0% found this document useful (0 votes)
33 views31 pages

Transformer (v5)

BERT is a transformer model that uses self-attention rather than recurrent connections. It represents text as a sequence of tokens and learns contextual relationships between tokens through attention mechanisms. Self-attention allows BERT to learn from the entire input sequence at once, rather than sequentially like RNNs, making it easy to parallelize computations.

Uploaded by

badrya badhy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views31 pages

Transformer (v5)

BERT is a transformer model that uses self-attention rather than recurrent connections. It represents text as a sequence of tokens and learns contextual relationships between tokens through attention mechanisms. Self-attention allows BERT to learn from the entire input sequence at once, rather than sequentially like RNNs, making it easy to parallelize computations.

Uploaded by

badrya badhy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

BERT

Transformer
李宏毅
Hung-yi Lee
Transformer
Seq2seq model with “Self-attention”
Sequence
Next layer
𝑏1 𝑏 2
𝑏 3
𝑏 4

1
𝑎 𝑎 2
𝑎3 𝑎4 𝑎1 𝑎 2
𝑎 3
𝑎4
Previous layer
Using CNN to replace RNN
Hard to parallel !
Filters in higher layer can
Sequence consider longer sequence

Next layer
𝑏1 𝑏 2
𝑏 3
𝑏4
𝑏1 𝑏2 𝑏3 𝑏4

……

……
……
𝑎 1
𝑎 2
𝑎3 𝑎4 ……
𝑎1 𝑎 2
𝑎 3
𝑎4
Previous layer
Using CNN to replace RNN
Hard to parallel (CNN can parallel)
is obtained based on the whole
input sequence.
Self-Attention , , , can be parallelly computed.

𝑏1 𝑏 2
𝑏3
𝑏4
𝑏1 𝑏 2
𝑏3 𝑏4

Self-Attention Layer

1
𝑎 𝑎 2
𝑎3 𝑎4 𝑎1 𝑎 2
𝑎 3
𝑎4
You can try to replace any thing that has been done by RNN
with self-attention.
Self-attention : query (to match others)
https://arxiv.org/abs/1706.03762
𝑞𝑖 =𝑊 𝑞 𝑎𝑖
: key (to be matched)
𝑘𝑖=𝑊 𝑘 𝑎𝑖
Attention is all : information to be extracted
you need. 𝑣 𝑖 =𝑊 𝑣 𝑎 𝑖

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4
4

𝑎1 𝑎2 𝑎3 𝑎4
𝑎 𝑖=𝑊 𝑥 𝑖
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
d is the dim of and
拿每個 query q 去對每個 key k 做 attention
Scaled Dot-Product Attention: 𝛼 1, 𝑖=𝑞 1
∙ 𝑘 𝑖
/√𝑑
dot product
𝛼1,1 𝛼1 , 2 𝛼1 , 3 𝛼1,4

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention ^ 1 , 𝑖=𝑒𝑥𝑝 ( 𝛼1 ,𝑖 ) / ∑ 𝑒𝑥𝑝 ( 𝛼1 , 𝑗 )
𝛼
𝑗
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼

Soft-max
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention 𝑏 =∑ 𝛼1, 𝑖 𝑣
1
^ 𝑖
Considering the whole sequence 𝑖
𝑏1
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
拿每個 query q 去對每個 key k 做 attention 𝑏 =∑ 𝛼^ 2,𝑖 𝑣
2 𝑖

𝑖
2
𝑏
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention , , , can be parallelly computed.

𝑏1 𝑏2 𝑏3 𝑏4

Self-Attention Layer

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention 𝑞1𝑞2𝑞3𝑞 4 = 𝑊𝑞 𝑎 1𝑎 2𝑎 3𝑎 4
𝑄 I
1 2 3 4
𝑖
𝑞 =𝑊 𝑎𝑞 𝑖 𝑘1𝑘𝑘
2 3 4
𝑘 = 𝑊 𝑘
𝑎 𝑎𝑎𝑎
𝐾 I
𝑘𝑖=𝑊 𝑘 𝑎𝑖
1 2 3 4 1 2 3 4
𝑖
𝑣 =𝑊 𝑎𝑣 𝑖 𝑣 𝑣𝑣 𝑣 = 𝑊 𝑣
𝑎𝑎 𝑎 𝑎
𝑉 I

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention

𝑏1
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4

𝛼1,1 =¿ 𝑘 1 𝑞1 𝛼1,2 =¿ 𝑘 2 𝑞1 𝛼1,1


𝑘1
𝛼1,2
𝑘2 1
𝛼1,3 =¿ 𝑘 3 𝑞1 𝛼1,4 =¿ 𝑘 4 𝑞1 𝛼1,3 ¿ 𝑘 3 𝑞

𝛼1,4 𝑘4
(ignore for simplicity)
Self-attention 𝑏 =∑ 𝛼 2,𝑖 𝑣
^
2 𝑖

𝑖
𝑏2
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4

^ 1,1
𝛼 ^ 2,1
𝛼 ^ 3,1
𝛼 ^ 4,1
𝛼 𝛼1,1 𝛼2,1 𝛼3,1 𝛼 4,1
^ 1,2
𝛼 ^ 2,2 ^ 3,2
𝛼
𝑘1
𝛼 ^ 4,2
𝛼 𝛼1,2 𝛼2,2 𝛼3,2 𝛼 4,2
𝑘2 𝑞1 2 3 4
^ 1,3
𝛼 ^ 2,3
𝛼 ^ 3,3
𝛼 ^ 4,3
𝛼 𝛼1,3 𝛼2,3 𝛼3,3 𝛼 4,3 ¿ 𝑘 3 𝑞𝑞𝑞
^ 1,4
𝛼 ^ 2,4
𝛼 ^ 3,4
𝛼 ^ 4,4
𝛼 𝛼1,4 𝛼2,4 𝛼3,4 𝛼 4,4 𝑄
𝑘4
^𝐴 𝐴 𝐾𝑇
Self-attention 𝑏 =∑ 𝛼 2,𝑖 𝑣
2
^ 𝑖

𝑖
𝑏2
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 𝑣 4
4 4

^ 1,1
𝛼 ^ 2,1
𝛼 ^ 3,1
𝛼 ^ 4,1
𝛼
^ 1,2
𝛼 ^ 2,2
𝛼 ^ 3,2
𝛼 ^ 4,2
𝛼
𝑏1𝑏2𝑏3𝑏4 = 𝑣 1𝑣 𝑣
2 3 4
𝑣 ^ 1,3
𝛼 ^ 2,3
𝛼 ^ 3,3
𝛼 ^ 4,3
𝛼
O 𝑉 ^ 1,4
𝛼 ^ 2,4
𝛼 ^ 3,4
𝛼 ^ 4,4
𝛼
^𝐴
Self-attention
O Q = 𝑊𝑞 I
K = 𝑊𝑘 I
I V = 𝑊𝑣 I

^
A A = 𝐾𝑇 Q

O = V ^
A
反正就是一堆矩陣乘法,用 GPU 可以加速
Multi-head Self-attention (2 heads as example)
𝑞𝑖 ,1=𝑊 𝑞,1 𝑞𝑖 𝑏𝑖, 1
𝑞𝑖 ,2=𝑊 𝑞, 2 𝑞 𝑖

𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞 𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2

𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣

𝑞𝑖 =𝑊 𝑞 𝑎𝑖 𝑎 𝑖
𝑎𝑗
Multi-head Self-attention (2 heads as example)
𝑞𝑖 ,1=𝑊 𝑞,1 𝑞𝑖 𝑏𝑖, 1
𝑞𝑖 ,2=𝑊 𝑞, 2 𝑞 𝑖 𝑖, 2
𝑏

𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞 𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2

𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣

𝑞𝑖 =𝑊 𝑞 𝑥 𝑖 𝑎 𝑖
𝑎𝑗
Multi-head Self-attention (2 heads as example)
𝑏𝑖, 1 𝑖
𝑏
𝑖, 2
𝑏𝑖, 1 𝑏
𝑂
𝑏𝑖
= 𝑊
𝑏𝑖, 2

𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2

𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣

𝑎 𝑖
𝑎𝑗
𝑖 𝑖 𝑖
𝑞 𝑘 𝑣
Positional Encoding
𝑒𝑖 + 𝑎𝑖
• No position information in self-attention.
• Original paper: each position has a unique
positional vector (not learned from data) 𝑥𝑖
• In other words: each appends a one-hot
vector
𝑥𝑖 𝑎 𝑖

𝑊 = 𝑊 𝐼 𝑥𝑖
………

𝑖
𝐼 𝑃 𝑝
𝑖 𝑊 𝑊
𝑝 =¿ 0
1 i-th dim
𝑒𝑖
𝑃
0 + 𝑊 𝑝𝑖

𝑖
𝑥
𝑊 𝑝𝑖
𝑊𝐼 𝑊𝑃

𝑖
𝑎
𝐼 𝑥𝑖
= 𝑊
𝑒𝑖
𝑃
+ 𝑊 𝑝𝑖

-1 1
source of image: http://jalammar.github.io/illustrated-transformer/
Review: https://www.youtube.com/watch?v=ZjfjPzXw6og&feature=youtu.be

Seq2seq with Attention


𝑐1 𝑐2

𝑜1 𝑜2 𝑜3
h1 h2 h3 h4

Self-Attention Layer Self-Attention Layer

𝑥1
𝑥 2
𝑥 3
𝑥 4
𝑐1 𝑐2 𝑐3
Encoder Decoder
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Transformer machine learning

Using Chinese to English


translation as example

Encoder Decoder

機 器 學 習 <BOS> machine
Transformer Layer Norm:
https:// Batch Size
′ Layer arxiv.org/abs/
𝑏 Norm 1607.06450
Batch Norm:
https://
+ www.youtube.co
m/watch? Batch
v=BZh1ltr5Rkg
𝑏 𝜇=0 , 𝜎=1
Layer

𝑎 attend on the
input sequence

Masked: attend on the


generated sequence
Attention Visualization

https://arxiv.org/abs/1706.03762
Attention Visualization

The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a
Transformer trained on English to French translation (one of eight attention heads).
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Multi-head
Attention
Example Application
• If you can use seq2seq, you can use transformer.

Summarizer

Document Set

https://arxiv.org/abs/1801.10198
Universal Transformer

https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
Self-Attention GAN

https://arxiv.org/abs/1805.08318

You might also like