0% found this document useful (0 votes)

33 views31 pages

Transformer (v5)

BERT is a transformer model that uses self-attention rather than recurrent connections. It represents text as a sequence of tokens and learns contextual relationships between tokens through attention mechanisms. Self-attention allows BERT to learn from the entire input sequence at once, rather than sequentially like RNNs, making it easy to parallelize computations.

Uploaded by

badrya badhy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views31 pages

Transformer (v5)

Uploaded by

badrya badhy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

BERT

Transformer
李宏毅
Hung-yi Lee
Transformer
Seq2seq model with “Self-attention”
Sequence
Next layer
𝑏1 𝑏 2
𝑏 3
𝑏 4

1
𝑎 𝑎 2
𝑎3 𝑎4 𝑎1 𝑎 2
𝑎 3
𝑎4
Previous layer
Using CNN to replace RNN
Hard to parallel !
Filters in higher layer can
Sequence consider longer sequence

Next layer
𝑏1 𝑏 2
𝑏 3
𝑏4
𝑏1 𝑏2 𝑏3 𝑏4

……

……
……
𝑎 1
𝑎 2
𝑎3 𝑎4 ……
𝑎1 𝑎 2
𝑎 3
𝑎4
Previous layer
Using CNN to replace RNN
Hard to parallel (CNN can parallel)
is obtained based on the whole
input sequence.
Self-Attention , , , can be parallelly computed.

𝑏1 𝑏 2
𝑏3
𝑏4
𝑏1 𝑏 2
𝑏3 𝑏4

Self-Attention Layer

1
𝑎 𝑎 2
𝑎3 𝑎4 𝑎1 𝑎 2
𝑎 3
𝑎4
You can try to replace any thing that has been done by RNN
with self-attention.
Self-attention : query (to match others)
https://arxiv.org/abs/1706.03762
𝑞𝑖 =𝑊 𝑞 𝑎𝑖
: key (to be matched)
𝑘𝑖=𝑊 𝑘 𝑎𝑖
Attention is all : information to be extracted
you need. 𝑣 𝑖 =𝑊 𝑣 𝑎 𝑖

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4
4

𝑎1 𝑎2 𝑎3 𝑎4
𝑎 𝑖=𝑊 𝑥 𝑖
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
d is the dim of and
拿每個 query q 去對每個 key k 做 attention
Scaled Dot-Product Attention: 𝛼 1, 𝑖=𝑞 1
∙ 𝑘 𝑖
/√𝑑
dot product
𝛼1,1 𝛼1 , 2 𝛼1 , 3 𝛼1,4

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention ^ 1 , 𝑖=𝑒𝑥𝑝 ( 𝛼1 ,𝑖 ) / ∑ 𝑒𝑥𝑝 ( 𝛼1 , 𝑗 )
𝛼
𝑗
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼

Soft-max
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention 𝑏 =∑ 𝛼1, 𝑖 𝑣
1
^ 𝑖
Considering the whole sequence 𝑖
𝑏1
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
拿每個 query q 去對每個 key k 做 attention 𝑏 =∑ 𝛼^ 2,𝑖 𝑣
2 𝑖

𝑖
2
𝑏
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention , , , can be parallelly computed.

𝑏1 𝑏2 𝑏3 𝑏4

Self-Attention Layer

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention 𝑞1𝑞2𝑞3𝑞 4 = 𝑊𝑞 𝑎 1𝑎 2𝑎 3𝑎 4
𝑄 I
1 2 3 4
𝑖
𝑞 =𝑊 𝑎𝑞 𝑖 𝑘1𝑘𝑘
2 3 4
𝑘 = 𝑊 𝑘
𝑎 𝑎𝑎𝑎
𝐾 I
𝑘𝑖=𝑊 𝑘 𝑎𝑖
1 2 3 4 1 2 3 4
𝑖
𝑣 =𝑊 𝑎𝑣 𝑖 𝑣 𝑣𝑣 𝑣 = 𝑊 𝑣
𝑎𝑎 𝑎 𝑎
𝑉 I

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention

𝑏1
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4

𝛼1,1 =¿ 𝑘 1 𝑞1 𝛼1,2 =¿ 𝑘 2 𝑞1 𝛼1,1

𝑘1
𝛼1,2
𝑘2 1
𝛼1,3 =¿ 𝑘 3 𝑞1 𝛼1,4 =¿ 𝑘 4 𝑞1 𝛼1,3 ¿ 𝑘 3 𝑞

𝛼1,4 𝑘4
(ignore for simplicity)
Self-attention 𝑏 =∑ 𝛼 2,𝑖 𝑣
^
2 𝑖

𝑖
𝑏2
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4

^ 1,1
𝛼 ^ 2,1
𝛼 ^ 3,1
𝛼 ^ 4,1
𝛼 𝛼1,1 𝛼2,1 𝛼3,1 𝛼 4,1
^ 1,2
𝛼 ^ 2,2 ^ 3,2
𝛼
𝑘1
𝛼 ^ 4,2
𝛼 𝛼1,2 𝛼2,2 𝛼3,2 𝛼 4,2
𝑘2 𝑞1 2 3 4
^ 1,3
𝛼 ^ 2,3
𝛼 ^ 3,3
𝛼 ^ 4,3
𝛼 𝛼1,3 𝛼2,3 𝛼3,3 𝛼 4,3 ¿ 𝑘 3 𝑞𝑞𝑞
^ 1,4
𝛼 ^ 2,4
𝛼 ^ 3,4
𝛼 ^ 4,4
𝛼 𝛼1,4 𝛼2,4 𝛼3,4 𝛼 4,4 𝑄
𝑘4
^𝐴 𝐴 𝐾𝑇
Self-attention 𝑏 =∑ 𝛼 2,𝑖 𝑣
2
^ 𝑖

𝑖
𝑏2
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 𝑣 4
4 4

^ 1,1
𝛼 ^ 2,1
𝛼 ^ 3,1
𝛼 ^ 4,1
𝛼
^ 1,2
𝛼 ^ 2,2
𝛼 ^ 3,2
𝛼 ^ 4,2
𝛼
𝑏1𝑏2𝑏3𝑏4 = 𝑣 1𝑣 𝑣
2 3 4
𝑣 ^ 1,3
𝛼 ^ 2,3
𝛼 ^ 3,3
𝛼 ^ 4,3
𝛼
O 𝑉 ^ 1,4
𝛼 ^ 2,4
𝛼 ^ 3,4
𝛼 ^ 4,4
𝛼
^𝐴
Self-attention
O Q = 𝑊𝑞 I
K = 𝑊𝑘 I
I V = 𝑊𝑣 I

^
A A = 𝐾𝑇 Q

O = V ^
A
反正就是一堆矩陣乘法，用 GPU 可以加速
Multi-head Self-attention (2 heads as example)
𝑞𝑖 ,1=𝑊 𝑞,1 𝑞𝑖 𝑏𝑖, 1
𝑞𝑖 ,2=𝑊 𝑞, 2 𝑞 𝑖

𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞 𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2

𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣

𝑞𝑖 =𝑊 𝑞 𝑎𝑖 𝑎 𝑖
𝑎𝑗
Multi-head Self-attention (2 heads as example)
𝑞𝑖 ,1=𝑊 𝑞,1 𝑞𝑖 𝑏𝑖, 1
𝑞𝑖 ,2=𝑊 𝑞, 2 𝑞 𝑖 𝑖, 2
𝑏

𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞 𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2

𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣

𝑞𝑖 =𝑊 𝑞 𝑥 𝑖 𝑎 𝑖
𝑎𝑗
Multi-head Self-attention (2 heads as example)
𝑏𝑖, 1 𝑖
𝑏
𝑖, 2
𝑏𝑖, 1 𝑏
𝑂
𝑏𝑖
= 𝑊
𝑏𝑖, 2

𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2

𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣

𝑎 𝑖
𝑎𝑗
𝑖 𝑖 𝑖
𝑞 𝑘 𝑣
Positional Encoding
𝑒𝑖 + 𝑎𝑖
• No position information in self-attention.
• Original paper: each position has a unique
positional vector (not learned from data) 𝑥𝑖
• In other words: each appends a one-hot
vector
𝑥𝑖 𝑎 𝑖

𝑊 = 𝑊 𝐼 𝑥𝑖
………

𝑖
𝐼 𝑃 𝑝
𝑖 𝑊 𝑊
𝑝 =¿ 0
1 i-th dim
𝑒𝑖
𝑃
0 + 𝑊 𝑝𝑖
…
𝑖
𝑥
𝑊 𝑝𝑖
𝑊𝐼 𝑊𝑃

𝑖
𝑎
𝐼 𝑥𝑖
= 𝑊
𝑒𝑖
𝑃
+ 𝑊 𝑝𝑖

-1 1
source of image: http://jalammar.github.io/illustrated-transformer/
Review: https://www.youtube.com/watch?v=ZjfjPzXw6og&feature=youtu.be

Seq2seq with Attention

𝑐1 𝑐2

𝑜1 𝑜2 𝑜3
h1 h2 h3 h4

Self-Attention Layer Self-Attention Layer

𝑥1
𝑥 2
𝑥 3
𝑥 4
𝑐1 𝑐2 𝑐3
Encoder Decoder
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Transformer machine learning

Using Chinese to English

translation as example

Encoder Decoder

機器學習 <BOS> machine
Transformer Layer Norm:
https:// Batch Size
′ Layer arxiv.org/abs/
𝑏 Norm 1607.06450
Batch Norm:
https://
+ www.youtube.co
m/watch? Batch
v=BZh1ltr5Rkg
𝑏 𝜇=0 , 𝜎=1
Layer
…

𝑎 attend on the
input sequence

Masked: attend on the

generated sequence
Attention Visualization

https://arxiv.org/abs/1706.03762
Attention Visualization

The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a
Transformer trained on English to French translation (one of eight attention heads).
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Multi-head
Attention
Example Application
• If you can use seq2seq, you can use transformer.

Summarizer

Document Set

https://arxiv.org/abs/1801.10198
Universal Transformer

https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
Self-Attention GAN

https://arxiv.org/abs/1805.08318

Transformer
No ratings yet
Transformer
33 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Building Code Requirements For Structural Concrete Reinforced With Glass FiberReinforced Polymer (GFRP) Bars Code and Commentary 440.11.22 Chapter 22
100% (1)
Building Code Requirements For Structural Concrete Reinforced With Glass FiberReinforced Polymer (GFRP) Bars Code and Commentary 440.11.22 Chapter 22
32 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Transformers
No ratings yet
Transformers
102 pages
1 Absolutism Vs Relavatism
No ratings yet
1 Absolutism Vs Relavatism
4 pages
Elephant Lifting Catalog v48
100% (1)
Elephant Lifting Catalog v48
80 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
MRO Intelligence Report PDF
No ratings yet
MRO Intelligence Report PDF
9 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
XRIO User Manual
No ratings yet
XRIO User Manual
38 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Black Spot Study and Accident Prediction Model Using Multiple Liner Regression PDF
No ratings yet
Black Spot Study and Accident Prediction Model Using Multiple Liner Regression PDF
16 pages
Deep and Surface Learning PDF
No ratings yet
Deep and Surface Learning PDF
1 page
AccountStatement31-01-2025 To 01-05-2025
No ratings yet
AccountStatement31-01-2025 To 01-05-2025
8 pages
Introduction To Management Accounting
No ratings yet
Introduction To Management Accounting
30 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Transformer
No ratings yet
Transformer
58 pages
PHD Thesis GauthamRam Cover Final
No ratings yet
PHD Thesis GauthamRam Cover Final
251 pages
PCP Comprehensive Solutions
No ratings yet
PCP Comprehensive Solutions
8 pages
Transformers
No ratings yet
Transformers
102 pages
BQ1031 Exercises
No ratings yet
BQ1031 Exercises
90 pages
ECRI - Grade 1 - Unit 0 Smart Start
No ratings yet
ECRI - Grade 1 - Unit 0 Smart Start
156 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
Self-Attion v7
No ratings yet
Self-Attion v7
43 pages
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
No ratings yet
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
117 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
NLP 8
No ratings yet
NLP 8
42 pages
CCM 303 Topic 8 PPT Gender and Communication in The Media PDF
No ratings yet
CCM 303 Topic 8 PPT Gender and Communication in The Media PDF
23 pages
Carbon Dioxide Enhanced Oil Recovery in The United States: Snapshot and Forecast
No ratings yet
Carbon Dioxide Enhanced Oil Recovery in The United States: Snapshot and Forecast
45 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Lec 12
No ratings yet
Lec 12
30 pages
Transformer
No ratings yet
Transformer
59 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Transformer
No ratings yet
Transformer
31 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Eco 1
No ratings yet
Eco 1
24 pages
1706.03762v7 5 15
No ratings yet
1706.03762v7 5 15
11 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
19 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
(: Subtitle) : Dissertation Title
No ratings yet
(: Subtitle) : Dissertation Title
18 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Transformers: Intro
No ratings yet
Transformers: Intro
7 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Attention
No ratings yet
Attention
15 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Cambridge International Examinations
No ratings yet
Cambridge International Examinations
12 pages
Aiayn
No ratings yet
Aiayn
15 pages
VR Part2 Lecture 5 Annotated
No ratings yet
VR Part2 Lecture 5 Annotated
14 pages
Keyboard Layout Selection Procedure
No ratings yet
Keyboard Layout Selection Procedure
8 pages
Transformer 1803.02155
No ratings yet
Transformer 1803.02155
5 pages
Transformer
No ratings yet
Transformer
10 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
A1
No ratings yet
A1
11 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
7181 Attention Is All You Need
No ratings yet
7181 Attention Is All You Need
11 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
Juris 5 - Legal Positivism and Normativism
No ratings yet
Juris 5 - Legal Positivism and Normativism
20 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
Kihlstrom GeneralPsych Sum12 Syllabus
No ratings yet
Kihlstrom GeneralPsych Sum12 Syllabus
12 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
Transformer Concepts
No ratings yet
Transformer Concepts
8 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
Transformer
No ratings yet
Transformer
4 pages
Data Class Nist SP 1800 39a Preliminary Draft
No ratings yet
Data Class Nist SP 1800 39a Preliminary Draft
4 pages
Example File
No ratings yet
Example File
3 pages
List of Government Colleges Affiliated To The University of Jammu (ACADEMIC SESSION 2020-21)
No ratings yet
List of Government Colleges Affiliated To The University of Jammu (ACADEMIC SESSION 2020-21)
9 pages
Final Baba Ghulam Shah Badshah University Admission 2021 Notification 1
No ratings yet
Final Baba Ghulam Shah Badshah University Admission 2021 Notification 1
2 pages
COA-RO9 APP-CSE 2024 Other Items
No ratings yet
COA-RO9 APP-CSE 2024 Other Items
3 pages
Premier League Data - Activity Questions: Part A: Sorting and Filtering
No ratings yet
Premier League Data - Activity Questions: Part A: Sorting and Filtering
3 pages
Q2 Lesson 1 Worksheet
No ratings yet
Q2 Lesson 1 Worksheet
2 pages
Project 619839 EPP 1 2020 1 FI EPPKA1 JMD MOB
No ratings yet
Project 619839 EPP 1 2020 1 FI EPPKA1 JMD MOB
2 pages
Rubric 4
No ratings yet
Rubric 4
5 pages
Homework 2 DSP
No ratings yet
Homework 2 DSP
2 pages
Calculus Volume1
From Everand
Calculus Volume1
Ming Yao Tsai
No ratings yet
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)

Transformer (v5)

Uploaded by

Transformer (v5)

Uploaded by

BERT

𝛼1,1 =¿ 𝑘 1 𝑞1 𝛼1,2 =¿ 𝑘 2 𝑞1 𝛼1,1

Seq2seq with Attention

Self-Attention Layer Self-Attention Layer

Using Chinese to English

Masked: attend on the

You might also like