0% found this document useful (1 vote)

31 views92 pages

(9,10) Transformers - 3

The document provides an overview of Transformers in Natural Language Processing, detailing their building blocks such as self-attention mechanisms and multi-head attention. It compares Transformers with traditional RNNs, highlighting advantages like parallelization and handling long sequences. Key concepts include the encoder-decoder architecture and the significance of positional encoders in maintaining word order.

Uploaded by

Mustafa A. Elattar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

31 views92 pages

(9,10) Transformers - 3

Uploaded by

Mustafa A. Elattar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 92

[1]

Natural Language Processing Series

Transformers
Prof. Khaled Mostafa El-Sayed
[email protected]

Faculty of Computers and Artificial Intelligence

Cairo University

Aug 2020

[email protected]
[2]
Transformers
Processing Sequential Data

Transformer Building Blocks

Self Attention Mechanism

Multi-Head Attention

Parallelization Implementation Issues

Positional Information

Building Transformer Block

Transformer Decoder

Transformer Encoder as a Classifier

What is Next ?
[email protected]
[3]

Processing Sequential Data

1-D Convolutional Neural Network (1-D CNN)

Recurrent Neural Network (RNN)

[email protected]
[4]
1-D Convolutional Neural Network

Can be Computed In Parallel

Convolutional Words 1st Feature 2nd Feature 3rd Feature 4th Feature
Kernel Vectors Map Map Map Map

Non Linearity

Non Linearity
Sequence Length

c es***
e quen
S
L ong
ry
fo r Ve
al
r a ctic
Not P
** *
As Word Sequence Length Increases, we need to stack many convolution Layers.
[email protected]
[5]
Traditional RNN Seq2Seq model
Decoder (RNN)

Context Vector (of Whole Inputs)

Encoder (RNN)
[email protected]
[6]
Traditional RNN Seq2Seq model
(Challenges) Decoder (RNN)

[1] Bottleneck
The meaning of the entire input sequence is
Expected to be captured by a single context vector
with fixed dimensionality

Context Vector (of Whole Inputs)

Encoder (RNN)
[email protected]
[7]
Traditional RNN Seq2Seq model
(Challenges) Decoder (RNN)

[2] Sequential Processing

Can NOT do processing in Parallel

Wait Wait
Till Processing X1 Till Processing X2

Context Vector (of Whole Inputs)

Encoder (RNN)
[email protected]
[8]
Traditional RNN Seq2Seq model
(Challenges) Decoder (RNN)

[3] Very Long Sequence

Vanishing Gradient

Short Sequence: RNN

Long Sequence: LSTM, GRU
Very Long Sequence: FAIL

Back Propagation
Δ Δ Δ Δ Δ Δ Δ Δ Δ Δ Δ

X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Encoder (RNN)
[email protected]
[9]
RNN Seq2Seq model [with Attention]
[ To overcome Bottleneck challenge] Decoder

weighted sum of the hidden states

Attention

Context Vector (of Whole Inputs)

Decoder Utilizes:
●
Context Vector
●
Weighted sum of hidden states
Encoder
[email protected]
[10]
Transformer
Attention is All You Need Decoder

Encoder-Decoder Attention
Multi Head Self Attention Blocks

Attention Mechanism supports:

●
Parallelization of Encoder ,
●
Parallelization of Decoder ( Training Mode Only)
●
Very Long Sequence length
Encoder
[email protected]
[11]

[email protected]
[12]

Transformer Building Blocks

[email protected]
[13]
Encoder – Decoder Transformer
Orientation is not the Issue

Let us focus on the Building blocks

[email protected]
[14]
Encoder – Decoder Transformer
Encoder Decoder

[email protected]
[15]
Encoder – Decoder Transformer
Decoder Block Encoder Decoder

Encoder Block

N Cascaded Blocks
In Both Encoder and Decoder

Encoder Decoder Attention V K Q

From Last Encoder Block
To ALL Decoder Cascaded Blocks

V K Q

Inputs to Encoder Block (V,K,Q)

All are from Previous Block

Inputs to Decoder Block (V,K,Q)

(V,K) from Attention, (Q) from Previous Block
[email protected]
[16]
Encoder – Decoder Transformer
Encoder Output Decoder
Decoder Output (at step “t-1”)
Is supplied as Decoder Input (at step “t”)

First Decoder Block

No Input from Encoder Attention

V K Q

Input Positional Encoder

(Position of Input Word within the Input Sequence)
V K Q
V K Q

Output Positional Encoder

(Position of Output Word within the Output Sequence) Outputs

[email protected]
[17]
Encoder – Decoder Transformer
Both Encoder and Decoder Blocks Contain Encoder Output Decoder
“Multi-Head Self Attention” block

“h” Heads
Outputs of Heads are concatenated
Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q’ K’ V’

V K Q Outputs

[email protected]
[18]
Encoder – Decoder Transformer
Layer Normalization Encoder Output Decoder

Feed Forward Layer

Residual Connections
Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[19]

Self Attention Mechanism

[email protected]
[20]
Encoder – Decoder Transformer
Encoder Output Decoder

Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[21]
Self Attention Mechanism (Fundamental Operation)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Similarity Probabilities Weighted SUM

Dot Σ
Product
W21
X4 X4
*

Soft max
W22
X3 X3
*
W23
X2 X2
*
W24
X1 X1
*

X1 X2 X3 X4

[email protected]
[22]
Self Attention Mechanism (Fundamental Operation)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Similarity Probabilities Weighted SUM

Xj Dot Σ
Product
W21
X4 X4
*

Soft max
W22
X3 X3
*
W23
X2 X2
*
W24
X1 X1
*

Xi=X2

X1 X2 X3 X4
Remember

[email protected]
[23]
Self Attention Mechanism (Not Sensitive to Words Order)
(Self Attention of word X2 w.r.t All words in Sentence ( X4, X2, X1, X3 )

Similarity Probabilities Weighted SUM

Dot Σ
Product
W21
X3 X3
*

Soft max
W22
X1 X1
*
W23
X2 X2
*
W24
X4 X4
*

Sequential nature of the input is IGNORED

Mitigated Later with

X4 X2 X1 X3
Positional Encoder
[email protected]
[24]
Self Attention Mechanism (as Query and {Key-Value} Dictionary)
(Query using X2 All Keys in Dictionary and retrieve Corresponding Values

Y2
Matched KEYS with Retrieve VALUES
Relevance Percentage Weighted by their Relevance

KEYs Dot Σ Values

Product
W21
X3 X3
*

Soft max
W22
X1 X1
*
W23
X2 X2
*
K V
W24
X4 X4
*

Query
ALL Keys with X2 Q Values of Matched Keys
are returned
X4 X2 X1 X3 when Running Query

[email protected]
[25]
Self Attention Mechanism (Input Word plays Different Roles)
(Query using X2 All Keys in Dictionary and retrieve Corresponding Values

Y2
Matched KEYS with Retrieve VALUES
Relevance Percentage Weighted by their Relevance

KEYs Dot Σ Values

Product
W21
X3 X3
*

Soft max
X2 is used as W22
X1 X2 is used as X1
Key * Value
W23
X2 X2
*
K V
W24
X4 X4
*

Query X2 is used as
ALL Keys with X2 Q Query
Should we use SAME X2 Vector
for the three Roles ?
X4 X2 X1 X3

[email protected]
Self Attention Mechanism (Linear Transformation of Inputs) [26]
(Query using X2 All Keys in Dictionary and retrieve Corresponding Values

W11 W21 W31 W41 XQ21 W11 W21 W31 W41 XV21
X21 X22 X23 X24 X21 X22 X23 X24

1xd W12 W22 W32 W42 XQ22 1xd W12 W22 W32 W42 XV22
X2 = X2 =
W13 W23 W33 W43 XQ23 W13 W23 W33 W43 XV23

WQ WV
W14 W24 W34 W44 XQ24 W14 W24 W34 W44 XV24
dxd dxd
X2 variant to QUERY with
X2 variant used as Value
W11 W21 W31 W41 XK21 X2 variant used as KEY
X21 X22 X23 X24

1xd W12 W22 W32 W42 XK22 Generate Three Variants

X2 = of X2 (Based on its Role)
W13 W23 W33 W43 XK23
WK WQ WK WV are controllable parameters
W14 W24 W34 W44 XK24
(Trainable), allows to modify the incoming
dxd vectors to suit the three roles they must play
[email protected]
[27]
Self Attention Mechanism (Linear Transformation of Inputs)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Similarity Probabilities Weighted SUM

Dot K.q Σ
Product
K4 V4
X4 Wk(r) * WV(r) X4

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[28]
Self Attention Mechanism
(Embedding Vector Dimension and Softmax Sensitivity)

Xi
1xd

d
As Dimension of Embedding “d” Increases
the Calculated weight value “w” increases

High Weight Values kill the gradient, and slow down learning
Xi SoftMax
1xd d d d

1 0.0321 5 0.0000 10 0.0000

2 0.0871 10 SoftMax 0.0000 20 SoftMax 0.0000
SoftMax
3 0.2369 15 0.0067 30 0.0000

4 0.6439 20 0.9933 40 1.0000

For small “d” For medium “d” For large “d”

[email protected]
[29]
Self Attention Mechanism
(Embedding Vector Dimension and Softmax Sensitivity)

Xi Scale down “Weights” Values

1xd before applying SoftMax
d
Scaling Factor should be related to
As Dimension of Embedding “d” Increases Embedding Dimension “d”
the Calculated weight value “w” increases

High Weight Values kill the gradient, and slow down learning
Xi SoftMax
1xd d d d

1 0.0321 5 0.0000 10 0.0000

2 0.0871 10 SoftMax 0.0000 20 SoftMax 0.0000
SoftMax
3 0.2369 15 0.0067 30 0.0000

4 0.6439 20 0.9933 40 1.0000

For small “d” For medium “d” For large “d”

[email protected]
[30]
Self Attention Mechanism
(Scaling based on Square Root of Embedding Vector Dimension)
V1 V2
Average Values of elements of V1 = M
X1 X1 Average Values of elements of V2 = M
X2 X2
X3 X3 Vector Length of V1 (all Ms) = V1T V1 = M √d
1
X4 X4
Vector Length of V2 (all Ms) = V2T V2 = M √d
X5 1xd2 2 M
X6
X7
X8 √d is a good choice for scaling 4
√d Some times is used
X9
X10
X11 ’
’ w ij
X12 SoftMax w ij SoftMax d=100, √d =10
1xd1 √d

Remember: 10 0.0000 10 1 0.0321

“d” is
20 SoftMax 0.0000 20 2 0.0871
Dimension of Embedding / √d Scaled SoftMax
Not 30 0.0000 30 3 0.2369
length of input Sequence
40 1.0000 40 4 0.6439

[email protected]
[31]
Self Attention Mechanism (Adding Scaling)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Similarity Probabilities Weighted SUM

K.q K.q
Dot Σ
Product √d
K4 V4
X4 Wk(r) WV(r) X4

Scaling by
*

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

√d
K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[32]
Self Attention Mechanism (Adding Scaling)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Similarity Probabilities Weighted SUM

K.q K.q
Dot Σ
Product √d
K4 V4
X4 Wk(r) WV(r) X4

Scaling by
*

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

√d
K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
Parallel Processing [33]
Transform Sequence Elements in Parallel Y4
[Calculation of Self Attention of words ( X1, X2, X3, X4 )] Y3
Y4(r)
Y2
Y1 Y3(r)
Similarity Probabilities Weighted SUM
Y2(r)
K.q K.q
Similarity
Dot Probabilities Weighted SUM Σ
Y1(r)Product √d
K4
K.q
K.q
V WV(r)

Scaling by
X4
Similarity Wk(r)
Dot
Product
Probabilities
√d
Weighted SUM Σ * X4

Soft max
K4
K.q W (r)
K.q K3 V
WV(r) 4V XW (r)

Scaling by
Σ
*
(r)
X4 WkDot X3 Probabilities Weighted SUM
4
X3
*
Similarity 4
Product k
√d V

Soft max
K.q K3 XW (r) XW (r)
K4 V
V V
K.q W (r) X V

Scaling by
K2
Wk(r) Σ *
X4 (r)
Dot X Wk(r) W 3
4 3
X2
*
4

*
V
√d
3 k 3
Product 2
V

Soft max

√d
XW (r) XW (r)
K4 V
V V
K3 V

Scaling by
K2
Wk(r) X3
* * W V V WV(r)
X4 (r)
Wk(r) X2 Wk(r) X K1 4
2
3
X2
Wk(r) 4
X1
*
V 3

*
1

Soft max

√d
XW (r)
K3 K2 V V
Wk(r) X2
* *W V XWV(r) 1
X3 K1 (r)
Wk(r) X 3

V 2
2
Wk(r) V 3 2 X1
q2
*(r)
1

√d
K2
W
K1 V
*W
X2 Wk(r) X (r) 2
W V X (r)
Wk(r)
* V 1
X 1 2
1 q V 1
q2
√d
K1 V1
Wq(r) X
Wk(r) X*2
X1 (r) X1
q2 1
X 3 WV X4

Wq(r)
q2 X1 X2 X3 X4

**
Wq(r)X
X2
1
X3 X4
l *
X1 X2 X3 X4
aralle
In P
R un
***
[email protected] X1 X2 X3 X4
[34]

Multi-Head Attention

Wide Architecture

Narrow Architecture

[email protected]
[35]
Encoder – Decoder Transformer
Encoder Output Decoder

Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[36]

Multi-Head Self Attention

Combining several self attention mechanisms give the self attention greater
power of discrimination

Each Head is indexed with “r”

Each head has its own matrices 𝐖rq , 𝐖rk ,𝐖rv .

For input X each attention head produces a different output vector Yr.

All Output vectors are concatenated , and pass through a linear

transformation to reduce the dimension back to original X dimension.

[email protected]
Y2
1*d Multi Head Self Attention of Word X2 [37]
Linear Transformation [T]
“Wide” Architecture
dH*d

Concatenated Vector Y2(r) Y2(r) Y2(r) Y2(r) Y2(r)

1*dH

Y2(r)
Y2(r)
Similarity Probabilities
Y2 (r) Weighted SUM h=5
K. q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4
Y2 K.q Probabilities
(r) Σ
Weighted SUM
V4

Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q
Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4

Soft max
Scaling by
*
(r) (r)
X4 Wk Product K √d W
VV X4
K.3q
*
(r)
X3 W
K.kqProbabilities h=1
3
WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4

Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K3 √Kd W
V3V
*
(r) (r)
X3 W K.(r)
q W
VV X3
X.2 qk Σ
*
Dot K4 K Wk V4 WV(r)

Soft max
2
X2

Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V

√d
Wk(r)
*X
K4 X2 (r) X2
V4 W

Soft max
2
Scaling by

Wk(r)
*
X4 K1 (r) VV
K3
W (r) X1 K Wk(r)
*
(r)4
* WV(r) X1
1
X3 V3
W
V X3

√d
X k Wk(r)
*
2V (r) X2
W
2
Soft max
2
K3 K1 VV
*
X1 Wk(r) V3 1
WV(r) X1 s
Wk(r)
* e ad
X3 (r) X3
K2 WV
Wk(r)
√d * ”H
X2 (r) X2
K1 q2
V2
W
VV “ H
*
(r) (r)
X1
K2
Wk 1
WV X1
√d

V2
X2 Wk(r)
X1 Wk(r)
K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
√d

K1 q2 Wq(r) Wq(4) Wk(5)

V1
Wk(r) X2 *
X1 (r) X1
X1 X3 X4WV
q2 Wq(r) Wq(3) Wk(4) WV(5)
X1 X2 X3 X4
Wk(3)
q2 Wq(r) Wq(2) WV(4)
X1 X2 X3 X4
Wq(r) Wq(1) Wk(2) WV(3)
X1 X2 X3 X4 1*d W =
Wk(1) WV(2)
1*d
X1 X2 X3 X4
d*d
WV(1)
d: Embedding Length
[email protected]
Y2
1*d Multi Head Self Attention of Word X2 [38]
Linear Transformation [T]
“Wide” Architecture
dH*d

Concatenated Vector Y2(r) Y2(r) Y2(r) Y2(r) Y2(r)

1*dH

Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q
Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4

Soft max
Scaling by
*
(r) (r)
X4 Wk Product K √d W
VV X4
K.3q
*
(r)
X3 W
K.kqProbabilities h=1
3
WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4

Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K3 √Kd W
V3V
*
(r) (r)
X3 W K.(r)
q W
VV X3
X.2 qk Σ
*
Dot K4 K Wk V4 WV(r)

Soft max
2
X2

Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V

√d
Wk(r)
*X
K4 X2 (r) X2
V4 W

Soft max
2
Scaling by

Wk(r)
*
X4 K1 (r) VV
K3
W (r) X1 K Wk(r)
*
(r)4
* WV(r) X1
1
X3 V3
W
V X3

√d
X k Wk(r)
*
2V (r) X2
W
2
Soft max
2
K3 K1 VV
*
X1 Wk(r) V3 1
WV(r) X1 s
Wk(r)
* e ad
X3 (r) X3
K2 WV
Wk(r)
√d * ”H
X2 (r) X2
K1 q2
V2
W
VV “ H
*
(r) (r)
X1
K2
Wk 1
WV X1
√d

V2
X2 Wk(r)
X1 Wk(r)
K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
√d

K1 q2 Wq(r) Wq(4) Wk(5)

V1
Wk(r) X2 *
X1 (r) X1
X1 X3 X4WV
q2 Wq(r) Wq(3) Wk(4) WV(5)
X1 X2 X3 X4
Wk(3)
q2 Wq(r) Wq(2) WV(4)
X1 X2 X3 X4
Wq(r) Wq(1) Wk(2) WV(3)
X1 X2 X3 X4 1*d W =
Wk(1) WV(2)
1*d
X1 X2 X3 X4
d*d
WV(1)
d: Embedding Length
[email protected]
Y2
1*d
Multi Head Self Attention of Word X2 [39]
Concatenated Vector
“Narrow” Architecture
Y2(r) Y2(r) Y2(r) Y2(r) Y2(r)
1*d

Y2(r)
1*d/H
1*d/H W =
1*d/H
Y2(r) d/H*d/H
Similarity Probabilities
(r) Weighted SUM h=5 d: Embedding Length
Y2
K.q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4 K.q Probabilities
(r) Σ
Weighted SUM
V4
Y2

Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4

Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K √d W
V3V
K.3q
*
(r)
X3 W
K.kqProbabilities h=1 WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4

Soft max
Scaling by
W (r)
*
X4 (r) X
Product K3 √Kd W
V3V
*
X k (r) (r) 4
W K.(r)
q W
VV X3
X.2 qk Σ
3

*
Dot K4 K Wk V4 WV(r)

Soft max
2
X2

Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V

√d
*X
(r) (r)
K4 X2 Wk V4 W X2

Soft max
2
Scaling by
Wk(r)
*
X4 K1 (r) VV
K3
Wk(r) X1 K Wk(r)
*
(r)
* WV(r) X1
4 1
X3 V3
W
V2V X3

√d
Wk(r)
*
X2 (r) X2
W

Soft max
2

K3 K1 VV
*
X1 Wk (r) V3 1
WV (r) X1 s
Wk(r)
* ad
X3 (r) X3
K2 WV

√d
He
Wk(r)
*
X2 (r) X2
K1 q2
V2
W
VV ”
“H
X1 Wk(r)
* WV(r) X1
1
K2
√d
V2
X2 Wk(r)
X1 Wk (r) K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
Input
√d

q2 Wq(r) Wq(4) Wk(5) Sequence

K1 V1
Wk(r) X2 *X
X1 (r) X1
X1 X4WV
q2 Wq(r)
3
Wq(3) Wk(4) WV(5)
X1 X2 X3 X4
Wk(3)
X1 X2 X3 X4
q2 Wq(r) Wq(2) WV(4)
X1 X2 X3 X4 Split vectors into “H” chunks
Wq(r) Wq(1) Wk(2) WV(3)
X1 X2 X3 X4 X1 X2 X3 X4 To Head 5
Wk(1) WV(2)
X1 X2 X3 X4 X1 X2 X3 X4 To Head 4
WV(1)
X1 X2 X3 X4 To Head 3

X1 X2 X3 X4 To Head 2

X1 X2 X3 X4 To Head 1
[email protected]
[40]

Parallelization
Implementation Issues

[email protected]
[41]
Using FF NN for Multiplication of a Vector by a Matrix

Input Word X X1 X2 X3 X4
W11 W21 W31 W41 b1
X1 y1
X1
X1 W11 Σ YT X b2
W21 W12 W22 W32 W42
b1 +
W31
y1 y2 y3 y4 = Sigmoid
W13 W23 W33 W43 b3
W41
W12 y2
X2 W14 W24 W34 W44 b4
X2
X2 W22 Σ
W32
b2
W42

W13

X3
W23 y3
X3
X3 W33 Σ
W43
b3
W14

W24

X4
W34 y4
X4
X4 W44 Σ
b4
[email protected]
Using FF NN for Multiplication of ONE Vector by a Matrix [42]
Example: Multiply Vector of Current Word X by Wq (Query) {Remember: Wq is Squared Matrix}

Input Word X X1 X2 X3 X4
W11 W21 W31 W41 b1
X1 y1
X1
X1 W11 Σ YT X b2
W21 W12 W22 W32 W42
b1 +
W31
y1 y2 y3 y4 = Sigmoid
W13 W23 W33 W43 b3
W41
W12 y2
X2 W14 W24 W34 W44 b4
X2
X2 W22 Σ
W32 No
b2
W42 Non-linearity No Bias
W13 X1 X2 X3 X4
W23 y3 W11 W21 W31 W41 0
X3
X3
X3 Σ
W33
W43
YT X W12 W22 W32 W42 0
b3
W14 y1 y2 y3 y4 = Linear +
W24 W13 W23 W33 W43 0
X4
W34 y4
X4
X4 W44 Σ W14 W24 W34 W44 0
b4
[email protected]
[43]
Self Attention Mechanism (Linear Transformation of Inputs)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Similarity Probabilities Weighted SUM

Dot K.q Σ
Product
K4 V4
X4 Wk(r) * WV(r) X4

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
Using FF NN for Multiplication of Multiple Vectors by a Matrix [44]
X5
X4 Example: Multiply Vectors of ALL Words by Wk (Key) or Wv (Value) { Wk Wv is Squared Matrix}
X3
X1
X2
X1
X1
X1
X1 y1
X1
X1
X1 W11 Σ Design FF NN with the following Considerations:
W21 Activation: Linear
b1
W31
Bias: Zeros
X2 W41 Inputs: All Words Within Same Paragraph (X1 , X2 , X3 , X4 , X5 )
X2
X2 W12 y2
X2X2
X2
X2 W22 Σ
W32
b2
y1 y2 y3 y4 Y1 X1 X1 X2 X3 X4
W11 W21 W31 W41 0
W42
y1 y2 y3 y4 Y2 X2 X1 X2 X3 X4
X3 W13 W12 W22 W32 W42
X3
X3 W23 y3 = Linear 0
X3X3
X3
X3 W33 Σ y1 y2 y3 y4 Y3 X3 X1 X2 X3 X4
+
W13 W23 W33 W43 0
W43
W14
b3 y1 y2 y3 y4 Y4 X4 X1 X2 X3 X4
W14 W24 W34 W44 0
W24
X4
X4 y1 y2 y3 y4 Y5 X5 X1 X2 X3 X4
X4
X4X4
W34 y4
X4
X4 W44 Σ
b4
[email protected]
[45]
Self Attention Mechanism (Linear Transformation of Inputs)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Similarity Probabilities Weighted SUM

Dot K.q Σ
Product
K4 V4
X4 Wk(r) * WV(r) X4

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[46]
Using FF NN for Multiplication of Vector by a Multiple Matrices
Example: Multiply Vector of Current Word by Multiple (different) W q (Queries of Multiple Heads)

Embedding Vector of
Current Word X Wq of Head 1 Wq of Head 2 Wq of Head 3

X1 X2 X3 X4
W11 W21 W31 W41 W11 W21 W31 W41 W11 W21 W31 W41

W12 W22 W32 W42 W12 W22 W32 W42 W12 W22 W32 W42

W13 W23 W33 W43 W13 W23 W33 W43 W13 W23 W33 W43
X1
W14 W24 W34 W44 W14 W24 W34 W44 W14 W24 W34 W44

= y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4
Corresponding Corresponding Corresponding
Projection at Head 1 Projection at Head 2 Projection at Head 3

[email protected]
[47]
Using FF NN for Multiplication of Multiple Vectors by Multiple Matrices
Usage: Multiply Vectors of All Words by Multiple (different) W k (Keys of Multiple Heads)

Wk of Head 1 Wk of Head 2 Wk of Head 3

Embedding Vector
of Word #1 X1 X1 X2 X3 X4
W11 W21 W31 W41 W11 W21 W31 W41 W11 W21 W31 W41
Embedding Vector
of Word #2 X2 X1 X2 X3 X4
W12 W22 W32 W42 W12 W22 W32 W42 W12 W22 W32 W42
Embedding Vector
of Word #3 X X1 X2 X3 X4
3
W13 W23 W33 W43 W13 W23 W33 W43 W13 W23 W33 W43
Embedding Vector
of Word #4 X4 X1 X2 X3 X4
W14 W24 W34 W44 W14 W24 W34 W44 W14 W24 W34 W44
Embedding Vector
of Word #5 X5 X1 X2 X3 X4

y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #1

y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #2

= y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #3
Corresponding
●
Sequence of 5 Words y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Projection of Word #4

●
Network with 3 Heads y1 y2 y3 y4
y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #5
●
Embedding Dimension=5
Corresponding Corresponding Corresponding
Projections at Head 1 Projections at Head 2 Projections at Head 3
[email protected]
Y2
1*d Multi Head Self Attention of Word X2 [48]
Linear Transformation [ T ]dH*d

Concatenated Vector Y2(r) Y2(r) Y2(r) Y2(r) Y2(r)

1*dH

Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q
Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4

Soft max
Scaling by
*
(r) (r)
X4 Wk Product K √d W
VV X4
K.3q
*
(r)
X3 W
K.kqProbabilities h=1
3
WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4

Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K3 √Kd W
V3V
*
(r) (r)
X3 W K.(r)
q W
VV X3
X.2 qk Σ
*
Dot K4 K Wk V4 WV(r)

Soft max
2
X2

Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V

√d
Wk(r)
*X
K4 X2 (r) X2
V4 W

Soft max
2
Scaling by

Wk(r)
*
X4 K1 (r) VV
K3
W (r) X1 K Wk(r)
*
(r)4
* WV(r) X1
1
X3 V3
W
V X3

√d
X k Wk(r)
*
2V (r) X2
W
2
Soft max
2
K3 K1 VV
*
X1 Wk(r) V3 1
WV(r) X1 s
Wk(r)
* e ad
X3 (r) X3
K2 WV
Wk(r)
√d * ”H
X2 (r) X2
K1 q2
V2
W
VV “ H
*
(r) (r)
X1
K2
Wk 1
WV X1
√d

V2
X2 Wk(r)
X1 Wk(r)
K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
√d

K1 q2 Wq(r) Wq(4) Wk(5)

Adding Position Information

[1] Position Encoding

[2] Position Embedding

[email protected]
[50]
Encoder – Decoder Transformer
Encoder Output Decoder

Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[51]
Pre-processing of Input Sequence
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N

Word
Embedding X0 X1 X2 X3 X4 X5 XN-2 XN-1
e.g. word2vec

Position
Information Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1
+δ +δ +δ +δ +δ +δ
Final
Embedding Combine Word Encoding and Position Information (for Each Word)

I/P 0 I/P 1 I/P 2 I/P 3 I/P 4 I/P 5 I/P N-2 I/P N-1

Input to
Encoder Input to first Self Attention Layer in Transformer
[email protected]
[52]
[1] Position Encoding
Position
Information 0 1 2 3 4 5 N-2 N-1

Positional
Encoder function to map the positions to real valued vectors

VectorP0 VectorP1 VectorP2 VectorP3 VectorP4 VectorP5 VectorPN-2 VectorPN-1

Real valued vectors representing positions

Function with (Vector for each position)
Fixed Parameters
Vector Length = Word Embedding length
Function with
TRAINABLE Parameters

[email protected]
[53]
Suggested Positional Encoders
(Use Word Index as Position Encoding)
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N

Position
Embedding Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1

Use Position
Index 0 1 2 3 4 5 N-2 N-1

+δ +δ +δ +δ +δ +δ

Use δ=1 First position Encoding =0 , Last Position Encoding = N-1 (Depends on Seq. Length)
Problems:
●
If N is large (e.g. seq. Length = 1024), Large position Values will dominate when Combine Position
Embedding to Word Embedding
●
If System is trained with Max seq length =256 how can the system deal with Larger sequence values
●
Position Encoding is scalar and Word embedding is vector (how to Combine them ?)
[email protected]
[54]
Suggested Positional Encoders
(Normalize Seq. Length to “1”)
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N

Position
Embedding Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1

Use Position
Index 0 δ 2δ 3δ 4δ 5δ 1-δ 1

+δ +δ +δ +δ +δ +δ

Use δ=1/Seq. Length First position Encoding=0, Last Position Encoding= 1 (Independent of Seq. Length)
Problems:
●
δ Value depends on Seq. Length ( δ is small for long seq and large for short sequence )
●
In other words, Encoding value of position # 4 in short seq will differ from position #4 in long seq
(though they are same position)
●
Position Encoding is scalar and Word embedding is vector (how to Combine them ?)
[email protected]
[55]
Suggested Positional Encoders
(Rules Controlling Good Positional Encoder)
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N

Position
Embedding Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1

Use Position
Index 0 δ 2δ 3δ 4δ 5δ 1-δ 1

+δ +δ +δ +δ +δ +δ

[1] δ value should be same for Long sequence and short sequence
(same positions will be encoded with same values irrespective of the length of the sequence)
[2] Range of position encoded values should NOT depend on sequence length
(predefined fixed range e.g. from 0 to 1 , form -1 to 1, …)
[3] Position Encoding should be a VECTOR (not scalar) and of same dimension as Word Embedding
(in order to easily combine Position Encoding and Word Encoding vectors)
[email protected]
[56]
Sine/Cosine Encoders
(an industrial application of Motor Control System)

Sin/Cos Encoder
Output Signals

Incremental rotary position encoders are used in many applications to

measure angular position and speed.
For More Info:
EXTRACTION OF HIGH RESOLUTION POSITION INFORMATION FROM SINUSOIDAL ENCODERS
J. Burke, J. F. Moynihan, K. Unterkofler
[email protected]
[57]
Sin/Cos Encoders
(an industrial application of Motor Control System)

Position
(Angle)

EMC-compliant interface to Sin/Cos incremental position encoder switch

1-VPP Differential analog output signals
[email protected]
[58]
Sequence of 5 words +ω +ω
Positions are defined

Word 3
+ω
as angles

2
Wo

rd
rd

Wo
4
1
Sin(ω t) rd
Wo
Cos(ω t)
+ω
t=0,1,2,3,….

Word 0

[email protected]
[59]
Sequence of 9 words +ω +ω
Positions are defined

Word 3
+ω
as angles +ω

2
Wo

rd
rd

Wo
4
Wo 1
Sin(ω t) rd rd
5 Wo
Cos(ω t) +ω +ω
t=0,1,2,3,….

Word 6 Word 0

+ω
7
o rd
W

8
rd
+ω Wo

[email protected]
[60]
Sequence of 12 words +ω +ω
Positions are defined

Word 3
+ω
as angles +ω

2
Wo

rd
rd

Wo
4
Wo 1
Sin(ω t) rd rd
5 Wo
Cos(ω t) +ω +ω
t=0,1,2,3,….

Word 6 Word 0

+ω
7 Wo
rd rd
Wo 11

Using smaller ω value, allow

Wo
representation of longer sequences

8
rd

rd
Word 9
+ω Wo

10
(practically use ω = 1/10000) +ω

+ω +ω
[email protected]
[61]
sin(ωt)
Vector of Pos “t”
cos(ωt)

Sin (ω(t+δ)) sin(a+b)=sin (a) cos(b) + cos(a) sin(b)

cos(a+b)=cos(a) cos(b) – sin(a) sin(b)
Vector of Pos “t+δ”
cos(ω(t+δ))

sin(ω(t+δ)) = sin(ωt +ωδ) = sin(ωt) cos(ωδ) + cos(ωt) sin(ωδ)

cos(ω(t+δ)) = cos(ωt +ωδ) = cos(ωt) cos(ωδ) - sin(ωt) sin(ωδ)

Sin (ω(t+δ)) cos(ωδ) sin(ωδ) sin(ωt) From Computer Graphics:

= Rotation Matrix of Vector at
cos(ω(t+δ)) - sin(ωδ) cos(ωδ) cos(ωt) Pos “t” with angle “ωδ”
e.g “ω” , “2ω” , “3ω” , ...
Vector of Pos “t+δ”
Vector of Pos “t”
Independent of Position “t”
[email protected]
Vector of Pos “t+δ” Vector of Pos “t” [62]
Sin (ω(t+δ)) sin(ωt) cos(ωδ) + cos(ωt) sin(ωδ) sin(ωt) sin2(a) + cos2(a)=1
cos(ω(t+δ)) = cos(ωt) cos(ωδ) - sin(ωt) sin(ωδ) cos(ωt)
Let ωt=a , ωδ=b
sin(a) cos(b) + cos(a) sin(b) sin(a)
- sin(a) sin(b) + cos(a) cos(b) cos(a)

Squared Distance Between Vector of Pos “t+δ” and Vector of Pos “t” [ Two Consecutive Positions ]

[ sin(a) cos(b) + cos(a) sin(b) - sin(a) ]2 + [- sin(a) sin(b) + cos(a) cos(b) - cos(a) ]2 =

[ sin(a) (cos(b)-1) + cos(a) sin(b) ]2 + [- sin(a) sin(b) + cos(a) (cos(b) -1) ]2 =

[ sin(a)2 (cos(b)-1)2 + cos2(a) sin2(b) + 2 sin(a) cos(a) sin(b) (cos(b)-1) ] +

”
“t
n
[ sin2(a) sin2(b) + cos2(a) (cos(b) -1)2 - 2 sin(a) cos(a) sin(b) (cos(b)-1) ] =

t io
si
Po
sin(a)2 (cos(b)-1)2 + cos2(a) sin2(b) + sin2(a) sin2(b) + cos2(a) (cos(b) -1)2 =

of
e
t iv
[ sin(a)2 + cos2(a) ] (cos(b)-1)2 + [ sin(a)2 + cos2(a) ] (sin2(b) ) = (cos(b)-1)2 + sin2(b) =

c
pe
es
cos2(b) -2 cos(b) +1 + sin2(b) = 2 -2 cos(b) = 2(1-cos(b) ) =2 (1-cos(ωδ) )

Ir r
[email protected]
[63]
Sin/Cos Positional Encoder
Vector of Dimension “2” to define Position
sin(ωt) Where “t” is the step (position) and ω is the frequency
cos(ωt) of trigonometric function Sine/Cosine

This vector can be used to define word position at step “t” , t=0,1,2,3,4,5,6,7,….

➢
Values are between -1 and 1 (bounded values)
Distance between vectors representing positions “t” and “t+1” is constant
irrespective of “t” [Distance (Pos1, Pos2) = Distance (Pos90, Pos91)]
➢
“t” may take any positive value (No limitation of length of Sequence )
➢
Position Vector Length = 2 (Cannot be combined with Word Embedding Vector)

[email protected]
Sin/Cos Positional Encoder [64]
Positional Vector with Same dimension as Embedding dimension
Vector Index
Pairs “K” “i”
Word Embedding Dimension “d” [Even Num] Example:
0 sin(ω0t)
sin(ωt) 0 Position Encoding Dimension “d” d=100

cos(ωt)
1 cos(ω0t)
Number of Vector Pairs “K”=d/2 [0, 1, 2,… d/2-1] K=[0 ..49]
2 sin(ω1t)
1 Elements Index “i” [0, 1, 2, …. d-1] i=[0 .. 99]
Sin/Cos pairs 3 cos(ω1t) Even Index “i” is Sine Term , i=2*k Sin [0,2,4,...]
Vector Represents
Position”t” 4 Sin(ω2t) Odd Index “i” is Cosine Term, i=2*k+1 Cos [1,3,5,...]
2
Any “ω” is accepted 5 cos(ω2t) Use different “ω” values for different “Pairs”
Select “ω” Value Based on “Pair” index “k”
1
2k sin(ωKt) ωK =
K 10,0002k/d
2k+1 cos(ωKt)
Largest “ω” value = 1 (Used in First Pair)
Smallest “ω” value ≈ 1/10000 (Used in Last Pair)
d-1-1 sin(ωd/2t) Note:
[d/2]-1 “ω” Value depends on Embedding Dimension “d” ONLY
d-1 cos(ωd/2t) dx1
“ω” Value does NOT depend on Sequence Length
[email protected]
[65]
[2] Position Embedding
(Similar to Word Embedding)
Position
Information 0 1 2 3 4 5 N-2 N-1

(similar to word tokenization)

Positional
Embedding Embedding Layer to Learn Position Vectors
(Trained on Position Values)

VectorP0 VectorP1 VectorP2 VectorP3 VectorP4 VectorP5 VectorPN-2 VectorPN-1

Real valued vectors representing positions

(Vector for each position)
Positions > N-1 can NOT be Embedded
(Must Train Embedding on Max Expected Seq. Length)

Vector Length = Word Embedding length

[email protected]
[66]

Building Transformer Block

Adding
Residual Connections
Layer Normalization
Feed Forward Network

[email protected]
[67]

Multi Head Self Attention Block

Input:
Word Embedding + Positional Information [First Transformer Block]
Output of Previous Transformer Block [Stacked Transformer Blocks]

Output:
Encoded Vectors of Input Sequence [With Same Input Vectors Dimension]

Remember:
Different Input Sequences should be of Same Length (Fixed Length Sequences)
[Apply Pad or Cut]
Output Sequence is of same Length as Input Sequence

Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Attention

Multi Head Self Attention Block

Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
[68]
Transformer Blocks (stacked over Multi-Head)
Encoder Output Decoder

Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[69]
Residual Connections and Normalization
Adding Residual Connection:
Word Embedding + Positional Information [First Transformer Block]
Output of Previous Transformer Block [Stacked Transformer Blocks]

Adding “LAYER” Normalization

Encoded Vectors of Input Sequence [With Same Input Vectors Dimension

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Attention

Residual
Connections Multi Head Self Attention Block
Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
[70]
Normalization
The behavior of machine learning algorithms changes when the input distribution changes.
{Covariate shift }
layer weights changes Outputs of Activation functions Inputs to the next layer,

during training steps of each layer change (With different Input distribution)

Conclusion:
Input distribution of each layer changes with each step.

The basic idea behind Normalization is to:

Limit covariate shift by normalizing the activations of each layer

Normalization: Transforming the inputs to be Zero mean and unit variance.

Normalization allows each layer to:

●
learn on a more stable distribution of inputs,
●
Accelerate the training of the network (don’t stick with very small learning rates).
[email protected]
[71]
Normalization
Instead of restricting the Activations of each layer to be strictly Zero mean and unit variance,
Normalization allows the network to learn parameters γ and β that can convert the mean
and variance to any value that Minimizes Loss.

[email protected]
[72]
Normalization
Normalization
of

Batch
Activations
Normalization Weights

Batch Normalization
Weight Normalization
Instance Normalization

Group Normalization

Layer Normalization
[email protected]
[73]

Batch Data (Feature Vectors)

Feature Vectors of :
FIVE Input Samples [mini-Batch]
Feature Vector of One Input Feature Vectors of One Input Sample Using 4 filters [ 4 Channels]
Sample (Using one filter) (Using 4 filters) [ 4 Channels]

Feature Vector

ze
si
ch
at
B
Channels

[email protected]
[74]
Batch Normalization

Batch Normalization

Feature Vector
Calculate (then Normalize) μ, σ
for ALL Samples in the batch
for each channel)

ze
si
ch
t
Ba
Channels

Batch Normalization Calculate Mean and Variance of Each Mini-Batch instead of

Calculating Mean and Variance of the Whole Data.

The mean and variance will differ from mini-batch to another (Source of Error).

Mini Batch Size should be large enough to minimize “batching” effect. (Constrain)
[email protected]
[75]
Layer , Group, and Instance Normalization
Sample Based
Normalization

Feature Vector

ze
si
ch
t
Ba
Channels

Group Normalization Instance Normalization

(Group of Channels) Calculate (then Normalize) μ, σ for ONE Channels
General Case in ONE Sample (Output of ONE Filter)

Layer Normalization
Calculate (then Normalize) μ, σ for ALL Channels in the
Same Sample (Input to Same Layer) Transformer choice
[email protected]
[76]
Feed Forward Neural Network
Feed Forward Network for Each Vector
One Hidden Layer with RELU Activation, [Size: at least double Input dimension]
Linear Activation Output Layer [Size: same Input Vector dimension]
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Feed Forward Networks [ One Hidden with RELU Activation]

RELU RELU RELU RELU RELU RELU RELU RELU

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Attention

Residual
Connections Multi Head Self Attention Block
Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
Final [77]
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Output

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Residual
Feed Forward Networks [ One Hidden with RELU Activation]
Connections
RELU RELU RELU RELU RELU RELU RELU RELU

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Attention

Residual
Connections Multi Head Self Attention Block
Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1 [78]
Block Output

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
Blo ck

+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Feed Forward Networks [ One Hidden with RELU Activation]

RELU RELU RELU RELU RELU RELU RELU RELU
Enc oder

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Multi Head Self Attention Block

Block Input X0 X1 X2 X3 X4 X5 XN-2 XN-1

[email protected]
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1 [79]
Block Output

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
Blo ck

+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Feed Forward Networks [ One Hidden with RELU Activation]

RELU RELU RELU RELU RELU RELU RELU RELU
Enc oder

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Multi Head Self Attention BlockV K Q

Block Input X0 X1 X2 X3 X4 X5 XN-2 XN-1

[email protected]
Transformer Stacked Encoder Blocks [80]
Encoder Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Output

Encoder Block M
Transformer
Encoder

Encoder Block 2

Encoder Block 1

Encoder Block 0

Encoder X0 X1 X2 X3 X4 X5 XN-2 XN-1

Input

Embedding Word Embedding

Words & Pos. Positional Encoding / Embedding

Input Seq Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
[email protected]
[81]

Transformer Decoder

[email protected]
[82]
Encoder – Decoder Transformer
Encoder Output Decoder

Decoder Only

Decoder Only
√
√

√ √
√ √
√ Encoder –
Decoder V K Q
√ Attention √
√
√ Adding Mask ≈
V K Q
V K Q

√ √

Outputs

[email protected]
[83]
Multi-head Encoder-Decoder
Attention Block Residual Connection and Layer Normalization
Feed Forward Networks [ One Hidden with RELU Activation]
Masked Multi-head Self Residual Connection and Layer Normalization
Attention Block
K
Encode Encode Encode Encode Encode
Decoder Decoder Decoder Decoder Decoder
Attention Attention Attention Attention Attention
V
Q Q Q Q Q
Y0 Y1 Y2

Residual Connection and Layer Normalization

Self
Self
Self Self
Self
Self Self
Self
Self Self
Self
Self Self
Self
Self
Transformer Encoder Multi-Head
Attention
Attention
Attention
Self Attention
Multi-Head
Attention
Attention
Attention
Self Attention
Multi-Head
Attention
Attention
Attention
Self Attention
Multi-Head
Attention
Attention
Attention
Self Attention
Multi-Head
Attention
Attention
Attention
Self Attention

I feel Sick Word and Position Encoding

SOS Je me sens malade
[email protected]
[84]
Self Attention Look Ahead Mask
(Used in Decoders first Layer )
Softmax

m
SO

al
se
Je

ad
ns
e
S

e
SOS 0 -∞ -∞ -∞ -∞

Je 0 0 -∞ -∞ -∞

me
0 0 0 -∞ -∞

sens 0 0 0 0 -∞
malade 0 0 0 0 0

Mask Values are ADDED to Masked (Scaled) Values

Addition of “0” does not change the original values, consequently no effect on Softmax output
Addition of “-inf” gets “Zero” output of Softmax, leaving zero attention scores for future Words.
[email protected]
[85]
Self Attention Mechanism for Decoder (Adding Mask) -∞ *

Similarity Probabilities Weighted SUM

K.q K.q
Dot Σ

Mask
Product √d
K4 0 V4
Z4 Wk(r) WV(r) Z4

Scaling by
-∞ *

Soft max
K3 0 V3
Z3 Wk(r) -∞ * WV(r) Z3

K2 V2
Z2 Wk(r) 0 * WV(r) Z2

√d
K1 V1
Z1 Wk(r) 0 * WV(r) Z1

q2 Look Ahead
Mask
Wq(r)

Z1 Z2 Z3 Z4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[86]
Encoder-Decoder Attention Mechanism (In Decoder)
(Self Attention of word Z2 w.r.t All words in Sentence ( Y1, Y2, Y3, Y4 )

Similarity Probabilities Weighted SUM

From Encoder K.q From Encoder
Output Dot K.q Σ Output
Product √d
K4 V4
Y4 Wk(r) WV(r) Y4

Scaling by
*

Soft max
K3 V3
Y3 Wk(r) * WV(r) Y3

K2 V2
Y2 Wk(r) * WV(r) Y2

√d
K1 V1
Y1 Wk(r) * WV(r) Y1

Wq(r)
1*d W =
Z1 Z2 Z3 Z4 From Decoder 1*d
d*d
Self Attention Layer d: Embedding Length

[email protected]
[87]
Decoder Final Output
Probability of each word in Vocab

Vocab Size

Softmax

Vocab Size

Linear Projection

Last Decoder Block

[email protected]
[88]
Softmax

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Residual Connection and Layer Normalization

Feed Forward Networks [ One Hidden with RELU Activation]
Encoder Block M Residual Connection and Layer Normalization
K Encode Encode Encode Encode Encode
Decoder Decoder Decoder Decoder Decoder
V Attention Attention Attention Attention Attention
Q Q Q Q Q

Decoder
Encoder Block 2
Encoder

Residual Connection and Layer Normalization

Feed Forward Networks [ One Hidden with RELU Activation]
Residual Connection and Layer Normalization
Encoder Block 1 K Encode Encode Encode Encode Encode
Decoder Decoder Decoder Decoder Decoder
V Attention Attention Attention Attention Attention
Q Q Q Q Q

Encoder Block 0
Residual Connection and Layer Normalization
Feed Forward Networks [ One Hidden with RELU Activation]
Residual Connection and Layer Normalization
K
X0 X1 X2 X3 X4 X5 XN-2 XN-1 Encode
Decoder
Encode
Decoder
Encode
Decoder
Encode
Decoder
Encode
Decoder
V Attention Attention Attention Attention Attention
Q Q Q Q Q

Word Embedding
Encoder Residual Connection and Layer Normalization
Positional Encoding / Embedding
Decoder Multi-Head
Self Attention
Multi-Head
Self Attention
Multi-Head
Self Attention
Multi-Head
Self Attention
Multi-Head
Self Attention

Attention
Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Word and Position Encoding
SOS Je me sens malade
[email protected]
[89]

Transformer Encoder
as a Classifier

[email protected]
Output Classes C1 C2 C3 [90]

Softmax Softmax Softmax

Softmax Layer Σ Σ Σ

Single vector representing

the whole sequence

Global Average Pooling

Output Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Sequence

Transformer Block
Transformer Block
Transformer Encoder
Transformer Block

Input Sequence
Word
Embedding X0 X1 X2 X3 X4 X5 XN-2 XN-1
Vectors

Position
+
Information VectorP0 VectorP1 VectorP2 VectorP3 VectorP4 VectorP5 VectorPN-2 VectorPN-1
Vectors
[email protected]
[91]
What is Next

GPT-1 and GPT-2 (Generative Pre-trained Transformers)

Radford et al., OpenAI. June 2018 and February 2019 .

BERT (Bidirectional Encoder Representations from Transformers)

Devlin et al., Google AI Language, May 2019 .

RoBERTa (Robustly Optimized BERT Approach)

Liu et al., Facebook AI. June 2019.

Transformer-XL: Attentive Language ModelsBeyond a Fixed-Length Context

Zihang Dai et al. Carnegie Mellon University and Google Brain

T5 (Text-to-Text Transfer Transformer)

GPT-3 (175 Billion Parameter, Beta Release June 2020)

[email protected]
[92]

Thank You

[email protected]

ADB Bearing Sensor Tester
No ratings yet
ADB Bearing Sensor Tester
2 pages
3rd Quarter MCQs For Grade 2
No ratings yet
3rd Quarter MCQs For Grade 2
6 pages
A Survey On Large Language Model Acceleration Based On KV Cache Management
No ratings yet
A Survey On Large Language Model Acceleration Based On KV Cache Management
43 pages
Physics With Arduino
No ratings yet
Physics With Arduino
44 pages
E Passbook 2024 08 01 09 53 42 AM
No ratings yet
E Passbook 2024 08 01 09 53 42 AM
146 pages
Report 1 Transformers
No ratings yet
Report 1 Transformers
7 pages
Transformer
No ratings yet
Transformer
41 pages
HEC-HMS Training-V2-20231219 - 223927
No ratings yet
HEC-HMS Training-V2-20231219 - 223927
61 pages
Self Attention Mechanism
No ratings yet
Self Attention Mechanism
20 pages
Alterar Data e Hora
No ratings yet
Alterar Data e Hora
1 page
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
Imperva - SecureD Data Protection v1.5 HSL v1.2
No ratings yet
Imperva - SecureD Data Protection v1.5 HSL v1.2
32 pages
Thesis On Mobile Computing PDF
100% (3)
Thesis On Mobile Computing PDF
6 pages
Transformers
No ratings yet
Transformers
15 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
LT 2023 May-June P2
No ratings yet
LT 2023 May-June P2
8 pages
05 Ccnasec-Firewall - p3
No ratings yet
05 Ccnasec-Firewall - p3
34 pages
Long Password DOS Attack 1702916027
No ratings yet
Long Password DOS Attack 1702916027
9 pages
Me Chan I Sing Block Chain Consensus
No ratings yet
Me Chan I Sing Block Chain Consensus
13 pages
Cost and Management Accounting I Group (5) Assignment
No ratings yet
Cost and Management Accounting I Group (5) Assignment
9 pages
NLP 8
No ratings yet
NLP 8
42 pages
Sirona Orthophos XG3 User Guide
No ratings yet
Sirona Orthophos XG3 User Guide
72 pages
10900320024-Arnab Basak-OE-EC506B-ECE-3A-24
No ratings yet
10900320024-Arnab Basak-OE-EC506B-ECE-3A-24
8 pages
Transformer
No ratings yet
Transformer
4 pages
Transformers in Machine Learning - GeeksforGeeks
No ratings yet
Transformers in Machine Learning - GeeksforGeeks
9 pages
Chapter 2
No ratings yet
Chapter 2
11 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Blockchain in Cyber Security: Submitted by
No ratings yet
Blockchain in Cyber Security: Submitted by
11 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
EDA - Ciclo2022 - 2 - EstadisticaByvariada
No ratings yet
EDA - Ciclo2022 - 2 - EstadisticaByvariada
9 pages
Programming With Python Solution W22
No ratings yet
Programming With Python Solution W22
21 pages
NLP Lecture 01-15-Attnmechanism
No ratings yet
NLP Lecture 01-15-Attnmechanism
13 pages
Understanding Self-Attention
No ratings yet
Understanding Self-Attention
37 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
Generative AI
No ratings yet
Generative AI
54 pages
Transformers
No ratings yet
Transformers
15 pages
Process Control-Lecture 09
No ratings yet
Process Control-Lecture 09
37 pages
Azure OpenAI Cookbook
No ratings yet
Azure OpenAI Cookbook
173 pages
ActivClient WIN UserGuide
No ratings yet
ActivClient WIN UserGuide
84 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Transformer
No ratings yet
Transformer
31 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Transformers v1.1
No ratings yet
Transformers v1.1
1 page
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Modbus RS485 Communications Wilo Pumps
No ratings yet
Modbus RS485 Communications Wilo Pumps
40 pages
Transformers
No ratings yet
Transformers
41 pages
Transformer Architecture Explained in LLMs
No ratings yet
Transformer Architecture Explained in LLMs
2 pages
Planning Chemical Syntheses With Deep Neural Networks and Symbolic AI - Presentation
No ratings yet
Planning Chemical Syntheses With Deep Neural Networks and Symbolic AI - Presentation
18 pages
Infocyte Hunt-Biotech Case Study
No ratings yet
Infocyte Hunt-Biotech Case Study
4 pages
Transformer
No ratings yet
Transformer
10 pages
Installation Guide: Connection
No ratings yet
Installation Guide: Connection
2 pages
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6
No ratings yet
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6
39 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Transformer
No ratings yet
Transformer
10 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Transformer
No ratings yet
Transformer
5 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
No ratings yet
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
117 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
Quiz1 Answers
No ratings yet
Quiz1 Answers
29 pages
Transformer
No ratings yet
Transformer
58 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Mark Min
No ratings yet
Mark Min
6 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
On The Robustness of Binomial Model and Finite Difference Method For Pricing European Options
No ratings yet
On The Robustness of Binomial Model and Finite Difference Method For Pricing European Options
7 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
Silo - Tips - Application Packaging Interview Questions and Answers PDF
No ratings yet
Silo - Tips - Application Packaging Interview Questions and Answers PDF
10 pages
Curriculum Vitae: Sanjay Dixit
No ratings yet
Curriculum Vitae: Sanjay Dixit
3 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Transformers Illustraded
No ratings yet
Transformers Illustraded
31 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages