0% found this document useful (1 vote)
31 views92 pages

(9,10) Transformers - 3

The document provides an overview of Transformers in Natural Language Processing, detailing their building blocks such as self-attention mechanisms and multi-head attention. It compares Transformers with traditional RNNs, highlighting advantages like parallelization and handling long sequences. Key concepts include the encoder-decoder architecture and the significance of positional encoders in maintaining word order.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
31 views92 pages

(9,10) Transformers - 3

The document provides an overview of Transformers in Natural Language Processing, detailing their building blocks such as self-attention mechanisms and multi-head attention. It compares Transformers with traditional RNNs, highlighting advantages like parallelization and handling long sequences. Key concepts include the encoder-decoder architecture and the significance of positional encoders in maintaining word order.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

[1]

Natural Language Processing Series

Transformers
Prof. Khaled Mostafa El-Sayed
[email protected]

Faculty of Computers and Artificial Intelligence


Cairo University

Aug 2020

[email protected]
[2]
Transformers
Processing Sequential Data

Transformer Building Blocks

Self Attention Mechanism

Multi-Head Attention

Parallelization Implementation Issues

Positional Information

Building Transformer Block

Transformer Decoder

Transformer Encoder as a Classifier

What is Next ?
[email protected]
[3]

Processing Sequential Data

1-D Convolutional Neural Network (1-D CNN)

Recurrent Neural Network (RNN)

[email protected]
[4]
1-D Convolutional Neural Network

Can be Computed In Parallel

Convolutional Words 1st Feature 2nd Feature 3rd Feature 4th Feature
Kernel Vectors Map Map Map Map

Non Linearity

Non Linearity

Non Linearity

Non Linearity
Sequence Length

c es***
e quen
S
L ong
ry
fo r Ve
al
r a ctic
Not P
** *
As Word Sequence Length Increases, we need to stack many convolution Layers.
[email protected]
[5]
Traditional RNN Seq2Seq model
Decoder (RNN)

Context Vector (of Whole Inputs)

Encoder (RNN)
[email protected]
[6]
Traditional RNN Seq2Seq model
(Challenges) Decoder (RNN)

[1] Bottleneck
The meaning of the entire input sequence is
Expected to be captured by a single context vector
with fixed dimensionality

Context Vector (of Whole Inputs)

Encoder (RNN)
[email protected]
[7]
Traditional RNN Seq2Seq model
(Challenges) Decoder (RNN)

[2] Sequential Processing


Can NOT do processing in Parallel

Wait Wait
Till Processing X1 Till Processing X2

Context Vector (of Whole Inputs)

Encoder (RNN)
[email protected]
[8]
Traditional RNN Seq2Seq model
(Challenges) Decoder (RNN)

[3] Very Long Sequence


Vanishing Gradient

Short Sequence: RNN


Long Sequence: LSTM, GRU
Very Long Sequence: FAIL

Back Propagation
Δ Δ Δ Δ Δ Δ Δ Δ Δ Δ Δ

X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Encoder (RNN)
[email protected]
[9]
RNN Seq2Seq model [with Attention]
[ To overcome Bottleneck challenge] Decoder

weighted sum of the hidden states

Attention

Attention

Context Vector (of Whole Inputs)

Decoder Utilizes:

Context Vector

Weighted sum of hidden states
Encoder
[email protected]
[10]
Transformer
Attention is All You Need Decoder

Encoder-Decoder Attention
Multi Head Self Attention Blocks

Encoder-Decoder Attention
Multi Head Self Attention Blocks

Attention Mechanism supports:



Parallelization of Encoder ,

Parallelization of Decoder ( Training Mode Only)

Very Long Sequence length
Encoder
[email protected]
[11]

[email protected]
[12]

Transformer Building Blocks

[email protected]
[13]
Encoder – Decoder Transformer
Orientation is not the Issue

Let us focus on the Building blocks


[email protected]
[14]
Encoder – Decoder Transformer
Encoder Decoder

[email protected]
[15]
Encoder – Decoder Transformer
Decoder Block Encoder Decoder

Encoder Block

N Cascaded Blocks
In Both Encoder and Decoder

Encoder Decoder Attention V K Q


From Last Encoder Block
To ALL Decoder Cascaded Blocks

V K Q

Inputs to Encoder Block (V,K,Q)


All are from Previous Block

Inputs to Decoder Block (V,K,Q)


(V,K) from Attention, (Q) from Previous Block
[email protected]
[16]
Encoder – Decoder Transformer
Encoder Output Decoder
Decoder Output (at step “t-1”)
Is supplied as Decoder Input (at step “t”)

First Decoder Block


No Input from Encoder Attention

V K Q

Input Positional Encoder


(Position of Input Word within the Input Sequence)
V K Q
V K Q

Output Positional Encoder


(Position of Output Word within the Output Sequence) Outputs

[email protected]
[17]
Encoder – Decoder Transformer
Both Encoder and Decoder Blocks Contain Encoder Output Decoder
“Multi-Head Self Attention” block

“h” Heads
Outputs of Heads are concatenated
Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q’ K’ V’

V K Q Outputs

[email protected]
[18]
Encoder – Decoder Transformer
Layer Normalization Encoder Output Decoder

Feed Forward Layer

Residual Connections
Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[19]

Self Attention Mechanism

[email protected]
[20]
Encoder – Decoder Transformer
Encoder Output Decoder

Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[21]
Self Attention Mechanism (Fundamental Operation)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Y2

Similarity Probabilities Weighted SUM

Dot Σ
Product
W21
X4 X4
*

Soft max
W22
X3 X3
*
W23
X2 X2
*
W24
X1 X1
*

X1 X2 X3 X4

[email protected]
[22]
Self Attention Mechanism (Fundamental Operation)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Y2

Similarity Probabilities Weighted SUM

Xj Dot Σ
Product
W21
X4 X4
*

Soft max
W22
X3 X3
*
W23
X2 X2
*
W24
X1 X1
*

Xi=X2

X1 X2 X3 X4
Remember

[email protected]
[23]
Self Attention Mechanism (Not Sensitive to Words Order)
(Self Attention of word X2 w.r.t All words in Sentence ( X4, X2, X1, X3 )

Y2

Similarity Probabilities Weighted SUM

Dot Σ
Product
W21
X3 X3
*

Soft max
W22
X1 X1
*
W23
X2 X2
*
W24
X4 X4
*

Sequential nature of the input is IGNORED

Mitigated Later with


X4 X2 X1 X3
Positional Encoder
[email protected]
[24]
Self Attention Mechanism (as Query and {Key-Value} Dictionary)
(Query using X2 All Keys in Dictionary and retrieve Corresponding Values

Y2
Matched KEYS with Retrieve VALUES
Relevance Percentage Weighted by their Relevance

KEYs Dot Σ Values


Product
W21
X3 X3
*

Soft max
W22
X1 X1
*
W23
X2 X2
*
K V
W24
X4 X4
*

Query
ALL Keys with X2 Q Values of Matched Keys
are returned
X4 X2 X1 X3 when Running Query

[email protected]
[25]
Self Attention Mechanism (Input Word plays Different Roles)
(Query using X2 All Keys in Dictionary and retrieve Corresponding Values

Y2
Matched KEYS with Retrieve VALUES
Relevance Percentage Weighted by their Relevance

KEYs Dot Σ Values


Product
W21
X3 X3
*

Soft max
X2 is used as W22
X1 X2 is used as X1
Key * Value
W23
X2 X2
*
K V
W24
X4 X4
*

Query X2 is used as
ALL Keys with X2 Q Query
Should we use SAME X2 Vector
for the three Roles ?
X4 X2 X1 X3

[email protected]
Self Attention Mechanism (Linear Transformation of Inputs) [26]
(Query using X2 All Keys in Dictionary and retrieve Corresponding Values

W11 W21 W31 W41 XQ21 W11 W21 W31 W41 XV21
X21 X22 X23 X24 X21 X22 X23 X24

1xd W12 W22 W32 W42 XQ22 1xd W12 W22 W32 W42 XV22
X2 = X2 =
W13 W23 W33 W43 XQ23 W13 W23 W33 W43 XV23

WQ WV
W14 W24 W34 W44 XQ24 W14 W24 W34 W44 XV24
dxd dxd
X2 variant to QUERY with
X2 variant used as Value
W11 W21 W31 W41 XK21 X2 variant used as KEY
X21 X22 X23 X24

1xd W12 W22 W32 W42 XK22 Generate Three Variants


X2 = of X2 (Based on its Role)
W13 W23 W33 W43 XK23
WK WQ WK WV are controllable parameters
W14 W24 W34 W44 XK24
(Trainable), allows to modify the incoming
dxd vectors to suit the three roles they must play
[email protected]
[27]
Self Attention Mechanism (Linear Transformation of Inputs)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Y2

Similarity Probabilities Weighted SUM

Dot K.q Σ
Product
K4 V4
X4 Wk(r) * WV(r) X4

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[28]
Self Attention Mechanism
(Embedding Vector Dimension and Softmax Sensitivity)

Xi
1xd

d
As Dimension of Embedding “d” Increases
the Calculated weight value “w” increases

High Weight Values kill the gradient, and slow down learning
Xi SoftMax
1xd d d d

1 0.0321 5 0.0000 10 0.0000


2 0.0871 10 SoftMax 0.0000 20 SoftMax 0.0000
SoftMax
3 0.2369 15 0.0067 30 0.0000

4 0.6439 20 0.9933 40 1.0000

For small “d” For medium “d” For large “d”


[email protected]
[29]
Self Attention Mechanism
(Embedding Vector Dimension and Softmax Sensitivity)

Xi Scale down “Weights” Values


1xd before applying SoftMax
d
Scaling Factor should be related to
As Dimension of Embedding “d” Increases Embedding Dimension “d”
the Calculated weight value “w” increases

High Weight Values kill the gradient, and slow down learning
Xi SoftMax
1xd d d d

1 0.0321 5 0.0000 10 0.0000


2 0.0871 10 SoftMax 0.0000 20 SoftMax 0.0000
SoftMax
3 0.2369 15 0.0067 30 0.0000

4 0.6439 20 0.9933 40 1.0000

For small “d” For medium “d” For large “d”


[email protected]
[30]
Self Attention Mechanism
(Scaling based on Square Root of Embedding Vector Dimension)
V1 V2
Average Values of elements of V1 = M
X1 X1 Average Values of elements of V2 = M
X2 X2
X3 X3 Vector Length of V1 (all Ms) = V1T V1 = M √d
1
X4 X4
Vector Length of V2 (all Ms) = V2T V2 = M √d
X5 1xd2 2 M
X6
X7
X8 √d is a good choice for scaling 4
√d Some times is used
X9
X10
X11 ’
’ w ij
X12 SoftMax w ij SoftMax d=100, √d =10
1xd1 √d

Remember: 10 0.0000 10 1 0.0321


“d” is
20 SoftMax 0.0000 20 2 0.0871
Dimension of Embedding / √d Scaled SoftMax
Not 30 0.0000 30 3 0.2369
length of input Sequence
40 1.0000 40 4 0.6439

[email protected]
[31]
Self Attention Mechanism (Adding Scaling)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Y2

Similarity Probabilities Weighted SUM

K.q K.q
Dot Σ
Product √d
K4 V4
X4 Wk(r) WV(r) X4

Scaling by
*

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

√d
K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[32]
Self Attention Mechanism (Adding Scaling)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Y2

Similarity Probabilities Weighted SUM

K.q K.q
Dot Σ
Product √d
K4 V4
X4 Wk(r) WV(r) X4

Scaling by
*

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

√d
K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
Parallel Processing [33]
Transform Sequence Elements in Parallel Y4
[Calculation of Self Attention of words ( X1, X2, X3, X4 )] Y3
Y4(r)
Y2
Y1 Y3(r)
Similarity Probabilities Weighted SUM
Y2(r)
K.q K.q
Similarity
Dot Probabilities Weighted SUM Σ
Y1(r)Product √d
K4
K.q
K.q
V WV(r)

Scaling by
X4
Similarity Wk(r)
Dot
Product
Probabilities
√d
Weighted SUM Σ * X4

Soft max
K4
K.q W (r)
K.q K3 V
WV(r) 4V XW (r)

Scaling by
Σ
*
(r)
X4 WkDot X3 Probabilities Weighted SUM
4
X3
*
Similarity 4
Product k
√d V

Soft max
K.q K3 XW (r) XW (r)
K4 V
V V
K.q W (r) X V

Scaling by
K2
Wk(r) Σ *
X4 (r)
Dot X Wk(r) W 3
4 3
X2
*
4

*
V
√d
3 k 3
Product 2
V

Soft max

√d
XW (r) XW (r)
K4 V
V V
K3 V

Scaling by
K2
Wk(r) X3
* * W V V WV(r)
X4 (r)
Wk(r) X2 Wk(r) X K1 4
2
3
X2
Wk(r) 4
X1
*
V 3

*
1

Soft max

√d
XW (r)
K3 K2 V V
Wk(r) X2
* *W V XWV(r) 1
X3 K1 (r)
Wk(r) X 3

V 2
2
Wk(r) V 3 2 X1
q2
*(r)
1

√d
K2
W
K1 V
*W
X2 Wk(r) X (r) 2
W V X (r)
Wk(r)
* V 1
X 1 2
1 q V 1
q2
√d
K1 V1
Wq(r) X
Wk(r) X*2
X1 (r) X1
q2 1
X 3 WV X4

Wq(r)
q2 X1 X2 X3 X4

**
Wq(r)X
X2
1
X3 X4
l *
X1 X2 X3 X4
aralle
In P
R un
***
[email protected] X1 X2 X3 X4
[34]

Multi-Head Attention

Wide Architecture

Narrow Architecture

[email protected]
[35]
Encoder – Decoder Transformer
Encoder Output Decoder

Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[36]

Multi-Head Self Attention

Combining several self attention mechanisms give the self attention greater
power of discrimination

Each Head is indexed with “r”

Each head has its own matrices 𝐖rq , 𝐖rk ,𝐖rv .

For input X each attention head produces a different output vector Yr.

All Output vectors are concatenated , and pass through a linear


transformation to reduce the dimension back to original X dimension.

[email protected]
Y2
1*d Multi Head Self Attention of Word X2 [37]
Linear Transformation [T]
“Wide” Architecture
dH*d

Concatenated Vector Y2(r) Y2(r) Y2(r) Y2(r) Y2(r)


1*dH

Y2(r)
Y2(r)
Similarity Probabilities
Y2 (r) Weighted SUM h=5
K. q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4
Y2 K.q Probabilities
(r) Σ
Weighted SUM
V4

Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q
Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4

Soft max
Scaling by
*
(r) (r)
X4 Wk Product K √d W
VV X4
K.3q
*
(r)
X3 W
K.kqProbabilities h=1
3
WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4

Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K3 √Kd W
V3V
*
(r) (r)
X3 W K.(r)
q W
VV X3
X.2 qk Σ
*
Dot K4 K Wk V4 WV(r)

Soft max
2
X2

Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V

√d
Wk(r)
*X
K4 X2 (r) X2
V4 W

Soft max
2
Scaling by

Wk(r)
*
X4 K1 (r) VV
K3
W (r) X1 K Wk(r)
*
(r)4
* WV(r) X1
1
X3 V3
W
V X3

√d
X k Wk(r)
*
2V (r) X2
W
2
Soft max
2
K3 K1 VV
*
X1 Wk(r) V3 1
WV(r) X1 s
Wk(r)
* e ad
X3 (r) X3
K2 WV
Wk(r)
√d * ”H
X2 (r) X2
K1 q2
V2
W
VV “ H
*
(r) (r)
X1
K2
Wk 1
WV X1
√d

V2
X2 Wk(r)
X1 Wk(r)
K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
√d

K1 q2 Wq(r) Wq(4) Wk(5)


V1
Wk(r) X2 *
X1 (r) X1
X1 X3 X4WV
q2 Wq(r) Wq(3) Wk(4) WV(5)
X1 X2 X3 X4
Wk(3)
q2 Wq(r) Wq(2) WV(4)
X1 X2 X3 X4
Wq(r) Wq(1) Wk(2) WV(3)
X1 X2 X3 X4 1*d W =
Wk(1) WV(2)
1*d
X1 X2 X3 X4
d*d
WV(1)
d: Embedding Length
[email protected]
Y2
1*d Multi Head Self Attention of Word X2 [38]
Linear Transformation [T]
“Wide” Architecture
dH*d

Concatenated Vector Y2(r) Y2(r) Y2(r) Y2(r) Y2(r)


1*dH

Y2(r)
Y2(r)
Similarity Probabilities
Y2 (r) Weighted SUM h=5
K. q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4
Y2 K.q Probabilities
(r) Σ
Weighted SUM
V4

Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q
Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4

Soft max
Scaling by
*
(r) (r)
X4 Wk Product K √d W
VV X4
K.3q
*
(r)
X3 W
K.kqProbabilities h=1
3
WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4

Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K3 √Kd W
V3V
*
(r) (r)
X3 W K.(r)
q W
VV X3
X.2 qk Σ
*
Dot K4 K Wk V4 WV(r)

Soft max
2
X2

Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V

√d
Wk(r)
*X
K4 X2 (r) X2
V4 W

Soft max
2
Scaling by

Wk(r)
*
X4 K1 (r) VV
K3
W (r) X1 K Wk(r)
*
(r)4
* WV(r) X1
1
X3 V3
W
V X3

√d
X k Wk(r)
*
2V (r) X2
W
2
Soft max
2
K3 K1 VV
*
X1 Wk(r) V3 1
WV(r) X1 s
Wk(r)
* e ad
X3 (r) X3
K2 WV
Wk(r)
√d * ”H
X2 (r) X2
K1 q2
V2
W
VV “ H
*
(r) (r)
X1
K2
Wk 1
WV X1
√d

V2
X2 Wk(r)
X1 Wk(r)
K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
√d

K1 q2 Wq(r) Wq(4) Wk(5)


V1
Wk(r) X2 *
X1 (r) X1
X1 X3 X4WV
q2 Wq(r) Wq(3) Wk(4) WV(5)
X1 X2 X3 X4
Wk(3)
q2 Wq(r) Wq(2) WV(4)
X1 X2 X3 X4
Wq(r) Wq(1) Wk(2) WV(3)
X1 X2 X3 X4 1*d W =
Wk(1) WV(2)
1*d
X1 X2 X3 X4
d*d
WV(1)
d: Embedding Length
[email protected]
Y2
1*d
Multi Head Self Attention of Word X2 [39]
Concatenated Vector
“Narrow” Architecture
Y2(r) Y2(r) Y2(r) Y2(r) Y2(r)
1*d

Y2(r)
1*d/H
1*d/H W =
1*d/H
Y2(r) d/H*d/H
Similarity Probabilities
(r) Weighted SUM h=5 d: Embedding Length
Y2
K.q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4 K.q Probabilities
(r) Σ
Weighted SUM
V4
Y2

Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4

Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K √d W
V3V
K.3q
*
(r)
X3 W
K.kqProbabilities h=1 WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4

Soft max
Scaling by
W (r)
*
X4 (r) X
Product K3 √Kd W
V3V
*
X k (r) (r) 4
W K.(r)
q W
VV X3
X.2 qk Σ
3

*
Dot K4 K Wk V4 WV(r)

Soft max
2
X2

Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V

√d
*X
(r) (r)
K4 X2 Wk V4 W X2

Soft max
2
Scaling by
Wk(r)
*
X4 K1 (r) VV
K3
Wk(r) X1 K Wk(r)
*
(r)
* WV(r) X1
4 1
X3 V3
W
V2V X3

√d
Wk(r)
*
X2 (r) X2
W

Soft max
2

K3 K1 VV
*
X1 Wk (r) V3 1
WV (r) X1 s
Wk(r)
* ad
X3 (r) X3
K2 WV

√d
He
Wk(r)
*
X2 (r) X2
K1 q2
V2
W
VV ”
“H
X1 Wk(r)
* WV(r) X1
1
K2
√d
V2
X2 Wk(r)
X1 Wk (r) K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
Input
√d

q2 Wq(r) Wq(4) Wk(5) Sequence


K1 V1
Wk(r) X2 *X
X1 (r) X1
X1 X4WV
q2 Wq(r)
3
Wq(3) Wk(4) WV(5)
X1 X2 X3 X4
Wk(3)
X1 X2 X3 X4
q2 Wq(r) Wq(2) WV(4)
X1 X2 X3 X4 Split vectors into “H” chunks
Wq(r) Wq(1) Wk(2) WV(3)
X1 X2 X3 X4 X1 X2 X3 X4 To Head 5
Wk(1) WV(2)
X1 X2 X3 X4 X1 X2 X3 X4 To Head 4
WV(1)
X1 X2 X3 X4 To Head 3

X1 X2 X3 X4 To Head 2

X1 X2 X3 X4 To Head 1
[email protected]
[40]

Parallelization
Implementation Issues

[email protected]
[41]
Using FF NN for Multiplication of a Vector by a Matrix

Input Word X X1 X2 X3 X4
W11 W21 W31 W41 b1
X1 y1
X1
X1 W11 Σ YT X b2
W21 W12 W22 W32 W42
b1 +
W31
y1 y2 y3 y4 = Sigmoid
W13 W23 W33 W43 b3
W41
W12 y2
X2 W14 W24 W34 W44 b4
X2
X2 W22 Σ
W32
b2
W42

W13

X3
W23 y3
X3
X3 W33 Σ
W43
b3
W14

W24

X4
W34 y4
X4
X4 W44 Σ
b4
[email protected]
Using FF NN for Multiplication of ONE Vector by a Matrix [42]
Example: Multiply Vector of Current Word X by Wq (Query) {Remember: Wq is Squared Matrix}

Input Word X X1 X2 X3 X4
W11 W21 W31 W41 b1
X1 y1
X1
X1 W11 Σ YT X b2
W21 W12 W22 W32 W42
b1 +
W31
y1 y2 y3 y4 = Sigmoid
W13 W23 W33 W43 b3
W41
W12 y2
X2 W14 W24 W34 W44 b4
X2
X2 W22 Σ
W32 No
b2
W42 Non-linearity No Bias
W13 X1 X2 X3 X4
W23 y3 W11 W21 W31 W41 0
X3
X3
X3 Σ
W33
W43
YT X W12 W22 W32 W42 0
b3
W14 y1 y2 y3 y4 = Linear +
W24 W13 W23 W33 W43 0
X4
W34 y4
X4
X4 W44 Σ W14 W24 W34 W44 0
b4
[email protected]
[43]
Self Attention Mechanism (Linear Transformation of Inputs)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Y2

Similarity Probabilities Weighted SUM

Dot K.q Σ
Product
K4 V4
X4 Wk(r) * WV(r) X4

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
Using FF NN for Multiplication of Multiple Vectors by a Matrix [44]
X5
X4 Example: Multiply Vectors of ALL Words by Wk (Key) or Wv (Value) { Wk Wv is Squared Matrix}
X3
X1
X2
X1
X1
X1
X1 y1
X1
X1
X1 W11 Σ Design FF NN with the following Considerations:
W21 Activation: Linear
b1
W31
Bias: Zeros
X2 W41 Inputs: All Words Within Same Paragraph (X1 , X2 , X3 , X4 , X5 )
X2
X2 W12 y2
X2X2
X2
X2 W22 Σ
W32
b2
y1 y2 y3 y4 Y1 X1 X1 X2 X3 X4
W11 W21 W31 W41 0
W42
y1 y2 y3 y4 Y2 X2 X1 X2 X3 X4
X3 W13 W12 W22 W32 W42
X3
X3 W23 y3 = Linear 0
X3X3
X3
X3 W33 Σ y1 y2 y3 y4 Y3 X3 X1 X2 X3 X4
+
W13 W23 W33 W43 0
W43
W14
b3 y1 y2 y3 y4 Y4 X4 X1 X2 X3 X4
W14 W24 W34 W44 0
W24
X4
X4 y1 y2 y3 y4 Y5 X5 X1 X2 X3 X4
X4
X4X4
W34 y4
X4
X4 W44 Σ
b4
[email protected]
[45]
Self Attention Mechanism (Linear Transformation of Inputs)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )

Y2

Similarity Probabilities Weighted SUM

Dot K.q Σ
Product
K4 V4
X4 Wk(r) * WV(r) X4

Soft max
K3 V3
X3 Wk(r) * WV(r) X3

K2 V2
X2 Wk(r) * WV(r) X2

K1 V1
X1 Wk(r) * WV(r) X1

q2
Wq(r)

X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[46]
Using FF NN for Multiplication of Vector by a Multiple Matrices
Example: Multiply Vector of Current Word by Multiple (different) W q (Queries of Multiple Heads)

Embedding Vector of
Current Word X Wq of Head 1 Wq of Head 2 Wq of Head 3

X1 X2 X3 X4
W11 W21 W31 W41 W11 W21 W31 W41 W11 W21 W31 W41

W12 W22 W32 W42 W12 W22 W32 W42 W12 W22 W32 W42

W13 W23 W33 W43 W13 W23 W33 W43 W13 W23 W33 W43
X1
W14 W24 W34 W44 W14 W24 W34 W44 W14 W24 W34 W44

= y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4
Corresponding Corresponding Corresponding
Projection at Head 1 Projection at Head 2 Projection at Head 3

[email protected]
[47]
Using FF NN for Multiplication of Multiple Vectors by Multiple Matrices
Usage: Multiply Vectors of All Words by Multiple (different) W k (Keys of Multiple Heads)

Wk of Head 1 Wk of Head 2 Wk of Head 3


Embedding Vector
of Word #1 X1 X1 X2 X3 X4
W11 W21 W31 W41 W11 W21 W31 W41 W11 W21 W31 W41
Embedding Vector
of Word #2 X2 X1 X2 X3 X4
W12 W22 W32 W42 W12 W22 W32 W42 W12 W22 W32 W42
Embedding Vector
of Word #3 X X1 X2 X3 X4
3
W13 W23 W33 W43 W13 W23 W33 W43 W13 W23 W33 W43
Embedding Vector
of Word #4 X4 X1 X2 X3 X4
W14 W24 W34 W44 W14 W24 W34 W44 W14 W24 W34 W44
Embedding Vector
of Word #5 X5 X1 X2 X3 X4

y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #1

y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #2

= y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #3
Corresponding

Sequence of 5 Words y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Projection of Word #4


Network with 3 Heads y1 y2 y3 y4
y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #5

Embedding Dimension=5
Corresponding Corresponding Corresponding
Projections at Head 1 Projections at Head 2 Projections at Head 3
[email protected]
Y2
1*d Multi Head Self Attention of Word X2 [48]
Linear Transformation [ T ]dH*d

Concatenated Vector Y2(r) Y2(r) Y2(r) Y2(r) Y2(r)


1*dH

Y2(r)
Y2(r)
Similarity Probabilities
Y2 (r) Weighted SUM h=5
K. q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4
Y2 K.q Probabilities
(r) Σ
Weighted SUM
V4

Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q
Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4

Soft max
Scaling by
*
(r) (r)
X4 Wk Product K √d W
VV X4
K.3q
*
(r)
X3 W
K.kqProbabilities h=1
3
WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4

Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K3 √Kd W
V3V
*
(r) (r)
X3 W K.(r)
q W
VV X3
X.2 qk Σ
*
Dot K4 K Wk V4 WV(r)

Soft max
2
X2

Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V

√d
Wk(r)
*X
K4 X2 (r) X2
V4 W

Soft max
2
Scaling by

Wk(r)
*
X4 K1 (r) VV
K3
W (r) X1 K Wk(r)
*
(r)4
* WV(r) X1
1
X3 V3
W
V X3

√d
X k Wk(r)
*
2V (r) X2
W
2
Soft max
2
K3 K1 VV
*
X1 Wk(r) V3 1
WV(r) X1 s
Wk(r)
* e ad
X3 (r) X3
K2 WV
Wk(r)
√d * ”H
X2 (r) X2
K1 q2
V2
W
VV “ H
*
(r) (r)
X1
K2
Wk 1
WV X1
√d

V2
X2 Wk(r)
X1 Wk(r)
K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
√d

K1 q2 Wq(r) Wq(4) Wk(5)


V1
Wk(r) X2 *
X1 (r) X1
X1 X3 X4WV
q2 Wq(r) Wq(3) Wk(4) WV(5)
X1 X2 X3 X4
Wk(3)
q2 Wq(r) Wq(2) WV(4)
X1 X2 X3 X4
Wq(r) Wq(1) Wk(2) WV(3)
X1 X2 X3 X4 1*d W =
Wk(1) WV(2)
1*d
X1 X2 X3 X4
d*d
WV(1)
d: Embedding Length
[email protected]
[49]

Adding Position Information

[1] Position Encoding

[2] Position Embedding

[email protected]
[50]
Encoder – Decoder Transformer
Encoder Output Decoder

Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[51]
Pre-processing of Input Sequence
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N

Word
Embedding X0 X1 X2 X3 X4 X5 XN-2 XN-1
e.g. word2vec

Position
Information Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1
+δ +δ +δ +δ +δ +δ
Final
Embedding Combine Word Encoding and Position Information (for Each Word)

I/P 0 I/P 1 I/P 2 I/P 3 I/P 4 I/P 5 I/P N-2 I/P N-1

Input to
Encoder Input to first Self Attention Layer in Transformer
[email protected]
[52]
[1] Position Encoding
Position
Information 0 1 2 3 4 5 N-2 N-1

Positional
Encoder function to map the positions to real valued vectors

VectorP0 VectorP1 VectorP2 VectorP3 VectorP4 VectorP5 VectorPN-2 VectorPN-1

Real valued vectors representing positions


Function with (Vector for each position)
Fixed Parameters
Vector Length = Word Embedding length
Function with
TRAINABLE Parameters

[email protected]
[53]
Suggested Positional Encoders
(Use Word Index as Position Encoding)
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N

Position
Embedding Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1

Use Position
Index 0 1 2 3 4 5 N-2 N-1

+δ +δ +δ +δ +δ +δ

Use δ=1 First position Encoding =0 , Last Position Encoding = N-1 (Depends on Seq. Length)
Problems:

If N is large (e.g. seq. Length = 1024), Large position Values will dominate when Combine Position
Embedding to Word Embedding

If System is trained with Max seq length =256 how can the system deal with Larger sequence values

Position Encoding is scalar and Word embedding is vector (how to Combine them ?)
[email protected]
[54]
Suggested Positional Encoders
(Normalize Seq. Length to “1”)
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N

Position
Embedding Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1

Use Position
Index 0 δ 2δ 3δ 4δ 5δ 1-δ 1

+δ +δ +δ +δ +δ +δ

Use δ=1/Seq. Length First position Encoding=0, Last Position Encoding= 1 (Independent of Seq. Length)
Problems:

δ Value depends on Seq. Length ( δ is small for long seq and large for short sequence )

In other words, Encoding value of position # 4 in short seq will differ from position #4 in long seq
(though they are same position)

Position Encoding is scalar and Word embedding is vector (how to Combine them ?)
[email protected]
[55]
Suggested Positional Encoders
(Rules Controlling Good Positional Encoder)
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N

Position
Embedding Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1

Use Position
Index 0 δ 2δ 3δ 4δ 5δ 1-δ 1

+δ +δ +δ +δ +δ +δ

[1] δ value should be same for Long sequence and short sequence
(same positions will be encoded with same values irrespective of the length of the sequence)
[2] Range of position encoded values should NOT depend on sequence length
(predefined fixed range e.g. from 0 to 1 , form -1 to 1, …)
[3] Position Encoding should be a VECTOR (not scalar) and of same dimension as Word Embedding
(in order to easily combine Position Encoding and Word Encoding vectors)
[email protected]
[56]
Sine/Cosine Encoders
(an industrial application of Motor Control System)

Sin/Cos Encoder
Output Signals

Incremental rotary position encoders are used in many applications to


measure angular position and speed.
For More Info:
EXTRACTION OF HIGH RESOLUTION POSITION INFORMATION FROM SINUSOIDAL ENCODERS
J. Burke, J. F. Moynihan, K. Unterkofler
[email protected]
[57]
Sin/Cos Encoders
(an industrial application of Motor Control System)

Position
(Angle)

EMC-compliant interface to Sin/Cos incremental position encoder switch


1-VPP Differential analog output signals
[email protected]
[58]
Sequence of 5 words +ω +ω
Positions are defined

Word 3

as angles

2
Wo

rd
rd

Wo
4
1
Sin(ω t) rd
Wo
Cos(ω t)

t=0,1,2,3,….

Word 0

[email protected]
[59]
Sequence of 9 words +ω +ω
Positions are defined

Word 3

as angles +ω

2
Wo

rd
rd

Wo
4
Wo 1
Sin(ω t) rd rd
5 Wo
Cos(ω t) +ω +ω
t=0,1,2,3,….

Word 6 Word 0


7
o rd
W

8
rd
+ω Wo

[email protected]
[60]
Sequence of 12 words +ω +ω
Positions are defined

Word 3

as angles +ω

2
Wo

rd
rd

Wo
4
Wo 1
Sin(ω t) rd rd
5 Wo
Cos(ω t) +ω +ω
t=0,1,2,3,….

Word 6 Word 0


7 Wo
rd rd
Wo 11

Using smaller ω value, allow

Wo
representation of longer sequences

8
rd

rd
Word 9
+ω Wo

10
(practically use ω = 1/10000) +ω

+ω +ω
[email protected]
[61]
sin(ωt)
Vector of Pos “t”
cos(ωt)

Sin (ω(t+δ)) sin(a+b)=sin (a) cos(b) + cos(a) sin(b)


cos(a+b)=cos(a) cos(b) – sin(a) sin(b)
Vector of Pos “t+δ”
cos(ω(t+δ))

sin(ω(t+δ)) = sin(ωt +ωδ) = sin(ωt) cos(ωδ) + cos(ωt) sin(ωδ)

cos(ω(t+δ)) = cos(ωt +ωδ) = cos(ωt) cos(ωδ) - sin(ωt) sin(ωδ)

Sin (ω(t+δ)) cos(ωδ) sin(ωδ) sin(ωt) From Computer Graphics:


= Rotation Matrix of Vector at
cos(ω(t+δ)) - sin(ωδ) cos(ωδ) cos(ωt) Pos “t” with angle “ωδ”
e.g “ω” , “2ω” , “3ω” , ...
Vector of Pos “t+δ”
Vector of Pos “t”
Independent of Position “t”
[email protected]
Vector of Pos “t+δ” Vector of Pos “t” [62]
Sin (ω(t+δ)) sin(ωt) cos(ωδ) + cos(ωt) sin(ωδ) sin(ωt) sin2(a) + cos2(a)=1
cos(ω(t+δ)) = cos(ωt) cos(ωδ) - sin(ωt) sin(ωδ) cos(ωt)
Let ωt=a , ωδ=b
sin(a) cos(b) + cos(a) sin(b) sin(a)
- sin(a) sin(b) + cos(a) cos(b) cos(a)

Squared Distance Between Vector of Pos “t+δ” and Vector of Pos “t” [ Two Consecutive Positions ]

[ sin(a) cos(b) + cos(a) sin(b) - sin(a) ]2 + [- sin(a) sin(b) + cos(a) cos(b) - cos(a) ]2 =

[ sin(a) (cos(b)-1) + cos(a) sin(b) ]2 + [- sin(a) sin(b) + cos(a) (cos(b) -1) ]2 =

[ sin(a)2 (cos(b)-1)2 + cos2(a) sin2(b) + 2 sin(a) cos(a) sin(b) (cos(b)-1) ] +


“t
n
[ sin2(a) sin2(b) + cos2(a) (cos(b) -1)2 - 2 sin(a) cos(a) sin(b) (cos(b)-1) ] =

t io
si
Po
sin(a)2 (cos(b)-1)2 + cos2(a) sin2(b) + sin2(a) sin2(b) + cos2(a) (cos(b) -1)2 =

of
e
t iv
[ sin(a)2 + cos2(a) ] (cos(b)-1)2 + [ sin(a)2 + cos2(a) ] (sin2(b) ) = (cos(b)-1)2 + sin2(b) =

c
pe
es
cos2(b) -2 cos(b) +1 + sin2(b) = 2 -2 cos(b) = 2(1-cos(b) ) =2 (1-cos(ωδ) )

Ir r
[email protected]
[63]
Sin/Cos Positional Encoder
Vector of Dimension “2” to define Position
sin(ωt) Where “t” is the step (position) and ω is the frequency
cos(ωt) of trigonometric function Sine/Cosine

This vector can be used to define word position at step “t” , t=0,1,2,3,4,5,6,7,….


Values are between -1 and 1 (bounded values)
Distance between vectors representing positions “t” and “t+1” is constant
irrespective of “t” [Distance (Pos1, Pos2) = Distance (Pos90, Pos91)]

“t” may take any positive value (No limitation of length of Sequence )

Position Vector Length = 2 (Cannot be combined with Word Embedding Vector)

[email protected]
Sin/Cos Positional Encoder [64]
Positional Vector with Same dimension as Embedding dimension
Vector Index
Pairs “K” “i”
Word Embedding Dimension “d” [Even Num] Example:
0 sin(ω0t)
sin(ωt) 0 Position Encoding Dimension “d” d=100

cos(ωt)
1 cos(ω0t)
Number of Vector Pairs “K”=d/2 [0, 1, 2,… d/2-1] K=[0 ..49]
2 sin(ω1t)
1 Elements Index “i” [0, 1, 2, …. d-1] i=[0 .. 99]
Sin/Cos pairs 3 cos(ω1t) Even Index “i” is Sine Term , i=2*k Sin [0,2,4,...]
Vector Represents
Position”t” 4 Sin(ω2t) Odd Index “i” is Cosine Term, i=2*k+1 Cos [1,3,5,...]
2
Any “ω” is accepted 5 cos(ω2t) Use different “ω” values for different “Pairs”
Select “ω” Value Based on “Pair” index “k”
1
2k sin(ωKt) ωK =
K 10,0002k/d
2k+1 cos(ωKt)
Largest “ω” value = 1 (Used in First Pair)
Smallest “ω” value ≈ 1/10000 (Used in Last Pair)
d-1-1 sin(ωd/2t) Note:
[d/2]-1 “ω” Value depends on Embedding Dimension “d” ONLY
d-1 cos(ωd/2t) dx1
“ω” Value does NOT depend on Sequence Length
[email protected]
[65]
[2] Position Embedding
(Similar to Word Embedding)
Position
Information 0 1 2 3 4 5 N-2 N-1

(similar to word tokenization)

Positional
Embedding Embedding Layer to Learn Position Vectors
(Trained on Position Values)

VectorP0 VectorP1 VectorP2 VectorP3 VectorP4 VectorP5 VectorPN-2 VectorPN-1

Real valued vectors representing positions


(Vector for each position)
Positions > N-1 can NOT be Embedded
(Must Train Embedding on Max Expected Seq. Length)

Vector Length = Word Embedding length


[email protected]
[66]

Building Transformer Block

Adding
Residual Connections
Layer Normalization
Feed Forward Network

[email protected]
[67]

Multi Head Self Attention Block


Input:
Word Embedding + Positional Information [First Transformer Block]
Output of Previous Transformer Block [Stacked Transformer Blocks]

Output:
Encoded Vectors of Input Sequence [With Same Input Vectors Dimension]

Remember:
Different Input Sequences should be of Same Length (Fixed Length Sequences)
[Apply Pad or Cut]
Output Sequence is of same Length as Input Sequence

Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1


Attention

Multi Head Self Attention Block


Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
[68]
Transformer Blocks (stacked over Multi-Head)
Encoder Output Decoder

Multi-Head Attention
Scaled Dot-Product
Attention

V K Q

V K Q
V K Q

Q K V

V K Q Outputs

[email protected]
[69]
Residual Connections and Normalization
Adding Residual Connection:
Word Embedding + Positional Information [First Transformer Block]
Output of Previous Transformer Block [Stacked Transformer Blocks]

Adding “LAYER” Normalization


Encoded Vectors of Input Sequence [With Same Input Vectors Dimension

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Attention

Residual
Connections Multi Head Self Attention Block
Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
[70]
Normalization
The behavior of machine learning algorithms changes when the input distribution changes.
{Covariate shift }
layer weights changes Outputs of Activation functions Inputs to the next layer,

during training steps of each layer change (With different Input distribution)

Conclusion:
Input distribution of each layer changes with each step.

The basic idea behind Normalization is to:


Limit covariate shift by normalizing the activations of each layer

Normalization: Transforming the inputs to be Zero mean and unit variance.

Normalization allows each layer to:



learn on a more stable distribution of inputs,

Accelerate the training of the network (don’t stick with very small learning rates).
[email protected]
[71]
Normalization
Instead of restricting the Activations of each layer to be strictly Zero mean and unit variance,
Normalization allows the network to learn parameters γ and β that can convert the mean
and variance to any value that Minimizes Loss.

[email protected]
[72]
Normalization
Normalization
of

Batch
Activations
Normalization Weights

Batch Normalization
Weight Normalization
Instance Normalization

Group Normalization

Layer Normalization
[email protected]
[73]

Batch Data (Feature Vectors)


Feature Vectors of :
FIVE Input Samples [mini-Batch]
Feature Vector of One Input Feature Vectors of One Input Sample Using 4 filters [ 4 Channels]
Sample (Using one filter) (Using 4 filters) [ 4 Channels]

Feature Vector

ze
si
ch
at
B
Channels

[email protected]
[74]
Batch Normalization

Batch Normalization

Feature Vector
Calculate (then Normalize) μ, σ
for ALL Samples in the batch
for each channel)

ze
si
ch
t
Ba
Channels

Batch Normalization Calculate Mean and Variance of Each Mini-Batch instead of


Calculating Mean and Variance of the Whole Data.

The mean and variance will differ from mini-batch to another (Source of Error).

Mini Batch Size should be large enough to minimize “batching” effect. (Constrain)
[email protected]
[75]
Layer , Group, and Instance Normalization
Sample Based
Normalization

Feature Vector

ze
si
ch
t
Ba
Channels

Group Normalization Instance Normalization


(Group of Channels) Calculate (then Normalize) μ, σ for ONE Channels
General Case in ONE Sample (Output of ONE Filter)

Layer Normalization
Calculate (then Normalize) μ, σ for ALL Channels in the
Same Sample (Input to Same Layer) Transformer choice
[email protected]
[76]
Feed Forward Neural Network
Feed Forward Network for Each Vector
One Hidden Layer with RELU Activation, [Size: at least double Input dimension]
Linear Activation Output Layer [Size: same Input Vector dimension]
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Feed Forward Networks [ One Hidden with RELU Activation]


RELU RELU RELU RELU RELU RELU RELU RELU

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Attention

Residual
Connections Multi Head Self Attention Block
Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
Final [77]
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Output

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Residual
Feed Forward Networks [ One Hidden with RELU Activation]
Connections
RELU RELU RELU RELU RELU RELU RELU RELU

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Attention

Residual
Connections Multi Head Self Attention Block
Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1 [78]
Block Output

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
Blo ck

+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Feed Forward Networks [ One Hidden with RELU Activation]


RELU RELU RELU RELU RELU RELU RELU RELU
Enc oder

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Multi Head Self Attention Block

Block Input X0 X1 X2 X3 X4 X5 XN-2 XN-1


[email protected]
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1 [79]
Block Output

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
Blo ck

+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Feed Forward Networks [ One Hidden with RELU Activation]


RELU RELU RELU RELU RELU RELU RELU RELU
Enc oder

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Multi Head Self Attention BlockV K Q

Block Input X0 X1 X2 X3 X4 X5 XN-2 XN-1


[email protected]
Transformer Stacked Encoder Blocks [80]
Encoder Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Output

Encoder Block M
Transformer
Encoder

Encoder Block 2

Encoder Block 1

Encoder Block 0

Encoder X0 X1 X2 X3 X4 X5 XN-2 XN-1


Input

Embedding Word Embedding


Words & Pos. Positional Encoding / Embedding

Input Seq Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
[email protected]
[81]

Transformer Decoder

[email protected]
[82]
Encoder – Decoder Transformer
Encoder Output Decoder

Decoder Only

Decoder Only

√ √
√ √
√ Encoder –
Decoder V K Q
√ Attention √

√ Adding Mask ≈
V K Q
V K Q

√ √

Outputs

[email protected]
[83]
Multi-head Encoder-Decoder
Attention Block Residual Connection and Layer Normalization
Feed Forward Networks [ One Hidden with RELU Activation]
Masked Multi-head Self Residual Connection and Layer Normalization
Attention Block
K
Encode Encode Encode Encode Encode
Decoder Decoder Decoder Decoder Decoder
Attention Attention Attention Attention Attention
V
Q Q Q Q Q
Y0 Y1 Y2

Residual Connection and Layer Normalization


Self
Self
Self Self
Self
Self Self
Self
Self Self
Self
Self Self
Self
Self
Transformer Encoder Multi-Head
Attention
Attention
Attention
Self Attention
Multi-Head
Attention
Attention
Attention
Self Attention
Multi-Head
Attention
Attention
Attention
Self Attention
Multi-Head
Attention
Attention
Attention
Self Attention
Multi-Head
Attention
Attention
Attention
Self Attention

I feel Sick Word and Position Encoding


SOS Je me sens malade
[email protected]
[84]
Self Attention Look Ahead Mask
(Used in Decoders first Layer )
Softmax

m
SO

al
se
Je

ad
ns
e
S

e
SOS 0 -∞ -∞ -∞ -∞

Je 0 0 -∞ -∞ -∞

me
0 0 0 -∞ -∞

sens 0 0 0 0 -∞
malade 0 0 0 0 0

Mask Values are ADDED to Masked (Scaled) Values

Addition of “0” does not change the original values, consequently no effect on Softmax output
Addition of “-inf” gets “Zero” output of Softmax, leaving zero attention scores for future Words.
[email protected]
[85]
Self Attention Mechanism for Decoder (Adding Mask) -∞ *

Z2

Similarity Probabilities Weighted SUM

K.q K.q
Dot Σ

Mask
Product √d
K4 0 V4
Z4 Wk(r) WV(r) Z4

Scaling by
-∞ *

Soft max
K3 0 V3
Z3 Wk(r) -∞ * WV(r) Z3

K2 V2
Z2 Wk(r) 0 * WV(r) Z2

√d
K1 V1
Z1 Wk(r) 0 * WV(r) Z1

q2 Look Ahead
Mask
Wq(r)

Z1 Z2 Z3 Z4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[86]
Encoder-Decoder Attention Mechanism (In Decoder)
(Self Attention of word Z2 w.r.t All words in Sentence ( Y1, Y2, Y3, Y4 )

O2

Similarity Probabilities Weighted SUM


From Encoder K.q From Encoder
Output Dot K.q Σ Output
Product √d
K4 V4
Y4 Wk(r) WV(r) Y4

Scaling by
*

Soft max
K3 V3
Y3 Wk(r) * WV(r) Y3

K2 V2
Y2 Wk(r) * WV(r) Y2

√d
K1 V1
Y1 Wk(r) * WV(r) Y1

q2

Wq(r)
1*d W =
Z1 Z2 Z3 Z4 From Decoder 1*d
d*d
Self Attention Layer d: Embedding Length

[email protected]
[87]
Decoder Final Output
Probability of each word in Vocab

Vocab Size

Softmax

Vocab Size

Linear Projection

Last Decoder Block


[email protected]
[88]
Softmax

Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1

Residual Connection and Layer Normalization


Feed Forward Networks [ One Hidden with RELU Activation]
Encoder Block M Residual Connection and Layer Normalization
K Encode Encode Encode Encode Encode
Decoder Decoder Decoder Decoder Decoder
V Attention Attention Attention Attention Attention
Q Q Q Q Q

Decoder
Encoder Block 2
Encoder

Residual Connection and Layer Normalization


Feed Forward Networks [ One Hidden with RELU Activation]
Residual Connection and Layer Normalization
Encoder Block 1 K Encode Encode Encode Encode Encode
Decoder Decoder Decoder Decoder Decoder
V Attention Attention Attention Attention Attention
Q Q Q Q Q

Encoder Block 0
Residual Connection and Layer Normalization
Feed Forward Networks [ One Hidden with RELU Activation]
Residual Connection and Layer Normalization
K
X0 X1 X2 X3 X4 X5 XN-2 XN-1 Encode
Decoder
Encode
Decoder
Encode
Decoder
Encode
Decoder
Encode
Decoder
V Attention Attention Attention Attention Attention
Q Q Q Q Q

Word Embedding
Encoder Residual Connection and Layer Normalization
Positional Encoding / Embedding
Decoder Multi-Head
Self Attention
Multi-Head
Self Attention
Multi-Head
Self Attention
Multi-Head
Self Attention
Multi-Head
Self Attention

Attention
Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Word and Position Encoding
SOS Je me sens malade
[email protected]
[89]

Transformer Encoder
as a Classifier

[email protected]
Output Classes C1 C2 C3 [90]

Softmax Softmax Softmax


Softmax Layer Σ Σ Σ

Single vector representing


the whole sequence

Global Average Pooling

Output Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1


Sequence

Transformer Block
Transformer Block
Transformer Encoder
Transformer Block

Input Sequence
Word
Embedding X0 X1 X2 X3 X4 X5 XN-2 XN-1
Vectors

Position
+
Information VectorP0 VectorP1 VectorP2 VectorP3 VectorP4 VectorP5 VectorPN-2 VectorPN-1
Vectors
[email protected]
[91]
What is Next

GPT-1 and GPT-2 (Generative Pre-trained Transformers)


Radford et al., OpenAI. June 2018 and February 2019 .

BERT (Bidirectional Encoder Representations from Transformers)


Devlin et al., Google AI Language, May 2019 .

RoBERTa (Robustly Optimized BERT Approach)


Liu et al., Facebook AI. June 2019.

Transformer-XL: Attentive Language ModelsBeyond a Fixed-Length Context


Zihang Dai et al. Carnegie Mellon University and Google Brain

T5 (Text-to-Text Transfer Transformer)

GPT-3 (175 Billion Parameter, Beta Release June 2020)

[email protected]
[92]

Thank You

[email protected]

You might also like