(9,10) Transformers - 3
(9,10) Transformers - 3
Transformers
Prof. Khaled Mostafa El-Sayed
[email protected]
Aug 2020
[email protected]
[2]
Transformers
Processing Sequential Data
Multi-Head Attention
Positional Information
Transformer Decoder
What is Next ?
[email protected]
[3]
[email protected]
[4]
1-D Convolutional Neural Network
Convolutional Words 1st Feature 2nd Feature 3rd Feature 4th Feature
Kernel Vectors Map Map Map Map
Non Linearity
Non Linearity
Non Linearity
Non Linearity
Sequence Length
c es***
e quen
S
L ong
ry
fo r Ve
al
r a ctic
Not P
** *
As Word Sequence Length Increases, we need to stack many convolution Layers.
[email protected]
[5]
Traditional RNN Seq2Seq model
Decoder (RNN)
Encoder (RNN)
[email protected]
[6]
Traditional RNN Seq2Seq model
(Challenges) Decoder (RNN)
[1] Bottleneck
The meaning of the entire input sequence is
Expected to be captured by a single context vector
with fixed dimensionality
Encoder (RNN)
[email protected]
[7]
Traditional RNN Seq2Seq model
(Challenges) Decoder (RNN)
Wait Wait
Till Processing X1 Till Processing X2
Encoder (RNN)
[email protected]
[8]
Traditional RNN Seq2Seq model
(Challenges) Decoder (RNN)
Back Propagation
Δ Δ Δ Δ Δ Δ Δ Δ Δ Δ Δ
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Encoder (RNN)
[email protected]
[9]
RNN Seq2Seq model [with Attention]
[ To overcome Bottleneck challenge] Decoder
Attention
Attention
Decoder Utilizes:
●
Context Vector
●
Weighted sum of hidden states
Encoder
[email protected]
[10]
Transformer
Attention is All You Need Decoder
Encoder-Decoder Attention
Multi Head Self Attention Blocks
Encoder-Decoder Attention
Multi Head Self Attention Blocks
[email protected]
[12]
[email protected]
[13]
Encoder – Decoder Transformer
Orientation is not the Issue
[email protected]
[15]
Encoder – Decoder Transformer
Decoder Block Encoder Decoder
Encoder Block
N Cascaded Blocks
In Both Encoder and Decoder
V K Q
V K Q
[email protected]
[17]
Encoder – Decoder Transformer
Both Encoder and Decoder Blocks Contain Encoder Output Decoder
“Multi-Head Self Attention” block
“h” Heads
Outputs of Heads are concatenated
Multi-Head Attention
Scaled Dot-Product
Attention
V K Q
V K Q
V K Q
Q’ K’ V’
V K Q Outputs
[email protected]
[18]
Encoder – Decoder Transformer
Layer Normalization Encoder Output Decoder
Residual Connections
Multi-Head Attention
Scaled Dot-Product
Attention
V K Q
V K Q
V K Q
Q K V
V K Q Outputs
[email protected]
[19]
[email protected]
[20]
Encoder – Decoder Transformer
Encoder Output Decoder
Multi-Head Attention
Scaled Dot-Product
Attention
V K Q
V K Q
V K Q
Q K V
V K Q Outputs
[email protected]
[21]
Self Attention Mechanism (Fundamental Operation)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )
Y2
Dot Σ
Product
W21
X4 X4
*
Soft max
W22
X3 X3
*
W23
X2 X2
*
W24
X1 X1
*
X1 X2 X3 X4
[email protected]
[22]
Self Attention Mechanism (Fundamental Operation)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )
Y2
Xj Dot Σ
Product
W21
X4 X4
*
Soft max
W22
X3 X3
*
W23
X2 X2
*
W24
X1 X1
*
Xi=X2
X1 X2 X3 X4
Remember
[email protected]
[23]
Self Attention Mechanism (Not Sensitive to Words Order)
(Self Attention of word X2 w.r.t All words in Sentence ( X4, X2, X1, X3 )
Y2
Dot Σ
Product
W21
X3 X3
*
Soft max
W22
X1 X1
*
W23
X2 X2
*
W24
X4 X4
*
Y2
Matched KEYS with Retrieve VALUES
Relevance Percentage Weighted by their Relevance
Soft max
W22
X1 X1
*
W23
X2 X2
*
K V
W24
X4 X4
*
Query
ALL Keys with X2 Q Values of Matched Keys
are returned
X4 X2 X1 X3 when Running Query
[email protected]
[25]
Self Attention Mechanism (Input Word plays Different Roles)
(Query using X2 All Keys in Dictionary and retrieve Corresponding Values
Y2
Matched KEYS with Retrieve VALUES
Relevance Percentage Weighted by their Relevance
Soft max
X2 is used as W22
X1 X2 is used as X1
Key * Value
W23
X2 X2
*
K V
W24
X4 X4
*
Query X2 is used as
ALL Keys with X2 Q Query
Should we use SAME X2 Vector
for the three Roles ?
X4 X2 X1 X3
[email protected]
Self Attention Mechanism (Linear Transformation of Inputs) [26]
(Query using X2 All Keys in Dictionary and retrieve Corresponding Values
W11 W21 W31 W41 XQ21 W11 W21 W31 W41 XV21
X21 X22 X23 X24 X21 X22 X23 X24
1xd W12 W22 W32 W42 XQ22 1xd W12 W22 W32 W42 XV22
X2 = X2 =
W13 W23 W33 W43 XQ23 W13 W23 W33 W43 XV23
WQ WV
W14 W24 W34 W44 XQ24 W14 W24 W34 W44 XV24
dxd dxd
X2 variant to QUERY with
X2 variant used as Value
W11 W21 W31 W41 XK21 X2 variant used as KEY
X21 X22 X23 X24
Y2
Dot K.q Σ
Product
K4 V4
X4 Wk(r) * WV(r) X4
Soft max
K3 V3
X3 Wk(r) * WV(r) X3
K2 V2
X2 Wk(r) * WV(r) X2
K1 V1
X1 Wk(r) * WV(r) X1
q2
Wq(r)
X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[28]
Self Attention Mechanism
(Embedding Vector Dimension and Softmax Sensitivity)
Xi
1xd
d
As Dimension of Embedding “d” Increases
the Calculated weight value “w” increases
High Weight Values kill the gradient, and slow down learning
Xi SoftMax
1xd d d d
High Weight Values kill the gradient, and slow down learning
Xi SoftMax
1xd d d d
[email protected]
[31]
Self Attention Mechanism (Adding Scaling)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )
Y2
K.q K.q
Dot Σ
Product √d
K4 V4
X4 Wk(r) WV(r) X4
Scaling by
*
Soft max
K3 V3
X3 Wk(r) * WV(r) X3
K2 V2
X2 Wk(r) * WV(r) X2
√d
K1 V1
X1 Wk(r) * WV(r) X1
q2
Wq(r)
X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[32]
Self Attention Mechanism (Adding Scaling)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )
Y2
K.q K.q
Dot Σ
Product √d
K4 V4
X4 Wk(r) WV(r) X4
Scaling by
*
Soft max
K3 V3
X3 Wk(r) * WV(r) X3
K2 V2
X2 Wk(r) * WV(r) X2
√d
K1 V1
X1 Wk(r) * WV(r) X1
q2
Wq(r)
X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
Parallel Processing [33]
Transform Sequence Elements in Parallel Y4
[Calculation of Self Attention of words ( X1, X2, X3, X4 )] Y3
Y4(r)
Y2
Y1 Y3(r)
Similarity Probabilities Weighted SUM
Y2(r)
K.q K.q
Similarity
Dot Probabilities Weighted SUM Σ
Y1(r)Product √d
K4
K.q
K.q
V WV(r)
Scaling by
X4
Similarity Wk(r)
Dot
Product
Probabilities
√d
Weighted SUM Σ * X4
Soft max
K4
K.q W (r)
K.q K3 V
WV(r) 4V XW (r)
Scaling by
Σ
*
(r)
X4 WkDot X3 Probabilities Weighted SUM
4
X3
*
Similarity 4
Product k
√d V
Soft max
K.q K3 XW (r) XW (r)
K4 V
V V
K.q W (r) X V
Scaling by
K2
Wk(r) Σ *
X4 (r)
Dot X Wk(r) W 3
4 3
X2
*
4
*
V
√d
3 k 3
Product 2
V
Soft max
√d
XW (r) XW (r)
K4 V
V V
K3 V
Scaling by
K2
Wk(r) X3
* * W V V WV(r)
X4 (r)
Wk(r) X2 Wk(r) X K1 4
2
3
X2
Wk(r) 4
X1
*
V 3
*
1
Soft max
√d
XW (r)
K3 K2 V V
Wk(r) X2
* *W V XWV(r) 1
X3 K1 (r)
Wk(r) X 3
V 2
2
Wk(r) V 3 2 X1
q2
*(r)
1
√d
K2
W
K1 V
*W
X2 Wk(r) X (r) 2
W V X (r)
Wk(r)
* V 1
X 1 2
1 q V 1
q2
√d
K1 V1
Wq(r) X
Wk(r) X*2
X1 (r) X1
q2 1
X 3 WV X4
Wq(r)
q2 X1 X2 X3 X4
**
Wq(r)X
X2
1
X3 X4
l *
X1 X2 X3 X4
aralle
In P
R un
***
[email protected] X1 X2 X3 X4
[34]
Multi-Head Attention
Wide Architecture
Narrow Architecture
[email protected]
[35]
Encoder – Decoder Transformer
Encoder Output Decoder
Multi-Head Attention
Scaled Dot-Product
Attention
V K Q
V K Q
V K Q
Q K V
V K Q Outputs
[email protected]
[36]
Combining several self attention mechanisms give the self attention greater
power of discrimination
For input X each attention head produces a different output vector Yr.
[email protected]
Y2
1*d Multi Head Self Attention of Word X2 [37]
Linear Transformation [T]
“Wide” Architecture
dH*d
Y2(r)
Y2(r)
Similarity Probabilities
Y2 (r) Weighted SUM h=5
K. q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4
Y2 K.q Probabilities
(r) Σ
Weighted SUM
V4
Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q
Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4
Soft max
Scaling by
*
(r) (r)
X4 Wk Product K √d W
VV X4
K.3q
*
(r)
X3 W
K.kqProbabilities h=1
3
WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4
Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K3 √Kd W
V3V
*
(r) (r)
X3 W K.(r)
q W
VV X3
X.2 qk Σ
*
Dot K4 K Wk V4 WV(r)
Soft max
2
X2
Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V
√d
Wk(r)
*X
K4 X2 (r) X2
V4 W
Soft max
2
Scaling by
Wk(r)
*
X4 K1 (r) VV
K3
W (r) X1 K Wk(r)
*
(r)4
* WV(r) X1
1
X3 V3
W
V X3
√d
X k Wk(r)
*
2V (r) X2
W
2
Soft max
2
K3 K1 VV
*
X1 Wk(r) V3 1
WV(r) X1 s
Wk(r)
* e ad
X3 (r) X3
K2 WV
Wk(r)
√d * ”H
X2 (r) X2
K1 q2
V2
W
VV “ H
*
(r) (r)
X1
K2
Wk 1
WV X1
√d
V2
X2 Wk(r)
X1 Wk(r)
K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
√d
Y2(r)
Y2(r)
Similarity Probabilities
Y2 (r) Weighted SUM h=5
K. q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4
Y2 K.q Probabilities
(r) Σ
Weighted SUM
V4
Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q
Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4
Soft max
Scaling by
*
(r) (r)
X4 Wk Product K √d W
VV X4
K.3q
*
(r)
X3 W
K.kqProbabilities h=1
3
WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4
Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K3 √Kd W
V3V
*
(r) (r)
X3 W K.(r)
q W
VV X3
X.2 qk Σ
*
Dot K4 K Wk V4 WV(r)
Soft max
2
X2
Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V
√d
Wk(r)
*X
K4 X2 (r) X2
V4 W
Soft max
2
Scaling by
Wk(r)
*
X4 K1 (r) VV
K3
W (r) X1 K Wk(r)
*
(r)4
* WV(r) X1
1
X3 V3
W
V X3
√d
X k Wk(r)
*
2V (r) X2
W
2
Soft max
2
K3 K1 VV
*
X1 Wk(r) V3 1
WV(r) X1 s
Wk(r)
* e ad
X3 (r) X3
K2 WV
Wk(r)
√d * ”H
X2 (r) X2
K1 q2
V2
W
VV “ H
*
(r) (r)
X1
K2
Wk 1
WV X1
√d
V2
X2 Wk(r)
X1 Wk(r)
K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
√d
Y2(r)
1*d/H
1*d/H W =
1*d/H
Y2(r) d/H*d/H
Similarity Probabilities
(r) Weighted SUM h=5 d: Embedding Length
Y2
K.q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4 K.q Probabilities
(r) Σ
Weighted SUM
V4
Y2
Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4
Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K √d W
V3V
K.3q
*
(r)
X3 W
K.kqProbabilities h=1 WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4
Soft max
Scaling by
W (r)
*
X4 (r) X
Product K3 √Kd W
V3V
*
X k (r) (r) 4
W K.(r)
q W
VV X3
X.2 qk Σ
3
*
Dot K4 K Wk V4 WV(r)
Soft max
2
X2
Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V
√d
*X
(r) (r)
K4 X2 Wk V4 W X2
Soft max
2
Scaling by
Wk(r)
*
X4 K1 (r) VV
K3
Wk(r) X1 K Wk(r)
*
(r)
* WV(r) X1
4 1
X3 V3
W
V2V X3
√d
Wk(r)
*
X2 (r) X2
W
Soft max
2
K3 K1 VV
*
X1 Wk (r) V3 1
WV (r) X1 s
Wk(r)
* ad
X3 (r) X3
K2 WV
√d
He
Wk(r)
*
X2 (r) X2
K1 q2
V2
W
VV ”
“H
X1 Wk(r)
* WV(r) X1
1
K2
√d
V2
X2 Wk(r)
X1 Wk (r) K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
Input
√d
X1 X2 X3 X4 To Head 2
X1 X2 X3 X4 To Head 1
[email protected]
[40]
Parallelization
Implementation Issues
[email protected]
[41]
Using FF NN for Multiplication of a Vector by a Matrix
Input Word X X1 X2 X3 X4
W11 W21 W31 W41 b1
X1 y1
X1
X1 W11 Σ YT X b2
W21 W12 W22 W32 W42
b1 +
W31
y1 y2 y3 y4 = Sigmoid
W13 W23 W33 W43 b3
W41
W12 y2
X2 W14 W24 W34 W44 b4
X2
X2 W22 Σ
W32
b2
W42
W13
X3
W23 y3
X3
X3 W33 Σ
W43
b3
W14
W24
X4
W34 y4
X4
X4 W44 Σ
b4
[email protected]
Using FF NN for Multiplication of ONE Vector by a Matrix [42]
Example: Multiply Vector of Current Word X by Wq (Query) {Remember: Wq is Squared Matrix}
Input Word X X1 X2 X3 X4
W11 W21 W31 W41 b1
X1 y1
X1
X1 W11 Σ YT X b2
W21 W12 W22 W32 W42
b1 +
W31
y1 y2 y3 y4 = Sigmoid
W13 W23 W33 W43 b3
W41
W12 y2
X2 W14 W24 W34 W44 b4
X2
X2 W22 Σ
W32 No
b2
W42 Non-linearity No Bias
W13 X1 X2 X3 X4
W23 y3 W11 W21 W31 W41 0
X3
X3
X3 Σ
W33
W43
YT X W12 W22 W32 W42 0
b3
W14 y1 y2 y3 y4 = Linear +
W24 W13 W23 W33 W43 0
X4
W34 y4
X4
X4 W44 Σ W14 W24 W34 W44 0
b4
[email protected]
[43]
Self Attention Mechanism (Linear Transformation of Inputs)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )
Y2
Dot K.q Σ
Product
K4 V4
X4 Wk(r) * WV(r) X4
Soft max
K3 V3
X3 Wk(r) * WV(r) X3
K2 V2
X2 Wk(r) * WV(r) X2
K1 V1
X1 Wk(r) * WV(r) X1
q2
Wq(r)
X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
Using FF NN for Multiplication of Multiple Vectors by a Matrix [44]
X5
X4 Example: Multiply Vectors of ALL Words by Wk (Key) or Wv (Value) { Wk Wv is Squared Matrix}
X3
X1
X2
X1
X1
X1
X1 y1
X1
X1
X1 W11 Σ Design FF NN with the following Considerations:
W21 Activation: Linear
b1
W31
Bias: Zeros
X2 W41 Inputs: All Words Within Same Paragraph (X1 , X2 , X3 , X4 , X5 )
X2
X2 W12 y2
X2X2
X2
X2 W22 Σ
W32
b2
y1 y2 y3 y4 Y1 X1 X1 X2 X3 X4
W11 W21 W31 W41 0
W42
y1 y2 y3 y4 Y2 X2 X1 X2 X3 X4
X3 W13 W12 W22 W32 W42
X3
X3 W23 y3 = Linear 0
X3X3
X3
X3 W33 Σ y1 y2 y3 y4 Y3 X3 X1 X2 X3 X4
+
W13 W23 W33 W43 0
W43
W14
b3 y1 y2 y3 y4 Y4 X4 X1 X2 X3 X4
W14 W24 W34 W44 0
W24
X4
X4 y1 y2 y3 y4 Y5 X5 X1 X2 X3 X4
X4
X4X4
W34 y4
X4
X4 W44 Σ
b4
[email protected]
[45]
Self Attention Mechanism (Linear Transformation of Inputs)
(Self Attention of word X2 w.r.t All words in Sentence ( X1, X2, X3, X4 )
Y2
Dot K.q Σ
Product
K4 V4
X4 Wk(r) * WV(r) X4
Soft max
K3 V3
X3 Wk(r) * WV(r) X3
K2 V2
X2 Wk(r) * WV(r) X2
K1 V1
X1 Wk(r) * WV(r) X1
q2
Wq(r)
X1 X2 X3 X4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[46]
Using FF NN for Multiplication of Vector by a Multiple Matrices
Example: Multiply Vector of Current Word by Multiple (different) W q (Queries of Multiple Heads)
Embedding Vector of
Current Word X Wq of Head 1 Wq of Head 2 Wq of Head 3
X1 X2 X3 X4
W11 W21 W31 W41 W11 W21 W31 W41 W11 W21 W31 W41
W12 W22 W32 W42 W12 W22 W32 W42 W12 W22 W32 W42
W13 W23 W33 W43 W13 W23 W33 W43 W13 W23 W33 W43
X1
W14 W24 W34 W44 W14 W24 W34 W44 W14 W24 W34 W44
= y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4
Corresponding Corresponding Corresponding
Projection at Head 1 Projection at Head 2 Projection at Head 3
[email protected]
[47]
Using FF NN for Multiplication of Multiple Vectors by Multiple Matrices
Usage: Multiply Vectors of All Words by Multiple (different) W k (Keys of Multiple Heads)
y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #1
y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #2
= y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #3
Corresponding
●
Sequence of 5 Words y1 y2 y3 y4 y1 y2 y3 y4 y1 y2 y3 y4 Projection of Word #4
●
Network with 3 Heads y1 y2 y3 y4
y1 y2 y3 y4 y1 y2 y3 y4 Corresponding
Projection of Word #5
●
Embedding Dimension=5
Corresponding Corresponding Corresponding
Projections at Head 1 Projections at Head 2 Projections at Head 3
[email protected]
Y2
1*d Multi Head Self Attention of Word X2 [48]
Linear Transformation [ T ]dH*d
Y2(r)
Y2(r)
Similarity Probabilities
Y2 (r) Weighted SUM h=5
K. q h=4
Similarity Dot K.q Probabilities
Y2(r)
Weighted SUM
Σ
Product
K.q d √ h=3
Similarity DotK4
Y2 K.q Probabilities
(r) Σ
Weighted SUM
V4
Scaling by
X4
Similarity
Wk(r)
DotK4
Product
K.q
K.q
Probabilities
√d Weighted SUM
Σ
* V4 h=2
WV(r) X4
Soft max
Scaling by
*
(r) (r)
X4 Wk Product K √d W
VV X4
K.3q
*
(r)
X3 W
K.kqProbabilities h=1
3
WV(r) X3
Similarity DotK4 Weighted SUM
Σ V4
Soft max
Scaling by
Wk(r)
*
X4 (r) X4
Product K3 √Kd W
V3V
*
(r) (r)
X3 W K.(r)
q W
VV X3
X.2 qk Σ
*
Dot K4 K Wk V4 WV(r)
Soft max
2
X2
Scaling by
2
Wk(r)
*
X4 (r) X4
Product(r) K3 √Kd W
V3V
* WV
X3 Wk (r) X3
W
V2V
√d
Wk(r)
*X
K4 X2 (r) X2
V4 W
Soft max
2
Scaling by
Wk(r)
*
X4 K1 (r) VV
K3
W (r) X1 K Wk(r)
*
(r)4
* WV(r) X1
1
X3 V3
W
V X3
√d
X k Wk(r)
*
2V (r) X2
W
2
Soft max
2
K3 K1 VV
*
X1 Wk(r) V3 1
WV(r) X1 s
Wk(r)
* e ad
X3 (r) X3
K2 WV
Wk(r)
√d * ”H
X2 (r) X2
K1 q2
V2
W
VV “ H
*
(r) (r)
X1
K2
Wk 1
WV X1
√d
V2
X2 Wk(r)
X1 Wk(r)
K1 q2 Wq(r)
* *
WVV(r)
1
WV(r)
X2
X1
Wq(5)
√d
[email protected]
[50]
Encoder – Decoder Transformer
Encoder Output Decoder
Multi-Head Attention
Scaled Dot-Product
Attention
V K Q
V K Q
V K Q
Q K V
V K Q Outputs
[email protected]
[51]
Pre-processing of Input Sequence
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N
Word
Embedding X0 X1 X2 X3 X4 X5 XN-2 XN-1
e.g. word2vec
Position
Information Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1
+δ +δ +δ +δ +δ +δ
Final
Embedding Combine Word Encoding and Position Information (for Each Word)
I/P 0 I/P 1 I/P 2 I/P 3 I/P 4 I/P 5 I/P N-2 I/P N-1
Input to
Encoder Input to first Self Attention Layer in Transformer
[email protected]
[52]
[1] Position Encoding
Position
Information 0 1 2 3 4 5 N-2 N-1
Positional
Encoder function to map the positions to real valued vectors
[email protected]
[53]
Suggested Positional Encoders
(Use Word Index as Position Encoding)
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N
Position
Embedding Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1
Use Position
Index 0 1 2 3 4 5 N-2 N-1
+δ +δ +δ +δ +δ +δ
Use δ=1 First position Encoding =0 , Last Position Encoding = N-1 (Depends on Seq. Length)
Problems:
●
If N is large (e.g. seq. Length = 1024), Large position Values will dominate when Combine Position
Embedding to Word Embedding
●
If System is trained with Max seq length =256 how can the system deal with Larger sequence values
●
Position Encoding is scalar and Word embedding is vector (how to Combine them ?)
[email protected]
[54]
Suggested Positional Encoders
(Normalize Seq. Length to “1”)
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N
Position
Embedding Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1
Use Position
Index 0 δ 2δ 3δ 4δ 5δ 1-δ 1
+δ +δ +δ +δ +δ +δ
Use δ=1/Seq. Length First position Encoding=0, Last Position Encoding= 1 (Independent of Seq. Length)
Problems:
●
δ Value depends on Seq. Length ( δ is small for long seq and large for short sequence )
●
In other words, Encoding value of position # 4 in short seq will differ from position #4 in long seq
(though they are same position)
●
Position Encoding is scalar and Word embedding is vector (how to Combine them ?)
[email protected]
[55]
Suggested Positional Encoders
(Rules Controlling Good Positional Encoder)
Sequence of
words with Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Length N
Position
Embedding Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos N-2 Pos N-1
Use Position
Index 0 δ 2δ 3δ 4δ 5δ 1-δ 1
+δ +δ +δ +δ +δ +δ
[1] δ value should be same for Long sequence and short sequence
(same positions will be encoded with same values irrespective of the length of the sequence)
[2] Range of position encoded values should NOT depend on sequence length
(predefined fixed range e.g. from 0 to 1 , form -1 to 1, …)
[3] Position Encoding should be a VECTOR (not scalar) and of same dimension as Word Embedding
(in order to easily combine Position Encoding and Word Encoding vectors)
[email protected]
[56]
Sine/Cosine Encoders
(an industrial application of Motor Control System)
Sin/Cos Encoder
Output Signals
Position
(Angle)
Word 3
+ω
as angles
2
Wo
rd
rd
Wo
4
1
Sin(ω t) rd
Wo
Cos(ω t)
+ω
t=0,1,2,3,….
Word 0
[email protected]
[59]
Sequence of 9 words +ω +ω
Positions are defined
Word 3
+ω
as angles +ω
2
Wo
rd
rd
Wo
4
Wo 1
Sin(ω t) rd rd
5 Wo
Cos(ω t) +ω +ω
t=0,1,2,3,….
Word 6 Word 0
+ω
7
o rd
W
8
rd
+ω Wo
[email protected]
[60]
Sequence of 12 words +ω +ω
Positions are defined
Word 3
+ω
as angles +ω
2
Wo
rd
rd
Wo
4
Wo 1
Sin(ω t) rd rd
5 Wo
Cos(ω t) +ω +ω
t=0,1,2,3,….
Word 6 Word 0
+ω
7 Wo
rd rd
Wo 11
Wo
representation of longer sequences
8
rd
rd
Word 9
+ω Wo
10
(practically use ω = 1/10000) +ω
+ω +ω
[email protected]
[61]
sin(ωt)
Vector of Pos “t”
cos(ωt)
Squared Distance Between Vector of Pos “t+δ” and Vector of Pos “t” [ Two Consecutive Positions ]
[ sin(a) cos(b) + cos(a) sin(b) - sin(a) ]2 + [- sin(a) sin(b) + cos(a) cos(b) - cos(a) ]2 =
”
“t
n
[ sin2(a) sin2(b) + cos2(a) (cos(b) -1)2 - 2 sin(a) cos(a) sin(b) (cos(b)-1) ] =
t io
si
Po
sin(a)2 (cos(b)-1)2 + cos2(a) sin2(b) + sin2(a) sin2(b) + cos2(a) (cos(b) -1)2 =
of
e
t iv
[ sin(a)2 + cos2(a) ] (cos(b)-1)2 + [ sin(a)2 + cos2(a) ] (sin2(b) ) = (cos(b)-1)2 + sin2(b) =
c
pe
es
cos2(b) -2 cos(b) +1 + sin2(b) = 2 -2 cos(b) = 2(1-cos(b) ) =2 (1-cos(ωδ) )
Ir r
[email protected]
[63]
Sin/Cos Positional Encoder
Vector of Dimension “2” to define Position
sin(ωt) Where “t” is the step (position) and ω is the frequency
cos(ωt) of trigonometric function Sine/Cosine
This vector can be used to define word position at step “t” , t=0,1,2,3,4,5,6,7,….
➢
Values are between -1 and 1 (bounded values)
Distance between vectors representing positions “t” and “t+1” is constant
irrespective of “t” [Distance (Pos1, Pos2) = Distance (Pos90, Pos91)]
➢
“t” may take any positive value (No limitation of length of Sequence )
➢
Position Vector Length = 2 (Cannot be combined with Word Embedding Vector)
[email protected]
Sin/Cos Positional Encoder [64]
Positional Vector with Same dimension as Embedding dimension
Vector Index
Pairs “K” “i”
Word Embedding Dimension “d” [Even Num] Example:
0 sin(ω0t)
sin(ωt) 0 Position Encoding Dimension “d” d=100
cos(ωt)
1 cos(ω0t)
Number of Vector Pairs “K”=d/2 [0, 1, 2,… d/2-1] K=[0 ..49]
2 sin(ω1t)
1 Elements Index “i” [0, 1, 2, …. d-1] i=[0 .. 99]
Sin/Cos pairs 3 cos(ω1t) Even Index “i” is Sine Term , i=2*k Sin [0,2,4,...]
Vector Represents
Position”t” 4 Sin(ω2t) Odd Index “i” is Cosine Term, i=2*k+1 Cos [1,3,5,...]
2
Any “ω” is accepted 5 cos(ω2t) Use different “ω” values for different “Pairs”
Select “ω” Value Based on “Pair” index “k”
1
2k sin(ωKt) ωK =
K 10,0002k/d
2k+1 cos(ωKt)
Largest “ω” value = 1 (Used in First Pair)
Smallest “ω” value ≈ 1/10000 (Used in Last Pair)
d-1-1 sin(ωd/2t) Note:
[d/2]-1 “ω” Value depends on Embedding Dimension “d” ONLY
d-1 cos(ωd/2t) dx1
“ω” Value does NOT depend on Sequence Length
[email protected]
[65]
[2] Position Embedding
(Similar to Word Embedding)
Position
Information 0 1 2 3 4 5 N-2 N-1
Positional
Embedding Embedding Layer to Learn Position Vectors
(Trained on Position Values)
Adding
Residual Connections
Layer Normalization
Feed Forward Network
[email protected]
[67]
Output:
Encoded Vectors of Input Sequence [With Same Input Vectors Dimension]
Remember:
Different Input Sequences should be of Same Length (Fixed Length Sequences)
[Apply Pad or Cut]
Output Sequence is of same Length as Input Sequence
Multi-Head Attention
Scaled Dot-Product
Attention
V K Q
V K Q
V K Q
Q K V
V K Q Outputs
[email protected]
[69]
Residual Connections and Normalization
Adding Residual Connection:
Word Embedding + Positional Information [First Transformer Block]
Output of Previous Transformer Block [Stacked Transformer Blocks]
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Attention
Residual
Connections Multi Head Self Attention Block
Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
[70]
Normalization
The behavior of machine learning algorithms changes when the input distribution changes.
{Covariate shift }
layer weights changes Outputs of Activation functions Inputs to the next layer,
during training steps of each layer change (With different Input distribution)
Conclusion:
Input distribution of each layer changes with each step.
[email protected]
[72]
Normalization
Normalization
of
Batch
Activations
Normalization Weights
Batch Normalization
Weight Normalization
Instance Normalization
Group Normalization
Layer Normalization
[email protected]
[73]
Feature Vector
ze
si
ch
at
B
Channels
[email protected]
[74]
Batch Normalization
Batch Normalization
Feature Vector
Calculate (then Normalize) μ, σ
for ALL Samples in the batch
for each channel)
ze
si
ch
t
Ba
Channels
The mean and variance will differ from mini-batch to another (Source of Error).
Mini Batch Size should be large enough to minimize “batching” effect. (Constrain)
[email protected]
[75]
Layer , Group, and Instance Normalization
Sample Based
Normalization
Feature Vector
ze
si
ch
t
Ba
Channels
Layer Normalization
Calculate (then Normalize) μ, σ for ALL Channels in the
Same Sample (Input to Same Layer) Transformer choice
[email protected]
[76]
Feed Forward Neural Network
Feed Forward Network for Each Vector
One Hidden Layer with RELU Activation, [Size: at least double Input dimension]
Linear Activation Output Layer [Size: same Input Vector dimension]
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Attention
Residual
Connections Multi Head Self Attention Block
Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
Final [77]
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Output
+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Residual
Feed Forward Networks [ One Hidden with RELU Activation]
Connections
RELU RELU RELU RELU RELU RELU RELU RELU
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Output of Self Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Attention
Residual
Connections Multi Head Self Attention Block
Input to Self
Attention X0 X1 X2 X3 X4 X5 XN-2 XN-1
[email protected] Word Embedding + Positional Information
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1 [78]
Block Output
+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
Blo ck
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
Blo ck
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
+ + + Normalization
Layer + +(NOT Batch
+ Normalization) + +
+ + + + + + + +
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Encoder Block M
Transformer
Encoder
Encoder Block 2
Encoder Block 1
Encoder Block 0
Input Seq Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
[email protected]
[81]
Transformer Decoder
[email protected]
[82]
Encoder – Decoder Transformer
Encoder Output Decoder
Decoder Only
Decoder Only
√
√
√ √
√ √
√ Encoder –
Decoder V K Q
√ Attention √
√
√ Adding Mask ≈
V K Q
V K Q
√ √
Outputs
[email protected]
[83]
Multi-head Encoder-Decoder
Attention Block Residual Connection and Layer Normalization
Feed Forward Networks [ One Hidden with RELU Activation]
Masked Multi-head Self Residual Connection and Layer Normalization
Attention Block
K
Encode Encode Encode Encode Encode
Decoder Decoder Decoder Decoder Decoder
Attention Attention Attention Attention Attention
V
Q Q Q Q Q
Y0 Y1 Y2
m
SO
al
se
Je
ad
ns
e
S
e
SOS 0 -∞ -∞ -∞ -∞
Je 0 0 -∞ -∞ -∞
me
0 0 0 -∞ -∞
sens 0 0 0 0 -∞
malade 0 0 0 0 0
Addition of “0” does not change the original values, consequently no effect on Softmax output
Addition of “-inf” gets “Zero” output of Softmax, leaving zero attention scores for future Words.
[email protected]
[85]
Self Attention Mechanism for Decoder (Adding Mask) -∞ *
Z2
K.q K.q
Dot Σ
Mask
Product √d
K4 0 V4
Z4 Wk(r) WV(r) Z4
Scaling by
-∞ *
Soft max
K3 0 V3
Z3 Wk(r) -∞ * WV(r) Z3
K2 V2
Z2 Wk(r) 0 * WV(r) Z2
√d
K1 V1
Z1 Wk(r) 0 * WV(r) Z1
q2 Look Ahead
Mask
Wq(r)
Z1 Z2 Z3 Z4 1*d W =
1*d
d*d
d: Embedding Length
[email protected]
[86]
Encoder-Decoder Attention Mechanism (In Decoder)
(Self Attention of word Z2 w.r.t All words in Sentence ( Y1, Y2, Y3, Y4 )
O2
Scaling by
*
Soft max
K3 V3
Y3 Wk(r) * WV(r) Y3
K2 V2
Y2 Wk(r) * WV(r) Y2
√d
K1 V1
Y1 Wk(r) * WV(r) Y1
q2
Wq(r)
1*d W =
Z1 Z2 Z3 Z4 From Decoder 1*d
d*d
Self Attention Layer d: Embedding Length
[email protected]
[87]
Decoder Final Output
Probability of each word in Vocab
Vocab Size
Softmax
Vocab Size
Linear Projection
Y0 Y1 Y2 Y3 Y4 Y5 YN-2 YN-1
Decoder
Encoder Block 2
Encoder
Encoder Block 0
Residual Connection and Layer Normalization
Feed Forward Networks [ One Hidden with RELU Activation]
Residual Connection and Layer Normalization
K
X0 X1 X2 X3 X4 X5 XN-2 XN-1 Encode
Decoder
Encode
Decoder
Encode
Decoder
Encode
Decoder
Encode
Decoder
V Attention Attention Attention Attention Attention
Q Q Q Q Q
Word Embedding
Encoder Residual Connection and Layer Normalization
Positional Encoding / Embedding
Decoder Multi-Head
Self Attention
Multi-Head
Self Attention
Multi-Head
Self Attention
Multi-Head
Self Attention
Multi-Head
Self Attention
Attention
Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word N-2 Word N-1
Word and Position Encoding
SOS Je me sens malade
[email protected]
[89]
Transformer Encoder
as a Classifier
[email protected]
Output Classes C1 C2 C3 [90]
Transformer Block
Transformer Block
Transformer Encoder
Transformer Block
Input Sequence
Word
Embedding X0 X1 X2 X3 X4 X5 XN-2 XN-1
Vectors
Position
+
Information VectorP0 VectorP1 VectorP2 VectorP3 VectorP4 VectorP5 VectorPN-2 VectorPN-1
Vectors
[email protected]
[91]
What is Next
[email protected]
[92]
Thank You