0% found this document useful (0 votes)
22 views20 pages

Trans Ormer Numerical

The document describes the process of multi-headed attention in transformer models. It involves: 1) Computing queries, keys and values from input embeddings. 2) Calculating attention weights by taking the dot product of queries and keys, applying a softmax, and multiplying with values. 3) Adding the attended output to the input and applying normalization, forming the output of the attention layer.

Uploaded by

rishikesh10808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views20 pages

Trans Ormer Numerical

The document describes the process of multi-headed attention in transformer models. It involves: 1) Computing queries, keys and values from input embeddings. 2) Calculating attention weights by taking the dot product of queries and keys, applying a softmax, and multiplying with values. 3) Adding the attended output to the input and applying normalization, forming the output of the attention layer.

Uploaded by

rishikesh10808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 20

〖 (Add &Norm−Attn )+(FFN_e

^𝑇
Input How are you
0.69 0.56 0.43
𝑊_
0.55 0.91 0.42 𝑘 0.4 0.71
0.6 0.95 0.15 0.95 0.71

Position 1 2 3
Embedding 0.9 0.59 0.96 𝐾=𝑊_𝑘×𝑋
(must be of same dimension as
input) 0.47 0.19 0.67 1.63 1.56
0.69 0.57 0.3 3.07 2.86

𝑋=input+ 1.59 1.15 1.39


positional embedding 1.02 1.1 1.09
1.29 1.52 0.45
𝑄^𝑇
1.56 1.88
𝑋^𝑇 1.54 1.81
How 1.59 1.02 1.29 0.9 1.36
are 1.15 1.1 1.52
you 1.39 1.09 0.45 𝑒^((𝑄^𝑇
×𝐾)/√2)
357.81 249.64
301.87 212.72
54.05 42.1

Step 1: Add of (Add & Norm - Attn) (Add &Norm−Attn )^𝑇


2.95 3.66 3.21 1.22 -1.22
𝑋^𝑇+𝑓(𝑄,𝐾,𝑉) 2.49 3.71 3.42 -1.14 1.3
2.72 3.68 2.32 0.48 0.92
How are

Step 2: Norm of (Add & Norm - Attn )


How 1.22 -1.14 0.48 𝑊𝑒1
are -1.22 1.3 0.92 0.63 0.51
you 0 -0.16 -1.39 0.81 0.88
0.33 0.72
Normalize feature-wise
𝑧−normalization:
𝑥=(𝑥−𝜇)/𝜎 𝑊𝑒2
0.24 0.23
0.48 0.03
0.71 0.71

Step 1: Add of (Add & Norm - FFN)


1.36 -0.99 0.91 How
Add &Norm−Attn )+(FFN_e ) 〗 -0.56 1.9 1.99 are
0 -0.16 -1.39 you

Wk-ed Wv-ed
0.23 0.1 0.06 0.27 0.59 0.54
0.7 0.13 0.45 0.42 0.97 0.82
0.02 0.22 0.47
step 2: Norm of (Add & Norm - FFN) How are you
𝑋_𝑒 How 1.36 -1.02 0.29
are -1.03 1.36 1.05
you -0.33 -0.34 -1.34
This the final encoder output
𝑊_ 𝑊_
0.21 𝑞0.41 0.02 0.69
𝑣 0.72 0.23 0.07
0.65 0.51 0.38 0.53 0.53 0.89 0.75
0.39 0.17 0.95

𝑊_𝑘×𝑋 𝑄=𝑊_𝑞×𝑋 𝑉=𝑊_𝑣×𝑋

1.42 1.56 1.54 0.9 1.47 1.19 1.28


2.39 1.88 1.81 1.36 2.72 2.73 2.04
2.02 2.08 1.15

(𝑄^𝑇×𝐾)/√2
𝑄^𝑇×𝐾
8.31 7.81 6.71 5.88 5.52 4.74
8.07 7.58 6.51 5.71 5.36 4.6
5.64 5.29 4.53 3.99 3.74𝑓(𝑄,𝐾,𝑉)=(softmax((𝑄^𝑇×
3.2

^((𝑄^𝑇 Attention(Q,K)=softmax (𝑄^𝑇×𝐾)/√2


𝐾)/√2)
114.43 0.5 0.35 0.16 1.36 2.64
99.48 0.49 0.35 0.16 1.34 2.61
24.53 0.45 0.35 0.2 1.33 2.59

&Norm−Attn )^𝑇
𝑊𝑒1×(Add &Norm−Attn)^𝑇 FFN_ei=relu(𝑊𝑒1×(Add &N
0
-0.16 0.29 0.1 -0.39 0.29 0.1
-1.39 0.32 0.8 -1.11 0.32 0.8
you -0.36 0.64 -0.28 0 0.64
How are you How are

0.22 FFN_e=𝑊_𝑒2×FFN_ei
0.7 0.14 0.66 0
0.12 0.15 0.6 0
0.43 1.07 0
How are you

(FFN_e )^𝑇
0.71 (FFN_e )^𝑇
0.82 How 0.14 0.15 0.43
0.68 are 0.66 0.6 1.07
you 0 0 0

( Norm of 〖 (Add & Norm − FFN)) 〗 ^T


1.36 -1.03 -0.33
-1.02 1.36 -0.34
0.29 1.05 -1.34
How are you

K-ed V-ed
0.19 -0.12 0.09 -0.08 1.09 -1.01
0.67 -0.69 -0.26 -0.18 1.75 -1.57
x1:How x2:are x3:you -0.06 0.77 -0.71
x1:How x2:are x3:you
𝑉^𝑇

1.47 2.72 2.02


1.19 2.73 2.08
1.28 2.04 1.15

𝑉)=(softmax((𝑄^𝑇×𝐾)/√2)) V^T

1.92
1.9
1.87

=relu(𝑊𝑒1×(Add &Norm−Attn)^𝑇 )
0
0
0
you
Output I
0.19
1
0.8

position 1
Embedding 0.19
(must be of same dimension
as output) 0.3
0.82

𝑌=output+ 0.38
positional embedding 1.3
1.62

𝑌^𝑇
I 0.38
am 1.42
fine 1.25

𝑓(𝑄,𝐾,𝑉)=Attention(𝑄,𝐾)×𝑉^𝑇
Attention(𝑄,𝐾)=softmax((𝑄^𝑇×

step 1: Add of (Add & No


1.44
𝑌^𝑇+𝑓(𝑄,𝐾,𝑉) 2.3
2.11

step 2: Norm of (Add & N


I -1.38
am 0.95
fine 0.43

𝑋_𝑒
How 1.36
are -1.03
you -0.33
This the final encoder output
𝑄 is done/provided by the decod
and 𝑉 provided by the encoder
𝑄 is done/provided by the decod
and 𝑉 provided by the encoder
How
𝑋_𝑒^𝑇 1.36
-1.02
0.29

Step 1: Add of (Add & No


-1.49
1.01
0.36

0.79
0.38
0.69

0.99
0.15
0.71

step 1: Add of (Add & Norm - FFN)


-0.45 2.22 2.12
0.99 -0.71 -1.38
0.38 -0.71 0.43

How
are
you
I
am
fine
am fine
𝑊_ 𝑊_
0.84 0.38 𝑘 0.91 0.36 0.72 𝑞 0.27
0.71 0.51 0.19 0.94 0.12 0.26
0.49 0.67

2 3
𝐾=𝑊_𝑘×𝑌 𝑄=𝑊_𝑞×𝑋
0.58 0.87 I am fine I

0.13 0.21 1.98 2.01 2.43 2.65


0.08 0.76 1.49 1.13 1.09 2.4

1.42 1.25
0.84 0.72
𝑄^𝑇
0.57 1.43
I 2.65 2.4 8.82
am 1.6 1.47 5.36
1.3 1.62 fine 2.22 2.02 7.41
0.84 0.57 𝑒^(((𝑄^𝑇×𝐾)/√2
0.72 1.43
Mask
I 0 -1E+099 -1E+099 I
on(𝑄,𝐾)×𝑉^𝑇, am 0 0 -1E+099 am
softmax((𝑄^𝑇×𝐾)/√2+mask) fine 0 0 0 fine
I am fine

step 1: Add of (Add & Norm - Attn) (Add &Norm−Attn )^𝑇


3.36 3.61 -1.38 0.95 0.43
2.77 2.35 1.41 -0.82 -0.59
2.83 3.33 0.95 -1.38 0.43
I am fine
tep 2: Norm of (Add & Norm - Attn )
1.41 0.95 Normalize feature-wise
-0.82 -1.38 𝑧−normalization:
-0.59 0.43
𝑥=(𝑥−𝜇)/𝜎
𝑊_(𝑘− 𝑊_(𝑣
-1.02 0.29 𝑒𝑛) 0.23 0.1 0.06 −𝑒𝑛)
0.27 0.59
1.36 1.05 0.7 0.13 0.45 0.42 0.97
-0.34 -1.34 0.02 0.22
der output How are

ded by the decoder on 𝐾 𝐾_𝑒𝑛=𝑊_(𝑘−𝑒𝑛)× 𝑉_𝑒𝑛=𝑊_(𝑣−𝑒𝑛)×


by the encoder 𝑋_𝑒 𝑋_𝑒
ded by the decoder on 𝐾 𝐾_𝑒𝑛=𝑊_(𝑘−𝑒𝑛)× 𝑉_𝑒𝑛=𝑊_(𝑣−𝑒𝑛)×
by the encoder 𝑋_𝑒 𝑋_𝑒-0.42
0.19 -0.12 0.09 0.34
0.67 -0.69 -0.26 -0.7 0.61
are you How are you -0.35 0.12
-1.03 -0.33 How are
1.36 -0.34
𝑄_𝑑𝑒^𝑇×𝐾_𝑒𝑛 (𝑄_𝑑𝑒^𝑇×𝐾𝑒𝑛)/√2
1.05 -1.34
0.41 -0.38 -0.06
-0.61 0.56 0.1 I 0.29
0.2 -0.19 -0.04 am -0.43
fine 0.14
How

Step 1: Add of (Add & Norm - Attn) step 2: Norm of (Add & Norm - Attn )
1.27 0.71 I -1.37 1.41 0.95
-0.67 -1.52 am 0.99 -0.71 -1.38
-0.67 0.21 fine 0.38 -0.71 0.43

𝑊_𝑑1 𝑊_𝑑1×(Add &Norm−Attn )^𝑇 FFN_di=relu(𝑊_


0.82 0.61 0.65 -0.64 -0.02
0.7 0.26 0.71 -0.48 -0.24
0.61 0.33 0.23 -0.21 -0.03
I am fine

𝑊_𝑑2 FFN_d=𝑊_𝑑2×FFN_di
0.19 0.63 0.92 0 0 I
0.97 0.11 0.81 0 0 am
0.97 0.08 1.17 0 0 fine
I am fine

step 2: Norm of (Add & Norm - FFN) ( Norm of 〖 (Add & Norm − FFN)) 〗 ^T
I -1.28 1.41 1.21 -1.28 1.16 0.12
am 1.16 -0.71 -1.24 1.41 -0.71 -0.71
fine 0.12 -0.71 0.03 1.21 -1.24 0.03
I am fine

𝑊_𝑙 output=𝑊_𝑙×( Norm of 〖 (Add & Norm − FFN)) 〗 ^T


-539.81 -110 -442.76 How 0.12 0.94 0.04
-1042.62 -213 -854.19 are 0.65 0.99 0.49
-588.67 -120 -482.64 you 0.3 0.82 0.08
-804.35 -164 -658.95 I 1 0.49 0.15
-696.62 -142 -571.18 am 0.33 1 0.09
-916.44 -188 -750.24 fine 0.17 0.71 1
I am fine I am fine

𝑒^Ouput
How 1.127497 2.559981 1.040811
are 1.915541 2.691234 1.632316
you 1.349859 2.2705 1.083287
I 2.718282 1.632316 1.161834
am 1.390968 2.718282 1.094174
fine 1.185305 2.033991 2.718282
I am fine
𝑊_
0.85 0.89 𝑣 0.07 0.43 0.29
0.77 0.8 0.72 0.24 0.91
0.5 0.41 0.78

𝑄=𝑊_𝑞×𝑋 𝑉=𝑊_𝑣×𝑋
am fine I am fine

1.6 2.22 1.06 0.63 0.81 I


1.47 2.02 2.06 1.74 2.37 am
1.99 1.5 2.04 fine
(𝑄^𝑇×𝐾)/√2
𝑄^𝑇×𝐾
8.04 9.06 I 6.24 5.69 6.41
4.88 5.49 am 3.79 3.45 3.88
6.74 7.6 fine 5.24 4.77 5.37
𝑒^(((𝑄^𝑇×𝐾)/√2+Mask) ) Attention(Q,K)=softmax (𝑄^𝑇×𝐾)/√2

512.86 0 0 I 1 0 0
44.26 31.5 0 am 0.58 0.42 0
188.67 117.92 214.86 fine 0.36 0.23 0.41
I am fine I am fine

𝑊_(𝑞
0.54 −𝑑𝑒)
0.65 0.59
0.82 0.66 0.54
0.47
you

=𝑊_(𝑣−𝑒𝑛)× 〖𝑉𝑒 𝑄_𝑑𝑒=𝑊_(𝑞−𝑑𝑒)×(Add &Norm


〗 ^𝑇
=𝑊_(𝑣−𝑒𝑛)× 〖𝑉𝑒 𝑄_𝑑𝑒=𝑊_(𝑞−𝑑𝑒)×(Add &Norm
-0.03 How -0.42 -0.7 -0.35 0.61 -0.85
0.04 are 0.34
𝑛〗0.61
^𝑇 0.12 0.44 -0.67
-0.39 you -0.03 0.04 -0.39 I am
you
𝑄_𝑑𝑒^𝑇×𝐾𝑒𝑛)/√2 Attention(𝑄_𝑑𝑒,𝐾_𝑒𝑛 )=softmax
𝑒^((𝑄_𝑑𝑒^
𝑇×𝐾𝑒𝑛)/√2)
-0.27 -0.04 I 1.34 0.76 0.96 I
0.4 0.07 am 0.65 1.49 1.07 am
-0.13 -0.03 fine 1.15 0.88 0.97 fine
are you How are you

(Add &Norm−Attn )^𝑇


-1.37 0.99 0.38
1.41 -0.71 -0.71
0.95 -1.38 0.43
I am fine

FFN_di=relu(𝑊_𝑑1×(Add &Norm−Attn)^𝑇 )
0.65 0 0
0.71 0 0
0.23 0 0
I am fine

(FFN_d )^𝑇
0.92 0.81 1.17
0 0 0
0 0 0

Norm − FFN)) 〗 ^T

& Norm − FFN)) 〗 ^T 𝐼𝑛𝑣𝑒𝑟𝑠𝑒 𝑜𝑓 ( Norm of 〖 (Add & Norm − FFN)) 〗 ^T


-491.1220044 -100 -402.1786 -539.81 -110 -442.76
-490.9586057 -100 -402.8322 -1042.62 -213 -854.19
-484.3681917 -100 -395.8606 -588.67 -120 -482.64
-804.35 -164 -658.95
-696.62 -142 -571.18
-916.44 -188 -750.24
expected * inverse to obtain correct value of WI

Final Output=softmax(Output)
How 0.12 0.18 0.12
are 0.2 0.19 0.19
you 0.14 0.16 0.12
I 0.28 0.12 0.13
am 0.14 0.2 0.13
fine 0.12 0.15 0.31
I am fine
𝑉^𝑇

1.06 2.06 1.99


0.63 1.74 1.5
0.81 2.37 2.04

𝑓(𝑄,𝐾,𝑉)=(softmax((𝑄^𝑇×𝐾)/√2)) V^T

I 1.06 2.06 1.99


am 0.88 1.93 1.78
fine 0.86 2.11 1.9

0.71
0.62

𝑊_(𝑞−𝑑𝑒)×(Add &Norm−Attn )^𝑇 " " 𝑄_𝑑𝑒^𝑇


𝑊_(𝑞−𝑑𝑒)×(Add &Norm−Attn )^𝑇 " " 𝑄_𝑑𝑒^𝑇
0.24 I 0.61 0.44
0.23 am -0.85 -0.67
fine fine 0.24 0.23

𝑄_𝑑𝑒,𝐾_𝑒𝑛 )=softmax (𝑄_𝑑𝑒^𝑇×𝐾_𝑒𝑛𝑓(𝑄,𝐾,𝑉)=(softmax


)/√2 (𝑄_𝑑𝑒^𝑇×𝐾_𝑒𝑛)/√2) 𝑉_𝑒𝑛^T

0.44 0.25 0.31 I -0.11 -0.14 -0.24


0.2 0.46 0.33 am 0.06 0.15 -0.14
0.38 0.29 0.32 fine -0.07 -0.08 -0.22
How are you
o obtain correct value of WI
)/√2) 𝑉_𝑒𝑛^T

You might also like