CS485 Ch5 Transformers
CS485 Ch5 Transformers
Encode
Decode
Key 3 Value 3
Value
Key 4 Value 4
Attention Mechanism
• Mimics the retrieval
• Measure the similarity between query and key and
produce an output based on the similarity.
𝑠! 𝑠" 𝑠# 𝑠$
query
𝑘! 𝑘" 𝑘# 𝑘$
Attention Mechanism
𝑠! 𝑠" 𝑠# 𝑠$
query
𝑘! 𝑘" 𝑘# 𝑘$
• Similarity:
– Dot product 𝑞# 𝑘$
%! &"
– Scaled dot product , d is dimensionality of each key
'
– General dot product 𝑞# 𝑊𝑘$
Attention Mechanism
Largest scale dot product
comes from 𝑘)
𝑘!
𝑘"
𝑞!
𝑘#
Attention Mechanism
𝑠! 𝑠" 𝑠# 𝑠$
query
𝑘! 𝑘" 𝑘# 𝑘$
𝑞!
𝑘#
Attention Mechanism
• 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞, 𝑘, 𝑣 = ∑! 𝑎! ×𝑣!