CS541 HW4
CS541 HW4
Self attention
Name:
CWID:
Email ID:
Submission guidelines:
1. Submit this notebook along with its PDF version. You can do this by clicking File->Print->"Save as
PDF"
2. Name the file as "<mailID_HWnumber.extension>". For example, mailID is abcdefg @stevens.edu then name
the files as abcdefg_HW1.ipynb and abcdefg_HW1.pdf.
Illustrated: Self-Attention
Step-by-step guide to self-attention with illustrations and code
Reference
medium article
Article author
The main content of this kernel is to walk you through the mathematical operations involved in a self-attention
module.
Prepare inputs
Initialise weights
Derive key, query and value
Calculate attention scores for Input 1
Calculate softmax
Multiply scores with values
Sum weighted values to get Output 1
Repeat steps 4–7 for Input 2 & Input 3
In [ ]:
import torch
x = [
[1, 0, 1, 0], # Input 1
[0, 2, 0, 2], # Input 2
[1, 1, 1, 1] # Input 3
]
x = torch.tensor(x, dtype=torch.float32)
x
Out[ ]:
tensor([[1., 0., 1., 0.],
[0., 2., 0., 2.],
[1., 1., 1., 1.]])
In [ ]:
w_key = [
[0, 0, 1],
[1, 1, 0],
[0, 1, 0],
[1, 1, 0]
]
w_query = [
[1, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 1]
]
w_value = [
[0, 2, 0],
[0, 3, 0],
[1, 0, 3],
[1, 1, 0]
]
w_key = torch.tensor(w_key, dtype=torch.float32)
w_query = torch.tensor(w_query, dtype=torch.float32)
w_value = torch.tensor(w_value, dtype=torch.float32)
Note: In a neural network setting, these weights are usually small numbers, initialised randomly using an
appropriate random distribution like Gaussian, Xavier and Kaiming distributions.
[0, 0, 1]
[1, 0, 1, 0] [1, 1, 0] [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1] [1, 1, 0] [2, 3, 1]
Obtaining the values:
[0, 2, 0]
[1, 0, 1, 0] [0, 3, 0] [1, 2, 3]
[0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
[1, 1, 1, 1] [1, 1, 0] [2, 6, 3]
[1, 0, 1]
[1, 0, 1, 0] [1, 0, 0] [1, 0, 2]
[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
[1, 1, 1, 1] [0, 1, 1] [2, 1, 3]
Notes: Notes In practice, a bias vector may be added to the product of matrix multiplication.
The correct values for all 3 have been commented in the cell below so you can verify your
output.
In [ ]:
print("Keys: \n ", keys)
# tensor([[0., 1., 1.],
# [4., 4., 0.],
# [2., 3., 1.]])
To obtain attention scores, we start off with taking a dot product between Input 1’s query (red) with all keys
(orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention
scores (blue).
[0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
[1, 0, 1]
Notice that we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.
Note: The above operation is known as dot product attention, one of the several score functions. Other score
functions include scaled dot product and additive/concat.
In [ ]:
print(attn_scores)
Q3. Calculate and print the softmax of the attention scores. (30
points)
In [ ]:
from torch.nn.functional import softmax
# Q3 30 points
print(attn_scores_softmax)
# tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01],
# [6.0337e-06, 9.8201e-01, 1.7986e-02],
# [2.9539e-04, 8.8054e-01, 1.1917e-01]])
In [ ]:
weighted_values = values[:,None] * attn_scores_softmax.T[:,:,None]
print(weighted_values)
The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1 , which is based on the query representation
from Input 1 interacting with all other keys, including itself.
Note: The dimension of query and key must always be the same because of the dot product score function.
However, the dimension of value may be different from query and key. The resulting output will consequently
follow the dimension of value.
In [ ]:
outputs = weighted_values.sum(dim=0)
print(outputs)