0% found this document useful (0 votes)

13 views11 pages

CS541 HW4

The document describes the mathematical operations behind self-attention in neural networks. It provides a step-by-step explanation of how to derive key, query and value representations from inputs, calculate attention scores and softmax, multiply scores with values to get weighted values, and sum the weighted values to get the output.

Uploaded by

wanjalaaustine819

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views11 pages

CS541 HW4

Uploaded by

wanjalaaustine819

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

CS 541-A-Homework 4

Self attention

Fill your details below

Name:

CWID:

Email ID:

References: Cite your references here

Submission guidelines:

1. Submit this notebook along with its PDF version. You can do this by clicking File->Print->"Save as
PDF"

2. Name the file as "<mailID_HWnumber.extension>". For example, mailID is abcdefg @stevens.edu then name
the files as abcdefg_HW1.ipynb and abcdefg_HW1.pdf.

3. Please do not Zip your files.

Illustrated: Self-Attention
Step-by-step guide to self-attention with illustrations and code

Reference
medium article

Article author

Colab reference: Manuel Romero

What do BERT, RoBERTa, ALBERT, SpanBERT, DistilBERT, SesameBERT, SemBERT, MobileBERT, TinyBERT
and CamemBERT all have in common? And I’m not looking for the answer “BERT” . Answer: self-attention .
We are not only talking about architectures bearing the name “BERT’, but more correctly Transformer-based
architectures. Transformer-based architectures, which are primarily used in modelling language understanding
tasks, eschew the use of recurrence in neural network (RNNs) and instead trust entirely on self-attention
mechanisms to draw global dependencies between inputs and outputs. But what’s the math behind this?

The main content of this kernel is to walk you through the mathematical operations involved in a self-attention
module.

Step 0. What is self-attention?

If you’re thinking if self-attention is similar to attention, then the answer is yes! They fundamentally share the
same concept and many common mathematical operations. A self-attention module takes in n inputs, and
returns n outputs. What happens in this module? In layman’s terms, the self-attention mechanism allows the
inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The
outputs are aggregates of these interactions and attention scores.

Following, we are going to explain and implement:

Prepare inputs
Initialise weights
Derive key, query and value
Calculate attention scores for Input 1
Calculate softmax
Multiply scores with values
Sum weighted values to get Output 1
Repeat steps 4–7 for Input 2 & Input 3

In [ ]:
import torch

Step 1: Prepare inputs

For this tutorial, for the sake of simplicity, we start with 3 inputs, each with dimension 4.
In [ ]:

x = [
[1, 0, 1, 0], # Input 1
[0, 2, 0, 2], # Input 2
[1, 1, 1, 1] # Input 3
]
x = torch.tensor(x, dtype=torch.float32)
x
Out[ ]:
tensor([[1., 0., 1., 0.],
[0., 2., 0., 2.],
[1., 1., 1., 1.]])

Step 2: Initialise weights

Every input must have three representations (see diagram below). These representations are called key (orange),
query (red), and value (purple). For this example, let’s take that we want these representations to have a
dimension of 3. Because every input has a dimension of 4, this means each set of the weights must have a shape
of 4×3.

In [ ]:
w_key = [
[0, 0, 1],
[1, 1, 0],
[0, 1, 0],
[1, 1, 0]
]
w_query = [
[1, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 1]
]
w_value = [
[0, 2, 0],
[0, 3, 0],
[1, 0, 3],
[1, 1, 0]
]
w_key = torch.tensor(w_key, dtype=torch.float32)
w_query = torch.tensor(w_query, dtype=torch.float32)
w_value = torch.tensor(w_value, dtype=torch.float32)

print("Weights for key: \n ", w_key)

print("Weights for query: \n ", w_query)
print("Weights for value: \n ", w_value)

Weights for key:

tensor([[0., 0., 1.],
[1., 1., 0.],
[0., 1., 0.],
[1., 1., 0.]])
Weights for query:
tensor([[1., 0., 1.],
[1., 0., 0.],
[0., 0., 1.],
[0., 1., 1.]])
Weights for value:
tensor([[0., 2., 0.],
[0., 3., 0.],
[1., 0., 3.],
[1., 1., 0.]])

Note: In a neural network setting, these weights are usually small numbers, initialised randomly using an
appropriate random distribution like Gaussian, Xavier and Kaiming distributions.

Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s actually obtain the key , query and value representations for
every input.

Obtaining the keys:

[0, 0, 1]
[1, 0, 1, 0] [1, 1, 0] [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1] [1, 1, 0] [2, 3, 1]
Obtaining the values:

[0, 2, 0]
[1, 0, 1, 0] [0, 3, 0] [1, 2, 3]
[0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
[1, 1, 1, 1] [1, 1, 0] [2, 6, 3]

Obtaining the querys:

[1, 0, 1]
[1, 0, 1, 0] [1, 0, 0] [1, 0, 2]
[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
[1, 1, 1, 1] [0, 1, 1] [2, 1, 3]
Notes: Notes In practice, a bias vector may be added to the product of matrix multiplication.

Q1. (45 points)

1. Obtain and print the keys. (15 points)

2. Obtain and print the querys. (15 points)

3. Obtain and print the values. (15 points)

The correct values for all 3 have been commented in the cell below so you can verify your
output.

In [ ]:
print("Keys: \n ", keys)
# tensor([[0., 1., 1.],
# [4., 4., 0.],
# [2., 3., 1.]])

print("Querys: \n ", querys)

# tensor([[1., 0., 2.],
# [2., 2., 2.],
# [2., 1., 3.]])
print("Values: \n ", values)
# tensor([[1., 2., 3.],
# [2., 8., 0.],
# [2., 6., 3.]])
Keys:
tensor([[0., 1., 1.],
[4., 4., 0.],
[2., 3., 1.]])
Querys:
tensor([[1., 0., 2.],
[2., 2., 2.],
[2., 1., 3.]])
Values:
tensor([[1., 2., 3.],
tensor([[1., 2., 3.],
[2., 8., 0.],
[2., 6., 3.]])

Step 4: Calculate attention scores

To obtain attention scores, we start off with taking a dot product between Input 1’s query (red) with all keys
(orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention
scores (blue).

[0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
[1, 0, 1]

Notice that we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.

Note: The above operation is known as dot product attention, one of the several score functions. Other score
functions include scaled dot product and additive/concat.

Q2. Calculate and print the attention scores. (25 points)

Correct output has been commented for verification.

In [ ]:
print(attn_scores)

# tensor([[ 2., 4., 4.], # attention scores from Query 1

# [ 4., 16., 12.], # attention scores from Query 2
# [ 4., 12., 10.]]) # attention scores from Query 3
tensor([[ 2., 4., 4.],
[ 4., 16., 12.],
[ 4., 12., 10.]])

Step 5: Calculate softmax

Take the softmax across these attention scores (blue).

softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

Q3. Calculate and print the softmax of the attention scores. (30
points)
In [ ]:
from torch.nn.functional import softmax

# Q3 30 points
print(attn_scores_softmax)
# tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01],
# [6.0337e-06, 9.8201e-01, 1.7986e-02],
# [2.9539e-04, 8.8054e-01, 1.1917e-01]])

# For readability, approximate the above as follows

attn_scores_softmax = [
[0.0, 0.5, 0.5],
[0.0, 1.0, 0.0],
[0.0, 0.9, 0.1]
]
attn_scores_softmax = torch.tensor(attn_scores_softmax)
print(attn_scores_softmax)

tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01],

[6.0337e-06, 9.8201e-01, 1.7986e-02],
[2.9539e-04, 8.8054e-01, 1.1917e-01]])
tensor([[0.0000, 0.5000, 0.5000],
[0.0000, 1.0000, 0.0000],
[0.0000, 0.9000, 0.1000]])

Step 6: Multiply scores with values

The softmaxed attention scores for each input (blue) is multiplied with its corresponding value (purple). This
results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values .

1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]

2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]

In [ ]:
weighted_values = values[:,None] * attn_scores_softmax.T[:,:,None]
print(weighted_values)

tensor([[[0.0000, 0.0000, 0.0000],

[0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000]],

[[1.0000, 4.0000, 0.0000],

[2.0000, 8.0000, 0.0000],
[1.8000, 7.2000, 0.0000]],

[[1.0000, 3.0000, 1.5000],

[0.0000, 0.0000, 0.0000],
[0.2000, 0.6000, 0.3000]]])

Step 7: Sum weighted values

Take all the weighted values (yellow) and sum them element-wise:

[0.0, 0.0, 0.0]

+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1 , which is based on the query representation
from Input 1 interacting with all other keys, including itself.

Step 8: Repeat for Input 2 & Input 3

Note: The dimension of query and key must always be the same because of the dot product score function.
However, the dimension of value may be different from query and key. The resulting output will consequently
follow the dimension of value.

In [ ]:
outputs = weighted_values.sum(dim=0)
print(outputs)

# tensor([[2.0000, 7.0000, 1.5000], # Output 1

# [2.0000, 8.0000, 0.0000], # Output 2
# [2.0000, 7.8000, 0.3000]]) # Output 3
tensor([[2.0000, 7.0000, 1.5000],
[2.0000, 8.0000, 0.0000],
[2.0000, 7.8000, 0.3000]])

Final DL
No ratings yet
Final DL
26 pages
(Deep Learning Using PyTorch) (Cheatsheet)
No ratings yet
(Deep Learning Using PyTorch) (Cheatsheet)
7 pages
Research Proposal
83% (6)
Research Proposal
49 pages
Deep Learning and TensorFlow
No ratings yet
Deep Learning and TensorFlow
50 pages
Algebra and Equations
No ratings yet
Algebra and Equations
36 pages
DL Lab
No ratings yet
DL Lab
51 pages
IBest DeepLearning
No ratings yet
IBest DeepLearning
123 pages
Soft Computing Practical Teacher Manual
No ratings yet
Soft Computing Practical Teacher Manual
87 pages
Copy of 03 - Building - Your - First - Dataset - Ipynb - Colab
No ratings yet
Copy of 03 - Building - Your - First - Dataset - Ipynb - Colab
46 pages
03 - Building - Your - First - Dataset - Ipynb - Colab
No ratings yet
03 - Building - Your - First - Dataset - Ipynb - Colab
42 pages
Introduction To Tensors
No ratings yet
Introduction To Tensors
77 pages
DLP Lab
No ratings yet
DLP Lab
81 pages
Coding Attention Mechanisms
No ratings yet
Coding Attention Mechanisms
24 pages
Soft Computing
No ratings yet
Soft Computing
38 pages
2.signal and Linear System Analysis
No ratings yet
2.signal and Linear System Analysis
42 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
Machine Learning Practical File
No ratings yet
Machine Learning Practical File
31 pages
L6 Multilayer FeedForward Network XOR & MNIST DIGIT
No ratings yet
L6 Multilayer FeedForward Network XOR & MNIST DIGIT
51 pages
ML Lab Record
No ratings yet
ML Lab Record
33 pages
Modeling Chatglm
No ratings yet
Modeling Chatglm
20 pages
Tutorial On Neural Networks - 18MAR2024
No ratings yet
Tutorial On Neural Networks - 18MAR2024
33 pages
Notebook - Tensorflow Keras
No ratings yet
Notebook - Tensorflow Keras
25 pages
Lab 6
No ratings yet
Lab 6
29 pages
TensorFlow Tutorial, CME 323, 4-12-2018
No ratings yet
TensorFlow Tutorial, CME 323, 4-12-2018
40 pages
Lecture 3 Tensors
No ratings yet
Lecture 3 Tensors
25 pages
Niraj DL
No ratings yet
Niraj DL
15 pages
DeepTrading With TensorFlow 4 - TodoTrader
No ratings yet
DeepTrading With TensorFlow 4 - TodoTrader
14 pages
CCC
No ratings yet
CCC
25 pages
6-10 ML
No ratings yet
6-10 ML
22 pages
DeepTrading With TensorFlow 2 - TodoTrader
No ratings yet
DeepTrading With TensorFlow 2 - TodoTrader
9 pages
UNIT II - PPT - Part 1
No ratings yet
UNIT II - PPT - Part 1
41 pages
178 DL
No ratings yet
178 DL
31 pages
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
No ratings yet
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
9 pages
Self Attention With Trainable Weights 1726701162
No ratings yet
Self Attention With Trainable Weights 1726701162
12 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
17 pages
Assignment No 4
No ratings yet
Assignment No 4
8 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
16 pages
Deeplearning Lab Manual
No ratings yet
Deeplearning Lab Manual
29 pages
PHY10 Lesson 2 Kinematics (Full)
No ratings yet
PHY10 Lesson 2 Kinematics (Full)
35 pages
24mcs1025 Ex1 Part B
No ratings yet
24mcs1025 Ex1 Part B
10 pages
Harvard CS197 Lecture 6 & 7 Notes
No ratings yet
Harvard CS197 Lecture 6 & 7 Notes
18 pages
Deep Learning Programs Updated
No ratings yet
Deep Learning Programs Updated
24 pages
24mcs1025 Ex1 Part A
No ratings yet
24mcs1025 Ex1 Part A
9 pages
Deep Learning Assignments
No ratings yet
Deep Learning Assignments
6 pages
Chapter02 Mathematical-Building-Blocks
No ratings yet
Chapter02 Mathematical-Building-Blocks
9 pages
DL Lab (6-10) With Output
No ratings yet
DL Lab (6-10) With Output
5 pages
Intro To Deep Learning With TensorFlow - Introduction To TensorFlow Cheatsheet - Codecademy
No ratings yet
Intro To Deep Learning With TensorFlow - Introduction To TensorFlow Cheatsheet - Codecademy
8 pages
Ml-Exp-3 - Jupyter Notebook
No ratings yet
Ml-Exp-3 - Jupyter Notebook
6 pages
Building Your Deep Neural Network - Step by Step v8 PDF
No ratings yet
Building Your Deep Neural Network - Step by Step v8 PDF
44 pages
Ca Foundation Maths Test-3 Set A
No ratings yet
Ca Foundation Maths Test-3 Set A
14 pages
Pytorch Demo 1749471354
No ratings yet
Pytorch Demo 1749471354
10 pages
Assignment 2 - ML-SelfAttn
No ratings yet
Assignment 2 - ML-SelfAttn
4 pages
Deep Learning Unit 4
No ratings yet
Deep Learning Unit 4
11 pages
Fnnexpriment 1
No ratings yet
Fnnexpriment 1
7 pages
Comprehensive Exam - Answer Key - DNN - EC3M - October 2024
No ratings yet
Comprehensive Exam - Answer Key - DNN - EC3M - October 2024
7 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
02 - Lecture Note - TensorFlow Ops
No ratings yet
02 - Lecture Note - TensorFlow Ops
21 pages
Marketing Measurement and Forecasting
86% (14)
Marketing Measurement and Forecasting
16 pages
Mental Calculation
No ratings yet
Mental Calculation
54 pages
Machine Learning Laboratory Manual
No ratings yet
Machine Learning Laboratory Manual
11 pages
PyTorch Crash Course 1713016363
No ratings yet
PyTorch Crash Course 1713016363
15 pages
Analyzing Operational Flexibility of Electric Power Systems
No ratings yet
Analyzing Operational Flexibility of Electric Power Systems
10 pages
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
No ratings yet
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
11 pages
MAT1100 Integral Calculus I - 2020
No ratings yet
MAT1100 Integral Calculus I - 2020
6 pages
2 Basic Integration Rules Ho
No ratings yet
2 Basic Integration Rules Ho
21 pages
Resilience-Oriented Optimal Operation of Networked Hybrid Microgrids
No ratings yet
Resilience-Oriented Optimal Operation of Networked Hybrid Microgrids
11 pages
Wiener Index of Graphs Over Rings A Survey
No ratings yet
Wiener Index of Graphs Over Rings A Survey
10 pages
(Math-AA 3.1-3.3) 3D GEOMETRY - TRIANGLES - Solutions
No ratings yet
(Math-AA 3.1-3.3) 3D GEOMETRY - TRIANGLES - Solutions
15 pages
3.OO Testing
No ratings yet
3.OO Testing
9 pages
Syllabus For RET Examination 2018: University of Gour Banga Subject: Physics
No ratings yet
Syllabus For RET Examination 2018: University of Gour Banga Subject: Physics
22 pages
Maple Labs
No ratings yet
Maple Labs
19 pages
As A Single PDF
No ratings yet
As A Single PDF
3 pages
14622inferenceforsingleproportions 160909005557
No ratings yet
14622inferenceforsingleproportions 160909005557
19 pages
Test Execution For Software Testing
No ratings yet
Test Execution For Software Testing
10 pages
Control of Spatiotemporal
No ratings yet
Control of Spatiotemporal
14 pages
Surveying Solved MCQs (Set-14)
No ratings yet
Surveying Solved MCQs (Set-14)
8 pages
Week3 Lecture Notes
No ratings yet
Week3 Lecture Notes
11 pages
第七單元細線化與骨架抽取
No ratings yet
第七單元細線化與骨架抽取
13 pages
The Stochastic Model of Pitting Corrosion of Metal
No ratings yet
The Stochastic Model of Pitting Corrosion of Metal
12 pages
Lecture 5 Power Set
No ratings yet
Lecture 5 Power Set
3 pages
Partial Correlation Intro 1
No ratings yet
Partial Correlation Intro 1
5 pages
Ch.5 Fixed-Point vs. Floating Point
No ratings yet
Ch.5 Fixed-Point vs. Floating Point
10 pages
The Sublime Girls Academy of Science Rajanpur: Long Questions
No ratings yet
The Sublime Girls Academy of Science Rajanpur: Long Questions
2 pages
Comments On "Robust Stabilization of A Class of Time-Delay Nonlinear Systems"
No ratings yet
Comments On "Robust Stabilization of A Class of Time-Delay Nonlinear Systems"
1 page
Polynomial Series
No ratings yet
Polynomial Series
3 pages
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
Blazor and API Example: Classroom Quiz Application
From Everand
Blazor and API Example: Classroom Quiz Application
Taurius Litvinavicius
No ratings yet
Develop Snake & Ladder Game in an Hour (Complete Guide with Code & Design)
From Everand
Develop Snake & Ladder Game in an Hour (Complete Guide with Code & Design)
Anurag Pandey
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

CS541 HW4

Uploaded by

CS541 HW4

Uploaded by

CS 541-A-Homework 4

Fill your details below

References: Cite your references here

3. Please do not Zip your files.

Colab reference: Manuel Romero

Step 0. What is self-attention?

Following, we are going to explain and implement:

Step 1: Prepare inputs

Step 2: Initialise weights

print("Weights for key: \n ", w_key)

Weights for key:

Step 3: Derive key, query and value

Obtaining the keys:

Obtaining the querys:

Q1. (45 points)

1. Obtain and print the keys. (15 points)

2. Obtain and print the querys. (15 points)

3. Obtain and print the values. (15 points)

print("Querys: \n ", querys)

Step 4: Calculate attention scores

Q2. Calculate and print the attention scores. (25 points)

# tensor([[ 2., 4., 4.], # attention scores from Query 1

Step 5: Calculate softmax

softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

# For readability, approximate the above as follows

tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01],

Step 6: Multiply scores with values

1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]

tensor([[[0.0000, 0.0000, 0.0000],

[[1.0000, 4.0000, 0.0000],

[[1.0000, 3.0000, 1.5000],

Step 7: Sum weighted values

[0.0, 0.0, 0.0]

Step 8: Repeat for Input 2 & Input 3

# tensor([[2.0000, 7.0000, 1.5000], # Output 1

You might also like