0% found this document useful (0 votes)

11 views20 pages

Solved Example of Transformers

Uploaded by

rajanatiq42

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views20 pages

Solved Example of Transformers

Uploaded by

rajanatiq42

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Understanding Transformers: A Step-by-Step Math Example

Step 1 — Defining our Dataset

The dataset used for creating ChatGPT is 570 GB. But we will be using a very small dataset to
perform numerical calculations visually.

Our entire dataset contains only three sentences, all of which are dialogues taken from a TV show.
Although our dataset is cleaned, in real-world scenarios cleaning a dataset requires a significant
amount of effort.

Step 2— Finding Vocab Size

The vocabulary size determines the total number of unique words in our dataset. It can be
calculated using the below formula, where N is the total number of words in our dataset.

In order to find N, we need to break our dataset into individual words.

After obtaining N, we perform a set operation to remove duplicates, and then we can count the
unique words to determine the vocabulary size. Therefore, the vocabulary size is 23, as there
are 23 unique words in our dataset.

Step 3 — Encoding
Now, we need to assign a unique number to each unique word.
Step 4 — Calculating Embedding
Let’s select a sentence from our corpus that will be processed in our transformer architecture.

We have selected our input, and we need to find an embedding vector for it. The original paper
uses a 512-dimensional embedding vector for each input word. Since, for our case, we need to
work with a smaller dimension of embedding vector to visualize how the calculation is taking
place. So, we will be using a dimension of 6 for the embedding vector.

These values of the embedding vector are between 0 and 1 and are filled randomly in the
beginning. They will later be updated as our transformer starts understanding the meanings among
the words.

Step 5 — Calculating Positional Embedding

Now we need to find positional embeddings for our input. There are two formulas for positional
embedding depending on the position of the ith value of that embedding vector for each word.
As you do know, our input sentence is “when you play the game of thrones” and the starting
word is “when” with a starting index (POS) value is 0, having a dimension (d) of 6. For i from 0
to 5, we calculate the positional embedding for our first word of the input sentence.

Similarly, we can calculate positional embedding for all the words in our input sentence.

Step 6 — Concatenating Positional and Word Embeddings

After calculating positional embedding, we need to add word embeddings and positional
embeddings.
This resultant matrix from combining both matrices (Word embedding matrix and positional
embedding matrix) will be considered as an input to the encoder part.

Step 7 — Multi Head Attention

A multi-head attention is comprised of many single-head attentions. It is up to us how many
single heads we need to combine. For example, LLaMA LLM from Meta has used 32 single
heads in the encoder architecture. Below is the illustrated diagram of how a single-head attention
looks like.

There are three inputs: query, key, and value. Each of these matrices is obtained by multiplying a
different set of weights matrix from the Transpose of same matrix that we computed earlier by
adding the word embedding and positional embedding matrix.
Let’s say, for computing the query matrix, the set of weights matrix must have the number of
rows the same as the number of columns of the transpose matrix, while the columns of the
weights matrix can be any; for example, we suppose 4 columns in our weights matrix. The values
in the weights matrix are between 0 and 1 randomly, which will later be updated when our
transformer starts learning the meaning of these words.
Similarly, we can compute the key and value matrices using the same procedure, but the values in
the weights matrix must be different for both.

So, after multiplying matrices, the resultant query, key, and values are obtained:
Now that we have all three matrices, let’s start calculating single-head attention step by step.

For scaling the resultant matrix, we have to reuse the dimension of our embedding vector, which
is 6.
The next step of masking is optional, and we won’t be calculating it. Masking is like telling the
model to focus only on what’s happened before a certain point and not peek into the future while
figuring out the importance of different words in a sentence. It helps the model understand things
in a step-by-step manner, without cheating by looking ahead. So now we will be applying
the softmax operation on our scaled resultant matrix.

Doing the final multiplication step to obtain the resultant matrix from single-head attention.
We have calculated single-head attention, while multi-head attention comprises many single-head
attentions, as I stated earlier. Below is a visual of how it looks like:

Each single-head attention has three inputs: query, key, and value, and each three have a
different set of weights. Once all single-head attentions output their resultant matrices, they will
all be concatenated, and the final concatenated matrix is once again transformed linearly by
multiplying it with a set of weights matrix initialized with random values, which will later get
updated when the transformer starts training. Since, in our case, we are considering a single-head
attention, but this is how it looks if we are working with multi-head attention.
In either case, whether it’s single-head or multi-head attention, the resultant matrix needs to be
once again transformed linearly by multiplying a set of weights matrix.

Make sure the linear set of weights matrix number of columns must be equal to the matrix that we
computed earlier (word embedding + positional embedding) matrix number of columns,
because the next step, we will be adding the resultant normalized matrix with (word embedding
+ positional embedding) matrix.
As we have computed the resultant matrix for multi-head attention, next, we will be working on
adding and normalizing step.

Step 8 — Adding and Normalizing

Once we obtain the resultant matrix from multi-head attention, we have to add it to our original
matrix. Let’s do it first.

To normalize the above matrix, we need to compute the mean and standard deviation row-wise
for each row.
we subtract each value of the matrix by the corresponding row mean and divide it by the
corresponding standard deviation.

Adding a small value of error prevents the denominator from being zero and avoids making the
entire term infinity.

Step 9 — Feed Forward Network

After normalizing the matrix, it will be processed through a feedforward network. We will be
using a very basic network that contains only one linear layer and one ReLU activation function
layer. This is how it looks like visually:
First, we need to calculate the linear layer by multiplying our last calculated matrix with a random
set of weights matrix, which will be updated when the transformer starts learning, and adding the
resultant matrix to a bias matrix that also contains random values.

After calculating the linear layer, we need to pass it through the ReLU layer and use its formula.
Step 10 — Adding and Normalizing Again
Once we obtain the resultant matrix from feed forward network, we have to add it to the matrix
that is obtained from previous add and norm step, and then normalizing it using the row wise
mean and standard deviation.
The output matrix of this add and norm step will serve as the query and key matrix in one of the
multi-head attention mechanisms present in the decoder part, which you can easily understand by
tracing outward from the add and norm to the decoder section.

Step 11 — Decoder Part

The good news is that up until now, we have calculated Encoder part, all the steps that we have
performed, from encoding our dataset to passing our matrix through the feedforward network, are
unique. It means we haven’t calculated them before. But from now on, all the upcoming steps that
is the remaining architecture of the transformer (Decoder part) are going to involve similar kinds
of matrix multiplications.
We won’t be calculating the entire decoder because most of its portion contains similar
calculations to what we have already done in the encoder. Calculating the decoder in detail would
only make the blog lengthy due to repetitive steps. Instead, we only need to focus on the
calculations of the input and output of the decoder.
When training, there are two inputs to the decoder. One is from the encoder, where the output
matrix of the last add and norm layer serves as the query and key for the second multi-head
attention layer in the decoder part. Below is the visualization of it
While the value matrix comes from the decoder after the first add and norm step. The second
input to the decoder is the predicted text. If you remember, our input to the encoder is when you
play game of thrones so the input to the decoder is the predicted text, which in our case is you win
or you die. But the predicted input text needs to follow a standard wrapping of tokens that make
the transformer aware of where to start and where to end.

Where <start> and <end> are two new tokens being introduced. Moreover, the decoder takes one
token as an input at a time. It means that <start> will be served as an input, and you must be the
predicted text for it.
As we already know, these embeddings are filled with random values, which will later be updated
during the training process. Compute rest of the blocks in the same way that we computed earlier
in the encoder part.

Step 12 — Understanding Mask Multi Head Attention

In a Transformer, the masked multi-head attention is like a spotlight that a model uses to focus on
different parts of a sentence. It’s special because it doesn’t let the model cheat by looking at
words that come later in the sentence. This helps the model understand and generate sentences
step by step, which is important in tasks like talking or translating words into another language.
Suppose we have the following input matrix, where each row represents a position in the
sequence, and each column represents a feature:

Now, let’s understand the masked multi-head attention components having two heads:
1. Linear Projections (Query, Key, Value): Assume the linear projections for each
head: Head 1: Wq1,Wk1,Wv1 and Head 2: Wq2,Wk2,Wv2
2. Calculate Attention Scores: For each head, calculate attention scores using the dot product
of Query and Key, and apply the mask to prevent attending to future positions.
3. Apply Softmax: Apply the softmax function to obtain attention weights.
4. Weighted Summation (Value): Multiply the attention weights by the Value to get the
weighted sum for each head.
5. Concatenate and Linear Transformation: Concatenate the outputs from both heads and
apply a linear transformation.
Let’s do a simplified calculation:
Assuming two conditions
 Wq1 = Wk1 = Wv1 = Wq2 = Wk2 = Wv2 = I, the identity matrix.
 Q=K=V=Input Matrix

The concatenation step combines the outputs from the two attention heads into a single set of
information. Imagine you have two friends who each give you advice on a problem.
Concatenating their advice means putting both pieces of advice together so that you have a more
complete view of what they suggest. In the context of the transformer model, this step helps
capture different aspects of the input data from multiple perspectives, contributing to a richer
representation that the model can use for further processing.

Step 13 — Calculating the Predicted Word

The output matrix of the last add and norm block of the decoder must contain the same number of
rows as the input matrix, while the number of columns can be any. Here, we work with 6.

The last add and norm block resultant matrix of the decoder must be flattened in order to match
it with a linear layer to find the predicted probability of each unique word in our dataset (corpus).

This flattened layer will be passed through a linear layer to compute the logits (scores) of each
unique word in our dataset.
Once we obtain the logits, we can use the softmax function to normalize them and find the word
that contains the highest probability.

So based on our calculations, the predicted word from the decoder is you.
This predicted word you, will be treated as the input word for the decoder, and this process
continues until the <end> token is predicted.

Solving Transformer by Hand A Step-by-Step Math Example
No ratings yet
Solving Transformer by Hand A Step-by-Step Math Example
43 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
2023 LLMBC LLM Foundations
No ratings yet
2023 LLMBC LLM Foundations
92 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
A House On Fire Book
No ratings yet
A House On Fire Book
77 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Enjoy Learning 2
100% (1)
Enjoy Learning 2
96 pages
Pakistani English by Tariq Rehman PDF
100% (1)
Pakistani English by Tariq Rehman PDF
101 pages
PLAGUE OF WAR - ATHENS, SPARTA Roberts
100% (1)
PLAGUE OF WAR - ATHENS, SPARTA Roberts
404 pages
Attention Is All You Need Paper Explained Well
No ratings yet
Attention Is All You Need Paper Explained Well
18 pages
All About History - Book of Ancient Greece
100% (6)
All About History - Book of Ancient Greece
165 pages
Soal Tes Masuk SMA Mata Pelajaran Bahasa Inggris
100% (1)
Soal Tes Masuk SMA Mata Pelajaran Bahasa Inggris
4 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
Untitled
No ratings yet
Untitled
772 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
Transformer
No ratings yet
Transformer
41 pages
Class 9 Solutions of Chapter
No ratings yet
Class 9 Solutions of Chapter
4 pages
Elllo Beg 13 - Sarah Adam - Out of Reach - Past Continuous
No ratings yet
Elllo Beg 13 - Sarah Adam - Out of Reach - Past Continuous
1 page
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
COMPUTER 7 Learning Plan 3
No ratings yet
COMPUTER 7 Learning Plan 3
5 pages
Technical Skills: Github Repo Video Demo Deployed App
No ratings yet
Technical Skills: Github Repo Video Demo Deployed App
1 page
Self Attention Mechanism
No ratings yet
Self Attention Mechanism
20 pages
07-Dlintro Deep Learning NLP
No ratings yet
07-Dlintro Deep Learning NLP
31 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
Mathematics of LLMs Part 1
No ratings yet
Mathematics of LLMs Part 1
8 pages
Transformers
No ratings yet
Transformers
15 pages
Учебно-методические Комплексы Дисциплин
No ratings yet
Учебно-методические Комплексы Дисциплин
108 pages
Transformers 1
No ratings yet
Transformers 1
6 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Understanding Self-Attention
No ratings yet
Understanding Self-Attention
37 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
Slides
No ratings yet
Slides
81 pages
Encode and Decoder Diagram Explanation
No ratings yet
Encode and Decoder Diagram Explanation
8 pages
Sulhi Kul Ain I Akbari
No ratings yet
Sulhi Kul Ain I Akbari
16 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
Week 6 Unsupervised Learning
No ratings yet
Week 6 Unsupervised Learning
60 pages
LLM Understading From SCH
No ratings yet
LLM Understading From SCH
16 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
Transformers v1.1
No ratings yet
Transformers v1.1
1 page
Generative AI
No ratings yet
Generative AI
54 pages
Transformers
No ratings yet
Transformers
41 pages
Filipino 9 (Kwarter 2)
33% (6)
Filipino 9 (Kwarter 2)
165 pages
06 05 Gita
No ratings yet
06 05 Gita
2 pages
''Beauty'', Dictionary of Untranslatables
No ratings yet
''Beauty'', Dictionary of Untranslatables
11 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Paradigms Vs Processes
No ratings yet
Paradigms Vs Processes
2 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
‎⁨قيمة الزمن⁩
No ratings yet
‎⁨قيمة الزمن⁩
28 pages
NLP 4
No ratings yet
NLP 4
10 pages
Transformer
No ratings yet
Transformer
10 pages
2022-23 Odd Et Cse DLNLP
No ratings yet
2022-23 Odd Et Cse DLNLP
4 pages
CHAPTER NO 4 SSD Ref
No ratings yet
CHAPTER NO 4 SSD Ref
12 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
Transformer Architecture Explained in LLMs
No ratings yet
Transformer Architecture Explained in LLMs
2 pages
All You Need To Know About Attention and Transformers In-Depth Understanding Part 1
No ratings yet
All You Need To Know About Attention and Transformers In-Depth Understanding Part 1
13 pages
Group 7 - Corpus Analysis
No ratings yet
Group 7 - Corpus Analysis
20 pages
AssgB9 - UDP Socket
No ratings yet
AssgB9 - UDP Socket
5 pages
Transformer
No ratings yet
Transformer
31 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
Top MongoDB Interview Q&A
No ratings yet
Top MongoDB Interview Q&A
14 pages
M3 Transcript
No ratings yet
M3 Transcript
10 pages
The Elements and Principles in Visual Arts
No ratings yet
The Elements and Principles in Visual Arts
5 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
32 pages
Eclipse Download and Installation Instructions
No ratings yet
Eclipse Download and Installation Instructions
15 pages
Their Eyes Were Watching God
No ratings yet
Their Eyes Were Watching God
13 pages
Rubric Ee Music
No ratings yet
Rubric Ee Music
5 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Introduction To Access SQL: SELECT Statements
No ratings yet
Introduction To Access SQL: SELECT Statements
5 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
Defect Bug Life Cycle in Software Testing
No ratings yet
Defect Bug Life Cycle in Software Testing
7 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Natural Language Processing With Neural Network - Class3
No ratings yet
Natural Language Processing With Neural Network - Class3
25 pages
Transformer
No ratings yet
Transformer
5 pages
C 1
No ratings yet
C 1
8 pages
Logical & Data Interpretation
No ratings yet
Logical & Data Interpretation
14 pages
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Solved Example of Transformers

Uploaded by

Solved Example of Transformers

Uploaded by

Understanding Transformers: A Step-by-Step Math Example

Step 1 — Defining our Dataset

Step 2— Finding Vocab Size

In order to find N, we need to break our dataset into individual words.

Step 5 — Calculating Positional Embedding

Step 6 — Concatenating Positional and Word Embeddings

Step 7 — Multi Head Attention

Step 8 — Adding and Normalizing

Step 9 — Feed Forward Network

Step 11 — Decoder Part

Step 12 — Understanding Mask Multi Head Attention

Step 13 — Calculating the Predicted Word

You might also like