0% found this document useful (0 votes)
11 views20 pages

Solved Example of Transformers

Uploaded by

rajanatiq42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views20 pages

Solved Example of Transformers

Uploaded by

rajanatiq42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Understanding Transformers: A Step-by-Step Math Example

Step 1 — Defining our Dataset


The dataset used for creating ChatGPT is 570 GB. But we will be using a very small dataset to
perform numerical calculations visually.

Our entire dataset contains only three sentences, all of which are dialogues taken from a TV show.
Although our dataset is cleaned, in real-world scenarios cleaning a dataset requires a significant
amount of effort.

Step 2— Finding Vocab Size


The vocabulary size determines the total number of unique words in our dataset. It can be
calculated using the below formula, where N is the total number of words in our dataset.

In order to find N, we need to break our dataset into individual words.

After obtaining N, we perform a set operation to remove duplicates, and then we can count the
unique words to determine the vocabulary size. Therefore, the vocabulary size is 23, as there
are 23 unique words in our dataset.

Step 3 — Encoding
Now, we need to assign a unique number to each unique word.
Step 4 — Calculating Embedding
Let’s select a sentence from our corpus that will be processed in our transformer architecture.

We have selected our input, and we need to find an embedding vector for it. The original paper
uses a 512-dimensional embedding vector for each input word. Since, for our case, we need to
work with a smaller dimension of embedding vector to visualize how the calculation is taking
place. So, we will be using a dimension of 6 for the embedding vector.

These values of the embedding vector are between 0 and 1 and are filled randomly in the
beginning. They will later be updated as our transformer starts understanding the meanings among
the words.

Step 5 — Calculating Positional Embedding


Now we need to find positional embeddings for our input. There are two formulas for positional
embedding depending on the position of the ith value of that embedding vector for each word.
As you do know, our input sentence is “when you play the game of thrones” and the starting
word is “when” with a starting index (POS) value is 0, having a dimension (d) of 6. For i from 0
to 5, we calculate the positional embedding for our first word of the input sentence.

Similarly, we can calculate positional embedding for all the words in our input sentence.

Step 6 — Concatenating Positional and Word Embeddings


After calculating positional embedding, we need to add word embeddings and positional
embeddings.
This resultant matrix from combining both matrices (Word embedding matrix and positional
embedding matrix) will be considered as an input to the encoder part.

Step 7 — Multi Head Attention


A multi-head attention is comprised of many single-head attentions. It is up to us how many
single heads we need to combine. For example, LLaMA LLM from Meta has used 32 single
heads in the encoder architecture. Below is the illustrated diagram of how a single-head attention
looks like.

There are three inputs: query, key, and value. Each of these matrices is obtained by multiplying a
different set of weights matrix from the Transpose of same matrix that we computed earlier by
adding the word embedding and positional embedding matrix.
Let’s say, for computing the query matrix, the set of weights matrix must have the number of
rows the same as the number of columns of the transpose matrix, while the columns of the
weights matrix can be any; for example, we suppose 4 columns in our weights matrix. The values
in the weights matrix are between 0 and 1 randomly, which will later be updated when our
transformer starts learning the meaning of these words.
Similarly, we can compute the key and value matrices using the same procedure, but the values in
the weights matrix must be different for both.

So, after multiplying matrices, the resultant query, key, and values are obtained:
Now that we have all three matrices, let’s start calculating single-head attention step by step.

For scaling the resultant matrix, we have to reuse the dimension of our embedding vector, which
is 6.
The next step of masking is optional, and we won’t be calculating it. Masking is like telling the
model to focus only on what’s happened before a certain point and not peek into the future while
figuring out the importance of different words in a sentence. It helps the model understand things
in a step-by-step manner, without cheating by looking ahead. So now we will be applying
the softmax operation on our scaled resultant matrix.

Doing the final multiplication step to obtain the resultant matrix from single-head attention.
We have calculated single-head attention, while multi-head attention comprises many single-head
attentions, as I stated earlier. Below is a visual of how it looks like:

Each single-head attention has three inputs: query, key, and value, and each three have a
different set of weights. Once all single-head attentions output their resultant matrices, they will
all be concatenated, and the final concatenated matrix is once again transformed linearly by
multiplying it with a set of weights matrix initialized with random values, which will later get
updated when the transformer starts training. Since, in our case, we are considering a single-head
attention, but this is how it looks if we are working with multi-head attention.
In either case, whether it’s single-head or multi-head attention, the resultant matrix needs to be
once again transformed linearly by multiplying a set of weights matrix.

Make sure the linear set of weights matrix number of columns must be equal to the matrix that we
computed earlier (word embedding + positional embedding) matrix number of columns,
because the next step, we will be adding the resultant normalized matrix with (word embedding
+ positional embedding) matrix.
As we have computed the resultant matrix for multi-head attention, next, we will be working on
adding and normalizing step.

Step 8 — Adding and Normalizing


Once we obtain the resultant matrix from multi-head attention, we have to add it to our original
matrix. Let’s do it first.

To normalize the above matrix, we need to compute the mean and standard deviation row-wise
for each row.
we subtract each value of the matrix by the corresponding row mean and divide it by the
corresponding standard deviation.

Adding a small value of error prevents the denominator from being zero and avoids making the
entire term infinity.

Step 9 — Feed Forward Network


After normalizing the matrix, it will be processed through a feedforward network. We will be
using a very basic network that contains only one linear layer and one ReLU activation function
layer. This is how it looks like visually:
First, we need to calculate the linear layer by multiplying our last calculated matrix with a random
set of weights matrix, which will be updated when the transformer starts learning, and adding the
resultant matrix to a bias matrix that also contains random values.

After calculating the linear layer, we need to pass it through the ReLU layer and use its formula.
Step 10 — Adding and Normalizing Again
Once we obtain the resultant matrix from feed forward network, we have to add it to the matrix
that is obtained from previous add and norm step, and then normalizing it using the row wise
mean and standard deviation.
The output matrix of this add and norm step will serve as the query and key matrix in one of the
multi-head attention mechanisms present in the decoder part, which you can easily understand by
tracing outward from the add and norm to the decoder section.

Step 11 — Decoder Part


The good news is that up until now, we have calculated Encoder part, all the steps that we have
performed, from encoding our dataset to passing our matrix through the feedforward network, are
unique. It means we haven’t calculated them before. But from now on, all the upcoming steps that
is the remaining architecture of the transformer (Decoder part) are going to involve similar kinds
of matrix multiplications.
We won’t be calculating the entire decoder because most of its portion contains similar
calculations to what we have already done in the encoder. Calculating the decoder in detail would
only make the blog lengthy due to repetitive steps. Instead, we only need to focus on the
calculations of the input and output of the decoder.
When training, there are two inputs to the decoder. One is from the encoder, where the output
matrix of the last add and norm layer serves as the query and key for the second multi-head
attention layer in the decoder part. Below is the visualization of it
While the value matrix comes from the decoder after the first add and norm step. The second
input to the decoder is the predicted text. If you remember, our input to the encoder is when you
play game of thrones so the input to the decoder is the predicted text, which in our case is you win
or you die. But the predicted input text needs to follow a standard wrapping of tokens that make
the transformer aware of where to start and where to end.

Where <start> and <end> are two new tokens being introduced. Moreover, the decoder takes one
token as an input at a time. It means that <start> will be served as an input, and you must be the
predicted text for it.
As we already know, these embeddings are filled with random values, which will later be updated
during the training process. Compute rest of the blocks in the same way that we computed earlier
in the encoder part.

Step 12 — Understanding Mask Multi Head Attention


In a Transformer, the masked multi-head attention is like a spotlight that a model uses to focus on
different parts of a sentence. It’s special because it doesn’t let the model cheat by looking at
words that come later in the sentence. This helps the model understand and generate sentences
step by step, which is important in tasks like talking or translating words into another language.
Suppose we have the following input matrix, where each row represents a position in the
sequence, and each column represents a feature:

Now, let’s understand the masked multi-head attention components having two heads:
1. Linear Projections (Query, Key, Value): Assume the linear projections for each
head: Head 1: Wq1,Wk1,Wv1 and Head 2: Wq2,Wk2,Wv2
2. Calculate Attention Scores: For each head, calculate attention scores using the dot product
of Query and Key, and apply the mask to prevent attending to future positions.
3. Apply Softmax: Apply the softmax function to obtain attention weights.
4. Weighted Summation (Value): Multiply the attention weights by the Value to get the
weighted sum for each head.
5. Concatenate and Linear Transformation: Concatenate the outputs from both heads and
apply a linear transformation.
Let’s do a simplified calculation:
Assuming two conditions
 Wq1 = Wk1 = Wv1 = Wq2 = Wk2 = Wv2 = I, the identity matrix.
 Q=K=V=Input Matrix

The concatenation step combines the outputs from the two attention heads into a single set of
information. Imagine you have two friends who each give you advice on a problem.
Concatenating their advice means putting both pieces of advice together so that you have a more
complete view of what they suggest. In the context of the transformer model, this step helps
capture different aspects of the input data from multiple perspectives, contributing to a richer
representation that the model can use for further processing.

Step 13 — Calculating the Predicted Word


The output matrix of the last add and norm block of the decoder must contain the same number of
rows as the input matrix, while the number of columns can be any. Here, we work with 6.

The last add and norm block resultant matrix of the decoder must be flattened in order to match
it with a linear layer to find the predicted probability of each unique word in our dataset (corpus).

This flattened layer will be passed through a linear layer to compute the logits (scores) of each
unique word in our dataset.
Once we obtain the logits, we can use the softmax function to normalize them and find the word
that contains the highest probability.

So based on our calculations, the predicted word from the decoder is you.
This predicted word you, will be treated as the input word for the decoder, and this process
continues until the <end> token is predicted.

You might also like