Solved Example of Transformers
Solved Example of Transformers
Our entire dataset contains only three sentences, all of which are dialogues taken from a TV show.
Although our dataset is cleaned, in real-world scenarios cleaning a dataset requires a significant
amount of effort.
After obtaining N, we perform a set operation to remove duplicates, and then we can count the
unique words to determine the vocabulary size. Therefore, the vocabulary size is 23, as there
are 23 unique words in our dataset.
Step 3 — Encoding
Now, we need to assign a unique number to each unique word.
Step 4 — Calculating Embedding
Let’s select a sentence from our corpus that will be processed in our transformer architecture.
We have selected our input, and we need to find an embedding vector for it. The original paper
uses a 512-dimensional embedding vector for each input word. Since, for our case, we need to
work with a smaller dimension of embedding vector to visualize how the calculation is taking
place. So, we will be using a dimension of 6 for the embedding vector.
These values of the embedding vector are between 0 and 1 and are filled randomly in the
beginning. They will later be updated as our transformer starts understanding the meanings among
the words.
Similarly, we can calculate positional embedding for all the words in our input sentence.
There are three inputs: query, key, and value. Each of these matrices is obtained by multiplying a
different set of weights matrix from the Transpose of same matrix that we computed earlier by
adding the word embedding and positional embedding matrix.
Let’s say, for computing the query matrix, the set of weights matrix must have the number of
rows the same as the number of columns of the transpose matrix, while the columns of the
weights matrix can be any; for example, we suppose 4 columns in our weights matrix. The values
in the weights matrix are between 0 and 1 randomly, which will later be updated when our
transformer starts learning the meaning of these words.
Similarly, we can compute the key and value matrices using the same procedure, but the values in
the weights matrix must be different for both.
So, after multiplying matrices, the resultant query, key, and values are obtained:
Now that we have all three matrices, let’s start calculating single-head attention step by step.
For scaling the resultant matrix, we have to reuse the dimension of our embedding vector, which
is 6.
The next step of masking is optional, and we won’t be calculating it. Masking is like telling the
model to focus only on what’s happened before a certain point and not peek into the future while
figuring out the importance of different words in a sentence. It helps the model understand things
in a step-by-step manner, without cheating by looking ahead. So now we will be applying
the softmax operation on our scaled resultant matrix.
Doing the final multiplication step to obtain the resultant matrix from single-head attention.
We have calculated single-head attention, while multi-head attention comprises many single-head
attentions, as I stated earlier. Below is a visual of how it looks like:
Each single-head attention has three inputs: query, key, and value, and each three have a
different set of weights. Once all single-head attentions output their resultant matrices, they will
all be concatenated, and the final concatenated matrix is once again transformed linearly by
multiplying it with a set of weights matrix initialized with random values, which will later get
updated when the transformer starts training. Since, in our case, we are considering a single-head
attention, but this is how it looks if we are working with multi-head attention.
In either case, whether it’s single-head or multi-head attention, the resultant matrix needs to be
once again transformed linearly by multiplying a set of weights matrix.
Make sure the linear set of weights matrix number of columns must be equal to the matrix that we
computed earlier (word embedding + positional embedding) matrix number of columns,
because the next step, we will be adding the resultant normalized matrix with (word embedding
+ positional embedding) matrix.
As we have computed the resultant matrix for multi-head attention, next, we will be working on
adding and normalizing step.
To normalize the above matrix, we need to compute the mean and standard deviation row-wise
for each row.
we subtract each value of the matrix by the corresponding row mean and divide it by the
corresponding standard deviation.
Adding a small value of error prevents the denominator from being zero and avoids making the
entire term infinity.
After calculating the linear layer, we need to pass it through the ReLU layer and use its formula.
Step 10 — Adding and Normalizing Again
Once we obtain the resultant matrix from feed forward network, we have to add it to the matrix
that is obtained from previous add and norm step, and then normalizing it using the row wise
mean and standard deviation.
The output matrix of this add and norm step will serve as the query and key matrix in one of the
multi-head attention mechanisms present in the decoder part, which you can easily understand by
tracing outward from the add and norm to the decoder section.
Where <start> and <end> are two new tokens being introduced. Moreover, the decoder takes one
token as an input at a time. It means that <start> will be served as an input, and you must be the
predicted text for it.
As we already know, these embeddings are filled with random values, which will later be updated
during the training process. Compute rest of the blocks in the same way that we computed earlier
in the encoder part.
Now, let’s understand the masked multi-head attention components having two heads:
1. Linear Projections (Query, Key, Value): Assume the linear projections for each
head: Head 1: Wq1,Wk1,Wv1 and Head 2: Wq2,Wk2,Wv2
2. Calculate Attention Scores: For each head, calculate attention scores using the dot product
of Query and Key, and apply the mask to prevent attending to future positions.
3. Apply Softmax: Apply the softmax function to obtain attention weights.
4. Weighted Summation (Value): Multiply the attention weights by the Value to get the
weighted sum for each head.
5. Concatenate and Linear Transformation: Concatenate the outputs from both heads and
apply a linear transformation.
Let’s do a simplified calculation:
Assuming two conditions
Wq1 = Wk1 = Wv1 = Wq2 = Wk2 = Wv2 = I, the identity matrix.
Q=K=V=Input Matrix
The concatenation step combines the outputs from the two attention heads into a single set of
information. Imagine you have two friends who each give you advice on a problem.
Concatenating their advice means putting both pieces of advice together so that you have a more
complete view of what they suggest. In the context of the transformer model, this step helps
capture different aspects of the input data from multiple perspectives, contributing to a richer
representation that the model can use for further processing.
The last add and norm block resultant matrix of the decoder must be flattened in order to match
it with a linear layer to find the predicted probability of each unique word in our dataset (corpus).
This flattened layer will be passed through a linear layer to compute the logits (scores) of each
unique word in our dataset.
Once we obtain the logits, we can use the softmax function to normalize them and find the word
that contains the highest probability.
So based on our calculations, the predicted word from the decoder is you.
This predicted word you, will be treated as the input word for the decoder, and this process
continues until the <end> token is predicted.