Transformers Explained Visually (Part 1) - Overview of Functionality - by Ketan Doshi - Towards Data Science
Transformers Explained Visually (Part 1) - Overview of Functionality - by Ketan Doshi - Towards Data Science
24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
Open in app
Search Write
Get unlimited access to the best of Medium for less than $1/week. Become a member
3.3K 21
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 1/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
We’ve been hearing a lot about Transformers and with good reason. They
have taken the world of NLP by storm in the last few years. The Transformer
is an architecture that uses Attention to significantly improve the
performance of deep learning NLP translation models. It was first
introduced in the paper Attention is all you need and was quickly established
as the leading architecture for most text data applications.
Since then, numerous projects including Google’s BERT and OpenAI’s GPT
series have built on this foundation and published performance results that
handily beat existing state-of-the-art benchmarks.
Here’s a quick summary of the previous and following articles in the series.
My goal throughout will be to understand not just how something works but
why it works that way.
2. How it works (Internal operation end-to-end. How data flows and what
computations are performed, including matrix representations)
4. Why Attention Boosts Performance (Not just what Attention does but why it
works so well. How does Attention capture the relationships between words in a
sentence)
2. Bleu Score (Bleu Score and Word Error Rate are two essential metrics for NLP
models)
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 3/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
What is a Transformer
The Transformer architecture excels at handling text data which is
inherently sequential. They take a text sequence as input and produce
another text sequence as output. eg. to translate an input English sentence to
Spanish.
(Image by Author)
At its core, it contains a stack of Encoder layers and Decoder layers. To avoid
confusion we will refer to the individual layer as an Encoder or a Decoder
and will use Encoder stack or Decoder stack for a group of Encoder layers.
The Encoder stack and the Decoder stack each have their corresponding
Embedding layers for their respective inputs. Finally, there is an Output layer
to generate the final output.
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 4/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
(Image by Author)
All the Encoders are identical to one another. Similarly, all the Decoders are
identical.
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 5/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
(Image by Author)
The Decoder contains the Self-attention layer and the Feed-forward layer,
as well as a second Encoder-Decoder attention layer.
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 6/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
(Image by Author)
eg. ‘Ball’ is closely related to ‘blue’ and ‘holding’. On the other hand, ‘blue’ is
not related to ‘boy’.
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 7/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
In the first sentence, the word ‘it’ refers to ‘cat’, while in the second it refers
to ‘milk. When the model processes the word ‘it’, self-attention gives the
model more information about its meaning so that it can associate ‘it’ with
the correct word.
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 8/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
To enable it to handle more nuances about the intent and semantics of the
sentence, Transformers include multiple attention scores for each word.
eg. While processing the word ‘it’, the first score highlights ‘cat’, while the
second score highlights ‘hungry’. So when it decodes the word ‘it’, by
translating it into a different language, for instance, it will incorporate some
aspect of both ‘cat’ and ‘hungry’ into the translated word.
(Image by Author)
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 9/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
Let’s first look at the flow of data during Training. Training data consists of
two parts:
The source or input sequence (eg. “You are welcome” in English, for a
translation problem)
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 10/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
(Image by Author)
4. The stack of Decoders processes this along with the Encoder stack’s
encoded representation to produce an encoded representation of the
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 11/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
target sequence.
5. The Output layer converts it into word probabilities and the final output
sequence.
6. The Transformer’s Loss function compares this output sequence with the
target sequence from the training data. This loss is used to generate
gradients to train the Transformer during back-propagation.
Inference
During Inference, we have only the input sequence and don’t have the target
sequence to pass as input to the Decoder. The goal of the Transformer is to
produce the target sequence from the input sequence alone.
So, like in a Seq2Seq model, we generate the output in a loop and feed the
output sequence from the previous timestep to the Decoder in the next
timestep until we come across an end-of-sentence token.
The difference from the Seq2Seq model is that, at each timestep, we re-feed
the entire output sequence generated thus far, rather than just the last word.
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 12/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
4. The stack of Decoders processes this along with the Encoder stack’s
encoded representation to produce an encoded representation of the
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 13/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
target sequence.
6. We take the last word of the output sequence as the predicted word. That
word is now filled into the second position of our Decoder input
sequence, which now contains a start-of-sentence token and the first
word.
7. Go back to step #3. As before, feed the new Decoder sequence into the
model. Then take the second word of the output and append it to the
Decoder sequence. Repeat this until it predicts an end-of-sentence token.
Note that since the Encoder sequence does not change for each iteration,
we do not have to repeat steps #1 and #2 each time (Thanks to Michal
Kučírka for pointing this out).
Teacher Forcing
The approach of feeding the target sequence to the Decoder during training
is known as Teacher Forcing. Why do we do this and what does that term
mean?
During training, we could have used the same approach that is used during
inference. In other words, run the Transformer in a loop, take the last word
from the output sequence, append it to the Decoder input and feed it to the
Decoder for the next iteration. Finally, when the end-of-sentence token is
predicted, the Loss function would compare the generated output sequence
to the target sequence in order to train the network.
Not only would this looping cause training to take much longer, but it also
makes it harder to train the model. The model would have to predict the
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 14/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 15/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
(Image by Author)
(Image by Author)
performance.
They process the input sequence sequentially one word at a time, which
means that it cannot do the computation for time-step t until it has
completed the computation for time-step t — 1. This slows down training
and inference.
As an aside, with CNNs, all of the outputs can be computed in parallel, which
makes convolutions much faster. However, they also have limitations in
dealing with long-range dependencies:
They process all the words in the sequence in parallel, thus greatly
speeding up computation.
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 17/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
(Image by Author)
The distance between words in the input sequence does not matter. It is
equally good at computing dependencies between adjacent words and
words that are far apart.
And finally, if you liked this article, you might also enjoy my other series on
Audio Deep Learning, Geolocation Machine Learning, and Image Caption
architectures.
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 18/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 19/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
Ketan Doshi in Towards Data Science Tim Sumner in Towards Data Science
9 min read · May 18, 2021 10 min read · Mar 31, 2024
1.8K 12 3.8K 46
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 20/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
Damian Gil in Towards Data Science Ketan Doshi in Towards Data Science
759 6 2.2K 23
See all from Ketan Doshi See all from Towards Data Science
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 21/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
9 min read · Apr 21, 2024 13 min read · Dec 18, 2023
7 2.3K 34
Lists
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 22/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science
133 1 424 5
Luís Fernando T… in Artificial Intelligence in Plain … Benedict Neo in bitgrit Data Science Publication
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 23/23