0% found this document useful (0 votes)

114 views23 pages

Transformers Explained Visually (Part 1) - Overview of Functionality - by Ketan Doshi - Towards Data Science

Uploaded by

tao.zhang.0000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views23 pages

Transformers Explained Visually (Part 1) - Overview of Functionality - by Ketan Doshi - Towards Data Science

Uploaded by

tao.zhang.0000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

17.05.

24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

Open in app

Search Write

Get unlimited access to the best of Medium for less than $1/week. Become a member

INTUITIVE TRANSFORMERS SERIES NLP

Transformers Explained Visually

(Part 1): Overview of Functionality
A Gentle Guide to Transformers, how they are used for NLP, and why
they are better than RNNs, in Plain English. How Attention helps
improve performance.

Ketan Doshi · Follow

Published in Towards Data Science · 10 min read · Dec 13, 2020

3.3K 21

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 1/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

Photo by Arseny Togulev on Unsplash

We’ve been hearing a lot about Transformers and with good reason. They
have taken the world of NLP by storm in the last few years. The Transformer
is an architecture that uses Attention to significantly improve the
performance of deep learning NLP translation models. It was first
introduced in the paper Attention is all you need and was quickly established
as the leading architecture for most text data applications.

Since then, numerous projects including Google’s BERT and OpenAI’s GPT
series have built on this foundation and published performance results that
handily beat existing state-of-the-art benchmarks.

Over a series of articles, I’ll go over the basics of Transformers, its

architecture, and how it works internally. We will cover the Transformer
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 2/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

functionality in a top-down manner. In later articles, we will look under the

covers to understand the operation of the system in detail. We will also do a
deep dive into the workings of the multi-head attention, which is the heart of
the Transformer.

Here’s a quick summary of the previous and following articles in the series.
My goal throughout will be to understand not just how something works but
why it works that way.

1. Overview of functionality — this article (How Transformers are used, and

why they are better than RNNs. Components of the architecture, and behavior
during Training and Inference)

2. How it works (Internal operation end-to-end. How data flows and what
computations are performed, including matrix representations)

3. Multi-head Attention (Inner workings of the Attention module throughout the

Transformer)

4. Why Attention Boosts Performance (Not just what Attention does but why it
works so well. How does Attention capture the relationships between words in a
sentence)

And if you’re interested in NLP applications in general, I have some other

articles you might like.

1. Beam Search (Algorithm commonly used by Speech-to-Text and NLP

applications to enhance predictions)

2. Bleu Score (Bleu Score and Word Error Rate are two essential metrics for NLP
models)

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 3/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

What is a Transformer
The Transformer architecture excels at handling text data which is
inherently sequential. They take a text sequence as input and produce
another text sequence as output. eg. to translate an input English sentence to
Spanish.

(Image by Author)

At its core, it contains a stack of Encoder layers and Decoder layers. To avoid
confusion we will refer to the individual layer as an Encoder or a Decoder
and will use Encoder stack or Decoder stack for a group of Encoder layers.

The Encoder stack and the Decoder stack each have their corresponding
Embedding layers for their respective inputs. Finally, there is an Output layer
to generate the final output.

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 4/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

(Image by Author)

All the Encoders are identical to one another. Similarly, all the Decoders are
identical.

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 5/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

(Image by Author)

The Encoder contains the all-important Self-attention layer that

computes the relationship between different words in the sequence, as
well as a Feed-forward layer.

The Decoder contains the Self-attention layer and the Feed-forward layer,
as well as a second Encoder-Decoder attention layer.

Each Encoder and Decoder has its own set of weights.

The Encoder is a reusable module that is the defining component of all

Transformer architectures. In addition to the above two layers, it also has
Residual skip connections around both layers along with two LayerNorm
layers.

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 6/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

(Image by Author)

There are many variations of the Transformer architecture. Some

Transformer architectures have no Decoder at all and rely only on the
Encoder.

What does Attention Do?

The key to the Transformer’s ground-breaking performance is its use of
Attention.

While processing a word, Attention enables the model to focus on other

words in the input that are closely related to that word.

eg. ‘Ball’ is closely related to ‘blue’ and ‘holding’. On the other hand, ‘blue’ is
not related to ‘boy’.

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 7/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

The Transformer architecture uses self-attention by relating every word in

the input sequence to every other word.

eg. Consider two sentences:

The cat drank the milk because it was hungry.

The cat drank the milk because it was sweet.

In the first sentence, the word ‘it’ refers to ‘cat’, while in the second it refers
to ‘milk. When the model processes the word ‘it’, self-attention gives the
model more information about its meaning so that it can associate ‘it’ with
the correct word.

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 8/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

Dark colors represent higher attention (Image by Author)

To enable it to handle more nuances about the intent and semantics of the
sentence, Transformers include multiple attention scores for each word.

eg. While processing the word ‘it’, the first score highlights ‘cat’, while the
second score highlights ‘hungry’. So when it decodes the word ‘it’, by
translating it into a different language, for instance, it will incorporate some
aspect of both ‘cat’ and ‘hungry’ into the translated word.

(Image by Author)

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 9/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

Training the Transformer

The Transformer works slightly differently during Training and while doing
Inference.

Let’s first look at the flow of data during Training. Training data consists of
two parts:

The source or input sequence (eg. “You are welcome” in English, for a
translation problem)

The destination or target sequence (eg. “De nada” in Spanish)

The Transformer’s goal is to learn how to output the target sequence, by

using both the input and target sequence.

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 10/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

(Image by Author)

The Transformer processes the data like this:

1. The input sequence is converted into Embeddings (with Position

Encoding) and fed to the Encoder.

2. The stack of Encoders processes this and produces an encoded

representation of the input sequence.

3. The target sequence is prepended with a start-of-sentence token,

converted into Embeddings (with Position Encoding), and fed to the
Decoder.

4. The stack of Decoders processes this along with the Encoder stack’s
encoded representation to produce an encoded representation of the
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 11/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

target sequence.

5. The Output layer converts it into word probabilities and the final output
sequence.

6. The Transformer’s Loss function compares this output sequence with the
target sequence from the training data. This loss is used to generate
gradients to train the Transformer during back-propagation.

Inference
During Inference, we have only the input sequence and don’t have the target
sequence to pass as input to the Decoder. The goal of the Transformer is to
produce the target sequence from the input sequence alone.

So, like in a Seq2Seq model, we generate the output in a loop and feed the
output sequence from the previous timestep to the Decoder in the next
timestep until we come across an end-of-sentence token.

The difference from the Seq2Seq model is that, at each timestep, we re-feed
the entire output sequence generated thus far, rather than just the last word.

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 12/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

Inference flow, after first timestep (Image by Author)

The flow of data during Inference is:

1. The input sequence is converted into Embeddings (with Position

Encoding) and fed to the Encoder.

2. The stack of Encoders processes this and produces an encoded

representation of the input sequence.

3. Instead of the target sequence, we use an empty sequence with only a

start-of-sentence token. This is converted into Embeddings (with Position
Encoding) and fed to the Decoder.

4. The stack of Decoders processes this along with the Encoder stack’s
encoded representation to produce an encoded representation of the
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 13/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

target sequence.

5. The Output layer converts it into word probabilities and produces an

output sequence.

6. We take the last word of the output sequence as the predicted word. That
word is now filled into the second position of our Decoder input
sequence, which now contains a start-of-sentence token and the first
word.

7. Go back to step #3. As before, feed the new Decoder sequence into the
model. Then take the second word of the output and append it to the
Decoder sequence. Repeat this until it predicts an end-of-sentence token.
Note that since the Encoder sequence does not change for each iteration,
we do not have to repeat steps #1 and #2 each time (Thanks to Michal
Kučírka for pointing this out).

Teacher Forcing
The approach of feeding the target sequence to the Decoder during training
is known as Teacher Forcing. Why do we do this and what does that term
mean?

During training, we could have used the same approach that is used during
inference. In other words, run the Transformer in a loop, take the last word
from the output sequence, append it to the Decoder input and feed it to the
Decoder for the next iteration. Finally, when the end-of-sentence token is
predicted, the Loss function would compare the generated output sequence
to the target sequence in order to train the network.

Not only would this looping cause training to take much longer, but it also
makes it harder to train the model. The model would have to predict the

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 14/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

second word based on a potentially erroneous first predicted word, and so

on.

Instead, by feeding the target sequence to the Decoder, we are giving it a

hint, so to speak, just like a Teacher would. Even though it predicted an
erroneous first word, it can instead use the correct first word to predict the
second word so that those errors don’t keep compounding.

In addition, the Transformer is able to output all the words in parallel

without looping, which greatly speeds up training.

What are Transformers used for?

Transformers are very versatile and are used for most NLP tasks such as
language models and text classification. They are frequently used in
sequence-to-sequence models for applications such as Machine Translation,
Text Summarization, Question-Answering, Named Entity Recognition, and
Speech Recognition.

There are different flavors of the Transformer architecture for different

problems. The basic Encoder Layer is used as a common building block for
these architectures, with different application-specific ‘heads’ depending on
the problem being solved.

Transformer Classification architecture

A Sentiment Analysis application, for instance, would take a text document
as input. A Classification head takes the Transformer’s output and generates
predictions of the class labels such as a positive or negative sentiment.

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 15/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

(Image by Author)

Transformer Language Model architecture

A Language Model architecture would take the initial part of an input
sequence such as a text sentence as input, and generate new text by
predicting sentences that would follow. A Language Model head takes the
Transformer’s output and generates a probability for every word in the
vocabulary. The highest probability word becomes the predicted output for
the next word in the sentence.

(Image by Author)

How are they better than RNNs?

RNNs and their cousins, LSTMs and GRUs, were the de facto architecture for
all NLP applications until Transformers came along and dethroned them.

RNN-based sequence-to-sequence models performed well, and when the

Attention mechanism was first introduced, it was used to enhance their
https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 16/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

performance.

However, they had two limitations:

It was challenging to deal with long-range dependencies between words

that were spread far apart in a long sentence.

They process the input sequence sequentially one word at a time, which
means that it cannot do the computation for time-step t until it has
completed the computation for time-step t — 1. This slows down training
and inference.

As an aside, with CNNs, all of the outputs can be computed in parallel, which
makes convolutions much faster. However, they also have limitations in
dealing with long-range dependencies:

In a convolutional layer, only parts of the image (or words if applied to

text data) that are close enough to fit within the kernel size can interact
with each other. For items that are further apart, you need a much deeper
network with many layers.

The Transformer architecture addresses both of these limitations. It got rid

of RNNs altogether and relied exclusively on the benefits of Attention.

They process all the words in the sequence in parallel, thus greatly
speeding up computation.

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 17/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

(Image by Author)

The distance between words in the input sequence does not matter. It is
equally good at computing dependencies between adjacent words and
words that are far apart.

Now that we have a high-level idea of what a Transformer is, we can go

deeper into its internal functionality in the next article to understand the
details of how it works.

And finally, if you liked this article, you might also enjoy my other series on
Audio Deep Learning, Geolocation Machine Learning, and Image Caption
architectures.

Audio Deep Learning Made Simple (Part 1): State-of-the-Art

Techniques
A Gentle Guide to the world of disruptive deep learning audio
applications and architectures. And why we all need to…
towardsdatascience.com

Leveraging Geolocation Data for Machine Learning: Essential

Techniques

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 18/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

A Gentle Guide to Feature Engineering and Visualization with

Geospatial data, in Plain English
towardsdatascience.com

Image Captions with Deep Learning: State-of-the-Art

Architectures
A Gentle Guide to Image Feature Encoders, Sequence Decoders,
Attention, and Multi-modal Architectures, in Plain English
towardsdatascience.com

Let’s keep learning!

Deep Learning Machine Learning NLP Data Science Artificial Intelligence

Written by Ketan Doshi Follow

5.6K Followers · Writer for Towards Data Science

Machine Learning and Big Data

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 19/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

More from Ketan Doshi and Towards Data Science

Ketan Doshi in Towards Data Science Tim Sumner in Towards Data Science

Batch Norm Explained Visually — A New Coefficient of Correlation

How it works, and why neural… What if you were told there exists a new way
A Gentle Guide to an all-important Deep to measure the relationship between two…
Learning layer, in Plain English

9 min read · May 18, 2021 10 min read · Mar 31, 2024

1.8K 12 3.8K 46

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 20/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

Damian Gil in Towards Data Science Ketan Doshi in Towards Data Science

Advanced Retriever Techniques to Transformers Explained Visually

Improve Your RAGs (Part 2): How it works, step-by-step
Master Advanced Information Retrieval: A Gentle Guide to the Transformer under the
Cutting-edge Techniques to Optimize the… hood, and its end-to-end operation.

18 min read · Apr 17, 2024 11 min read · Jan 2, 2021

759 6 2.2K 23

See all from Ketan Doshi See all from Towards Data Science

Recommended from Medium

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 21/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

Lars Wiik Fareed Khan in Level Up Coding

How Modern Tokenization Works Solving Transformer by Hand: A

🌟 Step-by-Step Math Example
Multilingual Tokenization — Byte-pair Performing numerous matrix multiplications
Encoding, SentencePiece, and WordPiece.… to solve the encoder and decoder parts of th…

9 min read · Apr 21, 2024 13 min read · Dec 18, 2023

7 2.3K 34

Lists

Predictive Modeling w/ Natural Language Processing

Python 1452 stories · 962 saves
20 stories · 1190 saves

Practical Guides to Machine data science and AI

Learning 40 stories · 157 saves
10 stories · 1439 saves

Stefan Shreya Rao in Towards Data Science

Understanding Attention and Deep Learning Illustrated, Part 3:

Transformers Convolutional Neural Networks
My notes for understanding the attention An illustrated and intuitive guide on the inner
mechanism and transformer architecture… workings of a CNN

7 min read · Nov 28, 2023 · 15 min read · 6 days ago

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 22/23
17.05.24, 22:16 Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

133 1 424 5

Luís Fernando T… in Artificial Intelligence in Plain … Benedict Neo in bitgrit Data Science Publication

Building And Training A Roadmap to Learn AI in 2024

Transformer From Scratch A free curriculum for hackers and
Using PyTorch to build and train one of the programmers to learn AI
most groundbreaking models in Machine…

28 min read · Jan 3, 2024 11 min read · Mar 11, 2024

399 11.5K 122

See more recommendations

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452 23/23

Transformer
No ratings yet
Transformer
33 pages
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
272 pages
1st Activity VMGO BTLED
No ratings yet
1st Activity VMGO BTLED
12 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Transformers - Intuitively and Exhaustively Explained - by Daniel Warfield - Towards Data Science
No ratings yet
Transformers - Intuitively and Exhaustively Explained - by Daniel Warfield - Towards Data Science
38 pages
2022 A Hybrid DenseNet121-UNet Model For Brain Tumor Segmentation From MR Images
No ratings yet
2022 A Hybrid DenseNet121-UNet Model For Brain Tumor Segmentation From MR Images
9 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
My Internship Overview1
No ratings yet
My Internship Overview1
15 pages
Lec 12
No ratings yet
Lec 12
30 pages
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
No ratings yet
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
20 pages
Transformer Networks
No ratings yet
Transformer Networks
53 pages
Transformers Explained Visually - Not Just How, But Why They Work So Well - by Ketan Doshi - Towards Data Science
100% (1)
Transformers Explained Visually - Not Just How, But Why They Work So Well - by Ketan Doshi - Towards Data Science
23 pages
Work Immersion Portfolio
No ratings yet
Work Immersion Portfolio
41 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
CLASS X (2020-21) Mathematics Basic (241) Sample Paper-1
No ratings yet
CLASS X (2020-21) Mathematics Basic (241) Sample Paper-1
7 pages
Transformers in Machine Learning - GeeksforGeeks
No ratings yet
Transformers in Machine Learning - GeeksforGeeks
8 pages
Transformers
No ratings yet
Transformers
15 pages
Udemy - AI 900 - Exam
No ratings yet
Udemy - AI 900 - Exam
17 pages
Transformers
No ratings yet
Transformers
23 pages
Transformers: Attention Is All You Need
No ratings yet
Transformers: Attention Is All You Need
54 pages
11.1. Queries, Keys, and Values - Dive Into Deep Learning 1.0-Merged-Compressed
No ratings yet
11.1. Queries, Keys, and Values - Dive Into Deep Learning 1.0-Merged-Compressed
55 pages
Module 1 Legal Basis of Rizal Course
No ratings yet
Module 1 Legal Basis of Rizal Course
25 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Transformer Model - NVIDIA Blogs
No ratings yet
Transformer Model - NVIDIA Blogs
18 pages
Transformers Explained Visually (Part 2) - How It Works, Step-By-step - by Ketan Doshi - Towards Data Science
No ratings yet
Transformers Explained Visually (Part 2) - How It Works, Step-By-step - by Ketan Doshi - Towards Data Science
30 pages
Transformers Explained Visually (Part 3) - Multi-Head Attention, Deep Dive - by Ketan Doshi - Towards Data Science
No ratings yet
Transformers Explained Visually (Part 3) - Multi-Head Attention, Deep Dive - by Ketan Doshi - Towards Data Science
24 pages
Am Ogh Seminar Report
No ratings yet
Am Ogh Seminar Report
19 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Stcgan Shadow
No ratings yet
Stcgan Shadow
10 pages
Transformers Illustraded
No ratings yet
Transformers Illustraded
31 pages
Deploying and Enhancing AI Models: A Deep Dive Into Portable and Trainable Transformer Architectures
No ratings yet
Deploying and Enhancing AI Models: A Deep Dive Into Portable and Trainable Transformer Architectures
26 pages
Transformers
No ratings yet
Transformers
15 pages
Transformer Design Report
No ratings yet
Transformer Design Report
21 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
2022-Markowitz-Transformers, Explained - Understand The Model Behind GPT-3, BERT, and T5
No ratings yet
2022-Markowitz-Transformers, Explained - Understand The Model Behind GPT-3, BERT, and T5
11 pages
Artificial Intelligence Project Topic: Home Security Using Ai
No ratings yet
Artificial Intelligence Project Topic: Home Security Using Ai
16 pages
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
No ratings yet
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
19 pages
Transformers in Machine Learning - GeeksforGeeks
No ratings yet
Transformers in Machine Learning - GeeksforGeeks
9 pages
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
No ratings yet
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
22 pages
A Survey of Transformers:, Yuxin Wang, Xiangyang Liu, and
No ratings yet
A Survey of Transformers:, Yuxin Wang, Xiangyang Liu, and
40 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Ironhack - Financing Options FR, en & ES
No ratings yet
Ironhack - Financing Options FR, en & ES
32 pages
Transformers Report Revised
No ratings yet
Transformers Report Revised
10 pages
Transformers AI Fundamentals
No ratings yet
Transformers AI Fundamentals
2 pages
Paper 2
No ratings yet
Paper 2
8 pages
Malaysia Course Exam Fees 2018 Without GST Effective 1 June 2018
No ratings yet
Malaysia Course Exam Fees 2018 Without GST Effective 1 June 2018
13 pages
Transformer
No ratings yet
Transformer
4 pages
Work Immersion Pertinent Papers
No ratings yet
Work Immersion Pertinent Papers
19 pages
OTM 16489 - Create A Solution For Closing Inventory With Pending RI's
No ratings yet
OTM 16489 - Create A Solution For Closing Inventory With Pending RI's
2 pages
SHS-Earth-and-Life-Science-Q2W8 2
No ratings yet
SHS-Earth-and-Life-Science-Q2W8 2
3 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
LEVEL 4 Present Conditional With Modals Practice
No ratings yet
LEVEL 4 Present Conditional With Modals Practice
3 pages
JioDiscover-What Is The Neural Networ
No ratings yet
JioDiscover-What Is The Neural Networ
5 pages
Module-1 and Module 2
No ratings yet
Module-1 and Module 2
16 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
A Guide To Transformers
No ratings yet
A Guide To Transformers
7 pages
Learning Kit para Sa Mga Bulilit
No ratings yet
Learning Kit para Sa Mga Bulilit
3 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
VR Part2 Lecture 5 Annotated
No ratings yet
VR Part2 Lecture 5 Annotated
14 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Transformer Architecture Explained in LLMs
No ratings yet
Transformer Architecture Explained in LLMs
2 pages
Transformers Info
No ratings yet
Transformers Info
3 pages
01 Transformers For Time-Series Data - by BearingPoint Data, Analytics & AI - BearingPoint Data, Analytics & AI - Medium
No ratings yet
01 Transformers For Time-Series Data - by BearingPoint Data, Analytics & AI - BearingPoint Data, Analytics & AI - Medium
20 pages
Pythagoras Essay
100% (2)
Pythagoras Essay
3 pages
Math 9 DLL Q1W1
No ratings yet
Math 9 DLL Q1W1
7 pages
Case Based 1 - Week 9
No ratings yet
Case Based 1 - Week 9
3 pages
Collaborative Learning Task#1
No ratings yet
Collaborative Learning Task#1
3 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
ELEC 1411 Syllabus 03-04
No ratings yet
ELEC 1411 Syllabus 03-04
2 pages
Chapter 1: Introduction To Transformers: What Is A Transformer? Self-Attention Mechanisms Historical Evolution
No ratings yet
Chapter 1: Introduction To Transformers: What Is A Transformer? Self-Attention Mechanisms Historical Evolution
1 page
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
The Transformer Architecture Explai
No ratings yet
The Transformer Architecture Explai
2 pages
The Transformer Revolution Unveiling The Inner Workings of A Computational Marvel
No ratings yet
The Transformer Revolution Unveiling The Inner Workings of A Computational Marvel
2 pages
Transformer
No ratings yet
Transformer
5 pages
Transformers
No ratings yet
Transformers
2 pages
First Grade: Newspaper Activity: Major Questions
No ratings yet
First Grade: Newspaper Activity: Major Questions
4 pages
Learning Delivery Modalities (LDM) 2 Module 3B: Learning Resources
No ratings yet
Learning Delivery Modalities (LDM) 2 Module 3B: Learning Resources
6 pages
Board of Apprenticeship Training (Southern Region) CIT Campus, Taramani, Chennai - 600113
No ratings yet
Board of Apprenticeship Training (Southern Region) CIT Campus, Taramani, Chennai - 600113
1 page
Technology in A Constructivist Environment
No ratings yet
Technology in A Constructivist Environment
19 pages
Understanding The Transformer Archi
No ratings yet
Understanding The Transformer Archi
2 pages
Developing An Effective Employee Orientation Program LB
No ratings yet
Developing An Effective Employee Orientation Program LB
7 pages
TRANSFORMER
No ratings yet
TRANSFORMER
1 page
Transformers
No ratings yet
Transformers
21 pages
Senior Java Developer
No ratings yet
Senior Java Developer
3 pages
LIMING CV 9 - 15 c2
No ratings yet
LIMING CV 9 - 15 c2
7 pages
Detailed Lesson Plan in Trends, Network & Critical Thinking (2 Quarter)
100% (4)
Detailed Lesson Plan in Trends, Network & Critical Thinking (2 Quarter)
5 pages
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
From Everand
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
Prem Timsina
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet

Transformers Explained Visually (Part 1) - Overview of Functionality - by Ketan Doshi - Towards Data Science

Uploaded by

Transformers Explained Visually (Part 1) - Overview of Functionality - by Ketan Doshi - Towards Data Science

Uploaded by

17.05.

INTUITIVE TRANSFORMERS SERIES NLP

Transformers Explained Visually

Ketan Doshi · Follow

Photo by Arseny Togulev on Unsplash

Over a series of articles, I’ll go over the basics of Transformers, its

functionality in a top-down manner. In later articles, we will look under the

1. Overview of functionality — this article (How Transformers are used, and

3. Multi-head Attention (Inner workings of the Attention module throughout the

And if you’re interested in NLP applications in general, I have some other

1. Beam Search (Algorithm commonly used by Speech-to-Text and NLP

The Encoder contains the all-important Self-attention layer that

Each Encoder and Decoder has its own set of weights.

The Encoder is a reusable module that is the defining component of all

There are many variations of the Transformer architecture. Some

What does Attention Do?

While processing a word, Attention enables the model to focus on other

The Transformer architecture uses self-attention by relating every word in

eg. Consider two sentences:

The cat drank the milk because it was hungry.

The cat drank the milk because it was sweet.

Dark colors represent higher attention (Image by Author)

Training the Transformer

The destination or target sequence (eg. “De nada” in Spanish)

The Transformer’s goal is to learn how to output the target sequence, by

The Transformer processes the data like this:

1. The input sequence is converted into Embeddings (with Position

2. The stack of Encoders processes this and produces an encoded

3. The target sequence is prepended with a start-of-sentence token,

Inference flow, after first timestep (Image by Author)

The flow of data during Inference is:

1. The input sequence is converted into Embeddings (with Position

2. The stack of Encoders processes this and produces an encoded

3. Instead of the target sequence, we use an empty sequence with only a

5. The Output layer converts it into word probabilities and produces an

second word based on a potentially erroneous first predicted word, and so

Instead, by feeding the target sequence to the Decoder, we are giving it a

In addition, the Transformer is able to output all the words in parallel

What are Transformers used for?

There are different flavors of the Transformer architecture for different

Transformer Classification architecture

Transformer Language Model architecture

How are they better than RNNs?

RNN-based sequence-to-sequence models performed well, and when the

However, they had two limitations:

It was challenging to deal with long-range dependencies between words

In a convolutional layer, only parts of the image (or words if applied to

The Transformer architecture addresses both of these limitations. It got rid

Now that we have a high-level idea of what a Transformer is, we can go

Audio Deep Learning Made Simple (Part 1): State-of-the-Art

Leveraging Geolocation Data for Machine Learning: Essential

A Gentle Guide to Feature Engineering and Visualization with

Image Captions with Deep Learning: State-of-the-Art

Let’s keep learning!

Deep Learning Machine Learning NLP Data Science Artificial Intelligence

Written by Ketan Doshi Follow

5.6K Followers · Writer for Towards Data Science

Machine Learning and Big Data

More from Ketan Doshi and Towards Data Science

Batch Norm Explained Visually — A New Coefficient of Correlation

Advanced Retriever Techniques to Transformers Explained Visually

18 min read · Apr 17, 2024 11 min read · Jan 2, 2021

Recommended from Medium

Lars Wiik Fareed Khan in Level Up Coding

How Modern Tokenization Works Solving Transformer by Hand: A

Predictive Modeling w/ Natural Language Processing

Practical Guides to Machine data science and AI

Stefan Shreya Rao in Towards Data Science

Understanding Attention and Deep Learning Illustrated, Part 3:

7 min read · Nov 28, 2023 · 15 min read · 6 days ago

Building And Training A Roadmap to Learn AI in 2024

28 min read · Jan 3, 2024 11 min read · Mar 11, 2024

399 11.5K 122

See more recommendations

You might also like