0% found this document useful (0 votes)
44 views21 pages

Transformer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views21 pages

Transformer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Transformer in NLP

INSTRUCTOR NAME: SHUKDEV DATTA


ML DEVELOPER AT INNOVATIVE SKILLS
Transformer in NLP
Transformer models are a type of deep learning model that is used for natural language processing
(NLP) tasks. They can learn long-range dependencies between words in a sentence, which makes them
very powerful for tasks such as machine translation, text summarization, and question answering.

Transformer models work by first encoding the input sentence into a sequence of vectors. This encoding
is done using a self-attention mechanism, which allows the model to learn the relationships between the
words in the sentence.

Once the input sentence has been encoded, the model decodes it into a sequence of output tokens. This
decoding is also done using a self-attention mechanism.

The attention mechanism is what allows transformer models to learn long-range dependencies between
words in a sentence. The attention mechanism works by focusing on the most relevant words in the
input sentence when decoding the output tokens.
Encoding & Decoding
NLP transformer architecture
NLP transformer architecture
The transformer model is made up of two main components: an encoder and a decoder. The encoder takes
the input sentence as input and produces a sequence of vectors. The decoder then takes these vectors as
input and produces the output sentence.
Working Principal of transformer
architecture
The encoder consists of a stack of self-attention layers. Each self-attention layer takes a sequence of
vectors as input and produces a new sequence of vectors. The self-attention layer works by first
computing a score for each pair of words in the input sequence. The score for a pair of words is a measure
of how related the two words are. The self-attention layer then uses these scores to compute a weighted
sum of the input vectors. The weighted sum is the output of the self-attention layer.
Working Principal of transformer
architecture
Working Principal of transformer
architecture
The decoder consists of a stack of self-attention layers and a recurrent neural network (RNN). The self-
attention layers work the same way as in the encoder. The RNN takes the output of the self-attention
layers as input and produces a sequence of output tokens. The output tokens are the words in the output
sentence.

The attention mechanism is what allows the transformer model to learn long-range dependencies
between words in a sentence. The attention mechanism works by focusing on the most relevant words in
the input sentence when decoding the output tokens.

For example, let’s say we want to translate the sentence “I love you” from English to Spanish. The
transformer model would first encode the sentence into a sequence of vectors. Then, the model would
decode the vectors into a sequence of Spanish words. The attention mechanism would allow the model
to focus on the words “I” and “you” in the English sentence when decoding the Spanish words “te amo”.
Encoding in details
Decoding in details
Encoding only models
Decoding only models
Differences
What are transformer models
built of
Embedding layer: The embedding layer converts the input text into a sequence of vectors. The vectors represent the
meaning of the words in the text.

Self-attention layers: The self-attention layers allow the model to learn long-range dependencies between words in
a sentence. The self-attention layers work by computing a score for each pair of words in the sentence. The score for
a pair of words is a measure of how related the two words are. The self-attention layers then use these scores to
compute a weighted sum of the input vectors. The weighted sum is the output of the self-attention layer.

Positional encoding: The positional encoding layer adds information about the position of each word in the
sentence. This is important for learning long-range dependencies, as it allows the model to know which words are
close to each other in the sentence.

Decoder: The decoder takes the output of the self-attention layers as input and produces a sequence of output
tokens. The output tokens are the words in the output sentence.
Positional Encoding
Training techniques of
Transformer models
Masked language modeling: Masked language modeling is a technique used to train transformer
models to predict the missing words in a sentence. This helps the model to learn to attend to the most
relevant words in a sentence.

Attention masking: Attention masking is a technique used to prevent the model from attending to
future words in a sentence. This is important for preventing the model from learning circular
dependencies.

Gradient clipping: Gradient clipping is a technique used to prevent the gradients from becoming too
large. This helps to stabilize the training process and prevent the model from overfitting.
Masked Language Modeling
Masked language modeling is a technique used to train transformer models, like BERT, to predict the
missing words in a sentence. The idea is to randomly mask (replace) some of the words in the input
sentence with a special token, such as [MASK], and then train the model to predict what the original
words were. This helps the model to learn to attend to the most relevant words in a sentence, as it has
to figure out which words are missing and what they might be based on the context of the surrounding
words. This is useful for tasks like language understanding, where the model needs to be able to
understand and generate human-like text. Masked language modeling is a key component of the
pretraining process for transformer models, as it helps the model to learn the relationships between
words and how to generate text that is coherent and grammatically correct.
Attention Masking
Attention masking is a technique used in transformer models, like BERT, to prevent the model from
attending to future words in a sentence during training. The idea is to mask (ignore) the attention scores
for any word that comes after the current word in the input sequence. This is important for preventing
the model from learning circular dependencies, where the model might learn to predict a word based on
future words that it shouldn't have access to. By masking the attention scores for future words, the
model is forced to only attend to the relevant words that come before the current word, which helps to
improve the accuracy of the model's predictions. Attention masking is a key component of the pre-
training process for transformer models, as it helps the model to learn the relationships between words
and how to generate text that is coherent and grammatically correct.
Attention Masking Example
Gradient Clipping
Gradient clipping is a technique used in machine learning to prevent the gradients of the model's
parameters from becoming too large during training. Gradients are used to update the model's
parameters in the direction that reduces the loss function, which measures how well the model is
performing on the training data. If the gradients become too large, it can cause the model's parameters
to change too much at each step, which can lead to the model becoming unstable and overfitting to the
training data. Gradient clipping helps to stabilize the training process by setting a threshold for the
maximum allowed gradient value. If the gradient exceeds this threshold, it is scaled down so that it
doesn't become too large. This helps to prevent the model from overfitting to the training data and
improves its generalization performance on unseen data. Gradient clipping is a common technique used
in deep learning and is particularly useful when training deep neural networks with many layers.
Thank You!!!

You might also like