0% found this document useful (0 votes)
28 views10 pages

Transformer

transformer

Uploaded by

dewanfoyez389
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

Transformer

transformer

Uploaded by

dewanfoyez389
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Transformer:

▪ Transformer is based on a 2017 paper named “Attention is All


you Need”
▪ All the models before transformer were able to represent
words as vectors, but these vectors did not contain the context,
and the usage of words changes based on the context. For
example, bank in riverbank VS bank in bank-robber might have
same vector representation before attention mechanism came
about.

• A transformer is an encoder-decoder model that uses the


attention mechanism.
➔ It take advantage of parallelization and also process a
large amount of data at the same time because of its
model architecture.

Transformer model consists of encoder and decoder. The encoder


encodes the input sequence and passes it to the decoder and the
decoder decodes a representation for a relevant task.
• The encoding component is a stack of encoders
• All encoders are identical in structure but with different
weights.

.
▪ Each encoder can be broken down into two sub-layers.
• The first layer is called the self-attention. The input of the
encoder first flows through a self-attention layer, which
helps the encoder to look at relevant parts of the words.
• The second layer is called feedforward layer. The output of
the self-attention layer is fed to the feedforward neural
network (FFN). The exact same FFN is independently applied
to each position.
▪ The decoder has both the self-attention and the feedforward
layer, but between them is the encoder-decoder attention layer
that helps a decoder focus on relevant parts of the input sentence.
▪ Each word in the input sequence is transformed into an
embedding vector.
▪ Each embedding vector flows through two layers of the
encoder:
1) Self-Attention layer
o At each position, the word’s embedding vector
undergoes a self-attention mechanism.
o Dependencies are present between different paths in
this layer since the attention mechanism compares each
word with others in the sequence.
2) Feedforward Neural Network
o After self-attention, each embedding vector passes
through the same feedforward neural network.
o The feedforward network processes each vector
independently, without dependencies between them.
This steps ensures parallelism, improving efficiency.

In the self-attention layer, the input embedding is broken up into Query


(Q), Key (K), and Value (V) vectors. These vectors are computed using
weights that the transformer learns during the training process.

All of these computation happen in parallel in the form of matrix


computations.
The next step is to calculates Soft-max scores.
o The soft-max function calculates a score for each word in
the sequence.
o These score represent the importance or focus level on each
word relative to others.
o The intuition behind this is to emphasize important words
while minimizing the influence of less relevant words.

The next step is to sum up the weighted vectors, which produces the
output of the self-attention layer at this position. And send resulting
vector to the feedforward neural network.
Summary:
Variations of transformer:

• A popular encoder-only architecture is BERT (Bi-directional


Encoder Representations from Transformers), developed
by Google, in 2018.
• BERT was trained in two variations. One model contains
BERT base, which had 12 stack of transformers
with approximately 110 million parameters, and the other
BERT large with 24 layers of transformers with about 340
million parameters.
• It was trained on the entire Wikipedia corpus and Books
corpus.
• BERT model was trained for one million steps.
• BERT model was trained on two different tasks.
▪ Task -1: Masked Language Model (MLM): Sentences
are masked and the model is trained to predict the
masked words.

▪ Task -2: Next Sentence Predictions (NPS): The


model is given two sets of sentences. BERT aims to
learn the relationships between sentences and
predict the next sentence given the first one.
• In order to train BERT, we need to feed 3 different kinds of
embeddings to the model: token, segment and position
embeddings.

▪ Token embeddings : Meaning of each word or sub-word in


the input sequence.
▪ Segment embeddings: Differentiate between two input
segments
▪ Position embeddings: Encode the order of tokens in the
sequence since transformers don’t inherently capture
positional information.

You might also like