Transformer
Transformer
.
▪ Each encoder can be broken down into two sub-layers.
• The first layer is called the self-attention. The input of the
encoder first flows through a self-attention layer, which
helps the encoder to look at relevant parts of the words.
• The second layer is called feedforward layer. The output of
the self-attention layer is fed to the feedforward neural
network (FFN). The exact same FFN is independently applied
to each position.
▪ The decoder has both the self-attention and the feedforward
layer, but between them is the encoder-decoder attention layer
that helps a decoder focus on relevant parts of the input sentence.
▪ Each word in the input sequence is transformed into an
embedding vector.
▪ Each embedding vector flows through two layers of the
encoder:
1) Self-Attention layer
o At each position, the word’s embedding vector
undergoes a self-attention mechanism.
o Dependencies are present between different paths in
this layer since the attention mechanism compares each
word with others in the sequence.
2) Feedforward Neural Network
o After self-attention, each embedding vector passes
through the same feedforward neural network.
o The feedforward network processes each vector
independently, without dependencies between them.
This steps ensures parallelism, improving efficiency.
The next step is to sum up the weighted vectors, which produces the
output of the self-attention layer at this position. And send resulting
vector to the feedforward neural network.
Summary:
Variations of transformer: