Notes 2 Transformer Model Architecture
Notes 2 Transformer Model Architecture
The Transformer model, introduced in the paper "Attention is All You Need" by Vaswani et al. (2017),
has revolutionized the field of natural language processing (NLP) and has since found applications in
various other domains. Its innovative architecture, based solely on attention mechanisms, eliminates
the need for recurrence and convolutions, leading to improved performance and parallelization
capabilities. This report provides an in-depth analysis of the Transformer architecture.
1. Introduction:
Traditional sequence-to-sequence models, like Recurrent Neural Networks (RNNs) and Long Short-
Term Memory (LSTM) networks, process input sequentially, making them slow and difficult to
parallelize. The Transformer addresses these limitations by leveraging attention mechanisms, which
allow the model to weigh the importance of different parts of the input sequence when processing
each element. This parallel processing capability makes Transformers significantly faster and more
efficient, especially for long sequences.
2. Overall Architecture:
The Transformer architecture consists of two main components: the encoder and the decoder. Both
the encoder and decoder are composed of multiple identical layers.
2.1 Encoder:
The encoder takes the input sequence as input and produces a contextualized representation of each
word. Each encoder layer consists of two sub-layers:
• Multi-Head Attention: This sub-layer allows the model to attend to different parts of the
input sequence simultaneously. It takes three inputs: Query (Q), Key (K), and Value (V). These
are derived from the input embeddings through linear transformations. The attention
mechanism calculates a weighted sum of the values, where the weights are determined by
the similarity between the query and the keys. The "multi-head" aspect means this attention
calculation is performed multiple times with different learned linear transformations
(different Q, K, V matrices), and the results are concatenated. This allows the model to
capture different relationships within the sequence.
Both sub-layers are followed by a residual connection and layer normalization. The residual
connection adds the input of the sub-layer to its output, helping to mitigate the vanishing gradient
problem. Layer normalization normalizes the activations across the features, stabilizing training.
2.2 Decoder:
The decoder also consists of multiple identical layers, each with three sub-layers:
• Masked Multi-Head Attention: This sub-layer is similar to the multi-head attention in the
encoder, but it includes a mask to prevent the decoder from attending to future tokens. This
is crucial for autoregressive decoding, where the model generates the output sequence one
token at a time. The mask ensures that the model only attends to the tokens that have
already been generated.
• Multi-Head Attention: This sub-layer performs attention over the output of the encoder. The
queries come from the previous decoder layer, while the keys and values come from the
encoder output. This allows the decoder to attend to the input sequence and gather
information relevant to generating the output.
Similar to the encoder, each sub-layer in the decoder is followed by a residual connection and layer
normalization.
The core of the Transformer is the attention mechanism. It calculates the relevance of each word in
the input sequence to every other word (or to the word being processed in the decoder). The scaled
dot-product attention is the most commonly used:
where:
• Q: Query matrix
• K: Key matrix
• V: Value matrix
The dot product of the query and key matrices measures the similarity between the corresponding
words. The scaling by √dₖ prevents the dot products from becoming too large, which can lead to
unstable softmax probabilities. The softmax function normalizes the scores into a probability
distribution, and the resulting weights are used to compute a weighted sum of the value matrix.
Multi-head attention allows the model to attend to different aspects of the input sequence. Instead
of using a single set of Q, K, and V matrices, the model uses multiple sets. Each set learns different
linear transformations of the input embeddings. The outputs of all attention heads are concatenated
and linearly transformed to produce the final output.
Since the Transformer does not use recurrence, it needs a way to encode the positional information
of the words in the sequence. This is done by adding position embeddings to the input embeddings.
These embeddings are learned during training and capture the relative positions of the words.
Several methods for generating position embeddings exist, including sinusoidal functions.
The position-wise feed-forward network applies the same transformation to each position in the
sequence independently. This allows the model to learn non-linear relationships between the words.
4. Positional Encoding:
As mentioned, positional encoding is crucial for providing the Transformer with information about
the order of words in a sequence. A common approach utilizes sinusoidal functions:
where:
These sinusoidal functions provide a unique representation for each position, allowing the model to
distinguish between different word orders.
5. Decoding Process:
The decoder generates the output sequence autoregressively, one token at a time. At each step, the
decoder attends to the previous generated tokens (using the masked multi-head attention) and the
encoder output (using the second multi-head attention) to predict the next token. This process
continues until the end-of-sequence token is generated.
6. Advantages of Transformers:
• Parallelization: Transformers can process the entire input sequence in parallel, making them
much faster than recurrent models.
7. Limitations of Transformers:
• Interpretability: While attention weights provide some insight into the model's behavior,
Transformers can still be difficult to interpret fully.
8. Conclusion:
The Transformer model has become a cornerstone of modern NLP. Its innovative architecture, based
on attention mechanisms, has enabled significant progress in various tasks, from machine translation
to text summarization. While challenges remain, ongoing research is addressing these limitations and
further improving the performance and efficiency of Transformers. The principles of attention and
parallel processing introduced by the Transformer have also influenced architectures beyond NLP,
demonstrating its profound impact on the field of artificial intelligence.