NLP Week8 Transformers
NLP Week8 Transformers
NLP Week 8
2. (Multi-head) self-attention
3. Residual stream
4. Position embeddings
7. Group exercises
This semester
We will build language models adding to each layer of their complexity:
Vaswani et al. 2017 - Attention is all you need Fleuret et al. 2024 - The little book of DL Alammar et al. 2018 - The Illustrated Transformer
We will stick to the following illustration
The Transformer
Zooming in
Zooming in
Intuition of attention
• Build up the representation of a word by selectively
integrating information from all the neighbouring words
• We say that a word "attends to" some neighbouring words
more than others
Intuition of attention
Attention definition
A mechanism for helping compute the embedding for a token
by selectively attending to and integrating information from
surrounding tokens (at the previous layer).
n
k+1 k
∑
v = αi ⋅ vi
i=1
Attention can respect time (causal)
a1 a2 a3 a4 a5
x1 x2 x3 x4 x5
{0 else
1 if i ≤ j
∑
aj = (αi ⋅ (i, j)) ⋅ xi =
i=1
𝕄
𝕄
Simplified version of attention: a sum of prior words
weighted by their similarity with the current word
ai
x1 x2 x3 x4 x5 x6 x7 xi
An Actual Attention Head is slightly more complicated
values x1 x2 x3 x4 x5 x6 x7 xi
Intuition of attention query
x1 x2 x3 x4 x5 x6 x7 xi
keys k k k k k k k k
values v v v v v v v v
An Actual Attention Head is slightly more complicated
Q K V
qi = xiW ki = xiW vi = xiW
Note: xi, qi, ki, vi are row vectors here
An Actual Attention Head is slightly more complicated
Q K V
qi = xiW ki = xiW vi = xiW
qik⊤j
score(xi, xj) = α = so max([score(xi, xj) ∀ j ≤ i])
dk
∑
ai = αj ⋅ vj
j≤i
ft
Example: calculating the value of a3
An Actual Attention Head is slightly more complicated
• Instead of one attention head, we'll have lots of them!
• Intuition: each head might be attending to the context for different
purposes
• E.g., different linguistic relationships or patterns in the context
headci = c
⋅ vcj ai = (head1 ⊕ head2… ⊕ headh)WO
∑
αi,j
j≤i
⊤
qikj
score(xi, xj) =
dk
QK⊤
S=
dk
Parallelizing attention
( ( ))
QK⊤
A = so max V
dk
ft
𝕄
Masking out the future
( ( d k ))
QK⊤
A = so max V
ft
𝕄
Masking out the future
( ( ))
QK⊤
A = so max V
dk
ft
𝕄
Another point: Attention is quadratic in length
( ( ))
QK⊤
A = so max V
dk
ft
𝕄
Attention again
Parallelizing Multi-head Attention
This is equivalent to running the attention heads in parallel and adding their results back to the residual stream
(Whiteboard)
Reminder: transformer architecture
A single transformer block
Sublayers of the transformer block: Layer Norm
LayerNorm(xi) = …
Layer Norm
Layer norm is a variation of the z-score from statistics, applied to a single
vector in a hidden layer
Sublayers of the transformer block: FFN
FFN(xi) = ReLU(xiW1 + b1)W2 + b2
Putting together a single transformer block
A transformer is a stack of these blocks
so all the vectors are of the same dimensionality d
Block 2
Block 1
Residual streams and attention
• Notice that all parts of the transformer block apply to 1 residual stream
except attention, which takes information from other tokens
• Elhage et al. (2021) show that we can view attention heads as literally
moving information from the residual stream of a neighboring token into
the current stream
Unembedding layer: linear layer projects from hLN (shape 1 × d ) to logit vector
Unembedding
Softmax turns the logits into
Unembedding layer d x |V|
layer = ET
probabilities over vocabulary.
hLN 1xd Shape 1 × | | .
L ⊤
… u= hN E
y = so max(u)
wN
ft
𝒱
𝒱
𝒱
𝒱
The final transformer model
∼ GPT
LM loss
The LM head takes output of final transformer layer L, multiplies it by
unembedding layer and turns into probabilities:
ui = EhLi yi = so max(ui)
The loss is the probability of the next word, given output hLi :
L
̂ i+1]
ℒLM(xi) = − log P(xi+1 ∣ hi ) = − log y[x
We get the gradients by taking the average of this loss over the batch
1 1
log P(xi+1 ∣ hLi)
∑ |s| ∑
ℒLM = −
|ℬ| s∈ℬ i∈s
ft
Language Models are Unsupervised Multitask Learners
( dk )
QK⊤
( ( ))
⊤
QK A = so max
A = so max
dk
ft
ft
𝕄
Masked training intuition
• For left-to-right (causal; decoder-only) LMs, the model tries to predict
the last word from prior words:
The water of Walden Pond is so beautifully
• And we train it to improve its predictions.
• For bidirectional masked LMs, the model tries to predict one or more
missing words from all the rest of the words:
The of Walden Pond so beautifully
blue
• The model generates a probability distribution over the vocabulary for each
missing token
• We use the cross-entropy loss from each of the model’s predictions to
drive the learning process.
Bidirectional Transformer ∼ BERT
MLM training in BERT
15% of the tokens are randomly chosen to be part of the masking
Example: "Lunch was delicious", if delicious was randomly chosen:
Three possibilities:
1. 80%: Token is replaced with special token [MASK]
Lunch was delicious -> Lunch was [MASK]
2. 10%: Token is replaced with a random token (sampled from unigram
prob)
Lunch was delicious -> Lunch was gasp
3. 10%: Token is unchanged
Lunch was delicious -> Lunch was delicious
MLM loss
The LM head takes output of final transformer layer L, multiplies it by
unembedding layer and turns into probabilities:
ui = EhLi yi = so max(ui)
E.g., for the xi corresponding to "long", the loss is the probability of the correct
word long, given output hLi ):
We get the gradients by taking the average of this loss over the batch
1 1
log P(xi ∣ hLi)
∑ |ℳs| ∑
ℒMLM = −
|ℬ| s∈ℬ i∈ℳ s
ft
Bidirectional Encoder Representations from Transformers
Team up!