0% found this document useful (0 votes)
12 views66 pages

NLP Week8 Transformers

This document provides an overview of the Transformer architecture in natural language processing, covering key concepts such as self-attention, multi-head attention, and the role of residual streams and position embeddings. It outlines the progression of language models from basic statistical models to complex architectures like GPT and BERT. Additionally, it includes details on the mathematical foundations of attention mechanisms and the training processes for language models.

Uploaded by

frfgvhr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views66 pages

NLP Week8 Transformers

This document provides an overview of the Transformer architecture in natural language processing, covering key concepts such as self-attention, multi-head attention, and the role of residual streams and position embeddings. It outlines the progression of language models from basic statistical models to complex architectures like GPT and BERT. Additionally, it includes details on the mathematical foundations of attention mechanisms and the training processes for language models.

Uploaded by

frfgvhr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

The Transformer

NLP Week 8

Thanks to Dan Jurafsky for most of the slides this week!


Plan for today
1. Review of notation and matrix multiplication

2. (Multi-head) self-attention

3. Residual stream

4. Position embeddings

5. Putting it all together —> The Transformer

6. GPT and BERT

7. Group exercises
This semester
We will build language models adding to each layer of their complexity:

1. Bag of words models ( basic statistical models of language )

2. N-gram models ( + sequential dependencies )

3. Hidden Markov models ( + latent categories )

4. Recurrent neural networks ( + distributed representations )

5. LSTM language models ( + long distance dependencies )

6. Transformer language models ( + attention-based dependency learning )

= Today’s language models!


A note on notation
Quick recap on our notation and matrix-matrix and matrix-vector multiplication

• Let A denote an p × d matrix We need to understand:


• Let X denote an d × n matrix • y = Ax
• Let x denote a d × 1 vector (column vector) • y⊤ = x⊤A⊤
• x⊤ is a 1 × d vector (row vector) • Y = AX
• Note that: (AX)⊤ = X⊤A⊤ • Y⊤ = X⊤A⊤

👉 will do this on the whiteboard …


The paper that started it all
Transformer: a specific kind of network architecture, like a
fancier feedforward network, but based on attention
What is a Transformer?

Vaswani et al. 2017 - Attention is all you need Fleuret et al. 2024 - The little book of DL Alammar et al. 2018 - The Illustrated Transformer
We will stick to the following illustration
The Transformer
Zooming in
Zooming in
Intuition of attention
• Build up the representation of a word by selectively
integrating information from all the neighbouring words
• We say that a word "attends to" some neighbouring words
more than others
Intuition of attention
Attention definition
A mechanism for helping compute the embedding for a token
by selectively attending to and integrating information from
surrounding tokens (at the previous layer).

More formally: a method for doing a weighted sum of vectors.

n
k+1 k

v = αi ⋅ vi
i=1
Attention can respect time (causal)

a1 a2 a3 a4 a5

Self-Attention attention attention attention attention attention


Layer

x1 x2 x3 x4 x5

{0 else
1 if i ≤ j

aj = (αi ⋅ (i, j)) ⋅ xi =
i=1
𝕄
𝕄
Simplified version of attention: a sum of prior words
weighted by their similarity with the current word

Given a sequence of token embeddings:


x1 x2 x3 x4 x5 x6 x7 xi
Produce: ai = a weighted sum of x1 through x7 (and xi)
Weighted by their similarity to xi

ai = αj ⋅ xj
j≤i
score(xi, xj) = xi ⋅ xj
7
ai = ( αj ⋅ xj) + αi ⋅ xi
α = so max([score(xi, xj) for j in 1…7,i]) ∑
j=1
ft
Intuition of attention

ai

x1 x2 x3 x4 x5 x6 x7 xi
An Actual Attention Head is slightly more complicated

High-level idea: instead of using vectors (like xi and x4) directly,


we'll represent 3 separate roles each vector xi plays:
• query: As the current element being compared to the
preceding inputs.
• key: as a preceding input that is being compared to the
current element to determine a similarity
• value: a value of a preceding element that gets weighted
and summed
Intuition of attention query

values x1 x2 x3 x4 x5 x6 x7 xi
Intuition of attention query

x1 x2 x3 x4 x5 x6 x7 xi
keys k k k k k k k k
values v v v v v v v v
An Actual Attention Head is slightly more complicated

We'll use matrices to project each vector xi into a


representation of its role as query, key, value:
• query: WQ
• key: WK
• value: WV

Q K V
qi = xiW ki = xiW vi = xiW
Note: xi, qi, ki, vi are row vectors here
An Actual Attention Head is slightly more complicated

Given these 3 representation of xi


Q K V
qi = xiW ki = xiW vi = xiW
To compute similarity of current element xi with some
prior element xj
We’ll use dot product between qi and kj.
And instead of summing up xj , we'll sum up vj
Final equations for one attention head

Q K V
qi = xiW ki = xiW vi = xiW
qik⊤j
score(xi, xj) = α = so max([score(xi, xj) ∀ j ≤ i])
dk


ai = αj ⋅ vj
j≤i
ft
Example: calculating the value of a3
An Actual Attention Head is slightly more complicated
• Instead of one attention head, we'll have lots of them!
• Intuition: each head might be attending to the context for different
purposes
• E.g., different linguistic relationships or patterns in the context

qci = xiWQc kci = xiWKc vci = xiWVc


qcikc⊤
j
scorec(xi, xj) = αic = so max([scorec(xi, xj) ∀ j ≤ i])
dk

headci = c
⋅ vcj ai = (head1 ⊕ head2… ⊕ headh)WO

αi,j
j≤i

Mul HeadA en on(xi, [x1, …, xn]) = ai


ti
tt
ti
ft
Multi-head attention
Parallelizing computation using X
For attention/transformer block we've been computing a
single output at a single time step i in a single residual
stream.
But we can pack the N tokens of the input sequence into a
single matrix X of size [N × d].
Each row of X is the embedding of one token of the input.
X can have 1K-32K rows, each of the dimensionality of the
embedding d (the model dimension)

Q = XWQ K = XWK V = XWV


QKT
Now can do a single matrix multiply to combine Q and KT


qikj
score(xi, xj) =
dk

QK⊤
S=
dk
Parallelizing attention

• Scale the scores, take the softmax, and


then multiply the result by V resulting in a
matrix of shape N × d
• An attention vector for each input token

( ( ))
QK⊤
A = so max V
dk
ft
𝕄
Masking out the future

• What is this mask function?


QKT has a score for each query dot every key,
including those that follow the query.
• Guessing the next word is pretty simple if you
already know it!

( ( d k ))
QK⊤
A = so max V
ft
𝕄
Masking out the future

Add –∞ to cells in upper triangle


The softmax will turn it to 0

( ( ))
QK⊤
A = so max V
dk
ft
𝕄
Another point: Attention is quadratic in length

( ( ))
QK⊤
A = so max V
dk
ft
𝕄
Attention again
Parallelizing Multi-head Attention

This is equivalent to running the attention heads in parallel and adding their results back to the residual stream
(Whiteboard)
Reminder: transformer architecture
A single transformer block
Sublayers of the transformer block: Layer Norm
LayerNorm(xi) = …
Layer Norm
Layer norm is a variation of the z-score from statistics, applied to a single
vector in a hidden layer
Sublayers of the transformer block: FFN
FFN(xi) = ReLU(xiW1 + b1)W2 + b2
Putting together a single transformer block
A transformer is a stack of these blocks
so all the vectors are of the same dimensionality d

Block 2

Block 1
Residual streams and attention
• Notice that all parts of the transformer block apply to 1 residual stream
except attention, which takes information from other tokens
• Elhage et al. (2021) show that we can view attention heads as literally
moving information from the residual stream of a neighboring token into
the current stream

Elhage et al. (2021) - A Mathematical Framework for Transformer Circuits


Residual stream view
• FFN and attention layers read
from and write to the residual
stream

• FFN layers have access to “one


lane” only. Same computation
applied on every “lane”

• Attention layers can read from


from other “lanes” too
L
x
• i is transformed into h i through
a sequence of non-linear
transformations
Putting together a single transformer block
Single vector: Matrix of inputs:
Residual stream view

Elhage et al. (2021) - A Mathematical Framework for Transformer Circuits


Reminder: transformer architecture
Token and Position Embeddings

The matrix X (of shape N × d ) has an


embedding for each word in the context.
This embedding is created by adding two
distinct embedding for each input
• token embedding
• positional embedding
Token Embeddings

Embedding matrix E has shape | | × d.


• One row for each of the | | tokens in the vocabulary.
• Each word is a row vector of d dimensions

Given: string "Thanks for all the"


1. Tokenize with BPE and convert into vocab indices
input_ids = [5,4000,10532,2224]
2. Select the corresponding rows from E, each row an embedding
• (row 5, row 4000, row 10532, row 2224).
𝒱
𝒱
Position Embeddings
There are many methods, but we'll just describe the simplest:
absolute position.
Goal: learn a position embedding matrix Epos of shape N × d.
Start with randomly initialized embeddings
• one for each integer up to some maximum length.
• i.e., just as we have an embedding for token fish, we’ll have
an embedding for position 3 and position 17.
• As with word embeddings, these position embeddings are
learned along with other parameters during training.
Each xi is just the sum of word and position embeddings
Reminder: transformer architecture
Language modeling head

y1 y2 … y|V| Word probabilities 1 x |V|

Language Model Head Softmax over vocabulary V


L
takes h N and outputs a u1 u2 … u|V| Logits 1 x |V|
distribution over vocabulary V
Unembedding Unembedding layer d x |V|
layer = ET

hL1 hL2 hLN 1xd


Layer L
Transformer
Block

w1 w2 wN
Language modeling head

Unembedding layer: linear layer projects from hLN (shape 1 × d ) to logit vector

y1 y2 … y|V| Word probabilities 1 x |V|


Why "unembedding"? Tied to ET
Softmax over vocabulary V
u1 u2 … u|V| Logits 1 x |V|

Unembedding Unembedding layer d x |V|


layer = ET Weight tying, we use the same weights for
hLN
two different matrices
1xd

Unembedding layer maps from an embedding to a 1 × | |



vector of logits
wN
𝒱
Language modeling head
Logits, the score vector u

One score for each of the | |


y1 y2 … y|V| Word probabilities 1 x |V|
possible words in the vocabulary
Softmax over vocabulary V
. Shape 1 × | | .
u1 u2 … u|V| Logits 1 x |V|

Unembedding
Softmax turns the logits into
Unembedding layer d x |V|
layer = ET
probabilities over vocabulary.
hLN 1xd Shape 1 × | | .
L ⊤
… u= hN E
y = so max(u)
wN
ft
𝒱
𝒱
𝒱
𝒱
The final transformer model

∼ GPT
LM loss
The LM head takes output of final transformer layer L, multiplies it by
unembedding layer and turns into probabilities:

ui = EhLi yi = so max(ui)

The loss is the probability of the next word, given output hLi :

L
̂ i+1]
ℒLM(xi) = − log P(xi+1 ∣ hi ) = − log y[x
We get the gradients by taking the average of this loss over the batch

1 1
log P(xi+1 ∣ hLi)
∑ |s| ∑
ℒLM = −
|ℬ| s∈ℬ i∈s
ft
Language Models are Unsupervised Multitask Learners

GPT-2 (Radford et al., 2019)


• Trained on ~40GB of text crawled from the internet
• Input context window N=1024 tokens, and model dimensionality
{d=768, d=1024, d=1280, d=1600}
• {L=12, L=24, L=36, L=48} layers of transformer blocks
• The resulting models have around {117M, 335M, 762M, 1542M}
parameters
Masked Language Modeling
• We've seen autoregressive (causal, left-to-right) LMs.
• But what about tasks for which we want to peak at future
tokens?
• Especially true for tasks where we map each input token
to an output token
• Bidirectional encoders use unmasked self-attention to
• map sequences of input embeddings x1, …, xn
• to sequences of output embeddings of the same length
h1, …, hn
• where the output vectors have been contextualized using
information from the entire input sequence.
Bidirectional Self-Attention
We just remove the mask
Casual self-attention Bidirectional self-attention

( dk )
QK⊤
( ( ))

QK A = so max
A = so max
dk
ft
ft
𝕄
Masked training intuition
• For left-to-right (causal; decoder-only) LMs, the model tries to predict
the last word from prior words:
The water of Walden Pond is so beautifully
• And we train it to improve its predictions.
• For bidirectional masked LMs, the model tries to predict one or more
missing words from all the rest of the words:
The of Walden Pond so beautifully
blue
• The model generates a probability distribution over the vocabulary for each
missing token
• We use the cross-entropy loss from each of the model’s predictions to
drive the learning process.
Bidirectional Transformer ∼ BERT
MLM training in BERT
15% of the tokens are randomly chosen to be part of the masking
Example: "Lunch was delicious", if delicious was randomly chosen:
Three possibilities:
1. 80%: Token is replaced with special token [MASK]
Lunch was delicious -> Lunch was [MASK]
2. 10%: Token is replaced with a random token (sampled from unigram
prob)
Lunch was delicious -> Lunch was gasp
3. 10%: Token is unchanged
Lunch was delicious -> Lunch was delicious
MLM loss
The LM head takes output of final transformer layer L, multiplies it by
unembedding layer and turns into probabilities:

ui = EhLi yi = so max(ui)

E.g., for the xi corresponding to "long", the loss is the probability of the correct
word long, given output hLi ):

ℒMLM(xi) = − log P(xi ∣ hLi)

We get the gradients by taking the average of this loss over the batch
1 1
log P(xi ∣ hLi)
∑ |ℳs| ∑
ℒMLM = −
|ℬ| s∈ℬ i∈ℳ s
ft
Bidirectional Encoder Representations from Transformers

BERT (Devlin et al., 2019)


• 30,000 English-only tokens (WordPiece tokenizer)
• Input context window N=512 tokens, and model dimensionality d=768
• L=12 layers of transformer blocks, each with A=12 (bidirectional)
multihead-attention layers.
• The resulting model has about 100M parameters.

XLM-RoBERTa (Conneau et al., 2020)


• 250,000 multilingual tokens (SentencePiece Unigram LM tokenizer)
• Input context window N=512 tokens,model dimensionality d=1024
• L=24 layers of transformer blocks, with A=16 multihead attention layers
each
• The resulting model has about 550M parameters.
[15 minute break]
Implementing the Transformer!

Team up!

Open exercises/week 8 in your course folder and start writing/running


code!

You might also like