0% found this document useful (0 votes)

12 views66 pages

NLP Week8 Transformers

This document provides an overview of the Transformer architecture in natural language processing, covering key concepts such as self-attention, multi-head attention, and the role of residual streams and position embeddings. It outlines the progression of language models from basic statistical models to complex architectures like GPT and BERT. Additionally, it includes details on the mathematical foundations of attention mechanisms and the training processes for language models.

Uploaded by

frfgvhr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views66 pages

NLP Week8 Transformers

Uploaded by

frfgvhr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

The Transformer

NLP Week 8

Thanks to Dan Jurafsky for most of the slides this week!

Plan for today
1. Review of notation and matrix multiplication

2. (Multi-head) self-attention

3. Residual stream

4. Position embeddings

5. Putting it all together —> The Transformer

6. GPT and BERT

7. Group exercises
This semester
We will build language models adding to each layer of their complexity:

1. Bag of words models ( basic statistical models of language )

2. N-gram models ( + sequential dependencies )

3. Hidden Markov models ( + latent categories )

4. Recurrent neural networks ( + distributed representations )

5. LSTM language models ( + long distance dependencies )

6. Transformer language models ( + attention-based dependency learning )

= Today’s language models!

A note on notation
Quick recap on our notation and matrix-matrix and matrix-vector multiplication

• Let A denote an p × d matrix We need to understand:

• Let X denote an d × n matrix • y = Ax
• Let x denote a d × 1 vector (column vector) • y⊤ = x⊤A⊤
• x⊤ is a 1 × d vector (row vector) • Y = AX
• Note that: (AX)⊤ = X⊤A⊤ • Y⊤ = X⊤A⊤

👉 will do this on the whiteboard …

The paper that started it all
Transformer: a specific kind of network architecture, like a
fancier feedforward network, but based on attention
What is a Transformer?

Vaswani et al. 2017 - Attention is all you need Fleuret et al. 2024 - The little book of DL Alammar et al. 2018 - The Illustrated Transformer
We will stick to the following illustration
The Transformer
Zooming in
Zooming in
Intuition of attention
• Build up the representation of a word by selectively
integrating information from all the neighbouring words
• We say that a word "attends to" some neighbouring words
more than others
Intuition of attention
Attention definition
A mechanism for helping compute the embedding for a token
by selectively attending to and integrating information from
surrounding tokens (at the previous layer).

More formally: a method for doing a weighted sum of vectors.

n
k+1 k
∑
v = αi ⋅ vi
i=1
Attention can respect time (causal)

a1 a2 a3 a4 a5

Self-Attention attention attention attention attention attention

Layer

x1 x2 x3 x4 x5

{0 else
1 if i ≤ j
∑
aj = (αi ⋅ (i, j)) ⋅ xi =
i=1
𝕄
𝕄
Simplified version of attention: a sum of prior words
weighted by their similarity with the current word

Given a sequence of token embeddings:

x1 x2 x3 x4 x5 x6 x7 xi
Produce: ai = a weighted sum of x1 through x7 (and xi)
Weighted by their similarity to xi
∑
ai = αj ⋅ xj
j≤i
score(xi, xj) = xi ⋅ xj
7
ai = ( αj ⋅ xj) + αi ⋅ xi
α = so max([score(xi, xj) for j in 1…7,i]) ∑
j=1
ft
Intuition of attention

x1 x2 x3 x4 x5 x6 x7 xi
An Actual Attention Head is slightly more complicated

High-level idea: instead of using vectors (like xi and x4) directly,

we'll represent 3 separate roles each vector xi plays:
• query: As the current element being compared to the
preceding inputs.
• key: as a preceding input that is being compared to the
current element to determine a similarity
• value: a value of a preceding element that gets weighted
and summed
Intuition of attention query

values x1 x2 x3 x4 x5 x6 x7 xi
Intuition of attention query

x1 x2 x3 x4 x5 x6 x7 xi
keys k k k k k k k k
values v v v v v v v v
An Actual Attention Head is slightly more complicated

We'll use matrices to project each vector xi into a

representation of its role as query, key, value:
• query: WQ
• key: WK
• value: WV

Q K V
qi = xiW ki = xiW vi = xiW
Note: xi, qi, ki, vi are row vectors here
An Actual Attention Head is slightly more complicated

Given these 3 representation of xi

Q K V
qi = xiW ki = xiW vi = xiW
To compute similarity of current element xi with some
prior element xj
We’ll use dot product between qi and kj.
And instead of summing up xj , we'll sum up vj
Final equations for one attention head

Q K V
qi = xiW ki = xiW vi = xiW
qik⊤j
score(xi, xj) = α = so max([score(xi, xj) ∀ j ≤ i])
dk

∑
ai = αj ⋅ vj
j≤i
ft
Example: calculating the value of a3
An Actual Attention Head is slightly more complicated
• Instead of one attention head, we'll have lots of them!
• Intuition: each head might be attending to the context for different
purposes
• E.g., different linguistic relationships or patterns in the context

qci = xiWQc kci = xiWKc vci = xiWVc

qcikc⊤
j
scorec(xi, xj) = αic = so max([scorec(xi, xj) ∀ j ≤ i])
dk

headci = c
⋅ vcj ai = (head1 ⊕ head2… ⊕ headh)WO
∑
αi,j
j≤i

Mul HeadA en on(xi, [x1, …, xn]) = ai

ti
tt
ti
ft
Multi-head attention
Parallelizing computation using X
For attention/transformer block we've been computing a
single output at a single time step i in a single residual
stream.
But we can pack the N tokens of the input sequence into a
single matrix X of size [N × d].
Each row of X is the embedding of one token of the input.
X can have 1K-32K rows, each of the dimensionality of the
embedding d (the model dimension)

Q = XWQ K = XWK V = XWV

QKT
Now can do a single matrix multiply to combine Q and KT

⊤
qikj
score(xi, xj) =
dk

QK⊤
S=
dk
Parallelizing attention

• Scale the scores, take the softmax, and

then multiply the result by V resulting in a
matrix of shape N × d
• An attention vector for each input token

( ( ))
QK⊤
A = so max V
dk
ft
𝕄
Masking out the future

• What is this mask function?

QKT has a score for each query dot every key,
including those that follow the query.
• Guessing the next word is pretty simple if you
already know it!

( ( d k ))
QK⊤
A = so max V
ft
𝕄
Masking out the future

Add –∞ to cells in upper triangle

The softmax will turn it to 0

( ( ))
QK⊤
A = so max V
dk
ft
𝕄
Another point: Attention is quadratic in length

( ( ))
QK⊤
A = so max V
dk
ft
𝕄
Attention again
Parallelizing Multi-head Attention

This is equivalent to running the attention heads in parallel and adding their results back to the residual stream
(Whiteboard)
Reminder: transformer architecture
A single transformer block
Sublayers of the transformer block: Layer Norm
LayerNorm(xi) = …
Layer Norm
Layer norm is a variation of the z-score from statistics, applied to a single
vector in a hidden layer
Sublayers of the transformer block: FFN
FFN(xi) = ReLU(xiW1 + b1)W2 + b2
Putting together a single transformer block
A transformer is a stack of these blocks
so all the vectors are of the same dimensionality d

Block 2

Block 1
Residual streams and attention
• Notice that all parts of the transformer block apply to 1 residual stream
except attention, which takes information from other tokens
• Elhage et al. (2021) show that we can view attention heads as literally
moving information from the residual stream of a neighboring token into
the current stream

Elhage et al. (2021) - A Mathematical Framework for Transformer Circuits

Residual stream view
• FFN and attention layers read
from and write to the residual
stream

• FFN layers have access to “one

lane” only. Same computation
applied on every “lane”

• Attention layers can read from

from other “lanes” too
L
x
• i is transformed into h i through
a sequence of non-linear
transformations
Putting together a single transformer block
Single vector: Matrix of inputs:
Residual stream view

Elhage et al. (2021) - A Mathematical Framework for Transformer Circuits

Reminder: transformer architecture
Token and Position Embeddings

The matrix X (of shape N × d ) has an

embedding for each word in the context.
This embedding is created by adding two
distinct embedding for each input
• token embedding
• positional embedding
Token Embeddings

Embedding matrix E has shape | | × d.

• One row for each of the | | tokens in the vocabulary.
• Each word is a row vector of d dimensions

Given: string "Thanks for all the"

1. Tokenize with BPE and convert into vocab indices
input_ids = [5,4000,10532,2224]
2. Select the corresponding rows from E, each row an embedding
• (row 5, row 4000, row 10532, row 2224).
𝒱
𝒱
Position Embeddings
There are many methods, but we'll just describe the simplest:
absolute position.
Goal: learn a position embedding matrix Epos of shape N × d.
Start with randomly initialized embeddings
• one for each integer up to some maximum length.
• i.e., just as we have an embedding for token fish, we’ll have
an embedding for position 3 and position 17.
• As with word embeddings, these position embeddings are
learned along with other parameters during training.
Each xi is just the sum of word and position embeddings
Reminder: transformer architecture
Language modeling head

y1 y2 … y|V| Word probabilities 1 x |V|

Language Model Head Softmax over vocabulary V

L
takes h N and outputs a u1 u2 … u|V| Logits 1 x |V|
distribution over vocabulary V
Unembedding Unembedding layer d x |V|
layer = ET

hL1 hL2 hLN 1xd

Layer L
Transformer
Block
…
w1 w2 wN
Language modeling head

Unembedding layer: linear layer projects from hLN (shape 1 × d ) to logit vector

y1 y2 … y|V| Word probabilities 1 x |V|

Why "unembedding"? Tied to ET
Softmax over vocabulary V
u1 u2 … u|V| Logits 1 x |V|

Unembedding Unembedding layer d x |V|

layer = ET Weight tying, we use the same weights for
hLN
two different matrices
1xd

Unembedding layer maps from an embedding to a 1 × | |

…
vector of logits
wN
𝒱
Language modeling head
Logits, the score vector u

One score for each of the | |

y1 y2 … y|V| Word probabilities 1 x |V|
possible words in the vocabulary
Softmax over vocabulary V
. Shape 1 × | | .
u1 u2 … u|V| Logits 1 x |V|

Unembedding
Softmax turns the logits into
Unembedding layer d x |V|
layer = ET
probabilities over vocabulary.
hLN 1xd Shape 1 × | | .
L ⊤
… u= hN E
y = so max(u)
wN
ft
𝒱
𝒱
𝒱
𝒱
The final transformer model

∼ GPT
LM loss
The LM head takes output of final transformer layer L, multiplies it by
unembedding layer and turns into probabilities:

ui = EhLi yi = so max(ui)

The loss is the probability of the next word, given output hLi :

L
̂ i+1]
ℒLM(xi) = − log P(xi+1 ∣ hi ) = − log y[x
We get the gradients by taking the average of this loss over the batch

1 1
log P(xi+1 ∣ hLi)
∑ |s| ∑
ℒLM = −
|ℬ| s∈ℬ i∈s
ft
Language Models are Unsupervised Multitask Learners

GPT-2 (Radford et al., 2019)

• Trained on ~40GB of text crawled from the internet
• Input context window N=1024 tokens, and model dimensionality
{d=768, d=1024, d=1280, d=1600}
• {L=12, L=24, L=36, L=48} layers of transformer blocks
• The resulting models have around {117M, 335M, 762M, 1542M}
parameters
Masked Language Modeling
• We've seen autoregressive (causal, left-to-right) LMs.
• But what about tasks for which we want to peak at future
tokens?
• Especially true for tasks where we map each input token
to an output token
• Bidirectional encoders use unmasked self-attention to
• map sequences of input embeddings x1, …, xn
• to sequences of output embeddings of the same length
h1, …, hn
• where the output vectors have been contextualized using
information from the entire input sequence.
Bidirectional Self-Attention
We just remove the mask
Casual self-attention Bidirectional self-attention

( dk )
QK⊤
( ( ))
⊤
QK A = so max
A = so max
dk
ft
ft
𝕄
Masked training intuition
• For left-to-right (causal; decoder-only) LMs, the model tries to predict
the last word from prior words:
The water of Walden Pond is so beautifully
• And we train it to improve its predictions.
• For bidirectional masked LMs, the model tries to predict one or more
missing words from all the rest of the words:
The of Walden Pond so beautifully
blue
• The model generates a probability distribution over the vocabulary for each
missing token
• We use the cross-entropy loss from each of the model’s predictions to
drive the learning process.
Bidirectional Transformer ∼ BERT
MLM training in BERT
15% of the tokens are randomly chosen to be part of the masking
Example: "Lunch was delicious", if delicious was randomly chosen:
Three possibilities:
1. 80%: Token is replaced with special token [MASK]
Lunch was delicious -> Lunch was [MASK]
2. 10%: Token is replaced with a random token (sampled from unigram
prob)
Lunch was delicious -> Lunch was gasp
3. 10%: Token is unchanged
Lunch was delicious -> Lunch was delicious
MLM loss
The LM head takes output of final transformer layer L, multiplies it by
unembedding layer and turns into probabilities:

ui = EhLi yi = so max(ui)

E.g., for the xi corresponding to "long", the loss is the probability of the correct
word long, given output hLi ):

ℒMLM(xi) = − log P(xi ∣ hLi)

We get the gradients by taking the average of this loss over the batch
1 1
log P(xi ∣ hLi)
∑ |ℳs| ∑
ℒMLM = −
|ℬ| s∈ℬ i∈ℳ s
ft
Bidirectional Encoder Representations from Transformers

BERT (Devlin et al., 2019)

• 30,000 English-only tokens (WordPiece tokenizer)
• Input context window N=512 tokens, and model dimensionality d=768
• L=12 layers of transformer blocks, each with A=12 (bidirectional)
multihead-attention layers.
• The resulting model has about 100M parameters.

XLM-RoBERTa (Conneau et al., 2020)

• 250,000 multilingual tokens (SentencePiece Unigram LM tokenizer)
• Input context window N=512 tokens,model dimensionality d=1024
• L=24 layers of transformer blocks, with A=16 multihead attention layers
each
• The resulting model has about 550M parameters.
[15 minute break]
Implementing the Transformer!

Team up!

Open exercises/week 8 in your course folder and start writing/running

code!

495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Transformer 24 Aug
No ratings yet
Transformer 24 Aug
56 pages
Transformers
No ratings yet
Transformers
15 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
Transformer
No ratings yet
Transformer
41 pages
Transformers
No ratings yet
Transformers
41 pages
Transformers 1
No ratings yet
Transformers 1
6 pages
NLP 8
No ratings yet
NLP 8
42 pages
Transformer
No ratings yet
Transformer
4 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Transformer
No ratings yet
Transformer
58 pages
Attention Is All You Need Paper Explained Well
No ratings yet
Attention Is All You Need Paper Explained Well
18 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Transformers
No ratings yet
Transformers
15 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
Transformer
No ratings yet
Transformer
31 pages
Transformers v1.1
No ratings yet
Transformers v1.1
1 page
Attention & Transformers
No ratings yet
Attention & Transformers
66 pages
Transformer
No ratings yet
Transformer
59 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Transformer
No ratings yet
Transformer
5 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
Self Attention Mechanism
No ratings yet
Self Attention Mechanism
20 pages
Understanding Self-Attention
No ratings yet
Understanding Self-Attention
37 pages
Lecture 28 TransformerIntroductionFinal 1
No ratings yet
Lecture 28 TransformerIntroductionFinal 1
69 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Transformer
No ratings yet
Transformer
10 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
The Transformer Model
No ratings yet
The Transformer Model
1 page
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Solved Example of Transformers
No ratings yet
Solved Example of Transformers
20 pages
Generative AI
No ratings yet
Generative AI
54 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Transformer Concepts
No ratings yet
Transformer Concepts
8 pages
Transformer
No ratings yet
Transformer
33 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Paper 2
No ratings yet
Paper 2
8 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
One-Day Generative AI Workshop Curriculum
No ratings yet
One-Day Generative AI Workshop Curriculum
4 pages
(2020TACL) E Cient Content-Based Sparse Attention With Routing Transformers
No ratings yet
(2020TACL) E Cient Content-Based Sparse Attention With Routing Transformers
24 pages
NLP Unit 1 and 2
No ratings yet
NLP Unit 1 and 2
106 pages
Review 2 J3
No ratings yet
Review 2 J3
16 pages
Research - Paper (1) (1) (1) Final
No ratings yet
Research - Paper (1) (1) (1) Final
4 pages
NLP Week9 Fine Tuning - and - IR
No ratings yet
NLP Week9 Fine Tuning - and - IR
64 pages
Retrieval-Augmented Dynamic Prompt Tuning For Incomplete Multimodal Learning
No ratings yet
Retrieval-Augmented Dynamic Prompt Tuning For Incomplete Multimodal Learning
9 pages
Anytop: Character Animation Diffusion With Any Topology
No ratings yet
Anytop: Character Animation Diffusion With Any Topology
10 pages
Urn CH SLSP ZBZ 9781098134181 Ihv PDF
No ratings yet
Urn CH SLSP ZBZ 9781098134181 Ihv PDF
7 pages
Short-Term Load Forecasting With Temporal Fusion Transformers For Power Distribution Networks
No ratings yet
Short-Term Load Forecasting With Temporal Fusion Transformers For Power Distribution Networks
5 pages
A Text Mining-Based Approach For Understanding Chinese Railway Incidents
No ratings yet
A Text Mining-Based Approach For Understanding Chinese Railway Incidents
12 pages
Towards Trustworthy LLMs - Understanding The Security and Privacy
No ratings yet
Towards Trustworthy LLMs - Understanding The Security and Privacy
82 pages
Image Segmentation New
No ratings yet
Image Segmentation New
11 pages
Qiu Et Al. - 2020 - Pre-Trained Models For Natural Language Processing
No ratings yet
Qiu Et Al. - 2020 - Pre-Trained Models For Natural Language Processing
28 pages
Origins of ChatGPT
No ratings yet
Origins of ChatGPT
3 pages
Fundamentals of Generative AI Quiz
No ratings yet
Fundamentals of Generative AI Quiz
6 pages
Research Paper - Advancements and Ethical Implications of AI
No ratings yet
Research Paper - Advancements and Ethical Implications of AI
1 page
Top Large Language Models
No ratings yet
Top Large Language Models
5 pages
LLM Inference Serving: Survey of Recent Advances and Opportunities
No ratings yet
LLM Inference Serving: Survey of Recent Advances and Opportunities
8 pages
Enchanging Code-Switching Asr WTH Interactive Language Biases
No ratings yet
Enchanging Code-Switching Asr WTH Interactive Language Biases
5 pages
ChatGPT MCQs
No ratings yet
ChatGPT MCQs
4 pages
125 Questions GenAI Interview Guide
No ratings yet
125 Questions GenAI Interview Guide
24 pages
Nature-2025-Towards Multimodal Foundation Models in Molecular Cell Biology
No ratings yet
Nature-2025-Towards Multimodal Foundation Models in Molecular Cell Biology
11 pages
Review 2 Report........
No ratings yet
Review 2 Report........
40 pages
Revisiting Deep Learning Models For Tabular Data
No ratings yet
Revisiting Deep Learning Models For Tabular Data
12 pages
Colorectal Cancer Image Recognition Algorithm Based On Improved
No ratings yet
Colorectal Cancer Image Recognition Algorithm Based On Improved
11 pages
TBEEG A Two-Branch Manifold Domain Enhanced Transformer Algorithm For Learning EEG Decoding
No ratings yet
TBEEG A Two-Branch Manifold Domain Enhanced Transformer Algorithm For Learning EEG Decoding
11 pages
1131AI01 Artificial Intelligence
No ratings yet
1131AI01 Artificial Intelligence
96 pages
AntiFraud (GraphQL, DeepLearningMastery)
No ratings yet
AntiFraud (GraphQL, DeepLearningMastery)
7 pages
2024 09 30 615762v1 Full
No ratings yet
2024 09 30 615762v1 Full
22 pages