Lesson 14 - Transformer
Lesson 14 - Transformer
NLP II 2025
Jakapun Tachaiya (Ph.D.)
Outline
- Transformer
- Transfer Learning
- Pretrained Model
- BERT
- GPT
2
Transformer
3
Evolution of Large Language Models
https://arxiv.org/html/2402.06853v1 4
No RNN, CNN
5
Example Task to train transformer
● Translation
● Dialogue completer
6
Way Smarter than RNN for Language Model!
7
Transformer Key Ideas
● Core Idea: Processes input sequences in
parallel using attention mechanisms, bypassing
the sequential limitations of RNNs.
○ Recurrence: not parallelizable, long “path
lengths”
○ Attention: Parallelizable, short path
lengths.
● Core Architecture:
○ Positional encoding
○ Multi-head attention and self attention
○ Decoder’s masked attention
8
Transformer
Core Architecture:
● Positional encoding
● Multi-head attention and self attention
● Decoder’s masked attention
https://jalammar.github.io/illustrated-transformer/ 9
How does transformer works?
Encoder
What is English?
What is context?
Decoder
How to map English
word to French?
10
Transformer
Core Architecture:
● Positional encoding
● Multi-head attention and self attention
● Decoder’s masked attention
https://jalammar.github.io/illustrated-transformer/ 11
Positional Encoding
LSTM - Read word by word. Know the position of each word.
12
https://www.youtube.com/watch?v=dichIcUZfOw
Positional Encoding
LSTM - Read word by word. Know the position of each word.
13
https://www.youtube.com/watch?v=dichIcUZfOw
Positional Encoding
Transformer - Read all word embedding. All at once ~ (512, 768,..., tokens)
● Lose information position of each word
14
Why position matter?
15
Absolute Position Embedding
Po
sit
io
n
of
to
ke
n
0
16
Intuition behind position formula
Just Sin function
17
Intuition behind position formula
18
Intuition behind position formula
Same value at i =4 but different at i = 2
19
Input embedding with absolute position embedding
Encodes the position of each token in
a sequence into fixed embeddings
added to input word embeddings.
20
Transformer
Core Architecture:
● Positional encoding
● Multi-head attention and self attention
● Decoder’s masked attention
https://jalammar.github.io/illustrated-transformer/ 21
Simple/Cross Attention VS Self-attention
Query
Input sentence
Input sentence
22
Multi-head attention
23
Intuition behind K, Q, V (info retrieval)
24
Intuition behind K, Q, V (info retrieval)
25
Multi-head attention is Scaled dot-product attention
(Multiplicative attention)
26
Linear layer with NO activation function (Relu)
1. Mapping inputs onto outputs
2. Changing vector dimensions
27
Multi-head attention (K, Q, V)
28
K, Q, V attention
29
K, Q, V attention
K
KQT
30
Attention filter
Initial random weights >> less meaning After trained weights >> Capture self-attention
31
Attention filter
32
Multi-head attention
33
Intuition on Multi-head attention
Then Concatenate all layer and pass to linear layer to reduce a size
34
Intuition on Multi-head attention
35
Multi-head attention
36
Multi-Head Cross-Attention
Enables the decoder to selectively focus on specific parts
of the encoder's output.
● Query (Q): Derived from the decoder's current hidden
state.
● Key (K) and Value (V): Derived from the encoder's
output.
K V Q
37
Multi-Head Cross-Attention
Enables the decoder to selectively focus on specific parts
of the encoder's output.
● Query (Q): Derived from the decoder's current hidden
state.
● Key (K) and Value (V): Derived from the encoder's
output.
K V Q
38
Self Attention VS Cross Attention
Attention Terminology
● K, Q, V attention
● Multi-head
● Self-attention
○ Encoder
○ Decoder
● Cross-attention
40
Transformer
Core Architecture:
● Positional encoding
● Multi-head attention and self attention
● Decoder’s masked attention
https://jalammar.github.io/illustrated-transformer/ 41
Masked Self-Attention
● Similar to language model, we mask to prevent the model see the output before prediction
42
Masked Self-Attention
Key idea: Masking Out the Future
On the left:
43
44
Masked attention
45
Residual connection
1. Knowledge preservation
2. Vanishing Gradient
46
ADD & NORM
47
ADD & NORM
Layer normalization
● Shift mean to 0 and var to 1
● Standardize along the feature axis
48
Transformer Architecture Summary
Main building block: attention!
● Encoder: self-attention
● Decoder: masked self-attention
● Decoder-encoder: cross-attention
49
50
Why Transformer is better than RNN?
1. Self-Attention:
a. Capture dependencies between words in a sentence without being restricted
by their distance from each other.
2. Parallel Processing:
a. Unlike RNNs, which process data sequentially, transformers can process the
entire input sequence in parallel.
3. Handling Long-Range Dependencies:
a. RNNs struggle with long-range dependencies due to the vanishing gradient
problem.
b. Transformers can remember and maintain performance over longer sequences.
51
Scaling Laws: Are Transformers All We Need?
● With Transformers, language modeling performance improves as we increase model size, training
data, and compute resources.
● This power-law relationship has been observed over multiple orders of magnitude with no sign of
slowing!
52
Transformer Drawback!
Quadratic compute in self-attention O(n2)
● Computing all pairs of
interactions/attentions means our
computation grows quadratically with
the sequence length!
● For recurrent models, it only grew
linearly!
● Prevents scaling to long sequences.
One big area of research: linear attention
mechanisms.
● Random attention
● Window attention
● Linear attention
● Flash attention
● Lightning attention
53
Transfer Learning Concept
Another classification task
● Can you guess whether it is a land animal or a water animal?
○ Have you ever seen this creature before?
■ You can transfer your knowledge from the past
55
https://slds-lmu.github.io/seminar_nlp_ss20/introduction-transfer-learning-for-nlp.html56
Transfer Learning
Myth: you can’t do deep learning unless you have a million labelled examples for your problem.
Reality:
57
Transfer learning: idea
60
Model Alignment for Transfer Learning
● Source model is the single most important variable.
● Keep source model and target model well-aligned (close to each other) when possible.
● Source vocabulary should be aligned with target vocabulary (similar domain).
● Source task should be aligned with target task (similar task).
For example:
61
What is the most common
Transfer Learning Model on NLP?
62
Pre-trained
Language Model
● Learning to model the distribution of
natural language.
● Predicting the next word in a sequence
given context.
● A Base model for specific tasks.
● No need for labeled data (unsupervised
data)
63
The Pretraining / Fine-tuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.
64
Pre-trained Models
Three architectures for large language models
66
1. Encoder Only (Autoencoder)
Model Type: Masked Language Models (MLMs) -
Trained by predicting words from surrounding
words on both sides.
Tasks:
1. Sequence classification
2. Token classification
67
2. Encoder - Decoder
Model Type: Original Transformer of seq2seq task.
Tasks:
1. Machine translation
2. Speech Recognition
68
3. Decoder Only (Auto-Regressive)
Model Type: Causal LLMs/Autoregressive
LLMs/Left-to-right LLMs - predict words left to
right.
Tasks:
1. Text Generation
2. Predicting next word
69
BERT
70
BERT - Bidirectional Encoder Representation from Transformers
71
BERT Ideas
1. Masked Language Model
○ fill-in-the-blank
2. Bidirectional encoder
○ See the future tokens - more
information to infer masked tokens
○ Can’t do language modeling!
72
BERT
73
BERT VS GPT
Transformer
74
BERT VS Transformer
75
BERT - 2 Phases Training
76
Phase1: Unsupervised Masked LM Training
15% of the tokens are randomly chosen to
be part of the masking
Three possibilities:
1. 80%: Token is replaced with special token [MASK]
● Lunch was delicious -> Lunch was [MASK]
77
Phase1: Next Sentence Prediction
78
Input Representation
81
Training Details
● BooksCorpus (800M words) + Wikipedia (2.5B)
● Masking the input text. 15% of all tokens are chosen.
○ 80% of the time: replaced by designated ‘[MASK]’ token
○ 10% of the time: replaced by random token
○ 10% of the time: unchanged
● Loss is cross-entropy of the prediction at the masked positions.
● Max seq length: 128 tokens for first 90%, 512 tokens for final 10%
● 1M training steps, batch size 256 = 4 days on 4 or 16 TPUs
82
Fine-tuning BERT Use case
● Sentence/Sentence pair classification
○ E.g. spam detection, sentiment analysis, Natural Language Inference
83
Fine-tuning BERT Use case
● Sequence Labeling
○ Tokenization, POS, NER
84
Contextual Embeddings to represent words
85
BERT as a Contextual Representation
Word sense disambiguation - The task of selecting the correct sense for a word
86
Model width
87
BERT is a stacked of encoders
88
Pretrained BERT (Hugging face)
89
Vision transformer
● Can’t feed pixel value directly to transformer because O(n2) attention
○ Use patch of image instead
90
91
Sidenote on Input Token
Why does BERT split input token this way?
92
Subword is the way!
We assume a fixed vocab of tens of thousands of words, built from the training set. All novel
words seen at test time are mapped to a single UNK.
● Combat with Misspelling, new unknown word issue
93
Level of Token
94
Tokens represent Words
95
96
Check if a small frequency of tokens still makes sense
97
98
Tokenizer (subwords) for Transformers
99
WordPiece Tokenization
Similar to BPE and uses frequency occurrences to identify potential merges but makes
the final decision based on the likelihood of the merged token.
100
SentencePiece Tokenization
101
SentencePiece Tokenization
Simply treating the input text as a sequence of Unicode characters, including whitespace.
102
Byte-Pair Encoding (BPE) Tokenizer
103
Pre-trained Encoder Decoder
104
Pretraining encoder-decoders
For encoder-decoders, we could do something like language
modeling, but where a prefix of every input is provided to the
encoder and is not predicted.
105
Pretraining encoder-decoders
1. Higher Computational Cost: Both an encoder and a
decoder are required, leading to increased memory and
computation requirements compared to simpler models like
decoder-only architectures.
2. Slower Inference : The encoder processes the entire input
sequence before the decoder starts generating the output,
resulting in a two-step process that slows down inference
compared to models that perform generation directly (e.g.,
decoder-only models).
3. Limited Suitability for Certain Tasks: These models are
better suited for sequence-to-sequence tasks (e.g.,
translation, summarization) but are less efficient for
general-purpose tasks like text generation, where
decoder-only models excel.
106
Decoder only Pretrained Model
107
Decoder Only Pretrained Model as LLM
● Generating text conditioned on previous text
108
GPT - Generative Pre-Training (OpenAI)
109
GPT
● Uses Transformer decoder instead of encoder
● “Self”-attention: masked so that only can
attend to previous tokens.
● Predict the next token in a sequence
○ Causal language modeling
https://jalammar.github.io/how-gpt3-works-visualizations-animations/111
GPT Training
112
OpenAI GPT 1 (Generative Pre-Training)
Multitask learning
113
GPT - Formatting Inputs for Fine-tuning Tasks
114
Data format for SFT
● Convert existing Annotated NLP datasets to instruction-following format to continue
training on LLM.
○ Supervised fine-tuning (SFT), Instruction fine-tuning
115
Multi-column dataset
● Conventional Classification Dataset
● merge multiple columns into 1 large prompt for fine-tuning to actually function.
116
https://docs.unsloth.ai/basics/tutorial-how-to-finetune-llama-3-and-use-in-ollama
Multi-column dataset
● Now LLM can perform a classification! (from causal model)
117
Evolution of GPT
https://www.kdnuggets.com/2023/05/deep-dive-gpt-models.html118
Scaling Laws
LLM performance depends on
Can improve a model by adding parameters (more layers, wider contexts), more data, or
training for more iterations The performance of a large language model (the loss) scales as a
power-law with each of these three.
119
Scaling Laws
● Empirical observation: scaling up models leads to reliable gains in perplexity
120
GPT Scale
Depth L
#head
121
Width d
Chat-GPT
https://openai.com/blog/chatgpt/ 122
How does ChatGPT different from GPT model?
● ChatGPT is optimized for dialogue and conversation.