0% found this document useful (0 votes)
8 views124 pages

Lesson 14 - Transformer

The document discusses transformer-based models in natural language processing, highlighting their architecture, including key components like positional encoding and multi-head attention. It also covers the evolution of large language models, the concept of transfer learning, and specific pretrained models such as BERT and GPT. The advantages of transformers over RNNs, including parallel processing and handling long-range dependencies, are emphasized, along with the challenges of quadratic computation in self-attention.

Uploaded by

Pim Pat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views124 pages

Lesson 14 - Transformer

The document discusses transformer-based models in natural language processing, highlighting their architecture, including key components like positional encoding and multi-head attention. It also covers the evolution of large language models, the concept of transfer learning, and specific pretrained models such as BERT and GPT. The advantages of transformers over RNNs, including parallel processing and handling long-range dependencies, are emphasized, along with the challenges of quadratic computation in self-attention.

Uploaded by

Pim Pat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

Transformer-based Models

NLP II 2025
Jakapun Tachaiya (Ph.D.)
Outline
- Transformer
- Transfer Learning
- Pretrained Model
- BERT
- GPT

2
Transformer

3
Evolution of Large Language Models

https://arxiv.org/html/2402.06853v1 4
No RNN, CNN

5
Example Task to train transformer
● Translation
● Dialogue completer

6
Way Smarter than RNN for Language Model!

7
Transformer Key Ideas
● Core Idea: Processes input sequences in
parallel using attention mechanisms, bypassing
the sequential limitations of RNNs.
○ Recurrence: not parallelizable, long “path
lengths”
○ Attention: Parallelizable, short path
lengths.
● Core Architecture:
○ Positional encoding
○ Multi-head attention and self attention
○ Decoder’s masked attention

8
Transformer
Core Architecture:
● Positional encoding
● Multi-head attention and self attention
● Decoder’s masked attention

https://jalammar.github.io/illustrated-transformer/ 9
How does transformer works?

Encoder
What is English?
What is context?

Decoder
How to map English
word to French?

10
Transformer
Core Architecture:
● Positional encoding
● Multi-head attention and self attention
● Decoder’s masked attention

https://jalammar.github.io/illustrated-transformer/ 11
Positional Encoding
LSTM - Read word by word. Know the position of each word.

12
https://www.youtube.com/watch?v=dichIcUZfOw
Positional Encoding
LSTM - Read word by word. Know the position of each word.

13
https://www.youtube.com/watch?v=dichIcUZfOw
Positional Encoding
Transformer - Read all word embedding. All at once ~ (512, 768,..., tokens)
● Lose information position of each word

14
Why position matter?

Change meaning and sentiment of sentence.

15
Absolute Position Embedding

Po
sit
io
n
of
to
ke
n
0

16
Intuition behind position formula
Just Sin function

17
Intuition behind position formula

With different frequency of I

18
Intuition behind position formula
Same value at i =4 but different at i = 2

19
Input embedding with absolute position embedding
Encodes the position of each token in
a sequence into fixed embeddings
added to input word embeddings.

● Use sinusoidal functions to


represent positions.

● Result in a word embedding with


position information.

20
Transformer
Core Architecture:
● Positional encoding
● Multi-head attention and self attention
● Decoder’s masked attention

https://jalammar.github.io/illustrated-transformer/ 21
Simple/Cross Attention VS Self-attention

Query

Input sentence

Input sentence

22
Multi-head attention

Value Key Query

23
Intuition behind K, Q, V (info retrieval)

24
Intuition behind K, Q, V (info retrieval)

25
Multi-head attention is Scaled dot-product attention

(Multiplicative attention)

Value Key Query


Query Key Value

26
Linear layer with NO activation function (Relu)
1. Mapping inputs onto outputs
2. Changing vector dimensions

Value Key Query

27
Multi-head attention (K, Q, V)

28
K, Q, V attention

29
K, Q, V attention

When You Play The Game Of Throne


When
You
Play
The
Game QT
Of
Throne

K
KQT

30
Attention filter

Initial random weights >> less meaning After trained weights >> Capture self-attention

31
Attention filter

32
Multi-head attention

33
Intuition on Multi-head attention

Then Concatenate all layer and pass to linear layer to reduce a size

Each multi-head attention layer focus on specific properties

34
Intuition on Multi-head attention

35
Multi-head attention

36
Multi-Head Cross-Attention
Enables the decoder to selectively focus on specific parts
of the encoder's output.
● Query (Q): Derived from the decoder's current hidden
state.
● Key (K) and Value (V): Derived from the encoder's
output.
K V Q

37
Multi-Head Cross-Attention
Enables the decoder to selectively focus on specific parts
of the encoder's output.
● Query (Q): Derived from the decoder's current hidden
state.
● Key (K) and Value (V): Derived from the encoder's
output.
K V Q

38
Self Attention VS Cross Attention
Attention Terminology
● K, Q, V attention
● Multi-head
● Self-attention
○ Encoder
○ Decoder
● Cross-attention

40
Transformer
Core Architecture:
● Positional encoding
● Multi-head attention and self attention
● Decoder’s masked attention

https://jalammar.github.io/illustrated-transformer/ 41
Masked Self-Attention
● Similar to language model, we mask to prevent the model see the output before prediction

Predict one by one word


● Future word is masked

42
Masked Self-Attention
Key idea: Masking Out the Future

● Use a “mask” to block out certain


attention scores.
● We mask to prevent the model see the
output before prediction

On the left:

● Tokens in the rows (as queries) can not


pay attention to the tokens in the
columns (values) that are shaded in

43
44
Masked attention

45
Residual connection

1. Knowledge preservation
2. Vanishing Gradient

Help preserve position aware embedding

46
ADD & NORM

47
ADD & NORM
Layer normalization
● Shift mean to 0 and var to 1
● Standardize along the feature axis

48
Transformer Architecture Summary
Main building block: attention!

● Encoder: self-attention
● Decoder: masked self-attention
● Decoder-encoder: cross-attention

Position encodings/embeddings to inject


information about sequence order

Residual connections + LayerNorm around every


component

49
50
Why Transformer is better than RNN?
1. Self-Attention:
a. Capture dependencies between words in a sentence without being restricted
by their distance from each other.
2. Parallel Processing:
a. Unlike RNNs, which process data sequentially, transformers can process the
entire input sequence in parallel.
3. Handling Long-Range Dependencies:
a. RNNs struggle with long-range dependencies due to the vanishing gradient
problem.
b. Transformers can remember and maintain performance over longer sequences.

51
Scaling Laws: Are Transformers All We Need?
● With Transformers, language modeling performance improves as we increase model size, training
data, and compute resources.
● This power-law relationship has been observed over multiple orders of magnitude with no sign of
slowing!

52
Transformer Drawback!
Quadratic compute in self-attention O(n2)
● Computing all pairs of
interactions/attentions means our
computation grows quadratically with
the sequence length!
● For recurrent models, it only grew
linearly!
● Prevents scaling to long sequences.
One big area of research: linear attention
mechanisms.
● Random attention
● Window attention
● Linear attention
● Flash attention
● Lightning attention
53
Transfer Learning Concept
Another classification task
● Can you guess whether it is a land animal or a water animal?
○ Have you ever seen this creature before?
■ You can transfer your knowledge from the past

55
https://slds-lmu.github.io/seminar_nlp_ss20/introduction-transfer-learning-for-nlp.html56
Transfer Learning
Myth: you can’t do deep learning unless you have a million labelled examples for your problem.

Reality:

● You can transfer learned representations from a related task.


a. Transfer "Pretrained weight" in models
● You can train on a nearby surrogate objective, for which it is easy to generate labels.

57
Transfer learning: idea

300-1,000 training samples


58
59
Transfer learning: 3 benefits

60
Model Alignment for Transfer Learning
● Source model is the single most important variable.
● Keep source model and target model well-aligned (close to each other) when possible.
● Source vocabulary should be aligned with target vocabulary (similar domain).
● Source task should be aligned with target task (similar task).

For example:

● Good: product review sentiment → product review categorization


● Good: hotel rating → restaurant rating
● Less good: product review sentiment → biology paper classification

61
What is the most common
Transfer Learning Model on NLP?

62
Pre-trained
Language Model
● Learning to model the distribution of
natural language.
● Predicting the next word in a sequence
given context.
● A Base model for specific tasks.
● No need for labeled data (unsupervised
data)

63
The Pretraining / Fine-tuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.

64
Pre-trained Models
Three architectures for large language models

66
1. Encoder Only (Autoencoder)
Model Type: Masked Language Models (MLMs) -
Trained by predicting words from surrounding
words on both sides.

Model Name: BERT family

Tasks:

1. Sequence classification
2. Token classification

67
2. Encoder - Decoder
Model Type: Original Transformer of seq2seq task.

Model Name: BART, T5, FLAN T5, Whisper

Tasks:

1. Machine translation
2. Speech Recognition

68
3. Decoder Only (Auto-Regressive)
Model Type: Causal LLMs/Autoregressive
LLMs/Left-to-right LLMs - predict words left to
right.

Model Name: GPT, LLAMA, Claude, Mistral

Tasks:

1. Text Generation
2. Predicting next word

69
BERT

70
BERT - Bidirectional Encoder Representation from Transformers

71
BERT Ideas
1. Masked Language Model
○ fill-in-the-blank
2. Bidirectional encoder
○ See the future tokens - more
information to infer masked tokens
○ Can’t do language modeling!

72
BERT

2 direction 1 direction Bi-LMs

73
BERT VS GPT

Transformer

74
BERT VS Transformer

75
BERT - 2 Phases Training

76
Phase1: Unsupervised Masked LM Training
15% of the tokens are randomly chosen to
be part of the masking

Three possibilities:
1. 80%: Token is replaced with special token [MASK]
● Lunch was delicious -> Lunch was [MASK]

2. 10%: Token is replaced with a random token.


● Lunch was delicious -> Lunch was gasp

3. 10%: Token is unchanged.


● Lunch was delicious -> Lunch was delicious

77
Phase1: Next Sentence Prediction

[CLS], [SEP]: special tokens


Class 0: is not next sentence.
Class 1: is next sentence0

78
Input Representation

● [CLS], [SEP]: special tokens


● Segment: is this a token from sentence A or B?
● Position embeddings: provide position in sequence (learned in this case, not fixed)
79
BERT Input Embedding

Hidden Size/model Width of 768


80
BERT Input Embedding

81
Training Details
● BooksCorpus (800M words) + Wikipedia (2.5B)
● Masking the input text. 15% of all tokens are chosen.
○ 80% of the time: replaced by designated ‘[MASK]’ token
○ 10% of the time: replaced by random token
○ 10% of the time: unchanged
● Loss is cross-entropy of the prediction at the masked positions.
● Max seq length: 128 tokens for first 90%, 512 tokens for final 10%
● 1M training steps, batch size 256 = 4 days on 4 or 16 TPUs

82
Fine-tuning BERT Use case
● Sentence/Sentence pair classification
○ E.g. spam detection, sentiment analysis, Natural Language Inference

83
Fine-tuning BERT Use case
● Sequence Labeling
○ Tokenization, POS, NER

84
Contextual Embeddings to represent words

85
BERT as a Contextual Representation
Word sense disambiguation - The task of selecting the correct sense for a word

86
Model width

87
BERT is a stacked of encoders

88
Pretrained BERT (Hugging face)

89
Vision transformer
● Can’t feed pixel value directly to transformer because O(n2) attention
○ Use patch of image instead

90
91
Sidenote on Input Token
Why does BERT split input token this way?

92
Subword is the way!
We assume a fixed vocab of tens of thousands of words, built from the training set. All novel
words seen at test time are mapped to a single UNK.
● Combat with Misspelling, new unknown word issue

93
Level of Token

94
Tokens represent Words

95
96
Check if a small frequency of tokens still makes sense
97
98
Tokenizer (subwords) for Transformers

99
WordPiece Tokenization
Similar to BPE and uses frequency occurrences to identify potential merges but makes
the final decision based on the likelihood of the merged token.

100
SentencePiece Tokenization

Problem with wordpiece Tokenization: Don’t know how to put it back.

101
SentencePiece Tokenization
Simply treating the input text as a sequence of Unicode characters, including whitespace.

102
Byte-Pair Encoding (BPE) Tokenizer

Byte-pair encoding is a simple


merging strategy for defining a
subword vocabulary.

1. Start with a vocabulary containing only


characters and an “end-of-word” symbol.
2. Using a corpus of text, find the most
common adjacent characters “a,b”; add
“ab” as a subword.
3. Replace instances of the character pair
with the new subword; repeat until desired
vocab size.

103
Pre-trained Encoder Decoder

104
Pretraining encoder-decoders
For encoder-decoders, we could do something like language
modeling, but where a prefix of every input is provided to the
encoder and is not predicted.

● Model - FLAN T5, T5, BART

The encoder portion can benefit from bidirectional context; the


decoder portion is used to train the whole model through
language modeling, autoregressively predicting and then
conditioning on one token at a time.

105
Pretraining encoder-decoders
1. Higher Computational Cost: Both an encoder and a
decoder are required, leading to increased memory and
computation requirements compared to simpler models like
decoder-only architectures.
2. Slower Inference : The encoder processes the entire input
sequence before the decoder starts generating the output,
resulting in a two-step process that slows down inference
compared to models that perform generation directly (e.g.,
decoder-only models).
3. Limited Suitability for Certain Tasks: These models are
better suited for sequence-to-sequence tasks (e.g.,
translation, summarization) but are less efficient for
general-purpose tasks like text generation, where
decoder-only models excel.

106
Decoder only Pretrained Model

107
Decoder Only Pretrained Model as LLM
● Generating text conditioned on previous text

108
GPT - Generative Pre-Training (OpenAI)

109
GPT
● Uses Transformer decoder instead of encoder
● “Self”-attention: masked so that only can
attend to previous tokens.
● Predict the next token in a sequence
○ Causal language modeling

● Pure LM training objective


○ Can be used for text generation
● GPT: same params as BERT-BASE; GPT2
much bigger; GPT3 much bigger (175B
params)
110
Causal Language Model

https://jalammar.github.io/how-gpt3-works-visualizations-animations/111
GPT Training

112
OpenAI GPT 1 (Generative Pre-Training)
Multitask learning

113
GPT - Formatting Inputs for Fine-tuning Tasks

114
Data format for SFT
● Convert existing Annotated NLP datasets to instruction-following format to continue
training on LLM.
○ Supervised fine-tuning (SFT), Instruction fine-tuning

115
Multi-column dataset
● Conventional Classification Dataset
● merge multiple columns into 1 large prompt for fine-tuning to actually function.

116
https://docs.unsloth.ai/basics/tutorial-how-to-finetune-llama-3-and-use-in-ollama
Multi-column dataset
● Now LLM can perform a classification! (from causal model)

117
Evolution of GPT

https://www.kdnuggets.com/2023/05/deep-dive-gpt-models.html118
Scaling Laws
LLM performance depends on

● Model size: the number of parameters, not counting embeddings


● Dataset size: the amount of training data
● Compute: Amount of compute (in FLOPS)

Can improve a model by adding parameters (more layers, wider contexts), more data, or
training for more iterations The performance of a large language model (the loss) scales as a
power-law with each of these three.

119
Scaling Laws
● Empirical observation: scaling up models leads to reliable gains in perplexity

120
GPT Scale

Depth L

#head

121
Width d
Chat-GPT

ChatGPT GPT Reinforcement


learning
Language modeling Transformer

https://openai.com/blog/chatgpt/ 122
How does ChatGPT different from GPT model?
● ChatGPT is optimized for dialogue and conversation.

● Trained with a frequent set of prompts instructions

Ouyang, Long et al. “Training language models to follow instructions


with human feedback.” ArXiv abs/2203.02155 (2022) 123
Transformer-based Models Summary
● Transformers: A revolutionary deep learning architecture using self-attention. It excels at capturing complex
relationships in sequential data like text and allows for parallel processing, making it efficient for large datasets.
● Transfer Learning: The standard training approach involves pre-training on vast amounts of text to learn
general language patterns, followed by fine-tuning on specific tasks. This leverages learned knowledge
effectively.
● Pre-trained Models: These models (e.g., BERT, GPT) serve as powerful, ready-to-use foundations built upon
the Transformer architecture and transfer learning.
○ BERT: Primarily an encoder-based model, strong for understanding tasks (NLU) like classification and
question answering due to its bidirectional context.
○ GPT: Primarily a decoder-based model, strong for generation tasks (NLG) like text completion and
creative writing due to its autoregressive nature.
● Impact: Transformer-based models represent a significant leap in AI, dramatically improving performance on a
wide range of Natural Language Processing tasks and enabling many modern language technologies.

You might also like