495 Lecture 13 Trans Decoder

The document discusses different techniques for text generation using decoder-only transformer models including greedy decoding, beam search, sampling, top-k sampling, and top-p sampling. It also discusses how GPT models can be adapted for downstream tasks like summarization and question answering.

Uploaded by

Mohibur Nabil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views21 pages

495 Lecture 13 Trans Decoder

Uploaded by

Mohibur Nabil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CSE 495 (Natural Language Processing)

Lecture 13
Decoder only Transformers
Language Model

• Auto-regressive language generation assumes that the probability

distribution of a word sequence can be decomposed into the product of
conditional next word distributions
LSTMs can be used for other sequence tasks

image captioning sequence named entity

classification translation
recognition

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Character-level language model

Test time:
• pick a seed
character sequence
• generate the next
character
• then the next
• then the next …

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Encoder-Decoder Transformer
GPT: Text generation
GPT: Decoder only transformer
GPT Models
Text generation using GPTx

• How can we control content generation with an unconditioned language

model?
• We use different sampling techniques
Text generation: Greedy decoding
• Greedy decoding: Always pick the next token with the highest probability
Problems of Greedy Decoding

• The generated words following the context are reasonable for greedy
decoding.
• But the model quickly starts repeating itself!
• This is a very common problem in language generation in general and even more
so in greedy and beam searches.
• Another major drawback of greedy decoding is it misses high-probability
words hidden behind a low-probability word
• The word "has“ with its high conditional probability of 0.9, is hidden
behind the word "dog,” which has only the second-highest conditional
probability, so that greedy search misses the word sequence "The",
"dog", "has"
Text generation: Beam search

• Beam search reduces the risk of missing hidden high probability word
sequences by keeping the most likely num_beams of hypotheses at each
time step and eventually choosing the hypothesis that has the overall
highest probability. Let's illustrate with num_beams=2:
Beam Decoding
• At time step 1, besides the most likely hypothesis ("The","nice"), beam search
also keeps track of the second most likely one ("The","dog").
• At time step 2, beam search finds that the word sequence ("The","dog","has"),
has a probability of 0.36
• Higher probability than ("The","nice","woman"), which has 0.2
• Beam search will almost always find an output sequence with a higher
probability than greedy search but is not guaranteed to find the most likely
output.
• While the text generated usually are fluent, the output will eventually include
repetitions of identical word sequences.
• A simple remedy is introducing n-grams penalties.
• The most common n-gram penalty ensures that no n-gram appears twice by manually
setting the probability of the following words that could create an already seen n-gram to
0.
Text generation: Sampling

• In its most basic form, sampling means randomly picking the next word
wt according to its conditional probability distribution
Sampling Decoding

• It is obvious that language generation using sampling is not deterministic

anymore.
• The word ("car") is sampled from the conditioned probability distribution
P(w∣"The"), followed by sampling ("drives") from P(w∣"The","car").
Text generation: Top-K sampling

• In Top-K sampling, the K most likely next words are filtered, and the
probability mass is redistributed among only those K next words. GPT2
adopted this sampling scheme, which was one of the reasons for its
success in story generation.
Top-k Sampling

• The generated text is very much human-sounding.

• One concern with Top-K sampling is that it does not dynamically adapt
the number of words that are filtered from the next word probability
distribution.
• This can be problematic as some words might be sampled from a very sharp
distribution, whereas others from a much flatter.
Top-p (nucleus) sampling

• Instead of sampling only from the most likely K words, in Top-p sampling
chooses from the smallest possible set of words whose cumulative
probability exceeds the probability p
Top-p nucleus

• The probability mass is redistributed after every step.

• This way, the size of the set of words (the number of words in the set)
can dynamically increase and decrease according to the next word's
probability distribution.
• While in theory, Top-p seems more elegant than Top-K, both methods
work well in practice.
• Top-p can also be combined with Top-K, which can avoid very low-ranked words
while allowing for some dynamic selection.
Text decoding conclusion

• top-p and top-K sampling produce more fluent text than traditional
greedy and beam search on open-ended language generation.
• The main issues with greedy and beam search – they generate repetitive
word sequences
• However, recent research found that this issue is caused by the model
(especially how the model is trained) rather than the decoding method
• Also, there are scenarios where top-K and top-p sampling suffer from
generating repetitive word sequences.
• According to human evaluations, beam search can generate more fluent
text than Top-p sampling when adapting the model's training objective.
How to adapt GPT to downstream tasks

• Text summarization: We can fine tune GPTx with the text and then start
text generation
• For example, we can generate 100 tokens with Top-k random sampling with k = 2
which reduces repetition and encourages more abstractive summaries than
greedy decoding
• We use the first 3 generated sentences in these 100 tokens as the summary
• Q/A: We can fine-tune the model using questions-answers pairs
(separated by special tokens)
• Then input the question followed by the separator
• The model will generate answers

Session 3 Prompt Engineering For Generative AI
No ratings yet
Session 3 Prompt Engineering For Generative AI
25 pages
2025 PADI Instructor Manual Errata (Summary of Changes Between 2024 and 2025 Versions)
No ratings yet
2025 PADI Instructor Manual Errata (Summary of Changes Between 2024 and 2025 Versions)
8 pages
BSBCRT511 Critical Thinking Skills
100% (7)
BSBCRT511 Critical Thinking Skills
18 pages
Lecture 10 - Knowledge and Reasoning - 2025 - LLM
No ratings yet
Lecture 10 - Knowledge and Reasoning - 2025 - LLM
121 pages
B.Sc. Eligibility
No ratings yet
B.Sc. Eligibility
10 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Module5 DS PPT
No ratings yet
Module5 DS PPT
38 pages
Introduction - To - Womens - Gender - and - Sexuality Studies
No ratings yet
Introduction - To - Womens - Gender - and - Sexuality Studies
13 pages
Certificate of Recognition: 1 Place
No ratings yet
Certificate of Recognition: 1 Place
4 pages
LLM Cheatsheet
No ratings yet
LLM Cheatsheet
1 page
Fundamentals of Generative AI
No ratings yet
Fundamentals of Generative AI
17 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Fil Ed 321 Chapter 3 This Is A Handout of Lectures
No ratings yet
Fil Ed 321 Chapter 3 This Is A Handout of Lectures
71 pages
Work Sheet On Editing-2
100% (1)
Work Sheet On Editing-2
3 pages
Solid Waste Engineering A Global Perspective 3rd Edition Worrell
No ratings yet
Solid Waste Engineering A Global Perspective 3rd Edition Worrell
310 pages
Worksheet On Preposition
100% (1)
Worksheet On Preposition
2 pages
NLP Basics
No ratings yet
NLP Basics
119 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Lecture 7
No ratings yet
Lecture 7
66 pages
University of Bristol Postgraduate Prospectus 2018 - Web
No ratings yet
University of Bristol Postgraduate Prospectus 2018 - Web
128 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
解读塞尚
No ratings yet
解读塞尚
81 pages
Cs224n 2020 Lecture08 NMT
No ratings yet
Cs224n 2020 Lecture08 NMT
77 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Session 15-2 Future NLP & Deep Learning
No ratings yet
Session 15-2 Future NLP & Deep Learning
81 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Person-Organization Fit, Knowledge Sharing Behaviour, and Innovative Work Behaviour: A Self-Determination Perspective
No ratings yet
Person-Organization Fit, Knowledge Sharing Behaviour, and Innovative Work Behaviour: A Self-Determination Perspective
17 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
Evaluation of Text Generation: A Survey
No ratings yet
Evaluation of Text Generation: A Survey
75 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
Year 1 - 6 Writing Tasks
No ratings yet
Year 1 - 6 Writing Tasks
18 pages
13 TextGen 2024
No ratings yet
13 TextGen 2024
106 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
Lesson 7 Function of Communication
No ratings yet
Lesson 7 Function of Communication
45 pages
Media Literacy Booklet Emedia Project Final ENG 1
No ratings yet
Media Literacy Booklet Emedia Project Final ENG 1
22 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
Problem 1 Proposal
No ratings yet
Problem 1 Proposal
24 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
15.chapter11 NLPApplications
No ratings yet
15.chapter11 NLPApplications
25 pages
08 NLP With Deep Learning
No ratings yet
08 NLP With Deep Learning
31 pages
Neural Text Generation: A Practical Guide: Ziang Xie Zxie@cs - Stanford.edu
No ratings yet
Neural Text Generation: A Practical Guide: Ziang Xie Zxie@cs - Stanford.edu
21 pages
Lect 07 - MT and Seq2seq
No ratings yet
Lect 07 - MT and Seq2seq
86 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
Three 150224 Generative A I Intro
No ratings yet
Three 150224 Generative A I Intro
19 pages
NLP 2-5 Unit Notes
No ratings yet
NLP 2-5 Unit Notes
83 pages
The Concept of Competitive Advantages. Logic, Sources and Durability
No ratings yet
The Concept of Competitive Advantages. Logic, Sources and Durability
14 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Automatic Detection of Generated Text Is Easiest When Humans Are Fooled
No ratings yet
Automatic Detection of Generated Text Is Easiest When Humans Are Fooled
15 pages
Bia Report
No ratings yet
Bia Report
26 pages
NLP Midsem Last Lesson
No ratings yet
NLP Midsem Last Lesson
53 pages
Lesson 1 Intro
No ratings yet
Lesson 1 Intro
51 pages
GPT in 60 Lines of NumPy - Jay Mody
No ratings yet
GPT in 60 Lines of NumPy - Jay Mody
41 pages
Large Language Model Algorithms in Plain English
No ratings yet
Large Language Model Algorithms in Plain English
8 pages
Dr. B. R. AMBEDKAR AND MAKING OF THE CONSTITUTION - A Case Study of Indian Federalism
No ratings yet
Dr. B. R. AMBEDKAR AND MAKING OF THE CONSTITUTION - A Case Study of Indian Federalism
13 pages
Module 5
No ratings yet
Module 5
76 pages
4786 Planning With Large Language M
No ratings yet
4786 Planning With Large Language M
28 pages
The Hidden History of New Women in Serbian Culture Toward A New History of Literature Svetlana Tomic Download
No ratings yet
The Hidden History of New Women in Serbian Culture Toward A New History of Literature Svetlana Tomic Download
86 pages
Studying Developmental Processes in Accelerated Cohort-Sequential Designs With Discrete - and Continuous-Time Latent Change Score Models
No ratings yet
Studying Developmental Processes in Accelerated Cohort-Sequential Designs With Discrete - and Continuous-Time Latent Change Score Models
27 pages
cl13 gpt-2
No ratings yet
cl13 gpt-2
26 pages
cl13 GPT
No ratings yet
cl13 GPT
26 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
11 Seq To Seq Model
No ratings yet
11 Seq To Seq Model
30 pages
Gsu List English 15032018
No ratings yet
Gsu List English 15032018
13 pages
Azmin MD Zamin 2020 Learning Vocabulary Through Songs
No ratings yet
Azmin MD Zamin 2020 Learning Vocabulary Through Songs
9 pages
Unit - 2
No ratings yet
Unit - 2
10 pages
MAT-116, Chapter 4.2
No ratings yet
MAT-116, Chapter 4.2
7 pages
Speech To Text Beam Search
No ratings yet
Speech To Text Beam Search
15 pages
MAT-116, Chapter 4.1 - Annex
No ratings yet
MAT-116, Chapter 4.1 - Annex
10 pages
ChatGPT Teardown
No ratings yet
ChatGPT Teardown
9 pages
Effective Chatbots Using Machine Learning and Natural Language Processing
No ratings yet
Effective Chatbots Using Machine Learning and Natural Language Processing
10 pages
Ntu Min Subj Req
No ratings yet
Ntu Min Subj Req
11 pages
Genai
No ratings yet
Genai
22 pages
Keyboard Prep Piano Classes For 8-10yo
No ratings yet
Keyboard Prep Piano Classes For 8-10yo
3 pages
Using An Inquiry Approach To Teach Science To Seco
No ratings yet
Using An Inquiry Approach To Teach Science To Seco
7 pages
RNN and LSTM Based Chatbot Using NLP: Department of Computer Science and Engineering, MSIT, New Delhi, India
No ratings yet
RNN and LSTM Based Chatbot Using NLP: Department of Computer Science and Engineering, MSIT, New Delhi, India
4 pages
Lec 11
No ratings yet
Lec 11
30 pages
Decoding Algorithms NLP
No ratings yet
Decoding Algorithms NLP
29 pages
SPMC Certified Fire Protection Specialist Cfps Exam Preparation Course Nfpa Usa
No ratings yet
SPMC Certified Fire Protection Specialist Cfps Exam Preparation Course Nfpa Usa
3 pages
Aieee CCB Spot Round
No ratings yet
Aieee CCB Spot Round
4 pages
ChatGPT and ART OF NATURAL LANGUAGE GENERATION
No ratings yet
ChatGPT and ART OF NATURAL LANGUAGE GENERATION
4 pages
Week 9 Prompt 1 - Before Class Lecture
No ratings yet
Week 9 Prompt 1 - Before Class Lecture
14 pages
Group Activity in Modaltext
No ratings yet
Group Activity in Modaltext
4 pages
University of Kashmir
No ratings yet
University of Kashmir
4 pages
Text Generation
No ratings yet
Text Generation
4 pages
The Comic Book Store
No ratings yet
The Comic Book Store
2 pages
Math & Science Department Terms Plan 2015/2016 TEACHER: Ms. Rahaf Jabbour GRADE: 1 SUBJECT: Science Periods Allotted: 4
No ratings yet
Math & Science Department Terms Plan 2015/2016 TEACHER: Ms. Rahaf Jabbour GRADE: 1 SUBJECT: Science Periods Allotted: 4
6 pages
M.Sc. Semester Examination Timetable (April 2025) - 1
No ratings yet
M.Sc. Semester Examination Timetable (April 2025) - 1
1 page
Read-Aloud Strategies Newsletter
No ratings yet
Read-Aloud Strategies Newsletter
5 pages
Beyond Effective Go: Part 1 - Achieving High-Performance Code
From Everand
Beyond Effective Go: Part 1 - Achieving High-Performance Code
Corey S Scott
5/5 (1)
Ruby Programming For Beginners: The Simple Guide to Learning Ruby Programming Language Fast!
From Everand
Ruby Programming For Beginners: The Simple Guide to Learning Ruby Programming Language Fast!
Tim Warren
2/5 (2)
Haskell Design Patterns
From Everand
Haskell Design Patterns
Lemmer Ryan
No ratings yet
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)