0% found this document useful (0 votes)
12 views21 pages

495 Lecture 13 Trans Decoder

The document discusses different techniques for text generation using decoder-only transformer models including greedy decoding, beam search, sampling, top-k sampling, and top-p sampling. It also discusses how GPT models can be adapted for downstream tasks like summarization and question answering.

Uploaded by

Mohibur Nabil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views21 pages

495 Lecture 13 Trans Decoder

The document discusses different techniques for text generation using decoder-only transformer models including greedy decoding, beam search, sampling, top-k sampling, and top-p sampling. It also discusses how GPT models can be adapted for downstream tasks like summarization and question answering.

Uploaded by

Mohibur Nabil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

CSE 495 (Natural Language Processing)

Lecture 13
Decoder only Transformers
Language Model

• Auto-regressive language generation assumes that the probability


distribution of a word sequence can be decomposed into the product of
conditional next word distributions
LSTMs can be used for other sequence tasks

image captioning sequence named entity


classification translation
recognition

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Character-level language model

Test time:
• pick a seed
character sequence
• generate the next
character
• then the next
• then the next …

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Encoder-Decoder Transformer
GPT: Text generation
GPT: Decoder only transformer
GPT Models
Text generation using GPTx

• How can we control content generation with an unconditioned language


model?
• We use different sampling techniques
Text generation: Greedy decoding
• Greedy decoding: Always pick the next token with the highest probability
Problems of Greedy Decoding

• The generated words following the context are reasonable for greedy
decoding.
• But the model quickly starts repeating itself!
• This is a very common problem in language generation in general and even more
so in greedy and beam searches.
• Another major drawback of greedy decoding is it misses high-probability
words hidden behind a low-probability word
• The word "has“ with its high conditional probability of 0.9, is hidden
behind the word "dog,” which has only the second-highest conditional
probability, so that greedy search misses the word sequence "The",
"dog", "has"
Text generation: Beam search

• Beam search reduces the risk of missing hidden high probability word
sequences by keeping the most likely num_beams of hypotheses at each
time step and eventually choosing the hypothesis that has the overall
highest probability. Let's illustrate with num_beams=2:
Beam Decoding
• At time step 1, besides the most likely hypothesis ("The","nice"), beam search
also keeps track of the second most likely one ("The","dog").
• At time step 2, beam search finds that the word sequence ("The","dog","has"),
has a probability of 0.36
• Higher probability than ("The","nice","woman"), which has 0.2
• Beam search will almost always find an output sequence with a higher
probability than greedy search but is not guaranteed to find the most likely
output.
• While the text generated usually are fluent, the output will eventually include
repetitions of identical word sequences.
• A simple remedy is introducing n-grams penalties.
• The most common n-gram penalty ensures that no n-gram appears twice by manually
setting the probability of the following words that could create an already seen n-gram to
0.
Text generation: Sampling

• In its most basic form, sampling means randomly picking the next word
wt according to its conditional probability distribution
Sampling Decoding

• It is obvious that language generation using sampling is not deterministic


anymore.
• The word ("car") is sampled from the conditioned probability distribution
P(w∣"The"), followed by sampling ("drives") from P(w∣"The","car").
Text generation: Top-K sampling

• In Top-K sampling, the K most likely next words are filtered, and the
probability mass is redistributed among only those K next words. GPT2
adopted this sampling scheme, which was one of the reasons for its
success in story generation.
Top-k Sampling

• The generated text is very much human-sounding.


• One concern with Top-K sampling is that it does not dynamically adapt
the number of words that are filtered from the next word probability
distribution.
• This can be problematic as some words might be sampled from a very sharp
distribution, whereas others from a much flatter.
Top-p (nucleus) sampling

• Instead of sampling only from the most likely K words, in Top-p sampling
chooses from the smallest possible set of words whose cumulative
probability exceeds the probability p
Top-p nucleus

• The probability mass is redistributed after every step.


• This way, the size of the set of words (the number of words in the set)
can dynamically increase and decrease according to the next word's
probability distribution.
• While in theory, Top-p seems more elegant than Top-K, both methods
work well in practice.
• Top-p can also be combined with Top-K, which can avoid very low-ranked words
while allowing for some dynamic selection.
Text decoding conclusion

• top-p and top-K sampling produce more fluent text than traditional
greedy and beam search on open-ended language generation.
• The main issues with greedy and beam search – they generate repetitive
word sequences
• However, recent research found that this issue is caused by the model
(especially how the model is trained) rather than the decoding method
• Also, there are scenarios where top-K and top-p sampling suffer from
generating repetitive word sequences.
• According to human evaluations, beam search can generate more fluent
text than Top-p sampling when adapting the model's training objective.
How to adapt GPT to downstream tasks

• Text summarization: We can fine tune GPTx with the text and then start
text generation
• For example, we can generate 100 tokens with Top-k random sampling with k = 2
which reduces repetition and encourages more abstractive summaries than
greedy decoding
• We use the first 3 generated sentences in these 100 tokens as the summary
• Q/A: We can fine-tune the model using questions-answers pairs
(separated by special tokens)
• Then input the question followed by the separator
• The model will generate answers

You might also like