495 Lecture 13 Trans Decoder
495 Lecture 13 Trans Decoder
Lecture 13
Decoder only Transformers
Language Model
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Character-level language model
Test time:
• pick a seed
character sequence
• generate the next
character
• then the next
• then the next …
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Encoder-Decoder Transformer
GPT: Text generation
GPT: Decoder only transformer
GPT Models
Text generation using GPTx
• The generated words following the context are reasonable for greedy
decoding.
• But the model quickly starts repeating itself!
• This is a very common problem in language generation in general and even more
so in greedy and beam searches.
• Another major drawback of greedy decoding is it misses high-probability
words hidden behind a low-probability word
• The word "has“ with its high conditional probability of 0.9, is hidden
behind the word "dog,” which has only the second-highest conditional
probability, so that greedy search misses the word sequence "The",
"dog", "has"
Text generation: Beam search
• Beam search reduces the risk of missing hidden high probability word
sequences by keeping the most likely num_beams of hypotheses at each
time step and eventually choosing the hypothesis that has the overall
highest probability. Let's illustrate with num_beams=2:
Beam Decoding
• At time step 1, besides the most likely hypothesis ("The","nice"), beam search
also keeps track of the second most likely one ("The","dog").
• At time step 2, beam search finds that the word sequence ("The","dog","has"),
has a probability of 0.36
• Higher probability than ("The","nice","woman"), which has 0.2
• Beam search will almost always find an output sequence with a higher
probability than greedy search but is not guaranteed to find the most likely
output.
• While the text generated usually are fluent, the output will eventually include
repetitions of identical word sequences.
• A simple remedy is introducing n-grams penalties.
• The most common n-gram penalty ensures that no n-gram appears twice by manually
setting the probability of the following words that could create an already seen n-gram to
0.
Text generation: Sampling
• In its most basic form, sampling means randomly picking the next word
wt according to its conditional probability distribution
Sampling Decoding
• In Top-K sampling, the K most likely next words are filtered, and the
probability mass is redistributed among only those K next words. GPT2
adopted this sampling scheme, which was one of the reasons for its
success in story generation.
Top-k Sampling
• Instead of sampling only from the most likely K words, in Top-p sampling
chooses from the smallest possible set of words whose cumulative
probability exceeds the probability p
Top-p nucleus
• top-p and top-K sampling produce more fluent text than traditional
greedy and beam search on open-ended language generation.
• The main issues with greedy and beam search – they generate repetitive
word sequences
• However, recent research found that this issue is caused by the model
(especially how the model is trained) rather than the decoding method
• Also, there are scenarios where top-K and top-p sampling suffer from
generating repetitive word sequences.
• According to human evaluations, beam search can generate more fluent
text than Top-p sampling when adapting the model's training objective.
How to adapt GPT to downstream tasks
• Text summarization: We can fine tune GPTx with the text and then start
text generation
• For example, we can generate 100 tokens with Top-k random sampling with k = 2
which reduces repetition and encourages more abstractive summaries than
greedy decoding
• We use the first 3 generated sentences in these 100 tokens as the summary
• Q/A: We can fine-tune the model using questions-answers pairs
(separated by special tokens)
• Then input the question followed by the separator
• The model will generate answers