Decoding Algorithms NLP
Decoding Algorithms NLP
§ Min-p sampling
We will also be looking at the code of some basic decoding strategies – and please feel free to interrupt and
ask questions!
Notes:
§ f represents model logits
§ Transformation to probabilities via softmax
• Impact on Output:
• Coherence and correctness
• Creativity, diversity
• Speed of generation
• Still an underexplored area – even recently, simple innovations can lead to major gains (eg. PLD
decoding)!
• "chair" – 0.3
• "floor" – 0.2
• "roof" – 0.1
• Cons:
• Often leads to repetitive or suboptimal outputs
• Concept: Restrict selection to the top-k • Concept: Choose tokens from the smallest set whose
most probable tokens. cumulative probability exceeds p
Minh et al, Oct 2024 – “Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs”
© 2025 Adobe. All Rights Reserved. Adobe Confidential.
Sampling – min-p
§ Next sections
§ Beam search: Keeping multiple sequences under consideration simultaneously
• Terminology:
• Beam Width (k): Number of candidate sequences retained.
• Motivation:
• Avoids early commitment to a single sequence that may lead to suboptimal outputs.
2. Expansion:
1. For each sequence in the beam, generate possible next tokens.
3. Scoring:
1. Compute scores (cumulative log probabilities) for each candidate.
4. Pruning:
5. Iteration:
image credits: https://d2l.ai/
1. Repeat until an end-of-sequence token is generated or a maximum length is reached.
• Why Log?
• Enhances numerical stability and makes the multiplication of probabilities manageable as summation.
• Interpretation:
• The sequence with the highest cumulative score is considered the best candidate.
• Better Global Quality*: Reduces the risk of getting stuck in locally optimal (but globally suboptimal) decisions.
• Disadvantages:
• Computational Cost: More sequences to evaluate compared to greedy decoding.
• Reduced Diversity: Can still converge to similar outputs if beam width is narrow.
• Objective doesn’t align with training: While it maximizes the language modeling objective, the underlying models were only trained to predict
next token, not the full sequence!
• Can be slow due to sequential dependency – even simple to predict tokens take the same amount of time!
• Key Idea:
• Use a fast, approximate model to “speculate” future tokens
§ Key Insight: We have access to next-token probs for all tokens in sequence – not just last token!
§ If we could make educated guesses for next k tokens, how do we leverage it?
§ Let’s consider some limited usecases – document summarization, doc QA, code editing
§ Is there a way to get good draft tokens without using an additional model?
§ Search for these in the prompt (eg. Document, previous code in prompt)
“Draft Model”
https://github.com/apoorvumang/prompt-lookup-decoding
Somasundaram et al, 2024, “PLD+: Accelerating LLM inference by leveraging Language Model Artifacts”
© 2025 Adobe. All Rights Reserved. Adobe Confidential.
Recap & Key Takeaways
§ Decoding basics, theoretical underpinnings
§ Overview of deterministic and stochastic methods
§ Greedy, sampling, sampling parameters (temperature, top-k, top-p, min-p)
§ Speculative decoding