0% found this document useful (0 votes)
3 views29 pages

Decoding Algorithms NLP

The document discusses decoding algorithms in large language models, focusing on both basic and advanced strategies for generating text. It covers methods such as greedy decoding, sampling techniques (temperature, top-k, top-p, min-p), beam search, and speculative decoding, highlighting their implications on output quality and generation speed. The presentation emphasizes the importance of decoding in enhancing coherence, creativity, and efficiency in language model outputs.

Uploaded by

Himanshu Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views29 pages

Decoding Algorithms NLP

The document discusses decoding algorithms in large language models, focusing on both basic and advanced strategies for generating text. It covers methods such as greedy decoding, sampling techniques (temperature, top-k, top-p, min-p), beam search, and speculative decoding, highlighting their implications on output quality and generation speed. The presentation emphasizes the importance of decoding in enhancing coherence, creativity, and efficiency in language model outputs.

Uploaded by

Himanshu Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Decoding Algorithms in

Large Language Models


Apoorv Saxena | Research Scientist

24 Feb 2025, Indian Institute of Science, Bangalore


Agenda
§ Introduction to decoding and theoretical foundations
§ Basic decoding strategies
§ Temperature, Top-k, top-p, greedy…

§ Min-p sampling

§ Advanced decoding strategies


§ Beam search

§ Speculative decoding (and variants)

We will also be looking at the code of some basic decoding strategies – and please feel free to interrupt and
ask questions!

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


What is decoding?
• Definition: Generating text by selecting tokens based on model probabilities
• Context: Autoregressive models (e.g., GPT)
• Modeling assumption: Models have been trained on the next-token prediction task

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Decoding in Autoregressive LLMs
§ We have a trained model that performs next-token
prediction.
§ Task: Use it to generate text iteratively.
§ Key Idea: Autoregression – Use previously generated tokens
as input to predict the next token.
§ Process:
§ Compute next token probabilities

§ Select next token

§ Append to sequence and repeat

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Theoretical Foundation – Language Modeling Equation

Notes:
§ f represents model logits
§ Transformation to probabilities via softmax

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Theoretical Foundation – Sequence Probability

§ Implication: Maximizing overall sequence likelihood can be achieved this way


§ Note: This is a theoretical construct

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Why Does Decoding Matter?
• Connections:
• Training (next-token prediction) vs. Inference (text generation)

• Impact on Output:
• Coherence and correctness

• Creativity, diversity

• Speed of generation

• Still an underexplored area – even recently, simple innovations can lead to major gains (eg. PLD
decoding)!

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Quick Example
• Prompt: “The cat sat on the …”
• Token Probabilities:
• "mat" – 0.4

• "chair" – 0.3

• "floor" – 0.2

• "roof" – 0.1

• How do we choose the next token?

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Greedy Decoding
• Definition: Always select the token with the highest probability.
• Pros:
• Simple and fast

• Cons:
• Often leads to repetitive or suboptimal outputs

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Sampling: The Basics
• Definition: Randomly select tokens based on the probability distribution.
• Key Parameter: Temperature (T)
• Low T: More focused, conservative outputs.

• High T: Increased diversity and randomness.

Without temperature T With temperature T

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Sampling: top-k and top-p
Top-k Top-p (nucleus sampling*)

• Concept: Restrict selection to the top-k • Concept: Choose tokens from the smallest set whose
most probable tokens. cumulative probability exceeds p

• Example: k=50 • Example: p=0.9

• Effect: Filters out long-tail, low-probability


tokens, reducing noise.

• *Holtzmann et al, 2020, “The curious case of neural


text degeneration”

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Sampling – min-p

Minh et al, Oct 2024 – “Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs”
© 2025 Adobe. All Rights Reserved. Adobe Confidential.
Sampling – min-p

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Sampling – min-p

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Advanced decoding methods - Introduction
§ Decoding so far
§ Single sequence under consideration

§ Different strategies for selecting next token, given probability vector

§ Next sections
§ Beam search: Keeping multiple sequences under consideration simultaneously

§ Speculative decoding: Speed up LLM decoding without affecting output quality

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Why multiple sequences?
§ Lets revisit the sequence probability formulation

• Q: Is there a way to maximize sequence probability while generating?


• Naïve answer: Score all possible sequences, get the maximum
• But with limited time/compute budget – beam search

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


What is Beam Search?
• Core Idea:
• Rather than choosing just the highest probability token (as in greedy decoding), beam search keeps the top-k sequences at each step.

• Terminology:
• Beam Width (k): Number of candidate sequences retained.

• Motivation:
• Avoids early commitment to a single sequence that may lead to suboptimal outputs.

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


How Beam Search Works – Step-by-Step
1. Initialization:
1. Start with the initial token or prompt.

2. Expansion:
1. For each sequence in the beam, generate possible next tokens.

3. Scoring:
1. Compute scores (cumulative log probabilities) for each candidate.

4. Pruning:

1. Keep the top-k highest-scoring sequences.

5. Iteration:
image credits: https://d2l.ai/
1. Repeat until an end-of-sequence token is generated or a maximum length is reached.

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Scoring in Beam Search

• Why Log?
• Enhances numerical stability and makes the multiplication of probabilities manageable as summation.

• Interpretation:
• The sequence with the highest cumulative score is considered the best candidate.

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Pros & Cons of Beam Search
• Advantages:
• Improved Coherence*: Explores multiple paths, often leading to more fluent text.

• Better Global Quality*: Reduces the risk of getting stuck in locally optimal (but globally suboptimal) decisions.

• Disadvantages:
• Computational Cost: More sequences to evaluate compared to greedy decoding.

• Reduced Diversity: Can still converge to similar outputs if beam width is narrow.

• Complexity: Requires careful tuning of the beam width parameter.

• Objective doesn’t align with training: While it maximizes the language modeling objective, the underlying models were only trained to predict
next token, not the full sequence!

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Speculative Decoding - Motivation & Background
• Autoregressive Generation:
• Generates text one token at a time

• Can be slow due to sequential dependency – even simple to predict tokens take the same amount of time!

• Need for Speed:


• Real-time applications require faster inference

• Speculative decoding offers a way to reduce latency

• Key Idea:
• Use a fast, approximate model to “speculate” future tokens

• Validate these tokens with a more accurate model

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Speculative Decoding
§ What makes it possible?
§ The ”attention is all you need” architecture!

§ Key Insight: We have access to next-token probs for all tokens in sequence – not just last token!
§ If we could make educated guesses for next k tokens, how do we leverage it?

Great blog post on spec-dec: https://huggingface.co/blog/assisted-generation


© 2025 Adobe. All Rights Reserved. Adobe Confidential.
Speculative Decoding

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Prompt Lookup Decoding
§ Speculative decoding: Requires an assistant model
§ Additional VRAM requirements

§ Need to consider speed/quality tradeoff with smaller models

§ Let’s consider some limited usecases – document summarization, doc QA, code editing
§ Is there a way to get good draft tokens without using an additional model?

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Prompt Lookup Decoding
§ Use the prompt itself!
§ Steps:
§ Take the last few generated tokens (so far)

§ Search for these in the prompt (eg. Document, previous code in prompt)

§ If a match is found – continuation of these tokens is the draft!

§ Use model as verifier, repeat

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Prompt Lookup Decoding

“Draft Model”

Part of major LLM inference libraries,


including transformers and vLLM © 2025 Adobe. All Rights Reserved. Adobe Confidential.
Prompt Lookup Decoding

© 2025 Adobe. All Rights Reserved. Adobe Confidential.


Prompt Lookup Decoding

https://github.com/apoorvumang/prompt-lookup-decoding
Somasundaram et al, 2024, “PLD+: Accelerating LLM inference by leveraging Language Model Artifacts”
© 2025 Adobe. All Rights Reserved. Adobe Confidential.
Recap & Key Takeaways
§ Decoding basics, theoretical underpinnings
§ Overview of deterministic and stochastic methods
§ Greedy, sampling, sampling parameters (temperature, top-k, top-p, min-p)

§ Advanced methods for efficiency and quality


§ Beam search

§ Speculative decoding

§ Prompt lookup decoding

§ Relatively underexplored field – even Deepseek R1 uses temp=0.7 sampling!


§ Possible to make progress on SoTA even with modest GPU resources!

© 2025 Adobe. All Rights Reserved. Adobe Confidential.

You might also like