0% found this document useful (0 votes)
35 views32 pages

Theoretical Benefit and Limitation of Diffusion Language Model

This paper analyzes the theoretical benefits and limitations of Masked Diffusion Models (MDMs) for text generation, highlighting their efficiency in achieving near-optimal perplexity while also examining the trade-off with sequence error rates. The authors demonstrate that while MDMs can generate multiple tokens simultaneously, their efficiency advantage diminishes when considering the correctness of generated sequences, requiring a linear increase in sampling steps with sequence length. The study provides a comprehensive evaluation of MDMs using token error rate and sequence error rate metrics, supported by empirical findings.

Uploaded by

Nuri Taş
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views32 pages

Theoretical Benefit and Limitation of Diffusion Language Model

This paper analyzes the theoretical benefits and limitations of Masked Diffusion Models (MDMs) for text generation, highlighting their efficiency in achieving near-optimal perplexity while also examining the trade-off with sequence error rates. The authors demonstrate that while MDMs can generate multiple tokens simultaneously, their efficiency advantage diminishes when considering the correctness of generated sequences, requiring a linear increase in sampling steps with sequence length. The study provides a comprehensive evaluation of MDMs using token error rate and sequence error rate metrics, supported by empirical findings.

Uploaded by

Nuri Taş
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Theoretical Benefit and Limitation of Diffusion Language Model

Guhao Feng * 1 Yihan Geng * 1 Jian Guan 2 Wei Wu 2 Liwei Wang 1 Di He 1

Abstract Sun & Yang, 2023; Avdeyev et al., 2023). Among various
discrete diffusion architectures, masked diffusion models
Diffusion language models have emerged as a (MDMs) (Shi et al., 2024; Sahoo et al., 2024; Ou et al.,
promising approach for text generation. One 2024)—which generate sequences by iteratively converting
arXiv:2502.09622v1 [cs.LG] 13 Feb 2025

would naturally expect this method to be an effi- masks to tokens—have emerged as a prominent approach
cient replacement for autoregressive models since and demonstrated competitive performance across diverse
multiple tokens can be sampled in parallel dur- language modeling tasks.
ing each diffusion step. However, its efficiency-
accuracy trade-off is not yet well understood. In While auto-regressive models generate sequences token-
this paper, we present a rigorous theoretical anal- by-token, discrete diffusion models can generate multiple
ysis of a widely used type of diffusion language tokens simultaneously during each step (reverse process).
model, the Masked Diffusion Model (MDM), and Therefore, it is natural to hypothesize that this parallel sam-
find that its effectiveness heavily depends on the pling improves generation efficiency. However, we argue
target evaluation metric. Under mild conditions, that reaching a conclusion requires considering both com-
we prove that when using perplexity as the met- putational cost and generation quality. Specifically, we pose
ric, MDMs can achieve near-optimal perplexity the following question: do discrete diffusion models achieve
in sampling steps regardless of sequence length, superior efficiency when the generated content meets an ac-
demonstrating that efficiency can be achieved ceptable quality standard? There may be multiple answers
without sacrificing performance. However, when to this question. If diffusion models require fewer neural
using the sequence error rate–which is important network executions while maintaining quality, they can offer
for understanding the “correctness” of a sequence, better acceleration. Conversely, if their execution count is
such as a reasoning chain–we show that the re- comparable to or exceeds that of auto-regressive models,
quired sampling steps must scale linearly with diffusion language models may not be a better choice.
sequence length to obtain “correct” sequences, To answer the above question, we leverage two complemen-
thereby eliminating MDM’s efficiency advantage tary metrics to evaluate the efficiency of MDMs in language
over autoregressive models. Our analysis estab- modeling. The first metric, token error rate (TER), quanti-
lishes the first theoretical foundation for under- fies token-level accuracy, which correlates with the fluency
standing the benefits and limitations of MDMs. of the generated text. In practice, perplexity is a widely used
All theoretical findings are supported by empirical metric for measuring token-level errors of language models
studies. (Jelinek et al., 1977; Devlin et al., 2019); thus, we define
the metric of TER by perplexity in this paper. The second
metric, sequence error rate (SER), evaluates the correct-
1. Introduction ness of an entire sequence, which is crucial for reasoning
tasks requiring logically correct sequences. We provide a
Diffusion models (Ho et al., 2020; Song et al., 2021b) have natural definition of SER that reflects the correctness of the
emerged as a powerful paradigm in generative modeling, whole sequence. Together, these metrics enable a compre-
establishing state-of-the-art performance in image synthesis hensive evaluation of the efficiency of MDMs under both
(Karras et al., 2022; Song et al., 2021a). Their extension to token-level and sequence-level metrics. We first provide a
discrete domains has opened new possibilities for generat- positive theoretical result regarding TER. We prove that un-
ing sequences, such as natural language (Campbell et al., der mild conditions, MDMs can achieve near-optimal TER
2022; Dieleman et al., 2022; Zheng et al., 2023; Lou et al., with sampling steps regardless of the sequence length L.
2024; Campbell et al., 2024; Lovelace et al., 2024) and bio- Compared to the auto-regressive model, which must be exe-
logical sequences (Rastogi et al., 2022; Vignac et al., 2022; cuted L times to generate the sequence, MDMs demonstrate
*
Equal contribution 1 Peking University 2 Ant Group. substantial efficiency gains, especially when the generation
length is long.

1
Theoretical Benefit and Limitation of Diffusion Language Model

However, we show that this efficiency advantage diminishes subsequent works (Sahoo et al., 2024; Zhao et al., 2024; Shi
when SER is considered. We theoretically prove that to et al., 2024; Ou et al., 2024; Zheng et al., 2024). The second
achieve a low SER, the number of required sampling steps category encompasses models like SEDD Uniform (Lou
for MDMs scales at least linearly with sequence length. et al., 2024), as well as extensions introduced in follow-up
Intuitively, this limitation arises from the fact that SER, studies (Campbell et al., 2024). Notably, Gat et al. (2024);
as a metric for the entire sequence, requires the generated Davis et al. (2024) and Campbell et al. (2024) further extend
sequence to be free of any error in the whole sequence, flow-matching to the discrete domain, with differing initial-
which forces MDMs to sample only a small number of ization strategies: the former employs masked sequences,
tokens per step to mitigate such inconsistencies. As a result, while the latter utilizes a customized distribution for the
the number of required sampling steps can be significant. reverse process.
It is notable that each MDM sampling step usually incurs
Masked Diffusion Models. Among the two primary
a higher computational cost than an auto-regressive step
classes of discrete diffusion models, MDMs have consis-
under the same architecture, thus MDMs offer no efficiency
tently demonstrated superior performance and scalability
advantage under this metric.
(Lou et al., 2024; Campbell et al., 2024). For instance, in
Finally, we validate our theoretical findings through com- Lou et al. (2024), the masked variant of SEDD significantly
prehensive experiments. Our experiments examine MDMs outperforms its uniform counterpart across a range of bench-
trained on formal languages, including n-gram languages marks. Similarly, Campbell et al. (2024) reports that the
and Hidden Markov Models (HMMs), systematically ana- masked variant achieves better results in most language
lyzing the relationship between performance and efficiency tasks. Furthermore, recent advancements have successfully
under both TER and SER metrics. Additional experiments scaled MDMs to over 1 billion parameters (Gat et al., 2024;
on natural language tasks, including TER evaluation on Nie et al., 2024; Gong et al., 2024; Shi et al., 2024), under-
text generation and SER assessment on the GSM8k dataset scoring their robustness and adaptability to large-scale NLP
(Cobbe et al., 2021), corroborate our theoretical predictions: models. In this paper, we focus on MDMs, and our theoreti-
while achieving low SER necessitates substantial sampling cal contributions can be applied to all MDMs, including the
steps, relatively few steps suffice for TER. These results masked variant of discrete flow matching.
provide practical guidance for deploying diffusion language
Various Metrics in NLP Tasks. Evaluation metrics in
models across different applications.
NLP tasks are inherently tied to the specific objectives and
requirements of their respective domains. For general lan-
2. Related Work guage modeling tasks, perplexity (Jelinek et al., 1977; De-
vlin et al., 2019) remains the metric of choice due to its
Discrete Diffusion Models. The auto-regressive paradigm
ability to capture a model’s predictive performance effec-
has achieved significant success in language modeling (Dai,
tively. However, domain-specific tasks often demand more
2019; Floridi & Chiriatti, 2020; Achiam et al., 2023). How-
specialized evaluation criteria. For instance, in machine
ever, its left-to-right, token-by-token generation approach
translation (Bahdanau, 2014; Wu et al., 2016), the BLEU
is not without limitations. Notably, it faces challenges such
score is widely regarded as a standard measure of transla-
as restricted controllability (Zhang et al., 2023) and inef-
tion quality (Papineni et al., 2002), while text generation
ficiencies in inference speed (Leviathan et al., 2023). To
tasks (Sutskever, 2014) frequently rely on metrics such as
overcome these drawbacks, inspired by the success of dif-
ROUGE to assess output fidelity (Lin, 2004). Similarly,
fusion models in image generation (Sohl-Dickstein et al.,
tasks requiring reasoning (Wei et al., 2022b), such as math-
2015; Song et al., 2021a; Karras et al., 2022) researchers
ematics (Bubeck et al., 2023) or code generation (Roziere
have adapted these techniques for NLP tasks (Austin et al.,
et al., 2023; Ouyang et al., 2023), commonly adopt accuracy
2021; He et al., 2022; Chen et al., 2022; Meng et al., 2022;
as an intuitive and straightforward measure of success.
Ye et al., 2023; Gulrajani & Hashimoto, 2023; Zhang et al.,
2024). Discrete diffusion models, in particular, have shown
promising results, achieving comparable performance with 3. Masked Diffusion Language Model
auto-regressive models across a range of NLP benchmarks. Without loss of generality, we study the sequence genera-
Discrete diffusion models can be categorized based on the tion task where the sequence length is upper bounded by
initialization strategy of the reverse process: (1) reverse pro- L. Let V denote the vocabulary. The MDM (Lou et al.,
cesses that begin with masked sequences and (2) reverse pro- 2024; Shi et al., 2024; Gong et al., 2024; Sahoo et al., 2024)
cesses that start with sequences of tokens sampled randomly extends the vocabulary V by introducing a special mask
from the vocabulary. The first category, termed masked token [m]. The forward diffusion process progressively
diffusion models (MDMs), includes models such as SEDD transforms an initial sequence x0 = (x10 , x20 , . . . , xL
0) ∈ V
L

Absorb (Lou et al., 2024) and its streamlined variants in into a fully masked sequence x1 = ([m], [m], . . . , [m])

2
Theoretical Benefit and Limitation of Diffusion Language Model

by independently masking each token according to a pre- sequences from a fully masked sequence. Let T denote the
defined schedule. Conversely, the reverse process defines a number of sampling steps. Starting with a fully masked
generative model that reconstructs a sequence by iteratively sequence, the denoising process proceeds via qs|t (xs | xt ),
modifying a fully/partially masked sequence. Below, we where s = Ti and t = i+1 T . At each step, the model first
formally define both the forward and reverse processes. samples x0 from the conditional distribution pθ (x0 | xt ),
followed by masking specific tokens according to q(xs |
3.1. Forward Process xt , x0 ).
Given a sequence x0 and a masking schedule αt , the dis- In practice, the reverse model is parameterized using a fac-
tribution of the sequence xt at time t ∈ [0, 1] is expressed torized denoising model, where the conditional distribution
as: pθ (x0 | xt ) is expressed as:
L−1
Y L
Y
qt|0 (xt |x0 ) = qt|0 (xit |xi0 ), pθ (x0 | xt ) = pθ (xi0 | xt ). (3)
i=0 i=1
( (1)
αt , xit = xi0 , Here, each token is predicted independently using pθ (xi0 |
where qt|0 (xit |xi0 ) =
1 − αt , xit = [m]. xt ), allowing for efficient parallel sampling. However, this
factorized approach imposes a significant limitation: it dis-
The masking schedule αt is designed such that α0 = 1, regards interdependencies between tokens within the se-
ensuring that the sequence remains unmasked at the start quence. As a result, the factorized model pθ (x0 | xt ) can-
of the process. Similar to the continuous diffusion methods not exactly match the true reverse distribution q(x0 | xt )
(Ho et al., 2020; Song et al., 2021a; Karras et al., 2022), (Xu et al., 2024). In this work, we analyze the conditions
we set α1 = 0 (or a value approaching zero), ensuring the under which this sampling method achieves a favorable bal-
sequence is fully masked at the end of the forward process. ance between efficiency and the quality of the generated
sequences.
3.2. Reverse Process
4. Theoretical Analysis
The reverse process reconstructs a sequence from a masked
version by reversing the forward dynamics. Given the se- In image generation, the primary goal is typically to produce
quence at time t and the original sequence x0 , the condi- visually appealing and seamless images (Heusel et al., 2017).
tional distribution of the sequence at time s < t, is defined Language generation is more task-specific. Depending on
as: the application, the users may prefer fluent outputs, as in
1 − αs αs − αt article writing, or precise and accurate reasonings, as in
qs|t,0 (xis |xt , x0 ) = δxit (xis ) + δ i (xi ), problem-solving tasks. In this section, we explore the sam-
1 − αt 1 − αt x0 s
pling efficiency of MDMs in addressing various language
where δx (y) is the Kronecker delta function. Marginalizing tasks with respect to different evaluation metrics.
over x0 yields the true reverse process q(xs |xt ):
4.1. Notations and Problem Setting
L−1
Y
qs|t (xs |xt ) = qs|t (xis |xt ), where Our investigation employs the hidden Markov model
i=0 (HMM) framework to analyze natural language generation.

1, xit ̸= [m], xis = xit , This section establishes the formal notation and problem


 1−αs , setting that underlies our subsequent analysis.
xit = [m], xis = [m],

qs|t (xis |xt ) = α1−α t
−α HMMs (Eddy, 1996) provide a probabilistic foundation for
 s t
q0|t (xs |xt ), xit = [m], xis ̸= [m],
i
 1−αt

 modeling sequential data with latent structures, where ob-
0, otherwise.
served sequences are generated by an underlying sequence
(2)
of unobservable hidden states. Formally, an HMM H =
In MDM, a parameterized reverse model pθ is often em-
(S, V, A, B) is characterized by the following components:
ployed to approximate the distribution q0|t (xis |xt ). This
a finite set of hidden states S = {s1 , s2 , . . . , sN }, an ob-
model is trained by minimizing the evidence lower bound
servable vocabulary V, a state transition probability matrix
(ELBO) (Lou et al., 2024; Shi et al., 2024; Gong et al., 2024;
A ∈ RN ×N , an emission probability matrix B ∈ RN ×|V| ,
Sahoo et al., 2024) on the negative log-likelihood of the data
and an initial state distribution π ∈ RN . Given a sequence
distribution q0 .
of observations x = (x1 , x2 , . . . , xL ) ∈ V L and a sequence
Inference. Inference within the MDM framework entails of hidden states s = (s1 , s2 , . . . , sL ) ∈ S L , the generative
discretizing the reverse process to iteratively reconstruct process of an HMM is governed by the following probabilis-

3
Theoretical Benefit and Limitation of Diffusion Language Model

tic relations: 4.2. MDMs Can Generate Low-TER Sentences


Pr(s1 ) = πs1 , Pr(xi | si ) = Bsi ,xi , Efficiently
Pr(si | s1:i−1 ) = Pr(si | si−1 ) = Asi−1 ,si . In this subsection, we rigorously examine the efficiency of
This formulation enables HMMs to capture both the sequen- sampling in MDMs, demonstrating that MDMs are capable
tial dependencies among hidden states and their probabilis- of efficiently generating sentences with near-optimal TER.
tic relationships with observed data. In the field of NLP, To establish the main theoretical results, we assume that the
HMMs serve as the fundamental statistical tools to model MDMs have enough expressive power and begin with the
natural language (Eddy, 1996; Marti & Bunke, 2001). A following assumption:
notable special case of HMM is the n-gram language model Assumption 4.1 (Learning with Small Error). Let q de-
(Brown et al., 1992), which estimates the probability of a note the target language model with vocabulary V, and let
token given its preceding n − 1 tokens. Despite their sim- pθ represent the reverse model trained to approximate the
plicity, n-gram models are foundational tools in NLP tasks reverse process generating the target language under a mask-
(Brown et al., 1992; De Novais et al., 2010). Moreover, ing schedule αt . Assume there exists ϵlearning > 0 such that
Liu et al. (2024) suggests that scaling up n-gram models the KL divergence between pθ and the reverse process dis-
can also achieve performance comparable to modern large tribution generating the language q is bounded by ϵlearning ,
language models. i.e.,
Formally, we aim to address the following question: If DKL (q0|t (xi0 | xt )∥pθ (xi0 | xt )) < ϵlearning , ∀ t and xt .
MDMs have the capability to approximate a target HMM
model, what are the computational costs, and do MDMs of-
fer advantages over auto-regressive models? To evaluate the It is worth noting that pθ (xi0 | xt ) = q0|t (xi0 | xt ) repre-
approximation quality of MDMs, we adopt two widely used sents the optimal solution to the ELBO loss during training.
metrics: TER and SER, which quantify different aspects of Assumption 4.1 implies that the MDM model is well-trained
a model’s performance. and approximates the ground-truth distribution with only a
small error.
Token Error Rate. In practice, perplexity is one of the
most widely used metrics for evaluating token-level errors During MDM inference, the time interval [0, 1] is discretized
in language models. It quantifies the uncertainty of a model into N steps, where ti = Ni , i ∈ [N ], and iteratively re-
in predicting the next token in a sequence and serves as a construct sequences from a fully masked sequence. The
standard measure for assessing the quality of text generation. following theorem shows that the sequence distribution gen-
In this paper, we define the TER by perplexity. Models erated by the reverse process, even with a small number
with lower TER are generally considered more effective of sampling steps, can achieve near-optimal TER. Conse-
at generating fluent and coherent text. Formally, given a quently, MDMs exhibit high efficiency in generating n-gram
ground-truth language model q and an evaluated model p, language.
the TER is computed as: Theorem 4.2 (TER Bounds for n-Gram Language Gener-
[
Ex∼q −
log(p(x))
]. ation). For any n-gram language q and any ϵ > 0, let pθ
TER(p) = 2 |x| (4) denote the reverse model and L denote the sequence length.
Sequence Error Rate. The SER evaluates the correctness n−1
 generated by pθ is denoted
The distribution over sequences
as p. For any L > O ϵn+0.5 , under Assumption 4.1, there
of an entire sequence rather than individual tokens. Let q
exists a masking schedule αt such that, with N = O n−1ϵn
represent a target language defined over a vocabulary V, and
sampling steps, the TER of the MDM is upper-bounded by:
let Lq = {x ∈ V ∗ | q(x) > 0} denote the support set of
distribution q. For a generative model p, the SER is defined log TER(p) ≤ log TER(q) + ϵlearning + 4ϵ log |V|. (6)
as: X
SER(p) = 1 − p(x). (5)
x∈Lq
The proof of this theorem is presented in Appendix B.
This metric quantifies the probability that the model gener- Theorem 4.2 demonstrates that MDMs can efficiently gener-
ates sequences falling outside the support set of the ground- ate sentences with high fidelity. It is notable that for a given
truth distribution. data distribution q, the TER of a language model p achieves
its global minimum when p = q. To ensure a gap of at most
Compared to TER, SER imposes a stricter evaluation crite-
ϵ with the optimal TER during sampling, the number of
rion by requiring the correctness of entire sequences. This
required sampling steps is bounded by O n−1 ϵn .
makes SER particularly well-suited for tasks that demand
logical consistency or reasoning, where the correctness of The above results suggest that to achieve near-optimal TER,
the complete reasoning chain is crucial. MDMs require only a number of sampling steps that is

4
Theoretical Benefit and Limitation of Diffusion Language Model

independent of the sequence length L. In each sampling step, law of MDMs typically leads to much higher computational
the neural network model, i.e., a Transformer, is executed costs compared to autoregressive models. For instance, in
once. Therefore, informally, the neural network execution the case of Transformer-based architectures, each execution
count is constant for MDM. This offers substantial efficiency step in MDMs involves a quadratic computational complex-
gains over auto-regressive models, where the model must ity in terms of L, as opposed to the linear complexity of
be executed L times, once for each token in the sequence. auto-regressive Transformer models in each generation step
Such efficiency enables MDMs to handle long-sequence (through reusing the stored KV caches). Consequently, in
generation tasks effectively while maintaining high-quality accuracy-critical applications, MDMs offer no computa-
outputs. tional efficiency advantage over auto-regressive models.
Furthermore, some prior works (Sahoo et al., 2024; Ou
4.3. MDMs Cannot Generate Low-SER Sentences with et al., 2024) have proposed efficient sampling strategies that
A Low Cost reuse cached outputs without requiring additional forward
In this subsection, we examine the SER of sampling in passes through the network when no token is modified from
MDMs and highlight a fundamental limitation of MDMs [m] at a given step. Nevertheless, our theoretical results
in generating logically rigorous language. We begin by remain applicable to these sampling strategies, as discussed
establishing that, with sufficient sampling steps, the MDMs in Appendix D.
have the capability to approximate a target HMM model Do TER and SER Conflict? The above results reveal
with perfect SER. that MDMs can efficiently generate low-TER sentences but
Theorem 4.3 (Accurate Generation of HMM with Sufficient may incur higher costs when evaluating the generation un-
Steps). Let q denote any HMM, and let pθ represent the der SER. One might think these results are contradictory.
reverse model under an arbitrary masking schedule, where Note that several previous works have already shown that
L is the sequence length. Let p denote the distribution over TER (a.k.a perplexity) may not reflect a model’s true per-
sequences generated by pθ . Under Assumption 4.1 with formance in solving several long-sequence understanding
a learning error ϵlearning < O( Lδ ), and given a sufficient tasks (Huang et al., 2022; Hu et al., 2024; Luden et al.,
number of reverse steps, the sequence error rate SER(p) of 2024). Thus, it is natural to arrive at different conclusions
the generated text satisfies depending on the metric used.

SER(p) ≤ δ. Moreover, many practical scenarios have shown that the


choice of evaluation metric significantly influences the con-
clusion of other problems. For instance, while the commu-
The complete proof of Theorem 4.3 is detailed in Ap-
nity has previously focused on the emergence phenomenon,
pendix C.1. While this result establishes the theoretical
recent works by Wei et al. (2022a) and Schaeffer et al. (2024)
capability of MDMs to achieve low SER, we still need to
demonstrate that this phenomenon may stem from the use
estimate the computational cost to achieve it. The following
of non-smooth evaluation metrics. Our work further reveals
theorem provides a negative result for this problem.
that conclusions regarding the efficiency of MDMs depend
Theorem 4.4 (SER Bound for HMM Generation). There heavily on the evaluation metric employed. Specifically,
exists an HMM q over a vocabulary of size 16 that sat- MDMs excel in applications where fluency is prioritized. In
isfies the following conditions: for any reverse model pθ contrast, for reasoning-intensive tasks that demand highly
1
under Assumption 4.1 with ϵlearning < 128 , and any mask- accurate trajectories, MDMs may fail to offer a significant
ing schedule αt , let p denote the distribution over sequences efficiency advantage over auto-regressive models.
generated by pθ . There exists a constant C such that if
the number of sampling steps satisfies N = CL, where
L is the sequence length, the SER of the generated text is 5. Experiments
lower-bounded by:
We conducted a series of experiments to empirically val-
1 idate the theoretical findings, focusing on evaluating the
SER(p) > .
2 sampling quality and computational efficiency of MDMs
under diverse metrics. The results reveal that while MDMs
The proof is presented in Appendix C.2. effectively generate low-TER sequences, but achieving low-
SER demands substantial computational resources. We will
Theorem 4.4 shows that to generate sequences with low
first introduce our experimental settings and then present
SER, the number of sampling steps in MDMs must scale at
the experimental results.
least linearly with the sequence length L, indicating that the
number of neural network executions is comparable between
MDMs and autoregressive models. However, this scaling

5
Theoretical Benefit and Limitation of Diffusion Language Model

Generative Perplexity v.s. Steps


5.0 13.93x 6.39x 3.14x 1.57x 0.79x 0.40x 0.20x 0.10x 0.05x 2-gram
4.5 3-gram
Generative Perplexity

4-gram
4.0 HMM
3.5
3.0
2.5
2.0
8 16 32 64 128 256 512 1024 2048 AR
Steps
Sequence Error Rate v.s. Steps
13.93x 6.39x 3.14x 1.57x 0.79x 0.40x 0.20x 0.10x 0.05x 2-gram
1.0 3-gram
Sequence Error Rate

0.8 4-gram
HMM
0.6
0.4
0.2
0.0
8 16 32 64 128 256 512 1024 2048 AR
Steps
Figure 1. Sampling Efficiency and Quality of MDMs on Formal Languages: The above subfigure illustrates generative perplexity of
generated sequences versus the number of sampling steps for n-gram languages (n ∈ 2, 3, 4) and HMM. The y-axis represents the
generative perplexity, and the x-axis represents the sampling steps, with the last point indicating the performance of auto-regressive
models. The figure below shows the SER of generated sequences versus the number of sampling steps for the same formal languages. The
y-axis represents the SER, while the x-axis is the same as the above figure. The number above each bar indicates the speedup of MDMs
under different sampling steps compared to the auto-regressive models.
5.1. Experimental Setup as the primary convergence metric, and the trained mod-
els achieved optimal perplexity values consistent with the
Tasks and Datasets. First, we evaluated MDMs on a
ground-truth language models that generated the datasets.
variety of formal languages, including n-gram languages
(n ∈ {2, 3, 4}) and HMMs. For each formal language, pa- Evaluation Metrics. To assess the quality of generated
rameters such as transition matrices, observation matrices, sequences, we used TER and SER as the primary evaluation
and initial distributions were generated through random metrics, in alignment with our theoretical framework.
sampling. Detailed descriptions of the parameter generation Computational efficiency was evaluated based on the
process, along with illustrative examples of the resulting se- number of sampling steps. Following prior work (Lou et al.,
quences, are provided in Appendix E.1. Using these formal 2024; Xu et al., 2024), generative perplexity was employed
languages, we constructed datasets comprising 1,000,000 as the TER metric to evaluate the sample qualities under
samples, of which 990,000 were allocated for training and different sampling steps. We compute the generative
10,000 for validation. When using the formal language mod- perplexity using the ground-truth model to evaluate the
els to generate the dataset, we set the max length of 512. likelihood of sequences generated by MDMs, which were
subsequently converted into perplexity scores. SER was
Model Training. We adopted transformer-based architec-
computed directly using its definition in Equation (5),
tures as the backbone models due to their scalability and
leveraging ground-truth models for evaluation. For se-
expressiveness in sequence modeling tasks. Comprehen-
quence generation, we utilized the ddpm_cache sampler
sive architectural details, including the number of layers,
proposed in prior work (Sahoo et al., 2024), ensuring
hidden dimensions, and positional encoding schemes, are
efficient sampling. Computational efficiency was measured
provided in Table 2 in Appendix E.2. The training process
by the number of sampling steps, and we further discuss
followed the framework proposed by Sahoo et al. (2024),
the influence of ddpm_cache under different sampling
with additional training configurations detailed in Table 3.
steps in Appendix D. Furthermore, we also test the true
Models were trained for 20 epochs, and their convergence
speedup of MDMs under different sampling steps compared
was monitored using the validation set. Perplexity was used

6
Theoretical Benefit and Limitation of Diffusion Language Model

Figure 2. Evaluation on Language Tasks: The left subfigure illustrates the text generation quality of MDLM-OWT across different
sampling steps, with GPT2-medium as baseline. The y-axis represents the average generative perplexity of 2000 generated texts, and the
x-axis indicates the number of sampling steps. The numbers above indicate the speedup of MDLM-OWT under different sampling steps
compared to GPT2-medium. The right subfigure shows the accuracy of MDM on the GSM8K benchmark at different sampling steps,
with Qwen-Math-1.5B as baseline. The y-axis indicates accuracy, and the x-axis represents the number of sampling steps.
to the auto-regressive models in Figure 1, the detailed the y-axis measures the SER, where lower values indicate
testing settings are listed in Appendix E.2. To ensure robust higher sequence-level accuracy. Compared to the upper
evaluation, we generated 2000 sequences for each setting subfigure, this subfigure reveals a slower improvement in
and computed both TER and SER over these samples. SER as the number of sampling steps increases. For these
formal languages, achieving low SER requires significantly
To compare MDMs with auto-regressive models, we trained
more sampling steps. Moreover, even when the number of
auto-regressive models with identical architectures and
sampling steps reaches 2048, there remains a gap in SER
model sizes on the same datasets generated by the formal
between MDMs and auto-regressive models. These results
languages. These models were evaluated under the same
demonstrate that auto-regressive models maintain a clear ad-
metrics, serving as a baseline for performance comparison.
vantage in SER, as their token-by-token generation achieves
The training configurations are provided in Table 4.
zero SER across these tasks.
5.2. Experiment Results Figure 1 highlights the trade-off between efficiency and ac-
curacy for MDMs empirically. While MDMs excel in gener-
The experiment results are presented in Figure 1. The upper ating fluent outputs with low TER, they require substantially
subfigure shows the generative perplexity across different more sampling steps to achieve low SER, particularly for
formal languages with the number of sampling steps varying. reasoning-intensive tasks that demand sequence-level cor-
The x-axis represents the number of sampling steps, ranging rectness. These experimental results further reinforce our
from 8 to 2048, while the y-axis measures the generative per- theoretical findings.
plexity, where lower values indicate higher text fluency and
token-level accuracy. The performance of auto-regressive
models is marked as the final point on the x-axis for compar- 6. Preliminary Experiments on Large Models
ison. As shown in the figure, MDMs achieve near-optimal We further conducted an extensive set of experiments on
generative perplexity with relatively few sampling steps. To language tasks using open-source MDMs. First, we eval-
achieve a perplexity similar to the auto-regressive model, uated the quality of text generation by measuring the gen-
MDMs only require about 64 steps and demonstrate 1.57 erative perplexity of MDMs using MDLM-OWT (Sahoo
times speedup compared to auto-regressive models. This et al., 2024), a diffusion language model trained on Open-
demonstrates that MDMs are highly efficient at generat- WebText (Gokaslan & Cohen, 2019). For a fair comparison,
ing fluent sequences even with a small number of sampling we evaluated GPT2-medium (Radford et al., 2019), which
steps. As the number of sampling steps increases, the perfor- is similar in size. Second, we explored the mathematical
mance of MDMs approaches that of auto-regressive models, reasoning ability of MDMs on the GSM8K dataset (Cobbe
converging to a similar level of generative perplexity. et al., 2021). Given that small models typically exhibit
The lower subfigure evaluates the relationship between the poor reasoning performance, we used a fine-tuned diffusion
number of sampling steps and the SER, which measures language model with 1.1B non-embedding parameters pro-
the correctness of an entire sequence. The x-axis again posed by Nie et al. (2024), and compared it against model
represents the number of sampling steps, with the perfor- with a similar number of parameters. While the generative
mance of auto-regressive models included as a baseline, and perplexity represents the metric of TER, mathematical rea-

7
Theoretical Benefit and Limitation of Diffusion Language Model

soning is more concerned with the correctness of the entire generation, MDM does not show a significant advantage
generated sequence, thus, the GSM8K accuracy is partially over auto-regressive models for mathematical reasoning
consistent with the negative sequence error rate − SER. tasks. The accuracy of the MDM decreases sharply as the
number of steps falls below the sequence length, and it
Text Generation. For text generation, we use MDLM-
shows only a slight improvement as the number of samples
OWT, which has a context length of 1024 and size similar to
exceeds the sequence length. While the latter may be due
GPT2-medium and is trained on OWT dataset (Gokaslan &
to the limitations of the MDM we used, the former likely
Cohen, 2019). Since our goal is to compare the acceleration
caused by insufficient sampling, which leads to a high se-
of MDMs relative to auto-regressive models and examine
quence error rate. It is worth noting that the experimental
the effect of the number of steps on text generation quality,
setup for the MDM differs from that of the baseline models,
the absolute size and capability of the model are less impor-
so the baseline accuracy is provided only for reference.
tant. Following the approach in the original work, we used
the ddpm_cache sampler and the GPT2 tokenizer. For the Summary. Figure 2 illustrates the performance of MDMs
number of sampling steps ranging from 4 to 2048, we gen- on language tasks and and its dependence on sampling steps.
erated 2000 samples of length 1024 and evaluated the gener- For text generation, MDLM-OWT achieved similar perfor-
ative perplexity using GPT2-large. To compare MDMs with mance to GPT2-medium with few sampling steps, demon-
auto-regressive models, we took GPT2-medium as baseline strating efficiency in generating fluent text. On the contrary,
and computed its generative perplexity in the same manner. MDMs showed no significant advantage on the GSM8K
accuracy, with performance declining rapidly when the num-
The experiment result is shown in the left subfigure of
ber of steps fell below the sequence length. These results
Figure 2, which illustrates the text generation quality
highlight MDMs’ ability in text generation, but suggest
of MDLM-OWT across different sampling steps, with
challenges in reasoning-relevant tasks.
GPT2-medium as the baseline. The the x-axis represents
the number of sampling steps, and the y-axis represents
the average generative perplexity of 2000 generated texts, 7. Conclusion and Limitations
where lower generative perplexity indicates higher fluency
Conclusion. This paper provides a rigorous theoretical
and, consequently, a lower TER. The numbers above indi-
and empirical analysis of the efficiency of MDMs under
cate the speedup of MDLM-OWT under different sampling
various metrics. We demonstrate that MDMs can achieve
steps compared to GPT2-medium. As is shown in the figure,
near-optimal TER with a fixed number of sampling steps,
MDLM-OWT matches the generative perplexity of GPT2-
regardless of sequence length, making them highly efficient
medium with only 32 steps, where there is a 2.28x speedup,
for tasks emphasizing fluency. However, when evaluated
and the perplexity continues to decline and converge as the
using SER, MDMs require sampling steps that scale lin-
number of sampling steps increases. This demonstrates that
early with sequence length, negating their efficiency advan-
MDMs can generate texts efficiently while ensuring a high
tage over auto-regressive models. These findings highlight
fluency, which illustrates the potential of MDMs for basic
the trade-off between efficiency and accuracy, depending
language generation tasks at a larger scale.
on the evaluation metric. Experimental results further re-
Mathematical Reasoning. For mathematical reasoning, we inforce our theoretical results across formal and natural
used the MDM provided by Nie et al. (2024), which was language tasks, offering practical guidance for deploying
fine-tuned on GSM8K using a model trained on SlimPajama MDMs. While MDMs demonstrate efficiency advantages
(Soboleva et al., 2023) for 3.3 × 1021 training FLOPs with in applications prioritizing fluency, they may fall short in
1.1B non-embedding parameters. This is so far the first reasoning-intensive tasks requiring high accuracy compared
MDM to be fine-tuned on mathematical reasoning tasks. to auto-regressive models.
We generated answers with a maximum length of 256 for
Limitations. Our study focuses on formal languages mod-
the number of sampling steps ranging from 1 to 256. Since
eled using HMM, which, while foundational, still differs
there are very few models fine-tuned on GSM8K at the same
from modern language models. Extending this analysis
scale, we took Qwen2-Math-1.5B (Yang et al., 2024) as our
to more advanced language models remains an important
baseline. We evaluated its performance following the widely
direction for future work. Additionally, we primarily ana-
used Language Model Evaluation Harness framework (Gao
lyze Masked Diffusion Models, but the broader family of
et al., 2024), and counted the average length of the generated
diffusion-based language models, including variants like
answers, which partly reflects the efficiency of the model.
SEDD-unform (Lou et al., 2024), requires further investiga-
The experiment results are presented in the right subfigure tion. In summary, while our work establishes a theoretical
of Figure 2. For all tested step numbers, the average length understanding of MDMs, further exploration is needed to
of generated answers for MDM is around 30, while for the generalize our findings to real-world settings and to system-
baseline model, the average length is about 105. Unlike text atically analyze other diffusion approaches.

8
Theoretical Benefit and Limitation of Diffusion Language Model

Impact Statement math word problems. arXiv preprint arXiv:2110.14168,


2021.
This paper presents work whose goal is to advance the field
of generative models. There are many potential societal Dai, Z. Transformer-xl: Attentive language models beyond
consequences of our work, none of which we feel must be a fixed-length context. arXiv preprint arXiv:1901.02860,
specifically highlighted here. 2019.

References Davis, O., Kessler, S., Petrache, M., Ceylan, I. I., Bron-
stein, M., and Bose, A. J. Fisher flow matching for
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., generative modeling over discrete data. arXiv preprint
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., arXiv:2405.14664, 2024.
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint
arXiv:2303.08774, 2023. De Novais, E. M., Dias Tadeu, T., and Paraboni, I. Im-
proved text generation using n-gram statistics. In Ad-
Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den vances in Artificial Intelligence–IBERAMIA 2010: 12th
Berg, R. Structured denoising diffusion models in dis- Ibero-American Conference on AI, Bahı́a Blanca, Ar-
crete state-spaces. In Advances in Neural Information gentina, November 1-5, 2010. Proceedings 12, pp. 316–
Processing Systems, 2021. 325. Springer, 2010.
Avdeyev, P., Shi, C., Tan, Y., Dudnyk, K., and Zhou, J.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:
Dirichlet diffusion score model for biological sequence
Pre-training of deep bidirectional transformers for lan-
generation. In International Conference on Machine
guage understanding. In Proceedings of the 2019 Confer-
Learning, pp. 1276–1301. PMLR, 2023.
ence of the North American Chapter of the Association
Bahdanau, D. Neural machine translation by jointly learning for Computational Linguistics: Human Language Tech-
to align and translate. arXiv preprint arXiv:1409.0473, nologies, Volume 1 (Long and Short Papers). Association
2014. for Computational Linguistics, 2019.

Brown, P. F., Della Pietra, V. J., Desouza, P. V., Lai, J. C., Dieleman, S., Sartran, L., Roshannai, A., Savinov, N.,
and Mercer, R. L. Class-based n-gram models of natu- Ganin, Y., Richemond, P. H., Doucet, A., Strudel, R.,
ral language. Computational linguistics, 18(4):467–480, Dyer, C., Durkan, C., Hawthorne, C., Leblond, R., Grath-
1992. wohl, W., and Adler, J. Continuous diffusion for categori-
cal data. ArXiv, abs/2211.15089, 2022.
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J.,
Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Eddy, S. R. Hidden markov models. Current opinion in
Lundberg, S., et al. Sparks of artificial general intel- structural biology, 6(3):361–365, 1996.
ligence: Early experiments with gpt-4. arXiv preprint
arXiv:2303.12712, 2023. Floridi, L. and Chiriatti, M. Gpt-3: Its nature, scope, limits,
and consequences. Minds and Machines, 30:681–694,
Campbell, A., Benton, J., Bortoli, V. D., Rainforth, T.,
2020.
Deligiannidis, G., and Doucet, A. A continuous time
framework for discrete denoising models. In Advances in Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi,
Neural Information Processing Systems, 2022. A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li,
Campbell, A., Yim, J., Barzilay, R., Rainforth, T., and H., McDonell, K., Muennighoff, N., Ociepa, C., Phang,
Jaakkola, T. Generative flows on discrete state-spaces: J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika,
Enabling multimodal flows with applications to protein L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou,
co-design. In Forty-first International Conference on Ma- A. A framework for few-shot language model evaluation,
chine Learning, 2024. URL https://openreview. 07 2024. URL https://zenodo.org/records/
net/forum?id=kQwSbv0BR4. 12608602.

Chen, T., Zhang, R., and Hinton, G. Analog bits: Gen- Gat, I., Remez, T., Shaul, N., Kreuk, F., Chen, R. T., Syn-
erating discrete data using diffusion models with self- naeve, G., Adi, Y., and Lipman, Y. Discrete flow match-
conditioning. arXiv preprint arXiv:2208.04202, 2022. ing. arXiv preprint arXiv:2407.15595, 2024.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Gokaslan, A. and Cohen, V. Openwebtext cor-
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, pus. http://Skylion007.github.io/
R., Hesse, C., and Schulman, J. Training verifiers to solve OpenWebTextCorpus, 2019.

9
Theoretical Benefit and Limitation of Diffusion Language Model

Gong, S., Agarwal, S., Zhang, Y., Ye, J., Zheng, L., Li, M., Lou, A., Meng, C., and Ermon, S. Discrete diffusion mod-
An, C., Zhao, P., Bi, W., Han, J., et al. Scaling diffu- eling by estimating the ratios of the data distribution. In
sion language models via adaptation from autoregressive Proceedings of the 41st International Conference on Ma-
models. arXiv preprint arXiv:2410.17891, 2024. chine Learning, pp. 32819–32848. PMLR, 2024.

Gulrajani, I. and Hashimoto, T. Likelihood-based diffusion Lovelace, J., Kishore, V., Chen, Y., and Weinberger, K. Q.
language models. In Advances in Neural Information Diffusion guided language modeling. arXiv preprint
Processing Systems, 2023. arXiv:2408.04220, 2024.

He, Z., Sun, T., Wang, K., Huang, X., and Qiu, X. Diffusion- Luden, I., Giulianelli, M., and Fernández, R. Beyond
bert: Improving generative masked language models with perplexity: Examining temporal generalization in large
diffusion models. In Annual Meeting of the Association language models via definition generation. Computa-
for Computational Linguistics, 2022. tional Linguistics in the Netherlands Journal, 13:205–
232, 2024.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and
Marti, U.-V. and Bunke, H. Using a statistical language
Hochreiter, S. Gans trained by a two time-scale update
model to improve the performance of an hmm-based
rule converge to a local nash equilibrium. Advances in
cursive handwriting recognition system. International
neural information processing systems, 30, 2017.
journal of Pattern Recognition and Artificial intelligence,
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion prob- 15(01):65–90, 2001.
abilistic models. In Advances in Neural Information Meng, C., Choi, K., Song, J., and Ermon, S. Concrete score
Processing Systems, 2020. matching: Generalized score matching for discrete data.
In Advances in Neural Information Processing Systems,
Hu, Y., Huang, Q., Tao, M., Zhang, C., and Feng, Y. Can per-
2022.
plexity reflect large language model’s ability in long text
understanding?, 2024. URL https://arxiv.org/ Nie, S., Zhu, F., Du, C., Pang, T., Liu, Q., Zeng, G., Lin, M.,
abs/2405.06105. and Li, C. Scaling up masked diffusion models on text.
arXiv preprint arXiv:2410.18514, 2024.
Huang, F., Tao, T., Zhou, H., Li, L., and Huang, M.
On the learning of non-autoregressive transform- Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li,
ers. ArXiv, abs/2206.05975, 2022. URL https: C. Your absorbing discrete diffusion secretly models the
//api.semanticscholar.org/CorpusID: conditional distributions of clean data. arXiv preprint
249626415. arXiv:2406.03736, 2024.

Jelinek, F., Mercer, R. L., Bahl, L. R., and Baker, J. K. Per- Ouyang, S., Zhang, J. M., Harman, M., and Wang, M. Llm is
plexity—a measure of the difficulty of speech recognition like a box of chocolates: the non-determinism of chatgpt
tasks. The Journal of the Acoustical Society of America, in code generation. arXiv preprint arXiv:2308.02828,
62(S1):S63–S63, 1977. 2023.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu:
Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating
a method for automatic evaluation of machine transla-
the design space of diffusion-based generative models.
tion. In Proceedings of the 40th annual meeting of the
In Advances in Neural Information Processing Systems,
Association for Computational Linguistics, pp. 311–318,
2022.
2002.
Leviathan, Y., Kalman, M., and Matias, Y. Fast inference Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
from transformers via speculative decoding. In Inter- Sutskever, I., et al. Language models are unsupervised
national Conference on Machine Learning, pp. 19274– multitask learners. OpenAI blog, 1(8):9, 2019.
19286. PMLR, 2023.
Rastogi, R., Schiff, Y., Hacohen, A., Li, Z., Lee, I., Deng,
Lin, C.-Y. Rouge: A package for automatic evaluation Y., Sabuncu, M. R., and Kuleshov, V. Semi-parametric
of summaries. In Text summarization branches out, pp. inducing point networks and neural processes. arXiv
74–81, 2004. preprint arXiv:2205.11718, 2022.
Liu, J., Min, S., Zettlemoyer, L., Choi, Y., and Hajishirzi, H. Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I.,
Infini-gram: Scaling unbounded n-gram language models Ellen, X., Adi, Y., Liu, J., Sauvestre, R., Remez, T., et al.
to a trillion tokens. arXiv preprint arXiv:2401.17377, Code llama: Open foundation models for code. arXiv
2024. preprint arXiv:2308.12950, 2023.

10
Theoretical Benefit and Limitation of Diffusion Language Model

Sahoo, S. S., Arriola, M., Schiff, Y., Gokaslan, A., Marro- Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,
quin, E., Chiu, J. T., Rush, A., and Kuleshov, V. Simple Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,
and effective masked diffusion language models, 2024. K., et al. Google’s neural machine translation system:
Bridging the gap between human and machine translation.
Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent arXiv preprint arXiv:1609.08144, 2016.
abilities of large language models a mirage? Advances in
Neural Information Processing Systems, 36, 2024. Xu, M., Geffner, T., Kreis, K., Nie, W., Xu, Y., Leskovec,
J., Ermon, S., and Vahdat, A. Energy-based diffusion
Shi, J., Han, K., Wang, Z., Doucet, A., and Titsias, M. K. language models for text generation. arXiv preprint
Simplified and generalized masked diffusion for discrete arXiv:2410.21357, 2024.
data. arXiv preprint arXiv:2406.04329, 2024.
Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C.,
Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Li, C., Li, C., Liu, D., Huang, F., et al. Qwen2 technical
Hestness, J., and Dey, N. SlimPajama: A 627B report. arXiv preprint arXiv:2407.10671, 2024.
token cleaned and deduplicated version of RedPa-
jama, 06 2023. URL https://huggingface.co/ Ye, J., Zheng, Z., Bao, Y., Qian, L., and Gu, Q. Diffusion
datasets/cerebras/SlimPajama-627B. language models can perform many tasks with scaling and
instruction-finetuning. arXiv preprint arXiv:2308.12219,
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and 2023.
Ganguli, S. Deep unsupervised learning using nonequi-
librium thermodynamics. In Proceedings of the 32nd Zhang, H., Dang, M., Peng, N., and Van den Broeck, G.
International Conference on Machine Learning, 2015. Tractable control for autoregressive language generation.
In International Conference on Machine Learning, pp.
Song, J., Meng, C., and Ermon, S. Denoising diffu- 40932–40945. PMLR, 2023.
sion implicit models. In International Conference on
Zhang, S., Wu, L., Gong, C., and Liu, X. Language recti-
Learning Representations, 2021a. URL https://
fied flow: Advancing diffusion language generation with
openreview.net/forum?id=St1giarCHLP.
probabilistic flows. arXiv preprint arXiv:2403.16995,
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., 2024.
Ermon, S., and Poole, B. Score-based generative mod-
Zhao, L., Ding, X., Yu, L., and Akoglu, L. Improving and
eling through stochastic differential equations. In In-
unifying discrete&continuous-time discrete denoising dif-
ternational Conference on Learning Representations,
fusion. arXiv preprint arXiv:2402.03701, 2024.
2021b. URL https://openreview.net/forum?
id=PxTIG12RRHS. Zheng, K., Chen, Y., Mao, H., Liu, M.-Y., Zhu, J., and
Zhang, Q. Masked diffusion models are secretly time-
Sun, Z. and Yang, Y. Difusco: Graph-based diffusion solvers agnostic masked models and exploit inaccurate categori-
for combinatorial optimization. Advances in Neural In- cal sampling. arXiv preprint arXiv:2409.02908, 2024.
formation Processing Systems, 36:3706–3731, 2023.
Zheng, L., Yuan, J., Yu, L., and Kong, L. A reparameter-
Sutskever, I. Sequence to sequence learning with neural ized discrete diffusion model for text generation. ArXiv,
networks. arXiv preprint arXiv:1409.3215, 2014. abs/2302.05737, 2023.
Vignac, C., Krawczuk, I., Siraudin, A., Wang, B., Cevher,
V., and Frossard, P. Digress: Discrete denoising diffusion
for graph generation. arXiv preprint arXiv:2209.14734,
2022.

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B.,
Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met-
zler, D., et al. Emergent abilities of large language models.
Transactions on Machine Learning Research, 2022a.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting
elicits reasoning in large language models. Advances in
neural information processing systems, 35:24824–24837,
2022b.

11
Theoretical Benefit and Limitation of Diffusion Language Model

Appendix
A. Auxiliary Lemma
In this section, we present some technical lemmas for the proof of our main results.
Lemma A.1 (Upper Bound for Multi-tokens Sampling). Let X = (X1 , X2 , . . . , Xk ) ∈ [N ]k be a random vector following
the distribution q, where each component Xi follows the marginal distribution qi . Define X̃ = (X̃1 , X̃2 , . . . , X̃k ) ∼ p as
another random vector, where the components X̃i are sampled independently according to pi . Let δ = maxi {DKL (qi ∥pi )},
then, the KL divergence between p and q satisfies the inequality:
DKL (q∥p) ≤ (k − 1) log N + kδ.

Proof. Using the chain rule for probabilities, the KL divergence can be written as:
" k #
X  qi (xi | x<i ) 
DKL (q∥p) = Eq log ,
i=1
pi (xi )

where x<i = (x1 , . . . , xi−1 ). For i = 1, there are no preceding variables, so:
  
q1 (x1 )
Eq log = DKL (q1 ∥p1 ).
p1 (x1 )

For i > 1, we bound:      


qi (xi | x<i ) 1
Eq log ≤ Eq log .
pi (xi ) pi (xi )
 
Decomposing Eq log 1/pi (xi ) , we get:
        
1 qi (xi ) 1
Eq log = Eq log + Eq log .
pi (xi ) pi (xi ) qi (xi )
 
The first term is DKL (qi ∥pi ), and the second term is −Eq log qi (xi ) , which represents the entropy of qi . Since the entropy
of any distribution over [N ] is at most log N , we have:
 
−Eq log qi (xi ) ≤ log N.
Thus:   
qi (xi | x<i )
Eq log ≤ DKL (qi ∥pi ) + log N.
pi (xi )

Summing over all i = 1, . . . , k, we obtain:


k    k
X qi (xi | x<i ) X 
DKL (q∥p) = Eq log ≤ DKL (q1 ∥p1 ) + DKL (qi ∥pi ) + log N .
i=1
pi (xi ) i=2

Reorganizing, we have:
k
X
DKL (q∥p) ≤ DKL (qi ∥pi ) + (k − 1) log N.
i=1
Since DKL (qi ∥pi ) ≤ δ for all i, the total sum of marginal KL divergences is bounded by kδ. Therefore:
DKL (q∥p) ≤ kδ + (k − 1) log N.

This completes the proof.

12
Theoretical Benefit and Limitation of Diffusion Language Model

Lemma
Pn A.2 (Chernoff Bound). Let X1 , . . . , Xn be independent random variables taking values in {0, 1}. Define X =
i=1 Xi as the sum of these independent random variables, and let µ = E[X] denote the expected value of X. Then, the
following probabilistic bounds hold:
δ2 µ
Pr(X ≥ (1 + δ)µ) ≤ e− 2+δ , for δ ≥ 0,
δ2 µ
Pr(X ≤ (1 − δ)µ) ≤ e− 2 , for 0 < δ < 1.
Lemma A.3 (Pinsker’s Inequality). Let p and q be two probability distributions. Then, the total variation distance between
these distributions satisfies: r
1
DTV (p, q) ≤ DKL (p∥q).
2
Specifically, since DTV (p, q) = 21 ∥p − q∥1 , the following inequality holds:
p
∥p − q∥1 ≤ 2DKL (p∥q).

B. Proof for Theorem 4.2


This section provides the complete proof of Theorem 4.2. We first outline the proof strategy, then present the detailed
arguments with supporting lemmas and definitions.
Our proof rests on reformulating TER bounds through KL divergence and carefully analyzing dependencies in the n-gram
setting. The key steps are:

• We establish a connection between the perplexity of the discrete diffusion model and the KL divergence between
generated and data distributions. This involves deriving an upper bound on KL divergence using expected divergence
over reverse processes (Lemma B.2) and decomposing this divergence into per-step conditional KL terms (Lemma B.3).
• We analyze n-gram model dependencies through a rigorous characterization of reverse processes (Definition B.1) and
separators—(n − 1) continuous sampled tokens that create independent intervals (Definition B.4). This leads to a
precise formulation of per-step dependencies using these separators (Definition B.5).
• We derive an upper bound for the KL divergence between generated and data distributions based on the number of
per-step dependencies (Lemmas B.8 and B.9).
• We employ probabilistic bounds to analyze and bound the number of per-step dependencies (Lemmas A.1, B.10
and B.11).
• Finally, we demonstrate the existence of a schedule achieving small KL divergence with O( n−1
ϵn ) steps by constructing
an efficient sampling schedule using the preceding lemmas (Lemma B.12).

To begin the formal proof, we introduce key definitions for analyzing the discrete diffusion process. Consider a masking
schedule αt and a sequence of sampling time steps ti = NN−i . For a sequence of length L, we define an instance of the
discretization of the reverse process τ as follows:
Definition B.1 (An Instance of Reverse Process). Let τ = (M1 , M2 , . . . , MN ) represent a reverse process, where
Mi = {lij } denotes the set of locations sampled at step i. For a sequence of length L, the sets Mi satisfy the following
conditions: [ \
Mi = [L] and Mi = ∅.
i∈[N ] i∈[N ]

Specifically, we denote M<i as the union of all locations sampled prior to step ti :
[
M<i = Mj .
j<i

Under a given instance of the reverse process τ , at each time step ti = NN−i , the set of locations Mi = {lij } is sampled.
Let x̃i denote the tokens associated with the locations sampled at time step ti . Given the masking schedule αt , there exist
multiple possible instances of the reverse process. We denote the distribution over these instances by REVR(αt , N, L).

13
Theoretical Benefit and Limitation of Diffusion Language Model

Lemma B.2 (KL Divergence Upper Bound for the Masking Schedule). Let q denote the data distribution over sequences
of length L, and let p denote the distribution over sequences of length L generated by the reverse model pθ with masking
schedule αt and N sampling steps. The KL divergence between q and p satisfies the following upper bound:

DKL (q∥p) ≤ Eτ ∼REVR(αt ,N,L) DKL (q∥p(·|τ )),

where the expectation is taken over the distribution of reverse processes τ induced by REVR(αt , N, L).

Proof. Let X denote the set of all possible generated sequences. Then, the KL divergence between q and p is given by:
X q(x)
DKL (q∥p) = q(x) log .
p(x)
x∈X

Let h denote the distribution over reverse processes τ ∼ REVR(αt , N, L). Due to the convexity of log x1 , by applying
Jensen’s inequality, we can obtain:
1 1 X 1
log = log P ≤ h(τ ) · log .
p(x) τ h(τ ) · p(x|τ ) τ
p(x|τ )

Since data distribution q is independent of reverse process τ :

q(x) = q(x|τ ), ∀τ.

Therefore, we have:
q(x) X q(x|τ )
log ≤ h(τ ) log .
p(x) τ
p(x|τ )
Substituting this back, we can get the final result:
XX q(x|τ )
DKL (q∥p) ≤ h(τ )q(x) log
τ
p(x|τ )
x∈X
X X q(x|τ )
= h(τ ) q(x|τ ) log
τ
p(x|τ )
x∈X
= Eτ ∼REVR(αt ,N,L) DKL (q(·|τ )∥p(·|τ ))
= Eτ ∼REVR(αt ,N,L) DKL (q∥p(·|τ )).

We next establish an upper bound for the KL divergence between the distribution of sequences sampled under an instance of
the reverse process τ and the ground-truth distribution in the n-gram setting. To achieve this, we leverage the chain rule
for KL divergence, which allows decomposition of the KL divergence of the entire sequence into a summation of the KL
divergences at each individual step of the process.
Lemma B.3 (KL Divergence Decomposition for the Reverse Process). Consider an instance of reverse process τ =
(M1 , M2 , . . . , MN ) ∼ REVR(αt , N, L). Let x̃i denote the set of tokens corresponding to the locations sampled at time
step ti , and x̃<i denote the set of tokens sampled at all steps prior to step ti . The KL divergence between the ground-truth
distribution q and the distribution pτ generated by the reverse process τ and reverse model pθ satisfies the following
decomposition:
XN
DKL (q∥pτ ) = Ex̃<i DKL (q(x̃i |x̃<i )∥pτ (x̃i |x̃<i )),
i=1

Proof. Given the reverse process τ , the reverse model samples x̃i sequentially from i = 1 to N , and the probability of
sampling x̃i at step ti depends only on the previously sampled tokens x̃<i . Therefore, the distribution p(x) can be factorized
as:
YN
pτ (x) = pτ (x̃i |x̃<i ).
i=1

14
Theoretical Benefit and Limitation of Diffusion Language Model

On the other hand, since the data distribution q is independent of the reverse process τ , it can similarly be decomposed as:
N
Y
q(x) = q(x̃i |x̃<i ).
i=1

Applying the chain rule for KL divergence, we obtain:


N
X
DKL (q∥pτ ) = Ex̃<i DKL (q(x̃i |x̃<i )∥pτ (x̃i |x̃<i )).
i=1

Next, we derive an upper bound for the KL divergence at each step of the reverse process. In the n-gram setting, it is
important to note that, at step ti , the tokens xlij and xlij′ are conditionally independent if there are at least n − 1 consecutive
tokens between the positions lij and lij ′ that have already been sampled prior to step i. Under this condition, sampling these
two tokens simultaneously incurs no sampling error, as the distributions of xlij and xlij′ are independent.
To formalize this concept, we introduce a measure of dependencies among the tokens sampled in Mi during the reverse
process. For the i-th reverse step in the n-gram setting, the number of dependencies, denoted as DEPn (Mi , M<i ), is
determined by the structure of M<i . Specifically, it depends on the number of separators in M<i , denoted as SEPn (M<i ),
as described in the following definition.
DefinitionSB.4 (Number of Separators in a Reverse Step). Consider a reverse process τ = (M1 , M2 , . . . , MN ), where
M<i = j<i Mj represents the union of all previously sampled location sets. The set M<i can be partitioned into
several contiguous segments. Let S1 , S2 , · · · , Sk denote the segments containing at least n − 1 consecutive tokens (i.e.,
|Sj | ≥ n − 1) with the maximum k. We refer to these segments as separators, and denote the number of separators in the set
M<i as:
SEPn (M<i ) = max k
s.t. |Sj | ≥ n − 1, Sj ⊂ M<i , ∀j ∈ [k],
Sj ∩ Sj′ = ∅, ′
∀j ̸= j .
Note that if a contiguous segment S in M<i contains at least d(n − 1) consecutive tokens, where d is an integer, then S
consists of at least d separators.
Definition B.5 (Number of Dependencies in a Reverse Step). Consider a reverse process τ = (M1 , M2 , . . . , MN ). The
separators of M<i divide the sequence into at most SEPn (M<i ) + 1 disjoint intervals I1 , I2 , . . . , Ik . Under the n-gram
setting, the sampling within each interval is independent of the sampling in other intervals. The number of dependencies of
step ti is defined as the number of intervals Ip (for p = 1, . . . , k) that contain at least one location in Mi :
k
X
DEPn (Mi , M<i ) = |Mi | − I [Ip ∩ Mi ̸= ∅] ,
p=1

where I is the indicator function.

To illustrate this definition, we provide the following example:


Example B.6 (Computing Dependencies in the n-gram Setting). Consider a token sequence of length 10, denoted as
x = (x1 , x2 , . . . , x10 ), with n = 4. Let the previously sampled location set be M<i = {2, 3, 4, 6, 7} and the current
location set be Mi = {1, 5, 9}.

1. Identify contiguous segments in M<i containing at least n − 1 = 3 consecutive tokens: The set M<i =
{2, 3, 4, 6, 7} forms the following contiguous segments:
{2, 3, 4} and {6, 7}.
Only the segment {2, 3, 4} contains at least n − 1 = 3 consecutive tokens. Thus, we have S1 = {2, 3, 4}. The sequence
is then divided into the following disjoint intervals:
I1 = {1}, I2 = {5, 6, 7, 8, 9, 10}.

15
Theoretical Benefit and Limitation of Diffusion Language Model

2. Determine which intervals overlap with Mi = {1, 5, 9}: Token 1 belongs to interval I1 , and tokens 5 and 9 belong
to interval I2 .

3. Compute the number of dependencies: The number of dependencies is:

k
X
DEPn (Mi , M<i ) = |Mi | − I[Ip ∩ Mi ̸= ∅] = 3 − 2 = 1.
p=1

{2, 3, 4} {6, 7}
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

1 5 9
I1 S1 I2

{1} {2, 3, 4} {5, 6, 7, 8, 9, 10}

Figure 3. Illustration of the example for computing dependencies in the n-gram setting. Tokens x2 , x3 , x4 , x6 , x7 (blue) represent
the previously sampled location set M<i , forming two contiguous segments: {2, 3, 4} and {6, 7}. The current sampled locations
x1 , x5 , x9 (red) overlap with disjoint intervals I1 = {1} and I2 = {5, 6, 7, 8, 9, 10}. The number of dependencies is computed as
DEPn (Mi , M<i ) = |Mi | − (number of overlapping intervals) = 3 − 2 = 1.

This example demonstrates how dependencies are computed, highlighting the interaction between previously sampled
locations and the current reverse step. Such formalization is critical for understanding the efficiency and accuracy of discrete
diffusion processes.
Finally, we extend this concept to define the total number of dependencies across an entire reverse process:
Definition B.7 (Number of Dependencies in a Reverse Process). Consider a reverse process τ = (M1 , M2 , . . . , MN ).
Under the n-gram setting, the total number of dependencies in the process is defined as the sum of the dependencies across
all steps:
XN
DEPn (τ ) = DEPn (Mi , M<i ).
i=1

Using the definition of DEPn (τ ), we can bound the KL divergence between the distribution of sequences sampled under an
instance of the reverse process and the ground-truth distribution in the n-gram setting.
Lemma B.8 (KL Divergence Upper Bound for the Instance of Reverse Process). Let q denote the data distribution for
sequences of length L, and let p denote the distribution of sequences of length L generated by reverse model pθ via the
reverse process τ . Under Assumption 4.1, the following upper bound holds:

DKL (q∥p(·|τ )) ≤ DEPn (τ ) log |V| + Lϵlearning ,

where V denote the vocabulary.

Proof. Using Lemma B.3, we have:

N
X
DKL (q∥pτ ) = Ex̃<i DKL (q(x̃i |x̃<i )∥pτ (x̃i |x̃<i )).
i=1

For each time step ti :


X q(x̃i |x̃<i )
Ex̃<i DKL (q(x̃i |x̃<i )∥pτ (x̃i |x̃<i )) = Ex̃<i q(x̃i |x̃<i ) log .
pτ (x̃i |x̃<i )
x̃i ∈V |Mi |

16
Theoretical Benefit and Limitation of Diffusion Language Model

(1) (m)
Given Mi and M<i , the tokens x̃i at step ti can be partitioned into independently sampled token sets x̃i , · · · , x̃i with
kj denoting the size of each token set:
(j)
kj = |x̃i |, j ∈ [m], m = |Mi | − DEPn (Mi , M<i ).

Using the independence, for each x̃<i , we can decompose the sum into:
m (j)
X q(x̃i |x̃<i ) X X (j) q(x̃i |x̃<i )
q(x̃i |x̃<i ) log = q(x̃i |x̃<i ) log (j)
pτ (x̃i |x̃<i ) j=1 pτ (x̃i |x̃<i )
x̃i ∈V |Mi | (j)
x̃i ∈V kj
m
(j) (j)
X
= DKL (q(x̃i |x̃<i )∥pτ (x̃i |x̃<i ).
j=1

Under Assumption 4.1, the KL divergence between q and pθ is bounded by:

DKL (q0|t (xi0 | xt )∥pθ (xi0 | xt )) < ϵlearning , ∀ t and xt .

By Lemma A.1, we know that:


(j) (j)
DKL (q(x̃i |x̃<i )∥pτ (x̃i |x̃<i ) ≤ (kj − 1) log |V| + kj ϵlearning .

Substituting back:
m
X q(x̃i |x̃<i ) X
q(x̃i |x̃<i ) log ≤ (kj − 1) log |V| + kj ϵlearning .
pτ (x̃i |x̃<i ) j=1
x̃i ∈V |Mi |

Using the fact that


m
X m
X
(kj − 1) = |Mi | − m = DEPn (Mi , M<i ), kj = |Mi |
j=1 j=1

we can obtain:
Ex̃<i DKL (q(x̃i |x̃<i )∥pτ (x̃i |x̃<i )) ≤ DEPn (Mi , M<i ) log |V| + |Mi |ϵlearning .
Thus, combined with the definition of DEPn (τ ) and pτ = p(·|τ ), we can draw the final conclusion:

N
X
DKL (q∥p(·|τ )) ≤ (DEPn (Mi , M<i ) log |V| + |Mi |ϵlearning )
i=1
= DEPn (τ ) log |V| + Lϵlearning .

The above Lemma directly leads to the bound for the KL divergence between the distribution of sequences generated by the
reverse model with a given masking schedule and the ground-truth distribution in the n-gram setting.
Lemma B.9 (KL Divergence Upper Bound for a Masking Schedule). Let q denote the data distribution over sequences
of length L, and let p denote the distribution over sequences of length L generated by the reverse model pθ with masking
schedule αt and N reverse steps. Under Assumption 4.1, the KL divergence between q and p satisfies the following upper
bound:
XN
DKL (q∥p) ≤ log |V| Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) + Lϵlearning .
i=1

Proof. By Lemma B.2, we can obtain:

DKL (q∥p) ≤ Eτ ∼REVR(αt ,N,L) DKL (q∥p(·|τ )).

17
Theoretical Benefit and Limitation of Diffusion Language Model

Applying Lemma B.8 to the instances of reverse process, we can conclude that:
N
X
DKL (q∥p) ≤ Eτ ∼REVR(αt ,N,L) DEPn (Mi , M<i ) log |V| + Lϵlearning
i=1
N
X
= log |V| Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) + Lϵlearning .
i=1

For the final estimation, we need to derive an upper bound for the expected number of dependencies at each reverse step.
First, we use Chernoff Bound to control the number of separators and new locations at each reverse step for a given masking
schedule.
Lemma B.10 (Bounds on Separator and New Location Count at Each Reverse Step). Given a sequence of length L, a
masking schedule αt , and N reverse steps. Assume that L is divisible by n − 1. Given the time step ti = NN−i , let NEW
denote the number of locations sampled at step ti , and SEPn denote the number of separators in the previously sampled
locations. Under the n-gram setting, the following bounds hold for NEW and SEPn :
Lpn−1
  n−1
Lp
i
Pr SEPn ≤ i
≤ e− 8(n−1) ,
2(n − 1)
Lδi
Pr (NEW ≥ 2Lδi ) ≤ e− 3 ,
where pi = αti−1 and δi = αti − αti−1 .

Proof. Given a masking schedule αt , using the expression of true reverse process in Equation (2) and α1 = 0, we can
compute the probability p(i) of a token being sampled at time step ti to be:
i−1
αti − αti−1 Y 1 − αtj
p(i) = · = αti − αti−1 = δi .
1 − αti−1 j=1 1 − αtj−1
Pi−1
Therefore, δi is the probability of a location being sampled at time step ti . Summing up δi , we can know that pi = j=1 δj
is the probability of a location being sampled prior to time step ti .
L
To derive a bound for SEPn , we partition the sequence into n−1 intervals, each of length n − 1. For a given interval, the
probability that all locations within the interval have been sampled prior to step ti is pn−1
i . Define Xj = 1 if the locations
in the j-th interval have been sampled prior to ti , and Xj = 0 otherwise. The random variables X1 , X2 , · · · , X L are
n−1
independent and satisfy the following expectation:
L
n−1
X Lpn−1
i
Eτ ∼REVR(L,αt ,N ) Xj = .
j=1
n−1

By the definition of SEPn , we know that:


L
n−1
X
SEPn ≥ Xj .
j=1

Applying Lemma A.2 to the sum of Xj , we derive:


 L 
 n−1  n−1 n−1 n−1
Lpi X Lpi Lp
i
 ≤ e− 8(n−1)
Pr SEPn ≤ ≤ Pr  Xj ≤ .
2(n − 1) j=1
2(n − 1)

Next, we consider the bound for NEW. Given that the sequence contains L locations and the probability of sampling any
specific location at step ti is δi , the expected number of new locations sampled at ti is given by:
Eτ ∼REVR(L,αt ,N ) NEW = Lδi .

18
Theoretical Benefit and Limitation of Diffusion Language Model

Since the sampling of each location occurs independently, applying Lemma A.2, we have:
Lδi
Pr (NEW ≥ 2Lδi ) ≤ e− 3 .

Using the above lemma, we can divide the estimation for the number of dependencies into three cases, and derive the bound
case by case. This is achieved by using a variety of means and careful estimations.
Lemma B.11 (Upper Bound for the Expectation of Dependencies at Each Reverse Step). Given a sequence of length L,
a masking schedule αt , and N reverse steps. Assume Lδi > 1, then the expected number of dependencies at time step
ti = NN−i satisfies:
9 C(n − 1)Lδi2
Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) ≤ + ,
3 + Lδi pn−1
i

where pi = αti−1 , δi = αti − αti−1 , and C is a constant.

Proof. By Lemma B.10, at step ti , the following bounds hold:

Lpn−1
  n−1
Lp
i
Pr SEPn ≤ i
≤ e− 8(n−1) ,
2(n − 1)
Lδi
Pr (NEW ≥ 2Lδi ) ≤ e− 3 .

Since DEPn (Mi , M<i ) ≥ 0, its expectation can be decomposed into three components:

Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) = Pr (NEW ≥ 2Lδi ) · Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) (Case 1)
NEW≥2Lδi

Lpn−1
 
i
+ Pr SEPn ≤ , NEW < 2Lδi ·
2(n − 1)
Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) (Case 2)
n−1
Lp
i
SEPn ≤ 2(n−1)
NEW<2Lδi
Lpn−1
 
i
+ Pr SEPn > , NEW < 2Lδi ·
2(n − 1)
Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) (Case 3)
n−1
Lp
i
SEPn > 2(n−1)
NEW<2Lδi

We estimate these three cases separately.


Case 1: NEW ≥ 2Lδi .
By the definitions of DEPn (Mi , M<i ) and NEW, we have:

DEPn (Mi , M<i ) ≤ |Mi | = NEW .

Substituting this into the estimation, we obtain:

Pr (NEW ≥ 2Lδi ) · Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) ≤ Pr (NEW ≥ 2Lδi ) · Eτ ∼REVR(L,αt ,N ) NEW
NEW≥2Lδi NEW≥2Lδi

Since DEPn (Mi , M<i ) ≥ 0, the expectation can be expressed as an integral of the tail probability:
Z +∞
Eτ ∼REVR(L,αt ,N ) NEW = Pr (NEW ≥ x | NEW ≥ 2Lδi ) dx.
NEW≥2Lδi 2Lδi

19
Theoretical Benefit and Limitation of Diffusion Language Model

It directly follows that:


Z +∞
Pr (NEW ≥ 2Lδi ) · Eτ ∼REVR(L,αt ,N ) NEW = Pr (NEW ≥ 2Lδi ) · Pr (NEW ≥ x | NEW ≥ 2Lδi ) dx
NEW≥2Lδi 2Lδi
Z +∞
= Pr (NEW ≥ x | NEW ≥ 2Lδi ) Pr (NEW ≥ 2Lδi ) dx
2Lδi
Z +∞
= Pr (NEW ≥ x) dx.
2Lδi

Using the same trick as Lemma B.10, applying Lemma A.2, we can derive the bound for probability Pr (NEW ≥ x) as:
(x−Lδi )2

Pr (NEW ≥ x) ≤ e x+Lδi
.

Note that NEW ≤ L, we only need to consider 2δi ≤ 1. In this case, we have:
Z +∞ Z L (x−Lδi )2

Pr (NEW ≥ x) dx ≤ e x+Lδi
dx
2Lδi 2Lδi

Let y = x − Lδi ∈ [Lδi , L(1 − δi )], the integral can be rewritten as:
Z +∞ Z L(1−δi ) y 2
− y+2Lδ
Pr (NEW ≥ x) dx ≤ e i dy.
2Lδi Lδi

Observe that y + 2Lδi ≤ 3y, we can obtain:


Z +∞ Z L(1−δi ) y2
 Lδi L(1−δi )
 Lδi
 L(1−2δi )

Pr (NEW ≥ x) dx ≤ e− 3y dy = 3 e− 3 − e− 3 = 3e− 3 1 − e− 3 .
2Lδi Lδi

Using the fact that e−x ≤ 1


1+x for x ≥ 0, we have the upper bound:

Lδi
 L(1−2δi )
 Lδi 9
3e− 3 1 − e− 3 ≤ 3e− 3 ≤ .
3 + Lδi
Combining the above results, we know that:
9
Pr (NEW ≥ 2Lδi ) · Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) ≤ .
NEW≥2Lδi 3 + Lδi

Lpn−1
Case 2: SEPn ≤ i
2(n−1) and NEW < 2Lδi .
Similar to Case 1, we have:
DEPn (Mi , M<i ) ≤ NEW < 2Lδi ,
so the expectation also follows:
Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) < 2Lδi .
n−1
Lp
i
SEPn ≤ 2(n−1)
NEW<2Lδi

Using the probability bound, it follows that:

Lpn−1 Lpn−1
    n−1
Lp
i
Pr SEPn ≤ i
, NEW < 2Lδi ≤ Pr SEPn ≤ i
≤ e− 8(n−1) .
2(n − 1) 2(n − 1)

Since e−x ≤ 1
1+x for x ≥ 0:
n−1
Lp
i 8(n − 1)
e− 8(n−1) ≤ .
Lpn−1
i + 8(n − 1)

20
Theoretical Benefit and Limitation of Diffusion Language Model

Combining these results, we obtain:

Lpn−1
 
i 16(n − 1)Lδi
Pr SEPn ≤ , NEW < 2Lδi · Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) ≤ .
2(n − 1) Lp
n−1 Lpn−1
i + 8(n − 1)
i
SEPn ≤ 2(n−1)
NEW<2Lδi

Lpn−1
Case 3: SEPn > i
2(n−1) and NEW < 2Lδi .
Apparently, we have:
Lpn−1
 
i
Pr SEPn > , NEW < 2Lδi ≤ 1.
2(n − 1)

Given a, b, let Ea,b DEPn (Mi , M<i ) denote the expectation of DEPn (Mi , M<i ) under the condition of SEPn = a and
NEW = b. In other words:

Ea,b DEPn (Mi , M<i ) = Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ).


SEPn =a
NEW=b

Since all the locations are sampled independently, and DEPn (Mi , M<i ) depends only on the relative positions of separators
in M<i and the new locations in Mi , the expectation Ea,b DEPn (Mi , M<i ) only depends on the ordering of separators
and new locations.
Assume x1 , · · · , xa+b are a + b positions (not locations) in order. We can regard the process of ordering separators and new
locations as the process of choosing b positions randomly from xj . For 1 ≤ j ≤ a + b − 1, define Xj = 1 if xj and xj+1
are both new locations, and Xj = 0 otherwise. By the definition of DEPn (Mi , M<i ), we can obtain:

a+b−1
X
DEPn (Mi , M<i ) = Xj .
j=1

Since the b new locations are chosen randomly, the probability of Xj = 1 can be calculated as:

b−2
Ca+b−2 b(b − 1)
Pr(Xj = 1) = b
= .
Ca+b (a + b)(a + b − 1)

Therefore, the expectation of Xj is:


b(b − 1)
EXj = .
(a + b)(a + b − 1)

Summing up, we have:

a+b−1
X b(b − 1)
Ea,b DEPn (Mi , M<i ) = E Xj = (a + b − 1)EX1 = .
j=1
a+b

Lpn−1
Since a > i
2(n−1) and b < 2Lδi , we can derive the upper bound for any a, b:

b(b − 1) b(b − 1) 2Lδi (2Lδi − 1) 8(n − 1)Lδi2


≤ n−1 ≤ n−1 ≤ n−1 .
a+b Lpi
+b
Lpi
+ 2Lδi pi + 4(n − 1)δi
2(n−1) 2(n−1)

21
Theoretical Benefit and Limitation of Diffusion Language Model

Since this holds for all a and b, we can obtain:

Lpn−1
 
i
Pr SEPn > , NEW < 2Lδi · Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i )
2(n − 1) Lp
n−1
i
SEPn > 2(n−1)
NEW<2Lδi
≤ Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i )
n−1
Lp
i
SEPn > 2(n−1)
NEW<2Lδi
X
= Pr(SEPn = a, NEW = b) · Ea,b DEPn (Mi , M<i )
n−1
Lp
i
a> 2(n−1) , b<2Lδi

8(n − 1)Lδi2
≤ .
pn−1
i + 4(n − 1)δi

Summarize the above proof:


Combining the above three cases, we can obtain:

9 16(n − 1)Lδi 8(n − 1)Lδi2


Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) ≤ + + .
3 + Lδi Lpn−1
i + 8(n − 1) pn−1
i + 4(n − 1)δi

If we have the assumption Lδi ≥ 1, it is easy to find that:

9 16(n − 1)δi 8(n − 1)Lδi2


Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) ≤ + +
3 + Lδi pn−1
i pn−1
i
2
9 16(n − 1)Lδi 8(n − 1)Lδi2
≤ + +
3 + Lδi pn−1
i pn−1
i
9 C(n − 1)Lδi2
≤ + .
3 + Lδi pn−1
i

Where C = 24 is a constant.

Finally, we can derive the upper bound for the KL divergence between the distribution of sequences generated by the reverse
model and the ground-truth distribution in the n-gram setting.
Lemma B.12 (Efficient Sampling with Small KL Divergence). Let q denote the data distribution over sequences of length
L, and let p denote the distribution over sequences of length L generated by the reverse model pθ with a masking schedule
αt and N reverse steps. Assume that pθ satisfies Assumption 4.1. For any ϵ > 0, there exists a masking schedule αt such
that, for L ≥ 3C(n−1) , with N = O n−1

n+ 1 ϵn sampling steps, the KL divergence between q and p satisfies:
ϵ 2

DKL (q∥p) ϵlearning


≤ 4ϵ + .
L log |V| log |V|

Proof. By Lemma B.9, we know that:


N
X
DKL (q∥p) ≤ log |V| Eτ ∼REVR(L,αt ,N ) DEPn (Mi , M<i ) + Lϵlearning .
i=1

Note that at step t1 , the reverse process can be bounded using Lemma A.1. By reviewing our proof process, it is easy to see
that we can substitute DEPn (M1 , M<1 ) for (|M1 | − 1) log |V|, where V stands for the vocabulary. By the definition of δi ,
we know that:
Eτ ∼REVR(L,αt ,N ) (|M1 | − 1) log |V| = (δ1 L − 1) log |V|.

22
Theoretical Benefit and Limitation of Diffusion Language Model

Applying Lemma B.11 to DEPn (Mi , M<i ), if Lδi ≥ 1, we can obtain:


N 
C(n − 1)Lδi2

X 9
DKL (q∥p) ≤ δ1 log |V| + log |V| + + Lϵlearning .
i=2
3 + Lδi pn−1
i

By the definition of pi , we know that p2 = δ1 . For any small ϵ > 0, consider the following masking schedule:
ϵn
δ1 = ϵ, δi = δ = , pi = δ1 + (i − 2)δ, ∀i ≥ 2.
C(n − 1)

Then, for L ≥ 1δ , the KL divergence can be bounded by:


N
X C(n − 1)δ 2
DKL (q∥p) 9(N − 1) ϵlearning
≤ϵ+ + n−1 +
L log |V| L(3 + Lδ) i=2 pi log |V|
N −2
9(1 − δ1 ) C(n − 1)δ 2 X C(n − 1)δ 2 ϵlearning
=ϵ+ + n−1 + n−1
+ .
Lδ(3 + Lδ) δ1 i=1
(δ1 + iδ) log |V|
N −2
9 C(n − 1)δ 2 X C(n − 1)δ 2 ϵlearning
≤ϵ+ + n−1 + n
+ .
Lδ(3 + Lδ) δ1 i=1
(δ 1 + iδ) log |V|

By simple calculations, we know that:


9 3
≤ ϵ, if L ≥ 1 .
Lδ(3 + Lδ) δϵ 2
It is clear that δ ≤ 1, so:
C(n − 1)δ 2
≤ ϵδ ≤ ϵ.
δ1n−1
Since x−n is convex on [0, +∞), the accumulation can be bounded by:
N −2 N −2
X C(n − 1)δ 2 2−n
X 1
n
= C(n − 1)δ δ
i=1
(δ1 + iδ) i=1
( δ + i)n
1

N −2 Z +∞
2−n
X 1
≤ C(n − 1)δ δ1
i=1 x=0
( δ + x)n dx
 n−1
1 δ
= C(n − 1)δ 2−n ·
n − 1 δ1

= n−1
δ1
≤ ϵ.

Combining the above, we have:


DKL (q∥p) ϵlearning
≤ 4ϵ + .
L log |V| log |V|
Meanwhile, the time step is limited by:
 
1 − δ1 n−1
N =1+ =O ,
δ ϵn

and the lower bound for L:


3 3C(n − 1)
L≥ 1 = 1 .
δϵ 2 ϵn+ 2

23
Theoretical Benefit and Limitation of Diffusion Language Model

Combining the above lemmas, we can prove Theorem 4.2 by breaking the expression of log TER(p) into two parts.
Theorem B.13 (TER Bounds for n-Gram Language Generation). For any n-gram language q and any ϵ > 0, let pθ denote
 L denote the sequence length. The distribution over sequences generated by pθ is denoted
the reverse model and
n−1
 as p. For
any L > O ϵn+0.5 , under Assumption 4.1, there exists a masking schedule αt such that, with N = O n−1ϵ n sampling
steps, the perplexity of the MDM is upper-bounded by:

log TER(p) ≤ log TER(q) + ϵlearning + 4ϵ log |V|.

n−1 n−1
 
Proof. By Lemma B.12, for any L > O ϵn+0.5 , there exists a masking schedule αt with N = O ϵn sampling steps
satisfying:
DKL (q∥p) ϵlearning
≤ 4ϵ + .
L log |V| log |V|
In other words:
1 q(x)
Ex∼q log ≤ 4ϵ log |V| + ϵlearning .
L p(x)
By the definition of TER, we have:
 
log p(x) 1 q(x)
log TER(p) = Ex∼q − = Ex∼q − log q(x) + log .
|x| L p(x)

Note that:
log q(x) 1
log TER(q) = Ex∼q − = Ex∼q − log q(x).
|x| L
We can obtain:
log TER(p) ≤ log TER(q) + ϵlearning + 4ϵ log |V|.

C. Proof for Theorem 4.3 and Theorem 4.4


C.1. Proof for Theorem 4.3
In this section, we aim to derive the upper bound for the SER of generated sequences with sufficient reverse steps. First, we
argue that, given a making schedule αt , with sufficient steps, the probability of sampling multiple locations in the sequence
at the same time can be very low.
Lemma C.1 (Low Probability of Simultaneous Sampling with Sufficient Steps). Given a sequence of length L and a
masking schedule αt . For any ϵ > 0, there exists N0 , such that for any N ≥ N0 , with N reverse steps, the probability pmul
of sampling multiple locations in the sequence at the same time satisfies:

pmul < ϵ.

N −i
Proof. By Lemma B.10, we know that the probability of a location being sampled at time step ti = N is:

δi = αti − αti−1 = α N −i − α N −i+1 .


N N

Since all the locations are sampled independently, for two distinct locations i ̸= j in the sequence, the probability that i and
j are sampled simultaneously is:
XN
pi,j = δi2 .
i=1

Summing up pi,j , the probability of having two locations a=sampled simultaneously can be bounded by:
N
L(L − 1) X 2
pmul ≤ · δi
2 i=1

24
Theoretical Benefit and Limitation of Diffusion Language Model

Since αt is continuous on [0, 1], we know that it is uniformly continuous. Therefore, for any ϵ > 0, there exists N0 > 0 that
satisfies:
2ϵ 1
|αx − αy | < , ∀x, y ∈ [0, 1], |x − y| < .
L(L − 1) N0
In this case, for N > N0 , we know that:

|δi | = |α N −i − α N −i+1 | < , ∀i ∈ [N ].
N N L(L − 1)
PN
Combining with the fact that i=1 δi = 1, we can obtain:
N
L(L − 1) X
pmul ≤ · δi · max δj < ϵ.
2 i=1
j∈[N ]

Next, we consider the SER increase due to the learning error. Specifically, we only investigate the case where all the
locations are sampled at different steps.
Lemma C.2 (Accurate Step-by-Step Generation with Low Learning Error). Let q denote any HMM, and let pθ represent
the reverse model under an arbitrary masking schedule, where L is the sequence length. Let p denote the distribution over
sequences generated by pθ . Under Assumption 4.1 with a learning error ϵlearning < Lδ , δ > 0, and given an instance of
reverse process τ = (M1 , M2 , · · · , MN ) with |Mi | ≤ 1, let pacc denote the probability of generating a valid sequence.
Then pacc satisfies:
pacc ≥ e−δ .

Proof. Since |Mi | ≤ 1, we only need to consider the steps where one token is sampled. Let x̃t denote the previously
sampled tokens, and x̃t denote the token sampled at the current step. If x̃t is can later form a valid sequence, let Xt denote
the set of valid choices for x̃t . In other words, if x̃t ∈ Xt , then the combination of x̃t and x̃t is can later form a valid
sequence, or more intuitively:
q0|t (x̃t | x̃t ) > 0.
Under Assumption 4.1, we know that:
DKL (q0|t (xt | x̃t )∥pθ (xt | x̃t )) < ϵlearning .
Since it is assumed that 0 log 0 = 0, we have:
X q0|t (xt | x̃t )
q0|t (xt | x̃t ) log < ϵlearning .
pθ (xt | x̃t )
xt ∈Xt

Equivalently, we have:
X pθ (xt | x̃t )
−ϵlearning < q0|t (xt | x̃t ) log .
q0|t (xt | x̃t )
xt ∈Xt

Due to the concavity of log x, by Jensen’s Inequality, we can obtain:


!
X pθ (xt | x̃t ) X pθ (xt | x̃t ) X
q0|t (xt | x̃t ) log ≤ log q0|t (xt | x̃t ) · = log pθ (xt | x̃t ).
q0|t (xt | x̃t ) q0|t (xt | x̃t )
xt ∈Xt xt ∈Xt xt ∈Xt

Therefore, the probability that each step remains valid satisfies:


δ
X
pθ (xt | x̃t ) ≥ e−ϵlearning ≥ e− L .
xt ∈Xt

Since there are L locations in the sequence, the probability of generating a valid sequence is bounded by:
δ
pacc ≥ (e− L )L = e−δ .

25
Theoretical Benefit and Limitation of Diffusion Language Model

Combining the above lemmas, we can derive the upper bound of SER by taking sufficient reverse steps and small learning
error.
Theorem C.3 (Accurate Generation of HMM with Sufficient Steps). Let q denote any HMM, and let pθ represent the
reverse model under an arbitrary masking schedule, where L is the sequence length. Let p denote the distribution over
sequences generated by pθ . Under Assumption 4.1 with a learning error ϵlearning < O( Lδ ), and given a sufficient number of
reverse steps, the sequence error rate SER(p) of the generated text satisfies

SER(p) ≤ δ.

Proof. For δ > 0, we know that:


1 − δ < c.

By Lemma C.1, given the masking schedule αt , there exists N0 , for N > N0 and N reverse steps, the probability of
sampling multiple locations in the sequence at the same time is bounded by:

1−δ
pmul < 1 − .
e−δ

In other words, the probability of sampling all the locations at different steps is at least 1−δ
e−δ
. By Lemma C.2, for each
reverse process which satisfies that all the locations are sampled at different steps, the probability of generating a valid
sequence is lower bounded by:
pacc ≥ e−δ .

Therefore, the sequence error rate SER satisfies:

1 − δ −δ
SER(p) ≤ 1 − · e = δ.
e−δ

C.2. Proof for Theorem 4.4


In the section, we aim to find an example (Example C.7) with high sequence error rate. To present this example, we begin
with a special class of languages defined under the interval setting:
Definition C.4 (Interval Setting). Consider a sequence of length L, which is divided equally into M intervals I1 , I2 , · · · , IM ,
L
each of length l = M ≥ 2. Given a masking schedule αt , an instance of reverse process τ = (M1 , M2 , · · · , MN ) is
defined by Definition B.1. For any two locations within different intervals, their corresponding tokens are independent
(j) (j)
from each other. In other words, let x̃i denote the new tokens in Mi ∩ Ij , x̃<i denote the previously sampled tokens in
M<i ∩ Ij , and p denote the distribution over sequences generated by the reverse model with reverse process τ , then for
time step ti = NN−i :
(j) (j) (j)
p(x̃i |x̃<i ) = p(x̃i |x̃<i ).

In this case, we have:


M M Y
N
(j) (j)
Y Y
p(x) = p(x(j) ) = p(x̃i |x̃<i ).
j=1 j=1 i=1

We denote the above setting as Inter(L, l, αt ).

Under the interval setting defined above, we can control the probability of sampling simultaneously in the same interval.
Lemma C.5 (Simultaneous Sampling Probability for an Interval). Consider the interval setting Inter(L, l, αt ). For each
interval Ij of length l, let hj denote the probability that all the locations in Ij are sampled in different time steps. Then, hj
can be bounded by:
1
hj ≤ 1 − .
N
26
Theoretical Benefit and Limitation of Diffusion Language Model

Proof. Let δi = αti − αti−1 . Similar to Lemma B.10, we know that δi is the probability of a location being sampled at time
step ti . Take the first location in |Ij |, denote it as X1 , and let X2 , · · · , Xl denote the rest l − 1 locations in Ij . If X1 is
sampled at step ti , then X2 , · · · , Xl must be sampled at time steps other than ti . Therefore, hj can be bounded by:
N
X N
X
hj ≤ δi (1 − δi )l−1 ≤ δi (1 − δi ).
i=1 i=1

Let f (δ) = δ(1 − δ). Note that we have:


f ′′ (δ) = −2 ≤ 0,
which indicates that f (δ) is concave. Using Jensen’s Inequality, we can obtain:
N  
X 1 1
hj ≤ f (δi ) ≤ N f =1− .
i=1
N N

Using the above lemma, if we assume that sampling simultaneously in one interval increases SER, then we can derive an
lower bound for SER(p).
Lemma C.6 (SER bound for Interval Setting). Consider the interval setting Inter(L, l, αt ). Assume that sampling
simultaneously in the same interval introduces an error with probability at least p0 , and other actions do not reduce error.
In other words, if two locations in an interval are both sampled at step ti , then there is a probability of pe that the sequence
will not be accurate afterwards. In this case, let p denote the distribution over sequences of length L generated by the
reverse model with masking schedule αt and N reverse steps. We have the following bound for SER:
 pe L/l
SER(p) ≥ 1 − 1 − .
N

(j)
Proof. By Lemma C.5, we can obtain that for each interval Ij , the probability perror of generating an error in Ij is
lower-bounded by:
pe
p(j)
error ≥ pe (1 − hj ) ≥ .
N
Due to the independence between different intervals, the accuracy SER(p) can be calculated as:
M
Y
SER(p) = 1 − (1 − p(j)
error ).
j=1

Therefore, we have the bound:


 pe L/l
SER(p) ≥ 1 − 1 − .
N

To show that the above setting is reasonable and achievable, we give the following example, which is later shown to be the
example we are looking for.
Example C.7. Consider a sequence of length L, which is divided equally into M intervals, each of length l = L/M . Denote
the k-th interval as Ik = [1 + (k − 1)l, kl]. The tokens xi , 1 ≤ i ≤ L in the sequence satisfy the following rules:

• Each xi takes values in the set A = {a1 , · · · , a2l−1 }. For each aj ∈ A, there corresponds a vector vj =
(vj,1 , · · · , vj,l−1 ) ∈ {0, 1}l−1 , where (vj,1 · · · vj,l−1 )2 is the binary expression for j − 1. Thus, each random variable
(i) (i) (i)
xi corresponds to a random vector (v1 , · · · , vl−1 ), where vj ∈ {0, 1} for j = 1, · · · l − 1.

• For i ∈ Ik and j ∈ Is , if k ̸= s, then xi and xj are independent.

27
Theoretical Benefit and Limitation of Diffusion Language Model

• For i, j ∈ Ik such that i < j, let i′ = i − (s − 1)l and j ′ = j − (s − 1)l. Then, xi and xj are the i′ -th and j ′ -th
(i) (j)
elements in interval Ik , respectively. The corresponding binary components satisfy vj ′ −1 = vi′ ∼ Bernoulli( 12 ),
(s)
which is independent of all other vt .

In this setup, each interval Ik contains l(l−1)


2 pairs of mutually independent random variables. Given an arbitrary masking
schedule αt , this setting is consistent with Definition C.4. Let q denote the data distribution described above.
Under Assumption 4.1, we only need to examine the case where xt has no error. By Lemma A.3, we know that:
q
q0|t (xi0 | xt ) − pθ (xi0 | xt ) 1 ≤ 2DKL (q0|t (xi0 | xt )∥pθ (xi0 | xt )) ≤ 2ϵlearning .
p

Let M denote the set of previously sampled locations. For q and any unsampled location in interval I, all of the potential
tokens x at this location which is consistent with xt have the same probability:

1
q(x|xt ) = .
2l−1−|M∩I|

If two locations xi , xj within the same interval I are sampled simultaneously, ignoring the possible inconsistency with
previously sampled tokens (since error can not be reduced), the independence of the random variable pairs implies that the
probability of generating an error is lower-bounded by:

1 1 1 1
pe ≥ ( + e1 )( + e2 ) + ( + e3 )( + e4 )
2 2 2 2
1 (j) (i)
where 2 implies the probability (for q) of letting vi′ or vj ′ −1 to be 0 or 1, and e1 , e2 , e3 , e4 satisfies:

|e1 | + |e3 | = q0|t (xi0 | xt ) − pθ (xi0 | xt ) 1

|e2 | + |e4 | = q0|t (xj0 | xt ) − pθ (xj0 | xt )


1

Thus, we know that:


1 1 p
pe ≥ − (|e1 | + |e2 | + |e3 | + |e4 |) ≥ − 2 2ϵlearning .
2 2
1
p
In other words, this is consistent with the setting Lemma C.6, with an error probability pe = 2 − 2 2ϵlearning .

Although the example above seems a bit tricky, it can actually be modified into the form of an HMM, a commonly considered
structure for generative models.
Note C.8 (HMM Form of Example C.7). The setting described in Example C.7 can be alternatively modeled as a
Hidden Markov Model (HMM), where the observation space is O = A, and the state space is S = {(i, A(i) )|A(i) ∈
R(l−1)×(l−1) , i = 1, · · · , l}. Here, i represents the current position within the interval, and A(i) is an upper triangular matrix
with entries taking values of 0 or 1. For j ≤ i, the j-th row of A(i) encodes the values sampled by the variable pairs formed
between the j-th position and all its subsequent positions in the interval. For j > i, the j-th row of A(i) is set to 0.
Given the current state s = (i, A(i) ), the state transition and emission process can be describe as follows:

• The observation oi corresponds to the i − 1-th column and the i-th row of the matrix A(i) , where the values of variable
pairs relevant to the i-th position within the interval are encoded. Specifically, we know that oi ∈ A corresponds to a
vector vi = (vi,1 , · · · , vi,l−1 ), where (
(i)
Aj,i−1 , j < i,
vi,j = (i)
Ai,j , j ≥ i.

(i+1)
• If i < l, the next state is s′ = (i, A(i+1) ), where the first i rows of A(i+1) is the same as A(i) , and Ai+1,j ∼
Bernoulli( 21 ) i.i.d. for j = i + 1, · · · , l − 1, with the remaining entries set to 0.

28
Theoretical Benefit and Limitation of Diffusion Language Model

• If i = l, the next state resets to s′ = (1, A(1) ), where the entries in the first row are independently sampled from
Bernoulli( 21 ), and other entries are set to 0.

The size of the observation space is given by |O| = |A| = 2l−1 . The size of the state space is computed as:
l
X
|S| = 2(2l−i−1)i/2 ≤ l · 2l(l−1)/2 .
i=1

The above Note gives the HMM form of Example C.7. In fact, with appropriate adjustments, it can be further modified into
an n-gram language. Using the HMM defined above, we can prove Theorem 4.4.
Theorem C.9 (SER Bound for HMM Generation). There exists an HMM q over a vocabulary of size 16 that satisfies the
1
following conditions: for any reverse model pθ under Assumption 4.1 with ϵlearning < 128 , and any masking schedule αt ,
let p denote the distribution over sequences generated by pθ . There exists a constant C such that if the number of sampling
steps satisfies N = CL, where L is the sequence length, the SER of the generated text is lower-bounded by:
1
SER(p) > .
2

Proof. Take the HMM described in Note C.8, and set l = 5, N = CL. The vocabulary is the observation space O which
satisfies |O| = 2l−1 . By Lemma C.6, for any masking schedule αt , we have:
 pe L/l
SER(p) ≥ 1 − 1 − .
N
As illustrated in Example C.7:
1 p
pe = − 2 2ϵlearning .
2
CL
Therefore, take N = CL, and let y = pe , we have:
 pe
y  Cl
1
SER(p) ≥ 1 − 1− .
y

Since (1 − y1 )y is decreasing, and apparently y ≥ Cl


pe , we know that:
pe
SER(p) ≥ .
Cl
2pe
Let C = l+1 , we can get the upper bound:
1
SER(p) > .
2
In this way:
1
p
2pe 2 − 2 2ϵlearning 1
C= = ≥ = O(1).
l+1 6 24

D. Extend to Efficient Sampling Strategies


In Sahoo et al. (2024) and Ou et al. (2024), an efficient sampling strategy ddpm_cache is proposed, which can reduce the
sampling time by a constant order of magnitude. Specifically, this sampler is approximately 3-4 times faster than previously
used samplers when the number of sampling steps is large. In this section, we discuss the influence of ddpm_cache on our
conclusions under different sampling steps.
First, we briefly introduce the principles of ddpm_cache. It utilizes the observation that if no locations are sampled at a
given step, the sequence remains unchanged. Consequently, when the reverse model is not conditioned on time, the cached
value computed during the first time this sequence went through the reverse model can be reused, instead of going through
the reverse model again.

29
Theoretical Benefit and Limitation of Diffusion Language Model

This sampling strategy does not affect our main theorems, as they are based solely on the sampled locations at each step,
while unsampled locations are not considered. As for the evaluation metrics for computational efficiency in our experiments,
we break it down into the following two cases:
1. When the number of sampling steps is much smaller than the sequence length, which is the primary scenario we focus
on, the expectation of steps where no new locations are sampled is relatively low, resulting in a computational cost that
is nearly linear with respect to the number of sampling steps.
2. As the number of sampling steps becomes larger, the computational cost is mainly dependent on the number of valid
steps where at least one location is sampled. As a matter of fact, the expectation of the number of valid steps increases
as the number of sampling steps increases, and the maximum number of valid steps is equal to the number of sampling
steps. In this case, the MDMs offer no computational advantage over auto-regressive models.
Based on the above conclusions, we can find that for tasks requiring a low TER, using ddpm_cache can further accelerate
the generation of MDMs, suggesting high efficiency. Conversely, for tasks that demand a low SER, we have shown that
the number of sampling steps need to be large enough, such that MDMs can not generate with low cost even when using
ddpm_cache. Therefore, we extend our findings to MDMs with efficient sampling strategies.

E. Experiment Details
In this section, we will present the details of the experiments.
E.1. Data Generation
We evaluate the MDMs in a variety of formal languages, including n-gram languages and HMMs. For each formal language,
parameters are generated through random sampling, we present the sampling algorithm in Algorithm 1 and Algorithm 2. It
is notable that to add some deterministic to the language model in the evaluation of SER, we add the parameter of thres to
prune the tail probabilities, making sure the language model only generates the correct sequence. For the evaluation of TER,
we set the thres to be 0, for the well definition of generative perplexity. The detailed parameters to generate the formal
languages are listed in Table 1.

Algorithm 1 Generate n-gram Language Model


Input:
n: number of grams
vocab size: size of vocabulary
temp: temperature (controls randomness, higher indicates more randomness)
thres: threshold for pruning small probabilities
Output: n-gram language model with parameters:
T : transition probability matrix (vocab sizen−1 × vocab size)
Init dist: initial state distribution
1: Init dist ← rand(hidden states num)
P
2: Init dist ← Init dist/ (Init dist)
3: T ← randn(vocab sizen−1 , vocab size) × randomness
4: T ← softmax(T )
5: if thres > 0 then
6: T [where(T < thres)] ← 0
7: T ← T /rowsum(T )
8: end if
9: return T and Init dist

Table 1. Generation Parameters for Different Language Models


Parameter 2-gram 3-gram 4-gram HMM
vocabulary size 8 8 8 8
Hidden States (n) N/A N/A N/A 32
Temperature 2 2 2 3.2
Threshold 0.008 0.008 0.005 0.003

30
Theoretical Benefit and Limitation of Diffusion Language Model

Algorithm 2 Generate Hidden Markov Model


Input:
n: number of hidden states
vocab size: size of vocabulary
randomness: temperature parameter to control probability distributions
thres: threshold for pruning small transition probabilities
Output: HMM with parameters:
A: state transition matrix (n × n)
B: emission probability matrix (n × (vocab size + 1))
Init dist: initial state distribution (n-dimensional)
1: hidden states num ← n
2: Init dist ← rand(hidden P states num)
3: Init dist ← Init dist/ (Init dist)
4: A ← randn(hidden states num, hidden states num) × randomness
5: A ← softmax(A)
6: if thres > 0 then
7: A[where(A < thres)] ← 0
8: A ← A/rowsum(A)
9: end if
10: B ← randn(hidden states num, vocab size) × randomness × 2.5
11: B ← softmax(B)
12: B[where(B < 0.05)] ← 0
13: B ← B/rowsum(B)
14: B ← concat(B, ones(hidden states num, 1)/hidden states num)
15: return A, B, and Init dist

E.2. Model Training and Testing


In our experiments of formal languages, all training was conducted on NVIDIA A100 GPUs. The model architectures and
train configurations are listed in Table 2 and Table 3. The training configuration of the auto-regressive model is listed in
Table 4.

Model Configuration
Hidden Size 512
Sequence Length 512
Number of Layers 8*
Attention Heads 8
*For the 4-gram model, the number of layers is 10.

Table 2. Model Configuration for the Formal Language Tasks

Training Configuration for MDMs


Epochs 20
Learning Rate 3e-4
Optimizer AdamW
β1 0.9
β2 0.999
Learning Rate Scheduler Cosine Scheduler with Warmup
Warmup Ratio 0.1

Table 3. Training Configuration for MDMs on the Formal Language Tasks

31
Theoretical Benefit and Limitation of Diffusion Language Model

Training Configuration for Auto-regressive Models


Epochs 20
Learning Rate 3e-4
Optimizer AdamW
β1 0.9
β2 0.999
Learning Rate Scheduler Cosine Scheduler with Warmup
Warmup Ratio 0.1

Table 4. Training Configuration for Auto-regressive Models on the Formal Language Tasks

Speedup Testing Settings


GPU Nvidia RTX 4090
Batch Size 16
Sequence Length 512
Testing Model Configuration In Table 2

Table 5. Setting for the experiments to test the speedup of MDMs under different sampling steps compare to auto-regressive models.

32

You might also like