0% found this document useful (0 votes)
20 views51 pages

Heavy Tailed Diffusion

Uploaded by

stevelime8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views51 pages

Heavy Tailed Diffusion

Uploaded by

stevelime8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Preprint.

H EAVY-TAILED D IFFUSION M ODELS


Kushagra Pandey1,2˚, Jaideep Pathak1 , Yilun Xu1 ,
Stephan Mandt2 , Michael Pritchard1,2 , Arash Vahdat1:, Morteza Mardani1:
NVIDIA1 , University of California, Irvine 2
{pandeyk1,mandt}@uci.edu,
{jpathak,yilunx,mpritchard,avahdat,mmardani}@nvidia.com
arXiv:2410.14171v2 [cs.LG] 29 Oct 2024

A BSTRACT
Diffusion models achieve state-of-the-art generation quality across many applications, but
their ability to capture rare or extreme events in heavy-tailed distributions remains unclear.
In this work, we show that traditional diffusion and flow-matching models with standard
Gaussian priors fail to capture heavy-tailed behavior. We address this by repurposing the
diffusion framework for heavy-tail estimation using multivariate Student-t distributions.
We develop a tailored perturbation kernel and derive the denoising posterior based on
the conditional Student-t distribution for the backward process. Inspired by γ-divergence
for heavy-tailed distributions, we derive a training objective for heavy-tailed denoisers.
The resulting framework introduces controllable tail generation using only a single scalar
hyperparameter, making it easily tunable for diverse real-world distributions. As specific
instantiations of our framework, we introduce t-EDM and t-Flow, extensions of existing
diffusion and flow models that employ a Student-t prior. Remarkably, our approach is
readily compatible with standard Gaussian diffusion models and requires only minimal
code changes. Empirically, we show that our t-EDM and t-Flow outperform standard
diffusion models in heavy-tail estimation on high-resolution weather datasets in which
generating rare and extreme events is crucial.

1 I NTRODUCTION
In many real-world applications, such as weather forecasting, rare or extreme events—like hurricanes or
heatwaves—can have disproportionately larger impacts than more common occurrences. Therefore, building
generative models capable of accurately capturing these extreme events is critically important (Gründemann
et al., 2022). However, learning the distribution of such data from finite samples is particularly challenging,
as the number of empirically observed tail events is typically small, making accurate estimation difficult.
One promising approach is to use heavy-tailed distributions, which allocate more density to the tails than
light-tailed alternatives. In popular generative models like Normalizing Flows (Rezende & Mohamed, 2016)
and Variational Autoencoders (VAEs) (Kingma & Welling, 2022), recent works address heavy-tail estimation
by learning a mapping from a heavy-tailed prior to the target distribution (Jaini et al., 2020; Kim et al., 2024b).
While these works advocate for heavy-tailed base distributions, their application to real-world, high-
dimensional datasets remains limited, with empirical results focused on small-scale or toy datasets. In
contrast, diffusion models (Ho et al., 2020; Song et al., 2020; Lipman et al., 2023) have demonstrated
excellent synthesis quality in large-scale applications. However, it is unclear whether diffusion models with
Gaussian priors can effectively model heavy-tailed distributions without significant modifications.
˚
Work during an internship at NVIDIA
:
Equal Advising

1
Preprint.

Figure 1: Toy Illustration. Our proposed diffusion model (t-Diffusion) captures heavy-tailed behavior more accurately
than standard Gaussian diffusion, as shown in the histogram comparisons (top panel, x-axis). The framework allows for
controllable tail estimation using a hyperparameter ν, which can be adjusted for each dimension. Lower ν values model
heavier tails, while higher values approach Gaussian diffusion. (Best viewed when zoomed in; see App. C.3 for details)

In this work, we first demonstrate through extensive experiments that traditional diffusion models—even
with proper normalization, preconditioning, and noise schedule design (see Section 4)—fail to accurately
capture the heavy-tailed behavior in target distributions (see Fig. 1 for a toy example). We hypothesize that,
in high-dimensional spaces, the Gaussian distribution in standard diffusion models tends to concentrate on
a spherical narrow shell, thereby neglecting the tails. To address this, we adopt the multivariate Student-t
distribution as the base noise distribution, with its degrees of freedom providing controllability over tail
estimation. Consequently, we reformulate the denoising diffusion framework using multivariate Student-t
distributions by designing a tailored perturbation kernel and deriving the corresponding denoiser. Moreover,
we draw inspiration from the γ-power Divergences (Eguchi, 2021; Kim et al., 2024a) for heavy-tailed
distributions to formulate the learning problem for our heavy-tailed denoiser.
We extend widely adopted diffusion models, such as EDM (Karras et al., 2022) and straight-line flows
(Lipman et al., 2023; Liu et al., 2022), by introducing their Student-t counterparts: t-EDM and t-Flow.
We derive the corresponding SDEs and ODEs for modeling heavy-tailed distributions. Through extensive
experiments on the HRRR dataset (Dowell et al., 2022), we train both unconditional and conditional versions
of these models. The results show that standard EDM struggles to capture tails and extreme events, whereas
t-EDM performs significantly better in modeling such phenomena. To summarize, we present,

• Heavy-tailed Diffusion Models. We repurpose the diffusion model framework for heavy-tail
estimation by formulating both the forward and reverse processes using multivariate Student-t
distributions. The denoiser is learned by minimizing the γ-power divergence (Kim et al., 2024a)
between the forward and reverse posteriors.

• Continuous Counterparts. We derive continuous formulations for heavy-tailed diffusion models


and provide a principled approach to constructing ODE and SDE samplers. This enables the
instantiation of t-EDM and t-Flow as heavy-tailed analogues to standard diffusion and flow models.

• Empirical Results. Experiments on the HRRR dataset (Dowell et al., 2022), a high-resolution
dataset for weather modeling, show that t-EDM significantly outperforms EDM in capturing tail
distributions for both unconditional and conditional tasks.

• Theoretical Connections. To theoretically justify the effectiveness of our approach, we present


several theoretical connections between our framework and existing work in diffusion models and
robust statistical estimators (Futami et al., 2018).

2
Preprint.

2 BACKGROUND
As prerequisites underlying our method, we briefly summarize Gaussian diffusion models (as introduced in
(Ho et al., 2020; Sohl-Dickstein et al., 2015)) and multivariate Student-t distributions.

2.1 D IFFUSION M ODELS

Diffusion models define a forward process (usually with an affine drift and no learnable parameters) to convert
data x0 „ ppx0 q, x0 P Rd to noise. A learnable reverse process is then trained to generate data from noise.
In the discrete-time setting, the training objective for diffusion models can be specified as,
„ ÿ ȷ
Eq DKL pqpxT |x0 q } ppxT qq ` DKL pqpxt´∆t |xt , x0 q } pθ pxt´∆t |xt qq ´ log pθ px0 |x∆t q , (1)
l jh n l jh nl jh n
tą∆t
LT Lt´1 L0

where T denotes the trajectory length while ∆t denotes the time increment between two consecutive time
ş qpxq
points. DKL denotes the Kullback-Leibler (KL) divergence defined as, DKL pq } pq “ qpxq log ppxq dx.
In the objective in Eq. 1, the trajectory length T is chosen to match the generative prior ppxT q and the
forward marginal qpxT |x0 q. The second term in Eqn. 1 proposes to minimize the KL divergence between the
forward posterior qpxt´∆t |xt , x0 q and the learnable posterior pθ pxt´∆t |xt q which corresponds to learning
the denoiser (i.e. predicting a less noisy state from noise). The forward marginals, posterior, and reverse
posterior are modeled using Gaussian distributions, which exhibit an analytical form of the KL divergence.
The discrete-time diffusion framework can also be extended to the continuous time setting (Song et al.,
2020; 2021; Karras et al., 2022). Recently, Lipman et al. (2023); Albergo et al. (2023a) proposed stochastic
interpolants (or flows), which allow flexible transport between two arbitrary distributions.

2.2 S TUDENT- T D ISTRIBUTIONS

The multivariate Student-t distribution td pµ, Σ, νq with dimensionality d, location µ, scale matrix Σ and
degrees of freedom ν is defined as,
” 1 ı´ ν`d
2
td pµ, Σ, νq “ Cν,d 1 ` px ´ µqJ Σ´1 px ´ µq , (2)
ν
where Cν,d is the normalizing factor. Since the multivariate Student-t distribution has polynomially decaying
density, it can model heavy-tailed distributions. Interestingly, for ν “ 1, the Student-t distribution is
analogous to the Cauchy distribution. As ν Ñ 8, the Student-t distribution converges to the Gaussian
? distributed random variable x can be reparameterized as (Andrews & Mallows,
distribution. A Student-t
1974), x “ µ ` Σ1{2 z{ κ, with z „ N p0, Id q, κ „ χ2 pνq{ν where χ2 denotes the Chi-squared distribution.

3 H EAVY-TAILED D IFFUSION M ODELS


We now repurpose standard diffusion models using multivariate Student-t distributions. The main idea
behind our design is the use of heavy-tailed generative priors (Jaini et al., 2020; Kim et al., 2024a) for
learning a transport map towards a potentially heavy-tailed target distribution. From Eqn. 1 we note three key
requirements for training diffusion models: choice of the perturbation kernel qpxt |x0 q, form of the target
denoising posterior qpxt´∆t |xt , x0 q and the parameterization of the learnable reverse posterior pθ pxt´1 |xt q.
Therefore, we begin our discussion in the context of discrete-time diffusion models and later extend to the
continuous regime. This has several advantages in terms of highlighting these three key design choices, which
might be obscured by the continuous-time framework of defining a forward and a reverse SDE (Karras et al.,
2022) while at the same time leading to a simpler construction.

3
Preprint.

3.1 N OISING P ROCESS D ESIGN WITH S TUDENT- T D ISTRIBUTIONS .

Our construction of the noising process involves three key steps.

1. Firstly, given two consecutive noisy states xt and xt´∆t , we specify a joint distribution
qpxt , xt´∆t |x0 q.
2. şSecondly, given qpxt , xt´∆t |x0 q, we construct the perturbation kernel qpxt |x0 q “
qpxt , xt´∆t |x0 qdxt´∆t which can be used as a noising process during training.
3. Lastly, from Steps 1 and 2, we construct the forward denoising posterior qpxt´∆t |xt , x0 q “
qpxt ,xt´∆t |x0 q
qpxt |x0 q . We will later utilize the form of qpxt´∆t |xt , x0 q to parameterize the reverse posterior.

It is worth noting that our construction of the noising process bypasses the specification of the forward
transition kernel qpxt |xt´∆t q. This has the advantage that we can directly specify the form of the perturbation
kernel parameters µt and σt as in Karras et al. (2022) unlike Song et al. (2020); Ho et al. (2020). We next
highlight the noising process construction in more detail.
Specifiying the joint distribution qpxt , xt´∆t |x0 q. We parameterize the joint distribution qpxt , xt´∆t |x0 q
as a multivariate Student-t distribution with the following form,
ˆ 2 2
˙
σt σ12 ptq
qpxt , xt´∆t |x0 q “ t2d pµ, Σ, νq, µ “ rµt ; µt´∆t sx0 , Σ “ 2 2 b Id ,
σ21 ptq σt´∆t
where µt , σt , σ12 ptq, σ21 ptq are time-dependent scalar design parameters. While the choice of the parameters
µt and σt determines the perturbation kernel used during training, the choice of σ12 ptq and σ21 ptq can affect
the ODE/SDE formulation for the denoising process and will be clarified when discussing sampling.
Constructing the perturbation kernel qpxt |x0 q: Given the joint distribution qpxt , xt´∆t |x0 q specified
as a multivariate Student-t distribution, it follows that the perturbation kernel distribution qpxt |x0 q is also
a Student-t distribution (Ding, 2016) parameterized as, qpxt |x0 q “ td pµt x0 , σt2 Id , νq (proof in App. A.1).
We choose the scalar coefficients µt and σt such that the perturbation kernel at time t “ T converges to a
standard Student-t generative prior td p0, Id , νq. We discuss practical choices of µt and σt in Section 3.5.
Estimating the reference denoising posterior. Given the joint distribution qpxt , xt´∆t |x0 q and the pertur-
bation kernel qpxt |x0 q, the denoising posterior can be specified as (see Ding (2016)),
ν ` d1 2
qpxt´∆t |xt , x0 q “ td pµ̄t , σ̄ Id , ν ` dq, (3)
ν`d t
2
σ21 ptq ” σ 2 ptqσ 2 ptq ı
µ̄t “ µt´∆t x0 ` pxt ´ µt x0 q, σ̄t2 “ σt´∆t
2
´ 21 2 12 , (4)
σt2 σt
1
where d1 “ σt2
}xt ´ µt x0 }2 . Next, we formulate the training objective for heavy-tailed diffusions.

3.2 PARAMETERIZATION OF THE R EVERSE P OSTERIOR

Following Eqn. 3, we parameterize the reverse (or the denoising) posterior distribution as:
pθ pxt´∆t |xt q “ td pµθ pxt , tq, σ̄t2 Id , ν ` dq, (5)
where the denoiser mean µθ pxt , tq is further parameterized as follows:
σ 2 ptq ” σ 2 ptq ı
µθ pxt , tq “ 212 xt ` µt´∆t ´ 212 µt Dθ pxt , σt q. (6)
σt σt

4
Preprint.

While we adopt the "x0 -prediction" parameterization (Karras et al., 2022), it is also possible to parameterize
the posterior mean as an ϵ-prediction objective instead (Ho et al., 2020) (See App. A.2). Lastly, when
parameterizing the reverse posterior scale, we drop the data-dependent coefficient ν`d
ν`d This aligns with prior
1

works in diffusion models (Ho et al., 2020; Song et al., 2020; Karras et al., 2022) where it is common to
only parameterize the denoiser mean. However, heteroskedastic modeling of the denoiser is possible in our
framework and could be an interesting direction for future work. Next, we reformulate the training objective
in Eqn. 1 for heavy-tailed diffusions.

3.3 T RAINING WITH P OWER D IVERGENCES

The optimization objective in Eqn. 1 primarily minimizes the KL-Divergence between a given pair of
distributions. However, since we parameterize the distributions in Eqn. 1 using multivariate Student-t
distributions, using the KL-Divergence might not be a suitable choice of divergence. This is because
computing the KL divergence for Student-t distributions does not exhibit a closed-form expression. An
alternative is the γ-Power Divergence (Eguchi, 2021; Kim et al., 2024a) defined as,
1“ ‰
Dγ pq } pq “ Cγ pq, pq ´ Hγ pqq , γ P p´1, 0q Y p0, 8q
γ
ˆż 1
˙ 1`γ ż ˆ ˙γ
1`γ ppxq
Hγ ppq “ ´ }p}1`γ“ ´ ppxq dx Cγ pq, pq “ ´ qpxq dx,
}p}1`γ
2
where, like Kim et al. (2024a), we set γ “ ´ ν`d for the remainder of our discussion. Moreover, Hγ and Cγ
represent the γ-power entropy and cross-entropy, respectively. Interestingly, the γ-Power divergence between
two multivariate Student-t distributions, qν “ td pµ0 , Σ0 , νq and pν “ td pµ1 , Σ1 , νq, can be tractably
computed in closed form and is defined as (see Kim et al. (2024a) for a proof),
γ ´ γ
¯´ 1`γ ” ´ ¯
γ
Dγ rqν ||pν s “ ´ γ1 Cν,d
1`γ d
1 ` ν´2 ´|Σ0 |´ 2p1`γq 1 ` ν´2 d

γ
´ ` ˘ 1 ¯ı (7)
` |Σ1 |´ 2p1`γq 1 ` ν´2
1
tr Σ´11 Σ 0 ` ν pµ 0 ´ µ 1 qJ ´1
Σ 1 pµ0 ´ µ1 q .

Therefore, analogous to Eqn. 1, we minimize the following optimization objective,


„ ÿ ȷ
Eq Dγ pqpxT |x0 q } ppxT qq ` Dγ pqpxt´∆t |xt , x0 q } pθ pxt´∆t |xt qq ´ log pθ px0 |x1 q . (8)
tą1

Here, we note a couple of caveats. Firstly, while replacing the KL-Divergence with the γ-Power Divergence
in the objective in Eqn. 1 might appear to be due to computational convenience, the γ-power divergence has
several connections with robust estimators (Futami et al., 2018) in statistics and provides a tunable parameter
γ which can be used to control the model density assigned at the tail (see Section 5). Secondly, while the
objective in Eqn. 1 is a valid ELBO, the objective in Eq. 8 is not. However, the following result provides a
connection between the two objectives (see proof in App. A.3),
Proposition 1. For arbitrary distributions q and p, in the limit of γ Ñ 0, Dγ pq } pq converges to DKL pq } pq.
2
Consequently, for a finite-dimensional dataset with x0 P Rd and γ “ ´ ν`d , under the limit of γ Ñ 0, the
objective in Eqn. 8 converges to the objective in Eqn. 1.

Therefore, under the limit γ Ñ 0, the standard diffusion model framework becomes a special case of our
proposed framework. Moreover, for γ “ ´2{pν ` dq, this also explains the tail estimation moving towards
Gaussian diffusion for an increasing ν (See Fig. 1 for an illustration.)

5
Preprint.

Component Gaussian Diffusion (Ours) t-Diffusion


Perturbation Kernel (qpxt |x0 q) N pµt x0 , σt2 Id q td pµt x0 , σt2 Id , νq
Forward Posterior (qpxt´∆t |xt , x0 q) N pµ̄t , σ̄t2 Id q td pµ̄t , ν`d1 2
σ̄ I , ν ` dq
ν`d t d
Reverse Posterior (pθ pxt´∆t |xt q) N pµθ pxt , tq, σ̄t2 Id q td pµθ pxt , tq, σ̄t2 Id , ν ` dq
Divergence Measure DKL pq } pq Dγ pq } pq
Table 1: Comparison between different modeling components for constructing Gaussian vs Heavy-Tailed diffusion
models. Under the limit of ν Ñ 8, our proposed t-Diffusion framework converges to Gaussian diffusion models.

Simplifying the Training Objective. Plugging the form of the forward posterior qpxt´∆t |xt , x0 q in Eqn. 3,
the reverse posterior pθ pxt´∆t |xt q in the optimization objective in Eqn. 8, we obtain the following simplified
training loss (proof in App. A.4),
› ` ϵ ˘ ›2
Lpθq “ Ex0 „ppx0 q Et„pptq Eϵ„N p0,Id q Eκ„ ν1 χ2 pνq ›Dθ µt x0 ` σt ? , σt ´ x0 › . (9)
› ›
κ 2

Intuitively, the form of our training objective is similar to existing diffusion models (Ho et al., 2020; Karras
et al., 2022). However, the only difference lies in sampling the noisy state xt from a Student-t distribution
based perturbation kernel instead of a Gaussian distribution. Next, we discuss sampling from our proposed
framework under discrete and continuous-time settings.

3.4 S AMPLING

Discrete-time Sampling. For discrete-time settings, we can simply perform ancestral sampling from the
learned reverse posterior distribution pθ pxt´∆t |xt q. Therefore, following simple re-parameterization, an
ancestral sampling update can be specified as,
z
xt´∆t “ µθ pxt , tq ` σ̄t ? , z „ N p0, Id q, κ „ χ2 pν ` dq{pν ` dq,
κ
2
σ ptq ” σ 2 ptq ı z
“ 212 xt ` µt´∆t ´ 212 µt Dθ pxt , σt q ` σ̄t ? .
σt σt κ

Continuous-time Sampling. Due to recent advances in accelerating sampling in continuous-time diffusion


processes (Pandey et al., 2024a; Zhang & Chen, 2023; Lu et al., 2022; Song et al., 2022a; Xu et al., 2023a),
we reformulate discrete-time dynamics in heavy-tailed diffusions to the continuous time regime. More
specifically, we present a family of continuous-time processes in the following result (Proof in App. A.5).
Proposition 2. The posterior parameterization in Eqn. 5 induces the following continuous-time dynamics,
« „ ȷ ff
µ9 t µ9 t a
dxt “ xt ´ f pσt , σ9 t q ` pxt ´ µt Dθ pxt , σt qq dt ` βptqgpσt , σ9 t qdSt , (10)
µt µt

where f : R` ˆ R` Ñ R and g : R` ˆ R` Ñ R` are scalar-valued functions, βt P R` is a scaling


coefficient such that the following condition holds,
1 ` 2
˘
2 ptq σt´∆t ´ βptqgpσt , σ9 t q∆t ´ 1 “ f pσt , σ9 t q∆t,
σ12
where µ9 t , σ9 t denote the first-order time-derivatives of the perturbation kernel parameters µt and σt respec-
tively and the differential dSt „ td p0, dt, ν ` dq.

6
Preprint.

Algorithm 1: Training (t-EDM) Algorithm 2: Sampling (t-EDM) (µt “ 1, σt “ t)


` ˘
1: repeat 1: sample x0 „ td 0, t20 Id , ν
2: x0 „ ppx0 q for i P t0,
3: σ „ LogNormalpπmean , πstd q
2: ` . . . , N ´ 1u do˘
3: di Ð xi ´ Dθ pxi ; ti q {ti
4: x “ xb
0 ` n, n „ td p0, σ 2 Id , νq
4: xi`1 Ð xi ` pti`1 ´ ti qdi
ν
5: σ “ σ ν´2 5: if ti`1 ‰`0 then ˘
6: Dθ px, σq “ 6: di Ð xi`1 ´ Dθ pxi`1 ;`ti`1 q {ti`1 ˘
cskip pσqx ` cout pσqFθ pcin pσqx, cnoise pσqq 7: xi`1 Ð xi ` pti`1 ´ ti q 21 di ` 12 di1
7: λpσq “ c´2 out pσq 8: end if
8: Take gradient
” descent step on ı 9: end for
9: ∇θ λpσq }Dθ px, σq ´ x0 }2 10: return xN
10: until converged

Figure 2: Training and Sampling algorithms for t-EDM (ν ą 2). Our proposed method requires minimal code updates
(indicated with blue) over traditional Gaussian diffusion models and converges to the latter as ν Ñ 8.

Based on the result in Proposition 2, it is possible to construct deterministic/stochastic samplers for heavy-
tailed diffusions. It is worth noting that the SDE in Eqn. 10 implies adding heavy-tailed stochastic noise
during inference (Bollerslev, 1987). Next, we provide specific instantiations of the generic sampler in Eq. 10.
Sampler Instantiations. We instantiate the continuous-time SDE in Eqn. 10 by setting gpσt , σ9 t q “ 0 and
σ12 ptq “ σt σt´∆t . Consequently, f pσt , σ9 t q “ ´ σσ9 tt . In this case, the SDE in Eqn. 10 reduces to an ODE,
which can be represented as,
„ ȷ
dxt µ9 t σ9 t µ9 t
“ xt ´ ´ ` pxt ´ µt Dθ pxt , σt qq. (11)
dt µt σt µt

Summary. To summarize our theoretical framework, we present an overview of the comparison between
Gaussian diffusion models and our proposed heavy-tailed diffusion models in Table 1.

3.5 S PECIFIC I NSTANTIATIONS : T-EDM

Karras et al. (2022) highlight several design choices during training and sampling which significantly
improve sample quality while reducing sampling budget for image datasets like CIFAR-10 (Krizhevsky,
2009) and ImageNet (Deng et al., 2009). With a similar motivation, we reformulate the perturbation kernel as
qpxt |x0 q “ td psptqx0 , sptq2 σptq2 Id , νq and denote the resulting diffusion model as t-EDM.
Training. During training, we set the perturbation kernel, qpxt |x0 q, parameters sptq “ 1, σptq “ σ „
LogNormalpPmean , Pstd q. Moreover, we parameterize the denoiser Dθ pxt , σt q as
Dθ px, σq “ cskip pσ, νqx ` cout pσ, νqFθ pcin pσ, νqx, cnoise pσqq.
Our denoiser parameterization is similar to Karras et al. (2022) with the difference that coefficients like cout
additionally depend on ν. We include full derivations in Appendix A.6. Consequently, our denoising loss can
be specified as follows:
“ ‰
Lpθq 9 Ex0 „ppx0 q Eσ En„td p0,σ2 Id ,νq λpσ, νq}Dθ px0 ` n, σq ´ x0 }22 , (12)
where λpσ, νq is a weighting function set to λpσ, νq “ 1{cout pσ, νq2 .
Sampling. Interestingly, it can be shown that the ODE in Eqn. 11 is equivalent to the deterministic dynamics
presented in Karras et al. (2022) (See Appendix A.7 for proof). Consequently, we choose sptq “ 1 and

7
Preprint.

EDM (+INC) EDM (+PCP) t-EDM

Figure 3: Sample 1-d histogram comparison between EDM and t-EDM on the test set for the Vertically Integrated
Liquid (VIL) channel. t-EDM captures heavy-tailed behavior more accurately than other baselines. INC: Inverse CDF
Normalization, PCP: Per-Channel Preconditioning

σptq “ t during sampling, further simplifying the dynamics in Eqn. 11 to


dxt {dt “ pxt ´ Dθ pxt , tqq{t.
We adopt the timestep discretization schedule and the choice of the numerical ODE solver (Heun’s method
(Ascher & Petzold, 1998)) directly from EDM.
Figure 2 illustrates the ease of transitioning from a Gaussian diffusion framework (EDM) to t-EDM. Under
standard settings, transitioning to t-EDM requires as few as two lines of code change, making our method
readily compatible with existing implementations of Gaussian diffusion models.
Extension to Flows. While our discussion has been restricted to diffusion models, we can also construct flows
(Albergo et al., 2023a; Lipman et al., 2023) using heavy-tailed base distributions. We denote the resulting
model as t-Flow and discuss its construction in more detail in App. B.

4 E XPERIMENTS
To assess the effectiveness of the proposed heavy-tailed diffusion and flow models, we demonstrate experi-
ments using real-world weather data for both unconditional and conditional generation tasks. We include full
experimental details in App. C.
Datasets. We adopt the High-Resolution Rapid Refresh (HRRR) (Dowell et al., 2022) dataset, which is an
operational archive of the US km-scale forecasting model. Among multiple dynamical variables in the dataset
that exhibit heavy-tailed behavior, based on their dynamic range, we only consider the Vertically Integrated
Liquid (VIL) and Vertical Wind Velocity at level 20 (denoted as w20) channels (see App. C.1 for more details).
It is worth noting that the VIL and w20 channels have heavier right and left tails, respectively (See Fig. 6)
Tasks and Metrics. We consider both unconditional and conditional generative tasks relevant to weather
and climate science. For unconditional modeling, we aim to generate the VIL and w20 physical variables
in the HRRR dataset. For conditional modeling, we aim to generatively predict the hourly evolution of the
target variable for the next lead-time (τ ` 1) hour-ahead evolution of VIL and w20 based on information only
at the current state time τ ; see more details in the appendix and see Pathak et al. (2024) for discussion of
why hour-ahead, km-scale atmospheric prediction is a stochastic physical task appropriate for conditional
generative models. To quantify the empirical performance of unconditional modeling, we rely on comparing
1-d statistics of generated and train/test set samples. More specifically, for quantitative analysis, we report the
Kurtosis Ratio (KR), the Skewness Ratio (SR), and the Kolmogorov-Smirnov (KS)-2 sample statistic (at the
tails) between the generated and train/test set samples. For qualitative analysis, we compare 1-d histograms
between generated and train/test set samples. For the conditional task, we adopt standard probabilistic

8
Preprint.

VIL (Train) VIL (Test) w20 (Train) w20 (Test)


Method ν KR Ó SR Ó KS Ó KR Ó SR Ó KS Ó ν KR Ó SR Ó KS Ó KR Ó SR Ó KS Ó
EDM 8 210.11 10.79 0.997 45.35 5.23 0.991 8 12.59 0.89 0.991 5.01 0.38 0.978
Baselines +INC 8 11.33 2.29 0.987 1.70 0.74 0.95 8 1.80 0.18 0.909 0.23 0.13 0.763
+PCP 8 2.12 0.72 0.800 0.31 0.09 0.522 8 2.17 0.70 0.838 0.40 0.24 0.648
t-EDM 3 1.06 0.43 0.431 0.54 0.23 0.114 3 2.44 0.65 0.683 0.52 0.21 0.286
Ours t-EDM 5 29.66 4.07 0.955 5.73 1.68 0.888 5 8.55 1.77 0.895 3.22 1.03 0.774
t-EDM 7 24.35 4.14 0.959 4.57 1.72 0.908 7 7.03 1.58 0.82 2.55 0.89 0.622
Table 2: t-EDM outperforms standard diffusion models for unconditional generation on the HRRR dataset. For all metrics,
lower is better. Values in bold indicate the best results in a column.

prediction score metrics such as the Continuous Ranked Probability Score (CRPS), the Root-Mean Squared
Error (RMSE), and the skill-spread ratio (SSR); see, e.g., Mardani et al. (2023a); Srivastava et al. (2023). A
more detailed explanation of our evaluation protocol is provided in App. C.
Methods and Baselines. In addition to standard diffusion (EDM (Karras et al., 2022)) and flow models
(Albergo et al., 2023a) based on Gaussian priors, we introduce two additional baselines that are variants
of EDM. To account for the high dynamic-range often exhibited by heavy-tailed distributions, we include
Inverse CDF Normalization (INC) as an alternative data preprocessing step to z-score normalization. Using
the former reduces dynamic range significantly and can make the data distribution closer to Gaussian. We
denote this preprocessing scheme combined with standard EDM training as EDM + INC. Alternatively, we
could instead modulate the noise levels used during EDM training as a function of the dynamic range of the
input channel while keeping the data preprocessing unchanged. The main intuition is to use more heavy-tailed
noise for large values. We denote this modulating scheme as Per-Channel Preconditioning (PCP) and denote
the resulting baseline as EDM + PCP. We elaborate on these baselines in more detail in App. C.1.2

4.1 U NCONDITIONAL G ENERATION

We assess the effectiveness of different methods on unconditional modeling for the VIL and w20 channels
in the HRRR dataset. Fig. 3 qualitatively compares 1-d histograms of sample intensities between different
methods for the VIL channel. We make the following key observations. Firstly, though EDM (with additional
tricks like noise conditioning) can improve tail coverage, t-EDM covers a broader range of extreme values
in the test set. Secondly, in addition to better dynamic range coverage, t-EDM qualitatively performs much
better in capturing the density assigned to intermediate intensity levels under the model. We note similar
observations from our quantitative results in Table 2, where t-EDM outperforms other baselines on the KS
metric, implying our model exhibits better tail estimation over competing baselines for both the VIL and
w20 channels. More importantly, unlike traditional Gaussian diffusion models like EDM, t-EDM enables
controllable tail estimation by varying ν, which could be useful when modeling a combination of channels
with diverse statistical properties. On the contrary, standard diffusion models like EDM do not have such
controllability. Lastly, we present similar quantitative results for t-Flow in Table 3. We present additional
results for unconditional modeling in App. C.1

4.2 C ONDITIONAL G ENERATION

Next, we consider the task of conditional modeling, where we aim to predict the hourly evolution of a
target variable for the next lead time (τ ` 1) based on the current state at time τ . Table 4 illustrates the
performance of EDM and t-EDM on this task for the VIL and w20 channels. We make the following key
observations. Firstly, for both channels, t-EDM exhibits better CRPS and SSR scores, implying better
probabilistic forecast skills and ensemble than EDM. Moreover, while t-EDM exhibits under-dispersion for

9
Preprint.

VIL (Train) VIL (Test) w20 (Train) w20 (Test)


Method ν KR Ó SRÓ KSÓ KR Ó SR Ó KSÓ ν KR Ó SR Ó KS Ó KR Ó SR Ó KS Ó
Gaussian
Baselines 8 0.46 0.09 0.897 0.67 0.52 0.704 8 2.03 0.36 0.294 0.34 0.01 0.384
Flow
t-Flow 3 1.39 0.37 0.711 0.47 0.27 0.275 5 1.08 0.21 0.333 0.07 0.42 0.512
Ours t-Flow 5 3.30 0.75 0.857 0.05 0.07 0.633 7 3.24 0.36 0.259 0.87 0.01 0.300
t-Flow 7 3.36 0.84 0.844 0.04 0.02 0.603 9 5.47 0.41 0.478 1.86 0.034 0.289
Table 3: t-Flow outperforms standard Gaussian flows for unconditional generation on the HRRR dataset. For all metrics,
lower is better. Values in bold indicate the best results in a column.
VIL (Test) w20 (Test)
Method ν CRPS Ó RMSE Ó SSR (Ñ 1) KS Ó ν CRPS Ó RMSE Ó SSR (Ñ 1) KS Ó
Baselines EDM 8 1.696 4.473 0.203 0.715 8 0.304 0.664 0.865 0.345
t-EDM 3 1.649 4.526 0.255 0.419 3 0.295 0.734 1.045 0.111
Ours
t-EDM 5 1.609 4.361 0.305 0.665 5 0.301 0.674 0.901 0.323
Table 4: t-EDM outperforms EDM for conditional next frame prediction for the HRRR dataset. Values in bold indicate
the best results in each column. We note that VIL has a higher dynamic range over w20, and thus, the gains for VIL are
more apparent (see hist plots in Fig. 6). (Ñ 1) indicates values near 1 are better.

VIL, while it is well-calibrated for w20, with its SSR close to an ideal score of 1. On the contrary, the baseline
EDM model exhibits under-dispersion for both channels, thus implying overconfident predictions. Secondly,
in addition to better calibration, t-EDM is better at tail estimation (as measured by the KS statistic) for the
underlying conditional distribution. Lastly, we notice that different values of the parameter ν are optimal for
different channels, which suggests a more data-driven approach to learning the optimal ν directly. We present
additional results for conditional modeling in App. C.2.

5 D ISCUSSION AND T HEORETICAL I NSIGHTS


To conclude, we propose a framework for constructing heavy-tailed diffusion models and demonstrate their
effectiveness over traditional diffusion models on unconditional and conditional generation tasks for a high-
resolution weather dataset. Here, we highlight some theoretical connections that help gain insights into the
effectiveness of our proposed framework while establishing connections with prior work.
Exploring distribution tails during sampling. The ODE in
Eq. 11 can re-formulated as,
dxt µ9 t ´ ν ` d1 ¯„ µ9 σ9 t
ȷ
1 t
“ xt ` σt2 ´ ∇x log ppxt , tq, (13)
dt µt ν`d µt σt
where d11 “ p1{σt2 q}xt ´ µt Dθ pxt , σt q}22 . By formulating the
ODE in terms of the score function, we can gain some intuition
into the effectiveness of our model in modeling heavy-tailed
distributions. Figure 4 illustrates the variation of the mean and Figure 4: Variation of the mean and standard
variance of the multiplier pν ` d11 q{pν ` dq along the diffusion deviation of the ratio pν ` d1 q{pν ` dq with
trajectory across 1M samples generated from our toy models. ν across diffusion sampling trajectory for the
Interestingly, as the value of ν decreases, the mean and variance toy dataset. As ν decreases, the mean ratio and
of this multiplier increase significantly, which leads to large its standard deviation increase, leading to large
score multiplier weights. We hypothesize that this behavior score multiplier weights.
allows our proposed model to explore more diverse regions during sampling (more details in App. A.10).

10
Preprint.

Enabling efficient tail coverage during training. The optimization objective in Eq. 8 has several connections
with robust statistical estimators. More specifically, it can be shown that (proof in App. A.11),
ż ˆ ˙γ ´
pθ pxq ¯
∇θ Dγ pq } pθ q “ ´ qpxq ∇θ log pθ pxq ´ Ep̃θ pxq r∇θ log pθ pxqs dx,
}pθ }1`γ

where q and pθ denote the forward (qpxt´∆t |xt , x0 q) and reverse diffusion posteriors (pθ pxt´∆t |xt q), respec-
tively. Intuitively, the coefficient γ weighs the likelihood gradient, ∇θ log pθ pxq, and can be set accordingly
to ignore or consider outliers when modeling the data distribution. Specifically, when γ ą 1, the model would
learn to ignore outliers (Futami et al., 2018; Fujisawa & Eguchi, 2008; Basu et al., 1998) since data points on
the tails would be assigned low likelihood. On the contrary, a negative value of γ (as is the case in this work
since we set γ “ ´2{pν ` dq), the model can assign more weights to capture these extreme values.
We discuss some other connections to prior work in heavy-tailed generative modeling and more recent work
in diffusion models in App. F.1 and some limitations of our approach in App. F.2.

R EPRODUCIBILITY S TATEMENT

We include proofs for all theoretical results introduced in the main text in Appendix A. We describe our
complete experimental setup (including data processing steps, model specification for training and inference,
description of evaluation metrics, and extended experimental results) in Appendix C.

E THICS S TATEMENT

We develop a generative framework for modeling heavy-tailed distributions and demonstrate its effectiveness
for scientific applications. In this context, we do not think our model poses a risk of misinformation or other
ethical biases associated with large-scale image synthesis models. However, we would like to point out that
similar to other generative models, our model can sometimes hallucinate predictions for certain channels,
which could impact downstream applications like weather forecasting.

R EFERENCES
Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying
framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023a.

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying
framework for flows and diffusions, 2023b. URL https://arxiv.org/abs/2303.08797.

D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical
Society. Series B (Methodological), 36(1):99–102, 1974. ISSN 00359246. URL http://www.jstor.
org/stable/2984774.

Uri M. Ascher and Linda R. Petzold. Computer Methods for Ordinary Differential Equations and
Differential-Algebraic Equations. Society for Industrial and Applied Mathematics, Philadelphia, PA,
1998. doi: 10.1137/1.9781611971392. URL https://epubs.siam.org/doi/abs/10.1137/1.
9781611971392.

Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by minimising
a density power divergence. Biometrika, 85(3):549–559, 1998.

11
Preprint.

Tim Bollerslev. A conditionally heteroskedastic time series model for speculative prices and rates of
return. The Review of Economics and Statistics, 69(3):542–547, 1987. ISSN 00346535, 15309142. URL
http://www.jstor.org/stable/1925546.
Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov Chain Monte
Carlo. Chapman and Hall/CRC, May 2011. ISBN 9780429138508. doi: 10.1201/b10905. URL
http://dx.doi.org/10.1201/b10905.
Tianfeng Chai and Roland R Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments
against avoiding rmse in the literature. Geoscientific Model Development, 7(3):1247–1250, 2014.
Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye.
Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference
on Learning Representations, 2022.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255,
2009. doi: 10.1109/CVPR.2009.5206848.
Peng Ding. On the conditional distribution of the multivariate t distribution, 2016. URL https://arxiv.
org/abs/1604.00561.
Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with critically-damped
langevin diffusion, 2022. URL https://arxiv.org/abs/2112.07068.
Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep
networks, 2016. URL https://arxiv.org/abs/1602.02644.
David C. Dowell, Curtis R. Alexander, Eric P. James, Stephen S. Weygandt, Stanley G. Benjamin, Geoffrey S.
Manikin, Benjamin T. Blake, John M. Brown, Joseph B. Olson, Ming Hu, Tatiana G. Smirnova, Terra
Ladwig, Jaymes S. Kenyon, Ravan Ahmadov, David D. Turner, Jeffrey D. Duda, and Trevor I. Alcott.
The high-resolution rapid refresh (hrrr): An hourly updating convection-allowing forecast model. part i:
Motivation and system description. Weather and Forecasting, 37(8):1371 – 1395, 2022. doi: 10.1175/
WAF-D-21-0151.1. URL https://journals.ametsoc.org/view/journals/wefo/37/8/
WAF-D-21-0151.1.xml.
Shinto Eguchi. Chapter 2 - pythagoras theorem in information geometry and applications to generalized linear
models. In Angelo Plastino, Arni S.R. Srinivasa Rao, and C.R. Rao (eds.), Information Geometry, volume 45
of Handbook of Statistics, pp. 15–42. Elsevier, 2021. doi: https://doi.org/10.1016/bs.host.2021.06.001. URL
https://www.sciencedirect.com/science/article/pii/S0169716121000225.
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi,
Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey,
Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution
image synthesis, 2024. URL https://arxiv.org/abs/2403.03206.
Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contami-
nation. Journal of Multivariate Analysis, 99(9):2053–2081, 2008. ISSN 0047-259X. doi: https://doi.org/
10.1016/j.jmva.2008.02.004. URL https://www.sciencedirect.com/science/article/
pii/S0047259X08000456.
Futoshi Futami, Issei Sato, and Masashi Sugiyama. Variational inference based on robust divergences, 2018.
URL https://arxiv.org/abs/1710.06595.

12
Preprint.

Gaby Joanne Gründemann, Nick van de Giesen, Lukas Brunner, and Ruud van der Ent. Rarest rainfall
events will see the greatest relative increase in magnitude under future climate change. Communications
Earth &; Environment, 3(1), October 2022. ISSN 2662-4435. doi: 10.1038/s43247-022-00558-8. URL
http://dx.doi.org/10.1038/s43247-022-00558-8.
H. Guo, J.-C. Golaz, L. J. Donner, B. Wyman, M. Zhao, and P. Ginoux. Clubb as a unified cloud pa-
rameterization: Opportunities and challenges. Geophysical Research Letters, 42(11):4540–4547, 2015.
doi: https://doi.org/10.1002/2015GL063672. URL https://agupubs.onlinelibrary.wiley.
com/doi/abs/10.1002/2015GL063672.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans
trained by a two time-scale update rule converge to a local nash equilibrium, 2018. URL https:
//arxiv.org/abs/1706.08500.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems, 33:6840–6851, 2020.
M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing
splines. Communications in Statistics - Simulation and Computation, 19(2):433–450, 1990. doi: 10.1080/
03610919008812866. URL https://doi.org/10.1080/03610919008812866.
Priyank Jaini, Ivan Kobyzev, Yaoliang Yu, and Marcus Brubaker. Tails of lipschitz triangular flows, 2020.
URL https://arxiv.org/abs/1907.04481.
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based
generative models, 2022. URL https://arxiv.org/abs/2206.00364.
Juno Kim, Jaehyuk Kwon, Mincheol Cho, Hyunjong Lee, and Joong-Ho Won. t3 -variational autoencoder:
Learning heavy-tailed data with student’s t and power divergence, 2024a. URL https://arxiv.org/
abs/2312.01133.
Juno Kim, Jaehyuk Kwon, Mincheol Cho, Hyunjong Lee, and Joong-Ho Won. $t^3$-variational autoen-
coder: Learning heavy-tailed data with student’s t and power divergence. In The Twelfth International
Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=
RzNlECeoOB.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.
org/abs/1312.6114.
Alex Krizhevsky. Learning multiple layers of features from tiny images. pp. 32–33, 2009. URL https:
//www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
Mike Laszkiewicz, Johannes Lederer, and Asja Fischer. Marginal tail-adaptive normalizing flows, 2022. URL
https://arxiv.org/abs/2206.10311.
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching
for generative modeling. In International Conference on Learning Representations, 2023. URL https:
//openreview.net/forum?id=PqvMRDCJT9t.
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data
with rectified flow, 2022. URL https://arxiv.org/abs/2209.03003.
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver
for diffusion probabilistic model sampling in around 10 steps, 2022. URL https://arxiv.org/
abs/2206.00927.

13
Preprint.

Morteza Mardani, Noah Brenowitz, Yair Cohen, Jaideep Pathak, Chieh-Yu Chen, Cheng-Chin Liu, Arash
Vahdat, Karthik Kashinath, Jan Kautz, and Mike Pritchard. Residual corrective diffusion modeling for
km-scale atmospheric downscaling. arXiv preprint arXiv:2309.15214, 2023a.
Morteza Mardani, Jiaming Song, Jan Kautz, and Arash Vahdat. A variational perspective on solving inverse
problems with diffusion models. In The Twelfth International Conference on Learning Representations,
2023b.
Morteza Mardani, Noah Brenowitz, Yair Cohen, Jaideep Pathak, Chieh-Yu Chen, Cheng-Chin Liu, Arash
Vahdat, Mohammad Amin Nabian, Tao Ge, Akshay Subramaniam, Karthik Kashinath, Jan Kautz, and
Mike Pritchard. Residual corrective diffusion modeling for km-scale atmospheric downscaling, 2024. URL
https://arxiv.org/abs/2309.15214.
Frank J. Massey. The kolmogorov-smirnov test for goodness of fit. Journal of the American Statistical
Association, 46(253):68–78, 1951. ISSN 01621459, 1537274X. URL http://www.jstor.org/
stable/2280095.
Radford M. Neal. Slice sampling. The Annals of Statistics, 31(3):705 – 767, 2003. doi: 10.1214/aos/
1056562461. URL https://doi.org/10.1214/aos/1056562461.
Kushagra Pandey and Stephan Mandt. A complete recipe for diffusion generative models, 2023. URL
https://arxiv.org/abs/2303.01748.
Kushagra Pandey, Avideep Mukherjee, Piyush Rai, and Abhishek Kumar. Diffusevae: Efficient, controllable
and high-fidelity generation from low-dimensional latents, 2022. URL https://arxiv.org/abs/
2201.00308.
Kushagra Pandey, Maja Rudolph, and Stephan Mandt. Efficient integrators for diffusion generative mod-
els. In The Twelfth International Conference on Learning Representations, 2024a. URL https:
//openreview.net/forum?id=qA4foxO5Gf.
Kushagra Pandey, Ruihan Yang, and Stephan Mandt. Fast samplers for inverse problems in iterative refinement
models, 2024b. URL https://arxiv.org/abs/2405.17673.
Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani,
Arash Vahdat, Shaoming Xu, Karthik Kashinath, and Michael Pritchard. Kilometer-scale convection
allowing model emulation using generative diffusion modeling, 2024. URL https://arxiv.org/
abs/2408.10958.
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URL
https://arxiv.org/abs/2307.01952.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows, 2016. URL
https://arxiv.org/abs/1505.05770.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
image synthesis with latent diffusion models, 2022. URL https://arxiv.org/abs/2112.10752.
Dario Shariatian, Umut Simsekli, and Alain Durmus. Denoising lévy probabilistic models, 2024. URL
https://arxiv.org/abs/2407.18609.
Raghav Singhal, Mark Goldstein, and Rajesh Ranganath. Where to diffuse, how to diffuse, and how to get
back: Automated learning for multivariate diffusions, 2023. URL https://arxiv.org/abs/2302.
07261.

14
Preprint.

John Skilling. The Eigenvalues of Mega-dimensional Matrices, pp. 455–466. Springer Netherlands, Dordrecht,
1989. ISBN 978-94-015-7860-8. doi: 10.1007/978-94-015-7860-8_48. URL https://doi.org/10.
1007/978-94-015-7860-8_48.
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265.
PMLR, 2015.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022a. URL
https://arxiv.org/abs/2010.02502.
Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for
inverse problems. In International Conference on Learning Representations, 2022b.
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differential equations. In International Conference on
Learning Representations, 2020.
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based
diffusion models, 2021. URL https://arxiv.org/abs/2101.09258.
Prakhar Srivastava, Ruihan Yang, Gavin Kerrigan, Gideon Dresdner, Jeremy McGibbon, Christopher Brether-
ton, and Stephan Mandt. Precipitation downscaling with spatiotemporal video diffusion. arXiv preprint
arXiv:2312.06071, 2023.
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23
(7):1661–1674, 2011. doi: 10.1162/NECO_a_00142.
Daniel S Wilks. Statistical methods in the atmospheric sciences, volume 100. Academic press, 2011.
Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi S. Jaakkola. Restart
sampling for improving generative processes. In Thirty-seventh Conference on Neural Information
Processing Systems, 2023a. URL https://openreview.net/forum?id=wFuemocyHZ.
Yilun Xu, Ziming Liu, Yonglong Tian, Shangyuan Tong, Max Tegmark, and Tommi Jaakkola. Pfgm++:
Unlocking the potential of physics-inspired generative models. In International Conference on Machine
Learning, pp. 38566–38591. PMLR, 2023b.
Eunbi Yoon, Keehun Park, Sungwoong Kim, and Sungbin Lim. Score-based generative models with
lévy processes. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL
https://openreview.net/forum?id=0Wp3VHX0Gm.
Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator, 2023.
URL https://arxiv.org/abs/2204.13902.

C ONTENTS

1 Introduction 1

2 Background 3
2.1 Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Student-t Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

15
Preprint.

3 Heavy-Tailed Diffusion Models 3


3.1 Noising Process Design with Student-t Distributions. . . . . . . . . . . . . . . . . . . . . . 4
3.2 Parameterization of the Reverse Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Training with Power Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 Specific Instantiations: t-EDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Experiments 8
4.1 Unconditional Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Conditional Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Discussion and Theoretical Insights 10

A Proofs 17
A.1 Derivation of the Perturbation Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.2 On the Posterior Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.3 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.4 Derivation of the simplified denoising loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
A.5 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
A.6 Deriving the Denoiser Preconditioner for t-EDM . . . . . . . . . . . . . . . . . . . . . . . 24
A.7 Equivalence with the EDM ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
A.8 Conditional Vector Field for t-Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
A.9 Connection to Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
A.10 ODE Reformulation and Connections to Inverse Problems. . . . . . . . . . . . . . . . . . . 27
A.11 Connections to Robust Statistical Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 28

B Extension to Flows 30

C Experimental Setup 31
C.1 Unconditional Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
C.1.1 HRRR Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
C.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
C.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
C.1.4 Denoiser Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
C.1.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

16
Preprint.

C.1.6 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.1.7 Extended Results on Unconditional Modeling . . . . . . . . . . . . . . . . . . . . . 35
C.2 Conditional Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.2.1 HRRR Dataset for Conditional Modeling . . . . . . . . . . . . . . . . . . . . . . . 35
C.2.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.2.3 Denoiser Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
C.2.5 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
C.2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
C.2.7 Extended Results on Conditional Modeling . . . . . . . . . . . . . . . . . . . . . . 37
C.3 Toy Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

D Optimal Noise Schedule Design 38


D.1 Design for EDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
D.2 Extension to t-EDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
D.3 Extension to Correlated Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

E Log-Likelihood for t-EDM 41

F Discussion and Limitations 42


F.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
F.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A P ROOFS
A.1 D ERIVATION OF THE P ERTURBATION K ERNEL

Proof. By re-parameterization of the distribution qpx|x0 q, we have,


?
x “ µ ` Σ1{2 z{ κ, z „ N p0, I2d q and κ „ χ2 pνq{ν (14)
This implies that the conditional distribution qpx|x0 , κq “ N pµ, Σ{κq. Therefore, following properties of
gaussian distributions, qpxt |x0 , κq “ N pµt x0 , σt2 {κId q. Therefore, from reparameterization,
?
xt |κ “ µt x0 ` σt z{ κ (15)
which implies that qpxt |x0 q “ td pµt x0 , σt2 Id , νq. This completes the proof.

A.2 O N THE P OSTERIOR PARAMETERIZATION

The perturbation kernel qpxt |x0 q for Student-t diffusions is parameterized as,
qpxt |x0 q “ td pµt x0 , σt2 Id , νq (16)

17
Preprint.

Using re-parameterization,
ϵ
xt “ µt x0 ` σt ? , ϵ „ N p0, Id q, κ „ χ2 pνq{ν (17)
κ
During inference, given a noisy state xt , we have the following estimation problem,
“ ϵ ‰
xt “ µt Erx0 |xt s ` σt E ? |xt (18)
κ
“ ‰
Therefore, the task of denoising can be posed as either estimating Erx0 |xt s or E ?ϵκ |xt . With this motivation,
the posterior pθ pxt´∆t |xt q can be parameterized appropriately. Recall the form of the forward posterior
ν ` d1 2
qpxt´∆t |xt , x0 q “ td pµ̄t , σ̄ Id , ν ` dq (19)
ν`d t
2
σ21 ptq ” σ 2 ptqσ 2 ptq ı
µ̄t “ µt´∆t x0 ` pxt ´ µt x0 q, σ̄t2 “ σt´∆t
2
´ 21 2 12 (20)
σt2 σt
1
where d1 “ σt2
}xt ´ µt x0 }2 . Further simplifying the mean µ̄t ,
2
σ21 ptq
µ̄t pxt , x0 , tq “ µt´∆t x0 ` 2 pxt ´ µt x0 q (21)
σt
σ 2 ptq σ 2 ptq
“ µt´∆t x0 ` 212 xt ´ 212 µt x0 (22)
σt σt
2
´ σ21 ptq ¯ σ 2 ptq
“ µt´∆t ´ µt 2 x0 ` 212 xt (23)
σt σt
Therefore, the mean µθ pxt , tq of the reverse posterior pθ pxt´∆t |xt q can be parameterized as,
´ σ 2 ptq ¯ σ 2 ptq
µ̄θ pxt , tq “ µt´∆t ´ µt 212 Erx0 |xt s ` 212 xt (24)
σt σt
2
´ σ21 ptq ¯ σ 2 ptq
« µt´∆t ´ µt 2 Dθ pxt , σt q ` 212 xt (25)
σt σt
where Erx0 |xt s is learned using a parametric estimator Dθ pxt , σt q. This corresponds to the x0 -prediction
parameterization presented in Eq. 6 in the main text. Alternatively, From Eqn. 18,
´ σ 2 ptq ¯ σ 2 ptq
µ̄θ pxt , tq “ µt´∆t ´ µt 212 Erx0 |xt s ` 212 xt (26)
σt σt
1´ σ 2 ptq ¯´ “ ϵ ‰¯ σ 2 ptq
“ µt´∆t ´ µt 212 xt ´ σt E ? |xt ` 212 xt (27)
µt σt κ σt
µt´∆t σt ´ σ 2 ptq ¯ “ ϵ
µt´∆t ´ µt 212

“ xt ´ E ? |xt (28)
µt µt σt κ
µt´∆t σt ´ σ 2 ptq ¯
« xt ´ µt´∆t ´ µt 212 ϵθ pxt , σt q (29)
µt µt σt
“ ‰
where E ?ϵκ |xt is learned using a parametric estimator ϵθ pxt , σt q. This corresponds to the ϵ-prediction
parameterization (Ho et al., 2020).

18
Preprint.

A.3 P ROOF OF P ROPOSITION 1

We restate the proposition here for convenience,


Proposition 1. For arbitrary distributions q and p, in the limit of γ Ñ 0, Dγ pq } pq converges to DKL pq } pq.
Consequently, the objective in Eqn. 8 converges to the DDPM objective stated in Eqn. 1.

Proof. We present our proof in two parts:

1. Firstly, we establish the following relation between the γ-Power Divergence and the KL-Divergence
between two distributions q and p.
Dγ pq } pq “ DKL pq } pq ` Opγq (30)
2
2. Next, for the choice of γ “ ´ ν`d , we show that in the limit of γ Ñ 0, the optimization objective in
Eqn. 8 converges to the optimization objective in Eqn. 1

Relation between DKL pq } pq and Dγ pq } pq. The γ-Power Divergence as stated in (Kim et al., 2024a)
assumes the following form:
1“ ‰
Dγ pq } pq “ Cγ pq, pq ´ Hγ pqq (31)
γ
where γ P p´1, 0q Y p0, 8q, and,
ˆż 1
˙ 1`γ ż ˆ ˙γ
1`γ ppxq
Hγ ppq “ ´ }p}1`γ “ ´ ppxq dx Cγ pq, pq “ ´ qpxq dx (32)
}p}1`γ
For subsequent analysis, we assume γ Ñ 0. Under this assumption, we simplify Hγ pqq as follows. By
definition,
ˆż 1
˙ 1`γ
Hγ pqq “ ´ }q}1`γ “ ´ qpxq1`γ dx (33)
ˆż 1
˙ 1`γ
“´ qpxqqpxqγ dx (34)
ˆż 1
˙ 1`γ
“´ qpxq exppγ log qpxqqdx (35)
ˆż 1
˙ 1`γ
“ 2

“´ qpxq 1 ` γ log qpxq ` Opγ q dx (36)
ˆż ż 1
˙ 1`γ
2
“´ qpxqdx ` γ qpxq log qpxqdx ` Opγ q (37)
ˆ ż 1
˙ 1`γ
2
“ ´ 1 ` γ qpxq log qpxqdx ` Opγ q (38)

Using the approximation p1 ` δxqα « 1 ` αδx for a small δ in the above equation, we have,
ˆ ż ı˙
γ ”
Hγ pqq “ ´ }q}1`γ « ´ 1 ` qpxq log qpxqdx ` Opγq (39)
1`γ
ˆ ”ż ı˙
« ´ 1 ` γp1 ´ γq qpxq log qpxqdx ` Opγq (40)

19
Preprint.

1
where we have used the approximation 1`γ « 1 ´ γ in the above equation. This is justified since γ is
assumed to be small enough. Therefore, we have,
ˆ ż ˙
2
Hγ pqq “ ´ }q}1`γ « ´ 1 ` γ qpxq log qpxqdx ` Opγ q (41)

Similarly, we now obtain an approximation for the power-cross entropy as follows. By definition,
ż ˆ ˙γ
ppxq
Cγ pq, pq “ ´ qpxq dx (42)
}p}1`γ
ż ˆ ˆ ˙ ˙
ppxq
“ ´ qpxq 1 ` γ log ` Opγ 2 q dx (43)
}p}1`γ
ˆż ż ˆ ˙ ˙
ppxq 2
“´ qpxqdx ` γ qpxq log ` Opγ q (44)
}p}1`γ
ˆ ż ż ˙
“ ´ 1 ` γ qpxq log ppxqdx ´ γ qpxq log }p}1`γ dx ` Opγ 2 q (45)

From Eqn. 41, it follows that,


ˆ ż ˙
}p}1`γ « 1`γ ppxq log ppxqdx ` Opγ 2 q (46)

Therefore,
ˆ ż ˙
2
log }p}1`γ “ log 1 ` γ ppxq log ppxqdx ` Opγ q (47)
ż
« γ ppxq log ppxqdx (48)

where the above result follows from the logarithmic series and ignores the terms of order Opγ 2 q or higher.
Plugging the approximation in Eqn. 48 in Eqn. 45, we have,
ˆ ż ż ˙
2
Cγ pq, pq “ ´ 1 ` γ qpxq log ppxqdx ´ γ qpxq log }p}1`γ dx ` Opγ q (49)
ˆ ż ˙
« ´ 1 ` γ qpxq log ppxqdx ` Opγ 2 q (50)

Therefore,
1“ ‰
Dγ pq } pq “ Cγ pq, pq ´ Hγ pqq (51)
γ
ˆż ż ˙
1“ 2

“ γ qpxq log qpxqdx ´ qpxq log ppxqdx ` Opγ q (52)
γ
ż
qpxq
“ qpxq log dx ` Opγq (53)
ppxq
“ DKL pq } pq ` Opγq (54)
This establishes the relationship between the KL and γ-Power divergence between two distributions. Therefore,
for two distributions q and p, the difference in the magnitude of DKL pq } pq and Dγ pq } pq is of the order of
Opγq. In the limit of γ Ñ 0, the Dγ pq } pq Ñ DKL pq } pq. This concludes the first part of our proof.

20
Preprint.

2
Equivalence between the objectives under γ Ñ 0. For γ “ ´ γ`d and a finite-dimensional dataset
d
with x0 P R , it follows that γ Ñ 0 implies ν Ñ 8. Moreover, in the limit of ν Ñ 8, the multivariate
Student-t distribution converges to a Gaussian distribution. As already shown in the previous part, under this
limit, Dγ pq } pq converges to DKL pq } pq. Therefore, under this limit, the optimization objective in Eqn. 8
converges to the standard DDPM objective in Eqn. 1. This completes the proof.

A.4 D ERIVATION OF THE SIMPLIFIED DENOISING LOSS

Here, we derive the simplified denoising loss presented in Eq. 9 in the main text. We specifically consider
the term Dγ pqpxt´∆t |xt , x0 q } pθ pxt´∆t |xt qq in Eq. 8. The γ-power divergence between two Student-t
distributions is given by,
γ ´ γ
¯´ 1`γ ” ´ ¯
γ
Dγ rqν ||pν s “ ´ γ1 Cν,d
1`γ d
1 ` ν´2 ´|Σ0 |´ 2p1`γq 1 ` ν´2 d

γ
´ ` ˘ 1 ¯ı (55)
` |Σ1 |´ 2p1`γq 1 ` ν´2
1
tr Σ´11 Σ 0 ` ν pµ 0 ´ µ 1 qJ ´1
Σ 1 pµ0 ´ µ1 q .

2
where γ “ ´ ν`d . Recall, the definitions of the forward denoising posterior,

ν ` d1 2
qpxt´∆t |xt , x0 q “ td pµ̄t , σ̄ Id , ν ` dq (56)
ν`d t
2
σ21 ptq ” σ 2 ptqσ 2 ptq ı
µ̄t “ µt´∆t x0 ` 2 pxt ´ µt x0 q, σ̄t2 “ σt´∆t
2
´ 21 2 12 (57)
σt σt
and the reverse denoising posterior,
pθ pxt´∆t |xt q “ td pµθ pxt , tq, σ̄t2 Id , ν ` dq (58)
where the denoiser mean µθ pxt , tq is further parameterized as follows:
2
σ21 ptq ” σ 2 ptq ı
µθ pxt , tq “ 2 xt ` µt´∆t ´ 212 µt Dθ pxt , σt q (59)
σt σt
Since we only parameterize the mean of the reverse posterior, the majority of the terms in the γ-power
divergence are independent of θ and can be ignored (or treated as scalar coefficients). Therefore,

Dγ pqpxt´∆t |xt , x0 q } pθ pxt´∆t |xt qq9pµ̄t ´ µθ pxt , tqqJ pµ̄t ´ µθ pxt , tqq (60)
9}µ̄t ´ µθ pxt , tq}22 (61)
” σ 2 ptq ı2
9 µt´∆t ´ 212 µt }x0 ´ Dθ pxt , σt q}22 (62)
σt
For better sample quality, it is common to ignore the scalar multiple in prior works (Ho et al., 2020; Song
et al., 2020). Therefore, ignoring the time-dependent scalar multiple,
Dγ pqpxt´∆t |xt , x0 q } pθ pxt´∆t |xt qq 9}x0 ´ Dθ pxt , σt q}22 (63)
Therefore, the final loss function Lθ can be stated as,
› ` ϵ ˘ ›2
Lpθq “ Ex0 „ppx0 q Et„pptq Eϵ„N p0,Id q Eκ„ ν1 χ2 pνq ›Dθ µt x0 ` σt ? , σt ´ x0 › (64)
› ›
κ 2

21
Preprint.

A.5 P ROOF OF P ROPOSITION 2

We restate the proposition here for convenience,


Proposition 2. The posterior parameterization in Eqn. 5 induces the following continuous-time dynamics,
« „ ȷ ff
µ9 t µ9 t a
dxt “ xt ´ f pσt , σ9 t q ` pxt ´ µt Dθ pxt , tqq dt ` βptqgpσt , σ9 t qdSt (65)
µt µt
where f : R` ˆ R` Ñ R and g : R` ˆ R` Ñ R` are scalar-valued functions, βt P R` is a scaling
coefficient such that the following condition holds,
1 ` 2 ˘
2 σt´∆t ´ βptqgpσt , σ9 t q∆t ´ 1 “ f pσt , σ9 t q∆t (66)
σ12 ptq
where µ9 t , σ9 t denote the first-order time-derivatives of the perturbation kernel parameters µt and σt respec-
tively and the differential dSt „ td p0, dt, ν ` dq.

Proof. We start by writing a single sampling step from our learned posterior distribution. Recall
pθ pxt´∆t |xt q “ td pµθ pxt , tq, σ̄t2 Id , ν ` dq (67)
where (using the ϵ-prediction parameterization in App. A.2),
„ ȷ
µt´∆t 1 2 µt´∆t 2
µθ pxt , tq “ xt ` σ21 ptq ´ σt ϵθ pxt , σt q (68)
µt σt µt
From re-parameterization, we have,
σ̄t
xt´∆t “ µθ pxt , tq ` ? z z „ N p0, Id q, κ „ χ2 pν ` dq{pν ` dq (69)
κ
„ ȷ
µt´∆t 1 2 µt´∆t 2 σ̄t
xt´∆t “ xt ` σ21 ptq ´ σt ϵθ pxt , σt q ` ? z (70)
µt σt µt κ
Moreover, we choose the posterior scale to be the same as the forward posterior qpxt´1 |xt , x0 q i.e.
” σ 2 ptqσ 2 ptq ı
σ̄t2 “ σt´∆t
2
´ 21 2 12 Id (71)
σt
This implies,
σ2 ` 2
ptq “ 2 t
2
˘
σ21 σt´∆t ´ σ̄t2 (72)
σ12 ptq
2
Substituting this form of σ21 ptq into Eqn. 70, we have,
„ 2 ȷ
µt´∆t 1 σt ` 2 2
˘ µt´∆t 2 σ̄t
xt´∆t “ xt ` 2 σ t´∆t ´ σ̄ t ´ σ t ϵθ pxt , σt q ` ? z (73)
µt σt σ12 ptq µt κ
„ ȷ
µt´∆t 1 ` 2 ˘ µt´∆t σ̄ t
“ x t ` σt 2 σ ´ σ̄t2 ´ ϵθ pxt , σt q ` ? z (74)
µt σ12 ptq t´∆t µt κ
„ ȷ
µt´∆t 1 ` 2 ˘ µ t´∆t σ̄t
“ x t ` σt 2 σt´∆t ´ σ̄t2 ´ 1 ` 1 ´ ϵθ pxt , σt q ` ? z (75)
µt σ ptq µt κ
„ 12 ȷ
µt´∆t 1 ` 2 ˘ µ
9 t σ̄t
“ x t ` σt 2 σ ´ σ̄t2 ´ 1 ` ∆t ϵθ pxt , σt q ` ? z (76)
µt σ12 ptq t´∆t µt κ
µt ´µt´∆t
where in the above equation we use the first-order approximation µ9 t “ ∆t . Next, we make the
following design choices:

22
Preprint.

1. Firstly, we assume the following form of the reverse posterior variance σ̄t2 :
σ̄t2 “ βptqgpσt , σ9 t q∆t (77)
where g : R` ˆ R` Ñ R` and βptq P R` represents a time-varying scaling factor chosen
empirically which can be used to vary the noise injected at each sampling step. It is worth noting
that a positive σ9 t (as indicated in the definition of g) is a consequence of a monotonically increasing
noise schedule σt in diffusion model design.
2. Secondly, we make the following design choice:
1 ` 2 ˘
2 σt´∆t ´ σ̄t2 ´ 1 “ f pσt , σ9 t q∆t (78)
σ12 ptq
where f : R` ˆ R` Ñ R

With these two design choices, Eqn. 76 simplifies as:


„ ȷ
µt´∆t 1 ` 2 ˘ µ9 t σ̄t
xt´∆t “ xt ` σt 2 σt´∆t ´ σ̄t2 ´ 1 ` ∆t ϵθ pxt , σt q ` ? z (79)
µt σ ptq µt κ
„ 12 ȷ
µt´∆t µ9 t a ? z
“ xt ` σt f pσt , σ9 t q∆t ` ∆t ϵθ pxt , σt q ` βptqgpσt , σ9 t q ∆t ? (80)
µt µt κ
„ ȷ
µt´∆t µ9 t a ? z
“ xt ` σt f pσt , σ9 t q ` ϵθ pxt , σt q∆t ` βptqgpσt , σ9 t q ∆t ? (81)
µt µt κ
„ ȷ
` µt´∆t ˘ µ9 t a ? z
xt´∆t ´ xt “ ´ 1 xt ` σt f pσt , σ9 t q ` ϵθ pxt , σt q∆t ` βptqgpσt , σ9 t q ∆t ? (82)
µt µt κ
« „ ȷ ff
µ9 t µ9 t a ? z
xt´∆t ´ xt “ ´ xt ´ σt f pσt , σ9 t q ` ϵθ pxt , σt q ∆t ` βptqgpσt , σ9 t q ∆t ? (83)
µt µt κ
In the limit of ∆t Ñ 0:
« „ ȷ ff
µ9 t µ9 t a dWt
dxt “ xt ´ σt f pσt , σ9 t q ` ϵθ pxt , σt q dt ` βptqgpσt , σ9 t q ? (84)
µt µt κ
which gives the required SDE formulation for the diffusion posterior sampling. Next, we discuss specific
choices of f pσt , σ9 t q and gpσt , σ9 t q.
On the choice of f pσt , σ9 t q and gpσt , σ9 t q: Recall our design choices:
σ̄t2 “ βptqgpσt , σ9 t q∆t (85)
1 ` 2 ˘
2 ptq σt´∆t ´ σ̄t2 ´ 1 “ f pσt , σ9 t q∆t (86)
σ12
Substituting the value of σ̄t2 from the first design choice to the second yields the following condition:
1 ` 2 ˘
2 ptq σt´∆t ´ βptqgpσt , σ
σ12
9 t q∆t ´ 1 “ f pσt , σ9 t q∆t (87)

This concludes the proof. It is worth noting that the above equation provides two degrees of freedom, i.e.,
we can choose two variables among σ12 ptq, g, f and automatically determine the third. However, it is more
convenient to choose σ12 ptq and g, since both these quantities should be positive. Different choices of these
quantities yield different instantiations of the SDE in Eqn. 84 as illustrated in the main text.

23
Preprint.

A.6 D ERIVING THE D ENOISER P RECONDITIONER FOR T-EDM

Recall our denoiser parameterization,

Dθ px, σq “ cskip pσ, νqx ` cout pσ, νqFθ pcin pσ, νqx, cnoise pσqq (88)

Karras et al. (2022) highlight the following design choices, which we adopt directly.

1. Derive cin based on constraining the input variance to 1


2. Derive cskip and cout to constrain output variance to 1 and additionally minimizing cout to bound
scaling errors in Fθ px, σq.

The coefficient cin can be derived by setting the network inputs to have unit variance. Therefore,
“ ‰
Varx0 ,n cin pσqpx0 ` nq “ 1 (89)
“ ‰
cin pσ, νq2 Varx0 ,n x0 ` n “ 1 (90)
` 2 ν ˘
cin pσ, νq2 σdata ` σ2 “ 1 (91)
ν´2
c
L ν 2 .
cin pσ, νq “ 1 σ 2 ` σdata (92)
ν´2

The coefficients cskip and cout can be derived by setting the training target to have unit variance. Similar to
Karras et al. (2022) our training target can be specified as:
1 ` ˘
Ftarget “ x0 ´ cskip pσ, νqpx0 `nq (93)
cout pσ, νq

“ ‰
Varx0 ,n Ftarget px0 , n; σq “ 1 (94)
” ` ˘ı
1
Varx0 ,n cout pσ,νq x0 ´ cskip pσ, νqpx0 ` nq “ 1 (95)
1
“ ‰
cout pσ,νq2 Varx0 ,n x0 ´ cskip pσ, νqpx0 ` nq “ 1 (96)
2
“ ‰
cout pσ, νq “ Varx0 ,n x0 ´ cskip pσ, νqpx0 ` nq (97)
”` ˘ ı
2
cout pσ, νq “ Varx0 ,n 1 ´ cskip pσ, νq x0 ` cskip pσ, νq n (98)
` ˘2 2 ν
cout pσ, νq2 “ 1 ´ cskip pσ, νq σdata ` cskip pσ, νq2 σ 2 . (99)
ν´2

Lastly, setting cskip pσ, νq to minimize cout pσ, νq, we obtain,


ˆ ˙
2 ν
cskip pσ, νq “ σdata { σ 2 ` σdata
2
. (100)
ν´2

Consequently cout pσ, νq can be specified as:


c c
ν L ν 2 .
cout pσ, νq “ σ ¨ σdata σ 2 ` σdata (101)
ν´2 ν´2

24
Preprint.

A.7 E QUIVALENCE WITH THE EDM ODE

Similar to Karras et al. (2022), we start by deriving the optimal denoiser for the denoising loss func-
tion. Moreover, we reformulate the form of the perturbation kernel as qpxt |x0 q “ td pµt x0 , σt2 Id , νq “
td psptqx0 , sptq2 σptq2 Id , νq by setting µt “ sptq and σt “ sptqσptq. The denoiser loss can then be specified
as follows,
“ ‰
LpD, σq “ Ex0 „ppx0 q En„td p0,σ2 Id ,νq λpσ, νq}Dpsptqrx0 ` ns, σq ´ x0 }22 (102)
“ 2

“ Ex0 „ppx0 q Ex„td psptqx0 ,sptq2 σ2 Id ,νq λpσ, νq}Dpx, σq ´ x0 }2 (103)
”ż “ ‰ ı
“ Ex0 „ppx0 q td px; sptqxi , sptq2 σ 2 Id , νq λpσ, νq}Dpx, σq ´ x0 }22 dx (104)
ż
1 ÿ “ ‰
“ td px; sptqxi , sptq2 σ 2 Id , νq λpσ, νq}Dpx, σq ´ xi }22 dx (105)
N i

where the last result follows from assuming ppx0 q as the empirical data distribution. Thus, the optimal
denoiser can be specified by setting ∇D LpD, σq “ 0. Therefore,

∇D LpD, σq “ 0 (106)

Consequently,
ż
1 ÿ “ ‰
∇D td px; sptqxi , sptq2 σ 2 Id , νq λpσ, νq}Dpx, σq ´ xi }22 dx “ 0 (107)
N i
ż
1 ÿ “ ‰
td px; sptqxi , sptq2 σ 2 Id , νq λpσ, νqpDpx, σq ´ xi q dx “ 0 (108)
N i
ż ÿ
td px; sptqxi , sptq2 σ 2 Id , νqpDpx, σq ´ xi qdx “ 0 (109)
i

The optimal denoiser Dpx, σq, can be obtained from Eq. 109 as,

td px; sptqxi , sptq2 σ 2 Id , νqxi


ř
Dpx, σq “ ři 2 2
(110)
i td px; sptqxi , sptq σ Id , νq

We can further simplify the optimal denoiser as,

td px; sptqxi , sptq2 σ 2 Id , νqxi


ř
Dpx, σq “ ři (111)
td px; sptqxi , sptq2 σ 2 Id , νq
ř i x 2
i td p sptq ; xi , σ Id , νqxi
“ ř x 2
(112)
i td p sptq ; xi , σ Id , νq
´ x ¯
“D ,σ (113)
sptq

Next, recall the ODE dynamics (Eqn. 11) in our formulation,


„ ȷ
dxt µ9 t σ9 t µ9 t
“ xt ´ ´ ` pxt ´ µt Dθ pxt , σptqqq (114)
dt µt σt µt

25
Preprint.

Reparameterizing the above ODE by setting µt “ sptq and σt “ sptqσptq,


„ ȷ
dxt µ9 t σ9 t µ9 t
“ xt ´ ´ ` pxt ´ sptqDθ pxt , σptqqq (115)
dt µt σt µt
„ ȷ
sptq
9 sptq
9 σptqsptq
9 ` σptqsptq
9
“ xt ´ ´ pxt ´ sptqDθ pxt , σptqqq (116)
sptq sptq σptqsptq
„ ȷ
sptq
9 σptq
9
“ xt ´ ´ pxt ´ sptqDθ pxt , σptqqq (117)
sptq σptq
sptq
9 σptq
9
“ xt ` pxt ´ sptqDθ pxt , σptqqq (118)
sptq σptq
Lastly, since Karras et al. (2022) propose to train the denoiser Dθ using unscaled noisy state, from Eqn. 113,
we can re-write the above ODE as,
dxt ” sptq
9 σptq
9 ı σptqsptq
9 ´ x
t
¯
“ ` xt ´ Dθ , σptq (119)
dt sptq σptq σptq sptq
The form of the ODE in Eqn. 119 is equivalent to the ODE presented in Karras et al. (2022) (Algorithm 1
line 4 in their paper). This concludes the proof.

A.8 C ONDITIONAL V ECTOR F IELD FOR T-F LOW

Here, we derive the conditional vector field for the t-Flow interpolant. Recall, the interpolant in t-Flow is
derived as follows,
ϵ
xt “ tx0 ` p1 ´ tq ? , ϵ „ N p0, Id q, κ „ χ2 pνq{ν (120)
κ
It follows that,
“ ϵ ‰
xt “ tErx0 |xt s ` p1 ´ tqE ? |xt (121)
κ
Moreover, following Eq. 2.10 in Albergo et al. (2023b), the conditional vector field bpxt , tq can be defined as,
“ ϵ ‰
bpxt , tq “ Erx9 t |xt s “ Erx0 |xt s ´ E ? |xt (122)
κ
From Eqns. 121 and 122, the conditional vector field can be simplified as,
“ ‰
xt ´ E ?ϵκ |xt
bpxt , tq “ (123)
t
This concludes the proof.

A.9 C ONNECTION TO D ENOISING S CORE M ATCHING

We start by defining the score for the perturbation kernel qpxt |x0 q. The pdf for the perturbation kernel
qpxt |x0 q “ td pµt x0 , σt2 Id , νq can be expressed as (ignoring the normalization constant):
„ ȷ ´pν`dq
1 ` ˘J ` ˘ 2
qpxt |x0 q 9 1` xt ´ µt x0 xt ´ µt x0 (124)
νσt2

26
Preprint.

Therefore,
„ ȷ
ν`d 1 2
∇xt log qpxt |x0 q “ ´ ∇xt log 1 ` }xt ´ µt x0 }2 (125)
2 νσt2
ˆ ˙
ν`d 1 2
“´ xt ´ µt x0 (126)
2 1 ` νσ1 2 }xt ´ µt x0 }22 νσt2
t

1
Denoting d1 “ σt2
}xt ´ µt x0 }22 for convenience, we have,
ˆ ˙ ˆ ˙
ν`d 1
∇xt log qpxt |x0 q “ ´ x t ´ µ x
t 0 (127)
ν ` d1 σt2

In Denoising Score Matching (DSM) (Vincent, 2011), the following objective is minimized,
« ff
LDSM “ Et„pptq,x0 „ppx0 q,xt „qpxt |x0 q γptq}∇xt log qpxt |x0 q ´ sθ pxt , tq}22 (128)

for some loss weighting function γptq. Parameterizing the score estimator sθ pxt , tq as:
ˆ ˙ ˆ ˙
ν`d 1
sθ pxt , tq “ ´ xt ´ µt Dθ pxt , σt q (129)
ν ` d1 σt2
With this parameterization of sθ pxt , tq and some choice of γptq, the DSM objective can be further simplified
as follows,
”´ ν ` d ¯2 › ϵ ›2 ı
LDSM “ Ex0 „ppx0 q Et Eϵ„N p0,Iq,κ„χ2 pνq{ν ›x0 ´ Dθ pµt x0 ` σt ? , σt q› (130)
› ›
ν ` d1 κ 2
” › ϵ ›2 ı
“ Ex0 „ppx0 q Et Eϵ„N p0,Iq,κ„χ2 pνq{ν λpxt , ν, tq›x0 ´ Dθ pµt x0 ` σt ? , σt q› (131)
› ›
κ 2
´ ¯2
ν`d
where λpxt , ν, tq “ ν`d 1
, which is equivalent to a scaled version of the simplified denoising loss Eq.
9. This concludes the proof. As an additional caveat, the score parameterization in Eq. 129 depends on
d1 “ σ12 }xt ´ µt x0 }22 , which can be approximated during inference as, d1 « σ12 }xt ´ µt Dθ pxt , σt q}22
t t

A.10 ODE R EFORMULATION AND C ONNECTIONS TO I NVERSE P ROBLEMS .

Recall the ODE dynamics in terms of the denoiser can be specified as,
„ ȷ
dxt µ9 t σ9 t µ9 t
“ xt ´ ´ ` pxt ´ µt Dθ pxt , σt qq (132)
dt µt σt µt
From Eq. 129, we have,
ˆ ˙
ν ` d1
xt ´ µt Dθ pxt , σt q “ ´σt2 ∇xt log ppxt , tq (133)
ν`d
where d1 “ σ12 }xt ´ µt x0 }22 . Substituting the above result in the ODE dynamics, we obtain the reformulated
t
ODE in Eq. 13.
dxt µ9 t ¯„ µ9 ȷ
2 ν ` d1 σ9 t
´
t
“ xt ` σt ´ ∇x log ppxt , tq (134)
dt µt ν`d µt σt

27
Preprint.

1
Since the term d1 is data-dependent, we can estimate it during inference as d1 « d11 “ σt2
}xt ´
µt Dθ pxt , σt q}22 . Thus, the above ODE can be reformulated as,
1 ¯
„ ȷ
dxt µ9 t 2 ν ` d1
´ µ9 t σ9 t
“ xt ` σt ´ ∇x log ppxt , tq (135)
dt µt ν`d µt σt

Tweedie’s Estimate and Inverse Problems. Given the perturbation kernel qpxt |x0 q “ td pµt x0 , σt2 Id , νq,
we have,
ϵ
xt “ µt x0 ` σt ? , ϵ „ N p0, Id q, κ „ χ2 pνq{ν (136)
κ
It follows that,
” ϵ ı
xt “ µt Erx0 |xt s ` σt E ? |xt (137)
κ
˜ ¸
1 ” ϵ ı
Erx0 |xt s “ xt ´ σt E ? |xt (138)
µt κ
˜ ¸
1 ´ ν ` d1 ¯
1
« xt ` σt2 ∇x log ppxt , tq (139)
µt ν`d

which gives us an estimate of the predicted x0 at any time t. Moreover, the framework can also be extended
for solving inverse problems using diffusion models. More specifically, given a conditional signal y, the goal
is to simulate the ODE,
dxt µ9 t ´ ν ` d1 ¯„ µ9 σ9 t
ȷ
1 t
“ xt ` σt2 ´ ∇x log ppxt |yq (140)
dt µt ν`d µt σt
µ9 t ´ ν ` d1 ¯„ µ9 ȷ
σ9 t ” ı
1 t
“ xt ` σt2 ´ wt ∇x log ppy|xt q ` ∇x log ppxt q (141)
µt ν`d µt σt
where the above decomposition follows from ppxt |yq9ppy|xt qwt ppxt q and the weight wt represents the
guidance weight of the distribution ppy|xt q. The term log ppy|xt q can now be approximated using existing
posterior approximation methods in inverse problems (Chung et al., 2022; Song et al., 2022b; Mardani et al.,
2023b; Pandey et al., 2024b)

A.11 C ONNECTIONS TO ROBUST S TATISTICAL E STIMATORS

Here, we derive the expression for the gradient of the γ-power divergence between the forward and the
reverse posteriors (denoted by q and pθ , respectively for notational convenience) i.e., ∇θ Dγ pq } pθ q. By the
definition of the γ-power divergence,
1“ ‰
Dγ pq } pθ q “ Cγ pq, pθ q ´ Hγ pqq , γ P p´1, 0q Y p0, 8q (142)
γ
ˆż 1
˙ 1`γ ż ˆ ˙γ
1`γ ppxq
Hγ ppq “ ´ }p}1`γ “´ ppxq dx Cγ pq, pq “ ´ qpxq dx (143)
}p}1`γ
Therefore,
1
∇θ Dγ pq } pθ q “ ∇θ Cγ pq, pθ q (144)
γ

28
Preprint.

Consequently, we next derive an expression for ∇θ Cγ pq, pθ q.

ˆ ż ˙γ
pθ pxq
∇θ Cγ pq, pθ q “ ´∇θqpxq dx (145)
}pθ }1`γ
ż ˆ ˙γ´1 ˜ ¸
pθ pxq pθ pxq
“ ´γ qpxq ∇θ dx (146)
}pθ }1`γ }pθ }1`γ
˙γ´1
}pθ }1`γ ∇θ pθ pxq ´ pθ pxq∇θ }pθ }1`γ
ż ˆ
pθ pxq
“ ´γ qpxq 2 dx (147)
}pθ }1`γ }pθ }1`γ

From the definition of }pθ }1`γ ,

ˆż 1
˙ 1`γ
}pθ }1`γ “ pθ pxq1`γ dx (148)
ˆż 1
˙ 1`γ
1`γ
∇θ }pθ }1`γ “ ∇θ pθ pxq dx (149)
ˆż γ
˙´ 1`γ ż
1 1`γ
` ˘
“ pθ pxq dx ∇θ pθ pxq1`γ dx (150)
1`γ
ˆż γ
˙´ 1`γ ż
1
“ pθ pxq1`γ dx p1 ` γq pθ pxqγ ∇θ pθ pxqdx (151)
1`γ
ˆż γ
˙´ 1`γ ż
1`γ
∇θ }pθ }1`γ “ pθ pxq dx pθ pxqγ ∇θ pθ pxqdx (152)
ż
}pθ }1`γ
“ˆ ˙ pθ pxqγ ∇θ pθ pxqdx (153)
ş
pθ pxq1`γ dx
ż
}pθ }1`γ
“ˆ ˙ pθ pxq1`γ ∇θ log pθ pxqdx (154)
ş
pθ pxq1`γ dx

pθ pxq1`γ
ż
“ }pθ }1`γ ˆ ˙ ∇θ log pθ pxqdx (155)
ş
pθ pxq 1`γ dx
l jh n
“p̃θ pxq

“ }pθ }1`γ Ep̃θ pxq r∇θ log pθ pxqs (156)

pθ pxq1`γ
where we denote p̃θ pxq “ ˆ ˙ for notational convenience. Substituting the above result in Eq.
ş
pθ pxq1`γ dx

147, we have the following simplification,

29
Preprint.

Algorithm 3: Training (t-Flow) Algorithm 4: Sampling (t-Flow)


` ˘
1: repeat 1: sample x0 „ td 0, Id , ν
2: x1 „ ppx1 q for i P t0,
3: t „ Uniformpt1, . . . , T uq
2: ` . . . , N ´ 1u do ˘
3: di Ð xi ´ ϵθ pxi ; σti q {ti
4: µt “ t, σt “ 1 ´ t
4: xi`1 Ð xi ` pti`1 ´ ti qdi
5: xt “ µt x1 ` σt n, n „ td p0, Id , νq
6: Take gradient descent step on 5: if ti`1 ‰`0 then ˘
7: ∇θ }n ´ ϵθ pxt , σt q}2 6: di Ð xi`1 ´ ϵθ pxi`1 ; σ`ti`1 q {ti`1˘
8: until converged 7: xi`1 Ð xi ` pti`1 ´ ti q 21 di ` 12 di1
8: end if
9: end for
10: return xN

Figure 5: Training and Sampling algorithms for t-Flow. Our proposed method requires minimal code updates (indicated
with blue) over traditional Gaussian flow models and converges to the latter as ν Ñ 8.

˙γ´1
}pθ }1`γ ∇θ pθ pxq ´ pθ pxq∇θ }pθ }1`γ
ż ˆ
pθ pxq
∇θ Cγ pq, pθ q “ ´γ qpxq 2 dx (157)
}pθ }1`γ }pθ }1`γ
˙γ´1
}pθ }1`γ ∇θ pθ pxq ´ pθ pxq }pθ }1`γ Ep̃θ pxq r∇θ log pθ pxqs
ż ˆ
pθ pxq
“ ´γ qpxq 2 dx
}pθ }1`γ }pθ }1`γ
(158)
ˆ ˙γ´1
∇θ pθ pxq ´ pθ pxqEp̃θ pxq r∇θ log pθ pxqs
ż
pθ pxq
“ ´γ qpxq dx (159)
}pθ }1`γ }pθ }1`γ
ż ˆ ˙γ´1
pθ pxq pθ pxq∇θ log pθ pxq ´ pθ pxqEp̃θ pxq r∇θ log pθ pxqs
“ ´γ qpxq dx (160)
}pθ }1`γ }pθ }1`γ
(161)
ż ˆ ˙γ ´
pθ pxq ¯
∇θ Cγ pq, pθ q “ ´γ qpxq ∇θ log pθ pxq ´ Ep̃θ pxq r∇θ log pθ pxqs dx (162)
}pθ }1`γ
Plugging this result in Eq. 144, we have the following result,
ż ˆ ˙γ ´
1 pθ pxq ¯
∇θ Dγ pq } pθ q “ ∇θ Cγ pq, pθ q “ ´ qpxq ∇θ log pθ pxq ´ Ep̃θ pxq r∇θ log pθ pxqs dx
γ }pθ }1`γ
(163)
This completes the proof. Intuitively, the second term inside the integral in Eq. 163 ensures unbiasedness of
the gradients. Therefore, the scalar coefficient γ controls the weighting on the likelihood gradient and can be
set accordingly to ignore or model outliers when modeling the data distribution.

B E XTENSION TO F LOWS
Here, we discuss an extension of our framework to flow matching models (Albergo et al., 2023a; Lipman
et al., 2023) with a Student-t base distribution. More specifically, we define a straight-line flow of the form,
ϵ
xt “ tx1 ` p1 ´ tq ? , ϵ „ N p0, Id q, κ „ χ2 pνq{ν (164)
κ

30
Preprint.

where x1 „ ppx1 q. Intuitively, at a given time t, the flow defined in Eqn. 164 linearly interpolates between
data and Student-t noise. Following Albergo et al. (2023a), the conditional vector field which induces this
interpolant can be specified as (proof in App. A.8)
“ ‰
dxt xt ´ E ?ϵκ |xt
“ bpxt , tq “ . (165)
dt t
“ ‰
We estimate E ?ϵκ |xt by minimizing the objective
”› ϵ ϵ ››2 ı
Lpθq “ Ex0 „ppx0 q Et„U r0,1s Eϵ„N p0,Id q Eκ„χ2 pνq{ν ›ϵθ ptx0 ` p1 ´ tq ? , tq ´ ? › . (166)

κ κ 2
We refer to this flow setup as t-Flow. To generate samples from our model, we simulate the ODE in Eq. 165
using Heun’s solver. Figure 5 illustrates the ease of transitioning from a Gaussian flow to t-Flow. Similar to
Gaussian diffusion, transitioning to t-Flow requires very few lines of code change, making our method readily
compatible with existing implementations of flow models.

C E XPERIMENTAL S ETUP
C.1 U NCONDITIONAL M ODELING

C.1.1 HRRR DATASET


We adopt the High-Resolution Rapid Refresh (HRRR) (Dowell et al., 2022) dataset, which is an operational
archive of the US km-scale forecasting model. Among multiple dynamical variables in the dataset that
exhibit heavy-tailed behavior, based on their dynamic range, we only consider the Vertically Integrated Liquid
(VIL) and Vertical Wind Velocity at level 20 (denoted as w20) channels. How to cope with the especially
non-Gaussian nature of such physical variables on km-scales, represents an entire subfield of climate model
subgrid-scale parameterization (e.g., Guo et al. (2015)). We only use data for the years 2019 ´ 2020 for
training (17.4k samples) and the data for 2021 (8.7k samples) for testing; data before 2019 are avoided
owing to non-stationaries associated with periodic version changes of the HRRR. Lastly, while the HRRR
dataset spans the entire US, for simplicity, we work with regional crops of size 128 ˆ 128 (corresponding
to 384 ˆ 384 km over the Central US). Unless specified otherwise, we perform z-score normalization using
precomputed statistics as a preprocessing step. We do not perform any additional data augmentation.

C.1.2 BASELINES
Baseline 1: EDM. For standard Gaussian diffusion models, we use the recently proposed EDM (Karras et al.,
2022) model, which shows strong empirical performance on various image synthesis benchmarks and has
also been employed in recent work in weather forecasting and downscaling (Pathak et al., 2024; Mardani
et al., 2024). To summarize, EDM employs the following denoising loss during training,
“ ‰
Lpθq 9 Ex0 „ppx0 q Eσ En„N p0,σ2 Id q λpσq}Dθ px0 ` n, σq ´ x0 }22 (167)
2
where the noise levels σ are usually sampled from a LogNormal distribution, ppσq “ LogNormalpπmean , πstd q
Baseline 2: EDM + Inverse CDF Normalization (INC). It is commonplace to perform z-score normalization
as a data pre-processing step during training. However, since heavy-tailed channels usually exhibit a high
dynamic range, using z-score normalization for such channels cannot fully compensate for this range,
especially when working with diverse channels in downstream tasks in weather modeling. An alternative
could be to use a more stronger normalization scheme like Inverse CDF normalization, which essentially
involves the following key steps:

31
Preprint.

Channel: VIL

Channel: w20

Original Samples Samples (after INC) Denormalized Samples

Figure 6: Inverse CDF Normalization. Using Inverse CDF Normalization (INC) can help reduce channel dynamic range
during training while providing accurate denormalization. (Top Panel) INC applied to the Vertically Integrated Liquid
(VIL) channel in the HRRR dataset. (Bottom Panel) INC applied to the Vertical Wind Velocity (w20) channel in the
HRRR dataset.

1. Compute channel-wise 1-d histograms of the training data.


2. Compute channel-wise empirical CDFs from the 1-d histograms computed in Step 1.
3. Use the empirical CDFs from Step 2 to compute the CDF at each spatial location.
4. For each spatial location with a CDF value p, replace its value by the value obtained by applying the
Inverse CDF operation under the standard Normal distribution.

Fig. 6 illustrates the effect of performing normalization under this scheme. As can be observed, using such
a normalization scheme can greatly reduce the dynamic range of a given channel while offering reliable
denormalization. Moreover, since our normalization scheme only affects data preprocessing, we leave the
standard EDM model parameters unchanged for this baseline.
Baseline 3: EDM + Per Channel-Preconditioning (PCP). Another alternative to account for extreme values
in the data (or high dynamic range) could be to instead add more heavy-tailed noise during training. This
can be controlled by modulating the πmean and πstd parameters based on the dynamic range of the channel
under consideration. Recall that these parameters control the domain of noise levels σ used during EDM
model training. In this work, we use a simple heuristic to modulate these parameters based on the normalized
channel dynamic range (denoted as d). More specifically, we set,

πmean “ ´1.2 ` α ˚ RBFpd, 1.0, βq (168)

32
Preprint.

where RBF denotes the Radial Basis Function kernel with radius=1.0, parameter β and a magnitude scaling
factor α. We keep πstd “ 1.2 fixed for all channels. Intuitively, this implies that a higher normalized dynamic
range (near 1.0) corresponds to sampling the noise levels σ from a more heavy-tailed distribution during
training. This is natural since a signal with a larger magnitude would need more amount of noise to convert
it to noise during the forward diffusion process. In this work, we set α “ 3.0, β “ 2.0, which yields
vil w20
πmean “ 1.8 and πmean “ 0.453 for the VIL and w20 channels, respectively. It is worth noting that, unlike the
previous baseline, we use z-score normalization as a preprocessing step for this baseline. Lastly, we keep
other EDM parameters during training and sampling fixed.
Baseline 4. Gaussian Flow. Since we also extend our framework to flow matching models (Albergo et al.,
2023a; Lipman et al., 2023), we also compare with a linear one-sided interpolant with a Gaussian base
distribution. More specifically,
xt “ tx1 ` p1 ´ tqϵ, ϵ „ N p0, Id q (169)
Similar to t-Flow (Section B), we train the Gaussian flow with the following objective,
”› ›2 ı
Lpθq “ Ex0 „ppx0 q Et„U r0,1s Eϵ„N p0,Id q ›ϵθ ptx0 ` p1 ´ tqϵ, tq ´ ϵ› . (170)
› ›
2

C.1.3 E VALUATION
Here, we describe our scoring protocol used in Tables 2 and 3 in more detail.
Kurtosis Ratio (KR). Intuitively, sample kurtosis characterizes the heavy-tailed behavior of a distribution and
represents the fourth-order moment. Higher kurtosis represents greater deviations from the central tendency,
such as from outliers in the data. In this work, given samples from the underlying train/test set, we generate
20k samples from our model. We then flatten all the samples and compute empirical kurtosis for both the
underlying samples from the train/test set (denoted as kdata ) and our model (denoted as ksim ). The Kurtosis
ratio is then computed as,
ˇ ksim ˇˇ
KR “ ˇ1 ´ (171)
ˇ
ˇ
kdata
Lower values of this ratio imply a better estimation of the underlying sample kurtosis.
Skewness Ratio (SR). Intuitively, sample skewness represents the asymmetry of a tailed distribution and
represents the third-order moment. In this work, given samples from the underlying train/test set, we generate
20k samples from our model. We then flatten all the samples and compute empirical skewness for both the
underlying samples from the train/test set (denoted as sdata ) and our model (denoted as ssim ). The Skewness
ratio is then computed as, ˇ ssim ˇˇ
SR “ ˇ1 ´ (172)
ˇ
ˇ
sdata
Lower values of this ratio imply a better estimation of the underlying sample skewness.
Kolmogorov-Smirnov 2-Sample Test (KS). The KS (Massey, 1951) statistic measures the maximum
difference between the CDFs of two distributions. For heavy-tailed distributions, evaluating the KS statistic
at the tails could provide a useful measure of the efficacy of our model in estimating tails reliably. To evaluate
the KS statistic at the tails, similar to prior metrics, we generate 20k samples from our model. We then flatten
all samples in the generated and train/test sets, followed by retaining samples lying above the 99.9th percentile
(quantifying right tails/extreme region) or below the 0.1th percentile (quantifying the left tail/extreme region).
Lastly, we compute the KS statistic between the retained samples from the generated and the train/test sets
individually for each tail and average the KS statistic values for both tails to obtain an average KS score. The
final score estimates how well the model might capture both tails. As an exception,. for the VIL channel,
we report KS scores only for the right tail due to the absence of a left tail in the underlying samples for this

33
Preprint.

Parameters EDM (+INC, +PCP) t-EDM Flow/t-Flow


ˆ ˙ ˆ ˙
2
cskip σdata { σ 2 ` σdata
2 2
σdata { ν
ν´2 σ
2
` 2
σdata 0
Preconditioner 2
Lb 2
b
ν 2
Lb ν 2
cout σ ¨ σdata σ 2 ` σdata ν´2 σ ¨ σdata ν´2 σ
2 ` σdata 1
L b Lb
2 ν 2
cin 1 σ 2 ` σdata 1 2
ν´2 σ ` σdata 1
1 1
cnoise 4 log σ 4 log σ σ
2 2
σ log σ „ N pπmean , πstd q log σ „ N pπmean , πstd q σ “ 1 ´ t, t „ U p0, 1q
µt 1 1 t
Training
λpσq 1{c2out pσq 1{c2out pσ, νq 1
t-Flow - Eq. 166
Loss Eq. 167 Eq. 12
Gaussian Flow - Eq. 170
Solver Heun’s (2nd order) Heun’s (2nd order) Heun’s (2nd order)
t-Flow: Eq. 165, x0 „ td p0, Id , νq
ODE Eq. 11, xT „ N p0, Id q Eq. 11, xT „ td p0, Id , νq
Sampling Flow: Eq. 165, x0 „ N p0, Id q
1 1 1 ‰˘ 1 1 1 1 1 1 1 1 1
` ρ i
“ ρ ρ ` ρ i
“ ρ ρ ‰˘ ρ ` ρ “ ρ ρ ‰˘ ρ
Discretization σmax ` N ´1 σmin ´ σmax ρ σmax ` N ´1 σmin ´ σmax σmax ` N i´1 σmin ´ σmax
Scaling: µt 1 1 t
Schedule: σt t t 1-t
σdata 1.0 1.0 N/A
Flow: 8
Hyperparameters VIL: t3, 5, 7u t-Flow (Ó)
ν 8
w20: t3, 5, 7u VIL=t3, 5, 7u
w20=t5, 7, 9u
EDM: -1.2, 1.2
EDM (+INC) : -1.2, 1.2
πmean , πstd EDM (+PCP) (Ó) -1.2, 1.2 N/A
VIL: 1.8, 1.2
w20: 0.453, 1.2
σmax , σmin 80, 0.002 80, 0.002 1.0, 0.01
NFE 18 18 25
ρ 7 7 7

Table 5: Comparison between design choices and specific hyperparameters between EDM (Karras et al., 2022) (+ related
baselines) and t-EDM (Ours, Section 3.5) for unconditional modeling (Section 4.1). INC: Inverse CDF Normalization
baselines, PCP: Per-Channel Preconditioning baseline, VIL: Vertically Integrated Liquid channel in the HRRR dataset,
w20: Vertical Wind Velocity at level 20 channel in the HRRR dataset, NFE: Number of Function Evaluations

channel (see Fig. 6 (first column) for 1-d intensity histograms for this channel). Lower values of the KS
statistic imply better density assignment at the tails by the model.
Histogram Comparisons. As a qualitative metric, comparing 1-d intensity histograms between the generated
and the original samples from the train/test set can serve as a reliable proxy to assess the tail estimation
capabilities of all models.

C.1.4 D ENOISER A RCHITECTURE


We use the DDPM++ architecture from (Karras et al., 2022; Song et al., 2020). We set the base channel
multiplier to 32 and the per-resolution channel multiplier to [1,2,2,4,4] with self-attention at resolution 16.
The rest of the hyperparameters remain unchanged from Karras et al. (2022), which results in a model size of
around 12M parameters.

C.1.5 T RAINING
We adopt the same training hyperparameters from Karras et al. (2022) for training all models. Model training
is distributed across 4 DGX nodes, each with 8 A100 GPUs, with a total batch size of 512. We train all

34
Preprint.

Parameter Model Levels Height Levels (m)


Zonal Wind (u) 1,2,3,4,5,6,7,8,9,10,11,13,15,20 10
Meridonal Wind (v) 1,2,3,4,5,6,7,8,9,10,11,13,15,20 10
Geopotential Height (z) 1,2,3,4,5,6,7,8,9,10,11,13,15,20 None
Humidity (q) 1,2,3,4,5,6,7,8,9,10,11,13,15,20 None
Pressure (p) 1,2,3,4,5,6,7,8,9,10,11,13,15,20 None
Temperature (t) 1,2,3,4,5,6,7,8,9,10,11,13,15,20 2
Radar Reflectivity (refc) N/A Integrated
Table 6: Parameters in the HRRR dataset used for conditional modeling tasks.

models for a maximum budget of 60Mimg and select the best-performing model in terms of qualitative 1-D
histogram comparisons.

C.1.6 S AMPLING
For the EDM and related baselines (INC, PCP), we use the ODE solver presented in Karras et al. (2022). For
the t-EDM models, as presented in Section 3.5, our sampler is the same as EDM with the only difference
in the sampling of initial latents from a Student-t distribution instead (See Fig. 2 (Right)). For Flow and
t-Flow, we numerically simulate the ODE in Eq. 165 using the 2nd order Heun’s solver with the timestep
discretization proposed in Karras et al. (2022). For evaluation, we generate 20k samples from each model.
We summarize our experimental setup in more detail for unconditional modeling in Table 5.

C.1.7 E XTENDED R ESULTS ON U NCONDITIONAL M ODELING


Sample Visualization. We visualize samples generated from the t-EDM and t-Flow models for the VIL and
w20 channels in Figs. 7-10
Visualization of 1-d histograms. Similar to Fig. 3, we present additional results on histogram comparisons
between different baselines and our proposed methods for the VIL and w20 channels in Figs. 11 and 12.

C.2 C ONDITIONAL M ODELING

C.2.1 HRRR DATASET FOR C ONDITIONAL M ODELING


Similar to unconditional modeling (See App. C.1.1), we use the HRRR dataset for conditional modeling at
the 128 x 128 resolution. More specifically, for a lead time of 1hr, we sample (input, output) pairs at time
τ and τ ` 1, respectively. For the input, at time τ , we use a state vector consisting of a combination of 86
atmospheric channels (including the channel to be predicted at time τ ), which are summarized in Table 6. For
the output, at time τ ` 1, we use either the Vertically Integrated Liquid (VIL) or Vertical Wind Velocity at
level 20 (w20) channels, depending on the prediction task. Unless specified otherwise, we perform z-score
normalization using precomputed statistics as a preprocessing step without any additional data augmentation.

C.2.2 BASELINES
We adopt the standard EDM (Karras et al., 2022) for conditional modeling as our baseline.

C.2.3 D ENOISER A RCHITECTURE


We use the DDPM++ architecture from (Karras et al., 2022; Song et al., 2020). We set the base channel
multiplier to 32 and the per-resolution channel multiplier to [1,2,2,4,4] with self-attention at resolution

35
Preprint.

16. Additionally, our noisy state x is channel-wise concatenated with an 86-channel conditioning signal,
increasing the total number of input channels in the denoiser to 87. The number of output channels remains 1
since we are predicting only a single VIL/w20 channel. However, the increase in the number of parameters is
minimal since only the first convolutional layer in the denoiser is affected. Therefore, our denoiser is around
12M parameters. The rest of the hyperparameters remain unchanged from Karras et al. (2022).

C.2.4 T RAINING
We adopt the same training hyperparameters from Karras et al. (2022) for training all conditional models.
Model training is distributed across 4 DGX nodes, each with 8 A100 GPUs, with a total batch size of 512.
We train all models for a maximum budget of 60Mimg.

C.2.5 S AMPLING
For both EDM and t-EDM models, we use the ODE solver presented in Karras et al. (2022). For the t-EDM
models, as presented in Section 3.5, our sampler is the same as EDM with the only difference in the sampling
of initial latents from a Student-t distribution instead (See Fig. 2 (Right)). For a given input conditioning state,
we generate an ensemble of predictions of size 16 by randomly initializing our ODE solver with different
random seeds. All other sampling parameters remain unchanged from our unconditional modeling setup (see
App. C.1.6).

C.2.6 E VALUATION
Root Mean Square Error (RMSE). is a standard evaluation metric used to measure the difference between
the predicted values and the true values Chai & Draxler (2014). In the context of our problem, let x be the
true target and x̂ be the predicted value. The RMSE is defined as:
a
RMSE “ E r}x ´ x̂}2 s.

This metric captures the average magnitude of the residuals, i.e., the difference between the predicted and true
values. A lower RMSE indicates better model performance, as it suggests the predicted values are closer to
the true values on average. RMSE is sensitive to large errors, making it an ideal choice for evaluating models
where minimizing large deviations is critical.
Continuous Ranked Probability Score (CRPS). is a measure used to evaluate probabilistic predictions
Wilks (2011). It compares the entire predicted distribution F px̂q with the observed data point x. For a
probabilistic forecast with cumulative distribution function (CDF) F , and the true value x, the CRPS can be
formulated as follows:
ż8
2
CRPSpF, xq “ pF pyq ´ Ipy ě xqq dy,
´8

where Ip¨q is the indicator function. Unlike RMSE, CRPS provides a more comprehensive evaluation of
both the location and spread of the predicted distribution. A lower CRPS indicates a better match between
the forecast distribution and the observed data. It is especially useful for probabilistic models that output a
distribution rather than a single-point prediction.
Spread-Skill Ratio (SSR). is used to assess over/under-dispersion in probabilistic forecasts. Spread measures
the uncertainty in the ensemble forecasts and can be represented by computing the standard deviation of
the ensemble members. Skill represents the accuracy of the mean of the ensemble forecasts and can be
represented by computing the RMSE between the ensemble mean and the observations.

36
Preprint.

Parameters EDM t-EDM


ˆ ˙ ˆ ˙
2
cskip σdata { σ 2 ` σdata
2 2
σdata { ν´2 ν
σ 2 ` σdata
2

Preconditioner 2
La 2
b
ν 2
Lb ν 2
cout σ ¨ σdata σ 2 ` σdata ν´2
σ ¨ σ data ν´2
σ 2 ` σdata
La 2
Lb ν 2
cin 1 σ 2 ` σdata 1 ν´2
σ 2 ` σdata
1 1
cnoise 4
log σ 4
log σ
2 2
σ log σ „ N pπmean , πstd q log σ „ N pπmean , πstd q
sptq 1 1
Training
λpσq 1{c2out pσq 1{c2out pσ, νq
Loss Eq. 2 in Karras et al. (2022) Eq. 12
Solver Heun’s (2nd order) Heun’s (2nd order)
ODE Eq. 11, xT „ N p0, Id q Eq. 11, xT „ td p0, Id , νq
Sampling ` ρ1 i
“ ρ1 1 ‰˘ 1
ρ
` ρ1 i
“ ρ1 1 ‰˘ 1
ρ
Discretization σmax ` N ´1 σmin ´ σmax ρ
σmax ` N ´1 σmin ´ σmax ρ

Scaling: sptq 1 1
Schedule: σptq t t
σdata 1.0 1.0
ν 8 x1 “ 20, x2 P t4, 7, 10u
πmean , πstd -1.2, 1.2 -1.2, 1.2
Hyperparameters
σmax , σmin 80, 0.002 80, 0.002
NFE 18 18
ρ 7 7
Table 7: Comparison between design choices and specific hyperparameters between EDM (Karras et al., 2022) and
t-EDM (Ours, Section 3.5) for the Toy dataset analysis in Fig. 1. NFE: Number of Function Evaluations

Scoring Criterion. Since CRPS and SSR metrics are based on predicting ensemble forecasts for a given
input state, we predict an ensemble of size 16 for 4000 samples from the VIL/w20 test set. We then enumerate
window sizes of 16 x 16 across the spatial resolution of the generated sample (128 x 128). Since the VIL
channel is quite sparse, we filter out windows with a maximum value of less than a threshold (1.0 for VIL)
and compute the CRPS, SSR, and RMSE metrics for all remaining windows. As an additional caveat, we
note that while it is common to roll out trajectories for weather forecasting, in this work, we only predict the
target at the immediate next time step.

C.2.7 E XTENDED R ESULTS ON C ONDITIONAL M ODELING


Sample Visualization. We visualize samples generated from the t-EDM and t-Flow models for the VIL and
w20 channels in Figs. 8 and 9

C.3 T OY E XPERIMENTS

Dataset. For the toy illustration in Fig. 1, we work with the Neals Funnel dataset (Neal, 2003), which is
commonly used in the MCMC literature (Brooks et al., 2011) due to its challenging geometry. The underlying
generative process for Neal’s funnel can be specified as follows:
ppx1 , x2 q “ N px1 ; 0, 3qN px2 ; 0, exp x1 {2q. (173)
For training, we randomly generate 1M samples from the generative process in Eq. 173 and perform z-score
normalization as a pre-processing step.

37
Preprint.

Baselines and Models. For the standard Gaussian diffusion model baseline (2nd column in Fig. 1), we use
EDM with standard hyperparameters as presented in Karras et al. (2022). Consequently, for heavy-tailed
diffusion models (columns 3-5 in Fig. 1), we use the t-EDM instantiation of our framework as presented in
Section 3.5. Since the hyperparameter ν is key in our framework, we tune ν for each individual dimension
in our toy experiments. We fix ν to 20 for the x1 dimension and vary it between ν P t4, 7, 10u to illustrate
controllable tail estimation along the x2 dimension.
Denoiser Architecture. For modeling the underlying denoiser Dθ px, σq, we use a simple MLP for all toy
models. At the input, we concatenate the 2-dimensional noisy state vector x with the noise level σ. We use
two hidden layers of size 64, followed by a linear output layer. This results in around 8.5k parameters for the
denoiser. We share the same denoiser architecture across all toy models.
Training. We optimize the training objective in Eq. 12 for both t-EDM and EDM (See Fig. 2 (Left)). Our
training hyperparameters are the same as proposed in Karras et al. (2022). We train all toy models for a fixed
duration of 30M samples and choose the last checkpoint for evaluation.
Sampling. For the EDM baseline, we use the ODE solver presented in Karras et al. (2022). For the t-EDM
models, as presented in Section 3.5, our sampler is the same as EDM with the only difference in the sampling
of initial latents from a Student-t distribution instead (See Fig. 2 (Right)). For visualization of the generated
samples in Fig. 1, we generate 1M samples for each model.
Overall, our experimental setup for the toy dataset analysis in Fig. 1 is summarized in Table 7.

D O PTIMAL N OISE S CHEDULE D ESIGN

In this section we discuss a strategy for choosing the parameter σmax (denoted by σ in this section for
notational brevity) in a more principled manner as compared to EDM (Karras et al., 2022). More specifically,
our approach involves directly estimating σ from the empirically observed samples which circumvents the
need to rely on ad-hoc choices of this parameter which can affect downstream sampler performance.
The main idea behind our approach is minimizing the statistical mutual information between datapoints from
the underlying data distribution, x0 „ pdata , and their noisy counterparts xσ „ ppxσ q. While a trivial (and
non-practical) way to achieve this objective could be to set a large enough σ i.e. σ Ñ 8, we instead minimize
the mutual information Ipx0 , xσ q while ensuring the magnitude of σ to be as small as possible. Formally, our
objective can be defined as,
min
2
σ 2 subject to Ipx0 , xσ q “ 0 (174)
σ
As we will discuss later, minimizing this constrained objective provides a more principled way to obtain σ
from the underlying data statistics for a specific level of mutual information desired by the user. Next, we first
simplify the form of Ipx0 , xσ q, followed by a discussion on the estimation of σ in the context of EDM and
t-EDM. We also extend to the case of non-i.i.d noise.
Simplification of Ipx0 , xσ q. The mutual information Ipx0 , xσ q can be stated and simplified as follows,
Ipx0 , xσ q “ DKL pppx0 , xσ q } ppx0 qppxσ qq (175)
ż
ppx0 , xσ q
“ ppx0 , xσ q log dx0 dxσ (176)
ppx0 qppxσ q
ż
ppxσ |x0 q
“ ppx0 , xσ q log dx0 dxσ (177)
ppxσ q
ż
ppxσ |x0 q
“ ppxσ |x0 qppx0 q log dx0 dxσ (178)
ppxσ q

38
Preprint.

«ż ff
ppxσ |x0 q
“ Ex0 „ppx0 q ppxσ |x0 q log dxσ (179)
ppxσ q
” ı
“ Ex0 „ppx0 q DKL pppxσ |x0 q } ppxσ qq (180)

D.1 D ESIGN FOR EDM

Given the simplification in Eqn. 180, the optimization problem in Eqn. 174 reduces to the following,
” ı
2
min
2
σ subject to E x0 „ppx 0 q D KL pppx σ |x0 q } ppxσ qq “0 (181)
σ

Since at σmax we expect the marginal distribution ppxσ q to converge to the generative prior (i.e. completely
destory the structure of the data), we approximate ppxσ q « N p0, σ 2 Id q. With this simplification, the
Lagrangian for the optimization problem in Eqn. 174 can be specified as,
” ` ˘ı
σ˚2 “ arg min
2
σ 2
` λEx0 „ppx0 q D KL N px σ ; x 0 , σ 2
q } N px σ ; 0, σ 2
q (182)
σ
2
Setting the gradient w.r.t σ = 0,
λ
1´ Ex rxJ x0 s “ 0 (183)
σ4 0 0
which implies, b
σ2 “ λEx0 rxJ
0 x0 s (184)
For an empirical dataset, g
N
f
fλ ÿ
2
σ “e }xi }22 (185)
N i“1
This allows us to choose a σmax from the underlying data statistics during training or sampling. It is worth
noting that the choice of the multiplier λ impacts the magnitude of σmax . However, this parameter can be
2
chosen in a principled manner. At σmax “ σ˚2 , the estimate of the mutual information is given by:
c
1 J Ex0 rxJ
0 x0 s
I˚ px0 , xσ q “ 2 Ex0 rx0 x0 s “ (186)
σ˚ λ
which implies,
Ex0 rxJ 0 x0 s
λ“ (187)
I˚ px0 , xσ q2
The above result provides a way to choose λ. Given the dataset stastistics, Ex0 rxJ
0 x0 s, the user can specify an
acceptable level of mutual information Ipx0 , xσ q to compute the correponding λ, which can then be used to
find the corresponding minimum σmax required to achieve that level of mutual information. Next, we extend
this analysis to t-EDM.

D.2 E XTENSION TO T-EDM

In the case of t-EDM, we pose the optimization problem as follows,


” ` ˘ı
2 2 2
σ ˚ “ arg min
2
σ ` λEx0 „ppx0 q Dγ t d px 0 , σ I d , νq } td p0, σ I d , νq (188)
σ

39
Preprint.

where Dγ pq } pq is the Gamma-power Divergence between two distributions q and p. From the definition of
the γ-Power Divergence in Eqn. 7, we have,
` ˘ 1 1`γ
γ
d ´ 1`γ
γ γd
Dγ td px0 , σ 2 Id , νq } td p0, σ 2 Id , νq “ ´ Cν,d p1 ` q pσ 2 q´ 2p1`γq ´1 Ex0 rxJ
0 x0 s (189)
νγ ν´2
l jh n
“f pν,dq

Γp ν`d2 q 2
where Cν,d “ d and γ “ ´ ν`d Solving the optimization problem yields the following optimal σ,
Γp ν2 qpνπq 2
” ´ ν´2 ¯ ν´2`d
ı 2pν´2q`d
σ˚2 “ λf pν, dq ErxJ
0 x0 s (190)
ν´2`d
For an empirical dataset, we have the following simplification,
” ´ ν´2 ¯ 1 ÿ ν´2`d
ı 2pν´2q`d
σ˚2 “ λf pν, dq }xi }22 (191)
ν´2`d N i

D.3 E XTENSION TO C ORRELATED G AUSSIAN N OISE

We now extend our formulation for optimal noise schedule design to the case of correlated noise in the
diffusion perturbation kernel. This is useful, especially for scientific applications where the data energy is
distributed quite non-uniformly across the (data) spectrum. Let R “ Ex0 „ppx0 q rx0 x0 J s P Rdˆd denote the
data correlation matrix. Let also consider the perturbation kernel N p0, Σq for the postitve-defintie covaraince
matrix Σ P Rdˆd . Following the steps in equation 182, the Lagrangian for noise covariance estimation can
be formulated as follows:
” ı
min tracepΣq ` λEx0 „ppx0 q DKL pN px0 , Σq } N p0, Σqq (192)
Σ
” ı
min tracepΣq ` λEx0 „ppx0 q xJ 0Σ
´1
x0 (193)
Σ
” ı
min tracepΣq ` λEx0 „ppx0 q tracepΣ´1 x0 xJ 0q (194)
Σ
min tracepΣq ` λ tracepΣ´1 Ex0 „ppx0 q rx0 xJ
0 sq (195)
Σ
min tracepΣq ` λ tracepΣ´1 Rq (196)
Σ
?
It can be shown that the optimal solution to this minimization problem is given by, Σ˚ “ λR1{2 , where
R1{2 denotes the matrix square root of R. This implies that the noise energy must be distributed along the
singular vectors of the correlaiton matrix, where the energy is proportional to the noise singular values. We
include the proof below.
Proof. We define the objective:
f pΣq “ tracepΣq ` λ tracepΣ´1 Rq. (197)
We compute the gradient of f pΣq with respect to Σ:
B B
∇Σ f pΣq “ tracepΣq ` λ tracepΣ´1 Rq. (198)
BΣ BΣ
The gradient of the first term is straightforward:
B
tracepΣq “ I. (199)

40
Preprint.

For the second term, using the matrix calculus identity, the gradient is:
B ` ˘
λ tracepΣ´1 Rq “ ´λΣ´1 RΣ´1 . (200)

Combining these results, the total gradient is:
∇Σ f pΣq “ I ´ λΣ´1 RΣ´1 . (201)
Setting the gradient to zero to find the critical point:
I ´ λΣ´1 RΣ´1 “ 0. (202)
1
Σ´1 RΣ´1 “ I. (203)
λ
which implies,
R “ λΣ2 . (204)
?
Σ “ λR1{2 . (205)

which completes the proof.

E L OG -L IKELIHOOD FOR T-EDM


Here, we present a method to estimate the log-likelihood for the generated samples using the ODE solver for
t-EDM (see Section 3.5). Our analysis is based on the likelihood computation in continuous-time diffusion
models as discussed in Song et al. (2020) (Appendix D.2). More specifically, given a probability-flow ODE,
dxt
“ fθ pxt , tq, (206)
dt
with a vector field fθ pxt , tq, Song et al. (2020) propose to estimate the log-likelihood of the model as follows,
żT
log ppx0 q “ log ppxT q ` ∇ ¨ fθ pxt , tqdt. (207)
0

The divergence of the vector field fθ pxt , tq is further estimated using the Skilling-Hutchinson trace estimator
(Skilling, 1989; Hutchinson, 1990) as follows,
∇ ¨ fθ pxt , tq “ Eppϵq rϵJ ∇fθ pxt , tqϵs (208)
where usually, ϵ „ N p0, Id q. For t-EDM ODE, this estimate can be further simplified as follows,
∇ ¨ fθ pxt , tq “ Eppϵq rϵJ ∇fθ pxt , tqϵs (209)
” ´ x ´ D px , tq ¯ ı
t θ t
“ Eppϵq ϵJ ∇ ϵ (210)
t
” ´ I ´ ∇ D px , tq ¯ ı
d xt θ t
“ Eppϵq ϵJ ϵ (211)
t
1 ” ı
“ Eppϵq ϵJ ϵ ´ ϵJ ∇xt Dθ pxt , tqϵ (212)
t
1 ” ı
“ Eppϵq ϵJ ϵ ´ ϵJ ∇xt Dθ pxt , tqϵ (213)
t
1” ` ˘ı
“ d ´ Eppϵq ϵJ ∇xt Dθ pxt , tqϵ (214)
t

41
Preprint.

where d is the data dimensionality. Thus, the log-likelihood can be specified as,
żT ”
1 ` ˘ı
log ppx0 q “ log ppxT q ` d ´ Eppϵq ϵJ ∇Dθ pxt , tqϵ dt. (215)
0 t

When ϵ „ N p0, σ 2 Id q, the above result can be re-formulated as,


żT ”
1 1 ` ˘ı
log ppx0 q “ log ppxT q ` d ´ 2 Eppϵq ϵJ ∇Dθ pxt , tqϵ dt. (216)
0 t σ
Moreover, using the first-order taylor series expansion
Dθ px ` ϵq “ Dθ pxq ` ∇Dθ ϵ ` Op}ϵ}2 q (217)
For a sufficiently small σ, higher-order terms in Op}ϵ}2 q can be ignored since Er}ϵ2 }s “ σ 2 d. Therefore,
Dθ px ` ϵq « Dθ pxq ` ∇Dθ ϵ (218)
J J
ϵ ∇Dθ ϵ « ϵ rDθ px ` ϵq ´ Dθ pxqs (219)
J J
Eϵ rϵ ∇Dθ ϵs « Eϵ rϵ pDθ px ` ϵq ´ Dθ pxqqs (220)
J J
Eϵ rϵ ∇Dθ ϵs « Eϵ rϵ Dθ px ` ϵqs (221)
Therefore, the log-likelihood expression can be further simplified as,
żT ”
1 1 ` ˘ı
log ppx0 q “ log ppxT q ` d ´ 2 Eppϵq ϵJ ∇Dθ pxt , tqϵ dt (222)
0 t σ
żT ”
1 1 ` ˘ı
log ppx0 q “ log ppxT q ` d ´ 2 Eppϵq ϵJ Dθ pxt ` ϵ, tq dt (223)
0 t σ
The advantage of this simplification is that we don’t need to rely on expensive jacobian-vector products in Eq.
215. However, since the denoiser now depends on ϵ, monte-carlo approximation of the expectation in the
above equation could be computationally expensive for many samples ϵ

F D ISCUSSION AND L IMITATIONS


F.1 R ELATED W ORK

Connections with Denoising Score Matching. For the perturbation kernel qpxt |x0 q “ td pµt x0 , σt2 Id , νq,
the denoising score matching (Vincent, 2011; Song et al., 2020) loss, LDSM , can be formulated as,
” › ϵ ›2 ı
LDSM pθq 9 Ex0 „ppx0 q Et Eϵ„N p0,Id q Eκ„χ2 pνq{ν λpxt , ν, tq›Dθ pµt x0 ` σt ? , σt q ´ x0 › (224)
› ›
κ 2

with the scaling factor λpxt , ν, tq “ rpν ` dq{pν ` d1 qs2 where d1 “ p1{σt2 q}xt ´ µt x0 }22 (proof in App.
A.9). Therefore, the denoising score matching loss in our framework is equivalent to the simplified training
objective in Eq. 9 scaled by a data-dependent coefficient. However, in this work, we do not explore this loss
formulation and leave further exploration to future work.
Prior work in Heavy-Tailed Generative Modeling. The idea of exploring heavy-tailed priors for modeling
heavy-tailed distributions has been explored in several works in the past. More specifically, Jaini et al. (2020)
argue that a Lipschitz flow map cannot change the tails of the base distribution significantly. Consequently,
they use a heavy-tailed prior (modeled using a Student-t distribution) as the base distribution to learn Tail
Adaptive flows (TAFs), which can model the tails more accurately. In this work, we make similar observations

42
Preprint.

where standard diffusion models fail to accurately model the tails of real-world distributions. Consequently,
Laszkiewicz et al. (2022) assess the tailedness of each marginal dimension and set the prior accordingly.
On a similar note, we note that learning the tail parameter ν spatially and across channels can provide
greater modeling flexibility for downstream tasks and will be an important direction for future work on this
problem. More recently, Kim et al. (2024b) introduce heavy-tailed VAEs (Kingma & Welling, 2022; Rezende
& Mohamed, 2016) based on minimizing γ-power divergences (Eguchi, 2021). This is perhaps the closest
connection of our method with prior work since we rely on γ-power divergences to minimize the divergence
between heavy-tailed forward and reverse diffusion posteriors. However, VAEs often have scalability issues
and tend to produce blurry artifacts (Dosovitskiy & Brox, 2016; Pandey et al., 2022). On the other hand,
we work with diffusion models, which are known to scale well to large-scale modeling applications (Pathak
et al., 2024; Mardani et al., 2024; Esser et al., 2024; Podell et al., 2023). In another line of work, Yoon
et al. (2023) presents a framework for modeling heavy-tailed distributions using α-stable Levy processes
while Shariatian et al. (2024) simplify the framework proposed in Yoon et al. (2023) and instantiate it for
more practical diffusion models like DDPM. In contrast, our work deals with Student-t noise, which in
general (with the exceptions of Cauchy and the Gaussian distribution) is not α-stable and, therefore, a distinct
category of diffusion models for modeling heavy-tailed distributions. Moreover, prior works like Yoon et al.
(2023); Shariatian et al. (2024) rely on empirical evidence from light-tailed variants of small-scale datasets
like CIFAR-10 (Krizhevsky, 2009) and their efficacy on actual large-scale scientific datasets like weather
datasets remains to be seen.
Prior work in Diffusion Models. Our work is a direct extension of standard diffusion models in the literature
(Karras et al., 2022; Ho et al., 2020; Song et al., 2020). Moreover, since it only requires a few lines of
code change to transition from standard diffusion models to our framework, our work is directly compatible
with popular families of latent diffusion models (Pandey et al., 2022; Rombach et al., 2022) and augmented
diffusion models (Dockhorn et al., 2022; Pandey & Mandt, 2023; Singhal et al., 2023). Our work is also
related to prior work in diffusion models on a more theoretical level. More specifically, PFGM++ (Xu et al.,
2023b) is a unique type of generative flow model inspired by electrostatic theory. It treats d-dimensional data
as electrical charges in a D ` d-dimensional space, where the electric field lines define a bijection between a
heavy-tailed prior and the data distribution. D is a hyperparameter controlling the shape of the electric fields
that define the generative mapping. In essence, their method can be seen as utilizing a perturbation kernel:
D`d
ppxt |x0 q9p||xt ´ x0 ||22 ` σt2 Dqq´ 2 “ td px0 , σt2 Id , Dq
When setting ν “ D, the perturbation kernel becomes equivalent to that of t-EDM, indicating the Student-t
perturbation kernel can be interpreted from another physical perspective — that of electrostatic fields and
charges. The authors demonstrated that using an intermediate value for D (or ν) leads to improved robustness
compared to diffusion models (where D Ñ 8), due to the heavy-tailed perturbation kernel.

F.2 L IMITATIONS

While our proposed framework works well for modeling heavy-tailed data, it is not without its limitations.
Firstly, while the parameter ν offers controllability for tail estimation using diffusion models, it also increases
the tuning budget by introducing an extra hyperparameter. Moreover, for diverse data channels, tuning ν
per channel could be key to good estimation at the tails. This could result in a combinatorial explosion
with manual tuning. Therefore, learning ν directly from the data could be an important direction for the
practical deployment of heavy-tailed diffusion models. Secondly, our evaluation protocol relies primarily on
comparing the statistical properties of samples obtained by flattening the generated or train/test set samples.
One disadvantage of this approach is that our current evaluation metrics ignore the structure of the generated
samples. In general, developing metrics like FID (Heusel et al., 2018) to assess the perceptual quality of
synthetic data for scientific domains like weather analysis remains an important future direction.

43
Preprint.

Train Set Samples

t-EDM (KS: 0.431)

EDM (KS: 0.997)


Figure 7: Random samples generated from t-EDM (Top Panel) and EDM (Bottom Panel) for the Vertically Integrated
Liquid (VIL) channel. KS: Kolmogorov-Smirnov 2-sample statistic. Samples have been scaled logarithmically for better
visualization

44
Preprint.

Train Set Samples

t-Flow (KS: 0.711)

Gaussian Flow (KS: 0.897)


Figure 8: Random samples generated from t-Flow (Top Panel) and Gaussian Flow (Bottom Panel) for the Vertically
Integrated Liquid (VIL) channel. KS: Kolmogorov-Smirnov 2-sample statistic. Samples have been scaled logarithmically
for better visualization

45
Preprint.

Train Set Samples

t-EDM (KS: 0.683)

EDM (KS: 0.991)


Figure 9: Random samples generated from t-EDM (Top Panel) and EDM (Bottom Panel) for the Vertical Wind Velocity
(w20) channel. KS: Kolmogorov-Smirnov 2-sample statistic

46
Preprint.

Train Set Samples

t-Flow

(KS: 0.259)

Gaussian Flow (KS: 0.294)


Figure 10: Random samples generated from t-Flow (Top Panel) and Gaussian Flow (Bottom Panel) for the Vertical Wind
Velocity (w20) channel. KS: Kolmogorov-Smirnov 2-sample statistic

47
Preprint.

t-EDM (Channel: VIL)

t-EDM (Channel: w20)

Figure 11: 1-d Histogram Comparisons between samples from the generated and the Train/Test set for the Vertically
Integrated Liquid (VIL, see Top Pandel) and Vertical Wind Velocity (w20, see Bottom Panel) channels using t-EDM (with
varying ν).

48
Preprint.

t-Flow (Channel: VIL)

t-Flow (Channel: w20)

Figure 12: 1-d Histogram Comparisons between samples from the generated and the Train/Test set for the Vertically
Integrated Liquid (VIL, see Top Pandel) and Vertical Wind Velocity (w20, see Bottom Panel) channels using t-Flow (with
varying ν).

49
Preprint.

Ensemble Mean Prediction 1 Prediction 2 Ensemble (gif)

Ground Truth

Ensemble Mean Prediction 1 Prediction 2 Ensemble (gif)

Ground Truth

Figure 8: Qualitative visualization of samples generated from our conditional modeling for predicting the next state for
the Vertically Integrated Liquid (VIL) channel. The ensemble mean represents the mean of ensemble predictions (16
in our case). Columns 2-3 represent two samples from the ensemble. The last column visualizes an animation of all
ensemble members (Best viewed in a dedicated PDF reader). For each sample, the rows correspond to predictions from
EDM, t-EDM (ν “ 3), and t-EDM (ν “ 5) from top to bottom, respectively. Samples have been scaled logarithmically
for better visualization

50
Preprint.

Ensemble Mean Prediction 1 Prediction 2 Ensemble (gif)

Ground Truth

Ensemble Mean Prediction 1 Prediction 2 Ensemble (gif)

Ground Truth

Figure 9: Qualitative visualization of samples generated from our conditional modeling for predicting the next state for
the Vertical Wind Velocity (w20) channel. The ensemble mean represents the mean of ensemble predictions (16 in our
case). Columns 2-3 represent two samples from the ensemble. The last column visualizes an animation of all ensemble
members (Best viewed in a dedicated PDF reader). For each sample, the rows correspond to predictions from EDM,
t-EDM (ν “ 3), and t-EDM (ν “ 5) from top to bottom, respectively.

51

You might also like