Heavy Tailed Diffusion
Heavy Tailed Diffusion
A BSTRACT
Diffusion models achieve state-of-the-art generation quality across many applications, but
their ability to capture rare or extreme events in heavy-tailed distributions remains unclear.
In this work, we show that traditional diffusion and flow-matching models with standard
Gaussian priors fail to capture heavy-tailed behavior. We address this by repurposing the
diffusion framework for heavy-tail estimation using multivariate Student-t distributions.
We develop a tailored perturbation kernel and derive the denoising posterior based on
the conditional Student-t distribution for the backward process. Inspired by γ-divergence
for heavy-tailed distributions, we derive a training objective for heavy-tailed denoisers.
The resulting framework introduces controllable tail generation using only a single scalar
hyperparameter, making it easily tunable for diverse real-world distributions. As specific
instantiations of our framework, we introduce t-EDM and t-Flow, extensions of existing
diffusion and flow models that employ a Student-t prior. Remarkably, our approach is
readily compatible with standard Gaussian diffusion models and requires only minimal
code changes. Empirically, we show that our t-EDM and t-Flow outperform standard
diffusion models in heavy-tail estimation on high-resolution weather datasets in which
generating rare and extreme events is crucial.
1 I NTRODUCTION
In many real-world applications, such as weather forecasting, rare or extreme events—like hurricanes or
heatwaves—can have disproportionately larger impacts than more common occurrences. Therefore, building
generative models capable of accurately capturing these extreme events is critically important (Gründemann
et al., 2022). However, learning the distribution of such data from finite samples is particularly challenging,
as the number of empirically observed tail events is typically small, making accurate estimation difficult.
One promising approach is to use heavy-tailed distributions, which allocate more density to the tails than
light-tailed alternatives. In popular generative models like Normalizing Flows (Rezende & Mohamed, 2016)
and Variational Autoencoders (VAEs) (Kingma & Welling, 2022), recent works address heavy-tail estimation
by learning a mapping from a heavy-tailed prior to the target distribution (Jaini et al., 2020; Kim et al., 2024b).
While these works advocate for heavy-tailed base distributions, their application to real-world, high-
dimensional datasets remains limited, with empirical results focused on small-scale or toy datasets. In
contrast, diffusion models (Ho et al., 2020; Song et al., 2020; Lipman et al., 2023) have demonstrated
excellent synthesis quality in large-scale applications. However, it is unclear whether diffusion models with
Gaussian priors can effectively model heavy-tailed distributions without significant modifications.
˚
Work during an internship at NVIDIA
:
Equal Advising
1
Preprint.
Figure 1: Toy Illustration. Our proposed diffusion model (t-Diffusion) captures heavy-tailed behavior more accurately
than standard Gaussian diffusion, as shown in the histogram comparisons (top panel, x-axis). The framework allows for
controllable tail estimation using a hyperparameter ν, which can be adjusted for each dimension. Lower ν values model
heavier tails, while higher values approach Gaussian diffusion. (Best viewed when zoomed in; see App. C.3 for details)
In this work, we first demonstrate through extensive experiments that traditional diffusion models—even
with proper normalization, preconditioning, and noise schedule design (see Section 4)—fail to accurately
capture the heavy-tailed behavior in target distributions (see Fig. 1 for a toy example). We hypothesize that,
in high-dimensional spaces, the Gaussian distribution in standard diffusion models tends to concentrate on
a spherical narrow shell, thereby neglecting the tails. To address this, we adopt the multivariate Student-t
distribution as the base noise distribution, with its degrees of freedom providing controllability over tail
estimation. Consequently, we reformulate the denoising diffusion framework using multivariate Student-t
distributions by designing a tailored perturbation kernel and deriving the corresponding denoiser. Moreover,
we draw inspiration from the γ-power Divergences (Eguchi, 2021; Kim et al., 2024a) for heavy-tailed
distributions to formulate the learning problem for our heavy-tailed denoiser.
We extend widely adopted diffusion models, such as EDM (Karras et al., 2022) and straight-line flows
(Lipman et al., 2023; Liu et al., 2022), by introducing their Student-t counterparts: t-EDM and t-Flow.
We derive the corresponding SDEs and ODEs for modeling heavy-tailed distributions. Through extensive
experiments on the HRRR dataset (Dowell et al., 2022), we train both unconditional and conditional versions
of these models. The results show that standard EDM struggles to capture tails and extreme events, whereas
t-EDM performs significantly better in modeling such phenomena. To summarize, we present,
• Heavy-tailed Diffusion Models. We repurpose the diffusion model framework for heavy-tail
estimation by formulating both the forward and reverse processes using multivariate Student-t
distributions. The denoiser is learned by minimizing the γ-power divergence (Kim et al., 2024a)
between the forward and reverse posteriors.
• Empirical Results. Experiments on the HRRR dataset (Dowell et al., 2022), a high-resolution
dataset for weather modeling, show that t-EDM significantly outperforms EDM in capturing tail
distributions for both unconditional and conditional tasks.
2
Preprint.
2 BACKGROUND
As prerequisites underlying our method, we briefly summarize Gaussian diffusion models (as introduced in
(Ho et al., 2020; Sohl-Dickstein et al., 2015)) and multivariate Student-t distributions.
Diffusion models define a forward process (usually with an affine drift and no learnable parameters) to convert
data x0 „ ppx0 q, x0 P Rd to noise. A learnable reverse process is then trained to generate data from noise.
In the discrete-time setting, the training objective for diffusion models can be specified as,
„ ÿ ȷ
Eq DKL pqpxT |x0 q } ppxT qq ` DKL pqpxt´∆t |xt , x0 q } pθ pxt´∆t |xt qq ´ log pθ px0 |x∆t q , (1)
l jh n l jh nl jh n
tą∆t
LT Lt´1 L0
where T denotes the trajectory length while ∆t denotes the time increment between two consecutive time
ş qpxq
points. DKL denotes the Kullback-Leibler (KL) divergence defined as, DKL pq } pq “ qpxq log ppxq dx.
In the objective in Eq. 1, the trajectory length T is chosen to match the generative prior ppxT q and the
forward marginal qpxT |x0 q. The second term in Eqn. 1 proposes to minimize the KL divergence between the
forward posterior qpxt´∆t |xt , x0 q and the learnable posterior pθ pxt´∆t |xt q which corresponds to learning
the denoiser (i.e. predicting a less noisy state from noise). The forward marginals, posterior, and reverse
posterior are modeled using Gaussian distributions, which exhibit an analytical form of the KL divergence.
The discrete-time diffusion framework can also be extended to the continuous time setting (Song et al.,
2020; 2021; Karras et al., 2022). Recently, Lipman et al. (2023); Albergo et al. (2023a) proposed stochastic
interpolants (or flows), which allow flexible transport between two arbitrary distributions.
The multivariate Student-t distribution td pµ, Σ, νq with dimensionality d, location µ, scale matrix Σ and
degrees of freedom ν is defined as,
” 1 ı´ ν`d
2
td pµ, Σ, νq “ Cν,d 1 ` px ´ µqJ Σ´1 px ´ µq , (2)
ν
where Cν,d is the normalizing factor. Since the multivariate Student-t distribution has polynomially decaying
density, it can model heavy-tailed distributions. Interestingly, for ν “ 1, the Student-t distribution is
analogous to the Cauchy distribution. As ν Ñ 8, the Student-t distribution converges to the Gaussian
? distributed random variable x can be reparameterized as (Andrews & Mallows,
distribution. A Student-t
1974), x “ µ ` Σ1{2 z{ κ, with z „ N p0, Id q, κ „ χ2 pνq{ν where χ2 denotes the Chi-squared distribution.
3
Preprint.
1. Firstly, given two consecutive noisy states xt and xt´∆t , we specify a joint distribution
qpxt , xt´∆t |x0 q.
2. şSecondly, given qpxt , xt´∆t |x0 q, we construct the perturbation kernel qpxt |x0 q “
qpxt , xt´∆t |x0 qdxt´∆t which can be used as a noising process during training.
3. Lastly, from Steps 1 and 2, we construct the forward denoising posterior qpxt´∆t |xt , x0 q “
qpxt ,xt´∆t |x0 q
qpxt |x0 q . We will later utilize the form of qpxt´∆t |xt , x0 q to parameterize the reverse posterior.
It is worth noting that our construction of the noising process bypasses the specification of the forward
transition kernel qpxt |xt´∆t q. This has the advantage that we can directly specify the form of the perturbation
kernel parameters µt and σt as in Karras et al. (2022) unlike Song et al. (2020); Ho et al. (2020). We next
highlight the noising process construction in more detail.
Specifiying the joint distribution qpxt , xt´∆t |x0 q. We parameterize the joint distribution qpxt , xt´∆t |x0 q
as a multivariate Student-t distribution with the following form,
ˆ 2 2
˙
σt σ12 ptq
qpxt , xt´∆t |x0 q “ t2d pµ, Σ, νq, µ “ rµt ; µt´∆t sx0 , Σ “ 2 2 b Id ,
σ21 ptq σt´∆t
where µt , σt , σ12 ptq, σ21 ptq are time-dependent scalar design parameters. While the choice of the parameters
µt and σt determines the perturbation kernel used during training, the choice of σ12 ptq and σ21 ptq can affect
the ODE/SDE formulation for the denoising process and will be clarified when discussing sampling.
Constructing the perturbation kernel qpxt |x0 q: Given the joint distribution qpxt , xt´∆t |x0 q specified
as a multivariate Student-t distribution, it follows that the perturbation kernel distribution qpxt |x0 q is also
a Student-t distribution (Ding, 2016) parameterized as, qpxt |x0 q “ td pµt x0 , σt2 Id , νq (proof in App. A.1).
We choose the scalar coefficients µt and σt such that the perturbation kernel at time t “ T converges to a
standard Student-t generative prior td p0, Id , νq. We discuss practical choices of µt and σt in Section 3.5.
Estimating the reference denoising posterior. Given the joint distribution qpxt , xt´∆t |x0 q and the pertur-
bation kernel qpxt |x0 q, the denoising posterior can be specified as (see Ding (2016)),
ν ` d1 2
qpxt´∆t |xt , x0 q “ td pµ̄t , σ̄ Id , ν ` dq, (3)
ν`d t
2
σ21 ptq ” σ 2 ptqσ 2 ptq ı
µ̄t “ µt´∆t x0 ` pxt ´ µt x0 q, σ̄t2 “ σt´∆t
2
´ 21 2 12 , (4)
σt2 σt
1
where d1 “ σt2
}xt ´ µt x0 }2 . Next, we formulate the training objective for heavy-tailed diffusions.
Following Eqn. 3, we parameterize the reverse (or the denoising) posterior distribution as:
pθ pxt´∆t |xt q “ td pµθ pxt , tq, σ̄t2 Id , ν ` dq, (5)
where the denoiser mean µθ pxt , tq is further parameterized as follows:
σ 2 ptq ” σ 2 ptq ı
µθ pxt , tq “ 212 xt ` µt´∆t ´ 212 µt Dθ pxt , σt q. (6)
σt σt
4
Preprint.
While we adopt the "x0 -prediction" parameterization (Karras et al., 2022), it is also possible to parameterize
the posterior mean as an ϵ-prediction objective instead (Ho et al., 2020) (See App. A.2). Lastly, when
parameterizing the reverse posterior scale, we drop the data-dependent coefficient ν`d
ν`d This aligns with prior
1
works in diffusion models (Ho et al., 2020; Song et al., 2020; Karras et al., 2022) where it is common to
only parameterize the denoiser mean. However, heteroskedastic modeling of the denoiser is possible in our
framework and could be an interesting direction for future work. Next, we reformulate the training objective
in Eqn. 1 for heavy-tailed diffusions.
The optimization objective in Eqn. 1 primarily minimizes the KL-Divergence between a given pair of
distributions. However, since we parameterize the distributions in Eqn. 1 using multivariate Student-t
distributions, using the KL-Divergence might not be a suitable choice of divergence. This is because
computing the KL divergence for Student-t distributions does not exhibit a closed-form expression. An
alternative is the γ-Power Divergence (Eguchi, 2021; Kim et al., 2024a) defined as,
1“ ‰
Dγ pq } pq “ Cγ pq, pq ´ Hγ pqq , γ P p´1, 0q Y p0, 8q
γ
ˆż 1
˙ 1`γ ż ˆ ˙γ
1`γ ppxq
Hγ ppq “ ´ }p}1`γ“ ´ ppxq dx Cγ pq, pq “ ´ qpxq dx,
}p}1`γ
2
where, like Kim et al. (2024a), we set γ “ ´ ν`d for the remainder of our discussion. Moreover, Hγ and Cγ
represent the γ-power entropy and cross-entropy, respectively. Interestingly, the γ-Power divergence between
two multivariate Student-t distributions, qν “ td pµ0 , Σ0 , νq and pν “ td pµ1 , Σ1 , νq, can be tractably
computed in closed form and is defined as (see Kim et al. (2024a) for a proof),
γ ´ γ
¯´ 1`γ ” ´ ¯
γ
Dγ rqν ||pν s “ ´ γ1 Cν,d
1`γ d
1 ` ν´2 ´|Σ0 |´ 2p1`γq 1 ` ν´2 d
γ
´ ` ˘ 1 ¯ı (7)
` |Σ1 |´ 2p1`γq 1 ` ν´2
1
tr Σ´11 Σ 0 ` ν pµ 0 ´ µ 1 qJ ´1
Σ 1 pµ0 ´ µ1 q .
Here, we note a couple of caveats. Firstly, while replacing the KL-Divergence with the γ-Power Divergence
in the objective in Eqn. 1 might appear to be due to computational convenience, the γ-power divergence has
several connections with robust estimators (Futami et al., 2018) in statistics and provides a tunable parameter
γ which can be used to control the model density assigned at the tail (see Section 5). Secondly, while the
objective in Eqn. 1 is a valid ELBO, the objective in Eq. 8 is not. However, the following result provides a
connection between the two objectives (see proof in App. A.3),
Proposition 1. For arbitrary distributions q and p, in the limit of γ Ñ 0, Dγ pq } pq converges to DKL pq } pq.
2
Consequently, for a finite-dimensional dataset with x0 P Rd and γ “ ´ ν`d , under the limit of γ Ñ 0, the
objective in Eqn. 8 converges to the objective in Eqn. 1.
Therefore, under the limit γ Ñ 0, the standard diffusion model framework becomes a special case of our
proposed framework. Moreover, for γ “ ´2{pν ` dq, this also explains the tail estimation moving towards
Gaussian diffusion for an increasing ν (See Fig. 1 for an illustration.)
5
Preprint.
Simplifying the Training Objective. Plugging the form of the forward posterior qpxt´∆t |xt , x0 q in Eqn. 3,
the reverse posterior pθ pxt´∆t |xt q in the optimization objective in Eqn. 8, we obtain the following simplified
training loss (proof in App. A.4),
› ` ϵ ˘ ›2
Lpθq “ Ex0 „ppx0 q Et„pptq Eϵ„N p0,Id q Eκ„ ν1 χ2 pνq ›Dθ µt x0 ` σt ? , σt ´ x0 › . (9)
› ›
κ 2
Intuitively, the form of our training objective is similar to existing diffusion models (Ho et al., 2020; Karras
et al., 2022). However, the only difference lies in sampling the noisy state xt from a Student-t distribution
based perturbation kernel instead of a Gaussian distribution. Next, we discuss sampling from our proposed
framework under discrete and continuous-time settings.
3.4 S AMPLING
Discrete-time Sampling. For discrete-time settings, we can simply perform ancestral sampling from the
learned reverse posterior distribution pθ pxt´∆t |xt q. Therefore, following simple re-parameterization, an
ancestral sampling update can be specified as,
z
xt´∆t “ µθ pxt , tq ` σ̄t ? , z „ N p0, Id q, κ „ χ2 pν ` dq{pν ` dq,
κ
2
σ ptq ” σ 2 ptq ı z
“ 212 xt ` µt´∆t ´ 212 µt Dθ pxt , σt q ` σ̄t ? .
σt σt κ
6
Preprint.
Figure 2: Training and Sampling algorithms for t-EDM (ν ą 2). Our proposed method requires minimal code updates
(indicated with blue) over traditional Gaussian diffusion models and converges to the latter as ν Ñ 8.
Based on the result in Proposition 2, it is possible to construct deterministic/stochastic samplers for heavy-
tailed diffusions. It is worth noting that the SDE in Eqn. 10 implies adding heavy-tailed stochastic noise
during inference (Bollerslev, 1987). Next, we provide specific instantiations of the generic sampler in Eq. 10.
Sampler Instantiations. We instantiate the continuous-time SDE in Eqn. 10 by setting gpσt , σ9 t q “ 0 and
σ12 ptq “ σt σt´∆t . Consequently, f pσt , σ9 t q “ ´ σσ9 tt . In this case, the SDE in Eqn. 10 reduces to an ODE,
which can be represented as,
„ ȷ
dxt µ9 t σ9 t µ9 t
“ xt ´ ´ ` pxt ´ µt Dθ pxt , σt qq. (11)
dt µt σt µt
Summary. To summarize our theoretical framework, we present an overview of the comparison between
Gaussian diffusion models and our proposed heavy-tailed diffusion models in Table 1.
Karras et al. (2022) highlight several design choices during training and sampling which significantly
improve sample quality while reducing sampling budget for image datasets like CIFAR-10 (Krizhevsky,
2009) and ImageNet (Deng et al., 2009). With a similar motivation, we reformulate the perturbation kernel as
qpxt |x0 q “ td psptqx0 , sptq2 σptq2 Id , νq and denote the resulting diffusion model as t-EDM.
Training. During training, we set the perturbation kernel, qpxt |x0 q, parameters sptq “ 1, σptq “ σ „
LogNormalpPmean , Pstd q. Moreover, we parameterize the denoiser Dθ pxt , σt q as
Dθ px, σq “ cskip pσ, νqx ` cout pσ, νqFθ pcin pσ, νqx, cnoise pσqq.
Our denoiser parameterization is similar to Karras et al. (2022) with the difference that coefficients like cout
additionally depend on ν. We include full derivations in Appendix A.6. Consequently, our denoising loss can
be specified as follows:
“ ‰
Lpθq 9 Ex0 „ppx0 q Eσ En„td p0,σ2 Id ,νq λpσ, νq}Dθ px0 ` n, σq ´ x0 }22 , (12)
where λpσ, νq is a weighting function set to λpσ, νq “ 1{cout pσ, νq2 .
Sampling. Interestingly, it can be shown that the ODE in Eqn. 11 is equivalent to the deterministic dynamics
presented in Karras et al. (2022) (See Appendix A.7 for proof). Consequently, we choose sptq “ 1 and
7
Preprint.
Figure 3: Sample 1-d histogram comparison between EDM and t-EDM on the test set for the Vertically Integrated
Liquid (VIL) channel. t-EDM captures heavy-tailed behavior more accurately than other baselines. INC: Inverse CDF
Normalization, PCP: Per-Channel Preconditioning
4 E XPERIMENTS
To assess the effectiveness of the proposed heavy-tailed diffusion and flow models, we demonstrate experi-
ments using real-world weather data for both unconditional and conditional generation tasks. We include full
experimental details in App. C.
Datasets. We adopt the High-Resolution Rapid Refresh (HRRR) (Dowell et al., 2022) dataset, which is an
operational archive of the US km-scale forecasting model. Among multiple dynamical variables in the dataset
that exhibit heavy-tailed behavior, based on their dynamic range, we only consider the Vertically Integrated
Liquid (VIL) and Vertical Wind Velocity at level 20 (denoted as w20) channels (see App. C.1 for more details).
It is worth noting that the VIL and w20 channels have heavier right and left tails, respectively (See Fig. 6)
Tasks and Metrics. We consider both unconditional and conditional generative tasks relevant to weather
and climate science. For unconditional modeling, we aim to generate the VIL and w20 physical variables
in the HRRR dataset. For conditional modeling, we aim to generatively predict the hourly evolution of the
target variable for the next lead-time (τ ` 1) hour-ahead evolution of VIL and w20 based on information only
at the current state time τ ; see more details in the appendix and see Pathak et al. (2024) for discussion of
why hour-ahead, km-scale atmospheric prediction is a stochastic physical task appropriate for conditional
generative models. To quantify the empirical performance of unconditional modeling, we rely on comparing
1-d statistics of generated and train/test set samples. More specifically, for quantitative analysis, we report the
Kurtosis Ratio (KR), the Skewness Ratio (SR), and the Kolmogorov-Smirnov (KS)-2 sample statistic (at the
tails) between the generated and train/test set samples. For qualitative analysis, we compare 1-d histograms
between generated and train/test set samples. For the conditional task, we adopt standard probabilistic
8
Preprint.
prediction score metrics such as the Continuous Ranked Probability Score (CRPS), the Root-Mean Squared
Error (RMSE), and the skill-spread ratio (SSR); see, e.g., Mardani et al. (2023a); Srivastava et al. (2023). A
more detailed explanation of our evaluation protocol is provided in App. C.
Methods and Baselines. In addition to standard diffusion (EDM (Karras et al., 2022)) and flow models
(Albergo et al., 2023a) based on Gaussian priors, we introduce two additional baselines that are variants
of EDM. To account for the high dynamic-range often exhibited by heavy-tailed distributions, we include
Inverse CDF Normalization (INC) as an alternative data preprocessing step to z-score normalization. Using
the former reduces dynamic range significantly and can make the data distribution closer to Gaussian. We
denote this preprocessing scheme combined with standard EDM training as EDM + INC. Alternatively, we
could instead modulate the noise levels used during EDM training as a function of the dynamic range of the
input channel while keeping the data preprocessing unchanged. The main intuition is to use more heavy-tailed
noise for large values. We denote this modulating scheme as Per-Channel Preconditioning (PCP) and denote
the resulting baseline as EDM + PCP. We elaborate on these baselines in more detail in App. C.1.2
We assess the effectiveness of different methods on unconditional modeling for the VIL and w20 channels
in the HRRR dataset. Fig. 3 qualitatively compares 1-d histograms of sample intensities between different
methods for the VIL channel. We make the following key observations. Firstly, though EDM (with additional
tricks like noise conditioning) can improve tail coverage, t-EDM covers a broader range of extreme values
in the test set. Secondly, in addition to better dynamic range coverage, t-EDM qualitatively performs much
better in capturing the density assigned to intermediate intensity levels under the model. We note similar
observations from our quantitative results in Table 2, where t-EDM outperforms other baselines on the KS
metric, implying our model exhibits better tail estimation over competing baselines for both the VIL and
w20 channels. More importantly, unlike traditional Gaussian diffusion models like EDM, t-EDM enables
controllable tail estimation by varying ν, which could be useful when modeling a combination of channels
with diverse statistical properties. On the contrary, standard diffusion models like EDM do not have such
controllability. Lastly, we present similar quantitative results for t-Flow in Table 3. We present additional
results for unconditional modeling in App. C.1
Next, we consider the task of conditional modeling, where we aim to predict the hourly evolution of a
target variable for the next lead time (τ ` 1) based on the current state at time τ . Table 4 illustrates the
performance of EDM and t-EDM on this task for the VIL and w20 channels. We make the following key
observations. Firstly, for both channels, t-EDM exhibits better CRPS and SSR scores, implying better
probabilistic forecast skills and ensemble than EDM. Moreover, while t-EDM exhibits under-dispersion for
9
Preprint.
VIL, while it is well-calibrated for w20, with its SSR close to an ideal score of 1. On the contrary, the baseline
EDM model exhibits under-dispersion for both channels, thus implying overconfident predictions. Secondly,
in addition to better calibration, t-EDM is better at tail estimation (as measured by the KS statistic) for the
underlying conditional distribution. Lastly, we notice that different values of the parameter ν are optimal for
different channels, which suggests a more data-driven approach to learning the optimal ν directly. We present
additional results for conditional modeling in App. C.2.
10
Preprint.
Enabling efficient tail coverage during training. The optimization objective in Eq. 8 has several connections
with robust statistical estimators. More specifically, it can be shown that (proof in App. A.11),
ż ˆ ˙γ ´
pθ pxq ¯
∇θ Dγ pq } pθ q “ ´ qpxq ∇θ log pθ pxq ´ Ep̃θ pxq r∇θ log pθ pxqs dx,
}pθ }1`γ
where q and pθ denote the forward (qpxt´∆t |xt , x0 q) and reverse diffusion posteriors (pθ pxt´∆t |xt q), respec-
tively. Intuitively, the coefficient γ weighs the likelihood gradient, ∇θ log pθ pxq, and can be set accordingly
to ignore or consider outliers when modeling the data distribution. Specifically, when γ ą 1, the model would
learn to ignore outliers (Futami et al., 2018; Fujisawa & Eguchi, 2008; Basu et al., 1998) since data points on
the tails would be assigned low likelihood. On the contrary, a negative value of γ (as is the case in this work
since we set γ “ ´2{pν ` dq), the model can assign more weights to capture these extreme values.
We discuss some other connections to prior work in heavy-tailed generative modeling and more recent work
in diffusion models in App. F.1 and some limitations of our approach in App. F.2.
R EPRODUCIBILITY S TATEMENT
We include proofs for all theoretical results introduced in the main text in Appendix A. We describe our
complete experimental setup (including data processing steps, model specification for training and inference,
description of evaluation metrics, and extended experimental results) in Appendix C.
E THICS S TATEMENT
We develop a generative framework for modeling heavy-tailed distributions and demonstrate its effectiveness
for scientific applications. In this context, we do not think our model poses a risk of misinformation or other
ethical biases associated with large-scale image synthesis models. However, we would like to point out that
similar to other generative models, our model can sometimes hallucinate predictions for certain channels,
which could impact downstream applications like weather forecasting.
R EFERENCES
Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying
framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023a.
Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying
framework for flows and diffusions, 2023b. URL https://arxiv.org/abs/2303.08797.
D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical
Society. Series B (Methodological), 36(1):99–102, 1974. ISSN 00359246. URL http://www.jstor.
org/stable/2984774.
Uri M. Ascher and Linda R. Petzold. Computer Methods for Ordinary Differential Equations and
Differential-Algebraic Equations. Society for Industrial and Applied Mathematics, Philadelphia, PA,
1998. doi: 10.1137/1.9781611971392. URL https://epubs.siam.org/doi/abs/10.1137/1.
9781611971392.
Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by minimising
a density power divergence. Biometrika, 85(3):549–559, 1998.
11
Preprint.
Tim Bollerslev. A conditionally heteroskedastic time series model for speculative prices and rates of
return. The Review of Economics and Statistics, 69(3):542–547, 1987. ISSN 00346535, 15309142. URL
http://www.jstor.org/stable/1925546.
Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov Chain Monte
Carlo. Chapman and Hall/CRC, May 2011. ISBN 9780429138508. doi: 10.1201/b10905. URL
http://dx.doi.org/10.1201/b10905.
Tianfeng Chai and Roland R Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments
against avoiding rmse in the literature. Geoscientific Model Development, 7(3):1247–1250, 2014.
Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye.
Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference
on Learning Representations, 2022.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255,
2009. doi: 10.1109/CVPR.2009.5206848.
Peng Ding. On the conditional distribution of the multivariate t distribution, 2016. URL https://arxiv.
org/abs/1604.00561.
Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with critically-damped
langevin diffusion, 2022. URL https://arxiv.org/abs/2112.07068.
Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep
networks, 2016. URL https://arxiv.org/abs/1602.02644.
David C. Dowell, Curtis R. Alexander, Eric P. James, Stephen S. Weygandt, Stanley G. Benjamin, Geoffrey S.
Manikin, Benjamin T. Blake, John M. Brown, Joseph B. Olson, Ming Hu, Tatiana G. Smirnova, Terra
Ladwig, Jaymes S. Kenyon, Ravan Ahmadov, David D. Turner, Jeffrey D. Duda, and Trevor I. Alcott.
The high-resolution rapid refresh (hrrr): An hourly updating convection-allowing forecast model. part i:
Motivation and system description. Weather and Forecasting, 37(8):1371 – 1395, 2022. doi: 10.1175/
WAF-D-21-0151.1. URL https://journals.ametsoc.org/view/journals/wefo/37/8/
WAF-D-21-0151.1.xml.
Shinto Eguchi. Chapter 2 - pythagoras theorem in information geometry and applications to generalized linear
models. In Angelo Plastino, Arni S.R. Srinivasa Rao, and C.R. Rao (eds.), Information Geometry, volume 45
of Handbook of Statistics, pp. 15–42. Elsevier, 2021. doi: https://doi.org/10.1016/bs.host.2021.06.001. URL
https://www.sciencedirect.com/science/article/pii/S0169716121000225.
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi,
Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey,
Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution
image synthesis, 2024. URL https://arxiv.org/abs/2403.03206.
Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contami-
nation. Journal of Multivariate Analysis, 99(9):2053–2081, 2008. ISSN 0047-259X. doi: https://doi.org/
10.1016/j.jmva.2008.02.004. URL https://www.sciencedirect.com/science/article/
pii/S0047259X08000456.
Futoshi Futami, Issei Sato, and Masashi Sugiyama. Variational inference based on robust divergences, 2018.
URL https://arxiv.org/abs/1710.06595.
12
Preprint.
Gaby Joanne Gründemann, Nick van de Giesen, Lukas Brunner, and Ruud van der Ent. Rarest rainfall
events will see the greatest relative increase in magnitude under future climate change. Communications
Earth &; Environment, 3(1), October 2022. ISSN 2662-4435. doi: 10.1038/s43247-022-00558-8. URL
http://dx.doi.org/10.1038/s43247-022-00558-8.
H. Guo, J.-C. Golaz, L. J. Donner, B. Wyman, M. Zhao, and P. Ginoux. Clubb as a unified cloud pa-
rameterization: Opportunities and challenges. Geophysical Research Letters, 42(11):4540–4547, 2015.
doi: https://doi.org/10.1002/2015GL063672. URL https://agupubs.onlinelibrary.wiley.
com/doi/abs/10.1002/2015GL063672.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans
trained by a two time-scale update rule converge to a local nash equilibrium, 2018. URL https:
//arxiv.org/abs/1706.08500.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems, 33:6840–6851, 2020.
M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing
splines. Communications in Statistics - Simulation and Computation, 19(2):433–450, 1990. doi: 10.1080/
03610919008812866. URL https://doi.org/10.1080/03610919008812866.
Priyank Jaini, Ivan Kobyzev, Yaoliang Yu, and Marcus Brubaker. Tails of lipschitz triangular flows, 2020.
URL https://arxiv.org/abs/1907.04481.
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based
generative models, 2022. URL https://arxiv.org/abs/2206.00364.
Juno Kim, Jaehyuk Kwon, Mincheol Cho, Hyunjong Lee, and Joong-Ho Won. t3 -variational autoencoder:
Learning heavy-tailed data with student’s t and power divergence, 2024a. URL https://arxiv.org/
abs/2312.01133.
Juno Kim, Jaehyuk Kwon, Mincheol Cho, Hyunjong Lee, and Joong-Ho Won. $t^3$-variational autoen-
coder: Learning heavy-tailed data with student’s t and power divergence. In The Twelfth International
Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=
RzNlECeoOB.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.
org/abs/1312.6114.
Alex Krizhevsky. Learning multiple layers of features from tiny images. pp. 32–33, 2009. URL https:
//www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
Mike Laszkiewicz, Johannes Lederer, and Asja Fischer. Marginal tail-adaptive normalizing flows, 2022. URL
https://arxiv.org/abs/2206.10311.
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching
for generative modeling. In International Conference on Learning Representations, 2023. URL https:
//openreview.net/forum?id=PqvMRDCJT9t.
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data
with rectified flow, 2022. URL https://arxiv.org/abs/2209.03003.
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver
for diffusion probabilistic model sampling in around 10 steps, 2022. URL https://arxiv.org/
abs/2206.00927.
13
Preprint.
Morteza Mardani, Noah Brenowitz, Yair Cohen, Jaideep Pathak, Chieh-Yu Chen, Cheng-Chin Liu, Arash
Vahdat, Karthik Kashinath, Jan Kautz, and Mike Pritchard. Residual corrective diffusion modeling for
km-scale atmospheric downscaling. arXiv preprint arXiv:2309.15214, 2023a.
Morteza Mardani, Jiaming Song, Jan Kautz, and Arash Vahdat. A variational perspective on solving inverse
problems with diffusion models. In The Twelfth International Conference on Learning Representations,
2023b.
Morteza Mardani, Noah Brenowitz, Yair Cohen, Jaideep Pathak, Chieh-Yu Chen, Cheng-Chin Liu, Arash
Vahdat, Mohammad Amin Nabian, Tao Ge, Akshay Subramaniam, Karthik Kashinath, Jan Kautz, and
Mike Pritchard. Residual corrective diffusion modeling for km-scale atmospheric downscaling, 2024. URL
https://arxiv.org/abs/2309.15214.
Frank J. Massey. The kolmogorov-smirnov test for goodness of fit. Journal of the American Statistical
Association, 46(253):68–78, 1951. ISSN 01621459, 1537274X. URL http://www.jstor.org/
stable/2280095.
Radford M. Neal. Slice sampling. The Annals of Statistics, 31(3):705 – 767, 2003. doi: 10.1214/aos/
1056562461. URL https://doi.org/10.1214/aos/1056562461.
Kushagra Pandey and Stephan Mandt. A complete recipe for diffusion generative models, 2023. URL
https://arxiv.org/abs/2303.01748.
Kushagra Pandey, Avideep Mukherjee, Piyush Rai, and Abhishek Kumar. Diffusevae: Efficient, controllable
and high-fidelity generation from low-dimensional latents, 2022. URL https://arxiv.org/abs/
2201.00308.
Kushagra Pandey, Maja Rudolph, and Stephan Mandt. Efficient integrators for diffusion generative mod-
els. In The Twelfth International Conference on Learning Representations, 2024a. URL https:
//openreview.net/forum?id=qA4foxO5Gf.
Kushagra Pandey, Ruihan Yang, and Stephan Mandt. Fast samplers for inverse problems in iterative refinement
models, 2024b. URL https://arxiv.org/abs/2405.17673.
Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani,
Arash Vahdat, Shaoming Xu, Karthik Kashinath, and Michael Pritchard. Kilometer-scale convection
allowing model emulation using generative diffusion modeling, 2024. URL https://arxiv.org/
abs/2408.10958.
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URL
https://arxiv.org/abs/2307.01952.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows, 2016. URL
https://arxiv.org/abs/1505.05770.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
image synthesis with latent diffusion models, 2022. URL https://arxiv.org/abs/2112.10752.
Dario Shariatian, Umut Simsekli, and Alain Durmus. Denoising lévy probabilistic models, 2024. URL
https://arxiv.org/abs/2407.18609.
Raghav Singhal, Mark Goldstein, and Rajesh Ranganath. Where to diffuse, how to diffuse, and how to get
back: Automated learning for multivariate diffusions, 2023. URL https://arxiv.org/abs/2302.
07261.
14
Preprint.
John Skilling. The Eigenvalues of Mega-dimensional Matrices, pp. 455–466. Springer Netherlands, Dordrecht,
1989. ISBN 978-94-015-7860-8. doi: 10.1007/978-94-015-7860-8_48. URL https://doi.org/10.
1007/978-94-015-7860-8_48.
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265.
PMLR, 2015.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022a. URL
https://arxiv.org/abs/2010.02502.
Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for
inverse problems. In International Conference on Learning Representations, 2022b.
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differential equations. In International Conference on
Learning Representations, 2020.
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based
diffusion models, 2021. URL https://arxiv.org/abs/2101.09258.
Prakhar Srivastava, Ruihan Yang, Gavin Kerrigan, Gideon Dresdner, Jeremy McGibbon, Christopher Brether-
ton, and Stephan Mandt. Precipitation downscaling with spatiotemporal video diffusion. arXiv preprint
arXiv:2312.06071, 2023.
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23
(7):1661–1674, 2011. doi: 10.1162/NECO_a_00142.
Daniel S Wilks. Statistical methods in the atmospheric sciences, volume 100. Academic press, 2011.
Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi S. Jaakkola. Restart
sampling for improving generative processes. In Thirty-seventh Conference on Neural Information
Processing Systems, 2023a. URL https://openreview.net/forum?id=wFuemocyHZ.
Yilun Xu, Ziming Liu, Yonglong Tian, Shangyuan Tong, Max Tegmark, and Tommi Jaakkola. Pfgm++:
Unlocking the potential of physics-inspired generative models. In International Conference on Machine
Learning, pp. 38566–38591. PMLR, 2023b.
Eunbi Yoon, Keehun Park, Sungwoong Kim, and Sungbin Lim. Score-based generative models with
lévy processes. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL
https://openreview.net/forum?id=0Wp3VHX0Gm.
Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator, 2023.
URL https://arxiv.org/abs/2204.13902.
C ONTENTS
1 Introduction 1
2 Background 3
2.1 Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Student-t Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
15
Preprint.
4 Experiments 8
4.1 Unconditional Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Conditional Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A Proofs 17
A.1 Derivation of the Perturbation Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.2 On the Posterior Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.3 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.4 Derivation of the simplified denoising loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
A.5 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
A.6 Deriving the Denoiser Preconditioner for t-EDM . . . . . . . . . . . . . . . . . . . . . . . 24
A.7 Equivalence with the EDM ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
A.8 Conditional Vector Field for t-Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
A.9 Connection to Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
A.10 ODE Reformulation and Connections to Inverse Problems. . . . . . . . . . . . . . . . . . . 27
A.11 Connections to Robust Statistical Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 28
B Extension to Flows 30
C Experimental Setup 31
C.1 Unconditional Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
C.1.1 HRRR Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
C.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
C.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
C.1.4 Denoiser Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
C.1.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
16
Preprint.
C.1.6 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.1.7 Extended Results on Unconditional Modeling . . . . . . . . . . . . . . . . . . . . . 35
C.2 Conditional Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.2.1 HRRR Dataset for Conditional Modeling . . . . . . . . . . . . . . . . . . . . . . . 35
C.2.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.2.3 Denoiser Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
C.2.5 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
C.2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
C.2.7 Extended Results on Conditional Modeling . . . . . . . . . . . . . . . . . . . . . . 37
C.3 Toy Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A P ROOFS
A.1 D ERIVATION OF THE P ERTURBATION K ERNEL
The perturbation kernel qpxt |x0 q for Student-t diffusions is parameterized as,
qpxt |x0 q “ td pµt x0 , σt2 Id , νq (16)
17
Preprint.
Using re-parameterization,
ϵ
xt “ µt x0 ` σt ? , ϵ „ N p0, Id q, κ „ χ2 pνq{ν (17)
κ
During inference, given a noisy state xt , we have the following estimation problem,
“ ϵ ‰
xt “ µt Erx0 |xt s ` σt E ? |xt (18)
κ
“ ‰
Therefore, the task of denoising can be posed as either estimating Erx0 |xt s or E ?ϵκ |xt . With this motivation,
the posterior pθ pxt´∆t |xt q can be parameterized appropriately. Recall the form of the forward posterior
ν ` d1 2
qpxt´∆t |xt , x0 q “ td pµ̄t , σ̄ Id , ν ` dq (19)
ν`d t
2
σ21 ptq ” σ 2 ptqσ 2 ptq ı
µ̄t “ µt´∆t x0 ` pxt ´ µt x0 q, σ̄t2 “ σt´∆t
2
´ 21 2 12 (20)
σt2 σt
1
where d1 “ σt2
}xt ´ µt x0 }2 . Further simplifying the mean µ̄t ,
2
σ21 ptq
µ̄t pxt , x0 , tq “ µt´∆t x0 ` 2 pxt ´ µt x0 q (21)
σt
σ 2 ptq σ 2 ptq
“ µt´∆t x0 ` 212 xt ´ 212 µt x0 (22)
σt σt
2
´ σ21 ptq ¯ σ 2 ptq
“ µt´∆t ´ µt 2 x0 ` 212 xt (23)
σt σt
Therefore, the mean µθ pxt , tq of the reverse posterior pθ pxt´∆t |xt q can be parameterized as,
´ σ 2 ptq ¯ σ 2 ptq
µ̄θ pxt , tq “ µt´∆t ´ µt 212 Erx0 |xt s ` 212 xt (24)
σt σt
2
´ σ21 ptq ¯ σ 2 ptq
« µt´∆t ´ µt 2 Dθ pxt , σt q ` 212 xt (25)
σt σt
where Erx0 |xt s is learned using a parametric estimator Dθ pxt , σt q. This corresponds to the x0 -prediction
parameterization presented in Eq. 6 in the main text. Alternatively, From Eqn. 18,
´ σ 2 ptq ¯ σ 2 ptq
µ̄θ pxt , tq “ µt´∆t ´ µt 212 Erx0 |xt s ` 212 xt (26)
σt σt
1´ σ 2 ptq ¯´ “ ϵ ‰¯ σ 2 ptq
“ µt´∆t ´ µt 212 xt ´ σt E ? |xt ` 212 xt (27)
µt σt κ σt
µt´∆t σt ´ σ 2 ptq ¯ “ ϵ
µt´∆t ´ µt 212
‰
“ xt ´ E ? |xt (28)
µt µt σt κ
µt´∆t σt ´ σ 2 ptq ¯
« xt ´ µt´∆t ´ µt 212 ϵθ pxt , σt q (29)
µt µt σt
“ ‰
where E ?ϵκ |xt is learned using a parametric estimator ϵθ pxt , σt q. This corresponds to the ϵ-prediction
parameterization (Ho et al., 2020).
18
Preprint.
1. Firstly, we establish the following relation between the γ-Power Divergence and the KL-Divergence
between two distributions q and p.
Dγ pq } pq “ DKL pq } pq ` Opγq (30)
2
2. Next, for the choice of γ “ ´ ν`d , we show that in the limit of γ Ñ 0, the optimization objective in
Eqn. 8 converges to the optimization objective in Eqn. 1
Relation between DKL pq } pq and Dγ pq } pq. The γ-Power Divergence as stated in (Kim et al., 2024a)
assumes the following form:
1“ ‰
Dγ pq } pq “ Cγ pq, pq ´ Hγ pqq (31)
γ
where γ P p´1, 0q Y p0, 8q, and,
ˆż 1
˙ 1`γ ż ˆ ˙γ
1`γ ppxq
Hγ ppq “ ´ }p}1`γ “ ´ ppxq dx Cγ pq, pq “ ´ qpxq dx (32)
}p}1`γ
For subsequent analysis, we assume γ Ñ 0. Under this assumption, we simplify Hγ pqq as follows. By
definition,
ˆż 1
˙ 1`γ
Hγ pqq “ ´ }q}1`γ “ ´ qpxq1`γ dx (33)
ˆż 1
˙ 1`γ
“´ qpxqqpxqγ dx (34)
ˆż 1
˙ 1`γ
“´ qpxq exppγ log qpxqqdx (35)
ˆż 1
˙ 1`γ
“ 2
‰
“´ qpxq 1 ` γ log qpxq ` Opγ q dx (36)
ˆż ż 1
˙ 1`γ
2
“´ qpxqdx ` γ qpxq log qpxqdx ` Opγ q (37)
ˆ ż 1
˙ 1`γ
2
“ ´ 1 ` γ qpxq log qpxqdx ` Opγ q (38)
Using the approximation p1 ` δxqα « 1 ` αδx for a small δ in the above equation, we have,
ˆ ż ı˙
γ ”
Hγ pqq “ ´ }q}1`γ « ´ 1 ` qpxq log qpxqdx ` Opγq (39)
1`γ
ˆ ”ż ı˙
« ´ 1 ` γp1 ´ γq qpxq log qpxqdx ` Opγq (40)
19
Preprint.
1
where we have used the approximation 1`γ « 1 ´ γ in the above equation. This is justified since γ is
assumed to be small enough. Therefore, we have,
ˆ ż ˙
2
Hγ pqq “ ´ }q}1`γ « ´ 1 ` γ qpxq log qpxqdx ` Opγ q (41)
Similarly, we now obtain an approximation for the power-cross entropy as follows. By definition,
ż ˆ ˙γ
ppxq
Cγ pq, pq “ ´ qpxq dx (42)
}p}1`γ
ż ˆ ˆ ˙ ˙
ppxq
“ ´ qpxq 1 ` γ log ` Opγ 2 q dx (43)
}p}1`γ
ˆż ż ˆ ˙ ˙
ppxq 2
“´ qpxqdx ` γ qpxq log ` Opγ q (44)
}p}1`γ
ˆ ż ż ˙
“ ´ 1 ` γ qpxq log ppxqdx ´ γ qpxq log }p}1`γ dx ` Opγ 2 q (45)
Therefore,
ˆ ż ˙
2
log }p}1`γ “ log 1 ` γ ppxq log ppxqdx ` Opγ q (47)
ż
« γ ppxq log ppxqdx (48)
where the above result follows from the logarithmic series and ignores the terms of order Opγ 2 q or higher.
Plugging the approximation in Eqn. 48 in Eqn. 45, we have,
ˆ ż ż ˙
2
Cγ pq, pq “ ´ 1 ` γ qpxq log ppxqdx ´ γ qpxq log }p}1`γ dx ` Opγ q (49)
ˆ ż ˙
« ´ 1 ` γ qpxq log ppxqdx ` Opγ 2 q (50)
Therefore,
1“ ‰
Dγ pq } pq “ Cγ pq, pq ´ Hγ pqq (51)
γ
ˆż ż ˙
1“ 2
‰
“ γ qpxq log qpxqdx ´ qpxq log ppxqdx ` Opγ q (52)
γ
ż
qpxq
“ qpxq log dx ` Opγq (53)
ppxq
“ DKL pq } pq ` Opγq (54)
This establishes the relationship between the KL and γ-Power divergence between two distributions. Therefore,
for two distributions q and p, the difference in the magnitude of DKL pq } pq and Dγ pq } pq is of the order of
Opγq. In the limit of γ Ñ 0, the Dγ pq } pq Ñ DKL pq } pq. This concludes the first part of our proof.
20
Preprint.
2
Equivalence between the objectives under γ Ñ 0. For γ “ ´ γ`d and a finite-dimensional dataset
d
with x0 P R , it follows that γ Ñ 0 implies ν Ñ 8. Moreover, in the limit of ν Ñ 8, the multivariate
Student-t distribution converges to a Gaussian distribution. As already shown in the previous part, under this
limit, Dγ pq } pq converges to DKL pq } pq. Therefore, under this limit, the optimization objective in Eqn. 8
converges to the standard DDPM objective in Eqn. 1. This completes the proof.
Here, we derive the simplified denoising loss presented in Eq. 9 in the main text. We specifically consider
the term Dγ pqpxt´∆t |xt , x0 q } pθ pxt´∆t |xt qq in Eq. 8. The γ-power divergence between two Student-t
distributions is given by,
γ ´ γ
¯´ 1`γ ” ´ ¯
γ
Dγ rqν ||pν s “ ´ γ1 Cν,d
1`γ d
1 ` ν´2 ´|Σ0 |´ 2p1`γq 1 ` ν´2 d
γ
´ ` ˘ 1 ¯ı (55)
` |Σ1 |´ 2p1`γq 1 ` ν´2
1
tr Σ´11 Σ 0 ` ν pµ 0 ´ µ 1 qJ ´1
Σ 1 pµ0 ´ µ1 q .
2
where γ “ ´ ν`d . Recall, the definitions of the forward denoising posterior,
ν ` d1 2
qpxt´∆t |xt , x0 q “ td pµ̄t , σ̄ Id , ν ` dq (56)
ν`d t
2
σ21 ptq ” σ 2 ptqσ 2 ptq ı
µ̄t “ µt´∆t x0 ` 2 pxt ´ µt x0 q, σ̄t2 “ σt´∆t
2
´ 21 2 12 (57)
σt σt
and the reverse denoising posterior,
pθ pxt´∆t |xt q “ td pµθ pxt , tq, σ̄t2 Id , ν ` dq (58)
where the denoiser mean µθ pxt , tq is further parameterized as follows:
2
σ21 ptq ” σ 2 ptq ı
µθ pxt , tq “ 2 xt ` µt´∆t ´ 212 µt Dθ pxt , σt q (59)
σt σt
Since we only parameterize the mean of the reverse posterior, the majority of the terms in the γ-power
divergence are independent of θ and can be ignored (or treated as scalar coefficients). Therefore,
Dγ pqpxt´∆t |xt , x0 q } pθ pxt´∆t |xt qq9pµ̄t ´ µθ pxt , tqqJ pµ̄t ´ µθ pxt , tqq (60)
9}µ̄t ´ µθ pxt , tq}22 (61)
” σ 2 ptq ı2
9 µt´∆t ´ 212 µt }x0 ´ Dθ pxt , σt q}22 (62)
σt
For better sample quality, it is common to ignore the scalar multiple in prior works (Ho et al., 2020; Song
et al., 2020). Therefore, ignoring the time-dependent scalar multiple,
Dγ pqpxt´∆t |xt , x0 q } pθ pxt´∆t |xt qq 9}x0 ´ Dθ pxt , σt q}22 (63)
Therefore, the final loss function Lθ can be stated as,
› ` ϵ ˘ ›2
Lpθq “ Ex0 „ppx0 q Et„pptq Eϵ„N p0,Id q Eκ„ ν1 χ2 pνq ›Dθ µt x0 ` σt ? , σt ´ x0 › (64)
› ›
κ 2
21
Preprint.
Proof. We start by writing a single sampling step from our learned posterior distribution. Recall
pθ pxt´∆t |xt q “ td pµθ pxt , tq, σ̄t2 Id , ν ` dq (67)
where (using the ϵ-prediction parameterization in App. A.2),
„ ȷ
µt´∆t 1 2 µt´∆t 2
µθ pxt , tq “ xt ` σ21 ptq ´ σt ϵθ pxt , σt q (68)
µt σt µt
From re-parameterization, we have,
σ̄t
xt´∆t “ µθ pxt , tq ` ? z z „ N p0, Id q, κ „ χ2 pν ` dq{pν ` dq (69)
κ
„ ȷ
µt´∆t 1 2 µt´∆t 2 σ̄t
xt´∆t “ xt ` σ21 ptq ´ σt ϵθ pxt , σt q ` ? z (70)
µt σt µt κ
Moreover, we choose the posterior scale to be the same as the forward posterior qpxt´1 |xt , x0 q i.e.
” σ 2 ptqσ 2 ptq ı
σ̄t2 “ σt´∆t
2
´ 21 2 12 Id (71)
σt
This implies,
σ2 ` 2
ptq “ 2 t
2
˘
σ21 σt´∆t ´ σ̄t2 (72)
σ12 ptq
2
Substituting this form of σ21 ptq into Eqn. 70, we have,
„ 2 ȷ
µt´∆t 1 σt ` 2 2
˘ µt´∆t 2 σ̄t
xt´∆t “ xt ` 2 σ t´∆t ´ σ̄ t ´ σ t ϵθ pxt , σt q ` ? z (73)
µt σt σ12 ptq µt κ
„ ȷ
µt´∆t 1 ` 2 ˘ µt´∆t σ̄ t
“ x t ` σt 2 σ ´ σ̄t2 ´ ϵθ pxt , σt q ` ? z (74)
µt σ12 ptq t´∆t µt κ
„ ȷ
µt´∆t 1 ` 2 ˘ µ t´∆t σ̄t
“ x t ` σt 2 σt´∆t ´ σ̄t2 ´ 1 ` 1 ´ ϵθ pxt , σt q ` ? z (75)
µt σ ptq µt κ
„ 12 ȷ
µt´∆t 1 ` 2 ˘ µ
9 t σ̄t
“ x t ` σt 2 σ ´ σ̄t2 ´ 1 ` ∆t ϵθ pxt , σt q ` ? z (76)
µt σ12 ptq t´∆t µt κ
µt ´µt´∆t
where in the above equation we use the first-order approximation µ9 t “ ∆t . Next, we make the
following design choices:
22
Preprint.
1. Firstly, we assume the following form of the reverse posterior variance σ̄t2 :
σ̄t2 “ βptqgpσt , σ9 t q∆t (77)
where g : R` ˆ R` Ñ R` and βptq P R` represents a time-varying scaling factor chosen
empirically which can be used to vary the noise injected at each sampling step. It is worth noting
that a positive σ9 t (as indicated in the definition of g) is a consequence of a monotonically increasing
noise schedule σt in diffusion model design.
2. Secondly, we make the following design choice:
1 ` 2 ˘
2 σt´∆t ´ σ̄t2 ´ 1 “ f pσt , σ9 t q∆t (78)
σ12 ptq
where f : R` ˆ R` Ñ R
This concludes the proof. It is worth noting that the above equation provides two degrees of freedom, i.e.,
we can choose two variables among σ12 ptq, g, f and automatically determine the third. However, it is more
convenient to choose σ12 ptq and g, since both these quantities should be positive. Different choices of these
quantities yield different instantiations of the SDE in Eqn. 84 as illustrated in the main text.
23
Preprint.
Dθ px, σq “ cskip pσ, νqx ` cout pσ, νqFθ pcin pσ, νqx, cnoise pσqq (88)
Karras et al. (2022) highlight the following design choices, which we adopt directly.
The coefficient cin can be derived by setting the network inputs to have unit variance. Therefore,
“ ‰
Varx0 ,n cin pσqpx0 ` nq “ 1 (89)
“ ‰
cin pσ, νq2 Varx0 ,n x0 ` n “ 1 (90)
` 2 ν ˘
cin pσ, νq2 σdata ` σ2 “ 1 (91)
ν´2
c
L ν 2 .
cin pσ, νq “ 1 σ 2 ` σdata (92)
ν´2
The coefficients cskip and cout can be derived by setting the training target to have unit variance. Similar to
Karras et al. (2022) our training target can be specified as:
1 ` ˘
Ftarget “ x0 ´ cskip pσ, νqpx0 `nq (93)
cout pσ, νq
“ ‰
Varx0 ,n Ftarget px0 , n; σq “ 1 (94)
” ` ˘ı
1
Varx0 ,n cout pσ,νq x0 ´ cskip pσ, νqpx0 ` nq “ 1 (95)
1
“ ‰
cout pσ,νq2 Varx0 ,n x0 ´ cskip pσ, νqpx0 ` nq “ 1 (96)
2
“ ‰
cout pσ, νq “ Varx0 ,n x0 ´ cskip pσ, νqpx0 ` nq (97)
”` ˘ ı
2
cout pσ, νq “ Varx0 ,n 1 ´ cskip pσ, νq x0 ` cskip pσ, νq n (98)
` ˘2 2 ν
cout pσ, νq2 “ 1 ´ cskip pσ, νq σdata ` cskip pσ, νq2 σ 2 . (99)
ν´2
24
Preprint.
Similar to Karras et al. (2022), we start by deriving the optimal denoiser for the denoising loss func-
tion. Moreover, we reformulate the form of the perturbation kernel as qpxt |x0 q “ td pµt x0 , σt2 Id , νq “
td psptqx0 , sptq2 σptq2 Id , νq by setting µt “ sptq and σt “ sptqσptq. The denoiser loss can then be specified
as follows,
“ ‰
LpD, σq “ Ex0 „ppx0 q En„td p0,σ2 Id ,νq λpσ, νq}Dpsptqrx0 ` ns, σq ´ x0 }22 (102)
“ 2
‰
“ Ex0 „ppx0 q Ex„td psptqx0 ,sptq2 σ2 Id ,νq λpσ, νq}Dpx, σq ´ x0 }2 (103)
”ż “ ‰ ı
“ Ex0 „ppx0 q td px; sptqxi , sptq2 σ 2 Id , νq λpσ, νq}Dpx, σq ´ x0 }22 dx (104)
ż
1 ÿ “ ‰
“ td px; sptqxi , sptq2 σ 2 Id , νq λpσ, νq}Dpx, σq ´ xi }22 dx (105)
N i
where the last result follows from assuming ppx0 q as the empirical data distribution. Thus, the optimal
denoiser can be specified by setting ∇D LpD, σq “ 0. Therefore,
∇D LpD, σq “ 0 (106)
Consequently,
ż
1 ÿ “ ‰
∇D td px; sptqxi , sptq2 σ 2 Id , νq λpσ, νq}Dpx, σq ´ xi }22 dx “ 0 (107)
N i
ż
1 ÿ “ ‰
td px; sptqxi , sptq2 σ 2 Id , νq λpσ, νqpDpx, σq ´ xi q dx “ 0 (108)
N i
ż ÿ
td px; sptqxi , sptq2 σ 2 Id , νqpDpx, σq ´ xi qdx “ 0 (109)
i
The optimal denoiser Dpx, σq, can be obtained from Eq. 109 as,
25
Preprint.
Here, we derive the conditional vector field for the t-Flow interpolant. Recall, the interpolant in t-Flow is
derived as follows,
ϵ
xt “ tx0 ` p1 ´ tq ? , ϵ „ N p0, Id q, κ „ χ2 pνq{ν (120)
κ
It follows that,
“ ϵ ‰
xt “ tErx0 |xt s ` p1 ´ tqE ? |xt (121)
κ
Moreover, following Eq. 2.10 in Albergo et al. (2023b), the conditional vector field bpxt , tq can be defined as,
“ ϵ ‰
bpxt , tq “ Erx9 t |xt s “ Erx0 |xt s ´ E ? |xt (122)
κ
From Eqns. 121 and 122, the conditional vector field can be simplified as,
“ ‰
xt ´ E ?ϵκ |xt
bpxt , tq “ (123)
t
This concludes the proof.
We start by defining the score for the perturbation kernel qpxt |x0 q. The pdf for the perturbation kernel
qpxt |x0 q “ td pµt x0 , σt2 Id , νq can be expressed as (ignoring the normalization constant):
„ ȷ ´pν`dq
1 ` ˘J ` ˘ 2
qpxt |x0 q 9 1` xt ´ µt x0 xt ´ µt x0 (124)
νσt2
26
Preprint.
Therefore,
„ ȷ
ν`d 1 2
∇xt log qpxt |x0 q “ ´ ∇xt log 1 ` }xt ´ µt x0 }2 (125)
2 νσt2
ˆ ˙
ν`d 1 2
“´ xt ´ µt x0 (126)
2 1 ` νσ1 2 }xt ´ µt x0 }22 νσt2
t
1
Denoting d1 “ σt2
}xt ´ µt x0 }22 for convenience, we have,
ˆ ˙ ˆ ˙
ν`d 1
∇xt log qpxt |x0 q “ ´ x t ´ µ x
t 0 (127)
ν ` d1 σt2
In Denoising Score Matching (DSM) (Vincent, 2011), the following objective is minimized,
« ff
LDSM “ Et„pptq,x0 „ppx0 q,xt „qpxt |x0 q γptq}∇xt log qpxt |x0 q ´ sθ pxt , tq}22 (128)
for some loss weighting function γptq. Parameterizing the score estimator sθ pxt , tq as:
ˆ ˙ ˆ ˙
ν`d 1
sθ pxt , tq “ ´ xt ´ µt Dθ pxt , σt q (129)
ν ` d1 σt2
With this parameterization of sθ pxt , tq and some choice of γptq, the DSM objective can be further simplified
as follows,
”´ ν ` d ¯2 › ϵ ›2 ı
LDSM “ Ex0 „ppx0 q Et Eϵ„N p0,Iq,κ„χ2 pνq{ν ›x0 ´ Dθ pµt x0 ` σt ? , σt q› (130)
› ›
ν ` d1 κ 2
” › ϵ ›2 ı
“ Ex0 „ppx0 q Et Eϵ„N p0,Iq,κ„χ2 pνq{ν λpxt , ν, tq›x0 ´ Dθ pµt x0 ` σt ? , σt q› (131)
› ›
κ 2
´ ¯2
ν`d
where λpxt , ν, tq “ ν`d 1
, which is equivalent to a scaled version of the simplified denoising loss Eq.
9. This concludes the proof. As an additional caveat, the score parameterization in Eq. 129 depends on
d1 “ σ12 }xt ´ µt x0 }22 , which can be approximated during inference as, d1 « σ12 }xt ´ µt Dθ pxt , σt q}22
t t
Recall the ODE dynamics in terms of the denoiser can be specified as,
„ ȷ
dxt µ9 t σ9 t µ9 t
“ xt ´ ´ ` pxt ´ µt Dθ pxt , σt qq (132)
dt µt σt µt
From Eq. 129, we have,
ˆ ˙
ν ` d1
xt ´ µt Dθ pxt , σt q “ ´σt2 ∇xt log ppxt , tq (133)
ν`d
where d1 “ σ12 }xt ´ µt x0 }22 . Substituting the above result in the ODE dynamics, we obtain the reformulated
t
ODE in Eq. 13.
dxt µ9 t ¯„ µ9 ȷ
2 ν ` d1 σ9 t
´
t
“ xt ` σt ´ ∇x log ppxt , tq (134)
dt µt ν`d µt σt
27
Preprint.
1
Since the term d1 is data-dependent, we can estimate it during inference as d1 « d11 “ σt2
}xt ´
µt Dθ pxt , σt q}22 . Thus, the above ODE can be reformulated as,
1 ¯
„ ȷ
dxt µ9 t 2 ν ` d1
´ µ9 t σ9 t
“ xt ` σt ´ ∇x log ppxt , tq (135)
dt µt ν`d µt σt
Tweedie’s Estimate and Inverse Problems. Given the perturbation kernel qpxt |x0 q “ td pµt x0 , σt2 Id , νq,
we have,
ϵ
xt “ µt x0 ` σt ? , ϵ „ N p0, Id q, κ „ χ2 pνq{ν (136)
κ
It follows that,
” ϵ ı
xt “ µt Erx0 |xt s ` σt E ? |xt (137)
κ
˜ ¸
1 ” ϵ ı
Erx0 |xt s “ xt ´ σt E ? |xt (138)
µt κ
˜ ¸
1 ´ ν ` d1 ¯
1
« xt ` σt2 ∇x log ppxt , tq (139)
µt ν`d
which gives us an estimate of the predicted x0 at any time t. Moreover, the framework can also be extended
for solving inverse problems using diffusion models. More specifically, given a conditional signal y, the goal
is to simulate the ODE,
dxt µ9 t ´ ν ` d1 ¯„ µ9 σ9 t
ȷ
1 t
“ xt ` σt2 ´ ∇x log ppxt |yq (140)
dt µt ν`d µt σt
µ9 t ´ ν ` d1 ¯„ µ9 ȷ
σ9 t ” ı
1 t
“ xt ` σt2 ´ wt ∇x log ppy|xt q ` ∇x log ppxt q (141)
µt ν`d µt σt
where the above decomposition follows from ppxt |yq9ppy|xt qwt ppxt q and the weight wt represents the
guidance weight of the distribution ppy|xt q. The term log ppy|xt q can now be approximated using existing
posterior approximation methods in inverse problems (Chung et al., 2022; Song et al., 2022b; Mardani et al.,
2023b; Pandey et al., 2024b)
Here, we derive the expression for the gradient of the γ-power divergence between the forward and the
reverse posteriors (denoted by q and pθ , respectively for notational convenience) i.e., ∇θ Dγ pq } pθ q. By the
definition of the γ-power divergence,
1“ ‰
Dγ pq } pθ q “ Cγ pq, pθ q ´ Hγ pqq , γ P p´1, 0q Y p0, 8q (142)
γ
ˆż 1
˙ 1`γ ż ˆ ˙γ
1`γ ppxq
Hγ ppq “ ´ }p}1`γ “´ ppxq dx Cγ pq, pq “ ´ qpxq dx (143)
}p}1`γ
Therefore,
1
∇θ Dγ pq } pθ q “ ∇θ Cγ pq, pθ q (144)
γ
28
Preprint.
ˆ ż ˙γ
pθ pxq
∇θ Cγ pq, pθ q “ ´∇θqpxq dx (145)
}pθ }1`γ
ż ˆ ˙γ´1 ˜ ¸
pθ pxq pθ pxq
“ ´γ qpxq ∇θ dx (146)
}pθ }1`γ }pθ }1`γ
˙γ´1
}pθ }1`γ ∇θ pθ pxq ´ pθ pxq∇θ }pθ }1`γ
ż ˆ
pθ pxq
“ ´γ qpxq 2 dx (147)
}pθ }1`γ }pθ }1`γ
ˆż 1
˙ 1`γ
}pθ }1`γ “ pθ pxq1`γ dx (148)
ˆż 1
˙ 1`γ
1`γ
∇θ }pθ }1`γ “ ∇θ pθ pxq dx (149)
ˆż γ
˙´ 1`γ ż
1 1`γ
` ˘
“ pθ pxq dx ∇θ pθ pxq1`γ dx (150)
1`γ
ˆż γ
˙´ 1`γ ż
1
“ pθ pxq1`γ dx p1 ` γq pθ pxqγ ∇θ pθ pxqdx (151)
1`γ
ˆż γ
˙´ 1`γ ż
1`γ
∇θ }pθ }1`γ “ pθ pxq dx pθ pxqγ ∇θ pθ pxqdx (152)
ż
}pθ }1`γ
“ˆ ˙ pθ pxqγ ∇θ pθ pxqdx (153)
ş
pθ pxq1`γ dx
ż
}pθ }1`γ
“ˆ ˙ pθ pxq1`γ ∇θ log pθ pxqdx (154)
ş
pθ pxq1`γ dx
pθ pxq1`γ
ż
“ }pθ }1`γ ˆ ˙ ∇θ log pθ pxqdx (155)
ş
pθ pxq 1`γ dx
l jh n
“p̃θ pxq
pθ pxq1`γ
where we denote p̃θ pxq “ ˆ ˙ for notational convenience. Substituting the above result in Eq.
ş
pθ pxq1`γ dx
29
Preprint.
Figure 5: Training and Sampling algorithms for t-Flow. Our proposed method requires minimal code updates (indicated
with blue) over traditional Gaussian flow models and converges to the latter as ν Ñ 8.
˙γ´1
}pθ }1`γ ∇θ pθ pxq ´ pθ pxq∇θ }pθ }1`γ
ż ˆ
pθ pxq
∇θ Cγ pq, pθ q “ ´γ qpxq 2 dx (157)
}pθ }1`γ }pθ }1`γ
˙γ´1
}pθ }1`γ ∇θ pθ pxq ´ pθ pxq }pθ }1`γ Ep̃θ pxq r∇θ log pθ pxqs
ż ˆ
pθ pxq
“ ´γ qpxq 2 dx
}pθ }1`γ }pθ }1`γ
(158)
ˆ ˙γ´1
∇θ pθ pxq ´ pθ pxqEp̃θ pxq r∇θ log pθ pxqs
ż
pθ pxq
“ ´γ qpxq dx (159)
}pθ }1`γ }pθ }1`γ
ż ˆ ˙γ´1
pθ pxq pθ pxq∇θ log pθ pxq ´ pθ pxqEp̃θ pxq r∇θ log pθ pxqs
“ ´γ qpxq dx (160)
}pθ }1`γ }pθ }1`γ
(161)
ż ˆ ˙γ ´
pθ pxq ¯
∇θ Cγ pq, pθ q “ ´γ qpxq ∇θ log pθ pxq ´ Ep̃θ pxq r∇θ log pθ pxqs dx (162)
}pθ }1`γ
Plugging this result in Eq. 144, we have the following result,
ż ˆ ˙γ ´
1 pθ pxq ¯
∇θ Dγ pq } pθ q “ ∇θ Cγ pq, pθ q “ ´ qpxq ∇θ log pθ pxq ´ Ep̃θ pxq r∇θ log pθ pxqs dx
γ }pθ }1`γ
(163)
This completes the proof. Intuitively, the second term inside the integral in Eq. 163 ensures unbiasedness of
the gradients. Therefore, the scalar coefficient γ controls the weighting on the likelihood gradient and can be
set accordingly to ignore or model outliers when modeling the data distribution.
B E XTENSION TO F LOWS
Here, we discuss an extension of our framework to flow matching models (Albergo et al., 2023a; Lipman
et al., 2023) with a Student-t base distribution. More specifically, we define a straight-line flow of the form,
ϵ
xt “ tx1 ` p1 ´ tq ? , ϵ „ N p0, Id q, κ „ χ2 pνq{ν (164)
κ
30
Preprint.
where x1 „ ppx1 q. Intuitively, at a given time t, the flow defined in Eqn. 164 linearly interpolates between
data and Student-t noise. Following Albergo et al. (2023a), the conditional vector field which induces this
interpolant can be specified as (proof in App. A.8)
“ ‰
dxt xt ´ E ?ϵκ |xt
“ bpxt , tq “ . (165)
dt t
“ ‰
We estimate E ?ϵκ |xt by minimizing the objective
”› ϵ ϵ ››2 ı
Lpθq “ Ex0 „ppx0 q Et„U r0,1s Eϵ„N p0,Id q Eκ„χ2 pνq{ν ›ϵθ ptx0 ` p1 ´ tq ? , tq ´ ? › . (166)
›
κ κ 2
We refer to this flow setup as t-Flow. To generate samples from our model, we simulate the ODE in Eq. 165
using Heun’s solver. Figure 5 illustrates the ease of transitioning from a Gaussian flow to t-Flow. Similar to
Gaussian diffusion, transitioning to t-Flow requires very few lines of code change, making our method readily
compatible with existing implementations of flow models.
C E XPERIMENTAL S ETUP
C.1 U NCONDITIONAL M ODELING
C.1.2 BASELINES
Baseline 1: EDM. For standard Gaussian diffusion models, we use the recently proposed EDM (Karras et al.,
2022) model, which shows strong empirical performance on various image synthesis benchmarks and has
also been employed in recent work in weather forecasting and downscaling (Pathak et al., 2024; Mardani
et al., 2024). To summarize, EDM employs the following denoising loss during training,
“ ‰
Lpθq 9 Ex0 „ppx0 q Eσ En„N p0,σ2 Id q λpσq}Dθ px0 ` n, σq ´ x0 }22 (167)
2
where the noise levels σ are usually sampled from a LogNormal distribution, ppσq “ LogNormalpπmean , πstd q
Baseline 2: EDM + Inverse CDF Normalization (INC). It is commonplace to perform z-score normalization
as a data pre-processing step during training. However, since heavy-tailed channels usually exhibit a high
dynamic range, using z-score normalization for such channels cannot fully compensate for this range,
especially when working with diverse channels in downstream tasks in weather modeling. An alternative
could be to use a more stronger normalization scheme like Inverse CDF normalization, which essentially
involves the following key steps:
31
Preprint.
Channel: VIL
Channel: w20
Figure 6: Inverse CDF Normalization. Using Inverse CDF Normalization (INC) can help reduce channel dynamic range
during training while providing accurate denormalization. (Top Panel) INC applied to the Vertically Integrated Liquid
(VIL) channel in the HRRR dataset. (Bottom Panel) INC applied to the Vertical Wind Velocity (w20) channel in the
HRRR dataset.
Fig. 6 illustrates the effect of performing normalization under this scheme. As can be observed, using such
a normalization scheme can greatly reduce the dynamic range of a given channel while offering reliable
denormalization. Moreover, since our normalization scheme only affects data preprocessing, we leave the
standard EDM model parameters unchanged for this baseline.
Baseline 3: EDM + Per Channel-Preconditioning (PCP). Another alternative to account for extreme values
in the data (or high dynamic range) could be to instead add more heavy-tailed noise during training. This
can be controlled by modulating the πmean and πstd parameters based on the dynamic range of the channel
under consideration. Recall that these parameters control the domain of noise levels σ used during EDM
model training. In this work, we use a simple heuristic to modulate these parameters based on the normalized
channel dynamic range (denoted as d). More specifically, we set,
32
Preprint.
where RBF denotes the Radial Basis Function kernel with radius=1.0, parameter β and a magnitude scaling
factor α. We keep πstd “ 1.2 fixed for all channels. Intuitively, this implies that a higher normalized dynamic
range (near 1.0) corresponds to sampling the noise levels σ from a more heavy-tailed distribution during
training. This is natural since a signal with a larger magnitude would need more amount of noise to convert
it to noise during the forward diffusion process. In this work, we set α “ 3.0, β “ 2.0, which yields
vil w20
πmean “ 1.8 and πmean “ 0.453 for the VIL and w20 channels, respectively. It is worth noting that, unlike the
previous baseline, we use z-score normalization as a preprocessing step for this baseline. Lastly, we keep
other EDM parameters during training and sampling fixed.
Baseline 4. Gaussian Flow. Since we also extend our framework to flow matching models (Albergo et al.,
2023a; Lipman et al., 2023), we also compare with a linear one-sided interpolant with a Gaussian base
distribution. More specifically,
xt “ tx1 ` p1 ´ tqϵ, ϵ „ N p0, Id q (169)
Similar to t-Flow (Section B), we train the Gaussian flow with the following objective,
”› ›2 ı
Lpθq “ Ex0 „ppx0 q Et„U r0,1s Eϵ„N p0,Id q ›ϵθ ptx0 ` p1 ´ tqϵ, tq ´ ϵ› . (170)
› ›
2
C.1.3 E VALUATION
Here, we describe our scoring protocol used in Tables 2 and 3 in more detail.
Kurtosis Ratio (KR). Intuitively, sample kurtosis characterizes the heavy-tailed behavior of a distribution and
represents the fourth-order moment. Higher kurtosis represents greater deviations from the central tendency,
such as from outliers in the data. In this work, given samples from the underlying train/test set, we generate
20k samples from our model. We then flatten all the samples and compute empirical kurtosis for both the
underlying samples from the train/test set (denoted as kdata ) and our model (denoted as ksim ). The Kurtosis
ratio is then computed as,
ˇ ksim ˇˇ
KR “ ˇ1 ´ (171)
ˇ
ˇ
kdata
Lower values of this ratio imply a better estimation of the underlying sample kurtosis.
Skewness Ratio (SR). Intuitively, sample skewness represents the asymmetry of a tailed distribution and
represents the third-order moment. In this work, given samples from the underlying train/test set, we generate
20k samples from our model. We then flatten all the samples and compute empirical skewness for both the
underlying samples from the train/test set (denoted as sdata ) and our model (denoted as ssim ). The Skewness
ratio is then computed as, ˇ ssim ˇˇ
SR “ ˇ1 ´ (172)
ˇ
ˇ
sdata
Lower values of this ratio imply a better estimation of the underlying sample skewness.
Kolmogorov-Smirnov 2-Sample Test (KS). The KS (Massey, 1951) statistic measures the maximum
difference between the CDFs of two distributions. For heavy-tailed distributions, evaluating the KS statistic
at the tails could provide a useful measure of the efficacy of our model in estimating tails reliably. To evaluate
the KS statistic at the tails, similar to prior metrics, we generate 20k samples from our model. We then flatten
all samples in the generated and train/test sets, followed by retaining samples lying above the 99.9th percentile
(quantifying right tails/extreme region) or below the 0.1th percentile (quantifying the left tail/extreme region).
Lastly, we compute the KS statistic between the retained samples from the generated and the train/test sets
individually for each tail and average the KS statistic values for both tails to obtain an average KS score. The
final score estimates how well the model might capture both tails. As an exception,. for the VIL channel,
we report KS scores only for the right tail due to the absence of a left tail in the underlying samples for this
33
Preprint.
Table 5: Comparison between design choices and specific hyperparameters between EDM (Karras et al., 2022) (+ related
baselines) and t-EDM (Ours, Section 3.5) for unconditional modeling (Section 4.1). INC: Inverse CDF Normalization
baselines, PCP: Per-Channel Preconditioning baseline, VIL: Vertically Integrated Liquid channel in the HRRR dataset,
w20: Vertical Wind Velocity at level 20 channel in the HRRR dataset, NFE: Number of Function Evaluations
channel (see Fig. 6 (first column) for 1-d intensity histograms for this channel). Lower values of the KS
statistic imply better density assignment at the tails by the model.
Histogram Comparisons. As a qualitative metric, comparing 1-d intensity histograms between the generated
and the original samples from the train/test set can serve as a reliable proxy to assess the tail estimation
capabilities of all models.
C.1.5 T RAINING
We adopt the same training hyperparameters from Karras et al. (2022) for training all models. Model training
is distributed across 4 DGX nodes, each with 8 A100 GPUs, with a total batch size of 512. We train all
34
Preprint.
models for a maximum budget of 60Mimg and select the best-performing model in terms of qualitative 1-D
histogram comparisons.
C.1.6 S AMPLING
For the EDM and related baselines (INC, PCP), we use the ODE solver presented in Karras et al. (2022). For
the t-EDM models, as presented in Section 3.5, our sampler is the same as EDM with the only difference
in the sampling of initial latents from a Student-t distribution instead (See Fig. 2 (Right)). For Flow and
t-Flow, we numerically simulate the ODE in Eq. 165 using the 2nd order Heun’s solver with the timestep
discretization proposed in Karras et al. (2022). For evaluation, we generate 20k samples from each model.
We summarize our experimental setup in more detail for unconditional modeling in Table 5.
C.2.2 BASELINES
We adopt the standard EDM (Karras et al., 2022) for conditional modeling as our baseline.
35
Preprint.
16. Additionally, our noisy state x is channel-wise concatenated with an 86-channel conditioning signal,
increasing the total number of input channels in the denoiser to 87. The number of output channels remains 1
since we are predicting only a single VIL/w20 channel. However, the increase in the number of parameters is
minimal since only the first convolutional layer in the denoiser is affected. Therefore, our denoiser is around
12M parameters. The rest of the hyperparameters remain unchanged from Karras et al. (2022).
C.2.4 T RAINING
We adopt the same training hyperparameters from Karras et al. (2022) for training all conditional models.
Model training is distributed across 4 DGX nodes, each with 8 A100 GPUs, with a total batch size of 512.
We train all models for a maximum budget of 60Mimg.
C.2.5 S AMPLING
For both EDM and t-EDM models, we use the ODE solver presented in Karras et al. (2022). For the t-EDM
models, as presented in Section 3.5, our sampler is the same as EDM with the only difference in the sampling
of initial latents from a Student-t distribution instead (See Fig. 2 (Right)). For a given input conditioning state,
we generate an ensemble of predictions of size 16 by randomly initializing our ODE solver with different
random seeds. All other sampling parameters remain unchanged from our unconditional modeling setup (see
App. C.1.6).
C.2.6 E VALUATION
Root Mean Square Error (RMSE). is a standard evaluation metric used to measure the difference between
the predicted values and the true values Chai & Draxler (2014). In the context of our problem, let x be the
true target and x̂ be the predicted value. The RMSE is defined as:
a
RMSE “ E r}x ´ x̂}2 s.
This metric captures the average magnitude of the residuals, i.e., the difference between the predicted and true
values. A lower RMSE indicates better model performance, as it suggests the predicted values are closer to
the true values on average. RMSE is sensitive to large errors, making it an ideal choice for evaluating models
where minimizing large deviations is critical.
Continuous Ranked Probability Score (CRPS). is a measure used to evaluate probabilistic predictions
Wilks (2011). It compares the entire predicted distribution F px̂q with the observed data point x. For a
probabilistic forecast with cumulative distribution function (CDF) F , and the true value x, the CRPS can be
formulated as follows:
ż8
2
CRPSpF, xq “ pF pyq ´ Ipy ě xqq dy,
´8
where Ip¨q is the indicator function. Unlike RMSE, CRPS provides a more comprehensive evaluation of
both the location and spread of the predicted distribution. A lower CRPS indicates a better match between
the forecast distribution and the observed data. It is especially useful for probabilistic models that output a
distribution rather than a single-point prediction.
Spread-Skill Ratio (SSR). is used to assess over/under-dispersion in probabilistic forecasts. Spread measures
the uncertainty in the ensemble forecasts and can be represented by computing the standard deviation of
the ensemble members. Skill represents the accuracy of the mean of the ensemble forecasts and can be
represented by computing the RMSE between the ensemble mean and the observations.
36
Preprint.
Preconditioner 2
La 2
b
ν 2
Lb ν 2
cout σ ¨ σdata σ 2 ` σdata ν´2
σ ¨ σ data ν´2
σ 2 ` σdata
La 2
Lb ν 2
cin 1 σ 2 ` σdata 1 ν´2
σ 2 ` σdata
1 1
cnoise 4
log σ 4
log σ
2 2
σ log σ „ N pπmean , πstd q log σ „ N pπmean , πstd q
sptq 1 1
Training
λpσq 1{c2out pσq 1{c2out pσ, νq
Loss Eq. 2 in Karras et al. (2022) Eq. 12
Solver Heun’s (2nd order) Heun’s (2nd order)
ODE Eq. 11, xT „ N p0, Id q Eq. 11, xT „ td p0, Id , νq
Sampling ` ρ1 i
“ ρ1 1 ‰˘ 1
ρ
` ρ1 i
“ ρ1 1 ‰˘ 1
ρ
Discretization σmax ` N ´1 σmin ´ σmax ρ
σmax ` N ´1 σmin ´ σmax ρ
Scaling: sptq 1 1
Schedule: σptq t t
σdata 1.0 1.0
ν 8 x1 “ 20, x2 P t4, 7, 10u
πmean , πstd -1.2, 1.2 -1.2, 1.2
Hyperparameters
σmax , σmin 80, 0.002 80, 0.002
NFE 18 18
ρ 7 7
Table 7: Comparison between design choices and specific hyperparameters between EDM (Karras et al., 2022) and
t-EDM (Ours, Section 3.5) for the Toy dataset analysis in Fig. 1. NFE: Number of Function Evaluations
Scoring Criterion. Since CRPS and SSR metrics are based on predicting ensemble forecasts for a given
input state, we predict an ensemble of size 16 for 4000 samples from the VIL/w20 test set. We then enumerate
window sizes of 16 x 16 across the spatial resolution of the generated sample (128 x 128). Since the VIL
channel is quite sparse, we filter out windows with a maximum value of less than a threshold (1.0 for VIL)
and compute the CRPS, SSR, and RMSE metrics for all remaining windows. As an additional caveat, we
note that while it is common to roll out trajectories for weather forecasting, in this work, we only predict the
target at the immediate next time step.
C.3 T OY E XPERIMENTS
Dataset. For the toy illustration in Fig. 1, we work with the Neals Funnel dataset (Neal, 2003), which is
commonly used in the MCMC literature (Brooks et al., 2011) due to its challenging geometry. The underlying
generative process for Neal’s funnel can be specified as follows:
ppx1 , x2 q “ N px1 ; 0, 3qN px2 ; 0, exp x1 {2q. (173)
For training, we randomly generate 1M samples from the generative process in Eq. 173 and perform z-score
normalization as a pre-processing step.
37
Preprint.
Baselines and Models. For the standard Gaussian diffusion model baseline (2nd column in Fig. 1), we use
EDM with standard hyperparameters as presented in Karras et al. (2022). Consequently, for heavy-tailed
diffusion models (columns 3-5 in Fig. 1), we use the t-EDM instantiation of our framework as presented in
Section 3.5. Since the hyperparameter ν is key in our framework, we tune ν for each individual dimension
in our toy experiments. We fix ν to 20 for the x1 dimension and vary it between ν P t4, 7, 10u to illustrate
controllable tail estimation along the x2 dimension.
Denoiser Architecture. For modeling the underlying denoiser Dθ px, σq, we use a simple MLP for all toy
models. At the input, we concatenate the 2-dimensional noisy state vector x with the noise level σ. We use
two hidden layers of size 64, followed by a linear output layer. This results in around 8.5k parameters for the
denoiser. We share the same denoiser architecture across all toy models.
Training. We optimize the training objective in Eq. 12 for both t-EDM and EDM (See Fig. 2 (Left)). Our
training hyperparameters are the same as proposed in Karras et al. (2022). We train all toy models for a fixed
duration of 30M samples and choose the last checkpoint for evaluation.
Sampling. For the EDM baseline, we use the ODE solver presented in Karras et al. (2022). For the t-EDM
models, as presented in Section 3.5, our sampler is the same as EDM with the only difference in the sampling
of initial latents from a Student-t distribution instead (See Fig. 2 (Right)). For visualization of the generated
samples in Fig. 1, we generate 1M samples for each model.
Overall, our experimental setup for the toy dataset analysis in Fig. 1 is summarized in Table 7.
In this section we discuss a strategy for choosing the parameter σmax (denoted by σ in this section for
notational brevity) in a more principled manner as compared to EDM (Karras et al., 2022). More specifically,
our approach involves directly estimating σ from the empirically observed samples which circumvents the
need to rely on ad-hoc choices of this parameter which can affect downstream sampler performance.
The main idea behind our approach is minimizing the statistical mutual information between datapoints from
the underlying data distribution, x0 „ pdata , and their noisy counterparts xσ „ ppxσ q. While a trivial (and
non-practical) way to achieve this objective could be to set a large enough σ i.e. σ Ñ 8, we instead minimize
the mutual information Ipx0 , xσ q while ensuring the magnitude of σ to be as small as possible. Formally, our
objective can be defined as,
min
2
σ 2 subject to Ipx0 , xσ q “ 0 (174)
σ
As we will discuss later, minimizing this constrained objective provides a more principled way to obtain σ
from the underlying data statistics for a specific level of mutual information desired by the user. Next, we first
simplify the form of Ipx0 , xσ q, followed by a discussion on the estimation of σ in the context of EDM and
t-EDM. We also extend to the case of non-i.i.d noise.
Simplification of Ipx0 , xσ q. The mutual information Ipx0 , xσ q can be stated and simplified as follows,
Ipx0 , xσ q “ DKL pppx0 , xσ q } ppx0 qppxσ qq (175)
ż
ppx0 , xσ q
“ ppx0 , xσ q log dx0 dxσ (176)
ppx0 qppxσ q
ż
ppxσ |x0 q
“ ppx0 , xσ q log dx0 dxσ (177)
ppxσ q
ż
ppxσ |x0 q
“ ppxσ |x0 qppx0 q log dx0 dxσ (178)
ppxσ q
38
Preprint.
«ż ff
ppxσ |x0 q
“ Ex0 „ppx0 q ppxσ |x0 q log dxσ (179)
ppxσ q
” ı
“ Ex0 „ppx0 q DKL pppxσ |x0 q } ppxσ qq (180)
Given the simplification in Eqn. 180, the optimization problem in Eqn. 174 reduces to the following,
” ı
2
min
2
σ subject to E x0 „ppx 0 q D KL pppx σ |x0 q } ppxσ qq “0 (181)
σ
Since at σmax we expect the marginal distribution ppxσ q to converge to the generative prior (i.e. completely
destory the structure of the data), we approximate ppxσ q « N p0, σ 2 Id q. With this simplification, the
Lagrangian for the optimization problem in Eqn. 174 can be specified as,
” ` ˘ı
σ˚2 “ arg min
2
σ 2
` λEx0 „ppx0 q D KL N px σ ; x 0 , σ 2
q } N px σ ; 0, σ 2
q (182)
σ
2
Setting the gradient w.r.t σ = 0,
λ
1´ Ex rxJ x0 s “ 0 (183)
σ4 0 0
which implies, b
σ2 “ λEx0 rxJ
0 x0 s (184)
For an empirical dataset, g
N
f
fλ ÿ
2
σ “e }xi }22 (185)
N i“1
This allows us to choose a σmax from the underlying data statistics during training or sampling. It is worth
noting that the choice of the multiplier λ impacts the magnitude of σmax . However, this parameter can be
2
chosen in a principled manner. At σmax “ σ˚2 , the estimate of the mutual information is given by:
c
1 J Ex0 rxJ
0 x0 s
I˚ px0 , xσ q “ 2 Ex0 rx0 x0 s “ (186)
σ˚ λ
which implies,
Ex0 rxJ 0 x0 s
λ“ (187)
I˚ px0 , xσ q2
The above result provides a way to choose λ. Given the dataset stastistics, Ex0 rxJ
0 x0 s, the user can specify an
acceptable level of mutual information Ipx0 , xσ q to compute the correponding λ, which can then be used to
find the corresponding minimum σmax required to achieve that level of mutual information. Next, we extend
this analysis to t-EDM.
39
Preprint.
where Dγ pq } pq is the Gamma-power Divergence between two distributions q and p. From the definition of
the γ-Power Divergence in Eqn. 7, we have,
` ˘ 1 1`γ
γ
d ´ 1`γ
γ γd
Dγ td px0 , σ 2 Id , νq } td p0, σ 2 Id , νq “ ´ Cν,d p1 ` q pσ 2 q´ 2p1`γq ´1 Ex0 rxJ
0 x0 s (189)
νγ ν´2
l jh n
“f pν,dq
Γp ν`d2 q 2
where Cν,d “ d and γ “ ´ ν`d Solving the optimization problem yields the following optimal σ,
Γp ν2 qpνπq 2
” ´ ν´2 ¯ ν´2`d
ı 2pν´2q`d
σ˚2 “ λf pν, dq ErxJ
0 x0 s (190)
ν´2`d
For an empirical dataset, we have the following simplification,
” ´ ν´2 ¯ 1 ÿ ν´2`d
ı 2pν´2q`d
σ˚2 “ λf pν, dq }xi }22 (191)
ν´2`d N i
We now extend our formulation for optimal noise schedule design to the case of correlated noise in the
diffusion perturbation kernel. This is useful, especially for scientific applications where the data energy is
distributed quite non-uniformly across the (data) spectrum. Let R “ Ex0 „ppx0 q rx0 x0 J s P Rdˆd denote the
data correlation matrix. Let also consider the perturbation kernel N p0, Σq for the postitve-defintie covaraince
matrix Σ P Rdˆd . Following the steps in equation 182, the Lagrangian for noise covariance estimation can
be formulated as follows:
” ı
min tracepΣq ` λEx0 „ppx0 q DKL pN px0 , Σq } N p0, Σqq (192)
Σ
” ı
min tracepΣq ` λEx0 „ppx0 q xJ 0Σ
´1
x0 (193)
Σ
” ı
min tracepΣq ` λEx0 „ppx0 q tracepΣ´1 x0 xJ 0q (194)
Σ
min tracepΣq ` λ tracepΣ´1 Ex0 „ppx0 q rx0 xJ
0 sq (195)
Σ
min tracepΣq ` λ tracepΣ´1 Rq (196)
Σ
?
It can be shown that the optimal solution to this minimization problem is given by, Σ˚ “ λR1{2 , where
R1{2 denotes the matrix square root of R. This implies that the noise energy must be distributed along the
singular vectors of the correlaiton matrix, where the energy is proportional to the noise singular values. We
include the proof below.
Proof. We define the objective:
f pΣq “ tracepΣq ` λ tracepΣ´1 Rq. (197)
We compute the gradient of f pΣq with respect to Σ:
B B
∇Σ f pΣq “ tracepΣq ` λ tracepΣ´1 Rq. (198)
BΣ BΣ
The gradient of the first term is straightforward:
B
tracepΣq “ I. (199)
BΣ
40
Preprint.
For the second term, using the matrix calculus identity, the gradient is:
B ` ˘
λ tracepΣ´1 Rq “ ´λΣ´1 RΣ´1 . (200)
BΣ
Combining these results, the total gradient is:
∇Σ f pΣq “ I ´ λΣ´1 RΣ´1 . (201)
Setting the gradient to zero to find the critical point:
I ´ λΣ´1 RΣ´1 “ 0. (202)
1
Σ´1 RΣ´1 “ I. (203)
λ
which implies,
R “ λΣ2 . (204)
?
Σ “ λR1{2 . (205)
The divergence of the vector field fθ pxt , tq is further estimated using the Skilling-Hutchinson trace estimator
(Skilling, 1989; Hutchinson, 1990) as follows,
∇ ¨ fθ pxt , tq “ Eppϵq rϵJ ∇fθ pxt , tqϵs (208)
where usually, ϵ „ N p0, Id q. For t-EDM ODE, this estimate can be further simplified as follows,
∇ ¨ fθ pxt , tq “ Eppϵq rϵJ ∇fθ pxt , tqϵs (209)
” ´ x ´ D px , tq ¯ ı
t θ t
“ Eppϵq ϵJ ∇ ϵ (210)
t
” ´ I ´ ∇ D px , tq ¯ ı
d xt θ t
“ Eppϵq ϵJ ϵ (211)
t
1 ” ı
“ Eppϵq ϵJ ϵ ´ ϵJ ∇xt Dθ pxt , tqϵ (212)
t
1 ” ı
“ Eppϵq ϵJ ϵ ´ ϵJ ∇xt Dθ pxt , tqϵ (213)
t
1” ` ˘ı
“ d ´ Eppϵq ϵJ ∇xt Dθ pxt , tqϵ (214)
t
41
Preprint.
where d is the data dimensionality. Thus, the log-likelihood can be specified as,
żT ”
1 ` ˘ı
log ppx0 q “ log ppxT q ` d ´ Eppϵq ϵJ ∇Dθ pxt , tqϵ dt. (215)
0 t
Connections with Denoising Score Matching. For the perturbation kernel qpxt |x0 q “ td pµt x0 , σt2 Id , νq,
the denoising score matching (Vincent, 2011; Song et al., 2020) loss, LDSM , can be formulated as,
” › ϵ ›2 ı
LDSM pθq 9 Ex0 „ppx0 q Et Eϵ„N p0,Id q Eκ„χ2 pνq{ν λpxt , ν, tq›Dθ pµt x0 ` σt ? , σt q ´ x0 › (224)
› ›
κ 2
with the scaling factor λpxt , ν, tq “ rpν ` dq{pν ` d1 qs2 where d1 “ p1{σt2 q}xt ´ µt x0 }22 (proof in App.
A.9). Therefore, the denoising score matching loss in our framework is equivalent to the simplified training
objective in Eq. 9 scaled by a data-dependent coefficient. However, in this work, we do not explore this loss
formulation and leave further exploration to future work.
Prior work in Heavy-Tailed Generative Modeling. The idea of exploring heavy-tailed priors for modeling
heavy-tailed distributions has been explored in several works in the past. More specifically, Jaini et al. (2020)
argue that a Lipschitz flow map cannot change the tails of the base distribution significantly. Consequently,
they use a heavy-tailed prior (modeled using a Student-t distribution) as the base distribution to learn Tail
Adaptive flows (TAFs), which can model the tails more accurately. In this work, we make similar observations
42
Preprint.
where standard diffusion models fail to accurately model the tails of real-world distributions. Consequently,
Laszkiewicz et al. (2022) assess the tailedness of each marginal dimension and set the prior accordingly.
On a similar note, we note that learning the tail parameter ν spatially and across channels can provide
greater modeling flexibility for downstream tasks and will be an important direction for future work on this
problem. More recently, Kim et al. (2024b) introduce heavy-tailed VAEs (Kingma & Welling, 2022; Rezende
& Mohamed, 2016) based on minimizing γ-power divergences (Eguchi, 2021). This is perhaps the closest
connection of our method with prior work since we rely on γ-power divergences to minimize the divergence
between heavy-tailed forward and reverse diffusion posteriors. However, VAEs often have scalability issues
and tend to produce blurry artifacts (Dosovitskiy & Brox, 2016; Pandey et al., 2022). On the other hand,
we work with diffusion models, which are known to scale well to large-scale modeling applications (Pathak
et al., 2024; Mardani et al., 2024; Esser et al., 2024; Podell et al., 2023). In another line of work, Yoon
et al. (2023) presents a framework for modeling heavy-tailed distributions using α-stable Levy processes
while Shariatian et al. (2024) simplify the framework proposed in Yoon et al. (2023) and instantiate it for
more practical diffusion models like DDPM. In contrast, our work deals with Student-t noise, which in
general (with the exceptions of Cauchy and the Gaussian distribution) is not α-stable and, therefore, a distinct
category of diffusion models for modeling heavy-tailed distributions. Moreover, prior works like Yoon et al.
(2023); Shariatian et al. (2024) rely on empirical evidence from light-tailed variants of small-scale datasets
like CIFAR-10 (Krizhevsky, 2009) and their efficacy on actual large-scale scientific datasets like weather
datasets remains to be seen.
Prior work in Diffusion Models. Our work is a direct extension of standard diffusion models in the literature
(Karras et al., 2022; Ho et al., 2020; Song et al., 2020). Moreover, since it only requires a few lines of
code change to transition from standard diffusion models to our framework, our work is directly compatible
with popular families of latent diffusion models (Pandey et al., 2022; Rombach et al., 2022) and augmented
diffusion models (Dockhorn et al., 2022; Pandey & Mandt, 2023; Singhal et al., 2023). Our work is also
related to prior work in diffusion models on a more theoretical level. More specifically, PFGM++ (Xu et al.,
2023b) is a unique type of generative flow model inspired by electrostatic theory. It treats d-dimensional data
as electrical charges in a D ` d-dimensional space, where the electric field lines define a bijection between a
heavy-tailed prior and the data distribution. D is a hyperparameter controlling the shape of the electric fields
that define the generative mapping. In essence, their method can be seen as utilizing a perturbation kernel:
D`d
ppxt |x0 q9p||xt ´ x0 ||22 ` σt2 Dqq´ 2 “ td px0 , σt2 Id , Dq
When setting ν “ D, the perturbation kernel becomes equivalent to that of t-EDM, indicating the Student-t
perturbation kernel can be interpreted from another physical perspective — that of electrostatic fields and
charges. The authors demonstrated that using an intermediate value for D (or ν) leads to improved robustness
compared to diffusion models (where D Ñ 8), due to the heavy-tailed perturbation kernel.
F.2 L IMITATIONS
While our proposed framework works well for modeling heavy-tailed data, it is not without its limitations.
Firstly, while the parameter ν offers controllability for tail estimation using diffusion models, it also increases
the tuning budget by introducing an extra hyperparameter. Moreover, for diverse data channels, tuning ν
per channel could be key to good estimation at the tails. This could result in a combinatorial explosion
with manual tuning. Therefore, learning ν directly from the data could be an important direction for the
practical deployment of heavy-tailed diffusion models. Secondly, our evaluation protocol relies primarily on
comparing the statistical properties of samples obtained by flattening the generated or train/test set samples.
One disadvantage of this approach is that our current evaluation metrics ignore the structure of the generated
samples. In general, developing metrics like FID (Heusel et al., 2018) to assess the perceptual quality of
synthetic data for scientific domains like weather analysis remains an important future direction.
43
Preprint.
44
Preprint.
45
Preprint.
46
Preprint.
t-Flow
(KS: 0.259)
47
Preprint.
Figure 11: 1-d Histogram Comparisons between samples from the generated and the Train/Test set for the Vertically
Integrated Liquid (VIL, see Top Pandel) and Vertical Wind Velocity (w20, see Bottom Panel) channels using t-EDM (with
varying ν).
48
Preprint.
Figure 12: 1-d Histogram Comparisons between samples from the generated and the Train/Test set for the Vertically
Integrated Liquid (VIL, see Top Pandel) and Vertical Wind Velocity (w20, see Bottom Panel) channels using t-Flow (with
varying ν).
49
Preprint.
Ground Truth
Ground Truth
Figure 8: Qualitative visualization of samples generated from our conditional modeling for predicting the next state for
the Vertically Integrated Liquid (VIL) channel. The ensemble mean represents the mean of ensemble predictions (16
in our case). Columns 2-3 represent two samples from the ensemble. The last column visualizes an animation of all
ensemble members (Best viewed in a dedicated PDF reader). For each sample, the rows correspond to predictions from
EDM, t-EDM (ν “ 3), and t-EDM (ν “ 5) from top to bottom, respectively. Samples have been scaled logarithmically
for better visualization
50
Preprint.
Ground Truth
Ground Truth
Figure 9: Qualitative visualization of samples generated from our conditional modeling for predicting the next state for
the Vertical Wind Velocity (w20) channel. The ensemble mean represents the mean of ensemble predictions (16 in our
case). Columns 2-3 represent two samples from the ensemble. The last column visualizes an animation of all ensemble
members (Best viewed in a dedicated PDF reader). For each sample, the rows correspond to predictions from EDM,
t-EDM (ν “ 3), and t-EDM (ν “ 5) from top to bottom, respectively.
51