Diffusion Time Series Prediction (General)
Diffusion Time Series Prediction (General)
A BSTRACT
arXiv:2403.01742v3 [cs.LG] 21 Oct 2024
1 I NTRODUCTION
Time series is ubiquitous in real-world problems, playing a crucial component in a wide variety of
domains such as finance, medicine, biology, retail, and climate modeling (Lim & Zohren, 2021).
However, lack of access to these dynamical data is a key hindrance to the development of machine
learning solutions in some cases where data sharing may lead to privacy breaches (Alaa et al., 2021).
Synthesizing realistic time series data is viewed as a promising solution and has received increasing
attention driven by advances in deep learning. With perceptual qualities superior to GANs while
avoiding the optimization challenges of adversarial training, score-based diffusion models (Song
et al., 2021; 2020), especially denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020),
have taken the world of image, video, and text generation (Ho et al., 2022; Li et al., 2022a; Dhariwal
& Nichol, 2021; Harvey et al., 2022) by storm than ever before.
It is hopeful that the diffusion models can be extended to the time-series area to tackle the chal-
lenging problem of high-quality time series generation. Although some recent works pioneered the
extension of diffusion models to time-series-related applications, almost all of them are designed
for task-agnostic generation (e.g., imputation (Tashiro et al., 2021; Alcaraz & Strodthoff, 2022) and
forecasting (Li et al., 2022b; Shen & Kwok, 2023b)) that train and sample with additional informa-
tion. Meanwhile, the rare work on unconditional time-related synthesis with diffusion models focus
on synthesizing univariate (Kong et al., 2021; Kollovieh et al., 2023) or short time series Lim et al.
(2023). But first, these diffusion-based methods (Lim et al., 2023; Das et al., 2023) typically employ
Recurrent Neural Networks (RNNs) as the backbone to jointly model temporal dynamics and com-
plex correlations. These autoregressive methods have limited long-range performance due to error
accumulation and slow inference speed. The second challenge lies in that the plentiful combinations
of independent components like trend, seasonality, and local idiosyncrasy of a real-world time se-
ries are usually destroyed by gradually adding the noise to data in the diffusion process. As a result,
∗
Corresponding author. The code is available at https://github.com/Y-debug-sys/Diffusion-TS.
1
Published as a conference paper at ICLR 2024
they can hardly recover the lost temporal dynamics since the temporal properties have not been in-
tentionally preserved. It becomes exacerbated when time series has apparent seasonal oscillations
since existing solutions lack the inductive bias to initiatively capture periodicity (LIU et al., 2022).
Furthermore, it is also difficult for them to provide expert knowledge to explain both conditional and
unconditional generation, and therefore they often lack interpretability.
To better tackle the aforementioned challenges, in this paper, we aim to develop Diffusion-TS, a
non-autoregressive diffusion model for synthesizing time series of high quality in various scenarios,
in which we explicitly model the temporal dynamics of highly complicated (e.g., multivariate and
long-term) time series by introducing a transformer-based architecture for the underlying model
that learns a disentangled seasonal-trend constitution of time series. This is achieved by imposing
different forms of constraints on different representations. These disentangled representations not
only offer Diffusion-TS an interpretable perspective on general synthetic tasks, they also plays a role
in guiding the capture of complicated periodic dependencies beyond much simplified assumptions.
Additionally, we design a Fourier-based loss to reconstruct the samples rather than the noises in each
diffusion step, which leads to a more accurate generation of the time series. Another notable design
in Diffusion-TS is a conditional generation method called reconstruction-based sampling, which
makes the Diffusion-TS versatile for various conditional applications, such as time series imputation
and forecasting, leading to greater flexibility without requiring any parametrically updating.
In summary, our major contributions are as follows:
2 P ROBLEM S TATEMENT
We denote X1:τ = (x1 , . . . , xτ ) ∈ Rτ ×d as a time series covering a period of τ time steps, where
i N
d denotes the dimension of observed signals. Given the dataset DA = X1:τ i=1
of N samples
of time-series signals, our unconditional goal is to use a diffusion-based generator to approach the
i
function of X̂1:τ = G(Zi ) which maps Gaussian vectors Zi = (z1i , . . . , zti ) ∈ Rτ ×d×T to the signals
that are most similar with the signals in DA, where T denotes the total diffusion step. In our method,
we consider the following time series model with trend and multiple seasonality as
Xm
xj = ζj + si,j + ej , j = 0, 1, . . . , τ − 1, (1)
i=1
where xj represents the observed time series, ζj denotes the trend component, si,j is the i-th sea-
sonal component and ej denotes the remainder part which contains the noise and some outliers at
time j. And the goal of controllable generation is to generate samples from a conditional distribution
p(.|y), where y is a control variable can be any real-world signal that will dictate the synthesis.
2
Published as a conference paper at ICLR 2024
q ( xt | xt −1 )
xT xt xt −1 x0
Interpretable p ( xt −1 | xt ,y) Reverse
synthesis Process
+ +
Trend xˆ0 ( xt , t) Season & Error
Figure 1: The forward and reverse diffusion on time series data in Diffusion-TS. The denoising
network learns to predict the clean time series x̂0 from xt based on interpretable decomposition. In
the reverse pass, the generator network gradually injects the expert knowledge to make the synthetic
series target move toward the real ones.
interpretable owing to the specific designs of architecture and objective; (iii) temporal information
is isolated into several parts, and it enables us to capture the potentially complex dynamics from
diffused data with explainable disentangled representations. The learned disentanglement also tends
to make the imputation or forecasting process more reliable (LIU et al., 2022). Thus finally, we will
show how to conduct time series imputation and forecasting by the trained off-the-shelf diffusion
model during the sampling process.
3.1 D IFFUSION F RAMEWORK
As shown in Figure 1, we start by introducing the diffusion model that typically contains two pro-
cesses: forward process and reverse process. In this setting, a sample from the data distribution
x0 ∼ q(x) is gradually noised into a standard Gaussian noise xT√∼ N (0, I) by the forward process,
where the transition is parameterized by q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I) with βt ∈ (0, 1)
as the amount of noise added at diffusion step t. Then a neural network learns the reverse process of
gradually denoising the sample via reverse transition pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)).
Learning to clean xT through the reversed diffusion process can be reduced to learning to build a
surrogate approximator to parameterize µθ (xt , t) for all t. Ho et al. (2020) trained this denoising
model µθ (xt , t) using a weighted mean squared error loss which we will refer to as
XT
2
L(x0 ) = E ∥µ(xt , x0 ) − µθ (xt , t)∥ , (2)
q(xt |x0 )
t=1
where µ(xt , x0 ) is the mean of the posterior q(xt−1 |x0 , xt ). This objective can be justified as
optimizing a weighted variational lower bound on the data log likelihood. Also note that the original
parameterization of µθ (xt , t) can be modified in favour of x̂0 (xt , t, θ) or ϵθ (xt , t). Please refer to
Appendix B for details.
3.2 D ECOMPOSITION M ODEL A RCHITECTURE
Here on a high level, we choose an encoder-decoder transformer that enhances the models’ ability
to capture global correlation and patterns of time series. This way, information of the entire noisy
sequence is encoded before decoding. We renovate the decoder as a deep decomposition architecture
as shown in Figure 2. The decoder adopts a multilayer structure in which each decoder block
contains a transformer block, a feed forward network block and interpretable layers (Trend and
Fourier synthetic layer). Detailed descriptions of the whole model can be found in Appendix E, we
are now ready to elaborate on the details of the disentangled representation.
We achieve the disentanglement by enforcing distinct forms of constraints on different components,
which introduce distinct inductive biases into these components and make them more liable to learn
specific semantic knowledge. Trend representation captures the intrinsic trend which changes gradu-
ally and smoothly, and seasonality representation illustrates the periodic patterns of the signal. Error
3
Published as a conference paper at ICLR 2024
representation characterizes the remaining parts after removing trend and periodicity. Before the
i,t
start, we define the input of interpretable layers as w(·) , where i ∈ 1, . . . , D denotes the index of the
corresponding decoder block at diffusion step t.
Trend Synthesis. The trend component describes the smooth underlying mean of the data, which
aims to model slow-varying behavior. To produce reasonable trend components, We use the poly-
nomial regressor (Oreshkin et al., 2020; Desai et al., 2021) to model the trend Vtrt as follows:
D
X i,t i,t
Vtrt = (C · Linear(wtr ) + Xtr ), C = [1, c, . . . , cp ] (3)
i=1
i,t
where Xtr is the mean value of the output of the ith decoder block, and ’·’ denotes tensor multi-
plication. Here slow-varying poly space C is the matrix of powers of vector c = [0, 1, 2, . . . , τ −
2, τ − 1]T /τ , and p is a small degree (e.g. p = 3) to model low frequency behavior.
Seasonality & Error Synthesis. In this part, we will try to recover components other than trends
in the model input. This includes the periodic components (Seasonality) and the non-periodic ones
(Error). The main challenge is to automatically identify seasonal patterns from the noisy input
xt . Inspired by the trigonometric representation of seasonal components based on Fourier series
(De Livera et al., 2011; Woo et al., 2022b), we capture the seasonal component of the time series in
Fourier synthetic layers using Fourier bases:
(k) i,t (k) i,t
Ai,t = F(wseas )k , Φi,t = ϕ F(wseas )k , (4)
(K) (k)
κi,t (1) , · · · , κi,t = arg TopK {Ai,t }, (5)
k∈{1,··· ,⌊τ /2⌋+1}
K (k)
(k) (k)
X κ κi,t κi,t
Si,t = Ai,ti,t ¯
cos(2πfκ(k) τ c + Φi,t ) + cos(2π fκ(k) τ c + Φ̄i,t ) , (6)
i,t i,t
k=1
(k) (k)
where arg TopK is to get the top K amplitudes and K is a hyperparameter. Ai,t , Φi,t are the phase,
amplitude of the k-th frequency after the discrete Fourier transform F respectively. fk represents the
Fourier frequency of the corresponding index k, and (·) ¯ denotes (·) of the corresponding conjugates.
In fact, the Fourier synthetic layer selects bases with the most significant amplitudes in the frequency
domain, and then returns to the time domain through an inverse transform to model the seasonality.
Finally, we can obtain the original signal by the following equations:
XD
x̂0 (xt , t, θ) = Vtrt + Si,t + R, (7)
i=1
where R is the output of the last decoder block, which can be regarded as the sum of residual
periodicity and other noise.
4
Published as a conference paper at ICLR 2024
@ Tensor Multiplication — Element-wise Substraction + + Element-wise Addition Positional Encoding Connect to next block Input of each block
i,t
xt t Encoder Output
w tr
Seaonality & Error
++
Dense Layer
Reshape
Transformer
@ C Block
Decoder Block 1 +
+
where FF T denotes the Fast Fourier Transformation Elliott & Rao (1982), and λ1 , λ2 are weights
to balance two losses.
3.4 C ONDITIONAL G ENERATION FOR T IME S ERIES A PPLICATIONS
The above has described the details of the unconditional time series generation. In this section, we
will describe conditional extensions of the Diffusion-TS, in which the modeled x0 is conditioned
on targets y. The goal is to utilize a pre-trained diffusion model and the gradients of a classifier to
QT
approximately sample from the posterior p(x0:T |y) = t=1 p(xt−1 |xt , y), where p(xt−1 |xt , y) ∝
p(xt−1 |xt ) p(y|xt−1 , xt ). Here using Bayes’ theorem, we run gradient update on xt−1 to control
generation via the following score function:
∇xt−1 log p(xt−1 |xt , y) = ∇xt−1 log p(xt−1 |xt ) + ∇xt−1 log p(y|xt−1 ), (11)
where log p(xt−1 |xt ) is defined by diffusion models, while log p(y|xt−1 ) is parametrized by an
classifier, which can be approximated by ∇xt−1 log p(y|x0|t−1 ), proved in Chung et al. (2022).
The method can be explained as a way to guide the samples towards areas where the classifier has
a high likelihood. Then given conditional part xa and generative part xb , our proposed method to
approximate conditional sampling for imputation and forecasting can be defined as follows:
2
x̃0 (xt , t, θ) = x̂0 (xt , t, θ) + η∇xt (∥xa − x̂a (xt , t, θ)∥2 + γ log p(xt−1 |xt )), (12)
where γ is a hyperparameter that trades off two functions (the first one for conditional consistency
and the second for better fluency). The gradient term can be interpreted as a reconstruction-based
guidance, with η controlling the strength. Following Li et al. (2022a), we repeat multiple steps
of the gradient √ update for √ each diffusion step to improve the control quality. Then by replacing
x̃a (xt , t, θ) := ᾱt xa + 1 − ᾱt ϵ, the sample xt−1 will be generated using the new x̃0 . Algorithms
in Appendix F lay out how such a sampling scheme is used for conditional generation.
4 E MPIRICAL E VALUATION
In this section, we first study the interpretable outputs of the proposed model. Then we evaluate
our method in two modes: unconditional and conditional generation, to verify the quality of the
generated signals. For time series generation, we select four previous models to compare with:
TimeVAE (Desai et al., 2021), Diffwave (Kong et al., 2021), TimeGAN (Yoon et al., 2019), Cot-
GAN (Xu et al., 2020) and DiffTime (Coletta et al., 2023) which can be viewed as an unconditional
CSDI (Tashiro et al., 2021). We also compare to the CSDI and SSSD (Alcaraz & Strodthoff, 2022)
on conditional tasks. Finally, we conduct experiments to validate the performance of Diffusion-
TS when clean data is not sufficient. Implementation details and ablation study can be found in
Appendix G and C.7, respectively.
5
Published as a conference paper at ICLR 2024
4.1 DATASETS
We use 4 real-world datasets and 2 simulated datasets in Table 11 to evaluate our method. Stocks
is the Google stock price data from 2004 to 2019. Each observation represents one day and has 6
features. ETTh dataset contains the data collected from electricity transformers, including load and
oil temperature that are recorded every 15 minutes between July 2016 and July 2018. Energy is a
UCI appliance energy prediction dataset with 28 values. fMRI is a benchmark for causal discov-
ery, which consists of realistic simulations of blood-oxygen-level-dependent (BOLD) time series.
Here, we select a simulation from the original dataset which has 50 features. Sines has 5 features
where each feature is created with different frequencies and phases independently. MuJoCo is
multivariate physics simulation time series data with 14 features.
4.2 M ETRICS
For quantitative evaluation of synthesized data, we consider three main criteria (1) the distribu-
tion similarity of time series; (2) the temporal and feature dependency; (3) the usefulness for the
predictive purposes. We adopt the following evaluation metrics (see Appendix G.3 for the detailed
descriptions): 1) Discriminative score (Yoon et al., 2019) measures the similarity using a classifica-
tion model to distinguish between the original and synthetic data as a supervised task; 2) Predictive
score (Yoon et al., 2019) measures the usefulness of the synthesized data by training a post-hoc
sequence model to predict next-step temporal vectors using the train-synthesis-and-test-real (TSTR)
method; 3) Context-Fréchet Inception Distance (Context-FID) score (Paul et al., 2022) quantifies
the quality of the synthetic time series samples by computing the difference between representations
of time series that fit into the local context; 4) Correlational score (Ni et al., 2020) uses the abso-
lute error between cross correlation matrices by real data and synthetic data to assess the temporal
dependency.
We start by analyzing the reconstruction performance of our model for multivariate time series. Fig-
ure 3 illustrates demos of the original and reconstructed samples by Diffusion-TS in three datasets.
The model takes the corrupted samples (shown in (a)) with 50 steps of noise added as input, and
outputs the signals (shown in (c)) that try to restore the ground truth (shown in (b)), with the aid of
the decomposition of temporal trend (shown in (d)) and season & error (shown in (e)). As would
be expected, the trend curve follows the overall shape of the signal, while the season & error os-
cillates around zero. Fusing the two temporal properties, the reconstructed samples can be seen
a great agreement with the ground truth. Our conclusion here is that by introducing interpretable
architecture, the time series generated by the Diffusion-TS can exhibit explainability of disentangle-
ment with almost no accuracy loss. Additionally, we also conduct case study for further validating
interpretability on synthetic data in Appendix C.5.
6
Published as a conference paper at ICLR 2024
In Table 1, we list the results of the 24-length time series generation used in most of the existing
related works. It shows that Diffusion-TS consistently produces higher-quality synthetic samples
over other baselines in terms of almost all metrics. For example, Diffuision-TS can remarkably
improve the discriminative score by an average of 50% in all six datasets. It is also notable that for
high-dimensional datasets (i.e., MuJoCo, Energy and fMRI), the leading performance of Diffusion-
TS is even more significant. That demonstrates that Diffusion-TS can well tackle the challenges of
complex time series synthesis. Table 3 shows the results of long-term time series generation on ETTh
and Energy datasets. We randomly generate 3000 sequences with various lengths (64, 128, 256),
then use aforementioned metrics to assess the generation quality of different methods. From the
results, Diffusion-TS achieves the best overall performance, implying the efficacy of interpretable
decomposition for long-term time-series modelling. Notably, different from other baselines, the
performance of Diffusion-TS changes quite steadily as the length of sequence increases. It means
Diffusion-TS retains better long-term robustness, which is meaningful for real-world applications.
10 10 Original
10 10 TimeGAN
5 5
5
0 0
0 0
5 5 5
Original 10 Original 10 Original
10 Diffusion-TS TimeGAN Diffusion-TS 10
15
10 0 10 10 0 10 10 5 0 5 10 10 0 10
Original 8 Original 8 Original
Data Density Estimate
7
Published as a conference paper at ICLR 2024
To visualize the performance of time series synthesis, we adopt two visualization methods. One is to
project original and synthetic data in a 2-dimensional space using t-SNE (Van der Maaten & Hinton,
2008). The other is to draw data distributions using kernel density estimation. As shown in the 1st
row in Figure 4 and pictures in Appendix C.2, Diffusion-TS overlaps original data areas markedly
better than TimeGAN. The 2nd row in Figure 4 shows that the synthetic data’s distributions from
Diffusion-TS are more similar to those of the original data than TimeGAN. For more visualizations
and distributions, please refer to Appendix H.
Value
Value
Value
0.4 0.6
0.2 0.5
Diffusion-TS Diffusion-TS
0.2 CSDI CSDI
0.4 0.4 0
0 20 40 0 20 40 0 20 40 20 40
Time Time Time Time
0.8 1.00 1.00 Diffusion-TS
0.8 CSDI
0.75 0.75
0.6 0.6
Value
Value
Value
Value
0.50 0.50
0.4
0.4 0.25 0.25
Diffusion-TS Diffusion-TS Diffusion-TS
0.2 CSDI CSDI CSDI
0.00 0.00
0 20 40 0 20 40 0 20 40 0 20 40
Time Time Time Time
(a) ETTh (b) Energy
Figure 5: Examples of time series imputation (1st row) and forecasting (2nd row) for ETTh and
Energy datasets. Green and gray colors correspond to Diffusion-TS and CSDI, respectively.
Here, we present the conditional generation on time series imputation and forecasting. For impu-
tation tasks, we apply a masking strategy following a geometric distribution used in Zerveas et al.
(2021), which controls both the lengths of the missing sequences and the missing ratio r, instead
of selecting the missing points randomly, since a single missing time point may often be easily pre-
dicted from the immediately preceding and succeeding points. For both tasks, the length of a time
series is set as 48 time steps, for which given the first w continuous time points, we forecast on the
remaining 48 − w time points. We provide imputation and forecasting examples on ETTh and En-
ergy datasets in Figure 5 (more examples in Appendix I). The red crosses show the observed values
and the blue circles show the ground-truth imputation targets. The median values of imputations are
shown as the line and 5% and 95% quantiles are shown as the shade. It indicates that Diffusion-TS
gives more reasonable imputations and predictions with high confidence against CSDI.
0.06
0.08 0.050
0.04
0.06 0.025 0.05
0.02
10% 25% 50% 75% 90% 6 12 24 36 10% 25% 50% 75% 90% 6 12 24 36
Missing ratio Forecasting Window Missing ratio Forecasting Window
(a) ETTh (b) Energy
Figure 6: Performance of diffusion methods for time-series imputation and forecasting. Mean and
confidence intervals are obtained by running each method 5 times.
We also run detailed experiments to evaluate our conditional extension. Diffusion-TS (Diffusion-
TS-G) based on the aforementioned Langevin sampler is compared with CSDI, the original Diffwave
and Diffusion-TS-R using replace-based conditional sampling (see Appendix D for detailed descrip-
tions for the conditional adaptation). In Figure 6 we show the empirical results, where Diffusion-TS-
G outperformed all baselines even at high levels of missing ratio. We can observe that Diffusion-
TS-R achieves close accuracy on the low missingness ratio, but for the high missing ratio 75%
and 90%, there is a clear improvement with the reconstruction-guided sampling strategy. It means
that seasonal-trend decomposition can naturally facilitate infilling as restrictions increase, however,
8
Published as a conference paper at ICLR 2024
when the missing ratio is high and there are no additional constraints, it may also make the entire
sequence deviate from the ground truth. For completeness, we conduct additional experiments and
include other time series baselines such as TLAE (Nguyen & Quanz, 2021) in Appendix C.3. We
can see our model still performs well, as it has the best performance against the baselines.
Cold Starts and Irregular Settings. In this experiment, we explore whether the Diffusion-TS can
keep performance from degrading under regular data deficit condition. Cold start refers to training
model when little or no data is available. We use only 10/25/50/75% of the time series data in each
dataset as training data respectively. Then we remove time series values from the rest dataset, leading
to 10% ∼ 20% missing values for a data point. After cold starts on the regular part of the dataset,
we leverage the model to impute these irregular time series then continue training by incorporating
them into the training set. We call this process irregular training and Figure 11 shows the results.
We observed that even at the 10% training threshold, Diffusion-TS still has superior results on the
multiple time-series datasets compared to baselines in Table 1. In addition, Diffusion-TS shows
better discriminative and predictive scores in all cases after restoration. The performance of the
model is always at the same level as if there were no missing data, which shows the efficacy of the
Diffusion-TS under insufficient and irregular settings.
Table 2: Ablation study for model architecture and options. (Bold indicates best performance).
Metric Method Sines Stocks ETTh Mujoco Energy fMRI
Diffusion-TS 0.006±.007 0.067±.015 0.061±.009 0.008±.002 0.122±.003 0.167±.023
Discriminative w/o FFT 0.007±.006 0.127±.019 0.096±.007 0.010±.002 0.135±.004 0.177±.013
Score w/o Interpretability 0.009±.006 0.101±.096 0.071±.010 0.021±.014 0.125±.003 0.267±.034
w/o Transformer 0.010±.007 0.104±.024 0.082±.006 0.039±.014 0.324±.015 0.123±.064
(Lower the Better) ϵ-prediction 0.040±.011 0.131±.014 0.099±.010 0.023±.006 0.197±.001 0.168±.030
Diffusion-TS 0.093±.000 0.036±.000 0.119±.002 0.007±.000 0.250±.000 0.099±.000
Predictive w/o FFT 0.093±.000 0.038±.000 0.121±.004 0.008±.001 0.250±.000 0.100±.001
Score w/o Interpretability 0.093±.000 0.037±.000 0.119±.008 0.008±.001 0.251±.000 0.101±.000
w/o Transformer 0.093±.000 0.036±.000 0.126±.004 0.008±.000 0.319±.006 0.099±.000
(Lower the Better) ϵ-prediction 0.097±.000 0.039±.000 0.120±.002 0.008±.001 0.251±.000 0.100±.000
Original 0.094±.001 0.036±.001 0.121±.005 0.007±.001 0.250±.003 0.090±.001
We can find that the best result on each dataset is usually obtained with Diffusion-TS. Removing
the attention and residuals often causes big performance drops. However, when the dataset, i.e.
fMRI, has a high frequency and dimension, a network with only an interpretable design achieves
the most significant performance improvement. In addition, the FFT-loss term and signal prediction
parameterization show a crucial role in general. The conclusion is that our design is relatively steady
across all experiment settings. And the decomposition not only adds interpretable capability without
losing accuracy, but also does boost the unconditional generation of the diffusion model when there
are obvious frequency fluctuations in the data set.
5 C ONCLUSIONS
In this paper, we have proposed Diffusion-TS, a DDPM-based method for general time series gen-
eration. In addition to the generative capability of DDPMs, the Diffusion-TS is powered by the
TS-specific loss design and transformer-based deep decomposition architecture. Besides, our model
trained for unconditional generation could be readily extended for conditional generation by com-
bining gradients into the sampling. The experiments have shown our model is capable of a wide
range of time-series generative tasks and can achieve competitive performance. Shown in Table
8, one notable limitation of DDPMs is the high cost of inference, which is demanding for more
computing resources to generate a sample compared with GAN-based approaches, although our un-
derlying model is lightweight anyway. To improve Diffusion-TS for which both conditional and
unconditional inference procedures converge in shorter time would be a potential future work.
9
Published as a conference paper at ICLR 2024
R EFERENCES
Ahmed Alaa, Alex James Chan, and Mihaela van der Schaar. Generative time-series modeling with
fourier flows. In International Conference on Learning Representations, 2021.
Juan Miguel Lopez Alcaraz and Nils Strodthoff. Diffusion-based time series imputation and fore-
casting with structured state space models. arXiv preprint arXiv:2208.09399, 2022.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016.
Gérard Biau, Maxime Sangnier, and Ugo Tanielian. Some theoretical insights into wasserstein gans,
2021.
Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its applications,
volume 31999. McGraw-Hill New York, 1986.
Edward Choi, Zhen Xu, Yujia Li, Michael W. Dusenberry, Gerardo Flores, Yuan Xue, and An-
drew M. Dai. Learning the graphical structure of electronic health records with graph convolu-
tional transformer, 2020.
Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion
posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687, 2022.
Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning. Stl: A seasonal-
trend decomposition. J. Off. Stat, 6(1):3–73, 1990.
Andrea Coletta, Sriram Gopalakrishan, Daniel Borrajo, and Svitlana Vyetrenko. On the constrained
time-series generation problem. arXiv preprint arXiv:2307.01717, 2023.
Ayan Das, Yongxin Yang, Timothy Hospedales, Tao Xiang, and Yi-Zhe Song. Chirodiff: Modelling
chirographic data with diffusion models. arXiv preprint arXiv:2304.03785, 2023.
Alysha M De Livera, Rob J Hyndman, and Ralph D Snyder. Forecasting time series with complex
seasonal patterns using exponential smoothing. Journal of the American statistical association,
106(496):1513–1527, 2011.
Abhyuday Desai, Cynthia Freeman, Zuhui Wang, and Ian Beaver. Timevae: A variational auto-
encoder for multivariate time series generation, 2021.
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021.
Alexander Dokumentov and Rob J. Hyndman. Str: Seasonal-trend decomposition using regression,
2021.
DF Elliott and KR Rao. Fast fourier transform and convolution algorithms, 1982.
Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. Real-valued (medical) time series
generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633, 2017.
Elizabeth Fons, Alejandro Sztrajman, Yousef El-laham, Alexandros Iosifidis, and Svitlana
Vyetrenko. Hypertime: Implicit neural representation for time series, 2022.
10
Published as a conference paper at ICLR 2024
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the
ACM, 63(11):139–144, 2020.
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and
Baining Guo. Vector quantized diffusion model for text-to-image synthesis, 2022.
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Im-
proved training of wasserstein gans, 2017.
William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible
diffusion modeling of long videos, 2022.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J.
Fleet. Video diffusion models, 2022.
S. Rao Jammalamadaka, Jinwen Qiu, and Ning Ning. Multivariate bayesian structural time series
model, 2018.
Daniel Jarrett, Ioana Bica, and Mihaela van der Schaar. Time-series generation by contrastive imi-
tation. Advances in Neural Information Processing Systems, 34:28968–28982, 2021a.
Daniel Jarrett, Ioana Bica, and Mihaela van der Schaar. Time-series generation by contrastive imi-
tation. Advances in Neural Information Processing Systems, 34:28968–28982, 2021b.
Jinsung Jeon, Jeonghak Kim, Haryong Song, Seunghyeon Cho, and Noseong Park. Gt-gan: General
purpose time series synthesis with generative adversarial networks. Advances in Neural Informa-
tion Processing Systems, 35:36999–37010, 2022.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
Marcel Kollovieh, Abdul Fatir Ansari, Michael Bohlke-Schneider, Jasper Zschiegner, Hao Wang,
and Yuyang Wang. Predict, refine, synthesize: Self-guiding diffusion models for probabilistic
time series forecasting. arXiv preprint arXiv:2307.11494, 2023.
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile
diffusion model for audio synthesis, 2021.
Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto.
Diffusion-lm improves controllable text generation, 2022a.
Yan Li, Xinjiang Lu, Yaqing Wang, and Dejing Dou. Generative time series forecasting with dif-
fusion, denoise, and disentanglement. Advances in Neural Information Processing Systems, 35:
23009–23022, 2022b.
Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey. Philosophical
Transactions of the Royal Society A, 379(2194):20200209, 2021.
Haksoo Lim, Minjung Kim, Sewon Park, and Noseong Park. Regular time-series generation using
sgm. arXiv preprint arXiv:2301.08518, 2023.
SHUAI LIU, Xiucheng Li, Gao Cong, Yile Chen, and YUE JIANG. Multivariate time-series impu-
tation with disentangled temporal representations. In The Eleventh International Conference on
Learning Representations, 2022.
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao,
Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video
generation, 2023.
Olof Mogren. C-rnn-gan: Continuous recurrent neural networks with adversarial training, 2016.
11
Published as a conference paper at ICLR 2024
Nam Nguyen and Brian Quanz. Temporal latent auto-encoder: A method for probabilistic multi-
variate time series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 35, pp. 9117–9125, 2021.
Hao Ni, Lukasz Szpruch, Magnus Wiese, Shujian Liao, and Baoren Xiao. Conditional sig-
wasserstein gans for time series generation. arXiv preprint arXiv:2006.05421, 2020.
Rachel Nosowsky and Thomas J Giordano. The health insurance portability and accountability act
of 1996 (hipaa) privacy rule: implications for clinical research. Annu. Rev. Med., 57:575–590,
2006.
Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural basis
expansion analysis for interpretable time series forecasting, 2020.
Jeha Paul, Bohlke-Schneider Michael, Mercado Pedro, Kapoor Shubham, Singh Nirwan Rajbir,
Flunkert Valentin, Gasthaus Jan, and Januschowski Tim. Psa-gan: Progressive self attention gans
for synthetic time series, 2022.
Hengzhi Pei, Kan Ren, Yuqing Yang, Chang Liu, Tao Qin, and Dongsheng Li. Towards generating
real-world time series data. In 2021 IEEE International Conference on Data Mining (ICDM), pp.
469–478. IEEE, 2021.
Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising dif-
fusion models for multivariate probabilistic time series forecasting. In International Conference
on Machine Learning, pp. 8857–8868. PMLR, 2021.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows, 2016.
David Salinas, Valentin Flunkert, and Jan Gasthaus. Deepar: Probabilistic forecasting with autore-
gressive recurrent networks, 2019.
Steven L Scott and Hal R Varian. Bayesian variable selection for nowcasting economic time series.
In Economic analysis of the digital economy, pp. 119–135. University of Chicago Press, 2015.
Lifeng Shen and James Kwok. Non-autoregressive conditional diffusion models for time series
prediction. arXiv preprint arXiv:2306.05043, 2023a.
Lifeng Shen and James Kwok. Non-autoregressive conditional diffusion models for time series
prediction. arXiv preprint arXiv:2306.05043, 2023b.
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, et al. Score-based generative modeling
through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
Yang Song, Conor Durkan, Iain Murray, et al. Maximum likelihood training of score-based diffusion
models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021.
Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will
Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, and Rémi Leblond. Self-
conditioned embedding diffusion for text generation, 2022.
Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based
diffusion models for probabilistic time series imputation, 2021.
Marina Theodosiou. Forecasting monthly and quarterly time series using stl decomposition. Inter-
national Journal of Forecasting, 27:1178–1195, 10 2011. doi: 10.1016/j.ijforecast.2010.11.002.
Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine
learning research, 9(11), 2008.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017.
Vikram Voleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video diffusion
for prediction, generation, and interpolation. Advances in Neural Information Processing Systems,
35:23371–23385, 2022.
12
Published as a conference paper at ICLR 2024
Qingsong Wen, Jingkun Gao, Xiaomin Song, Liang Sun, Huan Xu, and Shenghuo Zhu. Robuststl:
A robust seasonal-trend decomposition algorithm for long time series, 2018.
Qingsong Wen, Zhe Zhang, Yan Li, and Liang Sun. Fast robuststl: Efficient and robust seasonal-
trend decomposition for time series with complex patterns. In Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2203–2213,
2020.
Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. Cost: Contrastive
learning of disentangled seasonal-trend representations for time series forecasting, 2022a.
Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. Etsformer: Exponential
smoothing transformers for time-series forecasting. arXiv preprint arXiv:2202.01381, 2022b.
Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition trans-
formers with auto-correlation for long-term series forecasting. Advances in Neural Information
Processing Systems, 34:22419–22430, 2021.
Julian Wyatt, Adam Leach, Sebastian M. Schmon, and Chris G. Willcocks. Anoddpm: Anomaly
detection with denoising diffusion probabilistic models using simplex noise. In 2022 IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 649–655,
2022. doi: 10.1109/CVPRW56347.2022.00080.
Tianlin Xu, Li Kevin Wenliang, Michael Munn, and Beatrice Acciaio. Cot-gan: Generating sequen-
tial data via causal optimal transport. Advances in neural information processing systems, 33:
8798–8809, 2020.
Dazhi Yang, Vishal Sharma, Zhen Ye, Li Hong Idris Lim, Lu Zhao, and Aloysius Aryaputera. Fore-
casting of global horizontal irradiance by exponential smoothing, using decompositions. Energy,
81:111–119, 03 2015. doi: 10.1016/j.energy.2014.11.082.
Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial net-
works. Advances in neural information processing systems, 32, 2019.
Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu,
and Ying Nian Wu. Latent diffusion energy-based model for interpretable text modeling, 2022.
Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and
Bixiong Xu. Ts2vec: Towards universal representation of time series. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 36, pp. 8980–8987, 2022.
Luca Zancato, Alessandro Achille, Giovanni Paolini, Alessandro Chiuso, and Stefano Soatto. Stric:
Stacked residuals of interpretable components for time series anomaly detection. 2022.
George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eick-
hoff. A transformer-based framework for multivariate time series representation learning. In
Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pp.
2114–2124, 2021.
Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency
enhanced decomposed transformer for long-term series forecasting, 2022.
13
Published as a conference paper at ICLR 2024
A R ELATED W ORK
Time Series Generation. Deep generative models of all kinds have exhibited high-quality samples
in a wide variety of domains, where time series generation is one of the most challenging tasks in
the generative research field. Existing literature for time-series synthesis is based predominantly on
generative adversarial networks (GANs), and many GAN-based architectures use recurrent networks
to model temporal dynamics (Mogren, 2016; Esteban et al., 2017; Yoon et al., 2019; Pei et al., 2021;
Jeon et al., 2022). Among these GAN-based works, Mogren (2016) first introduced a GAN structure
with LSTM called C-RNN-GAN. Then a Recurrent Conditional GAN (RCGAN) (Esteban et al.,
2017) was proposed which uses a basic RNN as generator and discriminator and auxiliary label
information as a condition to generate medical time series. Yoon et al. developed TimeGAN (Yoon
et al., 2019) by adding an embedding function and supervised loss to the original GAN for capturing
the temporal dynamics of data throughout time. In addition, Pei et al. (2021) proposed RTSGAN,
and Paul et al. (2022) designed PSA-GAN that generates long univariate time series samples of high
quality using progressive growing of GANs along with self-attention.
Due to the instabilities typical of adversarial objectives, another line of work in the time series field
recently focuses on other deep generative methods. Jarrett et al. (2021a) imitates the sequential be-
havior of time series using a stepwise-decomposable energy model as reinforcement. Fourier Flows
(Alaa et al., 2021) was proposed as a method based on normalizing flows followed by a chain of
spectral filters leading to an exact likelihood optimization. Desai et al. (2021) introduced TimeVAE
which implements an interpretable temporal structure, and achieves reasonable results on time series
synthesis with Variational Autoencoder (VAE). Most recently, Implicit neural representations (INRs)
have also been used for time series generation. Fons et al. (2022) leverages the latent embeddings
learned by the hypernetwork for the synthesis of new time series by interpolation.
Denoising Diffusion Probabilistic Models. Denoising diffusion probabilistic models (DDPMs) are
a new class of generative models with nice properties (Ho et al., 2020). They have demonstrated
great success in both continuous and discrete data domains, producing images (Ho et al., 2020;
Dhariwal & Nichol, 2021; Gu et al., 2022), text (Li et al., 2022a; Yu et al., 2022; Strudel et al.,
2022), audio (Kong et al., 2021), and video (Luo et al., 2023; Harvey et al., 2022; Ho et al., 2022)
that have state-of-the-art sample quality. Very recently, diffusion models have also been developed
for time series data. TimeGrad (Rasul et al., 2021) is a conditional diffusion model which predicts
in an autoregressive manner, with the denoising process guided by the hidden state of a recurrent
neural network. CSDI (Tashiro et al., 2021) and SSSD (Alcaraz & Strodthoff, 2022) use a self-
supervised masking to guide the denoising process like image inpainting. To alleviate the problem of
boundary disharmony, the non-autoregressive diffusion model TimeDiff (Shen & Kwok, 2023b) uses
future mixup and autoregressive initialization for conditioning. Regarding unconditional generation,
Lim et al. (2023) employs recurrent neural networks as the backbone to product regular 24-time-
series using SGM. And most recently, based on the structured state space model used in SSSD,
Kollovieh et al. (2023) conducts univariate time series generation and forecasting with self-guiding
strategy. Meanwhile, Coletta et al. (2023) approximated the diffusion function based on CSDI where
14
Published as a conference paper at ICLR 2024
they remove the side information provided as embedding. They also introduced GuidedDiffTime to
handle new constraints such as trend or fixed-value, without re-training.
In this section, we provide a brief overview of DDPMs. On a high level, diffusion models sample
from a distribution by reversing a gradual noising process. In particular, the forward process q
gradually corrupts original data x0 ∈ Rd via a fixed Markov chain x0 , . . . , xT with each variable in
Rd as follows:
p T
Y
q(xt |xt−1 ) := N (xt ; 1 − βt xt−1 , βt I), q(x1:T |x0 ) := q(xt |xt−1 ), (13)
t=1
where βt ∈ (0, 1) is a variance at diffusion step t. The increasingly noisy variables x1...T have the
same dimensionality as x0 , and xT is an isotropic Gaussian.
t
Q
A notable property of the forward process is that using notation αt := 1 − βt and ᾱt := αs , we
s=1
can sample xt at any arbitrary time step t in a closed
√ form:
q(xt |x0 ) := N (xt ; ᾱt x0 , (1 − ᾱt )I). (14)
Thus using reparameterization trick Kingma √ (2013) and defining ϵ ∼ N (0, I), we have
√ & Welling
xt = ᾱt x0 + 1 − ᾱt ϵ. (15)
Starting from the noise xT , we can run the reverse process parametrized by the model
pθ (xt−1 |xt ) := N (xt−1 ; µθ (xt , t), Σθ (xt , t)) to get x0 . The diffusion model is trained to maxi-
mize the marginal likelihood of the data Ex0 [log pθ (x0 )], and we can write the variational lower
bound (VLB) as follows:
XT
Lvlb := − log pθ (x0 |x1 ) + DKL (q(xt−1 |xt , x0 ) || pθ (xt−1 |xt )) + DKL (q(xT |x0 ) || p(xT )) .
| {z } t=2 | {z } | {z }
L0 Lt−1 LT
(16)
The main objective is a sum of independent terms Lt−1 . There are many different ways to parame-
terize µθ (xt , t), and the most obvious option
is to predict µθ (xt , t) directly:
1 2
Lorigin := Et,x0 ∥µ̂(xt , x0 ) − µθ (xt , t)∥ , (17)
2Σ2θ (xt , t)
where µ̂(xt , x0 ) is the mean of the posterior
√ q(xt |xt−1√ , x0 ) which are defined as follows:
ᾱt−1 βt αt (1 − ᾱt−1 ) 1 − ᾱt−1
q(xt−1 |xt , x0 ) = N (xt−1 ; x0 + xt , βt I). (18)
1 − ᾱt 1 − ᾱt 1 − ᾱt
Finally, Ho et al. (2020) found predicting ϵ worked best, especially when combined with a
reweighted loss function:
h i
2
Lsimple = Et,x0 ,ϵ ∥ϵ − ϵθ (xt , t)∥ . (19)
But they also admitted that there is the possibility of predicting x0 directly. Note that estimating ϵ
in Equation 19 is equivalent to estimating a scaled version of the score function, i.e. the gradient of
the log density of the noisy data:
1 √ 1
∇xt log qt (xt |x0 ) ≈ − (xt − ᾱt x0 ) = − √ ϵ. (20)
1 − ᾱt 1 − ᾱt
Thus, data generation through denoising depends on the score-function and can be seen as noise
conditional score-based generation Voleti et al. (2022). Then the reverse process pθ (xt−1 |xt ) in
DDPM can be written as:
1 1 − αt
xt−1 = √ (xt − √ ϵθ (xt , t)) + σt z, (21)
αt 1 − ᾱt
where σt is hyperparameter, and z is a standard Gaussian noise.
15
Published as a conference paper at ICLR 2024
In this section we present additional experiments which we omitted in the main body of the paper
due to limited space.
In order to further verify its performance stability in producting long multivariate time-series data,
we add discriminative score and predictive score:
Table 3: Detailed Results of Long-term Time-series Generation. (Bold indicates best performance).
Dataset Length Diffusion-TS TimeGAN TimeVAE Diffwave DiffTime Cot-GAN
Context-FID 64 0.631±.058 1.130±.102 0.827±.146 1.543±.153 1.279±.083 3.008±.277
Score 128 0.787±.062 1.553±.169 1.062±.134 2.354±.170 2.554±.318 2.639±.427
(Lower the Better) 256 0.423±.038 5.872±.208 0.826±.093 2.899±.289 3.524±.830 4.075±.894
Correlational 64 0.082±.005 0.483±.019 0.067±.006 0.186±.008 0.094±.010 0.271±.007
Score 128 0.088±.005 0.188±.006 0.054±.007 0.203±.006 0.113±.012 0.176±.006
(Lower the Better) 256 0.064±.007 0.522±.013 0.046±.007 0.199±.003 0.135±.006 0.222±.010
ETTh
10
10 10 10
5 5
0 0 0 0
5 5
Original Original 10 Original 10 Original
10 Diffusion-TS Diffusion-TS Diffusion-TS Diffusion-TS
10
10 0 10 10 0 10 10 5 0 5 10 10 0 10
10 10 Original Original
TimeGAN 10 10 TimeGAN
5 5 5 5
0 0 0 0
5 5 5 5
10 Original 10 Original
TimeGAN 10 TimeGAN 10
10 0 10 10 0 10 10 0 10 10 0 10
(a) ETTh-24 (b) ETTh-64 (c) ETTh-128 (d) ETTh-256
Figure 7: t-SNE plots of the time series of length 24, 64, 128 and 256 synthesized by Diffusion-
TS and TimeGAN. Red dots represent real data instances, and blue dots represent generated data
samples in all plots.
16
Published as a conference paper at ICLR 2024
1.0
0.4 0.5 1.0
0.2 0.5 0.5
0.0 0.0 0.0 0.0
0.2 0.5
Original Original 0.5 Original Original
0.4 0.5 1.0
Diffusion-TS Diffusion-TS Diffusion-TS Diffusion-TS
1.0
1.0 0.5 0.0 0.5 1.0 1 0 1 2 1 0 1 2 0 2
1.0
0.4 Original Original 1.0 Original
0.5 TimeGAN TimeGAN TimeGAN
0.2 0.5 0.5
0.0 0.0 0.0 0.0
0.2 0.5
Original 0.5
0.4 0.5 1.0
TimeGAN
0.61.0 1.0
0.5 0.0 0.5 1.0 1 0 1 1 0 1 2 0 2
(a) ETTh-24 (b) ETTh-64 (c) ETTh-128 (d) ETTh-256
Figure 8: PCA plots of the time series of length 24, 64, 128 and 256 synthesized by Diffusion-
TS and TimeGAN. Red dots represent real data instances, and blue dots represent generated data
samples in all plots.
In order to further compare the diversity of the generated time-series using Diffusion-TS, and
TimeGAN with the ETTh dataset, we applied PCA and t-SNE analyses to visualize how well the
generated time-series distributions cover the real data distributions. According to the figures, syn-
thetic samples generated by Diffusion-TS have significantly more overlap with the original data than
the SOTA method.
To demonstrate that the excellent performance of Diffusion-TS extends to further baselines, we now
repeat imputation and forecasting experiments in Alcaraz & Strodthoff (2022). All baselines results
were collected from the original publications. We first conduct imputation experiments on the Mu-
JoCo sequences of length 100. We report an averaged MSE for a single imputation per sample on the
test set over 5 trials. Table 4 shows the empirical results on the MuJoCo data set, where Diffusion-TS
outperforms all baselines except for CSDI on the 70% missing ratio. We then demonstrate Diffusion-
TS’s forecasting capabilities on the Solar data set collected from GluonTS (Alexandrov et al., 2020),
where the conditional values and forecast horizon are 168 and 24 time steps respectively. The accu-
racy of all models on the long time series is shown in Table 5. Similar to the observation made in
imputation tasks, our proposed method can achieve significantly better performance. These results
further support the effectiveness of our proposed model, as well as its ability in processing long
time-series.
Table 4: Imputation MSE results for the MuJoCo data set. Here, we use a concise error notation
where the values in brackets affect the least significant digits e.g. 0.572(12) signifies 0.572 ± 0.012.
Similarly, all MSE results are in the order of 1e-3.
Model 70% Missing 80% Missing 90% Missing
RNN GRU-D 11.34 14.21 19.68
ODE-RNN 9.86 12.09 16.47
NeuralCDE 8.35 10.71 13.52
Latent-ODE 3 2.95 3.6
NAOMI 1.46 2.32 4.42
NRTSI 0.63 1.22 4.06
CSDI 0.24(3) 0.61(10) 4.84(2)
SSSD 0.59(8) 1.00(5) 1.90(3)
Diffusion-TS 0.37(3) 0.43(3) 0.73(12)
17
Published as a conference paper at ICLR 2024
Table 5: Time series forecasting results for the solar data set.
Model MSE
GP-copula 9.8e2±5.2e1
TransMAF 9.30E+02
TLAE 6.8e2±7.5e1
CSDI 9.0e2±6.1e1
SSSD 5.03e2±1.06e1
Diffusion-TS 3.75e2±3.6e1
Figure 9 and 10 shows example prediction results on solar and stock data set respectively. As can
be seen, Diffusion-TS produces high-quality predictions close to the ground truth.
200
History
Ground Truth
100 Prediction
0
0 25 50 75 100 125 150 175 200
150
100 History
Ground Truth
Prediction
50
0
0 25 50 75 100 125 150 175 200
150 History
Ground Truth
Prediction
100
50
0
0 25 50 75 100 125 150 175 200
125
100
75 History
Ground Truth
50 Prediction
25
0
0 25 50 75 100 125 150 175 200
(a)
Figure 9: Visualizations of time series forecasting on Solar dataset by the proposed Diffusion-TS.
In this subsection, we provide the irregular training results on four real-world datasets mentioned in
the main paper in Figure 11.
18
Published as a conference paper at ICLR 2024
2500 History
Ground Truth
Prediction
2400
2300
0 10 20 30 40
2500 History
Ground Truth
Prediction
2400
2300
2200
0 10 20 30 40
1e6
4 History
Ground Truth
Prediction
3
1
0 10 20 30 40
History
2800 Ground Truth
Prediction
2700
2600
2500
0 10 20 30 40
1e6
History
4 Ground Truth
Prediction
3
2
1
0 10 20 30 40
(a)
Figure 10: Visualizations of time series forecasting on Stock dataset by the proposed Diffusion-TS.
Energy fMRI ETTh Stocks
0.40 0.14 0.35
Discriminative score
0.25
0.35 0.30
0.12
0.25
0.20 0.30 0.10 0.20
0.25 0.08 0.15
0.15
0.20 0.06 0.10
0.10 0.05
0.15
0.1 0.25 0.5 0.75 0.1 0.25 0.5 0.75 0.1 0.25 0.5 0.75 0.1 0.25 0.5 0.75
0.255 Cold start 0.105 0.042
Irregular training 0.104
Predictive score
Figure 11: Performance of using different regular ratios in the cold start and irregular training ex-
periments with 10% ∼ 20% missing. Cold start only uses clean data, while the other restores and
incorporates irregular time series during training. Overall, Diffusion-TS can mostly achieve the
quality of no missing values via irregular training.
19
Published as a conference paper at ICLR 2024
2 Dirty Input
1
0
1
2 0 20 40 60 80 100 120 140 160
1.0 Original Trend
0.5
0.0
0.5
1.0 0 20 40 60 80 100 120 140 160
1.0 Learnt Trend
0.5
0.0
0.5
1.0 0 20 40 60 80 100 120 140 160
1.0 Original Season
0.5
0.0
0.5
1.0 0 20 40 60 80 100 120 140 160
1.0 Learnt Season
0.5
0.0
0.5
1.0 0 20 40 60 80 100 120 140 160
1.0 Learnt Residual
0.5
0.0
0.5
1.0 0 20 40 60 80 100 120 140 160
1.0 Original Data
0.5
0.0
0.5
1.0 0 20 40 60 80 100 120 140 160
1.0 Reconstruction
0.5
0.0
0.5
1.0 0 20 40 60 80 100 120 140 160
(a)
Figure 12: Disentanglement validation on synthetic dataset.
20
Published as a conference paper at ICLR 2024
We did limited hyperparameter tuning in this study to find default hyperparemters that perform well
across datasets. The range considered for each hyper-parameter is: batch size : [32; 64; 128], the
number of attention heads: [4; 8], the number of basic dimension: [32, 64, 96, 128], the diffusion
steps: [50, 200, 500, 1000] and the guidance strength: [1., 1e-1, 5e-2, 1e-2, 1e-3]. In table 6 we
evaluate the impact of the different guidance strength in the model according the mean squared
errors on the MuJoCo data set. Also in this table, we notice that the increasing or reducing the
strength do not improve the results of conditional generation. A single Nvidia 3090 GPU is used for
model training. In all of our experiments, we use cosine noise scheduling and optimize our network
using Adam with (β1 , β2 ) = (0.9, 0.96). And a linearly decay learning rate starts at 0.0008 after
500 iterations of warmup. For conditional generation, we set the inference steps, γ to be 200, 0.05
respectively. We use 90% of the dataset for training and the rest for testing. In Table 7, we report
the model size of our method and baselines. Diffusion models have generally more parameters than
other methods, but our method has fewer parameters than Diffwave. And in Table 8, we list the
detailed hyperparameter settings.
Table 6: Diffusion-TS with different guidance strengths γ ∈ [1., 1e-1, 5e-2, 1e-2, 1e-3].
γ 70% Missing 80% Missing 90% Missing
1. 2.8(13) 4.1(10) 6.8(17)
1e-1 0.37(4) 0.45(0) 0.82(9)
5e-2 0.37(3) 0.43(3) 0.73(12)
1e-2 0.60(5) 0.70(10) 1.07(14)
1e-3 3.1(8) 7.2(20) 19.6(22)
Table 8: Hyperparameters, training details, and compute resources used for each model
Parameter Sines Stocks ETTh MuJoCo Energy fMRI
attention heads 4 4 4 4 4 4
attention head dimension 16 16 16 16 24 24
encoder layers 1 2 3 3 4 4
decoder layers 2 2 2 2 3 4
batch size 128 64 128 128 64 64
timesteps / sampling steps 500 500 500 1000 1000 1000
training steps 12000 10000 18000 14000 25000 15000
model size 232,177 291,318 350,459 357,214 1,135,144 1,382,290
training time 17min 15min 31min 25min 60min 44min
sampling time (every 2000) 23s 26s 31s 50s 65s 72s
To evaluate the effectiveness of the Diffusion-TS, we compare the full versioned Diffusion-TS with
its three variants: (1) w/o FFT, i.e. Diffusion-TS without Fourier-based loss term during training,
(2) w/o Interpretability, i.e. Diffusion-TS without seasonal-trend design in the network, (3) w/o
21
Published as a conference paper at ICLR 2024
Transformer, i.e. Diffusion-TS based on a Convolutional network without encoder and self-attention
mechanism. (4) ϵ-prediction, Diffusion-TS using traditional noise prediction parameterization for
training and sampling. Here shown as Table 9, we provide ablation study on 24-length time series
in a detailed version.
Table 9: Ablation study for model architecture and options. (Bold indicates best performance).
Metric Method Sines Stocks ETTh Mujoco Energy fMRI
Diffusion-TS 0.006±.000 0.147±.025 0.116±.010 0.013±.001 0.089±.024 0.105±.006
Context-FID w/o FFT 0.008±.001 0.216±.031 0.219±.021 0.017±.002 0.109±.014 0.189±.003
Score w/o Interpretability 0.008±.000 0.181±.029 0.121±.004 0.018±.002 0.105±.020 0.149±.010
w/o Transformer 0.007±.000 0.254±.024 0.229±.024 0.040±.003 0.496±.088 0.100±.010
(Lower the Better) ϵ-prediction 0.019±.004 0.215±.033 0.265±.008 0.017±.001 0.189±.054 0.190±.003
Diffusion-TS 0.015±.004 0.004±.001 0.049±.008 0.193±.027 0.856±.147 1.411±.042
Correlational w/o FFT 0.016±.005 0.005±.005 0.063±.012 0.204±.017 0.891±.023 1.625±.021
Score w/o Interpretability 0.016±.003 0.010±.007 0.067±.009 0.199±.026 1.013±.088 1.340±.050
w/o Transformer 0.016±.005 0.042±.007 0.055±.005 0.254±.060 2.164±.057 1.061±.041
(Lower the Better) ϵ-prediction 0.024±.006 0.007±.006 0.065±.008 0.208±.016 1.111±.114 1.166±.031
Diffusion-TS 0.006±.007 0.067±.015 0.061±.009 0.008±.002 0.122±.003 0.167±.023
Discriminative w/o FFT 0.007±.006 0.127±.019 0.096±.007 0.010±.002 0.135±.004 0.177±.013
Score w/o Interpretability 0.009±.006 0.101±.096 0.071±.010 0.021±.014 0.125±.003 0.267±.034
w/o Transformer 0.010±.007 0.104±.024 0.082±.006 0.039±.014 0.324±.015 0.123±.064
(Lower the Better) ϵ-prediction 0.040±.011 0.131±.014 0.099±.010 0.023±.006 0.197±.001 0.168±.030
Diffusion-TS 0.093±.000 0.036±.000 0.119±.002 0.007±.000 0.250±.000 0.099±.000
Predictive w/o FFT 0.093±.000 0.038±.000 0.121±.004 0.008±.001 0.250±.000 0.100±.001
Score w/o Interpretability 0.093±.000 0.037±.000 0.119±.008 0.008±.001 0.251±.000 0.101±.000
w/o Transformer 0.093±.000 0.036±.000 0.126±.004 0.008±.000 0.319±.006 0.099±.000
(Lower the Better) ϵ-prediction 0.097±.000 0.039±.000 0.120±.002 0.008±.001 0.251±.000 0.100±.000
Original 0.094±.001 0.036±.001 0.121±.005 0.007±.001 0.250±.003 0.090±.001
We also conduct a fine-grained ablation study to verify the effectiveness of our proposed disentan-
gled framework. The ablative results are shown in Table 10. We find that the model performance
drops no matter which part is removed, which validates that all of our proposed disentangled rep-
resentations play an important role in generative tasks and jointly enhance the performance of the
final model.
We now introduce the imputation method with Diffwave and Diffusion-TS-R used for the
experiments in Section 4.5. Given an irregular sample x = [xob , xta ] with xob denotes
all observed values and xta denotes all missing values, Song et al. (2020) proposed a gen-
eral method for conditional sampling from the jointly trained diffusion model pθ (x). By
defining the known and unknown dimensions
of xtas Ω (xt ) and Ω̄ (xt ) respectively, they
have pt Ω̄ (xt ) |Ω (x0 ) = y ≈ pt Ω̄ (xt ) ; Ω (xt ) , where Ω̂ (xt ) denotes samples from
pt (Ω (xt ) |Ω (x0 = y)), and Ω̄ (xt ) ; Ω (xt ) represents the concatenation of two sets of dimen-
sions. More concretely, in their approach, the samples for xob t are replaced by exact samples from
the forward process q(xob ob
t |x ) in Equation 14, at each iteration, while the sampling procedure for
updating xta ob
t is still sampling from pθ (xt |xt+1 ). The samples xt then have the correct marginal dis-
tribution, and xta
t will conform with x ob
t through the denoising process. Using this strategy, we can
generate an intact sample that follows the correct conditional distribution in addition to the correct
22
Published as a conference paper at ICLR 2024
Diffusion-TS
q ( xt | xt −1 )
xT xt xt −1 x0
p ( xt −1 | xt ,y)
Conv 1×1
— Element-wise Substraction
t t Diffusion
Embedding
Block M
LayerNorm
Block 1
Block 2
Element-wise Addition
+
+
+ K V Q
Positional Encoding
Multihead
Attention
Encoder
Connect to next layer
+
+
LayerNorm Transformer
K V Q Block
Multihead
Attention
+
+
+
+
xˆ 0
Feed Conv 1×1
Forward
LayerNorm
++
+
+
Trend
Synthetic
Fourier
Synthetic
Encoding K V Q — Layer Layer
Information
Multihead Mean + +
Decoder Block 2
+
+
Decoder Block D
Trend
Figure 13: Overview figure of the Diffusion-TS using a Transfomer inspired architecture with the
decomposition design.
marginal. We will refer to the infilling method for conditional sampling as replace-based imputation
with a diffusion model.
E M ODEL D ETAILS
We now describe the details about our model architecture. The overview of Diffusion-TS is shown
in Figure 13. As described in main paper, the framework contains two parts: a sequence encoder and
an interpretable decoder. Each encoder block contains a full attention and a feed forward network
block, while each transformer block in the decoder consists of a full attention and a cross attention
to combine the encoding information. For the diffusion embedding, we follow previous works (Ho
et al., 2020; Gu et al., 2022) with transformer sinusoidal positional embedding to encode the diffu-
sion step. The diffusion step t is injected into the network using the Adaptive Layer Normalization
operator, which can be written as: at Laynorm(w) + bt where w is the intermediate activations, at
and bt are obtained from a linear projection of the diffusion embedding.
23
Published as a conference paper at ICLR 2024
F A LGORITHMS
The full reconstruction-guided sampling algorithm is illustrated on the left of F, where Replace (·)
denotes replace-based imputation algorithm introduced in D. As aforementioned, we take multiple
gradient steps for each diffusion step to improve the generative quality. Note that the reverse process
of a diffusion model can be viewed as two periods: creating in the early stages and smoothing in the
rest stages, we thus run more gradient updates at large diffusion step t to guide the generation then
reduce the number of gradient steps in stages to accelerate sampling. The algorithm in the right of
F presents this optimization in detail.
G E XPERIMENT D ETAILS
G.1 DATASETS
Table 11 shows the statistics of the datasets and all datasets are available online via the link.
G.2 BASELINES
We apply and optimize the following accessible source codes for generative experiments.
• TimeGAN: https://github.com/jsyoon0823/TimeGAN
• TimeVAE: https://github.com/abudesai/timeVAE
• Cot-GAN: https://github.com/tianlinxu312/cot-gan
• Diffwave: https://diffwave-demo.github.io/
• CSDI / DiffTime: https://github.com/ermongroup/CSDI
24
Published as a conference paper at ICLR 2024
For GAN-based methods, we use the 3-layer GRU-based neural network architecture with a hidden
size that is 4 times larger than the feature size. We modify settings of TimeVAE and Diffwave so
that they have roughly the same trainable parameter size as ours. For CSDI, we set the number of
residual layers as 4, residual channels as 64, and attention heads as 8. We also change the kernel-size
of the convolutional layers following the settings of DiffTime. Finally, both Diffwave and CSDI use
the same diffusion settings (e.g. diffusion steps) as ours if necessary.
Discriminative & Predictive score. The discriminative score is calculated as |accuracy − 0.5|,
while the predictive score is the mean absolute error (MAE) evaluated between the predicted values
and the ground-truth values in test data. For a fair comparison, we reuse the experimental settings
of TimeGAN Yoon et al. (2019) for the discriminative and predictive score. Both the classifier and
sequence-prediction model use a 2-layer GRU-based neural network architecture.
Context-FID score. A lower FID score means the synthetic sequences are distributed closer to the
original data. Paul et al. (2022) proposed a Frechet Inception distance (FID)-like score, Context-FID
(Context-Frechet Inception distance) score by replacing the Inception model of the original FID with
a time series representation learning method called TS2Vec (Yue et al., 2022). They have shown that
the lowest scoring models correspond to the best-performing models in downstream tasks and that
the Context-FID score correlates with the downstream forecasting performance of the generative
model. Specifically, we first sample synthetic time series and real-time series respectively. Then we
compute the FID score of the representation after encoding them with a pre-trained TS2Vec model.
Correlational score. Following Ni et al. (2020), we estimate the covariance of the ith and j th
feature of time series as follows: ! !
T T T
1X t t 1X t 1X t
covi,j = X X − X X . (22)
T t=1 i i T t=1 i T t=1 j
Then the metric on the correlation between the real data and synthetic data is computed by
1 X
d
covr covfi,j
p r i,j r − q , (23)
10 i,j covi,i covj,j covfi,i covfj,j
H A DDITIONAL V ISUALIZATIONS
We introduce additional visualization and distribution outcomes in Figure 14 and Figure 15.
I A DDITIONAL E XAMPLES
We present additional examples to illustrate various imputation examples to show the characteristic
of time-series imputation and forecasting on the Energy dataset in Figures 16 to 19.
25
Published as a conference paper at ICLR 2024
Sines
Stocks
ETTh
MuJoCo
Energy
fMRI
(a) Ours (b) TimeGAN (c) TimeVAE (d) Diffwave (e) DiffTime (f) Cot-GAN
Figure 14: t-SNE visualizations of all methods on multivariate datasets. Red is for original data, and
blue for synthetic data.
3 3 3 3 3
3
2
Sines
2 2 2 2
2
1 1 1 1 1
1
4 4 4
4
4 3 3 4 3
2 2 2 2
2 2
1 1 1
0 0.00 0 0.00 0 0.0 0 0.00 0 0.00 0.25 0.50 0.75 0
0.25 0.50 0.75 0.25 0.50 0.75 0.5 1.0 0.25 0.50 0.75 0.0 0.5 1.0
5 6 5
5 6 5
4 4 4
4
ETTh
4 4 3
3 3 3
2 2 2 2
2 2
1 1 1 1
0 0 0 0 0.2 0 0.4 0.6 0.8 0
0.4 0.6 0.8 0.4 0.6 0.8 0.4 0.6 0.8 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1.0
10 10 10 10
MuJoCo
10 10.0
8 8 8 8 8
6 6 7.5 6 6
6
4 4 4 5.0 4 4
2 2 2 2.5 2 2
0 0 0 0.0 0 0.4 0.5 0.6 00.2
0.4 0.5 0.6 0.4 0.5 0.6 0.4 0.5 0.6 0.4 0.5 0.6 0.7 0.4 0.6
6
5 8 5 5 5
Energy
4 6 4 4 4 4
3 3 3 3
4
2 2 2 2 2
1 2 1 1 1
00.2 00.2 0 0 0.2 0 0.4 0.6 0 0.2
0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6 0.8
15
12.5 12.5 100 15 12.5
10.0 10.0 80 10 10.0
fMRI
10
7.5 7.5 60 7.5
5.0 5.0 40 5 5 5.0
2.5 2.5 20 2.5
0.0 0.0 0 0 0 0.5 0.6 0.0 0.4
0.5 0.6 0.5 0.6 0.5 0.6 0.5 0.6 0.5 0.6 0.7
(a) Ours (b) TimeGAN (c) TimeVAE (d) Diffwave (e) DiffTime (f) Cot-GAN
Figure 15: Distributions of all methods on multivariate datasets. blue solid line is for original data,
and yellow dotted line for synthetic data.
26
Published as a conference paper at ICLR 2024
0.10 0.38
0.20
0.05 0.50 0.36
0 20 40 0 20 40 0 20 40 0 20 40
0.4
0.6
0.30
0.4
value
0.3 0.25
0.4
0.20 0.2
0.2
0 20 40 0 20 40 0 20 40 0 20 40
0.40 0.7 0.40
0.6 0.30 0.35
value
0.30
0.25
0.3 0.25 0.25
0.20
0 20 40 0 20 40 0 20 40 0 20 40
0.4 0.45
0.6 0.6
value
0.3 0.40
0.4
0.35 0.5
0 20 40 0 20 40 0 20 40 0 20 40
1.0 1.0
0.35
0.6
value
Figure 16: Comparison of imputation for the Energy dataset (90% missing). The result is for a time
series sample with all 28 features. The red crosses show observed values and the blue circles show
ground-truth imputation targets. Green and gray colors correspond to Diffusion-TS and Diffwave,
respectively. For each method, median values of imputations are shown as the line and 5% and 95%
quantiles are shown as the shade.
27
Published as a conference paper at ICLR 2024
0.15
0.6 0.9
0.10 0.55
0.4
value
0.70 0.60
0.8
0.55 0.55
0 20 40 0 20 40 0 20 40 0 20 40
0.900
0.74 0.34
0.875 0.75
value
0.72 0.32
0.850 0.70
0.70 0.30
0.825
0 20 40 0 20 40 0 20 40 0 20 40
0.95 0.075
0.85 0.75
0.90 0.050
value
0.80
0.025 0.70
0.85
0.000 0.75
0 20 40 0 20 40 0 20 40 0 20 40
0.75 0.88
0.725
0.75 0.70 0.86
value
0.700
0.65 0.84
0.70 0.675
0.60 0.82
0 20 40 0 20 40 0 20 40 0 20 40
0.9 0.64 0.4
0.7
0.62 0.3
0.8 0.6
value
0.2
0.7 0.60 0.5
0.1
0.4
0 20 40 0 20 40 0 20 40 0 20 40
1.0 1.0 1.0
0.8 0.90
value
Figure 17: Comparison of imputation for the Energy dataset (50% missing). The result is for a time
series sample with all 35 features. The red crosses show observed values and the blue circles show
ground-truth imputation targets. Green and gray colors correspond to Diffusion-TS and Diffwave,
respectively. For each method, median values of imputations are shown as the line and 5% and 95%
quantiles are shown as the shade.
28
Published as a conference paper at ICLR 2024
0.4
0.4 0.6 0.3
value
0.2 0.5
0.2
0.2
0.0 0.0 0.4
0 20 40 0 20 40 0 20 40 0 20 40
0.35 0.40
0.30 0.5 0.4 0.35
value
0.3 0.3
0.40 0.2
0.2 0.2
0 20 40 0 20 40 0 20 40 0 20 40
0.8 0.45
0.45 0.4
0.6 0.40
value
0.40
0.4 0.35
0.2
0.35 0.30
0.2
0 20 40 0 20 40 0 20 40 0 20 40
0.4
0.4 0.2 0.4
0.3
0 20 40 0 20 40 0 20 40 0 20 40
0.4
0.4 0.80 0.8
0.3
value
0.75 0.6
0.3 0.2
0.70 0.4 0.1
0 20 40 0 20 40 0 20 40 0 20 40
1.0 0.5 1.0 1.0
0.8
value
Figure 18: Comparison of forecasting for the Energy dataset (36 forecasting window). The result
is for a time series sample with all 28 features. The red crosses show observed values and the blue
circles show ground-truth imputation targets. Green and gray colors correspond to Diffusion-TS and
Diffwave, respectively. For each method, median values of imputations are shown as the line and
5% and 95% quantiles are shown as the shade.
29
Published as a conference paper at ICLR 2024
0.3 0.375
0.3
0.4
0.2 0.350
0.2
value
0.3 0.325
0.1 0.1
0.300
0.0 0.0 0.2
0 20 40 0 20 40 0 20 40 0 20 40
0.60 0.55
0.20
0.2 0.55
value
0.15 0.50
0.1 0.50
0.10
0 20 40 0 20 40 0 20 40 0 20 40
0.45 0.45
0.40 0.20 0.40
0.30
value
0.8 0.4
0.1
0.7 0.2
0.0 0.3
0 20 40 0 20 40 0 20 40 0 20 40
0.6
0.35 0.4 0.25
0.30 0.5
value
0.3 0.20
0.25 0.4
0.2
0 20 40 0 20 40 0 20 40 0 20 40
0.3
0.70 0.8 0.4
0.2
value
0.6 0.2
0.1 0.65
0 20 40 0 20 40 0 20 40 0 20 40
1.00 1.0 1.0
0.75 0.2
value
0.5 0.5
0.50 0.1
0.25 0.0 0.0
0 20 40 0 20 40 0 20 40 0 20 40
Figure 19: Comparison of forecasting for the Energy dataset (24 forecasting window). The result
is for a time series sample with all 28 features. The red crosses show observed values and the blue
circles show ground-truth imputation targets. Green and gray colors correspond to Diffusion-TS and
Diffwave, respectively. For each method, median values of imputations are shown as the line and
5% and 95% quantiles are shown as the shade.
30