Probabilistic Forecasting of Real-Time Electricity Market Signals
via Interpretable Generative AI***This work was supported in part by the National Science Foundation under Award EECS 2218110.

Abstract

This paper introduces a generative AI approach to probabilistic forecasting of real-time electricity market signals, including locational marginal prices, interregional price spreads, and demand-supply imbalances. We present WIAE-GPF, a Weak Innovation AutoEncoder-based Generative Probabilistic Forecasting architecture that generates future samples of multivariate time series. Unlike traditional black-box models, WIAE-GPF offers interpretability through the Wiener-Kallianpur innovation representation for nonparametric time series, making it a nonparametric generalization of the Wiener/Kalman filter-based forecasting. A novel learning algorithm with structural convergence guarantees is proposed, ensuring that, under ideal training conditions, the generated forecast samples match the ground truth conditional probability distribution. Extensive tests using publicly available data from U.S. independent system operators under various point and probabilistic forecasting metrics demonstrate that WIAE-GPF consistently outperforms classical methods and cutting-edge machine learning techniques.

keywords:
Probabilistic forecasting, electricity price forecasting, representation learning, generative artificial intelligence, energy and ancillary market forecasting, and risk-sensitive market operations.
journal: International Journal of Forecasting
\affiliation

[label1]organization=School of Electrical and Computer Engineering,addressline=Cornell University, city=Ithaca, state=New York, postcode=14853, country=USA

1 Introduction

A key feature of generative AI is its ability to produce artificial samples that closely resemble real-world data. In particular, generative AI learns the underlying structure of a phenomenon from examples, enabling it to generate an arbitrarily large number of artificial samples that exhibit the same properties as the original data. Leveraging advanced neural networks and machine learning techniques, generative AI has achieved performance in real-world applications that far surpasses conventional methods [1].

The classical field of Probabilistic Forecasting (PF) naturally aligns with generative AI. In particular, PF predicts the conditional probability distribution of the future given past time series observations. Once this distribution is known, Monte Carlo samples of the future series can be generated. However, forecasting such conditional distributions presents significant computational and sampling challenges. Specifically, nonparametric distribution forecasting is an infinite-dimensional functional estimation problem, often requiring finite-dimensional reductions, such as histograms with a finite number of bins or quantiles with a finite set of levels. Second, for continuous time series, each future realization is uniquely tied to a specific past history, making it inherently difficult to learn the true conditional distribution from data. The standard approach to PF is to assume a parametric model, reducing the infinite dimensional forecasting problem to finite dimensional prediction of distribution parameters.

This paper applies generative AI principles to derive non-parametric Generative Probabilistic Forecasting (GPF), bypassing the obstacles of modeling, computational, and sample complexities associated with forecasting conditional probability distributions. Currently, no nonparametric GPF techniques exist. By developing practical GPF solutions, we demonstrate the somewhat surprising conclusion that GPF problem is simpler and and more practical than PF.

As powerful as generative AI has become, it is often criticized as mysterious and uninterpretable, which casts doubt on whether the generated samples indeed follow the ground-truth distribution or merely appear to resemble the training data. For time series forecasting problems, it is difficult, if not impossible, to compare the similarities between training data and generated samples. No existing GPF techniques have established some level of guarantee the generative samples follow the correct conditional distributions. This paper aims to derive an interpretable GPF by making connections to classic theory of time series representation pioneered by Wiener, Kallianpur, and Rosenblatt [2, 3].

1.1 Literature Review

The literature on parametric and nonparametric probabilistic forecasting of electricity market signals is extensive. In this review, we focus on short-term (real-time) forecasting techniques for wholesale electricity prices and dispatch quantities, such as area demand-supply imbalances. Drawing on previous surveys [4, 5, 6], we place particular emphasis on machine learning-based techniques developed in the past decade.

Wholesale electricity prices and dispatch imbalances are endogenously determined by optimization-based market clearing processes, making them highly volatile due to their sensitivity to binding constraints. Consequently, real-time prices and dispatch quantities exhibit behavior distinct from exogenous physical processes like wind, solar, and demand time series. For this reason, we exclude the extensive body of energy forecasting literature, although those techniques can also be applied to price forecasting as well (for more on energy forecasting, see [7] and references therein). Also excluded are forecasting techniques from the market operator’s perspective, where network parameters and offers/bids are known. See, e.g., [8] and references therein.

Probabilistic forecasting methods generally fall into parametric or nonparametric categories. Parametric approaches model future time series variables by predicting a parameterized conditional probability distribution, thus reducing an infinite-dimensional inference problem to a finite-dimensional. Popular parametric methods include autoregressive and moving average models [9, 10, 11, 12, 13, 14], Gaussian models [15], Student’s t-distribution [16], and others [17]. While parametric models offer computational tractability, they often sacrifice accuracy due to model mismatches.

Nonparametric forecasting has a long history (see [4] for a review up to 1997). These methods estimate the underlying probability distribution or its properties, such as quantiles, without assuming a specific parametric form. However, classical nonparametric techniques [18] face significant sample and computational challenges, especially when time series exhibit complex temporal dependencies. Quantile regression is among the most popular techniques for forecasting electricity prices. By estimating multiple quantiles, it approximates the underlying probability density function through a histogram. A well-known application of quantile regression for day-ahead LMP forecasting can be found in [19].

Over the past decade, deep learning technologies have significantly advanced point, PF, and GPF generative methods, utilizing various architectures and learning principles. Examples of these include Extreme Learning Machines (ELM) [20], Recurrent Neural Networks (RNN) [21], Variational Autoencoders (VAE) [22, 23, 24, 25], Long Short-Term Memory (LSTM) networks, diffusion models [26], Generative Adversarial Network (GAN) [27] and Large Language Models (LLMs) featuring transformers and attention mechanisms [28].

Among the deep-learning-based PF and GPF techniques, the VAE-based method [22] is the most related to the proposed WIAE-GPF approach; both share a similar forecasting architecture, but differ in the training of the autoencoders. VAE typically relies on a parametric assumption, often on a conditional Gaussian distribution for the latent process, whereas WIAE-GPF imposes the innovation constraints that requires the latent process being independent and identically distributed and uniform (IID-uniform). See Sec. 2.

LLM-based generative AI, originally developed for linguistic time series, have recently demonstrated superior performance in various applications, including electricity price forecasting [29]. The transformer architecture, with its attention mechanism, has played a pivotal role in this success. However, despite their impressive results, these non-interpretable AI techniques give an understanding of the factors that would lead to effective probabilistic forecasting. In particular, there is no theoretical guarantee that LLM-based GPF can generate samples with the correct conditional probability distribution even when the training sample size is unbounded. Additionally, comprehensive empirical studies comparing LLMs to other machine learning methods for price forecasting remain scarce. To address this, we include three award-winning LLM-based models [30, 31] in our numerical comparisons in Section X, to benchmark their performance against other techniques.

1.2 Summary of Contributions

We propose Weak Innovation Autoencoder-based GPF (WIAE-GPF), a novel approach inspired by the classic Wiener-Kallianpur innovation representation of nonparametric time series [2] and a relaxation by Rosenblatt [3]. A key contribution of this work is to establish formally that the GPF architecture shown in Fig. 1 is “provable correct.” By provably correct, we mean that with optimally trained WIAE autoencoder (Gθ,Hη)subscript𝐺superscript𝜃subscript𝐻superscript𝜂(G_{\theta^{*}},H_{\eta^{*}})( italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), the WIAE-GPF output 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t follows the probability distribution of the future variable 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT given (𝑿0=𝒙0,,𝑿t=𝒙t)formulae-sequencesubscript𝑿0subscript𝒙0subscript𝑿𝑡subscript𝒙𝑡({\bm{X}}_{0}={\bm{x}}_{0},\cdots,{\bm{X}}_{t}={\bm{x}}_{t})( bold_italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )—the observed time series up to time t𝑡titalic_t.

Refer to caption
Figure 1: Forecasting pipeline for WIAE-GPF.

WIAE-GPF stands out as an interpretable GPF model because of its connection to the classic Wiener/Kalman predictor under parametric Gaussian and state-space assumptions. Specifically, the Innovation Encoder in Fig. 1 functions analogously to a causal whitening filter (via spectral factorization), while the Decoder corresponds to the linear Wiener predictor. Furthermore, the encoder’s operation parallels the measurement update in Kalman filtering, and the decoder mirrors the prediction (time) update. In essence, WIAE-GPF can be viewed as a non-Gaussian nonparametric extension of the Wiener/Kalman predictor.

Note that the WIAE-GPF output 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a function of observed time series 𝒙0:tsubscript𝒙:0𝑡{\bm{x}}_{0:t}bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT and independently generated exogenous random vector 𝒱~t=(𝐕1,,𝐕T)subscript~𝒱𝑡subscript𝐕1subscript𝐕𝑇\tilde{\mathcal{V}}_{t}=({\bf V}_{1},\cdots,{\bf V}_{T})over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) with IID uniformly distributed components, making 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT a function of the random vector 𝒱~tsubscript~𝒱𝑡\tilde{\mathcal{V}}_{t}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. By generating 𝒱~tsubscript~𝒱𝑡\tilde{{\cal V}}_{t}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from T𝑇Titalic_T IID samples of uniform distribution, WIAE-GPF produces realizations of 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT following the same conditional distribution as 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT. The formal definition of WIAE and its learning algorithm are presented in Sec. 2. The WIAE-GPF architecture and its validity are shown in Sec. 3.

Because practical implementations of WIAE are necessarily finite-dimensional, we establish a structural convergence property that the conditional distribution of the WIAE-GPF output converges to that of the conditional distribution of the time series. See Sec. 3.2 for details.

There have been some but limited applications of generative AI techniques in power system operations despite their accelerated advances in representing and learning time series models. Missing in particular are the validation and comparative studies using real-world market data. We fill this gap by comparing the WIAE-GPF with leading traditional and machine-learning algorithms for three applications: real-time LMP forecasting for energy markets, interregional LMP spread forecasting for interchange markets, and area control error (ACE) forecasting for regulation markets. Such comparisons are essential because these real-time market signals have characteristics not present in media signals such as video and natural language time series, where machine learning techniques have demonstrated success. Both LMP and ACE are highly dynamic time series with frequent spikes. Our comparison study offers a compelling case for WIAE-GPF across multiple performance measures for point and probabilistic forecasters.

The idea of WIAE-GPF was first presented in a preliminary version of this work [32], from which the current paper makes substantial new contributionsBased on Turnitin comparison, this paper exhibits less than 15% percent overall similarity and less than 4% similarity to the preliminary version.. In particular, the Bayesian sufficiency theorem in Sec. 3.1 is significantly stronger than that in [32]. Also new are a theorem (Theorem 1 in Sec. 3.1) that establishes formally the validity of WIAE-GPF and a theorem on the structural convergence (Theorem 2 in Sec. 3.2). We considered three specific real-time market applications, two of the three were not considered in [32]. The numerical results for all three applications as well as the analysis and discussions are all new.

1.3 Organization and Notations

This paper is organized as follows. Sec. 2 defines a nonparametric time series model, its innovation representations, and the learning algorithm of WIAE. Sec. 3 develops WIAE-GPF, the proposed GPF algorithm. Sec. 4 presents the application of WIAE-GPF in three market operations and the comparison studies of major forecasting benchmarks.

The notations used in this paper are standard. Random variables are in capital letters and their realizations in lowercases. Boldface letters are typically used for vectors and matrices. We use (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to denote a multivariate random time series, where column vector 𝑿t=(X1t,,Xdt)subscript𝑿𝑡subscript𝑋1𝑡subscript𝑋𝑑𝑡{\bm{X}}_{t}=(X_{1t},\cdots,X_{dt})bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_d italic_t end_POSTSUBSCRIPT ) is the time series at time t𝑡titalic_t, and (Xit)subscript𝑋𝑖𝑡(X_{it})( italic_X start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) the i𝑖iitalic_ith sub-time series of (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In this paper, 𝑿t1:t2subscript𝑿:subscript𝑡1subscript𝑡2{\bm{X}}_{t_{1}:t_{2}}bold_italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the segment of (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) from t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, i.e., 𝐗t1:t2=(𝐗t1,,𝐗t2)subscript𝐗:subscript𝑡1subscript𝑡2subscript𝐗subscript𝑡1subscript𝐗subscript𝑡2{\bm{X}}_{t_{1}:t_{2}}=({\bm{X}}_{t_{1}},\cdots,{\bm{X}}_{t_{2}})bold_italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( bold_italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , bold_italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). For two random vectors 𝑿𝑿{\bm{X}}bold_italic_X and 𝐘𝐘{\bf Y}bold_Y, 𝑿=a.s.𝐘superscripta.s.𝑿𝐘{\bm{X}}\stackrel{{\scriptstyle\mbox{\tiny a.s.}}}{{=}}{\bf Y}bold_italic_X start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG a.s. end_ARG end_RELOP bold_Y means the two random variables equal almost surely, and 𝑿=d𝐘superscriptd𝑿𝐘{\bm{X}}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{=}}{\bf Y}bold_italic_X start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP bold_Y means the two equal in distribution. An IID random sequence with marginal distribution cumulative distribution F𝐹Fitalic_F is denoted by 𝑿tIIDFsuperscriptsimilar-toIIDsubscript𝑿𝑡𝐹{\bm{X}}_{t}\stackrel{{\scriptstyle\mbox{\tiny\sf IID}}}{{\sim}}Fbold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG IID end_ARG end_RELOP italic_F. Table 1 shows the major designated symbols used in the paper.

Table 1: Abbreviations and mathematical notations used in this paper.
GPF Generative Probabilitic Forecasting.
WIAE Weak Innovation AutoEncoder.
ACE Area Control Error.
LLM Large Language Model.
NMSE Normalized Mean Square Error.
NMAE Normalized Mean Absolut Error.
MASE Mean Absolute Scaled Error.
sMAPE Symmetric Mean Absolute Percentage Error.
CRPS Continuous Ranked Probability Score.
CP Coverage Probability.
CPE Coverage Probability Error.
NCW Normalized Coverage Width.
(𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) The random process of predictive interests.
(𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) The innovation sequence.
(𝐔t)subscript𝐔𝑡({\bf U}_{t})( bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) An IID sequence of uniform distribution.
(𝑿^t)subscript^𝑿𝑡(\hat{{\bm{X}}}_{t})( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) The rescontruction sequence output by WIAE decoder.
(𝐕^t(m))superscriptsubscript^𝐕𝑡𝑚\left(\hat{{\bf V}}_{t}^{(m)}\right)( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) The weak innovation sequence estimated by a m𝑚mitalic_m-dimensional WIAE.
(𝑿^t(m))superscriptsubscript^𝑿𝑡𝑚\left(\hat{{\bm{X}}}_{t}^{(m)}\right)( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) The reconstruction sequence estimated by a m𝑚mitalic_m-dimensional WIAE.
(𝒙t)subscript𝒙𝑡({\bm{x}}_{t})( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) A sequence of real numbers indicating the past realizations of (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
(𝝂t)subscript𝝂𝑡(\hbox{\boldmath$\nu$\unboldmath}_{t})( bold_italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) A sequence of real numbers indicating the past realizations of (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
G𝐺Gitalic_G WIAE encoder function.
H𝐻Hitalic_H WIAE decoder function.
Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT A neural network approximation of G𝐺Gitalic_G parameterized by θ𝜃\thetaitalic_θ.
Hηsubscript𝐻𝜂H_{\eta}italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT A neural network approximation of H𝐻Hitalic_H parameterized by η𝜂\etaitalic_η.
Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT Innovation discriminator that measures the distance between (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and (𝐔t)subscript𝐔𝑡({\bf U}_{t})( bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
Dωsubscript𝐷𝜔D_{\omega}italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT Reconstruction discriminator that measures the distance between (𝑿0:t+T)subscript𝑿:0𝑡𝑇({\bm{X}}_{0:t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT ) and (𝑿0:t,𝑿^t+1:t+T)subscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ).
𝒰[0,1]d𝒰superscript01𝑑{\cal U}[0,1]^{d}caligraphic_U [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT The continuous d𝑑ditalic_d-dimensional uniform distribution on [0,1]01[0,1][ 0 , 1 ].

2 Innovation Representation Learning

2.1 Strong and Weak Innovation Representations

In 1958, Wiener and Kallianpur proposed an innovation representation of scalar time series [2]. In the parlance of modern machine learning, an innovation representation is a causal autoencoder shown in Fig. 2 with the latent process (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) being an IID-uniform innovation sequence. In particular, 𝐕tsubscript𝐕𝑡{\bf V}_{t}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the new information (innovation) contained in 𝑿tsubscript𝑿𝑡{\bm{X}}_{t}bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT independent of the past 𝑿0:t1=(𝑿t1,𝑿t2,)subscript𝑿:0𝑡1subscript𝑿𝑡1subscript𝑿𝑡2{\bm{X}}_{0:t-1}=({\bm{X}}_{t-1},{\bm{X}}_{t-2},\cdots)bold_italic_X start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT = ( bold_italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , ⋯ ). Mathematically, the innovation representation of the time series is defined by causal mappings (G,H)𝐺𝐻(G,H)( italic_G , italic_H ) and (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ):

𝐕tsubscript𝐕𝑡\displaystyle{\bf V}_{t}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =G(𝑿t,𝑿t1,),absent𝐺subscript𝑿𝑡subscript𝑿𝑡1\displaystyle=G({\bm{X}}_{t},{\bm{X}}_{t-1},\cdots),= italic_G ( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ⋯ ) , (\theparentequation.1)
(𝐕t)IID𝒰[0,1]d,superscriptsimilar-toIIDsubscript𝐕𝑡𝒰superscript01𝑑\displaystyle({\bf V}_{t})\stackrel{{\scriptstyle\mbox{\sf\tiny IID}}}{{\sim}}% {\cal U}[0,1]^{d},( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG IID end_ARG end_RELOP caligraphic_U [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , (\theparentequation.2)
𝑿^tsubscript^𝑿𝑡\displaystyle\hat{{\bm{X}}}_{t}over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =H(𝐕t,𝐕t1,),absent𝐻subscript𝐕𝑡subscript𝐕𝑡1\displaystyle=H({\bf V}_{t},{\bf V}_{t-1},\cdots),= italic_H ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ⋯ ) , (\theparentequation.3)
Refer to caption
Figure 2: An autoencoder interpretation of innovation representations.

The Wiener-Kallianpur’s innovation autoencoder requires further that the decoder output (𝑿^t)subscript^𝑿𝑡(\hat{{\bm{X}}}_{t})( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) reconstructs the input (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (almost surely), i.e., (𝐗t^)=a.s.(𝐗t)superscripta.s.^subscript𝐗𝑡subscript𝐗𝑡(\hat{{\bm{X}}_{t}})\stackrel{{\scriptstyle\mbox{\tiny a.s.}}}{{=}}({\bm{X}}_{% t})( over^ start_ARG bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG a.s. end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which makes Wiener-Kallianpur’s autoencoder a strong innovation Autoencoder. The perfect causal reconstruction implies that the innovation sequence (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a sufficient statistic for all decision-making based on (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Therefore, using the IID-uniform (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for decision-making incurs no performance loss.

However, Rosenblatt showed that the Wiener-Kallianpur (strong) innovation representation does not exist for broad classes of random processes, including some of the widely used finite-state Markov chains [3]. Rosenblatt suggested a weak innovation representation, requiring that the autoencoder output (𝑿^t)subscript^𝑿𝑡(\hat{{\bm{X}}}_{t})( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) matches its input (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) only in distribution:

(𝑿0:t,𝑿^t+1:t+T)=d(𝑿0:t+T),t.superscriptdsubscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇subscript𝑿:0𝑡𝑇for-all𝑡\displaystyle({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})\stackrel{{\scriptstyle% \mbox{\tiny d}}}{{=}}({\bm{X}}_{0:t+T}),\forall t.( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT ) , ∀ italic_t . (2)

Herein, we call the autoencoder (G,H)𝐺𝐻(G,H)( italic_G , italic_H ) for the weak innovation representation the Weak Innovation Auto Encoder (WIAE).

2.2 Innovation Representation Learning

Beyond the Gaussian and additive Gaussian models, there is no known algorithm to obtain WIAE, especially when the underlying time series is nonparametric with an unknown probability structure. In [33], the authors proposed a GAN-based learning of the strong innovation representation by jointly minimizing the Wasserstein distance of the latent process from the uniform IID process and the mean squared error (l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance) of the autoencoder output as the estimate of the input. However, strong innovation representation applies only to a restricted class of time series, and the joint optimization of autoencoder with mixed Wasserstein and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance measures can be challenging. Finally, learning a scalar innovation representation limits the ability to incorporate multiple time series observations. The WIAE learning proposed below overcomes these shortcomings.

2.3 WIAE Learning

We present a deep learning approach to learn a WIAE for the weak innovation representation defined in (2). Shown in Fig. 3 is the schematic that highlights key components of the WIAE learning.

Refer to caption
Figure 3: Training schematics of WIAE. Dash lines indicate the flow of training information.

The encoder Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and decoder Hηsubscript𝐻𝜂H_{\eta}italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT are causal convolutional neural networks parameterized by coefficients θ𝜃\thetaitalic_θ and η𝜂\etaitalic_η, respectively. The weak innovation representation, at its core, matches the input-output distributions and constrains the latent process (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to be IID-uniform. To this end, we introduce two neural network discriminators, the innovation discriminator Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT and the reconstruction discriminator Dωsubscript𝐷𝜔D_{\omega}italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT with parameters γ𝛾\gammaitalic_γ and ω𝜔\omegaitalic_ω respectively, to enforce (\theparentequation.2) and (2). In particular, the innovation discriminator Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT compares the distributions of (𝐕^t)subscript^𝐕𝑡(\hat{{\bf V}}_{t})( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and the reconstruction discriminator Dωsubscript𝐷𝜔D_{\omega}italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT the compares joint distributions of 𝑿0:t+Tsubscript𝑿:0𝑡𝑇{\bm{X}}_{0:t+T}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT and (𝑿0:t,𝑿^t+1:t+T)subscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) with sufficiently large T𝑇Titalic_T. These discriminators produce error signals to update neural network parameters (θ,η,γ,ω)𝜃𝜂𝛾𝜔(\theta,\eta,\gamma,\omega)( italic_θ , italic_η , italic_γ , italic_ω ). In this work, we adopt the Wasserstein discriminator proposed in [34] to compute the Wasserstein distance between distributions.

Because the two discriminators are both based on the Wasserstein-distance measure, their parameters (θ,η,ω,γ)𝜃𝜂𝜔𝛾(\theta,\eta,\omega,\gamma)( italic_θ , italic_η , italic_ω , italic_γ ) can be obtained via a single optimization:

L:=minθ,ηmaxγ,ω(𝔼[Dγ((𝐔t))]𝔼[Dγ((𝐕^t))]+λ(𝔼[Dω(𝑿0:t+T)]𝔼[Dω((𝑿0:t,𝑿^t+1:t+T))])),assign𝐿subscript𝜃𝜂subscript𝛾𝜔𝔼delimited-[]subscript𝐷𝛾subscript𝐔𝑡𝔼delimited-[]subscript𝐷𝛾subscript^𝐕𝑡𝜆𝔼delimited-[]subscript𝐷𝜔subscript𝑿:0𝑡𝑇𝔼delimited-[]subscript𝐷𝜔subscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇L:=\min_{\theta,\eta}\max_{\gamma,\omega}\big{(}\mathbb{E}[D_{\gamma}\left(({% \bf U}_{t})\right)]-\mathbb{E}[D_{\gamma}((\hat{{\bf V}}_{t}))]\\ +\lambda(\mathbb{E}[D_{\omega}({\bm{X}}_{0:t+T})]-\mathbb{E}[D_{\omega}(({\bm{% X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T}))])\big{)},start_ROW start_CELL italic_L := roman_min start_POSTSUBSCRIPT italic_θ , italic_η end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_γ , italic_ω end_POSTSUBSCRIPT ( blackboard_E [ italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( ( bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] - blackboard_E [ italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( ( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] end_CELL end_ROW start_ROW start_CELL + italic_λ ( blackboard_E [ italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) ) ] ) ) , end_CELL end_ROW (3)

where λ𝜆\lambdaitalic_λ is a real number that scales the two Wasserstein distances. The two parts of the inner maximization loop of loss function (3) regularize (Gθ,Hη)subscript𝐺𝜃subscript𝐻𝜂(G_{\theta},H_{\eta})( italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) according to (\theparentequation.2) and (2). It’s evident that minimizing the inner loop with respect to θ𝜃\thetaitalic_θ and η𝜂\etaitalic_η is equivalent to enforcing (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) being IID uniform, and (𝑿0:t,𝑿^t+1:t+T)subscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) having the same distribution as 𝑿0:t+Tsubscript𝑿:0𝑡𝑇{\bm{X}}_{0:t+T}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT. The training of the four neural networks is standard. Here we used the off-the-shelf Adam optimizer.

In a practical implementation of WIAE, finite (input) dimensional neural networks are used. The training of a finite-dimensional WIAE via (3) must also be implemented by finite segments of the random processes involved. In Sec. 3.2, we consider the implications of such practical restrictions.

3 WIAE-GPF and its Properties

In this section, we introduce WIAE-GPF—a generative probabilistic forecasting techniques based on weak innovations representation. Specifically, given past observations 𝒙0:tsubscript𝒙:0𝑡{\bm{x}}_{0:t}bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT, WIAE-GPF produces (arbitrarily many) samples of 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that has the conditional distribution of 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT. We present next the structure of WIAE-GPF, the Bayesian sufficiency of WIAE, and a structure convergence when WIAE is implemented with finite-dimensional implementations.

3.1 Structure of WIAE-GPF

The structure of the proposed WIAE-GPF forecaster is shown in Fig. 1. At time t𝑡titalic_t, given the realization of 𝑿0:t=𝒙0:tsubscript𝑿:0𝑡subscript𝒙:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT and autoencoder (Gθ,Hη)subscript𝐺superscript𝜃subscript𝐻superscript𝜂(G_{\theta^{*}},H_{\eta^{*}})( italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) trained by (3), 𝒙0:tsubscript𝒙:0𝑡{\bm{x}}_{0:t}bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT up to time t𝑡titalic_t, the WIAE encoder Gθsubscript𝐺superscript𝜃G_{\theta^{*}}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT generates the innovation sequence 𝝂0:tsubscript𝝂:0𝑡\hbox{\boldmath$\nu$\unboldmath}_{0:t}bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT. The WIAE decoder Hηsubscript𝐻superscript𝜂H_{\eta^{*}}italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT maps 𝝂0:tsubscript𝝂:0𝑡\hbox{\boldmath$\nu$\unboldmath}_{0:t}bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT and independently generated IID-uniform pseudo innovations 𝒱~tIID𝒰[0,1]Tsuperscriptsimilar-toIIDsubscript~𝒱𝑡𝒰superscript01𝑇\tilde{{\cal V}}_{t}\stackrel{{\scriptstyle\mbox{\sf\tiny IID}}}{{\sim}}{\cal U% }[0,1]^{T}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG IID end_ARG end_RELOP caligraphic_U [ 0 , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to produce a sample 𝑿~t=𝒙~tsubscript~𝑿𝑡subscript~𝒙𝑡\tilde{{\bm{X}}}_{t}=\tilde{{\bm{x}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Note that when forecasting 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT, we do not have realizations for random samples of 𝑿t+1:t+Tsubscript𝑿:𝑡1𝑡𝑇{\bm{X}}_{t+1:t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT. The salient feature of WIAE-GPF is to replace samples from the unknown and arbitrarily distributed 𝑿t+1:t+Tsubscript𝑿:𝑡1𝑡𝑇{\bm{X}}_{t+1:t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT by realizations of pseudo innovations 𝒱~tsubscript~𝒱𝑡\tilde{\mathcal{V}}_{t}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT known to be IID-uniform. Thus, once the autoencoder is trained, generating random samples with the conditional distribution of 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT is trivial.

We now establish the validity of WIAE-GPF by showing that the WIAE-GPF output 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has the same conditional distribution as 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT given 𝑿0:t=𝒙0:tsubscript𝑿:0𝑡subscript𝒙:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT. This is not obvious because the input of Hηsubscript𝐻superscript𝜂H_{\eta^{*}}italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT includes an exogenous random vector 𝒱~tsubscript~𝒱𝑡\tilde{{\cal V}}_{t}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the weak innovation (𝐕0:t)subscript𝐕:0𝑡({\bf V}_{0:t})( bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) that may not be a sufficient statistic.

We first show that the weak innovation sequence is Bayesian sufficientT(X)𝑇𝑋T(X)italic_T ( italic_X ) is Bayesian sufficient for the estimation of a random variable Y𝑌Yitalic_Y if the posterior distribution of Y𝑌Yitalic_Y given X𝑋Xitalic_X is the same as the one given T(X)𝑇𝑋T(X)italic_T ( italic_X ) [35]., which implies that any stochastic decision involving future time series 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT can be made without loss based on the innovations 𝐕0:tsubscript𝐕:0𝑡{\bf V}_{0:t}bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT. The same result was first presented in [32] under the more restrictive setting of Hηsubscript𝐻superscript𝜂H_{\eta^{*}}italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT being injective.

Lemma 1 (Bayesian Sufficiency of Multivariate Weak Innovations)

Let (𝐗t)subscript𝐗𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be a stationary time series for which the weak innovation representation exists. Let (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the weak innovation representation of (𝐗t)subscript𝐗𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Then, for all 𝐱𝐱{\bm{x}}bold_italic_x and 𝐗0:t=𝐱0:tsubscript𝐗:0𝑡subscript𝐱:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT,

Pr[𝑿t+T𝒙|𝑿0:t=𝒙0:t]=Pr[𝑿^t+T𝒙|𝐕0:t=𝝂0:t].Prsubscript𝑿𝑡𝑇conditional𝒙subscript𝑿:0𝑡subscript𝒙:0𝑡Prsubscript^𝑿𝑡𝑇conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}]=\Pr% \left[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{\boldmath$\nu$% \unboldmath}_{0:t}\right].roman_Pr [ bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] . (4)

Proof: By the definition of weak innovation representation,

Pr[𝑿t+T𝒙|𝑿0:t\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}roman_Pr [ bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT =𝒙0:t]=Pr[𝑿^t+T𝒙|𝑿0:t=𝒙0:t]\displaystyle={\bm{x}}_{0:t}]=\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0% :t}={\bm{x}}_{0:t}]= bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] (5)
=(a)Pr[𝑿^t+T𝒙|Gθ(𝑿0:t)=Gθ(𝒙0:t))]\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm% {x}}|G_{\theta^{*}}({\bm{X}}_{0:t})=G_{\theta^{*}}({\bm{x}}_{0:t}))]start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ) ]
=Pr[𝑿^t+T𝒙|𝐕0:t=𝝂0:t],absentPrsubscript^𝑿𝑡𝑇conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡\displaystyle=\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{% \boldmath$\nu$\unboldmath}_{0:t}],= roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] ,

where (a)𝑎(a)( italic_a ) is from the Markovian structure of the autoencoder,i.e., 𝐗0:tGθ𝐕^0:tHη𝐗^0:tsuperscriptsubscript𝐺superscript𝜃subscript𝐗:0𝑡subscript^𝐕:0𝑡superscriptsubscript𝐻superscript𝜂subscript^𝐗:0𝑡{\bm{X}}_{0:t}\stackrel{{\scriptstyle G_{\theta^{*}}}}{{\rightarrow}}\hat{{\bf V% }}_{0:t}\stackrel{{\scriptstyle H_{\eta^{*}}}}{{\rightarrow}}\hat{{\bm{X}}}_{0% :t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT. By definition of Bayesian statistics [35], 𝐕0:t=Gθ(𝑿0:t)subscript𝐕:0𝑡subscript𝐺superscript𝜃subscript𝑿:0𝑡{\bf V}_{0:t}=G_{\theta^{*}}({{\bm{X}}_{0:t}})bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) is a sufficient statistics for 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT for all T>0𝑇0T>0italic_T > 0. \square

The validity of WIAE-GPF is shown next.

Theorem 1 (Validity of WIAE-GPF)

For all T>0𝑇0T>0italic_T > 0, the conditional distribution of the WIAE-GPF output 𝐗~tsubscript~𝐗𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given 𝐗0:t=𝐱0:tsubscript𝐗:0𝑡subscript𝐱:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT is identical to that of 𝐗t+Tsubscript𝐗𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT (given 𝐗0:t=𝐱0:tsubscript𝐗:0𝑡subscript𝐱:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT), i.e., ,

Pr[𝑿t+T𝒙|𝑿0:t=𝒙0:t]=Pr[𝑿~t𝒙|𝑿0:t=𝒙0:t].Prsubscript𝑿𝑡𝑇conditional𝒙subscript𝑿:0𝑡subscript𝒙:0𝑡Prsubscript~𝑿𝑡conditional𝒙subscript𝑿:0𝑡subscript𝒙:0𝑡\displaystyle\Pr[{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}]=% \Pr[\tilde{{\bm{X}}}_{t}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}].roman_Pr [ bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] . (6)

Proof: By Lemma 1,

Pr[𝑿t+T𝒙|𝑿0:t=𝒙0:t]=Pr[𝑿^t+T𝒙|𝐕0:t=𝝂0:t],Prsubscript𝑿𝑡𝑇conditional𝒙subscript𝑿:0𝑡subscript𝒙:0𝑡Prsubscript^𝑿𝑡𝑇conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}]=\Pr% [\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{\boldmath$\nu$% \unboldmath}_{0:t}],roman_Pr [ bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] , (7)

where 𝐕0:t=Gθ(𝑿0:t)subscript𝐕:0𝑡subscript𝐺superscript𝜃subscript𝑿:0𝑡{\bf V}_{0:t}=G_{\theta^{*}}({\bm{X}}_{0:t})bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) and 𝝂0:t=Gθ(𝒙0:t)subscript𝝂:0𝑡subscript𝐺superscript𝜃subscript𝒙:0𝑡\hbox{\boldmath$\nu$\unboldmath}_{0:t}=G_{\theta^{*}}({\bm{x}}_{0:t})bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). Now consider

{𝑿^t+T=Gθ(𝐕0:t,𝐕t+1:t+T)𝑿~t=Gθ(𝐕0:t,𝒱~t),casessubscript^𝑿𝑡𝑇subscript𝐺superscript𝜃subscript𝐕:0𝑡subscript𝐕:𝑡1𝑡𝑇otherwisesubscript~𝑿𝑡subscript𝐺superscript𝜃subscript𝐕:0𝑡subscript~𝒱𝑡otherwise\begin{cases}\hat{{\bm{X}}}_{t+T}=G_{\theta^{*}}({\bf V}_{0:t},{\bf V}_{t+1:t+% T})\\ \tilde{{\bm{X}}}_{t}=G_{\theta^{*}}({\bf V}_{0:t},\tilde{{\cal V}}_{t}),\end{cases}{ start_ROW start_CELL over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL start_CELL end_CELL end_ROW

where, by definition, (𝐕0:t,𝐕t+1:t+T,𝒱~t)subscript𝐕:0𝑡subscript𝐕:𝑡1𝑡𝑇subscript~𝒱𝑡({\bf V}_{0:t},{\bf V}_{t+1:t+T},\tilde{{\cal V}}_{t})( bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT , over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are jointly independent IID uniform sequences, and 𝒱~t=d𝐕t+1:t+Tsuperscriptdsubscript~𝒱𝑡subscript𝐕:𝑡1𝑡𝑇\tilde{{\cal V}}_{t}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{=}}{\bf V}_{t+1:t% +T}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP bold_V start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT. Therefore,

Pr[𝑿^t+T𝒙|𝐕0:t=𝝂0:t]=Pr[𝑿~t𝒙|𝐕0:t=𝝂0:t].Prsubscript^𝑿𝑡𝑇conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡Prsubscript~𝑿𝑡conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡\displaystyle\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{% \boldmath$\nu$\unboldmath}_{0:t}]=\Pr[\tilde{{\bm{X}}}_{t}\leq{\bm{x}}|{\bf V}% _{0:t}=\hbox{\boldmath$\nu$\unboldmath}_{0:t}].roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] . (8)

Combining (7) and (8), we have (6). \square

3.2 Structural Convergence of WIAE-GPF

This section focuses on the practical issue of finite-dimensional implementations of WIAE and discriminators in Fig. 3. It is evident that no machine learning technique guarantees that a finite-dimensional implementation can extract weak innovations, even if the amount of historical samples available is unbounded. Here we present a structural convergence result to show that, under the ideal training conditions with unbounded training samples and global convergence of training, the innovations generated from a finite-dimensional WIAE converge in distribution to the true weak innovations.

The structural convergence analysis assumes that the training samples are unbounded, and the training algorithm converges to the global minimum. Let Gθ(m)superscriptsubscript𝐺superscript𝜃𝑚G_{\theta^{*}}^{(m)}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT be the optimally trained finite (input) dimensional CNN encoder that takes time-shifted m𝑚mitalic_m consecutive observations 𝑿tm+1:tsubscript𝑿:𝑡𝑚1𝑡{\bm{X}}_{t-m+1:t}bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT and produces the latent process (𝐕^t(m))subscriptsuperscript^𝐕𝑚𝑡\left(\hat{{\bf V}}^{(m)}_{t}\right)( over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Likewise, let Hη(m)subscriptsuperscript𝐻𝑚superscript𝜂H^{(m)}_{\eta^{*}}italic_H start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT be the optimally trained m𝑚mitalic_m-dimensional CNN decoder that produces the WIAE output sequence (𝑿^t(m))subscriptsuperscript^𝑿𝑚𝑡\left(\hat{{\bm{X}}}^{(m)}_{t}\right)( over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Similarly defined are the finite dimensional discriminators that take n𝑛nitalic_n consecutive inputs, denoted by (Dω(n),Dγ(n))superscriptsubscript𝐷𝜔𝑛superscriptsubscript𝐷𝛾𝑛(D_{\omega}^{(n)},D_{\gamma}^{(n)})( italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ). In this paper, we choose n=m𝑛𝑚n=mitalic_n = italic_m.

To analyze the asymptotic property of finite (input) dimensional WIAE-GPF, we make the following assumptions:

  1. A1

    Existence: The random process (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) has a weak innovation representation defined in (\theparentequation.1 - \theparentequation.3) & (2), and there exists a causal WIAE with continuous G𝐺Gitalic_G and H𝐻Hitalic_H.

  2. A2

    Feasibility: There exists a sequence of finite input dimension auto-encoder functions (Gθ¯(m),Hη¯(m))superscriptsubscript𝐺¯𝜃𝑚superscriptsubscript𝐻¯𝜂𝑚(G_{\bar{\theta}}^{(m)},H_{\bar{\eta}}^{(m)})( italic_G start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) that converges uniformly to (G,H)𝐺𝐻(G,H)( italic_G , italic_H ) under the mean-squared distance metric.

  3. A3

    Training: The training sample sizes are infinite. The training algorithm for all finite-dimensional WIAE using finite-dimensional training samples converges almost surely to the global optimum.

Theorem 2

Under (A1-A3),

(𝑿0:t,𝑿^t+T(m))d(𝑿0:t,𝑿t+T),tsuperscriptdsubscript𝑿:0𝑡superscriptsubscript^𝑿𝑡𝑇𝑚subscript𝑿:0𝑡subscript𝑿𝑡𝑇for-all𝑡\displaystyle({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+T}^{(m)})\stackrel{{% \scriptstyle\mbox{\tiny d}}}{{\rightarrow}}({\bm{X}}_{0:t},{\bm{X}}_{t+T}),~{}\forall t( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ) , ∀ italic_t (9)

as m𝑚mitalic_m goes to infinity.

Proof: See A.

3.3 From GPF to Point and Quantile Forecasting

GPF produces samples of the conditional probability distribution, from which point and quantile forecasts can be easily computed. Here we outline techniques to compute forecasts of several popular point and quantile forecasters. To this end, let {𝒙~t(k),k=1,,K}formulae-sequencesuperscriptsubscript~𝒙𝑡𝑘𝑘1𝐾\left\{\tilde{{\bm{x}}}_{t}^{(k)},k=1,\cdots,K\right\}{ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_k = 1 , ⋯ , italic_K } be the set of GPF generated samples from the probability distribution of the time series at time t+T𝑡𝑇t+Titalic_t + italic_T conditioned on past observations up to time t𝑡titalic_t. For the simplicity of mathematical expressions, we assume that {𝒙~t(k)}superscriptsubscript~𝒙𝑡𝑘\left\{\tilde{{\bm{x}}}_{t}^{(k)}\right\}{ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } is sorted in the ascending order.

  • 1.

    Minimum Mean Squared Error (MMSE) Forecast: The MMSE forecast is the mean of the conditional distribution. The MMSE forecast 𝒙^tMMSEsubscriptsuperscript^𝒙MMSE𝑡\hat{{\bm{x}}}^{\mbox{\tiny MMSE}}_{t}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT MMSE end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by a GPF is given by the conditional sample mean

    𝒙^tMMSE=1Kk=1K𝒙~t(k).subscriptsuperscript^𝒙MMSE𝑡1𝐾superscriptsubscript𝑘1𝐾superscriptsubscript~𝒙𝑡𝑘\hat{{\bm{x}}}^{\mbox{\tiny MMSE}}_{t}=\frac{1}{K}\sum_{k=1}^{K}\tilde{{\bm{x}% }}_{t}^{(k)}.over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT MMSE end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT .
  • 2.

    Minimum Mean Absolute Error (MMAE) Forecast: The MMAE forecast is the median of the conditional distribution. The MMAE forecast 𝒙^tMMAEsubscriptsuperscript^𝒙MMAE𝑡\hat{{\bm{x}}}^{\mbox{\tiny MMAE}}_{t}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT MMAE end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by a GPF is given by the conditional sample median

    𝒙^tMMAE={𝒙~t((K+1)/2),if K is odd0.5(𝒙~t(K/2)+𝒙~t(K/2+1)),if K is even.subscriptsuperscript^𝒙MMAE𝑡casessuperscriptsubscript~𝒙𝑡𝐾12if K is odd0.5superscriptsubscript~𝒙𝑡𝐾2superscriptsubscript~𝒙𝑡𝐾21if K is even\hat{{\bm{x}}}^{\mbox{\tiny MMAE}}_{t}=\begin{cases}\tilde{{\bm{x}}}_{t}^{% \left((K+1)/2\right)},&\mbox{if $K$ is odd}\\ 0.5\left(\tilde{{\bm{x}}}_{t}^{\left(K/2\right)}+\tilde{{\bm{x}}}_{t}^{\left(K% /2+1\right)}\right),&\mbox{if $K$ is even}.\\ \end{cases}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT MMAE end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( ( italic_K + 1 ) / 2 ) end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_K is odd end_CELL end_ROW start_ROW start_CELL 0.5 ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K / 2 ) end_POSTSUPERSCRIPT + over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K / 2 + 1 ) end_POSTSUPERSCRIPT ) , end_CELL start_CELL if italic_K is even . end_CELL end_ROW
  • 3.

    Quantile Forecast: The GPF forecast of q𝑞qitalic_q-quantile 𝒙^tq-quantilesubscriptsuperscript^𝒙q-quantile𝑡\hat{{\bm{x}}}^{\mbox{\tiny$q$-quantile}}_{t}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_q -quantile end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given by:

    𝒙^tq-quantile={𝒙~t(qK),if qK is an integer0.5(𝒙~t([qK])+𝒙~t([qK]+1)),otherwise,subscriptsuperscript^𝒙q-quantile𝑡casessuperscriptsubscript~𝒙𝑡𝑞𝐾if qK is an integer0.5superscriptsubscript~𝒙𝑡delimited-[]𝑞𝐾superscriptsubscript~𝒙𝑡delimited-[]𝑞𝐾1otherwise\hat{{\bm{x}}}^{\mbox{\tiny$q$-quantile}}_{t}=\begin{cases}\tilde{{\bm{x}}}_{t% }^{\left(qK\right)},&\mbox{if $qK$ is an integer}\\ 0.5\left(\tilde{{\bm{x}}}_{t}^{\left([qK]\right)}+\tilde{{\bm{x}}}_{t}^{\left(% [qK]+1\right)}\right),&\mbox{otherwise},\\ \end{cases}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_q -quantile end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q italic_K ) end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_q italic_K is an integer end_CELL end_ROW start_ROW start_CELL 0.5 ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( [ italic_q italic_K ] ) end_POSTSUPERSCRIPT + over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( [ italic_q italic_K ] + 1 ) end_POSTSUPERSCRIPT ) , end_CELL start_CELL otherwise , end_CELL end_ROW

    where [a]delimited-[]𝑎[a][ italic_a ] indicates the greatest integer not exceeding a𝑎aitalic_a.

3.4 Evaluation Metrics

Comparing probabilistic forecasting methods is difficult due to the lack of ground truth for the underlying conditional distribution. However, because a GPF can produce arbitrarily many Monte Carlo samples§§§We used 1000100010001000 Monte-Carlo samples and sample average to obtain point estimates., it can be evaluated by all point forecasting metrics. More importantly, an ideal probabilistic forecaster that produces the correct conditional distribution will perform well under any point estimator metric under regularity conditions. Therefore, evaluating a GPF method based on a set of point-forecasting techniques is appropriate to assess its performance. To this end, we also compared WIAE-GPF with some of the well-tested point forecasting techniques.

We used four popular point forecasting and two widely used probabilistic forecasting metrics. See B for their definitions. Normalized mean squared error (NMSE) measures the error associated with the mean of the estimated conditional probability distribution. Normalized Absolute Error (NMAE) measures the error associated with the median. Mean absolute scaled error (MASE) is the ratio of the NMAE of a method over that of the (naive) persistent predictor that uses the latest observation available as the forecast. Symmetric mean absolute percentage error (sMAPE) averages and symmetrizes the percentage error computed at each time stamp and is less sensible to outliers. The probabilistic forecasting metrics were the continuous ranked probability score (CRPS) [36], Coverage Probability Error (CPE), and Normalized Coverage Width (NCW). CRPS evaluates the quadratic difference between the predicted empirical cumulative density function (c.d.f.) with an indicator c.d.f. based on the ground truth. CPE and NCW are often used to evaluate prediction intervals. CPE is the deviation of the coverage probability (CP) from the nominal confidence level β%percent𝛽\beta\%italic_β %, whereas NCW represents the width of the prediction intervals. At similar level of CP, the method with smaller NCW shows better accuracy in prediction interval estimation. In this paper, we computed the CPE and NCW of 10%, 50%, and 90% intervals predicted by each probabilistic method. The mathematical definition of those metrics can be found in the appendix.

For probabilistic forecasting techniques, we used their conditional means as the point forecasts when evaluated by NMSE, whereas the conditional median is used for NMAE, MASE, and sMAPE. For the quantile regression technique, BWGVT, we use the estimated 0.5-quantile as its point forecast for all metrics since it’s unclear how to compute empirical mean from quantiles. When producing interval forecasts for GPF methods, we took the empirical quantiles from the Monte-Carlo forecasts of future values. In particular, the beta-coverage interval was defined as the β𝛽\betaitalic_β-width interval symmetric around the sample median.

4 WIAE-GPF for Market Operations

We now apply WIAE-GPF to forecasting market signals such as locational marginal prices and market imbalances. At the outset, we recognize that the underlying random processes are not known to be stationary, whereas WIAE-GPF is derived based on a representation of stationary processes. Here, we rely on the hypothesis that these processes are approximately stationary locally within the forecasting horizon. Our evaluations based on real market data presented here in some way validated this hypothesis. Brief discussions on the limitations and possible extensions can be found in Sec. 5.

We conducted extensive experiments to compare leading GPF and point forecasting techniques based on a suite of performance metrics. This section summarizes our findings for three market applications where GPF is particularly valuable to system operators and market participants: (a) LMP forecasting for the optimal bidding in NYISO’s 5-minute real-time energy market, (b) GPF for the interregional LMP spread for the Coordinated Transaction Scheduling (CTS) [37] market between NYISO and PJM, and (c) ACE forecasting for regulation services using PJM’s 15-second ACE data. Common to these applications is that the forecasted variables are endogenously determined by the market operations. In contrast to exogenous variables such as wind/solar generations and inelastic demands, the LMP and ACE values are the results of dispatch and commitment optimization, where binding constraints introduce spikes in dual variables from which LMPs are computed. They are highly dynamic as shown in Fig. 4.

Refer to caption
Figure 4: Real-time, day-ahead LMPs, and load at Long Island, and real-time LMP at NYC July 2023.

4.1 Baseline Methods in Comparison

Table 2: Comparison of the baselines.
Algorithm Forecasting Type Time Series Model Forecastor Output ML Models
WIAE-GPF Probabilistic Nonparametric Generative CNN + WIAE
TLAE [22] Probabilistic Parametric Generative RNN + VAE
DeepVAR [14] Probabilistic Parametric (AR Model) Model Parameters LSTM
BWGVT [28] Probabilistic Nonparametric Forecasted Quantiles LLM + Quantile Regression
Pyraformer [30] Point Nonparametric Point Estimate LLM
Informer [31] Point Nonparametric Point Estimate LLM

We compared WIAE-GPF with six leading forecasters based on their relevance to power system applications and their established reputations. See Table 2 for attributes of these techniques and references. WIAE-GPF is the only nonparametric GPF forecaster. Because there are limited nonparametric GPF techniques, we also included in our comparison popular machine-learning-based parameterized GPF and point-forecasting techniques.

For deep-learning techniques, we compared with DeepVAR [14], a multivariate generalization of the popular DeepVAR that has become a key baseline for time series forecasting in multiple applications. Temporal Latent AutoEncoder (TLAE) [22] is an autoencoder-based parametric GPF where the conditional distribution of the future time series variables is obtained by passing a conditional Gaussian process through the decoder, where the conditional mean and variance are computed by the encoder. Once these parameters are estimated from observed realizations, Gaussian Monte Carlo samples are fed into the decoder to generate samples of the forecasted random variable.

We also included three popular forecasting techniques based on LLMs. One is the award-winning technique Informer [31]; the other is Pyraformer [30] that captures temporal dependency at multiple granularitiesPyraformer is the keynote presentation at The International Conference on Learning Representations (ICLR) 2022.. Pyraformer showed superior performance over a wide range of LLM-based point forecasting techniques. Specifically developed for LMP forecasting, BWGVTWe have named the algorithm by the first letters of authors’ last names [28] combines quantile regression with a transformer architecture derived for LLMs.

4.2 LMP forecasting for Energy Market Participation

For a self-scheduled resource submitting a quantity bid to the energy market, the ability to forecast future prices is essential in constructing its bids and offers. With GPF generating future LMP realizations, the problem of optimal offer/bid strategies can be formulated as scenario-based stochastic optimization [38]. Our experiment was based on a use case of a merchant storage owner submitting quantity offers and bids to a deregulated wholesale market, using LMP from NYISO as the hypothetical price realizations.

The real-time market of NYISO closes sixty minutes ahead of actual delivery, which means that the forecasting horizon needs to be longer than 60 minutes. Its real-time LMPs and load were collected every 5 minutes and day-ahead LMPs every hour. Two experiments were conducted to produce probabilistic forecasts of 60-minute ahead LMPs at the Long Island (LONGIL) using (a) the day-ahead prices and the current and past real-time LMPs at LONGIL, along with the system load up to the time of submitting the bid; (b) the neighboring NYC real-time LMP in addition to the data in (a).

Electricity prices are seasonal. We selected July 2023, October 2023, January 2024, and April 2024 to represent summer, fall, winter, and spring, respectively. Our method, along with all baseline methods in Table. 2, were tested using NYISO real-time price data at LONGIL collected during these months. For every week in the selected months, a forecastor was trained using historical data from the preceding 30-day period. The same forecastor was used to produce forecast for that week.

Fig. 4 shows the real-time LMP trajectories at both LONGIL and NYC in July 2023, along with the demand and the day-ahead LMP at LONGIL. The real-tine LMP at LONGIL exhibited the highest level of volatility for all four months. The real-time LMPs at LONGIL and NYC showed apparent spatial dependencies, while the dependency between day-ahead and real-time LMPs at LONGIL. The dependencies between load and real-time LMP were less obvious. The real-time LMPs for the other three seasons show similarities to those in summer, with fall and summer exhibiting the most volatility, while spring is the least volatile. During summer and fall, real-time LMPs frequently spiked above 1000$/MWh due to high demand and weather-related factors. In contrast, winter and spring experienced more stable prices; though on April 29th at 7:40 PM, the real-time LMP at LONGIL surged to 6752$/MWh before dropping back below 1000$/MWh by 8 PM.

Table 3: Comparison of volatility between seasons
Season Summer Fall Winter Spring Spring (excl. Apr. 29th, 2024)
Standard Deviation 52.8266 68.3016 43.6466 96.8191 21.6991

The standard deviation for LMPs of each seasons are included in Table. 3. Although spring LMPs appear to have the highest volatility, the extreme outlier price (over 6500$/MWh) on Apr. 29th, 2024 mainly contributes to the high standard deviation. Excluding the extreme price from the dataset, spring LMPs exhibited the least level of volatility. Winter volatility is higher than that of spring time (excluding Apr. 29th), though it is still lower than that of Summer and Fall.

As illustrated by Fig. 4, the real-time LMPs at LONGIL exhibit significant spikes with absolute values far exceeding the rest. These spikes are challenging to predict, and even minor prediction errors for these spikes can disproportionately impact the overall evaluation metrics. To ensure a more informative comparison among techniques, we calculated the evaluation metrics only on the “normal” LMPs, defined as those within three standard deviations from the mean.”

Real-time LMP forecasting performance: an overall summary

We analyzed the overall performance of all methods across the four seasons. WIAE-GPF consistently achieved the best overall performance on all evaluation metrics, both point and probabilistic. It ranked first in NMSE, CRPS, and CPE for nearly all cases, while it placed second in NMAE, MASE, and sMAPE in most instances. The strong performance of WIAE-GPF can be attributed to its focus on matching the conditional distribution, with a validity guarantee ensuring that the Monte Carlo samples generated have the same conditional distribution as the actual time series variable. The strong performance of WIAE-GPF compared with benchmark techniques under NMSE, NMAE, and other point forecasting metrics suggests that the WIAE-GPF approximates the true conditional probability distribution well.

The second best-performing technique overall was DeepVAR, which excelled in NMAE and MASE for summer and fall. This indicates that DeepVAR achieved accurate parameter estimation for the mean and median of the Gaussian AR model it assumes. However, its CPE and CRPS metrics are generally worse than those of WIAE-GPF, possibly due to a model mismatch between the volatile and complex conditional distribution of real-time LMP and the Gaussian AR assumption of DeepVAR.

The three LLM-based estimators (BWGVT, Pyraformer, and Informer) performed similarly, outperforming TLAE but generally falling short of WIAE-GPF and DeepVAR. Pyraformer and Informer, both point forecasters trained to minimize mean squared forecasting error, performed better under NMSE. Notably, Pyraformer achieved the best NMSE among all methods during the winter month. However, their NMAE, MASE, and sMAPE scores were generally worse than those of WIAE-GPF and DeepVAR in most cases. Informer underperformed compared to Pyraformer across the other seasons, but it did achieve the best NMAE and MASE during the spring month, when LMP exhibited substantially lower volatility. We conclude the underperformance of the LLM-based point estimators to the un-necessity of adopting attention mechanism for long-range dependency and the difficulty of training due to their large numbers of parameters. For details, see Sec. 4.5.

The LLM-based probabilistic forecasting technique BWGVT uses quantile regression to predict future distributions. Its quantile predictions were sensitive to outliers, especially compared to the GPF methods that rely on a stochastic latent process, leading to poorer point forecasting performance. BWGVT tended to predict larger intervals, as indicated by its high NCW across all cases. As a result, its CPEs were always positive, meaning that the prediction intervals it generated covered more than the nominal percentages. Consequently, both its point estimation results and CRPS were worse than those of DeepVAR and WIAE-GPF.

TLAE performed the worst across the broad. Its point evaluation metrics consistently ranked the worst over all methods, while its probabilistic evaluation metrics are slighly better. TLAE is a VAE-based auto encoder with a correlated Gaussian latent process. It generates samples of forecasts at the next time stamp by re-parameterization. For a multiple timestamp-ahead prediction, TLAE drawns sample iteratively by substituting the mean of the samples generated for the previous timestamp. This heuristic has no guarantee to match the conditional distribution of the generated samples and the conditional distribution of the future given the current observations, and it could also cause an accumulation of biases.

Table 4: Performance evaluation of forecasting results for real-time price forecasting at Long Island. The numbers in the parentheses were the ranking of the algorithm. The columns under the label of LONGIL are the GPF performance of the 12-step foresting of the LMP at LONGIL.
Season Methods NMSE NMAE MASE sMAPE CRPS CPE (90%) [NCW] CPE (50%) [NCW] CPE (10%) [NCW]
Summer (2023.07/01-2023.07.31) WIAE-GPF (1) 0.10840.1084\mathbf{0.1084}bold_0.1084 (2) 0.23410.23410.23410.2341 (2) 0.58120.58120.58120.5812 (2) 0.26440.26440.26440.2644 (1) 13.260713.2607\mathbf{13.2607}bold_13.2607 (1) 0.0009[0.0718]0.0009delimited-[]0.0718\mathbf{0.0009}[0.0718]bold_0.0009 [ 0.0718 ] (1) 0.0264[0.0261]0.0264delimited-[]0.0261\mathbf{-0.0264}[0.0261]- bold_0.0264 [ 0.0261 ] (1) 0.0059[0.0055]0.0059delimited-[]0.0055\mathbf{0.0059}[0.0055]bold_0.0059 [ 0.0055 ]
TLAE [22] (6) 0.46900.46900.46900.4690 (6) 0.46160.46160.46160.4616 (6) 0.97680.97680.97680.9768 (6) 0.44360.44360.44360.4436 (4) 23.328823.328823.328823.3288 (4) 0.1034[0.0723]0.1034delimited-[]0.0723-0.1034[0.0723]- 0.1034 [ 0.0723 ] (3) 0.1492[0.0268]0.1492delimited-[]0.02680.1492[0.0268]0.1492 [ 0.0268 ] (3) 0.0646[0.0027]0.0646delimited-[]0.0027-0.0646[0.0027]- 0.0646 [ 0.0027 ]
DeepVAR [14] (3) 0.20090.20090.20090.2009 (1) 0.18490.1849\mathbf{0.1849}bold_0.1849 (1) 0.39110.3911\mathbf{0.3911}bold_0.3911 (1) 0.24960.2496\mathbf{0.2496}bold_0.2496 (2) 14.071614.071614.071614.0716 (2) 0.0275[0.0777]0.0275delimited-[]0.07770.0275[0.0777]0.0275 [ 0.0777 ] (2) 0.0459[0.226]0.0459delimited-[]0.226-0.0459[0.226]- 0.0459 [ 0.226 ] (2) 0.0243[0.0034]0.0243delimited-[]0.0034-0.0243[0.0034]- 0.0243 [ 0.0034 ]
BWGVT [28] (4) 0.22060.22060.22060.2206 (4) 0.28040.28040.28040.2804 (4) 0.59340.59340.59340.5934 (3) 0.27110.27110.27110.2711 (3) 15.047615.047615.047615.0476 (3) 0.0968[0.0862]0.0968delimited-[]0.08620.0968[0.0862]0.0968 [ 0.0862 ] (4) 0.1646[0.0419]0.1646delimited-[]0.04190.1646[0.0419]0.1646 [ 0.0419 ] (4) 0.1279[0.0081]0.1279delimited-[]0.00810.1279[0.0081]0.1279 [ 0.0081 ]
Pyraformer [30] (2) 0.10900.10900.10900.1090 (3) 0.27170.27170.27170.2717 (3) 0.57490.57490.57490.5749 (4) 0.29340.29340.29340.2934 N/A N/A N/A N/A
Informer [31] (5) 0.45050.45050.45050.4505 (5) 0.39350.39350.39350.3935 (5) 0.83260.83260.83260.8326 (5) 0.33290.33290.33290.3329 N/A N/A N/A N/A
Fall (2023.10.01-2023.10.31) WIAE-GPF (1) 0.10200.1020\mathbf{0.1020}bold_0.1020 (2) 0.28060.28060.28060.2806 (2) 0.41310.41310.41310.4131 (2) 0.25220.25220.25220.2522 (1) 10.936210.9362\mathbf{10.9362}bold_10.9362 (1) 0.0002[0.0711]0.0002delimited-[]0.0711\mathbf{0.0002}[0.0711]bold_0.0002 [ 0.0711 ] (1) 0.0390[0.0292]0.0390delimited-[]0.0292\mathbf{-0.0390}[0.0292]- bold_0.0390 [ 0.0292 ] (1) 0.0157[0.0054]0.0157delimited-[]0.0054\mathbf{-0.0157}[0.0054]- bold_0.0157 [ 0.0054 ]
TLAE [22] (6) 0.61810.61810.61810.6181 (6) 0.46930.46930.46930.4693 (6) 0.69080.69080.69080.6908 (6) 0.43490.43490.43490.4349 (4) 19.139419.139419.139419.1394 (3) 0.0487[0.0335]0.0487delimited-[]0.0335-0.0487[0.0335]- 0.0487 [ 0.0335 ] (3) 0.0395[0.0135]0.0395delimited-[]0.0135-0.0395[0.0135]- 0.0395 [ 0.0135 ] (2) 0.0230[0.0059]0.0230delimited-[]0.00590.0230[0.0059]0.0230 [ 0.0059 ]
DeepVAR [14] (5) 0.56840.56840.56840.5684 (1) 0.23280.2328\mathbf{0.2328}bold_0.2328 (1) 0.34270.3427\mathbf{0.3427}bold_0.3427 (3) 0.26620.26620.26620.2662 (2) 12.610912.610912.610912.6109 (2) 0.0205[0.0414]0.0205delimited-[]0.04140.0205[0.0414]0.0205 [ 0.0414 ] (2) 0.0328[0.106]0.0328delimited-[]0.106-0.0328[0.106]- 0.0328 [ 0.106 ] (3) 0.0246[0.0018]0.0246delimited-[]0.0018-0.0246[0.0018]- 0.0246 [ 0.0018 ]
BWGVT [28] (4) 0.29920.29920.29920.2992 (5) 0.36960.36960.36960.3696 (5) 0.54410.54410.54410.5441 (5) 0.30980.30980.30980.3098 (3) 14.142314.142314.142314.1423 (4) 0.0905[0.0837]0.0905delimited-[]0.08370.0905[0.0837]0.0905 [ 0.0837 ] (4) 0.1763[0.0393]0.1763delimited-[]0.03930.1763[0.0393]0.1763 [ 0.0393 ] (4) 0.0367[0.0074]0.0367delimited-[]0.00740.0367[0.0074]0.0367 [ 0.0074 ]
Pyraformer [30] (2) 0.10800.10800.10800.1080 (3) 0.29490.29490.29490.2949 (3) 0.43410.43410.43410.4341 (1) 0.21160.2116\mathbf{0.2116}bold_0.2116 N/A N/A N/A N/A
Informer [31] (3) 0.26150.26150.26150.2615 (4) 0.29850.29850.29850.2985 (4) 0.43940.43940.43940.4394 (4) 0.28230.28230.28230.2823 N/A N/A N/A N/A
Winter (2024.01.01-2024.01.31) WIAE-GPF (2) 0.09060.0906{0.0906}0.0906 (2) 0.22630.22630.22630.2263 (2) 0.33690.33690.33690.3369 (1) 0.23070.2307\mathbf{0.2307}bold_0.2307 (1) 12.194912.1949\mathbf{12.1949}bold_12.1949 (1) 0.0014[0.1203]0.0014delimited-[]0.1203\mathbf{0.0014}[0.1203]bold_0.0014 [ 0.1203 ] (2) 0.0358[0.0493]0.0358delimited-[]0.0493{0.0358}[0.0493]0.0358 [ 0.0493 ] (2) 0.0196[0.0092]0.0196delimited-[]0.0092{0.0196}[0.0092]0.0196 [ 0.0092 ]
TLAE [22] (6) 0.22440.22440.22440.2244 (6) 0.33060.33060.33060.3306 (6) 0.49220.49220.49220.4922 (6) 0.43550.43550.43550.4355 (4) 20.746820.746820.746820.7468 (3) 0.0337[0.0371]0.0337delimited-[]0.0371-0.0337[0.0371]- 0.0337 [ 0.0371 ] (3) 0.0369[0.0159]0.0369delimited-[]0.0159{-0.0369}[0.0159]- 0.0369 [ 0.0159 ] (4) 0.0502[0.0027]0.0502delimited-[]0.0027-0.0502[0.0027]- 0.0502 [ 0.0027 ]
DeepVAR [14] (3) 0.15070.15070.15070.1507 (1) 0.16140.1614\mathbf{0.1614}bold_0.1614 (1) 0.32730.3273\mathbf{0.3273}bold_0.3273 (2) 0.24990.24990.24990.2499 (2) 13.140313.140313.140313.1403 (2) 0.0211[0.1525]0.0211delimited-[]0.15250.0211[0.1525]0.0211 [ 0.1525 ] (1) 0.0160[0.220]0.0160delimited-[]0.220\mathbf{-0.0160}[0.220]- bold_0.0160 [ 0.220 ] (3) 0.0287[0.0049]0.0287delimited-[]0.0049-0.0287[0.0049]- 0.0287 [ 0.0049 ]
BWGVT [28] (5) 0.16490.16490.16490.1649 (4) 0.27250.27250.27250.2725 (4) 0.40570.40570.40570.4057 (3) 0.25150.25150.25150.2515 (3) 15.291915.291915.291915.2919 (4) 0.081[0.1808]0.081delimited-[]0.18080.081[0.1808]0.081 [ 0.1808 ] (4) 0.1804[0.0545]0.1804delimited-[]0.05450.1804[0.0545]0.1804 [ 0.0545 ] (1) 0.0158[0.0058]0.0158delimited-[]0.0058\mathbf{0.0158}[0.0058]bold_0.0158 [ 0.0058 ]
Pyraformer [30] (1) 0.07210.0721\mathbf{0.0721}bold_0.0721 (5) 0.27860.27860.27860.2786 (5) 0.41440.41440.41440.4144 (5) 0.27470.27470.27470.2747 N/A N/A N/A N/A
Informer [31] (4) 0.15530.15530.15530.1553 (3) 0.25280.25280.25280.2528 (3) 0.37630.37630.37630.3763 (4) 0.26170.26170.26170.2617 N/A N/A N/A N/A
Spring (2024.04.01-2024.04.30) WIAE-GPF (1) 0.07970.0797\mathbf{0.0797}bold_0.0797 (2) 0.21370.21370.21370.2137 (2) 0.29420.29420.29420.2942 (2) 0.12890.1289{0.1289}0.1289 (1) 5.31115.3111\mathbf{5.3111}bold_5.3111 (1) 0.0068[0.1485]0.0068delimited-[]0.1485\mathbf{0.0068}[0.1485]bold_0.0068 [ 0.1485 ] (1) 0.0208[0.0624]0.0208delimited-[]0.0624\mathbf{0.0208}[0.0624]bold_0.0208 [ 0.0624 ] (1) 0.0148[0.0113]0.0148delimited-[]0.0113\mathbf{0.0148}[0.0113]bold_0.0148 [ 0.0113 ]
TLAE [22] (6) 0.19030.19030.19030.1903 (6) 0.27290.27290.27290.2729 (6) 0.37560.37560.37560.3756 (6) 0.40160.40160.40160.4016 (4) 15.333015.333015.333015.3330 (3) 0.0510[0.1270]0.0510delimited-[]0.1270-0.0510[0.1270]- 0.0510 [ 0.1270 ] (2) 0.0273[0.0635]0.0273delimited-[]0.0635{0.0273}[0.0635]0.0273 [ 0.0635 ] (2) 0.0217[0.0143]0.0217delimited-[]0.01430.0217[0.0143]0.0217 [ 0.0143 ]
DeepVAR [14] (5) 0.11050.11050.11050.1105 (4) 0.23450.2345{0.2345}0.2345 (4) 0.32280.3228{0.3228}0.3228 (1) 0.12330.1233\mathbf{0.1233}bold_0.1233 (2) 5.94205.94205.94205.9420 (2) 0.0154[0.1665]0.0154delimited-[]0.16650.0154[0.1665]0.0154 [ 0.1665 ] (3) 0.0423[0.572]0.0423delimited-[]0.572-0.0423[0.572]- 0.0423 [ 0.572 ] (3) 0.0415[0.0069]0.0415delimited-[]0.0069-0.0415[0.0069]- 0.0415 [ 0.0069 ]
BWGVT [28] (3) 0.08770.08770.08770.0877 (3) 0.21880.21880.21880.2188 (3) 0.30110.30110.30110.3011 (5) 0.21660.21660.21660.2166 (3) 14.610314.610314.610314.6103 (4) 0.0805[0.1551]0.0805delimited-[]0.15510.0805[0.1551]0.0805 [ 0.1551 ] (4) 0.2436[0.0871]0.2436delimited-[]0.08710.2436[0.0871]0.2436 [ 0.0871 ] (4) 0.1177[0.0196]0.1177delimited-[]0.0196{0.1177}[0.0196]0.1177 [ 0.0196 ]
Pyraformer [30] (4) 0.10220.10220.10220.1022 (5) 0.28900.28900.28900.2890 (5) 0.39780.39780.39780.3978 (3) 0.17880.17880.17880.1788 N/A N/A N/A N/A
Informer [31] (2) 0.08450.08450.08450.0845 (1) 0.18670.1867\mathbf{0.1867}bold_0.1867 (1) 0.25700.2570\mathbf{0.2570}bold_0.2570 (4) 0.18800.18800.18800.1880 N/A N/A N/A N/A

Summer real-time LMP prediction

Test results for July 2023 are shown in row 2-7 of Table 4 with the best performances highlighted in bold. We observed that WIAE-GPF performed the best for the probabilistic evluation metrics and NMSE, and ranked in close second for NMAE, MASE and sMAPE, whereas DeepVAR performed the best for those three metrics associated with conditional median. Pyraformer, trained to minimize MSE, achieved the second-best NMSE but didn’t achieve notable performance for the other metrics. We observed that WIAE-GPF outperformed other methods in probabilistic evaluation metrics and NMSE, while ranking a close second in NMAE, MASE, and sMAPE. In contrast, DeepVAR excelled in these three metrics, which are associated with the conditional median. Pyraformer, which was trained to minimize MSE, achieved the second-best NMSE but did not show notable performance in the other metrics. We believe this may be due to DeepVAR, as a Gaussian AR process with fewer parameters and a simpler training objective, could perform better at estimating the conditional median. On the other hand, WIAE-GPF, being nonparametric and trained with two Wasserstein discriminators, is more sensitive to hyperparameters and initialization. However, its nonparametric nature allows it to avoid the model misspecification issues that can affect parametric methods like DeepVAR when predicting the overall conditional distribution, as evidenced by its superior performance in CRPS and CPE.

Refer to caption
(a) WIAE-GPF
Refer to caption
(b) Pyraformer
Refer to caption
(c) TLAE
Refer to caption
(d) DeepVAR
Refer to caption
(e) WIAE-GPF
Refer to caption
(f) Pyraformer
Refer to caption
(g) TLAE
Refer to caption
(h) DeepVAR
Figure 5: Trajectories of the real-time price at LONGIL in July (top row) and October (bottom row) 2023, and its prediction generated by selective methods.
Refer to caption
(a) WIAE-GPF
Refer to caption
(b) Pyraformer
Refer to caption
(c) TLAE
Refer to caption
(d) DeepVAR
Refer to caption
(e) WIAE-GPF
Refer to caption
(f) Informer
Refer to caption
(g) TLAE
Refer to caption
(h) DeepVAR
Figure 6: Trajectories of the real-time price at LONGIL in January (top row) and April 2024 (bottom row), and its prediction generated by selective methods.

To gain insights into the performance of WIAE-GPF and other benchmark techniques, we plotted the ground truth trajectories (black) and trajectory forecasts generated by WIAE-GPF (red) and a competing algorithm (blue) in Fig. 5 (top row). Note that the spikes were not predicted by any methods. This was not surprising given the nature of how these spikes were produced. Aside from these spikes, these figures show clearly that WIAE-GPF (red) tracked the ground truth (black) the closest, which was supported by the fact that WIAE-GPF has the smallest NMAE. We also observed that WIAE-GPF had the smallest variation, which is supported by the fact that WIAE-GPF had the smallest NMSE. Furthermore, WIAE-GPF was the least affected by the price spikes. This was because, as a GPF method, the Monte-Carlo samples used to produce the MMSE point estimate were less likely to include extreme samples.

Fall Real-time LMP Prediction

Test results for October are shown in row 8-13 of Table 4 displayed similarity to summer’s results. The volatility of fall LMPs is slightly lower than the volatility of summer. WIAE-GPF performed the best for all except for NMAE, MASE and sMAPE, where it ranked the second. DeepVAR performed the best for NMAE and MASE, where conditional median is used as the point forecastor. Pyraformer achieved the best sMAPE and the second-best NMSE.

The performance can also be understood by examining the trajectories shown in Fig. 5 (bottom row). Similarly, WIAE-GPF exhibited the smallest variation and remained closest to the ground truth (excluding the spikes). In contrast, Fig. 5(h) reveals that DeepVAR tended to predict shifted peaks. This behavior is characteristic of the AR process, which is heavily influenced by past observations.

Winter Real-time LMP Prediction

January 2024 has the most volatile LMP among the four months tested. Seen from Table. 4, WIAE-GPF obtained the best sMAPE, CRPS, and CPE(90%), and came close in the second place for the rest. Pyraformer, with its MSE-minimizing training objective, achieved the best NMSE at around 7%percent77\%7 %. DeepVAR had the best NMAE, MASE and CPE(50%), proving its capability of accurately estimating conditional distribution around the median.

Fig. 6 (top row) showed the trajectory of winter LMP predictions. DeepVAR exhibited larger variability when predicting LMP of January 2024, which explains its higher level of NMSE. Pyraformer and WIAE-GPF achieved similar level of variance when predicting winter LMP. TLAE had the worst ground truth-tracking performance among the four trajectories.

Spring Real-time LMP Prediction

April 2024 has the least volatile LMPs among the four months. For this rolling-window experiment, WIAE-GPF achieved the best metrics except for NMAE, MASE and sMAPE. Informer performed better than Pyraformer on the spring dataset, with the best NMAE and MASE. DeepVAR obtained the best sMAPE.

Fig. 6 (bottom row) showed the trajectory of spring LMP predictions. DeepVAR and TLAE exhibited larger deviation for the spring , which explains its higher level of NMSE. Informer exhibited the best ground-truth tracking capability, with WIAE-GPF came in close second.

4.3 Interregional LMP spread for Interchange Markets

The interchange market aims to improve overall economic efficiency across ISOs by allowing virtual bidders to arbitrage price differences at proxy buses of two neighboring ISOs. This experiment was based on the use case of a virtual bidder bidding into the CTS market between NYISO and PJM. The proxy buses of this market were Sandy Point of NYISO and Neptune of PJM.

The CTS market closes 75 minutes ahead of delivery and is cleared every 15 minutes. A virtual bidder submits a price-quantity bid along with the direction of the virtual trade from the source of the proxy with low LMP to the destination proxy with high LMP. Once the market is cleared, the settlement is based on the actual LMP spread between the two proxies and the cleared quantity. The bidder profits if the virtual trade direction matches the direction of the real-time LMP spread. Otherwise, the bidder incurs a loss. Therefore, the ability to predict the LMP spread direction is especially important.

We performed a 75-minute ahead LMP spread forecasting using the interface power flow and LMP spread data between NYISO and PJM at the Neptune proxy, collected in February 2024. The interface power flow samples were collected every 5 minutes, and LMP spread every 15 minutes. We used the first 24 days for training and validation, and the last 5 days of February for testing.

We added Prediction Error Rate (PER) as a measure for the accuracy of the virtual trading direction prediction, given that the sign of spread is of great importance to profitability. PER indicates the percentage of forecasts that don’t have the same direction as the ground truth. For point forecasts, we compared the signs of the forecasts with the signs of the ground truth. For probabilistic forecasting, we compared the direction of the ground truth with that of the minimum error-probability prediction of the LMP spread, which is the sign of the conditional median. For GPF, we compare the sample median with the sign of the ground truth.

Table 5: Evaluation of forecasting results for spread forecasting between NYISO and PJM.
Methods NMSE NMAE MASE sMAPE PER CRPS CPE (90%) [NCW] CPE (50%) [NCW] CPE (10%) [NCW]
WIAE-GPF (1) 0.00980.0098\mathbf{0.0098}bold_0.0098 (1) 0.27380.2738\mathbf{0.2738}bold_0.2738 (1) 0.24180.2418\mathbf{0.2418}bold_0.2418 (1) 0.44930.4493\mathbf{0.4493}bold_0.4493 (1) 0.06060.0606\mathbf{0.0606}bold_0.0606 (1) 4.03294.0329\mathbf{4.0329}bold_4.0329 (1) 0.02150.0215\mathbf{0.0215}bold_0.0215 [0.4427]delimited-[]0.4427[0.4427][ 0.4427 ] (2) 0.02740.0274{-0.0274}- 0.0274 [0.5255]delimited-[]0.5255[0.5255][ 0.5255 ] (3) 0.02120.02120.02120.0212 [0.1692]delimited-[]0.1692[0.1692][ 0.1692 ]
TLAE [22] (5) 0.95920.95920.95920.9592 (5) 0.97850.97850.97850.9785 (5) 0.86410.86410.86410.8641 (4) 0.47850.47850.47850.4785 (4) 0.36920.36920.36920.3692 (2) 15.519515.519515.519515.5195 (3) 0.04430.0443-0.0443- 0.0443 [0.7745]delimited-[]0.7745[0.7745][ 0.7745 ] (1) 0.00520.0052\mathbf{0.0052}bold_0.0052 [0.8784]delimited-[]0.8784[0.8784][ 0.8784 ] (5) 0.09330.0933-0.0933- 0.0933 [0.3133]delimited-[]0.3133[0.3133][ 0.3133 ]
DeepVAR [14] (6) 1.89861.89861.89861.8986 (3) 0.72240.72240.72240.7224 (3) 0.63800.63800.63800.6380 (5) 0.48060.48060.48060.4806 (3) 0.35050.35050.35050.3505 (4) 32.829632.829632.829632.8296 (2) 0.03550.0355-0.0355- 0.0355 [2.0279]delimited-[]2.0279[2.0279][ 2.0279 ] (3) 0.14800.1480{-0.1480}- 0.1480 [1.4739]delimited-[]1.4739[1.4739][ 1.4739 ] (1) 0.00210.0021\mathbf{-0.0021}- bold_0.0021 [0.4198]delimited-[]0.4198[0.4198][ 0.4198 ]
BWGVT [28] (3) 0.90530.90530.90530.9053 (4) 0.85250.85250.85250.8525 (4) 0.75290.75290.75290.7529 (3) 0.46740.46740.46740.4674 (2) 0.23130.23130.23130.2313 (3) 31.566031.566031.566031.5660 (4) 0.09890.09890.09890.0989 [5.1788]delimited-[]5.1788[5.1788][ 5.1788 ] (4) 0.18350.18350.18350.1835 [6.0939]delimited-[]6.0939[6.0939][ 6.0939 ] (4) 0.06560.06560.06560.0656 [4.0473]delimited-[]4.0473[4.0473][ 4.0473 ]
Pyraformer [30] (4) 0.94780.94780.94780.9478 (6) 1.26741.26741.26741.2674 (6) 1.11931.11931.11931.1193 (6) 0.49090.49090.49090.4909 (6) 0.67380.67380.67380.6738 N/A N/A N/A N/A
Informer [31] (2) 0.80450.80450.80450.8045 (2) 0.41850.41850.41850.4185 (2) 0.42520.42520.42520.4252 (2) 0.45800.45800.45800.4580 (5) 0.54870.54870.54870.5487 N/A N/A N/A N/A
Table 6: Estimation Results of ACE forecasting for PJM. The prediction step is 5-minute ahead.
Methods NMSE NMAE MASE sMAPE CRPS CPE (90%) [NCW] CPE (50%) [NCW] CPE (10%) [NCW]
WIAE-GPF (1) 0.59570.5957\mathbf{0.5957}bold_0.5957 (1) 0.75550.7555\mathbf{0.7555}bold_0.7555 (1) 0.46980.4698\mathbf{0.4698}bold_0.4698 (1) 0.10590.1059\mathbf{0.1059}bold_0.1059 (1) 0.00810.0081\mathbf{0.0081}bold_0.0081 (1) 0.00160.0016\mathbf{-0.0016}- bold_0.0016 [0.9199]delimited-[]0.9199[0.9199][ 0.9199 ] (1) 0.03210.0321\mathbf{0.0321}bold_0.0321 [0.9336]delimited-[]0.9336[0.9336][ 0.9336 ] (1) 0.01320.0132\mathbf{-0.0132}- bold_0.0132 [0.8885]delimited-[]0.8885[0.8885][ 0.8885 ]
TLAE [22] (5) 1.17271.17271.17271.1727 (5) 1.06051.06051.06051.0605 (5) 0.65950.65950.65950.6595 (3) 0.27820.27820.27820.2782 (4) 1.55411.55411.55411.5541 (4) 0.78570.7857-0.7857- 0.7857 [0.0004]delimited-[]0.0004[0.0004][ 0.0004 ] (4) 0.44890.4489-0.4489- 0.4489 [0.0005]delimited-[]0.0005[0.0005][ 0.0005 ] (4) 0.09570.0957-0.0957- 0.0957 [0.0027]delimited-[]0.0027[0.0027][ 0.0027 ]
DeepVAR [14] (7) 1.44311.44311.44311.4431 (7) 1.17501.17501.17501.1750 (7) 0.73070.73070.73070.7307 (5) 0.39520.39520.39520.3952 (3) 1.29471.29471.29471.2947 (3) 0.35260.3526-0.3526- 0.3526 [0.5665]delimited-[]0.5665[0.5665][ 0.5665 ] (3) 0.25600.2560-0.2560- 0.2560 [0.5296]delimited-[]0.5296[0.5296][ 0.5296 ] (3) 0.05210.0521-0.0521- 0.0521 [0.5434]delimited-[]0.5434[0.5434][ 0.5434 ]
BWGVT [28] (3) 0.95620.95620.95620.9562 (2) 0.97930.97930.97930.9793 (2) 0.60900.60900.60900.6090 (4) 0.31680.31680.31680.3168 (2) 1.24881.24881.24881.2488 (2) 0.00650.00650.00650.0065 [1.8309]delimited-[]1.8309[1.8309][ 1.8309 ] (2) 0.07540.07540.07540.0754 [2.0385]delimited-[]2.0385[2.0385][ 2.0385 ] (5) 0.09960.09960.09960.0996 [2.4261]delimited-[]2.4261[2.4261][ 2.4261 ]
Pyraformer [30] (4) 0.97830.97830.97830.9783 (4) 0.99480.99480.99480.9948 (4) 0.61860.61860.61860.6186 (7) 0.49860.49860.49860.4986 N/A N/A N/A N/A
Informer [31] (2) 0.60060.60060.60060.6006 (3) 0.98190.98190.98190.9819 (3) 0.61060.61060.61060.6106 (2) 0.22470.22470.22470.2247 N/A N/A N/A N/A
Refer to caption
(a) WIAE-GPF
Refer to caption
(b) Pyraformer
Refer to caption
(c) TLAE
Refer to caption
(d) DeepVAR
Figure 7: Trajectories of the interregional LMP spread between NYISO and PJM, and its prediction generated by selective methods.

Seen from Table 6, WIAE-GPF performed better than all other techniques in all metrics. TLAE performed the second-best in CRPS (15.519515.519515.519515.5195) but slightly worse than BWGVT when evaluated under point estimation metrics. Its sequential sampling of the latent Gaussian process added to its numerical instability. BWGVT was the overall second-best performing probabilistic technique. Its transformer architecture with enhanced capability of capturing long-term temporal dependency didn’t offer much gain for the training difficulty imposed by the increasing number of deep-learning parameters, see Sec. 4.5. BWGVT also exhibited the tendency to predict a wide interval covering more than the nominal percentage. Point estimation techniques, namely Pyraformer and Informer, were not competitive when evaluated under point estimation metrics other than NMSE. Among probabilistic methods, DeepVAR performed similarly to the LLM methods. The (semi) parametric methods suffered from model mismatch, and were sensitive to sudden changes, Thus, shifted peaks and valleys were often witnessed in their predictions.

Same observation can also be made through Fig. 7. WIAE-GPF has the most stable prediction of interregional LMP spreads, which is corroborated by its smallest NMSE and NMAE. Pyraformer also follows the trend of LMP spreads accurately but with higher variance. The AR-based parametric models, DeepVAR exhibited the tendency to predict shifted spikes and failures to catch the rapid and dramatic change of LMP spread.

4.4 Area Control Error Forecasting for Reserve Market Participants

ACE is defined as the difference between actual and scheduled load-generation imbalance, adjusted by the area frequency deviation [39]. It is the control signal for frequency regulation, and its probabilistic forecasting is especially important for the operator to procure resources and market participants to bid in the regulation ancillary service market.

In this subsection, we present the simulation results of a 5-minute ahead forecasting of ACE. We utilized the ACE data from Jan 24th to 26th, collected by PJM. The ACE signal is measured every 15 seconds and can be quite volatile, as shown by the trajectory in Fig. 8.

Refer to caption
Figure 8: Trajectory of ACE at PJM, Jan. 24th - 26th, 2023.

Shown by Table. 6, WIAE-GPF achieved better performance than other methods, with CRPS less than 0.010.010.010.01 and sMAPE less than 11%percent1111\%11 %, as shown by the WIAE-GPF row. We credited the strong performance of WIAE-GPF to the simplicity of its latent process, and its Bayesian sufficiency. BWGVT ranked second among all methods since the ACE data has few outliers. But its CRPS at 1.24881.24881.24881.2488 is dramatically larger than that of WIAE-GPF. Its CPE and NCW for 10% confidence interval prediction also showed that it cannot accurately predict a narrow interval. Pyraformer and Informer, trained with NMSE, had better performance under NMSE but worse under NMAE. With NMSE over 110%percent110110\%110 %, TLAE had the worst performance among GPF methods. DeepVAR and SNARX performed worse than the other forecasting methods, with NMSE and NMAE larger than 110%percent110110\%110 %, possibly due to model mismatch.

4.5 Discussion: On using LLM for price forecasting

Table 7: Statistics that models long-range dependency of time series.
Metrics Real Time Interchange Spread ACE
Hurst Exponent 0.52570.52570.52570.5257 0.53010.53010.53010.5301 0.53510.53510.53510.5351
DFA 0.60530.60530.60530.6053 0.66140.66140.66140.6614 0.86090.86090.86090.8609

The success of LLM-based prediction in natural processing ignited broad interest in adopting LLM models in various applications, including electricity price forecasting with BWGVT, Pyraformer, and Informer. Our experiments showed that the innovation-based method (WIAE) performed uniformly better than the three LLM techniques, except for the real-time prediction at LONGIL-NYC under NMSE, where Pyraformer was the best among all forecasters. Note that the innovation representation used in WIAE can model but not explicitly long-range dependencies of the random process. WIAE does not include attention modeling.

As LLM-based forecasting techniques, Pyraformer, Informer and BWGVT somtimes performed better than TLAE (see Table 4). Compared with the more straightforward deep learning method of DeepVAR, LLM-based forecasting did not show clear advantages. Authors of [40] pointed out that the simple convolutional neural network outperformed RNNs and LLMs on imbalance price forecasting, for the forecasted time series are not a good fit to the complicated deep learning models.

To understand if long-range dependencies matter in the probabilistic forecasting of electricity market signals, we examined the characteristics of LMP signals using the Hurst exponent and Detrended Fluctuation Analysis (DFA) as indicators for the long-range dependencies of LMP; both parameters had the range [0,1], with deviation from 0.5 indicating symptoms of long-range dependencies.

Table 7 shows the estimated Hurst exponent and DFA. The Hurst Exponent and DFA slope displayed a slight deviation from 0.5. The English and Korean literature [41, 42] are known to have long-range dependencies with the Hurst exponents ranging from 0.64 to 0.73. In comparison, the long-term effect of real-time electricity market signals is minimal.

While further studies are necessary, the use of LLM may not be suitable for electricity market signals where long-range dependencies are not evident. Indeed, real-time LMPs are computed either on an interval-by-interval basis or as part of short sliding window economic dispatch. Any temporal coupling is a result of temporal dependencies of demand and supplies, neither shown to have long-range dependencies. An unproven hypothesis is that the model complexity of LLM may offset any benefit it may bring to price forecasting.

5 Conclusion

This paper introduces WIAE-GPF, a generative AI method for probabilistic forecasting of nonparametric time series, building on the innovation representation developed by Wiener, Kallianpur, and Rosenblatt over six decades ago. Three key findings emerge from our research. First, the innovation representation enables WIAE-GPF to produce accurate conditional probability distributions, provided perfect learning conditions are met. To the best of our knowledge, WIAE-GPF is the first nonparametric generative probabilistic forecasting (GPF) technique with such theoretical guarantees.

Second, WIAE-GPF outperformed leading machine learning-based probabilistic forecasting methods in our numerical experiments using real-world market data. This includes advanced models employing transformer architectures, attention mechanisms, and large language models. Additionally, the local stationarity hypothesis, integral to the weak innovation representation, held robustly in our studies.

Third, this paper establishes Bayesian sufficiency for Rosenblatt’s weak innovation representation, validating it as a canonical framework for time series and a powerful tool for stochastic decision-making. Its potential applications in power system anomaly detection are explored in [43, 44, 45].

Finally, deep learning methods are often criticized for their black-box nature, with some viewing them as mysterious techniques that produce impressive yet opaque results. In contrast, WIAE-GPF offers a highly intuitive and interpretable architecture, closely paralleling the classic Kalman filter. Specifically, in Kalman filtering, the innovation is extracted during the measurement update, followed by time-updated predictions based on a state-space model. Similarly, WIAE-GPF extracts innovations using its weak innovation encoder and generates time-updated predictions via the weak innovation decoder. In this sense, WIAE-GPF can be seen as a generalization of Kalman filtering to nonparametric and non-Gaussian settings.

It is important to address the limitations and future directions of this work. WIAE-GPF is derived from the innovation representation of stationary processes, so extending it to certain classes of nonstationary processes would be a natural next step. Notably, an innovation representation exists for nonstationary Gaussian processes using time-varying state-space models. Expanding WIAE-GPF to handle nonstationary time series under regime-switching models is also a promising avenue, particularly given the demonstrated effectiveness of regime-switching techniques in price forecasting. See [40, 28] for relevant applications.

Appendix A Proof of Theorem 2

Let (𝐕¯t(m))superscriptsubscript¯𝐕𝑡𝑚\left(\bar{{\bf V}}_{t}^{(m)}\right)( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) and (𝑿¯t(m))superscriptsubscript¯𝑿𝑡𝑚\left(\bar{{\bm{X}}}_{t}^{(m)}\right)( over¯ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) denote the latent process and the reconstruction sequence, under weights θ¯msubscript¯𝜃𝑚\bar{\theta}_{m}over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and η¯msubscript¯𝜂𝑚\bar{\eta}_{m}over¯ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

𝐕¯t(m)=Gθ¯(m)(𝑿t,𝑿t1,,𝑿tm+1),superscriptsubscript¯𝐕𝑡𝑚superscriptsubscript𝐺¯𝜃𝑚subscript𝑿𝑡subscript𝑿𝑡1subscript𝑿𝑡𝑚1\displaystyle\bar{{\bf V}}_{t}^{(m)}=G_{\bar{\theta}}^{(m)}({\bm{X}}_{t},{\bm{% X}}_{t-1},\cdots,{\bm{X}}_{t-m+1}),over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = italic_G start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 1 end_POSTSUBSCRIPT ) ,
𝑿¯t(m)=Hη¯(m)(𝐕¯t,𝐕¯t1,,𝐕¯tm+1).superscriptsubscript¯𝑿𝑡𝑚superscriptsubscript𝐻¯𝜂𝑚subscript¯𝐕𝑡subscript¯𝐕𝑡1subscript¯𝐕𝑡𝑚1\displaystyle\bar{{\bm{X}}}_{t}^{(m)}=H_{\bar{\eta}}^{(m)}(\bar{{\bf V}}_{t},% \bar{{\bf V}}_{t-1},\cdots,\bar{{\bf V}}_{t-m+1}).over¯ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ⋯ , over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t - italic_m + 1 end_POSTSUBSCRIPT ) .

We define the loss of a WIAE pair (Gθ,Hη)subscript𝐺𝜃subscript𝐻𝜂(G_{\theta},H_{\eta})( italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) achieved under a m𝑚mitalic_m-dimensional discriminator pairs as

L(m)(θ,η):=maxγ,η(𝔼[Dγ(m)(𝐔t:tm+1)]𝔼[Dγ(m)(𝐕^t:tn+1)]+λ(𝔼[Dω(m)(𝑿tn+2:t+T)]𝔼[Dω(m)((𝑿tn+2:t,𝑿^t+1:t+T))])).assignsuperscript𝐿𝑚𝜃𝜂subscript𝛾𝜂𝔼delimited-[]superscriptsubscript𝐷𝛾𝑚subscript𝐔:𝑡𝑡𝑚1𝔼delimited-[]superscriptsubscript𝐷𝛾𝑚subscript^𝐕:𝑡𝑡𝑛1𝜆𝔼delimited-[]superscriptsubscript𝐷𝜔𝑚subscript𝑿:𝑡𝑛2𝑡𝑇𝔼delimited-[]superscriptsubscript𝐷𝜔𝑚subscript𝑿:𝑡𝑛2𝑡subscript^𝑿:𝑡1𝑡𝑇L^{(m)}(\theta,\eta):=\max_{\gamma,\eta}\big{(}\mathbb{E}[D_{\gamma}^{(m)}% \left({\bf U}_{t:t-m+1}\right)]-\mathbb{E}[D_{\gamma}^{(m)}(\hat{{\bf V}}_{t:t% -n+1})]\\ +\lambda(\mathbb{E}[D_{\omega}^{(m)}({\bm{X}}_{t-n+2:t+T})]-\mathbb{E}[D_{% \omega}^{(m)}(({\bm{X}}_{t-n+2:t},\hat{{\bm{X}}}_{t+1:t+T}))])\big{)}.start_ROW start_CELL italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ , italic_η ) := roman_max start_POSTSUBSCRIPT italic_γ , italic_η end_POSTSUBSCRIPT ( blackboard_E [ italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_U start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t : italic_t - italic_n + 1 end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL + italic_λ ( blackboard_E [ italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_n + 2 : italic_t + italic_T end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_n + 2 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) ) ] ) ) . end_CELL end_ROW

We first show that L(m)(θm,ηm)0superscript𝐿𝑚superscriptsubscript𝜃𝑚superscriptsubscript𝜂𝑚0L^{(m)}(\theta_{m}^{*},\eta_{m}^{*})\rightarrow 0italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) → 0 as m𝑚m\rightarrow\inftyitalic_m → ∞, where (θm,ηm)superscriptsubscript𝜃𝑚superscriptsubscript𝜂𝑚(\theta_{m}^{*},\eta_{m}^{*})( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) denotes the optimal weights of (Gθ(m),Hη(m))superscriptsubscript𝐺𝜃𝑚superscriptsubscript𝐻𝜂𝑚(G_{\theta}^{(m)},H_{\eta}^{(m)})( italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) obtained by minimizing (3).

Following the line of [46], we defined the distance between two random processes (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and (𝐘t)subscript𝐘𝑡({\bf Y}_{t})( bold_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by the expected subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm:

d((𝑿t),(𝐘t)):=𝔼[supt|𝑿t𝐘t|].assign𝑑subscript𝑿𝑡subscript𝐘𝑡𝔼delimited-[]subscriptsupremum𝑡subscript𝑿𝑡subscript𝐘𝑡d\left(({\bm{X}}_{t}),({\bf Y}_{t})\right):=\mathbb{E}\left[\sup_{t}\lvert{\bm% {X}}_{t}-{\bf Y}_{t}\rvert\right].italic_d ( ( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ( bold_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) := blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ] .

The uniform convergence assumed in assumption A2 is also defined on metric spaces with distance measure d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ). Hence, by assumption A2, Gθ¯(m)Gsuperscriptsubscript𝐺¯𝜃𝑚𝐺G_{\bar{\theta}}^{(m)}\rightarrow Gitalic_G start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT → italic_G uniformly, which implies that, ϵfor-allitalic-ϵ\forall\epsilon∀ italic_ϵ, there exists a M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT such that m>M1for-all𝑚subscript𝑀1\forall m>M_{1}∀ italic_m > italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, d((𝐕¯t(m)),(𝐕t))<ϵ.𝑑superscriptsubscript¯𝐕𝑡𝑚subscript𝐕𝑡italic-ϵd\left((\bar{{\bf V}}_{t}^{(m)}),({\bf V}_{t})\right)<\epsilon.italic_d ( ( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) , ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) < italic_ϵ . Thus, for F:(T):for-all𝐹superscript𝑇\forall F:\ell^{\infty}(T)\to\mathbb{R}∀ italic_F : roman_ℓ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_T ) → blackboard_R, F𝐹Fitalic_F bounded and continuous,

d(F((𝐕t)),F((𝐕¯t(m))))<δ(ϵ).𝑑𝐹subscript𝐕𝑡𝐹superscriptsubscript¯𝐕𝑡𝑚𝛿italic-ϵd\left(F(({\bf V}_{t})),F((\bar{{\bf V}}_{t}^{(m)}))\right)<\delta(\epsilon).italic_d ( italic_F ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , italic_F ( ( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ) ) < italic_δ ( italic_ϵ ) .

In other words,

limm𝔼[F((𝐕¯t(m)))]=𝔼[F((𝐕t))],subscript𝑚𝔼delimited-[]𝐹subscriptsuperscript¯𝐕𝑚𝑡𝔼delimited-[]𝐹subscript𝐕𝑡\lim_{m\rightarrow\infty}\mathbb{E}[F((\bar{{\bf V}}^{(m)}_{t}))]=\mathbb{E}[F% (({\bf V}_{t}))],roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_E [ italic_F ( ( over¯ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] = blackboard_E [ italic_F ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] ,

which fulfills the definition of weak convergence. Therefore,

𝐕¯t:tm+1(m)d𝐕t:tm+1,superscriptdsuperscriptsubscript¯𝐕:𝑡𝑡𝑚1𝑚subscript𝐕:𝑡𝑡𝑚1\displaystyle\bar{{\bf V}}_{t:t-m+1}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny d% }}}{{\rightarrow}}{\bf V}_{t:t-m+1},over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP bold_V start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT , (10)

due to the fact that convergence in expectation implies convergence in distribution.

Similarly, by the uniform convergence of Hη¯(m)superscriptsubscript𝐻¯𝜂𝑚H_{\bar{\eta}}^{(m)}italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT to H𝐻Hitalic_H, we have that that m>M2for-all𝑚subscript𝑀2\forall m>M_{2}∀ italic_m > italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

d((𝑿^t),(Hη¯(m)((𝐕t))))<ϵ,𝑑subscript^𝑿𝑡superscriptsubscript𝐻¯𝜂𝑚subscript𝐕𝑡italic-ϵ\displaystyle d\left((\hat{{\bm{X}}}_{t}),(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))% )\right)<\epsilon,italic_d ( ( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ( italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) < italic_ϵ ,

where (Hη¯(m)((𝐕t))))(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))))( italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) represent the random sequence generated by passing (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) through Hη¯subscript𝐻¯𝜂H_{\bar{\eta}}italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT. Thus, for F:(T):for-all𝐹superscript𝑇\forall F:\ell^{\infty}(T)\to\mathbb{R}∀ italic_F : roman_ℓ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_T ) → blackboard_R, F𝐹Fitalic_F bounded and continuous,

d(F((𝑿^t)),(Hη¯(m)((𝐕t)))))<δ(ϵ).d\left(F((\hat{{\bm{X}}}_{t})),(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))))\right)<% \delta(\epsilon).italic_d ( italic_F ( ( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , ( italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) ) < italic_δ ( italic_ϵ ) .

Hence we have (Hη¯(m)((𝐕t))))(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))))( italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) converges in distribution to (𝑿^t)subscript^𝑿𝑡(\hat{{\bm{X}}}_{t})( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Since H𝐻Hitalic_H is continuous and Hη¯(m)superscriptsubscript𝐻¯𝜂𝑚H_{\bar{\eta}}^{(m)}italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT converges uniformly to H𝐻Hitalic_H, Hη¯(m)superscriptsubscript𝐻¯𝜂𝑚H_{\bar{\eta}}^{(m)}italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT is also continuous. Thus, by continuous mapping theorem,

𝐕¯tm+1:t(m)d𝐕tm+1:tHη¯(m)(𝐕¯tm+1:t(m))dHη¯(m)(𝐕tm+1:t),superscriptdsuperscriptsubscript¯𝐕:𝑡𝑚1𝑡𝑚subscript𝐕:𝑡𝑚1𝑡superscriptabsentsuperscriptsubscript𝐻¯𝜂𝑚superscriptsubscript¯𝐕:𝑡𝑚1𝑡𝑚superscriptdsuperscriptsubscript𝐻¯𝜂𝑚subscript𝐕:𝑡𝑚1𝑡\bar{{\bf V}}_{t-m+1:t}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{% \rightarrow}}{\bf V}_{t-m+1:t}\stackrel{{\scriptstyle}}{{\Rightarrow}}H_{\bar{% \eta}}^{(m)}(\bar{{\bf V}}_{t-m+1:t}^{(m)})\stackrel{{\scriptstyle\mbox{\tiny d% }}}{{\rightarrow}}H_{\bar{\eta}}^{(m)}\left({\bf V}_{t-m+1:t}\right),over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP bold_V start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ⇒ end_ARG start_ARG end_ARG end_RELOP italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT ) ,

that is, 𝑿¯t(m)dHη¯(m)(𝐕t,,𝐕tm+1)superscript𝑑superscriptsubscript¯𝑿𝑡𝑚superscriptsubscript𝐻¯𝜂𝑚subscript𝐕𝑡subscript𝐕𝑡𝑚1\bar{{\bm{X}}}_{t}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny$d$}}}{{\rightarrow}% }H_{\bar{\eta}}^{(m)}({\bf V}_{t},\cdots,{\bf V}_{t-m+1})over¯ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG italic_d end_ARG end_RELOP italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⋯ , bold_V start_POSTSUBSCRIPT italic_t - italic_m + 1 end_POSTSUBSCRIPT ). Therefore,

(𝑿tm+2:t,𝑿¯t+T(m))d(𝑿tm+2:t,𝑿^t+T)=d(𝑿tm+2:t,𝑿t+T).superscriptdsubscript𝑿:𝑡𝑚2𝑡superscriptsubscript¯𝑿𝑡𝑇𝑚subscript𝑿:𝑡𝑚2𝑡subscript^𝑿𝑡𝑇superscriptdsubscript𝑿:𝑡𝑚2𝑡subscript𝑿𝑡𝑇\displaystyle({\bm{X}}_{t-m+2:t},\bar{{\bm{X}}}_{t+T}^{(m)})\stackrel{{% \scriptstyle\mbox{\tiny d}}}{{\rightarrow}}({\bm{X}}_{t-m+2:t},\hat{{\bm{X}}}_% {t+T})\stackrel{{\scriptstyle\mbox{\tiny d}}}{{=}}({\bm{X}}_{t-m+2:t},{\bm{X}}% _{t+T}).( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , over¯ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ) . (11)

By (10)&(11), L(m)(θ¯m,η¯m)0superscript𝐿𝑚subscript¯𝜃𝑚subscript¯𝜂𝑚0L^{(m)}(\bar{\theta}_{m},\bar{\eta}_{m})\rightarrow 0italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over¯ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) → 0. Since θmsubscriptsuperscript𝜃𝑚\theta^{*}_{m}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and ηmsubscriptsuperscript𝜂𝑚\eta^{*}_{m}italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are the optimal parameters obtained by minimizing (3) evaluated by m𝑚mitalic_m-dimensional discriminators (Dω(m),Dγ(m))superscriptsubscript𝐷𝜔𝑚superscriptsubscript𝐷𝛾𝑚(D_{\omega}^{(m)},D_{\gamma}^{(m)})( italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ),

L(m)(θm,ηm):=minθ,ηL(m)(θ,η)L(m)(θ¯m,η¯m)0.assignsuperscript𝐿𝑚superscriptsubscript𝜃𝑚superscriptsubscript𝜂𝑚subscript𝜃𝜂superscript𝐿𝑚𝜃𝜂superscript𝐿𝑚subscript¯𝜃𝑚subscript¯𝜂𝑚0\displaystyle L^{(m)}(\theta_{m}^{*},\eta_{m}^{*}):=\min_{\theta,\eta}L^{(m)}(% \theta,\eta)\leq L^{(m)}(\bar{\theta}_{m},\bar{\eta}_{m})\rightarrow 0.italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) := roman_min start_POSTSUBSCRIPT italic_θ , italic_η end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ , italic_η ) ≤ italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over¯ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) → 0 .

Because L(m)(θm,ηm)0superscript𝐿𝑚superscriptsubscript𝜃𝑚superscriptsubscript𝜂𝑚0L^{(m)}(\theta_{m}^{*},\eta_{m}^{*})\rightarrow 0italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) → 0 as m𝑚m\rightarrow\inftyitalic_m → ∞, 𝑽t:tm+1(m)d𝑽t:tm+1(m)superscriptdsuperscriptsubscript𝑽:𝑡𝑡𝑚1𝑚superscriptsubscript𝑽:𝑡𝑡𝑚1𝑚\bm{V}_{t:t-m+1}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{\rightarrow}}% \bm{V}_{t:t-m+1}^{(m)}bold_italic_V start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP bold_italic_V start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT and (𝑿tm+2:t,𝑿^t+T(m))d(𝑿tm+2:t,𝑿t+T(m))superscriptdsubscript𝑿:𝑡𝑚2𝑡superscriptsubscript^𝑿𝑡𝑇𝑚subscript𝑿:𝑡𝑚2𝑡superscriptsubscript𝑿𝑡𝑇𝑚({\bm{X}}_{t-m+2:t},\hat{\bm{X}}_{t+T}^{(m)})\stackrel{{\scriptstyle\mbox{% \tiny d}}}{{\rightarrow}}\bm{(}{\bm{X}}_{t-m+2:t},\bm{X}_{t+T}^{(m)})( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP bold_( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) follow directly from the equivalence of convergence in Wasserstein distance and convergence in distribution [47]. Since the discriminator dimensionality also goes to \infty, we have (𝑿0:t,𝑿^t+T(m))d(𝑿0:t,𝑿t+T)superscriptdsubscript𝑿:0𝑡superscriptsubscript^𝑿𝑡𝑇𝑚subscript𝑿:0𝑡subscript𝑿𝑡𝑇({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+T}^{(m)})\stackrel{{\scriptstyle\mbox{\tiny d% }}}{{\rightarrow}}({\bm{X}}_{0:t},{\bm{X}}_{t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ). Further, the conditional distribution of 𝑿^t+T(m)|𝑿0:t=𝒙0:tconditionalsuperscriptsubscript^𝑿𝑡𝑇𝑚subscript𝑿:0𝑡subscript𝒙:0𝑡\hat{{\bm{X}}}_{t+T}^{(m)}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT converges in distribution to 𝑿t+T|𝑿0:t=𝒙0:tconditionalsubscript𝑿𝑡𝑇subscript𝑿:0𝑡subscript𝒙:0𝑡{\bm{X}}_{t+T}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT follows from a simple application of the Bayes rule. \square

Appendix B Definition of Metrics for Time Series Forecasting

Given the original time series (𝒙t)subscript𝒙𝑡({\bm{x}}_{t})( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), the forecasts (𝒙~t)subscript~𝒙𝑡(\tilde{{\bm{x}}}_{t})( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), N𝑁Nitalic_N the size of datasets, and T𝑇Titalic_T the prediction step, the point estimation metrics can be calculated through:

NMSE=1NTt=T+1N(𝒙t𝒙~tT)21NTt=T+1N𝒙t2,NMSE1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁superscriptsubscript𝒙𝑡subscript~𝒙𝑡𝑇21𝑁𝑇superscriptsubscript𝑡𝑇1𝑁superscriptsubscript𝒙𝑡2\displaystyle\mbox{NMSE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}({\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T})^{2}}{\frac{1}{N-T}\sum_{t=T+1}^{N}{\bm{x}}_{t}^{2}},NMSE = divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,
NMAE=1NTt=T+1N|𝒙t𝒙~tT|1NTt=1N|𝒙t|,NMAE1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript𝒙𝑡subscript~𝒙𝑡𝑇1𝑁𝑇superscriptsubscript𝑡1𝑁subscript𝒙𝑡\displaystyle\mbox{NMAE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}|}{\frac{1}{N-T}\sum_{t=1}^{N}|{\bm{x}}_{t}|},NMAE = divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ,
MASE=1NTt=T+1N|𝒙t𝒙~tT|1NTt=T+1N|𝒙t𝒙tT|,MASE1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript𝒙𝑡subscript~𝒙𝑡𝑇1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript𝒙𝑡subscript𝒙𝑡𝑇\displaystyle\mbox{MASE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}|}{\frac{1}{N-T}\sum_{t=T+1}^{N}|{\bm{x}}_{t}-{\bm{x}}_{% t-T}|},MASE = divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | end_ARG ,
sMAPE=1NTt=1N|𝒙t𝒙~tT|(|𝒙t|+|𝒙~tT|)/2.sMAPE1𝑁𝑇superscriptsubscript𝑡1𝑁subscript𝒙𝑡subscript~𝒙𝑡𝑇subscript𝒙𝑡subscript~𝒙𝑡𝑇2\displaystyle\mbox{sMAPE}=\frac{1}{N-T}\sum_{t=1}^{N}\frac{|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}|}{(|{\bm{x}}_{t}|+|\tilde{{\bm{x}}}_{t-T}|)/2}.sMAPE = divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | end_ARG start_ARG ( | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + | over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | ) / 2 end_ARG .

The purpose of adopting multiple metrics is to comprehensively evaluate the forecasting performance. NMSE and NMAE evaluate the overall performance, and MASE reflects the relative performance to the naive forecaster. Methods with MASE smaller than 1111 outperform the naive forecaster. sMAPE is the symmetric counterpart of mean absolute percentage error (MAPE) that can be both upper bounded and lower bounded. Since for electricity datasets, the actual values can be very close to 00, thus nullifies the effectiveness of MAPE, we regard sMAPE as the better metric.

For probabilistic methods, we further evaluates their CRPS. CRPS can be computed from

CRPS=1NTt=T+1N((F~(𝒙|𝒙1:tT)𝕀𝒙t𝒙)2𝑑𝒙),CRPS1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscriptsuperscript~𝐹conditional𝒙subscript𝒙:1𝑡𝑇subscript𝕀subscript𝒙𝑡𝒙2differential-d𝒙\displaystyle\mbox{CRPS}=\frac{1}{N-T}\sum_{t=T+1}^{N}\left(\int_{\mathbb{R}}% \left(\tilde{F}({\bm{x}}|{\bm{x}}_{1:t-T})-\mathbb{I}_{{\bm{x}}_{t}\leq{\bm{x}% }}\right)^{2}d{\bm{x}}\right),CRPS = divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG ( bold_italic_x | bold_italic_x start_POSTSUBSCRIPT 1 : italic_t - italic_T end_POSTSUBSCRIPT ) - blackboard_I start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ bold_italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d bold_italic_x ) ,

where 𝕀𝕀\mathbb{I}blackboard_I is the indicator function and F~(𝒙|𝒙0:tT)~𝐹conditional𝒙subscript𝒙:0𝑡𝑇\tilde{F}({\bm{x}}|{\bm{x}}_{0:t-T})over~ start_ARG italic_F end_ARG ( bold_italic_x | bold_italic_x start_POSTSUBSCRIPT 0 : italic_t - italic_T end_POSTSUBSCRIPT ) the empirical cumulative density function (c.d.f.) of 𝑿~tTsubscript~𝑿𝑡𝑇\tilde{{\bm{X}}}_{t-T}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT conditioned on 𝑿0:tT=𝒙0:tTsubscript𝑿:0𝑡𝑇subscript𝒙:0𝑡𝑇{\bm{X}}_{0:t-T}={\bm{x}}_{0:t-T}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t - italic_T end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t - italic_T end_POSTSUBSCRIPT predicted by probabilistic forecasting methods. CRPS is equivalent to comparing the empirical conditional c.d.f. forecasted by probabilistic methods with the indicator c.d.f. 𝕀𝒙~tT>𝒙tsubscript𝕀subscript~𝒙𝑡𝑇subscript𝒙𝑡\mathbb{I}_{\tilde{{\bm{x}}}_{t-T}>{\bm{x}}_{t}}blackboard_I start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT > bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT of the true value 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. It can be viewed as a generalization of MAE to probabilistic methods.

The coverage probability (CP) of an confidence interval predictor is the (estimated) probability that the ground truth falls within the predicted interval. For a T𝑇Titalic_T-step prediction of β%percent𝛽\beta\%italic_β %-intervals, we denote the upper and lower bound by U^t|tT,βsubscript^𝑈conditional𝑡𝑡𝑇𝛽\hat{U}_{t|t-T,\beta}over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT and L^t|tT,βsubscript^𝐿conditional𝑡𝑡𝑇𝛽\hat{L}_{t|t-T,\beta}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT. CP can be computed through

CP(β%)=1NTt=T+1N𝕀𝒙t[L^t|tT,β,U^t|tT,β].CPpercent𝛽1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript𝕀subscript𝒙𝑡subscript^𝐿conditional𝑡𝑡𝑇𝛽subscript^𝑈conditional𝑡𝑡𝑇𝛽\displaystyle\mbox{CP}(\beta\%)=\frac{1}{N-T}\sum_{t=T+1}^{N}\mathbb{I}_{{\bm{% x}}_{t}\in[\hat{L}_{t|t-T,\beta},\hat{U}_{t|t-T,\beta}]}.CP ( italic_β % ) = divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT , over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT .

The closer the CP to its nominal value β%percent𝛽\beta\%italic_β %, the more accurate the prediction is. Thus, the coverage probability error (CPE) is often adopted for evaluation. CPE measures the deviation of CP from its nominal value β%percent𝛽\beta\%italic_β %

CPE(β%)=CP(β%)β%.CPEpercent𝛽CPpercent𝛽percent𝛽\mbox{CPE}(\beta\%)=\mbox{CP}(\beta\%)-\beta\%.CPE ( italic_β % ) = CP ( italic_β % ) - italic_β % .

The value of CPE closer to zero means the prediction interval estimation is more accurate.

Although CP and CPE are widely adopted for its simplicity, since they only estimate the unconditional coverage, they do not measure the accuracy the coverage based on the forecasted conditional probability distribution. Its limitation was discuss in [48].

In particular, while a good forecaster produces small CPE and a forecaster with high CPE must be a poor forecaster, a forecaster producing small CPE may not be a good forecaster. To this end, the normalized coverage width (NCW) can be used as a secondary measure. NCW is defined as

NCW(β%)=1NTt=T+1NU^t|tT,βL^t|tT,βU^βL^β,NCWpercent𝛽1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript^𝑈conditional𝑡𝑡𝑇𝛽subscript^𝐿conditional𝑡𝑡𝑇𝛽subscript^𝑈𝛽subscript^𝐿𝛽\displaystyle\mbox{NCW}(\beta\%)=\frac{1}{N-T}\sum_{t=T+1}^{N}\frac{\hat{U}_{t% |t-T,\beta}-\hat{L}_{t|t-T,\beta}}{\hat{U}_{\beta}-\hat{L}_{\beta}},NCW ( italic_β % ) = divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT - over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT - over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG ,

where U^βsubscript^𝑈𝛽\hat{U}_{\beta}over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT and L^βsubscript^𝐿𝛽\hat{L}_{\beta}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT are the prediction interval estimated from the empirical quantile of the testing data. For instance, when predicting a 90%percent9090\%90 % interval, U^90subscript^𝑈90\hat{U}_{90}over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT 90 end_POSTSUBSCRIPT is the empirical 0.950.950.950.95-quantile of the testing set, whereas L^90subscript^𝐿90\hat{L}_{90}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT 90 end_POSTSUBSCRIPT is the empirical 0.050.050.050.05-quantile of the testing set. As a result, NCW is the average width of intervals predicted normalized by the width of the interval estimated through the empirical marginal distribution of the testing set. One would expect that, conditional on observations, one would get a more concentrated prediction interval than the interval estimated based on unconditional distribution. Hence, a method with NCW smaller than 1111 estimates prediction interval more accurately than the unconditional estimation. At similar level of CP, the method with smaller NCW shows better accuracy in prediction interval estimation.

References

  • [1] A. Green, The great acceleration: Cio perspectives on generative ai, Tech. rep., MIT Technology Review Insights (2023).
    URL https://www.databricks.com/sites/default/files/2023-07/ebook_mit-cio-generative-ai-report.pdf
  • [2] N. Wiener, Nonlinear Problems in Random Theory, Technology Press of Massachusetts Institute of Technology, Cambridge, MA, 1958.
  • [3] M. Rosenblatt, Stationary Processes as Shifts of Functions of Independent Random Variables, Journal of Mathematics and Mechanics 8 (5) (1959) 665–681.
  • [4] W. Härdle, H. Lütkepohl, R. Chen, A review of nonparametric time series analysis, International Statistical Review / Revue Internationale de Statistique 65 (1) (1997) 49–72, publisher: [Wiley, International Statistical Institute (ISI)]. doi:10.2307/1403432.
  • [5] J. Nowotarski, R. Weron, Computing electricity spot price prediction intervals using quantile regression and forecast averaging, Computational Statistics 30 (3) (2015) 791–803. doi:10.1007/s00180-014-0523-0.
  • [6] R. Weron, A. Misiorek, Forecasting spot electricity prices: A comparison of parametric and semiparametric time series models, International Journal of Forecasting 24 (4) (2008) 744–763. doi:https://doi.org/10.1016/j.ijforecast.2008.08.004.
  • [7] T. Hong, P. Pinson, Y. Wang, R. Weron, D. Yang, H. Zareipour, Energy forecasting: A review and outlook, IEEE Open Access Journal of Power and Energy 7 (2020) 376–388. doi:10.1109/OAJPE.2020.3029979.
  • [8] Y. Ji, R. J. Thomas, L. Tong, Probabilistic forecasting of real-time lmp and network congestion, IEEE Transactions on Power Systems 32 (2) (2017) 831–841. doi:10.1109/TPWRS.2016.2592380.
  • [9] M. Zhou, Z. Yan, Y. X. Ni, G. Li, Y. Nie, Electricity price forecasting with confidence-interval estimation through an extended arima approach, IEE Proc.-Gener.Transmiss.Distrib 153 (2) (2006) 187–195.
  • [10] J. P. González, A. M. S. Muñoz San Roque, E. A. Pérez, Forecasting functional time series with a new hilbertian armax model: Application to electricity price forecasting, IEEE Transactions on Power Systems 33 (1) (2018) 545–556. doi:10.1109/TPWRS.2017.2700287.
  • [11] B. Uniejewski, R. Weron, Regularized quantile regression averaging for probabilistic electricity price forecasting, Energy Economics 95 (2021) 105121. doi:https://doi.org/10.1016/j.eneco.2021.105121.
  • [12] L. M. Lima, P. Damien, D. W. Bunn, Bayesian predictive distributions for imbalance prices with time-varying factor impacts, IEEE Transactions on Power Systems 38 (1) (2023) 349–357. doi:10.1109/TPWRS.2022.3165149.
  • [13] S. Chai, Z. Xu, Y. Jia, Conditional density forecast of electricity price based on ensemble elm and logistic emos, IEEE Transactions on Smart Grid 10 (3) (2019) 3031–3043. doi:10.1109/TSG.2018.2817284.
  • [14] D. Salinas, M. Bohlke-Schneider, L. Callot, R. Medico, J. Gasthaus, High-dimensional multivariate forecasting with low-rank gaussian copula processes, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
  • [15] G. Dudek, Multilayer perceptron for GEFCom2014 probabilistic electricity price forecasting, International Journal of Forecasting 32 (3) (2016) 1057–1060. doi:10.1016/j.ijforecast.2015.11.009.
  • [16] D. Lee, H. Shin, R. Baldick, Bivariate probabilistic wind power and real-time price forecasting and their applications to wind power bidding strategy development, IEEE Transactions on Power Systems 33 (6) (2018) 6087–6097. doi:10.1109/TPWRS.2018.2830785.
  • [17] J. Nowotarski, R. Weron, Recent advances in electricity price forecasting: A review of probabilistic forecasting, Renewable and Sustainable Energy Reviews 81 (2018) 1548–1568. doi:10.1016/j.rser.2017.05.234.
  • [18] D. J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, Fifth Edition, Chapman and Hall/CRC., 2011.
  • [19] B. Uniejewski, R. Weron, Regularized quantile regression averaging for probabilistic electricity price forecasting, Energy Economics 95 (2021) 105121. doi:10.1016/j.eneco.2021.105121.
    URL https://www.sciencedirect.com/science/article/pii/S0140988321000268
  • [20] C. Zhang, Y. Fu, Probabilistic electricity price forecast with optimal prediction interval, IEEE Transactions on Power Systems (2023) 1–10doi:10.1109/TPWRS.2023.3235193.
  • [21] J.-F. Toubeau, T. Morstyn, J. Bottieau, K. Zheng, D. Apostolopoulou, Z. De Grève, Y. Wang, F. Vallée, Capturing spatio-temporal dependencies in the probabilistic forecasting of distribution locational marginal prices, IEEE Transactions on Smart Grid 12 (3) (2021) 2663–2674. doi:10.1109/TSG.2020.3047863.
  • [22] N. Nguyen, B. Quanz, Temporal Latent Auto-Encoder: A Method for Probabilistic Multivariate Time Series Forecasting, Proceedings of the AAAI Conference on Artificial Intelligence 35 (10) (2021) 9117–9125, number: 10. doi:10.1609/aaai.v35i10.17101.
  • [23] Z. Zheng, L. Wang, L. Yang, Z. Zhang, Generative Probabilistic Wind Speed Forecasting: A Variational Recurrent Autoencoder Based Method, IEEE Transactions on Power Systems 37 (2) (2022) 1386–1398, conference Name: IEEE Transactions on Power Systems. doi:10.1109/TPWRS.2021.3105101.
  • [24] L. Li, J. Zhang, J. Yan, Y. Jin, Y. Zhang, Y. Duan, G. Tian, Synergetic Learning of Heterogeneous Temporal Sequences for Multi-Horizon Probabilistic Forecasting, Proceedings of the AAAI Conference on Artificial Intelligence 35 (10) (2021) 8420–8428, number: 10. doi:10.1609/aaai.v35i10.17023.
  • [25] M. Khodayar, S. Mohammadi, M. E. Khodayar, J. Wang, G. Liu, Convolutional Graph Autoencoder: A Generative Deep Neural Network for Probabilistic Spatio-Temporal Solar Irradiance Forecasting, IEEE Transactions on Sustainable Energy 11 (2) (2020) 571–583, conference Name: IEEE Transactions on Sustainable Energy. doi:10.1109/TSTE.2019.2897688.
  • [26] Y. Li, X. Lu, Y. Wang, D. Dou, Generative time series forecasting with diffusion, denoise, and disentanglement, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, Vol. 35, Curran Associates, Inc., 2022, pp. 23009–23022.
  • [27] Z. Zhang, M. Wu, Predicting real-time locational marginal prices: A gan-based approach, IEEE Transactions on Power Systems 37 (2) (2022) 1286–1296. doi:10.1109/TPWRS.2021.3106263.
  • [28] J. Bottieau, Y. Wang, Z. De Grève, F. Vallée, J.-F. Toubeau, Interpretable transformer model for capturing regime switching effects of real-time electricity prices, IEEE Transactions on Power Systems 38 (3) (2023) 2162–2176. doi:10.1109/TPWRS.2022.3195970.
  • [29] S. Majumder, L. Dong, F. Doudi, Y. Cai, C. Tian, D. Kalathi, K. Ding, A. A. Thatte, N. Li, L. Xie, Exploring the capabilities and limitations of large language models in the electric energy sector, arXiv:2403.09125 (2024). arXiv:2403.09125.
  • [30] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, S. Dustdar, Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting, 2022.
  • [31] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: Beyond efficient transformer for long sequence time-series forecasting, Proceedings of the AAAI Conference on Artificial Intelligence 35 (12) (2021) 11106–11115. doi:10.1609/aaai.v35i12.17325.
  • [32] X. Wang, L. Tong, Q. Zhao, Generative probabilistic time series forecasting and applications in grid operations, to appear in the Proceedings of Conference on Information Sciences and Systems (2024).
    URL https://arxiv.org/abs/2402.13870
  • [33] X. Wang, L. Tong, Innovations Autoencoder and its Application in One-class Anomalous Sequence Detection, Journal of Machine Learning Research 23 (49) (2022) 1–27.
  • [34] M. Arjovsky, S. Chintala, L.Bottou, Wasserstein GAN, arXiv:1701.07875 (Jan. 2017).
  • [35] P. J. Bickel, K. A. Doksum, Mathematical statistics: basic ideas and selected topics. (2nd ed.), Vol. 1, Pearson Prentice Hall, Upper Saddle River, N.J., 2007.
  • [36] T. Gneiting, M. Katzfuss, Probabilistic forecasting, Annual Review of Statistics and Its Application 1 (1) (2014) 125–151. arXiv:https://doi.org/10.1146/annurev-statistics-062713-085831, doi:10.1146/annurev-statistics-062713-085831.
  • [37] M. White, R. Pike, C. Brown, R. Coutu, B. Ewing, S. Johnson, C. Mendrala, White paper: Inter-regional interchange scheduling analysis and options, Tech. rep., ISO New England and New York ISO (January 2011).
  • [38] E. Tómasson, M. R. Hesamzadeh, F. A. Wolak, Optimal offer-bid strategy of an energy storage portfolio: A linear quasi-relaxation approach, Applied Energy 260 (2020) 114251. doi:https://doi.org/10.1016/j.apenergy.2019.114251.
  • [39] NERC, Balancing and frequency control, Tech. rep., NERC Resource Subcommittee, Priceton,NJ (January 2011).
    URL https://www.nerc.com/comm/OC/BAL0031_Supporting_Documents_2017_DL/NERC%20Balancing%20and%20Frequency%20Control%20040520111.pdf
  • [40] V. N. Ganesh, D. Bunn, Forecasting imbalance price densities with statistical methods and neural networks, IEEE Transactions on Energy Markets, Policy and Regulation 2 (1) (2024) 30–39. doi:10.1109/TEMPR.2023.3293693.
  • [41] M. A. Montemurro, P. A. Pury, Long-range fractal correlations in literary corpora, Fractals 10 (4) (2002) 451–461.
  • [42] J. Bhan, S. Kim, J. Kim, Y. Kwon, S. il Yang, K. Lee, Long-range correlations in korean literary corpora, Chaos, Solitons & Fractals 29 (1) (2006) 69–81. doi:https://doi.org/10.1016/j.chaos.2005.08.214.
  • [43] X. Wang, L. Tong, Innovations autoencoder and its application in one-class anomalous sequence detection, J. Mach. Learn. Res. 23 (1) (Jan 2022).
  • [44] K. R. Mestav, X. Wang, L. Tong, A deep learning approach to anomaly sequence detection for high-resolution monitoring of power systems, IEEE Transactions on Power Systems 38 (1) (2023) 4–13. doi:10.1109/TPWRS.2022.3168529.
  • [45] L. Tong, X. Wang, Q. Zhao, Grid monitoring and protection with continuous point-on-wave measurements and generative ai, arXiv:2403.06942 (2024). arXiv:2403.06942.
    URL https://arxiv.org/abs/2403.06942
  • [46] J. Hoffmann-Jørgensen, Stochastic Processes on Polish Spaces, Aarhus Universitet, Matematisk Institut., Aarhus, Denmark, 1991.
  • [47] C. Villani, The Wasserstein distances, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 57–75. doi:10.1007/978-3-540-71050-9_6.
  • [48] P. F. Christoffersen, Evaluating Interval Forecasts, International Economic Review 39 (4) (1998) 841–862. doi:10.2307/2527341.