Time Series Forecasting With Deep Learning: A Survey: Research
Time Series Forecasting With Deep Learning: A Survey: Research
Time Series Forecasting With Deep Learning: A Survey: Research
Oxford, Oxford, UK
Research
Numerous deep learning architectures have been
Article submitted to journal developed to accommodate the diversity of time series
datasets across different domains. In this article, we
survey common encoder and decoder designs used
arXiv:2004.13408v2 [stat.ML] 27 Sep 2020
where ŷi,t+1 is the model forecast, yi,t−k:t = {yi,t−k , . . . , yi,t }, xi,t−k:t = {xi,t−k , . . . , xi,t } are
observations of the target and exogenous inputs respectively over a look-back window k, si is
static metadata associated with the entity (e.g. sensor location), and f (.) is the prediction function
learnt by the model. While we focus on univariate forecasting in this survey (i.e. 1-D targets), we
note that the same components can be extended to multivariate models without loss of generality
[26, 27, 28, 29, 30]. For notational simplicity, we omit the entity index i in subsequent sections
unless explicitly required.
where genc (.), gdec (.) are encoder and decoder functions respectively, and recalling that that
subscript i from Equation (2.1) been removed to simplify notation (e.g. yi,t replaced by yt ). These
encoders and decoders hence form the basic building blocks of deep learning architectures, with
the choice of network determining the types of relationships that can be learnt by our model. In
this section, we examine modern design choices for encoders, as overviewed in Figure 1, and their
relationship to traditional temporal models. In addition, we explore common network outputs and
loss functions used in time series forecasting applications.
3
k
W (l, τ )hlt−τ ,
X
(W ∗ h) (l, t) = (2.5)
τ =0
where hlt ∈ RHin is an intermediate state at layer l at time t, ∗ is the convolution operator, W (l, τ ) ∈
RHout ×Hin is a fixed filter weight at layer l, and A(.) is an activation function, such as a sigmoid
function, representing any architecture-specific non-linear processing. For CNNs that use a total of
L convolutional layers, we note that the encoder output is then zt = hL t .
Considering the 1-D case, we can see that Equation (2.5) bears a strong resemblance to finite
impulse response (FIR) filters in digital signal processing [35]. This leads to two key implications
for temporal relationships learnt by CNNs. Firstly, in line with the spatial invariance assumptions
for standard CNNs, temporal CNNs assume that relationships are time-invariant – using the same
set of filter weights at each time step and across all time. In addition, CNNs are only able to use
inputs within its defined lookback window, or receptive field, to make forecasts. As such, the
receptive field size k needs to be tuned carefully to ensure that the model can make use of all
relevant historical information. It is worth noting that a single causal CNN layer with a linear
activation function is equivalent to an auto-regressive (AR) model.
bk/dl c
W (l, τ )hlt−dl τ ,
X
(W ∗ h) (l, t, dl ) = (2.6)
τ =0
where b.c is the floor operator and dl is a layer-specific dilation rate. Dilated convolutions can hence
be interpreted as convolutions of a down-sampled version of the lower layer features – reducing
resolution to incorporate information from the distant past. As such, by increasing the dilation rate
with each layer, dilated convolutions can gradually aggregate information at different time blocks,
allowing for more history to be used in an efficient manner. With the WaveNet architecture of [32]
for instance, dilation rates are increased in powers of 2 with adjacent time blocks aggregated in
each layer – allowing for 2l time steps to be used at layer l as shown in Figure 1a.
(ii) Recurrent Neural Networks 4
Recurrent neural networks (RNNs) have historically been used in sequence modelling [31],
zt = ν (zt−1 , yt , xt , s) , (2.7)
Where zt ∈ RH here is the hidden internal state of the RNN, and ν(.) is the learnt memory update
function. For instance, the Elman RNN [41], one of the simplest RNN variants, would take the
form below:
Where W. , b. are the linear weights and biases of the network respectively, and γy (.), γz (.) are
network activation functions. Note that RNNs do not require the explicit specification of a lookback
window as per the CNN case. From a signal processing perspective, the main recurrent layer – i.e.
Equation (2.9) – thus resembles a non-linear version of infinite impulse response (IIR) filters.
Long Short-term Memory Due to the infinite lookback window, older variants of RNNs can
suffer from limitations in learning long-range dependencies in the data [42, 43] – due to issues with
exploding and vanishing gradients [31]. Intuitively, this can be seen as a form of resonance in the
memory state. Long Short-Term Memory networks (LSTMs) [44] were hence developed to address
these limitations, by improving gradient flow within the network. This is achieved through the use
of a cell state ct which stores long-term information, modulated through a series of gates as below:
where zt−1 is the hidden state of the LSTM, and σ(.) is the sigmoid activation function. The gates
modify the hidden and cell states of the LSTM as below:
Where is the element-wise (Hadamard) product, and tanh(.) is the tanh activation function.
Relationship to Bayesian Filtering As examined in [39], Bayesian filters [45] and RNNs are both
similar in their maintenance of a hidden state which is recursively updated over time. For Bayesian
filters, such as the Kalman filter [46], inference is performed by updating the sufficient statistics
of the latent state – using a series of state transition and error correction steps. As the Bayesian
filtering steps use deterministic equations to modify sufficient statistics, the RNN can be viewed
as a simultaneous approximation of both steps – with the memory vector containing all relevant
information required for prediction.
(iii) Attention Mechanisms 5
The development of attention mechanisms [47, 48] has also lead to improvements in long-term
Where the key κt , query qτ and value vt−τ are intermediate features produced at different time
steps by lower levels of the network. Furthermore, α(κt , qτ ) ∈ [0, 1] is the attention weight for
t − τ generated at time t, and ht is the context vector output of the attention layer. Note that
multiple attention layers can also be used together as per the CNN case, with the output from the
final layer forming the encoded latent variable zt .
Recent work has also demonstrated the benefits of using attention mechanisms in time series
forecasting applications, with improved performance over comparable recurrent networks [52,
53, 54]. For instance, [52] use attention to aggregate features extracted by RNN encoders, with
attention weights produced as below:
where α(t) = [α(t, 0), . . . α(t, k)] is a vector of attention weights, κt−1 , qt are outputs from LSTM
encoders used for feature extraction, and softmax(.) is the softmax activation function. More
recently, Transformer architectures have also been considered in [53, 54], which apply scalar-dot
product self-attention [49] to features extracted within the lookback window. From a time series
modelling perspective, attention provides two key benefits. Firstly, networks with attention are
able to directly attend to any significant events that occur. In retail forecasting applications, for
example, this includes holiday or promotional periods which can have a positive effect on sales.
Secondly, as shown in [54], attention-based networks can also learn regime-specific temporal
dynamics – by using distinct attention weight patterns for each regime.
While the loss functions above are the most common across applications, we note that the
flexibility of neural networks also allows for more complex losses to be adopted - e.g. losses for
quantile regression [56] and multinomial classification [32].
Probabilistic Outputs While point estimates are crucial to predicting the future value of a target,
understanding the uncertainty of a model’s forecast can be useful for decision makers in different
domains. When forecast uncertainties are wide, for instance, model users can exercise more caution
when incorporating predictions into their decision making, or alternatively rely on other sources
of information. In some applications, such as financial risk management, having access to the full
predictive distribution will allow decision makers to optimise their actions in the presence of rare
events – e.g. allowing risk managers to insulate portfolios against market crashes.
A common way to model uncertainties is to use deep neural networks to generate parameters
of known distributions [27, 37, 38]. For example, Gaussian distributions are typically used for
forecasting problems with continuous targets, with the networks outputting means and variance
parameters for the predictive distributions at each step as below:
µ(t, τ ) = Wµ hL
t + bµ , (2.21)
ζ(t, τ ) = softplus(WΣ hL
t + bΣ ), (2.22)
where hLt is the final layer of the network, and softplus(.) is the softplus activation function to
ensure that standard deviations take only positive values.
where τ ∈ {1, . . . , τmax } is a discrete forecast horizon, ut are known future inputs (e.g. date
information, such as the day-of-week or month) across the entire horizon, and xt are inputs
that can only be observed historically. In line with traditional econometric approaches [57, 58],
deep learning architectures for multi-horizon forecasting can be divided into iterative and direct
methods – as shown in Figure 2 and described in detail below.
autoregressive models are trained in the exact same fashion as one-step-ahead prediction models
7
Figure 2: Main types of multi-horizon forecasting models. Colours used to distinguish between
model weights – with iterative models using a common model across the entire horizon and direct
methods taking a sequence-to-sequence approach.
(i.e. via backpropagation through time), the iterative approach allows for the easy generalisation
of standard models to multi-step forecasting. However, as a small amount of error is produced
at each time step, the recursive structure of iterative methods can potentially lead to large error
accumulations over longer forecasting horizons. In addition, iterative methods assume that all
inputs but the target are known at run-time – requiring only samples of the target to be fed into
future time steps. This can be a limitation in many practical scenarios where observed inputs exist,
motivating the need for more flexible methods.
ŷi,t+τ = exp(WES hL
i,t+τ + bES ) × li,t × γi,t+τ , (3.1)
(i) (i)
li,t = β1 yi,t /γi,t + (1 − β1 )li,t−1 , (3.2)
(i) (i)
γi,t = β2 yi,t /li,t + (1 − β2 )γi,t−κ , (3.3)
where hL
i,t+τ is the final layer of the network for the τ th-step-ahead forecast, li,t is a level
(i) (i)
component, γi,t is a seasonality component with period κ, and β1 , β2 are entity-specific static
coefficients. From the above equations, we can see that the exponential smoothing components
(li,t , γi,t ) handle the broader (e.g. exponential) trends within the datasets, reducing the need for
additional input scaling.
yt = a(hL T L
i,t+τ ) lt + φ(hi,t+τ )t , (3.4)
lt = F (hL
i,t+τ )lt−1 + q(hL L
i,t+τ ) + Σ(hi,t+τ ) Σt , (3.5)
where lt is the hidden latent state, a(.), F (.), q(.) are linear transformations of hL
i,t+τ ,
φ(.), Σ(.)
are linear transformations with softmax activations, t ∼ N (0, 1) is a univariate residual and
Σt ∼ N (0, I) is a multivariate normal random variable.
can hence also be interpreted as a weighted average over temporal features, using the weights
supplied by the attention layer at each step. An analysis of attention weights can then be used to
understand the relative importance of features at each time step. Instance-wise interpretability
studies have been performed in [53, 55, 76], where the authors used specific examples to show how
the magnitudes of α(t, τ ) can indicate which time points were most significant for predictions. By
analysing distributions of attention vectors across time, [54] also shows how attention mechanisms
can be used to identify persistent temporal relationships – such as seasonal patterns – in the dataset.
Competing Interests. The author(s) declare that they have no competing interests.
References
1 Mudelsee M. Trend analysis of climate time series: A review of methods. Earth-Science Reviews.
2019;190:310 – 322.
2 Stoffer DS, Ombao H. Editorial: Special issue on time series analysis in the biological sciences.
Journal of Time Series Analysis. 2012;33(5):701–703.
3 Topol EJ. High-performance medicine: the convergence of human and artificial intelligence.
Nature Medicine. 2019 Jan;25(1):44–56.
4 Böse JH, Flunkert V, Gasthaus J, Januschowski T, Lange D, Salinas D, et al. Probabilistic Demand
Forecasting at Scale. Proc VLDB Endow. 2017 Aug;10(12):1694–1705.
5 Andersen TG, Bollerslev T, Christoffersen PF, Diebold FX. Volatility Forecasting. National
Bureau of Economic Research; 2005. 11188.
6 Box GEP, Jenkins GM. Time Series Analysis: Forecasting and Control. Holden-Day; 1976.
7 Gardner Jr ES. Exponential smoothing: The state of the art. Journal of Forecasting. 1985;4(1):1–28.
8 Winters PR. Forecasting Sales by Exponentially Weighted Moving Averages. Management
Science. 1960;6(3):324–342.
9 Harvey AC. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge
University Press; 1990.
10 Ahmed NK, Atiya AF, Gayar NE, El-Shishiny H. An Empirical Comparison of Machine Learning
Models for Time Series Forecasting. Econometric Reviews. 2010;29(5-6):594–621.
11 Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional
Neural Networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Advances in
Neural Information Processing Systems 25 (NIPS); 2012. p. 1097–1105.
12 Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers); 2019. p. 4171–4186.
13 Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. Mastering the 11
game of Go with deep neural networks and tree search. Nature. 2016;529:484–503.
14 Baxter J. A Model of Inductive Bias Learning. J Artif Int Res. 2000;12(1):149âĂŞ198.