0% found this document useful (0 votes)
44 views14 pages

Time Series Forecasting Using RNNS: An Extended Attention Mechanism To Model Periods and Handle Missing Values

This paper investigates the use of recurrent neural networks (RNNs) for time series forecasting, focusing on their limitations in capturing periodic patterns and handling missing values. The authors propose an extended attention mechanism to enhance RNNs' performance in these areas and demonstrate its effectiveness through experiments on various datasets. The study highlights the challenges of standard RNNs and the potential improvements offered by the proposed model in real-world applications.

Uploaded by

amogadasi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views14 pages

Time Series Forecasting Using RNNS: An Extended Attention Mechanism To Model Periods and Handle Missing Values

This paper investigates the use of recurrent neural networks (RNNs) for time series forecasting, focusing on their limitations in capturing periodic patterns and handling missing values. The authors propose an extended attention mechanism to enhance RNNs' performance in these areas and demonstrate its effectiveness through experiments on various datasets. The study highlights the challenges of standard RNNs and the potential improvements offered by the proposed model in real-world applications.

Uploaded by

amogadasi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Time Series Forecasting using RNNs: an Extended Attention Mechanism to

Model Periods and Handle Missing Values

Yagmur Gizem Cinar YAGMUR .C INAR @ IMAG . FR


Univ. Grenoble Alps/CNRS/Grenoble INP, Grenoble, France
Hamid Mirisaee H AMID .M IRISAEE @ IMAG . FR
Univ. Grenoble Alps/CNRS/Grenoble INP, Grenoble, France
arXiv:1703.10089v1 [cs.LG] 29 Mar 2017

Parantapa Goswami PARANTAPA .G OSWAMI @ IMAG . FR


Univ. Grenoble Alps/CNRS/Grenoble INP, Grenoble, France
Eric Gaussier E RIC .G AUSSIER @ IMAG . FR
Univ. Grenoble Alps/CNRS/Grenoble INP, Grenoble, France
Ali Ait-Bachir A.A IT-BACHIR @ COSERVIT. COM
Coservit, Grenoble, France
Vadim Strijov S TRIJOV @ CCAS . RU
Computing Center of the Russian Academy of Sciences, Moscow, Russia

Abstract ular class of approaches that has recently received much


In this paper, we study the use of recurrent neural attention for modeling sequences is based on sequence-
networks (RNNs) for modeling and forecasting to-sequence Recurrent Neural Networks (RNNs) (Graves,
time series. We first illustrate the fact that stan- 2013). Sequence-to-sequence RNNs constitute a flexible
dard sequence-to-sequence RNNs neither cap- class of methods particularly well adapted to time series
ture well periods in time series nor handle well forecasting when one aims at predicting a sequence of fu-
missing values, even though many real life times ture values on the basis of a sequence of past values. They
series are periodic and contain missing values. have furthermore led to state-of-the-art results in different
We then propose an extended attention mecha- applications (as machine translation or image captioning).
nism that can be deployed on top of any RNN and We explore here how to adapt them to general time series.
that is designed to capture periods and make the In real life scenarios, time series data often display peri-
RNN more robust to missing values. We show ods, due for example to environmental factors as seasonal-
the effectiveness of this novel model through ex- ity or to the patterns underlying the activities measured (the
tensive experiments with multiple univariate and workload on professional email servers for example has
multivariate datasets. both weekly and daily periods). A first question one can ask
is thus: Can sequence-to-sequence RNNs model periods in
time series? Furthermore, missing observations and gaps
1. Introduction in the data are also common, due to e.g. lost records and/or
Forecasting future values of temporal variables is termed as mistakes during data entry or faulty sensors. In addition,
time series prediction or time series forecasting and has ap- for multivariate time series, underlying variables can have
plications in a variety of fields, as finance, economics, me- mixed sampling rates, i.e. different variables may have dif-
teorology, or customer support center operations. A con- ferent sampling frequencies, resulting again in values miss-
siderable number of stochastic (De Gooijer & Hyndman, ing at certain times when comparing the different variables.
2006) and machine learning based (Bontempi et al., 2013) The standard strategy to tackle missing values is to apply
approaches have been proposed for this problem. A partic- numerical or deterministic approaches, such as interpola-
tion and data imputation methods, to explicitly estimate
or infer the missing values, and then apply classical meth-
Time Series Prediction using RNNs

ods for forecasting. In the context of sequence-to-sequence ing a summary of the encoding, which we will refer to as
RNNs, a standard technique, called padding, consists in re- summarizer, and one dedicated to generating the output,
peating the hidden state of the last observed value over the and referred to as decoder. The encoder represents each in-

→ −−→
missing values. All these methods however make strong put xj , 1 ≤ j ≤ T as a hidden state hj = f (xj , hj−1 ),
assumptions about the functional form of the data, and im- −

hj ∈ Rn , where the function f corresponds here to
puted values, especially for long gaps, and can introduce the non-linear transformation implemented in LSTM with
sources of errors and biases. The second question we ad- peephole connections (Gers et al., 2002). The equations of
dress here is thus: Do sequence-to-sequence RNNs handle LSTM with peephole connections are given in Eq. 1. For
well missing values? bidirectional RNNs (Schuster & Paliwal, 1997), the input
To answer the above questions, we first conduct in Sec- is read both forward and backward, leading to two vectors

− →
− ←− ←

tion 2 simple experiments to illustrate how state-of-the-art h j = f (xj , h j−1 ) and h j = f (xj , h j+1 ). The final
sequence-to-sequence RNNs behave on time series. Even hidden state for any input xj is constructed simply by con-
though our approach is general in the sense that it can catenating the corresponding forward and backward hidden

− ← −
be applied to any sequence-to-sequence RNN, we con- states, i.e. hj = [ h j ; h j ]T , where now hj ∈ R2n .
sider in this study bidirectional RNNs (Schuster & Pali-
wal, 1997) based on Long Short-Term Memory (LSTM)
ij = σ(Wi xj + Ui hj−1 + C0i c0j−1 )
networks (Hochreiter & Schmidhuber, 1997) with peephole
connections (Gers et al., 2002), known to perform very well fj = σ(Wf xj + Uf hj−1 + C0f c0j−1 )
in practice (Graves, 2013), and we also make use of the at- c0j = fj c0j−1 + ij tanh(Wc0 xj + Uc0 hj−1 ) (1)
tention mechanism recently introduced in (Bahdanau et al.,
oj = σ(Wo xj + Uo hj−1 + C0o c0j−1 )
2014). We then present in Section 3 two extensions of the
attention mechanism to better handle periods and missing hj = oj tanh(c0j )
values, prior to describe their application to multivariate
time series (Section 4) and to evaluate their impact on sev- The summarizer builds, from the sequence of input hid-
eral univariate and multivariate time series in Section 5. We den states hj , 1 ≤ j ≤ T , a context vector c =
finally discuss the related work in Section 6. In the remain- q({h1 , . . . , hj , . . . , hT }). In its most simple form, the
der, we use the term RNNs to refer to sequence-to-sequence function q just selects the last hidden state (Graves, 2013):
RNNs. q({h1 , . . . , hj , . . . , hT }) = hT . More recently, in (Bah-
danau et al., 2014), an attention mechanism is used to con-
struct different context vectors ci for different outputs yi
2. How do RNNs behave on time series? A (1 ≤ i ≤ T 0 ) as a weighted sum of the hiddenP states of the
study of periodicity and missing values T
encoder representing the input history: ci = j=1 αij hj ,
We focus at first on univariate time series even though most where αij are the attention weights. They are computed as:
of the elements we discuss apply directly to multivariate exp(eij )
time series (we will consider multivariate time series in αij = PT (2)
a second step). As mentioned before, time series fore- j 0 =1 exp(eij 0 )
casting consists in predicting future values from past, ob-
where
served values. The time span of the past values, denoted
by T , is termed as history, whereas the time span of the eij = a(si−1 , hj ) = vaT tanh(Wa si−1 + Ua hj ) (3)
future values to be predicted, denoted by T 0 , is termed as
forecast horizon (in multi-step ahead prediction, which we with a being a feedforward neural network with weights
consider here, T 0 > 1). The prediction is modeled as a Wa , Ua and va trained in conjunction with the entire
regression-like problem where the goal is to learn the rela- encoder-decoder framework. si−1 is the hidden state ob-
tion y = r(x) where y = (xT +1 , . . . , xT +i , . . . , xT +T 0 ) tained by the decoder (see below) at time (i-1)One can note
is the output sequence and x = (x1 , . . . , xj , . . . , xT ) is the that a scores the importance of the input at time j (specifi-
input sequence. Both input and output sequences are or- cally its representation hj by the encoder) for the output at
dered and indexed by time instants. time i, given the previous hidden state si−1 of the decoder.
For clarity’s sake, and without loss of generality, for the This allows the model to concentrate or put attention on
input sequence x = (x1 , . . . , xj , . . . , xT ), the output se- certain parts of the input history to predict each output.
quence y is rewritten as y = (y1 , . . . , yi , . . . , yT 0 ). The decoder parallels the encoder by associating each out-
RNNs rely on three parts, one dedicated to encoding the in- put yi , 1 ≤ i ≤ T 0 to a hidden state vector si =
put, and referred to as encoder, one dedicated to construct- g(yi−1 , si−1 , ci ) where si ∈ Rn and yi−1 denotes the out-
put at time i − 1. The function g corresponds again to an
Time Series Prediction using RNNs

LSTM with peephole connections, with an explicit context standard RNNs able to capture periods, we should see that
added (Graves, 2013). The equations of LSTM with peep- the corresponding weights in the attention mechanism are
hole connections and context are given in Eq. 4. Each out- higher. However, as one can note, this is not the case: for
put is then predicted in sequence through: RNNs with time stamp information, a higher weight is put
on the first instances, which do not correspond to the actual
yi = Wout si + bout , periods of the times series. Higher weights are put with
standard RNNs on the more recent history on PSE. This
where bout is a scalar and Wout ∈ Rn .
seems more reasonable, and leads to better results, even
though the period at one day is missed. On PW, the weights
with standard RNNs are uniformly distributed on the dif-
ii = σ(Wi yi−1 + Ui si−1 + C0i c0i−1 + Ci ci )
ferent time stamps. This shows that current RNNs do not
fi = σ(Wf yi−1 + Uf si−1 + C0f c0i−1 + Cf ci ) always detect (and make us of) the periods underlying the
c0i = fi c0i−1 + ii tanh(Wc0 yi−1 + Uc0 si−1 + Cc0 ci ) (4) data.
oi = σ(Wo yi−1 + Uo si−1 + C0o c0i−1 + Co ci )
2.2. RNNs and missing values
si = oi tanh(c0i )
We now turn to the problem of missing values and use this
2.1. RNNs and periodicity time a degraded version of the above time series in which
15% of the original values are missing, either in consec-
To illustrate how RNNs behave w.r.t. periods in time se- utive sequences (to form gaps in the data) or randomly
ries, we retained two periodic time series, fully described (we refer again the reader to Section 5 for further detail on
in Section 5 and representing respectively (a) the electri- these datasets). As mentioned before, a standard technique
cal consumption load in Poland over the period of 10 years to handle such missing values in RNNs, called padding,
(1/1/2002-7/31/2012), and (b) the maximum daily tempera- consists in repeating the hidden state of the last observed
ture, again in Poland, over the period of 12 years (1/1/2002- value over the missing values. Another standard approach
7/31/2014). The first time series, referred to as PSE, has on temporal data is to reconstruct missing values using in-
two main periods, a daily one and a weekly one, whereas terpolation. Among the existing interpolation methods, we
the second time series, referred to as PW, has a yearly pe- experimented with three methods from three different cat-
riod. Figure 1 (left and middle) displays the autocorrela- egories, namely linear interpolation, non-linear spline in-
tion (Chatfield, 2016) for each time series and shows these terpolation and kernel based Fourier transform interpola-
different periods (note that the term original indicates that tion (Meijering, 2002). We measured their reconstruction
we used the dataset as it is, without any modification). We ability on all four datasets we used for experiments (de-
ran the RNN described above, with (RNN-A) and without scribed in Section 5.1) in terms of MSE, calculated between
(RNN) the attention mechanism (in the latter case, the con- the interpolated and original datasets. In all the cases, we
text is taken to be the last hidden state). The experimental found that linear interpolation provided the minimum re-
setting retained is the one described in Section 5. construction error. We thus compare here the use of linear
We also added to the input sequence the time stamp (or po- interpolation and padding on RNNs, with and without the
sition) information j, to help the RNNs capture potential attention mechanism. The results obtained, displayed in
periods. When the positions are added, each input at time Figure 3 (left), show that the linear interpolation tends to
j contains two values: xj and j, 1 ≤ j ≤ T . The posi- outperform the padding technique on both datasets1 , par-
tion is of course not predicted in the output. The results we ticularly on PSE15. Furthermore, as before, the attention
obtained, evaluated in terms of mean squared error (MSE) mechanism brings here no improvement.
and displayed in Figure 1 (right), show that adding the time Looking at the weights obtained by the attention mecha-
stamps in the input does not improve the RNNs: the results nism (Figure 3 (middle and right)) for both PSE and PW,
slightly degrade on PSE (from 0.039 to 0.053) and are al- one can see that, not surprisingly, the missing input data
most the same on PW. Furthermore, on these time series, the sometimes get high weights and are primarily used to pre-
use of the attention mechanism has almost no impact: the dict output values, despite their lack of reliability. In par-
MSE remains almost the same when the attention mecha- ticular, for PW, the missing values corresponding to the gap
nism is used. comprised in between 12 and 8 months in the history get
To see whether standard RNNs are able to capture periods very high values. This indicates that current RNNs are not
in time series, we plot, in Figure 2, the weights obtained entirely robust to missing values as an important part of
by the attention mechanism. As mentioned before, the at- 1
We obtained the same results on all collections, with differ-
tention mechanism aims at capturing the importance of the ent levels of missing values.
input at time j for predicting the output at time i. Were
Time Series Prediction using RNNs

1 week 1 year
1 1 RNN-A RNN
PSE Autocorrelation

PW Autocorrelation
0.2 RNN-A-position RNN-position
1 day

0.15

MSE
0.5 0
0.1

0.05

0 −1 0
0 100 200 300 400 0 200 400 600 800 1,000
PSE-original PW-original
History (hours) History (days) Datasets

Figure 1. Autocorrelation for PSE and PW. A time span of 5 weeks is considered for PSE (left), and of 2.5 years for PW (middle).
Comparison of RNNs with and without attention, and with and without time stamp (or position) information (right).

0.3
PSE-original (RNN-A) 0.06 PW-original (RNN-A)
Attention weights

Attention weights

PSE-original (RNN-A with positions) PW-original (RNN-A with positions)


0.2
0.04

0.1
0.02

0 0
7 days 6 days 5 days 4 days 3 days 2 days 1 day 16 mo 14 mo 12 mo 10 mo 8 mo 6 mo 4 mo 2 mo

History History
Figure 2. Weights of the attention mechanism for the RNNs without and with time stamp (or position) information on both PSE (left)
and PW (right). The weights are averaged over all test examples (see Section 5.3).
PSE15 Attention weights

0.2
PW15 Attention weights

RNN-A-padding RNN-padding Attention weights Attention weights

RNN-A-inter RNN-inter Missing instances 0.004 Missing instances


0.15
0.2
0.003
MSE

0.1
0.002
0.1
0.05
0.001

0 0
0
7days 6 days 5 days 4 days 3 days 2 days 1 day 16mo 14mo 12mo 10mo 8mo 6mo 4mo 2mo
PSE15 PW15

Datasets History History

Figure 3. Comparison of padding and linear interpolation on PSE and PW with 15% missing values (left). Examples of attention weights
on missing values for PSE (middle) and PW (right).
Time Series Prediction using RNNs

their decision can be based on them. We propose to do so by another modification of the atten-
tion mechanism that again reweighs the attention weights.
3. Extended attention mechanism We consider two reweighing scheme: a first one that pe-
nalizes missing values further away from the last observed
We present here two simple extensions of the attention value through a decaying function, a priori adapted to
mechanism that allow one to model periods and better han- padding, and a second one that penalizes missing values
dle missing values in RNNs. depending on whether they are at the beginning, middle or
end of a gap, a priori adapted to interpolation methods that
3.1. Modeling periodicity rely on values before and after the gap. The first reweighing
scheme corresponds to the following equation, that extends
With a history size T and a forecast horizon T 0 , the pos- Eq. 5 and relies on an exponential decaying function:
sible periods to rely on for prediction range in the set
{1, · · · , T + T 0 − 1}. We here explicitly model all pos- eij,τ µ,1 = eij,τ × f1 (µ, j) (6)
sible periods throughout a real vector, which we will refer
to as τ , of dimension T + T 0 − 1. This vector is used to with:
encode the importance of all possible periods, and to de- 
exp(−µ(j − jlast )) if j is missing
crease or increase the importance of the corresponding in- f1 (µ, j) =
1 otherwise
put to predict the current output. To this end, we modify
the original attention mechanism to reweigh the attention where jlast denotes the last observed value before j and
weights as follows: µ is an additional parameter that is learned with the other
parameters of the RNN. eij,τ is given by Eq. 5. We will
eij,τ = vaT tanh(Wa si−1 + Ua hj ) × (τ T ∆(i, j)) (5) refer to this model as RNN-τ µ-1.
0
where ∆(i, j) ∈ RT +T −1 is a binary vector that is 1 on Similarly, the second reweighing scheme is defined as fol-
dimension (i − j) and 0 elsewhere. We will refer to this lows:
model as RNN-τ .
eij,τ µ,2 = eij,τ × (1 + µT Pos(j; θg )) (7)
As one can note, τ(i−j) will either decrease or increase the
weight eij computed by the original attention mechanism. where Pos(j; θg ) is a three-dimensional vector which is
Since τ is learned along with the other parameters of the null if j is not missing; otherwise, denoting the length of
RNN, we expect that τ(i−j) will be high for those values the gap by |g|, if j−j last
|g| < θ1g , then the first coordinate of
of (i − j) that correspond to periods of the time series. We Pos(j; θg ) is set to 1, if θ1g < j−jlast g
|g| < θ2 then the second
will see in Section 5 that this is indeed the case.
coordinate is set to 1 and if j−jlast
|g| > θ2g then the third co-
ordinate is set to 1. Note that if only one value is missing,
3.2. Handling missing values
then |g| = 1.
Provided their proportion is not too important, otherwise
µ is now a three-dimensional vector, learned with the other
the forecasting task is less relevant, missing values in time
parameters of the RNN, where each of the coordinates aim
series can be easily identified by considering that the most
at reweighing the impact of missing values on the predic-
common interval between consecutive points corresponds
tion task according to their position in gaps. The values
to the sampling rate of the time series. Points not present at
θ1g and θ2g are hyper-parameters; they are set here in such a
the expected intervals are then considered as missing. This
way that they divide each gap g into three equal parts cor-
strategy works well in practice, provided the sampling rate
responding to the beginning of the gap, its middle and its
of the time series does not change much over time. It is
end. We will refer to this model as RNN-τ µ-2. As one can
thus possible to identify missing values in time series, and
note, if there is no gap in the data (i.e. no missing values),
then use padding or interpolation techniques to represent
then both Eq. 6 and Eq. 7 reduce to Eq. 5.
them. Padded or interpolated inputs should however not
be treated as standard inputs as they are less reliable than Figure 4 illustrates the modifications brought to the atten-
other inputs. Furthermore, be it with padding or interpo- tion mechanism to model periods and handle missing val-
lation techniques, when the size of a gap, i.e. consecutive ues (the reweighing scheme of RNN-τ µ-1 is used but can
missing values, is important, the further away the missing nevertheless be replaced by the one of RNN-τ µ-2).
value is from the observed values (the last one in case of
padding, the last and next ones in case of interpolation), the 4. Multivariate Extension
less confident one is in the padded hidden states or interpo-
lated values. It is thus desirable to decrease the importance We now consider the multivariate extension of the above
of missing values according to their position in gaps. model. As each variable in a K multivariate time series can
Time Series Prediction using RNNs

h1 s1
..
.. .
. exp (−µ.(j − jlast ))
ei1
T tanh(W s
va a i−1 + Ua hj )

eij
normalize si−1
hj ci
1
..
.
eiT si
.. l τl τ ..
. .
.. ∆i,j
. ... δτ (l = T + i − j) ...
T + T0 − 1
1 l T + T0 − 1 sT 0
hT
Figure 4. RNN-τµ -1 mechanism (Eq. 6). This figure shows the step of predicting the output yi at time i, given the input history vector
x = [x1 , . . . , xj , . . . , xT ]. For RNN-τµ -2, we replace exp(·) according to Eq. 7.

Table 1. Datasets.
Name Type #Instances History size Forecast horizon Sampling rate
Polish Electricity (PSE) Univariate 46379 96 4 2 hours
Polish Weather (PW) Multivariate 4595 548 7 1 days
Air Quality (AQ) Multivariate 9471 192 6 1 hour
Household Power Consumption (HPC) Multivariate 17294 96 4 2 hours

have its own periods, we propose here to apply the RNN 5.1. Datasets and settings
with the extended attention mechanism described above on
We retained four widely used and publicly available
each variable k, 1 ≤ k ≤ K, of the time series. We thus
(Datasets) that are described in Table 1. The values for the
construct, for each variable k, context vectors on the basis
history size were set so as they encompass the known peri-
of the following equations:
ods of the datasets. They can of course be tuned by cross-
  validation if one does not want to identify the potential pe-
(k)
exp eij riods by checking the autocorrelation curves. In general,
(k)
αij =P   (8)
T
exp e
(k) the forecast horizon should reflect the nature of the data
0
j =1 ij 0
and the application one has in mind, with of course a trade
off between long forecast horizon and prediction quality.
where ekij is given by Eqs. 3, 5, 6 or 7. The context vec- For this purpose, the forecast horizons of PSE, PW, AQ and
tor for the ith output of the k th variable is then defined by HPC are chosen as 8 hours, 1 week, 6 hours and 8 hours
(k) PT (k) (k) (k)
ci = j=1 αij hj , where hj is the encoder hidden respectively. In Table 1, one can also find the sampling rate
state at time stamp j for the k th variable. for each dataset.
Lastly, to predict the output in the multivariate case while Note that for PW, we selected the Warsaw metropolitan area
taking into account potential dependencies between differ- which covers only one recording station. For the univariate
ent variables, we concatenate the context vectors from the experiments, from PW we selected the max temperature se-
different variables into a single context vector ci that is ries, from AQ we picked PT08.S4(NO2) and from HPC we
used as input to the decoder, the rest of the decoder archi- selected the global active power variable.
tecture being unchanged:
To implement the RNN models discussed before, we used
(1)T (K)T T theano2 (Theano Development Team, 2016) and Lasagne
ci = [ci · · · ci ] 3
on a Linux system with 256GB of memory and 32-core
(k)
Intel Xeon @2.60GHz. All parameters are regularized and
As each ci is of dimension 2n (that is the dimension of learned through stochastic backpropagation with adaptive
the encoder hidden states), ci is of dimension 2Kn. learning rate for each parameter (Kingma & Ba, 2014), the
We now turn to the experimental validation of the proposed objective function being the MSE on the output. The hyper-
RNN. parameters we considered for the different RNNs are given
in Table 2. To tune the hyperparameters, we performed
a two-level grid search, where in the first level we tuned
5. Experiments the mini-batch size by using 10 random combinations of
In this section, we first describe the datasets used in this 2
http://deeplearning.net/software/theano/
study and explain the experimental settings prior to pre- 3
https://lasagne.readthedocs.io
senting the results we obtain using the proposed methods.
Time Series Prediction using RNNs

the hyperparameters. Within the set {1, 32, 64, 128, 256,
Table 2. Hyperparameters of the grid search.
512}, we retained the mini-batch size of 64 as it performs
Hyperparameter Values
best. Then, in the second level, we tuned all the hyper-
parameters shown in Table 2 and picked the top 4 settings Learning rate {0.01, 0.001, 0.0001}
according to the RNN model with standard attention mech- Regularization type {L1 , L2 }
anism. We use these top 4 settings for all the RNN-based Regularization coeff {0.001, 0.0001,0.00001}
models we are evaluating, selecting the best setting for each # RNN units {128, 256}
model on the validation set. Note that in all those top 4 set- # Attention units {128, 256}
tings, the learning rate, regularization coefficient and regu- of predictions, including all horizons. SMAPE is define as
larization type are the same (shown in the bold face in the
upper part Table 2). In other words, the top 4 settings dif- N
1 X |yˆi − yi |
fer in number of attention units and number of RNN units, SM AP E =
all the other hyperparameters being the same. Lastly, the N i=1 (|yi | + |yˆi |)/2
objective function for all RNN-based models is the MSE.
In order to assess whether the proposed methods are robust and calculates the L1 error divided by the average of pre-
to missing values, we introduced different levels of miss- dicted and true value. SMAPE is a scaled L1 error.
ing values in the datasets with the following strategy: for Lastly, we divided the datasets by retaining the first 75%
each dataset, we added 5%, 10%, 15% and 20% percent of of each dataset for training-validation and the last 25% for
missing values. As the datasets should contain both ran- testing. We applied 5-fold cross validation on the training-
dom missing values and gaps, we introduced half gaps and validation sets for RF and GBT. For RNN-based methods,
half random missing values. For instance, to add 10% of we divided the training-validation sets into first 75% for
missing values, we add 5% of random missing values and training (56.25% of the data) and last 25% for validation
5% of gaps. Furthermore, as the gaps could be of different (18.75% of the data).
sizes, we let the length of gaps vary from 5 to 100, picked
at random. Note that for HPC and AQ which already con- 5.2. Overall results on univariate time series
tain missing values, we first introduced new missing values
until we reach half of the desired percentage, then we intro- Figure 5 and Figure 6 illustrate the univariate experiments
duce the gaps. In our experiments, the percentage of miss- with the interpolation and padding methods, respectively,
ing values introduced are shown as a number along with using the MSE and the SMAPE as evaluation measures.
the dataset name; the original dataset is named original. We nevertheless focus our discussion on MSE as it is the
Also note that the time series AQ already has 5% of missing metric based on which the problem is optimized.
values, so we discard introducing the 5% missing values As one can observe in Figure 5, for the AQ dataset, the three
and we start with 10%. proposed models, i.e. RNN-τ and RNN-τ µ-1/2, yield bet-
We compare the following methods discussed before, ter MSE results than the other methods. The same conclu-
namely standard RNN, RNN-A, RNN-τ , RNN-τ µ-1/2, sion could be drawn from the SMAPE figure of the same
with gradient boosted trees (GBT) and random forests (RF) dataset. Furthermore, as conjectured in Section 3, RNN-
as these ensemble methods have been shown to provide τ µ-2 tends to perform better than RNN-τ µ-1 on these in-
state-of-the-art results on various forecasting problems and terpolated data. This fact can be seen in Figure 6 where
to outperform moving-average methods (e.g. (Kane et al., the MSE and SMAPE results are illustrated for the padded
2014a)). We apply grid search over the number of trees data. One can observe in Figure 6 that, in general, RNN-
{500, 1000, 2000} and the number of features {nfeatures , τ µ-1 performs better in terms of MSE, as anticipated in
√ Section 3, and is very close to RNN-τ µ-2 in terms of
nfeatures , log2 (nfeatures )} for RF, and over learning rate
{0.01, 0.05, 0.1, 0.25} for GBT. As stated in Section 2, we SMAPE. Note that, in Figure 6, we removed RF and GBT
use linear interpolation and padding for the datasets with as padding cannot be used (they are always interpolated
missing values. with missing values). One can also observe that on the orig-
inal dataset, with only 5% missing values, RNN-τ has the
For evaluation, we use MSE and symmetric mean absolute lowest MSE both in padded and interpolated data.
percentage error (SMAPE). MSE is defined as
In Figure 5, similar results, both in terms of MSE and
1 X SMAPE, are observed on PSE, with again a slight advan-
M SE = (yi − yˆi )2
N i tage for RNN-τ µ-2 over RNN-τ µ-1, the difference with
the other methods being important when the proportion
and gives the average L2 distance between the predicted of missing values increases, e.g. for PSE20. On the
value yˆi and the true value yi , where N is the total number other hand, one can see in Figure 6 that when the data is
Time Series Prediction using RNNs

RNN RNN-A RNN-τ RNN-τ µ-1 RNN-τ µ-2 GBT RF

0.4 0.6

0.3
0.4
MSE

MSE
0.2
0.2
0.1

0 0
AQ-original AQ10 AQ15 AQ20 HPC-original HPC5 HPC10 HPC15 HPC20

0.1 0.15
MSE

MSE
0.1
0.05
0.05

0 0
PSE-original PSE5 PSE10 PSE15 PSE20 PW-original PW5 PW10 PW15 PW20

1
0.4
SMAPE

SMAPE

0.5
0.2

0 0
AQ-original AQ10 AQ15 AQ20 HPC-original HPC5 HPC10 HPC15 HPC20

0.6
0.4

0.4
SMAPE

SMAPE

0.2
0.2

0 0
PSE-original PSE5 PSE10 PSE15 PSE20 PW-original PW5 PW10 PW15 PW20

Figure 5. MSE and SMAPE metrics for interpolated data.

padded and there is considerable amount of missing values, Similar to HPC there is no clear difference for PW between
i.e. PSE20, RNN-τ µ-1 outperforms RNN-τ µ-2, both with the methods in case of interpolated data. However, if we
MSE and SMAPE, as conjectured in Section 3. The base- compare the RNN-based methods, we see that except for
lines, RF and GBT, do not behave well on AQ and PSE ,and PW10, the proposed methods outperform RNN and RNN-
yield worse results than the RNN-based methods. More- A, particularly in case of PW20. The situation is a bit better
over, RNN-A yields better results w.r.t. RNN for interpo- in case of padded data; for instance, there is a clear im-
lated data when we have the maximum amount of missing provement over the MSE on PW10 and slight improvement
values, both in terms of MSE and SMAPE. on SMAPE on PW15.
The situation is more contrasted on HPC, where there is
no clear difference between the methods for the interpo- 5.3. RNN-τ and periodicity
lated data. In this case, the proposed methods are compa- As mentioned before, RNN-τ aims at detecting the period-
rable to the standard attention mechanism. When there are icity of the data, which is done via the τ vector explained
many missing values (HPC15 and HPC20), RNN-τ µ-2 is in Section. 3.1. Here, we show the effectiveness of this ap-
slightly better than the other methods. However, in case of proach by illustrating how the attention weights of the orig-
padded data, all three proposed methods outperform RNN inal attention model behave compared to those of RNN-τ .
and RNN-A, particularly when we increase the missing val- Obviously, we expect RNN-τ to put higher weights on the
ues. inputs corresponding to periods. To illustrate this point, we
Time Series Prediction using RNNs

RNN RNN-A RNN-τ RNN-τ µ-1 RNN-τ µ-2

0.4
0.6
0.3

MSE

MSE
0.4
0.2

0.1 0.2

0 0
AQ-original AQ10 AQ15 AQ20 HPC-original HPC5 HPC10 HPC15 HPC20

0.2

0.2 0.15
MSE

MSE
0.1
0.1
0.05

0 0
PSE-original PSE5 PSE10 PSE15 PSE20 PW-original PW5 PW10 PW15 PW20

0.6

1
0.4
SMAPE

SMAPE

0.2 0.5

0 0
AQ-original AQ10 AQ15 AQ20 HPC-original HPC5 HPC10 HPC15 HPC20

0.6 0.6

0.4
SMAPE

SMAPE

0.4

0.2 0.2

0 0
PSE-original PSE5 PSE10 PSE15 PSE20 PW-original PW5 PW10 PW15 PW20

Figure 6. MSE and SMAPE metrics for padded data.

chose the PSE dataset which has two periods, weekly and forming method. To show the effectiveness of this method
daily (see Section 2). To observe how RNN-τ behaves, we in handling the missing values, we investigated the atten-
average all the attention weights of all test examples and all tion weights it assigned to missing instances. For each
forecast horizons. Figure 7 (left) shows the average atten- dataset, we averaged the attention weights of RNN-τ µ-2
tion weights of PSE with RNN-A and RNN-τ . As one can and compared them to those of RNN-A. Figure 8 illustrates
observe, RNN-A fails to capture both periods (1 day and the average attention weight and the confidence intervals
7 days) while RNN-τ can effectively spot those two peri- for all datasets.
ods by giving them higher weights. We also illustrate the
As one can observe, in 11 cases out of 15, RNN-τ µ-2 pro-
learned τ values of PSE in Figure 7 (right), which shows
vides attention weights which are lower on missing values
that the highest values are assigned to the period of the data,
than those of RNN-A. In average, the attention weights that
i.e. 7 days and 1 day.
RNN-τ µ put on missing values is 48% smaller than that of
RNN-A, which shows how the µ vector can efficiently take
5.4. RNN-τ µ and missing values care of missing values inside the data, regardless of its po-
The RNN-τ µ-1/2 models are designed to capture periodic- sition, by not putting much weight on them.
ity and to prevent the attention mechanism from assigning
high weights to the missing instances. As noted before,
RNN-τ µ-2 with interpolated data is overall the best per-
Time Series Prediction using RNNs

0.5
0.7
RNN-A Attention weights τ values
0.6
Attention weights

0.4 RNN-τ Attention weights


0.5

τ values
0.3 0.4
0.3
0.2
0.2
0.1 0.1
0
0 −0.1
7 days 6 days 5 days 4 days 3 days 2 days 1 day 7 days 6 days 5 days 4 days 3 days 2 days 1 day

History History
Figure 7. Comparison of averaged attention weights of PSE-original using RNN-A and RNN-τ (left). The τ values learned with RNN-τ
method on PSE-original (right).

two cases. In all cases, RNN-τ outperforms GBT and, ex-


RNN-τ µ
cept for AQ dataset, RNN-τ µ-1/2 always provides better
Avg missing weights

0.01 RNN-A
results than the other approaches.
Similarly with what is happening on the univariate case,
0.005 RNN-τ µ-1/2 behave better when the amount of missing
values increases, with again a slight preference for RNN-
τ µ-2, as conjectured in Section 3. Furthermore, relying
0
on several variables for the prediction improves the results
over the univariate case. As illustrated in Figure 9 (right),
AQ10
AQ15
AQ20

HPC5
HPC10
HPC15
HPC20
PSE5
PSE10
PSE15
PSE20
PW5
PW10
PW15
PW20

the results obtained with the multivariate extension are sys-


tematically and significantly better than the ones obtained
Figure 8. Comparison of average weights of PSE-original using with the univariate case (the same variable is predicted in
RNN-A and RNN-τ µ. both cases).
5.5. Results on multivariate time series
6. Related Work
As mentioned in Section 4, the proposed methods can
be extended to multivariate time series. Here, we show The notion of stochasticity of time series modeling and pre-
how this extension can effectively outperform the state-of- diction was introduced long back (Yule, 1927). Since then
the-art methods as well as the standard attention mecha- various stochastic models have been developed, notable
nism. To illustrate that, we pick all four global time se- among these are autoregressive (AR) (Walker, 1931) and
ries of HPC, namely global active power, global reactive moving averages (MA) (Slutzky, 1937) methods. These
power, voltage and global intensity. We predict the first one two models were combined in a more general and more ef-
which is important for e.g. monthly electricity bills. For fective framework, known as autoregressive moving aver-
AQ, we selected the four variables associated to real sen- age (ARMA), or autoregressive integrated moving average
sors (thus excluding nominal sensors), namely C6H6(GT), (ARIMA) when the differencing is included in the model-
NO2(GT), CO(GT) and NOx(GT) and predict the first one, ling (Box & Jenkins, 1968). Vector ARIMA or VARIMA
C6H6(GT), as it is the most important for NMHC-related (Tiao & Box, 1981) is the multivariate extension of the uni-
air pollution (Vito et al., 2009). Note that this differs from variate ARIMA models where each time series instance is
the univariate case in which we focused on the nominal represented using a vector.
sensor PT08.S4(NO2) to observe the effect of different
amounts of missing values and gaps. Also note that all the Neural networks are regarded as a promising tool for time
time series selected from AQ have 18% of missing values; series prediction (Zhang, 2001; Crone et al., 2011) due
for HPC we consider 5, 10 and 15% of missing values. to their data-driven and self-adaptive nature, their ability
to approximate any continuous function and their inherent
Figure 9 (left) shows the results of our experiments on these non-linearity. The idea of using neural networks for pre-
multivariate series. We only show GBT here since it gen- diction dates back to 1964 where an adaptive linear net-
erally outperforms RF. As one can see in this figure, RNN- work was used for weather forecasting (Hu, 1964). But the
τ and RNN-τ µ-1/2 provide better results than RNN in all research was quite limited due to the lack of a training al-
cases. Compared to RNN-A, this is also the true in three gorithm for general multilayer networks at the time. Since
cases out of five, the methods being comparable in the last the introduction of backpropagation algorithm (Rumelhart
Time Series Prediction using RNNs

RNN RNN-A RNN-τ RNN-τ µ-1 RNN-τ µ-2 GBT

0.6 0.6

0.4 0.4
MSE

MSE
0.2 0.2

0 0
mAQ-orig

mHPC-orig

mHPC5

mHPC10

mHPC15

mHPC-orig

uHPC-orig

mHPC5

uHPC5

mHPC10

uHPC10

mHPC15

uHPC15
Figure 9. Multivariate extension of the proposed methods on AQ and HPC (left). Comparison of multivariate and univariate time series
prediction (right). Multivariate results prefixed with m and univariates with u.

et al., 1988), there had been much development in the use on time series. Subsequently, various RNN-based models
of neural networks for forecasting. It was shown in a sim- were developed for different time series, as noisy foreign
ulated study that neural networks can be used for model- exchange rate prediction (Giles et al., 2001), chaotic time
ing and forecasting nonlinear time series (Laepes & Far- series prediction in communication engineering (Jaeger &
ben, 1987). Several studies (Werbos, 1988; Sharda & Patil, Haas, 2004) or stock price prediction (Hsieh et al., 2011).
1990) have found that neural networks are able to outper- A detailed review can be found in (Längkvist et al., 2014)
form the traditional stochastic models such as ARIMA or on the applications of RNNs along with other deep learning
Box-Jenkins approaches (Box & Jenkins, 1968). based approaches for different time series prediction tasks.
With the advent of SVM, it has been used to formulate time RNNs based on LSTMs (Hochreiter & Schmidhuber,
series prediction as a regression estimation using Vapnik’s 1997), which we consider in our study, alleviate the van-
insensitive loss function and Huber’s loss function (Müller ishing gradient problem of the traditional RNNs proposed.
et al., 1997). The authors showed that SVM based pre- They have furthermore been shown to outperform tradi-
diction outperforms traditional neural network based meth- tional RNNs on various temporal tasks (Gers et al., 2001;
ods. Since then different versions of SVMs are applied 2002). More recently, they have been used for predicting
for time series prediction and many different SVM fore- the next frame in a video and for interpolating intermedi-
casting algorithms have been derived (Raicharoen et al., ate frames (Ranzato et al., 2014), for forecasting the future
2003; Fan et al., 2006). More recently, random forests rainfall intensity in a region (Xingjian et al., 2015), or for
(Breiman, 2001) are used for time series predictions due modeling clinical data consisting of multivariate time series
to their performance accuracy and ease of execution. Ran- of observations (Lipton et al., 2015).
dom forest regression is used for prediction in the field of
Adding attention mechanism on the decoder side of an
finance (Creamer & Freund, 2004) and bioinformatics (Ku-
encoder-decoder RNN framework enabled the network to
siak et al., 2013), and are shown to outperform ARIMA
focus on the interesting parts of the encoded sequence
(Kane et al., 2014b).
(Bahdanau et al., 2014). Attention mechanism is used in
Traditional neural networks allow only feedforward con- many applications such as image description (Xu et al.,
nections among the neurons of a layer and the neurons 2015), image generation (Karol et al., 2015), phoneme
in the following layer. In contrast, recurrent neural net- recognition (Chorowski et al., 2015), heart failure predic-
works (RNN) (Jordan, 1986; Elman, 1990) allow both for- tion (Choi et al., 2016), as well as time series prediction
ward and feedback or recurrent connections between the (Riemer et al., 2016) and classification (Choi et al., 2016).
neurons of different layers. Hence, RNNs are able to in- Many studies also apply attention mechanism on external
corporate contextual information from past inputs which memory (Graves et al., 2014; 2016).
makes them an attractive choice for predicting general
The previous work (Riemer et al., 2016) uses attention
sequence-to-sequence data, including time series (Graves,
mechanism to determine the importance of a factor among
2013). In this present study we use RNNs for modeling
other factors that affect time series. However, we use ex-
time series. Early work (Connor et al., 1991) has shown
tended attention mechanism to model period and empha-
that RNNs (a) are a type of nonlinear autoregressive mov-
size the interesting parts of input sequence with missing
ing average (NARMA) model and (b) outperform feedfor-
values. Our approach is applicable for both univariate and
ward networks and various types of linear statistical models
Time Series Prediction using RNNs

multivariate time series prediction. Springer, 2013.


However, as far we know, no work has focused on analyz- Box, George EP and Jenkins, Gwilym M. Some recent
ing the adequacy of RNNs (based or not on LSTMs) for advances in forecasting and control. Journal of the Royal
time series, in particular w.r.t. their ability to model peri- Statistical Society. Series C (Applied Statistics), 17(2):
ods and handle missing values (for this latter case, some 91–109, 1968.
methods based on e.g. Gaussian processes have been pro-
posed to handle irregularly sampled data and missing val- Breiman, Leo. Random forests. Machine learning, 45(1):
ues (Ghassemi et al., 2015), but none related to RNNs). To 5–32, 2001.
our knowledge, this study is the first one to address this
problem. Chatfield, Chris. The analysis of time series: an introduc-
tion. CRC press, 2016.
7. Conclusion
Choi, Edward, Bahadori, Mohammad Taha, Sun, Jimeng,
In this paper we studied the abilities of RNNs for modeling Kulas, Joshua, Schuetz, Andy, and Stewart, Walter. Re-
and forecasting time series. We used state-of-the-art RNNs tain: An interpretable predictive model for healthcare us-
based on bidirectional LSTM encoder-decoder with atten- ing reverse time attention mechanism. In Advances in
tion mechanism, and illustrated their deficiencies in cap- Neural Information Processing Systems, pp. 3504–3512,
turing the periodicity of the data and in handling properly 2016.
missing values. To alleviate this, we proposed two archi-
tectural modifications over the traditional attention mecha- Chorowski, Jan K, Bahdanau, Dzmitry, Serdyuk, Dmitriy,
nism: a first one to learn and exploit the periodicity in the Cho, Kyunghyun, and Bengio, Yoshua. Attention-based
temporal data (RNN-τ ) and a second one to handle both models for speech recognition. In Advances in Neural
random missing values and long gaps in the data (RNN-τ µ- Information Processing Systems, pp. 577–585, 2015.
1/2). We further extended the entire framework for multi-
variate time series. Connor, Jerome, Atlas, Les E, and Martin, Douglas R. Re-
current networks and narma modeling. In NIPS, pp. 301–
The experiments we conducted over multiple univariate 308, 1991.
and multivariate time series demonstrate the effectiveness
of these modifications. RNN-τ and RNN-τ µ-1/2 are able Creamer, Germán G and Freund, Yoav. Predicting per-
to perform not only better than traditional RNNs and RNNs formance and quantifying corporate governance risk for
with original attention mechanism, but also better than two latin american adrs and banks. 2004.
state-of-the-art baselines based on random forests and gra-
dient boosted trees. Furthermore, by design, RNN-τ can Crone, Sven F, Hibon, Michele, and Nikolopoulos, Kon-
capture several periods in a time series and make use of stantinos. Advances in forecasting with neural net-
all of them to forecast new values. We provided experi- works? empirical evidence from the nn3 competition on
mental illustration to this fact on a time series with both time series prediction. International Journal of Forecast-
weekly and daily periods. Regarding missing values, the ing, 27(3):635–660, 2011.
two variants we propose, RNN-τ µ-1/2, are well adapted to
different techniques used in RNNs to handle missing val- Datasets.
ues: padding and interpolation. We showed that these vari- PSE: http://www.pse.pl.
ants rely more on actual values and less on padded or in- PW: https://globalweather.tamu.edu.
terpolated values than the original RNNs, thus making the AQ: http://archive.ics.uci.edu.
proposed RNN more robust to missing values. HPC: http://archive.ics.uci.edu/.

De Gooijer, Jan G and Hyndman, Rob J. 25 years of time


References series forecasting. International journal of forecasting,
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, 22(3):443–473, 2006.
Yoshua. Neural machine translation by jointly learning
to align and translate. arXiv preprint arXiv:1409.0473, Elman, Jeffrey L. Finding structure in time. Cognitive sci-
2014. ence, 14(2):179–211, 1990.

Fan, Yugang, Li, Ping, and Song, Zhihuan. Dynamic least


Bontempi, Gianluca, Taieb, Souhaib Ben, and Le Borgne, squares support vector machine. In Intelligent Control
Yann-Aël. Machine learning strategies for time se- and Automation, 2006. WCICA 2006. The Sixth World
ries forecasting. In Business Intelligence, pp. 62–77. Congress on, volume 1, pp. 4886–4889. IEEE, 2006.
Time Series Prediction using RNNs

Gers, Felix A, Eck, Douglas, and Schmidhuber, Jürgen. Kane, Michael J, Price, Natalie, Scotch, Matthew, and Ra-
Applying lstm to time series predictable through time- binowitz, Peter. Comparison of arima and random forest
window approaches. In International Conference on Ar- time series models for prediction of avian influenza h5n1
tificial Neural Networks, pp. 669–676. Springer, 2001. outbreaks. BMC bioinformatics, 15(1):276, 2014a.

Gers, Felix A, Schraudolph, Nicol N, and Schmidhuber, Kane, Michael J, Price, Natalie, Scotch, Matthew, and Ra-
Jürgen. Learning precise timing with lstm recurrent net- binowitz, Peter. Comparison of arima and random forest
works. Journal of machine learning research, 3(Aug): time series models for prediction of avian influenza h5n1
115–143, 2002. outbreaks. BMC bioinformatics, 15(1):276, 2014b.

Ghassemi, Marzyeh, Pimentel, Marco AF, Naumann, Tris- Karol, G, Danihelka, I, Graves, A, Rezende, D, and Wier-
tan, Brennan, Thomas, Clifton, David A, Szolovits, Pe- stra, D. Draw: a recurrent neural network for image gen-
ter, and Feng, Mengling. A multivariate timeseries mod- eration. ICML, 2015.
eling approach to severity of illness assessment and fore- Kingma, Diederik and Ba, Jimmy. Adam: A
casting in icu with sparse, heterogeneous clinical data. method for stochastic optimization. arXiv preprint
In Proceedings of the... AAAI Conference on Artificial arXiv:1412.6980, 2014.
Intelligence. AAAI Conference on Artificial Intelligence,
volume 2015, pp. 446. NIH Public Access, 2015. Kusiak, Andrew, Verma, Anoop, and Wei, Xiupeng. A
data-mining approach to predict influent quality. En-
Giles, C Lee, Lawrence, Steve, and Tsoi, Ah Chung. Noisy vironmental monitoring and assessment, 185(3):2197–
time series prediction using recurrent neural networks 2210, 2013.
and grammatical inference. Machine learning, 44(1-2):
161–183, 2001. Laepes, A and Farben, R. Nonlinear signal processing us-
ing neural networks: prediction and system modelling.
Graves, Alex. Generating sequences with recurrent neural Technical report, Technical report, Los Alamos National
networks. arXiv preprint arXiv:1308.0850, 2013. Laboratory, Los Alamos, NM, 1987.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural Längkvist, Martin, Karlsson, Lars, and Loutfi, Amy. A re-
turing machines. arXiv preprint arXiv:1410.5401, 2014. view of unsupervised feature learning and deep learning
for time-series modeling. Pattern Recognition Letters,
Graves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley, 42:11–24, 2014.
Tim, Danihelka, Ivo, Grabska-Barwińska, Agnieszka,
Lipton, Zachary C, Kale, David C, Elkan, Charles, and
Colmenarejo, Sergio Gómez, Grefenstette, Edward, Ra-
Wetzell, Randall. Learning to diagnose with lstm recur-
malho, Tiago, Agapiou, John, et al. Hybrid computing
rent neural networks. arXiv preprint arXiv:1511.03677,
using a neural network with dynamic external memory.
2015.
Nature, 538(7626):471–476, 2016.
Meijering, Erik. A chronology of interpolation: From an-
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short- cient astronomy to modern signal and image processing.
term memory. Neural computation, 9(8):1735–1780, Proceedings of the IEEE, 90(3):319–342, 2002.
1997.
Müller, K-R, Smola, Alexander J, Rätsch, Gunnar,
Hsieh, Tsung-Jung, Hsiao, Hsiao-Fen, and Yeh, Wei- Schölkopf, Bernhard, Kohlmorgen, Jens, and Vapnik,
Chang. Forecasting stock markets using wavelet trans- Vladimir. Predicting time series with support vector ma-
forms and recurrent neural networks: An integrated sys- chines. In International Conference on Artificial Neural
tem based on artificial bee colony algorithm. Applied Networks, pp. 999–1004. Springer, 1997.
soft computing, 11(2):2510–2525, 2011.
Raicharoen, T, Lursinsap, C, and Sanguanbhoki, P. Appli-
Hu, Michael Jen-Chao. Application of the adaline system cation of critical support vector machine to time series
to weather forecasting. 1964. prediction. circuits and systems. iscas03. In Proceedings
of the 2003 International Symposium on, volume 5, pp.
Jaeger, Herbert and Haas, Harald. Harnessing nonlinearity: 25–28, 2003.
Predicting chaotic systems and saving energy in wireless
communication. science, 304(5667):78–80, 2004. Ranzato, MarcAurelio, Szlam, Arthur, Bruna, Joan,
Mathieu, Michael, Collobert, Ronan, and Chopra,
Jordan, Michael I. Serial order: A parallel distributed pro- Sumit. Video (language) modeling: a baseline for
cessing approach. Technical Report 8604, Institute for generative models of natural videos. arXiv preprint
Cognitive Science, University of California, 1986. arXiv:1412.6604, 2014.
Time Series Prediction using RNNs

Riemer, Matthew, Vempaty, Aditya, Calmon, Flavio P, Richard S, and Bengio, Yoshua. Show, attend and tell:
Heath III, Fenno F, Hull, Richard, and Khabiri, Elham. Neural image caption generation with visual attention.
Correcting forecasts with multifactor neural attention. In In ICML, volume 14, pp. 77–81, 2015.
Proceedings of The 33rd International Conference on
Machine Learning, pp. 3010–3019, 2016. Yule, G Udny. On a method of investigating periodici-
ties in disturbed series, with special reference to wolfer’s
Rumelhart, David E, Hinton, Geoffrey E, and Williams, sunspot numbers. Philosophical Transactions of the
Ronald J. Learning representations by back-propagating Royal Society of London. Series A, Containing Papers
errors. Cognitive modeling, 5(3):1, 1988. of a Mathematical or Physical Character, 226:267–298,
1927.
Schuster, Mike and Paliwal, Kuldip K. Bidirectional re-
current neural networks. IEEE Transactions on Signal Zhang, Guoqiang Peter. An investigation of neural net-
Processing, 45(11):2673–2681, 1997. works for linear time-series forecasting. Computers &
Operations Research, 28(12):1183–1202, 2001.
Sharda, Ramesh and Patil, R. Neural networks as fore-
casting experts: an empirical test. In Proceedings of the
International Joint Conference on Neural Networks, vol-
ume 2, pp. 491–494. IEEE, 1990.

Slutzky, Eugen. The summation of random causes as the


source of cyclic processes. Econometrica: Journal of
the Econometric Society, pp. 105–146, 1937.

Theano Development Team. Theano: A Python framework


for fast computation of mathematical expressions. arXiv
e-prints, abs/1605.02688, May 2016. URL http://
arxiv.org/abs/1605.02688.

Tiao, George C and Box, George EP. Modeling multiple


time series with applications. journal of the American
Statistical Association, 76(376):802–816, 1981.

Vito, Saverio De, Piga, Marco, Martinotto, Luca, and Fran-


cia, Girolamo Di. Co, {NO2} and {NOx} urban pollu-
tion monitoring with on-field calibrated electronic nose
by automatic bayesian regularization. Sensors and Actu-
ators B: Chemical, 143(1):182 – 191, 2009. ISSN 0925-
4005. doi: http://dx.doi.org/10.1016/j.snb.2009.08.
041. URL http://www.sciencedirect.com/
science/article/pii/S092540050900673X.

Walker, Gilbert. On periodicity in series of related terms.


Proceedings of the Royal Society of London. Series
A, Containing Papers of a Mathematical and Physical
Character, 131(818):518–532, 1931.

Werbos, Paul J. Generalization of backpropagation with


application to a recurrent gas market model. Neural net-
works, 1(4):339–356, 1988.

Xingjian, SHI, Chen, Zhourong, Wang, Hao, Yeung, Dit-


Yan, Wong, Wai-Kin, and Woo, Wang-chun. Convolu-
tional lstm network: A machine learning approach for
precipitation nowcasting. In Advances in Neural Infor-
mation Processing Systems, pp. 802–810, 2015.

Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun,


Courville, Aaron C, Salakhutdinov, Ruslan, Zemel,

You might also like