Multi-Step-Ahead Prediction With Neural Networks
Multi-Step-Ahead Prediction With Neural Networks
Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
R. Boné, M. Crucianu
Abstract
We review existing approaches in using neural networks for solving multi-step-ahead
prediction problems. A few experiments allow us to further explore the relationship
between the ability to learn longer-range dependencies and performance in multi-step-
ahead prediction. We eventually focus on characteristics of various multi-step-ahead
prediction problems that encourage us to prefer one method over another.
1 Introduction
While reliable multi-step-ahead (MS) time series prediction has many important
applications and is often the intended outcome, published literature usually considers
single-step-ahead (SS) prediction. The main reason for this is the increased difficulty of the
problems requiring MS prediction and the fact that the results obtained by simple
extensions of techniques developed for SS prediction are often disappointing. Moreover, if
many different techniques perform rather similarly on SS prediction problems, significant
differences show up when extensions of these techniques are employed on MS problems.
The purpose of this short review of existing work concerning the use of neural networks for
MS prediction is to investigate the relationship between modeling approaches and
prediction problems.
In the following section, we present the main existing approaches in using neural
networks for coping with MS prediction problems. Section 3 provides some additional
experimental results regarding an expected relation between the ability to learn longer-
range dependencies and performance in MS prediction. The discussion in section 4 is an
attempt to identify characteristics of various MS prediction problems that encourage us to
prefer one method over another.
1
R. Boné, M. Crucianu, Multi-step-ahead prediction with neural networks: a review. Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
2 Modeling approaches
Before taking a closer look at the existing approaches we must introduce some notation.
Consider x(t ) , for 0 ≤ t ≤ TD , the time series data one can employ for building a model. In
most cases, the available data actually consists in samples of x(t ) obtained with a time
interval of τ . In multi-step-ahead prediction, given {x(t ), x(t − τ ), , x(t − nτ ), } one is
looking for a good estimate x̂(t + hτ ) of x(t + hτ) , h being the number of steps ahead.
The most common approach in dealing with a prediction problem can be traced back to
[Yule, 1927] and consists in using a fixed number M of past values (a fixed-length time
window sliding over the time series) when building the prediction:
x(t ) = [x(t ), x(t − τ), , x(t − (M − 1)τ)] (1)
xˆ (t + τ ) = f (x(t )) (2)
Most of the current work on single-step-ahead prediction relies on a result in [Takens,
1980] showing that under several assumptions (among which the absence of noise) it is
possible to obtain a perfect estimate of x(t + τ ) according to (2) if M ≥ 2d + 1 , where d is
the dimension of the stationary attractor generating the time series. In this approach, all the
memory of the past is preserved in the sliding time window.
An alternative solution is to introduce a memory of the past in the model itself and only
keep a time window of small length (usually M = 1 ). Time series prediction with recurrent
neural networks usually corresponds to such a solution. Memory of the past is maintained
in the internal state of the model, s(t ) , which evolves according to (for M = 1 ):
s(t + τ) = g (s(t ), x(t )) (3)
The estimate of x(t + τ ) is provided by
xˆ (t + τ ) = h(s(t )) (4)
2
R. Boné, M. Crucianu, Multi-step-ahead prediction with neural networks: a review. Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
To perform the prediction, the quantizer identifies the segment that is closest to the
input vector x(t ) by ignoring the first coordinate corresponding to γx(t + τ ) (which is now
unknown). The local model associated to this segment is selected and produces the
prediction x̂(t + τ ) .
Neural network models are often employed in the vector quantization phase. Self-
organizing feature maps (SOFMs) having a constrained topology were employed first
[Walter et al., 1990], also [Vesanto, 1997]. However, the predefined topology of SOFMs
may impose unnecessary constraints on the quantization process, so some work used
instead free topology networks such as neural-gas networks [Martinetz et al., 1993] or
dynamic cell structures [Chudy et Farkas, 1998]. The upper bound on the number of
different segments must be carefully chosen. Indeed, the number of segments determines
the number of training examples in each segment, which imposes an upper bound on the
number of parameters of the corresponding local model.
The fact that local methods were successful in many MS prediction problems [Vesanto,
1997], [Chudy et Farkas, 1998], [McNames et al., 1999], [McNames, 2000], [Gers et al.,
2001] shows that a robust quantization of the set of x vectors can often be performed and
that even simple AR local models can provide very good results. Some authors also
mention the difficulties encountered by the local approach on a few other problems
[Vesanto, 1997], [Gers et al., 2001].
We must note that local approaches rely heavily on several assumptions. If only a
limited amount of data is available for training, then the relevant dimension of the space
actually spanned by the x vectors encountered must be significantly lower than the
dimension of the time window. Under the same circumstances (rather common as far as
applications are concerned), the time series should exhibit only a few different behaviors (a
few segments should be enough) and simple local models should prove satisfactory. If any
of these assumptions is false, the number of parameters becomes huge (and so does the
amount of data required for reliable training) either because too many segments are
required or because each local model has too many parameters.
Global approaches follow either equations (1) and (2), or equations (3) and (4). A global
approach attempts to build a single complex model for the entire range of behaviors
identified in the time series. It is usually considered that such a model can be more
parsimonious than a set of local models. However, in order to be able to model an entire
range of behaviors, powerful models have to be used; such models appear to be rather
difficult to train.
Given their universal approximation properties, neural networks such as multi-layer
perceptrons (MLPs) or recurrent networks (RNs) are good candidate models for the global
approaches. Among the many neural network architectures employed for time series
prediction we can mention MLPs with a time window in the input [Weigend et al., 1990],
MLPs with finite impulse response (FIR) connections (equivalent to time windows) both
from the input to the hidden layer and from the hidden layer to the output [Wan, 1993],
recurrent networks obtained by providing MLPs with a feedback from the output
[Czernichow, 1996], simple recurrent networks [Suykens and Vandewalle, 1995], recurrent
networks with FIR connections [El Hihi and Bengio, 1996], [Lin et al., 1996] and recurrent
networks with both internal loops and feedback from the output [Parlos et al., 2000].
3
R. Boné, M. Crucianu, Multi-step-ahead prediction with neural networks: a review. Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
One can notice that many authors included at various places in the feed-forward or
recurrent networks time-delayed connections which provide an explicit memory of the past.
These additions appear to be (implicitly and sometimes explicitly) justified by the fact that
they promote learning of longer-range dependencies in the data, which is supposed to be
generally helpful for SS prediction and in particular for MS prediction problems. However,
the relationship between learning longer-range dependencies and performance in MS
prediction was neither theoretically elucidated, nor experimentally explored.
A general drawback of the global approach is the difficulty of learning a single model
for the entire range of behaviors of the time series. In SS prediction the resulting models
are usually unreliable for only a small part of the entire range of behaviors, but these
difficult data points produce significantly negative effects in MS prediction schemes. This
observation explains why valuable results were obtained by applying boosting methods to
time-dependent processes [Avnimelech and Intrator, 1999] or by using support vector
machines [Suykens and Vandewalle, 2000].
Recently, a significantly different method that also follows the global approach was put
forward in [Jaeger, 2001]. A huge recurrent neural network is randomly generated so that
its units present a rich set of behaviors in an answer to the input sequence; then, by training
the output weights, these behaviors are combined in order to obtain the desired prediction.
Among the very promising results obtained by this method, we can mention MS prediction
with a very long time horizon for a chaotic time series.
the iterated method and at least as well as the corrected iterated method. However, this
result relies on several assumptions, among which the ability of the model to learn the
different target functions (the one for SS prediction and the one for direct MS prediction)
perfectly. This assumption can hardly be satisfied by a local approach, because individual
local models are usually simple and the (direct) MS prediction function may be complex.
The comparison then only holds for global approaches. But even in a global approach the
learning algorithm welcomes some form of “help”. For instance, improved results were
obtained by using recurrent networks and training them with progressively increasing
prediction horizons [Suykens and Vandewalle, 1995] or including time-delayed
connections from the output of the network to its input [Parlos et al., 2000].
A method which is related to the direct method was suggested in [Duhoux et al., 2001]
and consists in chaining several networks. For a time horizon of k, a first network learns to
predict at t+1, then a second network is trained to predict at t+2 by using as a
supplementary input the prediction provided by the first network, and so on, until the
desired time horizon is reached. This method was experimentally found to provide better
predictions than the iterated method in a global approach, but the total number of
parameters is proportional to the longest prediction horizon. Owing to the incremental
development of models for progressively longer time horizons, for a local approach this
method should probably be preferred to the direct method.
3 Illustrative experiments
In order to better understand the important issue of the relationship between learning
longer-range dependencies and performance in MS prediction, we decided to perform
specific experiments. These experiments contribute to the discussion in section 4. We
tested on MS prediction problems two constructive algorithms that were originally
developed for learning medium or long-range dependencies in time series [Boné et al.,
2000], [Boné et al., in press]. These constructive algorithms perform a selective addition of
time-delayed connections to recurrent networks and were shown to produce parsimonious
models (few parameters, linear prior on the longer-range dependencies) having good results
on SS prediction problems.
Such results, together with the fact that a longer-range memory embodied in the time
delays should allow a network to better retain the past information when predicting at a
longer horizon, let us anticipate improved results on MS prediction problems. Some more
support for this claim is provided by the experimental evidence in [Parlos et al., 2000]
concerning the successful use of time delays in recurrent networks for MS prediction. We
expected the constructive algorithms to identify the most useful delays for a given problem
and network architecture, instead of using an entire range of delays.
5
R. Boné, M. Crucianu, Multi-step-ahead prediction with neural networks: a review. Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
the time-delayed connections the signal does no longer cross nonlinear activation functions
between successive time steps.
Instead of systematically adding FIR connections to a recurrent network, each
connection encompassing a whole range of delays, we opted for a constructive approach:
start with an RN having no time-delayed connections, then selectively add a few such
connections. The two algorithms we present in the following allow us to choose the
location and the delay associated with a time-delayed connection which is added to an RN.
The first heuristic [Boné et al., in press] for defining the relevance of a candidate
connection is closely dependent on BPTT-like [Rumelhart et al., 1986] underlying learning
algorithms. This method makes use of measures computed during gradient descent and its
order of complexity is the same as for BPTT. For the first heuristic, a connection is
considered useful if it can have an important contribution to the computation of the
gradient of the error with respect to the weights. The resulting algorithm is called
Constructive Back Propagation Through Time (CBPTT).
The second heuristic [Boné et al., 2000] is a sort of breadth-first search. It explores the
alternatives for the location and the delay associated with a new connection by adding that
connection and performing a few iterations of the underlying learning algorithm. The
connection that produces the largest increase in performance during these few iterations is
then added, and learning continues until error increases on the stop set. Another
exploratory stage begins for the addition of a new connection. The breadth-first heuristic
does not need any gradient information and can be applied in combination with learning
algorithms which are not based on the gradient. However, in the experiments reported here
we employed BPTT as the underlying learning algorithm and for this reason we called the
resulting constructive algorithm Exploratory Back-Propagation Through Time (EBPTT).
6
R. Boné, M. Crucianu, Multi-step-ahead prediction with neural networks: a review. Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
200
Test 1 Test 2
180
160
yearly sunspot averages
140
120
100
80
60
40
20
0
1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975
2 2
NMSE Test 1
NMSE Test 2
BPTT
1,5 CBPTT 1,5
EBPTT
1 1
BPTT
CBPTT
0,5 0,5
EBPTT
0 0
1 2 3 4 5 6 1 2 3 4 5 6
FIGURE 2. SUNSPOTS TIME SERIES: MEAN NMSE ON THE TEST SETS AS A FUNCTION OF THE
PREDICTION HORIZON.
One can see that for test2 performance degrades much faster than for test1. It is
commonly accepted that the behavior on test2 can not be explained (by some longer-range
phenomenon) given the available history. Short-range information available in SS
prediction lets the network evaluate the rate of change in the number of sunspots. Such
information is missing in MS prediction, so nothing compensates for the lack of knowledge
concerning the longer-range phenomenon that could explain the behavior of test2.
7
R. Boné, M. Crucianu, Multi-step-ahead prediction with neural networks: a review. Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
The Mackey-Glass benchmarks [Mackey and Glass, 1977] are well-known for the
evaluation of SS and MS prediction methods. The time series are generated by the
following nonlinear differential equation:
0.2 ⋅ x(t − θ)
= −0.1 ⋅ x(t ) +
dx
(5)
dt 1 + x10 (t − θ)
The behavior is chaotic for θ > 16,8 . The results in the literature usually concern θ = 17
(known as MG17, Figure 3) and θ = 30 (MG30). The data is generated and then sampled
with a period of 6, according to the common practice (see e.g. [Wan, 1993]). We use the
first 500 values for the learning set and the next 100 values for the test set.
1,4
1,3
1,2
1,1
1
0,9
0,8
0,7
0,6
0,5
0,4
0 100 200 300 400 500 600
0,2
0,15
0,1
0,05
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Steps ahead
FIGURE 4. MG17 TIME SERIES: MEAN NMSE ON THE TEST SET AS A FUNCTION OF THE
PREDICTION HORIZON.
8
R. Boné, M. Crucianu, Multi-step-ahead prediction with neural networks: a review. Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
1
BPTT
0,9
0,8 CBPTT
0,7 EBPTT
0,6
NMSE
0,5
0,4
0,3
0,2
0,1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Steps ahead
FIGURE 5. MG30 TIME SERIES: MEAN NMSE ON THE TEST SET AS A FUNCTION OF THE
PREDICTION HORIZON.
models can be obtained and a local approach should be preferred. When the dimension of
the attractor increases, either as a consequence of a noisy system or because explicit long-
range dependencies are present, the performance of local approaches is likely to degrade
faster than the performance of global approaches. Note that a chaotic attractor displays a
long-range behavior that can often be explained by short-range dependencies only.
The negative effect on local models of the presence of long-range dependencies is not
well understood; for example, in [Vesanto, 1997] the lack of power of the local models is
blamed for the lower performance obtained on MG30. A simple alternative explanation
consists in noting that the time window may not be long enough. Since any amount of
training data can be generated for MG30, an experiment with a longer time window can
easily be performed and should help in clarifying this issue.
The frequent expectation that improved learning of longer-range dependencies in the
data necessarily implies better performance in MS prediction appears to be unfounded.
While an advantage in MS prediction can indeed be obtained when long-range
dependencies are strong, performance is poor when such dependencies have a low
importance, as shown by our results on MG17. Similar poor results were obtained on
MG17 with LSTM [Gers et al., 2001], an algorithm that is also dedicated to long-range
dependencies.
The general weakness of global models as compared to local models on time series like
MG17 can easily be explained for the iterative approach by the accumulation of errors on
the most difficult parts of the range of behaviors. For direct MS prediction the difficulty of
a problem appears to increase significantly, but a more comprehensive explanation has yet
to be found.
Further comparisons between global and local approaches need to be performed on MS
prediction problems where exogenous variables are present. We expect global approaches
to scale better to such situations.
5 Conclusion
The papers concerning MS prediction we cited here are only a part of the recent work on
the subject, but the methods we mentioned represent well the existing approaches. On the
problems that are most frequently encountered in the published literature, comparisons turn
to the advantage of the local approach. However, the promising global methods that were
put forward recently deserve further evaluation. Combinations between global and local
approaches should also be explored, such as using a smaller number of more complex
models instead of many simple models in a local approach.
We attempted to relate features of MS prediction problems to the characteristics of
existing modeling methods. To pursue this effort, more data should be collected concerning
MS problems, either from prior knowledge regarding the problems or from the
experimental results obtained by different prediction methods on these problems.
10
R. Boné, M. Crucianu, Multi-step-ahead prediction with neural networks: a review. Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
References
[1] A.F. Atiya, S.M. El-Shoura, S.I. Shaheen, M.S. El-Sherif (1999) A comparison
between neural-Network Forecasting Techniques – case Study: River Flow
Forecasting, IEEE Transaction on Neural Networks, Vol. 10, pp. 402-409.
[2] R. Avnimelech, N. Intrator (1999) Boosting Regression Estimators, Neural
Computation, vol. 11, pp. 491-513.
[3] R. Boné, M. Crucianu, J.-P. Asselin de Beauville (2000) Two Constructive
Algorithms for Improved Time-series Processing with Recurrent Neural Networks,
IEEE International Workshop on Neural Networks for Signal Processing, pp. 55-64,
Sydney, Australia.
[4] R. Boné, M. Crucianu, J.-P. Asselin de Beauville (2002) Learning Long-Term
Dependencies by the Selective Addition of Time-Delayed Connections to Recurrent
Neural Networks, Neurocomputing, in press.
[5] L. Chudy, I. Farkas (1998) Prediction of Chaotic Time-Series Using Dynamic Cell
Structures and Local Linear Models, Neural Network World, 8(5), pp. 481-489.
[6] T. Czernichow (1996) Apport des réseaux récurrents à la prévision de séries
temporelles, application à la prévision de consommation d'électricité, Thèse de
Doctorat, Université Paris 6, Paris.
[7] M. Duhoux, J. Suykens, B. de Moor, J. Vandewalle (2001) Improved Long-term
Temperature Prediction by Chaining of Neural Networks, International Journal of
Neural Systems, World Scientific Publishing Company, Vol. 11, pp. 1-10.
[8] S. El Hihi, Y. Bengio (1996) Hierarchical Recurrent Neural Networks for Long-Term
Dependencies, in M. Mozer, D. S. Touretzky and M. Perrone, Eds., Advances in
Neural Information Processing Systems, Cambridge, MA, MIT Press. VIII, pp. 493-
499.
[9] F. Gers, D. Eck, J. Schmidhuber (2001) Applying LSTM to Time Series Predictable
Through Time-Window Approaches, International Conference on Artificial Neural
Networks, Vienna, Austria, pp. 669-675.
[10] H. Jaeger (2001) The “Echo State” Approach to Analyzing and Training Recurrent
Neural Networks, GMD Report 148, GMD, Germany.
[11] T. Lin, B.G. Horne, P. Tino, C.L. Giles (1996) Learning Long-Term Dependencies in
NARX Recurrent Neural Networks, IEEE Transactions on Neural Networks 7(6), pp.
1329-1335.
[12] M. Mackey, L. Glass (1977) Oscillations and chaos in physiological control systems,
Science, pp. 197-287.
[13] T. Martinetz, S.G. Berkovich, K. Schulten (1993) Neural-gas Network for Vector
Quantization and Its Application to Time-series Prediction, IEEE Transactions on
Neural Networks, 4(4), pp. 558-569.
[14] J. McNames, J.A.K. Suykens, J.Vandewalle (1999) Winning Entry of the K.U.Leuven
Time-Series Prediction Competition, International Journal of Bifurcation and Chaos,
vol. 9, no. 8, pp. 1485-1500.
[15] J. McNames (2000) Local Modeling Optimization for Time Series Prediction, 8th
European Symposium on Artificial Neural Networks, Bruges, Belgium, pp. 305-310.
11
R. Boné, M. Crucianu, Multi-step-ahead prediction with neural networks: a review. Publication
de l'équipe RFAI, 9èmes rencontres internationales « Approches Connexionnistes en Sciences
Économiques et en Gestion ». 21-22 novembre 2002, Boulogne sur Mer, France. pp. 97-106.
[16] A.G. Parlos, O.T. Rais, A.F. Atiya (2000) Multi-step-ahead Prediction Using Dynamic
Recurrent Neural Networks, Neural Networks, Vol. 13, pp. 765-786.
[17] D.E. Rumelhart, G.E. Hinton, R.J. Williams (1986) Learning Internal Representations
by Error Propagation, in D. E. Rumelhart, J. McClelland (Eds.), Parallel Distributed
Processing: Explorations in the Microstructure of Cognition Vol. 1, pp. 318-362,
Cambridge, MA: MIT Press.
[18] J.A.K. Suykens, J. Vandewalle (1995) Learning a Simple Recurrent Neural State
Space Model to Behave Like Chua’s Double Scroll, IEEE Transactions on Circuits
and Systems-I, Vol. 42, pp. 499-502.
[19] J.A.K. Suykens, J.Vandewalle (2000) Recurrent least squares support vector
machines, IEEE Transactions on Circuits and Systems-I, vol. 47, no. 7, pp. 1109-1114.
[20] F. Takens (1980) Detecting Strange Attractors in Fluid Turbulence, in D. A. Rand and
L. S. Young, Dynamical Systems and Turbulence, Springer-Verlag, New York, pp.
366-381.
[21] J. Vesanto (1997) Using the SOM and Local Models in Time-Series Prediction,
Proceedings of the Workshop on Self-Organizing Maps (WSOM’97), Espoo, Finland,
pp. 209-214.
[22] J. Walter, H. Ritter, K.J. Schulten (1990) Non-linear Prediction with Self-organizing
Feature Maps, in Proceedings of the International Joint Conference on Neural
Networks (IJCNN’90), vol. 2, pp. 589-594.
[23] E.A. Wan (1993) Finite Impulse Response Neural Networks with Applications in
Time Series Prediction, Ph.D. Thesis, Stanford University, USA.
[24] A.S. Weigend, B.A. Huberman, D.E. Rumelhart (1990) Predicting the Future: A
Connectionist Approach, International Journal of Neural Systems, 1(3), pp. 193-209.
[25] G.U. Yule (1927) On a Method of Investigating Periodicity in Disturbed Series with
Special Reference to Wolfer' s Sunspot Numbers,Philos. Trans. Roy. Soc. London Ser.
A 226, pp. 267-298.
12