2023, Semenoglou - Image Based Time Series Forecasting
2023, Semenoglou - Image Based Time Series Forecasting
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
article info a b s t r a c t
Article history: Inspired by the successful use of deep learning in computer vision, in this paper we introduce ForCNN,
Received 22 September 2021 a novel deep learning method for univariate time series forecasting that mixes convolutional and dense
Received in revised form 4 August 2022 layers in a single neural network. Instead of using conventional, numeric representations of time series
Accepted 6 October 2022
data as input to the network, the proposed method considers visual representations of it in the form
Available online 15 October 2022
of images to directly produce point forecasts. Three variants of deep convolutional neural networks
Keywords: are examined to process the images, the first based on VGG-19, the second on ResNet-50, while the
Time series third on a self-designed architecture. The performance of the proposed approach is evaluated using
Forecasting time series of the M3 and M4 forecasting competitions. Our results suggest that image-based time
Images series forecasting methods can outperform both standard and state-of-the-art forecasting models.
Deep Learning
© 2022 Elsevier Ltd. All rights reserved.
Convolutional Neural Networks
M competitions
https://doi.org/10.1016/j.neunet.2022.10.006
0893-6080/© 2022 Elsevier Ltd. All rights reserved.
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
they are not ‘‘data-hungry’’ and are computational cheap. On explored the use of time series imaging for combining forecasts
the negative side, statistical methods are less generic in a sense from different baseline forecasting methods based on the features
that they prescribe the data generation process of the series, that computer vision algorithms had extracted from each series.
display limited learning capacity, and are typically trained in a CNNs have been considered alongside 2D inputs only in spe-
series-by-series fashion, thus failing to capture valuable infor- cific forecasting applications. For instance, Ma, Dai, He, Ma, Wang,
mation that can be extracted from large data sets (Semenoglou, and Wang (2017) leverage the temporal and spatial dependen-
Spiliotis, Makridakis, & Assimakopoulos, 2021). The turning point cies of traffic to create 2D images that are then introduced to
for the perception towards ML methods in generic time series a CNN-based model for predicting traffic speed. Zhang, Zheng,
forecasting tasks was probably the M4 competition (Makridakis, Qi, Li, Yi, and Li (2018) estimate crowd inflow and outflow in
Spiliotis, & Assimakopoulos, 2020) where the winning method different regions of a city by employing a deep CNN, consid-
mixed long short-term memory (LSTM) networks with standard ering city grid maps of past inflows/outflows as input. Even in
exponential smoothing equations (Smyl, 2020), while the runner- cases where no spatial dependencies exist in the data, 2D input
up utilized XGBoost to optimally combine nine (mostly) statistical vectors can still be constructed, provided that the available data
baseline forecasting models (Montero-Manso, Athanasopoulos, set contains variables that are correlated with the target series.
Hyndman, & Talagala, 2020). These results demonstrated the For example, Zhang and Guo (2020) encoded data from various
benefits of ‘‘cross-learning’’ and highlighted the potential value variables that affect energy consumption as images in an attempt
of ML methods. to predict more accurately the hourly electricity consumption
Empirical evidence from forecasting competitions, along with at residential level. However, when tasked with forecasting sets
greater accessibility to computer resources and ML libraries, also
of uncorrelated series without exogenous variables, the afore-
paved the way towards more advanced and accurate DL meth-
mentioned approaches, used for constructing 2D inputs, become
ods. Since the literature is vast, especially when application-
inapplicable. As a result, in univariate time series forecasting
specific methods are considered, below we focus on some generic
settings, like in our case, 1D CNNs become more relevant.
DL forecasting methods that have become particularly popu-
Despite the use of images in certain time series analysis tasks,
lar in the recent years. Sen, Yu, and Dhillon (2019) proposed
to the best of our knowledge, no study has previously investigated
a DL method, based on temporal convolutions, that leverages
the use of images as input to NNs to directly forecast a large
both global learning from the entire data set and local learning
set of uncorrelated time series, without additional explanatory
from each time series separately. Salinas, Flunkert, Gasthaus, and
variables. Given the potential benefits of such an approach, we
Januschowski (2020) introduced DeepAR, a recurrent DL model
that uses stacked LSTM layers to estimate the parameters of pre- believe that image-based time series forecasting requires more
defined distributions that represent the future observations of attention. The use of visual time series representations and deep
the time series being predicted, thus allowing the generation 2D convolutions constitutes a novel approach for extrapolating
of both point and probabilistic forecasts. Lim, Arık, Loeff, and patterns when compared to other types of NNs that solely rely
Pfister (2021) proposed Temporal Fusion Transformer (TFT), a on numeric input. Spatial structural information that is apparent
deep neural network (NN) that utilizes LSTM encoders and a self- in a visual representation of a time series offers a unique per-
attention mechanism to generate multi-step forecasts. Finally, spective, even if the same information is encoded in the original
N-BEATS, introduced by Oreshkin, Carpov, Chapados, and Bengio numeric data. An example of that comes from the way humans
(2019), involves deep NNs that build on fully-connected layers process information, being easier for most of us to recognize
with backward and forward residual links. Apart from achiev- time series patterns like trends by looking at the plot of the
ing state-of-the-art accuracy, N-BEATS results in interpretable series compared to reading through the subsequent observations.
forecasts. Another major advantage of treating time series as images comes
It becomes evident that most state-of-the-art DL time se- from the breakthroughs in the field of computer vision. Research
ries forecasting methods are based on recurrent neural networks has provided key insights and deep architectures that enable the
(RNNs) that use numeric representations of time series data as extraction of useful features from images that can be transferred
input (Hewamalage et al., 2021), thus ignoring the advances in the forecasting domain with minimal adjustments and promis-
reported for CNNs in computer vision. Indeed, although CNNs (Lai, ing results. Finally, the image-based approach proposed in this
Chang, Yang, & Liu, 2018; Shih, Sun, & Lee, 2019) and temporal paper is flexible and expandable depending on the application
CNNs (Van Den Oord et al., 2016) have recently become more considered. For instance, the proposed methods can incorporate
popular for time series forecasting, they usually handle time exogenous variables as additional features by including more
series data as 1D numeric vectors instead of 2D images, which is channels as input. Motivated by the above, the contributions of
surprising if we consider that CNNs have been originally designed this paper are threefold:
for analyzing visual imagery.
Moreover, even when this is not the case, the NNs are typically • We introduce an end-to-end, image-based DL approach for
used to develop classifiers or meta-learners that support the univariate time series forecasting, to be called ForCNN. We
forecasting process and not to directly produce forecasts. For provide a detailed description of the overall framework that
example, Cohen, Balch, and Veloso (2019) compared the perfor- includes the necessary pre-processing steps and the archi-
mance of various classification models that used either numeric tecture of the self-designed encoder used.
or visual representations of the S&P 500 index as input to predict • We explore the potential value of well-established networks
whether the price will go up or down, concluding that the latter such as the ResNet-50 and VGG-19 as alternatives to the
approach resulted in superior accuracy. In a similar classification self-designed encoder. Leveraging such networks alleviates
task, Du and Barucca (2020) proposed a framework that converts the burden of developing and optimizing new architectures
the log return of financial time series to a gray-scale spectrum and by relying on already optimized and trained ones. The re-
uses a deep CNN to predict the future direction of prices. Cohen, sults suggest that it is possible to use such networks to
Sood, Zeng, Balch, and Veloso (2020) introduced an image-to- generate accurate forecasts, thus further supporting the uti-
image forecasting method where a convolutional autoencoder is lization of image-based forecasting approaches.
used to process images of financial time series and provide the • We demonstrate the performance of the proposed approach
corresponding visual forecasts. Finally, Li, Kang, and Li (2020) through an empirical evaluation using two different sets of
40
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
series that consist of the monthly and yearly data of well- [0, 1] range, the corresponding output values are not necessarily.
known forecasting competitions, namely the M3 (Makri- Especially in tasks that involve forecasting heavily trended time
dakis & Hibon, 2000) and M4 (Makridakis et al., 2020). series, this can be a useful attribute of the training set since
Both accuracy and computational cost metrics are consid- it allows the NNs to learn that future values can be above or
ered to provide a comprehensive assessment, while various below any known value, across all samples. This approach has
established methods, including state-of-the-art ML and DL been successfully applied in order to create training data sets
models, are employed as benchmarks. for similar forecasting problems in the past (Semenoglou et al.,
2021; Smyl, 2020) and was therefore adopted in the proposed
The rest of the paper is organized as follows: Section 2 in-
methodology.
troduces the methodological approach proposed for transforming
Having scaled the original time series data, the next step of
time series data into images and the architecture of the mod-
the pre-processing phase refers to the visualization, i.e. the trans-
els used for generating point forecasts. Section 3 describes the
formation of the 1D numeric vectors into 2D images. This is done
experimental design used to empirically evaluate the overall per-
using simple line plots where the horizontal x-axis represents the
formance of the proposed approach. This includes the data sets
time period each observation corresponds to and the vertical y-
used for training and testing ForCNN, the performance measures
axis the scaled values of the observations. The plots are created
utilized, details on the implementation of ForCNN, and the bench-
and exported using the Matplotlib plotting library for Python.
marks selected for assessing the relative performance of the
Considering that the plotted line is the most important ele-
proposed approach. Section 4 presents and discusses the results.
ment in the created images, with its form and position containing
Finally, Section 5 concludes the paper and provides directions of
the complete information required for generating forecasts, the
future research.
width of the line is accordingly thickened to make the time series
patterns more apparent. Other visual elements usually contained
2. Methodological approach
in plots, such as axes and legend, are not exported in the final
images for the sake of simplicity. A monochromatic, black and
A time series is defined as a set of data points listed in
white color scheme is chosen for creating the images, with the
chronological order, with successive observations taken at reg-
line representing the series being white and the background color
ularly spaced intervals of time. A time series Y of length n can be
being black. Thus, a single 8-bit integer value is used to represent
expressed as:
each pixel instead of the three or more values typically required
Yt = {yt ∈ R | t = 1, 2, . . . , n}, (1) for each pixel in colored pictures. As a result, the redundant
coloring information is eliminated, the input of the forecasting
where yt represents the value of the time series at time t. Univari- model is simplified, and the memory requirements for working
ate time series forecasting aims at estimating the future values of with the images are significantly reduced. Finally, in order for
the series based solely on its past observations. Solving this prob- all inputs to have the same dimensions, all images are resized
lem depends on discovering a function (or forecasting method) to 64 × 64 pixels. Fig. 1 displays the final representations of
that approximates the underlying data generation process of the 16 indicative time series that have been exported following the
series, as follows: aforementioned pre-processing steps.
f : Rw −
→ Rh
(2) 2.2. Model architecture
[yn+1 , . . . , yn+h ] = f (yn−w+1 , . . . , yn ) + [en+1 , . . . , en+h ]
where {yn+1 , . . . , yn+h } are the future h observations of the series, The images created in the pre-processing phase are provided
f is the approximation function used to forecast these observa- as input to ForCNN in order for the DL network to be trained
tions using w past values (y1 , . . . , yn ) as input, and {en+1 , . . . , and then used to produce point forecasts for the time series of
en+h } are the corresponding forecast errors. It becomes apparent interest. The overall network consists of two modules, namely
that accurate forecasting requires methods that minimize forecast an encoder and a regressor, as shown in Fig. 2. Although these
error. modules have distinct roles, they are trained concurrently as a
The methodological approach proposed in this paper for fore- single model.
casting time series through images consists of two phases. In the Note that ForCNN serves as a generic framework for devel-
first phase, the time series data, originally provided as 1D numeric oping DL, image-based time series forecasting models. Thus, al-
vectors, are pre-processed to be properly exported as 2D images. though in the following sections we will focus on a model that
Each image corresponds to a particular window of the in-sample utilizes a self-designed encoder, to be called ForCNN-SD, we will
data of the series and serves as a substitute for the respective nu- later discuss how variants of this model can be constructed using
meric input that would have been typically provided to standard other well-established deep CNNs. The architecture of ForCNN-SD
ML or DL methods for training and forecasting purposes. In the is presented in Fig. 3.
second phase, the images created are used for training the deep
NN, ForCNN, and producing out-of-sample forecasts. 2.2.1. Encoder
The objective of the first module of the NN, i.e. the encoder,
2.1. Time series pre-processing is to transform each image X , provided as input to the net-
work, into a vector W that contains a latent representation of
In order for all images to be extracted to depict time periods X . This will allow the patterns of the series to be effectively
of equal length, the in-sample data of the time series are first filtered and the information that has to be learned by the network
divided into equal-sized windows of w observations each. Then, to be accordingly reduced in size to facilitate training without
the min–max scaling is employed to adjust the window values missing any significant knowledge. Given their successful use
in the target range of [0, 1] and facilitate training across the in numerous image related applications (Anwar, Majid, Qayyum,
numerous, diverse series of the data set (Shanker, Hu, & Hung, Awais, Alnowami, & Khan, 2018; Kamilaris & Prenafeta-Boldú,
1996). This means that, essentially, even different samples from 2018; Rawat & Wang, 2017), this task is performed through layers
the same series are treated as being independent from each other. that apply 2D convolutions to the input. Other advantages of 2D
Although the input values of each sample are always bound to the convolution layers that make them attractive for undertaking this
41
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
Fig. 1. Indicative visual representations of time series used by ForCNN as input, with w set to 18 observations.
Fig. 2. Overview of the proposed image-based time series forecasting method, ForCNN, consisting of an encoder and a regressor module.
task include their ability to account for local dependencies around the identity mapping is considered optimal, the shortcut offers a
the pixels and their relatively few parameters that accelerate direct way of achieving it rather than training a large number of
training without sacrificing learning capacity. parameters, used in non-linear transformations, to approximate
Specifically, in order to build the encoder of the network, we the identity function.
consider a deep convolutional architecture which is inspired by Each convolutional layer of the encoder consists of a 2D convo-
that of ResNet (He et al., 2016). Multiple studies and the overall lution of its respective input (Conv2D) and uses 3 × 3 filters and
trend in ML research favor deeper NN architectures when work- zero padding to maintain the original input dimensions, followed
ing with images (Chollet, 2017; Simonyan & Zisserman, 2015; by batch normalization (Batch Norm) and the application of the
Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2016) since the ad- activation function, which is the Rectified Linear Unit (ReLU). The
dition of more layers typically allows the model to progressively layers are then organized into blocks (Block), with each block
recognize and learn more specific patterns. At the same time, having 3 convolutional layers and an identity shortcut from its
since deeper models are more difficult to train, they require input to its output. Note that the two information flows, i.e. from
additional components that allow them to handle issues related the main path of the block and the shortcut, are merged before
to vanishing gradients and degradation of training accuracy (He the final application of the ReLU activation function. Finally, the
et al., 2016). To deal with these issues, ResNet utilizes ‘‘shortcut residual convolutional blocks are organized into stacks (Stack).
connections’’ between the layers that enable the input informa- The number of convolutional filters used increases by a factor of 2
tion of a block to be mapped directly to its output. Thus, when as more stacks are being added while the size of the feature maps
42
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
Fig. 3. The ForCNN-SD architecture. Top: The model consists of a series of stacks that take images as input and create latent representations of them. Fully-connected
(FC) layers are then used to produce h-step-ahead forecasts based on the representations provided. Middle: Each stack consists of several convolutional blocks
and a final convolution layer (Conv2D) for reducing the size of the exported feature maps. Bottom: Each block consists of three sets of convolution layers of the
following transformations: 2D convolution (Conv2D), batch normalization (Batch Norm), and application of the ReLU activation function. A shortcut connection is
used at the block level.
decreases by the same factor. Note also that instead of employing an indicative example of such architectures, demonstrating that
traditional pooling layers with fixed functions (e.g. max pooling deep MLPs can achieve state-of-the-art performance in forecast-
or average pooling), 2D convolutions are used with 2 × 2 filters ing problems similar to ours. An additional point to consider
and a stride of 2, essentially decreasing the spatial size by the is that LSTM-based architectures are inherently more complex
required factor. After the final stack, the resulting feature maps to deploy effectively and require more time and computational
are concatenated and flattened to form the embedding vector W . resources as they have more parameters that need optimizing.
Thus, in the present study, fully-connected layers were used in
2.2.2. Regressor order to develop a regressor that is easy to implement, efficient
The objective of the second module of the NN, i.e. the regres- in terms of computational time and resources needed for training
sor, is to produce the requested forecasts F given the embedding purposes, and relatively accurate when used in batch time series
vector W that has been created by the encoder. The regressor forecasting tasks (Semenoglou et al., 2021).
is implemented as a simple NN with fully-connected (FC) non-
linear hidden layers (the ReLU function is used for transforming
2.2.3. Generalizing ForCNN
the input of the nodes) and a linear output layer. The forecasts
As noted earlier, ForCNN-SD adopts a specific, self-designed
are produced for all the forecasting horizons considered simulta-
neously, meaning that the nodes of the output layer are equal in architecture. This is done in order for the proposed method to
size with the forecasting horizon examined. be tailored to the particular characteristics of the images used
Note that the regressor can be implemented using recurrent as input at the forecasting task at hand (e.g. single channel in-
layers instead of the fully-connected layers proposed in ForCNN- put of relatively simple images). However, the main framework
SD. Recently, combinations of convolutional and LSTM-based ar- can be adjusted according to the preferences of the user and
chitectures have been successfully introduced for time series the requirements of the forecasting task. For instance, although
forecasting (Livieris, Pintelas, & Pintelas, 2020; Xue, Triguero, gray-scale line plots provide a clear and intuitive representation
Figueredo, & Landa-Silva, 2019) and a similar approach can pos- of the series, requiring also less memory to store and process,
sibly be exploited for use at an image-based model. However, they may be replaced by recurrence or colored figures (Li et al.,
selecting between using fully-connected or recurrent layers for 2020). Similarly, the layers of the regressor can be extended or
the regressor is not a trivial process. Depending on the applica- reduced in size depending on the complexity of the forecasting
tion, NNs that are based on dense layers can be as accurate as task and the length of the forecasting horizon. More importantly,
models based on recurrent layers, if not more. N-BEATS (Ore- the proposed encoder can be replaced by any other NN capable
shkin et al., 2019), which is based on fully-connected layers, is of extracting meaningful features from images.
43
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
In order to demonstrate the flexibility of the proposed method autocorrelation-based seasonality test (Makridakis et al., 2020).
and further explore the potential value of image-based time series The inverse transformation is then applied to re-seasonalize the
forecasting models, we proceed by considering two variants of forecasts of the series that were identified as seasonal. The sliding
the ForCNN-SD model, constructed by replacing the suggested window strategy, as described above, is used on the in-sample
encoder with well-established CNNs in the field of image recog- part of the monthly series to create the train set of the models.
nition. To that end, we consider the ResNet-50 and VGG-19 DL The size of the input windows, w , is set equal to 36 observations
models, to be called ForCNN-ResNet and ForCNN-VGG, respec- (3 seasonal periods). As a result, the final train set consists of
tively. Note that since ResNet-50 and VGG-19 are used just for approximately 67,000 samples, extracted from the in-sample part
extracting the embedding vector W from the images, the top FC of the initial 1428 monthly time series.
layers of their architectures are dropped before being introduced Note that the minimum and maximum values used in the pre-
to the ForCNN framework. Moreover, given that both models processing phase of the method for scaling the data are defined
were originally developed for handling colored images, consisting for each window separately using only the in-sample part of the
of three channels, we simply repeat the constructed gray-scale series, i.e. the 18 and 36 observations of each window in the case
images three times to match their input format. The pre-trained of the yearly and monthly series, respectively. These values are
weights of ResNet-50 and VGG-19 on the ImageNet data set are then used for re-scaling the forecasts and measuring forecasting
used to initialize the encoder modules of ForCNN-ResNet and performance, either for training or testing purposes. Also note
ForCNN-VGG, while random weights to initialize their regressor that, in the testing phase, if a window has less observations
modules. than what is required to produce forecasts, then additional data
points are artificially added at its start using a naive backcasting
3. Experimental setup approach, i.e. the first available data point of the window is used
to fill the missing ones backwards. On the contrary, incomplete
3.1. Data set windows are dropped in the training phase.
In order to evaluate the forecasting performance of the pro- 3.2. ForCNN implementation
posed method, we consider two sets of series: the 23,000 yearly
time series of the M4 competition data set (Makridakis et al., ForCNN-SD is implemented using Python 3.7.6, Tensorflow
2020) and the 1428 monthly series of the M3 competition data v.2.2.0, and the Keras API built on top of Tensorflow. Given
set (Makridakis & Hibon, 2000). Both data sets are comprised the complex nature of the model and the large number hyper-
of multiple series from diverse domains (finance, micro, macro, parameters available, optimizing the architecture of the network
industry, and demographics) that facilitate training and provide and the training process is essential. In this respect, a hyper-
better generalization of our findings (Spiliotis et al., 2020a). More- parameter search is performed to determine the optimal values
over, they are publicly available and well-established as bench- for the most critical hyper-parameters, based on the yearly series
mark sets, enabling the replication of our results (Makridakis, of the M4 competition. The Tree-of-Parzen-Estimators (TPE) al-
Assimakopoulos, & Spiliotis, 2018a) and their direct comparison gorithm is employed for performing the search, as implemented
to those of other studies claiming accuracy improvements using in the HyperOpt library for Python (Bergstra, Komer, Eliasmith,
either standard or state-of-the-art forecasting methods (Makri- Yamins, & Cox, 2015). The optimal values, as determined by
dakis, Spiliotis, & Assimakopoulos, 2018c). the search, are then used for building and training the final
In the case of the yearly series, following the original set-up model. Appendix A contains a summary of the optimized hyper-
of the M4 competition, the forecasting horizon, h, is set equal parameters, the search range for each of them, and their fi-
to 6 years and the number of the output nodes of the regressor nal optimal values. Furthermore, for the two ForCNN variants,
module of the NNs is accordingly defined. Consequently, the last 6 i.e. ForCNN-ResNet and ForCNN-VGG, the self-designed encoder
observations of the series (out-of-sample) are hidden for the final was replaced by the pre-trained ResNet50 and VGG19 networks
evaluation of the models, while the rest of the observations (in- respectively, as implemented in Tensorflow’s Keras API.
sample) are used for hyper-parameter optimization, training, and The optimal hyper-parameters values, as determined by the
validation purposes. Also, a sliding window strategy is applied optimization process on the yearly series of the M4 competi-
on the in-sample part of the series, where possible, in order to tion, are also transferred to the models used for forecasting the
extract multiple windows from each series and maximize the monthly series of the M3 competition. As a result, it is possible
number of available training and validation samples. The size of to investigate whether the proposed DL architecture of ForCNN-SD
the windows used for creating the images, w , is set equal to and its variants (ForCNN-ResNet and ForCNN-VGG) has the ability
3 × h = 18 so that a reasonable number of observations is to generalize well and adapt to data of different features.
available for analyzing the patterns of the data and producing the Finally, to reduce the effect that random initial weights may
corresponding forecasts. As a result, the final train set consists have on the final forecasting performance of the proposed method,
of approximately 235,000 samples, extracted from the in-sample improve its accuracy and robustness, and draw more objective
part of the initial 23,000 yearly time series. conclusions (Barrow, Crone, & Kourentzes, 2010; Kourentzes,
Similarly, in the case of the monthly series, the set-up of the Barrow, & Crone, 2014), we consider an ensemble of 50 different
M3 competition was followed for training and evaluating the ForCNN-SD models and use the median operator to obtain the
examined models. The forecasting horizon, h, is set equal to 18 final forecasts, i.e. train ForCNN-SD 50 times and combine the
months. Thus, the last 18 observations of the series are hidden outputs of the individual models for each forecasting horizon
until the evaluation phase, while the rest of the observations are separately. In this regard, although all models used within the
used for training and validating the models. Note that monthly se- ensemble will have the same architecture and hyper-parameters,
ries often exhibit strong seasonality that may be difficult for NNs each model will display different initial weights and use dif-
to capture, especially when relatively short series are involved ferent training and validation splits from the initial data set.
and no exogenous date-related information is provided (Barker, The same ensembling strategy is utilized for ForCNN-ResNet and
2020; Zhang & Qi, 2005). Therefore, the classical multiplicative ForCNN-VGG.
decomposition is applied for seasonally adjusting the series be- Ensembling has long been considered as a useful practice in
fore further processing them, if needed, as determined by an the forecasting literature (Bates & Granger, 1969) and numerous
44
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
forecast combination schemes, both simple and sophisticated • CNN-1D: This method is an ensemble of 50 simple CNNs
ones, have been proposed to exploit its full potential (for an that use 1D vectors as input, similar to the MLP. Comparing
encyclopedic review on forecast combinations, please refer to the proposed image-based methods to CNN-1D is important,
section 2.6 of Petropoulos et al., 2022). Simply put, since no fore- since it expands the discussion by considering an approach
casting model can consistently outperform all models available, that uses convolutions on traditional numeric representa-
combining the forecasts of multiple models, each making differ- tions of the series. CNN-1D is implemented using Python
ent assumptions about the data and having a different structure, 3.7.6, Tensorflow v.2.2.0, and the Keras API. As is the case
can effectively improve overall forecast accuracy. By ensembling with the MLP, the 1D numeric vectors are constructed by
forecasts the errors of the participating models cancel each other, slicing the available series and then scaling the resulting
while model and parameter uncertainty is reduced (Petropou- windows based on their in-sample part. Also, for the case
los, Hyndman, & Bergmeir, 2018). As a result, the practice of of the 1428 monthly series of the M3 competition, series
combining has become particularly popular among the ML fore- were seasonally adjusted before being sliced and scaled. The
casting community as well, where bagging, boosting, and heavy multi-step forecasts required are produced simultaneously
ensembling are widely used to improve the performance of single from the last dense layer of CNN-1D. The architecture and
models (Bojer & Meldgaard, 2021). These findings motivated the training process of CNN-1D were optimized using the TPE
utilization of ensembling in the present study. algori Appendix A provides a more detailed summary of
CNN-1D’s architecture and training.
3.3. Benchmarks • N-BEATS: This method was proposed by Oreshkin et al.
(2019) and is essentially an ensemble of 180 deep neural
We consider five different benchmarks to demonstrate the
networks of fully-connected layers with backward and for-
relative forecasting performance of the proposed approach, as
ward residual links. Currently, it is considered as the state-
described below.
of-the-art approach in time series forecasting, and has been
• Theta: Theta (Assimakopoulos & Nikolopoulos, 2000) is a successfully used in various applications (Putz, Gumhalter,
well-established and highly accurate statistical method that & Auer, 2021; Stevenson, Rodriguez-Fernandez, Minisci, &
became popular due to its superior performance in the Camacho, 2022), including the second best submission of
M3 competition (Makridakis & Hibon, 2000), being also the the most recent M5 competition (Anderer & Li, 2022). In the
most accurate benchmark in the M4. Therefore, it is of- original paper, two N-BEATS configurations were presented
ten considered a standard of comparison. The method was based on the interpretability of the underlying models, N-
implemented as in the M4 competition.1 BEATS-Generic and N-BEATS-Interpretable. For the purposes
• ES-RNN: This method, proposed by Smyl (2020), was the of this study we employ the N-BEATS-Generic variant (to
winning submission of the M4 competition and the most be simply called N-BEATS) since it provides marginally more
accurate one for the yearly series of the respective data set, accurate forecasts, according to the original paper. Note that
also used in the present study. The method is essentially N-BEATS was configured and trained as proposed by the
a hybrid, mixing traditional exponential smoothing models authors in the original study. The method was replicated
with an LSTM-based architecture. Although methods that using the official release provided.3 .
outperform the winning submission of a competition after • DeepAR: This method was proposed by Salinas et al. (2020)
the out-of-sample data have been released cannot claim any and is an auto-regressive recurrent neural network model
victory, such comparisons are useful for understanding their based on LSTM cells. It was initially developed as an archi-
potential value over existing state-of-the-art methods. Note tecture for probabilistic forecasting but it is also capable of
that, for the purposes of the present study, ES-RNN was not providing very accurate point forecasts. DeepAR has been
replicated. Instead, the original forecasts submitted to the employed in several studies and applications (Mashlakov,
M4 competition were used.2 Kuronen, Lensu, Kaarna, & Honkapuro, 2021; Park, Park, &
• MLP: This method is an ensemble of 50, simple feed-forward Hwang, 2020), including the third best submission of the
NNs with fully-connected layers and serves as a standard most recent M5 competition (Jeon & Seong, 2021), and is
‘‘cross-learning’’ ML benchmark (Semenoglou et al., 2021). now considered a well-established standard of comparison.
Similar to the proposed method, MLP is implemented using In order for the comparisons performed with the rest of
Python 3.7.6, Tensorflow v.2.2.0, and the Keras API. Follow- the methods to be fair, DeepAR’s hyper-parameters were
ing the steps outlined in Section 2, the numeric vectors, optimized using TPE algorithm in a fashion similar to that
before being exported as images, are used as input to the of the proposed imaged-based models (for more details
MLP. This means that, for the case of the 1428 monthly series on the selected values, please see Appendix A). Further-
of the M3 competition, data were seasonally adjusted and more, 50 DeepAR models were trained and their forecasts
then scaled. Accordingly, samples were simply scaled for the were combined using the median operator. For the purposes
case of the 23,000 yearly series of the M4 competition. As for of this study, the DeepAR models were implemented in
the output, the required multi-step forecasts are produced Python, using the GluonTS modeling package, developed by
simultaneously, exactly as done by the regressor module of Amazon (Alexandrov et al., 2019).
the ForCNN method. Although MLP is relatively simple in its
nature, its architecture and training hyper-parameters were 3.4. Forecasting accuracy measures
optimized using the TPE algorithm in order for the com-
parisons performed with the optimized ForCNN methods to
Enabling direct comparisons of our results with those of other
be fair (for more details on the selected values please see
relevant studies is of major importance. Hence, we choose to
Appendix A).
evaluate the forecasting accuracy of the presented models using
the official measures of the M3 and M4 forecasting competitions.
1 https://github.com/Mcompetitions/M4-methods/blob/master/Benchmarks%
Specifically, the official measure used in the M3 competition was
20and%20Evaluation.R.
2 Forecasts retrieved from the GitHub repository of the M4 competition (htt
ps://github.com/Mcompetitions/M4-methods/tree/master/Point%20Forecasts). 3 github.com/ElementAI/N-BEATS
45
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
Fig. 4. Indicative plots of time series and the corresponding predictions provided by ForCNN-SD and N-BEATS. The green line represents the actual historical and
future observations. The orange line represents the forecasts generated by ForCNN-SD, while the blue line represents the forecasts generated by N-BEATS. The top
two rows show indicative cases where ForCNN-SD significantly outperforms N-BEATS. The bottom two rows show cases where N-BEATS significantly outperforms
ForCNN-SD. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 5. Average ranks and 95% confidence intervals of the three ForCNN variants and benchmarks considered over the 23,000 M4 yearly (left) and 1428 M3 monthly
(right) series. The multiple comparisons with the best test, as proposed by Koning et al. (2005), is applied using sMAPE for ranking the methods.
Fig. B.6. Distribution of sMAPE forecasting errors of the 50 networks participating in each of the ensembles. ‘‘X’’ marks the forecasting accuracy of the respective
ensemble. The top figure corresponds to the results based on the 23,000 yearly series of the M4 competition. The bottom figure corresponds to the results based on
the 1428 monthly series of the M3 competition.
Table A.7 that using VGG-19 as a replacement for the encoder is indeed a
The architecture and the training hyper-parameters of the DeepAR used as viable option.
benchmark in this study, along with the considered search space for each
hyper-parameter and the optimal values as determined by the optimization
Our findings greatly motivate future research in the area of
algorithm. image-based time series forecasting, that could focus on, but not
Hyper-parameter Search space Optimal value limited to, the following aspects:
Cell type [LSTM, GRU] LSTM
Number of recurrent [1, 2, 3, 5 10] 3
• The present study considered two large, diverse sets of
layers 23,000 yearly and 1428 monthly time series respectively,
Number of cells [16, 32, 64, 128, 256, 512] 512 that originated from six particular domains (micro, macro,
Dropout rate [7%, 12%, 20%] 7% industry, finance, demographic and other). Although monthly
Training epochs [64, 128, 256, 512, 1024, 128 series differ significantly from yearly ones, both data sets
2048]
Number of batches per [32, 64, 128, 256, 512, 128
consist of continuous, relatively short series of low fre-
epoch 1024] quency. It would be interesting to examine how the pro-
Batch size [32, 64, 128, 256, 512] 128 posed method performs for the case of high-frequency data
Learning rate [0.0001, 0.0005, 0.001, 0.005 (e.g. daily and hourly series) or intermittent and lumpy
0.005, 0.01]
data, as well as, for cases of special forecasting applications
Patience [8, 16, 32, 64] 8
(e.g. stock market, retail, and energy forecasting).
• Recurrence and colored plots could be used instead of sim-
ple, gray-scale line figures to extract more information from
the series and provide more accurate forecasts.
as input to the proposed method and eliminate the need for
• Additional work should be done to understand how ex-
designing and optimizing new convolutional architecture from isting, advanced DL models used in computer vision can
scratch. Between the two, the accuracy and computational cost be exploited for identifying time series patterns, extracting
of ForCNN-VGG was comparable to that of the self-designed en- knowledge from large data sets, and using such knowledge
coder, while ForCNN-ResNet’s performance was less impressive, to improve forecasting accuracy, while reducing computa-
irrespective of the evaluation data set. Overall, our results suggest tional cost.
50
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
Alexandrov, A., Benidis, K., Bohlke-Schneider, M., Flunkert, V., Gasthaus, J.,
The authors declare that they have no known competing finan- Januschowski, T., et al. (2019). GluonTS: Probabilistic time series models in
cial interests or personal relationships that could have appeared python. arXiv preprint arXiv:1906.05264.
to influence the work reported in this paper. Anderer, M., & Li, F. (2022). Hierarchical forecasting with a top-down alignment
of independent-level forecasts. International Journal of Forecasting, http://dx.
doi.org/10.1016/j.ijforecast.2021.12.015.
Data availability Anwar, S. M., Majid, M., Qayyum, A., Awais, M., Alnowami, M., & Khan, M.
K. (2018). Medical image analysis using convolutional neural networks: a
review. Journal of Medical Systems, 42(11), 226. http://dx.doi.org/10.1007/
Data will be made available on request. s10916-018-1088-1.
Assimakopoulos, V., & Nikolopoulos, K. (2000). The theta model: A decomposition
approach to forecasting. International Journal of Forecasting, 16, 521–530.
Appendix A. ForCNN-SD, MLP, CNN-1D & DeepAR hyper-para- http://dx.doi.org/10.1016/S0169-2070(00)00066-2.
Barker, J. (2020). Machine learning in M4: What makes a good unstructured
meters
model? International Journal of Forecasting, 36(1), 150–155. http://dx.doi.org/
10.1016/j.ijforecast.2019.06.001.
This appendix provides details on the hyper-parameter opti- Barrow, D. K., Crone, S. F., & Kourentzes, N. (2010). An evaluation of neural
network ensembles and model selection for time series prediction. In The
mization of the proposed ForCNN-SD model, as well as, of the MLP, 2010 international joint conference on neural networks (pp. 1–8). http://dx.
CNN-1D and DeepAR benchmarks. The Tree-of-Parzen-Estimators doi.org/10.1109/IJCNN.2010.5596686.
(TPE) algorithm is used on a validation set to determine the op- Bates, J. M., & Granger, C. W. J. (1969). The combination of forecasts. Journal of
timal set of values after 100 iterations for all models considered. the Operational Research Society, 20(4), 451–468.
Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., & Cox, D. D. (2015). Hyperopt:
Note that, in all cases, the optimal hyper-parameter values are a python library for model selection and hyperparameter optimization.
determined based on the yearly series of the M4 competition and Computational Science & Discovery, 8(1), Article 014008. http://dx.doi.org/10.
they are also used for forecasting the monthly series of the M3 1088/1749-4699/8/1/014008.
Bojer, C. S., & Meldgaard, J. P. (2021). Kaggle forecasting competitions: An
competition. Table A.4 contains the hyper-parameters that are be-
overlooked learning opportunity. International Journal of Forecasting, 37(2),
ing optimized, along with the search space and their final, optimal 587–603.
values for ForCNN-SD. Tables A.5–A.7 contain similar information Catania, L., & Bernardi, M. (2017). MCS: Model confidence set procedure. R
for the MLP, CNN-1D and DeepAR respectively. package version 0.1.3.
Chae, Y. T., Horesh, R., Hwang, Y., & Lee, Y. M. (2016). Artificial neural network
model for forecasting sub-hourly electricity usage in commercial buildings.
Energy and Buildings, 111, 184–194. http://dx.doi.org/10.1016/j.enbuild.2015.
Appendix B. Forecasting errors distribution
11.045.
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolu-
In this appendix we provide more details on the distribution of tions. In 2017 IEEE conference on computer vision and pattern recognition (pp.
1800–1807). http://dx.doi.org/10.1109/CVPR.2017.195.
sMAPE errors of the individual networks that contribute forecasts
Cohen, N., Balch, T., & Veloso, M. (2019). The effect of visual design in image
to the ensemble. Note that the final forecasts of each ForCNN classification. arXiv preprint arXiv:1907.09567.
variant are calculated by combining the output of 50 networks Cohen, N., Sood, S., Zeng, Z., Balch, T., & Veloso, M. (2020). Visual forecasting
of that architecture, as described in Section 3.2. Fig. B.6 shows of time series with image-to-image regression. arXiv preprint arXiv:2011.
09052.
the distributions of errors of the individual networks of the three Du, B., & Barucca, P. (2020). Image processing tools for financial time series
ForCNN variants, for the cases of the yearly and the monthly data classification. arXiv preprint arXiv:2008.06042.
respectively. The forecasting accuracy of each variant’s ensemble Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video
recognition. In 2019 IEEE/CVF international conference on computer vision (pp.
is also shown, as the corresponding ‘‘X’’ sign.
6201–6210). http://dx.doi.org/10.1109/ICCV.2019.00630.
The first major conclusion from the analysis relates to the Fischer, T., & Krauss, C. (2018). Deep learning with long short-term memory
benefit of employing ensembles of networks, as opposed to us- networks for financial market predictions. European Journal of Operational
ing a single network. Even though this approach requires more Research, 270(2), 654–669. http://dx.doi.org/10.1016/j.ejor.2017.11.054.
Frolov, S., Hinz, T., Raue, F., Hees, J., & Dengel, A. (2021). Adversarial text-to-
computational time, the forecasting accuracy improvements are
image synthesis: A review. Neural Networks, 144, 187–209. http://dx.doi.org/
significant across all architectures and sets of series. Our results 10.1016/j.neunet.2021.07.019.
are aligned with existing literature in forecasting that highlights Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Martinez-
the importance of combining predictions from different models in Gonzalez, P., & Garcia-Rodriguez, J. (2018). A survey on deep learning
techniques for image and video semantic segmentation. Applied Soft
order to make the final forecasts more robust. That is also the case Computing, 70, 41–65. http://dx.doi.org/10.1016/j.asoc.2018.05.018.
for N-BEATS (Oreshkin et al., 2019) that is essentially an ensemble Hansen, P. R., Lunde, A., & Nason, J. M. (2011). The model confidence set.
of 180 individual networks. Econometrica, 79(2), 453–497.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
When considering the M3 monthly series, the forecasting er-
recognition. In 2016 IEEE conference on computer vision and pattern recognition
rors of the networks are similarly distributed across the three (pp. 770–778). http://dx.doi.org/10.1109/CVPR.2016.90.
ForCNN variants, with the respective ensembles being far more Hewamalage, H., Bergmeir, C., & Bandara, K. (2021). Recurrent neural networks
accurate in the comparison. In the case of the M4 yearly data, for time series forecasting: Current status and future directions. International
Journal of Forecasting, 37(1), 388–427. http://dx.doi.org/10.1016/j.ijforecast.
each variant’s networks exhibit different patterns. According to 2020.06.008.
the distribution of errors, the individual ForCNN-SD networks are Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely
more robust and their accuracy is closer to that of the final connected convolutional networks. In 2017 IEEE conference on computer
ensemble. On the other hand, ForCNN-VGG networks exhibit a vision and pattern recognition (pp. 2261–2269). http://dx.doi.org/10.1109/
CVPR.2017.243.
wide range of forecasting errors, with the majority of the models Huber, J., & Stuckenschmidt, H. (2020). Daily retail demand forecasting using
being considerably less accurate than the respective ensemble. Fi- machine learning with emphasis on calendric special days. International
nally, the large majority of ForCNN-ResNet networks have similar Journal of Forecasting, 36(4), 1420–1438. http://dx.doi.org/10.1016/j.ijforecast.
2020.02.005.
performance, being significantly less accurate than the ensemble.
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast
However, 2 out of the 50 networks generate forecasts that are at accuracy. International Journal of Forecasting, 22(4), 679–688. http://dx.doi.
least as accurate as the forecasts of the final ensemble. org/10.1016/j.ijforecast.2006.03.001.
51
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
Jeon, Y., & Seong, S. (2021). Robust recurrent network model for intermittent Park, S., Park, S., & Hwang, E. (2020). Normalized residue analysis for deep
time-series forecasting. International Journal of Forecasting, http://dx.doi.org/ learning based probabilistic forecasting of photovoltaic generations. In 2020
10.1016/j.ijforecast.2021.07.004. IEEE international conference on big data and smart computing (BigComp) (pp.
Kamilaris, A., & Prenafeta-Boldú, F. X. (2018). A review of the use of convo- 483–486). http://dx.doi.org/10.1109/BigComp48618.2020.00-20.
lutional neural networks in agriculture. The Journal of Agricultural Science, Petropoulos, F., Apiletti, D., Assimakopoulos, V., Babai, M. Z., Barrow, D. K., Ben
156(3), 312–322. http://dx.doi.org/10.1017/S0021859618000436. Taieb, S., et al. (2022). Forecasting: theory and practice. International Journal
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for of Forecasting, 38(3), 705–871.
generative adversarial networks. In 2019 IEEE/CVF conference on computer Petropoulos, F., Hyndman, R. J., & Bergmeir, C. (2018). Exploring the sources of
vision and pattern recognition (pp. 4396–4405). http://dx.doi.org/10.1109/ uncertainty: Why does bagging for time series forecasting work? European
CVPR.2019.00453. Journal of Operational Research, 268(2), 545–554.
Ker, J., Wang, L., Rao, J., & Lim, T. (2018). Deep learning applications in medical Putz, D., Gumhalter, M., & Auer, H. (2021). A novel approach to multi-horizon
image analysis. IEEE Access, 6, 9375–9389. http://dx.doi.org/10.1109/ACCESS. wind power forecasting based on deep neural architecture. Renewable Energy,
2017.2788044. 178, 494–505. http://dx.doi.org/10.1016/j.renene.2021.06.099.
Koning, A. J., Franses, P. H., Hibon, M., & Stekler, H. (2005). The M3 competition: Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for im-
Statistical tests of the results. International Journal of Forecasting, 21(3), age classification: A comprehensive review. Neural Computation, 29(9),
397–409. http://dx.doi.org/10.1016/j.ijforecast.2004.10.003. 2352–2449. http://dx.doi.org/10.1162/neco_a_00990.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015).
Kourentzes, N., Barrow, D. K., & Crone, S. F. (2014). Neural network ensemble
ImageNet large scale visual recognition challenge. International Journal of
operators for time series forecasting. Expert Systems with Applications, 41(9),
Computer Vision (IJCV), 115(3), 211–252. http://dx.doi.org/10.1007/s11263-
4235–4244. http://dx.doi.org/10.1016/j.eswa.2013.12.011.
015-0816-y.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with
Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). Deepar: Prob-
deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou,
abilistic forecasting with autoregressive recurrent networks. International
& K. Q. Weinberger (Eds.), Vol. 25, Advances in neural information processing
Journal of Forecasting, 36(3), 1181–1191. http://dx.doi.org/10.1016/j.ijforecast.
systems (pp. 1097–1105). Curran Associates, Inc..
2019.07.001.
Lai, G., Chang, W.-C., Yang, Y., & Liu, H. (2018). Modeling long- and short-term
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mo-
temporal patterns with deep neural networks. In The 41st international ACM
bileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the
SIGIR conference on research and development in information retrieval SIGIR
IEEE conference on computer vision and pattern recognition.
’18 (pp. 95–104). Association for Computing Machinery, http://dx.doi.org/10.
Semenoglou, A.-A., Spiliotis, E., Makridakis, S., & Assimakopoulos, V. (2021).
1145/3209978.3210006.
Investigating the accuracy of cross-learning time series forecasting methods.
Li, X., Kang, Y., & Li, F. (2020). Forecasting with time series imaging. Expert International Journal of Forecasting, 37(3), 1072–1084. http://dx.doi.org/10.
Systems with Applications, 160, Article 113680. http://dx.doi.org/10.1016/j. 1016/j.ijforecast.2020.11.009.
eswa.2020.113680. Sen, R., Yu, H.-F., & Dhillon, I. S. (2019). Think globally, act locally: A deep neural
Lim, B., Arık, S. O., Loeff, N., & Pfister, T. (2021). Temporal fusion transformers network approach to high-dimensional time series forecasting. In H. Wallach,
for interpretable multi-horizon time series forecasting. International Journal H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Vol.
of Forecasting, 37(4), 1748–1764. http://dx.doi.org/10.1016/j.ijforecast.2021. 32, Advances in neural information processing systems. Curran Associates, Inc..
03.012. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y.
Livieris, I. E., Pintelas, E., & Pintelas, P. (2020). A CNN–LSTM model for gold (2013). OverFeat: Integrated recognition, localization and detection using
price time-series forecasting. Neural Computing and Applications, 32(23), convolutional networks. arXiv preprint arXiv:1312.6229.
17351–17360. http://dx.doi.org/10.1007/s00521-020-04867-x. Shanker, M., Hu, M., & Hung, M. (1996). Effect of data standardization on neural
Ma, X., Dai, Z., He, Z., Ma, J., Wang, Y., & Wang, Y. (2017). Learning traffic network training. Omega, 24(4), 385–397. http://dx.doi.org/10.1016/0305-
as images: A deep convolutional neural network for large-scale trans- 0483(96)00010-2.
portation network speed prediction. Sensors, 17(4), http://dx.doi.org/10.3390/ Shih, S.-Y., Sun, F.-K., & Lee, H.-Y. (2019). Temporal pattern attention for
s17040818. multivariate time series forecasting. Machine Learning, 108(8–9), 1421–1441.
Majumdar, A., & Gupta, M. (2019). Recurrent transform learning. Neural Networks, http://dx.doi.org/10.1007/s10994-019-05815-0.
118, 271–279. http://dx.doi.org/10.1016/j.neunet.2019.07.003. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks
Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. for large-scale image recognition. In International conference on learning
International Journal of Forecasting, 9(4), 527–529. http://dx.doi.org/10.1016/ representations.
0169-2070(93)90079-3. Smyl, S. (2020). A hybrid method of exponential smoothing and recurrent neural
Makridakis, S. (2017). The forthcoming artificial intelligence (AI) revolution: Its networks for time series forecasting. International Journal of Forecasting,
impact on society and firms. Futures, 90, 46–60. http://dx.doi.org/10.1016/j. 36(1), 75–85. http://dx.doi.org/10.1016/j.ijforecast.2019.03.017.
futures.2017.03.006. Spiliotis, E., Kouloumos, A., Assimakopoulos, V., & Makridakis, S. (2020a). Are
Makridakis, S., Assimakopoulos, V., & Spiliotis, E. (2018a). Objectivity, repro- forecasting competitions data representative of the reality? International
ducibility and replicability in forecasting research. International Journal of Journal of Forecasting, 36(1), 37–53. http://dx.doi.org/10.1016/j.ijforecast.
Forecasting, 34(4), 835–838. http://dx.doi.org/10.1016/j.ijforecast.2018.05.001. 2018.12.007.
Spiliotis, E., Makridakis, S., Semenoglou, A.-A., & Assimakopoulos, V. (2020b).
Makridakis, S., & Hibon, M. (2000). The M3-competition: results, conclusions
Comparison of statistical and machine learning methods for daily SKU
and implications. International Journal of Forecasting, 16(4), 451–476. http:
demand forecasting. Operational Research, 1–25.
//dx.doi.org/10.1016/S0169-2070(00)00057-1, The M3- Competition.
Stevenson, E., Rodriguez-Fernandez, V., Minisci, E., & Camacho, D. (2022). A
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018b). Statistical and machine
deep learning approach to solar radio flux forecasting. Acta Astronautica, 193,
learning forecasting methods: Concerns and ways forward. PLOS ONE, 13(3),
595–606. http://dx.doi.org/10.1016/j.actaastro.2021.08.004.
1–26. http://dx.doi.org/10.1371/journal.pone.0194889.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018c). The M4 competi-
the inception architecture for computer vision. In Proceedings of the IEEE
tion: Results, findings, conclusion and way forward. International Journal of
conference on computer vision and pattern recognition.
Forecasting, 34, http://dx.doi.org/10.1016/j.ijforecast.2018.06.001.
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020). The M4 competition: neural networks. In K. Chaudhuri, & R. Salakhutdinov (Eds.), proceedings
100,000 time series and 61 forecasting methods. International Journal of of machine learning research: Vol. 97, Proceedings of the 36th international
Forecasting, 36(1), 54–74. http://dx.doi.org/10.1016/j.ijforecast.2019.04.014. conference on machine learning. PMLR.
Mashlakov, A., Kuronen, T., Lensu, L., Kaarna, A., & Honkapuro, S. (2021). Assess- Tian, C., Fei, L., Zheng, W., Xu, Y., Zuo, W., & Lin, C.-W. (2020). Deep learning
ing the performance of deep learning models for multivariate probabilistic on image denoising: An overview. Neural Networks, 131, 251–275. http:
energy forecasting. Applied Energy, 285, Article 116405. http://dx.doi.org/10. //dx.doi.org/10.1016/j.neunet.2020.07.025.
1016/j.apenergy.2020.116405. Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,
Montero-Manso, P., Athanasopoulos, G., Hyndman, R. J., & Talagala, T. S. (2020). et al. (2016). WaveNet: A generative model for raw audio. In The 9th ISCA
FFORMA: Feature-based forecast model averaging. International Journal of speech synthesis workshop (p. 125).
Forecasting, 36(1), 86–92. http://dx.doi.org/10.1016/j.ijforecast.2019.02.011. Wang, J., & Wang, J. (2017). Forecasting stochastic neural network based on
Naseer, S., Saleem, Y., Khalid, S., Bashir, M. K., Han, J., Iqbal, M. M., et al. (2018). financial empirical mode decomposition. Neural Networks, 90, 8–20. http:
Enhanced network anomaly detection based on deep neural networks. IEEE //dx.doi.org/10.1016/j.neunet.2017.03.004.
Access, 6, 48231–48246. http://dx.doi.org/10.1109/ACCESS.2018.2863036. Xue, N., Triguero, I., Figueredo, G. P., & Landa-Silva, D. (2019). Evolving deep
Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2019). N-BEATS: Neural CNN-LSTMs for inventory time series prediction. In 2019 IEEE congress
basis expansion analysis for interpretable time series forecasting. arXiv on evolutionary computation (pp. 1517–1524). http://dx.doi.org/10.1109/CEC.
preprint arXiv:1905.10437. 2019.8789957.
52
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53
Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep Zhang, G., & Qi, M. (2005). Neural network forecasting for seasonal and trend
learning based natural language processing [review article]. IEEE Computa- time series. European Journal of Operational Research, 160(2), 501–514. http:
tional Intelligence Magazine, 13(3), 55–75. http://dx.doi.org/10.1109/MCI.2018. //dx.doi.org/10.1016/j.ejor.2003.08.037.
2840738. Zhang, J., Zheng, Y., Qi, D., Li, R., Yi, X., & Li, T. (2018). Predicting citywide crowd
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional flows using deep spatio-temporal residual networks. Artificial Intelligence,
networks. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer
259, 147–166. http://dx.doi.org/10.1016/j.artint.2018.03.002.
vision – ECCV 2014 (pp. 818–833). Cham: Springer International Publishing.
Zhao, Z., Zheng, P., Xu, S., & Wu, X. (2019). Object detection with deep learning:
Zhang, G., & Guo, J. (2020). A novel ensemble method for hourly residential
electricity consumption forecasting by imaging time series. Energy, 203, A review. IEEE Transactions on Neural Networks and Learning Systems, 30(11),
Article 117858. http://dx.doi.org/10.1016/j.energy.2020.117858. 3212–3232. http://dx.doi.org/10.1109/TNNLS.2018.2876865.
53