0% found this document useful (0 votes)
5 views15 pages

2023, Semenoglou - Image Based Time Series Forecasting

This paper introduces ForCNN, a novel deep learning method for univariate time series forecasting that utilizes visual representations of time series data as images instead of traditional numeric inputs. The approach combines convolutional and dense layers, evaluating its performance against established forecasting methods using data from the M3 and M4 competitions. Results indicate that image-based forecasting can outperform standard models, highlighting the potential of deep learning in time series analysis.

Uploaded by

practice752
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views15 pages

2023, Semenoglou - Image Based Time Series Forecasting

This paper introduces ForCNN, a novel deep learning method for univariate time series forecasting that utilizes visual representations of time series data as images instead of traditional numeric inputs. The approach combines convolutional and dense layers, evaluating its performance against established forecasting methods using data from the M3 and M4 competitions. Results indicate that image-based forecasting can outperform standard models, highlighting the potential of deep learning in time series analysis.

Uploaded by

practice752
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Neural Networks 157 (2023) 39–53

Contents lists available at ScienceDirect

Neural Networks
journal homepage: www.elsevier.com/locate/neunet

Image-based time series forecasting: A deep convolutional neural


network approach

Artemios-Anargyros Semenoglou , Evangelos Spiliotis, Vassilios Assimakopoulos
Forecasting and Strategy Unit, School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece

article info a b s t r a c t

Article history: Inspired by the successful use of deep learning in computer vision, in this paper we introduce ForCNN,
Received 22 September 2021 a novel deep learning method for univariate time series forecasting that mixes convolutional and dense
Received in revised form 4 August 2022 layers in a single neural network. Instead of using conventional, numeric representations of time series
Accepted 6 October 2022
data as input to the network, the proposed method considers visual representations of it in the form
Available online 15 October 2022
of images to directly produce point forecasts. Three variants of deep convolutional neural networks
Keywords: are examined to process the images, the first based on VGG-19, the second on ResNet-50, while the
Time series third on a self-designed architecture. The performance of the proposed approach is evaluated using
Forecasting time series of the M3 and M4 forecasting competitions. Our results suggest that image-based time
Images series forecasting methods can outperform both standard and state-of-the-art forecasting models.
Deep Learning
© 2022 Elsevier Ltd. All rights reserved.
Convolutional Neural Networks
M competitions

1. Introduction 2014), expanding their architecture, and improving their per-


formance (Sermanet, Eigen, Zhang, Mathieu, Fergus, & LeCun,
Machine learning (ML) and more recently deep learning (DL) 2013). For instance, Simonyan and Zisserman (2015) proposed
are at the heart of many modern scientific and technological the utilization of smaller filters along with the addition of more
advances, with computer vision (Feichtenhofer, Fan, Malik, & layers. This CNN architecture, often called VGG, outperformed
He, 2019; Ker, Wang, Rao, & Lim, 2018), natural language pro- AlexNet when tested in the ILSVRC data set. ResNet (He, Zhang,
cessing (Young, Hazarika, Poria, & Cambria, 2018), and anomaly Ren, & Sun, 2016) was another critical development in the
detection (Naseer et al., 2018) being only a few of the areas where area of image recognition as it introduced shortcut connections
ML and DL algorithms have dominated over the years, influencing and allowed for deeper models to be effectively trained. Other
both academia and daily life (Makridakis, 2017). architectures, claiming even more accurate results, are those
In the field of computer vision, ML and DL have been of the DenseNet (Huang, Liu, Van Der Maaten, & Weinberger,
2017), Xception (Chollet, 2017), MobileNet (Sandler, Howard,
successfully applied in several pattern recognition and con-
Zhu, Zhmoginov, & Chen, 2018), and EfficientNet (Tan & Le, 2019).
tent understanding tasks, including, but not limited to,
Despite their resounding success in the aforementioned ap-
image classification (Rawat & Wang, 2017), semantic segmen-
plications, ML and DL algorithms have been adopted at a much
tation (Garcia-Garcia, Orts-Escolano, Oprea, Villena-Martinez,
slower pace in the field of time series forecasting (Hewamalage,
Martinez-Gonzalez, & Garcia-Rodriguez, 2018), object detec-
Bergmeir, & Bandara, 2021), also focusing on particular fore-
tion (Zhao, Zheng, Xu, & Wu, 2019), image denoising (Tian,
casting applications. For instance, although ML and DL methods
Fei, Zheng, Xu, Zuo, & Lin, 2020), and generation of artificial
have been successfully used in energy (Chae, Horesh, Hwang, &
images (Frolov, Hinz, Raue, Hees, & Dengel, 2021; Karras, Laine, Lee, 2016; Majumdar & Gupta, 2019), retail demand Huber and
& Aila, 2019). AlexNet, developed by Krizhevsky, Sutskever, and Stuckenschmidt (2020) and Spiliotis, Makridakis, Semenoglou,
Hinton (2012), was one of the first deep convolutional neural and Assimakopoulos (2020b), and stock market (Fischer & Krauss,
networks (CNNs) to be used for large-scale image recognition, 2018; Wang & Wang, 2017) forecasting applications, their results
winning the ImageNet Large-Scale Visual Recognition Challenge have been mixed in generic forecasting tasks and especially in
(ILSVRC) in 2012 (Russakovsky et al., 2015). Since then, many those where data availability is limited (Makridakis, Spiliotis, &
studies have focused on understanding CNNs (Zeiler & Fergus, Assimakopoulos, 2018b) or series display heterogeneous char-
acteristics (Spiliotis, Kouloumos, Assimakopoulos, & Makridakis,
∗ Corresponding author. 2020a). In such settings, statistical time series forecasting meth-
E-mail addresses: [email protected] (A.-A. Semenoglou), [email protected] ods, such as exponential smoothing and ARIMA models, have
(E. Spiliotis), [email protected] (V. Assimakopoulos). long been considered as the standard methods of choice since

https://doi.org/10.1016/j.neunet.2022.10.006
0893-6080/© 2022 Elsevier Ltd. All rights reserved.
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

they are not ‘‘data-hungry’’ and are computational cheap. On explored the use of time series imaging for combining forecasts
the negative side, statistical methods are less generic in a sense from different baseline forecasting methods based on the features
that they prescribe the data generation process of the series, that computer vision algorithms had extracted from each series.
display limited learning capacity, and are typically trained in a CNNs have been considered alongside 2D inputs only in spe-
series-by-series fashion, thus failing to capture valuable infor- cific forecasting applications. For instance, Ma, Dai, He, Ma, Wang,
mation that can be extracted from large data sets (Semenoglou, and Wang (2017) leverage the temporal and spatial dependen-
Spiliotis, Makridakis, & Assimakopoulos, 2021). The turning point cies of traffic to create 2D images that are then introduced to
for the perception towards ML methods in generic time series a CNN-based model for predicting traffic speed. Zhang, Zheng,
forecasting tasks was probably the M4 competition (Makridakis, Qi, Li, Yi, and Li (2018) estimate crowd inflow and outflow in
Spiliotis, & Assimakopoulos, 2020) where the winning method different regions of a city by employing a deep CNN, consid-
mixed long short-term memory (LSTM) networks with standard ering city grid maps of past inflows/outflows as input. Even in
exponential smoothing equations (Smyl, 2020), while the runner- cases where no spatial dependencies exist in the data, 2D input
up utilized XGBoost to optimally combine nine (mostly) statistical vectors can still be constructed, provided that the available data
baseline forecasting models (Montero-Manso, Athanasopoulos, set contains variables that are correlated with the target series.
Hyndman, & Talagala, 2020). These results demonstrated the For example, Zhang and Guo (2020) encoded data from various
benefits of ‘‘cross-learning’’ and highlighted the potential value variables that affect energy consumption as images in an attempt
of ML methods. to predict more accurately the hourly electricity consumption
Empirical evidence from forecasting competitions, along with at residential level. However, when tasked with forecasting sets
greater accessibility to computer resources and ML libraries, also
of uncorrelated series without exogenous variables, the afore-
paved the way towards more advanced and accurate DL meth-
mentioned approaches, used for constructing 2D inputs, become
ods. Since the literature is vast, especially when application-
inapplicable. As a result, in univariate time series forecasting
specific methods are considered, below we focus on some generic
settings, like in our case, 1D CNNs become more relevant.
DL forecasting methods that have become particularly popu-
Despite the use of images in certain time series analysis tasks,
lar in the recent years. Sen, Yu, and Dhillon (2019) proposed
to the best of our knowledge, no study has previously investigated
a DL method, based on temporal convolutions, that leverages
the use of images as input to NNs to directly forecast a large
both global learning from the entire data set and local learning
set of uncorrelated time series, without additional explanatory
from each time series separately. Salinas, Flunkert, Gasthaus, and
variables. Given the potential benefits of such an approach, we
Januschowski (2020) introduced DeepAR, a recurrent DL model
that uses stacked LSTM layers to estimate the parameters of pre- believe that image-based time series forecasting requires more
defined distributions that represent the future observations of attention. The use of visual time series representations and deep
the time series being predicted, thus allowing the generation 2D convolutions constitutes a novel approach for extrapolating
of both point and probabilistic forecasts. Lim, Arık, Loeff, and patterns when compared to other types of NNs that solely rely
Pfister (2021) proposed Temporal Fusion Transformer (TFT), a on numeric input. Spatial structural information that is apparent
deep neural network (NN) that utilizes LSTM encoders and a self- in a visual representation of a time series offers a unique per-
attention mechanism to generate multi-step forecasts. Finally, spective, even if the same information is encoded in the original
N-BEATS, introduced by Oreshkin, Carpov, Chapados, and Bengio numeric data. An example of that comes from the way humans
(2019), involves deep NNs that build on fully-connected layers process information, being easier for most of us to recognize
with backward and forward residual links. Apart from achiev- time series patterns like trends by looking at the plot of the
ing state-of-the-art accuracy, N-BEATS results in interpretable series compared to reading through the subsequent observations.
forecasts. Another major advantage of treating time series as images comes
It becomes evident that most state-of-the-art DL time se- from the breakthroughs in the field of computer vision. Research
ries forecasting methods are based on recurrent neural networks has provided key insights and deep architectures that enable the
(RNNs) that use numeric representations of time series data as extraction of useful features from images that can be transferred
input (Hewamalage et al., 2021), thus ignoring the advances in the forecasting domain with minimal adjustments and promis-
reported for CNNs in computer vision. Indeed, although CNNs (Lai, ing results. Finally, the image-based approach proposed in this
Chang, Yang, & Liu, 2018; Shih, Sun, & Lee, 2019) and temporal paper is flexible and expandable depending on the application
CNNs (Van Den Oord et al., 2016) have recently become more considered. For instance, the proposed methods can incorporate
popular for time series forecasting, they usually handle time exogenous variables as additional features by including more
series data as 1D numeric vectors instead of 2D images, which is channels as input. Motivated by the above, the contributions of
surprising if we consider that CNNs have been originally designed this paper are threefold:
for analyzing visual imagery.
Moreover, even when this is not the case, the NNs are typically • We introduce an end-to-end, image-based DL approach for
used to develop classifiers or meta-learners that support the univariate time series forecasting, to be called ForCNN. We
forecasting process and not to directly produce forecasts. For provide a detailed description of the overall framework that
example, Cohen, Balch, and Veloso (2019) compared the perfor- includes the necessary pre-processing steps and the archi-
mance of various classification models that used either numeric tecture of the self-designed encoder used.
or visual representations of the S&P 500 index as input to predict • We explore the potential value of well-established networks
whether the price will go up or down, concluding that the latter such as the ResNet-50 and VGG-19 as alternatives to the
approach resulted in superior accuracy. In a similar classification self-designed encoder. Leveraging such networks alleviates
task, Du and Barucca (2020) proposed a framework that converts the burden of developing and optimizing new architectures
the log return of financial time series to a gray-scale spectrum and by relying on already optimized and trained ones. The re-
uses a deep CNN to predict the future direction of prices. Cohen, sults suggest that it is possible to use such networks to
Sood, Zeng, Balch, and Veloso (2020) introduced an image-to- generate accurate forecasts, thus further supporting the uti-
image forecasting method where a convolutional autoencoder is lization of image-based forecasting approaches.
used to process images of financial time series and provide the • We demonstrate the performance of the proposed approach
corresponding visual forecasts. Finally, Li, Kang, and Li (2020) through an empirical evaluation using two different sets of
40
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

series that consist of the monthly and yearly data of well- [0, 1] range, the corresponding output values are not necessarily.
known forecasting competitions, namely the M3 (Makri- Especially in tasks that involve forecasting heavily trended time
dakis & Hibon, 2000) and M4 (Makridakis et al., 2020). series, this can be a useful attribute of the training set since
Both accuracy and computational cost metrics are consid- it allows the NNs to learn that future values can be above or
ered to provide a comprehensive assessment, while various below any known value, across all samples. This approach has
established methods, including state-of-the-art ML and DL been successfully applied in order to create training data sets
models, are employed as benchmarks. for similar forecasting problems in the past (Semenoglou et al.,
2021; Smyl, 2020) and was therefore adopted in the proposed
The rest of the paper is organized as follows: Section 2 in-
methodology.
troduces the methodological approach proposed for transforming
Having scaled the original time series data, the next step of
time series data into images and the architecture of the mod-
the pre-processing phase refers to the visualization, i.e. the trans-
els used for generating point forecasts. Section 3 describes the
formation of the 1D numeric vectors into 2D images. This is done
experimental design used to empirically evaluate the overall per-
using simple line plots where the horizontal x-axis represents the
formance of the proposed approach. This includes the data sets
time period each observation corresponds to and the vertical y-
used for training and testing ForCNN, the performance measures
axis the scaled values of the observations. The plots are created
utilized, details on the implementation of ForCNN, and the bench-
and exported using the Matplotlib plotting library for Python.
marks selected for assessing the relative performance of the
Considering that the plotted line is the most important ele-
proposed approach. Section 4 presents and discusses the results.
ment in the created images, with its form and position containing
Finally, Section 5 concludes the paper and provides directions of
the complete information required for generating forecasts, the
future research.
width of the line is accordingly thickened to make the time series
patterns more apparent. Other visual elements usually contained
2. Methodological approach
in plots, such as axes and legend, are not exported in the final
images for the sake of simplicity. A monochromatic, black and
A time series is defined as a set of data points listed in
white color scheme is chosen for creating the images, with the
chronological order, with successive observations taken at reg-
line representing the series being white and the background color
ularly spaced intervals of time. A time series Y of length n can be
being black. Thus, a single 8-bit integer value is used to represent
expressed as:
each pixel instead of the three or more values typically required
Yt = {yt ∈ R | t = 1, 2, . . . , n}, (1) for each pixel in colored pictures. As a result, the redundant
coloring information is eliminated, the input of the forecasting
where yt represents the value of the time series at time t. Univari- model is simplified, and the memory requirements for working
ate time series forecasting aims at estimating the future values of with the images are significantly reduced. Finally, in order for
the series based solely on its past observations. Solving this prob- all inputs to have the same dimensions, all images are resized
lem depends on discovering a function (or forecasting method) to 64 × 64 pixels. Fig. 1 displays the final representations of
that approximates the underlying data generation process of the 16 indicative time series that have been exported following the
series, as follows: aforementioned pre-processing steps.
f : Rw −
→ Rh
(2) 2.2. Model architecture
[yn+1 , . . . , yn+h ] = f (yn−w+1 , . . . , yn ) + [en+1 , . . . , en+h ]
where {yn+1 , . . . , yn+h } are the future h observations of the series, The images created in the pre-processing phase are provided
f is the approximation function used to forecast these observa- as input to ForCNN in order for the DL network to be trained
tions using w past values (y1 , . . . , yn ) as input, and {en+1 , . . . , and then used to produce point forecasts for the time series of
en+h } are the corresponding forecast errors. It becomes apparent interest. The overall network consists of two modules, namely
that accurate forecasting requires methods that minimize forecast an encoder and a regressor, as shown in Fig. 2. Although these
error. modules have distinct roles, they are trained concurrently as a
The methodological approach proposed in this paper for fore- single model.
casting time series through images consists of two phases. In the Note that ForCNN serves as a generic framework for devel-
first phase, the time series data, originally provided as 1D numeric oping DL, image-based time series forecasting models. Thus, al-
vectors, are pre-processed to be properly exported as 2D images. though in the following sections we will focus on a model that
Each image corresponds to a particular window of the in-sample utilizes a self-designed encoder, to be called ForCNN-SD, we will
data of the series and serves as a substitute for the respective nu- later discuss how variants of this model can be constructed using
meric input that would have been typically provided to standard other well-established deep CNNs. The architecture of ForCNN-SD
ML or DL methods for training and forecasting purposes. In the is presented in Fig. 3.
second phase, the images created are used for training the deep
NN, ForCNN, and producing out-of-sample forecasts. 2.2.1. Encoder
The objective of the first module of the NN, i.e. the encoder,
2.1. Time series pre-processing is to transform each image X , provided as input to the net-
work, into a vector W that contains a latent representation of
In order for all images to be extracted to depict time periods X . This will allow the patterns of the series to be effectively
of equal length, the in-sample data of the time series are first filtered and the information that has to be learned by the network
divided into equal-sized windows of w observations each. Then, to be accordingly reduced in size to facilitate training without
the min–max scaling is employed to adjust the window values missing any significant knowledge. Given their successful use
in the target range of [0, 1] and facilitate training across the in numerous image related applications (Anwar, Majid, Qayyum,
numerous, diverse series of the data set (Shanker, Hu, & Hung, Awais, Alnowami, & Khan, 2018; Kamilaris & Prenafeta-Boldú,
1996). This means that, essentially, even different samples from 2018; Rawat & Wang, 2017), this task is performed through layers
the same series are treated as being independent from each other. that apply 2D convolutions to the input. Other advantages of 2D
Although the input values of each sample are always bound to the convolution layers that make them attractive for undertaking this
41
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

Fig. 1. Indicative visual representations of time series used by ForCNN as input, with w set to 18 observations.

Fig. 2. Overview of the proposed image-based time series forecasting method, ForCNN, consisting of an encoder and a regressor module.

task include their ability to account for local dependencies around the identity mapping is considered optimal, the shortcut offers a
the pixels and their relatively few parameters that accelerate direct way of achieving it rather than training a large number of
training without sacrificing learning capacity. parameters, used in non-linear transformations, to approximate
Specifically, in order to build the encoder of the network, we the identity function.
consider a deep convolutional architecture which is inspired by Each convolutional layer of the encoder consists of a 2D convo-
that of ResNet (He et al., 2016). Multiple studies and the overall lution of its respective input (Conv2D) and uses 3 × 3 filters and
trend in ML research favor deeper NN architectures when work- zero padding to maintain the original input dimensions, followed
ing with images (Chollet, 2017; Simonyan & Zisserman, 2015; by batch normalization (Batch Norm) and the application of the
Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2016) since the ad- activation function, which is the Rectified Linear Unit (ReLU). The
dition of more layers typically allows the model to progressively layers are then organized into blocks (Block), with each block
recognize and learn more specific patterns. At the same time, having 3 convolutional layers and an identity shortcut from its
since deeper models are more difficult to train, they require input to its output. Note that the two information flows, i.e. from
additional components that allow them to handle issues related the main path of the block and the shortcut, are merged before
to vanishing gradients and degradation of training accuracy (He the final application of the ReLU activation function. Finally, the
et al., 2016). To deal with these issues, ResNet utilizes ‘‘shortcut residual convolutional blocks are organized into stacks (Stack).
connections’’ between the layers that enable the input informa- The number of convolutional filters used increases by a factor of 2
tion of a block to be mapped directly to its output. Thus, when as more stacks are being added while the size of the feature maps
42
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

Fig. 3. The ForCNN-SD architecture. Top: The model consists of a series of stacks that take images as input and create latent representations of them. Fully-connected
(FC) layers are then used to produce h-step-ahead forecasts based on the representations provided. Middle: Each stack consists of several convolutional blocks
and a final convolution layer (Conv2D) for reducing the size of the exported feature maps. Bottom: Each block consists of three sets of convolution layers of the
following transformations: 2D convolution (Conv2D), batch normalization (Batch Norm), and application of the ReLU activation function. A shortcut connection is
used at the block level.

decreases by the same factor. Note also that instead of employing an indicative example of such architectures, demonstrating that
traditional pooling layers with fixed functions (e.g. max pooling deep MLPs can achieve state-of-the-art performance in forecast-
or average pooling), 2D convolutions are used with 2 × 2 filters ing problems similar to ours. An additional point to consider
and a stride of 2, essentially decreasing the spatial size by the is that LSTM-based architectures are inherently more complex
required factor. After the final stack, the resulting feature maps to deploy effectively and require more time and computational
are concatenated and flattened to form the embedding vector W . resources as they have more parameters that need optimizing.
Thus, in the present study, fully-connected layers were used in
2.2.2. Regressor order to develop a regressor that is easy to implement, efficient
The objective of the second module of the NN, i.e. the regres- in terms of computational time and resources needed for training
sor, is to produce the requested forecasts F given the embedding purposes, and relatively accurate when used in batch time series
vector W that has been created by the encoder. The regressor forecasting tasks (Semenoglou et al., 2021).
is implemented as a simple NN with fully-connected (FC) non-
linear hidden layers (the ReLU function is used for transforming
2.2.3. Generalizing ForCNN
the input of the nodes) and a linear output layer. The forecasts
As noted earlier, ForCNN-SD adopts a specific, self-designed
are produced for all the forecasting horizons considered simulta-
neously, meaning that the nodes of the output layer are equal in architecture. This is done in order for the proposed method to
size with the forecasting horizon examined. be tailored to the particular characteristics of the images used
Note that the regressor can be implemented using recurrent as input at the forecasting task at hand (e.g. single channel in-
layers instead of the fully-connected layers proposed in ForCNN- put of relatively simple images). However, the main framework
SD. Recently, combinations of convolutional and LSTM-based ar- can be adjusted according to the preferences of the user and
chitectures have been successfully introduced for time series the requirements of the forecasting task. For instance, although
forecasting (Livieris, Pintelas, & Pintelas, 2020; Xue, Triguero, gray-scale line plots provide a clear and intuitive representation
Figueredo, & Landa-Silva, 2019) and a similar approach can pos- of the series, requiring also less memory to store and process,
sibly be exploited for use at an image-based model. However, they may be replaced by recurrence or colored figures (Li et al.,
selecting between using fully-connected or recurrent layers for 2020). Similarly, the layers of the regressor can be extended or
the regressor is not a trivial process. Depending on the applica- reduced in size depending on the complexity of the forecasting
tion, NNs that are based on dense layers can be as accurate as task and the length of the forecasting horizon. More importantly,
models based on recurrent layers, if not more. N-BEATS (Ore- the proposed encoder can be replaced by any other NN capable
shkin et al., 2019), which is based on fully-connected layers, is of extracting meaningful features from images.
43
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

In order to demonstrate the flexibility of the proposed method autocorrelation-based seasonality test (Makridakis et al., 2020).
and further explore the potential value of image-based time series The inverse transformation is then applied to re-seasonalize the
forecasting models, we proceed by considering two variants of forecasts of the series that were identified as seasonal. The sliding
the ForCNN-SD model, constructed by replacing the suggested window strategy, as described above, is used on the in-sample
encoder with well-established CNNs in the field of image recog- part of the monthly series to create the train set of the models.
nition. To that end, we consider the ResNet-50 and VGG-19 DL The size of the input windows, w , is set equal to 36 observations
models, to be called ForCNN-ResNet and ForCNN-VGG, respec- (3 seasonal periods). As a result, the final train set consists of
tively. Note that since ResNet-50 and VGG-19 are used just for approximately 67,000 samples, extracted from the in-sample part
extracting the embedding vector W from the images, the top FC of the initial 1428 monthly time series.
layers of their architectures are dropped before being introduced Note that the minimum and maximum values used in the pre-
to the ForCNN framework. Moreover, given that both models processing phase of the method for scaling the data are defined
were originally developed for handling colored images, consisting for each window separately using only the in-sample part of the
of three channels, we simply repeat the constructed gray-scale series, i.e. the 18 and 36 observations of each window in the case
images three times to match their input format. The pre-trained of the yearly and monthly series, respectively. These values are
weights of ResNet-50 and VGG-19 on the ImageNet data set are then used for re-scaling the forecasts and measuring forecasting
used to initialize the encoder modules of ForCNN-ResNet and performance, either for training or testing purposes. Also note
ForCNN-VGG, while random weights to initialize their regressor that, in the testing phase, if a window has less observations
modules. than what is required to produce forecasts, then additional data
points are artificially added at its start using a naive backcasting
3. Experimental setup approach, i.e. the first available data point of the window is used
to fill the missing ones backwards. On the contrary, incomplete
3.1. Data set windows are dropped in the training phase.

In order to evaluate the forecasting performance of the pro- 3.2. ForCNN implementation
posed method, we consider two sets of series: the 23,000 yearly
time series of the M4 competition data set (Makridakis et al., ForCNN-SD is implemented using Python 3.7.6, Tensorflow
2020) and the 1428 monthly series of the M3 competition data v.2.2.0, and the Keras API built on top of Tensorflow. Given
set (Makridakis & Hibon, 2000). Both data sets are comprised the complex nature of the model and the large number hyper-
of multiple series from diverse domains (finance, micro, macro, parameters available, optimizing the architecture of the network
industry, and demographics) that facilitate training and provide and the training process is essential. In this respect, a hyper-
better generalization of our findings (Spiliotis et al., 2020a). More- parameter search is performed to determine the optimal values
over, they are publicly available and well-established as bench- for the most critical hyper-parameters, based on the yearly series
mark sets, enabling the replication of our results (Makridakis, of the M4 competition. The Tree-of-Parzen-Estimators (TPE) al-
Assimakopoulos, & Spiliotis, 2018a) and their direct comparison gorithm is employed for performing the search, as implemented
to those of other studies claiming accuracy improvements using in the HyperOpt library for Python (Bergstra, Komer, Eliasmith,
either standard or state-of-the-art forecasting methods (Makri- Yamins, & Cox, 2015). The optimal values, as determined by
dakis, Spiliotis, & Assimakopoulos, 2018c). the search, are then used for building and training the final
In the case of the yearly series, following the original set-up model. Appendix A contains a summary of the optimized hyper-
of the M4 competition, the forecasting horizon, h, is set equal parameters, the search range for each of them, and their fi-
to 6 years and the number of the output nodes of the regressor nal optimal values. Furthermore, for the two ForCNN variants,
module of the NNs is accordingly defined. Consequently, the last 6 i.e. ForCNN-ResNet and ForCNN-VGG, the self-designed encoder
observations of the series (out-of-sample) are hidden for the final was replaced by the pre-trained ResNet50 and VGG19 networks
evaluation of the models, while the rest of the observations (in- respectively, as implemented in Tensorflow’s Keras API.
sample) are used for hyper-parameter optimization, training, and The optimal hyper-parameters values, as determined by the
validation purposes. Also, a sliding window strategy is applied optimization process on the yearly series of the M4 competi-
on the in-sample part of the series, where possible, in order to tion, are also transferred to the models used for forecasting the
extract multiple windows from each series and maximize the monthly series of the M3 competition. As a result, it is possible
number of available training and validation samples. The size of to investigate whether the proposed DL architecture of ForCNN-SD
the windows used for creating the images, w , is set equal to and its variants (ForCNN-ResNet and ForCNN-VGG) has the ability
3 × h = 18 so that a reasonable number of observations is to generalize well and adapt to data of different features.
available for analyzing the patterns of the data and producing the Finally, to reduce the effect that random initial weights may
corresponding forecasts. As a result, the final train set consists have on the final forecasting performance of the proposed method,
of approximately 235,000 samples, extracted from the in-sample improve its accuracy and robustness, and draw more objective
part of the initial 23,000 yearly time series. conclusions (Barrow, Crone, & Kourentzes, 2010; Kourentzes,
Similarly, in the case of the monthly series, the set-up of the Barrow, & Crone, 2014), we consider an ensemble of 50 different
M3 competition was followed for training and evaluating the ForCNN-SD models and use the median operator to obtain the
examined models. The forecasting horizon, h, is set equal to 18 final forecasts, i.e. train ForCNN-SD 50 times and combine the
months. Thus, the last 18 observations of the series are hidden outputs of the individual models for each forecasting horizon
until the evaluation phase, while the rest of the observations are separately. In this regard, although all models used within the
used for training and validating the models. Note that monthly se- ensemble will have the same architecture and hyper-parameters,
ries often exhibit strong seasonality that may be difficult for NNs each model will display different initial weights and use dif-
to capture, especially when relatively short series are involved ferent training and validation splits from the initial data set.
and no exogenous date-related information is provided (Barker, The same ensembling strategy is utilized for ForCNN-ResNet and
2020; Zhang & Qi, 2005). Therefore, the classical multiplicative ForCNN-VGG.
decomposition is applied for seasonally adjusting the series be- Ensembling has long been considered as a useful practice in
fore further processing them, if needed, as determined by an the forecasting literature (Bates & Granger, 1969) and numerous
44
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

forecast combination schemes, both simple and sophisticated • CNN-1D: This method is an ensemble of 50 simple CNNs
ones, have been proposed to exploit its full potential (for an that use 1D vectors as input, similar to the MLP. Comparing
encyclopedic review on forecast combinations, please refer to the proposed image-based methods to CNN-1D is important,
section 2.6 of Petropoulos et al., 2022). Simply put, since no fore- since it expands the discussion by considering an approach
casting model can consistently outperform all models available, that uses convolutions on traditional numeric representa-
combining the forecasts of multiple models, each making differ- tions of the series. CNN-1D is implemented using Python
ent assumptions about the data and having a different structure, 3.7.6, Tensorflow v.2.2.0, and the Keras API. As is the case
can effectively improve overall forecast accuracy. By ensembling with the MLP, the 1D numeric vectors are constructed by
forecasts the errors of the participating models cancel each other, slicing the available series and then scaling the resulting
while model and parameter uncertainty is reduced (Petropou- windows based on their in-sample part. Also, for the case
los, Hyndman, & Bergmeir, 2018). As a result, the practice of of the 1428 monthly series of the M3 competition, series
combining has become particularly popular among the ML fore- were seasonally adjusted before being sliced and scaled. The
casting community as well, where bagging, boosting, and heavy multi-step forecasts required are produced simultaneously
ensembling are widely used to improve the performance of single from the last dense layer of CNN-1D. The architecture and
models (Bojer & Meldgaard, 2021). These findings motivated the training process of CNN-1D were optimized using the TPE
utilization of ensembling in the present study. algori Appendix A provides a more detailed summary of
CNN-1D’s architecture and training.
3.3. Benchmarks • N-BEATS: This method was proposed by Oreshkin et al.
(2019) and is essentially an ensemble of 180 deep neural
We consider five different benchmarks to demonstrate the
networks of fully-connected layers with backward and for-
relative forecasting performance of the proposed approach, as
ward residual links. Currently, it is considered as the state-
described below.
of-the-art approach in time series forecasting, and has been
• Theta: Theta (Assimakopoulos & Nikolopoulos, 2000) is a successfully used in various applications (Putz, Gumhalter,
well-established and highly accurate statistical method that & Auer, 2021; Stevenson, Rodriguez-Fernandez, Minisci, &
became popular due to its superior performance in the Camacho, 2022), including the second best submission of
M3 competition (Makridakis & Hibon, 2000), being also the the most recent M5 competition (Anderer & Li, 2022). In the
most accurate benchmark in the M4. Therefore, it is of- original paper, two N-BEATS configurations were presented
ten considered a standard of comparison. The method was based on the interpretability of the underlying models, N-
implemented as in the M4 competition.1 BEATS-Generic and N-BEATS-Interpretable. For the purposes
• ES-RNN: This method, proposed by Smyl (2020), was the of this study we employ the N-BEATS-Generic variant (to
winning submission of the M4 competition and the most be simply called N-BEATS) since it provides marginally more
accurate one for the yearly series of the respective data set, accurate forecasts, according to the original paper. Note that
also used in the present study. The method is essentially N-BEATS was configured and trained as proposed by the
a hybrid, mixing traditional exponential smoothing models authors in the original study. The method was replicated
with an LSTM-based architecture. Although methods that using the official release provided.3 .
outperform the winning submission of a competition after • DeepAR: This method was proposed by Salinas et al. (2020)
the out-of-sample data have been released cannot claim any and is an auto-regressive recurrent neural network model
victory, such comparisons are useful for understanding their based on LSTM cells. It was initially developed as an archi-
potential value over existing state-of-the-art methods. Note tecture for probabilistic forecasting but it is also capable of
that, for the purposes of the present study, ES-RNN was not providing very accurate point forecasts. DeepAR has been
replicated. Instead, the original forecasts submitted to the employed in several studies and applications (Mashlakov,
M4 competition were used.2 Kuronen, Lensu, Kaarna, & Honkapuro, 2021; Park, Park, &
• MLP: This method is an ensemble of 50, simple feed-forward Hwang, 2020), including the third best submission of the
NNs with fully-connected layers and serves as a standard most recent M5 competition (Jeon & Seong, 2021), and is
‘‘cross-learning’’ ML benchmark (Semenoglou et al., 2021). now considered a well-established standard of comparison.
Similar to the proposed method, MLP is implemented using In order for the comparisons performed with the rest of
Python 3.7.6, Tensorflow v.2.2.0, and the Keras API. Follow- the methods to be fair, DeepAR’s hyper-parameters were
ing the steps outlined in Section 2, the numeric vectors, optimized using TPE algorithm in a fashion similar to that
before being exported as images, are used as input to the of the proposed imaged-based models (for more details
MLP. This means that, for the case of the 1428 monthly series on the selected values, please see Appendix A). Further-
of the M3 competition, data were seasonally adjusted and more, 50 DeepAR models were trained and their forecasts
then scaled. Accordingly, samples were simply scaled for the were combined using the median operator. For the purposes
case of the 23,000 yearly series of the M4 competition. As for of this study, the DeepAR models were implemented in
the output, the required multi-step forecasts are produced Python, using the GluonTS modeling package, developed by
simultaneously, exactly as done by the regressor module of Amazon (Alexandrov et al., 2019).
the ForCNN method. Although MLP is relatively simple in its
nature, its architecture and training hyper-parameters were 3.4. Forecasting accuracy measures
optimized using the TPE algorithm in order for the com-
parisons performed with the optimized ForCNN methods to
Enabling direct comparisons of our results with those of other
be fair (for more details on the selected values please see
relevant studies is of major importance. Hence, we choose to
Appendix A).
evaluate the forecasting accuracy of the presented models using
the official measures of the M3 and M4 forecasting competitions.
1 https://github.com/Mcompetitions/M4-methods/blob/master/Benchmarks%
Specifically, the official measure used in the M3 competition was
20and%20Evaluation.R.
2 Forecasts retrieved from the GitHub repository of the M4 competition (htt
ps://github.com/Mcompetitions/M4-methods/tree/master/Point%20Forecasts). 3 github.com/ElementAI/N-BEATS

45
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

the symmetric mean absolute percentage error (sMAPE; Makri- Table 1


dakis, 1993), while in M4 the overall weighted average (OWA) of Overall forecasting accuracy and computational cost of the proposed method
and the considered benchmarks across the 23,000 yearly series of the M4
the sMAPE and the mean absolute scaled error (MASE; Hyndman competition. Three variants of the proposed method are examined.
& Koehler, 2006). The sMAPE, MASE, and OWA measures are
Model sMAPE MASE OWA CTTraining CTInference Memory
defined as: usage
n+h
2 ∑ |yt − ft | Benchmarks
sMAPE = × 100%, (3) Theta 14.593 3.382 0.872 – 2.3 –
h |yt | + |ft |
t =n+1 ES-RNN 13.176 2.980 0.778 – – –
1
∑n+h MLP 13.089 2.954 0.772 351 1 3.0
t =n+1 |yt − ft |
MASE = h
1
∑n , (4) CNN-1D 13.100 2.958 0.773 214 1 2.8
|tt − tt −1 | N-BEATS 13.114 2.936 0.771 4766 12 3.1
n−1 t =2
DeepAR 13.562 3.146 0.811 3317 1397 0.6
sMAPEMethod MASEMethod
sMAPENaiv e2
+ MASENaiv e2 Examined models
OWAMethod = , (5)
2 ForCNN-SD 13.024 2.935 0.768 3349 14 5.1
ForCNN-VGG 13.119 2.956 0.773 3715 11 5.8
where ft is the forecast of the examined method at point t, yt the ForCNN-ResNet 13.342 3.010 0.787 11038 18 6.0
actual value of the series at the same point, and n the number of
available historical observations. sMAPEMethod and MASEMethod refer
to the overall forecasting accuracy of the examined method, as
measured across the complete set of predicted series by sMAPE The first, major finding of Table 1 is that using time se-
and MASE, respectively, with Naive2 being the benchmark used ries images for producing forecasts instead of conventional, nu-
for scaling these scores and calculating OWA, as originally done meric vectors, leads to notable accuracy improvements regardless
in the M4 competition (Makridakis et al., 2020). the measure used for assessing forecasting performance. Specifi-
cally, ForCNN-SD provides 0.50%, 0.64%, and 0.52% more accurate
3.5. Computational cost measures forecasts than the MLP benchmark in terms of sMAPE, MASE,
and OWA accordingly. Moreover, in comparison to CNN-1D that
Apart from forecasting accuracy, another critical aspect for also uses convolutions on traditional time series representations,
assessing the utility of sophisticated DL models is their com- ForCNN-SD is 0.58%, 0.78%, and 0.65% more accurate, as measured
putational cost. Depending on the application and the time or by sMAPE, MASE, and OWA. When compared to N-BEATS, the top
resources available, minor accuracy improvements may not al- performing DL benchmark considered, the forecasts provided by
ways justify the increased cost. Thus, in order to provide a more ForCNN-SD are more accurate by 0.69%, 0.03%, and 0.38% accord-
comprehensive comparison of the proposed forecasting methods, ing to sMAPE, MASE, and OWA, respectively. When compared
we also report (in minutes) the time required for fitting the meth- against ES-RNN, the winning submission of the M4 competi-
ods (CTTraining ) and for extracting the forecasts given the fitted tion, the accuracy improvements are even more significant, with
methods (CTInference ), as well as the maximum memory usage in ForCNN-SD being 1.15%, 1.51%, and 1.29% more accurate according
GB (Memory Usage). The measures are also reported for all the to sMAPE, MASE, and OWA, respectively. Furthermore, ForCNN-
benchmarks considered. SD provides 4.13%, 7.19%, and 5.60% more accurate forecasts
Note that for the case of the Theta method, CTInference is reported than DeepAR in terms of sMAPE, MASE, and OWA. Finally, when
as the total time required for generating the forecasts since, when compared to Theta, a strong statistical benchmark, ForCNN-SD
working with statistical approaches, it is difficult to distinguish displays even better results, as it is 10.75%, 13.22%, and 11.93%
between training and inference time. Also, for the case of the more accurate according to sMAPE, MASE, and OWA. At this
ES-RNN method, it was not possible to report computational point, it is important to note that DeepAR, ES-RNN, MLP, CNN-1D
demand, since the method was not replicated by us, as described and N-BEATS are highly competitive benchmarks and significantly
in Section 3.3. When multiple models are trained as components more accurate than any standard time series forecasting method.
of an ensemble, CTTraining refers to the time required for training Therefore, any reported improvement over these benchmarks,
all participating models sequentially. Similarly, CTInference refers to even if small in absolute values, is of utmost importance. Thus,
the time needed for calculating and combining the forecasts of all it is evident that the proposed method is a promising alternative
the models involved in the ensemble. for batch time series forecasting.
All forecasting methods were trained and evaluated using an A second point of discussion concerns the performance of
Ubuntu 18.04.6 LTS server with the following characteristics: 2 ForCNN-VGG and ForCNN-ResNet. ForCNN-VGG is less accurate
Intel Xeon Gold 5118 CPUs @ 2.30 GHz, x64 based processor, 24 than ForCNN-SD by 0.73%, 0.72%, and 0.65% as measured by
cores, 48 logical processors, 64.00 GB RAM, one Quadro RTX 5000 sMAPE, MASE, and OWA. By the same measures, ForCNN-ResNet is
GPU with 16 GB RAM. also less accurate than ForCNN-SD by 2.44%, 2.56%, and 2.47%, re-
spectively. Despite the fact that neither ForCNN-VGG nor ForCNN-
4. Results and discussion ResNet outperform our proposed architecture, their accuracy is
still noteworthy. ForCNN-VGG provides more accurate forecasts
The overall forecasting accuracy and computational cost of the than both ES-RNN, DeepAR and Theta, while ForCNN-ResNet out-
proposed methods along with that of the considered benchmarks performs DeepAR and Theta. An explanation for the relatively
is summarized in Table 1 for the 23,000 yearly series of the worse performance of the two variants of ForCNN may lie in
M4 competition and in Table 2 for the 1428 monthly series the fact that VGG and ResNet were originally designed to handle
of the M3 competition. The second, third, and fourth columns larger, colored images and for performing a different computer
correspond to the scores reported according to the sMAPE, MASE, vision task (image recognition vs. pattern analysis). As a result,
and OWA measures, respectively. The fifth, sixth, and seventh using such pre-trained NNs would probably require further ad-
columns correspond to the training time, inference time, and justments to their architectures and the overall framework of the
maximum memory usage, respectively. Note that, as described methodology to achieve more accurate forecasts. On the positive
in Section 2.2.3, three variants of the proposed method are ex- side, the results indicate that relatively simple, self-designed
amined, each one utilizing a different encoder for analyzing the encoders can be effectively used to analyze time series images
images provided as input. and improve forecasting performance.
46
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

Table 2 In comparison with MLP, ForCNN-SD generates 2.84%, 3.48%, and


Overall forecasting accuracy and computational cost of the proposed method 3.16% more accurate forecasts, as measured by sMAPE, MASE, and
and the considered benchmarks across all 1428 monthly series of the M3
competition. Three variants of the proposed method are examined.
OWA. Furthermore, ForCNN-SD outperforms the simpler convolu-
Model sMAPE MASE OWA CTTraining CTInference Memory
tional benchmark, CNN-1D, by 2.89%, 3.81%, and 3.26% in terms of
usage sMAPE, MASE, and OWA respectively. Finally, when compared to
Benchmarks Theta, the winning submission of the M3 competition, ForCNN-SD
exhibits even better results, as it is 3.96%, 3.63%, and 3.80% more
Theta 13.888 0.863 0.830 – 0.2 –
MLP 13.738 0.862 0.825 71 0.2 4.0 accurate according to sMAPE, MASE, and OWA.
CNN-1D 13.757 0.866 0.827 106 0.2 4.0 Another interesting finding from the evaluation on the 1428
N-BEATS 13.363 0.813 0.790 3032 9.0 2.5 monthly series is the increased accuracy of ForCNN-VGG com-
DeepAR 13.572 0.868 0.823 9686 136 0.7 pared to those of ForCNN-SD, as opposed to the case of the yearly
Examined models series, where both ForCNN-VGG and ForCNN-ResNet variants were
ForCNN-SD 13.359 0.833 0.800 3747 3.1 8.3 found to be less accurate than the self-designed architecture.
ForCNN-VGG 13.228 0.833 0.796 4087 3.0 8.3 Specifically, ForCNN-VGG is 0.99%, 0.00%, and 0.49% more accurate
ForCNN-ResNet 13.669 0.847 0.816 4264 3.9 8.6
than ForCNN-SD according to MAPE, MASE, and OWA. When
compared to N-BEATS, ForCNN-VGG is 1.02% more accurate in
terms of sMAPE but 2.40% and 0.70% less accurate in terms of
The reported forecasting performances are the result of fol- MASE and OWA. On the other hand, ForCNN-ResNet remains less
lowing the steps outlined in Sections 2 and 3 and are indicative of accurate than ForCNN-SD by 2.32%, 1.68%, and 2.00%, as measured
a forecasting pipeline that includes both a hyper-parameter opti- by sMAPE, MASE, and OWA, but outperforms the MLP, CNN-1D
mization process for determining the ‘‘optimal’’ architecture and and Theta benchmarks.
an ensemble of 50 randomly initialized models for eliminating the To identify cases where ForCNN-SD may result in significantly
uncertainty from randomly initializing the NNs. Despite that, it more accurate forecasts compared to N-BEATS, the most accurate
is interesting to explore the sensitivity of the method’s accuracy benchmark considered, and vice-versa, Fig. 4 presents a set of in-
to different network hyper-parameter values and initializations dicative examples of time series drawn from the M3 and M4 data
regardless, especially for the case of the proposed ForCNN-SD. sets. As seen, both N-BEATS and ForCNN-SD are highly capable of
First, in order to assess the effect the different hyper-parameter generating accurate forecasts, although in some cases they may
values on the forecasting accuracy of ForCNN-SD, 10 additional either over- or underestimate trends.
ForCNN-SD’s variants were trained and evaluated based on the In order to further investigate the differences reported be-
final configurations tested by the TPE algorithm. These configura- tween the proposed imaged-based DL methods and the bench-
tions include encoders and regressors of different sizes, encoders marks, we employ the multiple comparisons with the best (MCB)
with and without shortcut connections, and trainers with varying test (Koning, Franses, Hibon, & Stekler, 2005). The test computes
learning rates, patience values, and batch sizes. To make the com- the average ranks of the forecasting methods according to sMAPE
parisons fair, ensembles of 50 models were used when evaluating or MASE across the complete data set of the study and concludes
each ForCNN-SD configuration. The average forecasting accuracy whether or not these are statistically different. Fig. 5 presents the
of the different ForCNN-SD configurations was 13.043%, according results of the analysis. If the intervals of two methods do not
to sMAPE, with a standard deviation of 0.040%. Based on the overlap, this indicates a statistically different performance. Thus,
out-of-sample evaluation, the best configuration has an accuracy methods that do not overlap with the gray interval of the figures
of 13.000%, while the worse configuration has an accuracy of are considered significantly worse than the best, and vice versa.
13.138%. It is interesting to note that the configuration suggested We find that, according to MCB, ForCNN-SD is significantly
by the optimization algorithm A is not the best performing, with more accurate than the rest of the methods considered when
4 out of the 10 configurations tested providing more accurate used for forecasting the yearly series of the M4 competition.
results. In addition to that, we examine the sensitivity of ForCNN- Moreover, we observe that ForCNN-VGG is similarly accurate to
SD to random initializations. For that purpose, we train and two state-of-the-art forecasting methods (N-BEATS and ES-RNN)
evaluate 10 additional ensembles, consisting of 50 ForCNN-SD and that ForCNN-ResNet manages to provide better forecasts than
networks each, with different initial random seeds. All models DeepAR and the statistical benchmark, Theta. The results are
share the same hyper-parameters, as determined by the original slightly different for the case of the M3 monthly series where
optimization process, making them comparable to the ForCNN-SD ForCNN-SD, ForCNN-VGG, and N-BEATS are identified as similarly
method reported in Table 1. The average accuracy of the forecasts accurate, with the latter model displaying the lowest average
across all tested ensembles was 13.025% in terms of sMAPE, with rank.
a standard deviation of 0.010%. The best performing ForCNN-SD In addition to MCB, we apply the model confidence test
ensemble has an accuracy of 13.010%, while the worse configura- (Hansen, Lunde, & Nason, 2011), as implemented in the MCS
tion has an accuracy of 13.043%. Overall, these findings confirm package for R (Catania & Bernardi, 2017), that is better suited
that the reported ForCNN-SD results are indicative and relatively to understand whether it is possible to identify a superior set of
insensitive to hyper-parameter selection or initialization. methods. This bootstrap-based approach stops when one or more
The results of Table 2 indicate that, similar to the yearly methods are identified as ‘‘superior’’ to all others given a certain
data, ForCNN-SD can generate highly accurate forecasts when level of confidence. The results of the test are summarized in
used for predicting monthly time series. However, in this case, Table 3. As seen, the findings of the model confidence test are
the benefits of the proposed image-based approach are more in general agreement with those of the MCB test, identifying
limited. When compared to N-BEATS, the most accurate DL bench- ForCNN-SD, MLP, CNN-1D, N-BEATS, and ForCNN-VGG as superior
mark, ForCNN-SD’s forecasts are 0.03% more accurate according methods in the M4 data set, while ForCNN-VGG, N-BEATS, and
to sMAPE, but 2.40% and 1.19% less accurate according to MASE ForCNN-SD as superior models in the M3 data set. Based on the
and OWA respectively. Yet, ForCNN-SD is still more accurate than above, we conclude that the results of Tables 1 and 2 are aligned
the rest of the benchmarks considered. Specifically, it provides with those of the statistical tests employed and that, although
1.59%, 4.20%, and 2.90% more accurate forecasts than the DeepAR the benchmarks considered were competitive, including state-of-
benchmark in terms of sMAPE, MASE, and OWA, respectively. the-art implementations of ML and DL models, image-based time
47
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

Fig. 4. Indicative plots of time series and the corresponding predictions provided by ForCNN-SD and N-BEATS. The green line represents the actual historical and
future observations. The orange line represents the forecasts generated by ForCNN-SD, while the blue line represents the forecasts generated by N-BEATS. The top
two rows show indicative cases where ForCNN-SD significantly outperforms N-BEATS. The bottom two rows show cases where N-BEATS significantly outperforms
ForCNN-SD. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 3 Overall, ForCNN-SD is faster than N-BEATS and DeepAR by approx-


Superior forecasting methods based on the model confidence test. imately 24 and 23 h, respectively, for the case of the 23,000 yearly
Measure Superior set of methods series. When the 1428 monthly series are considered, ForCNN-
M4 Yearly ForCNN-SD; MLP ; CNN-1D ; N-BEATS; ForCNN-VGG SD is 9 h slower than N-BEATS but 105 h faster than DeepAR.
M3 Monthly ForCNN-VGG; N-BEATS; ForCNN-SD Between the three ForCNN variants, ForCNN-SD has the lowest
training time but ForCNN-VGG has the lowest inference time.
Observe that the differences in training time are much greater
(ForCNN-SD is faster than ForCNN-VGG by 6 to 9 h) than those in
series forecasting managed to provide similarly accurate, if not
inference time (ForCNN-VGG is faster than ForCNN-SD by 4 min)
better results.
and, as a result, ForCNN-SD is faster over the entire forecasting
Finally, it is important to comment of the computational re- process. In contrast, ForCNN-ResNet requires significantly more
quirements of the examined methods, as reported in Tables 1 and time compared to the rest of the methods. Note, however, that
2. As expected, relatively more sophisticated DL methods require computational time can be improved by parallelizing the training
more time for training and generating forecasts compared to sim- process. Furthermore, in terms of the memory usage, the three
pler ML and statistical methods, such as MLP, CNN-1D and Theta. ForCNN variants have similar memory requirements and, overall,
48
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

Fig. 5. Average ranks and 95% confidence intervals of the three ForCNN variants and benchmarks considered over the 23,000 M4 yearly (left) and 1428 M3 monthly
(right) series. The multiple comparisons with the best test, as proposed by Koning et al. (2005), is applied using sMAPE for ranking the methods.

Table A.4 Table A.5


The architecture and the training hyper-parameters of the ForCNN-SD model The architecture and the training hyper-parameters of the MLP used as
proposed in this study, along with the considered search space for each benchmark in this study, along with the considered search space for each
hyper-parameter and the optimal values as determined by the optimization hyper-parameter and the optimal values as determined by the optimization
algorithm. algorithm.
Hyper-parameter Search space Optimal value Hyper-parameter Search space Optimal value
Number of stacks [2, 3, 4, 5] 5 Number of hidden layers [1, 2, 3] 3
Number of blocks [2, 3, 4, 5] 3 Size of hidden layers [0.5, 1.0, 1.5, 2.0] ×IS 1.5 × IS
Number of top FC layers [1, 2, 3] 2 Optimizer [Adam] Adam
Shortcut connections [True, False] True Learning rate [0.0001, 0.0005, 0.001
Optimizer [Adam] Adam 0.001, 0.005, 0.01]
Learning rate [0.0001, 0.0005, 0.001 Max training epochs [1000] 1000
0.001, 0.005, 0.01] Early stopping patience [2, 5, 10] 10
Max training epochs [1000] 1000 Batch size [64, 128, 256, 512] 64
Early stopping patience [2, 5, 10] 10 Loss function [MAE, MSE, sMAPE] MAE
Batch size [64, 128, 256, 512] 64
Loss function [MAE, MSE, sMAPE] MAE
Table A.6
The architecture and the training hyper-parameters of the CNN-1D used as
benchmark in this study, along with the considered search space for each
they demand more memory compared to traditional statistical, hyper-parameter and the optimal values as determined by the optimization
algorithm.
ML, and DL methods. These requirements can be attributed to the
Hyper-parameter Search space Optimal value
fact that images, being large 2D matrices, take up more space than
the corresponding 1D numeric vectors. Number of convolutional layers [1, 2, 3, 5] 3
Number of convolutional filters [8, 16, 32, 64, 128] 16
Kernel size [1, 2, 3] 3
5. Conclusions Dilation rate [1, 2, 3] 2
Optimizer [Adam] Adam
This study investigated whether the use of visual time series Learning rate [0.0001, 0.0005, 0.005
representations and architectures inspired by the area of com- 0.001, 0.005, 0.01]
Max training epochs [1000] 1000
puter vision can lead to superior forecasting accuracy when com- Early stopping patience [2, 5, 10] 10
pared to existing state-of-the-art time series forecasting methods. Batch size [64, 128, 256, 512] 128
The proposed DL method, called ForCNN, which mixes convolu- Loss function [MAE, MSE] MAE
tional and dense layers in a single neural network, constitutes
an end-to-end solution for univariate time series forecasting that
leverages the recent advances reported for the case of deep CNN
when used for analyzing images and recognizing patterns. able to generate highly accurate predictions when tasked with
Our results suggest that image-based time series forecast- forecasting monthly series, outperforming most benchmarks and
ing can produce highly accurate forecasts and outperform both being on par with the top state-of-the-art DL method. More
standard and advanced forecasting methods of either statisti- importantly, the results indicate that DL models may be able to
cal or ML nature. When yearly series, typically dominated by handle more effectively visual time series representations than
trend patterns, are considered, the proposed ForCNN-SD outper- conventional, numeric ones to generate forecasts.
forms all other benchmarks in terms of forecasting accuracy Pre-trained, well-established deep CNNs, like ResNet-50 and
and computational time. Moreover, image-based approaches are VGG-19, were also considered for analyzing the images provided
49
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

Fig. B.6. Distribution of sMAPE forecasting errors of the 50 networks participating in each of the ensembles. ‘‘X’’ marks the forecasting accuracy of the respective
ensemble. The top figure corresponds to the results based on the 23,000 yearly series of the M4 competition. The bottom figure corresponds to the results based on
the 1428 monthly series of the M3 competition.

Table A.7 that using VGG-19 as a replacement for the encoder is indeed a
The architecture and the training hyper-parameters of the DeepAR used as viable option.
benchmark in this study, along with the considered search space for each
hyper-parameter and the optimal values as determined by the optimization
Our findings greatly motivate future research in the area of
algorithm. image-based time series forecasting, that could focus on, but not
Hyper-parameter Search space Optimal value limited to, the following aspects:
Cell type [LSTM, GRU] LSTM
Number of recurrent [1, 2, 3, 5 10] 3
• The present study considered two large, diverse sets of
layers 23,000 yearly and 1428 monthly time series respectively,
Number of cells [16, 32, 64, 128, 256, 512] 512 that originated from six particular domains (micro, macro,
Dropout rate [7%, 12%, 20%] 7% industry, finance, demographic and other). Although monthly
Training epochs [64, 128, 256, 512, 1024, 128 series differ significantly from yearly ones, both data sets
2048]
Number of batches per [32, 64, 128, 256, 512, 128
consist of continuous, relatively short series of low fre-
epoch 1024] quency. It would be interesting to examine how the pro-
Batch size [32, 64, 128, 256, 512] 128 posed method performs for the case of high-frequency data
Learning rate [0.0001, 0.0005, 0.001, 0.005 (e.g. daily and hourly series) or intermittent and lumpy
0.005, 0.01]
data, as well as, for cases of special forecasting applications
Patience [8, 16, 32, 64] 8
(e.g. stock market, retail, and energy forecasting).
• Recurrence and colored plots could be used instead of sim-
ple, gray-scale line figures to extract more information from
the series and provide more accurate forecasts.
as input to the proposed method and eliminate the need for
• Additional work should be done to understand how ex-
designing and optimizing new convolutional architecture from isting, advanced DL models used in computer vision can
scratch. Between the two, the accuracy and computational cost be exploited for identifying time series patterns, extracting
of ForCNN-VGG was comparable to that of the self-designed en- knowledge from large data sets, and using such knowledge
coder, while ForCNN-ResNet’s performance was less impressive, to improve forecasting accuracy, while reducing computa-
irrespective of the evaluation data set. Overall, our results suggest tional cost.
50
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

Declaration of competing interest References

Alexandrov, A., Benidis, K., Bohlke-Schneider, M., Flunkert, V., Gasthaus, J.,
The authors declare that they have no known competing finan- Januschowski, T., et al. (2019). GluonTS: Probabilistic time series models in
cial interests or personal relationships that could have appeared python. arXiv preprint arXiv:1906.05264.
to influence the work reported in this paper. Anderer, M., & Li, F. (2022). Hierarchical forecasting with a top-down alignment
of independent-level forecasts. International Journal of Forecasting, http://dx.
doi.org/10.1016/j.ijforecast.2021.12.015.
Data availability Anwar, S. M., Majid, M., Qayyum, A., Awais, M., Alnowami, M., & Khan, M.
K. (2018). Medical image analysis using convolutional neural networks: a
review. Journal of Medical Systems, 42(11), 226. http://dx.doi.org/10.1007/
Data will be made available on request. s10916-018-1088-1.
Assimakopoulos, V., & Nikolopoulos, K. (2000). The theta model: A decomposition
approach to forecasting. International Journal of Forecasting, 16, 521–530.
Appendix A. ForCNN-SD, MLP, CNN-1D & DeepAR hyper-para- http://dx.doi.org/10.1016/S0169-2070(00)00066-2.
Barker, J. (2020). Machine learning in M4: What makes a good unstructured
meters
model? International Journal of Forecasting, 36(1), 150–155. http://dx.doi.org/
10.1016/j.ijforecast.2019.06.001.
This appendix provides details on the hyper-parameter opti- Barrow, D. K., Crone, S. F., & Kourentzes, N. (2010). An evaluation of neural
network ensembles and model selection for time series prediction. In The
mization of the proposed ForCNN-SD model, as well as, of the MLP, 2010 international joint conference on neural networks (pp. 1–8). http://dx.
CNN-1D and DeepAR benchmarks. The Tree-of-Parzen-Estimators doi.org/10.1109/IJCNN.2010.5596686.
(TPE) algorithm is used on a validation set to determine the op- Bates, J. M., & Granger, C. W. J. (1969). The combination of forecasts. Journal of
timal set of values after 100 iterations for all models considered. the Operational Research Society, 20(4), 451–468.
Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., & Cox, D. D. (2015). Hyperopt:
Note that, in all cases, the optimal hyper-parameter values are a python library for model selection and hyperparameter optimization.
determined based on the yearly series of the M4 competition and Computational Science & Discovery, 8(1), Article 014008. http://dx.doi.org/10.
they are also used for forecasting the monthly series of the M3 1088/1749-4699/8/1/014008.
Bojer, C. S., & Meldgaard, J. P. (2021). Kaggle forecasting competitions: An
competition. Table A.4 contains the hyper-parameters that are be-
overlooked learning opportunity. International Journal of Forecasting, 37(2),
ing optimized, along with the search space and their final, optimal 587–603.
values for ForCNN-SD. Tables A.5–A.7 contain similar information Catania, L., & Bernardi, M. (2017). MCS: Model confidence set procedure. R
for the MLP, CNN-1D and DeepAR respectively. package version 0.1.3.
Chae, Y. T., Horesh, R., Hwang, Y., & Lee, Y. M. (2016). Artificial neural network
model for forecasting sub-hourly electricity usage in commercial buildings.
Energy and Buildings, 111, 184–194. http://dx.doi.org/10.1016/j.enbuild.2015.
Appendix B. Forecasting errors distribution
11.045.
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolu-
In this appendix we provide more details on the distribution of tions. In 2017 IEEE conference on computer vision and pattern recognition (pp.
1800–1807). http://dx.doi.org/10.1109/CVPR.2017.195.
sMAPE errors of the individual networks that contribute forecasts
Cohen, N., Balch, T., & Veloso, M. (2019). The effect of visual design in image
to the ensemble. Note that the final forecasts of each ForCNN classification. arXiv preprint arXiv:1907.09567.
variant are calculated by combining the output of 50 networks Cohen, N., Sood, S., Zeng, Z., Balch, T., & Veloso, M. (2020). Visual forecasting
of that architecture, as described in Section 3.2. Fig. B.6 shows of time series with image-to-image regression. arXiv preprint arXiv:2011.
09052.
the distributions of errors of the individual networks of the three Du, B., & Barucca, P. (2020). Image processing tools for financial time series
ForCNN variants, for the cases of the yearly and the monthly data classification. arXiv preprint arXiv:2008.06042.
respectively. The forecasting accuracy of each variant’s ensemble Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video
recognition. In 2019 IEEE/CVF international conference on computer vision (pp.
is also shown, as the corresponding ‘‘X’’ sign.
6201–6210). http://dx.doi.org/10.1109/ICCV.2019.00630.
The first major conclusion from the analysis relates to the Fischer, T., & Krauss, C. (2018). Deep learning with long short-term memory
benefit of employing ensembles of networks, as opposed to us- networks for financial market predictions. European Journal of Operational
ing a single network. Even though this approach requires more Research, 270(2), 654–669. http://dx.doi.org/10.1016/j.ejor.2017.11.054.
Frolov, S., Hinz, T., Raue, F., Hees, J., & Dengel, A. (2021). Adversarial text-to-
computational time, the forecasting accuracy improvements are
image synthesis: A review. Neural Networks, 144, 187–209. http://dx.doi.org/
significant across all architectures and sets of series. Our results 10.1016/j.neunet.2021.07.019.
are aligned with existing literature in forecasting that highlights Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Martinez-
the importance of combining predictions from different models in Gonzalez, P., & Garcia-Rodriguez, J. (2018). A survey on deep learning
techniques for image and video semantic segmentation. Applied Soft
order to make the final forecasts more robust. That is also the case Computing, 70, 41–65. http://dx.doi.org/10.1016/j.asoc.2018.05.018.
for N-BEATS (Oreshkin et al., 2019) that is essentially an ensemble Hansen, P. R., Lunde, A., & Nason, J. M. (2011). The model confidence set.
of 180 individual networks. Econometrica, 79(2), 453–497.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
When considering the M3 monthly series, the forecasting er-
recognition. In 2016 IEEE conference on computer vision and pattern recognition
rors of the networks are similarly distributed across the three (pp. 770–778). http://dx.doi.org/10.1109/CVPR.2016.90.
ForCNN variants, with the respective ensembles being far more Hewamalage, H., Bergmeir, C., & Bandara, K. (2021). Recurrent neural networks
accurate in the comparison. In the case of the M4 yearly data, for time series forecasting: Current status and future directions. International
Journal of Forecasting, 37(1), 388–427. http://dx.doi.org/10.1016/j.ijforecast.
each variant’s networks exhibit different patterns. According to 2020.06.008.
the distribution of errors, the individual ForCNN-SD networks are Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely
more robust and their accuracy is closer to that of the final connected convolutional networks. In 2017 IEEE conference on computer
ensemble. On the other hand, ForCNN-VGG networks exhibit a vision and pattern recognition (pp. 2261–2269). http://dx.doi.org/10.1109/
CVPR.2017.243.
wide range of forecasting errors, with the majority of the models Huber, J., & Stuckenschmidt, H. (2020). Daily retail demand forecasting using
being considerably less accurate than the respective ensemble. Fi- machine learning with emphasis on calendric special days. International
nally, the large majority of ForCNN-ResNet networks have similar Journal of Forecasting, 36(4), 1420–1438. http://dx.doi.org/10.1016/j.ijforecast.
2020.02.005.
performance, being significantly less accurate than the ensemble.
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast
However, 2 out of the 50 networks generate forecasts that are at accuracy. International Journal of Forecasting, 22(4), 679–688. http://dx.doi.
least as accurate as the forecasts of the final ensemble. org/10.1016/j.ijforecast.2006.03.001.

51
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

Jeon, Y., & Seong, S. (2021). Robust recurrent network model for intermittent Park, S., Park, S., & Hwang, E. (2020). Normalized residue analysis for deep
time-series forecasting. International Journal of Forecasting, http://dx.doi.org/ learning based probabilistic forecasting of photovoltaic generations. In 2020
10.1016/j.ijforecast.2021.07.004. IEEE international conference on big data and smart computing (BigComp) (pp.
Kamilaris, A., & Prenafeta-Boldú, F. X. (2018). A review of the use of convo- 483–486). http://dx.doi.org/10.1109/BigComp48618.2020.00-20.
lutional neural networks in agriculture. The Journal of Agricultural Science, Petropoulos, F., Apiletti, D., Assimakopoulos, V., Babai, M. Z., Barrow, D. K., Ben
156(3), 312–322. http://dx.doi.org/10.1017/S0021859618000436. Taieb, S., et al. (2022). Forecasting: theory and practice. International Journal
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for of Forecasting, 38(3), 705–871.
generative adversarial networks. In 2019 IEEE/CVF conference on computer Petropoulos, F., Hyndman, R. J., & Bergmeir, C. (2018). Exploring the sources of
vision and pattern recognition (pp. 4396–4405). http://dx.doi.org/10.1109/ uncertainty: Why does bagging for time series forecasting work? European
CVPR.2019.00453. Journal of Operational Research, 268(2), 545–554.
Ker, J., Wang, L., Rao, J., & Lim, T. (2018). Deep learning applications in medical Putz, D., Gumhalter, M., & Auer, H. (2021). A novel approach to multi-horizon
image analysis. IEEE Access, 6, 9375–9389. http://dx.doi.org/10.1109/ACCESS. wind power forecasting based on deep neural architecture. Renewable Energy,
2017.2788044. 178, 494–505. http://dx.doi.org/10.1016/j.renene.2021.06.099.
Koning, A. J., Franses, P. H., Hibon, M., & Stekler, H. (2005). The M3 competition: Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for im-
Statistical tests of the results. International Journal of Forecasting, 21(3), age classification: A comprehensive review. Neural Computation, 29(9),
397–409. http://dx.doi.org/10.1016/j.ijforecast.2004.10.003. 2352–2449. http://dx.doi.org/10.1162/neco_a_00990.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015).
Kourentzes, N., Barrow, D. K., & Crone, S. F. (2014). Neural network ensemble
ImageNet large scale visual recognition challenge. International Journal of
operators for time series forecasting. Expert Systems with Applications, 41(9),
Computer Vision (IJCV), 115(3), 211–252. http://dx.doi.org/10.1007/s11263-
4235–4244. http://dx.doi.org/10.1016/j.eswa.2013.12.011.
015-0816-y.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with
Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). Deepar: Prob-
deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou,
abilistic forecasting with autoregressive recurrent networks. International
& K. Q. Weinberger (Eds.), Vol. 25, Advances in neural information processing
Journal of Forecasting, 36(3), 1181–1191. http://dx.doi.org/10.1016/j.ijforecast.
systems (pp. 1097–1105). Curran Associates, Inc..
2019.07.001.
Lai, G., Chang, W.-C., Yang, Y., & Liu, H. (2018). Modeling long- and short-term
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mo-
temporal patterns with deep neural networks. In The 41st international ACM
bileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the
SIGIR conference on research and development in information retrieval SIGIR
IEEE conference on computer vision and pattern recognition.
’18 (pp. 95–104). Association for Computing Machinery, http://dx.doi.org/10.
Semenoglou, A.-A., Spiliotis, E., Makridakis, S., & Assimakopoulos, V. (2021).
1145/3209978.3210006.
Investigating the accuracy of cross-learning time series forecasting methods.
Li, X., Kang, Y., & Li, F. (2020). Forecasting with time series imaging. Expert International Journal of Forecasting, 37(3), 1072–1084. http://dx.doi.org/10.
Systems with Applications, 160, Article 113680. http://dx.doi.org/10.1016/j. 1016/j.ijforecast.2020.11.009.
eswa.2020.113680. Sen, R., Yu, H.-F., & Dhillon, I. S. (2019). Think globally, act locally: A deep neural
Lim, B., Arık, S. O., Loeff, N., & Pfister, T. (2021). Temporal fusion transformers network approach to high-dimensional time series forecasting. In H. Wallach,
for interpretable multi-horizon time series forecasting. International Journal H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Vol.
of Forecasting, 37(4), 1748–1764. http://dx.doi.org/10.1016/j.ijforecast.2021. 32, Advances in neural information processing systems. Curran Associates, Inc..
03.012. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y.
Livieris, I. E., Pintelas, E., & Pintelas, P. (2020). A CNN–LSTM model for gold (2013). OverFeat: Integrated recognition, localization and detection using
price time-series forecasting. Neural Computing and Applications, 32(23), convolutional networks. arXiv preprint arXiv:1312.6229.
17351–17360. http://dx.doi.org/10.1007/s00521-020-04867-x. Shanker, M., Hu, M., & Hung, M. (1996). Effect of data standardization on neural
Ma, X., Dai, Z., He, Z., Ma, J., Wang, Y., & Wang, Y. (2017). Learning traffic network training. Omega, 24(4), 385–397. http://dx.doi.org/10.1016/0305-
as images: A deep convolutional neural network for large-scale trans- 0483(96)00010-2.
portation network speed prediction. Sensors, 17(4), http://dx.doi.org/10.3390/ Shih, S.-Y., Sun, F.-K., & Lee, H.-Y. (2019). Temporal pattern attention for
s17040818. multivariate time series forecasting. Machine Learning, 108(8–9), 1421–1441.
Majumdar, A., & Gupta, M. (2019). Recurrent transform learning. Neural Networks, http://dx.doi.org/10.1007/s10994-019-05815-0.
118, 271–279. http://dx.doi.org/10.1016/j.neunet.2019.07.003. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks
Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. for large-scale image recognition. In International conference on learning
International Journal of Forecasting, 9(4), 527–529. http://dx.doi.org/10.1016/ representations.
0169-2070(93)90079-3. Smyl, S. (2020). A hybrid method of exponential smoothing and recurrent neural
Makridakis, S. (2017). The forthcoming artificial intelligence (AI) revolution: Its networks for time series forecasting. International Journal of Forecasting,
impact on society and firms. Futures, 90, 46–60. http://dx.doi.org/10.1016/j. 36(1), 75–85. http://dx.doi.org/10.1016/j.ijforecast.2019.03.017.
futures.2017.03.006. Spiliotis, E., Kouloumos, A., Assimakopoulos, V., & Makridakis, S. (2020a). Are
Makridakis, S., Assimakopoulos, V., & Spiliotis, E. (2018a). Objectivity, repro- forecasting competitions data representative of the reality? International
ducibility and replicability in forecasting research. International Journal of Journal of Forecasting, 36(1), 37–53. http://dx.doi.org/10.1016/j.ijforecast.
Forecasting, 34(4), 835–838. http://dx.doi.org/10.1016/j.ijforecast.2018.05.001. 2018.12.007.
Spiliotis, E., Makridakis, S., Semenoglou, A.-A., & Assimakopoulos, V. (2020b).
Makridakis, S., & Hibon, M. (2000). The M3-competition: results, conclusions
Comparison of statistical and machine learning methods for daily SKU
and implications. International Journal of Forecasting, 16(4), 451–476. http:
demand forecasting. Operational Research, 1–25.
//dx.doi.org/10.1016/S0169-2070(00)00057-1, The M3- Competition.
Stevenson, E., Rodriguez-Fernandez, V., Minisci, E., & Camacho, D. (2022). A
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018b). Statistical and machine
deep learning approach to solar radio flux forecasting. Acta Astronautica, 193,
learning forecasting methods: Concerns and ways forward. PLOS ONE, 13(3),
595–606. http://dx.doi.org/10.1016/j.actaastro.2021.08.004.
1–26. http://dx.doi.org/10.1371/journal.pone.0194889.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018c). The M4 competi-
the inception architecture for computer vision. In Proceedings of the IEEE
tion: Results, findings, conclusion and way forward. International Journal of
conference on computer vision and pattern recognition.
Forecasting, 34, http://dx.doi.org/10.1016/j.ijforecast.2018.06.001.
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020). The M4 competition: neural networks. In K. Chaudhuri, & R. Salakhutdinov (Eds.), proceedings
100,000 time series and 61 forecasting methods. International Journal of of machine learning research: Vol. 97, Proceedings of the 36th international
Forecasting, 36(1), 54–74. http://dx.doi.org/10.1016/j.ijforecast.2019.04.014. conference on machine learning. PMLR.
Mashlakov, A., Kuronen, T., Lensu, L., Kaarna, A., & Honkapuro, S. (2021). Assess- Tian, C., Fei, L., Zheng, W., Xu, Y., Zuo, W., & Lin, C.-W. (2020). Deep learning
ing the performance of deep learning models for multivariate probabilistic on image denoising: An overview. Neural Networks, 131, 251–275. http:
energy forecasting. Applied Energy, 285, Article 116405. http://dx.doi.org/10. //dx.doi.org/10.1016/j.neunet.2020.07.025.
1016/j.apenergy.2020.116405. Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,
Montero-Manso, P., Athanasopoulos, G., Hyndman, R. J., & Talagala, T. S. (2020). et al. (2016). WaveNet: A generative model for raw audio. In The 9th ISCA
FFORMA: Feature-based forecast model averaging. International Journal of speech synthesis workshop (p. 125).
Forecasting, 36(1), 86–92. http://dx.doi.org/10.1016/j.ijforecast.2019.02.011. Wang, J., & Wang, J. (2017). Forecasting stochastic neural network based on
Naseer, S., Saleem, Y., Khalid, S., Bashir, M. K., Han, J., Iqbal, M. M., et al. (2018). financial empirical mode decomposition. Neural Networks, 90, 8–20. http:
Enhanced network anomaly detection based on deep neural networks. IEEE //dx.doi.org/10.1016/j.neunet.2017.03.004.
Access, 6, 48231–48246. http://dx.doi.org/10.1109/ACCESS.2018.2863036. Xue, N., Triguero, I., Figueredo, G. P., & Landa-Silva, D. (2019). Evolving deep
Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2019). N-BEATS: Neural CNN-LSTMs for inventory time series prediction. In 2019 IEEE congress
basis expansion analysis for interpretable time series forecasting. arXiv on evolutionary computation (pp. 1517–1524). http://dx.doi.org/10.1109/CEC.
preprint arXiv:1905.10437. 2019.8789957.

52
A.-A. Semenoglou, E. Spiliotis and V. Assimakopoulos Neural Networks 157 (2023) 39–53

Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep Zhang, G., & Qi, M. (2005). Neural network forecasting for seasonal and trend
learning based natural language processing [review article]. IEEE Computa- time series. European Journal of Operational Research, 160(2), 501–514. http:
tional Intelligence Magazine, 13(3), 55–75. http://dx.doi.org/10.1109/MCI.2018. //dx.doi.org/10.1016/j.ejor.2003.08.037.
2840738. Zhang, J., Zheng, Y., Qi, D., Li, R., Yi, X., & Li, T. (2018). Predicting citywide crowd
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional flows using deep spatio-temporal residual networks. Artificial Intelligence,
networks. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer
259, 147–166. http://dx.doi.org/10.1016/j.artint.2018.03.002.
vision – ECCV 2014 (pp. 818–833). Cham: Springer International Publishing.
Zhao, Z., Zheng, P., Xu, S., & Wu, X. (2019). Object detection with deep learning:
Zhang, G., & Guo, J. (2020). A novel ensemble method for hourly residential
electricity consumption forecasting by imaging time series. Energy, 203, A review. IEEE Transactions on Neural Networks and Learning Systems, 30(11),
Article 117858. http://dx.doi.org/10.1016/j.energy.2020.117858. 3212–3232. http://dx.doi.org/10.1109/TNNLS.2018.2876865.

53

You might also like