0% found this document useful (0 votes)
30 views

Springer Lecture Notes in Computer Science

Springer_Lecture_Notes_in_Computer_Science

Uploaded by

verena.kantere
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Springer Lecture Notes in Computer Science

Springer_Lecture_Notes_in_Computer_Science

Uploaded by

verena.kantere
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Towards the Automated Selection of ML Models

for Time-Series Data Forecasting

Yi Chen1[0009−0003−9868−4286] and Verena Kantere2[1111−2222−3333−4444]

School of Electrical Engineering and Computer Science


Faculty of Enginnering, University of Ottawa
{ychen808,vkantere}@uottawa.ca

Abstract. Analyzing and forecasting time-series data is challenging,


since the latter always comes with characteristics, such as seasonality,
which may impact models’ performance. Such characteristics are fre-
quently not known before implementing models. At the same time, the
abundance of ML models makes selecting a suitable model for a specific
dataset difficult. To solve this problem, research is currently of explor-
ing the creation of automated model-selection techniques. However, the
characteristics of datasets have not been considered. Towards this goal,
this work aims to explore the appropriateness of models with respect to
the features of time-series datasets. We collect a wide range of models
and time-series datasets and choose some of them to conduct a series
of experiments to explore how different elements, such as the input size,
the number of external features, and seasonality affect the performances
of selected models. We make a thorough quantitative and qualitative
analysis of the experimental results. Based on this analysis we formulate
several outcomes that are helpful in time-series data forecasting. Fur-
ther, we design a decision tree based on these outcomes, which can be
used as a first step towards the creation of an automated model-selection
technique for time-series forecasting.

Keywords: Model Selection · Deep learning · Time-series data.

1 Introduction

Data is recorded and stored over time in a wide range of domains. These ob-
servations lead to a collection of organized statistics called time-series data, a
set of data points ordered in time [1, 2]. Time-series data analysis and forecast-
ing are significant for many applications in business and industry, such as the
stock market and exchange, weather forecasting, electricity management, fuel
consumption, etc.[3]. There are two types of time-series data: regular time series
and irregular time series. Regular time series is the most common type and refers
to observations that occur at regular intervals of time [4]. This kind of data is
recorded in intervals of various granularity, like year, month, day, hour, or even
minute. Oppositely, irregular time series refer to observations on irregular time
intervals. An example of this type is the collection of data related to a patient’s
2 Y. Chen et al.

medical tests. Such data may be obtained only if the patient heads to a clinic and
carries out a test, an event which may not happen in regular time intervals [4].
The analysis of time series has inherent complexity. First, most time series ex-
hibit seasonality or elaborate cyclical patterns. Second, time-series data is often
affected by external factors that should be considered during analysis. Moreover,
the forecasting of time-series data usually relies on previous time points, so it is
sensitive to the variation in time. For these reasons, the analysis and forecast-
ing of time-series data has become vital but challenging. Nevertheless, several
methods for time-series data analysis have been proposed, such as ARIMA [5],
Prophet [6], as well as Deep Learning (DL) models. Specifically, Facebook re-
leased the source of Prophet—a forecasting tool used in Python and R languages
in 2017[7]. It is developed to meet the demand of the business level[7]. Moreover,
Deep Learning models are also used to do time series predicting such as Long
Short-Term Memory (LSTM)[6] and Gated Recurrent Unit (GRU)[8].
In order to perform forecasting with ML models, it is necessary to imple-
ment a type of model appropriate for the characteristics of the input datasets.
However, this is a challenging task since there is a wide variety of ML models
to choose from and, furthermore, users may not know the characteristics of the
input time-series dataset. Thus, the selection of the appropriate model can be
time-consuming, or, even, inaccurate.
Proposed Solution To solve the problem of model selection for time-series
data forecasting, we explore the association of the suitability of models with
the features of input time-series datasets. Such association of the suitability of
models with time-series features can lead the design of techniques select the
appropriate model in an automated manner, given an estimation of the char-
acteristics of the time-series data. Toward this end, we have designed a series
of experiments that consider various models that are used for time-series data
forecasting and have selected the most appropriate one based on the character-
istics of the input datasets. Our exploratory experimental analysis leads to the
formation of specific outcomes that can be used as guidelines for the appropri-
ate selection of models. We showcase how the outcomes can be used towards the
creation of an automated model-selection technique by designing a decision tree
that employs them.
In the rest of this paper, Section 2 summarizes related work and Section 3
gives an overview of our methodology for the creation of our exploratory ex-
perimental study. Section 4 describes the evaluation of the ML/DL models, and
Section 5 presents details of the implementation of the models. Section 6 presents
the experimental results, and Section 7 describes the proposed outcomes and dis-
cusses their possible application, showcasing it in the design of a decision tree.
Section 8 concludes the paper and discusses the direction for future work.

2 Related Work

Model selection is a topic in ML research that has attracted a lot of interest, since
it promises to provide both convenience and efficiency in using ML models. In
Title Suppressed Due to Excessive Length 3

1994, Yumi Iwasaki and Alon Y. Levy proposed an algorithm for selecting model
fragments automatically for simulation [9]. They designed the algorithm based
on relevance reasoning, which is used to determine which phenomenon can affect
the query [9]. This algorithm helps in choosing the required model fragments for
simulation [9]. In addition, an R package glmulti which aimed at selecting gen-
eralized linear models automatically was designed and implemented by Vincent
Calcagno in 2010 [10]. In 2016, Gustavo Malkomes et al. [11] employed Bayesian
optimization for automated model selection. They constructed a novel kernel
between models to explain a given dataset which helps them achieve the goal of
finding a model for the given dataset without human assistance [11]. Moreover,
Lars Kotthoff et al. [12] released the source of Auto-WEKA which is the addi-
tion of automatic selection technology to the original platform. They used the
Bayesian optimization method too, to help users identify the best approach for
their particular datasets. In recent years, Abdelhak Bentaleb et al. [13] proposed
a kind of Automated Model for Prediction which is used for predicting network
bandwidth. In 2020, Yuanxiangyin et al. [14] presented an automated model
selection framework to find the most suitable model for time series anomaly de-
tection [14]. They achieved this by invoking a pre-trained model selector and
a parameter estimator [14]. In 2022, Chunnan Wang et al. proposed an algo-
rithm, AutoTS, which is used for designing the suitable forecasting model for
the given time-series dataset [15]. Firstly, they constructed a search space by
decomposing time-series forcasting models into 7 modules. Then they employed
a two-stage purning and a knowledge graph analysis method to prune the search
space and understand each component [15]. In 2023, a framework for automated
model selection in natural language processing was created by Shehan Saleem
and Sapna Kumarapathirage [16]. Shehan Saleem et al. conducted trials on 2
models (BOWRF and FastText) to select the best-performing models and eval-
uated the performance by F1 marco and time [16]. Chengxin Gong and Jinwen
Ma proposed a Bayesian Ying-Yang annealing learning algorithm for the param-
eter learning of the TMGPFR model with automated model selection [17]. In
the same year, Amazon Web Services released AutoGluton-TimeSeries which is
a part of AutoGluton framework [18]. It combines classic statistic models and
deep learning models with ensembling technique and help users achieve time-
series forecasting more efficient and simple.
Although several model selection techniques have been proposed, they tar-
get various areas but not specifically time-series data. Moreover, all of these
techniques do not consider the characteristics of time-series datasets, and they
only train the models without this information. In this work, we fill this gap by
conducting a series of experiments and acquiring several outcomes that can be
applied to time-series data forecasting.

3 Methodology of Experiments

In order to extract useful outcomes for the design of an automated model-


selection technique for time-series forecasting, we devised the following method-
4 Y. Chen et al.

ology: We reviewed a variety of models and time-series datasets. We collected


different types of time-series datasets and assessed their suitability. Further-
more, we chose several kinds of models and 2 datasets to conduct experiments.
We designed meticulously the exploratory experimental study and conducted
experiments. We analysed the experimental results in depth and determined
clear outcomes, which can be used to design a technique for automated model
selection.
In the following we give details about the first 3 steps our methodology. The
rest of the steps are summarized in the following sections.

– Review of models. There are various models that can be used for time-series
data forecasting. The Autoregressive and Moving Average (ARMA) model is an
important way to study time series [19]. Based on this, one of the most popu-
lar algorithms in time-series data prediction was proposed, the Autoregressive
Integrated Moving Average (ARIMA) [5]. Moreover, a time-series forecasting
tool called Prophet, which was developed to meet the demand of the business
level data was released by Facebook [7]. Prophet took not only holidays or break
intervals, but also trends, outlier detection, and missing data into considera-
tion [7]. One traditional method, Exponential Smoothing, which was created
by Brown [20] in the 1950s to develop a tracking model for fire control informa-
tion on the location of submarines, can also be used for time-series data. Besides,
Deep Learning models are available for time-series data forecasting as well. Deep
Learning (DL) is a subset of ML that focuses on building models on the neural
network architecture [2]. Some DL models, such as LSTM, GRU, and Convolu-
tional Neural Network (CNN), are used in time-series. LSTM and GRU are 2
kinds of sub-types of Recurrent Neural Network (RNN). The special character-
istics of RNN is that it can keep information from past elements of a sequence
and use it to process the next element by calculating a hidden state [2]. Different
from RNN, CNN applies convolutional operation which allows it to create a re-
duced set of features [2]. CNN is the main architecture behind some algorithms
for image classification and segmentation, and it can also be used for time-series
data. There are also some variations of these models like Bidirectional LSTM,
Bidirectional GRU, and CNN LSTM which are an upgrade of traditional ones.
– Collection and review of datasets. Beyond the models, we took into con-
sideration the possible characteristics of time-series datasets. Besides character-
istics that also occur in other types of datasets, such as linearity, time-series
have some special characteristics. A prevalent one is related to whether the
dataset is stationary or not. Some models like ARIMA can only be used for sta-
tionary datasets. Also, another important feature of time-series datasets is that
they may exhibit seasonality. We collected different kinds of datasets. After a
thorough search, we obtained 6 complete time-series datasets: AEP hourly [21],
Air Passengers [2], Steel industry data [22], Canadian climate history [23], mi-
crodata [2], and DailyDelhiClimate [24]. AEP hourly is hourly power consump-
tion data from PJM Interconnection LLC’s website, recorded from 1998 to 2001.
Air Passengers reflects the number of passengers in each month from December
1948 to December 1960. Steel industry data is the amount of energy consump-
Title Suppressed Due to Excessive Length 5

tion of a company which produces coils, steel plates, and iron plates. Related
information on energy consumption is stored on the website of the Korea Electric
Power Corporation. Canadian climate history and DailyDelhiClimate are both
climate information in Canada and Delhi. The last one, microdata, is the real
gross domestic product of the United States from 1959 to 2009 [2].
– Selection of models and datasets. The next step of our methodology was
to choose some models and datasets to conduct an experimental evaluation. We
chose several types of DL models as these can be effective for the processing of
non-stationary data. Compared to other models, DL ones have the advantage
that they tend to perform better as more data is available, so they are appro-
priate for large complex datasets [2]. We considered 3 categories of such models,
based on their architecture: normal DL networks, bidirectional networks, and hy-
brid networks. Due to their architectural differences, these categories can have
a complementary performance on various time-series data. Concerning normal
DL networks, we selected LSTM and GRU. LSTM is a powerful tool for solving
sequence data and is particularly good at tasks with long-term dependencies,
which is suitable for time-series data. GRU is a modification of LSTM but with
a simpler structure. Furthermore, we selected the corresponding bidirectional
networks of these 2 models, so that we can do a 1-1 comparison of their per-
formance. Finally, CNN-LSTM represents the hybrid neural networks category.
CNN-LSTM is one of the most popular hybrid networks since it combines the
capabilities of CNN for spatial feature extraction and LSTM for processing time-
series data. Besides models, we also selected 2 of the 6 datasets that we have
explored. Since we focus on time-series datasets, we needed the datasets to ex-
hibit the basic and most important characteristics of time-series datasets, i.e.
stationarity and seasonality. After conducting the ADF Test (Test of station-
arity) [2] and decomposition on all datasets, we chose DailyDelhaClimate and
Steel industry data. Beyond exhibiting the aforementioned characteristics, these
2 datasets are most appropriate because they involve data about climate and
energy consumption, which may be affected by external factors. Thus, they give
us the opportunity to explore if training with more features related to external
factors is positive to the performance of models.

4 Design of the Model Evaluation

In this section we describe the datasets and the metrics we have used in our
experimental evaluation, as well as the design of the experiments.

4.1 Dataset

We have conducted experiments using 2 datasets with different characteristics.


These datasets are the dataset 1: Steel industry data, and the dataset 2: Daily-
DelhiClimateTrain. Steel industry data is a dataset about the energy consump-
tion in steel industry. This dataset is sourced from the UCI Machine Learning
Repository [22]. The schema of this dataset is shown in Table 1.
6 Y. Chen et al.

Table 1. Head of Steel industry data.

date Usage L C R.P1 L C R.P2 CO2 L C P F1 L C P F2 NSM WS D o w L T


01/01/2018 00:15 3.17 2.95 0 0 73.21 100 900 Weekday Monday Load Type
01/01/2018 00:30 4 4.46 0 0 66.77 100 1800 Weekday Monday Load Type
01/01/2018 00:45 3.24 3.28 0 0 70.28 100 2700 Weekday Monday Load Type
01/01/2018 01:00 3.31 3.56 0 0 68.09 100 3600 Weekday Monday Load Type
01/01/2018 01:15 3.82 4.5 0 0 64.72 100 4500 Weekday Monday Load Type

There are 11 features in the dataset. The date is the continuous time from
01/01/2018 00:15, and it was recorded every 15 minutes. Usage kWh (Usage),
Lagging Current reactive(L C R.P1), Leading Current reactive(L C R.P2), CO2,
Lagging Current Power Factor(L C P F1), and Leading Current Power Factor
all are different elements of steel energy evaluation. NSM means ‘Number of
Seconds’ from midnight. WeekStatus and Day of Week both represent the date
status. Finally Load Type is a category of Light Load, Medium Load, Maximum
Load. Steel industry data dataset is stationary since the ADF test [2] results in
a p-value of 0.0, indicates its stationarity.
The DailyDelhiClimateTrain dataset provides users with the climate in Delhi
climate from January 1st 2013 to April 24th, 2017. This dataset is collected from
Weather Undergroud API [24]. There are 5 features in the dataset which is shown
in Table 2.

Table 2. Head of DailyDelhiClimateTrain.

date meantemp humidity wind speed meanpressure


2013-01-01 10.0 84.5 0.0 1015.6666666666700
2013-01-02 7.4 92.0 2.98 1017.8
2013-01-03 7.166666666666670 87.0 4.633333333333330 1018.6666666666700
2013-01-04 8.666666666666670 71.33333333333330 1.2333333333333300 1017.1666666666700
2013-01-05 6.0 86.83333333333330 3.70 1016.5

From Table 2, we can see that this dataset records 4 elements of climate in
Delhi: mean temperature, humidity, wind speed, and mean pressure. This is a
representative dataset that exhibits seasonality. The climate data in a city is
a little bit regular each year, and the seasonal composition in STL [2] of the
dataset also shown in Fig. 1.

4.2 Metrics
In our study we used metrics to evaluate the models in terms of 2 aspects: time
and accuracy. In this work we use 2 metrics of accuracy: the Mean Absolute Per-
centage Error (MAPE) and the Mean Squared Error (MSE). MAPE calculates
the relative error between predicted and actual values to assess the accuracy of
a model, and the expression of it is a percentage which is simple to understand
the meaning of the result. However, the calculation of MAPE may be affected
if there is a zero in the input data since the denominator should be a non-zero
value. MSE is the square of the difference which means it is always a non-negative
Title Suppressed Due to Excessive Length 7

Fig. 1. Seasonal Composition of DailyDelhiClimateTrain.

number. One limitation of it is that it will magnify errors, especially for that
large one because it is the square of the difference. In the same way, MSE may
be sensitive to the outliers. MAPE and MSE are employed in both the training
and the test phases to measure the accuracy of the results and help us under-
stand the suitability of models for the input datasets. Beyond these 2 metrics,
we also measured the epochs in each training process of models. The latter is a
convenient metric to measure the training time, and it shows how many epochs
the models trained, which is a measurement of processing time. It can help with
the discovery of the efficiency of the models.

4.3 Design of Experiments


For our evaluation we implemented 5 kinds of neural network models. These
models are used to process datasets in 8 kinds of phenomena to analyze if differ-
ent horizons and number of features will affect the performance of the models.
8 experiments with each model are shown in Table 3.

Table 3. Experiments in Each Model.

Models Experiments
1 feature + horizon (1)
1 external feature + horizon (5)
LSTM
1 external feature + date features + horizon (1)
Bi-LSTM
1 external feature + date features + horizon (5)
GRU
2 external features + horizon (1)
Bi-GRU
2 external features + horizon (5)
CNN-LSTM
2 external features + date features + horizon (1)
2 external features + date features + horizon (5)

In Table 3, the ‘horizon’ is the length of time for which forecasts are gen-
erated. We create experiments for a short and a long horizon so that we can
8 Y. Chen et al.

see how this influences the results. Another factor we consider is ‘date’ feature
which encapsulates the date and time of the data collection. The ‘date’ feature
can give us important information about time patterns. For example, the con-
sumption of electricity may present some kinds of seasonality and have higher
values in summer and lower values in spring. DateTime feature engineering is
the process of creating new features from date and time information in order to
improve predictive accuracy in ML models[25]. In this work, the ‘date’ feature
includes several subdivisions of time, like year, month, day, day of week, and day
of year. The latter can signify special days in a year, like the day of Christmas.
We have created the values for these subdivisions based on the original times-
tamps of the data. Moreover, employing ‘external features’ while training the
model may have a positive impact on the prediction. For instance, in a factory,
consumption of energy may result in a change of temperature inside, so the data
of temperature contributes to the prediction of the amount of energy. Therefore,
our experiments take also either 1 or 2 ‘external features’ into consideration.
The 8 types of experiments are implemented for each of the 5 models shown in
Table 3 in order to understand the impact of these features on the training and
the prediction. All experiments for all models are implemented for both selected
datasets.

5 Description of Model Implementation

In the following we give details of the implementation of the models, focusing


on scaling and inversing, as well as windowing.

5.1 Scaling and Inversing

Feature scaling is a method used to normalize the range of independent variables


or features of data. In data processing, it is also known as data normalization
and is generally performed during the data preprocessing step [25]. Scaling the
data can help to balance the impact of all variables on the distance calculation
and can help to improve the performance of the algorithm [26]. In some DL
models like LSTM, and GRU, scaling the dataset is a necessary step which can
help increase the accuracy and efficiency. Firstly, different features may have
different ranges of units which may result in those features with higher values
may have a greater impact on the loss function. This can be avoided by scaling
since the features will have similar ranges, and this can help speed up the model
training process. On the other hand, scaling can prevent the training process
from gradient vanishing or explosion by ensuring the input data is in a small
range. Thus, scaling the input dataset is indispensable in neural networks. In this
work, we choose the MinMaxScaler to normalize the data. The implementation of
MinMaxScaler is simple since there is a MinMaxScaler class in the preprocessing
class of the sklearn library, so it can be imported directly. What we should only
do is fit the data we want to scale on the scaler and then define the boundary of
scaling. On the other hand, one technique used in experiments with date features
Title Suppressed Due to Excessive Length 9

and 2 features is defining several scalers: one for the target feature and others
for some other features. In this way, it will be right and understandable while
doing the inversing of the scaled data. Normalization is usually performed on
the training data in order to normalize the range of independent variables or
features of data. After that, it is needed to convert the scaled data back to the
original data range for subsequent analysis and interpretation. This process is
called inversing which usually happens after predicting and before evaluation of
the model. An important point is that the scaler for scaling before should be
used while doing inversing to guarantee the predictions will be transformed to the
original range. In the first experiment, there is only one feature which is scaled, so
only one scaler can be used in inversing the predictions. However, the situation is
different in the second experiment. There are 2 kinds of features that are scaled at
the beginning: one scaler called scaler date is for date features, and another one
called scaler target is for the target values. Therefore, the scaler target should
be used in inversing the predictions instead of the scaler date. After that, it is
uncomplicated to finish the following evaluation step.

5.2 Window

Applying DL for forecasting relies on creating appropriate time windows which


allow us to format the data in the right way to be fed to neural network models.
Data windowing is a process in which users define a sequence of time-series data
and separate them into 2 parts which are inputs and labels [2]. The DL model
can fit the inputs, generate predictions, compare them to the labels, and repeat
this process until it cannot improve the accuracy of its predictions [2]. In this
work, a function of create dataset is defined to achieve this. The create dataset
function receives two parameters: dataset and look back. The dataset is the
input dataset which users want to feed to the model and do the forecasting, and
look back means a retrospective period which indicates the number of previous
points in time that are used to predict the next point. If the value of look back is
1, it represents it will use the value of (t-1) to predict the value in time t. At the
beginning of the function, it initializes 2 lists: dataX and dataY. DataX is used
to store the input features, and dataY is used to store the corresponding target
values. Then it will come to a loop that aims to traverse the whole dataset.
However, it is needed for the number of (look back) time points to complete the
prediction, so the loop will end until len(dataset) - look back. Then one line of
a = dataset[i : i + look back] is used for extracting the look back time points
starting from index i from the dataset, and these data will be processed for
the next value predicting. Then the feature data a will be added to dataX by
append. At the same time, the data at the point in time intermediately after
look back time point is target which should be added to dataY. In the end,
the 2 lists initialized at the beginning will be returned as output of this widow
function. Overall, this function creates some windows from the input dataset by
extracting sequence segments, and the output of the function can be fed to the
neural network for training.
10 Y. Chen et al.

Fig. 2. Result of DailyDelhaClimate Dataset.

6 Experimental Results

We present our experimental results for the two selected datasets, namely the
DailyDelhaClimate and the Steel Industry datasets.

6.1 Results of DailyDelhaClimate Dataset

The results of the first dataset DailyDelhaClimate is shown in Fig.2. In the figure
the results are marked with different colors, with red denoting the smallest, and
green the largest, values. In general, the more accurate model is BiGRU which
achieves the lowest error rate in experiments 1, 4, 8, and there are more low error
rates in BiGRU, indicated by red areas. In contrast, LSTM is not as precise as
other models since there are more yellow cells in the results for LSTM in the
result table. Conversely, LSTM has the fewest epochs of training. From the table,
we can see that the number of epochs is the smallest for LSTM in experiments 1,
6, and 8, and this means it is efficient, as training time is short. There are some
Title Suppressed Due to Excessive Length 11

general observations that we can make based on the results of all 8 experiments.
All highest errors occurred in experiment 3, for which the horizon value is 1 and
which uses date as an additional feature. The reason for this is that the models
are more likely to rely on the recent observations when the horizon is 1, and
date features may not provide direct information about upcoming changes but
may introduce some unwanted noise into the model, especially if these features
have no obvious correlation with the target variable. Thus, we can see that in
most of experiments where the horizon is 1, the results are better if there are
no date features. But if the horizon is 5 which means it is long-term forecasting,
the results are better with date features. This is because long-term predicting
can make use of date features and it may have a positive impact on the results.
Additionally, the results are more accurate when the horizon is 1 than when the
horizon is 5 in the most of experiments. Usually, it is easier and more accurate
to predict points that are close in the future than those that are farther away,
since near-term data are more reflective of the current patterns and trends and
the farther a data point is in the future, the less it may be affected by such
patterns and trends and the more it may be affected by other factors. The more
long-term a prediction is the higher the risk of error ,because each prediction
of the model relies on the predictions of the previous time step. Concerning the
number of features, we observe that the higher the number of features, the more
accurate a prediction is, since the model can learn the data in a more holistic
manner. At the same time, the results of experiments with date features are
better than those without these features except for experiment 3. To a certain
extent, date features help models understand and capture the seasonality and
trends of this dataset, and models can know about the data clearly by leveraging
date features.

6.2 Results of Steel Industry Dataset


We present results on the stationary dataset Steel Industry in Fig.3. The figure
demonstrates the MAPE of training set in each model by different kinds of
experiments. We observe that LSTM is the best model for this dataset in almost
all experiments except experiments 1 and 2. In other words, LSTM performs
best if there are more features used in training process. Similarly, CNN-LSTM
is also one of the models that performs well in forecasting of this dataset, since
its result is above the average level among the 5 models. On the other hand,
the performance of BiGRU is not as outstanding as it is for the first dataset,
especially in experiments 2, 4, 6, and 8. Hence, BiGRU is not suitable for this
dataset when the horizon is 5. Yet, the training of BiGRU usually takes a shorter
time. There are more red cells in BiGRU in the last column which indicates that
it needs fewer epochs to obtain the most optimal model. From the results on
the this dataset, it is clear that the performance of the models with horizon is
1 is better than is 5 since there are some red-yellow intervals in the figure. This
means that the models are better at short-term forecasting, and the horizon may
affect the precision of it. Models targeting short-term forecasts are often simpler
because they only need to capture patterns in recent data. Meanwhile, each
12 Y. Chen et al.

Fig. 3. Result of Steel Industry Dataset.

prediction of the model depends on the previous output in long-term forecasting,


so errors may propagate over time and result in a large error rate in the end.
Moreover, experiment 3, for which the horizon is 1 and training includes the
date features, still produces the worst result in this dataset. The reason for this
may be the same as before because the model is more likely to rely on recent
observations when the horizon is 1 and date features may not provide direct
information about upcoming changes. Moreover, the results of each experiment
in these models present almost the same pattern: results of experiment 7 often
achieve the lowest error rate. In experiment 7, models will train with 2 features
and date features, and this way can give them more information about the
pattern and relationship of the dataset so that the model can learn it more
comprehensively.

7 Applications of Experimental Results


In this section, the main outcomes we got from experiments are introduced
first. Then, we proposed phenomenons how these outcomes can be used in real
forecasting.
Title Suppressed Due to Excessive Length 13

7.1 Outcomes

Based on the thorough analysis of the experimental results presented in Section


6, we can extract several outcomes to summarize how to select a model according
to the characteristics of input datasets. The outcomes of the analysis of results
are listed below:
-Outcome 1: Choose GRU and BiGRU for processing small time-series datasets.
-Outcome 2: Choose LSTM for processing large time-series datasets.
-Outcome 3: Extract date features and use them for training if the dataset
comes with seasonality.
-Outcome 4: Use more time-dependent features in training if the datasets have
more features except the target feature.
-Outcome 5: Choose to perform short-term forecasting instead of long-term
forecasting for time-series data.
These five Outcomes give us directions of choosing suitable types of models for
time-series data forecasting and how to define their parameters. They considered
the features of the datasets first, which makes the results be more precise and
understandable.

7.2 Applications of Outcomes

Propose a Decision Tree to Design a Model Selection Technique The


outcomes above provide us with a new way to select a suitable model for time-
series data forecasting. The size of datasets, the number of external features
and seasonality are involved, which are crucial for the models’ performances.
Therefore, we can employ a decision tree to present these conditions and design
a Model Selection Technique based on this. The decision tree for the whole
process is shown in Fig.4.
The Decision Tree was created based on the experiments and Outcomes
above, outlined in Fig.4. We can see that the first characteristic we check is
the size of the datasets. According to outcomes 1 and 2, the size of datasets de-
termines the type of models used for forecasting later. Then, it comes to whether
the number of features is more than one and whether there is seasonality in the
datasets. If the datasets come with more than one feature, two features are
employed to train the models. Especially if it contains another time-dependent
feature, priority is given to this kind of feature since it may be an external factor
of the target feature. This technique chooses different models based on a series
of Outcomes if the input meets various conditions. For example, in the left part
of the figure where the size of the input is large, the LSTM model with several
pieces of setup (horizon =1, training with Date features and two features) is
selected if the input comes with seasonality and more than one feature, which is
the first output of the left part of the figure. Therefore, an Automated Model Se-
lection Technique can be designed in this way, which achieves selecting a suitable
model for the input automatically based on the analysis of the characteristics of
datasets and the Outcomes we got from previous experiments.
14 Y. Chen et al.

Fig. 4. Decision Tree of Model Selection.

Used as Meta-Information to Design an Automated Model Selection


Technique Several different Automated Model Selection Techniques are pro-
posed to meet the need of selecting a suitable model. Most of them are designed
based on training models on a large number of meta-datasets and get related
results. This is a useful method to acquire accurate information. However, they
do not consider the features of datasets, which have an impact on the perfor-
mances of models. Therefore, our Outcomes can be used as meta-information,
which are involved in the training process.

8 Conclusion and Future Work


Time-series data is related to various domains, and its analysis can promote
the development of our lives. However, several specific characteristics of time-
series data make it challenging to analyze. A large number of models have been
proposed to analyze and predict time-series data. Yet, it is difficult to choose a
suitable model among an abundance of proposed models. To tackle this issue,
we conducted a series of experiments to explore the relationship between the
features of datasets and the models and to thoroughly inquire if we can select
a suitable model based on the characteristics of input datasets. We collected all
pertinent models as well as multiple time-series datasets. We then chose five kinds
of models and two datasets with different characteristics to conduct experiments.
Moreover, we applied eight types of settings for each model, which aim to find
the best setup of parameters. Finally, we acquired a series of Outcomes that are
the foundation of selecting a proper model for a time-series dataset. We also
Title Suppressed Due to Excessive Length 15

proposed several phenomenons that how these Outcomes can be used correctly.
We designed a decision tree that outputs a recommendation of the most suitable
model based on the characteristics of the input dataset. It achieves selecting a
suitable model with several conditions. Moreover, we also proposed that these
Outcomes can be used as meta-information in the training process to design an
Automated Model Selection Technique.
We continue working on the Outcomes we got. We intend to employ more
datasets to acquire more accurate Outcomes. At the same time, more models
can be considered to increase the generalization of our results.

References
1. Esling, P., & Agon, C. (2012). Time-series data mining. ACM Computing Surveys
(CSUR), 45(1), 1-34.
2. Peixeiro, M. (2022). Time series forecasting in python. Simon and Schuster.
3. Mahalakshmi, G., Sridevi, S., & Rajaram, S. (2016, January). A survey on forecast-
ing of time series data. In 2016 International Conference on Computing Technologies
and Intelligent Data Engineering (ICCTIDE’16) (pp. 1-8). IEEE.
4. Joseph, M. (2022). Modern Time Series Forecasting with Python: Explore industry-
ready time series forecasting using modern machine learning and deep learning.
Packt Publishing Ltd.
5. Box, G. E. P., & Tiao, G. C. (1975). Intervention Analysis with Applications to
Economic and Environmental Problems. Journal of the American Statistical Asso-
ciation, 70(349), 70–79.
6. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE
tra nsactions on Signal Processing, 45(11), 2673-2681.
7. Jha, B. K., & Pande, S. (2021, April). Time series forecasting model for supermar-
ket sales using FB-prophet. In 2021 5th International Conference on Computing
Methodologies and Communication (ICCMC) (pp. 547-554). IEEE.
8. Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, DZmitry; Bougares, Fethi;
Schwenk, Holger; Bengio, Yoshua (2014). ”Learning Phrase Representations using
RNN Encoder-Decoder for Statistical Machine Translation”
9. Iwasaki, Y., & Levy, A. Y. (1994, August). Automated model selection for simula-
tion. In AAAI (pp. 1183-1190).
10. Calcagno, V., & de Mazancourt, C. (2010). glmulti: an R package for easy auto-
mated model selection with (generalized) linear models. Journal of statistical soft-
ware, 34, 1-29.
11. Malkomes, G., Schaff, C., & Garnett, R. (2016). Bayesian optimization for auto-
mated model selection. Advances in neural information processing systems, 29.
12. Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., & Leyton-Brown, K. (2017).
Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in
WEKA. Journal of Machine Learning Research, 18(25), 1-5.
13. Bentaleb, A., Begen, A. C., Harous, S., & Zimmermann, R. (2020). Data-driven
bandwidth prediction models and automated model selection for low latency. IEEE
Transactions on Multimedia, 23, 2588-2601.
14. Ying, Y., Duan, J., Wang, C., Wang, Y., Huang, C., & Xu, B. (2020). Automated
model selection for time-series anomaly detection. arXiv preprint arXiv:2009.04395.
15. Wang, C., Chen, X., Wu, C., Wang, H. (2022). AutoTS: Automatic time series fore-
casting model design based on two-stage pruning. arXiv preprint arXiv:2203.14169.
16 Y. Chen et al.

16. Saleem, S., & Kumarapathirage, S. (2023, June). AutoNLP: A Framework for
Automated Model Selection in Natural Language Processing. In 2023 18th Iberian
Conference on Information Systems and Technologies (CISTI) (pp. 1-4). IEEE.
17. Gong, C., & Ma, J. (2023). Automated Model Selection of the Two-Layer Mixtures
of Gaussian Process Functional Regressions for Curve Clustering and Prediction.
Mathematics, 11(12), 2592.
18. Shchur, O., Turkmen, A. C., Erickson, N., Shen, H., Shirkov, A., Hu, T., Wang,
B. (2023, December). AutoGluon–TimeSeries: AutoML for probabilistic time series
forecasting. In International Conference on Automated Machine Learning (pp. 9-1).
PMLR.
19. Mondal, P., Shit, L., & Goswami, S. (2014). Study of effectiveness of time series
modeling (ARIMA) in forecasting stock prices. International Journal of Computer
Science, Engineering and Applications, 4(2), 13.
20. Gardner Jr, E. S. (1985). Exponential smoothing: The state of the art. Journal of
forecasting, 4(1), 1-28.
21. Hour Energy Consumption, https://www.kaggle.com/datasets/robikscube/hourly-
energy-consumption, Last accessed 2023/12/20
22. Steel Industry Energy Consumption, https://www.kaggle.com/datasets/csafrit2/steel-
industry-energy-consumption, Last accessed 2023/12/21
23. Medium-Ds-Unsupervised-Anomaly-Detection-Deepant-Lstmae,
https://github.com/bmonikraj/medium-ds-unsupervised-anomaly-detection-
deepant-lstmae, Last accessed 2023/12/21
24. Daily Climate Time Series Data, https://www.kaggle.com/datasets/sumanthvrao/daily-
climate-time-series-data, Last accessed 2023/12/21
25. Practical Guide for Feature Engineering of Time Series Data,
https://dotdata.com/blog/practical-guide-for-feature-engineering-of-time-series-
data, Last accessed 2024/02/02
26. When to Perform a Feature Scaling, https://www.atoti.io/articles/when-to-
perform-a-feature-scaling, Last accessed 2024/01/04
27. Why Scaling Your Data Is Important, https://medium.com/codex/why-scaling-
your-data-is-important, Last accessed 2024/02/05

You might also like