An Automated Forecasting Framework Based On Method Recommendation For Seasonal Time Series
An Automated Forecasting Framework Based On Method Recommendation For Seasonal Time Series
48
SESSION 2: Performance Learning ICPE '20, April 20–24, 2020, Edmonton, AB, Canada
three different approaches and propose our own time series charac- a specific statistical noise distribution. It is also considered as the
teristics (see Section 3.2.1). (iv) In a broad evaluation (see Section 4), residual time series after all other components have been removed.
we analyze the different approaches, investigate the impact of the
time series generation, and compare our forecast framework with 2.3 Fourier Terms & Frequency Detection
state-of-the-art forecasting methods.
In many fields, especially for forecasting, it is helpful to know the
Without our framework, a simple and straight-forward approach
frequencies, i.e., the lengths of the seasonal patterns. For instance,
for choosing the best-suited method for a given time series would be
if the most dominant frequency is unknown for a given time series,
based on trial and error, or the consultation of an expert. However,
the time series cannot be decomposed by the method explained
both possibilities are expensive, time-consuming, or error-prone.
above. By dominant, we mean the most common period, i.e., the
That is, through the automation of choosing the best method in
seasonal pattern such as days in a year. An established approach
conjunction with the hybrid approaches leads to good forecasting
for frequency analysis is the Fourier transform, which allows to
results and helps saving time and costs.
determine the distribution of frequencies or the spectral density of
the time series. As a time series can be represented as a weighted
2 BACKGROUND sum of sinusoidal components, the found frequencies can be used
Before explaining our approach in detail, we outline some back- to retrieves these components, also referred to as Fourier terms.
ground concepts. Thus, Section 2.1 gives a short introduction to
time series. Afterward, the time series decomposition is explained. 3 APPROACH
Finally, the frequency detection and Fourier terms are outlined. As our approach is two-fold, we first introduce the automatic de-
composition, feature extraction, and forecasting of a time series. In
2.1 Time Series Section 3.2, we explain the recommendation system for selecting
the most suitable machine learning approach. Afterward, the con-
A univariate time series is an ordered collection of values of a quan-
sidered time series characteristics are presented. Finally, the used
tity obtained over a specific period or from a certain point in time.
machine learning methods are highlighted.
In general, observations are recorded in successive and equidistant
time steps (e.g., hours). Typically, internal patterns exist, such as
autocorrelation, trend, or seasonal variation. 3.1 Automatic Time Series Forecasting
One of the essential characteristics of a time series is the station- The assumption of data stationarity is an inherent limitation for
arity. Hence, most statistical forecasting methods have the assump- time series forecasting. Any time series property that eludes station-
tion that the time series is either stationary or can be “stationar- arity, such as non-constant mean (trend), seasonality, non-constant
ized” through a transformation. The statistical properties (such as variance, or multiplicative effect, poses a challenge for the proper
mean, variance, auto-correlation) of a stationary time series do not model building. Consequently, we design an automated time series
change over time. Therefore, a stationary time series is easier to forecasting method that addresses these issues. Figure 1 shows
model and forecast. In practice, however, time series are usually the work-flow of the automatic time series forecasting part. The
showing a mix of trend or/and seasonal patterns and are thus non- blue rectangle boxes reflect actions, the green trapezoids machine
stationary [1]. To this end, time series are transformed, seasonally learning features, the grey rounded boxes the target for the ma-
adjusted, made trend-stationary by removing the trend, or made chine learning, and the rounded white boxes everything else. The
difference-stationary by possibly repeated differencing. functioning can be grouped into four steps (dashed red boxes):
(i) preprocessing, (ii) recommendation, (iii) forecasting, and (iv)
postprocessing. Each part is described in the following.
2.2 Time Series Decomposition
As a time series consists of different components, a common ap- 3.1.1 Preprocessing. This step is responsible for preparing the time
proach is to break down the time series into its components. These series and extracting the intrinsic features for the machine learning
parts can either be used for modifying the data (e.g., removing the algorithm. The first step consists of the frequency estimation. If
trend or the seasonality), or they can be used as intrinsic features the time series has a certain frequency, this frequency is chosen.
(e.g., modeling different recurring patterns). Otherwise, the most dominant frequency is estimated. Next, if
A common method for decomposing a time series is STL (Sea- the time series has multiplicative effects, the logarithm is used to
sonal and Trend decomposition using Loess) [5]. STL can handle transform the time series. The Fourier terms (the sine and cosine
any type of seasonality, allows the seasonal pattern to change over pair) for the most dominant frequency are determined and used
time, and disassembles the given time series into the components as intrinsic features later on. Although most forecasting methods
trend T , season S, and irregular I (also called remainder). The long- assume stationary time series, many time series exhibit trend or/and
term development in a time series (i.e., upwards, downwards, or seasonal patterns. To tackle the non-stationarity, our approach
stagnate) is called trend. Usually, the trend is a monotone function decomposes the time series and then handles each part separately.
unless external events trigger a break and cause a change in the di- To this end, the time series is decomposed by STL (see Section 2.2)
rection. The presence of recurring patterns within a regular period into season, trend, and remainder. The seasonal component is used
in the time series is called seasonality. These patterns are caused as an intrinsic feature later on. The remainder is ignored since it is
by climate, customs, or traditional habits. The unpredictable part irregular and hard to predict and therefore correlated with a high
of a time series is called irregular component, possibly following error rate. Finally, the trend is removed from the time series to
49
SESSION 2: Performance Learning ICPE '20, April 20–24, 2020, Edmonton, AB, Canada
Forecasting
Legend Action Feature Target Artifacts Forecast Fourier Postprocessing
Terms
Machine Learning: Forecast
Repeat Pattern
Predict Detrended TS
Time Series Forecast Season
Preprocessing Recommendation
make the time series trend-stationary. The detrended time series is 3.1.4 Postprocessing. In this last step, the forecast trend is ap-
the target value for model building. pended to the forecast detrended time series to assemble the fore-
cast time series. Moreover, if the time series was multiplicative, the
3.1.2 Recommendation. The detrended time series is passed from forecast time series is re-transformed with the exponential function.
the preprocessing step and is the basis for the recommendation. Finally, the forecast time series is returned.
The recommendation selects which machine learning algorithm is
best suited to model the detrended time series. Thus, time series 3.2 Machine Learning Recommendation
characteristics are extracted from the detrended time series. Based
To tackle the problem that arises with the "No-Free-Lunch Theo-
on these characteristics, a suitable machine learning method is
rem", we employ a recommendation system for machine learning
selected. The detailed recommendation is explained in Section 3.2.
approaches. The idea is to choose the best suitable method based on
3.1.3 Forecasting. To build a suitable forecast model that takes the the time series characteristics. Figure 2 shows the recommendation
features derived in the previous step into account, we use the ma- work-flow. The blue rectangle boxes reflect actions, the green trape-
chine learning algorithm recommended by the last step. To reduce zoids reflect machine learning features, the grey rounded boxes the
the model error and later the forecast error, we exclude the trend machine learning target, and the rounded white boxes everything
and the remainder as features. The trend was removed during the else. The functioning can be grouped into two phases (dashed red
first step to make the time series trend-stationary. The remainder boxes): (i) an offline phase and (ii) an online phase. Both phases are
of the time series is not explicitly considered a feature. That is, the described in the following.
machine learning method notices a difference that is missing to 3.2.1 Offline Phase. The offline phase learns the rules for recom-
fully recreate the target value. In other words, this difference is mendation a specific method based on time series characteristics,
the remainder and is learned implicitly as the machine learning during the start or if no forecast is currently conducted. To this end,
method tries to explain this difference. Consequently, the consid- our approach requires an initial set of time series that are stored in
ered features include the season and the Fourier terms, and the the associated storage. To have a broad training set independent
target value corresponds to the detrended time series. Although of the amount of original time series, the first step in this phase is
seasonality can also violate stationarity, time series models usually to create new time series based on the original time series in the
explicitly take seasonality into account. Also, machine learning storage. For this purpose, three different methods are used:
methods are suitable for pattern recognition. To this end, we keep (i) The first method splits time series into smaller parts to have
the seasonality as a feature. a more diverse set of time series with different lengths. The length
To forecast the time series, each feature and the trend has to be of a split is the maximum between a freely configurable length and
forecast separately. As the season and the Fourier terms are recur- 10% of the original length. (ii) The core idea of the second method is
ring patterns per definition, these features can merely be continued. to decompose the time series, modify one component, and assemble
Based on the trend component, an ARIMA1 model [11] without the modified component and the two remaining parts to a new time
seasonality is determined that forecasts the future trend of the series. More precisely, this method modifies each component one
time series. Simultaneously, the forecast patterns of the season and after the other and creates, therefore, three new time series. For the
Fourier terms, in combination with the model, are used to predict modification, the divisors of the frequency of the time series are
the detrended time series. determined. For each divisor, the components are modified with the
proportion of the frequency and the divisor differently: The trend
1 We select ARIMA as it is able to estimate the trend even from a few points, and we is getting steeper; the season is compressed, i.e., the period length
use an automatic version that selects the most suited model [10]. becomes shorter; the remainder is stretched. (iii) The third method
50
SESSION 2: Performance Learning ICPE '20, April 20–24, 2020, Edmonton, AB, Canada
Online Phase
Legend Action Feature Target Artifacts
Characteristics Characterics
Extraction
Offline Phase
also decomposes the time series. More precisely, it combines each error. (ii) The core idea of the second approach AR is to learn how
component of each time series with each component of the other much each method is worse than the best method. In more detail,
time series. The length of the resulting time series is equal to the the approach calculates for each method how much worse this
shortest component that was used. method is compared to the method with the lowest forecast error
Due to the limitations of STL, which requires at least two full for given time series characteristics. Then, a random forest is used
periods, only new time series with a length greater than two times as a regressor for each machine learning method in question for the
the period plus one are considered valid. Created time series that do selection. In other words, the random forest tries to find a function
not fulfill this requirement are considered invalid and are discarded. that learns how much worse the method is in comparison to the
This method is able to create a huge training set (including the orig- best method based on the time series characteristics. After each
inal time series) with a high diversity of time series characteristics. method has estimated how worse the forecast will be for a new
The rough number of the training set is the number of original time time series, the method with the lowest value is chosen. (iii) The
series to the power of three. third method is a hybrid approach AH that combines the first two
After the training set is generated, the time series characteristics approaches. More specifically, a random forest regressor is used
(see Section 3.3) of each time series are extracted. As the machine for each machine learning method available to estimate how much
learning methods have to handle the detrended time series, the char- worse the method is in comparison to the method with the lowest
acteristics are also calculated on the detrended time series. At the error. Then, another random forest is used as a classifier to map the
same time, the machine learning method evaluation is conducted. estimation of how worse the forecast will be to the best method. The
During the evaluation, each method (see Section 3.4) performs a idea is to minimize the regression error of each method. For example,
forecast for each time series. To this end, the time series is split into if one method always claims to have the lowest degradation, but it
history (the first 80% of the time series) and in future (the remaining does not perform as well, the classification shall learn this behavior.
20%). For the forecasting, each method gets, as explained in Section
3.1, the Fourier terms, and the season as input while the detrended
time series is the target. Then, for each time series and each method, 3.2.2 Online Phase. This phase takes place when a forecast for a
the forecast error, in this case, the mean absolute error (MAPE), is given time series is conducted. First, the characteristics of the time
calculated: series are extracted. Then, the recommendation rules are applied
n
100% Õ yt − ft to the characteristics, and a machine learning method is selected.
MAPE := | |. (1) Afterward, the forecasting approach (see Section 3.1) performs the
n t =1 yt
forecast. Finally, the time series is saved within the time series
In this equation, n is the forecast horizon, yt the actual value, and ft storage, and new time series can be generated, as explained in
the forecast value. To have a comparable forecast measure among Section 3.2.1.
all time series, we normalize for each time series the forecast error
with the lowest error. This normalization results in values ≥ 1 for
each time series. Further, the best method has a value of 1. We
define these values as forecast accuracy degradation ϑ showing how 3.3 Time Series Characteristics
much worse the forecast accuracy is compared to the best method. To train a machine learning method for choosing the best method,
For instance, a forecast accuracy degradation of 1.05 means that the suitable features are required. Thus, we calculate for each time
method is 5% worse. Based on the forecast accuracy degradation, series a set of characteristics. These characteristics contain infor-
the best method for each time series is determined. mation about the time series, statistical measures, characteristics
Based on the time series characteristics and the best method for proposed by Wang et al. [18], characteristics proposed by Lemke
each time series, the recommendation rules can be learned. For this and Gabrys [13], and characteristics we propose in this work. The
purpose, we envision three different approaches: used time series characteristics and the associated calculation in-
(i) The first approach AC is a classification task. That is, a random structions are listed in Table 1. In contrast to the work of Wang et
forest is used to map the time series characteristics for the given al., we use the raw values of the characteristics to avoid arbitrary
time series to the machine learning method with the lowest forecast normalization factors.
51
SESSION 2: Performance Learning ICPE '20, April 20–24, 2020, Edmonton, AB, Canada
3.4 Machine Learning Methods additional corrections as described by Quinlan [16]. (iii) Evtree im-
For the forecasting task, we only consider machine learning meth- plements an evolutionary algorithm for learning globally optimal
ods in this paper as statistical methods such as ARIMA can typically classification and regression trees [9]. (iv) NNetar is a feed-forward
only process the time series without additional information. This neural network is trained with lagged values of the time series [10].
means that the extracted features (see Section 3.1) cannot be used (v) Random Forest (RF) uses bagging for generating samples from
by such methods. In addition, ML methods can handle any num- the data set used for learning [2]. (vi) Rpart trains a regression
ber of features. That is, for a possible extension of our approach tree using recursive partitioning, based on the CART algorithm
with external information, these features can be added. The used by Breiman et al. [3]. (vii) Support Vector Regression (SVR) uses the
machine learning methods (see Section 3.1) are listed in the follow- same principles as SVM for classification [8]. (viii) XGBoost uses
ing: (i) Catboost applies gradient boosting of decision trees [15]. (ii) gradient tree boosting where trees are generated sequentially. That
Cubist is a regression model that combines the ideas of M5 with is, each tree is grown with knowledge from the last trained tree [4].
52
SESSION 2: Performance Learning ICPE '20, April 20–24, 2020, Edmonton, AB, Canada
4 EVALUATION method in split), (ii) has on average the lowest forecast accuracy
Before discussing the evaluation, we introduce the used data set degradation in each split (on avg. lowest error in split), and (iii)
in Section 4.1. Then, we explain the methodology and the evalua- is over all time series the best method (total best method). We
tion metrics in Section 4.2. Afterwards, we analyze how well the report the respective percentages in Table 2 showing these three
different machine learning methods perform on the data set. Based observations for the training data and test data for each method.
on this information, we evaluate our recommendation approaches While the distribution of percentages of which method is the best
in Section 4.4. In Section 4.5, we investigate how the diversity of over all time series is almost similar for the training and test data,
the data set is increased by the time series generation. Finally, we the distributions per split differ considerably. While Nnetar was in
compare our forecasting framework with state-of-the-art methods. every split the method achieving the best training forecast accuracy
the most often, it reaches only in 73% of the test data splits the same
4.1 Data Set performance. Cubist had in 55% of the training splits on average the
lowest forecast accuracy degradation. In the test data, Cubist has
To have a sound and broad evaluation of our approach, a highly
in only 17% of the splits on average the lowest forecast accuracy
heterogeneous data set that covers different domains and charac-
degradation.
teristics is required. Indeed, there are numerous data sets available
In a nutshell, we see from these results that the dynamic choice
online: competitions (e.g., NN32 , M33 , and M4), kaggle, R packages,
of the best performing method is a crucial task with significant po-
and many more. Although, for instance, the M4 competition set
tential. Even choosing a method based on straight-forward metrics
contains 100,000 time series, these time series have low frequencies
(for instance, choosing the method which was on average the best
(1, 4, 12, and 24) and short forecasting horizons (6 to 48 data points).
method in the training data) based on the training data may lead to
Further, the median length of a time series is 106. That is, we assume
a bad performance.
that if the data set is used alone, it is not suitable for benchmarking
forecasting methods for all kinds of domains.
To this end, our data set4 consists of 150 real-world and pub- 4.4 Evaluation of the Recommendation
licly available time series. The time series are collected from vari-
As the recommendation of the best suitable method is an essential
ous sources including Wikipedia Project-Counts, Internet Traffic
pillar of our forecasting framework, we examine the recommenda-
Archive, R packages, Kaggle, Datamarket, and many more. Further,
tion performance of our envisioned approaches (see Section 3.2.1).
the data set reflects different use cases, e.g., Internet accesses, sales
To have a ground truth for the competition, we define the following
volume, etc. Moreover, our data set covers the same frequencies
three method selecting strategies: (i) Selecting the best method for
as the M4 competition and additional frequencies (7, 48, 52, 60, 96,
each time series a-posteriori S ∗ . (ii) Selecting the method which had
144, 168, 365, 2160, and 6480). Further, our forecast horizons range
the lowest average forecast accuracy degradation in each training
from 8 to 7,304 data points, and the median length is 595.
split S L . (iii) Selecting the method, which was most often the best
method in each training split S B . Note that, based on our analysis
4.2 Evaluation Methodology
in Section 4.3, the method Nnetar will be chosen.
To evaluate our approach, we divide the original data set into 100 The results of the comparison between these six methods are
training time series and 50 validation time series. To avoid an ar- presented in Table 3. For each approach/strategy, this table lists the
bitrary split, we divide the data set in 100 unique splits. In other median, average and standard deviation of the accuracy degradation
words, we train and evaluate our approach on 100 different time ϑ over all 100 splits.
series train and test sets. We also made sure that all time series are The best values are shown by S ∗ . Indeed, this result is not surpris-
spread across all splits. ing as this strategy has a-posteriori knowledge. Thus, this method
As described in Section 3.2.1, our approach expands for each has the role of showing the theoretically best possible values. In
split the size of the training set to have a sound training set for the other words, S ∗ is the base-line for the recommendation. Conse-
recommendation. That is, our approach uses in each division the quently, only five methods remain for a fair competition. In terms
100 time series for the generation of new time series. In contrast to of the average forecast accuracy degradation, the regression-based
the description of the approach, we restrict the approach to use only approaches (AH being on average 15.9% worse than always choos-
10,000 instead of the roughly 1,000,000 time series. More precisely, ing the best method and AR with a value of 1.172) outperform the
the training data in each split contains the original 100 time series remaining approaches/strategies. While taking also the median and
and 9,900 new time series. the standard deviation of the forecast accuracy degradation into
account, it can be seen that the meta-learning layer of AH is able
4.3 Machine Learning Method Analysis to improve the performance of AR in all measures of the forecast
For reference, we investigate how each of the chosen machine accuracy degradation. The worst forecast accuracy degradation
learning methods performs in the forecasting process on the data is shown S L followed by AC . In contrast, AC exhibits the lowest
set, without recommendation, i.e., changing the method depending median followed by the regression-based approaches. The worst
on the input time series. To this end, we observe for each method median is shown by S B . While observing the standard deviation of
how often the method (i) is the best method in each split (best the forecast accuracy degradation, S L , AC , AR exhibit high values.
2 NN3 The lowest value is shown by AH .
competition: http://www.neural-forecasting-competition.com/NN3/
3 M3 competition: https://forecasters.org/resources/time-series-data/m3-competition/ The median and mean values can be better understood if the
4 Time series data set available at https://zenodo.org/record/3508552 distribution of the ranking of the recommended methods is taken
53
SESSION 2: Performance Learning ICPE '20, April 20–24, 2020, Edmonton, AB, Canada
Table 2: Investigation of the forecast performance of the different machine learning methods.
Table 3: Comparison of the recommendation methods. time series. Then, we normalize with a min-max-scaling for each
time series characteristic the data between 0 and 1 for a comparable
S∗ SL SB AC AR AH analysis. On top of this, we depict each characteristic in a spider
chart (see Figure 4). In this diagram, the maximal value of new data
Avg. ϑ 1.000 1.409 1.235 1.249 1.172 1.159
(grey) and original data (purple), and the minimum values of the
Median ϑ 1.000 1.045 1.076 1.016 1.035 1.032
new data (green) and original data (blue) are shown. Each edge of
SD ϑ 0.000 3.674 0.427 2.458 1.382 0.382
this chart represents a time series characteristics. For almost all
characteristics, the new generated time series expand the spectrum
into account. Figure 3 shows the distribution of the rankings. The of the data both in terms of the maximum value and minimum
ranks of S L are almost equally distributed. S B selects almost either value.
the best or the worst method. More precisely, it recommends in
51.4% of the time series the worst method. For all recommenda-
Max new Max orginal Min orginal Min new
tion approaches, the distribution of ranks two to five drops. The num peaks Length Frequency
1
regression-based approaches select in more than 30% the best or max spec Standard deviation
second-best method. However, choosing the worst method is al- 3rd freq 0.8 Remainder SD
most as likely as choosing the best method. In contrast to all other 2nd freq 0.6 Proportion remainder
methods, AC chooses with more than 50% the best, second, or third
0.4
best method, but also has almost 25% of choosing the worst method. Durbinwatson Proportion season
In fact, none of the methods show a proper distribution, which 0.2
Remainder kurtosis Mean period entropy
decreases with increasing rank. 0
0.1
0.5 Figure 4: Time series generation result.
0.3
AR
0.1
0.5
0.3
AH
0.1
4.6 Evaluation of Forecast Accuracy
2 4 6 8 To investigate how well our forecasting framework performs, we
Ranks compare the forecasting error (i.e., MAPE) of our approach with
three state-of-the-art approaches that are briefly described in the
Figure 3: Distribution of the rankings. following: ETS [12] is a statistical method and builds an exponential
smoothing state space model consisting of trend, season, and error.
Each component can be combined in an additive or multiplicative
4.5 Evaluating the Time Series Generation manner, or it may be skipped. tBATs [7] extends ETS using a trigono-
One central problem of machine learning is the inherent limitation metric representation based on Fourier series for the season and an
to predict only what has been learned during the training phase. ARMA model for the error. Further, the data is transformed with a
In other words, machine learning methods have a limited ability Box-Cox transformation. sARIMA [11] determines the orders of the
for extrapolation. This also holds true for our recommendation. autoregressive model, the moving average model, and the differenti-
Consequently, we try to consider as many time series with differ- ation. sARIMA models one seasonal pattern, and each non-seasonal
ent characteristics as possible to improve the recommendation for component of the ARIMA model is extended with its seasonal coun-
unknown time series. Thus, we analyze in this section how the terpart. Table 4 lists the average, median, standard deviation of the
new time series generation affects the diversity of the time series forecast error for all 100 splits. Each of our approaches exhibits a
characteristics. To this end, we collect for each time series charac- lower average MAPE and standard deviation than the state-of-the-
teristic the values from the original data and the new generated art methods. The worst average MAPE (56.96%) is achieved by ETS.
54
SESSION 2: Performance Learning ICPE '20, April 20–24, 2020, Edmonton, AB, Canada
In contrast, tBATS has the lowest median MAPE (10.83%) followed forecasts. For the recommendation of the best-suited method, we
by AC (12.31%) while ETS again shows the highest median error. To introduce three different approaches, and in addition to time series
sum up, our approaches are equally accurate in terms of the median characteristics from the literature, we propose our own character-
forecast error, but having a lower average and standard deviation istics. In an extensive evaluation, we compare the three proposed
forecast error than the state-of-the-art methods. recommendation approaches, the impact of time series generation,
and compare the forecasting framework with state-of-the-art meth-
Table 4: Comparison of the forecast error.
ods. Although the proposed recommendation approaches perform
equally good, our approach achieves the best forecasting accuracy
MAPE AC AR AH ETS tBATS sARIMA in comparison with the state-of-the-art techniques.
Avg. 24.40 23.26 23.68 56.96 36.28 28.12
Median 12.31 13.07 13.18 14.47 10.83 13.00 ACKNOWLEDGEMENTS
SD 50.31 40.41 38.52 136.22 98.68 64.72 This work was co-funded by the German Research Foundation
(DFG) under grant No. (KO 3445/11-1) and the IHK (Industrie- und
Handelskammer) Würzburg-Schweinfurt.
5 RELATED WORK
REFERENCES
To face the "No-Free-Lunch Theorem", i.e., minimizing the variance [1] Ratnadip Adhikari and R. K. Agrawal. 2013. An Introductory Study on Time
of monolithic forecasting methods, many hybrid mechanisms and Series Modeling and Forecasting. CoRR abs/1302.6613 (2013).
forecast recommendation systems have been developed. The first [2] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
[3] Leo Breiman, Joseph H Friedman, R. A. Olshen, and C. J. Stone. 1983. Classification
idea of selecting a forecasting method based on rules was intro- and Regression Trees.
duced by Collopy and Armstrong in 1992 [6]. In their work, they [4] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system.
In ACM SIGKDD 2016. ACM, 785–794.
manually created an expert system. The rules based on 18 time [5] Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning.
series characteristics and include four methods. However, this rule 1990. STL: A seasonal-trend decomposition procedure based on loess. Journal of
set was created by human experts, and each modification requires Official Statistics 6, 1 (1990), 3–73.
[6] Fred Collopy and J Scott Armstrong. 1992. Rule-based forecasting: Develop-
human interaction. In 2009, Wang et al. introduced two approaches ment and validation of an expert systems approach to combining time series
for forecasting method recommendation [18]. Firstly, they pro- extrapolations. Management Science 38, 10 (1992), 1394–1414.
pose hierarchical clustering and self-organizing maps; secondly, [7] Alysha M De Livera, Rob J Hyndman, and Ralph D Snyder. 2011. Forecasting time
series with complex seasonal patterns using exponential smoothing. J. Amer.
a decision tree technique is applied. The generate rules based on Statist. Assoc. 106, 496 (2011), 1513–1527.
13 time series characteristics and covers four methods. Unfortu- [8] Harris Drucker, Christopher JC Burges, Linda Kaufman, Alex J Smola, and
Vladimir Vapnik. 1997. Support vector regression machines. In Advances in
nately, the proposed rules were not evaluated. In 2010, Lemke and neural information processing systems. 155–161.
Gabrys investigated the applicability of different meta-learning [9] Thomas Grubinger, Achim Zeileis, and Karl-Peter Pfeiffer. 2014. evtree: Evolu-
approaches [13]. In their work, they use 17 time series and six error tionary Learning of Globally Optimal Classification and Regression Trees in R.
Journal of Statistical Software, Articles 61, 1 (2014), 1–29.
characteristics while using eighth methods and seven combina- [10] Rob Hyndman, George Athanasopoulos, Christoph Bergmeir, Gabriel Caceres,
tion approaches. In 2018, Talagala et al. propose in a techpaper a Leanne Chhay, Mitchell O’Hara-Wild, Fotios Petropoulos, Slava Razbash, Earo
feature-based forecast-model selection [17]. To this end, they simu- Wang, and Farah Yasmeen. 2018. forecast: Forecasting functions for time series and
linear models. http://pkg.robjhyndman.com/forecast R package version 8.4.
late time series that are generated by fitting exponential smoothing [11] Rob J Hyndman and George Athanasopoulos. 2014. Forecasting: principles and
and ARIMA models to the original data. A random forest classifier practice. OTexts, Melbourne, Australia.
[12] Rob J Hyndman, Anne B Koehler, Ralph D Snyder, and Simone Grose. 2002. A
is then used to map 25 to 30 time series characteristics (depending state space framework for automatic forecasting using exponential smoothing
on the time series) to the best forecast method. In their work, they methods. International Journal of forecasting 18, 3 (2002), 439–454.
consider seven methods. As the work of Wang et al. [18] were not [13] Christiane Lemke and Bogdan Gabrys. 2010. Meta-learning for time series
forecasting and forecast combination. Neurocomputing 73, 10-12 (2010), 2006–
evaluated, Züfle et al. investigate and compare these rules to two 2016.
proposed dynamic recommendation algorithms [20]. [14] Steven M Pincus, Igor M Gladstone, and Richard A Ehrenkranz. 1991. A regularity
In contrast to the related work that only introduce the selection of statistic for medical data analysis. Journal of clinical monitoring 7, 4 (1991), 335–
345.
the best forecasting method, we propose an overarching framework [15] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro-
that combines the selection of the best method and the forecast itself. gush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical
features. In Advances in Neural Information Processing Systems. 6638–6648.
While the aforementioned works use solely statistical methods, the [16] J Ross Quinlan. 1993. Combining instance-based and model-based learning. In
focus in this work lies in machine learning-based regressor methods. Proceedings of the tenth international conference on machine learning. 236–243.
Further, for the evaluation, we use a highly diverse data set. Further, [17] Priyanga Talagala, Rob Hyndman, George Athanasopoulos, et al. 2018. Meta-
learning how to forecast time series. Technical Report. Monash University, De-
our selection mechanism creates also new time series by combining partment of Econometrics and Business Statistics.
actual time series to increase the diversity of the data set. [18] Xiaozhe Wang, Kate Smith-Miles, and Rob Hyndman. 2009. Rule induction for
forecasting method selection: Meta-learning the characteristics of univariate
time series. Neurocomputing 72, 10âĂŞ12 (2009), 2581 – 2594.
6 CONCLUSION [19] D. H. Wolpert and W. G. Macready. 1997. No free lunch theorems for optimization.
In this work, we propose an automated forecasting framework that IEEE Transactions on Evolutionary Computation 1, 1 (Apr 1997), 67–82.
[20] Marwin Züfle, André Bauer, Veronika Lesch, Christian Krupitzer, Nikolas Herbst,
(i) extracts characteristics from a given time series, (ii) selects the Samuel Kounev, and Valentin Curtef. 2019. Autonomic Forecasting Method Selec-
best-suited machine learning method based on recommendation, tion: Examination and Ways Ahead. In Proceedings of the 16th IEEE International
Conference on Autonomic Computing (ICAC). IEEE.
and finally, (iii) performs the forecast. Our approach offers the ben-
efit of not relying on a single method with its possibly inaccurate
55