A Novel Active Multi-Source Transfer Learning Algorithm For Time
A Novel Active Multi-Source Transfer Learning Algorithm For Time
https://doi.org/10.1007/s10489-020-01871-5
Abstract
In Time Series Forecasting (TSF), researchers usually assume that there is enough training data can be obtained, with the old a`nd
new data satisfying the same distribution. However, time series data always produces some time-varying characteristics over
time, which will lead to relatively large differences between old and new data. As we all know, single-source TSF Transfer
Learning (TL) faces the problem of negative transfer. Addressing this issue, this paper proposes a new Multi-Source TL
algorithm, abbreviated as the MultiSrcTL algorithm, and a novel Active Multi-Source Transfer Learning, abbreviated as the
AcMultiSrcTL algorithm, with the latter one integrating Multi-Source TL with Active Learning (AL), and taking the former one
as its sub-algorithm. We introduce domain adaptation theory into this work, and analyze the expected target risk of TSF under the
multi-source setting, accordingly. For the development of MultiSrcTL, we make full use of source similarity and domain
dependability, using the Maximum Mean Discrepancy statistical indicator to measure the similarity between domains, so as to
promote better transfer. A domain relation matrix is constructed to describe the relationship between source domains, so that the
source-source and source-target relations are adequately considered. In the design of AcMultiSrcTL, Kullback-Leibler diver-
gence is used to measure the similarity of related indicators to select the appropriate source domain. The uncertainty sampling
method and the distribution match weighting technique are integrated, obtaining a new sample selection scheme. The empirical
results on six benchmark datasets demonstrate the applicability and effectiveness of the two proposed algorithms for multi-source
TSF TL.
Keywords Time series forecasting (TSF) . Transfer learning (TL) . Multi-source transfer learning (MSTL) . Active multi-source
transfer learning (AMSTL)
current task, and successfully transfer it to the current predic- For designing a desirable multi-source TSF TL algorithm, a
tion task is particularly important. multi-source TL method proposed by the authors of [20]
We have investigated Transfer Learning (TL) algorithms caught our attention, which transfers knowledge learned from
for TSF in [8], which mainly focuses on solving the severe multiple source domains to the target domain. Their algorithm
challenges caused by the transfer of knowledge over a long is similar to the multi-task learning paradigm, which trains a
period of time. Different from traditional supervised learning source relationship matrix of inter-source relationship for the
algorithms, TL paradigm does not strictly require the training multi-source TL problem to formulate all source domains
data and testing data to follow the same distribution [9–11]. It jointly in a framework. Inspired by both their research work
refers to the transfer of knowledge gained from the source in [20] and the TL framework proposed by us in [8], we
domain to another different but related target domain to realize propose a novel Multi-Source TL (MultiSrcTL) algorithm
the reuse of knowledge. for single-step-ahead TSF in this work.
Compared with traditional TSF methods, the investigation To further improve forecasting performance, a novel
conducted in [8] considers the old data from a long time ago Active Multi-Source Transfer Learning (AcMultiSrcTL) algo-
fully and establishes a TSF transfer learning framework inno- rithm combining multi-source TL and AL is proposed, which
vatively. Based upon this framework, a novel algorithm fusing takes the MultiSrcTL algorithm as its sub-algorithm, specifi-
transfer learning scheme, Online Sequential Extreme cally for single-step-ahead TSF. The problem setting of multi-
Learning Machine with Kernels (OS-ELMK) [12, 13], and source domains brings certain challenges to the selection of
Ensemble Learning (EL) paradigm, termed as TrEnOS- samples without true values. Therefore, this work designs an
ELMK [8] for short, is constructed. With TrEnOS-ELMK, active learning sample selection scheme applicable to multi-
the knowledge learned from the old data can be effectively source domains to select the most representative samples for
used to solve the current predictive task. fine-tuning the model. Since not all samples in each source
For the traditional TSF problems, we all assume that the domain are used for training, this method saves the time of
data arrived in the same batch, and the training data and testing model training, to a certain extent.
data follow the same distribution. However, in the problem As far as we know, there are few researches on multi-
setting considered in this paper, the training data come from source TSF TL algorithm, and even fewer researches on con-
the old data, which are far away from the current time. The structing active multi-source TL algorithm and applying it to
distributions between the old data and the current data are TSF. The existing investigations on TL, AL, and the fusion of
different. For solving this problem, inspired by [8], the re- TL and AL mainly focus on the study of pattern classification
search work carried out in this paper also applies the TL problems. Consequently, the research work in this paper pos-
scheme to TSF tasks, which attempts at making full use of sesses relatively strong originality.
the knowledge learnt from the old data and transferring it to The key technologies and novelties of the proposed
new prediction tasks. However, the key difference between MultiSrcTL sub-algorithm are summarized as below:
the research work of this paper and that of [8] lies in that, with Firstly, source similarity and domain dependability are
TrEnOS-ELMK, single-source TSF TL is implemented, while made the best of. The Maximum Mean Discrepancy (MMD)
this work realizes multi-source TSF TL. statistical indicator is used to measure the similarity between
The motivation of applying multi-source TL to the study of the source domain and the target domain, so as to achieve
TSF is elaborated as below. The purpose of TL is to mining promising transfer learning effect.
the potential knowledge within the source domain, so as to Secondly, incorporating the idea of multi-task learning, a
provide help for the formation of the target domain predictor relation matrix is formed to describe the relation between
[9]. However, the transferability between the source and target source domains, constructing all source domains jointly in a
domains determines whether transfer learning can be success- framework, so that the source-source and source-target rela-
fully realized [9, 14, 15]. Negative transfer often occurs if the tions are taken full advantage of.
correlation between the source and target domains is relatively The critical technologies and originalities of the proposed
low. For addressing this concern, researchers turn to investi- entire AcMultiSrcTL algorithm are summarized as follows:
gate a more realistic transfer learning paradigm, i.e., Multi-
Source Transfer Learning, which has demonstrated its effec- Firstly, Kullback-Leibler (KL) divergence is employed to
tiveness for transfer learning tasks [9, 14–20]. TSF transfer measure the similarity between the current ratio of the
learning is also often faced with the above issue. The distri- data with true values within each source and the uniform
bution of time series data usually changes gradually and sig- distribution to implement appropriate selection of the
nificantly over time, and therefore, single-source TSF TL may source domain.
also be confronted with the challenge of negative transfer. Secondly, the AcMultiSrcTL algorithm is designed to
This is precisely the motivation why we study multi-source exploit, simultaneously, source domains with both high
TSF TL in this work. ratio and low ratio of data with true values, with the
A novel active multi-source transfer learning algorithm for time series forecasting
former aiming at implementing dependable inquiry, and Average (MA) model and the Autoregressive Moving
the latter aiming at acquiring high marginal benefit of Average (ARMA) model [2], with the latter one being exten-
prediction performance. sively applied in various TSF tasks. However, the linear
Thirdly, the uncertainty sampling approach and the dis- models can only process linear and stationary time series data.
tribution match weighting technique are combined to for- For nonlinear and non-stationary data, researchers try to con-
mulate a new sample selection scheme for selecting the vert it into smooth time series data for acquiring relatively
most representative and informative samples. desirable prediction results, for example, in the ARIMA mod-
el [3]. Then, with the development of machine learning theo-
The difference between the research work carried out in ry, nonlinear prediction techniques were proposed.
[20] and this paper mainly lies in that, the former one focuses Nonlinear prediction techniques: Since time series data are
on classification problems, while the research focus of the often nonlinear and non-stationary in real-world applications,
latter one is multi-source TSF TL issues. traditional linear prediction techniques have difficulties in
In this paper, a synthetic dataset, a natural dataset [21, 22] adapting to these situations. Only by means of difference pro-
and four financial datasets [23] are used as the experimental cessing can the stable time series data be obtained, which is
datasets to verify the predictive performance of our algo- not enough to fully express the changes of the time series data
rithms, i.e., the Mackey-Glass, Zuerich monthly sunspot num- over time. At the same time, if the time series contains wrong
bers (SunSpot), Dow Jones Industrial Average (DJI), Nikkei or missing data, difference processing cannot be carried out.
225 (^N225), Johnson Outdoors Inc. (JOUT), and Advanced To solve this problem, non-stationary TSF algorithms were
Micro Devices, Inc. (AMD) datasets. It has been verified proposed, such as Support Vector Machine (SVM) [5],
through the empirical investigation that, the two proposed Extreme Learning Machine (ELM) [13], and Artificial
algorithms, i.e., the MultiSrcTL algorithm and the Neural Networks (ANNs) [24]. Empirical results presented
AcMultiSrcTL algorithm, possess significantly improved pre- in [25, 26] show that, compared with linear TSF algorithms,
diction accuracy, compared with several other state-of-the-art nonlinear ones usually have better and more reliable
TSF algorithms. At the same time, it could be observed from performance.
the experimental results that, the predictive performance of Numerical methods are widely used in ARMA for optimi-
our proposed AcMultiSrcTL algorithm is superior to that of zation, which have high computational limitations and may
the MultiSrcTL algorithm, in most cases. fall into local minima. To solve this problem, in [2], a two-
The rest of the paper is organized as follows. Section 2 level architecture was proposed, where the ARMA coeffi-
briefly introduces the existing time series forecasting algo- cients are estimated using the low-level Evolutionary
rithms and the knowledge of transfer learning and active learn- Algorithms (EA) with actual coding, and all possible
ing. Section 3 presents the exhaustive derivation of the pro- ARMA model spaces are automatically searched by EA.
posed MultiSrcTL sub-algorithm and AcMultiSrcTL algo- The entire evolutionary process is guided by Bayesian
rithm. In Section 4, the proposed MultiSrcTL sub-algorithm Information Criterion (BIC) to prevent overfitting. In [3], a
and AcMultiSrcTL algorithm are described at length. In TSF method combining ARIMA and Genetic Programming
Section 5, experimental results and analysis are given. (GP) was proposed to ameliorate the problem that the nonlin-
Finally, in Section 6, the algorithms proposed in this paper ear time series cannot be accurately predicted. The proposed
are summarized and prospected. method utilized the advantages of ARIMA and GP models in
both linear and nonlinear modeling.
Based on the conventional wavelet neural network, a local
2 Related work linear wavelet neural network for solving linear TSF problems
is proposed in [27], and a hybrid Particle Swarm Optimization
2.1 Existing state-of-the-art techniques for time series (PSO) algorithm combined with diversity and gradient de-
forecasting scent learning methods is constructed. A new method is pro-
posed in [28] to predict chaotic time series and an Ant Colony
TSF was firstly studied in the field of mathematics. The Optimization (ACO) paradigm is used to analyze the topology
existing TSF techniques can be divided into linear prediction of a given time series. The Lorenz system and the Mackey-
techniques and nonlinear prediction ones. Glass equation are used to verify the performance of the algo-
Linear prediction techniques: The TSF models fallen into rithm. ANN has good performance in TSF. However, when
this category are constructed in the form of a linear function. ANN is used to deal with corresponding problems, a series of
For example, the earliest proposed Autoregressive (AR) mod- parameters need to be set. To address this issue, a method of
el aims to minimize the squared error between the predicted using genetic algorithm to automatically search proper values
results and the actual results. Based on AR, some new linear of parameters was proposed, so as to facilitate the application
prediction models are put forward, including the Moving of ANN in TSF tasks [24]. In recent years, researchers have
Q. Gu and Q. Dai
tried to combine traditional ANN with ARIMA to achieve The authors of [31] proposed a good solution for nonlinear
better TSF performance. In the new model built by M. TSF, which no longer uses kernel trick in nonlinear process-
Khashei et al. [29], the unique advantages of ARIMA in linear ing, but performs linear SVR in the high-dimensional reser-
modeling were utilized to identify the linear structures existed voir state space. At the same time, the principle of structural
in the data, and ANN was used to capture the process of data risk minimization is adopted to solve TSF problems. In [32],
generation. Least Squares Support Vector Machine (LS-SVM) with mul-
Recently, deep learning models have shown good perfor- tiple types of kernels was proposed to solve the TSF problem,
mance in many data mining fields, including TSF, such as and was applied to Mackey-Glass time series with uniform
LSTM [6], GRU [7], and Deep Belief Network (DBN) [30], Gaussian noise, achieving good effect.
etc. In [30], T. Kuremoto et al. proposed a TSF model based L. J. Cao et al. [33] investigated the application of SVM in
upon Deep Belief Nets (DBNs) developed by Hinton and financial TSF. The feasibility of SVM in financial TSF was
Salakhutdinov, which are probabilistic generative NNs con- examined by comparing SVM with Back Propagation (BP)
stituted of multilayers of Restricted Boltzmann Machines neural network and Radial Basis Function (RBF) neural net-
(RBMs). A deep network composed of three layers of work. They further studied the variability in performance of
RBMs is constructed to acquire the characteristics of time SVM relative to the free parameters, so as to fully consider the
series input space, where Back-Propagation (BP) algorithm non-stationarity of financial time series. Through simulation
is employed to fine-tune the weight vectors of RBMs. experiments, the conclusion that SVM is superior to the other
During training, Particle Swarm Optimization (PSO) algo- NN models in financial TSF has been reached. Section 1 of
rithm is utilized to determine the structures of NNs and the [33] briefly introduces the advantages of SVM and the feasi-
learning rates. bility of its application in financial TSF. Then, adaptive pa-
rameters SVM (ASVM) is proposed. Section 2 sets forth the
2.2 Application of SVM in TSF related theory and mathematical support of SVM in regression
approximation. In order to explain the feasibility of the appli-
SVM is a nonlinear machine learning algorithm, which was cation of SVM in TSF, Section 3 compares the experimental
originally proposed to solve pattern classification problems. results achieved by SVM with those obtained by BP and RBF.
SVM is a kind of generalized linear classifier that classifies By changing one free parameter once a time, the optimal
data in the manner of supervised learning, with its decision values of the three free parameters and the number of the
boundary being the maximum-margin hyperplane for learning support vectors of SVM are investigated in Section 4.
samples. Then, from the need of practical applications, Section 5 discusses and proposes the Adaptive Parameters
Support Vector Regression (SVR) for solving regression SVM (ASVM) to make the predictions more accurate. The
problems was proposed, which also has a good application last section summarizes their work.
prospect in TSF tasks. In low-dimensional space, there is al- From the elaboration of the reference [33], it can be found
ways the problem of inseparability of linearity. In order to out that the application of SVM in TSF is important.
solve this issue, kernel function is introduced into SVM and Therefore, SVM is used as the base model for the two algo-
SVR, which maps low-dimensional space onto high- rithms proposed in this paper.
dimensional space, so that linear reparability can be achieved.
Common kernel functions include linear kernel function, 2.3 Transfer learning (TL) and active learning (AL)
polynomial kernel function, Gaussian kernel function, and
sigmoid one. 2.3.1 Transfer learning (TL)
The aim of SVR is to find a regression plane, so that all the
data of a set is closest to the plane. Traditional regression Unlike traditional machine learning algorithms, TL breaks
methods consider that the prediction is correct only if the down the assumption that training data and testing data must
regression f(x) is equal to y, such as the loss is often calculated follow the same distribution. TL scheme is designed to solve
by (f(x) − y)2 in linear regression. While SVR holds that as the problem of using previously known empirical knowledge
long as the deviation between f(x) and y is not too large, the to develop new models for other different but relevant fields.
prediction can be considered to be correct, and it is not neces- Ever since its emergence, researchers have carried out a lot of
sary to calculate the loss. The specific mode of operation is to exploration and investigation on TL. A method called Cross-
set a threshold α and calculate the loss of the data points which Domain Support Vector Machine (CDSVM) was proposed in
satisfy |f(x) − y| > α. It is believed that the prediction of the [34], which uses the support vectors extracted from the source
model is accurate for the data points inside the decision data to find a reasonable boundary for the target data.
boundary. Because the optimization goal of SVR is to mini- In [35], a new deep learning architecture of multi-label
mize structural risk, it has good generalization ability and has zero-shot learning (ZSL) method was proposed, which can
been widely used in the field of TSF. predict the labels of multiple unknown classes for each input
A novel active multi-source transfer learning algorithm for time series forecasting
instance. In addition, in TL, instances of the unseen (target) application of knowledge and may even cause negative trans-
class during the training phase are often classified as one of the fer. In recent years, the combination of TL and AL paradigms
seen (source) classes when tested. Therefore, after the deploy- has aroused researchers’ interest and wide attention.
ment of generalized ZSL settings, their performance is very In [40], a new Active Incremental Fine-Tuning (AIFT) al-
poor. In [36], a new deep learning architecture was proposed gorithm was proposed to solve the problem of insufficient
for multi-label ZSL. The proposed algorithm can be used to labeled data in the field of medical image processing, where
solve multi-labels classification and ZSL problems, by con- AL and TL are integrated. The AIFT algorithm starts with a
structing a framework to describe the relationship between completely unlabeled dataset and does not require initial la-
multiple labels. beled data. It improves the learner step by step through con-
tinuous fine-tuning rather than repeated retraining.
2.3.2 Active learning (AL) Simultaneously, it can also mine the consistency of the
patches of each candidate sample to select the candidate set
AL is a new machine learning method proposed by re- worthy of labeling. The authors of [41] proposed an active
searchers in recent years. AL paradigm queries the most useful transfer learning framework, which utilizes labeled data from
unlabeled samples through certain algorithms, giving these related tasks to improve the performance of AL machines. In
samples to experts for marking, and then using these queried the research work conducted in [42], domain experts are re-
samples to train the learners. Compared with traditional super- quired by an active learner to label a few most informative
vised learning models, AL can well treat the large training target samples for TL, so that the trained learner can be applied
sets. By selecting the most informative instances, AL can re- to the classification of multi-view head-pose.
duce the number of training data and the cost of manual
labeling.
The research work carried out in [37] is essentially different 3 Exhaustive derivation of the proposed
from the traditional active selection strategy, which over- MultiSrcTL sub-algorithm and AcMultiSrcTL
comes the deficiency of cross-domain generalization ability algorithm
of manually designed selection strategy and transforms the
active selection strategy into a regression problem. The strat- This work attempts to make the full use of old data far away
egies have significant effects on real datasets in several differ- from the prediction point, rather than simply and directly dis-
ent domains, including Striatum, Magnetic Resonance card the old data. Inspired by both the multi-source TL scheme
Imaging (MRI), Credit Card, Splice, and Higgs. Person re- employed in [20] and the single-source TL framework pro-
identification is regarded as a binary classification problem posed by us in [8], we establish here a novel multi-source TL
in [38] and the most valuable samples are selected by SVM- framework ad hoc for the research of TSF. By calculating the
based AL framework for training. In [39], an AL method dependability of each source domain and the approximation
based on SVM was proposed, which selects unlabeled sam- between each of the source domains and the target domain, we
ples from the training set based on uncertainty and diversity, can balance the relationship between them and achieve more
and the generated classifier can be applied to the recognition effective transfer, so as to better realize TSF TL tasks. The
of high resolution images. formation of the multi-source TL framework, the proposal of
the MultiSrcTL algorithm, and the further proposal of the
2.3.3 Researches on the combination of TL and AL AcMultiSrcTL algorithm reflect the originalities of this work.
To the best of our knowledge, few researches have been con-
Both TL and AL are proposed to solve the problem of lack of ducted on TSF TL problems. The existing researches on TL,
training samples, however, these two machine learning para- AL, and the combination of TL and AL mainly focus on the
digms give different solutions. TL is a machine learning meth- study of pattern classification problems. There are even fewer
od that improves learning by transferring knowledge learned researches on multi-source TSF TL algorithm, and on design-
from related domains to target domains. The main principle of ing active multi-source TL algorithm and applying it to TSF.
AL is to actively select a small number of core samples, and In Table 1, we summarize the frequently used notations in
these samples can provide more important and crucial infor- the proposed MultiSrcTL sub-algorithm and AcMultiSrcTL
mation to the learner. algorithm.
However, TL and AL still have their own shortcomings.
AL has a large demand for training samples, while the cost of 3.1 Mathematical derivation of the proposed
obtaining training samples in certain fields is high. Although MultiSrcTL sub-algorithm
TL can obtain training samples at zero cost, the distribution of
the sample data obtained is often greatly different from that of In the problem setting addressed in this work, the distributions
the target domain, which brings considerable difficulties to the of training set and testing set are different, therefore, transfer
Q. Gu and Q. Dai
Notations Descriptions
H Hypothesis space
χ Instance space
AΗΔΗ Subsets of χ that are the support of some hypotheses in ΗΔΗ
Pr∗[A] Probability of the happening of event A under the corresponding distribution
ε∗(h) Expected risk of h in the corresponding domain
b
ε* ðhÞ Empirical risk of h in the corresponding domain
N kold U Number of the samples without true values in the kth source domain
N kold L Number of the samples with true values in the kth source domain
NT Number of samples in the target domain
λi Combination risk of the ith ideal hypothesis
m′ Number of instances without true values in every source domain and target domain
m Sum of instances with true values in all the source domains
b Confidence tolerance
n Number of sub-domains within the original source domain
Dold
i The ith source domain data (the old dataset)
Dold
i
L
The instance subset with true values in the ith source domain
Diold U
The instance subset without true values in the ith source domain
Dnew The target domain (the new dataset)
learning paradigm is employed to deal with this problem of b
d ΗΔΗ Dold U
; Dnew , where Dold th
i i represents the i source do-
inconsistent distribution. Next, we will first introduce the do- main, and also the ith old data domain. Diold U denotes the
main adaption theory demonstrated in [43, 44], and then, ap- subset of instances without true values in the ith source do-
ply it to our work, so as to analyze the risk of the target domain
main, and Doldi
L
denotes the subset of instances with true
under the setting of multiple source domains in TSF.
values in the i source domain, respectively. Dnew denotes
th
Under the problem setting of inconsistent distribution, what
the target domain, and also the new data domain.
is required to be implemented firstly is to measure the distance
The expected risk εDold ðh; pÞ of a hypothesis is defined as
between the two distributions D and D′ . While, for the regres-
the probability, according to the distribution of the source
sion task, a natural method is to measure the discrepancy
domain, that a hypothesis h disagrees with the prediction func-
between distributions. For this purpose, one symmetric differ-
tion p, which can be shorted as εDold ðhÞ. And the empirical risk
ence hypothesis space ΗΔΗ is defined as below:
of a hypothesis on the source domain can be expressed as
n o
0
ΗΔΗ ¼ hðxÞ⊕h ðxÞ : h; h ∈H
0
ð1Þ b
εDold ðhÞ. Then, the expected risk and empirical risk of the other
source domains and the target domain can be defined by uti-
where ⊕ is the XOR operator. H denotes a hypothesis space lizing the same method, respectively.
for the instance space χ, which has finite VC-dimension [20, To simplify the problem, it could be assumed that, the
43, 44]. Then, the discrepancy between the two distributions multiple source domains, generated by separating time series
D and D′ can be calculated as [20, 43, 44]: data, have the same number of samples without true values.
The target risk is similar to the source risk and is affected by
d ΗΔΗ D; D ¼ 2 sup PrD ½A−PrD0 ½A
0
ð2Þ the source risk, the prediction functions of source domain and
Α∈ΑΗΔΗ target domain, and the distributions of Dold i and Dnew. Then,
the risk boundary on the target domain can be calculated,
where AΗΔΗ denotes the set of subsets of χ that are the support
accordingly. All in all, for learner, the ultimate goal is to find
of some hypotheses in ΗΔΗ. PrD[A] is the probability of the
a hypothesis that can minimizes the target risk εDnew ðhÞ, by
happening of event A under the distribution D and PrD0 ½A is
making full use of the domain adaption theory. Also, this is
the probability of the happening of event A under the distri-
the purpose of the derivation of expected target domain risk, in
bution D′. sup(⋅) represents the upper bound function.
the setting of the multiple source domains, in this section.
The empirical distribution discrepancy between the ith
According to [43, 44], we first re-write the ideal hypothesis
source domain and the target domain is computed as
in the setting of single source domain as:
A novel active multi-source transfer learning algorithm for time series forecasting
h* ¼ argmin εDnew ðhÞ þ εDold ðhÞ ð3Þ adaptability between the weighted sources and the target
h∈Η
domain.
Based on Eq. (3), the ideal hypothesis of the ith source and Now, we can derive the risk bound on the target domain
target domains can be expressed as: (i.e., the new data domain). Based on the research work con-
ducted in [43, 44], the risk bound on the new data domain, in
h*i ¼ argmin εDnew ðhÞ þ εDold
i
ð hÞ ð4Þ the setting of single-source domain, can be rewritten as Eq.
h∈Η
(8).
Then, according to Eq. (4), the combination risk of the ith
1
ideal hypothesis can be defined as: εDnew ðhÞ ≤ εDold ðhÞ þ b d ΗΔΗ Dold U ; Dnew þ 4
2
* vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
λi ¼ εDnew h*i þ εDold hi ð5Þ u ffi
i u 0 4
u 2d log 2m þ log
t δ
where λi represents the combination risk of the ith ideal 0 þλ ð8Þ
hypothesis. The ideal hypothesis clearly embodies the m
concept of adaptability. If the ideal hypothesis does
where m′ represents both the size of Dold _ Uand that of
not perform well, it cannot be expected that a good
Dnew. In Eq. (8), it is assumed that, all domains have
target learner can be achieved by minimizing the source
the same amount of instances without true values, which
error. According to the ideal hypothesis and the combi-
is a reasonable assumption, because these instances are
nation risk of single-source TL model, the ideal hypoth-
relatively easy to obtain. Based on Eq. (8), the expected
esis and the combination risk of multi-source TL model
risk of the target domain, in the setting of multi-source
can be deduced by analogy as:
TL, can be derived as:
h*ν;σ ¼ argmin εDnew ðhÞ
K K
h∈Η εDnew bh ¼ εDnew ∑ ν i hi ¼ ∑ ν i εDnew bhi
i¼1 i¼1
K 1−σ 2 3
þ ∑ ν i σ þ ð1−ν i Þ ε old ðhÞ ð6Þ
i¼1 K−1 Di
6 1−σ K 7
K 6 7
λν;σ ¼ εDnew h*ν;μ 6
¼ ∑ ν i 6σεD b
hi þ ∑ εD b hi 7
i¼1 4
new
K−1
new
7 ð9Þ
j¼1 5
K 1−σ
j≠i
þ ∑ ν i σ þ ð1−ν i Þ εDold h *
ν;σ ð7Þ
i¼1 K−1 i
8 2 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
3 9
> u 0 >
>
> u 4 >
>
>
> 6 1 u 2dlog 2m þ log 7 >
>
>
> 6 t δ 7 >
>
>
> σ6ε b þ b
old hi d ΗΔΗ Dold U
;D new
þ4 þ λi 7 >
>
>
> 6 iD
2 i
m0 7 >
>
>
> 4 5 >
>
K >
< >
=
εDnew b
h ≤ ∑ νi 2 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u 3 ð10Þ
>
> u 0 >
>
i¼1
>
> 6 u 2dlog 2m þ log
4
7>>
>
> 6 1 t 7>>
>
> 1−σ K δ >
>
> þ ∑ 6ε old b
6 j h þ b
d ΗΔΗ Dold U
;Dnew
þ4 þ λ j7 >
7 >
>
>
> K−1 j ¼ 1 4
D i
2 j
m 0
5>>
>
> >
>
: ;
j≠i
8 2 39
>
> >
>
>
> 6 >
>
<K 6
1 old U new
1−σ K
1 7
7=
εDnew b
h ≤ ∑ νi6 6 σ ε b þ b
old hi d ΗΔΗ D ; D þ ∑ ε old
b
hi þ b
d ΗΔΗ D old U
; D new 7
7>þ
>
>
Di
2 i
K−1 j ¼ 1 D j 2 j
5>
>i¼1 4
> >
>
: ;
j≠i
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u 0
u 4
u 2dlog 2m þ log
t δ
4 0 þ λν;σ
8 m
2
1 old U new
1 3 9
>
> 1−σ K >
>
> 6 σ ε old hi
b þ b d ΗΔΗ D ; D þ ∑ ε b
h þ b
d ΗΔΗ D old U
; D new
þ7 > >
>
> >
old i
Di i Dj j
>
> 6 2 K−1 ¼ 2 7 >
>
> 6 j 1 7 > >
>
< K 6 j≠i 7 > >
=
6 v ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 7
≤ ∑ νi 6 6
u 2 2 7
7
þ þ
> i¼1 6 u σ ð Þ−logδ
1−σ 1 dlog 2m 7 >
K
>
> u þ >
>
>
> 6 u ξi K−1
∑
ξj 2m 7 > >
>
>
>
> 4 t j ¼ 1 5 >
>
>
> >
>
: j≠i ;
v ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
u 0
4
ð12Þ
u 2dlog 2m þ log
t δ
4 þ λν;σ
8 2 m0 3 9
>
> >
>
>
> 6
1
1 7
>
>
>
> 6 old U new 7 > >
>
> ν 6 σ ε b þ b ; þ
1−σ K
∑ ε b þ b old U
; new 7 þ >
>
>
> i6 Diold hi d ΗΔΗ D D Dj old h i d ΗΔΗ D D 7 >
>
>
> 4 2 i
K−1 j ¼ 1 2 j
5 >
>
>
> >
>
< K j≠i =
¼ ∑ 0 1 þ
>i¼1
> vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiCrffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi >
>
>
> B u >
>
>
> B K u σ2 2
1C >
>
> B ∑ ν i u þ 1−σ C dlogð2mÞ−logδ >
K
>
> ∑ >
>
>
> Bi¼1 u ξ ξ C >
>
>
> @ t i K−1 j¼1 j A 2m >
>
>
: >
;
j≠i
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u 0
u 4
u 2dlog 2m þ log
t δ
4 þ λν;σ
m0
Note that, by using the concentration factor σ , the quantity of errors. On the contrary, when νi = 1, what we use is only the target
samples with true values in each source domain, i.e., N oldi
L
, is domain data, and the bound will be the standard learning bound
replaced by the total amount of samples in all the source domains, using only target data. At the same time, the boundary can help to
K trade off a small amount of target domain data and a large amount
i.e., m ¼ ∑ N old
i
L
. When νi = 0, the target domain data will be of less relevant source domain data by selecting different vi values.
i¼1
ignored and the bound will be the ideal one. However, the bound As we discussed above, for the TL setting of single source
will be accompanied with empirical estimate for source domain domain, the authors of [43, 44] analyzed the expected risk of the
target domain. Based on their theory presented in [43, 44], we
A novel active multi-source transfer learning algorithm for time series forecasting
derive the risk bound for the TL setting of multi-source domains 4 The proposed MultiSrcTL sub-algorithm
considered in the research of TSF. Based upon this risk bound, and AcMultiSrcTL algorithm
we construct robust model, which uses the TL paradigm to solve
the problem, for the research of TSF, that the distribution of TL aims at contributing to the building of the target domain
training data and testing data is different. Until now, we have predictor by mining the potential knowledge acquired from
analyzed the expected target domain risk under the TL setting of the source domain. The transferability between the source
multi-source domains applicable to TSF. This step of research and target domains is crucial for the successful implementa-
makes full use of the domain adaptation theory, based upon tion of TL [9, 14, 15]. When the degree of correlation between
which we propose the MultiSrcTL algorithm. the source and target domains is rather low or even close to
zero, single-source TSF TL might be faced with the challenge
of negative transfer. This is just the motivation why we inves-
3.2 Mathematical derivation of the proposed
tigate multi-source TSF TL in this work.
AcMultiSrcTL algorithm
This paper mainly solves two challenges of multi-source TSF
TL problem. The first one is how to construct a robust multi-
This paper mainly investigates the multi-source TL problem
source TSF TL model, which can transfer knowledge by
of TSF. Therefore, different from the setting of single source
exploiting different source similarities and domain dependabil-
domain that AL module only needs to work in a single source,
ities. The second challenge is how to use AL to promote better
here in this work, an AL module is designed to implement two
transfer under the scenario of uneven source domains, so that the
tasks: one is to select an appropriate source domain, and the
prediction accuracy of the model can be further improved.
other is to choose appropriate samples from this selected
Incorporating the principle of source similarity and domain
source domain for the usage of AL.
dependability, we make use of the source-source and source-
For selecting an appropriate source domain, it could be ob-
target relations to solve the first challenge, transferring knowl-
served from Eq. (12) that, when the proportion of samples with
edge from the old data domains (source domains) to the new
true values in each source domain is uniform, the learner can
data domain (target domain). For addressing the second chal-
achieve the best performance. Inspired by [40], this paper uses
lenge, we integrate distribution matching with uncertainty
Kullback-Leibler (KL) divergence to measure the degree of sim-
sampling for selecting the most informative samples. After
ilarity between the current ratio of the data with true values within
continuously selecting the most informative samples without
each source and the uniform distribution, so as to implement
true values, the AL paradigm will query their true values and
proper selection of the source domain. Based on KL divergence,
apply these samples to continuously fine-tune the learner, so
we can get a Bernoulli random variable to choose the proper
as to further deepen its learning procedure.
source, which will be elaborated in Section 4.2. Here, the degree
As the results of this work, the MultiSrcTL algorithm and,
of similarity measured by KL divergence can be written as:
further, the ultimate AcMultiSrcTL algorithm, which takes
K ξðiÞ K KξðiÞ MultiSrcTL as its sub-algorithm, are proposed, on the basis of
DKL ðξkuniformÞ ¼ ∑ ξðiÞlog K
¼ ∑ ξðiÞlog K
ð13Þ
i¼1
∑ ξðiÞ=K i¼1
∑ ξ ð iÞ their mathematical derivations presented in Section 3.1 and 3.2,
i¼1 i¼1 respectively.
where ξ(i) is the probability distribution of the i-th source domain.
How to choose appropriate samples without true values from a 4.1 The MultiSrcTL sub-algorithm
corresponding source domain to query their true values is a more
challenging task. When selecting samples, based on the consider- For investigating TSF tasks in this work, we try to build a
ation of performance, the algorithm wants to obtain the most in- multi-source transfer learning algorithm to solve the prob-
formative samples. The authors of [20, 45] proposed the following lem of different distribution between the old data and new
density-weighted uncertainty sample selection scheme, which data. If a source contains a small number of samples with
chooses the samples that influence the current error the most, rather true values, the source, called sparse source, will be diffi-
than the samples that bring about the smallest future error: cult to use. The MultiSrcTL algorithm will make full use of
the inter-source relationships by measuring the similarity
2 and the dependability. And the algorithm can effectively
x ¼ argmax E byi −yi xi pðxi Þ ð14Þ
xi ∈Dold U utilize the sparse sources via taking the peer sources into
consideration. By reasonably setting the confidence toler-
where yi andbyi denote the true value and the predictive value of the ance and employing the source-source relation matrix, the
instance xi, respectively. In this manner, the scheme chooses those sparse source domain with true values can be better
samples with the most information. Although the scheme cannot transferred.
guarantee the smallest generalization error of the algorithm, it has a The pseudocode of the MultiSrcTL algorithm is presented
great chance to markedly boost its generalization performance. in detail in the following Algorithm1.
Q. Gu and Q. Dai
A novel active multi-source transfer learning algorithm for time series forecasting
The computational complexity of Algorithm 1, i.e., the more dependable sources could be successfully transferred to less
MultiSrcTL algorithm, equals O N old þ N new OðnÞ þ O dependable sources, while the opposite is not necessarily true.
old L Therefore, M is not required to be a symmetric matrix.
Ni OðnÞ, i.e., O(n2), where Nold denotes the number
of samples in each old data domain (source domain), Nnew is A critical effect of M lies in that, it can explicitly evaluate the
the number of samples in the new data domain (target do- dependability of the sources, so that generalization error of the
algorithm could be made lower. TL has specific transitivity,
main), and N old L
represents the number of samples with true
namely, if the learner b
i
th
values in the i source domain. h j trained based on the jth source domain
In this paper, we use the MMD statistical indicator [46, 47] to possesses good predictive performance over the ith source do-
measure the distribution difference between the new data domain main, and, simultaneously, the distribution difference between
and old data domain. Firstly, the corresponding two distributions the ith source domain and target domain is small, then b h j would
are mapped onto Reproducing Kernel Hilbert Space (RKHS). have a favorable predictive performance over the target domain.
And then, the difference of the means of the two distributions, The idea of multi-source learning is incorporated into the
i.e., the value of MMD, can be computed, which is used to deter- MultiSrcTL algorithm, with the weights of multiple source do-
mine how similar the two distributions are. This method can avoid mains being different from each other. For properly calculating
complex density estimation. In the scenario of the TSF problem the weight of each source domain, source similarity and depend-
studied in this paper, the value of MMD can be computed as: ability are required to be taken into account, simultaneously and
old fully. Thus, the weights of sources are formulated as:
N i old 1 N new
new
MMD f ; Dold
i ;D
new
¼ min ∑ ν i
f D − ∑ f ð D Þ ð15Þ
ν i j¼1
j i
N new
i¼1 ω ¼ δ⋅½σEn þ ðð1−σÞM Þ ð18Þ
where f(x) is a feature mapping onto the RKHS Hs. ν ij represents where the coefficient σ can control the trade-off between source
the weights of the ith source sub-domain data. Through exploiting similarity and domain dependability. En ∈ Rn ∗ n is an identity
the kernel approach, the optimization in Eq. (15) can be seen as a matrix. The parameter δ can evaluate the similarity between the
quadratic programming issue, being easy to work out, where source domain and target domain, pairwise. Based on the distri-
Gaussian kernel function is generally adopted. bution difference, we can also use MMD to measure parameter
Simultaneously, multi-task learning paradigm is incorpo- δ:
rated into the design of the MultiSrcTL algorithm. In the pro- ρ
exp −θ⋅ ν ij
cess of multi-source TL, it is not only necessary to consider δ¼ n ρ ð19Þ
the relations between multiple source domains and the target ∑ exp −θ⋅ ν ij
i¼1
domain, but is also required to take the relations between
different source domains into account. Consequently, a rela- where ρ is a manually specified parameter to regulate the value of
tion matrix, which depicts the relations between different δ. That is, the smaller the value of MMD is, the more similar the
source domains, is formulated as below: source domain is to the target domain. Thus, the weight of each
8 source domain can be measured by ω.
>
> exp θb εDold b
hj
>
> 0 ; i≠j
>
i
<
M ij ¼ ∑ exp θbε old
b
hj ð16Þ
>
Di
>
>
0
j ∈ N old 4.2 The proposed AcMultiSrcTL algorithm specialized
>
> 0
: j ≠i
for TSF
0; otherwise
where bh j represents the learner trained based upon the jth In this section, aiming at fine-tuning the learner and further
source domain, θ is the control parameter to restraint the prop- boosting its forecasting performance, we incorporate the
b Active Learning (AL) paradigm into the MultiSrcTL algo-
agation of the predictive errors, and b
εDold h j represents the
i
rithm, and propose a new Active Multi-Source Transfer
empirical error generated by the learner bh j. b
εDold
i
b
h j is com- Learning (AcMultiSrcTL) algorithm specialized for TSF,
puted by utilizing R-Square statistic as below: which regards MultiSrcTL as its sub-algorithm.
The MultiSrcTL sub-algorithm focuses on, on the
2
∑ yold L
−by premise of ensuring the prediction performance of the
i
b
εDold b
h j ¼ 1− ð17Þ algorithm, using as few samples as possible with true
i old L 2
∑ yold
i
L
−y i values for training. While in the final AcMultiSrcTL
algorithm, we focus on adopting the AL scheme to fur-
The relation matrix M is a square matrix with the main diag- ther improve the prediction accuracy of the acquired
onal entries being zeros. Knowledge transmission and transfer learner. In Algorithm 2, the specific pseudo code of
learning possess asymmetry, namely, learners trained based on the AcMultiSrcTL algorithm is presented.
Q. Gu and Q. Dai
A novel active multi-source transfer learning algorithm for time series forecasting
...
...
Apply the knowledge Model S acquired
DNold_L to the existing new through multi-source
D old
N samples TL
DNold_U
Fine-tune the
Model S
The final acquired
Model S
Newly selected Newly selected
samples Ψ D1old - Ψ DNold - Ψ samples Ψ
...
End
D1old DNold
technique formulated in Eq. (15), designing a new sample four financial datasets are all from yahoo finance. For each
selection technique presented as below: source domain, the samples will be evenly divided into 10
2 ðℓÞ sub-source domains according to the time to achieve the set-
x ¼ argmax E byi −yi xi ν oi ð20Þ ting of the multi-source domain.
xi ∈Dold
ðℓ Þ
o
dyðt Þ p2 yðt−τ Þ
¼ −p1 yðt Þ þ ð21Þ
dt 1 þ yp3 ðt−τ Þ
5 Experiments
According to Eq. (21), 1000 samples are produced as the
5.1 Experimental datasets training data (i.e. the old data), by means of the fourth-order
Runge-Kutta method, with the step size being set as 0.1 [49].
In this paper, we select a synthetic dataset, a natural dataset, By changing the time step size to be 0.11, 1000 samples are
and four financial datasets for performance verification of the generated as the testing data (i.e. the new data) utilizing the
two proposed algorithms. These include Mackey-Glass same fourth-order Runge-Kutta method. In accordance with
dataset [21, 48], Zuerich monthly sunspot Numbers dataset [21, 48], the parameters of Eq. (21) are set as p1 = 0.2, p2 =
[22], Dow Jones Industrial Average (DJI) dataset [23], 0.1, p3 = 10, τ = 17, and y(0) = 1.0. The relationship between
Nikkei 225 (^N225) [23], Johnson Outdoors Inc. (JOUT) the old data and the new data is depicted by Fig. 2.
[23], Advanced Micro Devices, Inc. (AMD) [23]. The latter As shown in Fig. 2, the distribution of the old data and new
data generated by the Mackey-Glass model is different.
However, there inherently exists relationship between the
Table 3 The parameter settings of the proposed AcMultiSrcTL old and new data, due to the fact that they are generated by
algorithm and MultiSrcTL sub-algorithm
the same model.
MultiSrcTL parameter tw σ b
5.1.2 Zuerich monthly sunspot numbers (SunSpot) dataset
value [10, 14] [0.01, 0.1] [0.8, 1]
AcMultiSrcTL parameter tw σ b C This dataset is drawn from Time Series Data Library [22],
value [10, 14] [0.01, 0.1] [0.8, 1] 100 which records the number of sunspots per month in Zurich
from 1749 to 2018. In this work, we use the data from 1749 to
A novel active multi-source transfer learning algorithm for time series forecasting
RMSE (E-03) AEE CDSVR OLSVR OS- OS- LSTM GRU MultiSrcTL AcMultiSrcTL
ELM EMELM
MackeyGlass 35.968 66.277 97.919 29.582 40.960 10.023 23.639 9.339 8.457
SunSpot 87.841 71.727 76.102 59.869 61.356 51.287 86.998 50.118 48.858
JOUT 95.111 49.429 61.37 94.425 87.906 93.293 84.150 38.908 38.823
N225 73.371 30.49 75.709 18.005 23.421 19.454 19.501 16.54 15.260
AMD 58.588 61.95 81.081 34.432 36.887 31.120 59.380 27.513 27.314
DJI 74.01 34.228 51.654 101.806 87.73 96.197 137.141 27.054 24.631
In Table 4, the bold RMSE value represents the lowest value of RMSE obtained on the corresponding dataset by different algorithms. The expression
way of 8.457E-03 is scientific notation, indicating 8.457 × 10‐3 . The same marks in the following tables have the same meaning
MAE (E-03) AEE CDSVR OLSVR OS- OS- LSTM GRU MultiSrcTL AcMultiSrcTL
ELM EMELM
MackeyGlass 27.448 55.095 90.497 22.367 32.088 7.777 19.184 7.553 6.991
SunSpot 65.594 55.429 60.585 43.195 44.373 37.115 78.011 34.713 33.972
JOUT 64.35 47.176 55.88 63.441 59.41 61.575 79.274 27.436 25.452
N225 66.309 24.769 70.954 13.56 17.945 15.839 15.173 13.414 12.176
AMD 37.456 61.106 71.187 21.333 22.982 18.565 45.867 18.094 17.875
DJI 65.52 30.546 50.864 86.209 74.363 80.232 127.046 23.37 20.833
1849 as the old data and those from 2010 to 2018 as the new employed as the old data, and the last 240 data being used
data. Over time the number of sunspots can change greatly, as the new data.
therefore there exists a big difference between the data for the
first 100 years and the last 18 years.
5.1.4 Dow Jones industrial average (DJI) dataset
5.1.3 Johnson outdoors, Inc. (JOUT) dataset
The DJI dataset is obtained from Yahoo Finance [23] and it
The JOUT dataset is drawn from Yahoo Finance [23] and it records the stock data from 1985.1 to date. We select the
records the stock data from 1987.10 to date. With the passage closing price data of stocks from 1985.1.28 to 2019.6.3, with
of time, the distribution of stock prices in different periods is the first 1000 data being used as the old data, and the last 280
also different. We select the weekly closing price stock data data being utilized as the new data.
from 1987.10.5 to 2019.6.3, with the first 800 data being
Table 6 RMSE t-test results between AcMultiSrcTL and other comparative algorithms
Remark: H value of 1 indicates that the proposed algorithm significantly outperforms the corresponding rival algorithm at 5% significance level on basis
of the RMSE metric.
Q. Gu and Q. Dai
Table 7 MAE t-test results between AcMultiSrcTL and other comparative algorithms
Remark: H value of 1 indicates that the proposed algorithm significantly outperforms the corresponding rival algorithm at 5% significance level on basis
of the MAE metric.
5.1.5 Nikkei 225 (N225) datasets 5.1.6 Advanced Micro Devices, Inc. (AMD)
The N225 dataset also comes from Yahoo Finance [23] and it The AMD dataset is also from Yahoo Finance [23]. This
records N225 stock data from 1970.6.1 to date. In this exper- dataset records stock data from 1980.5.17 to date. In this pa-
iment, we select the weekly stock closing price from 1970.6.1 per, we select the weekly stock closing price from 1980.5.17
to 2019.6.2 as the experimental data, the first 1200 data as the to 2019.6.3 as the experimental data, the first 1000 data as the
old data, and the last 240 data as the new data. old data, and the last 230 data as the new data.
(a)
(b)
A novel active multi-source transfer learning algorithm for time series forecasting
(a)
(b)
In addition, in order to, the data needs to be standardized. Memory (LSTM) [6] and Gated Recurrent Unit (GRU)
The most common way is normalization, that is, the data is [7]. In addition, in order to demonstrate the role of each
adjusted to the range of [0, 1], and the normalization formula independent part in the AcMultiSrcTL algorithm, includ-
is as follows: ing the TL and AL modules, it is also compared with
0 xi −xmin its sub-algorithm MultiSrcTL.
xi ¼ ð22Þ Among the comparative models, OS-ELM and OS-
xmax −xmin
EMELM are variant algorithms of ELM based on online
0
where xi is the normalized value, xi is the original value, xmax sequential learning scheme. With the arrival of the data
is the maximum value of the original time series, and xmin is its one by one or batch by batch, the two models will be
minimum value. constantly updated during the operation. Similarly, our
proposed AcMultiSrcTL model will be updated the mod-
5.2 Experimental settings el after active learning through AL scheme. In addition,
when dealing with time series data, AcMultiSrcTL algo-
Several typical TL algorithms and advanced TSF algo- rithm can learn from old data, however, OS-ELM and
rithms are used as comparative models to evaluate the OS-EMELM cannot achieve this function.
performance of the two algorithms presented in this pa- AEE takes ELM as its base model, and adopts the
per. These comparative algorithms include Online method of ensemble learning. In every time step, the
Sequential ELM (OS-ELM) [50], Online Sequential base model will be retrained based on the old data,
Improved Error Minimized ELM (OS-EMELM) [51], which is bound to affect the performance of the algo-
Online Support Vector Regression (OLSVR) [52], rithm and prolong the running time of the algorithm.
Cross Domain Support Vector Regression (CDSVR) However, the AcMultiSrcTL algorithm proposed in this
[34], Adaptive Ensemble Models of ELM (AEE) [53] work can update the learner employing the AL scheme,
and two deep learning methods, i.e., Long Short-Term possessing high computational efficiency.
Q. Gu and Q. Dai
(a)
(b)
OLSVR is a method designed based on SVR. It has been Table 2 displays the total quantity of samples in all the
widely applied in the field of TSF. It is also a method of online source domains and the quantity of samples in the target do-
learning, which can continuously update the learning model. main, as well as the size of training and testing sets used in the
However, the AcMultiSrcTL algorithm adopts the idea of TL, proposed MultiSrcTL sub-algorithm and AcMultiSrcTL algo-
which can realize the transfer of knowledge grasped from old rithm. For our proposed two algorithms, the samples in the
data. training set all come from the source domains, and those in
CDSVR is a TL algorithm that can realize knowledge the testing set all come from the target domain. For CDSVR,
transfer between old and new data. However, CDSVR algo- a transfer learning algorithm, its training samples come from the
rithm does not update the model using the AL scheme. While source domain and its testing samples come from the target
the AcMultiSrcTL algorithm integrate TL with AL, and fine- domain. Its amounts of training and testing samples equal to
tunes the model through the AL scheme, resulting in im- those of the AcMultiSrcTL algorithm. Except for CDSVR, the
proved predictive accuracy. amounts of training and testing samples for the other contrast
LSTM possesses unique advantages in dealing with TSF algorithms employed in this work equal to those of the
problems, which makes full use of timing characteristic in recur- AcMultiSrcTL algorithm. Besides, except for CDSVR, all the
rent neural network, being stronger in the analysis of time series training and testing samples for the other contrast algorithms
data. Also, LSTM relieves the problems of gradient disappear- are drawn from the target domain. It should be noted that, for
ance and gradient explosion in recurrent neural network, demon- the MultiSrcTL sub-algorithm within the AcMultiSrcTL algo-
strating better performance in TSF. Similar to LSTM, GRU is rithm, the proportion of the samples with true values is set to
also proposed to solve the gradient problem of long-term mem- 40% of the size of each source domain dataset. Then, in the
ory and back propagation in recurrent neural network. Since active learning sub-procedure within AcMultiSrcTL, the inqui-
LSTM and GRU are widely used in nonlinear TSF, in this work, ry budget is set as 20% of the size of the source domain dataset
these two deep learning methods are employed as comparative to fine-tune the learner. Also, the newly chosen samples should
models for our proposed algorithms. not overlap with the already used ones.
A novel active multi-source transfer learning algorithm for time series forecasting
(a)
(b)
(a)
(b)
values obtained by corresponding algorithms are displayed in the six benchmark time series datasets are illustrated in Figs.
Tables 4 and 5, respectively. 3, 4, 5, 6, 7 and 8, respectively. It is worth noting that, in order
It can be seen from Table 4 that, the AcMultiSrcTL algo- to clearly show the prediction performances of the two pro-
rithm achieves the minimum value of RMSE on all the six posed algorithms and those of all the comparative algorithms,
datasets. As can be seen from Table 5, the minimum value of three kinds of prediction results, i.e., that of the best of the
MAE is also obtained by AcMultiSrcTL on all the six conventional methods, those of the two proposed methods,
datasets. It can be concluded that, among all the comparative and the real data, are displayed in Figs. 3, 4, 5, 6, 7 and 8
algorithms, AcMultiSrcTL has ideal RMSE and MAE values. Because if the experimental results obtained by all the algo-
Among the remaining algorithms, the predictive performance rithms are exhibited, the prediction results and errors shown in
of the MultiSrcTL algorithm is the best. However, the figures will be difficult to observe distinctly.
MultiSrcTL is only a TL algorithm and does not have the According to the RMSE metric, among conventional
ability to actively learn, resulting in its several limitations in methods, LSTM shows the best performance on the Mackey
dealing with the under-fitting problem. Glass, Sunspot, and AMD datasets. CDSVR is the optimal
At the same time, in order to verify whether the predictive conventional algorithm on the JOUT and DJI datasets, while
performance of the AcMultiSrcTL algorithm proposed in this the best conventional algorithm on the N225 dataset is OS-
paper is significantly superior to other rival algorithms, t-tests ELM. Consequently, prediction results and errors obtained by
are conducted pairwise on the experimental results of the cor- corresponding algorithms are displayed in Figs. 3, 4, 5, 6, 7
responding algorithms, at the significance level of 5%, with and 8.
the t-test results being shown in Tables 6 and 7. As can be seen Figure 3a and b display the predicted value and absolute
from Tables 6 and 7 that, for 91 out of 96 t-tests, the predictive prediction error value obtained by each algorithm on the
performance of AcMultiSrcTL algorithm is significantly su- Mackey-Glass dataset, respectively. From Fig. 3a, it can be
perior to that of the comparative algorithms. observed that the trends of the prediction results gotten by the
For clarity and intuition, predictive results and absolute corresponding algorithms on the Mackey-Glass dataset are all
prediction errors acquired by corresponding algorithms on close to the trend of the real curve. However, it is clearly
A novel active multi-source transfer learning algorithm for time series forecasting
(a)
(b)
shown in Fig. 3b that, the fitting degree of AcMultiSrcTL respectively. From Fig. 5a, it can be seen that, the pre-
algorithm is higher and its predictive error is the smallest. dicted curve of the AcMultiSrcTL algorithm is very close
Figure 4a and b exhibit the experimental results acquired to the real curve. The MultiSrcTL algorithm and CDSVR
by the corresponding algorithms on the SunSpot dataset. In algorithm also show relatively good prediction effect. At
terms of Fig. 4a and b, it can be concluded that, the prediction the same time, it can be seen from Fig. 5 that, the pre-
curve of the AcMultiSrcTL algorithm is the closest to the real diction effect of AEE is poor. Figure 6a and b reflect the
curve. For the best conventional algorithm LSTM, the predic- experimental results obtained by corresponding algo-
tion curve is also similar to the real curve, but its performance rithms on N225, where the prediction performance of
is worse than AcMultiSrcTL algorithm and MultiSrcTL AcMultiSrcTL algorithm is the best, the MultiSrcTL al-
algorithm. gorithm ranks the second, and the OS-ELM algorithm is
Figure 5a and b depict the prediction values and errors the best among the other contrast algorithms. LSTM and
of c or res po nd ing a lg orit hms o n JOUT d at ase t, GRU also show good predictive performance.
b 0.1 1 10 20 30 40 50
It can be observed from Fig. 7a and b that, the prediction will gradually become better as σ increases. When the value of
results of LSTM, MultiSrcTL and the AcMultiSrcTL algo- σ is greater than 0.03, the prediction performance of the algo-
rithm are all close to the real values, while AcMultiSrcTL rithm gradually decreases as σ increases. AcMultiSrcTL
achieves the best fitting degree. The prediction performance achieves the optimal performance when σ is taken at 0.01,
of the MultiSrcTL algorithm is close to that of the on the four financial datasets, and the trend is the same as
AcMultiSrcTL algorithm, and the performance of LSTM is the former two datasets. It can be observed from Tables 8
better than the remaining algorithms. It also can be clearly and 9 that, the influence of the values of b and σ on the
observed from Fig. 8a and b that, on the DJI dataset, the algorithm performance exhibit similar trends. AcMultiSrcTL
prediction accuracy of the AcMultiSrcTL algorithm is the acquires the best predictive performance when b is 10, on the
highest, with its absolute prediction errors being close to zero. former two datasets. While its best predictive performance is
It can be concluded from Figs. 3, 4, 5, 6, 7 and 8 that, the obtained when b is taken as 1, on the other latter four financial
AcMultiSrcTL algorithm possesses both high prediction ac- datasets.
curacy and low absolute error, being applicable to single-step- At the same time, the running time is also an important
ahead TSF problems. indicator for algorithm performance. We have reported the
The AcMultiSrcTL algorithm proposed in this paper con- total running time of the corresponding algorithms on the six
tains several hyperparameters, where the concentration factor benchmark datasets in Table 10.
σ and the confidence tolerance b are the two most important It can be seen from Table 10 that, among all the seven
hyperparameters. In order to analyze the influence of the two algorithms, OS-ELM and OS-EMELM usually require the
parameters on the predictive performance of the shortest running time, while, in contrast, LSTM needs the
AcMultiSrcTL algorithm, we have assigned different values longest running time. Because the MultiSrcTL algorithm only
to each parameter and carried out repeated experiments on the implements transfer learning, its running time is similar to that
six benchmark datasets. Through these experiments, the ap- of AEE. By incorporating the active learning paradigm, the
propriate value range of each parameter can be determined AcMultiSrcTL algorithm can actively select the samples, que-
easily. ry their true values in each domain, and then update the learner
From Table 8, it can be concluded that, the AcMultiSrcTL continuously. Compared with other algorithms, the
algorithm achieves the best prediction performance on the AcMultiSrcTL algorithm takes a little more time, however,
synthetic dataset and the natural dataset, when σ is 0.03. In its predictive performance has been significantly improved.
the cases when σ is less than 0.03, its prediction performance Considering comprehensively its performance and efficiency,
Time (s) AEE CDSVR OLSVR OS-ELM OS- LSTM GRU MultiSrcTL AcMultiSrcTL
EMELM
MackeyGlass 3.2610 2.023E-1 3.6512E2 1.8504E-1 1.7508E-1 159.24 140.98 1.8735 6.2974
SunSpot 1.6780 2.4308E-1 1.9.39E2 1.8608E-1 2.606E-1 144.53 130.25 1.7098 6.0837
JOUT 4.6009E-1 2.073E-1 1.4460E1 1.8503E-1 1.8406E-1 95.27 82.19 0.9660 2.6210
N225 1.8394 2.1042E-1 1.0406E2 2.3304E-1 2.3308E-1 140.81 116.98 1.8631 5.3627
AMD 1.5300 1.3000E-1 2.3910E1 1.4402E-1 1.4302E-1 119.45 100.21 1.3581 3.8358
DJI 1.5470 1.3704E-1 4.5281E1 1.8003E-1 1.4555E-1 118.80 103.56 1.5136 4.3059
A novel active multi-source transfer learning algorithm for time series forecasting
the AcMultiSrcTL algorithm is an algorithm suitable for two-dimensional inter-source relation matrix under the setting
single-step-ahead TSF. of the single-step setting, the relation matrix of multi-step-
ahead TSF should be a three-dimensional matrix to suit the
multi-output base model. The weight of predicted value vector
6 Conclusions and future work of the base model should also be converted from scalar to
vector. Then, the expected risk bound should be re-derived
Time series forecasting has always been a difficult problem in to address multi-source TL and multi-step-ahead TSF prob-
the field of mathematics and machine learning, especially lems. We will try to re-construct the risk bound and design a
non-stationary TSF. With the passage of time, time series data corresponding multi-source TL and multi-step-ahead TSF al-
often changes greatly, and the old and new data frequently no gorithm in our future work.
longer satisfy the same distribution, resulting in negative
transfer of single-source TSF TL. Aiming at addressing this Acknowledgments This work is supported by the National Key R&D
Program of China (Grant Nos. 2018YFC2001600, 2018YFC2001602),
problem, this work firstly builds a multi-source TL frame-
and the National Natural Science Foundation of China under Grant no.
work, specifically for single-step-ahead TSF, on the basis of 61473150.
which a multi-source TL algorithm, i.e., the MultiSrcTL algo-
rithm, is proposed. It can make full use of the old data to Compliance with ethical standards
predict the future values of the new data.
Besides, as the second contribution of this paper, a new Conflict of interest The authors declare that they have no conflict of
Active Multi-Source Transfer Learning algorithm fusing interest.
multi-source TL and AL, i.e., the AcMultiSrcTL algorithm,
is proposed, which takes MultiSrcTL as its sub-algorithm. The
AcMultiSrcTL algorithm can select the samples without true References
values properly and gradually fine-tune the learner, so as to
further enhance its predictive performance in the setting of 1. Corberán-Vallet A, Bermúdez JD, Vercher E (2011) Forecasting
multi-source TSF TL. All in all, this paper proposes the correlated time series with exponential smoothing models. Int J
Forecast 27(2):252–265
MultiSrcTL algorithm and the AcMultiSrcTL algorithm to
2. Cortez P, Rocha M, Neves J (2004) Evolving time series forecast-
solve the problem of insufficient training of the learner caused ing ARMA models. J Heuristics 10(4):415–429
by the lack of recent time series samples. 3. Lee YS, Tong LI (2011) Forecasting time series using a methodol-
The experimental results carried out on the six benchmark ogy based on autoregressive integrated moving average and genetic
time series datasets fully verify the superiorities of the programming. Knowl-Based Syst 24(1):66–72
4. Liang LZ, Shao F (2010) The study on short-time wind speed
MultiSrcTL algorithm and AcMultiSrcTL algorithm, includ-
prediction based on time-series neural network algorithm Asia-
ing desirable performance and favorable robustness. These Pacific Power Energy Eng Conf 2010
superiorities demonstrate that the two proposed algorithms 5. Mager J, Paasche U, Sick B (2009) "Forecasting financial time
are suitable for addressing multi-source TSF TL problems, series with support vector machines based on dynamic kernels,"
having good prospect of practical application. 2008 IEEE Conference on Soft Computing in Industrial
Applications, pp 252-257
It is worth noting that, the two algorithms we proposed in 6. Yang HM, Pan ZS, Tao Q (2017) Robust and adaptive online time
this paper are mainly for single-step-ahead TSF. In the future series prediction with long short-term memory. Comput Intell
work, we will work on the multi-source transfer learning and Neurosci 1–9
multi-step-ahead TSF. At present, there are three classical 7. Zhao R, Wang D, Yan R, Mao K, Shen F, Wang J (2018) Machine
multi-step-ahead TSF strategies, including recursive strategy, health monitoring using local feature-based gated recurrent unit
networks. IEEE Trans Ind Electron 65(2):1539–1548
direct strategy, and multi-input multi-output one. The recur- 8. Ye R, Dai Q (2018) A novel transfer learning framework for time
sive strategy, which uses the prediction results as the inputs of series forecasting. Knowl-Based Systems 156:74–99
the model, will have negative impact on the prediction perfor- 9. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans
mance. The direct strategy needs to construct a separate model Knowl Data Eng 22(10):1345–1359
for each predictive step, causing a high time complexity of the 10. Shao L, Zhu F, Li XL (2015) Transfer learning for visual categori-
zation: a survey. IEEE Trans Neural Netw Learn Syst 26(5):1019–
forecasting algorithm. Therefore, in our future work, we plan 1034
to adopt the multi-input multi-output strategy to deal with 11. Yang Q (2009) Transfer learning beyond text classification. Asian
multi-step-ahead TSF problem, and employ multi-input Conf Mach Learn, Springer 5828:10–22
multi-output SVM as the base model. 12. Wang XY, Han M (2014) Online sequential extreme learning ma-
In this case, the output of the model is no longer a scalar but chine with kernels for nonstationary time series prediction.
Neurocomputing 145:90–97
a vector of predicted values, whose size equals the number of 13. Scardapane S, Comminiello D, Scarpiniti M, Uncini A (2015)
predictive steps. In the setting of the multi-source domains, the Online sequential extreme learning machine with kernels. IEEE
inter-source relation matrix plays an important role. Unlike the Trans Neural Netw Learn Syst 26(9):2214–2220
Q. Gu and Q. Dai
14. Fang M, Guo Y, Zhang XS, Li X (2015) Multi-source transfer 35. Lee CW, Fang W, Yeh CK, Wang YCF (2018) "Multi-label zero-
learning based on label shared subspace. Pattern Recogn Lett 51: shot learning with structured knowledge graphs," IEEE Conference
101–106 on Computer Vision and Pattern Recognition, pp. 1576–1585
15. Eaton E, Desjardins M (2011) "Selective transfer between learning 36. Song J, Shen CC, Yang YZ, Liu Y, Song ML (2018) "Transductive
tasks using task-based boosting," Proceedings of the 25th AAAI unbiased embedding for zero-shot learning," Proc IEEE Conf
Conference on Artificial Intelligence, pp. 337–342 Comput Vis Pattern Recognit, pp. 1024–1033
16. Yao Y, Doretto G (2010) "Boosting for transfer learning with mul- 37. Konyushkova K, Raphael S, Fua P (2017) Learning Active
tiple sources," Proceedings of the 23rd IEEE Conference on Learning from Data. Neural Inf Process Syst Conf 30
Computer Vision and Pattern Recognition, pp. 1855–1862 38. Xiang JP (2012) Active learning for person re-identification. 2012
17. Al-Stouhi S, Reddy CK (2011) "Adaptive boosting for transfer International Conference on Machine Learning and Cybernetics.
learning using dynamic updates," Joint European Conference on IEEE 1: 336–340
Machine Learning and Knowledge Discovery in Databases, pp. 39. Bruzzone L, Persello C (2009) Active Learning for Classification of
60–75 Remote Sensing Images. IEEE Int Geosci Remote Sens Symp 1-5:
18. Eaton E, Desjardins M (2009) "Set-Based Boosting for Instance- 1995–1998
level Transfer," Proceedings of the International Conference on 40. Zhou ZW, Shin J, Zhang L, Gurudu S, Gotway M, Liang JM (2017)
Data Mining Workshops, pp. 422–428 "Fine-tuning Convolutional Neural Networks for Biomedical
19. Gao J, Fan W, Sun Y, Han J (2009) "Heterogeneous Source Image Analysis: Actively and Incrementally," 30th IEEE
Consensus Learning via Decision Propagation and Negotiation," Conference on Computer Vision and Pattern Recognition, pp.
Proceedings of the 15th ACM SIGKDD International Conference 4761–4772
on Knowledge Discovery and Data Mining, pp. 339–348 41. Kale D, Liu Y (2013) "Accelerating Active Learning with Transfer
20. Wang Z, Carbonell J (2018) "Towards more reliable transfer learn- Learning," 2013 IEEE 13th International Conference on Data
ing," Joint European Conference on Machine Learning and Mining, pp. 1085–1090
Knowledge Discovery in Databases, Springer, pp. 794–810 42. Yan Y, Subramanian R, Lanz O, Sebe N (2012) "Active Transfer
21. Ardalani-Farsa M, Zolfaghari S (2010) Chaotic time series predic- Learning for Multi-view Head-pose Classification," Proceedings of
tion with residual analysis method using hybrid Elman-NARX neu- the 21st International Conference on Pattern Recognition. IEEE, pp.
ral networks. Neurocomputing 73(13–15):2540–2553 1168–1171
22. Time Series Data Library. Available: Time Series Data Library - 43. Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J (2008)
Data provider — DataMarket "Learning bounds for domain adaptation," Conference on Neural
23. Yahoo Finance[EB/OL]. Available: http://finance.yahoo.com/ Information Processing Systems, pp. 129–136
24. Ferreira TAE, Vasconcelos GC, Adeodato PJL (2008) A new intel- 44. Ben-David S, Blitzer J, Crammer K, Pereira F (2006) Analysis of
ligent system methodology for time series forecasting with artificial representations for domain adaptation. Int Conf Neural Inf Process
neural networks. Neural Process Lett 28(2):113–129 Syst, pp. 137–144
25. Fakhr MW (2015) "Sparse locally linear and leighbor embedding 45. Nguyen HT, Smeulders A (2004) Active Learning Using Pre-clus-
for nonlinear time series prediction," 2015 Tenth International tering. Proceedings of the 21st International Conference on
Conference on Computer Engineering & Systems, pp. 371–377 Machine learning. ACM, pp. 336–340
26. Ciz R, Rudajev V (2007) Linear and nonlinear attributes of ultra- 46. Huang J, Smola AJ, Gretton A, Borgwardt KM, Schölkopf B
sonic time series recorded from experimentally loaded rock samples (2006) Correcting sample selection Bias by unlabeled data. Int
and total failure prediction. Int J Rock Mech Min Sci 44(33):457– Conf Neural Inf Process Syst, pp. 601–608
467 47. Gretton A, Borgwardt K, Rasch MJ, Scholkopf B, Smola AJ (2008)
27. Chen YH, Yang B, Dong JW (2006) Time-series prediction using a A kernel method for the two-sample problem. Adv Neural Inf
local linear wavelet neural network. Neurocomputing 69(4–6):449– Proces Syst, pp. 513–520
465 48. Chandra R, Zhang M (2012) Cooperative coevolution of Elman
28. Gromov VA, Shulga AN (2012) Chaotic time series prediction with recurrent neural networks for chaotic time series prediction.
employment of ant colony optimization. Expert Syst Appl 39(9): Neurocomputing 86(12):116–123
8474–8478 49. Rong HJ, Sundararajan N, Huang GB, Saratchandran P (2006)
29. Khashei M, Bijari M (2011) A novel hybridization of artificial Sequential Adaptive Fuzzy Inference System (SAFIS) for nonlinear
neural networks and ARIMA models for time series forecasting. system identification and prediction. Fuzzy Sets Syst 157(9):1260–
Appl Soft Comput 11(2):2664–2675 1275
30. Kuremoto T, Kimura S, Kobayashi K, Obayashi M (2014) Time 50. Nan-Ying L, Guang-Bin H, Saratchandran P, Sundararajan N
series forecasting using a deep belief network with restricted (2006) A fast and accurate online sequential learning algorithm
Boltzmann machines. Neurocomputing 137:47–56 for feedforward networks. IEEE Trans Neural Netw 17(6):1411–
31. Shi ZW, Han M (2007) Support vector echo-state machine for 1423
chaotic time-series prediction. IEEE Trans Neural Netw 18(2): 51. Jiao X, Liu Z, Yong G, Pan Z (2016) Time Series Prediction Based
359–372 on Online Sequential Improved Error Minimized Extreme Learning
32. Zhu JY, Ren B, Zhang HX, Deng ZT (2002) Time series prediction Machine. Proc ELM-2015 1:193–209
via new support vector machines. Int Conf Mach Learn Cybernet 1- 52. Ma J, Theiler J, Perkins S (2003) Accurate on-line support vector
4:364–366 regression. Neural Comput 15:2683–2703
33. Cao LJ, Tay FEH (2003) Support vector machine with adaptive 53. van Heeswijk M et al. (2009) Adaptive Ensemble Models of
parameters in financial time series forecasting. IEEE Trans Neural Extreme Learning Machines for Time Series Prediction. Proc
Netw Learn Syst 14(6):1506–1518 19th Int Conf Artif Neural Netw 5769, pp. 305–314
34. Jiang W, Zavesky E, Chang SF, Loui A (2008) "Cross-Domain
Learning Methods for High-Level Visual Concept Classification," Publisher’s note Springer Nature remains neutral with regard to jurisdic-
2008 15th IEEE International Conference on Image Processing, tional claims in published maps and institutional affiliations.
vols 1–5, pp. 161–164
A novel active multi-source transfer learning algorithm for time series forecasting