Efficient Online Learning Algorithms Based On LSTM Neural Networks
Efficient Online Learning Algorithms Based On LSTM Neural Networks
8, AUGUST 2018
Abstract— We investigate online nonlinear regression and and its label to find a nonlinear relation between them to
introduce novel regression structures based on the long short predict the future labels.
term memory (LSTM) networks. For the introduced structures, There exists a wide range of nonlinear modeling approaches
we also provide highly efficient and effective online training
methods. To train these novel LSTM-based structures, we put in the machine learning and signal processing literatures for
the underlying architecture in a state space form and introduce regression [1], [3]. However, most of these approaches usually
highly efficient and effective particle filtering (PF)-based updates. suffer from high computational complexity and they may
We also provide stochastic gradient descent and extended Kalman provide inadequate performance due to stability and overfitting
filter-based updates. Our PF-based training method guarantees issues [3]. Neural network-based regression algorithms are
convergence to the optimal parameter estimation in the mean
square error sense provided that we have a sufficient num- also introduced for nonlinear modeling since neural networks
ber of particles and satisfy certain technical conditions. More are capable of modeling highly nonlinear and complex struc-
importantly, we achieve this performance with a computational tures [2], [4], [5]. However, they are also shown to be prone
complexity in the order of the first-order gradient-based methods to overfitting problems and demonstrate less than adequate
by controlling the number of particles. Since our approach is performance in certain applications [6], [7]. To remedy these
generic, we also introduce a gated recurrent unit (GRU)-based
approach by directly replacing the LSTM architecture with the issues and further enhance their performance, neural networks
GRU architecture, where we demonstrate the superiority of composed of multiple layers, i.e., known as deep neural net-
our LSTM-based approach in the sequential prediction task via works (DNNs), are recently introduced [8]. In DNNs, a layered
different real life data sets. In addition, the experimental results structure is employed so that each layer performs a feature
illustrate significant performance improvements achieved by the extraction based on the previous layers [8]. With this mecha-
introduced algorithms with respect to the conventional methods
over several different benchmark real life data sets. nism, DNNs are able to model highly nonlinear and complex
structures [9]. However, this layered structure poorly performs
Index Terms— Gated recurrent unit (GRU), Kalman filtering, in capturing time dependencies in the data so that DNNs can
long short term memory (LSTM), online learning, particle
filtering (PF), regression, stochastic gradient descent (SGD). provide only limited performance in modeling time series and
processing temporal data [10]. As a remedy, basic recurrent
neural networks (RNNs) are introduced since these networks
I. I NTRODUCTION have inherent memory that can store the past information [5].
A. Preliminaries However, basic RNNs lack control structures so that the long-
term components cause either an exponential growth or decay
T HE problem of estimating an unknown desired signal
is one of the main subjects of interest in contemporary
online learning literature, where we sequentially receive a data
in the norm of gradients during training, which are the
well-known exploding and vanishing gradient problems,
respectively [6], [11]. Hence, they are insufficient to cap-
sequence related to a desired signal to predict the signal’s next
ture long-term dependencies on the data, which significantly
value [1]. This problem is known as online regression and
restricts their performance in real life tasks [12]. In order
it is extensively studied in the neural network [2], machine
to resolve this issue, a novel RNN architecture with several
learning [1], and signal processing literatures [3], especially
control structures, i.e., long short term memory (LSTM)
for prediction tasks [4]. In these studies, nonlinear approaches
network [12], [13], is introduced. However, in the classi-
are generally employed because for certain applications, linear
cal LSTM structures, we do not have the direct contribu-
modeling is inadequate due to the constraints on linearity [3].
tion of the regression vector to the output, i.e., the desired
Here, in particular, we study the nonlinear regression in an
signal is regressed only using the state vector [4]. Hence,
online setting, where we sequentially observe a data sequence
in this paper, we introduce LSTM-based online regression
Manuscript received October 30, 2016; revised May 5, 2017 and architectures, where we also incorporate the direct contribu-
August 15, 2017; accepted August 15, 2017. Date of publication Septem- tion of the regression vectors inspired from the well-known
ber 13, 2017; date of current version July 18, 2018. This work was supported ARMA models [14].
by TUBITAK under Contract 115E917. (Corresponding author: Tolga Ergen.)
The authors are with the Department of Electrical and Electron- After the neural network structure is fixed, there exists a
ics Engineering, Bilkent University, 06800 Ankara, Turkey (e-mail: wide range of different methods to train the corresponding
[email protected]; [email protected]). parameters in an online manner. Especially the first-order
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. gradient-based approaches are widely used due to their effi-
Digital Object Identifier 10.1109/TNNLS.2017.2741598 ciency in training because of the well-known backpropagation
2162-237X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3773
recursion [4], [15]. However, these techniques provide poorer suffer from complexity issues and also poor performance due
performance compared with the second-order gradient-based to an abundance of saddle points [20]. On the contrary, for
techniques [5], [16]. As an example, the real-time recurrent basic RNNs, we have less parameters to train; however, these
learning (RTRL) algorithm is highly efficient in calculating neural networks do not have control structures [12], [13].
gradients [15], [16]. However, since the RTRL algorithm Hence, the exploding and vanishing gradient problems occur
exploits only the first-order gradient information, it performs due to long-term components [6], [11]. These problems pre-
poorly on ill-conditioned problems [17]. On the other side, vent the basic RNNs from learning correlation between distant
although the second-order gradient-based techniques provide events [6]. To ameliorate performance, the basic RNN-based
much better performance, they are highly complex compared learning methods in [5] and [16] choose a high-complexity
with the first-order methods [5], [16], [18]. As an example, second-order gradient-based techniques to train their para-
the well-known extended Kalman filter (EKF) method also meters. Hence, either low-complexity neural networks or
uses the second-order information to boost its performance, low-complexity training methods are chosen to avoid unman-
which requires to update the error covariance matrix of ageable computational complexity increase. However, basic
the parameter estimate and brings an additional complexity RNNs suffer from inadequately capturing long- and short-term
accordingly [19]. Furthermore, the second-order gradient- dependencies compared with complex networks [12], [13].
based methods provide limited training performance due On the other hand, the first-order gradient-based methods
to an abundance of saddle points in neural network-based suffer from slower convergence and poorer performance com-
applications [20]. To alleviate the training issues, we intro- pared with the second-order gradient-based techniques [5].
duce particle filtering (PF) [21]-based online updates for the To circumvent these issues, in this paper, we derive online
LSTM architecture. In particular, we first put the LSTM updates based on the PF algorithm [21] to train the
architecture in a nonlinear state space form and formulate LSTM architecture. Thus, we not only provide the second-
the parameter learning problem in this setup. Based on order training without any ad hoc linearization but also accom-
this form, we introduce a PF-based estimation algorithm to plish this with a computational complexity in the order of the
effectively learn the parameters. Here, our training method first-order methods (by carefully controlling the number of
guarantees convergence to the optimal parameter estimation particles in modeling).
performance in an online manner provided that we have We emphasize that the conventional neural networks-
sufficiently many particles and satisfy certain technical con- based learning methods [5], [16], [18], [23] suffer from the
ditions. Furthermore, by controlling the amount of particles well-known complexity–performance tradeoff. Due to this
in our experiments, we demonstrate that we can significantly tradeoff, they usually are not chosen to address the nonlinear
reduce the computational complexity while providing a supe- regression problem. There are certain neural network-based
rior performance compared with the conventional second-order methods [5], [16] that particularly investigate the nonlinear
methods. Here, our training approach is generic such that we regression; however, they only employ the basic RNN architec-
also put the recently introduced gated recurrent unit (GRU) ture for this purpose. In addition, in their regression approach,
architecture [22] in a nonlinear state space form and then they provide the final estimate by setting the output of the
apply our algorithms to learn its parameters. Through exten- basic RNN architecture as a scalar value so that the final
sive set of simulations, we illustrate significant performance estimate becomes linear combination of only the internal
improvements achieved by our algorithms compared with the states. Instead, in this paper, we employ the LSTM architecture
conventional methods [18], [23]. for the nonlinear regression and also introduce additional terms
to incorporate the direct contribution of the regression vector
B. Prior Art and Comparisons to our final estimate. Therefore, we significantly improve the
Neural network-based learning methods are powerful in regression performance as illustrated in our simulations.
modeling highly nonlinear structures such that a single hidden
layer neural network can adequately model any nonlinear
structure [24]. In addition, these methods, especially complex C. Contributions
RNNs-based methods, are capable of effectively processing Our main contributions are as follows.
temporal data and modeling time series [4], [12]. Complex 1) As the first time in the literature, we introduce online
RNNs, e.g., LSTM networks, provide this performance thanks learning algorithms based on the LSTM architecture
to their memory to keep the past information and several for data regression, where we efficiently train the
control gates to regulate the information flow inside the net- LSTM architecture in an online manner using our
work [12], [13]. However, for complex RNNs, adequate perfor- PF-based approach.
mance requires high computational complexity, i.e., training of 2) We propose novel LSTM-based regression structures
a large number of parameters at every time instance [4]. Thus, to compute the final estimate, where we introduce an
to mitigate complexity, the LSTM network-based methods additional gate to the classical LSTM architecture to
in [16] and [5] choose a low-complexity first-order gradient- incorporate the direct contribution of the input regressor
based technique, i.e., stochastic gradient descent (SGD) [23], inspired from the ARMA models.
to train their parameters. Even though there exist certain appli- 3) We put the LSTM equations in a nonlinear state space
cations of LSTM trained with the second-order techniques, form and then derive online updates based on the state-
e.g., EKF in [18] and a Hessian free technique in [25], they of-the-art state estimation techniques [21], [26] for each
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3774 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018
parameter. Here, our PF-based method achieves a sub- x t = [x t , x t −1 . . . , x t − p+1 ]T and then generate d̂t ; after
stantial performance improvement in online parameter dt = x t +1 is observed, we suffer l(dt , d̂t ) = (dt − d̂t )2 .
training with respect to the conventional second- and In this paper, to generate the sequential estimates d̂t , we
first-order methods [18], [23]. use RNNs. The basic RNN structure is described by the
4) We achieve this substantial improvement with a com- following set of equations [16]:
putational complexity in the order of the first-order
ht = κ(W (h) x t + R (h) ht −1 ) (1)
gradient-based methods [18], [23] by controlling the
number of particles in our method. In our simulations, yt = u(R (y) ht ) (2)
we also illustrate that by controlling the number of where ht ∈ Rm is the state vector, x t ∈ R p is the input,
particles, we can achieve the same complexity with the and yt ∈ Rm is the output. The functions κ(·) and u(·) apply
first-order gradient-based methods while providing a far to vectors pointwise and commonly set to tanh(·). For the
superior performance compared with the both first- and coefficient matrices, we have W (h) ∈ Rm× p , R (h) ∈ Rm×m ,
second-order methods. and R(y) ∈ Rm×m .
5) Through extensive set of simulations involving real life As a special case of RNNs, we use the LSTM neural
and financial data, we illustrate performance improve- network [12] with only one hidden layer. Although there
ments achieved by our algorithms with respect to the exists a wide range of different implementations of the
conventional methods [18], [23]. Furthermore, since LSTM network, we use the most widely used extension, where
our approach is generic, we also introduce GRU-based the nonlinearities are set to the hyperbolic tangent function
algorithms by directly applying our approach to and the peephole connections are eliminated. This LSTM
the GRU architecture, i.e., also a complex RNN, architecture is defined by the following set of equations [12]:
in Section IV.
z t = h(W (z) x t + R (z) yt −1 + b(z) ) (3)
(i) (i) (i)
D. Organization of This Paper i t = σ (W xt + R yt −1 + b ) (4)
(f) (f) (f)
The organization of this paper is as follows. We intro- f t = σ (W x t + R yt −1 + b ) (5)
(i) (f)
duce the online regression problem and then describe our ct = t z t + t ct −1 (6)
LSTM-based model in Section II. We then introduce different ot = σ (W (o) x t + R (o) yt −1 + b(o)) (7)
architectures to compute the final estimate for data regression (o)
in Section III-A. In Section III-B, we review the conventional yt = t h(ct ) (8)
(f)
training methods and extend these methods to the introduced where t = diag( f t ), (i) (o)
t = diag(i t ), and t = diag(ot ).
architectures. We then introduce our PF-based training algo- Furthermore, ct ∈ R is the state vector, x t ∈ R p is the
m
rithm in Section III-C. In Section IV, we illustrate the merits input vector, and yt ∈ Rm is the output vector. Here, i t , f t ,
of the proposed algorithms and training methods via extensive and ot are the input, forget, and output gates, respectively.
set of experiments involving real life and financial data, and The functions g(·) and h(·) apply to vectors pointwise and
we also introduce a GRU-based approach for online learning commonly set to tanh(·). Similarly, the sigmoid function σ (·)
tasks. We then finalize our paper with concluding remarks applies pointwise to the vector elements. For the coefficient
in Section V. matrices and the weight vectors, we have W (z) ∈ Rm× p ,
R(z) ∈ Rm×m , b(z) ∈ Rm , W (i) ∈ Rm× p , R (i) ∈ Rm×m ,
II. M ODEL AND P ROBLEM D ESCRIPTION b(i) ∈ Rm , W ( f ) ∈ Rm× p , R ( f ) ∈ Rm×m , b( f ) ∈ Rm ,
All vectors are column vectors and denoted by boldface W (o) ∈ Rm× p , R (o) ∈ Rm×m , and b(o) ∈ Rm . Given the
lower case letters. Matrices are represented by boldface capital output yt , we generate the final estimate as
letters. For a vector u (or a matrix U), u T (U T ) is the ordinary d̂t = wtT yt (9)
transpose. The time index is given as subscript, e.g., ut is
the vector at time t. The 1 is a vector of all ones, 0 is a where the final regression coefficients wt will be trained in
vector or matrix of all zeros, I is the identity matrix, where an online manner in the following. Our goal is to design
n
the size is understood from the context. Given a vector u, the system parameters so that t =1 l(dt , d̂t ) or E[l(dt , d̂t )]
diag(u) is the diagonal matrix constructed from the entries is minimized.
of u. Remark 1: The basic LSTM network can be extended by
We sequentially receive {dt }t ≥1, dt ∈ R, and regression vec- including last s outputs in the recursion, e.g., { yt −s , . . . , yt −1 };
tors, {x t }t ≥1, x t ∈ R p such that our goal is to estimate dt based however, this case corresponds to an extended output defini-
on our current and past observations {. . . , x t −1 , x t }. Given our tion, i.e., an extended super output vector consisting of all
estimate d̂t , which can only be a function of {. . . , x t −1 , x t } { yt −s , . . . , yt −1 }. We use only yt −1 for notational simplicity.
and {. . . , dt −2 , dt −1}, we suffer the loss l(dt , d̂t ). This frame- In the following section, we first introduce novel LSTM
work models a wide range of machine learning problems network-based regression architectures inspired from the
including financial analysis [27], tracking [28], and state ARMA models. Then, we review and extend the conventional
estimation [19]. As an example, in one step ahead data methods [18], [23] to learn the parameters of LSTM in an
prediction under the square error loss, where we sequen- online manner. Finally, we provide our novel PF-based training
tially receive data and predict the next sample, we receive method.
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3775
TABLE I
C OMPARISON OF THE C OMPUTATIONAL C OMPLEXITIES OF THE P ROPOSED
O NLINE T RAINING M ETHODS . p R EPRESENTS THE D IMENSIONALITY
OF THE R EGRESSOR S PACE , m R EPRESENTS THE D IMENSIONALITY
OF THE N ETWORK ’ S O UTPUT S PACE , AND N R EPRESENTS
THE N UMBER OF PARTICLES FOR THE PF A LGORITHM
these gates may restrict the exposure of the state and input
contents in nonlinear regression problems. To expose the full
content of the state and input vectors, we remove the control
and output gates in (11) and introduce the third regression
architecture as follows:
(3)
d̂t = w tT h(ct ) + v tT h(x t ). (12)
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3776 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018
calculations to the introduced architectures. For the weight To obtain (19), we compute the partial derivatives of (3)–(5)
vector, we use with respect to wi(z)
j as follows:
⎡
wt +1 = wt − μt ∇wt l(dt , d̂t )
(σ (ζ (i) )) ⎢
(o) ∂it
= wt + 2μt (dt − d̂t )t h(ct ) (13) = t ⎢ (i) (o) (h (c)) (z)
⎣R t −1 t −1 ψ i j,t −1
∂wi(z)
where
for the learning rate μt , we have μt → 0 as t → ∞ j
t ⎤
and k=1 μk → ∞ as t → ∞, e.g., μt = 1/t. For the
∂o
parameter W (z) , we have the following update: (z)
∂wi j ⎥
(z) (z)
+ R(i) t −1 h(ct −1 )⎥
⎦ (20)
W =W − μt ∇ W (z) l(dt , d̂t ).
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3777
have an additional update for v t as follows: EKF algorithm [19] to estimate yt , ct , and θ t as follows:
⎡ ⎤ ⎡ ⎤
(α) yt |t yt |t −1
v t +1 = v t + 2μt (dt − d̂t )t h(x t ). (24) ⎣ ct |t ⎦ = ⎣ ct |t −1 ⎦ + L t dt − wtT|t −1 yt |t −1 (31)
θ t |t θ t |t −1
Then, we follow the derivations in (13), (15), (16), and
(3) (o)
(19)–(22). For d̂t , we just set t = I and t = I and
(α) yt |t −1 = τ (ct |t −1 , x t , yt −1|t −1) (32)
then all the derivations in (13), (15), (16), (19), and (20)–(24) ct |t −1 = (ct −1|t −1, x t , yt −1|t −1 ) (33)
(2)
follow as in d̂t . θ t |t −1 = θ t −1|t −1 (34)
According to the update equations in (15), (16), and (19), −1
Lt = T
t |t −1 H t H t t |t −1 H t + Rt (35)
update of an entry of a parameter has a computational com-
plexity O(m 2 + mp) due to the matrix vector multiplications t |t = t |t −1 − L t H tT t |t −1 (36)
in (17). Since we have mp, m 2 , and m entries for W (.) , R (.) , t |t −1 = F t −1 t −1|t −1 t −1 + Q t −1
F T
(37)
and b(.) , respectively, this results in O(m 4 + m 2 p2 ) compu-
tational complexity to update the entries of all parameters as where ∈ R(2m+nθ )×(2m+nθ ) is the error covariance matrix,
Lt ∈ R (2m+n θ ) is the Kalman gain, Q ∈ R(2m+n θ )× (2m+n θ ) is
given in Table I. t
2) Online Learning With the EKF Algorithm: We next pro- the process noise covariance, and Rt ∈ R is the measurement
vide the updates based on the EKF algorithm in order to train noise variance. We compute H t and F t as follows:
the parameters of the system described in (3)–(8) and (10). ∂ d̂t ∂ d̂t ∂ d̂t
H tT = y= yt|t−1 (38)
In the literature, there are certain EKF-based methods to train ∂ y ∂ c ∂θ c= ct|t−1
LSTM (see [18], [30]); however, these methods estimate only θ =θ t|t−1
the parameters, i.e., θ t . However, in our case, we also estimate
the state and the output vector of LSTM, i.e., ct and yt , and
⎡ ∂τ (c, x , y) ∂τ (c, x , y) ∂τ (c, x , y) ⎤
respectively. In the following, we derive the updates for our t t t
⎢ ∂y ∂c ∂θ ⎥
⎢ ∂ (c, x , y) ∂ (c, x , y) ∂ (c, x , y) ⎥
approach and extend these to the introduced architectures.
The EKF algorithm assumes that the posterior density Ft = ⎢ t t t ⎥ y= yt|t
⎣ ⎦ c= ct|t
function of the states given the observations is Gaussian [19]. ∂y ∂c ∂θ θ =θ
This assumption can be satisfied by introducing perturbations 0 0 I t|t
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3778 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018
C. Online Training Based on the PF Algorithm where the weights are normalized such that
Since the conventional training methods [18], [23] provide
N
restricted performance as explained in the previous section, ωti = 1.
we introduce a novel PF-based method that provides supe- i=1
rior performance compared with the second-order training To simplify the weight calculation, we can factorize (43) to
methods. Furthermore, we achieve this performance with obtain a recursive formulation for the update of the weights
a computational complexity in the order of the first-order as follows [26]:
methods depending on the choice of N as shown in Table I.
In the following, we derive the updates for our PF-based p dt |ait p a it |ait −1 i
ωt ∝
i
ωt −1 . (44)
training method and extend these calculations to the introduced q ait |ait −1 , dt
architectures.
The PF algorithm [21] requires no assumptions other than In (44), we aim to choose the importance function such
the independence of noise samples in (29) and (30). Hence, that the variance of the weights is minimized. Thus, we can
we modify the system in (29) and (30) as follows: guarantee that all the particles have nonnegligible weights and
contribute considerably to (42) [33]. In this sense, the optimal
a t = ϕ(at −1 , x t ) + η t (40) choice of the importance function is p(at |ait −1 , dt ); however,
dt = wtT yt + ξt (41) this requires an integration that does not have an analytic
form in most cases [34]. Thus, we choose p(at |ait −1 ) as
where η t and ξt are independent noise samples, ϕ(·, ·) is the the importance function, which provides a small variance for
nonlinear mapping in (29), and the weights but not zero as the optimal importance function
⎡ ⎤ does [21], [34]. This simplifies (44) as follows:
yt
a t = ⎣ ct ⎦ . ωti ∝ p dt |ait ωti −1 . (45)
θt
We can now get the desired distribution to compute the
For (40) and (41), we seek to obtain E[at |d1:t ], i.e., the optimal conditional mean of the augmented state vector at using
state estimate in the mean square error (MSE) sense. For this (42) and (45). By this, we obtain the conditional mean
purpose, we first find the posterior probability density function for at as follows:
p(at |d1:t ). We then calculate the conditional mean of the state
vector based on the posterior density function. To obtain the E[a t |d1:t ] = at p(at |d1:t )d at
density function, we employ the PF algorithm [21] as follows.
Let {ait , ωti }i=1
N denote the samples and the associated
N
N
≈ at ωti δ at − ait d at = ωti ait . (46)
weights of the desired distribution, i.e., p(at |d1:t ). Then,
i=1 i=1
we obtain the desired distribution from its samples as follows:
While applying the PF algorithm, the variance of the
N weights inevitably increases over time so that after a few
p(at |d1:t ) ≈ ωti δ(a t − ait ) (42) time steps, all but one of the weights get values that are very
i=1 close to zero [33]. Due to this reason, although particles with
where δ(·) represents the Dirac delta function. Since obtaining very small weights have almost no contribution to our estimate
the samples from the desired distribution is intractable in in (46), we have to update them using (40) and (45). Hence,
most cases [21], an intermediate function is introduced to most of our computational effort is used for the particles
obtain the samples {ait }i=1 N , which is called as importance with negligible weights, which is known as the degeneracy
function [21]. Hence, we first obtain the samples from the problem [21]. To measure degeneracy, we use the effective
importance function and then estimate the desired density sample size introduced in [35], which is calculated as follows:
function based on these samples as follows. As an example, 1
in order to calculate E p [at |d1:t ], we use the following trick: Ne f f = 2 . (47)
N
ωti
i=1
p(at |d1:t )
E p [at |d1:t ] = Eq at d1:t Note that a small Ne f f value indicates that the variance of the
q(at |d1:t ) weights is high, i.e., the degeneracy problem. If Ne f f is smaller
where E f represents an expectation operation with respect to than a certain threshold [33], then we apply the resampling
a certain density function f (·). Hence, we observe that we algorithm introduced in [26], which eliminates the particles
can use q(·), i.e., called as importance function, when direct with negligible weights and focuses on the particles with large
sampling from the desired distribution p(·) is intractable. weights to avoid degeneracy. By this, we obtain an online
Here, we use q(at |d1:t ) as our importance function to obtain training method (see Algorithm 1 for the pseudocode) that
the samples and the corresponding weights are calculated as converges to E[at |d1:t ], where the convergence is guaranteed
follows: under certain conditions as follows.
Remark 5: For the PF derivations of d̂t(2) , we change the
p(ait |d1:t ) observation model in (41) according to the definition in (11).
ωti ∝ (43)
q(ait |d1:t ) We also modify at by adding v t , W (α) , R (α) , and b(α) to θ t .
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3779
For the PF derivations of d̂t(3), we modify (41) according to Algorithm 1 Online Training Based on the PF Algorithm
the definition in (12). Furthermore, we modify θ t by removing 1: for i = 1 : N do
W (α) , R (α) , b(α) , W (o) , R (o), and b(o) from its definition 2: Draw ait ∼ p(at |ait −1 )
for d̂t(2) . 3: Assign wti according to (45)
Theorem 1: Let at be the state vector such that 4: end for
N j
5: Calculate total weight: S = j =1 wt
sup |at |4 p(dt |at ) < K t (48)
at 6: for i = 1 : N do
7: Normalize: wti = wti /S
where K t is a finite constant independent of N. Then we have
8: end for
the following convergence result:
9: Calculate Ne f f according to (47)
N 10: if Ne f f < NT then %NT is a threshold for Ne f f
ωti ait → E[at |d1:t ] as N → ∞. 11: Apply the resampling algorithm in [26]
i=1 12: Obtain new pairs { āit , ω̄ti }i=1
N , where w̄ i = 1/N, ∀i
t
Proof of Theorem 1. From [36], we have 13: end if
14: Using { āt , ω̄ti }i=1
i N , compute the estimate according to (46)
N
i 4 ||π||4t,4
E E[π(at )|d1:t ] − ωt π a t ≤ Ct
i
(49)
N2
i=1
A. Real Life Data Sets
where
In this section, we evaluate the performances of the algo-
||π||t,4 max {1, (E[|π(at )|4 |d1:t ]) 4 , t = 1, 2, . . . , t}
1
rithms for the real life data sets. We first evaluate the
π ∈ Bt4 , i.e., a class of functions with certain properties performances of the algorithms for the kinematic data set
described in [36], and Ct represents a finite constant inde- [37]. We then examine the effect of the number of particles
pendent of N. With (48), π(at ) = at satisfies the conditions on the convergence rate of the PF-based algorithm using
of Bt4 . Therefore, applying π(at ) = at to (49) and then the same data set. Furthermore, in order to illustrate the
evaluating (49) as N goes to infinity conclude our proof. effects of model size while keeping the computation time
same, we perform another experiment on the same data
This theorem provides a convergence result under (48).
The inequality in (48) implies that the conditional distrib- set for the PF-based algorithm. Finally, we consider three
ution of the observations, i.e., p(dt |at ), decays faster than benchmark real data sets, i.e., elevators [38], bank [39], and
at increases [36]. Since generic distributions usually decrease pumadyn [38], to evaluate the regression performances of our
exponentially, e.g., Gaussian distribution, or they are nonzero algorithms.
only for bounded intervals, (48) is not a strict assumption We first consider the kinematic data set [37], i.e., a sim-
for at . Hence, we can conclude that Theorem 1 can be ulation of eight-link all-revolute robotic arm. Our aim is to
employed for most cases. predict the distance of the effector from a target. We first
select a fixed architecture. For this purpose, we can choose
According to update equations in (40), (41), (45), and (46),
each particle costs O(m 2 + mp) due to the matrix any one of three architectures since the algorithm with the
vector multiplications in (40) and (41), and this results in best performance is the same for all three architectures as
O(N(m 2 + mp)) computational complexity to update all detailed later in this section. Here, we choose Architecture 1.
particles. Furthermore, we choose the parameters such that all the
introduced algorithms reach their maximum performance for
fair comparison. To provide this fair setup, we have the
IV. S IMULATIONS following parameters. For this data set, the input vector is
In this section, we illustrate the performances of our algo- x t ∈ R8 and we set the output dimension of the neural network
rithms on different benchmark real data sets under various as m = 8. For the PF-based algorithm, the crucial parameter
scenarios. We first consider the regression performance for is the number of particles; we set this parameter as N = 1500.
real life data sets such as kinematic [37], elevators [38], In addition, we choose ηt and ξt as zero-mean Gaussian
bank [39], and pumadyn [38]. We then consider the regression random variables with the covariance Cov[ηt ] = 0.01 I
performance for financial data sets, e.g., Alcoa stock price [40] and the variance Var[ξt ] = 0.25, respectively. For the
and Hong Kong exchange rate data [41]. We then compare the EKF-based algorithm, we choose the initial error covariance as
performances of the algorithms based on two different neural 0|0 = 0.01 I. Moreover, we choose Q t = 0.01 I and
networks, i.e., the LSTM and GRU networks [22]. Finally, Rt = 0.25. For the SGD-based algorithm, we set
we comparatively illustrate the merits of our LSTM-based the learning rate as μ = 0.03. As seen in Fig. 2,
regression architectures described in (10)–(12). the PF-based algorithm converges to a much smaller final
Throughout this section, “Architecture 1” represents the MSE level, and hence significantly outperforms the other
LSTM network with (10) as the final estimate equation, algorithms.
similarly “Architecture 2” represents the LSTM network In order to illustrate the effect of the number of particles
with (11), and “Architecture 3” represents the LSTM network on the convergence rate, we perform a new experiment on the
with (12). kinematic data set, where we use the same setup except the
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3780 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3781
TABLE II
T IME A CCUMULATED E RRORS AND THE C ORRESPONDING T RAINING
T IMES ( IN S ECONDS ) OF THE LSTM-BASED A LGORITHMS FOR THE
E LEVATORS , P UMADYN , AND BANK D ATA S ETS . N OTE T HAT H ERE
W E U SE A C OMPUTER W ITH i5-6400 P ROCESSOR ,
2.7-GHz CPU, AND 16-GB RAM
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018
Fig. 7. Comparison of the LSTM and GRU architectures in terms of regression error performance for (a) PF-based algorithm, (b) EKF-based algorithm,
and (c) SGD-based algorithm.
V. C ONCLUSION
We studied the nonlinear regression problem in an online
setting and introduced novel LSTM-based online algorithms
for data regression. We then introduced low-complexity
and effective online training methods for these algorithms.
Hence, they have significant differences. To compare them, We achieved these by first proposing novel regression algo-
we use the Hong Kong exchange rate data set as in the rithms to compute the final estimate, where we introduced an
previous section. For a fair comparison, we again select a additional gate to the classical LSTM architecture. We then
fixed architecture. Here, since we compare the performances of put the LSTM system in a state space form, and then based
the networks rather than the algorithms, we arbitrarily choose on this form, we derived online updates based on the SGD,
one of the architectures. We select Architecture 1. Moreover, EKF, and PF algorithms [17], [19], [26] to train the LSTM
we choose the same parameters with the previous subsection architecture. By this way, we obtain an effective online training
so that convergence rates of the algorithms are the same. With method, which guarantees convergence to the optimal para-
this fair setup, Fig. 7(a)–(c) shows that the LSTM network- meter estimation provided that we have a sufficient number of
based approach achieves a smaller steady-state error; therefore, particles and satisfy certain technical conditions. We achieve
it is superior to the GRU architecture-based approach in the this performance with a computational complexity in the
sequential prediction task in our experiments. order of the first-order gradient-based methods [5], [16] by
controlling the number of particles. In Section IV, thanks to
D. Different Regression Architectures the generic structure of our approach, we also introduced a
In this section, we compare the performances of different GRU architecture-based approach by directly replacing the
LSTM-based regression architectures. For this purpose, we use LSTM equations with the GRU architecture and observed
the Hong Kong exchange rate data set as in the previous that our LSTM-based approach is superior to the GRU-based
section. For a fair comparison, we select the parameters such approach in the sequential prediction tasks studied in this
that the convergence rates of the algorithms are the same. paper. Furthermore, we demonstrate significant performance
We choose the same parameter with the previous subsection improvements achieved by the introduced algorithms with
except 0|0 = 0.01 I. Under this fair setup, Table III shows respect to the conventional methods [18], [23] over several
that for the PF- and EKF-based algorithms, Architecture 2 different data sets (used in this paper).
achieves a smaller time accumulated error thanks to the
contribution of the regression vector with the control gate αt . R EFERENCES
Due to the lack of the control and output gates, although [1] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games.
Architecture 3 also has the direct contribution of the regression Cambridge, U.K.: Cambridge Univ. Press, 2006.
vector, it has a greater error value compared with its competi- [2] D. F. Specht, “A general regression neural network,” IEEE Trans. Neural
Netw., vol. 2, no. 6, pp. 568–576, Nov. 1991.
tors. For the SGD-based algorithm, the direct contribution of [3] A. C. Singer, G. W. Wornell, and A. V. Oppenheim, “Nonlinear
the regression vector does not provide improvement on the autoregressive modeling and estimation in the presence of noise,” Digit.
error performance. Hence, Architecture 1 achieves a smaller Signal Process., vol. 4, no. 4, pp. 207–221, Oct. 1994.
[4] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and
time accumulated error. However, overall Architecture 2 J. Schmidhuber, “LSTM: A search space Odyssey,” IEEE Trans. Neural
trained with the PF-based algorithm achieves the smallest time Netw. Learn. Syst., to be published, doi: 10.1109/TNNLS.2016.2582924.
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3783
[5] A. C. Tsoi, “Gradient based learning methods,” in Adaptive Processing [28] I. Patras and E. Hancock, “Regression-based template tracking in pres-
of Sequences and Data Structures, C. L. Giles and M. Gori, Eds. ence of occlusions,” in Proc. 8th Int. Workshop Image Anal. Multimedia
Berlin, Germany: Springer, Sep. 1998, pp. 27–62. [Online]. Available: Interact. Services (WIAMIS), Jun. 2007, p. 15.
https://doi.org/10.1007/BFb0053994, doi: 10.1007/BFb0053994. [29] D. M. Bates, D. G. Watts, Nonlinear Regression Analysis and Its
[6] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen netzen,” Applications. New York, NY, USA: Wiley, 1988.
Ph.D. dissertation, Inst. Inform., Tech. Univ. Munich, München, [30] F. A. Gers, J. A. Péerez-Ortiz, D. Eck, and J. Schmidhuber,
Germany, 1991. “DEKF-LSTM,” in Proc. ESANN, 2002, pp. 369–376.
[7] N. D. Vanli, M. O. Sayin, I. Delibalta, and S. S. Kozat, “Sequential [31] Y. C. Ho and R. Lee, “A Bayesian approach to problems in stochastic
nonlinear learning for distributed multiagent systems via extreme learn- estimation and control,” IEEE Trans. Autom. Control, vol. 9, no. 4,
ing machines,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 333–339, Oct. 1964.
pp. 546–558, Mar. 2017. [32] M. Enescu, M. Sirbu, and V. Koivunen, “Recursive estimation of
[8] J. Schmidhuber, “Deep learning in neural networks: An overview,” noise statistics in Kalman filter based MIMO equalization,” in Proc.
Neural Netw., vol. 61, pp. 85–117, Jan. 2015. [Online]. Available: 27th General Assembly Int. Union Radio Sci. (URSI), Maastricht,
http://www.sciencedirect.com/science/article/pii/S0893608014002135 The Netherlands, 2002, pp. 17–24.
[9] U. Shaham, A. Cloninger, and R. R. Coifman, “Prov- [33] A. Kong, J. S. Liu, and W. H. Wong, “Sequential imputations and
able approximation properties for deep neural networks,” Bayesian missing data problems,” J. Amer. Statist. Assoc., vol. 89,
Appl. Comput. Harmon. Anal., 2016. [Online]. Available: no. 425, pp. 278–288, 1994.
http://www.sciencedirect.com/science/article/pii/S1063520316300033, [34] A. Doucet, S. Godsill, and C. Andrieu, “On sequential Monte Carlo
doi: https://doi.org/10.1016/j.acha.2016.04.003 sampling methods for Bayesian filtering,” Statist. Comput., vol. 10, no. 3,
[10] M. Hermans and B. Schrauwen, “Training and analysing deep recur- pp. 197–208, Jul. 2000.
rent neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2013, [35] N. Bergman, “Recursive bayesian estimation,” Doctoral dissertation,
pp. 190–198. Dept. Elect. Eng., Linköping Univ., Linköping, Sweden, 1999, vol. 579.
[11] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies [36] X.-L. Hu, T. B. Schon, and L. Ljung, “A basic convergence result
with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, for particle filtering,” IEEE Trans. Signal Process., vol. 56, no. 4,
no. 2, pp. 157–166, Mar. 1994. pp. 1337–1348, Apr. 2008.
[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [37] C. E. Rasmussen et al., Delve Data Sets. Accessed: Oct. 1, 2016.
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: [Online]. Available: http://www.cs.toronto.edu/~delve/data/datasets.html
http://dx.doi.org/10.1162/neco.1997.9.8.1735 [38] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, and S. García,
[13] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to for- “KEEL data-mining software tool: Data set repository, integration of
get: Continual prediction with LSTM,” Neural Comput., vol. 12, algorithms and experimental analysis framework,” J. Multiple-Valued
no. 10, pp. 2451–2471, Oct. 2000. [Online]. Available: http://dx.doi.org/ Logic Soft Comput., vol. 17, nos. 2–3, pp. 255–287, 2011.
10.1162/089976600300015015 [39] L. Torgo. Regression Data Sets. Accessed: Oct. 1, 2016. [Online].
[14] J. Fan and Q. Yao, ARMA Modeling and Forecasting. New York, Available: http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html
NY, USA: Springer, 2003, pp. 89–123. [Online]. Available: [40] Alcoa Inc. Common Stock. Accessed: Oct. 1, 2016.[Online]. Available:
http://dx.doi.org/10.1007/978-0-387-69395-8_3 http://finance.yahoo.com/quote/AA?ltr=1
[41] E. W. Frees. Regression Modelling With Actuarial and Financial
[15] J. Mazumdar and R. G. Harley, “Recurrent neural networks trained
Applications. Accessed: Oct. 1, 2016. [Online]. Available: http://
with backpropagation through time algorithm to estimate nonlinear
instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/
load harmonic currents,” IEEE Trans. Ind. Electron., vol. 55, no. 9,
BookWebDec2010/data.html
pp. 3484–3491, Sep. 2008.
[16] H. Jaeger, Tutorial on Training Recurrent Neural Networks, Cov-
ering BPPT, RTRL, EKF and the Echo State Network Approach.
Sankt Augustin, Germany: GMD-Forschungszentrum Informationstech- Tolga Ergen received the B.S. degree in electrical
nik, 2002. and electronics engineering from Bilkent University,
[17] A. H. Sayed, Fundamentals of Adaptive Filtering. Hoboken, NJ, USA: Ankara, Turkey, in 2016. He is currently pursuing
Wiley, 2003. the M.S. degree with the Department of Electrical
[18] J. A. Pérez-Ortiz, F. A. Gers, D. Eck, and J. Schmidhuber, “Kalman and Electronics Engineering, Bilkent University.
filters improve LSTM network performance in problems unsolvable by His current research interests include online learn-
traditional recurrent nets,” Neural Netw., vol. 16, no. 2, pp. 241–250, ing, adaptive filtering, machine learning, optimiza-
Mar. 2003. tion, and statistical signal processing.
[19] B. D. O. Anderson and J. B. Moore, Optimal Filtering.
North Chelmsford, MA, USA: Courier Corporation, 2012.
[20] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and
Y. Bengio, “Identifying and attacking the saddle point problem in high- Suleyman Serdar Kozat (A’10–M’11–SM’11)
dimensional non-convex optimization,” in Proc. 27th Int. Conf. Neural received the B.S. (Hons.) degree from Bilkent Uni-
Inf. Process. Syst. (NIPS), Cambridge, MA, USA, 2014, pp. 2933–2941. versity, Ankara, Turkey, and the M.S. and Ph.D.
[Online]. Available: http://dl.acm.org/citation.cfm?id=2969033.2969154 degrees in electrical and computer engineering from
[21] P. M. Djuric et al., “Particle filtering,” IEEE Signal Process. Mag., the University of Illinois at Urbana–Champaign,
vol. 20, no. 5, pp. 19–38, Sep. 2003. Urbana, IL, USA.
[22] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. (2014). “Empirical He joined the IBM Thomas J. Watson Research
evaluation of gated recurrent neural networks on sequence modeling.” Center, Yorktown Heights, NY, USA, as a Research
[Online]. Available: https://arxiv.org/abs/1412.3555 Staff Member and later became a Project Leader
[23] R. J. Williams and D. Zipser, “A learning algorithm for continually with the Pervasive Speech Technologies Group,
running fully recurrent neural networks,” Neural Comput., vol. 1, no. 2, where he focused on problems related to statistical
pp. 270–280, 1989. signal processing and machine learning. He was a Research Associate with
[24] B. C. Csáji, “Approximation with artificial neural networks,” Faculty the Cryptography and Anti-Piracy Group, Microsoft Research, Redmond,
Sci., Eötvös Loránd Univ., Budapest, Hungary, Tech. Rep., 2001, vol. 24, WA, USA. He is currently an Associate Professor with the Electrical and
p. 48. Electronics Engineering Department, Bilkent University. He has co-authored
[25] J. Martens and I. Sutskever, “Learning recurrent neural net- over 100 papers in refereed high impact journals and conference proceedings.
works with hessian-free optimization,” in Proc. 28th Int. Conf. He holds several patent inventions (used in several different Microsoft
Mach. Learn. (ICML), 2011, pp. 1033–1040. and IBM products) due to his research accomplishments with the IBM
[26] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial Thomas J. Watson Research Center and Microsoft Research. His current
on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” research interests include cyber security, anomaly detection, big data, data
IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188, Feb. 2002. intelligence, adaptive filtering, and machine learning algorithms for signal
[27] Z. Li, Y. Li, F. Yu, and D. Ge, “Adaptively weighted support vector processing.
regression for financial time series prediction,” in Proc. Int. Joint Conf. Dr. Kozat received many international and national awards. He is the Elected
Neural Netw. (IJCNN), Jul. 2014, pp. 3062–3065. President of the IEEE Signal Processing Society, Turkey Chapter.
Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.