0% found this document useful (0 votes)
33 views12 pages

Efficient Online Learning Algorithms Based On LSTM Neural Networks

The document discusses efficient online learning algorithms based on LSTM neural networks. It introduces novel regression structures using LSTM networks and provides efficient online training methods for these structures. Specifically, it proposes using particle filtering-based updates to train the LSTM-based structures, which guarantees convergence to optimal parameter estimation with lower complexity than gradient-based methods. It also introduces a GRU-based approach and demonstrates the superiority of LSTM for sequential prediction tasks using various real-life data sets.

Uploaded by

tys7524
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views12 pages

Efficient Online Learning Algorithms Based On LSTM Neural Networks

The document discusses efficient online learning algorithms based on LSTM neural networks. It introduces novel regression structures using LSTM networks and provides efficient online training methods for these structures. Specifically, it proposes using particle filtering-based updates to train the LSTM-based structures, which guarantees convergence to optimal parameter estimation with lower complexity than gradient-based methods. It also introduces a GRU-based approach and demonstrates the superiority of LSTM for sequential prediction tasks using various real-life data sets.

Uploaded by

tys7524
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

3772 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO.

8, AUGUST 2018

Efficient Online Learning Algorithms


Based on LSTM Neural Networks
Tolga Ergen and Suleyman Serdar Kozat, Senior Member, IEEE

Abstract— We investigate online nonlinear regression and and its label to find a nonlinear relation between them to
introduce novel regression structures based on the long short predict the future labels.
term memory (LSTM) networks. For the introduced structures, There exists a wide range of nonlinear modeling approaches
we also provide highly efficient and effective online training
methods. To train these novel LSTM-based structures, we put in the machine learning and signal processing literatures for
the underlying architecture in a state space form and introduce regression [1], [3]. However, most of these approaches usually
highly efficient and effective particle filtering (PF)-based updates. suffer from high computational complexity and they may
We also provide stochastic gradient descent and extended Kalman provide inadequate performance due to stability and overfitting
filter-based updates. Our PF-based training method guarantees issues [3]. Neural network-based regression algorithms are
convergence to the optimal parameter estimation in the mean
square error sense provided that we have a sufficient num- also introduced for nonlinear modeling since neural networks
ber of particles and satisfy certain technical conditions. More are capable of modeling highly nonlinear and complex struc-
importantly, we achieve this performance with a computational tures [2], [4], [5]. However, they are also shown to be prone
complexity in the order of the first-order gradient-based methods to overfitting problems and demonstrate less than adequate
by controlling the number of particles. Since our approach is performance in certain applications [6], [7]. To remedy these
generic, we also introduce a gated recurrent unit (GRU)-based
approach by directly replacing the LSTM architecture with the issues and further enhance their performance, neural networks
GRU architecture, where we demonstrate the superiority of composed of multiple layers, i.e., known as deep neural net-
our LSTM-based approach in the sequential prediction task via works (DNNs), are recently introduced [8]. In DNNs, a layered
different real life data sets. In addition, the experimental results structure is employed so that each layer performs a feature
illustrate significant performance improvements achieved by the extraction based on the previous layers [8]. With this mecha-
introduced algorithms with respect to the conventional methods
over several different benchmark real life data sets. nism, DNNs are able to model highly nonlinear and complex
structures [9]. However, this layered structure poorly performs
Index Terms— Gated recurrent unit (GRU), Kalman filtering, in capturing time dependencies in the data so that DNNs can
long short term memory (LSTM), online learning, particle
filtering (PF), regression, stochastic gradient descent (SGD). provide only limited performance in modeling time series and
processing temporal data [10]. As a remedy, basic recurrent
neural networks (RNNs) are introduced since these networks
I. I NTRODUCTION have inherent memory that can store the past information [5].
A. Preliminaries However, basic RNNs lack control structures so that the long-
term components cause either an exponential growth or decay
T HE problem of estimating an unknown desired signal
is one of the main subjects of interest in contemporary
online learning literature, where we sequentially receive a data
in the norm of gradients during training, which are the
well-known exploding and vanishing gradient problems,
respectively [6], [11]. Hence, they are insufficient to cap-
sequence related to a desired signal to predict the signal’s next
ture long-term dependencies on the data, which significantly
value [1]. This problem is known as online regression and
restricts their performance in real life tasks [12]. In order
it is extensively studied in the neural network [2], machine
to resolve this issue, a novel RNN architecture with several
learning [1], and signal processing literatures [3], especially
control structures, i.e., long short term memory (LSTM)
for prediction tasks [4]. In these studies, nonlinear approaches
network [12], [13], is introduced. However, in the classi-
are generally employed because for certain applications, linear
cal LSTM structures, we do not have the direct contribu-
modeling is inadequate due to the constraints on linearity [3].
tion of the regression vector to the output, i.e., the desired
Here, in particular, we study the nonlinear regression in an
signal is regressed only using the state vector [4]. Hence,
online setting, where we sequentially observe a data sequence
in this paper, we introduce LSTM-based online regression
Manuscript received October 30, 2016; revised May 5, 2017 and architectures, where we also incorporate the direct contribu-
August 15, 2017; accepted August 15, 2017. Date of publication Septem- tion of the regression vectors inspired from the well-known
ber 13, 2017; date of current version July 18, 2018. This work was supported ARMA models [14].
by TUBITAK under Contract 115E917. (Corresponding author: Tolga Ergen.)
The authors are with the Department of Electrical and Electron- After the neural network structure is fixed, there exists a
ics Engineering, Bilkent University, 06800 Ankara, Turkey (e-mail: wide range of different methods to train the corresponding
[email protected]; [email protected]). parameters in an online manner. Especially the first-order
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. gradient-based approaches are widely used due to their effi-
Digital Object Identifier 10.1109/TNNLS.2017.2741598 ciency in training because of the well-known backpropagation
2162-237X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3773

recursion [4], [15]. However, these techniques provide poorer suffer from complexity issues and also poor performance due
performance compared with the second-order gradient-based to an abundance of saddle points [20]. On the contrary, for
techniques [5], [16]. As an example, the real-time recurrent basic RNNs, we have less parameters to train; however, these
learning (RTRL) algorithm is highly efficient in calculating neural networks do not have control structures [12], [13].
gradients [15], [16]. However, since the RTRL algorithm Hence, the exploding and vanishing gradient problems occur
exploits only the first-order gradient information, it performs due to long-term components [6], [11]. These problems pre-
poorly on ill-conditioned problems [17]. On the other side, vent the basic RNNs from learning correlation between distant
although the second-order gradient-based techniques provide events [6]. To ameliorate performance, the basic RNN-based
much better performance, they are highly complex compared learning methods in [5] and [16] choose a high-complexity
with the first-order methods [5], [16], [18]. As an example, second-order gradient-based techniques to train their para-
the well-known extended Kalman filter (EKF) method also meters. Hence, either low-complexity neural networks or
uses the second-order information to boost its performance, low-complexity training methods are chosen to avoid unman-
which requires to update the error covariance matrix of ageable computational complexity increase. However, basic
the parameter estimate and brings an additional complexity RNNs suffer from inadequately capturing long- and short-term
accordingly [19]. Furthermore, the second-order gradient- dependencies compared with complex networks [12], [13].
based methods provide limited training performance due On the other hand, the first-order gradient-based methods
to an abundance of saddle points in neural network-based suffer from slower convergence and poorer performance com-
applications [20]. To alleviate the training issues, we intro- pared with the second-order gradient-based techniques [5].
duce particle filtering (PF) [21]-based online updates for the To circumvent these issues, in this paper, we derive online
LSTM architecture. In particular, we first put the LSTM updates based on the PF algorithm [21] to train the
architecture in a nonlinear state space form and formulate LSTM architecture. Thus, we not only provide the second-
the parameter learning problem in this setup. Based on order training without any ad hoc linearization but also accom-
this form, we introduce a PF-based estimation algorithm to plish this with a computational complexity in the order of the
effectively learn the parameters. Here, our training method first-order methods (by carefully controlling the number of
guarantees convergence to the optimal parameter estimation particles in modeling).
performance in an online manner provided that we have We emphasize that the conventional neural networks-
sufficiently many particles and satisfy certain technical con- based learning methods [5], [16], [18], [23] suffer from the
ditions. Furthermore, by controlling the amount of particles well-known complexity–performance tradeoff. Due to this
in our experiments, we demonstrate that we can significantly tradeoff, they usually are not chosen to address the nonlinear
reduce the computational complexity while providing a supe- regression problem. There are certain neural network-based
rior performance compared with the conventional second-order methods [5], [16] that particularly investigate the nonlinear
methods. Here, our training approach is generic such that we regression; however, they only employ the basic RNN architec-
also put the recently introduced gated recurrent unit (GRU) ture for this purpose. In addition, in their regression approach,
architecture [22] in a nonlinear state space form and then they provide the final estimate by setting the output of the
apply our algorithms to learn its parameters. Through exten- basic RNN architecture as a scalar value so that the final
sive set of simulations, we illustrate significant performance estimate becomes linear combination of only the internal
improvements achieved by our algorithms compared with the states. Instead, in this paper, we employ the LSTM architecture
conventional methods [18], [23]. for the nonlinear regression and also introduce additional terms
to incorporate the direct contribution of the regression vector
B. Prior Art and Comparisons to our final estimate. Therefore, we significantly improve the
Neural network-based learning methods are powerful in regression performance as illustrated in our simulations.
modeling highly nonlinear structures such that a single hidden
layer neural network can adequately model any nonlinear
structure [24]. In addition, these methods, especially complex C. Contributions
RNNs-based methods, are capable of effectively processing Our main contributions are as follows.
temporal data and modeling time series [4], [12]. Complex 1) As the first time in the literature, we introduce online
RNNs, e.g., LSTM networks, provide this performance thanks learning algorithms based on the LSTM architecture
to their memory to keep the past information and several for data regression, where we efficiently train the
control gates to regulate the information flow inside the net- LSTM architecture in an online manner using our
work [12], [13]. However, for complex RNNs, adequate perfor- PF-based approach.
mance requires high computational complexity, i.e., training of 2) We propose novel LSTM-based regression structures
a large number of parameters at every time instance [4]. Thus, to compute the final estimate, where we introduce an
to mitigate complexity, the LSTM network-based methods additional gate to the classical LSTM architecture to
in [16] and [5] choose a low-complexity first-order gradient- incorporate the direct contribution of the input regressor
based technique, i.e., stochastic gradient descent (SGD) [23], inspired from the ARMA models.
to train their parameters. Even though there exist certain appli- 3) We put the LSTM equations in a nonlinear state space
cations of LSTM trained with the second-order techniques, form and then derive online updates based on the state-
e.g., EKF in [18] and a Hessian free technique in [25], they of-the-art state estimation techniques [21], [26] for each

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3774 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018

parameter. Here, our PF-based method achieves a sub- x t = [x t , x t −1 . . . , x t − p+1 ]T and then generate d̂t ; after
stantial performance improvement in online parameter dt = x t +1 is observed, we suffer l(dt , d̂t ) = (dt − d̂t )2 .
training with respect to the conventional second- and In this paper, to generate the sequential estimates d̂t , we
first-order methods [18], [23]. use RNNs. The basic RNN structure is described by the
4) We achieve this substantial improvement with a com- following set of equations [16]:
putational complexity in the order of the first-order
ht = κ(W (h) x t + R (h) ht −1 ) (1)
gradient-based methods [18], [23] by controlling the
number of particles in our method. In our simulations, yt = u(R (y) ht ) (2)
we also illustrate that by controlling the number of where ht ∈ Rm is the state vector, x t ∈ R p is the input,
particles, we can achieve the same complexity with the and yt ∈ Rm is the output. The functions κ(·) and u(·) apply
first-order gradient-based methods while providing a far to vectors pointwise and commonly set to tanh(·). For the
superior performance compared with the both first- and coefficient matrices, we have W (h) ∈ Rm× p , R (h) ∈ Rm×m ,
second-order methods. and R(y) ∈ Rm×m .
5) Through extensive set of simulations involving real life As a special case of RNNs, we use the LSTM neural
and financial data, we illustrate performance improve- network [12] with only one hidden layer. Although there
ments achieved by our algorithms with respect to the exists a wide range of different implementations of the
conventional methods [18], [23]. Furthermore, since LSTM network, we use the most widely used extension, where
our approach is generic, we also introduce GRU-based the nonlinearities are set to the hyperbolic tangent function
algorithms by directly applying our approach to and the peephole connections are eliminated. This LSTM
the GRU architecture, i.e., also a complex RNN, architecture is defined by the following set of equations [12]:
in Section IV.
z t = h(W (z) x t + R (z) yt −1 + b(z) ) (3)
(i) (i) (i)
D. Organization of This Paper i t = σ (W xt + R yt −1 + b ) (4)
(f) (f) (f)
The organization of this paper is as follows. We intro- f t = σ (W x t + R yt −1 + b ) (5)
(i) (f)
duce the online regression problem and then describe our ct = t z t + t ct −1 (6)
LSTM-based model in Section II. We then introduce different ot = σ (W (o) x t + R (o) yt −1 + b(o)) (7)
architectures to compute the final estimate for data regression (o)
in Section III-A. In Section III-B, we review the conventional yt = t h(ct ) (8)
(f)
training methods and extend these methods to the introduced where t = diag( f t ), (i) (o)
t = diag(i t ), and t = diag(ot ).
architectures. We then introduce our PF-based training algo- Furthermore, ct ∈ R is the state vector, x t ∈ R p is the
m
rithm in Section III-C. In Section IV, we illustrate the merits input vector, and yt ∈ Rm is the output vector. Here, i t , f t ,
of the proposed algorithms and training methods via extensive and ot are the input, forget, and output gates, respectively.
set of experiments involving real life and financial data, and The functions g(·) and h(·) apply to vectors pointwise and
we also introduce a GRU-based approach for online learning commonly set to tanh(·). Similarly, the sigmoid function σ (·)
tasks. We then finalize our paper with concluding remarks applies pointwise to the vector elements. For the coefficient
in Section V. matrices and the weight vectors, we have W (z) ∈ Rm× p ,
R(z) ∈ Rm×m , b(z) ∈ Rm , W (i) ∈ Rm× p , R (i) ∈ Rm×m ,
II. M ODEL AND P ROBLEM D ESCRIPTION b(i) ∈ Rm , W ( f ) ∈ Rm× p , R ( f ) ∈ Rm×m , b( f ) ∈ Rm ,
All vectors are column vectors and denoted by boldface W (o) ∈ Rm× p , R (o) ∈ Rm×m , and b(o) ∈ Rm . Given the
lower case letters. Matrices are represented by boldface capital output yt , we generate the final estimate as
letters. For a vector u (or a matrix U), u T (U T ) is the ordinary d̂t = wtT yt (9)
transpose. The time index is given as subscript, e.g., ut is
the vector at time t. The 1 is a vector of all ones, 0 is a where the final regression coefficients wt will be trained in
vector or matrix of all zeros, I is the identity matrix, where an online manner in the following. Our goal is to design
n
the size is understood from the context. Given a vector u, the system parameters so that t =1 l(dt , d̂t ) or E[l(dt , d̂t )]
diag(u) is the diagonal matrix constructed from the entries is minimized.
of u. Remark 1: The basic LSTM network can be extended by
We sequentially receive {dt }t ≥1, dt ∈ R, and regression vec- including last s outputs in the recursion, e.g., { yt −s , . . . , yt −1 };
tors, {x t }t ≥1, x t ∈ R p such that our goal is to estimate dt based however, this case corresponds to an extended output defini-
on our current and past observations {. . . , x t −1 , x t }. Given our tion, i.e., an extended super output vector consisting of all
estimate d̂t , which can only be a function of {. . . , x t −1 , x t } { yt −s , . . . , yt −1 }. We use only yt −1 for notational simplicity.
and {. . . , dt −2 , dt −1}, we suffer the loss l(dt , d̂t ). This frame- In the following section, we first introduce novel LSTM
work models a wide range of machine learning problems network-based regression architectures inspired from the
including financial analysis [27], tracking [28], and state ARMA models. Then, we review and extend the conventional
estimation [19]. As an example, in one step ahead data methods [18], [23] to learn the parameters of LSTM in an
prediction under the square error loss, where we sequen- online manner. Finally, we provide our novel PF-based training
tially receive data and predict the next sample, we receive method.

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3775

TABLE I
C OMPARISON OF THE C OMPUTATIONAL C OMPLEXITIES OF THE P ROPOSED
O NLINE T RAINING M ETHODS . p R EPRESENTS THE D IMENSIONALITY
OF THE R EGRESSOR S PACE , m R EPRESENTS THE D IMENSIONALITY
OF THE N ETWORK ’ S O UTPUT S PACE , AND N R EPRESENTS
THE N UMBER OF PARTICLES FOR THE PF A LGORITHM

these gates may restrict the exposure of the state and input
contents in nonlinear regression problems. To expose the full
content of the state and input vectors, we remove the control
and output gates in (11) and introduce the third regression
architecture as follows:
(3)
d̂t = w tT h(ct ) + v tT h(x t ). (12)

Note that d̂t(2) is our most general architecture to compute


(1)
the final estimate since the updates for d̂t are a special case
Fig. 1. Detailed schematic of the proposed architecture in (11) for when (α) (3)
t = 0 and the updates for d̂t are a special case when
(o) (α)
the regression tasks. Note that for the summations before the gate and t = I and t = I. In the following sections, we provide
h(·) functions, we multiply x t and yt−1 by W (.) and R (.) , respectively, (1)
and also we add the weight vector b(.) to these summations. We omit these the full derivations for d̂t for notational and presentation
operations for presentation simplicity. simplicity, and also provide the required updates to extend
(2) (3)
these basic derivations to d̂t and d̂t .
III. N OVEL L EARNING A LGORITHMS BASED
ON LSTM N EURAL N ETWORKS
In this section, we first introduce our novel contributions for B. Conventional Online Training Algorithms
data regression. For these contributions, we also derive online In this section, we introduce methods to learn the cor-
updates based on the SGD, EKF, and PF algorithms. responding parameters of the introduced architectures in an
online manner. We first derive the online updates based
A. Different Regression Architectures on the SGD algorithm [17], i.e., also known as the
We first consider the direct linear combination of the RTRL algorithm [23] in the neural network literature, where
output yt with the weight vector wt . In this case, given (8), we derive the recursive gradient formulations to obtain the
we generate the final estimate as online updates for the LSTM architecture.
The SGD algorithm exploits only the first-order gradient
(1)
d̂t = wtT yt information so that it usually converges slower compared
(o) with the second-order gradient-based techniques and performs
= wtT t h(ct ) (10)
poorly on ill-conditioned problems [17]. To mitigate these
where wt ∈ Rm . In (10), the final estimate of the system problems, we next consider the second-order gradient-based
does not directly depend on x t . However, in generic non- techniques, which have faster convergence rate and are more
linear regression tasks, the final estimate usually depends robust against ill-conditioned problems [5]. We first put the
on the current regression vector also [29]. For this purpose, LSTM equations in a nonlinear state space form so that we
we introduce a linear term to incorporate the effects of the can consider the EKF algorithm [19] to train the parameters in
input vector, i.e., the regression vector, to the final estimate as an online manner. However, the EKF algorithm requires the
shown in Fig. 1. Hence, we introduce the second regression first-order Taylor series expansion to linearize the nonlinear
architecture as network equations and this degrades its performance [5], [19].
(2) (o) (α) In addition, Table I shows that the EKF algorithm has high
d̂t = wtT t h(ct ) + v tT t h(x t ) (11)
computational complexity compared with the SGD algorithm.
(α)
v t ∈ R p , in accordance with (10), where t = diag(α t ) and In the following sections, we derive both the SGD- and
 (o)  EKF-based training methods and extend these derivations to
α t = σ W (α) x t + R (α) t −1 h(ct −1 ) + b(α) .
the regression architectures in (10)–(12).
Here, the final estimate directly depends on x t and also the 1) Online Learning With the SGD Algorithm: For each
dependence is controlled by the control gate, i.e., α t . parameter set, we next derive the stochastic gradient updates,
In (10) and (11), the effects of the input and state vectors are i.e., also known as the RTRL algorithm [23], to minimize the
controlled by the control and output gates, respectively. Thus, instantaneous loss, i.e., l(dt , d̂t ) = (dt − d̂t )2 , and extend these

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3776 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018

calculations to the introduced architectures. For the weight To obtain (19), we compute the partial derivatives of (3)–(5)
vector, we use with respect to wi(z)
j as follows:

wt +1 = wt − μt ∇wt l(dt , d̂t )
(σ  (ζ (i) )) ⎢
(o) ∂it
= wt + 2μt (dt − d̂t )t h(ct ) (13) = t ⎢ (i) (o) (h  (c)) (z)
⎣R t −1 t −1 ψ i j,t −1
∂wi(z)
where
for the learning rate μt , we have μt → 0 as t → ∞ j
t   ⎤
and k=1 μk → ∞ as t → ∞, e.g., μt = 1/t. For the
∂o
parameter W (z) , we have the following update: (z)
∂wi j ⎥
(z) (z)
+ R(i) t −1 h(ct −1 )⎥
⎦ (20)
W =W − μt ∇ W (z) l(dt , d̂t ).

For notational simplicity, we derive the updates for each entry ⎡


of W (z) separately. We denote the entry in the i th row and ∂ ft (σ  (ζ ( f ) )) ⎢ 
j th column of W (z) by wi(z) = t ⎣R
⎢ (f) (o) (h (c)) (z)
t −1 t −1 ψ i j,t −1
j . We have the following update (z)
for each entry of W (z) : ∂wi j
 (o)    ⎤
T ∂ t h(ct )
∂o
(z) (z)
wi j = wi j + 2μt (dt − d̂t )wt . (14) (z)
∂wi j ⎥
(z)
∂wi j + R( f ) t −1 h(ct −1 )⎥
⎦ (21)
We write the partial derivative in (14) as ⎡
 
 (o)  ∂ zt (h  (ζ (z) )) ⎢
∂o 
⎢ (z) (o) (h (c)) (z)
∂ t h(ct ) (z)
∂wi j (o) (h  (c)) ∂ ct (z)
= t ⎣δ i j x t + R t −1 t −1 ψ i j,t −1
= t h(ct ) + t t (15) ∂wi j
∂wi(z)
j ∂wi(z)
j
  ⎤
where h  (·) denotes the differential of h(·) with respect to its ∂o
(h  (c))
(z)
∂wi j ⎥
argument, t = diag(h  (ct )), and + R(z) t −1 h(ct −1 )⎥
⎦ (22)
 
∂o  
(z)
∂w ∂ ot
t ij
= diag (z)
. where δ i j is an m × p matrix with all entries zeros, except a 1
∂wi j in the i j th position. With these equations, we can compute (19)
and then obtain (15) using (19) and (16). By this, we have all
Now, we compute the partial derivatives of ot and ct with
(z) the required equations for the SGD update in (14).
respect to wi j . Taking derivative of (7) gives
Remark 2: Here, we derive the updates just for the entries
⎡ of W (z) . When we take the partial derivative of d̂t with
(σ  (ζ (o) )) ⎢
∂ ot  respect to the entries of the other parameters, (14), (15), (18),
(z)
= t ⎢
⎣R
(o)
(o) (h (c)) (z)
t −1 t −1 ψ i j,t −1 and (19) still hold with a change of the derivative variable.
∂wi j For (16) and (20)–(22), we also have a change in the form
  ⎤ and location of the δ i j x t term. In particular, as in (22), when
∂o
(z) ⎥ we take the derivative of W (.) , R(.) , and b(.) with respect to
∂wi j
+ R(o) t −1 h(ct −1 )⎥
⎦ (16)
their entries, respectively, additional δ i j x t , δ i j yt −1 , and δ i j
terms appear in the derivative equation of the corresponding
structure, i.e., one of (16) and (20)–(22). Here, the size of δ i j
where changes accordingly.
Remark 3: In case of d̂t(2) , instead of (14), we have the
ζ (o)
t = W
(o)
x t + R(o) (o)
t −1 h(ct −1 ) + b
(o)
(17) following update:

and  
∂ ct −1 (z) (z) ⎢ T ∂ (o) h(ct )
wi j = wi j + 2μt (dt − d̂t ) ⎢
t
ψ (z)
i j,t −1 = . (18) ⎣w t (z)
∂wi j
(z) ∂wi j
  ⎤
To get (15), we also compute the partial derivative of ct with ∂α
(z) (z) ⎥
respect to wi j . Using (18), we write the following recursive ∂wi j
+ v tT t h(x t )⎥
⎦ (23)
equation:
(z) (z) ∂it (i) ∂ zt (c) ∂ ft (f) (z)
ψ i j,t = t + t +t −1 +t ψ i j,t −1 .
∂wi(z)
j ∂wi(z)
j ∂wi(z)
j where the introduced partial derivative term ∂α t /∂wi j is
(z)

(19) computed in the same manner with (16). Furthermore, we

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3777

have an additional update for v t as follows: EKF algorithm [19] to estimate yt , ct , and θ t as follows:
⎡ ⎤ ⎡ ⎤
(α) yt |t yt |t −1  
v t +1 = v t + 2μt (dt − d̂t )t h(x t ). (24) ⎣ ct |t ⎦ = ⎣ ct |t −1 ⎦ + L t dt − wtT|t −1 yt |t −1 (31)
θ t |t θ t |t −1
Then, we follow the derivations in (13), (15), (16), and
(3) (o)
(19)–(22). For d̂t , we just set t = I and t = I and
(α) yt |t −1 = τ (ct |t −1 , x t , yt −1|t −1) (32)
then all the derivations in (13), (15), (16), (19), and (20)–(24) ct |t −1 = (ct −1|t −1, x t , yt −1|t −1 ) (33)
(2)
follow as in d̂t . θ t |t −1 = θ t −1|t −1 (34)
According to the update equations in (15), (16), and (19),  −1
Lt = T
t |t −1 H t H t t |t −1 H t + Rt (35)
update of an entry of a parameter has a computational com-
plexity O(m 2 + mp) due to the matrix vector multiplications t |t = t |t −1 − L t H tT t |t −1 (36)
in (17). Since we have mp, m 2 , and m entries for W (.) , R (.) , t |t −1 = F t −1 t −1|t −1 t −1 + Q t −1
F T
(37)
and b(.) , respectively, this results in O(m 4 + m 2 p2 ) compu-
tational complexity to update the entries of all parameters as where ∈ R(2m+nθ )×(2m+nθ ) is the error covariance matrix,
Lt ∈ R (2m+n θ ) is the Kalman gain, Q ∈ R(2m+n θ )× (2m+n θ ) is
given in Table I. t
2) Online Learning With the EKF Algorithm: We next pro- the process noise covariance, and Rt ∈ R is the measurement
vide the updates based on the EKF algorithm in order to train noise variance. We compute H t and F t as follows:

the parameters of the system described in (3)–(8) and (10). ∂ d̂t ∂ d̂t ∂ d̂t 
H tT = y= yt|t−1 (38)
In the literature, there are certain EKF-based methods to train ∂ y ∂ c ∂θ  c= ct|t−1
LSTM (see [18], [30]); however, these methods estimate only θ =θ t|t−1
the parameters, i.e., θ t . However, in our case, we also estimate
the state and the output vector of LSTM, i.e., ct and yt , and
⎡ ∂τ (c, x , y) ∂τ (c, x , y) ∂τ (c, x , y) ⎤ 
respectively. In the following, we derive the updates for our t t t 
⎢ ∂y ∂c ∂θ ⎥
⎢ ∂ (c, x , y) ∂ (c, x , y) ∂ (c, x , y) ⎥ 
approach and extend these to the introduced architectures.
The EKF algorithm assumes that the posterior density Ft = ⎢ t t t ⎥  y= yt|t
⎣ ⎦  c= ct|t
function of the states given the observations is Gaussian [19]. ∂y ∂c ∂θ  θ =θ
This assumption can be satisfied by introducing perturbations 0 0 I t|t

to the system equations via Gaussian noise [31]. Hence,


where F t ∈ R(2m+nθ )× (2m+nθ ) and H t ∈ R(2m+nθ ) . For
we first write the LSTM system in a nonlinear state space
(35) and (37), we use Q t and Rt ; however, these may not
form and then introduce Gaussian noise terms to be able to use
be known in advance. To estimate Rt , we can use exponential
the EKF updates. For convenience, we group the parameters
smoothing as follows:
{w, W (z) , R (z) , b(z) , W (i) , R (i) , b(i) , W ( f ) , R( f ) , b( f ) , W (o) ,
R (o), b(o) } together into a vector θ , θ ∈ Rnθ , where Rt = (1 − α)Rt −1 + αλ2t
n θ = 4m(m + p) + 5m. By this, we write the LSTM system
where 0 < α < 1 is the smoothing constant and
as
 
λt = dt − wtT|t −1 yt |t −1 . (39)
yt = τ (ct , x t , yt −1 ) +  t (25)
For the estimation of Q t , we cannot use the exponential
ct = (ct −1 , x t , yt −1 ) + v t (26)
smoothing technique due to our inability to observe the states
θ t = θ t −1 + et (27) at each time instance. Although there exists a wide variety
dt = w tT yt + εt (28) of techniques to estimate Q t , we use the algorithm in [32],
which provides a highly effective estimate of Q t .
where τ (·) and (·) are the nonlinear functions Remark 4: For the EKF derivations of d̂t(2) , we change the
in (8) and (6), respectively, and  t , et , v t , and εt are zero-mean observation model in (30), the update in (31), the Jacobian
Gaussian random variables. In addition, [ tT , v tT , etT ]T , and εt computation in (38), and the definition in (39) according to
are with variances Q t and Rt , respectively. Here, we assume the definition of the architecture in (11). In addition, we also
that Q t and Rt are known or can be estimated from the extend the parameter vector θ t by adding v t , W (α) , R(α) ,
data as detailed later in this paper. We write (25)–(27) in a and b(α) . Hence, we have θ t ∈ Rnθ , where n θ = (4m + p)
(3)
compact form as (m + p)+5m +2 p. For the EKF derivations of d̂t , we change
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ the observation model in (30), the update in (31), the Jacobian
yt τ (ct , x t , yt −1 ) t computation in (38), and the definition in (39) according
⎣ ct ⎦ = ⎣ (ct −1 , x t , yt −1 )⎦ + ⎣v t ⎦ (29) to (12). Moreover, we modify θ t by removing W (α) , R(α) ,
θt θ t −1 et (2)
b(α), W (o), R (o), and b(o) from its definition for d̂t . Hence,
dt = wtT yt + εt . (30) we obtain θ t ∈ Rnθ , where n θ = 3m(m + p) + 4m + p.
According to the update equations in (31)–(33) and
In the system described in (29) and (30), we are able to (35)–(37), the computational complexity of the updates based
observe only dt and we can estimate yt , ct , and θ t based on the EKF algorithm results in O(m 8 + m 4 p4 ) due to the
on the observed dt values. Thus, we directly apply the matrix multiplications in (35)–(37).

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3778 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018

C. Online Training Based on the PF Algorithm where the weights are normalized such that
Since the conventional training methods [18], [23] provide 
N
restricted performance as explained in the previous section, ωti = 1.
we introduce a novel PF-based method that provides supe- i=1
rior performance compared with the second-order training To simplify the weight calculation, we can factorize (43) to
methods. Furthermore, we achieve this performance with obtain a recursive formulation for the update of the weights
a computational complexity in the order of the first-order as follows [26]:
methods depending on the choice of N as shown in Table I.    
In the following, we derive the updates for our PF-based p dt |ait p a it |ait −1 i
ωt ∝
i
  ωt −1 . (44)
training method and extend these calculations to the introduced q ait |ait −1 , dt
architectures.
The PF algorithm [21] requires no assumptions other than In (44), we aim to choose the importance function such
the independence of noise samples in (29) and (30). Hence, that the variance of the weights is minimized. Thus, we can
we modify the system in (29) and (30) as follows: guarantee that all the particles have nonnegligible weights and
contribute considerably to (42) [33]. In this sense, the optimal
a t = ϕ(at −1 , x t ) + η t (40) choice of the importance function is p(at |ait −1 , dt ); however,
dt = wtT yt + ξt (41) this requires an integration that does not have an analytic
form in most cases [34]. Thus, we choose p(at |ait −1 ) as
where η t and ξt are independent noise samples, ϕ(·, ·) is the the importance function, which provides a small variance for
nonlinear mapping in (29), and the weights but not zero as the optimal importance function
⎡ ⎤ does [21], [34]. This simplifies (44) as follows:
yt  
a t = ⎣ ct ⎦ . ωti ∝ p dt |ait ωti −1 . (45)
θt
We can now get the desired distribution to compute the
For (40) and (41), we seek to obtain E[at |d1:t ], i.e., the optimal conditional mean of the augmented state vector at using
state estimate in the mean square error (MSE) sense. For this (42) and (45). By this, we obtain the conditional mean
purpose, we first find the posterior probability density function for at as follows:
p(at |d1:t ). We then calculate the conditional mean of the state 
vector based on the posterior density function. To obtain the E[a t |d1:t ] = at p(at |d1:t )d at
density function, we employ the PF algorithm [21] as follows. 
Let {ait , ωti }i=1
N denote the samples and the associated 
N
  
N
≈ at ωti δ at − ait d at = ωti ait . (46)
weights of the desired distribution, i.e., p(at |d1:t ). Then,
i=1 i=1
we obtain the desired distribution from its samples as follows:
While applying the PF algorithm, the variance of the

N weights inevitably increases over time so that after a few
p(at |d1:t ) ≈ ωti δ(a t − ait ) (42) time steps, all but one of the weights get values that are very
i=1 close to zero [33]. Due to this reason, although particles with
where δ(·) represents the Dirac delta function. Since obtaining very small weights have almost no contribution to our estimate
the samples from the desired distribution is intractable in in (46), we have to update them using (40) and (45). Hence,
most cases [21], an intermediate function is introduced to most of our computational effort is used for the particles
obtain the samples {ait }i=1 N , which is called as importance with negligible weights, which is known as the degeneracy
function [21]. Hence, we first obtain the samples from the problem [21]. To measure degeneracy, we use the effective
importance function and then estimate the desired density sample size introduced in [35], which is calculated as follows:
function based on these samples as follows. As an example, 1
in order to calculate E p [at |d1:t ], we use the following trick: Ne f f =   2 . (47)
N
ωti
  i=1
p(at |d1:t ) 
E p [at |d1:t ] = Eq at d1:t Note that a small Ne f f value indicates that the variance of the
q(at |d1:t )  weights is high, i.e., the degeneracy problem. If Ne f f is smaller
where E f represents an expectation operation with respect to than a certain threshold [33], then we apply the resampling
a certain density function f (·). Hence, we observe that we algorithm introduced in [26], which eliminates the particles
can use q(·), i.e., called as importance function, when direct with negligible weights and focuses on the particles with large
sampling from the desired distribution p(·) is intractable. weights to avoid degeneracy. By this, we obtain an online
Here, we use q(at |d1:t ) as our importance function to obtain training method (see Algorithm 1 for the pseudocode) that
the samples and the corresponding weights are calculated as converges to E[at |d1:t ], where the convergence is guaranteed
follows: under certain conditions as follows.
Remark 5: For the PF derivations of d̂t(2) , we change the
p(ait |d1:t ) observation model in (41) according to the definition in (11).
ωti ∝ (43)
q(ait |d1:t ) We also modify at by adding v t , W (α) , R (α) , and b(α) to θ t .

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3779

For the PF derivations of d̂t(3), we modify (41) according to Algorithm 1 Online Training Based on the PF Algorithm
the definition in (12). Furthermore, we modify θ t by removing 1: for i = 1 : N do
W (α) , R (α) , b(α) , W (o) , R (o), and b(o) from its definition 2: Draw ait ∼ p(at |ait −1 )
for d̂t(2) . 3: Assign wti according to (45)
Theorem 1: Let at be the state vector such that 4: end for
N j
5: Calculate total weight: S = j =1 wt
sup |at |4 p(dt |at ) < K t (48)
at 6: for i = 1 : N do
7: Normalize: wti = wti /S
where K t is a finite constant independent of N. Then we have
8: end for
the following convergence result:
9: Calculate Ne f f according to (47)

N 10: if Ne f f < NT then %NT is a threshold for Ne f f
ωti ait → E[at |d1:t ] as N → ∞. 11: Apply the resampling algorithm in [26]
i=1 12: Obtain new pairs { āit , ω̄ti }i=1
N , where w̄ i = 1/N, ∀i
t
Proof of Theorem 1. From [36], we have 13: end if
14: Using { āt , ω̄ti }i=1
i N , compute the estimate according to (46)
 
 
N
 i 4 ||π||4t,4
E E[π(at )|d1:t ] − ωt π a t  ≤ Ct
i
(49)
N2
i=1
A. Real Life Data Sets
where
In this section, we evaluate the performances of the algo-
||π||t,4  max {1, (E[|π(at  )|4 |d1:t  ]) 4 , t  = 1, 2, . . . , t}
1
rithms for the real life data sets. We first evaluate the
π ∈ Bt4 , i.e., a class of functions with certain properties performances of the algorithms for the kinematic data set
described in [36], and Ct represents a finite constant inde- [37]. We then examine the effect of the number of particles
pendent of N. With (48), π(at ) = at satisfies the conditions on the convergence rate of the PF-based algorithm using
of Bt4 . Therefore, applying π(at ) = at to (49) and then the same data set. Furthermore, in order to illustrate the
evaluating (49) as N goes to infinity conclude our proof.  effects of model size while keeping the computation time
same, we perform another experiment on the same data
This theorem provides a convergence result under (48).
The inequality in (48) implies that the conditional distrib- set for the PF-based algorithm. Finally, we consider three
ution of the observations, i.e., p(dt |at ), decays faster than benchmark real data sets, i.e., elevators [38], bank [39], and
at increases [36]. Since generic distributions usually decrease pumadyn [38], to evaluate the regression performances of our
exponentially, e.g., Gaussian distribution, or they are nonzero algorithms.
only for bounded intervals, (48) is not a strict assumption We first consider the kinematic data set [37], i.e., a sim-
for at . Hence, we can conclude that Theorem 1 can be ulation of eight-link all-revolute robotic arm. Our aim is to
employed for most cases. predict the distance of the effector from a target. We first
select a fixed architecture. For this purpose, we can choose
According to update equations in (40), (41), (45), and (46),
each particle costs O(m 2 + mp) due to the matrix any one of three architectures since the algorithm with the
vector multiplications in (40) and (41), and this results in best performance is the same for all three architectures as
O(N(m 2 + mp)) computational complexity to update all detailed later in this section. Here, we choose Architecture 1.
particles. Furthermore, we choose the parameters such that all the
introduced algorithms reach their maximum performance for
fair comparison. To provide this fair setup, we have the
IV. S IMULATIONS following parameters. For this data set, the input vector is
In this section, we illustrate the performances of our algo- x t ∈ R8 and we set the output dimension of the neural network
rithms on different benchmark real data sets under various as m = 8. For the PF-based algorithm, the crucial parameter
scenarios. We first consider the regression performance for is the number of particles; we set this parameter as N = 1500.
real life data sets such as kinematic [37], elevators [38], In addition, we choose ηt and ξt as zero-mean Gaussian
bank [39], and pumadyn [38]. We then consider the regression random variables with the covariance Cov[ηt ] = 0.01 I
performance for financial data sets, e.g., Alcoa stock price [40] and the variance Var[ξt ] = 0.25, respectively. For the
and Hong Kong exchange rate data [41]. We then compare the EKF-based algorithm, we choose the initial error covariance as
performances of the algorithms based on two different neural 0|0 = 0.01 I. Moreover, we choose Q t = 0.01 I and
networks, i.e., the LSTM and GRU networks [22]. Finally, Rt = 0.25. For the SGD-based algorithm, we set
we comparatively illustrate the merits of our LSTM-based the learning rate as μ = 0.03. As seen in Fig. 2,
regression architectures described in (10)–(12). the PF-based algorithm converges to a much smaller final
Throughout this section, “Architecture 1” represents the MSE level, and hence significantly outperforms the other
LSTM network with (10) as the final estimate equation, algorithms.
similarly “Architecture 2” represents the LSTM network In order to illustrate the effect of the number of particles
with (11), and “Architecture 3” represents the LSTM network on the convergence rate, we perform a new experiment on the
with (12). kinematic data set, where we use the same setup except the

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3780 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018

Fig. 4. Comparison of the PF-based algorithm with different N -m combina-


Fig. 2. Sequential prediction performances of the algorithms for the kinematic tions for the kinematic data set. Note that all the combinations have the same
data set. computation time.

consumes the same amount of the computation time. In Fig. 4,


we observed that as the model size increases, the performance
of the PF-based algorithm decreases. Since the PF-based algo-
rithm approximates a density function based on the particles,
as the number of particles decreases, we expect to obtain worse
approximations for the density function. Hence, Fig. 4 matches
with our interpretation for the PF-based algorithm.
Other than the kinematic data set, we also consider the
elevators [38], bank [39], and pumadyn [38] data sets. For
all of these data sets, we again select a fixed architecture,
i.e., Architecture 1. In addition, we choose the performance
maximizing parameters while forcing the PF-based algorithm
to consume less training time than the other algorithms by
controlling N. With this setup, we have the following para-
meter selection for each data set. The elevators data set is
obtained from the procedure that is related to controlling
an F16 aircraft and our aim is to predict the variable that
Fig. 3. Comparison of the PF-based algorithm with different number of
particles for the kinematic data set. expresses the actions of the aircraft. For this data set, we have
x t ∈ R18 and we set the output dimension of the neural
network as m = 18. For the other parameters, we use the
number of particles. In Fig. 3, we observe that as the number of same settings with the kinematic data set case except that we
particles increases, the PF-based algorithm achieves a lower choose N = 100, Q t = 0.0016 I, Cov[ηt ] = 0.0016 I, and
MSE value with a faster convergence rate. Furthermore, as μ = 0.7. Moreover, the pumadyn data set is obtained from
N increases, the marginal performance improvement achieved the simulation of Unimation Puma 560 robotic arm and our
becomes smaller compared with the previous N values. As an goal is to predict the angular acceleration of the arm. We have
example, we observed that even though there is a significant x t ∈ R32 and we set the output dimension of the neural
improvement between N = 50 and N = 100 cases, there is a network as m = 32. In addition, we set the learning rate as
slight improvement between N = 500 and N = 1500 cases. μ = 0.4 and the number of particles as N = 170. For the other
Hence, if we further increase N, the marginal performance parameters, we use the same settings with the elevators data set
improvement may not worth the increase in the computational case. Finally, the bank data set is generated from a simulator
complexity for our case. Thus, we illustrate that N = 1500 is that simulates the queues in banks and our aim is to predict
a reasonable choice for our simulations. the fraction of the customers that leave the bank due to full
In addition to the simulation for the convergence rate, queues. In this case, we have x t ∈ R32 and we set the output
we perform another experiment on the same data set in dimension of the neural network as m = 32. Moreover, we set
order to observe the effects of model size while keeping the the learning rate as μ = 0.07 and the number of particles as
computation time the same. To provide this setup, we choose N = 150. For the other parameters, we use the same settings
four different output dimensions, i.e., m, and the number with the elevators data set case. As shown in Table II, the
of particles, i.e., N, combinations so that each combination PF-based algorithm achieves a smaller time accumulated error

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3781

TABLE II
T IME A CCUMULATED E RRORS AND THE C ORRESPONDING T RAINING
T IMES ( IN S ECONDS ) OF THE LSTM-BASED A LGORITHMS FOR THE
E LEVATORS , P UMADYN , AND BANK D ATA S ETS . N OTE T HAT H ERE
W E U SE A C OMPUTER W ITH i5-6400 P ROCESSOR ,
2.7-GHz CPU, AND 16-GB RAM

value while consuming less training time compared with its


competitors; therefore, it has superior performance compared
with the other algorithms in these real life tasks.

B. Financial Data Sets


In this section, we evaluate the performances of the algo-
rithms under two different financial scenarios. We first con- Fig. 5. Future price prediction performances of the algorithms for the Alcoa
sider the Alcoa stock price data set [40], which contains the stock price data set.

daily stock price values. Our goal is to predict the future


prices by examining the past prices. As in the previous section,
we first choose a fixed architecture. Since for all architectures,
we obtain the best performance from the same algorithm as
detailed later in this section, we can choose any architecture.
Hence, we again choose Architecture 1. Moreover, we set the
parameters such that all the introduced algorithms converge to
the same steady-state error level. To provide this fair setup,
we choose the parameters as follows. For the Alcoa stock
price data set, we choose to examine the price of the previous
five days, so that we have the input x t ∈ R5 and we set the
output dimension of the neural network as m = 5. For the
PF-based algorithm, we set the number of particles as
N = 2000. In addition, we choose ηt and ξt as zero mean
Gaussian random variables with Cov[ηt ] = 0.0036 I and
Var[ξt ] = 0.01. For the EKF-based algorithm, we choose
0|0 = 0.0036 I, Q t = 0.0036 I, and Rt = 0.01. For the
SGD-based algorithm, we set the learning rate as μ = 0.1. Fig. 6. Exchange rate prediction performances of the algorithms for the
With these fair settings, Fig. 5 illustrates that the PF-based Hong Kong exchange rate data set.
algorithm converges much faster.
Aside from the Alcoa stock price data set, we also consider equations [22]:
the Hong Kong exchange rate data set [41], for which we
have the amount of Hong Kong dollars that one is able to buy z̃ t = σ (W (z̃) x t + R (z̃) yt −1 ) (50)
(r) (r)
for US$1 on a daily basis. Our aim is to predict the future r t = σ (W xt + R yt −1 ) (51)
exchange rates by exploiting the data of the previous five (y) (y)
ỹt = g(W xt + r t (R yt −1 )) (52)
days. We again choose Architecture 1 and then we select the
parameter such that the convergence rates of the algorithms yt = ỹt z̃ t + yt −1 (1 − z̃ t ) (53)
are the same. We use the same parameters with the Alcoa where x t ∈ R p is the input vector and yt ∈ Rm is the output
stock price data set case except Q t = 0.0004 I and Cov[ηt ] = vector. The functions g(.) and σ (.) are set to the hyperbolic
0.0004 I. In this case, Fig. 6 shows that the PF-based algorithm tangent and sigmoid functions, respectively. For the coefficient
converges to a much smaller steady-state error value. matrices, we have W (z̃) ∈ Rm× p , R (z̃) ∈ Rm×m , W (r) ∈
Rm× p , R (r) ∈ Rm×m , W (y) ∈ Rm× p , and R (y) ∈ Rm×m . Here,
C. LSTM and GRU Networks z̃ t and r t are the update and reset gates, respectively. To obtain
In this section, we consider the regression performances GRU-based algorithms, we directly replace the LSTM equa-
of the algorithms based on two different RNNs, i.e., the tions with the GRU equations and then apply our regression
LSTM and GRU networks. In the previous sections, we use and training approaches. However, the GRU network lacks
the LSTM architecture. Since our approach is generic, the output gate, which controls the amount of the incoming
we also apply our approach to the recently introduced memory content. Furthermore, these networks differ in the
GRU architecture, which is described by the following set of location of the forget gates or the corresponding reset gates.

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
3782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8, AUGUST 2018

Fig. 7. Comparison of the LSTM and GRU architectures in terms of regression error performance for (a) PF-based algorithm, (b) EKF-based algorithm,
and (c) SGD-based algorithm.

TABLE III accumulated error among our alternatives; hence, it signifi-


T IME A CCUMULATED E RRORS OF THE LSTM-BASED R EGRESSION cantly outperforms its competitors in these simulations.
A LGORITHMS D ESCRIBED IN (10)–(12) FOR E ACH A LGORITHM

V. C ONCLUSION
We studied the nonlinear regression problem in an online
setting and introduced novel LSTM-based online algorithms
for data regression. We then introduced low-complexity
and effective online training methods for these algorithms.
Hence, they have significant differences. To compare them, We achieved these by first proposing novel regression algo-
we use the Hong Kong exchange rate data set as in the rithms to compute the final estimate, where we introduced an
previous section. For a fair comparison, we again select a additional gate to the classical LSTM architecture. We then
fixed architecture. Here, since we compare the performances of put the LSTM system in a state space form, and then based
the networks rather than the algorithms, we arbitrarily choose on this form, we derived online updates based on the SGD,
one of the architectures. We select Architecture 1. Moreover, EKF, and PF algorithms [17], [19], [26] to train the LSTM
we choose the same parameters with the previous subsection architecture. By this way, we obtain an effective online training
so that convergence rates of the algorithms are the same. With method, which guarantees convergence to the optimal para-
this fair setup, Fig. 7(a)–(c) shows that the LSTM network- meter estimation provided that we have a sufficient number of
based approach achieves a smaller steady-state error; therefore, particles and satisfy certain technical conditions. We achieve
it is superior to the GRU architecture-based approach in the this performance with a computational complexity in the
sequential prediction task in our experiments. order of the first-order gradient-based methods [5], [16] by
controlling the number of particles. In Section IV, thanks to
D. Different Regression Architectures the generic structure of our approach, we also introduced a
In this section, we compare the performances of different GRU architecture-based approach by directly replacing the
LSTM-based regression architectures. For this purpose, we use LSTM equations with the GRU architecture and observed
the Hong Kong exchange rate data set as in the previous that our LSTM-based approach is superior to the GRU-based
section. For a fair comparison, we select the parameters such approach in the sequential prediction tasks studied in this
that the convergence rates of the algorithms are the same. paper. Furthermore, we demonstrate significant performance
We choose the same parameter with the previous subsection improvements achieved by the introduced algorithms with
except 0|0 = 0.01 I. Under this fair setup, Table III shows respect to the conventional methods [18], [23] over several
that for the PF- and EKF-based algorithms, Architecture 2 different data sets (used in this paper).
achieves a smaller time accumulated error thanks to the
contribution of the regression vector with the control gate αt . R EFERENCES
Due to the lack of the control and output gates, although [1] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games.
Architecture 3 also has the direct contribution of the regression Cambridge, U.K.: Cambridge Univ. Press, 2006.
vector, it has a greater error value compared with its competi- [2] D. F. Specht, “A general regression neural network,” IEEE Trans. Neural
Netw., vol. 2, no. 6, pp. 568–576, Nov. 1991.
tors. For the SGD-based algorithm, the direct contribution of [3] A. C. Singer, G. W. Wornell, and A. V. Oppenheim, “Nonlinear
the regression vector does not provide improvement on the autoregressive modeling and estimation in the presence of noise,” Digit.
error performance. Hence, Architecture 1 achieves a smaller Signal Process., vol. 4, no. 4, pp. 207–221, Oct. 1994.
[4] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and
time accumulated error. However, overall Architecture 2 J. Schmidhuber, “LSTM: A search space Odyssey,” IEEE Trans. Neural
trained with the PF-based algorithm achieves the smallest time Netw. Learn. Syst., to be published, doi: 10.1109/TNNLS.2016.2582924.

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.
ERGEN AND KOZAT: EFFICIENT ONLINE LEARNING ALGORITHMS BASED ON LSTM NEURAL NETWORKS 3783

[5] A. C. Tsoi, “Gradient based learning methods,” in Adaptive Processing [28] I. Patras and E. Hancock, “Regression-based template tracking in pres-
of Sequences and Data Structures, C. L. Giles and M. Gori, Eds. ence of occlusions,” in Proc. 8th Int. Workshop Image Anal. Multimedia
Berlin, Germany: Springer, Sep. 1998, pp. 27–62. [Online]. Available: Interact. Services (WIAMIS), Jun. 2007, p. 15.
https://doi.org/10.1007/BFb0053994, doi: 10.1007/BFb0053994. [29] D. M. Bates, D. G. Watts, Nonlinear Regression Analysis and Its
[6] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen netzen,” Applications. New York, NY, USA: Wiley, 1988.
Ph.D. dissertation, Inst. Inform., Tech. Univ. Munich, München, [30] F. A. Gers, J. A. Péerez-Ortiz, D. Eck, and J. Schmidhuber,
Germany, 1991. “DEKF-LSTM,” in Proc. ESANN, 2002, pp. 369–376.
[7] N. D. Vanli, M. O. Sayin, I. Delibalta, and S. S. Kozat, “Sequential [31] Y. C. Ho and R. Lee, “A Bayesian approach to problems in stochastic
nonlinear learning for distributed multiagent systems via extreme learn- estimation and control,” IEEE Trans. Autom. Control, vol. 9, no. 4,
ing machines,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 333–339, Oct. 1964.
pp. 546–558, Mar. 2017. [32] M. Enescu, M. Sirbu, and V. Koivunen, “Recursive estimation of
[8] J. Schmidhuber, “Deep learning in neural networks: An overview,” noise statistics in Kalman filter based MIMO equalization,” in Proc.
Neural Netw., vol. 61, pp. 85–117, Jan. 2015. [Online]. Available: 27th General Assembly Int. Union Radio Sci. (URSI), Maastricht,
http://www.sciencedirect.com/science/article/pii/S0893608014002135 The Netherlands, 2002, pp. 17–24.
[9] U. Shaham, A. Cloninger, and R. R. Coifman, “Prov- [33] A. Kong, J. S. Liu, and W. H. Wong, “Sequential imputations and
able approximation properties for deep neural networks,” Bayesian missing data problems,” J. Amer. Statist. Assoc., vol. 89,
Appl. Comput. Harmon. Anal., 2016. [Online]. Available: no. 425, pp. 278–288, 1994.
http://www.sciencedirect.com/science/article/pii/S1063520316300033, [34] A. Doucet, S. Godsill, and C. Andrieu, “On sequential Monte Carlo
doi: https://doi.org/10.1016/j.acha.2016.04.003 sampling methods for Bayesian filtering,” Statist. Comput., vol. 10, no. 3,
[10] M. Hermans and B. Schrauwen, “Training and analysing deep recur- pp. 197–208, Jul. 2000.
rent neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2013, [35] N. Bergman, “Recursive bayesian estimation,” Doctoral dissertation,
pp. 190–198. Dept. Elect. Eng., Linköping Univ., Linköping, Sweden, 1999, vol. 579.
[11] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies [36] X.-L. Hu, T. B. Schon, and L. Ljung, “A basic convergence result
with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, for particle filtering,” IEEE Trans. Signal Process., vol. 56, no. 4,
no. 2, pp. 157–166, Mar. 1994. pp. 1337–1348, Apr. 2008.
[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [37] C. E. Rasmussen et al., Delve Data Sets. Accessed: Oct. 1, 2016.
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: [Online]. Available: http://www.cs.toronto.edu/~delve/data/datasets.html
http://dx.doi.org/10.1162/neco.1997.9.8.1735 [38] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, and S. García,
[13] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to for- “KEEL data-mining software tool: Data set repository, integration of
get: Continual prediction with LSTM,” Neural Comput., vol. 12, algorithms and experimental analysis framework,” J. Multiple-Valued
no. 10, pp. 2451–2471, Oct. 2000. [Online]. Available: http://dx.doi.org/ Logic Soft Comput., vol. 17, nos. 2–3, pp. 255–287, 2011.
10.1162/089976600300015015 [39] L. Torgo. Regression Data Sets. Accessed: Oct. 1, 2016. [Online].
[14] J. Fan and Q. Yao, ARMA Modeling and Forecasting. New York, Available: http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html
NY, USA: Springer, 2003, pp. 89–123. [Online]. Available: [40] Alcoa Inc. Common Stock. Accessed: Oct. 1, 2016.[Online]. Available:
http://dx.doi.org/10.1007/978-0-387-69395-8_3 http://finance.yahoo.com/quote/AA?ltr=1
[41] E. W. Frees. Regression Modelling With Actuarial and Financial
[15] J. Mazumdar and R. G. Harley, “Recurrent neural networks trained
Applications. Accessed: Oct. 1, 2016. [Online]. Available: http://
with backpropagation through time algorithm to estimate nonlinear
instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/
load harmonic currents,” IEEE Trans. Ind. Electron., vol. 55, no. 9,
BookWebDec2010/data.html
pp. 3484–3491, Sep. 2008.
[16] H. Jaeger, Tutorial on Training Recurrent Neural Networks, Cov-
ering BPPT, RTRL, EKF and the Echo State Network Approach.
Sankt Augustin, Germany: GMD-Forschungszentrum Informationstech- Tolga Ergen received the B.S. degree in electrical
nik, 2002. and electronics engineering from Bilkent University,
[17] A. H. Sayed, Fundamentals of Adaptive Filtering. Hoboken, NJ, USA: Ankara, Turkey, in 2016. He is currently pursuing
Wiley, 2003. the M.S. degree with the Department of Electrical
[18] J. A. Pérez-Ortiz, F. A. Gers, D. Eck, and J. Schmidhuber, “Kalman and Electronics Engineering, Bilkent University.
filters improve LSTM network performance in problems unsolvable by His current research interests include online learn-
traditional recurrent nets,” Neural Netw., vol. 16, no. 2, pp. 241–250, ing, adaptive filtering, machine learning, optimiza-
Mar. 2003. tion, and statistical signal processing.
[19] B. D. O. Anderson and J. B. Moore, Optimal Filtering.
North Chelmsford, MA, USA: Courier Corporation, 2012.
[20] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and
Y. Bengio, “Identifying and attacking the saddle point problem in high- Suleyman Serdar Kozat (A’10–M’11–SM’11)
dimensional non-convex optimization,” in Proc. 27th Int. Conf. Neural received the B.S. (Hons.) degree from Bilkent Uni-
Inf. Process. Syst. (NIPS), Cambridge, MA, USA, 2014, pp. 2933–2941. versity, Ankara, Turkey, and the M.S. and Ph.D.
[Online]. Available: http://dl.acm.org/citation.cfm?id=2969033.2969154 degrees in electrical and computer engineering from
[21] P. M. Djuric et al., “Particle filtering,” IEEE Signal Process. Mag., the University of Illinois at Urbana–Champaign,
vol. 20, no. 5, pp. 19–38, Sep. 2003. Urbana, IL, USA.
[22] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. (2014). “Empirical He joined the IBM Thomas J. Watson Research
evaluation of gated recurrent neural networks on sequence modeling.” Center, Yorktown Heights, NY, USA, as a Research
[Online]. Available: https://arxiv.org/abs/1412.3555 Staff Member and later became a Project Leader
[23] R. J. Williams and D. Zipser, “A learning algorithm for continually with the Pervasive Speech Technologies Group,
running fully recurrent neural networks,” Neural Comput., vol. 1, no. 2, where he focused on problems related to statistical
pp. 270–280, 1989. signal processing and machine learning. He was a Research Associate with
[24] B. C. Csáji, “Approximation with artificial neural networks,” Faculty the Cryptography and Anti-Piracy Group, Microsoft Research, Redmond,
Sci., Eötvös Loránd Univ., Budapest, Hungary, Tech. Rep., 2001, vol. 24, WA, USA. He is currently an Associate Professor with the Electrical and
p. 48. Electronics Engineering Department, Bilkent University. He has co-authored
[25] J. Martens and I. Sutskever, “Learning recurrent neural net- over 100 papers in refereed high impact journals and conference proceedings.
works with hessian-free optimization,” in Proc. 28th Int. Conf. He holds several patent inventions (used in several different Microsoft
Mach. Learn. (ICML), 2011, pp. 1033–1040. and IBM products) due to his research accomplishments with the IBM
[26] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial Thomas J. Watson Research Center and Microsoft Research. His current
on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” research interests include cyber security, anomaly detection, big data, data
IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188, Feb. 2002. intelligence, adaptive filtering, and machine learning algorithms for signal
[27] Z. Li, Y. Li, F. Yu, and D. Ge, “Adaptively weighted support vector processing.
regression for financial time series prediction,” in Proc. Int. Joint Conf. Dr. Kozat received many international and national awards. He is the Elected
Neural Netw. (IJCNN), Jul. 2014, pp. 3062–3065. President of the IEEE Signal Processing Society, Turkey Chapter.

Authorized licensed use limited to: Hanyang University. Downloaded on November 06,2023 at 03:35:44 UTC from IEEE Xplore. Restrictions apply.

You might also like