0% found this document useful (0 votes)
16 views12 pages

Training Recurrent Neural Networks Via Forward Propagation Through Time

Uploaded by

HananAbuWarda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

Training Recurrent Neural Networks Via Forward Propagation Through Time

Uploaded by

HananAbuWarda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Training Recurrent Neural Networks via Forward Propagation Through Time

Anil Kag 1 Venkatesh Saligrama 1

Abstract empirical risk function:


Back-propagation through time (BPTT) has been N T
widely used for training Recurrent Neural Net- 1 XX i i
[W ∗ , v ∗ ] = arg min L(W, v) = `(yt , ŷt )
works (RNNs). BPTT updates RNN parame- W,v N T i=1 t=1
ters on an instance by back-propagating the er- (2)
ror in time over the entire sequence length, and
∀i, t; ŷti = v > hit ; hit = f (W, xit , hit−1 ); hi0 =0
as a result, leads to poor trainability due to the
well-known gradient explosion/decay phenomena. where, RNN parameters W and classifier v ∈ RD are op-
While a number of prior works have proposed to timized. This objective is generally optimized by gradient
mitigate vanishing/explosion effect through care- descent. The gradient expression for each example, i ∈ [N ],
ful RNN architecture design, these RNN variants is a sum of products of partial gradients, and is commonly re-
still train with BPTT. We propose a novel forward- ferred to as the error back-propagated through time (BPTT).
propagation algorithm, FPTT , where at each time, This is a direct result of the fact that the hidden state hit , at
for an instance, we update RNN parameters by each time, is recursively updated by the same parameter W .
optimizing an instantaneous risk function. Our As a consequence, via chain rule, we get:
proposed risk is a regularization penalty at time
N T
X X ∂`(y i , ŷ t )
t that evolves dynamically based on previously ∂L t i
observed losses, and allows for RNN parame- =
∂W i=1 t=1
∂W
ter updates to converge to a stationary solution
N T t t
of the empirical RNN objective. We consider 1 X X ∂`(yti , ŷit ) ∂ ŷit X  Y ∂his  ∂hij−1
= t
both sequence-to-sequence as well as terminal N T i=1 t=1 ∂ ŷi ∂hit j=1 s=j ∂his−1 ∂W
loss problems. Empirically FPTT outperforms
(3)
BPTT on a number of well-known benchmark
tasks, thus enabling architectures like LSTMs to We highlight two fundamental aspects of BPTT:
solve long range dependencies problems.
• Trainability: Unless the partial terms ∂h∂hs−1s
stay close
to identity, the product of these terms could explode or
1. Introduction vanish. As such, if gradients vanish, earlier times, t  T
have little contribution to the overall error. If gradients
Recurrent Neural Networks (RNNs) have been successfully explode, we see an opposite effect, namely, later states
employed in many sequential learning tasks including lan- may have little contribution.
guage modelling, speech recognition, and terminal predic- • Complexity: Gradient computation is expensive for large
tion. An RNN is described by its parameters W ∈ W and T , since it involves a sum-product of T terms, resulting
the transition function f : W × X × H → H which takes in Ω(T 2 ) scaling for each example, and for N examples,
RNN parameters W , current input xt ∈ X and previous computational cost for processing the dataset once scales
hidden state ht−1 ∈ H ⊆ RD to output the next state ht : as Ω(N T 2 ). Note that this calculation assumes naive
ht = f (W, xt , ht−1 ) (1) computation of the sum-product. In practice, with addi-
tional memory overhead, an efficient gradient propagation
Given the training dataset {xi , yi }N
i=1with N examples scheme would only result in cost linear in length of the
of T −length sequences xi , yi , we optimize the following sequence. BPTT has a Ω(T ) memory overhead associated
1
Department of Electrical and Computer Engineering, with storing all the intermediate hidden states in the time
Boston University, USA. Correspondence to: Anil Kag horizon.
<[email protected]>.
In this work, our focus is on simplifying the RNN training
th
Proceedings of the 38 International Conference on Machine procedure. Once the RNN parameters are learnt, the infer-
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). ence process remains same as before. Our goal is to reduce
FPTT: Training RNNs with Forward Propagation

computation of BPTT so that each step only involves taking t, we immediately get feedback for the updated Wt . For
a derivative for a single time. terminal prediction problems we present a simple scheme
to construct surrogate losses at any timestep using the label
Challenges. Minimizing Eq. 2 poses two challenges:
for the entire sequence.
(a) Dynamics. Equation 1 enforces a temporal constraint on
allowable transitions. We then conduct a number of experiments on benchmark
(b) Time-Invariance. The transition matrices W are fixed, datasets and show that our proposed method is particularly
and as a result the dynamics of RNNs is time-invariant. effective on tasks that exhibit long-range dependencies. In
summary, our proposed method suggests that vanilla LSTMs
Let us examine a few potential directions in this context.
are effective tools for inferring long-term dependencies, and
Method of Multipliers allows for eliminating constraints im-
exhibit performance matching state-of-the-art competitors–
posed by (a) and (b) by introducing a regularizer based on
even those with higher capacities and well-designed archi-
an augmented Lagrangian. This approach is often adopted
tectures.
in distributed optimization (Boyd et al., 2011). Eliminat-
ing (a) could be accomplished with ADMM methods with Toy Example. As a sneak preview, we demonstrate ef-
squared norm penalty, or other specialized functions (Gu fectiveness of FPTT on the Add Task (see Sec. 4 for de-
et al., 2020). For (b), leveraging the key insight in distributed tails) against BPTT on training LSTMs under an identi-
optimization, we can write the condition (b) as Wt = Wt−1 cal test/train split. Figure 1 shows that FPTT solves this
for all t, and rewrite it as a penalty. Nevertheless, while com- problem while BPTT fails to find the correct parameters, it
putationally block-coordinate descent allows for efficiency, stays near the same loss value throughout the training phase.
memory expands substantially (O(T D2 + N T D)). BPTT’s poor behavior on this task has been observed in
previous works (Kag et al., 2020; Zhang et al., 2018).
Online Gradient Method (OGD). We can view `t (W ) =
`(yt , v > f (W, xt , ht−1 ) as the instantaneous loss incurred Add Task (Sequence Length=500)
0.30
in round t by “playing” the parameter W . We can up- FPTT-LSTM
date Wt+1 = Wt − η∇`t (Wt ), based on the observed LSTM
0.25
loss. While this could work, there is no reason why Wt ’s
converge, and furthermore, it is unclear how to choose a
Mean Squared Error

0.20
constant parameter, W , based on the sequence of updates.
In general, we have observed in experiments that training 0.15
performance based on time-varying transition matrices does
not reflect test-time and does not generalize well. 0.10

Follow-the-Regularized-Leader Rule. (McMahan et al., 0.05


2013) Rather than optimizing the instantaneous loss as in
OGD, we utilize
Pt all of the previously seen losses, namely, 0.00
0 1000 2000 3000 4000 5000
Lt (W ) = j=1 `j (W ) and attempt to find a update direc- Training Steps
tion. Nevertheless, this approach suffers from the same issue
as BPTT, since for large t ≈ T , finding a descent direction Figure 1. Add Task (T = 500): Comparison between standard
involves back-propagation through ≈ T steps. Furthermore, learning and forward propagation.
this method adds a multiplicative factor of T in the run-time
in comparison to BPTT. Contributions.
Our Forward-Propagation Method. We propose a novel • We proposed forward-propagation-through-time (FPTT )
forward-propagation-through-time (FPTT) method based on as an alternative to conventional BPTT.
instantaneous dynamic regularization. FPTT at each time • FPTT takes a gradient step of an instantaneous time-
takes a gradient step to minimize an instantaneous risk func- dependent risk function at each time. The risk function
tion. The instantaneous risk is the loss at time t plus a dy- is the regularized loss, with a dynamic evolving regular-
namically evolving regularizer. This dynamics is controlled ization. The dynamic penalty requires minimal memory,
by a state-vector, which summarizes past losses. FPTT has thus allowing for rapid gradient computation.
the in-built property that the point of convergence of Wt • We construct surrogate losses for terminal prediction tasks,
sequence, is also a stationary point of the global empirical to guide FPTT to learn in intermediate time.
risk Equation 2. The resulting method has a light-weight • We perform empirical evaluations to demonstrate the util-
footprint and is computationally efficient. For sequence-to- ity and superiority of FPTT over BPTT.
sequence modelling tasks our learning scheme integrates • Our FPTT algorithm can be readily deployed in any deep
easily since the losses are instantaneous, i.e., at timestep learning library. We have released our implementation at
https://github.com/anilkagak2/FPTT
FPTT: Training RNNs with Forward Propagation

2. Related Work gate the gradient issues, but are less popular due to overhead
involved in imposing unitary/orthogonal constraints. There
There is a vast literature on RNNs that span novel architec- are other designs, such as those based on ODEs (Chang
tural designs and algorithmic/methodological improvements. et al., 2019; Kag et al., 2020; Kag & Saligrama, 2021; Kusu-
Here, we list only the closely related works. pati et al., 2018; Erichson et al., 2021). These RNNs enforce
Learning Algorithms. While a number of RNN training the partials between the hidden states to be near identity,
methods have been proposed, BPTT remains the single most thus mitigating gradient issues faced by architectures like
dominant method (Rumelhart et al., 1986; Werbos, 1990). LSTMs/GRUs.
BPTT unrolls the recurrent logic for the entire time horizon Our view is that architecture and training methods are com-
and computes gradient through this horizon. It has been plementary. Our ablative results suggest that FPTT improves
observed to be computationally expensive (storing the hid- upon BPTT method on different architectures.
den states and computing gradient for entire time horizon)
and unless the RNN architecture is carefully designed, this Miscellaneous. (Miller & Hardt, 2019) analyzes LSTMs
leads to vanishing / exploding gradients (Hochreiter, 1991; from the stability perspective and show that in an uncon-
Bengio et al., 2013). Truncated BPTT (Williams & Peng, strained form LSTMs are unstable. They show that under
1990) is the variant of BPTT where gradient flow is trun- some conditions on the non-linearity in transition, stable
cated after a fixed number of timesteps. This fails to learn recurrent neural networks can be represented by a feed-
dependencies present beyond the fixed window. forward network. While stability reasons about the explod-
ing gradients, it does not eliminate the vanishing gradients
Real time recurrent learning (RTRL) (Williams & Zipser, issue during training. (Linsley et al., 2020) addresses the
1989), a BPTT alternative, proposes to propagate the partials large memory cost of BPTT which scales linearly with the
∂ht ∂ht
∂ht−1 and ∂W from timestep t to t + 1, by noting that there number of time steps. They modify the training process
is significant overlap in the product term (see Eq. 3) from by replacing the BPTT algorithm with Recurrent BackProp
time t to t + 1, allowing for recursive computation. Early (RBP) algorithm which optimizes the parameters to achieve
attempts suffered large memory overhead limiting its usage, steady state dynamics. We point out that their scheme is
and while recent attempts (Mujika et al., 2018; Menick et al., orthogonal to FPTT and we can leverage RBP to provide
2021; Tallec & Ollivier, 2018; Ollivier & Charpiat, 2015) similar benefits
have been more successful, these methods still fall short of
BPTT performance, and so trainability of RNNs is still a
significant issue. 3. Method
In contrast to these gradient based methods, FPTT is based First, we describe our proposed algorithm and our learning
on directly updating RNN parameters, and the updates op- objective. We describe a general pseudo code that can be
timize an instantaneous loss. Thus, RNN parameters are leveraged to train any RNN architecture. Finally, we will
allowed to vary after each time-step, and over time our describe a method for dealing with terminal prediction tasks.
updates converge to stationary solution. As a result, our Notation. The training set B = {xi , y i }N i=1 consists of
method has low memory footprint, small computational N examples. Each input xi ∈ X is a T −length sequence
cost/iteration, and most importantly, exhibits empirical per- which can be written as {xi1 , xi2 , · · · , xiT }. For sequence-to-
formance, dominating BPTT training on many long-term sequence modelling tasks, the label y i ∈ Y is a T −length
dependency datasets. sequence written as {y1i , y2i , · · · , yTi }. While in the terminal
RNN Architectures. The choice of architecture has a sig- prediction case, the label y i ∈ Y is provided for a single
nificant impact on trainability with BPTT. To this end gated timestep T as the feedback for the entire input sequence
variants have been introduced: Long Short Term Memory xi . We will drop the superscript i to denote a data point
networks (LSTMs) (Hochreiter & Schmidhuber, 1997) and wherever it can be inferred from the context. With initial
Gated Recurrent Units (GRUs) (Cho et al., 2014) have con- hidden state h0 = 0 ∈ H, an RNN with parameters W ∈ W
siderably improved performance, but have been shown to and transition function f : W × X × H → H generates
fare poorly on tasks involving long range dependencies due the hidden state sequence {h1 , h2 , · · · , hT } for the data
to vanishing gradients (Zhang et al., 2018; Chang et al., point {x, y}. We use `t (W ) = `(yt , v > f (W, xt , ht−1 )) to
2019; Kusupati et al., 2018). denote the loss incurred at time step t, where v ∈ V is a
linear classifier.
Unitary RNNs (Arjovsky et al., 2016; Jing et al., 2017;
Zhang et al., 2018; Mhammedi et al., 2017; Kerg et al., 2019; To simplify the
Pequations, (a) we will only use one example
Lezcano-Casado & Martı́nez-Rubio, 2019) focus on design- and drop the i and superscript i used in Eq. 2, since the
ing well-conditioned state transition matrices, attempting to motivation behind the proposal remains same, and (b) we
enforce unitary-property, during training. This helps miti- will assume v is constant, while in practice to learn v, the
FPTT: Training RNNs with Forward Propagation

Algorithm 1 Training RNN with BackProp These update equations are loosely inspired by consensus
Input: Training data B = {xi , y i }N in distributed optimization problems over a star-network1
i=1 , Timesteps T
Input: Learning rate η, #Epochs E Intuition. The basic concept here is that W̄t represents,
Initialize: W1 randomly in the domain W in principle, the running average of all the Wt ’s seen so
for e = 1 to E do far, with a small correction term (Eq. 5). Therefore, the
Randomly Shuffle B updates impose proximity to the running average in the
for i = 1 to N do update step. However, this alone is not sufficient to converge
Set: (x, y) = (xi , y i ) and h0 = 0 to stationary points of Eq. 3. We will show this later. As
for t = 1 to T do such W̄t is a vector that summarize past losses. Eq. 5 is also
Update : ht = f (W, xt , ht−1 ) the first order condition for `t (Wt+1 ) + α2 kWt+1 − W̄t −
end for 1 2
PT α ∇`t−1 (Wt )k . Taken together the scheme resembles an
Loss: `(W ) = t=1 `(yt , v > ht ) alternative optimization method for a joint risk function over
Set: : Wi+1 = Wi − η∇W `(W )|W =Wi W, W̄ , namely, we hold W̄t fixed and optimize W , and after
end for the update, optimize W̄ with fixed W . However, notice
Reset: W1 = WN +1 that unlike conventional setting, here the risk functions are
end for time-varying. Note that Eq. 5 requires gradient of the loss
Return : WN +1 `t at the new iterate Wt+1 . Computational cost for this step
operations applied on W , are applied on v as well. can be eliminated by keeping a running estimate λt with
update equation λt+1 = λt − α(Wt+1 − W̄t ) and initial
value λ0 = 0.
3.1. FPTT : Forward Propagation Through Time
Observe that for large α, we expect Wt+1 to be close to
Given one example (x, y) = ({xt }Tt=1 , {yt }Tt=1 ) and initial
the previous Wt , and this would result in the hidden state
parameter estimate W0 , BPTT (see Algorithm 1) updates the
sequence {ht }Tt=1 to be essentially very close to the one
parameter once by taking the gradient of the T length loss
PT generated by a single W ≈ Wt+1 ≈ Wt . In effect this
t=1 `t (W ). In contrast, we update parameters at every
would simulate hidden state trajectories with a static time-
time step t by utilizing (xt , yt ) to avoid getting penalized by
invariant RNN parameter.
T length gradient dependence. Since the parameters update
very frequently, we need to incorporate two mechanisms in Pseudo Code. Algorithms 1 and 2 enumerates the learning
parameter updates: (a) stability in updates so that a single schemes for BPTT and FPTT respectively. These proce-
step does not stray, (b) since our training does not follow dures can be utilized to train an RNN architecture in any
standard RNN transition (i.e. keep a single parameter W popular deep learning framework with minimal efforts. Note
through the input sequence), our updates should ensure that that for simplicity we write the algorithms with batch size 1,
the iterates converge to a single parameter. This in turn this is relaxed to the conventional choice of larger batch size
guarantees that towards the end of the training sequence we in our experiments. In FPTT , starting with small values
will mimic an RNN. of α we gradually increase α to enforce the constraint. We
explore the impact of this hyper-parameter in the ablative
To build motivation into our method Algorithm 2, we refer to
experiments (see Sec. 4.2).
the sequence of updates on a single instance, i ∈ [N ] at time
step t. The first update is a gradient step of the loss `t (W ) Remarks. (a) Note that even though we have separate Wt for
for a fixed value of W̄t . As such, we track one additional each timestep, we do not suffer additional storage overhead
copy of the parameter, Wt , namely W̄t ∈ W. At time step of the factor T . This follows from the fact that we solve
t, we update parameters using the supervision (xt , yt ) and these sub-problems forward in time and only solve for Wt+1
previous iterates Wt , W̄t . Following T updates, on a new at timestep t. (b) We show that the iterates converge in our
instance, we set the initial weight parameter W0 ← WT +1 , ablative experiments (see supplementary), below we provide
and the iteration follows subsequently. an explanation for the convergence.
To understand our scheme, let us consider the situation Convergence. Let us focus on the arg min step in Eq. 4.
where the number of gradient steps of the loss approaches 1
Distributed agents connected over a star-network seek to solve
infinity. In this case, our equations read as: a joint optimization problem, which requires seeking consensus
on the decision variables (Boyd et al., 2011). A master agent
α 1
Wt+1 = arg min `t (W ) + kW − W̄t − ∇`t−1 (Wt )k2 coordinates with the agents to communicate and synchronize de-
W 2 2α cision variables in an iterative fashion. Eq. 2 could be viewed in
(4) a number of ways, as a single-agent network, a T node network,
or an N node network etc. Each of these in turn lead to different
1 1 coordinating mechanisms.
W̄t+1 = (W̄t + Wt+1 ) − ∇`t (Wt+1 ) (5)
2 2α
FPTT: Training RNNs with Forward Propagation

Algorithm 2 Training RNN with FPTT under smoothness and Lipschitz conditions, we can as-
Input: Training data B = {xi , y i }N sume that T1 ∇`T (WT +1 ) → 0 in all of its compo-
i=1 , Timesteps T PT
Input: Learning rate η, Hyper-parameter α, # Epochs E nents. As a result, we have T1 t=1 W̄t → W∞ as
Initialize: W1 randomly in the domain W well. Plugging
PT these facts into the second equation, we
1
Initialize: W̄1 = W1 get 2αT t=1 ∇` t (Wt+1 ) ≈ 0. Now we also know that
for e = 1 to E do Wt+1 ≈ W∞ for sufficiently large T , and using stan-
1
PT
Randomly Shuffle B dard arguments it follows that 2αT t=1 ∇`t (W∞ ) also
for i = 1 to N do approaches zero. This is the proposed stationarity condition,
Set: (x, y) = (xi , y i ) and h0 = 0 and our claim follows.
for t = 1 to T do
Computational Complexity. BPTT gradient cost scales
Update : ht = f (W, xt , ht−1 )
as Ω(T ) as seen from Eq. 3. Although FPTT for T times
`t (W ) = `t (yt , v > ht )
steps leads to Ω(T ) gradient computations, but it is worth
`(W ) = `t (W )+ α2 kW − W̄t − 2α 1
∇`t−1 (Wt )k2
noting that the constants involved in taking gradient for
Wt+1 = Wt − η∇W `(W )|W =Wt
the full length T are higher than computing single step
W̄t+1 = 21 (W̄t + Wt+1 ) − 2α 1
∇`t (Wt+1 )
gradients. However, FPTT has more arithmetic operations
end for
per gradient step, and as such a tradeoff exists. BPTT has
Reset: W1 = WT and W̄1 = W̄T
higher memory overhead since it stores intermediate hidden
end for
states for the full-time horizon T . In contrast, since FPTT
end for
only optimizes instantaneous loss functions, it does not
Return : WT
require storing hidden states for the full-time horizon. We
Taking gradient w.r.t. W results in the following dynamics: list computational complexities of different algorithms in
Table 1.
∇`t (Wt+1 ) − ∇`t−1 (Wt ) + α(Wt+1 − W̄t ) = 0
1 1 FPTT -K. Instead of updating parameters at every timestep,
W̄t+1 = (W̄t + Wt+1 ) − ∇`t (Wt+1 ) (6) we could perform updates in Eq. 4 only K times for the se-
2 2α
quence length T . To do this, for each example, we consider
Let us see why these equations allow for reaching a sta- a window of size ω = b K T
c, and define a windowed loss,
tionary point of Eq. 2. For now suppose the sequence Wt ¯ 1
P t
`t,ω (W ) = ω τ =t−ω `t (W ). We then loop this over K
converges to a limit point W∞ . One way to ensure this steps instead of T . Setting K = 1 is the same as learning
happens is to view Eq. 6 as a map from [Wt , W̄t ]> → RNN through BPTT, while K = T results in the FPTT 2.
[Wt+1 , W̄t+1 ]> , and show that this map is contractive, and We provide ablative experiments to study the effect of this
as such invoke the Banach fixed point theorem. Neverthe- parameter on learning efficiency in Section 4.
less, this is difficult to show and we assume that it is true
for now. Table 1. Per-instance computational cost for gradient, parameter
Proposition 1. In the Algorithm 2, suppose, the sequence update & memory storage overhead. Parameter update involves
Wt is bounded and converges to a limit point W∞ . Further several arithmetic operations (see Algo. 2), exceeding cost of
assume the loss function `t is smooth gradient update by a constant factor. Note that constant associated
PT and Lipschitz. Let
the cumulative loss be F = T1 t=1 ∇`t (W∞ ) after T with gradient computation is a monotonically increasing function
iterations 2 . It follows that W∞ is a stationary point of c(·) of the sequence length, i.e. c(1) < c(K) < C(T ).
Eq. 3, i.e., limT →∞ ∂W∂F
(W∞ ) = 0. Gradient Parameter Memory
Algorithm
Updates Updates Storage
We sketch the proof below (see Sec. 6.7 for detailed BPTT Ω(c(T )T ) Ω(1) Ω(T )
.
proof). Rewriting the first equation in Eq. 6 as: Wt+1 = FTRL Ω(c(T )T 2 ) Ω(T ) Ω(T )
W̄t + α1 (`t−1 (Wt ) − `t (Wt+1 )), we note that if Wt+1 → FPTT Ω(c(1)T ) Ω(T ) Ω(1)
W∞ , then invoking Cesaro mean3 argument, the corre- FPTT -K Ω(c(K)T ) Ω(K) Ω(T /K)
PT T →∞
sponding averages do as well: T1 t=1 Wt+1 −→ W∞ .
In turn, we note that the second term in the above ex- Intermediate Losses for Terminal Prediction. In our ex-
pression
PT telescopes,1 P and consequently, it follows that position so far, we assumed that we have access to an in-
1 T 1
T t=1 W t+1 = T t=1 W̄t − T ∇`T (WT +1 ). Now stantaneous loss `t at timestep t. While this is true for
2
Seq-to-Seq modelling tasks, for terminal prediction tasks
For simplicity in exposition, we concatenate all the losses `t
we only get one label y for the entire input sequence x. Let
into a single online stream and get rid of the index N , that gets
repeated to provide T iterations of the gradient updates. P̂ = softmax(v > ht ) be our current estimate of label dis-
3
https://www.ee.columbia.edu/˜vittorio/ tribution and Q be our estimate in last training epoch. We
CesaroMeans.pdf use cross-entropy for the classification loss. We construct
FPTT: Training RNNs with Forward Propagation

intermediate losses `t for anytime step t as a convex com- reduces our experimentation cost. Although we use LSTMs
bination of two terms : (a) cross-entropy using the current to show that FPTT works on many benchmark datasets, we
label distribution P̂ , and (b) divergence like term to enforce provide ablative study to show that FPTT works on many
P̂ and Q to stay close by. This results in the following loss: RNN architectures (see Sec. 4.2).

`t = β`CE + (1 − β)`Div Since many RTRL algorithms do not scale to the large-
t t
X X scale benchmark tasks, we do not show their performance
CE
`t = − 1ȳ=y log P̂ (ȳ); `Div
t =− Q(ȳ) log P̂ (ȳ) in our results. TBPTT has been shown to perform poorly in
ȳ∈Y ȳ∈Y comparison to BPTT (Trinh et al., 2018). Hence, we will
only consider BPTT as the baseline and add the prefix FPTT
where β ∈ [0, 1]. Our intuition is that for timesteps near whenever RNNs are trained with the proposed algorithm,
T , classification loss is weighted more and less to the di- otherwise the training algorithm is assumed to be BPTT.
vergence term. In the beginning, we should have much Wherever applicable we will include known results from the
less confidence in the classification loss and more on the literature to compare our BPTT implementation.
divergence term. This leads to a natural choice of β = Tt Add Task (Sequence Length=200)
achieving the desired effect. 0.40
LSTM
0.35 FPTT-LSTM (K=3)
Extensions to Stacked/Hierarchical RNNs. There can be
FPTT-LSTM (K=5)
various extensions of our scheme to multi-layered RNNs. 0.30 FPTT-LSTM (K=10)
One simple scheme is to treat the transition in the stacked

Mean Squared Error


FPTT-LSTM (K=100)
0.25
FPTT-LSTM (K=200)
RNN as a multi-layered function which transforms hidden
states from one time step to the next. Our language mod- 0.20

elling experiments (Sec. 4.3) on PTB-word and character 0.15


level uses this extension for the 3-layered stacked LSTM
0.10
models. Note that ideally such an extension should work for
hierarchical RNNs as well as state transitions can be seen 0.05

at the most frequent update equation. We leave this as a 0.00


potential future direction. 0 1000 2000 3000 4000 5000
Training Steps

4. Experiments Figure 2. Ablative Experiment: Add Task (T = 200) solved by


splitting in multiple parts. Note that FPTT with K = 1 corre-
In this section we empirically demonstrate that the pro-
sponds to BPTT for LSTM while K = 200 updates Wt at every
posed algorithm outperforms BPTT. First, we provide ab-
timestep. This figure demonstrates as K increases the performance
lative experiments to justify our default choice of hyper- of the algorithm improves.
parameters and the chosen architecture. Next, we run
FPTT on sequence-to-sequence modelling tasks. Finally, 4.2. Ablative experiments
we benchmark FPTT on terminal prediction tasks which
provide the true label only for the full input sequence. Below ablative experiments highlight key aspects of FPTT .
Effect of the parameter K. As described in the Sec. 3 each
4.1. Experimental Setup round of RNN parameter update is more expensive than gra-
dient update. We can address this issue by choosing a suit-
We implement FPTT in Pytorch using the pseudo code
able K and run FPTT -K. Larger K decreases number of pa-
given by Algorithm 2. We perform our experiments on
rameter updates, but can impact convergence. For the Add-
single GTX 1080 Ti GPU. The benchmark datasets used
Task with sequence length T = 200, we learn LSTMs with
in this study are publicly available along with a train and
different K values (ranging from K = 1, 3, 5, 10, 100, 200).
test split. For hyper-parameter tuning, we set aside a vali-
Note that K = 1 is essentially BPTT as we only update the
dation set on tasks where a validation set is not available.
RNN parameter after seeing the entire sequence. Figure 2
Wherever applicable we use grid search for tuning hyper-
shows that higher K values result in better convergence. On
parameters (details in supplementary). In our experiment,
the other hand higher values of K are more expensive since,
we use LSTM(Hochreiter & Schmidhuber, 1997) as the de-
as described in Table 1, cost of parameter updates exceeds
fault RNN architecture for evaluation purpose. They have
gradient by a constant factor4 . Suppose this factor is C 2 , the
been shown to suffer from vanishing/exploding gradients
highest computational efficiency is achieved by FPTT -K
on many tasks (Zhang et al., 2018; Chang et al., 2019; Kag T
with K = b C c. On the other hand small values of K could
et al., 2020). Our reasoning follows from the fact that they
are widely available with most efficient CUDA implemen- 4
This factor is difficult to pin-down since gradients leverage
tation on many popular deep learning libraries. This also CUDA-pytorch, while parameter updates are not optimized.
FPTT: Training RNNs with Forward Propagation

lead to poor training√as observed in Figure 2. As a rule of 4.3. Sequence Modelling


thumb we use K ≈ T for all our other experiments. This
We perform experiments on three variants of the sequence-
results in meaningful performance (trainability), and com-
to-sequence benchmark Penn Tree Bank (PTB) dataset
putational efficiency matching GPU LSTM implementation.
(McAuley & Leskovec, 2013). We provide full details of
these experiments in the supplementary.
Table 2. CIFAR-10 : Different RNN architectures.
Accuracy #Params PTB-300 is a word level language modelling task with the
LSTM 60.11% 67K difficult sequence length of 300 and has been studied in
FPTT LSTM 71.03% 67K many previous works to study long range dependencies
GRU 66.28% 51K in language modeling (Zhang et al., 2018; Kusupati et al.,
FPTT GRU 71.37% 51K 2018; Kag et al., 2020). Table 4 shows the test perplexity
Antisymmetric 62.41% 37K for our experimental runs along with results from earlier
FPTT Antisymmetric 72.13% 37K works. Note that improved architectures such as FastGRNN
(Kusupati et al., 2018), SpectralRNNs(Zhang et al., 2018),
Choice of architecture. In this experiment we show that IncrementalRNNs (Kag et al., 2020) show improvements
FPTT provides non-trivial gains for many RNN architec- over the LSTMs trained using backpropagation algorithm.
tures. We train one layer LSTM, and GRU architectures on By incorporating FPTT as the training algorithm we im-
the CIFAR-10 dataset with the same setting as the described prove LSTM’s test perplexity by nearly 11 points and thus
in section 4.4. Table 2 shows that RNNs trained with FPTT outperforming the reported LSTM results.
provide gains of about 5 − 10 points in accuracy over the
RNNs trained with BPTT. In the remaining experiments, we
reduce experimentation cost by only performing evaluations Table 4. Results for PTB word level language modelling : Se-
on LSTMs as they are readily available in PyTorch with quence length (300), 1-Layer LSTM.
very efficient CUDA implementation. We also want to show Dataset PTB-w
that replacing BPTT with FPTT allows LSTMs to achieve Perplexity #Params
performance near state-of-the-art performance achieved by FastGRNN (Kusupati et al., 2018) 116.11 53K
recent architectural improvements. IncrementalRNN (Kag et al., 2020) 115.71 30K
Auxiliary Losses in BPTT vs FPTT . In this experiment SpectralRNN (Zhang et al., 2018) 130.20 31K
we augment BPTT with the auxiliary losses ( proposed for LSTM (Zhang et al., 2018) 130.21 64K
terminal prediction in section 3 ) similar to (Trinh et al., LSTM (Kusupati et al., 2018) 117.41 210K
2018) in order to isolate the gains from auxiliary losses in LSTM 117.09 210K
the BPTT routine. Table 3 shows that our auxiliary losses FPTT LSTM 106.27 210K
helps BPTT to improve the performance but still lack behind
the proposed algorithm.
Table 3. CIFAR-10 : BPTT+Auxiliary Loss vs FPTT . PTB-w is the traditional word level language modelling
Accuracy #Params variant of the PTB dataset. It uses 70 as sequence length
LSTM 60.11% 67K and we follow (Yang et al., 2018) to setup this experiment.
Aux-Loss+LSTM 65.65% 67K We use three-layer LSTM model for this task with embed-
FPTT LSTM 71.03% 67K ding dimensions 280 and hidden size 1150. We report the
results with dynamic evaluation(Krause et al., 2018) on the
Sensitivity to α hyper-parameter. Note that very small
trained model. We use the same architecture and training
value of α, i.e. α → 0 would lead FPTT to ignore the
setup to train LSTMs with both BPTT and FPTT . Table 5
regularizer and would only optimize the instantaneous loss
demonstrates that LSTM trained with FPTT result in better
at every step resulting in diverging iterates. While very
performance than the ones trained with BPTT.
high value of α would lead FPTT to only optimize the
regularizer and hence very poor generalization performance. PTB-c is the character level modelling task that uses 150
We explore the sensitivity to the α hyper-parameter in the sequence length. We utilize (Merity et al., 2018) to setup
Algorithm 2. We use the PTB-300 language modelling the character level task. We use 3-layer LSTM models
dataset. In this experiment we train FPTT on the following as recommended with hidden size 1000 and embedding
α values: {1.0, 0.8, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001}. Best dimension 200. We train this model with both BPTT and
perplexity is reached at α = 0.5 while α = 1.0 fails to FPTT with the same setting. As shown in Table 5, LSTM
converge to a good solution. Also, the performance starts trained with FPTT results in better bits-per-characters and
to decrease with α ≤ 0.05. We show the full result in the has comparable performance with existing state-of-the-art
appendix (see Table 8 in Sec. 6.3). results present in this table.
FPTT: Training RNNs with Forward Propagation

Table 5. Results for PTB-w and PTB-c datasets. We use AWD-LSTM model in our PTB-c experiments and AWD-LSTM with Mixture-
of-Softmaxes(Yang et al., 2018) in the PTB-w experiments. For PTB-w dataset, wherever applicable, all the baselines report the results
with dynamiceval(Krause et al., 2018). It can be seen that training with FPTT outperforms the model trained with BPTT.
Dataset PTB-c PTB-w
Hidden Hidden
BPC #Params Perplexity #Params
Dimension Dimension
Trellis-Net (Bai et al., 2019) 1000 1.158 13.4M 1000 54.19 34M
AWD-LSTM (Merity et al., 2018; Krause et al., 2018) 1000 1.175 13.8M 1150 51.1 24M
Dense IndRNN (Li et al., 2019) 2000 1.18 45.7M 2000 50.97 52M
LSTM 1000 1.183 13.8M 1150 51.9 22M
FPTT LSTM 1000 1.165 13.8M 1150 50.96 22M

4.4. Terminal Prediction Add-Task (Hochreiter & Schmidhuber, 1997) has been used
to evaluate long range dependencies in RNN architectures.
We benchmark FPTT on popular terminal prediction tasks
An example data point consists of two sequences (x1 , x2 ) of
to demonstrate that the proposed algorithm provides non-
length T and a target label y. x1 contains real-valued entries
trivial gains over BPTT in this setting as well. For fair
drawn uniformly from [0, 1], x2 is a binary sequence with
comparison, following previous works(Zhang et al., 2018;
exactly two 1s, and the label y is the sum of the two entries
Kusupati et al., 2018; Kag et al., 2020), we use LSTMs with
in sequence x1 where x2 has 1s. For both the algorithms
128 dimensional hidden state and Adam as the choice of
(BPTT and FPTT ), we use episodic training where a train
optimizer with initial learning rate 1e − 3 for both algo-
batch size of 128 is presented to the RNN to update its
rithms. We provide other hyper-parameter tuning details in
parameters and evaluated using an independently drawn
the supplementary (see Sec. 6.1).
test set. We use difficult sequence lengths T = 750 and
T = 1000 in this task.
Add Task (Sequence Length=750) Figure 3 shows the convergence plots for both the algorithms
0.30
FPTT-LSTM on these two settings. This demonstrates that FPTT helps
LSTM
0.25 LSTMs solve this task while BPTT stays around the same
loss value throughout the training phase. Note that this
Mean Squared Error

0.20
observation is consistent with previous works (Kag et al.,
2020; Zhang et al., 2018).
0.15

Pixel & Permute MNIST, CIFAR-10 are sequential vari-


0.10
ants of the popular image classification datasets: MNIST
0.05
(Lecun et al., 1998) and CIFAR-10 (Krizhevsky & Hin-
ton, 2009). MNIST consists of images of 10 digits with
0.00 shape 28 × 28 × 1, while CIFAR-10 consists of images
0 2000 4000 6000 8000 10000
Training Steps with shape 32 × 32 × 3. The input images are flattened into
(a) a sequence (row-wise). At each time step, 1 and 3 pixels
Add Task (Sequence Length=1000)
are presented as the input for MNIST and CIFAR datasets
0.30
FPTT-LSTM
respectively. This construction results in Pixel MNIST and
LSTM CIFAR datasets with 784 and 1024 length sequences re-
0.25
spectively. While Permute-MNIST is obtained by applying
a fixed permutation on the Pixel MNIST sequence. This
Mean Squared Error

0.20
creates a harder problem than the Pixel setting since there
0.15 are no obvious patterns to explore.
0.10 Table 6 lists the performance of LSTMs trained using BPTT
and FPTT , along with the known results in the literature
0.05 on these datasets. This shows that LSTMs trained with
FPTT outperforms the ones trained with BPTT. We point
0.00
0 2000 4000 6000 8000 10000 out that FPTT LSTMs performance is reasonably close to
Training Steps
(b) the best performance reported on this dataset with better
architectures and higher complexity.
Figure 3. Results for Add Task with large sequence lengths : (a)
T = 750, and (b) T = 1000. Below we list benefits of the proposed algorithm.
FPTT: Training RNNs with Forward Propagation

Table 6. Results for Sequential MNIST, Permute MNIST and Sequential CIFAR-10. Models listed below use 1-Layer except IndRNN
and TrellisNet as they are multi-layered architectures.
Dataset Seq-MNIST Permute-MNIST CIFAR-10
Accuracy #Params Accuracy #Params Accuracy #Params
AntisymmetricRNN (Chang et al., 2019) 98.8% 10K 93.1% 10K 62.20% 37K
IncrementalRNN (Kag et al., 2020) 98.13% 4K 95.62% 8K - -
IndRNN (6 Layers) (Li et al., 2018) 99.0% - 96.0% - - -
TrellisNet (16 Layers) (Bai et al., 2019) 99.20% 8M 98.13% 8M 73.42% 8M
r-LSTM (Trinh et al., 2018) 98.4% 100K 95.2% 100K 72.20% 101K
LSTM (Chang et al., 2019) 97.3% 68K 92.6% 68K 59.70% 69K
LSTM (Trinh et al., 2018) 98.3% 100K 89.4% 100K 58.80% 101K
LSTM (TBPTT-300) (Trinh et al., 2018) 11.3% 100K 88.8% 100K 49.01% 101K
LSTM 97.71% 66K 88.91% 66K 60.11% 67K
FPTT LSTM 98.67% 66K 94.75% 66K 71.03% 67K

(A) Better Generalization. FPTT provides better gen- hidden states between two iterate updates, in contrast BPTT
eralization on many benchmark tasks compared to BPTT. stores all the intermediate states in the time horizon T .
Tables 4, 5, and 6 shows that FPTT yields better test perfor-
mance as compared to training with BPTT. Note that table 2 5. Conclusion
shows similar gains in architectures other than LSTMs.
We proposed a novel forward-propagation-through-time
(B) Learning Long Term Dependency tasks. Training
(FPTT ) method for training RNNs based on sequentially
with FPTT enables LSTMs to solve LTD tasks. Our exper-
updating parameters forward through time. As such our
iments evaluate FPTT on many LTD datasets (Add-Task,
method at each time t, involves taking a gradient step of
Permute/Pixel MNIST, CIFAR-10 and PTB-300). Figure 3
an instantaneously constructed regularized risk, where the
shows that FPTT enables LSTMs to solve Add-Task while
regularizer evolves dynamically, and is updated based on
BPTT was unable to solve this task. Similarly, tables 4, and
past history. Our method exhibits light-weight footprint
6 shows that FPTT outperforms BPTT on LTD tasks.
and improves LSTM trainability for benchmark long-term
(C) Better Model Efficiency. FPTT trained LSTMs com- dependency tasks, bypassing vanishing/exploding gradient
pete with higher complexity models. Table 6 shows our issues encountered while training LSTMs with BPTT. As
1-layer LSTMs are competitive with multi-layered deep a result we show that LSTMs have sufficient capacity, and
RNN higher capacity models (TrellisNet(Bai et al., 2019), often realize results that are competitive with much higher
IndRNN(Li et al., 2018)). In retrospect it is worth consider- capacity models.
ing that the higher complexity models have often stemmed
from inability to train LSTMs on long-term dependency Acknowledgements
tasks. FPTT points to the fact that the issue is not capacity
but the training method. We would like to thank the reviewers for their insightful com-
ments. This research was supported by National Science
(D) Learning Short-term dependency tasks. FPTT
Foundation grants CCF-2007350 (VS), CCF-2022446(VS),
shows competitive performance on sequence modelling
CCF-1955981 (VS), the Data Science Faculty and Student
tasks. Our experiments on language modelling tasks with
Fellowship from the Rafik B. Hariri Institute, the Office of
shorter sequence lengths (PTB-w, PTB-c datasets) demon-
Naval Research Grant N0014-18-1-2257 and by a gift from
strates the proposed method can learn short term dependen-
the ARM corporation.
cies present in these datasets. Table 5 shows that FPTT
provides better test performance than BPTT.
References
(E) Computational Efficiency/Convergence. We dis-
cussed the computational trade-off of the proposed method Arjovsky, M., Shah, A., and Bengio, Y. Unitary evo-
in table 1. This trade-off allows FPTT to provide training lution recurrent neural networks. In Balcan, M. F.
complexity similar to BPTT while providing better statisti- and Weinberger, K. Q. (eds.), Proceedings of The 33rd
cal trainability and generalization. We show training time International Conference on Machine Learning, vol-
comparison in the supplementary (see Sec. 6.2). Addition- ume 48 of Proceedings of Machine Learning Research,
ally, FPTT reduces the memory overhead by only storing pp. 1120–1128, New York, New York, USA, 20–22 Jun
2016. PMLR. URL http://proceedings.mlr.
FPTT: Training RNNs with Forward Propagation

press/v48/arjovsky16.html. Jing, L., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., Le-
Cun, Y., Tegmark, M., and Soljačić, M. Tunable efficient
Bai, S., Kolter, J. Z., and Koltun, V. Trellis networks unitary neural networks (EUNN) and their application
for sequence modeling. In International Conference to RNNs. In Precup, D. and Teh, Y. W. (eds.), Pro-
on Learning Representations, 2019. URL https:// ceedings of the 34th International Conference on Ma-
openreview.net/forum?id=HyeVtoRqtQ. chine Learning, volume 70 of Proceedings of Machine
Learning Research, pp. 1733–1741. PMLR, 06–11 Aug
Bengio, Y., Boulanger-Lewandowski, N., and Pascanu, R. 2017. URL http://proceedings.mlr.press/
Advances in optimizing recurrent networks. 2013 IEEE v70/jing17a.html.
International Conference on Acoustics, Speech and Sig-
nal Processing, pp. 8624–8628, 2013. Kag, A. and Saligrama, V. Time adaptive recurrent neural
network, 2021. URL https://openreview.net/
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, forum?id=VDUovuK0gV.
J. Distributed optimization and statistical learning via
the alternating direction method of multipliers. Found. Kag, A., Zhang, Z., and Saligrama, V. Rnns incrementally
Trends Mach. Learn., 3(1):1–122, January 2011. ISSN evolving on an equilibrium manifold: A panacea for van-
1935-8237. doi: 10.1561/2200000016. URL https: ishing and exploding gradients? In International Confer-
//doi.org/10.1561/2200000016. ence on Learning Representations, 2020. URL https:
//openreview.net/forum?id=HylpqA4FwS.
Chang, B., Chen, M., Haber, E., and Chi, E. H. Antisym-
Kerg, G., Goyette, K., Puelma Touzel, M., Gidel, G.,
metricRNN: A dynamical system view on recurrent neu-
Vorontsov, E., Bengio, Y., and Lajoie, G. Non-
ral networks. In International Conference on Learning
normal recurrent neural network (nnrnn): learning
Representations, 2019. URL https://openreview.
long time dependencies while improving expressivity
net/forum?id=ryxepo0cFX.
with transient dynamics. In Wallach, H., Larochelle,
H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau,
Garnett, R. (eds.), Advances in Neural Information Pro-
D., Bougares, F., Schwenk, H., and Bengio, Y. Learn-
cessing Systems, volume 32. Curran Associates,
ing phrase representations using rnn encoder–decoder
Inc., 2019. URL https://proceedings.
for statistical machine translation. In Proceedings of the
neurips.cc/paper/2019/file/
2014 Conference on Empirical Methods in Natural Lan-
9d7099d87947faa8d07a272dd6954b80-Paper.
guage Processing (EMNLP), pp. 1724–1734, 2014. doi:
pdf.
10.3115/v1/D14-1179. URL http://www.aclweb.
org/anthology/D14-1179. Krause, B., Kahembwe, E., Murray, I., and Renals,
S. Dynamic evaluation of neural sequence mod-
Erichson, N. B., Azencot, O., Queiruga, A., Hodgkinson, L., els. In Dy, J. and Krause, A. (eds.), Proceed-
and Mahoney, M. W. Lipschitz recurrent neural networks. ings of the 35th International Conference on Ma-
In International Conference on Learning Representations, chine Learning, volume 80 of Proceedings of Machine
2021. URL https://openreview.net/forum? Learning Research, pp. 2766–2775. PMLR, 10–15 Jul
id=-N7PBXqOUJZ. 2018. URL http://proceedings.mlr.press/
v80/krause18a.html.
Gu, F., Askari, A., and Ghaoui, L. E. Fenchel lifted net-
works: A lagrange relaxation of neural network training. Krizhevsky, A. and Hinton, G. Learning multiple layers of
In Chiappa, S. and Calandra, R. (eds.), Proceedings of features from tiny images. Master’s thesis, Department
the Twenty Third International Conference on Artificial of Computer Science, University of Toronto, 2009.
Intelligence and Statistics, volume 108 of Proceedings
of Machine Learning Research, pp. 3362–3371. PMLR, Kusupati, A., Singh, M., Bhatia, K., Kumar, A., Jain, P.,
26–28 Aug 2020. URL http://proceedings.mlr. and Varma, M. Fastgrnn: A fast, accurate, stable and tiny
press/v108/gu20a.html. kilobyte sized gated recurrent neural network. In Bengio,
S., Wallach, H., Larochelle, H., Grauman, K., Cesa-
Hochreiter, S. Untersuchungen zu dynamischen neuronalen Bianchi, N., and Garnett, R. (eds.), Advances in Neural
netzen. Diploma, Technische Universität München, 91 Information Processing Systems, volume 31. Curran As-
(1), 1991. sociates, Inc., 2018. URL https://proceedings.
neurips.cc/paper/2018/file/
Hochreiter, S. and Schmidhuber, J. Long short-term memory. ab013ca67cf2d50796b0c11d1b8bc95d-Paper.
Neural computation, 9(8):1735–1780, 1997. pdf.
FPTT: Training RNNs with Forward Propagation

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- Merity, S., Keskar, N. S., and Socher, R. Regulariz-
based learning applied to document recognition. In Pro- ing and optimizing LSTM language models. In In-
ceedings of the IEEE, pp. 2278–2324, 1998. ternational Conference on Learning Representations,
2018. URL https://openreview.net/forum?
Lezcano-Casado, M. and Martı́nez-Rubio, D. Cheap id=SyyGPP0TZ.
orthogonal constraints in neural networks: A simple
parametrization of the orthogonal and unitary group. Mhammedi, Z., Hellicar, A., Rahman, A., and Bailey,
In Chaudhuri, K. and Salakhutdinov, R. (eds.), Pro- J. Efficient orthogonal parametrisation of recurrent
ceedings of the 36th International Conference on Ma- neural networks using householder reflections. In
chine Learning, volume 97 of Proceedings of Machine Precup, D. and Teh, Y. W. (eds.), Proceedings of
Learning Research, pp. 3794–3803. PMLR, 09–15 Jun the 34th International Conference on Machine Learn-
2019. URL http://proceedings.mlr.press/ ing, volume 70 of Proceedings of Machine Learn-
v97/lezcano-casado19a.html. ing Research, pp. 2401–2409. PMLR, 06–11 Aug
2017. URL http://proceedings.mlr.press/
Li, S., Li, W., Cook, C., Zhu, C., and Gao, Y. Independently v70/mhammedi17a.html.
recurrent neural network (indrnn): Building a longer and Miller, J. and Hardt, M. Stable recurrent models. In
deeper rnn. In 2018 IEEE/CVF Conference on Computer International Conference on Learning Representations,
Vision and Pattern Recognition, pp. 5457–5466, 2018. 2019. URL https://openreview.net/forum?
doi: 10.1109/CVPR.2018.00572. id=Hygxb2CqKm.
Li, S., Li, W., Cook, C., Gao, Y., and Zhu, C. Deep indepen- Mujika, A., Meier, F., and Steger, A. Approximating
dently recurrent neural network (indrnn), 2019. real-time recurrent learning with random kronecker
factors. In Bengio, S., Wallach, H., Larochelle, H.,
Linsley, D., Karkada Ashok, A., Govindarajan, L. N., Grauman, K., Cesa-Bianchi, N., and Garnett, R.
Liu, R., and Serre, T. Stable and expressive re- (eds.), Advances in Neural Information Processing
current vision models. In Larochelle, H., Ranzato, Systems, volume 31, pp. 6594–6603. Curran Asso-
M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), ciates, Inc., 2018. URL https://proceedings.
Advances in Neural Information Processing Systems, neurips.cc/paper/2018/file/
volume 33, pp. 10456–10467. Curran Associates, dba132f6ab6a3e3d17a8d59e82105f4c-Paper.
Inc., 2020. URL https://proceedings. pdf.
neurips.cc/paper/2020/file/
Ollivier, Y. and Charpiat, G. Training recurrent networks on-
766d856ef1a6b02f93d894415e6bfa0e-Paper.
line without backtracking. CoRR, abs/1507.07680, 2015.
pdf.
URL http://arxiv.org/abs/1507.07680.
McAuley, J. and Leskovec, J. Hidden factors and hid- Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learn-
den topics: Understanding rating dimensions with re- ing Internal Representations by Error Propagation, pp.
view text. In Proceedings of the 7th ACM Conference 318–362. MIT Press, Cambridge, MA, USA, 1986. ISBN
on Recommender Systems, RecSys ’13, pp. 165–172, 026268053X.
New York, NY, USA, 2013. ACM. ISBN 978-1-4503-
2409-0. doi: 10.1145/2507157.2507163. URL http: Tallec, C. and Ollivier, Y. Unbiased online recurrent op-
//doi.acm.org/10.1145/2507157.2507163. timization. In International Conference on Learning
Representations, 2018. URL https://openreview.
McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, net/forum?id=rJQDjk-0b.
D., Grady, J., Nie, L., Phillips, T., Davydov, E., Golovin, Trinh, T., Dai, A., Luong, T., and Le, Q. Learn-
D., Chikkerur, S., Liu, D., Wattenberg, M., Hrafnkelsson, ing longer-term dependencies in RNNs with auxil-
A. M., Boulos, T., and Kubica, J. Ad click prediction: iary losses. In Dy, J. and Krause, A. (eds.), Pro-
a view from the trenches. In Proceedings of the 19th ceedings of the 35th International Conference on Ma-
ACM SIGKDD International Conference on Knowledge chine Learning, volume 80 of Proceedings of Machine
Discovery and Data Mining (KDD), 2013. Learning Research, pp. 4965–4974. PMLR, 10–15 Jul
2018. URL http://proceedings.mlr.press/
Menick, J., Elsen, E., Evci, U., Osindero, S., Simonyan,
v80/trinh18a.html.
K., and Graves, A. Practical real time recurrent learning
with a sparse approximation. In International Conference Werbos, P. J. Backpropagation through time: what it does
on Learning Representations, 2021. URL https:// and how to do it. Proceedings of the IEEE, 78(10):1550–
openreview.net/forum?id=q3KSThy2GwB. 1560, 1990. doi: 10.1109/5.58337.
FPTT: Training RNNs with Forward Propagation

Williams, R. J. and Peng, J. An efficient gradient-based


algorithm for on-line training of recurrent network tra-
jectories. Neural Comput., 2(4):490–501, December
1990. ISSN 0899-7667. doi: 10.1162/neco.1990.2.4.490.
URL https://doi.org/10.1162/neco.1990.
2.4.490.
Williams, R. J. and Zipser, D. A learning algorithm for con-
tinually running fully recurrent neural networks. Neural
Computation, 1(2):270–280, 1989. doi: 10.1162/neco.
1989.1.2.270.
Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W.
Breaking the softmax bottleneck: A high-rank RNN lan-
guage model. In International Conference on Learning
Representations, 2018. URL https://openreview.
net/forum?id=HkwZSG-CZ.
Zhang, J., Lei, Q., and Dhillon, I. Stabilizing gradi-
ents for deep neural networks via efficient SVD pa-
rameterization. In Dy, J. and Krause, A. (eds.), Pro-
ceedings of the 35th International Conference on Ma-
chine Learning, volume 80 of Proceedings of Machine
Learning Research, pp. 5806–5814. PMLR, 10–15 Jul
2018. URL http://proceedings.mlr.press/
v80/zhang18g.html.

You might also like