BiLSTM BPTT
BiLSTM BPTT
Abstract— In this paper, we present bidirectional Long Short (Robinson, 1994; Bourlard and Morgan, 1994), and a chal-
Term Memory (LSTM) networks, and a modified, full gradient lenging benchmark in sequence processing. In particular, it
version of the LSTM learning algorithm. We evaluate Bidirec- requires the effective use of contextual information.
tional LSTM (BLSTM) and several other network architectures
on the benchmark task of framewise phoneme classification, using The contents of the rest of this paper are as follows: in
the TIMIT database. Our main findings are that bidirectional Section II we discuss bidirectional networks, and answer a
networks outperform unidirectional ones, and Long Short Term possible objection to their use in causal tasks; in Section III
Memory (LSTM) is much faster and also more accurate than we describe the Long Short Term Memory (LSTM) network
both standard Recurrent Neural Nets (RNNs) and time-windowed architecture, and our modification to its error gradient cal-
Multilayer Perceptrons (MLPs). Our results support the view
that contextual information is crucial to speech processing, and culation; in Section IV we describe the experimental data
suggest that BLSTM is an effective architecture with which to and how we used it in our experiments; in Section V we
exploit it. 1 give an overview of the various network architectures; in
Section VI we describe how we trained (and retrained) them;
I. I NTRODUCTION
in Section VII we present and discuss the experimental results,
For neural networks, there are two main ways of incor- and in Section VIII we make concluding remarks. Appendix
porating context into sequence processing tasks: collect the A contains the pseudocode for training LSTM networks with
inputs into overlapping time-windows, and treat the task as a full gradient calculation, and Appendix B is an outline of
spatial; or use recurrent connections to model the flow of bidirectional training with RNNs.
time directly. Using time-windows has two major drawbacks:
firstly the optimal window size is task dependent (too small II. B IDIRECTIONAL R ECURRENT N EURAL N ETS
and the net will neglect important information, too large and it The basic idea of bidirectional recurrent neural nets
will overfit on the training data), and secondly the network is (BRNNs) (Schuster and Paliwal, 1997; Baldi et al., 1999) is
unable to adapt to shifted or timewarped sequences. However, to present each training sequence forwards and backwards to
standard RNNs (by which we mean RNNs containing hidden two separate recurrent nets, both of which are connected to
layers of recurrently connected neurons) have limitations of the same output layer. (In some cases a third network is used
their own. Firstly, since they process inputs in temporal order, in place of the output layer, but here we have used the simpler
their outputs tend to be mostly based on previous context (there model). This means that for every point in a given sequence,
are ways to introduce future context, such as adding a delay the BRNN has complete, sequential information about all
between the outputs and the targets; but these do not usually points before and after it. Also, because the net is free to use as
make full use of backwards dependencies). Secondly they are much or as little of this context as necessary, there is no need
known to have difficulty learning time-dependencies more than to find a (task-dependent) time-window or target delay size. In
a few timesteps long (Hochreiter et al., 2001). An elegant Appendix B we give an outline of the bidirectional algorithm,
solution to the first problem is provided by bidirectional and Figure 1 illustrates how the forwards and reverse subnets
networks (Section II). For the second problem, an alternative combine to classify phonemes. BRNNs have given improved
RNN architecture, LSTM, has been shown to be capable of results in sequence learning tasks, notably protein structure
learning long time-dependencies (Section III). prediction (PSP) (Baldi et al., 2001; Chen and Chaudhari,
Our experiments concentrate on framewise phoneme clas- 2004) and speech processing (Schuster, 1999; Fukada et al.,
sification (i.e. mapping a sequence of speech frames to a 1999).
sequence of phoneme labels associated with those frames).
This task is both a first step towards full speech recognition A. Bidirectional Networks and Online Causal Tasks
1 An
In a spatial task like PSP, it is clear that any distinction
abbreviated version of some portions of this article appeared in (Graves
and Schmidhuber, 2005), as part of the IJCNN 2005 conference proceedings, between input directions should be discarded. But for temporal
published under the IEEE copyright. problems like speech recognition, relying on knowledge of the
future seems at first sight to violate causality — at least if the Target
task is online. How can we base our understanding of what
we’ve heard on something that hasn’t been said yet? However,
human listeners do exactly that. Sounds, words, and even
whole sentences that at first mean nothing are found to make
sense in the light of future context. What we must remember
is the distinction between tasks that are truly online - requiring
an output after every input - and those where outputs are only Bidirectional Output
needed at the end of some input segment. For the first class
of problems BRNNs are useless, since meaningful outputs are
only available after the net has run backwards. But the point is
that speech recognition, along with most other ‘online’ causal
tasks, is in the second class: an output at the end of every
segment (e.g. sentence) is fine. Therefore, we see no objection
to using BRNNs to gain improved performance on speech
recognition tasks. On a more practical note, given the relative Forward Net Only
speed of activating neural nets, the delay incurred by running
an already trained net backwards as well as forwards is small.
In general, the BRNNs examined here make the following
assumptions about their input data: that it can be divided into
finitely long segments, and that the effect of each of these on
the others is negligible. For speech corpora like TIMIT, made
up of separately recorded utterances, this is clearly the case.
Reverse Net Only
For real speech, the worst it can do is neglect contextual effects
that extend across segment boundaries — e.g. the ends of
sentences or dialogue turns. Moreover, such long term effects
are routinely neglected by current speech recognition systems.
III. LSTM
The Long Short Term Memory architecture (Hochreiter and sil w ah n sil ow sil f ay v sil
A. Retraining
For the experiments with varied time-windows or target
delays, we iteratively retrained the networks instead of starting
again from scratch. For example, for LSTM with a target delay
of 2, we first trained with delay 0, then took the best net and
BLSTM Duration Weighted Error retrained it (without resetting the weights) with delay 1, then
retrained again with delay 2. To find the best networks, we
retrained the LSTM nets for 5 epochs at each iteration, the
RNN nets for 10, and the MLPs for 20. It is possible that
longer retraining times would have given improved results.
For the retrained MLPs, we had to add extra (randomised)
weights from the input layers, since the input size grew with
BRNN
the time-window.
Although primarily a means to reduce training time, we
have also found that retraining improves final performance
(Graves et al., 2005; Beringer, 2004a). Indeed, the best result
in this paper was achieved by retraining (on the BLSTM
net trained with a weighted error function, then retrained
with normal cross-entropy error). The benefits presumably
MLP 10 Frame Time-Window
come from escaping the local minima that gradient descent
algorithms tend to get caught in.
TABLE I
F RAMEWISE P HONEME C LASSIFICATION ON THE TIMIT DATABASE :
B IDIRECTIONAL LSTM
q ae dx ah w ix n dcl d ow sil
Network Training Set Score Test Set Score Epochs
at a window BLSTM (1) 77.0% 69.7% 20
BLSTM (2) 77.9% 70.1% 21
BLSTM (3) 77.3% 69.9% 20
Fig. 2. The best exemplars of each architecture classifying the excerpt BLSTM (4) 77.8% 69.8% 22
”at a window” from an utterance in the TIMIT database. In general, the BLSTM (5) 77.1% 69.4% 19
networks found the vowels more difficult (here, ”ix” is confused with ”ih”, BLSTM (6) 77.8% 69.8% 21
”ah” with ”ax” and ”axr”, and ”ae” with ”eh”), than the consonants (e.g. BLSTM (7) 76.7% 69.9% 18
”w” and ”n”), which in English are more distinct . For BLSTM, the net with mean 77.4% 69.8% 20.1
duration weighted error tends to do better on short phones, (e.g. the closure standard deviation 0.5% 0.2% 1.3
and stop ”dcl” and ”d”), and worse on longer ones (”ow”), as expected. Note
the more jagged trajectories for the MLP net (e.g. for ”q” and ”ow”); this is
presumably because they have no recurrency to smooth the outputs.
VII. R ESULTS
duration of the current phoneme, ensuring that short phonemes Table I contains the outcomes of 7, randomly initialised,
are as significant to the training as longer ones. However, we training runs with BLSTM. For the rest of the paper, we use
recorded a slightly lower framewise classification score with their mean as the result for BLSTM. The standard deviation
BLSTM trained with this error function (see Section VII-D). in the test set scores (0.2%) gives an indication of significant
difference in network performance.
VI. N ETWORK T RAINING The last three entries in Table II come from the papers
For all architectures, we calculated the full error gradient indicated (note that Robinson did not quote framewise clas-
using online BPTT (BPTT truncated to the lengths of the sification scores; the result for his network was recorded by
utterances), and trained the weights using gradient descent Schuster, using the original software). The rest are from our
with momentum. We kept the same training parameters for own experiments. For the MLP, RNN and LSTM nets we
all experiments: initial weights randomised in the range give the best results, and those achieved with least contextual
TABLE II
since their error signals tend to decay after a few timesteps.
F RAMEWISE P HONEME C LASSIFICATION ON THE TIMIT DATABASE :
A detailed analysis of the evolution of the weights would be
M AIN R ESULTS
required to check this.
Network Training Set Test Set Epochs
BLSTM (retrained) 78.6% 70.2% 17 As well as being faster, the LSTM nets were also slightly
BLSTM 77.4% 69.8% 20.1 more accurate. Although the final difference in score between
BRNN 76.0% 69.0% 170 BLSTM and BRNN on this task is small (0.8%) the results in
BLSTM (Weighted Error) 75.7% 68.9% 15
LSTM (5 frame delay) 77.6% 66.0% 34 Table I strongly suggest that it is significant. The fact that the
RNN (3 frame delay) 71.0% 65.2% 139 difference is not larger could mean that long time dependencies
LSTM (backwards, 0 frame delay) 71.1% 64.7% 15 (more than 10 timesteps or so) are not very helpful to this task.
LSTM (0 frame delay) 70.9% 64.6% 15
RNN (0 frame delay) 69.9% 64.5% 120 It is interesting to note how much more prone to overfitting
MLP (10 frame time-window) 67.6% 63.1% 990 LSTM was than standard RNNs. For LSTM, after only 15-20
MLP (no time-window) 53.6% 51.4% 835 epochs the performance on the validation and test sets would
RNN (Chen and Jamieson, 1996) 69.9% 74.2% -
RNN (Robinson, 1994; Schuster, 1999) 70.6% 65.3% - begin to fall, while that on the training set would continue to
BRNN (Schuster, 1999) 72.1% 65.1% - rise (the highest score we recorded on the training set with
BLSTM was 86.4%, and still improving). With the RNNs on
the other hand, we never observed a large drop in test set score.
This suggests a difference in the way the two architectures
information (i.e. with no target delay / time-window). The
learn. Given that in the TIMIT corpus no speakers or sentences
number of epochs includes both training and retraining.
are shared by the training and test sets, it is possible that
There are some differences between the results quoted in
LSTM’s overfitting was partly caused by its better adaptation
this paper and in our previous work (Graves and Schmidhuber,
to long range regularities (such as phoneme ordering within
2005). The most significant of these is the improved score we
words, or speaker specific pronunciations) than normal RNNs.
achieved here with the bidirectional RNN (69.0% instead of
If this is true, we would expect a greater distinction between
64.7%). Previously we had stopped the BRNN after 65 epochs,
the two architectures on tasks with more training data.
when it appeared to have converged; here, however, we let it
run for 225 epochs (10 times as long as LSTM), and kept B. Comparison with Previous Work
the best net on the validation set, after 170 epochs. As can
be seen from Figure 4 the learning curves for the non LSTM Overall BLSTM outperformed any neural network we found
networks are very slow, and contain several sections where the in the literature on this task, apart from the RNN used by
error temporarily increases, making it difficult to know when Chen and Jamieson. Their result (which we were unable to
training should be stopped. approach with standard RNNs) is surprising as they quote a
The results for the unidirectional LSTM and RNN nets are substantially higher score on the test set than the training set:
also better here; this is probably due to our use of larger all other methods reported here were better on the training
networks, and the fact that we retrained between different than the test set, as expected.
target delays. Again it should be noted that at the moment In general, it is difficult to compare with previous work
we do not have an optimal method for choosing retraining on this task, owing to the many variations in training data
times. (different preprocessing, different subsets of the TIMIT cor-
pus, different target representations) and experimental method
A. Comparison Between LSTM and Other Architectures (different learning algorithms, error functions, network sizes
The most obvious difference between LSTM and the RNN etc). This is why we reimplemented all the architectures
and MLP nets was the training time (see Figure 4). In partic- ourselves.
ular, the BRNN took more than 8 times as long to converge
as BLSTM, despite having more or less equal computational C. Effect of Increased Context
complexity per time-step (see Section V-A). There was a As is clear from Figure 3 networks with access to more
similar time increase between the unidirectional LSTM and contextual information tended to get better results. In partic-
RNN nets, and the MLPs were slower still (990 epochs for ular, the bidirectional networks were substantially better than
the best MLP result). the unidirectional ones. For the unidirectional nets, note that
The training time of 17 epochs for our most accurate LSTM benefits more from longer target delays than RNNs; this
network (retrained BLSTM) is remarkably fast, needing just could be due to LSTM’s greater facility with long timelags,
a few hours on an ordinary desktop computer. Elsewhere we allowing it to make use of the extra context without suffering
have seen figures of between 40 and 120 epochs quoted for as much from having to remember previous inputs.
RNN convergence on this task, usually with more advanced Interestingly, LSTM with no time delay returns almost
training algorithms than the one used here. identical results whether trained forwards or backwards. This
A possible explanation of why RNNs took longer to train suggests that the context in both directions is equally im-
than LSTM on this task is that they require more fine-tuning portant. However, with bidirectional nets, the forward subnet
of their weights to make use of the contextual information, usually dominates the outputs (see Figure 1).
Framewise Phoneme Classification Scores Learning Curves for Three Architectures
72 85
BLSTM Retrained BLSTM training set
BLSTM BLSTM test set
70 BRNN 80 BRNN training set
BLSTM Weighted Error BRNN test set
68 LSTM MLP training set
RNN 75 MLP test set
MLP
66
% Frames Correctly Classified
45
54
52 40
50 35
0 2 4 6 8 10 0 50 100 150 200 250 300 350 400
Target Delay / Window Size Training Epochs
Fig. 3. Framewise phoneme classification results for all networks on the Fig. 4. Learning curves for BLSTM, BRNN and MLP with no time-window.
TIMIT test set. The number of frames of introduced context (time-window For all experiments, LSTM was much faster to converge than either the RNN
size for MLPs, target delay size for unidirectional LSTM and RNNs) is plotted or MLP architectures.
along the x axis. Therefore the results for the bidirectional nets (clustered
around 70%) are plotted at x=0.
In the future we would like to apply BLSTM to full speech
recognition, for example as part of a hybrid RNN / Hidden
For the MLPs, performance increased with time-window Markov Model system.
size, and it appears that even larger windows would have been
desirable. However, with fully connected networks, the number A PPENDIX A: P SEUDOCODE FOR F ULL G RADIENT LSTM
of weights required for such large input layers makes training
prohibitively slow. The following pseudocode details the forward pass, back-
ward pass, and weight updates of an extended LSTM layer in
D. Weighted Error a multi-layer net. The error gradient is calculated with online
BPTT (i.e. BPTT truncated to the lengths of input sequences,
The experiment with a weighted error function gave slightly with weight updates after every sequence). As is standard with
inferior framewise performance for BLSTM (68.9%, compared BPTT, the network is unfolded over time, so that connections
to 69.7%). However, the purpose of this weighting is to arriving at layers are viewed as coming from the previous
improve overall phoneme recognition, rather than framewise timestep. We have tried to make it clear which equations are
classification (see Section V-C). As a measure of its success, LSTM specific, and which are part of the standard BPTT
if we assume a perfect knowledge of the test set segmentation algorithm. Note that for the LSTM equations, the order of
(which in real-life situations we cannot), and integrate the net- execution is important.
work outputs over each phoneme, then BLSTM with weighted
errors gives a phoneme correctness of 74.4%, compared to Notation
71.2% with normal errors. The input sequence over which the training takes place is
labelled S and it runs from time τ0 to τ1 . xk (τ ) refers to the
VIII. C ONCLUSION AND F UTURE W ORK network input to unit k at time τ , and yk (τ ) to its activation.
Unless stated otherwise, all network inputs, activations and
In this paper we have compared bidirectional LSTM to other partial derivatives are evaluated at time τ — e.g. yc ≡ yc (τ ).
neural network architectures on the task of framewise phoneme E(τ ) refers to the (scalar) output error of the net at time τ .
classification. We have found that bidirectional networks are The training target for output unit k at time τ is denoted tk (τ ).
significantly more effective than unidirectional ones, and that N is the set of all units in the network, including input and
LSTM is much faster to train than standard RNNs and MLPs, bias units, that can be connected to other units. Note that this
and also slightly more accurate. We conclude that bidirectional includes LSTM cell outputs, but not LSTM gates or internal
LSTM is an architecture well suited to this and other speech states (whose activations are only visible within their own
processing tasks, where context is vitally important. memory blocks). Wij is the weight from unit j to unit i.
The LSTM equations are given for a single memory block Cell Outputs:
only. The generalisation to multiple blocks is trivial: simply !
∀c ∈ C, def ine ϵc = wjc δj (τ + 1)
repeat the calculations for each block, in any order. Within
j∈N
each block, we use the suffixes ι, φ and ω to refer to the
input gate, forget gate and output gate respectively. The suffix Output Gates:
c refers to an element of the set of cells C. sc is the state value !
δω = f ′ (xω ) ϵc h(sc )
of cell c — i.e. its value after the input and forget gates have
c∈C
been applied. f is the squashing function of the gates, and
g and h are respectively the cell input and output squashing States:
functions. ∂E ∂E
(τ ) = ϵc yω h′ (sc ) + (τ + 1)yφ (τ + 1)
∂sc ∂sc
Forward Pass
+δι (τ + 1)wιc + δφ (τ + 1)wφc + δω wωc
• Reset all activations to 0.
• Running forwards from time τ0 to time τ1 , feed in the Cells:
inputs and update the activations. Store all hidden layer ∂E
∀c ∈ C, δc = yι g ′ (xc )
and output activations at every timestep. ∂sc
• For each LSTM block, the activations are updated as
Forget Gates:
follows: ! ∂E
Input Gates: δφ = f ′ (xφ ) sc (τ − 1)
∂sc
! ! c∈C
xι = wιj yj (τ − 1) + wιc sc (τ − 1) Input Gates:
j∈N c∈C
! ∂E
yι = f (xι ) δι = f ′ (xι ) g(xc )
∂sc
c∈C
Forget Gates:
! ! • Using the standard BPTT equation, accumulate the δ’s
xφ = wφj yj (τ − 1) + wφc sc (τ − 1) to get the partial derivatives of the cumulative sequence
j∈N c∈C error:
yφ = f (xφ ) !τ1
def ine Etotal (S) = E(τ )
Cells: τ =τ0
!
∀c ∈ C, xc = wcj yj (τ − 1) ∂Etotal (S)
def ine ▽ij (S) =
j∈N ∂wij
sc = yφ sc (τ − 1) + yι g(xc ) !τ1
=⇒ ▽ij (S) = δi (τ )yj (τ − 1)
Output Gates: τ =τ0 +1
! !
xω = wωj yj (τ − 1) + wωc sc (τ ) Update Weights
j∈N c∈C • After the presentation of sequence S, with learning rate α