0% found this document useful (0 votes)
8 views8 pages

BiLSTM BPTT

This paper presents a study on framewise phoneme classification using Bidirectional Long Short Term Memory (BLSTM) networks and other neural network architectures, evaluated on the TIMIT database. The findings indicate that BLSTM networks outperform unidirectional ones and demonstrate superior speed and accuracy compared to standard Recurrent Neural Networks (RNNs) and time-windowed Multilayer Perceptrons (MLPs). The research emphasizes the importance of contextual information in speech processing and provides insights into the effectiveness of BLSTM for this task.

Uploaded by

cdk7bgqhpc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

BiLSTM BPTT

This paper presents a study on framewise phoneme classification using Bidirectional Long Short Term Memory (BLSTM) networks and other neural network architectures, evaluated on the TIMIT database. The findings indicate that BLSTM networks outperform unidirectional ones and demonstrate superior speed and accuracy compared to standard Recurrent Neural Networks (RNNs) and time-windowed Multilayer Perceptrons (MLPs). The research emphasizes the importance of contextual information in speech processing and provides insights into the effectiveness of BLSTM for this task.

Uploaded by

cdk7bgqhpc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Framewise Phoneme Classification with

Bidirectional LSTM and Other Neural Network


Architectures
Alex Graves∗ and Jürgen Schmidhuber∗†
IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland∗
TU Munich, Boltzmannstr. 3, 85748 Garching, Munich, Germany†
{alex,juergen}@idsia.ch

Abstract— In this paper, we present bidirectional Long Short (Robinson, 1994; Bourlard and Morgan, 1994), and a chal-
Term Memory (LSTM) networks, and a modified, full gradient lenging benchmark in sequence processing. In particular, it
version of the LSTM learning algorithm. We evaluate Bidirec- requires the effective use of contextual information.
tional LSTM (BLSTM) and several other network architectures
on the benchmark task of framewise phoneme classification, using The contents of the rest of this paper are as follows: in
the TIMIT database. Our main findings are that bidirectional Section II we discuss bidirectional networks, and answer a
networks outperform unidirectional ones, and Long Short Term possible objection to their use in causal tasks; in Section III
Memory (LSTM) is much faster and also more accurate than we describe the Long Short Term Memory (LSTM) network
both standard Recurrent Neural Nets (RNNs) and time-windowed architecture, and our modification to its error gradient cal-
Multilayer Perceptrons (MLPs). Our results support the view
that contextual information is crucial to speech processing, and culation; in Section IV we describe the experimental data
suggest that BLSTM is an effective architecture with which to and how we used it in our experiments; in Section V we
exploit it. 1 give an overview of the various network architectures; in
Section VI we describe how we trained (and retrained) them;
I. I NTRODUCTION
in Section VII we present and discuss the experimental results,
For neural networks, there are two main ways of incor- and in Section VIII we make concluding remarks. Appendix
porating context into sequence processing tasks: collect the A contains the pseudocode for training LSTM networks with
inputs into overlapping time-windows, and treat the task as a full gradient calculation, and Appendix B is an outline of
spatial; or use recurrent connections to model the flow of bidirectional training with RNNs.
time directly. Using time-windows has two major drawbacks:
firstly the optimal window size is task dependent (too small II. B IDIRECTIONAL R ECURRENT N EURAL N ETS
and the net will neglect important information, too large and it The basic idea of bidirectional recurrent neural nets
will overfit on the training data), and secondly the network is (BRNNs) (Schuster and Paliwal, 1997; Baldi et al., 1999) is
unable to adapt to shifted or timewarped sequences. However, to present each training sequence forwards and backwards to
standard RNNs (by which we mean RNNs containing hidden two separate recurrent nets, both of which are connected to
layers of recurrently connected neurons) have limitations of the same output layer. (In some cases a third network is used
their own. Firstly, since they process inputs in temporal order, in place of the output layer, but here we have used the simpler
their outputs tend to be mostly based on previous context (there model). This means that for every point in a given sequence,
are ways to introduce future context, such as adding a delay the BRNN has complete, sequential information about all
between the outputs and the targets; but these do not usually points before and after it. Also, because the net is free to use as
make full use of backwards dependencies). Secondly they are much or as little of this context as necessary, there is no need
known to have difficulty learning time-dependencies more than to find a (task-dependent) time-window or target delay size. In
a few timesteps long (Hochreiter et al., 2001). An elegant Appendix B we give an outline of the bidirectional algorithm,
solution to the first problem is provided by bidirectional and Figure 1 illustrates how the forwards and reverse subnets
networks (Section II). For the second problem, an alternative combine to classify phonemes. BRNNs have given improved
RNN architecture, LSTM, has been shown to be capable of results in sequence learning tasks, notably protein structure
learning long time-dependencies (Section III). prediction (PSP) (Baldi et al., 2001; Chen and Chaudhari,
Our experiments concentrate on framewise phoneme clas- 2004) and speech processing (Schuster, 1999; Fukada et al.,
sification (i.e. mapping a sequence of speech frames to a 1999).
sequence of phoneme labels associated with those frames).
This task is both a first step towards full speech recognition A. Bidirectional Networks and Online Causal Tasks
1 An
In a spatial task like PSP, it is clear that any distinction
abbreviated version of some portions of this article appeared in (Graves
and Schmidhuber, 2005), as part of the IJCNN 2005 conference proceedings, between input directions should be discarded. But for temporal
published under the IEEE copyright. problems like speech recognition, relying on knowledge of the
future seems at first sight to violate causality — at least if the Target
task is online. How can we base our understanding of what
we’ve heard on something that hasn’t been said yet? However,
human listeners do exactly that. Sounds, words, and even
whole sentences that at first mean nothing are found to make
sense in the light of future context. What we must remember
is the distinction between tasks that are truly online - requiring
an output after every input - and those where outputs are only Bidirectional Output
needed at the end of some input segment. For the first class
of problems BRNNs are useless, since meaningful outputs are
only available after the net has run backwards. But the point is
that speech recognition, along with most other ‘online’ causal
tasks, is in the second class: an output at the end of every
segment (e.g. sentence) is fine. Therefore, we see no objection
to using BRNNs to gain improved performance on speech
recognition tasks. On a more practical note, given the relative Forward Net Only
speed of activating neural nets, the delay incurred by running
an already trained net backwards as well as forwards is small.
In general, the BRNNs examined here make the following
assumptions about their input data: that it can be divided into
finitely long segments, and that the effect of each of these on
the others is negligible. For speech corpora like TIMIT, made
up of separately recorded utterances, this is clearly the case.
Reverse Net Only
For real speech, the worst it can do is neglect contextual effects
that extend across segment boundaries — e.g. the ends of
sentences or dialogue turns. Moreover, such long term effects
are routinely neglected by current speech recognition systems.
III. LSTM
The Long Short Term Memory architecture (Hochreiter and sil w ah n sil ow sil f ay v sil

Schmidhuber, 1997; Gers et al., 2002) was motivated by an one oh five


analysis of error flow in existing RNNs (Hochreiter et al.,
2001), which found that long time lags were inaccessible
to existing architectures, because backpropagated error either Fig. 1. A bidirectional LSTM net classifying the utterance ”one oh five”
blows up or decays exponentially. from the Numbers95 corpus. The different lines represent the activations
An LSTM layer consists of a set of recurrently connected (or targets) of different output nodes. The bidirectional output combines the
predictions of the forward and reverse subnets; it closely matches the target,
blocks, known as memory blocks. These blocks can be thought indicating accurate classification. To see how the subnets work together, their
of as a differentiable version of the memory chips in a digital contributions to the output are plotted separately (“Forward Net Only” and
computer. Each one contains one or more recurrently con- “Reverse Net Only”). As we might expect, the forward net is more accurate.
However there are places where its substitutions (‘w’), insertions (at the start
nected memory cells and three multiplicative units - the input, of ‘ow’) and deletions (‘f’) are corrected by the reverse net. In addition, both
output and forget gates - that provide continuous analogues of are needed to accurately locate phoneme boundaries, with the reverse net
write, read and reset operations for the cells. More precisely, tending to find the starts and the forward net tending to find the ends (‘ay’ is
a good example of this).
the input to the cells is multiplied by the activation of the input
gate, the output to the net is multiplied by that of the output
gate, and the previous cell values are multiplied by the forget
gate. The net can only interact with the cells via the gates. 1987) and Back Propagation Through Time (BPTT)(Williams
Recently, we have concentrated on applying LSTM to real and Zipser, 1995). The backpropagation was truncated after
world sequence processing problems. In particular, we have one timestep, because it was felt that long time dependencies
studied isolated word recognition (Graves et al., 2004b; Graves would be dealt with by the memory blocks, and not by the
et al., 2004a) and continuous speech recognition (Eck et al., (vanishing) flow of backpropagated error gradient. Partly to
2003; Beringer, 2004b). check this assumption, and partly to ease the implementation
of Bidirectional LSTM, we calculated the full error gradient
A. LSTM Gradient Calculation for the LSTM architecture. See Appendix A for the revised
The original LSTM training algorithm (Gers et al., 2002) pseudocode. For both bidirectional and unidirectional nets, we
used an error gradient calculated with a combination of found that using the full gradient gave slightly higher perfor-
Real Time Recurrent Learning (RTRL)(Robinson and Fallside, mance than the original algorithm. It had the added benefit
of making LSTM directly comparable to other RNNs, since All nets contained an input layer of size 26 (one for each
it could now be trained with standard BPTT. Also, since the MFCC coefficient), and an output layer of size 61 (one for
full gradient can be checked numerically, its implementation each phoneme). The input layers were fully connected to the
was easier to debug. hidden layers and the hidden layers were fully connected to
the output layers. For the recurrent nets, the hidden layers were
IV. E XPERIMENTAL DATA also fully connected to themselves. The LSTM blocks had the
The data for our experiments came from the TIMIT corpus following activation functions: logistic sigmoids in the range
(Garofolo et al., 1993) of prompted utterances, collected by [−2, 2] for the input and output squashing functions of the cell,
Texas Instruments. The utterances were chosen to be phoneti- and in the range [0, 1] for the gates. The non-LSTM nets had
cally rich, and the speakers represent a wide variety of Amer- logistic sigmoid activations in the range [0, 1] in the hidden
ican dialects. The audio data is divided into sentences, each layers. All units were biased.
of which is accompanied by a complete phonetic transcript. None of our experiments with more complex network
We preprocessed the audio data into 12 Mel-Frequency Cep- topologies (e.g. multiple hidden layers, several LSTM cells
strum Coefficients (MFCC’s) from 26 filter-bank channels. We per block, direct connections between input and output layers)
also extracted the log-energy and the first order derivatives of led to improved results.
it and the other coefficients, giving a vector of 26 coefficients A. Computational Complexity
per frame. The frame size was 10 ms and the input window
was 25 ms. The hidden layer sizes were chosen to ensure that all
For consistency with the literature, we used the complete networks had roughly the same number of weights W (≈
set of 61 phonemes provided in the transcriptions for classi- 100, 000). However, for the MLPs the network grew with
fication. In full speech recognition, it is common practice to the time-window size, and W varied between 22,061 and
use a reduced set of phonemes (Robinson, 1991), by merging 152,061. For all networks, the computational complexity was
those with similar sounds, and not separating closures from dominated by the O(W ) feedforward and feedback operations.
stops. This means that the bidirectional nets and the LSTM nets did
not take significantly more time to train per epoch than the
A. Training and Testing Sets unidirectional or RNN or (equivalently sized) MLP nets.
The standard TIMIT corpus comes partitioned into training B. Range of Context
and test sets, containing 3696 and 1344 utterances respectively.
Only the bidirectional nets had access to the complete
In total there were 1,124,823 frames in the training set, and
context of the frame being classified (i.e. the whole input
410,920 in the test set. No speakers or sentences exist in
sequence). For MLPs, the amount of context depended on the
both the training and test sets. We used 184 of the training
size of the time-window. The results for the MLP with no time-
set utterances (chosen randomly, but kept constant for all
window (presented only with the current frame) give a baseline
experiments) as a validation set and trained on the rest. All
for performance without context information. However, some
results for the training and test sets were recorded at the point
context is implicitly present in the window averaging and first-
of lowest cross-entropy error on the validation set.
derivatives of the preprocessor.
V. N ETWORK A RCHITECTURES Similarly, for unidirectional LSTM and RNN, the amount
of future context depended on the size of target delay. The
We used the following five neural network architectures in results with no target delay (trained forwards or backwards)
our experiments (henceforth referred to by the abbreviations give a baseline for performance with context in one direction
in brackets): only.
• Bidirectional LSTM, with two hidden LSTM layers
(forwards and backwards), both containing 93 one-cell C. Output Layers
memory blocks of one cell each (BLSTM) For the output layers, we used the cross entropy error
• Unidirectional LSTM, with one hidden LSTM layer, con- function and the softmax activation function, as is standard
taining 140 one cell memory blocks, trained backwards for 1 of K classification (Bishop, 1995). The softmax function
with no target delay, and forwards with delays from 0 to ensures that the network outputs are all between zero and
10 frames (LSTM) one, and that they sum to one on every timestep. This means
• Bidirectional RNN with two hidden layers containing 185 they can be interpreted as the posterior probabilities of the
sigmoidal units each (BRNN) phonemes at a given frame, given all the inputs up to the
• Unidirectional RNN with one hidden layers containing current one (with unidirectional nets) or all the inputs in the
275 sigmoidal units, trained with target delays from 0 to whole sequence (with bidirectional nets).
10 frames (RNN) Several alternative error functions have been studied for
• MLP with one hidden layer containing 250 sigmoidal this task (Chen and Jamieson, 1996). One modification in
units, and symmetrical time-windows from 0 to 10 frames particular has been shown to have a positive effect on full
(MLP) speech recognition. This is to weight the error according to the
Targets
[−0.1, 0.1], a learning rate of 10−5 and a momentum of 0.9.
At the end of each utterance, weight updates were carried out
and network activations were reset to 0.
Keeping the training algorithm and parameters constant
allowed us to concentrate on the effect of varying the archi-
tecture. However it is possible that different training methods
would be better suited to different networks.
BLSTM

A. Retraining
For the experiments with varied time-windows or target
delays, we iteratively retrained the networks instead of starting
again from scratch. For example, for LSTM with a target delay
of 2, we first trained with delay 0, then took the best net and
BLSTM Duration Weighted Error retrained it (without resetting the weights) with delay 1, then
retrained again with delay 2. To find the best networks, we
retrained the LSTM nets for 5 epochs at each iteration, the
RNN nets for 10, and the MLPs for 20. It is possible that
longer retraining times would have given improved results.
For the retrained MLPs, we had to add extra (randomised)
weights from the input layers, since the input size grew with
BRNN
the time-window.
Although primarily a means to reduce training time, we
have also found that retraining improves final performance
(Graves et al., 2005; Beringer, 2004a). Indeed, the best result
in this paper was achieved by retraining (on the BLSTM
net trained with a weighted error function, then retrained
with normal cross-entropy error). The benefits presumably
MLP 10 Frame Time-Window
come from escaping the local minima that gradient descent
algorithms tend to get caught in.

TABLE I
F RAMEWISE P HONEME C LASSIFICATION ON THE TIMIT DATABASE :
B IDIRECTIONAL LSTM
q ae dx ah w ix n dcl d ow sil
Network Training Set Score Test Set Score Epochs
at a window BLSTM (1) 77.0% 69.7% 20
BLSTM (2) 77.9% 70.1% 21
BLSTM (3) 77.3% 69.9% 20
Fig. 2. The best exemplars of each architecture classifying the excerpt BLSTM (4) 77.8% 69.8% 22
”at a window” from an utterance in the TIMIT database. In general, the BLSTM (5) 77.1% 69.4% 19
networks found the vowels more difficult (here, ”ix” is confused with ”ih”, BLSTM (6) 77.8% 69.8% 21
”ah” with ”ax” and ”axr”, and ”ae” with ”eh”), than the consonants (e.g. BLSTM (7) 76.7% 69.9% 18
”w” and ”n”), which in English are more distinct . For BLSTM, the net with mean 77.4% 69.8% 20.1
duration weighted error tends to do better on short phones, (e.g. the closure standard deviation 0.5% 0.2% 1.3
and stop ”dcl” and ”d”), and worse on longer ones (”ow”), as expected. Note
the more jagged trajectories for the MLP net (e.g. for ”q” and ”ow”); this is
presumably because they have no recurrency to smooth the outputs.

VII. R ESULTS
duration of the current phoneme, ensuring that short phonemes Table I contains the outcomes of 7, randomly initialised,
are as significant to the training as longer ones. However, we training runs with BLSTM. For the rest of the paper, we use
recorded a slightly lower framewise classification score with their mean as the result for BLSTM. The standard deviation
BLSTM trained with this error function (see Section VII-D). in the test set scores (0.2%) gives an indication of significant
difference in network performance.
VI. N ETWORK T RAINING The last three entries in Table II come from the papers
For all architectures, we calculated the full error gradient indicated (note that Robinson did not quote framewise clas-
using online BPTT (BPTT truncated to the lengths of the sification scores; the result for his network was recorded by
utterances), and trained the weights using gradient descent Schuster, using the original software). The rest are from our
with momentum. We kept the same training parameters for own experiments. For the MLP, RNN and LSTM nets we
all experiments: initial weights randomised in the range give the best results, and those achieved with least contextual
TABLE II
since their error signals tend to decay after a few timesteps.
F RAMEWISE P HONEME C LASSIFICATION ON THE TIMIT DATABASE :
A detailed analysis of the evolution of the weights would be
M AIN R ESULTS
required to check this.
Network Training Set Test Set Epochs
BLSTM (retrained) 78.6% 70.2% 17 As well as being faster, the LSTM nets were also slightly
BLSTM 77.4% 69.8% 20.1 more accurate. Although the final difference in score between
BRNN 76.0% 69.0% 170 BLSTM and BRNN on this task is small (0.8%) the results in
BLSTM (Weighted Error) 75.7% 68.9% 15
LSTM (5 frame delay) 77.6% 66.0% 34 Table I strongly suggest that it is significant. The fact that the
RNN (3 frame delay) 71.0% 65.2% 139 difference is not larger could mean that long time dependencies
LSTM (backwards, 0 frame delay) 71.1% 64.7% 15 (more than 10 timesteps or so) are not very helpful to this task.
LSTM (0 frame delay) 70.9% 64.6% 15
RNN (0 frame delay) 69.9% 64.5% 120 It is interesting to note how much more prone to overfitting
MLP (10 frame time-window) 67.6% 63.1% 990 LSTM was than standard RNNs. For LSTM, after only 15-20
MLP (no time-window) 53.6% 51.4% 835 epochs the performance on the validation and test sets would
RNN (Chen and Jamieson, 1996) 69.9% 74.2% -
RNN (Robinson, 1994; Schuster, 1999) 70.6% 65.3% - begin to fall, while that on the training set would continue to
BRNN (Schuster, 1999) 72.1% 65.1% - rise (the highest score we recorded on the training set with
BLSTM was 86.4%, and still improving). With the RNNs on
the other hand, we never observed a large drop in test set score.
This suggests a difference in the way the two architectures
information (i.e. with no target delay / time-window). The
learn. Given that in the TIMIT corpus no speakers or sentences
number of epochs includes both training and retraining.
are shared by the training and test sets, it is possible that
There are some differences between the results quoted in
LSTM’s overfitting was partly caused by its better adaptation
this paper and in our previous work (Graves and Schmidhuber,
to long range regularities (such as phoneme ordering within
2005). The most significant of these is the improved score we
words, or speaker specific pronunciations) than normal RNNs.
achieved here with the bidirectional RNN (69.0% instead of
If this is true, we would expect a greater distinction between
64.7%). Previously we had stopped the BRNN after 65 epochs,
the two architectures on tasks with more training data.
when it appeared to have converged; here, however, we let it
run for 225 epochs (10 times as long as LSTM), and kept B. Comparison with Previous Work
the best net on the validation set, after 170 epochs. As can
be seen from Figure 4 the learning curves for the non LSTM Overall BLSTM outperformed any neural network we found
networks are very slow, and contain several sections where the in the literature on this task, apart from the RNN used by
error temporarily increases, making it difficult to know when Chen and Jamieson. Their result (which we were unable to
training should be stopped. approach with standard RNNs) is surprising as they quote a
The results for the unidirectional LSTM and RNN nets are substantially higher score on the test set than the training set:
also better here; this is probably due to our use of larger all other methods reported here were better on the training
networks, and the fact that we retrained between different than the test set, as expected.
target delays. Again it should be noted that at the moment In general, it is difficult to compare with previous work
we do not have an optimal method for choosing retraining on this task, owing to the many variations in training data
times. (different preprocessing, different subsets of the TIMIT cor-
pus, different target representations) and experimental method
A. Comparison Between LSTM and Other Architectures (different learning algorithms, error functions, network sizes
The most obvious difference between LSTM and the RNN etc). This is why we reimplemented all the architectures
and MLP nets was the training time (see Figure 4). In partic- ourselves.
ular, the BRNN took more than 8 times as long to converge
as BLSTM, despite having more or less equal computational C. Effect of Increased Context
complexity per time-step (see Section V-A). There was a As is clear from Figure 3 networks with access to more
similar time increase between the unidirectional LSTM and contextual information tended to get better results. In partic-
RNN nets, and the MLPs were slower still (990 epochs for ular, the bidirectional networks were substantially better than
the best MLP result). the unidirectional ones. For the unidirectional nets, note that
The training time of 17 epochs for our most accurate LSTM benefits more from longer target delays than RNNs; this
network (retrained BLSTM) is remarkably fast, needing just could be due to LSTM’s greater facility with long timelags,
a few hours on an ordinary desktop computer. Elsewhere we allowing it to make use of the extra context without suffering
have seen figures of between 40 and 120 epochs quoted for as much from having to remember previous inputs.
RNN convergence on this task, usually with more advanced Interestingly, LSTM with no time delay returns almost
training algorithms than the one used here. identical results whether trained forwards or backwards. This
A possible explanation of why RNNs took longer to train suggests that the context in both directions is equally im-
than LSTM on this task is that they require more fine-tuning portant. However, with bidirectional nets, the forward subnet
of their weights to make use of the contextual information, usually dominates the outputs (see Figure 1).
Framewise Phoneme Classification Scores Learning Curves for Three Architectures
72 85
BLSTM Retrained BLSTM training set
BLSTM BLSTM test set
70 BRNN 80 BRNN training set
BLSTM Weighted Error BRNN test set
68 LSTM MLP training set
RNN 75 MLP test set
MLP
66
% Frames Correctly Classified

% Frames Correctly Classified


70
64
65
62
60
60
55
58
50
56

45
54

52 40

50 35
0 2 4 6 8 10 0 50 100 150 200 250 300 350 400
Target Delay / Window Size Training Epochs

Fig. 3. Framewise phoneme classification results for all networks on the Fig. 4. Learning curves for BLSTM, BRNN and MLP with no time-window.
TIMIT test set. The number of frames of introduced context (time-window For all experiments, LSTM was much faster to converge than either the RNN
size for MLPs, target delay size for unidirectional LSTM and RNNs) is plotted or MLP architectures.
along the x axis. Therefore the results for the bidirectional nets (clustered
around 70%) are plotted at x=0.
In the future we would like to apply BLSTM to full speech
recognition, for example as part of a hybrid RNN / Hidden
For the MLPs, performance increased with time-window Markov Model system.
size, and it appears that even larger windows would have been
desirable. However, with fully connected networks, the number A PPENDIX A: P SEUDOCODE FOR F ULL G RADIENT LSTM
of weights required for such large input layers makes training
prohibitively slow. The following pseudocode details the forward pass, back-
ward pass, and weight updates of an extended LSTM layer in
D. Weighted Error a multi-layer net. The error gradient is calculated with online
BPTT (i.e. BPTT truncated to the lengths of input sequences,
The experiment with a weighted error function gave slightly with weight updates after every sequence). As is standard with
inferior framewise performance for BLSTM (68.9%, compared BPTT, the network is unfolded over time, so that connections
to 69.7%). However, the purpose of this weighting is to arriving at layers are viewed as coming from the previous
improve overall phoneme recognition, rather than framewise timestep. We have tried to make it clear which equations are
classification (see Section V-C). As a measure of its success, LSTM specific, and which are part of the standard BPTT
if we assume a perfect knowledge of the test set segmentation algorithm. Note that for the LSTM equations, the order of
(which in real-life situations we cannot), and integrate the net- execution is important.
work outputs over each phoneme, then BLSTM with weighted
errors gives a phoneme correctness of 74.4%, compared to Notation
71.2% with normal errors. The input sequence over which the training takes place is
labelled S and it runs from time τ0 to τ1 . xk (τ ) refers to the
VIII. C ONCLUSION AND F UTURE W ORK network input to unit k at time τ , and yk (τ ) to its activation.
Unless stated otherwise, all network inputs, activations and
In this paper we have compared bidirectional LSTM to other partial derivatives are evaluated at time τ — e.g. yc ≡ yc (τ ).
neural network architectures on the task of framewise phoneme E(τ ) refers to the (scalar) output error of the net at time τ .
classification. We have found that bidirectional networks are The training target for output unit k at time τ is denoted tk (τ ).
significantly more effective than unidirectional ones, and that N is the set of all units in the network, including input and
LSTM is much faster to train than standard RNNs and MLPs, bias units, that can be connected to other units. Note that this
and also slightly more accurate. We conclude that bidirectional includes LSTM cell outputs, but not LSTM gates or internal
LSTM is an architecture well suited to this and other speech states (whose activations are only visible within their own
processing tasks, where context is vitally important. memory blocks). Wij is the weight from unit j to unit i.
The LSTM equations are given for a single memory block Cell Outputs:
only. The generalisation to multiple blocks is trivial: simply !
∀c ∈ C, def ine ϵc = wjc δj (τ + 1)
repeat the calculations for each block, in any order. Within
j∈N
each block, we use the suffixes ι, φ and ω to refer to the
input gate, forget gate and output gate respectively. The suffix Output Gates:
c refers to an element of the set of cells C. sc is the state value !
δω = f ′ (xω ) ϵc h(sc )
of cell c — i.e. its value after the input and forget gates have
c∈C
been applied. f is the squashing function of the gates, and
g and h are respectively the cell input and output squashing States:
functions. ∂E ∂E
(τ ) = ϵc yω h′ (sc ) + (τ + 1)yφ (τ + 1)
∂sc ∂sc
Forward Pass
+δι (τ + 1)wιc + δφ (τ + 1)wφc + δω wωc
• Reset all activations to 0.
• Running forwards from time τ0 to time τ1 , feed in the Cells:
inputs and update the activations. Store all hidden layer ∂E
∀c ∈ C, δc = yι g ′ (xc )
and output activations at every timestep. ∂sc
• For each LSTM block, the activations are updated as
Forget Gates:
follows: ! ∂E
Input Gates: δφ = f ′ (xφ ) sc (τ − 1)
∂sc
! ! c∈C
xι = wιj yj (τ − 1) + wιc sc (τ − 1) Input Gates:
j∈N c∈C
! ∂E
yι = f (xι ) δι = f ′ (xι ) g(xc )
∂sc
c∈C
Forget Gates:
! ! • Using the standard BPTT equation, accumulate the δ’s
xφ = wφj yj (τ − 1) + wφc sc (τ − 1) to get the partial derivatives of the cumulative sequence
j∈N c∈C error:
yφ = f (xφ ) !τ1
def ine Etotal (S) = E(τ )
Cells: τ =τ0
!
∀c ∈ C, xc = wcj yj (τ − 1) ∂Etotal (S)
def ine ▽ij (S) =
j∈N ∂wij
sc = yφ sc (τ − 1) + yι g(xc ) !τ1
=⇒ ▽ij (S) = δi (τ )yj (τ − 1)
Output Gates: τ =τ0 +1
! !
xω = wωj yj (τ − 1) + wωc sc (τ ) Update Weights
j∈N c∈C • After the presentation of sequence S, with learning rate α

yω = f (xω ) and momentum m, update all weights with the standard


equation for gradient descent with momentum:
Cell Outputs:
∆wij (S) = −α ▽ij (S) + m∆wij (S − 1)
∀c ∈ C, yc = yω h(sc )
A PPENDIX B: A LGORITHM O UTLINE FOR B IDIRECTIONAL
R ECURRENT N EURAL N ETWORKS
Backward Pass We quote the following method for training bidirectional
• Reset all partial derivatives to 0. recurrent nets with BPTT (Schuster, 1999). As above, training
• Starting at time τ1 , propagate the output errors backwards takes place over an input sequence running from time τ0 to
through the unfolded net, using the standard BPTT equa- τ1 . All network activations and errors are reset to 0 at τ0 and
tions for a softmax output layer and the cross-entropy τ1 .
error function: Forward Pass Feed all input data for the sequence into the
∂E(τ ) BRNN and determine all predicted outputs.
def ine δk (τ ) = • Do forward pass just for forward states (from time τ0 to
∂xk
δk (τ ) = yk (τ ) − tk (τ ) k ∈ output units τ1 ) and backward states (from time τ1 to τ0 ).
• Do forward pass for output layer.
Backward Pass Calculate the error function derivative for
• For each LSTM block the δ’s are calculated as follows: the sequence used in the forward pass.
• Do backward pass for output neurons. Gers, F., Schraudolph, N., and Schmidhuber, J. (2002). Learn-
• Do backward pass just for forward states (from time τ1 ing precise timing with LSTM recurrent networks. Jour-
to τ0 ) and backward states (from time τ0 to τ1 ). nal of Machine Learning Research, 3:115–143.
Update Weights Graves, A., Beringer, N., and Schmidhuber, J. (2004a). A
comparison between spiking and differentiable recurrent
neural networks on spoken digit recognition. In The
ACKNOWLEDGMENTS
23rd IASTED International Conference on modelling,
The authors would like to thank Nicole Beringer for her identification, and control, Grindelwald.
expert advice on linguistics and speech recognition. This Graves, A., Beringer, N., and Schmidhuber, J. (2005). Rapid
work was supported by the SNF under grant number 200020- retraining on speech data with lstm recurrent networks.
100249. Technical Report IDSIA-09-05, IDSIA, www.idsia.ch/-
techrep.html.
R EFERENCES Graves, A., Eck, D., Beringer, N., and Schmidhuber, J.
(2004b). Biologically plausible speech recognition with
Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., and Soda, lstm neural nets. In First International Workshop on Bi-
G. (2001). Bidirectional dynamics for protein secondary ologically Inspired Approaches to Advanced Information
structure prediction. Lecture Notes in Computer Science, Technology, Lausanne.
1828:80–104. Graves, A. and Schmidhuber, J. (2005). Framewise phoneme
Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, classification with bidirectional lstm networks. In Pro-
G. (1999). Exploiting the past and the future in protein ceedings of the 2005 International Joint Conference on
secondary structure prediction. BIOINF: Bioinformatics, Neural Networks, Montreal, Canada.
15. Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J.
Beringer, N. (2004a). Human language acquisition in a (2001). Gradient flow in recurrent nets: the difficulty
machine learning task. Proc. ICSLP. of learning long-term dependencies. In Kremer, S. C.
Beringer, N. (2004b). Human language acquisition methods and Kolen, J. F., editors, A Field Guide to Dynamical
in a machine learning task. In Proceedings of the 8th In- Recurrent Neural Networks. IEEE Press.
ternational Conference on Spoken Language Processing, Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term
pages 2233–2236. Memory. Neural Computation, 9(8):1735–1780.
Bishop, C. (1995). Neural Networks for Pattern Recognition. Robinson, A. J. (1991). Several improvements to a recurrent
Oxford University Press, Inc. error propagation network phone recognition system.
Bourlard, H. and Morgan, N. (1994). Connnectionist Speech Technical Report CUED/F-INFENG/TR82, University of
Recognition: A Hybrid Approach. Kluwer Academic Cambridge.
Publishers. Robinson, A. J. (1994). An application of recurrent nets
Chen, J. and Chaudhari, N. S. (2004). Capturing long- to phone probability estimation. IEEE Transactions on
term dependencies for protein secondary structure pre- Neural Networks, 5(2):298–305.
diction. In Yin, F., Wang, J., and Guo, C., editors, Robinson, A. J. and Fallside, F. (1987). The utility driven
Advances in Neural Networks - ISNN 2004, International dynamic error propagation network. Technical Re-
Symposiumon Neural Networks, Part II, volume 3174 port CUED/F-INFENG/TR.1, Cambridge University En-
of Lecture Notes in Computer Science, pages 494–500, gineering Department.
Dalian, China. Springer. Schuster, M. (1999). On supervised learning from sequential
Chen, R. and Jamieson, L. (1996). Experiments on the data with applications for speech recognition. PhD thesis,
implementation of recurrent neural networks for speech Nara Institute of Science and Technolog, Kyoto, Japan.
phone recognition. In Proceedings of the Thirtieth Annual Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent
Asilomar Conference on Signals, Systems and Computers, neural networks. IEEE Transactions on Signal Process-
pages 779–782. ing, 45:2673–2681.
Eck, D., Graves, A., and Schmidhuber, J. (2003). A new Williams, R. J. and Zipser, D. (1995). Gradient-based learning
approach to continuous speech recognition using LSTM algorithms for recurrent networks and their computational
recurrent neural networks. Technical Report IDSIA-14- complexity. In Chauvin, Y. and Rumelhart, D. E.,
03, IDSIA, www.idsia.ch/techrep.html. editors, Back-propagation: Theory, Architectures and Ap-
Fukada, T., Schuster, M., and Sagisaka, Y. (1999). Phoneme plications, pages 433–486. Lawrence Erlbaum Publishers,
boundary estimation using bidirectional recurrent neural Hillsdale, N.J.
networks and its applications. Systems and Computers in
Japan, 30(4):20–30.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G.,
Pallett, D. S., , and Dahlgren, N. L. (1993). Darpa timit
acoustic phonetic continuous speech corpus cdrom.

You might also like