0% found this document useful (0 votes)

31 views9 pages

Long Short-Term Memory

This document discusses Long Short-Term Memory (LSTM) networks, which are a type of recurrent neural network. LSTM networks are able to learn long-term dependencies better than regular RNNs due to their use of memory cells and gates. The document covers the architecture of LSTM, how it addresses the vanishing gradient problem, and methods for training LSTM networks.

Uploaded by

nailadi25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views9 pages

Long Short-Term Memory

Uploaded by

nailadi25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Chapter 4

Long Short-Term Memory

As discussed in the previous chapter, an important beneﬁt of recurrent neu-

ral networks is their ability to use contextual information when mapping
between input and output sequences. Unfortunately, for standard RNN ar-
chitectures, the range of context that can be in practice accessed is quite
limited. The problem is that the inﬂuence of a given input on the hidden
layer, and therefore on the network output, either decays or blows up ex-
ponentially as it cycles around the network’s recurrent connections. This
eﬀect is often referred to in the literature as the vanishing gradient prob-
lem (Hochreiter, 1991; Hochreiter et al., 2001a; Bengio et al., 1994). The
vanishing gradient problem is illustrated schematically in Figure 4.1
Numerous attempts were made in the 1990s to address the problem of
vanishing gradients for RNNs. These included non-gradient based training
algorithms, such as simulated annealing and discrete error propagation (Ben-
gio et al., 1994), explicitly introduced time delays (Lang et al., 1990; Lin
et al., 1996; Plate, 1993) or time constants (Mozer, 1992), and hierarchical
sequence compression (Schmidhuber, 1992). The approach favoured by this
book is the Long Short-Term Memory (LSTM) architecture (Hochreiter and
Schmidhuber, 1997).
This chapter reviews the background material for LSTM. Section 4.1 de-
scribes the basic structure of LSTM and explains how it tackles the vanishing
gradient problem. Section 4.3 discusses an approximate and an exact algo-
rithm for calculating the LSTM error gradient. Section 4.4 describes some
enhancements to the basic LSTM architecture. Section 4.2 discusses the ef-
fect of preprocessing on long range dependencies. Section 4.6 provides all the
equations required to train and apply LSTM networks.

4.1 Network Architecture

The LSTM architecture consists of a set of recurrently connected subnets,
known as memory blocks. These blocks can be thought of as a diﬀerentiable
version of the memory chips in a digital computer. Each block contains one or
more self-connected memory cells and three multiplicative units—the input,

A. Graves: Supervised Sequence Labell. with Recur. Neur. Networks, SCI 385, pp. 37–45.
springerlink.com
c Springer-Verlag Berlin Heidelberg 2012
38 4 Long Short-Term Memory

Fig. 4.1 The vanishing gradient problem for RNNs. The shading of the
nodes in the unfolded network indicates their sensitivity to the inputs at time one
(the darker the shade, the greater the sensitivity). The sensitivity decays over
time as new inputs overwrite the activations of the hidden layer, and the network
‘forgets’ the ﬁrst inputs.

output and forget gates—that provide continuous analogues of write, read

and reset operations for the cells.
Figure 4.2 provides an illustration of an LSTM memory block with a sin-
gle cell. An LSTM network is the same as a standard RNN, except that the
summation units in the hidden layer are replaced by memory blocks, as illus-
trated in Fig. 4.3. LSTM blocks can also be mixed with ordinary summation
units, although this is typically not necessary. The same output layers can
be used for LSTM networks as for standard RNNs.
The multiplicative gates allow LSTM memory cells to store and access
information over long periods of time, thereby mitigating the vanishing gra-
dient problem. For example, as long as the input gate remains closed (i.e.
has an activation near 0), the activation of the cell will not be overwrit-
ten by the new inputs arriving in the network, and can therefore be made
available to the net much later in the sequence, by opening the output gate.
The preservation over time of gradient information by LSTM is illustrated in
Figure 4.4.
Over the past decade, LSTM has proved successful at a range of synthetic
tasks requiring long range memory, including learning context free languages
(Gers and Schmidhuber, 2001), recalling high precision real numbers over
extended noisy sequences (Hochreiter and Schmidhuber, 1997) and various
tasks requiring precise timing and counting (Gers et al., 2002). In particular,
it has solved several artiﬁcial problems that remain impossible with any other
RNN architecture.
Additionally, LSTM has been applied to various real-world problems, such
as protein secondary structure prediction (Hochreiter et al., 2007; Chen and
Chaudhari, 2005), music generation (Eck and Schmidhuber, 2002),
4.1 Network Architecture 39

Fig. 4.2 LSTM memory block with one cell. The three gates are nonlinear
summation units that collect activations from inside and outside the block, and
control the activation of the cell via multiplications (small black circles). The input
and output gates multiply the input and output of the cell while the forget gate
multiplies the cell’s previous state. No activation function is applied within the
cell. The gate activation function ‘f’ is usually the logistic sigmoid, so that the gate
activations are between 0 (gate closed) and 1 (gate open). The cell input and output
activation functions (‘g’ and ‘h’) are usually tanh or logistic sigmoid, though in some
cases ‘h’ is the identity function. The weighted ‘peephole’ connections from the cell
to the gates are shown with dashed lines. All other connections within the block
are unweighted (or equivalently, have a ﬁxed weight of 1.0). The only outputs from
the block to the rest of the network emanate from the output gate multiplication.
40 4 Long Short-Term Memory

Fig. 4.3 An LSTM network. The network consists of four input units, a hid-
den layer of two single-cell LSTM memory blocks and ﬁve output units. Not all
connections are shown. Note that each block has four inputs but only one output.
4.2 Inﬂuence of Preprocessing 41

Fig. 4.4 Preservation of gradient information by LSTM. As in Figure 4.1 the

shading of the nodes indicates their sensitivity to the inputs at time one; in this case
the black nodes are maximally sensitive and the white nodes are entirely insensitive.
The state of the input, forget, and output gates are displayed below, to the left and
above the hidden layer respectively. For simplicity, all gates are either entirely open
(‘O’) or closed (‘—’). The memory cell ‘remembers’ the first input as long as the
forget gate is open and the input gate is closed. The sensitivity of the output layer
can be switched on and off by the output gate without affecting the cell.

reinforcement learning (Bakker, 2002), speech recognition

(Graves and Schmidhuber, 2005b; Graves et al., 2006) and handwriting recog-
nition (Liwicki et al., 2007; Graves et al., 2008). As would be expected, its
advantages are most pronounced for problems requiring the use of long range
contextual information.

4.2 Influence of Preprocessing

The above discussion raises an important point about the influence of pre-
processing. If we can find a way to transform a task containing long range
contextual dependencies into one containing only short-range dependencies
before presenting it to a sequence learning algorithm, then architectures such
as LSTM become somewhat redundant. For example, a raw speech signal
typically has a sampling rate of over 40 kHz. Clearly, a great many timesteps
would have to be spanned by a sequence learning algorithm attempting to
label or model an utterance presented in this form. However when the signal
is first transformed into a 100 Hz series of mel-frequency cepstral coefficients,
it becomes feasible to model the data using an algorithm whose contextual
range is relatively short, such as a hidden Markov model.
Nonetheless, if such a transform is difficult or unknown, or if we simply
wish to get a good result without having to design task-specific preprocessing
methods, algorithms capable of handling long time dependencies are essential.
42 4 Long Short-Term Memory

4.3 Gradient Calculation

Like the networks discussed in the last chapter, LSTM is a differentiable func-
tion approximator that is typically trained with gradient descent. Recently,
non gradient-based training methods of LSTM have also been considered
(Wierstra et al., 2005; Schmidhuber et al., 2007), but they are outside the
scope of this book.
The original LSTM training algorithm (Hochreiter and Schmidhuber, 1997)
used an approximate error gradient calculated with a combination of Real
Time Recurrent Learning (RTRL; Robinson and Fallside, 1987) and Back-
propagation Through Time (BPTT; Williams and Zipser, 1995). The BPTT
part was truncated after one timestep, because it was felt that long time
dependencies would be dealt with by the memory blocks, and not by the
(vanishing) flow of activation around the recurrent connections. Truncating
the gradient has the benefit of making the algorithm completely online, in
the sense that weight updates can be made after every timestep. This is
an important property for tasks such as continuous control or time-series
prediction.
However, it is also possible to calculate the exact LSTM gradient with
untruncated BPTT (Graves and Schmidhuber, 2005b). As well as being more
accurate than the truncated gradient, the exact gradient has the advantage of
being easier to debug, since it can be checked numerically using the technique
described in Section 3.1.4.1. Only the exact gradient is used in this book,
and the equations for it are provided in Section 4.6.

4.4 Architectural Variants

In its original form, LSTM contained only input and output gates. The
forget gates (Gers et al., 2000), along with additional peephole weights (Gers
et al., 2002) connecting the gates to the memory cell were added later to give
extended LSTM (Gers, 2001). The purpose of the forget gates was to provide
a way for the memory cells to reset themselves, which proved important for
tasks that required the network to ‘forget’ previous inputs. The peephole
connections, meanwhile, improved the LSTM’s ability to learn tasks that
require precise timing and counting of the internal states.
Since LSTM is entirely composed of simple multiplication and summation
units, and connections between them, it is straightforward to create further
variants of the block architecture. Indeed it has been shown that alternative
structures with equally good performance on toy problems such as learning
context-free and context-sensitive languages can be evolved automatically
(Bayer et al., 2009). However the standard extended form appears to be a
good general purpose structure for sequence labelling, and is used exclusively
in this book.
4.5 Bidirectional Long Short-Term Memory 43

4.5 Bidirectional Long Short-Term Memory

Using LSTM as the network architecture in a bidirectional recurrent neural
network (Section 3.2.4) yields bidirectional LSTM (Graves and Schmidhuber,
2005a,b; Chen and Chaudhari, 2005; Thireou and Reczko, 2007). Bidirec-
tional LSTM provides access to long range context in both input directions,
and will be used extensively in later chapters.

4.6 Network Equations

This section provides the equations for the activation (forward pass) and
BPTT gradient calculation (backward pass) of an LSTM hidden layer within
a recurrent neural network.
As before, wij is the weight of the connection from unit i to unit j, the
network input to unit j at time t is denoted atj and activation of unit j at
time t is btj . The LSTM equations are given for a single memory block only.
For multiple blocks the calculations are simply repeated for each block, in
any order. The subscripts ι, φ and ω refer respectively to the input gate,
forget gate and output gate of the block. The subscripts c refers to one of
the C memory cells. The peephole weights from cell c to the input, forget
and output gates are denoted wcι , wcφ and wcω respectively. stc is the state of
cell c at time t (i.e. the activation of the linear cell unit). f is the activation
function of the gates, and g and h are respectively the cell input and output
activation functions.
Let I be the number of inputs, K be the number of outputs and H be
the number of cells in the hidden layer. Note that only the cell outputs btc
are connected to the other blocks in the layer. The other LSTM activations,
such as the states, the cell inputs, or the gate activations, are only visible
within the block. We use the index h to refer to cell outputs from other
blocks in the hidden layer, exactly as for standard hidden units. As with
standard RNNs the forward pass is calculated for a length T input sequence
x by starting at t = 1 and recursively applying the update equations while
incrementing t, and the BPTT backward pass is calculated by starting at
t = T , and recursively calculating the unit derivatives while decrementing t
to one (see Section 3.2 for details). The ﬁnal weight derivatives are found by
summing over the derivatives at each timestep, as expressed in Eqn. (3.35).
Recall that
def ∂L
δjt = (4.1)
∂atj

where L is the loss function used for training.

The order in which the equations are calculated during the forward and
backward passes is important, and should proceed as speciﬁed below. As with
standard RNNs, all states and activations are initialised to zero at t = 0, and
all δ terms are zero at t = T + 1.
44 4 Long Short-Term Memory

4.6.1 Forward Pass

Input Gates

I
H
C
atι = wiι xti + whι bt−1
h + wcι st−1
c (4.2)
i=1 h=1 c=1
btι = f (atι ) (4.3)

Forget Gates

I
H
C
atφ = wiφ xti + whφ bt−1
h + wcφ st−1
c (4.4)
i=1 h=1 c=1
btφ = f (atφ ) (4.5)

Cells

I
H
atc = wic xti + whc bt−1
h (4.6)
i=1 h=1
stc = btφ st−1
c + bι g(atc )
t
(4.7)

Output Gates

I
H
C
atω = wiω xti + whω bt−1
h + wcω stc (4.8)
i=1 h=1 c=1
btω = f (atω ) (4.9)

Cell Outputs

btc = btω h(stc ) (4.10)

4.6.2 Backward Pass

def ∂L def ∂L
tc = ts =
∂btc ∂stc
Cell Outputs

K
H
tc = wck δkt + wch δht+1 (4.11)
k=1 h=1

Output Gates

C
δωt = f (atω ) h(stc )tc (4.12)
c=1
4.6 Network Equations 45

States

ts = btω h (stc )tc + bt+1

φ s
t+1
+ wcι διt+1 + wcφ δφt+1 + wcω δωt (4.13)

Cells

δct = btι g (atc )ts (4.14)

Forget Gates

C
δφt = f (atφ ) st−1
c s
t
(4.15)
c=1

Input Gates

C
διt = f (atι ) g(atc )ts (4.16)
c=1

SMS Lift Controller
88% (8)
SMS Lift Controller
40 pages
CNN RNN LSTM GRU Simple
100% (3)
CNN RNN LSTM GRU Simple
20 pages
Unit 2 DL
No ratings yet
Unit 2 DL
44 pages
Hitachi Dx235nlc 5
100% (1)
Hitachi Dx235nlc 5
1,320 pages
Thesis Samples For Information Technology
100% (3)
Thesis Samples For Information Technology
8 pages
Long Short-Term Memory (LSTM)
No ratings yet
Long Short-Term Memory (LSTM)
25 pages
CBD Aisc 360 16
100% (1)
CBD Aisc 360 16
98 pages
Types of 3D Printers - Complete Guide - SLA, DLP, FDM, SLS, SLM, EBM, LOM, BJ, MJ Printing
100% (2)
Types of 3D Printers - Complete Guide - SLA, DLP, FDM, SLS, SLM, EBM, LOM, BJ, MJ Printing
12 pages
Practice English Literacy Questions With Answer Keys and Discussion
No ratings yet
Practice English Literacy Questions With Answer Keys and Discussion
10 pages
MC33290
No ratings yet
MC33290
12 pages
Management Science Chapter 11 and 12 1
No ratings yet
Management Science Chapter 11 and 12 1
30 pages
5G Fixed Wireless Gigabit Services Today
No ratings yet
5G Fixed Wireless Gigabit Services Today
32 pages
DL Half TechKnowledge
No ratings yet
DL Half TechKnowledge
50 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
0% (1)
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
16 pages
Brochure Inpage
No ratings yet
Brochure Inpage
2 pages
Service Manual: TEL 13942296513 9 9 2 8 9 4 2 9 8 0 5 1 5 1 3 6 7 3 Q Q
No ratings yet
Service Manual: TEL 13942296513 9 9 2 8 9 4 2 9 8 0 5 1 5 1 3 6 7 3 Q Q
54 pages
CSE231 - Lecture 5
No ratings yet
CSE231 - Lecture 5
33 pages
RNN 2
No ratings yet
RNN 2
144 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
UNIT-5-Modern Recurrent Neural Networks
No ratings yet
UNIT-5-Modern Recurrent Neural Networks
60 pages
Conec DSub Hoods
No ratings yet
Conec DSub Hoods
54 pages
Matrikon Data Broker MQTT Publisher User Manual
No ratings yet
Matrikon Data Broker MQTT Publisher User Manual
66 pages
Application of Cognitive Ergonomics To The Control Room Design of Advanced Technologies
No ratings yet
Application of Cognitive Ergonomics To The Control Room Design of Advanced Technologies
40 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Lecture 11
No ratings yet
Lecture 11
57 pages
RNN
No ratings yet
RNN
28 pages
Illustrated Guide To LSTM's and GRU'S - A Step by Step Explanation - by Michael Phi - Towards Data Science
No ratings yet
Illustrated Guide To LSTM's and GRU'S - A Step by Step Explanation - by Michael Phi - Towards Data Science
15 pages
Week 6
No ratings yet
Week 6
60 pages
Neural Networks
No ratings yet
Neural Networks
22 pages
LSTM
No ratings yet
LSTM
22 pages
CS 601 Machine Learning Unit 4
No ratings yet
CS 601 Machine Learning Unit 4
14 pages
Cs224n 2025 Lecture06 Fancy RNN
No ratings yet
Cs224n 2025 Lecture06 Fancy RNN
57 pages
Multinomial Goodness-of-Fit Based On U - Statistics: High-Dimensional Asymptotic and Minimax Optimality
No ratings yet
Multinomial Goodness-of-Fit Based On U - Statistics: High-Dimensional Asymptotic and Minimax Optimality
29 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Unit 2 DL
No ratings yet
Unit 2 DL
43 pages
DLT Unit-4
No ratings yet
DLT Unit-4
18 pages
LSTM Presentation
No ratings yet
LSTM Presentation
23 pages
LSTM & Gru
No ratings yet
LSTM & Gru
17 pages
Long Short-Term Memory (LSTM) by Mohsin
No ratings yet
Long Short-Term Memory (LSTM) by Mohsin
17 pages
Long-Short Term Memory
No ratings yet
Long-Short Term Memory
21 pages
Draft Amrita Institute Prospectus
No ratings yet
Draft Amrita Institute Prospectus
28 pages
LSTM by Bushra
No ratings yet
LSTM by Bushra
16 pages
EPJ LSTM Survey
No ratings yet
EPJ LSTM Survey
14 pages
T3-Slide 006 LSTM
No ratings yet
T3-Slide 006 LSTM
25 pages
DL Co-3 PPT 3
No ratings yet
DL Co-3 PPT 3
19 pages
CH4 - AA1.1-Sequence Models
No ratings yet
CH4 - AA1.1-Sequence Models
26 pages
LSTM
No ratings yet
LSTM
19 pages
LSTM
No ratings yet
LSTM
27 pages
LSTM and GRU
No ratings yet
LSTM and GRU
22 pages
Longshorttermmemorylstm 231215171600 1feb7b1b
No ratings yet
Longshorttermmemorylstm 231215171600 1feb7b1b
17 pages
A Review of Recurrent Neural Networks
No ratings yet
A Review of Recurrent Neural Networks
36 pages
A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)
No ratings yet
A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)
36 pages
LSTM
No ratings yet
LSTM
12 pages
34-Long-Term Dependencies - Echo State Networks - Long Short-Term Memory and Othe-03!10!2024
No ratings yet
34-Long-Term Dependencies - Echo State Networks - Long Short-Term Memory and Othe-03!10!2024
14 pages
LSTM
No ratings yet
LSTM
12 pages
Long Short-Term Memory (LSTM) : A Deep Dive Into Sequential Learning
No ratings yet
Long Short-Term Memory (LSTM) : A Deep Dive Into Sequential Learning
17 pages
Machine Learning Unit 4 RNN
No ratings yet
Machine Learning Unit 4 RNN
11 pages
Unit 4 - Machine Learning
No ratings yet
Unit 4 - Machine Learning
16 pages
Chapter 12 PartII en
No ratings yet
Chapter 12 PartII en
23 pages
Module 4
No ratings yet
Module 4
14 pages
A Review On The Long Short Term Memory Model
No ratings yet
A Review On The Long Short Term Memory Model
34 pages
cs224n spr2024 Lecture06 Fancy RNN
No ratings yet
cs224n spr2024 Lecture06 Fancy RNN
56 pages
Context Based
No ratings yet
Context Based
10 pages
Unit 4 - MachineLearning
No ratings yet
Unit 4 - MachineLearning
16 pages
Construction Schedule: Ks Saastha Enterprise
No ratings yet
Construction Schedule: Ks Saastha Enterprise
4 pages
Technical Answers For Real World Problems (TARP) CSE-3999: Assessment - 3
No ratings yet
Technical Answers For Real World Problems (TARP) CSE-3999: Assessment - 3
9 pages
DL U-Ii
No ratings yet
DL U-Ii
41 pages
MCA Cloud Storage Report
No ratings yet
MCA Cloud Storage Report
13 pages
Carry Look-Ahead Adder - Circuit Diagram, Applications & Advantages
No ratings yet
Carry Look-Ahead Adder - Circuit Diagram, Applications & Advantages
13 pages
LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
No ratings yet
LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
12 pages
Addition Multiplication RNN
No ratings yet
Addition Multiplication RNN
7 pages
6 - RNN LSTM & Gru
No ratings yet
6 - RNN LSTM & Gru
14 pages
Co2 LSTM 5
No ratings yet
Co2 LSTM 5
17 pages
5 LSTM
No ratings yet
5 LSTM
4 pages
Unit Iii
No ratings yet
Unit Iii
5 pages
Bidirectional LSTM Networks For Poetry Generation in Hindi
No ratings yet
Bidirectional LSTM Networks For Poetry Generation in Hindi
4 pages
LSTM Networks Thesis Updated
No ratings yet
LSTM Networks Thesis Updated
5 pages
LSTM&RNN
No ratings yet
LSTM&RNN
10 pages
Vigneshwaran-Resume-Linux and Windows
No ratings yet
Vigneshwaran-Resume-Linux and Windows
6 pages
Long Short-Term Memory Networks (LSTM) - Simply Explained! - Data Basecamp
No ratings yet
Long Short-Term Memory Networks (LSTM) - Simply Explained! - Data Basecamp
4 pages
netLabs!UG Internship 2024 Capstone Project
No ratings yet
netLabs!UG Internship 2024 Capstone Project
6 pages
Accomplishment Report Format
No ratings yet
Accomplishment Report Format
6 pages
Excercise Solution 3-5
No ratings yet
Excercise Solution 3-5
5 pages
LSTM
No ratings yet
LSTM
3 pages
Audiolab Mdac HFC
No ratings yet
Audiolab Mdac HFC
3 pages
Switching Lemma
No ratings yet
Switching Lemma
3 pages
Ladd
No ratings yet
Ladd
1 page
First Notification
No ratings yet
First Notification
2 pages
ANSWER SHEET IN Statisctics and Probabilty: Written Work
No ratings yet
ANSWER SHEET IN Statisctics and Probabilty: Written Work
1 page
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet