Long Short-Term Memory
Long Short-Term Memory
A. Graves: Supervised Sequence Labell. with Recur. Neur. Networks, SCI 385, pp. 37–45.
springerlink.com
c Springer-Verlag Berlin Heidelberg 2012
38 4 Long Short-Term Memory
Fig. 4.1 The vanishing gradient problem for RNNs. The shading of the
nodes in the unfolded network indicates their sensitivity to the inputs at time one
(the darker the shade, the greater the sensitivity). The sensitivity decays over
time as new inputs overwrite the activations of the hidden layer, and the network
‘forgets’ the first inputs.
Fig. 4.2 LSTM memory block with one cell. The three gates are nonlinear
summation units that collect activations from inside and outside the block, and
control the activation of the cell via multiplications (small black circles). The input
and output gates multiply the input and output of the cell while the forget gate
multiplies the cell’s previous state. No activation function is applied within the
cell. The gate activation function ‘f’ is usually the logistic sigmoid, so that the gate
activations are between 0 (gate closed) and 1 (gate open). The cell input and output
activation functions (‘g’ and ‘h’) are usually tanh or logistic sigmoid, though in some
cases ‘h’ is the identity function. The weighted ‘peephole’ connections from the cell
to the gates are shown with dashed lines. All other connections within the block
are unweighted (or equivalently, have a fixed weight of 1.0). The only outputs from
the block to the rest of the network emanate from the output gate multiplication.
40 4 Long Short-Term Memory
Fig. 4.3 An LSTM network. The network consists of four input units, a hid-
den layer of two single-cell LSTM memory blocks and five output units. Not all
connections are shown. Note that each block has four inputs but only one output.
4.2 Influence of Preprocessing 41
I
H
C
atι = wiι xti + whι bt−1
h + wcι st−1
c (4.2)
i=1 h=1 c=1
btι = f (atι ) (4.3)
Forget Gates
I
H
C
atφ = wiφ xti + whφ bt−1
h + wcφ st−1
c (4.4)
i=1 h=1 c=1
btφ = f (atφ ) (4.5)
Cells
I
H
atc = wic xti + whc bt−1
h (4.6)
i=1 h=1
stc = btφ st−1
c + bι g(atc )
t
(4.7)
Output Gates
I
H
C
atω = wiω xti + whω bt−1
h + wcω stc (4.8)
i=1 h=1 c=1
btω = f (atω ) (4.9)
Cell Outputs
def ∂L def ∂L
tc = ts =
∂btc ∂stc
Cell Outputs
K
H
tc = wck δkt + wch δht+1 (4.11)
k=1 h=1
Output Gates
C
δωt = f (atω ) h(stc )tc (4.12)
c=1
4.6 Network Equations 45
States
Cells
Forget Gates
C
δφt = f (atφ ) st−1
c s
t
(4.15)
c=1
Input Gates
C
διt = f (atι ) g(atc )ts (4.16)
c=1