Imperial Dlcourse2022 RNN Notes
Imperial Dlcourse2022 RNN Notes
This is a common form for the loss function in many sequential modelling tasks such as video/audio
d
PT d
sequence reconstruction. The derivative of the loss function w.r.t. θ is dθ L(θ) = t=1 dθ L(yt ),
d
therefore it remains to compute dθ L(yt ) for θ = {Wh , Wx , Wy , bh , by }.
Derivations of these derivatives requires the usage of chain rule which will be explained next.
1
Figure 1: Visualising Back-propagation through time (BPTT) without truncation. The black arrows
show forward pass computations, while the red arrows show the gradient back-propagation in order
to compute ∇Wh L(yt ).
2
Figure 2: Visualising the gradient step with/out gradient clipping. Source: Goodfellow et al. [2016].
Now consider a simple case where ϕh (·) is an identity mapping so that ϕ′h (·) = 1. We further
Qt−1 l+1
assume the hidden states have scalar values, i.e. dim(ht ) = 1. Then we have l=τ dh dhl = (Wh
t−τ ⊤
)
which can vanish or explode when t − τ is large, depending on whether Wh < 1 or not. In the
general case when Wh is a matrix, depending on whether the largest singular value (i.e. maximum
of the absolute values of the largest and smallest eigenvalues) of Wh is smaller or larger than 1, the
Qt−1 l+1
spectral norm of l=τ dh dhl = (Wh
t−τ ⊤
) will vanish or explode when t − τ increases. When ϕh (·)
is selected as the sigmoid function or the hyperbolic tangent function, gradient vanishing problem
can still happen. Take hyperbolic tangent function as an example. When entries in ht is close to
±1, then ϕ′h (·) ≈ 0, i.e. the derivative is saturated. Multiplying several of such saturated derivatives
together also leads to the gradient vanishing problem.
3
to control the error flows, in detail the computation goes as follows (with σ(·) as sigmoid function):
*Gradient computation
Readers are encouraged to derive themselves the gradient of L(yt ) with respect to θ. Specifically
for the recurrent weight matrix Wc , computing the derivative dL(yt)
dWc requires the following terms:
(21)
dct dft dit dc̃t
= ct−1 ⊙ + c̃t ⊙ + it ⊙ . (22)
dht−1 dht−1 dht−1 dht−1
4
dct
This means for computing dWc it requires computing
t−1 t−1
Y dcl+1 Y dtanh(cl ) dcl+1
= [fl+1 + ol ⊙ ⊙ ]
dcl dcl dhl
l=τ l=τ
for all τ = 1, ..., t. There is no guarantee that this term will not vanish or explode, however the usage
of forget gates makes the issue less severe. To see this, notice Qi that by expanding the product term
above, we have that it contains terms proportional to fi+1 ⊙ l=τ ol ⊙ dcdhl+1 l
for i = τ +1, ..., t−1. So
if in the forward pass the network sets fi+1 → 0 (i.e. forgetting the previous cell state), then this will
Qi
also likely to bring fi+1 ⊙ l=τ ol ⊙ dcdhl+1 l
≈ 0, which is helpful to cope with the gradient explosion
Qi
problem. On the other hand, dWc also contains terms proportional to l=τ +1 fl ⊙ oτ ⊙ dcdhτ +1
dct
τ
for
i = τ + 1, ..., t − 1. This means if the network sets fl → 1 for l = τ + 1, ..., i (i.e. maintaining the cell
Qi
state until at least time t = i), then it will be likely that l=τ +1 fl ⊙ oτ ⊙ dcdhτ +1
τ
≈ oτ ⊙ dcdhτ +1
τ
, which
helps prevent the gradient info at time τ from vanishing when oτ → 1, and thus be helpful to learn
longer term dependencies. The gradients dW dct
c
dot
and dW c
also require computing products of dodhi+1 i
and
dci+1
dhi terms, and analogous analysis can be done for those product terms. It is worth emphasising
again that LSTM does NOT solve the gradient vanishing/explosion problem completely, however
empirical evidences have shown that it is easier for LSTMs to learn longer term dependencies when
compared with the simple RNN.
The network parameters are then θ = {Wz , Wr , Wh , bz , br , bh }. We see here zt acts as the the
update gate which determines the incorporation of the current info to the hidden states, and rt is
named the reset gate which also impact on the maintenance of the historical information.
pθ (yl |y<l , v) = pθ (yl |hdl , cdl ), hdl , cdl = LST Mθdec (y<l ), (28)
and the decoder LSTM has its internal recurrent states hd0 , cd0 initialised using the input sequence
representation v = enc(x1:T ). The encoder also uses an LSTM, meaning that
5
Figure 3: Visualising the forward pass of Seq2Seq model in train (left) and test (right) time.
Regarding the prediction for the first output y1 , either pθ (y1 |x1:T ) can be produced using the
last recurrent states of the encoder LSTM (i.e. pθ (y1 |x1:T ) = pθ (y1 |heT , ceT )), or we can add in a
“start of sentence” token as y0 and compute the probability vector for y1 using the LSTM decoder:
pθ (y1 |x1:T ) = pθ (y1 |y0 , v). This model is named Sequence-to-sequence model or Seq2Seq model in
short, which is proposed by Sutskever et al. [2014].
The decoder LSTM forward pass in training and test times are different, which is visualised in
Figure 3. In training, since maximum likelihood training requires evaluating pθ (yl |y<l , v) for the
output sequences y1:L from the dataset, this means the inputs to the decoder LSTM are words
in the data label sequence, and the output of the LSTM is the probability vector for the current
word, which is then used to compute the MLE objective (i.e. negative cross-entropy). In test time,
however, there is no ground truth output sequence provided, therefore the input to the decoder
LSTM at step l is the predicted word yl−1 ∼ pθ (yl−1 |y<l−1 , v). In practice prediction is done by
e.g. beam search rather than naive sequential sampling [Sutskever et al., 2014].
In NLP applications such as machine translation, both x1:T and y1:L are sequence of words
which cannot be directly processed by neural networks. Instead each xt (yl ) needs to be mapped
to a real-value vector before feeding it to the encoder (decoder) LSTM. A naive approach is to
use one-hot encoding: assume that the input sentence is in English and we have a sorted English
vocabulary of size V , then
This is clearly inefficient since English vocabulary has tens of thousands of words. Instead, it is
recommended to map the words to their word embeddings using word2vec [Mikolov et al., 2013a]
or GloVe [Pennington et al., 2014], which has much smaller dimensions. But more importantly,
semantics are preserved to some extent in these word embeddings, e.g. it has been shown that
vector calculus results like “emb(king) - emb(male) + emb(female) = emb(queen)” hold for word2vec
embeddings.
In many NLP applications the decoder output is a probability vector pθ (yl |y<l , v) which specifies
the predictive probability of each of the words in the vocabulary. When the vocabulary is large (which
is often so), applying softmax to obtain the probability vector can be very challenging. Interested
readers can check e.g. hierarchical softmax [Mikolov et al., 2013a] and negative sampling [Mikolov
et al., 2013b] for solutions to mitigate this issue.
6
1.4 *Generative models for sequences
To generate sequential data such as video, text and audio, one needs to build a generative model
pθ (x1:T ) and train it with e.g. (approximate) maximum likelihood. In the following we discuss two
types of latent variable models that are often used in sequence generation tasks.
If variational lower-bound is used for training, then this also requires an approximate posterior
qϕ (z|x1:T ) to be optimised:
ϕ∗ , θ ∗ = arg max L(ϕ, θ), L(ϕ, θ) = Epdata (x1:T ) [Eqϕ (z|x1:T ) [log pθ (x1:T |z)] − KL[qϕ (z|x1:T )||p(z)]].
| {z }
:=L(x1:T ,ϕ,θ)
(31)
Now it remains to define the encoder and decoder distributions, such that they can process sequence
of any length. This can be achieved using e.g. LSTMs to define an auto-regressive decoder:
T
Y
pθ (x1:T |z) = pθ (xt |x<t , z), pθ (x1 |x<1 , z) = pθ (x1 |z). (32)
t=1
with the distributional parameters of pθ (xt |x<t , z) defined by LST Mθ (x<t ) which has its recurrent
states h0 , c0 initialised using z. For the encoder, LSTMs can also be used to process the input:
2
qϕ (z|x1:T ) = N (z; µϕ (x1:T ), diag(σϕ (x1:T ))), µϕ (x1:T ), log σϕ (x1:T ) = LST Mϕ (x1:T ). (33)
State-space models
State-space models assume that for every observation xt at time t, there is a latent variable zt
that generates it, and the sequence dynamic model is defined in the latent space rather than in the
observation space. In detail, a prior dynamic model is assumed on the transitions of the latent states
zt , often in an auto-regressive way:
T
Y
pθ (z1:T ) = pθ (zt |z<t ), pθ (z1 |z<1 ) = pθ (z1 ). (34)
t=1
The observation xt at time t is assumed to be conditionally dependent on zt only, and this conditional
distribution is also called the emission model :
pθ (xt |z1:T ) = pθ (xt |zt ). (35)
Combining both definitions, we have the sequence generative model defined as
T
Z Y
pθ (x1:T ) = pθ (xt |zt )pθ (zt |z<t )dz1:T . (36)
t=1
A variational lower-bound objective for training this state-space model require an approximate
posterior distribution qϕ (z1:T |x1:T ):
L(ϕ, θ) = Epdata (x1:T ) [Eqϕ (z1:T |x1:T ) [log pθ (x1:T |z1:T )] − KL[qϕ (z1:T |x1:T )||pθ (z1:T )]]
" T
#
X (37)
= Epdata (x1:T ) Eqϕ (z1:T |x1:T ) [ log pθ (xt |zt )] − KL[qϕ (z1:T |x1:T )||pθ (z1:T )] .
t=1
7
Different from the image generation case, now the prior distribution pθ (z1:T ) also has learnable
parameters in θ, so in this case it is less appropriate to view this KL term as a “regulariser”.
The expanded expression for the variational lower-bound depends on the definition of the encoder
distribution qθ (z1:T |x1:T ). The simplest solution is to use a factorised approximate posterior
T
Y
qϕ (z1:T |x1:T ) = q(zt |x≤t ), (38)
t=1
pθ (zt |z<t ) = pθ (zt |hpt , cpt ), hpt , cpt = LST Mθ (z<t ). (40)
This means the previous latent states z<t are summarised by the LSTM internal recurrent states
hpt and cpt , which are then transformed into the distributional parameters of pθ (zt |z<t ). For the
factorised encoder distribution, it can also be defined using an LSTM:
qϕ (zt |x≤t ) = qϕ (zt |hqt , cqt ), hqt , cqt = LST Mϕ (x≤t ). (41)
References
Arjovsky, M., Shah, A., and Bengio, Y. (2016). Unitary evolution recurrent neural networks. In
International Conference on Machine Learning, pages 1120–1128.
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. (2016). Generating
sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computa-
tional Natural Language Learning, pages 10–21. Association for Computational Linguistics.
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,
Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine
translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics.
Fabius, O. and van Amersfoort, J. R. (2014). Variational recurrent auto-encoders. arXiv preprint
arXiv:1412.6581.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. http://www.
deeplearningbook.org.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–
1780.
8
Le, Q. V., Jaitly, N., and Hinton, G. E. (2015). A simple way to initialize recurrent networks of
rectified linear units. arXiv preprint arXiv:1504.00941.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representa-
tions in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representa-
tions of words and phrases and their compositionality. Advances in neural information processing
systems, 26:3111–3119.
Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global vectors for word representa-
tion. In Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), pages 1532–1543.
Saxe, A. M., McClelland, J. L., and Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of
learning in deep linear neural networks. In International Conference on Learning Representations.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks.
Advances in neural information processing systems, 27:3104–3112.