0% found this document useful (0 votes)

13 views9 pages

Imperial Dlcourse2022 RNN Notes

The document provides an overview of recurrent neural networks (RNNs) and their mathematical formulations, including the simple RNN and Long Short-Term Memory (LSTM) models. It discusses the back-propagation through time (BPTT) method for training RNNs, the challenges of gradient vanishing and explosion, and techniques to mitigate these issues, such as gradient clipping and proper weight initialization. Additionally, it details the structure and functioning of LSTMs, emphasizing their design to address the limitations of simple RNNs.

Uploaded by

selamwork17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views9 pages

Imperial Dlcourse2022 RNN Notes

Uploaded by

selamwork17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Lecture notes on recurrent neural networks (RNNs)

1.1 Simple RNNs

Mathematical form
Assume we want to build a neural network to process a data sequence x1:T = (x1 , ..., xT ). Recurrent
neural networks (RNNs) are neural networks suited for processing sequential data, which, if well
trained, can model dependencies within a sequence of arbitrary length.
We also assume a supervised learning task which aims to learn the mapping
from inputs x1:T to outputs y1:T = (y1 , ..., yT ). Then a simple RNN computes
the following mapping for t = 1, ..., T :

ht = ϕh (Wh ht−1 + Wx xt + bh ), (1)

yt = ϕy (Wy ht + by ). (2)

Here the network parameters are θ = {Wh , Wx , Wy , bh , by }, ϕh and ϕy are the

non-linear activation functions for the hidden state ht and the output yt , respec-
tively. For t = 1 the convention is to set h0 = 0 so that h1 = ϕh (Wx x1 + bh );
alternatively h0 can also be added to θ as a learnable parameter.

Back-propagation through time (BPTT)

A loss function is required for training an RNN. Assuming the following loss function to minimise:
T
X
L(θ) = L(yt ). (3)
t=1

This is a common form for the loss function in many sequential modelling tasks such as video/audio
d
PT d
sequence reconstruction. The derivative of the loss function w.r.t. θ is dθ L(θ) = t=1 dθ L(yt ),
d
therefore it remains to compute dθ L(yt ) for θ = {Wh , Wx , Wy , bh , by }.

• Derivative of L(yt ) w.r.t. Wy and by :

dL(yt ) dL(yt ) dyt dL(yt ) dL(yt ) dyt

= , = .
dWy dyt dWy dby dyt dby

• Derivative of L(yt ) w.r.t. Wx and bh :

dL(yt ) dL(yt ) dyt dht dL(yt ) dL(yt ) dyt dht

= , = .
dWx dyt dht dWx dbh dyt dht dbh
Here Wx and bh contributes to ht in two ways: both direct and indirect contributions, where
the latter is through ht−1 . This means:

dht ∂ht dht dht−1 dht ∂ht dht dht−1

= + , = + .
dWx ∂Wx dht−1 dWx dbh ∂bh dht−1 dbh

Derivations of these derivatives requires the usage of chain rule which will be explained next.

• Derivative of L(yt ) w.r.t. Wh : by chain rule, we have

dL(yt ) dL(yt ) dht

=
dWh dht dWh

1
Figure 1: Visualising Back-propagation through time (BPTT) without truncation. The black arrows
show forward pass computations, while the red arrows show the gradient back-propagation in order
to compute ∇Wh L(yt ).

where dL(y t) dL(yt ) dyt dht

dht = dyt dht . Importantly, here the entries in the Jacobian dWh contains the total
dht
gradient of ht [i] w.r.t. Wh [m, n]. It remains to compute dWh and notice that ht depends on
ht−1 which also depends on Wh :
dht ∂ht dht dht−1
= + . (4)
dWh ∂Wh dht−1 dWh
∂ht
Here the entries in ∂Wh contains partial gradient only (by treating ht−1 as a constant w.r.t. Wh ,
dht−1
note that ht depends on ht−1 ). By expanding the dWh term further, we have:
dht ∂ht dht ∂ht−1 dht dht−1 dht−2
= + + = ...
dWh ∂Wh dht−1 ∂Wh dht−1 dht−2 dWh
X t t−1
!
Y dhl+1 ∂hτ (5)
= ,
τ =1
dhl ∂Wh
l=τ
Qt−1 l+1
with the convention that when τ = t, l=t dh dhl = 1. This means the chain rule of the
gradients needs to be computed in an reversed order from time t = T to time t = 1, hence the
name Back-propagation through time (BPTT). A visualisation of BPTT is provided in Figure
1. Truncation with length L might be applied to this back-propagation procedure, and with
truncated BPTT the gradient is computed as
t t−1
!
dht X Y dhl+1 ∂hτ
truncate[ ]= .
dWh dhl ∂Wh
τ =max(1,t−L) l=τ

Gradient vanishing/explosion issues

Simple RNNs are often said to suffer from gradient vanishing or gradient explosion issues. To
understand this, notice that
dhl+1 ⊤
= ϕ′h (Wh hl + Wx xl+1 + bh ) ⊙ Wh , (6)
dhl
Here ϕ′h (Wh hl + Wx xl+1 + bh ) denotes a vector containing element-wise derivatives, and we reload
′
the element-wise product operator for vector a ∈ Rd×1 and B ∈ Rd×d as the “broadcasting element-
Qt−1 dhl+1
wise product” a ⊙ B := [ a, ..., a ] ⊙ B. This means l=τ dhl contains products of t − τ copies
| {z }
repeat d′ times
of Wh and the derivative ϕ′h (·) at time steps l = τ, ..., t − 1.

2
Figure 2: Visualising the gradient step with/out gradient clipping. Source: Goodfellow et al. [2016].

Now consider a simple case where ϕh (·) is an identity mapping so that ϕ′h (·) = 1. We further
Qt−1 l+1
assume the hidden states have scalar values, i.e. dim(ht ) = 1. Then we have l=τ dh dhl = (Wh
t−τ ⊤
)
which can vanish or explode when t − τ is large, depending on whether Wh < 1 or not. In the
general case when Wh is a matrix, depending on whether the largest singular value (i.e. maximum
of the absolute values of the largest and smallest eigenvalues) of Wh is smaller or larger than 1, the
Qt−1 l+1
spectral norm of l=τ dh dhl = (Wh
t−τ ⊤
) will vanish or explode when t − τ increases. When ϕh (·)
is selected as the sigmoid function or the hyperbolic tangent function, gradient vanishing problem
can still happen. Take hyperbolic tangent function as an example. When entries in ht is close to
±1, then ϕ′h (·) ≈ 0, i.e. the derivative is saturated. Multiplying several of such saturated derivatives
together also leads to the gradient vanishing problem.

Some tricks to fix the gradient vanishing/explosion issues

There are a handful of empirical tricks to fix the gradient vanishing/explosion issues discussed above.
• Gradient clipping:
This trick is often used to prevent the gradient from explosion. With a fixed hyper-parameter
γ, a gradient g is clipped when ||g|| > η:
γ
g← g.
||g||
This trick ensures the gradients used in optimisation has their maximum norm bounded by
a pre-defined hyper-parameter. It introduces biases in the gradient-based optimisation proce-
dure, but in certain cases it can be beneficial. Figure 2 visualises such an example, where with
gradient clipping, the updates can stay in the valley of the loss function.
• Good initialisation of the recurrent weight matrix Wh :
The IRNN approach [Le et al., 2015] uses ReLU activation’s for ϕh and initialise Wh = I,
bh = 0. This makes ϕ′h (t) = δ(t > 0) and dh l+1
dhl = δ(Wx xl+1 > 0) at initialisation. While
there is no guarantee of eliminating the gradient vanishing/explosion problem during the whole
course of training, empirically RNNs with this trick have achieve competitive performance to
LSTMs in a variety of tasks.
• Alternatively, one can construct the recurrent weight matrix Wh to be orthogonal or unitary
matrix. See e.g. Saxe et al. [2014]; Arjovsky et al. [2016] for examples.

1.2 Long Short-Term Memory (LSTM)

Mathematical form
The Long Short-term Memory [Hochreiter and Schmidhuber, 1997] was proposed with the motivation
of addressing the gradient vanishing/explosion problem. It introduces memory cell states and gates

3
to control the error flows, in detail the computation goes as follows (with σ(·) as sigmoid function):

ft :forget gate ft = σ(Wf · [ht−1 , xt ] + bf ) (7)

it :input gate it = σ(Wi · [ht−1 , xt ] + bi ) (8)
ot :output gate ot = σ(Wo · [ht−1 , xt ] + bo ) (9)
xt :input c̃t = tanh(Wc · [ht−1 , xt ] + bc ) (10)
ct :memory cell state ct = ft ⊙ ct−1 + it ⊙ c̃t (11)
ht :hidden state ht = ot ⊙ tanh(ct ) (12)
The parameters of an LSTM are therefore θ = {Wf , Wi , Wo , Wc , bf , bi , bo , bc }. Again by con-
vention, h0 and c0 are either set to zero vectors or added to the learnable parameters. We note
that if the initial cell state c0 is initialised to zero, then we have the elements in ct and ht bounded
within (−1, 1). The output yt can be produced by a 1-layer neural network similar to the simple
RNN case: yt = ϕy (Wy ht + by ).

*Gradient computation
Readers are encouraged to derive themselves the gradient of L(yt ) with respect to θ. Specifically
for the recurrent weight matrix Wc , computing the derivative dL(yt)
dWc requires the following terms:

dL(yt ) dL(yt ) dht

= (13)
dWc dht dWc
dht dtanh(ct ) dot
= ot ⊙ + tanh(ct ) ⊙ (14)
dWc dWc dWc
dot dot dht−1
= (15)
dWc dht−1 dWc
dct dct−1 dft dc̃t dit
= ft ⊙ + ct−1 ⊙ + it ⊙ + c̃t ⊙ . (16)
dWc dWc dWc dWc dWc
As ht−1 also depends on ct−1 and ot−1 , it means that

dft dft dht−1 dft dtanh(ct−1 ) dot−1
= = ot−1 ⊙ + tanh(ct−1 ) ⊙ (17)
dWc dht−1 dWc dht−1 dWc dWc

dit dit dht−1 dit dtanh(ct−1 ) dot−1
= = ot−1 ⊙ + tanh(ct−1 ) ⊙ (18)
dWc dht−1 dWc dht−1 dWc dWc

dc̃t ∂ c̃t dc̃t dht−1 ∂ c̃t dc̃t dtanh(ct−1 ) dot−1
= + = + ot−1 ⊙ + tanh(ct−1 ) ⊙ . (19)
dWc ∂Wc dht−1 dWc ∂Wc dht−1 dWc dWc
dc̃t ∂ c̃t dc̃t
Note again the difference between dW c
and ∂W c
. The former Jacobian dW c
has its entries as the total
∂ c̃t
gradient of c̃t [i] w.r.t. Wc [m, n], while the latter partial gradient ∂Wc has its entries as the partial
gradient of c̃t [i] w.r.t. Wc [m, n] (by treating ht−1 as a constant w.r.t. Wc , note that c̃t depends on
ht−1 as well). Combining the derivations, we have (notice that dtanh(c dWc
t−1 )
= dtanh(c t−1 ) dct−1
dct−1 dWc ):

dot dot dtanh(ct−1 ) dot−1
= ot−1 ⊙ + tanh(ct−1 ) ⊙ (20)
dWc dht−1 dWc dWc

dct dtanh(ct−1 ) dct dct−1 dct dot−1 ∂ c̃t
= ft + ot−1 ⊙ ⊙ + tanh(ct−1 ) ⊙ + it ⊙
dWc dct−1 dht−1 dWc dht−1 dWc ∂Wc
| {z }
dct
= dc
t−1

(21)
dct dft dit dc̃t
= ct−1 ⊙ + c̃t ⊙ + it ⊙ . (22)
dht−1 dht−1 dht−1 dht−1

4
dct
This means for computing dWc it requires computing
t−1 t−1
Y dcl+1 Y dtanh(cl ) dcl+1
= [fl+1 + ol ⊙ ⊙ ]
dcl dcl dhl
l=τ l=τ

for all τ = 1, ..., t. There is no guarantee that this term will not vanish or explode, however the usage
of forget gates makes the issue less severe. To see this, notice Qi that by expanding the product term
above, we have that it contains terms proportional to fi+1 ⊙ l=τ ol ⊙ dcdhl+1 l
for i = τ +1, ..., t−1. So
if in the forward pass the network sets fi+1 → 0 (i.e. forgetting the previous cell state), then this will
Qi
also likely to bring fi+1 ⊙ l=τ ol ⊙ dcdhl+1 l
≈ 0, which is helpful to cope with the gradient explosion
Qi
problem. On the other hand, dWc also contains terms proportional to l=τ +1 fl ⊙ oτ ⊙ dcdhτ +1
dct
τ
for
i = τ + 1, ..., t − 1. This means if the network sets fl → 1 for l = τ + 1, ..., i (i.e. maintaining the cell
Qi
state until at least time t = i), then it will be likely that l=τ +1 fl ⊙ oτ ⊙ dcdhτ +1
τ
≈ oτ ⊙ dcdhτ +1
τ
, which
helps prevent the gradient info at time τ from vanishing when oτ → 1, and thus be helpful to learn
longer term dependencies. The gradients dW dct
c
dot
and dW c
also require computing products of dodhi+1 i
and
dci+1
dhi terms, and analogous analysis can be done for those product terms. It is worth emphasising
again that LSTM does NOT solve the gradient vanishing/explosion problem completely, however
empirical evidences have shown that it is easier for LSTMs to learn longer term dependencies when
compared with the simple RNN.

Gated Recurrent Unit (GRU): a simplified gated RNN

The Gated Recurrent Unit (GRU) [Cho et al., 2014] improves the simple RNN with the gating
mechanism as well. Compared with LSTM, GRU removes the input/output gates and the cell state,
but still maintains the forgetting mechanism in some form:

zt = σ(Wz · [ht−1 , xt ] + bz ) (23)

rt = σ(Wr · [ht−1 , xt ] + br ) (24)
h̃t = tanh(Wh · [rt ⊙ ht−1 , xt ] + bh ) (25)
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ h̃t (26)

The network parameters are then θ = {Wz , Wr , Wh , bz , br , bh }. We see here zt acts as the the
update gate which determines the incorporation of the current info to the hidden states, and rt is
named the reset gate which also impact on the maintenance of the historical information.

1.3 Sequence-to-sequence models

Given dataset of input-output sequence pairs (x1:T , y1:L ), the goal of sequence prediction is to build
a model pθ (y1:L |x1:T ) to fit the data. Note here that the x, y sequences might have different lengths,
and the input/output length T and L can vary across input-output pairs. So to handle sequence
outputs of arbitrary length, we define an auto-regressive model
L
Y
pθ (y1:L |x1:T ) = pθ (yl |y<l , v), v = enc(x1:T ). (27)
l=1

Here pθ (yl |y<l , v) is defined by a sequence decoder, e.g. an LSTM:

pθ (yl |y<l , v) = pθ (yl |hdl , cdl ), hdl , cdl = LST Mθdec (y<l ), (28)

and the decoder LSTM has its internal recurrent states hd0 , cd0 initialised using the input sequence
representation v = enc(x1:T ). The encoder also uses an LSTM, meaning that

v = enc(x1:T ) = N Nθ (heT , ceT ), heT , ceT = LST Mθenc (x1:T ). (29)

5
Figure 3: Visualising the forward pass of Seq2Seq model in train (left) and test (right) time.

Regarding the prediction for the first output y1 , either pθ (y1 |x1:T ) can be produced using the
last recurrent states of the encoder LSTM (i.e. pθ (y1 |x1:T ) = pθ (y1 |heT , ceT )), or we can add in a
“start of sentence” token as y0 and compute the probability vector for y1 using the LSTM decoder:
pθ (y1 |x1:T ) = pθ (y1 |y0 , v). This model is named Sequence-to-sequence model or Seq2Seq model in
short, which is proposed by Sutskever et al. [2014].
The decoder LSTM forward pass in training and test times are different, which is visualised in
Figure 3. In training, since maximum likelihood training requires evaluating pθ (yl |y<l , v) for the
output sequences y1:L from the dataset, this means the inputs to the decoder LSTM are words
in the data label sequence, and the output of the LSTM is the probability vector for the current
word, which is then used to compute the MLE objective (i.e. negative cross-entropy). In test time,
however, there is no ground truth output sequence provided, therefore the input to the decoder
LSTM at step l is the predicted word yl−1 ∼ pθ (yl−1 |y<l−1 , v). In practice prediction is done by
e.g. beam search rather than naive sequential sampling [Sutskever et al., 2014].
In NLP applications such as machine translation, both x1:T and y1:L are sequence of words
which cannot be directly processed by neural networks. Instead each xt (yl ) needs to be mapped
to a real-value vector before feeding it to the encoder (decoder) LSTM. A naive approach is to
use one-hot encoding: assume that the input sentence is in English and we have a sorted English
vocabulary of size V , then

xt → (0, 0, ..., 1, ..., 0), if xt is the kth word in the vocabulary.

| {z }
k−1

This is clearly inefficient since English vocabulary has tens of thousands of words. Instead, it is
recommended to map the words to their word embeddings using word2vec [Mikolov et al., 2013a]
or GloVe [Pennington et al., 2014], which has much smaller dimensions. But more importantly,
semantics are preserved to some extent in these word embeddings, e.g. it has been shown that
vector calculus results like “emb(king) - emb(male) + emb(female) = emb(queen)” hold for word2vec
embeddings.
In many NLP applications the decoder output is a probability vector pθ (yl |y<l , v) which specifies
the predictive probability of each of the words in the vocabulary. When the vocabulary is large (which
is often so), applying softmax to obtain the probability vector can be very challenging. Interested
readers can check e.g. hierarchical softmax [Mikolov et al., 2013a] and negative sampling [Mikolov
et al., 2013b] for solutions to mitigate this issue.

6
1.4 *Generative models for sequences
To generate sequential data such as video, text and audio, one needs to build a generative model
pθ (x1:T ) and train it with e.g. (approximate) maximum likelihood. In the following we discuss two
types of latent variable models that are often used in sequence generation tasks.

Sequence VAE with global latent variables

Similar to VAEs for image generation, one can define a latent variable model with a global latent
variable for sequence generation [Fabius and van Amersfoort, 2014; Bowman et al., 2016]:
Z
pθ (x1:T ) = pθ (x1:T |z)p(z)dz. (30)

If variational lower-bound is used for training, then this also requires an approximate posterior
qϕ (z|x1:T ) to be optimised:
ϕ∗ , θ ∗ = arg max L(ϕ, θ), L(ϕ, θ) = Epdata (x1:T ) [Eqϕ (z|x1:T ) [log pθ (x1:T |z)] − KL[qϕ (z|x1:T )||p(z)]].
| {z }
:=L(x1:T ,ϕ,θ)
(31)
Now it remains to define the encoder and decoder distributions, such that they can process sequence
of any length. This can be achieved using e.g. LSTMs to define an auto-regressive decoder:
T
Y
pθ (x1:T |z) = pθ (xt |x<t , z), pθ (x1 |x<1 , z) = pθ (x1 |z). (32)
t=1

with the distributional parameters of pθ (xt |x<t , z) defined by LST Mθ (x<t ) which has its recurrent
states h0 , c0 initialised using z. For the encoder, LSTMs can also be used to process the input:
2
qϕ (z|x1:T ) = N (z; µϕ (x1:T ), diag(σϕ (x1:T ))), µϕ (x1:T ), log σϕ (x1:T ) = LST Mϕ (x1:T ). (33)

State-space models
State-space models assume that for every observation xt at time t, there is a latent variable zt
that generates it, and the sequence dynamic model is defined in the latent space rather than in the
observation space. In detail, a prior dynamic model is assumed on the transitions of the latent states
zt , often in an auto-regressive way:
T
Y
pθ (z1:T ) = pθ (zt |z<t ), pθ (z1 |z<1 ) = pθ (z1 ). (34)
t=1

The observation xt at time t is assumed to be conditionally dependent on zt only, and this conditional
distribution is also called the emission model :
pθ (xt |z1:T ) = pθ (xt |zt ). (35)
Combining both definitions, we have the sequence generative model defined as
T
Z Y
pθ (x1:T ) = pθ (xt |zt )pθ (zt |z<t )dz1:T . (36)
t=1

A variational lower-bound objective for training this state-space model require an approximate
posterior distribution qϕ (z1:T |x1:T ):
L(ϕ, θ) = Epdata (x1:T ) [Eqϕ (z1:T |x1:T ) [log pθ (x1:T |z1:T )] − KL[qϕ (z1:T |x1:T )||pθ (z1:T )]]
" T
#
X (37)
= Epdata (x1:T ) Eqϕ (z1:T |x1:T ) [ log pθ (xt |zt )] − KL[qϕ (z1:T |x1:T )||pθ (z1:T )] .
t=1

7
Different from the image generation case, now the prior distribution pθ (z1:T ) also has learnable
parameters in θ, so in this case it is less appropriate to view this KL term as a “regulariser”.
The expanded expression for the variational lower-bound depends on the definition of the encoder
distribution qθ (z1:T |x1:T ). The simplest solution is to use a factorised approximate posterior
T
Y
qϕ (z1:T |x1:T ) = q(zt |x≤t ), (38)
t=1

and the variational lower-bound becomes

  
T
X  
L(ϕ, θ) = Epdata (x1:T ) 
 |Eqϕ (zt |x≤t ) [log pθ (xt |zt )] −{zKL[qϕ (zt |x≤t )||pθ (zt |z<t )]} .
Eq(z<t |x<t )  
t=1
:=L(xt ,ϕ,θ,z<t )
(39)
We see that the term L(xt , ϕ, θ, z<t ) resembles the VAE objective in the image generation case,
except that the prior distribution pθ (zt |z<t ) is conditioned on the previous latent states z<t rather
than a standard Gaussian, and the q distribution takes x≤t as the input rather than a single frame
xt .
Neural networks can be used to construct the conditional distributions in the following way. The
distributional parameters (e.g. mean and variance) of the emission model pθ (xt |zt ) can be defined
by a neural network transformation of zt , similar to deep generative models for images. The prior
dynamic model pθ (zt |z<t ) can be defined as (e.g. with an LSTM)

pθ (zt |z<t ) = pθ (zt |hpt , cpt ), hpt , cpt = LST Mθ (z<t ). (40)

This means the previous latent states z<t are summarised by the LSTM internal recurrent states
hpt and cpt , which are then transformed into the distributional parameters of pθ (zt |z<t ). For the
factorised encoder distribution, it can also be defined using an LSTM:

qϕ (zt |x≤t ) = qϕ (zt |hqt , cqt ), hqt , cqt = LST Mϕ (x≤t ). (41)

References
Arjovsky, M., Shah, A., and Bengio, Y. (2016). Unitary evolution recurrent neural networks. In
International Conference on Machine Learning, pages 1120–1128.
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. (2016). Generating
sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computa-
tional Natural Language Learning, pages 10–21. Association for Computational Linguistics.
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,
Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine
translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics.

Fabius, O. and van Amersfoort, J. R. (2014). Variational recurrent auto-encoders. arXiv preprint
arXiv:1412.6581.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. http://www.
deeplearningbook.org.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–
1780.

8
Le, Q. V., Jaitly, N., and Hinton, G. E. (2015). A simple way to initialize recurrent networks of
rectified linear units. arXiv preprint arXiv:1504.00941.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representa-
tions in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representa-
tions of words and phrases and their compositionality. Advances in neural information processing
systems, 26:3111–3119.
Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global vectors for word representa-
tion. In Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), pages 1532–1543.
Saxe, A. M., McClelland, J. L., and Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of
learning in deep linear neural networks. In International Conference on Learning Representations.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks.
Advances in neural information processing systems, 27:3104–3112.

BackPropagation Through Time
No ratings yet
BackPropagation Through Time
6 pages
RNN LSTM
No ratings yet
RNN LSTM
71 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
LSTMDerivadas
No ratings yet
LSTMDerivadas
10 pages
04 RNN Slides
No ratings yet
04 RNN Slides
55 pages
L14 Exploding and Vanishing Gradients
No ratings yet
L14 Exploding and Vanishing Gradients
13 pages
RNN-1 All
No ratings yet
RNN-1 All
44 pages
Gradients For An RNN: Carter N Brown January 4, 2017 Last Edited: June 6, 2017
No ratings yet
Gradients For An RNN: Carter N Brown January 4, 2017 Last Edited: June 6, 2017
4 pages
Gradient Backpropagation Through A Long Short-Term Memory (LSTM) Cell
No ratings yet
Gradient Backpropagation Through A Long Short-Term Memory (LSTM) Cell
4 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
20 pages
Unit 5 RNN
No ratings yet
Unit 5 RNN
14 pages
DL 4 Notes
No ratings yet
DL 4 Notes
34 pages
598 114 216 Recurrent Neural Networks
No ratings yet
598 114 216 Recurrent Neural Networks
87 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
DLAI4 Networks Recurrent
No ratings yet
DLAI4 Networks Recurrent
7 pages
RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3
No ratings yet
RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3
14 pages
Outline
No ratings yet
Outline
50 pages
Dis6 Sol
No ratings yet
Dis6 Sol
6 pages
Udacity Deep LEarning Part4 RNN
No ratings yet
Udacity Deep LEarning Part4 RNN
338 pages
In4310 2023 Slides Rnns Part1
No ratings yet
In4310 2023 Slides Rnns Part1
47 pages
RNN IITMumbai
No ratings yet
RNN IITMumbai
9 pages
Solving Parabolic Periodic P-Laplacian by Deep Learning
No ratings yet
Solving Parabolic Periodic P-Laplacian by Deep Learning
15 pages
Recurrent Neural Networks and Long Short-Term Memory Networks: Tutorial and Survey
No ratings yet
Recurrent Neural Networks and Long Short-Term Memory Networks: Tutorial and Survey
15 pages
TUM I2DL Matrix Derivatives
No ratings yet
TUM I2DL Matrix Derivatives
8 pages
Slides 11
No ratings yet
Slides 11
48 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Recurrent Neural Networks (RNNS)
No ratings yet
Recurrent Neural Networks (RNNS)
45 pages
4-Recurrent Neural Network
No ratings yet
4-Recurrent Neural Network
21 pages
ch6 RNN
No ratings yet
ch6 RNN
25 pages
HW 4
No ratings yet
HW 4
10 pages
DL Co-3 PPT 2
No ratings yet
DL Co-3 PPT 2
25 pages
07 RNN Recurrent Neural Networks
No ratings yet
07 RNN Recurrent Neural Networks
115 pages
AN2DL 04 2324 RecurrentNeuralNetworks
No ratings yet
AN2DL 04 2324 RecurrentNeuralNetworks
34 pages
CT1 DL Ans
No ratings yet
CT1 DL Ans
13 pages
Introduction To Rnns
No ratings yet
Introduction To Rnns
48 pages
DSP Unit - Iv
No ratings yet
DSP Unit - Iv
15 pages
DL Exam 2023-2
No ratings yet
DL Exam 2023-2
5 pages
Short Notes On Vanishing & Exploding Gradients
No ratings yet
Short Notes On Vanishing & Exploding Gradients
30 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
18 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
9 pages
CS60010: Deep Learning: Recurrent Neural Network
No ratings yet
CS60010: Deep Learning: Recurrent Neural Network
44 pages
Unit 4 NNDL-1
No ratings yet
Unit 4 NNDL-1
12 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
AyushChokhani AI Asiignment 2
No ratings yet
AyushChokhani AI Asiignment 2
12 pages
Module 3.2 Time Series Forecasting LSTM Model
No ratings yet
Module 3.2 Time Series Forecasting LSTM Model
23 pages
Introduction To RNNS!: Arun Mallya!
No ratings yet
Introduction To RNNS!: Arun Mallya!
52 pages
ML Lec 21 RNN
No ratings yet
ML Lec 21 RNN
72 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Advanced Data Analytics: Simon Scheidegger - University of Lausanne, Department of Economics
No ratings yet
Advanced Data Analytics: Simon Scheidegger - University of Lausanne, Department of Economics
50 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
Module 2
No ratings yet
Module 2
13 pages
6.3 HiddenUnits
No ratings yet
6.3 HiddenUnits
26 pages
Exam - Deep Learning - From Theory To Practice (201800177) - Jan 22 2019
No ratings yet
Exam - Deep Learning - From Theory To Practice (201800177) - Jan 22 2019
3 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
4.5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Adjoint-Based Model Tuning and Machine Learning Strategy For Turbulence Model Improvement
No ratings yet
Adjoint-Based Model Tuning and Machine Learning Strategy For Turbulence Model Improvement
23 pages
Gradient Backprop Guide - Neural Networks (Stanford)
No ratings yet
Gradient Backprop Guide - Neural Networks (Stanford)
69 pages
Online Adaptive Fuzzy Neural Network Automotive Engine Control by Keith James
No ratings yet
Online Adaptive Fuzzy Neural Network Automotive Engine Control by Keith James
267 pages
Machine Learning Report
No ratings yet
Machine Learning Report
58 pages
Deep Learning
No ratings yet
Deep Learning
124 pages
Chapter Convolutional Neural Networks
No ratings yet
Chapter Convolutional Neural Networks
7 pages
DENFIS: Dynamic Evolving Neural-Fuzzy Inference System and Its Application For Time-Series Prediction
No ratings yet
DENFIS: Dynamic Evolving Neural-Fuzzy Inference System and Its Application For Time-Series Prediction
11 pages
Iridology Heart
No ratings yet
Iridology Heart
9 pages
Anatomy of Neural Networks
No ratings yet
Anatomy of Neural Networks
2 pages
Questionbank
No ratings yet
Questionbank
9 pages
Esn Tutorial Rev
No ratings yet
Esn Tutorial Rev
46 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
3 1 Backpropagation - Example
No ratings yet
3 1 Backpropagation - Example
9 pages
No Prop
No ratings yet
No Prop
17 pages
Unit 2
No ratings yet
Unit 2
15 pages
10f 601 Midterm
No ratings yet
10f 601 Midterm
17 pages
Computational Neural Networks Driving Complex Analytical Problem Solving
No ratings yet
Computational Neural Networks Driving Complex Analytical Problem Solving
7 pages
Lösungen Zu Den Exercises AI Python
No ratings yet
Lösungen Zu Den Exercises AI Python
26 pages
A Parametric Study of 3D Printed Polymer Gears
No ratings yet
A Parametric Study of 3D Printed Polymer Gears
12 pages
Statistical Language Models Based On Neural Networks
No ratings yet
Statistical Language Models Based On Neural Networks
59 pages
Artificial Neural Networks As An Architectural Des
No ratings yet
Artificial Neural Networks As An Architectural Des
9 pages
MLSP Exp04 60002200083
No ratings yet
MLSP Exp04 60002200083
5 pages
Emad Abdelkarim Et Al 2023 Fractional Order Neural Control of A Dfig Supplied by A Two Level PWM Inverter For Dual Rotor
No ratings yet
Emad Abdelkarim Et Al 2023 Fractional Order Neural Control of A Dfig Supplied by A Two Level PWM Inverter For Dual Rotor
18 pages
MLT Unit 4
No ratings yet
MLT Unit 4
15 pages
Soranson Python-Machine-Learning RuLit Me 683600
No ratings yet
Soranson Python-Machine-Learning RuLit Me 683600
99 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
37 pages
Neural Network For PLC PDF
No ratings yet
Neural Network For PLC PDF
7 pages
Model Compression
No ratings yet
Model Compression
41 pages
Presentaton PPT Stock Prediction
No ratings yet
Presentaton PPT Stock Prediction
10 pages
A High-Bias, Low-Variance Introduction To Machine Learning For Physicists PDF
No ratings yet
A High-Bias, Low-Variance Introduction To Machine Learning For Physicists PDF
117 pages

Imperial Dlcourse2022 RNN Notes

Uploaded by

Imperial Dlcourse2022 RNN Notes

Uploaded by

Lecture notes on recurrent neural networks (RNNs)

1.1 Simple RNNs

ht = ϕh (Wh ht−1 + Wx xt + bh ), (1)

Here the network parameters are θ = {Wh , Wx , Wy , bh , by }, ϕh and ϕy are the

Back-propagation through time (BPTT)

• Derivative of L(yt ) w.r.t. Wy and by :

dL(yt ) dL(yt ) dyt dL(yt ) dL(yt ) dyt

• Derivative of L(yt ) w.r.t. Wx and bh :

dL(yt ) dL(yt ) dyt dht dL(yt ) dL(yt ) dyt dht

dht ∂ht dht dht−1 dht ∂ht dht dht−1

• Derivative of L(yt ) w.r.t. Wh : by chain rule, we have

dL(yt ) dL(yt ) dht

where dL(y t) dL(yt ) dyt dht

Gradient vanishing/explosion issues

Some tricks to fix the gradient vanishing/explosion issues

1.2 Long Short-Term Memory (LSTM)

ft :forget gate ft = σ(Wf · [ht−1 , xt ] + bf ) (7)

dL(yt ) dL(yt ) dht

Gated Recurrent Unit (GRU): a simplified gated RNN

zt = σ(Wz · [ht−1 , xt ] + bz ) (23)

1.3 Sequence-to-sequence models

Here pθ (yl |y<l , v) is defined by a sequence decoder, e.g. an LSTM:

v = enc(x1:T ) = N Nθ (heT , ceT ), heT , ceT = LST Mθenc (x1:T ). (29)

xt → (0, 0, ..., 1, ..., 0), if xt is the kth word in the vocabulary.

Sequence VAE with global latent variables

and the variational lower-bound becomes

You might also like