0% found this document useful (0 votes)

7 views55 pages

04 RNN Slides

The lecture discusses recurrent neural networks (RNNs) and their application in sequence modeling, such as sentiment analysis, where inputs can vary in length. It covers the architecture of RNNs, the training process using backpropagation through time, and challenges like gradient explosions and vanishing gradients that can occur during training. Solutions to these problems, including gradient clipping, are also presented to improve RNN training stability.

Uploaded by

mrolaw01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views55 pages

04 RNN Slides

Uploaded by

mrolaw01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

CS-E4890 Deep Learning

Lecture #4: Recurrent neural networks

Jorma Laaksonen — Juho Kannala — Alexander Ilin

Sequence modeling

• Previously: inputs and outputs are vectors of fixed sizes

• MNIST: inputs: 28x28 images, outputs: 10 classes
• In some tasks, inputs can be sequences, each sequence can have a different number of elements:

(1) (1) (1)
x1 , x2 , x3 → y (1)

(2) (2) (2) (2)
x1 , x2 , x3 , x4 → y (2)

• Example: sentiment analysis

Dear #XYZ there is no network in my area and internet service is pathetic from past one week. Kindly help me
negative review
out.
Although the value added services being provided are great but the prices are high. mixed review
Great work done #XYZ Problem resolved by customer care in just one day. postive review

1
How can we process a sequence?

• Example: count the number of zeros in an input sequence (x1 , x2 , x3 , . . . xT )

h = 0
for x in input_sequence:
if x == 0:
h = h + 1

• How to implement this in a computational graph:

h1 h2 h3 h4 h5
h0 f f f f f
def f(x, h):
return h + (x == 0)
x1 x2 x3 x4 x5

2
Generic processor

• How can we learn to process sequences from training examples?

• Example: sentiment analysis

Dear #XYZ there is no network in my area and internet service is pathetic from past one week.
negative review
Kindly help me out.
Although the value added services being provided are great but the prices are high. mixed review
Great work done #XYZ Problem resolved by customer care in just one day. postive review

• To build a generic processor, we can use the same computational graph with a learnable f:

h1 h2 h3 h4 h5
h0 f f f f f

x1 x2 x3 x4 x5

3
Vanilla recurrent neural network (RNN)

• We can use the same building block as in the standard multilayer perceptron (MLP):

f (x, h) = tanh(Wh + Ux + b)

h1 h2 h3 h4 h5
h0 f f f f f softmax

x1 x2 x3 x4 x5

• Recurrence, thus recurrent neural network (RNN).

• h is often called hidden state.

4
RNN vs feedforward network

Computational graph of a feedforward network: Computational graph of an RNN:

W1 W2 θ

h1 h2 h1 h2
x f1 f2 L h0 f f L

y x1 x2 y

• External inputs are added at every step.

• Same parameters are used in every layer.

5
Training recurrent neural networks
Training an RNN

• Just like for a feedforward network, the parameters of an RNN can be found by (stochastic)
gradient descent.
θ

h1 h2 z
h0 f f softmax L

x1 x2 y

• For example, we can tune parameters θ by minimizing the cost function

N K
1 X X (n) (n)
θ ∗ = arg min − yj log zj
θ N
n=1 j=1

• We need to compute gradients wrt parameters θ.

7
Backpropagation

• Recall backpropagation in a multi-layer model that operates with scalars:

L = L(y ), y = f2 (h, θ), h = f1 (x , w )

• We can compute the derivatives wrt the model parameters θ and w using the chain rule.

∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ ∂L ∂L
∂L ∂L ∂y ∂h ∂w ∂θ
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h ∂L ∂L
∂h ∂y

8
Backpropagation in RNN

θ
• The difference in the RNN is that each
layer implements the same function f with
the same (shared) parameters θ: h1 h2
h0 f f L
L = L(h2 ), h2 = f (x2 , h1 , θ)
h1 = f (x1 , h0 , θ)
x1 x2

9
Backpropagation in RNN

θ
• The difference in the RNN is that each
layer implements the same function f with
the same (shared) parameters θ: h1 h2
h0 f f L
L = L(h2 ), h2 = f (x2 , h1 , θ)
h1 = f (x1 , h0 , θ)
x1 x2

θ1 θ2

• Let us assume for now that the parameters h1 h2

of the layers are not shared. h0 f f L

x1 x2

9
Backpropagation in RNN

• We can compute the derivatives wrt θ1 θ2

parameters θ1 and θ2 using the chain rule:
∂L ∂L
∂θ1 ∂θ2
∂L ∂L ∂h2
= h1 h2
∂θ2 ∂h2 ∂θ2 h0 f f L
∂L ∂L ∂h2 ∂h1 ∂L ∂L
= ∂h1 ∂h2
∂θ1 ∂h2 ∂h1 ∂θ1
| {z }
∂L
∂h1
x1 x2

• We can compute the derivatives efficiently using backpropagation.

10
Backpropagation in RNN

∂θ1 θ ∂θ2
∂θ ∂θ

θ1 θ2
• Finally, we can combine the gradients wrt
shared parameters: ∂L ∂L
∂θ1 ∂θ2

∂L ∂L ∂θ1 ∂L ∂θ2 ∂L ∂L h1 h2
= + = + h0 f f L
∂θ ∂θ1 ∂θ ∂θ2 ∂θ ∂θ1 ∂θ2
∂L ∂L
∂h1 ∂h2

x1 x2

• We need to compute gradients through all possible paths and aggregate them.
• The backpropagation algorithm applied to RNN is called backpropagation through time.

11
Problems with RNN training
Does recurrence cause problems for training?

• Consider a vanilla RNN:

ht = f (xt , ht−1 , W, U, b) = φ(Wht−1 + Uxt + b)

• Assume that we are not careful about selecting φ and we select it to be an identity mapping
φ(a) = a, h0 = 0 and b = 0.
• Let us write the hidden state at time t:
t
X
ht = Wht−1 + Uxt = W (Wht−2 + Uxt−1 ) + Uxt = WWht−2 + WUxt−1 + Uxt = Wt−τ Uxτ
τ =1

Pt
Ux1 WUx1 + Ux2 WWUx1 + WUx2 + Ux3 τ =1
Wt−τ Uxτ
0 f f f ... f

x1 x2 x3 xt

13
Analysis for diagonalizable W

• For simplicity, let us assume that matrix W is diagonalizable and its eigenvalue decomposition
W = QΛQ−1 exists, where Q contains the eigenvectors of W and Λ is the diagonal matrix
containing the eigenvalues of W. We can then re-write:

Wt−τ = QΛQ−1 QΛQ−1 . . . QΛQ−1 = QΛt−τ Q−1

| {z }
t−τ times
Pt
• Let us now look at one term in the formula for ht = τ =1
Wt−τ Uxτ :
X
Wt−τ Uxτ = QΛt−τ Q−1 Uxτ = QΛt−τ z = qi λt−τ
i zi
| {z }
=z i

where we denote z = Q−1 Uxτ and zi is the i-th component of z.

• If there is an eigenvalue λi such that |λi | > 1, then the norm of the corresponding term qi λt−τ
i zi
will grow exponentially with t causing explosions in the forward computations.

14
Analysis for a more general case

• Let Qm be an n × m matrix containing the m linear independent unit-norm eigenvectors of W in

its columns and Λ be a diagonal matrix made of the corresponding eigenvectors λi :

WQm = Qm Λ

• We can write Uxτ = Qm z + z0 where z0 belongs to the null space of Qm , that is Q> 0
m z = 0.

• Then, one term in the expression for ht is Wt−τ Uxτ = Wt−τ Qm z + Wt−τ z0 .
• Let us look at the first term only:

Wt−τ Qm z = Wt−τ −1 WQm z = Wt−τ −1 Qm Λz = Wt−τ −2 Qm Λ2 z = ...

X
= Qm Λt−τ z = qi λt−τ
i zi
i

• Again, if one of the eigenvalues is such that |λi | > 1, then the norm of qi λt−τ
i zi will grow
exponentially causing explosions in the forward computations.

15
Explosions in forward computations

• The largest absolute value of the eigenvalues is called spectral radius:

spectral radius(W) = max |λi |

• Forward explosions happen if the spectral radius of W is greater than 1.

• Will explosions happen if we use tanh nonlinearity at each time step?

ht = φ(Wht−1 + Uxt + b) = tanh(Wht−1 + Uxt + b)

16
Explosions in forward computations

• The largest absolute value of the eigenvalues is called spectral radius:

spectral radius(W) = max |λi |

• Forward explosions happen if the spectral radius of W is greater than 1.

• Will explosions happen if we use tanh nonlinearity at each time step?

ht = φ(Wht−1 + Uxt + b) = tanh(Wht−1 + Uxt + b)

• Since tanh is bounded in (−1, 1), the explosions cannot happen.

• This is the reason why tanh is most commonly used in RNNs.

16
Are there similar problems in backward computations?

• Lets us look at the longest path of derivative computations (red) for an RNN

ht = φ(Wht−1 + Uxt + b)

h1 h2 ht
h0 f f ··· f L
∂L ∂L ∂L
∂h1 ∂h2 ∂ht

x1 x2 xt

∂L > ∂L > Y ∂hτ ∂L > Y

= = diag(φ0τ )W
∂h1 ∂ht ∂hτ −1 ∂ht
τ =t,...,2 τ =t,...,2

∂L ∂L
• ∂h is a column vector of partial derivatives ∂h1i
1
• φ0τ = φ0 (Whτ −1 + Uxt + b)

17
Gradient explosions (Pascanu et al., 2013)

1
∂L > ∂L > Y ∂hτ ∂L > Y
= = diag(φ0τ )W
∂h1 ∂ht ∂hτ −1 ∂ht 0
τ =t,...,2 τ =t,...,2
1
4 3 2 1 0 1 2 3 4
• Suppose φ(h) = tanh(h) and all our neurons in an RNN are φ(h)
not saturated, which means that 1

|φ0τ | ≥γ
0
• If the spectral radius of W is greater than 1/γ, then the 4 3 2 1 0 1 2 3 4

gradient explodes. φ0 (h)

• The gradient may explode even for a bounded activation function φ!

• To avoid explosions, it is good to keep neurons in the saturated regime where derivatives φ0 are
small.

18
How to cope with gradient explosions?

• Gradient explosions (caused by recurrence) is one problem with training RNNs.

• One workaround: clip the gradient if it is larger than

some pre-defined value:
• can be done element-wise (Mikolov, 2012) or by
clipping the norm (Pascanu et al., 2013):
g
if kgk ≥ ∆, then g ← ∆
kgk
• In PyTorch, clipping of gradients can be done by
re-writing parameter.grad.data after calling
loss.backward().
image from (Pascanu et al., 2013)

19
Vanishing gradients

• Let us look at the gradients again: 1

> > >

∂L ∂L Y ∂hτ ∂L Y 0
= = diag(φ0τ )W
∂h1 ∂ht ∂hτ −1 ∂ht 1
τ =t,...,2 τ =t,...,2
4 3 2 1 0 1 2 3 4
φ(h)
• The absolute values of |φ0τ | are bounded:
1

0< |φ0τ | ≤1

• If the spectral radius of W is smaller than 1, the gradient will 0

4 3 2 1 0 1 2 3 4
vanish (its norm will decay exponentially with increase of t). φ0 (h)

• To avoid vanishing gradients, it is good to keep neurons in the non-saturated regime where
derivatives φ0 are close to 1.

20
Vanishing gradients

• The vanishing gradients problem makes it difficult to learn long-range dependencies in the data:
• In sentiment analysis, it is difficult to capture the effect of the first words in a paragraph on the
predicted class.
• In time-series modeling, it is difficult to capture slowly changing phenomena.
• Vanilla RNNs ht = φ(Wht−1 + Ux + b) are rarely used in practice.
• Recurrent units with gating mechanisms work better.
• Gated recurrent unit (GRU) (Cho et al., 2014)
• Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997)

21
Historical note on RNNs

• Recurrent neural networks for sequential data processing were proposed in the 80s
(Rumelhart et al., 1986; Elman, 1990; Werbos, 1988).
• RNNs did not gain much popularity because they were particularly difficult to train with
backpropagation:
• Unstable training because of gradient explosions
• Difficulty to learn long-term dependencies due to vanishing gradients (Bengio et al., 1994)
• The breakthrough came with the invention of Long Short-Term Memory (LSTM) RNN (Hochreiter
and Schmidhuber, 1997) which was designed to solve the gradient explosion/vanishing problem.
• LSTM remained largely unnoticed in the community until the deep learning boom started.

22
Gated recurrent unit (GRU)
(Cho et al., 2014)
Gated recurrent unit (GRU)

• Motivation for gating in GRU:

• Vanilla RNN ht = φ(Wht−1 + Ux + b) re-writes all the elements of state ht−1 with new values ht .
• How can we keep old values for some elements of ht−1 ?
• GRU uses an update gate ut ∈ (0, 1) that controls which states should be updated:

ht = (1 − ut ) ht−1 + ut ht
e
ut = σ(Wu ht−1 + Uu xt + bu )

where σ(x ) = 1/(1 + e −x ) is the sigmoid function and e

ht are the new state candidates.
• The new state candidates are computed using only the states selected by the reset gate rt :

ht = φ(W(rt
e ht−1 ) + Uxt + bh )
rt = σ(Wr ht−1 + Ur xt + br )

24
Gated recurrent unit (GRU)

• State update:

ht = (1 − u) ht−1 + u ht
e

• Update gate: u = σ(Wu ht−1 + Uu xt + bu )

• New candidate state:

ht = φ(W(r
e ht−1 ) + Uxt + bh )

• Reset gate: r = σ(Wr ht−1 + Ur xt + br )

25
Does GRU help with the vanishing gradient problem?

• GRU update rule for the state:

ht = (1 − ut ) ht−1 + ut φ(W(rt ht−1 ) + Uxt )
• Let us look at the gradient (back)propagation assuming that ut and rt are fixed:
∂hτ
= diag(1 − uτ ) + diag(uτ ) diag(φ0τ )W diag(rτ )
∂hτ −1
where φ0τ = φ0 (W(rτ hτ −1 ) + Uxτ )
> >
∂L ∂L Y ∂hτ ∂L > Y
diag(1 − uτ ) + diag(uτ ) diag(φ0τ )W diag(rτ )

= =
∂h1 ∂ht ∂hτ −1 ∂ht
τ =t,...,2 τ =t,...,2

• For simplicity, let us assume that the state of an RNN is one-dimensional and all intermediate
signals do not depend on time step τ :
t−1
∂L Y ∂L ∂L 1 + γ/2
uτ φ0τ wrτ ((1 − u) + uγr )t−1 =

(1 − uτ ) + =
∂ht ∂ht ∂ht 2
τ =t,...,2

where γ = φ0τ w and we also assumed that gates u, r are half-closed u = r = 21 .

26
Does GRU help with the vanishing gradient problem?

t−1
∂L 1 + γ/2
• Gradient propagation in GRU (simplified):
∂ht 2
• Let us do the same simplified analysis for vanilla RNN:

∂L > Y ∂L Y ∂L t−1
diag(φ0τ )W = φ0τ w = γ where γ = φ0τ w
∂ht ∂ht ∂ht
τ =t,...,2 τ =t,...,2

1
• If γ is small, the gradients in GRU decay with rate 2
which is much better than the rate of γ in
the vanilla RNN.
γt
• If γ is large, the magnitudes of the gradients grow exponentially as O which is better than
4t
t

O γ in the vanilla RNN.
• Thus, the gating mechanism mitigates the problem of vanishing/exploding gradients. Gradients
may explode or vanish in GRU but such problems occur more rarely compared to the vanilla RNN.

27
Connection to probabilistic graphical models
for sequential data
Linear dynamical systems

• Consider a linear Gaussian model with temporal structure (time series):

• Inference in linear dynamical systems: Find the conditional distribution p(ht | x1 , . . . , xt ) of latent
variables h1 , h2 , . . . , ht given the observation sequence x1 , x2 , . . . , xt .
• Since it is a linear Gaussian probabilistic model, the inference can be done using the
message-passing algorithm (see, e.g., Chapter 13 of Bishop, 2006) which yields the Kalman filter.

29
Kalman filter: Message-passing in linear dynamical systems

−
→
−
→ h t , Pt
1. Prediction p(ht | x1 , ..., xt−1 ) = N (ht | h t , Pt ) ... ht−2 ht−1 ht
−
→
h t = Bh̄t−1
Pt = BΣt−1 B> + R
xt−2 xt−1 xt

2. Correction p(ht | x1 , ..., xt ) = N (ht | h̄t , Σt )

... ht−2 ht−1 ht
−
→ −
→
h̄t = h t + Kt (xt − A h t ) A† xt
Σt = (I − Kt A)Pt xt−2 xt−1 xt
Kt = Pt A> (APt A> + V)−1
The message from xt to ht is usually not explicitly
expressed in the derivations of the Kalman filter.

30
Kalman filter in one-dimensional case

• Let us look closer at the correction equation for the mean values of the hidden states
−
→ −
→
h̄t = h t + Kt (xt − A h t )
Kt = Pt A> (APt A> + V)−1

in the one-dimensional case:

−
→ −
→ −
→ pt a −
→
h̄t = h t + kt (xt − a h t ) = h t + (xt − a h t )
a 2 pt + v
−
→ pt a 2 −
→ pt a v −
→ a2 pt xt
= ht− 2 ht+ 2 xt = 2 ht+ 2
a pt + v a pt + v a pt + v a pt + v a
−
→ xt
= (1 − ut ) h t + ut
a

a2 pt
where ut = σ log a2 pt +v
−
→
• The updated value of the state is a trade-off between the estimate h t computed before observing
xt (prior) and the value xat justified by observation xat (likelihood).

31
Motivation of gatings in recurrent units

• Kalman filter update in the one-dimensional case:

−
→ xt
h̄t = (1 − ut ) h t + ut
a
a 2 pt
ut = σ log 2
a pt + v

• Compare this with the GRU update rule:

ht = (1 − ut ) ht−1 + ut h̃t
u = σ(Wu ht−1 + Uu xt + bu )

• This example justifies the use of gatings in the recurrent units: gatings allow combination of
information gained from the previous observations and the current observation.
• The same intuitions hold for nonlinear dynamic systems (extended Kalman filter) which can be
learned by RNNs.

32
Computational graph of RNN as implementation of message passing

−
→
• Message passing in linear dynamical systems: h t , Pt
−
→ xt ... ht−2 ht−1 ht
h̄t = (1 − ut ) h t + ut
a
A† xt
a2 pt−1
ut = σ log 2
a pt−1 + v xt−2 xt−1 xt

• Computational graph of an RNN with gatings: ... ht−2 ht−1 ht

ht = (1 − ut ) ht−1 + ut h̃t
u = σ(Wu ht−1 + Uu xt + bu ) xt−2 xt−1 xt

• The computational graph of an RNN with gatings can be seen as implementation of an inference
procedure for a probabistic graphical model with sequential data.

33
Long short-term memory (LSTM)
(Hochreiter and Schmidhuber, 1997)
Long short-term memory (LSTM) unit

• LSTM was designed to prevent vanishing and exploding gradients.

• The unit has two states: hidden state ht and cell state ct .
• The new cell state is a (gated) sum of the old state and an update:

ct = ft ct−1 + it φc (Wc ht−1 + Uc xt + bc )

where forget gate ft ∈ (0, 1) and input gate it ∈ (0, 1).

• The gradient propagation for cell state c:
∂ct
= diag(ft )
∂ct−1
and if we set ft to 1, the gradient neither grows nor decreases.

35
Long short-term memory (LSTM) unit

• Update of the cell state:

ct = ft ct−1 + it φc (Wc ht−1 + Uc xt + bc )

forget gate ft = σ(Wf ht−1 + Uf xt + bf )
input gate it = σ(Wi ht−1 + Ui xt + bi )

• The hidden state (the output vector of the

LSTM unit):

ht = ot φh (ct )
output gate ot = σ(Wo ht−1 + Uo xt + bo )

36
Initialization of the forget gates (Jozefowicz et al., 2015)

• Update gate:

ft = σ(Wf ht−1 + Uf xt + bf )
ct = ft ct−1 + it φc (Wc ht−1 + Uc xt + bc )

• Common intialization of the forget gate: small random weights for bf . This initialization
effectively sets the forget gate to 21 and therefore the gradient vanishes with a factor of 12 per
timestep. It works well in many problems.
• However, sometimes an RNN can fail to learn long-term dependencies. This problem can be
addressed by initializing the forget gates bf to large values such as 1 or 2.

37
Architecture search for recurrent units

• LSTM and GRU have somewhat similar but different architectures. Can there be even better
architectures of the recurrent unit?
• Jozefowicz et al. (2015) performed random search of the architecture by constructing the
recurrent unit from a selected set of operations. The performance was tested on a set of standard
benchmarks.
• The best architectures found in that procedure were very similar to GRU!

z = σ(Wxz xt + bz )
r = σ(Wxr xt + Whr ht + br )
ht+1 = tanh(Whh (r ht ) + tanh(xt ) + bh ) z + ht (1 − z)

38
Layer normalization (Ba et al., 2016)

• Batch normalization significantly reduces the training time in feed-forward neural networks. Can
we apply the same idea to recurrent networks?
• If we apply BN to an RNN, we need to compute and store separate statistics for each time step in
a sequence. This is problematic if a test sequence is longer than any of the training sequences.
• Layer normalization (LN) is a modification of BN in which statistics are computed over the hidden
units of a single time step:
H H
1 X 1 X
µ= xi σ2 = (xi − µ)2
H H
i=1 i=1

where H is the number of hidden units in a layer. LN also has bias and gain parameters
x−µ
x̃ = γ √ +β
σ2 +
• In GRU or LSTM, LN is usually applied before the non-linearity.
• It is observed layer-normalized RNNs train faster.

39
Sequence-to-sequence models
for neural machine translation
Neural machine translation

• The task is to translate a sentence from a source language to a target language.

• Inputs and outputs are sequences of words. We need a model that transforms input sequences
into output sequences (a sequence-to-sequence model).
• Input and output sequences may be of different lengths.

41
Simple sequence-to-sequence model

• The simplest sequence-to-sequence model uses two RNNs: encoder and decoder.

42
Simple sequence-to-sequence model

• The encoder is an RNN that encodes the input sentence into a vector c = h5 .
• The whole sentence is represented as a vector (a vector of thought).

43
Simple sequence-to-sequence model

• The decoder is an RNN that converts the developed representation c into the output sentence:

• Each neuron also receives the previous word and the input-sequence representation c as inputs.

44
Simple sequence-to-sequence model: Training

• The minimized cost is the negative log-likelihood of the output sequence:

Tn
1 XX

(n) (n)
L=− log p yt y<t , X(n)
N
n t=1

• To produce categorical distribution over words, we process the hidden states zt of the decoder
RNN with a linear layer W and apply softmax: p(yt = i y<t , X) ∝ exp(wi> zt ).

45
Simple sequence-to-sequence model: Test time

• How to generate the output sequence for a given input sequence?

• We can sample a sequence of words using the predicted categorical distribution:

(n) (n)

log p yt y<t , X(n)

• This is suboptimal: we are interested in the whole sequence that has the highest probability,
sampling from the output distribution is greedy search.
• The most likely sequence is usually found with beamsearch (see, e.g., Cho, 2015).

46
Teacher forcing

• Training time: Feed correct words as inputs of the decoder (this is called teacher forcing).
• Test time: Feed the decoder’s own predictions as inputs (generation mode).

• The decoder needs to learn to work in the generation mode (without teacher forcing).
• To enable this, we can toggle teacher forcing on and off during training.

47
Home assignment
Assignment 04 rnn

• In the home assignment, you need to implement a sequence-to-sequence model for statistical
machine translation:

49
Building a computational graph with an RNN in PyTorch

• There are two ways to build a computational graph with RNNs in PyTorch.
• In simple cases, the whole sequence can be processed with one call:
h = torch.zeros(...)
h = rnn.forward(x, h)
• In more difficult cases, you need to build a graph with a for-loop:
h = torch.zeros(...)
for x_t in x:
h = rnn.forward(x_t, h)
• The initial states of RNNs are often initialized with zeros.

50
How to represent words

• A simple word representation is one-hot vector. Word i is represented with vector z such that
zi = 1, zj6=i = 0.
• Better representaion:
• represent each word i as a vector wi
• treat all vectors wi as model parameters and tune them in the training procedure
• this is equivalent to Wz where W is a matrix of word embeddings (word vectors wi in its columns).
• This is implemented in torch.nn.Embedding(num embeddings, embedding dim)
• num embeddings is the size of the dictionary
• embedding dim is the size of each embedding vector wi

51
Recommended reading

• Chapter 10 of the Deep Learning book.

• C. Olah. Understanding LSTM Networks.
• K. Cho. Natural Language Understanding with Distributed Representation.

Thesis Evaluation Report TW - 2
100% (3)
Thesis Evaluation Report TW - 2
3 pages
Complete Bundle Thomas Calculus 14 Ed Hass Ebook and TestBank Bundle
100% (1)
Complete Bundle Thomas Calculus 14 Ed Hass Ebook and TestBank Bundle
403 pages
Cidam Oral Comm
No ratings yet
Cidam Oral Comm
5 pages
Grade 10 Data Handling QP 2023-2-1
No ratings yet
Grade 10 Data Handling QP 2023-2-1
5 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
Assure and Tpack Model
100% (1)
Assure and Tpack Model
12 pages
Module 5, PED 109
100% (1)
Module 5, PED 109
25 pages
Memorandum Order 40 - Guidelines Change of Status PDF
100% (1)
Memorandum Order 40 - Guidelines Change of Status PDF
3 pages
RNN Notes
No ratings yet
RNN Notes
36 pages
AML - Lecture - 09 - 08nov24
No ratings yet
AML - Lecture - 09 - 08nov24
126 pages
Course Outcomes
No ratings yet
Course Outcomes
22 pages
Nursing Care Plan: 1. Safe and Quality Nursing Care On-Going Planned To Join A Support Group
75% (4)
Nursing Care Plan: 1. Safe and Quality Nursing Care On-Going Planned To Join A Support Group
5 pages
Lesson Plan in Mapeh 10
No ratings yet
Lesson Plan in Mapeh 10
3 pages
Renesas Flash Programmer Sample Circuit For Programming PC Serial PDF
No ratings yet
Renesas Flash Programmer Sample Circuit For Programming PC Serial PDF
5 pages
Nonverbal Communication
No ratings yet
Nonverbal Communication
6 pages
AlecCouros CV Jan2009
No ratings yet
AlecCouros CV Jan2009
17 pages
Let's Look and Say. (Textbook Page: 97) : Daily Lesson Plan English Language Year 3
No ratings yet
Let's Look and Say. (Textbook Page: 97) : Daily Lesson Plan English Language Year 3
8 pages
Pre-Final Examination Grade 9-ENGLISH
No ratings yet
Pre-Final Examination Grade 9-ENGLISH
2 pages
DL Co3 - PPT 1
No ratings yet
DL Co3 - PPT 1
22 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
20 pages
THECAUSES AND EFFECTS OF CHILDMARRIAGE IN LUSAKA A CASE STUDY OF University ZAMBIA
No ratings yet
THECAUSES AND EFFECTS OF CHILDMARRIAGE IN LUSAKA A CASE STUDY OF University ZAMBIA
48 pages
Memotret Kerawanan Pangan Dengan Metode Hfias (Studi Kasus Di Salah Satu Desa Hutan Di Desa Lembu Kecamatan Bancak, Kabupaten Semarang)
No ratings yet
Memotret Kerawanan Pangan Dengan Metode Hfias (Studi Kasus Di Salah Satu Desa Hutan Di Desa Lembu Kecamatan Bancak, Kabupaten Semarang)
24 pages
ch6 RNN
No ratings yet
ch6 RNN
25 pages
Personality Disorder
No ratings yet
Personality Disorder
20 pages
DL Unit5 RNN
No ratings yet
DL Unit5 RNN
107 pages
Ảnh Màn Hình 2025-04-10 Lúc 10.10.40
No ratings yet
Ảnh Màn Hình 2025-04-10 Lúc 10.10.40
63 pages
RNN IITMumbai
No ratings yet
RNN IITMumbai
9 pages
Fill-Out The Table With Information About One-Act Play 2. Compose A Play Review
No ratings yet
Fill-Out The Table With Information About One-Act Play 2. Compose A Play Review
3 pages
English Project
No ratings yet
English Project
5 pages
AN2DL 04 2324 RecurrentNeuralNetworks
No ratings yet
AN2DL 04 2324 RecurrentNeuralNetworks
34 pages
Pfe Manual
No ratings yet
Pfe Manual
9 pages
Sequence Models-I: (Recurrent Neural Networks-Introduction, Types of Rnns Many-To-Many-Rnns For Sequence Labeling)
No ratings yet
Sequence Models-I: (Recurrent Neural Networks-Introduction, Types of Rnns Many-To-Many-Rnns For Sequence Labeling)
22 pages
NLKDS - Nhom 8
No ratings yet
NLKDS - Nhom 8
94 pages
The Working Memory Model
No ratings yet
The Working Memory Model
3 pages
Outline
No ratings yet
Outline
50 pages
Module 4 Recurrent Neural Network
No ratings yet
Module 4 Recurrent Neural Network
78 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
VirtualHeritage BlackBook Finall
No ratings yet
VirtualHeritage BlackBook Finall
76 pages
Imperial Dlcourse2022 RNN Notes
No ratings yet
Imperial Dlcourse2022 RNN Notes
9 pages
RNN LSTM
No ratings yet
RNN LSTM
71 pages
8 Shape of Graphs-To-Post
No ratings yet
8 Shape of Graphs-To-Post
4 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
Recurrent Neural Networks (RNNS)
No ratings yet
Recurrent Neural Networks (RNNS)
45 pages
Mod 4-RNN Deep Learning
No ratings yet
Mod 4-RNN Deep Learning
63 pages
Introduction To Rnns
No ratings yet
Introduction To Rnns
48 pages
Agile Teamwork - Minimize Handoffs
No ratings yet
Agile Teamwork - Minimize Handoffs
3 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
18 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
125 pages
Very Deep Learning
No ratings yet
Very Deep Learning
38 pages
5707 11 RNN LSTM
No ratings yet
5707 11 RNN LSTM
128 pages
Business Studies f2
No ratings yet
Business Studies f2
19 pages
Recurrent Neural Networks: RNN: S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
No ratings yet
Recurrent Neural Networks: RNN: S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
47 pages
L14 Exploding and Vanishing Gradients
No ratings yet
L14 Exploding and Vanishing Gradients
13 pages
ANN Notes
No ratings yet
ANN Notes
7 pages
DL Unit 4 Part 2
No ratings yet
DL Unit 4 Part 2
8 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
34 pages
Module 4
No ratings yet
Module 4
36 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
28 pages
Literature Review in Tagalog
100% (2)
Literature Review in Tagalog
4 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
Dis6 Sol
No ratings yet
Dis6 Sol
6 pages
DL 4 Notes
No ratings yet
DL 4 Notes
34 pages
Bianchi
No ratings yet
Bianchi
62 pages
07 RNN Recurrent Neural Networks
No ratings yet
07 RNN Recurrent Neural Networks
115 pages
Introduction To RNNS!: Arun Mallya!
No ratings yet
Introduction To RNNS!: Arun Mallya!
52 pages
Unit 3 RCNN
No ratings yet
Unit 3 RCNN
25 pages
DL Mod 3
No ratings yet
DL Mod 3
4 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
Awad 233
No ratings yet
Awad 233
16 pages
Chapter 4 Data Sci
No ratings yet
Chapter 4 Data Sci
58 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
Module 2
No ratings yet
Module 2
13 pages
M3 L4 RNN Regularization
No ratings yet
M3 L4 RNN Regularization
24 pages
Unit 3 RCNN Updated
No ratings yet
Unit 3 RCNN Updated
28 pages
6.3 HiddenUnits
No ratings yet
6.3 HiddenUnits
26 pages
Udacity Deep LEarning Part4 RNN
No ratings yet
Udacity Deep LEarning Part4 RNN
338 pages
Mergeddv
No ratings yet
Mergeddv
2 pages
Advanced Data Analytics: Simon Scheidegger - University of Lausanne, Department of Economics
No ratings yet
Advanced Data Analytics: Simon Scheidegger - University of Lausanne, Department of Economics
50 pages
LSTMDerivadas
No ratings yet
LSTMDerivadas
10 pages
DLAI4 Networks Recurrent
No ratings yet
DLAI4 Networks Recurrent
7 pages
DL Exam 2023-2
No ratings yet
DL Exam 2023-2
5 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
9 pages
CS60010: Deep Learning: Recurrent Neural Network
No ratings yet
CS60010: Deep Learning: Recurrent Neural Network
44 pages
Time Series Prediction With Recurrent Neural Networks
No ratings yet
Time Series Prediction With Recurrent Neural Networks
7 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
De Cuong GK2 Tieng Anh 11 Isw
No ratings yet
De Cuong GK2 Tieng Anh 11 Isw
4 pages

04 RNN Slides

Uploaded by

04 RNN Slides

Uploaded by

CS-E4890 Deep Learning

Lecture #4: Recurrent neural networks

Jorma Laaksonen — Juho Kannala — Alexander Ilin

• Previously: inputs and outputs are vectors of fixed sizes

• Example: sentiment analysis

• Example: count the number of zeros in an input sequence (x1 , x2 , x3 , . . . xT )

• How to implement this in a computational graph:

• How can we learn to process sequences from training examples?

• Recurrence, thus recurrent neural network (RNN).

Computational graph of a feedforward network: Computational graph of an RNN:

• External inputs are added at every step.

• For example, we can tune parameters θ by minimizing the cost function

• We need to compute gradients wrt parameters θ.

• Recall backpropagation in a multi-layer model that operates with scalars:

• Let us assume for now that the parameters h1 h2

• We can compute the derivatives wrt θ1 θ2

• We can compute the derivatives efficiently using backpropagation.

• Consider a vanilla RNN:

ht = f (xt , ht−1 , W, U, b) = φ(Wht−1 + Uxt + b)

Wt−τ = QΛQ−1 QΛQ−1 . . . QΛQ−1 = QΛt−τ Q−1

where we denote z = Q−1 Uxτ and zi is the i-th component of z.

• Let Qm be an n × m matrix containing the m linear independent unit-norm eigenvectors of W in

Wt−τ Qm z = Wt−τ −1 WQm z = Wt−τ −1 Qm Λz = Wt−τ −2 Qm Λ2 z = ...

• The largest absolute value of the eigenvalues is called spectral radius:

spectral radius(W) = max |λi |

• Forward explosions happen if the spectral radius of W is greater than 1.

ht = φ(Wht−1 + Uxt + b) = tanh(Wht−1 + Uxt + b)

• The largest absolute value of the eigenvalues is called spectral radius:

spectral radius(W) = max |λi |

• Forward explosions happen if the spectral radius of W is greater than 1.

ht = φ(Wht−1 + Uxt + b) = tanh(Wht−1 + Uxt + b)

• Since tanh is bounded in (−1, 1), the explosions cannot happen.

∂L > ∂L > Y ∂hτ ∂L > Y

gradient explodes. φ0 (h)

• The gradient may explode even for a bounded activation function φ!

• Gradient explosions (caused by recurrence) is one problem with training RNNs.

• One workaround: clip the gradient if it is larger than

• Let us look at the gradients again: 1

> > >

• If the spectral radius of W is smaller than 1, the gradient will 0

• Motivation for gating in GRU:

where σ(x ) = 1/(1 + e −x ) is the sigmoid function and e

• Update gate: u = σ(Wu ht−1 + Uu xt + bu )

• Reset gate: r = σ(Wr ht−1 + Ur xt + br )

• GRU update rule for the state:

where γ = φ0τ w and we also assumed that gates u, r are half-closed u = r = 21 .

• Consider a linear Gaussian model with temporal structure (time series):

2. Correction p(ht | x1 , ..., xt ) = N (ht | h̄t , Σt )

in the one-dimensional case:

• Kalman filter update in the one-dimensional case:

• Compare this with the GRU update rule:

• Computational graph of an RNN with gatings: ... ht−2 ht−1 ht

• LSTM was designed to prevent vanishing and exploding gradients.

ct = ft ct−1 + it φc (Wc ht−1 + Uc xt + bc )

where forget gate ft ∈ (0, 1) and input gate it ∈ (0, 1).

• Update of the cell state:

ct = ft ct−1 + it φc (Wc ht−1 + Uc xt + bc )

• The hidden state (the output vector of the

• The task is to translate a sentence from a source language to a target language.

• The minimized cost is the negative log-likelihood of the output sequence:

• How to generate the output sequence for a given input sequence?

• Chapter 10 of the Deep Learning book.

You might also like