04 RNN Slides
04 RNN Slides
1
How can we process a sequence?
h1 h2 h3 h4 h5
h0 f f f f f
def f(x, h):
return h + (x == 0)
x1 x2 x3 x4 x5
2
Generic processor
Dear #XYZ there is no network in my area and internet service is pathetic from past one week.
negative review
Kindly help me out.
Although the value added services being provided are great but the prices are high. mixed review
Great work done #XYZ Problem resolved by customer care in just one day. postive review
• To build a generic processor, we can use the same computational graph with a learnable f:
h1 h2 h3 h4 h5
h0 f f f f f
x1 x2 x3 x4 x5
3
Vanilla recurrent neural network (RNN)
• We can use the same building block as in the standard multilayer perceptron (MLP):
f (x, h) = tanh(Wh + Ux + b)
h1 h2 h3 h4 h5
h0 f f f f f softmax
x1 x2 x3 x4 x5
4
RNN vs feedforward network
h1 h2 h1 h2
x f1 f2 L h0 f f L
y x1 x2 y
5
Training recurrent neural networks
Training an RNN
• Just like for a feedforward network, the parameters of an RNN can be found by (stochastic)
gradient descent.
θ
h1 h2 z
h0 f f softmax L
x1 x2 y
7
Backpropagation
• We can compute the derivatives wrt the model parameters θ and w using the chain rule.
∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ ∂L ∂L
∂L ∂L ∂y ∂h ∂w ∂θ
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h ∂L ∂L
∂h ∂y
8
Backpropagation in RNN
θ
• The difference in the RNN is that each
layer implements the same function f with
the same (shared) parameters θ: h1 h2
h0 f f L
L = L(h2 ), h2 = f (x2 , h1 , θ)
h1 = f (x1 , h0 , θ)
x1 x2
9
Backpropagation in RNN
θ
• The difference in the RNN is that each
layer implements the same function f with
the same (shared) parameters θ: h1 h2
h0 f f L
L = L(h2 ), h2 = f (x2 , h1 , θ)
h1 = f (x1 , h0 , θ)
x1 x2
θ1 θ2
x1 x2
9
Backpropagation in RNN
10
Backpropagation in RNN
∂θ1 θ ∂θ2
∂θ ∂θ
θ1 θ2
• Finally, we can combine the gradients wrt
shared parameters: ∂L ∂L
∂θ1 ∂θ2
∂L ∂L ∂θ1 ∂L ∂θ2 ∂L ∂L h1 h2
= + = + h0 f f L
∂θ ∂θ1 ∂θ ∂θ2 ∂θ ∂θ1 ∂θ2
∂L ∂L
∂h1 ∂h2
x1 x2
• We need to compute gradients through all possible paths and aggregate them.
• The backpropagation algorithm applied to RNN is called backpropagation through time.
11
Problems with RNN training
Does recurrence cause problems for training?
• Assume that we are not careful about selecting φ and we select it to be an identity mapping
φ(a) = a, h0 = 0 and b = 0.
• Let us write the hidden state at time t:
t
X
ht = Wht−1 + Uxt = W (Wht−2 + Uxt−1 ) + Uxt = WWht−2 + WUxt−1 + Uxt = Wt−τ Uxτ
τ =1
Pt
Ux1 WUx1 + Ux2 WWUx1 + WUx2 + Ux3 τ =1
Wt−τ Uxτ
0 f f f ... f
x1 x2 x3 xt
13
Analysis for diagonalizable W
• For simplicity, let us assume that matrix W is diagonalizable and its eigenvalue decomposition
W = QΛQ−1 exists, where Q contains the eigenvectors of W and Λ is the diagonal matrix
containing the eigenvalues of W. We can then re-write:
14
Analysis for a more general case
WQm = Qm Λ
• We can write Uxτ = Qm z + z0 where z0 belongs to the null space of Qm , that is Q> 0
m z = 0.
• Then, one term in the expression for ht is Wt−τ Uxτ = Wt−τ Qm z + Wt−τ z0 .
• Let us look at the first term only:
• Again, if one of the eigenvalues is such that |λi | > 1, then the norm of qi λt−τ
i zi will grow
exponentially causing explosions in the forward computations.
15
Explosions in forward computations
16
Explosions in forward computations
16
Are there similar problems in backward computations?
• Lets us look at the longest path of derivative computations (red) for an RNN
ht = φ(Wht−1 + Uxt + b)
h1 h2 ht
h0 f f ··· f L
∂L ∂L ∂L
∂h1 ∂h2 ∂ht
x1 x2 xt
∂L ∂L
• ∂h is a column vector of partial derivatives ∂h1i
1
• φ0τ = φ0 (Whτ −1 + Uxt + b)
17
Gradient explosions (Pascanu et al., 2013)
1
∂L > ∂L > Y ∂hτ ∂L > Y
= = diag(φ0τ )W
∂h1 ∂ht ∂hτ −1 ∂ht 0
τ =t,...,2 τ =t,...,2
1
4 3 2 1 0 1 2 3 4
• Suppose φ(h) = tanh(h) and all our neurons in an RNN are φ(h)
not saturated, which means that 1
|φ0τ | ≥γ
0
• If the spectral radius of W is greater than 1/γ, then the 4 3 2 1 0 1 2 3 4
18
How to cope with gradient explosions?
19
Vanishing gradients
0< |φ0τ | ≤1
• To avoid vanishing gradients, it is good to keep neurons in the non-saturated regime where
derivatives φ0 are close to 1.
20
Vanishing gradients
• The vanishing gradients problem makes it difficult to learn long-range dependencies in the data:
• In sentiment analysis, it is difficult to capture the effect of the first words in a paragraph on the
predicted class.
• In time-series modeling, it is difficult to capture slowly changing phenomena.
• Vanilla RNNs ht = φ(Wht−1 + Ux + b) are rarely used in practice.
• Recurrent units with gating mechanisms work better.
• Gated recurrent unit (GRU) (Cho et al., 2014)
• Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997)
21
Historical note on RNNs
• Recurrent neural networks for sequential data processing were proposed in the 80s
(Rumelhart et al., 1986; Elman, 1990; Werbos, 1988).
• RNNs did not gain much popularity because they were particularly difficult to train with
backpropagation:
• Unstable training because of gradient explosions
• Difficulty to learn long-term dependencies due to vanishing gradients (Bengio et al., 1994)
• The breakthrough came with the invention of Long Short-Term Memory (LSTM) RNN (Hochreiter
and Schmidhuber, 1997) which was designed to solve the gradient explosion/vanishing problem.
• LSTM remained largely unnoticed in the community until the deep learning boom started.
22
Gated recurrent unit (GRU)
(Cho et al., 2014)
Gated recurrent unit (GRU)
ht = (1 − ut ) ht−1 + ut ht
e
ut = σ(Wu ht−1 + Uu xt + bu )
ht = φ(W(rt
e ht−1 ) + Uxt + bh )
rt = σ(Wr ht−1 + Ur xt + br )
24
Gated recurrent unit (GRU)
• State update:
ht = (1 − u) ht−1 + u ht
e
ht = φ(W(r
e ht−1 ) + Uxt + bh )
25
Does GRU help with the vanishing gradient problem?
• For simplicity, let us assume that the state of an RNN is one-dimensional and all intermediate
signals do not depend on time step τ :
t−1
∂L Y ∂L ∂L 1 + γ/2
uτ φ0τ wrτ ((1 − u) + uγr )t−1 =
(1 − uτ ) + =
∂ht ∂ht ∂ht 2
τ =t,...,2
26
Does GRU help with the vanishing gradient problem?
t−1
∂L 1 + γ/2
• Gradient propagation in GRU (simplified):
∂ht 2
• Let us do the same simplified analysis for vanilla RNN:
∂L > Y ∂L Y ∂L t−1
diag(φ0τ )W = φ0τ w = γ where γ = φ0τ w
∂ht ∂ht ∂ht
τ =t,...,2 τ =t,...,2
1
• If γ is small, the gradients in GRU decay with rate 2
which is much better than the rate of γ in
the vanilla RNN.
γt
• If γ is large, the magnitudes of the gradients grow exponentially as O which is better than
4t
t
O γ in the vanilla RNN.
• Thus, the gating mechanism mitigates the problem of vanishing/exploding gradients. Gradients
may explode or vanish in GRU but such problems occur more rarely compared to the vanilla RNN.
27
Connection to probabilistic graphical models
for sequential data
Linear dynamical systems
h1 h2 h3 h4
p(h1 ) = N (h1 | µ1 , R1 )
p(ht | ht−1 ) = N (ht | Bht−1 , R)
x1 x2 x3 x4
p(xt | ht ) = N (xt | Aht , V)
• Inference in linear dynamical systems: Find the conditional distribution p(ht | x1 , . . . , xt ) of latent
variables h1 , h2 , . . . , ht given the observation sequence x1 , x2 , . . . , xt .
• Since it is a linear Gaussian probabilistic model, the inference can be done using the
message-passing algorithm (see, e.g., Chapter 13 of Bishop, 2006) which yields the Kalman filter.
29
Kalman filter: Message-passing in linear dynamical systems
−
→
−
→ h t , Pt
1. Prediction p(ht | x1 , ..., xt−1 ) = N (ht | h t , Pt ) ... ht−2 ht−1 ht
−
→
h t = Bh̄t−1
Pt = BΣt−1 B> + R
xt−2 xt−1 xt
30
Kalman filter in one-dimensional case
• Let us look closer at the correction equation for the mean values of the hidden states
−
→ −
→
h̄t = h t + Kt (xt − A h t )
Kt = Pt A> (APt A> + V)−1
31
Motivation of gatings in recurrent units
ht = (1 − ut ) ht−1 + ut h̃t
u = σ(Wu ht−1 + Uu xt + bu )
• This example justifies the use of gatings in the recurrent units: gatings allow combination of
information gained from the previous observations and the current observation.
• The same intuitions hold for nonlinear dynamic systems (extended Kalman filter) which can be
learned by RNNs.
32
Computational graph of RNN as implementation of message passing
−
→
• Message passing in linear dynamical systems: h t , Pt
−
→ xt ... ht−2 ht−1 ht
h̄t = (1 − ut ) h t + ut
a
A† xt
a2 pt−1
ut = σ log 2
a pt−1 + v xt−2 xt−1 xt
ht = (1 − ut ) ht−1 + ut h̃t
u = σ(Wu ht−1 + Uu xt + bu ) xt−2 xt−1 xt
• The computational graph of an RNN with gatings can be seen as implementation of an inference
procedure for a probabistic graphical model with sequential data.
33
Long short-term memory (LSTM)
(Hochreiter and Schmidhuber, 1997)
Long short-term memory (LSTM) unit
35
Long short-term memory (LSTM) unit
ht = ot φh (ct )
output gate ot = σ(Wo ht−1 + Uo xt + bo )
36
Initialization of the forget gates (Jozefowicz et al., 2015)
• Update gate:
ft = σ(Wf ht−1 + Uf xt + bf )
ct = ft ct−1 + it φc (Wc ht−1 + Uc xt + bc )
• Common intialization of the forget gate: small random weights for bf . This initialization
effectively sets the forget gate to 21 and therefore the gradient vanishes with a factor of 12 per
timestep. It works well in many problems.
• However, sometimes an RNN can fail to learn long-term dependencies. This problem can be
addressed by initializing the forget gates bf to large values such as 1 or 2.
37
Architecture search for recurrent units
• LSTM and GRU have somewhat similar but different architectures. Can there be even better
architectures of the recurrent unit?
• Jozefowicz et al. (2015) performed random search of the architecture by constructing the
recurrent unit from a selected set of operations. The performance was tested on a set of standard
benchmarks.
• The best architectures found in that procedure were very similar to GRU!
z = σ(Wxz xt + bz )
r = σ(Wxr xt + Whr ht + br )
ht+1 = tanh(Whh (r ht ) + tanh(xt ) + bh ) z + ht (1 − z)
38
Layer normalization (Ba et al., 2016)
• Batch normalization significantly reduces the training time in feed-forward neural networks. Can
we apply the same idea to recurrent networks?
• If we apply BN to an RNN, we need to compute and store separate statistics for each time step in
a sequence. This is problematic if a test sequence is longer than any of the training sequences.
• Layer normalization (LN) is a modification of BN in which statistics are computed over the hidden
units of a single time step:
H H
1 X 1 X
µ= xi σ2 = (xi − µ)2
H H
i=1 i=1
where H is the number of hidden units in a layer. LN also has bias and gain parameters
x−µ
x̃ = γ √ +β
σ2 +
• In GRU or LSTM, LN is usually applied before the non-linearity.
• It is observed layer-normalized RNNs train faster.
39
Sequence-to-sequence models
for neural machine translation
Neural machine translation
41
Simple sequence-to-sequence model
• The simplest sequence-to-sequence model uses two RNNs: encoder and decoder.
42
Simple sequence-to-sequence model
• The encoder is an RNN that encodes the input sentence into a vector c = h5 .
• The whole sentence is represented as a vector (a vector of thought).
43
Simple sequence-to-sequence model
• The decoder is an RNN that converts the developed representation c into the output sentence:
• Each neuron also receives the previous word and the input-sequence representation c as inputs.
44
Simple sequence-to-sequence model: Training
• To produce categorical distribution over words, we process the hidden states zt of the decoder
RNN with a linear layer W and apply softmax: p(yt = i y<t , X) ∝ exp(wi> zt ).
45
Simple sequence-to-sequence model: Test time
(n) (n)
log p yt y<t , X(n)
• This is suboptimal: we are interested in the whole sequence that has the highest probability,
sampling from the output distribution is greedy search.
• The most likely sequence is usually found with beamsearch (see, e.g., Cho, 2015).
46
Teacher forcing
• Training time: Feed correct words as inputs of the decoder (this is called teacher forcing).
• Test time: Feed the decoder’s own predictions as inputs (generation mode).
• The decoder needs to learn to work in the generation mode (without teacher forcing).
• To enable this, we can toggle teacher forcing on and off during training.
47
Home assignment
Assignment 04 rnn
• In the home assignment, you need to implement a sequence-to-sequence model for statistical
machine translation:
49
Building a computational graph with an RNN in PyTorch
• There are two ways to build a computational graph with RNNs in PyTorch.
• In simple cases, the whole sequence can be processed with one call:
h = torch.zeros(...)
h = rnn.forward(x, h)
• In more difficult cases, you need to build a graph with a for-loop:
h = torch.zeros(...)
for x_t in x:
h = rnn.forward(x_t, h)
• The initial states of RNNs are often initialized with zeros.
50
How to represent words
• A simple word representation is one-hot vector. Word i is represented with vector z such that
zi = 1, zj6=i = 0.
• Better representaion:
• represent each word i as a vector wi
• treat all vectors wi as model parameters and tune them in the training procedure
• this is equivalent to Wz where W is a matrix of word embeddings (word vectors wi in its columns).
• This is implemented in torch.nn.Embedding(num embeddings, embedding dim)
• num embeddings is the size of the dictionary
• embedding dim is the size of each embedding vector wi
51
Recommended reading
52