0% found this document useful (0 votes)
7 views55 pages

04 RNN Slides

The lecture discusses recurrent neural networks (RNNs) and their application in sequence modeling, such as sentiment analysis, where inputs can vary in length. It covers the architecture of RNNs, the training process using backpropagation through time, and challenges like gradient explosions and vanishing gradients that can occur during training. Solutions to these problems, including gradient clipping, are also presented to improve RNN training stability.

Uploaded by

mrolaw01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views55 pages

04 RNN Slides

The lecture discusses recurrent neural networks (RNNs) and their application in sequence modeling, such as sentiment analysis, where inputs can vary in length. It covers the architecture of RNNs, the training process using backpropagation through time, and challenges like gradient explosions and vanishing gradients that can occur during training. Solutions to these problems, including gradient clipping, are also presented to improve RNN training stability.

Uploaded by

mrolaw01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

CS-E4890 Deep Learning

Lecture #4: Recurrent neural networks

Jorma Laaksonen — Juho Kannala — Alexander Ilin


Sequence modeling

• Previously: inputs and outputs are vectors of fixed sizes


• MNIST: inputs: 28x28 images, outputs: 10 classes
• In some tasks, inputs can be sequences, each sequence can have a different number of elements:
 
(1) (1) (1)
x1 , x2 , x3 → y (1)
 
(2) (2) (2) (2)
x1 , x2 , x3 , x4 → y (2)

• Example: sentiment analysis


Dear #XYZ there is no network in my area and internet service is pathetic from past one week. Kindly help me
negative review
out.
Although the value added services being provided are great but the prices are high. mixed review
Great work done #XYZ Problem resolved by customer care in just one day. postive review

1
How can we process a sequence?

• Example: count the number of zeros in an input sequence (x1 , x2 , x3 , . . . xT )


h = 0
for x in input_sequence:
if x == 0:
h = h + 1

• How to implement this in a computational graph:

h1 h2 h3 h4 h5
h0 f f f f f
def f(x, h):
return h + (x == 0)
x1 x2 x3 x4 x5

2
Generic processor

• How can we learn to process sequences from training examples?


• Example: sentiment analysis

Dear #XYZ there is no network in my area and internet service is pathetic from past one week.
negative review
Kindly help me out.
Although the value added services being provided are great but the prices are high. mixed review
Great work done #XYZ Problem resolved by customer care in just one day. postive review

• To build a generic processor, we can use the same computational graph with a learnable f:

h1 h2 h3 h4 h5
h0 f f f f f

x1 x2 x3 x4 x5

3
Vanilla recurrent neural network (RNN)

• We can use the same building block as in the standard multilayer perceptron (MLP):

f (x, h) = tanh(Wh + Ux + b)

h1 h2 h3 h4 h5
h0 f f f f f softmax

x1 x2 x3 x4 x5

• Recurrence, thus recurrent neural network (RNN).


• h is often called hidden state.

4
RNN vs feedforward network

Computational graph of a feedforward network: Computational graph of an RNN:


W1 W2 θ

h1 h2 h1 h2
x f1 f2 L h0 f f L

y x1 x2 y

• External inputs are added at every step.


• Same parameters are used in every layer.

5
Training recurrent neural networks
Training an RNN

• Just like for a feedforward network, the parameters of an RNN can be found by (stochastic)
gradient descent.
θ

h1 h2 z
h0 f f softmax L

x1 x2 y

• For example, we can tune parameters θ by minimizing the cost function


N K
1 X X (n) (n)
θ ∗ = arg min − yj log zj
θ N
n=1 j=1

• We need to compute gradients wrt parameters θ.

7
Backpropagation

• Recall backpropagation in a multi-layer model that operates with scalars:


L = L(y ), y = f2 (h, θ), h = f1 (x , w )

• We can compute the derivatives wrt the model parameters θ and w using the chain rule.

∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ ∂L ∂L
∂L ∂L ∂y ∂h ∂w ∂θ
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h ∂L ∂L
∂h ∂y

8
Backpropagation in RNN

θ
• The difference in the RNN is that each
layer implements the same function f with
the same (shared) parameters θ: h1 h2
h0 f f L
L = L(h2 ), h2 = f (x2 , h1 , θ)
h1 = f (x1 , h0 , θ)
x1 x2

9
Backpropagation in RNN

θ
• The difference in the RNN is that each
layer implements the same function f with
the same (shared) parameters θ: h1 h2
h0 f f L
L = L(h2 ), h2 = f (x2 , h1 , θ)
h1 = f (x1 , h0 , θ)
x1 x2

θ1 θ2

• Let us assume for now that the parameters h1 h2


of the layers are not shared. h0 f f L

x1 x2

9
Backpropagation in RNN

• We can compute the derivatives wrt θ1 θ2


parameters θ1 and θ2 using the chain rule:
∂L ∂L
∂θ1 ∂θ2
∂L ∂L ∂h2
= h1 h2
∂θ2 ∂h2 ∂θ2 h0 f f L
∂L ∂L ∂h2 ∂h1 ∂L ∂L
= ∂h1 ∂h2
∂θ1 ∂h2 ∂h1 ∂θ1
| {z }
∂L
∂h1
x1 x2

• We can compute the derivatives efficiently using backpropagation.

10
Backpropagation in RNN

∂θ1 θ ∂θ2
∂θ ∂θ

θ1 θ2
• Finally, we can combine the gradients wrt
shared parameters: ∂L ∂L
∂θ1 ∂θ2

∂L ∂L ∂θ1 ∂L ∂θ2 ∂L ∂L h1 h2
= + = + h0 f f L
∂θ ∂θ1 ∂θ ∂θ2 ∂θ ∂θ1 ∂θ2
∂L ∂L
∂h1 ∂h2

x1 x2

• We need to compute gradients through all possible paths and aggregate them.
• The backpropagation algorithm applied to RNN is called backpropagation through time.

11
Problems with RNN training
Does recurrence cause problems for training?

• Consider a vanilla RNN:

ht = f (xt , ht−1 , W, U, b) = φ(Wht−1 + Uxt + b)

• Assume that we are not careful about selecting φ and we select it to be an identity mapping
φ(a) = a, h0 = 0 and b = 0.
• Let us write the hidden state at time t:
t
X
ht = Wht−1 + Uxt = W (Wht−2 + Uxt−1 ) + Uxt = WWht−2 + WUxt−1 + Uxt = Wt−τ Uxτ
τ =1

Pt
Ux1 WUx1 + Ux2 WWUx1 + WUx2 + Ux3 τ =1
Wt−τ Uxτ
0 f f f ... f

x1 x2 x3 xt

13
Analysis for diagonalizable W

• For simplicity, let us assume that matrix W is diagonalizable and its eigenvalue decomposition
W = QΛQ−1 exists, where Q contains the eigenvectors of W and Λ is the diagonal matrix
containing the eigenvalues of W. We can then re-write:

Wt−τ = QΛQ−1 QΛQ−1 . . . QΛQ−1 = QΛt−τ Q−1


| {z }
t−τ times
Pt
• Let us now look at one term in the formula for ht = τ =1
Wt−τ Uxτ :
X
Wt−τ Uxτ = QΛt−τ Q−1 Uxτ = QΛt−τ z = qi λt−τ
i zi
| {z }
=z i

where we denote z = Q−1 Uxτ and zi is the i-th component of z.


• If there is an eigenvalue λi such that |λi | > 1, then the norm of the corresponding term qi λt−τ
i zi
will grow exponentially with t causing explosions in the forward computations.

14
Analysis for a more general case

• Let Qm be an n × m matrix containing the m linear independent unit-norm eigenvectors of W in


its columns and Λ be a diagonal matrix made of the corresponding eigenvectors λi :

WQm = Qm Λ

• We can write Uxτ = Qm z + z0 where z0 belongs to the null space of Qm , that is Q> 0
m z = 0.

• Then, one term in the expression for ht is Wt−τ Uxτ = Wt−τ Qm z + Wt−τ z0 .
• Let us look at the first term only:

Wt−τ Qm z = Wt−τ −1 WQm z = Wt−τ −1 Qm Λz = Wt−τ −2 Qm Λ2 z = ...


X
= Qm Λt−τ z = qi λt−τ
i zi
i

• Again, if one of the eigenvalues is such that |λi | > 1, then the norm of qi λt−τ
i zi will grow
exponentially causing explosions in the forward computations.

15
Explosions in forward computations

• The largest absolute value of the eigenvalues is called spectral radius:

spectral radius(W) = max |λi |


i

• Forward explosions happen if the spectral radius of W is greater than 1.


• Will explosions happen if we use tanh nonlinearity at each time step?

ht = φ(Wht−1 + Uxt + b) = tanh(Wht−1 + Uxt + b)

16
Explosions in forward computations

• The largest absolute value of the eigenvalues is called spectral radius:

spectral radius(W) = max |λi |


i

• Forward explosions happen if the spectral radius of W is greater than 1.


• Will explosions happen if we use tanh nonlinearity at each time step?

ht = φ(Wht−1 + Uxt + b) = tanh(Wht−1 + Uxt + b)

• Since tanh is bounded in (−1, 1), the explosions cannot happen.


• This is the reason why tanh is most commonly used in RNNs.

16
Are there similar problems in backward computations?

• Lets us look at the longest path of derivative computations (red) for an RNN

ht = φ(Wht−1 + Uxt + b)

h1 h2 ht
h0 f f ··· f L
∂L ∂L ∂L
∂h1 ∂h2 ∂ht

x1 x2 xt

∂L > ∂L > Y ∂hτ ∂L > Y


= = diag(φ0τ )W
∂h1 ∂ht ∂hτ −1 ∂ht
τ =t,...,2 τ =t,...,2

∂L ∂L
• ∂h is a column vector of partial derivatives ∂h1i
1
• φ0τ = φ0 (Whτ −1 + Uxt + b)

17
Gradient explosions (Pascanu et al., 2013)

1
∂L > ∂L > Y ∂hτ ∂L > Y
= = diag(φ0τ )W
∂h1 ∂ht ∂hτ −1 ∂ht 0
τ =t,...,2 τ =t,...,2
1
4 3 2 1 0 1 2 3 4
• Suppose φ(h) = tanh(h) and all our neurons in an RNN are φ(h)
not saturated, which means that 1

|φ0τ | ≥γ
0
• If the spectral radius of W is greater than 1/γ, then the 4 3 2 1 0 1 2 3 4

gradient explodes. φ0 (h)

• The gradient may explode even for a bounded activation function φ!


• To avoid explosions, it is good to keep neurons in the saturated regime where derivatives φ0 are
small.

18
How to cope with gradient explosions?

• Gradient explosions (caused by recurrence) is one problem with training RNNs.

• One workaround: clip the gradient if it is larger than


some pre-defined value:
• can be done element-wise (Mikolov, 2012) or by
clipping the norm (Pascanu et al., 2013):
g
if kgk ≥ ∆, then g ← ∆
kgk
• In PyTorch, clipping of gradients can be done by
re-writing parameter.grad.data after calling
loss.backward().
image from (Pascanu et al., 2013)

19
Vanishing gradients

• Let us look at the gradients again: 1

> > >


∂L ∂L Y ∂hτ ∂L Y 0
= = diag(φ0τ )W
∂h1 ∂ht ∂hτ −1 ∂ht 1
τ =t,...,2 τ =t,...,2
4 3 2 1 0 1 2 3 4
φ(h)
• The absolute values of |φ0τ | are bounded:
1

0< |φ0τ | ≤1

• If the spectral radius of W is smaller than 1, the gradient will 0


4 3 2 1 0 1 2 3 4
vanish (its norm will decay exponentially with increase of t). φ0 (h)

• To avoid vanishing gradients, it is good to keep neurons in the non-saturated regime where
derivatives φ0 are close to 1.

20
Vanishing gradients

• The vanishing gradients problem makes it difficult to learn long-range dependencies in the data:
• In sentiment analysis, it is difficult to capture the effect of the first words in a paragraph on the
predicted class.
• In time-series modeling, it is difficult to capture slowly changing phenomena.
• Vanilla RNNs ht = φ(Wht−1 + Ux + b) are rarely used in practice.
• Recurrent units with gating mechanisms work better.
• Gated recurrent unit (GRU) (Cho et al., 2014)
• Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997)

21
Historical note on RNNs

• Recurrent neural networks for sequential data processing were proposed in the 80s
(Rumelhart et al., 1986; Elman, 1990; Werbos, 1988).
• RNNs did not gain much popularity because they were particularly difficult to train with
backpropagation:
• Unstable training because of gradient explosions
• Difficulty to learn long-term dependencies due to vanishing gradients (Bengio et al., 1994)
• The breakthrough came with the invention of Long Short-Term Memory (LSTM) RNN (Hochreiter
and Schmidhuber, 1997) which was designed to solve the gradient explosion/vanishing problem.
• LSTM remained largely unnoticed in the community until the deep learning boom started.

22
Gated recurrent unit (GRU)
(Cho et al., 2014)
Gated recurrent unit (GRU)

• Motivation for gating in GRU:


• Vanilla RNN ht = φ(Wht−1 + Ux + b) re-writes all the elements of state ht−1 with new values ht .
• How can we keep old values for some elements of ht−1 ?
• GRU uses an update gate ut ∈ (0, 1) that controls which states should be updated:

ht = (1 − ut ) ht−1 + ut ht
e
ut = σ(Wu ht−1 + Uu xt + bu )

where σ(x ) = 1/(1 + e −x ) is the sigmoid function and e


ht are the new state candidates.
• The new state candidates are computed using only the states selected by the reset gate rt :

ht = φ(W(rt
e ht−1 ) + Uxt + bh )
rt = σ(Wr ht−1 + Ur xt + br )

24
Gated recurrent unit (GRU)

• State update:

ht = (1 − u) ht−1 + u ht
e

• Update gate: u = σ(Wu ht−1 + Uu xt + bu )


• New candidate state:

ht = φ(W(r
e ht−1 ) + Uxt + bh )

• Reset gate: r = σ(Wr ht−1 + Ur xt + br )

25
Does GRU help with the vanishing gradient problem?

• GRU update rule for the state:


ht = (1 − ut ) ht−1 + ut φ(W(rt ht−1 ) + Uxt )
• Let us look at the gradient (back)propagation assuming that ut and rt are fixed:
∂hτ
= diag(1 − uτ ) + diag(uτ ) diag(φ0τ )W diag(rτ )
∂hτ −1
where φ0τ = φ0 (W(rτ hτ −1 ) + Uxτ )
> >
∂L ∂L Y ∂hτ ∂L > Y
diag(1 − uτ ) + diag(uτ ) diag(φ0τ )W diag(rτ )

= =
∂h1 ∂ht ∂hτ −1 ∂ht
τ =t,...,2 τ =t,...,2

• For simplicity, let us assume that the state of an RNN is one-dimensional and all intermediate
signals do not depend on time step τ :
 t−1
∂L Y ∂L ∂L 1 + γ/2
uτ φ0τ wrτ ((1 − u) + uγr )t−1 =

(1 − uτ ) + =
∂ht ∂ht ∂ht 2
τ =t,...,2

where γ = φ0τ w and we also assumed that gates u, r are half-closed u = r = 21 .

26
Does GRU help with the vanishing gradient problem?

 t−1
∂L 1 + γ/2
• Gradient propagation in GRU (simplified):
∂ht 2
• Let us do the same simplified analysis for vanilla RNN:

∂L > Y ∂L Y ∂L t−1
diag(φ0τ )W = φ0τ w = γ where γ = φ0τ w
∂ht ∂ht ∂ht
τ =t,...,2 τ =t,...,2

1
• If γ is small, the gradients in GRU decay with rate 2
which is much better than the rate of γ in
the vanilla RNN.  
γt
• If γ is large, the magnitudes of the gradients grow exponentially as O which is better than
4t
t

O γ in the vanilla RNN.
• Thus, the gating mechanism mitigates the problem of vanishing/exploding gradients. Gradients
may explode or vanish in GRU but such problems occur more rarely compared to the vanilla RNN.

27
Connection to probabilistic graphical models
for sequential data
Linear dynamical systems

• Consider a linear Gaussian model with temporal structure (time series):

h1 h2 h3 h4
p(h1 ) = N (h1 | µ1 , R1 )
p(ht | ht−1 ) = N (ht | Bht−1 , R)
x1 x2 x3 x4
p(xt | ht ) = N (xt | Aht , V)

• Inference in linear dynamical systems: Find the conditional distribution p(ht | x1 , . . . , xt ) of latent
variables h1 , h2 , . . . , ht given the observation sequence x1 , x2 , . . . , xt .
• Since it is a linear Gaussian probabilistic model, the inference can be done using the
message-passing algorithm (see, e.g., Chapter 13 of Bishop, 2006) which yields the Kalman filter.

29
Kalman filter: Message-passing in linear dynamical systems




→ h t , Pt
1. Prediction p(ht | x1 , ..., xt−1 ) = N (ht | h t , Pt ) ... ht−2 ht−1 ht


h t = Bh̄t−1
Pt = BΣt−1 B> + R
xt−2 xt−1 xt

2. Correction p(ht | x1 , ..., xt ) = N (ht | h̄t , Σt )


... ht−2 ht−1 ht

→ −

h̄t = h t + Kt (xt − A h t ) A† xt
Σt = (I − Kt A)Pt xt−2 xt−1 xt
Kt = Pt A> (APt A> + V)−1
The message from xt to ht is usually not explicitly
expressed in the derivations of the Kalman filter.

30
Kalman filter in one-dimensional case

• Let us look closer at the correction equation for the mean values of the hidden states

→ −

h̄t = h t + Kt (xt − A h t )
Kt = Pt A> (APt A> + V)−1

in the one-dimensional case:



→ −
→ −
→ pt a −

h̄t = h t + kt (xt − a h t ) = h t + (xt − a h t )
a 2 pt + v

→ pt a 2 −
→ pt a v −
→ a2 pt xt
= ht− 2 ht+ 2 xt = 2 ht+ 2
a pt + v a pt + v a pt + v a pt + v a

→ xt
= (1 − ut ) h t + ut
a
 
a2 pt
where ut = σ log a2 pt +v


• The updated value of the state is a trade-off between the estimate h t computed before observing
xt (prior) and the value xat justified by observation xat (likelihood).

31
Motivation of gatings in recurrent units

• Kalman filter update in the one-dimensional case:



→ xt
h̄t = (1 − ut ) h t + ut
 a
a 2 pt
ut = σ log 2
a pt + v

• Compare this with the GRU update rule:

ht = (1 − ut ) ht−1 + ut h̃t
u = σ(Wu ht−1 + Uu xt + bu )

• This example justifies the use of gatings in the recurrent units: gatings allow combination of
information gained from the previous observations and the current observation.
• The same intuitions hold for nonlinear dynamic systems (extended Kalman filter) which can be
learned by RNNs.

32
Computational graph of RNN as implementation of message passing



• Message passing in linear dynamical systems: h t , Pt

→ xt ... ht−2 ht−1 ht
h̄t = (1 − ut ) h t + ut
a
 A† xt
a2 pt−1
ut = σ log 2
a pt−1 + v xt−2 xt−1 xt

• Computational graph of an RNN with gatings: ... ht−2 ht−1 ht

ht = (1 − ut ) ht−1 + ut h̃t
u = σ(Wu ht−1 + Uu xt + bu ) xt−2 xt−1 xt

• The computational graph of an RNN with gatings can be seen as implementation of an inference
procedure for a probabistic graphical model with sequential data.

33
Long short-term memory (LSTM)
(Hochreiter and Schmidhuber, 1997)
Long short-term memory (LSTM) unit

• LSTM was designed to prevent vanishing and exploding gradients.


• The unit has two states: hidden state ht and cell state ct .
• The new cell state is a (gated) sum of the old state and an update:

ct = ft ct−1 + it φc (Wc ht−1 + Uc xt + bc )

where forget gate ft ∈ (0, 1) and input gate it ∈ (0, 1).


• The gradient propagation for cell state c:
∂ct
= diag(ft )
∂ct−1
and if we set ft to 1, the gradient neither grows nor decreases.

35
Long short-term memory (LSTM) unit

• Update of the cell state:

ct = ft ct−1 + it φc (Wc ht−1 + Uc xt + bc )


forget gate ft = σ(Wf ht−1 + Uf xt + bf )
input gate it = σ(Wi ht−1 + Ui xt + bi )

• The hidden state (the output vector of the


LSTM unit):

ht = ot φh (ct )
output gate ot = σ(Wo ht−1 + Uo xt + bo )

36
Initialization of the forget gates (Jozefowicz et al., 2015)

• Update gate:

ft = σ(Wf ht−1 + Uf xt + bf )
ct = ft ct−1 + it φc (Wc ht−1 + Uc xt + bc )

• Common intialization of the forget gate: small random weights for bf . This initialization
effectively sets the forget gate to 21 and therefore the gradient vanishes with a factor of 12 per
timestep. It works well in many problems.
• However, sometimes an RNN can fail to learn long-term dependencies. This problem can be
addressed by initializing the forget gates bf to large values such as 1 or 2.

37
Architecture search for recurrent units

• LSTM and GRU have somewhat similar but different architectures. Can there be even better
architectures of the recurrent unit?
• Jozefowicz et al. (2015) performed random search of the architecture by constructing the
recurrent unit from a selected set of operations. The performance was tested on a set of standard
benchmarks.
• The best architectures found in that procedure were very similar to GRU!

z = σ(Wxz xt + bz )
r = σ(Wxr xt + Whr ht + br )
ht+1 = tanh(Whh (r ht ) + tanh(xt ) + bh ) z + ht (1 − z)

38
Layer normalization (Ba et al., 2016)

• Batch normalization significantly reduces the training time in feed-forward neural networks. Can
we apply the same idea to recurrent networks?
• If we apply BN to an RNN, we need to compute and store separate statistics for each time step in
a sequence. This is problematic if a test sequence is longer than any of the training sequences.
• Layer normalization (LN) is a modification of BN in which statistics are computed over the hidden
units of a single time step:
H H
1 X 1 X
µ= xi σ2 = (xi − µ)2
H H
i=1 i=1

where H is the number of hidden units in a layer. LN also has bias and gain parameters
x−µ
x̃ = γ √ +β
σ2 + 
• In GRU or LSTM, LN is usually applied before the non-linearity.
• It is observed layer-normalized RNNs train faster.

39
Sequence-to-sequence models
for neural machine translation
Neural machine translation

• The task is to translate a sentence from a source language to a target language.


• Inputs and outputs are sequences of words. We need a model that transforms input sequences
into output sequences (a sequence-to-sequence model).
• Input and output sequences may be of different lengths.

41
Simple sequence-to-sequence model

• The simplest sequence-to-sequence model uses two RNNs: encoder and decoder.

42
Simple sequence-to-sequence model

• The encoder is an RNN that encodes the input sentence into a vector c = h5 .
• The whole sentence is represented as a vector (a vector of thought).

43
Simple sequence-to-sequence model

• The decoder is an RNN that converts the developed representation c into the output sentence:

• Each neuron also receives the previous word and the input-sequence representation c as inputs.

44
Simple sequence-to-sequence model: Training

• The minimized cost is the negative log-likelihood of the output sequence:


Tn
1 XX
 
(n) (n)
L=− log p yt y<t , X(n)
N
n t=1

• To produce categorical distribution over words, we process the hidden states zt of the decoder
RNN with a linear layer W and apply softmax: p(yt = i y<t , X) ∝ exp(wi> zt ).

45
Simple sequence-to-sequence model: Test time

• How to generate the output sequence for a given input sequence?


• We can sample a sequence of words using the predicted categorical distribution:

(n) (n)

log p yt y<t , X(n)

• This is suboptimal: we are interested in the whole sequence that has the highest probability,
sampling from the output distribution is greedy search.
• The most likely sequence is usually found with beamsearch (see, e.g., Cho, 2015).

46
Teacher forcing

• Training time: Feed correct words as inputs of the decoder (this is called teacher forcing).
• Test time: Feed the decoder’s own predictions as inputs (generation mode).

• The decoder needs to learn to work in the generation mode (without teacher forcing).
• To enable this, we can toggle teacher forcing on and off during training.

47
Home assignment
Assignment 04 rnn

• In the home assignment, you need to implement a sequence-to-sequence model for statistical
machine translation:

49
Building a computational graph with an RNN in PyTorch

• There are two ways to build a computational graph with RNNs in PyTorch.
• In simple cases, the whole sequence can be processed with one call:
h = torch.zeros(...)
h = rnn.forward(x, h)
• In more difficult cases, you need to build a graph with a for-loop:
h = torch.zeros(...)
for x_t in x:
h = rnn.forward(x_t, h)
• The initial states of RNNs are often initialized with zeros.

50
How to represent words

• A simple word representation is one-hot vector. Word i is represented with vector z such that
zi = 1, zj6=i = 0.
• Better representaion:
• represent each word i as a vector wi
• treat all vectors wi as model parameters and tune them in the training procedure
• this is equivalent to Wz where W is a matrix of word embeddings (word vectors wi in its columns).
• This is implemented in torch.nn.Embedding(num embeddings, embedding dim)
• num embeddings is the size of the dictionary
• embedding dim is the size of each embedding vector wi

51
Recommended reading

• Chapter 10 of the Deep Learning book.


• C. Olah. Understanding LSTM Networks.
• K. Cho. Natural Language Understanding with Distributed Representation.

52

You might also like