0% found this document useful (0 votes)
7 views10 pages

Chapter Recurrent Neural Networks

This chapter introduces recurrent neural networks (RNNs) and state machines, emphasizing their ability to model sequential temporal behavior. RNNs are defined as state machines where the functions governing state transitions and outputs are represented by neural networks, enabling the modeling of complex sequences. The chapter also discusses training RNNs for tasks like sequence-to-sequence mapping and language modeling, utilizing techniques such as back-propagation through time.

Uploaded by

selamwork17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Chapter Recurrent Neural Networks

This chapter introduces recurrent neural networks (RNNs) and state machines, emphasizing their ability to model sequential temporal behavior. RNNs are defined as state machines where the functions governing state transitions and outputs are represented by neural networks, enabling the modeling of complex sequences. The chapter also discusses training RNNs for tasks like sequence-to-sequence mapping and language modeling, utilizing techniques such as back-propagation through time.

Uploaded by

selamwork17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CHAPTER 9

Recurrent Neural Networks

So far, we have limited our attention to domains in which each output y is assumed to
have been generated as a function of an associated input x, and our hypotheses have been
“pure” functions, in which the output depends only on the input (and the parameters we
have learned that govern the function’s behavior). In the next few chapters, we are going
to consider cases in which our models need to go beyond functions. In particular, behavior
as a function of time will be an important concept:

• In recurrent neural networks, the hypothesis that we learn is not a function of a single
input, but of the whole sequence of inputs that the predictor has received.

• In reinforcement learning, the hypothesis is either a model of a domain (such as a game)


as a recurrent system or a policy which is a pure function, but whose loss is deter-
mined by the ways in which the policy interacts with the domain over time.

In this chapter, we introduce state machines. We start with deterministic state machines,
and then consider recurrent neural network (RNN) architectures to model their behavior.
Later, in Chapter 10, we will study Markov decision processes (MDPs) that extend to consider
probabilistic (rather than deterministic) transitions in our state machines. RNNs and MDPs
will enable description and modeling of temporally sequential patterns of behavior that are
important in many domains.

9.1 State machines


A state machine is a description of a process (computational, physical, economic) in terms This is such a pervasive
of its potential sequences of states. idea that it has been
given many names in
The state of a system is defined to be all you would need to know about the system to
many subareas of com-
predict its future trajectories as well as possible. It could be the position and velocity of an puter science, control
object or the locations of your pieces on a game board, or the current traffic densities on a theory, physics, etc.,
highway network. including: automaton,
transducer, dynamical sys-
Formally, we define a state machine as (S, X, Y, s0 , fs , fo ) where tem, system, etc.
• S is a finite or infinite set of possible states;

• X is a finite or infinite set of possible inputs;

80
MIT 6.390 Fall 2022 81

• Y is a finite or infinite set of possible outputs;

• s0 ∈ S is the initial state of the machine;

• fs : S × X → S is a transition function, which takes an input and a previous state and


produces a next state;

• fo : S → Y is an output function, which takes a state and produces an output.

The basic operation of the state machine is to start with state s0 , then iteratively compute In some cases, we will
for t > 1: pick a starting state
from a set or distribu-
tion.
st = fs (st−1 , xt ) (9.1)
yt = fo (st ) (9.2)

The diagram below illustrates this process. Note that the “feedback” connection of
st back into fs has to be buffered or delayed by one time step—-otherwise what it
is computing would not generally be well defined.

xt st yt
fs fo

st−1

So, given a sequence of inputs x1 , x2 , . . . the machine generates a sequence of outputs

fo (fs (s0 , x1 )), fo (fs (fs (s0 , x1 ), x2 )), . . . .


| {z } | {z }
y1 y2

We sometimes say that the machine transduces sequence x into sequence y. The output at
time t can have dependence on inputs from steps 1 to t. There are a huge num-
One common form is finite state machines, in which S, X, and Y are all finite sets. They are ber of major and minor
variations on the idea of
often described using state transition diagrams such as the one below, in which nodes stand
a state machine. We’ll
for states and arcs indicate transitions. Nodes are labeled by which output they generate just work with one spe-
and arcs are labeled by which input causes the transition. cific one in this section
and another one in the
One can verify that the state machine below reads binary strings and determines the next, but don’t worry if
you see other variations
parity of the number of zeros in the given string. Check for yourself that all input out in the world!
binary strings end in state S1 if and only if they contain an even number of zeros.
All computers can be
described, at the digital
level, as finite state ma-
chines. Big, but finite!

Last Updated: 12/06/22 14:56:16


MIT 6.390 Fall 2022 82

Another common structure that is simple but powerful and used in signal processing
and control is linear time-invariant (LTI) systems. In this case, all the quantities are real-
valued vectors: S = Rm , X = Rl and Y = Rn . The functions fs and fo are linear functions of
their inputs. The transition function is described by the state matrix A and the input matrix
B; the output function is defined by the output matrix C, each with compatible dimensions.
In discrete time, they can be defined by a linear difference equation, like

st = fs (st−1 , xt ) = Ast−1 + Bxt , (9.3)


yt = fo (st ) = Cst , (9.4)

and can be implemented using state to store relevant previous input and output informa-
tion. We will study recurrent neural networks which are a lot like a non-linear version of an
LTI system.

9.2 Recurrent neural networks


In Chapter 7, we studied neural networks and how the weights of a network can be ob-
tained by training on data, so that the neural network will model a function that approx-
imates the relationship between the (x, y) pairs in a supervised-learning training set. In
Section 9.1 above, we introduced state machines to describe sequential temporal behavior.
Here in Section 9.2, we explore recurrent neural networks by defining the architecture and
weight matrices in a neural network to enable modeling of such state machines. Then, in
Section 9.3, we present a loss function that may be employed for training sequence to se-
quence RNNs, and then consider application to language translation and recognition in Sec-
tion 9.4. In Section 9.5, we’ll see how to use gradient-descent methods to train the weights
of an RNN so that it performs a transduction that matches as closely as possible a training
set of input-output sequences.

A recurrent neural network is a state machine with neural networks constituting functions
fs and fo :

st = fs (W sx xt + W ss st−1 + W0ss ) (9.5)


o
yt = fo (W st + W0o ) . (9.6)

The inputs, states, and outputs are all vector-valued: We are very sorry! This
course material has
xt : ` × 1 (9.7) evolved from different
sources, which used
st : m × 1 (9.8) W T x in the forward
yt : v × 1 . (9.9) pass for regular feed-
forward NNs and Wx
The weights in the network, then, are for the forward pass in
RNN s. This inconsis-
tency doesn’t make any
W sx : m × ` (9.10) technical difference, but
W ss
:m×m (9.11) is a potential source of
confusion.
W0ss :m×1 (9.12)
o
W :v×m (9.13)
W0o : v × 1 (9.14)

with activation functions fs and fo .


Study Question: Check dimensions here to be sure it all works out. Remember that
we apply fs and fo elementwise, unless fo is a softmax activation.

Last Updated: 12/06/22 14:56:16


MIT 6.390 Fall 2022 83

9.3 Sequence-to-sequence RNN


Now, how can we set up an RNN to model and be trained to produce a transduction of one
sequence to another? This problem is sometimes called sequence-to-sequence mapping. You
can think of it as a kind of regression problem: given an input sequence, learn to generate
the corresponding output sequence.
   One way to think of
A training set has the form x(1) , y(1) , . . . , x(q) , y(q) , where training a sequence
classifier is to reduce it
• x(i) and y(i) are length n(i) sequences; to a transduction prob-
lem, where yt = 1 if the
• sequences in the same pair are the same length; and sequences in different pairs may sequence x1 , . . . , xt is a
positive example of the
have different lengths. class of sequences and
−1 otherwise.
Next, we need a loss function. We start by defining a loss function on sequences. There
are many possible choices, but usually it makes sense just to sum up a per-element loss
function on each of the output values, where g is the predicted sequence and y is the actual
one:
  nX
(i)
 
(i) (i)
Lseq g(i) , y(i) = Lelt gt , yt . (9.15)
t=1

The per-element loss function Lelt will depend on the type of yt and what information it is So it could be NLL,
encoding, in the same way as for a supervised network. squared loss, etc.
Then, letting W = (W sx , W ss , W o , W0ss , W0o ), our overall goal is to minimize the objec-
tive
1X
q  
J(W) = Lseq RNN(x(i) ; W), y(i) , (9.16)
q
i=1

where RNN(x; W) is the output sequence generated, given input sequence x.


It is typical to choose fs to be tanh but any non-linear activation function is usable. We Remember that it looks
choose fo to align with the types of our outputs and the loss function, just as we would do like a sigmoid but
ranges from -1 to +1.
in regular supervised learning.

9.4 RNN as a language model


A language model is a sequence to sequence RNN which is trained on a character sequence
of the form, c = (c1 , c2 , . . . , ck ), and is used to predict the next character ct , t 6 k, given a
sequence of the previous (t − 1) tokens: A “token” is generally a
character or a word.
ct = RNN ((c1 , c2 , . . . , ct−1 ); W) (9.17)

We can convert this to a sequence-to-sequence training problem by constructing a data


set of q different (x, y) sequence pairs, where we make up new special tokens, start and
end, to signal the beginning and end of the sequence:

x = (hstarti, c1 , c2 , . . . , ck ) (9.18)
y = (c1 , c2 , . . . , hendi) (9.19)

9.5 Back-propagation through time


Now the fun begins! We can now try to find a W to minimize J using gradient descent. We
will work through the simplest method, back-propagation through time (BPTT), in detail. This
is generally not the best method to use, but it’s relatively easy to understand. In Section 9.6
we will sketch alternative methods that are in much more common use.

Last Updated: 12/06/22 14:56:16


MIT 6.390 Fall 2022 84

What we want you to take away from this section is that, by “unrolling” a recurrent
network out to model a particular sequence, we can treat the whole thing as a feed-
forward network with a lot of parameter sharing. Thus, we can tune the parameters
using stochastic gradient descent, and learn to model sequential mappings. The
concepts here are very important. While the details are important to get right if you
need to implement something, we present the mathematical details below primarily
to convey or explain the larger concepts.

Calculus reminder: total derivative Most of us are not very careful about the differ-
ence between the partial derivative and the total derivative. We are going to use a nice
example from the Wikipedia article on partial derivatives to illustrate the difference.
The volume of a circular cone depends on its height and radius:

πr2 h
V(r, h) = . (9.20)
3
The partial derivatives of volume with respect to height and radius are

∂V 2πrh ∂V πr2
= and = . (9.21)
∂r 3 ∂h 3
They measure the change in V assuming everything is held constant except the
single variable we are changing. Now assume that we want to preserve the cone’s
proportions in the sense that the ratio of radius to height stays constant. Then we
can’t really change one without changing the other. In this case, we really have to
think about the total derivative. If we’re interested in the total derivative with respect
to r, we sum the “paths” along which r might influence V:

dV ∂V ∂V dh
= + (9.22)
dr ∂r ∂h dr
2πrh πr2 dh
= + (9.23)
3 3 dr
Or if we’re interested in the total derivative with respect to h, we consider how h
might influence V, either directly or via r:

dV ∂V ∂V dr
= + (9.24)
dh ∂h ∂r dh
πr2 2πrh dr
= + (9.25)
3 3 dh
Just to be completely concrete, let’s think of a right circular cone with a fixed angle
α = tan r/h, so that if we change r or h then α remains constant. So we have
r = h tan−1 α; let constant c = tan−1 α, so now r = ch. Thus, we finally have

dV 2πrh πr2 1
= + (9.26)
dr 3 3 c
dV πr2 2πrh
= + c. (9.27)
dh 3 3

The BPTT process goes like this:

(1) Sample a training pair of sequences (x, y); let their length be n.

Last Updated: 12/06/22 14:56:16


MIT 6.390 Fall 2022 85

(2) “Unroll" the RNN to be length n (picture for n = 3 below), and initialize s0 :

Now, we can see our problem as one of performing what is almost an ordinary back-
propagation training procedure in a feed-forward neural network, but with the dif-
ference that the weight matrices are shared among the layers. In many ways, this is
similar to what ends up happening in a convolutional network, except in the conv-
net, the weights are re-used spatially, and here, they are re-used temporally.

(3) Do the forward pass, to compute the predicted output sequence g:

z1t = W sx xt + W ss st−1 + W0ss (9.28)


st = fs (z1t ) (9.29)
z2t = W o st + W0o (9.30)
gt = fo (z2t ) (9.31)

(4) Do backward pass to compute the gradients. For both W ss and W sx we need to find
n
dLseq (g, y) X dLelt (gu , yu )
=
dW dW
u=1

Letting Lu = Lelt (gu , yu ) and using the total derivative, which is a sum over all the
ways in which W affects Lu , we have
n X
X n
∂st ∂Lu
=
∂W ∂st
u=1 t=1

Re-organizing, we have
n
X n
∂st X ∂Lu
=
∂W ∂st
t=1 u=1

Last Updated: 12/06/22 14:56:16


MIT 6.390 Fall 2022 86

Because st only affects Lt , Lt+1 , . . . , Ln ,

Xn n
∂st X ∂Lu
=
∂W u=t ∂st
t=1
 
n  n 
X ∂st  ∂Lt X ∂Lu 
=  +  . (9.32)
∂W  ∂st ∂st 

t=1 u=t+1
| {z }
δst

where δst is the dependence of the future loss (incurred after step t) on the state St . That is, δst is how
much we can blame
We can compute this backwards, with t going from n down to 1. The trickiest part is state st for all the future
figuring out how early states contribute to later losses. We define the future loss after element losses.
step t to be
n
X
Ft = Lelt (gu , yu ) , (9.33)
u=t+1
so
∂Ft
δst = . (9.34)
∂st
At the last stage, Fn = 0 so δsn = 0.
Now, working backwards,
n
∂ X
δst−1 = Lelt (gu , yu ) (9.35)
∂st−1 u=t
n
∂st ∂ X
= Lelt (gu , yu ) (9.36)
∂st−1 ∂st u=t
" n
#
∂st ∂ X
= Lelt (gt , yt ) + Lelt (gu , yu ) (9.37)
∂st−1 ∂st
u=t+1
 
∂st ∂Lelt (gt , yt )
= +δ st
(9.38)
∂st−1 ∂st

Now, we can use the chain rule again to find the dependence of the element loss at
time t on the state at that same time,

∂Lelt (gt , yt ) ∂z2t ∂Lelt (gt , yt )


= , (9.39)
∂st ∂st ∂z2t
| {z } |{z} | {z }
(m×1) (m×v) (v×1)

and the dependence of the state at time t on the state at the previous time,

∂st ∂z1t ∂st ∂st


= 1
= W ss T 1 (9.40)
∂st−1 ∂st−1 ∂zt ∂zt
| {z } | {z } |{z}
(m×m) (m×m) (m×m)

Last Updated: 12/06/22 14:56:16


MIT 6.390 Fall 2022 87

Note that ∂st /∂z1t is formally an m × m diagonal matrix, with the values along
the diagonal being fs0 (z1t,i ), 1 6 i 6 m. But since this is a diagonal matrix,
one could represent it as an m × 1 vector fs0 (z1t ). In that case the product
of the matrix W ss T by the vector fs0 (z1t ), denoted W ss T ∗ fs0 (z1t ), should be
interpreted as follows: take the first column of the matrix W ss T and multiply
each of its elements by the first element of the vector ∂st /∂z1t , then take the
second column of the matrix W ss T and multiply each of its elements by the
second element of the vector ∂st /∂z1t , and so on and so forth ...

Putting this all together, we end up with


 
∂st ∂Lt
δst−1 = W ss T 1 W o T 2 + δst (9.41)
∂z ∂zt
| {z t} | {z }
∂st ∂Ft−1
∂st−1 ∂st

We’re almost there! Now, we can describe the actual weight updates. Using Eq. 9.32
and recalling the definition of δst = ∂Ft /∂st , as we iterate backwards, we can accu-
mulate the terms in Eq. 9.32 to get the gradient for the whole loss.

n n
X dLelt (gt , yt ) X ∂z1 ∂st ∂Ft−1
dLseq
= = t
(9.42)
dW ss dW ss ∂W ss ∂z1t ∂st
t=1 t=1
Xn Xn
dLseq dLelt (gt , yt ) ∂z1t ∂st ∂Ft−1
= = (9.43)
dW sx dW sx ∂W sx ∂z1t ∂st
t=1 t=1

We can handle W o separately; it’s easier because it does not affect future losses in the
way that the other weight matrices do:
n n
dLseq X dLt X ∂Lt ∂z2
= = t
(9.44)
dW o dW o
t=1
∂z2t ∂W o t=1

Assuming we have ∂L t
∂z2t
= (gt − yt ), (which ends up being true for squared loss,
softmax-NLL, etc.), then
n
dLseq X
= (gt − yt ) sTt . (9.45)
dW o
| {z } | {z } |{z}
t=1 v×1 1×m
v×m

Whew!

Study Question: Derive the updates for the offsets W0ss and W0o .

9.6 Vanishing gradients and gating mechanisms


Let’s take a careful look at the backward propagation of the gradient along the sequence:
 
∂st ∂Lelt (gt , yt )
δst−1 = + δst . (9.46)
∂st−1 ∂st

Last Updated: 12/06/22 14:56:16


MIT 6.390 Fall 2022 88

Consider a case where only the output at the end of the sequence is incorrect, but it depends
critically, via the weights, on the input at time 1. In this case, we will multiply the loss at
step n by
∂s2 ∂s3 ∂sn
··· . (9.47)
∂s1 ∂s2 ∂sn−1
In general, this quantity will either grow or shrink exponentially with the length of the
sequence, and make it very difficult to train.
Study Question: The last time we talked about exploding and vanishing gradients, it
was to justify per-weight adaptive step sizes. Why is that not a solution to the prob-
lem this time?
An important insight that really made recurrent networks work well on long sequences
is the idea of gating.

9.6.1 Simple gated recurrent networks


A computer only ever updates some parts of its memory on each computation cycle. We
can take this idea and use it to make our networks more able to retain state values over time
and to make the gradients better-behaved. We will add a new component to our network,
called a gating network. Let gt be a m × 1 vector of values and let W gx and W gs be m × l
and m × m weight matrices, respectively. We will compute gt as It can have an offset,
too, but we are omitting
gt = sigmoid(W gx xt + W gs st−1 ) (9.48) it for simplicity.

and then change the computation of st to be

st = (1 − gt ) ∗ st−1 + gt ∗ s(W sx xt + W ss st−1 + W0ss ) , (9.49)

where ∗ is component-wise multiplication. We can see, here, that the output of the gating
network is deciding, for each dimension of the state, how much it should be updated now.
This mechanism makes it much easier for the network to learn to, for example, “store”
some information in some dimension of the state, and then not change it during future
state updates, or change it only under certain conditions on the input or other aspects of
the state.
Study Question: Why is it important that the activation function for g be a sigmoid?

9.6.2 Long short-term memory


The idea of gating networks can be applied to make a state machine that is even more like
a computer memory, resulting in a type of network called an LSTM for “long short-term
memory.” We won’t go into the details here, but the basic idea is that there is a memory Yet another awesome
cell (really, our state vector) and three (!) gating networks. The input gate selects (using name for a neural net-
work!
a “soft” selection as in the gated network above) which dimensions of the state will be
updated with new values; the forget gate decides which dimensions of the state will have
its old values moved toward 0, and the output gate decides which dimensions of the state
will be used to compute the output value. These networks have been used in applications
like language translation with really amazing results. A diagram of the architecture is
shown below:

Last Updated: 12/06/22 14:56:16


MIT 6.390 Fall 2022 89

Last Updated: 12/06/22 14:56:16

You might also like