0% found this document useful (0 votes)
12 views34 pages

Module 4 Part 1

Module 4 of the Deep Learning course focuses on Recurrent Neural Networks (RNNs), covering their design, computational graphs, and applications in sequence-to-sequence tasks. It explains the architecture of RNNs, including encoder-decoder models, and discusses concepts such as backpropagation through time and teacher forcing. The module highlights the importance of RNNs in processing sequential data like text and time-series, and introduces advanced techniques like attention mechanisms for improved performance in tasks such as machine translation.

Uploaded by

uos4367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views34 pages

Module 4 Part 1

Module 4 of the Deep Learning course focuses on Recurrent Neural Networks (RNNs), covering their design, computational graphs, and applications in sequence-to-sequence tasks. It explains the architecture of RNNs, including encoder-decoder models, and discusses concepts such as backpropagation through time and teacher forcing. The module highlights the importance of RNNs in processing sequential data like text and time-series, and introduces advanced techniques like attention mechanisms for improved performance in tasks such as machine translation.

Uploaded by

uos4367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

CST414

DEEP LEARNING
Module-4 PART -I

1
SYLLABUS
2

Module- 4 (Recurrent Neural Network)


 Recurrent neural networks – Computational graphs, RNN design,
encoder – decoder sequence to sequence architectures, deep
recurrent networks, recursive neural networks, modern RNNs LSTM
and GRU.
TRACE KTU
Recurrent neural networks
3

 Recurrent neural networks are designed for sequential data like


text sentences, time-series,and other discrete sequences like
biological sequences
 The input is of the form x1 . . . xn, where xt is a d-dimensional
point received at the time-stamp t.

TRACE KTU
 In a text-setting, the vector xt will contain the one-hot encoded
word at the t th time-stamp.
 In one-hot encoding, we have a vector of length equal to the
lexicon size, and the component for the relevant word has a
value of 1. All other components are 0
 successive words are dependent on one another.
4

TRACE KTU
 key point is that there is an input xt at each time-stamp, and a
hidden state ht that changes at each time stamp as new data
points arrive. Each time-stamp also has an output value yt.
 When used in the text-setting of predicting the next word, this
approach is referred to as language modeling.
 The hidden state at time t is given by a function of the input
5 vector at time t and the hidden vector at time (t − 1):

 A separate function yt = g(ht) is used to learn the output


probabilities from the hidden States

TRACE KTU
 Note that the functions f(·) and g(·) are the same at each time
stamp.
 A key point here is the presence of the self-loop in Figure 1.17(a),
which will cause the hidden state of the neural network to
change after the input of each xt.
 works with sequences of finite length, and it makes sense to
unfurl the loop into a “time-layered” network that looks more like
a feed-forward network. This network is shown in Figure 1.17(b).
6
 weight matrices of the connections are shared by multiple
connections in the time layered network to ensure that the
same function is used at each time stamp. This sharing is the key
to the domain-specific insights that are learned by the network.
 The backpropagation algorithm takes the sharing and temporal
length into account when updating the weights during the
learning process. This special type of backpropagation algorithm

TRACE KTU
is referred to as backpropagation through time (BPTT).
 Because of the recursive nature ,the recurrent network has the
ability to compute a function of variable-length inputs.
 For example, starting at h0, which is typically fixed to some constant
 Note that the function Ft(·) varies with the value of t. Such an
7 approach is particularly useful for variable-length inputs like text
sentences.
 An interesting theoretical property of recurrent neural networks
is that they are Turing complete . What this means is that given
enough data and computational resources, a recurrent neural
network can simulate any algorithm.

TRACE KTU
COMPUTATIONAL GRAPHS
8
 a recurrent neural network is a neural network that is specialized
for processing a sequence of values x(1), . . . , x(τ)
 A computational graph is a way to formalize the structure of a set of
computations,
 such as those involved in mapping inputs and parameters to

TRACE KTU
outputs and loss.
 unfolding a recursive
computational graph
or recurrent computation into a

 Unfolding this graph results in the sharing of parameters across a


deep networkStructure
 For example, consider the classical form of a dynamical
9 system:

TRACE KTU
• is recurrent because the definition of s at time t refers back to the
same definition at time t − 1.
• For a finite number of time steps τ , the graph can be unfolded by
applying the definition τ − 1 times. For example, if we unfold
equation 10.1 for τ = 3 time steps, we obtain.
 Such an expression can now be represented by a traditional
10 directed acyclic computational graph

TRACE KTU
Figure 10.1: The classical dynamical system described by equation 10.1,
illustrated as an unfolded computational graph. Each node represents the
state at some time t and the function f maps the state at t to the state at t
+ 1. The same parameters (the same value of θ used to parametrize f) are
used for all time steps.
 As another example, let us consider a dynamical system driven
11 by an external signal x(t),

 any function involving recurrence can be considered a


recurrent neural network.
 To indicate that the state is the hidden units of the network, we

TRACE KTU
now rewrite equation 10.4 using the variable h to represent the
state:
12

TRACE KTU
 One way to draw the RNN is with a diagram containing one node
for every component that might exist in a physical implementation
of the model, such as a biological neural network
 In this view, the network defines a circuit that operates in real
time, with physical parts whose current state can influence their
future state,
 The other way to draw the RNN is as an unfolded computational
graph, in which each component is represented by many different
13
variables, with one variable per time step, representing the state
of the component at that point in time.
 Each variable for each time step is drawn as a separate node of
the computational graph, as in the right of figure 10.2.
 The unfolded graph now has a size that depends on the sequence
length.
TRACE KTU
14
 The function g(t) takes the whole past sequence (x(t), x(t−1),
x(t−2), . . . , x(2), x(1)) as input and produces the current state,
 unfolded recurrent structure allows us to factorize g(t) into
repeated application of a function f.
 The unfolding process thus introduces two major advantages:

TRACE KTU
 1. Regardless of the sequence length, the learned model always has
the same input size, because it is specified in terms of transition
from one state to another state, rather than specified in terms of
a variable-length history of states.
 2. It is possible to use the same transition function f with the
same parameters at every time step.
 The recurrent graph and the unrolled graph have their uses
15
 The recurrent graph is succinct.
 The unfolded graph provides an explicit description of which
computations to perform.
 The unfolded graph also helps to illustrate the idea of
information flow forward in time (computing outputs and
losses) and backward in time (computing gradients) by
TRACE KTU
explicitly showing the path along which this information
flows.
RNN Design Figure 10.3: The computational graph to compute the training
loss of a recurrent network that maps an input sequence of x
16 values to a corresponding sequence of output o values.
A loss L measures how far each o is from the corresponding
training target y . When using softmax outputs, we assume o
is the unnormalized log probabilities. The loss L internally
computes ˆ y = softmax(o) and compares this to the target y.
The RNN has input to hidden connections parametrized by a
weight matrix U, hidden-to-hidden recurrent connections
parametrized by a weight matrix W , and hidden-to-output
connections parametrized by a weight matrix V . Equation

TRACE KTU10.8 defines forward propagation in this model. (Left)The


RNN and its loss drawn with recurrent connections. (Right)The
same seen as an time unfolded computational graph, where
each node is now associated with one particular time
instance.
• Some examples of important design patterns for recurrent neural networks
17 include the following

 • Recurrent networks that produce an output at each time step


and have recurrent connections between hidden units, illustrated
in figure 10.3.
 • Recurrent networks that produce an output at each time step
and have recurrent connections only from the output at one time
TRACE KTU
step to the hidden units at the next time step, illustrated in figure
10.4
 • Recurrent networks with recurrent connections between hidden
units, that read an entire sequence and then produce a single
output, illustrated in figure 10.5.
18

TRACE KTU
smaller set of functions) than those in the family represented by figure 10.3.
 The RNN in this figure is trained to put a specific output value
19 into o , and o is the only information it is allowed to send to
the future.
 There are no direct connections from h going forward. The
previous h is connected to the present only indirectly, via the
predictions it was used to produce.
 Unless o is very high-dimensional and rich, it will usually lack

TRACE KTU
important information from the past.
 This makes the RNN in this figure less powerful, but it may be
easier to train because each time step can be trained in isolation
from the others t = 1 to t = τ , we apply the following update
equations:
 where the parameters are the bias vectors b and c along with
the weight matrices U , V and W , respectively for input-to-
20
hidden, hidden-to-output and hidden-to –hidden
connections
 This is an example of a recurrent network that maps an input
sequence to an output sequence of the same length.
 The total loss for a given sequence of x values paired with a
sequence of y values would then be just
TRACE KTU
 The sum of the losses over all the time steps. For example, if
L(t) is the negative log-likelihood of y (t) given x(1) , . . . , x(t) ,
then
21

moving right to left through the graph. The runtime is O(τ ) and
cannot be reduced by parallelization because the forward propagation
graph is inherently sequential; each time step may only be computed
after the previous one.

TRACE KTU
 States computed in the forward pass must be stored until they are
reused during the backward pass, so the memory cost is also O(τ ). The
back-propagation algorithm applied to the unrolled graph with O( τ)
cost is called back-propagation through time or BPTT
22  Teacher forcing is a procedure that emerges from the
maximum likelihood criterion, in which during training the y(t)
model receives the ground truth output as input at time t + 1.

The conditional maximum likelihood criterion is

TRACE KTU
• we see that at time t = 2, the model is trained to maximize the
y(2) conditional probability of given both the x sequence so far
and the previous y value from the training set.
• Maximum likelihood thus specifies that during training, rather
than feeding the model’s own output back into itself, these
connections should be fed with the target values specifying what
the correct output should be.
 motivated teacher forcing as allowing us to avoid back-
23
propagation through time in models that lack hidden-to-hidden
connections
 Some models may thus be trained with both teacher forcing and
BPTT.
 The disadvantage of strict teacher forcing arises if the network is
going to be later used in an open-loop mode, with the network

TRACE KTU
outputs (or samples from the output distribution) fed back as
input
ENCODER – DECODER SEQUENCE TO SEQUENCE
24 ARCHITECTURES

TRACE KTU
 Here we discuss how an RNN can be trained to map an input
25 sequence to an output sequence which is not necessarily
of the same length.
 This comes up in many applications, such as speech
recognition, machine translation or question answering, where
the input and output sequences in the training set are
generally not of the same length

TRACE KTU
 The input to the RNN the “context.” We want to produce a
representation of this context, C . The context C might be a
vector or sequence of vectors that summarize the input
sequence
26  (1) an encoder or reader or input RNN processes the input
sequence. The encoder emits the context C , usually as a
simple function of its final hidden state.
 (2) a decoder or writer or output RNN is conditioned on that
fixed-length vector to generate the output sequence

TRACE KTU
 In a sequence-to-sequence architecture, the two RNNs are
trained jointly to maximize the average of

over all the pairs of x and y sequences in the training set.


 The last state hnx of the encoder RNN is typically used as
a representation C of the input sequence that is provided
as input to the decoder RNN.
 If the context C is a vector, then the decoder RNN is simply a vector-
27 to sequence RNN
 There is no constraint that the encoder must have the same
size of hidden layer as the decoder
 One clear limitation of this architecture is when the context C
output by the encoder RNN has a dimension that is too small to
properly summarize a long sequence.

TRACE KTU
 Phenomenon was observed by bahdanau et al. (2015) in the context
of machine translation. They proposed to make C a variable-length
sequence rather than a fixed-size vector.
 They introduced an attention mechanism that learns to associate
elements of the sequence c to elements of the output sequence
28

TRACE KTU
 Encoder
29
• A stack of several recurrent units where each accepts a
single element of the input sequence, collects information for
that element and propagates it forward.
• In question-answering problem, the input sequence is a
collection of all words from the question. Each word is
represented as x_i where i is the order of that word.

TRACE KTU
• The hidden states h_i are computed using the formula:
30  Encoder Vector
• This is the final hidden state produced from the encoder part
of the model. It is calculated using the formula above.
• This vector aims to encapsulate the information for all input
elements in order to help the decoder make accurate
predictions.

TRACE KTU
• It acts as the initial hidden state of the decoder part of the
model.
 Decoder
31
• A stack of several recurrent units where each predicts an output
y_t at a time step t.
• Each recurrent unit accepts a hidden state from the previous
unit and produces and output as well as its own hidden state.
• In the question-answering problem, the output sequence is a
collection of all words from the answer. Each word is represented

TRACE KTU
as y_i where i is the order of that word.
• Any hidden state h_i is computed using the formula:
 we are just using the previous hidden state to compute the next
32 one.
• The output y_t at time step t is computed using the formula:

• We calculate the outputs using the hidden state at the current


TRACE KTU
time step together with the respective weight W(S). Softmax is
used to create a probability vector which will help us determine
the final output (e.g. word in the question-answering problem)
Fig; machine translation
33 English to spanish

TRACE KTU
34

Applications
It possesses many applications
TRACE KTU
such as
•Google’s Machine Translation
•Question answering chatbots
•Speech recognition
•Time Series Application etc.,

You might also like