Module 4-1
Module 4-1
• Unlike traditional feedforward networks, RNNs have loops that allow information to
persist, making them well-suited for tasks where the order of data matters.
• Eg. : time series forecasting, speech recognition, and natural language processing (NLP)
• By applying the same update rule to each step in the sequence, ensuring that every
output is generated using the same set of weights.
2
3
(b) Time-Layered Representation (Unrolled RNN)
• The RNN is expanded across time steps to show how information flows through the
sequence.
• Each word in a sentence (e.g., "the cat chased the mouse") is processed in a step-
by-step manner.
• The hidden state (ht) is passed to the next time step, allowing the network to retain
memory of previous words.
• The final output represents the predicted words
4
Real-Life Applications of RNNs
5
Computational graphs
• A computational graph represents the sequence of operations in a neural
network, mapping inputs and parameters to outputs and loss.
• This enables BPTT while maintaining shared parameters across time steps.
6
The basic formula for RNN is :
The theta are the parameters of the function f.
• Unfolding maps the left to the right in the figure below (both are computational
graphs of a RNN without output o)
• The black square indicates that an interaction takes place with a delay of 1 time
step, from the state at time t to the state at time t + 1.
• Unfolding/parameter sharing is better than using different parameters per
position: less parameters to estimate, generalize to various length.
7
RNN design
8
1. Variation 1 of RNN (basic form): hidden2hidden connections,
sequence output
• The computational graph to compute the training loss of a recurrent network that
maps an input sequence of x values to a corresponding sequence of output o
values.
• Loss 𝐿 evaluates the difference between 𝑜 and the target 𝑦.
• With softmax, 𝑜 represents log probabilities, and 𝑦^=softmax(𝑜) is compared with 𝑦
• The RNN is structured with three weight matrices:
• 𝑈: Connects input to hidden state
• 𝑊: Recurrent hidden-to-hidden connections
• 𝑉: Connects hidden state to output
This setup enables learning from sequential data.
10
2. Variation 2 of RNN output2hidden, sequence output
• It produces an output at each time step and have recurrent connections only from the
output at one time step to the hidden units at the next time step
11
• Teacher forcing can be used to train RNN as in Fig 10.4, where only
output2hidden connections exist.
• i.e hidden2hidden connections are absent.
• In teacher forcing, instead of using the model’s predicted output, we provide the
actual correct output from the training data for the previous time step.
• This helps the model learn faster and more accurately because it doesn't have to
rely on its own potentially incorrect predictions.
12
13
Encoder – decoder sequence to
sequence architectures
14
• The Encoder-Decoder sequence-to-sequence architecture is a neural network
framework designed for the tasks where input and output sequences have different
lengths.
• It consists of an encoder and decoder.
An encoder or reader or input RNN processes the input sequence X = ( x(1) , . . . ,
x(nx ) ). The encoder emits the context C , usually as a simple function of its final
hidden state.
A decoder or writer or output RNN is conditioned on that fixed-length vector to
generate the output sequence Y = ( y(1) , . . . , y(ny ) ).
Encoder vector (also called the context vector) is the fixed-length representation
of the input sequence that the decoder uses to generate the output sequence.
• Commonly used in speech recognition, machine translation, and question answering.
15
16
• Training: Two RNNs (Input RNN & Output RNN) are trained jointly to maximize the
average of logP(y(1),…,y(ny) |x(1),…,x(nx)) over all the pairs of x and y sequences in
the training set.
• If the context C is a vector, then the decoder RNN is simply a vector to sequence
RNN
• One clear limitation of this architecture is when the context C output by the
encoder RNN has a dimension that is too small to properly summarize a long
sequence.
• Bahdanau et al. (2015) proposed making C a variable-length sequence instead of a
fixed-size vector.
• They introduced attention mechanism, allowing the decoder to focus on different
parts of the input dynamically.
17
How the Sequence to Sequence Model works?
18
Example
Consider the input sequence “I am a Student” to be encoded. There will be totally 4
timesteps ( 4 tokens) for the Encoder model. At each time step, the hidden state h will
be updated using the previous hidden state and the current input.
20
Deep recurrent networks
• The computation in most RNNs can be decomposed into three blocks of
parameters and associated transformations:
2. from the previous hidden state to the next hidden state, h(t-1) → h(t)
• However, we can use multiple layers for each of the above transformations,
which results in deep recurrent networks.
21
22
• Previous figure shows a significant benefit of decomposing the state of an RNN into
multiple layers.
• The lower layers in the hierarchy depicted in figure, as playing a role in transforming
the raw input into a representation that is more appropriate, at the higher levels of
the hidden state.
• It is easier to optimize shallower architectures, and adding the extra depth of figure
makes the shortest path from a variable in time step t to a variable in time step t + 1
become longer.
23
Recursive neural networks
• It represent yet another generalization of recurrent networks, with a different
kind of computational graph.
• The typical computational graph for a recursive network is illustrated in Fig. 10.14
24
25
• Recursive networks have been successfully applied to processing data
structures as input to neural nets, in natural language process, as well as in
computer vision.
• One clear advantage of recursive network over recurrent nets is that for a
sequence of the same length , the depth can be drastically reduced from ,
which might help deal with long-term dependencies.
• In some application domains, external methods can suggest the appropriate tree
structure.
• For example, when processing natural language sentences, the tree structure for
the recursive network can be fixed to the structure of the parse tree of the
sentence provided by a natural language parser.
26
Modern RNNs
• Modern RNNs, such as LSTMs (Long Short-Term Memory) and GRUs (Gated
Recurrent Units).
• They address the vanishing and exploding gradient problems in traditional RNNs.
• They use gates to control information flow, allowing them to retain long-term
dependencies and adapt their weights at each time step.
• GRUs simplify this with reset and update gates, making them more computationally
efficient.
• These models are widely used in NLP, speech recognition, and time-series
forecasting due to their ability to reduce information loss.
27
LSTM (Long Short-Term Memory)
28
• LSTM is a type of RNN designed to handle sequential data and capture long-term
dependencies.
• It was introduced to solve the problem of vanishing and exploding gradients that
traditional RNNs suffer from when learning long-term dependencies.
• Memory Cells: Unlike regular RNNs, LSTMs have special units called memory cells
that help them remember information over long sequences.
• Gating Mechanism: LSTMs use three gates (input, forget, and output) to control
the flow of information.
29
30
31
32
33
34
35
GRU (Gated Recurrent Unit)
• GRU is a type of RNN architecture designed to solve problems like vanishing
gradients and inefficient long-term dependency learning in traditional RNNs.
• It is similar to LSTM, but is more computationally efficient due to having fewer
parameters.
GRU Architecture
36
37
38
OR
39
40
Difference between RNN and Modern RNN
41
Previous Year Questions
42
• What is a computational graph and how is it used in the context of RNN?
• Compare LSTM and RNN.
• Suppose you were given a task of predicting long term dependencies in data. Which
architecture would you prefer: LSTM or RNN? Justify your answer and explain its
architecture.
• List and explain the applications of deep recurrent neural networks.
• With neat diagram explain GRU architecture.
• Explain the concept of ‘Unrolling through time’ in Recurrent Neural Networks.
• How does a recursive neural network work?
• Draw and explain the architecture of LSTM.
• How does encoder-decoder RNN work?
43
• Draw and explain the architecture of Recurrent Neural Networks.
• Describe how an LSTM takes care of the vanishing gradient problem.
• Sketch diagrams of different Recurrent Neural Network patterns and explain them in
detail.
• Discuss different ways to make a Recurrent Neural Network(RNN) deep RNN with the
help of diagrams.
44