0% found this document useful (0 votes)
13 views39 pages

DL Unit-3 Question Bank

Uploaded by

mralif360
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views39 pages

DL Unit-3 Question Bank

Uploaded by

mralif360
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Deep Learning

Unit – 3
Question Bank
1. Which of the following RNN patterns outputs a single prediction after reading
an entire sequence?
A) Encoder
B) Acceptor
C) Transducer
D) Decoder
Answer: B) Acceptor
2. In a sequence-to-sequence model, what is the main role of the encoder?
A) Generate predictions
B) Classify input tokens
C) Encode input into a context vector
D) Store gradients
Answer: C) Encode input into a context vector
3. The transducer RNN pattern is best suited for:
A) Single output from sequence
B) Labeling sequences
C) Fixed-length inputs
D) Image classification
Answer: B) Labeling sequences
4. What is used to compute gradients in RNNs?
A) Stochastic Gradient Descent
B) Adam Optimizer
C) Backpropagation Through Time (BPTT)
1
D) RMSProp
Answer: C) Backpropagation Through Time (BPTT)
5. Which of the following is a problem when training deep RNNs?
A) Overfitting
B) Gradient Explosion
C) Gradient Vanishing
D) Fast convergence
Answer: C) Gradient Vanishing
6. Bidirectional RNNs are useful because they:
A) Reduce computation time
B) Use GPU acceleration
C) Consider both past and future context
D) Skip hidden states
Answer: C) Consider both past and future context
7. In sequence-to-sequence models, the decoder:
A) Reads the input directly
B) Produces an intermediate representation
C) Outputs a new sequence
D) Stores attention weights
Answer: C) Outputs a new sequence
8. What is a major benefit of Deep Recurrent Networks?
A) Simpler computations
B) Faster training
C) Better hierarchical feature learning
D) No need for gradient updates
Answer: C) Better hierarchical feature learning
9. Recursive Neural Networks are mainly used for:
A) Time series prediction
B) Image classification
C) Hierarchical data structures

2
D) Sequence labeling
Answer: C) Hierarchical data structures
10.What causes the vanishing gradient problem in RNNs?
A) Overfitting
B) Large weights
C) Long-term dependencies
D) Lack of training data
Answer: C) Long-term dependencies
11.What do leaky units in RNNs help with?
A) Increase computation
B) Avoid memory usage
C) Stabilize gradient flow
D) Block input noise
Answer: C) Stabilize gradient flow
12.Skip connections help in RNNs by:
A) Increasing memory usage
B) Speeding up inference
C) Improving gradient flow
D) Reducing dropout
Answer: C) Improving gradient flow
13.Dropout is mainly used in RNNs for:
A) Speeding up training
B) Preventing overfitting
C) Increasing depth
D) Enhancing weights
Answer: B) Preventing overfitting
14.Which of the following is a gated RNN architecture?
A) CNN
B) LSTM
C) MLP

3
D) Boltzmann Machine
Answer: B) LSTM
15.In an LSTM cell, the forget gate is responsible for:
A) Selecting the output
B) Forgetting part of the cell state
C) Adding new information
D) Copying inputs
Answer: B) Forgetting part of the cell state
16.What does the input gate in an LSTM do?
A) Forgets hidden states
B) Stores long-term memory
C) Adds new input to cell state
D) Filters outputs
Answer: C) Adds new input to cell state
17.What does the output gate control in an LSTM?
A) Gradient flow
B) Final output from the cell
C) Dropout rate
D) Learning rate
Answer: B) Final output from the cell
18.Why are LSTMs preferred over simple RNNs?
A) More parameters
B) Less computation
C) Solve vanishing gradient problem
D) Shorter training time
Answer: C) Solve vanishing gradient problem
19.A key feature of Bidirectional RNNs is:
A) Two output vectors
B) Backward-only computation
C) Parallel sequence modeling

4
D) Processing sequences in both directions
Answer: D) Processing sequences in both directions
20.In a sequence-to-sequence model, which part handles the output generation?
A) Encoder
B) Input Gate
C) Decoder
D) Memory Unit
Answer: C) Decoder
2-Marks Questions :
1. What is an Acceptor in RNN design patterns?
An Acceptor processes an input sequence and outputs a single value, typically
used in classification tasks like sentiment analysis. It summarizes the entire input
into one decision output. It is generally followed by a dense layer with softmax
or sigmoid activation for final classification. It ignores the intermediate states
and focuses on the final hidden state for prediction.
2. What does an Encoder do in RNNs?
An Encoder converts a sequence into a fixed-length vector that captures the
essential information. This representation is then used by another network
(usually a decoder) for further tasks. Encoders are vital in sequence-to-sequence
learning. They help handle variable-length input sequences and compress them
into meaningful context.
3. Define Transducer in the context of RNNs.
A Transducer transforms an input sequence into an output sequence of the same
or different length. It's commonly used in applications like speech recognition
and translation. It combines encoding and decoding steps into a single model.
Transducers can be implemented using attention mechanisms for dynamic
alignment.

4. What is Backpropagation Through Time (BPTT)?


BPTT unfolds the RNN across time steps and applies standard
backpropagation. It computes gradients across each time step but can become
inefficient and unstable for long sequences. BPTT can be truncated for long
sequences to improve performance. It helps in updating weights by
considering dependencies across time.

5
5. What are the main issues in gradient computation in RNNs?
Vanishing gradients make it hard to learn long-term dependencies; exploding
gradients can cause numerical instability. Proper initialization, gradient clipping,
and using LSTM/GRU help address these problems. Choosing suitable
activation functions like ReLU also helps.
6. What does sequence modeling conditioned on contexts mean?
It involves incorporating external information (like speaker, time, or
environment) into the model. This helps generate more accurate and relevant
predictions. Context can be added as an additional input at each time step or as
part of the initial state. It enables contextual adaptation of predictions.
7. What is a Bidirectional RNN?
A Bidirectional RNN has two RNNs: one reads the input forward, and the other
reads it backward. Their outputs are combined to improve context
understanding. It is useful when the entire sequence is available at once. It
improves understanding of ambiguous words based on future context.
8. What is the advantage of Bidirectional RNNs?
They provide a richer context by using information from both past and future
words. This is especially useful in tasks like named entity recognition and
POS tagging. However, they are not ideal for real-time applications. They
increase computational cost but provide better accuracy.
9. Explain the Sequence-to-Sequence model in RNNs.
It uses an encoder to represent the input as a context vector and a decoder to
produce the output sequence. This design is vital for variable-length output
tasks. Attention mechanisms are often added for better performance. Seq2Seq
is commonly used in translation, summarization, and question answering.
10.Where are Seq2Seq models used?
Used in neural machine translation, speech-to-text systems, and
summarization. Attention mechanisms are often added to improve
performance. They are also employed in question answering and dialogue
generation. These models handle input-output sequences of different lengths
effectively.
11.What is a Deep Recurrent Network?
A Deep RNN has multiple RNN layers stacked on top of each other, allowing
hierarchical feature extraction across different time scales. Deeper models

6
can capture more abstract features but are harder to train. They enhance
model expressiveness and abstraction power.
12.Why use Deep Recurrent Networks?
They offer improved representation capacity and can model complex patterns
more effectively. However, they require careful training due to potential
vanishing gradients. Techniques like skip connections and batch
normalization help in training. They help in learning both low-level and high-
level temporal features.
13.What are Recursive Neural Networks?
These networks process inputs with hierarchical or tree-like structures, useful
for syntactic parsing and image scene understanding. Each node in the tree is
computed based on its children. They are suitable for modeling structures
with nested or grammatical rules.
14.How do Recursive Neural Networks differ from RNNs?
While RNNs process sequences linearly, Recursive Neural Networks
combine inputs based on a tree structure, handling hierarchical relationships
better. They are well-suited for natural language with parse tree structures.
Recursive models handle compositional semantics efficiently.
15.What is the challenge of long-term dependencies in RNNs?
RNNs struggle with remembering distant information due to vanishing
gradients. This limits their ability to connect earlier and later parts of the
sequence. LSTMs and GRUs are designed to overcome this limitation. Long-
term dependencies are essential in language understanding.
16.How do Leaky Units help RNNs?
Leaky units allow some information to persist over time, addressing the
vanishing gradient problem. They use a fixed decay rate that mixes old and
new hidden states. This helps capture longer dependencies. Leaky integration
allows gradual forgetting instead of abrupt reset.
17.What are Skip Connections in RNNs?
Skip connections link non-consecutive layers, improving gradient flow and
training speed. They reduce vanishing gradients and enable deeper
architectures to be trained efficiently. They allow models to reuse earlier
learned features across layers.
18.How does Dropout help RNNs?

7
Dropout randomly disables neurons during training to prevent overfitting. It
is carefully applied to avoid disrupting temporal dependencies. Variational
dropout can be used to apply the same dropout mask at each time step.
Dropout improves model generalization and reduces over-reliance on specific
neurons.
19.What is the role of gates in LSTM architecture?
Gates (input, forget, output) regulate information flow in LSTMs,
determining what to store, update, or discard, thus enhancing memory
control. The forget gate is key in controlling memory retention. Gates enable
selective memory manipulation, aiding learning of long-range patterns.
20.Why are LSTMs effective for long-term dependencies?
LSTMs use cell states and gating mechanisms to retain relevant information
over long sequences, mitigating vanishing gradient issues common in vanilla
RNNs. Their design allows selective memory updates and forgetting. They
can maintain information over hundreds of time steps.

10 marks:
1. sequence to sequence model RNN
In Deep Learning, Many Complex problems can be solved by constructing better neural
network architecture. The RNN(Recurrent Neural Network) and its variants are much
useful in sequence to sequence learning. The RNN variant LSTM (Long Short-term
Memory) is the most used cell in seq-seq learning tasks.
The encoder-decoder architecture for recurrent neural networks is the standard
neural machine translation method that rivals and in some cases outperforms classical
statistical machine translation methods.
This architecture is very new, having only been pioneered in 2014, although, has been
adopted as the core technology inside
Encoder-Decoder Model
There are three main blocks in the encoder-decoder model,
 Encoder
 Hidden Vector
 Decoder
The Encoder will convert the input sequence into a single-dimensional vector (hidden
vector). The decoder will convert the hidden vector into the output sequence.
8
Encoder-Decoder models are jointly trained to maximize the conditional probabilities of
the target sequence given the input sequence.
How the Sequence to Sequence Model works?
In order to fully understand the model’s underlying logic, we will go over the below
illustration:

Encoder-decoder sequence to sequence model


Encoder
 Multiple RNN cells can be stacked together to form the encoder. RNN reads each
inputs sequentially
 For every timestep (each input) t, the hidden state (hidden vector) h is updated
according to the input at that timestep X[i].
 After all the inputs are read by encoder model, the final hidden state of the model
represents the context/summary of the whole input sequence.
 Example: Consider the input sequence “I am a Student” to be encoded. There will
be totally 4 timesteps ( 4 tokens) for the Encoder model. At each time step, the
hidden state h will be updated using the previous hidden state and the current input.

9
Example: Encoder
 At the first timestep t1, the previous hidden state h0 will be considered as zero or
randomly chosen. So the first RNN cell will update the current hidden state with the
first input and h0. Each layer outputs two things — updated hidden state and the
output for each stage. The outputs at each stage are rejected and only the hidden
states will be propagated to the next layer.
 The hidden states h_i are computed using the formula:

 At second timestep t2, the hidden state h1 and the second input X[2] will be given
as input , and the hidden state h2 will be updated according to both inputs. Then the

10
hidden state h1 will be updated with the new input and will produce the hidden state
h2. This happens for all the four stages wrt example taken.
 A stack of several recurrent units (LSTM or GRU cells for better performance)
where each accepts a single element of the input sequence, collects information for
that element, and propagates it forward.
 In the question-answering problem, the input sequence is a collection of all words
from the question. Each word is represented as x_i where i is the order of that word.
This simple formula represents the result of an ordinary recurrent neural network. As you
can see, we just apply the appropriate weights to the previously hidden state h_(t-1) and
the input vector x_t.
Encoder Vector
 This is the final hidden state produced from the encoder part of the model. It is
calculated using the formula above.
 This vector aims to encapsulate the information for all input elements in order to
help the decoder make accurate predictions.
 It acts as the initial hidden state of the decoder part of the model.
Decoder
 The Decoder generates the output sequence by predicting the next output Yt given
the hidden state ht.
 The input for the decoder is the final hidden vector obtained at the end of encoder
model.
 Each layer will have three inputs, hidden vector from previous layer ht-1 and the
previous layer output yt-1, original hidden vector h.
 At the first layer, the output vector of encoder and the random symbol START,
empty hidden state ht-1 will be given as input, the outputs obtained will be y1 and
updated hidden state h1 (the information of the output will be subtracted from the
hidden vector).
 The second layer will have the updated hidden state h1 and the previous output y1
and original hidden vector h as current inputs, produces the hidden vector h2 and
output y2.
 The outputs occurred at each timestep of decoder is the actual output. The model
will predict the output until the END symbol occurs.
 A stack of several recurrent units where each predicts an output y_t at a time step t.

11
 Each recurrent unit accepts a hidden state from the previous unit and produces an
output as well as its own hidden state.
 In the question-answering problem, the output sequence is a collection of all words
from the answer. Each word is represented as y_i where i is the order of that word.

Example: Decoder.
 Any hidden state h_i is computed using the formula:

As you can see, we are just using the previous hidden state to compute the next one.
Output Layer
 We use Softmax activation function at the output layer.

12
 It is used to produce the probability distribution from a vector of values with the
target class of high probability.
 The output y_t at time step t is computed using the formula:

We calculate the outputs using the hidden state at the current time step together with the
respective weight W(S). Softmax is used to create a probability vector that will help us
determine the final output (e.g. word in the question-answering problem).
The power of this model lies in the fact that it can map sequences of different lengths to
each other. As you can see the inputs and outputs are not correlated and their lengths can
differ. This opens a whole new range of problems that can now be solved using such
architecture.
Applications
It possesses many applications such as
 Google’s Machine Translation
 Question answering chatbots
 Speech recognition
 Time Series Application etc.,

2.Deep Recurrent Networks?


Recurrent Neural Networks (RNNs) work a bit different from regular neural networks. In
neural network the information flows in one direction from input to output. However in
RNN information is fed back into the system after each step. Think of it like reading a
sentence, when you’re trying to predict the next word you don’t just look at the current
word but also need to remember the words that came before to make accurate guess.
RNNs allow the network to “remember” past information by feeding the output from
one step into next step. This helps the network understand the context of what has
already happened and make better predictions based on that. For example when
predicting the next word in a sentence the RNN uses the previous words to help decide
what word is most likely to come next.
13
This image showcases the basic architecture of RNN and the feedback loop mechanism
where the output is passed back as input for the next time step.
How RNN Differs from Feedforward Neural Networks?
Feedforward Neural Networks (FNNs) process data in one direction from input to output
without retaining information from previous inputs. This makes them suitable for tasks
with independent inputs like image classification. However FNNs struggle with sequential
data since they lack memory.
Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow
information from previous steps to be fed back into the network. This feedback
enables RNNs to remember prior inputs making them ideal for tasks where context is
important.

14
Key Components of RNNs
1. Recurrent Neurons
The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold a
hidden state that maintains information about previous inputs in a sequence. Recurrent
units can “remember” information from prior steps by feeding back their hidden state,
allowing them to capture dependencies across time.

2. RNN Unfolding
RNN unfolding or unrolling is the process of expanding the recurrent structure over time
steps. During unfolding each step of the sequence is represented as a separate layer in a
series illustrating how information flows across each time step.
This unrolling enables backpropagation through time (BPTT) a learning process where
errors are propagated across time steps to adjust the network’s weights enhancing the
RNN’s ability to learn dependencies within sequential data.

15
RNN Unfolding
Recurrent Neural Network Architecture
RNNs share similarities in input and output structures with other deep learning
architectures but differ significantly in how information flows from input to output.
Unlike traditional deep neural networks, where each dense layer has distinct weight
matrices, RNNs use shared weights across time steps, allowing them to remember
information over sequences.
In RNNs, the hidden state HiHi is calculated for every input XiXi to retain sequential
dependencies. The computations follow these core formulas:
1. Hidden State Calculation:
h=σ(U⋅X+W⋅ht−1+B)h=σ(U⋅X+W⋅ht−1+B)
Here, hh represents the current hidden state, UU and WW are weight matrices, and BB is
the bias.
2. Output Calculation:
Y=O(V⋅h+C)Y=O(V⋅h+C)
The output YY is calculated by applying OO, an activation function, to the weighted
hidden state, where VV and CC represent weights and bias.
3. Overall Function:
Y=f(X,h,W,U,V,B,C)Y=f(X,h,W,U,V,B,C)
This function defines the entire RNN operation, where the state matrix SS holds each
element sisi representing the network’s state at each time step ii.
The computation in most recurrent neural networks can be decomposed into three blocks
of parameters and associated transformations:
1. From the input to the hidden state

16
2. From the previous hidden state to the next hidden state
3. From the hidden state to the output

Blocks of parameters as a shallow transformation

• With the RNN architecture shown each of these three blocks is associated with
a single weight matrix, i.e.,

• When the network is unfolded,


each of these corresponds to a
shallow transformation.

• By a shallow Transformation we
mean a transformation that would
be represented a single layer within
a deep MLP.

• Typically this is a transformation


represented by a learned affine
transformation followed by a fixed nonlinearity

• Would it be advantageous to introduce depth into each of these . operations?

• Experimental evidence strongly suggests so. 4

• That we need enough depth in order to perform the required transformations

Ways of making an RNN deep

1. Hidden recurrent state can be broken down into groups organized hierarchically
2.The pathlengthening effect can be mitigated by introducing skip connections.

17
18
Recurrent states broken down into groups

We can think of lower levels of the hierarchy play a role of


transforming the raw input into a representation that is more
appropriate at the higher levels of the hidden state

Deeper computation in hidden-to-hidden

• Go a step further and propose to have a separate MLP


(possibly deep) for each of the three blocks:
1. From the input to the hidden state
2. From the previous hidden state to the next hidden state
3. From the hidden state to the output

• Considerations of representational capacity suggest that to


allocate enough capacity in each of these three steps

• But doing so by adding depth may hurt learning by making optimization


difficult

• In general it is easier to optimize shallower architectures •


Adding the extra depth makes the shortest time of a
variable from time step t to a variable in time step t+1 beome
longer

3. Introducing skip connections

• For example, if an MLP with a single hidden layer is used for


the state-tostate transition, we have doubled the length of the
shortest path between variables in any two different time
steps compared with the ordinary RNN.

• This can be mitigated by introducing skip connections in the


hidden-to-hidden path as illustrated here

19
Recurrent Neural Architecture
How does RNN work?
At each time step RNNs process units with a fixed activation function. These units
have an internal hidden state that acts as memory that retains information from

20
previous time steps. This memory allows the network to store past knowledge and
adapt based on new inputs.
Updating the Hidden State in RNNs
The current hidden state htht depends on the previous state ht−1ht−1 and the current
input xtxt, and is calculated using the following relations:
1. State Update:
ht=f(ht−1,xt)ht=f(ht−1,xt)
where:
 htht is the current state
 ht−1ht−1 is the previous state
 xtxt is the input at the current time step
2. Activation Function Application:
ht=tanh⁡(Whh⋅ht−1+Wxh⋅xt)ht=tanh(Whh⋅ht−1+Wxh⋅xt)
Here, WhhWhh is the weight matrix for the recurrent neuron, and WxhWxh is the
weight matrix for the input neuron.
3. Output Calculation:
yt=Why⋅htyt=Why⋅ht
where ytyt is the output and WhyWhy is the weight at the output layer.
These parameters are updated using backpropagation. However, since RNN works
on sequential data here we use an updated backpropagation which is known
as backpropagation through time.
Backpropagation Through Time (BPTT) in RNNs
Since RNNs process sequential data Backpropagation Through Time (BPTT) is
used to update the network’s parameters. The loss function L(θ) depends on the final
hidden state h3h3 and each hidden state relies on preceding ones forming a
sequential dependency chain:
h3h3 depends on depends on h2,h2 depends on h1,…,h1 depends on h0 depends on
h2,h2 depends on h1,…,h1 depends on h0.

21
Backpropagation Through Time (BPTT) In RNN
In BPTT, gradients are backpropagated through each time step. This is essential for
updating network parameters based on temporal dependencies.
1. Simplified Gradient Calculation:
∂L(θ)∂W=∂L(θ)∂h3⋅∂h3∂W∂W∂L(θ)=∂h3∂L(θ)⋅∂W∂h3
2. Handling Dependencies in Layers:
Each hidden state is updated based on its dependencies:
h3=σ(W⋅h2+b)h3=σ(W⋅h2+b)
The gradient is then calculated for each state, considering dependencies from
previous hidden states.
3. Gradient Calculation with Explicit and Implicit Parts: The gradient is
broken down into explicit and implicit parts summing up the indirect paths
from each hidden state to the weights.
∂h3∂W=∂h3+∂W+∂h3∂h2⋅∂h2+∂W∂W∂h3=∂W∂h3++∂h2∂h3⋅∂W∂h2+
4. Final Gradient Expression:
The final derivative of the loss function with respect to the weight matrix W
is computed:

22
∂L(θ)∂W=∂L(θ)∂h3⋅∑k=13∂h3∂hk⋅∂hk∂W∂W∂L(θ)=∂h3∂L(θ)⋅∑k=13∂hk∂h3
⋅∂W∂hk
This iterative process is the essence of backpropagation through time.
Types Of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the
network:
1. One-to-One RNN
This is the simplest type of neural network architecture where there is a single input
and a single output. It is used for straightforward classification tasks such as binary
classification where no sequential data is involved.

One to One RNN


2. One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce multiple
outputs over time. This is useful in tasks where one input triggers a sequence of
predictions (outputs). For example in image captioning a single image can be used
as input to generate a sequence of words as a caption.

23
One to Many RNN
3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single
output. This type is useful when the overall context of the input sequence is needed
to make one prediction. In sentiment analysis the model receives a sequence of
words (like a sentence) and produces a single output like positive, negative or
neutral.

Many to One RNN


4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a
sequence of outputs. In language translation task a sequence of words in one
language is given as input, and a corresponding sequence in another language is
generated as output.

24
Graphical models of RNNs without/with inputs

3.Modeling Sequences Conditioned

on Context with RNNs

1. Directed graphical models of RNNs without inputs


• having a set of random variables y(t) :
Fully connected graphical modelEfficient parameterization based on
(t) (t-1) (t)
h =f(h ,x ;θ)

2. RNNs do include a sequence of inputs x(1), x(2),..x(τ)

25
o(t)=c+Vh(t) h(t)=tanh(a(t))
a(t)=b+Wh(t-1)+Ux(t)

• RNNs allow the extension of the graphical model view to represent not
only the joint distribution view over y variables but also a conditional
distribution over y given x
4
CPDs of model depend on RNN Design pattern

1. Recurrent connections between hidden units

2. Recurrent connections only from output 2 at one time step


to hidden units at next time step

26
Extending RNNs to represent conditional
P(y|x)

• A model representing a variable P(y;


θ) can be reinterpreted as a model
representing a conditional distribution
P(y|ω) with ω=θ

• We can extend such a model to


represent a distribution P(y|x) by using the same P(y|ω) as before but making ω
a function of x

• In the case of an RNN this can be achieved in several ways

• Most common choices are described next

Taking a single vector x as an extra input

• Instead of taking a sequence x(t), t =1,..,τ as input we can take a single vector x
as input

• When x is a fixed-size vector we can simply make it an extra input of the RNN
that generates the y sequence

• Common ways of providing an extra input to RNN are


1. An extra input at each time step, or
2. As the initial state h(0), or
3. Both
The first and common approach is illustrated next
Mapping vector x into distribution over sequences Y
An extra input at each time step

27
Interaction between input x and hidden unit vector h(t) is
parameterized by a newly introduced weight matrix R that
was

Appropriate for tasks such as image captioning


where a single image x is input which produces a sequence of words describing the
image

Each element of the observed output y(t) of the observed output sequence
serves both as input (for the current time step) and during training as
target

RNN to receive a
sequence of vectors x(t)
as input

RNN described by a(t)=b


+Wh(t-1)+Ux(t) corresponds
to a conditional
distribution P(y(1),.., y(τ)|
x(1),.., x(τ))

• It makes a
conditional
independence
assumption that this
distribution
factorizes as
∏P(y(t) |x(1),..,x(t))

• To remove the conditional independence assumption, we can t add connections


from the output at time t to the hidden unit at time t+1 (see next slide)

• The model can then represent arbitrary probability distributions over the y
sequence

• Limitation: both sequences must be of same length


28
• Removing this restriction is discussed in Section 10.4

Removing conditional independence assumption


Connections from previous output to current state allow RNN to model arbitrary
distribution over sequences of y

Conditional RNN mapping a variable-length sequence of x values into a


distribution over sequences of y values of same length.Compare it to
model that is only able
to represent
distributions
in which the y values
are conditionally
independent
from each other given x
values

29
4.Gradient Computation
In the realm of deep learning, the optimization process plays a crucial role in
training neural networks. Gradient descent, a fundamental optimization algorithm,
can sometimes encounter two common issues: vanishing gradients and exploding
gradients. In this article, we will delve into these challenges, providing insights
into what they are, why they occur, and how to mitigate them. We will build and
train a model, and learn how to face vanishing and exploding problems.
What is Vanishing Gradient?
The vanishing gradient problem is a challenge that emerges during
backpropagation when the derivatives or slopes of the activation functions
become progressively smaller as we move backward through the layers of a
neural network. This phenomenon is particularly prominent in deep networks with
many layers, hindering the effective training of the model. The weight updates
becomes extremely tiny, or even exponentially small, it can significantly prolong
the training time, and in the worst-case scenario, it can halt the training process
altogether.
Why the Problem Occurs?
During backpropagation, the gradients propagate back through the layers of the
network, they decrease significantly. This means that as they leave the output
layer and return to the input layer, the gradients become progressively smaller. As
a result, the weights associated with the initial levels, which accommodate these
small gradients, are updated little or not at each iteration of the optimization
process.
The vanishing gradient problem is particularly associated with the sigmoid and
hyperbolic tangent (tanh) activation functions because their derivatives fall within
the range of 0 to 0.25 and 0 to 1, respectively. Consequently, extreme weights
becomes very small, causing the updated weights to closely resemble the original
ones. This persistence of small updates contributes to the vanishing gradient issue.
The sigmoid and tanh functions limit the input values to the ranges [0,1] and [-
1,1], so that they saturate at 0 or 1 for sigmoid and -1 or 1 for Tanh. The
derivatives at points becomes zero as they are moving. In these regions, especially
when inputs are very small or large, the gradients are very close to zero. While
this may not be a major concern in shallow networks with a few layers, it is a
more pronounced issue in deep networks. When the inputs fall in saturated
regions, the gradients approach zero, resulting in little update to the weights of the
previous layer. In simple networks this does not pose much of a problem, but as
more layers are added, these small gradients, which multiply between layers,
decay significantly and consequently the first layer tears very slowly , and hinders
overall model performance and can lead to convergence failure.

30
How can we identify?
Identifying the vanishing gradient problem typically involves monitoring the
training dynamics of a deep neural network.
 One key indicator is observing model weights converging to 0 or stagnation
in the improvement of the model's performance metrics over training epochs.
 During training, if the loss function fails to decrease significantly, or if there
is erratic behavior in the learning curves, it suggests that the gradients may be
vanishing.
 Additionally, examining the gradients themselves during backpropagation can
provide insights. Visualization techniques, such as gradient histograms or
norms, can aid in assessing the distribution of gradients throughout the
network.
How can we solve the issue?
 Batch Normalization : Batch normalization normalizes the inputs of each
layer, reducing internal covariate shift. This can help stabilize and accelerate
the training process, allowing for more consistent gradient flow.
 Activation function: Activation function like Rectified Linear Unit
(ReLU) can be used. With ReLU, the gradient is 0 for negative and zero
input, and it is 1 for positive input, which helps alleviate the vanishing
gradient issue. Therefore, ReLU operates by replacing poor enter values with
0, and 1 for fine enter values, it preserves the input unchanged.
 Skip Connections and Residual Networks (ResNets): Skip connections, as
seen in ResNets, allow the gradient to bypass certain layers during
backpropagation. This facilitates the flow of information through the
network, preventing gradients from vanishing.
 Long Short-Term Memory Networks (LSTMs) and Gated Recurrent
Units (GRUs): In the context of recurrent neural networks (RNNs),
architectures like LSTMs and GRUs are designed to address the vanishing
gradient problem in sequences by incorporating gating mechanisms .
 Gradient Clipping: Gradient clipping involves imposing a threshold on the
gradients during backpropagation. Limit the magnitude of gradients during
backpropagation, this can prevent them from becoming too small or
exploding, which can also hinder learning.

31
What is LSTM – Long Short Term Memory?
5.What is LSTM – Long Short Term Memory?

Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural


Network (RNN) designed by Hochreiter & Schmidhuber. LSTMs can capture long-
term dependencies in sequential data making them ideal for tasks like language
translation, speech recognition and time series forecasting.
Unlike traditional RNNs which use a single hidden state passed through time
LSTMs introduce a memory cell that holds information over extended periods
addressing the challenge of learning long-term dependencies.
Problem with Long-Term Dependencies in RNN
Recurrent Neural Networks (RNNs) are designed to handle sequential data by
maintaining a hidden state that captures information from previous time steps.
However they often face challenges in learning long-term dependencies where
information from distant time steps becomes crucial for making accurate predictions
for current state. This problem is known as the vanishing gradient or exploding
gradient problem.
 Vanishing Gradient: When training a model over time, the gradients (which
help the model learn) can shrink as they pass through many steps. This makes
it hard for the model to learn long-term patterns since earlier information
becomes almost irrelevant.
 Exploding Gradient: Sometimes, gradients can grow too large, causing
instability. This makes it difficult for the model to learn properly, as the
updates to the model become erratic and unpredictable.
Both of these issues make it challenging for standard RNNs to effectively capture
long-term dependencies in sequential data.
LSTM Architecture
LSTM architectures involves the memory cell which is controlled by three gates:
the input gate, the forget gate and the output gate. These gates decide what
information to add to, remove from and output from the memory cell.
 Input gate: Controls what information is added to the memory cell.
 Forget gate: Determines what information is removed from the memory cell.
32
 Output gate: Controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows
through the network which allows them to learn long-term dependencies. The
network has a hidden state which is like its short-term memory. This memory is
updated using the current input, the previous hidden state and the current state of the
memory cell.
Working of LSTM
LSTM architecture has a chain structure that contains four neural networks and
different memory blocks called cells.

LSTM Model
Information is retained by the cells and the memory manipulations are done by
the gates. There are three gates –
Forget Gate
The information that is no longer useful in the cell state is removed with the forget
gate. Two inputs xt (input at the particular time) and ht-1 (previous cell output) are
fed to the gate and multiplied with weight matrices followed by the addition of bias.
The resultant is passed through an activation function which gives a binary output.
If for a particular cell state the output is 0, the piece of information is forgotten and
for output 1, the information is retained for future use.
The equation for the forget gate is:
ft=σ(Wf⋅[ht−1,xt]+bf) ft=σ(Wf⋅[ht−1,xt]+bf)
where:
 W_f represents the weight matrix associated with the forget gate.
33
 [h_t-1, x_t] denotes the concatenation of the current input and the previous
hidden state.
 b_f is the bias with the forget gate.
 σ is the sigmoid activation function.

Forget Gate
Input gate
The addition of useful information to the cell state is done by the input gate. First,
the information is regulated using the sigmoid function and filter the values to be
remembered similar to the forget gate using inputs ht-1 and xt. . Then, a vector is
created using tanh function that gives an output from -1 to +1, which contains all
the possible values from ht-1 and xt. At last, the values of the vector and the
regulated values are multiplied to obtain the useful information. The equation for
the input gate is:
it=σ(Wi⋅[ht−1,xt]+bi) it=σ(Wi⋅[ht−1,xt]+bi)
C^t=tanh(Wc⋅[ht−1,xt]+bc)C^t=tanh(Wc⋅[ht−1,xt]+bc)
We multiply the previous state by ft, disregarding the information we had previously
chosen to ignore. Next, we include it∗Ct. This represents the updated candidate
values, adjusted for the amount that we chose to update each state value.
Ct=ft⊙Ct−1+it⊙C^tCt=ft⊙Ct−1+it⊙C^t
34
where
 ⊙ denotes element-wise multiplication
 tanh is tanh activation function

Input Gate

Output gate
The task of extracting useful information from the current cell state to be presented
as output is done by the output gate. First, a vector is generated by applying tanh
function on the cell. Then, the information is regulated using the sigmoid function
and filter by the values to be remembered using inputs ht−1ht−1and xtxt. At last, the
values of the vector and the regulated values are multiplied to be sent as an output
and input to the next cell. The equation for the output gate is:
ot=σ(Wo⋅[ht−1,xt]+bo)ot=σ(Wo⋅[ht−1,xt]+bo)

35
Output Gate
Bidirectional LSTM Model
Bidirectional LSTM (Bi LSTM/ BLSTM) is a variation of normal LSTM which
processes sequential data in both forward and backward directions. This allows Bi
LSTM to learn longer-range dependencies in sequential data than traditional LSTMs
which can only process sequential data in one direction.
 Bi LSTMs are made up of two LSTM networks one that processes the input
sequence in the forward direction and one that processes the input sequence
in the backward direction.
 The outputs of the two LSTM networks are then combined to produce the
final output.
LSTM models including Bi LSTMs have demonstrated state-of-the-art performance
across various tasks such as machine translation, speech recognition and text
summarization.
LSTM networks can be stacked to form deeper models allowing them to learn more
complex patterns in data. Each layer in the stack captures different levels of
information and time-based relationships in the input.
Applications of LSTM
Some of the famous applications of LSTM includes:
 Language Modeling: Used in tasks like language modeling, machine
translation and text summarization. These networks learn the dependencies
36
between words in a sentence to generate coherent and grammatically correct
sentences.
 Speech Recognition: Used in transcribing speech to text and recognizing
spoken commands. By learning speech patterns they can match spoken words
to corresponding text.
 Time Series Forecasting: Used for predicting stock prices, weather and
energy consumption. They learn patterns in time series data to predict future
events.
 Anomaly Detection: Used for detecting fraud or network intrusions. These
networks can identify patterns in data that deviate drastically and flag them as
potential anomalies.
 Recommender Systems: In recommendation tasks like suggesting movies,
music and books. They learn user behavior patterns to provide personalized
suggestions.
 Video Analysis: Applied in tasks such as object detection, activity
recognition and action classification. When combined with Convolutional
Neural Networks (CNNs) they help analyze video data and extract useful
information.
GRU:
In machine learning Recurrent Neural Networks (RNNs) are essential for tasks
involving sequential data such as text, speech and time-series analysis. While
traditional RNNs struggle with capturing long-term dependencies due to
the vanishing gradient problem architectures like Long Short-Term Memory
(LSTM) networks were developed to overcome this limitation.
However LSTMs are very complex structure with higher computational cost. To
overcome this Gated Recurrent Unit (GRU) where introduced which uses
LSTM architecture by merging its gating mechanisms offering a more efficient
solution for many sequential tasks without sacrificing performance. In this article
we’ll learn more about them.
Getting Started with Gated Recurrent Units (GRU)
Gated Recurrent Units (GRUs) are a type of RNN introduced by Cho et al. in
2014. The core idea behind GRUs is to use gating mechanisms to selectively
update the hidden state at each time step allowing them to remember important
information while discarding irrelevant details. GRUs aim to simplify the LSTM
architecture by merging some of its components and focusing on just two main
gates: the update gate and the reset gate.

37
Core Structure of GRUs

The GRU consists of two main gates:


1. Update Gate (��zt): This gate decides how much information from
previous hidden state should be retained for the next time step.
2. Reset Gate (��rt): This gate determines how much of the past hidden state
should be forgotten.
These gates allow GRU to control the flow of information in a more efficient
manner compared to traditional RNNs which solely rely on hidden state.
Equations for GRU Operations
The internal workings of a GRU can be described using following equations:
1. Reset gate:
rt=σ(Wr⋅[ht−1,xt])
The reset gate determines how much of the previous hidden
state ℎ�−1ht−1 should be forgotten.

zt=σ(Wz⋅[ht−1,xt])
2. Update gate:

The update gate controls how much of the new information ��xt should be used
to update the hidden state.

ht’=tanh(Wh⋅[rt⋅ht−1,xt])
3. Candidate hidden state:

This is the potential new hidden state calculated based on the current input and
the previous hidden state.

ht=(1–zt)⋅ht−1+zt⋅ht′
4. Hidden state:

38
How GRUs Solve the Vanishing Gradient Problem
Like LSTMs, GRUs were designed to address the vanishing gradient
problem which is common in traditional RNNs. GRUs help mitigate this issue by
using gates that regulate the flow of gradients during training ensuring that
important information is preserved and that gradients do not shrink excessively
over time. By using these gates, GRUs maintain a balance between remembering
important past information and learning new, relevant data.

39

You might also like