DL Unit-3 Question Bank
DL Unit-3 Question Bank
Unit – 3
Question Bank
1. Which of the following RNN patterns outputs a single prediction after reading
an entire sequence?
A) Encoder
B) Acceptor
C) Transducer
D) Decoder
Answer: B) Acceptor
2. In a sequence-to-sequence model, what is the main role of the encoder?
A) Generate predictions
B) Classify input tokens
C) Encode input into a context vector
D) Store gradients
Answer: C) Encode input into a context vector
3. The transducer RNN pattern is best suited for:
A) Single output from sequence
B) Labeling sequences
C) Fixed-length inputs
D) Image classification
Answer: B) Labeling sequences
4. What is used to compute gradients in RNNs?
A) Stochastic Gradient Descent
B) Adam Optimizer
C) Backpropagation Through Time (BPTT)
1
D) RMSProp
Answer: C) Backpropagation Through Time (BPTT)
5. Which of the following is a problem when training deep RNNs?
A) Overfitting
B) Gradient Explosion
C) Gradient Vanishing
D) Fast convergence
Answer: C) Gradient Vanishing
6. Bidirectional RNNs are useful because they:
A) Reduce computation time
B) Use GPU acceleration
C) Consider both past and future context
D) Skip hidden states
Answer: C) Consider both past and future context
7. In sequence-to-sequence models, the decoder:
A) Reads the input directly
B) Produces an intermediate representation
C) Outputs a new sequence
D) Stores attention weights
Answer: C) Outputs a new sequence
8. What is a major benefit of Deep Recurrent Networks?
A) Simpler computations
B) Faster training
C) Better hierarchical feature learning
D) No need for gradient updates
Answer: C) Better hierarchical feature learning
9. Recursive Neural Networks are mainly used for:
A) Time series prediction
B) Image classification
C) Hierarchical data structures
2
D) Sequence labeling
Answer: C) Hierarchical data structures
10.What causes the vanishing gradient problem in RNNs?
A) Overfitting
B) Large weights
C) Long-term dependencies
D) Lack of training data
Answer: C) Long-term dependencies
11.What do leaky units in RNNs help with?
A) Increase computation
B) Avoid memory usage
C) Stabilize gradient flow
D) Block input noise
Answer: C) Stabilize gradient flow
12.Skip connections help in RNNs by:
A) Increasing memory usage
B) Speeding up inference
C) Improving gradient flow
D) Reducing dropout
Answer: C) Improving gradient flow
13.Dropout is mainly used in RNNs for:
A) Speeding up training
B) Preventing overfitting
C) Increasing depth
D) Enhancing weights
Answer: B) Preventing overfitting
14.Which of the following is a gated RNN architecture?
A) CNN
B) LSTM
C) MLP
3
D) Boltzmann Machine
Answer: B) LSTM
15.In an LSTM cell, the forget gate is responsible for:
A) Selecting the output
B) Forgetting part of the cell state
C) Adding new information
D) Copying inputs
Answer: B) Forgetting part of the cell state
16.What does the input gate in an LSTM do?
A) Forgets hidden states
B) Stores long-term memory
C) Adds new input to cell state
D) Filters outputs
Answer: C) Adds new input to cell state
17.What does the output gate control in an LSTM?
A) Gradient flow
B) Final output from the cell
C) Dropout rate
D) Learning rate
Answer: B) Final output from the cell
18.Why are LSTMs preferred over simple RNNs?
A) More parameters
B) Less computation
C) Solve vanishing gradient problem
D) Shorter training time
Answer: C) Solve vanishing gradient problem
19.A key feature of Bidirectional RNNs is:
A) Two output vectors
B) Backward-only computation
C) Parallel sequence modeling
4
D) Processing sequences in both directions
Answer: D) Processing sequences in both directions
20.In a sequence-to-sequence model, which part handles the output generation?
A) Encoder
B) Input Gate
C) Decoder
D) Memory Unit
Answer: C) Decoder
2-Marks Questions :
1. What is an Acceptor in RNN design patterns?
An Acceptor processes an input sequence and outputs a single value, typically
used in classification tasks like sentiment analysis. It summarizes the entire input
into one decision output. It is generally followed by a dense layer with softmax
or sigmoid activation for final classification. It ignores the intermediate states
and focuses on the final hidden state for prediction.
2. What does an Encoder do in RNNs?
An Encoder converts a sequence into a fixed-length vector that captures the
essential information. This representation is then used by another network
(usually a decoder) for further tasks. Encoders are vital in sequence-to-sequence
learning. They help handle variable-length input sequences and compress them
into meaningful context.
3. Define Transducer in the context of RNNs.
A Transducer transforms an input sequence into an output sequence of the same
or different length. It's commonly used in applications like speech recognition
and translation. It combines encoding and decoding steps into a single model.
Transducers can be implemented using attention mechanisms for dynamic
alignment.
5
5. What are the main issues in gradient computation in RNNs?
Vanishing gradients make it hard to learn long-term dependencies; exploding
gradients can cause numerical instability. Proper initialization, gradient clipping,
and using LSTM/GRU help address these problems. Choosing suitable
activation functions like ReLU also helps.
6. What does sequence modeling conditioned on contexts mean?
It involves incorporating external information (like speaker, time, or
environment) into the model. This helps generate more accurate and relevant
predictions. Context can be added as an additional input at each time step or as
part of the initial state. It enables contextual adaptation of predictions.
7. What is a Bidirectional RNN?
A Bidirectional RNN has two RNNs: one reads the input forward, and the other
reads it backward. Their outputs are combined to improve context
understanding. It is useful when the entire sequence is available at once. It
improves understanding of ambiguous words based on future context.
8. What is the advantage of Bidirectional RNNs?
They provide a richer context by using information from both past and future
words. This is especially useful in tasks like named entity recognition and
POS tagging. However, they are not ideal for real-time applications. They
increase computational cost but provide better accuracy.
9. Explain the Sequence-to-Sequence model in RNNs.
It uses an encoder to represent the input as a context vector and a decoder to
produce the output sequence. This design is vital for variable-length output
tasks. Attention mechanisms are often added for better performance. Seq2Seq
is commonly used in translation, summarization, and question answering.
10.Where are Seq2Seq models used?
Used in neural machine translation, speech-to-text systems, and
summarization. Attention mechanisms are often added to improve
performance. They are also employed in question answering and dialogue
generation. These models handle input-output sequences of different lengths
effectively.
11.What is a Deep Recurrent Network?
A Deep RNN has multiple RNN layers stacked on top of each other, allowing
hierarchical feature extraction across different time scales. Deeper models
6
can capture more abstract features but are harder to train. They enhance
model expressiveness and abstraction power.
12.Why use Deep Recurrent Networks?
They offer improved representation capacity and can model complex patterns
more effectively. However, they require careful training due to potential
vanishing gradients. Techniques like skip connections and batch
normalization help in training. They help in learning both low-level and high-
level temporal features.
13.What are Recursive Neural Networks?
These networks process inputs with hierarchical or tree-like structures, useful
for syntactic parsing and image scene understanding. Each node in the tree is
computed based on its children. They are suitable for modeling structures
with nested or grammatical rules.
14.How do Recursive Neural Networks differ from RNNs?
While RNNs process sequences linearly, Recursive Neural Networks
combine inputs based on a tree structure, handling hierarchical relationships
better. They are well-suited for natural language with parse tree structures.
Recursive models handle compositional semantics efficiently.
15.What is the challenge of long-term dependencies in RNNs?
RNNs struggle with remembering distant information due to vanishing
gradients. This limits their ability to connect earlier and later parts of the
sequence. LSTMs and GRUs are designed to overcome this limitation. Long-
term dependencies are essential in language understanding.
16.How do Leaky Units help RNNs?
Leaky units allow some information to persist over time, addressing the
vanishing gradient problem. They use a fixed decay rate that mixes old and
new hidden states. This helps capture longer dependencies. Leaky integration
allows gradual forgetting instead of abrupt reset.
17.What are Skip Connections in RNNs?
Skip connections link non-consecutive layers, improving gradient flow and
training speed. They reduce vanishing gradients and enable deeper
architectures to be trained efficiently. They allow models to reuse earlier
learned features across layers.
18.How does Dropout help RNNs?
7
Dropout randomly disables neurons during training to prevent overfitting. It
is carefully applied to avoid disrupting temporal dependencies. Variational
dropout can be used to apply the same dropout mask at each time step.
Dropout improves model generalization and reduces over-reliance on specific
neurons.
19.What is the role of gates in LSTM architecture?
Gates (input, forget, output) regulate information flow in LSTMs,
determining what to store, update, or discard, thus enhancing memory
control. The forget gate is key in controlling memory retention. Gates enable
selective memory manipulation, aiding learning of long-range patterns.
20.Why are LSTMs effective for long-term dependencies?
LSTMs use cell states and gating mechanisms to retain relevant information
over long sequences, mitigating vanishing gradient issues common in vanilla
RNNs. Their design allows selective memory updates and forgetting. They
can maintain information over hundreds of time steps.
10 marks:
1. sequence to sequence model RNN
In Deep Learning, Many Complex problems can be solved by constructing better neural
network architecture. The RNN(Recurrent Neural Network) and its variants are much
useful in sequence to sequence learning. The RNN variant LSTM (Long Short-term
Memory) is the most used cell in seq-seq learning tasks.
The encoder-decoder architecture for recurrent neural networks is the standard
neural machine translation method that rivals and in some cases outperforms classical
statistical machine translation methods.
This architecture is very new, having only been pioneered in 2014, although, has been
adopted as the core technology inside
Encoder-Decoder Model
There are three main blocks in the encoder-decoder model,
Encoder
Hidden Vector
Decoder
The Encoder will convert the input sequence into a single-dimensional vector (hidden
vector). The decoder will convert the hidden vector into the output sequence.
8
Encoder-Decoder models are jointly trained to maximize the conditional probabilities of
the target sequence given the input sequence.
How the Sequence to Sequence Model works?
In order to fully understand the model’s underlying logic, we will go over the below
illustration:
9
Example: Encoder
At the first timestep t1, the previous hidden state h0 will be considered as zero or
randomly chosen. So the first RNN cell will update the current hidden state with the
first input and h0. Each layer outputs two things — updated hidden state and the
output for each stage. The outputs at each stage are rejected and only the hidden
states will be propagated to the next layer.
The hidden states h_i are computed using the formula:
At second timestep t2, the hidden state h1 and the second input X[2] will be given
as input , and the hidden state h2 will be updated according to both inputs. Then the
10
hidden state h1 will be updated with the new input and will produce the hidden state
h2. This happens for all the four stages wrt example taken.
A stack of several recurrent units (LSTM or GRU cells for better performance)
where each accepts a single element of the input sequence, collects information for
that element, and propagates it forward.
In the question-answering problem, the input sequence is a collection of all words
from the question. Each word is represented as x_i where i is the order of that word.
This simple formula represents the result of an ordinary recurrent neural network. As you
can see, we just apply the appropriate weights to the previously hidden state h_(t-1) and
the input vector x_t.
Encoder Vector
This is the final hidden state produced from the encoder part of the model. It is
calculated using the formula above.
This vector aims to encapsulate the information for all input elements in order to
help the decoder make accurate predictions.
It acts as the initial hidden state of the decoder part of the model.
Decoder
The Decoder generates the output sequence by predicting the next output Yt given
the hidden state ht.
The input for the decoder is the final hidden vector obtained at the end of encoder
model.
Each layer will have three inputs, hidden vector from previous layer ht-1 and the
previous layer output yt-1, original hidden vector h.
At the first layer, the output vector of encoder and the random symbol START,
empty hidden state ht-1 will be given as input, the outputs obtained will be y1 and
updated hidden state h1 (the information of the output will be subtracted from the
hidden vector).
The second layer will have the updated hidden state h1 and the previous output y1
and original hidden vector h as current inputs, produces the hidden vector h2 and
output y2.
The outputs occurred at each timestep of decoder is the actual output. The model
will predict the output until the END symbol occurs.
A stack of several recurrent units where each predicts an output y_t at a time step t.
11
Each recurrent unit accepts a hidden state from the previous unit and produces an
output as well as its own hidden state.
In the question-answering problem, the output sequence is a collection of all words
from the answer. Each word is represented as y_i where i is the order of that word.
Example: Decoder.
Any hidden state h_i is computed using the formula:
As you can see, we are just using the previous hidden state to compute the next one.
Output Layer
We use Softmax activation function at the output layer.
12
It is used to produce the probability distribution from a vector of values with the
target class of high probability.
The output y_t at time step t is computed using the formula:
We calculate the outputs using the hidden state at the current time step together with the
respective weight W(S). Softmax is used to create a probability vector that will help us
determine the final output (e.g. word in the question-answering problem).
The power of this model lies in the fact that it can map sequences of different lengths to
each other. As you can see the inputs and outputs are not correlated and their lengths can
differ. This opens a whole new range of problems that can now be solved using such
architecture.
Applications
It possesses many applications such as
Google’s Machine Translation
Question answering chatbots
Speech recognition
Time Series Application etc.,
14
Key Components of RNNs
1. Recurrent Neurons
The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold a
hidden state that maintains information about previous inputs in a sequence. Recurrent
units can “remember” information from prior steps by feeding back their hidden state,
allowing them to capture dependencies across time.
2. RNN Unfolding
RNN unfolding or unrolling is the process of expanding the recurrent structure over time
steps. During unfolding each step of the sequence is represented as a separate layer in a
series illustrating how information flows across each time step.
This unrolling enables backpropagation through time (BPTT) a learning process where
errors are propagated across time steps to adjust the network’s weights enhancing the
RNN’s ability to learn dependencies within sequential data.
15
RNN Unfolding
Recurrent Neural Network Architecture
RNNs share similarities in input and output structures with other deep learning
architectures but differ significantly in how information flows from input to output.
Unlike traditional deep neural networks, where each dense layer has distinct weight
matrices, RNNs use shared weights across time steps, allowing them to remember
information over sequences.
In RNNs, the hidden state HiHi is calculated for every input XiXi to retain sequential
dependencies. The computations follow these core formulas:
1. Hidden State Calculation:
h=σ(U⋅X+W⋅ht−1+B)h=σ(U⋅X+W⋅ht−1+B)
Here, hh represents the current hidden state, UU and WW are weight matrices, and BB is
the bias.
2. Output Calculation:
Y=O(V⋅h+C)Y=O(V⋅h+C)
The output YY is calculated by applying OO, an activation function, to the weighted
hidden state, where VV and CC represent weights and bias.
3. Overall Function:
Y=f(X,h,W,U,V,B,C)Y=f(X,h,W,U,V,B,C)
This function defines the entire RNN operation, where the state matrix SS holds each
element sisi representing the network’s state at each time step ii.
The computation in most recurrent neural networks can be decomposed into three blocks
of parameters and associated transformations:
1. From the input to the hidden state
16
2. From the previous hidden state to the next hidden state
3. From the hidden state to the output
• With the RNN architecture shown each of these three blocks is associated with
a single weight matrix, i.e.,
• By a shallow Transformation we
mean a transformation that would
be represented a single layer within
a deep MLP.
1. Hidden recurrent state can be broken down into groups organized hierarchically
2.The pathlengthening effect can be mitigated by introducing skip connections.
17
18
Recurrent states broken down into groups
19
Recurrent Neural Architecture
How does RNN work?
At each time step RNNs process units with a fixed activation function. These units
have an internal hidden state that acts as memory that retains information from
20
previous time steps. This memory allows the network to store past knowledge and
adapt based on new inputs.
Updating the Hidden State in RNNs
The current hidden state htht depends on the previous state ht−1ht−1 and the current
input xtxt, and is calculated using the following relations:
1. State Update:
ht=f(ht−1,xt)ht=f(ht−1,xt)
where:
htht is the current state
ht−1ht−1 is the previous state
xtxt is the input at the current time step
2. Activation Function Application:
ht=tanh(Whh⋅ht−1+Wxh⋅xt)ht=tanh(Whh⋅ht−1+Wxh⋅xt)
Here, WhhWhh is the weight matrix for the recurrent neuron, and WxhWxh is the
weight matrix for the input neuron.
3. Output Calculation:
yt=Why⋅htyt=Why⋅ht
where ytyt is the output and WhyWhy is the weight at the output layer.
These parameters are updated using backpropagation. However, since RNN works
on sequential data here we use an updated backpropagation which is known
as backpropagation through time.
Backpropagation Through Time (BPTT) in RNNs
Since RNNs process sequential data Backpropagation Through Time (BPTT) is
used to update the network’s parameters. The loss function L(θ) depends on the final
hidden state h3h3 and each hidden state relies on preceding ones forming a
sequential dependency chain:
h3h3 depends on depends on h2,h2 depends on h1,…,h1 depends on h0 depends on
h2,h2 depends on h1,…,h1 depends on h0.
21
Backpropagation Through Time (BPTT) In RNN
In BPTT, gradients are backpropagated through each time step. This is essential for
updating network parameters based on temporal dependencies.
1. Simplified Gradient Calculation:
∂L(θ)∂W=∂L(θ)∂h3⋅∂h3∂W∂W∂L(θ)=∂h3∂L(θ)⋅∂W∂h3
2. Handling Dependencies in Layers:
Each hidden state is updated based on its dependencies:
h3=σ(W⋅h2+b)h3=σ(W⋅h2+b)
The gradient is then calculated for each state, considering dependencies from
previous hidden states.
3. Gradient Calculation with Explicit and Implicit Parts: The gradient is
broken down into explicit and implicit parts summing up the indirect paths
from each hidden state to the weights.
∂h3∂W=∂h3+∂W+∂h3∂h2⋅∂h2+∂W∂W∂h3=∂W∂h3++∂h2∂h3⋅∂W∂h2+
4. Final Gradient Expression:
The final derivative of the loss function with respect to the weight matrix W
is computed:
22
∂L(θ)∂W=∂L(θ)∂h3⋅∑k=13∂h3∂hk⋅∂hk∂W∂W∂L(θ)=∂h3∂L(θ)⋅∑k=13∂hk∂h3
⋅∂W∂hk
This iterative process is the essence of backpropagation through time.
Types Of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the
network:
1. One-to-One RNN
This is the simplest type of neural network architecture where there is a single input
and a single output. It is used for straightforward classification tasks such as binary
classification where no sequential data is involved.
23
One to Many RNN
3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single
output. This type is useful when the overall context of the input sequence is needed
to make one prediction. In sentiment analysis the model receives a sequence of
words (like a sentence) and produces a single output like positive, negative or
neutral.
24
Graphical models of RNNs without/with inputs
25
o(t)=c+Vh(t) h(t)=tanh(a(t))
a(t)=b+Wh(t-1)+Ux(t)
• RNNs allow the extension of the graphical model view to represent not
only the joint distribution view over y variables but also a conditional
distribution over y given x
4
CPDs of model depend on RNN Design pattern
26
Extending RNNs to represent conditional
P(y|x)
• Instead of taking a sequence x(t), t =1,..,τ as input we can take a single vector x
as input
• When x is a fixed-size vector we can simply make it an extra input of the RNN
that generates the y sequence
27
Interaction between input x and hidden unit vector h(t) is
parameterized by a newly introduced weight matrix R that
was
Each element of the observed output y(t) of the observed output sequence
serves both as input (for the current time step) and during training as
target
RNN to receive a
sequence of vectors x(t)
as input
• It makes a
conditional
independence
assumption that this
distribution
factorizes as
∏P(y(t) |x(1),..,x(t))
• The model can then represent arbitrary probability distributions over the y
sequence
29
4.Gradient Computation
In the realm of deep learning, the optimization process plays a crucial role in
training neural networks. Gradient descent, a fundamental optimization algorithm,
can sometimes encounter two common issues: vanishing gradients and exploding
gradients. In this article, we will delve into these challenges, providing insights
into what they are, why they occur, and how to mitigate them. We will build and
train a model, and learn how to face vanishing and exploding problems.
What is Vanishing Gradient?
The vanishing gradient problem is a challenge that emerges during
backpropagation when the derivatives or slopes of the activation functions
become progressively smaller as we move backward through the layers of a
neural network. This phenomenon is particularly prominent in deep networks with
many layers, hindering the effective training of the model. The weight updates
becomes extremely tiny, or even exponentially small, it can significantly prolong
the training time, and in the worst-case scenario, it can halt the training process
altogether.
Why the Problem Occurs?
During backpropagation, the gradients propagate back through the layers of the
network, they decrease significantly. This means that as they leave the output
layer and return to the input layer, the gradients become progressively smaller. As
a result, the weights associated with the initial levels, which accommodate these
small gradients, are updated little or not at each iteration of the optimization
process.
The vanishing gradient problem is particularly associated with the sigmoid and
hyperbolic tangent (tanh) activation functions because their derivatives fall within
the range of 0 to 0.25 and 0 to 1, respectively. Consequently, extreme weights
becomes very small, causing the updated weights to closely resemble the original
ones. This persistence of small updates contributes to the vanishing gradient issue.
The sigmoid and tanh functions limit the input values to the ranges [0,1] and [-
1,1], so that they saturate at 0 or 1 for sigmoid and -1 or 1 for Tanh. The
derivatives at points becomes zero as they are moving. In these regions, especially
when inputs are very small or large, the gradients are very close to zero. While
this may not be a major concern in shallow networks with a few layers, it is a
more pronounced issue in deep networks. When the inputs fall in saturated
regions, the gradients approach zero, resulting in little update to the weights of the
previous layer. In simple networks this does not pose much of a problem, but as
more layers are added, these small gradients, which multiply between layers,
decay significantly and consequently the first layer tears very slowly , and hinders
overall model performance and can lead to convergence failure.
30
How can we identify?
Identifying the vanishing gradient problem typically involves monitoring the
training dynamics of a deep neural network.
One key indicator is observing model weights converging to 0 or stagnation
in the improvement of the model's performance metrics over training epochs.
During training, if the loss function fails to decrease significantly, or if there
is erratic behavior in the learning curves, it suggests that the gradients may be
vanishing.
Additionally, examining the gradients themselves during backpropagation can
provide insights. Visualization techniques, such as gradient histograms or
norms, can aid in assessing the distribution of gradients throughout the
network.
How can we solve the issue?
Batch Normalization : Batch normalization normalizes the inputs of each
layer, reducing internal covariate shift. This can help stabilize and accelerate
the training process, allowing for more consistent gradient flow.
Activation function: Activation function like Rectified Linear Unit
(ReLU) can be used. With ReLU, the gradient is 0 for negative and zero
input, and it is 1 for positive input, which helps alleviate the vanishing
gradient issue. Therefore, ReLU operates by replacing poor enter values with
0, and 1 for fine enter values, it preserves the input unchanged.
Skip Connections and Residual Networks (ResNets): Skip connections, as
seen in ResNets, allow the gradient to bypass certain layers during
backpropagation. This facilitates the flow of information through the
network, preventing gradients from vanishing.
Long Short-Term Memory Networks (LSTMs) and Gated Recurrent
Units (GRUs): In the context of recurrent neural networks (RNNs),
architectures like LSTMs and GRUs are designed to address the vanishing
gradient problem in sequences by incorporating gating mechanisms .
Gradient Clipping: Gradient clipping involves imposing a threshold on the
gradients during backpropagation. Limit the magnitude of gradients during
backpropagation, this can prevent them from becoming too small or
exploding, which can also hinder learning.
31
What is LSTM – Long Short Term Memory?
5.What is LSTM – Long Short Term Memory?
LSTM Model
Information is retained by the cells and the memory manipulations are done by
the gates. There are three gates –
Forget Gate
The information that is no longer useful in the cell state is removed with the forget
gate. Two inputs xt (input at the particular time) and ht-1 (previous cell output) are
fed to the gate and multiplied with weight matrices followed by the addition of bias.
The resultant is passed through an activation function which gives a binary output.
If for a particular cell state the output is 0, the piece of information is forgotten and
for output 1, the information is retained for future use.
The equation for the forget gate is:
ft=σ(Wf⋅[ht−1,xt]+bf) ft=σ(Wf⋅[ht−1,xt]+bf)
where:
W_f represents the weight matrix associated with the forget gate.
33
[h_t-1, x_t] denotes the concatenation of the current input and the previous
hidden state.
b_f is the bias with the forget gate.
σ is the sigmoid activation function.
Forget Gate
Input gate
The addition of useful information to the cell state is done by the input gate. First,
the information is regulated using the sigmoid function and filter the values to be
remembered similar to the forget gate using inputs ht-1 and xt. . Then, a vector is
created using tanh function that gives an output from -1 to +1, which contains all
the possible values from ht-1 and xt. At last, the values of the vector and the
regulated values are multiplied to obtain the useful information. The equation for
the input gate is:
it=σ(Wi⋅[ht−1,xt]+bi) it=σ(Wi⋅[ht−1,xt]+bi)
C^t=tanh(Wc⋅[ht−1,xt]+bc)C^t=tanh(Wc⋅[ht−1,xt]+bc)
We multiply the previous state by ft, disregarding the information we had previously
chosen to ignore. Next, we include it∗Ct. This represents the updated candidate
values, adjusted for the amount that we chose to update each state value.
Ct=ft⊙Ct−1+it⊙C^tCt=ft⊙Ct−1+it⊙C^t
34
where
⊙ denotes element-wise multiplication
tanh is tanh activation function
Input Gate
Output gate
The task of extracting useful information from the current cell state to be presented
as output is done by the output gate. First, a vector is generated by applying tanh
function on the cell. Then, the information is regulated using the sigmoid function
and filter by the values to be remembered using inputs ht−1ht−1and xtxt. At last, the
values of the vector and the regulated values are multiplied to be sent as an output
and input to the next cell. The equation for the output gate is:
ot=σ(Wo⋅[ht−1,xt]+bo)ot=σ(Wo⋅[ht−1,xt]+bo)
35
Output Gate
Bidirectional LSTM Model
Bidirectional LSTM (Bi LSTM/ BLSTM) is a variation of normal LSTM which
processes sequential data in both forward and backward directions. This allows Bi
LSTM to learn longer-range dependencies in sequential data than traditional LSTMs
which can only process sequential data in one direction.
Bi LSTMs are made up of two LSTM networks one that processes the input
sequence in the forward direction and one that processes the input sequence
in the backward direction.
The outputs of the two LSTM networks are then combined to produce the
final output.
LSTM models including Bi LSTMs have demonstrated state-of-the-art performance
across various tasks such as machine translation, speech recognition and text
summarization.
LSTM networks can be stacked to form deeper models allowing them to learn more
complex patterns in data. Each layer in the stack captures different levels of
information and time-based relationships in the input.
Applications of LSTM
Some of the famous applications of LSTM includes:
Language Modeling: Used in tasks like language modeling, machine
translation and text summarization. These networks learn the dependencies
36
between words in a sentence to generate coherent and grammatically correct
sentences.
Speech Recognition: Used in transcribing speech to text and recognizing
spoken commands. By learning speech patterns they can match spoken words
to corresponding text.
Time Series Forecasting: Used for predicting stock prices, weather and
energy consumption. They learn patterns in time series data to predict future
events.
Anomaly Detection: Used for detecting fraud or network intrusions. These
networks can identify patterns in data that deviate drastically and flag them as
potential anomalies.
Recommender Systems: In recommendation tasks like suggesting movies,
music and books. They learn user behavior patterns to provide personalized
suggestions.
Video Analysis: Applied in tasks such as object detection, activity
recognition and action classification. When combined with Convolutional
Neural Networks (CNNs) they help analyze video data and extract useful
information.
GRU:
In machine learning Recurrent Neural Networks (RNNs) are essential for tasks
involving sequential data such as text, speech and time-series analysis. While
traditional RNNs struggle with capturing long-term dependencies due to
the vanishing gradient problem architectures like Long Short-Term Memory
(LSTM) networks were developed to overcome this limitation.
However LSTMs are very complex structure with higher computational cost. To
overcome this Gated Recurrent Unit (GRU) where introduced which uses
LSTM architecture by merging its gating mechanisms offering a more efficient
solution for many sequential tasks without sacrificing performance. In this article
we’ll learn more about them.
Getting Started with Gated Recurrent Units (GRU)
Gated Recurrent Units (GRUs) are a type of RNN introduced by Cho et al. in
2014. The core idea behind GRUs is to use gating mechanisms to selectively
update the hidden state at each time step allowing them to remember important
information while discarding irrelevant details. GRUs aim to simplify the LSTM
architecture by merging some of its components and focusing on just two main
gates: the update gate and the reset gate.
37
Core Structure of GRUs
zt=σ(Wz⋅[ht−1,xt])
2. Update gate:
The update gate controls how much of the new information ��xt should be used
to update the hidden state.
ht’=tanh(Wh⋅[rt⋅ht−1,xt])
3. Candidate hidden state:
This is the potential new hidden state calculated based on the current input and
the previous hidden state.
ht=(1–zt)⋅ht−1+zt⋅ht′
4. Hidden state:
38
How GRUs Solve the Vanishing Gradient Problem
Like LSTMs, GRUs were designed to address the vanishing gradient
problem which is common in traditional RNNs. GRUs help mitigate this issue by
using gates that regulate the flow of gradients during training ensuring that
important information is preserved and that gradients do not shrink excessively
over time. By using these gates, GRUs maintain a balance between remembering
important past information and learning new, relevant data.
39