0% found this document useful (0 votes)
49 views30 pages

Unit 3

Uploaded by

953622243011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views30 pages

Unit 3

Uploaded by

953622243011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

AD3501-DEEP LEARNING

UNIT-3

RECURRENT NEURAL NETWORKS

Er. R. Muthu Eshwaran, M.E., (Ph.D.),

AP/AI&DS

Ramco Institute of Technology

Rajapalayam
AD3501-DEEP LEARNING

UNIT III RECURRENT NEURAL NETWORKS

Unfolding Graphs – RNN- Design Patterns: Acceptor -- Encoder --Transducer; Gradient


Computation -- Sequence Modeling Conditioned on Contexts -- Bidirectional RNN -- Sequence to
Sequence RNN – Deep Recurrent Networks -- Recursive Neural Networks -- Long Term
Dependencies; Leaky Units: Skip connections and dropouts; Gated Architecture: LSTM.

RECURRENT NEURAL NETWORKS

Introduction to RNN

Traditional neural networks mainly have independent input and output layers, which make them
inefficient when dealing with sequential data. Hence, a new neural network called Recurrent
Neural Network, introduced to store results of previous outputs in the internal memory. These
results are then fed into the network inputs in order to predict the output of the layer. This allows
it to be used in applications like pattern detection, speech and voice recognition, natural language
processing, and time series prediction.
Below is how we can convert a Feed-Forward Neural Network into a Recurrent Neural Network:

Fig: Simple Recurrent Neural Network


RNN has hidden layers that act as memory locations to store the outputs of a layer in a loop. Here,
“x” is the input layer, “h” is the hidden layer (act as memory locations to store the outputs of a
layer in a loop), and “y” is the output layer. A, B, and C are the network parameters used to improve
the output of the model. At any given time t, the current input is a combination of input at x(t) and
x(t-1). The output at any given time is fetched back to the network to improve on the output.
Why Recurrent Neural Networks?
RNN were created because there were a few issues in the feed-forward neural network:

• Cannot handle sequential data


• Considers only the current input
• Cannot memorize previous inputs
The solution to these issues is the RNN. An RNN can handle sequential data, accepting the current
input data, and previously received inputs. RNNs can memorize previous inputs due to their
internal memory.

How Does Recurrent Neural Networks Work?


In Recurrent Neural networks, the information cycles through a loop to the middle hidden layer.

Fig: Working of Recurrent Neural Network

The input layer ‘x’ takes in the input to the neural network and processes it and passes it onto the
middle layer.

The middle layer ‘h’ can consist of multiple hidden layers, each with its own activation functions
and weights and biases. If we have a neural network where the various parameters of different
hidden layers are not affected by the previous layer, ie: the neural network does not have memory,
then we can use a recurrent neural network.
The Recurrent Neural Network will standardize the different activation functions and weights and
biases so that each hidden layer has the same parameters. Then, instead of creating multiple hidden
layers, it will create one and loop over it as many times as required.

Feed-Forward Neural Networks vs Recurrent Neural Networks


A feed-forward neural network allows information to flow only in the forward direction, from the
input nodes, through the hidden layers, and to the output nodes. There are no cycles or loops in the
network. Below is how a simplified presentation of a feed-forward neural network looks like:

Fig:
Feed-forward Neural Network

In a feed-forward neural network, the decisions are based on the current input. It doesn’t memorize
the past data, and there’s no future scope. Feed-forward neural networks are used in general
regression and classification problems.

Applications of Recurrent Neural Networks

Image Captioning:RNNs are used to caption an image by analysing the activities present.

Time Series Prediction: Any time series problem, like predicting the prices of stocks in a
particular month, can be solved using an RNN.

Natural Language Processing: Text mining and Sentiment analysis can be carried out using an
RNN for Natural Language Processing (NLP).

a
Machine Translation: Given an input in one language, RNNs can be used to translate the input
into different languages as output.

Advantages of Recurrent Neural Network


Recurrent Neural Networks (RNNs) have several advantages over other types of neural networks,
including:

Ability to Handle Variable-Length Sequences: RNNs are designed to handle input sequences of
variable length, which makes them well-suited for tasks such as speech recognition, natural
language processing, and time series analysis.

Memory of Past Inputs:RNNs have a memory of past inputs, which allows them to capture
information about the context of the input sequence. This makes them useful for tasks such as
language modelling, where the meaning of a word depends on the context in which it appears.

Parameter Sharing: RNNs share the same set of parameters across all time steps, which reduce
the number of parameters that need to be learned and can lead to better generalization.

Non-Linear Mapping: RNNs use non-linear activation functions, which allow them to learn
complex, non-linear mappings between inputs and outputs.

Sequential Processing: RNNs process input sequences sequentially, which makes them
computationally efficient and easy to parallelize.

Flexibility: RNNs can be adapted to a wide range of tasks and input types, including text, speech,
and image sequences.
Improved Accuracy:RNNs have been shown to achieve state-of-the-art performance on a variety
of sequence modeling tasks, including language modeling, speech recognition, and machine
translation.

These advantages make RNNs a powerful tool for sequence modelling and analysis, and have led
to their widespread use in a variety of applications, including natural language processing, speech
recognition, and time series analysis.

Disadvantages of Recurrent Neural Network


Although Recurrent Neural Networks (RNNs) have several advantages, they also have some
disadvantages. Here are some of the main disadvantages of RNNs:

Vanishing and Exploding Gradients:RNNs can suffer from the problem of vanishing or
exploding gradients, which can make it difficult to train the network effectively. This occurs when
the gradients of the loss function with respect to the parameters become very small or very large
as they propagate through time.

Computational Complexity:RNNs can be computationally expensive to train, especially when


dealing with long sequences. This is because the network has to process each input in sequence,
which can be slow.

Difficulty in Capturing Long-Term Dependencies:Although RNNs are designed to capture


information about past inputs; they can struggle to capture long-term dependencies in the input
sequence. This is because the gradients can become very small as they propagate through time,
which can cause the network to forget important information.

Lack of Parallelism:RNNs are inherently sequential, which makes it difficult to parallelize the
computation. This can limit the speed and scalability of the network.

Difficulty in Choosing the Right Architecture:There are many different variants of RNNs, each
with its own advantages and disadvantages. Choosing the right architecture for a given task can be
challenging, and may require extensive experimentation and tuning.

Difficulty in Interpreting the Output:The output of an RNN can be difficult to interpret,


especially when dealing with complex inputs such as natural language or audio. This can make it
difficult to understand how the network is making its predictions.

These disadvantages are important when deciding whether to use an RNN for a given task.
However, many of these issues can be addressed through careful design and training of the network
and through techniques such as regularization and attention mechanisms.
The four commonly used types of Recurrent Neural Networks are:

1. One-to-One
The simplest type of RNN is One-to-One, which allows a single input and a single output. It has
fixed input and output sizes and acts as a traditional neural network. The One-to-One application
can be found in Image Classification.

One-to One
2. One-to-Many
One-to-Many is a type of RNN that gives multiple outputs when given a single input. It takes a
fixed input size and gives a sequence of data outputs. Its applications can be found in Music
Generation and Image Captioning.

One-to-Many
3. Many-to-One
Many-to-One is used when a single output is required from multiple input units or a sequence of
them. It takes a sequence of inputs to display a fixed output. Sentiment Analysis is a common
example of this type of Recurrent Neural Network.

4. Many-to-Many
Many-to-Many are used to generate a sequence of output data from a sequence of input units.
This type of RNN is further divided into the following two subcategories:

1. Equal Unit Size: In this case, the number of both the input and output units is the same. A
common application can be found in Name-Entity Recognition.
2. Unequal Unit Size: In this case, inputs and outputs have different numbers of units. Its
application can be found in Machine Translation.

Two Issues of Standard RNNs

1. Vanishing Gradient Problem


Recurrent Neural Networks enable us to model time-dependent and sequential data problems, such
as stock market prediction, machine translation, and text generation. We will find, however, RNN
is hard to train because of the gradient problem.

RNNs suffer from the problem of vanishing gradients. The gradients carry information used in the
RNN, and when the gradient becomes too small, the parameter updates become insignificant. This
makes the learning of long data sequences difficult.
2. Exploding Gradient Problem

While training a neural network, if the slope tends to grow exponentially instead of decaying, this
is called an Exploding Gradient. This problem arises when large error gradients accumulate,
resulting in very large updates to the neural network model weights during the training
process.Long training time, poor performance, and bad accuracy are the major issues in gradient
problems.

Feed-Forward Neural Networks vs Recurrent Neural Networks


A feed-forward neural network allows information to flow only in the forward direction, from the
input nodes, through the hidden layers, and to the output nodes. There are no cycles or loops in the
network. Below is how a simplified presentation of a feed-forward neural network looks like:

Fig: Feed-forward Neural Network

In a feed-forward neural network, the decisions are based on the current input. It doesn’t memorize
the past data, and there’s no future scope. Feed-forward neural networks are used in general
regression and classification problems.

Variant RNN Architectures


There are several variant RNN architectures that have been developed over the years to address
the limitations of the standard RNN architecture. Here are a few examples:

Long Short-Term Memory (LSTM) Networks


LSTM is a type of RNN that is designed to handle the vanishing gradient problem that can occur
in standard RNNs. It does this by introducing three gating mechanisms that control the flow of
information through the network: the input gate, the forget gate, and the output gate. These gates
allow the LSTM network to selectively remember or forget information from the input sequence,
which makes it more effective for long-term dependencies.

Gated Recurrent Unit (GRU) Networks


GRU is another type of RNN that is designed to address the vanishing gradient problem. It has two
gates: the reset gate and the update gate. The reset gate determines how much of the previous state
should be forgotten, while the update gate determines how much of the new state should be
remembered. This allows the GRU network to selectively update its internal state based on the
input sequence.

Bidirectional RNNs:
Bidirectional RNNs are designed to process input sequences in both forward and backward
directions. This allows the network to capture both past and future context, which can be useful
for speech recognition and natural language processing tasks.

Encoder-Decoder RNNs:
Encoder-decoder RNNs consist of two RNNs: an encoder network that processes the input
sequence and produces a fixed-length vector representation of the input and a decoder network
that generates the output sequence based on the encoder's representation. This architecture is
commonly used for sequence-to-sequence tasks such as machine translation.

Attention Mechanisms
Attention mechanisms are a technique that can be used to improve the performance of RNNs on
tasks that involve long input sequences. They work by allowing the network to attend to different
parts of the input sequence selectively rather than treating all parts of the input sequence equally.
This can help the network focus on the input sequence's most relevant parts and ignore irrelevant
information.
These are just a few examples of the many variant RNN architectures that have been developed
over the years. The choice of architecture depends on the specific task and the characteristics of
the input and output sequences.

Encoder-Decoder Model

There are three main blocks in the encoder-decoder model,

• Encoder

• Hidden Vector

• Decoder
The Encoder will convert the input sequence into a single-dimensional vector (hidden vector).
The decoder will convert the hidden vector into the output sequence.

Encoder-Decoder models are jointly trained to maximize the conditional probabilities of the
target sequence given the input sequence.

SEQUENCE TO SEQUENCE RNN

How the Sequence to Sequence Model works?

In order to fully understand the model’s underlying logic, we will go over the below illustration:

Encoder-decoder sequence to sequence model

Encoder

• Multiple RNN cells can be stacked together to form the encoder. RNN reads each inputs
sequentially

• For every timestep (each input) t, the hidden state (hidden vector) h is updated according to
the input at that timestep X[i].

• After all the inputs are read by encoder model, the final hidden state of the model represents
the context/summary of the whole input sequence.
• Example: Consider the input sequence “I am a Student” to be encoded. There will be totally 4
timesteps ( 4 tokens) for the Encoder model. At each time step, the hidden state h will be
updated using the previous hidden state and the current input.

Example: Encoder

• At the first timestep t1, the previous hidden state h0 will be considered as zero or randomly
chosen. So the first RNN cell will update the current hidden state with the first input and h0.
Each layer outputs two things — updated hidden state and the output for each stage. The
outputs at each stage are rejected and only the hidden states will be propagated to the next
layer.

• The hidden states h_i are computed using the formula:

• At second timestep t2, the hidden state h1 and the second input X[2] will be given as input ,
and the hidden state h2 will be updated according to both inputs. Then the hidden state h1 will
be updated with the new input and will produce the hidden state h2. This happens for all the
four stages wrt example taken.

• A stack of several recurrent units (LSTM or GRU cells for better performance) where each
accepts a single element of the input sequence, collects information for that element, and
propagates it forward.

• In the question-answering problem, the input sequence is a collection of all words from the
question. Each word is represented as xi where i is the order of that word.
This simple formula represents the result of an ordinary recurrent neural network. As you can see,
we just apply the appropriate weights to the previously hidden state h(t-1) and the input vector xt.

Encoder Vector

• This is the final hidden state produced from the encoder part of the model. It is calculated using
the formula above.

• This vector aims to encapsulate the information for all input elements in order to help the
decoder make accurate predictions.

• It acts as the initial hidden state of the decoder part of the model.

Decoder

• The Decoder generates the output sequence by predicting the next output Yt given the hidden
state ht.

• The input for the decoder is the final hidden vector obtained at the end of encoder model.

• Each layer will have three inputs, hidden vector from previous layer ht-1 and the previous layer
output yt-1, original hidden vector h.

• At the first layer, the output vector of encoder and the random symbol START, empty hidden
state ht-1 will be given as input, the outputs obtained will be y1 and updated hidden state h1 (the
information of the output will be subtracted from the hidden vector).

• The second layer will have the updated hidden state h1 and the previous output y1 and original
hidden vector h as current inputs, produces the hidden vector h2 and output y2.

• The outputs occurred at each timestep of decoder is the actual output. The model will predict
the output until the END symbol occurs.

• A stack of several recurrent units where each predicts an output yt at a time step t.
• Each recurrent unit accepts a hidden state from the previous unit and produces an output as
well as its hidden state.

• In the question-answering problem, the output sequence is a collection of all words from the
answer. Each word is represented as yi where i is the order of that word.

Example: Decoder.

• Any hidden state h_i is computed using the formula:

As you can see, we are just using the previous hidden state to compute the next one.

Output Layer

• We use Softmax activation function at the output layer.

• It is used to produce the probability distribution from a vector of values with the target class
of high probability.

• The output yt at time step t is computed using the formula:


We calculate the outputs using the hidden state at the current time step together with the respective
weight W(S). Softmax is used to create a probability vector that will help us determine the final
output (e.g. word in the question-answering problem).

The power of this model lies in the fact that it can map sequences of different lengths to each other.
As you can see the inputs and outputs are not correlated and their lengths can differ. This opens a
whole new range of problems that can now be solved using such architecture.

Applications

It possesses many applications such as

• Google’s Machine Translation

• Question answering chatbots

• Speech recognition

• Time Series Application etc.,

BIDIRECTIONAL RNN

A bi-directional recurrent neural network (Bi-RNN) is a type of recurrent neural network


(RNN) that processes input data in both forward and backward directions. The goal of a Bi-
RNN is to capture the contextual dependencies in the input data by processing it in both
directions, which can be useful in a variety of natural language processing (NLP) tasks.

In a Bi-RNN, the input data is passed through two separate RNNs: one processes the data in the
forward direction, while the other processes it in the reverse direction. The outputs of these two
RNNs are then combined in some way to produce the final output.

One common way to combine the outputs of the forward and reverse RNNs is to concatenate them,
but other methods, such as element-wise addition or multiplication can also be used. The choice
of combination method can depend on the specific task and the desired properties of the final
output.
Need for Bi-directional RNNs

• A uni-directional recurrent neural network (RNN) processes input sequences in a single direction,
either from left to right or right to left.

• This means that the network can only use information from earlier time steps when making
predictions at later time steps.

• This can be limiting, as the network may not capture important contextual information relevant to the
output prediction.

• For example, in natural language processing tasks, a uni-directional RNN may not accurately predict
the next word in a sentence if the previous words provide important context for the current word.

Consider an example where we could use the recurrent network to predict the masked word in a
sentence.

1. Apple is my favorite .
2. Apple is my favourite , and I work there.
3. Apple is my favorite , and I am going to buy one.

In the first sentence, the answer could be fruit, company, or phone. But in the second and third
sentences, it cannot be a fruit.

A Recurrent Neural Network that can only process the inputs from left to right might not be able
to accurately predict the right answer for sentences discussed above.

To perform well on natural language tasks, the model must be able to process the sequence in
both directions.

Bi-directional RNNs
• A bidirectional recurrent neural network (RNN) is a type of recurrent neural network (RNN)
that processes input sequences in both forward and backward directions.

• This allows the RNN to capture information from the input sequence that may be relevant to
the output prediction, but the same could be lost in a traditional RNN that only processes the
input sequence in one direction.

• This allows the network to consider information from the past and future when making
predictions rather than just relying on the input data at the current time step.

• This can be useful for tasks such as language processing, where understanding the context of
a word or phrase can be important for making accurate predictions.

• In general, bidirectional RNNs can help improve the performance of a model on a variety of
sequence-based tasks.
This means that the network has two separate RNNs:

1. One that processes the input sequence from left to right


2. Another one that processes the input sequence from right to left.

These two RNNs are typically referred to as the forward and backward RNNs, respectively.

During the forward pass of the RNN, the forward RNN processes the input sequence in the usual
way by taking the input at each time step and using it to update the hidden state. The updated
hidden state is then used to predict the output at that time step.

Back-propagation through time (BPTT) is a widely used algorithm for training recurrent neural
networks (RNNs). It is a variant of the back-propagation algorithm specifically designed to handle
the temporal nature of RNNs, where the output at each time step depends on the inputs and outputs
at previous time steps.

In the case of a bidirectional RNN, BPTT involves two separate Back-propagation passes: one for
the forward RNN and one for the backward RNN. During the forward pass, the forward RNN
processes the input sequence in the usual way and makes predictions for the output sequence.
These predictions are then compared to the target output sequence, and the error is back-
propagated through the network to update the weights of the forward RNN.

During the backward pass, the backward RNN processes the input sequence in reverse order and
makes predictions for the output sequence. These predictions are then compared to the target output
sequence in reverse order, and the error is back-propagated through the network to update the
weights of the backward RNN.

Once both passes are complete, the weights of the forward and backward RNNs are updated based
on the errors computed during the forward and backward passes, respectively. This process is
repeated for multiple iterations until the model converges and the predictions of the bidirectional
RNN are accurate.
This allows the bidirectional RNN to consider information from past and future time steps when
making predictions, which can significantly improve the model's accuracy.
Applications of Bi-directional RNNs

Bidirectional recurrent neural networks (RNNs) can outperform traditional RNNs on various tasks,
particularly those involving sequential data processing. Some examples of tasks where
bidirectional RNNs have been shown to outperform traditional RNNs include:

• Natural languages processing tasks, such as language translation and sentiment analysis, where
understanding the context of a word or phrase can be important for making accurate predictions.

• Time series forecasting tasks, such as predicting stock prices or weather patterns, where the
sequence of past data can provide important clues about future trends.

• Audio processing tasks, such as speech recognition or music generation, where the information
in the audio signal can be complex and non-linear.

In general, bidirectional RNNs can be useful for any task where the input data has a temporal
structure and where understanding the context of the data is important for making accurate
predictions.

Advantages and Disadvantages of Bi-directional RNNs

Advantages:

Bidirectional Recurrent Neural Networks (RNNs) have several advantages over traditional
RNNs. Some of the key advantages of bidirectional RNNs include the following:
• Improved performance on tasks that involve processing sequential data. Because bidirectional
RNNs can consider information from both past and future time steps when making
predictions, they can outperform traditional RNNs on tasks such as natural language
processing, time series forecasting, and audio processing.

Disadvantages:

However, Bidirectional RNNs also have some disadvantages. Some of the key disadvantages of
bidirectional RNNs include the following:

• Increased computational complexity. Because bidirectional RNNs have two separate RNNs
(one for the forward pass and one for the backward pass), they can require more computational
resources to train and evaluate than traditional RNNs. This can make them more difficult to
implement and less efficient in terms of runtime performance.

• More difficult to optimize. Because bidirectional RNNs have more parameters (due to the two
separate RNNs), they can be more difficult to optimize. This can make finding the right set of
weights for the model challenging and lead to slower convergence during training.

• The need for longer input sequences. For a bidirectional RNN to capture long-term
dependencies in the data, it typically requires longer input sequences than a traditional RNN.
This can be a disadvantage in situations where the input data is limited or noisy, as it may not
be possible to generate enough input data to train the model effectively.

RECURSIVE NEURAL NETWORKS

Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn detailed
and structured information. With RvNN, you can get a structured prediction by recursively
applying the same set of weights on structured inputs. The word recursive indicates that the neural
network is applied to its output.

Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data. The
tree structure means combining child nodes and producing parent nodes. Each child-parent bond
has a weight matrix, and similar children have the same weights. The number of children for every
node in the tree is fixed to enable it to perform recursive operations and use the same weights.
RvNNs are used when there's a need to parse an entire sentence.

To calculate the parent node's representation, we add the products of the weight matrices (Wi) and
the children's representations (Ci) and apply the transformation f:

Where:

• h is the parent node's representation,


• Wi represents the weight matrix for the ith child,
• Ci is the representation of the ith child node,
• c is the number of children,
• f is the transformation function applied to the sum of the products.
Recurrent Neural Network vs. Recursive Neural Networks

• Recurrent Neural Networks (RNNs) are another well-known class of neural networks used for
processing sequential data. They are closely related to the Recursive Neural Network.
• Recurrent Neural Networks represent temporal sequences, which they find application
in Natural language Processing (NLP) since language-related data like sentences and
paragraphs are sequential in nature. Recurrent networks are usually chain structures. The
weights are shared across the chain length, keeping the dimensionality constant.
• On the other hand, Recursive Neural Networks operate on hierarchical data models due to their
tree structure. There are a fixed number of children for each node in the tree so that it can execute
recursive operations and use the same weights for each step. Child representations are combined
into parent representations.
• The efficiency of a recursive network is higher than a feed-forward network.
• Recurrent Networks are recurrent over time, meaning recursive networks are just a
generalization of the recurrent network.

Recursive Neural Network Implementation

A Recursive Neural Network is used for sentiment analysis in natural language sentences. It is one
of the most important tasks of Natural language Processing (NLP), which identifies the writing
tone and sentiments of the writer in a particular sentence. If a writer expresses any sentiment, basic
labels about the writing tone are recognized. We want to identify the smaller components like
nouns or verb phrases and order them in a syntactic hierarchy. For example, it identifies whether
the sentence showcases a constructive form of writing or negative word choices.

A variable called 'score' is calculated at each traversal of nodes, telling us which pair of phrases
and words we must combine to form the perfect syntactic tree for a given sentence.

Let us consider the representation of the phrase -- "a lot of fun" in the following sentence.

Programming is a lot of fun.

An RNN representation of this phrase would not be suitable because it considers only sequential
relations. Each state varies with the preceding words' representation. So, a subsequence that doesn't
occur at the beginning of the sentence can't be represented. With RNN, when processing the word
'fun,' the hidden state will represent the whole sentence.

However, with a Recursive Neural Network (RvNN), the hierarchical architecture can store the
representation of the exact phrase. It lies in the hidden state of the node R_{a\ lot\ of\ fun}. Thus,
Syntactic parsing is completely implemented with the help of Recursive Neural Networks.

Benefits of RvNNs for Natural Language Processing

• The two significant advantages of Recursive Neural Networks for Natural Language Processing
are their structure and reduction in network depth.
• As already explained, the tree structure of Recursive Neural Networks can manage hierarchical
data like in parsing problems.

• Another benefit of RvNN is that the trees can have a logarithmic height. When there are O(n)
input words, a Recursive Neural Network can represent a binary tree with height O(log\ n). This
lessens the distance between the first and last input elements. Hence, the long-term dependency
turns shorter and easier to grab.

Disadvantages of RvNNs for Natural Language Processing

• The main disadvantage of recursive neural networks can be the tree structure. Using the tree
structure indicates introducing a unique inductive bias to our model. The bias corresponds to
the assumption that the data follow a tree hierarchy structure. But that is not the truth. Thus, the
network may not be able to learn the existing patterns.
• Another disadvantage of the Recursive Neural Network is that sentence parsing can be slow and
ambiguous. Interestingly, there can be many parse trees for a single sentence.
• Also, it is more time-consuming and labor-intensive to label the training data for recursive
neural networks than to construct recurrent neural networks. Manually parsing a sentence into
short components is more time-consuming and tedious than assigning a label to a sentence.

Gated Architecture

LONG SHORT TERM MEMORY NETWORK(LSTM).

LSTM used in the field of Deep Learning. It is a variety of recurrent neural networks (RNNs) that
are capable of learning long-term dependencies, especially in sequence prediction problems.

LSTMs are predominantly used to learn, process, and classify sequential data because these
networks can learn long-term dependencies between time steps of data. Common LSTM
applications include sentiment analysis, language modelling, speech recognition, and video
analysis.

LSTM has feedback connections, i.e., it is capable of processing the entire sequence of data, apart
from single data points such as images. This finds application in speech recognition, machine
translation, etc. LSTM is a special kind of RNN, which shows outstanding performance on a large
variety of problems.

The Logic behind LSTM

The central role of an LSTM model is held by a memory cell known as a ‘cell state’ that maintains
its state over time. The cell state is the horizontal line that runs through the top of the below
diagram. It can be visualized as a conveyor belt through which information just flows, unchanged.
Information can be added to or removed from the cell state in LSTM and is regulated by gates.
These gates optionally let the information flow in and out of the cell. It contains a point wise
multiplication operation and a sigmoid neural net layer that assist the mechanism.

The sigmoid layer gives out numbers between zero and one, where zero means ‘nothing should be
let through,’ and one means ‘everything should be let through.’
1. Forget Gate(f): At forget gate the input is combined with the previous output to generate
a fraction between 0 and 1, that determines how much of the previous state need to be
preserved (or in other words, how much of the state should be forgotten). This output is then
multiplied with the previous state. Note: An activation output of 1.0 means “remember
everything” and activation output of 0.0 means “forget everything.” From a different
perspective, a better name for the forget gate might be the “remember gate”
2. Input Gate(i): Input gate operates on the same signals as the forget gate, but here the objective
is to decide which new information is going to enter the state of LSTM. The output of the
input gate (again a fraction between 0 and 1) is multiplied with the output of tan h block that
produces the new values that must be added to previous state. This gated vector is then added
to previous state to generate current state
3. Input Modulation Gate(g): It is often considered as a sub-part of the input gate and much
literature on LSTM’s does not even mention it and assume it is inside the Input gate. It is used
to modulate the information that the Input gate will write onto the Internal State Cell by
adding non-linearity to the information and making the information Zero-mean. This is
done to reduce the learning time as Zero-mean input has faster convergence. Although this
gate’s actions are less important than the others and are often treated as a finesse-
providing concept, it is good practice to include this gate in the structure of the LSTM
unit.
4. Output Gate(o): At output gate, the input and previous state are gated as before to
generate another scaling fraction that is combined with the output of tanh block that
brings the current state. This output is then given out. The output and state are fed back
into the LSTM block.
The basic workflow of a Long Short Term Memory Network is similar to the workflow of a
Recurrent Neural Network with the only difference being that the Internal Cell State is also passed
forward along with the Hidden State.
Working of an LSTM recurrent unit:
1. Take input the current input, the previous hidden state, and the previous internal cell
state.
2. Calculate the values of the four different gates by following the below steps:-

• For each gate, calculate the parameterized vectors for the current input and the previous
hidden state by element-wise multiplication with the concerned vector with the
respective weights for each gate.

• Apply the respective activation function for each gate element-wise on the parameterized
vectors. Below given is the list of the gates with the activation function to be applied
for the gate.

3. Calculate the current internal cell state by first calculating the element-wise multiplication
vector of the input gate and the input modulation gate, then calculate the element-wise
multiplication vector of the forget gate and the previous internal cell state and then
add the two vectors.

4. Calculate the current hidden state by first taking the element-wise hyperbolic tangent of
the current internal cell state vector and then performing element-wise multiplication with
the output gate.

The above-stated working is illustrated as below:-

Note that the blue circles denote element-wise multiplication. The weight matrix W contains
different weights for the current input vector and the previous hidden state for each gate.
LSTMs work in a 3-step process.

Step 1: Decide How Much Past Data It Should Remember

The first step in the LSTM is to decide which information should be omitted from the cell in that
particular time step. The sigmoid function determines this. It looks at the previous state (ht-1)
along with the current input xt and computes the function.

ft – forget gate. Decides which information to delete that is not important from previous time
step.

Consider the following two sentences:

1. Let the output of h(t-1) be “Alice is good in Physics. John, on the other hand, is good at
Chemistry.”

2. Let the current input at x(t) be “John plays football well. He told me yesterday over the phone
that he had served as the captain of his college football team.”

The forget gate realizes there might be a change in context after encountering the first full stop. It
compares with the current input sentence at x(t). The next sentence talks about John, so the
information on Alice is deleted. The position of the subject is vacated and assigned to John.

Step 2: Decide How Much This Unit Adds to the Current State
In the second layer, there are two parts. One is the sigmoid function, and the other is the tanh
function. In the sigmoid function, it decides which values to let through (0 or1). tanh function gives
weightage to the values which are passed, deciding their level of importance (-1 to 1).

it - input gate, Determines which information to let through based on its significance in
the current time step.

With the current input at x(t), the input gate analyses the important information John plays football,
and the fact that he was the captain of his college team is important.

“He told me yesterday over the phone” is less important; hence it's forgotten. This process of
adding some new information can be done via the input gate.

Step 3: Decide What Part of the Current Cell State Makes It to the Output
The third step is to decide what the output will be. First, we run a sigmoid layer, which decides
what parts of the cell state make it to the output. Then, we put the cell state through tanh to push
the values to be between -1 and 1 and multiply it by the output of the sigmoid gate.
Ot-output gate, Allows the passed in information to impact the output in the current time step

Let’s consider this example to predict the next word in the sentence: “John played tremendously
well against the opponent and won for his team. For his contributions, brave was awarded
player of the match.” There could be many choices for the space. The current input brave is an
adjective, and adjectives describe a noun. So, “John” could be the best output after Brave.

LSTM Applications

LSTM networks find useful applications in the following areas:


• Language modeling
• Machine translation
• Handwriting recognition
• Image captioning
• Image generation using attention models
• Question Answering
• Video-to-text conversion
• Polymorphic music modeling
• Speech synthesis
• Protein secondary structure prediction
connections
Skip connections are a type of shortcut that connects the output of one layer to the input of another
layer that is not adjacent to it. For example, in a CNN with four layers, A, B, C, and D, a skip
connection could connect layer A to layer C, layer B to layer D, or both.

Skip connection is a standard module in much convolutional architecture. Using a skip connection,
we provide an alternative path for the gradient (with back-propagation). It is experimentally
validated that these additional paths are often beneficial for model convergence. Skip
connections in deep architectures, as the name suggests, skip some layers in the neural network
and feed the output of one layer as the input to the next layers (instead of only the next
one).

As previously explained, using the chain rule, we must keep multiplying terms with the error
gradient as we go backwards. However, in the long chain of multiplication, if we multiply many
things together that are less than one, then the resulting gradient will be very small. Thus, the
gradient becomes very small as we approach the earlier layers in a deep architecture. In some
cases, the gradient becomes zero, meaning that we do not update the early layers at all.

In general, there are two fundamental ways that one could use skip connections through different
non-sequential layers:

a) Addition as in residual architectures,


b) Concatenation as in densely connected architectures.
We will first describe addition which is commonly referred as residual skip connections.
Skip connections via addition

The core idea is to back-propagate through the identity function, by just using a vector
addition. Then the gradient would simply be multiplied by one and its value will be maintained
in the earlier layers. This is the main idea behind Residual Networks (ResNets): they stack these
skip residual blocks together. We use an identity function to preserve the gradient.

Mathematically, we can represent the residual block, and calculate its partial derivative (gradient),
given the loss function like this:

Apart from the vanishing gradients, there is another reason that we commonly use them. For a
plethora of tasks (such as semantic segmentation, optical flow estimation, etc.) there is some
information that was captured in the initial layers and we would like to allow the later layers to
also learn from them. It has been observed that in earlier layers the learned features
correspond to lower semantic information that is extracted from the input. If we had not used
the skip connection that information would have turned too abstract.

Skip connections via concatenation

As stated, for many dense prediction problems, there is low-level information shared between
the input and output, and it would be desirable to pass this information directly across the
net. The alternative way that we can achieve skip connections is by concatenation of previous
feature maps. The most famous deep learning architecture is DenseNet. Below we can see an
example of feature reusability by concatenation with 5 convolutional layers:
This architecture heavily uses feature concatenation to ensure maximum information flow between
layers in the network. This is achieved by connecting all layers directly with each other via
concatenation, as opposed to ResNets. Practically, what we do is concatenate the feature channel
dimension.

This leads to
a) An enormous amount of feature channels on the last layers of the network,
b) To more compact models, and
c) Extreme feature reusability.

Short and long skip connections in Deep Learning


In more practical terms, we have to be careful when introducing additive skip connections in our
deep learning model. The dimensionality has to be the same in addition and also in
concatenation apart from the chosen channel dimension. That is the reason why we see that
additive skip connections are used in two kinds of setups:

a) Short skip connections


b) Long skip connections.

Short skip connections are used along with consecutive convolutional layers that do not change
the input dimension (see Res-Net), while long skip connections usually exist in encoder-decoder
architectures. It is known that the global information (shape of the image and other statistics)
resolves what, while local information resolves where (small details in an image patch).
Long skip connections often exist in architectures that are symmetrical, where the spatial
dimensionality is reduced in the encoder part and is gradually increased in the decoder part as
illustrated below. In the decoder part, one can increase the dimensionality of a feature map
via transpose convolutional layers. The transposed convolution operation forms the same
connectivity as the normal convolution but in the backward direction.
Benefits of skip connections

Skip connections can provide several benefits for CNNs, such as improving accuracy and
generalization, solving the vanishing gradient problem, and enabling deeper networks. Skip
connections can help the network to learn more complex and diverse patterns from the data and
reduce the number of parameters and operations needed by the network. Additionally, skip
connections can help to alleviate the problem of vanishing gradients by providing alternative paths
for the gradients to flow. Furthermore, they can make it easier and faster to train deeper networks,
which have more expressive power and can capture more features from the data.

Drawbacks of skip connections

Skip connections are a popular and powerful technique for improving the performance and
efficiency of CNNs, but they are not a panacea. They can help preserve information and gradients,
combine features, solve the vanishing gradient problem, and enable deeper networks. However,
they can also increase complexity and memory requirements, introduce redundancy and noise, and
require careful design and tuning to match the network architecture and data domain. Different
types and locations of skip connections can have different impacts on the network performance,
with some being more beneficial or harmful than others. Thus, it is essential to understand how
skip connections work and how to use them wisely and effectively for CNNs.

Dropouts

Dropout refers to data, or noise, that's intentionally dropped from a neural network to improve
processing and time to results. A neural network is software attempting to emulate the actions of
the human brain.

Neural networks are the building blocks of any machine-learning architecture. They consist of one
input layer, one or more hidden layers, and an output layer.

When we training our neural network (or model) by updating each of its weights, it might become
too dependent on the dataset we are using. Therefore, when this model has to make a prediction or
classification, it will not give satisfactory results. This is known as over-fitting. We might
understand this problem through a real-world example: If a student of mathematics learns only
one chapter of a book and then takes a test on the whole syllabus, he will probably fail.

To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in 2012.
This technique is known as dropout.
The basic idea of this method is to, based on probability, temporarily “drop out” neurons from our
original network. Doing this for every training example gives us different models for each one.
Afterwards, when we want to test our model, we take the average of each model to get our
answer/prediction.

Dropout during training

We assign ‘p’ to represent the probability of a neuron, in the hidden layer, being excluded from
the network; this probability value is usually equal to 0.5. We do the same process for the input
layer whose probability value is usually lower than 0.5 (e.g. 0.2). Remember, we delete the
connections going into, and out of, the neuron when we drop it.

Dropout during testing


An output, given from a model trained using the dropout technique, is a bit different: We can take
a sample of many dropped-out models and compute the geometric mean of their output neurons
by multiplying all the numbers together and taking the product’s square root. However, since this
is computationally expensive, we use the original model instead by simply cutting all of the hidden
units’ weights in half. This will give us a good approximation of the average for each of the
different dropped-out models.

RNN DESIGN PATTERNS

You might also like