Unit 3
Unit 3
UNIT-3
AP/AI&DS
Rajapalayam
AD3501-DEEP LEARNING
Introduction to RNN
Traditional neural networks mainly have independent input and output layers, which make them
inefficient when dealing with sequential data. Hence, a new neural network called Recurrent
Neural Network, introduced to store results of previous outputs in the internal memory. These
results are then fed into the network inputs in order to predict the output of the layer. This allows
it to be used in applications like pattern detection, speech and voice recognition, natural language
processing, and time series prediction.
Below is how we can convert a Feed-Forward Neural Network into a Recurrent Neural Network:
The input layer ‘x’ takes in the input to the neural network and processes it and passes it onto the
middle layer.
The middle layer ‘h’ can consist of multiple hidden layers, each with its own activation functions
and weights and biases. If we have a neural network where the various parameters of different
hidden layers are not affected by the previous layer, ie: the neural network does not have memory,
then we can use a recurrent neural network.
The Recurrent Neural Network will standardize the different activation functions and weights and
biases so that each hidden layer has the same parameters. Then, instead of creating multiple hidden
layers, it will create one and loop over it as many times as required.
Fig:
Feed-forward Neural Network
In a feed-forward neural network, the decisions are based on the current input. It doesn’t memorize
the past data, and there’s no future scope. Feed-forward neural networks are used in general
regression and classification problems.
Image Captioning:RNNs are used to caption an image by analysing the activities present.
Time Series Prediction: Any time series problem, like predicting the prices of stocks in a
particular month, can be solved using an RNN.
Natural Language Processing: Text mining and Sentiment analysis can be carried out using an
RNN for Natural Language Processing (NLP).
a
Machine Translation: Given an input in one language, RNNs can be used to translate the input
into different languages as output.
Ability to Handle Variable-Length Sequences: RNNs are designed to handle input sequences of
variable length, which makes them well-suited for tasks such as speech recognition, natural
language processing, and time series analysis.
Memory of Past Inputs:RNNs have a memory of past inputs, which allows them to capture
information about the context of the input sequence. This makes them useful for tasks such as
language modelling, where the meaning of a word depends on the context in which it appears.
Parameter Sharing: RNNs share the same set of parameters across all time steps, which reduce
the number of parameters that need to be learned and can lead to better generalization.
Non-Linear Mapping: RNNs use non-linear activation functions, which allow them to learn
complex, non-linear mappings between inputs and outputs.
Sequential Processing: RNNs process input sequences sequentially, which makes them
computationally efficient and easy to parallelize.
Flexibility: RNNs can be adapted to a wide range of tasks and input types, including text, speech,
and image sequences.
Improved Accuracy:RNNs have been shown to achieve state-of-the-art performance on a variety
of sequence modeling tasks, including language modeling, speech recognition, and machine
translation.
These advantages make RNNs a powerful tool for sequence modelling and analysis, and have led
to their widespread use in a variety of applications, including natural language processing, speech
recognition, and time series analysis.
Vanishing and Exploding Gradients:RNNs can suffer from the problem of vanishing or
exploding gradients, which can make it difficult to train the network effectively. This occurs when
the gradients of the loss function with respect to the parameters become very small or very large
as they propagate through time.
Lack of Parallelism:RNNs are inherently sequential, which makes it difficult to parallelize the
computation. This can limit the speed and scalability of the network.
Difficulty in Choosing the Right Architecture:There are many different variants of RNNs, each
with its own advantages and disadvantages. Choosing the right architecture for a given task can be
challenging, and may require extensive experimentation and tuning.
These disadvantages are important when deciding whether to use an RNN for a given task.
However, many of these issues can be addressed through careful design and training of the network
and through techniques such as regularization and attention mechanisms.
The four commonly used types of Recurrent Neural Networks are:
1. One-to-One
The simplest type of RNN is One-to-One, which allows a single input and a single output. It has
fixed input and output sizes and acts as a traditional neural network. The One-to-One application
can be found in Image Classification.
One-to One
2. One-to-Many
One-to-Many is a type of RNN that gives multiple outputs when given a single input. It takes a
fixed input size and gives a sequence of data outputs. Its applications can be found in Music
Generation and Image Captioning.
One-to-Many
3. Many-to-One
Many-to-One is used when a single output is required from multiple input units or a sequence of
them. It takes a sequence of inputs to display a fixed output. Sentiment Analysis is a common
example of this type of Recurrent Neural Network.
4. Many-to-Many
Many-to-Many are used to generate a sequence of output data from a sequence of input units.
This type of RNN is further divided into the following two subcategories:
1. Equal Unit Size: In this case, the number of both the input and output units is the same. A
common application can be found in Name-Entity Recognition.
2. Unequal Unit Size: In this case, inputs and outputs have different numbers of units. Its
application can be found in Machine Translation.
RNNs suffer from the problem of vanishing gradients. The gradients carry information used in the
RNN, and when the gradient becomes too small, the parameter updates become insignificant. This
makes the learning of long data sequences difficult.
2. Exploding Gradient Problem
While training a neural network, if the slope tends to grow exponentially instead of decaying, this
is called an Exploding Gradient. This problem arises when large error gradients accumulate,
resulting in very large updates to the neural network model weights during the training
process.Long training time, poor performance, and bad accuracy are the major issues in gradient
problems.
In a feed-forward neural network, the decisions are based on the current input. It doesn’t memorize
the past data, and there’s no future scope. Feed-forward neural networks are used in general
regression and classification problems.
Bidirectional RNNs:
Bidirectional RNNs are designed to process input sequences in both forward and backward
directions. This allows the network to capture both past and future context, which can be useful
for speech recognition and natural language processing tasks.
Encoder-Decoder RNNs:
Encoder-decoder RNNs consist of two RNNs: an encoder network that processes the input
sequence and produces a fixed-length vector representation of the input and a decoder network
that generates the output sequence based on the encoder's representation. This architecture is
commonly used for sequence-to-sequence tasks such as machine translation.
Attention Mechanisms
Attention mechanisms are a technique that can be used to improve the performance of RNNs on
tasks that involve long input sequences. They work by allowing the network to attend to different
parts of the input sequence selectively rather than treating all parts of the input sequence equally.
This can help the network focus on the input sequence's most relevant parts and ignore irrelevant
information.
These are just a few examples of the many variant RNN architectures that have been developed
over the years. The choice of architecture depends on the specific task and the characteristics of
the input and output sequences.
Encoder-Decoder Model
• Encoder
• Hidden Vector
• Decoder
The Encoder will convert the input sequence into a single-dimensional vector (hidden vector).
The decoder will convert the hidden vector into the output sequence.
Encoder-Decoder models are jointly trained to maximize the conditional probabilities of the
target sequence given the input sequence.
In order to fully understand the model’s underlying logic, we will go over the below illustration:
Encoder
• Multiple RNN cells can be stacked together to form the encoder. RNN reads each inputs
sequentially
• For every timestep (each input) t, the hidden state (hidden vector) h is updated according to
the input at that timestep X[i].
• After all the inputs are read by encoder model, the final hidden state of the model represents
the context/summary of the whole input sequence.
• Example: Consider the input sequence “I am a Student” to be encoded. There will be totally 4
timesteps ( 4 tokens) for the Encoder model. At each time step, the hidden state h will be
updated using the previous hidden state and the current input.
Example: Encoder
• At the first timestep t1, the previous hidden state h0 will be considered as zero or randomly
chosen. So the first RNN cell will update the current hidden state with the first input and h0.
Each layer outputs two things — updated hidden state and the output for each stage. The
outputs at each stage are rejected and only the hidden states will be propagated to the next
layer.
• At second timestep t2, the hidden state h1 and the second input X[2] will be given as input ,
and the hidden state h2 will be updated according to both inputs. Then the hidden state h1 will
be updated with the new input and will produce the hidden state h2. This happens for all the
four stages wrt example taken.
• A stack of several recurrent units (LSTM or GRU cells for better performance) where each
accepts a single element of the input sequence, collects information for that element, and
propagates it forward.
• In the question-answering problem, the input sequence is a collection of all words from the
question. Each word is represented as xi where i is the order of that word.
This simple formula represents the result of an ordinary recurrent neural network. As you can see,
we just apply the appropriate weights to the previously hidden state h(t-1) and the input vector xt.
Encoder Vector
• This is the final hidden state produced from the encoder part of the model. It is calculated using
the formula above.
• This vector aims to encapsulate the information for all input elements in order to help the
decoder make accurate predictions.
• It acts as the initial hidden state of the decoder part of the model.
Decoder
• The Decoder generates the output sequence by predicting the next output Yt given the hidden
state ht.
• The input for the decoder is the final hidden vector obtained at the end of encoder model.
• Each layer will have three inputs, hidden vector from previous layer ht-1 and the previous layer
output yt-1, original hidden vector h.
• At the first layer, the output vector of encoder and the random symbol START, empty hidden
state ht-1 will be given as input, the outputs obtained will be y1 and updated hidden state h1 (the
information of the output will be subtracted from the hidden vector).
• The second layer will have the updated hidden state h1 and the previous output y1 and original
hidden vector h as current inputs, produces the hidden vector h2 and output y2.
• The outputs occurred at each timestep of decoder is the actual output. The model will predict
the output until the END symbol occurs.
• A stack of several recurrent units where each predicts an output yt at a time step t.
• Each recurrent unit accepts a hidden state from the previous unit and produces an output as
well as its hidden state.
• In the question-answering problem, the output sequence is a collection of all words from the
answer. Each word is represented as yi where i is the order of that word.
Example: Decoder.
As you can see, we are just using the previous hidden state to compute the next one.
Output Layer
• It is used to produce the probability distribution from a vector of values with the target class
of high probability.
The power of this model lies in the fact that it can map sequences of different lengths to each other.
As you can see the inputs and outputs are not correlated and their lengths can differ. This opens a
whole new range of problems that can now be solved using such architecture.
Applications
• Speech recognition
BIDIRECTIONAL RNN
In a Bi-RNN, the input data is passed through two separate RNNs: one processes the data in the
forward direction, while the other processes it in the reverse direction. The outputs of these two
RNNs are then combined in some way to produce the final output.
One common way to combine the outputs of the forward and reverse RNNs is to concatenate them,
but other methods, such as element-wise addition or multiplication can also be used. The choice
of combination method can depend on the specific task and the desired properties of the final
output.
Need for Bi-directional RNNs
• A uni-directional recurrent neural network (RNN) processes input sequences in a single direction,
either from left to right or right to left.
• This means that the network can only use information from earlier time steps when making
predictions at later time steps.
• This can be limiting, as the network may not capture important contextual information relevant to the
output prediction.
• For example, in natural language processing tasks, a uni-directional RNN may not accurately predict
the next word in a sentence if the previous words provide important context for the current word.
Consider an example where we could use the recurrent network to predict the masked word in a
sentence.
1. Apple is my favorite .
2. Apple is my favourite , and I work there.
3. Apple is my favorite , and I am going to buy one.
In the first sentence, the answer could be fruit, company, or phone. But in the second and third
sentences, it cannot be a fruit.
A Recurrent Neural Network that can only process the inputs from left to right might not be able
to accurately predict the right answer for sentences discussed above.
To perform well on natural language tasks, the model must be able to process the sequence in
both directions.
Bi-directional RNNs
• A bidirectional recurrent neural network (RNN) is a type of recurrent neural network (RNN)
that processes input sequences in both forward and backward directions.
• This allows the RNN to capture information from the input sequence that may be relevant to
the output prediction, but the same could be lost in a traditional RNN that only processes the
input sequence in one direction.
• This allows the network to consider information from the past and future when making
predictions rather than just relying on the input data at the current time step.
• This can be useful for tasks such as language processing, where understanding the context of
a word or phrase can be important for making accurate predictions.
• In general, bidirectional RNNs can help improve the performance of a model on a variety of
sequence-based tasks.
This means that the network has two separate RNNs:
These two RNNs are typically referred to as the forward and backward RNNs, respectively.
During the forward pass of the RNN, the forward RNN processes the input sequence in the usual
way by taking the input at each time step and using it to update the hidden state. The updated
hidden state is then used to predict the output at that time step.
Back-propagation through time (BPTT) is a widely used algorithm for training recurrent neural
networks (RNNs). It is a variant of the back-propagation algorithm specifically designed to handle
the temporal nature of RNNs, where the output at each time step depends on the inputs and outputs
at previous time steps.
In the case of a bidirectional RNN, BPTT involves two separate Back-propagation passes: one for
the forward RNN and one for the backward RNN. During the forward pass, the forward RNN
processes the input sequence in the usual way and makes predictions for the output sequence.
These predictions are then compared to the target output sequence, and the error is back-
propagated through the network to update the weights of the forward RNN.
During the backward pass, the backward RNN processes the input sequence in reverse order and
makes predictions for the output sequence. These predictions are then compared to the target output
sequence in reverse order, and the error is back-propagated through the network to update the
weights of the backward RNN.
Once both passes are complete, the weights of the forward and backward RNNs are updated based
on the errors computed during the forward and backward passes, respectively. This process is
repeated for multiple iterations until the model converges and the predictions of the bidirectional
RNN are accurate.
This allows the bidirectional RNN to consider information from past and future time steps when
making predictions, which can significantly improve the model's accuracy.
Applications of Bi-directional RNNs
Bidirectional recurrent neural networks (RNNs) can outperform traditional RNNs on various tasks,
particularly those involving sequential data processing. Some examples of tasks where
bidirectional RNNs have been shown to outperform traditional RNNs include:
• Natural languages processing tasks, such as language translation and sentiment analysis, where
understanding the context of a word or phrase can be important for making accurate predictions.
• Time series forecasting tasks, such as predicting stock prices or weather patterns, where the
sequence of past data can provide important clues about future trends.
• Audio processing tasks, such as speech recognition or music generation, where the information
in the audio signal can be complex and non-linear.
In general, bidirectional RNNs can be useful for any task where the input data has a temporal
structure and where understanding the context of the data is important for making accurate
predictions.
Advantages:
Bidirectional Recurrent Neural Networks (RNNs) have several advantages over traditional
RNNs. Some of the key advantages of bidirectional RNNs include the following:
• Improved performance on tasks that involve processing sequential data. Because bidirectional
RNNs can consider information from both past and future time steps when making
predictions, they can outperform traditional RNNs on tasks such as natural language
processing, time series forecasting, and audio processing.
Disadvantages:
However, Bidirectional RNNs also have some disadvantages. Some of the key disadvantages of
bidirectional RNNs include the following:
• Increased computational complexity. Because bidirectional RNNs have two separate RNNs
(one for the forward pass and one for the backward pass), they can require more computational
resources to train and evaluate than traditional RNNs. This can make them more difficult to
implement and less efficient in terms of runtime performance.
• More difficult to optimize. Because bidirectional RNNs have more parameters (due to the two
separate RNNs), they can be more difficult to optimize. This can make finding the right set of
weights for the model challenging and lead to slower convergence during training.
• The need for longer input sequences. For a bidirectional RNN to capture long-term
dependencies in the data, it typically requires longer input sequences than a traditional RNN.
This can be a disadvantage in situations where the input data is limited or noisy, as it may not
be possible to generate enough input data to train the model effectively.
Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn detailed
and structured information. With RvNN, you can get a structured prediction by recursively
applying the same set of weights on structured inputs. The word recursive indicates that the neural
network is applied to its output.
Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data. The
tree structure means combining child nodes and producing parent nodes. Each child-parent bond
has a weight matrix, and similar children have the same weights. The number of children for every
node in the tree is fixed to enable it to perform recursive operations and use the same weights.
RvNNs are used when there's a need to parse an entire sentence.
To calculate the parent node's representation, we add the products of the weight matrices (Wi) and
the children's representations (Ci) and apply the transformation f:
Where:
• Recurrent Neural Networks (RNNs) are another well-known class of neural networks used for
processing sequential data. They are closely related to the Recursive Neural Network.
• Recurrent Neural Networks represent temporal sequences, which they find application
in Natural language Processing (NLP) since language-related data like sentences and
paragraphs are sequential in nature. Recurrent networks are usually chain structures. The
weights are shared across the chain length, keeping the dimensionality constant.
• On the other hand, Recursive Neural Networks operate on hierarchical data models due to their
tree structure. There are a fixed number of children for each node in the tree so that it can execute
recursive operations and use the same weights for each step. Child representations are combined
into parent representations.
• The efficiency of a recursive network is higher than a feed-forward network.
• Recurrent Networks are recurrent over time, meaning recursive networks are just a
generalization of the recurrent network.
A Recursive Neural Network is used for sentiment analysis in natural language sentences. It is one
of the most important tasks of Natural language Processing (NLP), which identifies the writing
tone and sentiments of the writer in a particular sentence. If a writer expresses any sentiment, basic
labels about the writing tone are recognized. We want to identify the smaller components like
nouns or verb phrases and order them in a syntactic hierarchy. For example, it identifies whether
the sentence showcases a constructive form of writing or negative word choices.
A variable called 'score' is calculated at each traversal of nodes, telling us which pair of phrases
and words we must combine to form the perfect syntactic tree for a given sentence.
Let us consider the representation of the phrase -- "a lot of fun" in the following sentence.
An RNN representation of this phrase would not be suitable because it considers only sequential
relations. Each state varies with the preceding words' representation. So, a subsequence that doesn't
occur at the beginning of the sentence can't be represented. With RNN, when processing the word
'fun,' the hidden state will represent the whole sentence.
However, with a Recursive Neural Network (RvNN), the hierarchical architecture can store the
representation of the exact phrase. It lies in the hidden state of the node R_{a\ lot\ of\ fun}. Thus,
Syntactic parsing is completely implemented with the help of Recursive Neural Networks.
• The two significant advantages of Recursive Neural Networks for Natural Language Processing
are their structure and reduction in network depth.
• As already explained, the tree structure of Recursive Neural Networks can manage hierarchical
data like in parsing problems.
• Another benefit of RvNN is that the trees can have a logarithmic height. When there are O(n)
input words, a Recursive Neural Network can represent a binary tree with height O(log\ n). This
lessens the distance between the first and last input elements. Hence, the long-term dependency
turns shorter and easier to grab.
• The main disadvantage of recursive neural networks can be the tree structure. Using the tree
structure indicates introducing a unique inductive bias to our model. The bias corresponds to
the assumption that the data follow a tree hierarchy structure. But that is not the truth. Thus, the
network may not be able to learn the existing patterns.
• Another disadvantage of the Recursive Neural Network is that sentence parsing can be slow and
ambiguous. Interestingly, there can be many parse trees for a single sentence.
• Also, it is more time-consuming and labor-intensive to label the training data for recursive
neural networks than to construct recurrent neural networks. Manually parsing a sentence into
short components is more time-consuming and tedious than assigning a label to a sentence.
Gated Architecture
LSTM used in the field of Deep Learning. It is a variety of recurrent neural networks (RNNs) that
are capable of learning long-term dependencies, especially in sequence prediction problems.
LSTMs are predominantly used to learn, process, and classify sequential data because these
networks can learn long-term dependencies between time steps of data. Common LSTM
applications include sentiment analysis, language modelling, speech recognition, and video
analysis.
LSTM has feedback connections, i.e., it is capable of processing the entire sequence of data, apart
from single data points such as images. This finds application in speech recognition, machine
translation, etc. LSTM is a special kind of RNN, which shows outstanding performance on a large
variety of problems.
The central role of an LSTM model is held by a memory cell known as a ‘cell state’ that maintains
its state over time. The cell state is the horizontal line that runs through the top of the below
diagram. It can be visualized as a conveyor belt through which information just flows, unchanged.
Information can be added to or removed from the cell state in LSTM and is regulated by gates.
These gates optionally let the information flow in and out of the cell. It contains a point wise
multiplication operation and a sigmoid neural net layer that assist the mechanism.
The sigmoid layer gives out numbers between zero and one, where zero means ‘nothing should be
let through,’ and one means ‘everything should be let through.’
1. Forget Gate(f): At forget gate the input is combined with the previous output to generate
a fraction between 0 and 1, that determines how much of the previous state need to be
preserved (or in other words, how much of the state should be forgotten). This output is then
multiplied with the previous state. Note: An activation output of 1.0 means “remember
everything” and activation output of 0.0 means “forget everything.” From a different
perspective, a better name for the forget gate might be the “remember gate”
2. Input Gate(i): Input gate operates on the same signals as the forget gate, but here the objective
is to decide which new information is going to enter the state of LSTM. The output of the
input gate (again a fraction between 0 and 1) is multiplied with the output of tan h block that
produces the new values that must be added to previous state. This gated vector is then added
to previous state to generate current state
3. Input Modulation Gate(g): It is often considered as a sub-part of the input gate and much
literature on LSTM’s does not even mention it and assume it is inside the Input gate. It is used
to modulate the information that the Input gate will write onto the Internal State Cell by
adding non-linearity to the information and making the information Zero-mean. This is
done to reduce the learning time as Zero-mean input has faster convergence. Although this
gate’s actions are less important than the others and are often treated as a finesse-
providing concept, it is good practice to include this gate in the structure of the LSTM
unit.
4. Output Gate(o): At output gate, the input and previous state are gated as before to
generate another scaling fraction that is combined with the output of tanh block that
brings the current state. This output is then given out. The output and state are fed back
into the LSTM block.
The basic workflow of a Long Short Term Memory Network is similar to the workflow of a
Recurrent Neural Network with the only difference being that the Internal Cell State is also passed
forward along with the Hidden State.
Working of an LSTM recurrent unit:
1. Take input the current input, the previous hidden state, and the previous internal cell
state.
2. Calculate the values of the four different gates by following the below steps:-
• For each gate, calculate the parameterized vectors for the current input and the previous
hidden state by element-wise multiplication with the concerned vector with the
respective weights for each gate.
• Apply the respective activation function for each gate element-wise on the parameterized
vectors. Below given is the list of the gates with the activation function to be applied
for the gate.
3. Calculate the current internal cell state by first calculating the element-wise multiplication
vector of the input gate and the input modulation gate, then calculate the element-wise
multiplication vector of the forget gate and the previous internal cell state and then
add the two vectors.
4. Calculate the current hidden state by first taking the element-wise hyperbolic tangent of
the current internal cell state vector and then performing element-wise multiplication with
the output gate.
Note that the blue circles denote element-wise multiplication. The weight matrix W contains
different weights for the current input vector and the previous hidden state for each gate.
LSTMs work in a 3-step process.
The first step in the LSTM is to decide which information should be omitted from the cell in that
particular time step. The sigmoid function determines this. It looks at the previous state (ht-1)
along with the current input xt and computes the function.
ft – forget gate. Decides which information to delete that is not important from previous time
step.
1. Let the output of h(t-1) be “Alice is good in Physics. John, on the other hand, is good at
Chemistry.”
2. Let the current input at x(t) be “John plays football well. He told me yesterday over the phone
that he had served as the captain of his college football team.”
The forget gate realizes there might be a change in context after encountering the first full stop. It
compares with the current input sentence at x(t). The next sentence talks about John, so the
information on Alice is deleted. The position of the subject is vacated and assigned to John.
Step 2: Decide How Much This Unit Adds to the Current State
In the second layer, there are two parts. One is the sigmoid function, and the other is the tanh
function. In the sigmoid function, it decides which values to let through (0 or1). tanh function gives
weightage to the values which are passed, deciding their level of importance (-1 to 1).
it - input gate, Determines which information to let through based on its significance in
the current time step.
With the current input at x(t), the input gate analyses the important information John plays football,
and the fact that he was the captain of his college team is important.
“He told me yesterday over the phone” is less important; hence it's forgotten. This process of
adding some new information can be done via the input gate.
Step 3: Decide What Part of the Current Cell State Makes It to the Output
The third step is to decide what the output will be. First, we run a sigmoid layer, which decides
what parts of the cell state make it to the output. Then, we put the cell state through tanh to push
the values to be between -1 and 1 and multiply it by the output of the sigmoid gate.
Ot-output gate, Allows the passed in information to impact the output in the current time step
Let’s consider this example to predict the next word in the sentence: “John played tremendously
well against the opponent and won for his team. For his contributions, brave was awarded
player of the match.” There could be many choices for the space. The current input brave is an
adjective, and adjectives describe a noun. So, “John” could be the best output after Brave.
LSTM Applications
Skip connection is a standard module in much convolutional architecture. Using a skip connection,
we provide an alternative path for the gradient (with back-propagation). It is experimentally
validated that these additional paths are often beneficial for model convergence. Skip
connections in deep architectures, as the name suggests, skip some layers in the neural network
and feed the output of one layer as the input to the next layers (instead of only the next
one).
As previously explained, using the chain rule, we must keep multiplying terms with the error
gradient as we go backwards. However, in the long chain of multiplication, if we multiply many
things together that are less than one, then the resulting gradient will be very small. Thus, the
gradient becomes very small as we approach the earlier layers in a deep architecture. In some
cases, the gradient becomes zero, meaning that we do not update the early layers at all.
In general, there are two fundamental ways that one could use skip connections through different
non-sequential layers:
The core idea is to back-propagate through the identity function, by just using a vector
addition. Then the gradient would simply be multiplied by one and its value will be maintained
in the earlier layers. This is the main idea behind Residual Networks (ResNets): they stack these
skip residual blocks together. We use an identity function to preserve the gradient.
Mathematically, we can represent the residual block, and calculate its partial derivative (gradient),
given the loss function like this:
Apart from the vanishing gradients, there is another reason that we commonly use them. For a
plethora of tasks (such as semantic segmentation, optical flow estimation, etc.) there is some
information that was captured in the initial layers and we would like to allow the later layers to
also learn from them. It has been observed that in earlier layers the learned features
correspond to lower semantic information that is extracted from the input. If we had not used
the skip connection that information would have turned too abstract.
As stated, for many dense prediction problems, there is low-level information shared between
the input and output, and it would be desirable to pass this information directly across the
net. The alternative way that we can achieve skip connections is by concatenation of previous
feature maps. The most famous deep learning architecture is DenseNet. Below we can see an
example of feature reusability by concatenation with 5 convolutional layers:
This architecture heavily uses feature concatenation to ensure maximum information flow between
layers in the network. This is achieved by connecting all layers directly with each other via
concatenation, as opposed to ResNets. Practically, what we do is concatenate the feature channel
dimension.
This leads to
a) An enormous amount of feature channels on the last layers of the network,
b) To more compact models, and
c) Extreme feature reusability.
Short skip connections are used along with consecutive convolutional layers that do not change
the input dimension (see Res-Net), while long skip connections usually exist in encoder-decoder
architectures. It is known that the global information (shape of the image and other statistics)
resolves what, while local information resolves where (small details in an image patch).
Long skip connections often exist in architectures that are symmetrical, where the spatial
dimensionality is reduced in the encoder part and is gradually increased in the decoder part as
illustrated below. In the decoder part, one can increase the dimensionality of a feature map
via transpose convolutional layers. The transposed convolution operation forms the same
connectivity as the normal convolution but in the backward direction.
Benefits of skip connections
Skip connections can provide several benefits for CNNs, such as improving accuracy and
generalization, solving the vanishing gradient problem, and enabling deeper networks. Skip
connections can help the network to learn more complex and diverse patterns from the data and
reduce the number of parameters and operations needed by the network. Additionally, skip
connections can help to alleviate the problem of vanishing gradients by providing alternative paths
for the gradients to flow. Furthermore, they can make it easier and faster to train deeper networks,
which have more expressive power and can capture more features from the data.
Skip connections are a popular and powerful technique for improving the performance and
efficiency of CNNs, but they are not a panacea. They can help preserve information and gradients,
combine features, solve the vanishing gradient problem, and enable deeper networks. However,
they can also increase complexity and memory requirements, introduce redundancy and noise, and
require careful design and tuning to match the network architecture and data domain. Different
types and locations of skip connections can have different impacts on the network performance,
with some being more beneficial or harmful than others. Thus, it is essential to understand how
skip connections work and how to use them wisely and effectively for CNNs.
Dropouts
Dropout refers to data, or noise, that's intentionally dropped from a neural network to improve
processing and time to results. A neural network is software attempting to emulate the actions of
the human brain.
Neural networks are the building blocks of any machine-learning architecture. They consist of one
input layer, one or more hidden layers, and an output layer.
When we training our neural network (or model) by updating each of its weights, it might become
too dependent on the dataset we are using. Therefore, when this model has to make a prediction or
classification, it will not give satisfactory results. This is known as over-fitting. We might
understand this problem through a real-world example: If a student of mathematics learns only
one chapter of a book and then takes a test on the whole syllabus, he will probably fail.
To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in 2012.
This technique is known as dropout.
The basic idea of this method is to, based on probability, temporarily “drop out” neurons from our
original network. Doing this for every training example gives us different models for each one.
Afterwards, when we want to test our model, we take the average of each model to get our
answer/prediction.
We assign ‘p’ to represent the probability of a neuron, in the hidden layer, being excluded from
the network; this probability value is usually equal to 0.5. We do the same process for the input
layer whose probability value is usually lower than 0.5 (e.g. 0.2). Remember, we delete the
connections going into, and out of, the neuron when we drop it.