Unit V
Unit V
A simple RNN has a feedback loop, as shown in the first diagram of the above
figure.
The feedback loop shown in the gray rectangle can be unrolled in three-time
steps to produce the second network of the above figure. Of course, you can vary
the architecture so that the network unrolls 𝑘 time steps. In the figure, the
following notation is used:
Hence, in the feedforward pass of an RNN, the network computes the values of
the hidden units and the output after 𝑘 time steps. The weights associated with
the network are shared temporally.
Each recurrent layer has two sets of weights:
• One for the input
• Second for the hidden unit
• The last feedforward layer, which computes the final output for the kth
time step, is just like an ordinary layer of a traditional feedforward
network.
• One to Many
This type of neural network incorporates a single input and multiple
outputs. An example of this is often the image caption.
• Many to One
This RNN takes a sequence of inputs and generates one output. Sentiment
analysis may be a example of this sort of network where a given sentence
are often classified as expressing positive or negative sentiments.
• Many to Many
This RNN takes a sequence of inputs and generates a sequence of outputs.
artificial intelligence is one among the examples.
Suppose you wish to predict the last word within the text: “The clouds
are within the ______.”
The most obvious answer to the present is that the “sky.” We don’t need
from now on context to predict the last word within the above sentence.
Consider this sentence: “I are staying in Spain for the last 10 years…I can
speak fluent ______.”
The word you are expecting will rely on the previous couple of words in
context. Here, you would like the context of Spain to predict the last word
within the text, and also the most fitted answer to the present sentence
is “Spanish.” The gap between the relevant information and the point
where it’s needed may became very large. LSTMs facilitate to solve this
problem.
*****************************************************************
Inputting a sequence:
A sequence of data points, each represented as a vector with the same
dimensionality, are fed into a BRNN. The sequence might have different lengths.
Dual Processing:
Both the forward and backward directions are used to process the data. On the
basis of the input at that step and the hidden state at step t-1, the hidden state
at time step t is determined in the forward direction. The input at step t and the
hidden state at step t+1 are used to calculate the hidden state at step t in a
reverse way.
A non-linear activation function on the weighted sum of the input and previous
hidden state is used to calculate the hidden state at each step. This creates a
memory mechanism that enables the network to remember data from earlier
steps in the process.
Training:
The network is trained through a supervised learning approach where the goal
is to minimize the discrepancy between the predicted output and the actual
output. The network adjusts its weights in the input-to-hidden and hidden-to-
output connections during training through backpropagation.
To calculate the output from an RNN unit, we use the following formula:
where,
A = activation function, W = weight matrix, b = bias
• Enhanced accuracy:
BRNNs frequently yield more precise answers since they take both
historical and upcoming data into account.
Bi-RNNs have been applied to various natural language processing (NLP) tasks,
including:
• Sentiment Analysis:
By taking into account both the prior and subsequent context, BRNNs can
be utilized to categorize the sentiment of a particular sentence.
• Part-of-Speech Tagging:
The classification of words in a phrase into their corresponding parts of
speech, such as nouns, verbs, adjectives, etc., can be done using BRNNs.
• Machine Translation:
BRNNs can be used in encoder-decoder models for machine translation,
where the decoder creates the target sentence and the encoder analyses
the source sentence in both directions to capture its context.
• Speech Recognition:
When the input voice signal is processed in both directions to capture the
contextual information, BRNNs can be used in automatic speech
recognition systems.
• Computational complexity:
Given that they analyze data both forward and backward, BRNNs can be
computationally expensive due to the increased amount of calculations
needed.
• Difficulty in parallelization:
Due to the requirement for sequential processing in both the forward and
backward directions, BRNNs can be challenging to parallelize.
• Overfitting:
BRNNs are prone to overfitting since they include many parameters that
might result in too complicated models, especially when trained on short
datasets.
• Interpretability:
Due to the processing of data in both forward and backward directions,
BRNNs can be tricky to interpret since it can be difficult to comprehend
what the model is doing and how it is producing predictions.
• RNNs are a type of neural network designed to work with sequential data,
where the output of each step is dependent on the previous steps.
• This makes them particularly suitable for tasks like natural language
processing (NLP), time series prediction, and speech recognition.
• Each layer in a DRN passes its output as input to the next layer, enabling
the network to learn hierarchical representations of sequential data.
There are several types of recurrent units that can be used in deep recurrent
networks, such as:
• Vanilla RNNs:
These are the simplest form of recurrent units, where the output is
computed based on the current input and the previous hidden state.
• Long Short-Term Memory (LSTM):
LSTMs are a type of recurrent unit that introduces gating mechanisms to
control the flow of information within the network, allowing it to learn
long-range dependencies more effectively and mitigate the vanishing
gradient problem.
• Gated Recurrent Units (GRUs):
GRUs are like LSTMs but have a simpler structure with fewer parameters,
making them computationally more efficient.
Steps to develop a deep RNN application
Developing an end-to-end deep RNN application involves several steps,
including data preparation, model architecture design, training the model, and
deploying it. Here is an example of an end-to-end deep RNN application for
sentiment analysis.
Data preparation:
The first step is to gather and preprocess the data. In this case, we’ll need a
dataset of text reviews labelled with positive or negative sentiment. The text
data needs to be cleaned, tokenized, and converted to the numerical format.
This can be done using libraries like NLTK or spaCy in Python.
Recurrent Layer
(Multiple layers stacked together)
• Embedding Layer:
Converts the input sequence into a dense representation suitable for
processing by the recurrent layers. It typically involves mapping each
element of the sequence (e.g., word or data point) to a high-dimensional
vector space.
• Recurrent Layers:
Consist of multiple recurrent units stacked together. Each layer processes
the input sequence sequentially, capturing temporal dependencies.
Common types of recurrent units include vanilla RNNs, LSTMs, and GRUs.
• Output Layer:
Takes the output from the recurrent layers and produces the final
prediction or output. The structure of this layer depends on the specific
task, such as classification (e.g., softmax activation) or regression (e.g.,
linear activation).
• Output (Prediction):
The final output of the network, which could be a sequence of predictions
for each time step or a single prediction for the entire sequence,
depending on the task.
• Transfer Learning:
Pre-training deep recurrent networks on large-scale datasets for related
tasks (e.g., language modeling) and fine-tuning them for specific tasks
often leads to improved performance. The hierarchical representations
learned during pre-training capture generic features of the data, which
can be beneficial for downstream tasks with limited labeled data.
• Computational Complexity:
Deep recurrent networks with multiple layers can be computationally
expensive to train and deploy, especially when dealing with large-scale
datasets or complex architectures. The computational complexity
increases with the number of layers, making it challenging to train deep
models on resource-constrained devices or in real-time applications.
• Overfitting:
Deep recurrent networks are prone to overfitting, especially when
dealing with small datasets or overly complex models. With a large
number of parameters, deep models have a high capacity to memorize
noise or irrelevant patterns in the training data, leading to poor
generalization performance on unseen data. Regularization techniques
such as dropout and weight decay are commonly used to prevent
overfitting.
• Difficulty in Interpretability:
Understanding the internal workings of deep recurrent networks and
interpreting their decisions can be challenging. With multiple layers of
non-linear transformations, it can be difficult to interpret the learned
representations and understand how the network arrives at a particular
prediction. This lack of interpretability can be a significant drawback in
applications where transparency and interpretability are essential.
In above diagram, there are three modules with two additional novel blocks in
the end-to-end framework, i.e., encoder network, analysis block, binarizer,
decoder network, and synthesis block. Image patches are directly given to the
analysis block as an input that generates latent features using the proposed
analysis encoder block. The entire framework architecture is presented in
architecture diagram.
The single iteration of the end-to-end framework is represented in below
Equation.
• L1 and L2 Regularization:
L1 and L2 regularization penalize the magnitude of the weights in the
autoencoder's neural network. By adding a regularization term to the
loss function proportional to either the L1 or L2 norm of the weights,
these techniques encourage sparsity (in the case of L1 regularization) or
small weights (in the case of L2 regularization), helping prevent
overfitting.
• Dropout:
Dropout is a regularization technique that randomly sets a fraction of the
input units to zero during each training iteration. This helps prevent the
autoencoder's neural network from relying too heavily on any individual
input features, forcing it to learn more robust representations.
• Batch Normalization:
Batch normalization normalizes the activations of each layer in the
autoencoder's neural network, helping stabilize and accelerate the
training process. By reducing internal covariate shift, batch
normalization acts as a regularizer, making the autoencoder more
resistant to overfitting.
• Noise Injection:
Noise injection involves adding noise to the input data or the activations
of the autoencoder's hidden layers during training. This helps prevent the
autoencoder from memorizing the training data and encourages it to
learn more generalizable representations.
• Contractive Regularization:
Contractive regularization penalizes the Frobenius norm of the Jacobian
matrix of the encoder with respect to the input data. This encourages the
encoder to learn representations that are invariant to small changes in
the input data, making the autoencoder more robust to variations in the
input.
Stochastic Encoders and Decoders
Stochastic Encoder:
In a VAE, the encoder network outputs the parameters of a probability
distribution instead of a deterministic encoding. Instead of directly
outputting the latent representation of the input data, the encoder
outputs the mean and variance (or other parameters) of a Gaussian
distribution that represents the distribution of possible latent variables
given the input. The latent variable is then sampled from this distribution
to generate a stochastic representation.
Stochastic Decoder:
Similarly, the decoder network in a VAE accepts a sampled latent variable
as input instead of a deterministic encoding. This sampled latent variable
is generated by sampling from the distribution outputted by the encoder.
The decoder then generates the reconstructed output based on this
sampled latent variable.
Cost Function Calculation
Contractive autoencoders