0% found this document useful (0 votes)
58 views43 pages

Unit 2 DL

Deep Learning

Uploaded by

Prawin raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views43 pages

Unit 2 DL

Deep Learning

Uploaded by

Prawin raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT II

RECURRENT NEURAL NETWORKS

LSTM - GRU - Encoder Decoder architectures - Deep Unsupervised Learning: Autoencoders


- Variational Auto-encoders - Adversarial Generative Networks - Auto-encoder and DBM -
Attention and memory models - Dynamic Memory Models

1. Dynamic Memory Models


Dynamic memory models in Recurrent Neural Networks (RNNs) enhance the network's
ability to capture and utilize information over extended sequences. Key approaches include
Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures, which
address the vanishing gradient problem in traditional RNNs. Attention mechanisms allow
networks to focus on relevant parts of input sequences, while memory-augmented networks
incorporate external memory for flexible learning. Hierarchical RNNs and transformer
models like BERT further improve the modeling of complex dependencies in sequential
data. These dynamic memory models contribute to better handling of long-term
dependencies and contextual understanding in various tasks.

1.1 LSTM:

Long Short-Term Memory (LSTM) stands out as a significant breakthrough in


recurrent neural network (RNN) architecture, specifically crafted to overcome challenges
like the vanishing gradient problem encountered in traditional RNNs. Designed by
Hochreiter and Schmidhuber, LSTM excels in capturing long-term dependencies, making it
particularly adept for sequence prediction tasks in various domains such as time series
analysis, natural language processing (NLP), and speech recognition.

At the heart of LSTM's innovation lies the incorporation of a unique architectural


element known as the memory cell. This dynamic container facilitates the storage, access,
and manipulation of information over extended sequences. Governed by input, forget, and
output gates, the memory cell enables the LSTM network to selectively retain or discard
information, providing a powerful mechanism for learning and utilizing long-term
dependencies within sequential data.

Beyond its fundamental capabilities, LSTMs can be further optimized by stacking


them to form deep LSTM networks, enhancing their ability to comprehend and learn
intricate patterns within sequential data. Furthermore, their adaptability extends to
collaborative use with other neural network architectures like Convolutional Neural
Networks (CNNs), particularly beneficial in applications such as image and video analysis.
In essence, LSTM stands as a potent solution in the domain of deep learning, showcasing
exceptional performance across a diverse spectrum of challenges associated with processing
sequence-based data.

1.1.1 Working mechanism of LSTM:

The working mechanism of a Long Short-Term Memory (LSTM) network is


orchestrated through a dynamic interplay of its key components: the input gate(it), the output
gate(ot),and the forget gate (ft). At each time step t, the input gate takes charge of regulating
the flow of new information into the memory cell, evaluating the relevance of incoming data
through a sigmoid activation function. Simultaneously, the memory cell state undergoes an
update, guided by the candidate values generated by the hyperbolic tangent (tanh) function.
This candidate set captures the essential features of the current input. The subsequent
memory cell update selectively combines these candidate values with the existing memory
cell state, modulated by the output of the forget gate. This process allows the LSTM to
adaptively store or discard information, maintaining a relevant memory cell state. Finally,
the output gate governs the generation of the final output (ht), deciding which information
from the current memory cell state should be transmitted to subsequent layers or serve as the
network's output. This orchestrated mechanism enables LSTMs to effectively capture and
utilize long-term dependencies in sequential data, making them a powerful tool in various
applications, including natural language processing and time series analysis.

1.1.2 LSTM Components:


1.1.2.1 Memory Cell:

The memory cell in a Long Short-Term Memory (LSTM) network can be likened to a
reservoir proficiently storing and retrieving information across extensive sequences. It
assumes a pivotal role in empowering the network to grasp and preserve vital patterns over
time, addressing the complexities associated with learning from prolonged dependencies.
Unlike traditional neural networks, the memory cell enables LSTMs to selectively manage
information, offering a nuanced comprehension of context in sequential data.

1.1.2.2 Input Gate:

The input gate plays a crucial role in overseeing the inflow of new information into
the memory cell, determining the relevance of incoming data that should be stored. It is
responsible for selectively incorporating useful information into the cell state. The process
begins with the application of the sigmoid function to regulate the information, akin to the
forget gate, utilizing inputs ht-1 and x t.This sigmoid activation filters the values to be
retained. Subsequently, a vector is constructed using the hyperbolic tangent (tanh) function,
producing an output ranging from -1 to +1 and encapsulating all potential values from ht-1
and xt. Finally, the vector values and the regulated values are multiplied to yield the pertinent
information that contributes to the memory cell state. The comprehensive equation for the
input gate captures this intricate process:

i_t = σ(W_i · [h_t-1, x_t] + b_i)

Ĉ_t = tanh(W_c · [h_t-1, x_t] + b_c)

C_t = f_t ⊙ C_t-1 + i_t ⊙ Ĉ_t

where

 ⊙ denotes element-wise multiplication

 tanh is tanh activation function


Here, i_t represents the output of the input gate, determined by the weighted sums and biases
in conjunction with the sigmoid activation function.

1.1.2.3 Output Gate:

The output gate holds authority over the information that is conveyed to subsequent layers or
serves as the output of the LSTM, ensuring that pertinent information is effectively
integrated into the network's overarching processing. It is responsible for the crucial task of
extracting valuable information from the current cell state for presentation as the output. The
process unfolds with the application of the hyperbolic tangent (tanh) function to generate a
vector from the cell. Subsequently, the information undergoes regulation using the sigmoid
function, similar to the input gate, with filtering based on values to be retained, involving
inputs ht-1 and xt. Lastly, the product of the vector values and the regulated values is
computed, constituting the output that is transmitted to subsequent layers and serves as input
to the next cell. Capturing this intricate process, the equation for the output gate is expressed
as:

o_t = σ(W_o · [h_t-1, x_t] + b_o)


1.1.2.4 Forget Gate:

The forget gate assumes a pivotal role in the strategic elimination of unnecessary or less
relevant information from the memory cell, thereby facilitating a selective forgetting
mechanism crucial for maintaining the pertinence of stored information. This mechanism
targets the removal of information that is no longer deemed useful in the cell state. The
forget gate incorporates two key inputs:xt (the input at the specific time) and ht-1 the previous
cell output). These inputs are subjected to multiplication with weight matrices, followed by
the addition of biases. The resultant is then passed through an activation function, yielding a
binary output. When the output is 0 for a particular cell state, it signifies that the associated
piece of information is forgotten, while an output of 1 indicates that the information is
retained for future utilization. Capturing this intricate process, the equation for the forget
gate is articulated as:

f_t = σ(W_f · [h_t-1, x_t] + b_f)


where:

 W_f represents the weight matrix associated with the forget gate.

 [h_t-1, x_t] denotes the concatenation of the current input and the previous hidden
state.

 b_f is the bias with the forget gate.

 σ is the sigmoid activation function.

1.1.3 Advantages of LSTM:

 Long-Term Dependency Handling


 Selective Information Retention
 Effective in Sequence Prediction
 Mitigation of Vanishing Gradient Problem
 Versatility in Architectural Configurations
 Combination with Other Architectures

1.1.4 Disadvantages of LSTM:

 Computational Complexity
 Increased Training Data Requirements
 Potential Overfitting
 Difficulty in Interpretability
 Hyperparameter Sensitivity
 Resource Intensive

1.1.5 Applications of LSTM:

1. Natural Language Processing (NLP)


2. Speech Recognition
3. Time Series Forecasting
4. Gesture Recognition
5. Healthcare
6. Autonomous Vehicles
7. Financial Forecasting
8. Video Analysis and Action Recognition

1.2 GRU

GRU or Gated recurrent unit is an advancement of the standard RNN i.e recurrent
neural network. It was introduced by Kyunghyun Cho et al in the year 2014.
GRUs are very similar to Long Short-Term Memory (LSTM). Just like LSTM, GRU uses
gates to control the flow of information. They are relatively new as compared to LSTM. This
is the reason they offer some improvement over LSTM and have simpler architecture.
Another Interesting thing about GRU is that, unlike LSTM, it does not have a separate cell
state (Ct). It only has a hidden state(Ht). Due to the simpler architecture, GRUs are faster to
train.

1.2.1 The architecture of Gated Recurrent Unit

Now lets’ understand how GRU works. Here we have a GRU cell which more or less
similar to an LSTM cell or RNN cell.

At each timestamp t, it takes an input Xt and the hidden state Ht-1 from the previous
timestamp t-1. Later it outputs a new hidden state Ht which again passed to the next
timestamp.

Now there are primarily two gates in a GRU as opposed to three gates in an LSTM
cell. The first gate is the Reset gate and the other one is the update gate.

1.2.1.1 Reset Gate (Short term memory)


The Reset Gate is responsible for the short-term memory of the network i.e the hidden state
(Ht). Here is the equation of the Reset gate.

If you remember from the LSTM gate equation it is very similar to that. The value of rt will
range from 0 to 1 because of the sigmoid function. Here Ur and Wr are weight matrices for
the reset gate.

1.2.1.2 Update Gate (Long Term memory)

Similarly, we have an Update gate for long-term memory and the equation of the gate is shown
below.

The only difference is of weight metrics i.e Uu and Wu.

1.2.2 How GRU Works

Now let’s see the functioning of these gates. To find the Hidden state Ht in GRU, it
follows a two-step process. The first step is to generate what is known as the candidate hidden
state. As shown below

1.2.2.1 Candidate Hidden State

It takes in the input and the hidden state from the previous timestamp t-1 which is
multiplied by the reset gate output rt. Later passed this entire information to the tanh function,
the resultant value is the candidate’s hidden state.
The most important part of this equation is how we are using the value of the reset gate
to control how much influence the previous hidden state can have on the candidate state.

If the value of rt is equal to 1 then it means the entire information from the previous
hidden state Ht-1 is being considered. Likewise, if the value of rt is 0 then that means the
information from the previous hidden state is completely ignored.

1.2.2.2 Hidden state

Once we have the candidate state, it is used to generate the current hidden state Ht. It is where
the Update gate comes into the picture. Now, this is a very interesting equation, instead of
using a separate gate like in LSTM in GRU we use a single update gate to control both the
historical information which is Ht-1 as well as the new information which comes from the
candidate state.

Now assume the value of ut is around 0 then the first term in the equation will vanish which
means the new hidden state will not have much information from the previous hidden state.
On the other hand, the second part becomes almost one that essentially means the hidden state
at the current timestamp will consist of the information from the candidate state only.

Similarly, if the value of ut is on the second term will become entirely 0 and the current hidden
state will entirely depend on the first term i.e the information from the hidden state at the
previous timestamp t-1.
Hence we can conclude that the value of ut is very critical in this equation and it can range
from 0 to 1.

1.2.2.3 End Notes

So just to summarize, Let’s see how different GRU is from LSTM.

LSTM has three gates on the other hand GRU has only two gates. In LSTM they are the Input
gate, Forget gate, and Output gate. Whereas in GRU we have a Reset gate and Update gate.

In LSTM we have two states Cell state or Long term memory and Hidden state also known as
Short term memory. In the case of GRU, there is only one state i.e Hidden state (Ht).

2. Encoder-decoder architecture

2.1 What is seq2seq Model in Machine Learning?

Seq2Seq, initially introduced by Google for machine translation, marked a paradigm


shift from the rudimentary approach prevalent before its inception. Previously, translations
lacked consideration for grammar and sentence structure, merely converting each word
without contextual understanding. The advent of Seq2Seq brought about a transformative
approach, leveraging deep learning to not only translate based on the current word but also
taking its context into account.

Seq2Seq, short for Sequence-to-Sequence, represents a machine learning model with


versatile applications like machine translation, text summarization, and image captioning.
Comprising two fundamental components, namely the Encoder and the Decoder, Seq2Seq
models are trained using datasets containing input-output pairs, where both input and output
are sequences of tokens. The training objective is to maximize the likelihood of generating
the correct output sequence given the input sequence.

Extensively employed in natural language processing tasks, Seq2Seq models excel in


handling variable-length input and output sequences. Notably, their adoption extends to tasks
such as image captioning, conversational models, and text summarization. The incorporation
of the Attention mechanism in Seq2Seq models further enhances performance, enabling the
decoder to focus selectively on specific segments of the input sequence during the generation
of the output. In contemporary applications, Seq2Seq has evolved into a versatile tool,
finding utility in diverse domains like image captioning, conversational models, and text
summarization.

2.2 How the Sequence to Sequence Model works?

In order to fully understand the model’s underlying logic, we will go over the below
illustration:

2.2.1 Encoder-Decoder Stack

The Encoder-Decoder Stack, as implied by its name, processes a sequence of words


(either a sentence or multiple sentences) as input and produces an output sequence of words.
This operation is facilitated through the utilization of a recurrent neural network (RNN).
While the conventional RNN is infrequently employed due to the vanishing gradient issue,
more advanced versions such as LSTM or GRU are preferred. In the context of seq2seq,
Google's proposed version incorporates LSTM, addressing the vanishing gradient problem.

The LSTM enhances word context development by considering two inputs at each
time step: one from the user and the other from its previous output, illustrating the recurrent
nature where the output serves as input. Both the encoder and decoder components are
commonly implemented using Recurrent Neural Networks (RNNs) or Transformers.

The model consists of 3 parts:

 encoder,
 intermediate (encoder) vector
 decoder.

2.2.1.1 Encoders

The process of comprehending text reflects the iterative nature of human cognition,
where each word in a sentence undergoes systematic processing, accumulating information
until the completion of the text. This iterative accumulation of information finds a parallel in
the Deep Learning field with Recurrent Neural Networks (RNNs), which operate through
recurrent iterations over similar units. In this context, a text encoder plays a pivotal role,
transforming textual content into a numeric representation. This transformation involves the
use of a stack of recurrent units, often employing LSTM or GRU cells for superior
performance. Each recurrent unit within the stack processes a single element of the input
sequence, collecting and propagating information forward. In scenarios like question-
answering problems, the input sequence comprises all words from the question, with each
word represented as x_i based on its sequential order in the question. The hidden
states h_i are computed using the formula:

This simple formula represents the result of an ordinary recurrent neural network. As
you can see, we just apply the appropriate weights to the previous hidden state h_(t-1) and
the input vector x_t.

2.2.1.2 Encoder Vector

 This is the final hidden state produced from the encoder part of the model. It is
calculated using the formula above.

 This vector aims to encapsulate the information for all input elements in order to help
the decoder make accurate predictions.

 It acts as the initial hidden state of the decoder part of the model.

2.2.1.3 Decoders

In contrast to encoders, decoders unfold a vector that represents the sequential state
and generates meaningful outputs such as text, tags, or labels. A crucial distinction from
encoders is that decoders necessitate both the hidden state and the output from the preceding
state. The decoder comprises a stack of multiple recurrent units, where each unit predicts an
output y_t at a specific time step t. Each recurrent unit receives a hidden state from the
preceding unit and produces an output along with its own hidden state. In the context of a
question-answering problem, the output sequence constitutes a compilation of all words from
the answer, with each word denoted as y_i, corresponding to its order in the response. Any
hidden state h_i is computed using the formula:

As you can see, we are just using the previous hidden state to compute the next one.

The output y_t at time step t is computed using the formula:

We calculate the outputs using the hidden state at the current time step together with
the respective weight W(S). Softmax is used to create a probability vector which will help us
determine the final output

Let’s make it clearer with the example below, which shows how machine translation works:

The encoder produced state C representing the sentence in the source language (English): I
love learning.
Then, the decoder unfolded that state C into the target language (Spanish): Amo el
aprendizaje.

2.3 Advantages of Encoder-Decoder Architectures:

1. Versatility Across Tasks

2. Information Compression

3. Context-Aware Generation

4. Handling Variable-Length Input

5. Attention Mechanism

2.4 Disadvantages of Encoder-Decoder Architectures:

1. Complexity

2. Overfitting

3. Training Time

4. Need for Sufficient Data

5. Difficulty in Capturing Long-Term Dependencies

6. Interpretability

7. Resource Intensive

3. Deep learning

Deep learning, a specialized domain within machine learning, centers around


artificial neural network architecture. This branch possesses the capability to discern intricate
patterns and relationships within data, eliminating the need for explicit programming.
Bolstered by advances in processing power and the availability of extensive datasets, deep
learning is grounded in artificial neural networks (ANNs), also known as deep neural
networks (DNNs). These networks draw inspiration from the structure and function of the
human brain's biological neurons, designed to learn from substantial amounts of data.

At its core, deep learning utilizes neural networks to model and address complex
problems. These networks mirror the structure and function of the human brain, featuring
interconnected nodes organized in layers. A defining characteristic of deep learning lies in
the deployment of deep neural networks, characterized by multiple layers. These networks
excel in autonomously learning complex representations, unveiling hierarchical patterns and
features without the need for manual feature engineering.

Deep learning's success spans various fields, including image recognition, natural
language processing, speech recognition, and recommendation systems. Prominent
architectures within deep learning encompass Convolutional Neural Networks (CNNs),
Recurrent Neural Networks (RNNs), and Deep Belief Networks (DBNs). While training
deep neural networks traditionally demands substantial data and computational resources,
the landscape has evolved with the advent of cloud computing and specialized hardware,
such as Graphics Processing Units (GPUs), streamlining the training process.

Deep learning can be used for supervised, unsupervised as well as reinforcement machine
learning. it uses a variety of ways to process these.

3.1.1 Supervised machine learning:

Supervised machine learning involves a technique where a neural network learns to


predict or classify data by utilizing labeled datasets. In this approach, both input features and
target variables are provided. The neural network refines its predictions through
backpropagation, adjusting its parameters based on the cost or error derived from the
disparity between predicted and actual targets. This iterative process fine-tunes the network's
ability to make accurate predictions. Deep learning algorithms, including Convolutional
Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), find application in
various supervised tasks such as image classification, recognition, sentiment analysis, and
language translation.
3.1.2 Unsupervised machine learning :

Unsupervised machine learning relies on neural networks to autonomously discover


patterns or cluster datasets without labeled information. In this technique, no target variables
are provided, and the machine self-determines hidden patterns or relationships within the
datasets. Utilizing deep learning algorithms like autoencoders and generative models,
unsupervised tasks, such as clustering, dimensionality reduction, and anomaly detection, are
effectively addressed. This approach is valuable when dealing with unlabeled datasets,
allowing the system to glean insights and structure from the data without explicit guidance
from labeled outcomes.

3.1.3 Reinforcement Machine Learning

Reinforcement Machine Learning involves an agent learning to make decisions


within an environment to maximize a reward signal. The agent interacts with the
environment, taking actions and observing resulting rewards. Deep learning is employed to
learn policies, sets of actions that maximize cumulative rewards over time. Deep
reinforcement learning algorithms, such as Deep Q networks and Deep Deterministic Policy
Gradient (DDPG), are applied to reinforce tasks like robotics and game playing. This
technique is particularly effective for scenarios where the agent needs to adapt and learn
optimal strategies through trial and error.

3.1.4 Artificial neural networks:

Artificial neural networks, also known as neural networks or neural nets, are
constructed based on the principles of the structure and functioning of human neurons. The
initial layer, called the input layer, receives input from external sources and transmits it to the
second layer known as the hidden layer. Neurons in the hidden layer receive information
from the preceding layer, perform weighted computations, and then transmit the results to
the subsequent layer. These connections are weighted, implying that the influence of inputs
from the previous layer is optimized through distinct weights assigned to each input. During
the training process, these weights are adjusted to improve the model's performance.
Artificial neurons, also referred to as units, constitute the building blocks of artificial
neural networks. These networks are organized in layers, with the complexity varying based
on the underlying patterns in the dataset. Whether a layer contains a dozen or millions of
units, it influences the intricacies of neural networks. Typically, an artificial neural network
consists of an input layer, one or more hidden layers, and an output layer. The input layer
receives external data for analysis or learning by the neural network.

In a fully connected artificial neural network, an input layer and multiple hidden
layers are sequentially connected. Each neuron in a layer receives input from the preceding
layer or the input layer. The output of one neuron serves as input for others in the next layer,
continuing until the final layer produces the network's output. As data traverses through
hidden layers, it undergoes transformations, eventually providing valuable information to the
output layer, which delivers the network's response.

Units in neural networks are interconnected between layers, with each link
possessing weights that dictate the influence of one unit on another. Through these weighted
connections, the neural network progressively learns about the data, culminating in an output
from the output layer.

3.2 Auto-encoders

An autoencoder stands as a category of artificial neural networks employed for unsupervised


learning of data encodings. Its primary objective is to acquire a lower-dimensional
representation (encoding) for higher-dimensional data, commonly for dimensionality
reduction. The training process involves instructing the network to capture the essential
features of the input image, facilitating a more concise representation. Autoencoders operate
as self-supervised machine learning models, initially trained in a supervised manner and
transitioning to unsupervised functionality during inference, earning them the label of self-
supervised models. Comprising two integral components, the autoencoder consists of:

1. Encoder: Serving as a compression unit, it condenses the input data.

2. Decoder: Responsible for decompressing the condensed input by reconstructing it.

Both the encoder and decoder incorporate a combination of neural network (NN) layers,
contributing to the reduction in the size of the input image through recreation. In the context
of CNN autoencoders, these layers consist of convolutional, max pool, flattening, etc., while
for RNN/LSTM, the respective layers are employed.

3.2.1 The Architecture of Autoencoders:

Autoencoders are structured into three integral components:

1. Encoder: This module undertakes the compression of the input data from the train-
validate-test set, producing an encoded representation significantly smaller in scale
than the initial data.

2. Bottleneck: Serving as the core repository of compressed knowledge representations,


the bottleneck module stands out as the pivotal section of the network, encapsulating
the most crucial information.

3. Decoder: Tasked with aiding the network in the "decompression" of knowledge


representations, the decoder module reconstructs the data from its encoded form. The
resulting output is then meticulously compared with a ground truth for evaluation.

The architecture as a whole looks something like this:


3.2.1.1 Encoder

The encoder functions as a collection of convolutional blocks coupled with pooling


modules, working in tandem to condense the model's input into a concise segment known as
the bottleneck. Sequentially, the bottleneck leads to the decoder, comprising a sequence of
upsampling modules designed to revert the compressed feature to the configuration of an
image. In conventional autoencoders, the anticipated output mirrors the input data with
diminished noise. On the contrary, variational autoencoders produce an entirely new image,
leveraging the input information supplied to the model.

3.2.1.2 Bottleneck

The neural network's most crucial yet ironically smallest component is the
bottleneck. It serves the purpose of constraining information flow from the encoder to the
decoder, permitting only the most essential information to traverse. By design, the bottleneck
captures the maximum information inherent in an image, forming a knowledge
representation of the input. This compressed representation prevents the neural network from
memorizing the input, mitigating the risk of overfitting. It's worth noting that a smaller
bottleneck reduces the likelihood of overfitting. However, excessively small bottlenecks may
limit information storage, increasing the potential for vital information to be lost through the
pooling layers of the encoder.

3.2.1.3 Decoder

The decoder, comprised of upsampling and convolutional blocks, is the final component
responsible for reconstructing the output from the bottleneck. Given that the input to the
decoder is a compressed knowledge representation, its role is akin to that of a
"decompressor," rebuilding the image from its latent attributes.

Thus, the encoder-decoder structure helps us extract the most from an image in the form of
data and establish useful correlations between various inputs within the network.

3.2.2 How to train autoencoders?

Before commencing the training of an autoencoder, it's imperative to set four crucial
hyperparameters:

1. Code Size: The code size, representing the size of the bottleneck, stands as a pivotal
hyperparameter for tuning the autoencoder. It dictates the degree of compression
applied to the data and can serve as a regularization term.

2. Number of Layers: Similar to all neural networks, the depth of both the encoder and
the decoder constitutes a significant hyperparameter for autoencoders. A higher depth
increases model complexity, while a lower depth enhances processing speed.

3. Number of Nodes per Layer: This hyperparameter defines the number of nodes per
layer, influencing the weights used in each layer. Generally, the number of nodes
decreases as we move across subsequent layers in the autoencoder, corresponding to
the reduction in input size.

4. Reconstruction Loss: The choice of loss function for training the autoencoder
hinges on the desired adaptation to input and output types. For image data, popular
reconstruction loss functions include Mean Squared Error (MSE) Loss and L1 Loss.
In scenarios where inputs and outputs fall within the range [0,1], such as in MNIST,
Binary Cross Entropy can also serve as an effective reconstruction loss.

3.2.3 Types of Autoencoders


Autoencoders are flexible neural networks that can be customized for various tasks. They
come in different forms, each with unique strengths and limitations.

 Under Complete Autoencoders : generate a compressed version of the input data.

 Denoising Autoencoders: Improved robustness to noise and irrelevant information.

 Sparse Autoencoders: Learn more compact and efficient data representations.

 Contractive Autoencoders: Generate representations less sensitive to minor data


variations.

 Variational Autoencoders: Generate new data points that resemble the training data.

The choice of autoencoder depends on the specific task and data characteristics.

3.2.3.1 Under Complete Autoencoders

Undercomplete autoencoders represent an unsupervised neural network designed


for generating a compressed representation of input data. This is achieved by taking an
image as input and attempting to predict the same image as output, effectively
reconstructing the image from its compressed bottleneck region. The main utility of such
autoencoders lies in generating a latent space or bottleneck that serves as a compressed
surrogate for the input data. When required, this compressed representation can be easily
decompressed using the network. This compression technique can be conceptualized as a
form of dimensionality reduction, streamlining the representation of the data.
3.2.3.2. Sparse Autoencoders

Sparse autoencoders are manipulated by adjusting the number of nodes within each
hidden layer. Due to the impracticality of crafting a neural network with dynamically
changing nodes in its hidden layers, sparse autoencoders adopt a strategy of penalizing the
activation of specific neurons within these layers. This implies that a penalty, directly
correlated with the count of activated neurons, is incorporated into the loss function. Serving
as a regularization mechanism for the neural network, the sparsity function inhibits the
activation of additional neurons, contributing to the controlled and regulated behavior of
sparse autoencoders.

Unlike undercomplete autoencoders, which are regulated by adjusting the size of the
bottleneck, sparse autoencoders are governed by modifying the number of nodes in each
hidden layer. Due to the impracticality of designing a neural network with a variable number
of nodes in its hidden layers, sparse autoencoders adopt a strategy of penalizing the
activation of specific neurons within these layers. Simply put, the loss function includes a
term that calculates the activated neurons' count and imposes a penalty directly proportional
to it. This penalty, known as the sparsity function, acts as a regularizer, restraining the neural
network from activating additional neurons. There are two primary methods for integrating
the sparsity regularizer term into the loss function.
I. .L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for
general regularizers

Where h represents the hidden layer, i represents the image in the minibatch, and a represents
the activation

II. . KL-Divergence: In this case, we consider the activations over a collection of


samples at once rather than summing them as in the L1 Loss method. We constrain
the average activation of each neuron over this collection.

Considering the ideal distribution as a Bernoulli distribution, we include KL divergence


within the loss to reduce the difference between the current distribution of the activations
and the ideal (Bernoulli) distribution:

3.2.3.3. Contractive Autoencoders

In a contractive autoencoder, the input undergoes encoding in a bottleneck and is


subsequently reconstructed in the decoder. The bottleneck function serves to learn a
representation of the image during this passage. To prevent the network from learning the
identity function and simply mapping input to output, a regularization term is incorporated
into the contractive autoencoder. The underlying principle of contractive autoencoders is
based on the idea that similar inputs should yield similar encodings and latent space
representations. This implies that minor variations in the input should not result in significant
variations in the latent space. To train a model under this constraint, it is essential to ensure
that the derivatives of the hidden layer activations are small concerning the input.
Mathematically:

Where h represents the hidden layer and x represents the input


3.2.3.4. Denoising autoencoders:

Denoising autoencoders specialize in the removal of noise from images. Unlike


regular autoencoders, denoising counterparts do not consider the input image as their ground
truth; instead, they operate on a noisy version. This noisy image, altered through digital
modifications, is input into the encoder-decoder architecture, and the output is then
compared with the actual ground truth image. The denoising autoencoder effectively
eliminates noise by learning a representation of the input that facilitates easy noise filtering.
Despite the challenge of directly removing noise from the image, the autoencoder achieves
this by mapping the input data into a lower-dimensional manifold, akin to undercomplete
autoencoders, where noise filtration becomes more manageable. The denoising autoencoder
relies on non-linear dimensionality reduction, with commonly used loss functions being L2
or L1 loss.
3.2.3.5. Variational Autoencoders

Variational autoencoders (VAEs) address a specific limitation in standard autoencoders.


While autoencoders create a compressed representation called the latent space during
training, this space may lack continuity, making interpolation challenging. VAEs tackle this
issue by expressing their latent attributes as a probability distribution, resulting in a
continuous latent space that allows for easy sampling and interpolation. Unlike traditional
autoencoders, VAEs provide a more nuanced understanding of the latent space, enabling
smoother transitions between different representations, a crucial aspect for tasks requiring
meaningful interpolation in the latent domain. This characteristic makes VAEs particularly
valuable in applications where generating diverse and meaningful outputs is essential.
3.2.4. Applications of Autoencoders:

1. Image Compression

2. Anomaly Detection

3. Feature Learning

4. Denoising

5. Data Generation

6. Image-to-Image Translation

7. Representation Learning

3.3 Variational encoder(vae)

Variational autoencoders (VAEs) were introduced in 2013 by Kingma and Welling at


Google and Qualcomm. Unlike traditional autoencoders, VAEs offer a probabilistic approach
to representing observations in latent space. Instead of producing a single value for each
latent attribute, the encoder in VAEs generates a probability distribution. This distinctive
feature allows for a more nuanced understanding of latent attributes. VAEs find applications
in designing complex generative models, fitting them to large datasets, and generating
diverse outputs, such as fictional celebrity faces and high-resolution digital artwork. They
have demonstrated state-of-the-art performance in machine learning tasks like image
generation and reinforcement learning.

3.3.1 Neural Networks in the Model

Variational Autoencoders (VAEs) leverage neural networks for both encoding and
decoding processes. The encoder network reduces input data to a lower-dimensional latent
space, while the decoder network reconstructs the original data from the latent code.

In the training phase, VAEs fine-tune the parameters of the encoder and decoder
networks by minimizing the reconstruction error and the Kullback-Leibler (KL) divergence
between the variational and true posterior distributions. This optimization is commonly
achieved through algorithms like stochastic gradient descent.

3.3.2 Architecture:
Variational Autoencoders (VAEs) distinguish themselves from traditional
autoencoders by providing a statistical approach to describing dataset samples in the latent
space. In a VAE, the encoder generates a probability distribution in the bottleneck layer
instead of a single output value.

Autoencoders, a type of unsupervised neural network, encompass two primary


components: an encoder, reminiscent of a convolutional neural network aimed at acquiring
efficient data encoding, and a decoder tasked with regenerating images similar to the dataset
using the latent space from the bottleneck layer.

The typical VAE architecture comprises an encoder network mapping input data to a
lower-dimensional latent space and a decoder network reconstructing the original data from
the latent code. The encoder's output includes the mean and variance of a Gaussian
distribution, facilitating the sampling of the latent code. The decoder network is tailored to
reconstruct the original data from this latent code.

Here is a simple example of a VAE architecture:

In this architecture, the encoder network maps the input data to the latent code, and the
decoder network maps the latent code back to the reconstructed data. The VAE is then
trained to minimize the reconstruction error between the input and reconstructed data.

3.3.3 Intuitions About the Regularization

Apart from the architectural considerations, another crucial element of Variational


Autoencoders (VAEs) involves the regularization applied to the latent code. In VAEs, the
latent code undergoes regularization through a Gaussian distribution with fixed mean and
variance. This regularization serves as a preventive measure against overfitting, encouraging
a smooth distribution in the latent code rather than memorizing the training data.

The regularization not only fosters smooth interpolation between training data points
but also enhances the VAE's capability to generate novel data samples resembling the
training data. Additionally, this regularization prevents the decoder network from achieving
perfect reconstruction of the input data. Instead, it compels the decoder to acquire a more
generalized representation of the data, contributing to the VAE's proficiency in generating
diverse data samples.

Here is one way to mathematically formulate the regularization in VAE:

The encoder network outputs the parameters of a Gaussian distribution for the latent
code, typically the mean and the log-variance (or standard deviation). The latent code is then
sampled from this Gaussian distribution. The KL divergence between this distribution and a
prior (often assumed to be a standard normal distribution) is added as a regularization term
to the VAE's loss function.

A KL divergence term in the loss function will encourage the learned latent variables
to have similar distributions to the prior.

3.3.3.1 Formula for KL divergence:

KL(q(z∣x)∣∣p(z))=E[logq(z∣x)−logp(z)]

Overall, the regularization in a VAE helps improve the model's ability to generate new data
samples and prevent overfitting the training data.
3.3.3.2 Mathematical Details of VAEs

Probabilistic Framework and Assumptions

The probabilistic framework of a VAE is typically defined as follows:

 Latent variables: z, which is assumed to follow a prior


distribution p(z) (e.g., Gaussian).

 Observed variables: x, which is assumed to follow a likelihood


distribution p(x∣z) (e.g. Bernoulli)

 The joint distribution of the observed and latent variables is:p(x,z)=p(x∣z)p(z)

The main goal of the VAE is to learn the true posterior distribution of the latent variables
given the observed variables, which are defined as p(z∣x). The VAE uses an encoder network
to approximate the true posterior distribution with a learned approximation q(z∣x) to achieve
this.

A directed graphical model can represent the graphical representation of a VAE as below:

z <-------- p(z) (prior)

x <-------- p(x|z) (likelihood)

where ,x is the observed variable and z is the latent variable

The VAE learns the parameters of the model by maximizing the Evidence Lower Bound
(ELBO), which is defined as

ELBO=E[log(p(x∣z))]−KL(q(z∣x)∣∣p(z))

The first term on the right-hand side of the equation is the reconstruction term, which
measures how well the VAE can reconstruct the input data. The second term, KL divergence,
measures the difference between the approximate posterior and the prior distribution.
A VAE uses a probabilistic framework to model the data by assuming that the input data is
generated from a latent space according to certain probabilistic distributions. The goal is to
learn the true posterior distribution by maximizing the likelihood of the input data.

3.3.4 Variational Inference Formulation

The Variational Inference Formulation for a VAE is typically defined as follows:

 Approximate posterior distribution: q(z∣x)

 True posterior distribution: p(z∣x)

The goal is to find the approximate posterior distribution q(z∣x) closest to the true posterior
distribution p(z∣x) regarding KL divergence.

The KL divergence between the two distributions is defined as

KL(q(z∣x)∣∣p(z∣x))=Eq(z∣x)[log(q(z∣x))−log(p(z∣x))]

The VAE is trained to minimize the KL divergence by maximizing evidence lower bound
(ELBO) defined as

ELBO=Eq(z∣x)[log(p(x∣z))]−KL(q(z∣x)∣∣p(z))

The first term on the right-hand side of the equation is the reconstruction term, which
measures how well the VAE can reconstruct the input data. The second term, KL divergence,
measures the difference between the approximate posterior and the true posterior
distribution.

A directed graphical model can represent the graphical representation of a VAE with
Variational Inference as below:

z <-------- p(z|x) (true posterior)

x <-------- p(x|z) (likelihood)


z <-------- q(z|x) (approximate posterior)

where x is the observed variable and z is the latent variable

Overall, Variational Inference is used by VAE to approximate the true posterior distribution
over the latent space with a simpler distribution by minimizing the KL divergence between
the variational distribution and the true posterior distribution and reconstructing the input
data as accurately as possible.

4 Generative Adversarial Network (GAN)

A Generative Adversarial Network (GAN) is a machine learning model where two


neural networks, the generator and the discriminator, engage in a competitive learning
process using deep learning methods. Operating primarily in an unsupervised manner, GANs
employ a cooperative zero-sum game framework, where gains for one component equal
losses for the other.

The generator, implemented as a convolutional neural network, aims to produce


outputs that closely resemble real data, while the discriminator, a deconvolutional neural
network, endeavors to distinguish between artificially generated and authentic examples.
This adversarial dynamic drives the training process, pushing the generator to create outputs
that can deceive the discriminator.

Generative models, such as GANs, have the unique capability of generating their
training data. In this process, the generator is trained to produce deceptive data, while the
discriminator learns to differentiate between the generator's output and genuine examples.
The ongoing feedback loop between these adversarial networks leads to the refinement of
the generator's ability to produce high-quality and realistic outputs, while the discriminator
improves its accuracy in identifying artificially created data. For example, a GAN can be
trained to generate lifelike images of human faces that do not correspond to any actual
person.

4.1 Architecture of GAN

A Generative Adversarial Network (GAN) is composed of two primary parts, which are the
Generator and the Discriminator.
4.1.1 Generator Model:

The generator model plays a pivotal role in the Generative Adversarial Network
(GAN) by generating novel and accurate data. Taking random noise as input, the generator
transforms it into complex data samples, such as text or images, typically implemented as a
deep neural network. During training, the generator captures the underlying distribution of
the training data through layers of learnable parameters, adjusting its output to closely
resemble real data. Backpropagation is employed for fine-tuning these parameters. The
success of the generator lies in its capability to produce high-quality and diverse samples,
effectively deceiving the discriminator in the adversarial interplay.

4.1.2 Generator Loss(JG )

For generated samples, the generator minimizes the log likelihood that the discriminator is
right. Due to this loss, the generator is incentivized to generate samples that the discriminator
is likely to classify as real (logD(G(z i )) close to 1).

Where,

 JG measure how well the generator is fooling the discriminator.

 log D(G(zi ) represents log probability of the discriminator being correct for
generated samples.

 The generator aims to minimize this loss, encouraging the production of samples that
the discriminator classifies as real (log D(G(zi )) close to 1).

4.1.3 Discriminator Model:

In the framework of Generative Adversarial Networks (GANs), a discriminator model,


implemented as an artificial neural network, serves the crucial role of distinguishing between
generated and authentic input. This discriminator acts as a binary classifier by assessing
input samples and assigning probabilities of authenticity. Through iterative exposure to both
genuine data from the dataset and artificial samples produced by the generator, the
discriminator refines its parameters, enhancing its ability to discern between real and
synthetic data. For image data, convolutional layers or relevant structures for other
modalities are commonly employed in its architecture. The primary objective of the
adversarial training process is to maximize the discriminator's capability to accurately
identify generated samples as fraudulent and real samples as authentic. As the generator and
discriminator engage in mutual refinement, the discriminator becomes increasingly
discerning, contributing to the generation of highly realistic synthetic data by the GAN.

4.1.4 Discriminator Loss(JD )

The discriminator reduces the negative log likelihood of correctly classifying both produced
and real samples. This loss incentivizes the discriminator to accurately categorize generated
samples as fake (log(1−D(G(zi))) close to 1) and real samples (log D(xi ) close to 1 ).

Where,

 JD assesses the discriminator’s ability to discern between produced and actual


samples.

 The log likelihood that the discriminator will accurately categorize real data is
represented by logD(xi ).

 The log chance that the discriminator would correctly categorize generated samples
as fake is represented by log(1-D(G(zi))).

 The discriminator aims to reduce this loss by accurately identifying artificial and real
samples.

4.2 How GANs work??

GANs are typically divided into the following three categories:

 Generative.: This describes how data is generated in terms of a probabilistic model.


 Adversarial.: A model is trained in an adversarial setting.
 Networks.: Deep neural networks can be used as artificial intelligence (AI)
algorithms for training purposes
The initial step in constructing a Generative Adversarial Network (GAN) involves
defining the desired end output and assembling an initial training dataset based on those
specifications. This dataset is then randomized and used as input for the generator until it
achieves a foundational level of accuracy in generating outputs. Subsequently, the generated
samples or images, along with actual data points from the original concept, are input into the
discriminator. Following the processing of data by both the generator and discriminator
models, the optimization process with backpropagation commences. The discriminator
evaluates the information and produces a probability score ranging from 0 to 1 for each
image, where 1 signifies authenticity for real images and 0 indicates artificiality for fake
ones. These values are iteratively examined for success and the process is repeated until the
desired outcome is attained.

A GAN typically takes the following steps:

1. The generator outputs an image after accepting random numbers.


2. The discriminator receives this created image in addition to a stream of photos from
the real, ground-truth data set.
3. The discriminator inputs both real and fake images and outputs probabilities -- a
value between 0 and 1 -- where 1 indicates a prediction of authenticity and 0
indicates a fake.

This creates a double feedback loop where the discriminator is in a feedback loop with
the ground truth of the images and the generator is in a feedback loop with the discriminator.
4.3 Different Types of GAN Models

 Vanilla GAN: The most straightforward type of GAN, Vanilla GAN employs simple
multi-layer perceptrons as the Generator and the Discriminator. The algorithm
optimizes a mathematical equation using stochastic gradient descent.

 Conditional GAN (CGAN): Introducing conditional parameters, CGAN


incorporates an additional parameter 'y' into the Generator for generating
corresponding data. Labels are included in the input to the Discriminator, aiding in
distinguishing real data from the fake generated data.

 Deep Convolutional GAN (DCGAN): Among the most popular and successful
implementations, DCGAN utilizes ConvNets instead of multi-layer perceptrons.
ConvNets replace max pooling with convolutional stride, and the layers are not fully
connected.

 Laplacian Pyramid GAN (LAPGAN): LAPGAN employs multiple Generator and


Discriminator networks along with different levels of the Laplacian Pyramid. This
approach, known for producing high-quality images, involves down-sampling the
image at each layer and then up-scaling it in a backward pass. The image gains noise
from the Conditional GAN at these layers until it reaches its original size.

 Super Resolution GAN (SRGAN): SRGAN utilizes a deep neural network and an
adversarial network to generate higher-resolution images. Specifically designed for
optimally up-scaling native low-resolution images, SRGAN enhances details while
minimizing errors in the process.

4.4 Training & Prediction of Generative Adversarial Networks (GANs)

Step 1: Define the Problem

The project's success hinges on a well-defined problem statement. It is crucial to


articulate the specific challenge GANs will address, whether it involves creating content
such as audio, poems, text, or images.

Step 2: Select GAN Architecture


With various GAN architectures available, the next step is to choose the most suitable one
for the defined problem. Different types of GANs cater to specific applications, and selecting
the appropriate architecture is vital.

Step 3: Train Discriminator on Real Dataset

Now Discriminator is trained on a real dataset. It is only having a forward path, no


backpropagation is there in the training of the Discriminator in n epochs. And the Data you
are providing is without Noise and only contains real images, and for fake images,
Discriminator uses instances created by the generator as negative output. Now, what happens
at the time of discriminator training.

 It classifies both real and fake data.

 The discriminator loss helps improve its performance and penalize it when it
misclassifies real as fake or vice-versa.

 weights of the discriminator are updated through discriminator loss.

Step 4: Train Generator

Provide some Fake inputs for the generator(Noise) and It will use some random noise and
generate some fake outputs. when Generator is trained, Discriminator is Idle and when
Discriminator is trained, Generator is Idle. During generator training through any random
noise as input, it tries to transform it into meaningful data. to get meaningful output from the
generator takes time and runs under many epochs. steps to train a generator are listed below.

 get random noise and produce a generator output on noise sample

 predict generator output from discriminator as original or fake.

 we calculate discriminator loss.

 perform backpropagation through discriminator, and generator both to calculate


gradients.

 Use gradients to update generator weights.

Step 5: Train Discriminator on Fake Data


Generated samples from the Generator are passed to the Discriminator, which
distinguishes between real and fake data. The Discriminator's feedback is crucial for refining
the Generator's performance.

Step 6: Train Generator with Discriminator's Output

The Generator undergoes further training based on the feedback provided by the
Discriminator. This iterative process continues until the Generator successfully fools the
Discriminator.

4.5 Applications of Generative Adversarial Networks (GANs):

1. Image Synthesis and Generation

2. Image-to-Image Translation

3. Text-to-Image Synthesis

4. Data Augmentation

5. Data Generation for Training

4.6 Advantages of Generative Adversarial Networks (GANs):

1. Synthetic Data Generation

2. High-Quality Results

3. Unsupervised Learning

4. Versatility
4.7 Disadvantages of Generative Adversarial Networks (GANs):

1. Training Instability

2. Computational Cost

3. Overfitting

4. Bias and Fairness

5. Interpretability and Accountability

5. Boltzmann Machine (DBM) auto encoder

A DBM is a type of probabilistic graphical model that consists of multiple layers of


stochastic latent variables. Each layer is connected to every other layer, both within and
across levels. The connections are defined by undirected edges, and the model is trained to
capture complex dependencies in the data.

5.1 What is Deep Boltzmann Machine (DBM)?

A Boltzmann machine is an unsupervised deep learning model in which every node


is connected to every other node. It is a type of recurrent neural network, and the nodes make
binary decisions with some level of bias.

These machines are not deterministic deep learning models, they are stochastic
or generative deep learning models. They are representations of a system.

A Boltzmann machine has two kinds of nodes

 Visible nodes:
These are nodes that can be measured and are measured.
 Hidden nodes:
These are nodes that cannot be measured or are not measured.

According to some experts, a Boltzmann machine can be called a stochastic Hopfield


network which has hidden units. It has a network of units with an ‘energy’ defined for the
overall network.

Boltzmann machines seek to reach thermal equilibrium. It essentially looks to optimize


global distribution of energy. But the temperature and energy of the system are relative to
laws of thermodynamics and are not literal.

A Boltzmann machine is made up of a learning algorithm that enables it to discover


interesting features in datasets composed of binary vectors. The learning algorithm tends to
be slow in networks that have many layers of feature detectors but it is possible to make it
faster by implementing a learning layer of feature detectors.

They use stochastic binary units to reach probability distribution equilibrium (to minimize
energy). It is possible to get multiple Boltzmann machines to collaborate together to form far
more sophisticated systems like deep belief networks.

The Boltzmann machine is named after Ludwig Boltzmann, an Austrian scientist who
came up with the Boltzmann distribution. However, this type of network was first developed
by Geoff Hinton, a Stanford Scientist.

5.2 How does a Boltzmann machine work?

Boltzmann machines are non-deterministic (stochastic) generative Deep


Learning models that only have two kinds of nodes - hidden and visible nodes. They don’t
have any output nodes, and that’s what gives them the non-deterministic feature. They learn
patterns without the typical 1 or 0 type output through which patterns are learned and
optimized using Stochastic Gradient Descent.

A major difference is that unlike other traditional networks (A/C/R) which don’t have
any connections between the input nodes, Boltzmann Machines have connections among the
input nodes. Every node is connected to all other nodes irrespective of whether they are input
or hidden nodes. This enables them to share information among themselves and self-generate
subsequent data. You’d only measure what’s on the visible nodes and not what’s on the
hidden nodes. After the input is provided, the Boltzmann machines are able to capture all the
parameters, patterns and correlations among the data. It is because of this that they are
known as deep generative models and they fall into the class of Unsupervised Deep
Learning.

5.3 Types of Boltzmann machines?

There are three types of Boltzmann machines. These are:


 Restricted Boltzmann Machines (RBMs)
 Deep Belief Networks (DBNs)
 Deep Boltzmann Machines (DBMs)

5.4 Advantages:

 Generative Modeling:
o DBMs are capable of generative modeling, meaning they can generate new
samples from the learned distribution. This is valuable for tasks such as
generating realistic data instances.
 Representation Learning:
o DBMs can learn hierarchical and distributed representations of data. This is
advantageous for capturing complex patterns and high-level features in the
input.
 Unsupervised Learning:
o DBMs can be trained in an unsupervised manner, meaning they can learn
from data without explicit labels. This is particularly useful when labeled data
is scarce or expensive to obtain.
 Capturing Dependencies:
o DBMs are designed to capture dependencies and interactions among
variables, making them suitable for modeling complex relationships in the
data.
5.5 Disadvantages:

 Computational Complexity:
o Training and inference in DBMs can be computationally expensive. The
learning algorithms often involve sampling procedures, and as the depth of
the model increases, the training process becomes more challenging.
 Difficulty in Training:
o Training deep generative models, including DBMs, can be challenging. The
training process may suffer from issues like slow convergence,
vanishing/exploding gradients, and mode collapse.
 Limited Scalability:
o Scaling DBMs to handle large datasets and high-dimensional input spaces can
be difficult. The number of parameters in the model grows rapidly with the
depth and size of each layer, which can lead to scalability issues.
 Sensitivity to Hyperparameters:
o DBMs have various hyperparameters, and the performance of the model can
be sensitive to their settings. Finding the right set of hyperparameters for a
specific task can be a non-trivial task.

6. Attention and memory models

6.1 Attention Mechanism:


The attention mechanism is a neural network architecture, often employed in
Encoder-Decoder models, that enhances model performance by dynamically assigning
weights to different elements in the input. This allows the model to focus on specific sections
of the input, capturing dependencies and relationships within the data. Particularly valuable
in tasks involving sequential or structured data, such as natural language processing or
computer vision, the attention mechanism enables effective handling of long-range
dependencies and improved performance by selectively attending to crucial features or
contexts.

In the context of visual attention, recurrent models utilize reinforcement learning to


focus on key areas of an image. Governed by a recurrent neural network, the peek network
dynamically selects locations for exploration over time, outperforming convolutional neural
networks in classification tasks. This framework extends beyond image identification and
finds applications in visual reinforcement learning, such as aiding robots in choosing
behaviors to achieve specific goals. While the basic use involves supervised learning, the
incorporation of reinforcement learning enables more adaptable decision-making based on
feedback and rewards.

In image captioning, the application of attention mechanisms significantly enhances


the quality and accuracy of generated captions. By focusing on relevant image regions
during each step of caption creation, the model synchronizes visual and textual modalities.
This attention to various areas of the image at each time step allows the model to produce
more detailed, contextually appropriate captions. Attention-based image captioning models
excel at capturing minute details, handling complex scenes, and delivering cohesive and
informative captions that closely align with the visual content.

The attention mechanism is a technique widely applied in machine learning and


natural language processing to improve model accuracy by focusing on relevant data. It
allows the model to assign weights to input attributes based on their importance, enhancing
performance in tasks such as speech recognition, image captioning, and machine translation.

6.1.1 How Attention Mechanism Works

An attention mechanism in a neural network model typically consists of the following steps:

1. Data Encoding:
The initial step involves representing or embedding the input sequence of data
using a set of representations. This encoding process transforms the input into a
format suitable for processing by the attention mechanism.

2. Query Vector Generation:

A query vector is created based on the current state or context of the model.
This vector serves as a representation of the information that the model aims to focus
on or retrieve from the input.

3. Key-Value Pair Formation:

The representations of the input are divided into key-value pairs. Keys
capture information critical for determining relevance, while values encompass the
actual data or information.

4. Similarity Calculation:

The next step involves computing the similarity between the query vector and
each key. This computation measures their compatibility or relevance, and various
similarity metrics, such as dot product, cosine similarity, or scaled dot product, can
be utilized.

where,

 hs: Encoder source hidden state at position s

 yi: Encoder Target hidden state at the position i

 W: Weight Matrix

 v : Weight vector

5. Calculation of Attention Weights:


The obtained similarity scores undergo a softmax function, resulting in
attention weights. These weights convey the significance or relevance assigned to
each key-value pair.

6. Weighted Summation:

The attention weights are then employed to the respective values, producing
a weighted sum. This process consolidates the pertinent information from the input,
emphasizing its importance as determined by the attention mechanism.

Here, Ts: Total number of key-value pairs (source hidden states) in the encoder.

7. Context Vector Formation:

The weighted sum forms a context vector containing attended or focused


information from the input. This vector encapsulates the pertinent context for the
ongoing step or task.

8. Model Integration:

The context vector integrates with the model's present state or hidden
representation, furnishing additional information or context for subsequent steps or
layers of the model.

9. Iterative Process:

Steps 2 to 8 are iteratively performed for each step or iteration of the model,
enabling the attention mechanism to dynamically focus on varied segments of the
input sequence or data.

Integrating an attention mechanism empowers the model to adeptly capture dependencies,


highlight crucial information, and dynamically concentrate on various aspects of the input.
This enhancement results in superior performance across tasks such as machine translation,
text summarization, and image recognition.

You might also like