Unit 2 DL
Unit 2 DL
1.1 LSTM:
The memory cell in a Long Short-Term Memory (LSTM) network can be likened to a
reservoir proficiently storing and retrieving information across extensive sequences. It
assumes a pivotal role in empowering the network to grasp and preserve vital patterns over
time, addressing the complexities associated with learning from prolonged dependencies.
Unlike traditional neural networks, the memory cell enables LSTMs to selectively manage
information, offering a nuanced comprehension of context in sequential data.
The input gate plays a crucial role in overseeing the inflow of new information into
the memory cell, determining the relevance of incoming data that should be stored. It is
responsible for selectively incorporating useful information into the cell state. The process
begins with the application of the sigmoid function to regulate the information, akin to the
forget gate, utilizing inputs ht-1 and x t.This sigmoid activation filters the values to be
retained. Subsequently, a vector is constructed using the hyperbolic tangent (tanh) function,
producing an output ranging from -1 to +1 and encapsulating all potential values from ht-1
and xt. Finally, the vector values and the regulated values are multiplied to yield the pertinent
information that contributes to the memory cell state. The comprehensive equation for the
input gate captures this intricate process:
where
The output gate holds authority over the information that is conveyed to subsequent layers or
serves as the output of the LSTM, ensuring that pertinent information is effectively
integrated into the network's overarching processing. It is responsible for the crucial task of
extracting valuable information from the current cell state for presentation as the output. The
process unfolds with the application of the hyperbolic tangent (tanh) function to generate a
vector from the cell. Subsequently, the information undergoes regulation using the sigmoid
function, similar to the input gate, with filtering based on values to be retained, involving
inputs ht-1 and xt. Lastly, the product of the vector values and the regulated values is
computed, constituting the output that is transmitted to subsequent layers and serves as input
to the next cell. Capturing this intricate process, the equation for the output gate is expressed
as:
The forget gate assumes a pivotal role in the strategic elimination of unnecessary or less
relevant information from the memory cell, thereby facilitating a selective forgetting
mechanism crucial for maintaining the pertinence of stored information. This mechanism
targets the removal of information that is no longer deemed useful in the cell state. The
forget gate incorporates two key inputs:xt (the input at the specific time) and ht-1 the previous
cell output). These inputs are subjected to multiplication with weight matrices, followed by
the addition of biases. The resultant is then passed through an activation function, yielding a
binary output. When the output is 0 for a particular cell state, it signifies that the associated
piece of information is forgotten, while an output of 1 indicates that the information is
retained for future utilization. Capturing this intricate process, the equation for the forget
gate is articulated as:
W_f represents the weight matrix associated with the forget gate.
[h_t-1, x_t] denotes the concatenation of the current input and the previous hidden
state.
Computational Complexity
Increased Training Data Requirements
Potential Overfitting
Difficulty in Interpretability
Hyperparameter Sensitivity
Resource Intensive
1.2 GRU
GRU or Gated recurrent unit is an advancement of the standard RNN i.e recurrent
neural network. It was introduced by Kyunghyun Cho et al in the year 2014.
GRUs are very similar to Long Short-Term Memory (LSTM). Just like LSTM, GRU uses
gates to control the flow of information. They are relatively new as compared to LSTM. This
is the reason they offer some improvement over LSTM and have simpler architecture.
Another Interesting thing about GRU is that, unlike LSTM, it does not have a separate cell
state (Ct). It only has a hidden state(Ht). Due to the simpler architecture, GRUs are faster to
train.
Now lets’ understand how GRU works. Here we have a GRU cell which more or less
similar to an LSTM cell or RNN cell.
At each timestamp t, it takes an input Xt and the hidden state Ht-1 from the previous
timestamp t-1. Later it outputs a new hidden state Ht which again passed to the next
timestamp.
Now there are primarily two gates in a GRU as opposed to three gates in an LSTM
cell. The first gate is the Reset gate and the other one is the update gate.
If you remember from the LSTM gate equation it is very similar to that. The value of rt will
range from 0 to 1 because of the sigmoid function. Here Ur and Wr are weight matrices for
the reset gate.
Similarly, we have an Update gate for long-term memory and the equation of the gate is shown
below.
Now let’s see the functioning of these gates. To find the Hidden state Ht in GRU, it
follows a two-step process. The first step is to generate what is known as the candidate hidden
state. As shown below
It takes in the input and the hidden state from the previous timestamp t-1 which is
multiplied by the reset gate output rt. Later passed this entire information to the tanh function,
the resultant value is the candidate’s hidden state.
The most important part of this equation is how we are using the value of the reset gate
to control how much influence the previous hidden state can have on the candidate state.
If the value of rt is equal to 1 then it means the entire information from the previous
hidden state Ht-1 is being considered. Likewise, if the value of rt is 0 then that means the
information from the previous hidden state is completely ignored.
Once we have the candidate state, it is used to generate the current hidden state Ht. It is where
the Update gate comes into the picture. Now, this is a very interesting equation, instead of
using a separate gate like in LSTM in GRU we use a single update gate to control both the
historical information which is Ht-1 as well as the new information which comes from the
candidate state.
Now assume the value of ut is around 0 then the first term in the equation will vanish which
means the new hidden state will not have much information from the previous hidden state.
On the other hand, the second part becomes almost one that essentially means the hidden state
at the current timestamp will consist of the information from the candidate state only.
Similarly, if the value of ut is on the second term will become entirely 0 and the current hidden
state will entirely depend on the first term i.e the information from the hidden state at the
previous timestamp t-1.
Hence we can conclude that the value of ut is very critical in this equation and it can range
from 0 to 1.
LSTM has three gates on the other hand GRU has only two gates. In LSTM they are the Input
gate, Forget gate, and Output gate. Whereas in GRU we have a Reset gate and Update gate.
In LSTM we have two states Cell state or Long term memory and Hidden state also known as
Short term memory. In the case of GRU, there is only one state i.e Hidden state (Ht).
2. Encoder-decoder architecture
In order to fully understand the model’s underlying logic, we will go over the below
illustration:
The LSTM enhances word context development by considering two inputs at each
time step: one from the user and the other from its previous output, illustrating the recurrent
nature where the output serves as input. Both the encoder and decoder components are
commonly implemented using Recurrent Neural Networks (RNNs) or Transformers.
encoder,
intermediate (encoder) vector
decoder.
2.2.1.1 Encoders
The process of comprehending text reflects the iterative nature of human cognition,
where each word in a sentence undergoes systematic processing, accumulating information
until the completion of the text. This iterative accumulation of information finds a parallel in
the Deep Learning field with Recurrent Neural Networks (RNNs), which operate through
recurrent iterations over similar units. In this context, a text encoder plays a pivotal role,
transforming textual content into a numeric representation. This transformation involves the
use of a stack of recurrent units, often employing LSTM or GRU cells for superior
performance. Each recurrent unit within the stack processes a single element of the input
sequence, collecting and propagating information forward. In scenarios like question-
answering problems, the input sequence comprises all words from the question, with each
word represented as x_i based on its sequential order in the question. The hidden
states h_i are computed using the formula:
This simple formula represents the result of an ordinary recurrent neural network. As
you can see, we just apply the appropriate weights to the previous hidden state h_(t-1) and
the input vector x_t.
This is the final hidden state produced from the encoder part of the model. It is
calculated using the formula above.
This vector aims to encapsulate the information for all input elements in order to help
the decoder make accurate predictions.
It acts as the initial hidden state of the decoder part of the model.
2.2.1.3 Decoders
In contrast to encoders, decoders unfold a vector that represents the sequential state
and generates meaningful outputs such as text, tags, or labels. A crucial distinction from
encoders is that decoders necessitate both the hidden state and the output from the preceding
state. The decoder comprises a stack of multiple recurrent units, where each unit predicts an
output y_t at a specific time step t. Each recurrent unit receives a hidden state from the
preceding unit and produces an output along with its own hidden state. In the context of a
question-answering problem, the output sequence constitutes a compilation of all words from
the answer, with each word denoted as y_i, corresponding to its order in the response. Any
hidden state h_i is computed using the formula:
As you can see, we are just using the previous hidden state to compute the next one.
We calculate the outputs using the hidden state at the current time step together with
the respective weight W(S). Softmax is used to create a probability vector which will help us
determine the final output
Let’s make it clearer with the example below, which shows how machine translation works:
The encoder produced state C representing the sentence in the source language (English): I
love learning.
Then, the decoder unfolded that state C into the target language (Spanish): Amo el
aprendizaje.
2. Information Compression
3. Context-Aware Generation
5. Attention Mechanism
1. Complexity
2. Overfitting
3. Training Time
6. Interpretability
7. Resource Intensive
3. Deep learning
At its core, deep learning utilizes neural networks to model and address complex
problems. These networks mirror the structure and function of the human brain, featuring
interconnected nodes organized in layers. A defining characteristic of deep learning lies in
the deployment of deep neural networks, characterized by multiple layers. These networks
excel in autonomously learning complex representations, unveiling hierarchical patterns and
features without the need for manual feature engineering.
Deep learning's success spans various fields, including image recognition, natural
language processing, speech recognition, and recommendation systems. Prominent
architectures within deep learning encompass Convolutional Neural Networks (CNNs),
Recurrent Neural Networks (RNNs), and Deep Belief Networks (DBNs). While training
deep neural networks traditionally demands substantial data and computational resources,
the landscape has evolved with the advent of cloud computing and specialized hardware,
such as Graphics Processing Units (GPUs), streamlining the training process.
Deep learning can be used for supervised, unsupervised as well as reinforcement machine
learning. it uses a variety of ways to process these.
Artificial neural networks, also known as neural networks or neural nets, are
constructed based on the principles of the structure and functioning of human neurons. The
initial layer, called the input layer, receives input from external sources and transmits it to the
second layer known as the hidden layer. Neurons in the hidden layer receive information
from the preceding layer, perform weighted computations, and then transmit the results to
the subsequent layer. These connections are weighted, implying that the influence of inputs
from the previous layer is optimized through distinct weights assigned to each input. During
the training process, these weights are adjusted to improve the model's performance.
Artificial neurons, also referred to as units, constitute the building blocks of artificial
neural networks. These networks are organized in layers, with the complexity varying based
on the underlying patterns in the dataset. Whether a layer contains a dozen or millions of
units, it influences the intricacies of neural networks. Typically, an artificial neural network
consists of an input layer, one or more hidden layers, and an output layer. The input layer
receives external data for analysis or learning by the neural network.
In a fully connected artificial neural network, an input layer and multiple hidden
layers are sequentially connected. Each neuron in a layer receives input from the preceding
layer or the input layer. The output of one neuron serves as input for others in the next layer,
continuing until the final layer produces the network's output. As data traverses through
hidden layers, it undergoes transformations, eventually providing valuable information to the
output layer, which delivers the network's response.
Units in neural networks are interconnected between layers, with each link
possessing weights that dictate the influence of one unit on another. Through these weighted
connections, the neural network progressively learns about the data, culminating in an output
from the output layer.
3.2 Auto-encoders
Both the encoder and decoder incorporate a combination of neural network (NN) layers,
contributing to the reduction in the size of the input image through recreation. In the context
of CNN autoencoders, these layers consist of convolutional, max pool, flattening, etc., while
for RNN/LSTM, the respective layers are employed.
1. Encoder: This module undertakes the compression of the input data from the train-
validate-test set, producing an encoded representation significantly smaller in scale
than the initial data.
3.2.1.2 Bottleneck
The neural network's most crucial yet ironically smallest component is the
bottleneck. It serves the purpose of constraining information flow from the encoder to the
decoder, permitting only the most essential information to traverse. By design, the bottleneck
captures the maximum information inherent in an image, forming a knowledge
representation of the input. This compressed representation prevents the neural network from
memorizing the input, mitigating the risk of overfitting. It's worth noting that a smaller
bottleneck reduces the likelihood of overfitting. However, excessively small bottlenecks may
limit information storage, increasing the potential for vital information to be lost through the
pooling layers of the encoder.
3.2.1.3 Decoder
The decoder, comprised of upsampling and convolutional blocks, is the final component
responsible for reconstructing the output from the bottleneck. Given that the input to the
decoder is a compressed knowledge representation, its role is akin to that of a
"decompressor," rebuilding the image from its latent attributes.
Thus, the encoder-decoder structure helps us extract the most from an image in the form of
data and establish useful correlations between various inputs within the network.
Before commencing the training of an autoencoder, it's imperative to set four crucial
hyperparameters:
1. Code Size: The code size, representing the size of the bottleneck, stands as a pivotal
hyperparameter for tuning the autoencoder. It dictates the degree of compression
applied to the data and can serve as a regularization term.
2. Number of Layers: Similar to all neural networks, the depth of both the encoder and
the decoder constitutes a significant hyperparameter for autoencoders. A higher depth
increases model complexity, while a lower depth enhances processing speed.
3. Number of Nodes per Layer: This hyperparameter defines the number of nodes per
layer, influencing the weights used in each layer. Generally, the number of nodes
decreases as we move across subsequent layers in the autoencoder, corresponding to
the reduction in input size.
4. Reconstruction Loss: The choice of loss function for training the autoencoder
hinges on the desired adaptation to input and output types. For image data, popular
reconstruction loss functions include Mean Squared Error (MSE) Loss and L1 Loss.
In scenarios where inputs and outputs fall within the range [0,1], such as in MNIST,
Binary Cross Entropy can also serve as an effective reconstruction loss.
Variational Autoencoders: Generate new data points that resemble the training data.
The choice of autoencoder depends on the specific task and data characteristics.
Sparse autoencoders are manipulated by adjusting the number of nodes within each
hidden layer. Due to the impracticality of crafting a neural network with dynamically
changing nodes in its hidden layers, sparse autoencoders adopt a strategy of penalizing the
activation of specific neurons within these layers. This implies that a penalty, directly
correlated with the count of activated neurons, is incorporated into the loss function. Serving
as a regularization mechanism for the neural network, the sparsity function inhibits the
activation of additional neurons, contributing to the controlled and regulated behavior of
sparse autoencoders.
Unlike undercomplete autoencoders, which are regulated by adjusting the size of the
bottleneck, sparse autoencoders are governed by modifying the number of nodes in each
hidden layer. Due to the impracticality of designing a neural network with a variable number
of nodes in its hidden layers, sparse autoencoders adopt a strategy of penalizing the
activation of specific neurons within these layers. Simply put, the loss function includes a
term that calculates the activated neurons' count and imposes a penalty directly proportional
to it. This penalty, known as the sparsity function, acts as a regularizer, restraining the neural
network from activating additional neurons. There are two primary methods for integrating
the sparsity regularizer term into the loss function.
I. .L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for
general regularizers
Where h represents the hidden layer, i represents the image in the minibatch, and a represents
the activation
1. Image Compression
2. Anomaly Detection
3. Feature Learning
4. Denoising
5. Data Generation
6. Image-to-Image Translation
7. Representation Learning
Variational Autoencoders (VAEs) leverage neural networks for both encoding and
decoding processes. The encoder network reduces input data to a lower-dimensional latent
space, while the decoder network reconstructs the original data from the latent code.
In the training phase, VAEs fine-tune the parameters of the encoder and decoder
networks by minimizing the reconstruction error and the Kullback-Leibler (KL) divergence
between the variational and true posterior distributions. This optimization is commonly
achieved through algorithms like stochastic gradient descent.
3.3.2 Architecture:
Variational Autoencoders (VAEs) distinguish themselves from traditional
autoencoders by providing a statistical approach to describing dataset samples in the latent
space. In a VAE, the encoder generates a probability distribution in the bottleneck layer
instead of a single output value.
The typical VAE architecture comprises an encoder network mapping input data to a
lower-dimensional latent space and a decoder network reconstructing the original data from
the latent code. The encoder's output includes the mean and variance of a Gaussian
distribution, facilitating the sampling of the latent code. The decoder network is tailored to
reconstruct the original data from this latent code.
In this architecture, the encoder network maps the input data to the latent code, and the
decoder network maps the latent code back to the reconstructed data. The VAE is then
trained to minimize the reconstruction error between the input and reconstructed data.
The regularization not only fosters smooth interpolation between training data points
but also enhances the VAE's capability to generate novel data samples resembling the
training data. Additionally, this regularization prevents the decoder network from achieving
perfect reconstruction of the input data. Instead, it compels the decoder to acquire a more
generalized representation of the data, contributing to the VAE's proficiency in generating
diverse data samples.
The encoder network outputs the parameters of a Gaussian distribution for the latent
code, typically the mean and the log-variance (or standard deviation). The latent code is then
sampled from this Gaussian distribution. The KL divergence between this distribution and a
prior (often assumed to be a standard normal distribution) is added as a regularization term
to the VAE's loss function.
A KL divergence term in the loss function will encourage the learned latent variables
to have similar distributions to the prior.
KL(q(z∣x)∣∣p(z))=E[logq(z∣x)−logp(z)]
Overall, the regularization in a VAE helps improve the model's ability to generate new data
samples and prevent overfitting the training data.
3.3.3.2 Mathematical Details of VAEs
The main goal of the VAE is to learn the true posterior distribution of the latent variables
given the observed variables, which are defined as p(z∣x). The VAE uses an encoder network
to approximate the true posterior distribution with a learned approximation q(z∣x) to achieve
this.
A directed graphical model can represent the graphical representation of a VAE as below:
The VAE learns the parameters of the model by maximizing the Evidence Lower Bound
(ELBO), which is defined as
ELBO=E[log(p(x∣z))]−KL(q(z∣x)∣∣p(z))
The first term on the right-hand side of the equation is the reconstruction term, which
measures how well the VAE can reconstruct the input data. The second term, KL divergence,
measures the difference between the approximate posterior and the prior distribution.
A VAE uses a probabilistic framework to model the data by assuming that the input data is
generated from a latent space according to certain probabilistic distributions. The goal is to
learn the true posterior distribution by maximizing the likelihood of the input data.
The goal is to find the approximate posterior distribution q(z∣x) closest to the true posterior
distribution p(z∣x) regarding KL divergence.
KL(q(z∣x)∣∣p(z∣x))=Eq(z∣x)[log(q(z∣x))−log(p(z∣x))]
The VAE is trained to minimize the KL divergence by maximizing evidence lower bound
(ELBO) defined as
ELBO=Eq(z∣x)[log(p(x∣z))]−KL(q(z∣x)∣∣p(z))
The first term on the right-hand side of the equation is the reconstruction term, which
measures how well the VAE can reconstruct the input data. The second term, KL divergence,
measures the difference between the approximate posterior and the true posterior
distribution.
A directed graphical model can represent the graphical representation of a VAE with
Variational Inference as below:
Overall, Variational Inference is used by VAE to approximate the true posterior distribution
over the latent space with a simpler distribution by minimizing the KL divergence between
the variational distribution and the true posterior distribution and reconstructing the input
data as accurately as possible.
Generative models, such as GANs, have the unique capability of generating their
training data. In this process, the generator is trained to produce deceptive data, while the
discriminator learns to differentiate between the generator's output and genuine examples.
The ongoing feedback loop between these adversarial networks leads to the refinement of
the generator's ability to produce high-quality and realistic outputs, while the discriminator
improves its accuracy in identifying artificially created data. For example, a GAN can be
trained to generate lifelike images of human faces that do not correspond to any actual
person.
A Generative Adversarial Network (GAN) is composed of two primary parts, which are the
Generator and the Discriminator.
4.1.1 Generator Model:
The generator model plays a pivotal role in the Generative Adversarial Network
(GAN) by generating novel and accurate data. Taking random noise as input, the generator
transforms it into complex data samples, such as text or images, typically implemented as a
deep neural network. During training, the generator captures the underlying distribution of
the training data through layers of learnable parameters, adjusting its output to closely
resemble real data. Backpropagation is employed for fine-tuning these parameters. The
success of the generator lies in its capability to produce high-quality and diverse samples,
effectively deceiving the discriminator in the adversarial interplay.
For generated samples, the generator minimizes the log likelihood that the discriminator is
right. Due to this loss, the generator is incentivized to generate samples that the discriminator
is likely to classify as real (logD(G(z i )) close to 1).
Where,
log D(G(zi ) represents log probability of the discriminator being correct for
generated samples.
The generator aims to minimize this loss, encouraging the production of samples that
the discriminator classifies as real (log D(G(zi )) close to 1).
The discriminator reduces the negative log likelihood of correctly classifying both produced
and real samples. This loss incentivizes the discriminator to accurately categorize generated
samples as fake (log(1−D(G(zi))) close to 1) and real samples (log D(xi ) close to 1 ).
Where,
The log likelihood that the discriminator will accurately categorize real data is
represented by logD(xi ).
The log chance that the discriminator would correctly categorize generated samples
as fake is represented by log(1-D(G(zi))).
The discriminator aims to reduce this loss by accurately identifying artificial and real
samples.
This creates a double feedback loop where the discriminator is in a feedback loop with
the ground truth of the images and the generator is in a feedback loop with the discriminator.
4.3 Different Types of GAN Models
Vanilla GAN: The most straightforward type of GAN, Vanilla GAN employs simple
multi-layer perceptrons as the Generator and the Discriminator. The algorithm
optimizes a mathematical equation using stochastic gradient descent.
Deep Convolutional GAN (DCGAN): Among the most popular and successful
implementations, DCGAN utilizes ConvNets instead of multi-layer perceptrons.
ConvNets replace max pooling with convolutional stride, and the layers are not fully
connected.
Super Resolution GAN (SRGAN): SRGAN utilizes a deep neural network and an
adversarial network to generate higher-resolution images. Specifically designed for
optimally up-scaling native low-resolution images, SRGAN enhances details while
minimizing errors in the process.
The discriminator loss helps improve its performance and penalize it when it
misclassifies real as fake or vice-versa.
Provide some Fake inputs for the generator(Noise) and It will use some random noise and
generate some fake outputs. when Generator is trained, Discriminator is Idle and when
Discriminator is trained, Generator is Idle. During generator training through any random
noise as input, it tries to transform it into meaningful data. to get meaningful output from the
generator takes time and runs under many epochs. steps to train a generator are listed below.
The Generator undergoes further training based on the feedback provided by the
Discriminator. This iterative process continues until the Generator successfully fools the
Discriminator.
2. Image-to-Image Translation
3. Text-to-Image Synthesis
4. Data Augmentation
2. High-Quality Results
3. Unsupervised Learning
4. Versatility
4.7 Disadvantages of Generative Adversarial Networks (GANs):
1. Training Instability
2. Computational Cost
3. Overfitting
These machines are not deterministic deep learning models, they are stochastic
or generative deep learning models. They are representations of a system.
Visible nodes:
These are nodes that can be measured and are measured.
Hidden nodes:
These are nodes that cannot be measured or are not measured.
They use stochastic binary units to reach probability distribution equilibrium (to minimize
energy). It is possible to get multiple Boltzmann machines to collaborate together to form far
more sophisticated systems like deep belief networks.
The Boltzmann machine is named after Ludwig Boltzmann, an Austrian scientist who
came up with the Boltzmann distribution. However, this type of network was first developed
by Geoff Hinton, a Stanford Scientist.
A major difference is that unlike other traditional networks (A/C/R) which don’t have
any connections between the input nodes, Boltzmann Machines have connections among the
input nodes. Every node is connected to all other nodes irrespective of whether they are input
or hidden nodes. This enables them to share information among themselves and self-generate
subsequent data. You’d only measure what’s on the visible nodes and not what’s on the
hidden nodes. After the input is provided, the Boltzmann machines are able to capture all the
parameters, patterns and correlations among the data. It is because of this that they are
known as deep generative models and they fall into the class of Unsupervised Deep
Learning.
5.4 Advantages:
Generative Modeling:
o DBMs are capable of generative modeling, meaning they can generate new
samples from the learned distribution. This is valuable for tasks such as
generating realistic data instances.
Representation Learning:
o DBMs can learn hierarchical and distributed representations of data. This is
advantageous for capturing complex patterns and high-level features in the
input.
Unsupervised Learning:
o DBMs can be trained in an unsupervised manner, meaning they can learn
from data without explicit labels. This is particularly useful when labeled data
is scarce or expensive to obtain.
Capturing Dependencies:
o DBMs are designed to capture dependencies and interactions among
variables, making them suitable for modeling complex relationships in the
data.
5.5 Disadvantages:
Computational Complexity:
o Training and inference in DBMs can be computationally expensive. The
learning algorithms often involve sampling procedures, and as the depth of
the model increases, the training process becomes more challenging.
Difficulty in Training:
o Training deep generative models, including DBMs, can be challenging. The
training process may suffer from issues like slow convergence,
vanishing/exploding gradients, and mode collapse.
Limited Scalability:
o Scaling DBMs to handle large datasets and high-dimensional input spaces can
be difficult. The number of parameters in the model grows rapidly with the
depth and size of each layer, which can lead to scalability issues.
Sensitivity to Hyperparameters:
o DBMs have various hyperparameters, and the performance of the model can
be sensitive to their settings. Finding the right set of hyperparameters for a
specific task can be a non-trivial task.
An attention mechanism in a neural network model typically consists of the following steps:
1. Data Encoding:
The initial step involves representing or embedding the input sequence of data
using a set of representations. This encoding process transforms the input into a
format suitable for processing by the attention mechanism.
A query vector is created based on the current state or context of the model.
This vector serves as a representation of the information that the model aims to focus
on or retrieve from the input.
The representations of the input are divided into key-value pairs. Keys
capture information critical for determining relevance, while values encompass the
actual data or information.
4. Similarity Calculation:
The next step involves computing the similarity between the query vector and
each key. This computation measures their compatibility or relevance, and various
similarity metrics, such as dot product, cosine similarity, or scaled dot product, can
be utilized.
where,
W: Weight Matrix
v : Weight vector
6. Weighted Summation:
The attention weights are then employed to the respective values, producing
a weighted sum. This process consolidates the pertinent information from the input,
emphasizing its importance as determined by the attention mechanism.
Here, Ts: Total number of key-value pairs (source hidden states) in the encoder.
8. Model Integration:
The context vector integrates with the model's present state or hidden
representation, furnishing additional information or context for subsequent steps or
layers of the model.
9. Iterative Process:
Steps 2 to 8 are iteratively performed for each step or iteration of the model,
enabling the attention mechanism to dynamically focus on varied segments of the
input sequence or data.