0% found this document useful (0 votes)
61 views21 pages

Unit 4 Notes

Uploaded by

Poranki Anusha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views21 pages

Unit 4 Notes

Uploaded by

Poranki Anusha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Attention Mechanisms

Attention mechanisms enhance deep learning models by selectively focusing on important input
elements, improving prediction accuracy and computational efficiency. They prioritize and emphasize
relevant information, acting as a spotlight to enhance overall model performance.

Fundamentally, the attention mechanism is akin to our brain's neurological system, which emphasizes
relevant sounds while filtering out background distractions. In the realm of deep learning, it allows
neural networks to attribute varying levels of importance to different input segments, significantly
boosting their capability to capture essential information. This process is crucial in tasks such as natural
language processing (NLP), where attention aids in aligning relevant parts of a source sentence during
translation or question-answering activities.

Working:

1. Breaking Down the Input: Let’s say you have a bunch of words (or any kind of data) that you
want the computer to understand. First, it breaks down this input into smaller pieces, like
individual words.

2. Picking Out Important Bits: Then, it looks at these pieces and decides which ones are the most
important. It does this by comparing each piece to a question or ‘query’ it has in mind.

3. Assigning Importance: Each piece gets a score based on how well it matches the question. The
higher the score, the more important that piece is.

4. Focusing Attention: After scoring each piece, it figures out how much attention to give to each
one. Pieces with higher scores get more attention, while less important ones get less attention.

5. Putting It All Together: Finally, it adds up all the pieces, but gives more weight to the important
ones. This way, the computer gets a clearer picture of what’s most important in the input.

Attention is an interface connecting the encoder and decoder that provides the decoder with
information from every encoder hidden state. With this framework, the model is able to selectively
focus on valuable parts of the input sequence and hence, learn the association between them. This
helps the model to cope efficiently with long input sentences.

Attention mechanisms in deep learning are used to help the model focus on the most relevant parts of
the input when making a prediction. In many problems, the input data may be very large and complex,
and it can be difficult for the model to process all of it. Attention mechanisms allow the model to
selectively focus on the parts of the input that are most important for making a prediction, and to ignore
the less relevant parts. This can help the model to make more accurate predictions and to run more
efficiently.

Intuition

The below figure demonstrates an Encoder-Decoder architecture with an attention layer.


The idea is to keep the decoder as it is, and we just replace sequential RNN/LSTM with bidirectional
RNN/LSTM in the encoder.

Here, we give attention to some words by considering a window size Tx (say four words x1, x2, x3, and
x4). Using these four words, we’ll create a context vector c1, which is given as input to the decoder.
Similarly, we’ll create a context vector c2 using these four words. Also, we have α1, α2, and α3 as
weights, and the sum of all weights within one window is equal to 1.

Similarly, we create context vectors from different sets of words with different α values.

The attention model computes a set of attention weights denoted by α(t,1), α(t,2),..,α(t,t) because not
all the inputs would be used in generating the corresponding output. The context vector ci for the
output word yi is generated using the weighted sum of the annotations:
The attention weights are calculated by normalizing the output score of a feed-forward neural network
described by the function that captures the alignment between input at j and output at i.

Implementation

Let's take an example where a translator reads the English(input language) sentence while writing down
the keywords from the start till the end, after which it starts translating to Portuguese (the output
language). While translating each English word, it makes use of the keywords it has understood.

Attention places different focus on different words by assigning each word with a score. Then, using the
softmax scores, we aggregate the encoder hidden states using a weighted sum of the encoder hidden
states to get the context vector.

The implementations of an attention layer can be broken down into 4 steps.

Step 0: Prepare hidden states.

First, prepare all the available encoder hidden states (green) and the first decoder hidden state (red). In
our example, we have 4 encoder hidden states and the current decoder hidden state. (Note: the last
consolidated encoder hidden state is fed as input to the first time step of the decoder. The output of this
first time step of the decoder is called the first decoder hidden state.)

Step 1: Obtain a score for every encoder hidden state.

A score (scalar) is obtained by a score function (also known as alignment score function or alignment
model). In this example, the score function is a dot product between the decoder and encoder hidden
states.

Step 2: Run all the scores through a softmax layer.

We put the scores to a softmax layer so that the softmax scores (scalar) add up to 1. These softmax
scores represent the attention distribution.

Step 3: Multiply each encoder hidden state by its softmax score.

By multiplying each encoder hidden state with its softmax score (scalar), we obtain the alignment vector
or the annotation vector. This is exactly the mechanism where alignment takes place.
Step 4: Sum the alignment vectors.

The alignment vectors are summed up to produce the context vector. A context vector is an aggregated
information of the alignment vectors from the previous step.

Step 5: Feed the context vector into the decoder.

Recurrent Models of Visual Attention

Recurrent models of visual attention combine the principles of attention mechanisms and recurrent
neural networks (RNNs) to effectively process and analyze visual data. This approach is particularly
useful in tasks such as image captioning, visual question answering, and video analysis, where the ability
to focus on specific regions of an image or sequence is crucial.

Key Features:

1. Visual Attention:

o Visual attention mimics human visual perception by allowing models to focus on specific
areas of an image while ignoring irrelevant parts. This selective focus helps in extracting
meaningful features and improving performance in various tasks.

2. Recurrent Neural Networks (RNNs):

o RNNs are designed for sequential data, maintaining a hidden state that captures
information from previous time steps. They are particularly effective for tasks where the
input is a sequence, such as text or time series data.

3. Combining Attention and RNNs:

o Integrating attention mechanisms into RNNs allows the model to dynamically adjust its
focus on different parts of the visual input as it generates output sequences, such as
captions or answers to questions.

Architecture of Recurrent Models of Visual Attention

1. Input Representation:

o Images are typically processed using Convolutional Neural Networks (CNNs) to extract
feature maps. These feature maps represent different regions of the image and their
corresponding visual features.

2. Attention Mechanism:

o The attention mechanism computes a set of attention weights that determine the
importance of different regions of the feature map.
o Attention scores can be calculated based on the current hidden state of the RNN and
the feature representations.

3. Context Vector:

o The attended feature representation (context vector) is computed as a weighted sum of


the feature map, where the weights are determined by the attention mechanism.

4. Recurrent Processing:

o The RNN (often an LSTM or GRU) takes the current hidden state and the context vector
as inputs. The hidden state is updated based on the attended features, allowing the
model to maintain context over time.

5. Output Generation:

o The output of the RNN can be used for various tasks, such as generating a sequence of
words (in image captioning) or predicting a class label (in visual question answering).

Attention Mechanism Details

1. Calculating Attention Weights:

o Given the feature map (F) and the previous hidden state (h_{t-1}), the attention weights
can be computed using a scoring function (f): [ e_{ij} = f(h_{t-1}, F_j) ] where (e_{ij}) is
the attention score for the (j)-th region of the feature map at time (t).

2. Softmax Normalization:

o The attention scores are normalized using the softmax function: [ \alpha_{ij} = \frac{\
exp(e_{ij})}{\sum_{k} \exp(e_{ik})} ] where (\alpha_{ij}) represents the attention weight
for the (j)-th region.

3. Weighted Feature Representation:

o The context vector is computed as a weighted sum of the feature map: [ \mathbf{c}t = \
sum{j} \alpha_{ij} F_j ] where (\mathbf{c}_t) is the context vector for the current time
step.

Applications of Recurrent Models of Visual Attention

1. Image Captioning:

o Models can generate descriptive captions for images by focusing on different parts of
the image at each time step of the caption generation process.

2. Visual Question Answering (VQA):


o By attending to relevant regions of an image while answering questions about it, models
can provide more accurate and context-aware answers.

3. Object Detection and Recognition:

o Attention mechanisms can help models focus on specific objects within an image,
improving their ability to detect and classify items.

4. Video Analysis:

o In video processing, recurrent models of visual attention can dynamically focus on


different frames or regions of interest, enhancing understanding of temporal dynamics.

Attention Mechanisms for Machine Translation in Deep Learning

Attention mechanisms have revolutionized machine translation by allowing models to focus on


specific parts of the input sequence when generating translations. This approach enhances the
performance of neural machine translation (NMT) systems, particularly in handling long
sentences and complex structures.

The attention mechanism calculates a weighted sum of the input sequence at each step of the
output sequence. The weights are based on how similar the current output sequence is to the
input sequence.

Neural Machine Translation (NMT)

NMT is a large neural network that is trained in an end to end fashion for translating one
language into another. The figure below is an illustration of NMT with an RNN based encoder-
decoder architecture.
Figure 1 : Neural machine translation as a stacking recurrent architecture for translating a source
sequence A B C D into a target sequence X Y Z. Here <eos> marks the end of a sentence.
NMT directly models the conditional probability p(y/x) of translating a source (x1,x2….xn)
sentence into a target sentence (y1,y2….yn).
NMT consist of two components:
1. An encoder which computes a representation S for each source sentence
2. A decoder which generates translation one word at a time and hence decomposes the
conditional probability as:

A probability of translation y given the source sentence x


One could parametrize the probability of decoding each word y(j) as

where h(j) could be modeled as

RNN hidden unit definition (h)


where
g: a transformative function that outputs a vocabulary size vector
h: RNN hidden unit
f: computes the current hidden state given the previously hidden state.
The training objective for the translation process could be framed as
Loss Function

Generative Adversarial Networks

Training a Generative Adversarial Network

Using GANs for Generating Image Data


Using GANs for Generating Image Data

GAN is an algorithm that uses two neural networks- Generator G and Discriminator D. The two
networks compete against one another (hence the term ‘adversarial’).

The generator creates synthetic data, while the Discriminator tries to distinguish between the
generated data and real data. This leads to creating highly realistic data that can often pass for
real

GANs are usually trained to generate images from random noises and a GAN has usually two
parts in which it works namely the Generator that generates new samples of images and the
second is a Discriminator that classifies images as real or fake for example we can train a GAN
model to generate digit images that look like hand-written digit images from the MNIST dataset
and apart from this GANs are widely used for voice generation, image generation or video
generation.

o Generator: A generator is a model that is used to generate new reasonable data
examples from the problem statement and

o Discriminator: A discriminator model is a model that classifies the given
examples as real (from the domain) or fake (generated).
Creating novel images given an image dataset is one of the strengths of a specific branch of models
called Generative Adversarial Networks (GAN). These networks specialize in unsupervised/semi-
supervised image generation given any image data.

Building the Model

The GAN we want to create comprises two major parts:

 Generator

 Discriminator.

The Generator is responsible for creating novel images, while the Discriminator is responsible for
understanding how good the generated image is.

The entire architecture we want to build for the GANs image generation is shown in the following
diagram.

Example :
 MNIST (Modified National Institute of Standards and Technology) dataset. This dataset has more
handwritten digits of 28x28
 The MNIST is an easy dataset for a GAN such as the one we are building, as it has small, single
channels images.
 The shape of the image is defined as a matrix of 28x28x128x28x1. The last dimension
corresponds to the number of channels in an image. Since we are using the MNIST dataset in
black and white, we only have a single channel.
 The zsize is the shape of the latent space we want to generate. In this case, we set it to 100. This
number could be modified if required.

Defining the Generator

The job of the Generator (D) is to create realistic images that the Discriminator fails to understand are
fake. Thus, the Generator is an essential component that enables a GANs image generation ability. The
architecture we consider in this article comprises fully connected layers (FC) and Leaky ReLU activations.
The final layer of the Generator has a TanH activation rather than a LeakyReLU. This replacement was
done because we wanted to convert the generated image to the same range as the original MNIST
dataset (-1, 1).

Defining the Discriminator

The GAN uses the Discriminator (D) to identify how real the Generator's outputs look by returning a
probability of real vs. fake. This part of the network can be considered a binary classification problem. To
solve this binary classification problem, we need a rather simple network composed of blocks of Fully
Connect Layers (FC), Leaky ReLU activations, and Dropout layers. Note that the final layer has a block
with an FC layer and a Sigmoid. The final Sigmoid activation returns the classification probability that we
require.
Having defined all the required functions, we can train the network to optimize the losses.

Steps for the GAN's image generation are as follows:


 Load an image, and generate random noise the same size as the loaded image.
 Send these images to the Discriminator and calculate the real vs. fake probability.
 Generate another noise of the same size. Send this noise to the Generator.
 Run training for the Generator for a few epochs.
 Repeat all the steps until a satisfactory image is generated.

Conditional Generative Adversarial Networks

Generative Adversarial Networks (GAN) is a deep learning framework that is used to generate
random, plausible examples based on our needs. It contains two essential parts that are always
competing against each other in a repetitive process (as adversaries). These two essential parts
are:
 Generator Network: It is the neural network responsible for creating (or generating)
new data. They can be in the form of an image, text, video, sound, etc., as per the data
they are trained on.
 Discriminator Network: It’s work is to distinguish between real and fake data from the
dataset and data generated by the generator.

The Conditional Generative Adversarial Network (cGAN) is a model used in deep learning, a
derivative of machine learning. It enables more precise generation and discrimination of images
to train machines and allow them to learn on their own.

Imagine the need to generate images that are of only Mercedes cars when you have trained your
model on a collection of cars. To do that, you need to provide the GAN model with a specific
“condition,” which can be done by providing the car’s name (or label). Conditional generative
adversarial networks work in the same way as GANs. The generation of data in a CGAN is
conditional on specific input information, which could be labels, class information, or any other
relevant features. This conditioning enables more precise and targeted data generation.

To understand what a cGAN is, you first need to become familiar with Deep Learning. This
process involves feeding a computer program thousands of data points so that it can learn to
recognize them. The Generative Adversarial Network (GAN) represents an initial training
approach. It facilitates a dialogue between two networks: the generator and the discriminator.
On one side, the generator creates fake images that are supposed to be as realistic as possible,
with the aim of deceiving the opposing network: the discriminator.
On the other side, the discriminator observes images coming from both the generator and a
database. It must determine which images come from the database (and label them as real) and
which images are generated by the generator (and are therefore fake).
The discriminator correctly classifies fakes as fakes and real images as real, receiving positive
feedback. If it fails in its task, it receives negative feedback. Gradually, thanks to the Gradient
Descent algorithm, it can determine the range of data that allows it to recognize a real image,
learn from its mistakes, and improve. Thus, it progressively enhances its ability to create more
relevant objects.
The cGAN or how to maximize the performance of the generator and the discriminator
With a conditional GAN, it’s possible to send more precise information, called class labels, to
both the generator and the discriminator to guide their data generation. These pieces of
information help specify the data produced by the generator and the discriminator, allowing them
to arrive at the desired results more quickly.

The labels guide the generator’s production to generate more specific information. For example,
instead of producing images of clothing in general, it will produce images of pants, jackets, or
socks based on the provided label.
On the discriminator’s side, the labels help the network better distinguish between real images
and the fake images provided by the generator, making it more efficient.
Architecture and Working of CGANs

Conditioning in GANs:

 GANs can be extended to a conditional model by providing additional information (denoted as y)


to both the generator and discriminator.

 This additional information (y) can be any kind of auxiliary information, such as class labels or
data from other modalities.

 In the generator, the prior input noise (z) and y are combined in a joint hidden representation.

Generator Architecture:

 The generator takes both the prior input noise (z) and the additional information (y) as inputs.

 These inputs are combined in a joint hidden representation, and the generator produces
synthetic samples.

 The adversarial training framework allows flexibility in how this hidden representation is
composed.

Discriminator Architecture:
 The discriminator takes both real data (x) and the additional information (y) as inputs.

 The discriminator’s task is to distinguish between real data and synthetic data generated by the
generator conditioned on y.

You might also like