DL Unit 5
DL Unit 5
IMAGE SEGMENTATION
The encoder encodes a latent space representation of the input which the decoder decodes to form
segment maps, or in other words maps outlining each object’s location in the image.
1
Types of Image Segmentation tasks
Image segmentation tasks can be classified into three groups based on the amount and type of
information they convey.
Semantic Segmentation
Instance Segmentation
Panoptic Segmentation
Semantic segmentation
Semantic segmentation refers to the classification of pixels in an image into semantic classes.
Pixels belonging to a particular class are simply classified to that class with no other
information or context taken into consideration.
As might be expected, it is a poorly defined problem statement when there are closely
grouped multiple instances of the same class in the image. An image of a crowd in a street
would have a semantic segmentation model predict the entire crowd region as belonging to
the “pedestrian” class, thus providing very little in-depth detail or information on the image.
Instance segmentation
Instance segmentation models classify pixels into categories on the basis of “instances” rather
than classes.
An instance segmentation algorithm has no idea of the class a classified region belongs to but
can segregate overlapping or very similar object regions on the basis of their boundaries.
If the same image of a crowd we talked about before is fed to an instance segmentation
model, the model would be able to segregate each person from the crowd as well as the
surrounding objects (ideally), but would not be able to predict what each region/object is an
instance of.
2
Panoptic segmentation
Panoptic segmentation, the most recently developed segmentation task, can be expressed as
the combination of semantic segmentation and instance segmentation where each instance of
an object in the image is segregated and the object’s identity is predicted.
Panoptic segmentation algorithms find large-scale applicability in popular tasks like self-
driving cars where a huge amount of information about the immediate surroundings must be
captured with the help of a stream of images.
Encoder decoder architectures for semantic segmentation became popular with the onset of
works like SegNet (by Badrinarayanan et. a.) in 2015.
SegNet proposes the use of a combination of convolutional and downsampling blocks to
squeeze information into a bottleneck and form a representation of the input. The decoder
then reconstructs input information to form a segment map highlighting regions on the input
and grouping them under their classes.
Finally, the decoder has a sigmoid activation at the end that squeezes the output in the range
(0,1).
3
U NET:
SegNet was accompanied by the release of another independent segmentation work at the
same time, U-Net ( by Ronnerberger et. al.), which first introduced skip connections in Deep
Learning as a solution for the loss of information observed in downsampling layers of typical
encoder-decoder networks.
Skip connections are connections that go from the encoder directly to the decoder without
passing through the bottleneck.
In other words, feature maps at various levels of encoded representations are captured and
concatenated to feature maps in the decoder. This helps to reduce data loss by aggressive
pooling and downsampling as done in the encoder blocks of an encoder-decoder architecture.
Skip Connections were a big hit, specifically in the domain of medical imaging, with U-Net
providing state-of-the-art results in cell segmentation for the diagnosis of diseases.
DEEP LAB
DeepLab made use of atrous convolutions replacing simple pooling operations and preventing
significant information loss while downsampling. They further introduced multi-scale feature
extraction with the help of Atrous Spatial Pyramid Pooling to help the network segment objects
regardless of their sizes.
4
To recover boundary information, one of the most important parts of semantic as well as instance
segmentation, they made use of fully connected Conditional Random Fields (CRFs).
Coupling the fine-grained localization accuracy of CRFs, the recognition capacity of CNNs led
DeepLab to provide highly accurate segment maps, beating methods like FCNs and SegNet by a wide
margin.
Papers like SegNet, U-Net, and DeepLab laid the groundwork for future work like Mask-RCNN, the
DeepLab series by Facebook, and works like PspNet and GSCNN.
Image segmentation is an important step in artificial vision. Machines need to divide visual
data into segments for segment-specific processing to take place.
Image segmentation thus finds its way in prominent fields like Robotics, Medical Imaging,
Autonomous Vehicles, and Intelligent Video Analytics.
Image segmentation aids machine perception and locomotion by pointing out objects in their path of
motion, enabling them to change paths effectively and understand the context of their environment.
Apart from locomotion, segmentation of images helps machines segregate the objects they are
working with and enables them to interact with real-world objects using only vision as a reference.
This allows the machine to be useful almost anywhere without much constraint.
Medical Imaging is an important domain of computer vision that focuses on the diagnosis of diseases
from visual data, both in the form of simple visual data and biomedical scans.
Segmentation forms an important role in medical imaging as it helps doctors identify possible
malignant features in images in a fast and accurate manner.
Using image segmentation, diagnosis of diseases can not only be speeded up but can also be made
cheaper, thereby benefiting thousands across the globe.
X-Ray segmentation
Smart Cities
Smart Cities often have CCTV cameras for real-time monitoring of pedestrians, traffic, and crime.
This monitoring can be easily automated with the help of image segmentation.
With AI-based monitoring, crimes can be reported faster, road accidents can be followed up with
immediate ambulances, and speeding cars can be easily caught and penalized.
The use of image segmentation and AI-based monitoring can thus improve the lifestyle of people.
Pedestrian detection
Traffic analytics
Video Surveillance
Self Driving cars are one of the biggest applications of image segmentation with the planning of
routes and movement depending heavily on it.
Semantic and instance segmentation helps these vehicles to identify road patterns and other vehicles,
thereby enabling a hassle-free and smooth ride.
6
Drivable surface semantic segmentation
Object detection is a supervised machine learning problem, which means you must
train your models on labeled examples.
It becomes as simple as feeding input visuals and receiving a fully marked-up output
visual.
A key component is the object detection bounding box which identifies the edges of the
object tagged with a clear-cut quadrilateral — typically either a square or rectangle.
They are accompanied by a label of the object, whether it is a person, a car, or a dog to
describe the target object.
7
Bounding boxes can overlap to showcase multiple objects in a given shot as long as the
model has prior knowledge of items it is tagging.
Object detection is a subset of object recognition, where the object is not only identified
but also located in an image. This allows for multiple objects to be identified and
located within the same image
Let’s break down the other computer vision tasks individually for a greater understanding of
each one:
The object detection with deep learning apart from alternative approaches is the employment
of convolutional neural networks (CNN).
The neural networks mimic that of the complex neural architecture of the human mind. They
primarily consist of an input layer, hidden inner layers, and an output layer.
8
The learning for these neural networks can be supervised, semi-supervised, and unsupervised,
referring to how much of the training data is annotated, if at all (unsupervised).
Deep neural networks for object detection yield by far the quickest and most accurate results
for single and multiple object detection since CNNs are capable of automated learning with less
manual engineering involved.
There is a world to unpack regarding deep learning and CNNs, but today we will only focus on
key points that regard object detection algorithms and models.
Object detection is not possible without models designed especially for handling that task.
These object detection models are trained with hundreds of thousands of visual content to
optimize the detection accuracy on an automatic basis later on.
Training and refining models are made efficient through the help of readily available datasets
like COCO (Common Objects in Context) to help give you a head start in scaling your
annotation pipeline.
R-CNN
The first largely successful family of methods was R-CNN (Region-Based Convolutional
Neural Network), which was proposed in 2014. It surpassed its predecessors by extracting
merely 2,000 regions from the image, which were referred to as region proposals, instead of an
exceedingly large number of regions prior to this.
The input image is selected, of which 2,000 region proposals are extracted.
Next, the features would be extracted from each individual region, which would then go
on to be classified as one of the known classes.
The primary shortcoming of R-CNN lies in the fact that although it extracted 2,000
region proposals, it was nonetheless a lengthy process. That is what paved the way to
the new and improved Fast R-CNN.
9
Problems with R-CNN
It still takes a huge amount of time to train the network as you would have to classify 2000
region proposals per image.
It cannot be implemented real time as it takes around 47 seconds for each test image.
Fast R-CNN
To overcome drawbacks of R-CNN to build a faster object detection algorithm and it was called
Fast R-CNN. The approach is similar to the R-CNN algorithm. But, instead of feeding the region
proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature
map. From the convolutional feature map, we identify the region of proposals and warp them
into squares and by using a RoI pooling layer we reshape them into a fixed size so that it can be
fed into a fully connected layer. From the RoI feature vector, we use a softmax layer to predict
the class of the proposed region and also the offset values for the bounding box.
10
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region
proposals to the convolutional neural network every time. Instead, the convolution operation is
done only once per image and a feature map is generated from it.
Faster R-CNN
Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find out the
region proposals. Selective search is a slow and time-consuming process affecting the
performance of the network. Therefore, Shaoqing Ren et al. Came up with an object detection
algorithm that eliminates the selective search algorithm and lets the network learn the region
proposals.
Similar to Fast R-CNN, the image is provided as an input to a convolutional network which
provides a convolutional feature map. Instead of using selective search algorithm on the feature
map to identify the region proposals, a separate network is used to predict the region proposals.
The predicted region proposals are then reshaped using a RoI pooling layer which is then used to
classify the image within the proposed region and predict the offset values for the bounding
boxes.
11
YOLO (You Only Look Once)
All of the previous object detection algorithms use regions to localize the object within the
image. The network does not look at the complete image. Instead, parts of the image which has
high probabilities of containing the object. YOLO or You Only Look Once is an object detection
algorithm much different from the region based algorithms seen above. In YOLO a single
convolutional network predicts the bounding boxes and the class probabilities for these boxes.
How YOLO works is that we take an image and split it into an SxS grid, within each of the grid
we take m bounding boxes. For each of the bounding box, the network outputs a class
probability and offset values for the bounding box. The bounding boxes having the class
probability above a threshold value is selected and used to locate the object within the image.
YOLO is orders of magnitude faster(45 frames per second) than other object detection
algorithms. The limitation of YOLO algorithm is that it struggles with small objects within the
12
image, for example it might have difficulties in detecting a flock of birds. This is due to the
spatial constraints of the algorithm.
Applications
Image Captioning refers to the process of generating textual description from an image –
based on the objects and actions in the image.
Image Captioning is the process of generating a textual description for given images. It has been
a very important and fundamental task in the Deep Learning domain. Image captioning has a
huge amount of application. NVIDIA is using image captioning technologies to create an
application to help people who have low or no eyesight.
13
If we are told to describe it, maybe we will describe it as: “A puppy on a blue towel” or “A
brown dog playing with a green ball”. So, how are we doing this? While forming the
description, we are seeing the image but at the same time, we are looking to create a meaningful
sequence of words. The first part is handled by CNNs and the second is handled by RNNs.
Now, one issue we might have overlooked here. We have seen that we can describe the above
images in several ways. So, how do we evaluate our model? For sequence to sequence problems,
like summarization, language translations, or captioning we use a Metrics called the BLEU
score.
BLEU stands for Bilingual Evaluation Understudy. It is a metric for evaluating a generated
sentence to a reference sentence. The perfect match is 1.0 and a perfect mismatch is 0.0. You can
study more about the BLEU score from this awesome blog post.
We have seen that we need to create a multimodal neural network that uses feature vectors
obtained using both RNN and CNN, so consequently, we will have two inputs. One is the image
we need to describe, a feed to the CNN, and the second is the words in the text sequence
produced till now as a sequence as the input to the RNN.
We are dealing with two types of information, a language one and another image one. So, the
question arises how or in what order should we introduce the pieces of information into our
model? Elaborately speaking, we need a language RNN model as we want to generate a
word sequence, so, when should we introduce the image data vectors in the language
model. A paper by Marc Tanti and Albert Gatt, Institute of Linguistics and Language
Technology, University of Malta covered a comparison study, of all the approaches. Let’s look
into the approaches.
Types of Architectures
14
The first architecture is called Injecting Architecture and the second one is called Merging
In the Injecting Architecture, the image data is introduced along with the language data, and the
image and language data mixture is represented together. The RNN trains on the mixture. So, at
every step of training, the RNN uses the mixture of both pieces of information to predict the next
word, and consequently, the RNN fine-tunes image information as well during training.
In the Merging Architecture, the image data is not introduced in the RNN network. So, the
image and the language information are encoded separately and introduced together in a feed-
forward network, creating multimodal layer architecture.
What is CNN?
CNN is a powerful algorithm for image processing. These algorithms are currently the best
algorithms we have for the automated processing of images. Many companies use these
algorithms to do things like identifying the objects in an image.
Images contain data of RGB combination. Matplotlib can be used to import an image into
memory from a file. The computer doesn’t see an image, all it sees is an array of
numbers. Color images are stored in 3-dimensional arrays. The first two dimensions correspond
to the height and width of the image (the number of pixels). The last dimension corresponds to
the red, green, and blue colors present in each pixel.
15
Convolutional Neural Networks specialized for applications in image & video recognition.
CNN is mainly used in image analysis tasks like Image recognition, Object detection &
Segmentation.
1) Convolutional Layer: In a typical neural network each input neuron is connected to the next
hidden layer. In CNN, only a small region of the input layer neurons connect to the neuron
hidden layer.
2) Pooling Layer: The pooling layer is used to reduce the dimensionality of the feature map.
There will be multiple activation & pooling layers inside the hidden layer of the CNN.
3) Fully-Connected layer: Fully Connected Layers form the last few layers in the network.
The input to the fully connected layer is the output from the final Pooling or
Convolutional Layer, which is flattened and then fed into the fully connected layer.
how would the LSTM or any other sequence prediction model understand the input image. We
cannot directly input the RGB image tensor as they are ill-equipped to work with such inputs.
Yes, this is precisely what we need to do in order to use the LSTM architecture for our purpose.
We can use the deep CNN architecture to extract features from the image which are then fed into
This is called the CNN LSTM model, specifically designed for sequence prediction problems
with spatial inputs, like images or videos. This architecture involves using Convolutional Neural
Network (CNN) layers for feature extraction on input data combined with LSTMs to perform
16
Neural network models for captioning involve two main elements:
1. Feature Extraction.
2. Language Model.
Feature extraction
The feature extraction model is a neural network that given an image is able to extract the salient
features, often in the form of a fixed-length vector. A deep convolutional neural network, or
CNN, is used as the feature extraction submodel. This network can be trained directly on the
images in your dataset. Alternatively, you can use a pre-trained convolutional model .
Language Model
For image captioning, we are creating an LSTM based model that is used to predict the
sequences of words, called the caption, from the feature vectors obtained from the VGG
network.
Model Evaluation:
17
BLEU stands for Bilingual Evaluation Understudy. It is a metric for evaluating a generated
sentence to a reference sentence. The perfect match is 1.0 and a perfect mismatch is 0.0.
GANs are a powerful class of neural networks that are used for unsupervised learning.
In GAN, when we give a training set, this technique learns to generate new data with the
same statistics as the training set with the help of algorithmic architectures that uses two neural
networks to generate new, synthetic instances of data that is very much similar to the real
data.
Generative Adversarial Networks are deep learning machines that combine two separate
models into one architecture. The two components are:
Generator Model
Discriminator Model
Zero sum game-> a situation in which one person or group can win something only by causing
another person or group to lose it.
The generator model tries to generate new data samples similar to those in the problem
domain.
The goal of the generator is to fool the discriminator, so the generative neural network is
trained to maximise the final classification error (between true and generated data)
Generative models: they create new data instances that resemble your training data. For
example, GANs can create images that look like photographs of human faces, even though the
faces don't belong to any real person.
It takes a random set of input values of fixed length, and it tries to generate a sample in the
domain same as of input data.
Input is drawn randomly from a Gaussian distribution. Once the model is trained, it can be used
in future for generating new samples as same as input data.
18
The discriminator tries to identify whether the example presented is fake (comes from a
generator) or real (comes from the actual data domain).
The goal of the discriminator is to detect fake generated data, so the discriminative neural
network is trained to minimise the final classification error.
The training of GANs typically begins with the discriminator. The discriminator may even
first be preliminarily trained to recognize the samples from the dataset using the softmax layer.
Once the discriminator is in place, we start feeding it the samples generated by the thus far
untrained generator.
The discriminator produces an error at the classifier for predicting if the image came from the
dataset or from the generator.
As the learning proceeds, the generator network learns to produce samples that are closer and
closer to the original data up to a point where the generation is indistinguishable (at least for
the discriminator network) from the data.
19
For producing better images, we may use the transposed convolutions and the extension made
using de-convolutional GANs.
The competition between the generator and the discriminator makes them adversaries, which
gives the name to GANs.
At each iteration of the training process, the weights of the generative network are updated in
order to increase the classification error (error gradient ascent over the generator’s
parameters) whereas the weights of the discriminative network are updated so that to decrease
this error (error gradient descent over the discriminator’s parameters).
Here, the generative model captures the distribution of data and is trained in such a manner that
it tries to maximize the probability of the Discriminator in making a mistake.
The Discriminator, on the other hand, is based on a model that estimates the probability that the
sample that it got is received from the training data and not from the Generator. The GANs are
formulated as a minimax game, where the Discriminator is trying to minimize its reward V(D,
G) and the Generator is trying to minimize the Discriminator’s reward or in other words,
maximize its loss. It can be mathematically described by the formula below:
20
where,
G = Generator
D = Discriminator
Pdata(x) = distribution of real data
P(z) = distribution of generator
x = sample from Pdata(x)
z = sample from P(z)
D(x) = Discriminator network
G(z) = Generator network
Different types of GANs:
GANs are now a very active topic of research and there have been many different types of
GAN implementation. Some of the important ones that are actively being used currently are
described below:
Vanilla GAN: This is the simplest type GAN. Here, the Generator and the Discriminator are
simple multi-layer perceptron’s. In vanilla GAN, the algorithm is really simple, it tries to
optimize the mathematical equation using stochastic gradient descent.
Conditional GAN (CGAN): CGAN can be described as a deep learning method in which
some conditional parameters are put into place. In CGAN, an additional parameter ‘y’ is
added to the Generator for generating the corresponding data. Labels are also put into the
input to the Discriminator in order for the Discriminator to help distinguish the real data
from the fake generated data.
Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular also the most
successful implementation of GAN. It is composed of ConvNets in place of multi-layer
perceptron’s. The ConvNets are implemented without max pooling, which is in fact replaced
by convolutional stride. Also, the layers are not fully connected.
Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image
representation consisting of a set of band-pass images, spaced an octave apart, plus a low-
frequency residual. This approach uses multiple numbers of Generator and Discriminator
networks and different levels of the Laplacian Pyramid. This approach is mainly used because
it produces very high-quality images. The image is down-sampled at first at each layer of the
21
pyramid and then it is again up-scaled at each layer in a backward pass where the image
acquires some noise from the Conditional GAN at these layers until it reaches its original size.
Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a
GAN in which a deep neural network is used along with an adversarial network in order to
produce higher resolution images. This type of GAN is particularly useful in optimally up-
scaling native low-resolution images to enhance its details minimizing errors while doing so.
Application of GANs
Generate new data from available data – It means generating new samples from an
available sample that is not similar to a real one.
Generate realistic pictures of people that have never existed.
Gans is not limited to Images, It can generate text, articles, songs, poems, etc.
Generate Music by using some clone Voice.
Text to Image Generation (Object GAN and Object Driven GAN)
Creation of anime characters in Game Development and animation production.
Image to Image Translation – We can translate one Image to another without
changing the background of the source image. For example, Gans can replace a dog
with a cat.
Low resolution to High resolution – If you pass a low-resolution Image or video,
GAN can produce a high-resolution Image version of the same.
Prediction of Next Frame in the video – By training a neural network on small frames
of video, GANs are capable to generate or predict a small next frame of video. For
example, you can have a look at below GIF
Interactive Image Generation – It means that GANs are capable to generate images
and video footage in an art form if they are trained on the right real dataset.
Steps to Implement Basic GAN
1. Importing all libraries
2. Getting the Dataset
3. Data Preparation – It includes various steps to accomplish like preprocessing data,
scaling, flattening, and reshaping the data.
4. Define the function Generator and Discriminator.
5. Create a Random Noise and then create an Image with Random Noise.
6. Setting Parameters like defining epoch, batch size, and Sample size.
7. Define the function of generating Sample Images.
8. Train Discriminator then trains Generator and it will create Images.
9. Will see what clarity of Images is created by Generator.
22
Video to text with LSTM Models
LSTM:
Long Short Term Memory is a kind of recurrent neural network. In RNN output from the last
step is fed as input in the current step. LSTM was designed by Hochreiter & Schmidhuber. It
tackled the problem of long-term dependencies of RNN in which the RNN cannot predict
the word stored in the long-term memory but can give more accurate predictions from the
recent information.
As the gap length increases RNN does not give an efficient performance. LSTM can by
default retain the information for a long period of time. It is used for processing, predicting,
and classifying on the basis of time-series data.
Long Short- Term Memory (LSTM) networks are a modified version of recurrent
neural networks, which makes it easier to remember past data in memory.
Input gate- It discover which value from input should be used to modify the memory.
Sigmoid function decides which values to let through 0 or 1. And tanh function gives
weightage to the values which are passed, deciding their level of importance ranging from -1
to 1.
Forget gate- It discover the details to be discarded from the block. A sigmoid function
decides it. It looks at the previous state (ht-1) and the content input (Xt) and outputs a
23
number between 0(omit this) and 1(keep this) for each number in the cell state Ct-1.
24
3. Output gate- The input and the memory of the block are used to decide the output.
Sigmoid function decides which values to let through 0 or 1. And tanh function decides
which values to let through 0, 1. And tanh function gives weightage to the values which are
passed, deciding their level of importance ranging from -1 to 1 and multiplied with an output
of sigmoid.
It represents a full RNN cell that takes the current input of the sequence xi, and outputs the
current hidden state, hi, passing this to the next RNN cell for our input sequence. The inside
of an LSTM cell is a lot more complicated than a traditional RNN cell, while the
conventional RNN cell has a single "internal layer" acting on the current state (ht-1) and
input (xt).
Video to text with LSTM Models:
In video to text the methods for generating open-domain video descriptions should be
sensitive to temporal structure and allow both input (sequence of frames) and output
(sequence of words) of variable length.
25
Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of
video frames to a sequence of words in order to generate a description of the event in the
video clip. Our model naturally is able to learn the temporal structure of the sequence of
frames as well as the sequence model of the generated sentences, i.e. a language model. We
evaluate several variants of our model that exploit different visual features on a standard
set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).
Our S2VT approach performs video description using a sequence to sequence model. It
incorporates a stacked LSTM which first reads the sequence of frames and then generates a
sequence of words. The input visual sequence to the model is comprised of RGB and/or
optical flow CNN outputs.
We propose a sequence to sequence model for video description, where the input is the
sequence of video frames (x1, . . . , xn), and the output is the sequence of words (y1, . . . ,
26
ym). Naturally, both the input and output are of variable, potentially different, lengths. In our
case, there are typically many more frames than words. In our model, we estimate the
conditional probability of an output sequence (y1, . . . , ym) given an input sequence (x1, . . . ,
xn) i.e.
The main idea to handle variable-length input and output is to first encode the input
sequence of frames, one at a time, representing the video using a latent vector
representation, and then decode from that representation to a sentence, one word at a time.
for an input xt at time step t, the LSTM computes a hidden/control state ht and a memory cell
state ct which is an encoding of everything the cell has observed until time t:
(2)
Thus, in the encoding phase, given an input sequence X (x1, . . . , xn), the LSTM computes a
sequence of hidden states (h1, . . . , hn). During decoding it defines a distribution over the
output sequence Y (y1, . . . , ym) given the input sequence X as p(Y |X) is
where the distribution of p(yt|hn+t) is given by a softmax over all of the words in the
vocabulary (see Equation 5). Note that hn+t is obtained from hn+t−1, yt−1 based on the
recursion in Equation 2.
Sequence to sequence video to text Our approach, S2VT, is depicted in Figure 2. While first
27
encode the input sequence to a fixed length vector using one LSTM and then use
another
28
LSTM to map the vector to a sequence of outputs, we rely on a single LSTM for both the
encoding and decoding stage. This allows parameter sharing between the encoding and
decoding stage.
Our model uses a stack of two LSTMs with 1000 hidden units each. Figure 2 shows the
LSTM stack unrolled over time. When two LSTMs are stacked together,
in our case, the hidden representation (ht) from the first LSTM layer (colored red) is provided
as the input (xt) to the second LSTM (colored green). The top LSTM layer in our architecture
is used to model the visual frame sequence, and the next layer is used to model the output
word sequence. Training and Inference
In the first several time steps, the top LSTM layer (colored red in Figure 2) receives a
sequence of frames and encodes them while the second LSTM layer receives the hidden
representation (ht) and concatenates it with null padded input words (zeros), which it
then encodes. There is no loss during this stage when the LSTMs are encoding.
After all the frames in the video clip are exhausted, the second LSTM layer is fed the
beginning-ofsentence () tag, which prompts it to start decoding its current hidden
representation into a sequence of words.
While training in the decoding stage, the model maximizes for the log-likelihood of the
predicted output sentence given the hidden representation of the visual frame sequence, and
the previous words it has seen. From Equation 3 for a model with parameters θ and output
sequence Y = (y1, . . . , ym), this is formulated as:
figure(2)
This log-likelihood is optimized over the entire training dataset using stochastic gradient
descent. The loss is computed only when the LSTM is learning to decode. Since this loss is
propagated back in time, the LSTM learns to generate an appropriate hidden state
representation (hn) of the input sequence. The output (zt) of the second LSTM layer is used
to obtain the emitted word (y). We apply a softmax function to get the probability distribution
over the words y ′ in the vocabulary V :
29
(5)
We note that, during the decoding phase, the visual frame representation for the first LSTM
layer is simply a vector of zeros that acts as padding input. We require an explicit end-of-
sentence tag () to terminate each sentence since this enables the model to define a distribution
over sequences of varying lengths. At test time, during each decoding step we choose the
word yt with the maximum probability after the softmax (from Equation 5) until it emits the
token.
We propose a stack of two LSTMs that learn a representation of a sequence of frames in
order to decode it into a sentence that describes the event in the video. The top LSTM layer
(colored red) models visual feature inputs. The second LSTM layer (colored green) models
language given the text input and the hidden representation of the video sequence. We use to
indicate begin-of-sentence and for the end-of-sentence tag. Zeros are used as a when
there is no input at the time step.
Attention Models in Computer vision
Since the introduction of Transformer in the work “Attention is all you need”, there has been a
transition in the field of NLP towards replacing Recurrent Neural Networks (RNN) with
attention-based networks.
The aim of attention models is to reduce larger, more complicated tasks into smaller, more
manageable areas of attention to understand and process sequentially.
The models work within neural networks, which are a type of network model with a similar
structure and processing methods as the human brain for simplifying and processing
information. This allows for efficient and sequential data processing, especially when the
network needs to categorize entire datasets.
This one refers to the mechanism of relating different positions of a single sequence to compute
a representation of the same sequence. Self-attention works by comparing every word in the
sentence to every other word and reweighing the word embedding’s of each word to
include contextual relevance.
Multi-Head Attention:
Multi-Head Attention is a module for attention mechanism that runs an attention module
several times in parallel. Hence, to understand its logic it is first needed to understand the
Attention module. The two most commonly used attention functions are Additive
Attention and Dot-Product.
The basic structure of the Attention module is that there are two lists of vectors x1 and x2, one
which is attended and the other one which attends. The vector x2 generates a ‘query’ while the
vector x1 creates a ‘key’ and a ‘value’. The idea behind the attention function is to map the
query and the set key-value pairs to an output. “The output is computed as a weighted sum of
the values, where the weight assigned to each value is computed by a compatibility
function of the query with the corresponding key”. The output is computed as follows:
As mentioned in this discussion, the key/value/query concepts come from retrieval systems. For
example, when typing a query on YouTube to search for some video, the search engine will map
your query against a set of keys (video title, description, etc.) linked with candidate videos in
the database. Then, it will present you with the best-matched videos (values).
Compared to Multi-Head attention, this type of attention was intentionally made for feed-
forward convolutional neural networks and can be applied at every convolutional block in
deep networks. CBAM contains two sequential sub-modules called the Channel Attention
Module (CAM) and the Spatial Attention Module (SAM). While channel refers to the number
of features or channels for each pixel, spatial refers to the feature maps of dimension (h x
w).
This module is comprised of a three-fold sequential operation. The first part of it is called the
Channel Pool and it consists of applying Max Pooling and Average Pooling across the channels
to the input (c × h × w) to generate an output with shape (2 × h × w). This is the input to a
convolution layer that outputs a 1-channel feature map (1 × h × w). After passing this output
through a Batch-Norm and an optional ReLU, the data goes to a Sigmoid Activation layer.
31
Channel Attention Module (CAM):
This module first decomposes the input tensor into 2 subsequent vectors of dimensionality (c ×
1 × 1) generated by Global Average Pooling (GAP) and Global Max Pooling (GMP).
Thereafter, the output goes through a fully connected layer followed by a ReLu activation
layer.
While this is a powerful technique for improving computer vision, the most work so far with
attention mechanisms has focused on Neural Machine Translation (NMT).
Using attention mechanisms in NMT is a much simpler approach. Here, the meaning of a
sentence is mapped into a fixed-length vector, which then generates a translation based on
that vector as a whole. The goal isn’t to translate the sentence word for word, but rather pay
attention to the general, “high level” overall sentiment. Besides extremely improving
accuracy, this attention-driven learning approach is much easier to construct and faster
to train.
By applying the bottom-up visual attention computing models to the specific tasks of
computer vision, so the generalization performance of network can be improved. In the
process of development, attention mechanisms in computer vision have evolved into different
categories and different models.
The global attention model, which is also similar to the soft attention model, collects inputs
from all encoder and decoder states before evaluating the current state to determine the
output. This model uses each encoder step and each decoder preview step to calculate the
attention weights or align weights. This allows the model to find the decoder output.
It has been used in many fields of computer vision, such as classification, detection,
segmentation, model generation, video processing, etc.
Mechanisms of soft attention can be categorized into spatial attention, channel attention, mixed
attention, self-attention.
Spatial attention:
Channel attention:
If weights are added to each channel to show the relevance between channel and key
information, a greater weight means a higher relevancy, and more attention should be
paid to the corresponding channel. It models the importance of each feature channel and then
33
enhances or suppresses it in different tasks.
Mixed attention:
34
adding a few more parameters.
Self-attention:
In a convolutional neural network, the convolution kernel is confined by its size, which can
only use local information to calculate the target pixel, so it may lead to deviations due to the
ignorance of global information.
If each pixel in the feature map is regarded as a random variable and the paring co-variances
are calculated, the value of each predicted pixel can be enhanced or weakened based on its
similarity to other pixels in the image. The mechanism of employing similar pixels in
training and prediction and ignoring dissimilar pixels is called the self-attention
mechanism.
35
To achieve global reference for each pixel-level prediction. By using the self-attention
mechanism, global reference can be realized during the training and prediction of models.
The model is with good bias-variance weight, making it more reasonable.
The local attention model is similar to the global attention model, but it only uses a few encoder
positions to determine the align weights. The model calculates the align weights and context
vector by using the first single-aligned position and a selection of words from the encoder
source.
The local attention model also allows for monotonic alignment and predictive alignment.
Monotonic alignment assumes only select information matters, whereas predictive
alignment allows the model itself to predict the final alignment position. The local attention
model is similar to the hard attention model.
Hard attention:
The mechanism of soft attention has been successfully applied in the field of computer vision.
As the mechanism of hard attention can select important features from input information, it is
observed as a more efficient and direct approach.
It pays most attention to the elements related to the task, and temporarily ignores the other
signals, segments the input image into several blocks, and then simulates self-attention
architecture. Relevant blocks are selected in each time step, and once determined, Attention
Agent will make decisions only according to these blocks while ignoring the other blocks.
Usually, back propagation is utilized to optimize neural networks.
The upper row: input transforming - the sliding window splits the input image into
smaller blocks and then “flattens” them for future processing.
The middle row: block election - the modified self-attention modules vote between blocks
to generate a vector of block importance.
The lower row: action generation – Attention Agent selects the most important blocks,
extracts corresponding features, and makes decisions on their basis. It has been proven that
Attention Agent has successfully learned to pay attention to different regions in the input
image.
36
5. Tips for using attention models:
Explore different models-Consider the different types of models available for attention
mechanism. Think about which may best meet your needs and provide the most accurate
results.
Provide training- It's important to provide consistent back propagation training and
reinforcement to ensure your attention models are accurate and effective. This helps
identify potential errors within your models, helping you find ways to refine and improve them.
Use them for translation- Implement attention models to support language translations.
Using them frequently may help improve the accuracy of your translations.
37
CASE STUDY:
NAME AND ENTITY RECOGNITION:
NER – Definition
• Named entity recognition (NER) ‒ also called entity identification or entity
extraction ‒ is a natural language processing (NLP) technique that automatically
identifies named entities in a text and classifies them into predefined categories.
• An entity is basically the thing that is consistently talked about or refer to in the text.
• Entities can be names of people, organizations, locations, times, quantities, monetary
values, percentages, and more.
• The named entity recognition (NER) is one of the most popular data preprocessing
task. It involves the identification of key information in the text and classification into
a set of predefined categories.
• Named entity recognition (NER) helps to easily identify the key elements in a text,
like names of people, places, brands, monetary values, and more.
• With named entity recognition, key information can be extracted to understand what a
text is about, or merely use it to collect important information to store in a database.
NER – Concept
It detects named entities like person, organization, place, date etc.
It predicts the entities based on model which was trained using the labelled data.
It is a supervised learning.
It involves the identification of key information in the text and classification into a set
of predefined categories.
It is one of the most popular data preprocessing task.
NER – Process
• Two Processes that are involved:
Detecting the entities from the text
Classifying them into different categories
• First find the entities mentioned in a given text and
• Assign them to a particular class in our list of predefined entities.
38
e.g 1
Person Location
NER Tag
<START:Person>Nelson<END>lives in <START:Location>India<END>
NER – Steps
39
• Steps to build the custom NER model for detecting the job role in job postings in
spaCy 3.0:
• Annotate the data to train the model.
• Convert the annotated data into the spaCy bin object.
• Generate the config file from the spaCy website.
• Train the model in the command line.
• Load and test the saved model
• Keyword extraction uses machine learning artificial intelligence (AI) with natural
language processing (NLP) to break down human language so that it can be
understood and analyzed by machines. It’s used to find keywords from all manner of
text: regular documents and business reports, social media comments, online forums
and reviews, news reports, and more.
• Imagine thousands of online reviews about the product has to be analysed. Keyword
extraction helps to sift through the whole set of data and obtain the words that best
describe each review in just seconds. That way, what customers are mentioning most
often can be easily and automatically seen , saving teams hours upon hours of manual
processing.
• Ambiguity in NE
• Charles has told Nokia employees to come back to their respective offices atleast 40
hrs a week or leave the company.
• NewYork – City
• Charles – Person
• It actually tells what entity it is i.e it extracts the entities amidst lot of articles
40
Approaches:
• Content Recommendation
• Lot of comedies on Netflix, get more recommendations that have been classified as
the entity Comedy.
Methods of NER
• One way is to train the model for multi-class classification using different machine
learning algorithms, but it requires a lot of labelling. In addition to labelling the
model also requires a deep understanding of context to deal with the ambiguity of
the sentences. This makes it a challenging task for a simple machine learning
algorithm.
• Another way is that Conditional random field that is implemented by both NLP
Speech Tagger and NLTK. It is a probabilistic model that can be used to model
sequential data such as words.
Deep Learning Based NER: Deep learning NER is much more accurate than
previous method, as it is capable to assemble words. This is due to the fact that it used
a method called word embedding, that is capable of understanding the semantic and
syntactic relationship between various words. It is also able to learn analyzes topic-
41
specific as well as high level words automatically. This makes deep learning NER
applicable for performing multiple tasks. Deep learning can do most of the repetitive
work itself, hence researchers for example can use their time more efficiently.
• Implementation
Python 3
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)
42
# tokenization
print(token.text)
# print entities
print(ents)
Applications
• gene identification,
• identification of drug names and disease names. These experiments use CRFs
with features engineered for their domain data.
• Spam filters.
• Sentence translation.
Emotions or attitudes towards a topic can be positive, negative, or neutral. This makes
sentiment analysis a text classification task.
43
“I am not sure if I liked the movie.” – Neutral
“It was the most worst movie I have ever seen.” – Negative
A recurrent neural network (RNN) is a type of artificial neural network which uses
sequential data or time series data.
It is a type of Neural Network where the output from the previous step are fed as input
to the current step.
While traditional deep neural networks assume that inputs and outputs are
independent of each other, the output of recurrent neural networks depend on the prior
elements within the sequence.
Why Recurrent Neural Networks?
44
RNN were created because there were a few issues in the feed-forward neural
network:
o Cannot handle sequential data
o Considers only the current input
o Cannot memorize previous inputs
The solution to these issues is the RNN. An RNN can handle sequential data,
accepting the current input data, and previously received inputs.
RNNs can memorize previous inputs due to their internal memory.(to predict the next
word of a sentence)
How RNN Works:
The input layer ‘x’ takes in the input to the neural network and processes it and passes
it onto the middle layer.
The middle layer ‘h’ can consist of multiple hidden layers, each with its own
activation functions ,weights and biases
The Recurrent Neural Network will standardize the different activation
functions ,weights and biases so that each hidden layer has the same parameters.
Then, instead of creating multiple hidden layers, it will create one and loop over it as
many times as required.
45
Document summarization
Problem Statement:
Imagine the task of determining whether a product’s review is positive or negative;
you could do it yourself just by reading it, right? But what happens when the company
you work for sells 2k products every single day? Are you pretending to read all the
reviews and manually classify them? Let’s be honest, your job would be the worst
ever. There’s where Sentiment Analysis comes in and makes your life and job easier.
46
Solution:
The common and most basic steps in sentiment analysis are:
Remove URLs and email addresses from every single sample — because they won’t
add meaningful value.
Remove punctuation signs — otherwise your model won’t understand that “good!”
and “good” are actually meaning the same thing.
Lowercase all text — because you want to make the input text as generic as possible
Example, a “Good” which is at the beginning of a phrase is understood differently
than the “good” in another sample.
Remove stop-words — because they only add noise and won’t make the data more
meaningful.
Stop words are the very common words like 'if', 'but', 'we', 'he', 'she', and 'they‘.
Stemming/Lemmatizing: Lemmatizing generally returns valid words (that exist) while
stemming techniques return (most of the times) shorten words, that’s why
lemmatizing is used more in real world implementations. This is how lemmatizers vs.
stemmers work: suppose you want to find the root word of ‘caring’: ‘Caring’ -
> Lemmatization -> ‘Care’.
Stemming -> ‘Car’
Preparing IMDB reviews for Sentiment Analysis:
IMDB movie review dataset is a collection of 50K movie reviews tagged with
corresponding true sentiment value. Out of which 25K reviews belong to the ‘positive‘
category and the rest, 25K belong to the ‘negative‘ sentiment category.
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import math
47
import nltk
We load the dataset into a pandas dataframe with the help of the following code :
Data Preprocessing
First, we need to remove HTML tags, URLs, and non-alphanumeric characters from the
reviews. We do that with the help of the remove_tags function, and Regex functions are used
def remove_tags(string):
48
removelist = ""
result = re.sub('','',string) #remove HTML tags
result = re.sub('https://.*','',result) #remove URLs
result = re.sub(r'[^w'+removelist+']', ' ',result) #remove non-alphanumeric characters
result = result.lower()
return result
data['review']=data['review'].apply(lambda cw : remove_tags(cw))
We also need to remove stopwords from the corpus. Stopwords are commonly used words
like ‘and’, ‘the’, ‘at’ that do not add any special meaning or significance to a sentence. A list
of stopwords are available with nltk, and they can be removed from the corpus using the
following code :
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
data['review'] = data['review'].apply(lambda x: ' '.join([word for word in x.split() if word not
in (stop_words)]))
obtain the root form of words, known as lemmas. For example, the lemma of the words
reading, reads, read is read. This helps save unnecessary computational overhead in trying to
decipher entire words, as the meanings of most words are well-expressed by their separate
lemmas. We perform lemmatization using the WordNetLemmatizer() from nltk. The text is
first broken into individual words using the WhitespaceTokenizer() from nltk. We write a
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
st = ""
for w in w_tokenizer.tokenize(text):
st = st + lemmatizer.lemmatize(w) + " "
return st
49
data['review'] = data.review.apply(lemmatize_text)
data
Next, we print some basic statistics about the dataset and check if the dataset is balanced or
not (equal number of all labels). Ideally, the dataset should be balanced because a severely
s = 0.0
for i in data['review']:
word_list = i.split()
s = s + len(word_list)
print("Average length of each review : ",s/data.shape[0])
pos = 0
for i in range(data.shape[0]):
if data.iloc[i]['sentiment'] == 'positive':
pos = pos + 1
neg = data.shape[0]-pos
print("Percentage of reviews with positive sentiment is "+str(pos/data.shape[0]*100)+"%")
print("Percentage of reviews with negative sentiment is "+str(neg/data.shape[0]*100)+"%")
>>Average length of each review : 119.57112
50
>>Percentage of reviews with positive sentiment is 50.0%
>>Percentage of reviews with negative sentiment is 50.0%
Encoding Labels and Making Train-Test Splits
reviews = data['review'].values
labels = data['sentiment'].values
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)
Finally, we split the dataset into train and test parts using train_test_split from
sklearn.model_selection. We use 80% of the dataset for training and 20% for testing.
Before being fed into the LSTM model, the data needs to be padded and tokenized:
Tokenizing: Keras’ inbuilt tokenizer API has fit the dataset, which splits the sentences into
words and creates a dictionary of all unique words found and their uniquely assigned
integers. Each sentence is converted into an array of integers representing all the individual
words present in it.
Sequence Padding: The array representing each sentence in the dataset is filled with zeroes
to the left to make the size of the array ten and bring all collections to the same length.
51
# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)
# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)
Building the Model
An embedding layer of dimension 100 converts each word in the sentence into a fixed-length
dense vector of size 100. The input dimension is set as the vocabulary size, and the output
dimension is 100. Each word in the input will hence get represented by a vector of size 100.
A bidirectional LSTM layer of 64 units.
A dense (fully connected) layer of 24 units with relu activation.
A dense layer of 1 unit and sigmoid activation outputs the probability of the review is
positive, i.e. if the label is 1.
# model initialization
model = keras.Sequential([
keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
keras.layers.Bidirectional(keras.layers.LSTM(64)),
keras.layers.Dense(24, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
# compile model
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# model summary
model.summary()
The model is compiled with binary cross-entropy loss and adam optimizer. Since we have a
binary classification problem, binary cross-entropy loss is used. The Adam optimizer uses
stochastic gradient descent to train deep learning models, and it compares each of the
52
predicted probabilities to the actual class label (0 or 1). Accuracy is used as the primary
num_epochs = 5
history = model.fit(train_padded, train_labels,
epochs=num_epochs, verbose=1,
validation_split=0.1)
prediction = model.predict(test_padded)
# Get labels based on probability 1 if p>= 0.5 else 0
pred_labels = []
for i in prediction:
53
if i >= 0.5:
pred_labels.append(1)
else:
pred_labels.append(0)
print("Accuracy of prediction on test set : ", accuracy_score(test_labels,pred_labels))
The accuracy of prediction on the test set comes out to be 87.27%! You can improve the
accuracy further by playing around with the model hyperparameters, further tuning the model
architecture or changing the train-test split ratio. You should also train the model for a more
significant number of epochs, and we stopped at five epochs because of the computational
time. Ideally, it would help prepare the model until the train and test losses converge.
We can use our trained model to determine the sentiment of new unseen movie reviews not
present in the dataset. Each new text must be tokenized and padded before being fed as input
to the model. The model.predict() function returns the probability of the positive review. If
the probability is more significant than 0.5, we consider the study to be positive, else
negative.
54
if i >= 0.5:
pred_labels.append(1)
else:
pred_labels.append(0)
for i in range(len(sentence)):
print(sentence[i])
if pred_labels[i] == 1:
s = 'Positive'
else:
s = 'Negative'
print("Predicted sentiment : ",s)
Conclusion
Sentiment analysis, like any other classification task, can be performed with many different
machine learning and deep learning models, such as Naive Bayes, KNN, SVM or CNN,
ANN, etc. Now that you know its basics go ahead and explore other models to perform
sentiment analysis.
55
Long Short-Term Memory is an advanced version of recurrent neural network
(RNN) architecture that was designed to model chronological sequences and their
long-range dependencies more precisely than conventional RNNs.
The major highlights include the interior design of a basic LSTM cell, the variations
brought into the LSTM architecture, and few applications of LSTMs that are highly
in demand.
It also makes a comparison between LSTMs and GRUs. The article concludes with
a list of disadvantages of the LSTM network and a brief introduction of the
upcoming attention-based models that are swiftly replacing LSTMs in the real
world.
Introduction:
The basic difference between the architectures of RNNs and LSTMs is that the
hidden layer of LSTM is a gated unit or gated cell.
It consists of four layers that interact with one another in a way to produce the
output of that cell along with the cell state. These two things are then passed onto
the next hidden layer.
Unlike RNNs which have got the only single neural net layer of tanh, LSTMs
comprises of three logistic sigmoid gates and one tanh layer.
Gates have been introduced in order to limit the information that is passed through the cell.
They determine which part of the information will be needed by the next cell and which
part is to be discarded
56
The output is usually in the range of 0-1 where ‘0’ means ‘reject all’ and ‘1’ means
‘include all’.
Conventional LSTM:
Variations:
With the increasing popularity of LSTMs, various alterations have been tried on the
conventional LSTM architecture to simplify the internal design of cells to make
them work in a more efficient way and to reduce the computational complexity.
Gers and Schmidhuber introduced peephole connections which allowed gate layers
to have knowledge about the cell state at every instant. Some LSTMs also made use
of a coupled input and forget gate instead of two separate gates that helped in
making both the decisions simultaneously.
Another variation was the use of the Gated Recurrent Unit(GRU) which improved
the design complexity by reducing the number of gates. It uses a combination of the
cell state and hidden state and also an update gate which has forgotten and input
57
gates merged into it.
1. Figure-A represents what a basic LSTM network looks like. Only one layer of LSTM
between an input and output layer has been shown here.
2. Figure-B represents Deep LSTM which includes a number of LSTM layers in between
the input and output. The advantage is that the input values fed to the network not only
go through several LSTM layers but also propagate through time within one LSTM cell.
Hence, parameters are well distributed within multiple layers. This results in a thorough
process of inputs in each time step.
3. Figure-C represents LSTM with the Recurrent Projection layer where the recurrent
connections are taken from the projection layer to the LSTM layer input. This
architecture was designed to reduce the high learning computational complexity (O(N))
for each time step) of the standard LSTM RNN.
4. Figure-D represents Deep LSTM with a Recurrent Projection Layer consisting of
multiple LSTM layers where each layer has its own projection layer. The increased
depth is quite useful in the case where the memory size is too large. Having increased
depth prevents overfitting in models as the inputs to the network need to go through
many nonlinear functions.
58
Applications:
1. Language modelling or text generation, that involves the computation of words when a
sequence of words is fed as input. Language models can be operated at the character
level, n-gram level, sentence level or even paragraph level.
2. Image processing, that involves performing analysis of a picture and concluding its
result into a sentence Speech and Handwriting Recognition
3. Music generation which is quite similar to that of text generation where LSTMs predict
musical notes instead of text by analyzing a combination of given notes fed as input.
4. Language Translation involves mapping a sequence in one language to a sequence in
another language.
Disadvantages
1. the cell has become quite complex now with the additional features (such as forget
gates) being brought into the picture.
2. Hardware-wise, LSTMs become quite inefficient.
3. LSTMs get affected by different random weight initialization and hence behave quite
similar to that of a feed-forward neural net. They prefer small weight initialization
instead.
4. LSTMs are prone to overfitting and it is difficult to apply the dropout algorithm to curb
this issue.
59
60
61
62
63