0% found this document useful (0 votes)
147 views

Image Captioning Using CNN & RNN

The purpose of this model is to generate captions for an image. Image captioning aims at generating captions of an image automatically using deep learning techniques. Initially, the objects in the image are detected using a Convolutional Neural Network (InceptionV3). Using the objects detected, a syntactically and semantically correct caption for the image is generated using Recurrent Neural Networks (LSTM) with attention mechanism.

Uploaded by

Ganesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views

Image Captioning Using CNN & RNN

The purpose of this model is to generate captions for an image. Image captioning aims at generating captions of an image automatically using deep learning techniques. Initially, the objects in the image are detected using a Convolutional Neural Network (InceptionV3). Using the objects detected, a syntactically and semantically correct caption for the image is generated using Recurrent Neural Networks (LSTM) with attention mechanism.

Uploaded by

Ganesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1

IMAGE CAPTIONING USING CNN AND RNN


Akkem Hema Bhargavi, Ganesh G, Jagan Mohan Reddy Dwarampudi, Sowmiya TN
Under the guidance of
Ms. Akshita Mehta, Mr. Naveen Nanda

Abstract—The purpose of this model is to generate d. Modeling of the language for syntactically sounding
captions for an image. Image captioning aims at captions
generating captions of an image automatically using deep
learning techniques. Initially, the objects in the image are The overfitting problem lies in the fact that the model
detected using a Convolutional Neural Network memorises the inputs and uses captions that sound similar for
(InceptionV3). Using the objects detected, a syntactically the images which have a bit variation of specific details like a
and semantically correct caption for the image is person on a ramp with a skateboard and a person on a table
generated using Recurrent Neural Networks (LSTM) with with a skateboard.
attention mechanism. In our project, we are using a traffic
sign dataset that is captioned by the above mentioned
process. This model is of great benefit to the visually II.METHODOLOGY
impaired in order to cross roads safely.
We have trained a visual attention based model for the
Index Terms—CNN, Deep learning, RNN captioning of images with traffic signs. This model can also be
extended for a general captioning which might require a larger
dataset of images. Our model first pre-processes the images
through InceptionV3 pre-trained model. Then, the last output
I. INTRODUCTION layer is further taken through some convolution layers which
is then processed to output the features with dimensions as
Image captioning or generating the description of an image in required.
natural language has received a lot of attention in recent times. Parallelly, the captions of every image are tokenized and
It emerged as an important and challenging area with research <start>, <end> tokens are appended for all the sentences.
advancing in image recognition. It is interesting due to the fact Finally, the model is trained with each image and it’s
that it has a lot of practical applications like labeling large corresponding real captions. The final weights are used to
image datasets, assisting the visually impaired. It requires the predict the captions for the test images. These captions are not
level of understanding way beyond the general object generated at a time but rather word by word. The prediction of
detection and image classification and so, is regarded as a every word requires the probability of occurrence of that
grand challenge. The field is a crossover of the modern particular word, given the previous words.
Artificial Intelligence models of Natural Language This method turned out to be quite effective and this is
Processing and Computer Vision. proposed by us from our research.
Top-down and Bottom-up are the general image captioning
approaches. While the top-down starts from the image which Convolutional Neural Networks:
is later converted into words, the bottom-up approach starts
with words which describe the various aspects that are 1. Mask RCNN
combined. Both the approaches suffer from describing finer
details and formulation of sentences from individual aspects Mask Region-based Convolutional Neural Networks has been
respectively. the new technology in terms of instance segmentation. There
Visual attention is a selective mapping process in the human are many papers, tutorials with good quality open source
visual system. It is important for the semantic natural language codes for reference. Mask RCNN is a deep neural network
description of images. So, people often talk about the crucial aimed to unravel instance segmentation problem in machine
parts rather than everything. We will talk about the visual learning or computer vision. In other words, it can be used to
attention approach for image captioning. We have used an separate different objects in an image or a video.
attention-based model to achieve the same. Overfitting the
training data poses a massive difficulty due to the fact that the You provide it with an image, it gives you the output
biggest available dataset has around only two million labeled bounding boxes for the objects, their classes and masks.
individual data. This data has to be used for
a. Extraction of the features
There are two stages of Mask RCNN. First, it generates
b. Correlation with the labels
proposals about the regions where there could be an object
c. Syntactic understanding of the labels
2

supported the input image. Second, it predicts the category of whole image for the recognition of the object. YOLO solves
the output, refines the bounding box and generates a mask in the problem also providing the bounding boxes for the
pixel level of the object supporting the primary stage proposal. corresponding trained objects.
Both stages are connected to the backbone structure which is
ResNet. YOLO works great for real-time detection as its name
suggests: You Only Look Once. It has 24 convolutional layers
Now let’s check out the primary stage, a light-weight neural and 2 dense layers totalling to 26 layers. It works on the
network called Region Proposal Network(RPN) scans all Darknet architecture, created by the first author of the YOLO
Feature Pyramid Network(FPN) top-bottom pathways paper.
(hereinafter mentioned feature map) and proposes regions
which can contain objects. While scanning feature map in an The algorithm divides any given image into an N x N grid
efficient way, we'd like a way to bind features to its raw image which are then sent to the neural network to detect the
location. Here come the anchors. Anchors are a group of presence of any objects in any of the grids and comes up with
boxes with predefined locations and scales relative to pictures. bounding boxes for the corresponding objects with
Ground-truth classes (only object or background binary probabilities and labels. The later modified models of YOLO
classified at this stage) and bounding boxes are assigned to are YOLOv2 and YOLOv3 which had 30 and 106 layers
individual anchors consistent with some Intersection over respectively. They were modified for more accurate
Union value. predictions that could work with even smaller and finer details
in the image.
As anchors with different scales bind to different levels of
feature map, RPN uses these anchors to work out where of the We have tested a YOLOv3 pretrained model with the COCO
feature map ‘should’ get an object and what size of its dataset which was able to detect 80 different labelled classes.
bounding box is. Later, the same was implemented with our custom dataset of
Here we may agree that convolving, downsampling and 790 traffic sign images for training. These images were taken
upsampling would keep features staying an equivalent relative from Google Open Image Dataset v5. The data was pre
locations as the objects in the original image, and wouldn’t annotated which made it easy for us to train the model. Our
mess them around. At the second stage, another neural model was able to detect traffic signs in an image and in real-
network takes proposed regions by the primary stage and time.
assigns them to many specific areas of a feature map level,
scans these areas, and generates objects classes, bounding The hyperparameters for this model are tuned to have the
boxes and masks. The procedure looks almost like RPN. optimizer as Adam, batch size as 8 and the model was trained
Differences are that without the assistance of anchors, stage- for 50 epochs with 88 steps per epoch.
two used a trick called Region Of Interest Align to locate the
relevant areas of feature map, and there's a branch which Recurrent Neural Networks:
generates masks for every object in the pixel level.
A Recurrent neural network (RNN) has the ability to
process a sequence of arbitrary length and hence used for
generation of a sequence of text.

Illustration of Mask RCNN In this model, we worked with a Shakespeare dataset. Given a
sequence of characters from this data ("Shakespear"), a model
is trained in order to predict the next character in the sequence
2. YOLO ("e"). Longer sequences of text can be generated by calling the
model repeatedly. The model demonstrates how to generate
“You Only Look Once” abbreviated or most ordinarily text using character-based RNN.
referred to as YOLO is the best available object detection
algorithm that works in real-time. Traditional object It is trained on small batches of text, and is able to generate a
classification can be done with a simple neural network but longer sequence of text with coherent structure. Tensorflow
object detection in a given scenario is a whole lot different. It library is used along with other libraries such as numpy, os.
requires a different approach as the conventional classification The dataset is then downloaded and read.
algorithms use predefined image dimensions whereas real- Later, the data is vectorized i.e. before training, the strings are
time object detection requires the system or model to scan the mapped to a numerical representation. The input to the model
3

is a sequence of characters, and we train the model to predict about 82,000 images, every image having at least 5 different
the output. We used tf.data to split the text into manageable caption annotations.
sequences and the data is shuffled and packed into batches.
Transfer learning is a machine learning method where a model
GRU is used in order to build the model. The loss function is developed for a task is reused as the starting point for a
and optimizer used are sparse categorical cross entropy and model on a second task. Here, we use InceptionV3 (which is
adam respectively. The model is then trained using a certain pretrained on Imagenet) to preprocess and classify the images.
number of epochs. After training,the model is used to generate After preprocessing, the output is cached to the disk and the
the text from the Shakespeare data. captions are tokenized with the vocabulary size being limited
to 5000 words to save memory while the remaining words are
Hyperparameters are variables which determine the network replaced with UNK token. The features are extracted from the
structure and how the network is trained (predefined). The lower convolutional layer of InceptionV3 giving us a vector of
hyperparameters in this model are sequence length, batch size, shape (8, 8, 2048).
buffer size, embedding dimension, rnn units, optimizer,
epochs, number of char to generated. This vector is then passed through the CNN Encoder ( consists
of a single Fully connected layer). Then RNN (here GRU)
Hyperparameter tuning is the process of tuning the above attends over the image to predict the next word. The next step
parameters in order to optimize the model. When the sequence is to train the model by extracting the features stored in the
length, batch size, buffer size, embedding dimension, rnn respective .npy files and then pass those features through the
units, number of char is increased to a greater extent, resource encoder. The encoder output, hidden state (initialized to 0) and
exhausted error was displayed. the decoder input (which is the start token) is passed to the
And if the optimizer was changed to RMSprop then the decoder.The decoder returns the predictions and the decoder
accuracy of the model gets reduced, so we use the adam hidden state.
optimizer as it gives the best accuracy for this model. More The decoder hidden state is then passed back into the model
the number of epochs, more is the accuracy. and the predictions are used to calculate the loss.Use teacher
forcing to decide the next input to the decoder.
Image Captioning:
Teacher forcing is a technique where the target word is passed
Caption generation is an artificial intelligence problem where as the next input to the decoder. The final step is to calculate
a textual description must be generated for a photograph. the gradients and apply it to the optimizer and back propagate
Try captioning the picture below. the model .

III.EXPERIMENTAL RESULTS

Some of our results after implementing Mask RCNN are as


follows:

Caption generation is an artificial intelligence problem where


a textual description must be generated for a photograph.
Try captioning the below picture.

The caption a human may give will be “A dog on grass and


some pink flowers”. There might be other descriptions too but
all of them mean the same thing.It’s so easy for us, as human
beings, to just have a glance at a picture and describe it in an
appropriate language. But how does a computer program
generate suitable caption for the image?
Figure 1: The figure shows how Mask RCNN works on a
Deep learning comes into picture at this point, two models: given image for object detection
CNN and RNN are used in order to caption the images. CNN
is used to detect the objects in the image and RNN is used to
generate the captions for the image. The working model of
both these models are shown previously. In our model, we The result of using YOLOv3 algorithm on the traffic signs
used MS-COCO dataset to train our model. The dataset has dataset is as follows:
4

Figure 2: The stop sign is detected as a traffic sign using


YoloV3.
Figure 5

The result for RNN model of text generation is: IV.CONCLUSION

In our project, we have developed a model to caption the


images. We also extended our model by converting our
captions to speech.
We hope to do future works on voice synthesis.
We have done research in order to understand our models in
depth and have executed each model separately. We learned
how the deep learning techniques work and how to create
these models.
We faced many challenges while running the model and with
Figure 3 our datasets. But later on we learned how to rectify the
mistakes and made an efficient model.
The result of image captioning are: V.ACKNOWLEDGMENT

We would like to thank our mentors Mr.Naveen Nanda and


Ms.Aksitha Mehta for helpful communications and
discussions regarding the work.

VI.REFERENCES

[1] Andrej Karpathy, Li Fei-Fei, “DeepVisual-Semantic


Alignments for Generating Image Description”,
Department of Computer Science, Stanford University,
published in IEEE Xplore 2015.

[2] Oriol Vinyals, Alexander Toshev, Samy


Bengio,Dumitru Erhan ,“ShowandTell:
ANeuralImageCaptionGenerator”,
a rXiv 2015.

[3] KelvinXu, JimmyLeiBa, RyanKiros, KyunghyunCho


Figure 4 ,AaronCourville,RuslanSalakhutdinov, RichardS.Zemel
Z,YoshuaBengio,
Figure 4 and 5 shows how the traffic signs are captioned. “Show,AttendandTell:NeuralImageCaption
GenerationwithVisualAttention”,arXiv 2015.

[4] Rahul Singha, Aayush Sharmaa, “Image captioning


using Deep Neural Networks”, Iowa State
University,published in Research gate 2018.

You might also like