Image Captioning Using CNN & RNN
Image Captioning Using CNN & RNN
Abstract—The purpose of this model is to generate d. Modeling of the language for syntactically sounding
captions for an image. Image captioning aims at captions
generating captions of an image automatically using deep
learning techniques. Initially, the objects in the image are The overfitting problem lies in the fact that the model
detected using a Convolutional Neural Network memorises the inputs and uses captions that sound similar for
(InceptionV3). Using the objects detected, a syntactically the images which have a bit variation of specific details like a
and semantically correct caption for the image is person on a ramp with a skateboard and a person on a table
generated using Recurrent Neural Networks (LSTM) with with a skateboard.
attention mechanism. In our project, we are using a traffic
sign dataset that is captioned by the above mentioned
process. This model is of great benefit to the visually II.METHODOLOGY
impaired in order to cross roads safely.
We have trained a visual attention based model for the
Index Terms—CNN, Deep learning, RNN captioning of images with traffic signs. This model can also be
extended for a general captioning which might require a larger
dataset of images. Our model first pre-processes the images
through InceptionV3 pre-trained model. Then, the last output
I. INTRODUCTION layer is further taken through some convolution layers which
is then processed to output the features with dimensions as
Image captioning or generating the description of an image in required.
natural language has received a lot of attention in recent times. Parallelly, the captions of every image are tokenized and
It emerged as an important and challenging area with research <start>, <end> tokens are appended for all the sentences.
advancing in image recognition. It is interesting due to the fact Finally, the model is trained with each image and it’s
that it has a lot of practical applications like labeling large corresponding real captions. The final weights are used to
image datasets, assisting the visually impaired. It requires the predict the captions for the test images. These captions are not
level of understanding way beyond the general object generated at a time but rather word by word. The prediction of
detection and image classification and so, is regarded as a every word requires the probability of occurrence of that
grand challenge. The field is a crossover of the modern particular word, given the previous words.
Artificial Intelligence models of Natural Language This method turned out to be quite effective and this is
Processing and Computer Vision. proposed by us from our research.
Top-down and Bottom-up are the general image captioning
approaches. While the top-down starts from the image which Convolutional Neural Networks:
is later converted into words, the bottom-up approach starts
with words which describe the various aspects that are 1. Mask RCNN
combined. Both the approaches suffer from describing finer
details and formulation of sentences from individual aspects Mask Region-based Convolutional Neural Networks has been
respectively. the new technology in terms of instance segmentation. There
Visual attention is a selective mapping process in the human are many papers, tutorials with good quality open source
visual system. It is important for the semantic natural language codes for reference. Mask RCNN is a deep neural network
description of images. So, people often talk about the crucial aimed to unravel instance segmentation problem in machine
parts rather than everything. We will talk about the visual learning or computer vision. In other words, it can be used to
attention approach for image captioning. We have used an separate different objects in an image or a video.
attention-based model to achieve the same. Overfitting the
training data poses a massive difficulty due to the fact that the You provide it with an image, it gives you the output
biggest available dataset has around only two million labeled bounding boxes for the objects, their classes and masks.
individual data. This data has to be used for
a. Extraction of the features
There are two stages of Mask RCNN. First, it generates
b. Correlation with the labels
proposals about the regions where there could be an object
c. Syntactic understanding of the labels
2
supported the input image. Second, it predicts the category of whole image for the recognition of the object. YOLO solves
the output, refines the bounding box and generates a mask in the problem also providing the bounding boxes for the
pixel level of the object supporting the primary stage proposal. corresponding trained objects.
Both stages are connected to the backbone structure which is
ResNet. YOLO works great for real-time detection as its name
suggests: You Only Look Once. It has 24 convolutional layers
Now let’s check out the primary stage, a light-weight neural and 2 dense layers totalling to 26 layers. It works on the
network called Region Proposal Network(RPN) scans all Darknet architecture, created by the first author of the YOLO
Feature Pyramid Network(FPN) top-bottom pathways paper.
(hereinafter mentioned feature map) and proposes regions
which can contain objects. While scanning feature map in an The algorithm divides any given image into an N x N grid
efficient way, we'd like a way to bind features to its raw image which are then sent to the neural network to detect the
location. Here come the anchors. Anchors are a group of presence of any objects in any of the grids and comes up with
boxes with predefined locations and scales relative to pictures. bounding boxes for the corresponding objects with
Ground-truth classes (only object or background binary probabilities and labels. The later modified models of YOLO
classified at this stage) and bounding boxes are assigned to are YOLOv2 and YOLOv3 which had 30 and 106 layers
individual anchors consistent with some Intersection over respectively. They were modified for more accurate
Union value. predictions that could work with even smaller and finer details
in the image.
As anchors with different scales bind to different levels of
feature map, RPN uses these anchors to work out where of the We have tested a YOLOv3 pretrained model with the COCO
feature map ‘should’ get an object and what size of its dataset which was able to detect 80 different labelled classes.
bounding box is. Later, the same was implemented with our custom dataset of
Here we may agree that convolving, downsampling and 790 traffic sign images for training. These images were taken
upsampling would keep features staying an equivalent relative from Google Open Image Dataset v5. The data was pre
locations as the objects in the original image, and wouldn’t annotated which made it easy for us to train the model. Our
mess them around. At the second stage, another neural model was able to detect traffic signs in an image and in real-
network takes proposed regions by the primary stage and time.
assigns them to many specific areas of a feature map level,
scans these areas, and generates objects classes, bounding The hyperparameters for this model are tuned to have the
boxes and masks. The procedure looks almost like RPN. optimizer as Adam, batch size as 8 and the model was trained
Differences are that without the assistance of anchors, stage- for 50 epochs with 88 steps per epoch.
two used a trick called Region Of Interest Align to locate the
relevant areas of feature map, and there's a branch which Recurrent Neural Networks:
generates masks for every object in the pixel level.
A Recurrent neural network (RNN) has the ability to
process a sequence of arbitrary length and hence used for
generation of a sequence of text.
Illustration of Mask RCNN In this model, we worked with a Shakespeare dataset. Given a
sequence of characters from this data ("Shakespear"), a model
is trained in order to predict the next character in the sequence
2. YOLO ("e"). Longer sequences of text can be generated by calling the
model repeatedly. The model demonstrates how to generate
“You Only Look Once” abbreviated or most ordinarily text using character-based RNN.
referred to as YOLO is the best available object detection
algorithm that works in real-time. Traditional object It is trained on small batches of text, and is able to generate a
classification can be done with a simple neural network but longer sequence of text with coherent structure. Tensorflow
object detection in a given scenario is a whole lot different. It library is used along with other libraries such as numpy, os.
requires a different approach as the conventional classification The dataset is then downloaded and read.
algorithms use predefined image dimensions whereas real- Later, the data is vectorized i.e. before training, the strings are
time object detection requires the system or model to scan the mapped to a numerical representation. The input to the model
3
is a sequence of characters, and we train the model to predict about 82,000 images, every image having at least 5 different
the output. We used tf.data to split the text into manageable caption annotations.
sequences and the data is shuffled and packed into batches.
Transfer learning is a machine learning method where a model
GRU is used in order to build the model. The loss function is developed for a task is reused as the starting point for a
and optimizer used are sparse categorical cross entropy and model on a second task. Here, we use InceptionV3 (which is
adam respectively. The model is then trained using a certain pretrained on Imagenet) to preprocess and classify the images.
number of epochs. After training,the model is used to generate After preprocessing, the output is cached to the disk and the
the text from the Shakespeare data. captions are tokenized with the vocabulary size being limited
to 5000 words to save memory while the remaining words are
Hyperparameters are variables which determine the network replaced with UNK token. The features are extracted from the
structure and how the network is trained (predefined). The lower convolutional layer of InceptionV3 giving us a vector of
hyperparameters in this model are sequence length, batch size, shape (8, 8, 2048).
buffer size, embedding dimension, rnn units, optimizer,
epochs, number of char to generated. This vector is then passed through the CNN Encoder ( consists
of a single Fully connected layer). Then RNN (here GRU)
Hyperparameter tuning is the process of tuning the above attends over the image to predict the next word. The next step
parameters in order to optimize the model. When the sequence is to train the model by extracting the features stored in the
length, batch size, buffer size, embedding dimension, rnn respective .npy files and then pass those features through the
units, number of char is increased to a greater extent, resource encoder. The encoder output, hidden state (initialized to 0) and
exhausted error was displayed. the decoder input (which is the start token) is passed to the
And if the optimizer was changed to RMSprop then the decoder.The decoder returns the predictions and the decoder
accuracy of the model gets reduced, so we use the adam hidden state.
optimizer as it gives the best accuracy for this model. More The decoder hidden state is then passed back into the model
the number of epochs, more is the accuracy. and the predictions are used to calculate the loss.Use teacher
forcing to decide the next input to the decoder.
Image Captioning:
Teacher forcing is a technique where the target word is passed
Caption generation is an artificial intelligence problem where as the next input to the decoder. The final step is to calculate
a textual description must be generated for a photograph. the gradients and apply it to the optimizer and back propagate
Try captioning the picture below. the model .
III.EXPERIMENTAL RESULTS
VI.REFERENCES