Image Captioning Using CNN & RNN

The purpose of this model is to generate captions for an image. Image captioning aims at generating captions of an image automatically using deep learning techniques. Initially, the objects in the image are detected using a Convolutional Neural Network (InceptionV3). Using the objects detected, a syntactically and semantically correct caption for the image is generated using Recurrent Neural Networks (LSTM) with attention mechanism.

Uploaded by

Ganesh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views

Image Captioning Using CNN & RNN

Uploaded by

Ganesh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

1

IMAGE CAPTIONING USING CNN AND RNN

Akkem Hema Bhargavi, Ganesh G, Jagan Mohan Reddy Dwarampudi, Sowmiya TN
Under the guidance of
Ms. Akshita Mehta, Mr. Naveen Nanda

Abstract—The purpose of this model is to generate d. Modeling of the language for syntactically sounding
captions for an image. Image captioning aims at captions
generating captions of an image automatically using deep
learning techniques. Initially, the objects in the image are The overfitting problem lies in the fact that the model
detected using a Convolutional Neural Network memorises the inputs and uses captions that sound similar for
(InceptionV3). Using the objects detected, a syntactically the images which have a bit variation of specific details like a
and semantically correct caption for the image is person on a ramp with a skateboard and a person on a table
generated using Recurrent Neural Networks (LSTM) with with a skateboard.
attention mechanism. In our project, we are using a traffic
sign dataset that is captioned by the above mentioned
process. This model is of great benefit to the visually II.METHODOLOGY
impaired in order to cross roads safely.
We have trained a visual attention based model for the
Index Terms—CNN, Deep learning, RNN captioning of images with traffic signs. This model can also be
extended for a general captioning which might require a larger
dataset of images. Our model first pre-processes the images
through InceptionV3 pre-trained model. Then, the last output
I. INTRODUCTION layer is further taken through some convolution layers which
is then processed to output the features with dimensions as
Image captioning or generating the description of an image in required.
natural language has received a lot of attention in recent times. Parallelly, the captions of every image are tokenized and
It emerged as an important and challenging area with research <start>, <end> tokens are appended for all the sentences.
advancing in image recognition. It is interesting due to the fact Finally, the model is trained with each image and it’s
that it has a lot of practical applications like labeling large corresponding real captions. The final weights are used to
image datasets, assisting the visually impaired. It requires the predict the captions for the test images. These captions are not
level of understanding way beyond the general object generated at a time but rather word by word. The prediction of
detection and image classification and so, is regarded as a every word requires the probability of occurrence of that
grand challenge. The field is a crossover of the modern particular word, given the previous words.
Artificial Intelligence models of Natural Language This method turned out to be quite effective and this is
Processing and Computer Vision. proposed by us from our research.
Top-down and Bottom-up are the general image captioning
approaches. While the top-down starts from the image which Convolutional Neural Networks:
is later converted into words, the bottom-up approach starts
with words which describe the various aspects that are 1. Mask RCNN
combined. Both the approaches suffer from describing finer
details and formulation of sentences from individual aspects Mask Region-based Convolutional Neural Networks has been
respectively. the new technology in terms of instance segmentation. There
Visual attention is a selective mapping process in the human are many papers, tutorials with good quality open source
visual system. It is important for the semantic natural language codes for reference. Mask RCNN is a deep neural network
description of images. So, people often talk about the crucial aimed to unravel instance segmentation problem in machine
parts rather than everything. We will talk about the visual learning or computer vision. In other words, it can be used to
attention approach for image captioning. We have used an separate different objects in an image or a video.
attention-based model to achieve the same. Overfitting the
training data poses a massive difficulty due to the fact that the You provide it with an image, it gives you the output
biggest available dataset has around only two million labeled bounding boxes for the objects, their classes and masks.
individual data. This data has to be used for
a. Extraction of the features
There are two stages of Mask RCNN. First, it generates
b. Correlation with the labels
proposals about the regions where there could be an object
c. Syntactic understanding of the labels
2

supported the input image. Second, it predicts the category of whole image for the recognition of the object. YOLO solves
the output, refines the bounding box and generates a mask in the problem also providing the bounding boxes for the
pixel level of the object supporting the primary stage proposal. corresponding trained objects.
Both stages are connected to the backbone structure which is
ResNet. YOLO works great for real-time detection as its name
suggests: You Only Look Once. It has 24 convolutional layers
Now let’s check out the primary stage, a light-weight neural and 2 dense layers totalling to 26 layers. It works on the
network called Region Proposal Network(RPN) scans all Darknet architecture, created by the first author of the YOLO
Feature Pyramid Network(FPN) top-bottom pathways paper.
(hereinafter mentioned feature map) and proposes regions
which can contain objects. While scanning feature map in an The algorithm divides any given image into an N x N grid
efficient way, we'd like a way to bind features to its raw image which are then sent to the neural network to detect the
location. Here come the anchors. Anchors are a group of presence of any objects in any of the grids and comes up with
boxes with predefined locations and scales relative to pictures. bounding boxes for the corresponding objects with
Ground-truth classes (only object or background binary probabilities and labels. The later modified models of YOLO
classified at this stage) and bounding boxes are assigned to are YOLOv2 and YOLOv3 which had 30 and 106 layers
individual anchors consistent with some Intersection over respectively. They were modified for more accurate
Union value. predictions that could work with even smaller and finer details
in the image.
As anchors with different scales bind to different levels of
feature map, RPN uses these anchors to work out where of the We have tested a YOLOv3 pretrained model with the COCO
feature map ‘should’ get an object and what size of its dataset which was able to detect 80 different labelled classes.
bounding box is. Later, the same was implemented with our custom dataset of
Here we may agree that convolving, downsampling and 790 traffic sign images for training. These images were taken
upsampling would keep features staying an equivalent relative from Google Open Image Dataset v5. The data was pre
locations as the objects in the original image, and wouldn’t annotated which made it easy for us to train the model. Our
mess them around. At the second stage, another neural model was able to detect traffic signs in an image and in real-
network takes proposed regions by the primary stage and time.
assigns them to many specific areas of a feature map level,
scans these areas, and generates objects classes, bounding The hyperparameters for this model are tuned to have the
boxes and masks. The procedure looks almost like RPN. optimizer as Adam, batch size as 8 and the model was trained
Differences are that without the assistance of anchors, stage- for 50 epochs with 88 steps per epoch.
two used a trick called Region Of Interest Align to locate the
relevant areas of feature map, and there's a branch which Recurrent Neural Networks:
generates masks for every object in the pixel level.
A Recurrent neural network (RNN) has the ability to
process a sequence of arbitrary length and hence used for
generation of a sequence of text.

Illustration of Mask RCNN In this model, we worked with a Shakespeare dataset. Given a
sequence of characters from this data ("Shakespear"), a model
is trained in order to predict the next character in the sequence
2. YOLO ("e"). Longer sequences of text can be generated by calling the
model repeatedly. The model demonstrates how to generate
“You Only Look Once” abbreviated or most ordinarily text using character-based RNN.
referred to as YOLO is the best available object detection
algorithm that works in real-time. Traditional object It is trained on small batches of text, and is able to generate a
classification can be done with a simple neural network but longer sequence of text with coherent structure. Tensorflow
object detection in a given scenario is a whole lot different. It library is used along with other libraries such as numpy, os.
requires a different approach as the conventional classification The dataset is then downloaded and read.
algorithms use predefined image dimensions whereas real- Later, the data is vectorized i.e. before training, the strings are
time object detection requires the system or model to scan the mapped to a numerical representation. The input to the model
3

is a sequence of characters, and we train the model to predict about 82,000 images, every image having at least 5 different
the output. We used tf.data to split the text into manageable caption annotations.
sequences and the data is shuffled and packed into batches.
Transfer learning is a machine learning method where a model
GRU is used in order to build the model. The loss function is developed for a task is reused as the starting point for a
and optimizer used are sparse categorical cross entropy and model on a second task. Here, we use InceptionV3 (which is
adam respectively. The model is then trained using a certain pretrained on Imagenet) to preprocess and classify the images.
number of epochs. After training,the model is used to generate After preprocessing, the output is cached to the disk and the
the text from the Shakespeare data. captions are tokenized with the vocabulary size being limited
to 5000 words to save memory while the remaining words are
Hyperparameters are variables which determine the network replaced with UNK token. The features are extracted from the
structure and how the network is trained (predefined). The lower convolutional layer of InceptionV3 giving us a vector of
hyperparameters in this model are sequence length, batch size, shape (8, 8, 2048).
buffer size, embedding dimension, rnn units, optimizer,
epochs, number of char to generated. This vector is then passed through the CNN Encoder ( consists
of a single Fully connected layer). Then RNN (here GRU)
Hyperparameter tuning is the process of tuning the above attends over the image to predict the next word. The next step
parameters in order to optimize the model. When the sequence is to train the model by extracting the features stored in the
length, batch size, buffer size, embedding dimension, rnn respective .npy files and then pass those features through the
units, number of char is increased to a greater extent, resource encoder. The encoder output, hidden state (initialized to 0) and
exhausted error was displayed. the decoder input (which is the start token) is passed to the
And if the optimizer was changed to RMSprop then the decoder.The decoder returns the predictions and the decoder
accuracy of the model gets reduced, so we use the adam hidden state.
optimizer as it gives the best accuracy for this model. More The decoder hidden state is then passed back into the model
the number of epochs, more is the accuracy. and the predictions are used to calculate the loss.Use teacher
forcing to decide the next input to the decoder.
Image Captioning:
Teacher forcing is a technique where the target word is passed
Caption generation is an artificial intelligence problem where as the next input to the decoder. The final step is to calculate
a textual description must be generated for a photograph. the gradients and apply it to the optimizer and back propagate
Try captioning the picture below. the model .

III.EXPERIMENTAL RESULTS

Some of our results after implementing Mask RCNN are as

follows:

Caption generation is an artificial intelligence problem where

a textual description must be generated for a photograph.
Try captioning the below picture.

The caption a human may give will be “A dog on grass and

some pink flowers”. There might be other descriptions too but
all of them mean the same thing.It’s so easy for us, as human
beings, to just have a glance at a picture and describe it in an
appropriate language. But how does a computer program
generate suitable caption for the image?
Figure 1: The figure shows how Mask RCNN works on a
Deep learning comes into picture at this point, two models: given image for object detection
CNN and RNN are used in order to caption the images. CNN
is used to detect the objects in the image and RNN is used to
generate the captions for the image. The working model of
both these models are shown previously. In our model, we The result of using YOLOv3 algorithm on the traffic signs
used MS-COCO dataset to train our model. The dataset has dataset is as follows:
4

Figure 2: The stop sign is detected as a traffic sign using

YoloV3.
Figure 5

The result for RNN model of text generation is: IV.CONCLUSION

In our project, we have developed a model to caption the

images. We also extended our model by converting our
captions to speech.
We hope to do future works on voice synthesis.
We have done research in order to understand our models in
depth and have executed each model separately. We learned
how the deep learning techniques work and how to create
these models.
We faced many challenges while running the model and with
Figure 3 our datasets. But later on we learned how to rectify the
mistakes and made an efficient model.
The result of image captioning are: V.ACKNOWLEDGMENT

We would like to thank our mentors Mr.Naveen Nanda and

Ms.Aksitha Mehta for helpful communications and
discussions regarding the work.

VI.REFERENCES

[1] Andrej Karpathy, Li Fei-Fei, “DeepVisual-Semantic

Alignments for Generating Image Description”,
Department of Computer Science, Stanford University,
published in IEEE Xplore 2015.

[2] Oriol Vinyals, Alexander Toshev, Samy

Bengio,Dumitru Erhan ,“ShowandTell:
ANeuralImageCaptionGenerator”,
a rXiv 2015.

[3] KelvinXu, JimmyLeiBa, RyanKiros, KyunghyunCho

Figure 4 ,AaronCourville,RuslanSalakhutdinov, RichardS.Zemel
Z,YoshuaBengio,
Figure 4 and 5 shows how the traffic signs are captioned. “Show,AttendandTell:NeuralImageCaption
GenerationwithVisualAttention”,arXiv 2015.

[4] Rahul Singha, Aayush Sharmaa, “Image captioning

using Deep Neural Networks”, Iowa State
University,published in Research gate 2018.

100 NLP Questions
100% (5)
100 NLP Questions
23 pages
ELEC5471M MATLABproject FT16
No ratings yet
ELEC5471M MATLABproject FT16
3 pages
Objectdetection
No ratings yet
Objectdetection
7 pages
Narrative Paragraph Generation
No ratings yet
Narrative Paragraph Generation
13 pages
Manuscript Template 2
No ratings yet
Manuscript Template 2
13 pages
Learning Hierarchical Features For Scene Labeling
No ratings yet
Learning Hierarchical Features For Scene Labeling
15 pages
Image Caption Generator Using CNN and LSTM
No ratings yet
Image Caption Generator Using CNN and LSTM
8 pages
Automated Image Captioning Using CNN and RNN
No ratings yet
Automated Image Captioning Using CNN and RNN
17 pages
1.convolutional Neural Networks For Image Classification
No ratings yet
1.convolutional Neural Networks For Image Classification
11 pages
Expt 6
No ratings yet
Expt 6
1 page
5 Major Computervision Technique
No ratings yet
5 Major Computervision Technique
10 pages
IA 3 Must Study Merged
No ratings yet
IA 3 Must Study Merged
69 pages
Research on Learning Representations in Computer Vision
No ratings yet
Research on Learning Representations in Computer Vision
52 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
CNN Model For Image Classification Using Resnet: Dr. Senbagavalli M & Swetha Shekarappa G
No ratings yet
CNN Model For Image Classification Using Resnet: Dr. Senbagavalli M & Swetha Shekarappa G
10 pages
Deep Learning Image Classification
No ratings yet
Deep Learning Image Classification
11 pages
Context Encoders Feature Learning by Inpainting
No ratings yet
Context Encoders Feature Learning by Inpainting
9 pages
Real Time Object Detection in Surveillance Cameras With 2xjeq74wam
No ratings yet
Real Time Object Detection in Surveillance Cameras With 2xjeq74wam
8 pages
Pathak Context Encoders Feature CVPR 2016 Paper
No ratings yet
Pathak Context Encoders Feature CVPR 2016 Paper
9 pages
Image Summarizer: Seeing Through Machine Using Deep Learning Algorithm
No ratings yet
Image Summarizer: Seeing Through Machine Using Deep Learning Algorithm
7 pages
Deep Learning Based Text To Image Genera
No ratings yet
Deep Learning Based Text To Image Genera
6 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
14 pages
Image+Caption(1)
No ratings yet
Image+Caption(1)
8 pages
paper3
No ratings yet
paper3
11 pages
Deep Learning
No ratings yet
Deep Learning
9 pages
Paper 82-Hyperspectral Image Classification
No ratings yet
Paper 82-Hyperspectral Image Classification
7 pages
DenseCap - Fully Convolutional Localization Networks For Dense Captioning
No ratings yet
DenseCap - Fully Convolutional Localization Networks For Dense Captioning
10 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
DL Unit2
No ratings yet
DL Unit2
25 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
boualleg@M2GARSS 2020
No ratings yet
boualleg@M2GARSS 2020
4 pages
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
No ratings yet
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
21 pages
Image Caption Generator Final Report
No ratings yet
Image Caption Generator Final Report
28 pages
Sample Term Paper
No ratings yet
Sample Term Paper
7 pages
IEEE Journal Paper Template 1
No ratings yet
IEEE Journal Paper Template 1
5 pages
Sommaire CNN Presentation
No ratings yet
Sommaire CNN Presentation
10 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Indian Institute of Technology Kanpur CS-698 Visual Recognition
No ratings yet
Indian Institute of Technology Kanpur CS-698 Visual Recognition
3 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Image Caption Generator
100% (1)
Image Caption Generator
20 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
Gatys 2016
No ratings yet
Gatys 2016
10 pages
Ijcrt 196552
No ratings yet
Ijcrt 196552
6 pages
ml2
No ratings yet
ml2
70 pages
Convolutional Neural Networks For Image Classification
No ratings yet
Convolutional Neural Networks For Image Classification
5 pages
07
No ratings yet
07
7 pages
Automatic Image Captioning Combining Natural Language Processing and
No ratings yet
Automatic Image Captioning Combining Natural Language Processing and
14 pages
Context Encoders: Feature Learning by Inpainting
No ratings yet
Context Encoders: Feature Learning by Inpainting
12 pages
BbbbbbbbE Project Research PPR 1-4-6
No ratings yet
BbbbbbbbE Project Research PPR 1-4-6
3 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Image Upscaling Based Convolutional Neural Network For Better Reconstruction Quality
No ratings yet
Image Upscaling Based Convolutional Neural Network For Better Reconstruction Quality
6 pages
Automatic Image Caption Generation System
No ratings yet
Automatic Image Caption Generation System
4 pages
An Improved Automatic Image Annotation Approach Using Convolutional Neural Network-Slantlet Transform
No ratings yet
An Improved Automatic Image Annotation Approach Using Convolutional Neural Network-Slantlet Transform
13 pages
Image Captioning Using Deep Stacked LSTMS, Contextual Word Embeddings and Data Augmentation
No ratings yet
Image Captioning Using Deep Stacked LSTMS, Contextual Word Embeddings and Data Augmentation
18 pages
Deep Learning For X Ray Image To Text Generation
No ratings yet
Deep Learning For X Ray Image To Text Generation
4 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
A Deep Neural Network For Image Quality Assessment
No ratings yet
A Deep Neural Network For Image Quality Assessment
5 pages
Neural Networks
No ratings yet
Neural Networks
3 pages
Technical Seminar PDF
No ratings yet
Technical Seminar PDF
9 pages
Ref12
No ratings yet
Ref12
7 pages
yarn own bd'
No ratings yet
yarn own bd'
9 pages
Distributed Systems - Final Materials
No ratings yet
Distributed Systems - Final Materials
181 pages
CS6513-COMPUTER GRAPHICS LABORATORY-664424542-computer Graphics Lab Manual 2013 Regulation
100% (1)
CS6513-COMPUTER GRAPHICS LABORATORY-664424542-computer Graphics Lab Manual 2013 Regulation
77 pages
Week 4 Lec 16-20 With Watermarking PDF
No ratings yet
Week 4 Lec 16-20 With Watermarking PDF
155 pages
Week 5 Lec 21-25 With Watermarking PDF
No ratings yet
Week 5 Lec 21-25 With Watermarking PDF
152 pages
SRS Matrimony
No ratings yet
SRS Matrimony
16 pages
Survival Analysis Theory 2024-4
No ratings yet
Survival Analysis Theory 2024-4
49 pages
PCFG
No ratings yet
PCFG
79 pages
The Evaluation Report of SHA-256 Crypt Analysis Hash Function
No ratings yet
The Evaluation Report of SHA-256 Crypt Analysis Hash Function
5 pages
Systems Engineering: Le On
No ratings yet
Systems Engineering: Le On
20 pages
M-III Important Questions
No ratings yet
M-III Important Questions
4 pages
Exercise 07
No ratings yet
Exercise 07
5 pages
Statistical and Machine Learning Models in Credit Scoring A Systematic
No ratings yet
Statistical and Machine Learning Models in Credit Scoring A Systematic
21 pages
Experimental Methods: Department of Applied Mechanics
No ratings yet
Experimental Methods: Department of Applied Mechanics
12 pages
12th Maths EM Unit Test 1 Model Question Paper English Medium PDF Download
No ratings yet
12th Maths EM Unit Test 1 Model Question Paper English Medium PDF Download
2 pages
A Model For Synthesis Process
No ratings yet
A Model For Synthesis Process
8 pages
Buy ebook Analysis And Design Of Algorithms 2nd Edition Amrinder Arora cheap price
100% (2)
Buy ebook Analysis And Design Of Algorithms 2nd Edition Amrinder Arora cheap price
71 pages
Unit Iii Convolutional Networks and Sequence Modelling
No ratings yet
Unit Iii Convolutional Networks and Sequence Modelling
38 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
Signals & System II Mid Format QP 2017 (2,3 & 4)
No ratings yet
Signals & System II Mid Format QP 2017 (2,3 & 4)
1 page
1 s2.0 S2214509522001784 Main
No ratings yet
1 s2.0 S2214509522001784 Main
17 pages
Unit Iii Notes - Final
No ratings yet
Unit Iii Notes - Final
31 pages
autogluon-cheat-sheet
No ratings yet
autogluon-cheat-sheet
1 page
Data Structure and Algorithm
No ratings yet
Data Structure and Algorithm
7 pages
P, Pi & PID Controller: By:-Karan Sati
No ratings yet
P, Pi & PID Controller: By:-Karan Sati
16 pages
Worksheet B Key Topic 1.2 Rates of Change
No ratings yet
Worksheet B Key Topic 1.2 Rates of Change
2 pages
Machine Learning, Winter 2020, PPHA 30545
No ratings yet
Machine Learning, Winter 2020, PPHA 30545
5 pages
Stack PDF
No ratings yet
Stack PDF
29 pages
Bridge Course 2
No ratings yet
Bridge Course 2
14 pages
Assignment #1
No ratings yet
Assignment #1
3 pages
Lab Report 3
No ratings yet
Lab Report 3
11 pages
Lab 4-Image Segmentation Using U-Net
No ratings yet
Lab 4-Image Segmentation Using U-Net
9 pages
New Crypto Lab File
No ratings yet
New Crypto Lab File
24 pages