Image Captioning Using R-CNN & LSTM Deep Learning Model
Image Captioning Using R-CNN & LSTM Deep Learning Model
ISSN No:-2456-2165
Abstract:- Image Captioning is the process of creating a on one of these works, which combines a variant of a neural
text description of an image. It uses both Natural network from Recurrent to a CNN. The main objective of is
Language Processing (NLP) and Computer Vision to to improve the model by making minor architectural
generate the captions. The image captioning task is done changes and using phrases as elementary units instead of
by combining the detection process when the words, which can lead to better semantic and syntactic
descriptions consist of a single word like cat, skateboard, subtitles.
etc. and Image Captioning when one predicted region
covers the full image, for example cat riding a II. IMAGE CAPTIONING APPROACHES
skateboard. To address the localization and description
task together, we propose a Fully Convolution 2.1 Retrieval Based Image Captioning
Localization Network that processes a picture with a Retrieval based image captioning technique is one of
single forward pass which can be consistently trained in the traditional ways of producing captions for the image. An
a single round of optimization. To process an image, first input image is given as input, retrieval based captioning
the input image is processed using CNN. Then methods generate a caption by retrieving one or many
Localization Layer proposes regions and includes a sentences from the pre-loaded pool of sentences. The
region detection network adopted from faster R-CNN caption that was generated can either be the exact same
and captioning network. The model directly combines sentence that was pre-loaded or a sentence that is framed
the faster R-CNN framework for region detection and from the retrieved ones.
long short-term memory (LSTM) for captioning.
The disadvantage of this approach is that a large pool
Keywords:- Image Caption, Recurrent Neural Network, of images is required to make captions from the allocated set
Long short-term memory, Convolution Neural Network, of sentences for those images. The output captions generated
Faster R-CNN, Natural Language Processing. from this image captioning technique does not have any
grammatical errors, but they are the sentences that were the
I. INTRODUCTION captions for other images. In some of the cases, the captions
may be irrelevant to the input image. The major
Creating captions is an important task that is relevant disadvantage of this method is that they have limitations on
to both computer vision and natural language processing . their capability to describe the images in their own unique
Limitation of human ability for writing pictures through a way.
machine is itself a remarkable step along the line of artificial
intelligence [1-2]. The motivation of our project is to capture 2.2 Template Based Image Captioning
how objects in the image are related to each other and Its different kind of methods which is early used and
express them in the English language (generate captions)[3- it’s based on templates. Based on the templates, the captions
4]. are generated by a syntactic and semantic restricted process.
Certain visual concepts must be recognized for generating a
Many image captioning techniques are used to description of an image using the template-based image
translate raw image files. These techniques do not translate captioning method. The visual concepts that were
the exact context of the image and therefore cannot be used recognized are then connected by a sentence. For the
to gain insight into the image files. A new system is creation of a sentence without any grammatical errors,
proposed for image captioning which includes usage of optimization algorithms are required.
contextual information and thereby accurately describe the
image [5-8]. The disadvantage of this method is that the generation
of descriptions is done under the template-based framework
This work aims to generate captions using neural which is strictly limited to recognized image content of
language models. The number of proposed models for the visual models, typically with a low number of visual models
image labeling task has increased significantly since the available. There are usually limitations on the coverage,
models of the neural language, ANN and the folding neural creativity, and complexity of the sentences generated.
networks (CNN) became popular [9-16]. This paper is based