0% found this document useful (0 votes)
147 views

Automatic Image Caption Generation System

Computer vision has become omnipresent in our society, with uses in several fields. In this project, we specialize in one among the visually imparting recognition of images in computer vision, that is image captioning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views

Automatic Image Caption Generation System

Computer vision has become omnipresent in our society, with uses in several fields. In this project, we specialize in one among the visually imparting recognition of images in computer vision, that is image captioning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Volume 6, Issue 6, June-2021 International Journal of Innovative Science and Research Technology

ISSN No: 2456-2156

Automatic Image Caption Generation System


Satyabrat Mandal: Nachiket Lele:
Student in Department of Information Technology, Student in Department of Information Technology,
Smt. Kashibai Navale College Of Engineering, Smt. Kashibai Navale College Of Engineering,
Savitribai Phule Pune University, Savitribai Phule Pune University,
Ambegaon, Pune, Maharashtra, India. Ambegaon, Pune, Maharashtra, India.

Chinmay Kunawar:
Student in Department of Information Technology,
Smt. Kashibai Navale College Of Engineering, Savitribai Phule Pune University,
Ambegaon, Pune, Maharashtra, India.

Abstract:- Computer vision has become omnipresent in The attempts made within the past have all been to stitch the
our society, with uses in several fields. In this project, we both two models together.
specialize in one among the visually imparting
recognition of images in computer vision, that is image In the model proposed we attempt to combine this into
captioning. The problem of generating language one model which consists of Convolutional Neural Network
descriptions for images is still considered a problem (CNN) encoder which usually creates image encodings. We
which needs a resolution and this has been studied more use the Xception architecture with some modifications.
regressively within the field of videos. From past few These encodings are then passed to a LSTM network layer
years more emphasis has been given to still images and which are a kind of Recurrent Neural Network. The
their descriptions with human understandable natural specification used for the LSTM network add similar way
language. The task of detecting scenes and object has because the ones utilized in machine translators. We then
become easier due studies that have taken place in last use Flickr8k dataset to train and coach the model. The model
few years. The main motive of our project is to train generates a caption as an output that is to be supported by
convolutional neural networks and applying various the dictionary which is formed from the tokens of caption
hyper parameters with huge datasets of images like within the training set.
Flicker 8k and Resnet, and combining the results of these
images and their classifiers with a recurrent neural and II. PROBLEM DEFINATION
obtain the desired caption for the image. In this paper we
would be presenting the detailed architecture of the Image caption generation has been considered as a
image captioning model. challenging and significant research area that is constantly
following advancements in statistical language modelling
Keywords:- Computer Vision, Convolutional Neural and image recognition system. Caption generation can
Network (CNN), Recurrent Neural Network (RNN), benefit many like helping the visually impaired by aiding
Xception, Flicker 8K, LSTM, Preprocessing. them by enabling automatic captions of the millions of
images uploaded to the internet every day which will help
I. INTRODUCTION them understand the World Wide Web.

In the past few years the field of AI namely Deep III. PROBLEM SOLUTION
Learning has developed a lot because of its impressive leads
to terms of accuracy in comparison with the already existing In our perception the main components of image
Machine learning algorithms. It might be a difficult task to captioning are CNN and RNN. And then merging them both
get a meaningful sentence from an image but if done to get the captions as follows for the images.
successfully, it can have a huge impact, as an example
helping the visually impaired to possess a better
understanding of images.

Image captioning is considered a bit more difficult in


comparison with image classification, which has been the
main focus point within the computer vision community.
The task to find the relationship between the objects in the
image is the most important factor to consider. In addition to
the visual understanding of the image, the above semantic
knowledge has got to be expressed during a tongue like
English, which suggests that a language model is required.

IJISRT21JUN776 www.ijisrt.com 1034


Volume 6, Issue 6, June-2021 International Journal of Innovative Science and Research Technology
ISSN No: 2456-2156
Convolution Neural Network (CNN)

Fig. 2: A simple neural network unrolled into simple neural


network.

Fig.1: CNN Architecture

Convolutional Neural Network (CNN) has been an Fig. 3: Four interacting layers in a LSTM layer
important factor for the improvement in image classification.
Image net Large Scale Visual Recognition competition Datasets to be used
(ILSVRC) have various open source deep learning For the task of image captioning we use Flickr8k
frameworks like ZFnet, Alexnet, Vgg16, Resnet, Xception, dataset. The dataset contains 8000 images with 5 captions
etc. which do have great ability to classify images. And for per image. The dataset by default is split into image and text
encoding our images we are using Xception in our model. folders. Each image has a unique id and the caption for each
The image used for classification needs to be a 224*224 of these images is stored corresponding to the respective id.
image. The one and only preprocessing done is by
subtracting the mean RGB values from each pixel The dataset contains 6000 training images, 1000
determined from the training images. The CNN layer of 3*3 development images and 1000 test images.
filters and the stride length is fixed at 1. Max pooling is done
using 2*2-pixel window having stride length of 2. Images
need to be converted into 224*224-dimensional image. The
output of the encoder would thus be a 1*1*4096 encoded
and which is then passed to the language generating RNN.
We do have many other frameworks which are successful in
this field like Resnet but they are very expensive
computationally since the number of layers in Resnet is very
high as compared to Xception therefore it requires a very
powerful system.

Recurrent Neural Network(RNN) Fig. 4: Sample photo with captions from the Flickr8k dataset
Recurrent neural networks are types of artificial neural
network where the connections between units are formed by Tokenizing Captions
a directed cycle. Recurrent neural can also be termed as The Recurrent Neural Network (RNN) segment is
networks with loops where the information usually persist in trained on the captions that are given in the Flicker 8K
networks. Recurrent neural network can be considered as dataset. We are supposed to train the RNN to forecast the
multiple copies of same network with each network passing succeeding word of a sentence that is inspired from the
the message to its successor. One of the problems with foregoing words. Because of t this we are supposed to alter
RNNs is that they do not take long-term dependencies into the captions linked with the images that are in the list of
account. To surpass the problem which usually occurs due to tokenize words. This can turn any string into a list of
of “long term dependencies”, Hochreiter and Schmidhuber integers.
put forward a term called the Long Short-Term Memory.
The key and the importance that backs the LSTM network is Firstly, we go through all the captions that are trained
the horizontal line that is running on the top which is known and then generate a dictionary that plots all the distinctive
as the cell state. All the repeating modules are supported by words to a numerical index format. So, each and every word
the cell states and every module is modified with the help of that we pass through will have an integer value accordingly
gates. All These things lets LSTM network to persist all the that we would be able to see in this dictionary. The words of
available information. these dictionaries are referred to as our vocabulary. It remold
each and every word into a caption and then it is converted to
a vector format that is desired to be used. After this step, we

IJISRT21JUN776 www.ijisrt.com 1035


Volume 6, Issue 6, June-2021 International Journal of Innovative Science and Research Technology
ISSN No: 2456-2156
are supposed to train the RNN that can help to figure out the V. CONCLUSION
next word in a sentence.
Image caption generation involves Convolutional
IV. RESULTS AND OBSERVATION Neural Network and Long short-term Memory to detect
objects and captioning the images. Image caption generation
We did test our system by testing around 300 images of has many advantages, we discussed a convolutional approach
different category, and we observed that for about 178 for image caption generation. Even though automatically
images we got perfect captions and these object basically generating captions for images is a complex task, with the
were having very few objects that is around one or two but help of models and powerful deep learning networks, it is
when the image object wears a multicolor shirt it couldn’t possible to obtain good results.
recognize the colors and determines the brightest color like
red as the main color like in fig 5. And this also cannot In the future scope we further can extend our project in
determine moving and still objects and also doesn’t the next higher level by modifying our model for generating
determine multiple same objects like in fig.6. captions even for the live video. Currently our model
generates captions only for the image, which itself a difficult
We did find a precision of 63% from our observation task and captioning live video is much more complex to
which was better than other dataset which was used before it. create. This is completely GPU based and captioning live
video cannot be possible with the general CPUs. Captioning
video is a popular research area in which it is going to
change the way of life of the people with the use cases being
widely usable in almost every domain. It automates the
major tasks like video surveillance and other security tasks.
Also, we can extend our work by enhancing our model to
develop a voice clip for the caption that is generated by the
system this will help the visually impaired people to get an
idea about the image.

ACKNOWLEDGEMENT

We are very grateful to all the teachers of our college


who have helped us with their valuable guidance towards the
Fig.5: snapshot of the output completion of our project entitled “Automatic Image
Caption Generation System” as this is a part of our syllabus
of Bachelor of engineering (B.E) course. We convey our
genuine regards towards our department who have helped us
with the essential guidance.

We want to convey our special thanks to


Prof.M.V.Raut for providing us with all the necessary
instructions and guidance, solving our problems, giving us
an insight on each and every step and contributing with her
knowledge and experience in making this project come true.
We are also very thankful to Prof. R. H. Borhade, Prof. L.V.
Patil for their invaluable support. The acknowledgement will
be incomplete without mentioning our Principal Prof. Dr. A.
V. Deshpande, whose constant assistance encouragement
Fig.6: snapshot of the output has been highly important in making our project.

We would like to express our gratitude towards our


parents & friends for their kind cooperation and
encouragement which help us in finishing this project.

REFERENCES

[1]. Fang, Hao, et al. "From captions to visual concepts and


back." Proceedings of the IEEE conference on
computer vision and pattern recognition. 2015.
[2]. Xu, Kelvin, et al. "Show, attend and tell: Neural image
caption generation with visual attention." International
Conference on Machine Learning. 2015.
Fig.7: snapshot of the output

IJISRT21JUN776 www.ijisrt.com 1036


Volume 6, Issue 6, June-2021 International Journal of Innovative Science and Research Technology
ISSN No: 2456-2156
[3]. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by jointly
learning to align and translate. In International
Conference on Learning Representations (ICLR).
[4]. Peter Anderson, Xiaodong He, Chris Buehler, Damien
Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
2017. Bottom-up and top-down attention for image
captioning and vqa. arXiv preprint arXiv:1707.07998
(2017).
[5]. Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., &
Ng, A.Y. (2014). Grounded Compositional Semantics
for Finding and Describing Images with Sentences.
TACL, 2, 207-218.

IJISRT21JUN776 www.ijisrt.com 1037

You might also like