Automatic Image Caption Generation System
Automatic Image Caption Generation System
Chinmay Kunawar:
Student in Department of Information Technology,
Smt. Kashibai Navale College Of Engineering, Savitribai Phule Pune University,
Ambegaon, Pune, Maharashtra, India.
Abstract:- Computer vision has become omnipresent in The attempts made within the past have all been to stitch the
our society, with uses in several fields. In this project, we both two models together.
specialize in one among the visually imparting
recognition of images in computer vision, that is image In the model proposed we attempt to combine this into
captioning. The problem of generating language one model which consists of Convolutional Neural Network
descriptions for images is still considered a problem (CNN) encoder which usually creates image encodings. We
which needs a resolution and this has been studied more use the Xception architecture with some modifications.
regressively within the field of videos. From past few These encodings are then passed to a LSTM network layer
years more emphasis has been given to still images and which are a kind of Recurrent Neural Network. The
their descriptions with human understandable natural specification used for the LSTM network add similar way
language. The task of detecting scenes and object has because the ones utilized in machine translators. We then
become easier due studies that have taken place in last use Flickr8k dataset to train and coach the model. The model
few years. The main motive of our project is to train generates a caption as an output that is to be supported by
convolutional neural networks and applying various the dictionary which is formed from the tokens of caption
hyper parameters with huge datasets of images like within the training set.
Flicker 8k and Resnet, and combining the results of these
images and their classifiers with a recurrent neural and II. PROBLEM DEFINATION
obtain the desired caption for the image. In this paper we
would be presenting the detailed architecture of the Image caption generation has been considered as a
image captioning model. challenging and significant research area that is constantly
following advancements in statistical language modelling
Keywords:- Computer Vision, Convolutional Neural and image recognition system. Caption generation can
Network (CNN), Recurrent Neural Network (RNN), benefit many like helping the visually impaired by aiding
Xception, Flicker 8K, LSTM, Preprocessing. them by enabling automatic captions of the millions of
images uploaded to the internet every day which will help
I. INTRODUCTION them understand the World Wide Web.
In the past few years the field of AI namely Deep III. PROBLEM SOLUTION
Learning has developed a lot because of its impressive leads
to terms of accuracy in comparison with the already existing In our perception the main components of image
Machine learning algorithms. It might be a difficult task to captioning are CNN and RNN. And then merging them both
get a meaningful sentence from an image but if done to get the captions as follows for the images.
successfully, it can have a huge impact, as an example
helping the visually impaired to possess a better
understanding of images.
Convolutional Neural Network (CNN) has been an Fig. 3: Four interacting layers in a LSTM layer
important factor for the improvement in image classification.
Image net Large Scale Visual Recognition competition Datasets to be used
(ILSVRC) have various open source deep learning For the task of image captioning we use Flickr8k
frameworks like ZFnet, Alexnet, Vgg16, Resnet, Xception, dataset. The dataset contains 8000 images with 5 captions
etc. which do have great ability to classify images. And for per image. The dataset by default is split into image and text
encoding our images we are using Xception in our model. folders. Each image has a unique id and the caption for each
The image used for classification needs to be a 224*224 of these images is stored corresponding to the respective id.
image. The one and only preprocessing done is by
subtracting the mean RGB values from each pixel The dataset contains 6000 training images, 1000
determined from the training images. The CNN layer of 3*3 development images and 1000 test images.
filters and the stride length is fixed at 1. Max pooling is done
using 2*2-pixel window having stride length of 2. Images
need to be converted into 224*224-dimensional image. The
output of the encoder would thus be a 1*1*4096 encoded
and which is then passed to the language generating RNN.
We do have many other frameworks which are successful in
this field like Resnet but they are very expensive
computationally since the number of layers in Resnet is very
high as compared to Xception therefore it requires a very
powerful system.
Recurrent Neural Network(RNN) Fig. 4: Sample photo with captions from the Flickr8k dataset
Recurrent neural networks are types of artificial neural
network where the connections between units are formed by Tokenizing Captions
a directed cycle. Recurrent neural can also be termed as The Recurrent Neural Network (RNN) segment is
networks with loops where the information usually persist in trained on the captions that are given in the Flicker 8K
networks. Recurrent neural network can be considered as dataset. We are supposed to train the RNN to forecast the
multiple copies of same network with each network passing succeeding word of a sentence that is inspired from the
the message to its successor. One of the problems with foregoing words. Because of t this we are supposed to alter
RNNs is that they do not take long-term dependencies into the captions linked with the images that are in the list of
account. To surpass the problem which usually occurs due to tokenize words. This can turn any string into a list of
of “long term dependencies”, Hochreiter and Schmidhuber integers.
put forward a term called the Long Short-Term Memory.
The key and the importance that backs the LSTM network is Firstly, we go through all the captions that are trained
the horizontal line that is running on the top which is known and then generate a dictionary that plots all the distinctive
as the cell state. All the repeating modules are supported by words to a numerical index format. So, each and every word
the cell states and every module is modified with the help of that we pass through will have an integer value accordingly
gates. All These things lets LSTM network to persist all the that we would be able to see in this dictionary. The words of
available information. these dictionaries are referred to as our vocabulary. It remold
each and every word into a caption and then it is converted to
a vector format that is desired to be used. After this step, we
ACKNOWLEDGEMENT
REFERENCES