Automatic Image Captioning Bot With CNN and RNN: - Submitted By-Harkirat Singh CSE-3 01976802717
Automatic Image Captioning Bot With CNN and RNN: - Submitted By-Harkirat Singh CSE-3 01976802717
Datasets
•Common Objects in Context (COCO). A collection of more than 120 thousand images
with descriptions
•Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
•Flickr 30K. A collection of 30 thousand described images taken from flickr.com.
•Exploring Image Captioning Datasets, 2016
Data Collection
There are many open source datasets available for
this problem, like Flickr 8k (containing8k images),
Flickr 30k (containing 30k images), MS COCO
(containing 180k images), etc.
But for the purpose of this case study, I have used
the Flickr 8k dataset which you can download by
filling this form provided by the University of Illinois
at Urbana-Champaign. Also training a model with
large number of images may not be feasible on a
system which is not a very high end PC/Laptop.
This dataset contains 8000 images each with 5 A white dog in a grassy area
captions (as we have already seen in the (Image Captioning )
Introduction section that an image can have
multiple captions, all being relevant
simultaneously).
These images are bifurcated as follows:
•Training Set — 6000 images
•Dev Set — 1000 images
Data Preprocessing — Images
Images are nothing but input (X) to our model.
As you may already know that any input to a
model must be given in the form of a vector.
We need to convert every image into a fixed sized
vector which can then be fed as input to the
neural network. For this purpose, we opt
for transfer learning by using the InceptionV3
model (Convolutional Neural Network) created
by Google Research.
This model was trained on Imagenet dataset to
perform image classification on 1000 different
classes of images. However, our purpose here is
not to classify the image but just get fixed-length
informative vector for each image. This process is
called automatic feature engineering.
Hence, we just remove the last softmax layer
from the model and extract a 2048 length vector
(bottleneck features) for every image as given.
Data Preparation
This is one of the most important
steps in this case study. Here we
will understand how to prepare the
data in a manner which will be (Train image 1) Caption -> The black cat sat on grass
convenient to be given as input to
the deep learning model.
Consider we have 2 images and (Train image 2) Caption -> The white cat is walking on road
their 2 corresponding captions as
given.
First we need to convert both the images to their
corresponding 2048 length feature vector as discussed
above. Let “Image_1” and “Image_2” be the feature
vectors of the first two images respectively
Secondly, let’s build the vocabulary for the first two (train)
captions by adding the two tokens “startseq” and “endseq”
in both of them: (Assume we have already performed the
basic cleaning steps)
Caption_1 -> “startseq the black cat sat on grass endseq”
Caption_2 -> “startseq the white cat is walking on road
endseq”
THANK YOU!