Mini Project Fln..
Mini Project Fln..
A Project report on
Image Captioning
Submitted to
Mrs. Arpana Saxena
Ajay Kumar Garg Engineering College-MCA, Ghaziabad
Submitted By:-
Aman Singh -1900270140002
Shashank Saxena 1900270140010
TABLE OF CONTENT
Abstract ............................................................................................................................... ii
Acknowledgements……………………………………………………………………….iii
Keywords...........................................................................................................................iv
List of abbreviations……………………………………………………….xiii
1.1 Preface...................................................................................................................... 1
1.2.1 abcd……………………….
Chapter 3- SRS/ALGORITHM………………
2
3
Abstract:
• Image Captioning is the process of generating textual description of an
image.
• It uses both Natural Language Processing and Computer Vision to
generate the captions.
• The dataset will be in the form [image → captions]. The dataset consists of
input images and their corresponding output captions.
• This process has many potential applications in real life. A noteworthy one
would be to save the captions of an image so that it can be retrieved easily at
a later stage just on the basis of this description.
• For Example-
• Python 3.7
• Pycharm
• Google Collab
• Jupyter Notebook
• Anaconda
Technologies-
• Python
• Tensorflow
• Flask
• Sklearn
• Computer Vision
• Natural Language Processing
4
5
Acknowledgements:
5
6
Several approaches have been made to solve the task. One of the most
notable work has been put forward by Andrej Karpathy, Director of AI,
Tesla in his Ph.D. at Standford. In this article, we will be talking about the
most used and well-known approaches proposed as a solution to this
problem. We will also be looking at a python demo example on the Flickr
Dataset in Python.
6
7
If we are told to describe it, maybe we will describe it as: “A puppy on a blue
towel” or “ A brown dog playing with a green ball”. So, how are we doing this?
While forming the description, we are seeing the image but at the same time, we are
looking to create a meaningful sequence of words. The first part is handled by
CNNs and the second is handled by RNNs.If we can obtain a suitable dataset with
images and their corresponding human descriptions, we can train networks to
automatically caption images. FLICKR 8K, FLICKR 30K, and MS-COCO are
some most used datasets for the purpose.
7
8
8
9
Table of Contents
1. Introduction 11
1.1 Preface
1.2 Problem Description and Motivation 13
3. SRS/Algorithm 17
5. Data Set 43
9
10
CHAPTER: 1
INTRODUCTION
Well some of you might say “A white dog in a grassy area”, some may say
“White dog with brown spots” and yet some others might say “A dog on grass
and some pink flowers”.
Definitely all of these captions are relevant for this image and there may be some
others also. But the point I want to make is; it’s so easy for us, as human beings, to
just have a glance at a picture and describe it in an appropriate language. Even a 5
year old could do this with utmost ease.
10
11
But, can you write a computer program that takes an image as input and produces a
relevant caption as output?
The Problem
Just prior to the recent development of Deep Neural Networks this problem was
inconceivable even by the most advanced researchers in Computer Vision. But with
the advent of Deep Learning this problem can be solved very easily if we have the
required dataset.
This problem was well researched by Andrej Karapathy in his PhD thesis at
Stanford [1], who is also now the Director of AI at Tesla.
The purpose of this blog post is to explain (in as simple words as possible) that how
Deep Learning can be used to solve this problem of generating a caption for a given
image, hence the name Image Captioning.
11
12
To get a better feel of this problem, I strongly recommend to use this state-of-the-art
system created by Microsoft called as Caption Bot. Just go to this link and try
uploading any picture you want; this system will generate a caption for it
• 1.1 Preface
12
13
Motivation:
We must first understand how important this problem is to real world scenarios.
Let’s see few applications where a solution to this problem can be very useful.
• Aid to the blind — We can create a product for the blind which will
guide them travelling on the roads without the support of anyone else.
We can do this by first converting the scene into text and then the text to
voice. Both are now famous applications of Deep Learning. Refer
this link where its shown how Nvidia research is trying to create such a
product.
• CCTV cameras are everywhere today, but along with viewing the world,
if we can also generate relevant captions, then we can raise alarms as
soon as there is some malicious activity going on somewhere. This could
probably help reduce some crime and/or accidents.
13
14
CHAPTER: 2
14
15
example, for the picture on the right hand side, we can describe it as A man is
trying to murder his cs231n partner with a clipper. Attention helps us to determine
the relationship between the objects. 2. Related work Work[3](Szegedy et al)
proposed a deep convolutional neural network architecture codenamed Inception.
The main hallmark of this architecture is the improved utilization of the computing
resources inside the network. For example, our project tried to use layers
“inception3b” and “inception4b” to get captions and attention. Because features
learned from the lower layers can contain more accurate information of correlation
between words in caption and specific location in image. presented a generative
model based on a deep recurrent architecture that combined advances in computer
vision and machine translation that can be used to generate natural sentences
describing an image. The model is trained to maximize the likelihood of the target
description sentence given the training image.Work[5](Jeff et al) introduced a
model based on deep convolutional networks performed very good in image
interpretation tasks. Their recurrent convolutional model and long-term RNN
models are suitable for large-scale visual learning that is end-to-end trainable and
demonstrate the value of these models on benchmark video recognition tasks.
Attention mechanism has a long history, especially in image recognition. Related
work include work[6] and work[7](Larochelle et al). But until recently Attention
wasn’t included to recurrent neural network architecture. Work[8](Volodymyr et
al) use reinforcement learning as a alternative way to predict the attention point. It
sounds more like human attention. However reinforcement learning model cannot
use back propagation so that not end-toend trainable, thusly it is not widely use in
NLP. In work[9] the authors use recurrent neural and attention mechanism to
15
16
generate grammar tree. In work[10] the author use RNN model to read in text.
Work[2](Andrej et al) presented a model that generates natural language
descriptions of images and their regions. They combined Convolutional Neural
Networks over sentences, bidirectional Recurrent Neural Networks over sentences
and a structured objective that aligns the two modalities through a multimodal
embedding. In Work[1](Xu, et al) attention mechanism is used in generation of
image caption. They use convolutional neural network to encode image and use a
recurrent neural network and attention mechanism to generate caption. By the
visualization of the attention weights, we can explain which part the model is
focusing on while generating the caption.
CHAPTER: 3
SRS /Algorithm
3.1 Purpose
16
17
Data Collection
There are many open source datasets available for this problem, like Flickr 8k
(containing8k images), Flickr 30k (containing 30k images), MS COCO (containing
180k images), etc.
But for the purpose of this case study, I have used the Flickr 8k dataset which you
can download by filling this form provided by the University of Illinois at Urbana-
Champaign. Also training a model with large number of images may not be feasible
on a system which is not a very high end PC/Laptop.
This dataset contains 8000 images each with 5 captions (as we have already seen in
the Introduction section that an image can have multiple captions, all being relevant
simultaneously).
17
18
One of the files is “Flickr8k.token.txt” which contains the name of each image
along with its 5 captions. We can read this file as follows:
# Below is the path for the file "Flickr8k.token.txt" on your disk
filename = "/dataset/TextFiles/Flickr8k.token.txt"
file = open(filename, 'r')
doc = file.read()
Thus every line contains the <image name>#i <caption>, where 0≤i≤4
i.e. the name of the image, caption number (0 to 4) and the actual caption.
18
19
Now, we create a dictionary named “descriptions” which contains the name of the
image (without the .jpg extension) as keys and a list of the 5 captions for the
corresponding image as values.
19
20
brown dog is running through a snow covered field .', 'The white and brown dog is
running over the surface of the snow .']
Data Cleaning
When we deal with text, we generally perform some basic cleaning like lower-
casing all the words (otherwise“hello” and “Hello” will be regarded as two separate
words), removing special tokens (like ‘%’, ‘$’, ‘#’, etc.), eliminating words which
contain numbers (like ‘hey199’, etc.).
desc = desc.split()
# convert to lower case
desc = [word.lower() for
word in desc]
desc =
[w.translate(table) for w
in desc]
20
21
Create a vocabulary of all the unique words present across all the 8000*5 (i.e.
40000) image captions (corpus) in the data set :
vocabulary = set()
for key in descriptions.keys():
[vocabulary.update(d.split()) for d in descriptions[key]]
print('Original Vocabulary Size: %d' % len(vocabulary))
Original Vocabulary Size: 8763
This means we have 8763 unique words across all the 40000 image captions. We
write all these captions along with their image names in a new file namely,
“descriptions.txt” and save it on the disk.
However, if we think about it, many of these words will occur very few times, say
1, 2 or 3 times. Since we are creating a predictive model, we would not like to have
all the words present in our vocabulary but the words which are more likely to
occur or which are common. This helps the model become more robust to
outliers and make less mistakes.
21
22
Hence we consider only those words which occur at least 10 times in the entire
corpus. The code for this is below:
Code to retain only those words which occur at least 10 times in the corpus
22
23
So now we have only 1651 unique words in our vocabulary. However, we will
append 0’s (zero padding explained later) and thus total words = 1651+1
= 1652 (one index for the 0)
Images are nothing but input (X) to our model. As you may already know that any
input to a model must be given in the form of a vector.
We need to convert every image into a fixed sized vector which can then be fed as
input to the neural network. For this purpose, we opt for transfer learning by using
the InceptionV3 model (Convolutional Neural Network) created by Google
Research.
Hence, we just remove the last softmax layer from the model and extract a 2048
length vector (bottleneck features) for every image as follows:
23
24
Now, we pass every image to this model to get the corresponding 2048 length
feature vector as follows:
# Convert all the images to size 299x299 as expected by the
# inception v3 model
img = image.load_img(image_path, target_size=(299, 299))
# Convert PIL image to numpy array of 3-dimensions
x = image.img_to_array(img)
# Add one more dimension
x = np.expand_dims(x, axis=0)
# preprocess images using preprocess_input() from inception module
x = preprocess_input(x)
# reshape from (1, 2048) to (2048, )
x = np.reshape(x, x.shape[1])
24
25
We save all the bottleneck train features in a Python dictionary and save it on the
disk using Pickle file, namely “encoded_train_images.pkl” whose keys are image
names and values are corresponding 2048 length feature vector.
NOTE: This process might take an hour or two if you do not have a high end
PC/laptop.
Similarly we encode all the test images and save them in the file
“encoded_test_images.pkl”
But the prediction of the entire caption, given the image does not happen at once.
We will predict the caption word by word. Thus, we need to encode each word
into a fixed sized vector. However, this part will be seen later when we look at the
model design, but for now we will create two Python Dictionaries namely
“wordtoix” (pronounced — word to index) and “ixtoword” (pronounced — index to
word).
25
26
Stating simply, we will represent every unique word in the vocabulary by an integer
(index). As seen above, we have 1652 unique words in the corpus and thus each
word will be represented by an integer index between 1 to 1652.
There is one more parameter that we need to calculate, i.e., the maximum length of
a caption and we do it as below:
# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
all_desc = list()
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc# calculate the length of the description with the most words
def max_length(descriptions):
lines = to_lines(descriptions)
return max(len(d.split()) for d in lines)# determine the maximum sequence length
max_length = max_length(train_descriptions)
26
27
This is one of the most important steps in this case study. Here we will understand
how to prepare the data in a manner which will be convenient to be given as input
to the deep learning model.
Hereafter, I will try to explain the remaining steps by taking a sample example as
follows:
27
28
Now, let’s say we use the first two images and their captions to train the model
and the third image to test our model.
Now the questions that will be answered are: how do we frame this as a supervised
learning problem?, what does the data matrix look like? how many data points do
we have?, etc.
28
29
First we need to convert both the images to their corresponding 2048 length feature
vector as discussed above. Let “Image_1” and “Image_2” be the feature vectors of
the first two images respectively
Secondly, let’s build the vocabulary for the first two (train) captions by adding the
two tokens “startseq” and “endseq” in both of them: (Assume we have already
performed the basic cleaning steps)
vocab = {black, cat, endseq, grass, is, on, road, sat, startseq, the, walking, white}
black -1, cat -2, endseq -3, grass -4, is -5, on -6, road -7, sat -8, startseq -9, the -10,
walking -11, white -12
Now let’s try to frame it as a supervised learning problem where we have a set of
data points D = {Xi, Yi}, where Xi is the feature vector of data point ‘i’ and Yi is
the corresponding target variable.
29
30
Let’s take the first image vector Image_1 and its corresponding caption “startseq
the black cat sat on grass endseq”. Recall that, Image vector is the input and the
caption is what we need to predict. But the way we predict the caption is as follows:
For the first time, we provide the image vector and the first word as input and try to
predict the second word, i.e.:
Then we provide image vector and the first two words as input and try to predict the
third word, i.e.:
And so on…
Thus, we can summarize the data matrix for one image and its corresponding
caption as follows:
30
31
It must be noted that, one image+caption is not a single data point but are multiple
data points depending on the length of the caption.
Similarly if we consider both the images and their captions, our data matrix will
then look as follows:
31
32
We must now understand that in every data point, it’s not just the image which goes
as input to the system, but also, a partial caption which helps to predict the next
word in the sequence.
However, we have already discussed that we are not going to pass the actual
English text of the caption, rather we are going to pass the sequence of indices
where each index represents a unique word.
32
33
Since we have already created an index for each word, let’s now replace the words
with their indices and understand how the data matrix will look like:
Since we would be doing batch processing (explained later), we need to make sure
that each sequence is of equal length. Hence we need to append 0’s (zero padding)
at the end of each sequence. But how many zeros should we append in each
sequence?
33
34
Well, this is the reason we had calculated the maximum length of a caption, which
is 34 (if you remember). So we will append those many number of zeros which will
lead to every sequence having a length of 34.
Word Embeddings
As already stated above, we will map the every word (index) to a 200-long vector
and for this purpose, we will use a pre-trained GLOVE Model:
34
35
Now, for all the 1652 unique words in our vocabulary, we create an embedding
matrix which will be loaded into the model before training.
embedding_dim = 200# Get 200-dim dense vector for each of the 10000 words in out
vocabulary
embedding_matrix = np.zeros((vocab_size, embedding_dim))for word, i in
wordtoix.items():
#if i < max_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Words not found in the embedding index will be all zeros
embedding_matrix[i] = embedding_vector
Model Architecture
Since the input consists of two parts, an image vector and a partial caption, we
cannot use the Sequential API provided by the Keras library. For this reason, we
use the Functional API which allows us to create Merge Models.
First, let’s look at the brief architecture which contains the high level sub-modules:
35
36
36
37
The below plot helps to visualize the structure of the network and better understand
the two streams of input:
37
38
The text in red on the right side are the comments provided for you to map your
understanding of the data preparation to model architecture.
The LSTM (Long Short Term Memory) layer is nothing but a specialized
Recurrent Neural Network to process the sequence input (partial captions in our
case).
38
39
Recall that we had created an embedding matrix from a pre-trained Glove model
which we need to include in the model before starting the training:
model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable = False
Notice that since we are using a pre-trained embedding layer, we need to freeze it
(trainable = False), before training the model, so that it does not get updated during
the backpropagation.
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)
Output -> An appropriate word, next in the sequence of partial caption provided in
the input_1 (or in probability terms we say conditioned on image vector and the
partial caption)
The model was then trained for 30 epochs with the initial learning rate of 0.001
and 3 pictures per batch (batch size). However after 20 epochs, the learning rate
was reduced to 0.0001 and the model was trained on 6 pictures per batch.
This generally makes sense because during the later stages of training, since
the model is moving towards convergence, we must lower the learning rate so
that we take smaller steps towards the minima. Also increasing the batch size
over time helps your gradient updates to be more powerful.
CHAPTER: 4
39
40
A:State Diagram
B: DFD
40
41
C:
level 1 DFD
41
42
CHAPTER: 5
Data set
4.1. Dataset
Data are the basis of artificial intelligence. People are increasingly discovering that
many laws that are difficult to find can be found from a large amount of data. In the
image description generation task, there are currently rich and colourful datasets,
such as MSCOCO, Flickr8k, Flickr30k, PASCAL 1K, AI Challenger Dataset, and
STAIR Captions, and gradually become a trend of contention. In the dataset, each
image has five reference descriptions, and Table 2 summarizes the number of
images in each dataset. In order to have multiple independent descriptions of each
image, the dataset uses different syntax to describe the same image.
Flickr8k/Flickr30k [81, 82]. Flickr8k image comes from Yahoo’s photo album site
Flickr, which contains 8,000 photos, 6000 image training, 1000 image verification,
and 1000 image testing. Flickr30k contains 31,783 images collected from the Flickr
42
43
The text file “Flickr_8k.trainImages.txt” contains the names of the images that
belong to the training set. So we load these names into a list “train”.
filename = 'dataset/TextFiles/Flickr_8k.trainImages.txt'
doc = load_doc(filename)
train = list()
for line in doc.split('\n'):
identifier = line.split('.')[0]
train.append(identifier)
print('Dataset: %d' % len(train))
Dataset: 6000
Thus we have separated the 6000 training images in the list named “train”.
Now, we load the descriptions of these images from “descriptions.txt” (saved on the
hard disk) in the Python dictionary “train_descriptions”.
However, when we load them, we will add two tokens in every caption as follows
(significance explained later):
‘startseq’ -> This is a start sequence token which will be added at the start of every
caption.
43
44
‘endseq’ -> This is an end sequence token which will be added at the end of every
caption.
doc =
load_doc('descriptions.txt')
train_descriptions = dict()
for line in doc.split('\n'):
# split line by white space
tokens = line.split()
# store
train_descriptions[image_id].append(desc)
print('Descriptions: train=%d' %
len(train_descriptions))
# Descriptions: train=6000
44
45
CHAPTER: 6
Snapshot of Forms
45
46
(a)-
CHAPTER: 7
46
47
47
48
48
49
CHAPTER: 8
Many deep learning-based methods have been proposed for generating automatic
image captions in the recent years. Supervised learning, reinforcement learning, and
GAN based methods are commonly used in generating image captions. Both visual
space and multimodal space can be used in supervised learning-based methods. The
main difference between visual space and multimodal space occurs in mapping.
Visual space-based methods perform explicit mapping from images to descriptions.
In contrast, multimodal space-based methods incorporate implicit vision and
language models. Supervised learning-based methods are further categorized into
Encoder-Decoder architecture-based, Compositional architecture-based, Attention-
based, Semantic concept-based, Stylized captions, Dense image captioning, and
Novel object-based image captioning. Encoder-Decoder architecture-based methods
use a simple CNN and a text generator for generating image captions. Attention-
based image captioning methods focus on different salient parts of the image and
achieve better performance than encoder-decoder architecture-based methods.
Semantic concept-based image captioning methods selectively focus on different
parts of the image and can generate semantically rich captions. Dense image
captioning methods can generate region based image captions. Stylized image
captions express various emotions such as romance, pride, and shame. GAN and RL
based image captioning methods can generate diverse and multiple captions.
MSCOCO, Flickr30k and Flickr8k dataset are common and popular datasets used
49
50
for image captioning. MSCOCO dataset is very large dataset and all the images in
these datasets have multiple captions. Visual Genome dataset is mainly used for
region based image captioning. Different evaluation metrics are used for measuring
the performances of image captions. BLEU metric is ACM Computing Surveys, Vol.
0, No. 0, Article 0. Acceptance Date: October 2018. A Comprehensive Survey of
Deep Learning for Image Captioning 0:29 good for small sentence evaluation.
ROUGE has different types and they can be used for evaluating different types of
texts. METEOR can perform an evaluation on various segments of a caption. SPICE
is better in understanding semantic details of captions compared to other evaluation
metrics. Although success has been achieved in recent years, there is still a large
scope for improvement. Generation based methods can generate novel captions for
every image. However, these methods fail to detect prominent objects and attributes
and their relationships to some extent in generating accurate and multiple captions.
In addition to this, the accuracy of the generated captions largely depends on
syntactically correct and diverse captions which in turn rely on powerful and
sophisticated language generation model. Existing methods show their performances
on the datasets where images are collected from the same domain. Therefore,
working on open domain dataset will be an interesting avenue for research in this
area. Image-based factual descriptions are not enough to generate high-quality
captions. External knowledge can be added in order to generate attractive image
captions. Supervised learning needs a large amount of labelled data for training.
8.1 :References
50
51
1. https://cs.stanford.edu/people/karpathy/cvpr2015.pdf
2. https://arxiv.org/abs/1411.4555
3. https://arxiv.org/abs/1703.09137
4. https://arxiv.org/abs/1708.02043
5. https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-
in-python/
6. https://www.youtube.com/watch?v=yk6XDFm3J2c
7. https://www.appliedaicourse.com/
51