0% found this document useful (0 votes)
133 views51 pages

Mini Project Fln..

This document provides an overview of an image captioning project. It discusses how image captioning works by using computer vision to understand image content and natural language processing to generate descriptive captions. It also outlines the tools and technologies used, including Python, TensorFlow, Flask and datasets like Flickr8K. The document presents an abstract, introduction, background on related work, proposed algorithm, system design, data set details, and planned evaluation of results.

Uploaded by

Umesh Maurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views51 pages

Mini Project Fln..

This document provides an overview of an image captioning project. It discusses how image captioning works by using computer vision to understand image content and natural language processing to generate descriptive captions. It also outlines the tools and technologies used, including Python, TensorFlow, Flask and datasets like Flickr8K. The document presents an abstract, introduction, background on related work, proposed algorithm, system design, data set details, and planned evaluation of results.

Uploaded by

Umesh Maurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

1

A Project report on
Image Captioning

Submitted to
Mrs. Arpana Saxena
Ajay Kumar Garg Engineering College-MCA, Ghaziabad

Submitted By:-
Aman Singh -1900270140002
Shashank Saxena 1900270140010

Dr. A.P.J. Abdul Kalam Technical University,


Uttar Pradesh, Lucknow
1
2

TABLE OF CONTENT
Abstract ............................................................................................................................... ii

Acknowledgements……………………………………………………………………….iii

Keywords...........................................................................................................................iv

Summary for Lay Audience............................................................................................... v


Acknowledgements............................................................................................................. vi

Table of Contents............................................................................................................... vii

List of Tables ...................................................................................................................... x

List of Figures................................................................................................................... xii

List of abbreviations……………………………………………………….xiii

Chapter 1 – Introduction ..................................................................................................... 1

1.1 Preface...................................................................................................................... 1

1.2 Problem Description and Motivation....................................................................... 3

1.2.1 abcd……………………….

Chapter 2- Background and Related Work……………..

Chapter 3- SRS/ALGORITHM………………

Chapter 4- System design document (DFD,ER DIAGRAM,UML DIAGRAM,STATE


DIAGRAM)………………………

Chapter 5-database Design/Data Set…………………..

Chapter 6- Snapshots of forms……………………

Chapter 7-Snapshots of reports/Results Evaluation, Analysis and Conclusion…..

Chapter 8- Discussion and Future Work…………………

2
3

Abstract:
• Image Captioning is the process of generating textual description of an
image.
• It uses both Natural Language Processing and Computer Vision to
generate the captions.
• The dataset will be in the form [image → captions]. The dataset consists of
input images and their corresponding output captions.
• This process has many potential applications in real life. A noteworthy one
would be to save the captions of an image so that it can be retrieved easily at
a later stage just on the basis of this description.
• For Example-

The caption for below image is shown-

• This procs has many in



• The task of image captioning can be divided into two modules logically –
one is an image based model – which extracts the features and nuances out
of our image, and the other is a language based model – which translates
the features and objects given by our image based model to a natural
sentence.
• For our image based model (viz encoder) – we usually rely on a
Convolutional Neural Network model. And for our language based model
3
4

(viz decoder) – we rely on a Recurrent Neural Network.noteworo1ne


would be to save the captions of an image so that it can be
retrieved easily at a later stage just on the basis of this
description.

Tools & Technologies Used


Tools-

• Python 3.7
• Pycharm
• Google Collab
• Jupyter Notebook
• Anaconda

Technologies-

• Python
• Tensorflow
• Flask
• Sklearn
• Computer Vision
• Natural Language Processing

4
5

Acknowledgements:

We take this opportunity to acknowledge everyone who have helped us in every


stage of this project.
We would like to express our greatest appreciation to the all individuals who
have helped and supported us throughout the project. We are thankful to our
mentor for her ongoing support during the project, from initial advice, and
encouragement, which led to the final report of this project.
A special acknowledgement goes to my senior who helped us in completing the
project by exchanging interesting ideas and sharing their experience.
We wish to thank my parents as well for their undivided support and interest who
inspired me and encouraged me to go my own way, without whom We would be
unable to complete our project.
At the end, We want to thank our friends who displayed appreciation to our work
and motivated us to continue the work.

➢ Summary for Lay Audience:

5
6

Image Captioning is the process of generating a textual description for given


images. It has been a very important and fundamental task in the Deep
Learning domain. Image captioning has a huge amount of application.
NVIDIA is using image captioning technologies to create an application to
help people who have low or no eyesight.

Several approaches have been made to solve the task. One of the most
notable work has been put forward by Andrej Karpathy, Director of AI,
Tesla in his Ph.D. at Standford. In this article, we will be talking about the
most used and well-known approaches proposed as a solution to this
problem. We will also be looking at a python demo example on the Flickr
Dataset in Python.

So, let’s start.

Image captioning can be regarded as an end-to-end Sequence to Sequence


problem, as it converts images, which is regarded as a sequence of pixels to
a sequence of words. For this purpose, we need to process both the language
or statements and the images. For the Language part, we use recurrent
Neural Networks and for the Image part, we use Convolutional Neural
Networks to obtain the feature vectors respectively.

Now, How does the idea work?

6
7

Say, we as humans are seeing a scene as given below.

If we are told to describe it, maybe we will describe it as: “A puppy on a blue
towel” or “ A brown dog playing with a green ball”. So, how are we doing this?
While forming the description, we are seeing the image but at the same time, we are
looking to create a meaningful sequence of words. The first part is handled by
CNNs and the second is handled by RNNs.If we can obtain a suitable dataset with
images and their corresponding human descriptions, we can train networks to
automatically caption images. FLICKR 8K, FLICKR 30K, and MS-COCO are
some most used datasets for the purpose.

7
8

Caption generation is an interesting artificial intelligence problem where a


descriptive sentence is generated for a given image. It involves the dual techniques
from computer vision to understand the content of the image and a language model
from the field of natural language processing to turn the understanding of the
image into words in the right order. Image captioning has various applications such
as recommendations in editing applications, usage in virtual assistants, for image
indexing, for visually impaired persons, for social media, and several other natural
language processing applications. Recently, deep learning methods have achieved
state-ofthe-art results on examples of this problem. It has been demonstrated that
deep learning models are able to achieve optimum results in the field of caption
generation problems. Instead of requiring complex data preparation or a pipeline of
specifically designed models, a single end-to-end model can be defined to predict a
caption, given a photo. In order to evaluate our model, we measure its performance
on the Flickr8K dataset using the BLEU standard metric. These results show that
our proposed model performs better than standard models regarding image
captioning in performance evaluation.

8
9

Table of Contents

S.no Title Page.no

1. Introduction 11

1.1 Preface
1.2 Problem Description and Motivation 13

2. Background and Related work 15

3. SRS/Algorithm 17

4. System design document 41

5. Data Set 43

6. Snapshot of reports/Result Evaluation, 48


Analysis and conclusion

7. Discussion And Future Work 50

9
10

CHAPTER: 1

INTRODUCTION

What do you see in the below picture?

Can you write a caption?

Well some of you might say “A white dog in a grassy area”, some may say
“White dog with brown spots” and yet some others might say “A dog on grass
and some pink flowers”.

Definitely all of these captions are relevant for this image and there may be some
others also. But the point I want to make is; it’s so easy for us, as human beings, to
just have a glance at a picture and describe it in an appropriate language. Even a 5
year old could do this with utmost ease.

10
11

But, can you write a computer program that takes an image as input and produces a
relevant caption as output?

The Problem

Just prior to the recent development of Deep Neural Networks this problem was
inconceivable even by the most advanced researchers in Computer Vision. But with
the advent of Deep Learning this problem can be solved very easily if we have the
required dataset.

This problem was well researched by Andrej Karapathy in his PhD thesis at
Stanford [1], who is also now the Director of AI at Tesla.

The purpose of this blog post is to explain (in as simple words as possible) that how
Deep Learning can be used to solve this problem of generating a caption for a given
image, hence the name Image Captioning.

11
12

To get a better feel of this problem, I strongly recommend to use this state-of-the-art
system created by Microsoft called as Caption Bot. Just go to this link and try
uploading any picture you want; this system will generate a caption for it

• 1.1 Preface

• 1.2 Problem Description

The problem introduces a captioning task, which requires a computer vision


system to both localize and describe salient regions in images in natural language.
The image captioning task generalizes object detection when the descriptions
consist of a single word. Given a set of images and prior knowledge about the
content find the correct semantic label for the entire image(s). Input: An image.
Expected Output: Natural language description of the
input image

12
13

Motivation:

We must first understand how important this problem is to real world scenarios.
Let’s see few applications where a solution to this problem can be very useful.

• Self driving cars — Automatic driving is one of the biggest challenges


and if we can properly caption the scene around the car, it can give a
boost to the self driving system.

• Aid to the blind — We can create a product for the blind which will
guide them travelling on the roads without the support of anyone else.
We can do this by first converting the scene into text and then the text to
voice. Both are now famous applications of Deep Learning. Refer
this link where its shown how Nvidia research is trying to create such a
product.

• CCTV cameras are everywhere today, but along with viewing the world,
if we can also generate relevant captions, then we can raise alarms as
soon as there is some malicious activity going on somewhere. This could
probably help reduce some crime and/or accidents.

• Automatic Captioning can help, make Google Image Search as good as


Google Search, as then every image could be first converted into a
caption and then search can be performed based on the caption

13
14

CHAPTER: 2

Background And Related work

Automatically generating captions to an image shows the understanding of the


image by computers, which is a fundamental task of intelligence. For a caption
model it not only need to find which objects are contained in the image and also
need to be able to expressing their relationships in a natural language such as
English. Recently work also achieve the presence of attention, which can store and
report the information and relationship between some most salient features and
clusters in the image. In Xu’s work, it describe approaches to caption generation
that attempt to incorporate a form of attention with two variants: a “hard” attention
mechanism and a “soft” attention mechanism. In his work, the comparation of the
mechanism shows“soft” works better and we will implement “soft” mechanism in
our project. If we have enough time we will also implement “hard” mechanism and
compare the results. In our project, we do image-to-sentence generation. This
application bridges vision and natural language. If we can do well in this task, we
can then utilize natural language processing technologies understand the world in
images. In addition, we introduced attention mechanism, which is able to recognize
what a word refers to in the image, and thus summarize the relationship between
objects in the image. This will be a powerful tool to utilize the massive
unformatted image data, which dominate the whole data in the world. As an

14
15

example, for the picture on the right hand side, we can describe it as A man is
trying to murder his cs231n partner with a clipper. Attention helps us to determine
the relationship between the objects. 2. Related work Work[3](Szegedy et al)
proposed a deep convolutional neural network architecture codenamed Inception.
The main hallmark of this architecture is the improved utilization of the computing
resources inside the network. For example, our project tried to use layers
“inception3b” and “inception4b” to get captions and attention. Because features
learned from the lower layers can contain more accurate information of correlation
between words in caption and specific location in image. presented a generative
model based on a deep recurrent architecture that combined advances in computer
vision and machine translation that can be used to generate natural sentences
describing an image. The model is trained to maximize the likelihood of the target
description sentence given the training image.Work[5](Jeff et al) introduced a
model based on deep convolutional networks performed very good in image
interpretation tasks. Their recurrent convolutional model and long-term RNN
models are suitable for large-scale visual learning that is end-to-end trainable and
demonstrate the value of these models on benchmark video recognition tasks.
Attention mechanism has a long history, especially in image recognition. Related
work include work[6] and work[7](Larochelle et al). But until recently Attention
wasn’t included to recurrent neural network architecture. Work[8](Volodymyr et
al) use reinforcement learning as a alternative way to predict the attention point. It
sounds more like human attention. However reinforcement learning model cannot
use back propagation so that not end-toend trainable, thusly it is not widely use in
NLP. In work[9] the authors use recurrent neural and attention mechanism to

15
16

generate grammar tree. In work[10] the author use RNN model to read in text.
Work[2](Andrej et al) presented a model that generates natural language
descriptions of images and their regions. They combined Convolutional Neural
Networks over sentences, bidirectional Recurrent Neural Networks over sentences
and a structured objective that aligns the two modalities through a multimodal
embedding. In Work[1](Xu, et al) attention mechanism is used in generation of
image caption. They use convolutional neural network to encode image and use a
recurrent neural network and attention mechanism to generate caption. By the
visualization of the attention weights, we can explain which part the model is
focusing on while generating the caption.

CHAPTER: 3

SRS /Algorithm

3.1 Purpose

16
17

Data Collection

There are many open source datasets available for this problem, like Flickr 8k
(containing8k images), Flickr 30k (containing 30k images), MS COCO (containing
180k images), etc.

But for the purpose of this case study, I have used the Flickr 8k dataset which you
can download by filling this form provided by the University of Illinois at Urbana-
Champaign. Also training a model with large number of images may not be feasible
on a system which is not a very high end PC/Laptop.

This dataset contains 8000 images each with 5 captions (as we have already seen in
the Introduction section that an image can have multiple captions, all being relevant
simultaneously).

These images are bifurcated as follows:

• Training Set — 6000 images

• Dev Set — 1000 images

• Test Set — 1000 images

17
18

Understanding the data

One of the files is “Flickr8k.token.txt” which contains the name of each image
along with its 5 captions. We can read this file as follows:
# Below is the path for the file "Flickr8k.token.txt" on your disk
filename = "/dataset/TextFiles/Flickr8k.token.txt"
file = open(filename, 'r')
doc = file.read()

The text file looks as follows:

101654506_8eb26cfb60.jpg#0 A brown and white dog is running through the


snow

101654506_8eb26cfb60.jpg#1 A dog is running in the snow

01654506_8eb26cfb60.jpg#2 A dog running through snow .

101654506_8eb26cfb60.jpg#3 a white and brown dog is running through a


snow covered field

101654506_8eb26cfb60.jpg#4 The white and brown dog is running over the


surface of the snow

Thus every line contains the <image name>#i <caption>, where 0≤i≤4

i.e. the name of the image, caption number (0 to 4) and the actual caption.

18
19

Now, we create a dictionary named “descriptions” which contains the name of the
image (without the .jpg extension) as keys and a list of the 5 captions for the
corresponding image as values.

for line in doc.split('\n'):


# split line by white space
tokens = line.split()

# take the first token as image id, the rest as description


image_id, image_desc = tokens[0], tokens[1:]

# extract filename from image id


image_id = image_id.split('.')[0]

# convert description tokens back to string


image_desc = ' '.join(image_desc)
if image_id not in mapping:
descriptions[image_id] = list()
descriptions[image_id].append(image_desc)

For example with reference to the above screenshot the


dictionary will look as follows:
descriptions['101654506_8eb26cfb60'] = ['A brown and white dog is running through
the snow .', 'A dog is running in the snow', 'A dog running through snow .', 'a white and

19
20

brown dog is running through a snow covered field .', 'The white and brown dog is
running over the surface of the snow .']

Data Cleaning

When we deal with text, we generally perform some basic cleaning like lower-
casing all the words (otherwise“hello” and “Hello” will be regarded as two separate
words), removing special tokens (like ‘%’, ‘$’, ‘#’, etc.), eliminating words which
contain numbers (like ‘hey199’, etc.).

The below code does these basic cleaning steps:

# prepare translation table for removing punctuation

table = str.maketrans('', '', string.punctuation)

for key, desc_list in descriptions.items():

for i in range(len(desc_list)): desc = desc_list[i]

desc = desc.split()
# convert to lower case
desc = [word.lower() for
word in desc]
desc =
[w.translate(table) for w
in desc]

20
21

desc = [word for word


in desc if len(word)>1]
desc = [word for word
in desc if
word.isalpha()]
# store as string
desc_list[i] = '
'.join(desc)

Create a vocabulary of all the unique words present across all the 8000*5 (i.e.
40000) image captions (corpus) in the data set :
vocabulary = set()
for key in descriptions.keys():
[vocabulary.update(d.split()) for d in descriptions[key]]
print('Original Vocabulary Size: %d' % len(vocabulary))
Original Vocabulary Size: 8763

This means we have 8763 unique words across all the 40000 image captions. We
write all these captions along with their image names in a new file namely,
“descriptions.txt” and save it on the disk.

However, if we think about it, many of these words will occur very few times, say
1, 2 or 3 times. Since we are creating a predictive model, we would not like to have
all the words present in our vocabulary but the words which are more likely to
occur or which are common. This helps the model become more robust to
outliers and make less mistakes.

21
22

Hence we consider only those words which occur at least 10 times in the entire
corpus. The code for this is below:

# Create a list of all the


training captions
all_train_captions = []
for key, val in
train_descriptions.items():
for cap in val:
all_train_captions.append(cap)
# Consider only words which occur
at least 10 times in the corpus
word_count_threshold = 10
word_counts = {}
for sent in all_train_captions:
nsents += 1
for w in sent.split(' '):
word_counts[w] =
word_counts.get(w, 0) + 1
vocab = [w for w in word_counts
if word_counts[w] >=
word_count_threshold]

print('preprocessed words %d ' %


len(vocab))

Code to retain only those words which occur at least 10 times in the corpus

22
23

So now we have only 1651 unique words in our vocabulary. However, we will
append 0’s (zero padding explained later) and thus total words = 1651+1
= 1652 (one index for the 0)

Data Preprocessing — Images

Images are nothing but input (X) to our model. As you may already know that any
input to a model must be given in the form of a vector.

We need to convert every image into a fixed sized vector which can then be fed as
input to the neural network. For this purpose, we opt for transfer learning by using
the InceptionV3 model (Convolutional Neural Network) created by Google
Research.

This model was trained on Imagenet dataset to perform image classification on


1000 different classes of images. However, our purpose here is not to classify the
image but just get fixed-length informative vector for each image. This process is
called automatic feature engineering.

Hence, we just remove the last softmax layer from the model and extract a 2048
length vector (bottleneck features) for every image as follows:

23
24

Feature Vector Extraction (Feature Engineering) from InceptionV3

The code for this is as follows:


# Get the InceptionV3 model trained on imagenet data
model = InceptionV3(weights='imagenet')
# Remove the last layer (output softmax layer) from the inception v3
model_new = Model(model.input, model.layers[-2].output)

Now, we pass every image to this model to get the corresponding 2048 length
feature vector as follows:
# Convert all the images to size 299x299 as expected by the
# inception v3 model
img = image.load_img(image_path, target_size=(299, 299))
# Convert PIL image to numpy array of 3-dimensions
x = image.img_to_array(img)
# Add one more dimension
x = np.expand_dims(x, axis=0)
# preprocess images using preprocess_input() from inception module
x = preprocess_input(x)
# reshape from (1, 2048) to (2048, )
x = np.reshape(x, x.shape[1])

24
25

We save all the bottleneck train features in a Python dictionary and save it on the
disk using Pickle file, namely “encoded_train_images.pkl” whose keys are image
names and values are corresponding 2048 length feature vector.

NOTE: This process might take an hour or two if you do not have a high end
PC/laptop.

Similarly we encode all the test images and save them in the file
“encoded_test_images.pkl”

Data Preprocessing — Captions


We must note that captions are something that we want to predict. So during the
training period, captions will be the target variables (Y) that the model is learning to
predict.

But the prediction of the entire caption, given the image does not happen at once.
We will predict the caption word by word. Thus, we need to encode each word
into a fixed sized vector. However, this part will be seen later when we look at the
model design, but for now we will create two Python Dictionaries namely
“wordtoix” (pronounced — word to index) and “ixtoword” (pronounced — index to
word).

25
26

Stating simply, we will represent every unique word in the vocabulary by an integer
(index). As seen above, we have 1652 unique words in the corpus and thus each
word will be represented by an integer index between 1 to 1652.

These two Python dictionaries can be used as follows:

wordtoix[‘abc’] -> returns index of the word ‘abc’

ixtoword[k] -> returns the word whose index is ‘k’

The code used is as below:


ixtoword = {}
wordtoix = {}ix = 1
for w in vocab:
wordtoix[w] = ix
ixtoword[ix] = w
ix += 1

There is one more parameter that we need to calculate, i.e., the maximum length of
a caption and we do it as below:
# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
all_desc = list()
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc# calculate the length of the description with the most words
def max_length(descriptions):
lines = to_lines(descriptions)
return max(len(d.split()) for d in lines)# determine the maximum sequence length
max_length = max_length(train_descriptions)
26
27

print('Max Description Length: %d' % max_length)


Max Description Length: 34

So the maximum length of any caption is 34.

Data Preparation using Generator Function

This is one of the most important steps in this case study. Here we will understand
how to prepare the data in a manner which will be convenient to be given as input
to the deep learning model.

Hereafter, I will try to explain the remaining steps by taking a sample example as
follows:

Consider we have 3 images and their 3 corresponding captions as follows:

(Train image 1) Caption -> The black cat sat on grass

27
28

(Train image 2) Caption -> The white cat is walking on road

(Test image) Caption -> The black cat is walking on grass

Now, let’s say we use the first two images and their captions to train the model
and the third image to test our model.

Now the questions that will be answered are: how do we frame this as a supervised
learning problem?, what does the data matrix look like? how many data points do
we have?, etc.

28
29

First we need to convert both the images to their corresponding 2048 length feature
vector as discussed above. Let “Image_1” and “Image_2” be the feature vectors of
the first two images respectively

Secondly, let’s build the vocabulary for the first two (train) captions by adding the
two tokens “startseq” and “endseq” in both of them: (Assume we have already
performed the basic cleaning steps)

Caption_1 -> “startseq the black cat sat on grass endseq”

Caption_2 -> “startseq the white cat is walking on road endseq”

vocab = {black, cat, endseq, grass, is, on, road, sat, startseq, the, walking, white}

Let’s give an index to each word in the vocabulary:

black -1, cat -2, endseq -3, grass -4, is -5, on -6, road -7, sat -8, startseq -9, the -10,
walking -11, white -12

Now let’s try to frame it as a supervised learning problem where we have a set of
data points D = {Xi, Yi}, where Xi is the feature vector of data point ‘i’ and Yi is
the corresponding target variable.

29
30

Let’s take the first image vector Image_1 and its corresponding caption “startseq
the black cat sat on grass endseq”. Recall that, Image vector is the input and the
caption is what we need to predict. But the way we predict the caption is as follows:

For the first time, we provide the image vector and the first word as input and try to
predict the second word, i.e.:

Input = Image_1 + ‘startseq’; Output = ‘the’

Then we provide image vector and the first two words as input and try to predict the
third word, i.e.:

Input = Image_1 + ‘startseq the’; Output = ‘cat’

And so on…

Thus, we can summarize the data matrix for one image and its corresponding
caption as follows:

30
31

Data points corresponding to one image and its caption

It must be noted that, one image+caption is not a single data point but are multiple
data points depending on the length of the caption.

Similarly if we consider both the images and their captions, our data matrix will
then look as follows:

31
32

Data Matrix for both the images and captions

We must now understand that in every data point, it’s not just the image which goes
as input to the system, but also, a partial caption which helps to predict the next
word in the sequence.

Since we are processing sequences, we will employ a Recurrent Neural


Network to read these partial captions (more on this later).

However, we have already discussed that we are not going to pass the actual
English text of the caption, rather we are going to pass the sequence of indices
where each index represents a unique word.

32
33

Since we have already created an index for each word, let’s now replace the words
with their indices and understand how the data matrix will look like:

Data matrix after replacing the words by their indices

Since we would be doing batch processing (explained later), we need to make sure
that each sequence is of equal length. Hence we need to append 0’s (zero padding)
at the end of each sequence. But how many zeros should we append in each
sequence?

33
34

Well, this is the reason we had calculated the maximum length of a caption, which
is 34 (if you remember). So we will append those many number of zeros which will
lead to every sequence having a length of 34.

The data matrix will then look as follows:

Appending zeros to each sequence to make them all of same length 34

Word Embeddings

As already stated above, we will map the every word (index) to a 200-long vector
and for this purpose, we will use a pre-trained GLOVE Model:

34
35

# Load Glove vectors


glove_dir = 'dataset/glove'
embeddings_index = {} # empty dictionary
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), encoding="utf-8")for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()

Now, for all the 1652 unique words in our vocabulary, we create an embedding
matrix which will be loaded into the model before training.
embedding_dim = 200# Get 200-dim dense vector for each of the 10000 words in out
vocabulary
embedding_matrix = np.zeros((vocab_size, embedding_dim))for word, i in
wordtoix.items():
#if i < max_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Words not found in the embedding index will be all zeros
embedding_matrix[i] = embedding_vector

Model Architecture

Since the input consists of two parts, an image vector and a partial caption, we
cannot use the Sequential API provided by the Keras library. For this reason, we
use the Functional API which allows us to create Merge Models.

First, let’s look at the brief architecture which contains the high level sub-modules:

35
36

High level architecture

We define the model as follows:

Code to define the Model

Let’s look at the model summary:

36
37

Summary of the parameters in the model

The below plot helps to visualize the structure of the network and better understand
the two streams of input:

37
38

Flowchart of the architecture

The text in red on the right side are the comments provided for you to map your
understanding of the data preparation to model architecture.

The LSTM (Long Short Term Memory) layer is nothing but a specialized
Recurrent Neural Network to process the sequence input (partial captions in our
case).
38
39

Recall that we had created an embedding matrix from a pre-trained Glove model
which we need to include in the model before starting the training:

model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable = False

Notice that since we are using a pre-trained embedding layer, we need to freeze it
(trainable = False), before training the model, so that it does not get updated during
the backpropagation.

Finally we compile the model using the adam optimizer

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)

Finally the weights of the model will be updated through backpropagation


algorithm and the model will learn to output a word, given an image feature vector
and a partial caption. So in summary, we have:

Input_1 -> Partial Caption

Input_2 -> Image feature vector

Output -> An appropriate word, next in the sequence of partial caption provided in
the input_1 (or in probability terms we say conditioned on image vector and the
partial caption)

Hyper parameters during training:

The model was then trained for 30 epochs with the initial learning rate of 0.001
and 3 pictures per batch (batch size). However after 20 epochs, the learning rate
was reduced to 0.0001 and the model was trained on 6 pictures per batch.

This generally makes sense because during the later stages of training, since
the model is moving towards convergence, we must lower the learning rate so
that we take smaller steps towards the minima. Also increasing the batch size
over time helps your gradient updates to be more powerful.

CHAPTER: 4
39
40

System design document

A:State Diagram

B: DFD

40
41

C:
level 1 DFD

41
42

CHAPTER: 5
Data set

The evaluation methods of open-source datasets and generated sentences in this


field. Data, computational power, and algorithms are the three major elements of the
current development of artificial intelligence. The three complement each other and
enhance each other. It can be said that a good dataset can make the algorithm or
model more effective. The image description task is similar to machine translation,
and its evaluation method extends from machine translation to form its own unique
evaluation criteria.

4.1. Dataset

Data are the basis of artificial intelligence. People are increasingly discovering that
many laws that are difficult to find can be found from a large amount of data. In the
image description generation task, there are currently rich and colourful datasets,
such as MSCOCO, Flickr8k, Flickr30k, PASCAL 1K, AI Challenger Dataset, and
STAIR Captions, and gradually become a trend of contention. In the dataset, each
image has five reference descriptions, and Table 2 summarizes the number of
images in each dataset. In order to have multiple independent descriptions of each
image, the dataset uses different syntax to describe the same image.

Flickr8k/Flickr30k [81, 82]. Flickr8k image comes from Yahoo’s photo album site
Flickr, which contains 8,000 photos, 6000 image training, 1000 image verification,
and 1000 image testing. Flickr30k contains 31,783 images collected from the Flickr

42
43

website, mostly depicting humans participating in an event. The corresponding


manual label for each image is still 5 sentences.

Throughout this data set we can:

Loading the training set

The text file “Flickr_8k.trainImages.txt” contains the names of the images that
belong to the training set. So we load these names into a list “train”.
filename = 'dataset/TextFiles/Flickr_8k.trainImages.txt'
doc = load_doc(filename)
train = list()
for line in doc.split('\n'):
identifier = line.split('.')[0]
train.append(identifier)
print('Dataset: %d' % len(train))
Dataset: 6000

Thus we have separated the 6000 training images in the list named “train”.

Now, we load the descriptions of these images from “descriptions.txt” (saved on the
hard disk) in the Python dictionary “train_descriptions”.

However, when we load them, we will add two tokens in every caption as follows
(significance explained later):

‘startseq’ -> This is a start sequence token which will be added at the start of every
caption.
43
44

‘endseq’ -> This is an end sequence token which will be added at the end of every
caption.

doc =
load_doc('descriptions.txt')
train_descriptions = dict()
for line in doc.split('\n'):
# split line by white space
tokens = line.split()

# split id from description


image_id, image_desc = tokens[0],
tokens[1:]

# skip images not in the set


if image_id in dataset:
if image_id not in descriptions:
train_descriptions[image_id] =
list()

# wrap description in tokens


desc = 'startseq ' + '
'.join(image_desc) + ' endseq'

# store

train_descriptions[image_id].append(desc)

print('Descriptions: train=%d' %
len(train_descriptions))
# Descriptions: train=6000

44
45

CHAPTER: 6

Snapshot of Forms

45
46

(a)-

CHAPTER: 7

Snapshot of Result Evolutions

46
47

47
48

48
49

CHAPTER: 8

Discussion And Future Work

Many deep learning-based methods have been proposed for generating automatic
image captions in the recent years. Supervised learning, reinforcement learning, and
GAN based methods are commonly used in generating image captions. Both visual
space and multimodal space can be used in supervised learning-based methods. The
main difference between visual space and multimodal space occurs in mapping.
Visual space-based methods perform explicit mapping from images to descriptions.
In contrast, multimodal space-based methods incorporate implicit vision and
language models. Supervised learning-based methods are further categorized into
Encoder-Decoder architecture-based, Compositional architecture-based, Attention-
based, Semantic concept-based, Stylized captions, Dense image captioning, and
Novel object-based image captioning. Encoder-Decoder architecture-based methods
use a simple CNN and a text generator for generating image captions. Attention-
based image captioning methods focus on different salient parts of the image and
achieve better performance than encoder-decoder architecture-based methods.
Semantic concept-based image captioning methods selectively focus on different
parts of the image and can generate semantically rich captions. Dense image
captioning methods can generate region based image captions. Stylized image
captions express various emotions such as romance, pride, and shame. GAN and RL
based image captioning methods can generate diverse and multiple captions.
MSCOCO, Flickr30k and Flickr8k dataset are common and popular datasets used

49
50

for image captioning. MSCOCO dataset is very large dataset and all the images in
these datasets have multiple captions. Visual Genome dataset is mainly used for
region based image captioning. Different evaluation metrics are used for measuring
the performances of image captions. BLEU metric is ACM Computing Surveys, Vol.
0, No. 0, Article 0. Acceptance Date: October 2018. A Comprehensive Survey of
Deep Learning for Image Captioning 0:29 good for small sentence evaluation.
ROUGE has different types and they can be used for evaluating different types of
texts. METEOR can perform an evaluation on various segments of a caption. SPICE
is better in understanding semantic details of captions compared to other evaluation
metrics. Although success has been achieved in recent years, there is still a large
scope for improvement. Generation based methods can generate novel captions for
every image. However, these methods fail to detect prominent objects and attributes
and their relationships to some extent in generating accurate and multiple captions.
In addition to this, the accuracy of the generated captions largely depends on
syntactically correct and diverse captions which in turn rely on powerful and
sophisticated language generation model. Existing methods show their performances
on the datasets where images are collected from the same domain. Therefore,
working on open domain dataset will be an interesting avenue for research in this
area. Image-based factual descriptions are not enough to generate high-quality
captions. External knowledge can be added in order to generate attractive image
captions. Supervised learning needs a large amount of labelled data for training.

8.1 :References

50
51

1. https://cs.stanford.edu/people/karpathy/cvpr2015.pdf
2. https://arxiv.org/abs/1411.4555
3. https://arxiv.org/abs/1703.09137
4. https://arxiv.org/abs/1708.02043
5. https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-
in-python/
6. https://www.youtube.com/watch?v=yk6XDFm3J2c
7. https://www.appliedaicourse.com/

51

You might also like