0% found this document useful (0 votes)

133 views51 pages

Mini Project Fln..

This document provides an overview of an image captioning project. It discusses how image captioning works by using computer vision to understand image content and natural language processing to generate descriptive captions. It also outlines the tools and technologies used, including Python, TensorFlow, Flask and datasets like Flickr8K. The document presents an abstract, introduction, background on related work, proposed algorithm, system design, data set details, and planned evaluation of results.

Uploaded by

Umesh Maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

133 views51 pages

Mini Project Fln..

Uploaded by

Umesh Maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

1

A Project report on
Image Captioning

Submitted to
Mrs. Arpana Saxena
Ajay Kumar Garg Engineering College-MCA, Ghaziabad

Submitted By:-
Aman Singh -1900270140002
Shashank Saxena 1900270140010

Dr. A.P.J. Abdul Kalam Technical University,

Uttar Pradesh, Lucknow
1
2

TABLE OF CONTENT
Abstract ............................................................................................................................... ii

Acknowledgements……………………………………………………………………….iii

Keywords...........................................................................................................................iv

Summary for Lay Audience............................................................................................... v

Acknowledgements............................................................................................................. vi

Table of Contents............................................................................................................... vii

List of Tables ...................................................................................................................... x

List of Figures................................................................................................................... xii

List of abbreviations……………………………………………………….xiii

Chapter 1 – Introduction ..................................................................................................... 1

1.1 Preface...................................................................................................................... 1

1.2 Problem Description and Motivation....................................................................... 3

1.2.1 abcd……………………….

Chapter 2- Background and Related Work……………..

Chapter 3- SRS/ALGORITHM………………

Chapter 4- System design document (DFD,ER DIAGRAM,UML DIAGRAM,STATE

DIAGRAM)………………………

Chapter 5-database Design/Data Set…………………..

Chapter 6- Snapshots of forms……………………

Chapter 7-Snapshots of reports/Results Evaluation, Analysis and Conclusion…..

Chapter 8- Discussion and Future Work…………………

2
3

Abstract:
• Image Captioning is the process of generating textual description of an
image.
• It uses both Natural Language Processing and Computer Vision to
generate the captions.
• The dataset will be in the form [image → captions]. The dataset consists of
input images and their corresponding output captions.
• This process has many potential applications in real life. A noteworthy one
would be to save the captions of an image so that it can be retrieved easily at
a later stage just on the basis of this description.
• For Example-

The caption for below image is shown-

• This procs has many in

•
• The task of image captioning can be divided into two modules logically –
one is an image based model – which extracts the features and nuances out
of our image, and the other is a language based model – which translates
the features and objects given by our image based model to a natural
sentence.
• For our image based model (viz encoder) – we usually rely on a
Convolutional Neural Network model. And for our language based model
3
4

(viz decoder) – we rely on a Recurrent Neural Network.noteworo1ne

would be to save the captions of an image so that it can be
retrieved easily at a later stage just on the basis of this
description.

Tools & Technologies Used

Tools-

• Python 3.7
• Pycharm
• Google Collab
• Jupyter Notebook
• Anaconda

Technologies-

• Python
• Tensorflow
• Flask
• Sklearn
• Computer Vision
• Natural Language Processing

4
5

Acknowledgements:

We take this opportunity to acknowledge everyone who have helped us in every

stage of this project.
We would like to express our greatest appreciation to the all individuals who
have helped and supported us throughout the project. We are thankful to our
mentor for her ongoing support during the project, from initial advice, and
encouragement, which led to the final report of this project.
A special acknowledgement goes to my senior who helped us in completing the
project by exchanging interesting ideas and sharing their experience.
We wish to thank my parents as well for their undivided support and interest who
inspired me and encouraged me to go my own way, without whom We would be
unable to complete our project.
At the end, We want to thank our friends who displayed appreciation to our work
and motivated us to continue the work.

➢ Summary for Lay Audience:

5
6

Image Captioning is the process of generating a textual description for given

images. It has been a very important and fundamental task in the Deep
Learning domain. Image captioning has a huge amount of application.
NVIDIA is using image captioning technologies to create an application to
help people who have low or no eyesight.

Several approaches have been made to solve the task. One of the most
notable work has been put forward by Andrej Karpathy, Director of AI,
Tesla in his Ph.D. at Standford. In this article, we will be talking about the
most used and well-known approaches proposed as a solution to this
problem. We will also be looking at a python demo example on the Flickr
Dataset in Python.

So, let’s start.

Image captioning can be regarded as an end-to-end Sequence to Sequence

problem, as it converts images, which is regarded as a sequence of pixels to
a sequence of words. For this purpose, we need to process both the language
or statements and the images. For the Language part, we use recurrent
Neural Networks and for the Image part, we use Convolutional Neural
Networks to obtain the feature vectors respectively.

Now, How does the idea work?

6
7

Say, we as humans are seeing a scene as given below.

If we are told to describe it, maybe we will describe it as: “A puppy on a blue
towel” or “ A brown dog playing with a green ball”. So, how are we doing this?
While forming the description, we are seeing the image but at the same time, we are
looking to create a meaningful sequence of words. The first part is handled by
CNNs and the second is handled by RNNs.If we can obtain a suitable dataset with
images and their corresponding human descriptions, we can train networks to
automatically caption images. FLICKR 8K, FLICKR 30K, and MS-COCO are
some most used datasets for the purpose.

7
8

Caption generation is an interesting artificial intelligence problem where a

descriptive sentence is generated for a given image. It involves the dual techniques
from computer vision to understand the content of the image and a language model
from the field of natural language processing to turn the understanding of the
image into words in the right order. Image captioning has various applications such
as recommendations in editing applications, usage in virtual assistants, for image
indexing, for visually impaired persons, for social media, and several other natural
language processing applications. Recently, deep learning methods have achieved
state-ofthe-art results on examples of this problem. It has been demonstrated that
deep learning models are able to achieve optimum results in the field of caption
generation problems. Instead of requiring complex data preparation or a pipeline of
specifically designed models, a single end-to-end model can be defined to predict a
caption, given a photo. In order to evaluate our model, we measure its performance
on the Flickr8K dataset using the BLEU standard metric. These results show that
our proposed model performs better than standard models regarding image
captioning in performance evaluation.

8
9

Table of Contents

S.no Title Page.no

1. Introduction 11

1.1 Preface
1.2 Problem Description and Motivation 13

2. Background and Related work 15

3. SRS/Algorithm 17

4. System design document 41

5. Data Set 43

6. Snapshot of reports/Result Evaluation, 48

Analysis and conclusion

7. Discussion And Future Work 50

9
10

CHAPTER: 1

INTRODUCTION

What do you see in the below picture?

Can you write a caption?

Well some of you might say “A white dog in a grassy area”, some may say
“White dog with brown spots” and yet some others might say “A dog on grass
and some pink flowers”.

Definitely all of these captions are relevant for this image and there may be some
others also. But the point I want to make is; it’s so easy for us, as human beings, to
just have a glance at a picture and describe it in an appropriate language. Even a 5
year old could do this with utmost ease.

10
11

But, can you write a computer program that takes an image as input and produces a
relevant caption as output?

The Problem

Just prior to the recent development of Deep Neural Networks this problem was
inconceivable even by the most advanced researchers in Computer Vision. But with
the advent of Deep Learning this problem can be solved very easily if we have the
required dataset.

This problem was well researched by Andrej Karapathy in his PhD thesis at
Stanford [1], who is also now the Director of AI at Tesla.

The purpose of this blog post is to explain (in as simple words as possible) that how
Deep Learning can be used to solve this problem of generating a caption for a given
image, hence the name Image Captioning.

11
12

To get a better feel of this problem, I strongly recommend to use this state-of-the-art
system created by Microsoft called as Caption Bot. Just go to this link and try
uploading any picture you want; this system will generate a caption for it

• 1.1 Preface

• 1.2 Problem Description

The problem introduces a captioning task, which requires a computer vision

system to both localize and describe salient regions in images in natural language.
The image captioning task generalizes object detection when the descriptions
consist of a single word. Given a set of images and prior knowledge about the
content find the correct semantic label for the entire image(s). Input: An image.
Expected Output: Natural language description of the
input image

12
13

Motivation:

We must first understand how important this problem is to real world scenarios.
Let’s see few applications where a solution to this problem can be very useful.

• Self driving cars — Automatic driving is one of the biggest challenges

and if we can properly caption the scene around the car, it can give a
boost to the self driving system.

• Aid to the blind — We can create a product for the blind which will
guide them travelling on the roads without the support of anyone else.
We can do this by first converting the scene into text and then the text to
voice. Both are now famous applications of Deep Learning. Refer
this link where its shown how Nvidia research is trying to create such a
product.

• CCTV cameras are everywhere today, but along with viewing the world,
if we can also generate relevant captions, then we can raise alarms as
soon as there is some malicious activity going on somewhere. This could
probably help reduce some crime and/or accidents.

• Automatic Captioning can help, make Google Image Search as good as

Google Search, as then every image could be first converted into a
caption and then search can be performed based on the caption

13
14

CHAPTER: 2

Background And Related work

Automatically generating captions to an image shows the understanding of the

image by computers, which is a fundamental task of intelligence. For a caption
model it not only need to find which objects are contained in the image and also
need to be able to expressing their relationships in a natural language such as
English. Recently work also achieve the presence of attention, which can store and
report the information and relationship between some most salient features and
clusters in the image. In Xu’s work, it describe approaches to caption generation
that attempt to incorporate a form of attention with two variants: a “hard” attention
mechanism and a “soft” attention mechanism. In his work, the comparation of the
mechanism shows“soft” works better and we will implement “soft” mechanism in
our project. If we have enough time we will also implement “hard” mechanism and
compare the results. In our project, we do image-to-sentence generation. This
application bridges vision and natural language. If we can do well in this task, we
can then utilize natural language processing technologies understand the world in
images. In addition, we introduced attention mechanism, which is able to recognize
what a word refers to in the image, and thus summarize the relationship between
objects in the image. This will be a powerful tool to utilize the massive
unformatted image data, which dominate the whole data in the world. As an

14
15

example, for the picture on the right hand side, we can describe it as A man is
trying to murder his cs231n partner with a clipper. Attention helps us to determine
the relationship between the objects. 2. Related work Work[3](Szegedy et al)
proposed a deep convolutional neural network architecture codenamed Inception.
The main hallmark of this architecture is the improved utilization of the computing
resources inside the network. For example, our project tried to use layers
“inception3b” and “inception4b” to get captions and attention. Because features
learned from the lower layers can contain more accurate information of correlation
between words in caption and specific location in image. presented a generative
model based on a deep recurrent architecture that combined advances in computer
vision and machine translation that can be used to generate natural sentences
describing an image. The model is trained to maximize the likelihood of the target
description sentence given the training image.Work[5](Jeff et al) introduced a
model based on deep convolutional networks performed very good in image
interpretation tasks. Their recurrent convolutional model and long-term RNN
models are suitable for large-scale visual learning that is end-to-end trainable and
demonstrate the value of these models on benchmark video recognition tasks.
Attention mechanism has a long history, especially in image recognition. Related
work include work[6] and work[7](Larochelle et al). But until recently Attention
wasn’t included to recurrent neural network architecture. Work[8](Volodymyr et
al) use reinforcement learning as a alternative way to predict the attention point. It
sounds more like human attention. However reinforcement learning model cannot
use back propagation so that not end-toend trainable, thusly it is not widely use in
NLP. In work[9] the authors use recurrent neural and attention mechanism to

15
16

generate grammar tree. In work[10] the author use RNN model to read in text.
Work[2](Andrej et al) presented a model that generates natural language
descriptions of images and their regions. They combined Convolutional Neural
Networks over sentences, bidirectional Recurrent Neural Networks over sentences
and a structured objective that aligns the two modalities through a multimodal
embedding. In Work[1](Xu, et al) attention mechanism is used in generation of
image caption. They use convolutional neural network to encode image and use a
recurrent neural network and attention mechanism to generate caption. By the
visualization of the attention weights, we can explain which part the model is
focusing on while generating the caption.

CHAPTER: 3

SRS /Algorithm

3.1 Purpose

16
17

Data Collection

There are many open source datasets available for this problem, like Flickr 8k
(containing8k images), Flickr 30k (containing 30k images), MS COCO (containing
180k images), etc.

But for the purpose of this case study, I have used the Flickr 8k dataset which you
can download by filling this form provided by the University of Illinois at Urbana-
Champaign. Also training a model with large number of images may not be feasible
on a system which is not a very high end PC/Laptop.

This dataset contains 8000 images each with 5 captions (as we have already seen in
the Introduction section that an image can have multiple captions, all being relevant
simultaneously).

These images are bifurcated as follows:

• Training Set — 6000 images

• Dev Set — 1000 images

• Test Set — 1000 images

17
18

Understanding the data

One of the files is “Flickr8k.token.txt” which contains the name of each image
along with its 5 captions. We can read this file as follows:
# Below is the path for the file "Flickr8k.token.txt" on your disk
filename = "/dataset/TextFiles/Flickr8k.token.txt"
file = open(filename, 'r')
doc = file.read()

The text file looks as follows:

101654506_8eb26cfb60.jpg#0 A brown and white dog is running through the

snow

101654506_8eb26cfb60.jpg#1 A dog is running in the snow

01654506_8eb26cfb60.jpg#2 A dog running through snow .

101654506_8eb26cfb60.jpg#3 a white and brown dog is running through a

snow covered field

101654506_8eb26cfb60.jpg#4 The white and brown dog is running over the

surface of the snow

Thus every line contains the <image name>#i <caption>, where 0≤i≤4

i.e. the name of the image, caption number (0 to 4) and the actual caption.

18
19

Now, we create a dictionary named “descriptions” which contains the name of the
image (without the .jpg extension) as keys and a list of the 5 captions for the
corresponding image as values.

for line in doc.split('\n'):

# split line by white space
tokens = line.split()

# take the first token as image id, the rest as description

image_id, image_desc = tokens[0], tokens[1:]

# extract filename from image id

image_id = image_id.split('.')[0]

# convert description tokens back to string

image_desc = ' '.join(image_desc)
if image_id not in mapping:
descriptions[image_id] = list()
descriptions[image_id].append(image_desc)

For example with reference to the above screenshot the

dictionary will look as follows:
descriptions['101654506_8eb26cfb60'] = ['A brown and white dog is running through
the snow .', 'A dog is running in the snow', 'A dog running through snow .', 'a white and

19
20

brown dog is running through a snow covered field .', 'The white and brown dog is
running over the surface of the snow .']

Data Cleaning

When we deal with text, we generally perform some basic cleaning like lower-
casing all the words (otherwise“hello” and “Hello” will be regarded as two separate
words), removing special tokens (like ‘%’, ‘$’, ‘#’, etc.), eliminating words which
contain numbers (like ‘hey199’, etc.).

The below code does these basic cleaning steps:

# prepare translation table for removing punctuation

table = str.maketrans('', '', string.punctuation)

for key, desc_list in descriptions.items():

for i in range(len(desc_list)): desc = desc_list[i]

desc = desc.split()
# convert to lower case
desc = [word.lower() for
word in desc]
desc =
[w.translate(table) for w
in desc]

20
21

desc = [word for word

in desc if len(word)>1]
desc = [word for word
in desc if
word.isalpha()]
# store as string
desc_list[i] = '
'.join(desc)

Create a vocabulary of all the unique words present across all the 8000*5 (i.e.
40000) image captions (corpus) in the data set :
vocabulary = set()
for key in descriptions.keys():
[vocabulary.update(d.split()) for d in descriptions[key]]
print('Original Vocabulary Size: %d' % len(vocabulary))
Original Vocabulary Size: 8763

This means we have 8763 unique words across all the 40000 image captions. We
write all these captions along with their image names in a new file namely,
“descriptions.txt” and save it on the disk.

However, if we think about it, many of these words will occur very few times, say
1, 2 or 3 times. Since we are creating a predictive model, we would not like to have
all the words present in our vocabulary but the words which are more likely to
occur or which are common. This helps the model become more robust to
outliers and make less mistakes.

21
22

Hence we consider only those words which occur at least 10 times in the entire
corpus. The code for this is below:

# Create a list of all the

training captions
all_train_captions = []
for key, val in
train_descriptions.items():
for cap in val:
all_train_captions.append(cap)
# Consider only words which occur
at least 10 times in the corpus
word_count_threshold = 10
word_counts = {}
for sent in all_train_captions:
nsents += 1
for w in sent.split(' '):
word_counts[w] =
word_counts.get(w, 0) + 1
vocab = [w for w in word_counts
if word_counts[w] >=
word_count_threshold]

print('preprocessed words %d ' %

len(vocab))

Code to retain only those words which occur at least 10 times in the corpus

22
23

So now we have only 1651 unique words in our vocabulary. However, we will
append 0’s (zero padding explained later) and thus total words = 1651+1
= 1652 (one index for the 0)

Data Preprocessing — Images

Images are nothing but input (X) to our model. As you may already know that any
input to a model must be given in the form of a vector.

We need to convert every image into a fixed sized vector which can then be fed as
input to the neural network. For this purpose, we opt for transfer learning by using
the InceptionV3 model (Convolutional Neural Network) created by Google
Research.

This model was trained on Imagenet dataset to perform image classification on

1000 different classes of images. However, our purpose here is not to classify the
image but just get fixed-length informative vector for each image. This process is
called automatic feature engineering.

Hence, we just remove the last softmax layer from the model and extract a 2048
length vector (bottleneck features) for every image as follows:

23
24

Feature Vector Extraction (Feature Engineering) from InceptionV3

The code for this is as follows:

# Get the InceptionV3 model trained on imagenet data
model = InceptionV3(weights='imagenet')
# Remove the last layer (output softmax layer) from the inception v3
model_new = Model(model.input, model.layers[-2].output)

Now, we pass every image to this model to get the corresponding 2048 length
feature vector as follows:
# Convert all the images to size 299x299 as expected by the
# inception v3 model
img = image.load_img(image_path, target_size=(299, 299))
# Convert PIL image to numpy array of 3-dimensions
x = image.img_to_array(img)
# Add one more dimension
x = np.expand_dims(x, axis=0)
# preprocess images using preprocess_input() from inception module
x = preprocess_input(x)
# reshape from (1, 2048) to (2048, )
x = np.reshape(x, x.shape[1])

24
25

We save all the bottleneck train features in a Python dictionary and save it on the
disk using Pickle file, namely “encoded_train_images.pkl” whose keys are image
names and values are corresponding 2048 length feature vector.

NOTE: This process might take an hour or two if you do not have a high end
PC/laptop.

Similarly we encode all the test images and save them in the file
“encoded_test_images.pkl”

Data Preprocessing — Captions

We must note that captions are something that we want to predict. So during the
training period, captions will be the target variables (Y) that the model is learning to
predict.

But the prediction of the entire caption, given the image does not happen at once.
We will predict the caption word by word. Thus, we need to encode each word
into a fixed sized vector. However, this part will be seen later when we look at the
model design, but for now we will create two Python Dictionaries namely
“wordtoix” (pronounced — word to index) and “ixtoword” (pronounced — index to
word).

25
26

Stating simply, we will represent every unique word in the vocabulary by an integer
(index). As seen above, we have 1652 unique words in the corpus and thus each
word will be represented by an integer index between 1 to 1652.

These two Python dictionaries can be used as follows:

wordtoix[‘abc’] -> returns index of the word ‘abc’

ixtoword[k] -> returns the word whose index is ‘k’

The code used is as below:

ixtoword = {}
wordtoix = {}ix = 1
for w in vocab:
wordtoix[w] = ix
ixtoword[ix] = w
ix += 1

There is one more parameter that we need to calculate, i.e., the maximum length of
a caption and we do it as below:
# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
all_desc = list()
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc# calculate the length of the description with the most words
def max_length(descriptions):
lines = to_lines(descriptions)
return max(len(d.split()) for d in lines)# determine the maximum sequence length
max_length = max_length(train_descriptions)
26
27

print('Max Description Length: %d' % max_length)

Max Description Length: 34

So the maximum length of any caption is 34.

Data Preparation using Generator Function

This is one of the most important steps in this case study. Here we will understand
how to prepare the data in a manner which will be convenient to be given as input
to the deep learning model.

Hereafter, I will try to explain the remaining steps by taking a sample example as
follows:

Consider we have 3 images and their 3 corresponding captions as follows:

(Train image 1) Caption -> The black cat sat on grass

27
28

(Train image 2) Caption -> The white cat is walking on road

(Test image) Caption -> The black cat is walking on grass

Now, let’s say we use the first two images and their captions to train the model
and the third image to test our model.

Now the questions that will be answered are: how do we frame this as a supervised
learning problem?, what does the data matrix look like? how many data points do
we have?, etc.

28
29

First we need to convert both the images to their corresponding 2048 length feature
vector as discussed above. Let “Image_1” and “Image_2” be the feature vectors of
the first two images respectively

Secondly, let’s build the vocabulary for the first two (train) captions by adding the
two tokens “startseq” and “endseq” in both of them: (Assume we have already
performed the basic cleaning steps)

Caption_1 -> “startseq the black cat sat on grass endseq”

Caption_2 -> “startseq the white cat is walking on road endseq”

vocab = {black, cat, endseq, grass, is, on, road, sat, startseq, the, walking, white}

Let’s give an index to each word in the vocabulary:

black -1, cat -2, endseq -3, grass -4, is -5, on -6, road -7, sat -8, startseq -9, the -10,
walking -11, white -12

Now let’s try to frame it as a supervised learning problem where we have a set of
data points D = {Xi, Yi}, where Xi is the feature vector of data point ‘i’ and Yi is
the corresponding target variable.

29
30

Let’s take the first image vector Image_1 and its corresponding caption “startseq
the black cat sat on grass endseq”. Recall that, Image vector is the input and the
caption is what we need to predict. But the way we predict the caption is as follows:

For the first time, we provide the image vector and the first word as input and try to
predict the second word, i.e.:

Input = Image_1 + ‘startseq’; Output = ‘the’

Then we provide image vector and the first two words as input and try to predict the
third word, i.e.:

Input = Image_1 + ‘startseq the’; Output = ‘cat’

And so on…

Thus, we can summarize the data matrix for one image and its corresponding
caption as follows:

30
31

Data points corresponding to one image and its caption

It must be noted that, one image+caption is not a single data point but are multiple
data points depending on the length of the caption.

Similarly if we consider both the images and their captions, our data matrix will
then look as follows:

31
32

Data Matrix for both the images and captions

We must now understand that in every data point, it’s not just the image which goes
as input to the system, but also, a partial caption which helps to predict the next
word in the sequence.

Since we are processing sequences, we will employ a Recurrent Neural

Network to read these partial captions (more on this later).

However, we have already discussed that we are not going to pass the actual
English text of the caption, rather we are going to pass the sequence of indices
where each index represents a unique word.

32
33

Since we have already created an index for each word, let’s now replace the words
with their indices and understand how the data matrix will look like:

Data matrix after replacing the words by their indices

Since we would be doing batch processing (explained later), we need to make sure
that each sequence is of equal length. Hence we need to append 0’s (zero padding)
at the end of each sequence. But how many zeros should we append in each
sequence?

33
34

Well, this is the reason we had calculated the maximum length of a caption, which
is 34 (if you remember). So we will append those many number of zeros which will
lead to every sequence having a length of 34.

The data matrix will then look as follows:

Appending zeros to each sequence to make them all of same length 34

Word Embeddings

As already stated above, we will map the every word (index) to a 200-long vector
and for this purpose, we will use a pre-trained GLOVE Model:

34
35

# Load Glove vectors

glove_dir = 'dataset/glove'
embeddings_index = {} # empty dictionary
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), encoding="utf-8")for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()

Now, for all the 1652 unique words in our vocabulary, we create an embedding
matrix which will be loaded into the model before training.
embedding_dim = 200# Get 200-dim dense vector for each of the 10000 words in out
vocabulary
embedding_matrix = np.zeros((vocab_size, embedding_dim))for word, i in
wordtoix.items():
#if i < max_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Words not found in the embedding index will be all zeros
embedding_matrix[i] = embedding_vector

Model Architecture

Since the input consists of two parts, an image vector and a partial caption, we
cannot use the Sequential API provided by the Keras library. For this reason, we
use the Functional API which allows us to create Merge Models.

First, let’s look at the brief architecture which contains the high level sub-modules:

35
36

High level architecture

We define the model as follows:

Code to define the Model

Let’s look at the model summary:

36
37

Summary of the parameters in the model

The below plot helps to visualize the structure of the network and better understand
the two streams of input:

37
38

Flowchart of the architecture

The text in red on the right side are the comments provided for you to map your
understanding of the data preparation to model architecture.

The LSTM (Long Short Term Memory) layer is nothing but a specialized
Recurrent Neural Network to process the sequence input (partial captions in our
case).
38
39

Recall that we had created an embedding matrix from a pre-trained Glove model
which we need to include in the model before starting the training:

model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable = False

Notice that since we are using a pre-trained embedding layer, we need to freeze it
(trainable = False), before training the model, so that it does not get updated during
the backpropagation.

Finally we compile the model using the adam optimizer

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)

Finally the weights of the model will be updated through backpropagation

algorithm and the model will learn to output a word, given an image feature vector
and a partial caption. So in summary, we have:

Input_1 -> Partial Caption

Input_2 -> Image feature vector

Output -> An appropriate word, next in the sequence of partial caption provided in
the input_1 (or in probability terms we say conditioned on image vector and the
partial caption)

Hyper parameters during training:

The model was then trained for 30 epochs with the initial learning rate of 0.001
and 3 pictures per batch (batch size). However after 20 epochs, the learning rate
was reduced to 0.0001 and the model was trained on 6 pictures per batch.

This generally makes sense because during the later stages of training, since
the model is moving towards convergence, we must lower the learning rate so
that we take smaller steps towards the minima. Also increasing the batch size
over time helps your gradient updates to be more powerful.

CHAPTER: 4
39
40

System design document

A:State Diagram

B: DFD

40
41

C:
level 1 DFD

41
42

CHAPTER: 5
Data set

The evaluation methods of open-source datasets and generated sentences in this

field. Data, computational power, and algorithms are the three major elements of the
current development of artificial intelligence. The three complement each other and
enhance each other. It can be said that a good dataset can make the algorithm or
model more effective. The image description task is similar to machine translation,
and its evaluation method extends from machine translation to form its own unique
evaluation criteria.

4.1. Dataset

Data are the basis of artificial intelligence. People are increasingly discovering that
many laws that are difficult to find can be found from a large amount of data. In the
image description generation task, there are currently rich and colourful datasets,
such as MSCOCO, Flickr8k, Flickr30k, PASCAL 1K, AI Challenger Dataset, and
STAIR Captions, and gradually become a trend of contention. In the dataset, each
image has five reference descriptions, and Table 2 summarizes the number of
images in each dataset. In order to have multiple independent descriptions of each
image, the dataset uses different syntax to describe the same image.

Flickr8k/Flickr30k [81, 82]. Flickr8k image comes from Yahoo’s photo album site
Flickr, which contains 8,000 photos, 6000 image training, 1000 image verification,
and 1000 image testing. Flickr30k contains 31,783 images collected from the Flickr

42
43

website, mostly depicting humans participating in an event. The corresponding

manual label for each image is still 5 sentences.

Throughout this data set we can:

Loading the training set

The text file “Flickr_8k.trainImages.txt” contains the names of the images that
belong to the training set. So we load these names into a list “train”.
filename = 'dataset/TextFiles/Flickr_8k.trainImages.txt'
doc = load_doc(filename)
train = list()
for line in doc.split('\n'):
identifier = line.split('.')[0]
train.append(identifier)
print('Dataset: %d' % len(train))
Dataset: 6000

Thus we have separated the 6000 training images in the list named “train”.

Now, we load the descriptions of these images from “descriptions.txt” (saved on the
hard disk) in the Python dictionary “train_descriptions”.

However, when we load them, we will add two tokens in every caption as follows
(significance explained later):

‘startseq’ -> This is a start sequence token which will be added at the start of every
caption.
43
44

‘endseq’ -> This is an end sequence token which will be added at the end of every
caption.

doc =
load_doc('descriptions.txt')
train_descriptions = dict()
for line in doc.split('\n'):
# split line by white space
tokens = line.split()

# split id from description

image_id, image_desc = tokens[0],
tokens[1:]

# skip images not in the set

if image_id in dataset:
if image_id not in descriptions:
train_descriptions[image_id] =
list()

# wrap description in tokens

desc = 'startseq ' + '
'.join(image_desc) + ' endseq'

# store

train_descriptions[image_id].append(desc)

print('Descriptions: train=%d' %
len(train_descriptions))
# Descriptions: train=6000

44
45

CHAPTER: 6

Snapshot of Forms

45
46

(a)-

CHAPTER: 7

Snapshot of Result Evolutions

46
47

47
48

48
49

CHAPTER: 8

Discussion And Future Work

Many deep learning-based methods have been proposed for generating automatic
image captions in the recent years. Supervised learning, reinforcement learning, and
GAN based methods are commonly used in generating image captions. Both visual
space and multimodal space can be used in supervised learning-based methods. The
main difference between visual space and multimodal space occurs in mapping.
Visual space-based methods perform explicit mapping from images to descriptions.
In contrast, multimodal space-based methods incorporate implicit vision and
language models. Supervised learning-based methods are further categorized into
Encoder-Decoder architecture-based, Compositional architecture-based, Attention-
based, Semantic concept-based, Stylized captions, Dense image captioning, and
Novel object-based image captioning. Encoder-Decoder architecture-based methods
use a simple CNN and a text generator for generating image captions. Attention-
based image captioning methods focus on different salient parts of the image and
achieve better performance than encoder-decoder architecture-based methods.
Semantic concept-based image captioning methods selectively focus on different
parts of the image and can generate semantically rich captions. Dense image
captioning methods can generate region based image captions. Stylized image
captions express various emotions such as romance, pride, and shame. GAN and RL
based image captioning methods can generate diverse and multiple captions.
MSCOCO, Flickr30k and Flickr8k dataset are common and popular datasets used

49
50

for image captioning. MSCOCO dataset is very large dataset and all the images in
these datasets have multiple captions. Visual Genome dataset is mainly used for
region based image captioning. Different evaluation metrics are used for measuring
the performances of image captions. BLEU metric is ACM Computing Surveys, Vol.
0, No. 0, Article 0. Acceptance Date: October 2018. A Comprehensive Survey of
Deep Learning for Image Captioning 0:29 good for small sentence evaluation.
ROUGE has different types and they can be used for evaluating different types of
texts. METEOR can perform an evaluation on various segments of a caption. SPICE
is better in understanding semantic details of captions compared to other evaluation
metrics. Although success has been achieved in recent years, there is still a large
scope for improvement. Generation based methods can generate novel captions for
every image. However, these methods fail to detect prominent objects and attributes
and their relationships to some extent in generating accurate and multiple captions.
In addition to this, the accuracy of the generated captions largely depends on
syntactically correct and diverse captions which in turn rely on powerful and
sophisticated language generation model. Existing methods show their performances
on the datasets where images are collected from the same domain. Therefore,
working on open domain dataset will be an interesting avenue for research in this
area. Image-based factual descriptions are not enough to generate high-quality
captions. External knowledge can be added in order to generate attractive image
captions. Supervised learning needs a large amount of labelled data for training.

8.1 :References

50
51

1. https://cs.stanford.edu/people/karpathy/cvpr2015.pdf
2. https://arxiv.org/abs/1411.4555
3. https://arxiv.org/abs/1703.09137
4. https://arxiv.org/abs/1708.02043
5. https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-
in-python/
6. https://www.youtube.com/watch?v=yk6XDFm3J2c
7. https://www.appliedaicourse.com/

Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
CNN and RNN
No ratings yet
CNN and RNN
82 pages
Image Caption Generation Research Paper
No ratings yet
Image Caption Generation Research Paper
8 pages
Image Caption Generation Research Paper
No ratings yet
Image Caption Generation Research Paper
9 pages
Ref 12
No ratings yet
Ref 12
7 pages
Synopsis May 2024 (Pradeep, Vikas) - 1
No ratings yet
Synopsis May 2024 (Pradeep, Vikas) - 1
14 pages
Srs Main Icg Akash
No ratings yet
Srs Main Icg Akash
22 pages
IJCRT2310418
No ratings yet
IJCRT2310418
8 pages
15 Report PDF
No ratings yet
15 Report PDF
35 pages
New PDF
No ratings yet
New PDF
48 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
Mini Project Final
No ratings yet
Mini Project Final
27 pages
Welcome
No ratings yet
Welcome
3 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Project Report
No ratings yet
Project Report
35 pages
Internship Report (Sanjay Final)
No ratings yet
Internship Report (Sanjay Final)
45 pages
Report 1
No ratings yet
Report 1
34 pages
Report Contents Image Caption Generation-1
No ratings yet
Report Contents Image Caption Generation-1
42 pages
Mini Project Report
No ratings yet
Mini Project Report
31 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Black and White Both Sides Updated
No ratings yet
Black and White Both Sides Updated
25 pages
Image Caption Generator: Minor Project (BCA 5005)
No ratings yet
Image Caption Generator: Minor Project (BCA 5005)
15 pages
Image To TXT Original Final
No ratings yet
Image To TXT Original Final
32 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Ex 3 SRS
No ratings yet
Ex 3 SRS
5 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
No ratings yet
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
6 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Papers
No ratings yet
Papers
9 pages
Document From Deependra Singh
No ratings yet
Document From Deependra Singh
10 pages
Project Review
No ratings yet
Project Review
12 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
Visual Image Caption Generator 38
No ratings yet
Visual Image Caption Generator 38
6 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
Image Captioning Using Deep Learning Mait
No ratings yet
Image Captioning Using Deep Learning Mait
8 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
BTP Report
No ratings yet
BTP Report
27 pages
Photoshop Vocabulary
0% (1)
Photoshop Vocabulary
2 pages
DL 20i0551 Project Proposal
No ratings yet
DL 20i0551 Project Proposal
3 pages
Gray Scale Image Captioning Using CNN and LSTM
No ratings yet
Gray Scale Image Captioning Using CNN and LSTM
8 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Image Caption Genrator Report
No ratings yet
Image Caption Genrator Report
45 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
Perspective Drawings
100% (1)
Perspective Drawings
4 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
No ratings yet
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
9 pages
Course Presentation AI 900 AzureAIFundamentals
No ratings yet
Course Presentation AI 900 AzureAIFundamentals
78 pages
Poster 2
No ratings yet
Poster 2
1 page
Real-Time Critical Systems
From Everand
Real-Time Critical Systems
Jordan Lee Mauro-Buhagiar
3/5 (1)
A Project Report On: Chat Application
100% (1)
A Project Report On: Chat Application
44 pages
Image To Caption Generator
No ratings yet
Image To Caption Generator
7 pages
Mp3 Music Player Application Development Using Android
No ratings yet
Mp3 Music Player Application Development Using Android
310 pages
Seminar Report: Image Processing Introduction and Application
No ratings yet
Seminar Report: Image Processing Introduction and Application
26 pages
PHD Thesis Computer Vision PDF
100% (3)
PHD Thesis Computer Vision PDF
5 pages
IT and Analytics Dossier 2019-20 PDF
No ratings yet
IT and Analytics Dossier 2019-20 PDF
51 pages
Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach
No ratings yet
Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach
91 pages
Facial Recognition Using OpenCV
No ratings yet
Facial Recognition Using OpenCV
7 pages
Mychatapp
No ratings yet
Mychatapp
44 pages
Music Player Webapp and Existing Applications Is That It Is
No ratings yet
Music Player Webapp and Existing Applications Is That It Is
6 pages
Bece312l Robotics-And-Automation TH 1.0 75 Bece312l
No ratings yet
Bece312l Robotics-And-Automation TH 1.0 75 Bece312l
2 pages
73-ComputerVision LabManual
No ratings yet
73-ComputerVision LabManual
63 pages
Features of Yolo11
No ratings yet
Features of Yolo11
9 pages
Project Thesis Final YOLO SSD
No ratings yet
Project Thesis Final YOLO SSD
49 pages
Edge Detection
No ratings yet
Edge Detection
33 pages
IPPR Ch8
No ratings yet
IPPR Ch8
30 pages
Yu MambaOut Do We Really Need Mamba For Vision CVPR 2025 Paper
No ratings yet
Yu MambaOut Do We Really Need Mamba For Vision CVPR 2025 Paper
13 pages
Deep Learning Sensor Fusion For Autonomous Vehicle
No ratings yet
Deep Learning Sensor Fusion For Autonomous Vehicle
34 pages
Automation and Robotics Assignment
No ratings yet
Automation and Robotics Assignment
13 pages
Traffic Signal Violation Detection System
No ratings yet
Traffic Signal Violation Detection System
5 pages
Clobotics - WindTurbin - Intro 28 June PDF
No ratings yet
Clobotics - WindTurbin - Intro 28 June PDF
24 pages
Kaspersky Lab Whitepaper Machine Learning
No ratings yet
Kaspersky Lab Whitepaper Machine Learning
17 pages
04 Elements of Image Interpretation
No ratings yet
04 Elements of Image Interpretation
23 pages
Unit 4 Computer Graphics
No ratings yet
Unit 4 Computer Graphics
10 pages
Seminar Presentation On Sixth Sense Technology
No ratings yet
Seminar Presentation On Sixth Sense Technology
33 pages
10 AI Project Cycle Questions and Answers
No ratings yet
10 AI Project Cycle Questions and Answers
8 pages
Grade 3 - Q4 Exam 2024
No ratings yet
Grade 3 - Q4 Exam 2024
5 pages
Computer Science Quarter 2
No ratings yet
Computer Science Quarter 2
2 pages
A Project Report On: Chitchat App
No ratings yet
A Project Report On: Chitchat App
6 pages
Fast Animal Detection in Uav Images Using Convolutional Neural Networks
No ratings yet
Fast Animal Detection in Uav Images Using Convolutional Neural Networks
4 pages
Indian Currency Detection Using KNN Classifier
No ratings yet
Indian Currency Detection Using KNN Classifier
4 pages
Reactjs. The Biggest Difference Between The Music Player Webapp and
No ratings yet
Reactjs. The Biggest Difference Between The Music Player Webapp and
1 page
Single Image Super-Resolution Using Deep Learning
No ratings yet
Single Image Super-Resolution Using Deep Learning
1 page
Sagnik Majumder: Education
No ratings yet
Sagnik Majumder: Education
3 pages