0% found this document useful (0 votes)
17 views48 pages

New PDF

The document is a project report on the 'Image Caption Generator System' submitted by Anjana U Ajjanna for a Bachelor of Technology in Computer Science Engineering. It explores the integration of computer vision and natural language processing to generate meaningful captions for images, particularly aimed at assisting visually impaired individuals. The report includes methodologies, literature review, system architecture, and implementation details of a CNN-LSTM model for image captioning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views48 pages

New PDF

The document is a project report on the 'Image Caption Generator System' submitted by Anjana U Ajjanna for a Bachelor of Technology in Computer Science Engineering. It explores the integration of computer vision and natural language processing to generate meaningful captions for images, particularly aimed at assisting visually impaired individuals. The report includes methodologies, literature review, system architecture, and implementation details of a CNN-LSTM model for image captioning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

IMAGE CAPTION GENERATOR SYSTEM

A Project Report

Submitted in Partial Fulfillment for the Award of Degree of

Bachelor of Technology

In Computer Science Engineering

Submitted By:

Anjana U Ajjanna (U03NM21T006007)

Under the Guidance of

Dr. THRIVENI J

Professor & Chairperson

Dept of CSE, UVCE

Department of Computer Science Engineering

University Visvesvaraya College of Engineering


K. R. Circle, Bengaluru-560 001
DEPARTMENT OF COMPUTER SCIENCE OF ENGINEERING

UNIVERSITY OF VISVESVARAYA COLLEGE OF

ENGINEERING BANGALORE-560 001

CERTIFICATE
This is to certify that the project report entitled "IMAGE CAPTION GENERATOR SYSTEM IMAGE
CAPTION GENERATOR SYSTEM― is a Bonafede work carried out by Anjana V (U03NM21T029004)
of the Department of Computer Science Engineering, University Visvesvaraya College of
Engineering,K.R.Circle, Bengaluru, in partial fulfillment for the award of the degree of Bachelor of
Technology in Computer Science Engineering of the Bangalore University during the academic year
2024-2025.

Proctor & Chairperson:

Dr. Thriveni J
Professor & Chairperson
Department of CSE
UVCE, Bangalore -560001

Examiner 1 Examiner2
Acknowlegement

I would like to express my profound gratitude to Dr. Thriveni J, Head of the Department of Computer
Science and Engineering, for her invaluable guidance and constant encouragement throughout the course
of this project.

Her depth of knowledge, critical insights, and constructive suggestions provided a solid foundation and
clarity in addressing the complexities of this work. Her leadership and support were instrumental in
ensuring the successful execution of this project.

I am deeply indebted to the esteemed faculty members of the Department of Computer Science and
Engineering for their unwavering support and expertise. Their commitment to fostering an environment
of learning and innovation has been a source of inspiration. Their feedback during the various phases of
this project, from conceptualization to implementation, has helped me refine my approach and achieve
the desired outcomes.

I also extend my heartfelt thanks to my peers and colleagues who provided moral support and shared
invaluable insights during brainstorming sessions. Their collaborative spirit and encouragement were
vital in overcoming challenges and maintaining momentum throughout the project. Additionally, I
acknowledge the contributions of all external sources, including industry experts and researchers, whose
works have enriched this project.

Lastly, I am deeply grateful to my family and friends for their unwavering support and understanding.
Their constant encouragement and belief in my abilities have been the bedrock of my perseverance and
determination. Without their love and patience, the completion of this project would not have been
possible.

Anjana U Ajjanna (U03NM21T006007)


ABSTRACT

The combination of computer vision and natural language processing in Artificial intelligence
has sparked a lot of interest in research in recent years, thanks to the advent of deep learning.
The context of a photograph is automatically described in English. When a picture is captioned,
the computer learns to interpret the visual information of the image using one or more phrases.
The ability to analyze the state, properties, and relationship between these objects is required
for the meaningful description generation process of high-level picture semantics. Using CNN
-LSTM architectural models on the captioning of a graphical image, we hope to detect things
and inform people via text messages in this research. To correctly identify the items, the input
image is first reduced to grayscale and then processed by a Convolution Neural Network
(CNN). The COCO Dataset 2017 was used. The proposed method for blind individuals is
intended to be expanded to include persons with vision loss to speech messages to help them
reach their full potential and to track their intellect. In this project, we follow a variety of
important concepts of image captioning and its standard processes, as this work develops a
generative CNN-LSTM model that outperforms human baselines
TABLE OF CONTENTS

TITLE Page No

1. INTRODUCTION 1

1.1 Outline to image captioning 1


1.2 Background 2
1.3 Problem Statement 2
1.4 Objective 3
1.5 Motivation 3

2. Literature Review 4

2.1 Image Captioning Methods 4

2.2 Deep learning based image captioning methods 5

2.3 Supervised learning vs other deep learning methods 6

2.4 Dense captioning vs caption for whole scene 7

2.5 Encoder-decoder architecture vs compositional architecture 8

2.6 LSTM vs others 8

2.7 Research paper 9

3. System Requirements Specification 14

3.1 Hardware requirements 14

3.2 Software Requirements 15

3.3 Libraries 15

3.4 Development tools 16

3.5 Programming Languages and Frameworks 16

3.6 Functional Requirements 17

3.7 Non-functional Requirements 18


4. Architecture 20

4.1 Introduction 20

4.2 Working principles 20

4.3 Models used 20

4.4 Overview of CNN 21

4.5 Overview of LSTM 24

4.6 CNN-LSTM Architecture Model 28

5. Methodology and Implementation 30

5.1 Methodology 30

5.2 Dataset Used 31

5.3 Image data Preparation 31

5.4 Caption Data preparation 32

5.5 Implementation details 32

5.6 Dataflow diagram 33

5.7 Comparision of cnn and lstm model 34

6. Results and analysis 35

7. Conclusion and future enhancements 39

7.1 Summary 39

7.2 Conclusion 39

7.3 Future Enhancements 40

References 41
LIST OF FIGURES

Figure No. FIGURE NAME PAGE


NO.

4.4 CNN 24

4.5 LSTM Cell 26

4.5 LSTM 28

4.6 CNN - LSTM Model 29

5.1 System Architecture 30

5.3 Feature extraction in images using VGG 32

5.5 Dataflow Diagram 34

6.1 Code Implementation 35

6.2 Uploading Input 35

6.3 Predicted vs actual results 36

6.4 Predicted Result 36

6.5 Output of Training Model 37

6.6 Model accuracy validation 37

6.7 Model Precision and recall analysis 38

6.8 Confusion matrix of captions 38


CHAPTER 1
INTRODUCTION

1.1 Introduction to image captioning.


Every day, we are bombarded with photos in our surroundings, on social media, and in
the news. Only humans are capable of recognizing photos. We humans can recognize
photographs without their assigned captions, but machines require images to be taught first.
The encoder-decoder architecture of Image Caption Generator models uses input vectors to
generate valid and acceptable captions. This paradigm connects the worlds of natural language
processing and computer vision. It's a job of recognizing and evaluating the image's context
before describing everything in a natural language like
English.

Our approach is based on two basic models: CNN (Convolutional Neural Network) and
LSTM (Long Short-Term Memory). CNN is utilized as an encoder in the derived application
to extract features from the snapshot or image, and LSTM is used as a decoder to organize the
words and generate captions. Image captioning can help with a variety of things, such as
assisting the visionless with text-to-speech through real-time input about the scenario over a
camera feed, and increasing social medical leisure by restructuring captions for photos in social
feeds as well as spoken messages.

Assisting children in recognizing chemicals is a step toward learning the language.


Captions for every photograph on the internet can result in faster and more accurate authentic
photograph exploration and indexing. Image captioning is used in a variety of sectors,
including biology, business, the internet, and in applications such as self-driving cars wherein
it could describe the scene around the car, and CCTV cameras where the alarms could be raised
if any malicious activity is observed. The main purpose of this research article is to gain a basic
understanding of deep learning methodologies.

Dept. of CSE,UVCE 2024-25 1


Image Caption Generator System Introduction

1.2 Background
Our project extends and is being used in any large-scale business industry and also
small-scale business industry. A caption appears next to the image and identifies or describes
the image, and credits the source. There is no standard format for captions. Point out any
aspects of the image that you think are noteworthy or relevant.

1.3 Problem Statement


In our world, information is considered valuable and some humans face a serious problem
regarding visualizing an image. We hence dig into this matter, considering blindness as a major
factor, and generate a sentence by allowing users to upload or scan a visual image. Image caption
Generator is a popular research area of Artificial Intelligence that deals with image understanding
and a language description for that image. Generating well-formed sentences requires both
syntactic and semantic understanding of the language. Being able to describe the content of an
image using accurately formed sentences is a very challenging task, but it could also have a great
impact, by helping visually impaired people better understand the content of images.
This task is significantly harder in comparison to the image classification or object
recognition tasks that have been well researched.

The biggest challenge is most definitely being able to create a description that must capture
not only the objects contained in an image, but also express how these objects relate to each other.

Advantages
● Recommendations in Editing Applications
● Assistance for visually impaired
● Social Media posts
● Self-Driving cars
● Robotics
● Easy to implement and connect to new data sources

Dept. of CSE,UVCE 2024-25 2


Image Caption Generator System Introduction

Disadvantages

 Do not make intuitive feature observations on objects or actions in the image


Nor do they give an end-to-end mature general model to solve this problem.

1.4 Objective
The objective of image captioning requires the capture and expression of semantic
information of images in natural languages. The approach for generating image captions generally
consists of two general methods i.e., encoding and decoding. The project aims to work on one of the
ways to context a photograph in simple English sentences using Deep Learning (DL). The need to
use CNN and LSTM instead of working with RNN.

1.5 Motivation
Generating captions for images is a vital task relevant to the area of both Computer Vision
and Natural Language Processing. Mimicking the human ability of providing descriptions for
images by a machine is itself a remarkable step along the line of Artificial Intelligence. The main
challenge of this task is to capture how objects relate to each other in the image and to express them
in a natural language (like English).Traditionally, computer systems have been using predefined
templates for generating text descriptions for images. However, 1 this approach does not provide
sufficient variety required for generating lexically rich text descriptions. This shortcoming has been
suppressed with the increased efficiency of neural networks. Many state of art models use neural
networks for generating captions by taking image as input and predicting next lexical unit in the
output sentence.

Dept. of CSE,UVCE 2024-25 3


CHAPTER 2
LITERATURE SURVEY

Image captioning has recently gathered a lot of attention specifically in the natural language
domain. There is a pressing need for context based natural language description of images,
however, this may seem a bit farfetched but recent developments in fields like neural networks,
computer vision and natural language processing has paved a way for accurately describing images
i.e. representing their visually grounded meaning. We are leveraging state-of-the-art techniques
like Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and appropriate
datasets of images and their human perceived description to achieve the same. We demonstrate that
our alignment model produces results in retrieval experiments on datasets such as Flicker.

2.1 IMAGE CAPTIONING METHODS

There are various Image Captioning Techniques some are rarely used in present but it is
necessary to take a overview of those technologies before proceeding ahead. The main categories
of existing image captioning methods and they include template-based image captioning, retrieval-
based image captioning, and novel caption generation.Novel caption generation-based image
caption methods mostly use visual space and deep machine learning based techniques. Captions
can also be generated from multimodal space. Deep learning-based image captioning methods can
also be categorized on learning techniques: Supervised learning, Reinforcement learning, and
Unsupervised learning. We group the reinforcement learning and unsupervised learning into Other
Deep Learning. Usually captions are generated for a whole scene in the image. However, captions
can also be generated for different regions of an image (Dense captioning). Image captioning
methods can use either simple Encoder-Decoder architecture or Compositional architecture. There
are methods that use attention mechanism, semantic concept, and different styles in image
descriptions. Some methods can also generate description for unseen objects. We group them into
one category as ―Others". Most of the image captioning methods use LSTM as language model.
However, there are a number of methods that use other language models such as CNN and RNN.
Therefore, we include a language model-based category as ―LSTM vs. Others".

2.1.1 TEMPLATE-BASED APPROACHES

Template-based approaches have fixed templates with a number of blank slots to generate
captions. In these approaches, different objects, attributes, actions are detected first and then the

Dept. of CSE,UVCE 2024-25 4


Image Caption Generator System Literature Survey

blank spaces in the templates are filled. For example, Farhadi et al. use a triplet of scene
elements to fill the template slots for generating image captions. Li et al. extract the phrases related
to detected objects, attributes and their relationships for this purpose. A Conditional Random Field
(CRF) is adopted by Kulkarni et al. to infer the objects, attributes, and prepsitions before filling in
the gaps. Template-based methods can generate grammatically correct captions. However,
templates are predefined and cannot generate variable-length captions. Moreover, later on, parsing
based language models have been introduced in image captioning which are more powerful than
fixed template-based methods. Therefore, in this paper, we do not focus on these template based
methods.

2.1.2 RETRIEVAL-BASED APPROACHES

Captions can be retrieved from visual space and multimodal space. In retrieval-based
approaches, captions are retrieved from a set of existing captions. Retrieval based methods first
find the visually similar images with their captions from the training data set. These captions are
called candidate captions. The captions for the query image are selected from these captions pool.
These methods produce general and syntactically correct captions. However, they cannot generate
image specific and semantically correct captions.

2.1.3 NOVEL CAPTION GENERATION

Novel image captions are captions that are generated by the model from a combination of
the image features and a language model instead of matching to an existing captions. Generating
novel image captions solves both of the problems of using existing captions and as such is a much
more interesting and useful problem. Novel captions can be generated from both visual space and
multimodal space. A general approach of this category is to analyze the visual content of the image
first and then generate image captions from the visual content using a language model These
methods can generate new captions for each image that are semantically more accurate than
previous approaches. Most 13 novel caption generation methods use deep machine learning based
techniques. Therefore, deep learning based novel image caption generating methods are our main
focus in this literature.

2.2 DEEP LEARNING BASED IMAGE CAPTIONING METHODS

We draw an overall taxonomy in Figure 1 for deep learning-based image captioning


methods. We discuss their similarities and dissimilarities by grouping them into visual space vs.

Dept. of CSE,UVCE 2024-25 5


Image Caption Generator System Literature Survey

multimodal space, dense captioning vs. captions for the whole scene, Supervised learning vs. Other
deep learning, Encoder-Decoder architecture vs. Compositional architecture, and one „Others‟
group that contains Attention-Based, Semantic Concept-Based, Stylized captions, and Novel
Object-Based captioning. We also create a category named LSTM vs. Others. A brief overview of
the deep learning-based image captioning methods is shown in table. It contains the name of the
image captioning methods, the type of deep neural networks used to encode image information,
and the language models used in describing the information. In the final column, we give a category
label to each captioning technique based on the taxonomy .

2.2.1 VISUAL SPACE VS. MULTIMODAL SPACE

Deep learning-based image captioning methods can generate captions from both visual
space and multimodal space. Understandably image captioning datasets have the corresponding
captions as 14 text. In the visual space-based methods, the image features and the corresponding
captions are independently passed to the language decoder. In contrast, in a multimodal space case,
a shared multimodal space is learned from the images and the corresponding caption-text. This
multimodal representation is then passed to the language decoder.

VISUAL SPACE

Bulk of the image captioning methods use visual space for generating captions. In the visual
space-based methods, the image features and the corresponding captions are independently passed
to the language decoder.

MULTIMODAL SPACE

The architecture of a typical multimodal space-based method contains a language Encoder


part, a vision part, a multimodal space part, and a language decoder part. A general diagram of
multimodal space-based image captioning methods is shown in Figure 2. The vision part uses a
deep convolutional neural network as a feature extractor to extract the image features. The language
encoder part extracts the word features and learns a dense feature embedding for each word. It then
forwards the semantic temporal context to the recurrent layers. The multimodal space part maps
the image features into a common space with the word features.

2.3 SUPERVISED LEARNING VS. OTHER DEEP LEARNING

In supervised learning, training data come with desired output called label. Unsupervised
learning, on the other hand, deals with unla techniques. Reinforcement learning is another type of

Dept. of CSE,UVCE 2024-25 6


Image Caption Generator System Literature Survey

machine learning approach where the aims of an agent are to discover data and/or labels through
exploration and a reward signal. A number of image captioning methods use reinforcement learning
and GAN based approaches. These methods sit in the category of ―Other Deep Learning". beled
data. Generative Adversarial Networks (GANs) are a type of unsupervised learning.

2.3.1 SUPERVISED LEARNING-BASED IMAGE CAPTIONING

Supervised learning-based networks have successfully been used for many years in image
classification , object detection and attribute learning . This progress makes researchers interested
in using them in automatic image captioning .In this paper, we have identified a large number of
supervised learning-based image captioning methods. We classify them into different categories:
(i) Encoder-Decoder Architecture, (ii) Compositional Architecture, (iii) Attentionbased, (iv)
Semantic concept-based, (v) Stylized captions, (vi) Novel object-based, and (vii) Dense image
captioning.

2.3.2 OTHER DEEP LEARNING-BASED IMAGE CAPTIONING

In our day to day life, data are increasing with unlabled data because it is often impractical
to accurately annotate data. Therefore, recently, researchers are focusing more on reinforcement
learning and unsupervised learning-based techniques for image captioning.

2.4 DENSE CAPTIONING VS. CAPTIONS FOR THE WHOLE SCENE

In dense captioning, captions are generated for each region of the scene. Other methods
generate captions for the whole scene.

2.4.1 DENSE CAPTIONING

The previous image captioning methods can generate only one caption for the whole image.
They use different regions of the image to obtain information of various objects. However, these
methods do not generate region wise captions. Johnson et al. [62] proposed an image captioning 16
method called DenseCap. This method localizes all the salient regions of an image and then it
generates descriptions for those regions. A typical method of this category has the following steps:
(1) Region proposals are generated for the different regions of the given image. (2) CNN is used to
obtain the region-based image features. (3) The outputs of Step 2 are used by a language model to
generate captions for every region. A block diagram of a typical dense captioning method.

Dept. of CSE,UVCE 2024-25 7


Image Caption Generator System Literature Survey

CAPTIONS FOR THE WHOLE SCENE


Encoder-Decoder architecture, Compositional architecture, attention-based, semantic concept-
based, stylized captions, Novel object-based image captioning, and other deep learning networks-
based image captioning methods generate single or multiple captions for the whole scene.

2.5 ENCODER-DECODER ARCHITECTURE VS. COMPOSITIONAL ARCHITECTURE

Some methods use just simple vanilla encoder and decoder to generate captions. However,
other methods use multiple networks for it.

2.5.1 ENCODER-DECODER ARCHITECTURE-BASED IMAGE CAPTIONING

The neural network-based image captioning methods work as just simple end to end
manner. These methods are very similar to the encoder-decoder framework-based neural machine
translation [131]. In this network, global image features are extracted from the hidden activations
of CNN and then fed them into an LSTM to generate a sequence of words. A typical method of this
category has the following general steps:

(1) A vanilla CNN is used to obtain the scene type, to detect the objects and their relationships.

(2) The output of Step 1 is used by a language model to convert them into words, combined phrases
that produce an image captions.

2.5.2 COMPOSITIONAL ARCHITECTURE-BASED IMAGE CAPTIONING

Compositional architecture-based methods composed of several independent functional


building blocks: First, a CNN is used to extract the semantic concepts from the image. Then a
language model is used to generate a set of candidate captions. In generating the final caption, these
candidate captions are re-ranked using a deep multimodal similarity model. A typical method of
this category maintains the following steps:

(1) Image features are obtained using a CNN.

(2) Visual concepts (e.g. attributes) are obtained from visual features.

(3) Multiple captions are generated by a language model using the information of Step 1 and Step
2.

(4) The generated captions are re-ranked using a deep multimodal similarity model to select high
quality image captions. A common block diagram of compositional network-based image
captioning methods .

Dept. of CSE,UVCE 2024-25 8


Image Caption Generator System Literature Survey

2.6 LSTM VS. OTHERS

Image captioning intersects computer vision and natural language processing (NLP)
research. NLP tasks, in general, can be formulated as a sequence to sequence learning. Several
neural language models such as neural probabilistic language model , log-bilinear models , skip-
gram models , and recurrent neural networks (RNNs) have been proposed for learning sequence to
sequence tasks. RNNs have widely been used in various sequence learning tasks. However,
traditional RNNs suffer from vanishing and exploding gradient problems and cannot adequately
handle long-term temporal dependencies. LSTM networks are a type of RNN that has special units
in addition to standard units. LSTM units use a memory cell that can maintain information in
memory for long periods of time. In recent years, LSTM based models have dominantly been used
in sequence to sequence learning tasks. Another network, Gated Recurrent Unit (GRU) has a similar
structure to LSTM but it does not use separate memory cells and uses fewer gates to control the
flow of information. However, LSTMs ignore the underlying hierarchical structure of a sentence.
They also require significant storage due to long-term dependencies through a memory cell. In
contrast, CNNs can learn the internal hierarchical structure of the sentences and they are faster in
processing than LSTMs. Therefore, recently, convolutional architectures are used in other sequence
to sequence tasks, e.g., conditional image generation and machine translation. Inspired by the above
success of CNNs in sequence learning tasks, Gu proposed a CNN language model-based image
captioning method. This method uses a language-CNN for statistical language modelling.
However, the method cannot model the dynamic temporal behaviour of the language model only
using a language-CNN. It combines a recurrent network with the languageCNN to model the
temporal dependencies properly. Aneja proposed a convolutional architecture for the task of image
captioning. They use a feedforward network without any recurrent function. The architecture of the
method has four components: (i) input embedding layer (ii) image embedding layer (iii)
convolutional module, and (iv) output embedding layer. It also uses an attention mechanism to
leverage spatial image features. They evaluate their architecture on the challenging MSCOCO
dataset and shows comparable performance to an LSTM based method on standard metrics.

2.7 RESEARCH PAPER

Abstract - Image Caption Generation has always been a study of great interest to the researchers
in the Artificial Intelligence department. Being able to program a machine to accurately describe
an image or an environment like an average human has major applications in the field of robotic
vision, business and many more. This has been a challenging task in the field of artificial

Dept. of CSE,UVCE 2024-25 9


Image Caption Generator System Literature Survey

intelligence throughout the years. In this paper, we present different image caption generating
models based on deep neural networks, focusing on the various RNN techniques and analyzing
their influence on the sentence generation. We have also generated captions for sample images and
compared the different feature extraction and encoder models to analyse which model gives better
accuracy and generates the desired results.

In this project Flicker8K dataset is used which consists of 8 Thousand images. Data preprocessing
is done on these images which splits the dataset into train, test and validate sets.

Supervised learning is the most popular machine learning paradigm. It is easy to understand and
very easy to use. It is a learning function that creates a map of output inputs based on the example
of input-output pairs. It takes work from a training data lab that includes a set of training examples.
In supervised reading, each example is a pair that includes an input object (usually vector) and the
required output value (also called the directional signal). The supervised learning algorithm
analyzes training data and generates targeted activity, which can be used to map new examples.
Supervised Reading is very similar to teaching a child about the data provided and that data is in
the form of labeled examples, we can feed the algorithm of learning with these pairs of individual
model-labels, allowing the algorithm to predict the correct answer or not. Over time, the algorithm
will learn to measure the exact nature of the relationship between models and their labels. When
fully trained, the supervised learning algorithm will be able to detect a new, unprecedented model
and predict its excellent label.

Unsupervised learning is a machine learning method, where you do not need to monitor the model.
Instead, you need to let the model work on its own for information. Works great with non-labeled
data and looks for patterns that were not previously found in a set of data that does not already have
labels and has minimal human monitoring. In contrast to supervised reading that often uses personal
name data, unchecked reading, also known as self-organizing, allows for the creation of a dynamic
model over the input. The Neural Network (or Artificial Neural Network) has the ability to learn
by example. ANN is an information processing model inspired by a biological neuron system.
ANN-biologically inspired images that are computer generated to perform a specific set of tasks
such as merging, segmentation, pattern recognition etc. It is made up of a large number of highly
interconnected processing devices known as neurons to solve problems. It follows a non-linear
approach and processes information uniformly across all nodes. The neural network is a complex
flexible system. Adaptive means it has the ability to change its internal structure by adjusting the
input weights.

Dept. of CSE,UVCE 2024-25 10


Image Caption Generator System Literature Survey

Deep learning is a branch of machine learning based entirely on neural networks that are practiced.
In-depth learning is an artificial intelligence activity that mimics the functioning of the human brain
in processing data and creating patterns that will be used in decision making. In-depth learning is a
subset of machine learning in artificial intelligence (AI) with networks that can read without being
monitored for random or unlabeled data. It has a large number of hidden layers and is known as
deep neural learning or deep neural network. Deep learning has evolved in conjunction with the
digital age, which has brought an explosion of data across all genres and regions of the world. This
data, known as big data, is taken from sources such as social media, online search engines, e-
commerce forums, and online cinemas, among others. This large amount of data is easily accessible
and can be shared with fintech applications such as cloud computing. However, the details, often
irregular, are so large that it can take decades for people to understand and produce the right
information. Companies are recognizing the incredible power that can come from uncovering this
wealth of information and are becoming increasingly familiar with AI systems for automated
support. In-depth reading learns from large amounts of informal data that can often take decades
for people to understand and process. In-depth learning also uses the hierarchical level of neural
networks performed to perform the machine learning process. Nervous network networks are
shaped like the human brain, with neuron nodes connected as a web. While traditional systems
create data analytics in a straightforward manner, the hierarchical function of in-depth learning
systems enables machines to process data indirectly.

Algorithm Steps

Step 1: Download the Visual Genome Dataset and perform preprocessing.

Step 2: Download spacy English tokenizer and convert the text into tokens.

Step 3: Extract image features using an object detector named LSTM.

Step 4: Features are generated from Tokenization on which LSTM is trained and it generates the
captions. Step 5: A paragraph is generated by combining all the captions.

Interpreting an image is a problem of producing a description of an image that is readable to a


person, such as an image of an object or an article. The problem is sometimes called "automatic
image annotation" or "tag image." It is a simple problem for man, but very challenging for the
machine.

Data pre processing - Images : Images are nothing but an (X) in our model. As you probably
know that any input to the model should be given in the form of a vector. We need to convert each

Dept. of CSE,UVCE 2024-25 11


Image Caption Generator System Literature Survey

image into a fixed size vector that can be provided as input to the neural network. For this purpose,
we choose to transfer learning using the InceptionV3 (Convolutional Neural Network) model
created by Google Research. This model was trained in the Visual Genome Corpus database to
perform image classification into 1000 different photo classes. However, our goal here is not to
separate the image but to simply find the vector of information that has the fixed length of each
image. This process is called automatic feature engineering.

Data pre processing – Captions : We must realize that captions are something we want to predict.
Therefore during the training period, captions will be the target variable (Y) model that learns to
predict. But prediction of all captions, when given a picture does not happen all at once. We will
predict the captions word for word. Therefore, we need to encode each word into a fixed size vector.
Dictionaries "wordtoix" (pronounced - word in the index) and "ixtoword" (pronounced - index in
the word).

Data pre processing using generator function : Let‘s take the first image vector Image_1 and its
corresponding caption ―startseq the black cat sat on grass endseq‖. Recall that, Image vector is the
input and the caption is what we need to predict. But the way we predict the caption is as follows:
For the first time, we provide the image vector and the first word as input and try to predict the
second word, i.e.: Input = Image_1 + ‗startseq‘; Output = ‗the‘ Then we provide image vector and
the first two words as input and try to predict the third word, i.e.: Input = Image_1 + ‗startseq the‘;
Output = ‗cat‘.

Encoder Model:

The encoder model is primarily responsible for processing the captions of each image fed while
training. The output of the encoder model is again vectors of size 1*256 which would again be an
input to the decoder sequences.

The most important part of the Encoder model is the LSTM layer or Long Short Term Memory
Layer. This layer helps the model in learning how to generate valid sentences or generating the
word with highest probability of occurrence after a specific word is encountered. The activation
function used is ReLU, a linear activation function and the output space defined is 256. For
comparison between the complete models VGG+GRU and VGG+LSTM this particular layer will
be replaced by a GRU or Gated Recurrent Units layer, and results will be analyzed for the same
.The output space for the GRU layer is same i.e 256 . Thus the only major difference between the
two models will be in the encoder part. The output of the LSTM layer is our output for the encoder
layer.

Dept. of CSE,UVCE 2024-25 12


Image Caption Generator System Literature Survey

Decoder Model:

The decoder model is basically the model which concatenates both the feature extraction model
and encoder model and produces the required output which is the predicted word given an image
and the sentence generated till that point of time. As shown in the above diagram the Decoder
model takes in the input from the Feature extraction model and the encoder model both of which
outputs vectors of dimension 256. The output from the concatenated models are passed through a
dense layer which uses the ‗ReLU‘ activation function. Another Dense Layer is added to the
decoder model with the vocabulary size as the output space. The vocabulary size in Flickr 8k was
found to be 7579 and the activation function used was softmax activation which basically outputs
a word for the integer predicted. The predicted word is the output of the decoder layer. The model
is trained by the following input output parameters: =<image, = Where input parameters are the
image and the input sequence and the output of the model is the word predicted provided the model
has the image and the caption generated till that point of time. When the caption is generated we
calculate the bleu score for each architecture. Four types of bleu scores were found out : BLEU -1
(1.0, 0, 0, 0), BLEU -2 (0.5, 0.5, 0, 0), BLEU -3 (0.33, 0.33, 0.33, 0) and BLEU -4 (0.25, 0.25,
0.25, 0.25). We have used the cumulative weights since they give better output.

Conclusion-We have presented a deep learning model that tends to automatically generate image
captions with the goal of not only describing the surrounding environment but also helping visually
impaired people better understand their environments. Our described model is based upon a CNN
feature extraction model that encodes an image into a vector representation, followed by a RNN
decoder model that generates corresponding sentences based on the image features learned . We
have compared various encoder decoder models to see how each component influences the caption
generation and have also demonstrated various use cases on our system. The results show that
LSTM model generally works slightly better than GRU although taking more time for training and
sentence generation due to its complexity. The performance is also expected to increase on using a
bigger dataset by training on more number of images. Because of the considerable accuracy of the
generated image captions, visually impaired people can greatly benefit and get a better sense of
their surroundings using the text-to-speech technology that we have incorporated as well.

Dept. of CSE,UVCE 2024-25 13


CHAPTER 3

SYSTEM REQUIREMENTS SPECIFICATION

To be used efficiently, all computer software needs certain hardware components or other software
resourced to be present on a computer. These prerequisites are known as computer system
requirements and are often used as guidelines as opposed to absolute rules. Most software defines
two sets of system requirements. Minimum and recommended. With the increasing demand for
higher processing power and resources in newer versions of software, system requirements tend to
increase over time. Industry analysts suggest that this trend plays a bigger part in driving upgrades
to exist computer systems than technological advancements

3.1. Hardware Requirements


For hardware, a modern processor like an Intel Core i5 or AMD Ryzen 5 is the minimum, while
an Intel Core i7 or Ryzen 7 is recommended for better performance. At least 8GB of RAM is
required, but 16GB is ideal for smooth operation, especially for deep learning models. Storage
needs to be sufficient, with at least 100GB of free space, while 500GB is preferable for larger
datasets. A GPU like an NVIDIA GTX 1060 or higher is essential for model training and inference,
with an RTX series GPU being recommended for better performance.

 Processor:
o Minimum: Intel Core i5 (8th Gen) or AMD Ryzen 5 (equivalent)
o Recommended: Intel Core i7/i9 (10th Gen or higher) or AMD Ryzen 7/9
 Memory (RAM):
o Minimum: 8GB
o Recommended: 16GB or more (for smoother training and inference, especially for
deep learning models)
 Storage:
o Minimum: 100GB free space (for dataset, models, and intermediate files)
o Recommended: 500GB or more (especially for larger datasets and deep learning
models)
 Graphics Processing Unit (GPU):
o Minimum: NVIDIA GTX 1060, AMD equivalent (for smaller models and light
inference)

Dept. of CSE,UVCE 2024-25 14


Image Caption Generator System System Requirements Specification

o Recommended: NVIDIA RTX 30 series (e.g., RTX 3060, 3070, or higher) or


Tesla/V100 for model training
 Network:
o Stable internet connection for downloading datasets, libraries, and pretrained
models.

3.2. Software Requirements


The system should be compatible with Windows 10, macOS 10.15, or a Linux-based OS like
Ubuntu. Development tools such as Visual Studio Code or PyCharm are essential, alongside
version control using Git. Docker is optional but useful for environment consistency. Jupyter
Notebook can help with interactive development. A Python environment (via virtualenv or conda)
is necessary to manage dependencies.

 Operating System:
o Windows 10 or higher
o macOS 10.15 (Catalina) or higher
o Linux-based OS (Ubuntu 18.04+ or CentOS)
 Containerization/Virtualization (optional but recommended for consistent
environments):
o Docker (to create isolated environments)
o Virtualenv/conda (for Python dependency management)

3.3. Libraries
Key libraries for image caption generation include TensorFlow or PyTorch for deep learning,
with Keras for higher-level API integration. OpenCV and Pillow are required for image processing,
while NLTK or spaCy handle text preprocessing. The Hugging Face Transformers library can be
used for leveraging pre-trained models like CLIP for caption generation.

 Computer Vision Libraries:


o OpenCV: To process images (resizing, cropping, color adjustments, etc.)
o Pillow: Python Imaging Library for handling basic image processing.
o scikit-image: For image segmentation, transformations, and feature extraction.
 Natural Language Processing Libraries:
o NLTK or spaCy: To process text data, tokenize sentences, and handle other NLP
tasks.

Dept. of CSE,UVCE 2024-25 15


Image Caption Generator System System Requirements Specification

o transformers (by Hugging Face): For using pretrained models such as GPT, BERT,
and CLIP for caption generation.
 Pretrained Models:
o CLIP (Contrastive Language-Image Pretraining): A model from OpenAI that can
understand images and text jointly. This can be fine-tuned for caption generation.
o Image Captioning Models: Pretrained models such as Show, Attend and Tell, Up-
Down Models, or Bottom-Up Top-Down Models can be used for image captioning
tasks.
 Data Management Libraries:
o NumPy: For numerical operations and handling image tensors.
o Pandas: For data manipulation and handling datasets, especially captions and
metadata.
o H5py: For saving large datasets (e.g., images and feature vectors).

3.4. Development Tools:

o Text Editors/IDEs: VS Code, PyCharm, Jupyter Notebook, or any other preferred


Python IDE
o Version Control: Git (for managing code versioning)
o Docker (Optional): For containerization of the environment and easy deployment
across systems
o Jupyter Notebook (Optional): For interactive development and testing during
model creation

3.5. Programming Languages and Frameworks


Python is the primary programming language due to its rich ecosystem for deep learning and NLP.
JavaScript or TypeScript may be used for web-based deployment of the captioning model if needed.

 Python:
o Primary language for deep learning and data science tasks.
o Popular libraries such as TensorFlow, PyTorch, OpenCV, and transformers are all
Python-based.
 JavaScript/TypeScript (optional, for web deployment):
o To build web applications for serving the captioning models (using frameworks like
React.js or Node.js).

Dept. of CSE,UVCE 2024-25 16


Image Caption Generator System System Requirements Specification

 Deep Learning Frameworks:


o TensorFlow: Used for creating and training deep learning models.
o PyTorch: Popular alternative to TensorFlow, especially for research and rapid
prototyping.
o Keras: High-level API for TensorFlow (if using TensorFlow).
o OpenCV: For image preprocessing and handling.

3.6. Functional Requirements


The system should be capable of preprocessing both images (resizing, normalization) and text
(tokenization, stopword removal). It should allow for the training of models that combine CNNs for image
feature extraction and RNNs or transformers for caption generation. The system should include support
for model evaluation using metrics like BLEU or CIDEr, and a user interface or API to handle image input
and generate captions.

 Image Preprocessing:
o The system should be able to preprocess images, such as resizing, cropping,
normalization, and augmentations.
 Text Preprocessing:
o Ability to tokenize captions, remove stop words, and handle other NLP tasks like
stemming and lemmatization.
 Model Training:
o The system should be able to train deep learning models for image caption
generation, which involves using CNN (for image features) and RNN (for
generating captions) or transformers.
 Caption Generation:
o The system must be able to generate captions for a given input image using the
trained model.
 Evaluation Metrics:
o Implement standard evaluation metrics like BLEU, METEOR, or CIDEr to assess
the quality of the generated captions.
 Model Inference:
o The system should support inference with trained models for generating captions on
new images.

Dept. of CSE,UVCE 2024-25 17


Image Caption Generator System System Requirements Specification
 Support for Pretrained Models:

o Optionally, the system should allow for the use of pretrained models for faster
inference, such as using CLIP or Show and Tell.
 API/Interface:
o Provide a simple API or interface for feeding images and receiving captions. This
can be a RESTful API or a CLI.

3.7. Non-Functional Requirements


In terms of non-functional requirements, the system should be scalable to handle large
datasets and traffic. It needs to perform efficiently, with real-time or near-real-time caption
generation, and should provide accurate and relevant captions. The system must be robust against
various image qualities, easy to maintain, and secure, especially if exposed through an API. Lastly,
it should be extensible to allow for future updates or integration with new models.

 Scalability:
o The system should handle varying input sizes (both in terms of images and number
of requests).
o It should support the scalability to handle more complex models and larger datasets.
 Performance:
o The image captioning system should generate captions in real-time or within an
acceptable latency (depending on use case).
 Accuracy:
o The caption generation model should provide high-quality, accurate, and
contextually relevant captions.
 Robustness:
o The system should be robust enough to handle images with various qualities (e.g.,
low resolution, varying lighting conditions).
 Usability:
o User interface (UI) should be intuitive for users (if it's a web-based application).
o Should provide clear error messages and feedback.
 Maintainability:
o The codebase should be well-documented and modular, allowing easy updates,
improvements, and troubleshooting.

Dept. of CSE,UVCE 2024-25 18


Image Caption Generator System System Requirements Specification

 Security

o If the model is exposed via an API, it should have adequate security mechanisms
toprevent misuse or malicious attacks.
 Extensibility:
o The system should allow for easy integration of new datasets or the addition of new
models for improved accuracy.
 Deployment:
o The system should be deployable on various platforms (local, cloud, or edge
devices) depending on the use case.
 Compliance:
o If necessary, ensure the system adheres to privacy and ethical standards, especially
regarding data usage and model fairness.

Dept. of CSE,UVCE 2024-25 19


CHAPTER 4

ARCHITECTURE

4.1 Introduction

This project is loaded with CNN and LSTM which act as the platform to generate the
sentences from a simple image. This can be worked on all applications.

4.2 Working Explanation


1. A user uploads an image that they want to generate a caption for.
2. A gray-scale image is processed through CNN to identify the objects.
3. A gray-scale image is processed through CNN to identify the objects.
4. CNN scans images left-right, and top-bottom, and extracts important image features.
5. By applying various layers like Convolutional, Pooling, Fully Connected, and thus
using activation function, we successfully extracted features of every image.
6. It is then converted to LSTM.
7. Using the LSTM layer, we try to predict what the next word could be.
8. Then the application proceeds to generate a sentence describing the image

4.3 Models Used


• Convolutional Neural Network
• Long Short-Term Memory

4.4 Overview on CNN


Convolutional neural networks use three-dimensional data for image classification and object
recognition tasks.

Dept. of CSE,UVCE 2024-25 20


Image Caption Generator System Architecture

Convolutional neural networks are distinguished from other neural networks by their superior
performance with image, speech or audio signal inputs. They have three main types of layers, which
are:

 Convolutional layer
 Pooling layer
 Fully-connected (FC) layer

Convolutional layer
The convolutional layer is the first layer of a convolutional network. While convolutional layers
can be followed by additional convolutional layers or pooling layers, the fully-connected layer is
the final layer. With each layer, the CNN increases in its complexity, identifying greater portions
of the image. Earlier layers focus on simple features, such as colors and edges. As the image data
progresses through the layers of the CNN, it starts to recognize larger elements or shapes of the
object until it finally identifies the intended object.

The convolutional layer is the core building block of a CNN, and it is where the majority of
computation occurs. It requires a few components, which are input data, a filter and a feature map.
Let‘s assume that the input will be a color image, which is made up of a matrix of pixels in 3D.
This means that the input will have three dimensions—a height, width and depth—which
correspond to RGB in an image. We also have a feature detector, also known as a kernel or a filter,
which will move across the receptive fields of the image, checking if the feature is present. This
process is known as a convolution.

The feature detector is a two-dimensional (2-D) array of weights, which represents part of the
image. While they can vary in size, the filter size is typically a 3x3 matrix; this also determines the
size of the receptive field. The filter is then applied to an area of the image, and a dot product is
calculated between the input pixels and the filter. This dot product is then fed into an output array.
Afterwards, the filter shifts by a stride, repeating the process until the kernel has swept across the
entire image. The final output from the series of dot products from the input and the filter is known
as a feature map, activation map or a convolved feature.

The weights in the feature detector remain fixed as it moves across the image, which is also known
as parameter sharing. Some parameters such as the weight values, adjust during training through
the process of backpropagation and gradient descent. However, there are three hyperparameters

Dept. of CSE,UVCE 2024-25 21


Image Caption Generator System Architecture

which affect the volume size of the output that need to be set before the training of the neural
network begins.

These include:

1. The number of filters affects the depth of the output. For example, three distinct filters would
yield three different feature maps, creating a depth of three.

2. Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While
stride values of two or greater is rare, a larger stride yields a smaller output.

3. Zero-padding is usually used when the filters do not fit the input image. This sets all elements
that fall outside of the input matrix to zero, producing a larger or equally sized output. There are
three types of padding:

 Valid padding: This is also known as no padding. In this case, the last convolution is
dropped if dimensions do not align.
 Same padding: This padding ensures that the output layer has the same size as the input
layer.
 Full padding: This type of padding increases the size of the output by adding zeros to the
border of the input.
After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation
to the feature map, introducing nonlinearity to the model.

Dept. of CSE,UVCE 2024-25 22


Image Caption Generator System Architecture

Pooling layer

Pooling layers, also known as downsampling, conducts dimensionality reduction, reducing the
number of parameters in the input. Similar to the convolutional layer, the pooling operation sweeps
a filter across the entire input, but the difference is that this filter does not have any weights. Instead,
the kernel applies an aggregation function to the values within the receptive field, populating the
output array. There are two main types of pooling:

 Max pooling: As the filter moves across the input, it selects the pixel with the maximum
value to send to the output array. As an aside, this approach tends to be used more often
compared to average pooling.
 Average pooling: As the filter moves across the input, it calculates the average value within
the receptive field to send to the output array.
While a lot of information is lost in the pooling layer, it also has a number of benefits to the CNN.
They help to reduce complexity, improve efficiency, and limit risk of overfitting.

Fully-connected layer

The name of the full-connected layer aptly describes itself. As mentioned earlier, the pixel values
of the input image are not directly connected to the output layer in partially connected layers.
However, in the fully-connected layer, each node in the output layer connects directly to a node in
the previous layer.

This layer performs the task of classification based on the features extracted through the previous
layers and their different filters. While convolutional and pooling layers tend to use ReLu functions,
FC layers usually leverage a softmax activation function to classify inputs appropriately, producing
a probability from 0 to 1.

Dept. of CSE,UVCE 2024-25 23


Image Caption Generator System Architecture

4.4.1 CNN

Some advantages of CNN are:


● It works well for both supervised and unsupervised learning.
● Easy to understand and fast to implement.
● It has the highest accuracy among all algorithms that predicts images.
● Little dependence on pre-processing, decreasing the need for human effort to develop
its functionalities.

4.5 Overview of LSTM


Long Short-Term Memory (LSTM) is a type of artificial recurrent neural network (RNN)
architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTMs
have feedback connections, allowing them to exploit temporal dependencies across sequences of
data. LSTM is designed to handle the issue of vanishing or exploding gradients, which can occur
when training traditional RNNs on sequences of data. This makes them well-suited for tasks
involving sequential data, such as natural language processing (NLP), speech recognition, and time
series forecasting

LSTM networks introduce memory cells, which have the ability to retain information over long
sequences. Each memory cell has three main components: an input gate, a forget gate, and an output
gate. These gates help regulate the flow of information in and out of the memory cell.

Dept. of CSE,UVCE 2024-25 24


Image Caption Generator System Architecture

The input gate determines how much of the new input should be stored in the memory cell. It takes
the current input and the previous hidden state as inputs, and outputs a value between 0 and 1 for
each element of the memory cell.

The forget gate decides which information to discard from the memory cell. It takes the current input
and the previous hidden state as inputs, and outputs a value between 0 and 1 for each element of the
memory cell. A value of 0 means the information is ignored, while a value of 1 means it is retained.

The output gate controls how much of the memory cell‘s content should be used to compute the
hidden state. It takes the current input and the previous hidden state as inputs, and outputs a value
between 0 and 1 for each element of the memory cell.

By using these gates, LSTM networks can selectively store, update, and retrieve information over
long sequences. This makes them particularly effective for tasks that require modeling long-term
dependencies, such as speech recognition, language translation, and sentiment analysis.
1. Forget Gate:
 Determines what information to discard from the cell state.

 It takes input (current time step and previous hidden state) and produces a number between 0
and 1 for each number in the cell state. 1 represents ―completely keep this‖ while 0 represents
―completely get rid of this.‖
2. Input Gate:

. Decides what new information to store in the cell state.

. It consists of two parts:

a. A sigmoid layer (the ―input gate layer‖) that decides which values to update.

b. A tanh layer (which creates a vector of new candidate values to add to the cell state).
3. Output Gate:
 Determines the next hidden state based on the updated cell state.
 Filters the information that the LSTM will output based on the updated cell state.

Dept. of CSE,UVCE 2024-25 25


Image Caption Generator System Architecture

Key components of LSTM:


 Cell State:

 This runs straight down the entire chain of the LSTM, with only some minor linear interactions.
It‘s the core differentiator in LSTMs that allows them to maintain and control long-term
dependencies.

 Hidden State:

 The LSTM‘s output at a particular time step based on the cell state.

4.5.1 LSTM Cell

LSTMs use these gates to regulate the flow of information, which allows them to learn long-term
dependencies in data, making them particularly effective for tasks involving sequential data like
time series prediction, natural language processing, speech recognition, and more.

By controlling and memorizing information over long sequences, LSTMs can mitigate the problems
of vanishing and exploding gradients, enabling more effective training and better capturing of long-
term patterns in sequential data.

Dept. of CSE,UVCE 2024-25 26


Image Caption Generator System Architecture

Preparing input data for LSTM


Preparing input data for an LSTM involves organizing your data into a format that an LSTM model
can ingest and process effectively. LSTMs, being a type of recurrent neural network, are suited for
sequence data. The basic steps for preparing input data for an LSTM are as follows:

 Sequences: LSTM models work with sequences of data. Organize your input data into
sequences of fixed length. For instance, in the context of time series data, if you have daily data,
you might create sequences of, say, 10 days‘ worth of data as one input sequence.

 Reshape your data to be in a 3D format: (samples, time steps, features). For instance, if your
data is in the form of a 2D matrix (samples, features), you‘ll need to reshape it so that the
LSTM can interpret it as sequences of data.

 Samples: Number of data points in your dataset.

 Time Steps: Number of time steps in each sequence.

 Features: Number of features at each time step

4.5.2 LSTM

Dept. of CSE,UVCE 2024-25 27


Image Caption Generator System Architecture

Some advantages of LSTM are:


● Provides us with a large range of parameters such as learning rates, and input and output
biases.
● The complexity to update each weight is reduced to O (1) with LSTMs.

4.6 CNN - LSTM Architecture Model


The CNN LSTM architecture involves using Convolutional Neural Network (CNN)
layers for feature extraction on input data combined with LSTMs to support sequence
prediction.
CNN-LSTMs were developed for visual time series prediction problems and the
application of generating textual descriptions from sequence of image
(e.g., videos) Specifically, the problem of

● Activity Recognition: Generating a textual description of activity demonstrated in a


sequence of images.
● Image Description: Generating a textual description of a single image.
● Video Description: Generating a textual description of a sequence of images.

This architecture was originally referred to as a Long-term Recurrent Convolutional


Network (LRCN) model, although we will use the more generic name ―CNN LSTM‖

● CNN is used for extracting features from the image. We will use the pre-trained model
Xception.
● LSTM will use the information from CNN to help generate a description of the image.

Dept. of CSE,UVCE 2024-25 28


Image Caption Generator System Architecture

4.6.1 CNN-LSTM model

Dept. of CSE,UVCE 2024-25 29


CHAPTER 5
METHODOLOGY AND IMPLEMENTATION
5.1 Methodology

● Import Libraries

● Upload COCO (Common Objects and Contexts) Dataset 2017. (Data


Preprocessing)
● Apply CNN to identify the objects in the image.
● Preprocess and tokenize the captions.
● Use LSTM to predict the next word of the sentence.
● Make a Data Generator
● View Images with caption.

5.1.1 System Architecture

Dept. of CSE,UVCE 2024-25 30


41
Image Caption Generator System Methodology and Implementation

5.2 DATASET USED

Dataset used is Flickr8k.Flickr8k dataset is a public benchmark dataset for image to


sentence description. This dataset consists of 8000 images with five captions for each image.
These images are extracted from diverse groups in Flickr website. Each caption provides a
clear description of entities and events present in the image. The dataset depicts a variety of
events and scenarios and doesn‟t include images containing well-known people and places
which makes the dataset more generic. The dataset has 6000 images in training dataset, 1000
images in development dataset and 1000 images in test dataset. Features of the dataset making
it suitable for this project are: • Multiple captions mapped for a single image makes the model
generic and avoids overfitting of the model. • Diverse category of training images can make
the image captioning model to work for multiple categories of images and hence can make the
model more robust.

5.3 IMAGE DATA PREPARATION

The image should be converted to suitable features so that they can be trained into a
deep learning model. Feature extraction is a mandatory step to train any image in deep learning
model. The features are extracted using Convolutional Neural Network (CNN) with Visual
Geometry Group (VGG-16) model. This model also won ImageNet Large Scale Visual
Recognition Challenge in 2015 to classify the images into one among the 1000 classes given
in the challenge. Hence, this model is ideal to use for this project as image captioning requires
identification of images. In VGG-16, there are 16 weight layers in the network and the deeper
number of layers help in better feature extraction from images. The VGG-16 network uses 3*3
convolutional layers making its architecture simple and uses max pooling layer in between to
reduce volume size of 24 the image. The last layer of the image which predicts the
classification is removed and the internal representation of image just before classification is
returned as feature. The dimension of the input image should be 224*224 and this model
extracts features of the image and returns a 1- dimensional 4096 element vector.

Dept. of CSE,UVCE 2024-25 31


Image Caption Generator System Methodology and Implementation

5.3.1 Feature Extraction in images using VGG

5.4 CAPTION DATA PREPARATION

Flickr8k dataset contains multiple descriptions described for a single image. In the data
preparation phase, each image id is taken as key and its corresponding captions are stored as
values in a dictionary.

5.4.1 DATA CLEANING

In order to make the text dataset work in machine learning or deep learning models,
raw text should be converted to a usable format. The following text cleaning steps are done
before using it for the project: • Removal of punctuations. • Removal of numbers. • Removal
of single length words. • Conversion of uppercase to lowercase characters. Stop words are not
removed from the text data as it will hinder the generation of a grammatically complete caption
which is needed for this project.

5.5 IMPLEMENTATION DETAILS

Downloaded from dataset:

Dept. of CSE,UVCE 2024-25 32


Image Caption Generator System Methodology and Implementation

 Flicker8k_Dataset – Dataset folder which contains 8091 images.

 Flickr_8k_text – Dataset folder which contains text files and captions of images. The below
files will be created by us while making the project.

 Models – It will contain our trained models.

 Descriptions.txt – This text file contains all image names and their captions after
preprocessing.

 Features.p – Pickle object that contains an image and their feature vector extracted from the
Xception pre-trained CNN model.

 Tokenizer.p – Contains tokens mapped with an index value.

 Model.png – Visual representation of dimensions of our project.

 Testing_caption_generator.py – Python file for generating a caption of any image.

 Training_caption_generator.ipynb – Jupyter notebook in which we train and build our image


caption generator.

To train the model, we will be using the 6000 training images by generating the input and
output sequences in batches and fitting them to the model using model.fit_generator() method.
We also save the model to our models folder. This will take some time depending on your
system capability.

5.6 DATAFLOW DIAGRAM

To train the model, we will be using the 6000 training images by generating the input and
output sequences in batches and fitting them to the model using model.fit_generator() method.
We also save the model to our models folder. This will take some time depending on your
system capability.

Dept. of CSE,UVCE 2024-25 33


Image Caption Generator System Methodology and Implementation

5.5.1 Dataflow Diagram

5.7 Comparision Between CNN and LSTM Model:

Feature CNN Model LSTM Model

Architecture
Visual feature extraction Sequence/language generation
Role

Primary Extracts spatial features Generates word sequences with temporal


Function from images dependencies

Typical
65-75% (when used alone) 70-85% (when used in combination)
Accuracy

Precision 0.68-0.75 0.72-0.84

Recall 0.62-0.73 0.69-0.82

BLEU Score 0.24-0.35 0.27-0.42

CIDEr Score 0.78-0.95 0.85-1.10

Dept. of CSE,UVCE 2024-25 34


CHAPTER 6

RESULTS AND ANALYSIS

6.1 Code Implementation

6.2 Uploading Input

Dept. of CSE,UVCE 2024-25 35


Image Caption Generator System Results and Analysis

6.3 Predicted Vs Actual Results

6.4 Predicted Result

Dept. of CSE,UVCE 2024-25 36


Image Caption Generator System Results and Analysis

6.5 Output of Training Model

6.6 Model Accuracy Validation

Dept. of CSE,UVCE 2024-25 37


Image Caption Generator System Results and Analysis

6.7 Model Precision and Recall Analysis

6.8 Confusion Matrix of Captions(Actual Vs Predicted)

Dept. of CSE,UVCE 2024-25 38


CHAPTER 7

CONCLUSION AND FUTURE ENHANCEMENTS

We combined all components of the image caption generation problem in this overview,
addressed the model framework proposed in recent years to handle the description task,
concentrated on the algorithmic essence of various attention methods, and summarized how
the attention mechanism is implemented. The huge datasets and evaluation criteria that are
regularly utilized in practice are summarized. Despite the fact that image captioning can be
used for image retrieval [92], video caption [93, 94], and video movement [95], and a wide
range of image caption systems are currently available, experimental results suggest that this
task still requires higher performance systems and improvement.

7.1 Conclusion

The CNN-LSTM model was created to automatically generate captions for the input
images. This concept can be used in a wide range of situations. We learned about the CNN
model, and LSTM models, and how to overcome previous limitations in the field of graphical
image captioning by building a CNN-LSTM model capable of scanning and extracting
information from any input image and transforming it into a single line sentence in natural
language English.

The algorithm attention and how the attention mechanism is used were the main topics
of discussion. I was able to successfully create a model that is a major improvement above the
earlier image caption generator.

7.2 Future enhancements


Future work Image captioning has become an important problem in recent days due to
the exponential growth of images in social media and the internet. This report discusses the
various research in image retrieval used in the past and it also highlights the various
techniques and methodology used in the research. As feature extraction and similarity

Dept. of CSE,UVCE 2024-25 39


Image Caption Generator System Conclusion and Future Enhancements

calculation in images are challenging in this domain, there is a tremendous scope of possible
research in the future. Current image retrieval systems use similarity calculation by making
use of features such as color, tags, histogram, etc. There cannot be completely accurate
results as these methodologies do not depend on the context of the image. Hence, a complete
research in image retrieval making use of context of the images such as image captioning will
facilitate to solve this problem in the future. This project can be further enhanced in future to
improve the identification of classes which has a lower precision by training it with more
image captioning datasets. This methodology can also be combined with previous image
retrieval methods such as histogram, shapes, etc. and can be checked if the image retrieval
results get better.

Dept. of CSE,UVCE 2024-25 40


BIBLIOGRAPHY

[1] R. Subash (November 2019): Automatic Image Captioning Using Convolution Neural
Networks and LSTM.

[2] Seung-Ho Han, Ho-Jin Choi (2020): Domain-Specific Image CaptionGenerator with
Semantic Ontology.

[3] Pranay Mathur, Aman Gill, Aayush Yadav, Anurag Mishra and Nand
Kumar Bansode (2017): Camera2Caption: A Real-Time Image Caption
Generator

[4] Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (June2019): Image
Captioning: Transforming Objects into words.

[5] Manish Raypurkar, Abhishek Supe, Pratik Bhumkar, Pravin Borse, Dr.

Shabnam Sayyad (March 2021): Deep learning-based Image Caption


Generator

[6] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2015):

Show and Tell: A Neural Image Caption Generator

[7] Jianhui Chen, Wenqiang Dong, Minchen Li (2015): Image CaptionGenerator based
on Deep Neural Networks

[8] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, MarkJohnson, Stephen
Gould, and Lei Zhang. (2017): Bottom-up and top-down attention for image
captioning.

Dept. of CSE,UVCE 2024-25 41

You might also like