New PDF
New PDF
A Project Report
Bachelor of Technology
Submitted By:
Dr. THRIVENI J
CERTIFICATE
This is to certify that the project report entitled "IMAGE CAPTION GENERATOR SYSTEM IMAGE
CAPTION GENERATOR SYSTEM― is a Bonafede work carried out by Anjana V (U03NM21T029004)
of the Department of Computer Science Engineering, University Visvesvaraya College of
Engineering,K.R.Circle, Bengaluru, in partial fulfillment for the award of the degree of Bachelor of
Technology in Computer Science Engineering of the Bangalore University during the academic year
2024-2025.
Dr. Thriveni J
Professor & Chairperson
Department of CSE
UVCE, Bangalore -560001
Examiner 1 Examiner2
Acknowlegement
I would like to express my profound gratitude to Dr. Thriveni J, Head of the Department of Computer
Science and Engineering, for her invaluable guidance and constant encouragement throughout the course
of this project.
Her depth of knowledge, critical insights, and constructive suggestions provided a solid foundation and
clarity in addressing the complexities of this work. Her leadership and support were instrumental in
ensuring the successful execution of this project.
I am deeply indebted to the esteemed faculty members of the Department of Computer Science and
Engineering for their unwavering support and expertise. Their commitment to fostering an environment
of learning and innovation has been a source of inspiration. Their feedback during the various phases of
this project, from conceptualization to implementation, has helped me refine my approach and achieve
the desired outcomes.
I also extend my heartfelt thanks to my peers and colleagues who provided moral support and shared
invaluable insights during brainstorming sessions. Their collaborative spirit and encouragement were
vital in overcoming challenges and maintaining momentum throughout the project. Additionally, I
acknowledge the contributions of all external sources, including industry experts and researchers, whose
works have enriched this project.
Lastly, I am deeply grateful to my family and friends for their unwavering support and understanding.
Their constant encouragement and belief in my abilities have been the bedrock of my perseverance and
determination. Without their love and patience, the completion of this project would not have been
possible.
The combination of computer vision and natural language processing in Artificial intelligence
has sparked a lot of interest in research in recent years, thanks to the advent of deep learning.
The context of a photograph is automatically described in English. When a picture is captioned,
the computer learns to interpret the visual information of the image using one or more phrases.
The ability to analyze the state, properties, and relationship between these objects is required
for the meaningful description generation process of high-level picture semantics. Using CNN
-LSTM architectural models on the captioning of a graphical image, we hope to detect things
and inform people via text messages in this research. To correctly identify the items, the input
image is first reduced to grayscale and then processed by a Convolution Neural Network
(CNN). The COCO Dataset 2017 was used. The proposed method for blind individuals is
intended to be expanded to include persons with vision loss to speech messages to help them
reach their full potential and to track their intellect. In this project, we follow a variety of
important concepts of image captioning and its standard processes, as this work develops a
generative CNN-LSTM model that outperforms human baselines
TABLE OF CONTENTS
TITLE Page No
1. INTRODUCTION 1
2. Literature Review 4
3.3 Libraries 15
4.1 Introduction 20
5.1 Methodology 30
7.1 Summary 39
7.2 Conclusion 39
References 41
LIST OF FIGURES
4.4 CNN 24
4.5 LSTM 28
Our approach is based on two basic models: CNN (Convolutional Neural Network) and
LSTM (Long Short-Term Memory). CNN is utilized as an encoder in the derived application
to extract features from the snapshot or image, and LSTM is used as a decoder to organize the
words and generate captions. Image captioning can help with a variety of things, such as
assisting the visionless with text-to-speech through real-time input about the scenario over a
camera feed, and increasing social medical leisure by restructuring captions for photos in social
feeds as well as spoken messages.
1.2 Background
Our project extends and is being used in any large-scale business industry and also
small-scale business industry. A caption appears next to the image and identifies or describes
the image, and credits the source. There is no standard format for captions. Point out any
aspects of the image that you think are noteworthy or relevant.
The biggest challenge is most definitely being able to create a description that must capture
not only the objects contained in an image, but also express how these objects relate to each other.
Advantages
● Recommendations in Editing Applications
● Assistance for visually impaired
● Social Media posts
● Self-Driving cars
● Robotics
● Easy to implement and connect to new data sources
Disadvantages
1.4 Objective
The objective of image captioning requires the capture and expression of semantic
information of images in natural languages. The approach for generating image captions generally
consists of two general methods i.e., encoding and decoding. The project aims to work on one of the
ways to context a photograph in simple English sentences using Deep Learning (DL). The need to
use CNN and LSTM instead of working with RNN.
1.5 Motivation
Generating captions for images is a vital task relevant to the area of both Computer Vision
and Natural Language Processing. Mimicking the human ability of providing descriptions for
images by a machine is itself a remarkable step along the line of Artificial Intelligence. The main
challenge of this task is to capture how objects relate to each other in the image and to express them
in a natural language (like English).Traditionally, computer systems have been using predefined
templates for generating text descriptions for images. However, 1 this approach does not provide
sufficient variety required for generating lexically rich text descriptions. This shortcoming has been
suppressed with the increased efficiency of neural networks. Many state of art models use neural
networks for generating captions by taking image as input and predicting next lexical unit in the
output sentence.
Image captioning has recently gathered a lot of attention specifically in the natural language
domain. There is a pressing need for context based natural language description of images,
however, this may seem a bit farfetched but recent developments in fields like neural networks,
computer vision and natural language processing has paved a way for accurately describing images
i.e. representing their visually grounded meaning. We are leveraging state-of-the-art techniques
like Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and appropriate
datasets of images and their human perceived description to achieve the same. We demonstrate that
our alignment model produces results in retrieval experiments on datasets such as Flicker.
There are various Image Captioning Techniques some are rarely used in present but it is
necessary to take a overview of those technologies before proceeding ahead. The main categories
of existing image captioning methods and they include template-based image captioning, retrieval-
based image captioning, and novel caption generation.Novel caption generation-based image
caption methods mostly use visual space and deep machine learning based techniques. Captions
can also be generated from multimodal space. Deep learning-based image captioning methods can
also be categorized on learning techniques: Supervised learning, Reinforcement learning, and
Unsupervised learning. We group the reinforcement learning and unsupervised learning into Other
Deep Learning. Usually captions are generated for a whole scene in the image. However, captions
can also be generated for different regions of an image (Dense captioning). Image captioning
methods can use either simple Encoder-Decoder architecture or Compositional architecture. There
are methods that use attention mechanism, semantic concept, and different styles in image
descriptions. Some methods can also generate description for unseen objects. We group them into
one category as ―Others". Most of the image captioning methods use LSTM as language model.
However, there are a number of methods that use other language models such as CNN and RNN.
Therefore, we include a language model-based category as ―LSTM vs. Others".
Template-based approaches have fixed templates with a number of blank slots to generate
captions. In these approaches, different objects, attributes, actions are detected first and then the
blank spaces in the templates are filled. For example, Farhadi et al. use a triplet of scene
elements to fill the template slots for generating image captions. Li et al. extract the phrases related
to detected objects, attributes and their relationships for this purpose. A Conditional Random Field
(CRF) is adopted by Kulkarni et al. to infer the objects, attributes, and prepsitions before filling in
the gaps. Template-based methods can generate grammatically correct captions. However,
templates are predefined and cannot generate variable-length captions. Moreover, later on, parsing
based language models have been introduced in image captioning which are more powerful than
fixed template-based methods. Therefore, in this paper, we do not focus on these template based
methods.
Captions can be retrieved from visual space and multimodal space. In retrieval-based
approaches, captions are retrieved from a set of existing captions. Retrieval based methods first
find the visually similar images with their captions from the training data set. These captions are
called candidate captions. The captions for the query image are selected from these captions pool.
These methods produce general and syntactically correct captions. However, they cannot generate
image specific and semantically correct captions.
Novel image captions are captions that are generated by the model from a combination of
the image features and a language model instead of matching to an existing captions. Generating
novel image captions solves both of the problems of using existing captions and as such is a much
more interesting and useful problem. Novel captions can be generated from both visual space and
multimodal space. A general approach of this category is to analyze the visual content of the image
first and then generate image captions from the visual content using a language model These
methods can generate new captions for each image that are semantically more accurate than
previous approaches. Most 13 novel caption generation methods use deep machine learning based
techniques. Therefore, deep learning based novel image caption generating methods are our main
focus in this literature.
multimodal space, dense captioning vs. captions for the whole scene, Supervised learning vs. Other
deep learning, Encoder-Decoder architecture vs. Compositional architecture, and one „Others‟
group that contains Attention-Based, Semantic Concept-Based, Stylized captions, and Novel
Object-Based captioning. We also create a category named LSTM vs. Others. A brief overview of
the deep learning-based image captioning methods is shown in table. It contains the name of the
image captioning methods, the type of deep neural networks used to encode image information,
and the language models used in describing the information. In the final column, we give a category
label to each captioning technique based on the taxonomy .
Deep learning-based image captioning methods can generate captions from both visual
space and multimodal space. Understandably image captioning datasets have the corresponding
captions as 14 text. In the visual space-based methods, the image features and the corresponding
captions are independently passed to the language decoder. In contrast, in a multimodal space case,
a shared multimodal space is learned from the images and the corresponding caption-text. This
multimodal representation is then passed to the language decoder.
VISUAL SPACE
Bulk of the image captioning methods use visual space for generating captions. In the visual
space-based methods, the image features and the corresponding captions are independently passed
to the language decoder.
MULTIMODAL SPACE
In supervised learning, training data come with desired output called label. Unsupervised
learning, on the other hand, deals with unla techniques. Reinforcement learning is another type of
machine learning approach where the aims of an agent are to discover data and/or labels through
exploration and a reward signal. A number of image captioning methods use reinforcement learning
and GAN based approaches. These methods sit in the category of ―Other Deep Learning". beled
data. Generative Adversarial Networks (GANs) are a type of unsupervised learning.
Supervised learning-based networks have successfully been used for many years in image
classification , object detection and attribute learning . This progress makes researchers interested
in using them in automatic image captioning .In this paper, we have identified a large number of
supervised learning-based image captioning methods. We classify them into different categories:
(i) Encoder-Decoder Architecture, (ii) Compositional Architecture, (iii) Attentionbased, (iv)
Semantic concept-based, (v) Stylized captions, (vi) Novel object-based, and (vii) Dense image
captioning.
In our day to day life, data are increasing with unlabled data because it is often impractical
to accurately annotate data. Therefore, recently, researchers are focusing more on reinforcement
learning and unsupervised learning-based techniques for image captioning.
In dense captioning, captions are generated for each region of the scene. Other methods
generate captions for the whole scene.
The previous image captioning methods can generate only one caption for the whole image.
They use different regions of the image to obtain information of various objects. However, these
methods do not generate region wise captions. Johnson et al. [62] proposed an image captioning 16
method called DenseCap. This method localizes all the salient regions of an image and then it
generates descriptions for those regions. A typical method of this category has the following steps:
(1) Region proposals are generated for the different regions of the given image. (2) CNN is used to
obtain the region-based image features. (3) The outputs of Step 2 are used by a language model to
generate captions for every region. A block diagram of a typical dense captioning method.
Some methods use just simple vanilla encoder and decoder to generate captions. However,
other methods use multiple networks for it.
The neural network-based image captioning methods work as just simple end to end
manner. These methods are very similar to the encoder-decoder framework-based neural machine
translation [131]. In this network, global image features are extracted from the hidden activations
of CNN and then fed them into an LSTM to generate a sequence of words. A typical method of this
category has the following general steps:
(1) A vanilla CNN is used to obtain the scene type, to detect the objects and their relationships.
(2) The output of Step 1 is used by a language model to convert them into words, combined phrases
that produce an image captions.
(2) Visual concepts (e.g. attributes) are obtained from visual features.
(3) Multiple captions are generated by a language model using the information of Step 1 and Step
2.
(4) The generated captions are re-ranked using a deep multimodal similarity model to select high
quality image captions. A common block diagram of compositional network-based image
captioning methods .
Image captioning intersects computer vision and natural language processing (NLP)
research. NLP tasks, in general, can be formulated as a sequence to sequence learning. Several
neural language models such as neural probabilistic language model , log-bilinear models , skip-
gram models , and recurrent neural networks (RNNs) have been proposed for learning sequence to
sequence tasks. RNNs have widely been used in various sequence learning tasks. However,
traditional RNNs suffer from vanishing and exploding gradient problems and cannot adequately
handle long-term temporal dependencies. LSTM networks are a type of RNN that has special units
in addition to standard units. LSTM units use a memory cell that can maintain information in
memory for long periods of time. In recent years, LSTM based models have dominantly been used
in sequence to sequence learning tasks. Another network, Gated Recurrent Unit (GRU) has a similar
structure to LSTM but it does not use separate memory cells and uses fewer gates to control the
flow of information. However, LSTMs ignore the underlying hierarchical structure of a sentence.
They also require significant storage due to long-term dependencies through a memory cell. In
contrast, CNNs can learn the internal hierarchical structure of the sentences and they are faster in
processing than LSTMs. Therefore, recently, convolutional architectures are used in other sequence
to sequence tasks, e.g., conditional image generation and machine translation. Inspired by the above
success of CNNs in sequence learning tasks, Gu proposed a CNN language model-based image
captioning method. This method uses a language-CNN for statistical language modelling.
However, the method cannot model the dynamic temporal behaviour of the language model only
using a language-CNN. It combines a recurrent network with the languageCNN to model the
temporal dependencies properly. Aneja proposed a convolutional architecture for the task of image
captioning. They use a feedforward network without any recurrent function. The architecture of the
method has four components: (i) input embedding layer (ii) image embedding layer (iii)
convolutional module, and (iv) output embedding layer. It also uses an attention mechanism to
leverage spatial image features. They evaluate their architecture on the challenging MSCOCO
dataset and shows comparable performance to an LSTM based method on standard metrics.
Abstract - Image Caption Generation has always been a study of great interest to the researchers
in the Artificial Intelligence department. Being able to program a machine to accurately describe
an image or an environment like an average human has major applications in the field of robotic
vision, business and many more. This has been a challenging task in the field of artificial
intelligence throughout the years. In this paper, we present different image caption generating
models based on deep neural networks, focusing on the various RNN techniques and analyzing
their influence on the sentence generation. We have also generated captions for sample images and
compared the different feature extraction and encoder models to analyse which model gives better
accuracy and generates the desired results.
In this project Flicker8K dataset is used which consists of 8 Thousand images. Data preprocessing
is done on these images which splits the dataset into train, test and validate sets.
Supervised learning is the most popular machine learning paradigm. It is easy to understand and
very easy to use. It is a learning function that creates a map of output inputs based on the example
of input-output pairs. It takes work from a training data lab that includes a set of training examples.
In supervised reading, each example is a pair that includes an input object (usually vector) and the
required output value (also called the directional signal). The supervised learning algorithm
analyzes training data and generates targeted activity, which can be used to map new examples.
Supervised Reading is very similar to teaching a child about the data provided and that data is in
the form of labeled examples, we can feed the algorithm of learning with these pairs of individual
model-labels, allowing the algorithm to predict the correct answer or not. Over time, the algorithm
will learn to measure the exact nature of the relationship between models and their labels. When
fully trained, the supervised learning algorithm will be able to detect a new, unprecedented model
and predict its excellent label.
Unsupervised learning is a machine learning method, where you do not need to monitor the model.
Instead, you need to let the model work on its own for information. Works great with non-labeled
data and looks for patterns that were not previously found in a set of data that does not already have
labels and has minimal human monitoring. In contrast to supervised reading that often uses personal
name data, unchecked reading, also known as self-organizing, allows for the creation of a dynamic
model over the input. The Neural Network (or Artificial Neural Network) has the ability to learn
by example. ANN is an information processing model inspired by a biological neuron system.
ANN-biologically inspired images that are computer generated to perform a specific set of tasks
such as merging, segmentation, pattern recognition etc. It is made up of a large number of highly
interconnected processing devices known as neurons to solve problems. It follows a non-linear
approach and processes information uniformly across all nodes. The neural network is a complex
flexible system. Adaptive means it has the ability to change its internal structure by adjusting the
input weights.
Deep learning is a branch of machine learning based entirely on neural networks that are practiced.
In-depth learning is an artificial intelligence activity that mimics the functioning of the human brain
in processing data and creating patterns that will be used in decision making. In-depth learning is a
subset of machine learning in artificial intelligence (AI) with networks that can read without being
monitored for random or unlabeled data. It has a large number of hidden layers and is known as
deep neural learning or deep neural network. Deep learning has evolved in conjunction with the
digital age, which has brought an explosion of data across all genres and regions of the world. This
data, known as big data, is taken from sources such as social media, online search engines, e-
commerce forums, and online cinemas, among others. This large amount of data is easily accessible
and can be shared with fintech applications such as cloud computing. However, the details, often
irregular, are so large that it can take decades for people to understand and produce the right
information. Companies are recognizing the incredible power that can come from uncovering this
wealth of information and are becoming increasingly familiar with AI systems for automated
support. In-depth reading learns from large amounts of informal data that can often take decades
for people to understand and process. In-depth learning also uses the hierarchical level of neural
networks performed to perform the machine learning process. Nervous network networks are
shaped like the human brain, with neuron nodes connected as a web. While traditional systems
create data analytics in a straightforward manner, the hierarchical function of in-depth learning
systems enables machines to process data indirectly.
Algorithm Steps
Step 2: Download spacy English tokenizer and convert the text into tokens.
Step 4: Features are generated from Tokenization on which LSTM is trained and it generates the
captions. Step 5: A paragraph is generated by combining all the captions.
Data pre processing - Images : Images are nothing but an (X) in our model. As you probably
know that any input to the model should be given in the form of a vector. We need to convert each
image into a fixed size vector that can be provided as input to the neural network. For this purpose,
we choose to transfer learning using the InceptionV3 (Convolutional Neural Network) model
created by Google Research. This model was trained in the Visual Genome Corpus database to
perform image classification into 1000 different photo classes. However, our goal here is not to
separate the image but to simply find the vector of information that has the fixed length of each
image. This process is called automatic feature engineering.
Data pre processing – Captions : We must realize that captions are something we want to predict.
Therefore during the training period, captions will be the target variable (Y) model that learns to
predict. But prediction of all captions, when given a picture does not happen all at once. We will
predict the captions word for word. Therefore, we need to encode each word into a fixed size vector.
Dictionaries "wordtoix" (pronounced - word in the index) and "ixtoword" (pronounced - index in
the word).
Data pre processing using generator function : Let‘s take the first image vector Image_1 and its
corresponding caption ―startseq the black cat sat on grass endseq‖. Recall that, Image vector is the
input and the caption is what we need to predict. But the way we predict the caption is as follows:
For the first time, we provide the image vector and the first word as input and try to predict the
second word, i.e.: Input = Image_1 + ‗startseq‘; Output = ‗the‘ Then we provide image vector and
the first two words as input and try to predict the third word, i.e.: Input = Image_1 + ‗startseq the‘;
Output = ‗cat‘.
Encoder Model:
The encoder model is primarily responsible for processing the captions of each image fed while
training. The output of the encoder model is again vectors of size 1*256 which would again be an
input to the decoder sequences.
The most important part of the Encoder model is the LSTM layer or Long Short Term Memory
Layer. This layer helps the model in learning how to generate valid sentences or generating the
word with highest probability of occurrence after a specific word is encountered. The activation
function used is ReLU, a linear activation function and the output space defined is 256. For
comparison between the complete models VGG+GRU and VGG+LSTM this particular layer will
be replaced by a GRU or Gated Recurrent Units layer, and results will be analyzed for the same
.The output space for the GRU layer is same i.e 256 . Thus the only major difference between the
two models will be in the encoder part. The output of the LSTM layer is our output for the encoder
layer.
Decoder Model:
The decoder model is basically the model which concatenates both the feature extraction model
and encoder model and produces the required output which is the predicted word given an image
and the sentence generated till that point of time. As shown in the above diagram the Decoder
model takes in the input from the Feature extraction model and the encoder model both of which
outputs vectors of dimension 256. The output from the concatenated models are passed through a
dense layer which uses the ‗ReLU‘ activation function. Another Dense Layer is added to the
decoder model with the vocabulary size as the output space. The vocabulary size in Flickr 8k was
found to be 7579 and the activation function used was softmax activation which basically outputs
a word for the integer predicted. The predicted word is the output of the decoder layer. The model
is trained by the following input output parameters: =<image, = Where input parameters are the
image and the input sequence and the output of the model is the word predicted provided the model
has the image and the caption generated till that point of time. When the caption is generated we
calculate the bleu score for each architecture. Four types of bleu scores were found out : BLEU -1
(1.0, 0, 0, 0), BLEU -2 (0.5, 0.5, 0, 0), BLEU -3 (0.33, 0.33, 0.33, 0) and BLEU -4 (0.25, 0.25,
0.25, 0.25). We have used the cumulative weights since they give better output.
Conclusion-We have presented a deep learning model that tends to automatically generate image
captions with the goal of not only describing the surrounding environment but also helping visually
impaired people better understand their environments. Our described model is based upon a CNN
feature extraction model that encodes an image into a vector representation, followed by a RNN
decoder model that generates corresponding sentences based on the image features learned . We
have compared various encoder decoder models to see how each component influences the caption
generation and have also demonstrated various use cases on our system. The results show that
LSTM model generally works slightly better than GRU although taking more time for training and
sentence generation due to its complexity. The performance is also expected to increase on using a
bigger dataset by training on more number of images. Because of the considerable accuracy of the
generated image captions, visually impaired people can greatly benefit and get a better sense of
their surroundings using the text-to-speech technology that we have incorporated as well.
To be used efficiently, all computer software needs certain hardware components or other software
resourced to be present on a computer. These prerequisites are known as computer system
requirements and are often used as guidelines as opposed to absolute rules. Most software defines
two sets of system requirements. Minimum and recommended. With the increasing demand for
higher processing power and resources in newer versions of software, system requirements tend to
increase over time. Industry analysts suggest that this trend plays a bigger part in driving upgrades
to exist computer systems than technological advancements
Processor:
o Minimum: Intel Core i5 (8th Gen) or AMD Ryzen 5 (equivalent)
o Recommended: Intel Core i7/i9 (10th Gen or higher) or AMD Ryzen 7/9
Memory (RAM):
o Minimum: 8GB
o Recommended: 16GB or more (for smoother training and inference, especially for
deep learning models)
Storage:
o Minimum: 100GB free space (for dataset, models, and intermediate files)
o Recommended: 500GB or more (especially for larger datasets and deep learning
models)
Graphics Processing Unit (GPU):
o Minimum: NVIDIA GTX 1060, AMD equivalent (for smaller models and light
inference)
Operating System:
o Windows 10 or higher
o macOS 10.15 (Catalina) or higher
o Linux-based OS (Ubuntu 18.04+ or CentOS)
Containerization/Virtualization (optional but recommended for consistent
environments):
o Docker (to create isolated environments)
o Virtualenv/conda (for Python dependency management)
3.3. Libraries
Key libraries for image caption generation include TensorFlow or PyTorch for deep learning,
with Keras for higher-level API integration. OpenCV and Pillow are required for image processing,
while NLTK or spaCy handle text preprocessing. The Hugging Face Transformers library can be
used for leveraging pre-trained models like CLIP for caption generation.
o transformers (by Hugging Face): For using pretrained models such as GPT, BERT,
and CLIP for caption generation.
Pretrained Models:
o CLIP (Contrastive Language-Image Pretraining): A model from OpenAI that can
understand images and text jointly. This can be fine-tuned for caption generation.
o Image Captioning Models: Pretrained models such as Show, Attend and Tell, Up-
Down Models, or Bottom-Up Top-Down Models can be used for image captioning
tasks.
Data Management Libraries:
o NumPy: For numerical operations and handling image tensors.
o Pandas: For data manipulation and handling datasets, especially captions and
metadata.
o H5py: For saving large datasets (e.g., images and feature vectors).
Python:
o Primary language for deep learning and data science tasks.
o Popular libraries such as TensorFlow, PyTorch, OpenCV, and transformers are all
Python-based.
JavaScript/TypeScript (optional, for web deployment):
o To build web applications for serving the captioning models (using frameworks like
React.js or Node.js).
Image Preprocessing:
o The system should be able to preprocess images, such as resizing, cropping,
normalization, and augmentations.
Text Preprocessing:
o Ability to tokenize captions, remove stop words, and handle other NLP tasks like
stemming and lemmatization.
Model Training:
o The system should be able to train deep learning models for image caption
generation, which involves using CNN (for image features) and RNN (for
generating captions) or transformers.
Caption Generation:
o The system must be able to generate captions for a given input image using the
trained model.
Evaluation Metrics:
o Implement standard evaluation metrics like BLEU, METEOR, or CIDEr to assess
the quality of the generated captions.
Model Inference:
o The system should support inference with trained models for generating captions on
new images.
o Optionally, the system should allow for the use of pretrained models for faster
inference, such as using CLIP or Show and Tell.
API/Interface:
o Provide a simple API or interface for feeding images and receiving captions. This
can be a RESTful API or a CLI.
Scalability:
o The system should handle varying input sizes (both in terms of images and number
of requests).
o It should support the scalability to handle more complex models and larger datasets.
Performance:
o The image captioning system should generate captions in real-time or within an
acceptable latency (depending on use case).
Accuracy:
o The caption generation model should provide high-quality, accurate, and
contextually relevant captions.
Robustness:
o The system should be robust enough to handle images with various qualities (e.g.,
low resolution, varying lighting conditions).
Usability:
o User interface (UI) should be intuitive for users (if it's a web-based application).
o Should provide clear error messages and feedback.
Maintainability:
o The codebase should be well-documented and modular, allowing easy updates,
improvements, and troubleshooting.
Security
o If the model is exposed via an API, it should have adequate security mechanisms
toprevent misuse or malicious attacks.
Extensibility:
o The system should allow for easy integration of new datasets or the addition of new
models for improved accuracy.
Deployment:
o The system should be deployable on various platforms (local, cloud, or edge
devices) depending on the use case.
Compliance:
o If necessary, ensure the system adheres to privacy and ethical standards, especially
regarding data usage and model fairness.
ARCHITECTURE
4.1 Introduction
This project is loaded with CNN and LSTM which act as the platform to generate the
sentences from a simple image. This can be worked on all applications.
Convolutional neural networks are distinguished from other neural networks by their superior
performance with image, speech or audio signal inputs. They have three main types of layers, which
are:
Convolutional layer
Pooling layer
Fully-connected (FC) layer
Convolutional layer
The convolutional layer is the first layer of a convolutional network. While convolutional layers
can be followed by additional convolutional layers or pooling layers, the fully-connected layer is
the final layer. With each layer, the CNN increases in its complexity, identifying greater portions
of the image. Earlier layers focus on simple features, such as colors and edges. As the image data
progresses through the layers of the CNN, it starts to recognize larger elements or shapes of the
object until it finally identifies the intended object.
The convolutional layer is the core building block of a CNN, and it is where the majority of
computation occurs. It requires a few components, which are input data, a filter and a feature map.
Let‘s assume that the input will be a color image, which is made up of a matrix of pixels in 3D.
This means that the input will have three dimensions—a height, width and depth—which
correspond to RGB in an image. We also have a feature detector, also known as a kernel or a filter,
which will move across the receptive fields of the image, checking if the feature is present. This
process is known as a convolution.
The feature detector is a two-dimensional (2-D) array of weights, which represents part of the
image. While they can vary in size, the filter size is typically a 3x3 matrix; this also determines the
size of the receptive field. The filter is then applied to an area of the image, and a dot product is
calculated between the input pixels and the filter. This dot product is then fed into an output array.
Afterwards, the filter shifts by a stride, repeating the process until the kernel has swept across the
entire image. The final output from the series of dot products from the input and the filter is known
as a feature map, activation map or a convolved feature.
The weights in the feature detector remain fixed as it moves across the image, which is also known
as parameter sharing. Some parameters such as the weight values, adjust during training through
the process of backpropagation and gradient descent. However, there are three hyperparameters
which affect the volume size of the output that need to be set before the training of the neural
network begins.
These include:
1. The number of filters affects the depth of the output. For example, three distinct filters would
yield three different feature maps, creating a depth of three.
2. Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While
stride values of two or greater is rare, a larger stride yields a smaller output.
3. Zero-padding is usually used when the filters do not fit the input image. This sets all elements
that fall outside of the input matrix to zero, producing a larger or equally sized output. There are
three types of padding:
Valid padding: This is also known as no padding. In this case, the last convolution is
dropped if dimensions do not align.
Same padding: This padding ensures that the output layer has the same size as the input
layer.
Full padding: This type of padding increases the size of the output by adding zeros to the
border of the input.
After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation
to the feature map, introducing nonlinearity to the model.
Pooling layer
Pooling layers, also known as downsampling, conducts dimensionality reduction, reducing the
number of parameters in the input. Similar to the convolutional layer, the pooling operation sweeps
a filter across the entire input, but the difference is that this filter does not have any weights. Instead,
the kernel applies an aggregation function to the values within the receptive field, populating the
output array. There are two main types of pooling:
Max pooling: As the filter moves across the input, it selects the pixel with the maximum
value to send to the output array. As an aside, this approach tends to be used more often
compared to average pooling.
Average pooling: As the filter moves across the input, it calculates the average value within
the receptive field to send to the output array.
While a lot of information is lost in the pooling layer, it also has a number of benefits to the CNN.
They help to reduce complexity, improve efficiency, and limit risk of overfitting.
Fully-connected layer
The name of the full-connected layer aptly describes itself. As mentioned earlier, the pixel values
of the input image are not directly connected to the output layer in partially connected layers.
However, in the fully-connected layer, each node in the output layer connects directly to a node in
the previous layer.
This layer performs the task of classification based on the features extracted through the previous
layers and their different filters. While convolutional and pooling layers tend to use ReLu functions,
FC layers usually leverage a softmax activation function to classify inputs appropriately, producing
a probability from 0 to 1.
4.4.1 CNN
LSTM networks introduce memory cells, which have the ability to retain information over long
sequences. Each memory cell has three main components: an input gate, a forget gate, and an output
gate. These gates help regulate the flow of information in and out of the memory cell.
The input gate determines how much of the new input should be stored in the memory cell. It takes
the current input and the previous hidden state as inputs, and outputs a value between 0 and 1 for
each element of the memory cell.
The forget gate decides which information to discard from the memory cell. It takes the current input
and the previous hidden state as inputs, and outputs a value between 0 and 1 for each element of the
memory cell. A value of 0 means the information is ignored, while a value of 1 means it is retained.
The output gate controls how much of the memory cell‘s content should be used to compute the
hidden state. It takes the current input and the previous hidden state as inputs, and outputs a value
between 0 and 1 for each element of the memory cell.
By using these gates, LSTM networks can selectively store, update, and retrieve information over
long sequences. This makes them particularly effective for tasks that require modeling long-term
dependencies, such as speech recognition, language translation, and sentiment analysis.
1. Forget Gate:
Determines what information to discard from the cell state.
It takes input (current time step and previous hidden state) and produces a number between 0
and 1 for each number in the cell state. 1 represents ―completely keep this‖ while 0 represents
―completely get rid of this.‖
2. Input Gate:
a. A sigmoid layer (the ―input gate layer‖) that decides which values to update.
b. A tanh layer (which creates a vector of new candidate values to add to the cell state).
3. Output Gate:
Determines the next hidden state based on the updated cell state.
Filters the information that the LSTM will output based on the updated cell state.
This runs straight down the entire chain of the LSTM, with only some minor linear interactions.
It‘s the core differentiator in LSTMs that allows them to maintain and control long-term
dependencies.
Hidden State:
The LSTM‘s output at a particular time step based on the cell state.
LSTMs use these gates to regulate the flow of information, which allows them to learn long-term
dependencies in data, making them particularly effective for tasks involving sequential data like
time series prediction, natural language processing, speech recognition, and more.
By controlling and memorizing information over long sequences, LSTMs can mitigate the problems
of vanishing and exploding gradients, enabling more effective training and better capturing of long-
term patterns in sequential data.
Sequences: LSTM models work with sequences of data. Organize your input data into
sequences of fixed length. For instance, in the context of time series data, if you have daily data,
you might create sequences of, say, 10 days‘ worth of data as one input sequence.
Reshape your data to be in a 3D format: (samples, time steps, features). For instance, if your
data is in the form of a 2D matrix (samples, features), you‘ll need to reshape it so that the
LSTM can interpret it as sequences of data.
4.5.2 LSTM
● CNN is used for extracting features from the image. We will use the pre-trained model
Xception.
● LSTM will use the information from CNN to help generate a description of the image.
● Import Libraries
The image should be converted to suitable features so that they can be trained into a
deep learning model. Feature extraction is a mandatory step to train any image in deep learning
model. The features are extracted using Convolutional Neural Network (CNN) with Visual
Geometry Group (VGG-16) model. This model also won ImageNet Large Scale Visual
Recognition Challenge in 2015 to classify the images into one among the 1000 classes given
in the challenge. Hence, this model is ideal to use for this project as image captioning requires
identification of images. In VGG-16, there are 16 weight layers in the network and the deeper
number of layers help in better feature extraction from images. The VGG-16 network uses 3*3
convolutional layers making its architecture simple and uses max pooling layer in between to
reduce volume size of 24 the image. The last layer of the image which predicts the
classification is removed and the internal representation of image just before classification is
returned as feature. The dimension of the input image should be 224*224 and this model
extracts features of the image and returns a 1- dimensional 4096 element vector.
Flickr8k dataset contains multiple descriptions described for a single image. In the data
preparation phase, each image id is taken as key and its corresponding captions are stored as
values in a dictionary.
In order to make the text dataset work in machine learning or deep learning models,
raw text should be converted to a usable format. The following text cleaning steps are done
before using it for the project: • Removal of punctuations. • Removal of numbers. • Removal
of single length words. • Conversion of uppercase to lowercase characters. Stop words are not
removed from the text data as it will hinder the generation of a grammatically complete caption
which is needed for this project.
Flickr_8k_text – Dataset folder which contains text files and captions of images. The below
files will be created by us while making the project.
Descriptions.txt – This text file contains all image names and their captions after
preprocessing.
Features.p – Pickle object that contains an image and their feature vector extracted from the
Xception pre-trained CNN model.
To train the model, we will be using the 6000 training images by generating the input and
output sequences in batches and fitting them to the model using model.fit_generator() method.
We also save the model to our models folder. This will take some time depending on your
system capability.
To train the model, we will be using the 6000 training images by generating the input and
output sequences in batches and fitting them to the model using model.fit_generator() method.
We also save the model to our models folder. This will take some time depending on your
system capability.
Architecture
Visual feature extraction Sequence/language generation
Role
Typical
65-75% (when used alone) 70-85% (when used in combination)
Accuracy
We combined all components of the image caption generation problem in this overview,
addressed the model framework proposed in recent years to handle the description task,
concentrated on the algorithmic essence of various attention methods, and summarized how
the attention mechanism is implemented. The huge datasets and evaluation criteria that are
regularly utilized in practice are summarized. Despite the fact that image captioning can be
used for image retrieval [92], video caption [93, 94], and video movement [95], and a wide
range of image caption systems are currently available, experimental results suggest that this
task still requires higher performance systems and improvement.
7.1 Conclusion
The CNN-LSTM model was created to automatically generate captions for the input
images. This concept can be used in a wide range of situations. We learned about the CNN
model, and LSTM models, and how to overcome previous limitations in the field of graphical
image captioning by building a CNN-LSTM model capable of scanning and extracting
information from any input image and transforming it into a single line sentence in natural
language English.
The algorithm attention and how the attention mechanism is used were the main topics
of discussion. I was able to successfully create a model that is a major improvement above the
earlier image caption generator.
calculation in images are challenging in this domain, there is a tremendous scope of possible
research in the future. Current image retrieval systems use similarity calculation by making
use of features such as color, tags, histogram, etc. There cannot be completely accurate
results as these methodologies do not depend on the context of the image. Hence, a complete
research in image retrieval making use of context of the images such as image captioning will
facilitate to solve this problem in the future. This project can be further enhanced in future to
improve the identification of classes which has a lower precision by training it with more
image captioning datasets. This methodology can also be combined with previous image
retrieval methods such as histogram, shapes, etc. and can be checked if the image retrieval
results get better.
[1] R. Subash (November 2019): Automatic Image Captioning Using Convolution Neural
Networks and LSTM.
[2] Seung-Ho Han, Ho-Jin Choi (2020): Domain-Specific Image CaptionGenerator with
Semantic Ontology.
[3] Pranay Mathur, Aman Gill, Aayush Yadav, Anurag Mishra and Nand
Kumar Bansode (2017): Camera2Caption: A Real-Time Image Caption
Generator
[4] Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (June2019): Image
Captioning: Transforming Objects into words.
[5] Manish Raypurkar, Abhishek Supe, Pratik Bhumkar, Pravin Borse, Dr.
[6] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2015):
[7] Jianhui Chen, Wenqiang Dong, Minchen Li (2015): Image CaptionGenerator based
on Deep Neural Networks
[8] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, MarkJohnson, Stephen
Gould, and Lei Zhang. (2017): Bottom-up and top-down attention for image
captioning.