Image Captioning Using Deep Learning Mait
Image Captioning Using Deep Learning Mait
of
BACHELOR OF TECHNOLOGY
in
Guided by
Mr. Saurabh Rastogi
The explosion of digital media has led to a massive amount of visual content being created and
shared online. Images and videos have become an integral part of social media platforms, with
people sharing and consuming them more than ever before. However, this trend has posed
significant challenges for individuals who are visually impaired or blind, as they are unable to
perceive and understand the visual content. For them, the online world is mostly textual, which
limits their access to information and entertainment. Additionally, even for individuals with
normal vision, there are times when it is challenging to comprehend the context or identify
relevant information within an image quickly. Thus, there is a need to develop a system that can
accurately describe the visual content of images using natural language. This will enable
individuals who are visually impaired to understand and appreciate visual content, and also help
people who do not have time to carefully inspect images to quickly identify the relevant
information they need. The development of such a system is critical for creating a more inclusive
and accessible online world.
In this project, we explore the application of deep learning techniques to generate captions for
images. Deep learning is a subfield of machine learning that enables machines to learn and
improve from experience by feeding large amounts of data into a neural network. The neural
network consists of layers of interconnected nodes that process input data and produce output
predictions. The main goal of this project is to develop an image captioning system that can
understand the content of images and generate captions that accurately describe what is depicted
in the image. The system will utilize deep learning techniques such as convolutional neural
networks (CNNs) and recurrent neural networks (RNNs), which are the current state-of-the-art in
image captioning. The use of deep learning techniques in image captioning has significantly
improved the accuracy of image description. CNNs are used to extract features from the image,
while RNNs generate the corresponding natural language descriptions. The CNNs and RNNs are
trained on large datasets of images and their corresponding captions, which allows the system to
learn to generate accurate captions based on the content of the image.
The development of an accurate image captioning system has numerous applications in various
fields, including healthcare, education, content creation on social media, story narration and
security. In healthcare, the system can be used to assist doctors in diagnosing medical conditions
by analyzing medical images and generating accurate captions. In education, the system can be
used to provide visually impaired students with access to visual content in textbooks and other
educational materials. In entertainment, the system can be used to automatically generate
captions for videos and images shared on social media platforms. Image captioning can also be
used in content creation on social media to assist creators in generating creative, unique and
appropriate captions for their images, videos and posts in order to attract more viewers and
improve their engagement with their audience. In Security, the system can be used to detect
violence, robbery and any other a like scenarios by generating captions of a frame from live
video feed and comparing it with keywords from trained dataset.
In conclusion, the development of an accurate image captioning system using deep learning
techniques has the potential to transform the way we perceive visual content. The system will
enable individuals who are visually impaired to access and understand visual content in a more
meaningful way and provide quick access to relevant information within images for everyone.
The project will contribute significantly to the field of computer vision and natural language
processing and has a wide range of applications in various fields.
Objective
● Exploring different techniques for incorporating contextual and semantic information into
the image captioning models. This may include leveraging pre-trained models such as
BERT or GPT or using our own model.
● Developing and training deep learning models that can accurately generate captions for a
given image.
● Evaluating the performance of the image captioning models using evaluation metric such
as BLEU This involves comparing the generated captions to the actual captions and
assessing the quality of the generated captions in terms of accuracy.
Literature Survey
In 2015, Vinyals et al. introduced a neural network-based model for image captioning, called
Show and Tell. The model used a convolutional neural network (CNN) to extract image features,
and a long short-term memory (LSTM) network to generate captions. The model was trained on
the COCO dataset and achieved state-of-the-art performance.
In 2016, Xu et al. proposed an attention-based model for image captioning, called Attend and
Tell. The model used a soft attention mechanism to selectively focus on different regions of the
image while generating the caption. The model was trained on the Flickr30k dataset and
outperformed the previous state-of-the-art methods.
In 2017, Anderson et al. introduced a bottom-up and top-down attention mechanism for image
captioning, called Up-Down. The model first generated a set of image features using a bottom-up
approach, and then used a top-down attention mechanism to focus on different parts of the image
while generating the caption. The model was trained on the COCO dataset and achieved
state-of-the-art performance.
In 2018, Lu et al. proposed a dual attention network for image captioning, called DA-Net. The
model used both spatial and channel-wise attention mechanisms to selectively focus on different
regions and features of the image while generating the caption. The model was trained on the
COCO dataset and outperformed the previous state-of-the-art methods.
Feasibility Study
Image captioning involves generating a natural language description of an image, which has
many potential applications in areas such as assistive technology, image search, and content
generation. The goal of this project is to develop an image captioning system that can accurately
describe a wide range of images. There has been significant research in the field of image
captioning in recent years, with many deep learning-based models achieving impressive results.
However, there is still room for improvement, particularly in accurately describing complex
scenes and generating captions that are both informative and natural-sounding. The development
of a high-quality image captioning system could have many potential benefits, including
improving accessibility for visually impaired individuals and enhancing the search capabilities of
image-based platforms.
Based on our review of the literature, there is a significant need for continued research in the
field of image captioning, particularly in accurately describing complex scenes and generating
natural-sounding captions. There are several existing datasets that can be used for training an
image captioning model, although it may be necessary to augment these datasets with additional
images and captions to ensure sufficient coverage of different types of images. The
computational resources required for training an image captioning model are significant,
although they can be obtained through cloud computing services or other means.
There are several potential limitations and challenges of an image captioning system that should
be considered, including the need for human input to evaluate the quality of generated captions
and the difficulty of accurately describing complex scenes. However, these challenges can be
addressed through the use of human evaluation metrics and the development of more
sophisticated deep learning models.
The potential applications of an image captioning system are numerous, including aiding visually
impaired individuals, improving image search capabilities, and generating captions for social
media or other platforms. These applications can be evaluated through user testing and other
methods.
Methodology
Semantic segmentation in the context of image analytics enables us to identify the objects in the
image but falls short of describing the relationships between these things using verbs or
contextual information. For instance, a security camera may pick up a person and a car but fail to
indicate that the person is breaking into a car. We can recognise these events with the aid of
automatic caption creation, and we can utilize it to prompt users to view the photos or videos and
take appropriate action. However, in this study, we constructed the system with the aid of
CNN-LSTM architecture and compared our findings with the GPT2-generated captions. Similar
work has been done in this field using state-of-the-art transformers to generate pertinent captions.
This served as the inspiration for our group to try and come up with a solution that seeks to
address the surveillance business challenge. A user must frequently keep an eye on several
screens at once when using surveillance film systems. The person is required to act appropriately
if they notice something questionable. However this necessitates the use of reliable and accurate
multitasking. It is unrealistic to expect a single individual to reliably watch multiple screens at
once while also keeping an eye out for odd behaviour. To solve this issue, an image captioning
system that looks at such photos and assigns a caption to it can be used. The created captions can
then be utilised to inform concerned parties about alarms.
Data Collection
We selected the following Kaggle dataset to use for training our model:
https://www.kaggle.com/datasets/kunalgupta2616/flickr-8k-images-with-captions. There are
8092 photos in this dataset that were taken from Flickr. The goal label in a csv file for each of
these photographs was one of five captions. This csv file was well structured with columns for
the image filename and captions.
We had a problem with not having enough surveillance-related images, even if the dataset's size
was sufficient to train a somewhat accurate image captioning model. We had to incorporate an
additional 568 surveillance photos related to themes like weapons, knives, crime, etc. to address
the problem of the class imbalance in order to make the model learn about various dangers like
armed robbery, guns, knives, etc. The images needed manual captioning. We were able to train
the model for dangers and emergency situations thanks to this exercise, which made sure we had
a good amount of photographs.
Data Pre-Processing
Captions for each image had to first be cleaned up and prepped. All captions were made
lowercase throughout this process, and punctuation and other special characters were also
removed. The next step was to create a vocabulary and tokenize. A vocabulary is a list of
key-value pairs, each containing a word and its corresponding token index. Also, we recorded
the number of times each term appeared in the captions. Only if it happened more than five times
will it be added to our vocabulary.
Model Architecture
Although it is a difficult task, the ability of a machine to automatically describe items in a picture
with their relationships or the work being done using a learnt language model is crucial in many
fields. Together with the names of the picture objects, the generated image description should
also list their attributes, connections, and functions. The generated caption must also be written
in a language that is common to humans, like English.
Convolutional Neural Networks (CNNs), a type of Deep Learning algorithm, take in an input
image and rank numerous features and objects to help them stand out from other images. It is
used to extract various features from an image. Long Short-Term Memory (LSTM) networks are
a type of Recurrent Neural Network (RNN) capable of learning order dependence in sequence
prediction problems. LSTM is chosen over RNN due to the issue of vanishing and exploding
gradients in RNN. Because we need to memorize a lot of historical information when creating
texts. LSTM is therefore preferable for this purpose. The phrases in the text are only word
combinations. In order to predict the following word, LSTM can be utilised.
Model Training
The next step is to train the model on the preprocessed dataset. This involves feeding the
preprocessed images and captions to the model and adjusting the weights to minimize the loss
function. When the loss goes down, we can observe the model beginning to understand the word
sequence and how it relates to the CNN output. Training can take a few hours to complete
because it is a demanding task.
Evaluation Metrics
Once the model is trained, it needs to be evaluated on a separate test dataset to check its
performance. This can be done by calculating metrics like BLEU. BLEU (Bilingual Evaluation
Understudy) Score is a measurement used to assess machine translated text. Also, it can be used
to compare a sentence produced by a computer to a reference sentence. We chose to use this
metric since it was rapid, computationally cheap, simple to understand, and widely utilized to
evaluate picture captioning model performance. The BLEU score is a number between 0 and 1.
A BLEU score of 0 means that the machine-generated text has zero overlaps with the reference
text while a BLEU score of 1 means that the machine-generated has perfect overlap with the
reference text.
References
[1] Shuang Liu, Liang Bai, Yanli Hu and Haoran Wang “Image Captioning Based on
Deep Neural Networks”, MATEC Web Conf. Volume 232, (EICTE) 2018
[3] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." IEEE
Conference on Computer Vision and Pattern Recognition IEEE Computer
Society, 3156-3164. (2015)
[4] “Every picture tells a story: Generating sentences from images.”, European
conference on computer vision. Springer, Berlin, Heidelberg, 2010.
[5] He, Kaiming, et al. "Deep Residual Learning for Image Recognition." IEEE
Conference on Computer Vision and Pattern Recognition IEEE Computer Society,
770-778. (2016)
[7] Fang, H., et al. "From captions to visual concepts and back." Computer Vision and
Pattern Recognition IEEE, 1473-1482. (2015)
[8] J Gu, J Cai, G Wang, T Chen “Stack-Captioning: Coarse-to-Fine Learning for Image
Captioning”, AAAI-18 Conference