0% found this document useful (0 votes)

30 views8 pages

Image Captioning Using Deep Learning Mait

The document discusses a project to develop an image captioning system using deep learning techniques. The goal is to generate natural language captions that accurately describe the content and context of images. Convolutional neural networks will be used to extract visual features from images, while recurrent neural networks will generate captions. The system will be trained on large datasets containing images and captions. The potential applications of image captioning include assisting the visually impaired, aiding education and healthcare, and improving image search and social media content creation. The document outlines the objectives, related work, feasibility, and potential challenges of the project.

Uploaded by

Aditya verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views8 pages

Image Captioning Using Deep Learning Mait

Uploaded by

Aditya verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

IMAGE CAPTIONING USING DEEP LEARNING

MAJOR PROJECT SYNOPSIS

BACHELOR OF TECHNOLOGY
in

COMPUTER SCIENCE & ENGINEERING

by
Project ID :

Akshat Ajay Udit Agarwal Aditya Singh

Enrollment No: 75714802719 Enrollment No: 12914802719 Enrollment No: 04814802719

Guided by
Mr. Saurabh Rastogi

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY
(AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DELHI)
Introduction

The explosion of digital media has led to a massive amount of visual content being created and
shared online. Images and videos have become an integral part of social media platforms, with
people sharing and consuming them more than ever before. However, this trend has posed
significant challenges for individuals who are visually impaired or blind, as they are unable to
perceive and understand the visual content. For them, the online world is mostly textual, which
limits their access to information and entertainment. Additionally, even for individuals with
normal vision, there are times when it is challenging to comprehend the context or identify
relevant information within an image quickly. Thus, there is a need to develop a system that can
accurately describe the visual content of images using natural language. This will enable
individuals who are visually impaired to understand and appreciate visual content, and also help
people who do not have time to carefully inspect images to quickly identify the relevant
information they need. The development of such a system is critical for creating a more inclusive
and accessible online world.

The main challenge in developing an accurate image captioning system is development of a

system that can accurately describe the visual content of images using natural language. The
system should be able to recognize objects, people, and actions within the image and describe
them accurately in natural language. The system should also be able to understand the context in
which the image was taken and generate captions that take into account the context. Annotated
datasets are essential for training deep learning models, but creating annotations for images is a
time-consuming and expensive process. The lack of large-scale annotated datasets is a significant
challenge in image captioning. To address this challenge, we have proposed approaches such as
transfer learning, where pre-trained models are fine-tuned on smaller datasets, and data
augmentation, where synthetic data is generated to increase the size of the dataset.

In this project, we explore the application of deep learning techniques to generate captions for
images. Deep learning is a subfield of machine learning that enables machines to learn and
improve from experience by feeding large amounts of data into a neural network. The neural
network consists of layers of interconnected nodes that process input data and produce output
predictions. The main goal of this project is to develop an image captioning system that can
understand the content of images and generate captions that accurately describe what is depicted
in the image. The system will utilize deep learning techniques such as convolutional neural
networks (CNNs) and recurrent neural networks (RNNs), which are the current state-of-the-art in
image captioning. The use of deep learning techniques in image captioning has significantly
improved the accuracy of image description. CNNs are used to extract features from the image,
while RNNs generate the corresponding natural language descriptions. The CNNs and RNNs are
trained on large datasets of images and their corresponding captions, which allows the system to
learn to generate accurate captions based on the content of the image.
The development of an accurate image captioning system has numerous applications in various
fields, including healthcare, education, content creation on social media, story narration and
security. In healthcare, the system can be used to assist doctors in diagnosing medical conditions
by analyzing medical images and generating accurate captions. In education, the system can be
used to provide visually impaired students with access to visual content in textbooks and other
educational materials. In entertainment, the system can be used to automatically generate
captions for videos and images shared on social media platforms. Image captioning can also be
used in content creation on social media to assist creators in generating creative, unique and
appropriate captions for their images, videos and posts in order to attract more viewers and
improve their engagement with their audience. In Security, the system can be used to detect
violence, robbery and any other a like scenarios by generating captions of a frame from live
video feed and comparing it with keywords from trained dataset.

In conclusion, the development of an accurate image captioning system using deep learning
techniques has the potential to transform the way we perceive visual content. The system will
enable individuals who are visually impaired to access and understand visual content in a more
meaningful way and provide quick access to relevant information within images for everyone.
The project will contribute significantly to the field of computer vision and natural language
processing and has a wide range of applications in various fields.

Objective
● Exploring different techniques for incorporating contextual and semantic information into
the image captioning models. This may include leveraging pre-trained models such as
BERT or GPT or using our own model.

● Developing and training deep learning models that can accurately generate captions for a
given image.

● Evaluating the performance of the image captioning models using evaluation metric such
as BLEU This involves comparing the generated captions to the actual captions and
assessing the quality of the generated captions in terms of accuracy.

Literature Survey
In 2015, Vinyals et al. introduced a neural network-based model for image captioning, called
Show and Tell. The model used a convolutional neural network (CNN) to extract image features,
and a long short-term memory (LSTM) network to generate captions. The model was trained on
the COCO dataset and achieved state-of-the-art performance.

In 2016, Xu et al. proposed an attention-based model for image captioning, called Attend and
Tell. The model used a soft attention mechanism to selectively focus on different regions of the
image while generating the caption. The model was trained on the Flickr30k dataset and
outperformed the previous state-of-the-art methods.

In 2017, Anderson et al. introduced a bottom-up and top-down attention mechanism for image
captioning, called Up-Down. The model first generated a set of image features using a bottom-up
approach, and then used a top-down attention mechanism to focus on different parts of the image
while generating the caption. The model was trained on the COCO dataset and achieved
state-of-the-art performance.

In 2018, Lu et al. proposed a dual attention network for image captioning, called DA-Net. The
model used both spatial and channel-wise attention mechanisms to selectively focus on different
regions and features of the image while generating the caption. The model was trained on the
COCO dataset and outperformed the previous state-of-the-art methods.

Feasibility Study
Image captioning involves generating a natural language description of an image, which has
many potential applications in areas such as assistive technology, image search, and content
generation. The goal of this project is to develop an image captioning system that can accurately
describe a wide range of images. There has been significant research in the field of image
captioning in recent years, with many deep learning-based models achieving impressive results.
However, there is still room for improvement, particularly in accurately describing complex
scenes and generating captions that are both informative and natural-sounding. The development
of a high-quality image captioning system could have many potential benefits, including
improving accessibility for visually impaired individuals and enhancing the search capabilities of
image-based platforms.

Based on our review of the literature, there is a significant need for continued research in the
field of image captioning, particularly in accurately describing complex scenes and generating
natural-sounding captions. There are several existing datasets that can be used for training an
image captioning model, although it may be necessary to augment these datasets with additional
images and captions to ensure sufficient coverage of different types of images. The
computational resources required for training an image captioning model are significant,
although they can be obtained through cloud computing services or other means.

There are several potential limitations and challenges of an image captioning system that should
be considered, including the need for human input to evaluate the quality of generated captions
and the difficulty of accurately describing complex scenes. However, these challenges can be
addressed through the use of human evaluation metrics and the development of more
sophisticated deep learning models.
The potential applications of an image captioning system are numerous, including aiding visually
impaired individuals, improving image search capabilities, and generating captions for social
media or other platforms. These applications can be evaluated through user testing and other
methods.

Methodology
Semantic segmentation in the context of image analytics enables us to identify the objects in the
image but falls short of describing the relationships between these things using verbs or
contextual information. For instance, a security camera may pick up a person and a car but fail to
indicate that the person is breaking into a car. We can recognise these events with the aid of
automatic caption creation, and we can utilize it to prompt users to view the photos or videos and
take appropriate action. However, in this study, we constructed the system with the aid of
CNN-LSTM architecture and compared our findings with the GPT2-generated captions. Similar
work has been done in this field using state-of-the-art transformers to generate pertinent captions.

This served as the inspiration for our group to try and come up with a solution that seeks to
address the surveillance business challenge. A user must frequently keep an eye on several
screens at once when using surveillance film systems. The person is required to act appropriately
if they notice something questionable. However this necessitates the use of reliable and accurate
multitasking. It is unrealistic to expect a single individual to reliably watch multiple screens at
once while also keeping an eye out for odd behaviour. To solve this issue, an image captioning
system that looks at such photos and assigns a caption to it can be used. The created captions can
then be utilised to inform concerned parties about alarms.

Data Collection
We selected the following Kaggle dataset to use for training our model:
https://www.kaggle.com/datasets/kunalgupta2616/flickr-8k-images-with-captions. There are
8092 photos in this dataset that were taken from Flickr. The goal label in a csv file for each of
these photographs was one of five captions. This csv file was well structured with columns for
the image filename and captions.

We had a problem with not having enough surveillance-related images, even if the dataset's size
was sufficient to train a somewhat accurate image captioning model. We had to incorporate an
additional 568 surveillance photos related to themes like weapons, knives, crime, etc. to address
the problem of the class imbalance in order to make the model learn about various dangers like
armed robbery, guns, knives, etc. The images needed manual captioning. We were able to train
the model for dangers and emergency situations thanks to this exercise, which made sure we had
a good amount of photographs.
Data Pre-Processing
Captions for each image had to first be cleaned up and prepped. All captions were made
lowercase throughout this process, and punctuation and other special characters were also
removed. The next step was to create a vocabulary and tokenize. A vocabulary is a list of
key-value pairs, each containing a word and its corresponding token index. Also, we recorded
the number of times each term appeared in the captions. Only if it happened more than five times
will it be added to our vocabulary.

Model Architecture
Although it is a difficult task, the ability of a machine to automatically describe items in a picture
with their relationships or the work being done using a learnt language model is crucial in many
fields. Together with the names of the picture objects, the generated image description should
also list their attributes, connections, and functions. The generated caption must also be written
in a language that is common to humans, like English.

Convolutional Neural Networks (CNNs), a type of Deep Learning algorithm, take in an input
image and rank numerous features and objects to help them stand out from other images. It is
used to extract various features from an image. Long Short-Term Memory (LSTM) networks are
a type of Recurrent Neural Network (RNN) capable of learning order dependence in sequence
prediction problems. LSTM is chosen over RNN due to the issue of vanishing and exploding
gradients in RNN. Because we need to memorize a lot of historical information when creating
texts. LSTM is therefore preferable for this purpose. The phrases in the text are only word
combinations. In order to predict the following word, LSTM can be utilised.

Model Training
The next step is to train the model on the preprocessed dataset. This involves feeding the
preprocessed images and captions to the model and adjusting the weights to minimize the loss
function. When the loss goes down, we can observe the model beginning to understand the word
sequence and how it relates to the CNN output. Training can take a few hours to complete
because it is a demanding task.

Evaluation Metrics
Once the model is trained, it needs to be evaluated on a separate test dataset to check its
performance. This can be done by calculating metrics like BLEU. BLEU (Bilingual Evaluation
Understudy) Score is a measurement used to assess machine translated text. Also, it can be used
to compare a sentence produced by a computer to a reference sentence. We chose to use this
metric since it was rapid, computationally cheap, simple to understand, and widely utilized to
evaluate picture captioning model performance. The BLEU score is a number between 0 and 1.
A BLEU score of 0 means that the machine-generated text has zero overlaps with the reference
text while a BLEU score of 1 means that the machine-generated has perfect overlap with the
reference text.

References
[1] Shuang Liu, Liang Bai, Yanli Hu and Haoran Wang “Image Captioning Based on
Deep Neural Networks”, MATEC Web Conf. Volume 232, (EICTE) 2018

[2] “Image Caption Generating Deep Learning Model”, International Journal of

Engineering Research & Technology (IJERT) http://www.ijert.org ISSN: 2278-0181

[3] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." IEEE
Conference on Computer Vision and Pattern Recognition IEEE Computer
Society, 3156-3164. (2015)

[4] “Every picture tells a story: Generating sentences from images.”, European
conference on computer vision. Springer, Berlin, Heidelberg, 2010.
[5] He, Kaiming, et al. "Deep Residual Learning for Image Recognition." IEEE
Conference on Computer Vision and Pattern Recognition IEEE Computer Society,
770-778. (2016)

[6] “ImageNet Classification with Deep Convolutional Neural Networks”, International

Conference on Neural Information Processing Systems Curran Associates Inc.
1097-1105. (2012)

[7] Fang, H., et al. "From captions to visual concepts and back." Computer Vision and
Pattern Recognition IEEE, 1473-1482. (2015)

[8] J Gu, J Cai, G Wang, T Chen “Stack-Captioning: Coarse-to-Fine Learning for Image
Captioning”, AAAI-18 Conference

Effects of Social Media To Students' Behavior
73% (11)
Effects of Social Media To Students' Behavior
17 pages
Introduction Econometrics
100% (1)
Introduction Econometrics
27 pages
Pharmaceutical Sales Job Interview Questions and Answers
No ratings yet
Pharmaceutical Sales Job Interview Questions and Answers
5 pages
Report Contents Image Caption Generation-1
No ratings yet
Report Contents Image Caption Generation-1
42 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Report 1
No ratings yet
Report 1
34 pages
Image Caption Generator: Minor Project (BCA 5005)
No ratings yet
Image Caption Generator: Minor Project (BCA 5005)
15 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Visual Image Caption Generator 38
No ratings yet
Visual Image Caption Generator 38
6 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
No ratings yet
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
6 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Mini Project Final
No ratings yet
Mini Project Final
27 pages
Seminar Report Final
No ratings yet
Seminar Report Final
20 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
New PDF
No ratings yet
New PDF
48 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
No ratings yet
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
6 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Image Caption Generation Research Paper
No ratings yet
Image Caption Generation Research Paper
9 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
A Comprehensive Survey of Deep Learning For Image Captioning
No ratings yet
A Comprehensive Survey of Deep Learning For Image Captioning
36 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
Ref 12
No ratings yet
Ref 12
7 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
Review 3
No ratings yet
Review 3
18 pages
Project Review
No ratings yet
Project Review
12 pages
Image Caption Generation Using Deep Neural Networks
No ratings yet
Image Caption Generation Using Deep Neural Networks
3 pages
DL 20i0551 Project Proposal
No ratings yet
DL 20i0551 Project Proposal
3 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
No ratings yet
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
3 pages
Synopsis May 2024 (Pradeep, Vikas) - 1
No ratings yet
Synopsis May 2024 (Pradeep, Vikas) - 1
14 pages
Project Report
No ratings yet
Project Report
35 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Image Caption Generation Research Paper
No ratings yet
Image Caption Generation Research Paper
8 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
(Ankitveer)
No ratings yet
(Ankitveer)
18 pages
TSP CMC 53245
No ratings yet
TSP CMC 53245
18 pages
Image Captionbot For Assistive Technology
No ratings yet
Image Captionbot For Assistive Technology
3 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Survey Paper
No ratings yet
Survey Paper
9 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
RP Springer
No ratings yet
RP Springer
10 pages
Black and White Both Sides Updated
No ratings yet
Black and White Both Sides Updated
25 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
IJCRT2310418
No ratings yet
IJCRT2310418
8 pages
He 2017
No ratings yet
He 2017
8 pages
Cherukuri Varalakshmi-2
No ratings yet
Cherukuri Varalakshmi-2
21 pages
Image Captioning Based Website Forvisuall y Impaired
No ratings yet
Image Captioning Based Website Forvisuall y Impaired
5 pages
Poster 2
No ratings yet
Poster 2
1 page
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
118 Presentation
No ratings yet
118 Presentation
26 pages
IEEE Paper
No ratings yet
IEEE Paper
13 pages
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
From Everand
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
Fouad Sabry
No ratings yet
Priyanka Resume
No ratings yet
Priyanka Resume
1 page
Boycott 454365
No ratings yet
Boycott 454365
6 pages
Indore UX Meetup Part 2
No ratings yet
Indore UX Meetup Part 2
53 pages
On Further Developing Mass Gymnastics Kim Jong II
No ratings yet
On Further Developing Mass Gymnastics Kim Jong II
21 pages
PGDPM - Project Risk Management - FINAL
No ratings yet
PGDPM - Project Risk Management - FINAL
5 pages
Apffb Brochure
No ratings yet
Apffb Brochure
23 pages
BBK Dissertation Results
100% (2)
BBK Dissertation Results
4 pages
Does Television Shatter The Glass Ceiling Representation of Women S Roles and Occupations in The South African Soap Opera Muvhango
No ratings yet
Does Television Shatter The Glass Ceiling Representation of Women S Roles and Occupations in The South African Soap Opera Muvhango
24 pages
Enclosure 4. Teacher-Made Learner's Home Task: 7MT-Ia-1
No ratings yet
Enclosure 4. Teacher-Made Learner's Home Task: 7MT-Ia-1
4 pages
Sales Account Manager Biotechnology in Boston MA Resume Cynthia Smith
No ratings yet
Sales Account Manager Biotechnology in Boston MA Resume Cynthia Smith
2 pages
Uniktool
No ratings yet
Uniktool
5 pages
Chapter V PR2
No ratings yet
Chapter V PR2
3 pages
1 - DCAVRKMI F RPAF - Research Proposal Applicatin Form - August 2022
No ratings yet
1 - DCAVRKMI F RPAF - Research Proposal Applicatin Form - August 2022
16 pages
4553 18219 1 PB
No ratings yet
4553 18219 1 PB
9 pages
Climate Change Groundwater Proposal
No ratings yet
Climate Change Groundwater Proposal
3 pages
FHIR and Public Health
No ratings yet
FHIR and Public Health
70 pages
Neda PDPMR Final PDF
No ratings yet
Neda PDPMR Final PDF
84 pages
Bombas Verticales
No ratings yet
Bombas Verticales
64 pages
Chapter Two
No ratings yet
Chapter Two
29 pages
PICCASO ESG WhitePaper Oct 2023 Rev 7
No ratings yet
PICCASO ESG WhitePaper Oct 2023 Rev 7
24 pages
FINAL-RESEARCH-MANUSCRIPT-3-edited-july-15 2
No ratings yet
FINAL-RESEARCH-MANUSCRIPT-3-edited-july-15 2
172 pages
Duties and Responsibilities
No ratings yet
Duties and Responsibilities
8 pages
3a. Factorial Experiment
No ratings yet
3a. Factorial Experiment
47 pages
MBA (Global) : Deakin Business School, Australia
No ratings yet
MBA (Global) : Deakin Business School, Australia
14 pages
Ma. Angela Nacpil Elaine Joy Songco: Attitudes
No ratings yet
Ma. Angela Nacpil Elaine Joy Songco: Attitudes
38 pages
COSO Enterprise Risk Management Framework-Integrating Strategy and Performance
100% (1)
COSO Enterprise Risk Management Framework-Integrating Strategy and Performance
33 pages
SOA Essential Playbook
No ratings yet
SOA Essential Playbook
7 pages
Action Verbs & Bullet Points PDF
100% (1)
Action Verbs & Bullet Points PDF
2 pages
Fix 2
No ratings yet
Fix 2
11 pages
Towards Robust Ferrous Scrap Material Classification With Deep Learning and Conformal Prediction
No ratings yet
Towards Robust Ferrous Scrap Material Classification With Deep Learning and Conformal Prediction
34 pages
NMSCST - Mary Analyn Lim - Assignment#2 - Setember - 9 - 2024
No ratings yet
NMSCST - Mary Analyn Lim - Assignment#2 - Setember - 9 - 2024
12 pages

Image Captioning Using Deep Learning Mait

Uploaded by

Image Captioning Using Deep Learning Mait

Uploaded by

IMAGE CAPTIONING USING DEEP LEARNING

MAJOR PROJECT SYNOPSIS

COMPUTER SCIENCE & ENGINEERING

Akshat Ajay Udit Agarwal Aditya Singh

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

The main challenge in developing an accurate image captioning system is development of a

[2] “Image Caption Generating Deep Learning Model”, International Journal of

[6] “ImageNet Classification with Deep Convolutional Neural Networks”, International

You might also like