0% found this document useful (0 votes)
5 views52 pages

Final Year Project Report

The project report presents an image captioning system that utilizes a hybrid architecture combining Convolutional Neural Networks (CNNs) for image feature extraction and Long Short-Term Memory (LSTM) networks for language generation. The model aims to automatically generate descriptive captions for images, enhancing accessibility and providing meaningful information for various applications. The report details the methodology, implementation, and experimental results, demonstrating the effectiveness of the proposed approach in generating coherent and contextually relevant captions.

Uploaded by

evamaslow755
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views52 pages

Final Year Project Report

The project report presents an image captioning system that utilizes a hybrid architecture combining Convolutional Neural Networks (CNNs) for image feature extraction and Long Short-Term Memory (LSTM) networks for language generation. The model aims to automatically generate descriptive captions for images, enhancing accessibility and providing meaningful information for various applications. The report details the methodology, implementation, and experimental results, demonstrating the effectiveness of the proposed approach in generating coherent and contextually relevant captions.

Uploaded by

evamaslow755
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

lOMoARcPSD|52281726

Final year Project Report

Computer Science SL (Karnataka Law Society’s Institute of Management Education &


Research)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

KARNATAKA LAW SOCIETY’S

GOGTE INSTITUTE OF TECHNOLOGY


UDYAMBAG, BELAGAVI-590008
(An Autonomous Institution under Visvesvaraya Technological University, Belagavi)
(APPROVED BY AICTE, NEW DELHI)

Department of Computer Science & Engineering

A Project Report on
IMAGE CAPTIONING USING DEEP LEARNING TECHNIQUES
LIKE CNN AND LSTM
Submitted in the partial fulfillment for the award of the degree of
Bachelor of Engineering
In
Computer Science & Engineering

Submitted by
NAME USN
Manish Bhojedar 2GI20CS061

Paritosh Kumar 2GI20CS082

Omkar Patil 2GI20CS084

Shivam Kumar 2GI20CS138

Guide
Dr.Ranjana Battur
Asst Prof, Dept of CSE

2023 – 2024
1

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

KARNATAKA LAW SOCIETY’S

GOGTE INSTITUTE OF TECHNOLOGY


UDYAMBAG, BELAGAVI-590008
(An Autonomous Institution under Visvesvaraya Technological University, Belagavi)
(APPROVED BY AICTE, NEW DELHI)

Department of Computer Science & Engineering

CERTIFICATE

Certified that the project entitled “IMAGE CAPTIONING USING DEEP LEARNING
TECHNIQUES LIKE CNN AND LSTM ” carried out by MANISH BHOJEDAR
(2GI20CS061), PARITOSH KUMAR (2GI20CS082), OMKAR PATIL (2GI20CS084),
SHIVAM KUMAR (2GI20CS138) students of KLS Gogte Institute of Technology, Belagavi,
can be considered as a bonafide work for partial fulfillment for the award of Bachelor of
Engineering in Computer science and engineering of the Visvesvaraya Technological
University, Belagavi during the year 2023-2024. It is certified that all corrections/suggestions
indicated have been incorporated in the report. The project report has been approved as it
satisfies the academic requirements prescribed for the said Degree.

Guide Co-Guide HOD Principal

Final Viva-Voce

Name of the examiners Date of Viva -voce Signature


1.
2.

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

DECLARATION BY THE STUDENT

We, MANISH BHOJEDAR (2GI20CS061), PARITOSH KUMAR (2GI20CS082),


OMKAR PATIL (2GI20CS084), SHIVAM KUMAR (2GI20CS138) , hereby declare
that the project report entitled “IMAGE CAPTIONING USING DEEP
LEARNING TECHNIQUES LIKE CNN AND LSTM” submitted by us to KLS
Gogte Institute of Technology, Belagavi, in partial fulfillment of the Degree of Bachelor of
Engineering in Computer Science and Engineeringis a record of the project carried out
at Gogte institute of technology. This report is for the academic purposes.

We further declare that the report has not been submitted and will not be submitted, either
in part or full, to any other institution and University for the award of any diploma or
degree.

NAME USN SIGNATURE


Manish Bhojedar 2GI20CS061
Paritosh Kumar 2GI20CS082
Omkar Patil 2GI20CS084
Shivam Kumar 2GI20CS138

Place: Belgaum

Date:

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

ACKNOWLEDGEMENT

We take this opportunity to express our gratitude to all those people who have been
instrumental in making this project successful.

We feel honoured to place warm salutation to K.L.S. GOGTE INSTITUTE OF


TECHNOLOGY, Belagavi, which gave the opportunity to study B.E. and strengthen our
knowledge base.

We would like to express sincere thanks to Dr.M.S.Patil, Principal, G.I.T, Belagavi for his
warm support throughout the B.E. program.

We are extremely thankful to Dr. Sanjeev Sannakki, Professor & Head Dept of CSE, G.I.T,
Belagavi for her constant cooperation and support throughout this project.

We hereby express our thanks to Dr.Ranjana Battur Dept. of CSE, G.I.T, Belagavi for being a
guide for this project. She has provided us with incessant support and has been a constant
source of inspiration throughout the project.

We thank all our family members, friends, and to all the Teaching, Non-Teaching and
Technical staff of Computer Science and Engineering Department, K.L.S. GOGTE
INSTITUTE OF TECHNOLOGY, Belagavi for their invaluable support and guidance.

Manish Bhojedar (2GI20CS061)

Paritosh Kumar (2GI20CS082)

Omkar Patil (2GI20CS084)

Shivam Kumar (2GI20CS138)

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

INDEX

SI.NO CONTENT PAGENO

ABSTRACT i

LIST OF FIGURES ii

ABBREVIATIONS iii

1 INTRODUCTION 1

2 FEASIBILITY STUDY 2

2.1 Technical Feasibility

2.2 Economic Feasibility

2.3 Social Feasibility

3 SEMANTIC ANALYSIS 3

3.1 CNN 4

3.1.2 How does CNN work ? 1

3.1.3 Layer Composition in CNN Models 1

3.2 Origin of LSTM 1

3.3 Challenges with Traditional RNNs 1

3.4 Vanishing Gradient Problem 1

3.4.2 Addressing the Vanishing Gradient Problem 1

3.5 Architecture of LSTM Networks 1

3.5.2 Applications of LSTM Networks 1

3.5.3 Further Insights into LSTM Architecture 1

3.6 Hardware Requirements 1

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

3.6.2 Softwares Used 1

4 SYSTEM DESIGN 4

4.1 Image Caption Generator Model 1

4.2 UMLDiagrams 1

4.2.1 Use CaseDiagrams 1

4.2.2 ClassDiagram 1

4.2.3 DataflowDiagram 1

4.2.4 SequenceDiagram 1

4.2.5 ActivityDiagram 1

5 IMPLEMENTATION 5

5.1.1 Object Detection 1

5.2 Source Code 1

6 RESULTS 6

6.1 BLEU score Comparison 1

6.2 Epoch Vs Loss Function Graph 1

6.3 Dataset Contents 1

6.4 Snapshots 1

7 CONCLUSION 7

8 REFERENCE 8

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

ABSTRACT

This project proposes an image caption generator utilizing a hybrid architecture combining
Convolutional Neural Networks (CNNs) for image feature extraction and Recurrent Neural
Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, for sequential
language generation. The goal is to generate descriptive captions for images automatically.

The CNN component processes input images to extract high-level features, capturing spatial
information effectively. These features are then fed into the LSTM network, which generates
captions word by word, taking into account the context of the image. The LSTM network learns to
associate visual features with corresponding linguistic descriptions, enabling it to generate coherent
and contextually relevant captions.

To train the model, a large dataset of images paired with corresponding captions is utilized. The
CNN part is typically pretrained on a large-scale image dataset like ImageNet, while the LSTM
network is trained end-to-end along with the captioning task.

During the inference stage, the trained model takes an image as input, extracts its features using the
CNN, and then generates a caption using the LSTM network. Beam search or other decoding
strategies can be employed to generate diverse and high-quality captions.

The proposed model aims to overcome the limitations of purely statistical approaches by leveraging
both visual and semantic information present in the images, resulting in more accurate and
meaningful captions. Additionally, by employing an LSTM network, the model can capture long-
range dependencies in language, enabling it to generate fluent and contextually appropriate
captions.

Experimental results demonstrate the effectiveness of the proposed approach in generating captions
that are both descriptive and semantically meaningful, showcasing its potential for various
applications such as image indexing, retrieval, and accessibility for visually impaired individuals.

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

LIST OF FIGURES

S. No Fig. No Figure Pg. No


1 1 Working of our model 21
2 2 Show, attend and tell 26
3 3 CNN Architecture 27
4 4 Working of CNN 28
5 5 CNN Architecture 29
6 6 Feature map of CNN picture 30
7 7 Layers of the scanned picture 31
8 8 LSTM memory cell 32
9 9 Working of LSTM 92
10 10 Block diagram of our working model 93
11 11 Use Case diagram 94
12 12 Class Diagram 95
13 13 Dataflow Diagram 96
14 14 Sequence Diagram 97
15 15 Activity Diagram 98
16 16 BLEU Scores 99
17 17 Epoch Vs Loss function graph
18 18 Flickr8k Dataset
19 19 Captions.txt
20 20 VGG16 Snapshot 1
21 21 VGG16 Snapshot 2
22 22 InceptionV3 Snapshot 1
23 23 InceptionV3 Snapshot 2
24 24 Testing Image

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

ABBREVIATIONS

CNN - Convolutional NeuralNetwork.

RNN - Recurrent Neural Network.

LSTM - Long Short Term Memory.

NLTK - Natural Language Tool Kit

NLP - Natural LanguageProcessing.

TF - Tensorflow

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

1. Introduction

Automatically describing the content of images using natural languages is a fundamental and
challenging task. It has great potential impact. For example, it could help visually impaired people
better understand the content of images on the web. Also, it could provide more accurate and
compact information of images/videos in scenarios such as image sharing in social network or video
surveillance systems. This project accomplishes this task using deep Figure 1: Image caption
generation pipeline. The framework consists of a convulitional neural netwok (CNN) followed by a
recurrent neural network (RNN). It generates an English sentence from an input image. neural
networks. By learning knowledge from image and caption pairs, the method can generate image
captions that are usually semantically descriptive and grammatically correct. Human beings usually
describe a scene using natural languages which are concise and compact. However, machine vision
systems describes the scene by taking an image which is a two dimension arrays. From this
perspective, Vinyal et al. (Vinyals et al., ) models the image captioning problem as a language
translation problem in their Neural Image Caption (NIC) generator system. The idea is mapping the
image and captions to the same space and learning a mapping from the image to the sentences.
Donahue et al. (Donahue et al., ) proposed a more general Long-term Recurrent Convolutional
Network (LRCN) method. The LRCN method not only models the one-to-many (words) image
captioning, but also models many-to-one action generation and many-to-many video description.
They also provides publicly available implementation based on Caffe framework (Jia et al., 2014),
which further boosts the research on image captioning. This work is based on the LRCN method.
Although all the mappings are learned in an end to-end framework, we believe the benefits of better
understanding of the system by analyzing different components separately. Fig. 1 shows the
pipeline. The model has three components. The first component is a CNN which is used to
understand the content of the image. Image understanding answers the typical questions in computer
vision such as “What are the objects?”, “Where are the objects?” and “How are the objects
interactive?”. For example, the CNN has to recognize the “teddy bear”, “table” and their relative
locations in the image. The second component is a RNN which is used to generate a sentence given
the visual feature. For example, the RNN has to generate a sequence of probabilities of words given
two words “teddy bear, table”. The third component is used to generate a sentence by exploring the
combination of the probabilities. This component is less studied in the reference paper (Donahue et
al., ). This project aims at understanding the impact of different components of the LRCN method
(Donahue et al., ).We have following contributions:
• understand the LRCN method at the implementation level.

 analyze the influence of the CNN component by replacing three CNN architectures (two
from author’s and one from our implementation).
4
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

 analyze the influence of the RNN component by replacing two RNN architectures. (one
from author’s and one from our implementation).

 analyze the influence of sentence generation method by comparing two methods (one from
author’s and one from our implementation).

Our approach is based on two basic models: CNN (Convolutional Neural Network) and LSTM
(Long Short-Term Memory). CNN is utilized as an encoder in the derived application to extract
features from the snapshot or image, and LSTM is used as a decoder to organize the words and
generate captions. Image captioning can help with a variety of things, such as assisting the visionless
with text-to-speech through real-time input about the scenario over a camera feed, and increasing
social medical leisure by restructuring captions for photos in social feeds as well as spoken
messages. Assisting children in recognizing chemicals is a step toward learning the language.
Captions for every photograph on the internet can result in faster and more accurate authentic
photograph exploration and indexing. Image captioning is used in a variety of sectors, including
biology, business, the internet, and in applications such as self-driving cars wherein it could describe
the scene around the car, and CCTV cameras where the alarms could be raised if any malicious
activity is observed. The main purpose of this research article is to gain a basic understanding of
deep learning methodologies.

Fig 1 Working of our model

Image caption generation pipeline. The framework consists of a convulitional neural netwok
(CNN) followed by a recurrent neural network (RNN). It generates an English sen- tence from an
input image.

Figure 1: (Left) Our CNN-LSTM architecture, modelled after the NIC architecture described in
[6]. We use a deep convolutional neural network to create a semantic representation of an
image, which we then decode using a LSTM network. (Right) A unrolled LSTM network for
our CNN-LSTM model. All LSTMs share the same parameters. The vectorized image

5
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

representation is fed into the network, followed by a special start of sentence token. The hidden
state produced is then used by the LSTM predict/generate the caption for the given image.
Figures taken from [6] manifests itself in the memorization of inputs and the use of similar
sounding captions for images which differ in their specific details. For example, an image of a
man on a skateboard on a ramp may receive the same caption has an image of a man on a
skateboard on a table.

Fig 2. This figure is from Show, attend and tell visualization (adapted from [12])

To cope with this, recent advances in the field of Image Captioning have innovated at the
architecture-level, with the most successful model to date on the Microsoft Common Objects in
Context competition using the basic architecture in Figure 1 augmented with an attention
mecha-nism [7]. This allows it to deal with the main challenge of top-down approaches, i.e. the
inability to focus the caption on small and specific details in the image. In this paper, we
approach the problem via thorough hyper-parameter experimentation on the basic architecture
in Figure 1.For most computer vision researchers the classification task has always been
dominant in the field. Either it was a scene understanding in the pioneer 1960s or a traffic sign
detection in the modern days, the task has been rooted in the soil of computer vision. It is not
surprising that one of the most significant competition in the field comprises the image
classification task among others. The ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) awards annually the algorithm which is most successful at predicting the class of an
image in its five estimates (known as top-5 error). For the record, the lowest top-5
classification error reached 28.2% at the ILSVRC 2010 and 25.8% a year later, respectively
[1]. Nonetheless, an unxpected breakthrough came in the year 2012 when Krizhevsky et al. [2]
presented decades old algorithms [3, 4] enhanced by novel training techniques achieving so-
far-not-seen results. In particular, the top-5 classification error was pushed to 16.4%. At the
latest contest in 2015, the lowest top-5 error was brought to 3.5%, drawing on the work of
Krizhevsky et al. After this success, neural networks has revolutionised the field and brought in
new challenges that had not been merely considerable before. One of those newly feasible

6
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

techniques – image captioning – is discussed in this thesis. In fact, as an arising discipline with
promising potential, image captioning still is an active area of research nowadays, striving to
answer unsolved questions Consecutively, since the field has not been entirely established yet,
one must rely mainly on recently published papers and on-line lectures only. Considering
recent work, we define image captioning as a task in which an algorithm describes a particular
image with a statement. However, it is expected that the statement is meaningful, self-
contained and grammatically and semantically correct. In other words, the caption shall
describe the image concretely, shall not require or rely on additional information and, last but
not least, be consisted of a grammatically correct sentence that semantically corresponds to the
image.

7
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

2. FEASIBILITYSTUDY

Preliminary investigation examine project practicability, the chance the system are
helpful to the organization. The most objective of the practicability study is to check the
Technical, Operational and Economical practicability for adding new modules and
debugging previous running system. All system is possible if they're unlimited resources
and infinite time. There are unit aspects within the practicability study portion of the
preliminary investigation

 TechnicalFeasibility
 EconomicalFeasibility
 SocialFeasibility

2.1 TechnicalFeasibility

The technical issue typically raised throughout the practicableness stage of the
investigation includes the following:

 Does the mandatory technology exist to try to what's suggested?


 Do the planned equipments have the technical capability to carry the info needed
to use the new system?
 Will the planned system offer adequate response to inquiries, despite the amount or
location of users?
 Can the system be upgraded if developed?
 Are there technical guarantees of accuracy, responsibleness, simple access and
information security?

Earlier no system existed to cater to the requirements of ‘Secure Infrastructure


Implementation System’. this system developed is technically possible. it's an internet
primarily based interface for audit work flow at NIC-CSD. therefore it provides a simple
access to the users.

8
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

The database’s purpose is to make, establish and maintain a work flow among
numerous entities so as to facilitate all involved users in their numerous capacities or roles.
Permission to the users would be granted supported the roles nominative. Therefore, it
provides the technical guarantee of accuracy, responsibleness and security. The package
and laborious needs for the event of this project aren't several and area unit already out
there in-house at NIC or area unit out there as free as open supply.

The work for the project is finished with this instrumentality and existing package
technology.

Necessary information measure exists for providing a quick feedback to the users no
matter the amount of user’s victimization thes ystem.

2.2 EconomicalFeasibility

A system is developed technically which are used if put in should still be an honest
investment for the organization. within the economical practicableness, the event price in
making the system is evaluated against the last word profit derived from the new systems.
money advantages should equal or exceed the prices.

The system is economically possible. It doesn't need any addition hardware or code.
Since the interface for this technique is developed mistreatment the prevailing resources
and technologies out there at NIC, there's nominal expenditure and economical
practicableness sure.

2.3 SocialFeasibility

Proposed comes square measure useful given that they will be clad into data system.
That may meet the organizations in operation needs. Operational feasibleness aspects of
the project square measure to be taken as a vital a part of the project implementation. a
number of the vital problems raised square measure to check the operational feasibleness
of a project includes the following: -

9
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

 Is there spare support for the management from the users?


 Will the system be used and work properly if it's being developed an dimplemented?
 Will there be any resistance from the user that may undermine the potential
application benefits?

This system is targeted to be in accordance with the above-named problems. Beforehand,


the management problems and user needs are taken into thought. Therefore there's
absolute confidence of resistance from the users that may undermine the potential
application edges.

The well-planned style would make sure the optimum utilization of the pc resources and
would facilitate within the improvement of performance standing.

10
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

3. SYSTEMANALYSIS

3.1 CNN –

Convolutional Neural Networks (CNNs) are specialized forms of neural networks that are
particularly adept at processing data with a grid-like topology, such as two- dimensional image
matrices.A CNN analyzes an image by methodically examining it from the top left corner to the
bottom right corner, efficiently extracting critical features and progressively integrating
them .Notably, CNNs are equipped to manage images that are translated, rotated, scaled, or
distorted, showcasing their robustness in handling variations in visual data.

Fig.3 CNN Architecture

The preprocessing requirements for Convolutional Networks are relatively minimal compared to
other classification algorithms. While traditional methods might rely on manually designed filters,
CNNs, given adequate training, are capable of autonomously learning these feature detectors. The
architecture of CNNs mirrors the organization of the human visual cortex, drawing inspiration from
the biological processes observed in the human brain. In the visual cortex, individual neurons
respond exclusively to stimuli within a restricted region of the visual field, a concept referred to as a
receptive field. The collective arrangement of these fields comprehensively covers the entire visual
area.

This ability of CNNs to perform feature extraction with minimal preprocessing and their biologically
inspired architecture makes them exceptionally effective for tasks involving image recognition and
classification, positioning them as a fundamental component in the field of deep learning applied to
visual data processing.

CNN: Architecture - Efficient Processing through Layered Structuring


Traditional neural networks connect every neuron in one layer to every neuron in the next, a method
that becomes inefficient for analyzing large images with millions of pixels in three color channels

11
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

(RGB). This extensive interconnectivity typically leads to overfitting, where the model learns the
noise in the training data rather than generalizing from it.

To address this issue and reduce the number of parameters, Convolutional Neural Networks (CNNs)
employ a structured approach where each neuron processes only a small, localized region of the
image. This setup allows neurons to specialize in detecting specific image features, such as edges or
textures. Unlike fully connected networks, CNNs apply the same filters across the entire image,
which not only reduces the parameters but also helps in identifying the same features regardless of
their
position in the image

This architecture results in a condensed feature map that captures essential aspects of the input,
making CNNs highly effective for tasks that require detailed visual understanding, like image
captioning. The strategic configuration of neurons and the shared weights across layers significantly
enhance the network's efficiency and its ability to generalize, positioning CNNs as a fundamental
technology in computer vision.

Fig.4 Working of CNN

3.1.2 How does CNN work ?


As discussed, a fully connected neural network, wherein each neuron in a layer is connected to every
neuron in the subsequent layer, may seem suitable for certain tasks. However, Convolutional Neural
Networks (CNNs) adopt a more nuanced approach by connecting neurons to only a specific
localized area of the preceding layer, rather than universally across all neurons. This targeted
connectivity reduces the overall complexity of the network and lessens the computational demand.
12
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

Fig. 5 CNN Architecture

In traditional methods, image comparison typically involves examining the pixel values of each
pixel in two images. This approach is effective for comparing identical images but fails when the
images vary. CNNs address this limitation by segmenting the image comparison process, analyzing
piece by piece.

Fig.6 Feature map of CNN picture(here pic of a dog)

The principal advantage of utilizing the CNN algorithm lies in its capability to process images
directly as inputs. Based on these inputs, the CNN algorithm constructs a feature map by classifying
each pixel according to observed similarities and differences. This feature map, essentially a matrix
of categorized similar pixels, is critical in delineating the core characteristics of the input image.
These matrices are instrumental in extracting and highlighting the essential features of the objects
13
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

within the images, thereby facilitating a more refined and accurate analysis.

3.1.3 Layer Composition in CNN Models


Convolutional Neural Networks (CNNs) are structured with three principal types of layers, each
contributing uniquely to the process of image analysis:

Convolutional Layer: This is the initial layer where the input image is introduced into the CNN.
The primary function of this layer is to create a feature map by applying filters to the input image.
These filters help in detecting specific features such as edges, colors, and textures.

Pooling Layer: Following the convolutional layer, the feature map undergoes processing in the
pooling layer. This layer simplifies the feature map by summarizing the features within small
receptive fields, a process known as downsampling. The objective is to reduce the spatial size of the
feature map, making the output more compact and emphasizing the most essential features of the
image.

Fully Connected Layer: After repeated application of convolutional and pooling layers, which
serves to intensify the feature detection, the resultant dense feature map is fed into the fully
connected layer. This final layer performs the classification task by analyzing the processed features
to differentiate and categorize distinct elements within the image. The classification is executed with
a high degree of precision to capture the essence of the image, which is critical for accurate
identification of objects, persons, and other entities.

14
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

Fig.7 Layers of the scanned picture

These layers collectively enable the CNN to accurately identify and localize features within an
image. By transforming the varied-length inputs of raw images into fixed-size outputs, CNNs
efficiently extract crucial visual features for further analysis and interpretation.

CNN techniques are very much in usage viz,


Computer vision— in the area of medical sciences image analysis is done through CNNs only. The
inner structure of the body is effortlessly examined with the help of this. In mobile phones, it's been
used for so many things, for instance, to find the age of the person, to unlock the phone by
examining the picture from the camera.

In industries, it far used for making patents or copyrights of specific clicked pictures.
Pharmaceuticals discovery— it's been broadly used for discovering drugs/pharmaceuticals, by
analyzing the chemical features and finding the best drug to cure a particular problem

3.2 Origin of LSTM:-

Long Short-Term Memory (LSTM) networks were initially developed by two German researchers,
Sepp Hochreiter and Jürgen Schmidhuber, in 1997. As a subtype of recurrent neural networks
(RNNs), LSTMs play a pivotal role within the realm of deep learning. The defining feature of LSTM
networks is their ability to not only store information for extended periods but also to make
predictions about future datasets based on the stored data. This capability distinguishes LSTMs from
traditional RNNs and underpins their widespread application in sequences where context from the
past significantly informs future outcomes.

3.3 Challenges with Traditional RNNs


Recurrent Neural Networks (RNNs) are utilized across a spectrum of complex computational tasks,
including object classification and speech recognition. These networks are specifically designed to
handle sequential data, where the relevance of each piece of data is contingent on its predecessors. In
practice, RNNs are ideal for managing long data sequences with extensive dependencies, making
them suitable for applications such as inventory forecasting and advanced speech recognition
systems. However, the practical deployment of RNNs in solving real-world problems is often
hindered by the vanishing gradient problem, where the gradient signal becomes too weak to make
15
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

significant adjustments in the network's parameters, thereby stalling the learning process. This issue
limits the effectiveness of RNNs in applications requiring the learning of long-term dependencies.

3.4 Vanishing Gradient Problem –


This vanishing gradient problem is the main cause that makes the working of RNNs challenging. In
general, the engineering of RNNs is made such that it stores the data for some short period of time
and stores some array of data. It's not possible for RNNs to remember all the data values and a long
period. RNNs can only store some of the data for a small period. Thereupon, the reminiscence of
RNNs is only favorable for shorter arrays of data and short time periods.

This vanishing gradient problem becomes very prominent as compared to traditional RNNs- to solve
a particular problem it adds so many time steps, which results in losing the data when we use
backpropagation. With so many time steps, RNNs have to store data values of each time step, which
results in storing more & more data values and that is not feasible in the case of RNNs. And by this
vanishing gradient problem is formed.

3.4.1 Addressing the Vanishing Gradient Problem through Long Short-Term Memory
Networks
The vanishing gradient problem is a significant challenge in training traditional Recurrent Neural
Networks (RNNs), impacting the network's ability to learn long-range dependencies within the input
data. To mitigate this issue, Long Short-Term Memory (LSTM) networks, a specialized subset of
RNNs, have been developed specifically to address the vanishing gradient problem by maintaining
data across extended time intervals.
LSTMs are uniquely designed to persist information for long durations which inherently aids in
overcoming the problem of vanishing gradients. This is accomplished through the network's
architecture, which integrates several gates that manage the flow of information. Unlike standard
RNNs, which pass data directly through each recurrent unit without modification, LSTMs process
and filter information via these gates. Each gate within an LSTM unit is capable of making
independent decisions on what data to store, discard, or pass through, based on the learned data
dependencies.
In practice, LSTMs maintain a constant error flow through internal structures, which they use to
regulate the updating and forgetting processes. This error handling ensures that LSTMs can learn
from data values repeatedly over time steps, simplifying the backpropagation process across layers
and time, thus effectively mitigating the risk of vanishing gradients.
16
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

The gates—often referred to as the input, forget, and output gates—each play a pivotal role in the
LSTM's ability to shape and control the flow of data. These gates independently evaluate the
necessity of maintaining or modifying information, allowing the LSTM to make refined judgements
about the data it retains over time.
Overall, the architecture of LSTMs provides substantial improvements over traditional RNNs,
particularly in tasks that require learning from long input sequences. The ability of LSTMs to retain
information over prolonged periods and their robustness to vanishing gradients make them superior
for handling complex sequence prediction problems.
In practice, LSTMs maintain a constant error flow through internal structures, which they use to
regulate the updating and forgetting processes. This error handling ensures that LSTMs can learn
from data values repeatedly over time steps, simplifying the backpropagation process across layers
and time, thus effectively mitigating the risk of vanishing gradients.
The gates—often referred to as the input, forget, and output gates—each plays a pivotal role in the
LSTM's ability to shape and control the flow of data. These gates independently evaluate the
necessity of maintaining or modifying information, allowing the LSTM to make refined judgments
about the data it retains over time.
Overall, the architecture of LSTMs provides substantial improvements over traditional RNNs,
particularly in tasks that require learning from long input sequences. The ability of LSTMs to retain
information over prolonged periods and their robustness to vanishing gradients make them superior
for handling complex sequence prediction problems.

3.5 Architecture of LSTM Networks


The architecture of Long Short-Term Memory (LSTM) networks is elegantly designed to address
the shortcomings found in traditional Recurrent Neural Networks (RNNs), particularly in handling
long-term dependencies. At the core of LSTM architecture are three significant gates that regulate
the flow of information, each serving a distinct but crucial function that contributes to the model's
capability to retain information over extended periods and to selectively forget information that is no
longer useful.

1. Forget Gate: This gate plays a pivotal role in the LSTM's functionality by filtering out
unnecessary information. It decides what information is non-essential and should be discarded,
thus optimizing the memory utilization of the network. The effectiveness of the LSTM in
managing its memory component is largely attributable to the operations of the forget gate.

2. Input Gate: The operation of the LSTM begins at the input gate, where it receives and
processes the incoming data. This gate is critical as it determines which values from the input
17
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

data should be updated in the cell state, thereby allowing the network to preserve relevant
information throughout the operation of the model.

3. Output Gate: The output gate is responsible for determining what the next output should be. It
does this by filtering the information from the cell state based on the current input and the
memory of the previous cell state, producing the output that is used for further processing or as
the final prediction.

Fig.8 LSTM memory cell

3.5.2 Applications of LSTM Networks


LSTM networks are extensively utilized in a variety of deep learning applications that require
predictions based on historical data. These applications range from natural language processing tasks
such as text prediction to more complex time series prediction tasks like stock market forecasting.

-Text Prediction: LSTMs are particularly effective in text prediction due to their ability to
remember and utilize past information, such as previously encountered words and their contexts.
This capability allows them to predict subsequent words in a sentence with a higher degree of
accuracy, which is immensely beneficial in applications like chatbots used by e-commerce sites and
mobile applications.

- Stock Market Prediction: In financial applications, LSTMs can analyze and remember patterns in
historical stock market data, enabling them to predict future market trends. This task is challenging
due to the inherent unpredictability of the market, requiring the LSTM to be trained on extensive and
18
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

varied datasets to achieve reliable predictions.

3.5.3 Further Insights into LSTM Architecture


LSTMs are an advanced variant of RNNs, designed to hold larger amounts of data for more
extended periods without the risk of vanishing gradients, a common problem in standard RNNs. The
basic structural diagram of an LSTM typically highlights the three gates—forget, input, and output
—which are instrumental in the network’s ability to store relevant information and provide desired
outputs effectively.

Fig.9 Working of LSTM

3.6 Hardware Requirements

Processor : AMD Ryzen3 / Intel i3 (min)

Speed : 1.6 GHz

RAM : 8 GB (min)

HardDisk : 500 GB

3.6.2 Softwares Used

Code Editor : VS code

Operating System : Windows 11

19
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

20
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

4. SYSTEMDESIGN

4.1. Image Caption Generator Model

So here, we are going to combine these two independent architectures mentioned above to develop
the image caption generator model, also known as the CNN-LSTM model. For the input image, we
will use these two architectures like this, to get the caption of input images. So we have considered
these two pre-trained models, InceptionV3 and VGG16, the CNN is used to extract features from
image data and CNN model data i.e.,features stored in an LSTM, and the created LSTM is used to
process the data and input text data, and it is used to generate more accurate and interesting captions
of the image.

Fig.10 Block diagram of our working model

21
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

4.2 UMLDiagrams

Unified Modeling Language:

The Unified Modeling Language permits the technologist to specific AN analysis


model mistreatment the modeling notation that's ruled by a group of grammar linguistics
and pragmatic rules.

A UML system is diagrammatical mistreatment 5 completely different views that describe


the system from clearly different perspective. Every read is outlined by a group of
diagram, that is as follows.

It represents the dynamic of behavioral as elements of the system, portrayal the


interactions of assortment between varied structural components delineated within the user
model and structural model read.

Use case Diagrams represent the practicality of the system from a user’s purpose of read.
Use cases are used throughout needs induction and analysis to represent the practicality of
the system. Use cases specialize in the behavior of the system from external purpose of
read.

Actors are external entities that move with the system. Samples of actors embody users
like administrator, bank client …etc., or another system like central info.

22
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

4.2.1 Use CaseDiagrams:

Use case diagrams model the practicality of system treatment actors and use cases.

Use cases are services or functions provided by the system to its users.

Fig 11: Use Case diagram

23
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

4.2.2 ClassDiagram:

Class diagrams are the backbone of virtually each object-oriented methodology as


well as UML. They describe the static structure of a system. Categories represent associate
degree abstraction of entities with common characteristics. Associations represent the
relationships between categories.

Fig 12: Class Diagram

24
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

4.2.3 DataflowDiagram:
A data-flow diagram is a way of representing a flow of a data of a process or a system The
DFD also provides information about the outputs and inputs of each entity and the process
itself. A data-flow diagram has no control flow, there are no decision rules and no loops.

Fig 13: Data Flow Diagram

25
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

4.2.4 SequenceDiagram:
A sequence diagram shows object interactions arranged in time sequence. It depicts the
objects and classes involved in the scenario and the sequence of messages exchanged
between the objects needed to carry out the functionality of the scenario.

Fig 14: Sequence Diagram

26
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

4.2.5 ActivityDiagram:

An activity diagram illustrates the dynamic nature of a system by modeling the


flow of management from activity to activity. An activity represents AN operation on
some category within the system that leads to an amendment within the state of the
system. Typically, activity diagrams are accustomed model progress or business
processes and internal operation. As a result of AN activity diagram may be a special
quite state chart diagram, it uses a number of constant modeling conventions.

Fig 15: Activity Diagram

27
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

5 IMPLEMENTATION

5.1.1. ObjectDetection:

In this module, Convolutional Neural Network performs the task of Object Detection from
the images. In this phase, Transfer learning methodology is used to extract the previously
used knowledge.We have used pre-trained models named vgg16 AND InceptionV3 to
detect the objects from the image , which contains the functionality of convolutional
neural network.

5.2 Source Code

MODULES

import os
import pickle
import numpy as np
from tqdm.notebook import tqdm
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.utils import to_categorical, plot_model
from keras.layers import Input, Dense, LSTM, Embedding, Dropout, add

PATH
BASE_DIR = 'C:\\Users\\Manish\\Downloads\\ICG\\Flickr8k_Dataset'
WORKING_DIR = 'C:\\Users\\Manish\\Downloads\\ICG\\Working'

PRE TRAINED MODEL


# load vgg16 model
model = VGG16()
# restructure the model
model = Model(inputs=model.inputs, outputs=model.layers[-2].output)
# summarize
28
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

print(model.summary())

EXTRACTING IMAGE FEATURES


# extract features from image
features = {}
directory = os.path.join(BASE_DIR, 'Images')

for img_name in tqdm(os.listdir(directory)):


# load the image from file
img_path = directory + '/' + img_name
image = load_img(img_path, target_size=(224, 224))
# convert image pixels to numpy array
image = img_to_array(image)
# reshape data for model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# preprocess image for vgg
image = preprocess_input(image)
# extract features
feature = model.predict(image, verbose=0)
# get image ID
image_id = img_name.split('.')[0]
# store feature
features[image_id] = feature

# store features in pickle


pickle.dump(features, open(os.path.join(WORKING_DIR, 'features.pkl'), 'wb'))

# load features from pickle


with open(os.path.join(WORKING_DIR, 'features.pkl'), 'rb') as f:
features = pickle.load(f)

LOADING THE CAPTIONS DATA


with open(os.path.join(BASE_DIR, 'captions.txt'), 'r') as f:
next(f)
captions_doc = f.read()
# create mapping of image to captions
29
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

mapping = {}
# process lines
for line in tqdm(captions_doc.split('\n')):
# split the line by comma(,)
tokens = line.split(',')
if len(line) < 2:
continue
image_id, caption = tokens[0], tokens[1:]
# remove extension from image ID
image_id = image_id.split('.')[0]
# convert caption list to string
caption = " ".join(caption)
# create list if needed
if image_id not in mapping:
mapping[image_id] = []
# store the caption
mapping[image_id].append(caption)

len(mapping)

def clean(mapping):
for key, captions in mapping.items():
for i in range(len(captions)):
# take one caption at a time
caption = captions[i]
# preprocessing steps
# convert to lowercase
caption = caption.lower()
# delete digits, special chars, etc.,
caption = caption.replace('[^A-Za-z]', '')
# delete additional spaces
caption = caption.replace('\s+', ' ')
# add start and end tags to the caption
caption = 'startseq ' + " ".join([word for word in caption.split() if len(word)>1]) + ' endseq'
captions[i] = caption

30
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

# before preprocess of text


mapping['1052358063_eae6744153']
# preprocess the text
clean(mapping)
# after preprocess of text
mapping['1052358063_eae6744153']
all_captions = []
for key in mapping:
for caption in mapping[key]:
all_captions.append(caption)
len(all_captions)
all_captions[:10]

# tokenize the text


tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_captions)
vocab_size = len(tokenizer.word_index) + 1
vocab_size
# get maximum length of the caption available
max_length = max(len(caption.split()) for caption in all_captions)
max_length

TRAIN TEST SPLIT


image_ids = list(mapping.keys())
split = int(len(image_ids) * 0.90)
train = image_ids[:split]
test = image_ids[split:]

def data_generator(data_keys, mapping, features, tokenizer, max_length, vocab_size, batch_size):


# loop over images
X1, X2, y = list(), list(), list()
n=0
while 1:

for key in data_keys:


n += 1
31
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

captions = mapping[key]
# process each caption
for caption in captions:
# encode the sequence
seq = tokenizer.texts_to_sequences([caption])[0]
# split the sequence into X, y pairs
for i in range(1, len(seq)):
# split into input and output pairs
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store the sequences
X1.append(features[key][0])
X2.append(in_seq)
y.append(out_seq)
if len(X1) == batch_size:
X1, X2, y = np.array(X1), np.array(X2), np.array(y)
yield [X1, X2], y
X1, X2, y = list(), list(), list()
n=0
if n < batch_size:
X1, X2, y = np.array(X1), np.array(X2), np.array(y)
yield [X1, X2], y
X1, X2, y = list(), list(), list()

MODEL CREATION
# encoder model
# image feature layers
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.4)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# sequence feature layers
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
32
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

se2 = Dropout(0.4)(se1)
se3 = LSTM(256)(se2)

# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)

model = Model(inputs=[inputs1, inputs2], outputs=outputs)


model.compile(loss='categorical_crossentropy', optimizer='adam')

# plot the model


plot_model(model, show_shapes=True)

epochs = 10
batch_size = 32
steps = (len(train)) // batch_size

for i in range(epochs ):
# create data generator
generator = data_generator(train, mapping, features, tokenizer, max_length, vocab_size,
batch_size)
# fit for one epoch
model.fit(generator, epochs=1, steps_per_epoch=steps, verbose=1)

# save the model


model.save(WORKING_DIR+'/best_model.h5')

GENERATE CAPTIONS FOR IMAGES


def idx_to_word(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
# generate caption for an image
def predict_caption(model, image, tokenizer, max_length):
33
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

# add start tag for generation process


in_text = 'startseq'
# iterate over the max length of sequence
for i in range(max_length):
# encode input sequence
sequence = tokenizer.texts_to_sequences([in_text])[0]
# pad the sequence
sequence = pad_sequences([sequence], max_length)
# predict next word
yhat = model.predict([image, sequence], verbose=0)
# get index with high probability
yhat = np.argmax(yhat)
# convert index to word
word = idx_to_word(yhat, tokenizer)
# stop if word not found
if word is None:
break
# append word as input for generating next word
in_text += " " + word
# stop if we reach end tag
if word == 'endseq':
break

return in_text

BLEU SCORE DETERMINATION


from nltk.translate.bleu_score import corpus_bleu
# validate with test data
actual, predicted = list(), list()

for key in tqdm(test):


# get actual caption
captions = mapping[key]
# predict the caption for image
y_pred = predict_caption(model, features[key], tokenizer, max_length)
# split into words
34
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

actual_captions = [caption.split() for caption in captions]


y_pred = y_pred.split()
# append to the list
actual.append(actual_captions)
predicted.append(y_pred)

# calcuate BLEU score


print("BLEU-1: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
print("BLEU-2: %f" % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))

VISUALIZE THE RESULTS


from PIL import Image
import matplotlib.pyplot as plt
def generate_caption(image_name):
# load the image
# image_name = "1001773457_577c3a7d70.jpg"
image_id = image_name.split('.')[0]
img_path = os.path.join(BASE_DIR, "Images", image_name)
image = Image.open(img_path)
captions = mapping[image_id]
print('---------------------Actual---------------------')
for caption in captions:
print(caption)
# predict the caption
y_pred = predict_caption(model, features[image_id], tokenizer, max_length)
print('--------------------Predicted--------------------')
print(y_pred)
plt.imshow(image)

generate_caption("1001773457_577c3a7d70.jpg")
generate_caption("1002674143_1b742ab4b8.jpg")

TEST WITH REAL IMAGE


image_path =
'C:/Users/Manish/Downloads/ICG/Flickr8k_Dataset/Images/1000268201_693b08cb0e.jpg'
# load image
35
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

image = load_img(image_path, target_size=(224, 224))


# convert image pixels to numpy array
image = img_to_array(image)
# reshape data for model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# preprocess image for vgg
image = preprocess_input(image)
# extract features
feature = vgg_model.predict(image, verbose=0)
# predict from the trained model
predict_caption(model, feature, tokenizer, max_length)

36
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

6 RESULTS

6.1
 BLEU score Comparison

BLEU score Comparison


0.6

0.5

0.4
BLEU Score

0.3

0.2

0.1

0 CNN Models
VGG16 InceptionV3

BLEU-1 BLEU-2

Fig.16 BLEU Scores

As per the Implementation of our project using two different models, we found out
that the InceptionV3 model yields better results compared to the VGG16 model.

VGG16: BLEU-1 = 0.516511, BLEU-2 = 0.295842

InceptionV3: BLEU-1 = 0.541477 , BLEU-2 = 0.314663

6.2

 Epoch Vs Loss Function Graph

Fig.17 Epoch Vs Loss function graph

37
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

As per our study, it can be seen that the train graph is constantly reducing per epoch but we
can also see that it is not very low, to address this issue the datasets can be increased and
attention models can be altered as per our convenience.

6.3 DATASET CONTENTS

Fig 17: Flickr8k Dataset

38
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

Fig 18: Captions.txt

39
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

6.4 SNAPSHOTS

VGG16

Fig 19: VGG Snapshot 1

Fig 20: VGG Snapshot 2

40
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

INCEPTIONV3

Fig 21: InceptionV3 Snapshot 1

Fig 22: InceptionV3 Snapshot 2


41
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

TESTING WITH REAL IMAGES (other than the dataset)

Fig 23: Testing image

42
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726

7 CONCLUSION & FUTURESCOPE

We examined and adjusted the image captioning technique based on CNN. We broke down the
process into sentence generation, CNN, and RNN based LSTM in order to fully grasp it. We
changed or swapped out each component to observe how it affected the outcome. The Flickr8k and
Flickr30k datasets are used to test the updated approach. The experiment's findings indicate that:
InceptionV3 performs better in BLEU score measurement than VGGNet, we also tried testing with
images outside the dataset; and increasing the beam size generally raises the BLEU score but does
not always improve the quality of the description that is evaluated by humans.

We’d like to train our model and integrate it with text readout, i.e. the output caption gets
converted into audio so that it aids the visually impaired people, we would also like to train it with
larger datasets like Flickr40k, COCO dataset, etc.

43

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

8 REFERENCES

 Visual Image Caption Generator Using Deep Learning


 Reshmi Sasibhooshan, Suresh Kumaraswamy & Santhoshkumar Sasidharan (2023): Image
caption generation using Visual Attention Prediction and Contextual Spatial Relation
Extraction
 Janvi Jambhale, Payal Vairagade, Aarti Avhad, Jameer Kotwal, Shreeya Sangale(2022):
Image Caption Generator using Convolution Neural Networks and Long Short-Term
Memory
 Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark
Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-up and top-down
attention for image captioning and vqa. arXivpreprint arXiv:1707.07998
(2017).
 V. Kesavan, V. Muley and M. Kolhekar, "Deep Learning based Automatic Image Caption
Generation," 2019 Global Conference for Advancement in Technology (GCAT),
BENGALURU, India, 2019, pp. 1-6, doi:10.1109/GCAT47503.2019.8978293.
 Wang W, Hu H (2019) Image captioning using region-based attention joint with time-
varying attention. Neural Process Lett 1–13
 Shuang Bai and Shan An (2018): A survey on automatic image caption generation.
 Step by Step Guide to Build Image Caption Generator using Deep Learning
https://www.analyticsvidhya.com/blog/2021/12/step-by-step-guide-to-build-image-caption-
generator-using-deep-learning/

44

Downloaded by eva maslow ([email protected])


lOMoARcPSD|52281726

45

Downloaded by eva maslow ([email protected])

You might also like