Final Year Project Report
Final Year Project Report
A Project Report on
IMAGE CAPTIONING USING DEEP LEARNING TECHNIQUES
LIKE CNN AND LSTM
Submitted in the partial fulfillment for the award of the degree of
Bachelor of Engineering
In
Computer Science & Engineering
Submitted by
NAME USN
Manish Bhojedar 2GI20CS061
Guide
Dr.Ranjana Battur
Asst Prof, Dept of CSE
2023 – 2024
1
CERTIFICATE
Certified that the project entitled “IMAGE CAPTIONING USING DEEP LEARNING
TECHNIQUES LIKE CNN AND LSTM ” carried out by MANISH BHOJEDAR
(2GI20CS061), PARITOSH KUMAR (2GI20CS082), OMKAR PATIL (2GI20CS084),
SHIVAM KUMAR (2GI20CS138) students of KLS Gogte Institute of Technology, Belagavi,
can be considered as a bonafide work for partial fulfillment for the award of Bachelor of
Engineering in Computer science and engineering of the Visvesvaraya Technological
University, Belagavi during the year 2023-2024. It is certified that all corrections/suggestions
indicated have been incorporated in the report. The project report has been approved as it
satisfies the academic requirements prescribed for the said Degree.
Final Viva-Voce
We further declare that the report has not been submitted and will not be submitted, either
in part or full, to any other institution and University for the award of any diploma or
degree.
Place: Belgaum
Date:
ACKNOWLEDGEMENT
We take this opportunity to express our gratitude to all those people who have been
instrumental in making this project successful.
We would like to express sincere thanks to Dr.M.S.Patil, Principal, G.I.T, Belagavi for his
warm support throughout the B.E. program.
We are extremely thankful to Dr. Sanjeev Sannakki, Professor & Head Dept of CSE, G.I.T,
Belagavi for her constant cooperation and support throughout this project.
We hereby express our thanks to Dr.Ranjana Battur Dept. of CSE, G.I.T, Belagavi for being a
guide for this project. She has provided us with incessant support and has been a constant
source of inspiration throughout the project.
We thank all our family members, friends, and to all the Teaching, Non-Teaching and
Technical staff of Computer Science and Engineering Department, K.L.S. GOGTE
INSTITUTE OF TECHNOLOGY, Belagavi for their invaluable support and guidance.
INDEX
ABSTRACT i
LIST OF FIGURES ii
ABBREVIATIONS iii
1 INTRODUCTION 1
2 FEASIBILITY STUDY 2
3 SEMANTIC ANALYSIS 3
3.1 CNN 4
4 SYSTEM DESIGN 4
4.2 UMLDiagrams 1
4.2.2 ClassDiagram 1
4.2.3 DataflowDiagram 1
4.2.4 SequenceDiagram 1
4.2.5 ActivityDiagram 1
5 IMPLEMENTATION 5
6 RESULTS 6
6.4 Snapshots 1
7 CONCLUSION 7
8 REFERENCE 8
ABSTRACT
This project proposes an image caption generator utilizing a hybrid architecture combining
Convolutional Neural Networks (CNNs) for image feature extraction and Recurrent Neural
Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, for sequential
language generation. The goal is to generate descriptive captions for images automatically.
The CNN component processes input images to extract high-level features, capturing spatial
information effectively. These features are then fed into the LSTM network, which generates
captions word by word, taking into account the context of the image. The LSTM network learns to
associate visual features with corresponding linguistic descriptions, enabling it to generate coherent
and contextually relevant captions.
To train the model, a large dataset of images paired with corresponding captions is utilized. The
CNN part is typically pretrained on a large-scale image dataset like ImageNet, while the LSTM
network is trained end-to-end along with the captioning task.
During the inference stage, the trained model takes an image as input, extracts its features using the
CNN, and then generates a caption using the LSTM network. Beam search or other decoding
strategies can be employed to generate diverse and high-quality captions.
The proposed model aims to overcome the limitations of purely statistical approaches by leveraging
both visual and semantic information present in the images, resulting in more accurate and
meaningful captions. Additionally, by employing an LSTM network, the model can capture long-
range dependencies in language, enabling it to generate fluent and contextually appropriate
captions.
Experimental results demonstrate the effectiveness of the proposed approach in generating captions
that are both descriptive and semantically meaningful, showcasing its potential for various
applications such as image indexing, retrieval, and accessibility for visually impaired individuals.
LIST OF FIGURES
ABBREVIATIONS
TF - Tensorflow
1. Introduction
Automatically describing the content of images using natural languages is a fundamental and
challenging task. It has great potential impact. For example, it could help visually impaired people
better understand the content of images on the web. Also, it could provide more accurate and
compact information of images/videos in scenarios such as image sharing in social network or video
surveillance systems. This project accomplishes this task using deep Figure 1: Image caption
generation pipeline. The framework consists of a convulitional neural netwok (CNN) followed by a
recurrent neural network (RNN). It generates an English sentence from an input image. neural
networks. By learning knowledge from image and caption pairs, the method can generate image
captions that are usually semantically descriptive and grammatically correct. Human beings usually
describe a scene using natural languages which are concise and compact. However, machine vision
systems describes the scene by taking an image which is a two dimension arrays. From this
perspective, Vinyal et al. (Vinyals et al., ) models the image captioning problem as a language
translation problem in their Neural Image Caption (NIC) generator system. The idea is mapping the
image and captions to the same space and learning a mapping from the image to the sentences.
Donahue et al. (Donahue et al., ) proposed a more general Long-term Recurrent Convolutional
Network (LRCN) method. The LRCN method not only models the one-to-many (words) image
captioning, but also models many-to-one action generation and many-to-many video description.
They also provides publicly available implementation based on Caffe framework (Jia et al., 2014),
which further boosts the research on image captioning. This work is based on the LRCN method.
Although all the mappings are learned in an end to-end framework, we believe the benefits of better
understanding of the system by analyzing different components separately. Fig. 1 shows the
pipeline. The model has three components. The first component is a CNN which is used to
understand the content of the image. Image understanding answers the typical questions in computer
vision such as “What are the objects?”, “Where are the objects?” and “How are the objects
interactive?”. For example, the CNN has to recognize the “teddy bear”, “table” and their relative
locations in the image. The second component is a RNN which is used to generate a sentence given
the visual feature. For example, the RNN has to generate a sequence of probabilities of words given
two words “teddy bear, table”. The third component is used to generate a sentence by exploring the
combination of the probabilities. This component is less studied in the reference paper (Donahue et
al., ). This project aims at understanding the impact of different components of the LRCN method
(Donahue et al., ).We have following contributions:
• understand the LRCN method at the implementation level.
analyze the influence of the CNN component by replacing three CNN architectures (two
from author’s and one from our implementation).
4
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
analyze the influence of the RNN component by replacing two RNN architectures. (one
from author’s and one from our implementation).
analyze the influence of sentence generation method by comparing two methods (one from
author’s and one from our implementation).
Our approach is based on two basic models: CNN (Convolutional Neural Network) and LSTM
(Long Short-Term Memory). CNN is utilized as an encoder in the derived application to extract
features from the snapshot or image, and LSTM is used as a decoder to organize the words and
generate captions. Image captioning can help with a variety of things, such as assisting the visionless
with text-to-speech through real-time input about the scenario over a camera feed, and increasing
social medical leisure by restructuring captions for photos in social feeds as well as spoken
messages. Assisting children in recognizing chemicals is a step toward learning the language.
Captions for every photograph on the internet can result in faster and more accurate authentic
photograph exploration and indexing. Image captioning is used in a variety of sectors, including
biology, business, the internet, and in applications such as self-driving cars wherein it could describe
the scene around the car, and CCTV cameras where the alarms could be raised if any malicious
activity is observed. The main purpose of this research article is to gain a basic understanding of
deep learning methodologies.
Image caption generation pipeline. The framework consists of a convulitional neural netwok
(CNN) followed by a recurrent neural network (RNN). It generates an English sen- tence from an
input image.
Figure 1: (Left) Our CNN-LSTM architecture, modelled after the NIC architecture described in
[6]. We use a deep convolutional neural network to create a semantic representation of an
image, which we then decode using a LSTM network. (Right) A unrolled LSTM network for
our CNN-LSTM model. All LSTMs share the same parameters. The vectorized image
5
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
representation is fed into the network, followed by a special start of sentence token. The hidden
state produced is then used by the LSTM predict/generate the caption for the given image.
Figures taken from [6] manifests itself in the memorization of inputs and the use of similar
sounding captions for images which differ in their specific details. For example, an image of a
man on a skateboard on a ramp may receive the same caption has an image of a man on a
skateboard on a table.
Fig 2. This figure is from Show, attend and tell visualization (adapted from [12])
To cope with this, recent advances in the field of Image Captioning have innovated at the
architecture-level, with the most successful model to date on the Microsoft Common Objects in
Context competition using the basic architecture in Figure 1 augmented with an attention
mecha-nism [7]. This allows it to deal with the main challenge of top-down approaches, i.e. the
inability to focus the caption on small and specific details in the image. In this paper, we
approach the problem via thorough hyper-parameter experimentation on the basic architecture
in Figure 1.For most computer vision researchers the classification task has always been
dominant in the field. Either it was a scene understanding in the pioneer 1960s or a traffic sign
detection in the modern days, the task has been rooted in the soil of computer vision. It is not
surprising that one of the most significant competition in the field comprises the image
classification task among others. The ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) awards annually the algorithm which is most successful at predicting the class of an
image in its five estimates (known as top-5 error). For the record, the lowest top-5
classification error reached 28.2% at the ILSVRC 2010 and 25.8% a year later, respectively
[1]. Nonetheless, an unxpected breakthrough came in the year 2012 when Krizhevsky et al. [2]
presented decades old algorithms [3, 4] enhanced by novel training techniques achieving so-
far-not-seen results. In particular, the top-5 classification error was pushed to 16.4%. At the
latest contest in 2015, the lowest top-5 error was brought to 3.5%, drawing on the work of
Krizhevsky et al. After this success, neural networks has revolutionised the field and brought in
new challenges that had not been merely considerable before. One of those newly feasible
6
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
techniques – image captioning – is discussed in this thesis. In fact, as an arising discipline with
promising potential, image captioning still is an active area of research nowadays, striving to
answer unsolved questions Consecutively, since the field has not been entirely established yet,
one must rely mainly on recently published papers and on-line lectures only. Considering
recent work, we define image captioning as a task in which an algorithm describes a particular
image with a statement. However, it is expected that the statement is meaningful, self-
contained and grammatically and semantically correct. In other words, the caption shall
describe the image concretely, shall not require or rely on additional information and, last but
not least, be consisted of a grammatically correct sentence that semantically corresponds to the
image.
7
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
2. FEASIBILITYSTUDY
Preliminary investigation examine project practicability, the chance the system are
helpful to the organization. The most objective of the practicability study is to check the
Technical, Operational and Economical practicability for adding new modules and
debugging previous running system. All system is possible if they're unlimited resources
and infinite time. There are unit aspects within the practicability study portion of the
preliminary investigation
TechnicalFeasibility
EconomicalFeasibility
SocialFeasibility
2.1 TechnicalFeasibility
The technical issue typically raised throughout the practicableness stage of the
investigation includes the following:
8
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
The database’s purpose is to make, establish and maintain a work flow among
numerous entities so as to facilitate all involved users in their numerous capacities or roles.
Permission to the users would be granted supported the roles nominative. Therefore, it
provides the technical guarantee of accuracy, responsibleness and security. The package
and laborious needs for the event of this project aren't several and area unit already out
there in-house at NIC or area unit out there as free as open supply.
The work for the project is finished with this instrumentality and existing package
technology.
Necessary information measure exists for providing a quick feedback to the users no
matter the amount of user’s victimization thes ystem.
2.2 EconomicalFeasibility
A system is developed technically which are used if put in should still be an honest
investment for the organization. within the economical practicableness, the event price in
making the system is evaluated against the last word profit derived from the new systems.
money advantages should equal or exceed the prices.
The system is economically possible. It doesn't need any addition hardware or code.
Since the interface for this technique is developed mistreatment the prevailing resources
and technologies out there at NIC, there's nominal expenditure and economical
practicableness sure.
2.3 SocialFeasibility
Proposed comes square measure useful given that they will be clad into data system.
That may meet the organizations in operation needs. Operational feasibleness aspects of
the project square measure to be taken as a vital a part of the project implementation. a
number of the vital problems raised square measure to check the operational feasibleness
of a project includes the following: -
9
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
The well-planned style would make sure the optimum utilization of the pc resources and
would facilitate within the improvement of performance standing.
10
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
3. SYSTEMANALYSIS
3.1 CNN –
Convolutional Neural Networks (CNNs) are specialized forms of neural networks that are
particularly adept at processing data with a grid-like topology, such as two- dimensional image
matrices.A CNN analyzes an image by methodically examining it from the top left corner to the
bottom right corner, efficiently extracting critical features and progressively integrating
them .Notably, CNNs are equipped to manage images that are translated, rotated, scaled, or
distorted, showcasing their robustness in handling variations in visual data.
The preprocessing requirements for Convolutional Networks are relatively minimal compared to
other classification algorithms. While traditional methods might rely on manually designed filters,
CNNs, given adequate training, are capable of autonomously learning these feature detectors. The
architecture of CNNs mirrors the organization of the human visual cortex, drawing inspiration from
the biological processes observed in the human brain. In the visual cortex, individual neurons
respond exclusively to stimuli within a restricted region of the visual field, a concept referred to as a
receptive field. The collective arrangement of these fields comprehensively covers the entire visual
area.
This ability of CNNs to perform feature extraction with minimal preprocessing and their biologically
inspired architecture makes them exceptionally effective for tasks involving image recognition and
classification, positioning them as a fundamental component in the field of deep learning applied to
visual data processing.
11
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
(RGB). This extensive interconnectivity typically leads to overfitting, where the model learns the
noise in the training data rather than generalizing from it.
To address this issue and reduce the number of parameters, Convolutional Neural Networks (CNNs)
employ a structured approach where each neuron processes only a small, localized region of the
image. This setup allows neurons to specialize in detecting specific image features, such as edges or
textures. Unlike fully connected networks, CNNs apply the same filters across the entire image,
which not only reduces the parameters but also helps in identifying the same features regardless of
their
position in the image
This architecture results in a condensed feature map that captures essential aspects of the input,
making CNNs highly effective for tasks that require detailed visual understanding, like image
captioning. The strategic configuration of neurons and the shared weights across layers significantly
enhance the network's efficiency and its ability to generalize, positioning CNNs as a fundamental
technology in computer vision.
In traditional methods, image comparison typically involves examining the pixel values of each
pixel in two images. This approach is effective for comparing identical images but fails when the
images vary. CNNs address this limitation by segmenting the image comparison process, analyzing
piece by piece.
The principal advantage of utilizing the CNN algorithm lies in its capability to process images
directly as inputs. Based on these inputs, the CNN algorithm constructs a feature map by classifying
each pixel according to observed similarities and differences. This feature map, essentially a matrix
of categorized similar pixels, is critical in delineating the core characteristics of the input image.
These matrices are instrumental in extracting and highlighting the essential features of the objects
13
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
within the images, thereby facilitating a more refined and accurate analysis.
Convolutional Layer: This is the initial layer where the input image is introduced into the CNN.
The primary function of this layer is to create a feature map by applying filters to the input image.
These filters help in detecting specific features such as edges, colors, and textures.
Pooling Layer: Following the convolutional layer, the feature map undergoes processing in the
pooling layer. This layer simplifies the feature map by summarizing the features within small
receptive fields, a process known as downsampling. The objective is to reduce the spatial size of the
feature map, making the output more compact and emphasizing the most essential features of the
image.
Fully Connected Layer: After repeated application of convolutional and pooling layers, which
serves to intensify the feature detection, the resultant dense feature map is fed into the fully
connected layer. This final layer performs the classification task by analyzing the processed features
to differentiate and categorize distinct elements within the image. The classification is executed with
a high degree of precision to capture the essence of the image, which is critical for accurate
identification of objects, persons, and other entities.
14
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
These layers collectively enable the CNN to accurately identify and localize features within an
image. By transforming the varied-length inputs of raw images into fixed-size outputs, CNNs
efficiently extract crucial visual features for further analysis and interpretation.
In industries, it far used for making patents or copyrights of specific clicked pictures.
Pharmaceuticals discovery— it's been broadly used for discovering drugs/pharmaceuticals, by
analyzing the chemical features and finding the best drug to cure a particular problem
Long Short-Term Memory (LSTM) networks were initially developed by two German researchers,
Sepp Hochreiter and Jürgen Schmidhuber, in 1997. As a subtype of recurrent neural networks
(RNNs), LSTMs play a pivotal role within the realm of deep learning. The defining feature of LSTM
networks is their ability to not only store information for extended periods but also to make
predictions about future datasets based on the stored data. This capability distinguishes LSTMs from
traditional RNNs and underpins their widespread application in sequences where context from the
past significantly informs future outcomes.
significant adjustments in the network's parameters, thereby stalling the learning process. This issue
limits the effectiveness of RNNs in applications requiring the learning of long-term dependencies.
This vanishing gradient problem becomes very prominent as compared to traditional RNNs- to solve
a particular problem it adds so many time steps, which results in losing the data when we use
backpropagation. With so many time steps, RNNs have to store data values of each time step, which
results in storing more & more data values and that is not feasible in the case of RNNs. And by this
vanishing gradient problem is formed.
3.4.1 Addressing the Vanishing Gradient Problem through Long Short-Term Memory
Networks
The vanishing gradient problem is a significant challenge in training traditional Recurrent Neural
Networks (RNNs), impacting the network's ability to learn long-range dependencies within the input
data. To mitigate this issue, Long Short-Term Memory (LSTM) networks, a specialized subset of
RNNs, have been developed specifically to address the vanishing gradient problem by maintaining
data across extended time intervals.
LSTMs are uniquely designed to persist information for long durations which inherently aids in
overcoming the problem of vanishing gradients. This is accomplished through the network's
architecture, which integrates several gates that manage the flow of information. Unlike standard
RNNs, which pass data directly through each recurrent unit without modification, LSTMs process
and filter information via these gates. Each gate within an LSTM unit is capable of making
independent decisions on what data to store, discard, or pass through, based on the learned data
dependencies.
In practice, LSTMs maintain a constant error flow through internal structures, which they use to
regulate the updating and forgetting processes. This error handling ensures that LSTMs can learn
from data values repeatedly over time steps, simplifying the backpropagation process across layers
and time, thus effectively mitigating the risk of vanishing gradients.
16
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
The gates—often referred to as the input, forget, and output gates—each play a pivotal role in the
LSTM's ability to shape and control the flow of data. These gates independently evaluate the
necessity of maintaining or modifying information, allowing the LSTM to make refined judgements
about the data it retains over time.
Overall, the architecture of LSTMs provides substantial improvements over traditional RNNs,
particularly in tasks that require learning from long input sequences. The ability of LSTMs to retain
information over prolonged periods and their robustness to vanishing gradients make them superior
for handling complex sequence prediction problems.
In practice, LSTMs maintain a constant error flow through internal structures, which they use to
regulate the updating and forgetting processes. This error handling ensures that LSTMs can learn
from data values repeatedly over time steps, simplifying the backpropagation process across layers
and time, thus effectively mitigating the risk of vanishing gradients.
The gates—often referred to as the input, forget, and output gates—each plays a pivotal role in the
LSTM's ability to shape and control the flow of data. These gates independently evaluate the
necessity of maintaining or modifying information, allowing the LSTM to make refined judgments
about the data it retains over time.
Overall, the architecture of LSTMs provides substantial improvements over traditional RNNs,
particularly in tasks that require learning from long input sequences. The ability of LSTMs to retain
information over prolonged periods and their robustness to vanishing gradients make them superior
for handling complex sequence prediction problems.
1. Forget Gate: This gate plays a pivotal role in the LSTM's functionality by filtering out
unnecessary information. It decides what information is non-essential and should be discarded,
thus optimizing the memory utilization of the network. The effectiveness of the LSTM in
managing its memory component is largely attributable to the operations of the forget gate.
2. Input Gate: The operation of the LSTM begins at the input gate, where it receives and
processes the incoming data. This gate is critical as it determines which values from the input
17
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
data should be updated in the cell state, thereby allowing the network to preserve relevant
information throughout the operation of the model.
3. Output Gate: The output gate is responsible for determining what the next output should be. It
does this by filtering the information from the cell state based on the current input and the
memory of the previous cell state, producing the output that is used for further processing or as
the final prediction.
-Text Prediction: LSTMs are particularly effective in text prediction due to their ability to
remember and utilize past information, such as previously encountered words and their contexts.
This capability allows them to predict subsequent words in a sentence with a higher degree of
accuracy, which is immensely beneficial in applications like chatbots used by e-commerce sites and
mobile applications.
- Stock Market Prediction: In financial applications, LSTMs can analyze and remember patterns in
historical stock market data, enabling them to predict future market trends. This task is challenging
due to the inherent unpredictability of the market, requiring the LSTM to be trained on extensive and
18
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
RAM : 8 GB (min)
HardDisk : 500 GB
19
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
20
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
4. SYSTEMDESIGN
So here, we are going to combine these two independent architectures mentioned above to develop
the image caption generator model, also known as the CNN-LSTM model. For the input image, we
will use these two architectures like this, to get the caption of input images. So we have considered
these two pre-trained models, InceptionV3 and VGG16, the CNN is used to extract features from
image data and CNN model data i.e.,features stored in an LSTM, and the created LSTM is used to
process the data and input text data, and it is used to generate more accurate and interesting captions
of the image.
21
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
4.2 UMLDiagrams
Use case Diagrams represent the practicality of the system from a user’s purpose of read.
Use cases are used throughout needs induction and analysis to represent the practicality of
the system. Use cases specialize in the behavior of the system from external purpose of
read.
Actors are external entities that move with the system. Samples of actors embody users
like administrator, bank client …etc., or another system like central info.
22
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
Use case diagrams model the practicality of system treatment actors and use cases.
Use cases are services or functions provided by the system to its users.
23
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
4.2.2 ClassDiagram:
24
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
4.2.3 DataflowDiagram:
A data-flow diagram is a way of representing a flow of a data of a process or a system The
DFD also provides information about the outputs and inputs of each entity and the process
itself. A data-flow diagram has no control flow, there are no decision rules and no loops.
25
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
4.2.4 SequenceDiagram:
A sequence diagram shows object interactions arranged in time sequence. It depicts the
objects and classes involved in the scenario and the sequence of messages exchanged
between the objects needed to carry out the functionality of the scenario.
26
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
4.2.5 ActivityDiagram:
27
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
5 IMPLEMENTATION
5.1.1. ObjectDetection:
In this module, Convolutional Neural Network performs the task of Object Detection from
the images. In this phase, Transfer learning methodology is used to extract the previously
used knowledge.We have used pre-trained models named vgg16 AND InceptionV3 to
detect the objects from the image , which contains the functionality of convolutional
neural network.
MODULES
import os
import pickle
import numpy as np
from tqdm.notebook import tqdm
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.utils import to_categorical, plot_model
from keras.layers import Input, Dense, LSTM, Embedding, Dropout, add
PATH
BASE_DIR = 'C:\\Users\\Manish\\Downloads\\ICG\\Flickr8k_Dataset'
WORKING_DIR = 'C:\\Users\\Manish\\Downloads\\ICG\\Working'
print(model.summary())
mapping = {}
# process lines
for line in tqdm(captions_doc.split('\n')):
# split the line by comma(,)
tokens = line.split(',')
if len(line) < 2:
continue
image_id, caption = tokens[0], tokens[1:]
# remove extension from image ID
image_id = image_id.split('.')[0]
# convert caption list to string
caption = " ".join(caption)
# create list if needed
if image_id not in mapping:
mapping[image_id] = []
# store the caption
mapping[image_id].append(caption)
len(mapping)
def clean(mapping):
for key, captions in mapping.items():
for i in range(len(captions)):
# take one caption at a time
caption = captions[i]
# preprocessing steps
# convert to lowercase
caption = caption.lower()
# delete digits, special chars, etc.,
caption = caption.replace('[^A-Za-z]', '')
# delete additional spaces
caption = caption.replace('\s+', ' ')
# add start and end tags to the caption
caption = 'startseq ' + " ".join([word for word in caption.split() if len(word)>1]) + ' endseq'
captions[i] = caption
30
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
captions = mapping[key]
# process each caption
for caption in captions:
# encode the sequence
seq = tokenizer.texts_to_sequences([caption])[0]
# split the sequence into X, y pairs
for i in range(1, len(seq)):
# split into input and output pairs
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store the sequences
X1.append(features[key][0])
X2.append(in_seq)
y.append(out_seq)
if len(X1) == batch_size:
X1, X2, y = np.array(X1), np.array(X2), np.array(y)
yield [X1, X2], y
X1, X2, y = list(), list(), list()
n=0
if n < batch_size:
X1, X2, y = np.array(X1), np.array(X2), np.array(y)
yield [X1, X2], y
X1, X2, y = list(), list(), list()
MODEL CREATION
# encoder model
# image feature layers
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.4)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# sequence feature layers
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
32
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
se2 = Dropout(0.4)(se1)
se3 = LSTM(256)(se2)
# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
epochs = 10
batch_size = 32
steps = (len(train)) // batch_size
for i in range(epochs ):
# create data generator
generator = data_generator(train, mapping, features, tokenizer, max_length, vocab_size,
batch_size)
# fit for one epoch
model.fit(generator, epochs=1, steps_per_epoch=steps, verbose=1)
return in_text
generate_caption("1001773457_577c3a7d70.jpg")
generate_caption("1002674143_1b742ab4b8.jpg")
36
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
6 RESULTS
6.1
BLEU score Comparison
0.5
0.4
BLEU Score
0.3
0.2
0.1
0 CNN Models
VGG16 InceptionV3
BLEU-1 BLEU-2
As per the Implementation of our project using two different models, we found out
that the InceptionV3 model yields better results compared to the VGG16 model.
6.2
37
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
As per our study, it can be seen that the train graph is constantly reducing per epoch but we
can also see that it is not very low, to address this issue the datasets can be increased and
attention models can be altered as per our convenience.
38
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
39
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
6.4 SNAPSHOTS
VGG16
40
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
INCEPTIONV3
42
Downloaded by eva maslow ([email protected])
lOMoARcPSD|52281726
We examined and adjusted the image captioning technique based on CNN. We broke down the
process into sentence generation, CNN, and RNN based LSTM in order to fully grasp it. We
changed or swapped out each component to observe how it affected the outcome. The Flickr8k and
Flickr30k datasets are used to test the updated approach. The experiment's findings indicate that:
InceptionV3 performs better in BLEU score measurement than VGGNet, we also tried testing with
images outside the dataset; and increasing the beam size generally raises the BLEU score but does
not always improve the quality of the description that is evaluated by humans.
We’d like to train our model and integrate it with text readout, i.e. the output caption gets
converted into audio so that it aids the visually impaired people, we would also like to train it with
larger datasets like Flickr40k, COCO dataset, etc.
43
8 REFERENCES
44
45