Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network
Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network
, Preeti Voditel, Aparna Gurjar, Aakansha Pandey, Akrati Jain, Nandita Sharma, Nisha Dubey
[email protected], [email protected], [email protected], [email protected], [email protected], [email protected],
Department of Computer Applications,
Shri Ramdeobaba College of Engineering and Management,
2023 3rd International Conference on Pervasive Computing and Social Networking (ICPCSN) | 979-8-3503-2284-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICPCSN58827.2023.00062
Nagpur, India
344
Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.
that memory to inform its current processing. The memory are not as crucial as the others, It is considered optimal
cell within the LSTM model is responsible for keeping track to incorporate it within the design of the LSTM cell.
of this information. Input and forget gates control the flow of
information inward and out of it is controlled by gates. This x Output Gate(o): The output gate in LSTM uses the
allows the LSTM to selectively keep or discard information, current input and the state before the current to
depending on its relevance to the task at hand. The LSTM determine how much of the current state should be
model has a special structure called gates, which determine outputted. This output is then passed through a tanh
what information should be kept or forgotten from one function to add non-linearity and make it zero-mean,
moment to the next. The gates can be opened or closed, and if before being scaled by the output gate's fraction. This
they are closed, the information stored in the memory cell will final output is then used as the input for the next LSTM
not change. This allows the model to retain important block and also give back into the current LSTM block
information over a longer period of time, and use it to make as part of the state.
predictions or generate captions for images. The LSTM model The LSTM also uses an additional internal state, called
is designed to address the vanishing gradient problem, which the cell state, to pass information from one time-step
occurs in many Recurrent Neural Network models [7]. The to the next. This allows the LSTM to retain relevant
vanishing gradient problem refers to the difficulty in training information for longer periods of time, making it well
deep networks due to the rapidly decreasing gradient suited for tasks such as sequence prediction. The
magnitude with increasing network depth. The LSTM model LSTM structure is shown in Fig. 3.
is also adept at learning vision and language interactions by
keeping track of past and predicting future context
information [14]. This behaviour of LSTM helps during
textual caption generation of visual information. Image
captioning requires generation of word sequence. The LSTM
model helps predict the caption words based on the input
image.
A Long Short Term Memory (LSTM) Network is composed
of four distinct gates that serve specific functions. These gates
work together to regulate the flow of information into and out
of the memory cell and to control the stability of the gradient Fig. 3. LSTM structure
during training. They are: In our image caption generator model, we will
x Forget Gate(f): The forget gate in LSTM controls the combine these two network architectures, which is
amount of information to be retained from the previous commonly referred to as a CNN-RNN hybrid model
state by calculating a value between 0 and 1, which is shown in Fig. 4.
then used to update the current state. The previous
output and input are used to determine the proportion
of information that should be kept and the proportion
that should be discarded. This mechanism helps to
keep only the important information and discard the
irrelevant information, leading to better performance
of the model.
x Input Gate(i): The input gate in LSTM determines the
amount of new information to be stored in the cell state
based on the current input and the previous output. It
combines these two signals to produce a scalar
between 0 and 1, which acts as a weighting factor. This
weighting factor is then applied to the output of the Fig. 4. CNN-LSTM model
tanh activation function to determine the new A pre-trained CNN is required to extract important
information to be stored in the cell state. This new features from an input image, which are then used as input
information is added to the previous state, allowing the for an LSTM network that has been trained to model
LSTM to maintain its context over time. language. The feature representation obtained from the
x Input Modulation Gate(g): The Input Modulation CNN is altered to fit the input specifications of the LSTM
Gate in LSTM is a part of the input gate that adjusts network, enabling it to produce output based on the
the incoming information before it is stored in the characteristics of the image.
internal state cell. It modulates the input data by using The label and target text for training an LSTM model
a non-linear transformation, which helps to standardize would be the text description needed to be generated. For
the data and ensure that it has zero mean. This gate example, if there is a picture containing an old man who is
helps to ensure that only relevant information is wearing some hat, the label and target text would be:
incorporated into the internal state. The Input gate will
response to the internal state cell by making the input Label — [<start> ,An, old, man, is, wearing, a , hat]
zero-mean through non-linearity. This helps to speed Target — [ An old man is wearing a hat .,<End> ]
up the learning process by reducing the time it takes
for the network to converge. While this gate's actions
345
Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.
This is done so that the model can identify the beginning IV. DATA SET
and end of the caption and understand how the words in The Flickr8k dataset is a collection of 8092 photographs in
the caption relate to each other. JPEG format, along with accompanying text descriptions of
C. VGG16 the images. The dataset is organized into two main directories,
one containing the images themselves, and the other
VGG16, whose architecture is shown in Fig. 5, is a popular
containing text files with different sources of descriptions for
variant of CNN and is widely considered to be one of the most the photographs. The main file of the dataset is called
advanced models for computer vision tasks. The VGG16 Flickr8k.token and it is located in the Flickr8k_text folder.
model is a highly advanced convolutional neural network that This file contains the names of the images and their
was created with the goal of improving upon previous corresponding captions, with each image and its captions
computer vision models. The creators of this model separated by a newline character ("\n"). Overall, the dataset
experimented with different architectural designs and includes 8091 images, each with 5 English captions. The
ultimately decided to use a deep network with small Flickr8k is a 1GB dataset comprised of images and
convolution filters. This resulted in a model with a large accompanying text descriptions, organized into three subsets.
number of layers and trainable parameters, and ultimately led The training set contains 6000 images, the testing set has
to VGG16 becoming one of the best-performing models in 1000, and the validation set also has 1000 images. In order to
the field of computer vision. It is used for classification and prepare the dataset for use, the text descriptions have
identification of images belonging to diverse categories. undergone a process of cleaning and formatting. This includes
removing punctuation, converting all words to lowercase, and
removing any numerical values. Additionally, the dataset has
VGG16 Architecture been organized in a sequential pattern, using a tool called tqdm
to track progress.
V. METHODOLOGY
A. Pre-requities
To work on this project, it is important to have a strong
understanding of various technologies and tools, including
deep learning, Python programming, working with Kaggle
notebooks, and using the Keras library. Additionally,
experience with NumPy and natural language processing is
necessary. It is also important to assure that all of the
necessary libraries installed, in order to effectively run the
Fig. 5. VGG16 Architecture project
x Tensorflow: TensorFlow [11] is an open-source platform
Loading VGG16 Model that provides extensive support for machine learning and
artificial intelligence. With TensorFlow, developers can
After loading the VGG16 model, the output produced by the take advantage of its compatibility with multiple
final dense layer, which is usually 4096 in size and seen in hardware platforms, user-friendly APIs, and the support
Fig. 6, is passed to the encoder and decoder modules. The of an active community.
encoder module takes the image embeddings and encodes x Keras: Keras is a high-level neural networks API, written
them into a tensor whereas the decoder generates output in Python and capable of running on top of TensorFlow,
(captions) at each step. The encoder decoder model which CNTK, or Theano.One of the key strengths of Keras is
combines the CNN and LSTM is very efficient in generating its simplicity and ease of use. Keras also provides a
rich text captions of input images [8][15]. number of pre-trained models, making it easy to get
started with transfer learning and fine-tuning existing
models.
x Relevant Python libraries like NumPy, Pickle- for
serializing and deserializing Python objects, tqdm-to get
progress feedback through the progress bar widget for
long-running tasks or iterations over a large datasets.
346
Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.
to extract more meaningful information, such as word E. Model evaluation – Bilingual evaluation understudy
embeddings or n-grams. score(BLUE’S):
B. Pre-processing the Image BLEU is a method to measure the similarity between a
generated sentence and a reference sentence. It is mostly used
In our project, we are using a pre-trained model named
to evaluate the performance of machine translation systems.
VGG16 from the Visual Geometry Group for image
recognition. This model is a available within the Keras library, The score ranges from 0.0 to 1.0, where 1.0 represents a
and thus it will not require additional installation or setup. For perfect match and 0.0 represents a complete mismatch.
this project, the image features are extracted by resizing the To evaluate the performance of the model, the generated
images to a size of 224*224 pixels. The extraction process is captions are compared with the actual captions. The
performed on the image just before the last layer of the similarity between these two sets is measured using BLEU
classification model. This location is chosen because it is used score, which is a method to evaluate text similarity. This
for predicting the classification of a photo, but since it is not score is calculated for the entire set of captions, and it
required to classify the images, we exclude the last layer provides a summary of how well the generated captions
during the feature extraction process. match the expected captions.
C. Creating vocabulary for the image:
Before using text data in a machine learning or deep learning VI. RESULTS
model, it is necessary to clean and prepare it for the model.
This process includes splitting the text into individual words,
handling issues with punctuation and case sensitivity.
Additionally, since computers do not understand English
words, we need to represent them with numbers. This is done
by creating a vocabulary and mapping each word to a unique
index value, and then encoding each word into a fixed-sized
vector. Only after this process, the text becomes readable by
the machine and can be used to generate captions for images.
We plan to reduce the size of our vocabulary by processing
the text in the following sequence through cleaning.
In order to accomplish the goals of reducing the size of
vocabulary, we have outlined and defined five functions
Fig. 7 (a). Generated caption-1
below.
x Retrieving the data.
x Establishing a mapping between images and their
descriptions using a dictionary.
x Purifying the descriptions by eliminating punctuation
marks, transforming them to lowercase letters, and
removing any words containing numbers.
x Create a list of all the distinct words found in the
descriptions and make a vocabulary out of them.
x Making a document to keep all the captions saved.
In order to establish a vocabulary for the project, all the unique
words from the training dataset are tokenized. This process
results in 8763 unique words being defined as the vocabulary.
Fig. 7 (b). Generated caption-2
D. Training the model:
The dataset includes a file called "Flickr_8k.trainImages.txt"
Compiling a list of 8000 image names that will be utilized
during the training phase of the project.
The first step is to load the features extracted from the pre-
trained CNN model. To train the model, we will be using the
8000 training images by breaking them into smaller chunks
called batches and using them to generate input and output
sequences. These sequences are then used to fit the model.
The training process is set to run for 20 cycles known as Fig. 7 (c). Generated caption-3
epochs.
347
Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.
EXPERIMENTAL EVALUATION REFERENCES
Fig. 7. a, b, and c show the captions generated for 3 sample [1] Md. Z. Hossain, F. Sohel, Md. F. Shiratuddin and H. Laga, “A
images. The BLEU score was calculated for the entire set of Comprehensive Survey of Deep Learning for Image Captioning,”
arXiv:1810.04020v2 [cs.CV] 14 Oct 2018.
captions, for single word and word pair match respectively as
[2] Q. You, H. Jin, Z. Wang, C. Fang and J. Luo, “Image Captioning with
BLEU-1: 0.569902 Semantic Attention”, Proceedings of the IEEE Conference on
BLEU-2: 0.376622 Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4651-
4659.
Our model has achieved a BLEU-1 score of 0.56 for the 1- [3] C. Park, B. Kim and G. Kim, “Attend to You: Personalized Image
Captioning With Context Sequence Memory Networks”, Proceedings
gram (single word match) and a BLEU-2 score of 0.37 for the of the IEEE Conference on Computer Vision and Pattern Recognition
2-gram (word pair match). The generated captions were (CVPR), 2017, pp. 895-903
evaluated against 5 reference sentences. There was almost [4] J. Johnson, K. Karpathy and L. Fei-Fei, “DenseCap: Fully
60% single word match with the reference sentences as Convolutional Localization Networks for Dense Captioning”,
visible in the BLEU-1 score which is considered good. We Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 4565-4574
saw how different values of epochs affected the accuracy of
[5] J. Aneja, A. Deshpande and A.G. Schwing, “Convolutional Image
our model. We viewed that after some epochs accuracy of the Captioning”, Proceedings of the IEEE Conference on Computer Vision
model started decreasing due to over fitting. The dropout and Pattern Recognition (CVPR), 2018, pp. 5561-5570
technique used during training the model, ignores some [6] G. Srivastava and R. Srivastava, “A Survey on Automatic Image
layers of input information to ensure that the model will not Captioning”, Mathematics and Computing. ICMC 2018.
Communications in Computer and Information Science, vol 834.
over fit. The dropout we used is 0.5, and it avoided over Springer.
fitting of the model. [7] H. Wang, Y. Zhang and X. Yu, "An Overview of Image Caption
Generation Methods", Computational Intelligence and Neuroscience,
vol. 2020, Article ID 3062706, 13 pages, 2020.
VII. CONCLUSION [8] P. Sharma, N. Ding, S. Goodman, and R. Soricut,” Conceptual
We have built an Image Caption Generator using a Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For
Automatic Image Captioning”, In Proceedings of the 56th Annual
combination of a CNN and an LSTM model. This Meeting of the Association for Computational Linguistics (Volume 1:
architecture, known as CNN-LSTM, can be used in a variety Long Papers), pages 2556–2565, Melbourne, Australia 2018.
of areas such as computer vision and natural language [9] G. Sairam, M. Mandha, P. Prashanth and P. Swetha, "Image Captioning
processing. Our specific implementation uses an encoder- using CNN and LSTM," 4th Smart Cities Symposium (SCS 2021),
Online Conference, Bahrain, 2021, pp. 274-277.
decoder approach to generate grammatically correct captions
[10] V. Agrawal, S. Dhekane, N. Tuniya and V. Vyas, "Image Caption
for images. The model we are proposing uses a CNN as an Generator Using Attention Mechanism," 12th International Conference
encoder and an LSTM as a decoder. We have evaluated the on Computing Communication and Networking Technologies
model using the standard metric, called BLEU, on the (ICCCNT), Kharagpur, India, 2021, pp. 1-6.
Flickr8k Captions dataset. Our findings indicate that the [11] Online: https://www.tensorflow.org
model performs comparably with other leading techniques, as [12] Z. Shi, X. Zhou, X. Qiu and X. Zhu, “Improving Image Captioning
evaluated by the BLEU metric. Our proposed model has with Better Use of Captions”, Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pages 7454–7464,
displayed positive results in terms of BLEU scores, though July 5 - 10, 2020. c©2020 Association for Computational Linguistics.
there is still scope for enhancement. In future we plan to [13] G. Geetha, T. Kirthigadevi, G.Godwin Ponsam, T. Karthik and M.
improve the semantic relevance of the generated captions by Safa, “Image Captioning Using Deep Convolutional Neural Networks
implementing the attention mechanism where attention (CNNs)”, Journal of Physics.: Conf. Ser. 1712 012015, 2020
scores are used to change the attention strength on various [14] C. Wang, H. Yang, C. Bartz and C. Meinel, “Image Captioning with
Deep Bidirectional LSTMs”, MM '16: Proceedings of the 24th ACM
image attributes. international conference on Multimedia. 2016.
[15] A. K. Poddar and Dr. R. Rani, “Hybrid Architecture using CNN and
LSTM for Image Captioning in Hindi Language”, International
Conference on Machine Learning and Data Engineering, Procedia
Computer Science 218 (2023) 686–696.
348
Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.