Image Caption Generation Using Deep Neural Networks
Image Caption Generation Using Deep Neural Networks
Abstract— In recent years, computer vision has made implementation of the models we used (VGG16 and
significant progress, primarily in the field of image ResNet50) with comparison.
classification and object detection and recognition. Describing
the image content automatically using natural languages is
challenging and has a tremendous potential impact. Here, the
idea is to extract features from an image, generate captions,
and convert the generated captions to speech. This work
2022 International Conference for Advancement in Technology (ICONAT) | 978-1-6654-2577-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICONAT53423.2022.9726074
2
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on March 19,2025 at 06:29:29 UTC from IEEE Xplore. Restrictions apply.
shown in Fig. 4 with image size 224 X 224. Using VGG16 as On generating the captions of the images, the accuracy is
a model [10], an estimated 29 percent was the training tabulated in Table 1. Table 1 concludes that the ResNet50
accuracy for the Flickr8k dataset. Image is passed through achieves an average accuracy of 79%, which is more
different layers of convolutional neural network with the accurate and better than VGG16 (29%).
kernel size of 3*3. Convolutional layers are followed by
three fully connected layers (the first two have 4096 V. CONCLUSION AND FUTURE WORK
channels, and the third has 1000 channels). Image captioning is a very challenging and demanding
problem in various scenarios in real-time. This paper focuses
on captioning an image using a Flickr8k dataset using
ResNet50 as a convolutional neural network and LSTM as a
recurrent neural network. Experimental analysis, testing, and
training of datasets were done for both VGG16 and
ResNet50 models. The results show that ResNet50 models
perform better than VGG16 with an accuracy of 73% with
ResNet50 and 29% with VGG16. The end caption is further
converted from text to speech using gTTS.
Future works will focus on training for a larger number
of images and datasets to improve the model's overall
accuracy.
REFERENCES
[1] Krizhevsky, Alex, I. Sutskever, and G. E. Hinton. "ImageNet
classification with deep convolutional neural networks."
Fig. 4. VGG16 Architecture layers International Conference on Neural Information Processing Systems
Curran Associates Inc. 1097-1105. (2012)
B. Training procedure using ResNet50 [2] Sandeep Kumar Dash, Shantanu Acharya, Partha Pakray, Ranjita
ResNet50, also called Residual Networks, was used as a Das1, Alexander Gelbukh, “Topic Based Image Caption Generation,"
Arabian Journal for Science and Engineering (2019).
CNN to train the dataset [11]. When ResNet50 is used as a
model (Fig. 5), an approximate 45% was obtained for [3] Show and Tell: A Neural Image Caption Generator by Oriol Vinyal,
Alexander Toshev, Samy Bengio, Dumitru Erhan, IEEE (2015).
training the model for 20 epochs, and 73% accuracy was
[4] Image2Text: A Multimodal Caption Generator by Chang Liu,
obtained for training the model for 50 epochs. Changhu Wang, Fuchun Sun, Yong Rui, ACM (2016).
[5] The Vanishing Gradient Problem During Learning Recurrent Neural
Nets and Problem Solutions by Sepp Hochreiter.
[6] Vaidehi Muley, Varsha Kesavan, Megha Kolhekar, “Deep Learning
based Automatic Image Caption Generation," Institute of Electrical
and Electronics Engineers (2020).
[7] Vijayaraju, Nivetha, "Image Retrieval Using Image Captioning," San
Jose State University (2019).
[8] Zhengkui Wang, Xiao Yue, Yan Chu, Lei Yu, Mikhailov Sergei,
“Automatic Image Captioning Based on ResNet50 and LSTM with
Soft Attention” (2020).
[9] Xiangyu Zhang, Kaiming He, Shaoqing Ren, Jian Sun "Deep
(a)
Residual Learning for Image Recognition," Microsoft Research,
(2015).
[10] Liang Bai, Shuang Liu, Yanli Hua, Haoran Wang
“Image Captioning Based on Deep Neural Networks” (2018).
[11] San Pa Pa Aung, Win Pa Pa, Tin Lay New, ” Automatic Image
Captioning using CNN and LSTM-Based Language Model," (2020).
(b)
Fig. 5. Feature extraction in (a) ResNet50 network (b) VGG-16 network
3
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on March 19,2025 at 06:29:29 UTC from IEEE Xplore. Restrictions apply.