VCAS 2022 Paper 632
VCAS 2022 Paper 632
Abstract. When you see an image, your brain can easily describe it,
but can this be performed by a machine? With growing deep learn-
ing techniques and massive datasets available, we can automatically
build a model to generate image captions. In this work, CNN (Con-
volutional Neural Networks) and RNN (Recurrent Neural Networks) are
used to construct a model that automatically creates captions to images.
FLICKR 8K dataset is used as a benchmark dataset. As per the re-
sults, it is observed that PCA(Principal Component Analysis) performs
best for selecting features according to the BLEU score. The categorical
cross-entropy loss function performs best.
1 Introduction
1.1 Motivation
It is necessary to know how important this problem is to the real world. And
generating automatic captions are really solving lots of problems.
– Medical use- Taking a snapshot of the affected area of the skin and gener-
ating captions is used to identify diseases.
– CCTV Camera - Along with viewing the world, we can also generate
captions to the video, which will help to reduce the crimes and accidents.
– Visually Impaired - It helps visually impaired persons to get information
about images.
– Petroleum exploration-Generating captions of reservoir rocks on the sub-
surface of the earth helps to know the property of the reservoir.
2 Methodology
The FLICKR 8K dataset [7] which contains 8000 images with 5 captions each
is used for training and testing.This is divided into three parts-
This dataset contains 5 captions of each image. Captions are English sentences
that contain special symbols(like a full-stop, question mark, etc.). For prepro-
cessing, these symbols and single-letter words are eliminated. After cleaning, the
description dataset looks as shown in Fig. 2.
3 Feature selection
Feature selection is the process of selecting relevant features. It is the process
of reducing the input variables. It is desirable to reduce the number of input
variables to both reduce the model’s training time and increase the model’s
performance. It decreases the redundancy in features. There are many feature
selection techniques available. Some of them are PCA (Principal Component
Analysis), KPCA (Kernel Principal Component Analysis) etc.
Captions are what the model is going to predict; it is the target of our model.
So it needs to tokenize all words in the captions and encode them to a fixed size
vector. The model will map all the tokens to a 200 length fixed-size vector using
a pre-trained GLOVE [9] model.
5 Model Architecture
There are two inputs to the model for training, image and partial caption. This
is achieved using functional APIs provided by the Keras [10] library of Python.
Functional APIs allow creating a Merge model. Let us look on model summary:
Fig. 4. Caption
6 Evaluation
Some evaluation techniques are needed to know how good captions the model
is predicting. The BLEU score [12] is used to evaluate our model. BLEU score
is a number between zero and one which measures similarity between machine-
generated text and a set of good quality reference translations. It must be noted
that the image used for testing must be similar to what have been used in training
the model. No machine learning model will give relevant captions if the testing
image is totally different from the training images.
Automatic Image Captioning 5
Fig. 5. Caption
Feature selection
PCA KPCA SVD MDS
method
Average Bleu
0.3593 0.3435 0.3524 0.3321
Score
Number of Average
selected features Bleu Score
512 0.3593
256 0.3603
128 0.3716
64 0.3745
32 0.3800
2 -
of the model. But reducing it to a very small size leads to overfitting of the model
i.e for 2 in case of PCA. Since machines cannot produce good captions as human.
The caption is irrelevant in the last case which shows it need to improve more.
This becomes the future scope of this work. Some of results produced by our
model are shown in Fig. 6, Fig. 7, Fig. 8 and Fig. 9.
References
1. Srivastava, G., Srivastava, R.: A survey on automatic image captioning. In: Inter-
national Conference on Mathematics and Computing, pp. 74–83. Springer (2018)
2. Wang, J.: Analysis and design of a recurrent neural network for linear program-
ming. IEEE Transactions on Circuits and Systems I: Fundamental Theory and
Applications 40(9), 613–618 (1993)
3. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep-
tion architecture for computer vision. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 2818–2826 (2016)
Automatic Image Captioning 7
4. Rao, C.R.: The use and interpretation of principal component analysis in applied
research. Sankhyā: The Indian Journal of Statistics, Series A pp. 329–358 (1964)
5. Schölkopf, B., Smola, A., Müller, K.R.: Kernel principal component analysis. In:
International conference on artificial neural networks, pp. 583–588. Springer (1997)
6. Golub, G.H., Reinsch, C.: Singular value decomposition and least squares solutions.
In: Linear algebra, pp. 134–151. Springer (1971)
7. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a rank-
ing task: Data, models and evaluation metrics. Journal of Artificial Intelligence
Research 47, 853–899 (2013)
8. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-
scale hierarchical image database. In: 2009 IEEE conference on computer vision
and pattern recognition, pp. 248–255. Ieee (2009)
9. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-
sentation. In: Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), pp. 1532–1543 (2014)
10. Chollet, F., et al.: Keras (2015). URL https://github.com/fchollet/keras
11. Schmidhuber, J., Hochreiter, S., et al.: Long short-term memory. Neural Comput
9(8), 1735–1780 (1997)
12. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th annual meeting of
the Association for Computational Linguistics, pp. 311–318 (2002)
13. Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural
networks with noisy labels. Advances in neural information processing systems 31
(2018)
14. Liu, L., Qi, H.: Learning effective binary descriptors via cross entropy. In: 2017
IEEE winter conference on applications of computer vision (WACV), pp. 1251–
1258. IEEE (2017)
15. Brigo, D., Pallavicini, A., Torresetti, R.: Calibration of cdo tranches with the dy-
namical generalized-poisson loss model. Available at SSRN 900549 (2007)