0% found this document useful (0 votes)
23 views9 pages

VCAS 2022 Paper 632

The document describes research on using a convolutional neural network and recurrent neural network model to automatically generate captions for images. It analyzes using principal component analysis (PCA) for feature selection in the model. PCA performed best according to the BLEU score metric, selecting the most relevant features and increasing the quality of generated captions while decreasing training time. The model takes images as input, extracts features using InceptionV3, selects features using PCA, and feeds these to an RNN to generate captions in English.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views9 pages

VCAS 2022 Paper 632

The document describes research on using a convolutional neural network and recurrent neural network model to automatically generate captions for images. It analyzes using principal component analysis (PCA) for feature selection in the model. PCA performed best according to the BLEU score metric, selecting the most relevant features and increasing the quality of generated captions while decreasing training time. The model takes images as input, extracts features using InceptionV3, selects features using PCA, and feeds these to an RNN to generate captions in English.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Analyzing PCA for Automatic Image Captioning

Prashant and Gargi Srivastava

Rajiv Gandhi Institute of Petroleum Technology, Jais


Amethi 229304, Uttar Pradesh, India
{20cs3045,gsrivastava}@rgipt.ac.in

Abstract. When you see an image, your brain can easily describe it,
but can this be performed by a machine? With growing deep learn-
ing techniques and massive datasets available, we can automatically
build a model to generate image captions. In this work, CNN (Con-
volutional Neural Networks) and RNN (Recurrent Neural Networks) are
used to construct a model that automatically creates captions to images.
FLICKR 8K dataset is used as a benchmark dataset. As per the re-
sults, it is observed that PCA(Principal Component Analysis) performs
best for selecting features according to the BLEU score. The categorical
cross-entropy loss function performs best.

Keywords: deep learning, image captions, CNN

1 Introduction

Automatic image captioning is generating textual descriptions of an image us-


ing an artificial system. It involves natural language processing concepts to de-
scribe images in natural language such as English [1]. This work implements
the captions generator using CNN and RNN [2] together. Image features will be
extracted using the InceptionV3 [3] model, a CNN pre-trained model on the Im-
agenet dataset. Features are selected using feature selection algorithms such as
PCA [4], KPCA [5], SVD [6], etc. Feature selection is selecting relevant features
that decrease the training time and increase the quality of captions generated.
The features are fed to RNN, which is responsible for generating captions in the
English language.

Fig. 1. Automatic Image Captioning


2 Prashant and Gargi

1.1 Motivation

It is necessary to know how important this problem is to the real world. And
generating automatic captions are really solving lots of problems.

– Medical use- Taking a snapshot of the affected area of the skin and gener-
ating captions is used to identify diseases.
– CCTV Camera - Along with viewing the world, we can also generate
captions to the video, which will help to reduce the crimes and accidents.
– Visually Impaired - It helps visually impaired persons to get information
about images.
– Petroleum exploration-Generating captions of reservoir rocks on the sub-
surface of the earth helps to know the property of the reservoir.

2 Methodology

2.1 Collecting Dataset

The FLICKR 8K dataset [7] which contains 8000 images with 5 captions each
is used for training and testing.This is divided into three parts-

1. Training images - 6000


2. Testing images - 1000
3. Validation images - 1000

2.2 Cleaning Description

This dataset contains 5 captions of each image. Captions are English sentences
that contain special symbols(like a full-stop, question mark, etc.). For prepro-
cessing, these symbols and single-letter words are eliminated. After cleaning, the
description dataset looks as shown in Fig. 2.

Fig. 2. Post-processed captions


Automatic Image Captioning 3

2.3 Extracting feature vector from image


The input to our model is images. They need to be converted into a fixed-size
vector. For this purpose the InceptionV3(convolutional neural networks) model
is used as shown in Fig. 3, which is pre-trained on the Imagenet [8] dataset.
InceptionV3 takes images as input and converts them into a fixed-size vector of
length 2048.

Fig. 3. InceptionV3 Model

3 Feature selection
Feature selection is the process of selecting relevant features. It is the process
of reducing the input variables. It is desirable to reduce the number of input
variables to both reduce the model’s training time and increase the model’s
performance. It decreases the redundancy in features. There are many feature
selection techniques available. Some of them are PCA (Principal Component
Analysis), KPCA (Kernel Principal Component Analysis) etc.

3.1 Principal Component Analysis(PCA)


Principal Component Analysis (PCA), generally known as the data reduction
technique, is a helpful feature selection technique because it uses linear algebra
to transform the dataset into a compressed form. This work implements it using
the sci-kit-learn python library. It provides a choice to select the number of
features in the output.

3.2 Kernel Principal Component Analysis(KPCA)


Kernel Principal Component Analysis is a non-linear dimensional transformation
technique. It is an extension of Principal Component Analysis (PCA), which is
a linear transformation technique.
4 Prashant and Gargi

3.3 SVD(Singular Value Decomposition)

SVD is a data decomposition approach similar to principal component analysis


(PCA).For the decomposition, SVD exploits the linear combination of rows and
columns of the matrix.

4 Data preprocessing - Captions

Captions are what the model is going to predict; it is the target of our model.
So it needs to tokenize all words in the captions and encode them to a fixed size
vector. The model will map all the tokens to a 200 length fixed-size vector using
a pre-trained GLOVE [9] model.

5 Model Architecture

There are two inputs to the model for training, image and partial caption. This
is achieved using functional APIs provided by the Keras [10] library of Python.
Functional APIs allow creating a Merge model. Let us look on model summary:

Fig. 4. Caption

LSTM(Long short term Memory) [11] is a specialized Recurrent Neural Network


(RNN). The model uses an adam optimizer of keras library to compile our model.

6 Evaluation

Some evaluation techniques are needed to know how good captions the model
is predicting. The BLEU score [12] is used to evaluate our model. BLEU score
is a number between zero and one which measures similarity between machine-
generated text and a set of good quality reference translations. It must be noted
that the image used for testing must be similar to what have been used in training
the model. No machine learning model will give relevant captions if the testing
image is totally different from the training images.
Automatic Image Captioning 5

Fig. 5. Caption

7 Analysis and results


7.1 Loss Function
Three different loss functions i.e Categorical cross entropy [13], Binary cross
entropy [14] and Poisson loss [15] are verified. Above result concludes that Cat-

Table 1. Model accuracy against Loss functions

Categorical cross Binary cross


Loss function Poisson
entropy entropy
Accuracy 0.3021 0.2875 0.2904

egorical cross entropy performs best for this project.

7.2 Feature Selection


Four methods for feature selection have been analyzed. Only 512 features of 2048
were selected. It is observed that PCA performs best for the model. Selecting
different numbers of features using PCA. The above result shows that reducing
the dimension increases the average BLEU score i.e. it increases the performance
6 Prashant and Gargi

Table 2. Average bleu score against Feature selection methods

Feature selection
PCA KPCA SVD MDS
method
Average Bleu
0.3593 0.3435 0.3524 0.3321
Score

Table 3. Effect on Average Blue Score due to Number of selected features

Number of Average
selected features Bleu Score
512 0.3593
256 0.3603
128 0.3716
64 0.3745
32 0.3800
2 -

of the model. But reducing it to a very small size leads to overfitting of the model
i.e for 2 in case of PCA. Since machines cannot produce good captions as human.
The caption is irrelevant in the last case which shows it need to improve more.
This becomes the future scope of this work. Some of results produced by our
model are shown in Fig. 6, Fig. 7, Fig. 8 and Fig. 9.

8 Conclusion and Future Scope

Though a good level of accuracy is obtained, lots of modifications can be done


to improve this.

– Use of a larger dataset.


– Use different feature selection methods. i.e., using supervised selection meth-
ods.
– Changing the architecture of the model.
– Different matrices can be used to evaluate the model instead of BLEU.
– Use cross-validation set to know about overfitting of the model.

References
1. Srivastava, G., Srivastava, R.: A survey on automatic image captioning. In: Inter-
national Conference on Mathematics and Computing, pp. 74–83. Springer (2018)
2. Wang, J.: Analysis and design of a recurrent neural network for linear program-
ming. IEEE Transactions on Circuits and Systems I: Fundamental Theory and
Applications 40(9), 613–618 (1993)
3. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep-
tion architecture for computer vision. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 2818–2826 (2016)
Automatic Image Captioning 7

Fig. 6. The BLEU-1 in this case is 0.5555.

Fig. 7. BLEU-1 in this case is 0.875 which is quite good.


8 Prashant and Gargi

Fig. 8. BLEU-1 is 0.7.

Fig. 9. BLEU-1 in this case is 0.33.


Automatic Image Captioning 9

4. Rao, C.R.: The use and interpretation of principal component analysis in applied
research. Sankhyā: The Indian Journal of Statistics, Series A pp. 329–358 (1964)
5. Schölkopf, B., Smola, A., Müller, K.R.: Kernel principal component analysis. In:
International conference on artificial neural networks, pp. 583–588. Springer (1997)
6. Golub, G.H., Reinsch, C.: Singular value decomposition and least squares solutions.
In: Linear algebra, pp. 134–151. Springer (1971)
7. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a rank-
ing task: Data, models and evaluation metrics. Journal of Artificial Intelligence
Research 47, 853–899 (2013)
8. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-
scale hierarchical image database. In: 2009 IEEE conference on computer vision
and pattern recognition, pp. 248–255. Ieee (2009)
9. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-
sentation. In: Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), pp. 1532–1543 (2014)
10. Chollet, F., et al.: Keras (2015). URL https://github.com/fchollet/keras
11. Schmidhuber, J., Hochreiter, S., et al.: Long short-term memory. Neural Comput
9(8), 1735–1780 (1997)
12. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th annual meeting of
the Association for Computational Linguistics, pp. 311–318 (2002)
13. Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural
networks with noisy labels. Advances in neural information processing systems 31
(2018)
14. Liu, L., Qi, H.: Learning effective binary descriptors via cross entropy. In: 2017
IEEE winter conference on applications of computer vision (WACV), pp. 1251–
1258. IEEE (2017)
15. Brigo, D., Pallavicini, A., Torresetti, R.: Calibration of cdo tranches with the dy-
namical generalized-poisson loss model. Available at SSRN 900549 (2007)

You might also like