0% found this document useful (0 votes)
56 views13 pages

Narrative Paragraph Generation

This paper proposes a model to generate captions for images using neural networks. It uses an Xception convolutional neural network to extract features from images, and then uses an attention gated recurrent unit (GRU) as the sequence decoder to generate captions. The model aims to provide accurate yet compact image captioning using this architecture. It was trained on a modified Flickr dataset to help balance image classes.

Uploaded by

sid202pk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views13 pages

Narrative Paragraph Generation

This paper proposes a model to generate captions for images using neural networks. It uses an Xception convolutional neural network to extract features from images, and then uses an attention gated recurrent unit (GRU) as the sequence decoder to generate captions. The model aims to provide accurate yet compact image captioning using this architecture. It was trained on a modified Flickr dataset to help balance image classes.

Uploaded by

sid202pk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Narrative Paragraph Generation for Photo Stream Using Neural

Networks
This paper was downloaded from TechRxiv (https://www.techrxiv.org).

LICENSE

CC BY 4.0

SUBMISSION DATE / POSTED DATE

28-11-2022 / 06-12-2022

CITATION

N, Anjali M; More, Tejash; Misa, Kumari; Nath, Keshab (2022): Narrative Paragraph Generation for Photo
Stream Using Neural Networks. TechRxiv. Preprint. https://doi.org/10.36227/techrxiv.21629720.v1

DOI

10.36227/techrxiv.21629720.v1
Narrative Paragraph Generation for Photo
Stream Using Neural Networks

Anjali M N, Tejash More, Kumari Misa, and Keshab Nath*

Department of Computer Science and Engineering,


Indian Institute of Information Technology, Kottayam, Kerala, India, 686635,
[email protected]

Abstract. Humans have the innate ability to perceive an image just by


looking at it, for us images are not just a collection of objects but a net-
work of interconnected object relationships. The problem arises when a
machine tries to inspect an image, hence we try to convert image data to
textual data. Despite major achievements in the image captioning field,
there is a lack of models that provide concise captions of a given image,
moreover, already existing models are so much bigger in size that the
number of learning parameters is very high. The objective of this pa-
per is to fill that gap, hence we provide an image captioning model that
will be utilizing a small and good CNN architecture which is relatively
new in the research field and is not used much. Our model incorporates
an advanced Deep Convolution Neural Network to extract image fea-
tures and an Attention GRU with a local attention network to generate
captions. We have also identified a class imbalance problem with this
popular dataset so we tried to rectify this problem by adding some im-
ages of some specific classes, hence improvising the dataset as well. The
model has been trained on this improvised Flickr Dataset.

Keywords: Image Captioning, Visual Narration, Deep Learning, Nat-


ural Language Generation, Generative Adversarial Network.

1 Introduction
A single image can display many different ideas. We humans have an outstand-
ing ability to summarize different ideas with respect to different parts of the
image. And also we can link the image region with our description. While this
action seems easy for us humans it can be a tedious task for machines as it
requires the machine to express the image in natural language. The image cap-
tion generation task has been recently on a surge with the advances in Neu-
ral Machine Translations(NMT)[1] and larger data sets[2],[3]. Encoder-Decoder
pipeline is used by many image captioning models. For sequence to sequence
learning many encoder-decoder frameworks are introduced that are based on
Recurrent Neural Networks. Recurrent Neural Networks(RNNs)[4], as well as
Long short-term memory(LSTM) networks, can be sequence learners.RNNs can
only remember the earlier status for a few time steps because of Vanishing gra-
dient problem[5]. To solve the vanishing gradient problem there is a special type
2 Anjali et al.

of RNN architecture designed called the LSTM network in RNNs.It also intro-
duces a memory cell. Each memory cell consists of three gates and a neuron
along with a self recurrent connection. Due to these gates, memory cells can
keep and access data for a long period of time and make the LSTM network
capable of learning long-term dependencies. Memory cells of LSTM models aim
to remember old data for the long term but they are bounded to fewer time steps
because at every time step long-term information is gradually diluted. To en-
hance the hierarchical structure of traditional image captioning models, we add a
deep convolutional neural network(CNN) to extract the features from sequences
as an image encoder. We adopted the Xception[6] model for the encoder part
which is a width-based multi-connection convolution neural network(CNN). Un-
like the existing image captioning model we have used an advanced CNN model
and Attention GRU instead of LSTM.
To summarize, our primary contribution lies in bringing in a deep CNN
architecture like Xception, which is relatively small in size and has less number
of parameters to extract features from the images, and the image captioning part
is taken care of by RNNs, where we incorporated Attention GRU. We aim to
provide a higher accuracy and small size image captioning model trained on the
Flickr8k data set. We also managed to tweak the data-set based on our needs
and have provided a custom data-set where we have attempted to balance the
human class of the Flickr8k data-set and have added about 700 more images to
the data-set.

2 Related Work

In the computer vision field, the problem of creating natural language descrip-
tions for images has become a prominent topic. The traditional method of em-
ploying neural networks for producing descriptions is to frame the problem as
a retrieval and ranking problem[7]. The most important drawback of retrieval-
based techniques is that they are time-consuming and can’t come up with ap-
propriate descriptions for a new set of image objects[5]. Inspired by the success
of deep neural networks in the field of artificial intelligence, researchers have
proposed machine translation to label images using the encoder-decoder frame-
work generation rather than translating, hence the purpose of image captioning
is to understand an image and provide its description in a sentence. Vinyals[5].
introduced the first neural network approach for image captioning, which is an
encoder-decoder system trained to optimize the log-likelihood of the target pic-
ture descriptions. Similarly, the multimodal fusion layer is used by Mao[8] and
Donahue[9] to fuse picture characteristics and word representation at each time
step. The captions are derived from the whole images in both cases, i.e The
models in[8] and [9], but the image captioning model proposed by Karpathy[10]
provides descriptions based on regions. Following this work, Johnson[10] devel-
oped a method to jointly locate regions and characterize each using captions.
The recent image captioning models have incorporated convolution neural
networks into the architecture for image feature extraction and uses a recurrent
Narrative Paragraph Generation for Photo Stream Using Neural Networks 3

neural network to perform image captioning. According to the paper[11], a CNN,


convolutional neural network is a kind of feedforward neural network that is able
to extract features from data with convolution structures. The paper presented
a comprehensive study on all CNN architectures including the classic CNN and
advanced CNN. It is different from traditional feature extraction methods as it
does not need to extract features manually.
This paper was substantial for identifying the best CNN model for our archi-
tecture. Some researchers have looked at the topology of networks to describe
the link between visuals and descriptions directly or implicitly[12], [13], [14] Xu
et al.[13] use “hard” and “soft” attention algorithms to incorporate spatial at-
tention on image convolutional features into the encoder-decoder architecture.
Yang et al.[15], whose method uses a review network to improve the attention
mechanism, and Liu et al.[14], whose approach is aimed at improving the correct-
ness of visual attention. According to Shudong Yang et al.[16] after comparing
the performance differences of the two deep learning models namely, Recurrent
Neural Network(RNNs) and the Gated Recurrent Unit(GRU) involving two di-
mensions: dataset size and quantitative evaluation. They found that in terms
of model training speed, GRU is 29.29% faster than LSTM for processing the
same dataset. Hence we have utilized Attention GRU instead of LSTM for image
captioning in our architecture. Francois[6] gives us a new perspective on depth-
wise separable convolutions, in the Xception architecture. In this approach, the
data first goes through the entry flow, then through the middle flow which is
repeated eight times, and finally through the exit flow. Note that all Convolu-
tion and Separable Convolution layers are followed by batch normalization. All
Separable Convolution layers use a depth multiplier of 1.

3 Proposed Architecture

The proposed architecture “Visual Narration” adopts the Xception convolution


network to extract features from the images which then encodes it and passes it
to the sequence decoder with an attention mechanism i.e, the attention GRU to
generate captions of the particular image. We propose a model to generate cap-
tions from a given image, it incorporates a Deep CNN(Xception) with attention
GRU. The Fig.1 shows the architecture framework of our model. It includes a
Deep CNN for image encoding which mainly extracts the features from the im-
ages. It is then passed onto a sequence decoder, which is GRU with an attention
mechanism to generate the image captions. Next, we will go over the stages of
our proposed model.

3.1 Phase1: CNN Layer

Our first step is to extract features from the images for that we use the Xcep-
tion network which then encodes these features. According to Francois[6], the
Xception architecture has 36 convolutional layers forming the feature extraction
4 Anjali et al.

Fig. 1. Proposed Architecture with one deep CNN layer (Xception) to extract image
features and Sequence decoder( GRU) with Attention mechanism to generate captions.

base of the network. The 36 convolutional layers are structured into 14 mod-
ules, all of which have linear residual connections around them, except for the
first and last modules. In short, the Xception architecture is a linear stack of
depth-wise separable convolution layers with residual connections. This makes
the architecture very easy to define and modify. The Xception model includes a
convolutional base which is followed by a logistic regression layer for image clas-
sification problems, Since in our model we are not classifying the images rather
generating captions for the same, we have dropped that block as shown in Fig.2.

3.2 Phase2: Recurrent Network with Attention

An attention model allows, for each new word, to focus on a part of the image,
that is it gives attention to only a specific region of the image rather than the
whole image. Hence to generate captions we have incorporated Attention GRU
with the Minh-Thang Luong’s[17] local attention mechanism. There are mainly
two types of attention networks, a local attention network, and a global attention
network. Local Attention Mechanism attends to only a small subset of words. It
is also called window-based attention because it’s about selecting a window of
input tokens for attention distribution. So why not a global attention network?
It is because the Local Attention Network is easier to train and implement. It is
computationally simpler than the Global attention network.

3.3 Phase3: BLEU Score and Beam Search

BLEU is an algorithm for evaluating the quality of text which has been machine-
translated from one natural language to another. The BLEU score is a number
between zero and one that measures the similarity of the machine-translated
Narrative Paragraph Generation for Photo Stream Using Neural Networks 5

Fig. 2. The model architecture of Xception[6].

text to a set of high-quality reference translations. Beam search is an algorithm


used in many NLP and speech recognition models as a final decision-making
layer to choose the best output given target variables like maximum probability
or next output character. The beam search algorithm selects multiple tokens for
a position in a given sequence based on conditional probability. The algorithm
can take any number of N best alternatives through a hyper parameter known
as Beam width. With Beam search, we also take the N best output sequences
and look at the current preceding words and the probabilities compared to the
current position we are decoding in the sequence. In greedy search, we take
the best word for each position in the sequence. Greedy search looks at each
position in the output sequence in isolation. A word is decided based on the
highest probability and we continue moving down the rest of the sentence, not
going back to earlier ones.

4 Datasets Used

In this experiment, we have used the flickr8k dataset and our custom dataset
with around 8776 images to carry out the implementation. We have trained our
model with 3 different parameters, that is changing the number of units in the
dense network as well as the RNN units, namely 64, 128, 256 units.
The dataset which we considered for training, as well as testing, is the
Flickr8k dataset. Flickr8k dataset is a subset of the bigger Flickr30k Dataset[2],
[3]. Flickr30k Entities has 31K images, with 5 captions for each image. The
Flickr30k dataset has become a standard benchmark for sentence-based image
description. It augments the 158k captions from Flickr30k with 244k corefer-
ence chains, linking mentions of the same entities across different captions for
6 Anjali et al.

the same image, and associating them with 276k manually annotated bounding
boxes. Such annotations are essential for continued progress in automatic image
description and grounded language understanding. Flickr8k Dataset has become
a standard dataset to be used in training the model in smaller devices which has
fewer resources. However, the Flickr8k dataset has a class imbalance problem,
especially in different genders of the human (see Fig.3 (left)). To correct this, We
added around 700 images with 5 captions each from boy, girl, and men classes
also illustrated in Fig.3 (right). We focused on adding those images which have
multiple objects present in a single image so that it will not greatly increase the
vocabulary of the flickr8k dataset.

Fig. 3. (left) Human class was not balanced in Flickr8k Images. (right Customised
Flickr8k data with 700 images of girl, boy and men.)

GloVe: Global Vectors for Word Representation[18] is used for node and edge
feature labels. It is an unsupervised learning algorithm for obtaining vector rep-
resentations for words. Training is performed on aggregated global word-word
co-occurrence statistics from a corpus, and the resulting representations show-
case interesting linear substructures of the word vector space. There are many
versions and sizes of glove word embeddings available. We used the Wikipedia
2014 version which has 6B tokens in 200d vectors1 .

4.1 Caption Preparation

The most important part of any Data Science project is data preprocessing.
Similarly, the most important part of any Natural Language processing project
is its text data preparation. Since machines cannot understand words as we
humans do, so it’s a very complicated task to make a machine understand the
textual language. It has become a very necessary and important step to remove
any unrelated data from the textual data as those highly confuse the machine to
determine what is to be predicted. For that, some steps or filters were applied
to the raw caption dataset to make it less confusing to the machine.
1
https://nlp.stanford.edu/projects/glove/
Narrative Paragraph Generation for Photo Stream Using Neural Networks 7

Fig. 4. Illustrate the step-by-step process for caption preparation.

As shown in Fig.4, first, contractions were expanded. Shortcut words like


can’t, shouldn’t, it’s understandable to us humans, but after removing the punc-
tuation from these words it will become can’t, shouldn’t, its and these shortcut
words thought meant the same, should be avoided. Instead of removing these
words, we have expanded them. So, can’t becomes cannot, shouldn’t becomes
should not, it’s became it is. Second, the whole dataset is converted into low-
ercase so as to avoid duplication in different cases. Hyperlinks were removed if
any, as these are not necessary for a caption. Punctuation was removed as these
special characters provide a hindrance to learning. Digits were removed as these
are not necessary. Any Non-alphanumeric characters will be removed along with
the extra whitespace. After processing the captions, every sentence will get 2
extra tokens, ‘beginseq’ and ‘endseq’, at the start and the end respectively. All
the captions are stored in a dictionary with the image name as key and a list of
captions as value. Final prepared captions are shown in Fig.5.

Fig. 5. Preparation of Caption dictionary.

A vocabulary, of words has been generated from the captions with threshold
= 8. It means that whichever word has occurred more than 8 times in the whole
dataset, that word will be added to the vocabulary. The length of this vocabulary
is 2169. All the words in the vocabulary will get an unique index assigned to it,
That index will be sent to the model as an input instead of the word.
There are 8776 images in all, each with five captions. This data has been split
into 3 parts which are for Training (6776 images), Validation (1000 images), and
Testing (1000 images). Data that has to be fed into the model consists of image
8 Anjali et al.

Fig. 6. Caption Data. (left) Vocabulary of size 2169. (right) Dictionary with each
word in the vocabulary assigned a unique index.

Features and captions. For the first time, we attempt to predict the second word
using the input of the image vector and the first word, (i.e. Input = Image1 +
‘startseq’; Output = ‘a’ ). Next, we try to predict the third word using the input
of the image vector and the first two words (i.e. Input = Image1 + ‘startseq a’;
Output = ‘dog’ ) and so on. Table1 summarize the data matrix for one image
and its corresponding caption.

Table 1. Input data points corresponding to one image and its caption.

Image Feature Vector Sliced Caption Target Word


Image 1 beginseq a
Image 1 beginseq a dog
Image 1 beginseq a dog is
Image 1 beginseq a dog is running
Image 1 beginseq a dog is running endseq

In the sliced caption, instead of the list of words padded captions will be fed
with the index of the word replacing the textual word like in Fig.7. Maximum
sequence of the caption is taken as 40.

4.2 Model Parameters

Experimentation with the different model parameters has been completed in


this project. For Optimizer, Adam optimizer has been chosen for this. For loss,
Sparse Categorical Cross Entropy has been chosen. Training has been done for
51 epochs with batch size 8. A number of neurons in a dense network have been
chosen from (64, 128, 256 units) and a number of RNN units has been chosen
from (64, 128, 256 units). The model has been trained and compared with both
attention mechanisms and without.
Narrative Paragraph Generation for Photo Stream Using Neural Networks 9

Fig. 7. Padded Caption: Each word in the Sliced Caption is replaced by the respective
index and if the length of the capton is smaller than the maximum number which is
40, then it will be post-padded by 0.

5 Experimentation and Result


The model has been trained without Attention mechanism for 51 epochs with
batch size 8. We used three different parameters like 64 units, 128 units, 256
units as shown in Table2. Both sentence decoders, which are Beam Search and
Greedy Search, are used and compared against. The model has been evaluated
using BLEU scores with 4 kinds of different weights. Similarly, the model has
been trained with local Attention mechanism for 51 epochs with batch size 8. In
this experiment, we used two different parameters such as 128 units, 256 units
as shown in Table3. The results produced by our proposed model is presented
in Fig.8.

Table 2. Non-Attention based visual narration model with captions generated by Beam
& Greedy search.

Algorithms RNN Units BLEU-1 BLEU-2 BLEU-3 BLEU-4 LOSS


64 0.2668 0.1565 0.1064 0.0417 0.745
Beam Search 128 0.2694 0.1537 0.1023 0.0360 0.618
256 0.2309 0.1273 0.0843 0.0305 0.487
64 0.2596 0.1570 0.1118 0.0480 0.745
Greedy Search 128 0.2610 0.1556 0.1079 0.0422 0.618
256 0.2271 0.1318 0.0924 0.0357 0.487

5.1 Performance Analysis of Deep CNNs


To better understand the Deep CNNs architecture and to pick the language
CNN for our model we conducted performance analysis on a few Deep convo-
lution neural networks ranging from depth wise convolutions to width based
multi-connection convolutions. We compared the size, accuracy, and number
10 Anjali et al.

Table 3. Attention based visual narration model with captions generated by Beam &
Greedy search.

Algorithms RNN Units BLEU-1 BLEU-2 BLEU-3 BLEU-4 LOSS


128 0.2809 0.1288 0.0700 0.0206 0.419
Beam Search
256 0.2594 0.1193 0.0724 0.0250 0.250
128 0.3315 0.1831 0.1253 0.0522 0.304
Greedy Search
256 0.3301 0.1874 0.1324 0.0607 0.208

Fig. 8. Output generated by the proposed model with Attention (shows both greedy
and beam)

of parameters for each convolution neural network. From the analysis, it was
evident that Xception was the smallest and most advanced Deep Convolution
Neural network with a size of 88MB and an accuracy of 0.945, which led us to
choose Xception as our CNN layer for our model as shown in Table.4.

Table 4. Performance analysis of Xception against other Deep Convolution Networks.

Model Size (MB) TOP-1 Accuracy TOP-5 Accuracy Parameters Depth


VGG16 528 0.713 0.901 138,357,544 23
InceptionV3 92 0.779 0.937 23,851,784 159
ResNet50 98 0.749 0.921 25,636,712 -
InceptionResNetV2 215 0.803 0.953 55,873,736 572
ResNeXt50 96 0.777 0.938 25,097,128 -
Xception 88 0.790 0.945 22,910980 126
Narrative Paragraph Generation for Photo Stream Using Neural Networks 11

6 Conclusion and Future Work

We have provided an image captioning model called Visual Narration where we


predominately leveraged the benefits of Deep Convolution Neural Network Xcep-
tion for extracting the image features and Attention GRU with local attention
for the generating captions. Hence significantly reducing the size of the model
compared to state-of-the-art architectures and enhancing the model’s accuracy.
We also tried to balance the Flickr8k data set by adding images of some
specific classes to slightly reduce the class imbalance problem in the original
data set. Our future work includes incorporating Transformers for the sentence
generation part instead of Attention GRU, as transformers like BERT and GPT
provide more human-like captions.

References
1. I. Sutskever, O. Vinyals, Q.V. Le, Advances in neural information processing sys-
tems 27 (2014)
2. B.A. Plummer, L. Wang, C.M. Cervantes, J.C. Caicedo, J. Hockenmaier, S. Lazeb-
nik, in Proceedings of the IEEE international conference on computer vision (2015),
pp. 2641–2649
3. T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L.
Zitnick, in European conference on computer vision (Springer, 2014), pp. 740–755
4. A. Sherstinsky, Physica D: Nonlinear Phenomena 404, 132306 (2020)
5. O. Vinyals, A. Toshev, S. Bengio, D. Erhan, in Proceedings of the IEEE conference
on computer vision and pattern recognition (2015), pp. 3156–3164
6. F. Chollet, in Proceedings of the IEEE conference on computer vision and pattern
recognition (2017), pp. 1251–1258
7. M. Hodosh, P. Young, J. Hockenmaier, Journal of Artificial Intelligence Research
47, 853 (2013)
8. J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, A. Yuille, arXiv preprint
arXiv:1412.6632 (2014)
9. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, T. Darrell, in Proceedings of the IEEE conference on computer vision
and pattern recognition (2015), pp. 2625–2634
10. J. Johnson, A. Karpathy, L. Fei-Fei, in Proceedings of the IEEE conference on
computer vision and pattern recognition (2016), pp. 4565–4574
11. Z. Li, F. Liu, W. Yang, S. Peng, J. Zhou, IEEE transactions on neural networks
and learning systems (2021)
12. J. Lu, C. Xiong, D. Parikh, R. Socher, in Proceedings of the IEEE conference on
computer vision and pattern recognition (2017), pp. 375–383
13. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio,
in International conference on machine learning (PMLR, 2015), pp. 2048–2057
14. C. Liu, J. Mao, F. Sha, A. Yuille, in Thirty-first AAAI conference on artificial
intelligence (2017)
15. Z. Yang, Y. Yuan, Y. Wu, W.W. Cohen, R.R. Salakhutdinov, Advances in neural
information processing systems 29 (2016)
16. S. Yang, X. Yu, Y. Zhou, in 2020 International workshop on electronic communi-
cation and artificial intelligence (IWECAI) (IEEE, 2020), pp. 98–101
12 Anjali et al.

17. M.T. Luong, H. Pham, C.D. Manning, arXiv preprint arXiv:1508.04025 (2015)
18. J. Pennington, R. Socher, C.D. Manning, in Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP) (2014), pp. 1532–1543

You might also like