Narrative Paragraph Generation
Narrative Paragraph Generation
Networks
This paper was downloaded from TechRxiv (https://www.techrxiv.org).
LICENSE
CC BY 4.0
28-11-2022 / 06-12-2022
CITATION
N, Anjali M; More, Tejash; Misa, Kumari; Nath, Keshab (2022): Narrative Paragraph Generation for Photo
Stream Using Neural Networks. TechRxiv. Preprint. https://doi.org/10.36227/techrxiv.21629720.v1
DOI
10.36227/techrxiv.21629720.v1
Narrative Paragraph Generation for Photo
Stream Using Neural Networks
1 Introduction
A single image can display many different ideas. We humans have an outstand-
ing ability to summarize different ideas with respect to different parts of the
image. And also we can link the image region with our description. While this
action seems easy for us humans it can be a tedious task for machines as it
requires the machine to express the image in natural language. The image cap-
tion generation task has been recently on a surge with the advances in Neu-
ral Machine Translations(NMT)[1] and larger data sets[2],[3]. Encoder-Decoder
pipeline is used by many image captioning models. For sequence to sequence
learning many encoder-decoder frameworks are introduced that are based on
Recurrent Neural Networks. Recurrent Neural Networks(RNNs)[4], as well as
Long short-term memory(LSTM) networks, can be sequence learners.RNNs can
only remember the earlier status for a few time steps because of Vanishing gra-
dient problem[5]. To solve the vanishing gradient problem there is a special type
2 Anjali et al.
of RNN architecture designed called the LSTM network in RNNs.It also intro-
duces a memory cell. Each memory cell consists of three gates and a neuron
along with a self recurrent connection. Due to these gates, memory cells can
keep and access data for a long period of time and make the LSTM network
capable of learning long-term dependencies. Memory cells of LSTM models aim
to remember old data for the long term but they are bounded to fewer time steps
because at every time step long-term information is gradually diluted. To en-
hance the hierarchical structure of traditional image captioning models, we add a
deep convolutional neural network(CNN) to extract the features from sequences
as an image encoder. We adopted the Xception[6] model for the encoder part
which is a width-based multi-connection convolution neural network(CNN). Un-
like the existing image captioning model we have used an advanced CNN model
and Attention GRU instead of LSTM.
To summarize, our primary contribution lies in bringing in a deep CNN
architecture like Xception, which is relatively small in size and has less number
of parameters to extract features from the images, and the image captioning part
is taken care of by RNNs, where we incorporated Attention GRU. We aim to
provide a higher accuracy and small size image captioning model trained on the
Flickr8k data set. We also managed to tweak the data-set based on our needs
and have provided a custom data-set where we have attempted to balance the
human class of the Flickr8k data-set and have added about 700 more images to
the data-set.
2 Related Work
In the computer vision field, the problem of creating natural language descrip-
tions for images has become a prominent topic. The traditional method of em-
ploying neural networks for producing descriptions is to frame the problem as
a retrieval and ranking problem[7]. The most important drawback of retrieval-
based techniques is that they are time-consuming and can’t come up with ap-
propriate descriptions for a new set of image objects[5]. Inspired by the success
of deep neural networks in the field of artificial intelligence, researchers have
proposed machine translation to label images using the encoder-decoder frame-
work generation rather than translating, hence the purpose of image captioning
is to understand an image and provide its description in a sentence. Vinyals[5].
introduced the first neural network approach for image captioning, which is an
encoder-decoder system trained to optimize the log-likelihood of the target pic-
ture descriptions. Similarly, the multimodal fusion layer is used by Mao[8] and
Donahue[9] to fuse picture characteristics and word representation at each time
step. The captions are derived from the whole images in both cases, i.e The
models in[8] and [9], but the image captioning model proposed by Karpathy[10]
provides descriptions based on regions. Following this work, Johnson[10] devel-
oped a method to jointly locate regions and characterize each using captions.
The recent image captioning models have incorporated convolution neural
networks into the architecture for image feature extraction and uses a recurrent
Narrative Paragraph Generation for Photo Stream Using Neural Networks 3
3 Proposed Architecture
Our first step is to extract features from the images for that we use the Xcep-
tion network which then encodes these features. According to Francois[6], the
Xception architecture has 36 convolutional layers forming the feature extraction
4 Anjali et al.
Fig. 1. Proposed Architecture with one deep CNN layer (Xception) to extract image
features and Sequence decoder( GRU) with Attention mechanism to generate captions.
base of the network. The 36 convolutional layers are structured into 14 mod-
ules, all of which have linear residual connections around them, except for the
first and last modules. In short, the Xception architecture is a linear stack of
depth-wise separable convolution layers with residual connections. This makes
the architecture very easy to define and modify. The Xception model includes a
convolutional base which is followed by a logistic regression layer for image clas-
sification problems, Since in our model we are not classifying the images rather
generating captions for the same, we have dropped that block as shown in Fig.2.
An attention model allows, for each new word, to focus on a part of the image,
that is it gives attention to only a specific region of the image rather than the
whole image. Hence to generate captions we have incorporated Attention GRU
with the Minh-Thang Luong’s[17] local attention mechanism. There are mainly
two types of attention networks, a local attention network, and a global attention
network. Local Attention Mechanism attends to only a small subset of words. It
is also called window-based attention because it’s about selecting a window of
input tokens for attention distribution. So why not a global attention network?
It is because the Local Attention Network is easier to train and implement. It is
computationally simpler than the Global attention network.
BLEU is an algorithm for evaluating the quality of text which has been machine-
translated from one natural language to another. The BLEU score is a number
between zero and one that measures the similarity of the machine-translated
Narrative Paragraph Generation for Photo Stream Using Neural Networks 5
4 Datasets Used
In this experiment, we have used the flickr8k dataset and our custom dataset
with around 8776 images to carry out the implementation. We have trained our
model with 3 different parameters, that is changing the number of units in the
dense network as well as the RNN units, namely 64, 128, 256 units.
The dataset which we considered for training, as well as testing, is the
Flickr8k dataset. Flickr8k dataset is a subset of the bigger Flickr30k Dataset[2],
[3]. Flickr30k Entities has 31K images, with 5 captions for each image. The
Flickr30k dataset has become a standard benchmark for sentence-based image
description. It augments the 158k captions from Flickr30k with 244k corefer-
ence chains, linking mentions of the same entities across different captions for
6 Anjali et al.
the same image, and associating them with 276k manually annotated bounding
boxes. Such annotations are essential for continued progress in automatic image
description and grounded language understanding. Flickr8k Dataset has become
a standard dataset to be used in training the model in smaller devices which has
fewer resources. However, the Flickr8k dataset has a class imbalance problem,
especially in different genders of the human (see Fig.3 (left)). To correct this, We
added around 700 images with 5 captions each from boy, girl, and men classes
also illustrated in Fig.3 (right). We focused on adding those images which have
multiple objects present in a single image so that it will not greatly increase the
vocabulary of the flickr8k dataset.
Fig. 3. (left) Human class was not balanced in Flickr8k Images. (right Customised
Flickr8k data with 700 images of girl, boy and men.)
GloVe: Global Vectors for Word Representation[18] is used for node and edge
feature labels. It is an unsupervised learning algorithm for obtaining vector rep-
resentations for words. Training is performed on aggregated global word-word
co-occurrence statistics from a corpus, and the resulting representations show-
case interesting linear substructures of the word vector space. There are many
versions and sizes of glove word embeddings available. We used the Wikipedia
2014 version which has 6B tokens in 200d vectors1 .
The most important part of any Data Science project is data preprocessing.
Similarly, the most important part of any Natural Language processing project
is its text data preparation. Since machines cannot understand words as we
humans do, so it’s a very complicated task to make a machine understand the
textual language. It has become a very necessary and important step to remove
any unrelated data from the textual data as those highly confuse the machine to
determine what is to be predicted. For that, some steps or filters were applied
to the raw caption dataset to make it less confusing to the machine.
1
https://nlp.stanford.edu/projects/glove/
Narrative Paragraph Generation for Photo Stream Using Neural Networks 7
A vocabulary, of words has been generated from the captions with threshold
= 8. It means that whichever word has occurred more than 8 times in the whole
dataset, that word will be added to the vocabulary. The length of this vocabulary
is 2169. All the words in the vocabulary will get an unique index assigned to it,
That index will be sent to the model as an input instead of the word.
There are 8776 images in all, each with five captions. This data has been split
into 3 parts which are for Training (6776 images), Validation (1000 images), and
Testing (1000 images). Data that has to be fed into the model consists of image
8 Anjali et al.
Fig. 6. Caption Data. (left) Vocabulary of size 2169. (right) Dictionary with each
word in the vocabulary assigned a unique index.
Features and captions. For the first time, we attempt to predict the second word
using the input of the image vector and the first word, (i.e. Input = Image1 +
‘startseq’; Output = ‘a’ ). Next, we try to predict the third word using the input
of the image vector and the first two words (i.e. Input = Image1 + ‘startseq a’;
Output = ‘dog’ ) and so on. Table1 summarize the data matrix for one image
and its corresponding caption.
Table 1. Input data points corresponding to one image and its caption.
In the sliced caption, instead of the list of words padded captions will be fed
with the index of the word replacing the textual word like in Fig.7. Maximum
sequence of the caption is taken as 40.
Fig. 7. Padded Caption: Each word in the Sliced Caption is replaced by the respective
index and if the length of the capton is smaller than the maximum number which is
40, then it will be post-padded by 0.
Table 2. Non-Attention based visual narration model with captions generated by Beam
& Greedy search.
Table 3. Attention based visual narration model with captions generated by Beam &
Greedy search.
Fig. 8. Output generated by the proposed model with Attention (shows both greedy
and beam)
of parameters for each convolution neural network. From the analysis, it was
evident that Xception was the smallest and most advanced Deep Convolution
Neural network with a size of 88MB and an accuracy of 0.945, which led us to
choose Xception as our CNN layer for our model as shown in Table.4.
References
1. I. Sutskever, O. Vinyals, Q.V. Le, Advances in neural information processing sys-
tems 27 (2014)
2. B.A. Plummer, L. Wang, C.M. Cervantes, J.C. Caicedo, J. Hockenmaier, S. Lazeb-
nik, in Proceedings of the IEEE international conference on computer vision (2015),
pp. 2641–2649
3. T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L.
Zitnick, in European conference on computer vision (Springer, 2014), pp. 740–755
4. A. Sherstinsky, Physica D: Nonlinear Phenomena 404, 132306 (2020)
5. O. Vinyals, A. Toshev, S. Bengio, D. Erhan, in Proceedings of the IEEE conference
on computer vision and pattern recognition (2015), pp. 3156–3164
6. F. Chollet, in Proceedings of the IEEE conference on computer vision and pattern
recognition (2017), pp. 1251–1258
7. M. Hodosh, P. Young, J. Hockenmaier, Journal of Artificial Intelligence Research
47, 853 (2013)
8. J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, A. Yuille, arXiv preprint
arXiv:1412.6632 (2014)
9. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, T. Darrell, in Proceedings of the IEEE conference on computer vision
and pattern recognition (2015), pp. 2625–2634
10. J. Johnson, A. Karpathy, L. Fei-Fei, in Proceedings of the IEEE conference on
computer vision and pattern recognition (2016), pp. 4565–4574
11. Z. Li, F. Liu, W. Yang, S. Peng, J. Zhou, IEEE transactions on neural networks
and learning systems (2021)
12. J. Lu, C. Xiong, D. Parikh, R. Socher, in Proceedings of the IEEE conference on
computer vision and pattern recognition (2017), pp. 375–383
13. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio,
in International conference on machine learning (PMLR, 2015), pp. 2048–2057
14. C. Liu, J. Mao, F. Sha, A. Yuille, in Thirty-first AAAI conference on artificial
intelligence (2017)
15. Z. Yang, Y. Yuan, Y. Wu, W.W. Cohen, R.R. Salakhutdinov, Advances in neural
information processing systems 29 (2016)
16. S. Yang, X. Yu, Y. Zhou, in 2020 International workshop on electronic communi-
cation and artificial intelligence (IWECAI) (IEEE, 2020), pp. 98–101
12 Anjali et al.
17. M.T. Luong, H. Pham, C.D. Manning, arXiv preprint arXiv:1508.04025 (2015)
18. J. Pennington, R. Socher, C.D. Manning, in Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP) (2014), pp. 1532–1543