Image-Text Summarization PDF
Image-Text Summarization PDF
2
Aston University; 3Guangzhou University
4
Key Laboratory of Intelligent Information Processing, ICT,
University of Chinese Academy of Sciences, Chinese Academy of Sciences
[email protected], [email protected]
Abstract
Rapid growth of multi-modal documents
on the Internet makes multi-modal
summarization research necessary. Most
previous research summarizes texts or
images separately. Recent neural
summarization research shows the
strength of the Encoder-Decoder model in
text summarization. This paper proposes
an abstractive text-image summarization
model using the attentional hierarchical
Encoder-Decoder model to summarize a
text document and its accompanying
images simultaneously, and then to align
the sentences and images in summaries. A
multi-modal attentional mechanism is
proposed to attend original sentences,
images, and captions when decoding. The
DailyMail dataset is extended by
collecting images and captions from the
Web. Experiments show our model
outperforms the neural abstractive and
extractive text summarization methods
that do not consider images. In addition,
our model can generate informative Figure 1: An example of multi-modal news taken
summaries of images. from the DailyMail corpora.
1 Introduction
Summarizing multi-modal documents to get
multi-modal summaries is becoming an urgent
need with rapid growth of multi-modal
documents on the Internet. Text-Image
summarization is to summarize a document with
text and images to generate a summary with text
and images. The summarization approach is
different from pure text summarization. It is also
different from image summarization which
summarizes an image set to get a subset of
images.
An image worths thousands of words
(Rossiter, et al., 2012). Image plays an important
role in information transmission. Incorporating
images into text to generate text-image Figure 2: The manually generated text-image summary.
4046
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4046–4056
Brussels, Belgium, October 31 - November 4, 2018.
2018
c Association for Computational Linguistics
summaries can help people better understand, 2) We propose an RNN model to encode the
memorize, and express information. Most of ordered image set of the multi-model
recent research focuses on pure text document as one of the initial states (the
summarization, or image summarization. Little other is the text encoding) of the decoder.
has been done on text-image summarization. 3) We propose three multi-modal attentional
Figure 1 and Figure 2 show an example of text- mechanisms which attend the text and the
image summarization. Figure 1 is the original images simultaneously when decoding.
multi-modal news with text and images. The 4) Experiments show that attending images
news has 17 sentences (with 322 words) and 4 when decoding can improve text
images each of which has a caption. Figure 2 is summarization, and that our model can
the manually generated multi-modal summary. In generate informative image summaries.
the summary, the news is distilled to 3 sentences
(with 36 words) and 2 images, and each summary 2 Related Work
sentence is aligned with an image.
To generate such a text-image summary, the Recent research on text summarization focuses on
following problems should be considered: How to neural methods. Attentional Encoder-Decoder
generate the text part? How to measure the model is first proposed in (Bahdanau et al., 2014)
importance of images, and extract important and (Luond et al., 2015) to align the original text
images to form the image summary? How to align and the translated text in machine translation. The
sentences with images? attention model is applied to sentence
In this paper, we propose a neural text-image summarization by considering the neural
summarization model based on the attentional language model and the attention model when
hierarchical Encoder-Decoder model to solve the generating next words (Rush et al., 2015). A
above problems. The attentional Encoder- selective Encoder-Decoder model that uses a
Decoder model has been successfully used in selective gate network to control information
sequence-to-sequence applications such as from the encoder to the decoder for sentence
machine translation (Luong et al., 2015), text summarization is proposed (Zhou et al., 2017).
summarization (Cheng and Lapata, 2016; Tan et A neural document summarization model by
al., 2017), image captioning (Liu et al., 2017a), extracting sentences and words is proposed
and machine reading comprehension (Cui et al., (Cheng and Lapata, 2016). They use a CNN
2016). model to encode sentences, and then use a RNN
At the encoding stage, we use the hierarchical model to encode documents. The model extracts
bi-directional RNN to encode the sentences and sentences by computing the probability of
the text document, use the RNN and the CNN to sentences belonging to the summary based on an
encode the image set. In the decoding stage, we RNN model. The model extracts words from the
combine text encoding and image encoding as the original document based on an attentional
initial state, and use the attentional hierarchical decoder. An RNN-based extractive
decoder which attends original sentences, images summarization named SummaRuNNer, treating
and captions to generate the text summary. Each summarization as a sentence classification
generated sentence is aligned with a sentence, an problem is proposed (Nallapati et al., 2016). A
image, or a caption in the original document. logistic classifier is then applied using features
Based on the alignment scores, images are computed based on the RNN model. A
selected and aligned with the generated sentences. hierarchical Encoder-Decoder model, conserving
In the inference stage, we adopt the multi-modal the hierarchical structure of documents is
beam search algorithm which scores beams based proposed (Li et al., 2015). A graph-based
on bigram overlaps of the generated sentences attentional Encoder-Decoder model using a
and the attended captions. PageRank algorithm to compute the attention is
The main contributions are as follows: proposed (Tan et al., 2017).
1) We propose the text-image summarization Image captioning generates a caption for an
task, and extend the standard DailyMail image. Text-image summarization is similar to
corpora by collecting images and captions image captioning in that both utilize image
of each news from the Web for the task. information to generate text. Images are encoded
with CNN models such as VGGNet (Simonyan
4047
and Zisserman, 2014), AlexNet (Krizhevsky et al., and the captions, a CNN+RNN encoder to encode
2012) and GoogleNet (Szegedy et al., 2014) by the image set, and a multi-modal attentional
extracting the last full-connected layers. An hierarchical RNN decoder.
attentional model is used in image captioning by The input of our model is a multi-modal
splitting an image into multiple parts which is document MD = {D, PicSet}, where D is the main
attended in the decoding process (Xu et al., 2015). text of the multi-modal document and PicSet is
Image tags was used as additional information, the image-caption set ordered by the occurring
and semantic attention model which attends order of images in the document.
image tags when decoding was proposed (You et
3.1 Main Text Encoder
al., 2016). The attention-based alignment of
image parts and text is studied (Liu et al., 2017a), The main text D consists of sentences, each of
and the results show that the alignments is in high which consists of words. Let D=[s1, s2, …, s|d|],
accordance with manual alignments. An image and si [ xi,1 , xi,2 , , xi, s ] where xi,j is the word
i
to an ordered recognized object set is encoded, embedding of the jth word in the si. We use
and the attentional decoder is applied to generate word2vec (Mikolov et al., 2013) to create word
captions (Liu et al., 2017b). embeddings. GRU is used as the RNN cell (Cho
Multi-modal summarization summarizes text, et al., 2014).
images, videos, and etc. It is an important branch We use a hierarchical RNN encoder to encode
of automatic summarization. Traditional multi- the main text D to vector representation. The
modal summarization inputs multi-modal sentence encoder is adopted to encode sentences
documents or pure text documents, and outputs to vector representations. An <eos> token is
multi-modal documents (Wu, 2011; Greenbacker, appended to the end of each sentence. A bi-
2011; Yan, 2012; Agrawal, 2011; Zhu, 2007; directional RNN is used as the sentence encoder:
UzZaman, 2011). For example, Yan et al., (2012)
generate multi-modal timeline summaries for hi , j GRU s (hi, j 1 , xi, j ) (1)
news sets by constructing a bi-graph between text
and images, and apply a heterogeneous hi , j GRU s (hi , j , xi, j 1 ) (2)
reinforcement ranking algorithm. Strategies to
summarizing texts with images and the notion of encsent i [hi,1 , hi, 1 ] (3)
summarization of things are proposed in (Zhuge,
2016). The deep learning related work (Wang et where encsenti denotes the vector representation of
al. 2016) treats text summarization as a sentence si. It is the concatenation of hi ,1 and hi ,1 .
recommendation task and applies matrix We use encsenti as inputs to the document
factorization algorithm. They first retrieve images encoder to encode the main text to vector
from Yahoo!, use the CNN to extract image representations. A bi-directional RNN is adopted
features as the additional information of as the document encoder:
sentences, use Rouge maximization as the
training object function which are trained with hi GRU d ( hi 1 , enc sent i ) (4)
SGD. In test time, sentences are extracted based
on the model and images are retrieved from the hi GRU d (hi 1 , enc sent i ) (5)
Search Engine.
hi [ h i , hi ] (6)
3 Method
Figure 3 shows the framework, a multi-modal encdoc d [h1 , h 1 ] (7)
attentional hierarchical encoder-decoder model.
The hierarchical encoder-decoder is proposed in where encdoc denotes the vector representation of
(Li et al., 2015) and extended by (Tan et al. , the D, and hi is the concatenated hidden state of si.
2017) for document summarization through 3.2 CaptionSet and ImageSet Encoder
bringing in the graph-based attentional model.
Our model consists of three parts: a The ordered image-caption set PicSet consists of
hierarchical RNN to encode the original sentences an ordered image set and an ordered caption set
4048
Figure 3: The framework of our neural text-image summarization model.
img img
which are ordered by the occurring order in the enc img [ h 1 , h 1 ] (11)
multi-modal document. The image occurring
order makes sense because images are often put where imgfeai is the vector representation of imgi,
near the most related sentences, and the sentences encimg is the vector representation of the image
have strict order in the document. set, and himgi is the hidden state of imgi when
We treat the ordered caption set as a document, encoding the image set.
and apply the sentence encoder and the document To our best knowledge, we are the first to adopt
encoder to the caption document. Then, we get the RNN model to encode the image set.
the hidden state hcapi and the vector representation
enccap of the caption document. 3.3 Decoder
We use the CNN model to extract the vector In the decoding state, we adopt the hierarchical
representation of each image, and then use the RNN decoder to generate text summaries.
RNN model to encode the ordered image set to
vector representation. The CNN model we h 0 tanh(W dec _ doc enc doc
adopted is 19-layer VGGNet (Simonyan and V dec _ img enc img (12)
Zisserman, 2014). We drop the last dropout layer
and keep the last full-connected layer as the V dec _ cap enc cap )
image’s vector representation, the dimension of
which is 4096. h i GRU dec _ sent1 (hi 1 , hi 1,1 ) (13)
We then use a bi-directional RNN model to
encode the ordered image set and the image hi GRU dec _ sent 2 (h i , ci ) (14)
features are used as inputs of the RNN model.
img img hi , j GRU dec _ word (hi , j 1 , yi , j 1 ) (15)
h i GRU img (h i 1 , img fea i ) (8)
img img yi , j soft max(W soft max hi , j b) (16)
h i GRU img ( h i 1 , img fea i ) (9)
img img
himg i [h i , h i ] (10)
4049
Equation (12) to (16) are the equations of the Traditional attention mechanisms for text
hierarchical decoder which consists of a sentence summarization computes the importance score of
decoder and a word decoder. the sentence sj in the original document based on
Equation (12) computes the initial state h 0 for the relationship between the decoding hidden
the sentence decoder by combining the decoding of state h i and the original sentence encoding
the main text information and the decoding of the hidden state h j . We call the traditional attention
image information of the multi-modal document.
model as Text Attention (attT for short), which is
To represent image information, we can use both of
computed by equation (17), (18) and (19):
the image set decoding and the caption set
decoding, or only use one of them, depending on att T ( h i , h j ) v T tanh(W T h i U T h j ) (17)
the multi-modal attention mechanism introduced in
the next subsection. exp(att T (h i , h j )) (18)
T (h i , h j ) | D|
The sentence decoder uses a two-level hidden
output model (Luong et al., 2015) to generate the exp(att T (h i , h j ))
j 1
4050
exp( att TI ( h i , himg j ))
(23) We use the Adam (Kingma and Ba, 2014)
TI (h i , himg j ) |D| |PicSet | gradient-based optimization method to optimize the
exp(att
j 1
TI
( h i , h j )) exp(att
j 1
TI
( h i , himg j ))
model parameters.
|D | | PicSet | 3.6 Multi-Modal Beam Search Algorithm
cTI (h i ) TI (h i , h j )h j TI
(h i , himg j )himg j (24)
j 1 j 1 There are two major problems of the generation
Text-Image-Caption Attention (attTIC for of summaries: one is the out-of-vocabulary
short). This attention model uses both captions and problem, and the other is the low quality of the
images to represent the image information. attTIC generated texts including information
computes the importance score of the caption capj incorrectness and repetitions.
and the importance score of the image imgj For the OOV problem, we use the words in the
simultaneously, and then compute the context of the attended sentences or captions in the original
decoding hidden state h i using equation (25). document to replace OOV tokens in the generated
|D | summary. Previous research uses the attended
cTIC (h i ) TIC (h i , h j )h j words to replace OOV tokens in the flatten
j 1 (25)
|PicSet | encoder-decoder model which attends the words of
j 1
TIC
(h i , h cap j )hcap j TIC (h i , himg j )himg j the original word sequence (Jean, et al., 2015). Our
model is hierarchical and multi-modal, and attends
In the attention mechanisms, (h i , h j ) is the sentences, images, and captions when decoding. We
normalized attention score of h j , (h i , hcap j ) is use the following algorithm to find the replacement
for the jth OOV in a generated sentence:
cap
the normalized attention score of h j , Step 1: Order the original sentences and captions
by the attending scores in descending order.
(h i , himg j ) is the normalized attention score of
Step 2: Return the jth OOV word in the ordered
h img , and c(h i ) is the context.
j
sentences and captions as the replacement.
The initial state of the decoder is computed by For the attTI mechanism that attends images
Equation (12) which can be adjusted according to neglecting captions, we use captions instead of the
different attention models. attended images in the algorithm.
For the low-quality generated text problem, we
3.5 Model Training adopt the hierarchical beam search algorithm (Tan
el al., 2017). We extend the algorithm by adding
Since there are no existing manual text-image
caption-level and image-level beam search. The
summaries, and most of the existing training and
multi-modal hierarchical beam search algorithm
testing data have pure text summaries, we decide
comprises K-best word-level beam search and N-
to use pure text summaries as training data to train
best sentence-caption-level beam search. In
our models. The sentence-image alignment
particular, we use the corresponding captions
relationships can be discovered through
instead of images in beam search algorithm for the
training the multi-modal attention models.
attTI mechanism which attends images.
The loss function L of our summarization
model is the negative log likelihood of generating score( yt ) p ( yt ) (ref (Yt 1 yt , s* ) ref (Yt 1 , s* )) (28)
text summaries over the training multi-modal At the word-level search algorithm, we compute
document set MDS. the score of generating word yt using equation (28)
L
log P (Y | D, PicSet ) (26)
( D , PicSet ,Y )MDS
where ref is a function calculating the ratio of
bigram overlap between two texts, s* is the attended
where Y=[y1, y2, …, y|Y|] is the word sequences of sentence or caption, and γ is the weighting factor.
the summary corresponding to the main text D and The added term aims to increase the overlap of the
the ordered image set PicSet, including the tokens generated summary and the original text.
<eos>, <neod> and <eod>. At the sentence level and the caption level, we
|Y | set the sentence beam width as N, and keep N-best
log P(Y | D, PicSet ) log P( yt |{ y1 ,..., yt 1}, c; ) (27) previously un-referred sentences or captions which
t 1
have highest attending scores. For each sentence
where log P ( yt | { y1 ,..., yt 1}, c; ) is modeled by
beam, we try M sentences or captions and keep the
the multi-modal encoder-decoder model. one achieving best word-level scores.
4051
3.7 Image Selection and Alignment Train Dev Test
We rank the images, select several most important 196557 12147 10396
images as the image summary, and align each D.L. S.L. I.N Sent.L Cap.L
sentence with an image in the image summary. 26.0 3.84 5.42 26.86 24.75
The score of images is computed by equation (29).
|TextSum| Table 1: The split and statistics of the E-DailyMail
score(img j )
i 1
i , j (29) corpora. D.L and S.L indicate the average number of
sentences in the document and summary. I.N
where αi,j is the attention score of the jth image when indicates the average number of images in the story.
generating the ith sentence of the text summary, and Sent.L and Cap.L indicates the average number of
|TextSum| is the number of summary sentences. word in the sentence and the caption respectively.
The images are ranked by the scores in the word embeddings with Google’s word2vec
descending order, and the top K images are selected tools (Mikolov et al., 2013) trained in the whole
to form the image summary ImgSum. We align each text of DailyMail/CNN corpora. We extract the
sentence i in TextSum to the image j in ImgSum 4096-dimension full-connected layer of 19-layer
such that αi,j is the biggest. VGGNet (Simonyan and Zisserman, 2014) as the
vector representation of images. We set the
4 Experiments parameters of Adam to those provided in (Kingma
and Ba, 2014). The batch size is set to 5.
4.1 Data preparation
Convergence is reached within 800k training steps.
We extend the standard DailyMail corpora It takes about one day for training 40k ~ 50k steps
through extracting the images and the captions depending on the models on a GTX-1080 TI GPU
from the html-formatted documents. We call the card. The sentence beam width and the word beam
corpora as E-DailyMail. The standard DailyMail width are set as 2 and 5 respectively. M is set as 3.
and CNN datasets are two widely used datasets The parameter γ is set as 3 or 300 tuned on the
for neural document summarization, which are validation set.
originally built in (Hermann et al., 2015) by To train the multi-modal attention mechanism
collecting human generated highlights and news such as attTIC, we concatenate the matrix of text
stories from the news websites. We only extend representations, image representations, and caption
the DailyMail dataset because it has more images representations to one matrix M = [h1, h2, ... h|D|,
and is easier to collect than the CNN dataset does. hcap1, hcap2, …, hcap|PicSet|, himg1, himg2, …, himg|PicSet|].
We find that the text documents provided by the The parameters of the attention mechanisms are
original DailyMail corpora contain captions. This trained simultaneously. This way the model training
is due to that all related texts are extracted from can converge faster.
the html-formatted news when the corpora are
created. We keep the original text documents 4.3 Evaluation of Text Summarization
unchanged in E-DailyMail. The split and statistics The widely used ROUGE (Lin, 2004) is adopted
of E-DailyMail are shown in Table 1. to evaluate text summaries.
We compare four attention models. HNNattTC-
4.2 Implementation
3, HNNattTIC-3, HNNattTI-3, and HNNattT-3 are
We preprocess the text of the E-DailyMail our hierarchical RNN summarization models with
corpora by tokenizing the text and replacing the the attTC, attIC, attTI, and attT attention
digits with the <NUM> token. The 40k most mechanisms respectively, and 3 is the γ value.
frequent words in the corpora are kept and other HNNattT is similar to the model introduced in (Tan
words are replaced with OOV. et al., 2017) without the graph-based attention. We
Our model is implemented by using Google’s compare our models with HNNattT to show the
open-source seq2seq-master project written with influence of multi-modal attentions. The first 4 lines
Tensorflow. We use one layer of the GRU cell. The in Table 2 are the results with summary length of 75
dimension of the hidden state of the RNN decoder bytes. The results show that HNNattTI has
is 512. The dimension of the word embedding considerable improvement over HNNattT. An
vector is 128. The dimension of the hidden state of interesting observation is that HNNattTC and
the bi-directional RNN encoder is 256. We initialize HNNattTIC are not better than HNNattT. One of the
4052
Method Rouge-1 Rouge-2 Rouge-L Method Rouge-1 Rouge-2 Rouge-L
HNNattTI-3 24.84 8.7 16.99 HNNattTI-3-OOV 24.03 8.2 16.52
HNNattTC-3 18.61 6.7 13.44 HNNattTC-3-OOV 18.18 6.53 12.87
HNNTattTIC-3 21.17 8.1 15.24 HNNTattTIC-3-OOV 20.50 7.67 14.36
HNNattT-3 22.09 7.9 15.97 HNNattT-3-OOV 21.60 7.82 15.05
Lead 21.9 7.2 11.6
NN-SE 22.7 8.5 12.5 Table 4: Comparison results using Rouge recall at 75
SummaRuNNer- bytes without OOV replacement. HNNattTI-3-OOV
23.8 9.6 13.3
abs is the version of HNNattTI-3 without the OOV
LREG(500) 18.5 6.9 10.2 replacement mechanism.
NN-ABS(500) 7.8 1.7 7.1 Method Rouge-1 Rouge-2 Rouge-L
NN-WE(500) 15.7 6.4 9.8
HNNattTI-300-OOV 32.03 11.52 22.67
Table 2: Comparison results on the DailyMail test HNNattTC-300-OOV 26.13 9.87 19.03
set using Rouge recall at 75 bytes. HNNattTIC-300-OOV 30.11 10.87 21.12
HNNattT-300-OOV 30.74 11.21 22.28
Method Rouge-1 Rouge-2 Rouge-L
HNNattTI-300 32.64 12.02 23.88 Table 5: Comparison results using full-length F1
HNNattTC-300 26.75 10.12 19.42 metric without OOV replacement. HNNattTI-300-
HNNattTIC-300 30.52 11.04 21.81 OOV is the version of HNNattTI-300 without the
HNNattT-300 31.34 11.81 22.93 OOV replacement mechanism.
do not incorporate the attention distraction
Table 3: Comparison results on the DailyMail test mechanism (Chen et al., 2016) into our model,
set using full-length F1metric. because we want to focus on our own model to see
reasons is that the text documents provided by the whether considering images improves text
DailyMail corpora contain captions. Captions are summarization. Results in Table 3 also show that
already parts of the text documents. The other HNNattTI performs better than HNNattT,
reason is that captions distract attentions and cannot HNNattTC, and HNNattTIC.
attract sufficient attentions from the original To show the influence of our OOV replacement
sentences, which will be discussed in the next mechanism, we eliminate the mechanism from our
subsection. models, and show the evaluation results in Table 4
We compare our methods with state-of-the-art and Table 5. We can see from the two tables that the
neural summarization methods reported in recent scores are lower than the corresponding scores in
papers on the DailyMail corpora. Extractive models Table 2 and Table 3. Our OOV replacement
include Lead which is a strong baseline using the mechanism improves the summarization models,
leading 3 sentences as the summary, NN-SE though the mechanism is relatively simple.
(Cheng and Lapata, 2016), and SummaRuNNer-abs In short, combining and attending images in the
(Nallapati et al., 2017) which is trained on the neural summarization model improves document
abstractive summaries. Abstractive models include summarization.
NN-ABS, NN-WE, LREG, though they are tested on
500 samples of the test set. LREG is a feature-based 4.4 Evaluation of Image Summarization
method using linear regression. NN-ABS is a simple To evaluation the image summarization, the gold
hierarchical extension of (Rush et al., 2015). NN- standard image summary is generated based on a
WE is the abstractive model restricting the greedy algorithm on the captions as follows: at
generation of words from the original document. each time i, choose imgk to maximize
The results are shown in the last 6 rows in Table 2. Rouge({cap1,…capi-1,capk}, Abs_Sum) ˗
Our method HNNattTI outperforms the three Rouge({cap1,…capi-1}, Abs_Sum)) where
extractive models and the three abstractive models. Abs_Sum is the ground truth text summary and
We compare our models under the full-length F1
metric by setting the γ value as 300. According to num HNNattTI HNNattTC HNNattTIC Random
(Tan et al., 2017), a large γ makes the generated 1 0.4978 0. 4137 0. 4362 0. 4721
summary has more overlaps with the attended 2 0.4783 0. 3998 0. 4230 0. 4517
texts, and thus partly overcome the repeated
Table 6: Image summarization using the recall metric
sentences problem in the generated summary. We for the 1-image or 2-images summary. γ is set as 300.
4053
capk is the caption of imgk. The average number IMG1 IMG2 IMG3 IMG4
of images in summaries is 2.15. The average S1 0.0947 0.1089 0.1157 0.1194
Rouge-1, Rouge-2, and Rouge-L scores of the S2 0.0893 0.1020 0.1070 0.1052
caption summaries with respect to the ground S3 0.0853 0.0769 0.0946 0.0969
truth summaries are 43.85, 19.70, and 36.30
respectively. Table 7: The sentence-image alignment scores of
We use the 1-image and 2-image random the generated summary for the news in Figure 1.
selected image summaries as the baselines which The sentences are named by S1, S2, and S3
we compare our models with. The top 1 or 2 respectively.
images ranked by our model are selected out to Table 7 shows the sentence-image alignment
form the summaries. Results in Table 4 show that scores. The four images in the original document
HNNattTI outperforms the random baseline, are numbered from top to bottom and left to right
while HNNattTC and HNNattTIC perform worse. by IMG1, IMG2, IMG3, and IMG4. The
This implies that attending images can generate summation of alignment scores for a summary
better sentence-image alignment in the multi- sentence is less than 1, because the sentence is also
modal summaries than the model attending aligned with the sentences in the original document.
captions does. And this can also partly explain
why our summarization model attending images 5 Conclusions
when decoding can generate better text
summaries than the one attending captions does. This paper proposes the text-image
summarization task to summarize and align texts
4.5 Instance and images simultaneously. Most previous
Figure 4 shows the text-image summary of the research summarizes texts and images separately,
example demonstrated in Figure 1 generated by and few has been done on text-image
the HNNattTI model. In the summary, there are 2 summarization. We propose the multi-modal
images and 3 generated sentences, and each attentional mechanism which attends original
sentence is aligned with an image. The image sentences, images, captions simultaneously in the
summary has one common image with Figure 2. hierarchical encoder-decoder model, use the RNN
model to encode the ordered image set as the
initial state of the decoder, and propose the multi-
modal beam search algorithm which scores beams
using the bigram overlaps of the generated
sentences and the captions. The model is trained
by using abstractive text summaries as the targets,
and the attention scores of images are used to
score images. The original DailyMail dataset is
extended by collecting images and captions from
the Web. Experiments show that our model
attending images outperforms the models not
attending images, three existing neural abstractive
models and three existing extractive models.
Experiments also show our model can generate
informative summaries of images.
Acknowledgments
The research was sponsored by the National
Natural Science Foundation of China
(No.61806101, No.61876048) and the Natural
Science Foundation of Jiangsu Province
(BK20150862). We thank the anonymous
Figure 4: The generated text-image summary of the reviewers for helpful comments. Professor Hai
example in Figure 1. Zhuge is the corresponding author.
4054
References Krizhevsky A., Sutskever I., Hinton G. E. ImageNet
classification with deep convolutional neural
Agrawal, R., Gollapudi, S., Kannan, A., & networks. Communications of the Acm, 2013,
Kenthapadi, K. (2011). Enriching textbooks with 60(2):2012.
images. In Proceedings of the 20th ACM
international conference on Information and Li J., Luong M. T., Jurafsky D. A Hierarchical Neural
knowledge management (pp. 1847-1856). ACM. Autoencoder for Paragraphs and Documents.
Computer Science, 2015.
Bahdanau D., Cho K., Bengio Y. Neural Machine
Translation by Jointly Learning to Align and Lin, C.Y., 2004. Rouge: A package for automatic
Translate. Computer Science, 2014. evaluation of summaries. Text Summarization
Branches Out.
Chen, Q., Zhu, X., Ling, Z., Wei, S., and Jiang, H.
(2016, July). Distraction-based neural networks for Liu C., Mao J., Sha F., Yuille AL. 2017a. Attention
modeling documents. In Proceedings of the Correctness in Neural Image Captioning. InAAAI
Twenty-Fifth International Joint Conference on 2017 Feb 4 (pp. 4176-4182).
Artificial Intelligence (pp. 2754-2760). AAAI Liu, C., Sun, F., Wang, C., Wang, F. and Yuille, A.,
Press. 2017b. MAT: A multimodal attentive translator for
Cheng, J. and Lapata, M., 2016. Neural image captioning. arXiv preprint
summarization by extracting sentences and arXiv:1702.05658.
words. arXiv preprint arXiv:1603.07252. Luong, M. T., Hieu P., and Christopher D. M..
Cho, K., Van Merriënboer, B., Bahdanau, D. and "Effective approaches to attention-based neural
Bengio, Y., 2014. On the properties of neural machine translation." arXiv preprint
machine translation: Encoder-decoder arXiv:1508.04025 (2015).
approaches. arXiv preprint arXiv:1409.1259. Mikolov T., Sutskever I., Chen K., et al. Distributed
Cui, Y., Chen, Z., Wei, S., Wang, S., Liu, T. and Hu, Representations of Words and Phrases and their
G., 2016. Attention-over-attention neural networks Compositionality. 2013, 26:3111-3119.
for reading comprehension. arXiv preprint Nallapati R., Zhai F., Zhou B. SummaRuNNer: A
arXiv:1607.04423. Recurrent Neural Network based Sequence Model
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K. and for Extractive Summarization of Documents. 2016.
Fei-Fei, L., 2009, June. Imagenet: A large-scale Rossiter M. J., Derwing T. M., Jones V. M. L. O. Is a
hierarchical image database. In Computer Vision Picture Worth a Thousand Words?. Tesol
and Pattern Recognition, 2009. CVPR 2009. IEEE Quarterly, 2012, 42(2):325-329.
Conference on (pp. 248-255). IEEE.
Rush A. M., Chopra S., Weston J. A Neural Attention
Gu, J., Lu, Z., Li, H. and Li, V.O., 2016. Model for Abstractive Sentence Summarization.
Incorporating Copying Mechanism in Sequence- Computer Science, 2015.
to-Sequence Learning. In Proceedings of the 54th
Annual Meeting of the Association for Simonyan K., Zisserman A. Very Deep Convolutional
Computational Linguistics (Vol. 1, pp. 1631-1640). Networks for Large-Scale Image Recognition.
Computer Science, 2014.
Greenbacker, C. F., 2011. Towards a framework for
abstractive summarization of multimodal Szegedy C., Liu W., Jia Y., et al. Going deeper with
documents. In Proceedings of the ACL 2011 convolutions. CoRR, abs/1409.4842, 2014.
Student Session (pp. 75-80). Association for Tan, J., Wan, X. and Xiao, J., 2017. Abstractive
Computational Linguistics. document summarization with a graph-based
Hermann, K.M., Kocisky, T., Grefenstette, E., attentional neural model. In Proceedings of the
Espeholt, L., Kay, W., Suleyman, M. and Blunsom, 55th Annual Meeting of the Association for
P., 2015. Teaching machines to read and Computational Linguistics (Volume 1: Long
comprehend. In Advances in Neural Information Papers) (Vol. 1, pp. 1171-1181).
Processing Systems (pp. 1693-1701). UzZaman, N., Bigham, J. P., & Allen, J. F. (2011).
Jean, S., Cho, K., Memisevic, R. and Bengio, Y., Multimodal summarization of complex sentences.
2014. On using very large target vocabulary for In Proceedings of the 16th international
neural machine translation. arXiv preprint conference on Intelligent user interfaces (pp. 43-
arXiv:1412.2007. 52). ACM.
Kingma, D.P. and Ba, J., 2014. Adam: A method for Wang W. Y., Mehdad Y., Radev D. R., et al. A Low-
stochastic optimization. arXiv preprint Rank Approximation Approach to Learning Joint
arXiv:1412.6980.
4055
Embeddings of News Stories and Images for
Timeline Summarization. NAACL. 2016:58-68.
Wu, P., & Carberry, S. (2011). Toward extractive
summarization of multimodal documents.
In Proceedings of the Workshop on Text
Summarization at the Canadian Conference on
Artificial Intelligence (pp. 53-61).
Xu K., Ba J., Kiros R., et al. Show, Attend and Tell:
Neural Image Caption Generation with Visual
Attention. Computer Science, 2015:2048-2057.
Yan R., Wan X., et al. Visualizing timelines:
evolutionary summarization via iterative
reinforcement between text and image streams.
CIKM2012. ACM, 2012:275-284.
You Q., Jin H., Wang Z., et al. Image Captioning with
Semantic Attention. IEEE Conference on
Computer Vision and Pattern Recognition. IEEE
Computer Society, 2016:4651-4659.
Zhou Q., Yang N., Wei F., et al. Selective Encoding
for Abstractive Sentence Summarization. Meeting
of the Association for Computational Linguistics.
2017:1095-1104.
Zhu, X., Goldberg, A. B., et al. (2007). A text-to-
picture synthesis system for augmenting
communication. In AAAI (Vol. 7, pp. 1590-1595).
Zhuge, H. Multi-Dimensional Summarization in
Cyber-Physical Society, Morgan Kaufmann, 2016.
4056