0% found this document useful (0 votes)

171 views

Image-Text Summarization PDF

This paper proposes a neural model for abstractive text-image summarization that uses an attentional hierarchical encoder-decoder approach. It encodes text documents and accompanying image sets, then generates summaries with both text and aligned images. The model incorporates a multi-modal attentional mechanism to attend to original sentences, images, and captions during decoding. Experiments on an extended DailyMail dataset show the model outperforms text-only summarization methods.

Uploaded by

Bicpes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

171 views

Image-Text Summarization PDF

Uploaded by

Bicpes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Abstractive Text-Image Summarization

Using Multi-Modal Attentional Hierarchical RNN

Jingqiang Chen1, Hai Zhuge1,2,3,4
Nanjing University of Posts and Telecommunications
1

2
Aston University; 3Guangzhou University
4
Key Laboratory of Intelligent Information Processing, ICT,
University of Chinese Academy of Sciences, Chinese Academy of Sciences
[email protected], [email protected]

Abstract
Rapid growth of multi-modal documents
on the Internet makes multi-modal
summarization research necessary. Most
previous research summarizes texts or
images separately. Recent neural
summarization research shows the
strength of the Encoder-Decoder model in
text summarization. This paper proposes
an abstractive text-image summarization
model using the attentional hierarchical
Encoder-Decoder model to summarize a
text document and its accompanying
images simultaneously, and then to align
the sentences and images in summaries. A
multi-modal attentional mechanism is
proposed to attend original sentences,
images, and captions when decoding. The
DailyMail dataset is extended by
collecting images and captions from the
Web. Experiments show our model
outperforms the neural abstractive and
extractive text summarization methods
that do not consider images. In addition,
our model can generate informative Figure 1: An example of multi-modal news taken
summaries of images. from the DailyMail corpora.

1 Introduction
Summarizing multi-modal documents to get
multi-modal summaries is becoming an urgent
need with rapid growth of multi-modal
documents on the Internet. Text-Image
summarization is to summarize a document with
text and images to generate a summary with text
and images. The summarization approach is
different from pure text summarization. It is also
different from image summarization which
summarizes an image set to get a subset of
images.
An image worths thousands of words
(Rossiter, et al., 2012). Image plays an important
role in information transmission. Incorporating
images into text to generate text-image Figure 2: The manually generated text-image summary.

4046
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4046–4056
Brussels, Belgium, October 31 - November 4, 2018. 2018
c Association for Computational Linguistics
summaries can help people better understand, 2) We propose an RNN model to encode the
memorize, and express information. Most of ordered image set of the multi-model
recent research focuses on pure text document as one of the initial states (the
summarization, or image summarization. Little other is the text encoding) of the decoder.
has been done on text-image summarization. 3) We propose three multi-modal attentional
Figure 1 and Figure 2 show an example of text- mechanisms which attend the text and the
image summarization. Figure 1 is the original images simultaneously when decoding.
multi-modal news with text and images. The 4) Experiments show that attending images
news has 17 sentences (with 322 words) and 4 when decoding can improve text
images each of which has a caption. Figure 2 is summarization, and that our model can
the manually generated multi-modal summary. In generate informative image summaries.
the summary, the news is distilled to 3 sentences
(with 36 words) and 2 images, and each summary 2 Related Work
sentence is aligned with an image.
To generate such a text-image summary, the Recent research on text summarization focuses on
following problems should be considered: How to neural methods. Attentional Encoder-Decoder
generate the text part? How to measure the model is first proposed in (Bahdanau et al., 2014)
importance of images, and extract important and (Luond et al., 2015) to align the original text
images to form the image summary? How to align and the translated text in machine translation. The
sentences with images? attention model is applied to sentence
In this paper, we propose a neural text-image summarization by considering the neural
summarization model based on the attentional language model and the attention model when
hierarchical Encoder-Decoder model to solve the generating next words (Rush et al., 2015). A
above problems. The attentional Encoder- selective Encoder-Decoder model that uses a
Decoder model has been successfully used in selective gate network to control information
sequence-to-sequence applications such as from the encoder to the decoder for sentence
machine translation (Luong et al., 2015), text summarization is proposed (Zhou et al., 2017).
summarization (Cheng and Lapata, 2016; Tan et A neural document summarization model by
al., 2017), image captioning (Liu et al., 2017a), extracting sentences and words is proposed
and machine reading comprehension (Cui et al., (Cheng and Lapata, 2016). They use a CNN
2016). model to encode sentences, and then use a RNN
At the encoding stage, we use the hierarchical model to encode documents. The model extracts
bi-directional RNN to encode the sentences and sentences by computing the probability of
the text document, use the RNN and the CNN to sentences belonging to the summary based on an
encode the image set. In the decoding stage, we RNN model. The model extracts words from the
combine text encoding and image encoding as the original document based on an attentional
initial state, and use the attentional hierarchical decoder. An RNN-based extractive
decoder which attends original sentences, images summarization named SummaRuNNer, treating
and captions to generate the text summary. Each summarization as a sentence classification
generated sentence is aligned with a sentence, an problem is proposed (Nallapati et al., 2016). A
image, or a caption in the original document. logistic classifier is then applied using features
Based on the alignment scores, images are computed based on the RNN model. A
selected and aligned with the generated sentences. hierarchical Encoder-Decoder model, conserving
In the inference stage, we adopt the multi-modal the hierarchical structure of documents is
beam search algorithm which scores beams based proposed (Li et al., 2015). A graph-based
on bigram overlaps of the generated sentences attentional Encoder-Decoder model using a
and the attended captions. PageRank algorithm to compute the attention is
The main contributions are as follows: proposed (Tan et al., 2017).
1) We propose the text-image summarization Image captioning generates a caption for an
task, and extend the standard DailyMail image. Text-image summarization is similar to
corpora by collecting images and captions image captioning in that both utilize image
of each news from the Web for the task. information to generate text. Images are encoded
with CNN models such as VGGNet (Simonyan

4047
and Zisserman, 2014), AlexNet (Krizhevsky et al., and the captions, a CNN+RNN encoder to encode
2012) and GoogleNet (Szegedy et al., 2014) by the image set, and a multi-modal attentional
extracting the last full-connected layers. An hierarchical RNN decoder.
attentional model is used in image captioning by The input of our model is a multi-modal
splitting an image into multiple parts which is document MD = {D, PicSet}, where D is the main
attended in the decoding process (Xu et al., 2015). text of the multi-modal document and PicSet is
Image tags was used as additional information, the image-caption set ordered by the occurring
and semantic attention model which attends order of images in the document.
image tags when decoding was proposed (You et
3.1 Main Text Encoder
al., 2016). The attention-based alignment of
image parts and text is studied (Liu et al., 2017a), The main text D consists of sentences, each of
and the results show that the alignments is in high which consists of words. Let D=[s1, s2, …, s|d|],
accordance with manual alignments. An image and si  [ xi,1 , xi,2 , , xi, s ] where xi,j is the word
i

to an ordered recognized object set is encoded, embedding of the jth word in the si. We use
and the attentional decoder is applied to generate word2vec (Mikolov et al., 2013) to create word
captions (Liu et al., 2017b). embeddings. GRU is used as the RNN cell (Cho
Multi-modal summarization summarizes text, et al., 2014).
images, videos, and etc. It is an important branch We use a hierarchical RNN encoder to encode
of automatic summarization. Traditional multi- the main text D to vector representation. The
modal summarization inputs multi-modal sentence encoder is adopted to encode sentences
documents or pure text documents, and outputs to vector representations. An <eos> token is
multi-modal documents (Wu, 2011; Greenbacker, appended to the end of each sentence. A bi-
2011; Yan, 2012; Agrawal, 2011; Zhu, 2007; directional RNN is used as the sentence encoder:
UzZaman, 2011). For example, Yan et al., (2012)  
generate multi-modal timeline summaries for hi , j  GRU s (hi, j 1 , xi, j ) (1)
news sets by constructing a bi-graph between text   
and images, and apply a heterogeneous hi , j  GRU s (hi , j , xi, j 1 ) (2)
reinforcement ranking algorithm. Strategies to
 
summarizing texts with images and the notion of encsent i  [hi,1 , hi, 1 ] (3)
summarization of things are proposed in (Zhuge,
2016). The deep learning related work (Wang et where encsenti denotes the vector representation of
 
al. 2016) treats text summarization as a sentence si. It is the concatenation of hi ,1 and hi ,1 .
recommendation task and applies matrix We use encsenti as inputs to the document
factorization algorithm. They first retrieve images encoder to encode the main text to vector
from Yahoo!, use the CNN to extract image representations. A bi-directional RNN is adopted
features as the additional information of as the document encoder:
sentences, use Rouge maximization as the  

training object function which are trained with hi  GRU d ( hi 1 , enc sent i ) (4)
SGD. In test time, sentences are extracted based
  
on the model and images are retrieved from the hi  GRU d (hi 1 , enc sent i ) (5)
Search Engine.
 
hi  [ h i , hi ] (6)
3 Method
 
Figure 3 shows the framework, a multi-modal encdoc d  [h1 , h 1 ] (7)
attentional hierarchical encoder-decoder model.
The hierarchical encoder-decoder is proposed in where encdoc denotes the vector representation of
(Li et al., 2015) and extended by (Tan et al. , the D, and hi is the concatenated hidden state of si.
2017) for document summarization through 3.2 CaptionSet and ImageSet Encoder
bringing in the graph-based attentional model.
Our model consists of three parts: a The ordered image-caption set PicSet consists of
hierarchical RNN to encode the original sentences an ordered image set and an ordered caption set

4048
Figure 3: The framework of our neural text-image summarization model.
 img  img
which are ordered by the occurring order in the enc img  [ h 1 , h 1 ] (11)
multi-modal document. The image occurring
order makes sense because images are often put where imgfeai is the vector representation of imgi,
near the most related sentences, and the sentences encimg is the vector representation of the image
have strict order in the document. set, and himgi is the hidden state of imgi when
We treat the ordered caption set as a document, encoding the image set.
and apply the sentence encoder and the document To our best knowledge, we are the first to adopt
encoder to the caption document. Then, we get the RNN model to encode the image set.
the hidden state hcapi and the vector representation
enccap of the caption document. 3.3 Decoder
We use the CNN model to extract the vector In the decoding state, we adopt the hierarchical
representation of each image, and then use the RNN decoder to generate text summaries.
RNN model to encode the ordered image set to
vector representation. The CNN model we h 0  tanh(W dec _ doc  enc doc
adopted is 19-layer VGGNet (Simonyan and V dec _ img  enc img (12)
Zisserman, 2014). We drop the last dropout layer
and keep the last full-connected layer as the V dec _ cap  enc cap )
image’s vector representation, the dimension of
which is 4096. h i  GRU dec _ sent1 (hi 1 , hi 1,1 ) (13)
We then use a bi-directional RNN model to
encode the ordered image set and the image hi  GRU dec _ sent 2 (h i , ci ) (14)
features are used as inputs of the RNN model.
 img   img hi , j  GRU dec _ word (hi , j 1 , yi , j 1 ) (15)
h i  GRU img (h i 1 , img fea i ) (8)
 img   img yi , j  soft max(W soft max hi , j  b) (16)
h i  GRU img ( h i 1 , img fea i ) (9)
 img  img
himg i  [h i , h i ] (10)

4049
Equation (12) to (16) are the equations of the Traditional attention mechanisms for text
hierarchical decoder which consists of a sentence summarization computes the importance score of
decoder and a word decoder. the sentence sj in the original document based on
Equation (12) computes the initial state h 0 for the relationship between the decoding hidden
the sentence decoder by combining the decoding of state h i and the original sentence encoding
the main text information and the decoding of the hidden state h j . We call the traditional attention
image information of the multi-modal document.
model as Text Attention (attT for short), which is
To represent image information, we can use both of
computed by equation (17), (18) and (19):
the image set decoding and the caption set
decoding, or only use one of them, depending on att T ( h i , h j )  v T tanh(W T h i  U T h j ) (17)
the multi-modal attention mechanism introduced in
the next subsection. exp(att T (h i , h j )) (18)
 T (h i , h j )  | D|
The sentence decoder uses a two-level hidden
output model (Luong et al., 2015) to generate the  exp(att T (h i , h j ))
j 1

representation of the next sentence through

| D|
equation (13) and equation (14). The two-level cT ( h i )    T (h i , h j ) h j (19)
hidden output model consistently improves the j 1

summarization performance on different datasets

(Chen et al., 2016). In equation (14), the two-level where att T (h i , h j ) is the attention (Banahama et
model computes h i by capturing a direct al., 2014),  T (h i , h j ) is the normalized attention,
interaction between h i and c . h i is computed by
i and cT (h i ) is the context.
equation (13) using the preceding sentence decoder The problem is that the multi-modal document
hidden state h i 1 and the word decoder hidden state has images and captions besides the main text.
Therefore, we propose three multi-modal attention
hi 1,1 which is the last hidden state of the mechanisms which take images and captions into
preceding word decoder. And ci is the context of consideration.
the sentence decoder computed based on the multi- Text-Caption Attention (attTC for short). This
modal attention model. attention model uses captions to represent the image
The word decoder uses the sentence information. attTC computes the attention score of
representation generated by the sentence decoder as the caption capj based on the relationship between
the initial state, and use the <sos> (start of sentence) the caption encoding hidden state hcapi and the
token as the initial input. Equation (15) and decoding hidden state h j .
equation (16) generate the next hidden state and the
next word. The output of the word decoder in the exp(att TC (h i , h j )) (20)
 TC ( h i , h j ) 
first step is a switch sign which is either <neod> | D| | PicSet |

token or <eod> token. The token <neod> means  exp(att

j 1
TC
(h i , h j ))   exp(att
j 1
TC
(h i , hcap j ))

“not end of document”, and the token <eod> means (21)

exp( att TC ( h i , hcap j ))
“end of document”. If the first output is <eod>, the  TC ( h i , h cap j )  | D| |PicSet |

whole decoding process is finished. If the first  exp(att

j 1
TC
(h i , h j ))   exp(att
j 1
TC
( h i , h cap j ))

output is <neod>, the token is used as the next input

of the word decoder. The word decoding process is | D| | PicSet|
cTC (h i )    TC (h i , h j )h j   TC
(h i , h cap j ) hcap j (22)
finished when it generates the <eos> token. The last j 1 j 1

hidden state of the word decoder is treated as the

vector representation of the generated sentence and Text-Image Attention (attTI for short). This
is used as next input of the sentence decoder. attention model only uses images to represent the
image information neglecting the captions. attTI
3.4 Multi-Modal Attention computes the importance score of the image imgj
We propose three multi-modal attention based on the relationship between the image
mechanisms to compute the sentence decoding encoding hidden state himgi and the decoding
context ci . hidden state h j .

4050
exp( att TI ( h i , himg j ))
(23) We use the Adam (Kingma and Ba, 2014)
 TI (h i , himg j )  |D| |PicSet | gradient-based optimization method to optimize the
 exp(att
j 1
TI
( h i , h j ))   exp(att
j 1
TI
( h i , himg j ))
model parameters.
|D | | PicSet | 3.6 Multi-Modal Beam Search Algorithm
cTI (h i )   TI (h i , h j )h j   TI
(h i , himg j )himg j (24)
j 1 j 1 There are two major problems of the generation
Text-Image-Caption Attention (attTIC for of summaries: one is the out-of-vocabulary
short). This attention model uses both captions and problem, and the other is the low quality of the
images to represent the image information. attTIC generated texts including information
computes the importance score of the caption capj incorrectness and repetitions.
and the importance score of the image imgj For the OOV problem, we use the words in the
simultaneously, and then compute the context of the attended sentences or captions in the original
decoding hidden state h i using equation (25). document to replace OOV tokens in the generated
|D | summary. Previous research uses the attended
cTIC (h i )    TIC (h i , h j )h j words to replace OOV tokens in the flatten
j 1 (25)
|PicSet | encoder-decoder model which attends the words of
 
j 1
TIC
(h i , h cap j )hcap j   TIC (h i , himg j )himg j the original word sequence (Jean, et al., 2015). Our
model is hierarchical and multi-modal, and attends
In the attention mechanisms,  (h i , h j ) is the sentences, images, and captions when decoding. We
normalized attention score of h j ,  (h i , hcap j ) is use the following algorithm to find the replacement
for the jth OOV in a generated sentence:
cap
the normalized attention score of h j , Step 1: Order the original sentences and captions
by the attending scores in descending order.
 (h i , himg j ) is the normalized attention score of
Step 2: Return the jth OOV word in the ordered
h img , and c(h i ) is the context.
j
sentences and captions as the replacement.
The initial state of the decoder is computed by For the attTI mechanism that attends images
Equation (12) which can be adjusted according to neglecting captions, we use captions instead of the
different attention models. attended images in the algorithm.
For the low-quality generated text problem, we
3.5 Model Training adopt the hierarchical beam search algorithm (Tan
el al., 2017). We extend the algorithm by adding
Since there are no existing manual text-image
caption-level and image-level beam search. The
summaries, and most of the existing training and
multi-modal hierarchical beam search algorithm
testing data have pure text summaries, we decide
comprises K-best word-level beam search and N-
to use pure text summaries as training data to train
best sentence-caption-level beam search. In
our models. The sentence-image alignment
particular, we use the corresponding captions
relationships can be discovered through
instead of images in beam search algorithm for the
training the multi-modal attention models.
attTI mechanism which attends images.
The loss function L of our summarization
model is the negative log likelihood of generating score( yt )  p ( yt )   (ref (Yt 1  yt , s* )  ref (Yt 1 , s* )) (28)
text summaries over the training multi-modal At the word-level search algorithm, we compute
document set MDS. the score of generating word yt using equation (28)
L 
 log P (Y | D, PicSet ) (26)
( D , PicSet ,Y )MDS
where ref is a function calculating the ratio of
bigram overlap between two texts, s* is the attended
where Y=[y1, y2, …, y|Y|] is the word sequences of sentence or caption, and γ is the weighting factor.
the summary corresponding to the main text D and The added term aims to increase the overlap of the
the ordered image set PicSet, including the tokens generated summary and the original text.
<eos>, <neod> and <eod>. At the sentence level and the caption level, we
|Y | set the sentence beam width as N, and keep N-best
log P(Y | D, PicSet )   log P( yt |{ y1 ,..., yt 1}, c;  ) (27) previously un-referred sentences or captions which
t 1
have highest attending scores. For each sentence
where log P ( yt | { y1 ,..., yt 1}, c;  ) is modeled by
beam, we try M sentences or captions and keep the
the multi-modal encoder-decoder model. one achieving best word-level scores.

4051
3.7 Image Selection and Alignment Train Dev Test
We rank the images, select several most important 196557 12147 10396
images as the image summary, and align each D.L. S.L. I.N Sent.L Cap.L
sentence with an image in the image summary. 26.0 3.84 5.42 26.86 24.75
The score of images is computed by equation (29).
|TextSum| Table 1: The split and statistics of the E-DailyMail
score(img j )  
i 1
 i , j (29) corpora. D.L and S.L indicate the average number of
sentences in the document and summary. I.N
where αi,j is the attention score of the jth image when indicates the average number of images in the story.
generating the ith sentence of the text summary, and Sent.L and Cap.L indicates the average number of
|TextSum| is the number of summary sentences. word in the sentence and the caption respectively.
The images are ranked by the scores in the word embeddings with Google’s word2vec
descending order, and the top K images are selected tools (Mikolov et al., 2013) trained in the whole
to form the image summary ImgSum. We align each text of DailyMail/CNN corpora. We extract the
sentence i in TextSum to the image j in ImgSum 4096-dimension full-connected layer of 19-layer
such that αi,j is the biggest. VGGNet (Simonyan and Zisserman, 2014) as the
vector representation of images. We set the
4 Experiments parameters of Adam to those provided in (Kingma
and Ba, 2014). The batch size is set to 5.
4.1 Data preparation
Convergence is reached within 800k training steps.
We extend the standard DailyMail corpora It takes about one day for training 40k ~ 50k steps
through extracting the images and the captions depending on the models on a GTX-1080 TI GPU
from the html-formatted documents. We call the card. The sentence beam width and the word beam
corpora as E-DailyMail. The standard DailyMail width are set as 2 and 5 respectively. M is set as 3.
and CNN datasets are two widely used datasets The parameter γ is set as 3 or 300 tuned on the
for neural document summarization, which are validation set.
originally built in (Hermann et al., 2015) by To train the multi-modal attention mechanism
collecting human generated highlights and news such as attTIC, we concatenate the matrix of text
stories from the news websites. We only extend representations, image representations, and caption
the DailyMail dataset because it has more images representations to one matrix M = [h1, h2, ... h|D|,
and is easier to collect than the CNN dataset does. hcap1, hcap2, …, hcap|PicSet|, himg1, himg2, …, himg|PicSet|].
We find that the text documents provided by the The parameters of the attention mechanisms are
original DailyMail corpora contain captions. This trained simultaneously. This way the model training
is due to that all related texts are extracted from can converge faster.
the html-formatted news when the corpora are
created. We keep the original text documents 4.3 Evaluation of Text Summarization
unchanged in E-DailyMail. The split and statistics The widely used ROUGE (Lin, 2004) is adopted
of E-DailyMail are shown in Table 1. to evaluate text summaries.
We compare four attention models. HNNattTC-
4.2 Implementation
3, HNNattTIC-3, HNNattTI-3, and HNNattT-3 are
We preprocess the text of the E-DailyMail our hierarchical RNN summarization models with
corpora by tokenizing the text and replacing the the attTC, attIC, attTI, and attT attention
digits with the <NUM> token. The 40k most mechanisms respectively, and 3 is the γ value.
frequent words in the corpora are kept and other HNNattT is similar to the model introduced in (Tan
words are replaced with OOV. et al., 2017) without the graph-based attention. We
Our model is implemented by using Google’s compare our models with HNNattT to show the
open-source seq2seq-master project written with influence of multi-modal attentions. The first 4 lines
Tensorflow. We use one layer of the GRU cell. The in Table 2 are the results with summary length of 75
dimension of the hidden state of the RNN decoder bytes. The results show that HNNattTI has
is 512. The dimension of the word embedding considerable improvement over HNNattT. An
vector is 128. The dimension of the hidden state of interesting observation is that HNNattTC and
the bi-directional RNN encoder is 256. We initialize HNNattTIC are not better than HNNattT. One of the

4052
Method Rouge-1 Rouge-2 Rouge-L Method Rouge-1 Rouge-2 Rouge-L
HNNattTI-3 24.84 8.7 16.99 HNNattTI-3-OOV 24.03 8.2 16.52
HNNattTC-3 18.61 6.7 13.44 HNNattTC-3-OOV 18.18 6.53 12.87
HNNTattTIC-3 21.17 8.1 15.24 HNNTattTIC-3-OOV 20.50 7.67 14.36
HNNattT-3 22.09 7.9 15.97 HNNattT-3-OOV 21.60 7.82 15.05
Lead 21.9 7.2 11.6
NN-SE 22.7 8.5 12.5 Table 4: Comparison results using Rouge recall at 75
SummaRuNNer- bytes without OOV replacement. HNNattTI-3-OOV
23.8 9.6 13.3
abs is the version of HNNattTI-3 without the OOV
LREG(500) 18.5 6.9 10.2 replacement mechanism.
NN-ABS(500) 7.8 1.7 7.1 Method Rouge-1 Rouge-2 Rouge-L
NN-WE(500) 15.7 6.4 9.8
HNNattTI-300-OOV 32.03 11.52 22.67
Table 2: Comparison results on the DailyMail test HNNattTC-300-OOV 26.13 9.87 19.03
set using Rouge recall at 75 bytes. HNNattTIC-300-OOV 30.11 10.87 21.12
HNNattT-300-OOV 30.74 11.21 22.28
Method Rouge-1 Rouge-2 Rouge-L
HNNattTI-300 32.64 12.02 23.88 Table 5: Comparison results using full-length F1
HNNattTC-300 26.75 10.12 19.42 metric without OOV replacement. HNNattTI-300-
HNNattTIC-300 30.52 11.04 21.81 OOV is the version of HNNattTI-300 without the
HNNattT-300 31.34 11.81 22.93 OOV replacement mechanism.
do not incorporate the attention distraction
Table 3: Comparison results on the DailyMail test mechanism (Chen et al., 2016) into our model,
set using full-length F1metric. because we want to focus on our own model to see
reasons is that the text documents provided by the whether considering images improves text
DailyMail corpora contain captions. Captions are summarization. Results in Table 3 also show that
already parts of the text documents. The other HNNattTI performs better than HNNattT,
reason is that captions distract attentions and cannot HNNattTC, and HNNattTIC.
attract sufficient attentions from the original To show the influence of our OOV replacement
sentences, which will be discussed in the next mechanism, we eliminate the mechanism from our
subsection. models, and show the evaluation results in Table 4
We compare our methods with state-of-the-art and Table 5. We can see from the two tables that the
neural summarization methods reported in recent scores are lower than the corresponding scores in
papers on the DailyMail corpora. Extractive models Table 2 and Table 3. Our OOV replacement
include Lead which is a strong baseline using the mechanism improves the summarization models,
leading 3 sentences as the summary, NN-SE though the mechanism is relatively simple.
(Cheng and Lapata, 2016), and SummaRuNNer-abs In short, combining and attending images in the
(Nallapati et al., 2017) which is trained on the neural summarization model improves document
abstractive summaries. Abstractive models include summarization.
NN-ABS, NN-WE, LREG, though they are tested on
500 samples of the test set. LREG is a feature-based 4.4 Evaluation of Image Summarization
method using linear regression. NN-ABS is a simple To evaluation the image summarization, the gold
hierarchical extension of (Rush et al., 2015). NN- standard image summary is generated based on a
WE is the abstractive model restricting the greedy algorithm on the captions as follows: at
generation of words from the original document. each time i, choose imgk to maximize
The results are shown in the last 6 rows in Table 2. Rouge({cap1,…capi-1,capk}, Abs_Sum) ˗
Our method HNNattTI outperforms the three Rouge({cap1,…capi-1}, Abs_Sum)) where
extractive models and the three abstractive models. Abs_Sum is the ground truth text summary and
We compare our models under the full-length F1
metric by setting the γ value as 300. According to num HNNattTI HNNattTC HNNattTIC Random
(Tan et al., 2017), a large γ makes the generated 1 0.4978 0. 4137 0. 4362 0. 4721
summary has more overlaps with the attended 2 0.4783 0. 3998 0. 4230 0. 4517
texts, and thus partly overcome the repeated
Table 6: Image summarization using the recall metric
sentences problem in the generated summary. We for the 1-image or 2-images summary. γ is set as 300.

4053
capk is the caption of imgk. The average number IMG1 IMG2 IMG3 IMG4
of images in summaries is 2.15. The average S1 0.0947 0.1089 0.1157 0.1194
Rouge-1, Rouge-2, and Rouge-L scores of the S2 0.0893 0.1020 0.1070 0.1052
caption summaries with respect to the ground S3 0.0853 0.0769 0.0946 0.0969
truth summaries are 43.85, 19.70, and 36.30
respectively. Table 7: The sentence-image alignment scores of
We use the 1-image and 2-image random the generated summary for the news in Figure 1.
selected image summaries as the baselines which The sentences are named by S1, S2, and S3
we compare our models with. The top 1 or 2 respectively.
images ranked by our model are selected out to Table 7 shows the sentence-image alignment
form the summaries. Results in Table 4 show that scores. The four images in the original document
HNNattTI outperforms the random baseline, are numbered from top to bottom and left to right
while HNNattTC and HNNattTIC perform worse. by IMG1, IMG2, IMG3, and IMG4. The
This implies that attending images can generate summation of alignment scores for a summary
better sentence-image alignment in the multi- sentence is less than 1, because the sentence is also
modal summaries than the model attending aligned with the sentences in the original document.
captions does. And this can also partly explain
why our summarization model attending images 5 Conclusions
when decoding can generate better text
summaries than the one attending captions does. This paper proposes the text-image
summarization task to summarize and align texts
4.5 Instance and images simultaneously. Most previous
Figure 4 shows the text-image summary of the research summarizes texts and images separately,
example demonstrated in Figure 1 generated by and few has been done on text-image
the HNNattTI model. In the summary, there are 2 summarization. We propose the multi-modal
images and 3 generated sentences, and each attentional mechanism which attends original
sentence is aligned with an image. The image sentences, images, captions simultaneously in the
summary has one common image with Figure 2. hierarchical encoder-decoder model, use the RNN
model to encode the ordered image set as the
initial state of the decoder, and propose the multi-
modal beam search algorithm which scores beams
using the bigram overlaps of the generated
sentences and the captions. The model is trained
by using abstractive text summaries as the targets,
and the attention scores of images are used to
score images. The original DailyMail dataset is
extended by collecting images and captions from
the Web. Experiments show that our model
attending images outperforms the models not
attending images, three existing neural abstractive
models and three existing extractive models.
Experiments also show our model can generate
informative summaries of images.

Acknowledgments
The research was sponsored by the National
Natural Science Foundation of China
(No.61806101, No.61876048) and the Natural
Science Foundation of Jiangsu Province
(BK20150862). We thank the anonymous
Figure 4: The generated text-image summary of the reviewers for helpful comments. Professor Hai
example in Figure 1. Zhuge is the corresponding author.

4054
References Krizhevsky A., Sutskever I., Hinton G. E. ImageNet
classification with deep convolutional neural
Agrawal, R., Gollapudi, S., Kannan, A., & networks. Communications of the Acm, 2013,
Kenthapadi, K. (2011). Enriching textbooks with 60(2):2012.
images. In Proceedings of the 20th ACM
international conference on Information and Li J., Luong M. T., Jurafsky D. A Hierarchical Neural
knowledge management (pp. 1847-1856). ACM. Autoencoder for Paragraphs and Documents.
Computer Science, 2015.
Bahdanau D., Cho K., Bengio Y. Neural Machine
Translation by Jointly Learning to Align and Lin, C.Y., 2004. Rouge: A package for automatic
Translate. Computer Science, 2014. evaluation of summaries. Text Summarization
Branches Out.
Chen, Q., Zhu, X., Ling, Z., Wei, S., and Jiang, H.
(2016, July). Distraction-based neural networks for Liu C., Mao J., Sha F., Yuille AL. 2017a. Attention
modeling documents. In Proceedings of the Correctness in Neural Image Captioning. InAAAI
Twenty-Fifth International Joint Conference on 2017 Feb 4 (pp. 4176-4182).
Artificial Intelligence (pp. 2754-2760). AAAI Liu, C., Sun, F., Wang, C., Wang, F. and Yuille, A.,
Press. 2017b. MAT: A multimodal attentive translator for
Cheng, J. and Lapata, M., 2016. Neural image captioning. arXiv preprint
summarization by extracting sentences and arXiv:1702.05658.
words. arXiv preprint arXiv:1603.07252. Luong, M. T., Hieu P., and Christopher D. M..
Cho, K., Van Merriënboer, B., Bahdanau, D. and "Effective approaches to attention-based neural
Bengio, Y., 2014. On the properties of neural machine translation." arXiv preprint
machine translation: Encoder-decoder arXiv:1508.04025 (2015).
approaches. arXiv preprint arXiv:1409.1259. Mikolov T., Sutskever I., Chen K., et al. Distributed
Cui, Y., Chen, Z., Wei, S., Wang, S., Liu, T. and Hu, Representations of Words and Phrases and their
G., 2016. Attention-over-attention neural networks Compositionality. 2013, 26:3111-3119.
for reading comprehension. arXiv preprint Nallapati R., Zhai F., Zhou B. SummaRuNNer: A
arXiv:1607.04423. Recurrent Neural Network based Sequence Model
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K. and for Extractive Summarization of Documents. 2016.
Fei-Fei, L., 2009, June. Imagenet: A large-scale Rossiter M. J., Derwing T. M., Jones V. M. L. O. Is a
hierarchical image database. In Computer Vision Picture Worth a Thousand Words?. Tesol
and Pattern Recognition, 2009. CVPR 2009. IEEE Quarterly, 2012, 42(2):325-329.
Conference on (pp. 248-255). IEEE.
Rush A. M., Chopra S., Weston J. A Neural Attention
Gu, J., Lu, Z., Li, H. and Li, V.O., 2016. Model for Abstractive Sentence Summarization.
Incorporating Copying Mechanism in Sequence- Computer Science, 2015.
to-Sequence Learning. In Proceedings of the 54th
Annual Meeting of the Association for Simonyan K., Zisserman A. Very Deep Convolutional
Computational Linguistics (Vol. 1, pp. 1631-1640). Networks for Large-Scale Image Recognition.
Computer Science, 2014.
Greenbacker, C. F., 2011. Towards a framework for
abstractive summarization of multimodal Szegedy C., Liu W., Jia Y., et al. Going deeper with
documents. In Proceedings of the ACL 2011 convolutions. CoRR, abs/1409.4842, 2014.
Student Session (pp. 75-80). Association for Tan, J., Wan, X. and Xiao, J., 2017. Abstractive
Computational Linguistics. document summarization with a graph-based
Hermann, K.M., Kocisky, T., Grefenstette, E., attentional neural model. In Proceedings of the
Espeholt, L., Kay, W., Suleyman, M. and Blunsom, 55th Annual Meeting of the Association for
P., 2015. Teaching machines to read and Computational Linguistics (Volume 1: Long
comprehend. In Advances in Neural Information Papers) (Vol. 1, pp. 1171-1181).
Processing Systems (pp. 1693-1701). UzZaman, N., Bigham, J. P., & Allen, J. F. (2011).
Jean, S., Cho, K., Memisevic, R. and Bengio, Y., Multimodal summarization of complex sentences.
2014. On using very large target vocabulary for In Proceedings of the 16th international
neural machine translation. arXiv preprint conference on Intelligent user interfaces (pp. 43-
arXiv:1412.2007. 52). ACM.
Kingma, D.P. and Ba, J., 2014. Adam: A method for Wang W. Y., Mehdad Y., Radev D. R., et al. A Low-
stochastic optimization. arXiv preprint Rank Approximation Approach to Learning Joint
arXiv:1412.6980.

4055
Embeddings of News Stories and Images for
Timeline Summarization. NAACL. 2016:58-68.
Wu, P., & Carberry, S. (2011). Toward extractive
summarization of multimodal documents.
In Proceedings of the Workshop on Text
Summarization at the Canadian Conference on
Artificial Intelligence (pp. 53-61).
Xu K., Ba J., Kiros R., et al. Show, Attend and Tell:
Neural Image Caption Generation with Visual
Attention. Computer Science, 2015:2048-2057.
Yan R., Wan X., et al. Visualizing timelines:
evolutionary summarization via iterative
reinforcement between text and image streams.
CIKM2012. ACM, 2012:275-284.
You Q., Jin H., Wang Z., et al. Image Captioning with
Semantic Attention. IEEE Conference on
Computer Vision and Pattern Recognition. IEEE
Computer Society, 2016:4651-4659.
Zhou Q., Yang N., Wei F., et al. Selective Encoding
for Abstractive Sentence Summarization. Meeting
of the Association for Computational Linguistics.
2017:1095-1104.
Zhu, X., Goldberg, A. B., et al. (2007). A text-to-
picture synthesis system for augmenting
communication. In AAAI (Vol. 7, pp. 1590-1595).
Zhuge, H. Multi-Dimensional Summarization in
Cyber-Physical Society, Morgan Kaufmann, 2016.

4056

Abstractive Text Summarization of Multimedia News Content Using RNN
No ratings yet
Abstractive Text Summarization of Multimedia News Content Using RNN
10 pages
Short Updates-Machine Learning Based News Summarizer
No ratings yet
Short Updates-Machine Learning Based News Summarizer
11 pages
Abstractive Text Summary Generation With Knowledge Graph Representation
No ratings yet
Abstractive Text Summary Generation With Knowledge Graph Representation
9 pages
IEEE Xplore Reference Download 2024.10.8.14.47.26
No ratings yet
IEEE Xplore Reference Download 2024.10.8.14.47.26
2 pages
Inlg 19 TL DR Writeup 4
No ratings yet
Inlg 19 TL DR Writeup 4
7 pages
Data Science Interview Questions (#Day27)
No ratings yet
Data Science Interview Questions (#Day27)
18 pages
Final4 W18-2706
No ratings yet
Final4 W18-2706
10 pages
2 - Hierarchical LSTMs With Adaptive Attention For
No ratings yet
2 - Hierarchical LSTMs With Adaptive Attention For
18 pages
10 1142@S0218194019500086
No ratings yet
10 1142@S0218194019500086
20 pages
Extracting Sentences and Words
No ratings yet
Extracting Sentences and Words
11 pages
nlp
No ratings yet
nlp
8 pages
Unsupervised Text Summarization Using Sentence Embeddings: Aishwarya Padmakumar Akanksha Saran
No ratings yet
Unsupervised Text Summarization Using Sentence Embeddings: Aishwarya Padmakumar Akanksha Saran
9 pages
Don't Give Me The Details, Just The Summary! Topic-Aware Convolutional Neural Networks For Extreme Summarization
No ratings yet
Don't Give Me The Details, Just The Summary! Topic-Aware Convolutional Neural Networks For Extreme Summarization
11 pages
Knowing When To Look-Adaptive Attention Via A Visual Sentinel For Image Captioning
No ratings yet
Knowing When To Look-Adaptive Attention Via A Visual Sentinel For Image Captioning
12 pages
1805 03616 ReinforcedTopicAwareConvS2S PDF
No ratings yet
1805 03616 ReinforcedTopicAwareConvS2S PDF
8 pages
Comparative Analysis of T5 Model For Abstractive Text Summarization On Different Datasets
No ratings yet
Comparative Analysis of T5 Model For Abstractive Text Summarization On Different Datasets
7 pages
Patent 10
No ratings yet
Patent 10
50 pages
Sachin Kumar Report
No ratings yet
Sachin Kumar Report
15 pages
Image Captioning Using CNN & RNN
No ratings yet
Image Captioning Using CNN & RNN
4 pages
Rare Words in Text Summarization
No ratings yet
Rare Words in Text Summarization
11 pages
Paper for reference
No ratings yet
Paper for reference
47 pages
A Multimodal Text Block Segmentation Framework For Photo Translation
No ratings yet
A Multimodal Text Block Segmentation Framework For Photo Translation
12 pages
Show Attend and Tell
No ratings yet
Show Attend and Tell
10 pages
Deep Visual-Semantic Alignments For Generating Image Descriptions
No ratings yet
Deep Visual-Semantic Alignments For Generating Image Descriptions
17 pages
Data Representation for Deep Learning - Based Arabic Text Summarization Performance Using Python Results
No ratings yet
Data Representation for Deep Learning - Based Arabic Text Summarization Performance Using Python Results
18 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Chen, Bansal - 2018 - Fast Abstractive Summarization With Reinforce-Selected Sentence Rewriting-Annotated
No ratings yet
Chen, Bansal - 2018 - Fast Abstractive Summarization With Reinforce-Selected Sentence Rewriting-Annotated
12 pages
Text Paraphrasing With Large Language Models-3
No ratings yet
Text Paraphrasing With Large Language Models-3
6 pages
Semantic Sentence Embeddings For Paraphrasing and Text Summarization
No ratings yet
Semantic Sentence Embeddings For Paraphrasing and Text Summarization
5 pages
Abstractive Sentence Summarization With Attentive Recurrent Neural Networks
No ratings yet
Abstractive Sentence Summarization With Attentive Recurrent Neural Networks
6 pages
EAES Effective Augmented Embedding Spaces For Text-Based Image Captioning
No ratings yet
EAES Effective Augmented Embedding Spaces For Text-Based Image Captioning
10 pages
1331-4786-1-PB
No ratings yet
1331-4786-1-PB
14 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
TCS Ocr
No ratings yet
TCS Ocr
39 pages
J173 Tech-Talk-Sum - Fine-Tuning Extractive Summarization and Enhancing BERT Text Contextualization For Technological Talk Videos
No ratings yet
J173 Tech-Talk-Sum - Fine-Tuning Extractive Summarization and Enhancing BERT Text Contextualization For Technological Talk Videos
18 pages
MD Adil Irshad
No ratings yet
MD Adil Irshad
37 pages
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
No ratings yet
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
12 pages
21MCEC01 Prashant MP Review 3
No ratings yet
21MCEC01 Prashant MP Review 3
19 pages
T-BERTSum Topic-Aware Text Summarization Based on BERT
No ratings yet
T-BERTSum Topic-Aware Text Summarization Based on BERT
12 pages
Constrained LSTM and Residual Attention For Image Captioning
No ratings yet
Constrained LSTM and Residual Attention For Image Captioning
18 pages
Paper 3
No ratings yet
Paper 3
6 pages
Ranking Sentences For Extractive Summarization With Reinforcement Learning
No ratings yet
Ranking Sentences For Extractive Summarization With Reinforcement Learning
13 pages
Biomedical Text Summarization Using Conditional Generative Adversarial Network (CGAN)
No ratings yet
Biomedical Text Summarization Using Conditional Generative Adversarial Network (CGAN)
12 pages
AIML - Final Report _ version1
No ratings yet
AIML - Final Report _ version1
24 pages
Meshed-Memory Transformer For Image Captioning
No ratings yet
Meshed-Memory Transformer For Image Captioning
15 pages
T5-Based Model For Abstractive Summarization A Semi-Supervised Learning Approach With Consistency Loss Functions
No ratings yet
T5-Based Model For Abstractive Summarization A Semi-Supervised Learning Approach With Consistency Loss Functions
16 pages
Group 13 Sem 2 Review 1
No ratings yet
Group 13 Sem 2 Review 1
20 pages
9
No ratings yet
9
9 pages
Multi Task Learning For Abstractive and Extractive Summarization
No ratings yet
Multi Task Learning For Abstractive and Extractive Summarization
10 pages
Seminar Text Summarization 1
No ratings yet
Seminar Text Summarization 1
21 pages
lec16b-Attention-13-Feb-18
No ratings yet
lec16b-Attention-13-Feb-18
53 pages
I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
No ratings yet
I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
10 pages
Adaptive_Feature_Abstraction_for_Translating_Video
No ratings yet
Adaptive_Feature_Abstraction_for_Translating_Video
16 pages
Performance Evaluation of Medical Image Captioning Using
No ratings yet
Performance Evaluation of Medical Image Captioning Using
10 pages
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
No ratings yet
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
26 pages
Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning
No ratings yet
Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning
51 pages
Reinforced Generative Adversarial Network For Abstractive Text Sum-Marization
No ratings yet
Reinforced Generative Adversarial Network For Abstractive Text Sum-Marization
9 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
SNI 19-6724-2002 Horizontal Control Network
No ratings yet
SNI 19-6724-2002 Horizontal Control Network
88 pages
ECLYPSE UG 21 EN Edited TJM
No ratings yet
ECLYPSE UG 21 EN Edited TJM
174 pages
Power Theft Detection
No ratings yet
Power Theft Detection
25 pages
Salesforce Migration Guide
No ratings yet
Salesforce Migration Guide
44 pages
ISPFL9 Module1
100% (1)
ISPFL9 Module1
22 pages
Avital Oliver - CV
No ratings yet
Avital Oliver - CV
4 pages
A Novel Triple-Band Microstrip Branch-Line Coupler With Arbitrary Operating Frequencies
No ratings yet
A Novel Triple-Band Microstrip Branch-Line Coupler With Arbitrary Operating Frequencies
3 pages
Acmer p1 Lasergrbl User Manual
No ratings yet
Acmer p1 Lasergrbl User Manual
10 pages
(RHSA 124) : Monitoring and Managing Linux Processes
No ratings yet
(RHSA 124) : Monitoring and Managing Linux Processes
56 pages
SF - Experiment 2 Non-Preemptive CPU Scheduling Algorithm
No ratings yet
SF - Experiment 2 Non-Preemptive CPU Scheduling Algorithm
4 pages
Microsoft Excel 2013 Plain - Simple
No ratings yet
Microsoft Excel 2013 Plain - Simple
3 pages
Sherpa 2
No ratings yet
Sherpa 2
4 pages
Mod Menu Log - Com - Miniclip.eightballpool
No ratings yet
Mod Menu Log - Com - Miniclip.eightballpool
3 pages
Simplifying Radicals (2) - A Complete Course in Algebra
No ratings yet
Simplifying Radicals (2) - A Complete Course in Algebra
6 pages
Asset Properties - FlexNet Manager Suite
No ratings yet
Asset Properties - FlexNet Manager Suite
1 page
Biniyam C. V
No ratings yet
Biniyam C. V
6 pages
Looking Back at The Evolution of The Internet
No ratings yet
Looking Back at The Evolution of The Internet
4 pages
Autodesk 3ds Max Design 2013 Fundamentals: Better Textbooks. Lower Prices
No ratings yet
Autodesk 3ds Max Design 2013 Fundamentals: Better Textbooks. Lower Prices
66 pages
Presentation Slides - Music Video
No ratings yet
Presentation Slides - Music Video
56 pages
Chapter 11 & 12 - Display Files and CL Programs: Final Product
No ratings yet
Chapter 11 & 12 - Display Files and CL Programs: Final Product
10 pages
Multiswitches For: Choose Your Solution
No ratings yet
Multiswitches For: Choose Your Solution
8 pages
Acti 9 iEM3000 - A9MEM3255 PDF
No ratings yet
Acti 9 iEM3000 - A9MEM3255 PDF
3 pages
ISSMP Exam Outline Effective May 2018
No ratings yet
ISSMP Exam Outline Effective May 2018
11 pages
Characteristics of New Media
No ratings yet
Characteristics of New Media
1 page
Additive Manufacturing Notes
No ratings yet
Additive Manufacturing Notes
2 pages
Hardware/Software Co-Design For Data Flow Dominated Embedded Systems
No ratings yet
Hardware/Software Co-Design For Data Flow Dominated Embedded Systems
1 page
1-Main-AdSense Simplified PDF
No ratings yet
1-Main-AdSense Simplified PDF
52 pages
Spring Break HW
No ratings yet
Spring Break HW
3 pages
Eilago 2.0 Master-Features
No ratings yet
Eilago 2.0 Master-Features
4 pages
Challenging Tools On Research Issues in Big Data Analytics: Althaf Rahaman - SK, Sai Rajesh.K .Girija Rani K
No ratings yet
Challenging Tools On Research Issues in Big Data Analytics: Althaf Rahaman - SK, Sai Rajesh.K .Girija Rani K
8 pages

Image-Text Summarization PDF

Uploaded by

Image-Text Summarization PDF

Uploaded by

Abstractive Text-Image Summarization

Using Multi-Modal Attentional Hierarchical RNN

representation of the next sentence through

summarization performance on different datasets

token or <eod> token. The token <neod> means  exp(att

“not end of document”, and the token <eod> means (21)

whole decoding process is finished. If the first  exp(att

output is <neod>, the token is used as the next input

hidden state of the word decoder is treated as the

You might also like