Deep Recurrent Generative Decoder For Abstractive Text Summarization

The document proposes a new framework for abstractive text summarization using a sequence-to-sequence model with a deep recurrent generative decoder to learn the latent structure of target summaries. The model employs variational autoencoders and neural variational inference to handle the inference of recurrent latent variables. Experiments show the framework achieves better performance than state-of-the-art methods.

Uploaded by

Madara maximus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Deep Recurrent Generative Decoder For Abstractive Text Summarization

Uploaded by

Madara maximus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Deep Recurrent Generative Decoder for Abstractive Text Summarization∗

Piji Li† Wai Lam† Lidong Bing‡ Zihao Wang†

†
Key Laboratory on High Confidence Software Technologies (Sub-Lab, CUHK),
Ministry of Education, China
†
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong
‡
AI Lab, Tencent Inc., Shenzhen, China
†
{pjli, wlam, zhwang}@se.cuhk.edu.hk, ‡ [email protected]

Abstract
top stories_
We propose a new framework for ab-
Apple sues Qualcomm for nearly $1 billion
stractive text summarization based on a
sequence-to-sequence oriented encoder- Twitter fixes botched @POTUS account transfer
decoder model equipped with a deep re- Track Trump’s 100-day promises, Silicon Valley-style
current generative decoder (DRGN). La- The emergence of the ‘cyber cold war’
tent structure information implied in the Tesla Autopilot not defective in fatal crash
target summaries is learned based on a re-
Twitter mostly meets modest diversity goals
current latent random model for improv-
ing the summarization quality. Neural Uber to pay $20 million for misleading drivers
variational inference is employed to ad-
dress the intractable posterior inference for Figure 1: Headlines of the top stories from the
the recurrent latent variables. Abstractive channel “Technology” of CNN.
summaries are generated based on both
the generative latent variables and the dis-
criminative deterministic states. Extensive Some previous research works show that
experiments on some benchmark datasets human-written summaries are more abstractive
in different languages show that DRGN (Jing and McKeown, 2000). Moreover, our in-
achieves improvements over the state-of- vestigation reveals that people may naturally fol-
the-art methods. low some inherent structures when they write the
abstractive summaries. To illustrate this observa-
1 Introduction tion, we show some examples in Figure 1, which
Automatic summarization is the process of auto- are some top story summaries or headlines from
matically generating a summary that retains the the channel “Technology” of CNN. After analyz-
most important content of the original text doc- ing the summaries carefully, we can find some
ument (Edmundson, 1969; Luhn, 1958; Nenkova common structures from them, such as “What”,
and McKeown, 2012). Different from the common “What-Happened” , “Who Action What”, etc.
extraction-based and compression-based methods, For example, the summary “Apple sues Qual-
abstraction-based methods aim at constructing comm for nearly $1 billion” can be structural-
new sentences as summaries, thus they require a ized as “Who (Apple) Action (sues) What (Qual-
deeper understanding of the text and the capabil- comm)”. Similarly, the summaries “[Twitter]
ity of generating new sentences, which provide [fixes] [botched @POTUS account transfer]”,
an obvious advantage in improving the focus of “[Uber] [to pay] [$20 million] for misleading
a summary, reducing the redundancy, and keeping drivers”, and “[Bipartisan bill] aims to [reform]
a good compression rate (Bing et al., 2015; Rush [H-1B visa system]” also follow the structure of
et al., 2015; Nallapati et al., 2016). “Who Action What”. The summary “The emer-
∗
gence of the ‘cyber cold war”’ matches with
The work described in this paper is supported by a grant
from the Grant Council of the Hong Kong Special Adminis- the structure of “What”, and the summary “St.
trative Region, China (Project Code: 14203414). Louis’ public library computers hacked” follows

2091
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2091–2100
c
Copenhagen, Denmark, September 7–11, 2017. 2017 Association for Computational Linguistics
the structure of “What-Happened”. structure information implied in the target sum-
Intuitively, if we can incorporate the latent maries of the training data. Neural variational in-
structure information of summaries into the ab- ference is employed to address the intractable pos-
stractive summarization model, it will improve terior inference for the recurrent latent variables.
the quality of the generated summaries. How- (2) Both the generative latent structural informa-
ever, very few existing works specifically consider tion and the discriminative deterministic variables
the latent structure information of summaries in are jointly considered in the generation process of
their summarization models. Although a very pop- the abstractive summaries. (3) Experimental re-
ular neural network based sequence-to-sequence sults on some benchmark datasets in different lan-
(seq2seq) framework has been proposed to tackle guages show that our framework achieves better
the abstractive summarization problem (Lopyrev, performance than the state-of-the-art models.
2015; Rush et al., 2015; Nallapati et al., 2016), the
calculation of the internal decoding states is en- 2 Related Works
tirely deterministic. The deterministic transforma-
Automatic summarization is the process of auto-
tions in these discriminative models lead to lim-
matically generating a summary that retains the
itations on the representation ability of the latent
most important content of the original text doc-
structure information. Miao and Blunsom (2016)
ument (Nenkova and McKeown, 2012). Tradi-
extended the seq2seq framework and proposed a
tionally, the summarization methods can be classi-
generative model to capture the latent summary in-
fied into three categories: extraction-based meth-
formation, but they did not consider the recurrent
ods (Erkan and Radev, 2004; Goldstein et al.,
dependencies in their generative model leading to
2000; Wan et al., 2007; Min et al., 2012; Nalla-
limited representation ability.
pati et al., 2017; Cheng and Lapata, 2016; Cao
To tackle the above mentioned problems, we et al., 2016; Song et al., 2017), compression-based
design a new framework based on sequence- methods (Li et al., 2013; Wang et al., 2013; Li
to-sequence oriented encoder-decoder model et al., 2015, 2017), and abstraction-based meth-
equipped with a latent structure modeling com- ods. In fact, previous investigations show that
ponent. We employ Variational Auto-Encoders human-written summaries are more abstractive
(VAEs) (Kingma and Welling, 2013; Rezende (Barzilay and McKeown, 2005; Bing et al., 2015).
et al., 2014) as the base model for our genera- Abstraction-based approaches can generate new
tive framework which can handle the inference sentences based on the facts from different source
problem associated with complex generative sentences. Barzilay and McKeown (2005) em-
modeling. However, the standard framework of ployed sentence fusion to generate a new sentence.
VAEs is not designed for sequence modeling Bing et al. (2015) proposed a more fine-grained
related tasks. Inspired by (Chung et al., 2015), fusion framework, where new sentences are gen-
we add historical dependencies on the latent erated by selecting and merging salient phrases.
variables of VAEs and propose a deep recurrent These methods can be regarded as a kind of in-
generative decoder (DRGD) for latent structure direct abstractive summarization, and complicated
modeling. Then the standard discriminative constraints are used to guarantee the linguistic
deterministic decoder and the recurrent generative quality.
decoder are integrated into a unified decoding Recently, some researchers employ neural net-
framework. The target summaries will be decoded work based framework to tackle the abstractive
based on both the discriminative deterministic summarization problem. Rush et al. (2015) pro-
variables and the generative latent structural posed a neural network based model with local
information. All the neural parameters are learned attention modeling, which is trained on the Giga-
by back-propagation in an end-to-end training word corpus, but combined with an additional log-
paradigm. linear extractive summarization model with hand-
The main contributions of our framework crafted features. Gu et al. (2016) integrated a
are summarized as follows: (1) We propose copying mechanism into a seq2seq framework to
a sequence-to-sequence oriented encoder-decoder improve the quality of the generated summaries.
model equipped with a deep recurrent generative Chen et al. (2016) proposed a new attention mech-
decoder (DRGD) to model and learn the latent anism that not only considers the important source

2092
segments, but also distracts them in the decoding on both the discriminative deterministic variables
step in order to better grasp the overall meaning of H and the generative latent structural information
input documents. Nallapati et al. (2016) utilized Z.
a trick to control the vocabulary size to improve
the training efficiency. The calculations in these 3.2 Recurrent Generative Decoder
methods are all deterministic and the representa- Assume that we have obtained the source text rep-
tion ability is limited. Miao and Blunsom (2016) resentation he ∈ Rkh . The purpose of the decoder
extended the seq2seq framework and proposed a is to translate this source code he into a series
generative model to capture the latent summary in- of hidden states {hd1 , hd2 , . . . , hdn }, and then revert
formation, but they do not consider the recurrent these hidden states to an actual word sequence and
dependencies in their generative model leading to generate the summary.
limited representation ability. For standard recurrent decoders, at each time
Some research works employ topic models to step t, the hidden state hdt ∈ Rkh is calculated us-
capture the latent information from source docu- ing the dependent input symbol yt−1 ∈ Rkw and
ments or sentences. Wang et al. (2009) proposed the previous hidden state hdt−1 :
a new Bayesian sentence-based topic model by
making use of both the term-document and term-
hdt = f (yt−1 , hdt−1 ) (1)
sentence associations to improve the performance
of sentence selection. Celikyilmaz and Hakkani-
where f (·) is a recurrent neural network such as
Tur (2010) estimated scores for sentences based
vanilla RNN, Long Short-Term Memory (LSTM)
on their latent characteristics using a hierarchical
(Hochreiter and Schmidhuber, 1997), and Gated
topic model, and trained a regression model to ex-
Recurrent Unit (GRU) (Cho et al., 2014). No mat-
tract sentences. However, they only use the latent
ter which one we use for f (·), the common trans-
topic information to conduct the sentence salience
formation operation is as follows:
estimation for extractive summarization. In con-
trast, our purpose is to model and learn the latent
hdt = g(Wyh
d d
yt−1 + Whh hdt−1 + bdh ) (2)
structure information from the target summaries
and use it to enhance the performance of abstrac- d ∈ Rkh ×kw and Wd ∈ Rkh ×kh are
tive summarization. where Wyh hh
the linear transformation matrices. bdh is the bias.
3 Framework Description kh is the dimension of the hidden layers, and kw is
the dimension of the word embeddings. g(·) is the
3.1 Overview non-linear activation function. From Equation 2,
As shown in Figure 2, the basic framework of we can see that all the transformations are deter-
our approach is a neural network based encoder- ministic, which leads to a deterministic recurrent
decoder framework for sequence-to-sequence hidden state hdt . From our investigations, we find
learning. The input is a variable-length sequence that the representational power of such determin-
X = {x1 , x2 , . . . , xm } representing the source istic variables are limited. Some more complex
text. The word embedding xt is initialized ran- latent structures in the target summaries, such as
domly and learned during the optimization pro- the high-level syntactic features and latent topics,
cess. The output is also a sequence Y = cannot be modeled effectively by the deterministic
{y1 , y2 , . . . , yn }, which represents the generated operations and variables.
abstractive summaries. Gated Recurrent Unit Recently, a generative model called Variational
(GRU) (Cho et al., 2014) is employed as the ba- Auto-Encoders (VAEs) (Kingma and Welling,
sic sequence modeling component for the encoder 2013; Rezende et al., 2014) shows strong capa-
and the decoder. For latent structure modeling, we bility in modeling latent random variables and
add historical dependencies on the latent variables improves the performance of tasks in different
of Variational Auto-Encoders (VAEs) and propose fields such as sentence generation (Bowman et al.,
a deep recurrent generative decoder (DRGD) to 2016) and image generation (Gregor et al., 2015).
distill the complex latent structures implied in the However, the standard VAEs is not designed for
target summaries of the training data. Finally, the modeling sequence directly. Inspired by (Chung
abstractive summaries will be decoded out based et al., 2015), we extend the standard VAEs by

2093
y1 y2 <eos> output

݄ௗ௘௖ variational-decoder

Attention
z
z1 z2 z3
DKL [ N (u , V 2 ) || N (0, I )]

ߝ log V 2 ߤ

ܰሺͲǡ ‫ܫ‬ሻ ݄௘௡௖ variational-encoder

x1 x2 x3 x4 <eos> y1 y2 input

Encoder Decoder Variational Auto-Encoders

Figure 2: Our deep recurrent generative decoder (DRGD) for latent structure modeling.

introducing the historical latent variable depen- For the purpose of solving the intractable integral
dencies to make it be capable of modeling se- of the marginal likelihood as shown in Equation 3,
quence data. Our proposed latent structure mod- a recognition model qφ (zt |y<t , z<t ) is introduced
eling framework can be viewed as a sequence as an approximation to the intractable true poste-
generative model which can be divided into two rior pθ (zt |y<t , z<t ). The recognition model pa-
parts: inference (variational-encoder) and gener- rameters φ and the generative model parameters
ation (variational-decoder). As shown in the de- θ can be learned jointly. The aim is to reduce
coder component of Figure 2, the input of the orig- the Kulllback-Leibler divergence (KL) between
inal VAEs only contains the observed variable yt , qφ (zt |y<t , z<t ) and pθ (zt |y<t , z<t ):
and the variational-encoder can map it to a latent
DKL [qφ (zt |y<t , z<t )kpθ (zt |y<t , z<t )]
variable z ∈ Rkz , which can be used to reconstruct Z
the original input. For the task of summarization, qφ (zt |y<t , z<t )
= qφ (zt |y<t , z<t ) log dz
in the sequence decoder component, the previous z pθ (zt |y<t , z<t )
latent structure information needs to be considered = Eqφ (zt |y<t ,z<t ) [log qφ (zt |·) − log pθ (zt |·)]
for constructing more effective representations for
where · denotes the conditional variables y<t and
the generation of the next state.
z<t . Bayes rule is applied to pθ (zt |y<t , z<t ),
For the inference stage, the variational-encoder
and we can extract log pθ (z) from the expectation,
can map the observed variable y<t and the pre-
transfer the expectation term Eqφ (zt |y<t ,z<t ) back
vious latent structure information z<t to the pos-
to KL-divergence, and rearrange all the terms.
terior probability distribution of the latent struc-
Consequently the following holds:
ture variable pθ (zt |y<t , z<t ). It is obvious that
this is a recurrent inference process in which zt log pθ (y<t ) =
contains the historical dynamic latent structure in- DKL [qφ (zt |y<t , z<t )kpθ (zt |y<t , z<t )]
formation. Compared with the variational infer- (4)
+ Eqφ (zt |y<t ,z<t ) [log pθ (y<t |zt )]
ence process pθ (zt |yt ) of the typical VAEs model,
the recurrent framework can extract more complex − DKL [qφ (zt |y<t , z<t )kpθ (zt )]
and effective latent structure features implied in Let L(θ, φ; y) represent the last two terms from the
the sequence data. right part of Equation 4:
For the generation process, based on the la-
L(θ, ϕ; y) =
tent structure variable zt , the target word yt at the
XT
time step t is drawn from a conditional probabil- Eqφ (zt |y<t ,z<t ) log pθ (yt |zt ) (5)
t=1
ity distribution pθ (yt |zt ). The target is to maxi-
mize the probability of each generated summary − DKL [qφ (zt |y<t , z<t )kpθ (zt )]
y = {y1 , y2 , . . . , yT } based on the generation Since the first KL-divergence term of Equation 4
process according to: is non-negative, we have log pθ (y<t ) ≥ L(θ, φ; y)
T Z
meaning that L(θ, φ; y) is a lower bound (the ob-
Y jective to be maximized) on the marginal likeli-
pθ (y) = pθ (yt |zt )pθ (zt )dzt (3)
t=1
hood. In order to differentiate and optimize the

2094
lower bound L(θ, φ; y), following the core idea The deterministic decoder hidden state hdt is cal-
of VAEs, we use a neural network framework for culated using two layers of GRUs. On the first
the probabilistic encoder qφ (zt |y<t , z<t ) for bet- layer, the hidden state is calculated only using the
ter approximation. current input word embedding yt−1 and the previ-
ous hidden state hdt−1
1
:
3.3 Abstractive Summary Generation
We also design a neural network based frame- hdt 1 = GRU1 (yt−1 , hdt−1
1
) (7)
work to conduct the variational inference and gen-
eration for the recurrent generative decoder com- where the superscript d1 denotes the first decoder
ponent similar to some design in previous works GRU layer. Then the attention weights at the time
(Kingma and Welling, 2013; Rezende et al., 2014; step t are calculated based on the relationship of
Gregor et al., 2015). The encoder component and hdt 1 and all the source hidden states {het }. Let ai,j
the decoder component are integrated into a uni- be the attention weight between hdi 1 and hej , which
fied abstractive summarization framework. Con- can be calculated using the following formulation:
sidering that GRU has comparable performance
exp(ei,j )
but with less parameters and more efficient com- ai,j = PT e
putation, we employ GRU as the basic recurrent j 0 =1 exp(ei,j 0 )
model which updates the variables according to ei,j = vT tanh(Whh
d
hdi 1 + Whh
e
hej + ba )
the following operations:
where Whh d ∈ Rkh ×kh , We ∈ Rkh ×2kh , b ∈
hh a
rt = σ(Wxr xt + Whr ht−1 + br ) Rkh , and v ∈ Rkh . The attention context is ob-
zt = σ(Wxz xt + Whz ht−1 + bz ) tained by the weighted linear combination of all
gt = tanh(Wxh xt + Whh (rt ht−1 ) + bh ) the source hidden states:
ht = zt ht−1 + (1 − zt ) gt XT e
ct = at,j 0 hej0 (8)
where rt is the reset gate, zt is the update gate. 0
j =1
denotes the element-wise multiplication. tanh is
the hyperbolic tangent activation function. The final deterministic hidden state hdt 2 is the
As shown in the left block of Figure 2, the en- output of the second decoder GRU layer, jointly
coder is designed based on bidirectional recurrent considering the word yt−1 , the previous hidden
neural networks. Let xt be the word embedding state hdt−1
2
, and the attention context ct :
vector of the t-th word in the source sequence.
hdt 2 = GRU2 (yt−1 , hdt−1
2
, ct ) (9)
GRU maps xt and the previous hidden state ht−1
to the current hidden state ht in feed-forward di- For the component of recurrent generative
rection and back-forward direction respectively: model, inspired by some ideas in previous works
* * (Kingma and Welling, 2013; Rezende et al., 2014;
ht = GRU (xt , ht−1 ) Gregor et al., 2015), we assume that both the prior
( ( (6)
ht = GRU (xt , ht−1 ) and posterior of the latent variables are Gaussian,
i.e., pθ (zt ) = N (0, I) and qφ (zt |y<t , z<t ) =
Then the final hidden state het ∈ R2kh is concate- N (zt ; µ, σ 2 I), where µ and σ denote the varia-
nated using the hidden states from the two direc- tional mean and standard deviation respectively,
* (
tions: het = ht ||h. As shown in the middle block which can be calculated via a multilayer percep-
of Figure 2, the decoder consists of two compo- tron. Precisely, given the word embedding yt−1 ,
nents: discriminative deterministic decoding and the previous latent structure variable zt−1 , and the
generative latent structure modeling. previous deterministic hidden state hdt−1 , we first
The discriminative deterministic decoding is an project it to a new hidden space:
improved attention modeling based recurrent se-
quence decoder. The first hidden state hd1 is ini- het z = g(Wyh
ez ez
yt−1 +Wzh ez d
zt−1 +Whh ht−1 +behz )
tialized using the average of all the source input
ez ez ez
Te
P where Wyh ∈ Rkh ×kw , Wzh ∈ Rkh ×kz , Whh ∈
states: hd1 = T1e het , where het is the source in- k ×k ez k
t=1
R h h , and bh ∈ R . g is the sigmoid acti-
h

put hidden state. T e is the input sequence length. vation function: σ(x) = 1/(1 + e−x ). Then the

2095
Gaussian parameters µt ∈ Rkz and σ t ∈ Rkz can needs to be minimized, is formulated as follows:
be obtained via a linear transformation based on
het z : N T
(
1 XX

ez ez
µt = Whµ ht + beµz J =
(n) (n)
− log p(yt |y<t , X (n) )
(10) N n=1 t=1
log(σ 2t ) = Whσ het z + beσz ) (14)
(n) (n) (n) (n)
+DKL qφ (zt |y<t , z<t )kpθ (zt )
The latent structure variable zt ∈ Rkz can be cal-
culated using the reparameterization trick:
4 Experimental Setup
ε ∼ N (0, I), zt = µt + σ t ⊗ ε (11) 4.1 Datesets
where ε ∈ Rkz is an auxiliary noise variable. The We train and evaluate our framework on three pop-
process of inference for finding zt based on neural ular datasets. Gigawords is an English sentence
networks can be teated as a variational encoding summarization dataset prepared based on Anno-
process. tated Gigawords1 by extracting the first sentence
To generate summaries precisely, we first in- from articles with the headline to form a source-
tegrate the recurrent generative decoding compo- summary pair. We directly download the prepared
nent with the discriminative deterministic decod- dataset used in (Rush et al., 2015). It roughly con-
ing component, and map the latent structure vari- tains 3.8M training pairs, 190K validation pairs,
able zt and the deterministic decoding hidden state and 2,000 test pairs. DUC-20042 is another En-
hdt 2 to a new hidden variable: glish dataset only used for testing in our experi-
ments. It contains 500 documents. Each document
d d dz d2 d
ht y = tanh(Wzhy zt + Whh ht + bhy ) (12) contains 4 model summaries written by experts.
The length of the summary is limited to 75 bytes.
d
Given the combined decoding state ht y at the LCSTS is a large-scale Chinese short text summa-
time t, the probability of generating any target rization dataset, consisting of pairs of (short text,
word yt is given as follows: summary) collected from Sina Weibo3 (Hu et al.,
2015). We take Part-I as the training set, Part-II
d d
yt = ς(Why ht y + bdhy ) (13) as the development set, and Part-III as the test set.
There is a score in range 1 ∼ 5 labeled by human
where Why d ∈ Rky ×kh and bd ∈ Rky . ς(·) is the
hy to indicate how relevant an article and its summary
softmax function. Finally, we use a beam search is. We only reserve those pairs with scores no less
algorithm (Koehn, 2004) for decoding and gener- than 3. The size of the three sets are 2.4M, 8.7k,
ating the best summary. and 725 respectively. In our experiments, we only
3.4 Learning take Chinese character sequence as input, without
performing word segmentation.
Although the proposed model contains a recurrent
generative decoder, the whole framework is fully 4.2 Evaluation Metrics
differentiable. As shown in Section 3.3, both the We use ROUGE score (Lin, 2004) as our evalua-
recurrent deterministic decoder and the recurrent tion metric with standard options. The basic idea
generative decoder are designed based on neural of ROUGE is to count the number of overlapping
networks. Therefore, all the parameters in our units between generated summaries and the ref-
model can be optimized in an end-to-end paradigm erence summaries, such as overlapped n-grams,
using back-propagation. We use {X}N and {Y }N word sequences, and word pairs. F-measures of
to denote the training source and target sequence. ROUGE-1 (R-1), ROUGE-2 (R-2), ROUGE-L (R-
Generally, the objective of our framework con- L) and ROUGE-SU4 (R-SU4) are reported.
sists of two terms. One term is the negative log-
likelihood of the generated summaries, and the 4.3 Comparative Methods
other one is the variational lower bound L(θ, φ; Y ) We compare our model with some baselines and
mentioned in Equation 5. Since the variational state-of-the-art methods. Because the datasets are
lower bound L(θ, φ; Y ) also contains a likelihood 1
https://catalog.ldc.upenn.edu/ldc2012t21
term, we can merge it with the likelihood term of 2
http://duc.nist.gov/duc2004
summaries. The final objective function, which 3
http://www.weibo.com

2096
quite standard, so we just extract the results from • ASC+FSC1 (Miao and Blunsom, 2016) uses
their papers. Therefore the baseline methods on a generative model with attention mechanism
different datasets may be slightly different. to conduct the sentence compression prob-
lem. The model first draws a latent summary
• TOPIARY (Zajic et al., 2004) is the best on sentence from a background language model,
DUC2004 Task-1 for compressive text sum- and then subsequently draws the observed
marization. It combines a system using lin- sentence conditioned on this latent summary.
guistic based transformations and an unsu-
pervised topic detection algorithm for com- • lvt2k-1sent and lvt5k-1sent (Nallapati et al.,
pressive text summarization. 2016) utilize a trick to control the vocabulary
size to improve the training efficiency.
• MOSES+ (Rush et al., 2015) uses a phrase-
based statistical machine translation system 4.4 Experimental Settings
trained on Gigaword to produce summaries. For the experiments on the English dataset Giga-
It also augments the phrase table with “dele- words, we set the dimension of word embeddings
tion” rulesto improve the baseline perfor- to 300, and the dimension of hidden states and la-
mance, and MERT is also used to improve tent variables to 500. The maximum length of doc-
the quality of generated summaries. uments and summaries is 100 and 50 respectively.
The batch size of mini-batch training is 256. For
• ABS and ABS+ (Rush et al., 2015) are both
DUC-2004, the maximum length of summaries is
the neural network based models with local
75 bytes. For the dataset of LCSTS, the dimen-
attention modeling for abstractive sentence
sion of word embeddings is 350. We also set the
summarization. ABS+ is trained on the Gi-
dimension of hidden states and latent variables to
gaword corpus, but combined with an ad-
500. The maximum length of documents and sum-
ditional log-linear extractive summarization
maries is 120 and 25 respectively, and the batch
model with handcrafted features.
size is also 256. The beam size of the decoder
• RNN and RNN-context (Hu et al., 2015) are was set to be 10. Adadelta (Schmidhuber, 2015)
two seq2seq architectures. RNN-context in- with hyperparameter ρ = 0.95 and = 1e − 6
tegrates attention mechanism to model the is used for gradient based optimization. Our neu-
context. ral network based framework is implemented us-
ing Theano (Theano Development Team, 2016).
• CopyNet (Gu et al., 2016) integrates a
copying mechanism into the sequence-to- 5 Results and Discussions
sequence framework. 5.1 ROUGE Evaluation
• RNN-distract (Chen et al., 2016) uses a new
Table 1: ROUGE-F1 on validation sets
attention mechanism by distracting the his-
torical attention in the decoding steps. Dataset System R-1 R-2 R-L
GIGA StanD 32.69 15.29 30.60
• RAS-LSTM and RAS-Elman (Chopra et al.,
DRGD 36.25 17.61 33.55
2016) both consider words and word po-
LCSTS StanD 33.88 21.49 31.05
sitions as input and use convolutional en-
DRGD 36.71 24.00 34.10
coders to handle the source information. For
the attention based sequence decoding pro-
cess, RAS-Elman selects Elman RNN (El- We first depict the performance of our model
man, 1990) as decoder, and RAS-LSTM se- DRGD by comparing to the standard decoders
lects Long Short-Term Memory architecture (StanD) of our own implementation. The compari-
(Hochreiter and Schmidhuber, 1997). son results on the validation datasets of Gigawords
and LCSTS are shown in Table 1. From the re-
• LenEmb (Kikuchi et al., 2016) uses a mech- sults we can see that our proposed generative de-
anism to control the summary length by con- coders DRGD can obtain obvious improvements
sidering the length embedding vector as the on abstractive summarization than the standard de-
input. coders. Actually, the performance of the standard

2097
Table 2: ROUGE-F1 on Gigawords Gigawords. lvt2k and lvt5k are not end-to-end
style models and are more complicated than our
System R-1 R-2 R-L model in practical applications.
ABS 29.55 11.32 26.42 The results on the Chinese dataset LCSTS are
ABS+ 29.78 11.89 26.97 shown in Table 4. Our model DRGD also achieves
RAS-LSTM 32.55 14.70 30.03 the best performance. Although CopyNet employs
RAS-Elman 33.78 15.97 31.15 a copying mechanism to improve the summary
ASC + FSC1 34.17 15.94 31.92 quality and RNN-distract considers attention in-
lvt2k-1sent 32.67 15.59 30.64 formation diversity in their decoders, our model is
lvt5k-1sent 35.30 16.64 32.62 still better than those two methods demonstrating
DRGD 36.27 17.57 33.62 that the latent structure information learned from
target summaries indeed plays a role in abstractive
Table 3: ROUGE-Recall on DUC2004
summarization. We also believe that integrating
System R-1 R-2 R-L the copying mechanism and coverage diversity in
TOPIARY 25.12 6.46 20.12 our framework will further improve the summa-
MOSES+ 26.50 8.13 22.85 rization performance.
ABS 26.55 7.06 22.05
5.2 Summary Case Analysis
ABS+ 28.18 8.49 23.81
RAS-Elman 28.97 8.26 24.06 In order to analyze the reasons of improving
RAS-LSTM 27.41 7.69 23.06 the performance, we compare the generated sum-
LenEmb 26.73 8.39 23.88 maries by DRGD and the standard decoders StanD
lvt2k-1sen 28.35 9.46 24.59 used in some other works such as (Chopra et al.,
lvt5k-1sen 28.61 9.42 25.24 2016). The source texts, golden summaries, and
DRGD 31.79 10.75 27.48 the generated summaries are shown in Table 5.
From the cases we can observe that DRGD can in-
Table 4: ROUGE-F1 on LCSTS deed capture some latent structures which are con-
sistent with the golden summaries. For example,
System R-1 R-2 R-L our result for S(1) “Wuhan wins men’s soccer ti-
RNN 21.50 8.90 18.60 tle at Chinese city games” matches the “Who Ac-
RNN-context 29.90 17.40 27.20 tion What” structure. However, the standard de-
CopyNet 34.40 21.60 31.30 coder StanD ignores the latent structures and gen-
RNN-distract 35.20 22.60 32.50 erates some loose sentences, such as the results for
DRGD 36.99 24.15 34.21 S(1) “Results of men’s volleyball at Chinese city
games” does not catch the main points. The reason
is that the recurrent variational auto-encoders used
decoders is similar with those mentioned popular
in our framework have better representation ability
baseline methods.
and can capture more effective and complicated la-
The results on the English datasets of Giga- tent structures from the sequence data. Therefore,
words and DUC-2004 are shown in Table 2 and the summaries generated by DRGD have consis-
Table 3 respectively. Our model DRGD achieves tent latent structures with the ground truth, leading
the best summarization performance on all the to a better ROUGE evaluation.
ROUGE metrics. Although ASC+FSC1 also uses
a generative method to model the latent summary 6 Conclusions
variables, the representation ability is limited and
it cannot bring in noticeable improvements. It We propose a deep recurrent generative decoder
is worth noting that the methods lvt2k-1sent and (DRGD) to improve the abstractive summariza-
lvt5k-1sent (Nallapati et al., 2016) utilize linguis- tion performance. The model is a sequence-
tic features such as parts-of-speech tags, named- to-sequence oriented encoder-decoder framework
entity tags, and TF and IDF statistics of the words equipped with a latent structure modeling compo-
as part of the document representation. In fact, nent. Abstractive summaries are generated based
extracting all such features is a time consuming on both the latent variables and the determinis-
work, especially on large-scale datasets such as tic states. Extensive experiments on benchmark

2098
Table 5: Examples of the generated summaries. Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An-
drew M Dai, Rafal Jozefowicz, and Samy Ben-
S(1): hosts wuhan won the men ’s soccer title gio. 2016. Generating sentences from a continuous
space. CoNLL, pages 10–21.
by beating beijing shunyi #-# here at the #th
chinese city games on friday. Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, and Yan-
Golden: hosts wuhan wins men ’s soccer title ran Li. 2016. Attsum: Joint learning of focusing
at chinese city games. and summarization with neural attention. COLING,
pages 547–556.
StanD: results of men ’s volleyball at chinese
city games. Asli Celikyilmaz and Dilek Hakkani-Tur. 2010. A hy-
DRGD: wuhan wins men ’s soccer title at brid hierarchical model for multi-document summa-
chinese city games. rization. In ACL, pages 815–824.
S(2): UNK and the china meteorological Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei,
administration tuesday signed an agreement and Hui Jiang. 2016. Distraction-based neural net-
here on long - and short-term cooperation in works for document summarization. In IJCAI, pages
2754–2760.
projects involving meteorological satellites and
satellite meteorology. Jianpeng Cheng and Mirella Lapata. 2016. Neural
Golden: UNK china to cooperate in meteorol- summarization by extracting sentences and words.
In ACL, pages 484–494.
ogy.
StanD: weather forecast for major chinese Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
cities. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
DRGD: china to cooperate in meteorological Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using rnn encoder-decoder
satellites. for statistical machine translation. In EMNLP, pages
S(3): the rand gained ground against the dollar 1724–1734.
at the opening here wednesday , to #.# to the
Sumit Chopra, Michael Auli, Alexander M Rush, and
greenback from #.# at the close tuesday.
SEAS Harvard. 2016. Abstractive sentence sum-
Golden: rand gains ground. marization with attentive recurrent neural networks.
StanD: rand slightly higher against dollar. NAACL-HLT, pages 93–98.
DRGD: rand gains ground against dollar.
Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth
S(4): new zealand women are having more Goel, Aaron C Courville, and Yoshua Bengio. 2015.
children and the country ’s birth rate reached its A recurrent latent variable model for sequential data.
highest level in ## years , statistics new zealand In NIPS, pages 2980–2988.
said on wednesday.
Harold P Edmundson. 1969. New methods in au-
Golden: new zealand birth rate reaches ##- tomatic extracting. Journal of the ACM (JACM),
year high. 16(2):264–285.
StanD: new zealand women are having more
Jeffrey L Elman. 1990. Finding structure in time. Cog-
children birth rate hits highest level in ## years. nitive science, 14(2):179–211.
DRGD: new zealand ’s birth rate hits ##-
year high. Günes Erkan and Dragomir R Radev. 2004. Lexrank:
Graph-based lexical centrality as salience in text
summarization. Journal of Artificial Intelligence
Research, 22:457–479.
datasets show that DRGD achieves improvements
over the state-of-the-art methods. Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and
Mark Kantrowitz. 2000. Multi-document sum-
marization by sentence extraction. In NAACL-
References ANLPWorkshop, pages 40–48.

Regina Barzilay and Kathleen R McKeown. 2005. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo
Sentence fusion for multidocument news summa- Rezende, and Daan Wierstra. 2015. Draw: A recur-
rization. Computational Linguistics, 31(3):297– rent neural network for image generation. In ICML,
328. pages 1462–1471.

Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo, Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK
and Rebecca Passonneau. 2015. Abstractive multi- Li. 2016. Incorporating copying mechanism in
document summarization via phrase selection and sequence-to-sequence learning. In ACL, pages
merging. In ACL, pages 1587–1597. 1631–1640.

2099
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,
Long short-term memory. Neural computation, Bing Xiang, et al. 2016. Abstractive text summa-
9(8):1735–1780. rization using sequence-to-sequence rnns and be-
yond. arXiv preprint arXiv:1602.06023.
Baotian Hu, Qingcai Chen, and Fangze Zhu. 2015. Lc-
sts: A large scale chinese short text summarization Ani Nenkova and Kathleen McKeown. 2012. A survey
dataset. In EMNLP, pages 1962–1972. of text summarization techniques. In Mining Text
Data, pages 43–76. Springer.
Hongyan Jing and Kathleen R McKeown. 2000. Cut
and paste based text summarization. In NAACL, Danilo Jimenez Rezende, Shakir Mohamed, and Daan
pages 178–185. Wierstra. 2014. Stochastic backpropagation and ap-
proximate inference in deep generative models. In
Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya ICML, pages 1278–1286.
Takamura, and Manabu Okumura. 2016. Control-
ling output length in neural encoder-decoders. In Alexander M Rush, Sumit Chopra, and Jason Weston.
EMNLP, pages 1328–1338. 2015. A neural attention model for abstractive sen-
tence summarization. In EMNLP, pages 379–389.
Diederik P Kingma and Max Welling. 2013. Auto-
encoding variational bayes. arXiv preprint Jürgen Schmidhuber. 2015. Deep learning in neural
arXiv:1312.6114. networks: An overview. Neural Networks, 61:85–
117.
Philipp Koehn. 2004. Pharaoh: a beam search de-
coder for phrase-based statistical machine transla- Hongya Song, Zhaochun Ren, Piji Li, Shangsong
tion models. In Conference of the Association for Liang, Jun Ma, and Maarten de Rijke. 2017.
Machine Translation in the Americas, pages 115– Summarizing answers in non-factoid community
124. Springer. question-answering. In WSDM, pages 405–414.

Chen Li, Fei Liu, Fuliang Weng, and Yang Liu. 2013. Theano Development Team. 2016. Theano: A Python
Document summarization via guided sentence com- framework for fast computation of mathematical ex-
pression. In EMNLP, pages 490–500. pressions. arXiv e-prints, abs/1605.02688.

Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. Xiaojun Wan, Jianwu Yang, and Jianguo Xiao.
2015. Reader-aware multi-document summariza- 2007. Manifold-ranking based topic-focused multi-
tion via sparse coding. In IJCAI, pages 1270–1276. document summarization. In IJCAI, volume 7,
pages 2903–2908.
Piji Li, Zihao Wang, Wai Lam, Zhaochun Ren, and
Lidong Bing. 2017. Salience estimation via vari- Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong
ational auto-encoders for multi-document summa- Gong. 2009. Multi-document summarization us-
rization. In AAAI, pages 3497–3503. ing sentence-based topic models. In ACL-IJCNLP,
pages 297–300.
Chin-Yew Lin. 2004. Rouge: A package for auto-
matic evaluation of summaries. In Text summariza- Lu Wang, Hema Raghavan, Vittorio Castelli, Radu Flo-
tion branches out: Proceedings of the ACL-04 work- rian, and Claire Cardie. 2013. A sentence com-
shop, volume 8. pression based framework to query-focused multi-
document summarization. In ACL, pages 1384–
Konstantin Lopyrev. 2015. Generating news head- 1394.
lines with recurrent neural networks. arXiv preprint
arXiv:1512.01712. David Zajic, Bonnie Dorr, and Richard Schwartz. 2004.
Bbn/umd at duc-2004: Topiary. In HLT-NAACL,
Hans Peter Luhn. 1958. The automatic creation of lit- pages 112–119.
erature abstracts. IBM Journal of research and de-
velopment, 2(2):159–165.

Yishu Miao and Phil Blunsom. 2016. Language as a

latent variable: Discrete generative models for sen-
tence compression. In EMNLP, pages 319–328.

Ziheng Lin Min, Yen Kan Chew, and Lim Tan. 2012.
Exploiting category-specific information for multi-
document summarization. COLING, pages 2903–
2108.