0% found this document useful (0 votes)
81 views11 pages

Extracting Sentences and Words

This document proposes a neural network approach to extractive document summarization. The approach uses a hierarchical encoder to represent sentences and words in a document, and an attention-based extractor to select important sentences or words for the summary. The models are trained on large datasets of news articles paired with highlights to select summary sentences or words. Experimental results show the neural models achieve performance comparable to previous state-of-the-art extractive summarization systems.

Uploaded by

Nguyễn Trà My
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views11 pages

Extracting Sentences and Words

This document proposes a neural network approach to extractive document summarization. The approach uses a hierarchical encoder to represent sentences and words in a document, and an attention-based extractor to select important sentences or words for the summary. The models are trained on large datasets of news articles paired with highlights to select summary sentences or words. Experimental results show the neural models achieve performance comparable to previous state-of-the-art extractive summarization systems.

Uploaded by

Nguyễn Trà My
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Neural Summarization by Extracting Sentences and Words

Jianpeng Cheng Mirella Lapata


ILCC, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB
[email protected] [email protected]

Abstract typically assigned a score indicating the strength


of presence of these features. Several methods
Traditional approaches to extractive have been used in order to select the summary sen-
summarization rely heavily on human- tences ranging from binary classifiers (Kupiec et
engineered features. In this work we al., 1995), to hidden Markov models (Conroy and
propose a data-driven approach based on O’Leary, 2001), graph-based algorithms (Erkan
neural networks and continuous sentence
arXiv:1603.07252v3 [cs.CL] 1 Jul 2016

and Radev, 2004; Mihalcea, 2005), and integer lin-


features. We develop a general frame- ear programming (Woodsend and Lapata, 2010).
work for single-document summarization
In this work we propose a data-driven approach
composed of a hierarchical document
to summarization based on neural networks and
encoder and an attention-based extractor.
continuous sentence features. There has been a
This architecture allows us to develop
surge of interest recently in repurposing sequence
different classes of summarization models
transduction neural network architectures for NLP
which can extract sentences or words. We
tasks such as machine translation (Sutskever et
train our models on large scale corpora
al., 2014), question answering (Hermann et al.,
containing hundreds of thousands of
2015), and sentence compression (Rush et al.,
document-summary pairs1 . Experimental
2015). Central to these approaches is an encoder-
results on two summarization datasets
decoder architecture modeled by recurrent neu-
demonstrate that our models obtain results
ral networks. The encoder reads the source se-
comparable to the state of the art without
quence into a list of continuous-space representa-
any access to linguistic annotation.
tions from which the decoder generates the target
sequence. An attention mechanism (Bahdanau et
1 Introduction
al., 2015) is often used to locate the region of focus
The need to access and digest large amounts of during decoding.
textual data has provided strong impetus to de- We develop a general framework for single-
velop automatic summarization systems aiming to document summarization which can be used to
create shorter versions of one or more documents, extract sentences or words. Our model includes
whilst preserving their information content. Much a neural network-based hierarchical document
effort in automatic summarization has been de- reader or encoder and an attention-based content
voted to sentence extraction, where a summary is extractor. The role of the reader is to derive the
created by identifying and subsequently concate- meaning representation of a document based on its
nating the most salient text units in a document. sentences and their constituent words. Our models
Most extractive methods to date identify adopt a variant of neural attention to extract sen-
sentences based on human-engineered features. tences or words. Contrary to previous work where
These include surface features such as sentence attention is an intermediate step used to blend hid-
position and length (Radev et al., 2004), the words den units of an encoder to a vector propagating ad-
in the title, the presence of proper nouns, content ditional information to the decoder, our model ap-
features such as word frequency (Nenkova et al., plies attention directly to select sentences or words
2006), and event features such as action nouns (Fi- of the input document as the output summary.
latova and Hatzivassiloglou, 2004). Sentences are Similar neural attention architectures have been
1 Resources are available for download at http:// previously used for geometry reasoning (Vinyals
homepages.inf.ed.ac.uk/s1537177/resources.html et al., 2015), under the name Pointer Networks.
One stumbling block to applying neural net- resentations can be challenging to learn. Gu et
work models to extractive summarization is the al. (2016) and Gulcehre et al. (2016) propose a
lack of training data, i.e., documents with sen- similar “copy” mechanism in sentence compres-
tences (and words) labeled as summary-worthy. sion and other tasks; their model can accommo-
Inspired by previous work on summarization date both generation and extraction by selecting
(Woodsend and Lapata, 2010; Svore et al., 2007) which sub-sequences in the input sequence to copy
and reading comprehension (Hermann et al., in the output.
2015) we retrieve hundreds of thousands of news We evaluate our models both automatically (in
articles and corresponding highlights from the terms of ROUGE) and by humans on two datasets:
DailyMail website. Highlights usually appear as the benchmark DUC 2002 document summariza-
bullet points giving a brief overview of the infor- tion corpus and our own DailyMail news high-
mation contained in the article (see Figure 1 for lights corpus. Experimental results show that
an example). Using a number of transformation our summarizers achieve performance compara-
and scoring algorithms, we are able to match high- ble to state-of-the-art systems employing hand-
lights to document content and construct two large engineered features and sophisticated linguistic
scale training datasets, one for sentence extraction constraints.
and the other for word extraction. Previous ap-
proaches have used small scale training data in the 2 Problem Formulation
range of a few hundred examples. In this section we formally define the summariza-
Our work touches on several strands of research tion tasks considered in this paper. Given a doc-
within summarization and neural sequence model- ument D consisting of a sequence of sentences
ing. The idea of creating a summary by extracting {s1 , · · · , sm } and a word set {w1 , · · · , wn }, we are
words from the source document was pioneered in interested in obtaining summaries at two levels of
Banko et al. (2000) who view summarization as a granularity, namely sentences and words.
problem analogous to statistical machine transla- Sentence extraction aims to create a sum-
tion and generate headlines using statistical mod- mary from D by selecting a subset of j sentences
els for selecting and ordering the summary words. (where j < m). We do this by scoring each sen-
Our word-based model is similar in spirit, how- tence within D and predicting a label yL ∈ {0, 1}
ever, it operates over continuous representations, indicating whether the sentence should be in-
produces multi-sentence output, and jointly se- cluded in the summary. As we apply supervised
lects summary words and organizes them into sen- training, the objective is to maximize the likeli-
tences. A few recent studies (Kobayashi et al., hood of all sentence labels yL = (y1L , · · · , ym L ) given
2015; Yogatama et al., 2015) perform sentence ex- the input document D and model parameters θ:
traction based on pre-trained sentence embeddings m
following an unsupervised optimization paradigm. log p(yL |D; θ) = ∑ log p(yiL |D; θ) (1)
Our work also uses continuous representations to i=1
express the meaning of sentences and documents, Although extractive methods yield naturally
but importantly employs neural networks more di- grammatical summaries and require relatively
rectly to perform the actual summarization task. little linguistic analysis, the selected sentences
Rush et al. (2015) propose a neural attention make for long summaries containing much redun-
model for abstractive sentence compression which dant information. For this reason, we also de-
is trained on pairs of headlines and first sentences velop a model based on word extraction which
in an article. In contrast, our model summarizes seeks to find a subset of words2 in D and
documents rather than individual sentences, pro- their optimal ordering so as to form a summary
ducing multi-sentential discourse. A major archi- ys = (w01 , · · · , w0k ), w0i ∈ D. Compared to sentence
tectural difference is that our decoder selects out- extraction which is a sequence labeling problem,
put symbols from the document of interest rather this task occupies the middle ground between
than the entire vocabulary. This effectively helps full abstractive summarization which can exhibit
us sidestep the difficulty of searching for the next a wide range of rewrite operations and extractive
output symbol under a large vocabulary, with low- 2 The vocabulary can also be extended to include a small
frequency words and named entities whose rep- set of commonly-used (high-frequency) words.
AFL star blames vomiting cat for speeding
Adelaide Crows defender Daniel Talia has kept his driving license, telling a court he was speeding
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
36km over the limit because he was distracted by his sick cat.
::::::::::::::::::::::::::::::::::::::::::::::::::::::
The 22-year-old AFL star, who drove 96km/h in a 60km/h road works zone on the South Eastern
expressway in February, said he didn’t see the reduced speed sign because he was so distracted by his
cat vomiting violently in the back seat of his car.
In the Adelaide magistrates court on Wednesday, Magistrate Bob Harrap fined Talia $824 for
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
exceeding the speed limit by more than 30km/h.
::::::::::::::::::::::::::::::::::::::::::
He lost four demerit points, instead of seven, because of his significant training commitments.
• Adelaide Crows defender Daniel Talia admits to speeding but says he didn’t see road signs be-
cause his cat was vomiting in his car.
• 22-year-old Talia was fined $824 and four demerit points, instead of seven, because of his ’signif-
icant’ training commitments.

Figure 1: DailyMail news article with highlights. Underlined sentences bear label 1, and 0 otherwise.

summarization which exhibits none. We formu- are genuinely abstractive summaries and therefore
late word extraction as a language generation task not readily suited to supervised training. To cre-
with an output vocabulary restricted to the original ate the training data for sentence extraction, we
document. In our supervised setting, the training reverse approximated the gold standard label of
goal is to maximize the likelihood of the generated each document sentence given the summary based
sentences, which can be further decomposed by on their semantic correspondence (Woodsend and
enforcing conditional dependencies among their Lapata, 2010). Specifically, we designed a rule-
constituent words: based system that determines whether a document
k sentence matches a highlight and should be la-
log p(ys |D; θ)=∑log p(w0i |D, w01 ,· · ·, w0i−1 ; θ) (2) beled with 1 (must be in the summary), and 0 oth-
i=1 erwise. The rules take into account the position
In the following section, we discuss the data elici- of the sentence in the document, the unigram and
tation methods which allow us to train neural net- bigram overlap between document sentences and
works based on the above defined objectives. highlights, the number of entities appearing in the
highlight and in the document sentence. We ad-
3 Training Data for Summarization justed the weights of the rules on 9,000 documents
Data-driven neural summarization models require with manual sentence labels created by Woodsend
a large training corpus of documents with labels and Lapata (2010). The method obtained an accu-
indicating which sentences (or words) should be racy of 85% when evaluated on a held-out set of
in the summary. Until now such corpora have 216 documents coming from the same dataset and
been limited to hundreds of examples (e.g., the was subsequently used to label 200K documents.
DUC 2002 single document summarization cor- Approximately 30% of the sentences in each doc-
pus) and thus used mostly for testing (Woodsend ument were deemed summary-worthy.
and Lapata, 2010). To overcome the paucity of
annotated data for training, we adopt a methodol- For the creation of the word extraction dataset,
ogy similar to Hermann et al. (2015) and create we examine the lexical overlap between the high-
two large-scale datasets, one for sentence extrac- lights and the news article. In cases where all high-
tion and another one for word extraction. light words (after stemming) come from the orig-
In a nutshell, we retrieved3 hundreds of thou- inal document, the document-highlight pair con-
sands of news articles and their corresponding stitutes a valid training example and is added to
highlights from DailyMail (see Figure 1 for an ex- the word extraction dataset. For out-of-vocabulary
ample). The highlights (created by news editors) (OOV) words, we try to find a semantically equiv-
3 The script for constructing our datasets is modified from alent replacement present in the news article.
the one released in Hermann et al. (2015). Specifically, we check if a neighbor, represented
by pre-trained4 embeddings, is in the original doc- sentence-level classification tasks such as senti-
ument and therefore constitutes a valid substitu- ment analysis (Kim, 2014). Let d denote the
tion. If we cannot find any substitutes, we discard dimension of word embeddings, and s a docu-
the document-highlight pair. Following this pro- ment sentence consisting of a sequence of n words
cedure, we obtained a word extraction dataset con- (w1 , · · · , wn ) which can be represented by a dense
taining 170K articles, again from the DailyMail. column matrix W ∈ Rn×d . We apply a tempo-
ral narrow convolution between W and a kernel
4 Neural Summarization Model K ∈ Rc×d of width c as follows:
The key components of our summarization model
fij = tanh(W j: j+c−1 ⊗ K + b) (3)
include a neural network-based hierarchical doc-
ument reader and an attention-based hierarchical where ⊗ equates to the Hadamard Product fol-
content extractor. The hierarchical nature of our lowed by a sum over all elements. fij denotes the
model reflects the intuition that documents are j-th element of the i-th feature map fi and b is the
generated compositionally from words, sentences, bias. We perform max pooling over time to obtain
paragraphs, or even larger units. We therefore em- a single feature (the ith feature) representing the
ploy a representation framework which reflects the sentence under the kernel K with width c:
same architecture, with global information being
discovered and local information being preserved. si,K = max fij (4)
j
Such a representation yields minimum informa-
tion loss and is flexible allowing us to apply neural In practice, we use multiple feature maps to
attention for selecting salient sentences and words compute a list of features that match the dimen-
within a larger context. In the following, we first sionality of a sentence under each kernel width. In
describe the document reader, and then present the addition, we apply multiple kernels with different
details of our sentence and word extractors. widths to obtain a set of different sentence vectors.
Finally, we sum these sentence vectors to obtain
4.1 Document Reader
the final sentence representation. The CNN model
The role of the reader is to derive the meaning rep- is schematically illustrated in Figure 2 (bottom).
resentation of the document from its constituent In the example, the sentence embeddings have six
sentences, each of which is treated as a sequence dimensions, so six feature maps are used under
of words. We first obtain representation vectors each kernel width. The blue feature maps have
at the sentence level using a single-layer convo- width two and the red feature maps have width
lutional neural network (CNN) with a max-over- three. The sentence embeddings obtained under
time pooling operation (Kalchbrenner and Blun- each kernel width are summed to get the final sen-
som, 2013; Zhang and Lapata, 2014; Kim et al., tence representation (denoted by green).
2016). Next, we build representations for docu-
ments using a standard recurrent neural network Recurrent Document Encoder At the docu-
(RNN) that recursively composes sentences. The ment level, a recurrent neural network composes a
CNN operates at the word level, leading to the sequence of sentence vectors into a document vec-
acquisition of sentence-level representations that tor. Note that this is a somewhat simplistic attempt
are then used as inputs to the RNN that acquires at capturing document organization at the level of
document-level representations, in a hierarchical sentence to sentence transitions. One might view
fashion. We describe these two sub-components the hidden states of the recurrent neural network
of the text reader below. as a list of partial representations with each fo-
cusing mostly on the corresponding input sentence
Convolutional Sentence Encoder We opted for given the previous context. These representations
a convolutional neural network model for repre- altogether constitute the document representation,
senting sentences for two reasons. Firstly, single- which captures local and global sentential infor-
layer CNNs can be trained effectively (without mation with minimum compression.
any long-term dependencies in the model) and The RNN we used has a Long Short-Term
secondly, they have been successfully used for Memory (LSTM) activation unit for ameliorat-
4 We used the Python Gensim library and the ing the vanishing gradient problem when train-
300-dimensional GoogleNews vectors. ing long sequences (Hochreiter and Schmidhuber,
Figure 3: Neural attention mechanism for word
extraction.

with both the encoded document and the previ-


ously labeled sentences in mind. Given encoder
hidden states (h1 , · · · , hm ) and extractor hidden
states (h̄1 , · · · , h̄m ) at time step t, the decoder at-
Figure 2: A recurrent convolutional document
tends the t-th sentence by relating its current de-
reader with a neural sentence extractor.
coding state to the corresponding encoding state:
1997). Given a document d = (s1 , · · · , sm ), the h̄t = LSTM(pt−1 st−1 , h̄t−1 ) (8)
hidden state at time step t, denoted by ht , is up-
dated as: p(yL (t) = 1|D) = σ(MLP(h̄t : ht )) (9)
  
it σ
 where MLP is a multi-layer neural network with as
 ft   σ 
  input the concatenation of h̄t and ht . pt−1 repre-
 =  W · ht−1 (5) sents the degree to which the extractor believes the
ot   σ  st
ĉt tanh previous sentence should be extracted and memo-
rized (pt−1 =1 if the system is certain; 0 otherwise).
ct = ft ct−1 + it ĉt (6) In practice, there is a discrepancy between train-
ing and testing such a model. During training
ht = ot tanh(ct ) (7) we know the true label pt−1 of the previous sen-
tence, whereas at test time pt−1 is unknown and
where W is a learnable weight matrix. Next, we has to be predicted by the model. The discrep-
discuss a special attention mechanism for extract- ancy can lead to quickly accumulating prediction
ing sentences and words given the recurrent docu- errors, especially when mistakes are made early in
ment encoder just described, starting from the sen- the sequence labeling process. To mitigate this,
tence extractor. we adopt a curriculum learning strategy (Bengio
et al., 2015): at the beginning of training when
4.2 Sentence Extractor
pt−1 cannot be predicted accurately, we set it to
In the standard neural sequence-to-sequence mod- the true label of the previous sentence; as training
eling paradigm (Bahdanau et al., 2015), an atten- goes on, we gradually shift its value to the pre-
tion mechanism is used as an intermediate step dicted label p(yL (t − 1) = 1|d).
to decide which input region to focus on in order
to generate the next output. In contrast, our sen- 4.3 Word Extractor
tence extractor applies attention to directly extract Compared to sentence extraction which is a purely
salient sentences after reading them. sequence labeling task, word extraction is closer
The extractor is another recurrent neural net- to a generation task where relevant content must
work that labels sentences sequentially, taking into be selected and then rendered fluently and gram-
account not only whether they are individually matically. A small extension to the structure of
relevant but also mutually redundant. The com- the sequential labeling model makes it suitable
plete architecture for the document encoder and for generation: instead of predicting a label for
the sentence extractor is shown in Figure 2. As the next sentence at each time step, the model di-
can be seen, the next labeling decision is made rectly outputs the next word in the summary. The
model uses a hierarchical attention architecture: training and evaluation, give implementation de-
at time step t, the decoder softly5 attends each tails, briefly introduce comparison models, and ex-
document sentence and subsequently attends each plain how system output was evaluated.
word in the document and computes the probabil-
Datasets We trained our sentence- and word-
ity of the next word to be included in the summary
based summarization models on the two datasets
p(wt0 = wi |d, w01 , · · · , wt−1
0 ) with a softmax classi-
created from DailyMail news. Each dataset was
fier:
split into approximately 90% for training, 5% for
validation, and 5% for testing. We evaluated the
h̄t = LSTM(w0t−1 , h̄t−1 )6 (10)
models on the DUC-2002 single document sum-
atj = zT tanh(We h̄t + Wr h j ), h j ∈ D (11) marization task. In total, there are 567 documents
belonging to 59 different clusters of various news
btj = softmax(atj ) (12)
topics. Each document is associated with two ver-
m sions of 100-word7 manual summaries produced
h̃t = ∑ btj h j (13) by human annotators. We also evaluated our mod-
j=1
els on 500 articles from the DailyMail test set
uti = vT tanh(We0 h̃t + Wr0 wi ), wi ∈ D (14) (with the human authored highlights as goldstan-
dard). We sampled article-highlight pairs so that
p(wt0 = wi |D, w01 , · · · , wt−1
0
) = softmax(uti ) (15)
the highlights include a minimum of 3 sentences.
In the above equations, wi corresponds to the vec- The average byte count for each document is 278.
tor of the i-th word in the input document, whereas As there is no established evaluation standard for
z, We , Wr , v, We0 , and Wr0 are model weights. this task, we also report ROUGE evaluation on
The model architecture is shown in Figure 3. the entire DailyMail test set with varying limits.
The word extractor can be viewed as a con- Please refer to the appendix for more information.
ditional language model with a vocabulary con-
Implementation Details We trained our mod-
straint. In practice, it is not powerful enough to
els with Adam (Kingma and Ba, 2014) with ini-
enforce grammaticality due to the lexical diversity
tial learning rate 0.001. The two momentum pa-
and sparsity of the document highlights. A pos-
rameters were set to 0.99 and 0.999 respectively.
sible enhancement would be to pair the extractor
We performed mini-batch training with a batch
with a neural language model, which can be pre-
size of 20 documents. All input documents were
trained on a large amount of unlabeled documents
padded to the same length with an additional mask
and then jointly tuned with the extractor during
variable storing the real length for each document.
decoding (Gulcehre et al., 2015). A simpler al-
The size of word, sentence, and document em-
ternative which we adopt is to use n-gram features
beddings were set to 150, 300, and 750, respec-
collected from the document to rerank candidate
tively. For the convolutional sentence model, we
summaries obtained via beam decoding. We incor-
followed Kim et al. (2016)8 and used a list of ker-
porate the features in a log-linear reranker whose
nel sizes {1, 2, 3, 4, 5, 6, 7}. For the recurrent doc-
feature weights are optimized with minimum error
ument model and the sentence extractor, we used
rate training (Och, 2003).
as regularization dropout with probability 0.5 on
5 Experimental Setup the LSTM input-to-hidden layers and the scoring
layer. The depth of each LSTM module was 1.
In this section we present our experimental setup All LSTM parameters were randomly initialized
for assessing the performance of our summariza- over a uniform distribution within [-0.05, 0.05].
tion models. We discuss the datasets used for The word vectors were initialized with 150 dimen-
5 A simpler model would use hard attention to select a sen- sional pre-trained embeddings.9
tence first and then a few words from it as a summary, but this
7 According to the DUC2002 guidelines http:
would render the system non-differentiable for training. Al-
though hard attention can be trained with the REINFORCE //www-nlpir.nist.gov/projects/duc/guidelines/
algorithm (Williams, 1992), it requires sampling of discrete 2002.html, the generated summary should be within 100
actions and could lead to high variance. words.
6 We empirically found that feeding the previous sentence- 8 The CNN-LSTM architecture is publicly available at

level attention vector as additional input to the LSTM would https://github.com/yoonkim/lstm-char-cnn.


lead to small performance improvements. This is not shown 9 We used the word2vec (Mikolov et al., 2013) skip-gram
in the equation. model with context window size 6, negative sampling size 10
Proper nouns pose a problem for embedding- each document as the summary. We also built
based approaches, especially when these are rare a sentence extraction baseline classifier using lo-
or unknown (e.g., at test time). Rush et al. (2015) gistic regression and human engineered features.
address this issue by adding a new set of features The classifier was trained on the same datasets
and a log-linear model component to their sys- as our neural network models with the follow-
tem. As our model enjoys the advantage of gener- ing features: sentence length, sentence position,
ation by extraction, we can force the model to in- number of entities in the sentence, sentence-to-
spect the context surrounding an entity and its rel- sentence cohesion, and sentence-to-document rel-
ative position in the sentence in order to discover evance. Sentence-to-sentence cohesion was com-
extractive patterns, placing less emphasis on the puted by calculating for every document sentence
meaning representation of the entity itself. Specif- its embedding similarity with every other sentence
ically, we perform named entity recognition with in the same document. The feature was the nor-
the package provided by Hermann et al. (2015) malized sum of these similarity scores. Sentence
and maintain a set of randomly initialized entity embeddings were obtained by averaging the con-
embeddings. During training, the index of the en- stituent word embeddings. Sentence-to-document
tities is permuted to introduce some noise but also relevance was computed similarly. We calculated
robustness in the data. A similar data augmenta- for each sentence its embedding similarity with the
tion approach has been used for reading compre- document (represented as bag-of-words), and nor-
hension (Hermann et al., 2015). malized the score. The word embeddings used in
A common problem with extractive methods this baseline are the same as the pre-trained ones
based on sentence labeling is that there is no con- used for our neural models.
straint on the number of sentences being selected
In addition, we included a neural abstractive
at test time. We address this by reranking the posi-
summarization baseline. This system has a similar
tively labeled sentences with the probability scores
architecture to our word extraction model except
obtained from the softmax layer (rather than the
that it uses an open vocabulary during decoding.
label itself). In other words, we are more inter-
It can also be viewed as a hierarchical document-
ested in is the relative ranking of each sentence
level extension of the abstractive sentence summa-
rather than their exact scores. This suggests that
rizer proposed by Rush et al. (2015). We trained
an alternative to training the network would be to
this model with negative sampling to avoid the ex-
employ a ranking-based objective or a learning to
cessive computation of the normalization constant.
rank algorithm. However, we leave this to future
work. We use the three sentences with the highest Finally, we compared our models to three previ-
scores as the summary (also subject to the word or ously published systems which have shown com-
byte limit of the evaluation protocol). petitive performance on the DUC2002 single doc-
Another issue relates to the word extraction ument summarization task. The first approach is
model which is challenging to batch since each the phrase-based extraction model of Woodsend
document possesses a distinct vocabulary. We and Lapata (2010). Their system learns to produce
sidestep this during training by performing neg- highlights from parsed input (phrase structure
ative sampling (Mikolov et al., 2013) which trims trees and dependency graphs); it selects salient
the vocabulary of different documents to the same phrases and recombines them subject to length,
length. At each decoding step the model is trained coverage, and grammar constraints enforced via
to differentiate the true target word from 20 noise integer linear programming (ILP). Like ours, this
samples. At test time we still loop through the model is trained on document-highlight pairs, and
words in the input document (and a stop-word list) produces telegraphic-style bullet points rather than
to decide which word to output next. full-blown summaries. The other two systems,
TGRAPH (Parveen et al., 2015) and URANK (Wan,
System Comparisons We compared the output 2010), produce more typical summaries and repre-
of our models to various summarization meth- sent the state of the art. TGRAPH is a graph-based
ods. These included the standard baseline of sim- sentence extraction model, where the graph is con-
ply selecting the “leading” three sentences from structed from topic models and the optimization
and hierarchical softmax 1. The model was trained on the is performed by constrained ILP. URANK adopts a
Google 1-billion word benchmark (Chelba et al., 2014). unified ranking system for both single- and multi-
DUC 2002 ROUGE -1 ROUGE -2 ROUGE - L Models 1st 2nd 3rd 4th 5th 6th MeanR
LEAD 0.10 0.17 0.37 0.15 0.16 0.05 3.27
LEAD 43.6 21.0 40.2 ILP 0.19 0.38 0.13 0.13 0.11 0.06 2.77
LREG 43.8 20.7 40.3 NN - SE 0.22 0.28 0.21 0.14 0.12 0.03 2.74
ILP 45.4 21.3 42.8 NN - WE 0.00 0.04 0.03 0.21 0.51 0.20 4.79
NN - ABS 0.00 0.01 0.05 0.16 0.23 0.54 5.24
NN - ABS 15.8 5.2 13.8 Human 0.27 0.23 0.29 0.17 0.03 0.01 2.51
TGRAPH 48.1 24.3 —
URANK 48.5 21.5 — Table 2: Rankings (shown as proportions) and
NN - SE 47.4 23.0 43.5 mean ranks given to systems by human partici-
NN - WE 27.0 7.9 22.8 pants (lower is better).

DailyMail ROUGE -1 ROUGE -2 ROUGE - L 6 Results


L EAD 20.4 7.7 11.4
Table 1 (upper half) summarizes our results on
LREG 18.5 6.9 10.2
the DUC 2002 test dataset using ROUGE. NN - SE
NN - ABS 7.8 1.7 7.1
represents our neural sentence extraction model,
NN - SE 21.2 8.3 12.0
NN - WE our word extraction model, and NN - ABS
NN - WE 15.7 6.4 9.8
the neural abstractive baseline. The table also in-
Table 1: ROUGE evaluation (%) on the DUC- cludes results for the LEAD baseline, the logistic
2002 and 500 DailyMail samples. regression classifier (LREG), and three previously
published systems (ILP, TGRAPH, and URANK).
The NN - SE outperforms the LEAD and LREG
document summarization. baselines with a significant margin, while per-
Evaluation We evaluated the quality of the forming slightly better than the ILP model. This
summaries automatically using ROUGE (Lin and is an encouraging result since our model has
Hovy, 2003). We report unigram and bigram over- only access to embedding features obtained from
lap (ROUGE -1,2) as a means of assessing infor- raw text. In comparison, LREG uses a set of
mativeness and the longest common subsequence manually selected features, while the ILP system
(ROUGE -L) as a means of assessing fluency. takes advantage of syntactic information and ex-
In addition, we evaluated the generated sum- tracts summaries subject to well-engineered lin-
maries by eliciting human judgments for 20 ran- guistic constraints, which are not available to our
domly sampled DUC 2002 test documents. Par- models. Overall, our sentence extraction model
ticipants were presented with a news article and achieves performance comparable to the state of
summaries generated by a list of systems. These the art without sophisticated constraint optimiza-
include two neural network systems (sentence- tion (ILP, TGRAPH) or sentence ranking mech-
and word-based extraction), the neural abstrac- anisms (URANK). We visualize the sentence
tive system described earlier, the lead baseline, the weights of the NN - SE model in the top half of Fig-
phrase-based ILP model10 of Woodsend and La- ure 4. As can be seen, the model is able to locate
pata (2010), and the human authored summary. text portions which contribute most to the overall
Subjects were asked to rank the summaries from meaning of the document.
best to worst (with ties allowed) in order of in- ROUGE scores for the word extraction model
formativeness (does the summary capture impor- are less promising. This is somewhat expected
tant information in the article?) and fluency (is the given that ROUGE is n-gram based and not very
summary written in well-formed English?). We well suited to measuring summaries which contain
elicited human judgments using Amazon’s Me- a significant amount of paraphrasing and may de-
chanical Turk crowdsourcing platform. Partici- viate from the reference even though they express
pants (self-reported native English speakers) saw similar meaning. However, a meaningful com-
2 random articles per session. We collected 5 re- parison can be carried out between NN - WE and
sponses per document. NN - ABS which are similar in spirit. We observe
that NN - WE consistently outperforms the purely
10 We are grateful to Kristian Woodsend for giving us ac-
abstractive model. As NN - WE generates sum-
cess to the output of his system. Unfortunately, we do not
have access to the output of TGRAPH or URANK for inclusion maries by picking words from the original docu-
in the human evaluation. ment, decoding is easier for this model compared
sentence extraction:
a gang of at least three people poured gasoline on a car that stopped to fill up at entity5 gas station early on Saturday morning and set the vehicle on fire
the driver of the car, who has not been identified, said he got into an argument with the suspects while he was pumping gas at a entity13 in entity14
the group covered his white entity16 in gasoline and lit it ablaze while there were two passengers inside
at least three people poured gasoline on a car and lit it on fire at a entity14 gas station explosive situation
the passengers and the driver were not hurt during the incident but the car was completely ruined
the man’s grandmother said the fire was lit after the suspects attempted to carjack her grandson, entity33 reported
she said:’ he said he was pumping gas and some guys came up and asked for the car
’ they pulled out a gun and he took off running
’ they took the gas tank and started spraying
’ no one was injured during the fire , but the car ’s entire front end was torched , according to entity52
the entity53 is investigating the incident as an arson and the suspects remain at large
surveillance video of the incident is being used in the investigation
before the fire , which occurred at 12:15am on Saturday , the suspects tried to carjack the man hot case
the entity53 is investigating the incident at the entity67 station as an arson
word extraction:
gang poured gasoline in the car, entity5 Saturday morning. the driver argued with the suspects. his grandmother said the fire was lit by the suspects attempted to
carjack her grandson.
entities:
entity5:California entity13:76-Station entity14: South LA entity16:Dodge Charger entity33:ABC entity52:NBC entity53:LACFD entity67:LA76

Figure 4: Visualization of the summaries for a DailyMail article. The top half shows the relative attention
weights given by the sentence extraction model. Darkness indicates sentence importance. The lower half
shows the summary generated by the word extraction.

to NN - ABS which deals with an open vocabulary. but do not differ significantly from each other or
The extraction-based generation approach is more the human goldstandard.
robust for proper nouns and rare words, which
pose a serious problem to open vocabulary mod-
els. An example of the generated summaries for 7 Conclusions
NN - WE is shown at the lower half of Figure 4.
Table 1 (lower half) shows system results on In this work we presented a data-driven summa-
the 500 DailyMail news articles (test set). In gen- rization framework based on an encoder-extractor
eral, we observe similar trends to DUC 2002, with architecture. We developed two classes of mod-
NN - SE performing the best in terms of all ROUGE els based on sentence and word extraction. Our
metrics. Note that scores here are generally lower models can be trained on large scale datasets and
compared to DUC 2002. This is due to the fact learn informativeness features based on continu-
that the gold standard summaries (aka highlights) ous representations without recourse to linguistic
tend to be more laconic and as a result involve a annotations. Two important ideas behind our work
substantial amount of paraphrasing. More exper- are the creation of hierarchical neural structures
imental results on this dataset are provided in the that reflect the nature of the summarization task
appendix. and generation by extraction. The later effectively
The results of our human evaluation study are enables us to sidestep the difficulties of generat-
shown in Table 2. Specifically, we show, propor- ing under a large vocabulary, essentially covering
tionally, how often our participants ranked each the entire dataset, with many low-frequency words
system 1st, 2nd, and so on. Perhaps unsurpris- and named entities.
ingly, the human-written descriptions were con- Directions for future work are many and var-
sidered best and ranked 1st 27% of the time, how- ied. One way to improve the word-based model
ever closely followed by our NN - SE model which would be to take structural information into ac-
was ranked 1st 22% of the time. The ILP system count during generation, e.g., by combining it with
was mostly ranked in 2nd place (38% of the time). a tree-based algorithm (Cohn and Lapata, 2009). It
The rest of the systems occupied lower ranks. We would also be interesting to apply the neural mod-
further converted the ranks to ratings on a scale of els presented here in a phrase-based setting similar
1 to 6 (assigning ratings 6. . . 1 to rank placements to Lebret et al. (2015). A third direction would be
1. . . 6). This allowed us to perform Analysis of to adopt an information theoretic perspective and
Variance (ANOVA) which revealed a reliable ef- devise a purely unsupervised approach that selects
fect of system type. Specifically, post-hoc Tukey summary sentences and words so as to minimize
tests showed that NN - SE and ILP are significantly information loss, a task possibly achievable with
(p < 0.01) better than LEAD, NN - WE, and NN - ABS the dataset created in this work.
DM 75b ROUGE -1 ROUGE -2 ROUGE - L [Chelba et al.2014] Ciprian Chelba, Tomas Mikolov,
L EAD 21.9 7.2 11.6 Mike Schuster, Qi Ge, Thorsten Brants, Phillipp
Koehn, and Tony Robinson. 2014. One billion word
NN - SE 22.7 8.5 12.5 benchmark for measuring progress in statistical lan-
NN - WE 16.0 6.4 10.2 guage modeling. arXiv preprint arXiv:1312.3005.

DM 275b ROUGE -1 ROUGE -2 ROUGE - L [Cohn and Lapata2009] Trevor Anthony Cohn and
Mirella Lapata. 2009. Sentence compression as tree
L EAD 40.5 14.9 32.6 transduction. Journal of Artificial Intelligence Re-
NN - SE 42.2 17.3 34.8 search, pages 637–674.
NN - WE 33.9 10.2 23.5
[Conroy and O’Leary2001] Conroy and O’Leary.
2001. Text summarization via hidden Markov
DM full ROUGE -1 ROUGE -2 ROUGE - L models. In Proceedings of the 34th Annual ACL
L EAD 53.5 21.7 48.5 SIGIR, pages 406–407, New Oleans, Louisiana.
NN - SE 56.0 24.9 50.2 [Erkan and Radev2004] Güneş Erkan and Dragomir R.
NN - WE - - - Radev. 2004. Lexpagerank: Prestige in multi-
document text summarization. In Proceedings of the
Table 3: ROUGE evaluation (%) on the entire 500 2004 EMNLP, pages 365–371, Barcelona, Spain.
DailyMail samples, with different length limits.
[Filatova and Hatzivassiloglou2004] Elena Filatova and
Vasileios Hatzivassiloglou. 2004. Event-based
Acknowledgments extractive summarization. In Stan Szpakowicz
Marie-Francine Moens, editor, Text Summarization
We would like to thank three anonymous review- Branches Out: Proceedings of the ACL-04 Work-
shop, pages 104–111, Barcelona, Spain.
ers and members of the ILCC at the School of In-
formatics for their valuable feedback. The support [Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li, and
of the European Research Council under award Victor O.K. Li. 2016. Incorporating copying mech-
number 681760 “Translating Multiple Modalities anism in sequence-to-sequence learning. In Pro-
ceedings of the 54th ACL, Berlin, Germany. to ap-
into Text” is gratefully acknowledged. pear.
8 Appendix [Gulcehre et al.2015] Caglar Gulcehre, Orhan Firat,
Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-
In addition to the DUC 2002 and 500 DailyMail Chi Lin, Fethi Bougares, Holger Schwenk, and
samples, we additionally report results on the en- Yoshua Bengio. 2015. On using monolingual cor-
tire DailyMail test set (Table 3). Since there is no pora in neural machine translation. arXiv preprint
arXiv:1503.03535.
established evaluation standard for this task, we
experimented with three different ROUGE limits: [Gulcehre et al.2016] Caglar Gulcehre, Sungjin Ahn,
75 bytes, 275 bytes and full length. Ramesh Nallapati, Bowen Zhou, and Yoshua Ben-
gio. 2016. Pointing the unknown words. In Pro-
ceedings of the 54th ACL, Berlin, Germany. to ap-
References pear.

[Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun [Hermann et al.2015] Karl Moritz Hermann, Tomas
Cho, and Yoshua Bengio. 2015. Neural machine Kocisky, Edward Grefenstette, Lasse Espeholt, Will
translation by jointly learning to align and translate. Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
In Proceedings of ICLR 2015, San Diego, Califor- Teaching machines to read and comprehend. In Ad-
nia. vances in Neural Information Processing Systems
28, pages 1684–1692. Curran Associates, Inc.
[Banko et al.2000] Michele Banko, Vibhu O. Mittal,
and Michael J. Witbrock. 2000. Headline genera- [Hochreiter and Schmidhuber1997] Sepp Hochreiter
tion based on statistical translation. In Proceedings and Jürgen Schmidhuber. 1997. Long short-term
of the 38th ACL, pages 318–325, Hong Kong. memory. Neural computation, 9(8):1735–1780.

[Bengio et al.2015] Samy Bengio, Oriol Vinyals, [Kalchbrenner and Blunsom2013] Nal Kalchbrenner
Navdeep Jaitly, and Noam Shazeer. 2015. Sched- and Phil Blunsom. 2013. Recurrent convolutional
uled sampling for sequence prediction with recurrent neural networks for discourse compositionality. In
neural networks. In Advances in Neural Information Proceedings of the Workshop on Continuous Vector
Processing Systems 28, pages 1171–1179. Curran Space Models and their Compositionality, pages
Associates, Inc. 119–126, Sofia, Bulgaria.
[Kim et al.2016] Yoon Kim, Yacine Jernite, David Son- [Radev et al.2004] Dragomir Radev, Timothy Allison,
tag, and Alexander M Rush. 2016. Character-aware Sasha Blair-Goldensohn, John Blitzer, Arda Celebi,
neural language models. In Proceedings of the 30th Stanko Dimitrov, Elliott Drabek, Ali Hakim, Wai
AAAI, Phoenix, Arizon. to appear. Lam, Danyu Liu, et al. 2004. Mead-a platform
for multidocument multilingual text summarization.
[Kim2014] Yoon Kim. 2014. Convolutional neural net- Technical report, Columbia University Academic
works for sentence classification. In Proceedings of Commons.
the 2014 EMNLP, pages 1746–1751, Doha, Qatar.
[Rush et al.2015] Alexander M. Rush, Sumit Chopra,
[Kingma and Ba2014] Diederik Kingma and Jimmy and Jason Weston. 2015. A neural attention model
Ba. 2014. Adam: A method for stochastic opti- for abstractive sentence summarization. In Proceed-
mization. arXiv preprint arXiv:1412.6980. ings of the 2015 EMNLP, pages 379–389, Lisbon,
Portugal.
[Kobayashi et al.2015] Hayato Kobayashi, Masaki
Noguchi, and Taichi Yatsuka. 2015. Summarization [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals,
based on embedding distributions. In Proceedings and Quoc VV Le. 2014. Sequence to sequence
of the 2015 EMNLP, pages 1984–1989, Lisbon, learning with neural networks. In Advances in Neu-
Portugal. ral Information Processing Systems 27, pages 3104–
3112. Curran Associates, Inc.
[Kupiec et al.1995] Julian Kupiec, Jan O. Pedersen, and
Francine Chen. 1995. A trainable document sum- [Svore et al.2007] Krysta Svore, Lucy Vanderwende,
marizer. In Proceedings of the 18th Annual Interna- and Christopher Burges. 2007. Enhancing single-
tional ACM SIGIR, pages 68–73, Seattle, Washing- document summarization by combining RankNet
ton. and third-party sources. In Proceedings of the 2007
EMNLP-CoNLL, pages 448–457, Prague, Czech Re-
[Lebret et al.2015] Rémi Lebret, Pedro O Pinheiro, and public.
Ronan Collobert. 2015. Phrase-based image cap-
tioning. In Proceedings of the 32nd ICML, Lille, [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato,
France. and Navdeep Jaitly. 2015. Pointer networks. In
Advances in Neural Information Processing Systems
[Lin and Hovy2003] Chin-Yew Lin and Eduard H. 28, pages 2674–2682. Curran Associates, Inc.
Hovy. 2003. Automatic evaluation of summaries
using n-gram co-occurrence statistics. In Pro- [Wan2010] Xiaojun Wan. 2010. Towards a unified ap-
ceedings of HLT NAACL, pages 71–78, Edmonton, proach to simultaneous single-document and multi-
Canada. document summarizations. In Proceedings of the
23rd COLING, pages 1137–1145.
[Mihalcea2005] Rada Mihalcea. 2005. Language inde-
pendent extractive summarization. In Proceedings [Williams1992] Ronald J Williams. 1992. Simple
of the ACL Interactive Poster and Demonstration statistical gradient-following algorithms for connec-
Sessions, pages 49–52, Ann Arbor, Michigan. tionist reinforcement learning. Machine learning,
8(3-4):229–256.
[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever,
[Woodsend and Lapata2010] Kristian Woodsend and
Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Mirella Lapata. 2010. Automatic generation of
Distributed representations of words and phrases
story highlights. In Proceedings of the 48th ACL,
and their compositionality. In Advances in Neu-
pages 565–574, Uppsala, Sweden.
ral Information Processing Systems 26, pages 3111–
3119. Curran Associates, Inc. [Yogatama et al.2015] Dani Yogatama, Fei Liu, and
Noah A. Smith. 2015. Extractive summarization by
[Nenkova et al.2006] Ani Nenkova, Lucy Vander- maximizing semantic volume. In Proceedings of the
wende, and Kathleen McKeown. 2006. A 2015 EMNLP, pages 1961–1966, Lisbon, Portugal.
compositional context sensitive multi-document
summarizer: exploring the factors that influence [Zhang and Lapata2014] Xingxing Zhang and Mirella
summarization. In Proceedings of the 29th Annual Lapata. 2014. Chinese poetry generation with re-
ACM SIGIR, pages 573–580, Washington, Seattle. current neural networks. In Proceedings of 2014
EMNLP, pages 670–680, Doha, Qatar.
[Och2003] Franz Josef Och. 2003. Minimum error rate
training in statistical machine translation. In Pro-
ceedings of the 41st ACL, pages 160–167, Sapporo,
Japan.

[Parveen et al.2015] Daraksha Parveen, Hans-Martin


Ramsl, and Michael Strube. 2015. Topical coher-
ence for graph-based extractive summarization. In
Proceedings of the 2015 EMNLP, pages 1949–1954,
Lisbon, Portugal, September.

You might also like