Extracting Sentences and Words
Extracting Sentences and Words
Figure 1: DailyMail news article with highlights. Underlined sentences bear label 1, and 0 otherwise.
summarization which exhibits none. We formu- are genuinely abstractive summaries and therefore
late word extraction as a language generation task not readily suited to supervised training. To cre-
with an output vocabulary restricted to the original ate the training data for sentence extraction, we
document. In our supervised setting, the training reverse approximated the gold standard label of
goal is to maximize the likelihood of the generated each document sentence given the summary based
sentences, which can be further decomposed by on their semantic correspondence (Woodsend and
enforcing conditional dependencies among their Lapata, 2010). Specifically, we designed a rule-
constituent words: based system that determines whether a document
k sentence matches a highlight and should be la-
log p(ys |D; θ)=∑log p(w0i |D, w01 ,· · ·, w0i−1 ; θ) (2) beled with 1 (must be in the summary), and 0 oth-
i=1 erwise. The rules take into account the position
In the following section, we discuss the data elici- of the sentence in the document, the unigram and
tation methods which allow us to train neural net- bigram overlap between document sentences and
works based on the above defined objectives. highlights, the number of entities appearing in the
highlight and in the document sentence. We ad-
3 Training Data for Summarization justed the weights of the rules on 9,000 documents
Data-driven neural summarization models require with manual sentence labels created by Woodsend
a large training corpus of documents with labels and Lapata (2010). The method obtained an accu-
indicating which sentences (or words) should be racy of 85% when evaluated on a held-out set of
in the summary. Until now such corpora have 216 documents coming from the same dataset and
been limited to hundreds of examples (e.g., the was subsequently used to label 200K documents.
DUC 2002 single document summarization cor- Approximately 30% of the sentences in each doc-
pus) and thus used mostly for testing (Woodsend ument were deemed summary-worthy.
and Lapata, 2010). To overcome the paucity of
annotated data for training, we adopt a methodol- For the creation of the word extraction dataset,
ogy similar to Hermann et al. (2015) and create we examine the lexical overlap between the high-
two large-scale datasets, one for sentence extrac- lights and the news article. In cases where all high-
tion and another one for word extraction. light words (after stemming) come from the orig-
In a nutshell, we retrieved3 hundreds of thou- inal document, the document-highlight pair con-
sands of news articles and their corresponding stitutes a valid training example and is added to
highlights from DailyMail (see Figure 1 for an ex- the word extraction dataset. For out-of-vocabulary
ample). The highlights (created by news editors) (OOV) words, we try to find a semantically equiv-
3 The script for constructing our datasets is modified from alent replacement present in the news article.
the one released in Hermann et al. (2015). Specifically, we check if a neighbor, represented
by pre-trained4 embeddings, is in the original doc- sentence-level classification tasks such as senti-
ument and therefore constitutes a valid substitu- ment analysis (Kim, 2014). Let d denote the
tion. If we cannot find any substitutes, we discard dimension of word embeddings, and s a docu-
the document-highlight pair. Following this pro- ment sentence consisting of a sequence of n words
cedure, we obtained a word extraction dataset con- (w1 , · · · , wn ) which can be represented by a dense
taining 170K articles, again from the DailyMail. column matrix W ∈ Rn×d . We apply a tempo-
ral narrow convolution between W and a kernel
4 Neural Summarization Model K ∈ Rc×d of width c as follows:
The key components of our summarization model
fij = tanh(W j: j+c−1 ⊗ K + b) (3)
include a neural network-based hierarchical doc-
ument reader and an attention-based hierarchical where ⊗ equates to the Hadamard Product fol-
content extractor. The hierarchical nature of our lowed by a sum over all elements. fij denotes the
model reflects the intuition that documents are j-th element of the i-th feature map fi and b is the
generated compositionally from words, sentences, bias. We perform max pooling over time to obtain
paragraphs, or even larger units. We therefore em- a single feature (the ith feature) representing the
ploy a representation framework which reflects the sentence under the kernel K with width c:
same architecture, with global information being
discovered and local information being preserved. si,K = max fij (4)
j
Such a representation yields minimum informa-
tion loss and is flexible allowing us to apply neural In practice, we use multiple feature maps to
attention for selecting salient sentences and words compute a list of features that match the dimen-
within a larger context. In the following, we first sionality of a sentence under each kernel width. In
describe the document reader, and then present the addition, we apply multiple kernels with different
details of our sentence and word extractors. widths to obtain a set of different sentence vectors.
Finally, we sum these sentence vectors to obtain
4.1 Document Reader
the final sentence representation. The CNN model
The role of the reader is to derive the meaning rep- is schematically illustrated in Figure 2 (bottom).
resentation of the document from its constituent In the example, the sentence embeddings have six
sentences, each of which is treated as a sequence dimensions, so six feature maps are used under
of words. We first obtain representation vectors each kernel width. The blue feature maps have
at the sentence level using a single-layer convo- width two and the red feature maps have width
lutional neural network (CNN) with a max-over- three. The sentence embeddings obtained under
time pooling operation (Kalchbrenner and Blun- each kernel width are summed to get the final sen-
som, 2013; Zhang and Lapata, 2014; Kim et al., tence representation (denoted by green).
2016). Next, we build representations for docu-
ments using a standard recurrent neural network Recurrent Document Encoder At the docu-
(RNN) that recursively composes sentences. The ment level, a recurrent neural network composes a
CNN operates at the word level, leading to the sequence of sentence vectors into a document vec-
acquisition of sentence-level representations that tor. Note that this is a somewhat simplistic attempt
are then used as inputs to the RNN that acquires at capturing document organization at the level of
document-level representations, in a hierarchical sentence to sentence transitions. One might view
fashion. We describe these two sub-components the hidden states of the recurrent neural network
of the text reader below. as a list of partial representations with each fo-
cusing mostly on the corresponding input sentence
Convolutional Sentence Encoder We opted for given the previous context. These representations
a convolutional neural network model for repre- altogether constitute the document representation,
senting sentences for two reasons. Firstly, single- which captures local and global sentential infor-
layer CNNs can be trained effectively (without mation with minimum compression.
any long-term dependencies in the model) and The RNN we used has a Long Short-Term
secondly, they have been successfully used for Memory (LSTM) activation unit for ameliorat-
4 We used the Python Gensim library and the ing the vanishing gradient problem when train-
300-dimensional GoogleNews vectors. ing long sequences (Hochreiter and Schmidhuber,
Figure 3: Neural attention mechanism for word
extraction.
Figure 4: Visualization of the summaries for a DailyMail article. The top half shows the relative attention
weights given by the sentence extraction model. Darkness indicates sentence importance. The lower half
shows the summary generated by the word extraction.
to NN - ABS which deals with an open vocabulary. but do not differ significantly from each other or
The extraction-based generation approach is more the human goldstandard.
robust for proper nouns and rare words, which
pose a serious problem to open vocabulary mod-
els. An example of the generated summaries for 7 Conclusions
NN - WE is shown at the lower half of Figure 4.
Table 1 (lower half) shows system results on In this work we presented a data-driven summa-
the 500 DailyMail news articles (test set). In gen- rization framework based on an encoder-extractor
eral, we observe similar trends to DUC 2002, with architecture. We developed two classes of mod-
NN - SE performing the best in terms of all ROUGE els based on sentence and word extraction. Our
metrics. Note that scores here are generally lower models can be trained on large scale datasets and
compared to DUC 2002. This is due to the fact learn informativeness features based on continu-
that the gold standard summaries (aka highlights) ous representations without recourse to linguistic
tend to be more laconic and as a result involve a annotations. Two important ideas behind our work
substantial amount of paraphrasing. More exper- are the creation of hierarchical neural structures
imental results on this dataset are provided in the that reflect the nature of the summarization task
appendix. and generation by extraction. The later effectively
The results of our human evaluation study are enables us to sidestep the difficulties of generat-
shown in Table 2. Specifically, we show, propor- ing under a large vocabulary, essentially covering
tionally, how often our participants ranked each the entire dataset, with many low-frequency words
system 1st, 2nd, and so on. Perhaps unsurpris- and named entities.
ingly, the human-written descriptions were con- Directions for future work are many and var-
sidered best and ranked 1st 27% of the time, how- ied. One way to improve the word-based model
ever closely followed by our NN - SE model which would be to take structural information into ac-
was ranked 1st 22% of the time. The ILP system count during generation, e.g., by combining it with
was mostly ranked in 2nd place (38% of the time). a tree-based algorithm (Cohn and Lapata, 2009). It
The rest of the systems occupied lower ranks. We would also be interesting to apply the neural mod-
further converted the ranks to ratings on a scale of els presented here in a phrase-based setting similar
1 to 6 (assigning ratings 6. . . 1 to rank placements to Lebret et al. (2015). A third direction would be
1. . . 6). This allowed us to perform Analysis of to adopt an information theoretic perspective and
Variance (ANOVA) which revealed a reliable ef- devise a purely unsupervised approach that selects
fect of system type. Specifically, post-hoc Tukey summary sentences and words so as to minimize
tests showed that NN - SE and ILP are significantly information loss, a task possibly achievable with
(p < 0.01) better than LEAD, NN - WE, and NN - ABS the dataset created in this work.
DM 75b ROUGE -1 ROUGE -2 ROUGE - L [Chelba et al.2014] Ciprian Chelba, Tomas Mikolov,
L EAD 21.9 7.2 11.6 Mike Schuster, Qi Ge, Thorsten Brants, Phillipp
Koehn, and Tony Robinson. 2014. One billion word
NN - SE 22.7 8.5 12.5 benchmark for measuring progress in statistical lan-
NN - WE 16.0 6.4 10.2 guage modeling. arXiv preprint arXiv:1312.3005.
DM 275b ROUGE -1 ROUGE -2 ROUGE - L [Cohn and Lapata2009] Trevor Anthony Cohn and
Mirella Lapata. 2009. Sentence compression as tree
L EAD 40.5 14.9 32.6 transduction. Journal of Artificial Intelligence Re-
NN - SE 42.2 17.3 34.8 search, pages 637–674.
NN - WE 33.9 10.2 23.5
[Conroy and O’Leary2001] Conroy and O’Leary.
2001. Text summarization via hidden Markov
DM full ROUGE -1 ROUGE -2 ROUGE - L models. In Proceedings of the 34th Annual ACL
L EAD 53.5 21.7 48.5 SIGIR, pages 406–407, New Oleans, Louisiana.
NN - SE 56.0 24.9 50.2 [Erkan and Radev2004] Güneş Erkan and Dragomir R.
NN - WE - - - Radev. 2004. Lexpagerank: Prestige in multi-
document text summarization. In Proceedings of the
Table 3: ROUGE evaluation (%) on the entire 500 2004 EMNLP, pages 365–371, Barcelona, Spain.
DailyMail samples, with different length limits.
[Filatova and Hatzivassiloglou2004] Elena Filatova and
Vasileios Hatzivassiloglou. 2004. Event-based
Acknowledgments extractive summarization. In Stan Szpakowicz
Marie-Francine Moens, editor, Text Summarization
We would like to thank three anonymous review- Branches Out: Proceedings of the ACL-04 Work-
shop, pages 104–111, Barcelona, Spain.
ers and members of the ILCC at the School of In-
formatics for their valuable feedback. The support [Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li, and
of the European Research Council under award Victor O.K. Li. 2016. Incorporating copying mech-
number 681760 “Translating Multiple Modalities anism in sequence-to-sequence learning. In Pro-
ceedings of the 54th ACL, Berlin, Germany. to ap-
into Text” is gratefully acknowledged. pear.
8 Appendix [Gulcehre et al.2015] Caglar Gulcehre, Orhan Firat,
Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-
In addition to the DUC 2002 and 500 DailyMail Chi Lin, Fethi Bougares, Holger Schwenk, and
samples, we additionally report results on the en- Yoshua Bengio. 2015. On using monolingual cor-
tire DailyMail test set (Table 3). Since there is no pora in neural machine translation. arXiv preprint
arXiv:1503.03535.
established evaluation standard for this task, we
experimented with three different ROUGE limits: [Gulcehre et al.2016] Caglar Gulcehre, Sungjin Ahn,
75 bytes, 275 bytes and full length. Ramesh Nallapati, Bowen Zhou, and Yoshua Ben-
gio. 2016. Pointing the unknown words. In Pro-
ceedings of the 54th ACL, Berlin, Germany. to ap-
References pear.
[Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun [Hermann et al.2015] Karl Moritz Hermann, Tomas
Cho, and Yoshua Bengio. 2015. Neural machine Kocisky, Edward Grefenstette, Lasse Espeholt, Will
translation by jointly learning to align and translate. Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
In Proceedings of ICLR 2015, San Diego, Califor- Teaching machines to read and comprehend. In Ad-
nia. vances in Neural Information Processing Systems
28, pages 1684–1692. Curran Associates, Inc.
[Banko et al.2000] Michele Banko, Vibhu O. Mittal,
and Michael J. Witbrock. 2000. Headline genera- [Hochreiter and Schmidhuber1997] Sepp Hochreiter
tion based on statistical translation. In Proceedings and Jürgen Schmidhuber. 1997. Long short-term
of the 38th ACL, pages 318–325, Hong Kong. memory. Neural computation, 9(8):1735–1780.
[Bengio et al.2015] Samy Bengio, Oriol Vinyals, [Kalchbrenner and Blunsom2013] Nal Kalchbrenner
Navdeep Jaitly, and Noam Shazeer. 2015. Sched- and Phil Blunsom. 2013. Recurrent convolutional
uled sampling for sequence prediction with recurrent neural networks for discourse compositionality. In
neural networks. In Advances in Neural Information Proceedings of the Workshop on Continuous Vector
Processing Systems 28, pages 1171–1179. Curran Space Models and their Compositionality, pages
Associates, Inc. 119–126, Sofia, Bulgaria.
[Kim et al.2016] Yoon Kim, Yacine Jernite, David Son- [Radev et al.2004] Dragomir Radev, Timothy Allison,
tag, and Alexander M Rush. 2016. Character-aware Sasha Blair-Goldensohn, John Blitzer, Arda Celebi,
neural language models. In Proceedings of the 30th Stanko Dimitrov, Elliott Drabek, Ali Hakim, Wai
AAAI, Phoenix, Arizon. to appear. Lam, Danyu Liu, et al. 2004. Mead-a platform
for multidocument multilingual text summarization.
[Kim2014] Yoon Kim. 2014. Convolutional neural net- Technical report, Columbia University Academic
works for sentence classification. In Proceedings of Commons.
the 2014 EMNLP, pages 1746–1751, Doha, Qatar.
[Rush et al.2015] Alexander M. Rush, Sumit Chopra,
[Kingma and Ba2014] Diederik Kingma and Jimmy and Jason Weston. 2015. A neural attention model
Ba. 2014. Adam: A method for stochastic opti- for abstractive sentence summarization. In Proceed-
mization. arXiv preprint arXiv:1412.6980. ings of the 2015 EMNLP, pages 379–389, Lisbon,
Portugal.
[Kobayashi et al.2015] Hayato Kobayashi, Masaki
Noguchi, and Taichi Yatsuka. 2015. Summarization [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals,
based on embedding distributions. In Proceedings and Quoc VV Le. 2014. Sequence to sequence
of the 2015 EMNLP, pages 1984–1989, Lisbon, learning with neural networks. In Advances in Neu-
Portugal. ral Information Processing Systems 27, pages 3104–
3112. Curran Associates, Inc.
[Kupiec et al.1995] Julian Kupiec, Jan O. Pedersen, and
Francine Chen. 1995. A trainable document sum- [Svore et al.2007] Krysta Svore, Lucy Vanderwende,
marizer. In Proceedings of the 18th Annual Interna- and Christopher Burges. 2007. Enhancing single-
tional ACM SIGIR, pages 68–73, Seattle, Washing- document summarization by combining RankNet
ton. and third-party sources. In Proceedings of the 2007
EMNLP-CoNLL, pages 448–457, Prague, Czech Re-
[Lebret et al.2015] Rémi Lebret, Pedro O Pinheiro, and public.
Ronan Collobert. 2015. Phrase-based image cap-
tioning. In Proceedings of the 32nd ICML, Lille, [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato,
France. and Navdeep Jaitly. 2015. Pointer networks. In
Advances in Neural Information Processing Systems
[Lin and Hovy2003] Chin-Yew Lin and Eduard H. 28, pages 2674–2682. Curran Associates, Inc.
Hovy. 2003. Automatic evaluation of summaries
using n-gram co-occurrence statistics. In Pro- [Wan2010] Xiaojun Wan. 2010. Towards a unified ap-
ceedings of HLT NAACL, pages 71–78, Edmonton, proach to simultaneous single-document and multi-
Canada. document summarizations. In Proceedings of the
23rd COLING, pages 1137–1145.
[Mihalcea2005] Rada Mihalcea. 2005. Language inde-
pendent extractive summarization. In Proceedings [Williams1992] Ronald J Williams. 1992. Simple
of the ACL Interactive Poster and Demonstration statistical gradient-following algorithms for connec-
Sessions, pages 49–52, Ann Arbor, Michigan. tionist reinforcement learning. Machine learning,
8(3-4):229–256.
[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever,
[Woodsend and Lapata2010] Kristian Woodsend and
Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Mirella Lapata. 2010. Automatic generation of
Distributed representations of words and phrases
story highlights. In Proceedings of the 48th ACL,
and their compositionality. In Advances in Neu-
pages 565–574, Uppsala, Sweden.
ral Information Processing Systems 26, pages 3111–
3119. Curran Associates, Inc. [Yogatama et al.2015] Dani Yogatama, Fei Liu, and
Noah A. Smith. 2015. Extractive summarization by
[Nenkova et al.2006] Ani Nenkova, Lucy Vander- maximizing semantic volume. In Proceedings of the
wende, and Kathleen McKeown. 2006. A 2015 EMNLP, pages 1961–1966, Lisbon, Portugal.
compositional context sensitive multi-document
summarizer: exploring the factors that influence [Zhang and Lapata2014] Xingxing Zhang and Mirella
summarization. In Proceedings of the 29th Annual Lapata. 2014. Chinese poetry generation with re-
ACM SIGIR, pages 573–580, Washington, Seattle. current neural networks. In Proceedings of 2014
EMNLP, pages 670–680, Doha, Qatar.
[Och2003] Franz Josef Och. 2003. Minimum error rate
training in statistical machine translation. In Pro-
ceedings of the 41st ACL, pages 160–167, Sapporo,
Japan.