NSTM: Real-Time Query-Driven News Overview Composition at Bloomberg
NSTM: Real-Time Query-Driven News Overview Composition at Bloomberg
Bloomberg
Joshua Bambrick1 , Minjie Xu1 , Andy Almonte1 , Igor Malioutov1 ,
Guim Perarnau1 , Vittorio Selo1 , Iat Chong Chan2, ∗
1
Bloomberg, London, United Kingdom
2
OakNorth, London, United Kingdom
1
{jbambrick7,mxu161,aalmonte2,imalioutov,gperarnau,vselo}@bloomberg.net
2
[email protected]
insurmountable challenge. For example, a With a traditional system, they would search for
reader searching on Bloomberg’s system for news on the company and wade through many sto-
news about the U.K. would find 10,000 arti- ries (307 in this case1 ), often with duplicate infor-
cles on a typical day. Apple Inc., the world’s mation or unhelpful headlines, to slowly build up a
most journalistically covered company, gar-
full picture of what the key events were.
ners around 1,800 news articles a day.
By contrast, using NSTM (Key News Themes),
We realized that a new kind of summarization this same analyst can search for ‘Amazon.com’,
engine was needed, one that would condense
over a given time horizon, and promptly receive a
large volumes of news into short, easy to ab-
sorb points. The system would filter out noise concise and comprehensive overview of the news,
and duplicates to identify and summarize key as shown in Fig. 1. We tackle the challenges in-
news about companies, countries or markets. volved with consuming vast quantities of news by
When given a user query, Bloomberg’s solu- leveraging modern techniques to semantically clus-
tion, Key News Themes (or NSTM), leverages ter stories, as well as innovative summarization
state-of-the-art semantic clustering techniques methods to extract succinct, informational sum-
and novel summarization methods to produce maries for each cluster. A handful of key stories are
comprehensive, yet concise, digests to dramat- then selected from each cluster. We define a (story
ically simplify the news consumption process. cluster, summary, key stories) triple as one theme
NSTM is available to hundreds of thousands of and an ordered list of themes as an overview.
readers around the world and serves thousands NSTM works at web scale but responds to ar-
of requests daily with sub-second latency. At
bitrary user queries with sub-second latency. It is
ACL 2020, we will present a demo of NSTM.
deployed to hundreds of thousands of users around
1 Introduction the globe and serves thousands of requests per day.
Ke
yst
ori
es Fe
edb
ackb
utt
ons So
urc
e Pu
bli
cat
io
nda
te
Figure 1: A query-based UI for NSTM showing two themes. The un-cropped screenshot is in Appendix C.
(up to 50 characters, or roughly 6 tokens) summary several media broadcasts. SUMMA applies the
for each cluster. It needs to be short enough to online clustering algorithm by Aggarwal and Yu
be understandable to humans with a single glance, (2006) and the extractive summarization algorithm
but also rich enough to retain critical details from a by Almeida and Martins (2013). In contrast to
minimal ‘who-does-what’ stub, so the most popular NSTM, SUMMA focuses on scenarios with contin-
noun phrase or entity alone will not suffice. Such uous multimedia and multilingual data streams and
conciseness also helps when screen space is limited produces much longer summaries.
(for context-driven applications or mobile devices).
From each cluster, NSTM must surface a few key 4 Approach
stories to provide a sample of its contents. The clus- 4.1 Architecture
ters themselves should also be ranked to highlight
the most important few in limited screen space. The functionality of NSTM can be formulated
Finally, the system must be fast. It may only as: given a search query, generate a ranked list
take up to a few seconds for the slowest queries. (overview) of the key themes, or (news cluster, sum-
mary, key stories) triples, that concisely represent
Main technical challenges: 1) There is no pub- the most important matching news events.
lic dataset corresponding to this overview composi- Fig. 2 depicts the system’s architecture. The
tion problem with all the requirements set above, so story ingestion service processes millions of pub-
we were required to either define new (sub-)tasks lished news stories each day, stores them in a search
and collect new annotations, or select techniques index, and applies online clustering to them. When
by intuition, implement them, and iterate on feed- a search query is submitted via a user interface (
1
back; 2) Generating summaries which are simulta- in the diagram), the overview composition service
neously accurate, informational, fluent, and highly retrieves matching stories and their associated on-
concise necessitates careful and innovative choices line cluster IDs from the search index (
). 2 The
of summarization techniques; 3) Supporting arbi- system then further clusters the retrieved online
trary user searches in real-time places significant clusters into the final clusters, each correspond-
performance requirements on the system whilst ing to one theme (
).3 For each such cluster, the
also setting a high bar for its robustness. system extracts a concise summary and a handful
of key stories to reflect the cluster’s contents (
).
4
3 Related Work This creates a set of themes, which NSTM ranks to
A comparable system is Google News’ ‘Full Cover- create the final overview. Lastly, the system caches
age’ feature2 , which groups stories from different the overview for a limited time to support future
sources, akin to our clustering approach. However, reuse (
)
5 before returning it to the UI (
).6
it doesn’t offer summarization and its clustered
4.2 News Search
view is unavailable for arbitrary search queries.
SUMMA (Liepins et al., 2017) is another com- The first step in the NSTM pipeline is to retrieve
parable system which integrates a variety of NLP relevant news stories (
1 in Fig. 2), for which we
components and provides support for numerous leverage a customized in-house news search engine
media and languages, to simultaneously monitor based on Apache Solr.3 This supports searches
based on keywords, metadata (such as news source
2
https://www.blog.google/products/news/new-google-
3
news-ai-meets-human-intelligence/ http://lucene.apache.org/solr/
Index & cluster stories
Search index Cache
Story ingestion service
Real-time stream ② Retrieve stories & ⑤ Cache
of news stories online cluster IDs overview
③ Cluster search ④ Summarize
① Send search query results clusters
Overview composition service
User interface ⑥ Return overview
Figure 2: The architecture of NSTM. The digits indicate the order of execution whenever a new request is made.
and time of ingestion), and tags generated during a learnable common background word distribution
ingestion (such as topics, regions, securities, and into the generative model (Arora et al., 2017).
people). For example, TOPIC:ECOM AND NOT We trained the model on an internal corpus of
COMPANY:AMZN4 will retrieve all news about ‘E- 1.85M news articles, using a vocabulary of size
commerce’ but exclude Amazon.com. about 200k and a latent dimension n of 128.
NSTM uses Solr’s facet functionality to surface
the largest k online clusters (detailed in Sec. 4.3.2) 4.3.2 Clustering Stages
in the search results, before returning n stories from We divide clustering into two stages in the pipeline,
each. This tiered approach offers better coverage 1) online incremental clustering at story ingestion
and scalability than direct story retrieval. time, and 2) hierarchical agglomerative clustering
(HAC) at query time (
3 in Fig. 2). The former is
4.3 Clustering used to produce query-agnostic online clusters at
4.3.1 News Embedding and Similarity a relatively low cost to handle the daily influx of
millions of news stories. These clusters reduce the
At the core of any clustering system is a similar-
computational cost at query time. However, due
ity metric. In NSTM, we define the similarity be-
to its online nature, over-fragmentation, among
tween two articles as the cosine similarity between
other quality issues, occurs in the resulting clusters.
their embeddings as computed by NVDM (Miao
This necessitates further refinement at query time
et al., 2016), i.e., τ (d1 , d2 ) = 0.5(cos(z1 , z2 ) + 1),
when an offline HAC step is performed on top of
where z ∈ Rn denotes the NVDM embedding.
the retrieved online clusters. A similar, but more
Our choice is motivated by two observations: 1)
complicated, design was adopted in Vadrevu et al.
The generative model of NVDM is based on bag-
(2011) for clustering real-time news search results.
of-words (BoW) and P (w|z) = σ(W > z) where
At both stages, we compute the cluster embed-
σ is the softmax function, W ∈ Rn×V is the word
ding zc ∈ Rn as the mean of all the story em-
embedding matrix in the decoder and V is the size
beddings therein, and evaluate similarities between
of the vocabulary. This resembles the latent topic
clusters (individual stories are taken as singleton
structure popularized by LDA (Blei et al., 2003)
clusters) using the metric τ defined in Sec. 4.3.1.
which has proven effective in capturing textual se-
mantics. Additionally, the use of cosine similarities For online clustering, we apply an in-house im-
is naturally motivated by the fact that the genera- plementation which uses a distributed pool of work-
tive model is directly defined by the dot-product ers to reduce latency and increase throughput. It
between the story embedding (z) and a shared vo- merges each incoming story with the closest cluster
cabulary embedding (W ). 2) NVDM’s Variational if the similarity is within a parameterized threshold
Autoencoder (VAE) (Kingma and Welling, 2014; and otherwise creates a new singleton cluster.
Rezende et al., 2014) framework makes the infer- For HAC, we apply fastcluster5 (Müllner,
ence procedure much simpler than LDA and it also 2013) to construct the dendrogram. We use com-
supports decoder customizations. For example, it plete linkage to encourage more congruent clusters
allows us to easily integrate the idea of introducing and then form flat clusters by cutting the dendro-
gram at the same (height) threshold. To further
4
This is Bloomberg’s internal news search query syntax,
5
which maps closely to the final query submitted to Solr. https://www.jstatsoft.org/article/view/v053i09
reduce fragmentation where similar clusters are 4.4.2 BERT-based Sentence Compression
left un-clustered, we apply HAC twice recursively. In addition to the rule-based OpenIE system, we
To find a reasonable similarity threshold, we apply a Transfer Learning-based solution, using a
manually annotated just over 1k pairs of news arti- novel in-house dataset specific to our sub-task. In
cles. Each annotator indicated whether they would particular, we model candidate summary extraction
expect to see the articles grouped together or not as a ‘sentence compression’ task (Filippova et al.,
in an overview. We then selected the threshold 2015), where each story is split into sentences and
which achieved the highest F1 score on this binary tokens are classified as keep or delete to make each
classification task, which was 0.86. sentence shorter, while retaining the key message.
We oversaw the manual annotation of a dataset
4.4 Summary Extraction which maps sentences to compressed equivalents
Clustering search results (Vadrevu et al., 2011) is a that correspond to summaries. When presented
meaningful step towards creating a useful overview. with a news story, annotators selected one sentence
With NSTM, we push this one step further by ad- and deleted words to create a high quality summary.
ditionally generating a concise, yet still human- This rendered 10k annotations which we randomly
readable, summary for each cluster (
4 in Fig. 2). partitioned into train (80%) and test (20%) sets.
Due to the unique style of the summary ex- The task is formulated as sequence tagging,
plained in Sec. 2, the scarcity of training data makes whereby each sub-token (
1 in Fig. 3), defined
it hard to train an end-to-end seq2seq (Sutskever using the BERT vocabulary, is classified as keep or
et al., 2014) model, as is typical for abstractive sum- delete (
).
2 We implement this using a feedforward
marization. Also, this technique would only offer layer on top of a Bloomberg-internal pre-trained
limited control over the output. Hence, we opt for neural network, akin to the uncased English BERT-
an extractive method, leveraging OpenIE (Banko Base model, applying an adapated implementation.
et al., 2007) and a BERT-based (Devlin et al., 2019) To create a compression, we stitch sub-tokens
sentence compressor (both illustrated in Fig. 3) to labelled keep together (
).
3 Lastly, we use postpro-
surface a pool of sub-sentence-level candidate sum- cessing rules to improve formatting (
), 4 such as
maries from the headline and the body, which are titlecasing and fixing partial-entity deletion (where
then scored by a ranker. only some sub-tokens of a token/entity are deleted).
favorably than c0 for a given common article a, scores (Lin and Hovy, 2003; Lin, 2004) (details in
and the model sθ (a, c) was then trained to match Appendix B), the latter provides superior results.
such preferences using pairwise margin loss, i.e., However, in a production system which informs
max(0, 1 − sθ (a, c) + sθ (a, c0 )). business decisions, we must consider factors which
We considered a few models, including a aren’t readily captured by metrics which compare
parameter-free baseline which scores candidate- generated and ‘gold’ outputs. For example, chang-
article pairs as the dot-product of their NVDM ing a single word can reverse the meaning of a
(Sec. 4.3.1) embeddings, i.e., s = za> zc . We summary, with only a small change in such scores.
also considered this model’s bilinear extension Hence, we consider a range of pros and cons.
s = za> W zc where W is the learnable weight ma- The sentence compression method is supervised
trix. Lastly, we tried neural network models, such and is trained to produce summaries which can
as DecAtt (Parikh et al., 2016). We evaluated these take advantage of news-specific grammatical styles.
models on a held-out test set with metrics such as However, the OpenIE system is much faster and
pairwise ranking accuracy and NDCG. We opted offers greater interpretability and controllability.
to productionize the baseline model, since it was Since the neural and symbolic systems provide
the simplest and performed on par with the others.7 different advantages, we apply both. This renders
Because NVDM uses a bag-of-words model, this a diverse pool of candidate summaries from which
ranker ignores syntax entirely. We believe that its the ranker’s task is to select the best. At the pool-
empirical success owes to both the well-formedness ing stage we also impose a length constraint of 50
of the majority of the candidates and the averaging characters and exclude any longer candidates.
effect that amplifies the ‘signal-noise ratio’ when
the scores are averaged over the cluster. 4.5 Key Story Selection
Empirically, this approach tends to surface ‘in- As a sample from the full story cluster, NSTM se-
formational’ summaries, in contrast to headlines lects an ordered list of key stories which are deemed
which are often ‘sensational’. We posit that this to be representative. We select these using a heuris-
is because high-ranked summaries must also be tic based on intuition and client feedback.
representative of story bodies, not just headlines.
Our approach is to re-cluster all stories in the
4.4.4 Combining Summary Candidates cluster using HAC (see Sec. 4.3.2), to create a
parameterized number of sub-clusters. For each
OpenIE and sentence compression offer distinct sub-cluster, we select the story that has maximum
ways to extract candidates, and we experimented average similarity τ (as per Sec. 4.3.1) to the other
with each as the sole source of summary candi- sub-cluster stories. This strategy is intended to se-
dates in our pipeline. On the basis of ROUGE lect stories which represent each cluster’s diversity.
7
E.g., with NDCG5 , the (untrained) NVDM dot-product We sort the key stories by sub-cluster size and
yields 0.61, while the bilinear model and DecAtt yield 0.64. time of ingestion, in that order of precedence.
4.6 Theme Ranking Summary Size
We have described how (story cluster, summary, 1 Facebook to Settle Recognition Privacy Lawsuit 90
2 Facebook Warns Revenue Growth Slowing 79
key stories) triples, or themes, are created. How- 3 Facebook Stock Drops 7% Despite Earnings Beat 70
ever, some themes are considered to be more im- 4 Facebook to Remove Coronavirus Misinformation 49
portant than others since they are more useful to 5 Mark Zuckerberg to Launch WhatsApp Payments 19
readers. It is tricky to define this concept concretely
Table 1: Ranked theme summaries and cluster sizes for
but we apply proxy metrics in order to estimate an ‘Facebook’ (1,176 matching stories) from 31 Jan. 2020.
importance score for each theme. We rank themes
by this score and, in order to save screen space, re- Summary Size
turn only the top few (‘key’) themes as an overview. 1 Britain to Leave the EU 459
The main factor considered in the importance 2 Bank of England Would Keep Interest Rate Unchanged 141
3 Sturgeon Demands Scottish Independence Vote 71
score is the size of the story cluster – the larger 4 Pompeo in UK for Trade Talks 45
the cluster, the larger the score. This heuristic cor- 5 Boris Johnson Hails ‘Beginning’ on Brexit Day 63
responds to the observation that more important
themes tend to be reported on more frequently. Ad- Table 2: Ranked theme summaries and cluster sizes for
ditionally, we consider the entropy of the news ‘U.K.’ (13,858 matching stories) from 31 Jan. 2020.
sources in the cluster, which corresponds to the ob-
servation that more important themes are reported period that the overview is calculated over (this UI
on by a larger number of publishers and reduces offers 1 hour, 8 hour, 1 day, and 2 day options).
the impact of a source publishing duplicate stories. This interface also allows users to provide feed-
back via the ‘thumb’ icons or plain-text comments.
4.7 Caching
Of several hundred per-overview feedback submis-
Since many user requests are the same or use sim- sions, over three quarters have been positive.
ilar data, caching is useful to minimize response Tables 1 and 2 show example theme summaries
times. When NSTM receives a request, it checks generated for the queries ‘Facebook’ and ‘U.K.’.
whether there is a corresponding overview in the Note that the summaries are quite different from
cache, and immediately returns it if so. 99.6% of what has previously been studied by the NLP com-
requests hit the cache and 99% of requests are han- munity (in terms of brevity and grammatical style)
dled within 215ms.8 In the event of a cache miss, and that they accurately represent distinct events.
NSTM responds in a median time of 723ms.9 In addition to user-driven settings, NSTM can
We apply two mechanisms to ensure cache fresh- be used to supplement context-driven applications.
ness. Firstly, we preemptively invoke NSTM us- One example, demonstrated in Appendix D, uses
ing requests that are likely to be queried by users themes provided by NSTM to help explain why
(e.g., most read topics) and re-compose them from companies or topics are ‘trending’.
scratch at fixed intervals (e.g., every 30 min). Once
computed, they are cached. The second mecha- 6 Conclusion
nism is user-driven: every time a user requests an
We presented NSTM, a novel and production-ready
overview which is not cached, it will be created and
system that composes concise and human-readable
added to the cache. The system will subsequently
news overviews given arbitrary user search queries.
preemptively invoke NSTM using this request for
NSTM is the first of its kind; it is query-driven,
a fixed period of time (e.g., 24 hours).
it offers unique news overviews which leverage
5 Demonstration clustering and succinct summarization, and it has
been released to hundreds of thousands of users.
NSTM was deployed to our clients in 2019. Using We also demonstrated effective adoption of mod-
the UI depicted in Fig. 1, users can find overviews ern NLP techniques and advances in the design and
for customized queries to help support their work. implementation of the system, which we believe
From this screen, the user can enter a search query will be of interest to the community.
using any combination of Boolean logic with tag- There are many open questions which we intend
or keyword-based terms. They may also alter the to research, such as whether autoregressivity in
8
Computed for all requests over a 90-day period. neural sentence compression can be exploited and
9
Computed for the top 50 searches over a 7-day period. how to compose themes over longer time periods.
References Peggy van der Kreeft, Hervé Bourlard, João Pri-
eto, Ondřej Klejch, Peter Bell, Alexandros Lazaridis,
Charu C Aggarwal and Philip S Yu. 2006. A frame- Alfonso Mendes, Sebastian Riedel, Mariana S. C.
work for clustering massive text and categorical data Almeida, Pedro Balage, Shay B. Cohen, Tomasz
streams. In Proceedings of the 2006 SIAM Interna- Dwojak, Philip N. Garner, Andreas Giefer, Marcin
tional Conference on Data Mining, pages 479–483. Junczys-Dowmunt, Hina Imran, David Nogueira,
SIAM. Ahmed Ali, Sebastião Miranda, Andrei Popescu-
Miguel Almeida and André Martins. 2013. Fast and ro- Belis, Lesly Miculicich Werlen, Nikos Papasaran-
bust compressive summarization with dual decom- topoulos, Abiola Obamuyide, Clive Jones, Fahim
position and multi-task learning. In Proceedings Dalvi, Andreas Vlachos, Yang Wang, Sibo Tong,
of the 51st Annual Meeting of the Association for Rico Sennrich, Nikolaos Pappas, Shashi Narayan,
Computational Linguistics (Volume 1: Long Pa- Marco Damonte, Nadir Durrani, Sameer Khurana,
pers), pages 196–206, Sofia, Bulgaria. Association Ahmed Abdelali, Hassan Sajjad, Stephan Vogel,
for Computational Linguistics. David Sheppey, Chris Hernon, and Jeff Mitchell.
2017. The SUMMA platform prototype. In Pro-
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. ceedings of the Software Demonstrations of the 15th
A simple but tough-to-beat baseline for sentence em- Conference of the European Chapter of the Associa-
beddings. In Proceedings of the 5th International tion for Computational Linguistics, pages 116–119,
Conference on Learning Representations, ICLR’17. Valencia, Spain. Association for Computational Lin-
OpenReview.net. guistics.
Michele Banko, Michael J. Cafarella, Stephen Soder- Chin-Yew Lin. 2004. ROUGE: A package for auto-
land, Matt Broadhead, and Oren Etzioni. 2007. matic evaluation of summaries. In Text Summariza-
Open information extraction from the web. In Pro- tion Branches Out, pages 74–81, Barcelona, Spain.
ceedings of the 20th International Joint Conference Association for Computational Linguistics.
on Artifical Intelligence, IJCAI’07, page 2670–2676, Chin-Yew Lin and Eduard Hovy. 2003. Auto-
San Francisco, CA, USA. Morgan Kaufmann Pub- matic evaluation of summaries using n-gram co-
lishers Inc. occurrence statistics. In Proceedings of the 2003 Hu-
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. man Language Technology Conference of the North
2003. Latent dirichlet allocation. J. Mach. Learn. American Chapter of the Association for Computa-
Res., 3:993–1022. tional Linguistics, pages 150–157.
Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and variational inference for text processing. In Proceed-
Kristina Toutanova. 2019. BERT: Pre-training of ings of the 33rd International Conference on Inter-
deep bidirectional transformers for language under- national Conference on Machine Learning - Volume
standing. In Proceedings of the 2019 Conference 48, ICML’16, pages 1727–1736. JMLR.org.
of the North American Chapter of the Association
for Computational Linguistics: Human Language Daniel Müllner. 2013. fastcluster: Fast hierarchical,
Technologies, Volume 1 (Long and Short Papers), agglomerative clustering routines for R and Python.
pages 4171–4186, Minneapolis, Minnesota. Associ- Journal of Statistical Software, Articles, 53(9):1–18.
ation for Computational Linguistics.
Joakim Nivre, Marie-Catherine De Marneffe, Filip Gin-
Katja Filippova, Enrique Alfonseca, Carlos A. Col- ter, Yoav Goldberg, Jan Hajic, Christopher D Man-
menares, Lukasz Kaiser, and Oriol Vinyals. 2015. ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
Sentence compression by deletion with LSTMs. In Natalia Silveira, et al. 2016. Universal dependencies
Proceedings of the 2015 Conference on Empirical v1: A multilingual treebank collection. In Proceed-
Methods in Natural Language Processing, pages ings of the Tenth International Conference on Lan-
360–368, Lisbon, Portugal. Association for Compu- guage Resources and Evaluation (LREC’16), pages
tational Linguistics. 1659–1666.
Diederik P. Kingma and Max Welling. 2014. Auto- Ankur Parikh, Oscar Täckström, Dipanjan Das, and
encoding variational bayes. In 2nd International Jakob Uszkoreit. 2016. A decomposable attention
Conference on Learning Representations, ICLR model for natural language inference. In Proceed-
2014, Banff, AB, Canada, April 14-16, 2014, Con- ings of the 2016 Conference on Empirical Methods
ference Track Proceedings. in Natural Language Processing, pages 2249–2255,
Austin, Texas. Association for Computational Lin-
Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim- guistics.
ple and accurate dependency parsing using bidirec-
tional LSTM feature representations. Transactions Danilo Jimenez Rezende, Shakir Mohamed, and Daan
of the Association for Computational Linguistics, Wierstra. 2014. Stochastic backpropagation and ap-
4:313–327. proximate inference in deep generative models. In
Proceedings of the 31th International Conference on
Renars Liepins, Ulrich Germann, Guntis Barzdins, Machine Learning, ICML 2014, Beijing, China, 21-
Alexandra Birch, Steve Renals, Susanne Weber, 26 June 2014, pages 1278–1286.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural networks.
In Advances in Neural Information Processing Sys-
tems 27: Annual Conference on Neural Informa-
tion Processing Systems 2014, December 8-13 2014,
Montreal, Quebec, Canada, pages 3104–3112.
Srinivas Vadrevu, Choon Hui Teo, Suju Rajan, Kunal
Punera, Byron Dom, Alexander J. Smola, Yi Chang,
and Zhaohui Zheng. 2011. Scalable clustering of
news search results. In Proceedings of the Fourth
ACM International Conference on Web Search and
Data Mining, WSDM’11, pages 675–684, New
York, NY, USA. ACM.
Aaron Steven White, Drew Reisinger, Keisuke Sak-
aguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger,
Kyle Rawlins, and Benjamin Van Durme. 2016. Uni-
versal Decompositional Semantics on Universal De-
pendencies. In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Process-
ing, pages 1713–1723, Austin, Texas. Association
for Computational Linguistics.
A Acknowledgements
This has been a multi-year project, involving con-
tributions from many people at different stages.
In particular, we thank Miles Osborne, Marco
Ponza, Amanda Stent, Mohamed Yahya, Christoph
Teichmann, Prabhanjan Kambadur, Umut Topkara,
Ted Merz, Sam Brody, and Adrian Benton for re-
viewing and commenting on the manuscript; We
further thank Adela Quinones, Shaun Waters, Mark
Dimont, Ted Merz and other colleagues from the
News Product group for helping to shape the vi-
sion of the system; We also thank José Abarca
and his team for developing the user interface; We
thank Hady Elsahar for helping to improve sum-
mary ranking during his internship; Finally, we
thank all colleagues (especially those in the Global
Data department) who helped to produce high qual-
ity in-house annotations and all others who con-
tributed valuable thoughts and time into this work.
B End-To-End Evaluation
We evaluate the end-to-end NSTM system when
using the OpenIE (Sec. 4.4.1) and the BERT-based
sentence compression (Sec. 4.4.2) algorithms as the
sole source of candidate summaries. We also con-
ducted one experiment where both were used to cre-
ate a shared pool of candidates (as per Sec. 4.4.4).
We test the system end-to-end using the
manually-annotated Single Document Summariza-
tion (SDS) test set described in Sec. 4.4.2. To
implement SDS, our experimental setup assumes
that only one story was returned by a search request
(as per Sec. 4.2). We evaluate the output from each
system with ROUGE (Lin and Hovy, 2003; Lin,
2004)10 . The results are presented in Table 3.
10
https://github.com/google/seq2seq/blob/master/seq2seq/metrics/rouge.py
C Screenshots of A Query-Driven User Interface
Figure 4: Screenshot (taken on 29 January 2020) of a query-driven interface for NSTM showing the overview for
the company ‘Amazon.com’.
Figure 5: Screenshot (taken on 29 January 2020) of a query-driven interface for NSTM showing the overview for
the topic ‘Electric Vehicles’.
Figure 6: Screenshot (taken on 29 January 2020) of a query-driven interface for NSTM showing the overview for
the region ‘Canada’.
Figure 7: Screenshot (taken on 29 January 2020) of a query-driven interface for NSTM showing the overview for
a complex query, including a keyword.
D Screenshots of A Context-Driven User Interface
Figure 8: Screenshot (taken on 29 January 2020) of a context-driven application of NSTM. In the ‘Security’ column
are the companies that have seen the largest increase in news readership over the last day. Each entry in the ‘News
Summary’ column is the summary of the top theme provided by NSTM for the adjacent company.
Figure 9: Screenshot (taken on 29 January 2020) of a context-driven application of NSTM. In the ‘News Topic’
column are the topics that have seen the largest volume of news readership over the past 8 hours. Each entry in the
‘News Summary’ column is the summary of the top theme provided by NSTM for the adjacent topic.