Sentence Fusion For Multidocument News Summarization: Regina Barzilay Kathleen R. Mckeown
Sentence Fusion For Multidocument News Summarization: Regina Barzilay Kathleen R. Mckeown
Sentence Fusion For Multidocument News Summarization: Regina Barzilay Kathleen R. Mckeown
News Summarization
A system that can produce informative summaries, highlighting common information found in
1. Introduction
Redundancy in large text collections, such as the Web, creates both problems and
opportunities for natural language systems. On the one hand, the presence of numer-
ous sources conveying the same information causes difficulties for end users of search
engines and news providers; they must read the same information over and over again.
On the other hand, redundancy can be exploited to identify important and accurate
information for applications such as summarization and question answering (Mani
and Bloedorn 1997; Radev and McKeown 1998; Radev, Prager, and Samn 2000; Clarke,
Cormack, and Lynam 2001; Dumais et al. 2002; Chu-Carroll et al. 2003). Clearly, it would
be highly desirable to have a mechanism that could identify common information
among multiple related documents and fuse it into a coherent text. In this article, we
present a method for sentence fusion that exploits redundancy to achieve this task in
the context of multidocument summarization.
A straightforward approach for approximating sentence fusion can be found in the
use of sentence extraction for multidocument summarization (Carbonell and Goldstein
1998; Radev, Jing, and Budzikowska 2000; Marcu and Gerber 2001; Lin and Hovy
2002). Once a system finds a set of sentences that convey similar information (e.g.,
by clustering), one of these sentences is selected to represent the set. This is a robust
approach that is always guaranteed to output a grammatical sentence. However, ex-
traction is only a coarse approximation of fusion. An extracted sentence may include
not only common information, but additional information specific to the article from
Submission received: 14 September 2003; revised submission received: 23 February 2005; accepted for
publication: 19 March 2005.
which it came, leading to source bias and aggravating fluency problems in the extracted
summary. Attempting to solve this problem by including more sentences to restore the
original context might lead to a verbose and repetitive summary.
Instead, we want a fine-grained approach that can identify only those pieces of
sentences that are common. Language generation offers an appealing approach to the
problem, but the use of generation in this context raises significant research challenges.
In particular, generation for sentence fusion must be able to operate in a domain-
independent fashion, scalable to handle a large variety of input documents with various
degrees of overlap. In the past, generation systems were developed for limited domains
and required a rich semantic representation as input. In contrast, for this task we require
text-to-text generation, the ability to produce a new text given a set of related texts as
1 In addition to MultiGen, Newsblaster utilizes another summarizer, DEMS (Schiffman, Nenkova, and
McKeown 2002), to summarize heterogeneous sets of articles.
298
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
our fusion algorithm and detail on its main steps: identification of common infor-
mation (Section 3.1), fusion lattice computation (Section 3.2), and lattice linearization
(Section 3.3). Evaluation results and their analysis are presented in Section 4. Analy-
sis of the system’s output reveals the capabilities and the weaknesses of our text-
to-text generation method and identifies interesting challenges that will require new
insights. An overview of related work and a discussion of future directions conclude
the article.
Sentence fusion is the central technique used within the MultiGen summarization
Figure 1
An example of MultiGen summary as shown in the Columbia Newsblaster Interface. Summary
phrases are followed by parenthetical numbers indicating their source articles. The last sentence
is extracted because it was repeated verbatim in several input articles.
299
Computational Linguistics Volume 31, Number 3
Figure 2
MultiGen architecture.
length (Section 2.2). The selected groups are passed to the ordering component, which
Table 1
Theme with corresponding fusion sentence.
1. IDF Spokeswoman did not confirm this, but said the Palestinians fired an antitank missile at
a bulldozer.
2. The clash erupted when Palestinian militants fired machine guns and antitank missiles at a
bulldozer that was building an embankment in the area to better protect Israeli forces.
3. The army expressed “regret at the loss of innocent lives” but a senior commander said troops
had shot in self-defense after being fired at while using bulldozers to build a new embankment
at an army base in the area.
Fusion sentence: Palestinians fired an antitank missile at a bulldozer.
300
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
Once we filter out the themes that have a low rank, the next task is to order the selected
themes into coherent text. Our ordering strategy aims to capture chronological order
of the main events and ensure coherence. To implement this strategy in MultiGen, we
select for each theme the sentence which has the earliest publication time (theme time
stamp). To increase the coherence of the output text, we identify blocks of topically
related themes and then apply chronological ordering on blocks of themes using theme
time stamps (Barzilay, Elhadad, and McKeown 2002). These stages produce a sorted set
of themes which are passed as input to the sentence fusion component, described in the
next section.
3. Sentence Fusion
2 Typically, Simfinder produces at least 20 themes given an average Newsblaster cluster of nine articles.
The length of a generated summary typically does not exceed seven sentences.
301
Computational Linguistics Volume 31, Number 3
from the theme shown in Table 1 is (the, fired, antitank, at, a, bulldozer). Besides its
being ungrammatical, it is impossible to understand what event this intersection de-
scribes. The inadequacy of the bag-of-words method to the fusion task demonstrates the
need for a more linguistically motivated approach. At the other extreme, previous ap-
proaches (Radev and McKeown 1998) have demonstrated that this task is feasible when
a detailed semantic representation of the input sentences is available. However, these
approaches operate in a limited domain (e.g., terrorist events), where information ex-
traction systems can be used to interpret the source text. The task of mapping input text
into a semantic representation in a domain-independent setting extends well beyond
the ability of current analysis methods. These considerations suggest that we need a
new method for the sentence fusion task. Ideally, such a method would not require a
Content selection occurs primarily in the first phase, in which our algorithm uses local
alignment across pairs of parsed sentences, from which we select fragments to be
included in the fusion sentence. Instead of examining all possible ways to combine these
fragments, we select a sentence in the input which contains most of the fragments and
transform its parsed tree into the fusion lattice by eliminating nonessential information
and augmenting it with information from other input sentences. This construction of the
fusion lattice targets content selection, but in the process, alternative verbalizations are
selected, and thus some aspects of realization are also carried out in this phase. Finally,
we generate a sentence from this representation based on a language model derived
from a large body of texts.
302
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
3 Two paraphrasing sentences which differ in word order may have significantly different trees in
phrase-based format. For instance, this phenomenon occurs when an adverbial is moved from a position
in the middle of a sentence to the beginning of a sentence. In contrast, dependency representations of
these sentences are very similar.
303
Computational Linguistics Volume 31, Number 3
3.1.2 Alignment. Our alignment of dependency trees is driven by two sources of in-
formation: the similarity between the structure of the dependency trees and the similar-
ity between lexical items. In determining the structural similarity between two trees, we
take into account the types of edges (which indicate the relationships between nodes).
An edge is labeled by the syntactic function of the two nodes it connects (e.g., subject–
verb). It is unlikely that an edge connecting a subject and verb in one sentence, for
example, corresponds to an edge connecting a verb and an adjective in another sentence.
The word similarity measures take into account more than word identity: They
also identify pairs of paraphrases, using WordNet and a paraphrasing dictionary. We
automatically constructed the paraphrasing dictionary from a large comparable news
corpus using the co-training method described in Barzilay and McKeown (2001). The
dictionary contains pairs of word-level paraphrases as well as phrase-level para-
phrases.4 Several examples of automatically extracted paraphrases are given in Table 2.
During alignment, each pair of nonidentical words that do not comprise a synset in
304
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
Table 2
Lexical paraphrases extracted by the algorithm from the comparable news corpus.
(auto, automobile), (closing, settling), (rejected, does not accept), (military, army), (IWC,
International Whaling Commission), (Japan, country), (researching, examining), (harvesting,
killing), (mission-control office, control centers), (father, pastor), (past 50 years, four decades),
(Wangler, Wanger), (teacher, pastor), (fondling, groping), (Kalkilya, Qalqilya), (accused,
suspected), (language, terms), (head, president), (U.N., United Nations), (Islamabad, Kabul),
(goes, travels), (said, testified), (article, report), (chaos, upheaval), (Gore, Lieberman), (revolt,
uprising), (more restrictive local measures, stronger local regulations) (countries, nations),
(barred, suspended), (alert, warning), (declined, refused), (anthrax, infection), (expelled,
removed), (White House, White House spokesman Ari Fleischer), (gunmen, militants)
Figure 4
Tree alignment computation. In the first case two tops are aligned, while in the second case the
top of one tree is aligned with a child of another tree.
305
Computational Linguistics Volume 31, Number 3
Given two trees T and T with root nodes v and v, respectively, the similar-
ity Sim(T, T ) between the trees is defined to be the maximum of the three expres-
sions NodeCompare(T, T ), maxs∈c(T) Sim(Ts , T ), and maxs ∈c(T ) Sim(T, Ts ). The upper
part of Figure 4 depicts the computation of NodeCompare(T, T ), in which two top
nodes are aligned with each other. The remaining expressions, maxs∈c(T) Sim(Ts , T ),
and maxs ∈c(T ) Sim(T, Ts ), capture mappings in which the top of one tree is aligned
with one of the children of the top node of the other tree (the bottom of Figure 4).
The maximization in the NodeCompare formula searches for the best possible
alignment for the child nodes of the given pair of nodes and is defined by
NodeCompare(T, T ) = NodeSimilarity(v, v )
where M(A, A ) is the set of all possible matchings between A and A , and a matching
(between A and A ) is a subset m of A × A such that for any two distinct elements
(a, a ), (b, b ) ∈ m, both a = b and a = b . In the base case, when one of the trees has
depth one, NodeCompare(T, T ) is defined to be NodeSimilarity(v, v ).
The similarity score NodeSimilarity(v, v ) of atomic nodes depends on whether the
corresponding words are identical, paraphrases, or unrelated. The similarity scores for
pairs of identical words, pairs of synonyms, pairs of paraphrases, and edges (given in
Table 3) are manually derived using a small development corpus. While learning of
the similarity scores automatically is an appealing alternative, its application in the fu-
sion context is challenging because of the absence of a large training corpus and the lack
of an automatic evaluation function.5 The similarity of nodes containing flattened
subtrees,6 such as noun phrases, is computed as the score of their intersection nor-
malized by the length of the longest phrase. For instance, the similarity score of the
noun phrases antitank missile and machine gun and antitank missile is computed as a ratio
between the score of their intersection antitank missile (2), divided by the length of the
latter phrase (5).
The similarity function Sim is computed using bottom-up dynamic programming,
in which the shortest subtrees are processed first. The alignment algorithm returns
the similarity score of the trees as well as the optimal mapping between the subtrees
of input trees. The pseudocode of this function is presented in the Appendix. In the
resulting tree mapping, the pairs of nodes whose NodeSimilarity positively contributed
to the alignment are considered parallel. Figure 5 shows two dependency trees and their
alignment.
As is evident from the Sim definition, we are considering only one-to-one node
“matchings”: Every node in one tree is mapped to at most one node in another tree. This
restriction is necessary because the problem of optimizing many-to-many alignments
5 Our preliminary experiments with n-gram-based overlap measures, such as BLEU (Papineni et al. 2002)
and ROUGE (Lin and Hovy 2003), show that these metrics do not correlate with human judgments on the
fusion task, when tested against two reference outputs. This is to be expected: As lexical variability across
input sentences grows, the number of possible ways to fuse them by machine as well by human also
grows. The accuracy of match between the system output and the reference sentences largely depends on
the features of the input sentences, rather than on the underlying fusion method.
6 Pairs of phrases that form an entry in the paraphrasing dictionary are compared as pairs of atomic entries.
306
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
Table 3
Node and edge similarity scores used by the alignment algorithm.
is NP-hard.7 The subtree flattening performed during the preprocessing stage aims to
7 The complexity of our algorithm is polynomial in the number of nodes. Let n1 denote the number of
nodes in the first tree, and n2 denote the number of nodes in the second tree. We assume that the
branching factor of a parse tree is bounded above by a constant. The function NodeCompare is evaluated
only once on each node pair. Therefore, it is evaluated n1 × n2 times totally. Each evaluation is computed
in constant time, assuming that values of the function for node children are known. Since we use
memoization, the total time of the procedure is O(n1 × n2 ).
307
Computational Linguistics Volume 31, Number 3
the extraneous subtrees. Alignment is essential for all the steps. The selection of the
basis tree is guided by the number of intersection subtrees it includes; in the best case,
it contains all such subtrees. The basis tree is the centroid of the input sentences—
the sentence which is the most similar to the other sentences in the input. Using the
alignment-based similarity score described in Section 3.1.2, we identify the centroid
by computing for each sentence the average similarity score between the sentence and
the rest of the input sentences, then selecting the sentence with the highest score.
Next, we augment the basis tree with information present in the other input
sentences. More specifically, we add alternative verbalizations for the nodes in the basis
tree and the intersection subtrees which are not part of the basis tree. The alternative
verbalizations are readily available from the pairwise alignments of the basis tree with
other trees in the input computed in the previous section. For each node of the basis tree,
we record all verbalizations from the nodes of the other input trees aligned with a given
node. A verbalization can be a single word, or it can be a phrase, if a node represents
a noun compound or a verb with a particle. An example of a fusion lattice, augmented
308
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
Figure 6
A basis lattice before and after augmentation. Solid lines represent aligned edges of the basis
tree. Dashed lines represent unaligned edges of the basis tree, and dotted lines represent
insertions from other theme sentences. Added subtrees correspond to sentences from Table 1.
with alternative verbalizations, is given in Figure 6. Even after this augmentation, the fu-
sion lattice may not include all of the intersection subtrees. The main difficulty in subtree
insertion is finding an acceptable placement; this is often determined by syntactic, se-
mantic, and idiosyncratic knowledge. Therefore, we follow a conservative insertion pol-
icy. Among all the possible aligned sentences, we insert only subtrees whose top node
aligns with one of the nodes in a basis tree.8 We further constrain the insertion procedure
by inserting only trees that appear in at least half of the sentences of a theme. These two
8 Our experimental results show that the algorithm inserts a sufficient amount of new subtrees despite this
limitation.
309
Computational Linguistics Volume 31, Number 3
constituent-level restrictions prevent the algorithm from generating overly long, un-
readable sentences.9
Finally, subtrees which are not part of the intersection are pruned off the basis
tree. However, removing all such subtrees may result in an ungrammatical or seman-
tically flawed sentence; for example, we might create a sentence without a subject.
This overpruning may happen if either the input to the fusion algorithm is noisy
or the alignment has failed to recognize similar subtrees. Therefore, we perform
a more conservative pruning, deleting only the self-contained components which
can be removed without leaving ungrammatical sentences. As previously observed
in the literature (Mani, Gates, and Bloedorn 1999; Jing and McKeown 2000), such com-
ponents include a clause in the clause conjunction, relative clauses, and some ele-
3.3 Generation
The final stage in sentence fusion is linearization of the fusion lattice. Sentence
generation includes selection of a tree traversal order, lexical choice among avail-
able alternatives, and placement of auxiliaries, such as determiners. Our generation
method utilizes information given in the input sentences to restrict the search space
and then chooses among remaining alternatives using a language model derived from
a large text collection. We first motivate the need for reordering and rephrasing, then
discuss our implementation.
For the word-ordering task, we do not have to consider all the possible travers-
als, since the number of valid traversals is limited by ordering constraints encoded
in the fusion lattice. However, the basis lattice does not uniquely determine the
ordering: The placement of trees inserted in the basis lattice from other theme sen-
tences is not restricted by the original basis tree. While the ordering of many sentence
constituents is determined by their syntactic roles, some constituents, such as time,
location and manner circumstantials, are free to move (Elhadad et al. 2001). Therefore,
the algorithm still has to select an appropriate order from among different orders of
the inserted trees.
The process so far produces a sentence that can be quite different from the ex-
tracted sentence; although the basis sentences provides guidance for the generation
process, constituents may be removed, added in, or reordered. Wording can also be
modified during this process. Although the selection of words and phrases which
appear in the basis tree is a safe choice, enriching the fusion sentence with alternative
verbalizations has several benefits. In applications such as summarization, in which
the length of the produced sentence is a factor, a shorter alternative is desirable. This
goal can be achieved by selecting the shortest paraphrase among available alternatives.
Alternate verbalizations can also be used to replace anaphoric expressions, for instance,
9 Furthermore, the preference for shorter fusion sentences is further enforced during the linearization stage
because our scoring function monotonically decreases with the length of a sentence.
310
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
when the basis tree contains a noun phrase with anaphoric expressions (e.g., his visit)
and one of the other verbalizations is anaphora-free. Substitution of the latter for the
anaphoric expression may increase the clarity of the produced sentence, since frequently
the antecedent of the anaphoric expression is not present in a summary. Moreover,
in some cases substitution is mandatory. As a result of subtree insertions and dele-
tions, the words used in the basis tree may not be a good choice after the transfor-
mations, and the best verbalization might be achieved by using a paraphrase of them
from another theme sentence. As an example, consider the case of two paraphras-
ing verbs with different subcategorization frames, such as tell and say. If the phrase
our correspondent is removed from the sentence Sharon told our correspondent that the
elections were delayed . . . , a replacement of the verb told with said yields a more readable
sentence.
The task of auxiliary placement is alleviated by the presence of features stored
in the input nodes. In most cases, aligned words stored in the same node have
the same feature values, which uniquely determine an auxiliary selection and con-
jugation. However, in some cases, aligned words have different grammatical
features, in which case the linearization algorithm needs to select among avail-
able alternatives.
311
Computational Linguistics Volume 31, Number 3
Linearization of the fusion sentence involves the selection of the best phrasing
and placement of auxiliaries as well as the determination of optimal ordering. Since
we do not have sufficient semantic information to perform such selection, our algo-
rithm is driven by corpus-derived knowledge. We generate all possible sentences10
from the valid traversals of the fusion lattice and score their likelihood according to
statistics derived from a corpus. This approach, originally proposed by Knight and
Hatzivassiloglou (1995) and Langkilde and Knight (1998), is a standard method used
in statistical generation. We trained a trigram model with Good–Turing smoothing
over 60 megabytes of news articles collected by Newsblaster using the second version
CMU–Cambridge Statistical Language Modeling toolkit (Clarkson and Rosenfeld 1997).
The sentence with the lowest length-normalized entropy (the best score) is selected as
10 Because of the efficiency constraints imposed by Newsblaster, we sample only a subset of 20,000 paths.
The sample is selected randomly.
312
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
Table 4
Alternative linearizations of the fusion lattice with corresponding entropy values.
Sentence Entropy
4.1 Methods
4.1.1 Construction of a Reference Sentence. We evaluated content selection by com-
paring an automatically generated sentence with a reference sentence. The reference
sentence was produced by a human (hereafter the RFA), who was instructed to gener-
ate a sentence conveying information common to many sentences in a theme. The RFA
was not familiar with the fusion algorithm. The RFA was provided with the list of
theme sentences; the original documents were not included. The instructions given to
the RFA included several examples of themes with fusion sentences generated by the
authors. Even though the RFA was not instructed to use phrases from input sentences,
the sentences presented as examples reused many phrases from the input sentences.
We believe that phrase reuse elucidates the connection between input sentences and
a resulting fusion sentence. Two examples of themes, reference sentences, and system
outputs are shown in Table 5.
4.1.2 Data Selection. We wanted to test the performance of the fusion component on
automatically computed inputs which reflect the accuracy of the existing preprocessing
tools. For this reason, the test data were selected randomly from material collected by
Newsblaster. To remove themes irrelevant for fusion evaluation, we introduced two
313
Computational Linguistics Volume 31, Number 3
Table 5
Examples from the test set. Each example contains a theme, a reference sentence generated by
the RFA, and a sentence generated by the system. Subscripts in the system-generated sentence
represent a theme sentence from which a word was extracted.
additional filters. First, we excluded themes that contained identical or nearly identical
sentences (with cosine similarity higher than 0.8). When processing such sentences,
our algorithm reduces to sentence extraction, which does not allow us to evaluate the
generation abilities of our algorithm. Second, themes for which the RFA was unable to
create a reference sentence were also removed from the test set. As mentioned above,
Simfinder does not always produce accurate themes,12 and therefore, the RFA could
choose not to generate a reference sentence if the theme sentences had too little in
common. An example of a theme for which no sentence was generated is shown in
Table 6. As a result of this filtering, 34% of the sentences were removed.
12 To mitigate the effects of Simfinder noise in MultiGen, we induced a similarity threshold on input
trees—trees which are not similar to the basis tree are not used in the fusion process.
314
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
Table 6
An example of noisy Simfinder output.
4.2 Data
To evaluate our sentence fusion algorithm, we selected 100 themes following the proce-
dure described in the previous section. Each set varied from three to seven sentences,
13 We were unable to develop a set of rules which works in most cases. Punctuation placement is
determined by a variety of features; considering all possible interactions of these features is hard. We
believe that corpus-based algorithms for automatic restoration of punctuation developed for speech
recognition applications (Beeferman, Berger, and Lafferty 1998; Shieber and Tao 2003) could help in our
task, and we plan to experiment with them in the future.
315
Computational Linguistics Volume 31, Number 3
with 4.22 sentences on average. The generated fusion sentences consisted of 1.91 clauses
on average. None of the sentences in the test set were fully extracted; on average, each
sentence fused fragments from 2.14 theme sentences. Out of 100 sentence, 57 sentences
produced by the algorithm combined phrases from several sentences, while the rest
of the sentences comprised subsequences of one of the theme sentences. (Note that
compression is different from sentence extraction.) We included these sentences in the
evaluation, because they reflect both content selection and realization capacities of the
algorithm.
Table 5 shows two sentences from the test corpus, along with input sentences. The
examples are chosen so as to reflect good- and bad-performance cases. Note that the
first example results in inclusion of the essential information (the fact that bodies were
4.3 Results
Table 7 shows the length ratio, precision, recall, F-measure, and grammaticality score
for each algorithm. The length ratio of a sentence was computed as the ratio of its
output length to the average length of the theme input sentences.
4.4 Discussion
The results in Table 7 demonstrate that sentences manually generated by the second
human participant (RFA2) not only are the shortest, but are also closest to the reference
sentence in terms of selected information. The tight connection14 between sentences
generated by the RFAs establishes a high upper bound for the fusion task. While
neither our system nor the baselines were able to reach this level of performance, the
fusion algorithm clearly outperforms all the baselines in terms of content selection,
at a reasonable level of compression. The performance of baseline 1 and baseline 2
demonstrates that neither the shortest sentence nor the basis sentence is an adequate
substitution for fusion in terms of content selection. The gap in recall between our
system and baseline 3 confirms our hypothesis about the importance of paraphrasing
information for the fusion process. Omission of paraphrases causes an 8% drop in
recall due to the inability to match equivalent phrases with different wording.
Table 7 also reveals a downside of the fusion algorithm: Automatically generated
sentences contain grammatical errors, unlike fully extracted, human-written sentences.
Given the high sensitivity of humans to processing ungrammatical sentences, one
has to consider the benefits of flexible information selection against the decrease in
readability of the generated sentences. Sentence fusion may not be a worthy direction
to pursue if low grammaticality is intrinsic to the algorithm and its correction requires
14 We cannot apply kappa statistics (Siegel and Castellan 1988) for measuring agreement in the content
selection task since the event space is not well-defined. This prevents us from computing the probability
of random agreement.
316
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
Table 7
Evaluation results for a human-crafted fusion sentence (RFA2), our system output, the shortest
sentence in the theme (baseline 1), the basis sentence (baseline 2), and a simplified version of our
algorithm without paraphrasing information (baseline 3).
4.4.1 Error Analysis. In this section, we discuss the results of our manual analysis of
mistakes in content selection and surface realization. Note that in some cases multiple
errors are entwined in one sentence, which makes it hard to distinguish between a
sequence of independent mistakes and a cause-and-effect chain. Therefore, the pre-
sented counts should be viewed as approximations, rather than precise numbers.
We start with the analysis of the test set and continue with the description of some
interesting mistakes that we encountered during system development.
Mistakes in Content Selection. Most of the mistakes in content selection can be attributed
to problems with alignment. In most cases (17), erroneous alignments missed relevant
word mappings as a result of the lack of a corresponding entry in our paraphrasing
resources. At the same time, mapping of unrelated words (as shown in Table 5) was
quite rare (two cases). This performance level is quite predictable given the accuracy
of an automatically constructed dictionary and limited coverage of WordNet. Even
in the presence of accurate lexical information, the algorithm occasionally produced
suboptimal alignments (four cases) because of the simplicity of our weighting scheme,
which supports limited forms of mapping typology and also uses manually assigned
weights.
Another source of errors (two cases) was the algorithm’s inability to handle
many-to-many alignments. Namely, two trees conveying the same meaning may not
be decomposable into the node-level mappings which our algorithm aims to compute.
For example, the mapping between the sentences in Table 8 expressed by the rule
X denied claims by Y ↔ X said that Y’s claim was untrue cannot be decomposed into
smaller matching units. At least two mistakes resulted from noisy preprocessing
(tokenization and parsing).
In addition to alignment, overcutting during lattice pruning caused the omission of
three clauses that were present in the corresponding reference sentences. The sentence
Conservatives were cheering language is an example of an incomplete sentence derived
from the following input sentence: Conservatives were cheering language in the final version
317
Computational Linguistics Volume 31, Number 3
Table 8
A pair of sentences which cannot be fully decomposed.
that ensures that one-third of all funds for prevention programs be used to promote abstinence.
The omission of a relative clause was possible because some sentences in the input
theme contained the noun language without any relative clauses.
4.4.2 Further Analysis. In addition to analyzing errors found in this particular study,
we also regularly track the quality of generated summaries on Newsblaster’s Web
page. We have noted a number of interesting errors that crop up from time to time
that seem to require information about the full syntactic parse, semantics, or even
discourse. Consider, for example, the last sentence from a summary entitled Estrogen-
Progestin Supplements Now Linked to Dementia, which is shown in Table 9. This sentence
was created by sentence fusion and clearly, there is a problem. Certainly, there was a
study finding the risk of dementia in women who took one type of combined hormone pill, but
it was not the government study which was abruptly halted last summer. In looking
at the two sentences from which this summary sentence was drawn, we can see that
there is a good amount of overlap between the two, but the component does not have
enough information about the referents of the different terms to know that two different
318
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
Table 9
An example of wrong reference selection. Subscripts in the generated sentence indicate the
theme sentence from which the words were extracted.
#1 Last summer, a government study was abruptly halted after finding an increased risk
of breast cancer, heart attacks, and strokes in women who took one type of combined
hormone pill.
#2 The most common form of hormone replacement therapy, already linked to breast
cancer, stroke, and heart disease, does not improve mental functioning as some earlier
studies suggested and may increase the risk of dementia, researchers said on Tuesday.
System Last1 summer1 a1 government1 study1 abruptly1 was1 halted1 after1 finding1 the2 risk2
of2 dementia2 in1 women1 who1 took1 one1 type1 of1 combined1 hormone1 pill1 .
5. Related Work
Table 10
An example of incorrect reference selection. Subscripts in the generated sentence indicate the
theme sentence from which the words were extracted.
#1 The segments will revive the “Point-Counterpoint” segments popular until they
stopped airing in 1979, but will instead be called “Clinton/Dole” one week and
“Dole/Clinton” the next week.
#2 Clinton and Dole have signed up to do the segment for the next 10 weeks, Hewitt said.
#3 The segments will be called “Clinton Dole” one week and “Dole Clinton” the next.
System The1 segments1 will1 revive1 the3 segments3 until1 they1 stopped1 airing1 in1 19791
but1 instead1 will1 be1 called1 Clinton2 and2 Dole2 .
319
Computational Linguistics Volume 31, Number 3
peripheral to the central point of the document can be removed from a sentence without
significantly distorting its meaning. While earlier approaches for text compression were
based on symbolic reduction rules (Grefenstette 1998; Mani, Gates, and Bloedorn 1999),
more recent approaches use an aligned corpus of documents and their human written
summaries to determine which constituents can be reduced (Knight and Marcu 2002;
Jing and McKeown 2000; Reizler et al. 2003). The summary sentences, which have
been manually compressed, are aligned with the original sentences from which they
were drawn.
Knight and Marcu (2000) treat reduction as a translation process using a noisy-
channel model (Brown et al. 1993). In this model, a short (compressed) string is treated
as a source, and additions to this string are considered to be noise. The probability of a
320
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
The alignment method described in Section 3 falls into a class of tree comparison
algorithms extensively studied in theoretical computer science (Sankoff 1975; Finden
and Gordon 1985; Amir and Keselman 1994; Farach, Przytycka, and Thorup 1995)
and widely applied in many areas of computer science, primarily computational bi-
ology (Gusfield 1997). These algorithms aim to find an overlap subtree that captures
structural commonality across a set of related trees. A typical tree similarity measure
In this article, we have presented sentence fusion, a novel method for text-to-text
generation which, given a set of similar sentences, produces a new sentence contain-
ing the information common to most sentences. Unlike traditional generation methods,
15 See Gusfield (1997) and Durbin et al. (1998) for an overview of multisequence alignment.
321
Computational Linguistics Volume 31, Number 3
sentence fusion does not require an elaborate semantic representation of the input
but instead relies on the shallow linguistic representation automatically derived from
the input documents and knowledge acquired from a large text corpus. Generation is
performed by reusing and altering phrases from input sentences.
As the evaluation described in Section 4 shows, our method accurately identifies
common information and in most cases generates a well-formed fusion sentence. Our
algorithm outperforms the shortest-sentence baseline in terms of content selection,
without a significant drop in grammaticality. We also show that augmenting the fu-
sion process with paraphrasing knowledge improves the output by both measures.
However, there is still a gap between the performance of our system and human
performance.
322
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
323
Computational Linguistics Volume 31, Number 3
maps two top nodes of the tree one to another. The function returns the score
of the alignment and the mapping itself.
begin
node-sim ← NodeSim(tree1 .top, tree2 .top) ;
/*If one of the trees is of height one, return the NodeSim score between two tops */
if is leaf(tree1 ) or is leaf(tree2 ) then
return node-sim, tree1 , tree2 ;
else
/*Find an optimal alignment of the children nodes */
res ← MapChildren(tree1 , tree2 ) ;
/*The alignment score is computed as a sum of the similarity of top nodes and
324
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
325
Computational Linguistics Volume 31, Number 3
326
Barzilay and McKeown Sentence Fusion for Multidocument News Summarization
Morris, Jane and Graeme Hirst. 1991. Lexical predictive annotation. In Proceedings of
cohesion, the thesaurus, and the structure Sixth Conference on Applied Natural
of text. Computational Linguistics, Language Processing (ANLP),
17(1):21–48. pages 150–157, Philadelphia, PA.
Nenkova, Ani and Kathleen R. McKeown. Reizler, Stefan, Tracy H. King, Richard
2003. References to named entities: Crouch, and Annie Zaenen. 2003.
A corpus study. In Proceedings of the Statistical sentence condensation using
Human Language Technology Conference, ambiguity packing and stochastic
Companion Volume, pages 70–73, disambiguation methods for
Edmonton, Alberta. lexical-functional grammar. In Proceedings
Pang, Bo, Kevin Knight, and Daniel Marcu. of HLT-NAACL, pages 197–204,
2003. Syntax-based alignment of multiple Edmonton, Alberta.
translations: Extracting paraphrases Robin, Jacques and Kathleen McKeown.
and generating new sentences. In 1996. Empirically designing and
327
Downloaded from http://direct.mit.edu/coli/article-pdf/31/3/297/1798201/089120105774321091.pdf by guest on 28 April 2024