Clustering Sentence
Clustering Sentence
Abstract
This paper presents an innovative unsupervised
method for automatic sentence extraction using graphbased ranking algorithms. We evaluate the method in
the context of a text summarization task, and show
that the results obtained compare favorably with previously published results on established benchmarks.
1 Introduction
Graph-based ranking algorithms, such as Kleinbergs HITS algorithm (Kleinberg, 1999) or Googles
PageRank (Brin and Page, 1998), have been traditionally and successfully used in citation analysis, social
networks, and the analysis of the link-structure of the
World Wide Web. In short, a graph-based ranking algorithm is a way of deciding on the importance of a
vertex within a graph, by taking into account global information recursively computed from the entire graph,
rather than relying only on local vertex-specific information.
A similar line of thinking can be applied to lexical
or semantic graphs extracted from natural language
documents, resulting in a graph-based ranking model
called TextRank (Mihalcea and Tarau, 2004), which
can be used for a variety of natural language processing applications where knowledge drawn from an entire text is used in making local ranking/selection decisions. Such text-oriented ranking methods can be
applied to tasks ranging from automated extraction
of keyphrases, to extractive summarization and word
sense disambiguation (Mihalcea et al., 2004).
In this paper, we investigate a range of graphbased ranking algorithms, and evaluate their application to automatic unsupervised sentence extraction in
the context of a text summarization task. We show
that the results obtained with this new unsupervised
method are competitive with previously developed
state-of-the-art systems.
HITS
HITS (Hyperlinked Induced Topic Search) (Kleinberg, 1999) is an iterative algorithm that was designed
for ranking Web pages according to their degree of
authority. The HITS algorithm makes a distinction
between authorities (pages with a large number of
incoming links) and hubs (pages with a large number of outgoing links). For each vertex, HITS produces two sets of scores an authority score, and a
hub score:
HIT SA(Vi ) =
HIT SH (Vj )
(1)
HIT SA (Vj )
(2)
Vj In(Vi )
HIT SH (Vi ) =
Vj Out(Vi )
2.2
1
|V |
(1 + P OSP (Vj ))
(3)
Vj Out(Vi )
1
|V |
X
Vj In(Vi )
(1 + P OSW (Vj ))
(4)
2.3 PageRank
PageRank (Brin and Page, 1998) is perhaps one of the
most popular ranking algorithms, and was designed as
a method for Web link analysis. Unlike other ranking
algorithms, PageRank integrates the impact of both incoming and outgoing links into one single model, and
therefore it produces only one set of scores:
P R(Vi ) = (1 d) + d
X
Vj In(Vi )
P R(Vj )
|Out(Vj )|
(5)
W
HIT SA
(Vi ) =
W
wji HIT SH
(Vj )
(6)
Vj In(Vi )
W
HIT SH
(Vi ) =
W
wij HIT SA
(Vj )
(7)
(8)
Vj Out(Vi )
P OSPW (Vi ) =
1
|V |
W
P OSW
(Vi ) =
Vj Out(Vi )
1
|V |
W
(1 + wji P OSW
(Vj ))
(9)
Vj In(Vi )
P RW (Vi ) = (1 d) + d
Vj In(Vi )
wji
P RW (Vj )
P
(10)
wkj
Vk Out(Vj )
While the final vertex scores (and therefore rankings) for weighted graphs differ significantly as compared to their unweighted alternatives, the number of
iterations to convergence and the shape of the convergence curves is almost identical for weighted and unweighted graphs.
3 Sentence Extraction
To enable the application of graph-based ranking algorithms to natural language texts, TextRank starts by
building a graph that represents the text, and interconnects words or other text entities with meaningful relations. For the task of sentence extraction, the goal
is to rank entire sentences, and therefore a vertex is
added to the graph for each sentence in the text.
To establish connections (edges) between sentences, we are defining a similarity relation, where
similarity is measured as a function of content overlap. Such a relation between two sentences can be
seen as a process of recommendation: a sentence
that addresses certain concepts in a text, gives the
reader a recommendation to refer to other sentences
in the text that address the same concepts, and therefore a link can be drawn between any two such sentences that share common content.
The overlap of two sentences can be determined
simply as the number of common tokens between
the lexical representations of the two sentences, or it
can be run through syntactic filters, which only count
words of a certain syntactic category. Moreover,
to avoid promoting long sentences, we are using a
normalization factor, and divide the content overlap
of two sentences with the length of each sentence.
Formally, given two sentences Si and Sj , with a
sentence being represented by the set of N i words
that appear in the sentence: Si = W1i , W2i , ..., WNi i ,
the similarity of Si and Sj is defined as:
Similarity(Si , Sj ) =
[0.50] 24
[0.80] 23
0.15
4 [0.71] 5 [1.20]
6 [0.15]
[0.70] 22
[1.02] 21
7 [0.15]
0.19
0.15
[0.84]
0.55
8 [0.70]
20
0.35
0.30
19
[0.15]
9 [1.83]
0.59
0.15
4 Evaluation
The TextRank sentence extraction algorithm is evaluated in the context of a single-document summarization task, using 567 news articles provided during the Document Understanding Evaluations 2002
(DUC, 2002). For each article, TextRank generates
a 100-words summary the task undertaken by other
systems participating in this single document summarization task.
For evaluation, we are using the ROUGE evaluation
toolkit, which is a method based on Ngram statistics,
found to be highly correlated with human evaluations
(Lin and Hovy, 2003a). Two manually produced reference summaries are provided, and used in the evaluation process4 .
2
[1.58] 18
0.15
0.14
[0.70] 17
0.29 10
0.27
[0.99]
0.16
0.15
11 [0.56]
16
[1.65]
12 [0.93]
15
13 [0.76]
[1.36] 14 [1.09]
Algorithm
W
HIT SA
W
HIT SH
W
P OSP
W
P OSW
P ageRank
Undirected
0.4912
0.4912
0.4878
0.4878
0.4904
Graph
Dir. forward
0.4584
0.5023
0.4538
0.3910
0.4202
Dir. backward
0.5023
0.4584
0.3910
0.4538
0.5008
Table 1: Results for text summarization using TextRank sentence extraction. Graph-based ranking algorithms: HITS, Positional Function, PageRank.
Graphs: undirected, directed forward, directed backward.
S27
0.5011
6 Conclusions
S29
0.4681
Baseline
0.4799
5 Related Work
Sentence extraction is considered to be an important
first step for automatic text summarization. As a consequence, there is a large body of work on algorithms
5
Notice that rows two and four in Table 1 are in fact redundant,
since the hub (weakness) variations of the HITS (Positional)
algorithms can be derived from their authority (power) counterparts by reversing the edge orientation in the graphs.
6
Only seven edges are incident with vertex 15, less than e.g.
eleven edges incident with vertex 14 not selected as important
by TextRank.
References
S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual Web
search engine. Computer Networks and ISDN Systems, 30(17).
DUC. 2002. Document understanding conference 2002. http://wwwnlpir.nist.gov/projects/duc/.
P.J. Herings, G. van der Laan, and D. Talman. 2001. Measuring the power
of nodes in digraphs. Technical report, Tinbergen Institute.
J.M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604632.
C.Y. Lin and E.H. Hovy. 2003a. Automatic evaluation of summaries using
n-gram co-occurrence statistics. In Proceedings of Human Language
Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May.
C.Y. Lin and E.H. Hovy. 2003b. The potential and limitations of sentence
extraction for summarization. In Proceedings of the HLT/NAACL
Workshop on Automatic Summarization, Edmonton, Canada, May.
R. Mihalcea and P. Tarau. 2004. TextRank bringing order into texts.
R. Mihalcea, P. Tarau, and E. Figa. 2004. PageRank on semantic networks, with application to word sense disambiguation. In Proceedings of the 20st International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, August.
G. Salton, A. Singhal, M. Mitra, and C. Buckley. 1997. Automatic text
structuring and summarization. Information Processing and Management, 2(32).
S. Teufel and M. Moens. 1997. Sentence extraction as a classification
task. In ACL/EACL workshop on Intelligent and scalable Text summarization, pages 5865, Madrid, Spain.