Optimal Features Set For Extractive Automatic Text Summarization
Optimal Features Set For Extractive Automatic Text Summarization
36
7. Residual IDF [7]: Instead of simple IDF, RIDF can be 2. Semantic structure [12]: This feature suggests that the
better criteria for selecting sentences. RIDF is basically the sentence that has maximum sentences related to it is the most
residual IDF of a term in a given document, i.e. the difference important sentence. The sentences are considered as nodes of
between observed IDF and expected IDF under the a graph and related sentences (the sentences that have
assumption that the term follows an independent distribution common terms) have edges between them. The sentence with
model such as Poisson distribution model. maximum number of edges is the most important sentence and
so on.
8. Gain [8]: Luhn in his first paper stated that words with
medium frequency are most important, and Gain is a feature 3. Sentence Length Cut Off [5]: Too short and Too long
that uses this hypothesis. In simple IDF words that occur very sentences are not included in summary, because too short
scarcely in the corpora gets very high score, but there sentences are not significant and too long sentences increase
importance is less. Gain overcomes this weakness of IDF by the length of summary unnecessarily. A threshold can be fixed
introducing a new formula which views the optimal gain (e.g. not less than 5 words and not more than 40 words).
associated with the feature as word's importance.
4. Fixed Phrase [5]: Sentences which contains some fixed
9. Term Co-occurrence [9]: This method is somewhat similar phrases are given priority. These phrases are also known as
to the keyword frequency feature but the difference is that Indicator phrases, and these are usually 2 words long (e.g. "In
instead of assigning weights to particular terms we locate conclusion", "My opinion" etc).
clusters of important words within sentences and accordingly
weights are assigned to them. According to the paper, in a 5. Concept Signature [2]: Concept signature uses Word co-
document with 25-30 sentences reasonable term occurrence occurrence feature to extract topic words with list of
for establishing a term as important are 7 sentences. associated (keyword weight) pairs. It is based on the
hypothesis that when a concept is important in the document, a
10. Query Score [9]: It is believed that the more number of set of words will co-occur fairly predictably. This feature is
query terms are present in the sentence the more important the also used in IR systems for query expansion.
sentence is. The scores are computed in a manner that the
sentence containing more number of terms from the query gets 6. Concept Count [13]: This feature suggests that we should
higher scores. count the occurrence of concepts rather than individual verbs
and nouns. Along with the count of verbs and nouns we also
11. Word Co-occurrence [10]: Same word appearing in use this feature, the verbs and nouns which carry the same
different text units. There are two variations of this feature one meaning form a single concept. It is called a concept sent.
is without stemming and another is after applying stemming. Wordnet synonyms, hypernyms, hyponyms can be used for
In documents where it is less probable that a single word will this matter.
be used for different meanings it is better to use stemming.
7. Maximum Marginal Relevance (MMR) [14]: This feature is
12. Matching Noun Phrases [10]: Tools can be used to identify used in query based summarizer. It is known that a
simplex noun phrases, and match the ones with same header. summarized document should contain minimum redundancy.
MMR helps in attaining this goal. In query based summarizers
13. Wordnet Synonyms [10]: The synonyms are matched with the sentences are arranged according to maximum similarity
the help of synset provided by wordnet. The use of synonyms with the query, however in MMR provides a linear
increases the similarity between sentences which contain same combination of relevance and novelty. In MMR a sentences is
words but as synonyms. important if it is both similar to the query and not similar to
other sentences in the document.
14. Word Significant Score [11]: This feature calculates
relative significance of words in a given document. It is 8. Page rank [15]: It is based on one of the most popular
somewhat similar to TF-IDF feature. ranking algorithms used for web link analysis. The graph
weights are calculated using cosine similarity between
15. Title Similarity Revised [11]: In this feature method the sentences, and PageRank algorithm is applied on this graph,
sentence score is calculated as square of number of common which in turn gives you a sequence of most important
terms in the sentence and title of the document. sentences. The algorithm starts with assigning arbitrary values
to each node, the computation iterates until convergence
below a given threshold is achieved
B. Sentence Level Features
1. Sentence location [4]: This hypothesis suggests that the
most important sentences of the document occur very early or C. Paragraph Level Features
very late in the text. Weights are assigned according to the 1. Paragraph Feature [5]: This feature records information for
position of sentence in the text. initial ten paragraphs and last 5 paragraphs. Sentences in
paragraphs are scored according to their position in the
37
paragraph whether they are paragraph-initial or paragraph- IV. PROPOSED WORK
final or paragraph-medial. In our proposed work we have tried all combinations of the
features available with us. We used six important features for
2. Optimal Position Policy [2]: This feature is somewhat initial work. These features are TF-IDF (F1), Co-occurrence
similar to Sentence location feature, but it is more systematic. (F2), Sentence Centrality (F3), Sentence Location (F4),
After evaluating 13000 newspaper articles they found a Named Entity (F5) and Proper Noun (F6). These features are
sequence of most important sentences. The title is most already described in previous section. These features generate
important sentence and is most likely to bear the main topic. the 63 number of combinations as given in Table I.
38
TABLE II. INDIVIDUAL COMBINATIONS TABLE III. RESULTS WITH BEST SET FOR PRECISION
Combinations of Features Precision
S. No. S. No. Feature Set
F1 F2 F3 F4 F5 F6 Value
1. 0 0 0 0 0 1 1. F2 0.51
2. 0 0 0 0 1 0 2. F2+F5 0.40
3. 0 0 0 0 1 1 3. F1 0.42
4. 0 0 0 1 0 0 4. F3 0.42
5. 0 0 0 1 0 1 5. F5 0.49
6. 0 0 0 1 1 0
6. F2+F5 0.65
7. 0 0 0 1 1 1
8. 0 0 1 0 0 0 7. F2+F4 0.40
9. 0 0 1 0 0 1 8. F3+F6 0.47
10. 0 0 1 0 1 0 9. F2+F3+F4+F5 0.30
11. 0 0 1 0 1 1 10. F4+F5 0.64
12. 0 0 1 1 0 0
13. 0 0 1 1 0 1 TABLE IV. RESULTS WITH BEST SET FOR RECALL
14. 0 0 1 1 1 0
15. 0 0 1 1 1 1 S. No. Feature Set Recall Value
16. 0 1 0 0 0 0 1. F4 0.56
17. 0 1 0 0 0 1 2. F1 0.42
18. 0 1 0 0 1 0 3. F5 0.41
19. 0 1 0 0 1 1 4. F3 0.43
20. 0 1 0 1 0 0 5. F1 0.46
21. 0 1 0 1 0 1 6. F2+F5 0.71
22. 0 1 0 1 1 0
7. F5 0.51
23. 0 1 0 1 1 1
8. F2+F4 0.50
24. 0 1 1 0 0 0
25. 0 1 1 0 0 1 9. F4 0.32
26. 0 1 1 0 1 0 10. F2+F4+F5 0.72
27. 0 1 1 0 1 1
28. 0 1 1 1 0 0
29. 0 1 1 1 0 1
30. 0 1 1 1 1 0 TABLE V. RESULTS WITH BEST SET FOR F-MEASURE
31. 0 1 1 1 1 1 F-Measure
32. 1 0 0 0 0 0 S. No. Feature Set
Value
33. 1 0 0 0 0 1 1. F4 0.53
34. 1 0 0 0 1 0 2. F1 0.41
35. 1 0 0 0 1 1 3. F2+F4 0.34
36. 1 0 0 1 0 0
4. F3+F5 0.42
37. 1 0 0 1 0 1
38. 1 0 0 1 1 0 5. F1+F3 0.45
39. 1 0 0 1 1 1 6. F2+F5 0.68
40. 1 0 1 0 0 0 7. F5 0.43
41. 1 0 1 0 0 1 8. F3+F6 0.46
42. 1 0 1 0 1 0 9. F4 0.30
43. 1 0 1 0 1 1 10. F2+F4+F5 0.67
44. 1 0 1 1 0 0
45. 1 0 1 1 0 1
46. 1 0 1 1 1 0 For document 10, Word co-occurrence, sentence location
47. 1 0 1 1 1 1 and named entity are giving the best results. The results for F-
48. 1 1 0 0 0 0 Measure are given in Table V. In this case also different
49. 1 1 0 0 0 1
combinations are giving the best results. For document 1 and
50. 1 1 0 0 1 0
51. 1 1 0 0 1 1
document 9 sentence location is giving the best result. TF-IDF
52. 1 1 0 1 0 0 is giving best results for document 2. Word co-occurrence and
53. 1 1 0 1 0 1 sentence location are giving best results in case of document 3.
54. 1 1 0 1 1 0 Sentence centrality and named entity are giving best results in
55. 1 1 0 1 1 1 case of document 4. TF-IDF and sentence centrality are best
56. 1 1 1 0 0 0 for document 5. For document 6 word co-occurrence and
57. 1 1 1 0 0 1 named entity are best. Named entity is having highest value in
58. 1 1 1 0 1 0 case of document 7. For document 8, sentence centrality and
59. 1 1 1 0 1 1 named entity are giving best results. Word co-occurrence,
60. 1 1 1 1 0 0
sentence location and name entity are having best result for
61. 1 1 1 1 0 1
62. 1 1 1 1 1 0
document 10.
63. 1 1 1 1 1 1
39
VI. CONCLUSION International Conference on Human Language Technology Research,
ser. HLT ’02. San Francisco, CA, USA: Morgan Kaufmann Publishers
This paper suggests that it is not easy to find an efficient Inc., 2002, pp. 52–58, 2002.
extractive summary of text using different feature [14] J. Carbonell and J. Goldstein, “The use of mmr, diversity-based
combinations. We tried all combinations for several features. reranking for reordering documents and producing summaries,” in
Proceedings of the 21st Annual International ACM SIGIR Conference
We found that specific combinations can give higher on Research and Development in Information Retrieval, ser. SIGIR ’98.
efficiency but only on one or few documents. News New York, NY, USA: ACM, 1998, pp. 335–336, 1998.
documents especially our algorithm performs consistently on [15] R. Mihalcea and P. Tarau, “A language independent algorithm for single
all documents as it takes the advantage of sentence location and multiple document summarization,” in In Proceedings of IJCNLP,
2005.
feature. TF/ISF, Named Entity and Proper Nouns are good
[16] J. M. Conroy and D. P. O’leary, “Text summarization via hidden
indicators to include the sentences in the summary. Proposed markov models,” in Proceedings of the 24th Annual International ACM
feature set may be extended with some semantic features with SIGIR Conference on Research and Development in Information
more filtering levels. Retrieval, ser. SIGIR ’01. New York, NY, USA: ACM, 2001, pp. 406–
407,2001.
REFERENCES [17] J. E. Rush, R. Salvador, and A. Zamora. Automatic abstracting and
indexing. ii. production of indicative abstracts by application of
[1] H. P. Luhn, “The automatic creation of literature abstracts,” IBM J. Res. contextual inference and syntactic coherence criteria. Journal of the
Dev., vol. 2, no. 2, pp. 159–165, Apr. 1958. American Society for Information Science, 22(4):260-274, 1971.
[2] E. Hovy and C.-Y. Lin, “Automated text summarization and the [18] Chin-Yew Lin and Eduard Hovy. 1997. Identifying topics by position. In
summarist system,” in Proceedings of a Workshop on Held at Baltimore, Proceedings of the fifth conference on Applied natural language
Maryland: October 13-15, 1998, ser. TIPSTER ’98. Stroudsburg, PA, processing (ANLC '97). Association for Computational Linguistics,
USA: Association for Computational Linguistics, 1998, pp. 197–214., Stroudsburg, PA, USA, 283-290. 1997
1998.
[19] Radev. D., Blair-Goldensohn S. And Zhang Z. (2001). Experiments in
[3] R. Brandow, K. Mitze, and L. F. Rau, “Automatic condensation of single and multi-document summarization using MEAD. In First
electronic publications by sentence selection,” Inf. Process. Manage., Document Understanding Conference, New Orleans LA, 2001.
vol. 31, no. 5, pp. 675–685, Sep. 1995.
[20] A. Abuobieda, N. Salim, A. Albaham, A. Osman, and Y. Kumar. Text
[4] H. P. Edmundson, “New methods in automatic extracting,” J. ACM, vol. summarization features selection method using pseudo genetic-based
16, no. 2, pp. 264–285, Apr. 1969 model. In International Conference on Information Retrieval Knowledge
[5] J. Kupiec, J. Pedersen, and F. Chen, “A trainable document Management (CAMP), 2012, pages 193-197, March 2012.
summarizer,” in Proceedings of the 18th Annual International ACM [21] J. J, Pingali, and V. Varma. Sentence extraction based on single
SIGIR Conference on Research and Development in Information document summarization. In Workshop on Document Summarization,
Retrieval, ser. SIGIR ’95. New York, NY, USA: ACM, 1995, pp. 68–73, 2005.
1995.
[22] M. Mendoza, S. Bonilla, C. Noguera, C. Cobos, and E. Leon. Extractive
[6] M. A. Fattah and F. Ren, “Ga, mr, ffnn, pnn and gmm based models for single-document summarization based on genetic operators and guided
automatic text summarization,” Computer Speech and Language, vol. local search. Expert Syst. Appl., 41(9):4158-4169, July 2014.
23, no. 1, pp. 126 – 144, 2009.
[23] C. Nobata, S. Sekine, M. Murata, K. Uchimoto, M. Utiyama, and H.
[7] K. Church and W. A. Gale, “Inverse document frequency (idf): A Isahara. Sentence extraction system assembling multiple evidence. In
measure of deviations from poisson,” in Proceedings of the Third Proceedings of the Second NTCIR Workshop Meeting, pages 5-213,
Workshop on Very Large Corpora, 1995, pp. 121–130, 1995. 2001.
[8] K. Papineni, “Why inverse document frequency?” in Proceedings of the [24] J. J. Pollock and A. Zamora. Automatic Abstracting Research at
Second Meeting of the North American Chapter of the Association for Chemical Abstracts Service. Chemical Information and Computer
Computational Linguistics on Language Technologies, ser. NAACL ’01. Sciences, 15(4):226-232, Nov. 1975.
Stroudsburg, PA, USA: Association for Computational Linguistics,
[25] P. R. Shardanand and U. Kulkarni. Implementation and evaluation of
2001, pp. 1–8., 2001.
evolutionary connectionist approaches to automated text summarization.
[9] A. Tombros and M. Sanderson, “Advantages of query biased summaries In Journal of Computer Science, pages 1366-1376, Feb 2010.
in information retrieval,” in Proceedings of the 21st Annual International
[26] Rafael Ferreira, Luciano de Souza Cabral, Rafael Dueire Lins, Gabriel
ACM SIGIR Conference on Research and Development in Information
Pereira e Silva, Fred Freitas, George D.C. Cavalcanti, Rinaldo Lima,
Retrieval, ser. SIGIR ’98. New York, NY, USA: ACM, 1998, pp. 2–10,
Steven J. Simske, Luciano Favaro, Assessing sentence scoring
1998.
techniques for extractive text summarization, Expert Systems with
[10] K. R. McKeown, J. L. Klavans, V. Hatzivassiloglou, R. Barzilay, and E. Applications, Volume 40, Issue 14, 15 October 2013, Pages 5755-5764,
skin, “Towards multidocument summarization by reformulation: ISSN 0957-4174, 2012.
Progress and prospects,” in Proceedings of the Sixteenth National
[27] Rafael Ferreira., Freitas F., De Souza Cabral L, Dueire Lins R., Lima
Conference on Artificial Intelligence and the Eleventh Innovative
R., Franca G., Simske S.J. and L. Favaro. "A Context Based Text
Applications of Artificial Intelligence Conference Innovative
Summarization System," 2014 11th IAPR International Workshop on
Applications of Artificial Intelligence, ser. AAAI ’99/IAAI ’99. Menlo
Document Analysis Systems (DAS), vol., no., pp.66,70, 7-10 April
Park, CA, USA: American Association for Artificial Intelligence, 1999,
2014.
pp. 453–460, 1999.
[28] Yogesh Kumar Meena and Dinesh Gopalani, Analysis of Sentence
[11] C. Hori, S. Furui, R. Malkin, H. Yu, and A. Waibel, “Automatic
Scoring Methods for Extractive Automatic Text Summarization,
summarization of english broadcast news speech,” in Proceedings of the
International Conference on Information and Communication
Second International Conference on Human Language Technology
Technology for Competitive Strategies (ICTCS-2014), November 14 -
Research, ser. HLT ’02. San Francisco, CA, USA: Morgan Kaufmann
16 2014, Udaipur, Rajasthan, India , ACM 978-1-4503-3216-3/14/11,
Publishers Inc., 2002, pp. 241–246, 2002.
2014.
[12] E. F. Skorochod’ko, “Adaptive method of automatic abstracting and
[29] Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of
indexing,” in IFIP Congress (2)’71, 1971, pp. 1179–1182, 1971.
Summaries. In Proceedings of the Workshop on Text Summarization
[13] B. Schiffman, A. Nenkova, and K. McKeown, “Experiments in Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.
multidocument summarization,” in Proceedings of the Second
40