0% found this document useful (0 votes)
83 views11 pages

Literature Study On Multi-Document Text Summarization Techniques

This document summarizes literature on multi-document text summarization techniques. It discusses graph-based, cluster-based, term frequency-based, and latent semantic analysis-based approaches. For each approach, it describes related work from other researchers, issues identified, and opportunities for improvement. The document also discusses evaluation criteria for comparing automatically generated summaries to human summaries.

Uploaded by

Vishal Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views11 pages

Literature Study On Multi-Document Text Summarization Techniques

This document summarizes literature on multi-document text summarization techniques. It discusses graph-based, cluster-based, term frequency-based, and latent semantic analysis-based approaches. For each approach, it describes related work from other researchers, issues identified, and opportunities for improvement. The document also discusses evaluation criteria for comparing automatically generated summaries to human summaries.

Uploaded by

Vishal Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/310596578

Literature Study on Multi-document Text Summarization Techniques

Conference Paper · August 2016


DOI: 10.1007/978-981-10-3433-6_53

CITATION READS

1 785

2 authors:

Chintan Shah Anjali G. Jivani


Shankersinh Vaghela Bapu Institute of Technology The Maharaja Sayajirao University of Baroda
7 PUBLICATIONS   31 CITATIONS    12 PUBLICATIONS   149 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Effective test summarization via Latent Semantic Analysis View project

All content following this page was uploaded by Chintan Shah on 21 November 2016.

The user has requested enhancement of the downloaded file.


LITERATURE STUDY ON MULTI-DOCUMENT
TEXT SUMMARIZATION TECHNIQUES
Chintan Shah1, Anjali Jivani2
1ShankersinhVaghelaBapu
Institute of Technology, Gandhinagar
2,The
Maharaja Sayajjirao University, Baroda
[email protected], [email protected]

Abstract. Text summarization is a method which generates a shorter and a pre-


ciseform of one or more text documents. Automatic text summarization playsan
essentialrole in finding information from large text corpus or aninternet.What
had actually started as a single document Text Summarization has now evolved
and developed into generating multi-document summarization. There are a
numberof approaches to multi-document summarization such as Graph, Cluster,
Term-Frequency, Latent Semantic Analysis (LSA) based etc. In this paper we
have started with introduction of multi-document summarization and then have
further discussed comparison and analysis of various approaches which comes
under the multi-document summarization.The paper also contains details about
the benefits and problems in the existing methods. This would especially be
helpful for researchers working in this field of text data mining. By using this
data, researchers can build new or mixed based approaches for multidocument
summarization.

Keywords : text summarization, cluster, multi-document summarization, graph,


LSA, Term-Frequency Based.

1 Introduction

For retrieving information, People widely use internet such as Google, Yahoo, Bing
and so on. Since amount of material on the internet is growing rapidly, for users it is
not easy to find relevant and appropriate information as per the requirement. Once a
user sends a query on a search engine for data or information then the response is
most of the timesthousands of documents and the user has to face the tedious task
offinding the appropriate information from this sea of rejoinder. This problem is
called as “Data Overloading” [1]. Automatic text summarization is the summary of
source of text in shorter version, that retain the main feature of the content and help
the user to quickly understand large volume of information.A number of authors have
proposed techniques for automatic text summarization which can be broadly classi-
fied as: extractive summarization and abstractive summarization. In extractive sum-
marization, it selects sentences that have the highest weightage in the retrieved docu-
ment and put them together to generate a summary version of original document
without changing or altering the main text,where as in abstractive summary,the origi-
nal text gets converted into another semantic form with the help of linguistic methods
to get a shorter summary of original document [2].
The primary goal of multiple-document summarization is to build summary which
has maximum coverage, less redundant data and maximum cohesiveness between
sentences [2]. In another words, main sentences are extracted from each document
and then are re-arranged to get multi-documents summary.Multi-document summari-
zation flow is shown in Fig.1

Fig.1 Multi-Document Process Flow

This survey paper covers various aspects which are given below
1. Several approaches of Graph, Cluster, Term Frequency, and Latent sematic
analysis for multi-document summarization
2. Issues and problems shown by different researchers for improvement in this
area
3. Evaluation criteria for comparing automatic summary and human summary
We have inSection IIof this paper, described related work done on multi-document
text summarization with help of Graph, Cluster, Term-Frequency and Latent semantic
analysis methods. In the Section III we have shown analysis and comparison of all
methods with scope of improvement, Section IV contains the evaluation criteria and
Section V contains conclusion.

2 Related work

Research on multi-document summarization is the need of the present scenario


with respect to Information Retrieval and Internet Surfing being the most popular
applications. Many methods and approaches are available for information retrieval
from various sources [3]. Many techniques have been developed till date on multiple-
document summarization.In this paper, different methods are grouped into different
categories as per their implementation criteria.
2.1 Graph Based Methods
Rada Mihalcea[4] (2004)Proposed Text Rank method on Graph based method
which takes into consideration local vertex-specific information as well as full graph
global statistics repeatedly for determining significance of vertex. Below steps are
elaborate in summary generation:
a. To build a graph model, from the graph, identify vertices which describe
given task as text units
b. Draw edges between text units on basis of common match and computere-
lationship for each edge
c. We may have weighted or un-weighted edges as well as directed or un-
directed graphs
d. In the model, apply rank algorithm and repeat until convergence takes place
e. In this graph method, all vertices will be sorted on score of respectively ver-
tex based on last mark of each vertex. And finally, scores will be used for se-
lection purpose
Julin Zhang[5] (2005) projected Hub/Authority framework on basis of Graph the-
ory. In that method, content feature is merged with surface feature i.e. location and
length of sentence, cue phrase etc. For sentence selection purpose, it may extract sig-
nificant sub-topic features under Hub/Authority framework. In this model, sentences
are ranked and final summary will be generated on basis of score of each sentences
under the hub and authority score.
ShanmugasudaranHariharan [6](2009), projected two primary methods with dif-
ferences, with or without omitting the nominated sentences.Where this paper concen-
trates on summarization of news articles with help of graph based methods. With help
of adjacency matrix, representation can be one via similarity measures between sen-
tences of documents which is the first step of this Graph based approach. In this ap-
proach, two techniques are discussed wherein primary one proposes cumulative sum
and second one degree of centrality. With aid of these two methods, a method is pro-
posed by the author for assessing adjacency matrix.Precision and recall have been
used for calculating extractive summaries as metrics. This paper presents two metrics:
Effectiveness 1 & Effectiveness 2 for evaluating human summaries against system
summaries. With the help of discounting method for testing for single and multi-
document summaries, after investigating the result, we come to know that the second
method is better than the previous method but there are few scopes for improvement
in this area.
Khushboo [7](2010), introducedmethodology of Text Rank method by few vari-
ances. In said method, it uses shortest path algorithm for generating summaries. Sen-
tences will be selected from path with help of shortest path algorithm, where each
sentences may be similar to pervious sentences for generating summaries over choos-
ing top ranking sentences such as Text rank. In first step for representing text, it will
build graph model. Text units can be word, phrases, collocation, sentence or others,
these will have considered as a Text units and it will be added as vertices for the
graph. After completion of the step, score will be calculated with help of ranking
algorithm (Graph Bases) such as HITS, Page Rank of each vertex. After finishing the
above step, shortest path algorithm will be applied for generating summaries.
Shuzhi Sam ge [8] (2010), proposed hybrid approach for weighted graph model
that include two concepts, sentences clustering & ranking for text summarization. In
other words, method depends on cluster as well as Graph based approaches for gener-
ating summaries for text. There are few steps for this approach -
a. There are two ways first is Graph model for sentence ranking and second is
cluster for merging same sentences
b. Clustering of sentences can be completed on basis on Singular non matrix
factorization, so there are possibilities of using Latent Semantic Analysis,
which has gained popularity nowadays for text summarization
c. In weighted graph model, it reflects discourse association between sentences
in order to cluster and rank sentences in a document
Tu-Anh Nguyen-Hoang[9] (2012), proposed method which has three steps, dur-
ing first step, for the data set, specific structure will be added to every document.
Undirected weighted graph can be measured as a structure. For graph, title and sen-
tences will play major role for construction of the graph. In the second phase,
Weighted page rank which is Graph based ranking algorithm will be used for calculat-
ing score of each sentences of the document. Few sentences are extracted from the
document for building summaries of documents for that ranks and scores are consid-
ered on the basis of relevant features of the document. In later stage, all different
summaries will be merged into a single summary. Finally, MMR(Maximal Marginal
Relevance) algorithm is used to form the final extractive summary.

2.2 Cluster Based Methods

Judith D. Schlesinger[10] (2008) has presented CLASSY for multi-document


summarization. CLASSY(Clustering, Linguistics, and Statistics for Summarization) is
a model of extractive automatic summarization which operates both on single and
multi-document summarization. Topic or generic summaries can be produced by this
model. It practiceslanguage method for trimming, statistical method for scoring and
that is why it is known as CLASSY. This technique includes trimming rules to reduce
the distance of sentences in the document and the identification of sentences on the
basis of importance that are probable to be involved in the summary. The summary is
generated for individually documentand then summaries are re-arranged and then-
merged to form the final combined summary.CLASSY constructioncontains of five
steps: to prepare document, to trim sentence (using stop word removal, stemming), to
compute score of each sentence, redundancy removal and collection of sentence based
on score.
Xiao-Chem Ma [11] (2009) has proposed summarization model, which has three
parts: pre-processing, soft clustering and summary generation. The main and the most
important portion of system is clustering. In the clustering algorithm, there are four
stages: primary is to construct Vector Space Model(VSM), second one is preparingre-
lationship matrix, where third is to set initial parameters and finally, build clusters
recursively. For summary preparation, Maximal Marginal Relevance(MMR) has been
usedso summary sentences will designate the core content of the multi-set of docu-
ments and deliver connection with the request which isa query.
Virendra Gupta [12] (2012) has introduced a clear approach for multi document
summarization by linking simple summary of the document using the phrase cluster-
ing. For clustering, syntactic and semantic analytics both are used for similarity be-
tween sentences. Document,sentence reference index, location and concept similarity
features, all have been used for generating single document summary. Summaries of
single document for sentences are clustered and best sentences from each cluster are
used to generating multi-document summary.

2.3 Term Frequency Based Methods

Salton [13] (2005) has proposed method of term frequencyinverse document fre-
quency model (TF-IDF), where the mark of a term in this document is the ratio be-
tween the amount of terms in this document to the frequency of the amount of docu-
ments that contain those terms.Importance of evaluating the expression is given by the
principle TFI X IDFI, where TFI is the term frequency of ‘I’ in the document and
IDFI is the inverted frequency in which that term ‘I’ occurs. Therefore, sentences can
be scored for illustration with help computing relevance of terms in the sentence.
Jun’ichiFukumoto [14] (2004) proposed a technique for multi-document summa-
rization in which an easy strategy to build abstract with help of TF-IDF based extrac-
tion is used. Summaries for individual documents are generated and same summaries
will be used for generating multi-document summary. The proposed system automati-
cally categorizes a document into three different sub-sets with help of info of high
frequency nouns and named object, the categories are one topic, multi-topic type and
others. To summarize, the first sentences are take out from each document based on
TF-IDF, the position of the sentence and weighingof a sentence. During the next step,
needless parts of sentences are discarded. Then all sentences which are extracted are
sorted in the original order in a document to generate summarized form of each single
document. In the next stage, all extracted sentences are grouped in clusters and the
repeated clauses are removed. The remaining clauses are sorted for generating the
final summary.

2.4 Latent Sematic Analysis Methods


ShuchuXiong[15] (2014) proposed a method based on LSA wherein sentence tak-
ing out summarizer evaluates a set of summary sentences based on its prediction simi-
larity to that of the full sentences set on the top latent singular vector. There are few
steps required to build summary with the help of Latent semantic analysis.First step is
applying singular value decomposition (SVD) to document. Second ischoosing sen-
tence by its capacity of projection similarity. And finally, LSA-based forward sen-
tence selection algorithm is applied to build summary. Here they have used centroid-
based MEAD and MMR (Maximal Marginal Relevance) methods.
Josef Steinberger [16](2004) shows that basic LSA has two main disadvantages,
first is that it uses matching number of dimensions as is the number of sentences that
we want in a summary. Second disadvantage is that large index value will not be cho-
sen evenwhen required for the summary.The author has proposed modification in the
existing SVD-based summarization. In the proposed method, he recalculates SVD of
a term by sentences matrix. For summary evaluation, thispaper shows few techniques
such as similarity of main topic, Term Significance, etc.

3 Analysis and Comparative Study of Various Methods

Author, Year Description Benefits Problems Scope of


Improve-
ment

Grid Based Method


Rada Mihalcea, With help of Text It considered Calcula- Possible
2004 [4] Rank it builds sum- Text Unit for tion of improve-
mary which chooses local infor- vertex ment in
top most sentences mation score is score calcu-
complex lation for
summary
generation
Julin Zhang, Hub/Authority Effective Not easy Limited
2005 [5] framework is used, graph- to find surface fea-
merge surface fea- ranking sub-topics tures are
ture and content method used used, possi-
feature for text ble to add
summary more to find
sub-topics
Shanmugasu- Two main tech- Single & Precision Extra meth-
daranHariharan, niques, cumulative multi- and recall od needs to
2009 [6] sum & degree of documents have been be devel-
centrality summariza- used for oped for
tion, both result, not better result
work any other
standard
formula
Khuhboo, 2010 Shortest path algo- Compared to It consid- It should
[7] rithm has been used other meth- ers only add another
for selection of top ods, it gener- shortest feature
most sentences for ates better path for along with
summary summary selection shortest path
of sen- algorithm
tences
Shuzhi Sam Find sentences from Work with Weight Need to
ge,2010 [8] document through hybrid ap- balance improve
sentence ranking & proach parameter system per-
clustering method for & sparse- formance
summarization ness de-
gree may
influence
perfor-
mance of
system
Tu-Anh Ngu- Preprocessing, graph It is unsu- Infor- Need to
yen-Hoang , construction & sen- pervised mation work for
2012 [9] tence ranking used training loss may optimizing
for summary genera- method so be possi- information
tion via MMR need to train ble during loss
data graph
building
Cluster based Methods
Judith D, 2008 CLASSY architec- Works for Machine CLASSY
[10] ture has been used multi-lingual translation should work
summariza- is difficult for all lan-
tion method task guages
Xiao-Chem Ma, To build cluster and For sentence Only con- Possible
2009 [11] extract summary extraction siders improve-
based on modified MMR is query ment is
MMR used for sentences readability
result of summary
Virendra, 2012 Merge single docu- Used syntac- Single Syntactic
[12] ment summary with tic & sematic document similarity
multi-document similarity for summary are founded
summary sentence is to be on work
clustering considered order which
for final can be sub-
summary stituted with
other struc-
tural com-
parison
measures
Term Frequency Based Method
Salton, 2005 Works with TF-IDF Easy and fast No major Work with
[13] for summary for genera- drawbacks other fea-
tion of sum- tures to
mary remove
duplication
Jun’ichi Fuku- Multi-document Categorized No major Result can
moto, 2004 [14] summarization to into three – drawbacks be improved
build abstract with one topic,
help of single docu- multi-topic
ment summarization and others
Latent Semantic Analysis Methods
ShuchuXiong, LSA based sentence Applied Only LSA Need to use
2014 [16] pulling out summa- SVD and based another
rizer which assesses have used algorithms method to
a set of summary Centroid used for build better
sentences based on based summary summariza-
its prediction similar- MEAD and generation tion
ity to that of the full MMR (Max-
sentences set on the imal Mar-
top latent singular ginal Rele-
vector vance) to
examination
and experi-
ment
Josef Stein- Recalculates SVD of Presenting Not any Need to
berger, 2004 a term by sentences techniques standard work for
[15] matrix similarity of method solid meth-
main topic, used for od for eval-
Term Signif- evaluation uation
icance. for sum- summary
mary gen- generation
eration

4 Evaluation Measures

Evaluation of summary is typically constructed on readability and content of infor-


mation. Primary purpose of text summarization is to find non redundant text that
havecontainedsignificant information from the original corpus. There is no fixed pa-
rameter for text summarization on which we can rely for evaluation. There are two
approaches for evaluation of summarization i.e. intrinsic and extrinsic. Intrinsic meth-
od calculates the actual information of a summary, compares with human summary or
with the full document source. In extrinsic methods evaluate the summary via task-
based performance i.e. information retrieval-oriented tasks.
The Rouge toolkit can help us to check performance of the summary generated.
Rouge is a software package which can be used to measure summary in period of
number of word overlaps in machine generated summary and human reference sum-
mary [17]. In Rouge toolkit, as input, we can provide two types of summaries. Stand-
ard summary can be considered as location summary which we can compare our
summary results and other that are generated via some methods. Rouge toolkit has
five evaluation metrics i.e. ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S and
ROUGE-SU based on word co-occurrence statistics [18].
There is another toolkit called MEAD which is a publicly open toolkit for multi-
lingual summarization and evaluation. This toolkit implements several summarization
algorithms i.e. position-based, centroid, TF-IDF, and query-based methods, etc.
Methods for evaluating the quality of the summaries include co-selection (preci-
sion/recall, kappa, and relative utility) and content-based measures (cosine, word
overlap, bigram overlap).

5 Conclusion

This literature survey paper contains various methods for multi-document text
summarization. Several techniques have been explored for multi-document summari-
zation such as Graph Based, Cluster Based, Term-Frequency Based and Latent Se-
mantic Analysis(LSA) based. Researchers can focus only on specific approaches from
existing techniques and make an improvement in those approaches to generate new or
hybrid approach for building better summaries which take less effort. We have com-
pared in this paper, Graph, Cluster, Term-Frequency and LSA. New approach or hy-
brid approach can be developed with help of natural language processing approach
and linguistic approach, which can help us to generate better summary for multi-
document.

6 References
1. M.-y' Kan and 1. L. Klavans, "Using librarian techniques in automatic text summarization
for information retrieval, " in Proceedings of the 2ndACMlIEEE-CS joint conference on
digital libraries, pp. 36-45, ACM, 2002
2. Y. K. Meena, A. Jain and D. Gopalani, "Survey on Graph and Cluster Based approaches in
Multi-document Text Summarization," Recent Advances and Innovations in Engineering
(ICRAIE), 2014, Jaipur, 2014, pp. 1-5. doi: 10.1109/ICRAIE.2014.6909126
3. M. Haque, S. Pervin, Z. Begum, et aI., "Literature review of automatic multiple documents
text summarization, " International Journal of Innovation and Applied Studies, vol. 3, no.
1, pp. 121-129, 2013.
4. R. Mihalcea and P. Tarau, 'Textrank: Bringing order into texts, " in Proceedings of
EMNLP, vol. 4, Barcelona, Spain, 2004.
5. J. Zhang, L. Sun, and Q. Zhou, "A cue-based hub-authority approach for multi-document
text summarization, " in Natural Language Processing and Knowledge Engineering, 2005.
IEEE NLP-KE'05. Proceedings of 2005 IEEE International Conference on, pp. 642-645,
IEEE, 2005.K. Elissa
6. S. Hariharan and R. Srinivasan, "Studies on graph based approaches for single and multi-
document summarizations, " Int. 1. Comput. Theory Eng, vol. 1, pp. 1793-8201, 2009
7. K. S. Thakkar, R. V. Dharaskar, and M. Chandak, "Graph-based algorithms for text sum-
marization, " in Emerging Trends in Engineering and Technology (lCETET), 2010 3rd In-
ternational Conference on, pp.516- 519, IEEE, 2010.
8. S. S. Ge, Z. Zhang, and H. He, "Weighted graph model based sentence clustering and
ranking for document summarization, " in Interaction Sciences (ICIS), 2011 4th Interna-
tional Conference on, pp. 90-95, IEEE, 2011
9. T.-A. Nguyen-Hoang, K. Nguyen, and Q.-V. Tran, "Tsgvi: a graphbased summarization
system for vietnamese documents,"Journal of Ambient Intelligence and Humanized Com-
puting, vol. 3, no. 4, pp. 305- 313, 2012.
10. J. D. Schlesinger, D. P. Oleary, and J. M. Conroy, "Arabic/English multi-document sum-
marization with CLASSY the past and the future, " in Computational Linguistics and Intel-
ligent Text Processing, pp. 568-581, Springer, 2008.
11. X.-c. Ma, G.-B. Yu, and L. Ma, "Multi-document summarization using clustering algo-
rithm, " in Intelligent Systems and Applications, 2009. ISA 2009. International Workshop
on, pp. 1-4, IEEE, 2009.
12. V. K. Gupta and T. J. Siddiqui, "Multi-document summarization using sentence clustering,
" in Intelligent Human Computer Interaction (IHCI), 2012 4th International Conference on,
pp. 1-5, IEEE, 2012
13. G. Salton, “Automatic Text Processing: the transformation, analysis, and retrieval of in-
formation by computer,” AddisonWesley Publishing Company, USA, 1989.
14. Jun'ichi Fukumoto, “Multi-Document Summarization Using Document Set Type Classifi-
cation,” Proceedings of NTCIR- 4, Tokyo, pp. 412-416, 2004.
15. S. Xiong and Y. Luo, "A New Approach for Multi-document Summarization Based on La-
tent Semantic Analysis," Computational Intelligence and Design (ISCID), 2014 Seventh
International Symposium on, Hangzhou, 2014, pp. 177-180.
16. J. Steinberger and K. Jezek, “Using latent semantic analysis in text summarization and
summary evaluation,” in Proc. ISIM ’04, 2004, pp. 93–100.
17. E. Lioret and M. Palomar, 'Text summarization in progress: a literature review, " Artificial
Intelligence Review, vol. 37, no. I, pp. 1-41, 2012.
18. D. Das and A. F. Martins, "A survey on automatic text summarization, "Literature Survey
for the Language and Statistics II course at CMU, vol. 4, pp. 192-195, 2007.

View publication stats

You might also like