Comparative Study of Text Summarization Methods
Comparative Study of Text Summarization Methods
net/publication/284411761
CITATIONS READS
42 3,902
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Nikita Munot on 24 April 2019.
ABSTRACT and its summary is less than half of the main text.
Text summarization is one of application of natural language Summarization has been viewed as a two step process. The
processing and is becoming more popular for information first step is the extraction of important concepts from the
condensation. Text summarization is a process of reducing the source text by building an intermediate representation of some
size of original document and producing a summary by sort. The second step uses this intermediate representation to
retaining important information of original document. This generate a summary. News blaster is a good example of a text
paper gives comparative study of various text summarization summarizer, that helps users find the news that is of most
methods based on different types of application. The paper interest to them. The system automatically collects, cluster,
discusses in detail two main categories of text summarization categorizes, and summarizes news from several sites on the
methods these are extractive and abstractive summarization web on a daily basis. A summarization machine can be
methods. The paper also presents taxonomy of summarization viewed as a system which accepts either a single document or
systems and statistical and linguistic approaches for multiple documents or a query as an input and produces a
summarization. abstract or extract summary.
33
International Journal of Computer Applications (0975 – 8887)
Volume 102– No.12, September 2014
H. Gregory Silber and McCoy [10] developed a liner time purchasing any novel a buyer reads the summary provided at
algorithm for lexical chain computation. The author follows back side of novel.
Barzilay and Elhadad [6] for employing the lexical chains to
extract important concepts from the source text by building an Informative summary serves as a substitution to the original
intermediate representation. The paper [10] discusses an document. It provides the concise information about the
algorithm for creating lexical chain which creates an array of original document to the user.
Meta-Chain whose size is the number of nouns senses in the 3. Based on type of content
Word Net and in the document. There were some problems
with the algorithm like proper nouns and anaphora resolution This classification is based on the type of content in the
that were to be addressed. original document[1]. Generic summarization is system which
can be used by any type of the user and summary does not
There is another method for summarization by using graph depend on the subject of the document. All the information is
theory [11]. The author proposed a method based on at same level of importance and which is not user specific.
subject-object-predicate (SOP) triples from individual
sentences to create a semantic graph of the original document. Query-based summarization [1] is question answer type where
The relevant concepts, carrying the meaning, are scattered the summary is the result of query. It provides the users view
across clauses. The author [11] suggested that identifying and and cannot be used by any type of user.
exploiting links among them could be useful for extracting 4. Based on limitation
relevant text.
Summary can be classified based on limitation of input
One of the researchers, Pushpak Bhattacharyya [12] from IIT text[1]. Genre specific systems only accept special type of
Bombay introduced a Word Net based approach for input like newspaper articles, stories, manuals etc. Limited to
summarization. The document is summarized by generating a the type of input they can accept.
sub-graph from Word-net. Weights are assigned to nodes of
the sub-graph with respect to the synsnet using the Word Net. Domain independent system can accept different type of text.
The most common text summarization techniques use either They are not dependent on the domain and can be used by any
statistical approach or linguistic approach or a combination of type of user. There are few systems that are domain
both. dependent.
34
International Journal of Computer Applications (0975 – 8887)
Volume 102– No.12, September 2014
Suffers
It consists of
Easy to from Any type
selecting Summ-
compute inconsist of text Copy
important It Domai Difficult
because it encies, Can accept input is and
sentences applet,d n to
does not lack of any type of accepted. Paste
Extract from esigned Indepe impleme
deal with balance, text. It is not system
ive original by ndent nt
the results domain [15]
document Surrey
semantics in dependent
based on Universi
and more lengthy
statistical ty [15]
successful summar Cannot
features
y summari
Copy
Encourag Can accept ze
Informa Single and
It only es the only one Less multiple
tion docum paste
presents users to input overhead documen
present ent system
main idea of read the document ts of
on the [15]
text to user. main Detailed related
back of 5.Numbe
They can be document informat topics
Indicati the r of input
used to in depth. ion is Multiple
ve movie documen SUMM
quickly Used for not document
pack or t ONS
decide quick present s of same
novels Can accept Difficult Designe
2.Details whether a categoriza Multi- topic can
Length multiple to d by
text is worth tion and docum be
5 to input impleme Columb
reading easier to ent summariz
10% documents nt ia
produce ed to
universi
Serves as single
Gives Does not SumUM ty [15]
a document
concise provide [3]
Inform substitutio
information quick Length Can accept
ative n for the
of the main overvie 20 to input only
main
text w 30% with specific Need to Cannot
document
Mono- language work with handle FarsiSu
Generalized
Lingual and output only one different m [18]
summary It
is based on language language
irrespective provides SUMM
Can be 6.Langua that
of the type an ARIST
Generi used by ge language
of user. author's [8]
c any type
Information view not SUMM
of user
is at same user Can accept Can deal Difficult ARIST(
level of specific Multi- documents with to English,
importance Lingual in different multiple impleme Japanes
3.Conten language language nt. e,Spanis
t User has to
determine h) [14]
Specific Not used Text summarization methods can be classified mainly into
the topic of
informatio by any Mitre's
original text
n can be type of WebSu
categories these are extractive and abstractive. Text
Query in the form summarization by extraction simply is extracting few
searched. user. It is mm
based of query sentences from the original document as it adding it to the
It reflects based on [17]
and system
user’s type of summary.
only extract
interest user
that
information
4.1 Extractive Summarization
Extractive text summarization works by selecting a subset of
Summarize They are existing words, phrases or sentences from the original text to
Limited
the text aware of form summary. Extractive summarization uses statistical
Domai to the
which their the special
n
subject can domain on
subject TREST approach for selecting important sentences or keyword from
depend of the LE [15] document. Extractive summarization uses statistical approach
be defined which
ent documen for selecting the important sentences or keyword from
in the fixed they are
t document. Various statistical methods are discussed in the
domain dependent
4.Limitat Overcome below section. Extracted sentences tend to be longer than
ion s the average. Conflicting information may not be presented
problem Limitati accurately.
Accept only
of on
Genre special type Newsbl
summarizi template 4.2 Abstractive Summarization
specific of text as aster
ng of the
input.
heterogen text
Abstractive text summarization method generates a sentence
eous from a semantic representation and then use natural language
document generation techniques to create a summary that is closer to
what a human might generate. Such a summary might contain
words not explicitly present in the original. It consists of
understanding the original text and re-telling it in fewer
words. It uses linguistic approach to understand the original
text and then generates summary.
Abstractive summaries are more accurate as compared to the
extractive summary but are difficult to generate because it
35
International Journal of Computer Applications (0975 – 8887)
Volume 102– No.12, September 2014
needs deep understanding of the NLP tasks. Abstractive and Linguistic approaches are based on considering the
extractive summarization uses either statistical or linguistics connection between the words and trying to find the main
approaches or combination of both to generate summary. concept by analyzing the words. Abstractive text
summarization is based on linguistic method which involves
4.3 Statistical Approaches the semantic processing for summarization.
Statistical approaches [1] can summarize a document using
statistical features of the sentence like title, location, term Linguistic approaches have some difficulties in using high
frequency, assigning weights to the keywords and then quality linguistic analysis tools (a discourse parser, etc.) and
calculating the score of the sentence and selecting the highest linguistic resources (Word Net, Lexical Chain, Context
scored sentence into the summary. Importance of a sentence Vector Space, etc.). Barzilay and Elhadad[6], Miller et al
can be decided by several methods such as: proposed and developed strong concepts with the help of
linguistic features but they require much memory for saving
4.3.1 Title method[16] the linguistic information like Word Net and processor
This method [16] [4] states that sentences that appear in the capacity because of additional linguistic knowledge and
title are considered to be more important and are more likely complex linguistic processing.
to be included in the summary. The score of the sentences is
calculated as how many words are commonly used between a 4.4.1 Lexical chain[6][10]
sentence and a title. Title method cannot be effective if the The concept of lexical chains was first introduced by Morris
document does not include any title information. and Hirst[9]. Basically, lexical chains exploit the cohesion
among an arbitrary number of related words. Lexical chains
4.3.2 Location method[16] can be computed in a source document by grouping (chaining)
Weights are assigned to text based on location whether it sets of words that are semantically related. Identities,
appears in lead, medial or final position in a paragraph or in synonyms, and hypernyms/hyponyms are the relations among
appears in the prominent section of the document such as words that might cause them to be grouped into the same
conclusion or introduction. Leading several sentences of a lexical chain. Lexical chains are used for IR and grammatical
document or last few sentences or conclusion are considered error corrections [6] [10]. In computing lexical chains, the
to be more important and included in summary. Hovy & Lin noun instances must be grouped according to the above
[14] and Edmundson [16] used this method. The location relations, but each noun instance must belong to exactly one
method relies on the following intuition headings, sentences lexical chain. There are several difficulties in determining
in the beginning and end of the text, text formatted in bold, which lexical chain a particular word instance should join.
contain important information to the summary. Words must be grouped such that it creates a strongest and
longest lexical chain.
4.3.3 tf-idf method[7]
The term frequency-inverse document frequency is a 4.4.2 Word Net[19]
numerical statistic which reflects how important a word is to a Word Net is a on-line lexical database available for English
document. It is often used as a weighting factor in information language. It groups the English words into sets of synonyms
retrieval and text mining. tf-idf is used majorly for stop words called sys-nets. Word Net also provides a short meaning of
filtering in text summarization and categorization application. each sys-net and semantic relation between each sys-net.
The tf-idf value increases proportionally to the number of Word-net also serves as a thesaurus and a on-line dictionary
times a word appears in the document. tf–idf weighting which is used by many systems for determining relationship
scheme are often used by search engines as a central tool in between words. Thesaurus is reference work that contains a
scoring and ranking a document's relevance given a user list of words grouped together according to the similarity of
query. meaning. Semantic relations between the words are
represented by synonyms sets, hyponym trees. Word-net are
The term frequency f(t,d) means the raw frequency of a term used for building lexical chains according to these relations.
in a document, that i the number of times that term t occurs in Word Net contains more than 118,000 different word
document d. The inverse document frequency is a measure of forms[19]. LexSum is a summarization system which uses
whether the term is common or rare across all documents. It is Word Net for generating the lexical chain.
obtained by dividing the total number of documents by the
number of documents containing the term. 4.4.3 Graph theory[11]
Graph theory [11] can be applied for representing the
4.3.4 Cue word method[16] structure of the text as well as the relationship between
Weight is assigned to text based on its significance like sentences of the document. Sentences in the document are
positive weights "verified, significant, best, this paper" and represented as nodes. The edges between nodes are
negative weights like "hardly, impossible". Cue phrases are considered as connections between sentences. These
usually genre dependent. The sentence consisting such cue connections are related by similarity relation. By developing
phrases can be included in summary. The cue phrase method different similarity criteria, the similarity between two
is based on the assumption that such phrases provide a sentences is calculated and each sentence is scored. Whenever
"rhetorical" context for identifying important sentences. The a summary is to be processed all the sentences with the
source abstraction in this case is a set of cue phrases and the highest scored are chosen for the summary. In graph ranking
sentences that contain them. Above all statistical features are algorithms, the importance of a vertex within the graph is
used by extractive text summarization. iteratively computed from the entire graph.
4.4 Linguistic Approaches TextRank algorithm is a graph based algorithm which is
Linguistic is a scientific study of language which includes applies in summarization. A graph is constructed by adding a
study of semantics and pragmatics. Study of semantics means vertex for each sentence in the text. Edges between vertices's
how meaning is inferred from words and concepts and study are established using sentence inter-connections.
of pragmatics includes how meaning is inferred from context.
36
International Journal of Computer Applications (0975 – 8887)
Volume 102– No.12, September 2014
These connections are defined using a similarity relation, [2] Goldstein, J., Kantrowitz, M., Mittal, V., Carbonell, J.,
where similarity is measured as a function of content overlap. 1999.Summarizing text documents: Sentence selection
The overlap of two sentences can be determined as the and evaluation metrics. In: Proc. ACM-SIGIR’99, pp.
number of common tokens between lexical representations of 121–128.
two sentences. The iterative part of algorithm is consequently [3] Horacio, L. Guy.Generating indicative-informative
applied on the graph of sentences. When its processing is summaries with SumUM : Summarization.
finished, vertices's (sentences) are sorted by their scores. The Computational linguistics -Association for
top ranked sentences are included in the result. Computational Linguistics, 2002, vol. 28, pp. 497-526.
Extracting summary by semantic graph generation[11] is a [4] Luhn, H.P., 1959. The automatic creation of literature
method which uses subject–object–predicate (SOP) triples abstracts. IBM J.Res. Develop., 159–165.
from individual sentences to create a semantic graph of the
original document. Using the Support Vector Machines [5] ZHANG Pei-ying, LI Cun-he. Automatic text
learning algorithm, it trains a classifier to identify SOP triples summarization based on sentences clustering and
from the document semantic graph that belong to the extraction.
summary. Usually main functional elements of sentences and [6] Barzilay, R., Elhadad, M.Using Lexical Chains for Text
clauses are Subjects, Objects, and Predicates, thus identifying Summarization. In Proc. ACL/EACL’97 Workshop on
and exploiting links among them could facilitate the Intelligent Scalable Text summarization, Madrid,
extraction of relevant text. A method that creates a semantic Spain,1997, pp. 10–17.
graph of a document, based on logical form triples [7] Youngjoong Koa, Jungyun Seo 2008.An effective
subject– predicate–object (SPO), and learns a relevant sentence-extraction technique using contextual
sub-graph that could be used for creating summaries. information and statistical approaches for text
4.4.4 Clustering[5] summarization.
Clustering is used to summarize a document by grouping and [8] Eduard Hovy and Chin Yew Lin.Automated text
clustering the similar data or sentences. The method states that summarization in SUMMARIST. MIT Press, 1999,
summarization result not only depends on the sentence pages 81–94.
features, but also depends on the sentence similarity measure. [9] Morris, J., and Hirst, G. 1991. Lexical cohesion
MultiGen is a multi-document system in the news domain. computed by thesaural relations as an indicator of the
One of the sentence clustering method developed by ZHANG structure of text. Computational Linguistics 17(1):21–43.
Pei-ying and LI Cun-he[5] is discussed in the paper[5].
Algorithm used for determining the number of the clusters is [10] Silber G.H., Kathleen F. McCoy. Efficiently Computed
K-means method. It helps to cluster the sentences of the Lexical Chains as an Intermediate Representation for
document, and extracts the topic sentences to generate the Automatic Text Summarization. Computational
extractive summary for the document. In this way sentences Linguistics 28(4): 487-496, 2002.
are clustered and selected for summarization. Linguistic [11] Kedar Bellare, Anish Das Sharma, Atish Das Sharma,
approaches are harder to implement whereas statistical Navneet Loiwal and Pushpak Bhattacharyya.Generic
approaches are more successful but has few limitations. Text Summarization Using Word net. Language
Resources Engineering Conference (LREC 2004),
5. CONCLUSION Barcelona, May, 2004.
As natural language understanding improves, computers will [12] J. Leskovec, M. Grobelnik, N. Milic-Frayling.Extracting
be able to learn from the information on-line and apply what Summary Sentences Based on the Document Semantic
they learned in the real world. Combined with natural Graph. Microsoft Research, 2005.
language generation, computers will become more and more
capable of receiving and giving instructions. [13] D. Radev, E. Hovy, K. McKeown, "Introduction to the
Special Issue on Summarization", Computational
Due to rapid growth of technology and use of Internet, there is Linguistics, Vol. 28, No. 4, pp. 399-408, 2002.
information overload. This problem can be solved if there are [14] Eduard Hovy and Chin Yew Lin, “Automated text
strong text summarizers which produces a summary of summarization in SUMMARIST”, MIT Press, 1999,
document to help user. Hence there is a need to develop pages 81–94.
system where a user can efficiently retrieve and get a
summarized document. One possible solution is to summarize [15] Http://Web.science.mq.edu.au/swan/summarization/proje
a document using either extractive or abstractive methods. cts_full.html.
Text summarization by extractive is easier to build. But text [16] Edmundson, H.P., 1968.New methods in automatic
summarization by abstractive technique is stronger because extraction. J. ACM16 (2), 264–285. S.
they produce summary which is semantically related but
[17] Kupiec, Julian M, Schuetze, Hinrich, “System for
difficult to produce. This paper discussed different types of
genre-specific summarization of documents”, Xerox
summarization methods used for summarizing a document
Corporation, 2004.
and advantages and disadvantages of each method.
[18] Martin Hassel, Nima Mazdak, “A Persian text
6. REFERENCES summarizer”, International Conference on Computational
[1] Saeedeh Gholamrezazadeh, Mohsen Amini Salehi, Linguistics, 2004.
Bahareh Gholamzadeh.A Comprehensive Survey on Text [19] William P. Doran, Nicola Stokes, John Dunnion, and Joe
Summarization Systems. 2009 In proceeding of: Carthy, “Comparing lexical chain-based summarization
Computer Science and its Applications, 2nd International approaches using an extrinsic evaluation,” In Proc.
Conference. Global Word net Conference (GWC 2004), 2004.
IJCATM : www.ijcaonline.org
37