Text Summarization:An Overview: October 2013
Text Summarization:An Overview: October 2013
net/publication/257947528
CITATIONS READS
13 39,421
3 authors, including:
Samrat Babar
Sanjeevan Engineering and Technology Institute Panhala
3 PUBLICATIONS 97 CITATIONS
SEE PROFILE
All content following this page was uploaded by Samrat Babar on 21 May 2014.
1.Abstract:
In this new era,where tremondous information is available on the internet,it is most important to
provide the improved mechanism to extract the information quickly and most efficiently . It is very difficult
for human beings to manually extract the summary of a large documents of text. There are plenty of text
material available on the internet. So there is a problem of searching for relevant documents from the
number of documents available, and absorbing relevant information from it.In order to solve the above two
problems, the automatic text summarization is very much necessary.Text summarization is the process of
identifying the most important meaningful information in a document or set of related documents and
compressing them into a shorter version preserving its overall meanings.
2.Introduction:
Before going to the Text summarization, first we, have to know that what a summary is. A summary is
a text that is produced from one or more texts, that conveys important information in the original text, and it
is of a shorter form. The goal of automatic text summarization is presenting the source text into a shorter
version with semantics.The most important advantage of using a summary is ,it reduces the reading time.
Text Summarization methods can be classified into extractive and abstractive summarization. An extractive
summarization method consists of selecting important sentences, paragraphs etc. from the original document
and concatenating them into shorter form. An Abstractive summarization is an understanding of the main
concepts in a document and then express those concepts in clear natural language.
There are two different groups of text summarization : indicative and informative.Inductive
summarization only represent the main idea of the text to the user. The typical length of this type of
summarization is 5 to 10 percent of the main text.On the other hand, the informative summarization systems
gives concise information of the main text .The length of informative summary is 20 to 30 percent of the
main text .
3.1. Topic Identificatio:The most prominent information in the text is identified .There are different
techniques for topic identification are used which are Position, Cue Phrases, word
frequency.Methods which are based on the position of phrases are the most useful methods for
topic identification.
3.2. Interpretation :Abstract summaries need to go through interpretation step. In This step,
different subjects are fused in order to form a general content.
3.3. Summary Generation :In this step, the system uses text generation method.
4. Extractive text summarization : This process can be divided into two steps: Pre Processing step and
Processing step. Pre Processing is structured representation of the original text. It usually includes: a)
Sentences boundary identification. In English, sentence boundary is identified with presence of dot at the end
of sentence. b) Stop-Word Elimination—Common words with no semantics c) Stemming—The purpose of
stemming is to obtain the stem or radix of each word, which emphasize its semantics.
In Processing step, features influencing the relevance of sentences are decided and calculated and then
weights are assigned to these features using weight learning method. Final score of each sentence is
determined using Feature-weight equation. Top ranked sentences are selected for final summary.
Summary evaluation is a very important aspect for text summarization. Generally, summaries can be
evaluated using intrinsic or extrinsic measures. While intrinsic methods attempt to measure summary quality
using human evaluation and extrinsic methods measure the same through a task-based performance measure
such the information retrieval-oriented task.
It is a numerical statistic which reflects how important a word is in a given document.The TF-IDF
value increases proportionally to the number of times a word appears in the document.This method mainly
works in the weighted term-frequency and inverse sentence frequency paradigm .where sentence-frequency is
the number of sentences in the document that contain that term. These sentence vectors are then scored by
similarity to the query and the highest scoring sentences are picked to be part of the summary.Summarization
is query-specific .
The hypothesis assumed by this approach is that if there are ‘‘more specific words’’ in a given
sentence, then the sentence is relatively more important. The target words are usually nouns .This method
performs a comparison between the term frequency (tf) in a document -in this case each sentence is treated
as a document and the document frequency (df), which means the number of times that the word occurs along
all documents. The TF/IDF score is calculated as follows:
Example:
In this technique, there is a node for every sentence . Two sentences are connected with an edge if the
two sentences share some common words, in other words, their similarity is above some threshold. This
representation gives two results :The partitions contained in the graph (that is those sub-graphs that are
unconnected to the other sub graphs), form distinct topics covered in the documents. The second result by
the graph-theoretic method is the identification of the important sentences in the document. The nodes with
high cardinality (number of edges connected to that node), are the important sentences in the partition, and
hence carry higher preference to be included in the summary.
Figure shows an example graph for a document. It can be seen that there are about 3-4 topics in the
document; the nodes that are encircled can be seen to be informative sentences in the document, since they
share information with many other sentences in the document. The graph theoretic method may also be
adapted easily for visualization of inter and intra document similarity.
7.4 Machine Learning approach :
In this method,the training dataset is used for reference and the summarization process is modeled as a
classification problem: sentences are classified as summary sentences and non-summary sentences based on
the features that they possess. The classification probabilities are learnt statistically from the training data,
using Bayes’ rule:
where, s is a sentence from the document collection, F1, F2...FN are features used in classification. S is the
summary to be generated, and P (s∈< S | F1, F2, ..., FN) is the probability that sentence s will be chosen to
form the summary given that it possesses features F1,F2...FN.
5) f5 Sentence length
The first phase of the process involves training the neural networks to learn the types of sentences
that should be included in the summary. Once the network has learned the features that must exist in
summary sentences, we need to discover the trends and relationships among the features that are inherent in
the majority of sentences. This is accomplished by the feature fusion phase, which consists of two steps: 1)
eliminating uncommon features; and 2) collapsing the effects of common features.
Summary evaluation is a very important aspect for text summarization. Generally, summaries can be
evaluated using intrinsic or extrinsic measures. While intrinsic methods attempt to measure summary quality
using human evaluation and extrinsic methods measure the same through a task-based performance measure
such the information retrieval-oriented task.
Evaluation methods are useful in evaluating the
usefulness and trustfulness of the summary. Evaluating the qualities like comprehensibility, coherence, and
readability is really difficult. System evaluation might be performed manually(gold standard) by experts .To
measure the quality of summary,the manually expert system is used. The qualitative evaluation is done by
counting the numbers of sentences selected by the system that match with the human gold standard. To
measure the quantittative assessment of the summary the ROUGE evaluator tool is used which consist of
precision, recall and F-measure .
9.Conclusion:
Automatic text summarization is an old challenge but the current research direction diverts towards
emerging trends in biomedicine, product review, education domains, emails and blogs. This is due to the fact
that there is information overload in these areas, especially on the World Wide Web.Automated summarization
is an important area in NLP (Natural Language Processing) research. It consists of automatically creating a
summary of one or more texts. The purpose of extractive document summarization is to automatically select a
number of indicative sentences, passages, or paragraphs from the original document .Text summarization
approaches based on Neural Network, Graph Theoretic, Fuzzy and Cluster have, to an extent, succeeded in
making an effective summary of a document.Both extractive and abstractive methods have been researched.
Most summarization techniques are based on extractive methods. Abstractive method is similar to summaries
made by humans. Abstractive summarization as of now requires heavy machinery for language generation and
is difficult to replicate into the domain specific areas.