Text Summarization Using Machine Learning Lst m
Text Summarization Using Machine Learning Lst m
net/publication/354291874
CITATIONS READS
2 2,239
3 authors:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Neeraj Kumar Sirohi on 08 July 2022.
Abstract
Due to the massive amount of online textual data generated in a diversity of social media, web, and
other information-centric applications. To select the vital data from the large text, need to study the
full article and generate summary also not loose critical information of text document this process is
called summarization. Text summarization is done either by human which need expertise in that area,
also very tedious and time consuming. second type of summarization is done through system which is
known as automatic text summarization which generate summary automatically. There are mainly two
categories of Automatic text summarizations that is abstractive and extractive text summarization.
Extractive summary is produced by picking important and high rank sentences and word from the text
document on the other hand the sentences and word are present in the summary generated through
Abstractive method may not present in original text.
This article mainly focuses on different ATS (Automatic text summarization) techniques that has been
instigated in the present are argue. The paper begin with a concise introduction of automatic text
summarization, then closely discussed the innovative developments in extractive and abstractive text
summarization methods, and then transfers to literature survey, and it finally sum-up with the proposed
techniques using LSTM with encoder Decoder for abstractive text summarization are discussed along
with some future work directions.
Key-words: ATA, Text Summarization, Abstractive, Extractive, Neural Network, LSTM, Encoder,
Decoder.
1. Introduction
To extract valuable information from gigantic text is a challenging task now a days because we
have lots of unstructured information available on the net in the form of articles, blogs and reports.
There are various groupings for TS classifications as demonstrated in Fig. 2. TS schemes can
be categorised based on any of the standards describe below.
Summary depend on the Input Size: On the bases of document as input, a summary can be
generated upon a solitary text document or numerous documents [7].
Input size tells us the to the total number of input documents whose summary can be generate
as target summary. As describe in Fig. 1, in which a user use Single-Document in single-source-
document Summarization (SSDS) and produce a summary (Shorten form of source document) while
preserve the critical [8].
Summary depend on nature of the output: It is categorised as Query and Generic-Based. The
summary generated by generic method is based on extraction of the critical information from one or
mode text and gives a general idea about its contents [9]. Where’s a query-based summarization deals
with multi-document where homogeneous documents are find out from large corpus of document based
on any particular [10]. A summary based on Query consider the data which is utmost suitable for query.
Basically, the is two main approaches for automatic text summarization (ATS) extractive and
abstractive each approach is implemented by any one from different techniques. This section provide
overview about techniques which are used for each approach in the literature.
Summarization based on extractive scheme’s architecture shows in Fig. 3 having the following
components.
1) Inputted document first pre-processed (i.e Tokenization, lowering, normalization etc).
2) Postprocessing like: restructuring the mined sentences, substituting pronouns with their root
forms, swapping qualified chronological appearance with genuine dates, etc. [15] the processing steps
are as follows:
1. Generate intermediate picture: Producing an appropriate representation of the inputted
document into simplify text representation (e.g. graphs,bi-gram, bag-of-words, etc.) [8].
2. Sentence Scoring: Assign scoring to the sentences and assign a ranking to every sentence
constructed on the inputted document [17].
3. Selection of maximum-scored sentences: picking utmost and significant sentences from the
inputted text (s) then combining them all to produce a summary [17] [18]). The length of final summary
depended on the selection of a any threshold value or any cut-off limit of the maximum length of the
instant and maintain the similar sequencing of the produced sentences as the inputted document [19].
These approaches based on extraction of most significant sentence’s words and sentences from
the inputted text depends on the features sets arithmetical analysis. The ‘‘utmost favourably located”
and ‘‘most recurrent”, are the common parameters to defined any sentences or words as ‘‘most vital”
sentence or words of the document. [15]. The scoring of sentence in this approach are involve the
following steps [9]:
1) By applying some linguistic and mathematical features calculate weight of every sentence
and assign them [15].
2) assignment of concluding weight to each and every sentence in the text [9] which is
calculated via a feature-score equation [15] (i.e. summing up all nominated features’ scores
to determine the final score of each sentence).
These approaches depend on recognizing the topic of document which is prime theme (i.e. what
the text all about). TF-IDF (Term Frequency, Term Frequency-Inverse Document Frequency), topic
word selection and lexical chains are the utmost technique for topic identification. Topic identification
kept their corresponding weights in a simple table [17], etc. The basic steps involve in this process
include [17]:
1) An intermediate representation of inputted text is generated which holds the key topic of
that document.
2) According to this representation a weight of each sentence is assign.
This method is used when we have multiple documents to summarization it collect all key
sentences which describe the main them (key subject) of the document and generate cluster for all these
sentences. Sentence selection in this approach are based on centrality of sentence which is calculated
through the word centrality by using TF-IDF approach then select all sentences having TF-IDF is grater
or equal to defined threshold [20]. The scoring of sentence is performed through the following steps.
1) Based on construction of centroid by calculating TFIDF of each sentence in the text (Mehta
& Majumder, 2018[21]), and 2) selection of sentence which has more words closer to particular cluster
Kobayashi, Noguchi, and Yatsuka (2015) [26], Proposed a method where text level likeness
depends on embedding (i.e scattered equivalence of term). Text is measured in the form of sentence
and sentence is refers as a collection of terms(words). The task is to validate the issue of maximize a
sub-modular purpose outlined through negative synopsis of the adjacent neighbour’s distances on
embedding disseminations (i.e. a collection of word embeddings in a text) [26]. Accomplish the
sentence- equal likeness soft-out less complex meaning in compression of text-level likeness.[27]
recommend a summarization process for solitary text by applying a RNN (Recurrent Neural Network)
and reinforcement knowledge based algorithm with a ordered scheme of encoder-selection network
style. The significant features are carefully chosen by a sentence-equal selection encoding method then
sentences which are the part of summary are identified and pick out from the document.
Fuzzy-logic system for summarization is a efficient way to collect likeness of the human
intellectual classifications of document an provide a well-organized technique to represent sentence
features standards of the document [28]. Scoring of sentences are done through the Following steps
[22]:
1) Features like term weight sentence length etc are selected from every sentence.
2) By applying the fuzzy logic method (i.e. subsequently introducing the essential instructions
to acquaintance base of this structure) a score is assign to every sentence based on sentence importance
which indicate the importance of sentence. And based on rules defined in knowledge based and
sentence features a value of 0 and 1 is assign to each sentence.
In conclusion, A batter summary is produced if different approaches used together because they
use the advantages of different approaches and removes their inadequacies. Many summarization
systems combine various approaches to take the assistance from the merits of different
technique.[29],[30],[25].[31] recommend an extractive summarization method which used Fuzzy
C-Means, TextRank and collective sentence marking approaches to summarize Bengali text[30].
suggest summarizer which generate extractive summary and uses a Distributional Semantic system to
detention the key idea of document, for creation of cluster for equal meaning of sentences the K-means
Advantages
The extractive techniques are simple, quicker, and easy to implement in compression to
abstractive techniques. This method provides a higher precision because in extractive summarization
sentences are directly chosen from original text and user get the summary with in the same vocabularies
in which the original document has[32].
Disadvantages
This technique is far-off with technique that human experts use to creates synopses (Hou, Hu,
& Bei, 2017) [51]. The main disadvantages of extractive summarization are as follows:
1. Some sentences are redundant in summary [33].
2. Summary sentences may be lengthier than normal sentences [15].
3. Because summary can be generated from various documents in multi-document the mutable
terminologies conflicts can be arise [15].
4. As because Vital data feast among the sentences. adversary, evidence might not be enclosed
[15]. If source document contains several topics then generated output summary might be partial [25].
To tide over from this problem user needs to focus all the topics which cause the lengthy summary and
summary has more length then exception.
The abstractive Summarization techniques classified into the following categories [38]:
1) Based on structure: these methods used predefined framework (e.g. trees, ontologies
templates, and graphs, and). This method recognizes the utmost significant data from the source
document then use any of the framework mention above and generate the summary [38].
2) Semantic based: these methods used semantic representation of text and natural language
generation (NLG) schemes (e.g. predicate arguments, created on data items and semantic diagrams).
This method creates the semantic picture of the source document through the data-items,
semantic-graphs or predicate opinions then use a NLG scheme to generate the abstractive summaries
[38] finally,
Recommend the model known as “Opinosisi” in which model based on graph is used where
word is represented by nodes topical data is linked with node. Sentences framework is represented
through the directed edges. Following are the steps which involved in processing the data under
graph-based approach describe the Ganesan et al. (2010):
1) Creation of graph: To describe the source document a word-based graph is generated, and 2)
Creation of Summary: The process of creating the final abstractive summary. Numerous sub-methods
of graph are discovered and ranked are as follows:
1. A score is assigned to every path then sort them on the basis of score in descending order.
Unused paths score also includes in process of sorting.
2. By applying likeness measure (e.g. Jaccard) repeated (or very comparable) paths removed.
3. After step two select the topmost paths from remaining paths and produced the summary,
length of summary is dependent on the number of paths selected by a constraint.
These approaches recognize comparable sentences that exchange data between them, then
gather these sentences and produced summary [38]. Equal sentences are denoted by structure which is
look like a tree. The dependency tree is constructed through parsing. To describe the text document in
the form of tree, the tree-based approach is commonly used. In procedure to produced summary, some
task is progress like trees pruning and linearization (i.e. translating trees to strings), etc. [38]. abstractive
summarizer for multi-document was proposed by Kurisinkel, Zhang, and Varma[41] the highlight of
this technique are as follows:
1) To find the set of all syntactic dependence tree input document of the corpus will pe parsed.
2) From all the dependency extract in step 1 select all set of unfinished dependency trees (with
flexible sizes).
3) Clustering the picked unfinished dependence trees to assurance topical range.
This approach based on describing the rules and classes to determine vital ideas about the input
document then summary is generated by using these ideas. Following steps are involved in this
approach are [38]:
1) Based on relationship and idea present in document the input document is classified.
2) According to the area of input a query is formulated.
3) Queries are responded by discovering the relationships and ideas of the document then
finally.
4) Passed these responses into almost outlines rules and create the abstractive summary.
Genest and Lapalme (2012) [42], suggest an style dependent on the abstractive structures. Each
abstractive structure is planned to solve a smaller group or ideas that involves satisfied choice
heuristics, rules for IE (Information Extraction) and simple patterns creation are used and construct
pattern for every structure. All these guidelines generated physically. An abstractive system looking to
response for single or multiple features which could be linked with the equal feature. The Information
Extraction guidelines might notice numerous applicants for every feature and the contented collection
component choose the finest which is directly involved in the summary creation unit.
These approaches uses a semantic image (e.g. predicate-argument structures data objects, or
semantic graphs) of the input text(s) then pass this information to NLG (natural language generation)
scheme where noun and verb phrases is uses to produce the concluding abstractive summary [38]. A
multi-document abstractive summarizer suggests by [43], Khan et al. which recommend that:
1) Using SRL input text represents in predicate argument structures.
2) Using a semantic similarity measure a clusters is generated through semantically equal
predicate- argument structures across the document.
3) Based on weighted features and optimized grades the predicate-argument structures using a
Genetic Algorithm.
3. Conclusion
Summarization is an exciting research area now a days and almost current summarization
methods to which produced abstractive summary are mainly focus on the deep learning methods
particularly for short document [48]. It suggested that combining unlike approaches and methods and
take the advantage for producing improved summaries using abstractive methods. various summaries
is generated from the same text by using different summarization techniques so it is encouraged to
References
Maybury, M.T. (1995). Generating summaries from event data. Information Processing
& Management, 31(5), 735–751. https://doi.org/10.1016/0306-4573 (95)00025-C
Dragomir R Radev, Eduard Hovy, and Kathleen McKeown. 2002. “Introduction to the special issue on
summarization”. Computational linguistics 28, 4 (2002), 399–408.
Hovy, E., 2005. Text Summarization. In: The Oxford Handbook of Computational Linguistics, Mitkov,
R. (Ed.), OUP Oxford, Oxford, ISBN-10: 019927634X, pp: 583-598.
Chen, J.; Zhuge, H. Abstractive Text-Image Summarization Using Multi-Modal Attentional
Hierarchical RNN. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 4046–4056.
Li, P.; Lam,W.; Bing, L.; Wang, Z. Deep Recurrent Generative Decoder for Abstractive Text
Summarization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing; Association for Computational Linguistics: Copenhagen, Denmark, 2017; pp. 2091–2100.
Gupta, V. K. & Siddiqui, T. J. (2012). Multi-document summarization using sentence clustering. Paper
presented at the 2012 4th international conference on intelligent human computer interaction (IHCI).
Kumar, Y. J., Goh, O. S., Basiron, H., Choon, N. H. & Suppiah, P. C. (2016). A Review on Automatic
Text Summarization Approaches. Journal of Computer Science, vol. 4, no. 12, pp. 178-190.
Joshi, M., Wang, H. & McClean, S. (2018). Dense semantic graph and its application in single
document summarisation. In C. Lai, A. Giuliani & G. Semeraro (Eds.), Emerging ideas on information
filtering and retrieval: DART 2013: Revised and invited papers (pp. 55–67). Springer International
Publishing.
Gambhir, M., & Gupta, V. (2017). Recent automatic text summarization techniques: A survey.
Artificial Intelligence Review, 47(1), 1–66. https://doi.org/10.1007/ s10462-016-9475-9.