Text Summarizer Using NLP (Natural Language Processing) : © JUL 2022 - IRE Journals - Volume 6 Issue 1 - ISSN: 2456-8880

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

© JUL 2022 | IRE Journals | Volume 6 Issue 1 | ISSN: 2456-8880

Text Summarizer Using NLP (Natural Language


Processing)
AAKASH SRIVASTAVA1, KAMAL CHAUHAN2, HIMANSHU DAHARWAL3, NIKHIL MUKATI4,
PRANOTI SHRIKANT KAVIMANDAN5
1, 2, 3, 4, 5
Department of Computer Science and Business System Bharati Vidyapeeth Deemed University
College of Engineering, Pune

Abstract- Enormous amounts of information are Summary systems are usually based on sentence
available online on the World Wide Web. To access delivery methods and for understanding the whole
information from databases, search engines like document properly as well as for extracting the
Google and Yahoo were created. Because the amount important sentences from the document.
of electronic information is growing every day, the
The technique of generating a brief description that
real outcomes have not been reached. As a result,
comprises a few phrases that describe the key concepts
automated summarization is in high demand.
of an article or section is known as abstractive
Automatic summary takes several papers as input
summarization.
and outputs a condensed version, saving both
information and time. The study was conducted in a This function is also included to naturally map the
single document and resulted in numerous input order of words in a source document to the target
publications. This report focuses on the frequency- sequence of words called the summary.
based approach for text summarization.
II. LITERATURE SURVEY
Indexed Terms- Automatic summarization,
Extractive, frequency-based, Natural Language The Internet is a vast source of electronic information.
Processing. But the result of information acquisition becomes a
tedious task for people. Therefore, automated
I. INTRODUCTION summaries began the search for automatic retrieval of
data from documents using our precious time. H.P.
Text summary is the way of selecting important points Luhn was the first to invent an automatic summary of
from the provided article or a document that can be the text in 1958.
reduced by a program. As the data overload problem
increased, so did the interest in capturing the text as There are helpful ways to produce a. summary -
the amount of data increased. Summarizing a large extraction and abstraction. Extraction is independent
document manually is challenging since it requires a of the domain and takes key sentences and provides a
lot of human effort and is time-consuming. summary on the other hand, abstracting depends on the
domain and taking personal information by
There are mainly two methods for summarizing the
understanding the entire text and adjusting the policy
text document that can be done by using extractive and
to produce a summary. There are several methods that
abstractive techniques.
use different methods to obtain a summary of a text.
Extractive summaries concentrate on selecting
important passages, sentences, words, etc. from the A. Abstractive Summarization Approach
primary text and connecting them into a concise form. Summarizations using abstractive techniques are
The importance of critical sentences is concluded on broadly classified into two categories: Structured
the basis of analytical and semantic features of the based approach and Semantic based approach.
sentences.
1) Structured Based Approach:

IRE 1703633 ICONIC RESEARCH AND ENGINEERING JOURNALS 211


© JUL 2022 | IRE Journals | Volume 6 Issue 1 | ISSN: 2456-8880

Structured based approach encodes most important • The contents of summary are generated from
information from the document through cognitive abstract representation of source documents, rather
schemes such as templates, extraction rules and other than from sentences of source documents. -The
structures such as tree, ontology, lead and body phrase abstract Representation is Information Item, which
structure. is the smallest element of coherent information in
a text.
Tree Based Method • The major strength of this approach is that it
• It uses a dependency tree to represent the text of a produces short, coherent, information rich and less
document. -It uses either a language generator or redundant summary.
an algorithm for generation of summary.
• It walks on units of the given document read and B. Extractive Summarization Techniques
easy to summary. An extractive summarization method consists of
selecting important sentences, paragraphs etc. from
Template Based Method the original document and concatenating them into
• It uses a template to represent a whole document. - shorter form. The importance of sentences is decided
Linguistic patterns or extraction rules are matched based on statistical and linguistic features of
to identify text snippets that will be mapped into sentences.
template slots.
• It generates summary is highly coherent because it Term Frequency Inverse Document Frequency
relies on relevant information identified by IE Method
system. • Sentence frequency is defined as the number of
sentences in the document that contain that term.
Ontology Based Method • Then this sentence vectors are scored by similarity
• Use ontology (knowledge base) to improve the to the query and the highest scoring sentences are
process of summarization. -It exploits fuzzy picked to be part of the summary.
ontology to handle uncertain data that simple
domain ontology cannot. Cluster Based Method
• Drawing relation or context is easy due to ontology • It is intuitive to think that summaries should
• Handles uncertainty at reasonable amount. address different “themes” appearing in the
documents.
2) Semantic Based Approach: • If the document collection for which summary is
In Semantic based approach, semantic representation being produced is of totally different topics,
of document is used to feed into natural language document clustering becomes almost essential to
generation (NLG) system. This method focuses on generate a meaningful summary.
identifying noun phrase and verb phrase by processing • Sentence selection is based on similarity of the
linguistic data. Brief abstract of all the techniques sentences to the theme of the cluster (Ci). The next
under semantic based approach is provided. factor that is location of the sentence in the
document (Li). The last factor is its similarity to
Multimodal semantic model the first sentence in the document to which it
A semantic model, which captures concepts and belongs (Fi).
relationship among concepts, is built to represent the
contents of multimodal documents. Si =W1 * Ci + W2 * Fi+ W3 *Li
Where, W1, W2, W3 are weight age for inclusion in
An important advantage of this framework is that it summary.
produces abstract summary, whose coverage is
excellent because it includes salient textual and • The clustering k-means algorithm is applied.
graphical content from the entire documents.
Information Item Based Method

IRE 1703633 ICONIC RESEARCH AND ENGINEERING JOURNALS 212


© JUL 2022 | IRE Journals | Volume 6 Issue 1 | ISSN: 2456-8880

Graph Theoretic Approach IDF method for summarization. From these


• Graph theoretic representation of passages discussions, we have observed that many techniques
provides a method of identification of themes. suffer from various challenges, for example, the
• After the common pre-processing steps, namely, graph-based methods have imitation in data size, the
stemming and stop word removal; sentences in the clustering-based methods require prior knowledge of
documents are represented as nodes in an the number of clusters, the MMR approaches have
undirected graph. uncertainty for the coverage and non-redundancy
aspects in the summary, etc. Tree Based Method lacks
2.1 Frequency based approach: a complete model which would include an abstract
• Term frequency (TF): representation for content selection. Template Based
TF mainly determines that how often a word appears Method Requires designing of templates and
in a text document and it is considered to be an generalization of template is to difficult. Ontology
important factor. The paragraphs in the document are Based Method This approach is limited to Chinese
divided into sentences based on the punctuation marks news only. And Creating Rule based system for
that appears at the end of every sentence. handling uncertainty is a complex task.

• Keyword frequency: III. PROBLEM STATEMENT


The high frequency words in the sentence are known
as keyword. It measures the frequency for every word We have used NLP, which seeks to summarize articles
once you've refined the content. Keywords are the by picking a collection of words that hold the most
terms that have the most important frequency. The essential information, can address this problem with
word score is organized as a keyword, and the phrase the help of extractive summarizer. This approach takes
is given some fixed points for each keyword found in a significant portion of a phrase and utilizes it to create
the text based on this feature. a summary. To define sentence verbs and
subsequently rank them in terms of significance and
• Stop words filtering: similarity, a variety of algorithms and approaches are
Any document will have a lot of words that appear utilized.
regularly but do not give the document less or more There is a great need for text summary techniques to
meaning. Words like 'on', 'the', 'is' and 'and' appear address the amount of text data available online to help
frequently in the English language and there are many people find the right information and use the right
examples of many texts. While searching, these words information quickly. In addition, the implementation
do not add up value to the information when users of text summaries reduces reading time, speeds up the
submit a query. process of researching information, and increases the
information that may not be in one field.
2.2 Clustering approach
• K-means clustering: This research paper focuses on the frequency-based
This approach aims to classify n observed in k groups approach for text summarization.
where each recognition belongs to a category with a
The steps involved in text summarizer are Sentence
descriptive meaning, acting as a collective example.
and word tokenization and then calculating sentence
k-means can be applied to data with small size, is
score on the basis of TF-IDF score which is being used
numerical, and continuous. The applications that can
to select the most important sentences to retain the
be benefited by the k-means algorithm are public
information and merge it to form a summary.
transport data analysis, targeting crime hotspots,
insurance fraud detection, customer segregation,
document collection, etc.

In our project we have used extractive approach for


text summarization. To be specific we have used TF-

IRE 1703633 ICONIC RESEARCH AND ENGINEERING JOURNALS 213


© JUL 2022 | IRE Journals | Volume 6 Issue 1 | ISSN: 2456-8880

STEP-5: Generate the summary

This is the last stage of text summarization. Top


sentences are calculated based on the score and
retention rate given to the user are included in the
summary and finally, a summary is created.

IV. TEST RESULTS

• Short Input - While performing the testing for


smaller inputs we get an error of minimum value
where it denotes about the word’s frequency is not
greater than required frequency to calculate the
summary.

• Foreign Language - While giving input in any


language, it successfully performs the
summarization process and a meaningful summary
is obtained.

STEP-1: Import all necessary libraries • Improper URL - If the given URL hasn’t a defined
NLTK (Natural Language toolkit) is a widely used and a sequential data which can be summarize then
library while we are working with text in python. Stop it displays the error as mentioned below since the
words contain a list of English stop words, which need web scrapper can’t get the exact data from the URL
to be removed during the pre-processing step. from which our summary could be generated.

STEP-2: Generate clean sentences • Illogical Text - If any illogical or meaningful text
is given as an input, then the summary won’t come
Text processing is the most important step in
as it will not make sense to generate a summary of
achieving a constant and positive approach result. The
punctuation marks or any stop words. As given
processing steps removes special digits, word, and
below the output is generated where it shows that
characters.
the given text could be stop words which gets
STEP-3: Calculate TF-IDF and generate a matrix eliminated in the pre-processing phase of
summarization.
We’ll find the TF and IDF for each word in a
paragraph.
• Repeated Text - If the repeated text is given as
TF (t) = (Frequency of t from document) / (total_no. input to generate the summary, then the summary
Of t in the document) will be obtained but it will also be in repeated
manner since the text are repeating due to which
IDF (t) = log_e (total_no. Of documents / No. of the program can’t differentiate between the
documents with t it) [4] meaning of the generated summary. So based on
Now, we will be generating a new matrix after the repeated input, summary is generated.
multiplying the calculated TF and IDF values.

STEP-4: Score the sentences

Here, we use TF-IDF word points in a sentence to give


weight to a paragraph. However, Sentence scoring
varies with different algorithms.

IRE 1703633 ICONIC RESEARCH AND ENGINEERING JOURNALS 214


© JUL 2022 | IRE Journals | Volume 6 Issue 1 | ISSN: 2456-8880

REFERENCES

[1] Adhika Widyassari, S. R. (2020). Review of


automatic text summarization techniques &
methods. Journal of King Saud University -
Computer and Information Sciences, 18.
[2] Amigó E, G. J. (2005). a framework for the
evaluation of text summarization systems.
proceedings of the 43rd annual meeting on
association for computational linguistics. ACL
’05.
[3] Antiqueira L, O. O. (2007). A complex network
approach to text summarization. Information
Sciences.
[4] B. Cretu, Z. C. (2002). Automatic summarization
based on sentence extraction. International
Journal of Applied Electromagnetic and
mechanics.
[5] Brownlee, J. (2019, August 7). A Gentle
Introduction to Text Summarization. Retrieved
from https://machinelearningmastery.com/:
CONCLUSION https://machinelearningmastery.com/gentle-
introduction-text-summarization/
Text summaries have been shown to be useful for [6] Changjian Fanga, D. M. (2016, March 5). Word-
natural language processing tasks such as question and sentence co-ranking for automatic extractive text
answer or other related fields of computer science such summarization. Retrieved from
as text classification and data retrieval. And access https://www.sciencedirect.com/:
time for information search will be improved. At the https://www.sciencedirect.com/science/article/a
same time, sequencing enhances the effect and its bs/pii/S0957417416306959?via%3Dihub
algorithms are less biased than human creams. Using [7] Conroy, J. M. (2001). Text summarization via
a text summary system, commercial capture services hidden markov models. Proceedings of SIGIR
allow users to increase the number of texts they can '01.
process.
[8] D. Gillick, K. R. (2009). A global optimization
network for meeting summarization. Proc. IEEE
FUTURE SCOPE
Int. Conf. Acoust, 1-4.
In this section, we will list some of the future [9] Darji, H. (2020, January 8). Text Summarization-
extensions for this study. In this article, we focused on Key Concepts. Retrieved from
summarizing news articles under the auspices of sports https://medium.com/:
and technology. The strategies proposed here are https://medium.com/@harshdarji_15896/text-
flexible in some domains. One of the future plans summarization-key-concepts-23df617bfb3e
would be to use an overview framework that focuses [10] Evans, D. K. (2005). Similarity-based
on the topic in news articles or blogs and to increase multilingual multidocument summarization.
work on machine-dependent methods. Summaries Technical Report CUCS-014-05.
focused on the headline article can be very accurate [11] Gupta V, L. G. (2010). A survey of text
and very important for users. It would be even more summarization extractive techniques. J Emerg
interesting to work on topic modeling and Technol Web Intell, 258-268.
summarizing in the future media domain.

IRE 1703633 ICONIC RESEARCH AND ENGINEERING JOURNALS 215


© JUL 2022 | IRE Journals | Volume 6 Issue 1 | ISSN: 2456-8880

[12] J. Patel, P. (2015). [23] S, S. (2011). Automatic Text Summarization:


https://machinelearningmastery.com/gentle- The current state of the art. International Journal
introduction-text-summarization/. International of.
Journal of Engineering and Computer Science, [24] YLLIAS CHALI, S. A. (2011). Query-focused
5. multi-document summarization: automatic data
[13] Jain, A. (2019, April 1). Automatic Extractive annotations and supervised learning approaches.
Text Summarization using TF-IDF. Retrieved Cambirdge University Press.
from Medium.com: https://medium.com/voice-
tech-podcast/automatic-extractive-text-
summarization-using-tfidf-3fc9a7b26f5
[14] KS, J. (2007). Automatic summarising: the state
of the art. Inf Process Manag 43, 1449-1487.
[15] Kumar, T. (2014). Automatic Text
Summarization. Rourkela.
[16] Mayo, M. (2019, November). Getting Started
with Automated Text Summarization. Retrieved
from https://www.kdnuggets.com/:
https://www.kdnuggets.com/2019/11/getting-
started-automated-text-summarization.html
[17] Mr. Vikrant Gupta, M. P. (2012). An Statiscal
Tool for Multi-Document Summarization.
International Journal of Scientific and Research
(ISSN 2250-3153).
[18] Neelima Bhatia, A. J. (2015). Literature Review
on Automatic Text Summarization: Single and
Multiple Summarizations. International Journal
of Computer Applications, 1-5.
[19] Okumura, H. T. (2009). Text Summarization
Model based on the budgeted median problem.
Proc. 18th ACM Conf. Inf. Knowledge, 1-4.
[20] Opidi, A. (2019, April 15). A Gentle Introduction
to Text Summarization in Machine Learning.
Retrieved from https://blog.floydhub.com/:
https://blog.floydhub.com/gentle-introduction-
to-text-summarization-in-machine-learning/
[21] Panchal, A. (2019, June 10). NLP — Text
Summarization using NLTK: TF-IDF Algorithm.
Retrieved from https://towardsdatascience.com/:
https://towardsdatascience.com/text-
summarization-using-tf-idf-e64a0644ace3
[22] Recent automatic text summarization techniques:
a survey. (2019, March 29). Retrieved from
https://link.springer.com/:
https://link.springer.com/article/10.1007/s10462
-016-9475-9

IRE 1703633 ICONIC RESEARCH AND ENGINEERING JOURNALS 216

You might also like