Text Summarizer Using NLP (Natural Language Processing) : © JUL 2022 - IRE Journals - Volume 6 Issue 1 - ISSN: 2456-8880
Text Summarizer Using NLP (Natural Language Processing) : © JUL 2022 - IRE Journals - Volume 6 Issue 1 - ISSN: 2456-8880
Text Summarizer Using NLP (Natural Language Processing) : © JUL 2022 - IRE Journals - Volume 6 Issue 1 - ISSN: 2456-8880
Abstract- Enormous amounts of information are Summary systems are usually based on sentence
available online on the World Wide Web. To access delivery methods and for understanding the whole
information from databases, search engines like document properly as well as for extracting the
Google and Yahoo were created. Because the amount important sentences from the document.
of electronic information is growing every day, the
The technique of generating a brief description that
real outcomes have not been reached. As a result,
comprises a few phrases that describe the key concepts
automated summarization is in high demand.
of an article or section is known as abstractive
Automatic summary takes several papers as input
summarization.
and outputs a condensed version, saving both
information and time. The study was conducted in a This function is also included to naturally map the
single document and resulted in numerous input order of words in a source document to the target
publications. This report focuses on the frequency- sequence of words called the summary.
based approach for text summarization.
II. LITERATURE SURVEY
Indexed Terms- Automatic summarization,
Extractive, frequency-based, Natural Language The Internet is a vast source of electronic information.
Processing. But the result of information acquisition becomes a
tedious task for people. Therefore, automated
I. INTRODUCTION summaries began the search for automatic retrieval of
data from documents using our precious time. H.P.
Text summary is the way of selecting important points Luhn was the first to invent an automatic summary of
from the provided article or a document that can be the text in 1958.
reduced by a program. As the data overload problem
increased, so did the interest in capturing the text as There are helpful ways to produce a. summary -
the amount of data increased. Summarizing a large extraction and abstraction. Extraction is independent
document manually is challenging since it requires a of the domain and takes key sentences and provides a
lot of human effort and is time-consuming. summary on the other hand, abstracting depends on the
domain and taking personal information by
There are mainly two methods for summarizing the
understanding the entire text and adjusting the policy
text document that can be done by using extractive and
to produce a summary. There are several methods that
abstractive techniques.
use different methods to obtain a summary of a text.
Extractive summaries concentrate on selecting
important passages, sentences, words, etc. from the A. Abstractive Summarization Approach
primary text and connecting them into a concise form. Summarizations using abstractive techniques are
The importance of critical sentences is concluded on broadly classified into two categories: Structured
the basis of analytical and semantic features of the based approach and Semantic based approach.
sentences.
1) Structured Based Approach:
Structured based approach encodes most important • The contents of summary are generated from
information from the document through cognitive abstract representation of source documents, rather
schemes such as templates, extraction rules and other than from sentences of source documents. -The
structures such as tree, ontology, lead and body phrase abstract Representation is Information Item, which
structure. is the smallest element of coherent information in
a text.
Tree Based Method • The major strength of this approach is that it
• It uses a dependency tree to represent the text of a produces short, coherent, information rich and less
document. -It uses either a language generator or redundant summary.
an algorithm for generation of summary.
• It walks on units of the given document read and B. Extractive Summarization Techniques
easy to summary. An extractive summarization method consists of
selecting important sentences, paragraphs etc. from
Template Based Method the original document and concatenating them into
• It uses a template to represent a whole document. - shorter form. The importance of sentences is decided
Linguistic patterns or extraction rules are matched based on statistical and linguistic features of
to identify text snippets that will be mapped into sentences.
template slots.
• It generates summary is highly coherent because it Term Frequency Inverse Document Frequency
relies on relevant information identified by IE Method
system. • Sentence frequency is defined as the number of
sentences in the document that contain that term.
Ontology Based Method • Then this sentence vectors are scored by similarity
• Use ontology (knowledge base) to improve the to the query and the highest scoring sentences are
process of summarization. -It exploits fuzzy picked to be part of the summary.
ontology to handle uncertain data that simple
domain ontology cannot. Cluster Based Method
• Drawing relation or context is easy due to ontology • It is intuitive to think that summaries should
• Handles uncertainty at reasonable amount. address different “themes” appearing in the
documents.
2) Semantic Based Approach: • If the document collection for which summary is
In Semantic based approach, semantic representation being produced is of totally different topics,
of document is used to feed into natural language document clustering becomes almost essential to
generation (NLG) system. This method focuses on generate a meaningful summary.
identifying noun phrase and verb phrase by processing • Sentence selection is based on similarity of the
linguistic data. Brief abstract of all the techniques sentences to the theme of the cluster (Ci). The next
under semantic based approach is provided. factor that is location of the sentence in the
document (Li). The last factor is its similarity to
Multimodal semantic model the first sentence in the document to which it
A semantic model, which captures concepts and belongs (Fi).
relationship among concepts, is built to represent the
contents of multimodal documents. Si =W1 * Ci + W2 * Fi+ W3 *Li
Where, W1, W2, W3 are weight age for inclusion in
An important advantage of this framework is that it summary.
produces abstract summary, whose coverage is
excellent because it includes salient textual and • The clustering k-means algorithm is applied.
graphical content from the entire documents.
Information Item Based Method
STEP-1: Import all necessary libraries • Improper URL - If the given URL hasn’t a defined
NLTK (Natural Language toolkit) is a widely used and a sequential data which can be summarize then
library while we are working with text in python. Stop it displays the error as mentioned below since the
words contain a list of English stop words, which need web scrapper can’t get the exact data from the URL
to be removed during the pre-processing step. from which our summary could be generated.
STEP-2: Generate clean sentences • Illogical Text - If any illogical or meaningful text
is given as an input, then the summary won’t come
Text processing is the most important step in
as it will not make sense to generate a summary of
achieving a constant and positive approach result. The
punctuation marks or any stop words. As given
processing steps removes special digits, word, and
below the output is generated where it shows that
characters.
the given text could be stop words which gets
STEP-3: Calculate TF-IDF and generate a matrix eliminated in the pre-processing phase of
summarization.
We’ll find the TF and IDF for each word in a
paragraph.
• Repeated Text - If the repeated text is given as
TF (t) = (Frequency of t from document) / (total_no. input to generate the summary, then the summary
Of t in the document) will be obtained but it will also be in repeated
manner since the text are repeating due to which
IDF (t) = log_e (total_no. Of documents / No. of the program can’t differentiate between the
documents with t it) [4] meaning of the generated summary. So based on
Now, we will be generating a new matrix after the repeated input, summary is generated.
multiplying the calculated TF and IDF values.
REFERENCES