Research Final

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

NLP BASED CONTENT SUMMARIZER

WEBSITE
Dr. Nazneen Pendhari
Govindaraju Akarsh Rao Sahil Ashok Kamble
Department of Computer Engineering,
Department of Computer Engineering, Department of Computer Engineering,
M. H. Saboo Siddik College of
M. H. Saboo Siddik College of M. H. Saboo Siddik College of
Engineering,
Engineering, Engineering,
Mumbai, India
Mumbai, India Mumbai, India
[email protected]
[email protected] [email protected]

Abhishek Ashok Pokharkar


Pratik Balkrishna Rane
Department of Computer Engineering,
Department of Computer Engineering,
M. H. Saboo Siddik College of
Engineering, M. H. Saboo Siddik College of
Engineering, Mumbai, India
Mumbai, India
[email protected] [email protected]

Abstract - A vast quantity of information is accessible on


The study framework consists of eight stages: (a) Data pre-
the internet through the World Wide Web. Search
processing, (b) Text Cleaning, (c) Sentence Tokenization, (d)
engines such as Google and Yahoo were developed for
Word Tokenization, (e) TF-IDF value calculation and,
retrieving information from databases. However, due to
(f)Vectorization (Cosine similarity) (g)Summary Generation.
the continuous growth of electronic information, it has
The example experiment's end outcome is the ability to
become challenging to achieve the desired outcomes. As a
summarize the uploaded text or document. The steps of the
solution, there is an increasing demand for automated
procedure are used to arrange the structure of this article.
summarization that can save time and extract essential
information. Automatic summary processes several II. LITERATURE REVIEW
papers and produces a compressed version. This study
was conducted on one document and resulted in multiple
[1] Akash Ajampura Natesh , Somaiah Thimmaiah
publications. The report emphasizes the frequency-based
Balekuttira , Annapurna P Patil et al. in the year 2016
technique for text summarization.
conducted a study on Graph Based Approach for Automatic
Text Summarization.In this research, they describe an
Key words: Indexed Terms- Automatic summarization, Extractive,
artificial text summarising method based on graphs. They
frequency-based, Natural Language Processing.
provide a strategy that has five steps. The text is first broken
down into sentences. These sentences are further broken
down into words. Part of Speech Tagging is used, which uses
I. INTRODUCTION
a list of tokens as input and assigns the best part of speech
Text summarization is the process of selecting essential tag. Additionally, it generates a unique list of nouns for each
information from a document that can be condensed through sentence. The Pronoun Resolution is finished after this. Each
a program. With the increase in data overload, there is an pronoun that appears in the text is replaced with the noun that
increasing interest in capturing text as the amount of data it refers to. Finally, a connection is created to connect each
grows. However, summarizing a large document manually word in each sentence, and the complete text's graph is
can be time-consuming and requires significant human effort. constructed. These sentences are further broken down into
Extractive and abstractive techniques are two primary words. Part of Speech Tagging is used, which uses a list of
methods for summarizing a text document. Extractive tokens as input and assigns the best part of speech tag.
summaries involve selecting important passages, sentences, Additionally, it generates a unique list of nouns for each
or words from the original text and connecting them to create sentence. The Pronoun Resolution is finished after this. Each
a brief summary. The importance of these sentences is based pronoun that appears in the text is replaced with the noun that
on their analytical and semantic features. it refers to. Finally, a connection is created to connect each
word in each sentence, and the complete text's graph is
To understand the entire document correctly and extract the constructed. The authors report that their approach performs
important sentences, summary systems are usually based on well with news articles, Wikipedia searches, and technical
sentence delivery methods. Abstractive summarization is a documents, as assessed by the ROUGE-1 metric.
technique for generating a brief description of key concepts
in an article or section using a few phrases. This technique [2] Senthamizh Selvan.R et al. in the year 2022 conducted a
involves mapping the input order of words in a source study on Automatic Text Summarization using Document
document to the target sequence of words called the Clustering Named Entity Recognition The authors present a
summary.
technique for generating literary information that addresses two techniques, Term Frequency-Inverse Document
the problem of repetition and error in text summaries. The Frequency (TF-IDF) and Loglikelihood, in identifying
proposed framework involves entity ranking and sentence relevant keywords from the large volume of data on the social
similarity calculation to extract unique sentences from networking site. The extracted keywords were found to be a
multiple documents. The extracted named entities are then valuable asset for identifying patterns and generating insights
passed to document clustering methods, using k-implies for for advertisement, trend analysis, and future business
group calculation and high-level cluster calculations for policies. The study found that TF-IDF was the more efficient
incomparable effects. The proposed EASDC technique method for keyword extraction, based on results from a
showed improvement over TextRank and LexRank relevancy test. The authors suggested that further NLP
algorithms.In future work, the authors aim to investigate techniques could be applied to refine the results and improve
more strategies for enhancing summary quality in the context the quality of clustering of the large volume of data.
of multidocument summarization, such as reinforcement
learning. They also plan to use their approach for additional [6] Akuma Stephen et al. conducted a study using NLP
tasks, such as answering multidocument questions. techniques to track hateful posts or tweets on Twitter in the
year 2022. The study evaluated the effectiveness of four
[3] Jishma Mohan M, Sunitha C ,Amal Ganesha , Dr.Jaya A supervised machine learning algorithms and compared the
et al. in the year 2016 conducted a study on Ontology based performance of the TF-IDF and BoW approaches. The results
Abstractive Summarization. The paper discusses the of the experimental study showed that the machine learning
challenges of abstractive summarization due to the models achieved significantly better results when tested using
complexity of natural language processing and highlights the the TF-IDF approach, with the Decision Tree algorithm
use of ontology-based approaches in various domains. The outperforming the other classifiers with an accuracy of
authors suggest the need for a single platform to 92.43%. Logistic Regression obtained the highest accuracy
accommodate all these domains and build a robust and among the four classifiers tested with BoW, achieving an
extensible summarization system. The paper reviews various accuracy of 74.79%. The developed model used the highest
ontology-based abstractive summarization methods and their accuracy technique to detect the presence or absence of
importance in different domains. The study also provides hateful connotations in a given tweet. The study aimed to
insights into evaluating ontologies. The paper aims to provide identify hateful language manifested in intolerance, religion,
new researchers in the field with a better understanding of gender, racism, and misinformation. The authors suggested
ontology-based approaches in text summarization. that future research could explore the use of emojis, optical
character recognition, and video images to detect hate speech,
[4] Tatsuro Oya, Yashar Mehdad, Giuseppe Carenini, and use larger datasets and other feature extraction techniques
Raymond Ng et al. in the year 2014 conducted a study on A with machine learning methods for optimal results.
Template-based Abstractive Meeting Summarization:
Leveraging Summary and Source Text Relationships. This [7] Nidhi Patel and Prof. Nikhita Mangaokar conducted a
paper describes a system for generating abstractive study in 2020 to compare extractive and abstractive
summaries of meetings, which the authors claim outperforms approaches for text summarization. While extractive
both human-authored extractive summaries and state-of-the- techniques like TextRank are generally easier to implement
art meeting summarization systems. The system has three and provide high-quality summaries for long texts,
main components: a novel approach for generating templates, abstractive summarization can generate summaries that more
an effective template selection method, and a comprehensive closely resemble those produced by humans. The study found
evaluation.The first contribution of the system is a novel that the addition of TF-IDF weighted vector values improved
approach for generating templates, which uses a multi- both extractive and abstractive models when used alongside
sentence fusion algorithm and lexico-semantic information to word embedding vectors. However, TF-IDF was found to be
generate summaries that capture the most important particularly effective for extractive models, which were able
information in a meeting. The second contribution is an to produce significantly acceptable summaries at a faster rate
effective template selection method that uses the relationship for longer texts. The study also found that the nature of the
between human-authored summaries and their source text domain did not have a significant impact on the
transcripts to select the best template for a given meeting. effectiveness of the models. The researchers suggest that the
Finally, the system is evaluated using a comprehensive choice of dataset is crucial in generating effective and quality
evaluation, which shows that the system's summaries are summaries, and that both algorithms can be run to determine
preferred over both human-authored extractive summaries which provides the highest amount of satisfactory
and those created by state-of-the-art meeting summarization summaries. The models can also be adapted for cross-
systems.The authors note that the current version of the document summarization and for summarizing text in various
system uses only hypernym information in WordNet to label languages, which would be an interesting area for future
phrases, and that future work could include extending the research.
system to use a richer knowledge base such as YAGO.
Additionally, the authors plan to apply the system to different [8] We have learned from these discussions that many
multi-party conversational domains such as chat logs and techniques face a variety of difficulties, such as the graph-
forum discussions based methods' imitation in data size, the clustering-based
[5] Muhammad Adeel Abid et al. in the year 2023 conducted methods' requirement for prior knowledge of the number of
a study on clustering tweet data from Twitter using keyword clusters, the uncertainty in the coverage and non-redundancy
extraction methods. The study compared the effectiveness of aspects of the MMR approaches' summary, etc. There is no
comprehensive model for the tree-based method, which 1. Removing punctuation: Punctuation marks such as periods,
would have an abstract representation for content selection. commas, and quotation marks do not add much value to the
Method Based on Templates requires the creation of summary. Hence, they are often removed from the text.
templates, and it is challenging to generalise a template. 2. Removing numerical data: Numerical data such as dates,
Method Based on Ontologies This strategy is only applicable phone numbers, and addresses are not typically relevant to
to Chinese news. Additionally, developing rule-based text summarization. Therefore, they are often removed from
systems to deal with uncertainty is a difficult process. the text.
3. Removing stop words: Stop words are common words such
as "the," "and," "or," and "in" that do not carry much meaning
III. PROPOSED METHODOLOGY in the text. They can be removed using pre-built libraries such

A. Data preprocessing:
Data preprocessing is a critical step in text summarization
using NLP (Natural Language Processing) that involves
cleaning and transforming raw textual data into a format that
can be efficiently processed by machine learning algorithms.
The main goal of data preprocessing in text summarization is
to extract the most relevant and useful information from the
text while removing irrelevant or redundant data.
steps involved in data preprocessing for text summarization:
1. Text cleaning: This step involves removing unwanted
characters, such as punctuation, numbers, special symbols,
and stop words (common words that do not carry much
meaning such as "the", "and", "or", etc.). This process can be
done using regular expressions or pre-built libraries in NLP
frameworks.
2. Text normalization: In this step, text is converted into a
uniform format by applying techniques such as stemming or
lemmatization. Stemming involves reducing words to their
root form, while lemmatization involves converting words to
their base form. This helps in reducing the vocabulary size
and improving the quality of the summary.
3. Sentence segmentation: The input text is split into
sentences, which are then analyzed individually. This helps
in identifying the most important sentences that contain the
key information.
4. Text vectorization: In this step, the processed text is
converted into a numerical representation that can be easily
analyzed by machine learning algorithms. One common
technique for text vectorization is using Bag-of-Words
(BoW) model, where each word in the text is treated as a Fig.1 Flow Chart
feature, and the count or frequency of each word is used as a
value. as NLTK or spaCy.
5. Feature selection: This step involves selecting the most 4. Correcting spelling and grammar errors: Spelling and
relevant features from the vectorized text based on their grammar errors can negatively impact the quality of the
importance in summarizing the text. This can be done using summary. Therefore, they are often corrected using
techniques such as Term Frequency-Inverse Document algorithms such as the LanguageTool or Grammarly.
Frequency (TF-IDF), which calculates the importance of 5. Removing HTML tags: If the text is extracted from the
each word based on its frequency in the text and its rarity in web, it may contain HTML tags. These tags are not relevant
the corpus. to the text summarization and hence are removed.
B. Text Cleaning Algorithms used for text cleaning:
Text cleaning is a critical step in text summarization using 1. Regular Expressions (Regex): Regular expressions are a
NLP (Natural Language Processing) that involves removing powerful tool for text cleaning that allow for pattern matching
irrelevant or unnecessary data from the raw text. The goal of and replacement. They are commonly used for removing
text cleaning is to extract the most useful and relevant punctuation, special characters, and numbers from text.
information from the text, while removing any noise or 2. NLTK (Natural Language Toolkit): NLTK is a popular
irrelevant data that could negatively impact the accuracy and Python library for natural language processing that provides
quality of the summary. a wide range of tools and algorithms for text cleaning,
techniques involved in text cleaning, some of which include: including tokenization, stop-word removal, and stemming.
These algorithms play a critical role in the text summarization or sub-words. It is a fundamental step in natural language
process by preparing the text for tokenization and processing and text analysis. Tokenization helps to convert
vectorization. unstructured text into a structured format that can be easily
C. Sentence Tokenization processed and analyzed by computer algorithms.
The "Sentence Tokenization" step in the TF-IDF model Tokenization algorithms typically work by separating words
involves breaking down the cleaned text data into individual based on spaces, punctuation, or special characters
sentences, which serves as a prerequisite for word Example:
tokenization. This step is critical as it helps to segment the “Is it weird I don’t like coffee?”
text data into more manageable units that can be further By performing word-based tokenization with space as a
processed. delimiter, we get:
Sentence tokenization is performed by analyzing the structure [“Is”, “it”, “weird”, “I”, “don’t”, “like”, “coffee?”]
of the text and identifying the boundaries between sentences. If we look at the tokens “don’t” and “coffee?”, we will notice
There are several methods for sentence tokenization, that these words have punctuation attached to them. What if
including rule-based methods, statistical methods, and there is another raw text (sentence) in our corpora like this —
machine learning methods. “I love coffee.” This time there will be a token “coffee.”
Rule-based methods: Rule-based methods use a set of which can lead the model to learn different representations of
predefined rules to identify sentence boundaries. These rules the word coffee (“coffee?” and “coffee.”) and will make the
may include identifying punctuation marks such as periods, representation of words (tokens) suboptimal.
question marks, and exclamation marks as sentence
boundaries. However, this approach may not always be
E. TF-IDF VALUE CALCULATION
reliable, as certain punctuation marks may also appear within
sentences (such as in abbreviations or quotations). From the preprocessed list of words, the TF-IDF value of
Statistical methods: Statistical methods use probability each noun and verb can then be
models to identify sentence boundaries. These models are calculated. The equation of TF-IDF can be seen below.
trained on large datasets of text and learn to recognize
patterns that indicate the end of a sentence, such as the
appearance of a period or a combination of punctuation
marks.
Machine learning methods: Machine learning methods use
algorithms to identify sentence boundaries based on patterns
learned from labeled examples. This approach requires a
large amount of labeled data to train the algorithm effectively.
Example: TF-IDF between d1 and d2
Algorithm for Sentence Tokenization:
Step 1: Initialize an empty list to hold the sentences.
d1: the best Italian restaurant enjoy the best pasta
Step 2:. Loop through the cleaned text data and identify
d2: American restaurant enjoy the best hamburger
potential sentence boundaries.
d3: Korean restaurant enjoy the best bibimbap
Step 3: Check if the potential sentence boundary is valid
d4: the best the best American restaurant
(such as a period, question mark, or exclamation mark
followed by a space), and if so, add the sentence to the list of
TF-IDF value calculation:
sentences.
Step 4:Continue looping through the text until all potential
sentence boundaries have been evaluated. Word TF IDF TF*IDF
Step 5: Return the list of sentences. d1 d2 d3 d4 d1 d2 d3 d4
sentence tokenization is a crucial step in the TF-IDF model
Italian 1/8 0/6 0/6 0/6 log(4/1)=0.6 0.075 0 0 0
as it allows the text data to be segmented into individual
sentences for further processing. The mechanism of sentence Restaurant 1/8 1/6 1/6 1/6 log(4/4)=0 0 0 0 0
tokenization involves analyzing the structure of the text to enjoy 1/8 1/6 1/6 0/6 log(4/3)=0.13 0.016 0.02 0.02 0
identify potential sentence boundaries using rule-based, the 2/8 1/6 1/6 2/6 log(4/4)=0 0 0 0 0
statistical, or machine learning methods, and the resulting
algorithm involves checking for valid sentence boundaries best 2/8 1/6 1/6 2/6 log(4/4)=0 0 0 0 0
and constructing a list of sentences. pasta 1/8 0/6 0/6 0/6 log(4/1)=0.6 0.075 0 0 0
American 0/8 1/6 0/6 1/6 log(4/2)=0.3 0 0.05 0 0.05
D. Word Tokenization hamburger 0/8 1/6 0/6 0/6 log(4/1)=0.6 0 0.1 0 0
Tokenization is an important step in various natural language
processing tasks such as text classification, sentiment
Korean 0/8 0/6 1/6 0/6 log(4/1)=0.6 0 0 0.1 0
analysis, machine translation, and speech recognition. It bibimbap 0/8 0/6 1/6 0/6 log(4/1)=0.6 0 0 0.1 0
helps to reduce the complexity of text analysis by breaking
down the text into smaller units that can be processed more
easily. Fig.2 TF_IDF Calculation
Word tokenization is the process of breaking a large piece of
text into smaller units called tokens, which are usually words
The value of TF-IDF ranges from zeroto one with ten-digit G. Model Deployment
precision. After been calculated, these words are sorted in GUI of project:
descending order by its value. Then, it is compiled into the
new dictionary of word and its value. This sorting is
important to analyze the rank of TF-IDF value from all of the
words to check the output summary. After knowing TF-IDF
value of each word, it can calculate the importance value of a
sentence. The importance value of a sentence is a sum of the
value of every noun and verb in the sentence. Every sentence
in the document is sorted in descending order.
E. Cosine Similarity
Cosine similarity is a measure of similarity between two non-
zero vectors in a high-dimensional space, commonly used in
natural language processing (NLP) for comparing the
similarity of documents or sentences based on the occurrence
of words or phrases they contain.
In NLP, documents or sentences are typically represented as
vectors, where each dimension represents a particular term or Input as Text:
word in a given vocabulary, and the value in each dimension
represents the frequency or weight of the corresponding term
in the document or sentence. The cosine similarity between
two vectors A and B is defined as the cosine of the angle
between them in the high-dimensional space, which is
calculated as the dot product of A and B divided by the
product of their magnitudes:
cosine similarity = A•B / ||A|| x ||B||
The resulting similarity score ranges from 0 (no similarity) to
1 (identical). Higher values indicate greater similarity
between the two documents or sentences.
Cosine similarity is a popular similarity measure in NLP
because it is efficient to calculate and provides a simple and
intuitive way to compare the similarity of two texts. It is
widely used in various applications such as information
retrieval, document clustering, text classification, and
recommendation systems.

Cosine similarity calculation:


Summarized Output:
Document TF_IDF Bag of Words Cosine similarity with d4
The best Italian restaurant [0.075,0,0.016,0,0,0.075,0,0,0,0] 0
enjoy the best pasta
American restaurant enjoy [0,0,0.02,0,0,0,0.05,0.1,0,0] 0.5
the best hamburger
Korean restaurant enjoy the [0,0,0.02,0,0,0,0,0.1,0.1] 0
best bibimbap
The best the best American [0,0,0,0,0,0,0.05,0,0,0] 1
restaurant

Fig.3 Cosine Calculation

F. Summary Generation
Finally, after finding the cosine-similarity for all vectorized
pairs, we average the weights of each vector, and return the
indexes of the vectors with the highest averages. These
indexes are then used to pull out the sentences from the
original text for the summarization. The sentences with the
highest average weights will capture the unique and
important sentences from the original text.
Summarized Output : keywords extraction of twitter data," Mehran University Research
Journal of Engineering and Technology, vol. 42, no. 1, p. 88, 2023.

[6] A. Stephen, T. Lubem and I. T. Adom, "Comparing Bag of Words


and TF-IDF with different models for hate speech detection from
live tweets," International Journal of Information Technology, vol.
14, no. 1, 2022.

[7] N. Patel and N. Mangaokar, "Abstractive vs Extractive Text


Summarization (Output based approach) - A Comparative Study,"
in 2020 IEEE International Conference for Innovation in
Technology (INOCON), Bangluru, India, 2020
[8]

IV. CONCLUSION
Crop ailments are amongst the largest and most important
agrarian concerns, inflicting annual losses of up to 25% in
agricultural production. In this paper, a diagnostic and
detection system for corn leaf diseases is proposed, which is
based on the technology of Convolutional Neural Network
and bespoke model framework. The suggested system's
primary goal is to detect infections in plants early to reduce
plant production losses caused by plant diseases such as
Northern Leaf blight and Common rust of corn crops. The
dataset used to train and evaluate the model consisted of 3000
colored corn leaf images. It was split into three parts for
model validation: 70% for training, 25% for validation, and
5% for testing the model. When the dataset was lowered from
1500 to 300 images, the accuracy was decreased by 10%,
whereas when the dataset was increased from 1500 to 3000
images, the accuracy remained nearly the same. In
comparison to other approaches, the suggested model
architecture achieves 99.52% percent accuracy. Using the
same dataset, the employed approach is compared to the
VGG16 methodology, and the comparison findings reveal
that our proposed system method outperforms the accuracy
produced by the mentioned technique.

References

[1] [1] A. A. Natesh, S. T. Balekuttira and A. P. Patil, "Graph Based


Approach for Automatic Text," M S Ramaiah Institute of Technology,
Bangalore, pp. 84-90, 2009.
[2] [2] S. R. Selvan and D. K. Arutchelvan, "Automatic Text
Summarization using Document," International Journal of Advanced
Computer Science and Applications, pp. 123-129, 2022.
[3] [3] M. Jishma, S. C, G. Amal and D. A, "A Study on Ontology based
Abstractive Summarization," Procedia Computer Science, pp. 62-69,
2016.
[4] T. Oya, Y. Mehdad, . G. Carenini and . R. Ng, "A Template-based
Abstractive Meeting Summarization: Leveraging," pp. 100-110,
2014.

[5] M. . A. Abid, M. F. Mushtaq, U. Akram and M. A. Abbasi,


"Comparative analysis of TF-IDF and loglikelihood method for

You might also like