Research Final
Research Final
Research Final
WEBSITE
Dr. Nazneen Pendhari
Govindaraju Akarsh Rao Sahil Ashok Kamble
Department of Computer Engineering,
Department of Computer Engineering, Department of Computer Engineering,
M. H. Saboo Siddik College of
M. H. Saboo Siddik College of M. H. Saboo Siddik College of
Engineering,
Engineering, Engineering,
Mumbai, India
Mumbai, India Mumbai, India
[email protected]
[email protected] [email protected]
A. Data preprocessing:
Data preprocessing is a critical step in text summarization
using NLP (Natural Language Processing) that involves
cleaning and transforming raw textual data into a format that
can be efficiently processed by machine learning algorithms.
The main goal of data preprocessing in text summarization is
to extract the most relevant and useful information from the
text while removing irrelevant or redundant data.
steps involved in data preprocessing for text summarization:
1. Text cleaning: This step involves removing unwanted
characters, such as punctuation, numbers, special symbols,
and stop words (common words that do not carry much
meaning such as "the", "and", "or", etc.). This process can be
done using regular expressions or pre-built libraries in NLP
frameworks.
2. Text normalization: In this step, text is converted into a
uniform format by applying techniques such as stemming or
lemmatization. Stemming involves reducing words to their
root form, while lemmatization involves converting words to
their base form. This helps in reducing the vocabulary size
and improving the quality of the summary.
3. Sentence segmentation: The input text is split into
sentences, which are then analyzed individually. This helps
in identifying the most important sentences that contain the
key information.
4. Text vectorization: In this step, the processed text is
converted into a numerical representation that can be easily
analyzed by machine learning algorithms. One common
technique for text vectorization is using Bag-of-Words
(BoW) model, where each word in the text is treated as a Fig.1 Flow Chart
feature, and the count or frequency of each word is used as a
value. as NLTK or spaCy.
5. Feature selection: This step involves selecting the most 4. Correcting spelling and grammar errors: Spelling and
relevant features from the vectorized text based on their grammar errors can negatively impact the quality of the
importance in summarizing the text. This can be done using summary. Therefore, they are often corrected using
techniques such as Term Frequency-Inverse Document algorithms such as the LanguageTool or Grammarly.
Frequency (TF-IDF), which calculates the importance of 5. Removing HTML tags: If the text is extracted from the
each word based on its frequency in the text and its rarity in web, it may contain HTML tags. These tags are not relevant
the corpus. to the text summarization and hence are removed.
B. Text Cleaning Algorithms used for text cleaning:
Text cleaning is a critical step in text summarization using 1. Regular Expressions (Regex): Regular expressions are a
NLP (Natural Language Processing) that involves removing powerful tool for text cleaning that allow for pattern matching
irrelevant or unnecessary data from the raw text. The goal of and replacement. They are commonly used for removing
text cleaning is to extract the most useful and relevant punctuation, special characters, and numbers from text.
information from the text, while removing any noise or 2. NLTK (Natural Language Toolkit): NLTK is a popular
irrelevant data that could negatively impact the accuracy and Python library for natural language processing that provides
quality of the summary. a wide range of tools and algorithms for text cleaning,
techniques involved in text cleaning, some of which include: including tokenization, stop-word removal, and stemming.
These algorithms play a critical role in the text summarization or sub-words. It is a fundamental step in natural language
process by preparing the text for tokenization and processing and text analysis. Tokenization helps to convert
vectorization. unstructured text into a structured format that can be easily
C. Sentence Tokenization processed and analyzed by computer algorithms.
The "Sentence Tokenization" step in the TF-IDF model Tokenization algorithms typically work by separating words
involves breaking down the cleaned text data into individual based on spaces, punctuation, or special characters
sentences, which serves as a prerequisite for word Example:
tokenization. This step is critical as it helps to segment the “Is it weird I don’t like coffee?”
text data into more manageable units that can be further By performing word-based tokenization with space as a
processed. delimiter, we get:
Sentence tokenization is performed by analyzing the structure [“Is”, “it”, “weird”, “I”, “don’t”, “like”, “coffee?”]
of the text and identifying the boundaries between sentences. If we look at the tokens “don’t” and “coffee?”, we will notice
There are several methods for sentence tokenization, that these words have punctuation attached to them. What if
including rule-based methods, statistical methods, and there is another raw text (sentence) in our corpora like this —
machine learning methods. “I love coffee.” This time there will be a token “coffee.”
Rule-based methods: Rule-based methods use a set of which can lead the model to learn different representations of
predefined rules to identify sentence boundaries. These rules the word coffee (“coffee?” and “coffee.”) and will make the
may include identifying punctuation marks such as periods, representation of words (tokens) suboptimal.
question marks, and exclamation marks as sentence
boundaries. However, this approach may not always be
E. TF-IDF VALUE CALCULATION
reliable, as certain punctuation marks may also appear within
sentences (such as in abbreviations or quotations). From the preprocessed list of words, the TF-IDF value of
Statistical methods: Statistical methods use probability each noun and verb can then be
models to identify sentence boundaries. These models are calculated. The equation of TF-IDF can be seen below.
trained on large datasets of text and learn to recognize
patterns that indicate the end of a sentence, such as the
appearance of a period or a combination of punctuation
marks.
Machine learning methods: Machine learning methods use
algorithms to identify sentence boundaries based on patterns
learned from labeled examples. This approach requires a
large amount of labeled data to train the algorithm effectively.
Example: TF-IDF between d1 and d2
Algorithm for Sentence Tokenization:
Step 1: Initialize an empty list to hold the sentences.
d1: the best Italian restaurant enjoy the best pasta
Step 2:. Loop through the cleaned text data and identify
d2: American restaurant enjoy the best hamburger
potential sentence boundaries.
d3: Korean restaurant enjoy the best bibimbap
Step 3: Check if the potential sentence boundary is valid
d4: the best the best American restaurant
(such as a period, question mark, or exclamation mark
followed by a space), and if so, add the sentence to the list of
TF-IDF value calculation:
sentences.
Step 4:Continue looping through the text until all potential
sentence boundaries have been evaluated. Word TF IDF TF*IDF
Step 5: Return the list of sentences. d1 d2 d3 d4 d1 d2 d3 d4
sentence tokenization is a crucial step in the TF-IDF model
Italian 1/8 0/6 0/6 0/6 log(4/1)=0.6 0.075 0 0 0
as it allows the text data to be segmented into individual
sentences for further processing. The mechanism of sentence Restaurant 1/8 1/6 1/6 1/6 log(4/4)=0 0 0 0 0
tokenization involves analyzing the structure of the text to enjoy 1/8 1/6 1/6 0/6 log(4/3)=0.13 0.016 0.02 0.02 0
identify potential sentence boundaries using rule-based, the 2/8 1/6 1/6 2/6 log(4/4)=0 0 0 0 0
statistical, or machine learning methods, and the resulting
algorithm involves checking for valid sentence boundaries best 2/8 1/6 1/6 2/6 log(4/4)=0 0 0 0 0
and constructing a list of sentences. pasta 1/8 0/6 0/6 0/6 log(4/1)=0.6 0.075 0 0 0
American 0/8 1/6 0/6 1/6 log(4/2)=0.3 0 0.05 0 0.05
D. Word Tokenization hamburger 0/8 1/6 0/6 0/6 log(4/1)=0.6 0 0.1 0 0
Tokenization is an important step in various natural language
processing tasks such as text classification, sentiment
Korean 0/8 0/6 1/6 0/6 log(4/1)=0.6 0 0 0.1 0
analysis, machine translation, and speech recognition. It bibimbap 0/8 0/6 1/6 0/6 log(4/1)=0.6 0 0 0.1 0
helps to reduce the complexity of text analysis by breaking
down the text into smaller units that can be processed more
easily. Fig.2 TF_IDF Calculation
Word tokenization is the process of breaking a large piece of
text into smaller units called tokens, which are usually words
The value of TF-IDF ranges from zeroto one with ten-digit G. Model Deployment
precision. After been calculated, these words are sorted in GUI of project:
descending order by its value. Then, it is compiled into the
new dictionary of word and its value. This sorting is
important to analyze the rank of TF-IDF value from all of the
words to check the output summary. After knowing TF-IDF
value of each word, it can calculate the importance value of a
sentence. The importance value of a sentence is a sum of the
value of every noun and verb in the sentence. Every sentence
in the document is sorted in descending order.
E. Cosine Similarity
Cosine similarity is a measure of similarity between two non-
zero vectors in a high-dimensional space, commonly used in
natural language processing (NLP) for comparing the
similarity of documents or sentences based on the occurrence
of words or phrases they contain.
In NLP, documents or sentences are typically represented as
vectors, where each dimension represents a particular term or Input as Text:
word in a given vocabulary, and the value in each dimension
represents the frequency or weight of the corresponding term
in the document or sentence. The cosine similarity between
two vectors A and B is defined as the cosine of the angle
between them in the high-dimensional space, which is
calculated as the dot product of A and B divided by the
product of their magnitudes:
cosine similarity = A•B / ||A|| x ||B||
The resulting similarity score ranges from 0 (no similarity) to
1 (identical). Higher values indicate greater similarity
between the two documents or sentences.
Cosine similarity is a popular similarity measure in NLP
because it is efficient to calculate and provides a simple and
intuitive way to compare the similarity of two texts. It is
widely used in various applications such as information
retrieval, document clustering, text classification, and
recommendation systems.
F. Summary Generation
Finally, after finding the cosine-similarity for all vectorized
pairs, we average the weights of each vector, and return the
indexes of the vectors with the highest averages. These
indexes are then used to pull out the sentences from the
original text for the summarization. The sentences with the
highest average weights will capture the unique and
important sentences from the original text.
Summarized Output : keywords extraction of twitter data," Mehran University Research
Journal of Engineering and Technology, vol. 42, no. 1, p. 88, 2023.
IV. CONCLUSION
Crop ailments are amongst the largest and most important
agrarian concerns, inflicting annual losses of up to 25% in
agricultural production. In this paper, a diagnostic and
detection system for corn leaf diseases is proposed, which is
based on the technology of Convolutional Neural Network
and bespoke model framework. The suggested system's
primary goal is to detect infections in plants early to reduce
plant production losses caused by plant diseases such as
Northern Leaf blight and Common rust of corn crops. The
dataset used to train and evaluate the model consisted of 3000
colored corn leaf images. It was split into three parts for
model validation: 70% for training, 25% for validation, and
5% for testing the model. When the dataset was lowered from
1500 to 300 images, the accuracy was decreased by 10%,
whereas when the dataset was increased from 1500 to 3000
images, the accuracy remained nearly the same. In
comparison to other approaches, the suggested model
architecture achieves 99.52% percent accuracy. Using the
same dataset, the employed approach is compared to the
VGG16 methodology, and the comparison findings reveal
that our proposed system method outperforms the accuracy
produced by the mentioned technique.
References