NLP Report
NLP Report
By
Supervisor
Prof. Mahesh Maurya
University of Mumbai
(BE 2023-24)
CERTIFICATE
This is to certify that the Report entitled “Text Summarization using NLP” is a bonafide
work of Aneesh Vinod Panchal (Roll No:07), Gaurang Rajam (Roll No:10), Sachin Satam
(Roll No:15) submitted to the University of Mumbai in partial fulfillment of the requirement
Supervisor
Mahesh Maurya
(Roll No:07), Gaurang Rajam (Roll No:10), Sachin Satam (Roll No:15) is approved for the degree
Examiners
1………………………………………
(Internal Examiner Name & Sign)
2…………………………………………
(External Examiner name & Sign)
Date:
Place: Thane
Contents
Abstract 5
Acknowledgments 6
List of Abbreviation 7
List of Figure 7
List of table 7
1 Introduction 8-9
1.1 Introduction
1.2 Motivation
1.3 Problem Statement
3.1 Introduction
3.2 Algorithm
3.3 Details of Software and Hardware requirement
3.4 Experiments and Results
3.5 Conclusion and Future Work
4 References 20
Abstract
Text Summarization is the process of creating a condensed form of text document which maintains significant information
and general meaning of source text. Automatic text summarization becomes an important way of finding relevant
information precisely in large text in a short time with little efforts. Text summarization approaches are classified into two
categories: extractive and abstractive. This paper presents the comprehensive survey of both the approaches in text
summarization. the challenge of how to make computer understand the document with any extension and how to
make it generate the summary is the main motivation. Reducing the time and effort of the user of reading through
entire document to know what the document is about is also the driving force behind this work. To summarize
large documents of the text will be difficult for human beings. Extractive and abstractive summarization is two
types of summarization. An extractive summarization method is concatenating important sentences or paragraphs
without understanding the meaning of those sentences. An abstractive summarization method is generating the
meaningful summary. The system uses is a culmination of both statistical and linguistic analysis of text document.
Summary generated is better than mere statistical summarizers that generate summary based on word frequency
calculation. Addition of plural resolution and abbreviation resolution adds more precision to summary. Concept of
normalization introduced here makes sentences get their weights purely based on value of its content words and
not on number of words it has. Therefore even a small but important sentence gets its place based on values of
words it has. Adding linguistic features to the algorithm fine tunes the summary to higher level.
Acknowledgement
No project is ever complete without the guidance of those experts who have already traded their past before
and hence became Master of It and as a result, our mentor. So, we would like to take this opportunity to take
all those individuals who have helped us in visualizing this project. The guidance of “Keerti Kharatmol”
played a great role in our research work. His guidance helped us in finding relevant information about our
topic. We are grateful to get an opportunity to present our work to everyone. We would like to express our
gratitude to the ‘K.C. College of Engineering and Management studies & Research’’ as well as our Head of
Department professor “Mahesh Maurya” for promoting students to express their ideas and research. Our
sincere vote of thanks goes to our college Principal, "Dr. Vilas Nitnaware'' for believing in the work of their
students and pushing our limits to do better in our field of study.
List of Abbreviation
List of Tables
List of Symbol
To reduce length, complexity, and retaining some of the essential qualities of the original document, will go for
summarizer. Titles, key words, tables-of-contents and abstracts might all be considered as the forms of summary.
In a full text document, abstract of that document plays role as a summary of that particular document. They are
intermediates between document‟s titles and its full text that is useful for rapid relevance and quick assessment of
the document. Autosummarization is a technique generates a summary of any document, provides briefs of big
documents, etc. There is an abundance of text material available on the internet. However, usually the Internet
provides more information than is needed. It is very difficult for human beings to manually summarize large
documents of text. Therefore, a twofold problem is encountered. Searching for relevant documents through an
overwhelming number of documents available, and absorbing a large quantity of relevant information. The goal of
automatic text
Microsoft Word‟s AutoSummarize function is a simple example of text summarization. Text Summarization
methods can be classified into extractive and abstractive summarization. An extractive summarization [1] method
consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into
shorter form. The importance of sentences is decided based on statistical and linguistic features of sentences. An
Abstractive summarization [2] attempts to develop an understanding of the main concepts in a document and then
express those concepts in clear natural language. It uses linguistic methods [3] to examine and interpret the text
and then to find the new concepts and expressions to best describe it by generating a new shorter text that conveys
the most important information from the original text document
1.2 Motivation
Every machine learning pipeline is a set of operations, which are executed to produce a model. An ML model is roughly
defined as a mathematical representation of a real-world process. We might think of the ML model as a function that takes
some input data and produces an output (classification, sentiment, recommendation, or clusters). The performance of each
model is evaluated by using evaluation metrics, such as precision & recall, or accuracy.
Author AKukkar [9] introduced one effective approach to produce a flexible and productive bug report summary as well as to
minimize load and work of developer. Author used particle swarm optimization (PSO) approach for searching the effective
semantic text. Author tried to address four central points that are, extractive bug report summarization, to increase the
ROUGE score by selecting effective semantic text, sparsity of data and reduction of information. Proposed methodology used
collection of comments and some feature extraction methods to generate or produce the bug report summary. Multiple
summary subsets are produced and optimal summary subset evaluated by PSO optimization technique. Author compared the
proposed approach with existing Email Classifier (EC) and Bug Report Classifier (BRC). ROUGE score was selected as one
of the evaluation criteria and was calculated for all approaches. At the same time, the ROUGE score was compared with three
human-generated summaries of 10 bug reports of Rastkar dataset. As a result, PSO approach summary subset was less
redundant, and included all important points needed to be present in a bug report.
Author Beibei Huai and team [10] gave new Intention-based bug report summarization approach, alias IBRS which is based
on intention taxonomy. This work considered sentences intentions in order to generate summary report. Sentence intentions
were classified according to their taxonomy levels into seven categories: bug Description, fix solution, opinion expressed,
information seeking, information giving, meta/code and emotion expressed. Now, sentences are categorized in specific
intention with the help of pattern matching and machine learning model. Finally bug report summary is produced. This
summary was compared with BRC (Bug Report Classifier) and found better in terms of precision (5% improved), recall (3%
improved), F-score (3% improved) and pyramid precision (5% improved)
Creating summary is selecting important topics of sentences as well as recognizing relevant relationships among those
concepts which are mentioned in that text. The key problem is generalization which is identified by ATS task. Stating an
example: summarization financial or medical reports are conceptually different from summarizing news articles. To solve the
above issue, to achieve more relevant summary, this paper, author Hernández-Castañeda, Ángel [11] proposes (EATS).EATS
is based on clustering technique which holds by GA i.e. Genetic Algorithm to get relevant topics in the proposed document.
To identify key sentences in the clusters, this method includes Topic Modeling Algorithm (LDA) which is based on keywords
those are generated automatically. This clustering technique needs LDA and Doc2Vec to map text to numeric vectors along
with tf-idf and n-grams. This method is tested on DUC02 dataset to achieve the goal of producing summaries as close to as
human generated summaries
3.Proposed System
3.1 Introduction
As proposed earlier, we have used NLTK, Numpy and Networks library under Machine Learning.
A. NLTK :
The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human
language data for applying in statistical natural language processing (NLP). It contains text processing
libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning. It also includes
graphical demonstrations and sample data sets as well as accompanied by a cook book and a book which
explains the principles behind the underlying language processing tasks that NLTK supports. The Natural
Language Toolkit is an open source library for the Python programming language originally written by Steven
Bird, Edward Loper and Ewan Klein for use in development and education. It comes with a hands-on guide
that introduces topics in computational linguistics as well as programming fundamentals for Python which
makes it suitable for linguists who have no deep knowledge in programming, engineers and researchers that
need to delve into computational linguistics, students and educators. NLTK includes more than 50 corpora and
lexical sources such as the Penn Treebank Corpus, Open Multilingual Wordnet, Problem Report Corpus, and
Lin’s Dependency Thesaurus. Natural Language Processing with Python provides a practical introduction to
programming for language processing. Written by the creators of NLTK, it guides the reader through the
fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic
structure, and more. The online version of the book has been been updated for Python 3 and NLTK 3.
B. Numpy :
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a
multidimensional array object, various derived objects (such as masked arrays and matrices), and an
assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation,
sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random
simulation and much more. At the core of the NumPy package, is the ndarray object. This encapsulates n-
dimensional arrays of homogeneous data types, with many operations being performed in compiled code for
performance. There are several important differences between NumPy arrays and the standard Python
sequences:
C. Networks :
NetworkX is a Python language software package for the creation, manipulation, and study of the structure,
dynamics, and function of complex networks. It is used to study large complex networks represented in form
of graphs with nodes and edges. Using networkx we can load and store complex networks. We can generate
many types of random and classic networks, analyze network structure, build network models, design new
network algorithms and draw networks. NetworkX is appropriate for the procedure on enormous certifiable
charts: e.g., diagrams of more than 20 billion nodes and 200 billion edges.[clarification needed] Due to its
reliance on an unadulterated Python "word reference of word reference" information structure, NetworkX is a
sensibly effective, entirely versatile, profoundly compact system for organization and informal organization
examination.
Software Requirements: -
1. Operating system : Windows 10 or any browser compatible OS
2. Web Browser
3.Any mobile phone with compatible android/ios version
Hardware Requirements :
1,All the hardware required to connect internet
for e.g. Modem, WAN/LAN, Ethernet Cable.
2. Storage : Size of the web browser.
3. RAM : 4GB
4. Processor : Intel Core i3
5. Any browser compatible mobile phone with internet.
3.4 Experiments and Results
Program
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sentence_similarity_matrix[i][j] = sentence_similarity(word_tokens[i],
word_tokens[j])
return "\n".join(summary)
# Example usage
if __name__ == "__main__":
input_text_file = "/content/input_text5.txt"
num_summary_sentences = 5
summary = generate_summary(input_text_file,
num_sentences=num_summary_sentences)
print("Lets summarize the given file.....")
print("\n")
print(summary)
print("\n")
Input Text :
As the G20 concluded on Sunday and Prime Minister Narendra Modi handed over the presidency to Brazil,
there was recognition of the efforts India made to arrive at a consensus on a joint communique. The theme
for India’s G20 presidency was Vasudhaiva Kutambakam — One Earth, One Family, One Family. In his
speeches over the years, Prime Minister Narendra Modi has spoken about India taking on a leadership role
in global affairs as “vishwa guru”, given its population and scale of economy, and this was on display in the
last two days. In his opening remarks at the Summit, the PM said, At the place where we are gathered today,
just a few kilometres away from here, stands a pillar that is nearly two-and-a-half thousand years old.
Inscribed on this pillar in the Prakrit language are the words: ‘Hevam loksa hitmukhe ti, atha iyam natisu
hevam’. Meaning, the welfare and happiness of humanity should always be ensured. Two-and-a-half
thousand years ago, the land of India gave this message to the entire world. Let us begin this G20 Summit
by remembering this message.
Output Text :
Two-and-a-half thousand years ago, the land of India gave this message to the entire world. As the G20
concluded on Sunday and Prime Minister Narendra Modi handed over the presidency to Brazil, there was
recognition of the efforts India made to arrive at a consensus on a joint communique. In his speeches over
the years, Prime Minister Narendra Modi has spoken about India taking on a leadership role in global affairs
as “vishwa guru”, given its population and scale of economy, and this was on display in the last two days. In
his opening remarks at the Summit, the PM said, At the place where we are gathered today, just a few
kilometres away from here, stands a pillar that is nearly two-and-a-half thousand years old. Let us begin this
G20 Summit by remembering this message.
3.5.1 Conclusion
A text summarizer is a valuable tool that helps condense lengthy or complex documents into concise and coherent
summaries. It can save time and effort for readers who need to quickly grasp the main points of a text, making it
especially useful in fields like journalism, research, and education. While text summarizers have their advantages,
it's important to remember that they are not infallible and may not always capture the nuances of a text. Therefore,
human judgment and editing are often necessary to ensure the accuracy and clarity of the summary. As technology
continues to advance, text summarizers are likely to become even more sophisticated and play an increasingly
important role in information processing and knowledge dissemination.
3.5.2 Future Work
1. Text summarization algorithms will continue to advance, leading to greater accuracy in capturing the essence of
a text. Machine learning models, such as transformer-based models like GPT-4, are likely to further enhance
summarization capabilities.Shuffling between dark mode and light mode will be provided
2. Future text summarizers will become more proficient in summarizing content in multiple languages, breaking
down language barriers and enabling more accessible information sharing globally.
3. Text summarization will integrate with visual data, creating summaries that include images, charts, and graphs.
This will be especially useful for summarizing data-rich documents.
4 References
• https://iopscience.iop.org/article/10.1088/1742-6596/2040/1/012044/pdf
• https://www.ijert.org/research/text-summarizer-using-abstractive-and-extractive-method-
IJERTV3IS050821.pdf
• https://journals.sagepub.com/home/thrhttps://www.freecode camp.org/learn/scientific-computing-
with-python
• https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9623462