0% found this document useful (0 votes)
17 views14 pages

NLP Report

Natural language processing project report with code

Uploaded by

sanjanabhosle27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views14 pages

NLP Report

Natural language processing project report with code

Uploaded by

sanjanabhosle27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Excelssior Education Society’s

K.C. College Of Engineering and Management Studies and Research


(Affiliated to the University of Mumbai)
Mith Bunder Road, Near Hume Pipe, Kopri, Thane (E)-400603

Text Summarization using NLP

Submitted in partial fulfillment of the requirements of the degree


BACHELOR OF ENGINEERING IN COMPUTER
ENGINEERING

By

Aneesh Panchal (B-07)


Gaurang Rajam (B-10)
Sachin Satam (B-15)

Supervisor
Prof. Mahesh Maurya

Department of Computer Engineering


K.C. College of Engineering and Management Studies and
Research
Mith Bunder Road, Kopri, Thane (E)-400603

University of Mumbai
(BE 2023-24)
CERTIFICATE

This is to certify that the Report entitled “Text Summarization using NLP” is a bonafide

work of Aneesh Vinod Panchal (Roll No:07), Gaurang Rajam (Roll No:10), Sachin Satam

(Roll No:15) submitted to the University of Mumbai in partial fulfillment of the requirement

for the award of the degree of “Bachelor of Engineering” in “Computer Engineering”.

Supervisor

Mahesh Maurya

Head of The Department Principal


Mahesh Maurya Vilas Nitnaware
Report Approval
This Report entitled “Text Summarization using NLP” is a bonafide work of Aneesh Vinod Panchal

(Roll No:07), Gaurang Rajam (Roll No:10), Sachin Satam (Roll No:15) is approved for the degree

of “ Bachelor of Engineering” in “Computer Engineering”.

Examiners

1………………………………………
(Internal Examiner Name & Sign)

2…………………………………………
(External Examiner name & Sign)

Date:

Place: Thane
Contents

Abstract 5

Acknowledgments 6

List of Abbreviation 7

List of Figure 7

List of table 7

1 Introduction 8-9
1.1 Introduction
1.2 Motivation
1.3 Problem Statement

2 Literature Survey 10-11

2.1 Survey to Existing Project


2.2 Limitation to Existing System
2.3 Mini Project Contribution

3 Proposed Techniques 12-19

3.1 Introduction
3.2 Algorithm
3.3 Details of Software and Hardware requirement
3.4 Experiments and Results
3.5 Conclusion and Future Work

4 References 20
Abstract

Text Summarization is the process of creating a condensed form of text document which maintains significant information
and general meaning of source text. Automatic text summarization becomes an important way of finding relevant
information precisely in large text in a short time with little efforts. Text summarization approaches are classified into two
categories: extractive and abstractive. This paper presents the comprehensive survey of both the approaches in text
summarization. the challenge of how to make computer understand the document with any extension and how to
make it generate the summary is the main motivation. Reducing the time and effort of the user of reading through
entire document to know what the document is about is also the driving force behind this work. To summarize
large documents of the text will be difficult for human beings. Extractive and abstractive summarization is two
types of summarization. An extractive summarization method is concatenating important sentences or paragraphs
without understanding the meaning of those sentences. An abstractive summarization method is generating the
meaningful summary. The system uses is a culmination of both statistical and linguistic analysis of text document.
Summary generated is better than mere statistical summarizers that generate summary based on word frequency
calculation. Addition of plural resolution and abbreviation resolution adds more precision to summary. Concept of
normalization introduced here makes sentences get their weights purely based on value of its content words and
not on number of words it has. Therefore even a small but important sentence gets its place based on values of
words it has. Adding linguistic features to the algorithm fine tunes the summary to higher level.
Acknowledgement

No project is ever complete without the guidance of those experts who have already traded their past before
and hence became Master of It and as a result, our mentor. So, we would like to take this opportunity to take
all those individuals who have helped us in visualizing this project. The guidance of “Keerti Kharatmol”
played a great role in our research work. His guidance helped us in finding relevant information about our
topic. We are grateful to get an opportunity to present our work to everyone. We would like to express our
gratitude to the ‘K.C. College of Engineering and Management studies & Research’’ as well as our Head of
Department professor “Mahesh Maurya” for promoting students to express their ideas and research. Our
sincere vote of thanks goes to our college Principal, "Dr. Vilas Nitnaware'' for believing in the work of their
students and pushing our limits to do better in our field of study.

List of Abbreviation

NLTK – Natural Language Tool Kit


OS - Operating System
RAM - Random Access memory
API - Application Programming interface

List of Tables

Table 2.1 Literature Survey

List of Symbol

Fig 3.2.1 Workflow of the program


Fig 3.3.1.1 Main Program
Fig 3.3.1.2 Test Program
Fig 3.3.2.1 Program Output
Fig 3.3.2.2 Program Output 2
1. Introduction
1.1 Introduction

To reduce length, complexity, and retaining some of the essential qualities of the original document, will go for
summarizer. Titles, key words, tables-of-contents and abstracts might all be considered as the forms of summary.
In a full text document, abstract of that document plays role as a summary of that particular document. They are
intermediates between document‟s titles and its full text that is useful for rapid relevance and quick assessment of
the document. Autosummarization is a technique generates a summary of any document, provides briefs of big
documents, etc. There is an abundance of text material available on the internet. However, usually the Internet
provides more information than is needed. It is very difficult for human beings to manually summarize large
documents of text. Therefore, a twofold problem is encountered. Searching for relevant documents through an
overwhelming number of documents available, and absorbing a large quantity of relevant information. The goal of
automatic text

Microsoft Word‟s AutoSummarize function is a simple example of text summarization. Text Summarization
methods can be classified into extractive and abstractive summarization. An extractive summarization [1] method
consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into
shorter form. The importance of sentences is decided based on statistical and linguistic features of sentences. An
Abstractive summarization [2] attempts to develop an understanding of the main concepts in a document and then
express those concepts in clear natural language. It uses linguistic methods [3] to examine and interpret the text
and then to find the new concepts and expressions to best describe it by generating a new shorter text that conveys
the most important information from the original text document

1.2 Motivation

Every machine learning pipeline is a set of operations, which are executed to produce a model. An ML model is roughly
defined as a mathematical representation of a real-world process. We might think of the ML model as a function that takes
some input data and produces an output (classification, sentiment, recommendation, or clusters). The performance of each
model is evaluated by using evaluation metrics, such as precision & recall, or accuracy.

1.3 Problem Statement


To implement text summarizer algorithm using Machine Learning Libraries.
2. Literature Survey

Author AKukkar [9] introduced one effective approach to produce a flexible and productive bug report summary as well as to
minimize load and work of developer. Author used particle swarm optimization (PSO) approach for searching the effective
semantic text. Author tried to address four central points that are, extractive bug report summarization, to increase the
ROUGE score by selecting effective semantic text, sparsity of data and reduction of information. Proposed methodology used
collection of comments and some feature extraction methods to generate or produce the bug report summary. Multiple
summary subsets are produced and optimal summary subset evaluated by PSO optimization technique. Author compared the
proposed approach with existing Email Classifier (EC) and Bug Report Classifier (BRC). ROUGE score was selected as one
of the evaluation criteria and was calculated for all approaches. At the same time, the ROUGE score was compared with three
human-generated summaries of 10 bug reports of Rastkar dataset. As a result, PSO approach summary subset was less
redundant, and included all important points needed to be present in a bug report.

Author Beibei Huai and team [10] gave new Intention-based bug report summarization approach, alias IBRS which is based
on intention taxonomy. This work considered sentences intentions in order to generate summary report. Sentence intentions
were classified according to their taxonomy levels into seven categories: bug Description, fix solution, opinion expressed,
information seeking, information giving, meta/code and emotion expressed. Now, sentences are categorized in specific
intention with the help of pattern matching and machine learning model. Finally bug report summary is produced. This
summary was compared with BRC (Bug Report Classifier) and found better in terms of precision (5% improved), recall (3%
improved), F-score (3% improved) and pyramid precision (5% improved)

Creating summary is selecting important topics of sentences as well as recognizing relevant relationships among those
concepts which are mentioned in that text. The key problem is generalization which is identified by ATS task. Stating an
example: summarization financial or medical reports are conceptually different from summarizing news articles. To solve the
above issue, to achieve more relevant summary, this paper, author Hernández-Castañeda, Ángel [11] proposes (EATS).EATS
is based on clustering technique which holds by GA i.e. Genetic Algorithm to get relevant topics in the proposed document.
To identify key sentences in the clusters, this method includes Topic Modeling Algorithm (LDA) which is based on keywords
those are generated automatically. This clustering technique needs LDA and Doc2Vec to map text to numeric vectors along
with tf-idf and n-grams. This method is tested on DUC02 dataset to achieve the goal of producing summaries as close to as
human generated summaries
3.Proposed System
3.1 Introduction

As proposed earlier, we have used NLTK, Numpy and Networks library under Machine Learning.

A. NLTK :

The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human
language data for applying in statistical natural language processing (NLP). It contains text processing
libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning. It also includes
graphical demonstrations and sample data sets as well as accompanied by a cook book and a book which
explains the principles behind the underlying language processing tasks that NLTK supports. The Natural
Language Toolkit is an open source library for the Python programming language originally written by Steven
Bird, Edward Loper and Ewan Klein for use in development and education. It comes with a hands-on guide
that introduces topics in computational linguistics as well as programming fundamentals for Python which
makes it suitable for linguists who have no deep knowledge in programming, engineers and researchers that
need to delve into computational linguistics, students and educators. NLTK includes more than 50 corpora and
lexical sources such as the Penn Treebank Corpus, Open Multilingual Wordnet, Problem Report Corpus, and
Lin’s Dependency Thesaurus. Natural Language Processing with Python provides a practical introduction to
programming for language processing. Written by the creators of NLTK, it guides the reader through the
fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic
structure, and more. The online version of the book has been been updated for Python 3 and NLTK 3.

B. Numpy :

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a
multidimensional array object, various derived objects (such as masked arrays and matrices), and an
assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation,
sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random
simulation and much more. At the core of the NumPy package, is the ndarray object. This encapsulates n-
dimensional arrays of homogeneous data types, with many operations being performed in compiled code for
performance. There are several important differences between NumPy arrays and the standard Python
sequences:
C. Networks :

NetworkX is a Python language software package for the creation, manipulation, and study of the structure,
dynamics, and function of complex networks. It is used to study large complex networks represented in form
of graphs with nodes and edges. Using networkx we can load and store complex networks. We can generate
many types of random and classic networks, analyze network structure, build network models, design new
network algorithms and draw networks. NetworkX is appropriate for the procedure on enormous certifiable
charts: e.g., diagrams of more than 20 billion nodes and 200 billion edges.[clarification needed] Due to its
reliance on an unadulterated Python "word reference of word reference" information structure, NetworkX is a
sensibly effective, entirely versatile, profoundly compact system for organization and informal organization
examination.

3.4 Details of Hardware and Software

Software Requirements: -
1. Operating system : Windows 10 or any browser compatible OS
2. Web Browser
3.Any mobile phone with compatible android/ios version

Hardware Requirements :
1,All the hardware required to connect internet
for e.g. Modem, WAN/LAN, Ethernet Cable.
2. Storage : Size of the web browser.
3. RAM : 4GB
4. Processor : Intel Core i3
5. Any browser compatible mobile phone with internet.
3.4 Experiments and Results

Program

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

# Define a function to read and preprocess the text


def read_and_preprocess_text(text_file):
with open(text_file, 'r') as file:
text = file.read()

# Tokenize the text into sentences


sentences = sent_tokenize(text)

# Tokenize sentences into words and remove stopwords


# nltk.download('stopwords')
stop_words = set(stopwords.words("english"))
word_tokens = [word_tokenize(sentence.lower()) for sentence in sentences]
word_tokens = [[word for word in words if word.isalnum() and word not in stop_words]
for words in word_tokens]

return sentences, word_tokens

# Define a function to calculate sentence similarity using cosine distance


def sentence_similarity(sent1, sent2):
vector1 = np.zeros((len(unique_words)), dtype=float)
vector2 = np.zeros((len(unique_words)), dtype=float)

for word in sent1:


vector1[word_index[word]] += 1

for word in sent2:


vector2[word_index[word]] += 1

return 1 - cosine_distance(vector1, vector2)

# Define the main function for text summarization


def generate_summary(text_file, num_sentences=5):
sentences, word_tokens = read_and_preprocess_text(text_file)
# Create a list of unique words in the document
global unique_words, word_index
unique_words = list(set(word for sentence in word_tokens for word in sentence))
word_index = {word: index for index, word in enumerate(unique_words)}

# Create a sentence similarity matrix


sentence_similarity_matrix = np.zeros((len(sentences), len(sentences)))

for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sentence_similarity_matrix[i][j] = sentence_similarity(word_tokens[i],
word_tokens[j])

# Create a graph from the similarity matrix


sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_matrix)

# Generate ranked sentences using PageRank


scores = nx.pagerank(sentence_similarity_graph)

# Sort sentences by their scores


ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(sentences)),
reverse=True)

# Select the top 'num_sentences' sentences as the summary


summary = [sentence for score, sentence in ranked_sentences[:num_sentences]]

return "\n".join(summary)

# Example usage
if __name__ == "__main__":
input_text_file = "/content/input_text5.txt"
num_summary_sentences = 5

summary = generate_summary(input_text_file,
num_sentences=num_summary_sentences)
print("Lets summarize the given file.....")
print("\n")
print(summary)
print("\n")
Input Text :

As the G20 concluded on Sunday and Prime Minister Narendra Modi handed over the presidency to Brazil,
there was recognition of the efforts India made to arrive at a consensus on a joint communique. The theme
for India’s G20 presidency was Vasudhaiva Kutambakam — One Earth, One Family, One Family. In his
speeches over the years, Prime Minister Narendra Modi has spoken about India taking on a leadership role
in global affairs as “vishwa guru”, given its population and scale of economy, and this was on display in the
last two days. In his opening remarks at the Summit, the PM said, At the place where we are gathered today,
just a few kilometres away from here, stands a pillar that is nearly two-and-a-half thousand years old.
Inscribed on this pillar in the Prakrit language are the words: ‘Hevam loksa hitmukhe ti, atha iyam natisu
hevam’. Meaning, the welfare and happiness of humanity should always be ensured. Two-and-a-half
thousand years ago, the land of India gave this message to the entire world. Let us begin this G20 Summit
by remembering this message.
Output Text :
Two-and-a-half thousand years ago, the land of India gave this message to the entire world. As the G20
concluded on Sunday and Prime Minister Narendra Modi handed over the presidency to Brazil, there was
recognition of the efforts India made to arrive at a consensus on a joint communique. In his speeches over
the years, Prime Minister Narendra Modi has spoken about India taking on a leadership role in global affairs
as “vishwa guru”, given its population and scale of economy, and this was on display in the last two days. In
his opening remarks at the Summit, the PM said, At the place where we are gathered today, just a few
kilometres away from here, stands a pillar that is nearly two-and-a-half thousand years old. Let us begin this
G20 Summit by remembering this message.

3.5 Conclusion and Future Work

3.5.1 Conclusion

A text summarizer is a valuable tool that helps condense lengthy or complex documents into concise and coherent
summaries. It can save time and effort for readers who need to quickly grasp the main points of a text, making it
especially useful in fields like journalism, research, and education. While text summarizers have their advantages,
it's important to remember that they are not infallible and may not always capture the nuances of a text. Therefore,
human judgment and editing are often necessary to ensure the accuracy and clarity of the summary. As technology
continues to advance, text summarizers are likely to become even more sophisticated and play an increasingly
important role in information processing and knowledge dissemination.
3.5.2 Future Work

1. Text summarization algorithms will continue to advance, leading to greater accuracy in capturing the essence of
a text. Machine learning models, such as transformer-based models like GPT-4, are likely to further enhance
summarization capabilities.Shuffling between dark mode and light mode will be provided

2. Future text summarizers will become more proficient in summarizing content in multiple languages, breaking
down language barriers and enabling more accessible information sharing globally.

3. Text summarization will integrate with visual data, creating summaries that include images, charts, and graphs.
This will be especially useful for summarizing data-rich documents.

4 References
• https://iopscience.iop.org/article/10.1088/1742-6596/2040/1/012044/pdf
• https://www.ijert.org/research/text-summarizer-using-abstractive-and-extractive-method-
IJERTV3IS050821.pdf
• https://journals.sagepub.com/home/thrhttps://www.freecode camp.org/learn/scientific-computing-
with-python
• https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9623462

You might also like