0% found this document useful (0 votes)
104 views

Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad

The document describes a report submitted for the degree of Bachelor of Technology in Information Technology. It was authored by 5 students - Ashish Kumar Mishra, Ashish Dubey, Shivanshu Singh, Barun Sarraf, and Ashish Kumar Thakur. The report is about extractive text summarization and was submitted to the Computer Science and Engineering Department of Motilal Nehru National Institute of Technology in Allahabad, India.

Uploaded by

Suhash Rangari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad

The document describes a report submitted for the degree of Bachelor of Technology in Information Technology. It was authored by 5 students - Ashish Kumar Mishra, Ashish Dubey, Shivanshu Singh, Barun Sarraf, and Ashish Kumar Thakur. The report is about extractive text summarization and was submitted to the Computer Science and Engineering Department of Motilal Nehru National Institute of Technology in Allahabad, India.

Uploaded by

Suhash Rangari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Extractive Text Summarization

A Report Submitted
in Partial Fulfillment of the Requirements
for the Degree of
Bachelor of Technology
in
Information Technology

by
Ashish Kumar Mishra , Ashish Dubey , Shivanshu Singh , Barun Sarraf
, Ashish Kumar Thakur

to the
COMPUTER SCIENCE AND ENGINEERING DEPARTMENT
MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY
ALLAHABAD
April, 2018
UNDERTAKING

We declare that the work presented in this report ti-


tled “Extractive Text Summarization”, submitted to the
Computer Science and Engineering Department, Motilal
Nehru National Institute of Technology, Allahabad, for
the award of the Bachelor of Technology degree in
Information Technology , is our original work. We have not
plagiarized or submitted the same work for the award of any
other degree. In case this undertaking is found incorrect, we
accept that our degree may be unconditionally withdrawn.

April, 2018
Allahabad
(Ashish Kumar Mishra ,
Ashish Dubey , Shivanshu
Singh , Barun Sarraf ,
Ashish Kumar Thakur)

ii
CERTIFICATE

Certified that the work contained in the report titled “Extractive


Text Summarization”, by Ashish Kumar Mishra , Ashish Dubey
, Shivanshu Singh , Barun Sarraf , Ashish Kumar Thakur, has
been carried out under my supervision and that this work has
not been submitted elsewhere for a degree.

(PROF. R.S.YADAV)
Computer Science and Engineering Dept.
M.N.N.I.T, Allahabad

April, 2018

iii
Preface

The length of textual data is increasing and people have less time. Often the news-
paper articles run into a long text of, say 1000 -1200 words. As wearable devices leap
to prominence (Google Glass, Apple Watch, to name a few), content must adapt to
the limited screen space available on these devices. The task of generating intelligent
and accurate summaries for long pieces of text has become a popular research as
well as industry problem. Text Summarization came into being in order to reduce
these difficulties for the user.
Text summarization is the process of reducing the large amount of textual data
which is available on Internet or on any other source of information like newspa-
pers, books etc., into summarized documents, in order to create the summary of the
document that retains the most important points and the overall meaning of the
document. The process of text summarization is further divided into extractive text
summarization and abstractive text summarization.
Our project aims at achieving extractive text summarization.

iv
Acknowledgements

”The acknowledgement of a single possibility can change everything.”


— Aberjhani

We would like to take this opportunity to express our deep sense of gratitude to
all who helped us directly or indirectly for our project work. Firstly, we would like
to thank our supervisor, Prof. R. S. Yadav, for being such a supportive mentor.His
advices, encouragement and critics were the primary sources of inspiration and in-
novation for us, and were the actual reason behind the successful completion of this
project. In addition, he was always accessible and willing to help us at each step
of our project develpoment. It has truly been a privilege working under him during
this semester.

Also, we wish to express our sincere gratitude to Prof. Rajeev Tripathi, Director,
MNNIT Allahabad and Prof. Neeraj Tyagi, Head of the Department, Computer
Science and Engineering Department (CSED), for providing us with all the needed
resources and facilities during the process of completion of this project work. We
would also like thank to our friends for their constant motivation, advices and kind
support.

v
Contents

Preface iv

Acknowledgements v

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Some Wonderful Minds . . . . . . . . . . . . . . . . . . . . . . 2

2 Related Work 3
2.1 Types of Summarization . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Extraction-based summarization . . . . . . . . . . . . . . . . . 3
2.1.2 Abstraction-based summarization . . . . . . . . . . . . . . . . 3
2.1.3 Aided summarization . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Some Common Techniques . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Heuristic techniques . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Semantics-based techniques . . . . . . . . . . . . . . . . . . . 4
2.2.3 Query-oriented techniques . . . . . . . . . . . . . . . . . . . . 5
2.2.4 Cluster-based techniques . . . . . . . . . . . . . . . . . . . . . 5
2.3 Other Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Proposed Work 6
3.1 Why textrank? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

vi
4 Experimental Setup and Results Analysis 15
4.1 Document Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Document-Term Matrix and normalized similarity matrix as per TF-
IDF values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Applying textrank algorithm on the normalized similarity matrix . . 16
4.4 Storing summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Result analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Conclusion and Future Work 19


5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

References 22

vii
Chapter 1

Introduction

This thesis presents the details of the project ”Extractive Text Summarization”.
Text Summarization is condensing the source text into a shorter version preserving
its information content and overall meaning. It is very difficult for human beings
to manually summarize large documents of text. Text Summarization methods can
be classified into extractive and abstractive summarization. An extractive summa-
rization method consists of selecting important sentences, paragraphs etc. from the
original document and concatenating them into shorter form. The importance of
sentences is decided based on statistical and linguistic features of sentences. An
abstractive summarization method consists of understanding the original text and
re-telling it in fewer words. It uses linguistic methods to examine and interpret the
text and then to find the new concepts and expressions to best describe it by gen-
erating a new shorter text that conveys the most important information from the
original text document. In this thesis, a report of Extractive Text Summarization
technique has been presented.

1
1.1 Motivation
To take the appropriate action, we need latest information. But on the contrary, the
amount of the information is more and more growing. There are many categories of
information (economy, sports, health, technology...) and also there are many sources
(news site, blog, SNS etc). So to make an automatically and accurate summaries
feature will help us to understand the topics shorten the time to do it. Automatic
text summarization helps the user to quickly understand large volumes of infor-
mation. We present a languag and domain-independent statistical-based method
for single-document extractive summarization, i.e., to produce a text summary by
extracting some sentences from the given text.

1.1.1 Some Wonderful Minds


In 1950, Alan Turing published an article titled Computing Machinery and Intelli-
gence which proposed what is now called the Turing test as a criterion of intelligence
[8]. The Georgetown experiment in 1954 involved fully automatic translation of more
than sixty Russian sentences into English. Some notably successful natural-language
processing systems developed in the 1960s were SHRDLU,[9] a natural-language
system working in restricted blocks worlds with restricted vocabularies.

2
Chapter 2

Related Work

2.1 Types of Summarization

2.1.1 Extraction-based summarization


In this summarization task, the automatic system extracts objects from the en-
tire collection, without modifying the objects themselves. Examples of this include
keyphrase extraction, where the goal is to select individual words or phrases to ”tag”
a document, and document summarization, where the goal is to select whole sen-
tences (without modifying them) to create a short paragraph summary. Similarly,
in image collection summarization, the system extracts images from the collection
without modifying the images themselves[3].

2.1.2 Abstraction-based summarization


Extraction techniques merely copy the information deemed most important by the
system to the summary (for example, key clauses, sentences or paragraphs), while
abstraction involves paraphrasing sections of the source document. In general, ab-
straction can condense a text more strongly than extraction, but the programs that
can do this are harder to develop as they require use of natural language generation
technology, which itself is a growing field[7] .

3
While some work has been done in abstractive summarization (creating an ab-
stract synopsis like that of a human), the majority of summarization systems are
extractive (selecting a subset of sentences to place in a summary).

2.1.3 Aided summarization


Machine learning techniques from closely related fields such as information retrieval
or text mining have been successfully adapted to help automatic summarization[1].
Apart from Fully Automated Summarizers (FAS), there are systems that aid
users with the task of summarization (MAHS = Machine Aided Human Summa-
rization), for example by highlighting candidate passages to be included in the sum-
mary, and there are systems that depend on post-processing by a human (HAMS =
Human Aided Machine Summarization).

2.2 Some Common Techniques


Several techniques for automatic text summarization have been developed by re-
searchers. Broadly, these summarization techniques are classified into following
categories:

2.2.1 Heuristic techniques


Heuristic based techniques include features like statistical features (frequency anal-
ysis, local context analysis, pseudo-relevance feedback), linguistic features (name,
place, quotations, thematic phrases), Titlewords, lexical cues, location/position.

2.2.2 Semantics-based techniques


Semantic based techniques consider a sentence more than an unordered collection
of words. Each word carries meaning and a summary is constructed on incorpo-
rated relations between words like synonymy, hyponym, antonym and other similar
semantics into the system, like Word Net in is used to build lexical chains of word
synonyms for sentence filtering to generate extract multi-document summary.

4
2.2.3 Query-oriented techniques
Query-based summaries are generated according to user needs based on his query
keywords for applications such as question answering system or browsing on search
engines.

2.2.4 Cluster-based techniques


Cluster-based summarization techniques generatesummary by creating clusters of
sentences/paragraphs using various similarity measures. The number of clusters is
roughly treated as equal to the number of topics covered in the text.

2.3 Other Works


Wang and Yang in 2006 proposed a fractal summarization technique based on frac-
tal theory to produce summary by exploring the hierarchical structure and salient
features of the document. The summarization technique is based on statistical
approach and hence can be applied to documents in any language without major
modifications. However, each language has its own unique features, and so the lan-
guage difference may significantly affect the summarization process.

Binwahlan et al. Intelligent Model for automatic text summarization produced


summary in two phases. In first phase, summary obtained using fuzzy swarm mod-
ule is given as input to second phase which uses module swarm diversity method to
find clusters treating sentences as initial centroids to produce final summary.

Dalal and Zaveri in 2011 proposed a heuristics-based approach to generate a


generic, informative, coherent summary from unstructured documents. The topic
signature file to score the sentences of the documents is generated by including the
Title-words, Cue-words, Key-words, and synonyms of the Title-words and key-words
searched from the lexical resource Wordnet.

5
Chapter 3

Proposed Work

The proposed system for text summarization is built based on the textrank algorithm[2].
It is essentially a modification of the famous pagerank algorithm, used by Google
Search to rank websites in their search engine results.[4]

3.1 Why textrank?


• If the model is trained as per some previously used data sets, the following
major problems are encountered:

1. It requires great computation power to train our model with the large
dataset. Simple argument can be made that the dataset size can be re-
duced. Usually training our text with its summary set is done by training
it into the RNN model with LSTM containing multiple hidden layers. So
even for small dataset say, 10,000 entries, it takes much more computa-
tion power. Also, if the dataset is significantly less in size, the results
may be inaccurate.
2. Human summaries are generally biased. Any person tends to put those
sentences in the summary which they feel is important. Also, something
important to somebody may not be important to another. This way,
our basis of training our model to provide the summary itself will be
inaccurate.

6
Therefore the textrank algorithm, with low computation requirements and not
‘biased’ with any kind of training is considered to be a great choice for the purpose
of extractive summarization.

3.2 Approach
Our algorithm for the purpose of summarization is as follows:

1. Document for processing:

• Take the input file. Our implementation supports: .txt file, .pdf file.
• Additional feature is that the input file can also be a text file generated
by speech recognized.

2. Tokenization of document into sentences :

• Initially the document is simply a large string.


• Tokenization is the process where the document is broken up into discrete
bits or tokens(sentences). It omits certain characters, such as punctua-
tion, spaces and special symbols, between the words [5]. We have used
PunktSentenceTokenizer class from the file punkt.py contained in package
nltk.tokenize for the process of tokenizing.

3. Generate the Document-Term Matrix:

• A document-term matrix or term-document matrix is a mathematical


matrix that describes the frequency of terms that occur in a collection of
documents. In a document-term matrix, rows correspond to documents
in the collection and columns correspond to terms.
• Here document refers to each sentence and term refers to distinct signif-
icant words.

7
Figure 1: Flow Chart for the algorithm

8
I like hate databases
D1 1 1 0 1
D2 1 0 1 1

Table 1: Example of document term matrix

• We have used CountVectorizer() class from the sklearn library in python.


fit transform() method on an object of CountVectorizer() class learns the
vocabulary dictionary and returns term-document matrix.

Example:
D1 = ”I like databases”
D2 = ”I hate databases”
Its document-term matrix is given in Table 1:

4. Generate the normalized similarity matrix from the document-term matrix


according to TF-IDF [6]

• TF(Term Frequency): the no. of times a term(a word here) appears in


the current document(single sentence here)
• IDF(Inverse Document Frequency): the no. of times a term(a word here)
appears in the entire corpus (set of all sentences)

Formula:


T F − IDF (termxindocumenty) = T F (x, y) ∗ IDF (x) (1)

T f − IDF = term x in document y


T F (d, t) = f1 /n (2)

f1 = Number of times term t appears in a sentence


n = Total number of terms in the sentence.

9

IDF (t) = log(N/D) (3)

N = Total number of documents


D = Number of documents with term t in it

Then we find the similarity matrix by multiplying this matrix with its trans-
pose.

5. Draw a graph for applying textrank algorithm:

• drawing a graph to proceed for the textrank algorithm


• nx graph is a graph developed using the networkx library
• each node represents a sentence
• each edge represents that they have words in common
• each edge represents that they have words in common

6. Textrank algorithm: Input:a similarity graph of the sentences Formula:

R(vi ) = (1 − d) + d ∗ Σjadj(v) R(vj )/(deg(vj )) (4)

where:
R(vi ) = rank of node i
d = damping factor: measure of probability of jumping from one node to an-
other
R(vj ) = rank of a node j connected to i
adj(v) = set of all the neighbors of node i

It generally takes quite a few iterations for the results to converge depending
upon the number of edges Output: rank of each sentence

7. Getting the rank of every sentence using textrank:

10
Figure 2: the graph of 3.pdf, contains 19 sentences and hence 19 nodes

• Using the pagerank() method on the graph created above we can get the
ranks of each node, here sentences.
• Using the pagerank() method on the graph created above we can get the
ranks of each node, here sentences.

8. Generate a threshold value of rank for choosing the most relevant sentences

• Save normalized ranks first


• Formula for normalization:
(rank current−rank min)
rank normal = (5)
(rank max−rank min)

this way, rank max will have a rank normal value =1 and rank min will

11
Figure 3: Example of textank algorithm

12
Figure 4: the rank of each sentence of 3.pdf

13
have a rank normal value = 0 and all other ranks will have values between
0 and 1.
• In our model we have used threshold rank as:

rank th = mean(rank normal) + 0.2 (6)

Another approach is that we simply sort the sentences based on their ranks in
descending order and take n top sentences where n will be input from user.

9. Find out all sentences with rank normal > rank thand store them as sum-
mary

10. Write the summary output to a file

14
Chapter 4

Experimental Setup and Results


Analysis

We used the Google Speech Recognition, sklearn, nltk and networkx library for
implementation of our work.

4.1 Document Preprocessing


1. Input from already existing file or from voice recognition from command line

• We have used python file handling for reading existing .txt files. For
handling the PDF files, we have used PyPDF2 library.
• For voice recognition: We used SpeechRecognition library which comes
with pre installed Google Speech Recognition api. User voice is input
through the microphone and using the Google Speech Recognition api, it
is converted t text online and the resulting text is appended in a text file
until the user asks the program to stop.

2. Tokenizing the document

• nltk library: We have used PunktSentenceTokenizer class from the file


punkt.py contained in package nltk.tokenize for the process of tokenizing.

15
4.2 Document-Term Matrix and normalized sim-
ilarity matrix as per TF-IDF values
1. We have used CountVectorizer() class from the sklearn library in python.
fit transform() method on an object of CountVectorizer() class learns the vo-
cabulary dictionary and returns term-document matrix.

2. In our implementation,TfidfTransformer is used for executing the method


fit transform() which provides the output as a document-term matrix nor-
malized (value 0-1) according to the TF-IDF

4.3 Applying textrank algorithm on the normal-


ized similarity matrix
1. we have used networkx library for creation of graph from the provided nor-
malized similarity matrix.

2. for applying the textrank algorithm on this graph we call the method pager-
ank() on the networkx graph created in the previous step.

4.4 Storing summary


We tried the algorithm on documents of various lengths. On a short story running
to 104 lines, we got a summary of 23 lines by adjusting the threshold rank to be:

rank th = mean(rank normal) + seed f actor (7)

We choose only positive seeds because with the threshold < the mean value of
threshold a lot of sentence will be included in the summary.

Thus the threshold value we choose depends on the user requirements. If the
summary is desired to 20%, the ideal choice becomes seed factor = 0.2

16
seed factor lines story line summary ratio
0.1 101 30 0.2970
0.1 29 10 0.3448
0.15 101 24 0.2376
0.15 29 7 0.2413
0.2 101 21 0.2079
0.2 29 6 0.2068
0.25 101 15 0.1485
0.3 101 8 0.0792

Table 2: Variation of summary length with change in threshold rank

So we stick at rank thresh = mean(rank normal) + seed f actor for generating


the summaries upto 20%.
Once the rank of each sentence is calculated, the summary can be easily generated
as explained in Proposed Work section. When generated, summary is just a list of
sentence string. We print the summary into a text file by using python file handling
mechanism.

4.5 Result analysis


The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for
evaluating a generated sentence to a reference sentence. A perfect match results in
a score of 1.0, whereas a perfect mismatch results in a score of 0.0.
The score was developed for evaluating the predictions made by automatic ma-
chine translation systems. It is not perfect, but does offer 5 compelling benefits:
1. It is quick and inexpensive to calculate.
2. It is easy to understand.
3. It is language independent.
4. It correlates highly with human evaluation.
5. It has been widely adopted.

The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper

17
“BLEU: a Method for Automatic Evaluation of Machine Translation“.
Our model modified to give output of 5 lines was tested on dataset of 50 doc-
uments containing a 5 lines human-made summary of 50 and 100 line documents.
The average BLEU score of 0.8 was achieved.

18
Chapter 5

Conclusion and Future Work

”Our imagination is the only limit to what we can hope to have in the fu-
ture.” — Charles F. Kettering

5.1 Conclusion
In this thesis we presented the extractive text summarization using text rank algo-
rithm. An extractive summary deals with classifying the the important sentences
and put them together to form a summary. In the main aim was to classify the sen-
teces according to thier relavance and importance in the given corpus (a collection
of sentences).In future this project can be extended to a model based on abstraction.
Abstractive summary firstly understands and then builds the summary.

5.2 Future Work


In future this project could be extended to a model based on abstraction. Abstrac-
tive summary firstly understands and then builds the summary.

19
List of Figures

1 Flow Chart for the algorithm . . . . . . . . . . . . . . . . . . . . . . 8


2 the graph of 3.pdf, contains 19 sentences and hence 19 nodes . . . . . 11
3 Example of textank algorithm . . . . . . . . . . . . . . . . . . . . . . 12
4 the rank of each sentence of 3.pdf . . . . . . . . . . . . . . . . . . . . 13

20
List of Tables

1 Example of document term matrix . . . . . . . . . . . . . . . . . . . 9

2 Variation of summary length with change in threshold rank 17

21
References

[1] Huang, H.-H., Kuo, Y.-H., and Yang, H.-C. Fuzzy-rough set aided sen-
tence extraction summarization. In Innovative Computing, Information and
Control, 2006. ICICIC’06. First International Conference on (2006), vol. 1,
IEEE, pp. 450–453.

[2] Mihalcea, R., and Tarau, P. Textrank: Bringing order into text. In Proceed-
ings of the 2004 conference on empirical methods in natural language processing
(2004).

[3] Mittal, N., Agarwal, B., Mantri, H., Goyal, R. K., and Jain, M. K.
Extractive text summarization.

[4] Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank
citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab, 1999.

[5] Pentheroudakis, J. E., Bradlee, D. G., and Knoll, S. S. Tokenizer for


a natural language processing system, Aug. 15 2006. US Patent 7,092,871.

[6] Ramos, J., et al. Using tf-idf to determine word relevance in document
queries. In Proceedings of the first instructional conference on machine learning
(2003), vol. 242, pp. 133–142.

[7] Singhal, S., and Bhattacharya, A. Abstractive text summarization.

[8] Turing, A. M. Computing machinery and intelligence. Creative Computing 6,


1 (1980), 44–53.

[9] Ward, N. SHRDLU. Wiley Online Library, 2003.

22

You might also like