Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
A Report Submitted
in Partial Fulfillment of the Requirements
for the Degree of
Bachelor of Technology
in
Information Technology
by
Ashish Kumar Mishra , Ashish Dubey , Shivanshu Singh , Barun Sarraf
, Ashish Kumar Thakur
to the
COMPUTER SCIENCE AND ENGINEERING DEPARTMENT
MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY
ALLAHABAD
April, 2018
UNDERTAKING
April, 2018
Allahabad
(Ashish Kumar Mishra ,
Ashish Dubey , Shivanshu
Singh , Barun Sarraf ,
Ashish Kumar Thakur)
ii
CERTIFICATE
(PROF. R.S.YADAV)
Computer Science and Engineering Dept.
M.N.N.I.T, Allahabad
April, 2018
iii
Preface
The length of textual data is increasing and people have less time. Often the news-
paper articles run into a long text of, say 1000 -1200 words. As wearable devices leap
to prominence (Google Glass, Apple Watch, to name a few), content must adapt to
the limited screen space available on these devices. The task of generating intelligent
and accurate summaries for long pieces of text has become a popular research as
well as industry problem. Text Summarization came into being in order to reduce
these difficulties for the user.
Text summarization is the process of reducing the large amount of textual data
which is available on Internet or on any other source of information like newspa-
pers, books etc., into summarized documents, in order to create the summary of the
document that retains the most important points and the overall meaning of the
document. The process of text summarization is further divided into extractive text
summarization and abstractive text summarization.
Our project aims at achieving extractive text summarization.
iv
Acknowledgements
We would like to take this opportunity to express our deep sense of gratitude to
all who helped us directly or indirectly for our project work. Firstly, we would like
to thank our supervisor, Prof. R. S. Yadav, for being such a supportive mentor.His
advices, encouragement and critics were the primary sources of inspiration and in-
novation for us, and were the actual reason behind the successful completion of this
project. In addition, he was always accessible and willing to help us at each step
of our project develpoment. It has truly been a privilege working under him during
this semester.
Also, we wish to express our sincere gratitude to Prof. Rajeev Tripathi, Director,
MNNIT Allahabad and Prof. Neeraj Tyagi, Head of the Department, Computer
Science and Engineering Department (CSED), for providing us with all the needed
resources and facilities during the process of completion of this project work. We
would also like thank to our friends for their constant motivation, advices and kind
support.
v
Contents
Preface iv
Acknowledgements v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Some Wonderful Minds . . . . . . . . . . . . . . . . . . . . . . 2
2 Related Work 3
2.1 Types of Summarization . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Extraction-based summarization . . . . . . . . . . . . . . . . . 3
2.1.2 Abstraction-based summarization . . . . . . . . . . . . . . . . 3
2.1.3 Aided summarization . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Some Common Techniques . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Heuristic techniques . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Semantics-based techniques . . . . . . . . . . . . . . . . . . . 4
2.2.3 Query-oriented techniques . . . . . . . . . . . . . . . . . . . . 5
2.2.4 Cluster-based techniques . . . . . . . . . . . . . . . . . . . . . 5
2.3 Other Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Proposed Work 6
3.1 Why textrank? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
vi
4 Experimental Setup and Results Analysis 15
4.1 Document Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Document-Term Matrix and normalized similarity matrix as per TF-
IDF values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Applying textrank algorithm on the normalized similarity matrix . . 16
4.4 Storing summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Result analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
References 22
vii
Chapter 1
Introduction
This thesis presents the details of the project ”Extractive Text Summarization”.
Text Summarization is condensing the source text into a shorter version preserving
its information content and overall meaning. It is very difficult for human beings
to manually summarize large documents of text. Text Summarization methods can
be classified into extractive and abstractive summarization. An extractive summa-
rization method consists of selecting important sentences, paragraphs etc. from the
original document and concatenating them into shorter form. The importance of
sentences is decided based on statistical and linguistic features of sentences. An
abstractive summarization method consists of understanding the original text and
re-telling it in fewer words. It uses linguistic methods to examine and interpret the
text and then to find the new concepts and expressions to best describe it by gen-
erating a new shorter text that conveys the most important information from the
original text document. In this thesis, a report of Extractive Text Summarization
technique has been presented.
1
1.1 Motivation
To take the appropriate action, we need latest information. But on the contrary, the
amount of the information is more and more growing. There are many categories of
information (economy, sports, health, technology...) and also there are many sources
(news site, blog, SNS etc). So to make an automatically and accurate summaries
feature will help us to understand the topics shorten the time to do it. Automatic
text summarization helps the user to quickly understand large volumes of infor-
mation. We present a languag and domain-independent statistical-based method
for single-document extractive summarization, i.e., to produce a text summary by
extracting some sentences from the given text.
2
Chapter 2
Related Work
3
While some work has been done in abstractive summarization (creating an ab-
stract synopsis like that of a human), the majority of summarization systems are
extractive (selecting a subset of sentences to place in a summary).
4
2.2.3 Query-oriented techniques
Query-based summaries are generated according to user needs based on his query
keywords for applications such as question answering system or browsing on search
engines.
5
Chapter 3
Proposed Work
The proposed system for text summarization is built based on the textrank algorithm[2].
It is essentially a modification of the famous pagerank algorithm, used by Google
Search to rank websites in their search engine results.[4]
1. It requires great computation power to train our model with the large
dataset. Simple argument can be made that the dataset size can be re-
duced. Usually training our text with its summary set is done by training
it into the RNN model with LSTM containing multiple hidden layers. So
even for small dataset say, 10,000 entries, it takes much more computa-
tion power. Also, if the dataset is significantly less in size, the results
may be inaccurate.
2. Human summaries are generally biased. Any person tends to put those
sentences in the summary which they feel is important. Also, something
important to somebody may not be important to another. This way,
our basis of training our model to provide the summary itself will be
inaccurate.
6
Therefore the textrank algorithm, with low computation requirements and not
‘biased’ with any kind of training is considered to be a great choice for the purpose
of extractive summarization.
3.2 Approach
Our algorithm for the purpose of summarization is as follows:
• Take the input file. Our implementation supports: .txt file, .pdf file.
• Additional feature is that the input file can also be a text file generated
by speech recognized.
7
Figure 1: Flow Chart for the algorithm
8
I like hate databases
D1 1 1 0 1
D2 1 0 1 1
Example:
D1 = ”I like databases”
D2 = ”I hate databases”
Its document-term matrix is given in Table 1:
Formula:
•
T F − IDF (termxindocumenty) = T F (x, y) ∗ IDF (x) (1)
•
T F (d, t) = f1 /n (2)
9
•
IDF (t) = log(N/D) (3)
Then we find the similarity matrix by multiplying this matrix with its trans-
pose.
where:
R(vi ) = rank of node i
d = damping factor: measure of probability of jumping from one node to an-
other
R(vj ) = rank of a node j connected to i
adj(v) = set of all the neighbors of node i
It generally takes quite a few iterations for the results to converge depending
upon the number of edges Output: rank of each sentence
10
Figure 2: the graph of 3.pdf, contains 19 sentences and hence 19 nodes
• Using the pagerank() method on the graph created above we can get the
ranks of each node, here sentences.
• Using the pagerank() method on the graph created above we can get the
ranks of each node, here sentences.
8. Generate a threshold value of rank for choosing the most relevant sentences
this way, rank max will have a rank normal value =1 and rank min will
11
Figure 3: Example of textank algorithm
12
Figure 4: the rank of each sentence of 3.pdf
13
have a rank normal value = 0 and all other ranks will have values between
0 and 1.
• In our model we have used threshold rank as:
Another approach is that we simply sort the sentences based on their ranks in
descending order and take n top sentences where n will be input from user.
9. Find out all sentences with rank normal > rank thand store them as sum-
mary
14
Chapter 4
We used the Google Speech Recognition, sklearn, nltk and networkx library for
implementation of our work.
• We have used python file handling for reading existing .txt files. For
handling the PDF files, we have used PyPDF2 library.
• For voice recognition: We used SpeechRecognition library which comes
with pre installed Google Speech Recognition api. User voice is input
through the microphone and using the Google Speech Recognition api, it
is converted t text online and the resulting text is appended in a text file
until the user asks the program to stop.
15
4.2 Document-Term Matrix and normalized sim-
ilarity matrix as per TF-IDF values
1. We have used CountVectorizer() class from the sklearn library in python.
fit transform() method on an object of CountVectorizer() class learns the vo-
cabulary dictionary and returns term-document matrix.
2. for applying the textrank algorithm on this graph we call the method pager-
ank() on the networkx graph created in the previous step.
We choose only positive seeds because with the threshold < the mean value of
threshold a lot of sentence will be included in the summary.
Thus the threshold value we choose depends on the user requirements. If the
summary is desired to 20%, the ideal choice becomes seed factor = 0.2
16
seed factor lines story line summary ratio
0.1 101 30 0.2970
0.1 29 10 0.3448
0.15 101 24 0.2376
0.15 29 7 0.2413
0.2 101 21 0.2079
0.2 29 6 0.2068
0.25 101 15 0.1485
0.3 101 8 0.0792
The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper
17
“BLEU: a Method for Automatic Evaluation of Machine Translation“.
Our model modified to give output of 5 lines was tested on dataset of 50 doc-
uments containing a 5 lines human-made summary of 50 and 100 line documents.
The average BLEU score of 0.8 was achieved.
18
Chapter 5
”Our imagination is the only limit to what we can hope to have in the fu-
ture.” — Charles F. Kettering
5.1 Conclusion
In this thesis we presented the extractive text summarization using text rank algo-
rithm. An extractive summary deals with classifying the the important sentences
and put them together to form a summary. In the main aim was to classify the sen-
teces according to thier relavance and importance in the given corpus (a collection
of sentences).In future this project can be extended to a model based on abstraction.
Abstractive summary firstly understands and then builds the summary.
19
List of Figures
20
List of Tables
21
References
[1] Huang, H.-H., Kuo, Y.-H., and Yang, H.-C. Fuzzy-rough set aided sen-
tence extraction summarization. In Innovative Computing, Information and
Control, 2006. ICICIC’06. First International Conference on (2006), vol. 1,
IEEE, pp. 450–453.
[2] Mihalcea, R., and Tarau, P. Textrank: Bringing order into text. In Proceed-
ings of the 2004 conference on empirical methods in natural language processing
(2004).
[3] Mittal, N., Agarwal, B., Mantri, H., Goyal, R. K., and Jain, M. K.
Extractive text summarization.
[4] Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank
citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab, 1999.
[6] Ramos, J., et al. Using tf-idf to determine word relevance in document
queries. In Proceedings of the first instructional conference on machine learning
(2003), vol. 242, pp. 133–142.
22