Shubh Am
Shubh Am
of
B. Tech
by
SHUBHAM AJMERA
Date :
Faculty Advisor
Mandi, 175001
CERTIFICATE
This is to certify that the thesis titled AUTOMATIC TEXT SUMMARIZATION, submit-
ted by SHUBHAM AJMERA, to the Indian Institute of Technolog, Mandi, is a record of
bonafide work under my (our) supervision and is worthy of consideration for the award of
the degree of B. Tech of the Institute.
Date :
Supervisor(s)
Mandi, 175001
DECLARATION BY THE STUDENT
This is to certify that the thesis titled AUTOMATIC TEXT SUMMARIZATION, submit-
ted by me to the Indian Institute of Technology Mandi for the award of the degree of B.
Tech is a bonafide record of work carried out by me under the supervision of DR. ARTI
KASHYAP. The contents of this MTP, in full or in parts, have not been submitted to any
other Institute or University for the award of any degree or diploma.
Date :
SHUBHAM AJMERA
Mandi, 175001
Acknowledgments
I would like to take this opportunity to thank my guide Dr. Arti Kashyap for supporting me
throughout the course. She whole heartedly supported me at all the stages of the project. I’d
like to express my heartliest gratitude to her.
I would als like to thank Dr. Dileep A. D., who have offered me their invaluable experi-
ence and knowledge to guide me throughout the project and accomplish the project goals.
Further, I am grateful to my team mates, Saurabh Jain and Rohit Shukla, for all the
contributions they have made to this project.
Shubham Ajmera
i
Abstract
With the increasing amount of information, it has become difficult to take out concise infor-
mation. Thus it is necessary to build a system that could present human quality summaries.
Automatic text summarization is a tool that provides summaries of a given document. In
this project three different approaches have been implemented for text summarization. In all
these three summarizers, sentences are represented as a feature vector. In the first approach,
features like position of sentences, vocabulary intersections, resemblance to title, sentence
inclusion of numerical data are used and model is trained using Genetic Algorithm. In the
second approach, apart from the features used in the first approach, structure of the doc-
ument, popularity of content are also used as features. In the third approach, word2vec
algorithm is used to generate summary.
Keywords: summarization, text, natural language processing, nlp, text mining, wikipedia,
google search, feature extraction, wordnet
ii
Contents
Abstract ii
Abbreviations v
List of Symbols vi
1 Introduction 2
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Methods and Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 3
iii
3.2.4.2 Summary Page . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4.3 Webpage interface for non-Wikipedia Context . . . . . 18
3.2.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Experimental Results 20
References 26
iv
Abbreviations
NN - Neural Network
FV - Feature Vector
v
Symbols
P
- Summation over all values
vi
List of Tables
vii
List of Figures
viii
4.5 Comparison of summaries generated between GA summarizer and Microsoft
Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1
Chapter 1
Introduction
2
The purpose of the update summary is to identify new pieces of information from the docu-
ment [1].
1.1 Objective
The objective of the project is to provide summaries of document using different summariza-
tion techniques and to compare the quality of the summaries generated. Different leaning
models, data modelling techniques are also tested like Neural Networks(NN), Genetic Al-
gorithms(GA).
3
sentence inclusion of numerical data, and sentence relative length [2]. The model is trained
using Genetic Algorithms [2]. In the second approach, we chose to work on documents
which are well structured and in which sentences are less connected. For summarization
using this techniue, we have used Wikipedia articles, as they provide structured content,
and sentences are less connected. In this approach google search, wordnet word similar-
ity [3], tf-idf, and sentence position featues are used. In the third approach, continuous bag
of words and skip-gram architecture are implemented using Word2Vec toolkit [4] and neural
networks are used to train the summarizer.
4
Chapter 2
Abdel Fattah, Mohamed et al. [2] has proposed a simple approach for text summarization.
They have considered features like position, length, name entities, numerical data, bushy
paths, vocabulary overlaps etc. to generate summary. In this approach, sentences are mod-
elled as vectors of features. Sentences are marked as correct if they are to be put in summary
while are marked as incorrected if not. While making the final choice of sentences, each sen-
tence is given a value in between 0 and 1 and using a machine learning model, the sentences
are selected using those scores.
Various other papers have also been published which are very useful for a particular
class of document. Kamal Sarkar et al. [5] is built for summarization of medical documents
using machine learning approach. Various features are very comman and specific to medical
documents. It uses the concept of cue phases such that if a sentences contains n cue phases,
it gets n as its score for the feature. It also uses position of cue phases in the document
such that if it appears at the beginning or at the end of the sentence, it get an additional 1
point. Acronyms are also used as a feature and sentences having these gets extra points. In
some papers [6], sentences are also broken down by special cue makers and sentences are
represented as a feature vectors.
Ryang, D Seonggi et al. [7] proposed a method of automatic text summarization with
reinforcement learning. Researches have also been done for summarization of Wikipedia
articles. Hingu , D et al. [8] implemented various method for summaring Wikipedia pages.
In one of their methods, sentences containing citations are given higher weigtage. In the
5
other approach, the frequency of words are adjusted based on the root form of the word. The
words are stemmed with the objective to assign equal weigtage to words with the same root
word.
Edmundson (1969) [9] proposed an approach of extraction-based summarization using
features like position, frequency of words, cue words and the skeleton of an article by man-
ually assigning weight to each of these features. The system was tested using manually
generated summaries of 400 technical documents. The results were good with 44% of sum-
maries generated by it were matching the manual summaries.
6
Chapter 3
1. f1 = Sentence Position: Sentences appearing in the beginning and at the end of the
document are given higher weightage. [2]
n
1 X
S core f 2 (s) = t fi ∗ P(keywordi |keywordi ) (3.1)
Length(s) i=1
P(keywordi |s ∈ S )P(s ∈ S )
P(s ∈ S |keywordi ) = (3.2)
P(keywordi )
7
#(sentences in the training corpus and also in the summary)
P(s ∈ S ) = (3.4)
#(sentences in the training corpus)
3. f3 = Negative keyword in the sentence: The keywords that are unlikely to appear in
summary. [2]
n
1 X
S core f 3 (s) = t fi ∗ P(s < S |keywordi ) (3.6)
Length(s) i=1
3.1.1 Architecture
The project is built on Java. The process of summarization begins with processing of input
document which is broken down into sentences and subsequently into words. Summarizer
maintains a list of sentences of the document and each sentence is responsible to store the
words contained in it. A UML class diagram of the same is shown in Figure 3.1.
8
Fig. 3.1: UML Class diagram of Summarizer
The summarizer calculates score of each feature for every sentence. It uses Apache
OpenNLP for name finder(dataset - en-ner-person.bin), location finder(dataset - en-ner-
location.bin), and data/time finder(dataset - en-ner-person.bin, data/en-ner-time.bin). After
getting scores, it adds them up by giving equal weightage to each feature value using equa-
tion 3.9 and returns final score and based upon which the summary is generated.
i
X
S core(s) = S core fi (3.9)
9
similarity using Wordnet, sentence position etc. A custom summary option is also provided
such that user can give keywords which he wants/doesn’t want in summary.
3.2.1 Methodology
The summarizer fetches pages from Wikipedia and filters out tags, references and other
meta content from it. The processed input page is then broken down section. Each section
maintains a separate list of sentences contained in it. After processing and storing of page
content, summarizer calculates feature values for each sentences. List of features used are
as follow:
1. Sentence Position: It is a traditional method for providing score for the sentences [2].
Each sentence in a section is given a score based on its relative position in its section.
Sentences appearing in the beginning of the section are given higher weightage as
shown in equation 3.10.
1
S core(s) = (3.10)
position in the section
P 1
wordi ∈W occurance o f wordi in the corpus
S core(s) = (3.11)
No o f words contained in the sentence
where W is the list of words contained in the sentence. An illustration is shown in
figure 3.2.
10
Fig. 3.2: TF-IDF Illustration
3. Google Search Result Count: This is one of the most important feature in which
importance of section is computed by comparing Google Search result count. A search
is performed for each section using query formed by concatenating section heading
and title of the document. Score are given relative to the section which has most
number of counts as shown in figure 3.3.
11
user can select keywords that he wants/doesn’t want in the summary. The summarizer
then compares similarity of that word with every word in the sentence using Wordnet
dataset. If a sentence contains similar words then depending upon the preference of
user it scores the sentence. For positive keywords, it increments the score for sentences
containing similary words on the other hand for negative keywords, it decrements the
score for sentences containing similary keywords. Some examples of words similarity
are shown in Table 3.1.
12
Fig. 3.4: Call flow of summarizer.
1. Extractor: It extracts out content from Wikipedia pages using urllib2. Extractor is
provided with the title of the Wikipedia article, using which it forms url to extract out
Wikipedia pages.
2. Text Processor: After receiving data from the extractor, it removes various content
including references, tags, tables, and unwanted sections. Then it breaks down the
article into section using regular expressions and returns a JSON object to the sum-
marizer as shown in Figure 3.5
Fig. 3.5: Structure of JSON object returned after Extraction and Processing
3. Image Extractor: It provides image to the front-end. It extracts image from infobox
of the Wikipedia page using BeautifulSoup
13
4. Summarizer: After receiving processed data, it starts evaluating scores of sentences
for different feature. It concurrently calculates sentences’ score for faster processing.
It eventually calculates score of each sentences.
5. Google Search Result Count: It uses selenium web driver and BeautifulSoup for
extracting Google Search result count. Selenium web driver uses mozilla to open
Google result page for the input query. After that the text from the page is processes by
BeautifulSoup and subsequently count is fetched from there using regular expressions.
It takes a pause of two seconds before making a new query.
14
Fig. 3.7: The Database for TF-IDF: Stemmed words are stored with their corresponding
occurrence
Final scores of each sentences are calculated by multiplying the value of each feature for a
sentence as shown in equation 3.14
15
Y
S core(s) = fi (3.14)
fi ∈F
In the webpage, three boxes have been provided for the Wikipedia page for which summary
has to be generated, important keywords and negative keywords as shown in Figure 3.8.
A mobile interface is also developed for the same as shown in Figure 3.9
16
Fig. 3.9: Mobile version of querying page.
This page is built using Bootstrap and Javascript. This page gets GET content from the
querying page using which it fetches relevant summary, image and title of the document.
It also has a slider using which a user can select the percentage shrinkage according to his
requirement as shown figure 3.10 and figure 3.11
17
Fig. 3.10: Web version of summary page.
An interface for non-wikipedia context is also developed in which data can be put manually
which includes document’s title, sections’ heading and its content. According to the need,
18
sections can be added dynamically.
3.2.5 Features
19
Chapter 4
Experimental Results
20
Fig. 4.1: Comparison between summarizer using GA approach and Wikipedia summarizer
with Compression rate of 30% against the hand-written summary.
By observing the figure 4.1, we have found out that the average rate of intersection
of GA is coming out in between 0.4 and 0.5 with high deviation while the Wikipedia
summaries have shown more consistency with values ranging from 0.7 to 0.8.
21
Fig. 4.2: Comparison between summarizer using GA approach and Wikipedia summarizer
with Compression rate of 20% against the hand-written summary.
Fig. 4.3: Comparison between The GA summarizer and the Word2Vec summarizer.
22
4. Wikipedia Summarizer vs Microsoft Word Summarizer
For the comparison of the summarizers, the Wikipedia articles are fetched to Microsoft
Word and summaries are compared with the equation 4.1. In Contrast to the compar-
ison with the hand-written summaries where the Wikipedia summarizer had shown
very less deviation with a high precision value, has shown a wide variation between
the range 0.38 to 0.6 as shown in graph in figure 4.4.
Fig. 4.4: Comparison of summaries generated between Wikipedia summarizer and Mi-
crosoft Word.
23
Fig. 4.5: Comparison of summaries generated between GA summarizer and Microsoft
Word.
24
Chapter 5
In this project, we have implemented various techniques for extraction-based text summa-
rization and experimented their performance analysis using various datasets.
5.1 Conclusion
1. The summarizer built using genetic algorithms has shown satifactory results for wide
variety of documents.
2. The Wikipedia summarizer, though not quite generic, has shown excellent results for
generating summaries for Wikipedia pages.
3. The Word2Vec summarizer is very much dependent on the training data but shows
good result as compared to other two for the data which is closely knitted.
2. Improving accuracy of features like name entity identifier, location finder etc.
3. Extending the domains for Word2Vec summarizer by training it using various other
datasets in diverse domains.
25
References
[2] M. A. Fattah and F. Ren, “Automatic text summarization,” World Academy of Science,
Engineering and Technology, vol. 37, p. 2008, 2008.
[5] J. Ramos, “Using tf-idf to determine word relevance in document queries,” in Proceed-
ings of the first instructional conference on machine learning, 2003.
[6] W. T. Chuang and J. Yang, “Extracting sentence segments for text summarization:
a machine learning approach,” in Proceedings of the 23rd annual international ACM
SIGIR conference on Research and development in information retrieval. ACM, 2000,
pp. 152–159.
[7] S. Ryang and T. Abekawa, “Framework of automatic text summarization using rein-
forcement learning,” in Proceedings of the 2012 Joint Conference on Empirical Meth-
ods in Natural Language Processing and Computational Natural Language Learning.
Association for Computational Linguistics, 2012, pp. 256–265.
26
[10] “Wikipedia - wikipedia, the free encyclopedia,” http://en.wikipedia.org/wiki/
Wikipedia, accessed: 2015-05-15.
27
Curriculum Vitae
Education qualifications:
Permanent address:
‘Keluka Bhawan’,
3/219, Near CTS Bus Stand, Sanganer
Jaipur, Rajasthan, India, 302029.
Ph: +918894849404
28