0% found this document useful (0 votes)

73 views40 pages

Shubh Am

The document describes a thesis submitted by Shubham Ajmera to the Indian Institute of Technology Mandi for the degree of Bachelor of Technology. The thesis is about automatic text summarization and was conducted under the guidance of Dr. Arti Kashyap. It includes certificates of approval and declaration by the student, as well as acknowledgments and an abstract describing the work.

Uploaded by

Sunil Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views40 pages

Shubh Am

Uploaded by

Sunil Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

AUTOMATIC TEXT SUMMARIZATION

MTP Report submitted to

Indian Institute of Technology Mandi
for the award of the degree

B. Tech

SHUBHAM AJMERA

under the guidance of

Dr. ARTI KASHYAP

SCHOOL OF COMPUTING AND ELECTRICAL ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY MANDI

JUNE 2015
CERTIFICATE OF APPROVAL

Certified that the thesis entitled AUTOMATIC TEXT SUMMARIZATION, submitted by

SHUBHAM AJMERA , to the Indian Institute of Technology Mandi, for the award of the
degree of B. Tech has been accepted after examination held today.

Date :
Faculty Advisor
Mandi, 175001
CERTIFICATE

This is to certify that the thesis titled AUTOMATIC TEXT SUMMARIZATION, submit-
ted by SHUBHAM AJMERA, to the Indian Institute of Technolog, Mandi, is a record of
bonafide work under my (our) supervision and is worthy of consideration for the award of
the degree of B. Tech of the Institute.

Date :
Supervisor(s)
Mandi, 175001
DECLARATION BY THE STUDENT

This is to certify that the thesis titled AUTOMATIC TEXT SUMMARIZATION, submit-
ted by me to the Indian Institute of Technology Mandi for the award of the degree of B.
Tech is a bonafide record of work carried out by me under the supervision of DR. ARTI
KASHYAP. The contents of this MTP, in full or in parts, have not been submitted to any
other Institute or University for the award of any degree or diploma.

Date :
SHUBHAM AJMERA
Mandi, 175001
Acknowledgments

I would like to take this opportunity to thank my guide Dr. Arti Kashyap for supporting me
throughout the course. She whole heartedly supported me at all the stages of the project. I’d
like to express my heartliest gratitude to her.
I would als like to thank Dr. Dileep A. D., who have offered me their invaluable experi-
ence and knowledge to guide me throughout the project and accomplish the project goals.
Further, I am grateful to my team mates, Saurabh Jain and Rohit Shukla, for all the
contributions they have made to this project.
Shubham Ajmera

i
Abstract

With the increasing amount of information, it has become difficult to take out concise infor-
mation. Thus it is necessary to build a system that could present human quality summaries.
Automatic text summarization is a tool that provides summaries of a given document. In
this project three different approaches have been implemented for text summarization. In all
these three summarizers, sentences are represented as a feature vector. In the first approach,
features like position of sentences, vocabulary intersections, resemblance to title, sentence
inclusion of numerical data are used and model is trained using Genetic Algorithm. In the
second approach, apart from the features used in the first approach, structure of the doc-
ument, popularity of content are also used as features. In the third approach, word2vec
algorithm is used to generate summary.
Keywords: summarization, text, natural language processing, nlp, text mining, wikipedia,
google search, feature extraction, wordnet

ii
Contents

Abstract ii

Abbreviations v

List of Symbols vi

List of Tables vii

List of Figures viii

1 Introduction 2
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Methods and Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 5

3 Work Done (Give an appropriate title) 7

3.1 A Naive Summarizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Wikipedia Summarization . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Summarizer Class . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Score Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.4 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4.1 Querying page . . . . . . . . . . . . . . . . . . . . . . 16

iii
3.2.4.2 Summary Page . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4.3 Webpage interface for non-Wikipedia Context . . . . . 18
3.2.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Experimental Results 20

5 Conclusion and Future Work 25

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Future Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

References 26

iv
Abbreviations

NN - Neural Network
FV - Feature Vector

v
Symbols

P
- Summation over all values

vi
List of Tables

3.1 Illustration of Word Similarity . . . . . . . . . . . . . . . . . . . . . . . 12

vii
List of Figures

1.1 Classification of summarization tasks. [1] . . . . . . . . . . . . . . . . . 3

3.1 UML Class diagram of Summarizer . . . . . . . . . . . . . . . . . . . . 9

3.2 TF-IDF Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Google Search Result Count Illustration . . . . . . . . . . . . . . . . . . 11
3.4 Call flow of summarizer. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Structure of JSON object returned after Extraction and Processing . . . . 13
3.6 Google Result Count Example . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 The Database for TF-IDF: Stemmed words are stored with their correspond-
ing occurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.8 Web version of querying page. . . . . . . . . . . . . . . . . . . . . . . . 16
3.9 Mobile version of querying page. . . . . . . . . . . . . . . . . . . . . . . 17
3.10 Web version of summary page. . . . . . . . . . . . . . . . . . . . . . . . 18
3.11 Different screenshots of mobile interface. . . . . . . . . . . . . . . . . . 18
3.12 Mobile interface for quering page of non-Wikipedia content . . . . . . . 19

4.1 Comparison between summarizer using GA approach and Wikipedia sum-

marizer with Compression rate of 30% against the hand-written summary. 21
4.2 Comparison between summarizer using GA approach and Wikipedia sum-
marizer with Compression rate of 20% against the hand-written summary. 22
4.3 Comparison between The GA summarizer and the Word2Vec summarizer. 22
4.4 Comparison of summaries generated between Wikipedia summarizer and
Microsoft Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

viii
4.5 Comparison of summaries generated between GA summarizer and Microsoft
Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1
Chapter 1

Introduction

Accoring to WordNet(Princeton) summary is defined as “a brief statement that presents

the main points in a concise form”. Automatic Summarization is a process of generating
summaries by a computer program.
Summarization process involves interpretation, transformation and generation. There are
two types of summarization: abstraction-based summarization and extraction-based sum-
marization. In abstract-based summarization, an abstract is created by interpreting the text
contained in the original document and generating summary that express the same in a more
concise way. In extractive-based approach some sentences from the original text are taken
out to form a meaningful summary. In this approach sentences are given scores based on
different feature and sentences with higher rating are selected for summary. It uses various
Natural Language processing approaches for information retrieval.
Sumamrization process can also be classified based upon the number of source documents,
task specific contraints and use of external resources as shown in the Figure 1.1. Sum-
marization is classified as single-document or multi-document based upon the number of
source document. In multi document summaization information overlaps between different
document makes task difficult. Based upon external resources summarization can also be
classified as knowledge-rich or knowledge-poor. Knowledge rich summarizer uses external
source external corpus like Wikipedia, WordNet etc. In query-focussed or query oriented
summarization summary is constructed with information relevant to the query. In update
summarization summarizer makes use of current trends for construction of the summary.

2
The purpose of the update summary is to identify new pieces of information from the docu-
ment [1].

Fig. 1.1: Classification of summarization tasks. [1]

1.1 Objective
The objective of the project is to provide summaries of document using different summariza-
tion techniques and to compare the quality of the summaries generated. Different leaning
models, data modelling techniques are also tested like Neural Networks(NN), Genetic Al-
gorithms(GA).

1.2 Methods and Techniques

In this project we have implemented three different techniques for extractive based text sum-
marization. Mainly the approaches used are different in terms of features used and machine
learning model. In the first approach, we used sentence position, positive keywords in sen-
tence, negative keywords in sentence, sentence centrality, sentence inclusion of name entit,

3
sentence inclusion of numerical data, and sentence relative length [2]. The model is trained
using Genetic Algorithms [2]. In the second approach, we chose to work on documents
which are well structured and in which sentences are less connected. For summarization
using this techniue, we have used Wikipedia articles, as they provide structured content,
and sentences are less connected. In this approach google search, wordnet word similar-
ity [3], tf-idf, and sentence position featues are used. In the third approach, continuous bag
of words and skip-gram architecture are implemented using Word2Vec toolkit [4] and neural
networks are used to train the summarizer.

4
Chapter 2

Background and Related Work

Abdel Fattah, Mohamed et al. [2] has proposed a simple approach for text summarization.
They have considered features like position, length, name entities, numerical data, bushy
paths, vocabulary overlaps etc. to generate summary. In this approach, sentences are mod-
elled as vectors of features. Sentences are marked as correct if they are to be put in summary
while are marked as incorrected if not. While making the final choice of sentences, each sen-
tence is given a value in between 0 and 1 and using a machine learning model, the sentences
are selected using those scores.
Various other papers have also been published which are very useful for a particular
class of document. Kamal Sarkar et al. [5] is built for summarization of medical documents
using machine learning approach. Various features are very comman and specific to medical
documents. It uses the concept of cue phases such that if a sentences contains n cue phases,
it gets n as its score for the feature. It also uses position of cue phases in the document
such that if it appears at the beginning or at the end of the sentence, it get an additional 1
point. Acronyms are also used as a feature and sentences having these gets extra points. In
some papers [6], sentences are also broken down by special cue makers and sentences are
represented as a feature vectors.
Ryang, D Seonggi et al. [7] proposed a method of automatic text summarization with
reinforcement learning. Researches have also been done for summarization of Wikipedia
articles. Hingu , D et al. [8] implemented various method for summaring Wikipedia pages.
In one of their methods, sentences containing citations are given higher weigtage. In the

5
other approach, the frequency of words are adjusted based on the root form of the word. The
words are stemmed with the objective to assign equal weigtage to words with the same root
word.
Edmundson (1969) [9] proposed an approach of extraction-based summarization using
features like position, frequency of words, cue words and the skeleton of an article by man-
ually assigning weight to each of these features. The system was tested using manually
generated summaries of 400 technical documents. The results were good with 44% of sum-
maries generated by it were matching the manual summaries.

6
Chapter 3

Wikipedia Articles Summarization

3.1 A Naive Summarizer

A naive summarizer was implemented in Java and the features used were:

1. f1 = Sentence Position: Sentences appearing in the beginning and at the end of the
document are given higher weightage. [2]

2. f2 = Positive keyword in the sentence: It is the frequency of keyword in the sum-

mary. [2]

n
1 X
S core f 2 (s) = t fi ∗ P(keywordi |keywordi ) (3.1)
Length(s) i=1

Where, s is a sentence, n is the number of keywords in s, tfi is the occurring frequency

of keywordi in s. We divide the value by the sentence length to avoid the bias of its
length.

P(keywordi |s ∈ S )P(s ∈ S )
P(s ∈ S |keywordi ) = (3.2)
P(keywordi )

#(sentences in summary and contains keywordi )

P(keywordi |s ∈ S ) = (3.3)
#(sentences in summary)

7
#(sentences in the training corpus and also in the summary)
P(s ∈ S ) = (3.4)
#(sentences in the training corpus)

#(sentences in the training corpus and contains keywordi )

P(keywordi ) = (3.5)
#(sentence in training corpus)

3. f3 = Negative keyword in the sentence: The keywords that are unlikely to appear in
summary. [2]

n
1 X
S core f 3 (s) = t fi ∗ P(s < S |keywordi ) (3.6)
Length(s) i=1

4. f4 = Sentence Centrality: It is the overlap in vocabulary of the sentence and rest of

the document. It demonstrate similarity of the sentence with the document. [2]

Keywords in s ∩ Keywords in other sentences

S core f 4 (s) = (3.7)
Keywords in s ∪ Keywords in other sentences

5. f5 =Sentence Resemblance to the title: It is the overlap in vocabulary between the

sentence and the title of the docuement. [2]

Keywords in s ∩ Keywords in title

S core f 5 (s) = (3.8)
Keywords in s ∪ Keywords in title

3.1.1 Architecture

The project is built on Java. The process of summarization begins with processing of input
document which is broken down into sentences and subsequently into words. Summarizer
maintains a list of sentences of the document and each sentence is responsible to store the
words contained in it. A UML class diagram of the same is shown in Figure 3.1.

8
Fig. 3.1: UML Class diagram of Summarizer

The summarizer calculates score of each feature for every sentence. It uses Apache
OpenNLP for name finder(dataset - en-ner-person.bin), location finder(dataset - en-ner-
location.bin), and data/time finder(dataset - en-ner-person.bin, data/en-ner-time.bin). After
getting scores, it adds them up by giving equal weightage to each feature value using equa-
tion 3.9 and returns final score and based upon which the summary is generated.

i
X
S core(s) = S core fi (3.9)

3.2 Wikipedia Summarization

“Wikipedia is a free-access, free content Internet encyclopedia, supported and hosted by the
non-profit Wikimedia Foundation” [10].
Wikipedia Summarizer provides summaries for given Wikipedia pages. The project is
built using Django Web Framework, and uses features like Google Search result count, word

9
similarity using Wordnet, sentence position etc. A custom summary option is also provided
such that user can give keywords which he wants/doesn’t want in summary.

3.2.1 Methodology

The summarizer fetches pages from Wikipedia and filters out tags, references and other
meta content from it. The processed input page is then broken down section. Each section
maintains a separate list of sentences contained in it. After processing and storing of page
content, summarizer calculates feature values for each sentences. List of features used are
as follow:

1. Sentence Position: It is a traditional method for providing score for the sentences [2].
Each sentence in a section is given a score based on its relative position in its section.
Sentences appearing in the beginning of the section are given higher weightage as
shown in equation 3.10.

1
S core(s) = (3.10)
position in the section

2. TF-IDF: TF-IDF for a word in a sentence is inversely proportional to the number of

documents which also contain that word. Words with high TF-IDF numbers imply a
strong relationship with the sentence they appear in, suggesting that if that word were
to appear in a sentence, it could be interest to the user [5]. Data frequency values are
constructed using 1000 randomly selected Wikipedia pages. Higher the frequency of
the word in corpus, lower will be its value. The score of sentence is calculated using
equation 3.11

P 1
wordi ∈W occurance o f wordi in the corpus
S core(s) = (3.11)
No o f words contained in the sentence
where W is the list of words contained in the sentence. An illustration is shown in
figure 3.2.

10
Fig. 3.2: TF-IDF Illustration

3. Google Search Result Count: This is one of the most important feature in which
importance of section is computed by comparing Google Search result count. A search
is performed for each section using query formed by concatenating section heading
and title of the document. Score are given relative to the section which has most
number of counts as shown in figure 3.3.

Fig. 3.3: Google Search Result Count Illustration

4. Word Similarity: This feature provides customizability to the summarizer, in which a

11
user can select keywords that he wants/doesn’t want in the summary. The summarizer
then compares similarity of that word with every word in the sentence using Wordnet
dataset. If a sentence contains similar words then depending upon the preference of
user it scores the sentence. For positive keywords, it increments the score for sentences
containing similary words on the other hand for negative keywords, it decrements the
score for sentences containing similary keywords. Some examples of words similarity
are shown in Table 3.1.

Word1 Word2 Score Explaination

cat tiger 0.5 Both belong to Felidae family.
apple tiger 0.1 Less related
apple orange 0.25 Both are fruits
cat cat 1.0 Same words
cat feline 0.5 Synonyms
history past 0.5 Both referring to time

Table 3.1: Illustration of Word Similarity

3.2.2 Summarizer Class

Wikipedia summarizer is implemented in Python and using Django Web Framework. It is a

multi-threaded tool which performs feature extraction concurrently as shown in figure 3.4.
Detailed implementation of each component is provided below:

12
Fig. 3.4: Call flow of summarizer.

1. Extractor: It extracts out content from Wikipedia pages using urllib2. Extractor is
provided with the title of the Wikipedia article, using which it forms url to extract out
Wikipedia pages.

url = ’en.wikipedia.org/w/index.php?action=raw&title=’ + title

2. Text Processor: After receiving data from the extractor, it removes various content
including references, tags, tables, and unwanted sections. Then it breaks down the
article into section using regular expressions and returns a JSON object to the sum-
marizer as shown in Figure 3.5

Fig. 3.5: Structure of JSON object returned after Extraction and Processing

3. Image Extractor: It provides image to the front-end. It extracts image from infobox
of the Wikipedia page using BeautifulSoup

13
4. Summarizer: After receiving processed data, it starts evaluating scores of sentences
for different feature. It concurrently calculates sentences’ score for faster processing.
It eventually calculates score of each sentences.

5. Google Search Result Count: It uses selenium web driver and BeautifulSoup for
extracting Google Search result count. Selenium web driver uses mozilla to open
Google result page for the input query. After that the text from the page is processes by
BeautifulSoup and subsequently count is fetched from there using regular expressions.
It takes a pause of two seconds before making a new query.

Fig. 3.6: Google Result Count Example

6. TF-IDF: A database of stemmed words is generated from randomly selected 1000

Wikipedia articles. Using the same database, and for the input Wikipedia article,
words are stemmed and scores are calculated using Equation 3.11 for calculating the
score of the sentence. A glimse of database content is shown in Figure 3.7.

14
Fig. 3.7: The Database for TF-IDF: Stemmed words are stored with their corresponding
occurrence

7. Positive Keywords/Negative Keywords: This is an important feature of the summa-

rization. It is used for generating customizable summaries based on user requirements.
For input keywords, similarity is calculated using Wordnet data. For each sentences,
a word is selected which has the maximum similarity with the input keywords and
then score of the sentence is calculated using the selected word.

If the keyword is positive then,

S core(s) = 1 + similarity value (3.12)

If the keyword is negative then,

S core(s) = 1 − similarity value (3.13)

3.2.3 Score Calculation

Final scores of each sentences are calculated by multiplying the value of each feature for a
sentence as shown in equation 3.14

15
Y
S core(s) = fi (3.14)
fi ∈F

where F is the feature vector for the sentence s.

3.2.4 User Interface

3.2.4.1 Querying page

In the webpage, three boxes have been provided for the Wikipedia page for which summary
has to be generated, important keywords and negative keywords as shown in Figure 3.8.

Fig. 3.8: Web version of querying page.

A mobile interface is also developed for the same as shown in Figure 3.9

16
Fig. 3.9: Mobile version of querying page.

3.2.4.2 Summary Page

This page is built using Bootstrap and Javascript. This page gets GET content from the
querying page using which it fetches relevant summary, image and title of the document.
It also has a slider using which a user can select the percentage shrinkage according to his
requirement as shown figure 3.10 and figure 3.11

17
Fig. 3.10: Web version of summary page.

Fig. 3.11: Different screenshots of mobile interface.

3.2.4.3 Webpage interface for non-Wikipedia Context

An interface for non-wikipedia context is also developed in which data can be put manually
which includes document’s title, sections’ heading and its content. According to the need,

18
sections can be added dynamically.

Fig. 3.12: Mobile interface for quering page of non-Wikipedia content

3.2.5 Features

1. Variable length summaries can be generated.

2. Custom summary option is provided to user to remove unwanted topics.

3. Multithreaded feature extraction.

4. Stores summaries of articles so to provide quick results in future.

19
Chapter 4

Experimental Results

For quality comparisons of summaries generated by summarizers, we have manually created

summaries of various Wikipedia articles and anime data. We have generated summaries of
10 Wikipedia articles and 10 anime article. The comparison is done between the Wikipedia
Articles and Genetic Algorithm summarizer using the manually generated summaries of
Wikipedia articles while in between GAS and Word2Vec summarizer using summaries of
animes articles. Also, the comparisons are performed between the summarizers and MS-
Word summarizer. Results and explainations of these experiments are as follows:

1. GA summarizer VS Wikipedia summarizer with CR 30%

In this comparison, the output of the generated summary is compared with the hand-
written summaries of Wikipedia articles using the equation 4.1. Higher the intersec-
tion of sentences between the summaries, higher will be the scores.

20
Fig. 4.1: Comparison between summarizer using GA approach and Wikipedia summarizer
with Compression rate of 30% against the hand-written summary.

By observing the figure 4.1, we have found out that the average rate of intersection
of GA is coming out in between 0.4 and 0.5 with high deviation while the Wikipedia
summaries have shown more consistency with values ranging from 0.7 to 0.8.

#(sentences in s1 ∩ sentences in s2)

S core(s1, s2) = (4.1)
#(sentences in s2)

2. GA summarizer VS Wikipedia summarizer with CR 20%

Again by observing the graph in the figure 4.2, it can be said that Wikipedia summa-
rizer is giving higher percentage of matching as compared to the summarizer using
GA approach. However, with comparison to the graph in figure 4.1, we can see that
percentage similarity between the hand-written summaries and the summaries gener-
ated by two summarizers has decreased.

21
Fig. 4.2: Comparison between summarizer using GA approach and Wikipedia summarizer
with Compression rate of 20% against the hand-written summary.

3. Summarizer using GA approach VS Word2Vec Summarizer

For the comparison of these two approaches, 10 hand-written summaries of anime ar-
ticles are used and scores are calcuated using the equation 4.1. It can be observed from
the graph that there is no much difference between the outcome of the two summarizer
and the range is varying from 0.35 to 0.6.

Fig. 4.3: Comparison between The GA summarizer and the Word2Vec summarizer.

22
4. Wikipedia Summarizer vs Microsoft Word Summarizer
For the comparison of the summarizers, the Wikipedia articles are fetched to Microsoft
Word and summaries are compared with the equation 4.1. In Contrast to the compar-
ison with the hand-written summaries where the Wikipedia summarizer had shown
very less deviation with a high precision value, has shown a wide variation between
the range 0.38 to 0.6 as shown in graph in figure 4.4.

Fig. 4.4: Comparison of summaries generated between Wikipedia summarizer and Mi-
crosoft Word.

5. GA Summarizer vs Microsoft Word Summarizer

The summarizer has shown similarity in results with the Wikipedia summarizer as
shown in graph in figure 4.5. There is a moderate deviation with a low percentage of
similarity between 0.45 to 0.55.

23
Fig. 4.5: Comparison of summaries generated between GA summarizer and Microsoft
Word.

24
Chapter 5

Conclusion and Future Work

In this project, we have implemented various techniques for extraction-based text summa-
rization and experimented their performance analysis using various datasets.

5.1 Conclusion
1. The summarizer built using genetic algorithms has shown satifactory results for wide
variety of documents.

2. The Wikipedia summarizer, though not quite generic, has shown excellent results for
generating summaries for Wikipedia pages.

3. The Word2Vec summarizer is very much dependent on the training data but shows
good result as compared to other two for the data which is closely knitted.

5.2 Future Score

1. Building a single system which can automatically identify the type of input text and
use the best method out of these three technique to generate summary.

2. Improving accuracy of features like name entity identifier, location finder etc.

3. Extending the domains for Word2Vec summarizer by training it using various other
datasets in diverse domains.

25
References

[1] G. Sizov, “Extraction-based automatic summarization: Theoretical and empirical in-

vestigation of summarization techniques,” 2010.

[2] M. A. Fattah and F. Ren, “Automatic text summarization,” World Academy of Science,
Engineering and Technology, vol. 37, p. 2008, 2008.

[3] “About wordnet - wordnet - about wordnet,” https://wordnet.princeton.edu/, accessed:

2015-05-15.

[4] “word2vec - tool for computing continuous distributed representations of words. -

google project hosting,” https://code.google.com/p/word2vec/, accessed: 2015-05-15.

[5] J. Ramos, “Using tf-idf to determine word relevance in document queries,” in Proceed-
ings of the first instructional conference on machine learning, 2003.

[6] W. T. Chuang and J. Yang, “Extracting sentence segments for text summarization:
a machine learning approach,” in Proceedings of the 23rd annual international ACM
SIGIR conference on Research and development in information retrieval. ACM, 2000,
pp. 152–159.

[7] S. Ryang and T. Abekawa, “Framework of automatic text summarization using rein-
forcement learning,” in Proceedings of the 2012 Joint Conference on Empirical Meth-
ods in Natural Language Processing and Computational Natural Language Learning.
Association for Computational Linguistics, 2012, pp. 256–265.

[8] D. Hingu, D. Shah, and S. S. Udmale, “Automatic text summarization of wikipedia

articles,” in Communication, Information & Computing Technology (ICCICT), 2015
International Conference on. IEEE, 2015, pp. 1–4.

[9] H. P. Edmundson, “New methods in automatic extracting,” Journal of the ACM

(JACM), vol. 16, no. 2, pp. 264–285, 1969.

26
[10] “Wikipedia - wikipedia, the free encyclopedia,” http://en.wikipedia.org/wiki/
Wikipedia, accessed: 2015-05-15.

27
Curriculum Vitae

Name: Shubham Ajmera

Date of birth: 09 October, 1992

Education qualifications:

• [2011 - 2015] Bachelor of Technology (B. Tech),

Computer Science and Engineering,
Indian Institute of Technology Mandi
Mandi, Himachal Pradesh, India.

Permanent address:
‘Keluka Bhawan’,
3/219, Near CTS Bus Stand, Sanganer
Jaipur, Rajasthan, India, 302029.
Ph: +918894849404

JOHN DEERE 06 - 644K - English PDF
100% (4)
JOHN DEERE 06 - 644K - English PDF
300 pages
Simulation of Image Encryption Using AES Algorithm
No ratings yet
Simulation of Image Encryption Using AES Algorithm
8 pages
P & S Important Questions
100% (1)
P & S Important Questions
8 pages
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
No ratings yet
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
29 pages
Automatic Text Summarization Using Natural Language Processing
No ratings yet
Automatic Text Summarization Using Natural Language Processing
54 pages
Automatic Text Summarization Using Natural Language Processing PDF
No ratings yet
Automatic Text Summarization Using Natural Language Processing PDF
54 pages
Seminar - Report - PYLI - RAGHURAM - Entire Document Ready
No ratings yet
Seminar - Report - PYLI - RAGHURAM - Entire Document Ready
26 pages
NLP Report
No ratings yet
NLP Report
14 pages
Keyword Extraction
No ratings yet
Keyword Extraction
110 pages
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
No ratings yet
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
13 pages
Text Summarization Using Word Frequency
No ratings yet
Text Summarization Using Word Frequency
3 pages
Implementation of NLP Based Automatic Text Summarization Using Spacy
No ratings yet
Implementation of NLP Based Automatic Text Summarization Using Spacy
15 pages
Mini Project Report
No ratings yet
Mini Project Report
26 pages
Capstone Project Report (AST)
No ratings yet
Capstone Project Report (AST)
44 pages
Text Summarisation and Document Understanding Report
No ratings yet
Text Summarisation and Document Understanding Report
50 pages
For MP
No ratings yet
For MP
13 pages
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
No ratings yet
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
13 pages
Bachelor Thesis 2016
No ratings yet
Bachelor Thesis 2016
56 pages
Paper Work
No ratings yet
Paper Work
12 pages
Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"
67% (3)
Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"
23 pages
ATSSI Abstractive Text Summarization Using Sentiment Infusion
No ratings yet
ATSSI Abstractive Text Summarization Using Sentiment Infusion
7 pages
Abstractive Text Summarization Using Transformer Based Approach
No ratings yet
Abstractive Text Summarization Using Transformer Based Approach
10 pages
Recent Approaches For Text Summarization
No ratings yet
Recent Approaches For Text Summarization
13 pages
Irsw Project
No ratings yet
Irsw Project
8 pages
Abstractive Text Summarization Using Deep Learning
No ratings yet
Abstractive Text Summarization Using Deep Learning
43 pages
Project File
No ratings yet
Project File
23 pages
News Summarization Report 40pages
No ratings yet
News Summarization Report 40pages
9 pages
2612 Manikanta Reddy K
No ratings yet
2612 Manikanta Reddy K
53 pages
Three Member Presentation
No ratings yet
Three Member Presentation
141 pages
Text Summarization Using NLP Technique
No ratings yet
Text Summarization Using NLP Technique
7 pages
Text Summerizer Synopsis-1
No ratings yet
Text Summerizer Synopsis-1
6 pages
FALLSEM2024-25 BCSE409L TH VL2024250101879 2024-11-14 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101879 2024-11-14 Reference-Material-I
13 pages
Ir Case Study
No ratings yet
Ir Case Study
8 pages
IEEE Conference Template 3
No ratings yet
IEEE Conference Template 3
4 pages
Human Aided Text Summarizer "SAAR" Using Reinforcement Learning
100% (1)
Human Aided Text Summarizer "SAAR" Using Reinforcement Learning
31 pages
Text Summarizer
No ratings yet
Text Summarizer
9 pages
Text Summarization Using Natural Language Processing
No ratings yet
Text Summarization Using Natural Language Processing
5 pages
Amitkv Lit Survey Summarization
No ratings yet
Amitkv Lit Survey Summarization
15 pages
Submissions in Applicate
No ratings yet
Submissions in Applicate
3 pages
Module 7
No ratings yet
Module 7
44 pages
Extractive Text Summarization Using Word Frequency
No ratings yet
Extractive Text Summarization Using Word Frequency
6 pages
EXT Ummarization: Kareem El-Sayed Hashem Mohamed Mohsen Brary
No ratings yet
EXT Ummarization: Kareem El-Sayed Hashem Mohamed Mohsen Brary
24 pages
Lect NLP 20
No ratings yet
Lect NLP 20
31 pages
Research Paper 2
No ratings yet
Research Paper 2
7 pages
Multi-Document Extractive Summarization For News Page 1 of 59
No ratings yet
Multi-Document Extractive Summarization For News Page 1 of 59
59 pages
IEEE Conference Template 1 PDF
No ratings yet
IEEE Conference Template 1 PDF
3 pages
Research Final
No ratings yet
Research Final
6 pages
Text Summarization Using NLP
No ratings yet
Text Summarization Using NLP
6 pages
Text Summarization On Youtube Videos in Educational Domain
No ratings yet
Text Summarization On Youtube Videos in Educational Domain
5 pages
Automatic Text Recognisation
No ratings yet
Automatic Text Recognisation
4 pages
CoSc581 NLP Topic 5-Text Summarization PDF
No ratings yet
CoSc581 NLP Topic 5-Text Summarization PDF
25 pages
Malayalam 2
No ratings yet
Malayalam 2
4 pages
Automatic Text Summarization Using Python
No ratings yet
Automatic Text Summarization Using Python
8 pages
AI Report
No ratings yet
AI Report
15 pages
Synopsis Creation For Research Paper Using Text Summarization Models
No ratings yet
Synopsis Creation For Research Paper Using Text Summarization Models
5 pages
Extractive Text Summarization Using Word Vector Embedding
No ratings yet
Extractive Text Summarization Using Word Vector Embedding
5 pages
Research Paper Summarizer Using NLP Techniques
No ratings yet
Research Paper Summarizer Using NLP Techniques
9 pages
A Statistical Approach To Perform Web Based Summarization: Kirti Bhatia, Dr. Rajendar Chhillar
No ratings yet
A Statistical Approach To Perform Web Based Summarization: Kirti Bhatia, Dr. Rajendar Chhillar
3 pages
DLPDFR006030 P1-79 PDF
No ratings yet
DLPDFR006030 P1-79 PDF
79 pages
Research Paper 8
No ratings yet
Research Paper 8
4 pages
Technical Seminar Report-6607
No ratings yet
Technical Seminar Report-6607
11 pages
IR Report
No ratings yet
IR Report
10 pages
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Infosys Limited: Fresher's Background Verification Form
0% (1)
Infosys Limited: Fresher's Background Verification Form
6 pages
Sunil Chaudhary Danda: Objective
No ratings yet
Sunil Chaudhary Danda: Objective
2 pages
Bluetooth Enabled Digital Door Lock Using Arduino-4
100% (1)
Bluetooth Enabled Digital Door Lock Using Arduino-4
30 pages
Transaction Management Overview: Transaction Management and Recovery, 2 Edition. R. Ramakrishnan and J. Gehrke 1
No ratings yet
Transaction Management Overview: Transaction Management and Recovery, 2 Edition. R. Ramakrishnan and J. Gehrke 1
52 pages
Probability Statistics
No ratings yet
Probability Statistics
125 pages
Notes 0
No ratings yet
Notes 0
158 pages
Mahindra & Mahindra LTD.: Automotive & Farm Equipment Sectors
No ratings yet
Mahindra & Mahindra LTD.: Automotive & Farm Equipment Sectors
2 pages
Introduction To PL/SQL
No ratings yet
Introduction To PL/SQL
25 pages
07 Bottom Up Parsing
No ratings yet
07 Bottom Up Parsing
79 pages
Prepared by Telugu Maddileti, Assistant Professor, SNIST
No ratings yet
Prepared by Telugu Maddileti, Assistant Professor, SNIST
7 pages
ECE 4121 Lec08Wire
No ratings yet
ECE 4121 Lec08Wire
34 pages
Afcatfaq - PDF 19
No ratings yet
Afcatfaq - PDF 19
9 pages
Scholarship 1 PDF
No ratings yet
Scholarship 1 PDF
8 pages
Iris Maroc Souris Sans Fil Logitech m171 Noire 910 004424
No ratings yet
Iris Maroc Souris Sans Fil Logitech m171 Noire 910 004424
2 pages
ISYS2092 Software Testing Assignment 1 (15%) : Learning Objectives
No ratings yet
ISYS2092 Software Testing Assignment 1 (15%) : Learning Objectives
2 pages
ASR Manager Install
No ratings yet
ASR Manager Install
7 pages
Windows 7 CertificationPath
No ratings yet
Windows 7 CertificationPath
1 page
Networks of Control PDF
No ratings yet
Networks of Control PDF
165 pages
College Level Project Assessor Package
No ratings yet
College Level Project Assessor Package
8 pages
Bit Info Nepal - Operating Systems - Bit204-2078
No ratings yet
Bit Info Nepal - Operating Systems - Bit204-2078
2 pages
Aranird001: Estimation of Efficiency of Low Pressure Steam Turbine Blading Usind CFD Technique
No ratings yet
Aranird001: Estimation of Efficiency of Low Pressure Steam Turbine Blading Usind CFD Technique
10 pages
A+ Exam Wrong Answers
No ratings yet
A+ Exam Wrong Answers
50 pages
Drop Box
No ratings yet
Drop Box
9 pages
Attacking Metasploitable2 VM Server Cameron W
No ratings yet
Attacking Metasploitable2 VM Server Cameron W
20 pages
Assiut University Faculty of Computers & Information 3th Level
No ratings yet
Assiut University Faculty of Computers & Information 3th Level
6 pages
VLAN
No ratings yet
VLAN
31 pages
Xlwings - Make Excel Fly! (PDFDrive)
No ratings yet
Xlwings - Make Excel Fly! (PDFDrive)
92 pages
B-64304en-1 - 02 (Manual Del Operador Torno) PDF
No ratings yet
B-64304en-1 - 02 (Manual Del Operador Torno) PDF
446 pages
Growth Hacking Iscap 141120185355 Conversion Gate01
100% (1)
Growth Hacking Iscap 141120185355 Conversion Gate01
122 pages
Python 505
No ratings yet
Python 505
22 pages
Unit 6 Software Metrics
No ratings yet
Unit 6 Software Metrics
6 pages
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
No ratings yet
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
8 pages
F3 Serial Port Diagnostics PDF
0% (1)
F3 Serial Port Diagnostics PDF
497 pages
Bab - La Phrases Resume CV English Arabic
No ratings yet
Bab - La Phrases Resume CV English Arabic
4 pages
Application Note IR Prox
No ratings yet
Application Note IR Prox
7 pages
1) WAP To Demostrate Making of Thread To Print Numbers From 1 To 10
No ratings yet
1) WAP To Demostrate Making of Thread To Print Numbers From 1 To 10
5 pages
Database Management 2020
No ratings yet
Database Management 2020
5 pages
CSC-326 Catalogue PDF
No ratings yet
CSC-326 Catalogue PDF
24 pages
PLC Based Home Automation System: Sahil Sahni, R.K. Jarial
No ratings yet
PLC Based Home Automation System: Sahil Sahni, R.K. Jarial
5 pages
SIMIT Drives BehLib Tlg860 DOC v10 en
No ratings yet
SIMIT Drives BehLib Tlg860 DOC v10 en
10 pages
Using "Audacity®" For Language Teaching
No ratings yet
Using "Audacity®" For Language Teaching
28 pages
Grade 9 AI QP Pattern and Unit 1 - Into To AI
No ratings yet
Grade 9 AI QP Pattern and Unit 1 - Into To AI
64 pages

Shubh Am

Uploaded by

Shubh Am

Uploaded by

AUTOMATIC TEXT SUMMARIZATION

MTP Report submitted to

under the guidance of

Dr. ARTI KASHYAP

SCHOOL OF COMPUTING AND ELECTRICAL ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY MANDI

Certified that the thesis entitled AUTOMATIC TEXT SUMMARIZATION, submitted by

List of Tables vii

List of Figures viii

2 Background and Related Work 5

3 Work Done (Give an appropriate title) 7

5 Conclusion and Future Work 25

3.1 Illustration of Word Similarity . . . . . . . . . . . . . . . . . . . . . . . 12

1.1 Classification of summarization tasks. [1] . . . . . . . . . . . . . . . . . 3

3.1 UML Class diagram of Summarizer . . . . . . . . . . . . . . . . . . . . 9

4.1 Comparison between summarizer using GA approach and Wikipedia sum-

Accoring to WordNet(Princeton) summary is defined as “a brief statement that presents

Fig. 1.1: Classification of summarization tasks. [1]

1.2 Methods and Techniques

Background and Related Work

Wikipedia Articles Summarization

3.1 A Naive Summarizer

2. f2 = Positive keyword in the sentence: It is the frequency of keyword in the sum-

Where, s is a sentence, n is the number of keywords in s, tfi is the occurring frequency

#(sentences in summary and contains keywordi )

#(sentences in the training corpus and contains keywordi )

4. f4 = Sentence Centrality: It is the overlap in vocabulary of the sentence and rest of

Keywords in s ∩ Keywords in other sentences

5. f5 =Sentence Resemblance to the title: It is the overlap in vocabulary between the

Keywords in s ∩ Keywords in title

3.2 Wikipedia Summarization

2. TF-IDF: TF-IDF for a word in a sentence is inversely proportional to the number of

Fig. 3.3: Google Search Result Count Illustration

4. Word Similarity: This feature provides customizability to the summarizer, in which a

Word1 Word2 Score Explaination

Table 3.1: Illustration of Word Similarity

3.2.2 Summarizer Class

Wikipedia summarizer is implemented in Python and using Django Web Framework. It is a

url = ’en.wikipedia.org/w/index.php?action=raw&title=’ + title

Fig. 3.6: Google Result Count Example

6. TF-IDF: A database of stemmed words is generated from randomly selected 1000

7. Positive Keywords/Negative Keywords: This is an important feature of the summa-

If the keyword is positive then,

S core(s) = 1 + similarity value (3.12)

If the keyword is negative then,

S core(s) = 1 − similarity value (3.13)

3.2.3 Score Calculation

where F is the feature vector for the sentence s.

3.2.4 User Interface

3.2.4.1 Querying page

Fig. 3.8: Web version of querying page.

3.2.4.2 Summary Page

Fig. 3.11: Different screenshots of mobile interface.

3.2.4.3 Webpage interface for non-Wikipedia Context

Fig. 3.12: Mobile interface for quering page of non-Wikipedia content

1. Variable length summaries can be generated.

2. Custom summary option is provided to user to remove unwanted topics.

3. Multithreaded feature extraction.

4. Stores summaries of articles so to provide quick results in future.

For quality comparisons of summaries generated by summarizers, we have manually created

1. GA summarizer VS Wikipedia summarizer with CR 30%

#(sentences in s1 ∩ sentences in s2)

2. GA summarizer VS Wikipedia summarizer with CR 20%

3. Summarizer using GA approach VS Word2Vec Summarizer

5. GA Summarizer vs Microsoft Word Summarizer

Conclusion and Future Work

5.2 Future Score

[1] G. Sizov, “Extraction-based automatic summarization: Theoretical and empirical in-

[3] “About wordnet - wordnet - about wordnet,” https://wordnet.princeton.edu/, accessed:

[4] “word2vec - tool for computing continuous distributed representations of words. -

[8] D. Hingu, D. Shah, and S. S. Udmale, “Automatic text summarization of wikipedia

[9] H. P. Edmundson, “New methods in automatic extracting,” Journal of the ACM

Name: Shubham Ajmera

Date of birth: 09 October, 1992

• [2011 - 2015] Bachelor of Technology (B. Tech),

You might also like