0% found this document useful (0 votes)
9 views

Automatic_Text_Summarization_using_Text_Rank_Algorithm

The document discusses a study on automatic text summarization using the Text Rank Algorithm, focusing on both single and multi-document summarization. It outlines the challenges in text summarization and presents a methodology that includes modules for paragraph, document, and URL summarization, as well as a web scraper and word dictionary. The proposed system aims to improve the accuracy and quality of generated summaries while providing a user-friendly interface for various input formats.

Uploaded by

darlingashish511
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Automatic_Text_Summarization_using_Text_Rank_Algorithm

The document discusses a study on automatic text summarization using the Text Rank Algorithm, focusing on both single and multi-document summarization. It outlines the challenges in text summarization and presents a methodology that includes modules for paragraph, document, and URL summarization, as well as a web scraper and word dictionary. The proposed system aims to improve the accuracy and quality of generated summaries while providing a user-friendly interface for various input formats.

Uploaded by

darlingashish511
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2024 3rd International Conference for Advancement in Technology (ICONAT)

Goa, India. Sep 13-14, 2024

Automatic Text Summarization using Text Rank


Algorithm
Rakhi Dumne Nitin L. Gavankar Madhav M. Bokare
Department of Comp. Science and Department of Comp. Science and Department of Computer
2024 3rd International Conference for Advancement in Technology (ICONAT) | 979-8-3503-5417-1/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICONAT61936.2024.10775241

Engineering Engineering Institute of Technology and


Walchand College of Engineering Walchand College of Engineering Management
Sangli, India Sangli, India Nanded, India
[email protected] [email protected] [email protected]

Vivek N. Waghmare
Department of Computer Engineering
PVG’s College of Engineering & S.S.
Dhamankar Institute of Management
Nashik
[email protected]

Abstract—Automatic text summarization has been emerged document text summarization, multiple documents related to
as a valuable tool for quickly locating the significant the similar subject are given as input and related summary is
information in vast text with minimal effort. The practice of generated [9].
constructing a summarized form of a text document that
retains significant information and also withstand the overall Various algorithms have been discussed in the literature
meaning of the source text is known as text summarization. for text summarization. Some of the challenges in text
This study mainly concentrates on the extractive text summarization are word embedding, extraction of key
summarization technique wherein the text summarization is sentences etc. In order to address these challenges, in this
carried out on single document as well as on multi document study, the Text Rank Algorithm has been proposed for
text. Further, the application is extended by implementing automatic text summarization which is an extractive text
document summarization and URL summarization. Here, the summarization approach. The Text Rank Algorithm uses
sentence extraction method from the input text forms the basis graph based approach for sentence ranking. Here, after
of proposed text summarization technique. During sentence preprocessing, each sentence is converted into a vertex as
extraction, the weights are assigned to sentences which act as shown in Fig. 1 where the vertices 1-7 represents the
rank of these sentences, using page rank algorithm. In this sentences and the edges, a line connecting two vertices
study, extraction technique has been implemented to extract represents the similarity weight of these sentences [2].
sentences having higher rank from the given input text. The
study mainly focuses on obtaining high rank sentences from
the given document in order to generate a high-quality
summary of the input text.

Keywords— Extractive summarization, natural language


processing (NLP), Text Rank algorithm, text summarization

I. INTRODUCTION
In present scenario, a large amount of data is being
generated on the internet every day. As a result, a better
mechanism is required for extracting important information
quickly and effectively which has been a major challenge.
Text summarization is one of the strategies for finding the
most significant and meaningful information in a document
or set of linked documents and summarizing it into a shorter
version while preserving the overall meaning [1][6]. Text
summarizing reduces the amount of time needed to read a
lengthy document and takes care of the space problems that
are usually encountered with big amounts of data. Extractive
Fig. 1. Sample graph representing similarity [8]
text summarization and abstractive text summarization are
the two main categories of text summarizing techniques [1]. II. LITERATURE SURVEY
Automatic text summarizing using extractive text
summarization technique is a challenge that is typically The text summarizing techniques used in earlier studies
broken down into two sub problems: single document and are described in this section. One of the most significant
multi document text summarization. Typically, one areas of application for natural language processing is text
document is used as the input for single document text summarization. There are two categories of text
summarizing. Further, by applying suitable technique a summarization techniques: Summary that is both abstract
summary data is generated from the given input while and extractive [1][3]. Extraction of text summarization is the
holding its overall meaning. However, in case of multi process of removing pertinent sentences from the supplied

979-8-3503-5417-1/24/$31.00 ©2024 IEEE 1


Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.
text. In order to extract the important content from the input is measured by using PageRank. Google uses other
text during extractive text summarization, linguistic and algorithms as well, but this is its initial and most well-known
statistical features of paragraphs are typically exploited algorithm. When someone clicks on a random link, the
[3][4]. Whereas, in the abstractive text summarization it PageRank algorithm creates a probability distribution that
comprehends the document's primary notion and its meaning shows how likely it is that they will get on a particular page.
[4]. Further, it interprets the text to find the new notion in the
document using the linguistic method. The resultant output D. Latent Segmantic Indexing
will be the most shortened version of the input document [3]. The relationship between a group of documents and the
terms contained are analyzed using latent sematic indexing.
In few studies, summarization of scientific texts has been This uses a mathematical concept known as Singular Value
done based on the important attributes such as, phrase Decomposition (SVD) to compute a collection of matrices
frequency, key words, important phrases, and location of the that represent document similarity. The mathematical
text [1]. Most of the earlier studies focuses on generating a approach of the Singular Value Decomposition is used in
summary of a single input document. Further, in few studies Latent Semantic Analysis to detect patterns of links between
extractive text summarization has been discussed, in which phrases and concepts. This is formed by the idea that words
essential sentences have been extracted by measuring words those appear in similar settings have comparable meanings
and phrases frequency that provides a useful indicator of [16].
their significance [2]. In 1958, Baxendale began his study on
extractive summarization at IBM. Using the text's location, Depending upon input type, output type and based on
he has extracted crucial sentences [1]. their purpose, text summarization has been classified into
different types as shown in Fig. 2.
However, Accuracy verification has been a major
challenge in the text summarization. Verifying accuracy of
the generated summary is calculated manually or human
generated summary which is highly subjective. In the
proposed system, accuracy verification has been done by
using the online software which uses the cosine similarity
method in order to calculate the similarity values between the
input text.
In addition, many researchers have worked on the topic
text summarization using various other algorithm such as
Term Frequency Algorithm, Sum Basic Algorithm, Page
rank algorithm, Latent semantic indexing etc. which falls
under the unsupervised learning algorithm [13].
Fig. 2. Types of text summarization [17]
A. Term Frequency Algorithm
The abbreviation TF-IDF stands for Term Frequency — Here in this study the proposed methodology uses the
Inverse Document Frequency. This technique can be used to Text Rank Algorithm for text summarization which also falls
figure out how many words are present in a group of texts. under the unsupervised learning method. In most of the
Here, each word is given a score to represent its weight earlier studies, Single paragraph text summarization has been
within the corpus and document. This method is commonly implemented whereas in the present study single as well as
used in text mining and information retrieval [13]. This multi paragraph text summarization has been implemented
algorithm has an advantage of easy computation for effectively and an easy interface has been provided to the
generating the summary. However, TF-IDF is built on the user. Due to its graph based approach the Text Rank
bag-of-words paradigm, it is unable to account for semantics, Algorithm produces a good outcome in terms of text
co-occurrences across many documents, etc. Whereas, in the summarization and which is also language independent
present study algorithm used considers the semantic as well [2][4]. In traditional studies, Word2Vec method has been
as syntactic meaning of a words during word embedding. used for the word embedding’s which relies on local
information of words in the text summarization process.
B. Sum Basic Algorithm However, the Text Rank Algorithm uses pre trained word
Sum Basic algorithm is a multi-document text embedding dataset which is trained on large dataset. To
summarization algorithm [10]. It does not support the single obtain word vectors, it incorporates global statistics (word
document text summarization. In the proposed study, single co-occurrence) in addition to local information [3]. It also
as well as multi documented text summarization has been considers the semantic as well as syntactic meaning of a
implemented in one application itself. In sum basic, the idea words during word embedding. Considering all these
is to use more commonly appearing terms in a document important features of Text Rank Algorithm, the proposed
than less frequently occurring words in order to produce a methodology uses Text Rank Algorithm for text
summary that is more likely to appear in human abstractions. summarization in order to improve the performance of the
It generates n-sentence summaries, where n is the number of proposed system.
phrases the user specifies [10].
III. PROPOSED METHODOLOGY
C. Page Rank Algorithm Text Rank Algorithm is an extractive and unsupervised
Websites are ranked in search engine results using the text summarization technique. It is a graph-based text
PageRank (PR) algorithm developed by Google. One of ranking model for determining the most relevant phrase and
Google's original founders, Larry Page, is accredited for the keywords in an input text which helps to improve the
creation of PageRank [15]. The importance of website pages

2
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.
accuracy. Overall methodology of the proposed system is D. Web Scraper
shown in the fig 3: Web Scrapper is the module wherein the summary is
Authentication is the process of verifying the identity of a generated by using the web scrapping tool. Web scraping is a
user. Here, user can see all the available services after login mean of extracting information in the form of text from the
to the system. In order to restrict unauthenticated access given URL automatically. In showcase model, after the
every user must register to the system. summary of all six technologies is generated, Multi-
documented text summarization has been implemented by
Proposed methodology works in five different modules: giving the input text of all 6 technologies as discussed above,
x Paragraph Summarizer then all the data will be appended one after the other and
further it serves as the input text to the Text Rank Algorithm.
x Document Summarizer
E. Word Dictionary
x URL Summarizer Defining of words plays a very important role in the text
x Web Scrapper summarization. As summary is shortened as compared to its
original text, it is very important to know the meaning of
x Word Dictionary each and every occurrence of the word in summary. Here,
dictionary feature is provided in the application itself for
better understanding of the summary and also to get the best
from it.
The Text Rank algorithm receives the input text. The
sentences in the complete text are mostly ranked by the Text
Rank algorithm. Figure 4 illustrates the overall text
summarizing procedure using the Text Rank Algorithm:

Fig. 3. Proposed Methodology for Text summarization

A. Paragraph Summarizer
In paragraph summarization, user can generate the
summary of a single as well as multi paragraph text. This
will help the user to read the summary and grasp the best
knowledge out of it simultaneously. In most of the earlier Fig. 4. Text summarization process
studies, only single paragraph summarization interface was
available to the user whereas in the present study a multi user x Preprocessing: This is the first step in the text
paragraph summary interface has been provided which summarization process. Cleaning of data from the
facilitates multi paragraph summarization. document is necessary. Since input data consists of
vast information along with useful or required
B. Document Summarizer
information [12]. Further, it is also important to
In Document summarizer, generally the format of the format the data properly in order to achieve better
document is restricted to any one of the format. In this study, outcomes from the proposed technique. Here in this
user can directly upload the files in the different format such study the preprocessing procedure carried out is
as .docx, .pdf, .txt. However, the user can specify ‘n’ number shown in Fig 4.
of sentences required in the summary. The summary is
generated by using the Text Rank algorithm and top ‘n’ high
ranked sentences are displayed in the summary. Furthermore,
it provides the facility of email where user can send the
summary via mail. This email feature is also applicable for
the URL summarizer which has been a challenge for the
backup of the summary.
C. URL Summarizer
In URL summarizer module, it provides the facility of
directly copying the desired URL to the given text box,
information will be automatically fetched from the URL and
the ‘n’ number of sentences summary will be displayed.
Here, Sumy python library has been used for extracting
summary from HTML pages or plain texts. Sumy library
helps to avoid the images, videos etc. available on the web
pages and extract only the text present on the respective Fig. 5. Pre Processing Flow Diagram [3]
page.

3
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.
x Word Embedding’s: The representation of words is between the system generated and SMMRY generated is
usually in the form of a real-valued vector, which is referred shown in the table 1. Further, for better visualization a graph
to as word embedding. The vectors for the input sentences is plotted considering number of input text against the
are calculated by using word embedding technique. The similarity percentage, Fig. 6.
word embedding's most significant use is in encoding a
Once the average similarity of the 50 sample input text
word's meaning and predicting words that are next to each
has been calculated, the next task is to calculated the average
other in the vector space and may have similar meanings [3].
of these sample document. This step helps to derive the
In this study, GloVe data set has been used for vector approximate accuracy of the proposed system. The results
representation of words available at Kaggle [11]. The use of show the accuracy of the proposed technique is around
global statistics to obtain the word vectors is a key 90.00%.
component of the dataset. Individual words are stored in a
Based on the ranking of these sentences, the present
vector space as a real-valued vectors.
model produces a summary of n number of sentences.
x Similarity Matrix: Similarity matrix is formed However, the user may choose n, the flexible number of
based on a vectors, in which we calculate the sentences in the generated summary, taking into account the
similarities between the sentences, then apply the significance and length of the input content. Here, extractive
cosine similarity approach. In order to determine text summarization is implemented using Python 3.7 and
how similar, the two sentences are, Cosine Similarity NLTK.
has been used where in values of matrix are filled
using cosine similarity scores [3] [12]. TABLE I. AVEVERAGE SIMILARITY
Input Similarity Input Similarity
Here, the similarity between the 2 sentence is Text (Percentage) Text (Percentage)
calculated using following relation: 1 95.1 26 86.6
Cos (x, y) = x. y / ||x|| * ||y|| (1) 2 86.7 27 90.2
x Formation of Graph: Once the similarity matrix is 3 84.3 28 91.7
derived, further it is converted into graph. A graph is
4 95.8 29 90.0
defined as a collection of nodes i.e. vertices and
identifiable pairs of nodes which is known as edges 5 92.9 30 88.9
or link. The nodes of this graph represent the
6 91.8 31 92.6
sentences, and its edges show how similar the
sentences are to one another [3]. 7 84.3 32 94.7

x Sentence Ranking: In order to rank the sentences in 8 93.0 33 86.5


the graph, the graph is provided as an input to the 9 92.5 34 92.8
Page Rank Algorithm. Although it is not the only
algorithm employed by Google to arrange search 10 89.4 35 96.9
engine results, it was the first and is the most well- 11 87.1 36 86.7
known [18]. It is employed to figure out a web
page's weight. In order to choose the appropriate 12 88.2 37 91.7
phrases from a list of tokenized sentences, the scores 13 88.3 38 90.4
are listed in descending order with their key value
used as an index. 14 93.5 39 84.1

15 89.0 40 87.3
IV. RESULT AND DISCUSSION
In the field of text summarization, verifying accuracy has 16 86.1 41 92.4
been a major challenge. The techniques such as the 17 89.6 42 88.1
comparison between the size of the input text and the output
18 96.7 43 84.9
text summary, manual check or expert generated summary
which is very subjective in nature are studied earlier. In the 19 89.4 44 95.5
proposed text summarization, the model is tested on the
20 85.4 45 89.2
various documents having different contents. All these text
are different in size. The number of lines in input document 21 90.0 46 90.8
varies from 20 to 100 lines in order to test the accuracy of the
22 89.0 47 86.3
proposed module on small as well as large input document.
23 87.1 48 90.1
The obtained text summary is compared with the
summary generated by standard tool named as SMMRY. 24 89.3 49 96.0
SMMRY is an online tool which is used to summarize
25 91.2 50 92.6
articles and text. Using an online software program which
make use of cosine similarity to determine the degree of
similarity between the two documents, the similarity between
the proposed system generated summary and the SMMRY Further, User Interface (UI) is an important aspect which
generated summary is determined after the summary has facilitate the user. Simple and effective UI encourages the
been generated. The most effective similarity measurement user to use the proposed system. A snapshot of UI is shown
technique for text summary is cosine similarity. Similarity in Fig. 7.

4
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION AND FUTURE SCOPE
The main objective of this study is to develop a technique
for obtaining the summary of the input text which is useful in
various applications such as understanding very long
technical and non-technical articles. Current research is
focusing on improving accuracy of the summarized text
using modern tools and techniques. Here, by using the
automated tool, the similarity between the documents has
been calculated. The proposed text summarization technique
using Text Rank algorithm is used for generating summary
of any input text which includes different five models such
as paragraph, document, URL and scrapper summarization.
Fig. 6. Line graph for representing similarity values All these five services are available to the user in one
application providing the better user interface. The
experimental results show that the accuracy of the proposed
technique is around 90.00%.
In future the accuracy of the proposed system may be
tested by considering different articles such as news, blogs
on social media, etc. The accuracy of the technique may
further be improved by integrating the machine learning
approache.
REFERENCES
[1] Madhuri, J.N. and Kumar, R.G., 2019, March. Extractive text
summarization using sentence ranking. In 2019 International
Conference on Data Science and Communication (IconDSC) (pp. 1-
3). IEEE.
[2] Rahimi, S.R., Mozhdehi, A.T. and Abdolahi, M., 2017, December. An
overview on extractive text summarization. In 2017 IEEE 4th
International Conference on Knowledge-Based Engineering and
Innovation (KBEI) (pp. 0054-0062). IEEE.
[3] Tanwi1, Satanik Ghosh1, Viplav Kumar1, Yashika S Jain1, Mr.
Fig. 7. Snapshot UI- Document Summarization Avinash 2019, April. Automatic Text Summarization using Text Rank,
In 2019 International Research Journal of Engineering and
Technology (IRJET)
The proposed system is further integrated to generate
[4] Gunawan, D., Harahap, S.H. and Rahmat, R.F., 2019, November.
summary from 1) given URL – URL summarization and 2) Multi-document summarization by using textrank and maximal
Web Scrapper, which is multi documented text marginal relevance for text in bahasa indonesia. In 2019 International
summarization module. Conference on ICT for Smart Society (ICISS) (Vol. 7, pp. 1-5). IEEE.
[5] Ashna Jain,2019, April.Automatic Extractive Text Summarization
1. URL Summarization: In URL summarizer module, using TF-IDF.
it provides the facility of directly copying the desired [6] Janjanam, P. and Reddy, C.P., 2019, February. Text summarization: an
essential study. In 2019 International Conference on Computational
URL to the given text box, information will be Intelligence in Data Science (ICCIDS) (pp. 1-6). IEEE.
automatically fetched from the URL and the ‘n’
[7] Zaware, S., Patadiya, D., Gaikwad, A., Gulhane, S. and Thakare, A.,
number of sentences summary will be displayed. 2021, June. Text summarization using tf-idf and textrank algorithm.
Here, Sumy python library has been used for In 2021 5th International Conference on Trends in Electronics and
extracting summary from HTML pages or plain Informatics (ICOEI) (pp. 1399-1407). IEEE.
texts. Sumy library helps to avoid the images, videos [8] Rahimi, S.R., Mozhdehi, A.T. and Abdolahi, M., 2017, December. An
etc. available on the web pages and extract only the overview on extractive text summarization. In 2017 IEEE 4th
text. Here in this module user copy the url with n International Conference on Knowledge-Based Engineering and
Innovation (KBEI) (pp. 0054-0062). IEEE.
numbers of sentences required in the generated
[9] Fakhrezi, M.F., Bijaksana, M.A. and Huda, A.F., 2021.
summary, the summary has been generated using Implementation of Automatic Text Summarization with Text Rank
proposed text summarization model. Method in the Development of Al-Qur’an Vocabulary Encyclopedia.
Procedia Computer Science, 179, pp.391-398.
2. Web Scrapper: Web Scraper is the module wherein
[10] Kulkarni, A.R. and Apte, M.S., 2002. An automatic text
the summary is generated by using the web summarization using feature terms for relevance measure. IOSR J.
scrapping Web Scraper model provides the user Comput. Eng, 9, pp.62-66.
interface where web scrapping is used for [11] Saggion, H. and Poibeau, T., 2013. Automatic text summarization:
information extraction from the web pages and Past, present and future. In Multi-source, multilingual information
respective extracted information is given to the extraction and summarization (pp. 3-21), Springer, Berlin,
proposed text summarization model algorithm as an Heidelberg.
input data and then summary has been generated. [12] Dr. Annapurna P Patil, Shivam Dalmia, Syed Abu Ayub Ansari, Tanay
Aul, Varun Bhatnagar, ͆Automatic Text Summarizer ͆,
The proposed text summarization algorithm is also found International Conference on Advances in Computing,
effective for URL summarization and using Web Scrapping Communications and Informatics, IEEE (2014).
multi document summarization. [13] Jayashree R, Shreekantha Murthy K,͆Categorized Text Document
Summarization in the Kannada Language by Sentence Ranking͇,

5
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.
12th International Conference on Intelligent Systems Design and
Applications (ISDA), IEEE (2012).
[14] Prakhar Sethi, Sameer Sonawane, Saumitra Khanwalker, R. B.
Keskar, ͆Automatic Text Summarization of News Articles ͆,
International Conference on Big Data, IoT and Data Science (BID)
Vishwakarma Institute of Technology, Pune, Dec 20-22 IEEE (2017).
[15] Prachi Shah, Nikitha P. Desai, ͆A Survey of Automatic Text
Summarization Techniques for Indian and Foreign Languages ͆,
International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT) (2016).
[16] D. Gunawan and A. Amalia, ͆Review of the recent research on
automatic text sum-27 marization in bahasa indonesia,͇ in 2018
Third International Conference on Informatics and Computing (ICIC),
Oct 2018, pp. 1̽6.
[17] Min-Yuh Day, Chao Yu Chen, ͆Artificial Intelligence for Automatic
Text Summarization͇, International Conference on Information
Reuse and Integration for Data Science IEEE (2018).
[18] Prachi Shah, Nikhita P Desai, ͆A Survey of Automatic Text
Summarization Techniques for Indian and Foreign Languages͇,
International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT) (2016).
[19] Online resource, Multi-Document summarization Available at:
https://paperswithcode.com/task/multi-document-summarization
[20] Online resource, Dataset: glove.6B.100d downloaded from kaggle.
Available at: https://www.kaggle.com/rtatman/glove-global-vectors-
for-word-representation
[21] Online resource, Page Rank Algorithm Available at:
https://www.geeksforgeeks.org/page-rank-algorithm-implementation/
[22] Online resource, latent sematic indexing Available at:
https://en.wikipedia.org/wiki/Latent_semantic_analysis
[23] Online resource, Page Rank Algorithm Available at:
https://devopedia.org/text-summarization
[24] Srinath, K.R., 2017. Page Ranking Algorithms–A Comparison.
International Research Journal of Engineering and Technology
(IRJET).

6
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.

You might also like