Automatic_Text_Summarization_using_Text_Rank_Algorithm
Automatic_Text_Summarization_using_Text_Rank_Algorithm
Vivek N. Waghmare
Department of Computer Engineering
PVG’s College of Engineering & S.S.
Dhamankar Institute of Management
Nashik
[email protected]
Abstract—Automatic text summarization has been emerged document text summarization, multiple documents related to
as a valuable tool for quickly locating the significant the similar subject are given as input and related summary is
information in vast text with minimal effort. The practice of generated [9].
constructing a summarized form of a text document that
retains significant information and also withstand the overall Various algorithms have been discussed in the literature
meaning of the source text is known as text summarization. for text summarization. Some of the challenges in text
This study mainly concentrates on the extractive text summarization are word embedding, extraction of key
summarization technique wherein the text summarization is sentences etc. In order to address these challenges, in this
carried out on single document as well as on multi document study, the Text Rank Algorithm has been proposed for
text. Further, the application is extended by implementing automatic text summarization which is an extractive text
document summarization and URL summarization. Here, the summarization approach. The Text Rank Algorithm uses
sentence extraction method from the input text forms the basis graph based approach for sentence ranking. Here, after
of proposed text summarization technique. During sentence preprocessing, each sentence is converted into a vertex as
extraction, the weights are assigned to sentences which act as shown in Fig. 1 where the vertices 1-7 represents the
rank of these sentences, using page rank algorithm. In this sentences and the edges, a line connecting two vertices
study, extraction technique has been implemented to extract represents the similarity weight of these sentences [2].
sentences having higher rank from the given input text. The
study mainly focuses on obtaining high rank sentences from
the given document in order to generate a high-quality
summary of the input text.
I. INTRODUCTION
In present scenario, a large amount of data is being
generated on the internet every day. As a result, a better
mechanism is required for extracting important information
quickly and effectively which has been a major challenge.
Text summarization is one of the strategies for finding the
most significant and meaningful information in a document
or set of linked documents and summarizing it into a shorter
version while preserving the overall meaning [1][6]. Text
summarizing reduces the amount of time needed to read a
lengthy document and takes care of the space problems that
are usually encountered with big amounts of data. Extractive
Fig. 1. Sample graph representing similarity [8]
text summarization and abstractive text summarization are
the two main categories of text summarizing techniques [1]. II. LITERATURE SURVEY
Automatic text summarizing using extractive text
summarization technique is a challenge that is typically The text summarizing techniques used in earlier studies
broken down into two sub problems: single document and are described in this section. One of the most significant
multi document text summarization. Typically, one areas of application for natural language processing is text
document is used as the input for single document text summarization. There are two categories of text
summarizing. Further, by applying suitable technique a summarization techniques: Summary that is both abstract
summary data is generated from the given input while and extractive [1][3]. Extraction of text summarization is the
holding its overall meaning. However, in case of multi process of removing pertinent sentences from the supplied
2
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.
accuracy. Overall methodology of the proposed system is D. Web Scraper
shown in the fig 3: Web Scrapper is the module wherein the summary is
Authentication is the process of verifying the identity of a generated by using the web scrapping tool. Web scraping is a
user. Here, user can see all the available services after login mean of extracting information in the form of text from the
to the system. In order to restrict unauthenticated access given URL automatically. In showcase model, after the
every user must register to the system. summary of all six technologies is generated, Multi-
documented text summarization has been implemented by
Proposed methodology works in five different modules: giving the input text of all 6 technologies as discussed above,
x Paragraph Summarizer then all the data will be appended one after the other and
further it serves as the input text to the Text Rank Algorithm.
x Document Summarizer
E. Word Dictionary
x URL Summarizer Defining of words plays a very important role in the text
x Web Scrapper summarization. As summary is shortened as compared to its
original text, it is very important to know the meaning of
x Word Dictionary each and every occurrence of the word in summary. Here,
dictionary feature is provided in the application itself for
better understanding of the summary and also to get the best
from it.
The Text Rank algorithm receives the input text. The
sentences in the complete text are mostly ranked by the Text
Rank algorithm. Figure 4 illustrates the overall text
summarizing procedure using the Text Rank Algorithm:
A. Paragraph Summarizer
In paragraph summarization, user can generate the
summary of a single as well as multi paragraph text. This
will help the user to read the summary and grasp the best
knowledge out of it simultaneously. In most of the earlier Fig. 4. Text summarization process
studies, only single paragraph summarization interface was
available to the user whereas in the present study a multi user x Preprocessing: This is the first step in the text
paragraph summary interface has been provided which summarization process. Cleaning of data from the
facilitates multi paragraph summarization. document is necessary. Since input data consists of
vast information along with useful or required
B. Document Summarizer
information [12]. Further, it is also important to
In Document summarizer, generally the format of the format the data properly in order to achieve better
document is restricted to any one of the format. In this study, outcomes from the proposed technique. Here in this
user can directly upload the files in the different format such study the preprocessing procedure carried out is
as .docx, .pdf, .txt. However, the user can specify ‘n’ number shown in Fig 4.
of sentences required in the summary. The summary is
generated by using the Text Rank algorithm and top ‘n’ high
ranked sentences are displayed in the summary. Furthermore,
it provides the facility of email where user can send the
summary via mail. This email feature is also applicable for
the URL summarizer which has been a challenge for the
backup of the summary.
C. URL Summarizer
In URL summarizer module, it provides the facility of
directly copying the desired URL to the given text box,
information will be automatically fetched from the URL and
the ‘n’ number of sentences summary will be displayed.
Here, Sumy python library has been used for extracting
summary from HTML pages or plain texts. Sumy library
helps to avoid the images, videos etc. available on the web
pages and extract only the text present on the respective Fig. 5. Pre Processing Flow Diagram [3]
page.
3
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.
x Word Embedding’s: The representation of words is between the system generated and SMMRY generated is
usually in the form of a real-valued vector, which is referred shown in the table 1. Further, for better visualization a graph
to as word embedding. The vectors for the input sentences is plotted considering number of input text against the
are calculated by using word embedding technique. The similarity percentage, Fig. 6.
word embedding's most significant use is in encoding a
Once the average similarity of the 50 sample input text
word's meaning and predicting words that are next to each
has been calculated, the next task is to calculated the average
other in the vector space and may have similar meanings [3].
of these sample document. This step helps to derive the
In this study, GloVe data set has been used for vector approximate accuracy of the proposed system. The results
representation of words available at Kaggle [11]. The use of show the accuracy of the proposed technique is around
global statistics to obtain the word vectors is a key 90.00%.
component of the dataset. Individual words are stored in a
Based on the ranking of these sentences, the present
vector space as a real-valued vectors.
model produces a summary of n number of sentences.
x Similarity Matrix: Similarity matrix is formed However, the user may choose n, the flexible number of
based on a vectors, in which we calculate the sentences in the generated summary, taking into account the
similarities between the sentences, then apply the significance and length of the input content. Here, extractive
cosine similarity approach. In order to determine text summarization is implemented using Python 3.7 and
how similar, the two sentences are, Cosine Similarity NLTK.
has been used where in values of matrix are filled
using cosine similarity scores [3] [12]. TABLE I. AVEVERAGE SIMILARITY
Input Similarity Input Similarity
Here, the similarity between the 2 sentence is Text (Percentage) Text (Percentage)
calculated using following relation: 1 95.1 26 86.6
Cos (x, y) = x. y / ||x|| * ||y|| (1) 2 86.7 27 90.2
x Formation of Graph: Once the similarity matrix is 3 84.3 28 91.7
derived, further it is converted into graph. A graph is
4 95.8 29 90.0
defined as a collection of nodes i.e. vertices and
identifiable pairs of nodes which is known as edges 5 92.9 30 88.9
or link. The nodes of this graph represent the
6 91.8 31 92.6
sentences, and its edges show how similar the
sentences are to one another [3]. 7 84.3 32 94.7
15 89.0 40 87.3
IV. RESULT AND DISCUSSION
In the field of text summarization, verifying accuracy has 16 86.1 41 92.4
been a major challenge. The techniques such as the 17 89.6 42 88.1
comparison between the size of the input text and the output
18 96.7 43 84.9
text summary, manual check or expert generated summary
which is very subjective in nature are studied earlier. In the 19 89.4 44 95.5
proposed text summarization, the model is tested on the
20 85.4 45 89.2
various documents having different contents. All these text
are different in size. The number of lines in input document 21 90.0 46 90.8
varies from 20 to 100 lines in order to test the accuracy of the
22 89.0 47 86.3
proposed module on small as well as large input document.
23 87.1 48 90.1
The obtained text summary is compared with the
summary generated by standard tool named as SMMRY. 24 89.3 49 96.0
SMMRY is an online tool which is used to summarize
25 91.2 50 92.6
articles and text. Using an online software program which
make use of cosine similarity to determine the degree of
similarity between the two documents, the similarity between
the proposed system generated summary and the SMMRY Further, User Interface (UI) is an important aspect which
generated summary is determined after the summary has facilitate the user. Simple and effective UI encourages the
been generated. The most effective similarity measurement user to use the proposed system. A snapshot of UI is shown
technique for text summary is cosine similarity. Similarity in Fig. 7.
4
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION AND FUTURE SCOPE
The main objective of this study is to develop a technique
for obtaining the summary of the input text which is useful in
various applications such as understanding very long
technical and non-technical articles. Current research is
focusing on improving accuracy of the summarized text
using modern tools and techniques. Here, by using the
automated tool, the similarity between the documents has
been calculated. The proposed text summarization technique
using Text Rank algorithm is used for generating summary
of any input text which includes different five models such
as paragraph, document, URL and scrapper summarization.
Fig. 6. Line graph for representing similarity values All these five services are available to the user in one
application providing the better user interface. The
experimental results show that the accuracy of the proposed
technique is around 90.00%.
In future the accuracy of the proposed system may be
tested by considering different articles such as news, blogs
on social media, etc. The accuracy of the technique may
further be improved by integrating the machine learning
approache.
REFERENCES
[1] Madhuri, J.N. and Kumar, R.G., 2019, March. Extractive text
summarization using sentence ranking. In 2019 International
Conference on Data Science and Communication (IconDSC) (pp. 1-
3). IEEE.
[2] Rahimi, S.R., Mozhdehi, A.T. and Abdolahi, M., 2017, December. An
overview on extractive text summarization. In 2017 IEEE 4th
International Conference on Knowledge-Based Engineering and
Innovation (KBEI) (pp. 0054-0062). IEEE.
[3] Tanwi1, Satanik Ghosh1, Viplav Kumar1, Yashika S Jain1, Mr.
Fig. 7. Snapshot UI- Document Summarization Avinash 2019, April. Automatic Text Summarization using Text Rank,
In 2019 International Research Journal of Engineering and
Technology (IRJET)
The proposed system is further integrated to generate
[4] Gunawan, D., Harahap, S.H. and Rahmat, R.F., 2019, November.
summary from 1) given URL – URL summarization and 2) Multi-document summarization by using textrank and maximal
Web Scrapper, which is multi documented text marginal relevance for text in bahasa indonesia. In 2019 International
summarization module. Conference on ICT for Smart Society (ICISS) (Vol. 7, pp. 1-5). IEEE.
[5] Ashna Jain,2019, April.Automatic Extractive Text Summarization
1. URL Summarization: In URL summarizer module, using TF-IDF.
it provides the facility of directly copying the desired [6] Janjanam, P. and Reddy, C.P., 2019, February. Text summarization: an
essential study. In 2019 International Conference on Computational
URL to the given text box, information will be Intelligence in Data Science (ICCIDS) (pp. 1-6). IEEE.
automatically fetched from the URL and the ‘n’
[7] Zaware, S., Patadiya, D., Gaikwad, A., Gulhane, S. and Thakare, A.,
number of sentences summary will be displayed. 2021, June. Text summarization using tf-idf and textrank algorithm.
Here, Sumy python library has been used for In 2021 5th International Conference on Trends in Electronics and
extracting summary from HTML pages or plain Informatics (ICOEI) (pp. 1399-1407). IEEE.
texts. Sumy library helps to avoid the images, videos [8] Rahimi, S.R., Mozhdehi, A.T. and Abdolahi, M., 2017, December. An
etc. available on the web pages and extract only the overview on extractive text summarization. In 2017 IEEE 4th
text. Here in this module user copy the url with n International Conference on Knowledge-Based Engineering and
Innovation (KBEI) (pp. 0054-0062). IEEE.
numbers of sentences required in the generated
[9] Fakhrezi, M.F., Bijaksana, M.A. and Huda, A.F., 2021.
summary, the summary has been generated using Implementation of Automatic Text Summarization with Text Rank
proposed text summarization model. Method in the Development of Al-Qur’an Vocabulary Encyclopedia.
Procedia Computer Science, 179, pp.391-398.
2. Web Scrapper: Web Scraper is the module wherein
[10] Kulkarni, A.R. and Apte, M.S., 2002. An automatic text
the summary is generated by using the web summarization using feature terms for relevance measure. IOSR J.
scrapping Web Scraper model provides the user Comput. Eng, 9, pp.62-66.
interface where web scrapping is used for [11] Saggion, H. and Poibeau, T., 2013. Automatic text summarization:
information extraction from the web pages and Past, present and future. In Multi-source, multilingual information
respective extracted information is given to the extraction and summarization (pp. 3-21), Springer, Berlin,
proposed text summarization model algorithm as an Heidelberg.
input data and then summary has been generated. [12] Dr. Annapurna P Patil, Shivam Dalmia, Syed Abu Ayub Ansari, Tanay
Aul, Varun Bhatnagar, ͆Automatic Text Summarizer ͆,
The proposed text summarization algorithm is also found International Conference on Advances in Computing,
effective for URL summarization and using Web Scrapping Communications and Informatics, IEEE (2014).
multi document summarization. [13] Jayashree R, Shreekantha Murthy K,͆Categorized Text Document
Summarization in the Kannada Language by Sentence Ranking͇,
5
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.
12th International Conference on Intelligent Systems Design and
Applications (ISDA), IEEE (2012).
[14] Prakhar Sethi, Sameer Sonawane, Saumitra Khanwalker, R. B.
Keskar, ͆Automatic Text Summarization of News Articles ͆,
International Conference on Big Data, IoT and Data Science (BID)
Vishwakarma Institute of Technology, Pune, Dec 20-22 IEEE (2017).
[15] Prachi Shah, Nikitha P. Desai, ͆A Survey of Automatic Text
Summarization Techniques for Indian and Foreign Languages ͆,
International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT) (2016).
[16] D. Gunawan and A. Amalia, ͆Review of the recent research on
automatic text sum-27 marization in bahasa indonesia,͇ in 2018
Third International Conference on Informatics and Computing (ICIC),
Oct 2018, pp. 1̽6.
[17] Min-Yuh Day, Chao Yu Chen, ͆Artificial Intelligence for Automatic
Text Summarization͇, International Conference on Information
Reuse and Integration for Data Science IEEE (2018).
[18] Prachi Shah, Nikhita P Desai, ͆A Survey of Automatic Text
Summarization Techniques for Indian and Foreign Languages͇,
International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT) (2016).
[19] Online resource, Multi-Document summarization Available at:
https://paperswithcode.com/task/multi-document-summarization
[20] Online resource, Dataset: glove.6B.100d downloaded from kaggle.
Available at: https://www.kaggle.com/rtatman/glove-global-vectors-
for-word-representation
[21] Online resource, Page Rank Algorithm Available at:
https://www.geeksforgeeks.org/page-rank-algorithm-implementation/
[22] Online resource, latent sematic indexing Available at:
https://en.wikipedia.org/wiki/Latent_semantic_analysis
[23] Online resource, Page Rank Algorithm Available at:
https://devopedia.org/text-summarization
[24] Srinath, K.R., 2017. Page Ranking Algorithms–A Comparison.
International Research Journal of Engineering and Technology
(IRJET).
6
Authorized licensed use limited to: University of Cincinnati. Downloaded on February 19,2025 at 03:25:29 UTC from IEEE Xplore. Restrictions apply.