Keyphrase Extraction From Document Using Rake and Textrank Algorithms
Keyphrase Extraction From Document Using Rake and Textrank Algorithms
83-93
ISSN 2320–088X
IMPACT FACTOR: 7.056
I. INTRODUCTION
Keyphrase extraction is a fundamental task in natural language processing
that facilitates mapping of documents to a set of representative phrases[1].[2]
The concise understanding of the text and grasping the central theme behind the
given text can be achieved through Keyphrase extraction[3]. [4]Spending a huge
amount of time in reading can be avoided. Information can be extracted
efficiently comparing to the traditional extraction techniques.
At present times, where there exists a vast amount of information in the form
of text on internet, the generation of Keyphrase has assumed much wider
application and importance. [5]With the growing abundance of resource
materials on the internet, the need of information retrieval calls for automatic
tagging of a text or document to extract relevant information for a particular
query of a user. Without any doubt, the task of manually tagging or
summarizing such texts will be herculean and this calls for automation in this
field to reduce the time and effort and of course to meet the unprecedented
volume of information to be exchanged today. The rise of „Big Data Analysis‟
will play a prominent role in phrase extraction.
Any key phrase model aims to generate words and phrases to summarize the
given text. This paper contains various sections such as a section 1 is
introduction, section 2 contains background work, section 3 discuss various
approaches towards phrase Detection, section 4 divide into two subdivision, one
explains Rapid automatic Keyphrase extraction and TextRank algorithm,
section 5 shows performance analysis and finally section 6 provides conclusion.
NLTK-POS Tagging
NLTK- POS tagging is a supervised learning solution that uses features like
the previous word, next word, is first letter capitalized etc. NLTK has a function
to get POS tags and it works after tokenization process [15]. The dataset has to
be pre-processed before adding a tag. The following are the steps to implement
POS tagging.
Parsing of Text/ Sentence Segmentation:
Text parsing is a common programming task that splits the given sequence of
characters or values (text) into smaller parts based on some rules.
Storing the segmented words/Sentence in List:
The segmented word is then stored in a list. The sequence is further analyzed,
tokenized and grammar is determined
Tokenization:
"Tokens" are usually individual words and "tokenization" is taking a text or
set of text and breaking it up into its individual words. These tokens are then
used as the input for other types of analysis or tasks, like parsing (automatically
tagging the syntactic relationship between words).
PART OF SPEECH(POS) Tagging:
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text
in some language and assigns parts of speech to each word (and other token),
such as noun, verb, adjective, etc., although generally computational
applications use more fine-grained POS tags like 'noun-plural'
Listing the Candidate Keyphrase:
The candidate Keyphrase listed based on tags. The co-occurring Keyphrase
are identified.
Scoring the potential candidate Keyphrase:
The potential candidate Keyphrase are scored
The best Keyphrase are selected and scored.
From the given scores the models generate a Keyphrase.
Adjoining keywords are included if they occur more than twice in the
document and score high enough. An adjoining keyword is two keyword
phrases with a stop word between them. [20][21]The top T keywords are then
extracted from the content, where T is 1/3rd of the number of words in the
graph. As below we visualize the text corpus that we created after pre-processing
to get insights on the most frequently used words using RAKE algorithm.
B. TextRank Algorithm
In general, Text Rank creates a graph of the words and relationships
between them from a document, then identifies the most important vertices of
the graph (words) based on importance scores calculated recursively from the
entire graph [22].
Candidates are extracted from the text via sentence and then word parsing
to produce a list of words to be evaluated. The words are annotated with part of
speech tags (noun, verb, etc) to better differentiate syntactic use. Each word is
then added to the graph and relationships are added between the word and
others in a sliding window around the word. [23]A ranking algorithm is run on
each vertex for several iterations, updating all of the word scores based on the
related word scores, until the scores stabilize – the research paper notes this is
typically 20-30 iterations. The words are sorted and the top N are kept (N is
typically 1/3rd of the words). [24]
A post-processing step loops back through the initial candidate list and
identifies words that appear next to one another and merges the two entries from
the scored results into a single multi-word entry.[25] As below we visualize the
text corpus that we created after pre-processing to get insights on the most
frequently used words using TextRank algorithm.
V. PERFORMANCE ANALYSIS
The below shown fig 2 is one of the sample literature abstract extracted
from Arxiv NLP papers with Github link. This abstract has been chosen
randomly for Keyphrase evaluation using both RAKE and TextRank Keyphrase
Extraction algorithm
PARTY DIALOGUES
Finally, we apply RAKE and TextRank algorithms to a corpus of research paper and define
metrics for evaluating the exclusivity, essentiality, and generality of extracted Keyphrase,
enabling a system to identify Keyphrase that are essential or general to document in the absence
of manual annotations. From the above Table 1 showing that RAKE is more computationally
efficient than TextRank shown in the Table 2 while achieving higher precision and comparable
recall scores which we use to configure RAKE for specific domains and corpora. The most
frequently Most frequently occurring N for RAKE and Textrank algorithms are shown below as
grams unigrams, bi-grams and trigrams which clearly displays Keyphrase obtained with scores
as shown below in graph 1 and graph 2.
Graph 1. Most frequently occurring unigrams, bi-grams and trigrams using Rake algorithm
Graph 2: Most frequently occurring unigrams, bi- grams and trigrams using TextRank algorithm.
As below we visualize the text corpus that we created after pre-processing to get insights on the
most frequently used words using RAKE algorithm and TextRank algorithm.
The most important thing to notice here is that TextRank gives us Keyphrase only one entry
has two words, the rest have only one word, while RAKE gives us phrases.
VI. CONCLUSION
The above proposed was implemented in Python=3.7 and used the NLTK toolkit to preprocess
text. Keyphrase extraction techniques spare time and assets, by allows to consequently
investigating huge arrangements of information in not more than seconds. Keyphrase extraction
automatically extracting and classifying information from document which gives a keen and
strong course of action, making it possible to separate text for a colossal degree and get speedy
and exact results. In this paper we implemented Rapid Automatic Keyphrase Extraction and
TextRank algorithms for data driven text and analyzed the predictions and accuracy which
results as scores in the table 1 and 2. The top keywords from the contents are displayed to the
user. We infer that RAKE algorithm gives the best results. RAKE tool is used to produce a list of
candidate keywords or phrases and the score calculated for each phrase depending upon features
of the word and correlation among them. Adjoining keyword are included if they occur more
than twice in the text and given a high score compare to TextRank algorithm.
REFERENCES
[1]. Lima Subramanian and R.S Karthik, “Keyword Extraction: A Comparative Study Using Graph Based
Model And Rake” March 2017.
[2]. Ambar Dutta, Department of Computer Science and Engineering, Birla Institute of Technology, Mesra,
Jharkhand, India, “A Novel Extension for Automatic Keyword Extraction”, Volume 6, Issue 5, May 2016.
[3]. M. Uma Maheswari, Dr. J. G. R. Sathiaseelan. “Text Mining: Survey on Techniques and Applications”,
International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064, Volume 6 Issue 6, June
2017.
[4]. Said A. Salloum, Mostafa Al-Emran, Azza Abdel Monem, and KhaledShaalan, ”Using Text Mining
Techniques for Extracting Information from Research Articles”, Chapter in Studies in Computational
Intelligence, DOI: 10.1007/978-3-319-67056-0_18 January 2018.
[5]. Tayfun Pay, Stephen Lucci, James L. Cox, “An Ensemble of Automatic Keyword Extractors: TextRank,
RAKE and TAKE”, Computación y Sistemas, Vol. 23, No. 3, 2019.
[6]. Alzaidy. R., Caragea, C., Giles, C.L.: “Bi-LSTM-CRF sequence labeling for keyphrase extraction from
scholarly documents”. In: Proceedings of The World Wide Web Conference, pp. 2551–2557. ACM,
2019.
[7]. DebanjanMahata, John Kuriakose, Rajiv Ratn Shah, and Roger Zimmermann “Key2Vec: Automatic
Ranked Keyphrase Extraction from Scientific Articles using Phrase Embeddings”.
[8]. Howard, Jeremy, & Ruder, Sebastian, Universal language model fine-tuning for text classification. arXiv
preprint arXiv:1801.06146, 2018.
[9]. Sifatullah Siddiqi, AditiSharan, “Keyword and Keyphrase Extraction Techniques: A Literature Review”,
International Journal of Computer Applications (0975 – 8887) Volume 109 – No. 2, January 2015.
[10]. Meng, Rui, Yuan, Xingdi, Wang, Tong, Brusilovsky, Peter, Trischler, Adam, & He, Daqing. “Does Order
Matter? An Empirical Study on Generating Multiple Keyphrases as a Sequence”, arXiv preprint
arXiv:1909.03590, 2019.
[11]. Isabella Gagliardi and Maria Teresa Artese, “ Semantic Unsupervised Automatic Keyphrases Extraction by
Integrating Word Embedding with Clustering Methods”, June 2020.
[12]. Gollum Rabby, Saiful Azad1, Mufti Mahmud · Kamal Z. Zamli1, Mohammed MostafizurRahman
“TeKET: a Tree-Based Unsupervised Keyphrase Extraction Technique, Cognitive Computational”,
Published online” March 2020.
[13]. Beltagy, I., Cohan, A., Lo, K.: “Scibert: pretrained contextualized embeddings for scientific text”, 2019.
[14]. Sang-Woon Kim and Joon-Min Gil, “Research paper classifcation systems based on TF-IDF and LDA
schemes”, https://doi.org/10.1186/s13673-019-0192-7, August 2019.
[15]. Aparna Bulusu, Sucharita V, “Research on Machine Learning Techniques for POS Tagging in NLP”,
International Journal of Recent Technology and Engineering,(IJRTE), ISSN: 2277-3878, Volume-8, Issue-
1S4, June 2019.
[16]. Teng-Fei Li, Liang Hu, Jian-Feng Chu, Hong-Tu Li, and Chi, “An Unsupervised Approach for Keyphrase
Extraction Using Within-Collection Resources” 2017.
[17]. Kamil Bennani-Smires, Claudiu Musat, Andreaa Hossmann, Michael Baeriswy, and Martin Jaggi, “Simple
Unsupervised Keyphrase Extraction using Sentence Embeddings” October2018.
[18]. S. Anjali Nair, M. Meera, M.G. Thushara, “A Graph-Based Approach for keyword extraction from
documents”, Second International Conference on Advance Computational and Communication
Paradigms”, ICACCP 2019.
[19]. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. “Language models are
unsupervised multitask learners. OpenAI Blog”, 2019.
[20]. Yan Ying,Tan Qingping, Xie Qinzheng, Zeng Ping and Li Panpan, “A graph-based approach of automatic
keyphrase extraction”, Procedia Computer Science, vol. 107, pp. 248-255, 2017.
[21]. Gollapalli, S.D., & Caragea, C. “Extracting keyphrases from research papers using citation networks”,
2014.
[22]. Rada Mihalcea and Paul Tarau Department of Computer Science University of North Texas “TextRank:
Bringing Order into Texts”.
[23]. Jinzhang Zhou, “Keyword extraction method based on word vector and TextRank”, Application Research
of Computers, 36, 5, 2019.
[24]. Suhan pan, Zhiqiang Li, Juan Dai, “An improved TextRank keywords extraction algorithm”, ACM TURC
'19: Proceedings of the ACM Turing Celebration Conference – China, May 2019.
[25]. Florescu, C., Caragea, and C.: Position Rank: an unsupervised approach to keyphrase extraction from
scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics, vol. 1: Long Papers, pp. 1105–1115, 2017.