TextRank Keyword Extraction Algorithm Using Word
TextRank Keyword Extraction Algorithm Using Word
Retraction
Retracted: TextRank Keyword Extraction Algorithm Using Word
Vector Clustering Based on Rough Data-Deduction
Copyright © 2023 Computational Intelligence and Neuroscience. Tis is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
References
[1] N. Zhou, W. Shi, R. Liang, and N. Zhong, “TextRank Keyword
Extraction Algorithm Using Word Vector Clustering Based on
Rough Data-Deduction,” Computational Intelligence and
Neuroscience, vol. 2022, Article ID 5649994, 19 pages, 2022.
Hindawi
Computational Intelligence and Neuroscience
Volume 2022, Article ID 5649994, 19 pages
https://doi.org/10.1155/2022/5649994
Research Article
TextRank Keyword Extraction Algorithm Using Word Vector
Clustering Based on Rough Data-Deduction
E D
Ning Zhou , Wenqian Shi, Renyu Liang, and Na Zhong
C T
School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
Received 3 August 2021; Revised 2 November 2021; Accepted 3 January 2022; Published 25 January 2022
A
Copyright © 2022 Ning Zhou et al. (is is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
When TextRank algorithm based on graph model constructs graph associative edges, the co-occurrence window rules only
consider the relationships between local terms. Using the information in the document itself is limited. In order to solve the above
R
problems, an improved TextRank keyword extraction algorithm based on rough data reasoning combined with word vector
clustering, RDD-WRank, was proposed. Firstly, the algorithm uses rough data reasoning to mine the association between
candidate keywords, expands the search scope, and makes the results more comprehensive. (en, based on Wikipedia online open
knowledge base, word embedding technology is used to integrate Word2Vec into the improved algorithm, and the word vector of
TextRank lexical graph nodes is clustered to adjust the voting importance of nodes in the cluster. Compared with the traditional
T
TextRank algorithm and the Word2Vec algorithm combined with TextRank, the experimental results show that the improved
algorithm has significantly improved the extraction accuracy, which proves that the idea of using rough data reasoning can
effectively improve the performance of the algorithm to extract keywords.
1. Introduction
R E
In this information age, people’s lives are full of information.
Faced with such a huge amount of data, it is particularly
important to quickly and accurately obtain the content
which we are interested in and which is valuable. As a high-
level summary of the text content, keywords can help readers
quickly understand the main ideas. In addition, keyword
extraction also plays an important role in the fields of in-
formation retrieval and text classification. (is article mainly
discusses the method of using TextRank to extract keywords.
(e traditional TextRank algorithm uses the co-occur-
uses the information of the document itself. If external
knowledge can be introduced into the keyword extraction
process of the algorithm, the effect of keyword extraction can
be improved in theory.
To solve the above problems and get more accurate
extraction results, this paper introduces rough data-de-
duction theory into the field of text mining for the first time
and makes improvements to the TextRank algorithm based
on this. Because rough data-deduction has the character-
istics of upper approximation and the deduction object is
data [1], when the theory is applied to a problem with
potential association, it has important application signifi-
rence window principle to establish the association between cance for the problem model building and algorithm sim-
nodes when it is constructing candidate keyword graphs. ulation. However, there are a few application studies on the
(at is, an edge can be constructed between two nodes in the theory at present, which are only involved in image repair [2]
same window, so the co-occurrence relationship can be used and have not been used in the related research of text
to easily obtain the required graph of word. However, using language processing. (erefore, it has theoretical and
this principle to judge the correlation between nodes only practical significance to improve TextRank algorithm based
considers the local relationship, which is relatively limited on rough data-deduction. (e algorithm in this paper uses
and may lead to the extraction results being not compre- rough data-deduction theory to infer the association rela-
hensive or accurate enough. In addition, the algorithm only tionship between nodes, determine whether there is a
2 Computational Intelligence and Neuroscience
potential association between two nodes, and then obtain the Chinese text. It can more fully utilize the relationship be-
transition probability of coverage influences between nodes. tween text elements than the method based on word fre-
At the same time, to make the algorithm consider the in- quency statistics, and has a good keyword extraction effect.
D
fluence of external knowledge on keyword extraction, this (e TextRank algorithm, as a typical representative based on
paper uses the Word2Vec model training to generate word the word diagram model, has been widely concerned by
vectors and cluster the word vectors. (e TextRank word researchers.
nodes of graph are nonuniformly weighted according to the According to the Google’s PageRank algorithm,
clustering distribution of words. (en, we integrate the Mihalcea and Tarau proposed the voting algorithm
E
external world knowledge of a single document into the TextRank based on the graph model. In recent years, in
algorithm and improve the extraction effect of the algorithm. order to further improve the keyword extraction effect of
Different from the existing methods that use topic weighting the TextRank algorithm, Literature [18] proposed Posi-
and inverse document frequency weighting to introduce tionRank, an unsupervised model for extracting key-
T
external knowledge, the training data of Word2Vec is in- words from academic documents, which combines
dependent of the documents to be extracted. Using the word information of all locations where words appear to bias
vector generated by it to improve the algorithm, theoretically PageRank. Literature [19] integrates LDA into the al-
a more stable extraction result can be obtained. gorithm, taking into account the influence of the subject
matter of the whole documentation set, thereby im-
C
2. Related Work proving the accuracy of extraction. Literature [20] added
the time dimension to the algorithm, which can better
(e Materials and Methods should be described with suf- adapt to changing themes and improve the effectiveness
ficient details to allow others to replicate and build on the of extraction. Literature [21] introduced word relevance
published results. Please note that the publication of your and document language network in the document graph
A
manuscript implicates that you must make all materials, model to improve keyword extraction performance.
data, computer code, and protocols associated with the Literature [22] improved the algorithm based on the
publication available to readers. Please disclose at the sub- theory of basic level category. Literature [23] integrated
mission stage any restrictions on the availability of materials the location information of the words in the document
or information. New methods and protocols should be into the algorithm and improved the effect of the algo-
R
described in detail while well-established methods can be rithm on keyword extraction. Literature [24] integrated
briefly described and appropriately cited. the Doc2Vec model and the K-means algorithm into the
Research manuscripts reporting large datasets that are algorithm to improve the quality of extraction. In sum-
deposited in a publicly available database should specify mary, it is found that the improvements of the existing
T
where the data have been deposited and provide the relevant related algorithms are all at the level of combining ex-
accession numbers. If the accession numbers have not yet ternal features, and fails to start from the inside of the
been obtained at the time of submission, please state that algorithm to improve its accuracy.
they will be provided during review. (ey must be provided With the continuous development of various tech-
prior to publication. nologies in the field of artificial intelligence, the neural
E
(e research on keyword extraction methods began at network tool Word2Vec began to be widely used. Key-
the end of the last century. According to whether it is words extraction in the text based on the Word2Vec
necessary to provide tagged corpus, keyword extraction can model is one of its important applications. Literature
be divided into supervised and unsupervised methods in this [25] used word2vec to perform K-dimensional vector
R
paper. (e supervised extraction method [3] regards key- representation on all the words in the training docu-
word extraction as a binary classification problem to per- mentation set, calculated the similarity between words
form binary judgment on the words in the text to determine based on the word vectors, and implemented word
whether it is a keyword, and this method needs to provide clustering to get the keywords of the document. Liter-
tagged corpus. (e unsupervised extraction method does not ature [26] combined the LDA topic model with
need to provide tagged corpus. It uses statistical properties to Word2Vec to propose a keyword extraction method that
rank the candidate words and takes the most important combined topic word embedding and network structure
words as keywords. With the continuous improvement of analysis. Literature [27] uses TF-IDF-weighted Glove
unsupervised extraction methods, its extraction perfor- word vector for word embedding representation. Lit-
mance is gradually approaching supervised methods [4], and erature [28] proposed a cuckoo search algorithm and
it has strong adaptability, so it is widely used. (is paper k-means supervised hybrid clustering algorithm to di-
focuses on the research of unsupervised extraction algo- vide all kinds of data samples into clusters so as to
rithms, and the mainstream methods can be summarized provide training subsets with high diversity and merged
into three categories, keyword extraction algorithms based the word2vec model into the traditional TextRank al-
on word frequency statistics [5–8], topic models [9–12], and gorithm by using word embedding technology to im-
diagram models [13–17]. prove the accuracy of keyword extraction. Literature [29]
(ere is a big difference between Chinese and English merged the word2vec model into the traditional Tex-
keyword research, and the algorithm based on graph model tRank algorithm by using word embedding technology to
is a more effective method in the keyword extraction of improve the accuracy of keyword extraction.
Computational Intelligence and Neuroscience 3
D
⎪
⎩ ⎪
⎭
eRank [31, 32] algorithm. Now it is widely used in the field of 0, otherwise
keyword extraction [33, 34]. (e basic idea of the algorithm (2)
is the voting principle. First, the target text is divided into
several meaningful words, and the local connection between In the formula, Out(vj ) represents the set of nodes
E
the words, which is the same as co-occurrence window, is which is pointed by vj , wji represents the weight of
used to determine the association between the candidate the edge from node wj to node wi , which is deter-
words and construct the candidate word graph. (en, our mined by the co-occurrence of two words in the
algorithm uses the voting mechanism to sort the candidate traditional algorithm.
T
words to achieve keyword extraction. (e main steps are as
(5) Sorting: Sort the node weights in reverse order and
follows:
use the first K word as keyword in the target text.
(1) Sentence segmentation: Segment the target text T
according to the completeness of the sentence, that
is, T � [S1 , S2 , . . . , Sm ]. 3.2. Rough Data-Deduction
C
(2) Word segmentation and filtering: Segment word for 3.2.1. Rough Set +eory. (e original application of rough
each sentence, Si ∈ T, and tag part-of-speech, then set theory in text processing is to classify texts to speed up the
filter out stop words and some words that are not classification and improve the accuracy of classification [35].
included in the specified part-of-speech, that is, (e idea of rough data-deduction is based on rough set
Si � [ti,1 , ti,2 , . . . , ti,n ], where ti,j is the candidate
A
theory, and integrates the approximate information in the
keyword after filtering. upper approximation concept into the process of data
(3) Construct graph: Construct candidate graph of reasoning. (erefore, the introduction of concepts related to
word, G � (V, E), where V is the vertex set which is rough set theory will play a role in understanding rough
composed of the candidate words obtained in (2), E data-deduction. Here is a brief introduction to some related
R
is the edge set which is a subset of V × V. (e tra- knowledge in the rough set.
ditional algorithm uses the co-occurrence window to Let U be a dataset and R an equivalence relation on U.
construct the edges between two nodes in the graph. (e structure composed of U and R is called approximation
(at is, only when the candidate words corre- space denoted by M � (U, R), and U is the domain of
T
sponding to the two nodes appear in a window discourse. Let U/R � [a]R | a ∈ U be referred to as the
whose length is K, there is an edge between the two partition of U relative to R, where [a]R is the R-equivalence
nodes, where K is the window size and determines class and determined by a. For any subset X ⊆ U of U, in
the maximum number of words that can co-occur. approximate space M, the upper and lower approximation
E
(4) Iterative calculation: Iteratively calculate the weight R∗ (x) of X are defined in the following ways [36]:
of each node according to formula (1) [30] until the [a]R |[a]R ∈ U
calculation result converges. R∗ (x) � ⋃ ,
R and [a]R ∩ X ≠ ϕ
wji
WS vi � (1 − d) + d × WS vi . (3)
vk ∈Out(vi )wjk [a]R |[a]R ∈ U
R
vj ∈In(vi )
R∗ (x) � ⋃ .
R and [a]R ⊆ X ≠ ϕ
(1)
(at is, the upper approximation of the subset X is equal
In the formula, In(vi ) represents the node set to the union of all R-equivalence classes whose intersection
pointing to node vi , and d ∈ [0, 1] is the damping with X is not equal to the empty set, and the lower ap-
factor, which was originally the random walk proximation of the subset X is equal to the union of all
probability of the PageRank algorithm. (e original R-equivalence classes contained in X.
intention of the setting is to prevent those pages (e lower approximation R∗ (x) is approaching X from
without external links from swallowing the oppor- the inside of X, and the upper approximation R∗ (x) is
tunity for users to browse down. (ere are also nodes approaching from the outside of X. If X is considered to
without any pointing in the text graph model, the contain precise information, R∗ (x) contained within X is
general value is 0.85. If there is a node in the can- often more accurate than precise information. R∗ (x) ex-
didate word graph whose error rate is less than a pands the scope of precise information to include external
specific limit value, it is considered that the node has information, so that the concept of rough set can be derived.
reached convergence, and this limit value is usually (at is, when R∗ (x) ≠ R∗ (x), X is called a rough set. And,
set to 0.0001. p(vj ⟶ vi ) represents the jump when R∗ (x) � R∗ (x), X is called a definite set [36]. Since the
probability from node vj to vi , which is calculated by information of R∗ (x) is too accurate, the information in
formula (2) in the traditional TextRank algorithm. R∗ (x) covers X which is an extension of accurate
4 Computational Intelligence and Neuroscience
information. (erefore, incorporating R∗ (x) into rough w1 , w2 , . . . , wn be sentences including n words, the proba-
data-deduction can increase the deduction data and expand bility of occurrence of the sentence T is:
the deduction range, and the results obtained will be more n
D
accurate. P(T) � p wi |wi− n− 1 wi− n− 2 , . . . , wi− 1 . (4)
i�1
3.2.2. Rough Deduction-Space. Rough deduction-space is And, the Bayesian estimation of the occurrence chance
the structure space that rough data-deduction depends on, of wi is
E
and it is an expansion of an approximation space
C w1 w2 , . . . , wn
M � (U, R) in content and structure. (en, in the formula, p wi | wi− n− 1 wi− n− 2 , . . . , wi− 1 � . (5)
K � R1 , R2 , . . . , Rn (n ≥ 1), where Ri (i � 1, 2, . . . , n) is an C w1 w2 , . . . , wn− 1
equivalence relation on U. Given a binary relation S ⊆ U × U, In the formula, C(w1 w2 , . . . , wn ) is the probability of the
T
S is referred to as a deduction relation. (e structure sentence w1 w2 , . . . , wn in the corpus.
composed of U, K, and S is called a rough deduction-space (e Word2Vec tool mainly includes the following two
denoted by W � (U, K, S) [1]. training modes: Continuous Bag-of-Words (CBOW) and
Skip-Gram, both of which are three-layer neural networks
(input layer, projection layer, and output layer). (e CBOW
C
3.3. Rough Data-Deduction. Rough data-deduction ac-
complishes deductions from data to data, which is different model [25, 36] predicts the current value through context,
from any logical deduction in the mathematical logic. Since that is, inputting a known context and outputting the
most things and objects in real life can be abstracted as data, prediction of the current word, as shown in Figure 1. What is
data-oriented reasoning is more widely used. Let predicted by the CBOW model in the figure is
W � (U, K, S) be rough deduction-spaces, a ∈ U and R ∈ K, p(wt | wt− 2 , wt− 1 , wt+1 , wt+2 ), and the window is 2. Assuming
A
then rough data-deductions are defined as follows: that k words are taken before and after the target word wt ,
that is, the window size is k, the prediction of the CBOW
(1) Let b ∈ U, if b ∈ R∗ ([a − R]), a can directly get rough model is
deduction b with respect to R, denoted by a⇒R b,
where [a − R] � {x | x ∈ U, and there is a data z pwt | wt− k , wt− (k− 1) , . . . , wt− 1 , wt+1 , . . . , wt+(k− 1) , wt+k . (6)
R
∈ [a]R making z, x ∈ S}.(e subset [a − R] is called
the S-predecessor set of [a]R . And, the learning goal of this model is to maximize the
function L1 :
(2) Let b1 , b2 , ..., bn , b ∈ U, if a⇒R b1 , b1 ⇒R b2 ,. . ., and
V
bn− 1 ⇒R bn bn ⇒R b(n ≥ 0), a roughly deduces b with
T
L1 � pwt | wt− k , wt− (k− 1) , . . . , wt− 1 , wt+1 , . . . , wt+(k− 1) , wt+k .
respect to R, which is denoted by a|�R b. t�1
(3) For R ∈ K and a, b ∈ U, the process of deduction (7)
whether a roughly deduces b with respect to (1) or (2)
is called the rough data-deduction with respect to R (e Skip-Gram model [25, 36] has the opposite char-
E
in W � (U, K, S), or rough data-deduction for short acteristics of the CBOW model. Its input is a word vector of a
[1]. specific word, and the output is a context word vector
corresponding to a specific word, as shown in Figure 1.
Rough data-deduction can expand association scope and Similarly, if it is assumed that k words are taken before and
increase association data. If this theory is applied to the after the word wt , that is, the window size is k, then the
R
TextRank keyword extraction algorithm, the association prediction of the Skip-Gram model is
between nodes of two words can be obtained through de-
duction from the overall situation. According to this, the pwt+p | wt (− k ≤ p ≤ k, k ≠ 0). (8)
graph of candidate keywords is constructed to extract
keywords, and the extraction result should be more And, the learning goal of this model is to maximize the
comprehensive. function L2 :
V k
L2 � pwt+p | wt . (9)
3.4. Word2Vec. Word2Vec is a model tool for word vector t�1 p�− k,t ≠ 0
training open sourced by Google. It can vectorize all words
to quantitatively measure the relationship between words CBOW and Skip-Gram are two important models in
and explore the relationship between candidate words. It Word2Vec, which describe the association between sur-
uses a shallow neural network model to automatically learn rounding words and current words from different angles.
the occurrence of words in the corpus, and embeds the Comparing the two models, the Skip-Gram model can
words into a space with a moderate dimension, that is, generate more training samples and capture more semantic
words ⟶ Rn , and the expression result of the words in the details between words. Under ideal conditions where the
new space Rn is the word vector [37]. corpus is good enough, the Skip-Gram model is superior to
(e idea of Word2Vec [26, 36] comes from the prob- the CBOW model. However, in the case of less corpus, it is
ability calculation of Bayesian occurrence estimation, let T � difficult to capture more details between words. On the
Computational Intelligence and Neuroscience 5
W (t-2)
W (t-1)
SUM
W (t-2)
W (t-1)
E D
T
W (t) W (t)
C
W (t+1) W (t+1)
A
W (t+2) W (t+2)
CBOW Skip-Gram
R
Figure 1: CBOW and Skip-Gram model structures.
T
contrary, the CBOW model has the characteristics of av- but it has certain limitations. (e rule of co-occurrence
eraging which will make the training effect better, and this window only considers the correlation between local words,
study considers both. At the same time, two optimization so some words that are locally associated with certain
methods are proposed to reduce the training complexity: keywords may be extracted. However, the keywords of a
negative sampling [38] and hierarchical softmax [39] to document are not limited to some words around important
E
speed up the training process. words. When extracting text keywords, we must fully
Compared with the traditional text representation, consider the words in the text and some potentially asso-
Word2Vec generates a word vector with a lower di- ciated words. Words with potential association will have an
mension, and the semantic and syntactic relationship important impact on the entirely iterative sorting process,
R
between words can also be well reflected in the vector and this potential association can be explored through the
space, because the words with similar semantics are close theory of rough data-deduction. At the same time, con-
to each other in space. It can be said that the word vectors sidering the influence of external knowledge on keyword
learned in Word2Vec training contain the semantic in- extraction, the improved algorithm introduces Word2Vec to
formation of the words in the dataset. Pretraining lan- quantify the candidate word nodes. Unlike existing methods
guage models such as GPT and BERT have better training that use topic weighting and inverse document frequency
effects, but their data scale is large. (erefore, this paper weighting to introduce external knowledge, the training data
weights the jump probability between TextRank word of the Word2Vec model is independent of the text data to be
graph nodes based on the relationship between the text extracted. Using the word vectors generated by its training to
word vectors obtained by Word2Vec training. improve the algorithm, in theory, a more stable extraction
result can be obtained [40]. (e word vector reflects the
4. Improved Algorithm Using Word Vector external knowledge information, and the candidate key-
Based on Rough Data-Deduction words can be clustered into several clusters according to the
similarity between the word vectors. (e farther a word is
(e classic TextRank algorithm constructs the graph model from the centroid of the cluster, the more it can reflect the
of candidate keywords through the co-occurrence rela- different aspects of a cluster from words near the centroid.
tionship and then iteratively calculates the weight of each When it is used as a word node in TextRank, the higher the
node through the average transition probability matrix until importance of its vote, the higher the probability of jumping
it converges. (is approach is relatively simple and effective, between adjacent nodes.
6 Computational Intelligence and Neuroscience
D
words with similar meanings may describe the same im-
portant content in a document, the weight of this group of
words should be increased accordingly to improve the ac- w2 w5 w8
curacy of the extraction results. (e classic TextRank al-
gorithm does not consider this aspect but only considers the
E
word itself, thereby ignoring the contribution rate of words
with similar meanings. (e improved algorithm takes the w3 w6 w9
word meaning into account and divides the candidate words
by word meaning to participate in the subsequent associ-
T
ation deduction, which can extract keywords more R R R
effectively.
Second, the rough deduction-space W � (U, K, S) is Figure 2: Diagram of rough data-deduction in keyword extraction.
introduced to describe the structure of keyword extraction,
where U is a dataset composed of candidate keywords, K is and vi , the influence of node vj on vi is transmitted through
C
the set of equivalence relations, and R ∈ K, for a, b ∈ U, its directed edge 〈vj , vi 〉, and the weight of the edge de-
〈a, b〉 ∈ R if and only if a and b b have similar meanings. termines the influence of vj and finally obtained by vi .
S⊆U × U is defined as S � {〈u, v〉|u, v ∈ U (erefore, let the association weight between vj and vi based
and there is an association between u and v}. on rough data-deduction be the weight of the coverage
At the same time, using rough data-deduction, it is influence of node vj transmitted to node vi , and record it as
A
assumed that the deduction relationship is wij′. With reference to formula (2), the transition probability
S � 〈w1 , w4 〉, 〈w3 , w4 〉, 〈w3 , w6 〉, 〈w5 , w8 〉, (10) of coverage that influences between candidate keywords
nodes vj and vi is
where w1 ∼ w9 are candidate keywords selected from the ′
wjk
text after word segmentation and filtering, and this de-
R
pcov vj ⟶ vi � . (12)
duction relationship is determined by the association degree v ′
wjk
k ∈Out vj
of the association rules in the deduction, that is, point
mutual information. For the equivalence relation R ∈ K, it is (en, for the text T and its candidate keyword word set
assumed that the division of U with respect to R is w1 , w2 , . . . , wn and the Word2Vec word vector model
�→
T
obtained by training, let wi represent the word vector
U
� w1 , w2 , w3 , w4 , w5 , w6 , w7 , w8 , w9 . (11) corresponding to the word wi , C1 , C2 , . . . , Ck represent the
R clustering result after K-means clustering by the word vector
(e equivalence division here is based on the similarity set of the text, and formula (13) is proposed to calculate the
E
of word meanings between candidate words, combined with voting importance of any word vi in the cluster Cvi .
the above information to obtain a rough data-deduction →→
d vi , cvi C .
schematic diagram in keyword extraction, as shown in c weight vi � →→ × vi (13)
Figure 2. vj ∈cv d vi , cvi
i
As shown in Figure 2, in the process of rough data- �→
R
deduction, for the candidate word w1 , the algorithm obtains In this formula, Cvi is the vector corresponding to the
→→
w2 , w3 based on the similarity rule of word meaning, so w1 , centroid of cluster Cvi , d( vi , cvi ) is the Euclidean distance
→ →
w2 , and w3 are divided into a dataset. Similarly, w4 ∼ w9 can from vector vi to vector cvi in the word vector space, and
be divided. Based on the association degree of association |Cvi | is the number of words included in the cluster Cvi . (e
rules in point mutual information deduction, the algorithm total voting score of a cluster is the number of nodes in the
deduces from w1 to w4 and get w5 ∼ w9 . According to the cluster, and the voting weight of each node in the cluster is
definition of rough data-deduction, for w1 , there are proportionally distributed according to the Euclidean
[w1 ]R � w1 , w2 , w3 , [w1 − R] � w4 , w6 , R∗ ([w1 − R]) � distance from the centroid. (e further away from the
w4 , w5 , w6 , so w1 ⇒R w5 . For candidate word w5 : centroid, the higher the importance of voting. When the
[w6 ]R � w4 , w5 , w6 , [w5 − R] � w8 , R∗ ([w5 − R]) � semantic association of the two nodes in the word vector
w7 , w8 , w9 , so w5 ⇒R w9 . Finally, w1 |�R w9 can be got from space is expressed as the clustering weighted influence
w1 ⇒R w5 and w5 ⇒R w9 . As can be seen from the above, there between the nodes, then through cluster analysis and
is also a potential correlation between w1 and w9 , which can calculation to get the voting importance of each word, the
provide a certain contribution for the calculation. (e as- transition probability of cluster influence between nodes vj
sociation between candidate keywords is established by the and vi is
above rules, and the association weight can be added as a c weight vi
contribution rate to the iterative calculation process to pclu vj ⟶ vi � . (14)
v ∈out v
c weight vk
improve the accuracy of extraction. For any two nodes vj k j
Computational Intelligence and Neuroscience 7
D
and vj .
(15)
A B
E
In this formula, α and β are the weight coefficients of
each influence, respectively, and α + β � 1. (is paper takes
α � 0.7 and β � 0.3 according to the result of experiments.
Figure 3: (e distribution of words in HowNet and Cilin.
According to the theory of link analysis, as long as the
T
jump probability transition matrix between nodes in a given
graph is given, the importance of the nodes can be calculated (5) When w1 ∈ B, w2 ∈ C, first calculate the similarity
iteratively by formula (1). between w1 and w2 based on Cilin and denote as s2 .
(e main steps to improve the algorithm are as follows: (en, find the synonyms set of w1 in Cilin and
calculate the similarity with w2 based on HowNet.
C
Take the maximum value and denote as s1 . If w1 has
Step 1. Preprocess the target text based on the classic
no synonymous words in Cilin, then take s1 � s2 .
TextRank algorithm, to get candidate keywords by cutting
Now λ2 > λ1 . Takes λ1 � 0.4 and λ2 � 0.6 in the ex-
sentences, word segmentation, and part-of-speech filtering.
periment of this paper.
(e calculation of word similarity based on HowNet is as
A
Step 2. Divide the candidate keywords into different follows [41]:
equivalence classes according to the similarity of word
3 i
meanings. In this paper, we divide words based on HowNet
and Cilin. For any two words w1 , w2 , the division rule [41] is sim C1 , C2 � βi simj C1 , C2 . (17)
i�1 j�1
R
s � λ1 s1 + λ2 s2 . (16)
sim W1 , W2 � max simC1i , C2j . (18)
In this formula, s1 and s2 are the similarity calculated by i�1,...,m,j�1,...,n
HowNet and Cilin, respectively, λ1 and λ2 are the two
weights given to s1 and s2 , and the requirement λ1 + λ2 � 1. In formula (17), sim1 (C1 , C2 ) is the similarity calcu-
T
(e values of λ1 and λ2 are defined by the distribution of lated by the set of independent sememe, sim2 (C1 , C2 ) is
words w1 and w2 in HowNet and Cilin in Figure 3. the similarity of the characteristic structure of the rela-
In this formula, the strategies for taking the value of λ1 tional sememe, and sim3 (C1 , C2 ) is the similarity of the
and λ2 are as follows [41]: characteristic structure of the relation symbol. (e pa-
rameter βi (1 ≤ i ≤ 3) is adjustable and satisfies
E
(1) When w1 ∈ C, w2 ∈ C, calculate the similarity be- β1 + β2 + β3 � 1. After experiments, β1 , β2 , and β3 in the
tween w1 and w2 , respectively, based on HowNet and algorithm of this paper take 0.7, 0.17, and 0.13, respec-
Cilin. (en, denote them as s1 and s2 . Takes λ1 � tively. Formula (17) obtains the similarity of sense. When
λ2 � 0.5 in the experiment of this paper. there are multiple senses in a word, formula (18) is used to
R
(2) When w1 ∈ A, w2 ∈ A or w1 ∈ B, w2 ∈ B, calculate calculate the maximum similarity value among all sense
the similarity between w1 and w2 based on HowNet combinations, which is the similarity of two words, where
or Cilin, and denote them as s1 or s2 . Here, λ1 is 1 and m is the number of senses of the word W1 and n is the
λ2 is 0. number of senses of the word W2
(3) When w1 ∈ A, w2 ∈ B, find the synonyms set of w2 (e calculation of word similarity based on Cilin is as
based on Cilin, then calculate the similarity of these follows [40]:
�����
synonymous words with w1 based on HowNet, and
sim C1 , C2 � 1.05 − 0.05 dis C1 , C2 e− k/2n . (19)
take the maximum value as s1 . If w2 has no syn-
onymous words in Cilin, then take s1 � 0.2. Now
In formula (19), dis(C1 , C2 ) is the distance function
λ1 � 1 and λ2 � 0.
of word coding C1 and C2 in the tree structure; n is the
(4) When w1 ∈ A, w2 ∈ C, first calculate the similarity total number of nodes in the branch layer, which in-
between w1 and w2 based on HowNet and denote as dicates the number of direct children of the nearest
s1 . (en, find the synonyms set of w2 in Cilin and common parent node adjacent between two words; k
calculate the similarity with w1 based on HowNet. represents the separation distance between the branches
Take the maximum value and denote as s2 . If w2 has where the two words are located in the nearest common
no synonymous words in Cilin, then take s2 � s1 . parent node. Similarly, when a word corresponds to
Now λ1 > λ2 . Takes λ1 � 0.6 and λ2 � 0.4 in the ex- multiple codes, formula (18) is used to calculate the
periment of this paper. similarity of words.
8 Computational Intelligence and Neuroscience
Step 3. (e association degree of the association rules in Dataset 1: (e experiment selected 1.4 GB of SogouCA
rough reasoning is defined as [42] from Sogou Lab as the basis for extracting the test text. (e
p(A, B) dataset contains news data on various fields from June to
D
PMI(A, B) � . (20) July 2012, related to domestic and foreign agencies, sports,
p(A)p(B) culture, entertainment, etc. A total of 2045 texts in various
In this formula, A and B are two candidate keywords in fields were randomly selected to form a test set to test the
the text, p(A, B) represents the frequency of occurrence of A effect of the algorithm. At the same time, many teachers and
students with undergraduate degree or above are invited.
E
and B in the same sentence, p(A) is the frequency of oc-
currence of A and p(B) is the frequency of occurrence of B. (ey are all teachers and students of the Department of
(e larger the PMI value, the more relevant. Journalism and Chinese of our school. Using manual cross-
According to this degree of association, it is determined labeling, 10 keywords are extracted for each text and given in
that there are candidate keywords that are directly related. order of importance.
T
(at is, when PMI(w1 , w2 ) ≠ 0, there is a direct association In addition, in order to prevent overfitting of the ex-
between w1 and w2 , then w1 , w2 and their association weight perimental results of the improved algorithm, the experi-
wij′ are stored into the association set. At the same time, the ment also tested the extraction effect of the improved
deduction relationship S for rough data-deduction can be algorithm based on the following two different types of
established according to this association weight. Next, the datasets:
C
rules of rough data-deduction are used to obtain the asso- Dataset 2: In this paper, we use CNKI as the retrieval
ciation between the remaining candidate keywords in all platform and use advanced search method to randomly
different equivalence classes, and these words and their collect the text data needed by the experiment in the fol-
association weight wij′ are also stored in the association set. lowing types of literature, namely, “Geology,” “General
(e transition probability of coverage influence between Chemistry Industry,” “Highway and Waterway Trans-
A
candidate keyword nodes is obtained by formula (12). portation,” “Fundamental Science of Agriculture,” “Plant
Protection,” “Paediatrics,” “Cardiovascular System Disease,”
“Geography,” “Biography,” “Military Affairs,” “Chinese
Step 4. (e popular Python software package Gensim is Communist Party,” “Ideological & Political Education,”
used to train and construct the Word2vec model, and the “Computer Hardware Technology,” “Internet Technology”
R
largest Wikipedia online open knowledge base is selected as and “Market Research and Information.” From the result set,
the training corpus, which can ensure that the model has we selected the titles, abstracts, and keywords of the journal
better generalization ability. (e Word2Vec model is trained texts in the period from 2014 to 2020 and graded in CSCD/
to generate word vectors, then the K-means clustering is CSSCI and above. And, we exclude texts whose abstract
performed on the word vectors of the candidate words, and
T
length is less than 150 words and documents whose number
the transition probability of clustering influence between the of manually marked keywords is less than or equal to 1. (e
candidate keyword nodes is obtained from formulas (14) and final test dataset contains 17514 data and 65310 keywords
(15). provided by the author, and each paper contains 3.73
keywords.
E
Step 5. (e jump probability between word nodes is ob- Dataset 3: (e Python web crawler is used to capture user
tained by formula (16). Finally, the weight of each candidate comment data of some restaurants in the Taiyuan area of
keyword is iteratively iterated to convergence using formula Dianping, including 400 restaurants and 120,000 user
(1). (e flow chart of the improved algorithm is shown in comments. However, some of the restaurants only have a
R
Figure 4. very small number of user reviews, which will affect sub-
sequent experiments, so they are excluded from the dataset.
In addition, since many users only score the merchants
5. Experimental Results and Analysis without writing specific review content, user reviews are
5.1. Experimental Data and Evaluation Criteria empty. (ese kinds of data are also not conducive to sub-
sequent experiments, so it is cleaned from the dataset. (e
5.1.1. Experimental Data. (e experiment selected the final test dataset contains 17,309 valid user reviews of 178
Wikipedia Chinese corpus released in February 2020 merchants. At the same time, teachers and students who
“zhwiki-20200201-pages-articles-multistream.xml.bz2” to have manually labeled keywords for dataset 1 are asked to
train Chinese word vectors [43, 44], which contains a main label valid keywords for this dataset [47].
file of 1.9CB. First, the experiment uses the Python software
package Gensim to convert the downloaded xml compressed
file to txt format. Second, it uses opencc to simplify the wiki 5.1.2. Evaluation Criteria. In addition, the article uses three
content and remove other characters except Chinese char- evaluation indicators commonly used in the field of infor-
acters. Finally, after using the jieba word segmentation tool mation retrieval and classification to compare the quality of
[45] to segment the Chinese corpus obtained above, the the experimental results. It contains the precision (P), which
word vector is trained using the Word2Vec tool [46]. And, represents the accuracy of the extraction results; the recall
the following datasets were used in the experiment to test the (R), which represents the degree of coverage of the ex-
extraction results of each algorithm. traction results for the correct keywords; and the F-Measure
Computational Intelligence and Neuroscience 9
D
NO
duplicate words in results
YES
E
deduplication and get candidate keywords
T
merge equivalent classes with the same words
C
YES a clear correlation between w1 NO
and w2, that is PMI≠0
A
explore potential association
between w1 and w2 based on
rough data-deduction
add w1, w2 and their
association weight w’ij to the
R
association set
R
Figure 4: Flow chart of improved algorithm.
(F), which is a comprehensive evaluation index of the (e operating system of the experimental environment is
reconciliation of P and R. (e specific calculation formulas Windows7-64bit. (e algorithm proposed in this paper is
of the three indicators are as follows [48–50]: implemented in Python language. (e word segmentation
and part-of-speech tagging use Jieba open source tools. At
K A ∩ KB
P� × 100%, the same time, the remaining contrast algorithms involved
KB in the experiment are completed using python language.
K A ∩ KB
R� × 100%, (21)
KA 5.2. Experimental Results and Analysis
D
keywords is set to 3–10, and the following 11 groups of α and 44.00
β values are tested: E1 (1.0, 0), E2 (0.9, 0.1), E3 (0.8, 0), E4 (0.7,
0.3), E5 (0.6, 0.4), E6 (0.5, 0.5), E7 (0.4, 0.6), E8 (0.3, 0.7), E9 41.00
(0.2, 0.8), E10 (0.1, 0.9), and E11 (0, 1.0). (e F-Measure of the
extraction result of the improved algorithm corresponding 38.00
E
F (%)
to each set of data is shown in Figure 5.
35.00
It can be seen from Figure 5 that the extraction effect of
the algorithm in this paper is different under different values 32.00
of α and β. In the experiment, the weighting coefficients are
T
compared under the same test set. And, it is found that when 29.00
the fifth set of data E4 (0.7, 0.3) is taken, the algorithm of this
26.00
paper has the best extraction effect. (erefore, the algorithm
takes α � 0.7 and β � 0.3 in this paper. 23.00
3 4 5 6 7 8 9 10
C
keywords (pcs)
5.2.2. Comparative Algorithm. Based on the same test set, E1 E7
this paper compares the following algorithms with the ex- E2 E8
perimental results of this algorithm. E3 E9
E4 E10
A
E5 E11
6. Experimental Results E6
(e value of two important parameters in the experiment Figure 5: (e F-Measure of extraction results under different α. α
will affect the extraction result of the TextRank algorithm, and β.
which includes the co-occurrence window size ω and the
R
number of keywords k. However, the implementation of the
TF-IDF algorithm based on statistical characteristics and the 42.50
algorithm in this paper are not affected by the parameter ω.
For the determination of the parameter ω, we set the number 42.00
T
of extracted keywords as k � 10 based on dataset 1, and when 41.50
the window value is within [6, 12], we compare the F-
Measure of the extraction result of the algorithm. (e 41.00
F (%)
E
the TextRank algorithm is different under different values of 40.00
ω. (is paper compares the extraction effects of different 39.50
values of ω under the same test set, and finds that the
TextRank algorithm has the best extraction effect when 39.00
R
ω � 6. (erefore, in order to ensure the effectiveness of the 38.50
algorithm in this paper, the initial value of ω of the TextRank 4 5 6 7 8 9 10
algorithm in the comparative experiment is set to 6. At the co-occurrence window ω
same time, the other parameters involved in the contrast TextRank
algorithm are taken from the optimal values used in the
Figure 6: (e F-Measure of extraction results under different ω
respective literature. When the number of keywords is values.
within [3.10], we calculate the precision (P), recall (R), and
F-Measure (F) of the following nine algorithms. (e ex-
perimental results (retain two decimal places) are shown in decreases to some extent. However, the accuracy of the
Table 1. RDD-WRank algorithm in this paper is higher than that of
At the same time, in order to comprehensively observe the other algorithms. Because the rough data-deduction
the differences of different keyword extraction methods, the rules adopted by the algorithm in this paper will incorporate
author further gives the overall changes of the P, R , and F- the approximate information into the process of data de-
Measure of nine methods when the top N value is [3, 10] in duction, it can make the mutual inference between the data
the form of line chart, as shown in Figures 7–9. show the characteristics of approximate implication or
Figure 7 describes the changing trend of the accuracy of imprecise association, and explore potential association
each algorithm when extracting different numbers of key- between candidate keywords. If the potential association is
words. It can be seen from the figure that as the number of added to the iterative calculation of the weight of each
extracted keywords increases, the accuracy of each algorithm candidate keyword, a more accurate extraction result will be
Computational Intelligence and Neuroscience 11
D
T2 64.22 19.27 29.64
T3 66.36 19.91 30.63
T4 28.13 8.44 12.99
3 T5 67.28 20.18 31.05
T6 61.16 18.35 28.23
E
T7 62.08 18.62 28.65
T8 55.96 16.79 25.83
RDD-WRank 69.75 20.93 32.19
T1 44.04 17.61 25.16
T
T2 56.19 22.48 32.11
T3 61.24 24.50 34.99
T4 28.21 11.28 16.12
4 T5 62.61 25.05 35.78
T6 57.11 22.84 32.63
C
T7 58.26 23.30 33.29
T8 51.83 20.73 29.62
RDD-WRank 63.43 25.37 36.24
T1 41.65 20.83 27.77
T2 53.21 26.61 35.47
T3 56.15 28.07 37.43
A
T4 27.52 13.76 18.35
5 T5 57.61 28.81 38.41
T6 53.94 26.97 35.96
T7 55.41 27.71 36.94
T8 47.34 23.67 31.56
R
RDD-WRank 59.26 29.63 39.51
T1 39.30 23.58 29.47
T2 49.69 29.82 37.27
T3 52.29 31.38 39.22
T4 26.91 16.15 20.18
T
6 T5 54.74 32.84 41.06
T6 51.07 30.64 38.30
T7 51.22 30.73 38.42
T8 46.18 27.71 34.63
E
RDD-WRank 55.86 33.52 41.90
T1 37.79 27.16 31.95
T2 46.66 32.66 38.42
T3 49.93 34.95 41.12
T4 27.00 18.90 22.23
R
7 T5 50.98 35.69 41.99
T6 49.02 34.31 40.37
T7 48.10 33.67 39.61
T8 44.43 31.10 36.59
RDD-WRank 54.10 37.87 44.55
T1 36.99 28.26 32.66
T2 44.15 35.32 39.25
T3 46.44 37.16 41.28
T4 26.61 21.28 23.65
8 T5 47.48 37.98 42.20
T6 46.56 37.25 41.39
T7 45.76 36.61 40.67
T8 43.35 34.68 38.53
RDD-WRank 50.93 40.74 45.27
12 Computational Intelligence and Neuroscience
Table 1: Continued.
Keywords (pcs) Algorithm P (%) R (%) F (%)
T1 35.12 30.31 33.12
D
T2 42.00 37.80 39.79
T3 44.34 39.91 42.01
T4 25.99 23.39 24.63
9 T5 45.16 40.64 42.78
T6 44.14 39.72 41.82
E
T7 43.63 39.27 41.33
T8 42.41 38.17 40.17
RDD-WRank 48.97 44.07 46.39
T1 32.98 32.98 32.98
T
T2 39.72 39.72 39.72
T3 42.20 42.20 42.20
T4 25.05 25.05 25.05
10 T5 42.66 42.66 42.66
T6 42.57 42.57 42.57
C
T7 41.83 41.83 41.83
T8 41.19 41.19 41.19
RDD-WRank 47.27 47.27 47.27
80.00
A
50.00
75.00
45.00
70.00
40.00
65.00
60.00 35.00
R
55.00 30.00
R (%)
P (%)
50.00
25.00
45.00
20.00
T
40.00
35.00 15.00
30.00 10.00
25.00 5.00
E
3 4 5 6 7 8 9 10
20.00
3 4 5 6 7 8 9 10 keywords (pcs)
keywords (pcs)
T1 T6
T1 T6 T2 T7
T2 T7 T3 T8
R
T3 T8 T4 RDD-WRank
T4 RDD-WRank T5
T5
Figure 8: Comparison of R values of various algorithms.
Figure 7: Comparison of P values of various algorithms.
the word frequency and does not consider the association
obtained. (erefore, the accuracy of the algorithm in this between words at all. Although the improved algorithms that
paper will be higher in theory than the algorithms that retain the principle of the co-occurrence window of the
calculate the association between words according to fixed traditional TextRank algorithm consider the relationship
association rules or rely on statistical word frequency, and its between words, the algorithm is more inclined to propose
P value will be higher than that of the other algorithms. frequent words because of the limitations of the association
Figure 8 describes the changes in the recall of each al- rules adopted by the algorithm. (is may ignore important
gorithm when extracting different numbers of keywords. words that have a low frequency but can describe the subject
(e recall of the RDD-Wrank algorithm in this paper is of the text. (e rough data-deduction used in the RDD-
higher than that of the other several algorithms. At the same WRank algorithm can expand the scope of association and
time, as the number of keywords increases, the relative increase the association data, which can enhance the cov-
increase of the algorithm’s recall rate becomes more obvious. erage of extraction results to the standard words and im-
(is is because the TF-IDF algorithm is too dependent on prove the recall of the algorithm. In particular, as the
Computational Intelligence and Neuroscience 13
D
methods, it is found that the three methods are the
45.00 fusion of the Word2Vec model and the TextRank
model. (e difference is that T5 adds the statistical
40.00
characteristics of words to the algorithm on the basis
F (%)
35.00
E
of considering the influence of external knowledge,
30.00
which improves the deficiencies of only introducing
word vector calculation to obtain keywords.
25.00
(4) Comparing the T5 method with the RDD-WRank
20.00
T
algorithm in this paper, it is found that the RDD-
15.00 WRank algorithm has a better extraction effect. (is
is because this paper uses rough data-deduction
10.00 theory to further improve the algorithm based on the
3 4 5 6 7 8 9 10
keywords (pcs)
fusion of the two models. Rough data-deduction can
C
explore the potential associations between candidate
T1 T6 keywords and increase the associated candidate
T2 T7
words and scope. If the potential association is added
T3 T8
T4 RDD-WRank to the iterative calculation of the weight of each
T5 candidate keyword, the extraction result will be more
A
accurate, and the algorithm will be more effective.
Figure 9: Comparison of F-Measure of various algorithms.
At the same time, in order to prevent over-fitting of the
experimental results of the improved algorithm, the ex-
number of keywords increases, the influence of word fre- periment also compares the extraction results of each
quency decreases, and the advantage of increasing the recall comparison algorithm based on datasets 2 and 3. In the
R
of the algorithm in this paper will be more obvious. experiment, the weight coefficient of the improved algo-
Figure 9 describes the F-Measure of each algorithm rithm is set to α � 0.7, β � 0.3, and the parameters of the other
when extracting different numbers of keywords. When comparison algorithms still take the optimal values in their
evaluating the experimental results, we hope that the higher respective references. (e partial calculation results of the P,
the P and R values, the better. But in fact, in most cases, the
T
R, and F-Measure of each algorithm (retain two decimal
two are contradictory, so F-Measure should be used to places) are shown in Tables 2 and 3.
comprehensively consider the two indicators. (e F-Mea- And, the line chart results of P, R, and F-measure of each
sure can reflect the effectiveness of the entire algorithm. For algorithm are shown in Figures 10–12.
the F-Measure of the extraction results of the algorithms in
E
Figures 10–12 describe the comparison results of each
the figure, there are the following results analysis. algorithm based on the P, R , and F-Measure of datasets 2
(1) T8 in the figure is the Word2Vec word vector and 3. It can be seen from these figures that the RDD-
clustering method, and its extraction effect alone is WRank algorithm in this paper still has a good extraction
not good, which is consistent with the conclusion in effect on the two datasets, and its three evaluation indicators
R
the document [40]. It is mentioned in the document are higher than that of the other methods. But for dataset 2,
[40] that when the Word2Vec word vector clustering the precision, recall, and F-Measure of each method are all
method is directly applied to a single document, it is lower than the results of dataset 1. (is is because the
not very accurate to select the clustering center as the keywords provided in some journal articles are newly
keywords of the text, and the N words which are proposed key phrases by the authors themselves, but the
closest to it are not necessarily keywords. (erefore, existing word segmentation technology cannot accurately
the extraction effect obtained by using this method is segment these phrases, which will lead to inaccurate ex-
general, but this method is often used in combina- traction results. (is is also a direction that we can study in
tion with other keyword extraction algorithms. the future. And, it is found that when the number of key-
(2) (e T6 and T7 methods incorporate information words is greater than 8, the extraction effect of the T6
such as the position of words into the TextRank method is better. Because of the influence of the text type,
algorithm to improve the accuracy of extraction, but when the number of extracted keywords is small, the in-
the effect is worse than the T5 method because the T6 fluence of word position will not be dominant in the ex-
and T7 methods ignore the influence of external traction process. However, as the number of extracted words
knowledge on keyword extraction. (e comparison increases, keywords in professional texts such as academic
between T5 and T6 and T7 shows that the improved paper abstracts will frequently appear at the beginning and
algorithm that introduces external knowledge end of the abstract. At this time, the advantages of the T6
through Word2Vec can better improve the keyword method based on word position distribution weighting will
14 Computational Intelligence and Neuroscience
D
T2 9.79 6.82 8.03
T3 7.61 5.32 6.25
T4 6.39 4.51 5.29
3 T5 15.42 10.86 12.73
T6 15.42 10.89 12.75
E
T7 16.10 11.35 13.29
T8 16.41 11.57 13.56
RDD-WRank 16.44 11.67 13.63
T1 12.90 15.28 13.97
T
T2 7.96 9.27 8.55
T3 6.53 7.62 7.02
T4 5.96 6.99 6.43
5 T5 12.76 15.01 13.78
T6 13.57 16.02 14.68
C
T7 13.46 15.82 14.53
T8 13.06 15.33 14.09
RDD-WRank 14.41 17.06 15.60
T1 11.06 18.31 13.77
T2 6.88 11.24 8.53
T3 5.51 9.02 6.83
A
T4 5.74 9.46 7.14
7 T5 10.60 17.45 13.17
T6 11.78 19.49 14.66
T7 11.19 18.43 13.91
T8 10.90 17.95 13.55
R
RDD-WRank 12.32 20.43 15.35
T1 8.89 20.99 12.47
T2 5.68 13.24 7.94
T3 4.70 10.98 6.57
T4 4.82 11.32 6.75
T
10 T5 8.22 19.35 11.53
T6 10.36 24.55 14.55
T7 8.38 19.71 11.75
T8 8.49 19.98 11.90
E
RDD-WRank 9.75 23.09 13.69
R
T1 38.39 13.82 20.33
T2 44.38 15.98 23.50
T3 51.87 18.68 27.47
T4 5.24 1.89 2.78
3 T5 57.87 20.84 30.64
T6 52.81 19.02 27.96
T7 47.19 16.99 24.99
T8 31.84 11.46 16.86
RDD-WRank 57.87 20.84 30.64
T1 39.44 23.67 29.58
T2 42.25 25.35 31.69
T3 49.55 29.73 37.17
T4 5.73 3.44 4.30
5 T5 51.69 31.02 38.77
T6 47.30 28.39 35.48
T7 43.37 26.03 32.53
T8 35.62 21.38 26.72
RDD-WRank 52.47 31.49 39.36
Computational Intelligence and Neuroscience 15
Table 3: Continued.
Keywords (pcs) Algorithm P (%) R (%) F (%)
T1 40.29 33.85 36.79
D
T2 41.01 34.46 37.45
T3 44.38 37.29 40.53
T4 6.50 5.46 5.94
7 T5 46.71 39.24 42.65
T6 43.02 36.14 39.28
E
T7 42.46 35.67 38.77
T8 38.44 32.30 35.10
RDD-WRank 47.75 40.12 43.61
T1 37.42 44.91 40.82
T
T2 37.70 45.25 41.13
T3 37.58 45.11 41.01
T4 6.91 8.29 7.54
10 T5 39.04 46.86 42.60
T6 38.20 45.85 41.68
C
T7 37.98 45.58 41.43
T8 36.24 43.49 39.53
RDD-WRank 42.02 50.44 45.85
A
22.00 85.00
80.00
20.00 75.00
70.00
18.00
65.00
R
60.00
16.00
55.00
14.00 50.00
P (%)
P (%)
45.00
T
12.00 40.00
35.00
10.00
30.00
25.00
E
8.00
20.00
6.00 15.00
10.00
4.00 5.00
R
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
keywords (pcs) keywords (pcs)
T1 T6 T1 T6
T2 T7 T2 T7
T3 T8 T3 T8
T4 RDD-WRank T4 RDD-WRank
T5 T5
(a) (b)
Figure 10: P value of each algorithm. (a) Results based on dataset 2. (b) Results based on dataset 3.
be more prominent, and its extraction effect will be better. Compared with dataset 2, the extraction results of each
For each journal article, the number of keywords provided algorithm for dataset 3 will be better. (is is due to the fact
by the author generally remains around 3–6, so after the that the effective keywords proposed in reference [47] are
value of the number of keywords is 6, the F-Measure of each referred to when manually labeling keywords in dataset 3.
comparison algorithm has a downward trend. But, com- (e effective keywords here refer to the information that is
pared with other comparison algorithms, the extraction valuable to users and businesses in the comments, and most
effect of this algorithm based on this dataset is still better. of such key information is common vocabulary. (e existing
16 Computational Intelligence and Neuroscience
30.00 60.00
28.00 55.00
26.00
D
50.00
24.00
22.00 45.00
20.00 40.00
18.00
E
35.00
16.00
R (%)
R (%)
30.00
14.00
12.00 25.00
T
10.00 20.00
8.00 15.00
6.00
10.00
4.00
C
2.00 5.00
0.00 0.00
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
keywords (pcs) keywords (pcs)
T1 T6 T1 T6
A
T2 T7 T2 T7
T3 T8 T3 T8
T4 RDD-WRank T4 RDD-WRank
T5 T5
(a) (b)
R
Figure 11: R value of each algorithm. (a) Results based on dataset 2. (b) Results based on dataset 3.
22.00 65.00
T
60.00
20.00
55.00
18.00 50.00
E
45.00
16.00
40.00
14.00 35.00
F (%)
F (%)
R
12.00 30.00
25.00
10.00
20.00
8.00 15.00
10.00
6.00
5.00
4.00 0.00
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
keywords (pcs) keywords (pcs)
T1 T6 T1 T6
T2 T7 T2 T7
T3 T8 T3 T8
T4 RDD-WRank T4 RDD-WRank
T5 T5
(a) (b)
Figure 12: F-Measure of each algorithm. (a) Results based on dataset 2. (b) Results based on dataset 3.
Computational Intelligence and Neuroscience 17
word segmentation technology is easy to perform accurate [2] Z. Ning and Z. Zhaozhao, “Criminisi image inpainting al-
segmentation, and the extraction effect will be more gorithm based on rough data-deduction,” Laser and Opto-
accurate. electronics Progress, vol. 56, no. 2, Article ID 021005, 2019.
[3] D. Peter Turney, “Learning algorithms for keyphrase ex-
D
Based on the above analysis, the precision (P), recall (R),
and comprehensive evaluation index F-Measure of the al- traction,” Information Retrieval, vol. 2, pp. 303–336, 2000.
[4] C. Florescu and C. Caragea, “A position-biased PageRank
gorithm in this paper are higher than those of the other
algorithm for keyphrase extraction,” in Proceedings of the 31st
comparison algorithms, which shows that the TextRank AAAI Conference on Artificial Intelligence (AAAI 2017),
algorithm using word vector clustering based on rough data-
E
pp. 4923-4924, San Francisco, CA, USA, February 2017.
deduction and fusing with the Word2Vec model is more [5] K. Spärck Jones, “A statistical interpretation of term specificity
effective in extracting results. (e TF-IDF algorithm based and its application in retrieval,” Journal of Documentation,
on statistical characteristics and the other several algorithms vol. 60, no. 5, pp. 493–502, 2004.
are more dependent on the word frequency in essence, and [6] S. Gerard and C. Buckley, “Term-weighting approaches in
T
may preferentially extract frequently occurring words. But automatic text retrieval,” Information Processing and Man-
for a document, especially Chinese text, the subject words agement, vol. 24, no. 5, pp. 513–523, 1988.
may not always appear. (erefore, the TextRank algorithm [7] H. Jiaul, “A novel TF-IDF weighting scheme for effective
based on rough data-deduction starts from the text as a ranking,” in Proceedings of the 36th International ACM SIGIR
whole, expands the scope of association, increases the as- Conference on Research and Development in Information
C
Retrieval, pp. 343–352, Dublin Ireland, July 2013.
sociated data, and establishes the association between words
[8] S. Gerard and T. Clement, “On the construction of effective
through rough data-deduction, which can further improve vocabularies for information retrieval,” ACM SIGPLAN No-
the accuracy of the algorithm. tices, vol. 10, pp. 48–60, 1973.
[9] M. David, M. Stephens, and P. Donnelly, “Latent
7. Conclusions Dirichlet allocation,” Journal of Machine Learning Research,
A
vol. 3, pp. 993–1022, 2003.
(rough research on text keyword extraction, it is found that [10] T. H. Haveliwala, “Topic-sensitive pagerank: a context-sen-
the potential association between words and external sitive ranking algorithm for web search,” IEEE Transactions
knowledge has a direct impact on keyword extraction re- on Knowledge and Data Engineering, vol. 15, no. 4, pp. 784–
sults. (erefore, based on rough data-deduction, this paper 796, 2003.
R
proposes a TextRank keyword extraction algorithm com- [11] X. Ao and Q. Guo, “Chinese news keyword extraction al-
bined with Word2Vec model. It can obtain more external gorithm based on TextRank and topic model,” in Proceedings
knowledge information to use rough data-deduction to of the International Conference on Artificial Intelligence for
Communications and Networks, pp. 334–341, Harbin, China,
explore potential associations between candidate keywords
May 2019.
T
and use word embedding technology to integrate Word2Vec [12] M.-h. Siu, H. Gish, A. Chan, W. Belfield, and S. Lowe,
into the algorithm. (e experimental results show that the “Unsupervised training of an HMM-based self-organizing
improved algorithm of word vector clustering based on unit recognizer with applications to topic classification and
rough data-deduction takes into account the potential as- keyword discovery,” Computer Speech and Language, vol. 28,
sociations between candidate words and external knowl- no. 1, pp. 210–223, 2014.
E
edge, which further improves the accuracy of keyword [13] X. Wan and J. Xiao, “Single document keyphrase extraction
extraction. In the next step, we will further refine and im- using neighborhood knowledge,” in Proceedings of the 23rd
prove the rough data-deduction rules to obtain better ex- national conference on Artificial intelligence AAAI, pp. 855–
traction results. 860, Chicago, IL, USA, July 2008.
[14] W. D. Abilhoa and L. N. de Castro, “A keyword extraction
R
method from twitter messages represented as graphs,” Applied
Data Availability Mathematics and Computation, vol. 240, pp. 308–325, 2014.
(e data used to support the findings of this study are in- [15] F. Boudin, “A comparison of centrality measures for graph-
based keyphrase extraction,” in Proceedings of the Sixth In-
cluded within the article.
ternational Joint Conference on Natural Language Processing,
pp. 834–838, Nagoya, Japan, October 2013.
Conflicts of Interest [16] A. Bougouin and F. Boudin, “TopicRank: graph-based topic
ranking for keyphrase extraction,” in Proceedings of the Sixth
(e authors declare no conflicts of interest. International Joint Conference on Natural Language Pro-
cessing, pp. 543–551, Suzhou, China, October 2013.
Acknowledgments [17] Y. Fan, Y. S. Zhu, and Yu-J. Ma, “WS-rank: bringing sentences
into graph for keyword extraction,” in Proceedings of the Asia-
(is work was supported in part by Tianyou Innovation Pacific Web Conference, pp. 474–477, Suzhou, China, Sep-
Team of Lanzhou Jiaotong University (TY202003). tember 2016.
[18] C. Florescu and C. Caragea, “PositionRank: an unsupervised
References approach to keyphrase extraction from sch-olarly docu-
ments,” 2017, https://aclanthology.org/P17-1102.
[1] S. Yan, L. Yan, and J. Wu, “Rough data-deduction based on [19] Y. Gu and X. Tian, “Study on keyword extraction with LDA
the upper approximation,” Information Sciences, vol. 373, and TextRank combination,” Data Analysis, Machine
pp. 308–320, 2016. Learning and Knowledge Discovery, vol. 30, pp. 41–47, 2014.
18 Computational Intelligence and Neuroscience
D
Amsterdam (e Netherlands, July 2007. 9U3MDYxzVjEyu6eWpSbmFcJAB9am78WAAAA&sa=X&v
[21] J. Cao, Z. Jiang, M. Huang, and K. Wang, “A way to im- ed=2ahUKEwjwqoqgh631AhVBT2wGHU2iDhMQmxMoA
prove graph-based keyword extraction,” in Proceedings of 3oECBUQBQ.
the 2015 IEEE International Conference on Computer and [37] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
E
Communications, ICCC, pp. 166–170, Chengdu, China, estimation of word representations in vector space,” in
October 2015. Proceedings of the 1st International Conference on Learning
[22] X. Xiao, Improvement of TextRank Algorithm Based on Basic- Representations (ICLR 2013), Scottsdale, AZ, USA, May
Level Category to Chinese Keyword Extraction, Central China 2013.
Normal University, Wuhan, Hubei, China, 2017. [38] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Distributed
T
[23] Z. Liu and J. Xia, “Extracting keywords with TextRnak and representations of words and phrases and their composi-
weighted word positions,” Data Analysis, Machine Learning tionality,” in Proceedings of the 26th International Con-
and Knowledge Discovery, vol. 2, pp. 74–79, 2018. ference on Advances in Neural Information Processing
[24] X. Xu, X. Chai, B. Xie, S. Chen, and J. Wang, “Extraction of Systems, pp. 3111–3119, Lake Tahoe, NV, USA, December
Chinese text summarization based on improved TextRank 2013.
C
algorithm,” Computer Engineering, vol. 45, pp. 273–277, 2019. [39] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and
[25] J. Liu, D. Zou, X. Xing, and Y. Li, “Keyphrase extraction based M. Welling, “Fast collapsed Gibbs sampling for latent
on topic feature,” Application Research of Computers, vol. 29, Dirichlet allocation,” in Proceedings of the 14th ACM SIGKDD
pp. 4224–4227, 2012. International Conference on Knowledge Discovery and Data
[26] Q. Zeng, X. Hu, and L. Chao, “Extracting keywords with topic Mining, pp. 569–577, Las Vegas, NV, USA, August 2008.
embedding and network structure analysis,” Data Analysis, [40] X. Tian, “Extracting keywords with modified TextRank
A
Machine Learning and Knowledge Discovery, vol. 3, pp. 52–60, model,” Data Analysis, Machine Learning and Knowledge
2019.
Discovery, vol. 1, no. 2, pp. 28–34, 2017.
[27] A. Onan, “Sentiment analysis on product reviews based on
[41] X. Zhu, S. Zhang, and J. Liu, “Word semantic similarity
weighted word embeddings and deep neural networks,”
computation based on HowNet and CiLin,” Journal of Chinese
Concurrency and Computation: Practice and Experience,
Information Processing, vol. 30, pp. 29–36, 2016.
R
vol. 33, no. 23, Article ID e5909, 2021.
[42] M. L. Littman, “Unsupervised learning of semantic orienta-
[28] A. Onan, K. Serdar, and H. Bulut, “A hybrid ensemble
tion from a hundred-billion-word corpus. arXiv preprint cs/
pruning approach based on consensus clustering and multi-
0212012,” 2002, http://arxiv.org/abs/cs/0212012.
objective evolutionary algorithm for sentiment classification,”
[43] M. Strube and S. P. Ponzetto, “WikiRelate! Computing se-
Information Processing and Management, vol. 53, no. 4,
T
mantic relatedness using Wikipedia,” in Proceedings of the
pp. 814–833, 2017.
21st National Conference on Artificial Intelligence AAAI,
[29] X. Zuo, S. Zhang, and J. Xia, “(e enhancement of TextRank
algorithm by using word2vec and its application on topic pp. 1419–1424, Boston, MA, USA, July 2006.
[44] R. Qu, Y. Fang, W. Bai, and Y. Jiang, “Computing semantic
extraction,” Journal of Physics: Conference Series, vol. 887,
Article ID 012028, 2017. similarity based on novel models of semantic representation
E
[30] R. Mihalcea and P. T. Textrank, “Bringing order into text,” in using Wikipedia,” Information Processing & Management,
Proceedings of the 2004 Conference on Empirical Methods in vol. 54, no. 6, pp. 1002–1021, 2018.
Natural Language Processing, pp. 404–411, Barcelona, Spain, [45] C. Che, H. Zhao, X. Wu, D. Zhou, and Q. Zhang, “A word
July 2004. segmentation method of ancient Chinese based on word
[31] L. page, S. Brin, R. Motwani, and T. Winorgrad, +e PageRank alignment,” in Proceedings of the Natural Language Processing
R
Citation Ranking: Bringing Order to the Web, Stanford and Chinese Computing. CCF International Conference on
InfoLab, Stanford, CA, USA, 1999. Natural Language Processing and Chinese Computing,
[32] S. Brin and P. Lawrence, “(e anatomy of a large-scale pp. 761–772, Dunhuang, China, October 2019.
hypertextual Web search engine,” Computer Networks and [46] X. Jin, S. Zhang, and J. Liu, “Word semantic similarity calculation
ISDN Systems, vol. 30, no. 1–7, pp. 107–117, 1998. based on word2vec,” in Proceedings of the 2018 International
[33] S. Sheetal and P. A. Kulkarni, “Graph based representation Conference on Control, Automation and Information Sciences
and analysis of text document: a survey of techniques,” In- (ICCAIS), pp. 12–16, Hangzhou, China, October 2018.
ternational Journal of Computing and Applications, vol. 96, [47] Z. Zhang and Z. Jin, “Extracting keywords from user com-
p. 19, 2014. ments: case study of meituan,” Data Analysis, Machine
[34] J.-Y. Chang and I.-M. Kim, “Analysis and evaluation of Learning and Knowledge Discovery, vol. 3, pp. 36–44, 2019.
current graph-based text mining researches,” Advanced Sci- [48] J. Zhao, Q. M. Zhu, G. D. Zhou, and L. Zhang, “Review of
ence and Technology Letters, vol. 42, pp. 100–103, 2013. research in automatic keyword extraction,” Journal of Soft-
[35] Y. Li, S. C. K. Shiu, S. K. Pal, and J. N. K. Liu, “A rough set- ware, vol. 28, pp. 2431–2449, 2017.
based case-based reasoner for text categorization,” Interna- [49] Y. Matsuo and M. Ishizuka, “Keyword extraction from a
tional Journal of Approximate Reasoning, vol. 41, no. 2, single document using word co-occurrence statistical infor-
pp. 229–255, 2006. mation,” +e International Journal on Artificial Intelligence
[36] Z. Pawlak, Rough Sets: +eoretical Aspects of Reasoning about Tools, vol. 13, no. 1, pp. 157–169, 2004.
Data, Springer Science & Business Media, ,, 2012, https:// [50] S. K. Biswas, M. Bordoloi, and J. Shreya, “A graph based
www.google.com/search?rlz=1C1GCEB_enIN988&q=Heidel keyword extraction model using collective node weight,”
berg&stick=H4sIAAAAAAAAAONgVuLQz9U3SM41zV3E Expert Systems with Applications, vol. 97, pp. 51–59, 2018.
Computational Intelligence and Neuroscience 19
D
weighted TextRank,” Data Analysis, Machine Learning and
Knowledge Discovery, vol. 29, pp. 30–34, 2013.
[53] L. Yuepeng, J. Cui, and J. Junchuan, “A keyword extraction
algorithm based on Word2vec,” E-Science Technology and
E
Application, vol. 6, pp. 54–59, 2015.
C T
R A
E T
R