Network-Based Bag-Of-Words Model For Text Classification
Network-Based Bag-Of-Words Model For Text Classification
ABSTRACT The rapidly developing internet and other media have produced a tremendous amount of
text data, making it a challenging and valuable task to find a more effective way to analyze text data by
machine. Text representation is the first step for a machine to understand the text, and the commonly used text
representation method is the Bag-of-Words (BoW) model. To form the vector representation of a document,
the BoW model separately matches and counts each element in the document, neglecting much correlation
information among words. In this paper, we propose a network-based bag-of-words model, which collects
high-level structural and semantic meaning of the words. Because the structural and semantic information of
a network reflects the relationship between nodes, the proposed model can distinguish the relation of words.
We apply the proposed model to text classification and compare the performance of the proposed model
with different text representation methods on four document datasets. The results show that the proposed
method achieves the best performance with high efficiency. Using the Eccentricity property of the network
as features can get the highest accuracy. We also investigate the influence of different network structures in
the proposed method. Experimental results reveal that, for text classification, the dynamic network is more
suitable than the static network and the hybrid network.
I. INTRODUCTION text data to the form that is convenient for computer pro-
During the last decades, people have witnessed the impact of cessing. The commonly used text representation method is
the advancement of information technology. The rapid devel- the bag-of-words (BoW) model [2]–[4]. This model maps
opment of social media on the internet has been producing a document into a vector as v = [x1 , x2 , . . . , xn ], where xi
more and more information, in which text information plays denotes the occurrence of the ith word in basic terms. The
a significant role. Meanwhile, a typical scenario is how to basic terms are collected from the datasets, which are usually
classify text data into topic sets by computer so that people the top n highest-frequency words. The value of the occur-
can conveniently search the data they want. The text classi- rence feature can be a binary, term frequency, or TF-IDF.
fication task, which assigns the documents to the best-suited A binary value denotes whether the ith word is presented
topic, has drawn much attention from researchers. in a document, which reckons without the weight of words.
A typical text classification work includes text preprocess- The term frequency is the number of occurrences of each
ing, feature selection, feature extraction, similarity computa- word. Generally, the word with high frequency in a document
tion, and classifier determination [1]. Though owing to the contains the representative idea about this document, with the
advantage in understanding human language, it is natural for exception that some words may have high frequency among
people to judge whether a document belongs to a particular all documents. TF-IDF (term frequency-inverse document
topic directly by reading and understanding, this process is frequency) balances the weight of the words that always have
not practical for a computer. So the text classification of a a high frequency. It assumes that the importance of a word
computer starts with the text representation, which transfers increases proportionally to its frequency in a document but is
offset by its frequency in the whole corpus [5], [6].
The associate editor coordinating the review of this manuscript and Though the BoW model is a useful and straightforward
approving it for publication was Dominik Strzalka . method for text representation, there are still some problems.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 82641
D. Yan et al.: Network-Based BoW Model for Text Classification
The value of xi , whether in binary, term frequency, or TF-IDF model and the comparison with different representation meth-
form, is matched and counted without considering the influ- ods. We extend the proposed model to more possible appli-
ence of others words. So the processing of text data may lose cations in Section V. And, at last, we provide the concluding
much context information without dealing with correlated remarks in Section VI.
words. To illustrate this limitation, we provide two simple
sentences as a toy example: Sen 1, ‘‘a cat ate a small white II. RELATED WORK
mouse;’’ Sen 2, ‘‘a small white mouse ate a cat.’’ The basic Because our work aims to incorporate the network model into
terms are (cat, eat, small, white, mouse), and for both two BoW, in this section, we give a brief review of these two
sentences, each word in basic terms occurs once. The BoW associated works, respectively.
model will project Sen 1 and Sen 2 to the same vector, i.e.
v1 = v2 = [1, 1, 1, 1, 1], though the two sentences have the A. TEXT REPRESENTATION METHODS
opposite meaning. In the field of text data mining, text representation is the
In this paper, we adopt the network model to overcome the keystone for the computer to understand. Though the BoW
limitation of the BoW model mentioned above. The complex model is simple and commonly used, it suffers from the
network is now attracting much attention in the study of sparsity with high dimensionality and the loss of relations
real-world systems [7] (such as social systems, biological among words. To improve the BoW model, researchers
systems, and authors systems). The advantage of the network have proposed some methods like latent semantic analysis
model to analyze text data is that through the network tools, (LSA) [10] and topic model [11]. LSA applies the singular
one can have an insight view of several features of texts, value decomposition (SVD) to transfer the original BoW
e.g., complexity [8], and symmetry [9]. By using the network representation to the vectors with a lower dimension. If the
model, we can take more context information of the text into origin vectors are frequency-based, the transferred vectors
account. To extend the application of the network model to are also approximately linearly related to the term frequency.
BoW, we come up with a network-based strategy: Attribute of The topic model, attaching the probability distribution of
Network Extended to BoW (AEBoW). AEBoW maps docu- words to the topic probability distribution, though has a
ments to vectors in which the value of the corresponding word more mature mathematical foundation than LSA, it is still
is replaced by the weight of the network node attribute. The a frequency-based method, which may not be able to cap-
main difference between AEBoW and BoW is that the value ture the genuine semantic relations. Being different from the
of xi will not only match the frequency of the ith basic term BoW model, word embedding maps the words into dense
but also match the role it plays in high-level features of the and low-dimensional vectors through machine learning meth-
text, e.g., the structural and semantic difference. By using the ods [12]–[14], e.g., multilayers neural networks. This kind
Degree of the network model, the AEBoW model will project of method can capture the relations of words like ‘‘king +
Sen 1 and Sen 2 to v1 = [1, 2, 2, 2, 1] and v2 = [1, 2, 1, 2, 2], woman ≈ queen.’’ Nevertheless, the mapped vectors are
which can capture the meaning difference of two sentences learned from a large corpus of text data, making this training
(see details in section IV.F). process very time-consuming and highly dependent on the
We summarize the main contributions of this paper as quality of training corpus. There is also the representation
follows model that combines word embedding model and deep learn-
G We propose the AEBoW model to maintain correlated ing with BoW [15], which uses the pre-trained word embed-
information among the words in the text. dings to get the fuzzy matching for the BoW model. The
G We demonstrate the efficiency of the AEBoW model matching process is based on the whole basic terms, which
by applying it to text classification. We also verify the is sometimes redundant (we will explain it in section IV).
performance of the proposed model by comparing it with In this paper, the proposed AEBoW model is a combined
seven text representation methods and the word embed- method, which adopts the simplicity of the BoW model while
ding model (deep learning method) on four different considering the inner-correlation of words by a network tool.
datasets.
G We present the results of the AEBoW model based B. THE NETWORK MODEL FOR TEXT ANALYSIS
on three kinds of network tools: the dynamic network, In recent years, more and more works studies on the net-
the static network, and the hybrid network. work model in analyzing human language. The network is
G By comparing the performance of the AEBoW model constructed from a series of nodes connected by their inter-
based on different kinds of networks, we observe the relations. The network model has been used for different
dynamic network is more suitable for the AEBoW complex systems because of its simplicity and generality.
model. Without loss of generality, the networks of text share the same
This paper is organized as follows. In Section II, we intro- properties that unveiled from other complex systems like the
duce some related works, including the studies on text repre- small-world structure and scale-free phenomena [16]–[19].
sentation and text complex networks. The proposed AEBoW Moreover, the network properties have been proved to be a
model is presented in Section III. In Section IV, we give powerful tool to capture the features of texts. The out degrees,
the experimental results on the performance of the proposed clustering coefficient, and deviation of network growth are
related to the text quality [20] while the community structures A. REPRESENTING TEXTS AS NETWORKS
and weighted edges of the network can be used to detect Generally, the network model can be described as a graph
the key segments [21], [22]. The topological properties of with graph theory [16]. An undirected network that we adopt
networks will help enhance the performance of several tasks to represent text is generally represented as G = (N , E),
(authors recognization [9], [23], text similarity [24], text sum- where N = {n1 , n2 , . . . , nl } denotes the set of nodes (or
marization [25], text classification [26], and shorts text anal- vertices) and E = {e1 , e2 , . . . , ek } denotes the set of links
ysis [27], [28]). In recent years, the image analysis approach between particular double nodes. We can use an adjacency
based on the network model is proposed to be supplementary matrix A = (aij )l×l to represent graph G, in which the element
on semantic-based applications, as the mesoscopic structure aij is defined as follows:
can reveal the visual ‘‘calligraphy’’ of a document [29]. The (
network model, when applied to text analysis, can capture 1 if (ni , nj ) ∈ E
subtle interactions among words, which will provide richer aij = (1)
0 if (ni , nj ) ∈
/E
information than the occurrence feature.
The appropriately represented texts as networks are the
inventories of text units with organized relations among them.
III. THE PROPOSED MODEL For example, when the text units are words, the relations
In this section, the AEBoW model is presented. It should among them may be the semantic relations or their positional
be noted that the underlying assumption is that the node relations in actual language use [18]. Different organized
properties of the complex network can reflect their specific relations may lead to different network structures in terms of
relevancy among other nodes. Of particular influence on the the same text. If the text network is modeled with the words
structure of the network, the linguistic units, and their rela- as nodes and the words’ semantic relations as edges, this kind
tions to form edges determine the topological configuration, of network, called static linguistic networks, contains relative
which affects the corresponding relevancy of nodes [18]. fixed nodes relationships. Another kind of network, named
We introduce three different sub-structures of text complex dynamic linguistic network, is modeled with the links being
networks: the static semantic network, the co-occurrence net- the naturally-occurring of words in texts, reflecting much
work, and the hybrid network. Moreover, based on the same information on actual language style.
text representation model, we compare the performance of This paper introduces the co-occurrence network as the
these sub-structures in practical use in section IV. sub-network of dynamic linguistic networks, the static
Before going into details about the proposed model, semantic network as the sub-network of the static linguistic
the general steps to deal with specific problems using this network. The co-occurrence network describes the texts as
model are summarized as follows: the network in which the nodes (words) are joined when
STEP 1: Lemmatize all the words in training data, and they co-occur within a distance [17]. Moreover, the static
eliminate the stop-words. Lemmatization makes the words semantic network, describing the texts as the inventories
transferred into their original forms, e.g., the nouns are con- of semantic relations, is constructed following the rule that
verted to the singular forms, and the verbs are converted to two nodes (words) are connected when they are organized
the infinitive forms. The stop-words are words that occur high in the same class of a dictionary [18]—in this paper, this
frequency with little useful semantic content. relationship is captured through the WordNet [30]. Based on
STEP 2: For each text sample in training data and test data, the WordNet, the words as nodes are linked when they are in
represent the text as networks (the type of network is a hyper- the same word set with hypernymy, meronymy (including the
parameter). Then get the value of particular network property entailment of verbs), or synonymy relationship. For a combi-
at all nodes. Each node in a network is bounded to a word in nation of the above two kinds of networks, we propose the
correspond text sample. hybrid network that contains relations both in static semantic
STEP 3: Represent each sample as a column vector, network and co-occurrence network. The hybrid network has
in which the value of each element is the network property the information held in both the dynamic network and the
of the corresponding node that obtained in step 2. The value static network, making it more helpful in text classification
of the node that not included in a text sample will be replaced work.
by ‘0’ in the corresponding column vector. Note that the full The process of text network construction starts with
words bag of big training data is considerably large, which a text preprocessing. Firstly, lemmatize the words [8]
causes the column vector high-dimensionality and sparse. (e.g., the nouns are converted to the singular forms, and the
One optional solution is to adopt the most used words in the verbs are converted to the infinitive forms). Then, eliminate
datasets, which called basic terms, to reduce dimensionality. words with little useful semantic content, which are named
STEP 4: Train the classifier using the vectors of training as stop-words, because in some text processing like classifi-
data obtained from step 3 as inputs. cation, these words are helpless, sometimes misleading [24].
The above steps are presented as a flowchart in figure 3. Figure 1 shows three text networks of the following docu-
The following part of this section will go into detail about ments. A more detailed process to construct these network
the proposed model. models is described in the Supplementary Information.
the kind of networks (static, dynamic, or hybrid) should be complex network. The Degree denotes the connectivity of
pre-defined. The idea of the BoW model is used to collect a node, which shows the ability to integrate with other
the words among all the documents in binary form. Then nodes. For an undirected graph, given the adjacency matrix A,
the extracted properties are located to the corresponding the degree ki of node i is defined as
place. We also list the procedure of AEBoW in Algorithm 1. N
X
An illustration of AEBoW by a toy example is shown in ki = aij , (3)
figure 4. The pseudo samples – d1 ‘‘A cat is sitting on the j=1
table while a dog is running towards it’’ and d2 ‘‘A cat
where N is the size of matrix A, i.e., the number of nodes in
and a dog were both sitting on the table, and the dog ran
the complex network. In the matrix A, the element aij is binary
away later’’ – are represented as vectors of AEBoW model.
value denoting that whether node i and node j is connected
The vector mapped from d1 is [1, 2, 2, 2, 1, 0, 0] because the
through an edge.
Degree of node ‘cat,’ ‘dog,’ ‘sit,’ ‘table’ and ‘run’ is 1, 2, 2,
Eccentricity: The eccentricity eci of a node i is the maxi-
2, 1, respectively, while ‘away’ and ‘later’ do not occur in d1 .
mum distance from ei to other nodes in the complex network.
Similarly, the vector of d2 is projected.
For a network G, the eccentricity eci is defined as
eci = maxj∈N \ni lij , (4)
Algorithm 1 AEBoW Framework
Inputs: Text corpus T including v documents, network where lij is the shortest distance from node i to node j.
property a, and the network type g. In some cases, the text network may be disconnected, which
Outputs: Text vectors Z of T. means that the network contains more than one part without
1. Collect the basic terms B based on the frequency links between each other. In this paper, for convenience,
that words occur in T. we assume that the eccentricity eci of the network that is not
2. for d in T: connected is the maximum distance from ei to its reachable
Construct the network gd of d: nodes.
for node in gd : PageRank: PageRank is initially designed for ranking web
Get the index i of node in Z pages based on the directed graph [31]. The idea is that
if node in B: the more web pages that a page is pointed to and the more
Zid = fgad (node) critical the pointing webs are, the more weighted this pointed
else: page is. The definition is a voting process, which needs
Zid = 0 recursive computing. The rank of a given node (page) i is
end if defined to be
end for X r(j)
r(i) = , (5)
end for num(j)
j∈Pi
3. return Z
where Pi is the set of nodes that point to i, and num(j) is
the number of links that point out from j in graph G. For a
start, we can arbitrarily assign the ranking to all the nodes of
graph G, e.g., r0 (i) = 1/l, i ∈ N , and successively update the
ranks of the nodes by (5). In this paper, we adopt this method
to the undirected graph by assuming that each undirected
edge (i, j) is equal to two directed edges i → j and j → i.
Accessibility: This concept is used to measure the ability of
a node to reach the number of nodes after h steps implemented
through self-avoiding random walks [38]. It is mathemati-
cally defined as
FIGURE 4. A toy example about text representing based on the AEBoW X
model (for Degree property) through the co-occurrence network.
αh (i) = exp − P(h) (i, j) log P(h) (i, j) , (6)
j
The development of complex networks has induced var-
ious indexes for the observed properties of real networks, where P(h) (i, j)
denotes the possibility of node i reach node j
e.g., node degree, betweenness, and clustering [16]. Though after h steps. The accessibility measures the influence of a
there are various property measures, the experimental results node in the complex network, i.e., the nodes playing more
show that not all of them are suitable for text classification. critical roles usually can access more neighbors.
The following part of this sub-section introduces network
properties that perform well in the experimental results. IV. EXPERIMENTAL RESULTS
Degree: The degree ki of a node i is the number of In this section, we apply the AEBoW model in text clas-
its neighbor nodes or the edges incident with it in the sification. The proposed method is compared with seven
text representation methods on four datasets. Furthermore, likely to be the same type as the nodes occur most in its
we also compare AEBoW with the deep learning algorithm k nearest neighbors, which are captured based on particular
at the end of this section. similarity distance. Because this method is parameter-free
except k, making it a lazy learning method, it is used in many
A. DATASETS DESCRIPTION applications.
There are four datasets used in the experiments. Cosine similarity: The cosine similarity computes the sim-
20Newgroups is a group of news with nearly 20000 docu- ilarity distance of two vectors in space. For vector vi =
ments and 20 news topics. This dataset is kindly preprocessed [vi1 , vi2 , . . . , vil ] and vj = [vj1 , vj2 , . . . , vjl ], the cosine simi-
in [32], [33]. larity is defined as
WebKB is collected from webpages by the World Wide l
Knowledge Base project [32]. The training data and testing
P
vik vjk
data of these documents were predesignated in [33], [34]: vi · vj k=1
cij = =s s . (7)
2803 documents for training and 1396 documents for testing. |vi ||vj | l l
2 2
P P
Reuters 52 is extracted from Reuters 21578 by [32]. This vik vjk
k=1 k=1
dataset includes 52 categories, deleting some categories of
Reuters 21578 that contain only a few documents. Classification Accuracy: The classification accuracy (CA)
Amazon Reviews contains 10000 labeled reviews with is defined as (8), denoting the accuracy of predicted labels
2 categories. The original dataset can be found in [35]. comparing with the labels given in the test data. For (8),
We list the details of these datasets in table 1. Note that all T is the document set of test data and |T | is the number of
the datasets are preprocessed by removing the stop words and documents in set T . E(pi , gi ) = 1 if pi = gi (pi denotes the
lemmatizing. predicted label of document i while gi is the given label in test
data corresponding to i), and E(pi , gi ) = 0 if pi 6 = gi .
TABLE 1. The description of four datasets.
E(pi , gi )
P
i∈T
CA = (8)
|T |
Train & Test: The training and testing process all pre-
compute the cosine similarity of documents using (7). Next,
a similarity matrix is as input for nearest-neighbor searching.
After the training step, the test data are all labeled with the
trained model. Then the CA is obtained using (8).
B. EXPERIMENTAL SETUP We use the following seven text representation methods to
The classification work is done by KNN measure [36], and compare the performance of the AEBoW model.
the similarity distance is computed through the cosine sim- BoW: The BoW model is described in section I.
ilarity [24]. Classification accuracy [37] is used to evaluate LSA: Latent Semantic Analysis [10] is a method to reduce
performance. Firstly, we briefly describe the KNN measure, the dimensionality based on BoW.
cosine similarity, and classification accuracy. LDA: Latent Dirichlet Allocation [39].
Net-Local: A complex network method for text classifica-
tion [26]. We label this method as Net-local, where ‘‘local’’
denotes the local strategy. We only choose the local strategy
because the global strategy performs weakly in the experi-
ments, which may be due to that the dimensionality of the
representation vector is too low for big datasets.
AE: The average embedding for text representation [15].
AE represents a document as the average of all embeddings
of words in the document.
FBoW & FBoWC: FBoW is a fuzzy bag-of-words
model [15], which conducts a fuzzy matching through
word embeddings. This method is a word embedding based
method. FBoWC is an extension of FBoW, which matches
FIGURE 5. An illustration of KNN. For k = 5, most of the nearest the clusters of word embeddings instead.
neighbors of the circle node Q belong to topic 2. So node Q is more likely
to belong to topic 2.
The word embedding based methods, including AE,
FBoW, and FBoWC use the data that are not lemmatized
KNN: The KNN (k-Nearest-Neighbors) is a simple and because the learning of word embedding can distinguish
effective non-parametric classification method. The idea of all word types. The other methods will use the data after
KNN, as shown in figure 5, is that a node in space is more lemmatization.
The implementation of all the methods mentioned above First, we can observe that the BoW model is the fastest
is based on Python 3.7 with windows 10 environment. The method, though the CA is relatively low. The increase of
configuration of the machine we used is Inter
R CoreTM performance by other methods shows that it is needed to
i7-8565U CPU @ 1.80GHz; Memory 16.0 GB. LSA, LDA, scarify the time for accuracy. The other methods all consid-
and BoW are based on sklearn module. The word embeddings erably increase the time costs of text representation while
of AE are looked up from the pre-trained word embedding increasing the performance. AEBoW gets the highest CA
dictionary [40], and the words that not in the word embed- in 20Newsgroups, WebKB, and Reuters 52, while FBOWC
ding dictionary are discarded. AEBoW, FBoW, FBoWC, and gets the highest CA in Amazon Reviews.
Net-local method all run with multi-threads within the per-
mission of the memory. TABLE 3. Time costs (s).
The dimensionality of representation vectors is set to
3000 for AEBoW, BoW, LSA, LDA, FBoW, and FBoWC.
For the Net model, because the number of chosen properties
is 8, we set the dimensionality of each property to 3000.
So the concatenated vector has a dimensionality of 24000.
The vector that projected from AE has dimensionality equal
to the word embedding, which is set to 300 in this paper.
Note that we search the best k of KNN for each method,
and the searching range is {3, 6, 9, 12, 15, 18, 21,
24, 27, 30}.
C. PERFORMANCE ANALYSIS
Based on the properties of the complex network, including
the Degree (D), Eccentricity (E), PageRank (P), and Acces-
sibility (A), we analyze the performance of the AEBoW
model. The classification accuracy (CA) is obtained from
the dynamic network (co-occurrence network), static net-
work (static semantic network), and hybrid network, respec-
tively. Then the best result for each property is selected. The
obtained CA is shown in table 2.
TABLE 6. CA of AEBoW and the word embedding model. words, or tone of his (her) work. The high-level information
can be captured through the AEBoW model.
The following are some ideas about applying AEBoW on
text interpretation.
The text interpretation includes processing the unstruc-
tured text and extracting the high-level semantics. For the
first step, the computer will interpreter a free text correctly
into the surface-level form. The free text is analyzed through
its syntactic structure, lexical meaning, and then the subse-
quent computation will take place. By using AEBoW, the
surface-level of raw text data can be preprocessed with a
TABLE 7. Time costs (s) of AEBoW and the word embedding model.
network tool, and AEBoW is applied to obtain extra struc-
tural and semantic information. For the second step, a series
of indexes and complicated relations are derived from the
surface-level information. The network model may further
explain the patterns of the surface-level information, and
the AEBoW model will produce the inputs of the instances
object model, which maps the patterns from the surface-level
meaning into high-level instance assertions.
It should be noted that the AEBoW model is only a com-
plement to existing methods of text interpretation because
there are limitations for AEBoW in grammar parsing and
abduction. The AEBoW will not capture the grammars and
proper word meaning. So it is needed to introduce the gram-
the word embedding model. We list the time costs of both mar parser and background knowledge.
methods in table 7. The AEBoW model is a powerful network-based tool for
From table 6 and table 7, we can observe that the AEBoW text analysis, which are possible to be applied to different
model outperforms the word embedding model on all the application scenarios. The introduction of the network model
four datasets. At the same time, the time costs of AEBoW makes AEBoW capture high-level structural and semantic
are much smaller than the word embedding model. How- meaning of the text. The application of AEBoW may also
ever, the word embedding model can accelerate its speed need other state-of-the-art studies for a complement.
by running on more powerful GPUs while AEBoW can not.
The results further certify that the AEBoW can capture more VI. CONCLUSION
information from text data. In this paper, we have proposed the AEBoW model based
on the complex network to represent text. The AEBoW is an
V. DISCUSSION improvement on the BoW model, taking the correlation of
From the experimental results, it can be observed that the words reflected in the text network into consideration. The
AEBoW model gets good results with high efficiency in text structure of a text network varies when the different relations
classification. We believe that the application of AEBoW will of words that form an edge are considered. We have intro-
not only limited to text classification. There are some possible duced the dynamic network (co-occurrence network) and
application scenarios of this model, including text interpre- static network (static semantic network). We have also pro-
tation, text clustering, text summarization, and identification posed the hybrid network that contains relations in both the
of authorship. Next, we briefly describe each application. dynamic network and the static network. We have compared
Furthermore, we also give some ideas for text interpretation. the performance of AEBoW with seven text representation
Text Interpretation is the process of extracting high-level methods in text classification.
semantics from the raw text data. The high-level semantics Experimental results revealed that the proposed AEBoW
are the structured indexes for the raw text data. could get the best performance with high efficiency. The best
Text Clustering is an unsupervised method of machine feature in AEBoW was the Eccentricity, which is a shortest-
learning to cluster the documents with high similarity into path-based property of text network. Further analysis showed
categories. The AEBoW outputs can be directly used for that for most methods, the performance reaches the best
clustering. when k is around 15 with KNN as the classifier. For the
Text Summarization is to catch the key phrase of a docu- Eccentricity of AEBoW, the best accuracy exists at k = 21.
ment. The key phrase is always a bunch of words from the The comparison of the three kinds of networks showed that
original document with complete syntax and content. the dynamic network is more suitable for text classification.
Identification of Authorship. Each author has his (her) style We have also investigated the performance of AEBoW
in their work. The author’s style is reflected in the structure, in the deep learning algorithm. By comparing it with the
word embedding model, we certified the high efficiency and [23] C. Akimushkin, D. R. Amancio, and O. N. Oliveira, ‘‘Text authorship
excellent performance of AEBoW. identified using the dynamics of word co-occurrence networks,’’ PLoS
ONE, vol. 12, no. 1, 2017, Art. no. e0170527.
The application of AEBoW is not limited to text classi- [24] D. R. Amancio, O. N. Oliveira, Jr., and L. D. F. Costa, ‘‘Structure–
fication. Future investigations will be concentrated on using semantics interplay in complex networks and its effects on the predictabil-
the AEBoW in more text analysis, e.g., text interpreta- ity of similarity in texts,’’ Phys. A, Stat. Mech. Appl., vol. 391, no. 18,
pp. 4406–4419, Sep. 2012.
tion, text clustering, text summarization and identification of [25] L. Antiqueira, O. N. Oliveira, L. D. F. Costa, and M. D. G. V. Nunes,
authorship. ‘‘A complex network approach to text summarization,’’ Inf. Sci., vol. 179,
no. 5, pp. 584–599, Feb. 2009.
[26] H. F. de Arruda, L. D. F. Costa, and D. R. Amancio, ‘‘Using complex net-
REFERENCES works for text classification: Discriminating informative and imaginative
[1] X. Deng, Y. Li, J. Weng, and J. Zhang, ‘‘Feature selection for text classifi- documents,’’ Europhys. Lett., vol. 113, no. 2, p. 28007, Jan. 2016.
cation: A review,’’ Multimedia Tools Appl., vol. 78, no. 3, pp. 3797–3816, [27] D. R. Amancio, ‘‘Probing the topological properties of complex net-
Feb. 2019. works modeling short written texts,’’ PLoS One, vol. 10, no. 2, 2015,
Art. no. e0118394.
[2] H. Kim, P. Howland, and H. Park, ‘‘Dimension reduction in text classifi-
[28] D. Yan, K. Li, and J. Ye, ‘‘Correlation analysis of short based on network
cation with support vector machines,’’ J. Mach. Learn. Res., vol. 6, no. 1,
model,’’ Phys. A, Stat. Mech. Appl., vol. 531, Oct. 2019, Art. no. 121728.
pp. 37–53, Jan. 2005.
[29] H. F. de Arruda, V. Q. Marinho, T. S. Lima, D. R. Amancio, and
[3] T. Joachims, ‘‘A probabilistic analysis of the Rocchio algorithm with
L. D. F. Costa, ‘‘An image analysis approach to text analytics based on
TFIDF for text categorization,’’ in Proc. 14th Int. Conf. Mach. Learn.,
complex networks,’’ Phys. A, Stat. Mech. Appl., vol. 510, pp. 110–120,
San Francisco, CA, USA, 1997, pp. 143–151.
Nov. 2018.
[4] M. Lan, C. Lim Tan, J. Su, and Y. Lu, ‘‘Supervised and traditional term [30] M. Sigman and G. A. Cecchi, ‘‘Global organization of the Wordnet lexi-
weighting methods for automatic text categorization,’’ IEEE Trans. Pattern con,’’ Proc. Nat. Acad. Sci. USA, vol. 99, no. 3, pp. 1742–1747, Feb. 2002.
Anal. Mach. Intell., vol. 31, no. 4, pp. 721–735, Apr. 2009. [31] A. N. Langville and C. D. Meyer, ‘‘A survey of eigenvector methods
[5] B. Trstenjak, S. Mikac, and D. Donko, ‘‘KNN with TF-IDF based frame- for Web information retrieval,’’ SIAM Rev., vol. 47, no. 1, pp. 135–161,
work for text categorization,’’ in Proc. DAAAM, vol. 69. Zadar, Croatia: Jan. 2005.
Univ Zadar, Oct. 2014, pp. 1356–1364. [32] M. Craven, D. Freitag, A. Mccallum, and T. Mitchell, ‘‘Learning to extract
[6] D. Kim, D. Seo, S. Cho, and P. Kang, ‘‘Multi-co-training for document symbolic knowledge from the World Wide Web,’’ in A Comprehensive
classification using various document representations: TF–IDF, LDA, and Survey of Text Mining, M. W. Berry, Ed, Heidelberg, Germany: Springer,
Doc2Vec,’’ Inf. Sci., vol. 477, pp. 15–29, Mar. 2019. 2003.
[7] S. N. Dorogovtsev, A. V. Goltsev, and J. F. F. Mendes, ‘‘Critical phenomena [33] [Online]. Available: http://ana.cachopo.org/datasets-for-single-label-text-
in complex networks,’’ Rev. Modern Phys., vol. 80, no. 4, pp. 1275–1335, categorization
Oct. 2008. [34] A. Cardoso-Cachopo, ‘‘Improving methods for single-label text catego-
[8] D. R. Amancio, S. M. Aluisio, O. N. Oliveira, and L. D. F. Costa, ‘‘Complex rization,’’ Ph.D. dissertation, Instituto Superior Técnico, Lisboa, Portugal,
networks analysis of language complexity,’’ Europhys. Lett., vol. 100, Oct. 2007.
no. 5, p. 58002, Dec. 2012. [35] [Online]. Available: https://gist.github.com/kunalj101
[9] D. R. Amancio, F. N. Silva, and L. D. F. Costa, ‘‘Concentric network [36] G. D. Gao, H. Wang, D. Bell, Y. X. Bi, and K. Greer, ‘‘KNN model-based
symmetry grasps authors’ styles in word adjacency networks,’’ Europhys. approach in classification,’’ in Proc. OTM Int. Conf. CoopIS DOA
Lett., vol. 110, no. 6, p. 68001, Jun. 2015. ODBASE, Catania, Italy, Nov. 2003, pp. 986–996.
[10] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harsh- [37] Y.-S. Lin, J.-Y. Jiang, and S.-J. Lee, ‘‘A similarity measure for text clas-
man, ‘‘Indexing by latent semantic analysis,’’ J. Amer. Soc. Inf. Sci., vol. 41, sification and clustering,’’ IEEE Trans. Knowl. Data Eng., vol. 26, no. 7,
no. 6, pp. 391–407, Sep. 1990. pp. 1575–1590, Jul. 2014.
[11] T. Hofmann, ‘‘Unsupervised learning by probabilistic latent semantic anal- [38] G. F. de Arruda, A. L. Barbieri, P. M. Rodríguez, F. A. Rodrigues,
ysis,’’ Mach. Learn., vol. 42, no. 1, pp. 177–196, Jan. 2001. Y. Moreno, and L. D. F. Costa, ‘‘Role of centrality for the identifica-
[12] Y. Kim, ‘‘Convolutional neural networks for sentence classification,’’ in tion of influential spreaders in complex networks,’’ Phys. Rev. E, Stat.
Proc. (EMNLP), Oct. 2014, pp. 1746–1751. Phys. Plasmas Fluids Relat. Interdiscip. Top., vol. 90, no. 3, Sep. 2014,
Art. no. 032812.
[13] X.-W. Chen and X. Lin, ‘‘Big data deep learning: Challenges and perspec-
[39] D. M. Blei, A. Y. Ng, and M. I. Jordan, ‘‘Latent Dirichlet allocation,’’
tives,’’ IEEE Access, vol. 2, pp. 514–525, 2014.
J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
[14] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
[40] [Online]. Available: https://nlp.stanford.edu/projects/glove/
P. Kuksa, ‘‘Natural language processing (almost) from scratch,’’ J. Mach.
Learn. Res., vol. 12 pp. 2493–2537, Aug. 2011.
[15] R. Zhao and K. Mao, ‘‘Fuzzy bag-of-words model for document represen- DONGYANG YAN was born in Xuchang, Henan,
tation,’’ IEEE Trans. Fuzzy Syst., vol. 26, no. 2, pp. 794–804, Apr. 2018. in 1993. He is currently pursuing the Ph.D. degree
[16] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwang, ‘‘Com- in system science with the Key Laboratory of
plex networks: Structure and dynamics,’’ Phys. Rep., vol. 424, nos. 4–5, Rail Traffic Control and Safety in Beijing Jiaotong
pp. 175–308, Feb. 2006. University. His main research interest is natural
[17] R. F. I. Cancho, and R. V. Solé, ‘‘The small world of human language,’’ language processing with complex networks and
Proc. R. Soc. London B vol. 268, pp. 2261–2265, Nov. 2001. other methods.
[18] J. Cong and H. Liu, ‘‘Approaching human language with complex net-
works,’’ Phys. Life Rev., vol. 11, no. 4, pp. 598–618, Dec. 2014.
[19] S. M. G. Caldeira, T. C. P. Lobao, R. F. S. Andrade, A. Neme, and
J. G. V. Miranda, ‘‘The network of concepts in written texts,’’ Eur. Phys. J.
B, vol. 49, no. 4, pp. 523–529, Aug. 2005. KEPING LI is currently a Professor with the State
[20] L. Antiqueira, M. G. V. Nunes, O. N. Oliveira, Jr., and L. D. F. Costa, Key Laboratory of Rail Traffic Control and Safety.
‘‘Strong correlations between text quality and complex networks features,’’ He was elected in New Century Talent Supporting
Phys. A, Stat. Mech. Appl., vol. 373, pp. 811–820, Jan. 2007. Project by Education Ministry. His main research
[21] H. F. de Arruda, L. D. F. Costa, and D. R. Amancio, ‘‘Topic segmentation interests include modeling of the complex network
via community detection in complex networks,’’ Chaos, Interdiscipl. J. system and modeling, analysis, and optimization
Nonlinear Sci., vol. 26, no. 6, Jun. 2016, Art. no. 063120. of rail transit systems.
[22] M. Garg and M. Kumar, ‘‘Identifying influential segments from word
co-occurrence networks using AHP,’’ Cognit. Syst. Res., vol. 47, pp. 28–41,
Jan. 2018.
SHUANG GU was born in Yichun, Heilongjiang, LIU YANG was born in Guizhou, in 1990. She
in 1994. She is currently pursuing the Ph.D. degree received the B.Sc. degree in information and com-
with the Key Laboratory of Rail Traffic Control puting science and the master’s degree in logis-
and Safety in Beijing Jiaotong University. Her tics engineering from Guizhou University. She is
main research interest is the complex networks. currently pursuing the Ph.D. degree with the Key
Laboratory of Rail Traffic Control and Safety in
Beijing Jiaotong University. Her main research
interest is complex network and risk analysis in
transportation.