0% found this document useful (0 votes)
42 views12 pages

Network-Based Bag-Of-Words Model For Text Classification

Uploaded by

Carla Silveira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views12 pages

Network-Based Bag-Of-Words Model For Text Classification

Uploaded by

Carla Silveira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received April 11, 2020, accepted April 23, 2020, date of publication April 28, 2020, date of current

version May 15, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.2991074

Network-Based Bag-of-Words Model


for Text Classification
DONGYANG YAN , KEPING LI , SHUANG GU , AND LIU YANG
State Key Laboratory of Rail Traffic Control and Safety, Beijing 100044, China
Corresponding author: Keping Li ([email protected])
This work was supported in part by the Beijing Natural Science Foundation under Grant 8202039, and in part by the National Key
Research and Development Program of China under Grant 2017YFB1201105.

ABSTRACT The rapidly developing internet and other media have produced a tremendous amount of
text data, making it a challenging and valuable task to find a more effective way to analyze text data by
machine. Text representation is the first step for a machine to understand the text, and the commonly used text
representation method is the Bag-of-Words (BoW) model. To form the vector representation of a document,
the BoW model separately matches and counts each element in the document, neglecting much correlation
information among words. In this paper, we propose a network-based bag-of-words model, which collects
high-level structural and semantic meaning of the words. Because the structural and semantic information of
a network reflects the relationship between nodes, the proposed model can distinguish the relation of words.
We apply the proposed model to text classification and compare the performance of the proposed model
with different text representation methods on four document datasets. The results show that the proposed
method achieves the best performance with high efficiency. Using the Eccentricity property of the network
as features can get the highest accuracy. We also investigate the influence of different network structures in
the proposed method. Experimental results reveal that, for text classification, the dynamic network is more
suitable than the static network and the hybrid network.

INDEX TERMS Bag-of-words, classification, complex network, text correlation, KNN.

I. INTRODUCTION text data to the form that is convenient for computer pro-
During the last decades, people have witnessed the impact of cessing. The commonly used text representation method is
the advancement of information technology. The rapid devel- the bag-of-words (BoW) model [2]–[4]. This model maps
opment of social media on the internet has been producing a document into a vector as v = [x1 , x2 , . . . , xn ], where xi
more and more information, in which text information plays denotes the occurrence of the ith word in basic terms. The
a significant role. Meanwhile, a typical scenario is how to basic terms are collected from the datasets, which are usually
classify text data into topic sets by computer so that people the top n highest-frequency words. The value of the occur-
can conveniently search the data they want. The text classi- rence feature can be a binary, term frequency, or TF-IDF.
fication task, which assigns the documents to the best-suited A binary value denotes whether the ith word is presented
topic, has drawn much attention from researchers. in a document, which reckons without the weight of words.
A typical text classification work includes text preprocess- The term frequency is the number of occurrences of each
ing, feature selection, feature extraction, similarity computa- word. Generally, the word with high frequency in a document
tion, and classifier determination [1]. Though owing to the contains the representative idea about this document, with the
advantage in understanding human language, it is natural for exception that some words may have high frequency among
people to judge whether a document belongs to a particular all documents. TF-IDF (term frequency-inverse document
topic directly by reading and understanding, this process is frequency) balances the weight of the words that always have
not practical for a computer. So the text classification of a a high frequency. It assumes that the importance of a word
computer starts with the text representation, which transfers increases proportionally to its frequency in a document but is
offset by its frequency in the whole corpus [5], [6].
The associate editor coordinating the review of this manuscript and Though the BoW model is a useful and straightforward
approving it for publication was Dominik Strzalka . method for text representation, there are still some problems.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 82641
D. Yan et al.: Network-Based BoW Model for Text Classification

The value of xi , whether in binary, term frequency, or TF-IDF model and the comparison with different representation meth-
form, is matched and counted without considering the influ- ods. We extend the proposed model to more possible appli-
ence of others words. So the processing of text data may lose cations in Section V. And, at last, we provide the concluding
much context information without dealing with correlated remarks in Section VI.
words. To illustrate this limitation, we provide two simple
sentences as a toy example: Sen 1, ‘‘a cat ate a small white II. RELATED WORK
mouse;’’ Sen 2, ‘‘a small white mouse ate a cat.’’ The basic Because our work aims to incorporate the network model into
terms are (cat, eat, small, white, mouse), and for both two BoW, in this section, we give a brief review of these two
sentences, each word in basic terms occurs once. The BoW associated works, respectively.
model will project Sen 1 and Sen 2 to the same vector, i.e.
v1 = v2 = [1, 1, 1, 1, 1], though the two sentences have the A. TEXT REPRESENTATION METHODS
opposite meaning. In the field of text data mining, text representation is the
In this paper, we adopt the network model to overcome the keystone for the computer to understand. Though the BoW
limitation of the BoW model mentioned above. The complex model is simple and commonly used, it suffers from the
network is now attracting much attention in the study of sparsity with high dimensionality and the loss of relations
real-world systems [7] (such as social systems, biological among words. To improve the BoW model, researchers
systems, and authors systems). The advantage of the network have proposed some methods like latent semantic analysis
model to analyze text data is that through the network tools, (LSA) [10] and topic model [11]. LSA applies the singular
one can have an insight view of several features of texts, value decomposition (SVD) to transfer the original BoW
e.g., complexity [8], and symmetry [9]. By using the network representation to the vectors with a lower dimension. If the
model, we can take more context information of the text into origin vectors are frequency-based, the transferred vectors
account. To extend the application of the network model to are also approximately linearly related to the term frequency.
BoW, we come up with a network-based strategy: Attribute of The topic model, attaching the probability distribution of
Network Extended to BoW (AEBoW). AEBoW maps docu- words to the topic probability distribution, though has a
ments to vectors in which the value of the corresponding word more mature mathematical foundation than LSA, it is still
is replaced by the weight of the network node attribute. The a frequency-based method, which may not be able to cap-
main difference between AEBoW and BoW is that the value ture the genuine semantic relations. Being different from the
of xi will not only match the frequency of the ith basic term BoW model, word embedding maps the words into dense
but also match the role it plays in high-level features of the and low-dimensional vectors through machine learning meth-
text, e.g., the structural and semantic difference. By using the ods [12]–[14], e.g., multilayers neural networks. This kind
Degree of the network model, the AEBoW model will project of method can capture the relations of words like ‘‘king +
Sen 1 and Sen 2 to v1 = [1, 2, 2, 2, 1] and v2 = [1, 2, 1, 2, 2], woman ≈ queen.’’ Nevertheless, the mapped vectors are
which can capture the meaning difference of two sentences learned from a large corpus of text data, making this training
(see details in section IV.F). process very time-consuming and highly dependent on the
We summarize the main contributions of this paper as quality of training corpus. There is also the representation
follows model that combines word embedding model and deep learn-
G We propose the AEBoW model to maintain correlated ing with BoW [15], which uses the pre-trained word embed-
information among the words in the text. dings to get the fuzzy matching for the BoW model. The
G We demonstrate the efficiency of the AEBoW model matching process is based on the whole basic terms, which
by applying it to text classification. We also verify the is sometimes redundant (we will explain it in section IV).
performance of the proposed model by comparing it with In this paper, the proposed AEBoW model is a combined
seven text representation methods and the word embed- method, which adopts the simplicity of the BoW model while
ding model (deep learning method) on four different considering the inner-correlation of words by a network tool.
datasets.
G We present the results of the AEBoW model based B. THE NETWORK MODEL FOR TEXT ANALYSIS
on three kinds of network tools: the dynamic network, In recent years, more and more works studies on the net-
the static network, and the hybrid network. work model in analyzing human language. The network is
G By comparing the performance of the AEBoW model constructed from a series of nodes connected by their inter-
based on different kinds of networks, we observe the relations. The network model has been used for different
dynamic network is more suitable for the AEBoW complex systems because of its simplicity and generality.
model. Without loss of generality, the networks of text share the same
This paper is organized as follows. In Section II, we intro- properties that unveiled from other complex systems like the
duce some related works, including the studies on text repre- small-world structure and scale-free phenomena [16]–[19].
sentation and text complex networks. The proposed AEBoW Moreover, the network properties have been proved to be a
model is presented in Section III. In Section IV, we give powerful tool to capture the features of texts. The out degrees,
the experimental results on the performance of the proposed clustering coefficient, and deviation of network growth are

82642 VOLUME 8, 2020


D. Yan et al.: Network-Based BoW Model for Text Classification

related to the text quality [20] while the community structures A. REPRESENTING TEXTS AS NETWORKS
and weighted edges of the network can be used to detect Generally, the network model can be described as a graph
the key segments [21], [22]. The topological properties of with graph theory [16]. An undirected network that we adopt
networks will help enhance the performance of several tasks to represent text is generally represented as G = (N , E),
(authors recognization [9], [23], text similarity [24], text sum- where N = {n1 , n2 , . . . , nl } denotes the set of nodes (or
marization [25], text classification [26], and shorts text anal- vertices) and E = {e1 , e2 , . . . , ek } denotes the set of links
ysis [27], [28]). In recent years, the image analysis approach between particular double nodes. We can use an adjacency
based on the network model is proposed to be supplementary matrix A = (aij )l×l to represent graph G, in which the element
on semantic-based applications, as the mesoscopic structure aij is defined as follows:
can reveal the visual ‘‘calligraphy’’ of a document [29]. The (
network model, when applied to text analysis, can capture 1 if (ni , nj ) ∈ E
subtle interactions among words, which will provide richer aij = (1)
0 if (ni , nj ) ∈
/E
information than the occurrence feature.
The appropriately represented texts as networks are the
inventories of text units with organized relations among them.
III. THE PROPOSED MODEL For example, when the text units are words, the relations
In this section, the AEBoW model is presented. It should among them may be the semantic relations or their positional
be noted that the underlying assumption is that the node relations in actual language use [18]. Different organized
properties of the complex network can reflect their specific relations may lead to different network structures in terms of
relevancy among other nodes. Of particular influence on the the same text. If the text network is modeled with the words
structure of the network, the linguistic units, and their rela- as nodes and the words’ semantic relations as edges, this kind
tions to form edges determine the topological configuration, of network, called static linguistic networks, contains relative
which affects the corresponding relevancy of nodes [18]. fixed nodes relationships. Another kind of network, named
We introduce three different sub-structures of text complex dynamic linguistic network, is modeled with the links being
networks: the static semantic network, the co-occurrence net- the naturally-occurring of words in texts, reflecting much
work, and the hybrid network. Moreover, based on the same information on actual language style.
text representation model, we compare the performance of This paper introduces the co-occurrence network as the
these sub-structures in practical use in section IV. sub-network of dynamic linguistic networks, the static
Before going into details about the proposed model, semantic network as the sub-network of the static linguistic
the general steps to deal with specific problems using this network. The co-occurrence network describes the texts as
model are summarized as follows: the network in which the nodes (words) are joined when
STEP 1: Lemmatize all the words in training data, and they co-occur within a distance [17]. Moreover, the static
eliminate the stop-words. Lemmatization makes the words semantic network, describing the texts as the inventories
transferred into their original forms, e.g., the nouns are con- of semantic relations, is constructed following the rule that
verted to the singular forms, and the verbs are converted to two nodes (words) are connected when they are organized
the infinitive forms. The stop-words are words that occur high in the same class of a dictionary [18]—in this paper, this
frequency with little useful semantic content. relationship is captured through the WordNet [30]. Based on
STEP 2: For each text sample in training data and test data, the WordNet, the words as nodes are linked when they are in
represent the text as networks (the type of network is a hyper- the same word set with hypernymy, meronymy (including the
parameter). Then get the value of particular network property entailment of verbs), or synonymy relationship. For a combi-
at all nodes. Each node in a network is bounded to a word in nation of the above two kinds of networks, we propose the
correspond text sample. hybrid network that contains relations both in static semantic
STEP 3: Represent each sample as a column vector, network and co-occurrence network. The hybrid network has
in which the value of each element is the network property the information held in both the dynamic network and the
of the corresponding node that obtained in step 2. The value static network, making it more helpful in text classification
of the node that not included in a text sample will be replaced work.
by ‘0’ in the corresponding column vector. Note that the full The process of text network construction starts with
words bag of big training data is considerably large, which a text preprocessing. Firstly, lemmatize the words [8]
causes the column vector high-dimensionality and sparse. (e.g., the nouns are converted to the singular forms, and the
One optional solution is to adopt the most used words in the verbs are converted to the infinitive forms). Then, eliminate
datasets, which called basic terms, to reduce dimensionality. words with little useful semantic content, which are named
STEP 4: Train the classifier using the vectors of training as stop-words, because in some text processing like classifi-
data obtained from step 3 as inputs. cation, these words are helpless, sometimes misleading [24].
The above steps are presented as a flowchart in figure 3. Figure 1 shows three text networks of the following docu-
The following part of this section will go into detail about ments. A more detailed process to construct these network
the proposed model. models is described in the Supplementary Information.

VOLUME 8, 2020 82643


D. Yan et al.: Network-Based BoW Model for Text Classification

FIGURE 1. Three networks modeled by various unit relationships.


(a) Co-occurrence network; (b) Static semantic network; (c) Hybrid
network.

FIGURE 2. Example of the static semantic network with laws to avoid


isolated nodes.
(1) This handsome man has a beautiful wife.
(2) He owns a medicine factory and a dog.
(3) This beautiful woman likes her poodle.
more likely to attach to nodes with a large degree. The nodes
(4) The pretty girl is the chief of this company.
connected with edges formed in semantic relevancy are the
Figure 1(a) is a co-occurrence network, and figure 1(b) is
same as figure 1(b) shows, and the other nodes which have
a static semantic network. In figure 1(b), ‘‘handsome’’ and
no neighbors are connected to the network with the laws
‘‘beautiful’’ are synonyms; ‘‘dog’’ has a hypernymy rela-
described in assumption 2 and 3.
tionship with ‘‘poodle.’’ As a mixed form of both types of
networks, Figure 1(c) shows a hybrid network with static and
C. AEBoW: A REPRESENTATION OF THE
dynamic relations, which in some extents, contains comple-
INTER-CORRELATION AMONG WORDS
mentary information.
The AEBoW (Attribute of Network Extended to BoW) model
is a simple extension of the BoW model, where the mapped
B. TO AVOID ISOLATED NODES IN THE
vectors contain the elements with the value being a particular
STATIC SEMANTIC NETWORK
attribute of the network. The attributes of the network, which
The above mentioned static network of text is an ideal model are also named the properties, are the fundamental quantities
for the static property: from the view of the formation process, used to describe the structure properties (or topology) of a
the edges of the words have already been pre-defined in the network.
corpus (WordNet). However, in some short texts, this kind of For a document (denote as d with the corresponding
static network contains many isolated nodes, e.g., ‘‘factory,’’ network model gd ), the representation by the AEBoW is
‘‘medicine,’’ and ‘‘own’’ in figure 1. Not only are these zd = [zd1 , zd2 , . . . , zdn ], where zdi is defined as
isolated nodes not helpful in text analysis, but they cause (
computing problems in a network model, e.g., the calculating f a (wi ), if wi ind.
of some properties of the network model requires that the zi = gd
d
(2)
0, else.
network is connected. To deal with this problem, we make
the following assumptions: In (2), fgd is the function that returns the value of an individual
1. The static semantic network is not allowed to contain node against the property a and network model gd ; wi denotes
isolated nodes. the ith word in the basic terms.
2. If the semantic relevancy in the WordNet is not enough to
avoid the existence of isolated nodes, the nodes with no edges
are randomly connected to be a circle, i.e., the isolated nodes
form a sub-network with every node having two neighbors.
3. The isolated nodes are connected to the other nodes
following the laws that the nodes are more likely to link to
the nodes with more neighbors.
With the above assumptions, we construct the static seman-
tic network used in this paper, as shown in figure 2. Note
that assumptions 2 and 3 do not have a complete theory
explanation but are only made to avoid isolated nodes without
losing the unique information of other nodes. Assumption 2
FIGURE 3. AEBoW framework steps.
guarantees that the isolated nodes are homogeneous (the
nodes in the sub-network of isolated nodes all contain two
neighbors). Assumption 3 retains the disassortativity of text We show the process of the AEBoW model in figure 3.
networks [18], which means the weakly linked nodes are Firstly, the documents are transformed into networks, and

82644 VOLUME 8, 2020


D. Yan et al.: Network-Based BoW Model for Text Classification

the kind of networks (static, dynamic, or hybrid) should be complex network. The Degree denotes the connectivity of
pre-defined. The idea of the BoW model is used to collect a node, which shows the ability to integrate with other
the words among all the documents in binary form. Then nodes. For an undirected graph, given the adjacency matrix A,
the extracted properties are located to the corresponding the degree ki of node i is defined as
place. We also list the procedure of AEBoW in Algorithm 1. N
X
An illustration of AEBoW by a toy example is shown in ki = aij , (3)
figure 4. The pseudo samples – d1 ‘‘A cat is sitting on the j=1
table while a dog is running towards it’’ and d2 ‘‘A cat
where N is the size of matrix A, i.e., the number of nodes in
and a dog were both sitting on the table, and the dog ran
the complex network. In the matrix A, the element aij is binary
away later’’ – are represented as vectors of AEBoW model.
value denoting that whether node i and node j is connected
The vector mapped from d1 is [1, 2, 2, 2, 1, 0, 0] because the
through an edge.
Degree of node ‘cat,’ ‘dog,’ ‘sit,’ ‘table’ and ‘run’ is 1, 2, 2,
Eccentricity: The eccentricity eci of a node i is the maxi-
2, 1, respectively, while ‘away’ and ‘later’ do not occur in d1 .
mum distance from ei to other nodes in the complex network.
Similarly, the vector of d2 is projected.
For a network G, the eccentricity eci is defined as
eci = maxj∈N \ni lij , (4)
Algorithm 1 AEBoW Framework
Inputs: Text corpus T including v documents, network where lij is the shortest distance from node i to node j.
property a, and the network type g. In some cases, the text network may be disconnected, which
Outputs: Text vectors Z of T. means that the network contains more than one part without
1. Collect the basic terms B based on the frequency links between each other. In this paper, for convenience,
that words occur in T. we assume that the eccentricity eci of the network that is not
2. for d in T: connected is the maximum distance from ei to its reachable
Construct the network gd of d: nodes.
for node in gd : PageRank: PageRank is initially designed for ranking web
Get the index i of node in Z pages based on the directed graph [31]. The idea is that
if node in B: the more web pages that a page is pointed to and the more
Zid = fgad (node) critical the pointing webs are, the more weighted this pointed
else: page is. The definition is a voting process, which needs
Zid = 0 recursive computing. The rank of a given node (page) i is
end if defined to be
end for X r(j)
r(i) = , (5)
end for num(j)
j∈Pi
3. return Z
where Pi is the set of nodes that point to i, and num(j) is
the number of links that point out from j in graph G. For a
start, we can arbitrarily assign the ranking to all the nodes of
graph G, e.g., r0 (i) = 1/l, i ∈ N , and successively update the
ranks of the nodes by (5). In this paper, we adopt this method
to the undirected graph by assuming that each undirected
edge (i, j) is equal to two directed edges i → j and j → i.
Accessibility: This concept is used to measure the ability of
a node to reach the number of nodes after h steps implemented
through self-avoiding random walks [38]. It is mathemati-
cally defined as
 
FIGURE 4. A toy example about text representing based on the AEBoW X
model (for Degree property) through the co-occurrence network.
αh (i) = exp − P(h) (i, j) log P(h) (i, j) , (6)
j
The development of complex networks has induced var-
ious indexes for the observed properties of real networks, where P(h) (i, j)
denotes the possibility of node i reach node j
e.g., node degree, betweenness, and clustering [16]. Though after h steps. The accessibility measures the influence of a
there are various property measures, the experimental results node in the complex network, i.e., the nodes playing more
show that not all of them are suitable for text classification. critical roles usually can access more neighbors.
The following part of this sub-section introduces network
properties that perform well in the experimental results. IV. EXPERIMENTAL RESULTS
Degree: The degree ki of a node i is the number of In this section, we apply the AEBoW model in text clas-
its neighbor nodes or the edges incident with it in the sification. The proposed method is compared with seven

VOLUME 8, 2020 82645


D. Yan et al.: Network-Based BoW Model for Text Classification

text representation methods on four datasets. Furthermore, likely to be the same type as the nodes occur most in its
we also compare AEBoW with the deep learning algorithm k nearest neighbors, which are captured based on particular
at the end of this section. similarity distance. Because this method is parameter-free
except k, making it a lazy learning method, it is used in many
A. DATASETS DESCRIPTION applications.
There are four datasets used in the experiments. Cosine similarity: The cosine similarity computes the sim-
20Newgroups is a group of news with nearly 20000 docu- ilarity distance of two vectors in space. For vector vi =
ments and 20 news topics. This dataset is kindly preprocessed [vi1 , vi2 , . . . , vil ] and vj = [vj1 , vj2 , . . . , vjl ], the cosine simi-
in [32], [33]. larity is defined as
WebKB is collected from webpages by the World Wide l
Knowledge Base project [32]. The training data and testing
P
vik vjk
data of these documents were predesignated in [33], [34]: vi · vj k=1
cij = =s s . (7)
2803 documents for training and 1396 documents for testing. |vi ||vj | l l
2 2
P P
Reuters 52 is extracted from Reuters 21578 by [32]. This vik vjk
k=1 k=1
dataset includes 52 categories, deleting some categories of
Reuters 21578 that contain only a few documents. Classification Accuracy: The classification accuracy (CA)
Amazon Reviews contains 10000 labeled reviews with is defined as (8), denoting the accuracy of predicted labels
2 categories. The original dataset can be found in [35]. comparing with the labels given in the test data. For (8),
We list the details of these datasets in table 1. Note that all T is the document set of test data and |T | is the number of
the datasets are preprocessed by removing the stop words and documents in set T . E(pi , gi ) = 1 if pi = gi (pi denotes the
lemmatizing. predicted label of document i while gi is the given label in test
data corresponding to i), and E(pi , gi ) = 0 if pi 6 = gi .
TABLE 1. The description of four datasets.
E(pi , gi )
P
i∈T
CA = (8)
|T |
Train & Test: The training and testing process all pre-
compute the cosine similarity of documents using (7). Next,
a similarity matrix is as input for nearest-neighbor searching.
After the training step, the test data are all labeled with the
trained model. Then the CA is obtained using (8).
B. EXPERIMENTAL SETUP We use the following seven text representation methods to
The classification work is done by KNN measure [36], and compare the performance of the AEBoW model.
the similarity distance is computed through the cosine sim- BoW: The BoW model is described in section I.
ilarity [24]. Classification accuracy [37] is used to evaluate LSA: Latent Semantic Analysis [10] is a method to reduce
performance. Firstly, we briefly describe the KNN measure, the dimensionality based on BoW.
cosine similarity, and classification accuracy. LDA: Latent Dirichlet Allocation [39].
Net-Local: A complex network method for text classifica-
tion [26]. We label this method as Net-local, where ‘‘local’’
denotes the local strategy. We only choose the local strategy
because the global strategy performs weakly in the experi-
ments, which may be due to that the dimensionality of the
representation vector is too low for big datasets.
AE: The average embedding for text representation [15].
AE represents a document as the average of all embeddings
of words in the document.
FBoW & FBoWC: FBoW is a fuzzy bag-of-words
model [15], which conducts a fuzzy matching through
word embeddings. This method is a word embedding based
method. FBoWC is an extension of FBoW, which matches
FIGURE 5. An illustration of KNN. For k = 5, most of the nearest the clusters of word embeddings instead.
neighbors of the circle node Q belong to topic 2. So node Q is more likely
to belong to topic 2.
The word embedding based methods, including AE,
FBoW, and FBoWC use the data that are not lemmatized
KNN: The KNN (k-Nearest-Neighbors) is a simple and because the learning of word embedding can distinguish
effective non-parametric classification method. The idea of all word types. The other methods will use the data after
KNN, as shown in figure 5, is that a node in space is more lemmatization.

82646 VOLUME 8, 2020


D. Yan et al.: Network-Based BoW Model for Text Classification

The implementation of all the methods mentioned above First, we can observe that the BoW model is the fastest
is based on Python 3.7 with windows 10 environment. The method, though the CA is relatively low. The increase of
configuration of the machine we used is Inter R CoreTM performance by other methods shows that it is needed to
i7-8565U CPU @ 1.80GHz; Memory 16.0 GB. LSA, LDA, scarify the time for accuracy. The other methods all consid-
and BoW are based on sklearn module. The word embeddings erably increase the time costs of text representation while
of AE are looked up from the pre-trained word embedding increasing the performance. AEBoW gets the highest CA
dictionary [40], and the words that not in the word embed- in 20Newsgroups, WebKB, and Reuters 52, while FBOWC
ding dictionary are discarded. AEBoW, FBoW, FBoWC, and gets the highest CA in Amazon Reviews.
Net-local method all run with multi-threads within the per-
mission of the memory. TABLE 3. Time costs (s).
The dimensionality of representation vectors is set to
3000 for AEBoW, BoW, LSA, LDA, FBoW, and FBoWC.
For the Net model, because the number of chosen properties
is 8, we set the dimensionality of each property to 3000.
So the concatenated vector has a dimensionality of 24000.
The vector that projected from AE has dimensionality equal
to the word embedding, which is set to 300 in this paper.
Note that we search the best k of KNN for each method,
and the searching range is {3, 6, 9, 12, 15, 18, 21,
24, 27, 30}.

C. PERFORMANCE ANALYSIS
Based on the properties of the complex network, including
the Degree (D), Eccentricity (E), PageRank (P), and Acces-
sibility (A), we analyze the performance of the AEBoW
model. The classification accuracy (CA) is obtained from
the dynamic network (co-occurrence network), static net-
work (static semantic network), and hybrid network, respec-
tively. Then the best result for each property is selected. The
obtained CA is shown in table 2.

TABLE 2. Classification accuracy (CA).

LSA, LDA, FBOW, and FBOWC are all dimensionality


reduction methods. Among these methods, LSA has the low-
est time consumption, the accuracy, however, is not compet-
itive. LDA is an iterative approach, the time cost of which
is counted within 100 iterations. It can be observed that the
CA of LDA can outperform LSA on specific datasets, though
the time consumption is always much higher than LSA. The
FBOW model and FBOWC model get better accuracy than
LSA and LDA. Though FBOWC is better than FBOW, the
increase in CA is not acceptable when considering the sharp
increase in time cost. The time consumption of FBOWC
With the same environment, we also get the running time includes two parts. The first part is the operation of k-means
of every method. The results are listed in table 3. Note that the clustering (FBOWC-c in table 3), which rapidly increases
time costs are only counted for the vector projecting process, following the explosion of the number of vocabulary in the
i.e., the counted period is after the data preprocessing and datasets. The second part is the similarity counting between
before the classification. the clusters and the words of a document (mean, max, and

VOLUME 8, 2020 82647


D. Yan et al.: Network-Based BoW Model for Text Classification

min in table 3). This process makes the similarity calculating


repeat thousands (the number of word embedding clusters)
of times more than FBOW in small batches, which causes the
increase in time costs. Note that the time costs of FBOWC are
counted in cases that four threads are used (other methods use
eight threads) to avoid out of memory.
AE, FBOW, and FBOWC are word embedding based
methods, which all use the pre-trained word embedding dur-
ing word matching. AE is a simple application on word
embedding, which represents a document by simply summing
up the embeddings of words in the document. The simple
operation loses much high-level information. The results
show that, in some cases, the CA of AE is worse than BoW.
AEBoW and Net-local are network-based methods. The
main difference between AEBoW and Net-local is that
AEBoW uses the individual property as features and uses
FIGURE 6. CA of three kinds of networks based on Degree, Eccentricity,
the BoW idea to collect them. In contrast, Net-local uses PageRank, and Accessibility.
different properties that reflect the symmetry of the network
and concatenates them as features. Net-local can be seen as
the particular case of AEBoW when several properties are First, figure 6 shows that the Eccentricity property can
chosen, and the top-k features are concatenated. However, always perform well in all the datasets. It is the only prop-
using too many local properties can not always improve erty that produces high CA on three kinds of networks. The
the performance of text classification while reducing the other properties all have poor behavior on the static network.
efficiency on the contrary. The results show that the CA We can also observe that the hybrid network can perform a
of AEBoW is better than Net-local, and the time costs of little better on WebKB and Amazon Reviews datasets, which
AEBoW are much smaller than Net-local. indicates that the combination with relations in both static
AEBoW, FBOW, and FBOWC are based on the BoW network and dynamic network can improve the performance
model. The differences exist that AEBoW is still the sparse of AEBoW in some instances. However, there is no such thing
representation like BoW, while FBOW and FBOWC solve as a free lunch. The hybrid network can not always perform
this limitation by fuzzy matching. However, from the results, the best.
we see that the dense representation may not always entirely As table 3 shows, AEBoW on the dynamic network has
reflect the right discriminative information for text classifica- the best efficiency compared with the hybrid network and
tion. On the other hand, the dense representation only shows network. At the same time, the dynamic network produces
its advantage when using it for dimensionality reduction. competitive results in all four datasets. So the dynamic net-
If the dimensionality is set to equal in experiments, the sparse work is more suitable than the static network and the hybrid
characteristic can reduce memory consumption by converting network for the AEBoW model.
the representation into sparse form (In python 3.7, we can use
scipy.sparse module). In contrast, the dense representation E. THE INFLUENCE OF K OF KNN IN TEXT CLASSIFICATION
can not use specific tools to reduce memory needs. We also Figure 7 shows the CA of every method in the searching range
tried to use lower dimensionality for FBOW and FBOWC, but of k. The results are obtained from the WebKB dataset.
this will cause performance reduction. We can also observe As is shown in figure 7, the accuracy reaches the best
that FBOW and FBOWC need more time to process data in different k for each method, which is the reason that we
than AEBoW. We can ascribe it to the difference in match- adopt a searching range of k to select the best results. Most
ing approach. The properties of the network model will be methods reach the best performance when k is around 15,
calculated through matching the words only contains a doc- while AEBoW is an exception. The Eccentricity gets the
ument, which sometimes only need to match the neighbors, highest CA at k = 21.
e.g., Degree, Accessibility. On the contrary, FBOW needs to From figure 7, we can also observe that the results of
calculate the similarity between each word in a document and LSA and BoW are nearly in the same trends, which indicates
all basic terms. Because the basic terms always contain words that LSA is the linearity mapping of BoW with a dimen-
much more than a document, the time costs are much higher sionality reduction approach. Among the four dimensional-
than AEBoW. ity reduction methods (LSA, LDA, FBOW, FBOWC), only
FBOW and FBOWC get a satisfactory improvement com-
pared with BoW.
D. COMPARISON AMONG THREE KINDS OF NETWORKS The accuracy of the Eccentricity keeps the best among
Next, we compare the performance of AEBoW based on three three kinds of text networks, and the PageRank follows. The
kinds of networks. Figure 6 lists the CA of four datasets. results show that some features in AEBoW have a steady

82648 VOLUME 8, 2020


D. Yan et al.: Network-Based BoW Model for Text Classification

contain the opposite meaning. For the AEBoW model, four


properties of the complex network all capture the difference
between the two sentences.

G. COMPARISON WITH DEEP LEARNING ALGORITHM


The above experiments are all based on the KNN. Next,
we also compare the performance of AEBoW with the word
embedding model based on the deep learning algorithm. The
deep learning algorithm is deployed on TensorFlow 2.0.

TABLE 5. Deep learning structure of the word embedding and AEBoW.

FIGURE 7. Accuracy varying with k of KNN (based on WebKB dataset).

performance despite the kind of networks, and the correla-


tion of words based on the network model can reflect more
information than that not based on the network model in text
classification. The Degree, Accessibility properties are all
local structural properties, and the CA of them is relatively
low, indicating that the high-level information of words needs
a non-local strategy to extract. In this experiment, the AEBoW and word embedding
model are all applied with the deep learning algorithm. Note
F. HOW DOES AEBOW WORK that we use different deep learning algorithms for two models
The experimental results show that the AEBoW model could because the word embedding model has the corresponding
outperform the BoW model in specific tasks. In this section, algorithm in deep learning [12] that AEBoW does not fit. The
we will discuss part of the reason that the properties of the structure of the deep learning algorithm for the two models
complex network can perform better. is listed in table 5. For AEBoW, the inputs are the vectors,
In the complex network, the nodes affect each other and three dense layers are followed. Dense layer 1 and dense
through their links between each other. Even two nodes that layer 2 activate the outputs with Rectified Linear Unit (relu).
are not directed linked can get the influence from the other For the word embedding model, the inputs are the documents
side through a particular path. The addition and deletion of after labeling and padding (symbolize the words and pad all
an edge in the complex network will affect a series of nodes. documents to the same length). The embedding layer will
This character makes the complex network have the ability to transfer each word to a vector, the dimensionality of which
capture text structure and semantic change in various ways, is 300. The outputs of the embedding layer are convoluted
and therefore suitable for processing text data. by 1D convolution layer, of which the filter size is 5. The
convolution layer will produce 300 filters with relu activation,
TABLE 4. Sentence representation of BoW and AEBoW. and the max-pooling layer downsamples the outputs. After
downsampling and flattening, the dense layer is used for
classification. Note that the dense layer (except the output
layer) and the convolution layer all use the biases. Dropout
is used before the output layer with a rate of 0.5.
We use the Adam optimization algorithm to update the
parameters with mini-batch set to 32, and the learning rate
is set to 1e-03. The training epochs is set to 5, and 10% of the
training data are selected for cross-validation. The results are
listed in table 6. The AEBoW model is based on the dynamic
To further explain this characteristic without complex math network.
symbols, we use Sen 1 and Sen 2 mentioned in section I as The main part of time costs is different for AEBoW and
a toy example. Different vector forms of these two sentences the word embedding model. For AEBoW, projecting vectors
are listed in table 4. With the BoW model, one can get the is before training deep learning models. On the contrary,
same vector to represent the two sentences because there are the two steps are finished simultaneous for the word embed-
the same words in the basic terms. However, two sentences ding model. Thus the training for AEBoW is much faster than

VOLUME 8, 2020 82649


D. Yan et al.: Network-Based BoW Model for Text Classification

TABLE 6. CA of AEBoW and the word embedding model. words, or tone of his (her) work. The high-level information
can be captured through the AEBoW model.
The following are some ideas about applying AEBoW on
text interpretation.
The text interpretation includes processing the unstruc-
tured text and extracting the high-level semantics. For the
first step, the computer will interpreter a free text correctly
into the surface-level form. The free text is analyzed through
its syntactic structure, lexical meaning, and then the subse-
quent computation will take place. By using AEBoW, the
surface-level of raw text data can be preprocessed with a
TABLE 7. Time costs (s) of AEBoW and the word embedding model.
network tool, and AEBoW is applied to obtain extra struc-
tural and semantic information. For the second step, a series
of indexes and complicated relations are derived from the
surface-level information. The network model may further
explain the patterns of the surface-level information, and
the AEBoW model will produce the inputs of the instances
object model, which maps the patterns from the surface-level
meaning into high-level instance assertions.
It should be noted that the AEBoW model is only a com-
plement to existing methods of text interpretation because
there are limitations for AEBoW in grammar parsing and
abduction. The AEBoW will not capture the grammars and
proper word meaning. So it is needed to introduce the gram-
the word embedding model. We list the time costs of both mar parser and background knowledge.
methods in table 7. The AEBoW model is a powerful network-based tool for
From table 6 and table 7, we can observe that the AEBoW text analysis, which are possible to be applied to different
model outperforms the word embedding model on all the application scenarios. The introduction of the network model
four datasets. At the same time, the time costs of AEBoW makes AEBoW capture high-level structural and semantic
are much smaller than the word embedding model. How- meaning of the text. The application of AEBoW may also
ever, the word embedding model can accelerate its speed need other state-of-the-art studies for a complement.
by running on more powerful GPUs while AEBoW can not.
The results further certify that the AEBoW can capture more VI. CONCLUSION
information from text data. In this paper, we have proposed the AEBoW model based
on the complex network to represent text. The AEBoW is an
V. DISCUSSION improvement on the BoW model, taking the correlation of
From the experimental results, it can be observed that the words reflected in the text network into consideration. The
AEBoW model gets good results with high efficiency in text structure of a text network varies when the different relations
classification. We believe that the application of AEBoW will of words that form an edge are considered. We have intro-
not only limited to text classification. There are some possible duced the dynamic network (co-occurrence network) and
application scenarios of this model, including text interpre- static network (static semantic network). We have also pro-
tation, text clustering, text summarization, and identification posed the hybrid network that contains relations in both the
of authorship. Next, we briefly describe each application. dynamic network and the static network. We have compared
Furthermore, we also give some ideas for text interpretation. the performance of AEBoW with seven text representation
Text Interpretation is the process of extracting high-level methods in text classification.
semantics from the raw text data. The high-level semantics Experimental results revealed that the proposed AEBoW
are the structured indexes for the raw text data. could get the best performance with high efficiency. The best
Text Clustering is an unsupervised method of machine feature in AEBoW was the Eccentricity, which is a shortest-
learning to cluster the documents with high similarity into path-based property of text network. Further analysis showed
categories. The AEBoW outputs can be directly used for that for most methods, the performance reaches the best
clustering. when k is around 15 with KNN as the classifier. For the
Text Summarization is to catch the key phrase of a docu- Eccentricity of AEBoW, the best accuracy exists at k = 21.
ment. The key phrase is always a bunch of words from the The comparison of the three kinds of networks showed that
original document with complete syntax and content. the dynamic network is more suitable for text classification.
Identification of Authorship. Each author has his (her) style We have also investigated the performance of AEBoW
in their work. The author’s style is reflected in the structure, in the deep learning algorithm. By comparing it with the

82650 VOLUME 8, 2020


D. Yan et al.: Network-Based BoW Model for Text Classification

word embedding model, we certified the high efficiency and [23] C. Akimushkin, D. R. Amancio, and O. N. Oliveira, ‘‘Text authorship
excellent performance of AEBoW. identified using the dynamics of word co-occurrence networks,’’ PLoS
ONE, vol. 12, no. 1, 2017, Art. no. e0170527.
The application of AEBoW is not limited to text classi- [24] D. R. Amancio, O. N. Oliveira, Jr., and L. D. F. Costa, ‘‘Structure–
fication. Future investigations will be concentrated on using semantics interplay in complex networks and its effects on the predictabil-
the AEBoW in more text analysis, e.g., text interpreta- ity of similarity in texts,’’ Phys. A, Stat. Mech. Appl., vol. 391, no. 18,
pp. 4406–4419, Sep. 2012.
tion, text clustering, text summarization and identification of [25] L. Antiqueira, O. N. Oliveira, L. D. F. Costa, and M. D. G. V. Nunes,
authorship. ‘‘A complex network approach to text summarization,’’ Inf. Sci., vol. 179,
no. 5, pp. 584–599, Feb. 2009.
[26] H. F. de Arruda, L. D. F. Costa, and D. R. Amancio, ‘‘Using complex net-
REFERENCES works for text classification: Discriminating informative and imaginative
[1] X. Deng, Y. Li, J. Weng, and J. Zhang, ‘‘Feature selection for text classifi- documents,’’ Europhys. Lett., vol. 113, no. 2, p. 28007, Jan. 2016.
cation: A review,’’ Multimedia Tools Appl., vol. 78, no. 3, pp. 3797–3816, [27] D. R. Amancio, ‘‘Probing the topological properties of complex net-
Feb. 2019. works modeling short written texts,’’ PLoS One, vol. 10, no. 2, 2015,
Art. no. e0118394.
[2] H. Kim, P. Howland, and H. Park, ‘‘Dimension reduction in text classifi-
[28] D. Yan, K. Li, and J. Ye, ‘‘Correlation analysis of short based on network
cation with support vector machines,’’ J. Mach. Learn. Res., vol. 6, no. 1,
model,’’ Phys. A, Stat. Mech. Appl., vol. 531, Oct. 2019, Art. no. 121728.
pp. 37–53, Jan. 2005.
[29] H. F. de Arruda, V. Q. Marinho, T. S. Lima, D. R. Amancio, and
[3] T. Joachims, ‘‘A probabilistic analysis of the Rocchio algorithm with
L. D. F. Costa, ‘‘An image analysis approach to text analytics based on
TFIDF for text categorization,’’ in Proc. 14th Int. Conf. Mach. Learn.,
complex networks,’’ Phys. A, Stat. Mech. Appl., vol. 510, pp. 110–120,
San Francisco, CA, USA, 1997, pp. 143–151.
Nov. 2018.
[4] M. Lan, C. Lim Tan, J. Su, and Y. Lu, ‘‘Supervised and traditional term [30] M. Sigman and G. A. Cecchi, ‘‘Global organization of the Wordnet lexi-
weighting methods for automatic text categorization,’’ IEEE Trans. Pattern con,’’ Proc. Nat. Acad. Sci. USA, vol. 99, no. 3, pp. 1742–1747, Feb. 2002.
Anal. Mach. Intell., vol. 31, no. 4, pp. 721–735, Apr. 2009. [31] A. N. Langville and C. D. Meyer, ‘‘A survey of eigenvector methods
[5] B. Trstenjak, S. Mikac, and D. Donko, ‘‘KNN with TF-IDF based frame- for Web information retrieval,’’ SIAM Rev., vol. 47, no. 1, pp. 135–161,
work for text categorization,’’ in Proc. DAAAM, vol. 69. Zadar, Croatia: Jan. 2005.
Univ Zadar, Oct. 2014, pp. 1356–1364. [32] M. Craven, D. Freitag, A. Mccallum, and T. Mitchell, ‘‘Learning to extract
[6] D. Kim, D. Seo, S. Cho, and P. Kang, ‘‘Multi-co-training for document symbolic knowledge from the World Wide Web,’’ in A Comprehensive
classification using various document representations: TF–IDF, LDA, and Survey of Text Mining, M. W. Berry, Ed, Heidelberg, Germany: Springer,
Doc2Vec,’’ Inf. Sci., vol. 477, pp. 15–29, Mar. 2019. 2003.
[7] S. N. Dorogovtsev, A. V. Goltsev, and J. F. F. Mendes, ‘‘Critical phenomena [33] [Online]. Available: http://ana.cachopo.org/datasets-for-single-label-text-
in complex networks,’’ Rev. Modern Phys., vol. 80, no. 4, pp. 1275–1335, categorization
Oct. 2008. [34] A. Cardoso-Cachopo, ‘‘Improving methods for single-label text catego-
[8] D. R. Amancio, S. M. Aluisio, O. N. Oliveira, and L. D. F. Costa, ‘‘Complex rization,’’ Ph.D. dissertation, Instituto Superior Técnico, Lisboa, Portugal,
networks analysis of language complexity,’’ Europhys. Lett., vol. 100, Oct. 2007.
no. 5, p. 58002, Dec. 2012. [35] [Online]. Available: https://gist.github.com/kunalj101
[9] D. R. Amancio, F. N. Silva, and L. D. F. Costa, ‘‘Concentric network [36] G. D. Gao, H. Wang, D. Bell, Y. X. Bi, and K. Greer, ‘‘KNN model-based
symmetry grasps authors’ styles in word adjacency networks,’’ Europhys. approach in classification,’’ in Proc. OTM Int. Conf. CoopIS DOA
Lett., vol. 110, no. 6, p. 68001, Jun. 2015. ODBASE, Catania, Italy, Nov. 2003, pp. 986–996.
[10] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harsh- [37] Y.-S. Lin, J.-Y. Jiang, and S.-J. Lee, ‘‘A similarity measure for text clas-
man, ‘‘Indexing by latent semantic analysis,’’ J. Amer. Soc. Inf. Sci., vol. 41, sification and clustering,’’ IEEE Trans. Knowl. Data Eng., vol. 26, no. 7,
no. 6, pp. 391–407, Sep. 1990. pp. 1575–1590, Jul. 2014.
[11] T. Hofmann, ‘‘Unsupervised learning by probabilistic latent semantic anal- [38] G. F. de Arruda, A. L. Barbieri, P. M. Rodríguez, F. A. Rodrigues,
ysis,’’ Mach. Learn., vol. 42, no. 1, pp. 177–196, Jan. 2001. Y. Moreno, and L. D. F. Costa, ‘‘Role of centrality for the identifica-
[12] Y. Kim, ‘‘Convolutional neural networks for sentence classification,’’ in tion of influential spreaders in complex networks,’’ Phys. Rev. E, Stat.
Proc. (EMNLP), Oct. 2014, pp. 1746–1751. Phys. Plasmas Fluids Relat. Interdiscip. Top., vol. 90, no. 3, Sep. 2014,
Art. no. 032812.
[13] X.-W. Chen and X. Lin, ‘‘Big data deep learning: Challenges and perspec-
[39] D. M. Blei, A. Y. Ng, and M. I. Jordan, ‘‘Latent Dirichlet allocation,’’
tives,’’ IEEE Access, vol. 2, pp. 514–525, 2014.
J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
[14] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
[40] [Online]. Available: https://nlp.stanford.edu/projects/glove/
P. Kuksa, ‘‘Natural language processing (almost) from scratch,’’ J. Mach.
Learn. Res., vol. 12 pp. 2493–2537, Aug. 2011.
[15] R. Zhao and K. Mao, ‘‘Fuzzy bag-of-words model for document represen- DONGYANG YAN was born in Xuchang, Henan,
tation,’’ IEEE Trans. Fuzzy Syst., vol. 26, no. 2, pp. 794–804, Apr. 2018. in 1993. He is currently pursuing the Ph.D. degree
[16] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwang, ‘‘Com- in system science with the Key Laboratory of
plex networks: Structure and dynamics,’’ Phys. Rep., vol. 424, nos. 4–5, Rail Traffic Control and Safety in Beijing Jiaotong
pp. 175–308, Feb. 2006. University. His main research interest is natural
[17] R. F. I. Cancho, and R. V. Solé, ‘‘The small world of human language,’’ language processing with complex networks and
Proc. R. Soc. London B vol. 268, pp. 2261–2265, Nov. 2001. other methods.
[18] J. Cong and H. Liu, ‘‘Approaching human language with complex net-
works,’’ Phys. Life Rev., vol. 11, no. 4, pp. 598–618, Dec. 2014.
[19] S. M. G. Caldeira, T. C. P. Lobao, R. F. S. Andrade, A. Neme, and
J. G. V. Miranda, ‘‘The network of concepts in written texts,’’ Eur. Phys. J.
B, vol. 49, no. 4, pp. 523–529, Aug. 2005. KEPING LI is currently a Professor with the State
[20] L. Antiqueira, M. G. V. Nunes, O. N. Oliveira, Jr., and L. D. F. Costa, Key Laboratory of Rail Traffic Control and Safety.
‘‘Strong correlations between text quality and complex networks features,’’ He was elected in New Century Talent Supporting
Phys. A, Stat. Mech. Appl., vol. 373, pp. 811–820, Jan. 2007. Project by Education Ministry. His main research
[21] H. F. de Arruda, L. D. F. Costa, and D. R. Amancio, ‘‘Topic segmentation interests include modeling of the complex network
via community detection in complex networks,’’ Chaos, Interdiscipl. J. system and modeling, analysis, and optimization
Nonlinear Sci., vol. 26, no. 6, Jun. 2016, Art. no. 063120. of rail transit systems.
[22] M. Garg and M. Kumar, ‘‘Identifying influential segments from word
co-occurrence networks using AHP,’’ Cognit. Syst. Res., vol. 47, pp. 28–41,
Jan. 2018.

VOLUME 8, 2020 82651


D. Yan et al.: Network-Based BoW Model for Text Classification

SHUANG GU was born in Yichun, Heilongjiang, LIU YANG was born in Guizhou, in 1990. She
in 1994. She is currently pursuing the Ph.D. degree received the B.Sc. degree in information and com-
with the Key Laboratory of Rail Traffic Control puting science and the master’s degree in logis-
and Safety in Beijing Jiaotong University. Her tics engineering from Guizhou University. She is
main research interest is the complex networks. currently pursuing the Ph.D. degree with the Key
Laboratory of Rail Traffic Control and Safety in
Beijing Jiaotong University. Her main research
interest is complex network and risk analysis in
transportation.

82652 VOLUME 8, 2020

You might also like