0% found this document useful (0 votes)
21 views12 pages

An Integration Model Based On Graph

1. The document proposes an improved Graph Convolutional Network (IGCN) model for text classification that introduces Bidirectional LSTM (BiLSTM), part-of-speech (POS) information, and dependency relationships to address issues in existing GCN models. 2. Existing GCN models cannot capture both short-range and long-range contextual dependencies and increasing layers causes over-smoothing. IGCN uses BiLSTM and dependency relationships to capture both. 3. IGCN also introduces POS information and BiLSTM features to address the problem of lexical polysemy. Experiments show IGCN achieves competitive results compared to other baselines.

Uploaded by

nithin reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

An Integration Model Based On Graph

1. The document proposes an improved Graph Convolutional Network (IGCN) model for text classification that introduces Bidirectional LSTM (BiLSTM), part-of-speech (POS) information, and dependency relationships to address issues in existing GCN models. 2. Existing GCN models cannot capture both short-range and long-range contextual dependencies and increasing layers causes over-smoothing. IGCN uses BiLSTM and dependency relationships to capture both. 3. IGCN also introduces POS information and BiLSTM features to address the problem of lexical polysemy. Experiments show IGCN achieves competitive results compared to other baselines.

Uploaded by

nithin reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received July 21, 2020, accepted August 5, 2020, date of publication August 11, 2020, date of current version

August 24, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.3015770

An Integration Model Based on Graph


Convolutional Network for Text Classification
HENGLIANG TANG 1,2 , YUAN MI 1, FEI XUE 1, AND YANG CAO 1
1 School of Information, Beijing Wuzi University, Beijing 101149, China
2 Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing 100124, China

Corresponding author: Yuan Mi ([email protected])


This work was supported in part by the Beijing Key Laboratory of Intelligent Logistics System under Grant BZ0211, in part by the Beijing
Intelligent Logistics System Collaborative Innovation Center, in part by the Beijing Municipal Natural Science Foundation, in part by the
Beijing Youth Top-notch Talent Plan of High-Creation Plan under Grant 2017000026833ZK25, in part by the Canal Plan-Leading Talent
Project of Beijing Tongzhou District under Grant YHLB2017038, and in part by the Grass-roots Academic Team Building Project of
Beijing Wuzi University under Grant 2019XJJCTD04.

ABSTRACT Graph Convolutional Network (GCN) is extensively used in text classification tasks and
performs well in the process of the non-euclidean structure data. Usually, GCN is implemented with the
spatial-based method, such as Graph Attention Network (GAT). However, the current GCN-based methods
still lack a more reasonable mechanism to account for the problems of contextual dependency and lexical pol-
ysemy. Therefore, an improved GCN (IGCN) is proposed to address the above problems, which introduces
the Bidirectional Long Short-Term Memory (BiLSTM) Network, the Part-of-Speech (POS) information,
and the dependency relationship. From a theoretical point of view, the innovation of IGCN is generalizable
and straightforward: use the short-range contextual dependency and the long-range contextual dependency
captured by the dependency relationship together to address the problem of contextual dependency and use a
more comprehensive semantic information provided by the BiLSTM and the POS information to address the
problem of lexical polysemy. What is worth mentioning, the dependency relationship is daringly transplanted
from relation extraction tasks to text classification tasks to provide the graph required by IGCN. Experiments
on three benchmarking datasets show that IGCN achieves competitive results compared with the other seven
baseline models.

INDEX TERMS Bidirectional long short-term memory network, dependency relationship, graph
convolutional network, part-of-speech information, text classification.

I. INTRODUCTION Network (BRNN) [9], Long Short-Term Memory (LSTM)


Text classification is always a hot topic of Natural Language [10] Network, Gated Recurrent Unit (GRU) [11], Recurrent
Processing (NLP), which is widely applied for text recogni- Convolutional Neural Network (RCNN) [12], Convolutional
tion and opinion extraction [1]–[4]. Currently, a large amount Recurrent Neural Network (CRNN) [13]). However, these
of the non-euclidean structure data, which can be quanti- methods have been greatly challenged on the graph-structure
fied and analyzed, is generated by the social media every data. For instance, the graph-structure data cannot be directly
day, such as social network reviews, interview records, prod- processed by CNN, for the reason that CNN cannot maintain
uct reviews, email records, etc. Over the past few decades, the translation invariant. Besides, the fixed size of the con-
the non-euclidean structure data has been studied mainly volution kernel limits the range of the dependency. There-
based on the traditional classification methods (e.g. Support fore, the methods based on Graph Convolutional Network
Vector Machine (SVM) [5]), the typical neural network meth- (GCN) [14] receive a growing attention from researchers
ods (e.g. Convolutional Neural Network (CNN) [6], Recur- and engineers. With regarding the graph as a spectral graph,
rent Neural Network (RNN) [7], Capsule Networks [8]), and GCN can realize the end-to-end learning of node feature and
their excellent variants (e.g. Bidirectional Recurrent Neural structure feature. Moreover, GCN is applicable to the graph
of arbitrary topology. Although GCN is gradually becoming
The associate editor coordinating the review of this manuscript and a good choice for text classification based on the graph, there
approving it for publication was Ali Shariq Imran . are still certain defects not to be neglected in current study.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 148865
H. Tang et al.: Integration Model Based on GCN for Text Classification

FIGURE 1. The difference of capturing the dependency between the original GCN and IGCN.

The original GCN cannot capture the short-range contex- (BiLSTM) [18], [19] Network, the Part-of-Speech (POS)
tual dependency and the long-range contextual dependency information, and the dependency relationship1 reasonably.
together. Look at this example ‘‘the movie, which is the prod- In this paper, the text feature and the POS feature sequen-
uct of an unknown French director, is wonderful.’’. Due to the tially obtained through BiLSTM will be applied to solve
mechanism that GCN only aggregates the information of the the problem of lexical polysemy. Through constructing the
direct neighbor nodes, GCN can only capture the short-range dependency relationship, IGCN can take good advantage of
contextual dependency information. This question can only the short-range contextual dependency and the long-range
be solved by increasing the number of GCN layers to contextual dependency together. Meanwhile, the adjacency
capture the long-range contextual dependency, such as the matrix based on the dependency relationship will also be
dependency between the words ‘‘movie’’ and ‘‘wonderful’’. generated to provide syntactic constraints. Subsequently,
However, the current researches reveal that the multi-layer the different attentions of the neighbor nodes to the cen-
GCN for text classification tasks will give rise to a high tral node can be learned during the aggregation process.
spatial complexity [15]. Meanwhile, the over-smoothing of Namely, the weights of the features will be calculated by
the node feature will also be caused by increasing the number the attention mechanism in the propagation process. What is
of network layers, which will make local features converge to worth mentioning, to provide the graph required by IGCN,
a similar value. the dependency relationship is daringly transplanted from
In addition to the problem of contextual dependency, relation extraction tasks to text classification tasks. The dif-
the problem of lexical polysemy also exists in GCN. The ference of capturing the dependency between the original
problem of lexical polysemy can be described that the same GCN and IGCN is shown in Fig. 1.
word may express different semantics in same or differ- Experiments on three benchmarking datasets demonstrate
ent positions. In the sentences ‘‘I bought an apple.’’ and that the problems in the current GCN-based method can be
‘‘I bought an apple X.’’, the meaning of the word ‘‘apple’’ effectively addressed by IGCN which has some advantages
is different due to the difference of the context. Meanwhile, than the other researches. The main contributions of this
in the sentences ‘‘Our object is to further cement trade rela- paper are as follows:
tions.’’ and ‘‘Their voters do not object much.’’, the meaning • The text information and the POS information can be
of the word ‘‘object’’ is also different owing to the differ- effectively applied to generate the initial features by
ence of the part-of-speech. Although the relevant researches BiLSTM. Simultaneously, the features can not only
have claimed that the problem of lexical polysemy can be make up for the deficiency of the content-level context
solved without relying on the syntactic information and the effectively, but also provide a new idea to address the
semantic information, their effects are not up to expectation problem of lexical polysemy.
unfortunately [16], [17].
To overcome the problems of contextual dependency and
1 the POS information and the dependency relationship are generated
lexical polysemy, an improved GCN (IGCN) is proposed for
through spaCy toolkit. The POS information of each word is generated by
text classification in this paper. Based on the original GCN, the POS tagging. The dependency relationship between words is generated
IGCN introduces the Bidirectional Long Short-Term Memory by the dependency parsing.

148866 VOLUME 8, 2020


H. Tang et al.: Integration Model Based on GCN for Text Classification

• The dependency relationship and the attention mech- information. The self-attention mechanism not only over-
anism are fully integrated into the IGCN. They can came the shortcoming of being unable to calculate paral-
effectively deal with the problem of contextual depen- lelly on RNN, but also solved the problem of capturing
dency, and reduce the number of GCN layers partly. the long-range dependency information difficultly on CNN.
Namely, they can solve the high space complexity and Dong et al. [33] obtained text representations containing
over-smoothing caused by the multi-layer GCN indi- more comprehensive semantics for text classification by
rectly. Such a cross-task study is useful to meet the chal- introducing the Bidirectional Encoder Representation from
lenges in NLP, and further demonstrates the significance Transformers(BERT) [34] and the self-interaction attention
of the dependency relationship. mechanism.
• A large number of experimental results not only prove When processing text classification through the attention-
the reasonability of integrating the BiLSTM, the POS based model, many breakthroughs in the field of node clas-
information, and the dependency relationship, but also sification and edge prediction have been made with GCN in
prove the efficiency of IGCN for text classification. recent years. Hamilton et al. [35] proposed a variety of aggre-
IGCN will contribute to the continuous development of gation functions to learn the feature representation of each
the research field of the non-euclidean structure data. node to enhance the effect of GCN. Chen et al. [36] proposed
a random training method. Their method can reduce the time
II. RELATED WORK complexity greatly with selecting two neighbor nodes for
Unlike the traditional classification methods based on extract- the convolution operation randomly. In the case of different
ing text features manually, the current methods based on the sampling sizes, features will converge to a local optimum.
deep learning can directly output the category of the text by Li et al. [37] proposed a method that can adaptively construct
training the neural network. For example, Tang et al. [20] new Laplacian matrices based on tasks and generate different
adopted two LSTMs for text classification which effec- task-driven convolution kernels. This method is superior to
tively integrate sentiment words and contextual information. GCN in processing multitask datasets. Velickovic et al. [38]
Zhang et al. [21] classified texts through introducing the used the attention mechanism to calculate the correlation
sentiment word information into BiLSTM. Yang et al. [22] between nodes dynamically and achieved good results in
integrated the common sense into the deep neural networks many public datasets. Yao et al. [39] introduced GCN for
based on BiLSTM to enhance the accuracy of text classifi- text classification and modeled the whole corpus into a het-
cation. Xue and Li [23] utilized CNN and the gate mecha- erogeneous network, which can learn word embeddings and
nism inversely to achieve higher accuracy which broke away document embeddings simultaneously. Cavallari et al. [40]
from the network structure on the base of RNN and atten- introduced a new setting for graph embedding, which con-
tion mechanism. Huang and Carley [24] achieved amazing siders embedding communities instead of individual nodes.
results through designing the parameterized filters and the Although GCN performs well in text classification, it still
gate mechanism on CNN to capture text features. Li et al. [25] fails to solve the problems of contextual dependency and
proposed a feature transformation component and a context lexical polysemy in text classification tasks. With addressing
retention mechanism to learn contextual information and the two problems as a starting point, an improved model based
combined contextual features with their transformed contex- on GCN is proposed accordingly.
tual features to obtain local salient features. Dong et al. [26]
proposed a CNN with multiple non-linear transformations III. IGCN
which has pursued a good result. Akhter et al. [27] achieved a Through an in-depth research on text classification based on
great performance by introducing the single-layer CNN with the neural network, IGCN is proposed in this paper, which
multiple filters into document-level text classification. builds three components of BiLSTM, the POS information,
At the same time, the attention-based model had also been and the dependency relationship on GCN. The whole process
proposed, which was used to capture the weight of each word of IGCN can be divided into steps of extracting the text
within a sentence. Wang et al. [28] introduced the attention feature, concatenating the POS feature, constructing the adja-
mechanism into LSTM, which provided a new idea for text cency matrix based on the dependency relationship, training
classification. Chen et al. [29] introduced the product and the neural network, and making a final prediction. The archi-
user information of different semantic levels to classify texts tecture of IGCN is shown in Fig. 2. Firstly, the text features
through the attention mechanism. Liu and Zhang [30] pro- and their corresponding POS features will be successively
posed a method which introduces three attention mechanisms obtained by BiLSTM, which can utilize their respective con-
to determine the contribution of each word in the context in textual information effectively. Then, two kinds of features
text classification. Gu et al. [31] proposed a position-aware will be concatenated together to form the required feature of
bidirectional attention network based on the Bidirectional IGCN. Meanwhile, the dependency relationship will be gen-
Gated Recurrent Unit(BiGRU) for text classification. Espe- erated to confront the problem of contextual dependency, and
cially, Vaswani et al. [32] proposed the self-attention to construct the adjacency matrix required by IGCN. After
mechanism which is good at capturing the internal rele- that, the feature and the adjacency matrix will be input to
vance between features and is less dependence on external train the neural network. With the hidden state vectors of the

VOLUME 8, 2020 148867


H. Tang et al.: Integration Model Based on GCN for Text Classification

FIGURE 2. The architecture of IGCN.

last layer obtained, the weights between features and hidden


state vectors can be calculated to determine the contribution
of each word by the attention mechanism. Subsequently,
the final features will be generated to predict the category of
the given sentence.

A. THE BASICS OF IGCN: GCN


As known, the original GCN takes the pre-processed words as
nodes and takes the relationship between words as edges. The
original GCN can be divided into the input layer, the hidden
layer, and the output layer. The architecture of the original
GCN is shown in Fig. 3. A graph G = (V , E) is given where FIGURE 3. The architecture of the original GCN.
V is the set of nodes and E is the set of edges between the
nodes.
relationship between the nodes. Let X ∈ RN ×D be the input
1) THE INPUT LAYER OF THE ORIGINAL GCN feature matrix where N is the size of V , and D represents the
The input layer of the original GCN consists of the input dictionary set size of V . When one word in the i-th node is at
feature matrix and the adjacency matrix of the graph. the m-th position of the dictionary set, it can be expressed as
The adjacency matrix is provided to express the reference Xim = {01 , . . . , 0m−1 , 1m , 0m+1 , . . . , 0D }. Let A ∈ RN ×N be

148868 VOLUME 8, 2020


H. Tang et al.: Integration Model Based on GCN for Text Classification

the self-loop adjacency matrix of the graph G. The adjacency information with full consideration of the context. Such a
matrix of an undirected graph can be expressed as follows. bi-directional model can provide a deeper text feature repre-
 sentation to the neural network. The architecture of BiLSTM
0, i →
 / j is shown in Fig. 4.
Aij = Aji = 1, i → j (1)

1, i = j

where Aij represents that whether there is a connection


between the i-th node and the j-th node.

2) THE HIDDEN LAYER OF THE ORIGINAL GCN


The hidden layer of the original GCN can aggregate the
node information of the current layer through the propagation
rules and transmit the features to the next layer. The features
become more abstract as the propagation through succes-
sive hidden layers. The layer-wise propagation rules of the
i-th node can be expressed as follows.
N
X
hli = σ ( Āij · W l · hl−1
i + bl ) FIGURE 4. The architecture of BiLSTM.
j=1
1
Ā = D− 2 · A · D− 2
1
Given a k-words sentence S = {W1 , W2 , . . . , Wk },
XN the model embeds the the initial input text through the
Dii = Aij (2) pre-trained embedding matrix. Then the text feature repre-
j=1 sentation matrix M ∈ Rk×d can be obtained, where k is
the vocabulary size of the sentence S and d represents the
where W l is a trainable linear transformation weight which dimension of word embedding. Meanwhile, The text fea-
can be obtained by minimizing the cross-entropy loss func- ture representations F = {f1 , f2 , . . . , fk } of the sentence S
tion on all labeled samples. bl is a bias term. Ā ∈ RN ×N is the with context information can also be got by BiLSTM, where
normalization adjacency matrix of the graph G. D ∈ RN ×N fi ∈ Rk×df represents the text feature of the i-th word. df is
is the degree matrix of the graph G. σ denotes a nonlinear the dimension of the hidden state vector of BiLSTM.
activation function (e.g. ReLU ). hli is the i-th node feature of In view of that the extraction of feature mentioned above
the l-th hidden layer. Initially, h0i = X . is not sufficient to deal with the problem of lexical polysemy,
the POS information of text is introduced to further eliminate
3) THE OUTPUT LAYER OF THE ORIGINAL GCN the problem. Unlike the other attributes, the POS information
After obtaining the final features of the hidden layer, the out- is a basic grammatical attribute of the word which does not
put layer of the original GCN can generate the probability vary much with the difference of the field. As everyone
value of each category through the softmax function and knows, the short text lacks a certain degree of grammar and
classify the category of the text according to the maximum does not contain the whole rigorous structure information.
value of the probability. Hence, more acceptable information will be provided by
POS feature for short text classification.
B. THE INPUT LAYER OF IGCN Accordingly, the POS information will be input to
The quality of the initial feature directly affects the perfor- BiLSTM to generate the POS feature representations P =
mance of the text classification model. In terms of feature {p1 , p2 , . . . , pk }, where pi ∈ Rk×df is the POS feature of the
extraction, the one-hot encoding features are generated with i-th word.
building a global text dictionary in the original GCN, which Through the above feature extraction, the text feature rep-
belong to the shallow feature. Therefore, to obtain more resentations F and the POS feature representations P have
advanced and abstract features, BiLSTM is introduced to been obtained by BiLSTM. On this basis, the concatenation
implement a deeper extraction of text features. On one hand, operation is chosen in this paper to effectively utilize the POS
the text is a kind of the non-euclidean structure data. BiLSTM information to solve the problem of lexical polysemy. The
can retain the position information of text and capture the text feature representations F and their corresponding POS
serialized features of text. On the other hand, the general feature representations P are concatenated by (3) to generate
social media text belongs to the typical short text. The gate a deeper feature representations FP = {fp1 , fp2 , . . . , fpk }
mechanism of BiLSTM can effectively solve the problems which are also the feature representation required by IGCN.
of less contextual information and ambiguous semantics in The concatenation operation is shown in Fig. 5.
such a short text. Besides, the bi-directional mechanism of
BiLSTM ensures that each word can obtain more semantic fpi = Concat(fi , pi ) (3)

VOLUME 8, 2020 148869


H. Tang et al.: Integration Model Based on GCN for Text Classification

In the original GCN, the weights of edges are uniformly set


to 1 so that the influences of each node to its neighbor nodes
are same. However, not all edges have the same weight in real-
ity and some of them must be important to text classification.
For example, in the sentence ‘‘I like this movie so much.’’,
the edge between the nodes ‘‘like’’ and ‘‘much’’ should con-
tribute much more to text classification than the edge between
the nodes ‘‘this’’ and ‘‘movie’’. Therefore, it is necessary to
capture which neighbor nodes contribute more to the central
node. To learn the contributions of neighbor nodes to the cen-
tral node, the attention mechanism based on the dependency
relationship is also introduced. The attention mechanism acts
FIGURE 5. The operation of concatenation.
between the central node and neighbor nodes, which builds up
the performance of the model with quantifying the contribu-
tions between nodes. As shown in Fig. 8, the adjacency matrix
where fpi ∈ Rk×2df represents the feature of the A ∈ Rk×k based on the attention mechanism is also obtained.
i-th word. In this paper, the similarity between nodes is selected to
Given the loose syntactic constraint and the vague depen- measure the dependency relationship. Based on the principle
dency, the dependency relationship is introduced to reveal a of not destroying the integrity of network, the topological
clear syntactic structure for text classification. The so-called information of the graph will be quantitatively analyzed from
dependency relationship is a more fine-grained attribute the perspective of local attributes and global attributes.
which can recognize the grammatical components in a sen-
tence. To a certain extent, this paper also pays some attention Ãij = Aij · αij
to non-notional words (e.g. prepositions) in structure analy- exp(eij )
αij = soft max(eij ) =
sis. As shown in Fig. 6, the dependency relationship graph is k
P
exp(eij )
applied to construct the adjacency matrix A ∈ Rk×k of a given i=1
sentence. eij = f pi · f pj (4)
where (4) is used to calculate the influences between the i-th
node and its neighbor nodes with the dependency relation-
ship. The softmax function ensures that the sum of each row
of the matrix is 1. In other words, the sum of the influences
of all nodes in the sentence to the i-th node is 1 during the
feature aggregation process. αij is the influence of the j-th
node to the i-th node. eij is the similarity of the j-th node to
the i-th node. A dot product operation is used to calculate
the similarity between nodes. The larger the value of eij is,
the more similar the i-th node and the j-th node are.
After the above data processing, the feature representations
FP required by IGCN and the adjacency matrix à based on
the attention mechanism are both obtained. Subsequently,
they will be fed into a single-layer GCN for training, and the
hidden state vectors H = {h1 , h2 , . . . , hk } can be obtained.
FIGURE 6. Adjacency matrix based on the dependency relationship. hi = σ (Ãi · FP · Wh + bh ) (5)
where hi ∈ Rk×2df represents the hidden state vector of the
i-th word, Wh is the trainable weight matrix, and bh is a bias
C. THE HIDDEN LAYER OF IGCN term.
Compared with the adjacency matrix constructed by the
dependencies of each node to its two neighbor nodes, D. THE OUTPUT LAYER OF IGCN
the adjacency matrix based on the dependency relationship With the feature representations FP and the hidden state
can provide the short-range contextual dependency and the vectors H obtained, the attention mechanism is introduced
long-range contextual dependency together, which is more once more to find the important relevant semantical fea-
effective to solve the problem of contextual dependency. Fur- tures. Unlike the general attention network, the contribution
thermore, an example is also used to explain how to construct µ = {µ1 , µ2 , . . . , µk } of each word to text classification
the adjacency matrix with dependency relations in Fig. 7. is obtained with linking the feature representations FP with

148870 VOLUME 8, 2020


H. Tang et al.: Integration Model Based on GCN for Text Classification

FIGURE 7. An example to explain how to construct the adjacency matrix with dependency relations.

where µi is the contribution of the i-th node to text classifica-


tion and λi is the sum of the similarity between the i-th node
and all nodes in the sentence.
Subsequently, the final feature representation y is input
into the fully-connected layer and the category of text will
be predicted by the softmax function.

Y = soft max(y · W + b) (7)

where W is a trainable weight matrix and b is a bias term.

IV. EXPERIMENTS
In this section, this paper first describes how to set up the
experiment and analyzes the experimental results of different
models. Subsequently, the ablation study is implemented to
further prove the performance brought by each component.

A. DATASETS AND PARAMETER SETTINGS


As shown in Table 1, the following three datasets are adopted
in the experiment.
• IMDB (Internet Movie DataBase) dataset, which
belongs to the binary-classification English dataset, con-
FIGURE 8. Adjacency matrix based on the attention mechanism. tains 10,662 pieces of the internet movie review data
provided by the data competition platform Kaggle. The
training set contains 6,000 pieces of data and the test
the hidden state vectors H in the output layer. Meanwhile, set contains 4,662 pieces of data, where the positive
the final feature representation y of the sentence is also comments and the negative comments in the training set
obtained as follows. and the test set account for 50% respectively.
• AAR (Amazon Alexa Reviews) dataset, which belongs
to the multiclass-classification English dataset, contains
y = µ · FP
3,150 pieces of the Amazon Alexa product review data
exp(λi )
µi = soft max(λi ) = collected in 2018. The training set contains 1,575 pieces
k
P
exp(λi ) of data, and the test set contains 1,575 pieces of data.
i=1 • TUA (Twitter US Airline) dataset, which belongs to
k
X the multiclass-classification English dataset, contains
λi = hi · fpTj (6) 14,640 pieces of Twitter user comments on American
j=1 Airlines collected in 2015. The training set contains

VOLUME 8, 2020 148871


H. Tang et al.: Integration Model Based on GCN for Text Classification

TABLE 1. Statistics of dataset. SVM is mainly oriented to small samples, nonlinear


data, and high-dimensional data.
• TextCNN is a specific form of CNN. It obtains the
shallow feature representation of the sentence effec-
tively with the one-dimensional convolution operation.
Moreover, it performs well in the field of short text
classification, such as the search field and the dialogue
field.
• LSTM is a specific form of RNN, which achieves good
7,320 pieces of data and the test set contains 7,320 pieces results in both the short text classification and the long
of data. The negative comments, the neutral comments, text classification.
and the positive comments in the training set and the test • BiLSTM is a specific form of LSTM, which further
set account for 63%, 21%, and 16% respectively. captures bidirectional semantic dependencies.
Two pre-trained vector matrices of Yelp and Sogou News • GCN is a classification model based on the non-euclidean
are used in this paper to achieve the initial word embed- structure data, which is gradually emerging in the field
ding. The initialization of weight matrix randomly adopts of text classification.
three methods: uniformly distributed initialization, normal • GAT is a classification model which borrows the atten-
distribution initialization, and orthogonal initialization. All tion mechanism into GCN.
the significant parameters in the experiment are recorded • TextGCN is a text classification model with utilizing
in Table 2. the document-word relation and the word-word relation,
which gains a variety of improvements over many state-
TABLE 2. Experimental parameter settings. of-the-art models.

B. MAIN EXPERIMENTAL RESULTS


As shown in Table 3, the experimental results on three bench-
marking datasets confirm that the performance of IGCN is
better than other baseline models, which further prove the
effectiveness and robustness of IGCN in text classification.

TABLE 3. Experimental results of different classification models.

In the training process, the dropout is used to introduce


the randomness and prevent overfitting in this paper, and the
nonlinear function LeakyReLU is selected as the activation
function of GCN to overcome the problem of vanishing gra-
dient and to speed up the training speed. The cross-entropy
loss function is used to train the model.
(
0.2x, x < 0
LeakyReLU (x) = (8)
x, x ≥ 0.
As shown in Fig. 9, the accuracy of SVM on the IMDB
Besides, the L2 regularization is also used to reduce the dataset reaches 77.75%, which is higher than the accura-
overfitting. The adaptive moment estimation optimizer is cies of the six models based on neural networks (TextCNN,
selected as the model optimizer. The stopping condition of LSTM, BiLSTM, GCN, GAT, and TextGCN). On the AAR
the training process is that the loss function has not decreased dataset and the TUA dataset, the accuracies of SVM are lower
for 10 consecutive iteration cycles. than the accuracies of the other four models based on the neu-
To comprehensively evaluate the model, the accuracy is ral network(TextCNN, LSTM, BiLSTM, and GAT), which
selected as the evaluation metric and IGCN is compared with only reach 64.66% and 69.4%. Accidentally, the accuracy
the following baseline models. of SVM is higher than the accuracy of GCN. It shows that
• SVM is a binary-classification model of the supervised SVM is weak to tackle the large-scale training samples and
learning. It is commonly applied for text classification, the multi-classification problems. Besides, the accuracies of
pattern recognition, and regression analysis. Moreover, TextCNN on the three benchmarking datasets are surprisingly

148872 VOLUME 8, 2020


H. Tang et al.: Integration Model Based on GCN for Text Classification

the effectiveness of IGCN. There are two reasons for this


result.
• The multiple effects of the pre-trained word embedding,
the BiLSTM, and the POS information are fully utilized
in this paper to extract more advanced and abstract
feature representations to further solve the problem of
lexical polysemy. At the same time, the dependency rela-
tionship introduced in this paper can make reasonable
use of the short-range contextual dependency and the
FIGURE 9. Histogram of experimental results of different classification long-range contextual dependency together.
models. • The attention mechanism introduced in this paper cap-
tures not only the importances of different neighbor
higher than the accuracies of LSTM, BiLSTM, GCN, GAT, nodes, but also the importances of different types of
and TextGCN. It means that the text classification model may nodes. It can reduce the weight of noise nodes to a
pay more attention to the short-range dependency information certain extent.
on the three benchmarking datasets.
As shown in Fig. 10 and Fig. 11, the loss value of IGCN on C. ADDITIONAL EXPERIMENTAL RESULTS
the IMDB dataset decreases faster than the loss value of GCN Whether using the same BiLSTM to generate the different
and the accuracy of IGCN on the IMDB dataset increases kinds of feature representations is studied on the IMDB
faster than the accuracy of GCN. The results show that IGCN dataset. To analyze its impact on the performance of IGCN,
can improve the effectiveness and achieve better results in less there are three cases listed as follows.
iteration.
• One case is that the same BiLSTM is used to train
the text features F and the POS feature P successively,
which is used in IGCN.
• Another case is that the same BiLSTM is used to train the
POS feature P and text features F successively, which is
named as MPF.
• The last case is that two BiLSTMs are used to train the
feature, which is named as M2Bi.
As shown in Table 4, the experimental results on the three
datasets illustrate that the accuracies of IGCN have increased
FIGURE 10. The loss curve of GCN and IGCN.
by 0.9%, 1.02%, and 1.33% respectively compared with
M2Bi. Fig. 12 proves that using the same BiLSTM to generate
the text feature and the POS feature can improve the overall
performance of IGCN.

TABLE 4. Experimental results of whether to use the same BiLSTM.

Moreover, the effect of the number of layers on the per-


FIGURE 11. The accuracy curve of GCN and IGCN.
formance of IGCN is also studied on the IMDB dataset.
Compared with the other models based on neural network, To analyze the impact, the number of layers would be set as
the accuracies of the graph-based GCN are lowest on the shown in Fig. 13. The results show that IGCN achieves the
three benchmarking datasets, which reach 73.9%, 63.58%, best performance when the number of layer is 1. Meanwhile,
and 68.36% respectively. The reason is that GCN cannot it is found that the accuracy decreases when the number of
make full use of contextual dependency information in the layer increases.
short text classification due to the sparse adjacency matrix
constructed by GCN. At the same time, compared with GCN, D. ABLATION STUDY
IGCN boosts the performance on the IMDB, AAR, and TUA To further examine the benefit brought by each component
datasets. As shown in Table 3, IGCN increases the accuracies of IGCN, an ablation study is implemented in this paper. The
by 6.95%, 4.49%, and 5.46% respectively, which fully proves results of IGCN on the three datasets are presented as the

VOLUME 8, 2020 148873


H. Tang et al.: Integration Model Based on GCN for Text Classification

FIGURE 12. Histogram of experimental results of whether to use the


same BiLSTM.
FIGURE 14. Histogram of experimental results of ablation study.

demonstrate that the introduction of the POS information


can solve the problem of lexical polysemy. With the average
decrease rate of the accuracy reaching 5.46%, it means that
the POS information works in text classification tasks.
Finally, after BiLSTM is removed, the performance of the
model named as MRBi also drops. As shown in Fig. 14, com-
pared with the accuracies of IGCN, the accuracies of MRBi
on the IMDB, AAR, and TUA datasets decrease by 16.13%,
13.92%, and 16.17% respectively. The average decrease rate
FIGURE 13. The accuracy curve of different layer number.
of the accuracy reaches 15.41%. Its decrease of the accuracy
baseline. As shown in Table 5, the experimental results of proves that BiLSTM can enhance the performance of IGCN
IGCN and the three variant models prove that the three com- with the contextual dependency information.
ponents applied in this paper can promote the performance of To sum up, the decreases of the accuracies of MRD, MRP,
IGCN. and MRBi prove that each component of IGCN proposed
in this paper is beneficial to address the problems in text
TABLE 5. Experimental results of ablation study. classification.

V. CONCLUSION AND FUTURE WORK


Based on a review of the existing challenges faced in text
classification, the applicability of the GCN-based model is
discussed. In this paper, an improved integration model based
on GCN is formulated, analyzed, and performed to handle
the problems of contextual dependency and lexical polysemy.
Experiments on three datasets show that IGCN performs
better than other models, especially than GCN. Thus, the suc-
After the dependency relationship is removed, the model cess of IGCN can be attributed to the integration of three
named as MRD is only constructed by learning the depen- components. On the one hand, the BiLSTM and the POS
dency information between each node and its two neighbor information can generate deeper feature representations for
nodes. As shown in Fig. 14, the accuracies of MRD on text classification. On the other hand, the dependency rela-
the IMDB, AAR, and TUA datasets decrease by 18.17%, tionship provides a required graph fo GCN which can be used
15.63%, and 17.65% respectively compared with the accu- to capture the short-range contextual dependency and the
racies of IGCN. The average decrease rate of its accuracy long-range contextual dependency together. Due to the good
reaches 17.15%. It shows that the dependency relationship is performance with the reasonable combination of dependency
significantly relevant to text classification based on the graph. relationship and GCN, it will certainly lead to a further explo-
The weights of neighbor nodes to the central node can be ration of the dependency relationship in the future. What is
deeply learned to further boost the performance of model by more, the transplantation of the dependency relationship from
the attention mechanism on the dependency relationship. relation extraction tasks to text classification tasks can also
Subsequently, after the POS feature is removed, the per- give a new idea to process the non-euclidean structure data.
formance of the model named as MRP also drops. As shown There is also a domanial problem in text classification.
in Fig. 14, the accuracies of MRP on the three datasets Some text in different fields has the same literal meaning.
are lower than the accuracies of IGCN, which decrease by However, its importance is different for text classification.
5.3%, 6.29%, and 4.8% respectively. The experimental results For example, the word ‘‘cancer’’ is a neutral tendency in the

148874 VOLUME 8, 2020


H. Tang et al.: Integration Model Based on GCN for Text Classification

medical field but a negative tendency in other fields. The [22] M. Yang, Q. Qu, X. Chen, C. Guo, Y. Shen, and K. Lei, ‘‘Feature-
difference of the tendency may cause a wrong prediction. enhanced attention network for target-dependent sentiment classification,’’
Neurocomputing, vol. 307, pp. 91–97, Sep. 2018.
Therefore, an in-depth study will be carried out in the domain [23] W. Xue and T. Li, ‘‘Aspect based sentiment analysis with gated convolu-
knowledge transfer. tional networks,’’ in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics
(Long Papers), vol. 1, 2018, pp. 2514–2523.
[24] B. Huang and K. Carley, ‘‘Parameterized convolutional neural networks for
REFERENCES aspect level sentiment classification,’’ in Proc. Conf. Empirical Methods
[1] W. Zhao, G. Zhang, G. Yuan, J. Liu, H. Shan, and S. Zhang, ‘‘The study Natural Lang. Process., 2018, pp. 1091–1096.
on the text classification for financial news based on partial information,’’ [25] X. Li, L. Bing, W. Lam, and B. Shi, ‘‘Transformation networks for target-
IEEE Access, vol. 8, pp. 100426–100437, 2020. oriented sentiment classification,’’ in Proc. 56th Annu. Meeting Assoc.
[2] K. Liu and L. Chen, ‘‘Medical social media text classification integrating Comput. Linguistics (Long Papers), vol. 1, 2018, pp. 946–956.
consumer health terminology,’’ IEEE Access, vol. 7, pp. 78185–78193, [26] M. Dong, Y. Li, X. Tang, J. Xu, S. Bi, and Y. Cai, ‘‘Variable convolution and
2019. pooling convolutional neural network for text sentiment classification,’’
[3] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and IEEE Access, vol. 8, pp. 16174–16186, 2020.
J. Gao, ‘‘Deep learning based text classification: A comprehensive review,’’ [27] M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, A. Mehmood,
2020, arXiv:2004.03705. [Online]. Available: http://arxiv.org/abs/2004. and M. T. Sadiq, ‘‘Document-level text classification using single-layer
03705 multisize filters convolutional neural network,’’ IEEE Access, vol. 8,
[4] S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. S. Yu, ‘‘A survey pp. 42689–42707, 2020.
on knowledge graphs: Representation, acquisition and applications,’’ [28] Y. Wang, M. Huang, X. Zhu, and L. Zhao, ‘‘Attention-based LSTM for
2020, arXiv:2002.00388. [Online]. Available: http://arxiv.org/abs/2002. aspect-level sentiment classification,’’ in Proc. Conf. Empirical Methods
00388 Natural Lang. Process., 2016, pp. 606–615.
[5] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’’ Mach. Learn., [29] H. Chen, M. Sun, C. Tu, Y. Lin, and Z. Liu, ‘‘Neural sentiment classifica-
vol. 20, no. 3, pp. 273–297, 1995. tion with user and product attention,’’ in Proc. Conf. Empirical Methods
Natural Lang. Process., 2016, pp. 1650–1659.
[6] Y. Kim, ‘‘Convolutional neural networks for sentence classification,’’ in
[30] J. Liu and Y. Zhang, ‘‘Attention modeling for targeted sentiment,’’ in Proc.
Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2014,
15th Conf. Eur. Chapter Assoc. Comput. Linguistics, Short Papers, vol. 2,
pp. 1746–1751.
Apr. 2017, pp. 572–577.
[7] H. Zhang, L. Xiao, Y. Wang, and Y. Jin, ‘‘A generalized recurrent neural
[31] S. Gu, L. Zhang, Y. Hou, and Y. Song, ‘‘A position-aware bidirectional
architecture for text classification with multi-task learning,’’ in Proc. 26th
attention network for aspect-level sentiment analysis,’’ in Proc. 27th Int.
Int. Joint Conf. Artif. Intell., Aug. 2017, pp. 2873–2879.
Conf. Comput. Linguistics (COLING), Aug. 2018, pp. 774–784.
[8] W. Zhao, H. Peng, S. Eger, E. Cambria, and M. Yang, ‘‘Towards scalable [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
and reliable capsule networks for challenging NLP applications,’’ in Proc. L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 1549–1559. Neural Inf. Process. Syst., Dec. 2017, pp. 5998–6008.
[9] M. Schuster and K. K. Paliwal, ‘‘Bidirectional recurrent neural networks,’’ [33] Y. Dong, P. Liu, Z. Zhu, Q. Wang, and Q. Zhang, ‘‘A fusion model-based
IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov. 1997. label embedding and self-interaction attention for text classification,’’
[10] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural IEEE Access, vol. 8, pp. 30548–30559, 2020.
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [34] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
[11] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, of deep bidirectional transformers for language understanding,’’ in Proc.
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Tech-
RNN encoder–Decoder for statistical machine translation,’’ in Proc. nol., vol. 1, Jun. 2019, pp. 4171–4186.
Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2014, [35] W. L. Hamilton, Z. Ying, and J. Leskovec, ‘‘Inductive representation learn-
pp. 1724–1734. ing on large graphs,’’ in Proc. Adv. Neural Inf. Process. Syst., Dec. 2017,
[12] R. Wang, Z. Li, J. Cao, T. Chen, and L. Wang, ‘‘Recurrent convolutional pp. 1024–1034.
neural networks for text classification,’’ in Proc. Int. Joint Conf. Neural [36] J. Chen, J. Zhu, and L. Song, ‘‘Stochastic training of graph convolutional
Netw. (IJCNN), Jul. 2019, pp. 2267–2273. networks with variance reduction,’’ in Proc. 35th Int. Conf. Mach. Learn.
[13] R. Wang, Z. Li, J. Cao, T. Chen, and L. Wang, ‘‘Convolutional recurrent (ICML), Jul. 2018, pp. 941–949.
neural networks for text classification,’’ in Proc. Int. Joint Conf. Neural [37] R. Li, S. Wang, F. Zhu, and J. Huang, ‘‘Adaptive graph convolutional neural
Netw. (IJCNN), Jul. 2019, pp. 1–6. networks,’’ in Proc. 32th AAAI Conf., Feb. 2018, pp. 3546–3553.
[14] T. N. Kipf and M. Welling, ‘‘Semi-supervised classification with graph [38] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio,
convolutional networks,’’ in Proc. 5th Int. Conf. Learn. Represent. (ICLR), ‘‘Graph attention networks,’’ in Proc. 6th Int. Conf. Learn. Represent.
Apr. 2017, pp. 1–14. (ICLR), Apr. 2018, pp. 1–12.
[15] G. Li, M. Muller, A. Thabet, and B. Ghanem, ‘‘DeepGCNs: Can GCNs go [39] L. Yao, C. Mao, and Y. Luo, ‘‘Graph convolutional networks for text
as deep as CNNs?’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), classification,’’ in Proc. 31th AAAI Conf., Jan. 2019, pp. 7370–7377.
Oct. 2019, pp. 9266–9275. [40] S. Cavallari, E. Cambria, H. Cai, K. C.-C. Chang, and V. W. Zheng,
[16] J. Liu, F. Meng, Y. Zhou, and B. Liu, ‘‘Character-level neural networks for ‘‘Embedding both finite and infinite communities on graphs [application
short text classification,’’ in Proc. Int. Smart Cities Conf. (ISC2), Sep. 2017, notes],’’ IEEE Comput. Intell. Mag., vol. 14, no. 3, pp. 39–50, Aug. 2019.
pp. 560–567.
[17] X. Zhang, J. J. Zhao, and Y. LeCun, ‘‘Character-level convolutional net-
works for text classification,’’ in Proc. Adv. Neural Inf. Process. Syst.,
Dec. 2015, pp. 649–657.
[18] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu, ‘‘Attention-based
bidirectional long short-term memory networks for relation classification,’’
in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics (Short Papers),
vol. 2, 2016, pp. 207–212.
HENGLIANG TANG received the B.Sc. and Ph.D.
[19] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and
degrees from the Beijing University of Technol-
L. Zettlemoyer, ‘‘Deep contextualized word representations,’’ in Proc. ogy, in 2005 and 2011, respectively. He is currently
Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Tech- a Professor with Beijing Wuzi University. His
nol., (Long Papers), vol. 1, 2018, pp. 2227–2237. main research interests include computer vision
[20] D. Tang, B. Qin, X. Feng, and T. Liu, ‘‘Effective LSTMs for target- and the IoT information technology.
dependent sentiment classification,’’ in Proc. 26th Int. Conf. Comput.
Linguistics (COLING), Dec. 2016, pp. 3298–3307.
[21] M. Zhang, Y. Zhang, and D. Vo, ‘‘Gated neural networks for targeted
sentiment analysis,’’ in Proc. 30th AAAI Conf., Feb. 2016, pp. 3087–3093.

VOLUME 8, 2020 148875


H. Tang et al.: Integration Model Based on GCN for Text Classification

YUAN MI received the B.Sc. degree from YANG CAO received the B.Sc. and M.Sc. degrees
Northwest A&F University, in 2015. He is cur- from the Taiyuan University of Science and Tech-
rently pursuing the M.Sc. degree with Beijing nology, in 2011 and 2015, respectively, and the
Wuzi University. His main research interests Ph.D. degree from the Beijing University of
include natural language processing and the IoT Technology, in 2019. He is currently a Lecturer
information technology. with Beijing Wuzi University. His main research
interests include machine learning and big data
analysis.

FEI XUE received the B.Sc. degree from the Uni-


versity of Jinan, in 2006, the M.Sc. degree from
the Taiyuan University of Science and Technology,
in 2011, and the Ph.D. degree from the Beijing
University of Technology, in 2016. He is currently
an Associate Professor with Beijing Wuzi Univer-
sity. His main research interests include computa-
tional complexity theory and optimization.

148876 VOLUME 8, 2020

You might also like