A Text Classification Model Based On GCN and BiGRU Fusion
A Text Classification Model Based On GCN and BiGRU Fusion
318
ICCAI ’22, March 18–21, 2022, Tianjin, China Yonghao Dong et al.
319
A Text Classification Model Based on GCN and BiGRU Fusion ICCAI ’22, March 18–21, 2022, Tianjin, China
by Word2Vec to convert the text data into word vectors; finally the more parameters and complex structure in training. In order to
word vectors are put into the classifier for processing. In this part compensate the shortcomings of LSTM, GRU model is proposed.
the BiGRU model and GCN network using dependent syntactic tree GRU model uses internal gate control to process the timing infor-
has initialized nodes are used, the BiGRU model learns the data mation efficiently and filters the input information according to
and extracts the important features; the GCN network extracts the two gating systems, update gate and reset gate. The larger value
spatial feature information in the text, and finally the classification of update gate indicates that the previous moment of hidden layer
results are obtained through the global pooling layer. output has more influence on the current hidden layer, and the
smaller value of update gate indicates that more ignored. While the
3.1 Embedding Layer one-way GRU model suffers from the problem of ignoring the con-
Embedding layer i.e. word embedding layer. Before analyzing the textual information, the two-way GRU model learns the contextual
text, the words in the text need to be converted into word vectors information adequately. The two-way GRU model is composed of
before they can be used as input to the neural network. In this paper, two GRU models with opposite information transfer, where the first
CBOW model based on Word2vec [11] is chosen to implement text layer transfers information in temporal order and the second layer
vectorization by which words are converted into 512-dimensional transfers information in reverse temporal order. The computational
word vectors. The essence of CBOW model is to predict a target flow of the bidirectional GRU model is shown below.
word by using a neural network algorithm given a contextual word
s®t = GRU (x t , s®t −1 )
environment. The CBOW model consists of an input layer, a hidden ←
←
layer and an output layer. The input layer is a vector of the current s t = GRU x t , s t −1
word and its neighboring position words. The hidden layer is to ←
h ← i
map the input matrix to a vector, the output layer is a Hoffman st = w t s®t + vt s t + bt = s®t , s t
tree, the leaf nodes in the tree is each word, the path from the high where x t denotes the input at moment t, and w t , and vt denotes
frequency word to the root node is relatively short, the word to the the associated weight values of the anterior hidden layer and the
root node is also only one path, each intermediate node is a sigmoid posterior hidden layer corresponding to the BiGRU at moment t,
unit, from the root node to the specified word will pass through respectively, and bt denotes the bias value of the state of the hidden
multiple intermediate nodes, each passing through an intermediate layer at moment t, and st is the hidden state at moment t. st −1 is
node, in fact, is a binary classification task. Each word arriving at a the hidden state at moment t-1.
node in the root node path will have a corresponding weight vector.
After the model is trained to find this appropriate weight vector 3.3 GCN Layer
that maximizes the probability of occurrence of the root node to
GCN is a convolutional neural network running on graph struc-
the specified word, the weight vector of the intermediate nodes is
ture, which expands the perceptual field by receiving neighborhood
found and then the vector of the original word is known.
information. The workflow of the GCN layer in this paper is de-
scribed below. Firstly, this paper generates the dependent syntactic
3.2 BiGRU Layer graph by using the LTP tool of HIT, and generates the undirected
Although the traditional LSTM algorithm has good results in deal- graph G with n The output of BiGRU layer is then used as the
ing with sequential problems, it suffers from long training time, input of GCN layer to extract the spatial feature information and
320
ICCAI ’22, March 18–21, 2022, Tianjin, China Yonghao Dong et al.
nonlinear complex semantic relations of the text through GCN Table 2: Model parameter setting table
network.The GCN network extracts the spatial feature informa-
tion and nonlinear complex semantic relations of the text through Parameter Name parameter value
the adjacency matrix A and the feature matrix S multiplying by
Word vector dimension 512
each vertex neighbor feature to get the summary of each vertex
Learning rate (1-10 epoch) 1e-2
neighbor feature, and then multiplying by a parameter matrix W
Learning rate (10-20 epoch) 1e-3
plus an activation function σ A nonlinear transformation is done
Learning rate (20-30 epoch) 1e-4
to obtain the matrix of aggregated neighboring vertex features H .
loss function categorical_crossentropy
The reason why the adjacency matrix A to add a unit matrix I N
Epoch 30
is to preserve the information of the vertices’ own features when
Batch size 64
propagating the information, while the neighbor matrix à The nor-
1 1 Dropout 0.1
e ∗D̃ − 2 is to keep the feature matrix
malization operation D̃ − 2 ∗A L2 regularization parameter 5e-4
H The computational procedure of the GCN model is shown below. Optimizer Adam
1 1
H = σ D̃ − 2 ∗Ae ∗D̃ − 2 ∗ S∗ W
of which à = A + I N , A, I N denote the adjacency matrix and the samples actually negative but incorrectly classified, and FN denotes
unit matrix of the undirected graph G, respectively. D̃ denotes the the number of samples actually positive but incorrectly classified.
à the degree matrix of G, and S denotes the output of the BiGRU
layer, and W denotes the parameter matrix, and σ denotes the 4.3 Comparison Experiments and Parameter
activation function, and H denotes the output of the GCN layer. Settings
In this experiment, the BiGRU_GCN model is experimentally com-
4 EXPERIMENTATION AND ANALYSIS pared with the following three models.
4.1 Data Pre-processing 1. The RNN neural network model proposed in the literature
This paper uses the THUCNews dataset, 120,000 news headlines [16].
were extracted from THUCNews with length between 20 and 30 2. The BiLSTM-Attention neural network model proposed in
words, including 10 types of finance, property, stock, education, the literature [17].
technology, society, current affairs, sports, games, and entertain- 3. The BiGRU-Attention neural network model proposed in the
ment, 20,000 items per category, a total of 120,000 items of data. For literature [18].
better text classification, this paper preprocessed the dataset, re- All experiments are divided into 3 parts: data preprocessing,
moved special characters such as line breaks and split the data with feature extraction, and text classification, and the preprocessing
stuttering, and then initialized the word embedding information process and experimental hyperparameters are kept consistent.
of the text using Word2Vec for the split data; then the dataset was In this paper, we use dynamically adjusted learning rate, and the
randomly divided into a training set and a test set in the ratio of 8:2, experiments are run for 30 epochs, and the learning rate is kept
which were used for model training and performance evaluation, consistent for the first 10 epochs, 10 to 20 epochs, and the learning
respectively. The structure of the dataset is shown in Table 1. rate is kept consistent for the last 10 epochs. The parameter settings
of this experimental model are shown in Table 2.
4.2 Evaluation Indicators
In this paper, precision (precision), recall (recall) and F1 (F1-score) 4.4 Analysis of Experimental Results for Text
values are used as evaluation indicators and are calculated as fol- Classification
lows. In this paper, experiments were conducted on the RNN model,
TP
precision = BiLSTM-Attention model, BiGRU-Attention model and the im-
TP + FP proved model proposed in this paper on Chinese text classification
TP dataset, and the experimental results are shown in Figure 4.
recall =
TP + FN As can be seen from Fig. 4, there is not much difference between
2∗recall∗precision several models in terms of accuracy, recall, and F1 values, which
F1 =
precision + recall indicates that the performance of the GRU model and the LSTM
where TP denotes the number of samples predicted to be positive model are relatively close when the dataset is large, and on balance
and correctly classified, TF denotes the number of samples predicted the BiGRU-Attention model works better on the text classification
to be negative and correctly classified, FP denotes the number of task compared to the BiLSTM-Attention model. The model that
321
A Text Classification Model Based on GCN and BiGRU Fusion ICCAI ’22, March 18–21, 2022, Tianjin, China
322