Bert-Enhanced Text Graph Neural Network For Classification
Bert-Enhanced Text Graph Neural Network For Classification
Article
Bert-Enhanced Text Graph Neural Network for Classification
Yiping Yang and Xiaohui Cui *
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education,
Wuhan University, Wuhan 430000, China; [email protected]
* Correspondence: [email protected]
Abstract: Text classification is a fundamental research direction, aims to assign tags to text units.
Recently, graph neural networks (GNN) have exhibited some excellent properties in textual infor-
mation processing. Furthermore, the pre-trained language model also realized promising effects in
many tasks. However, many text processing methods cannot model a single text unit’s structure or
ignore the semantic features. To solve these problems and comprehensively utilize the text’s structure
information and semantic information, we propose a Bert-Enhanced text Graph Neural Network
model (BEGNN). For each text, we construct a text graph separately according to the co-occurrence
relationship of words and use GNN to extract text features. Moreover, we employ Bert to extract
semantic features. The former part can take into account the structural information, and the latter can
focus on modeling the semantic information. Finally, we interact and aggregate these two features
of different granularity to get a more effective representation. Experiments on standard datasets
demonstrate the effectiveness of BEGNN.
homogeneous graphs or heterogeneous graphs from text data and perform graph neural
network propagation such as convolution operations on the graphs [9,10]. In this way, the
model can take into account the structural information, which is of great significance for
understanding the meaning of the text. However, some methods build text graphs on the
entire dataset, weakening the individual features of each document.
Based on the above analysis, the existing text classification methods have some limita-
tions in text feature extraction. First, most models use RNN, LSTM [5] and other methods
to process serialized data, which cannot take into account the text structure information.
Secondly, some methods based on graph neural networks extract the representation of text
by building a heterogeneous graph structure for the entire dataset, but it’s hard to consider
a single text’s semantic features. In addition, some methods have combined structural
features and semantic features of sequences for extraction, but they can not consider single
text features alone or do not consider the interaction between features, which limits their
representation ability.
To solve the problems of these algorithms, we construct the BEGNN model. Specif-
ically, we first construct a graph structure for each document separately. Moreover, we
propose to aggregate the features extracted from Bert and the features extracted by graph
structures. The former represents the semantic information of the documents, and the
latter is a representation that considers the structural feature of the text. Compared with
other work, we also add a co-attention module to solve the problem of interaction between
features, and performed a variety of experiments to integrate the features, which can
maximize the representation ability of the extracted features.
Our contribution is as follows:
(1) Our model can extract features of different granularities, from a pre-trained lan-
guage model and graph neural networks for text representation. It not only takes into
account the semantic information, and also the structural information, which improves the
effect of the learned text representation.
(2) In order to prevent the two features from being separated during the prediction
process, we have designed and performed experiments on co-attention modules as well as
different aggregation methods, which can consider the interaction of the two representa-
tions and make full use of them to achieve better classification capabilities.
(3) The experiment results and analysis on four datasets demonstrate the effectiveness
of BEGNN.
In the following paragraphs: Section 2 introduces researches about text classification
methods related to our work, Section 3 illustrates the overall model we proposed, Section 4
shows the experimental results, and finally, the conclusion.
2. Related Work
2.1. Traditional Feature Engineering Method
Traditional text classification methods need to extract manually defined features, and
they are often combined with machine learning algorithms for training and prediction.
For a specific task, some early studies classify sentences or documents by analyzing text
data and extracting statistical features of the text, then use pre-specified training set as
training data. Bag-Of-Words (BOW) [11] and n-grams [12] are commonly used word-based
representation method, which represents a sentence based on a collection of words or
n-gram sequences which occur in it. These features are usually combined with models
such as SVM [11] and have achieved good results. However, machine learning requires
extensive feature engineering and relies on domain knowledge, which makes it difficult for
the features on a single task to be generalized to other aspects.
solutions to natural language tasks. Word2Vec [4] and GloVe [15] have been drawing great
attention in NLP tasks. Mikolov et al. have shown these pre-trained embeddings can
capture meaningful semantic features [16]. In addition, RNN models [17] have shown
advantages in processing sequence data. TextCNN [6] performs convolution operations
on text features and has achieved good results. Tan et al. [18] use a structure based on a
dynamic convolutional gated neural network, making it possible to selectively control how
much context information is contained in each specific location.
Recently, pre-trained language models have caused a great upsurge in research. Mod-
els such as Bert [7] are pre-trained on a large corpus, and can be simply transferred to
downstream NLP tasks with fine-tuning, which have refreshed records on multiple NLP
tasks. Bert [7] takes advantage of the self-attention mechanism, and builds a multi-layer
self-attention network, which can also realize parallel computing. The attention mecha-
nism is applied to various models, greatly improving the performance of various NLP
tasks [19]. There have also been some studies exploring how to efficiently use Bert for
natural language processing tasks [20–22]. These models have been proved effective in
extracting features, but they cannot fully utilize the text’s structural features. While graph
structure has natural advantages in modeling structural information.
There have been some researches that model text as graph structure for feature ex-
traction. GNN [23] can capture the features of the nodes and the structural features in
the graph, which can learn more effective representations for the nodes or the whole
graph. GatedGNN [10] and GCN [9] have been applied to the task of text classification.
Textgcn [9] constructed a heterogeneous graph network of words and documents, and
uses co-occurrence features and TFIDF to measure the relationship between words and
documents. For a new document, it needs to update the whole graph structure to perform
prediction. Additionally, it cannot take into account the structural characteristics of a
single document well. TextING [10] builds graph structure on each single text, which can
learn the fine-grained word representation of the local structure. Lei et al. [24] designed
a structure that can integrate the graph convolutional features of multi-layer neighbors,
alleviating the problem of over-fitting to a certain extent. However, semantic features used
in the models rely on pre-trained word embeddings, which limits the effect of the model.
Parcheta et al. [25] studied the influence of embeddings extracted by combining different
methods on text classification models.
There are also some methods that combine the pre-trained language model with graph
neural networks to extract features. VGCN-Bert [26] builds a graph of the whole dataset,
and uses the features extracted by GCN [27] to enhance the effect of Bert [7]. However, as
in GCN, the unique structural characteristics of each text cannot be fully taken into account.
Jeong et al. [28] simply concatenate the features of Bert and GCN for the recommendation
task, but this method cannot consider the features’ interactive relationship, which reduces
the representation ability. We show the methods of some related works in Table 1.
Considering the above problems, we propose to combine the features extracted by
Bert [7] and graph neural networks, which can take into account the semantic and structural
information of a single text. Different from the previous work, first of all, we build a graph
structure for each text separately, and combine the graph neural network and Bert to extract
different granular features. While most of the studies built a graph on the entire dataset
or did not combine the different characteristics of different granularity. In addition, we
employ the co-attention module to integrate features. As far as we know, we are the first
to employ a co-attention module to combine the features of graph networks and Bert for
text classification. So that we can take the advantages of the feature representation with
different granularity.
Entropy 2021, 23, 1536 4 of 13
[24] Build a graph for the entire dataset, integrate the graph
convolution features of multi-hop neighbors.
[28] The output of GNN and Bert are concatenated for the
recommendation task.
[18] Dynamically Gated Convolutional Neural Network.
3. Method
In this part, we describe the structure of BEGNN in detail.
(c) (d)
Co-Attention
BERT and
.. .. Aggregation
�1 . .
�2
�3
..
...
.
... ...
�� ...
...
�1 �2 �3 ... ��
...
...
...
�1 �2 �3 ... ��
(a) (b) (e)
Figure 1. The architecture of BEGNN. (a) The input document. (b) Graph construction and graph neural network based
feature extraction. (c) Bert based feature extraction. (d) Interactive feature aggregation. (e) Fully-connected layer.
between the nodes. By stacking such layers for T times, the nodes are able to receive the
information of their T-hop neighbors. The formulas of the propagation recurrence in the
k-th layer are:
where A ∈ R|V |∗|V | is the adjacency matrix, at+1 represents the result of the interaction
between the nodes and their adjacent nodes through the edges. Formulas (2)–(4) is similar
to the calculation process of GRU. Among them, zt+1 controls the forgotten information,
and rt+1 controls the newly generated information. Ht+1 is the final updated node status
of t + 1-th layer. σ is sigmoid function. W, U and b are trainable weight matrices.
To simplify, we can write such a message passing process as:
where Θt is the parameter set of the gated graph neural network of the t-th layer. After
message passing of T layers, we get the final representation H0T .
QK T
Attention(Q, K, V) = softmax √ V (7)
dk
Q, K and V are the matrix of queries, keys and values, respectively. dk is the dimension of
the matrices. Furthermore, multi-head attention can be defined as:
headi = Attention QWiQ , KWiK , VWV
i (9)
After the multi-layer transformer module, we eventually get the final word feature
representation H0bert .
Then we get the attention representation of GNN conditioned on Bert output and the
attention representation of Bert conditioned on GNN output. Therefore, we obtain the
mutually conditional attention convergence feature between the two representations.
... ...
ŷ = softmax(WHi + b) (13)
L = − ∑ yi · log ŷi (14)
i
W, b are trainable parameters. ŷi and yi are the predicted and true label for the document,
respectively.
Entropy 2021, 23, 1536 8 of 13
4. Experiments
Here, we evaluated the effect of BEGNN and compared it with baseline models on
four publicly available datasets.
4.1. Datasets
We adopted four widely used datasets for text classification:
MR [30]. It is a sentiment classification dataset, each review is classified as positive
or negative.
SST-2 [31]. It is the Stanford Sentiment Treebank dataset, which includes sentences
from movie reviews. Each sample is labeled as negative or positive.
R8 [32]. It is a subset of the Reuters-21578 dataset and had been manually classified
into eight categories.
Ohsumed [33]. It is from the MEDLINE database, which is a bibliographic database.
Each document had been classified into 23 cardiovascular diseases categories.
The statistics are in Table 2.
For each dataset, we use 10% of the training data for validation to assist in model
training. For each piece of data in the dataset, we proceed with it as follows. First,
a BertTokenizer is used to segment the document. Second, in the Bert-based feature
extraction module, we directly use the segmentation as the input. Third, in the graph
neural network-based module, to ensure that the two modules can be aligned, we use
the result of Bert word segmentation, and then use the Glove word vector as the words’
initial representation.
thesaurus and documents together. The difference is that TextING builds a text graph of
words in each document. By comparing these methods, we can analyze which feature is
more important to the model.
Table 3. This is a table caption. Tables should be placed in the main text near to the first time they
are cited.
(1) BEGNN outperforms all the baselines. We use Bert based feature extraction module
and GNN based feature extraction module. At the same time, the co-attention module is
employed to interactively combine the two features. Suggesting that the combination of
GNN based method and pre-trained language method benefits text processing.
(2) The longer the text, the more obvious the improvement of our model to the
experimental effect. According to the statistics of the datasets, the text length of R8 and
Ohsumed is longer. Especially on the Ohsumed dataset, the average text length is 79. On
the datasets where the average text length is less than 20, the performance improvement
of our model is relatively lower than the other two datasets with longer text. This shows
our model can better process longer texts. Our feature extraction module based on graph
neural network passes through message in multiple layers, and can mine the information
of multi-hop neighbors. Superior to RNN based model, the self-attention module in Bert
can also pay attention to words that are farther away.
(3) RNN based model outperforms Fasttext and TextGCN in two datasets, and shows
comparable capability in R8, which shows its advantages in processing sequential data.
While in Ohsumed, it does not perform well. The text length of this dataset is long, causing
difficulties in processing long-distance context. RNN-based models have no advantage
when dealing with longer text data. After long-distance propagation, information will be
Entropy 2021, 23, 1536 10 of 13
lost. LSTM adds the memory module to solve the problem of long-distance dependence of
traditional RNN architecture, but when the average text length exceeds 70 in the Ohsumed
dataset, there are still some problems.
(4) TextGCN and TextING are graph based models. When they are used in text
classification tasks, TextING has achieved better results on each dataset. This is because, for
the texts, TextGCN constructs a graph of the entire corpus, which is low-density. However,
TextING constructs a graph structure for each document separately, which can take into
account the different structural information of each text, which will not be so sparse as it
in TextGCN.
(5) The performance of VGCN-Bert surpasses other models besides our proposed
model. It takes the features extract from graph neural networks and word embedding
features as the input of the attention module. However, it builds a graph structure on the
entire dataset. Compared with our operation of building a graph structure from a single
text, it cannot fully consider the unique structural characteristics of each text. Furthermore,
it chooses to concatenate the two representations and send them to the attention module.
Different from it, we interact and aggregate the features from GNN module and Bert
based module, which can avoid the separation of the two representations and utilize
their correlation.
Compared to other related models, first of all, the experimental results demonstrate
the superiority of BEGNN. Secondly, our model shows a more obvious advantage in the
processing of long texts and can extract features that span longer distances. In addition,
our model can take the semantic and structure information of the given documents. The
transformer module in Bert uses the attention mechanism to perform parallel calculations,
also extracts semantic features. The module based on GNN can extract the structure
information of the text well. While the interactive aggregation of these two features can
combine the advantages of these two features to the greatest extent. This ensures that
BEGNN attains a better effect over the baseline models.
Figure 4. Ablation study of the text graph and the co-attention modules of the model.
Compared with using Bert only for training and testing, our original model with graph
neural network achieves significant results on four datasets. This confirms the necessity of
adding a text graph neural network in our proposed model. Among them, the model with
graph structure features has achieved the most significant effect on the Ohsumed dataset.
Showing advantages of BEGNN in processing longer text features. Compared with the
model without the graph neural network feature extraction module, even without feature
interaction, the model containing two granular features still achieves better results than
the original Bert model. This also illustrates the importance of adding structural features.
Other than semantic features, adding structural features can improve the representation
ability of the extracted joint features.
5. Conclusions
In this article, we conduct research on text classification algorithms. The application
scenarios of text classification are very extensive, and it is important in public opinion
analysis and news classification. We propose a Bert-enhanced graph neural network
(BEGNN) to improve the representation ability of text. Although it is designed for text
classification, its ideas can be applied to other research fields, such as information retrieval.
We build a text graph structure for each document and extract the structural features of
Entropy 2021, 23, 1536 12 of 13
the text. Furthermore, Bert is used to extract semantic features. In addition, we added an
interaction module and aggregated the semantic and structural features of text. Different
from other studies, we can take into account the two granular text features in an innovative
way, and employ the co-attention module to interact and aggregate them. Experimental
results prove the effectiveness of BEGNN.
In future research, we will further study what algorithms and features will have a
positive impact on the deep learning model when using Bert and graph neural network
for feature extraction. At the same time, we will study how to use this analysis result to
further optimize the model, increase the interpretability of the model and produce more
fine-grained and reasonable interpretation. We will also consider further research on the
lightweight optimization to reduce the cost of calculation and reasoning while ensuring
the effect of the model.
Author Contributions: Conceptualization, Y.Y. and X.C.; methodology, Y.Y.; software, Y.Y.; validation,
Y.Y.; formal analysis, Y.Y. and X.C.; investigation, Y.Y.; resources, Y.Y.; data curation, Y.Y.; writing—
original draft preparation, Y.Y.; writing—review and editing, Y.Y. and X.C.; visualization, Y.Y.;
supervision, Y.Y.; project administration, Y.Y. All authors have read and agreed to the published
version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The source code and the datasets used in the experiments is available
at https://github.com/pingpingand/BEGNN, accessed on 24 July 2021.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep Learning–based Text Classification: A
Comprehensive Review. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [CrossRef]
2. Chen, W.; Yu, W.; He, G.; Jiang, N. Coarse-to-Fine Attention Network via Opinion Approximate Representation for Aspect-
Level Sentiment Classification. In International Conference on Neural Information Processing; Springer: Cham, Switzerland, 2020;
pp. 704–715.
3. Wang, P.; Hu, J.; Zeng, H.J.; Chen, Z. Using Wikipedia knowledge to improve text classification. Knowl. Inf. Syst. 2009, 19, 265–281.
[CrossRef]
4. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv Prepr. 2013,
arXiv:1301.3781.
5. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
6. Guo, B.; Zhang, C.; Liu, J.; Ma, X. Improving text classification with weighted word embeddings via a multi-channel TextCNN
model. Neurocomputing 2019, 363, 366–374. [CrossRef]
7. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv Prepr. 2018, arXiv:1810.04805.
8. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv Prepr. 2014,
arXiv:1409.0473.
9. Yao, L.; Mao, C.; Luo, Y. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial
Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 7370–7377.
10. Zhang, Y.; Yu, X.; Cui, Z.; Wu, S.; Wen, Z.; Wang, L. Every document owns its structure: Inductive text classification via graph
neural networks. arXiv Prepr. 2020, arXiv:2004.13826.
11. Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In European Conference on
Machine Learning; Springer: Berlin/Heidelberg, Germany, 1998; pp. 137–142.
12. Ayed, R.; Labidi, M.; Maraoui, M. Arabic text classification: New study. In Proceedings of the 2017 International Conference on
Engineering & MIS (ICEMIS), Monastir, Tunisia, 8–10 May 2017; pp. 1–7.
13. Ma, C.; Shi, X.; Zhu, W.; Li, W.; Cui, X.; Gui, H. An Approach to Time Series Classification Using Binary Distribution Tree.
In Proceedings of the 2019 15th International Conference on Mobile Ad-Hoc and Sensor Networks, Shenzhen, China, 11–13
December 2019; pp. 399–404.
14. Li, W.; Liu, X.; Liu, J.; Chen, P.; Wan, S.; Cui, X. On improving the accuracy with auto-encoder on conjunctivitis. Appl. Soft Comput.
2019, 81, 105489. [CrossRef]
Entropy 2021, 23, 1536 13 of 13
15. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543.
16. Mikolov, T.; Yih, W.T.; Zweig, G. Linguistic regularities in continuous space word representations. In Proceedings of the 2013
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Atlanta, GA, USA, 9–15 June 2013; pp. 746–751.
17. Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for
relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin,
Germany, 7–12 August 2016; pp. 207–212.
18. Tan, Z.; Chen, J.; Kang, Q.; Zhou, M.; Abusorrah, A.; Sedraoui, K. Dynamic embedding projection-gated convolutional neural
networks for text classification. IEEE Trans. Neural Netw. Learn. Syst. 2021, 1–10. [CrossRef] [PubMed]
19. Wang, Y.; Huang, M.; Zhu, X.; Zhao, L. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 606–615.
20. Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to fine-tune bert for text classification? In China National Conference on Chinese Computational
Linguistics; Springer: Cham, Switzerland, 2019; pp. 194–206.
21. González-Carvajal, S.; Garrido-Merchán, E.C. Comparing BERT against traditional machine learning text classification. arXiv Prepr.
2020, arXiv:2005.13012.
22. Jin, D.; Jin, Z.; Zhou, J.T.; Szolovits, P. Is bert really robust? a strong baseline for natural language attack on text classification
and entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020;
pp. 8018–8025.
23. Cai, H.; Zheng, V.W.; Chang, K. A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Trans.
Knowl. Data Eng. 2018, 30, 1616–1637. [CrossRef]
24. Lei, F.; Liu, X.; Li, Z.; Dai, Q.; Wang, S. Multihop Neighbor Information Fusion Graph Convolutional Network for Text
Classification. Math. Probl. Eng. 2021. [CrossRef]
25. Parcheta, Z.; Sanchis-Trilles, G.; Casacuberta, F.; Rendahl, R. Combining Embeddings of Input Data for Text Classification. Neural
Process. Lett. 2020, 53, 1–29. [CrossRef]
26. Lu, Z.; Du, P.; Nie, J.Y. VGCN-BERT: Augmenting BERT with graph embedding for text classification. In European Conference on
Information Retrieval; Springer: Cham, Switzerland, 2020; pp. 369–382.
27. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv Prepr. 2016, arXiv:1609.02907.
28. Jeong, C.; Jang, S.; Shin, H.; Park, E.; Choi, S. A context-aware citation recommendation model with BERT and graph convolutional
networks. arXiv 2019, arXiv:1903.06464.
29. Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated graph sequence neural networks. arXiv Prepr. 2015, arXiv:511.05493.
30. Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv Prepr.
2005, arXiv:cs/0506075.
31. Socher, R.; Perelygin, A.; Wu, J.Y.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive deep models for semantic composition-
ality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
Seattle, WA, USA, 18–21 October 2013.
32. Cardoso-Cachopo, A.; Oliveira, A.L. Semi-supervised single-label text categorization using centroid-based classifiers. In Proceed-
ings of the 2007 ACM Symposium on Applied Computing, Seoul, Korea, 11–15 March 2007; pp. 844–851.
33. Hersh, W.; Buckley, C.; Leone, T.J.; Hickam, D. OHSUMED: An interactive retrieval evaluation and new large test collection for
research. In SIGIR’94; Springer: London, UK, 1994; pp. 192–201.
34. Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv Prepr. 2016, arXiv:1607.01759.
35. Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the Acoustics,
Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649.
36. Kaselimi, M.; Doulamis, N.; Voulodimos, A.; Protopapadakis, E.; Doulamis, A. Context aware energy disaggregation using
adaptive bidirectional LSTM models. IEEE Trans. Smart Grid 2020, 11, 3054–3067. [CrossRef]
37. Golovin, D.; Solnik, B.; Moitra, S.; Kochanski, G.; Karro, J.; Sculley, D. Google vizier: A service for black-box optimization.
In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS,
Canada, 13–17 August 2017; pp. 1487–1495.