Bertgcn: Transductive Text Classification by Combining GCN and Bert
Bertgcn: Transductive Text Classification by Combining GCN and Bert
Table 1 presents the test accuracy of each model. 4.3 The Effect of λ
We can see that BertGCN and RoBERTaGCN per-
form the best across all datasets. Only using BERT λ controls the trade-off between training BertGCN
and RoBERTa generally performs better than GCN and BERT. The optimal value of λ can be different
variants except 20NG, which is due to the great for different tasks. Fig.1 shows the accuracy of
merits brought by large-scale pretraining. Com- RoBERTaGCN with different λ. On 20NG, the
pared with BERT and RoBERTa, the performance accuracy is consistently higher with larger λ value.
boost from BertGCN and RoBERTaGCN is signifi- This can be explained by the high performance of
cant on the 20NG and Ohsumed datasets. This is graph-based methods on 20NG. The model reaches
because the average length in 20NG and Ohsumed its best when λ = 0.7, performing slightly better
is much longer than that in other datasets: the than only using the GCN prediction (λ = 1).
graph is constructed using word-document statis-
4.4 The Effect of Strategies in Joint Training
4
https://www.cs.umb.edu/~smimarog/
textmining/datasets/ 7
The original training/test split of 20NG is based on post
5
http://disi.unitn.it/moschitti/ date, but the development set is randomly sampled from the
corpora.htm original training set. The accuracy on test set is thus much
6 lower than that on development set.
http://www.cs.cornell.edu/people/
8
pabo/movie-review-data/ Experiments without a small lr. failed to converge.
To overcome inconsistency of embeddings in the Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018.
memory bank, we set a smaller learning rate for Question answering by reasoning across documents
with graph convolutional networks. arXiv preprint
the BERT module and use a finetuned BERT
arXiv:1808.09920.
model for initialization. We evaluate the effect
of the two strategies. Table 2 shows the results Nicola De Cao and Thomas Kipf. 2018. Molgan: An
implicit generative model for small molecular graphs.
of RoBERTaGCN on 20NG with and without arXiv preprint arXiv:1805.11973.
these strategies. With the same learning rate for
RoBERTa and GCN, the model cannot be trained Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
due to inconsistency in the memory bank, regard- bidirectional transformers for language understanding.
less of whether the fine-tuned RoBERTa is used. arXiv preprint arXiv:1810.04805.
Models can be successfully trained when we set
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017.
a smaller learning rate for the RoBERTa module, Inductive representation learning on large graphs. In
and additional using finetuned RoBERTa leads to Advances in neural information processing systems,
the best performance. pages 1024–1034.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and
5 Conclusion and Future Work Ross Girshick. 2020a. Momentum contrast for unsu-
pervised visual representation learning. In Proceedings
In this work, we propose BertGCN, which takes the of the IEEE/CVF Conference on Computer Vision and
best advantages from both large-scale pretraining Pattern Recognition, pages 9729–9738.
models and transductive learning for text classifi- Qi He, Han Wang, and Yue Zhang. 2020b. Enhancing
cation. We efficiently train BertGCN by using a generalization in natural language inference by syntax.
memory bank that stores all document embeddings In Proceedings of the 2020 Conference on Empirical
and updates part of them with respect to the sam- Methods in Natural Language Processing: Findings,
pages 4973–4978.
pled mini batch. The framework of BertGCN can
be built on top of any document encoder and any Lianzhe Huang, Dehong Ma, Sujian Li, Xiaodong
Zhang, and Houfeng Wang. 2019. Text level graph
graph model. Experiments demonstrate the power
neural network for text classification. arXiv preprint
of the proposed BertGCN model. However, in arXiv:1910.02356.
this work, we only use document statistics to build
Yoon Kim. 2014. Convolutional neural net-
the graph, which might be sub-optimal compared works for sentence classification. arXiv preprint
to models that are able to automatically construct arXiv:1408.5882.
edges between nodes. We leave this in future work.
Thomas N Kipf and Max Welling. 2016a. Semi-
supervised classification with graph convolutional net-
works. arXiv preprint arXiv:1609.02907.
References
Thomas N Kipf and Max Welling. 2016b. Vari-
Rushlene Kaur Bakshi, Navneet Kaur, Ravneet Kaur, ational graph auto-encoders. arXiv preprint
and Gurpreet Kaur. 2016. Opinion mining and sen- arXiv:1611.07308.
timent analysis. In 2016 3rd International Confer-
ence on Computing for Sustainable Global Develop- Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018a.
ment (INDIACom), pages 452–455. IEEE. Deeper insights into graph convolutional networks for
semi-supervised learning. In Proceedings of the AAAI
Jasmijn Bastings, Ivan Titov, Wilker Aziz, Diego Conference on Artificial Intelligence, volume 32.
Marcheggiani, and Khalil Sima’an. 2017. Graph con-
volutional encoders for syntax-aware neural machine Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu.
translation. In Proceedings of the 2017 Conference on 2017. Diffusion convolutional recurrent neural net-
Empirical Methods in Natural Language Processing, work: Data-driven traffic forecasting. arXiv preprint
pages 1957–1967, Copenhagen, Denmark. Association arXiv:1707.01926.
for Computational Linguistics. Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu,
and Peter Battaglia. 2018b. Learning deep generative
Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep
models of graphs. arXiv preprint arXiv:1803.03324.
neural networks for learning graph representations. In
Proceedings of the AAAI Conference on Artificial Intel- Xien Liu, Xinxin You, Xiao Zhang, Ji Wu, and Ping
ligence, volume 30. Lv. 2020. Tensor graph convolutional networks for text
classification.
Duo Chai, Wei Wu, Qinghong Han, Fei Wu, and Jiwei
Li. 2020. Description based text classification with re- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
inforcement learning. In International Conference on dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Machine Learning, pages 1371–1382. PMLR. Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
robustly optimized bert pretraining approach. arXiv Alex Hai Wang. 2010. Don’t follow me: Spam de-
preprint arXiv:1907.11692. tection in twitter. In 2010 international conference
on security and cryptography (SECRYPT), pages 1–10.
Zhibin Lu, Pan Du, and Jian-Yun Nie. 2020. Vgcn- IEEE.
bert: augmenting bert with graph embedding for text
classification. In European Conference on Information Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe
Retrieval, pages 369–382. Springer. Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao,
and Lawrence Carin. 2018. Joint embedding of words
Diego Marcheggiani and Laura Perez-Beltrachini. and labels for text classification. arXiv preprint
2018. Deep graph convolutional encoders for struc- arXiv:1805.04174.
tured data to text generation. In Proceedings of the
11th International Conference on Natural Language Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr,
Generation, pages 1–9, Tilburg University, The Nether- Christopher Fifty, Tao Yu, and Kilian Q Weinberger.
lands. Association for Computational Linguistics. 2019. Simplifying graph convolutional networks.
arXiv preprint arXiv:1902.07153.
Diego Marcheggiani and Ivan Titov. 2017. Encoding
sentences with graph convolutional networks for se- Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua
mantic role labeling. In Proceedings of the 2017 Con- Lin. 2018. Unsupervised feature learning via non-
ference on Empirical Methods in Natural Language parametric instance discrimination. In Proceedings of
Processing, pages 1506–1515, Copenhagen, Denmark. the IEEE Conference on Computer Vision and Pattern
Association for Computational Linguistics. Recognition, pages 3733–3742.
Federico Monti, Michael M Bronstein, and Xavier Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong
Bresson. 2017. Geometric matrix completion with re- Long, Chengqi Zhang, and S Yu Philip. 2020. A com-
current multi-graph neural networks. In Proceedings prehensive survey on graph neural networks. IEEE
of the 31st International Conference on Neural Infor- Transactions on Neural Networks and Learning Sys-
mation Processing Systems, pages 3700–3710. tems.
Subhabrata Mukherjee and Ahmed Hassan Awadallah. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie
2020. Uncertainty-aware self-training for text classifi- Jegelka. 2018. How powerful are graph neural net-
cation with few labels. works? arXiv preprint arXiv:1810.00826.
Nikolaos Pappas and James Henderson. 2019. Gile: Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.
A generalized input-label embedding for text classifi- Graph convolutional networks for text classification. In
cation. Transactions of the Association for Computa- Proceedings of the AAAI Conference on Artificial Intel-
tional Linguistics, 7:139–155. ligence, volume 33, pages 7370–7377.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Zhiquan Ye, Yuxia Geng, Jiaoyan Chen, Jingmin Chen,
Ilya Sutskever. 2018. Improving language understand- Xiaoxiao Xu, SuHang Zheng, Feng Wang, Jun Zhang,
ing by generative pre-training. and Huajun Chen. 2020. Zero-shot text classification
via reinforced self-training. In Proceedings of the 58th
Nils Reimers and Iryna Gurevych. 2019. Sentence- Annual Meeting of the Association for Computational
bert: Sentence embeddings using siamese bert- Linguistics, pages 3014–3024, Online. Association for
networks. arXiv preprint arXiv:1908.10084. Computational Linguistics.
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017.
Hagenbuchner, and Gabriele Monfardini. 2008. The Spatio-temporal graph convolutional networks: A deep
graph neural network model. IEEE Transactions on learning framework for traffic forecasting. arXiv
Neural Networks, 20(1):61–80. preprint arXiv:1709.04875.
Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018.
Florian, and Daniel Gildea. 2018. Exploring graph- Spatio-temporal graph convolutional networks: a deep
structured passage representation for multi-hop read- learning framework for traffic forecasting. In Proceed-
ing comprehension with graph neural networks. arXiv ings of the 27th International Joint Conference on Arti-
preprint arXiv:1809.02040. ficial Intelligence, pages 3634–3640.
Vladimir N. Vapnik. 1998. Statistical Learning Theory. Haopeng Zhang and Jiawei Zhang. 2020. Text graph
Wiley-Interscience. transformer for document classification. In Proceed-
ings of the 2020 Conference on Empirical Methods in
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Natural Language Processing (EMNLP), pages 8322–
Adriana Romero, Pietro Lio, and Yoshua Bengio. 8327.
2017. Graph attention networks. arXiv preprint
arXiv:1710.10903. Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin
King, and Dit Yan Yeung. 2018a. Gaan: Gated atten-
Hanna M Wallach. 2006. Topic modeling: beyond bag- tion networks for learning on large and spatiotemporal
of-words. In Proceedings of the 23rd international con- graphs. In 34th Conference on Uncertainty in Artificial
ference on Machine learning, pages 977–984. Intelligence 2018, UAI 2018.
Shengyu Zhang, Ziqi Tan, Zhou Zhao, Jin Yu, Kun
Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, and
Fei Wu. 2020. Comprehensive information integration
modeling framework for video titling. In Proceedings
of the 26th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, pages 2744–
2754.
Yuhao Zhang, Peng Qi, and Christopher D Manning.
2018b. Graph convolution over pruned dependency
trees improves relation extraction. arXiv preprint
arXiv:1809.10185.
Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Fran-
cis Lau. 2015. A c-lstm neural network for text classi-
fication. arXiv preprint arXiv:1511.08630.