0% found this document useful (0 votes)
101 views7 pages

Bertgcn: Transductive Text Classification by Combining GCN and Bert

The document proposes BertGCN, a model that combines BERT (a large pretrained language model) and graph convolutional networks (GCN) for transductive text classification. BertGCN constructs a graph with documents as nodes, represented using BERT embeddings, and propagates label information across the graph using GCN. This allows it to leverage the advantages of both pretrained language models and graph-based semi-supervised learning. The model achieves state-of-the-art performance on several text classification datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views7 pages

Bertgcn: Transductive Text Classification by Combining GCN and Bert

The document proposes BertGCN, a model that combines BERT (a large pretrained language model) and graph convolutional networks (GCN) for transductive text classification. BertGCN constructs a graph with documents as nodes, represented using BERT embeddings, and propagates label information across the graph using GCN. This allows it to leverage the advantages of both pretrained language models and graph-based semi-supervised learning. The model achieves state-of-the-art performance on several text classification datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

BertGCN: Transductive Text Classification

by Combining GCN and BERT


Yuxiao Lin♠ , Yuxian Meng♣ , Xiaofei Sun♣
Qinghong Han♣ , Kun Kuang♠ , Jiwei Li♠♣ and Fei Wu♠

Computer Science Department, Zhejiang University

ShannonAI
{yuxiaolinling, kunkuang, jiwei_li, wufei}@zju.edu.cn
{yuxian_meng, xiaofei_sun, qinghong_han}@shannonai.com

Abstract This makes the model more immune to data out-


liers; (2) at the training time, since the model prop-
In this work, we propose BertGCN, a model
agates influence from supervised labels across both
arXiv:2105.05727v3 [cs.CL] 16 May 2021

that combines large scale pretraining and trans-


ductive learning for text classification. Bert- training and test instances through graph edges,
GCN constructs a heterogeneous graph over unlabeled data also contributes to the process of
the dataset and represents documents as nodes representation learning, and consequently higher
using BERT representations. By jointly train- performances.
ing the BERT and GCN modules within Bert-
GCN, the proposed model is able to lever- Large-scale pretraining has recently demonstrated
age the advantages of both worlds: large-scale their effectiveness on a variety of NLP tasks (De-
pretraining which takes the advantage of the vlin et al., 2018; Liu et al., 2019). Trained on
massive amount of raw data and transductive large-scale unlabeled corpora in an unsupervised
learning which jointly learns representations
manner, large-scale pretrained models are able to
for both training data and unlabeled test data
by propagating label influence through graph learn implicit but rich text semantics in language
convolution. Experiments show that BertGCN at scale. Intuitively, large-scale pretrained mod-
achieves SOTA performances on a wide range els have potentials to benefit transductive learning.
of text classification datasets.1 However, existing models for transductive text clas-
sification (Yao et al., 2019; Liu et al., 2020) did not
1 Introduction take large-scale pretraining into consideration, and
Text classification is a core task in natural language its effectiveness still remains unclear.
processing (NLP) and has been used in many real- In this work, we propose BertGCN, a model that
world applications such as spam detection (Wang, combines the advantages of both large-scale pre-
2010) and opinion mining (Bakshi et al., 2016). training and transductive learning for text clas-
Transductive learning (Vapnik, 1998) is a particular sification. BertGCN constructs a heterogeneous
method for text classification which makes use of graph for the corpus with node being word or docu-
both labeled and unlabeled examples in the train- ment, and node embeddings are initialized with pre-
ing process. Graph neural networks (GNNs) serve trained BERT representations, and uses graph con-
as an effective approach for transductive learning volutional networks (GCN) for classification. By
(Yao et al., 2019; Liu et al., 2020). In these works, jointly training the BERT and GCN modules, the
a graph is constructed to model the relationship be- proposed model is able to leverage the advantages
tween documents. Nodes in the graph represent text of both worlds: large-scale pretraining which takes
units such as words and documents, while edges the advantage of the massive amount of raw data
are constructed based on the semantic similarity be- and transductive learning which jointly learns repre-
tween nodes. GNNs are then applied to the graph sentations for both training data and unlabeled test
to perform node classification. The merits of GNNs data by propagating label influence through graph
and transductive learning are as follows: (1) the de- edges. The proposed BertGCN model successfully
cision for an instance (both training and test) does combines the powers of large-scale pretraining and
not depend merely on itself, but also its neighbors. graph networks, and achieves new state-of-the-art
1
Code available at https://github.com/ performances on a wide range of text classification
ZeroRin/BertGCN. datasets.
2 Related Work large-scale pretraining. Existing works that com-
bine BERT and GNNs uses graph to model rela-
Graph neural networks (GNNs) are connectionist tionships between tokens within a single document
models that capture dependencies and relations be- sample (Lu et al., 2020; He et al., 2020b), which
tween graph nodes via message passing through fall into the category of inductive learning. Dif-
edges that connect nodes (Scarselli et al., 2008; ferent from these works, we use graph to model
Hamilton et al., 2017; Xu et al., 2018). GNNs relationships between different samples from the
are practically categorized into (Wu et al., 2020): whole corpus to utilize the similarity between la-
graph convolutional networks (Kipf and Welling, beled and unlabeled documents, and uses GNNs to
2016a; Wu et al., 2019), graph attention networks learn their relationships.
(Veličković et al., 2017; Zhang et al., 2018a), graph
auto-encoder (Cao et al., 2016; Kipf and Welling, 3 Method
2016b), graph generative networks (De Cao and
Kipf, 2018; Li et al., 2018b) and graph spatial- 3.1 BertGCN
temporal networks (Li et al., 2017; Yu et al., 2017). In the proposed BertGCN model, we initialize rep-
GNNs serve as powerful tools to utilize the relation- resentations for document nodes in a text graph
ship between different objects, and have been ap- using a BERT-style model (e.g., BERT, RoBERTa).
plied to various domains such as traffic prediction These representations are used as inputs to GCN.
(Yu et al., 2018; Zhang et al., 2018a) and recom- Document representations will then be iteratively
mendation (Zhang et al., 2020; Monti et al., 2017). updated based on the graph structures using GCN,
In the context of NLP, GNNs have achieved re- the outputs of which are treated as final represen-
markable successes across a wide range of end tations for document nodes, and are sent to the
tasks such as relation extraction (Zhang et al., softmax classifier for predictions. In this way, we
2018b), semantic role labeling (Marcheggiani and are able to leverage the complementary strengths
Titov, 2017), data-to-text generation (Marcheggiani of pretrained models and graph models.
and Perez-Beltrachini, 2018), machine translation
(Bastings et al., 2017) and question answering Specifically, we construct a heterogeneous graph
(Song et al., 2018; De Cao et al., 2018). containing both word nodes and document nodes
following TextGCN (Yao et al., 2019). We define
The prevalence of neural networks has motivated a word-document edges and word-word edges based
diverse array of works on developing neural models on the term frequency-inverse document frequency
for text classification. Different neural model ar- (TF-IDF) and positive point-wise mutual informa-
chitectures (Kim, 2014; Zhou et al., 2015; Radford tion (PPMI), respectively. The weight of an edge
et al., 2018; Chai et al., 2020) have demonstrated between two nodes i and j is defined as:
their effectiveness against traditional statistical fea-
i, j are words and i 6= j

ture based methods (Wallach, 2006). Other works  PPMI(i, j),
TF-IDF(i, j), i is document, j is word

leverage label embeddings and jointly train them Ai,j =
 1, i=j
along with input texts (Wang et al., 2018; Pappas

0, otherwise
and Henderson, 2019). More recently, the suc- (1)
cess achieved by large-scale pretraining models
has spurred great interests in adapting the large- In TextGCN, an identity matrix X = Indoc +nword
scale pretraining framework (Devlin et al., 2018) is used as initial node features, where ndoc is the
into text classification (Reimers and Gurevych, number of document nodes, nword is the number of
2019), leading to remarkable progressive on few- word nodes (including both training and test). In
shot (Mukherjee and Awadallah, 2020) and zero- BertGCN, we use a BERT-style model to obtain
shot (Ye et al., 2020) learning. the document embeddings, and treat them as input
representations for document nodes. Document
Our work is inspired by the work of using graph node embeddings are denoted by Xdoc ∈ Rndoc ×d ,
neural networks for text classification (Yao et al., where d is the embedding dimensionality. Overall,
2019; Huang et al., 2019; Zhang and Zhang, 2020). the initial node feature matrix is given by:
But different from these works, we focus on com-  
bining large-scale pretrained models and GNNs, Xdoc
X= (2)
and show that GNNs can significantly benefit from 0 (ndoc +nword )×d
We feed X into a GCN model (Kipf and Welling, 3.3 Optimization using Memory Bank
2016a) which iteratively propagates messages
The original GCN model uses the full-batch gra-
across training and test examples. Specifically, the
dient descent method for training, which is in-
output feature matrix of the i-th GCN layer L(i) is
tractable for the proposed BertGCN model, since
computed as
the full-batch method can not be applied to BERT
L(i) = ρ(ÃL(i−1) W (i) ) (3) due to the memory limitation. Inspired by tech-
niques in contrastive learning which decouples the
where ρ is an activation function, Ã is the normal- dictionary size from the mini-batch size (Wu et al.,
ized adjacency matrix and W (i) ∈ Rdi−1 ×di is a 2018; He et al., 2020a), we introduce a memory
weight matrix of the layer. L(0) = X is the in- bank that stores all document embeddings to decou-
put feature matrix of the model. Outputs of GCN ple the training batch size from the total number of
are treated as final representations for documents, nodes in the graph.
which is then fed to the softmax layer for classifi-
cation: Specifically, during training, we maintain a mem-
ory bank M that tracks input features for all doc-
ZGCN = softmax(g(X, A)) (4) ument nodes. At the beginning of each epoch, we
first compute all document embeddings using the
where g represents the GCN model. We use the
current BERT module and store them in M . Dur-
cross entropy loss over labeled document nodes to
ing each iteration, we sample a mini batch from
jointly optimize parameters for BERT and GCN.
both labeled and unlabeled document nodes with
3.2 Interpolating BERT and GCN the index set B = {b0 , b1 ...bn }, where n is the
Predictions mini-batch size. We then compute their document
Practically, we find that optimizing BertGCN with embeddings MB also using the current BERT mod-
a auxiliary classifier that directly operates on BERT ule and update the corresponding memories in M .2
embeddings leads to faster convergence and better Next, we use the updated M as input to derive the
performances. Specifically, we construct an auxil- GCN output and compute the loss for the current
iary classifier by directly feeding document embed- mini batch. For back-propagation, M is considered
dings (denoted by X) to a dense layer with softmax as constant except the records in B.
activation:
With the memory bank, we are able to efficiently
ZBERT = softmax(W X) (5) train the BertGCN model including the BERT mod-
ule. However, during training, the embeddings in
The final training objective is the linear interpola- the memory bank are computed using the BERT
tion of the prediction from BertGCN and the pre- module at different steps in an epoch and are thus
diction from BERT, which is given by: inconsistent. To overcome this issue, we set a small
learning rate for the BERT module to improve con-
Z = λZGCN + (1 − λ)ZBERT (6)
sistency of the stored embeddings. With low learn-
where λ controls the tradeoff between the two ob- ing rate the training takes more time. In order to
jectives. λ = 1 means we use the full BertGCN speed up training, we fine-tune a BERT model on
model, and λ = 0 means we only use the BERT the target dataset before training begins, and use it
module. When λ ∈ (0, 1), we are able to balance to initialize the BERT parameters in BertGCN.
the predictions from both models, and the BertGCN
model can be better optimized. 4 Experiments
The explanation for better performances achieved 4.1 Experiment Setups
by the interpolation is as follows: The ZBERT di-
rectly operates on the input of GCN, making sure We run experiments on five widely-used text classi-
that inputs to GCN are regulated and optimized fication benchmarks: 20 Newsgroups (20NG)3 , R8
towards the objective. This helps the multi-layer
2
GCN model to overcome intrinsic drawbacks such Note that the BERT module used to compute MB is the
one finished training in the last iteration, which is different
as gradient vanishing or over-smoothing (Li et al., from the the BERT module used to compute the initial M .
3
2018a), and thus leads to better performances. http://qwone.com/~jason/20Newsgroups/
Model 20NG R8 R52 Ohsumed MR
TextGCN 86.3 97.1 93.6 68.4 76.7
SGC 88.5 97.2 94.0 68.5 75.9
BERT 85.3 97.8 96.4 70.5 85.7
RoBERTa 83.8 97.8 96.2 70.7 89.4
BertGCN 89.3 98.1 96.6 72.8 86.0
RoBERTaGCN 89.5 98.2 96.1 72.8 89.7
BertGAT 87.4 97.8 96.5 71.2 86.5
RoBERTaGAT 86.5 98.0 96.1 71.2 89.2

Table 1: Results for different models on transductive


text classification datasets. We run all models 10 times Figure 1: Accuracy of RoBERTaGCN when varying λ
and report the mean test accuracy. on 20NG development set. The dotted line indicates
the corresponding RoBERTa baseline.7
and R524 , Ohsumed5 and Movie Review (MR)6 .
Strategy w/ both w/o finetune w/o small lr. w/o both
We compare BertGCN to current state-of-the-art Accuracy 94.7 93.8 10.38 10.38

pretrained and GCN models: TextGCN (Yao et al.,


Table 2: Accuracy on 20NG development set for differ-
2019), SGC (Wu et al., 2019), BERT (Devlin et al., ent strategies. “finetune” means we use the finetuned
2018) and RoBERTa (Liu et al., 2019). Details for RoBERTa as initialization, and “small lr.” means we
datasets and baseline are left in the supplementary use a smaller learning rate for the RoBERTa module.
material.
We follow protocols in TextGCN to preprocess data.
tics, which means that long texts may produce
For BERT and RoBERTa, we use the output feature
more document connections transited via an in-
of the [CLS] token as the document embedding,
termediate word node, and this potentially benefits
followed by a feedforward layer to derive the final
message passing through the graph, leading to bet-
prediction. We use BERTbase and a two-layer GCN
ter performances when combined with GCN. This
to implement BertGCN. We initialize the learning
may also explain why GCN models perform bet-
rate to 1e-3 for the GCN module and 1e-5 for the
ter than BERT models on 20NG. For datasets with
fine-tuned BERT module. We also implement our
shorter documents such as R52 and MR, the power
model with RoBERTa and GAT (Veličković et al.,
of graph structure is limited, and thus the perfor-
2017). GAT variants are trained over the same
mance boost is smaller relative to 20NG. BertGAT
graph as GCN variants, but learn edge weights
and RoBERTaGAT can also benefit from the graph
through attention mechanism instead of using pre-
structure, but their performance are not as good
defined weight matrix.
as GCN variants due to the lack of edge weight
4.2 Main Results information.

Table 1 presents the test accuracy of each model. 4.3 The Effect of λ
We can see that BertGCN and RoBERTaGCN per-
form the best across all datasets. Only using BERT λ controls the trade-off between training BertGCN
and RoBERTa generally performs better than GCN and BERT. The optimal value of λ can be different
variants except 20NG, which is due to the great for different tasks. Fig.1 shows the accuracy of
merits brought by large-scale pretraining. Com- RoBERTaGCN with different λ. On 20NG, the
pared with BERT and RoBERTa, the performance accuracy is consistently higher with larger λ value.
boost from BertGCN and RoBERTaGCN is signifi- This can be explained by the high performance of
cant on the 20NG and Ohsumed datasets. This is graph-based methods on 20NG. The model reaches
because the average length in 20NG and Ohsumed its best when λ = 0.7, performing slightly better
is much longer than that in other datasets: the than only using the GCN prediction (λ = 1).
graph is constructed using word-document statis-
4.4 The Effect of Strategies in Joint Training
4
https://www.cs.umb.edu/~smimarog/
textmining/datasets/ 7
The original training/test split of 20NG is based on post
5
http://disi.unitn.it/moschitti/ date, but the development set is randomly sampled from the
corpora.htm original training set. The accuracy on test set is thus much
6 lower than that on development set.
http://www.cs.cornell.edu/people/
8
pabo/movie-review-data/ Experiments without a small lr. failed to converge.
To overcome inconsistency of embeddings in the Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018.
memory bank, we set a smaller learning rate for Question answering by reasoning across documents
with graph convolutional networks. arXiv preprint
the BERT module and use a finetuned BERT
arXiv:1808.09920.
model for initialization. We evaluate the effect
of the two strategies. Table 2 shows the results Nicola De Cao and Thomas Kipf. 2018. Molgan: An
implicit generative model for small molecular graphs.
of RoBERTaGCN on 20NG with and without arXiv preprint arXiv:1805.11973.
these strategies. With the same learning rate for
RoBERTa and GCN, the model cannot be trained Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
due to inconsistency in the memory bank, regard- bidirectional transformers for language understanding.
less of whether the fine-tuned RoBERTa is used. arXiv preprint arXiv:1810.04805.
Models can be successfully trained when we set
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017.
a smaller learning rate for the RoBERTa module, Inductive representation learning on large graphs. In
and additional using finetuned RoBERTa leads to Advances in neural information processing systems,
the best performance. pages 1024–1034.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and
5 Conclusion and Future Work Ross Girshick. 2020a. Momentum contrast for unsu-
pervised visual representation learning. In Proceedings
In this work, we propose BertGCN, which takes the of the IEEE/CVF Conference on Computer Vision and
best advantages from both large-scale pretraining Pattern Recognition, pages 9729–9738.
models and transductive learning for text classifi- Qi He, Han Wang, and Yue Zhang. 2020b. Enhancing
cation. We efficiently train BertGCN by using a generalization in natural language inference by syntax.
memory bank that stores all document embeddings In Proceedings of the 2020 Conference on Empirical
and updates part of them with respect to the sam- Methods in Natural Language Processing: Findings,
pages 4973–4978.
pled mini batch. The framework of BertGCN can
be built on top of any document encoder and any Lianzhe Huang, Dehong Ma, Sujian Li, Xiaodong
Zhang, and Houfeng Wang. 2019. Text level graph
graph model. Experiments demonstrate the power
neural network for text classification. arXiv preprint
of the proposed BertGCN model. However, in arXiv:1910.02356.
this work, we only use document statistics to build
Yoon Kim. 2014. Convolutional neural net-
the graph, which might be sub-optimal compared works for sentence classification. arXiv preprint
to models that are able to automatically construct arXiv:1408.5882.
edges between nodes. We leave this in future work.
Thomas N Kipf and Max Welling. 2016a. Semi-
supervised classification with graph convolutional net-
works. arXiv preprint arXiv:1609.02907.
References
Thomas N Kipf and Max Welling. 2016b. Vari-
Rushlene Kaur Bakshi, Navneet Kaur, Ravneet Kaur, ational graph auto-encoders. arXiv preprint
and Gurpreet Kaur. 2016. Opinion mining and sen- arXiv:1611.07308.
timent analysis. In 2016 3rd International Confer-
ence on Computing for Sustainable Global Develop- Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018a.
ment (INDIACom), pages 452–455. IEEE. Deeper insights into graph convolutional networks for
semi-supervised learning. In Proceedings of the AAAI
Jasmijn Bastings, Ivan Titov, Wilker Aziz, Diego Conference on Artificial Intelligence, volume 32.
Marcheggiani, and Khalil Sima’an. 2017. Graph con-
volutional encoders for syntax-aware neural machine Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu.
translation. In Proceedings of the 2017 Conference on 2017. Diffusion convolutional recurrent neural net-
Empirical Methods in Natural Language Processing, work: Data-driven traffic forecasting. arXiv preprint
pages 1957–1967, Copenhagen, Denmark. Association arXiv:1707.01926.
for Computational Linguistics. Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu,
and Peter Battaglia. 2018b. Learning deep generative
Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep
models of graphs. arXiv preprint arXiv:1803.03324.
neural networks for learning graph representations. In
Proceedings of the AAAI Conference on Artificial Intel- Xien Liu, Xinxin You, Xiao Zhang, Ji Wu, and Ping
ligence, volume 30. Lv. 2020. Tensor graph convolutional networks for text
classification.
Duo Chai, Wei Wu, Qinghong Han, Fei Wu, and Jiwei
Li. 2020. Description based text classification with re- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
inforcement learning. In International Conference on dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Machine Learning, pages 1371–1382. PMLR. Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
robustly optimized bert pretraining approach. arXiv Alex Hai Wang. 2010. Don’t follow me: Spam de-
preprint arXiv:1907.11692. tection in twitter. In 2010 international conference
on security and cryptography (SECRYPT), pages 1–10.
Zhibin Lu, Pan Du, and Jian-Yun Nie. 2020. Vgcn- IEEE.
bert: augmenting bert with graph embedding for text
classification. In European Conference on Information Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe
Retrieval, pages 369–382. Springer. Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao,
and Lawrence Carin. 2018. Joint embedding of words
Diego Marcheggiani and Laura Perez-Beltrachini. and labels for text classification. arXiv preprint
2018. Deep graph convolutional encoders for struc- arXiv:1805.04174.
tured data to text generation. In Proceedings of the
11th International Conference on Natural Language Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr,
Generation, pages 1–9, Tilburg University, The Nether- Christopher Fifty, Tao Yu, and Kilian Q Weinberger.
lands. Association for Computational Linguistics. 2019. Simplifying graph convolutional networks.
arXiv preprint arXiv:1902.07153.
Diego Marcheggiani and Ivan Titov. 2017. Encoding
sentences with graph convolutional networks for se- Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua
mantic role labeling. In Proceedings of the 2017 Con- Lin. 2018. Unsupervised feature learning via non-
ference on Empirical Methods in Natural Language parametric instance discrimination. In Proceedings of
Processing, pages 1506–1515, Copenhagen, Denmark. the IEEE Conference on Computer Vision and Pattern
Association for Computational Linguistics. Recognition, pages 3733–3742.

Federico Monti, Michael M Bronstein, and Xavier Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong
Bresson. 2017. Geometric matrix completion with re- Long, Chengqi Zhang, and S Yu Philip. 2020. A com-
current multi-graph neural networks. In Proceedings prehensive survey on graph neural networks. IEEE
of the 31st International Conference on Neural Infor- Transactions on Neural Networks and Learning Sys-
mation Processing Systems, pages 3700–3710. tems.

Subhabrata Mukherjee and Ahmed Hassan Awadallah. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie
2020. Uncertainty-aware self-training for text classifi- Jegelka. 2018. How powerful are graph neural net-
cation with few labels. works? arXiv preprint arXiv:1810.00826.

Nikolaos Pappas and James Henderson. 2019. Gile: Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.
A generalized input-label embedding for text classifi- Graph convolutional networks for text classification. In
cation. Transactions of the Association for Computa- Proceedings of the AAAI Conference on Artificial Intel-
tional Linguistics, 7:139–155. ligence, volume 33, pages 7370–7377.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Zhiquan Ye, Yuxia Geng, Jiaoyan Chen, Jingmin Chen,
Ilya Sutskever. 2018. Improving language understand- Xiaoxiao Xu, SuHang Zheng, Feng Wang, Jun Zhang,
ing by generative pre-training. and Huajun Chen. 2020. Zero-shot text classification
via reinforced self-training. In Proceedings of the 58th
Nils Reimers and Iryna Gurevych. 2019. Sentence- Annual Meeting of the Association for Computational
bert: Sentence embeddings using siamese bert- Linguistics, pages 3014–3024, Online. Association for
networks. arXiv preprint arXiv:1908.10084. Computational Linguistics.

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017.
Hagenbuchner, and Gabriele Monfardini. 2008. The Spatio-temporal graph convolutional networks: A deep
graph neural network model. IEEE Transactions on learning framework for traffic forecasting. arXiv
Neural Networks, 20(1):61–80. preprint arXiv:1709.04875.

Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018.
Florian, and Daniel Gildea. 2018. Exploring graph- Spatio-temporal graph convolutional networks: a deep
structured passage representation for multi-hop read- learning framework for traffic forecasting. In Proceed-
ing comprehension with graph neural networks. arXiv ings of the 27th International Joint Conference on Arti-
preprint arXiv:1809.02040. ficial Intelligence, pages 3634–3640.

Vladimir N. Vapnik. 1998. Statistical Learning Theory. Haopeng Zhang and Jiawei Zhang. 2020. Text graph
Wiley-Interscience. transformer for document classification. In Proceed-
ings of the 2020 Conference on Empirical Methods in
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Natural Language Processing (EMNLP), pages 8322–
Adriana Romero, Pietro Lio, and Yoshua Bengio. 8327.
2017. Graph attention networks. arXiv preprint
arXiv:1710.10903. Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin
King, and Dit Yan Yeung. 2018a. Gaan: Gated atten-
Hanna M Wallach. 2006. Topic modeling: beyond bag- tion networks for learning on large and spatiotemporal
of-words. In Proceedings of the 23rd international con- graphs. In 34th Conference on Uncertainty in Artificial
ference on Machine learning, pages 977–984. Intelligence 2018, UAI 2018.
Shengyu Zhang, Ziqi Tan, Zhou Zhao, Jin Yu, Kun
Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, and
Fei Wu. 2020. Comprehensive information integration
modeling framework for video titling. In Proceedings
of the 26th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, pages 2744–
2754.
Yuhao Zhang, Peng Qi, and Christopher D Manning.
2018b. Graph convolution over pruned dependency
trees improves relation extraction. arXiv preprint
arXiv:1809.10185.
Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Fran-
cis Lau. 2015. A c-lstm neural network for text classi-
fication. arXiv preprint arXiv:1511.08630.

You might also like