Graph NeuralSparse
Graph NeuralSparse
Cheng Zheng 1 Bo Zong 2 Wei Cheng 2 Dongjin Song 2 Jingchao Ni 2 Wenchao Yu 2 Haifeng Chen 2
Wei Wang 1
(a) Node Features (b) With Task-irrelevant Edges (c) By DropEdge (d) By NeuralSparse
Figure 1. Top row: Small samples of (sparsified) graphs for illustration. Bottom row: Visualization of (learned) node representations.
(a) Node representations are input two-dimensional node features. (b) Node representations are learned from a two-layer GCN on top of
graphs with task irrelevant edges. (c) Node representations are learned from DropEdge (with a two-layer GCN). (d) Node representations
are learned from NeuralSparse (with a two-layer GCN).
tributions. As the predefined distributions may not well automatically excluded, the node representations learned
adapt to subsequent tasks, such methods could suffer sub- from the NeuralSparse suggest a clearer boundary between
optimal prediction performance. Multiple recent efforts Blue and Red with promising generalization power, and
strive to make use of supervision signals to remove noise the sparsification learned by NeuralSparse could be more
edges (Wang et al., 2019). However, the proposed methods effective than the regularization provided by layer-wise
are either transductive with difficulty to scale (Franceschi random edge dropping (Rong et al., 2020) shown in Fig-
et al., 2019) or of high gradient variance bringing increased ure1(c).
training difficulty (Rong et al., 2020).
Experimental results on both public and private datasets
Present work. We propose Neural Sparsification (Neu- demonstrate that NeuralSparse consistently provides im-
ralSparse), a general framework that simultaneously learns proved performance for existing GNNs on node classifi-
to select task-relevant edges and graph representations cation tasks, yielding up to 7.2% improvement.
by feedback signals from downstream tasks. The Neu-
ralSparse consists of two major components: sparsification 2. Related Work
network and GNN. For the sparsification network, we uti-
lize a deep neural network to parameterize the sparsifica- Our work is related to two lines of research: graph sparsi-
tion process: how to select edges from the one-hop neigh- fication and graph representation learning.
borhood given a fixed budget. In the training phase, the
Graph sparsification. The goal of graph sparsification
network learns to optimize a sparsification strategy that fa-
is to find small subgraphs from input large graphs that
vors downstream tasks. In the testing phase, the network
best preserve desired properties. Existing techniques are
sparsifies input graphs following the learned strategy, in-
mainly unsupervised and deal with simple graphs without
stead of sampling subgraphs from a predefined distribution.
node/edge features for preserving predefined graph met-
Unlike conventional sparsification techniques, our tech-
rics (Hübler et al., 2008), information propagation traces
nique takes both structural and non-structural information
(Mathioudakis et al., 2011), graph spectrum (Calandriello
as input and optimizes the sparsification strategy by feed-
et al., 2018; Chakeri et al., 2016; Adhikari et al., 2018),
back from downstream tasks, instead of using (possibly
node degree distribution (Eden et al., 2018; Voudigari et al.,
irrelevant) heuristics. For the GNN component, the Neu-
2016), node distance distribution (Leskovec & Faloutsos,
ralSparse feeds the sparsified graphs to GNNs and learns
2006), or clustering coefficient (Maiya & Berger-Wolf,
graph representations for subsequent prediction tasks. Un-
2010). Importance based edge sampling has also been stud-
der the NeuralSparse framework, by the standard stochastic
ied in a scenario where we could predefine edge importance
gradient descent and backpropagation techniques, we can
(Zhao, 2015; Chen et al., 2018).
simultaneously optimize graph sparsification and represen-
tations. As shown in Figure 1(d), with task-irrelevant edges Unlike existing methods that mainly work with simple
Robust Graph Representation Learning via Neural Sparsification
graphs without node/edge features in an unsupervised man- cal learning, the key of a defined prediction task is to learn
ner, our method takes node/edge features as parts of input P (Y | G), where Y is the prediction target and G is an in-
and optimizes graph sparsification by supervision signals put graph. Instead of directly working with original graphs,
from errors made in downstream tasks. we would like to leverage sparsified subgraphs to remove
task-irrelevant information. In other words, we are inter-
Graph representation learning. Graph neural networks
ested in the following variant,
(GNNs) are the most popular techniques that enable vec-
tor representation learning for large graphs with complex
X
P (Y | G) ≈ P (Y | g)P (g | G), (1)
node/edge features. All existing GNNs share a common g∈SG
spirit: extracting local structural features by neighborhood
aggregation. Scarselli et al. (2009) explore how to extract where g is a sparsified subgraph, and SG is a class of spar-
multi-hop features by iterative neighborhood aggregation. sified subgraphs of G.
Inspired by the success of convolutional neural networks,
In general, because of the combinatorial complexity in
multiple studies (Defferrard et al., 2016; Bruna et al., 2014)
graphs, it is intractable to enumerate all possible g as well
investigate how to learn convolutional filters in the graph
as estimate the exact values of P (Y | g) and P (g | G).
spectral domain under transductive settings. To enable in-
Therefore, we approximate the distributions by tractable
ductive learning, convolutional filters in the graph domain
functions,
are proposed (Simonovsky & Komodakis, 2017; Niepert X X
et al., 2016; Kipf & Welling, 2017; Veličković et al., 2018; P (Y | g)P (g | G) ≈ Qθ (Y | g)Qφ (g | G)
Xu et al., 2018), and a few studies (Hamilton et al., 2017; g∈SG g∈SG
Lee et al., 2018) explore how to differentiate neighborhood (2)
filtering by sequential models. Multiple recent works (Xu where Qθ and Qφ are approximation functions for P (Y |
et al., 2019; Abu-El-Haija et al., 2019) investigate the ex- g) and P (g | G) parameterized by θ and φ, respectively.
pressive power of GNNs, and Ying et al. (2019) propose
Moreover, to make the above graph sparsification process
to identify critical subgraph structure with trained GNNs.
differentiable, we employ reparameterization tricks (Jang
In addition, Franceschi et al. (2019) study how to sam-
et al., 2017) to make Qφ (g | G) directly generate differen-
ple high-quality subgraphs from a transductive setting by
tiable samples, such that
learning Bernoulli variables on individual edges. Recent
efforts also attempt to sample subgraphs from predefined
X X
Qθ (Y | g)Qφ (g | G) ∝ Qθ (Y | g 0 ) (3)
distributions (Zeng et al., 2020; Hamilton et al., 2017), and g∈SG g 0 ∼Qφ (g|G)
regularize graph learning by random edge dropping (Rong
et al., 2020). where g 0 ∼ Qφ (g | G) means g 0 is a random sample drawn
Our work contributes from a unique angle: by inductively from Qφ (g | G).
learning to select task-relevant edges from downstream su- To this end, the key is how to find appropriate approxima-
pervision signal, our technique can further boost general- tion functions Qφ (g | G) and Qθ (Y | g).
ization performance for existing GNNs.
Architecture. In this paper, we propose Neural Sparsifi-
cation (NeuralSparse) to implement the theoretical frame-
3. Proposed Method: NeuralSparse work discussed in Equation 3. As shown in Figure 2, Neu-
ralSparse consists of two major components: the sparsifi-
In this section, we introduce the core idea of our method.
cation network and GNNs.
We start with the notations that are frequently used in this
paper. We then describe the theoretical justification behind
NeuralSparse and our architecture to tackle the supervised • The sparsification network is a multi-layer neural net-
node classification problem. work that implements Qφ (g | G): Taking G as input, it
generates a random sparsified subgraph of G drawn from
Notations. We represent an input graph of n nodes as a learned distribution.
G = (V, E, A): (1) V ∈ Rn×dn includes node features
with dimensionality dn ; (2) E ∈ Rn×n is a binary matrix • GNNs implement Qθ (Y | g) that takes the sparsified sub-
where E(u, v) = 1 if there is an edge between node u and graph as input, extracts node representations, and makes
node v; (3) A ∈ Rn×n×de encodes input edge features of predictions for downstream tasks.
dimensionality de . Besides, we use Y to denote the predic-
tion target in downstream tasks (e.g., Y ∈ Rn×dl if we are As the sparsified subgraph samples are differentiable, the
dealing with a node classification problem with dl classes). two components can be jointly trained using the gradient
descent based backpropagation techniques from a super-
Theoretical justification. From the perspective of statisti- vised loss function, as illustrated in Algorithm 1. While the
Robust Graph Representation Learning via Neural Sparsification
𝜕𝐿
𝜕𝐿
𝜕𝜙
𝜕𝜃
𝑗 𝑗 𝑉0 𝑗 𝑗
𝐴10
𝑖 𝑖 GNN 𝑖 Loss 𝐿
𝑖 𝑉1 𝐴′10
MLP
Algorithm 1 Training algorithm for NeuralSparse k-neighbor subgraphs for the following reasons.
1: Input: graph G = (V, E, A), integer l, and training
labels Y . • We are able to adjust the estimation on the amount of
2: while stop criterion is not met do task-relevant graph data by tuning the hyper-parameter
3: Generate sparsified subgraphs {g1 , g2 , · · · , gl } by k. Intuitively, when k is an under-estimate, the amount
sparsification network (Section 4); of task-relevant graph data accessed by GNNs could be
4: Produce prediction {Ŷ1 , Ŷ2 , · · · , Ŷl } by feeding inadequate, leading to inferior performance. When k is
{g1 , g2 , · · · , gl } into GNNs; an over-estimate, the downstream GNNs may overfit the
5: Calculate loss function J; introduced noise or irrelevant graph data, resulting in sub-
6: Update φ and θ by descending J optimal performance. It could be difficult to set a golden
7: end while hyper-parameter that works all the time, but one has the
freedom to choose the k that is the best fit for a specific
task.
GNNs have been widely investigated in recent works (Kipf • k-neighbor subgraphs are friendly to parallel computa-
& Welling, 2017; Hamilton et al., 2017; Veličković et al., tion. As each node selects its edges independently from
2018), we focus on the practical implementation for the its neighborhood, we can utilize tensor operations in
sparsification network in the remaining of this paper. existing deep learning frameworks, such as tensorflow
(Abadi et al., 2016), to speed up the sparsification pro-
4. Sparsification Network cess for k-neighbor subgraphs.
Note that the above process performs sampling without re- called temperature. In general, when τ is small, the
placement. Given a node u, each of its adjacent edges is Gumbel-Softmax distribution resembles the discrete dis-
selected at most once. Moreover, the sampling function tribution, which induces strong sparsity; however, small τ
fφ (·) is shared among nodes; therefore, the number of pa- also introduces high-variance gradients that block effective
rameters φ is independent of the input graph size. backpropagation. A high value of τ cannot produce ex-
pected sparsification effect. Following the practice in (Jang
Making samples differentiable. While conventional
et al., 2017), we adopt the strategy by starting the training
methods are able to generate discrete samples (Sadhanala
with a high temperature and anneal to a small value with a
et al., 2016), these samples are not differentiable such that
guided schedule.
it is difficult to utilize them to optimize sample generation.
To make samples differentiable, we propose a Gumbel- Sparsification algorithm and its complexity. As shown
Softmax based multi-layer neural network to implement the in Algorithm 2, given hyper-parameter k, the sparsification
sampling function fφ (·) discussed above. network visits each node’s one-hop neighbors k times. Let
m be the total number of edges in the graph. The complex-
To make the discussion self-contained, we briefly discuss
ity of sampling subgraphs by the sparsification network is
the idea of Gumbel-Softmax. Gumbel-Softmax is a repa-
O(km). When k is small in practice, the overall complex-
rameterization trick used to generate differentiable discrete
ity is O(m).
samples (Jang et al., 2017; Maddison et al., 2017). Under
appropriate hyper-parameter settings, Gumbel-Softmax is
able to generate continuous vectors that are as ”sharp” as
one-hot vectors widely used to encode discrete data. Algorithm 2 Sampling subgraphs by sparsification net-
work
Without loss of generality, we focus on a specific node u
1: Input: graph G = (V, E, A) and integer k.
in a graph G = (V, E, A). Let Nu be the set of one-hop
2: Edge set H = ∅
neighbors of the node u. We implement fφ (·) as follows.
3: for u ∈ V do
4: for v ∈ Nu do
1. ∀v ∈ Nu , 5: zu,v ← MLPφ (V (u), V (v), A(u, v))
6: end for
zu,v = MLPφ (V (u), V (v), A(u, v)), (4)
7: for v ∈ Nu do P
where MLPφ is a multi-layer neural network with pa- 8: πu,v ← exp(zu,v )/ w∈Nu exp(zu,w )
rameters φ. 9: end for
10: for j = 1, · · · , k do
2. ∀v ∈ Nu , we employ a softmax function to compute 11: for v ∈ Nu do
exp((log(πu,v )+v )/τ )
the probability to sample the edge, 12: xu,v ← P
w∈Nu exp((log(πu,w )+w )/τ )
13: end for
exp(zu,v )
πu,v = P (5) 14: Add the edge represented by vector [xu,v ] into H
w∈Nu exp(zu,w ) 15: end for
16: end for
3. Using Gumbel-Softmax, we generate differentiable
samples
5. Experimental Study PPI and Transaction datasets, the models have to generalize
to completely unseen graphs.
In this section, we evaluate our proposed NeuralSparse on
the node classification task with both inductive and trans- Transductive datasets. We use two citation benchmark
ductive settings. The experimental results demonstrate that datasets in Yang et al. (2016) and Kipf & Welling (2017)
NeuralSparse achieves superior classification performance with the transductive experimental setting. The citation
over state-of-the-art GNN models. Moreover, we provide graphs contain nodes corresponding to documents and
a case study to demonstrate how the sparsified subgraphs edges as citations. Node features are the sparse bag-of-
generated by NeuralSparse could improve classification words representations of the documents and node labels
compared against other sparsification baselines. The sup- indicate the topic class of the documents. In transductive
plementary material contains more experimental details. learning, the training methods have access to all node fea-
tures and edges, with a limited subset of node labels.
5.1. Datasets
5.2. Experimental Setup
We employ five datasets from various domains and con-
duct the node classification task following the settings as Baseline models. We incorporate four state-of-the-art
described in Hamilton et al. (2017) and Kipf & Welling methods as the base GNN components, including GCN
(2017). The dataset statistics are summarized in Table 1. (Kipf & Welling, 2017), GraphSAGE (Hamilton et al.,
2017), GAT (Veličković et al., 2018), and GIN (Xu et al.,
Inductive datasets. We utilize the Reddit and PPI datasets 2019). Besides evaluating the effectiveness and efficiency
and follow the same setting in Hamilton et al. (2017). The of NeuralSparse against base GNNs, we leverage three
Reddit dataset contains a post-to-post graph with word vec- other categories of methods in the experiments: (1) We in-
tors as node features. The node labels represent which corporate the two unsupervised graph sparsification mod-
community Reddit posts belong to. The protein-protein els, the spectral sparsifier (SS, Sadhanala et al., 2016) and
interaction (PPI) dataset contains graphs corresponding to the Rank Degree (RD, Voudigari et al., 2016). The in-
different human tissues. The node features are positional put graphs are sparsified before sent to the base GNNs for
gene sets, motif gene sets, and immunological signatures. node classification. (2) We compare against the random
The nodes are multi-labeled by gene ontology. layer-wise sampler DropEdge (Rong et al., 2020). Similar
Graphs in the Transaction dataset contains real transactions to the Dropout trick (Hinton et al., 2012), DropEdge ran-
between organizations in two years, with the first year for domly removes connections among node neighborhood in
training and the second year for validation/testing. Each each GNN layer. (3) We also incorporate LDS (Franceschi
node represents an organization and each edge indicates a et al., 2019), which works under a transductive setting and
transaction between two organizations. Node attributes are learns Bernoulli variables associated with individual edges.
side information about the organizations such as account Temperature tuning. We anneal the temperature with the
balance, cash reserve, etc. On this dataset, the objective is schedule τ = max(0.05, exp(−rp)), where p is the train-
to classify organizations into two categories: promising or ing epoch and r ∈ 10{−5,−4,−3,−2,−1} . τ is updated every
others for investment in the near future. More details on N steps and N ∈ {50, 100, ..., 500}. Compared with the
the Transaction dataset can be found in Supplementary S1. MNIST VAE model in Jang et al. (2017), smaller hyper-
In the inductive setting, models can only access training parameter τ fits NeuralSparse better in practice. More de-
nodes’ attributes, edges, and labels during training. In the tails on the experimental settings and implementation can
Robust Graph Representation Learning via Neural Sparsification
be found in Supplementary S2 and S3. could increase the chance of introducing noise into the ag-
gregation operations, leading to sub-optimal performance.
Metrics. We evaluate the performance on the transduc-
(2) With different GNN options, the NeuralSparse can
tive datasets with accuracy (Kipf & Welling, 2017). For
consistently achieve comparable or superior performance,
inductive tasks on the Reddit and PPI datasets, we report
while other sparsification approaches tend to favor a cer-
micro-averaged F1 scores (Hamilton et al., 2017). Due to
tain GNN structure. (3) Compared with DropEdge, Neu-
the imbalanced classes in the Transaction dataset, models
ralSparse achieves up to 13% of improvement in terms of
are evaluated with AUC value (Huang & Ling, 2005). The
accuracy with lower variance. In addition, the comparison
results show the average of 10 runs.
between NeuralSparse and DropEdge in terms of conver-
gence speed can be found in Supplementary S4. (4) In
5.3. Classification Performance comparison with the two NeuralSparse variants SS-GNN
Table 2 summarizes the classification performance of Neu- and RD-GNN, NeuralSparse outperforms because it can ef-
ralSparse and the baseline methods on all datasets. For fectively leverage the guidance from downstream tasks.
Reddit, PPI, Transaction, Cora, and Citeseer, the hyper-
parameter k is set as 30, 15, 10, 5, and 3 respectively. The Table 3. Node classification performance with κ-NN graphs
hyper-parameter l is set as 1. Note that the result of GAT
Dataset(κ) LDS NeuralSparse
on Reddit is missing due to the out-of-memory error and
LDS only works under the transductive setting. For sim- Cora(10) 0.715 ± 0.035 0.723 ± 0.025
plicity, we only report the better performance with SS or Cora(20) 0.703 ± 0.029 0.719 ± 0.021
RD sparsifiers. Citeseer(10) 0.691 ± 0.031 0.723 ± 0.016
Citeseer(20) 0.715 ± 0.026 0.725 ± 0.019
Overall, NeuralSparse is able to help GNN techniques
achieve competitive generalization performance with spar-
sified graph data. We make the following observations. In the following, we discuss the comparison between Neu-
(1) Compared with basic GNN models, NeuralSparse can ralSparse and LDS (Franceschi et al., 2019) on the Cora
enhance the generalization performance on node classi- and Citeseer datasets. Note that the row labeled with LDS
fication tasks by utilizing the sparsified subgraphs from in Table 2 presents the classification results on original in-
the sparsification network, especially in the inductive set- put graphs. In addition, we adopt κ-nearest neighbor (κ-
ting. Indeed, large neighborhood size in the original graphs NN) graphs suggested in (Franceschi et al., 2019) for more
comprehensive evaluation. In particular, κ-NN graphs are
Robust Graph Representation Learning via Neural Sparsification
Promising Organizations
Other Organizations
Figure 3. (a) Original graph from the Transaction dataset and sparsified subgraphs by (b) NeuralSparse, (c) Spectral Sparsifier, and (d)
RD Sparsifier.
constructed by connecting individual nodes with their top- l impacts classification performance on the Transaction
κ similar neighbors in terms of node features, and κ is se- dataset. When l increases from 1 to 5, we observe a rel-
lected from {10, 20}. In Table 3, we summarize the clas- atively small improvement in classification AUC score. As
sification accuracy of LDS (with GCN) and NeuralSparse the parameters in the sparsification network are shared by
(with GCN). On both original and κ-NN graphs, Neu- all edges in the graph, the estimation variance from random
ralSparse outperforms LDS in terms of classification ac- sampling could already be mitigated to some extent by a
curacy. As each edge is associated with a Bernoulli vari- number of sampled edges in a sparsified subgraph. Thus,
ables, the large number of parameters for graph sparsifica- when we increase the number of sparsified subgraphs, the
tion could impact the generalization power in LDS. More incremental gain could be small.
comparison results between NeuralSparse and LDS can be
In Figure 3(a), we present a sample of the graph from the
found in Supplementary S5.
Transaction dataset which consists of 38 nodes (promis-
ing organizations and other organizations) with an average
5.4. Sensitivity to Hyper-parameters and the Sparsified node degree 15 and node feature dimension 120. As shown
Subgraphs in Figure 3(b), the graph sparsified by the NeuralSparse has
lower complexity with an average node degree around 5. In
0.68
0.675 Figure 3(c, d), we also present the sparsified graphs output
0.670 by the two baseline methods, SS and RD. More quantita-
0.66
0.665 tive evaluations over sparsified graphs from different ap-
AUC
AUC
Acknowledgement
We thank the anonymous reviewers for their careful reading
and insightful comments on our manuscript. The work was
partially supported by NSF (DGE-1829071, IIS-2031187).
Robust Graph Representation Learning via Neural Sparsification
Adhikari, B., Zhang, Y., Amiri, S. E., Bharadwaj, A., and Leskovec, J. and Faloutsos, C. Sampling from large graphs.
Prakash, B. A. Propagation-based temporal network In KDD, 2006.
summarization. TKDE, 2018. Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated
Akoglu, L., Tong, H., and Koutra, D. Graph based anomaly graph sequence neural networks. In ICLR, 2016.
detection and description: a survey. Data mining and Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diffusion con-
knowledge discovery, 2015. volutional recurrent neural network: Data-driven traffic
Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral forecasting. In ICLR, 2018.
networks and locally connected networks on graphs. In Liu, Y., Safavi, T., Dighe, A., and Koutra, D. Graph sum-
ICLR, 2014. marization methods and applications: A survey. ACM
Calandriello, D., Koutis, I., Lazaric, A., and Valko, M. Im- Computing Surveys, 2018.
proved large-scale graph learning through ridge spectral Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete
sparsification. In ICML, 2018. distribution: A continuous relaxation of discrete random
Chakeri, A., Farhidzadeh, H., and Hall, L. O. Spectral spar- variables. In ICLR, 2017.
sification in spectral clustering. In ICPR, 2016. Maiya, A. S. and Berger-Wolf, T. Y. Sampling community
structure. In WWW, 2010.
Chen, J., Ma, T., and Xiao, C. Fastgcn: fast learning with
graph convolutional networks via importance sampling. Mathioudakis, M., Bonchi, F., Castillo, C., Gionis, A., and
In ICLR, 2018. Ukkonen, A. Sparsification of influence networks. In
KDD, 2011.
Defferrard, M., Bresson, X., and Vandergheynst, P. Con-
volutional neural networks on graphs with fast localized Niepert, M., Ahmed, M., and Kutzkov, K. Learning convo-
spectral filtering. In NIPS, 2016. lutional neural networks for graphs. In ICML, 2016.
Eden, T., Jain, S., Pinar, A., Ron, D., and Seshadhri, C. Rong, Y., Huang, W., Xu, T., and Huang, J. Dropedge: To-
Provable and practical approximations for the degree wards deep graph convolutional networks on node clas-
distribution using sublinear graph samples. In WWW, sification. In ICLR, 2020.
2018.
Sadhanala, V., Wang, Y.-X., and Tibshirani, R. Graph spar-
Franceschi, L., Niepert, M., Pontil, M., and He, X. Learn- sification approaches for laplacian smoothing. In AIS-
ing discrete structures for graph neural networks. In TATS, 2016.
ICML, 2019.
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre- Monfardini, G. The graph neural network model. IEEE
sentation learning on large graphs. In NIPS, 2017. Transactions on Neural Networks, 2009.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, Simonovsky, M. and Komodakis, N. Dynamic edge-
I., and Salakhutdinov, R. R. Improving neural networks conditioned filters in convolutional neural networks on
by preventing co-adaptation of feature detectors. arXiv graphs. In CVPR, 2017.
preprint arXiv:1207.0580, 2012.
Veličković, P., Cucurull, G., Casanova, A., Romero, A.,
Huang, J. and Ling, C. X. Using auc and accuracy in eval- Liò, P., and Bengio, Y. Graph attention networks. In
uating learning algorithms. TKDE, 2005. ICLR, 2018.
Robust Graph Representation Learning via Neural Sparsification
Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.-i.,
and Jegelka, S. Representation learning on graphs with
jumping knowledge networks. In ICML, 2018.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
are graph neural networks? ICLR, 2019.
Yu, W., Zheng, C., Cheng, W., Aggarwal, C., Song, D.,
Zong, B., Chen, H., and Wang, W. Learning deep net-
work representations with adversarially regularized au-
toencoders. In KDD, 2018.
Zhang, Y., Zhang, F., Yao, P., and Tang, J. Name disam-
biguation in aminer: Clustering, maintenance, and hu-
man in the loop. In KDD, 2018.
Zhao, P. gsparsify: Graph motif based sparsification for
graph clustering. In CIKM, 2015.