0% found this document useful (0 votes)
2 views

Graph NeuralSparse

Uhuhu

Uploaded by

lewy700
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Graph NeuralSparse

Uhuhu

Uploaded by

lewy700
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Robust Graph Representation Learning via Neural Sparsification

Cheng Zheng 1 Bo Zong 2 Wei Cheng 2 Dongjin Song 2 Jingchao Ni 2 Wenchao Yu 2 Haifeng Chen 2
Wei Wang 1

Abstract tion in citation networks (Zhang et al., 2018), spam detec-


Graph representation learning serves as the core tion in social networks (Akoglu et al., 2015), recommen-
of important prediction tasks, ranging from prod- dations in online marketing (Ying et al., 2018), and many
uct recommendation to fraud detection. Real- others (Yu et al., 2018; Li et al., 2018). As a class of mod-
life graphs usually have complex information els that can simultaneously utilize non-structural (e.g., node
in the local neighborhood, where each node is and edge features) and structural information in graphs,
described by a rich set of features and con- Graph Neural Networks (GNNs) construct effective rep-
nects to dozens or even hundreds of neighbors. resentations for downstream tasks by iteratively aggregat-
Despite the success of neighborhood aggrega- ing neighborhood information (Li et al., 2016; Hamilton
tion in graph neural networks, task-irrelevant in- et al., 2017; Kipf & Welling, 2017). Such methods have
formation is mixed into nodes’ neighborhood, demonstrated state-of-the-art performance in classification
making learned models suffer from sub-optimal and prediction tasks on graph data (Veličković et al., 2018;
generalization performance. In this paper, we Chen et al., 2018; Xu et al., 2019; Ying et al., 2019).
present NeuralSparse, a supervised graph spar- Meanwhile, the underlying motivation why two nodes get
sification technique that improves generaliza- connected may have no relation to a target downstream
tion power by learning to remove potentially task, and such task-irrelevant edges could hurt neighbor-
task-irrelevant edges from input graphs. Our hood aggregation as well as the performance of GNNs.
method takes both structural and non-structural Consider the following example shown in Figure 1. Blue
information as input, utilizes deep neural net- and Red are two classes of nodes, whose two-dimensional
works to parameterize sparsification processes, features are generated following two independent Gaussian
and optimizes the parameters by feedback sig- distributions, respectively. As shown in Figure 1(a), the
nals from downstream tasks. Under the Neu- overlap between their feature distributions makes it diffi-
ralSparse framework, supervised graph sparsi- cult to find a good boundary that well separates the Blue
fication could seamlessly connect with existing and Red nodes by node features only. Blue and Red nodes
graph neural networks for more robust perfor- are also inter-connected forming a graph. For each node
mance. Experimental results on both benchmark (either Blue or Red), it randomly selects 10 nodes as its
and private datasets show that NeuralSparse can one-hop neighbors, and the resulting edges may not be re-
yield up to 7.2% improvement in testing accu- lated to node labels. On such a graph, we train a two-layer
racy when working with existing graph neural GCN (Kipf & Welling, 2017), and the node representations
networks on node classification tasks. output from the two-layer GCN is illustrated in Figure 1(b).
When task-irrelevant edges are mixed into neighborhood
aggregation, the trained GCN fails to learn better repre-
1. Introduction sentations, and it becomes difficult to learn a subsequent
classifier with strong generalization power.
Representation learning has been in the center of many ma-
chine learning tasks on graphs, such as name disambigua- In this paper, we study how to utilize supervision signals
1
to remove task-irrelevant edges in an inductive manner to
Department of Computer Science, University of Cali- achieve robust graph representation learning. Conventional
fornia, Los Angeles, CA, USA 2 NEC Laboratories Amer-
ica, Princeton, NJ, USA. Correspondence to: Cheng methods, such as graph sparsification (Liu et al., 2018;
Zheng <[email protected]>, Wei Wang <wei- Zhang & Patone, 2017; Leskovec & Faloutsos, 2006; Sad-
[email protected]>. hanala et al., 2016), are unsupervised such that the resulting
sparsified graphs may not favor downstream tasks. Sev-
Proceedings of the 37 th International Conference on Machine eral works (Zeng et al., 2020; Hamilton et al., 2017; Chen
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au-
thor(s). et al., 2018) focus on downsampling under predefined dis-
Robust Graph Representation Learning via Neural Sparsification

(a) Node Features (b) With Task-irrelevant Edges (c) By DropEdge (d) By NeuralSparse

Figure 1. Top row: Small samples of (sparsified) graphs for illustration. Bottom row: Visualization of (learned) node representations.
(a) Node representations are input two-dimensional node features. (b) Node representations are learned from a two-layer GCN on top of
graphs with task irrelevant edges. (c) Node representations are learned from DropEdge (with a two-layer GCN). (d) Node representations
are learned from NeuralSparse (with a two-layer GCN).

tributions. As the predefined distributions may not well automatically excluded, the node representations learned
adapt to subsequent tasks, such methods could suffer sub- from the NeuralSparse suggest a clearer boundary between
optimal prediction performance. Multiple recent efforts Blue and Red with promising generalization power, and
strive to make use of supervision signals to remove noise the sparsification learned by NeuralSparse could be more
edges (Wang et al., 2019). However, the proposed methods effective than the regularization provided by layer-wise
are either transductive with difficulty to scale (Franceschi random edge dropping (Rong et al., 2020) shown in Fig-
et al., 2019) or of high gradient variance bringing increased ure1(c).
training difficulty (Rong et al., 2020).
Experimental results on both public and private datasets
Present work. We propose Neural Sparsification (Neu- demonstrate that NeuralSparse consistently provides im-
ralSparse), a general framework that simultaneously learns proved performance for existing GNNs on node classifi-
to select task-relevant edges and graph representations cation tasks, yielding up to 7.2% improvement.
by feedback signals from downstream tasks. The Neu-
ralSparse consists of two major components: sparsification 2. Related Work
network and GNN. For the sparsification network, we uti-
lize a deep neural network to parameterize the sparsifica- Our work is related to two lines of research: graph sparsi-
tion process: how to select edges from the one-hop neigh- fication and graph representation learning.
borhood given a fixed budget. In the training phase, the
Graph sparsification. The goal of graph sparsification
network learns to optimize a sparsification strategy that fa-
is to find small subgraphs from input large graphs that
vors downstream tasks. In the testing phase, the network
best preserve desired properties. Existing techniques are
sparsifies input graphs following the learned strategy, in-
mainly unsupervised and deal with simple graphs without
stead of sampling subgraphs from a predefined distribution.
node/edge features for preserving predefined graph met-
Unlike conventional sparsification techniques, our tech-
rics (Hübler et al., 2008), information propagation traces
nique takes both structural and non-structural information
(Mathioudakis et al., 2011), graph spectrum (Calandriello
as input and optimizes the sparsification strategy by feed-
et al., 2018; Chakeri et al., 2016; Adhikari et al., 2018),
back from downstream tasks, instead of using (possibly
node degree distribution (Eden et al., 2018; Voudigari et al.,
irrelevant) heuristics. For the GNN component, the Neu-
2016), node distance distribution (Leskovec & Faloutsos,
ralSparse feeds the sparsified graphs to GNNs and learns
2006), or clustering coefficient (Maiya & Berger-Wolf,
graph representations for subsequent prediction tasks. Un-
2010). Importance based edge sampling has also been stud-
der the NeuralSparse framework, by the standard stochastic
ied in a scenario where we could predefine edge importance
gradient descent and backpropagation techniques, we can
(Zhao, 2015; Chen et al., 2018).
simultaneously optimize graph sparsification and represen-
tations. As shown in Figure 1(d), with task-irrelevant edges Unlike existing methods that mainly work with simple
Robust Graph Representation Learning via Neural Sparsification

graphs without node/edge features in an unsupervised man- cal learning, the key of a defined prediction task is to learn
ner, our method takes node/edge features as parts of input P (Y | G), where Y is the prediction target and G is an in-
and optimizes graph sparsification by supervision signals put graph. Instead of directly working with original graphs,
from errors made in downstream tasks. we would like to leverage sparsified subgraphs to remove
task-irrelevant information. In other words, we are inter-
Graph representation learning. Graph neural networks
ested in the following variant,
(GNNs) are the most popular techniques that enable vec-
tor representation learning for large graphs with complex
X
P (Y | G) ≈ P (Y | g)P (g | G), (1)
node/edge features. All existing GNNs share a common g∈SG
spirit: extracting local structural features by neighborhood
aggregation. Scarselli et al. (2009) explore how to extract where g is a sparsified subgraph, and SG is a class of spar-
multi-hop features by iterative neighborhood aggregation. sified subgraphs of G.
Inspired by the success of convolutional neural networks,
In general, because of the combinatorial complexity in
multiple studies (Defferrard et al., 2016; Bruna et al., 2014)
graphs, it is intractable to enumerate all possible g as well
investigate how to learn convolutional filters in the graph
as estimate the exact values of P (Y | g) and P (g | G).
spectral domain under transductive settings. To enable in-
Therefore, we approximate the distributions by tractable
ductive learning, convolutional filters in the graph domain
functions,
are proposed (Simonovsky & Komodakis, 2017; Niepert X X
et al., 2016; Kipf & Welling, 2017; Veličković et al., 2018; P (Y | g)P (g | G) ≈ Qθ (Y | g)Qφ (g | G)
Xu et al., 2018), and a few studies (Hamilton et al., 2017; g∈SG g∈SG
Lee et al., 2018) explore how to differentiate neighborhood (2)
filtering by sequential models. Multiple recent works (Xu where Qθ and Qφ are approximation functions for P (Y |
et al., 2019; Abu-El-Haija et al., 2019) investigate the ex- g) and P (g | G) parameterized by θ and φ, respectively.
pressive power of GNNs, and Ying et al. (2019) propose
Moreover, to make the above graph sparsification process
to identify critical subgraph structure with trained GNNs.
differentiable, we employ reparameterization tricks (Jang
In addition, Franceschi et al. (2019) study how to sam-
et al., 2017) to make Qφ (g | G) directly generate differen-
ple high-quality subgraphs from a transductive setting by
tiable samples, such that
learning Bernoulli variables on individual edges. Recent
efforts also attempt to sample subgraphs from predefined
X X
Qθ (Y | g)Qφ (g | G) ∝ Qθ (Y | g 0 ) (3)
distributions (Zeng et al., 2020; Hamilton et al., 2017), and g∈SG g 0 ∼Qφ (g|G)
regularize graph learning by random edge dropping (Rong
et al., 2020). where g 0 ∼ Qφ (g | G) means g 0 is a random sample drawn
Our work contributes from a unique angle: by inductively from Qφ (g | G).
learning to select task-relevant edges from downstream su- To this end, the key is how to find appropriate approxima-
pervision signal, our technique can further boost general- tion functions Qφ (g | G) and Qθ (Y | g).
ization performance for existing GNNs.
Architecture. In this paper, we propose Neural Sparsifi-
cation (NeuralSparse) to implement the theoretical frame-
3. Proposed Method: NeuralSparse work discussed in Equation 3. As shown in Figure 2, Neu-
ralSparse consists of two major components: the sparsifi-
In this section, we introduce the core idea of our method.
cation network and GNNs.
We start with the notations that are frequently used in this
paper. We then describe the theoretical justification behind
NeuralSparse and our architecture to tackle the supervised • The sparsification network is a multi-layer neural net-
node classification problem. work that implements Qφ (g | G): Taking G as input, it
generates a random sparsified subgraph of G drawn from
Notations. We represent an input graph of n nodes as a learned distribution.
G = (V, E, A): (1) V ∈ Rn×dn includes node features
with dimensionality dn ; (2) E ∈ Rn×n is a binary matrix • GNNs implement Qθ (Y | g) that takes the sparsified sub-
where E(u, v) = 1 if there is an edge between node u and graph as input, extracts node representations, and makes
node v; (3) A ∈ Rn×n×de encodes input edge features of predictions for downstream tasks.
dimensionality de . Besides, we use Y to denote the predic-
tion target in downstream tasks (e.g., Y ∈ Rn×dl if we are As the sparsified subgraph samples are differentiable, the
dealing with a node classification problem with dl classes). two components can be jointly trained using the gradient
descent based backpropagation techniques from a super-
Theoretical justification. From the perspective of statisti- vised loss function, as illustrated in Algorithm 1. While the
Robust Graph Representation Learning via Neural Sparsification

𝜕𝐿
𝜕𝐿
𝜕𝜙
𝜕𝜃

𝑗 𝑗 𝑉0 𝑗 𝑗
𝐴10
𝑖 𝑖 GNN 𝑖 Loss 𝐿
𝑖 𝑉1 𝐴′10

MLP

Sparsification Network Graph Neural Networks


Graph 𝐺 Sparsified Graph 𝑔 Classification Results 𝑌'
𝑄# 𝑔 𝐺 𝑄% 𝑌 𝑔

Figure 2. The overview of NeuralSparse

Algorithm 1 Training algorithm for NeuralSparse k-neighbor subgraphs for the following reasons.
1: Input: graph G = (V, E, A), integer l, and training
labels Y . • We are able to adjust the estimation on the amount of
2: while stop criterion is not met do task-relevant graph data by tuning the hyper-parameter
3: Generate sparsified subgraphs {g1 , g2 , · · · , gl } by k. Intuitively, when k is an under-estimate, the amount
sparsification network (Section 4); of task-relevant graph data accessed by GNNs could be
4: Produce prediction {Ŷ1 , Ŷ2 , · · · , Ŷl } by feeding inadequate, leading to inferior performance. When k is
{g1 , g2 , · · · , gl } into GNNs; an over-estimate, the downstream GNNs may overfit the
5: Calculate loss function J; introduced noise or irrelevant graph data, resulting in sub-
6: Update φ and θ by descending J optimal performance. It could be difficult to set a golden
7: end while hyper-parameter that works all the time, but one has the
freedom to choose the k that is the best fit for a specific
task.
GNNs have been widely investigated in recent works (Kipf • k-neighbor subgraphs are friendly to parallel computa-
& Welling, 2017; Hamilton et al., 2017; Veličković et al., tion. As each node selects its edges independently from
2018), we focus on the practical implementation for the its neighborhood, we can utilize tensor operations in
sparsification network in the remaining of this paper. existing deep learning frameworks, such as tensorflow
(Abadi et al., 2016), to speed up the sparsification pro-
4. Sparsification Network cess for k-neighbor subgraphs.

Following the theory discussed above, the goal of the spar-


Sampling k-neighbor subgraphs. Given k and an input
sification network is to generate sparsified subgraphs for
graph G = (V, E, A), we obtain a k-neighbor subgraph
input graphs, serving as the approximation function Qφ (g |
by repeatedly sampling edges for each node in the original
G). Therefore, we need to answer the following three ques-
graph. Without loss of generality, we sketch this sampling
tions in the sparsification network. i). What is SG in Equa-
process by focusing on a specific node u in graph G. Let
tion 1, the class of subgraphs we focus on? ii). How to
Nu be the set of one-hop neighbors of the node u.
sample sparsified subgraphs? iii). How to make the sparsi-
fied subgraph sampling process differentiable for the end-
to-end training? In the following, we address the questions 1. v ∼ fφ (V (u), V (Nu ), A(u)), where fφ (·) is a function
one by one. that generates a one-hop neighbor v from the learned
distribution based on the node u’s attributes, node at-
k-neighbor subgraphs. We focus on k-neighbor sub- tributes of u’s neighbors V (Nu ), and their edge at-
graphs for SG (Sadhanala et al., 2016): Given an input tributes A(u). In particular, the learned distribution is
graph, a k-neighbor subgraph shares the same set of nodes encoded by parameters φ.
with the input graph, and each node in the subgraph can
select no more than k edges from its one-hop neighbor- 2. Edge E(u, v) is selected for the node u.
hood. Although the concept of the sparsification network
is not limited to a specific class of subgraphs, we choose 3. The above two steps are repeated k times.
Robust Graph Representation Learning via Neural Sparsification

Note that the above process performs sampling without re- called temperature. In general, when τ is small, the
placement. Given a node u, each of its adjacent edges is Gumbel-Softmax distribution resembles the discrete dis-
selected at most once. Moreover, the sampling function tribution, which induces strong sparsity; however, small τ
fφ (·) is shared among nodes; therefore, the number of pa- also introduces high-variance gradients that block effective
rameters φ is independent of the input graph size. backpropagation. A high value of τ cannot produce ex-
pected sparsification effect. Following the practice in (Jang
Making samples differentiable. While conventional
et al., 2017), we adopt the strategy by starting the training
methods are able to generate discrete samples (Sadhanala
with a high temperature and anneal to a small value with a
et al., 2016), these samples are not differentiable such that
guided schedule.
it is difficult to utilize them to optimize sample generation.
To make samples differentiable, we propose a Gumbel- Sparsification algorithm and its complexity. As shown
Softmax based multi-layer neural network to implement the in Algorithm 2, given hyper-parameter k, the sparsification
sampling function fφ (·) discussed above. network visits each node’s one-hop neighbors k times. Let
m be the total number of edges in the graph. The complex-
To make the discussion self-contained, we briefly discuss
ity of sampling subgraphs by the sparsification network is
the idea of Gumbel-Softmax. Gumbel-Softmax is a repa-
O(km). When k is small in practice, the overall complex-
rameterization trick used to generate differentiable discrete
ity is O(m).
samples (Jang et al., 2017; Maddison et al., 2017). Under
appropriate hyper-parameter settings, Gumbel-Softmax is
able to generate continuous vectors that are as ”sharp” as
one-hot vectors widely used to encode discrete data. Algorithm 2 Sampling subgraphs by sparsification net-
work
Without loss of generality, we focus on a specific node u
1: Input: graph G = (V, E, A) and integer k.
in a graph G = (V, E, A). Let Nu be the set of one-hop
2: Edge set H = ∅
neighbors of the node u. We implement fφ (·) as follows.
3: for u ∈ V do
4: for v ∈ Nu do
1. ∀v ∈ Nu , 5: zu,v ← MLPφ (V (u), V (v), A(u, v))
6: end for
zu,v = MLPφ (V (u), V (v), A(u, v)), (4)
7: for v ∈ Nu do P
where MLPφ is a multi-layer neural network with pa- 8: πu,v ← exp(zu,v )/ w∈Nu exp(zu,w )
rameters φ. 9: end for
10: for j = 1, · · · , k do
2. ∀v ∈ Nu , we employ a softmax function to compute 11: for v ∈ Nu do
exp((log(πu,v )+v )/τ )
the probability to sample the edge, 12: xu,v ← P
w∈Nu exp((log(πu,w )+w )/τ )
13: end for
exp(zu,v )
πu,v = P (5) 14: Add the edge represented by vector [xu,v ] into H
w∈Nu exp(zu,w ) 15: end for
16: end for
3. Using Gumbel-Softmax, we generate differentiable
samples

exp((log(πu,v ) + v )/τ ) Comparison with multiple related methods. Unlike


xu,v = P (6) FastGCN (Chen et al., 2018), GraphSAINT (Zeng et al.,
w∈Nu exp((log(πu,w ) + w )/τ ) 2020) and DropEdge (Rong et al., 2020) that incorpo-
where xu,v is a scalar, v = − log(− log(s)) with s rate layer-wise node samplers to reduce the complexity of
randomly drawn from Uniform(0, 1), and τ is a hyper- GNNs, NeuralSparse samples subgraphs before applying
parameter called temperature which controls the inter- GNNs. As for the computation complexity, the sparsifi-
polation between the discrete distribution and continu- cation in NeuralSparse is more friendly to parallel com-
ous categorical densities. putation than the layer-conditioned approaches such as
AS-GCN. Compared with the graph attentional models
(Veličković et al., 2018), the NeuralSparse can produce
Note that when we sample k edges, the computation for
sparser neighborhoods, which effectively remove task-
zu,v and πu,v only needs to be performed once. For the
irrelevant information on original graphs. Unlike LDS
hyper-parameter τ , we discuss how to tune it as follows.
(Franceschi et al., 2019), NeuralSparse learns graph spar-
Discussion on temperature tuning. The behavior of sification under inductive setting, and its graph sampling is
Gumbel-Softmax is governed by a hyper-parameter τ constrained by input graph topology.
Robust Graph Representation Learning via Neural Sparsification

Table 1. Dataset statistics


Reddit PPI Transaction Cora Citeseer
Task Inductive Inductive Inductive Transductive Transductive
Nodes 232,965 56,944 95,544 2,708 3,327
Edges 11,606,919 818,716 963,468 5,429 4,732
Features 602 50 120 1,433 3,703
Classes 41 121 2 7 6
Training Nodes 152,410 44,906 47,772 140 120
Validation Nodes 23,699 6,514 9,554 500 500
Testing Nodes 55,334 5,524 38,218 1,000 1,000

5. Experimental Study PPI and Transaction datasets, the models have to generalize
to completely unseen graphs.
In this section, we evaluate our proposed NeuralSparse on
the node classification task with both inductive and trans- Transductive datasets. We use two citation benchmark
ductive settings. The experimental results demonstrate that datasets in Yang et al. (2016) and Kipf & Welling (2017)
NeuralSparse achieves superior classification performance with the transductive experimental setting. The citation
over state-of-the-art GNN models. Moreover, we provide graphs contain nodes corresponding to documents and
a case study to demonstrate how the sparsified subgraphs edges as citations. Node features are the sparse bag-of-
generated by NeuralSparse could improve classification words representations of the documents and node labels
compared against other sparsification baselines. The sup- indicate the topic class of the documents. In transductive
plementary material contains more experimental details. learning, the training methods have access to all node fea-
tures and edges, with a limited subset of node labels.
5.1. Datasets
5.2. Experimental Setup
We employ five datasets from various domains and con-
duct the node classification task following the settings as Baseline models. We incorporate four state-of-the-art
described in Hamilton et al. (2017) and Kipf & Welling methods as the base GNN components, including GCN
(2017). The dataset statistics are summarized in Table 1. (Kipf & Welling, 2017), GraphSAGE (Hamilton et al.,
2017), GAT (Veličković et al., 2018), and GIN (Xu et al.,
Inductive datasets. We utilize the Reddit and PPI datasets 2019). Besides evaluating the effectiveness and efficiency
and follow the same setting in Hamilton et al. (2017). The of NeuralSparse against base GNNs, we leverage three
Reddit dataset contains a post-to-post graph with word vec- other categories of methods in the experiments: (1) We in-
tors as node features. The node labels represent which corporate the two unsupervised graph sparsification mod-
community Reddit posts belong to. The protein-protein els, the spectral sparsifier (SS, Sadhanala et al., 2016) and
interaction (PPI) dataset contains graphs corresponding to the Rank Degree (RD, Voudigari et al., 2016). The in-
different human tissues. The node features are positional put graphs are sparsified before sent to the base GNNs for
gene sets, motif gene sets, and immunological signatures. node classification. (2) We compare against the random
The nodes are multi-labeled by gene ontology. layer-wise sampler DropEdge (Rong et al., 2020). Similar
Graphs in the Transaction dataset contains real transactions to the Dropout trick (Hinton et al., 2012), DropEdge ran-
between organizations in two years, with the first year for domly removes connections among node neighborhood in
training and the second year for validation/testing. Each each GNN layer. (3) We also incorporate LDS (Franceschi
node represents an organization and each edge indicates a et al., 2019), which works under a transductive setting and
transaction between two organizations. Node attributes are learns Bernoulli variables associated with individual edges.
side information about the organizations such as account Temperature tuning. We anneal the temperature with the
balance, cash reserve, etc. On this dataset, the objective is schedule τ = max(0.05, exp(−rp)), where p is the train-
to classify organizations into two categories: promising or ing epoch and r ∈ 10{−5,−4,−3,−2,−1} . τ is updated every
others for investment in the near future. More details on N steps and N ∈ {50, 100, ..., 500}. Compared with the
the Transaction dataset can be found in Supplementary S1. MNIST VAE model in Jang et al. (2017), smaller hyper-
In the inductive setting, models can only access training parameter τ fits NeuralSparse better in practice. More de-
nodes’ attributes, edges, and labels during training. In the tails on the experimental settings and implementation can
Robust Graph Representation Learning via Neural Sparsification

Table 2. Node classification performance


Reddit PPI Transaction Cora Citeseer
Sparsifier Method
Micro-F1 Micro-F1 AUC Accuracy Accuracy
GCN 0.922 ± 0.041 0.532 ± 0.024 0.564 ± 0.018 0.810 ± 0.027 0.694 ± 0.020
GraphSAGE 0.938 ± 0.029 0.600 ± 0.027 0.574 ± 0.029 0.825 ± 0.033 0.710 ± 0.020
N/A
GAT - 0.973 ± 0.030 0.616 ± 0.022 0.821 ± 0.043 0.721 ± 0.037
GIN 0.928 ± 0.022 0.703 ± 0.028 0.607 ± 0.031 0.816 ± 0.020 0.709 ± 0.037
GCN 0.912 ± 0.022 0.521 ± 0.024 0.562 ± 0.035 0.780 ± 0.045 0.684 ± 0.033
SS/ GraphSAGE 0.907 ± 0.018 0.576 ± 0.022 0.565 ± 0.042 0.806 ± 0.032 0.701 ± 0.027
RD GAT - 0.889 ± 0.034 0.614 ± 0.044 0.807 ± 0.047 0.686 ± 0.034
GIN 0.901 ± 0.021 0.693 ± 0.019 0.593 ± 0.038 0.785 ± 0.041 0.706 ± 0.043
GCN 0.961 ± 0.040 0.548 ± 0.041 0.591 ± 0.040 0.828 ± 0.035 0.723 ± 0.043
GraphSAGE 0.963 ± 0.043 0.632 ± 0.031 0.598 ± 0.043 0.821 ± 0.048 0.712 ± 0.032
DropEdge
GAT - 0.851 ± 0.030 0.604 ± 0.043 0.789 ± 0.039 0.691 ± 0.039
GIN 0.931 ± 0.031 0.783 ± 0.037 0.625 ± 0.035 0.818 ± 0.044 0.715 ± 0.039
LDS GCN - - - 0.831 ± 0.017 0.727 ± 0.021
GCN 0.966 ± 0.020 0.651 ± 0.014 0.610 ± 0.022 0.837 ± 0.014 0.741 ± 0.014
Neural GraphSAGE 0.967 ± 0.015 0.696 ± 0.023 0.649 ± 0.018 0.841 ± 0.024 0.736 ± 0.013
Sparse GAT - 0.986 ± 0.015 0.671 ± 0.018 0.842 ± 0.015 0.736 ± 0.026
GIN 0.959 ± 0.027 0.892 ± 0.015 0.634 ± 0.023 0.838 ± 0.027 0.738 ± 0.015

be found in Supplementary S2 and S3. could increase the chance of introducing noise into the ag-
gregation operations, leading to sub-optimal performance.
Metrics. We evaluate the performance on the transduc-
(2) With different GNN options, the NeuralSparse can
tive datasets with accuracy (Kipf & Welling, 2017). For
consistently achieve comparable or superior performance,
inductive tasks on the Reddit and PPI datasets, we report
while other sparsification approaches tend to favor a cer-
micro-averaged F1 scores (Hamilton et al., 2017). Due to
tain GNN structure. (3) Compared with DropEdge, Neu-
the imbalanced classes in the Transaction dataset, models
ralSparse achieves up to 13% of improvement in terms of
are evaluated with AUC value (Huang & Ling, 2005). The
accuracy with lower variance. In addition, the comparison
results show the average of 10 runs.
between NeuralSparse and DropEdge in terms of conver-
gence speed can be found in Supplementary S4. (4) In
5.3. Classification Performance comparison with the two NeuralSparse variants SS-GNN
Table 2 summarizes the classification performance of Neu- and RD-GNN, NeuralSparse outperforms because it can ef-
ralSparse and the baseline methods on all datasets. For fectively leverage the guidance from downstream tasks.
Reddit, PPI, Transaction, Cora, and Citeseer, the hyper-
parameter k is set as 30, 15, 10, 5, and 3 respectively. The Table 3. Node classification performance with κ-NN graphs
hyper-parameter l is set as 1. Note that the result of GAT
Dataset(κ) LDS NeuralSparse
on Reddit is missing due to the out-of-memory error and
LDS only works under the transductive setting. For sim- Cora(10) 0.715 ± 0.035 0.723 ± 0.025
plicity, we only report the better performance with SS or Cora(20) 0.703 ± 0.029 0.719 ± 0.021
RD sparsifiers. Citeseer(10) 0.691 ± 0.031 0.723 ± 0.016
Citeseer(20) 0.715 ± 0.026 0.725 ± 0.019
Overall, NeuralSparse is able to help GNN techniques
achieve competitive generalization performance with spar-
sified graph data. We make the following observations. In the following, we discuss the comparison between Neu-
(1) Compared with basic GNN models, NeuralSparse can ralSparse and LDS (Franceschi et al., 2019) on the Cora
enhance the generalization performance on node classi- and Citeseer datasets. Note that the row labeled with LDS
fication tasks by utilizing the sparsified subgraphs from in Table 2 presents the classification results on original in-
the sparsification network, especially in the inductive set- put graphs. In addition, we adopt κ-nearest neighbor (κ-
ting. Indeed, large neighborhood size in the original graphs NN) graphs suggested in (Franceschi et al., 2019) for more
comprehensive evaluation. In particular, κ-NN graphs are
Robust Graph Representation Learning via Neural Sparsification

Promising Organizations
Other Organizations

Promising Organizations Promising Organizations Promising Organizations


Other Organizations Other Organizations Other Organizations

(a) Original (b) NeuralSparse (c) Spectral Sparsifier (d) RD Sparsifier

Figure 3. (a) Original graph from the Transaction dataset and sparsified subgraphs by (b) NeuralSparse, (c) Spectral Sparsifier, and (d)
RD Sparsifier.

constructed by connecting individual nodes with their top- l impacts classification performance on the Transaction
κ similar neighbors in terms of node features, and κ is se- dataset. When l increases from 1 to 5, we observe a rel-
lected from {10, 20}. In Table 3, we summarize the clas- atively small improvement in classification AUC score. As
sification accuracy of LDS (with GCN) and NeuralSparse the parameters in the sparsification network are shared by
(with GCN). On both original and κ-NN graphs, Neu- all edges in the graph, the estimation variance from random
ralSparse outperforms LDS in terms of classification ac- sampling could already be mitigated to some extent by a
curacy. As each edge is associated with a Bernoulli vari- number of sampled edges in a sparsified subgraph. Thus,
ables, the large number of parameters for graph sparsifica- when we increase the number of sparsified subgraphs, the
tion could impact the generalization power in LDS. More incremental gain could be small.
comparison results between NeuralSparse and LDS can be
In Figure 3(a), we present a sample of the graph from the
found in Supplementary S5.
Transaction dataset which consists of 38 nodes (promis-
ing organizations and other organizations) with an average
5.4. Sensitivity to Hyper-parameters and the Sparsified node degree 15 and node feature dimension 120. As shown
Subgraphs in Figure 3(b), the graph sparsified by the NeuralSparse has
lower complexity with an average node degree around 5. In
0.68
0.675 Figure 3(c, d), we also present the sparsified graphs output
0.670 by the two baseline methods, SS and RD. More quantita-
0.66
0.665 tive evaluations over sparsified graphs from different ap-
AUC

AUC

0.660 proaches can be found in Supplementary S7.


0.64 0.655

NeuralSparse-GAT 0.650 NeuralSparse-GAT


By comparing the four plots in Figure 3, we make the
NeuralSparse-GraphSAGE
0.62 0.645
NeuralSparse-GraphSAGE
following observations: First, the NeuralSparse sparsified
5 10 15 1 2 3 4 5
Hyper-parameter k Hyper-parameter l
graph tends to select edges that connect nodes of identi-
cal labels, which favors the downstream classification task.
(a) Hyperparameter k (b) Hyperparameter l
The observed clustering effect could further boost the con-
fidence in decision making. Second, instead of exploring
Figure 4. Performance vs hyper-parameters
all the neighbors, we can focus on the selected connec-
tions/edges, which could make it easier for human experts
Figure 4(a) demonstrates how classification performance
to perform model interpretation and result visualization.
responds when k increases on the Transaction dataset.
There exists an optimal k that delivers the best classifica-
tion AUC score. The similar trend on the validation set 6. Conclusion
is also observed, as shown in Supplementary S6. When k
In this paper, we propose Neural Sparsification (Neu-
is small, NeuralSparse can only make use of little relevant
ralSparse) to address the noise brought by the task-
structural information in feature aggregation, which leads
irrelevant information on real-life large graphs. Neu-
to inferior performance. When k increases, the aggregation
ralSparse consists of two major components: (1) The
convolution involves more complex neighborhood aggre-
sparsification network sparsifies input graphs by sampling
gation with a higher chance of overfitting noise data, which
edges following a learned distribution; (2) GNNs take
negatively impacts the classification performance for un-
sparsified subgraphs as input and extract node representa-
seen testing data. Figure 4(b) shows how hyper-parameter
Robust Graph Representation Learning via Neural Sparsification

tions for downstream tasks. The two components in Neu-


ralSparse can be jointly trained with supervised loss, gra-
dient descent, and backpropagation techniques. The ex-
perimental study on real-life datasets shows that the Neu-
ralSparse consistently renders more robust graph represen-
tations, and yields up to 7.2% improvement in accuracy
over state-of-the-art GNN models.

Acknowledgement
We thank the anonymous reviewers for their careful reading
and insightful comments on our manuscript. The work was
partially supported by NSF (DGE-1829071, IIS-2031187).
Robust Graph Representation Learning via Neural Sparsification

References Hübler, C., Kriegel, H.-P., Borgwardt, K., and Ghahramani,


Z. Metropolis algorithms for representative subgraph
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
sampling. In ICDM, 2008.
J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.
Tensorflow: a system for large-scale machine learning. Jang, E., Gu, S., and Poole, B. Categorical reparameteriza-
In OSDI, 2016. tion with gumbel-softmax. In ICLR, 2017.
Abu-El-Haija, S., Perozzi, B., Kapoor, A., Harutyunyan, Kipf, T. N. and Welling, M. Semi-supervised classification
H., Alipourfard, N., Lerman, K., Steeg, G. V., and Gal- with graph convolutional networks. In ICLR, 2017.
styan, A. Mixhop: Higher-order graph convolution
architectures via sparsified neighborhood mixing. In Lee, J. B., Rossi, R., and Kong, X. Graph classification
ICML, 2019. using structural attention. In KDD, 2018.

Adhikari, B., Zhang, Y., Amiri, S. E., Bharadwaj, A., and Leskovec, J. and Faloutsos, C. Sampling from large graphs.
Prakash, B. A. Propagation-based temporal network In KDD, 2006.
summarization. TKDE, 2018. Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated
Akoglu, L., Tong, H., and Koutra, D. Graph based anomaly graph sequence neural networks. In ICLR, 2016.
detection and description: a survey. Data mining and Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diffusion con-
knowledge discovery, 2015. volutional recurrent neural network: Data-driven traffic
Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral forecasting. In ICLR, 2018.
networks and locally connected networks on graphs. In Liu, Y., Safavi, T., Dighe, A., and Koutra, D. Graph sum-
ICLR, 2014. marization methods and applications: A survey. ACM
Calandriello, D., Koutis, I., Lazaric, A., and Valko, M. Im- Computing Surveys, 2018.
proved large-scale graph learning through ridge spectral Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete
sparsification. In ICML, 2018. distribution: A continuous relaxation of discrete random
Chakeri, A., Farhidzadeh, H., and Hall, L. O. Spectral spar- variables. In ICLR, 2017.
sification in spectral clustering. In ICPR, 2016. Maiya, A. S. and Berger-Wolf, T. Y. Sampling community
structure. In WWW, 2010.
Chen, J., Ma, T., and Xiao, C. Fastgcn: fast learning with
graph convolutional networks via importance sampling. Mathioudakis, M., Bonchi, F., Castillo, C., Gionis, A., and
In ICLR, 2018. Ukkonen, A. Sparsification of influence networks. In
KDD, 2011.
Defferrard, M., Bresson, X., and Vandergheynst, P. Con-
volutional neural networks on graphs with fast localized Niepert, M., Ahmed, M., and Kutzkov, K. Learning convo-
spectral filtering. In NIPS, 2016. lutional neural networks for graphs. In ICML, 2016.
Eden, T., Jain, S., Pinar, A., Ron, D., and Seshadhri, C. Rong, Y., Huang, W., Xu, T., and Huang, J. Dropedge: To-
Provable and practical approximations for the degree wards deep graph convolutional networks on node clas-
distribution using sublinear graph samples. In WWW, sification. In ICLR, 2020.
2018.
Sadhanala, V., Wang, Y.-X., and Tibshirani, R. Graph spar-
Franceschi, L., Niepert, M., Pontil, M., and He, X. Learn- sification approaches for laplacian smoothing. In AIS-
ing discrete structures for graph neural networks. In TATS, 2016.
ICML, 2019.
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre- Monfardini, G. The graph neural network model. IEEE
sentation learning on large graphs. In NIPS, 2017. Transactions on Neural Networks, 2009.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, Simonovsky, M. and Komodakis, N. Dynamic edge-
I., and Salakhutdinov, R. R. Improving neural networks conditioned filters in convolutional neural networks on
by preventing co-adaptation of feature detectors. arXiv graphs. In CVPR, 2017.
preprint arXiv:1207.0580, 2012.
Veličković, P., Cucurull, G., Casanova, A., Romero, A.,
Huang, J. and Ling, C. X. Using auc and accuracy in eval- Liò, P., and Bengio, Y. Graph attention networks. In
uating learning algorithms. TKDE, 2005. ICLR, 2018.
Robust Graph Representation Learning via Neural Sparsification

Voudigari, E., Salamanos, N., Papageorgiou, T., and Yan-


nakoudakis, E. J. Rank degree: An efficient algorithm
for graph sampling. In ASONAM, 2016.
Wang, L., Yu, W., Wang, W., Cheng, W., Zhang, W., Zha,
H., He, X., and Chen, H. Learning robust representations
with graph denoising policy network. In ICDM, 2019.

Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.-i.,
and Jegelka, S. Representation learning on graphs with
jumping knowledge networks. In ICML, 2018.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
are graph neural networks? ICLR, 2019.

Yang, Z., Cohen, W. W., and Salakhutdinov, R. Revisit-


ing semi-supervised learning with graph embeddings. In
ICML, 2016.
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,
W. L., and Leskovec, J. Graph convolutional neural net-
works for web-scale recommender systems. In KDD,
2018.
Ying, R., Bourgeois, D., You, J., Zitnik, M., and Leskovec,
J. Gnn explainer: A tool for post-hoc explanation of
graph neural networks. In NIPS, 2019.

Yu, W., Zheng, C., Cheng, W., Aggarwal, C., Song, D.,
Zong, B., Chen, H., and Wang, W. Learning deep net-
work representations with adversarially regularized au-
toencoders. In KDD, 2018.

Zeng, H., Zhou, H., Srivastava, A., Kannan, R., and


Prasanna, V. Graphsaint: Graph sampling based induc-
tive learning method. In ICLR, 2020.
Zhang, L.-C. and Patone, M. Graph sampling. Metron,
2017.

Zhang, Y., Zhang, F., Yao, P., and Tang, J. Name disam-
biguation in aminer: Clustering, maintenance, and hu-
man in the loop. In KDD, 2018.
Zhao, P. gsparsify: Graph motif based sparsification for
graph clustering. In CIKM, 2015.

You might also like