Cluster-GCN An Efficient Algorithm For Training Deep and Large Graph Convolutional Networks
Cluster-GCN An Efficient Algorithm For Training Deep and Large Graph Convolutional Networks
ABSTRACT Large Graph Convolutional Networks. In The 25th ACM SIGKDD Con-
Graph convolutional network (GCN) has been successfully applied ference on Knowledge Discovery and Data Mining (KDD ’19), August 4–
8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA, 10 pages. https:
to many graph-based applications; however, training a large-scale
//doi.org/10.1145/3292500.3330925
GCN remains challenging. Current SGD-based algorithms suffer
from either a high computational cost that exponentially grows
with number of GCN layers, or a large space requirement for keep-
1 INTRODUCTION
ing the entire graph and the embedding of each node in memory. In Graph convolutional network (GCN) [9] has become increasingly
this paper, we propose Cluster-GCN, a novel GCN algorithm that is popular in addressing many graph-based applications, including
suitable for SGD-based training by exploiting the graph clustering semi-supervised node classification [9], link prediction [17] and
structure. Cluster-GCN works as the following: at each step, it sam- recommender systems [15]. Given a graph, GCN uses a graph con-
ples a block of nodes that associate with a dense subgraph identified volution operation to obtain node embeddings layer by layer—at
by a graph clustering algorithm, and restricts the neighborhood each layer, the embedding of a node is obtained by gathering the
search within this subgraph. This simple but effective strategy leads embeddings of its neighbors, followed by one or a few layers of
to significantly improved memory and computational efficiency linear transformations and nonlinear activations. The final layer
while being able to achieve comparable test accuracy with previous embedding is then used for some end tasks. For instance, in node
algorithms. To test the scalability of our algorithm, we create a classification problems, the final layer embedding is passed to a
new Amazon2M data with 2 million nodes and 61 million edges classifier to predict node labels, and thus the parameters of GCN
which is more than 5 times larger than the previous largest publicly can be trained in an end-to-end manner.
available dataset (Reddit). For training a 3-layer GCN on this data, Since the graph convolution operator in GCN needs to propagate
Cluster-GCN is faster than the previous state-of-the-art VR-GCN embeddings using the interaction between nodes in the graph, this
(1523 seconds vs 1961 seconds) and using much less memory (2.2GB makes training quite challenging. Unlike other neural networks
vs 11.2GB). Furthermore, for training 4 layer GCN on this data, our that the training loss can be perfectly decomposed into individual
algorithm can finish in around 36 minutes while all the existing terms on each sample, the loss term in GCN (e.g., classification
GCN training algorithms fail to train due to the out-of-memory loss on a single node) depends on a huge number of other nodes,
issue. Furthermore, Cluster-GCN allows us to train much deeper especially when GCN goes deep. Due to the node dependence,
GCN without much time and memory overhead, which leads to GCN’s training is very slow and requires lots of memory – back-
improved prediction accuracy—using a 5-layer Cluster-GCN, we propagation needs to store all the embeddings in the computation
achieve state-of-the-art test F1 score 99.36 on the PPI dataset, while graph in GPU memory.
the previous best result was 98.71 by [16]. Previous GCN Training Algorithms: To demonstrate the
need of developing a scalable GCN training algorithm, we first
ACM Reference Format: discuss the pros and cons of existing approaches, in terms of 1)
Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui memory requirement1 , 2) time per epoch2 and 3) convergence
Hsieh. 2019. Cluster-GCN: An Efficient Algorithm for Training Deep and speed (loss reduction) per epoch. These three factors are crucial for
evaluating a training algorithm. Note that memory requirement
∗ This
work was done during the first and the second author’s internship at Google directly restricts the scalability of algorithm, and the later two
Research. factors combined together will determine the training speed. In the
following discussion we denote N to be the number of nodes in the
Permission to make digital or hard copies of part or all of this work for personal or
graph, F the embedding dimension, and L the number of layers to
classroom use is granted without fee provided that copies are not made or distributed analyze classic GCN training algorithms.
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
• Full-batch gradient descent is proposed in the first GCN pa-
For all other uses, contact the owner/author(s). per [9]. To compute the full gradient, it requires storing all the
KDD ’19, August 4–8, 2019, Anchorage, AK, USA
1 Here we consider the memory for storing node embeddings, which is dense and
© 2019 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-6201-6/19/08. usually dominates the overall memory usage for deep GCN.
https://doi.org/10.1145/3292500.3330925 2 An epoch means a complete data pass.
257
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
intermediate embeddings, leading to O(N F L) memory require- • Cluster-GCN achieves the best memory usage on large-scale
ment, which is not scalable. Furthermore, although the time per graphs, especially on deep GCN. For example, Cluster-GCN
epoch is efficient, the convergence of gradient descent is slow uses 5x less memory than VRGCN in a 3-layer GCN model on
since the parameters are updated only once per epoch. Amazon2M. Amazon2M is a new graph dataset that we construct
[memory: bad; time per epoch: good; convergence: bad] to demonstrate the scalablity of the GCN algorithms. This dataset
• Mini-batch SGD is proposed in [5]. Since each update is only contains a amazon product co-purchase graph with more than 2
based on a mini-batch gradient, it can reduce the memory re- millions nodes and 61 millions edges.
quirement and conduct many updates per epoch, leading to • Cluster-GCN achieves a similar training speed with VR-GCN
a faster convergence. However, mini-batch SGD introduces a for shallow networks (e.g., 2 layers) but can be faster than VR-
significant computational overhead due to the neighborhood GCN when the network goes deeper (e.g., 4 layers), since our
expansion problem—to compute the loss on a single node at complexity is linear to the number of layers L while VR-GCN’s
layer L, it requires that node’s neighbor nodes’ embeddings at complexity is exponential to L.
layer L − 1, which again requires their neighbors’ embeddings • Cluster-GCN is able to train a very deep network that has a
at layer L − 2 and recursive ones in the downstream layers. This large embedding size. Although several previous works show
leads to time complexity exponential to the GCN depth. Graph- that deep GCN does not give better performance, we found that
SAGE [5] proposed to use a fixed size of neighborhood samples with proper optimization, deeper GCN could help the accuracy.
during back-propagation through layers and FastGCN [1] pro- For example, with a 5-layer GCN, we obtain a new benchmark
posed importance sampling, but the overhead of these methods accuracy 99.36 for PPI dataset, comparing with the highest re-
is still large and will become worse when GCN goes deep. ported one 98.71 by [16].
[memory: good; time per epoch: bad; convergence: good]
• VR-GCN [2] proposes to use a variance reduction technique 2 BACKGROUND
to reduce the size of neighborhood sampling nodes. Despite Suppose we are given a graph G = (V, E, A), which consists of
successfully reducing the size of samplings (in our experiments N = |V | vertices and |E | edges such that an edge between any
VR-GCN with only 2 samples per node works quite well), it two vertices i and j represents their similarity. The corresponding
requires storing all the intermediate embeddings of all the nodes adjacency matrix A is an N × N sparse matrix with (i, j) entry equal-
in memory, leading to O(N F L) memory requirement. If the num- ing to 1 if there is an edge between i and j and 0 otherwise. Also,
ber of nodes in the graph increases to millions, the memory each node is associated with an F -dimensional feature vector and
requirement for VR-GCN may be too high to fit into GPU. X ∈ RN ×F denotes the feature matrix for all N nodes. An L-layer
[memory: bad; time per epoch: good; convergence: good.] GCN [9] consists of L graph convolution layers and each of them
constructs embeddings for each node by mixing the embeddings of
In this paper, we propose a novel GCN training algorithm by the node’s neighbors in the graph from the previous layer:
exploiting the graph clustering structure. We find that the efficiency
of a mini-batch algorithm can be characterized by the notion of “em- Z (l +1) = A′X (l )W (l ) , X (l +1) = σ (Z (l +1) ), (1)
bedding utilization”, which is proportional to the number of links
between nodes in one batch or within-batch links. This finding mo- where X (l ) ∈ RN ×Fl is the embedding at the l-th layer for all
tivates us to design the batches using graph clustering algorithms the N nodes and X (0) = X ; A′ is the normalized and regularized
that aims to construct partitions of nodes so that there are more adjacency matrix and W (l ) ∈ RFl ×Fl +1 is the feature transformation
graph links between nodes in the same partition than nodes in dif- matrix which will be learnt for the downstream tasks. Note that for
ferent partitions. Based on the graph clustering idea, we proposed simplicity we assume the feature dimensions are the same for all
Cluster-GCN, an algorithm to design the batches based on efficient layers (F 1 = · · · = F L = F ). The activation function σ (·) is usually
graph clustering algorithms (e.g., METIS [8]). We take this idea set to be the element-wise ReLU.
further by proposing a stochastic multi-clustering framework to im- Semi-supervised node classification is a popular application of
prove the convergence of Cluster-GCN. Our strategy leads to huge GCN. When using GCN for this application, the goal is to learn
memory and computational benefits. In terms of memory, we only weight matrices in (1) by minimizing the loss function:
need to store the node embeddings within the current batch, which 1 Õ
is O(bF L) with the batch size b. This is significantly better than L= loss(yi , ziL ), (2)
|YL |
VR-GCN and full gradient decent, and slightly better than other i ∈YL
SGD-based approaches. In terms of computational complexity, our (L)
where YL contains all the labels for the labeled nodes; zi is the
algorithm achieves the same time cost per epoch with gradient de-
scent and is much faster than neighborhood searching approaches. i-th row of Z (L) with the ground-truth label to be yi , indicating the
In terms of the convergence speed, our algorithm is competitive final layer prediction of node i. In practice, a cross-entropy loss is
with other SGD-based approaches. Finally, our algorithm is simple commonly used for node classification in multi-class or multi-label
to implement since we only compute matrix multiplication and no problems.
neighborhood sampling is needed. Therefore for Cluster-GCN, we
have [memory: good; time per epoch: good; convergence: good]. 3 PROPOSED ALGORITHM
We conducted comprehensive experiments on several large-scale We first discuss the bottleneck of previous training methods to
graph datasets and made the following contributions: motivate the proposed algorithm.
258
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Table 1: Time and space complexity of GCN training algorithms. L is number of layers, N is number of nodes, ∥A∥0 is number
of nonzeros in the adjacency matrix, and F is number of features. For simplicity we assume number of features is fixed for all
layers. For SGD-based approaches, b is the batch size and r is the number of sampled neighbors per node. Note that due to the
variance reduction technique, VR-GCN can work with a smaller r than GraphSAGE and FastGCN. For memory complexity,
LF 2 is for storing {W (l ) }lL=1 and the other term is for storing embeddings. For simplicity we omit the memory for storing the
graph (GCN) or sub-graphs (other approaches) since they are fixed and usually not the main bottleneck.
GCN [9] Vanilla SGD GraphSAGE [5] FastGCN [1] VR-GCN [2] Cluster-GCN
Time complexity O(L∥A∥0 F + LN F 2 ) O(d L N F 2 ) O(r L N F 2 ) O(rLN F 2 ) O(L∥A∥0 F + LN F 2 + r L N F 2 ) O(L∥A∥0 F + LN F 2 )
Memory complexity O(LN F + LF 2 ) O(bd L F + LF 2 ) O(br L F + LF 2 ) O(brLF + LF 2 ) O(LN F + LF 2 ) O(bLF + LF 2 )
In the original paper [9], full gradient descent is used for training graph is usually large and sparse. Assume u is a small constant
GCN, but it suffers from high computational and memory cost. (almost no overlaps between hop-k neighbors), then mini-batch
In terms of memory, computing the full gradient of (2) by back- SGD needs to compute O(bd L ) embeddings per batch, which leads
propagation requires storing all the embedding matrices {Z (l ) }lL=1 to O(bd L F 2 ) time per update and O(Nd L F 2 ) time per epoch.
which needs O(N F L) space. In terms of convergence speed, since We illustrate the neighborhood expansion problem in the left
the model is only updated once per epoch, the training requires panel of Fig. 1. In contrary, full-batch gradient descent has the
more epochs to converge. maximal embedding utilization—each embedding will be reused d
It has been shown that mini-batch SGD can improve the training (average degree) times in the upper layer. As a consequence, the
speed and memory requirement of GCN in some recent works [1, original full gradient descent [9] only needs to compute O(N L) em-
2, 5]. Instead of computing the full gradient, SGD only needs to beddings per epoch, which means on average only O(L) embedding
calculate the gradient based on a mini-batch for each update. In this computation is needed to acquire the gradient of one node.
paper, we use B ⊆ [N ] with size b = |B| to denote a batch of node To make mini-batch SGD work, previous approaches try to re-
indices, and each SGD step will compute the gradient estimation strict the neighborhood expansion size, which however do not
improve embedding utilization. GraphSAGE [5] uniformly samples
1 Õ (L)
∇loss(yi , zi ) (3) a fixed-size set of neighbors, instead of using a full-neighborhood
|B|
i ∈B set. We denote the sample size as r . This leads to O(r L ) embedding
to perform an update. Despite faster convergence in terms of epochs, computations for each loss term but also makes gradient estima-
SGD will introduce another computational overhead on GCN train- tion less accurate. FastGCN [1] proposed an important sampling
ing (as explained in the following), which makes it having much strategy to improve the gradient estimation. VR-GCN [2] proposed
slower per-epoch time compared with full gradient descent. a strategy to store the previous computed embeddings for all the
N nodes and L layers and reuse them for unsampled neighbors.
Why does vanilla mini-batch SGD have slow per-epoch Despite the high memory usage for storing all the N L embeddings,
time? We consider the computation of the gradient associated with we find their strategy very useful and in practice, even for a small
(L)
one node i : ∇loss(yi , zi ). Clearly, this requires the embedding r (e.g., 2) can lead to good convergence.
of node i, which depends on its neighbors’ embeddings in the We summarize the time and space complexity in Table 1. Clearly,
previous layer. To fetch each node i’s neighbor nodes’ embeddings, all the SGD-based algorithms suffer from exponential complexity
we need to further aggregate each neighbor node’s neighbor nodes’ with respect to the number of layers, and for VR-GCN, even though
embeddings as well. Suppose a GCN has L + 1 layers and each node r can be small, they incur huge space complexity that could go
has an average degree of d, to get the gradient for node i, we need to beyond a GPU’s memory capacity. In the following, we introduce
aggregate features from O(d L ) nodes in the graph for one node. That our Cluster-GCN algorithm, which achieves the best of two worlds—
is, we need to fetch information for a node’s hop-k (k = 1, · · · , L) the same time complexity per epoch with full gradient descent and
neighbors in the graph to perform one update. Computing each the same memory complexity with vanilla SGD.
embedding requires O(F 2 ) time due to the multiplication with W (l ) ,
so in average computing the gradient associated with one node
requires O(d L F 2 ) time. 3.1 Vanilla Cluster-GCN
Embedding utilization can reflect computational efficiency. Our Cluster-GCN technique is motivated by the following ques-
If a batch has more than one node, the time complexity is less tion: In mini-batch SGD updates, can we design a batch and the
straightforward since different nodes can have overlapped hop- corresponding computation subgraph to maximize the embedding
k neighbors, and the number of embedding computation can be utilization? We answer this affirmative by connecting the concept
less than the worst case O(bd L ). To reflect the computational effi- of embedding utilization to a clustering objective.
ciency of mini-batch SGD, we define the concept of “embedding Consider the case that in each batch we compute the embeddings
utilization” to characterize the computational efficiency. During for a set of nodes B from layer 1 to L. Since the same subgraph
(l )
the algorithm, if the node i’s embedding at l-th layer zi is com- A B, B (links within B) is used for each layer of computation, we can
puted and is reused u times for the embedding computations at then see that embedding utilization is the number of edges within
(l )
layer l + 1, then we say the embedding utilization of zi is u. For this batch ∥A B, B ∥0 . Therefore, to maximize embedding utilization,
mini-batch SGD with random sampling, u is very small since the we should design a batch B to maximize the within-batch edges,
259
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Õ |Vt | 1 Õ (L)
LĀ′ = LĀ′ and LĀ′ = loss(yi , zi ). (7)
t
N tt tt |Vt |
i ∈Vt
260
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
261
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
262
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Table 5: Comparisons of memory usages on different datasets. Numbers in the brackets indicate the size of hidden units used
in the model.
2-layer 3-layer 4-layer
VRGCN Cluster-GCN GraphSAGE VRGCN Cluster-GCN GraphSAGE VRGCN Cluster-GCN GraphSAGE
PPI (512) 258 MB 39 MB 51 MB 373 MB 46 MB 71 MB 522 MB 55 MB 85 MB
Reddit (128) 259 MB 284 MB 1074 MB 372 MB 285 MB 1075 MB 515 MB 285 MB 1076 MB
Reddit (512) 1031 MB 292 MB 1099 MB 1491 MB 300 MB 1115 MB 2064 MB 308 MB 1131 MB
Amazon (128) 1188 MB 703 MB N/A 1351 MB 704 MB N/A 1515 MB 705 MB N/A
Table 6: Benchmarking on the Sparse Tensor operations in Table 7: The most common categories in Amazon2M.
PyTorch and TensorFlow. A network with two linear layers Categories number of products
is used and the timing includes forward and backward oper- Books 668,950
ations. Numbers in the brackets indicate the size of hidden CDs & Vinyl 172,199
units in the first layer. Amazon data is used. Toys & Games 158,771
PyTorch TensorFlow
Avg. time per epoch (128) 8.81s 2.53s 4.2 Experimental results on Amazon2M
Avg. time per epoch (512) 45.08s 7.13s A new GCN dataset: Amazon2M. By far the largest public data
for testing GCN is Reddit dataset with the statistics shown in Table
3, which contains about 200K nodes. As shown in Figure 6 GCN
For Amazon data, since nodes’ features are not available, an iden- training on this data can be finished within a few hundreds seconds.
tity matrix is used as the feature matrix X . Under this setting, the To test the scalability of GCN training algorithms, we constructed
shape of parameter matrix W (0) becomes 334863x128. Therefore, a much larger graph with over 2 millions of nodes and 61 million
the computation is dominated by sparse matrix operations such as edges based on Amazon co-purchasing networks [11, 12]. The raw
AW (0) . Our method is still faster than VRGCN for 3-layer case, but co-purchase data is from Amazon-3M6 . In the graph, each node is
slower for 2-layer and 4-layer ones. The reason may come from a product, and the graph link represents whether two products are
the speed of sparse matrix operations from different frameworks. purchased together. Each node feature is generated by extracting
VRGCN is implemented in TensorFlow, while Cluster-GCN is im- bag-of-word features from the product descriptions followed by
plemented in PyTorch whose sparse tensor support are still in its Principal Component Analysis [7] to reduce the dimension to be
very early stage. In Table 6, we show the time for TensorFlow and 100. In addition, we use the top-level categories as the labels for
PyTorch to do forward/backward operations on Amazon data, and that product/node (see Table 7 for the most common categories).
a simple two-layer network are used for benchmarking both frame- The detailed statistics of the data set are listed in Table 3.
works. We can clearly see that TensorFlow is faster than PyTorch. In Table 8, we compare with VRGCN for GCNs with a different
The difference is more significant when the number of hidden units number of layers in terms of training time, memory usage, and test
increases. This may explain why Cluster-GCN has longer training accuracy (F1 score). As can be seen from the table that 1) VRGCN is
time in Amazon dataset. faster than Cluster-GCN with 2-layer GCN but slower than Cluster-
Memory usage comparison: For training large-scale GCNs, GCN when increasing one layer while achieving similar accuracy.
besides training time, memory usage needed for training is of- 2) In terms of memory usage, VRGCN is using much more memory
ten more important and will directly restrict the scalability. The than Cluster-GCN (5 times more for 3-layer case), and it is running
memory usage includes the memory needed for training the GCN out of memory when training 4-layer GCN, while Cluster-GCN does
for many epochs. As discussed in Section 3, to speedup training, not need much additional memory when increasing the number of
VRGCN needs to save historical embeddings during training, so it layers, and achieves the best accuracy for this data when training a
needs much more memory for training than Cluster-GCN. Graph- 4-layer GCN.
SAGE also has higher memory requirement than Cluster-GCN due
to the exponential neighborhood growing problem. In Table 5, we 4.3 Training Deeper GCN
compare our memory usage with VRGCN’s memory usage for In this section we consider GCNs with more layers. We first show
GCN with different layers. When increasing the number of layers, the timing comparisons of Cluster-GCN and VRGCN in Table 9. PPI
Cluster-GCN’s memory usage does not increase a lot. The reason is used for benchmarking and we run 200 epochs for both methods.
is that when increasing one layer, the extra variable introduced is We observe that the running time of VRGCN grows exponentially
the weight matrix W (L) , which is relatively small comparing to the because of its expensive neighborhood finding, while the running
sub-graph and node features. While VRGCN needs to save each time of Cluster-GCN only grows linearly.
layer’s history embeddings, and the embeddings are usually dense Next we investigate whether using deeper GCNs obtains better
and will soon dominate the memory usage. We can see from Table 5 accuracy. In Section 4.3, we discuss different strategies of modifying
that Cluster-GCN is much more memory efficient than VRGCN. For the adjacency matrix A to facilitate the training of deep GCNs. We
instance, on Reddit data to train a 4-layer GCN with hidden dimen- apply the diagonal enhancement techniques to deep GCNs and run
sion to be 512, VRGCN needs 2064MB memory, while Cluster-GCN
only uses 308MB memory. 6 http://manikvarma.org/downloads/XC/XMLRepository.html
263
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Table 8: Comparisons of running time, memory and testing accuracy (F1 score) for Amazon2M.
Time Memory Test F1 score
VRGCN Cluster-GCN VRGCN Cluster-GCN VRGCN Cluster-GCN
Amazon2M (2-layer) 337s 1223s 7476 MB 2228 MB 89.03 89.00
Amazon2M (3-layer) 1961s 1523s 11218 MB 2235 MB 90.21 90.21
Amazon2M (4-layer) N/A 2289s OOM 2241 MB N/A 90.41
264
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Table 11: Comparisons of using different diagonal enhancement techniques. For all methods, we present the best validation
accuracy achieved in 200 epochs. PPI is used and dropout rate is 0.1 in this experiment. Other settings are the same as in
Section 4.1. The numbers marked red indicate poor convergence.
2-layer 3-layer 4-layer 5-layer 6-layer 7-layer 8-layer
Cluster-GCN with (1) 90.3 97.6 98.2 98.3 94.1 65.4 43.1
Cluster-GCN with (10) 90.2 97.7 98.1 98.4 42.4 42.4 42.4
Cluster-GCN with (10) + (9) 84.9 96.0 97.1 97.6 97.3 43.9 43.8
Cluster-GCN with (10) + (11), λ = 1 89.6 97.5 98.2 98.3 98.0 97.4 96.2
Figure 6: Comparisons of different GCN training methods. We present the relation between training time in seconds (x-axis)
and the validation F1 score (y-axis).
265
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
6 MORE DETAILS ABOUT THE Table 12: The training, validation, and test splits used in the
EXPERIMENTS experiments. Note that for the two amazon datasets we only
split into training and test sets.
In this section we describe more detailed settings about the experi-
ments to help in reproducibility.
Datasets Task Data splits (Tr./Val./Te.)
PPI Inductive 44906/6514/5524
6.1 Datasets and software versions
Reddit Inductive 153932/23699/55334
We describe more details about the datasets in Table 12. We down- Amazon Inductive 91973/242890
load the datasets PPI, Reddit from the website7 and Amazon from
Amazon2M Inductive 1709997/739032
the website8 . Note that for Amazon, we consider GCN in an in-
ductive setting, meaning that the model only learns from training
Table 13: The running time of graph clustering algorithm
data. In [3] they consider a transductive setting. Regarding software
(METIS) and data preprocessing before the training of GCN.
versions, we install CUDA 10.0 and cuDNN 7.0. TensorFlow 1.12.0
and PyTorch 1.0.0 are used. We download METIS 5.1.0 via the offcial
website9 and use a Python wrapper10 for METIS library. Datasets #Partitions Clustering Preprocessing
PPI 50 1.6s 20.3s
6.2 Implementation details Reddit 1500 33s 286s
Previous works [1, 2] propose to pre-compute the multiplication Amazon 200 0.3s 67.5s
of AX in the first GCN layer. We also adopt this strategy in our Amazon2M 15000 148s 2160s
implementation. By precomputing AX , we are essentially using
the exact 1-hop neighborhood for each node and the expensive form the node partitions, which can be re-used for later training
neighbors searching in the first layer can be saved. processes.
Another implementation detail is about the technique mentioned
in Section 3.2 When multiple clusters are selected, some between-
cluster links are added back. Thus the new combined adjacency
matrix should be re-normalized to maintain numerical ranges of
the resulting embedding matrix. From experiments we find the
renormalization is helpful.
As for the inductive setting, the testing nodes are not visible
during the training process. Thus we construct an adjacency ma-
trix containing only training nodes and another one containing all
nodes. Graph partitioning are applied to the former one and the par-
titioned adjacency matrix is then re-normalized. Note that feature
normalization is also conducted. To calculate the memory usage,
we consider tf.contrib.memory_stats.BytesInUse() for Ten-
sorFlow and torch.cuda.memory_allocated() for PyTorch.
266