0% found this document useful (0 votes)
29 views10 pages

Cluster-GCN An Efficient Algorithm For Training Deep and Large Graph Convolutional Networks

This document summarizes an existing research paper on Cluster-GCN, a novel algorithm for efficiently training large graph convolutional networks (GCNs). Cluster-GCN works by sampling nodes that are part of dense subgraphs identified by graph clustering at each training step, and restricting neighborhood searches within these subgraphs. This approach significantly improves memory and computational efficiency compared to previous GCN training algorithms. The authors test Cluster-GCN on a new Amazon product co-purchase graph containing over 2 million nodes and 61 million edges, which is larger than previously available datasets. For training a 3-layer GCN, Cluster-GCN is faster and uses less memory than the previous state-of-the-art. It also enables training much

Uploaded by

liuyunwu2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views10 pages

Cluster-GCN An Efficient Algorithm For Training Deep and Large Graph Convolutional Networks

This document summarizes an existing research paper on Cluster-GCN, a novel algorithm for efficiently training large graph convolutional networks (GCNs). Cluster-GCN works by sampling nodes that are part of dense subgraphs identified by graph clustering at each training step, and restricting neighborhood searches within these subgraphs. This approach significantly improves memory and computational efficiency compared to previous GCN training algorithms. The authors test Cluster-GCN on a new Amazon product co-purchase graph containing over 2 million nodes and 61 million edges, which is larger than previously available datasets. For training a 3-layer GCN, Cluster-GCN is faster and uses less memory than the previous state-of-the-art. It also enables training much

Uploaded by

liuyunwu2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Cluster-GCN: An Efficient Algorithm for Training Deep and


Large Graph Convolutional Networks
Wei-Lin Chiang∗ Xuanqing Liu∗ Si Si
National Taiwan University University of California, Los Angeles Google Research
[email protected] [email protected] [email protected]

Yang Li Samy Bengio Cho-Jui Hsieh


Google Research Google Research University of California, Los Angeles
[email protected] [email protected] [email protected]

ABSTRACT Large Graph Convolutional Networks. In The 25th ACM SIGKDD Con-
Graph convolutional network (GCN) has been successfully applied ference on Knowledge Discovery and Data Mining (KDD ’19), August 4–
8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA, 10 pages. https:
to many graph-based applications; however, training a large-scale
//doi.org/10.1145/3292500.3330925
GCN remains challenging. Current SGD-based algorithms suffer
from either a high computational cost that exponentially grows
with number of GCN layers, or a large space requirement for keep-
1 INTRODUCTION
ing the entire graph and the embedding of each node in memory. In Graph convolutional network (GCN) [9] has become increasingly
this paper, we propose Cluster-GCN, a novel GCN algorithm that is popular in addressing many graph-based applications, including
suitable for SGD-based training by exploiting the graph clustering semi-supervised node classification [9], link prediction [17] and
structure. Cluster-GCN works as the following: at each step, it sam- recommender systems [15]. Given a graph, GCN uses a graph con-
ples a block of nodes that associate with a dense subgraph identified volution operation to obtain node embeddings layer by layer—at
by a graph clustering algorithm, and restricts the neighborhood each layer, the embedding of a node is obtained by gathering the
search within this subgraph. This simple but effective strategy leads embeddings of its neighbors, followed by one or a few layers of
to significantly improved memory and computational efficiency linear transformations and nonlinear activations. The final layer
while being able to achieve comparable test accuracy with previous embedding is then used for some end tasks. For instance, in node
algorithms. To test the scalability of our algorithm, we create a classification problems, the final layer embedding is passed to a
new Amazon2M data with 2 million nodes and 61 million edges classifier to predict node labels, and thus the parameters of GCN
which is more than 5 times larger than the previous largest publicly can be trained in an end-to-end manner.
available dataset (Reddit). For training a 3-layer GCN on this data, Since the graph convolution operator in GCN needs to propagate
Cluster-GCN is faster than the previous state-of-the-art VR-GCN embeddings using the interaction between nodes in the graph, this
(1523 seconds vs 1961 seconds) and using much less memory (2.2GB makes training quite challenging. Unlike other neural networks
vs 11.2GB). Furthermore, for training 4 layer GCN on this data, our that the training loss can be perfectly decomposed into individual
algorithm can finish in around 36 minutes while all the existing terms on each sample, the loss term in GCN (e.g., classification
GCN training algorithms fail to train due to the out-of-memory loss on a single node) depends on a huge number of other nodes,
issue. Furthermore, Cluster-GCN allows us to train much deeper especially when GCN goes deep. Due to the node dependence,
GCN without much time and memory overhead, which leads to GCN’s training is very slow and requires lots of memory – back-
improved prediction accuracy—using a 5-layer Cluster-GCN, we propagation needs to store all the embeddings in the computation
achieve state-of-the-art test F1 score 99.36 on the PPI dataset, while graph in GPU memory.
the previous best result was 98.71 by [16]. Previous GCN Training Algorithms: To demonstrate the
need of developing a scalable GCN training algorithm, we first
ACM Reference Format: discuss the pros and cons of existing approaches, in terms of 1)
Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui memory requirement1 , 2) time per epoch2 and 3) convergence
Hsieh. 2019. Cluster-GCN: An Efficient Algorithm for Training Deep and speed (loss reduction) per epoch. These three factors are crucial for
evaluating a training algorithm. Note that memory requirement
∗ This
work was done during the first and the second author’s internship at Google directly restricts the scalability of algorithm, and the later two
Research. factors combined together will determine the training speed. In the
following discussion we denote N to be the number of nodes in the
Permission to make digital or hard copies of part or all of this work for personal or
graph, F the embedding dimension, and L the number of layers to
classroom use is granted without fee provided that copies are not made or distributed analyze classic GCN training algorithms.
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
• Full-batch gradient descent is proposed in the first GCN pa-
For all other uses, contact the owner/author(s). per [9]. To compute the full gradient, it requires storing all the
KDD ’19, August 4–8, 2019, Anchorage, AK, USA
1 Here we consider the memory for storing node embeddings, which is dense and
© 2019 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-6201-6/19/08. usually dominates the overall memory usage for deep GCN.
https://doi.org/10.1145/3292500.3330925 2 An epoch means a complete data pass.

257
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

intermediate embeddings, leading to O(N F L) memory require- • Cluster-GCN achieves the best memory usage on large-scale
ment, which is not scalable. Furthermore, although the time per graphs, especially on deep GCN. For example, Cluster-GCN
epoch is efficient, the convergence of gradient descent is slow uses 5x less memory than VRGCN in a 3-layer GCN model on
since the parameters are updated only once per epoch. Amazon2M. Amazon2M is a new graph dataset that we construct
[memory: bad; time per epoch: good; convergence: bad] to demonstrate the scalablity of the GCN algorithms. This dataset
• Mini-batch SGD is proposed in [5]. Since each update is only contains a amazon product co-purchase graph with more than 2
based on a mini-batch gradient, it can reduce the memory re- millions nodes and 61 millions edges.
quirement and conduct many updates per epoch, leading to • Cluster-GCN achieves a similar training speed with VR-GCN
a faster convergence. However, mini-batch SGD introduces a for shallow networks (e.g., 2 layers) but can be faster than VR-
significant computational overhead due to the neighborhood GCN when the network goes deeper (e.g., 4 layers), since our
expansion problem—to compute the loss on a single node at complexity is linear to the number of layers L while VR-GCN’s
layer L, it requires that node’s neighbor nodes’ embeddings at complexity is exponential to L.
layer L − 1, which again requires their neighbors’ embeddings • Cluster-GCN is able to train a very deep network that has a
at layer L − 2 and recursive ones in the downstream layers. This large embedding size. Although several previous works show
leads to time complexity exponential to the GCN depth. Graph- that deep GCN does not give better performance, we found that
SAGE [5] proposed to use a fixed size of neighborhood samples with proper optimization, deeper GCN could help the accuracy.
during back-propagation through layers and FastGCN [1] pro- For example, with a 5-layer GCN, we obtain a new benchmark
posed importance sampling, but the overhead of these methods accuracy 99.36 for PPI dataset, comparing with the highest re-
is still large and will become worse when GCN goes deep. ported one 98.71 by [16].
[memory: good; time per epoch: bad; convergence: good]
• VR-GCN [2] proposes to use a variance reduction technique 2 BACKGROUND
to reduce the size of neighborhood sampling nodes. Despite Suppose we are given a graph G = (V, E, A), which consists of
successfully reducing the size of samplings (in our experiments N = |V | vertices and |E | edges such that an edge between any
VR-GCN with only 2 samples per node works quite well), it two vertices i and j represents their similarity. The corresponding
requires storing all the intermediate embeddings of all the nodes adjacency matrix A is an N × N sparse matrix with (i, j) entry equal-
in memory, leading to O(N F L) memory requirement. If the num- ing to 1 if there is an edge between i and j and 0 otherwise. Also,
ber of nodes in the graph increases to millions, the memory each node is associated with an F -dimensional feature vector and
requirement for VR-GCN may be too high to fit into GPU. X ∈ RN ×F denotes the feature matrix for all N nodes. An L-layer
[memory: bad; time per epoch: good; convergence: good.] GCN [9] consists of L graph convolution layers and each of them
constructs embeddings for each node by mixing the embeddings of
In this paper, we propose a novel GCN training algorithm by the node’s neighbors in the graph from the previous layer:
exploiting the graph clustering structure. We find that the efficiency
of a mini-batch algorithm can be characterized by the notion of “em- Z (l +1) = A′X (l )W (l ) , X (l +1) = σ (Z (l +1) ), (1)
bedding utilization”, which is proportional to the number of links
between nodes in one batch or within-batch links. This finding mo- where X (l ) ∈ RN ×Fl is the embedding at the l-th layer for all
tivates us to design the batches using graph clustering algorithms the N nodes and X (0) = X ; A′ is the normalized and regularized
that aims to construct partitions of nodes so that there are more adjacency matrix and W (l ) ∈ RFl ×Fl +1 is the feature transformation
graph links between nodes in the same partition than nodes in dif- matrix which will be learnt for the downstream tasks. Note that for
ferent partitions. Based on the graph clustering idea, we proposed simplicity we assume the feature dimensions are the same for all
Cluster-GCN, an algorithm to design the batches based on efficient layers (F 1 = · · · = F L = F ). The activation function σ (·) is usually
graph clustering algorithms (e.g., METIS [8]). We take this idea set to be the element-wise ReLU.
further by proposing a stochastic multi-clustering framework to im- Semi-supervised node classification is a popular application of
prove the convergence of Cluster-GCN. Our strategy leads to huge GCN. When using GCN for this application, the goal is to learn
memory and computational benefits. In terms of memory, we only weight matrices in (1) by minimizing the loss function:
need to store the node embeddings within the current batch, which 1 Õ
is O(bF L) with the batch size b. This is significantly better than L= loss(yi , ziL ), (2)
|YL |
VR-GCN and full gradient decent, and slightly better than other i ∈YL
SGD-based approaches. In terms of computational complexity, our (L)
where YL contains all the labels for the labeled nodes; zi is the
algorithm achieves the same time cost per epoch with gradient de-
scent and is much faster than neighborhood searching approaches. i-th row of Z (L) with the ground-truth label to be yi , indicating the
In terms of the convergence speed, our algorithm is competitive final layer prediction of node i. In practice, a cross-entropy loss is
with other SGD-based approaches. Finally, our algorithm is simple commonly used for node classification in multi-class or multi-label
to implement since we only compute matrix multiplication and no problems.
neighborhood sampling is needed. Therefore for Cluster-GCN, we
have [memory: good; time per epoch: good; convergence: good]. 3 PROPOSED ALGORITHM
We conducted comprehensive experiments on several large-scale We first discuss the bottleneck of previous training methods to
graph datasets and made the following contributions: motivate the proposed algorithm.

258
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Table 1: Time and space complexity of GCN training algorithms. L is number of layers, N is number of nodes, ∥A∥0 is number
of nonzeros in the adjacency matrix, and F is number of features. For simplicity we assume number of features is fixed for all
layers. For SGD-based approaches, b is the batch size and r is the number of sampled neighbors per node. Note that due to the
variance reduction technique, VR-GCN can work with a smaller r than GraphSAGE and FastGCN. For memory complexity,
LF 2 is for storing {W (l ) }lL=1 and the other term is for storing embeddings. For simplicity we omit the memory for storing the
graph (GCN) or sub-graphs (other approaches) since they are fixed and usually not the main bottleneck.
GCN [9] Vanilla SGD GraphSAGE [5] FastGCN [1] VR-GCN [2] Cluster-GCN
Time complexity O(L∥A∥0 F + LN F 2 ) O(d L N F 2 ) O(r L N F 2 ) O(rLN F 2 ) O(L∥A∥0 F + LN F 2 + r L N F 2 ) O(L∥A∥0 F + LN F 2 )
Memory complexity O(LN F + LF 2 ) O(bd L F + LF 2 ) O(br L F + LF 2 ) O(brLF + LF 2 ) O(LN F + LF 2 ) O(bLF + LF 2 )

In the original paper [9], full gradient descent is used for training graph is usually large and sparse. Assume u is a small constant
GCN, but it suffers from high computational and memory cost. (almost no overlaps between hop-k neighbors), then mini-batch
In terms of memory, computing the full gradient of (2) by back- SGD needs to compute O(bd L ) embeddings per batch, which leads
propagation requires storing all the embedding matrices {Z (l ) }lL=1 to O(bd L F 2 ) time per update and O(Nd L F 2 ) time per epoch.
which needs O(N F L) space. In terms of convergence speed, since We illustrate the neighborhood expansion problem in the left
the model is only updated once per epoch, the training requires panel of Fig. 1. In contrary, full-batch gradient descent has the
more epochs to converge. maximal embedding utilization—each embedding will be reused d
It has been shown that mini-batch SGD can improve the training (average degree) times in the upper layer. As a consequence, the
speed and memory requirement of GCN in some recent works [1, original full gradient descent [9] only needs to compute O(N L) em-
2, 5]. Instead of computing the full gradient, SGD only needs to beddings per epoch, which means on average only O(L) embedding
calculate the gradient based on a mini-batch for each update. In this computation is needed to acquire the gradient of one node.
paper, we use B ⊆ [N ] with size b = |B| to denote a batch of node To make mini-batch SGD work, previous approaches try to re-
indices, and each SGD step will compute the gradient estimation strict the neighborhood expansion size, which however do not
improve embedding utilization. GraphSAGE [5] uniformly samples
1 Õ (L)
∇loss(yi , zi ) (3) a fixed-size set of neighbors, instead of using a full-neighborhood
|B|
i ∈B set. We denote the sample size as r . This leads to O(r L ) embedding
to perform an update. Despite faster convergence in terms of epochs, computations for each loss term but also makes gradient estima-
SGD will introduce another computational overhead on GCN train- tion less accurate. FastGCN [1] proposed an important sampling
ing (as explained in the following), which makes it having much strategy to improve the gradient estimation. VR-GCN [2] proposed
slower per-epoch time compared with full gradient descent. a strategy to store the previous computed embeddings for all the
N nodes and L layers and reuse them for unsampled neighbors.
Why does vanilla mini-batch SGD have slow per-epoch Despite the high memory usage for storing all the N L embeddings,
time? We consider the computation of the gradient associated with we find their strategy very useful and in practice, even for a small
(L)
one node i : ∇loss(yi , zi ). Clearly, this requires the embedding r (e.g., 2) can lead to good convergence.
of node i, which depends on its neighbors’ embeddings in the We summarize the time and space complexity in Table 1. Clearly,
previous layer. To fetch each node i’s neighbor nodes’ embeddings, all the SGD-based algorithms suffer from exponential complexity
we need to further aggregate each neighbor node’s neighbor nodes’ with respect to the number of layers, and for VR-GCN, even though
embeddings as well. Suppose a GCN has L + 1 layers and each node r can be small, they incur huge space complexity that could go
has an average degree of d, to get the gradient for node i, we need to beyond a GPU’s memory capacity. In the following, we introduce
aggregate features from O(d L ) nodes in the graph for one node. That our Cluster-GCN algorithm, which achieves the best of two worlds—
is, we need to fetch information for a node’s hop-k (k = 1, · · · , L) the same time complexity per epoch with full gradient descent and
neighbors in the graph to perform one update. Computing each the same memory complexity with vanilla SGD.
embedding requires O(F 2 ) time due to the multiplication with W (l ) ,
so in average computing the gradient associated with one node
requires O(d L F 2 ) time. 3.1 Vanilla Cluster-GCN
Embedding utilization can reflect computational efficiency. Our Cluster-GCN technique is motivated by the following ques-
If a batch has more than one node, the time complexity is less tion: In mini-batch SGD updates, can we design a batch and the
straightforward since different nodes can have overlapped hop- corresponding computation subgraph to maximize the embedding
k neighbors, and the number of embedding computation can be utilization? We answer this affirmative by connecting the concept
less than the worst case O(bd L ). To reflect the computational effi- of embedding utilization to a clustering objective.
ciency of mini-batch SGD, we define the concept of “embedding Consider the case that in each batch we compute the embeddings
utilization” to characterize the computational efficiency. During for a set of nodes B from layer 1 to L. Since the same subgraph
(l )
the algorithm, if the node i’s embedding at l-th layer zi is com- A B, B (links within B) is used for each layer of computation, we can
puted and is reused u times for the embedding computations at then see that embedding utilization is the number of edges within
(l )
layer l + 1, then we say the embedding utilization of zi is u. For this batch ∥A B, B ∥0 . Therefore, to maximize embedding utilization,
mini-batch SGD with random sampling, u is very small since the we should design a batch B to maximize the within-batch edges,

259
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

due to the block-diagonal form of Ā (note that Āt′ t is the correspond-


ing diagonal block of Ā′ ). The loss function can also be decomposed
into

Õ |Vt | 1 Õ (L)
LĀ′ = LĀ′ and LĀ′ = loss(yi , zi ). (7)
t
N tt tt |Vt |
i ∈Vt

The Cluster-GCN is then based on the decomposition form in


(6) and (7). At each step, we sample a cluster Vt and then conduct
SGD to update based on the gradient of LĀ′ , and this only re-
tt
quires the sub-graph At t , the X t , Yt on the current batch and the
models {W (l ) }lL=1 . The implementation only requires forward and
Figure 1: The neighborhood expansion difference between
traditional graph convolution and our proposed cluster ap- backward propagation of matrix products (one block of (6)) that is
proach. The red node is the starting node for neighbor- much easier to implement than the neighborhood search procedure
hood nodes expansion. Traditional graph convolution suf- used in previous SGD-based training methods.
fers from exponential neighborhood expansion, while our We use graph clustering algorithms to partition the graph. Graph
method can avoid expensive neighborhood expansion. clustering methods such as Metis [8] and Graclus [4] aim to con-
struct the partitions over the vertices in the graph such that within-
by which we connect the efficiency of SGD updates with graph clusters links are much more than between-cluster links to better
clustering algorithms. capture the clustering and community structure of the graph. These
Now we formally introduce Cluster-GCN. For a graph G, we par- are exactly what we need because: 1) As mentioned before, the em-
tition its nodes into c groups: V = [V1 , · · · Vc ] where Vt consists bedding utilization is equivalent to the within-cluster links for each
of the nodes in the t-th partition. Thus we have c subgraphs as batch. Intuitively, each node and its neighbors are usually located
in the same cluster, therefore after a few hops, neighborhood nodes
Ḡ = [G 1 , · · · , Gc ] = [{V1 , E1 }, · · · , {Vc , Ec }], with a high chance are still in the same cluster. 2) Since we replace A
where each Et only consists of the links between nodes in Vt . by its block diagonal approximation Ā and the error is proportional
After reorganizing nodes, the adjacency matrix is partitioned into to between-cluster links ∆, we need to find a partition to minimize
c 2 submatrices as number of between-cluster links.
A11 · · · A1c  In Figure 1, we illustrate the neighborhood expansion with full
 . ..  graph G and the graph with clustering partition Ḡ. We can see that
..
 
A = Ā + ∆ =  .. . .  (4) cluster-GCN can avoid heavy neighborhood search and focus on

A
 c1 · · · Acc  the neighbors within each cluster. In Table 2, we show two differ-

ent node partition strategies: random partition versus clustering
and
A11 · · · partition. We partition the graph into 10 parts by using random
0   0 · · · A1c 
 . . .  . .. ..  partition and METIS. Then use one partition as a batch to perform
 
Ā =  .. .. ..  , ∆ =  .. . .  ,

(5) a SGD update. We can see that with the same number of epochs,
  
 0
 · · · Acc 
 A
 c1 · · · 0 
 using clustering partition can achieve higher accuracy. This shows
where each diagonal block At t is a |Vt | × |Vt | adjacency matrix using graph clustering is important and partitions should not be
containing the links within G t . Ā is the adjacency matrix for graph formed randomly.
Ḡ; Ast contains the links between two partitions Vs and Vt ; ∆ is
the matrix consisting of all off-diagonal blocks of A. Similarly, we Time and space complexity. Since each node in Vt only links
can partition the feature matrix X and training labels Y according to to nodes inside Vt , each node does not need to perform neigh-
the partition [V1 , · · · , Vc ] as [X 1 , · · · , X c ] and [Y1 , · · · , Yc ] where borhoods searching outside At t . The computation for each batch
(l )
X t and Yt consist of the features and labels for the nodes in Vt will purely be matrix products Āt′ t X t W (l ) and some element-wise
respectively. operations, so the overall time complexity per batch is O(∥At t ∥0 F +
The benefit of this block-diagonal approximation Ḡ is that the bF 2 ). Thus the overall time complexity per epoch becomes O(∥A∥0 F +
objective function of GCN becomes decomposible into different N F 2 ). In average, each batch only requires computing O(bL) embed-
batches (clusters). Let Ā′ denotes the normalized version of Ā, the dings, which is linear instead of exponential to L. In terms of space
final embedding matrix becomes complexity, in each batch, we only need to load b samples and store
their embeddings on each layer, resulting in O(bLF ) memory for
Z (L) = Ā′σ (Ā′σ (· · · σ (Ā′XW (0) )W (1) ) · · · )W (L−1) (6)
storing embeddings. Therefore our algorithm is also more memory
 Ā11
′ σ (Ā′ σ (· · · σ (Ā′ X W (0) )W (1) ) · · · )W (L−1) 
11 11 1 efficient than all the previous algorithms. Moreover, our algorithm
..
 
= only requires loading a subgraph into GPU memory instead of the
 
 . 
 full graph (though graph is usually not the memory bottleneck). The
Ā′ σ (Ā′ σ (· · · σ (Ā′ X W (0) )W (1) ) · · · )W (L−1) 
 cc cc cc c  detailed time and memory complexity are summarized in Table 1.

260
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Table 2: Random partition versus clustering partition of


the graph (trained on mini-batch SGD). Clustering partition
leads to better performance (in terms of test F1 score) since it
removes less between-partition links. These three datasetes
are all public GCN datasets. We will explain PPI data in the
experiment part. Cora has 2,708 nodes and 13,264 edges, and
Pubmed has 19,717 nodes and 108,365 edges.
Dataset random partition clustering partition
Figure 3: The proposed stochastic multiple partitions
Cora 78.4 82.5
scheme. In each epoch, we randomly sample q clusters (q = 2
Pubmed 78.9 79.9
is used in this example) and their between-cluster links to
PPI 68.1 92.9
form a new batch. Same color blocks are in the same batch.

Figure 2: Histograms of entropy values based on the la-


Figure 4: Comparisons of choosing one cluster versus multi-
bel distribution. Here we present within each batch using
ple clusters. The former uses 300 partitions. The latter uses
random partition versus clustering partition. Most cluster-
1500 and randomly select 5 to form one batch. We present
ing partitioned batches have low label entropy, indicating
epoch (x-axis) versus F1 score (y-axis).
skewed label distribution within each batch. In comparison,
random partition will lead to larger label entropy within a
batch although it is less efficient as discussed earlier. We par- their nodes {Vt1 ∪ · · · ∪ Vtq } into the batch. Furthermore, the links
tition the Reddit dataset with 300 clusters in this example. between the chosen clusters,
{Ai j | i, j ∈ t 1 , . . . , tq },
3.2 Stochastic Multiple Partitions
are added back. In this way, those between-cluster links are re-
Although vanilla Cluster-GCN achieves good computational and
incorporated and the combinations of clusters make the variance
memory complexity, there are still two potential issues:
across batches smaller. Figure 3 illustrates our algorithm—for each
• After the graph is partitioned, some links (the ∆ part in Eq. (4)) epochs, different combinations of clusters are chosen as a batch. We
are removed. Thus the performance could be affected. conduct an experiment on Reddit to demonstrate the effectiveness
• Graph clustering algorithms tend to bring similar nodes together. of the proposed approach. In Figure 4, we can observe that using
Hence the distribution of a cluster could be different from the multiple clusters as one batch could improve the convergence. Our
original data set, leading to a biased estimation of the full gradi- final Cluster-GCN algorithm is presented in Algorithm 1.
ent while performing SGD updates.
In Figure 2, we demonstrate an example of unbalanced label dis- 3.3 Issues of training deeper GCNs
tribution by using the Reddit data with clusters formed by Metis. Previous attempts of training deeper GCNs [9] seem to suggest that
We calculate the entropy value of each cluster based on its label adding more layers is not helpful. However, the datasets used in
distribution. Comparing with random partitioning, we clearly see the experiments may be too small to make a proper justification.
that entropy of most clusters are smaller, indicating that the label For example, [9] considered a graph with only a few hundreds of
distributions of clusters are biased towards some specific labels. training nodes for which overfitting can be an issue. Moreover, we
This increases the variance across different batches and may affect observe that the optimization of deep GCN models becomes difficult
the convergence of SGD. as it may impede the information from the first few layers being
To address the above issues, we propose a stochastic multiple passed through. In [9], they adopt a technique similar to residual
clustering approach to incorporate between-cluster links and re- connections [6] to enable the model to carry the information from
duce variance across batches. We first partition the graph into p a previous layer to a next layer. Specifically, they modify (1) to add
clusters V1 , · · · , Vp with a relatively large p. When constructing a the hidden representations of layer l into the next layer.
batch B for an SGD update, instead of considering only one cluster,
we randomly choose q clusters, denoted as t 1 , . . . , tq and include X (l +1) = σ (A′X (l )W (l ) ) + X (l ) (8)

261
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Algorithm 1: Cluster GCN Table 3: Data statistics


Input: Graph A, feature X , label Y ; Datasets Task #Nodes #Edges #Labels #Features
PPI multi-label 56,944 818,716 121 50
Output: Node representation X̄
Reddit multi-class 232,965 11,606,919 41 602
1 Partition graph nodes into c clusters V1 , V2 , · · · , Vc by
Amazon multi-label 334,863 925,872 58 N/A
METIS; Amazon2M multi-class 2,449,029 61,859,140 47 100
2 for iter = 1, · · · , max_iter do
3 Randomly choose q clusters, t 1 , · · · , tq from V without Table 4: The parameters used in the experiments.
replacement; Datasets #hidden units # partitions #clusters per batch
4 Form the subgraph Ḡ with nodes V̄ = [Vt1 , Vt2 , · · · , Vtq ] PPI 512 50 1
and links A V̄, V̄ ; Reddit 128 1500 20
Amazon 128 200 1
5 Compute д ← ∇LAV̄, V̄ (loss on the subgraph A V̄, V̄ ) ; Amazon2M 400 15000 10
6 Conduct Adam update using gradient estimator д
• Cluster-GCN (Our proposed algorithm): the proposed fast GCN
7 Output: {Wl }lL=1 training method.
• VRGCN3 [2]: It maintains the historical embedding of all the
nodes in the graph and expands to only a few neighbors to
Here we propose another simple technique to improve the training speedup training. The number of sampled neighbors is set to be
of deep GCNs. In the original GCN settings, each node aggregates 2 as suggested in [2]4 .
the representation of its neighbors from the previous layer. How- • GraphSAGE5 [5]: It samples a fixed number of neighbors per
ever, under the setting of deep GCNs, the strategy may not be node. We use the default settings of sampled sizes for each layer
suitable as it does not take the number of layers into account. In- (S 1 = 25, S 2 = 10) in GraphSAGE.
tuitively, neighbors nearby should contribute more than distant We implement our method in PyTorch [13]. For the other methods,
nodes. We thus propose a technique to better address this issue. we use all the original papers’ code from their github pages. Since
The idea is to amplify the diagonal parts of the adjacency matrix A [9] has difficulty to scale to large graphs, we do not compare with
used in each GCN layer. In this way, we are putting more weights it here. Also as shown in [2] that VRGCN is faster than FastGCN,
on the representation from the previous layer in the aggregation of so we do not compare with FastGCN here. For all the methods we
each GCN layer. An example is to add an identity to Ā as follows. use the Adam optimizer with learning rate as 0.01, dropout rate as
20%, weight decay as zero. The mean aggregator proposed by [5] is
X (l +1) = σ ((A′ + I )X (l )W (l ) ) (9) adopted and the number of hidden units is the same for all methods.
While (9) seems to be reasonable, using the same weight for all the Note that techniques such as (11) is not considered here. In each
nodes regardless of their numbers of neighbors may not be suitable. experiment, we consider the same GCN architecture for all methods.
Moreover, it may suffer from numerical instability as values can For VRGCN and GraphSAGE, we follow the settings provided by
grow exponentially when more layers are used. Hence we propose the original papers and set the batch sizes as 512. For Cluster-GCN,
a modified version of (9) to better maintain the neighborhoods the number of partitions and clusters per batch for each dataset
information and numerical ranges. We first add an identity to the are listed in Table 4. Note that clustering is seen as a preprocessing
original A and perform the normalization step and its running time is not taken into account in training.
In Section 6, we show that graph clustering only takes a small
à = (D + I )−1 (A + I ), (10) portion of preprocessing time. All the experiments are conducted
on a machine with a NVIDIA Tesla V100 GPU (16 GB memory),
and then consider
20-core Intel Xeon CPU (2.20 GHz), and 192 GB of RAM.
X (l +1) = σ ((Ã + λdiag(Ã))X (l )W (l ) ). (11)
4.1 Training Performance for median size
Experimental results of adopting the “diagonal enhancement” tech- datasets
niques are presented in Section 4.3 where we show that this new
normalization strategy can help to build deep GCN and achieve Training Time vs Accuracy: First we compare our proposed
SOTA performance. method with other methods in terms of training speed. In Figure 6,
the x-axis shows the training time in seconds, and y-axis shows the
accuracy (F1 score) on the validation sets. We plot the training time
4 EXPERIMENTS
versus accuracy for three datasets with 2,3,4 layers of GCN. Since
We evaluate our proposed method for training GCN on two tasks: GraphSAGE is slower than VRGCN and our method, the curves for
multi-label and multi-class classification on four public datasets. GraphSAGE only appear for PPI and Reddit datasets. We can see
The statistic of the data sets are shown in Table 3. Note that the that our method is the fastest for both PPI and Reddit datasets for
Reddit dataset is the largest public dataset we have seen so far for GCNs with different numbers of layers.
GCN, and the Amazon2M dataset is collected by ourselves and is
3 GitHub link: https://github.com/thu-ml/stochastic_gcn
much larger than Reddit (see more details in Section 4.2). 4 Notethat we also tried the default sample size 20 in VRGCN package but it performs
We include the following state-of-the-art GCN training algo- much worse than sample size= 2.
rithms in our comparisons: 5 GitHub link: https://github.com/williamleif/GraphSAGE

262
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Table 5: Comparisons of memory usages on different datasets. Numbers in the brackets indicate the size of hidden units used
in the model.
2-layer 3-layer 4-layer
VRGCN Cluster-GCN GraphSAGE VRGCN Cluster-GCN GraphSAGE VRGCN Cluster-GCN GraphSAGE
PPI (512) 258 MB 39 MB 51 MB 373 MB 46 MB 71 MB 522 MB 55 MB 85 MB
Reddit (128) 259 MB 284 MB 1074 MB 372 MB 285 MB 1075 MB 515 MB 285 MB 1076 MB
Reddit (512) 1031 MB 292 MB 1099 MB 1491 MB 300 MB 1115 MB 2064 MB 308 MB 1131 MB
Amazon (128) 1188 MB 703 MB N/A 1351 MB 704 MB N/A 1515 MB 705 MB N/A

Table 6: Benchmarking on the Sparse Tensor operations in Table 7: The most common categories in Amazon2M.
PyTorch and TensorFlow. A network with two linear layers Categories number of products
is used and the timing includes forward and backward oper- Books 668,950
ations. Numbers in the brackets indicate the size of hidden CDs & Vinyl 172,199
units in the first layer. Amazon data is used. Toys & Games 158,771
PyTorch TensorFlow
Avg. time per epoch (128) 8.81s 2.53s 4.2 Experimental results on Amazon2M
Avg. time per epoch (512) 45.08s 7.13s A new GCN dataset: Amazon2M. By far the largest public data
for testing GCN is Reddit dataset with the statistics shown in Table
3, which contains about 200K nodes. As shown in Figure 6 GCN
For Amazon data, since nodes’ features are not available, an iden- training on this data can be finished within a few hundreds seconds.
tity matrix is used as the feature matrix X . Under this setting, the To test the scalability of GCN training algorithms, we constructed
shape of parameter matrix W (0) becomes 334863x128. Therefore, a much larger graph with over 2 millions of nodes and 61 million
the computation is dominated by sparse matrix operations such as edges based on Amazon co-purchasing networks [11, 12]. The raw
AW (0) . Our method is still faster than VRGCN for 3-layer case, but co-purchase data is from Amazon-3M6 . In the graph, each node is
slower for 2-layer and 4-layer ones. The reason may come from a product, and the graph link represents whether two products are
the speed of sparse matrix operations from different frameworks. purchased together. Each node feature is generated by extracting
VRGCN is implemented in TensorFlow, while Cluster-GCN is im- bag-of-word features from the product descriptions followed by
plemented in PyTorch whose sparse tensor support are still in its Principal Component Analysis [7] to reduce the dimension to be
very early stage. In Table 6, we show the time for TensorFlow and 100. In addition, we use the top-level categories as the labels for
PyTorch to do forward/backward operations on Amazon data, and that product/node (see Table 7 for the most common categories).
a simple two-layer network are used for benchmarking both frame- The detailed statistics of the data set are listed in Table 3.
works. We can clearly see that TensorFlow is faster than PyTorch. In Table 8, we compare with VRGCN for GCNs with a different
The difference is more significant when the number of hidden units number of layers in terms of training time, memory usage, and test
increases. This may explain why Cluster-GCN has longer training accuracy (F1 score). As can be seen from the table that 1) VRGCN is
time in Amazon dataset. faster than Cluster-GCN with 2-layer GCN but slower than Cluster-
Memory usage comparison: For training large-scale GCNs, GCN when increasing one layer while achieving similar accuracy.
besides training time, memory usage needed for training is of- 2) In terms of memory usage, VRGCN is using much more memory
ten more important and will directly restrict the scalability. The than Cluster-GCN (5 times more for 3-layer case), and it is running
memory usage includes the memory needed for training the GCN out of memory when training 4-layer GCN, while Cluster-GCN does
for many epochs. As discussed in Section 3, to speedup training, not need much additional memory when increasing the number of
VRGCN needs to save historical embeddings during training, so it layers, and achieves the best accuracy for this data when training a
needs much more memory for training than Cluster-GCN. Graph- 4-layer GCN.
SAGE also has higher memory requirement than Cluster-GCN due
to the exponential neighborhood growing problem. In Table 5, we 4.3 Training Deeper GCN
compare our memory usage with VRGCN’s memory usage for In this section we consider GCNs with more layers. We first show
GCN with different layers. When increasing the number of layers, the timing comparisons of Cluster-GCN and VRGCN in Table 9. PPI
Cluster-GCN’s memory usage does not increase a lot. The reason is used for benchmarking and we run 200 epochs for both methods.
is that when increasing one layer, the extra variable introduced is We observe that the running time of VRGCN grows exponentially
the weight matrix W (L) , which is relatively small comparing to the because of its expensive neighborhood finding, while the running
sub-graph and node features. While VRGCN needs to save each time of Cluster-GCN only grows linearly.
layer’s history embeddings, and the embeddings are usually dense Next we investigate whether using deeper GCNs obtains better
and will soon dominate the memory usage. We can see from Table 5 accuracy. In Section 4.3, we discuss different strategies of modifying
that Cluster-GCN is much more memory efficient than VRGCN. For the adjacency matrix A to facilitate the training of deep GCNs. We
instance, on Reddit data to train a 4-layer GCN with hidden dimen- apply the diagonal enhancement techniques to deep GCNs and run
sion to be 512, VRGCN needs 2064MB memory, while Cluster-GCN
only uses 308MB memory. 6 http://manikvarma.org/downloads/XC/XMLRepository.html

263
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Table 8: Comparisons of running time, memory and testing accuracy (F1 score) for Amazon2M.
Time Memory Test F1 score
VRGCN Cluster-GCN VRGCN Cluster-GCN VRGCN Cluster-GCN
Amazon2M (2-layer) 337s 1223s 7476 MB 2228 MB 89.03 89.00
Amazon2M (3-layer) 1961s 1523s 11218 MB 2235 MB 90.21 90.21
Amazon2M (4-layer) N/A 2289s OOM 2241 MB N/A 90.41

Table 10: State-of-the-art performance of testing accuracy


reported in recent papers.
PPI Reddit
FastGCN [1] N/A 93.7
GraphSAGE [5] 61.2 95.4
VR-GCN [2] 97.8 96.3
GaAN [16] 98.71 96.36
GAT [14] 97.3 N/A
GeniePath [10] 98.5 N/A
Cluster-GCN 99.36 96.60

score). Using the proposed approach, we are able to successfully


train much deeper GCNs, which achieve state-of-the-art test F1
Figure 5: Convergence figure on a 8-layer GCN. We present
score on PPI and Reddit datasets.
numbers of epochs (x-axis) versus validation accuracy (y-
axis). All methods except for the one using (11) fail to con-
verge.
REFERENCES
[1] Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: Fast Learning with Graph
Table 9: Comparisons of running time when using different Convolutional Networks via Importance Sampling. In ICLR.
[2] Jianfei Chen, Jun Zhu, and Song Le. 2018. Stochastic Training of Graph Convolu-
numbers of GCN layers. We use PPI and run both methods tional Networks with Variance Reduction. In ICML.
for 200 epochs. [3] Hanjun Dai, Zornitsa Kozareva, Bo Dai, Alex Smola, and Le Song. 2018. Learning
Steady-States of Iterative Algorithms over Graphs. In ICML. 1114–1122.
2-layer 3-layer 4-layer 5-layer 6-layer [4] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. 2007. Weighted Graph Cuts
Cluster-GCN 52.9s 82.5s 109.4s 137.8s 157.3s Without Eigenvectors A Multilevel Approach. IEEE Trans. Pattern Anal. Mach.
Intell. 29, 11 (2007), 1944–1957.
VRGCN 103.6s 229.0s 521.2s 1054s 1956s [5] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation
Learning on Large Graphs. In NIPS.
experiments on PPI. Results are shown in Table 11. For the case [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
Learning for Image Recognition. CVPR (2016), 770–778.
of 2 to 5 layers, the accuracy of all methods increases with more [7] H. Hotelling. 1933. Analysis of a complex of statistical variables into principal
layers added, suggesting that deeper GCNs may be useful. However, components. Journal of Educational Psychology 24, 6 (1933), 417–441.
when 7 or 8 GCN layers are used, the first three methods fail to [8] George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme
for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (1998), 359–392.
converge within 200 epochs and get a dramatic loss of accuracy. A [9] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with
possible reason is that the optimization for deeper GCNs becomes Graph Convolutional Networks. In ICLR.
[10] Ziqi Liu, Chaochao Chen, Longfei Li, Jun Zhou, Xiaolong Li, Le Song, and Yuan
more difficult. We show a detailed convergence of a 8-layer GCN Qi. 2019. GeniePath: Graph Neural Networks with Adaptive Receptive Paths. In
in Figure 5. With the proposed diagonal enhancement technique AAAI.
(11), the convergence can be improved significantly and similar [11] Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring Networks of
Substitutable and Complementary Products. In KDD.
accuracy can be achieved. [12] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel.
State-of-the-art results by training deeper GCNs. With the 2015. Image-Based Recommendations on Styles and Substitutes. In SIGIR.
design of Cluster-GCN and the proposed normalization approach, [13] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
we now have the ability for training much deeper GCNs to achieve 2017. Automatic differentiation in PyTorch. In NIPS-W.
better accuracy (F1 score). We compare the testing accuracy with [14] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Liò, and Yoshua Bengio. 2018. Graph Attention Networks. (2018).
other existing methods in Table 10. For PPI, Cluster-GCN can [15] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton,
achieve the state-of-art result by training a 5-layer GCN with 2048 and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale
hidden units. For Reddit, a 4-layer GCN with 128 hidden units is Recommender Systems. In KDD.
[16] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung.
used. 2018. GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal
Graphs. In UAI.
5 CONCLUSION [17] Muhan Zhang and Yixin Chen. 2018. Link Prediction Based on Graph Neural
Networks. In NIPS.
We present ClusterGCN, a new GCN training algorithm that is fast
and memory efficient. Experimental results show that this method
can train very deep GCN on large-scale graph, for instance on a
graph with over 2 million nodes, the training time is less than an
hour using around 2G memory and achieves accuracy of 90.41 (F1

264
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Table 11: Comparisons of using different diagonal enhancement techniques. For all methods, we present the best validation
accuracy achieved in 200 epochs. PPI is used and dropout rate is 0.1 in this experiment. Other settings are the same as in
Section 4.1. The numbers marked red indicate poor convergence.
2-layer 3-layer 4-layer 5-layer 6-layer 7-layer 8-layer
Cluster-GCN with (1) 90.3 97.6 98.2 98.3 94.1 65.4 43.1
Cluster-GCN with (10) 90.2 97.7 98.1 98.4 42.4 42.4 42.4
Cluster-GCN with (10) + (9) 84.9 96.0 97.1 97.6 97.3 43.9 43.8
Cluster-GCN with (10) + (11), λ = 1 89.6 97.5 98.2 98.3 98.0 97.4 96.2

(a) PPI (2 layers) (b) PPI (3 layers) (c) PPI (4 layers)

(d) Reddit (2 layers) (e) Reddit (3 layers) (f) Reddit (4 layers)

(g) Amazon (2 layers) (h) Amazon (3 layers) (i) Amazon (4 layers)

Figure 6: Comparisons of different GCN training methods. We present the relation between training time in seconds (x-axis)
and the validation F1 score (y-axis).

265
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

6 MORE DETAILS ABOUT THE Table 12: The training, validation, and test splits used in the
EXPERIMENTS experiments. Note that for the two amazon datasets we only
split into training and test sets.
In this section we describe more detailed settings about the experi-
ments to help in reproducibility.
Datasets Task Data splits (Tr./Val./Te.)
PPI Inductive 44906/6514/5524
6.1 Datasets and software versions
Reddit Inductive 153932/23699/55334
We describe more details about the datasets in Table 12. We down- Amazon Inductive 91973/242890
load the datasets PPI, Reddit from the website7 and Amazon from
Amazon2M Inductive 1709997/739032
the website8 . Note that for Amazon, we consider GCN in an in-
ductive setting, meaning that the model only learns from training
Table 13: The running time of graph clustering algorithm
data. In [3] they consider a transductive setting. Regarding software
(METIS) and data preprocessing before the training of GCN.
versions, we install CUDA 10.0 and cuDNN 7.0. TensorFlow 1.12.0
and PyTorch 1.0.0 are used. We download METIS 5.1.0 via the offcial
website9 and use a Python wrapper10 for METIS library. Datasets #Partitions Clustering Preprocessing
PPI 50 1.6s 20.3s
6.2 Implementation details Reddit 1500 33s 286s
Previous works [1, 2] propose to pre-compute the multiplication Amazon 200 0.3s 67.5s
of AX in the first GCN layer. We also adopt this strategy in our Amazon2M 15000 148s 2160s
implementation. By precomputing AX , we are essentially using
the exact 1-hop neighborhood for each node and the expensive form the node partitions, which can be re-used for later training
neighbors searching in the first layer can be saved. processes.
Another implementation detail is about the technique mentioned
in Section 3.2 When multiple clusters are selected, some between-
cluster links are added back. Thus the new combined adjacency
matrix should be re-normalized to maintain numerical ranges of
the resulting embedding matrix. From experiments we find the
renormalization is helpful.
As for the inductive setting, the testing nodes are not visible
during the training process. Thus we construct an adjacency ma-
trix containing only training nodes and another one containing all
nodes. Graph partitioning are applied to the former one and the par-
titioned adjacency matrix is then re-normalized. Note that feature
normalization is also conducted. To calculate the memory usage,
we consider tf.contrib.memory_stats.BytesInUse() for Ten-
sorFlow and torch.cuda.memory_allocated() for PyTorch.

6.3 The running time of graph clustering


algorithm and data preprocessing
The experiments of comparing different GCN training methods in
Section 4 consider running time for training. The preprocessing
time for each method is not presented in the tables and figures.
While some of these preprocessing steps such as data loading or
parsing are shared across different methods, some steps are al-
gorithm specific. For instance, our method needs to run graph
clustering algorithm during the preprocessing stage.
In Table 13, we present more details about preprocessing time
of Cluster-GCN on the four GCN datasets. For graph clustering,
we adopt Metis, which is a fast and scalable graph clustering li-
brary. We observe that the graph clustering algorithm only takes
a small portion of preprocessing time, showing a small extra cost
while applying such algorithms and its scalability on large data sets.
In addition, graph clustering only needs to be conducted once to
7 http://snap.stanford.edu/graphsage/
8 https://github.com/Hanjun-Dai/steady_state_embedding
9 http://glaros.dtc.umn.edu/gkhome/metis/metis/download
10 https://metis.readthedocs.io/en/latest/

266

You might also like