CDFGNN: a Systematic Design of Cache-based Distributed Full-Batch Graph Neural Network Training with Communication Reduction

Shuai Zhang MeituanBeijingChina [email protected] , Zite Jiang SKL Computer Architecture, Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] and Haihang You SKL Computer Architecture, Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected]

Abstract.

Graph neural network training is mainly categorized into mini-batch and full-batch training methods. The mini-batch training method samples subgraphs from the original graph in each iteration. This sampling operation introduces extra computation overhead and reduces the training accuracy. Meanwhile, the full-batch training method calculates the features and corresponding gradients of all vertices in each iteration, and therefore has higher convergence accuracy. However, in the distributed cluster, frequent remote accesses of vertex features and gradients lead to huge communication overhead, thus restricting the overall training efficiency.

In this paper, we introduce the cached-based distributed full-batch graph neural network training framework (CDFGNN). We propose the adaptive cache mechanism to reduce the remote vertex access by caching the historical features and gradients of neighbor vertices. Besides, we further optimize the communication overhead by quantifying the messages and designing the graph partition algorithm for the hierarchical communication architecture. Experiments show that the adaptive cache mechanism reduces remote vertex accesses by $63.14\%$ on average. Combined with communication quantization and hierarchical GP algorithm, CDFGNN outperforms the state-of-the-art distributed full-batch training frameworks by $30.39\%$ in our experiments. Our results indicate that CDFGNN has great potential in accelerating distributed full-batch GNN training tasks.

Graph Neural Network, Distributed Training, Machine Learning System

^†^†copyright: acmcopyright

1. Introduction

With the rise of large-scale pre-training models, the demand for distributed training based on heterogeneous architecture is also increasing. As an important deep learning structure, graph neural network (GNN) (kipf2016semi, ) has been applied in natural language processing, computer vision, knowledge graphs, etc. Compared with traditional graph algorithms, the graph neural network often requires computation on heterogeneous devices. Besides, the graph neural network needs to send the features and gradients of vertices across devices in each iteration, which brings huge communication overhead. Therefore, designing an efficient heterogeneous distributed graph neural network training framework is a challenging and engaging research area.

The training of distributed graph neural network can be categorized into full-batch training (kipf2016semi, ; velickovic2018yoshua, ) and mini-batch training (hamilton2017inductive, ; chen2018fastgcn, ; huang2018adaptive, ; zeng2019graphsaint, ; dong2021global, ). The main difference between them is whether the entire graph data is involved in each iteration. For the full-batch training method, an iteration contains the model computation phase (including forward propagation and back propagation) and the parameter update phase. For mini-batch training, an additional sampling phase needs to be added. The sampling phase needs to be performed before the model computation phase and the parameter update phase. In the sampling phase, subgraphs are sampled from the entire graph for the current training iteration. Therefore, for the full-batch training, one training epoch is equivalent to one iteration. For the mini-batch training, one training epoch often consists of multiple iterations.

Many mini-batch (sample-based) distributed GNN training methods have been proposed recently. However, these mini-batch training methods lead to problems such as information loss (cai2021dgcl, ; jia2020improving, ; tripathy2020reducing, ), additional sampling overhead (jia2020improving, ), and unable to guarantee convergence (chen2017stochastic, ). Therefore, in this paper, we focus on another distributed training strategy: full-batch training.

Compared with traditional graph algorithms or deep learning algorithms, distributed full-batch graph neural network training brings new system-level problems. The GNN training process has irregular neighbor vertex access and iterative computation at the same time. Therefore, graph neural network training is also characterized by both memory access intensive and computing intensive tasks (thorpe2021dorylus, ; wang2020gnn, ). In the distributed environment, there is also a problem of intensive communication for the full-batch training methods. During the full-batch GNN training, both the model parameters and neighbor vertex data (features and gradients) need to be transmitted across the device. Due to the huge communication volume of vertex features and gradients, efficient full-batch GNN training is extremely difficult.

In this paper, we focus on reducing the communication overhead during distributed full-batch graph neural network training. Considering that the changes of model parameters during GNN training are usually very slight, we cache historical features and gradients of vertices to reduce the cross-device neighbor vertex access. In addition, we adopt the quantization method to compress communication messages. We further design the hierarchical graph partition algorithm to reduce the number of communication messages across physical nodes (at the expense of the extra messages across different GPUs within the same physical node).

Specifically, our main contributions are as follows:

•

We propose the cache-based distributed full-batch graph neural network training method CDFGNN. By adaptively caching vertex-level historical features and gradients, we can greatly reduce the communication overhead without affecting the convergence accuracy and the number of iterations required for convergence.
•

We quantify the vertex features and gradients during communication in CDFGNN to further reduce communication overhead.
•

We design the graph partition algorithm to adapt to the communication characteristics of the hierarchical hardware architecture.
•

Experiments show that CDFGNN can greatly reduce the communication overhead during distributed full-batch graph neural network training and thus improve the overall training efficiency.

This paper is organized as follows: Section 2 discusses the challenges of distributed GNN training and explains our motivation. Section 3 introduces the computation and communication architecture of CDFGNN. Section 4 proposes the adaptive cache mechanism for vertex features and gradients and theoretically proves the convergence of this mechanism. Section 5 and section 6 describes the quantization method and the hierarchical graph partition algorithm. Section 7 presents and analyzes several experiments, which demonstrate the characteristics and capabilities of CDFGNN. Finally, we review the related work, conclude our approach, and preview the future project in Section 8 and Section 9.

2. Background and Motivation

2.1. Background

Refer to caption — Figure 1. Distributed full-batch GNN Training.

The distributed full-batch GNN training methods require the original graph to be partitioned into several subgraphs, and each computing device (CPU or GPU) only keeps its own subgraph. The corresponding vertex features are also split and assigned to each device. Thus, the computation of the entire graph can be completed in just one iteration.

During the training process, each computing device saves a copy of the current model parameters to enable local computation. Therefore, for the full-batch GNN training, the model parameter synchronization is also needed after each iteration.

For both GCN (kipf2016semi, ) and GAT (velivckovic2017graph, ) models, the vertex features and gradients of all neighbor vertices are required to calculate the features and gradients of the certain vertex during the forward and backward propagation in each layer. In distributed clusters, such large-scale cross-device data access brings serious communication overhead and becomes a bottleneck of the overall computation. Besides, load balancing among the various devices is also important. This is because load imbalance not only results in computational load imbalance, but also communication imbalance.

Figure 1 shows the training process of a distributed graph neural network with $6$ vertices. Vertices on the same device (GPU) are represented by the same color, and red edges identify edges across GPUs.

The right side is the computational graph of the two-layer graph neural network for vertex “B” and vertex “D”. In order to obtain the final vertex features, $7$ and $6$ cross-device communication messages are required for “B” and “D” respectively. Each message contains high-dimensional vertex features. When performing backward propagation, the same number of vertex gradients is also required. Therefore, cross-device communication becomes an important bottleneck for efficient training. The overall communication overhead may even account for about $80\%$ of the total training time (cai2021dgcl, ; gandhi2021p3, ; tripathy2020reducing, ).

For the distributed mini-batch GNN training, we need to sample graphs before model computation. Thus, an iteration of distributed mini-batch training consists of three stages: sampling, model computation, and model parameter synchronization.

These mini-batches can be sampled by the computing device itself, or sampled by a dedicated sampling device. Each computing device independently executes forward propagation and backward propagation on its corresponding subgraph. After the computation stage is completed, these computing devices synchronize and accumulate the gradients to update the model parameters.

Figure 2 shows a $2$ -hop sampling process on the original graph. For the $L$ -layers graph neural network, in order to calculate the vertex features, (at least part of) L-hop neighbor vertices need to be included in the sampled subgraph. In figure 2, for calculating vertex $1$ , we additionally add parts of its $2$ -hop neighbor vertices to the subgraph. For graphs with high connectivity and small diameter (such as power-law graphs), even few vertices sampled will generate a large subgraph. This phenomenon results in significant extra computational overhead. Although we can restrict the maximum number of sampled neighbor vertices as in figure 2, it will directly reduce the model accuracy.

Compared with the full-batch distributed GNN training, the computation stage of mini-batch training is executed independently on sampled subgraphs, thus avoiding the remote vertex access. However, the sampling process also incurs additional computational overhead, including the sampling itself and extra vertex calculations. In addition, the mini-batch GNN training often reduces the model accuracy.

2.2. Motivation

The frequent and expensive remote neighbor vertex access restricts the scalability of distributed full-batch GNN training. To overcome this challenge, we can optimize it from the following perspectives:

•

Frequency: Cache neighbor vertex data instead of executing remote access in each iteration,
•

Expensive: Compress the message size,
•

Remote: Make full use of the hierarchical communication architecture.

For GNN training tasks, the model parameters tend to stabilize after several training epochs. Besides, the training process does not require high-precision vertex features and gradients before the model converges. Therefore, we cache and reuse historical vertex features and gradients during training to reduce communication overhead, especially in the middle stage of the training process.

In order to compress the message size, we quantify the communication messages. These messages include the model parameter gradients and remote neighbor vertex features and gradients. The scale of vertex features and gradients in the GNN training is much larger than the model parameters. Meanwhile, when there are small errors in the vertex features and gradients, the final convergence performance will not be significantly reduced, and sometimes it can even prevent the training process from falling into a local optimal solution. Therefore, we compress the vertex features and gradients during communication by quantifying.

Finally, we analyze the communication characteristics of heterogeneous clusters and find that using the PCIe to communicate between different GPUs in the same physical node is more efficient (higher bandwidth and lower latency) than network communication (InfiniBand) across physical nodes. Therefore, we propose a graph partition algorithm to reduce the number of messages across physical nodes at the cost of increasing communication within physical nodes.

3. CDFGNN Architecture

In this section, we take the graph convolutional network (GCN) as an example to describe the computation and communication stage of CDFGNN.

Figure 3 shows the overall computing and communication workflow of CDFGNN. CDFGNN first needs to perform the graph partitioning (GP) algorithm to partition the graph (and corresponding input features) into subgraphs equal to the number of computing devices (GPUs). Different from the traditional full-batch graph neural network training framework, we adopt the vertex-cut GP algorithm. The vertex-cut GP is considered a better approach to handle power-law graphs common in the real world (gonzalez2012powergraph, ; chen2019powerlyra, ). Figure 4 demonstrates partition results of the vertex-cut GP algorithm. In this example, vertex “B” exists in all $3$ subgraphs and we choose one of these replicas as the master vertex while others as mirror vertices.

We describe the single iteration distributed training in the algorithm 1. $L$ refers to the number of layers of the GCN network, and the model parameters of each layer are represented as $W^{(0)},\cdots,W^{(L-1)}$ . Next, we describe the computation and communication stage in detail.

Input: Graph

G(V,E)

, Sparse Matrix

\hat{A}_{i}

, Input feature

H^{(0)}_{i}

, Current Model Parameter

W

Output: Output Feature

H^{(L)}_{i}

2for all process $P(i)$ parallel do

//

Layer-by-layer forward propagation:

4 for $l=1,\cdots,L$ do

\ddot{Z}^{(l)}_{i}\leftarrow\hat{A}_{i}H^{(l-1)}_{i}W^{(l-1)}

6 Synchronize by communication to get

Z^{(l)}_{i}

H^{(l)}_{i}\leftarrow\sigma\left(Z^{(l)}_{i}\right)

//

Layer-by-layer backward propagation:

11 Compute Loss Function

\mathcal{L}_{i}

and

\ddot{\delta}_{i}^{(L)}

12 for $l=L,\cdots,1$ do

13 Synchronize by communication to get

\delta_{i}^{(l)}

\ddot{\delta}^{(l-1)}_{i}\leftarrow\delta_{i}^{(l)}\hat{A}_{i}\left(W^{(l-1)}% \right)^{\text{T}}\cdot\sigma^{\prime}\left(Z_{i}^{(l-1)}\right)

\nabla_{W^{(l-1)}}\mathcal{L}_{i}\leftarrow\delta_{i}^{(l)}\hat{A}_{i}\left(H_% {i}^{(l-1)}\right)^{\text{T}}

16 Parameter Server aggregate

\nabla_{W^{(l-1)}}\mathcal{L}_{i}

, update and broadcast parameters:

W^{(l-1)}=W^{(l-1)}-\eta\sum\limits_{i=1}^{p}\nabla_{W^{(l-1)}}\mathcal{L}_{i}

Algorithm 1 CDFGNN Workflow

3.1. Computation Stage of CDFGNN

In the computation stage, each GPU independently performs graph neural network computation tasks on its corresponding subgraph. We use the BSP model (valiant1990bridging, ) to achieve synchronization of vertex features through communication.

Let $A_{i}$ be the adjacency matrix of subgraph $i$ and $D_{i}$ be the corresponding submatrix in the original degree matrix. $\hat{A}_{i}=D_{i}^{-1/2}A_{i}D_{i}^{-1/2}$ is the normalized adjacency matrix of the subgraph in the computing device $i$ . We use superscript $\ddot{}$ to represent the intermediate matrix values ( $\ddot{Z}^{(l)}_{i}$ and $\ddot{\delta}^{(l-1)}_{i}$ ) calculated only from local subgraphs, and the corresponding expressions without this superscript indicate the value ( $Z^{(l)}_{i}$ and $\delta^{(l-1)}_{i}$ ) after communication synchronization.

During the forward propagation of GCN, we calculate the vertex feature $H^{(l)}_{i}$ of the $l$ -th layer in the subgraph $i$ as

(1)

\ddot{Z}^{(l)}_{i}=\hat{A}_{i}H^{(l-1)}_{i}W^{(l-1)},

(2)

H^{(l)}_{i}=\sigma\left(Z^{(l)}_{i}\right).

We calculate $\ddot{Z}^{(l)}_{i}$ with the local vertex feature $H^{(l-1)}_{i}$ , local normalized adjacency matrix $\hat{A}_{i}$ and the global model parameter $W^{(l-1)}$ . For restoring the “real” $Z_{i}^{(l)}$ (the same as the value during the sequential training), we need to synchronize and aggregate $\ddot{Z}^{(l)}_{i}$ from each device through communication. The communication stage will be introduced in section 3.2.

According to $Z^{(l)}_{i}$ , we can calculate the input $H^{(l)}_{i}$ of the next layer. $H^{(l)}_{i}\in\mathbb{R}^{|V_{i}|\times F_{i}}$ , where $F_{i}$ refers to the vertex feature dimension of the $i$ -th layer. By iteratively executing equations 1 and 2, we can complete the calculation of forward propagation layer by layer.

During the backward propagation, we only calculate the loss value of the master vertices when calculating the loss function $\mathcal{L}$ . Thus, we can avoid repeated calculations of gradients on multiple replicas.

We use $\mathcal{L}$ to represent the loss function in the global and $\mathcal{L}_{i}$ to represent its component on subgraph $i$ , while $\sum\limits_{i=1}^{p}\mathcal{L}_{i}=\mathcal{L}$ . When calculating the gradient, we define $\delta^{(l)}_{i}=\nabla_{Z^{(l)}_{i}}\mathcal{L}$ to represent the gradient of the global loss function $\mathcal{L}$ with respect to the global variable $Z^{(l)}_{i}$ , and $\ddot{\delta}^{(l)}_{i}=\nabla_{\ddot{Z}^{(l)}_{i}}\mathcal{L}_{i}$ to represent the gradient of the local loss function $\mathcal{L}_{i}$ with respect to the local variable $\ddot{Z}^{(l)}_{i}$ . For calculating $\ddot{\delta}^{(l-1)}_{i}$ , we have

(3)	$\displaystyle\ddot{\delta}^{(l-1)}_{i}=$	$\displaystyle\frac{\partial\mathcal{L}_{i}}{\partial\ddot{Z}_{i}^{(l-1)}}=% \frac{\partial\mathcal{L}_{i}}{\partial Z_{i}^{(l-1)}}$
	$\displaystyle=$	$\displaystyle\frac{\partial\mathcal{L}_{i}}{\partial Z_{i}^{(l)}}\frac{% \partial Z_{i}^{(l)}}{\partial\ddot{Z}_{i}^{(l)}}\frac{\partial\ddot{Z}_{i}^{(% l)}}{\partial H^{(l-1)}_{i}}\frac{\partial H^{(l-1)}_{i}}{\partial Z_{i}^{(l-1% )}}$
	$\displaystyle=$	$\displaystyle\delta_{i}^{(l)}\hat{A}_{i}\left(W^{(l-1)}\right)^{\text{T}}\cdot% \sigma^{\prime}\left(Z_{i}^{(l-1)}\right).$

Note that $Z_{i}^{(l)}$ is calculated with the sum aggregation of $\ddot{Z}_{j}^{(l)},j\in[1,p]$ , thus we have $\frac{\partial Z_{i}^{(l)}}{\partial\ddot{Z}_{i}^{(l)}}=1$ exists for all subgraph $i$ . Similar with $\ddot{Z}_{i}^{(l)}$ , we can also get $\delta_{i}^{(l-1)}$ by aggregating $\ddot{\delta}_{i}^{(l-1)}$ from each device through communication.

With $\delta_{i}^{(l)}$ , the gradient of the model parameter $W^{(l-1)}$ can be calculated as

(4)

\nabla_{W^{(l-1)}}\mathcal{L}_{i}=\frac{\partial\mathcal{L}_{i}}{\partial W^{(% l-1)}}=\frac{\partial\mathcal{L}_{i}}{\partial Z_{i}^{(l)}}\frac{\partial\ddot% {Z}_{i}^{(l)}}{\partial W^{(l-1)}}=\delta_{i}^{(l)}\hat{A}_{i}\left(H_{i}^{(l-% 1)}\right)^{\text{T}}.

When performing parameter updates, we need to summarize the gradients calculated on all subgraphs as

(5)

W^{(l)}=W^{(l)}-\eta\sum\limits_{i=1}^{p}\nabla_{W^{(l)}}\mathcal{L}_{i}.

This process also needs to be implemented through communication. However, the data size of model parameters is usually much smaller than the data size of neighbor vertex features and gradients. Thus, the communication overhead of aggregating model parameters is not the performance bottleneck.

In summary, during one iteration (forward + backward) of one GCN layer, there are two communication synchronizations for vertex values (features and gradients). This communication is to obtain the global intermediate value $Z^{(l)}$ in the forward propagation and to obtain the $\delta^{(l)}$ in the backward propagation. Through these communication synchronizations, the calculated model parameter gradients are theoretically consistent with the single-device full-batch training method.

3.2. Communication Stage of CDFGNN

In the real world, most of the data graphs processed by graph neural network algorithms are power-law graphs (albert2002statistical, ), such as social networks, citation graphs, etc. We adopt the vertex-cut GP algorithm, which is more efficient for power-law graphs. In figure 4, we demonstrate the communication pattern for vertex “B”. We use the gray vertex in subgraph $1$ to mark this vertex “B” as a master vertex, while others are mirror vertices. In the computation stage, these replicas compute their intermediate values $\ddot{Z}^{(l)}_{i}$ and $\ddot{\delta}^{(l)}_{i}$ independently. We need to aggregate these values through communication to achieve the same value as when executing on a single device.

CDFGNN takes each vertex as the minimum communication unit. The communication stage can be divided into two phases: gather and scatter. In the gather phase, the mirror vertex sends its values to the corresponding master vertex (with the same vertex ID). When the master vertex receives these messages, it should collect them and sum them with its own values. In the scatter phase, the master vertex sends its aggregated values back to all corresponding mirror vertices. The mirror vertex uses the received values to replace the original values. In figure 4, we list the values of vertex “B” at different communication phases in all subgraphs.

This communication pattern requires the mirror vertex to store the location of its master vertex, and the master vertex to store the locations of all its mirrors. By executing the communication stage, we can ensure that the states of the vertex replicas are consistent with the sequential GNN training.

4. Adaptive Vertex Feature Cache

In this section, we introduce the adaptive cache mechanism of CDFGNN and prove its convergence.

4.1. Adaptive Cache Mechanism

In order to reduce the expensive vertex feature and gradient communication during the CDFGNN training process, we propose an adaptive vertex-level caching mechanism. Specifically, we cache the intermediate variables $Z_{i}^{(l)}$ and $\delta^{(l)}_{i}$ during the training process.

For $Z_{i}^{(l)}$ and $\delta^{(l)}_{i}$ , we adopt the same cache mechanism. For convenience, we take $Z^{(l)}_{i}$ as the example to introduce the caching mechanism in detail. Firstly, we denote $\ddot{Z}^{(l)}_{i}=\{z_{i,1},\cdots,z_{i,|V_{i}|}\}$ , where $z_{i,j}$ represents the feature vector corresponding to the $j$ -th vertex of subgraph $i$ in $\ddot{Z}^{(l)}_{i}$ . For each subgraph, we renumber the vertices with a local ID for continuous memory access. The $j$ -th vertex here refers to the vertex with local ID $j$ of subgraph $i$ .

Let $\tilde{z}_{i,j}$ be the cached value of $z_{i,j}$ , and $\tilde{z}_{\cdot,j}$ be the corresponding cached value in $Z^{(l)}_{i}$ . For each computing device, it should keep the cached value $\tilde{z}_{i,j}$ and $\tilde{z}_{\cdot,j}$ for all vertices in their own subgraphs.

Input: current value

z_{i,u}

, cached value

\tilde{z}_{i,u}

and

\tilde{z}_{\cdot,u}

, threshold

\epsilon

Output: cached value

\tilde{z}_{i,u}

and

\tilde{z}_{\cdot,u}

2for all process $P(i)$ parallel do

//

Traverse mirror vertices:

4 for $u\in getMirrorVertices()$ do

5 if $\|z_{i,u}-\tilde{z}_{i,u}\|_{\infty}>\epsilon\|\tilde{z}_{i,u}\|_{\infty}$ then

6 Send the difference value

\Delta_{z_{i,u}}=z_{i,u}-\tilde{z}_{i,u}

to the corresponding master vertex

\tilde{z}_{i,u}\leftarrow z_{i,u}

11 Bulk Synchronize! Wait for messages from all processes to be sent!

//

Traverse messages and master vertices:

13 for $(u,\Delta_{z_{i,u}})\in messages$ do

\tilde{z}_{\cdot,u}\leftarrow\tilde{z}_{\cdot,u}+\Delta_{z_{i,u}}

15 active vertex

u

18 for $u\in getMaster()$ do

19 if $\|z_{i,u}-\tilde{z}_{i,u}\|_{\infty}>\epsilon\|\tilde{z}_{i,u}\|_{\infty}$ then

\tilde{z}_{\cdot,u}\leftarrow\tilde{z}_{\cdot,u}+z_{i,u}-\tilde{z}_{i,u}

\tilde{z}_{i,u}\leftarrow z_{i,u}

22 active vertex

u

25 for $u\in$ active vertices do

26 Send the cached value

\tilde{z}_{\cdot,u}

to the corresponding mirror vertices.

Algorithm 2 Adaptive Vertex Cache Mechanism

The algorithm 2 describes the update strategy of the cached values $\tilde{z}_{i,j}$ and $\tilde{z}_{\cdot,j}$ . In each forward propagation of the GNN layer, we need to perform this algorithm once. After the update process is completed, we generate the matrix $Z^{(l)}_{i}$ by directly combining the cached value $\tilde{z}_{\cdot,j}$ .

For the cache mechanism, $\tilde{z}_{i,j}$ keeps the values used by computing device $i$ when building the cached value $\tilde{z}_{\cdot,j}$ . When the difference between $\tilde{z}_{i,j}$ and the real value $z_{i,j}$ calculated in current iteration is too large, we need to update $\tilde{z}_{i,j}$ and $\tilde{z}_{\cdot,j}$ for avoiding the large error. We use $\frac{\|z_{i,u}-\tilde{z}_{i,u}\|_{\infty}}{\|\tilde{z}_{i,u}\|_{\infty}}$ to measure the error. $\|\cdot\|_{\infty}$ is the $L_{\infty}$ norm, which can be used to represent the maximum absolute value of all elements in it.

We expect $\tilde{z}_{\cdot,j}$ to be consistent across all relevant computing devices. Thus, when the $\tilde{z}_{i,j}$ of any computing device changed, we need to synchronize it to all other replicas.

In order to increase the proportion of cached values as much as possible without reducing the convergence accuracy or increasing the number of iterations for convergence, we design an adaptive caching mechanism by dynamically adjusting the threshold $\epsilon$ . We update $\epsilon$ by

(6)

\epsilon=\left\{\begin{aligned} &\min(\lambda_{1}\epsilon,\epsilon+\xi),&&acc<% mean_{acc}-\mu_{1},\epsilon<\nu_{1}\cr&\max(\lambda_{2}\epsilon,\epsilon-\xi),% &&acc>mean_{acc}+\mu_{2},\epsilon>\nu_{2}\cr&\epsilon,&&otherwise\cr\end{% aligned}\right.

After each iteration, the value of $\epsilon$ is updated. Where $acc$ is the model accuracy on the train set in the current epoch, and $mean_{acc}$ is the exponential moving average of $acc$ :

(7)

mean_{acc}=0.8\times mean_{acc}+0.2\times acc.

For the remaining hyperparameters, they are set by default to $\mu_{1}=0.001$ , $\mu_{2}=0.02$ , $\nu_{1}=0.3$ , $\nu_{2}=0.001$ , $\xi=0.01$ , $\lambda_{1}=1.05$ and $\lambda_{2}=0.9$ in our experiments.

Among these hyperparameters, we set $\mu_{1}$ to be much larger than $\mu_{2}$ . This is because in the early stage of training, the accuracy on the training set increases rapidly. Only when there is a large enough accuracy increment (larger than $\mu_{2}$ ) can we consider that the current cache threshold should be relaxed. After the model parameters are stabilized, the accuracy of the model on the training set changes slightly. Therefore, even for small accuracy decreases, the threshold should be set smaller to reduce the cache error. In addition, we also use $\xi=0.02$ to define the maximum step size when $\epsilon$ changes to avoid the error threshold changing too quickly. We also use $\nu_{1}$ and $\nu_{2}$ to limit the value range of $\epsilon$ to $[\nu_{2},\nu_{1}]$ . The settings of these hyperparameters ensure that the training accuracy of the model will not be greatly reduced.

4.2. Proof of Convergence

Next, we prove the convergence of the training process when employing the adaptive cache mechanism. Specifically, we will prove that after a finite number of iterations, the model parameters $W$ will converge to the local optimal solution $W^{*}$ . We use the superscript $\tilde{}$ to represent the value obtained in this layer after communication synchronization when the cache mechanism is used. The values without superscripts represent the values obtained by current model parameters and input features without cache mechanism in all layers.

We first lay out the necessary and basic inequality required for the theoretical analysis.

Lemma 0.

Denote $\|A\|_{\infty}=\max_{i,j}|A_{i,j}|$ , $col(A)$ is the column number of matrix $A$ . We have $\|A+B\|_{\infty}\leq\|A\|_{\infty}+\|B\|_{\infty}$ , $\|A\cdot B\|_{\infty}\leq\|A\|_{\infty}\|B\|_{\infty}$ and $\|AB\|_{\infty}\leq col(A)\|A\|_{\infty}\|B\|_{\infty}$ .

Proof.

These three inequalities can be proved as follows:

(8)	$\displaystyle\\|A+B\\|_{\infty}$	$\displaystyle=\max_{i,j}\|A_{i,j}+B_{i,j}\|$
		$\displaystyle\leq\max_{i,j}\|A_{i,j}\|+\max_{i,j}\|B_{i,j}\|$
		$\displaystyle=\\|A\\|_{\infty}+\\|B\\|_{\infty},$

(9)	$\displaystyle\\|A\cdot B\\|_{\infty}$	$\displaystyle=\max_{i,j}\|A_{i,j}\times B_{i,j}\|$
		$\displaystyle\leq\max_{i,j}\|A_{i,j}\|\times\max_{i,j}\|B_{i,j}\|$
		$\displaystyle=\\|A\\|_{\infty}\\|B\\|_{\infty},$

(10)	$\displaystyle\\|AB\\|_{\infty}$	$\displaystyle=\max_{i,j}\|\sum\limits_{k=1}^{col(A)}A_{i,k}\times B_{k,j}\|$
		$\displaystyle\leq col(A)\max\limits_{i,j,k}\|A_{i,k}\times B_{k,j}\|$
		$\displaystyle\leq col(A)\max\limits_{i,j}\|A_{i,j}\|\max\limits_{i,j}\|B_{i,j}\|$
		$\displaystyle=col(A)\\|A\\|_{\infty}\\|B\\|_{\infty}.$

∎

Next, we state that with bounded staleness on the embeddings, the approximations of the intermediate matrix results are close to the exact ones in the forward propagation.

Lemma 0.

For the forward propagation of CDFGNN with the cache mechanism, if (a) we have $\|\tilde{Z}^{(l-1)}-Z^{(l-1)}\|_{\infty}\leq\epsilon_{Z^{(l-1)}}$ , while $\tilde{Z}^{(l-1)}$ and $Z^{(l-1)}$ represent the intermediate values with or without cache mechanism, (b) the function $\sigma(\cdot)$ is $\rho$ -Lipschitz continuous, (c) the elements in $Z^{(l)}$ , $\hat{A}$ and $W^{(l-1)}$ are bounded, while the absolute values are less than $B$ and the number of columns is less than $C$ . Then we have $\|\tilde{H}^{(l-1)}-H^{(l-1)}\|_{\infty}\leq\rho\epsilon_{Z^{(l-1)}}$ and $\|\tilde{Z}^{(l)}-Z^{(l)}\|_{\infty}\leq p\nu_{1}B+C^{2}B^{2}\rho\epsilon_{Z^{% (l-1)}}$ .

Proof.

We denote $\hat{Z}^{(l)}$ as the intermediate value when the caching mechanism is used in the previous $l-1$ layers, but not used in the $l$ -th layer. Considering that each element in $\hat{Z}^{(l)}$ is the sum from at most $p$ device, the upper bound error of $\tilde{Z}^{(l)}$ for using the cache mechanism in layer $l$ , is

(11)

\|\tilde{Z}^{(l)}-\hat{Z}^{(l)}\|_{\infty}\leq p\nu_{1}B.

Where $\nu_{1}$ is the upper bound of $\epsilon$ defined in the equation (6).

Therefore, we have

(12)		$\displaystyle\\|\tilde{H}^{(l-1)}-H^{(l-1)}\\|_{\infty}=$	$\displaystyle\\|\sigma(\tilde{Z}^{(l-1)})-\sigma(Z^{(l-1)})\\|_{\infty}$
(12)		$\displaystyle\leq$	$\displaystyle\rho\epsilon_{Z^{(l-1)}}$

(13)	$\displaystyle\\|$	$\displaystyle\tilde{Z}^{(l)}-Z^{(l)}\\|_{\infty}=\\|(\tilde{Z}^{(l)}-\hat{Z}^{(l% )})+(\hat{Z}^{(l)}-Z^{(l)})\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle p\nu_{1}B+\\|\hat{A}\sigma(\tilde{Z}^{(l-1)})W^{(l-1)}-\hat{A}% \sigma(Z^{(l-1)})W^{(l-1)}\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle p\nu_{1}B+C^{2}B^{2}\rho\epsilon_{Z^{(l-1)}}$

The equation ( 12) is obtained from the definition of Lipschitz condition. ∎

Next, we will prove that the intermediate gradient $\tilde{\delta}^{(l)}=\nabla_{\tilde{Z}^{(l)}}\tilde{\mathcal{L}}$ with cache mechanism is also close to the exact gradient $\delta^{(l)}=\nabla_{Z^{(l)}}\mathcal{L}$ .

Lemma 0.

For the backward propagation of CDFGNN, if (a) we have $\|\tilde{Z}^{(l-1)}-Z^{(l-1)}\|_{\infty}\leq\epsilon_{Z^{(l-1)}}$ , while $\tilde{Z}^{(l-1)}$ and $Z^{(l-1)}$ represent the intermediate values with or without cache mechanism, (b) the function $\sigma(\cdot)$ and the derivative of loss function $\nabla\mathcal{L}$ are $\rho$ -Lipschitz continuous, (c) the elements in $\delta^{(l)}$ , $\hat{A}$ , $\sigma^{\prime}(Z^{(l)})$ and $W^{(l-1)}$ are bounded, and their absolute values are less than $B$ and the number of columns is less than $C$ . Then we have $\|\nabla_{\tilde{Z}^{(l)}}\tilde{\mathcal{L}}-\nabla_{Z^{(l)}}\mathcal{L}\|_{\infty}$ and $\|\nabla_{W^{(l)}}\tilde{\mathcal{L}}-\nabla_{W^{(l-1)}}\mathcal{L}\|_{\infty}$ are also bounded.

Proof.

First, we prove that $\|\tilde{\delta}^{(l)}-\delta^{(l)}\|_{\infty}$ is bounded based on the previous lemma.

For the last layer $L$ , we have

(14)

\|\nabla_{\tilde{Z}^{(L)}}\tilde{\mathcal{L}}-\nabla_{Z^{(L)}}\mathcal{L}\|_{% \infty}\leq\rho\epsilon_{Z^{(L)}}.

Next, we use mathematical induction to complete the proof. For $l^{\prime}>l$ , if it satisfies $\|\nabla_{\tilde{Z}^{(l^{\prime})}}\tilde{\mathcal{L}}-\nabla_{Z^{(l^{\prime})% }}\mathcal{L}\|_{\infty}\leq K^{(l^{\prime})}$ , then for the $l$ -th layer, we have

(15)		$\displaystyle\\|\nabla_{\tilde{Z}^{(l)}}\tilde{\mathcal{L}}-\nabla_{Z^{(l)}}% \mathcal{L}\\|_{\infty}$
	$\displaystyle=$	$\displaystyle\\|\tilde{\delta}^{(l+1)}\hat{A}\left(W^{(l)}\right)^{\text{T}}% \cdot\sigma^{\prime}\left(\tilde{Z}^{(l)}\right)-\delta^{(l+1)}\hat{A}\left(W^% {(l)}\right)^{\text{T}}\cdot\sigma^{\prime}\left(Z^{(l)}\right)\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle C^{2}\{\\|\tilde{\delta}^{(l+1)}\\|_{\infty}\\|\hat{A}\\|_{\infty}\\|% \left(W^{(l)}\right)^{\text{T}}\\|_{\infty}\\|\sigma^{\prime}\left(\tilde{Z}^{(l% )}\right)-\sigma^{\prime}\left(Z^{(l)}\right)\\|_{\infty}$
	$\displaystyle+$	$\displaystyle\\|\tilde{\delta}^{(l+1)}-\delta^{(l+1)}\\|_{\infty}\\|\hat{A}\\|_{% \infty}\\|\left(W^{(l)}\right)^{\text{T}}\\|_{\infty}\\|\sigma^{\prime}\left(Z^{(% l)}\right)\\|_{\infty}\}$
	$\displaystyle\leq$	$\displaystyle C^{2}(B^{3}\rho\epsilon_{Z^{(l)}}+K^{(l+1)}B^{3})=C^{2}B^{3}(% \rho\epsilon_{Z^{(l)}}+K^{(l+1)})$

Denote $K^{l}=C^{2}B^{3}(\rho\epsilon_{Z^{(l)}}+K^{(l+1)})$ , then we can find that the assumption holds for the $l$ -th layer. Therefore, we can complete the proof according to mathematical induction.

For $\|\nabla_{W^{(l)}}\tilde{\mathcal{L}}-\nabla_{W^{(l)}}\mathcal{L}\|_{\infty}$ , we can get it according to the equation (4):

(16)	$\displaystyle\\|\nabla_{W^{(l)}}$	$\displaystyle\tilde{\mathcal{L}}-\nabla_{W^{(l)}}\mathcal{L}\\|_{\infty}$
	$\displaystyle=$	$\displaystyle\\|\tilde{\delta}_{i}^{(l+1)}\hat{A}_{i}\left(\tilde{H}_{i}^{(l)}% \right)^{\text{T}}-\delta_{i}^{(l+1)}\hat{A}_{i}\left(H_{i}^{(l)}\right)^{% \text{T}}\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle C^{2}\{\\|\tilde{\delta}_{i}^{(l+1)}\\|_{\infty}\\|\hat{A}\\|_{% \infty}\\|\left(\tilde{H}_{i}^{(l)}\right)^{\text{T}}-\left(H_{i}^{(l)}\right)^% {\text{T}}\\|_{\infty}$
	$\displaystyle+$	$\displaystyle\\|\tilde{\delta}_{i}^{(l+1)}-\delta_{i}^{(l+1)}\\|_{\infty}\\|\hat{% A}\\|_{\infty}\\|\left(H_{i}^{(l)}\right)^{\text{T}}\\|_{\infty}\}$
	$\displaystyle\leq$	$\displaystyle C^{2}(B^{2}\rho\epsilon_{Z^{(l)}}+K^{l+1}B^{2})=C^{2}B^{2}(\rho% \epsilon_{Z^{(l)}}+K^{(l+1)})$

∎

Finally, we will prove that CDFGNN can converge to the local optimal solution under the premise that the error is bounded. For the parameter matrix $W$ , we use the subscript $i$ to identify that the value is obtained of the $i$ -th iteration.

Theorem 4.

For the $L$ layer graph neural network training based on the CDFGNN cache mechanism, given the local optimal parameters $W_{(*)}$ and the initial parameters $W_{(1)}$ . Assuming that (a) the activation function $\sigma(\cdot)$ and the derivative of loss function $\nabla\mathcal{L}$ are $\rho$ -Lipschitz continuous, (b) the matrix $\hat{A}$ , $H$ and $W$ , and the corresponding gradients on them are bounded, where the maximum absolute value of the element is $B$ , (c) the function $\mathcal{L}(W)$ is $\rho$ -smooth. We can prove that there is a constant $K>0$ such that for $\forall N>L_{\epsilon}$ , if the GNN is trained based on the cache mechanism $R$ iterations ( $R\in[1,N]$ and is sampled from $[1,\dots,N]$ uniformly) and the learning rate $\eta=\min\left(\frac{1}{\rho},\frac{1}{\sqrt{N}}\right)$ , we have

(17)

\mathrm{E}_{R}\|\nabla_{W_{(R)}}\mathcal{L}\|^{2}_{F}\leq 2\frac{\mathcal{L}(W% _{(1)})-\mathcal{L}(W_{(*)})+\frac{\rho K}{2}}{\sqrt{N}}.

Proof.

For the convenience, we denote $\Delta_{(i)}=\nabla_{W_{(i)}}\tilde{\mathcal{L}}-\nabla_{W_{(i)}}\mathcal{L}$ . Considering that the model parameter $W$ is updated under the cache mechanism, we have $W_{(i+1)}=W_{(i)}+\eta\nabla_{W_{(i)}}\tilde{\mathcal{L}}$ . According to lemma 3 and the $\rho$ -smooth property of the function $\mathcal{L}(W)$ , we have

(18)	$\displaystyle\mathcal{L}$	$\displaystyle(W_{(i+1)})=\mathcal{L}(W_{(i)}+\eta\nabla_{W_{(i)}}\tilde{% \mathcal{L}})$
	$\displaystyle\leq$	$\displaystyle\mathcal{L}(W_{(i)})-\eta\langle\nabla_{W_{(i)}}\mathcal{L},% \nabla_{W_{(i)}}\tilde{\mathcal{L}}\rangle+\frac{\rho}{2}\eta^{2}\\|\nabla_{W_{% (i)}}\tilde{\mathcal{L}}\\|_{F}^{2}$
	$\displaystyle=$	$\displaystyle\mathcal{L}(W_{(i)})-\eta\langle\nabla_{W_{(i)}}\mathcal{L},% \Delta_{(i)}\rangle-\eta\\|\nabla_{W_{(i)}}\mathcal{L}\\|_{F}^{2}$
	$\displaystyle+$	$\displaystyle\frac{\rho}{2}\eta^{2}\left(\\|\Delta_{(i)}\\|^{2}_{F}+\\|\nabla_{W_% {(i)}}\mathcal{L}\\|^{2}_{F}+2\langle\Delta_{(i)},\nabla_{W_{(i)}}\mathcal{L}% \rangle\right)$
	$\displaystyle\leq$	$\displaystyle\mathcal{L}(W_{(i)})-(\eta-\frac{\rho}{2}\eta^{2})\\|\nabla_{W_{(i% )}}\mathcal{L}\\|^{2}_{F}+\frac{\rho}{2}\eta^{2}\\|\Delta_{(i)}\\|^{2}_{F}.$

The scaling in the last step is based on the value of the learning rate $\eta$ . According to lemma 3, we have $\|\Delta_{(i)}\|^{2}_{F}\leq\|\nabla_{W_{(}i)}\tilde{\mathcal{L}}\ |_{\infty}^% {2}+\|\nabla\mathcal{L}(W_{(i)})\|_{\infty}^{2}\leq 2B^{2}\leq K$ . Therefore, we have

(19)

\mathcal{L}(W_{(i+1)})\leq\mathcal{L}(W_{(i)})-(\eta-\frac{\rho}{2}\eta^{2})\|% \nabla_{W_{(i)}}\mathcal{L}\|^{2}_{F}+\frac{\rho}{2}\eta^{2}K.

Sum up the equation (19) for $i$ from $1$ to $N$ , we can get

(20)

(\eta-\frac{\rho}{2}\eta^{2})\sum\limits_{i=1}^{N}\|\nabla_{W_{(i)}}\mathcal{L% }\|^{2}_{F}\leq\mathcal{L}(W_{(1)})-\mathcal{L}(W^{*})+\frac{\rho}{2}\eta^{2}KN.

Considering $\eta=\min\left(\frac{1}{\rho},\frac{1}{\sqrt{N}}\right)$ , we divide both side of equation (20) by $N(\eta-\frac{\rho}{2}\eta^{2})$ , then we have

(21)	$\displaystyle\mathrm{E}_{R}\\|\nabla_{W_{(R)}}\mathcal{L}\\|^{2}_{F}=$	$\displaystyle\frac{1}{N}\sum\limits_{i=1}^{N}\\|\nabla_{W_{(i)}}\mathcal{L}\\|^{% 2}_{F}$
	$\displaystyle\leq$	$\displaystyle 2\frac{\mathcal{L}(W_{(1)})-\mathcal{L}(W^{*})+\frac{\rho}{2}% \eta^{2}KN}{N\eta(2-\rho\eta)}$
	$\displaystyle\leq$	$\displaystyle 2\frac{\mathcal{L}(W_{(1)})-\mathcal{L}(W^{*})}{N\eta}+\rho\eta K$
	$\displaystyle\leq$	$\displaystyle 2\frac{\mathcal{L}(W_{(1)})-\mathcal{L}(W^{*})+\frac{\rho K}{2}}% {\sqrt{N}}.$

When $N\to\infty$ , we can find that the expectation of parameter gradient $\mathrm{E}_{R}\to 0$ . Therefore, we show that convergence of parameters can be achieved in finite iterations.

∎

5. Communication Quantization

In this section, we propose the communication quantization mechanism of CDFGNN. There are many quantization methods, including linear quantization and logarithmic quantification (daisuke2016convolutional, ), exponential quantification (li2019additive, ), differentiable quantization (gong2019differentiable, ; yang2019quantization, ), etc. Considering that when we adopt the adaptive cache mechanism, the message sent is the difference value instead of the original value. Thus, the message data usually follows an uniform distribution. For this reason, we adopt the simplest linear quantization method to quantify the difference of vertex features and gradients. We do not quantify the model parameters when communicating with the parameter server.

Specifically, for the calculated difference $\mathbf{m}$ of features or gradients for the vertex $v_{i}$ , it is represented in the form of a 32-bit floating point format in the GPU memory. In order to quantify it into the $B$ -bit unsigned integer format, we need to calculate the maximum element value $\max(\mathbf{m})$ and the minimum element value $\min(\mathbf{m})$ at first. Therefore, we can get the quantified value as

(22)

q_{i}=\left\lfloor\frac{2^{B}(m_{i}-\min(\mathbf{m}))}{\max(\mathbf{m})-\min(% \mathbf{m})}+0.5\right\rfloor.

When sending the message, the original message size is $T*L$ , and the quantified message size is $B*L+2T$ (including the maximum and minimum value). Where $L$ refers to the number of elements in $\mathbf{m}$ , and $T$ refers to the number of bits of the original data format.

During the recovery, for the quantization value $q_{i}$ , we can restore it to

(23)

\tilde{m}_{i}=\frac{\max(\mathbf{m})-\min(\mathbf{m})}{2^{B}}q_{i}+\min(% \mathbf{m}).

By the definition, we have $\left\lfloor\frac{2^{B}(m_{i}-\min(\mathbf{m}))}{\max(\mathbf{m})-\min(\mathbf% {m})}-0.5\right\rfloor<q_{i}\leq\left\lfloor\frac{2^{B}(m_{i}-\min(\mathbf{m})% )}{\max(\mathbf{m})-\min(\mathbf{m})}+0.5\right\rfloor$ . Therefore, the upper bound of the quantization error is $\frac{\max(\mathbf{m})-\min(\mathbf{m})}{2^{B+1}}$ .

6. Hierarchical Graph Partition Algorithm

Considering that in the heterogeneous multi-node multi-GPU environment, the communication overhead within a single node and across physical nodes is different. We demonstrate the communication architecture in figure 3. The GPU is viewed as the basic computing device.

We propose our vertex-cut graph partition algorithm based on the EBV (zhang2021efficient, ) algorithm. To adapt to the hierarchical communication architecture, we rewrite its evaluation function

(24)	$\displaystyle Eva_{(u,v)}(i)=$	$\displaystyle(1-\gamma)(\mathbb{I}(i\notin d\_rep_{u})+\mathbb{I}(i\notin d\_% rep_{v}))$
	$\displaystyle+$	$\displaystyle\gamma(\mathbb{I}(host_{i}\notin h\_rep_{u})+\mathbb{I}(host_{i}% \notin h\_rep_{v}))$
	$\displaystyle+$	$\displaystyle\alpha\frac{e_{count}[i]}{\|E\|/p}+\beta\frac{v_{count}[i]}{\|V\|/p}.$

$d\_rep_{u}$ and $h\_rep_{u}$ represent the GPU IDs and host (CPU) IDs that vertex $u$ has been assigned. As long as the vertex $u$ has been assigned to any GPU corresponding to the host, the host ID will be added to $h\_rep_{u}$ . We use $host_{i}$ to represent the host ID to which the $i$ -th GPU belongs. Besides, $e_{count}[i]$ and $v_{count}[i]$ mean the number of edges and vertices that have been assigned to subgraph $i$ .

When partitioning the graph, we assign it edge by edge. For each edge, we select the GPU ID that minimizes the evaluation function as the subgraph ID this edge assigned.

From equation (24), we can found that the term $\mathbb{I}(host_{i}\notin host\_rep_{u})+\mathbb{I}(host_{i}\notin host\_rep_{% v})$ we design can reduce the number of cut vertices between hosts. Usually, we set $\gamma\ll 1$ . Therefore, this term is mainly worked to select a more reasonable host when the other terms are close. In our experiment, we set $\gamma$ to $0.1$ by default.

For the other terms, $\mathbb{I}(i\notin d\_rep_{u})+\mathbb{I}(i\notin d\_rep_{v})$ is related to the replication factor among GPUs, while $\alpha\frac{e_{count}[i]}{|E|/p}$ and $\beta\frac{v_{count}[i]}{|V|/p}$ restrict the edge and vertex imbalance factor respectively. The replication factor is defined as $\frac{\sum_{i=1}^{p}|V_{i}|}{|V|}$ , that represents the average number of replicas for a vertex. The edge imbalance factor is defined as $\frac{\max_{i=1,...,p}|E_{i}|}{|E|/p}$ , while the vertex imbalance factor is defined as $\frac{\max_{i=1,...,p}|V_{i}|}{\sum_{i=1}^{p}|V_{i}|/p}$ . Both of them are used to measure the balance of partition results.

7. Experiments and Analysis

Table 1. Statistics of GNN dataset graphs

Dataset	$\|V\|$	$\|E\|$	Input Dim	Output Dim
Reddit	$232,965$	$11,606,919$	$602$	$41$
ogbn-products	$2,449,029$	$61,859,140$	$100$	$47$
ogbn-papers100M	$111,059,956$	$1,615,685,872$	$200$	$172$
Friendster	$65,608,366$	$1,806,067,135$	$64$	$32$

In this section, we test CDFGNN in a heterogeneous environment with multiple physical nodes and multiple GPUs per node. We compare CDFGNN with the state-of-the-art distributed full-batch graph neural network training frameworks on several datasets. In addition, we select some representative graph partition algorithms to analyze the influence of different graph partition algorithms on the distributed full-batch GNN training. Finally, we conduct the ablation study to demonstrate the effectiveness of each component.

7.1. Experiment Setup and Datasets

In the experiment, we compare CDFGNN with the state-of-the-art distributed full-batch graph neural network training frameworks SANCUS (peng2022sancus, ) and CAGNET (tripathy2020reducing, ). We select four datasets: Reddit (hamilton2017inductive, ), ogbn-products (chiang2019cluster, ), ogbn-papers100M (wang2020microsoft, ) and Friendster (Friend, ) for comparing their performance. The statistics of these graphs are listed in table 1. The Friendster does not provide input features and output categories. We randomly generate these data to test the training efficiency of different frameworks on the large graph.

Our experiment platform is a 2-node cluster, with each node has $8$ Nvidia A800 80G GPU. The communication within the physical nodes is based on the 16-channel PCIe 4.0, and the communication across the physical nodes is based on the InfiniBand. We use the NCCL for communication, and list the communication performance in Table 2.

Table 2. Communication Performance between GPUs

Environment	Pattern	Bandwidth
PCIe	Peer2Peer	22.70 GB/s
InfiniBand	Peer2Peer	8.27 GB/s
PCIe	Broadcast	19.47 GB/s
InfiniBand	Broadcast	11.98 GB/s

We adopt the simple $2$ -layer graph convolutional network as our test model. The dimensions of the input and output features are determined by the datasets, while the dimension of the hidden layer is set to $64$ by default. We adopt the cross-entropy function as the loss function, and the Adam optimizer (kingma2014adam, ) to update the model parameters. The initial learning rate is set to $0.01$ by default.

7.2. Distributed Training Efficiency Comparison

First, we compare CDFGNN with the current state-of-the-art distributed full-batch GNN training frameworks SANCUS and CAGNET. During the training process, we adopted the same GNN model. Meanwhile, we implement CDFGNN with $2$ famous vertex-cut GP algorithms: HEP (mayer2021hybrid, ) and DNE (hanai2019distributed, ). Thus, we can analysis the influence of different graph partition algorithms on the training efficiency. We also set the $\gamma$ to $0.1$ and $0.0$ respectively for testing the performance of our hierarchical GP algorithm, and represent them as EBV_γ=0.1 and EBV_γ=0.0. The EBV_γ=0.0 is equivalent to the original EBV algorithm.

Table 3. The Statistics of differnet graph partition algorithms

Dataset	GP algorithm	Nodes	GPU per Node	Inner	Outer	RF	Edge IF
reddit	EBV_γ=0.0	2	2	104217	138583	2.9027	1.0054
reddit	EBV_γ=0.1	2	2	105412	117879	3.0860	1.0022
reddit	HEP	2	2	36662	52886	1.6084	1.2693
reddit	DNE	2	2	65578	118788	2.1025	1.2558
ogbn-products	EBV_γ=0.0	2	4	695905	639459	3.1788	1.0002
ogbn-products	EBV_γ=0.1	2	4	952727	481147	3.3379	1.0008
ogbn-products	HEP	2	4	143711	127261	1.3304	1.2323
ogbn-products	DNE	2	4	367460	406408	1.9363	1.1527
friend	EBV_γ=0.0	2	8	12395102	9988776	3.7237	1.0002
friend	EBV_γ=0.1	2	8	19586785	5465465	4.0322	1.0011
friend	HEP	2	8	5794009	4675810	1.7048	1.776
friend	DNE	2	8	8737134	11670478	2.3546	1.7455
papers100M	EBV_γ=0.0	2	8	20438528	16362561	3.6503	1.0000
papers100M	EBV_γ=0.1	2	8	32241760	9924817	4.0347	1.0001
papers100M	HEP	2	8	5826661	3085959	1.3144	2.0204
papers100M	DNE	2	8	10021186	13819120	2.1475	1.3866

Figure 5 presents the average training time per epoch for different GNN training frameworks on four datasets. The GPUs we use are evenly distributed on two physical nodes. We use EBV_γ=0.1, EBV_γ=0.0, HEP and DNE to represent the training efficiency when combined with CDFGNN.

From figure 5, we can find that EBV_γ=0.1 achieves the best performance in almost all cases and reduces the training time by $30.39\%$ compared to Sancus on average. Sancus performs better than CAGNET and even outperforms EBV_γ=0.1 in the smallest case (2 GPUS, Reddit). However, the performance of Sancus is limited for larger cases. Comparing EBV_γ=0.1 and EBV_γ=0.0, setting $\gamma$ to $0.1$ can achieve better training efficiency on our cluster. It is worth noting that when there are only $2$ GPUs, the partition results of EBV_γ=0.1 and EBV_γ=0.0 are equivalent. The HEP algorithm also performs well in the smallest dataset (Reddit). But EBV_γ=0.1 leads HEP by a larger margin on other datasets. Therefore, we believe that the CDFGNN framework combined with the EBV_γ=0.1 can achieve the best training efficiency on graph neural network datasets of different sizes.

7.3. Ablation Study

Next, we study the reasons for different performances when different graph partition algorithms are combined with CDFGNN. We compare graph partition results generated by different GP algorithms in Table 3. The “Inner” and “Outer” columns mean the maximum number of inner and outer connections on a single subgraph. The number of inner connections refers to the number of messages within the physical node that need to be sent from this device, and outer connections refer to the messages across the physical nodes. We also present the replication factor (RF) and edge imbalance factor (Edge IF) defined in Section 6 for analyzing. Since all GP algorithms compared here are vertex-cut algorithms, we do not give the vertex imbalance factor.

Table 3 shows the characteristics of all GP algorithms on $4$ datasets. Setting $\gamma$ to $0.1$ can greatly reduce the number of outer connections ( $31.08\%$ on average) at the expense of more inner connections. Considering the inter and outer communication bandwidth comparison in Table 2, the overall communication overhead can be greatly reduced, thereby improving training efficiency. The HEP algorithm achieves the smallest inner and outer connections. However, the graph partition results are significantly imbalanced. Thus, it leads to imbalanced computing and communication overhead and reduces the overall training efficiency.

We decompose the computation and communication time of different GP algorithms and communication optimization methods based on CDFGNN for further analysis. We list the computation and communication time per epoch of each GPUs in Figure 6. We also provide the corresponding average training time with the dashed lines. When comparing these GP algorithms EBV_γ=0.1, EBV_γ=0.0, HEP and DNE, all communication optimization methods are used by default. When comparing the communication optimization methods, the GP algorithm used is EBV_γ=0.1. The “Cache” means only the adaptive cache mechanism is used, while “Quantify” means only the communication quantization is used. “Baseline” means that no communication optimization methods are used.

As shown in Figure 6, comparing with EBV_γ=0.1, the computation time of EBV_γ=0.0 is roughly the same. However, the communication time of EBV_γ=0.0 is longer. HEP and DNE have significant workload imbalances, thus restricting their training performance.

Meanwhile, both the adaptive cache mechanism and communication quantization can greatly reduce the communication overhead without affecting the computation overhead. We include the extra calculation time (quantization and dequantization for communication quantization, caching comparison for adaptive cache mechanism) into the communication time for a fair comparison. Therefore, the communication time is not directly proportional to the number of communication messages. On ogbn-products, the adaptive cache mechanism achieves better communication optimization, while on Reddit the communication quantification is more efficient. When combining both methods (EBV_γ=0.1), we achieve the best performance.

We also analysis the message sending percentage of each layer with the adaptive cache mechanism in Figure 7. To better understand the cache mechanism during different training epochs, we further provide the cache threshold $\epsilon$ . Figure 7 shows the sending percentage and cache threshold on ogbn-products and Reddit with $4$ and $8$ GPUs respectively. It can be found that in the middle stage of training, only few messages are sent, thus greatly reducing communication overhead. This phenomenon is consistent with our hypothesis. Furthermore, at about $50-100$ training epochs on ogbn-products, almost no vertex features are sent during the forward propagation. Meanwhile, the cache threshold is dynamically adjusted to a larger value in the middle of training and smaller at other times.

Finally, we verify the convergence of evaluate accuracy of CDFGNN in Figure 8. In addition to the distributed training approaches of CDFGNN, we also implement the full-batch and mini-batch training methods on the single GPU for comparison.

The results in figure 8 show that using the adaptive cache mechanism and communication quantification method has almost no impact on the convergence of accuracy. Due to the small random errors when distributed training, the accuracy in some epochs is even higher than that of single GPU full-batch training. Besides, the mini-batch training method significantly reduce the accuracy, especially on Reddit. That is because we limit the maximum number of neighbors when sampling, and the average degree of Reddit is very large.

8. Related Work

The research on distributed graph neural network training is still in the early stages (abadal2021computing, ), and only a few these works are based on GPU. Compared with traditional distributed large-scale graph computing frameworks (chen2019powerlyra, ; fan2017grape, ; malewicz2010pregel, ), the communication overhead of distributed GNN training tasks is more serious. This is because the distributed training of each GCN layer or GAT layer requires sending/receiving features and gradients of neighbor vertices, where the dimension of vertex features and gradients is usually very large.

Many existing distributed graph neural network training frameworks adopt the centralized architecture. For example, NeuGraph (ma2019neugraph, ) proposed a GNN training framework in a single-node multi-GPU environment. They use METIS (karypis1998fast, ) as the graph partitioning algorithm, and introduce graph computation optimizations into the management of data partitioning, scheduling, and parallelism. However, their work is not open source. RoC (jia2020improving, ) dynamically partitions the graph through an online regression model and proposes a inter-process memory management method, but it also leads to a complex execution workflow. PaGraph (lin2020pagraph, ) implements static caching of vertices with high degree in GPU memory, and use a special graph partitioning algorithm to balance workload and reduce cross-device data access. G³ (liu2020g3, ) utilizes parallel graph optimization to improve graph operations in GPU systems, Grain (zhang2021grain, ) selects GNN data by focusing on maximizing social influence, and RDD (zhang2020reliable, ) uses unlabeled data. AliGraph (zhu2019aligraph, ) also uses static caching technology, but only supports CPU clusters. AGL (zhang2020agl, ) uses MapReduce operations to simultaneously optimize the training and inference phases. In order to reduce and balance communication, DistDGL (zheng2020distdgl, ) uses a load-balanced graph partitioning algorithm. Most of these systems suffer from heavy communication overhead and therefore cannot scale to large-scale applications. Besides, we should notice that except for NeuGraph and Roc, which support full-batch graph neural network training, other frameworks are mini-batch training methods that require sampling.

9. Conclusion and Future Work

In this paper, we propose a cache-based distributed full-batch graph neural network training framework CDFGNN. To address the problem of excessive communication in existing full-batch training frameworks, we design three optimizations: adaptive cache mechanism, communication quantization, and hierarchical graph partition. With these improvements, CDFGNN outperforms the state-of-the-art distributed full-batch training frameworks. Besides, we theoretically and experimentally prove that the convergence accuracy of CDFGNN is not degraded. Therefore, we believe that CDFGNN can greatly improve the distributed training efficiency for large-scale graphs.

In the future, we want to make full use of the high-speed communication equipment such as NVLink to further reduce the communication overhead.

References

(1) T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
(2) P. Velickovic, G. Cucurull, A. Casanova, A. Romero, and P. Liò, “Yoshua 391 bengio. graph attention networks,” in International Conference on Learning Representations, vol. 392, 2018, p. 393.
(3) W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
(4) J. Chen, T. Ma, and C. Xiao, “Fastgcn: fast learning with graph convolutional networks via importance sampling,” arXiv preprint arXiv:1801.10247, 2018.
(5) W. Huang, T. Zhang, Y. Rong, and J. Huang, “Adaptive sampling towards fast graph representation learning,” Advances in neural information processing systems, vol. 31, 2018.
(6) H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. Prasanna, “Graphsaint: Graph sampling based inductive learning method,” arXiv preprint arXiv:1907.04931, 2019.
(7) J. Dong, D. Zheng, L. F. Yang, and G. Karypis, “Global neighbor sampling for mixed cpu-gpu training on giant graphs,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 289–299.
(8) Z. Cai, X. Yan, Y. Wu, K. Ma, J. Cheng, and F. Yu, “Dgcl: an efficient communication library for distributed gnn training,” in Proceedings of the Sixteenth European Conference on Computer Systems, 2021, pp. 130–144.
(9) Z. Jia, S. Lin, M. Gao, M. Zaharia, and A. Aiken, “Improving the accuracy, scalability, and performance of graph neural networks with roc,” Proceedings of Machine Learning and Systems, vol. 2, pp. 187–198, 2020.
(10) A. Tripathy, K. Yelick, and A. Buluç, “Reducing communication in graph neural network training,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14.
(11) J. Chen, J. Zhu, and L. Song, “Stochastic training of graph convolutional networks with variance reduction,” arXiv preprint arXiv:1710.10568, 2017.
(12) J. Thorpe, Y. Qiao, J. Eyolfson, S. Teng, G. Hu, Z. Jia, J. Wei, K. Vora, R. Netravali, M. Kim et al., “Dorylus: Affordable, scalable, and accurate $\{$ GNN $\}$ training with distributed $\{$ CPU $\}$ servers and serverless threads,” in 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), 2021, pp. 495–514.
(13) Z. Wang, Y. Guan, G. Sun, D. Niu, Y. Wang, H. Zheng, and Y. Han, “Gnn-pim: A processing-in-memory architecture for graph neural networks,” in Advanced Computer Architecture: 13th Conference, ACA 2020, Kunming, China, August 13–15, 2020, Proceedings 13. Springer, 2020, pp. 73–86.
(14) P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
(15) S. Gandhi and A. P. Iyer, “P3: Distributed deep graph learning at scale,” in 15th $\{$ USENIX $\}$ Symposium on Operating Systems Design and Implementation ( $\{$ OSDI $\}$ 21), 2021, pp. 551–568.
(16) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
(17) R. Chen, J. Shi, Y. Chen, B. Zang, H. Guan, and H. Chen, “Powerlyra: Differentiated graph computation and partitioning on skewed graphs,” ACM Transactions on Parallel Computing (TOPC), vol. 5, no. 3, pp. 1–39, 2019.
(18) L. G. Valiant, “A bridging model for parallel computation,” Communications of the ACM, vol. 33, no. 8, pp. 103–111, 1990.
(19) R. Albert and A.-L. Barabási, “Statistical mechanics of complex networks,” Reviews of modern physics, vol. 74, no. 1, p. 47, 2002.
(20) M. Daisuke, H. L. Edward, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” in Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2016.
(21) Y. Li, X. Dong, and W. Wang, “Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks,” arXiv preprint arXiv:1909.13144, 2019.
(22) R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft quantization: Bridging full-precision and low-bit neural networks,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4852–4861.
(23) J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X.-s. Hua, “Quantization networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7308–7316.
(24) S. Zhang, Z. Jiang, X. Hou, Z. Guan, M. Yuan, and H. You, “An efficient and balanced graph partition algorithm for the subgraph-centric programming model on large-scale power-law graphs,” in 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS). IEEE, 2021, pp. 68–78.
(25) J. Peng, Z. Chen, Y. Shao, Y. Shen, L. Chen, and J. Cao, “Sancus: sta le n ess-aware c omm u nication-avoiding full-graph decentralized training in large-scale graph neural networks,” Proceedings of the VLDB Endowment, vol. 15, no. 9, pp. 1937–1950, 2022.
(26) W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh, “Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 257–266.
(27) K. Wang, Z. Shen, C. Huang, C.-H. Wu, Y. Dong, and A. Kanakia, “Microsoft academic graph: When experts are not enough,” Quantitative Science Studies, vol. 1, no. 1, pp. 396–413, 2020.
(28) “Friendster,” https://snap.stanford.edu/data/com-Friendster.html.
(29) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
(30) R. Mayer and H.-A. Jacobsen, “Hybrid edge partitioner: Partitioning large power-law graphs under memory constraints,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 1289–1302.
(31) M. Hanai, T. Suzumura, W. J. Tan, E. Liu, G. Theodoropoulos, and W. Cai, “Distributed edge partitioning for trillion-edge graphs,” arXiv preprint arXiv:1908.05855, 2019.
(32) S. Abadal, A. Jain, R. Guirado, J. López-Alonso, and E. Alarcón, “Computing graph neural networks: A survey from algorithms to accelerators,” ACM Computing Surveys (CSUR), vol. 54, no. 9, pp. 1–38, 2021.
(33) W. Fan, J. Xu, Y. Wu, W. Yu, and J. Jiang, “Grape: Parallelizing sequential graph computations,” Proceedings of the VLDB Endowment, vol. 10, no. 12, pp. 1889–1892, 2017.
(34) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, pp. 135–146.
(35) L. Ma, Z. Yang, Y. Miao, J. Xue, M. Wu, L. Zhou, and Y. Dai, “ $\{$ NeuGraph $\}$ : Parallel deep neural network computation on large graphs,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019, pp. 443–458.
(36) G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” SIAM Journal on scientific Computing, vol. 20, no. 1, pp. 359–392, 1998.
(37) Z. Lin, C. Li, Y. Miao, Y. Liu, and Y. Xu, “Pagraph: Scaling gnn training on large graphs via computation-aware caching,” in Proceedings of the 11th ACM Symposium on Cloud Computing, 2020, pp. 401–415.
(38) H. Liu, S. Lu, X. Chen, and B. He, “G3: when graph neural networks meet parallel graph processing systems on gpus,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 2813–2816, 2020.
(39) W. Zhang, Z. Yang, Y. Wang, Y. Shen, Y. Li, L. Wang, and B. Cui, “Grain: Improving data efficiency of graph neural networks via diversified influence maximization,” arXiv preprint arXiv:2108.00219, 2021.
(40) W. Zhang, X. Miao, Y. Shao, J. Jiang, L. Chen, O. Ruas, and B. Cui, “Reliable data distillation on graph convolutional network,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020, pp. 1399–1414.
(41) R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, and J. Zhou, “Aligraph: A comprehensive graph neural network platform,” arXiv preprint arXiv:1902.08730, 2019.
(42) D. Zhang, X. Huang, Z. Liu, Z. Hu, X. Song, Z. Ge, Z. Zhang, L. Wang, J. Zhou, Y. Shuang et al., “Agl: a scalable system for industrial-purpose graph machine learning,” arXiv preprint arXiv:2003.02454, 2020.
(43) D. Zheng, C. Ma, M. Wang, J. Zhou, Q. Su, X. Song, Q. Gan, Z. Zhang, and G. Karypis, “Distdgl: distributed graph neural network training for billion-scale graphs,” in 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE, 2020, pp. 36–44.

(8)	$\displaystyle\\|A+B\\|_{\infty}$	$\displaystyle=\max_{i,j}\|A_{i,j}+B_{i,j}\|$
		$\displaystyle\leq\max_{i,j}\|A_{i,j}\|+\max_{i,j}\|B_{i,j}\|$
		$\displaystyle=\\|A\\|_{\infty}+\\|B\\|_{\infty},$

(9)	$\displaystyle\\|A\cdot B\\|_{\infty}$	$\displaystyle=\max_{i,j}\|A_{i,j}\times B_{i,j}\|$
		$\displaystyle\leq\max_{i,j}\|A_{i,j}\|\times\max_{i,j}\|B_{i,j}\|$
		$\displaystyle=\\|A\\|_{\infty}\\|B\\|_{\infty},$

(10)	$\displaystyle\\|AB\\|_{\infty}$	$\displaystyle=\max_{i,j}\|\sum\limits_{k=1}^{col(A)}A_{i,k}\times B_{k,j}\|$
		$\displaystyle\leq col(A)\max\limits_{i,j,k}\|A_{i,k}\times B_{k,j}\|$
		$\displaystyle\leq col(A)\max\limits_{i,j}\|A_{i,j}\|\max\limits_{i,j}\|B_{i,j}\|$
		$\displaystyle=col(A)\\|A\\|_{\infty}\\|B\\|_{\infty}.$