CDFGNN: a Systematic Design of Cache-based Distributed Full-Batch Graph Neural Network Training with Communication Reduction

Shuai Zhang MeituanBeijingChina [email protected] Zite Jiang SKL Computer Architecture, Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected]  and  Haihang You SKL Computer Architecture, Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected]
Abstract.

Graph neural network training is mainly categorized into mini-batch and full-batch training methods. The mini-batch training method samples subgraphs from the original graph in each iteration. This sampling operation introduces extra computation overhead and reduces the training accuracy. Meanwhile, the full-batch training method calculates the features and corresponding gradients of all vertices in each iteration, and therefore has higher convergence accuracy. However, in the distributed cluster, frequent remote accesses of vertex features and gradients lead to huge communication overhead, thus restricting the overall training efficiency.

In this paper, we introduce the cached-based distributed full-batch graph neural network training framework (CDFGNN). We propose the adaptive cache mechanism to reduce the remote vertex access by caching the historical features and gradients of neighbor vertices. Besides, we further optimize the communication overhead by quantifying the messages and designing the graph partition algorithm for the hierarchical communication architecture. Experiments show that the adaptive cache mechanism reduces remote vertex accesses by 63.14%percent63.1463.14\%63.14 % on average. Combined with communication quantization and hierarchical GP algorithm, CDFGNN outperforms the state-of-the-art distributed full-batch training frameworks by 30.39%percent30.3930.39\%30.39 % in our experiments. Our results indicate that CDFGNN has great potential in accelerating distributed full-batch GNN training tasks.

Graph Neural Network, Distributed Training, Machine Learning System
copyright: acmcopyright

1. Introduction

With the rise of large-scale pre-training models, the demand for distributed training based on heterogeneous architecture is also increasing. As an important deep learning structure, graph neural network (GNN) (kipf2016semi, ) has been applied in natural language processing, computer vision, knowledge graphs, etc. Compared with traditional graph algorithms, the graph neural network often requires computation on heterogeneous devices. Besides, the graph neural network needs to send the features and gradients of vertices across devices in each iteration, which brings huge communication overhead. Therefore, designing an efficient heterogeneous distributed graph neural network training framework is a challenging and engaging research area.

The training of distributed graph neural network can be categorized into full-batch training (kipf2016semi, ; velickovic2018yoshua, ) and mini-batch training (hamilton2017inductive, ; chen2018fastgcn, ; huang2018adaptive, ; zeng2019graphsaint, ; dong2021global, ). The main difference between them is whether the entire graph data is involved in each iteration. For the full-batch training method, an iteration contains the model computation phase (including forward propagation and back propagation) and the parameter update phase. For mini-batch training, an additional sampling phase needs to be added. The sampling phase needs to be performed before the model computation phase and the parameter update phase. In the sampling phase, subgraphs are sampled from the entire graph for the current training iteration. Therefore, for the full-batch training, one training epoch is equivalent to one iteration. For the mini-batch training, one training epoch often consists of multiple iterations.

Many mini-batch (sample-based) distributed GNN training methods have been proposed recently. However, these mini-batch training methods lead to problems such as information loss (cai2021dgcl, ; jia2020improving, ; tripathy2020reducing, ), additional sampling overhead (jia2020improving, ), and unable to guarantee convergence (chen2017stochastic, ). Therefore, in this paper, we focus on another distributed training strategy: full-batch training.

Compared with traditional graph algorithms or deep learning algorithms, distributed full-batch graph neural network training brings new system-level problems. The GNN training process has irregular neighbor vertex access and iterative computation at the same time. Therefore, graph neural network training is also characterized by both memory access intensive and computing intensive tasks (thorpe2021dorylus, ; wang2020gnn, ). In the distributed environment, there is also a problem of intensive communication for the full-batch training methods. During the full-batch GNN training, both the model parameters and neighbor vertex data (features and gradients) need to be transmitted across the device. Due to the huge communication volume of vertex features and gradients, efficient full-batch GNN training is extremely difficult.

In this paper, we focus on reducing the communication overhead during distributed full-batch graph neural network training. Considering that the changes of model parameters during GNN training are usually very slight, we cache historical features and gradients of vertices to reduce the cross-device neighbor vertex access. In addition, we adopt the quantization method to compress communication messages. We further design the hierarchical graph partition algorithm to reduce the number of communication messages across physical nodes (at the expense of the extra messages across different GPUs within the same physical node).

Specifically, our main contributions are as follows:

  • We propose the cache-based distributed full-batch graph neural network training method CDFGNN. By adaptively caching vertex-level historical features and gradients, we can greatly reduce the communication overhead without affecting the convergence accuracy and the number of iterations required for convergence.

  • We quantify the vertex features and gradients during communication in CDFGNN to further reduce communication overhead.

  • We design the graph partition algorithm to adapt to the communication characteristics of the hierarchical hardware architecture.

  • Experiments show that CDFGNN can greatly reduce the communication overhead during distributed full-batch graph neural network training and thus improve the overall training efficiency.

This paper is organized as follows: Section 2 discusses the challenges of distributed GNN training and explains our motivation. Section 3 introduces the computation and communication architecture of CDFGNN. Section 4 proposes the adaptive cache mechanism for vertex features and gradients and theoretically proves the convergence of this mechanism. Section 5 and section 6 describes the quantization method and the hierarchical graph partition algorithm. Section 7 presents and analyzes several experiments, which demonstrate the characteristics and capabilities of CDFGNN. Finally, we review the related work, conclude our approach, and preview the future project in Section 8 and Section 9.

2. Background and Motivation

2.1. Background

Refer to caption
Figure 1. Distributed full-batch GNN Training.

The distributed full-batch GNN training methods require the original graph to be partitioned into several subgraphs, and each computing device (CPU or GPU) only keeps its own subgraph. The corresponding vertex features are also split and assigned to each device. Thus, the computation of the entire graph can be completed in just one iteration.

During the training process, each computing device saves a copy of the current model parameters to enable local computation. Therefore, for the full-batch GNN training, the model parameter synchronization is also needed after each iteration.

For both GCN (kipf2016semi, ) and GAT (velivckovic2017graph, ) models, the vertex features and gradients of all neighbor vertices are required to calculate the features and gradients of the certain vertex during the forward and backward propagation in each layer. In distributed clusters, such large-scale cross-device data access brings serious communication overhead and becomes a bottleneck of the overall computation. Besides, load balancing among the various devices is also important. This is because load imbalance not only results in computational load imbalance, but also communication imbalance.

Figure 1 shows the training process of a distributed graph neural network with 6666 vertices. Vertices on the same device (GPU) are represented by the same color, and red edges identify edges across GPUs.

The right side is the computational graph of the two-layer graph neural network for vertex “B” and vertex “D”. In order to obtain the final vertex features, 7777 and 6666 cross-device communication messages are required for “B” and “D” respectively. Each message contains high-dimensional vertex features. When performing backward propagation, the same number of vertex gradients is also required. Therefore, cross-device communication becomes an important bottleneck for efficient training. The overall communication overhead may even account for about 80%percent8080\%80 % of the total training time (cai2021dgcl, ; gandhi2021p3, ; tripathy2020reducing, ).

For the distributed mini-batch GNN training, we need to sample graphs before model computation. Thus, an iteration of distributed mini-batch training consists of three stages: sampling, model computation, and model parameter synchronization.

These mini-batches can be sampled by the computing device itself, or sampled by a dedicated sampling device. Each computing device independently executes forward propagation and backward propagation on its corresponding subgraph. After the computation stage is completed, these computing devices synchronize and accumulate the gradients to update the model parameters.

Refer to caption

Refer to caption

(a) The original graph

(b) Sampled graph

Figure 2.  The sample process of mini-batch training.

Figure 2 shows a 2222-hop sampling process on the original graph. For the L𝐿Litalic_L-layers graph neural network, in order to calculate the vertex features, (at least part of) L-hop neighbor vertices need to be included in the sampled subgraph. In figure 2, for calculating vertex 1111, we additionally add parts of its 2222-hop neighbor vertices to the subgraph. For graphs with high connectivity and small diameter (such as power-law graphs), even few vertices sampled will generate a large subgraph. This phenomenon results in significant extra computational overhead. Although we can restrict the maximum number of sampled neighbor vertices as in figure 2, it will directly reduce the model accuracy.

Compared with the full-batch distributed GNN training, the computation stage of mini-batch training is executed independently on sampled subgraphs, thus avoiding the remote vertex access. However, the sampling process also incurs additional computational overhead, including the sampling itself and extra vertex calculations. In addition, the mini-batch GNN training often reduces the model accuracy.

2.2. Motivation

The frequent and expensive remote neighbor vertex access restricts the scalability of distributed full-batch GNN training. To overcome this challenge, we can optimize it from the following perspectives:

  • Frequency: Cache neighbor vertex data instead of executing remote access in each iteration,

  • Expensive: Compress the message size,

  • Remote: Make full use of the hierarchical communication architecture.

For GNN training tasks, the model parameters tend to stabilize after several training epochs. Besides, the training process does not require high-precision vertex features and gradients before the model converges. Therefore, we cache and reuse historical vertex features and gradients during training to reduce communication overhead, especially in the middle stage of the training process.

In order to compress the message size, we quantify the communication messages. These messages include the model parameter gradients and remote neighbor vertex features and gradients. The scale of vertex features and gradients in the GNN training is much larger than the model parameters. Meanwhile, when there are small errors in the vertex features and gradients, the final convergence performance will not be significantly reduced, and sometimes it can even prevent the training process from falling into a local optimal solution. Therefore, we compress the vertex features and gradients during communication by quantifying.

Finally, we analyze the communication characteristics of heterogeneous clusters and find that using the PCIe to communicate between different GPUs in the same physical node is more efficient (higher bandwidth and lower latency) than network communication (InfiniBand) across physical nodes. Therefore, we propose a graph partition algorithm to reduce the number of messages across physical nodes at the cost of increasing communication within physical nodes.

3. CDFGNN Architecture

In this section, we take the graph convolutional network (GCN) as an example to describe the computation and communication stage of CDFGNN.

Refer to caption
Figure 3. The workflow of CDFGNN.

Figure 3 shows the overall computing and communication workflow of CDFGNN. CDFGNN first needs to perform the graph partitioning (GP) algorithm to partition the graph (and corresponding input features) into subgraphs equal to the number of computing devices (GPUs). Different from the traditional full-batch graph neural network training framework, we adopt the vertex-cut GP algorithm. The vertex-cut GP is considered a better approach to handle power-law graphs common in the real world (gonzalez2012powergraph, ; chen2019powerlyra, ). Figure 4 demonstrates partition results of the vertex-cut GP algorithm. In this example, vertex “B” exists in all 3333 subgraphs and we choose one of these replicas as the master vertex while others as mirror vertices.

We describe the single iteration distributed training in the algorithm 1. L𝐿Litalic_L refers to the number of layers of the GCN network, and the model parameters of each layer are represented as W(0),,W(L1)superscript𝑊0superscript𝑊𝐿1W^{(0)},\cdots,W^{(L-1)}italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , ⋯ , italic_W start_POSTSUPERSCRIPT ( italic_L - 1 ) end_POSTSUPERSCRIPT. Next, we describe the computation and communication stage in detail.

Input: Graph G(V,E)𝐺𝑉𝐸G(V,E)italic_G ( italic_V , italic_E ), Sparse Matrix A^isubscript^𝐴𝑖\hat{A}_{i}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Input feature Hi(0)subscriptsuperscript𝐻0𝑖H^{(0)}_{i}italic_H start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Current Model Parameter W𝑊Witalic_W.
Output: Output Feature Hi(L)subscriptsuperscript𝐻𝐿𝑖H^{(L)}_{i}italic_H start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
1
2for all process P(i)𝑃𝑖P(i)italic_P ( italic_i ) parallel do
3       ///// / Layer-by-layer forward propagation:
4       for l=1,,L𝑙1𝐿l=1,\cdots,Litalic_l = 1 , ⋯ , italic_L do
5             Z¨i(l)A^iHi(l1)W(l1)subscriptsuperscript¨𝑍𝑙𝑖subscript^𝐴𝑖subscriptsuperscript𝐻𝑙1𝑖superscript𝑊𝑙1\ddot{Z}^{(l)}_{i}\leftarrow\hat{A}_{i}H^{(l-1)}_{i}W^{(l-1)}over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT
6             Synchronize by communication to get Zi(l)subscriptsuperscript𝑍𝑙𝑖Z^{(l)}_{i}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
7             Hi(l)σ(Zi(l))subscriptsuperscript𝐻𝑙𝑖𝜎subscriptsuperscript𝑍𝑙𝑖H^{(l)}_{i}\leftarrow\sigma\left(Z^{(l)}_{i}\right)italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_σ ( italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
8            
9      
10      ///// / Layer-by-layer backward propagation:
11       Compute Loss Function isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and δ¨i(L)superscriptsubscript¨𝛿𝑖𝐿\ddot{\delta}_{i}^{(L)}over¨ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT
12       for l=L,,1𝑙𝐿1l=L,\cdots,1italic_l = italic_L , ⋯ , 1 do
13             Synchronize by communication to get δi(l)superscriptsubscript𝛿𝑖𝑙\delta_{i}^{(l)}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT.
14             δ¨i(l1)δi(l)A^i(W(l1))Tσ(Zi(l1))subscriptsuperscript¨𝛿𝑙1𝑖superscriptsubscript𝛿𝑖𝑙subscript^𝐴𝑖superscriptsuperscript𝑊𝑙1Tsuperscript𝜎superscriptsubscript𝑍𝑖𝑙1\ddot{\delta}^{(l-1)}_{i}\leftarrow\delta_{i}^{(l)}\hat{A}_{i}\left(W^{(l-1)}% \right)^{\text{T}}\cdot\sigma^{\prime}\left(Z_{i}^{(l-1)}\right)over¨ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT )
15             W(l1)iδi(l)A^i(Hi(l1))Tsubscriptsuperscript𝑊𝑙1subscript𝑖superscriptsubscript𝛿𝑖𝑙subscript^𝐴𝑖superscriptsuperscriptsubscript𝐻𝑖𝑙1T\nabla_{W^{(l-1)}}\mathcal{L}_{i}\leftarrow\delta_{i}^{(l)}\hat{A}_{i}\left(H_% {i}^{(l-1)}\right)^{\text{T}}∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT
16             Parameter Server aggregate W(l1)isubscriptsuperscript𝑊𝑙1subscript𝑖\nabla_{W^{(l-1)}}\mathcal{L}_{i}∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, update and broadcast parameters:
17             W(l1)=W(l1)ηi=1pW(l1)isuperscript𝑊𝑙1superscript𝑊𝑙1𝜂superscriptsubscript𝑖1𝑝subscriptsuperscript𝑊𝑙1subscript𝑖W^{(l-1)}=W^{(l-1)}-\eta\sum\limits_{i=1}^{p}\nabla_{W^{(l-1)}}\mathcal{L}_{i}italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
18      
Algorithm 1 CDFGNN Workflow

3.1. Computation Stage of CDFGNN

In the computation stage, each GPU independently performs graph neural network computation tasks on its corresponding subgraph. We use the BSP model (valiant1990bridging, ) to achieve synchronization of vertex features through communication.

Let Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the adjacency matrix of subgraph i𝑖iitalic_i and Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the corresponding submatrix in the original degree matrix. A^i=Di1/2AiDi1/2subscript^𝐴𝑖superscriptsubscript𝐷𝑖12subscript𝐴𝑖superscriptsubscript𝐷𝑖12\hat{A}_{i}=D_{i}^{-1/2}A_{i}D_{i}^{-1/2}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT is the normalized adjacency matrix of the subgraph in the computing device i𝑖iitalic_i. We use superscript ¨¨absent\ddot{}over¨ start_ARG end_ARG to represent the intermediate matrix values (Z¨i(l)subscriptsuperscript¨𝑍𝑙𝑖\ddot{Z}^{(l)}_{i}over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and δ¨i(l1)subscriptsuperscript¨𝛿𝑙1𝑖\ddot{\delta}^{(l-1)}_{i}over¨ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) calculated only from local subgraphs, and the corresponding expressions without this superscript indicate the value (Zi(l)subscriptsuperscript𝑍𝑙𝑖Z^{(l)}_{i}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and δi(l1)subscriptsuperscript𝛿𝑙1𝑖\delta^{(l-1)}_{i}italic_δ start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) after communication synchronization.

During the forward propagation of GCN, we calculate the vertex feature Hi(l)subscriptsuperscript𝐻𝑙𝑖H^{(l)}_{i}italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the l𝑙litalic_l-th layer in the subgraph i𝑖iitalic_i as

(1) Z¨i(l)=A^iHi(l1)W(l1),subscriptsuperscript¨𝑍𝑙𝑖subscript^𝐴𝑖subscriptsuperscript𝐻𝑙1𝑖superscript𝑊𝑙1\ddot{Z}^{(l)}_{i}=\hat{A}_{i}H^{(l-1)}_{i}W^{(l-1)},over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ,
(2) Hi(l)=σ(Zi(l)).subscriptsuperscript𝐻𝑙𝑖𝜎subscriptsuperscript𝑍𝑙𝑖H^{(l)}_{i}=\sigma\left(Z^{(l)}_{i}\right).italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

We calculate Z¨i(l)subscriptsuperscript¨𝑍𝑙𝑖\ddot{Z}^{(l)}_{i}over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the local vertex feature Hi(l1)subscriptsuperscript𝐻𝑙1𝑖H^{(l-1)}_{i}italic_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, local normalized adjacency matrix A^isubscript^𝐴𝑖\hat{A}_{i}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the global model parameter W(l1)superscript𝑊𝑙1W^{(l-1)}italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT. For restoring the “real” Zi(l)superscriptsubscript𝑍𝑖𝑙Z_{i}^{(l)}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT (the same as the value during the sequential training), we need to synchronize and aggregate Z¨i(l)subscriptsuperscript¨𝑍𝑙𝑖\ddot{Z}^{(l)}_{i}over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from each device through communication. The communication stage will be introduced in section 3.2.

According to Zi(l)subscriptsuperscript𝑍𝑙𝑖Z^{(l)}_{i}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can calculate the input Hi(l)subscriptsuperscript𝐻𝑙𝑖H^{(l)}_{i}italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the next layer. Hi(l)|Vi|×Fisubscriptsuperscript𝐻𝑙𝑖superscriptsubscript𝑉𝑖subscript𝐹𝑖H^{(l)}_{i}\in\mathbb{R}^{|V_{i}|\times F_{i}}italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | × italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refers to the vertex feature dimension of the i𝑖iitalic_i-th layer. By iteratively executing equations 1 and 2, we can complete the calculation of forward propagation layer by layer.

During the backward propagation, we only calculate the loss value of the master vertices when calculating the loss function \mathcal{L}caligraphic_L. Thus, we can avoid repeated calculations of gradients on multiple replicas.

We use \mathcal{L}caligraphic_L to represent the loss function in the global and isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to represent its component on subgraph i𝑖iitalic_i, while i=1pi=superscriptsubscript𝑖1𝑝subscript𝑖\sum\limits_{i=1}^{p}\mathcal{L}_{i}=\mathcal{L}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_L. When calculating the gradient, we define δi(l)=Zi(l)subscriptsuperscript𝛿𝑙𝑖subscriptsubscriptsuperscript𝑍𝑙𝑖\delta^{(l)}_{i}=\nabla_{Z^{(l)}_{i}}\mathcal{L}italic_δ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L to represent the gradient of the global loss function \mathcal{L}caligraphic_L with respect to the global variable Zi(l)subscriptsuperscript𝑍𝑙𝑖Z^{(l)}_{i}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and δ¨i(l)=Z¨i(l)isubscriptsuperscript¨𝛿𝑙𝑖subscriptsubscriptsuperscript¨𝑍𝑙𝑖subscript𝑖\ddot{\delta}^{(l)}_{i}=\nabla_{\ddot{Z}^{(l)}_{i}}\mathcal{L}_{i}over¨ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to represent the gradient of the local loss function isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with respect to the local variable Z¨i(l)subscriptsuperscript¨𝑍𝑙𝑖\ddot{Z}^{(l)}_{i}over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For calculating δ¨i(l1)subscriptsuperscript¨𝛿𝑙1𝑖\ddot{\delta}^{(l-1)}_{i}over¨ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we have

(3) δ¨i(l1)=subscriptsuperscript¨𝛿𝑙1𝑖absent\displaystyle\ddot{\delta}^{(l-1)}_{i}=over¨ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = iZ¨i(l1)=iZi(l1)subscript𝑖superscriptsubscript¨𝑍𝑖𝑙1subscript𝑖superscriptsubscript𝑍𝑖𝑙1\displaystyle\frac{\partial\mathcal{L}_{i}}{\partial\ddot{Z}_{i}^{(l-1)}}=% \frac{\partial\mathcal{L}_{i}}{\partial Z_{i}^{(l-1)}}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over¨ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG = divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG
=\displaystyle== iZi(l)Zi(l)Z¨i(l)Z¨i(l)Hi(l1)Hi(l1)Zi(l1)subscript𝑖superscriptsubscript𝑍𝑖𝑙superscriptsubscript𝑍𝑖𝑙superscriptsubscript¨𝑍𝑖𝑙superscriptsubscript¨𝑍𝑖𝑙subscriptsuperscript𝐻𝑙1𝑖subscriptsuperscript𝐻𝑙1𝑖superscriptsubscript𝑍𝑖𝑙1\displaystyle\frac{\partial\mathcal{L}_{i}}{\partial Z_{i}^{(l)}}\frac{% \partial Z_{i}^{(l)}}{\partial\ddot{Z}_{i}^{(l)}}\frac{\partial\ddot{Z}_{i}^{(% l)}}{\partial H^{(l-1)}_{i}}\frac{\partial H^{(l-1)}_{i}}{\partial Z_{i}^{(l-1% )}}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ over¨ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ over¨ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ italic_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG
=\displaystyle== δi(l)A^i(W(l1))Tσ(Zi(l1)).superscriptsubscript𝛿𝑖𝑙subscript^𝐴𝑖superscriptsuperscript𝑊𝑙1Tsuperscript𝜎superscriptsubscript𝑍𝑖𝑙1\displaystyle\delta_{i}^{(l)}\hat{A}_{i}\left(W^{(l-1)}\right)^{\text{T}}\cdot% \sigma^{\prime}\left(Z_{i}^{(l-1)}\right).italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .

Note that Zi(l)superscriptsubscript𝑍𝑖𝑙Z_{i}^{(l)}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is calculated with the sum aggregation of Z¨j(l),j[1,p]superscriptsubscript¨𝑍𝑗𝑙𝑗1𝑝\ddot{Z}_{j}^{(l)},j\in[1,p]over¨ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_j ∈ [ 1 , italic_p ], thus we have Zi(l)Z¨i(l)=1superscriptsubscript𝑍𝑖𝑙superscriptsubscript¨𝑍𝑖𝑙1\frac{\partial Z_{i}^{(l)}}{\partial\ddot{Z}_{i}^{(l)}}=1divide start_ARG ∂ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ over¨ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG = 1 exists for all subgraph i𝑖iitalic_i. Similar with Z¨i(l)superscriptsubscript¨𝑍𝑖𝑙\ddot{Z}_{i}^{(l)}over¨ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, we can also get δi(l1)superscriptsubscript𝛿𝑖𝑙1\delta_{i}^{(l-1)}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT by aggregating δ¨i(l1)superscriptsubscript¨𝛿𝑖𝑙1\ddot{\delta}_{i}^{(l-1)}over¨ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT from each device through communication.

With δi(l)superscriptsubscript𝛿𝑖𝑙\delta_{i}^{(l)}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, the gradient of the model parameter W(l1)superscript𝑊𝑙1W^{(l-1)}italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT can be calculated as

(4) W(l1)i=iW(l1)=iZi(l)Z¨i(l)W(l1)=δi(l)A^i(Hi(l1))T.subscriptsuperscript𝑊𝑙1subscript𝑖subscript𝑖superscript𝑊𝑙1subscript𝑖superscriptsubscript𝑍𝑖𝑙superscriptsubscript¨𝑍𝑖𝑙superscript𝑊𝑙1superscriptsubscript𝛿𝑖𝑙subscript^𝐴𝑖superscriptsuperscriptsubscript𝐻𝑖𝑙1T\nabla_{W^{(l-1)}}\mathcal{L}_{i}=\frac{\partial\mathcal{L}_{i}}{\partial W^{(% l-1)}}=\frac{\partial\mathcal{L}_{i}}{\partial Z_{i}^{(l)}}\frac{\partial\ddot% {Z}_{i}^{(l)}}{\partial W^{(l-1)}}=\delta_{i}^{(l)}\hat{A}_{i}\left(H_{i}^{(l-% 1)}\right)^{\text{T}}.∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG = divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ over¨ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG = italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT .

When performing parameter updates, we need to summarize the gradients calculated on all subgraphs as

(5) W(l)=W(l)ηi=1pW(l)i.superscript𝑊𝑙superscript𝑊𝑙𝜂superscriptsubscript𝑖1𝑝subscriptsuperscript𝑊𝑙subscript𝑖W^{(l)}=W^{(l)}-\eta\sum\limits_{i=1}^{p}\nabla_{W^{(l)}}\mathcal{L}_{i}.italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

This process also needs to be implemented through communication. However, the data size of model parameters is usually much smaller than the data size of neighbor vertex features and gradients. Thus, the communication overhead of aggregating model parameters is not the performance bottleneck.

In summary, during one iteration (forward + backward) of one GCN layer, there are two communication synchronizations for vertex values (features and gradients). This communication is to obtain the global intermediate value Z(l)superscript𝑍𝑙Z^{(l)}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT in the forward propagation and to obtain the δ(l)superscript𝛿𝑙\delta^{(l)}italic_δ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT in the backward propagation. Through these communication synchronizations, the calculated model parameter gradients are theoretically consistent with the single-device full-batch training method.

3.2. Communication Stage of CDFGNN

In the real world, most of the data graphs processed by graph neural network algorithms are power-law graphs (albert2002statistical, ), such as social networks, citation graphs, etc. We adopt the vertex-cut GP algorithm, which is more efficient for power-law graphs. In figure 4, we demonstrate the communication pattern for vertex “B”. We use the gray vertex in subgraph 1111 to mark this vertex “B” as a master vertex, while others are mirror vertices. In the computation stage, these replicas compute their intermediate values Z¨i(l)subscriptsuperscript¨𝑍𝑙𝑖\ddot{Z}^{(l)}_{i}over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and δ¨i(l)subscriptsuperscript¨𝛿𝑙𝑖\ddot{\delta}^{(l)}_{i}over¨ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT independently. We need to aggregate these values through communication to achieve the same value as when executing on a single device.

Refer to caption
Figure 4. The communication pattern of CDFGNN.

CDFGNN takes each vertex as the minimum communication unit. The communication stage can be divided into two phases: gather and scatter. In the gather phase, the mirror vertex sends its values to the corresponding master vertex (with the same vertex ID). When the master vertex receives these messages, it should collect them and sum them with its own values. In the scatter phase, the master vertex sends its aggregated values back to all corresponding mirror vertices. The mirror vertex uses the received values to replace the original values. In figure 4, we list the values of vertex “B” at different communication phases in all subgraphs.

This communication pattern requires the mirror vertex to store the location of its master vertex, and the master vertex to store the locations of all its mirrors. By executing the communication stage, we can ensure that the states of the vertex replicas are consistent with the sequential GNN training.

4. Adaptive Vertex Feature Cache

In this section, we introduce the adaptive cache mechanism of CDFGNN and prove its convergence.

4.1. Adaptive Cache Mechanism

In order to reduce the expensive vertex feature and gradient communication during the CDFGNN training process, we propose an adaptive vertex-level caching mechanism. Specifically, we cache the intermediate variables Zi(l)superscriptsubscript𝑍𝑖𝑙Z_{i}^{(l)}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and δi(l)subscriptsuperscript𝛿𝑙𝑖\delta^{(l)}_{i}italic_δ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT during the training process.

For Zi(l)superscriptsubscript𝑍𝑖𝑙Z_{i}^{(l)}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and δi(l)subscriptsuperscript𝛿𝑙𝑖\delta^{(l)}_{i}italic_δ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we adopt the same cache mechanism. For convenience, we take Zi(l)subscriptsuperscript𝑍𝑙𝑖Z^{(l)}_{i}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the example to introduce the caching mechanism in detail. Firstly, we denote Z¨i(l)={zi,1,,zi,|Vi|}subscriptsuperscript¨𝑍𝑙𝑖subscript𝑧𝑖1subscript𝑧𝑖subscript𝑉𝑖\ddot{Z}^{(l)}_{i}=\{z_{i,1},\cdots,z_{i,|V_{i}|}\}over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_i , | italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_POSTSUBSCRIPT }, where zi,jsubscript𝑧𝑖𝑗z_{i,j}italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents the feature vector corresponding to the j𝑗jitalic_j-th vertex of subgraph i𝑖iitalic_i in Z¨i(l)subscriptsuperscript¨𝑍𝑙𝑖\ddot{Z}^{(l)}_{i}over¨ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each subgraph, we renumber the vertices with a local ID for continuous memory access. The j𝑗jitalic_j-th vertex here refers to the vertex with local ID j𝑗jitalic_j of subgraph i𝑖iitalic_i.

Let z~i,jsubscript~𝑧𝑖𝑗\tilde{z}_{i,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT be the cached value of zi,jsubscript𝑧𝑖𝑗z_{i,j}italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, and z~,jsubscript~𝑧𝑗\tilde{z}_{\cdot,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT be the corresponding cached value in Zi(l)subscriptsuperscript𝑍𝑙𝑖Z^{(l)}_{i}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each computing device, it should keep the cached value z~i,jsubscript~𝑧𝑖𝑗\tilde{z}_{i,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and z~,jsubscript~𝑧𝑗\tilde{z}_{\cdot,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT for all vertices in their own subgraphs.

Input: current value zi,usubscript𝑧𝑖𝑢z_{i,u}italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT, cached value z~i,usubscript~𝑧𝑖𝑢\tilde{z}_{i,u}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT and z~,usubscript~𝑧𝑢\tilde{z}_{\cdot,u}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_u end_POSTSUBSCRIPT, threshold ϵitalic-ϵ\epsilonitalic_ϵ.
Output: cached value z~i,usubscript~𝑧𝑖𝑢\tilde{z}_{i,u}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT and z~,usubscript~𝑧𝑢\tilde{z}_{\cdot,u}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_u end_POSTSUBSCRIPT.
1
2for all process P(i)𝑃𝑖P(i)italic_P ( italic_i ) parallel do
3       ///// / Traverse mirror vertices:
4       for ugetMirrorVertices()𝑢𝑔𝑒𝑡𝑀𝑖𝑟𝑟𝑜𝑟𝑉𝑒𝑟𝑡𝑖𝑐𝑒𝑠u\in getMirrorVertices()italic_u ∈ italic_g italic_e italic_t italic_M italic_i italic_r italic_r italic_o italic_r italic_V italic_e italic_r italic_t italic_i italic_c italic_e italic_s ( ) do
5             if zi,uz~i,u>ϵz~i,usubscriptnormsubscript𝑧𝑖𝑢subscript~𝑧𝑖𝑢italic-ϵsubscriptnormsubscript~𝑧𝑖𝑢\|z_{i,u}-\tilde{z}_{i,u}\|_{\infty}>\epsilon\|\tilde{z}_{i,u}\|_{\infty}∥ italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT - over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT > italic_ϵ ∥ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT then
6                   Send the difference value Δzi,u=zi,uz~i,usubscriptΔsubscript𝑧𝑖𝑢subscript𝑧𝑖𝑢subscript~𝑧𝑖𝑢\Delta_{z_{i,u}}=z_{i,u}-\tilde{z}_{i,u}roman_Δ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT - over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT to the corresponding master vertex
7                   z~i,uzi,usubscript~𝑧𝑖𝑢subscript𝑧𝑖𝑢\tilde{z}_{i,u}\leftarrow z_{i,u}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT ← italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT
8                  
9            
10      
11      Bulk Synchronize! Wait for messages from all processes to be sent!
12       ///// / Traverse messages and master vertices:
13       for (u,Δzi,u)messages𝑢subscriptΔsubscript𝑧𝑖𝑢𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠(u,\Delta_{z_{i,u}})\in messages( italic_u , roman_Δ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∈ italic_m italic_e italic_s italic_s italic_a italic_g italic_e italic_s do
14             z~,uz~,u+Δzi,usubscript~𝑧𝑢subscript~𝑧𝑢subscriptΔsubscript𝑧𝑖𝑢\tilde{z}_{\cdot,u}\leftarrow\tilde{z}_{\cdot,u}+\Delta_{z_{i,u}}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_u end_POSTSUBSCRIPT ← over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_u end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT
15             active vertex u𝑢uitalic_u.
16            
17      
18      for ugetMaster()𝑢𝑔𝑒𝑡𝑀𝑎𝑠𝑡𝑒𝑟u\in getMaster()italic_u ∈ italic_g italic_e italic_t italic_M italic_a italic_s italic_t italic_e italic_r ( ) do
19             if zi,uz~i,u>ϵz~i,usubscriptnormsubscript𝑧𝑖𝑢subscript~𝑧𝑖𝑢italic-ϵsubscriptnormsubscript~𝑧𝑖𝑢\|z_{i,u}-\tilde{z}_{i,u}\|_{\infty}>\epsilon\|\tilde{z}_{i,u}\|_{\infty}∥ italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT - over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT > italic_ϵ ∥ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT then
20                   z~,uz~,u+zi,uz~i,usubscript~𝑧𝑢subscript~𝑧𝑢subscript𝑧𝑖𝑢subscript~𝑧𝑖𝑢\tilde{z}_{\cdot,u}\leftarrow\tilde{z}_{\cdot,u}+z_{i,u}-\tilde{z}_{i,u}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_u end_POSTSUBSCRIPT ← over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_u end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT - over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT
21                   z~i,uzi,usubscript~𝑧𝑖𝑢subscript𝑧𝑖𝑢\tilde{z}_{i,u}\leftarrow z_{i,u}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT ← italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT
22                   active vertex u𝑢uitalic_u.
23            
24      
25      for u𝑢absentu\initalic_u ∈ active vertices do
26             Send the cached value z~,usubscript~𝑧𝑢\tilde{z}_{\cdot,u}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_u end_POSTSUBSCRIPT to the corresponding mirror vertices.
27            
28      
Algorithm 2 Adaptive Vertex Cache Mechanism

The algorithm 2 describes the update strategy of the cached values z~i,jsubscript~𝑧𝑖𝑗\tilde{z}_{i,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and z~,jsubscript~𝑧𝑗\tilde{z}_{\cdot,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT. In each forward propagation of the GNN layer, we need to perform this algorithm once. After the update process is completed, we generate the matrix Zi(l)subscriptsuperscript𝑍𝑙𝑖Z^{(l)}_{i}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by directly combining the cached value z~,jsubscript~𝑧𝑗\tilde{z}_{\cdot,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT.

For the cache mechanism, z~i,jsubscript~𝑧𝑖𝑗\tilde{z}_{i,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT keeps the values used by computing device i𝑖iitalic_i when building the cached value z~,jsubscript~𝑧𝑗\tilde{z}_{\cdot,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT. When the difference between z~i,jsubscript~𝑧𝑖𝑗\tilde{z}_{i,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and the real value zi,jsubscript𝑧𝑖𝑗z_{i,j}italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT calculated in current iteration is too large, we need to update z~i,jsubscript~𝑧𝑖𝑗\tilde{z}_{i,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and z~,jsubscript~𝑧𝑗\tilde{z}_{\cdot,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT for avoiding the large error. We use zi,uz~i,uz~i,usubscriptnormsubscript𝑧𝑖𝑢subscript~𝑧𝑖𝑢subscriptnormsubscript~𝑧𝑖𝑢\frac{\|z_{i,u}-\tilde{z}_{i,u}\|_{\infty}}{\|\tilde{z}_{i,u}\|_{\infty}}divide start_ARG ∥ italic_z start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT - over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG to measure the error. \|\cdot\|_{\infty}∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT is the Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm, which can be used to represent the maximum absolute value of all elements in it.

We expect z~,jsubscript~𝑧𝑗\tilde{z}_{\cdot,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT to be consistent across all relevant computing devices. Thus, when the z~i,jsubscript~𝑧𝑖𝑗\tilde{z}_{i,j}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT of any computing device changed, we need to synchronize it to all other replicas.

In order to increase the proportion of cached values as much as possible without reducing the convergence accuracy or increasing the number of iterations for convergence, we design an adaptive caching mechanism by dynamically adjusting the threshold ϵitalic-ϵ\epsilonitalic_ϵ. We update ϵitalic-ϵ\epsilonitalic_ϵ by

(6) ϵ={min(λ1ϵ,ϵ+ξ),acc<meanaccμ1,ϵ<ν1max(λ2ϵ,ϵξ),acc>meanacc+μ2,ϵ>ν2ϵ,otherwise\epsilon=\left\{\begin{aligned} &\min(\lambda_{1}\epsilon,\epsilon+\xi),&&acc<% mean_{acc}-\mu_{1},\epsilon<\nu_{1}\cr&\max(\lambda_{2}\epsilon,\epsilon-\xi),% &&acc>mean_{acc}+\mu_{2},\epsilon>\nu_{2}\cr&\epsilon,&&otherwise\cr\end{% aligned}\right.italic_ϵ = { start_ROW start_CELL end_CELL start_CELL roman_min ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ , italic_ϵ + italic_ξ ) , end_CELL start_CELL end_CELL start_CELL italic_a italic_c italic_c < italic_m italic_e italic_a italic_n start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ < italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_max ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ , italic_ϵ - italic_ξ ) , end_CELL start_CELL end_CELL start_CELL italic_a italic_c italic_c > italic_m italic_e italic_a italic_n start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ϵ > italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_ϵ , end_CELL start_CELL end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW

After each iteration, the value of ϵitalic-ϵ\epsilonitalic_ϵ is updated. Where acc𝑎𝑐𝑐accitalic_a italic_c italic_c is the model accuracy on the train set in the current epoch, and meanacc𝑚𝑒𝑎subscript𝑛𝑎𝑐𝑐mean_{acc}italic_m italic_e italic_a italic_n start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT is the exponential moving average of acc𝑎𝑐𝑐accitalic_a italic_c italic_c:

(7) meanacc=0.8×meanacc+0.2×acc.𝑚𝑒𝑎subscript𝑛𝑎𝑐𝑐0.8𝑚𝑒𝑎subscript𝑛𝑎𝑐𝑐0.2𝑎𝑐𝑐mean_{acc}=0.8\times mean_{acc}+0.2\times acc.italic_m italic_e italic_a italic_n start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT = 0.8 × italic_m italic_e italic_a italic_n start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT + 0.2 × italic_a italic_c italic_c .

For the remaining hyperparameters, they are set by default to μ1=0.001subscript𝜇10.001\mu_{1}=0.001italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.001, μ2=0.02subscript𝜇20.02\mu_{2}=0.02italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.02, ν1=0.3subscript𝜈10.3\nu_{1}=0.3italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.3, ν2=0.001subscript𝜈20.001\nu_{2}=0.001italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.001, ξ=0.01𝜉0.01\xi=0.01italic_ξ = 0.01, λ1=1.05subscript𝜆11.05\lambda_{1}=1.05italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1.05 and λ2=0.9subscript𝜆20.9\lambda_{2}=0.9italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.9 in our experiments.

Among these hyperparameters, we set μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be much larger than μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This is because in the early stage of training, the accuracy on the training set increases rapidly. Only when there is a large enough accuracy increment (larger than μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) can we consider that the current cache threshold should be relaxed. After the model parameters are stabilized, the accuracy of the model on the training set changes slightly. Therefore, even for small accuracy decreases, the threshold should be set smaller to reduce the cache error. In addition, we also use ξ=0.02𝜉0.02\xi=0.02italic_ξ = 0.02 to define the maximum step size when ϵitalic-ϵ\epsilonitalic_ϵ changes to avoid the error threshold changing too quickly. We also use ν1subscript𝜈1\nu_{1}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ν2subscript𝜈2\nu_{2}italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to limit the value range of ϵitalic-ϵ\epsilonitalic_ϵ to [ν2,ν1]subscript𝜈2subscript𝜈1[\nu_{2},\nu_{1}][ italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]. The settings of these hyperparameters ensure that the training accuracy of the model will not be greatly reduced.

4.2. Proof of Convergence

Next, we prove the convergence of the training process when employing the adaptive cache mechanism. Specifically, we will prove that after a finite number of iterations, the model parameters W𝑊Witalic_W will converge to the local optimal solution Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We use the superscript ~~absent\tilde{}over~ start_ARG end_ARG to represent the value obtained in this layer after communication synchronization when the cache mechanism is used. The values without superscripts represent the values obtained by current model parameters and input features without cache mechanism in all layers.

We first lay out the necessary and basic inequality required for the theoretical analysis.

Lemma 0.

Denote A=maxi,j|Ai,j|subscriptnorm𝐴subscript𝑖𝑗subscript𝐴𝑖𝑗\|A\|_{\infty}=\max_{i,j}|A_{i,j}|∥ italic_A ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT |, col(A)𝑐𝑜𝑙𝐴col(A)italic_c italic_o italic_l ( italic_A ) is the column number of matrix A𝐴Aitalic_A. We have A+BA+Bsubscriptnorm𝐴𝐵subscriptnorm𝐴subscriptnorm𝐵\|A+B\|_{\infty}\leq\|A\|_{\infty}+\|B\|_{\infty}∥ italic_A + italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ ∥ italic_A ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT, ABABsubscriptnorm𝐴𝐵subscriptnorm𝐴subscriptnorm𝐵\|A\cdot B\|_{\infty}\leq\|A\|_{\infty}\|B\|_{\infty}∥ italic_A ⋅ italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ ∥ italic_A ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT and ABcol(A)ABsubscriptnorm𝐴𝐵𝑐𝑜𝑙𝐴subscriptnorm𝐴subscriptnorm𝐵\|AB\|_{\infty}\leq col(A)\|A\|_{\infty}\|B\|_{\infty}∥ italic_A italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_c italic_o italic_l ( italic_A ) ∥ italic_A ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT.

Proof.

These three inequalities can be proved as follows:

(8) A+Bsubscriptnorm𝐴𝐵\displaystyle\|A+B\|_{\infty}∥ italic_A + italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT =maxi,j|Ai,j+Bi,j|absentsubscript𝑖𝑗subscript𝐴𝑖𝑗subscript𝐵𝑖𝑗\displaystyle=\max_{i,j}|A_{i,j}+B_{i,j}|= roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT |
maxi,j|Ai,j|+maxi,j|Bi,j|absentsubscript𝑖𝑗subscript𝐴𝑖𝑗subscript𝑖𝑗subscript𝐵𝑖𝑗\displaystyle\leq\max_{i,j}|A_{i,j}|+\max_{i,j}|B_{i,j}|≤ roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | + roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT |
=A+B,absentsubscriptnorm𝐴subscriptnorm𝐵\displaystyle=\|A\|_{\infty}+\|B\|_{\infty},= ∥ italic_A ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ,
(9) ABsubscriptnorm𝐴𝐵\displaystyle\|A\cdot B\|_{\infty}∥ italic_A ⋅ italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT =maxi,j|Ai,j×Bi,j|absentsubscript𝑖𝑗subscript𝐴𝑖𝑗subscript𝐵𝑖𝑗\displaystyle=\max_{i,j}|A_{i,j}\times B_{i,j}|= roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT × italic_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT |
maxi,j|Ai,j|×maxi,j|Bi,j|absentsubscript𝑖𝑗subscript𝐴𝑖𝑗subscript𝑖𝑗subscript𝐵𝑖𝑗\displaystyle\leq\max_{i,j}|A_{i,j}|\times\max_{i,j}|B_{i,j}|≤ roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | × roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT |
=AB,absentsubscriptnorm𝐴subscriptnorm𝐵\displaystyle=\|A\|_{\infty}\|B\|_{\infty},= ∥ italic_A ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ,
(10) ABsubscriptnorm𝐴𝐵\displaystyle\|AB\|_{\infty}∥ italic_A italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT =maxi,j|k=1col(A)Ai,k×Bk,j|absentsubscript𝑖𝑗superscriptsubscript𝑘1𝑐𝑜𝑙𝐴subscript𝐴𝑖𝑘subscript𝐵𝑘𝑗\displaystyle=\max_{i,j}|\sum\limits_{k=1}^{col(A)}A_{i,k}\times B_{k,j}|= roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_o italic_l ( italic_A ) end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT × italic_B start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT |
col(A)maxi,j,k|Ai,k×Bk,j|absent𝑐𝑜𝑙𝐴subscript𝑖𝑗𝑘subscript𝐴𝑖𝑘subscript𝐵𝑘𝑗\displaystyle\leq col(A)\max\limits_{i,j,k}|A_{i,k}\times B_{k,j}|≤ italic_c italic_o italic_l ( italic_A ) roman_max start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT × italic_B start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT |
col(A)maxi,j|Ai,j|maxi,j|Bi,j|absent𝑐𝑜𝑙𝐴subscript𝑖𝑗subscript𝐴𝑖𝑗subscript𝑖𝑗subscript𝐵𝑖𝑗\displaystyle\leq col(A)\max\limits_{i,j}|A_{i,j}|\max\limits_{i,j}|B_{i,j}|≤ italic_c italic_o italic_l ( italic_A ) roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT |
=col(A)AB.absent𝑐𝑜𝑙𝐴subscriptnorm𝐴subscriptnorm𝐵\displaystyle=col(A)\|A\|_{\infty}\|B\|_{\infty}.= italic_c italic_o italic_l ( italic_A ) ∥ italic_A ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_B ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT .

Next, we state that with bounded staleness on the embeddings, the approximations of the intermediate matrix results are close to the exact ones in the forward propagation.

Lemma 0.

For the forward propagation of CDFGNN with the cache mechanism, if (a) we have Z~(l1)Z(l1)ϵZ(l1)subscriptnormsuperscript~𝑍𝑙1superscript𝑍𝑙1subscriptitalic-ϵsuperscript𝑍𝑙1\|\tilde{Z}^{(l-1)}-Z^{(l-1)}\|_{\infty}\leq\epsilon_{Z^{(l-1)}}∥ over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, while Z~(l1)superscript~𝑍𝑙1\tilde{Z}^{(l-1)}over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT and Z(l1)superscript𝑍𝑙1Z^{(l-1)}italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT represent the intermediate values with or without cache mechanism, (b) the function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is ρ𝜌\rhoitalic_ρ-Lipschitz continuous, (c) the elements in Z(l)superscript𝑍𝑙Z^{(l)}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG and W(l1)superscript𝑊𝑙1W^{(l-1)}italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT are bounded, while the absolute values are less than B𝐵Bitalic_B and the number of columns is less than C𝐶Citalic_C. Then we have H~(l1)H(l1)ρϵZ(l1)subscriptnormsuperscript~𝐻𝑙1superscript𝐻𝑙1𝜌subscriptitalic-ϵsuperscript𝑍𝑙1\|\tilde{H}^{(l-1)}-H^{(l-1)}\|_{\infty}\leq\rho\epsilon_{Z^{(l-1)}}∥ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT - italic_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and Z~(l)Z(l)pν1B+C2B2ρϵZ(l1)subscriptnormsuperscript~𝑍𝑙superscript𝑍𝑙𝑝subscript𝜈1𝐵superscript𝐶2superscript𝐵2𝜌subscriptitalic-ϵsuperscript𝑍𝑙1\|\tilde{Z}^{(l)}-Z^{(l)}\|_{\infty}\leq p\nu_{1}B+C^{2}B^{2}\rho\epsilon_{Z^{% (l-1)}}∥ over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_p italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B + italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

Proof.

We denote Z^(l)superscript^𝑍𝑙\hat{Z}^{(l)}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT as the intermediate value when the caching mechanism is used in the previous l1𝑙1l-1italic_l - 1 layers, but not used in the l𝑙litalic_l-th layer. Considering that each element in Z^(l)superscript^𝑍𝑙\hat{Z}^{(l)}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the sum from at most p𝑝pitalic_p device, the upper bound error of Z~(l)superscript~𝑍𝑙\tilde{Z}^{(l)}over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT for using the cache mechanism in layer l𝑙litalic_l, is

(11) Z~(l)Z^(l)pν1B.subscriptnormsuperscript~𝑍𝑙superscript^𝑍𝑙𝑝subscript𝜈1𝐵\|\tilde{Z}^{(l)}-\hat{Z}^{(l)}\|_{\infty}\leq p\nu_{1}B.∥ over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_p italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B .

Where ν1subscript𝜈1\nu_{1}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the upper bound of ϵitalic-ϵ\epsilonitalic_ϵ defined in the equation (6).

Therefore, we have

(12) H~(l1)H(l1)=subscriptnormsuperscript~𝐻𝑙1superscript𝐻𝑙1absent\displaystyle\|\tilde{H}^{(l-1)}-H^{(l-1)}\|_{\infty}=∥ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT - italic_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = σ(Z~(l1))σ(Z(l1))subscriptnorm𝜎superscript~𝑍𝑙1𝜎superscript𝑍𝑙1\displaystyle\|\sigma(\tilde{Z}^{(l-1)})-\sigma(Z^{(l-1)})\|_{\infty}∥ italic_σ ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) - italic_σ ( italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
\displaystyle\leq ρϵZ(l1)𝜌subscriptitalic-ϵsuperscript𝑍𝑙1\displaystyle\rho\epsilon_{Z^{(l-1)}}italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
(13) \displaystyle\| Z~(l)Z(l)=(Z~(l)Z^(l))+(Z^(l)Z(l))superscript~𝑍𝑙evaluated-atsuperscript𝑍𝑙subscriptnormsuperscript~𝑍𝑙superscript^𝑍𝑙superscript^𝑍𝑙superscript𝑍𝑙\displaystyle\tilde{Z}^{(l)}-Z^{(l)}\|_{\infty}=\|(\tilde{Z}^{(l)}-\hat{Z}^{(l% )})+(\hat{Z}^{(l)}-Z^{(l)})\|_{\infty}over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = ∥ ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) + ( over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
\displaystyle\leq pν1B+A^σ(Z~(l1))W(l1)A^σ(Z(l1))W(l1)𝑝subscript𝜈1𝐵subscriptnorm^𝐴𝜎superscript~𝑍𝑙1superscript𝑊𝑙1^𝐴𝜎superscript𝑍𝑙1superscript𝑊𝑙1\displaystyle p\nu_{1}B+\|\hat{A}\sigma(\tilde{Z}^{(l-1)})W^{(l-1)}-\hat{A}% \sigma(Z^{(l-1)})W^{(l-1)}\|_{\infty}italic_p italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B + ∥ over^ start_ARG italic_A end_ARG italic_σ ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT - over^ start_ARG italic_A end_ARG italic_σ ( italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
\displaystyle\leq pν1B+C2B2ρϵZ(l1)𝑝subscript𝜈1𝐵superscript𝐶2superscript𝐵2𝜌subscriptitalic-ϵsuperscript𝑍𝑙1\displaystyle p\nu_{1}B+C^{2}B^{2}\rho\epsilon_{Z^{(l-1)}}italic_p italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B + italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

The equation ( 12) is obtained from the definition of Lipschitz condition. ∎

Next, we will prove that the intermediate gradient δ~(l)=Z~(l)~superscript~𝛿𝑙subscriptsuperscript~𝑍𝑙~\tilde{\delta}^{(l)}=\nabla_{\tilde{Z}^{(l)}}\tilde{\mathcal{L}}over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = ∇ start_POSTSUBSCRIPT over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG with cache mechanism is also close to the exact gradient δ(l)=Z(l)superscript𝛿𝑙subscriptsuperscript𝑍𝑙\delta^{(l)}=\nabla_{Z^{(l)}}\mathcal{L}italic_δ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = ∇ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L.

Lemma 0.

For the backward propagation of CDFGNN, if (a) we have Z~(l1)Z(l1)ϵZ(l1)subscriptnormsuperscript~𝑍𝑙1superscript𝑍𝑙1subscriptitalic-ϵsuperscript𝑍𝑙1\|\tilde{Z}^{(l-1)}-Z^{(l-1)}\|_{\infty}\leq\epsilon_{Z^{(l-1)}}∥ over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, while Z~(l1)superscript~𝑍𝑙1\tilde{Z}^{(l-1)}over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT and Z(l1)superscript𝑍𝑙1Z^{(l-1)}italic_Z start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT represent the intermediate values with or without cache mechanism, (b) the function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) and the derivative of loss function \nabla\mathcal{L}∇ caligraphic_L are ρ𝜌\rhoitalic_ρ-Lipschitz continuous, (c) the elements in δ(l)superscript𝛿𝑙\delta^{(l)}italic_δ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG, σ(Z(l))superscript𝜎superscript𝑍𝑙\sigma^{\prime}(Z^{(l)})italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) and W(l1)superscript𝑊𝑙1W^{(l-1)}italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT are bounded, and their absolute values are less than B𝐵Bitalic_B and the number of columns is less than C𝐶Citalic_C. Then we have Z~(l)~Z(l)subscriptnormsubscriptsuperscript~𝑍𝑙~subscriptsuperscript𝑍𝑙\|\nabla_{\tilde{Z}^{(l)}}\tilde{\mathcal{L}}-\nabla_{Z^{(l)}}\mathcal{L}\|_{\infty}∥ ∇ start_POSTSUBSCRIPT over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG - ∇ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT and W(l)~W(l1)subscriptnormsubscriptsuperscript𝑊𝑙~subscriptsuperscript𝑊𝑙1\|\nabla_{W^{(l)}}\tilde{\mathcal{L}}-\nabla_{W^{(l-1)}}\mathcal{L}\|_{\infty}∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG - ∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT are also bounded.

Proof.

First, we prove that δ~(l)δ(l)subscriptnormsuperscript~𝛿𝑙superscript𝛿𝑙\|\tilde{\delta}^{(l)}-\delta^{(l)}\|_{\infty}∥ over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - italic_δ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT is bounded based on the previous lemma.

For the last layer L𝐿Litalic_L, we have

(14) Z~(L)~Z(L)ρϵZ(L).subscriptnormsubscriptsuperscript~𝑍𝐿~subscriptsuperscript𝑍𝐿𝜌subscriptitalic-ϵsuperscript𝑍𝐿\|\nabla_{\tilde{Z}^{(L)}}\tilde{\mathcal{L}}-\nabla_{Z^{(L)}}\mathcal{L}\|_{% \infty}\leq\rho\epsilon_{Z^{(L)}}.∥ ∇ start_POSTSUBSCRIPT over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG - ∇ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

Next, we use mathematical induction to complete the proof. For l>lsuperscript𝑙𝑙l^{\prime}>litalic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_l, if it satisfies Z~(l)~Z(l)K(l)subscriptnormsubscriptsuperscript~𝑍superscript𝑙~subscriptsuperscript𝑍superscript𝑙superscript𝐾superscript𝑙\|\nabla_{\tilde{Z}^{(l^{\prime})}}\tilde{\mathcal{L}}-\nabla_{Z^{(l^{\prime})% }}\mathcal{L}\|_{\infty}\leq K^{(l^{\prime})}∥ ∇ start_POSTSUBSCRIPT over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG - ∇ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_K start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT, then for the l𝑙litalic_l-th layer, we have

(15) Z~(l)~Z(l)subscriptnormsubscriptsuperscript~𝑍𝑙~subscriptsuperscript𝑍𝑙\displaystyle\|\nabla_{\tilde{Z}^{(l)}}\tilde{\mathcal{L}}-\nabla_{Z^{(l)}}% \mathcal{L}\|_{\infty}∥ ∇ start_POSTSUBSCRIPT over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG - ∇ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
=\displaystyle== δ~(l+1)A^(W(l))Tσ(Z~(l))δ(l+1)A^(W(l))Tσ(Z(l))subscriptnormsuperscript~𝛿𝑙1^𝐴superscriptsuperscript𝑊𝑙Tsuperscript𝜎superscript~𝑍𝑙superscript𝛿𝑙1^𝐴superscriptsuperscript𝑊𝑙Tsuperscript𝜎superscript𝑍𝑙\displaystyle\|\tilde{\delta}^{(l+1)}\hat{A}\left(W^{(l)}\right)^{\text{T}}% \cdot\sigma^{\prime}\left(\tilde{Z}^{(l)}\right)-\delta^{(l+1)}\hat{A}\left(W^% {(l)}\right)^{\text{T}}\cdot\sigma^{\prime}\left(Z^{(l)}\right)\|_{\infty}∥ over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG ( italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) - italic_δ start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG ( italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
\displaystyle\leq C2{δ~(l+1)A^(W(l))Tσ(Z~(l))σ(Z(l))\displaystyle C^{2}\{\|\tilde{\delta}^{(l+1)}\|_{\infty}\|\hat{A}\|_{\infty}\|% \left(W^{(l)}\right)^{\text{T}}\|_{\infty}\|\sigma^{\prime}\left(\tilde{Z}^{(l% )}\right)-\sigma^{\prime}\left(Z^{(l)}\right)\|_{\infty}italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT { ∥ over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_A end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ ( italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) - italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
+\displaystyle++ δ~(l+1)δ(l+1)A^(W(l))Tσ(Z(l))}\displaystyle\|\tilde{\delta}^{(l+1)}-\delta^{(l+1)}\|_{\infty}\|\hat{A}\|_{% \infty}\|\left(W^{(l)}\right)^{\text{T}}\|_{\infty}\|\sigma^{\prime}\left(Z^{(% l)}\right)\|_{\infty}\}∥ over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT - italic_δ start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_A end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ ( italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT }
\displaystyle\leq C2(B3ρϵZ(l)+K(l+1)B3)=C2B3(ρϵZ(l)+K(l+1))superscript𝐶2superscript𝐵3𝜌subscriptitalic-ϵsuperscript𝑍𝑙superscript𝐾𝑙1superscript𝐵3superscript𝐶2superscript𝐵3𝜌subscriptitalic-ϵsuperscript𝑍𝑙superscript𝐾𝑙1\displaystyle C^{2}(B^{3}\rho\epsilon_{Z^{(l)}}+K^{(l+1)}B^{3})=C^{2}B^{3}(% \rho\epsilon_{Z^{(l)}}+K^{(l+1)})italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_B start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_K start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) = italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_K start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT )

Denote Kl=C2B3(ρϵZ(l)+K(l+1))superscript𝐾𝑙superscript𝐶2superscript𝐵3𝜌subscriptitalic-ϵsuperscript𝑍𝑙superscript𝐾𝑙1K^{l}=C^{2}B^{3}(\rho\epsilon_{Z^{(l)}}+K^{(l+1)})italic_K start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_K start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT ), then we can find that the assumption holds for the l𝑙litalic_l-th layer. Therefore, we can complete the proof according to mathematical induction.

For W(l)~W(l)subscriptnormsubscriptsuperscript𝑊𝑙~subscriptsuperscript𝑊𝑙\|\nabla_{W^{(l)}}\tilde{\mathcal{L}}-\nabla_{W^{(l)}}\mathcal{L}\|_{\infty}∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG - ∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT, we can get it according to the equation (4):

(16) W(l)\displaystyle\|\nabla_{W^{(l)}}∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ~W(l)~evaluated-atsubscriptsuperscript𝑊𝑙\displaystyle\tilde{\mathcal{L}}-\nabla_{W^{(l)}}\mathcal{L}\|_{\infty}over~ start_ARG caligraphic_L end_ARG - ∇ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
=\displaystyle== δ~i(l+1)A^i(H~i(l))Tδi(l+1)A^i(Hi(l))Tsubscriptnormsuperscriptsubscript~𝛿𝑖𝑙1subscript^𝐴𝑖superscriptsuperscriptsubscript~𝐻𝑖𝑙Tsuperscriptsubscript𝛿𝑖𝑙1subscript^𝐴𝑖superscriptsuperscriptsubscript𝐻𝑖𝑙T\displaystyle\|\tilde{\delta}_{i}^{(l+1)}\hat{A}_{i}\left(\tilde{H}_{i}^{(l)}% \right)^{\text{T}}-\delta_{i}^{(l+1)}\hat{A}_{i}\left(H_{i}^{(l)}\right)^{% \text{T}}\|_{\infty}∥ over~ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT - italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
\displaystyle\leq C2{δ~i(l+1)A^(H~i(l))T(Hi(l))T\displaystyle C^{2}\{\|\tilde{\delta}_{i}^{(l+1)}\|_{\infty}\|\hat{A}\|_{% \infty}\|\left(\tilde{H}_{i}^{(l)}\right)^{\text{T}}-\left(H_{i}^{(l)}\right)^% {\text{T}}\|_{\infty}italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT { ∥ over~ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_A end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT - ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
+\displaystyle++ δ~i(l+1)δi(l+1)A^(Hi(l))T}\displaystyle\|\tilde{\delta}_{i}^{(l+1)}-\delta_{i}^{(l+1)}\|_{\infty}\|\hat{% A}\|_{\infty}\|\left(H_{i}^{(l)}\right)^{\text{T}}\|_{\infty}\}∥ over~ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT - italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_A end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT }
\displaystyle\leq C2(B2ρϵZ(l)+Kl+1B2)=C2B2(ρϵZ(l)+K(l+1))superscript𝐶2superscript𝐵2𝜌subscriptitalic-ϵsuperscript𝑍𝑙superscript𝐾𝑙1superscript𝐵2superscript𝐶2superscript𝐵2𝜌subscriptitalic-ϵsuperscript𝑍𝑙superscript𝐾𝑙1\displaystyle C^{2}(B^{2}\rho\epsilon_{Z^{(l)}}+K^{l+1}B^{2})=C^{2}B^{2}(\rho% \epsilon_{Z^{(l)}}+K^{(l+1)})italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_K start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ italic_ϵ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_K start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT )

Finally, we will prove that CDFGNN can converge to the local optimal solution under the premise that the error is bounded. For the parameter matrix W𝑊Witalic_W, we use the subscript i𝑖iitalic_i to identify that the value is obtained of the i𝑖iitalic_i-th iteration.

Theorem 4.

For the L𝐿Litalic_L layer graph neural network training based on the CDFGNN cache mechanism, given the local optimal parameters W()subscript𝑊W_{(*)}italic_W start_POSTSUBSCRIPT ( ∗ ) end_POSTSUBSCRIPT and the initial parameters W(1)subscript𝑊1W_{(1)}italic_W start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT. Assuming that (a) the activation function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) and the derivative of loss function \nabla\mathcal{L}∇ caligraphic_L are ρ𝜌\rhoitalic_ρ-Lipschitz continuous, (b) the matrix A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG, H𝐻Hitalic_H and W𝑊Witalic_W, and the corresponding gradients on them are bounded, where the maximum absolute value of the element is B𝐵Bitalic_B, (c) the function (W)𝑊\mathcal{L}(W)caligraphic_L ( italic_W ) is ρ𝜌\rhoitalic_ρ-smooth. We can prove that there is a constant K>0𝐾0K>0italic_K > 0 such that for N>Lϵfor-all𝑁subscript𝐿italic-ϵ\forall N>L_{\epsilon}∀ italic_N > italic_L start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT, if the GNN is trained based on the cache mechanism R𝑅Ritalic_R iterations (R[1,N]𝑅1𝑁R\in[1,N]italic_R ∈ [ 1 , italic_N ] and is sampled from [1,,N]1𝑁[1,\dots,N][ 1 , … , italic_N ] uniformly) and the learning rate η=min(1ρ,1N)𝜂1𝜌1𝑁\eta=\min\left(\frac{1}{\rho},\frac{1}{\sqrt{N}}\right)italic_η = roman_min ( divide start_ARG 1 end_ARG start_ARG italic_ρ end_ARG , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ), we have

(17) ERW(R)F22(W(1))(W())+ρK2N.subscriptE𝑅subscriptsuperscriptnormsubscriptsubscript𝑊𝑅2𝐹2subscript𝑊1subscript𝑊𝜌𝐾2𝑁\mathrm{E}_{R}\|\nabla_{W_{(R)}}\mathcal{L}\|^{2}_{F}\leq 2\frac{\mathcal{L}(W% _{(1)})-\mathcal{L}(W_{(*)})+\frac{\rho K}{2}}{\sqrt{N}}.roman_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_R ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ 2 divide start_ARG caligraphic_L ( italic_W start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ) - caligraphic_L ( italic_W start_POSTSUBSCRIPT ( ∗ ) end_POSTSUBSCRIPT ) + divide start_ARG italic_ρ italic_K end_ARG start_ARG 2 end_ARG end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG .
Proof.

For the convenience, we denote Δ(i)=W(i)~W(i)subscriptΔ𝑖subscriptsubscript𝑊𝑖~subscriptsubscript𝑊𝑖\Delta_{(i)}=\nabla_{W_{(i)}}\tilde{\mathcal{L}}-\nabla_{W_{(i)}}\mathcal{L}roman_Δ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG - ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L. Considering that the model parameter W𝑊Witalic_W is updated under the cache mechanism, we have W(i+1)=W(i)+ηW(i)~subscript𝑊𝑖1subscript𝑊𝑖𝜂subscriptsubscript𝑊𝑖~W_{(i+1)}=W_{(i)}+\eta\nabla_{W_{(i)}}\tilde{\mathcal{L}}italic_W start_POSTSUBSCRIPT ( italic_i + 1 ) end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT + italic_η ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG. According to lemma 3 and the ρ𝜌\rhoitalic_ρ-smooth property of the function (W)𝑊\mathcal{L}(W)caligraphic_L ( italic_W ), we have

(18) \displaystyle\mathcal{L}caligraphic_L (W(i+1))=(W(i)+ηW(i)~)subscript𝑊𝑖1subscript𝑊𝑖𝜂subscriptsubscript𝑊𝑖~\displaystyle(W_{(i+1)})=\mathcal{L}(W_{(i)}+\eta\nabla_{W_{(i)}}\tilde{% \mathcal{L}})( italic_W start_POSTSUBSCRIPT ( italic_i + 1 ) end_POSTSUBSCRIPT ) = caligraphic_L ( italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT + italic_η ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG )
\displaystyle\leq (W(i))ηW(i),W(i)~+ρ2η2W(i)~F2subscript𝑊𝑖𝜂subscriptsubscript𝑊𝑖subscriptsubscript𝑊𝑖~𝜌2superscript𝜂2superscriptsubscriptnormsubscriptsubscript𝑊𝑖~𝐹2\displaystyle\mathcal{L}(W_{(i)})-\eta\langle\nabla_{W_{(i)}}\mathcal{L},% \nabla_{W_{(i)}}\tilde{\mathcal{L}}\rangle+\frac{\rho}{2}\eta^{2}\|\nabla_{W_{% (i)}}\tilde{\mathcal{L}}\|_{F}^{2}caligraphic_L ( italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) - italic_η ⟨ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L , ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG ⟩ + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== (W(i))ηW(i),Δ(i)ηW(i)F2subscript𝑊𝑖𝜂subscriptsubscript𝑊𝑖subscriptΔ𝑖𝜂superscriptsubscriptnormsubscriptsubscript𝑊𝑖𝐹2\displaystyle\mathcal{L}(W_{(i)})-\eta\langle\nabla_{W_{(i)}}\mathcal{L},% \Delta_{(i)}\rangle-\eta\|\nabla_{W_{(i)}}\mathcal{L}\|_{F}^{2}caligraphic_L ( italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) - italic_η ⟨ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L , roman_Δ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ⟩ - italic_η ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+\displaystyle++ ρ2η2(Δ(i)F2+W(i)F2+2Δ(i),W(i))𝜌2superscript𝜂2subscriptsuperscriptnormsubscriptΔ𝑖2𝐹subscriptsuperscriptnormsubscriptsubscript𝑊𝑖2𝐹2subscriptΔ𝑖subscriptsubscript𝑊𝑖\displaystyle\frac{\rho}{2}\eta^{2}\left(\|\Delta_{(i)}\|^{2}_{F}+\|\nabla_{W_% {(i)}}\mathcal{L}\|^{2}_{F}+2\langle\Delta_{(i)},\nabla_{W_{(i)}}\mathcal{L}% \rangle\right)divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ roman_Δ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + 2 ⟨ roman_Δ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ⟩ )
\displaystyle\leq (W(i))(ηρ2η2)W(i)F2+ρ2η2Δ(i)F2.subscript𝑊𝑖𝜂𝜌2superscript𝜂2subscriptsuperscriptnormsubscriptsubscript𝑊𝑖2𝐹𝜌2superscript𝜂2subscriptsuperscriptnormsubscriptΔ𝑖2𝐹\displaystyle\mathcal{L}(W_{(i)})-(\eta-\frac{\rho}{2}\eta^{2})\|\nabla_{W_{(i% )}}\mathcal{L}\|^{2}_{F}+\frac{\rho}{2}\eta^{2}\|\Delta_{(i)}\|^{2}_{F}.caligraphic_L ( italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) - ( italic_η - divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT .

The scaling in the last step is based on the value of the learning rate η𝜂\etaitalic_η. According to lemma 3, we have Δ(i)F2W(i)~|2+(W(i))22B2K\|\Delta_{(i)}\|^{2}_{F}\leq\|\nabla_{W_{(}i)}\tilde{\mathcal{L}}\ |_{\infty}^% {2}+\|\nabla\mathcal{L}(W_{(i)})\|_{\infty}^{2}\leq 2B^{2}\leq K∥ roman_Δ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_i ) end_POSTSUBSCRIPT over~ start_ARG caligraphic_L end_ARG | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ caligraphic_L ( italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_K. Therefore, we have

(19) (W(i+1))(W(i))(ηρ2η2)W(i)F2+ρ2η2K.subscript𝑊𝑖1subscript𝑊𝑖𝜂𝜌2superscript𝜂2subscriptsuperscriptnormsubscriptsubscript𝑊𝑖2𝐹𝜌2superscript𝜂2𝐾\mathcal{L}(W_{(i+1)})\leq\mathcal{L}(W_{(i)})-(\eta-\frac{\rho}{2}\eta^{2})\|% \nabla_{W_{(i)}}\mathcal{L}\|^{2}_{F}+\frac{\rho}{2}\eta^{2}K.caligraphic_L ( italic_W start_POSTSUBSCRIPT ( italic_i + 1 ) end_POSTSUBSCRIPT ) ≤ caligraphic_L ( italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) - ( italic_η - divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K .

Sum up the equation (19) for i𝑖iitalic_i from 1111 to N𝑁Nitalic_N, we can get

(20) (ηρ2η2)i=1NW(i)F2(W(1))(W)+ρ2η2KN.𝜂𝜌2superscript𝜂2superscriptsubscript𝑖1𝑁subscriptsuperscriptnormsubscriptsubscript𝑊𝑖2𝐹subscript𝑊1superscript𝑊𝜌2superscript𝜂2𝐾𝑁(\eta-\frac{\rho}{2}\eta^{2})\sum\limits_{i=1}^{N}\|\nabla_{W_{(i)}}\mathcal{L% }\|^{2}_{F}\leq\mathcal{L}(W_{(1)})-\mathcal{L}(W^{*})+\frac{\rho}{2}\eta^{2}KN.( italic_η - divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ caligraphic_L ( italic_W start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ) - caligraphic_L ( italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K italic_N .

Considering η=min(1ρ,1N)𝜂1𝜌1𝑁\eta=\min\left(\frac{1}{\rho},\frac{1}{\sqrt{N}}\right)italic_η = roman_min ( divide start_ARG 1 end_ARG start_ARG italic_ρ end_ARG , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ), we divide both side of equation (20) by N(ηρ2η2)𝑁𝜂𝜌2superscript𝜂2N(\eta-\frac{\rho}{2}\eta^{2})italic_N ( italic_η - divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), then we have

(21) ERW(R)F2=subscriptE𝑅subscriptsuperscriptnormsubscriptsubscript𝑊𝑅2𝐹absent\displaystyle\mathrm{E}_{R}\|\nabla_{W_{(R)}}\mathcal{L}\|^{2}_{F}=roman_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_R ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = 1Ni=1NW(i)F21𝑁superscriptsubscript𝑖1𝑁subscriptsuperscriptnormsubscriptsubscript𝑊𝑖2𝐹\displaystyle\frac{1}{N}\sum\limits_{i=1}^{N}\|\nabla_{W_{(i)}}\mathcal{L}\|^{% 2}_{F}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT
\displaystyle\leq 2(W(1))(W)+ρ2η2KNNη(2ρη)2subscript𝑊1superscript𝑊𝜌2superscript𝜂2𝐾𝑁𝑁𝜂2𝜌𝜂\displaystyle 2\frac{\mathcal{L}(W_{(1)})-\mathcal{L}(W^{*})+\frac{\rho}{2}% \eta^{2}KN}{N\eta(2-\rho\eta)}2 divide start_ARG caligraphic_L ( italic_W start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ) - caligraphic_L ( italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K italic_N end_ARG start_ARG italic_N italic_η ( 2 - italic_ρ italic_η ) end_ARG
\displaystyle\leq 2(W(1))(W)Nη+ρηK2subscript𝑊1superscript𝑊𝑁𝜂𝜌𝜂𝐾\displaystyle 2\frac{\mathcal{L}(W_{(1)})-\mathcal{L}(W^{*})}{N\eta}+\rho\eta K2 divide start_ARG caligraphic_L ( italic_W start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ) - caligraphic_L ( italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N italic_η end_ARG + italic_ρ italic_η italic_K
\displaystyle\leq 2(W(1))(W)+ρK2N.2subscript𝑊1superscript𝑊𝜌𝐾2𝑁\displaystyle 2\frac{\mathcal{L}(W_{(1)})-\mathcal{L}(W^{*})+\frac{\rho K}{2}}% {\sqrt{N}}.2 divide start_ARG caligraphic_L ( italic_W start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ) - caligraphic_L ( italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG italic_ρ italic_K end_ARG start_ARG 2 end_ARG end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG .

When N𝑁N\to\inftyitalic_N → ∞, we can find that the expectation of parameter gradient ER0subscriptE𝑅0\mathrm{E}_{R}\to 0roman_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT → 0. Therefore, we show that convergence of parameters can be achieved in finite iterations.

5. Communication Quantization

In this section, we propose the communication quantization mechanism of CDFGNN. There are many quantization methods, including linear quantization and logarithmic quantification (daisuke2016convolutional, ), exponential quantification (li2019additive, ), differentiable quantization (gong2019differentiable, ; yang2019quantization, ), etc. Considering that when we adopt the adaptive cache mechanism, the message sent is the difference value instead of the original value. Thus, the message data usually follows an uniform distribution. For this reason, we adopt the simplest linear quantization method to quantify the difference of vertex features and gradients. We do not quantify the model parameters when communicating with the parameter server.

Specifically, for the calculated difference 𝐦𝐦\mathbf{m}bold_m of features or gradients for the vertex visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it is represented in the form of a 32-bit floating point format in the GPU memory. In order to quantify it into the B𝐵Bitalic_B-bit unsigned integer format, we need to calculate the maximum element value max(𝐦)𝐦\max(\mathbf{m})roman_max ( bold_m ) and the minimum element value min(𝐦)𝐦\min(\mathbf{m})roman_min ( bold_m ) at first. Therefore, we can get the quantified value as

(22) qi=2B(mimin(𝐦))max(𝐦)min(𝐦)+0.5.subscript𝑞𝑖superscript2𝐵subscript𝑚𝑖𝐦𝐦𝐦0.5q_{i}=\left\lfloor\frac{2^{B}(m_{i}-\min(\mathbf{m}))}{\max(\mathbf{m})-\min(% \mathbf{m})}+0.5\right\rfloor.italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌊ divide start_ARG 2 start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_min ( bold_m ) ) end_ARG start_ARG roman_max ( bold_m ) - roman_min ( bold_m ) end_ARG + 0.5 ⌋ .

When sending the message, the original message size is TL𝑇𝐿T*Litalic_T ∗ italic_L, and the quantified message size is BL+2T𝐵𝐿2𝑇B*L+2Titalic_B ∗ italic_L + 2 italic_T (including the maximum and minimum value). Where L𝐿Litalic_L refers to the number of elements in 𝐦𝐦\mathbf{m}bold_m, and T𝑇Titalic_T refers to the number of bits of the original data format.

During the recovery, for the quantization value qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can restore it to

(23) m~i=max(𝐦)min(𝐦)2Bqi+min(𝐦).subscript~𝑚𝑖𝐦𝐦superscript2𝐵subscript𝑞𝑖𝐦\tilde{m}_{i}=\frac{\max(\mathbf{m})-\min(\mathbf{m})}{2^{B}}q_{i}+\min(% \mathbf{m}).over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_max ( bold_m ) - roman_min ( bold_m ) end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_min ( bold_m ) .

By the definition, we have 2B(mimin(𝐦))max(𝐦)min(𝐦)0.5<qi2B(mimin(𝐦))max(𝐦)min(𝐦)+0.5superscript2𝐵subscript𝑚𝑖𝐦𝐦𝐦0.5subscript𝑞𝑖superscript2𝐵subscript𝑚𝑖𝐦𝐦𝐦0.5\left\lfloor\frac{2^{B}(m_{i}-\min(\mathbf{m}))}{\max(\mathbf{m})-\min(\mathbf% {m})}-0.5\right\rfloor<q_{i}\leq\left\lfloor\frac{2^{B}(m_{i}-\min(\mathbf{m})% )}{\max(\mathbf{m})-\min(\mathbf{m})}+0.5\right\rfloor⌊ divide start_ARG 2 start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_min ( bold_m ) ) end_ARG start_ARG roman_max ( bold_m ) - roman_min ( bold_m ) end_ARG - 0.5 ⌋ < italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ ⌊ divide start_ARG 2 start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_min ( bold_m ) ) end_ARG start_ARG roman_max ( bold_m ) - roman_min ( bold_m ) end_ARG + 0.5 ⌋ . Therefore, the upper bound of the quantization error is max(𝐦)min(𝐦)2B+1𝐦𝐦superscript2𝐵1\frac{\max(\mathbf{m})-\min(\mathbf{m})}{2^{B+1}}divide start_ARG roman_max ( bold_m ) - roman_min ( bold_m ) end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_B + 1 end_POSTSUPERSCRIPT end_ARG.

6. Hierarchical Graph Partition Algorithm

Considering that in the heterogeneous multi-node multi-GPU environment, the communication overhead within a single node and across physical nodes is different. We demonstrate the communication architecture in figure 3. The GPU is viewed as the basic computing device.

We propose our vertex-cut graph partition algorithm based on the EBV (zhang2021efficient, ) algorithm. To adapt to the hierarchical communication architecture, we rewrite its evaluation function

(24) Eva(u,v)(i)=𝐸𝑣subscript𝑎𝑢𝑣𝑖absent\displaystyle Eva_{(u,v)}(i)=italic_E italic_v italic_a start_POSTSUBSCRIPT ( italic_u , italic_v ) end_POSTSUBSCRIPT ( italic_i ) = (1γ)(𝕀(id_repu)+𝕀(id_repv))1𝛾𝕀𝑖𝑑_𝑟𝑒subscript𝑝𝑢𝕀𝑖𝑑_𝑟𝑒subscript𝑝𝑣\displaystyle(1-\gamma)(\mathbb{I}(i\notin d\_rep_{u})+\mathbb{I}(i\notin d\_% rep_{v}))( 1 - italic_γ ) ( blackboard_I ( italic_i ∉ italic_d _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + blackboard_I ( italic_i ∉ italic_d _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) )
+\displaystyle++ γ(𝕀(hostih_repu)+𝕀(hostih_repv))𝛾𝕀𝑜𝑠subscript𝑡𝑖_𝑟𝑒subscript𝑝𝑢𝕀𝑜𝑠subscript𝑡𝑖_𝑟𝑒subscript𝑝𝑣\displaystyle\gamma(\mathbb{I}(host_{i}\notin h\_rep_{u})+\mathbb{I}(host_{i}% \notin h\_rep_{v}))italic_γ ( blackboard_I ( italic_h italic_o italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ italic_h _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + blackboard_I ( italic_h italic_o italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ italic_h _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) )
+\displaystyle++ αecount[i]|E|/p+βvcount[i]|V|/p.𝛼subscript𝑒𝑐𝑜𝑢𝑛𝑡delimited-[]𝑖𝐸𝑝𝛽subscript𝑣𝑐𝑜𝑢𝑛𝑡delimited-[]𝑖𝑉𝑝\displaystyle\alpha\frac{e_{count}[i]}{|E|/p}+\beta\frac{v_{count}[i]}{|V|/p}.italic_α divide start_ARG italic_e start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT [ italic_i ] end_ARG start_ARG | italic_E | / italic_p end_ARG + italic_β divide start_ARG italic_v start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT [ italic_i ] end_ARG start_ARG | italic_V | / italic_p end_ARG .

d_repu𝑑_𝑟𝑒subscript𝑝𝑢d\_rep_{u}italic_d _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and h_repu_𝑟𝑒subscript𝑝𝑢h\_rep_{u}italic_h _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT represent the GPU IDs and host (CPU) IDs that vertex u𝑢uitalic_u has been assigned. As long as the vertex u𝑢uitalic_u has been assigned to any GPU corresponding to the host, the host ID will be added to h_repu_𝑟𝑒subscript𝑝𝑢h\_rep_{u}italic_h _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. We use hosti𝑜𝑠subscript𝑡𝑖host_{i}italic_h italic_o italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to represent the host ID to which the i𝑖iitalic_i-th GPU belongs. Besides, ecount[i]subscript𝑒𝑐𝑜𝑢𝑛𝑡delimited-[]𝑖e_{count}[i]italic_e start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT [ italic_i ] and vcount[i]subscript𝑣𝑐𝑜𝑢𝑛𝑡delimited-[]𝑖v_{count}[i]italic_v start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT [ italic_i ] mean the number of edges and vertices that have been assigned to subgraph i𝑖iitalic_i.

When partitioning the graph, we assign it edge by edge. For each edge, we select the GPU ID that minimizes the evaluation function as the subgraph ID this edge assigned.

From equation (24), we can found that the term 𝕀(hostihost_repu)+𝕀(hostihost_repv)𝕀𝑜𝑠subscript𝑡𝑖𝑜𝑠𝑡_𝑟𝑒subscript𝑝𝑢𝕀𝑜𝑠subscript𝑡𝑖𝑜𝑠𝑡_𝑟𝑒subscript𝑝𝑣\mathbb{I}(host_{i}\notin host\_rep_{u})+\mathbb{I}(host_{i}\notin host\_rep_{% v})blackboard_I ( italic_h italic_o italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ italic_h italic_o italic_s italic_t _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + blackboard_I ( italic_h italic_o italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ italic_h italic_o italic_s italic_t _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) we design can reduce the number of cut vertices between hosts. Usually, we set γ1much-less-than𝛾1\gamma\ll 1italic_γ ≪ 1. Therefore, this term is mainly worked to select a more reasonable host when the other terms are close. In our experiment, we set γ𝛾\gammaitalic_γ to 0.10.10.10.1 by default.

For the other terms, 𝕀(id_repu)+𝕀(id_repv)𝕀𝑖𝑑_𝑟𝑒subscript𝑝𝑢𝕀𝑖𝑑_𝑟𝑒subscript𝑝𝑣\mathbb{I}(i\notin d\_rep_{u})+\mathbb{I}(i\notin d\_rep_{v})blackboard_I ( italic_i ∉ italic_d _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + blackboard_I ( italic_i ∉ italic_d _ italic_r italic_e italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) is related to the replication factor among GPUs, while αecount[i]|E|/p𝛼subscript𝑒𝑐𝑜𝑢𝑛𝑡delimited-[]𝑖𝐸𝑝\alpha\frac{e_{count}[i]}{|E|/p}italic_α divide start_ARG italic_e start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT [ italic_i ] end_ARG start_ARG | italic_E | / italic_p end_ARG and βvcount[i]|V|/p𝛽subscript𝑣𝑐𝑜𝑢𝑛𝑡delimited-[]𝑖𝑉𝑝\beta\frac{v_{count}[i]}{|V|/p}italic_β divide start_ARG italic_v start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT [ italic_i ] end_ARG start_ARG | italic_V | / italic_p end_ARG restrict the edge and vertex imbalance factor respectively. The replication factor is defined as i=1p|Vi||V|superscriptsubscript𝑖1𝑝subscript𝑉𝑖𝑉\frac{\sum_{i=1}^{p}|V_{i}|}{|V|}divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT | italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_V | end_ARG, that represents the average number of replicas for a vertex. The edge imbalance factor is defined as maxi=1,,p|Ei||E|/psubscript𝑖1𝑝subscript𝐸𝑖𝐸𝑝\frac{\max_{i=1,...,p}|E_{i}|}{|E|/p}divide start_ARG roman_max start_POSTSUBSCRIPT italic_i = 1 , … , italic_p end_POSTSUBSCRIPT | italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_E | / italic_p end_ARG , while the vertex imbalance factor is defined as maxi=1,,p|Vi|i=1p|Vi|/psubscript𝑖1𝑝subscript𝑉𝑖superscriptsubscript𝑖1𝑝subscript𝑉𝑖𝑝\frac{\max_{i=1,...,p}|V_{i}|}{\sum_{i=1}^{p}|V_{i}|/p}divide start_ARG roman_max start_POSTSUBSCRIPT italic_i = 1 , … , italic_p end_POSTSUBSCRIPT | italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT | italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | / italic_p end_ARG . Both of them are used to measure the balance of partition results.

7. Experiments and Analysis

Table 1. Statistics of GNN dataset graphs
Dataset |V|𝑉|V|| italic_V | |E|𝐸|E|| italic_E | Input Dim Output Dim
Reddit 232,965232965232,965232 , 965 11,606,9191160691911,606,91911 , 606 , 919 602602602602 41414141
ogbn-products 2,449,02924490292,449,0292 , 449 , 029 61,859,1406185914061,859,14061 , 859 , 140 100100100100 47474747
ogbn-papers100M 111,059,956111059956111,059,956111 , 059 , 956 1,615,685,87216156858721,615,685,8721 , 615 , 685 , 872 200200200200 172172172172
Friendster 65,608,3666560836665,608,36665 , 608 , 366 1,806,067,13518060671351,806,067,1351 , 806 , 067 , 135 64646464 32323232

In this section, we test CDFGNN in a heterogeneous environment with multiple physical nodes and multiple GPUs per node. We compare CDFGNN with the state-of-the-art distributed full-batch graph neural network training frameworks on several datasets. In addition, we select some representative graph partition algorithms to analyze the influence of different graph partition algorithms on the distributed full-batch GNN training. Finally, we conduct the ablation study to demonstrate the effectiveness of each component.

7.1. Experiment Setup and Datasets

In the experiment, we compare CDFGNN with the state-of-the-art distributed full-batch graph neural network training frameworks SANCUS (peng2022sancus, ) and CAGNET (tripathy2020reducing, ). We select four datasets: Reddit (hamilton2017inductive, ), ogbn-products (chiang2019cluster, ), ogbn-papers100M (wang2020microsoft, ) and Friendster (Friend, ) for comparing their performance. The statistics of these graphs are listed in table 1. The Friendster does not provide input features and output categories. We randomly generate these data to test the training efficiency of different frameworks on the large graph.

Our experiment platform is a 2-node cluster, with each node has 8888 Nvidia A800 80G GPU. The communication within the physical nodes is based on the 16-channel PCIe 4.0, and the communication across the physical nodes is based on the InfiniBand. We use the NCCL for communication, and list the communication performance in Table 2.

Table 2. Communication Performance between GPUs
Environment Pattern Bandwidth
PCIe Peer2Peer 22.70 GB/s
InfiniBand Peer2Peer 8.27 GB/s
PCIe Broadcast 19.47 GB/s
InfiniBand Broadcast 11.98 GB/s

We adopt the simple 2222-layer graph convolutional network as our test model. The dimensions of the input and output features are determined by the datasets, while the dimension of the hidden layer is set to 64646464 by default. We adopt the cross-entropy function as the loss function, and the Adam optimizer (kingma2014adam, ) to update the model parameters. The initial learning rate is set to 0.010.010.010.01 by default.

7.2. Distributed Training Efficiency Comparison

First, we compare CDFGNN with the current state-of-the-art distributed full-batch GNN training frameworks SANCUS and CAGNET. During the training process, we adopted the same GNN model. Meanwhile, we implement CDFGNN with 2222 famous vertex-cut GP algorithms: HEP (mayer2021hybrid, ) and DNE (hanai2019distributed, ). Thus, we can analysis the influence of different graph partition algorithms on the training efficiency. We also set the γ𝛾\gammaitalic_γ to 0.10.10.10.1 and 0.00.00.00.0 respectively for testing the performance of our hierarchical GP algorithm, and represent them as EBVγ=0.1 and EBVγ=0.0. The EBVγ=0.0 is equivalent to the original EBV algorithm.

Table 3. The Statistics of differnet graph partition algorithms
Dataset GP algorithm Nodes GPU per Node Inner Outer RF Edge IF
reddit EBVγ=0.0 2 2 104217 138583 2.9027 1.0054
reddit EBVγ=0.1 2 2 105412 117879 3.0860 1.0022
reddit HEP 2 2 36662 52886 1.6084 1.2693
reddit DNE 2 2 65578 118788 2.1025 1.2558
ogbn-products EBVγ=0.0 2 4 695905 639459 3.1788 1.0002
ogbn-products EBVγ=0.1 2 4 952727 481147 3.3379 1.0008
ogbn-products HEP 2 4 143711 127261 1.3304 1.2323
ogbn-products DNE 2 4 367460 406408 1.9363 1.1527
friend EBVγ=0.0 2 8 12395102 9988776 3.7237 1.0002
friend EBVγ=0.1 2 8 19586785 5465465 4.0322 1.0011
friend HEP 2 8 5794009 4675810 1.7048 1.776
friend DNE 2 8 8737134 11670478 2.3546 1.7455
papers100M EBVγ=0.0 2 8 20438528 16362561 3.6503 1.0000
papers100M EBVγ=0.1 2 8 32241760 9924817 4.0347 1.0001
papers100M HEP 2 8 5826661 3085959 1.3144 2.0204
papers100M DNE 2 8 10021186 13819120 2.1475 1.3866

Refer to caption

Refer to caption

(a) Reddit

(b) ogbn-products

Refer to caption

Refer to caption

(c) ogbn-papers100M

(d) Friendster

Figure 5. Comparison of average training time per epoch.

Figure 5 presents the average training time per epoch for different GNN training frameworks on four datasets. The GPUs we use are evenly distributed on two physical nodes. We use EBVγ=0.1, EBVγ=0.0, HEP and DNE to represent the training efficiency when combined with CDFGNN.

From figure 5, we can find that EBVγ=0.1 achieves the best performance in almost all cases and reduces the training time by 30.39%percent30.3930.39\%30.39 % compared to Sancus on average. Sancus performs better than CAGNET and even outperforms EBVγ=0.1 in the smallest case (2 GPUS, Reddit). However, the performance of Sancus is limited for larger cases. Comparing EBVγ=0.1 and EBVγ=0.0, setting γ𝛾\gammaitalic_γ to 0.10.10.10.1 can achieve better training efficiency on our cluster. It is worth noting that when there are only 2222 GPUs, the partition results of EBVγ=0.1 and EBVγ=0.0 are equivalent. The HEP algorithm also performs well in the smallest dataset (Reddit). But EBVγ=0.1 leads HEP by a larger margin on other datasets. Therefore, we believe that the CDFGNN framework combined with the EBVγ=0.1 can achieve the best training efficiency on graph neural network datasets of different sizes.

7.3. Ablation Study

Next, we study the reasons for different performances when different graph partition algorithms are combined with CDFGNN. We compare graph partition results generated by different GP algorithms in Table 3. The “Inner” and “Outer” columns mean the maximum number of inner and outer connections on a single subgraph. The number of inner connections refers to the number of messages within the physical node that need to be sent from this device, and outer connections refer to the messages across the physical nodes. We also present the replication factor (RF) and edge imbalance factor (Edge IF) defined in Section 6 for analyzing. Since all GP algorithms compared here are vertex-cut algorithms, we do not give the vertex imbalance factor.

Table 3 shows the characteristics of all GP algorithms on 4444 datasets. Setting γ𝛾\gammaitalic_γ to 0.10.10.10.1 can greatly reduce the number of outer connections (31.08%percent31.0831.08\%31.08 % on average) at the expense of more inner connections. Considering the inter and outer communication bandwidth comparison in Table 2, the overall communication overhead can be greatly reduced, thereby improving training efficiency. The HEP algorithm achieves the smallest inner and outer connections. However, the graph partition results are significantly imbalanced. Thus, it leads to imbalanced computing and communication overhead and reduces the overall training efficiency.

We decompose the computation and communication time of different GP algorithms and communication optimization methods based on CDFGNN for further analysis. We list the computation and communication time per epoch of each GPUs in Figure 6. We also provide the corresponding average training time with the dashed lines. When comparing these GP algorithms EBVγ=0.1, EBVγ=0.0, HEP and DNE, all communication optimization methods are used by default. When comparing the communication optimization methods, the GP algorithm used is EBVγ=0.1. The “Cache” means only the adaptive cache mechanism is used, while “Quantify” means only the communication quantization is used. “Baseline” means that no communication optimization methods are used.

Refer to caption

Refer to caption

(a) ogbn-products

(b) Reddit

Figure 6. Time breakdown of different GP algorithms and communication optimization methods.

As shown in Figure 6, comparing with EBVγ=0.1, the computation time of EBVγ=0.0 is roughly the same. However, the communication time of EBVγ=0.0 is longer. HEP and DNE have significant workload imbalances, thus restricting their training performance.

Meanwhile, both the adaptive cache mechanism and communication quantization can greatly reduce the communication overhead without affecting the computation overhead. We include the extra calculation time (quantization and dequantization for communication quantization, caching comparison for adaptive cache mechanism) into the communication time for a fair comparison. Therefore, the communication time is not directly proportional to the number of communication messages. On ogbn-products, the adaptive cache mechanism achieves better communication optimization, while on Reddit the communication quantification is more efficient. When combining both methods (EBVγ=0.1), we achieve the best performance.

We also analysis the message sending percentage of each layer with the adaptive cache mechanism in Figure 7. To better understand the cache mechanism during different training epochs, we further provide the cache threshold ϵitalic-ϵ\epsilonitalic_ϵ. Figure 7 shows the sending percentage and cache threshold on ogbn-products and Reddit with 4444 and 8888 GPUs respectively. It can be found that in the middle stage of training, only few messages are sent, thus greatly reducing communication overhead. This phenomenon is consistent with our hypothesis. Furthermore, at about 501005010050-10050 - 100 training epochs on ogbn-products, almost no vertex features are sent during the forward propagation. Meanwhile, the cache threshold is dynamically adjusted to a larger value in the middle of training and smaller at other times.

Refer to caption

Refer to caption

Refer to caption

(a) ogbn-products

(b) Reddit

Figure 7. Percentage of cache threshold and sending messages.

Refer to caption

Refer to caption

(a) ogbn-products

(b) Reddit

Figure 8. The convergence curve of evaluate accuracy.

Finally, we verify the convergence of evaluate accuracy of CDFGNN in Figure 8. In addition to the distributed training approaches of CDFGNN, we also implement the full-batch and mini-batch training methods on the single GPU for comparison.

The results in figure 8 show that using the adaptive cache mechanism and communication quantification method has almost no impact on the convergence of accuracy. Due to the small random errors when distributed training, the accuracy in some epochs is even higher than that of single GPU full-batch training. Besides, the mini-batch training method significantly reduce the accuracy, especially on Reddit. That is because we limit the maximum number of neighbors when sampling, and the average degree of Reddit is very large.

8. Related Work

The research on distributed graph neural network training is still in the early stages (abadal2021computing, ), and only a few these works are based on GPU. Compared with traditional distributed large-scale graph computing frameworks (chen2019powerlyra, ; fan2017grape, ; malewicz2010pregel, ), the communication overhead of distributed GNN training tasks is more serious. This is because the distributed training of each GCN layer or GAT layer requires sending/receiving features and gradients of neighbor vertices, where the dimension of vertex features and gradients is usually very large.

Many existing distributed graph neural network training frameworks adopt the centralized architecture. For example, NeuGraph (ma2019neugraph, ) proposed a GNN training framework in a single-node multi-GPU environment. They use METIS (karypis1998fast, ) as the graph partitioning algorithm, and introduce graph computation optimizations into the management of data partitioning, scheduling, and parallelism. However, their work is not open source. RoC (jia2020improving, ) dynamically partitions the graph through an online regression model and proposes a inter-process memory management method, but it also leads to a complex execution workflow. PaGraph (lin2020pagraph, ) implements static caching of vertices with high degree in GPU memory, and use a special graph partitioning algorithm to balance workload and reduce cross-device data access. G3 (liu2020g3, ) utilizes parallel graph optimization to improve graph operations in GPU systems, Grain (zhang2021grain, ) selects GNN data by focusing on maximizing social influence, and RDD (zhang2020reliable, ) uses unlabeled data. AliGraph (zhu2019aligraph, ) also uses static caching technology, but only supports CPU clusters. AGL (zhang2020agl, ) uses MapReduce operations to simultaneously optimize the training and inference phases. In order to reduce and balance communication, DistDGL (zheng2020distdgl, ) uses a load-balanced graph partitioning algorithm. Most of these systems suffer from heavy communication overhead and therefore cannot scale to large-scale applications. Besides, we should notice that except for NeuGraph and Roc, which support full-batch graph neural network training, other frameworks are mini-batch training methods that require sampling.

9. Conclusion and Future Work

In this paper, we propose a cache-based distributed full-batch graph neural network training framework CDFGNN. To address the problem of excessive communication in existing full-batch training frameworks, we design three optimizations: adaptive cache mechanism, communication quantization, and hierarchical graph partition. With these improvements, CDFGNN outperforms the state-of-the-art distributed full-batch training frameworks. Besides, we theoretically and experimentally prove that the convergence accuracy of CDFGNN is not degraded. Therefore, we believe that CDFGNN can greatly improve the distributed training efficiency for large-scale graphs.

In the future, we want to make full use of the high-speed communication equipment such as NVLink to further reduce the communication overhead.

References

  • (1) T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • (2) P. Velickovic, G. Cucurull, A. Casanova, A. Romero, and P. Liò, “Yoshua 391 bengio. graph attention networks,” in International Conference on Learning Representations, vol. 392, 2018, p. 393.
  • (3) W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
  • (4) J. Chen, T. Ma, and C. Xiao, “Fastgcn: fast learning with graph convolutional networks via importance sampling,” arXiv preprint arXiv:1801.10247, 2018.
  • (5) W. Huang, T. Zhang, Y. Rong, and J. Huang, “Adaptive sampling towards fast graph representation learning,” Advances in neural information processing systems, vol. 31, 2018.
  • (6) H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. Prasanna, “Graphsaint: Graph sampling based inductive learning method,” arXiv preprint arXiv:1907.04931, 2019.
  • (7) J. Dong, D. Zheng, L. F. Yang, and G. Karypis, “Global neighbor sampling for mixed cpu-gpu training on giant graphs,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 289–299.
  • (8) Z. Cai, X. Yan, Y. Wu, K. Ma, J. Cheng, and F. Yu, “Dgcl: an efficient communication library for distributed gnn training,” in Proceedings of the Sixteenth European Conference on Computer Systems, 2021, pp. 130–144.
  • (9) Z. Jia, S. Lin, M. Gao, M. Zaharia, and A. Aiken, “Improving the accuracy, scalability, and performance of graph neural networks with roc,” Proceedings of Machine Learning and Systems, vol. 2, pp. 187–198, 2020.
  • (10) A. Tripathy, K. Yelick, and A. Buluç, “Reducing communication in graph neural network training,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2020, pp. 1–14.
  • (11) J. Chen, J. Zhu, and L. Song, “Stochastic training of graph convolutional networks with variance reduction,” arXiv preprint arXiv:1710.10568, 2017.
  • (12) J. Thorpe, Y. Qiao, J. Eyolfson, S. Teng, G. Hu, Z. Jia, J. Wei, K. Vora, R. Netravali, M. Kim et al., “Dorylus: Affordable, scalable, and accurate {{\{{GNN}}\}} training with distributed {{\{{CPU}}\}} servers and serverless threads,” in 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), 2021, pp. 495–514.
  • (13) Z. Wang, Y. Guan, G. Sun, D. Niu, Y. Wang, H. Zheng, and Y. Han, “Gnn-pim: A processing-in-memory architecture for graph neural networks,” in Advanced Computer Architecture: 13th Conference, ACA 2020, Kunming, China, August 13–15, 2020, Proceedings 13.   Springer, 2020, pp. 73–86.
  • (14) P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
  • (15) S. Gandhi and A. P. Iyer, “P3: Distributed deep graph learning at scale,” in 15th {{\{{USENIX}}\}} Symposium on Operating Systems Design and Implementation ({{\{{OSDI}}\}} 21), 2021, pp. 551–568.
  • (16) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • (17) R. Chen, J. Shi, Y. Chen, B. Zang, H. Guan, and H. Chen, “Powerlyra: Differentiated graph computation and partitioning on skewed graphs,” ACM Transactions on Parallel Computing (TOPC), vol. 5, no. 3, pp. 1–39, 2019.
  • (18) L. G. Valiant, “A bridging model for parallel computation,” Communications of the ACM, vol. 33, no. 8, pp. 103–111, 1990.
  • (19) R. Albert and A.-L. Barabási, “Statistical mechanics of complex networks,” Reviews of modern physics, vol. 74, no. 1, p. 47, 2002.
  • (20) M. Daisuke, H. L. Edward, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” in Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2016.
  • (21) Y. Li, X. Dong, and W. Wang, “Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks,” arXiv preprint arXiv:1909.13144, 2019.
  • (22) R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft quantization: Bridging full-precision and low-bit neural networks,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4852–4861.
  • (23) J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X.-s. Hua, “Quantization networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7308–7316.
  • (24) S. Zhang, Z. Jiang, X. Hou, Z. Guan, M. Yuan, and H. You, “An efficient and balanced graph partition algorithm for the subgraph-centric programming model on large-scale power-law graphs,” in 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS).   IEEE, 2021, pp. 68–78.
  • (25) J. Peng, Z. Chen, Y. Shao, Y. Shen, L. Chen, and J. Cao, “Sancus: sta le n ess-aware c omm u nication-avoiding full-graph decentralized training in large-scale graph neural networks,” Proceedings of the VLDB Endowment, vol. 15, no. 9, pp. 1937–1950, 2022.
  • (26) W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh, “Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 257–266.
  • (27) K. Wang, Z. Shen, C. Huang, C.-H. Wu, Y. Dong, and A. Kanakia, “Microsoft academic graph: When experts are not enough,” Quantitative Science Studies, vol. 1, no. 1, pp. 396–413, 2020.
  • (28) “Friendster,” https://snap.stanford.edu/data/com-Friendster.html.
  • (29) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • (30) R. Mayer and H.-A. Jacobsen, “Hybrid edge partitioner: Partitioning large power-law graphs under memory constraints,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 1289–1302.
  • (31) M. Hanai, T. Suzumura, W. J. Tan, E. Liu, G. Theodoropoulos, and W. Cai, “Distributed edge partitioning for trillion-edge graphs,” arXiv preprint arXiv:1908.05855, 2019.
  • (32) S. Abadal, A. Jain, R. Guirado, J. López-Alonso, and E. Alarcón, “Computing graph neural networks: A survey from algorithms to accelerators,” ACM Computing Surveys (CSUR), vol. 54, no. 9, pp. 1–38, 2021.
  • (33) W. Fan, J. Xu, Y. Wu, W. Yu, and J. Jiang, “Grape: Parallelizing sequential graph computations,” Proceedings of the VLDB Endowment, vol. 10, no. 12, pp. 1889–1892, 2017.
  • (34) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, pp. 135–146.
  • (35) L. Ma, Z. Yang, Y. Miao, J. Xue, M. Wu, L. Zhou, and Y. Dai, “{{\{{NeuGraph}}\}}: Parallel deep neural network computation on large graphs,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019, pp. 443–458.
  • (36) G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” SIAM Journal on scientific Computing, vol. 20, no. 1, pp. 359–392, 1998.
  • (37) Z. Lin, C. Li, Y. Miao, Y. Liu, and Y. Xu, “Pagraph: Scaling gnn training on large graphs via computation-aware caching,” in Proceedings of the 11th ACM Symposium on Cloud Computing, 2020, pp. 401–415.
  • (38) H. Liu, S. Lu, X. Chen, and B. He, “G3: when graph neural networks meet parallel graph processing systems on gpus,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 2813–2816, 2020.
  • (39) W. Zhang, Z. Yang, Y. Wang, Y. Shen, Y. Li, L. Wang, and B. Cui, “Grain: Improving data efficiency of graph neural networks via diversified influence maximization,” arXiv preprint arXiv:2108.00219, 2021.
  • (40) W. Zhang, X. Miao, Y. Shao, J. Jiang, L. Chen, O. Ruas, and B. Cui, “Reliable data distillation on graph convolutional network,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020, pp. 1399–1414.
  • (41) R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, and J. Zhou, “Aligraph: A comprehensive graph neural network platform,” arXiv preprint arXiv:1902.08730, 2019.
  • (42) D. Zhang, X. Huang, Z. Liu, Z. Hu, X. Song, Z. Ge, Z. Zhang, L. Wang, J. Zhou, Y. Shuang et al., “Agl: a scalable system for industrial-purpose graph machine learning,” arXiv preprint arXiv:2003.02454, 2020.
  • (43) D. Zheng, C. Ma, M. Wang, J. Zhou, Q. Su, X. Song, Q. Gan, Z. Zhang, and G. Karypis, “Distdgl: distributed graph neural network training for billion-scale graphs,” in 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3).   IEEE, 2020, pp. 36–44.