0% found this document useful (0 votes)
15 views38 pages

Generalization Guarantee of Training Graph Convolutional Networks With Graph Topology Sampling

This paper provides the first theoretical justification for graph topology sampling in training Graph Convolutional Networks (GCNs) for semi-supervised node classification, demonstrating that it can achieve satisfactory generalization performance. The authors characterize the impact of graph structures and topology sampling on generalization performance and sample complexity, showing that learning with topology sampling can yield comparable results to training with the original graph. The findings are supported by numerical experiments and highlight the efficiency of various sampling methods in reducing computational costs while maintaining performance.

Uploaded by

apogne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views38 pages

Generalization Guarantee of Training Graph Convolutional Networks With Graph Topology Sampling

This paper provides the first theoretical justification for graph topology sampling in training Graph Convolutional Networks (GCNs) for semi-supervised node classification, demonstrating that it can achieve satisfactory generalization performance. The authors characterize the impact of graph structures and topology sampling on generalization performance and sample complexity, showing that learning with topology sampling can yield comparable results to training with the original graph. The findings are supported by numerical experiments and highlight the efficiency of various sampling methods in reducing computational costs while maintaining performance.

Uploaded by

apogne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Generalization Guarantee of Training Graph Convolutional Networks with

Graph Topology Sampling

Hongkang Li 1 Meng Wang 1 Sijia Liu 2 3 Pin-Yu Chen 4 Jinjun Xiong 5

Abstract ing nodes in each layer. GCNs can model graph-structured


data more accurately and compactly than conventional neu-
Graph convolutional networks (GCNs) have re-
ral networks and have demonstrated great empirical advan-
cently achieved great empirical success in learn-
tage in text analysis (Hamilton et al., 2017; Kipf & Welling,
ing graph-structured data. To address its scala-
2017; Veličković et al., 2018; Peng et al., 2017), computer
bility issue due to the recursive embedding of
vision (Satorras & Estrach, 2018; Wang et al., 2018; Hu
neighboring features, graph topology sampling
et al., 2018), recommendation systems (Ying et al., 2018;
has been proposed to reduce the memory and
Van den Berg et al., 2018), physical reasoning (Battaglia
computational cost of training GCNs, and it has
et al., 2016; Sanchez-Gonzalez et al., 2018), and biological
achieved comparable test performance to those
science (Duvenaud et al., 2015). Such empirical success is
without topology sampling in many empirical
often achieved at a cost of higher computational and memory
studies. To the best of our knowledge, this pa-
costs, especially for large graphs, because the embedding of
per provides the first theoretical justification of
one node depends recursively on the neighbors. To alleviate
graph topology sampling in training (up to) three-
the exponential increase of computational cost in training
layer GCNs for semi-supervised node classifica-
deep GCNs, various graph topology sampling methods have
tion. We formally characterize some sufficient
been proposed to only aggregate the embeddings of a se-
conditions on graph topology sampling such that
lected subset of neighbors in training GCNs. Node-wise
GCN training leads to a diminishing generaliza-
neighbor-sampling methods such as GraphSAGE (Hamilton
tion error. Moreover, our method tackles the non-
et al., 2017), VRGCN (Chen et al., 2018b), and Cluster-
convex interaction of weights across layers, which
GCN (Chiang et al., 2019) sample a subset of neighbors for
is under-explored in the existing theoretical analy-
each node. Layer-wise importance sampling methods such
ses of GCNs. This paper characterizes the impact
as FastGCN (Chen et al., 2018a) and LADIES (Zou et al.,
of graph structures and topology sampling on the
2019) sample a fixed number of nodes for each layer based
generalization performance and sample complex-
on the estimate of node importance. Another line of works
ity explicitly, and the theoretical findings are also
such as (Zheng et al., 2020; Li et al., 2020; Chen et al.,
justified through numerical experiments.
2021) employ graph sparsification or pruning to reduce the
computational and memory cost. Surprisingly, these sam-
pling methods often have comparable or even better testing
1. Introduction performance compared to training with the original graph
Graph convolutional neural networks (GCNs) aggregate the in many empirical studies (Chen et al., 2018a; 2021).
embedding of each node with the embedding of its neighbor- In contrast to the empirical success, the theoretical foun-
1 dation of training GCNs with graph sampling is much less
Department of Electrical, Computer, and System Engineer-
ing, Rensselaer Polytechnic Institute, NY, USA 2 Department of investigated. Only Cong et al. (2021) analyzes the conver-
Computer Science and Engineering, Michigan State University, gence rate of graph sampling, but no generalization analysis
MI, USA 3 MIT-IBM Watson AI Lab, IBM Research, MA, USA is provided. One fundamental question about training GCNs
4
IBM Thomas J. Watson Research Center, Yorktown Heights, NY, is still vastly open, which is:
USA 5 Department of Computer Science and Engineering, Uni-
versity at Buffalo, NY, USA. Correspondence to: Hongkang Li
<[email protected]>, Meng Wang <[email protected]>, Sijia Liu Under what conditions does a GCN learned with graph
<[email protected]>, Pin-yu Chen <[email protected]>, topology sampling achieve satisfactory generalization?
Jinjun Xiong <[email protected]>.

Proceedings of the 39 th International Conference on Machine Our contributions: To the best of our knowledge, this
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy- paper provides the first generalization analysis of training
right 2022 by the author(s). GCNs with graph topology sampling. We focus on semi-
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

supervised node classification problems where, with all node ing error, through Rademacher complexity. Verma & Zhang
features and partial node labels, the objective is to predict (2019); Cong et al. (2021); Zhou & Wang (2021) analyze
unknown node labels. We summarize our contributions the generalization gap of training GCNs using SGD via the
from the following dimensions. notation of algorithmic stability.
First, this paper proposes a training framework that imple- To analyze the training error and generalization performance
ments both stochastic gradient descent (SGD) and graph simultaneously, Du et al. (2019) uses the neural tangent
topology sampling, and the learned GCN model with Recti- kernel (NTK) approach, where the neural network width
fied Linear Unit (ReLU) activation is guaranteed to approach is infinite and the step size is infinitesimal, shows that the
the best generalization performance of a large class of target training error is zero, and characterizes the generalization
functions. Moreover, as the number of labeled nodes and bound. Zhang et al. (2020) proves that gradient descent can
the number of neurons increase, the class of target function learn a model with zero population risk, provided that all
enlarges, indicating improved generalization. data are generated by an unknown target model. The result
in (Zhang et al., 2020) is limited to two-layer GCNs and
Second, this paper explicitly characterizes the impact of
requires a proper initialization in the local convex region of
graph topology sampling on the generalization performance
the optimal solution.
through the proposed effective adjacency matrix A∗ of a
directed graph that models the node correlations. A∗ de- Generalization analyses of feed-forward neural net-
pends on both the given normalized graph adjacency matrix works. The NTK approach was first developed to analyze
in GCNs and the graph sampling strategy. We provide the fully connected neural networks (FCNNs), see, e.g., (Jacot
general insights that (1) if a node is sampled with a low et al., 2018). The works of Zhong et al. (2017); Fu et al.
frequency, its impact on other nodes is reduced in A∗ com- (2020); Li et al. (2022) analyze one-hidden-layer neural net-
pared with A; (2) graph sampling on a highly-unbalanced A, works with Gaussian input data. Daniely (2017) analyzes
where some nodes have a dominating impact in the graph, re- multi-layer FCNNs but focuses on training the last layer
sults in a more balanced A∗ . Moreover, these insights apply only, while the changes in the hidden layers are negligible.
to other graph sampling methods such as FastGCN (Chen Allen-Zhu et al. (2019) provides the optimization and gen-
et al., 2018a). eralization of three-layer FCNNs. Our proof framework is
built upon (Allen-Zhu et al., 2019) but makes two important
We show that learning with topology sampling has the same
technical contributions. First, this paper provides the first
generalization performance as training GCNs using A∗ .
generalization analysis of graph topology sampling in train-
Therefore, a satisfactory generalization can still be achieved
ing GCNs, while Allen-Zhu et al. (2019) considers FCNNs
even when the number of sampled nodes is small, provided
with neither graph topology nor graph sampling. Second,
that the resulting A∗ still characterizes the data correlations
Allen-Zhu et al. (2019) considers i.i.d. training samples,
properly. This is the first theoretical explanation of the
while this paper considers semi-supervised GCNs where the
empirical success of graph topology sampling.
training data are correlated through graph convolution.
Third, this paper shows that the required number of labeled
nodes, referred to as the sample complexity, is a polynomial 1.2. Notations
of kA∗ k∞ and the maximum node degree, where k·k∞ mea-
sures the maximum absolute row sum. Moreover, our sam- Vectors are in bold lowercase, matrices and tensors in are
ple complexity is only logarithmic in the number of neurons bold uppercase. Scalars are in normal fonts. For instance,
m and consistent with the practical over-parameterization of Z is a matrix, and z is a vector. zi denotes the i-th entry of
GCNs, in contrast to the loose bound of poly(m) in (Zhang z, and Zi,j denotes the (i, j)-th entry of Z. [K] (K > 0)
et al., 2020) in the restrictive setting of two-layer (one- denotes the set including integers from 1 to K. I d ∈ Rd×d
hidden-layer) GCNs without graph topology sampling. and ei represent the identity matrix in Rd×d and the i-th
standard basis vector, respectively. We denote the column
`p norm for W ∈ Rd×N (for p ≥ 1) as
1.1. Related Works
X 1
Generalization analyses of GCNs without graph sam- kW k2,p = ( kwi kp2 ) p (1)
pling. Some recent works analyze GCNs trained on the i∈[m]
original graph. Xu et al. (2019); Cong et al. (2021) char-
acterize the expressive power of GCNs. Xu et al. (2021) Hence, kW k2,2 = kW kF is the Frobenius norm of W . We
analyzes the convergence of gradient descent in training lin- use wi (w̃i ) to denote the i-th column (row) vector of W .
ear GCNs. Lv (2021); Liao et al. (2021); Garg et al. (2020); We follow the convention that f (x) = O(g(x)) (or Ω(g(x)),
Oono & Suzuki (2020) characterize the generalization gap, Θ(g(x))) means that f (x) increases at most (or at least, or in
which is the difference between the training error and test- the same, respectively,) order of g(x). With high probability
2
(w.h.p.) means with probability 1 − e−c log (m1 ,m2 ) for a
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

sufficient large constant c where m1 and m2 are the number only update W and V in training, and A represents the
of neurons in the two hidden layers. graph topology. Note that in conventional GCNs such as
(Kipf & Welling, 2017), C is a learnable parameter, and B 1
P∞ φ(z) with
Function complexity. For any smooth function
and B 2 can be zero. Here for the analytical purpose, we
its power series representation as φ(z) = i=0 ci z i , define
consider a slightly different model where C, B 1 and B 2
two useful parameters as follows,
are fixed as randomly selected values.
∞  p
X
∗ i log(1/) ∗ i  Consider a loss function L : R1×k × Y → R such that for
C (φ, R) = (C R) + ( √ C R) |ci | (2)
i=0
i every y ∈ Y, the function L(·, y) is nonnegative, convex, 1-
Lipschitz continuous and 1-Lipschitz smooth and L(0, y) ∈

X [0, 1]. This includes both the cross-entropy loss and the
Cs (φ, R) = C ∗ (i + 1)1.75 Ri |ci | (3) `2 -regression loss (for bounded Y). The learning problem
i=0
solves the following empirical risk minimization problem:
where R ≥ 0 and C ∗ is a sufficiently large constant. These
1 X
two quantities are used in the model complexity and sample min LΩ (W , V ) = L(FA (ei , X; W , V ), y i )
complexity, which represent the required number of model W ,V |Ω|
i∈Ω
parameters and training samples to learn φ up to  error, (5)
respectively. Many population functions have bounded com- where LΩ is the empirical risk of the labeled nodes in Ω.
plexity. For instance, if φ(z) is exp(z), sin(z), cos(z) or The trained weights are used to estimate the unknown labels
polynomials of z, then C (φ, O(1)) ≤ O(poly(1/)) and on V/Ω. Note that the results in this paper are distribution-
Cs (φ, O(1)) ≤ O(1). free, and no assumption is made on the distributions of x̃n
and yn .
The main notations are summarized in Table 2 in Appendix.
Training with SGD. In practice, (5) is often solved by gra-
dient type of methods, where in iteration t, the currently
2. Training GCNs with Topology Sampling: estimations are updated by subtracting the product of a posi-
Formulation and Main Components tive step size and the gradient of LΩ evaluated at the current
GCN setup. Let G = {V, E} denote an un-directed graph, estimate. To reduce the computational complexity in es-
where V is the set of nodes with size |V| = N and E is timating the gradient, an SGD method is often employed
the set of edges. Let à ∈ {0, 1}N ×N be the adjacency to compute the gradient of the risk of a randomly selected
matrix of G with added self-connections. Let subset of Ω rather than using the whole set Ω.
P D be the
degree matrix with diagonal elements Di,i = j Ãi,j and However, due to the recursive embedding of neighboring
zero entries otherwise. A denotes the normalized adjacency features in GCNs, see the concatenations of A in (4), the
1 1
matrix with A = D − 2 ÃD − 2 . Let X ∈ RN ×d denote the computation and memory cost of computing the gradient
matrix of the features of N nodes, where the n-th row of can be high. Thus, graph topology sampling methods have
X, denoted by x̃n ∈ R1×d , represents the feature of node been proposed to further reduce the computational cost.
n. Assume kx̃n k = 1 for all n without loss of generality.
Graph topology sampling. A node sampling method ran-
yn ∈ Y represents the label of node n, where Y is a set of
domly removes a subset of nodes and the incident edges
all labels. yn depends on not only xn but the neighbors. Let
from G in each iteration independently, and the embedding
Ω ⊂ V denote the set of labeled nodes. Given X and labels
aggregation is based on the reduced graph. Mathematically,
in Ω, the objective of semi-supervised node-classification is
in iteration s, replace A in (4) with1 As = AP s , where P s
to predict the unknown labels in V/Ω.
is a diagonal matrix, and the ith diagonal entry is 0, if node
Learner network We consider the setting of training a i is removed in iteration s. The non-zero diagonal entries
three-layer GCN F : RN × RN ×d → R1×K with of P s are selected differently based on different sampling
methods. Because As is much more sparse than A, the
FA (eg , X; W , V ) = e>
g Aσ(r + B 2 )C and computation and memory cost of embedding neighboring
(4)
r = Aσ(AXW + B 1 )V features is significantly reduced.

where σ(x) = max(x, 0) is the ReLU activation function,


1
Here we use the same sampled matrix As in all three layers in
W ∈ Rd×m1 and V ∈ Rm1 ×m2 represent the weights (4) to simplify the representation. Our analysis applies to the more
general setting that each layer uses a different sampled adjacency
of m1 and m2 hidden nodes in the first and second layer,
matrix, i.e., the three A matrices in (4) are replaced with As(1) =
respectively. B 1 ∈ RN ×m1 and B 2 ∈ Rm1 ×m2 represent AP s(1) , As(2) = AP s(2) , As(3) = AP s(3) , respectively, as in
the bias matrices. C ∈ Rm×K is the output weight vector. (Zou et al., 2019; Ramezani et al., 2020), where P s(1) , P s(2) , and
eg ∈ RN belongs to {ei }N i=1 and selects the index of the P s(3) are independently sampled following the same sampling
node label. We write F as FA (eg , X; W , V ), because we strategy.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

This paper will analyze the generalization performance, i.e., address the non-convex interaction of weights W and V in
the prediction accuracy of unknown labels, of our algorithm both algorithmic design and theoretical analyses.
framework that implements both SGD and graph topology
sampling to solve (5). The details of our algorithm are 3.2. Graph Topology Sampling Strategy
discussed in Section 3.2-3.3, and the generalization perfor-
mance is presented in Section 3.4. Here we describe our graph topology sampling strategy
using As , which we randomly generate to replace A in
the sth SGD iteration. Although our method is motivated
3. Main Algorithmic and Theoretical Results for analysis and different from the existing graph sampling
3.1. Informal Key Theoretical Findings strategies, our insights generalize to other sampling methods
like FastGCN (Chen et al., 2018a). The outline of our algo-
We first summarize the main insights of our results before rithmic framework of training GCNs with graph sampling
presenting them formally. is deferred to Section 3.3.
1. A provable generalization guarantee of GCNs be- Suppose the node degrees in G can be divided into L groups
yond two layers and with graph topology sampling. The with L ≥ 1, where the degrees of nodes in group l are in the
learned GCN by our Algorithm 1 can approach the best order of dl , i.e., between cdl and Cdl for some constants
performance of label prediction using a large class of target c ≤ C, and dl is order-wise smaller than dl+1 , i.e., dl =
functions. Moreover, the prediction performance improves o(dl+1 ). Let Nl denote the number of nodes in group l.
when the number of labeled nodes and the number of neu-
rons m1 and m2 increase. This is the first generalization Graph sampling strategy2 .. We consider a group-wise
performance guarantee of training GCNs with graph topol- uniform sampling strategy, where Sl out of Nl nodes are
ogy sampling. sampled uniformly from each group l. For all unsampled
nodes, we set the corresponding diagonal entries of a di-
2. The explicit characterization of the impact of graph agonal matrix P s to be zero. If node i is sampled in this
sampling through the effective adjacency matrix A∗ . iteration and belongs to group l for any i and l, the ith diag-
We show that training with graph sampling returns a model onal entry of P s is set as p∗l Nl /Sl for some non-negative
that has the same label prediction performance as that of constant p∗l . Then As = AP s . Nl /Sl can be viewed as
a model trained by replacing A with A∗ in (4), where A∗ the scaling to compensate for the unsampled nodes in group
depends on both A and the graph sampling strategy. As l. p∗l can be viewed as the scaling to reflect the impact of
long as A∗ can characterize the correlation among nodes sampling on nodes with different importance that will be
properly, the learned GCN maintains a desirable prediction discussed in detail soon.
performance. This explains the empirical success of graph
topology sampling in many datasets. Effective adjacency matrix A∗ by graph sampling. To
analyze the impact of graph topology sampling on the learn-
3. The explicit sample complexity bound on graph prop- ing performance, we define the effective adjacency matrix
erties. We provide explicit bounds on the sample complex- as follows:
ity and the required number of neurons, both of which grow A∗ = AP ∗ (6)
as the node correlation increase. Moreover, the sample
complexity depends on the number of neurons only log- where P ∗ is a diagonal matrix defined as
arithmically, which is consistent with the practical over- P ∗ii = p∗l if node i belongs to degree group l (7)
parameterization. To the best of our knowledge, (Zhang
et al., 2020) is the only existing work that provides a sam- Therefore, compared with A, all the columns with indices
ple complexity bound based on the graph topology, but in corresponding to group l are scaled by a factor of p∗l . We
the non-practical and restrictive setting of two-layer GCNs. will formally analyze the impact of graph topology sampling
Moreover, the sample complexity bound by (Zhang et al., on the generalization performance in Section 3.4, but an
2020) is polynomial in the number of neurons. intuitive understanding is that our graph sampling strategy
4. Tackling the non-convex interaction of weights be- effectively changes the normalized adjacency matrix A in
tween different layers. The convexity plays a critical role the GCN network model (4) to A∗ .
in many exiting analyses of GCNs. For instance, the anal- A∗ can be viewed as an adjacency matrix of a weighted
yses in (Zhang et al., 2020) require a special initialization directed graph G 0 that reflects the node correlations, where
in the local convex region of the global minimum, and the each un-directed edge in G corresponds to two directed
results only apply to two-layer GCNs. The NTK approach edges in G 0 with possibly different weights. A∗ji measures
in (Du et al., 2019) considers the limiting case that the in-
2
teractions across layers are negligible. Here, we directly Here we discuss asymmetric sampling as a general case. The
special case of symmetric sampling is introduced in Section A.1
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

the impact of the feature of node i on the label of node j. Algorithm 1 Training with SGD and graph topology sam-
If p∗l is in the range of (0, 1), the corresponding entries of pling
columns with indices in group l in A∗ are smaller than those 1: Input: Normalized adjacency matrix A, node features
in A. That means the impact of a node in group l on all other X, known node labels in Ω, the step size η, the number
nodes is reduced from those in A. Conversely, if p∗l > 1, of inner iterations Tw , the number of outer iterations T ,
then the impact of nodes in group l in A∗ is enhanced from σw , σv , λw , λv .
that in A. 2: Initialize W (0) , V (0) , B 1 , B 2 , C.
Parameter selection and insights 3: W 0 = 0, V 0 = 0.
4: for t = 0, 1, · · · , T − 1 do
(1) The scaling factor p∗l should satisfy 5: Apply noisy SGD with step size η on the stochastic
c1 objective L̂Ω (λt ; W , V ) in (11) for Tw steps. To
0 ≤ p∗l ≤ , ∀l (8) generate the stochastic objective in each step s, ran-
Lψl
domly sample a batch of labeled nodes Ωs from Ω;
for a positive constant c1 that can be sufficiently large. ψl is generate As using graph sampling; randomly gener-
defined as follows, ate W ρ , V ρ and Σ.
√ Let the starting point be W = W t , V = V t and
dL dl N l suppose it reaches W t+1 and V t+1 .
ψl := PL ∀l ∈ [L] (9)
i=1 di N i
6: λt+1 = λt · (1 − η).
7: end for
Note that (8) is a minor requirement for most graphs. To see 8: Output: p
this, suppose L is a constant, and every Nl is in the order W (out) = λT −1 (W (0) + W ρ + W T Σ)
V (out) = λT −1 (V (0) + V ρ + ΣV T ).
p
of N . Then ψl is less than O(1) for all l. Thus, all constant
values of p∗l̂ satisfy (8) with ψl from (9). A special example
is that p∗l are all equal, i.e., A∗ = c2 A for some constant
c2 . Because one can scale W and V by 1/c2 in (4) without
changing the results, A∗ is equivalent to A in this case.
The upper bound in (9) only becomes active in highly un-
ˆ
balanced graphs
p √ there exists a dominating group l
where
such that dl̂ Nl̂  dl Nl for all other l. Then the up- 4.2. Second, reducing the number of samples in group l
per bound of p∗l̂ is much smaller than those for other p∗l . corresponds to reducing the impact of group l in A∗ . To
Therefore, the columns of A∗ that correspond to group see this, note that decreasing p∗l reduces the right-hand side
ˆl are scaled down more significantly than other columns, of (10).
indicating that the impact of group ˆl is reduced more signif-
icantly than other groups in A∗ . Therefore, the takeaway 3.3. The Algorithmic Framework of Training GCNs
is that graph topology sampling reduces the impact of
dominating nodes more than other nodes, resulting in a Because (5) is non-convex, solving it directly using SGD
more balanced A∗ compared with A. can get stuck at a bad local minimum in theory. The main
idea in the theoretical analysis to address this non-convexity
(2) The number of sampled nodes shall satisfy is to add weight decay and regularization in the objective of
(5) such that with a proper regularization, any second-order
Sl c1 poly() −1 critical point is almost a global minimum.
≥ (1 + ) ∀l ∈ [L] (10)
Nl Lp∗l ψl
Specifically, for initialization, entries of W (0) are i.i.d. from
where  is a small positive value. The sampling require- N (0, m11 ), and entries of V (0) are i.i.d. from N (0, m12 ).
ment in (10) has two takeaways. First, the higher-degree B 1 (or B 2 ) is initialized to be an all-one vector multiply-
groups shall be sampled more frequently than lower- ing a row vector with i.i.d. samples from N (0, m11 ) (or
degree groups. To see this, consider a special case that N (0, m12 )). Entries of C are drawn i.i.d. from N (0, 1).
p∗l = 1, and Nl = N/L for all l. Then (10) indicates that
Sl is larger in a group l with a larger dl . This intuition is the In each outer loop t = 0, ..., T − 1, we use noisy SGD3
same as FastGCN (Chen et al., 2018a), which also samples with step size η for Tw iterations to minimize the stochastic
high-degree nodes with a higher probability in many cases. objective function L̂Ω in (11) with some fixed λt−1 , where
Therefore, the insights from our graph sampling method 3
Noisy SGD is vanilla SGD plus Gaussian perturbation. It is a
also apply to other sampling methods such as FastGCN. We common trick in the theoretical analyses of non-convex optimiza-
will show the connection to FastGCN empirically in Section tion (Ge et al., 2015) and is not needed in practice.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

λ0 = 1, and the weight decays with λt+1 = (1 − η)λt . 3.4. Generalization Guarantee
Our formal generalization analysis shows that our learning
L̂Ω (λt ; W , V ) method returns a GCN model that approaches the minimum
√ √
=LΩ ( λt (W (0) + W ρ + W Σ), λt (V (0) + V ρ + ΣV )) prediction error that can be achieved by the best function
√ √ in a large concept class of target functions, which have two
+ λw k λt W k42,4 + λv k λt V k2F
(11) important properties: (1) the prediction error decreases as
L̂Ω (λt ; W , V ) is stochastic because in each inner iteration size of the function class increases; and (2) the concept
s, (1) we randomly sample a subset Ωs of labeled nodes; (2) class uses A∗ in (6) as the adjacency matrix of the graph
we randomly sample As from the graph topology sampling topology. Therefore, the result implies that if A∗ accurately
method in Section 3.2; (3) W ρ and V ρ are small perturba- captures the correlations among node features and labels,
tion matrices with entries i.i.d. drawn from N (0, σw 2
) and the learned GCN model can achieve a small prediction error
2
N (0, σv ), respectively; and (4) Σ ∈ R m1 ×m1
is a random of unknown labels. Moreover, no other functions in a large
diagonal matrix with diagonal entries uniformly drawn from concept class can perform better than the learned GCN
{1, −1}. W ρ and V ρ are standard Gaussian smoothing model. To formalize the results, we first define the target
in the literature of theoretical analyses of non-convex opti- functions as follows.
mization, see, e.g. (Ge et al., 2015), and are not needed in Concept class and target function F ∗ . Consider a concept
practice. Σ is similar to the practical Dropout (Srivastava class consisting of target functions F ∗ : RN × RN ×d →
et al., 2014) technique that randomly masks out neurons and R1×K :
is also introduced for the theoretical analysis only.
> ∗

Φ(r 1 ) r 2 C ∗

FA ∗ (eg , X) = eg A
The last two terms in (11) are additional regularization terms
for some positive λw and λv . As shown in (Allen-Zhu et al., r 1 = A∗ φ1 (A∗ XW ∗1 )V ∗1 (12)
∗ ∗
2019), k · k2,4 is used for the analysis to drive the weights r 2 = A φ2 (A XW ∗2 )V ∗2
to be evenly distributed among neurons. The practical reg-
ularization k · kF has the same effect in empirical results, where φ1 , φ2 , Φ: R → R all infinite-order smooth4 . The
while the theoretical justification is open. parameters W ∗1 , W ∗2 ∈ Rd×p2 , V ∗1 , V ∗2 ∈ Rp2 ×p1 , C ∗ ∈
Rp1 ×k satisfy that every column of W ∗1 , W ∗2 , V ∗1 , V ∗2 is
Algorithm 1 summarizes the algorithm with the parame- unit norm, and the maximum absolute value of C ∗ is at
ter selections in Table 1. Let W out and V out denote the most 1. The effective adjacency matrix A∗ is defined in (6).
returned weights. We use FA∗ (ei , X; W out , V out ) to pre- Define
dict the label of node i. This might sound different from the

conventional practice which uses A in predicting unknown C (φ, R) = max C (φ1 , R), C (φ2 , R) , (13)
labels. However, note that A∗ only differs from A by a 
Cs (φ, R) = max Cs (φ1 , R), C2 (φ1 , R) . (14)
column-wise scaling as from (6). Moreover, A∗ can be set
as A in many practical datasets based on our discussion af-
ter (9). Here we use the general form of A∗ for the purpose We focus on target functions where the function complexity
of analysis. C (Φ, R), Cs (Φ, R), C (φ, R), Cs (φ, R), defined in (2)-(3),
(13)-(14), as well as p1 and p2 , are all bounded.
We remark that our framework of algorithm and analysis
can be easily applied to the simplified setup of two-layer (12) is more general than GCNs. If r 2 is a constant matrix,
GCNs. The resulting algorithm is much simplified to a (12) models a GCN, where W ∗1 and V ∗1 are weight matrices
vanilla SGD plus graph topology sampling. All the ad- in the first and second layer, respectively, and φ1 and Φ are
ditional components above are introduced to address the the activation functions in each layer.
non-convex interaction of W and V theoretically and may Modeling the prediction error of unknown labels. We
not be needed for practical implementation. We skip the will show that the learned GCN by our method performs
discussion of two-layer GCNs in this paper. almost the same as the best function in the concept class in
(12) in predicting unknown labels. Because the practical
datasets usually contain noise in features and labels, we em-
Table 1. Parameter choices for Algorithm 1 ploy a probabilistic model to model the data. Note that our
1/2+0.01
λv 20 m2 /m11−0.01 σv 1/m2
4
λw 20 m13−0.002 /C04 σw 1/m11−0.01 When Φ is operated on a matrix r 1 , Φ(r 1 ) means applying
√ √ Φ on each entry of r 1 . In fact, our results still hold for a more
C C (φ,kA∗ k∞ ) kA∗ k2∞ +1 C0 10C p2
general case that a different function Φj is applied to every entry

C 00 C (Φ,C 0 ) kA∗ k2∞ +1 C0 Õ(p21 p2 K 2 CC 00 ) of the jth column of r 1 , j ∈ [p2 ]. We keep the simpler model to
have a more compact representation. The similar arguments hold
for φ1 , φ2 .
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

result is distribution-free , and the following distributions where A∗ is the effective adjacency matrix in (12).
are introduced for the presentation of the results.
Theorem 3.1 shows that the required sample complexity
Specifically, let Dx̃n denote the distribution from which is polynomial in kA∗ k and δ, where δ is the maximum
the feature x̃n of node n is drawn. For example, when node degree without self-connections in A. Note that con-
the noise level is low, Dx̃n can be a distribution centered dition (8) implies that kA∗ k∞ is O(1). Then as long as δ is
at the observed feature of node n with a small variance. O(N α ) for some small α in (0, 1), say α = 1/5, then one
Similarly, let Dyn denote the distribution from which the can accurately infer the unknown labels from a small per-
label yn at node n is drawn. Let eg be uniformly selected centage of labeled nodes. Moreover, our sample complexity
from {ei }N N
i=1 ∈ R . Let D denote the concatenation of is sufficient but not necessary. It is possible to achieve a de-
these distributions of a data point sirable generalization performance if the number of labeled
z = (eg , X, y) ∈ RN × RN ×d × Y. (15) nodes is less than the bound in (18).
Graph topology sampling affects the generalization perfor-
Then the given feature matrix X and partial labels in Ω mance through A∗ . From the discussion in Section 3.2,
can be viewed as |Ω| identically distributed but correlated graph sampling reduces the node correlation in A∗ , espe-
samples from D. The correlation results from the fact that cially for dominating nodes. The generalization perfor-
the label of node i depends on not only the feature of node mance does not degrade when OPTA∗ is small, i.e., the
i but also neighboring features. This model of correlated resulting A∗ is sufficient to characterize the node correla-
samples is different from the conventional assumption of tion in a given dataset. That explains the empirical success
i.i.d. samples in supervised learning and makes our analyses of graph sampling in many datasets.
more involved.
Let 4. Numerical Results

OPTA∗ = min E(eg ,X,y)∼D L(FA ∗ (eg , X), y) To unveil how our theoretical results are aligned with GCN’s
W∗1, W2,

V∗ ∗
1, V 2, C
∗ generalization performance in experiments, we will focus
(16) on numerical evaluations on synthetic data where we can
be the smallest population risk achieved by the best target control target functions and compare with A∗ explicitly. We
function (over the choices of W ∗1 , W ∗2 , V ∗1 , V ∗2 , C ∗ ) in the also evaluate both our graph sampling method and FastGCN

concept class FA ∗ in (12). OPTA∗ measures the average (Chen et al., 2018a) to validate that insights for our graph
loss of predicting the unknown labels if the estimates are sampling method also apply to FastGCN.
computed using the best target function in (12). Clearly,
OPTA∗ decreases as the size of the concept increases, i.e., We generate a graph G with N = 2000 nodes. G has two
when p1 and p2 increase. Moreover, if A∗ indeed models degree groups. Group 1 has N1 nodes, and every node
the node correlations accurately, OPTA∗ can be very small, degree approximately equals d1 . Group 2 has N2 nodes,
indicating a desired generalization performance. We next and every node degree approximately equals d2 . The edges
show that the population risk of the learned GCN model by between nodes are randomly selected. A is the normalized
our method can be arbitrarily close to OPTA∗ . adjacency matrix of G.
Theorem 3.1. For every γ ∈ (0, 41 ], every 0 ∈ (0, 100 1
], The node labels are generated by the target function
2
every  ∈ (0, (Kp1 p2 Cs (Φ, p2 Cs (φ, O(1)))Cs (φ, O(1))
· kA∗ k2∞ )−1 0 ), as long as y = (sin(ÂXW ∗ ) tanh(ÂXW ∗ ))C ∗ , (20)

m1 = m2 = m where  ∈ RN ×N , X ∈ RN ×d , W ∗ ∈ Rd×p and C ∗ ∈


  1 (17) Rp×K . The feature dimension d = 10. p = 10, and K = 2.
≥poly C Φ, C (φ, O(1)), p2 , kA∗ k∞ , X, W ∗ and C ∗ are all randomly generated with each entry

i.i.d. from N (0, 1).
∗ 8 √
|Ω| ≥Θ(−2 6 4 5
0 kA k∞ K (1 + p1 p2 C (Φ, p2 C (φ, O(1))) We consider a regression task with the `2 -regression loss
∗ 4 4 function. A three-layer GCN as defined in (4) with m neu-
· C (φ, O(1))(kA k∞ + 1) )(1 + δ) log N log m),
(18) rons in each hidden layer is trained on a randomly selected
(8) and (10) hold, there is a choice η = set Ω of labeled nodes. The rest N − |Ω| labels are used
1/poly(kA∗ k∞ , K, m) and T = poly(kA∗ k∞ , K, m) for testing. The learning rate η = 10−3 . The mini-batch
such that with probability at least 0.99 , size is 5, and the dropout rate as 0.4. The total number of
iterations is T Tw = 4|Ω|. Our graph topology sampling
E(eg ,X,y)∈D L(FA∗ (eg , X; W (out) , V (out) ), y) method samples S1 = 0.9N1 and S2 = 0.9N2 nodes for
(19)
≤(1 + γ)OPTA∗ + 0 , both groups in each iteration.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

4.1. Sample Complexity and Neural Network Width generate three different datasets from (20). We consider both
with respect to kA∗ k∞ our graph sampling method in Section 3.2 and FastGCN
(Chen et al., 2018a).
We fix N1 = 100, N2 = 1900 and vary A by changing
node degrees d1 and d2 . In the graph topology sampling In Figure 3, N1 = 100 and N2 = 1900. d1 = 10 and
method, p∗1 = 0.7 and p∗2 = 0.3. For every fixed A, the d2 = 1. Figure 3(a) shows the testing performance of a
effective adjacency matrix A∗ is computed based on (6) learned GCN by Algorithm 1, where p∗1 = 0.9 and p∗2 = 0.1.
using p∗1 and p∗2 . Synthetic labels are generated based on the method indeed performs the best on Dataset 1 when Â
(20) using A∗ as Â. is generated using p̂1 = 0.9 and p̂2 = 0.1, in which case
A∗ = Â. This verifies our theoretical result that graph
Figure 1 shows the testing error decreases as the number of
sampling affects A∗ in the target functions, i.e., it achieves
labeled nodes |Ω| increases, when the number of neurons
the best performance if A∗ is the same as  in the target
per layer m is fixed as 500. Moreover, as kA∗ k∞ increases,
function.
the required number of labeled nodes increases to achieve
the same level of testing error. This verifies our sample
complexity bound in (18).
Figure 2 shows the testing error decreases as m increases
when |Ω| is fixed as 1500. Moreover, as kA∗ k∞ increases,
a larger m is needed to achieve the same level of testing
error. This verifies our bound on the number of neurons in
(17).

(a) (b)

Figure 3. Generalization performance of learned GCNs on datasets


generated from different  by (a) our graph sampling strategy and
(b) FastGCN. A is very unbalanced.

Figure 1. The testing error when |Ω| and kA∗ k∞ change. m =


500
Fig. 3 (b) shows the performance on the same three datasets
where in each iteration of Algorithm 1, the graph sampling
strategy is replaced with FastGCN (Chen et al., 2018a). The
method also performs the best in Dataset 1 when A∗ is
generated using p̂1 = 0.9 and p̂2 = 0.1. The reason is
that √
the graph topology
√ is highly unbalanced in the sense
that d2 N2  d1 N1 , which means group 2 has a much
higher impact on other nodes in group 1 in A. The graph
sampling reduces the impact of group 2 nodes more signifi-
cantly than group 1 nodes, as discussed in Section 3.2.
To further illustrate this, in Figure 4 we change the graph
Figure 2. The testing error when m and kA∗ k∞ change. |Ω| = topology by setting N1 = 1000 and N2 = 1000, and all the
1500. other settings remain
√ the same. √ In this case, the graph is
balanced because d2 N2 and d1 N1 are in the same order.
We generate different datasets using the new A following
4.2. Graph Sampling Affects A∗
the same method and evaluate the performance of both
Here we fix A and the graph sampling strategy, and evaluate our graph sampling method and FastGCN. Both methods
the prediction performance on datasets generated by (20) perform the best in Dataset 3 when  is generated using
using different Â. We generate  from  = AP̂ , where p̂1 = 0.5 and p̂2 = 0.5. That is because on a balanced
P̂ is a diagonal matrix with P̂ ii = p̂1 for nodes i in group 1 graph, graph sampling reduces the impact of both groups
and P̂ ii = p̂2 for nodes i in group 2. We vary p̂1 and p̂2 to equally.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Chen, J., Ma, T., and Xiao, C. Fastgcn: Fast learning with
graph convolutional networks via importance sampling.
In International Conference on Learning Representations,
2018a.

Chen, J., Zhu, J., and Song, L. Stochastic training of graph


convolutional networks with variance reduction. In Inter-
national Conference on Machine Learning, pp. 942–950.
PMLR, 2018b.

Chen, T., Sui, Y., Chen, X., Zhang, A., and Wang, Z. A
(a) (b) unified lottery ticket hypothesis for graph neural networks.
In International Conference on Machine Learning, pp.
Figure 4. Generalization performance of learned GCNs on datasets 1695–1706. PMLR, 2021.
generated from different A∗ by (a) our graph sampling strategy
and (b) FastGCN. A is balanced. Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh,
C.-J. Cluster-gcn: An efficient algorithm for training
deep and large graph convolutional networks. In Proceed-
5. Conclusion ings of the 25th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, pp. 257–266,
This paper provides a new theoretical framework for ex- 2019.
plaining the empirical success of graph sampling in training
GCNs. It quantifies the impact of graph sampling explic- Cong, W., Ramezani, M., and Mahdavi, M. On provable
itly through the effective adjacency matrix and provides benefits of depth in training graph convolutional networks.
generalization and sample complexity analyses. One fu- Advances in Neural Information Processing Systems, 34,
ture direction is to develop active graph sampling strategies 2021.
based on the presented insights and analyze its generaliza-
tion performance. Other potential extension includes the Daniely, A. Sgd learns the conjugate kernel class of the
construction of statistical-model-based characterization of network. Advances in Neural Information Processing
A∗ and fitness to real-world data, and the generalization Systems, 30:2422–2430, 2017.
analysis of deep GCNs, graph auto-encoders, and jumping
knowledge networks. Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang,
R., and Xu, K. Graph neural tangent kernel: Fusing
graph neural networks with graph kernels. In Advances in
Acknowledgements Neural Information Processing Systems, pp. 5724–5734,
This work was supported by AFOSR FA9550-20- 2019.
1-0122, ARO W911NF-21-1-0255, NSF 1932196
Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell,
and the Rensselaer-IBM AI Research Collaboration
R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Con-
(http://airc.rpi.edu), part of the IBM AI Horizons Network
volutional networks on graphs for learning molecular fin-
(http://ibm.biz/AIHorizons). We thank Ruisi Jian, Haolin
gerprints. In Advances in neural information processing
Xiong at Rensselaer Polytechnic Institute for the help
systems, pp. 2224–2232, 2015.
in formulating numerical experiments. We thank all
anonymous reviewers for their constructive comments. Fu, H., Chi, Y., and Liang, Y. Guaranteed recovery of one-
hidden-layer neural networks via cross entropy. IEEE
References Transactions on Signal Processing, 68:3225–3235, 2020.
Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and gener- Garg, V., Jegelka, S., and Jaakkola, T. Generalization and
alization in overparameterized neural networks, going representational limits of graph neural networks. In In-
beyond two layers. In Advances in neural information ternational Conference on Machine Learning, pp. 3419–
processing systems, pp. 6158–6169, 2019. 3430. PMLR, 2020.

Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al. Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from sad-
Interaction networks for learning about objects, relations dle points—online stochastic gradient for tensor decom-
and physics. In Advances in neural information process- position. In Conference on learning theory, pp. 797–842.
ing systems, pp. 4502–4510, 2016. PMLR, 2015.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre- Satorras, V. G. and Estrach, J. B. Few-shot learning with
sentation learning on large graphs. In Advances in neural graph neural networks. In International Conference on
information processing systems, pp. 1024–1034, 2017. Learning Representations, 2018.

Hu, H., Gu, J., Zhang, Z., Dai, J., and Wei, Y. Relation Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
networks for object detection. In Proceedings of the IEEE and Salakhutdinov, R. Dropout: a simple way to prevent
conference on computer vision and pattern recognition, neural networks from overfitting. The journal of machine
pp. 3588–3597, 2018. learning research, 15(1):1929–1958, 2014.
Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Van den Berg, R., Kipf, T. N., and Welling, M. Graph
Convergence and generalization in neural networks. In convolutional matrix completion. In KDD, 2018.
Advances in neural information processing systems, pp.
8571–8580, 2018. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio,
P., and Bengio, Y. Graph attention networks. Interna-
Kipf, T. N. and Welling, M. Semi-supervised classification tional Conference on Learning Representations (ICLR),
with graph convolutional networks. In Proc. International 2018.
Conference on Learning (ICLR), 2017.
Verma, S. and Zhang, Z.-L. Stability and generalization
Li, H., Zhang, S., and Wang, M. Learning and generaliza- of graph convolutional neural networks. In Proceedings
tion of one-hidden-layer neural networks, going beyond of the 25th ACM SIGKDD International Conference on
standard gaussian data. In 2022 56th Annual Conference Knowledge Discovery & Data Mining, pp. 1539–1548,
on Information Sciences and Systems (CISS), pp. 37–42. 2019.
IEEE, 2022.
Wang, X., Ye, Y., and Gupta, A. Zero-shot recognition
Li, J., Zhang, T., Tian, H., Jin, S., Fardad, M., and Zafarani,
via semantic embeddings and knowledge graphs. In Pro-
R. Sgcn: A graph sparsifier based on graph convolutional
ceedings of the IEEE conference on computer vision and
networks. In Pacific-Asia Conference on Knowledge Dis-
pattern recognition, pp. 6857–6866, 2018.
covery and Data Mining, pp. 275–287. Springer, 2020.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
Liao, R., Urtasun, R., and Zemel, R. A pac-bayesian ap-
are graph neural networks? International Conference on
proach to generalization bounds for graph neural net-
Learning Representations (ICLR), 2019.
works. In International Conference on Learning Repre-
sentations, 2021. Xu, K., Zhang, M., Jegelka, S., and Kawaguchi, K. Opti-
Lv, S. Generalization bounds for graph convolutional neural mization of graph neural networks: Implicit acceleration
networks via rademacher complexity. arXiv preprint by skip connections and more depth. In International
arXiv:2102.10234, 2021. Conference on Machine Learning. PMLR, 2021.

Oono, K. and Suzuki, T. Optimization and generalization Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,
analysis of transduction through gradient boosting and ap- W. L., and Leskovec, J. Graph convolutional neural net-
plication to multi-scale graph neural networks. Advances works for web-scale recommender systems. In Proceed-
in Neural Information Processing Systems, 33, 2020. ings of the 24th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, pp. 974–983,
Peng, N., Poon, H., Quirk, C., Toutanova, K., and Yih, W.- 2018.
t. Cross-sentence n-ary relation extraction with graph
lstms. Transactions of the Association for Computational Zhang, S., Wang, M., Liu, S., Chen, P.-Y., and Xiong, J.
Linguistics, 5:101–115, 2017. Fast learning of graph neural networks with guaranteed
generalizability: One-hidden-layer case. arXiv preprint
Ramezani, M., Cong, W., Mahdavi, M., Sivasubramaniam, arXiv:2006.14117, 2020.
A., and Kandemir, M. Gcn meets gpu: Decoupling “when
to sample” from “how to sample”. Advances in Neural Zheng, C., Zong, B., Cheng, W., Song, D., Ni, J., Yu, W.,
Information Processing Systems, 33:18482–18492, 2020. Chen, H., and Wang, W. Robust graph representation
learning via neural sparsification. In International Con-
Sanchez-Gonzalez, A., Heess, N., Springenberg, J. T., ference on Machine Learning, pp. 11458–11468. PMLR,
Merel, J., Riedmiller, M., Hadsell, R., and Battaglia, P. 2020.
Graph networks as learnable physics engines for infer-
ence and control. In International Conference on Machine Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon,
Learning, pp. 4470–4479. PMLR, 2018. I. S. Recovery guarantees for one-hidden-layer neural
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

networks. In Proceedings of the 34th International Con-


ference on Machine Learning-Volume 70, pp. 4140–4149.
JMLR. org, https://arxiv.org/abs/1706.03175, 2017.
Zhou, X. and Wang, H. The generalization error of graph
convolutional networks may enlarge with more layers.
Neurocomputing, 424:97–106, 2021.

Zou, D., Hu, Z., Wang, Y., Jiang, S., Sun, Y., and Gu,
Q. Layer-dependent importance sampling for training
deep and large graph convolutional networks. Advances
in Neural Information Processing Systems, 32:11249–
11259, 2019.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

A. Preliminaries
Lemma A.1. kãn Xk ≤ kAk∞ .

Proof:
N
X
kãn Xk = k an,k x̃k k
k=1
N N
X an,k X
=k PN x̃k k · an,k
k=1 k=1 an,k k=1
(21)
N
X an,k
≤ PN kx̃k k · kAk∞
k=1 k=1 an,k
= kAk∞
where the second to last step is by the convexity of k · k.
Lemma A.2. Given a graph G with L(≥ 1) groups of nodes, where the group i with node degree di is denoted as Ni .
Suppose that in iteration t, At (or any of At(1) , At(2) , At(3) in the general setting) is generated from the sampling strategy
in Section 3.2, if the number of sampled nodes satisfies li ≥ |Ni |/(1 + cLp 1 poly()
∗ Ψ ), we have
i i

kAt − A∗ k∞ ≤ poly() (22)

Proof:
From Section 3.2, we can rewrite that
(
|Nk | ∗
t
ãn = lk pk An,j , if the nodes n, j are connected and j is selected and j ∈ Nk (23)
0, else
(
p∗k An,j , if the nodes n, j are connected and j ∈ Nk
a˜∗ n = (24)
0, else
> > PN
Let A∗ = (a˜∗ 1 , a˜∗ 2 , · · · , a˜∗ n )> . Since that we need that j=1 A∗n,j ≤ O(1), we require
X
p∗i An,j ≤ O(1/L), holds for any i ∈ [L], n ∈ [N ] (25)
j∈Ni

We first roughly compute the ratio of edges that one node is connected to the nodes in another group. For the node with degree
deg(i), it has deg(i) − 1 open edges except the self-connection. Hence, the group with degree deg(j) has (deg(j) − 1)|Nj |
open edges except self-connections in total. Therefore, the ratio of the edges connected to the group j to all groups is
(deg(j) − 1)|Nj | dj |Nj |
PL ≈ PL (26)
l=1 (deg(l) − 1)|Nl | l=1 dl |Nl |

Define r
dn di |Ni |
Ψ(n, i) = · PL (27)
di l=1 dl |Nl |
Then, as long as
X 1 di |Ni |
p∗i An,j ≈ p∗i √ · PL dn . p∗i Ψ(n, i) ≤ O(1/L) (28)
j∈|Ni |
di dn l=1 ld |N l |
i.e., r PL
c1 c1 c1 di dl |Nl |
p∗i ≤ = = l=1
(29)
L · maxn∈[L] {Ψ(n, i)} L · Ψ(L, i) L dL di |Ni |
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

for some constant c1 > 0, we can obtain that kA∗ k∞ ≤ O(1). Since that
X 1 di |Ni | lk X lk
An,j ≈ √ · PL dn ≈ An,j (30)
di dn l=1 dl |Nl |
|Nk | |Nk |
j∈Sk j∈N k

X 1 di |Ni | lk X lk
An,j ≈ √ · PL dn (1 − )≈ An,j (1 − ), (31)
di dn l=1 dl |Nl |
|N k | |N k|
j ∈S
/ k j∈N k

the difference between ãtn and a˜∗ n can then be derived as

kãtn − a˜∗ n k1
L X L X
X |Nk | X
= An,j p∗k ( − 1) + An,j p∗k
lk
k=1 j∈Sk k=1 j ∈S
/ k
L
X |Nk | lk X lk X
. (p∗k ( − 1) An,j + (1 − )p∗k An,j ) (32)
lk |Nk | |Nk |
k=1 j∈Nk j∈Nk
L
X 1 X
.poly() An,j
LΨ(L, k)
k=1 j∈Nk

:=poly()Γ(A∗ )
c1 poly()
where the first inequality is by (30, 31) and the second inequality holds as long as li ≥ |Ni |/(1 + Lp∗ ). Combining
i Ψ(L,i)
(41), we have
L L
X X X 1 X
p∗i An,j . An,j = Γ(A∗ ) ≤ O(1) (33)
i=1 i=1
LΨ(L, i)
j∈Ni j∈Ni

Hence, (32) can be bounded by poly().

A.1. Symmetric graph sampling method


We provide and discuss a symmetric graph sampling method in this section. The insights behind this version of sampling
strategy is the same as in Section 3.2.
Similar to the asymmetric construction in Section 3.2, we consider a group-wise uniform sampling strategy, where Sl nodes
are sampled uniformly from Nl nodes. For all unsampled nodes, we set the corresponding diagonal entries of a diagonal
matrix P s topbe zero. If node i is sampled in this iteration and belongs to group l for any i and l, the ith diagonal entry of
P s is set as p∗l Nl /Sl for some non-negative constant p∗l . Then As = P s AP s .
Based on this symmetric graph sampling method, we define the effective adjacency matrix as
A∗ = P ∗ AP ∗ , (34)
where P ∗ is a diagonal matrix defined as
P ∗ii =
p ∗
pl if node i belongs to degree group l (35)

The scaling factor p∗l should satisfy


c2
0 ≤ p∗l ≤ , ∀l (36)
L2 ψl2
for a positive constant c2 that can be sufficiently large. ψl is defined in (9). The number of sampled nodes shall satisfy
Sl c2 poly()
≥ (1 + p ∗ )−2 ∀l ∈ [L] (37)
Nl L pl ψl
where  is a small positive value.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Lemma A.3. Given a graph G with L(≥ 1) groups of nodes, where the group i with node degree di is denoted as Ni .
Suppose At (or any of At(1) , At(2) , At(3) in the general setting) is generated from the sampling strategy in Section A.1, if
the number of sampled nodes satisfies li ≥ |Ni |/(1 + cLp
2 poly()
∗ Ψ ), then we have
i i

kAt − A∗ k∞ ≤ poly() (38)

Proof:

From Section A.1, we can rewrite that


(q
|Nk ||Nu | ∗ ∗
t
ãn = lk lu pk pu An,j , if the nodes n, j are connected and j is selected and n ∈ Nu , j ∈ Nk (39)
0, else
(p
p∗k p∗u An,j , if the nodes n, j are connected and n ∈ Nu , j ∈ Nk
a˜∗ n = (40)
0, else
> >
Then A∗ = (a˜∗ 1 , a˜∗ 2 , · · · , a˜∗ n )> . Then, for n ∈ Nu , as long as
X p p 1 di |Ni | p p
p∗i p∗u An,j ≈ p∗u p∗i √ · PL dn . p∗u p∗i Ψ(n, i) ≤ p∗i Ψ(n, i) ≤ O(1/L) (41)
j∈|N |
di dn l=1 dl |Nl |
i

i.e.,
r PL
p c2 c1 c2 di l=1 dl |Nl |
p∗i ≤ = = (42)
L · maxn∈[L] {Ψ(n, i)} L · Ψ(L, i) L dL di |Ni |
for some constant c2 > 0, we can obtain that kA∗ k∞ ≤ O(1).
The difference between ãtn and a˜∗ n can then be derived as

kãtn − a˜∗ n k1
s
L X L X
X p |Nk ||Nu | X p
= An,j p∗u p∗k ( − 1) + An,j p∗u p∗k
lk lu
k=1 j∈Sk k=1 j ∈S
/ k
s (43)
L L X
X X p ∗ |Nk ||Nu | lk X p lk
≈ ∗
An,j pu pk ( − 1) + An,j p∗u p∗k (1 − )
lk lu |Nk | |Nk |
k=1 j∈Nk k=1 j∈Nk

.poly()

c2 poly()
as long as li ≥ |Ni |/(1 + √ ∗
)2 .
L pi Ψ(L,i)

B. Node classification for three layers


In the whole proof, we consider a more general target function compared to (12). We write F ∗ : RN × RN ×d → RK :
∗ ∗ ∗ ∗
FA ∗ = (f1 , f2 , · · · , fK ),
   
(44)
X X X
fr∗ (eg , X) = e> g c∗k,r Φ A∗ ∗
v1,k,j φ1,j (A∗ Xw∗1,j ) A∗ ∗
v2,k,l φ2,l (A∗ Xw∗2,l ) , ∀r ∈ [K],
k∈[p1 ] j∈[p2 ] l∈[p2 ]

where each φ1,j , φ2,j , Φi : R → R is infinite-order smooth.


Table 2 shows some important notations used in our theorem and algorithm. Table 3 gives the full parameter choices for the
three-layer GCN. ploy(log(m1 m2 )) in the following analysis.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Table 2. Summary of notations


G = {V, E} G is an un-directed graph consisting of a set of nodes V and a set of edges E.
N The total number of nodes in a graph.
1 1
A = D − 2 ÃD − 2 A ∈ RN ×N is the normalized adjacency matrix computed by the degree matrix D and the
initial adjacency matrix Ã.
A∗ The effective adjacency matrix.
At The sampled adjacency matrix using our sampling strategy in Section 3.2 at the t-th iteration.
N ×d
eg , X yn eg belongs to {ei }N i=1 and selects the index of the node label. X ∈ R is the feature matrix. yn
is the label of the n-th node.
m1 , m2 m1 , m2 are the number of neurons in the first and second hidden layer, respectively.
W , V , B1, B2 X ∈ RN ×d is the data matrix. W , V are the weight matrices of the first and second hidden layer,
respectively. B 1 , B 2 are the corresponding bias matrices.
W (0) , V (0) W (0) and V (0) are random initializations of W and V , respectively.
W ρ, V ρ W ρ and V ρ are two random matrices used for Gaussian smoothing.
Σ The Dropout technique.
Ω, Ωt Ω is the set of labeled nodes and Ωt is the batch of labeled nodes at the t-th iteration.
T , Tw , η, λt In Algorithm 1, T is the number of outer iterations for the weight decay step, while Tw is the
number of inner iterations for the SGD steps. η is the step size and λt is the weight decay coefficient
at the t-th iteration.
L, dl , Sl , Nl L is the number of node groups in a graph. dl is the order-wise degree in the l-th group. Nl is the
number of nodes in group l.
Sl The number of nodes we sample in group l.

Table 3. Full parameter choices for three-layer GCN


1/2−0.005 √ 1/2
τv0 m1 /( 0 m2 )
1/4 3/4−0.005
τw0 C0 /(0 m1 )
1/2−0.001 1/2
τv m1 /m2  τv0
3/4−0.01
τw 1/m1  τw0
0 2
λv 2/(τv )
λw 2/(τw0 )4
1/2+0.01
σv 1/m2
1−0.01
σw σw = 1/m p1
C C (φ, kAk∞ ) kAk2∞ + 1

C0 10C p p2
C 00 C (Φ, C 0 ) kAk2∞ + 1
C0 Õ(p21 p2 K 2 CC 00 )
c 1

B.1. Lemmas
B.1.1. F UNCTION APPROXIMATION
To show that the target function can be learnt by the learner network with the Relu function, a good approach is to firstly
find a function h(·) such that the φ functions in the target function can be approximated by h(·) with an indicator function.
In this section, Lemma B.1 provides the existence of such h(·) function. Lemma B.2 and B.3 are two supporting lemmas to
prove Lemma B.1.
Lemma B.1. For every smooth function φ, every  ∈ (0, C(φ,a)1√a2 +1 ), there exists a function h : R2 →
√ √ √
[−C (φ, a) a2 + 1, C (φ, a) a2 + 1] that is also C (φ, a) a2 + 1-Lipschitz continuous on its first coordinate with the
following two (equivalent) properties:
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

(a) For every x1 ∈ [−a, a] where a > 0:


h i
E 1α √ h(α1 , b0 ) − φ(x1 ) ≤ 
1 x1 +β1 a2 −x21 +b0 ≥0

where α1 , β1 , b0 ∼ N (0, 1) are independent random variables.


(b) For every w∗ , x ∈ Rd with kw∗ k2 = 1 and kxk ≤ a:
h i
E 1wX+b0 ≥0 h(w> w∗ , b0 ) − φ(w∗ > x) ≤ 

where w ∼ N (0, I) is an d-dimensional Gaussian, b0 ∼ N (0, 1).


Furthermore, we have Eα1 ,b0 ∼N (0,1) [h(α1 , b0 )2 ] ≤ (Cs (φ, a))2 (a2 + 1).

(c) For every w∗ , x ∈ Rd with kw∗ k2 = 1, let w̃ = (w, b0 ) ∈ Rd+1 , x̃ = (x, 1) ∈ Rd+1 with kx̃k ≤ a2 + 1, then we
have h i
>
E 1w̃> x̃≥0 h(w̃[1 : d] w∗ , w̃[d + 1]) − φ(w∗ > x̃[1 : d]) ≤ 

where w̃ ∼ N (0, I d+1 ) is an d-dimensional Gaussian.


>
We also have Ew̃∈N (0,I d+1 ) [h(w̃[1 : d] w∗ , w̃[d + 1])2 ] ≤ (Cs (φ, a))2 (a2 + 1).

Proof:
Firstly, since we can assume w∗ = (1, 0, · · · , 0) without loss of generality by rotating x and w, it can be derived that x, w,
w∗ are equivalent to that they
p are two-dimensional. Therefore, proving Lemma B.1b suffices in showing Lemma B.1a.
Let w0 = (α, β), x = (x1 , t2 − x21 ) where α and β are independent. Following the idea of Lemma 6.3 in (Allen-Zhu et al.,
p ⊥
2019), we use another randomness as an alternative, i.e., we write x⊥ = ( t2 − x21 , −x1 ), w0 = α xt + β xt ∼ N (0, I).
q
x2
Then w0 X = tα. Let α1 = w01 = α xt1 + β 1 − t21 , where α, β ∼ N (0, 1). Hence, α1 ∼ N (0, 1).
We first use Lemma B.2 to fit φ(x1 ). By Taylor expansion, we have

X ∞
X
φ(x1 ) = c0 + ci xi1 + ci xi1
i=1, odd i i=2, even i

(45)
X
= c0 + c0i Eα,β∼N (0,1) [hi (α1 )1[qi (b0 )]1[w0 X + b0 ≥ 0]]
i=1

where hi (·) is the Hermite polynomial defined in Definition A.5 in (Allen-Zhu et al., 2019), and
√ (
0 ci 0 200i2 |ci | t2 + 1 |b0 | ≤ t/(2i), i is odd
ci = 0 , |ci | ≤ 1−i
and qi (b0 ) = (46)
pi (i − 1)!! t 0 < −b0 ≤ t/(2i), i is even

1
q √
2 +1
Let Bi = 100i 2 + 10 log( 1 tt1−i ). Define ĥi (α1 ) = hi (α1 ) · 1[|α1 | ≤ Bi ] + hi (sign(α1 )Bi ) · 1[|α1 | > Bi ] as the
truncated version of the Hermite polynomial. Then we have

X
φ(x1 ) = c0 + R(x1 ) + c0i Eα,β∼N (0,1) [ĥi (α1 )1[qi (b0 )]1[w0 X + b0 ≥ 0]],
i=1

where

X h i
c0i Eα,β∼N (0,1) hi (α1 ) · 1[|α1 | > Bi ] − hi (sign(α1 )Bi · 1[|α| > Bi ]) 1[qi (b0 )]1[w0 X + b0 ≥ 0]

R(x1 ) =
i=1

Define

X
h(α1 , b0 ) = c0 + c0i · ĥi (α1 ) · 1[qi (b0 )]
i=1
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Then by Lemma B.3, we have



|Eα,β,b0 ∼N (0,1) [1[w0 X + b0 ≥ 0] · h(α1 , b0 ) − φ(x1 )| ≤ |R(x1 )| ≤
4
We also have √

2 2
X i! · |ci |2 i3 t2 + 1 2
Eα1 ,b0 ∼N (0,1) [h(α1 , b0 ) ] ≤ ( + c20 )
+ O(1) · · ( )
i=1
((i − 1)!!)2 t1−i
∞ √
2 2
X t2 + 1
≤ ( + c0 ) + i · |ci | · ( 1−i )2
3.5 2

i=1
t (47)

X p 2
≤ (2 + c20 ) + (i + 1)1.75 · |ci | · ti t2 + 1
i=0
2 2
≤ Cs (φ, t) (t + 1)
Lemma B.2. Denote hi (x) as the degree-i Hermite polynomial as in Definition A.5 in (Allen-Zhu et al., 2019). For every
1−i
integer i ≥ 1, there exists constant p0i with |p0i | ≥ √tt2 +1 (i−1)!!
100i2 such that

1 b0 t
for even i : xi1 = 0 Ew0 ∼N (0,I),b0 ∼N (0,1) [hi (α1 )1[α ≥ − ]1[0 < −b0 ≤ ]] (48)
pi t 2i
1 b0 t
for odd i : xi1 = 0 Ew0 ∼N (0,I),b0 ∼N (0,1) [h(α1 )1[α ≥ − ]1[|b0 | ≤ ]] (49)
pi t 2i
for kxk ≤ t.

Proof:
For even i, by Lemma A.6 in (Allen-Zhu et al., 2019), we have
b0 t t xi
Ew0 ∼N (0,I),b0 ∼N (0,1) [hi (α1 )1[α ≥ − ]1[0 < −b0 ≤ ]] = Eb0 ∼N (0,1) [pi · 1[0 < −b0 ≤ ]] · i1
t 2i 2i t
, where
i−1 i−1−r 
exp(−b20 /(2t2 )) X (−1) 2

i/2 − 1
pi = (i − 1)!! √ (−b0 /t)r
2π r!! (r − 1)/2
r=1,r odd
i−1−r
(−1) 2 i/2−1

Define cr = r!! (r−1)/2
. Then sign(cr ) = −sign(cr+2 ). We can derive

cr (−b0 /t)r b0 i + 1 − r 1 1
= ( )2 ≤ ≤
cr−2 (−b0 /t)r−2 t r(r − 1) 4i 4
Therefore,
i−1
X 3
cr (−b0 /t)r ≥ |b0 /t|
4
r=1,r odd

|Eb0 ∼N (0,1) [pi · 1[0 ≤ −b0 /t ≤ 1/(2i)]]| · t−i


exp(−b20 /2t2 ) 3
≥|Eb0 ∼N (0,1) [(i − 1)!! √ · |b0 /t| · 1[0 ≤ −b0 /t ≤ 1/(2i)]]| · t−i
2π 4
0 b2 1
exp(− 20 (1 + t2 ))
Z
−i 3 b0
=t · (i − 1)!! · (− )db0
t
− 2i 2π 4 t
(50)
t b2 1 3 0
=t−i · 2
exp(− 0 (1 + 2 ))(i − 1)!!
t +1 2 t 8π − 2it
t 3  t2 + 1 
=t−i 2 (i − 1)!! 1 − exp(− )
t +1 8π 8i2
(i − 1)!!
≥t1−i
100i2
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

For odd i, similarly by Lemma A.6 in (Allen-Zhu et al., 2019), we can obtain

b0 t t xi
Ew0 ∼N (0,I),b0 ∼N (0,1) [h(α1 )1[α ≥ − ]1[|b0 | ≤ ]] = Eb0 ∼N (0,1) [pi · 1[|b0 | ≤ ]] · i1
t 2i 2i t
, where
i−1 i−1−r 
exp(−b20 /(2t2 )) X (−1) 2

i/2 − 1
pi = (i − 1)!! √ (−b0 /t)r
2π r=1,r even
r!! (r − 1)/2

Then we also have


cr (−b0 /t)r b0 i + 1 − r 1 1
r−2
= ( )2 ≤ ≤
cr−2 (−b0 /t) t r(r − 1) 4i 4
Therefore,
i−1
X 3 3 ( 2i − 1)! 3 1 3
cr (−b0 /t)r ≥ |c0 | = 1 3 i−1
≥ i−1

4 4 π( 2 · 2 · · · 2 ) 4π 2 2πi
r=1,r odd

|Eb0 ∼N (0,1) [pi · 1[|b0 |/t ≤ 1/(2i)]]| · t−i


exp(−b20 /2t2 ) 3
≥t−i · |Eb0 ∼N (0,1) [(i − 1)!! √ · · 1[|b0 |/t ≤ 1/(2i)]]|
2π 2πi
t b2 1
exp(− 20 (1 + t2 ))
Z 2i 3
=t−i · (i − 1)!! db0 ·
t
− 2i 2π 2πi
√ (51)
3 t √  t2 + 1 
=t−i · (i − 1)!! · √ · 2π · 2Φ( )−1
4π 2 i t2 + 1 2i

2 +1
3 t √ t
2Φ( 2 ) − 1
=t−i · (i − 1)!!
· √ · 2π ·
4π 2 i 2
t +1 i
t1−i (i − 1)!!
≥√
t2 + 1 100i2
q √ √
Lemma B.3. For Bi = 100i1/2 + 10 log(ti−1 t2 + 1/2i ) where 2i = ti−1 t2 + 12 , we have

P∞ √
1. i=1 |c0i | · |Ex∼N (0,1) [|hi (x)| · 1[|x| ≥ b]]| ≤ 
8 t2 + 1
P∞ √
2. i=1 |c0i | · |Ex∼N (0,1) [|hi (b)| · 1[|x| ≥ b]]| ≤ 
8 t2 + 1
P∞ √
3. i=1 |c0i |Ez∈N (0,1) [|hi (z)|1[|z| ≤ Bi ]] ≤ C (φ, t) t2 + 1
P∞ √
4. i=1 |c0i |Ez∈N (0,1) [| dz
d
hi (z)|1[|z| ≤ Bi ]] ≤ C (φ, t) t2 + 1

Proof:
By the definition of Hermite polynomial in Definition A.5 in (Allen-Zhu et al., 2019), we have

bi/2c
X |x|i−2j i2j
hi (x) ≤
j=1
j!

Combining (46), we can obtain


√ bi/2c
t2 + 1 i4 X |x|i−2j i2j
|c0i hi (x)| ≤ O(1)|ci | (52)
t1−i i!! j=1 j!
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
r √
1 t2 +1
log( ) √
1 2 t1−i t2 +1 2
(1) Let b = 100i θi and θi = 1 +
2 i√
10 i
for 2i = t1−i  where i ≥ 1, then we have

v √ √
t2 +1
q u
2
u 1 t +1 2
log( 12 tt1−i
+1
) log( 2 ) log( 12
t
 t2−i )
2 2
 2 i  t2−i i
−102 −2·10 √ 2 i
(θi · e−10 θi )i = 1 + √
i
·e e 10 i ·e −10 100i
10 i

v √
t2 +1
q u
t2 +1 tlog( 1
u
1 2 t1−i ) (53)
t1−i 2
log( 2 t1−i ) 2

i
−2·10 √
=2i √ · e−10 i · (1 + i
√ )e 10 i
t2 + 1 10 i
2i t1−i
≤ √
100000i t2 + 1
4
where the second step comes from that (1 + s) · e−2·10 ·s
≤ 1 for any s > 0. Combining the equation C.6, C.7 in (Allen-Zhu
et al., 2019) and (53), we can derive

X
|c0i | · Ex∼N (0,1) [|hi (z)| · 1[|x| ≥ b]]
i=1
∞ √
X t2 + 1 i4 i 2 2 (54)
≤ O(1)|ci | · i 2 · 1200i · (θi · e−10 θi )i
i=1
t1−i i!!
p 2
≤ t +1
8
for any  > 0 and t ≤ O(1).
(b) Similarly, following (53) and (54), we have
∞ ∞ √
X X t2 + 1 i4 b2 p 2
|c0i | · |Ex∼N (0,1) [|hi (b)| · 1[|x| ≥ b]]| ≤ O(1) 1−i
|ci | · e− 2 (3b)i ≤ t +1
i=1 i=1
t i!! 8

(c) Similar to (52),


∞ bi/2c

X i4 X Bii−2j i2j i−1 p 2
X
|c0i |Ez∈N (0,1) [|hi (z)|1[|z| ≤ Bi ]] ≤ O(1) |ci | t t +1
i=1 i=1
i!! j=0 j!

X p (55)
≤ |ci |(O(1)θi )i ti−1 t2 + 1
i=1
p
≤ C (φ, t) t2 + 1,
where the step follows from Claim C.2 (c) in (Allen-Zhu et al., 2019).
(d) Since we have
bi/2c
d X
| hi (x)| ≤ |x|i−2j i2j (56)
dx j=0

by Definition A.5 in (Allen-Zhu et al., 2019), we can derive



X d p
|c0i |Ez∈N (0,1) [| hi (z)|1[|z| ≤ Bi ]] ≤ C (φ, t) t2 + 1 (57)
i=1
dz

B.1.2. E XISTENCE OF A GOOD PSEUDO NETWORK


We hope to find some good pseudo network that can approximate the target network. In such a pseudo network, the activation
1x≥0 is replaced by 1x(0) ≥0 where x(0) is the value at the random initialization. We can define a pseudo network without
bias as
N
X X N
X X
gr(0) (q, A, W , V , B) = q > an ci,r 1rn,i +B2(n,i) ≥0 an,j vi,l 1aj Xwl +B1(j,l) aj Xwl (58)
n=1 i∈[m2 ] j=1 l∈[m1 ]
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Lemma B.4 shows the target function can be approximated by the pseudo network with some parameters. Lemma B.5 to B.8
provides how the existence of such a pseudo network is developed step by step.
1 √
Lemma B.4. For every  ∈ (0, 2 2
), there exists
Kkqk1 p1 p2 Cs (Φ,p2 Cs (φ,kAk∞ ))Cs (φ,kAk∞ ) kAk∞ +1
√ p
M = poly(C (Φ, p2 C (φ, kAk∞ ) kAk2∞ + 1), 1/)

p
C = C (φ, kAk∞ ) kAk2∞ + 1 (59)


C 0 = 10C p2 (60)
p
C 00 = C (Φ, C 0 ) kAk2∞ + 1 (61)

C0 = Õ(p21 p2 K 2 CC 00 ) (62)
c , Vb with m1 , m2 ≥ M ,
such that with high probability, there exists W

C0 m1
kW
c k2,∞ ≤ , kVb k2,∞ ≤
m1 m2
such that
K
hX i
E(X,y)∈D |fr∗ (q, A, X) − gr(0) (q, A, X, W
c .Vb )| ≤ 
r=1

E(X,y)∈D [|L(G(0) (q, A, X, W


c , Vb ))|] ≤ OP T + 

Proof: p
For each φ2,j , we can construct hφ,j : R2 → [−C, C] where C = C (φ, kAk∞ ) kAk2∞ + 1 using Lemma B.1 satisfying

E[hφ,j (w∗2,j > wi , B1(n,i) )1ãn Xw(0) +B


(0) (0)
] = φ2,j (ãn Xw2,j ) ±  (63)
i 1(n,i)≥0

for i ∈ [m1 ]. Consider any arbitrary b ∈ Rm1 with vi ∈ {−1, 1}. Define
1
00
c = (C0 C /C) (vi
2
hφ,j (w∗2,j > wi , B1(i) )ed )i∈[m1 ]
(0) (0)
X

W 2
v2,j (64)
c m 1
j∈[p2 ]

X c∗ K
1 √ X (0)
X
Vb = (C0 C 00 /C)− 2 k
(vh( m2 ∗
v1,j αi,j , B2(i) ) ci,r )i∈[m2 ] (65)
m2 r=1
k∈[p1 ] j∈[p2 ]

Then,
gr(0) (q, A, W
c , Vb , B)
N
X X N
X X
= q > an ci,r 1rn,i +B2(n,i) ≥0 an,j 1aj Xw(0) +B aj X W
c i Vbi,i0
i 1(j,i) ≥0
n=1 i∈[m1 ] i0 ∈[m2 ] j=1
N N
X c∗k X X √ X (0)
X X
= q > an c2i,r 1rn,i +B2(n,i) ≥0 h( m2 ∗
v1,j αi,j , B2(i) ) an,j ∗
v2,l φ2,l (aj Xw∗2,l )
m2 2c n=1 j=1
k∈[p1 ] i∈[m1 ] j∈[p2 ] l∈[p2 ]
N
X X X N
X N
X X (66)
= q > an c∗k Φ( ∗
v1,j am,n φ1,j (am Xw∗1,j )) an,j ∗
v2,l φ2,l (aj Xw∗2,l )
k∈[p1 ] n=1 j∈[p2 ] m=1 j=1 l∈[p2 ]
p
± O(p1 p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1)
N
X X X X
= q > an c∗k Φ(ãn ∗
v1,j φ1,j (AXw∗1,j ))ãn ∗
v2,l φ2,l (AXw∗2,l )
n=1 k∈[p1 ] j∈[p2 ] l∈[p2 ]
p
± O(kqk1 p1 p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1)
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

where the first step comes from definition of g (0) , the second step is derived from (64) and (65) and the second to last step is
by Lemma B.8.
Lemma B.5. For every smooth function φ, every w∗ ∈ Rd with kw∗ k = 1, for every  ∈ (0, 1√
),
Cs (φ,kAk∞ ) kAk2∞ +1
(0) (0) (0) (0) (0) (0)
there exists real-valued functions ρ(v 1 , W (0) , B 1(n) ), J(ãn X, v 1 , W (0) , B 1(n) ), R(ãn X, v 1 , W (0) , B 1(n) ) and
φ (ãn X) such that for every X
N
(0) (0) (0) (0) (0) (0)
X
rn,1 (X) = ρ(v 1 , W (0) , B 1(n) ) aj,n φ (aj Xw∗ ) + J(X, v 1 , W (0) , B 1(n) ) + R(X, v 1 , W (0) , B 1(n) )
j=1

p (0) (0)
Moreover, letting C = C (φ, kAk∞ ) kAk2∞ + 1 be the complexity of φ, and if v1,i ∼ N (0, m12 ) and wi,j , B 1(n) ∼
N (0, m11 ) are at random initialization, then we have
(0) (0) (0) (0)
1. for every fixed ãn X, ρ(v 1 , W (0) , B 1(n) ) is independent of J(ãn X, v 1 , W (0) , B 1(n) ).
(0) (0)
2. ρ(v 1 , W (0) , B 1(n) ) ∼ N (0, 100C12 m2 ).
3. |φ (ãn Xw∗i ) − φ(ãn Xw∗i )| ≤ 
4. with high probability, |R(X, v 1 , W (0) , B 1(n) )| ≤ Õ( √kAk , B 1(n) )| ≤ Õ( kAk∞√
(0) (0) ∞ (0) (0) (0) (1+kAk∞ )
m1 m2 ), |J(X, v 1 , W m2 ) and
(0) (0)
E[J(X, v 1 , W (0) , B 1(n) )] = 0.
With high probability, we also have
(0) τ
ρ̃(v1 ) ∼ N (0, )
C 2 m2
1
W2 (ρ|W (0) ,B (0) , ρ̃) ≤ Õ( √ )
1(n) C m2

Proof:
By Lemma B.1, we have

√ (0) > ∗ (0) φ (ãn Xw∗i )


Ew(0) ∼N (0, I
),b1(n,i) ∼N (0, m1 )
[h( m 1 w i w , b1(n,i) )1[ãn Xw i + b 1(n,i) ≥ 0]] =
i m1 1 C
with
|φ (ãn Xw∗ ) − φ(ãn Xw∗ )| ≤ 
√ (0) >
and |h( m1 wi w∗ , b1(n,i) )| ∈ [0, 1]. Note that here the h function is rescaled by 1/C.
Then, applying Lemma A.4 of (Allen-Zhu et al., 2019), we define
√ (0) >
Ii = I(h( m1 wi w∗ , B1(n,i) )) ⊂ [−2, 2]
√ (0)
S = {i ∈ [m1 ] : m2 v1,i ∈ Ii }
√ (0) > √ (0)
si = s(h( m1 wi w∗ , B1(n,i) )), m2 v1,i )
( s
√ i , if i ∈ S
ui = |S|
0, if s ∈
/S

where ui , i ∈ [m1 ] is independent of W (0) . We can write

W (0) = αed u> + β,


(0)
where α = u> e> dW ∼ N (0, 1/m1 ) and β ∈ Rd×m1 are two independent random variables given u. We know α is
independent of u. Since each i ∈ S with probability τ , we know with high probability,

|S| = Θ̃(τ m1 ) (67)


Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

(0) (0)
p
Since α = i∈S ui [e> ]i and |ui [e>
P
dW dW ]i | ≤ Õ(1/ m1 |S|), by (67) and the Wasserstein distance bound of central
limit theorem we know there exists g ∼ N (0, m11 ) such that
1
W2 (α|W (0) ,B (0) , g) ≤ Õ( √ )
1(n) τ m1
Then,
N m1
(0) (0) (0)
X X
rn,1 (X) = aj,n vi,1 σ(aj Xw1 + B1(n,i) )
j=1 i=1
N N
(0) (0) (0) (0) (0) (0)
X X X X
= aj,n vi,1 σ(aj Xw1 + B1(n,i) ) + aj,n vi,1 σ(aj Xw1 + B1(n,i) ) (68)
j=1 i∈S
/ j=1 i∈S
N
(0) (0) (0)
X X
= J1 + aj,n vi,1 σ(aj Xw1 + B1(n,i) )
j=1 i∈S

rn,1 (X) − J1
N N
X X (0) (0) (0) si X X (0) (0) (0) (0)
= aj,n vi,1 1[aj Xw1 + B1(n,i) ] p α + aj,n vi,1 1[aj Xw1 + B1(n,i) ](aj Xβ i + B1(n,i) ) (69)
j=1 i∈S
2 |S| j=1 i∈S

=P1 + P2
Here, we know that since
(0) (0) (0) (0) (0) (0)
E[vi,1 σ(aj Xw1 + B1(n,i) )] = E[vi,1 ] · E[σ(aj Xw1 + B1(n,i) )] = 0 (70)
Hence,
N X (0) (0) (0)
X
E[J1 ] = E[ aj,n vi,1 σ(aj Xw1 + B1(n,i) )] = 0 (71)
j=1 i∈S
/

Then we can derive


N
X X (0) (0) α √ (0) >
P1 = aj,n 1[aj Xw1 + B1(n,j) ] p h( m1 wi w∗ , B1(n,i) ) + R1 (72)
j=1 i∈S
2 |S|m2
q
where |R1 | ≤ Õ( m|S|
1 m2
). We write P3 = P1 −R1
α . Then,
p N
|S| X 1
|P3 − √ aj,n φ (aj Xw∗ )| ≤ Õ(kAk∞ √ )
m2 C j=1 m2

√ N
C m2 X C
|√ P1 − aj,n φ (aj Xw∗ )| ≤ Õ(kAk∞ √ )
τ m1 j=1
τ m1
Define √
(0) (0) τ m1 τ
ρ(v 1 , W (0) , B 1(n) ) = √ α ∼ N (0, 2 )
C m2 C m2
Then,
N
(0) (0) (0) (0)
X
P1 = ρ(v 1 , W (0) , B 1(n) ) · aj,n φ (aj Xw∗ ) + R1 + R2 (X, v 1 , W (0) , B 1(n) )
j=1

where |R2 | ≤ Õ( √m11 m2 ).


We can also define √
(0) τ m1 τ
ρ̃(v1 ) = √ g ∼ N (0, 2 )
C m2 C m2
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Therefore,
1
W2 (ρ|W (0) ,B (0) , ρ̃) ≤ Õ( √ )
1(n) C m2
Meanwhile,
(0) si (0) (0) 1
aj Xwi = α p aj Xed + aj Xβ i + B1(n,i) = aj Xβ i + B1(n,i) ± Õ( p )
|S| |S|m1
we have
N
(0) (0) (0)
X X
P2 = aj,n vi,1 1[aj Xβ i + b1(n,j) ](aj Xβ i + b1(n,i) ) + R3 = J2 + R3
j=1 i∈S

E[J2 ] = 0 (73)

with |R3 | ≤ Õ( √kAk ∞


m1 m2 ).
Let J = J1 + J2 , R = R1 + R2 + R3 . Then, w.h.p., E[J] = 0, |J| ≤ Õ( kAk∞√
(1+kAk∞ )
m2 ), |R| ≤ Õ( √kAk ∞
m1 m2 ).
1√
Lemma B.6. For every  ∈ (0, ), there exists real-valued functions φ1,j. (·) such that
Cs (φ,kAk∞ ) kAk2∞ +1

|φ1,j, (ãn Xw∗1,j ) − φ1,j (ãn Xw∗1,j )| ≤ 

for j ∈ [p2 ]. Denote by


p √ 1
C = C (φ, kAk∞ ) kAk2∞ + 1, C 0 = 10C p2 , φ1,j, (aj Xw∗1,i ) = 0 φ1,j, (aj Xw∗1,i )
C
For every i ∈ [m2 ], there exist independent Gaussians

1 1
αi,j ∼ N (0, ), βi (X) ∼ N (0, ),
m2 m2
satisfying
2
N
X X p23
W2 (rn,i (X), αi,j am,n φ1,j, (am Xw∗1,i ) + Ci βi (X)) ≤ Õ( 1√ )
j∈[p2 ] m=1 m1 6
m2

Proof:
Define p2 S many chunks of the first layer with each chunk corresponding to a set Sj,l , where |Sj,l | = m1 /(p2 S) for j ∈ [p2 ]
and l ∈ [S], such that
m1 m1 m1
Sj,l = {(j − 1) + (l − 1) S + k|k ∈ [ } ⊂ [m1 ]
p2 p2 p2 S]
By Lemma B.5, we have
N
(0) (0)
X X
rn,i (X) = ρ(v i [j, l], W (0) [j, l], B 1(n) [j, l]) am,n φ (am Xw∗1,j )
j∈[p2 ],l∈[S] m=1
(74)
(0) (0) (0) (0)
X
+ Jj (X, v i [j, l], W (0) [j, l], B 1(n) [j, l]) + Rj (X, v i [j, l], W (0) [j, l], B 1(n) [j, l]),
j∈[p2 ],l∈[S]

(0) (0)
where ρ(v i [j, l], W (0) [j, l], B 1(n) [j, l]) ∼ N (0, 100C 21m2 p2 S ). Then ρj = 1 0
P
l∈[S] ρj,l ∼ N (0, C 02 m2 ) for C =

10C p2 . Define
(0) (0)
X
JjS (X) = Jj (X, v i [j, l], W (0) [j, l], B 1(n) [j, l])
l∈[S]

(0) (0)
X
RjS (X) = Rj (X, v i [j, l], W (0) [j, l], B 1(n) [j, l])
l∈[S]
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Then there exists Gaussian random variables βj (X) and β 0 (X) =


P
i∈[p2 ] βj (X) such that

kAk∞ (1 + kAk∞ )
W2 (JjS (X), βj (X)) ≤ √
m2 pS
N √
X X Sp2 p2 kAk∞ (1 + kAk∞ )
W2 (rn,i (X), ρj am,n φ1,j, (am Xw∗1,j ) 0
+ β (X)) ≤ Õ( √ + )
m=1
m1 m2 m2 S
j∈[p2 ]

We know there exists a positive constant Ci such that β 0 /Ci ∼ N (0, m12 ). Let αi,j = C 0 ρj , βi0 = β 0 /Ci . Notice that
P 2 (0) (0) (0) 
E l∈[S],i∈[p2 ] [Jj (X, v i [j, l], W [j, l], b1 [j, l])] = Õ(kAk2∞ (1 + kAk∞ )2 /m2 ). Hence, we have

Ci ≤ Õ(kAk∞ (1 + kAk∞ )
1
Let S = (m1 /p2 ) 3 , we can obtain
2
N
X X p23
W2 (rn,i (X), αi,j am,n φ1,j, (am Xw∗1,i ) + Ci βi (X)) ≤ Õ( 1√ )
j∈[p2 ] m=1 m1 6
m2
p
Lemma B.7. There exists function h : R2 → [−C 00 , C 00 ] for C 00 = C (Φ, C 0 ) kAk2∞ + 1 such that
√ X
∗ (0)
X

E[1rn,i (X)+b(0) ≥0 h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
2(n,i)
j∈[p2 ] j∈[p2 ]

X N
X X p
∗ ∗
=Φ( v1,j am,n φ1,j ) v2,j φ2,j (ãn Xw∗2,j )) ± Õ(p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1)
j∈[p2 ] m=1 j∈[p2 ]
(75)

Proof: PN PN
Choose w = (αi,1 , · · · , αi,p2 , βi ), x = ( m=1 am,n φ1,1, , · · · , m=1 am,n φ1,p2 , , Ci ) and w∗ = (v1,1
∗ ∗
, · · · , v1,p2
, 0).
2 2 00 00 00 0
p
Then, kxk ≤ O(kAk∞ + kAk∞ ). By Lemma B.1, there exists h : R → [−C , C ] for C = Cs (Φ, C ) kAk∞ + 1 2

such that
√ (0)
X
E[1wX+b(0) ≥0 h( m2 w> w∗ , b2(n,i) )( ∗
v2,j φ2,j (ãn Xw∗2,j ))]
2(n,i)
j∈[p2 ]
√ (0)
X
=Eαi ,βi [1P PN ∗ 0 (0) h( m2 w> w∗ , b2(n,i) )( ∗
v2,j φ2,j (ãn Xw∗2,j ))]
] αi,j m=1 am,n φ1,j, (ãn Xw 1,i )+Ci βi +b2(n,i) ≥0
j∈[p2
j∈[p2 ]
(76)
X N
X X
=Φ(C 0 ∗
v1,j am,n φ1,j, ) ∗
v2,j φ2,j (ãn Xw∗2,j )) ± C 000
j∈[p2 ] m=1 j∈[p2 ]

where X p
C 000 = sup | ∗
v2,j φ2,j (ãn Xw∗2,j )| ≤ p2 Cs (φ, kAk∞ ) kAk2∞ + 1
j∈[p2 ]

By Lemma B.6, we know


2
N
X X p23
W2 (rn,i (X), αi,j am,n φ1,j, (am Xw∗1,i ) + Ci βi (X)) ≤ Õ( 1√ )
j∈[p2 ] m=1 m1 6
m2
2
P PN ∗ 0 2p23
Denote H = {i ∈ [m1 ] : | j∈[p2 ] αi,j m=1 am,n φ1,j, (ãn Xw 1,i ) + Ci βi | ≥ Õ( 1√ )}. Then, for every i ∈ [H],
m16 m2
we have that
1rn,i (X)+b(0) ≥0
= 1P PN
am,n φ1,j, (ãn Xw∗ 0 (0) (77)
2(n,i) j∈[p2 ] αi,j m=1 1,i )+Ci βi +b2(n,i) ≥0
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

2 2 2
N
 X X 2p23  2p23 √ 2p23
Pr αi,j am,n φ1,j, (ãn Xw∗1,i ) + Ci βi0 ≤ Õ( 1 √ ) ≤ Õ( 1 √ ) · m2 = Õ( 1 ), (78)
j∈[p2 ] m=1 m16 m2 m16 m2 m16
2/3 1/6
which implies with probability at least 1 − 2p2 /m1 , (77) holds. Therefore,
√ X
∗ (0)
X

E[1rn,i (X)+b(0) ≥0 h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
2(n,i)
j∈[p2 ] j∈[p2 ]
√ X
∗ (0)
X

=E[1P PN ∗ 0 (0) h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
j∈[p2 ] αi,j m=1 am,n φ1,j, (ãn Xw 1,i )+Ci βi +b2(n,i) ≥0
j∈[p2 ] j∈[p2 ]

± E[1rn,i (X)+b(0) ≥0
6= 1P PN ∗ 0 (0) ]O(C 000 C 00 )
2(n,i) j∈[p2 ] αi,j m=1 am,n φ1,j, (ãn Xw 1,i )+Ci βi +b2(n,i) ≥0
√ X
∗ (0)
X

=E[1P αi,j
PN
am,n φ1,j, (ãn Xw∗ 0 (0)
≥0
h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
j∈[p2 ] m=1 1,i )+Ci βi +b2(n,i)
j∈[p2 ] j∈[p2 ]
2/3
2p2
± Õ( 1/6 C 000 C 00 )
m1
X XN X p
∗ ∗
=Φ( v1,j am,n φ1,j ) v2,j φ2,j (x)) ± Õ(p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1 · ),
j∈[p2 ] m=1 j∈[p2 ]
(79)
where the first step is by Lemma B.6, the second step is by (77) and (78) and the last step comes from (76) and m1 ≥ M .
Lemma B.8.
m2 2
1 X ci,l √ X
∗ (0)
X

E[ 2
1rn,i (X)+b(0) ≥0 h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
m2 i=1 c 2(n,i)
j∈[p2 ] j∈[p2 ]
N
X

X X

(80)
=Φ( v1,j am,n am Xδφ1,j ) v2,j φ2,j (ãn Xw∗2,j ))
j∈[p2 ] m=1 j∈[p2 ]
p
± Õ(p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1 · )

Proof:
(0) (0)
Recall ρ̃(v 1 ) ∼ N (0, C 2τm2 ). Define ρ̃j,l = ρ̃(v 1 [j, l]). Therefore,

1
W2 (ρj,l |W (0) ,B (0) , ρ̃j,l ) ≤ Õ( √ ) (81)
1(n) C 0 m2 S

1
W2 (ρj |W (0) ,B (0) , ρ̃j ) ≤ Õ( √ ) (82)
1(n) C0 m2
where ρ̃j = l∈[S] ρj,l . We then define α̃i,j = C 0 ρ̃j
P

Next modify rn,i (X). Define


PN P (0) (0) (0)
m=1 am,n j∈[m1 ] vj,i σ(am Xwi + b1(n,i) )
r̃n,i (X) = E[kuk]
kuk
(0) (0)
where u = (σ(ãn Xwj + b1(n,j) ))j∈[m1 ] . By definition, we know

kãn k2∞
r̃n,i ∼ N (0, E[kuk]2 )
m2
Then we have p
kAk∞ kAk2∞ + 1
W2 (rn,i (X), r̃n,i (X)) ≤ Õ( √ ) (83)
m2
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Combining (81), (82), (83) and Lemma B.7, we have


m2 2
1 X ci,l √ X
∗ (0)
X

E[ 2
1rn,i (X)+b(0) ≥0 h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
m2 i=1 c 2(n,i)
j∈[p2 ] j∈[p2 ]

X N
X X p
∗ ∗
=Φ( v1,j am,n φ1,j ) v2,j φ2,j (ãn Xw∗2,j )) ± Õ(p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ )( kAk2∞ + 1 · )
j∈[p2 ] m=1 j∈[p2 ]
(84)

B.1.3. C OUPLING
This section illustrates the coupling between the real and pseudo networks. We first define diagonal matrices D n,w ,
D n,w + D 00n,w , D n,w + D 0n,w for node n as the sign of Relu’s in the first layer at weights W (0) , W (0) + W ρ and
W (0) +W ρ +W 0 , respectively. We also define diagonal matrices D n,v , D n,v +D 00n,v , D n,v +D 0n,v for node n as the sign
of Relu’s in the second layer at weights {W (0) , V (0) }, {W (0) +W ρ , V (0) +V ρ } and {W (0) +W ρ +W 0 , V (0) +V ρ +V 0 },
respectively. For every l ∈ [K], we then introduce the pseudo network and its semi-bias, bias-free version as

gl (q, A, X, W , V ) = q > A((A(AXW + B 1 ) (D w + D 0w )V + B 2 ) (D v + D 0v ))cl (85)

(b)
gl (q, A, X, W , V ) = q > A((A(AXW + B 1 ) (D w + D 0w )V ) (D v + D 0v ))cl (86)

(b,b)
gl (q, A, X, W , V ) = q > A((A(AXW ) (D w + D 0w )V ) (D v + D 0v ))cl (87)
Lemma B.9 gives the final result of coupling with added Drop-out noise. Lemma B.10 states the sparse sign change in Relu
and the function value changes of pseudo network by some update. To be more specific, Lemma B.11 shows that the sign
pattern can be viewed as fixed for the smoothed objective when a small update is introduced to the current weights. Lemma
B.12 proves the bias-free pseudo network can also approximate the target function.
Lemma B.9. Let FA = (f1 , f2 , · · · , fK ). With high probability, we have for any kW 0 k2,4 ≤ τw , kV 0 kF ≤ τv , such that

fl (q, A, X, W (0) + W 0 Σ, V (0) + ΣV 0 )


(0) (0) 0 0
=q > A(A((AXW (0) + B 1 ) D (0)w,x V
(0)
+ B 2 ) D (0) >
v,x )c + q A(A((AXW ) D (0)
w,x V ) D (0)
v,x )cl (88)
√ 16 √
m2 9 8 9
± Õ(τv √ + m15 τw5 m2 + τw5 m110 ) · kq > Ak1 kAk2∞ ,
m1

0
where we use D (0) (0)
w,x and D v,x to denote the sign matrices at random initialization W
(0)
, V (0) and we let D (0)
w,x + D w,x ,
0 0 0
D (0)
v,x + D v,x be the sign matrices at W + W Σ, V + ΣV .

Proof:
(0) (0) (0) (0) (0) (0)
Since ãn Xwi + B1(n,i) = ãn X̃ w̃i where w̃i = (wi , B1(n,i) ) ∈ Rd+1 and X̃ = (X, 1) ∈ RN ×(d+1) , we can
ignore the bias term for simplicity. Define

Z = A(AXW (0) ) D (0)


w,x

Z 1 = A(AXW 0 Σ) D (0)
w,x

Z 2 = A(AX(W (0) + W 0 )Σ) D 0w,x


Then by Fact C.9 in (Allen-Zhu et al., 2019) we have
m2
X m2
X
kZ n ΣV 0 k22 ≤ (Z n ΣV 0i )2 ≤ Õ(kZ n k2∞ · kV 0i k22 )
i=1 i=1 (89)
≤ Õ(kAk2∞ m−1 2
1 τv )
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

−1
Therefore, we have kZ n ΣV 0 k2 ≤ Õ(kAk∞ m1 2 τv ).
Let s be the total number of sign changes in the first layer caused by adding W 0 . Note that the total number of coordinated
3
(0)
i such that |ãn Xwi | ≤ s00 = 2τ1w is at most s00 m12 with high probability. Since kW 0 k2,4 ≤ τw , we must have
s4
3 3 4 6
00 τw
s ≤ Õ(s m ) = Õ(
2
1 m1 ). Therefore, kZ 2,n k0 ≤ s = Õ(τw5 m15 ). Then,
2

s4

kZ 2,n k2 =k(A(AX(W (0) + W 0 Σ)) D 0w,x )n k


 X  14
≤ s· (AAXW (0) + AAXW 0 Σ)4n,i
(D 0w,x )n 6=0
  41
≤ s·
X
(AAXW 0 Σ)4n,i (90)
(D 0w,x )n 6=0
1
≤s 4 kAk∞ τw
6 3
≤Õ(τw5 m110 kAk∞ )

Then we have 6 3
kZ 2,n ΣV 0 k2 ≤ Õ(τv τw5 m110 kAk∞ )
With high probability, we have
N m2
X X √
q > an 0
ci,l (σ(rn,i + rn,i 0
) − σ(rn,i )) ≤ Õ(kqk m2 )krn,i k
n=1 i=1

fl (q, A, X, W (0) + W 0 Σ, V (0) + ΣV 0 )


XN m2
X  
0
= q > an ci,l σ (Z + Z 1 + Z 2 )>
n (V i + (Σ)i V )
n=1 i=1
N m2 √
X X   m2 √ 6 3
= q > an ci,l σ (Z n + Z 1,n + Z 2,n )> V i + Z >
1,n (ΣV 0
) i ± Õ(kqkkAk∞ √ τv + m2 kqkkAk∞ τw5 m110 )
n=1 i=1
m1
(91)
We consider the difference between

A1 = q > A(((Z + Z 1 + Z 2 )V (0) + Z 1 ΣV 0 ) (D (0) 00


v,x + D v,x ))cl

A2 = q > A(((Z + Z 1 + Z 2 )V (0) + Z 1 ΣV 0 ) D (0)


v,x )cl

where D 00v,x is the diagonal sign change matrix from ZV (0) to (Z + Z 1 + Z 2 )V (0) + Z 1 ΣV 0 . The difference includes
three terms. 1
−1
kZ 1,n V (0) k∞ ≤ Õ(kAk∞ m14 τw m2 2 ) (92)

− 12 √ 9 8
−1
kZ 2,n V (0) k∞ ≤ Õ(kZ 2,n km2 s) ≤ Õ(kAk∞ m110 τw5 m2 2 ) (93)

1
kZ 1,n ΣV 0 k ≤ Õ(τv kAk∞ τw m14 ) (94)
where (92) is by Fact C.9 in (Allen-Zhu et al., 2019). Then we have
3 1
− 12 9 8
−1 1 4 4 4 1
|A1 − A2 | ≤ kq > Ak1 · Õ(m22 kAk2∞ (m14 τw m2 + m110 τw5 m2 2 )2 + m22 τv3 kAk∞
3
τw3 m13 )

From A2 to our goal


0 0
A3 = q > A(ZV (0) D (0) >
v,x )cl + q A(A(AXW D (0)
w,x V ) D (0)
v,x )cl
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

There are two more terms.


√ 8 9
|q > A(Z 2 V (0) D (0) >
v,x )cl | ≤ Õ(kq Ak1 kZ 2,n k s) ≤ Õ(kqk1 kAk∞ τw m1 )
5 10

1
|q > A(Z 1 V (0) D (0) >
v,x )cl | ≤ Õ(kq Ak1 kAk∞ τw m1 )
4

Therefore, we have
8 9 1 1
|A2 − A3 | ≤ Õ(kq > Ak1 kAk∞ τw5 m110 + kqk1 kAk∞ τw m14 + kqk1 τv kAk∞ τw m14 )

Finally, we have
fl (q, A, X, W (0) + W 0 Σ, V (0) + ΣV 0 )
0 0
=q > A(ZV (0) D (0) >
v,x )cl + q A(A(AXW D (0)
w,x V ) D (0)
v,x )cl
√ (95)
m2 9 16 √ 8 9
± Õ(τv √ + m15 τw5 m2 + τw5 m110 ) · kqk1 kAk∞
m1
1 1 1 τw 1
Lemma B.10. Suppose τv ∈ (0, 1], τw ∈ [ 3 , 1 ], σw ∈ [ 3 , 1 ], σv ∈ (0, 1 )]. The perturbation matrices satisfies
m12 m12 m12 m14 m22
kW 0 k2,4 ≤ τw , kV 0 kF ≤ τv , kW 00 k2,4 ≤ τw , kV 00 kF ≤ τv and random diagonal matrix Σ has each diagonal entry i.i.d.
drawn from {±1}. Then with high probability, we have
(1) Sparse sign change
4 6
kD 0n,w k0 ≤ Õ(τw5 m15 )
3 1 2 1 2
kD 0n,v k0 ≤ Õ(m22 σv (kAk∞ + kAk∞ τw m14 ) + m2 kAk∞
3
(kAk∞ τv + kAk∞ τw m14 (1 + τv )) 3 )
(2) Cross term vanish

gr (q, A, X, W (0) + W ρ + W 0 + ηW 00 Σ, V (0) + V ρ + V 0 + ηΣV 00 )


(96)
=gr (q, A, X, W (0) + W ρ + W 0 , V (0) + V ρ + V 0 ) + gr(b,b) (q, A, X, ηW 00 Σ, ηΣV 00 ) + gr0

for every r ∈ [K], where EΣ [gr0 ] = 0 and |gr0 | ≤ ηkq > Ak1 kAk2∞ τv .

Proof:
(0) (0) (0) (0) (0) (0)
(1) We first consider the sign changes by W ρ . Since ãn Xwi +B1(n,i) = ãn X̃ w̃i where w̃i = (wi , B1(n,i) ) ∈ Rd+1
and X̃ = (X, 1) ∈ RN ×(d+1) , we can ignore the bias term for simplicity. We have

(0) kãn X̃k2


ãn X̃ w̃i ∼ N (0, )
m1

ãn X̃ w̃ρi ∼ N (0, kãn X̃k2 σw


2
)
Therefore,
(0)
ãn X̃ w̃i 1
ρ ∼ p(z) = √ 1
ãn X̃ w̃i π(σw m1 z 2 + σw

m1 )

(0)
Pr[|ãn X̃ w̃i | ≤ |ãn X̃ w̃ρi |] = Pr[|z| ≤ 1]
Z 1
1
= √ 2+ 1 dz
−1 π(σ w m 1 z σw

m1 )
2 1
Z (σw m1 ) 2
1 (97)
= dt
π(t2 + 1)
1
2 m )2
−(σw 1

2 √
= arctan σw m1
π

≤ Õ(σw m1 )
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Then, we have
3
kD 00n,w k0 ≤ Õ(σw m12 )
(0) 3 3
kãn X̃ W̃ D 00n,w k2 ≤ Õ(kãn X̃kσw2 m14 )
We then consider the sign changes by W 0 . Let s = kD 0n,w − D 00n,w k0 be the total number of sign changes in the first layer
(0) ρ
caused by adding W 0 . Note that the total number of coordinated i such that |ãn X̃(W̃ + W̃ )i | ≤ s00 = 2τw
1 is at most
s4
3
s00 m1 with high probability. Since kW 0 k2,4 ≤ τw , we must have
2

3 τw 3
s ≤ Õ(s00 m 2 ) = Õ( 1 m12 )
s 4

4 6
kD 0n,w − D 00n,w k0 = s ≤ Õ(τw5 m15 )
(0) ρ 1 6 3
kãn X̃(W̃ + W̃ )(D 0n,w − D 00n,w )k2 ≤ Õ(s 4 τw ) ≤ Õ(τw5 m110 )
To sum up, we have
3 4 6 4 6
kD 0n,w k0 ≤ Õ(σw m12 + τw5 m15 ) ≤ Õ(τw5 m15 )
(0) (0) ρ (0)
Denote z n,0 = ãn X̃ W̃ D n,w and z n,2 = ãn X̃(W̃ + W̃ + W 0 )(D n,w + D 0n,w ) − ãn X̃ W̃ D n,w . With high
probability, we know
ρ
kz n,2 k ≤kãn X̃W 0 k + kãn X̃ W̃ k
1 1
≤Õ(m14 τw kAk∞ + kAk∞ σw m12 ) (98)
1
≤Õ(kAk∞ τw m1 ) 4

Denote Z 0 = (z > > >


1,0 , · · · , z N,0 ) ∈ R
N ×m1
, Z 2 = (z > > >
1,2 , · · · , z N,2 ) ∈ R
N ×m1
. The sign change in the second layer is
(0) (0) ρ 0
from ãn Z 0 V to ãn (Z 0 + Z 2 )(V + V + V ). We have

kãn (Z 0 + Z 2 )V ρ k∞ ≤ Õ(σv kAk∞ (kz 1,0 k + kz 1,2 k))

kãn (Z 0 + Z 2 )V 0 + ãn Z 2 V (0) k2 ≤ Õ(kAk∞ ((kz 1,0 k + kz 1,2 k)τv + kz 1,2 k))
Combining kz 1,0 k ≤ Õ(kAk∞ ), by Claim C.8 in (Allen-Zhu et al., 2019) we have
3 1 2 1 2
kD 0n,v k0 ≤ Õ(m22 σv (kAk2∞ + kAk2∞ τw m14 ) + m2 kAk∞
3
(kAk∞ τv + kAk∞ τw m14 (1 + τv )) 3 )

(2) Diagonal Cross terms.


Denote D w = (diag(D 1,w )> , · · · , diag(D N,m1 )> )> ∈ RN ×m1 and define D 0w , D 00w , D v , D 0v , D 00v accordingly.
Recall
gr (q, A, X, W , V ) = q > A((A(AXW + B 1 ) (D w + D 0w )V + B 2 ) (D v + D 0v ))cr
gr(b) (q, A, X, W , V ) = q > A((A(AXW + B 1 ) (D w + D 0w )V ) (D v + D 0v ))cr
gr(b,b) (q, A, X, W , V ) = q > A((A(AXW ) (D w + D 0w )V ) (D v + D 0v ))cr
Then
gr (q, A, X, W (0) + W ρ + W 0 + ηW 00 Σ, V (0) + V ρ + V 0 + ηΣV 00 )
=gr (q, A, X, W (0) + W ρ + W 0 , V (0) + V ρ + V 0 ) + gr(b,b) (q, A, X, ηW 00 Σ, ηΣV 00 ) (99)
0 00
+ gr(b) (q, A, X, W (0) ρ
+ W + W , ηΣV ) + gr(b,b) (q, A, X, ηW 00 Σ, V (0) +V +V ) ρ 0

where the last two terms are the error terms. We know that
v v
u m1 u m1
(0) > (0)
uX (0)
uX 1
kW k ≤ max ka W k ≤ max t (a> wi )2 ≤ max t ( √ )2 = 1
kak=1 kak=1
i=1
kak=1
i=1
m1
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Therefore,

|gr(b) (q, A, X, W (0) + W ρ + W 0 , ηΣV 00 )|


m2
N X N m1
00
X X X
=η >
q an ci,r Dn,v i an,k (ΣV )l,i Dk,w l (ak X(W (0) + W ρ + W 0 )l + B 1(k,l) )
n=1 i=1 k=1 l=1
N N
X X (100)
=η q > an an,k ((ak X(W (0) + W ρ + W 0 ) + B 1k ) D k,w )ΣV 00 D n,v cr
n=1 k=1
1 1 1
≤ηkq > Ak1 kAk∞ (kAk∞ (m− 2 + σw + τw ) + m− 2 )τv m22
1
−1
≤Õ(ηkq > Ak1 kAk2∞ τv m22 m1 2 ),

where the last step is by the value selection of σw , τw and τv .

|gr(b,b) (q, A, ηW 00 Σ, X, V (0) + V ρ + V 0 )|


N
X N
X
=|η q > an an,k (ak XW 00 Σ (D k,w + D k,w 0 ))(V (0) + V ρ + V 00 ) (D n,v + D n,v 0 )cr |
n=1 k=1
N
X N
X
≤|η q > an an,k (ak XW 00 Σ (D k,w + D k,w 0 ))V 0 (D n,v + D n,v 0 )cr |
n=1 k=1
N N
X X (101)
+ |η q > an an,k (ak XW 00 Σ (D k,w + D k,w 0 ))(V (0) + V ρ ) D n,v cr |
n=1 k=1
N
X N
X
+ |η q > an an,k (ak XW 00 Σ (D k,w + D k,w 0 ))(V (0) + V ρ ) D n,v 0 cr |
n=1 k=1
1 1
>
≤|ηkq Ak1 kAk2∞ τw τv m22 | + 2|ηkq > Ak1 kAk2∞ τw m12 |
1
≤Õ(|ηkq > Ak1 kAk2∞ τw m12 |)

Lemma B.11. Denote


Pρ,η =FA (q, X, W + W ρ + ηW 00 Σ, V + V ρ + ηΣV 00 )
(102)
=q > A(A(AX(W + W ρ + ηW 00 Σ) + B 1 ) D w,ρ,η (V + V ρ + ηΣV 00 ) D v,ρ,η )cr
0
Pρ,η =G(q, A, X, W + W ρ + ηW 00 Σ, V + V ρ + ηΣV 00 )
(103)
=q > A(A(AX(W + W ρ + ηW 00 Σ) + B 1 ) D w,ρ (V + V ρ + ηΣV 00 ) D v,ρ )cr
There exists η0 = 1
poly(m1 ,m2 ) such that for every η ≤ η0 , for every W 00 , V 00 that satisfies kW 00 k2,∞ ≤ τw,∞ , kV 00 k2,∞ ≤
τv,∞ , we have
0
|Pρ,η − Pρ,η | >
τ2
4 w,∞
2
(τw,∞ 2
+ τv,∞ m−1
1 )
EW ρ ,V ρ [ ] = Õ(q A1kAk ( m1 + m2 )) + Op (η),
η2 σw σv
where Op hides polynomial factor of m1 and m2 .

Proof:

0
Pρ,η − Pρ,η = q > A(A(AX(W + W ρ + ηW 00 Σ) + B 1 ) (D w,ρ,η − D w,ρ )(V + V ρ + ηΣV 00 ) D v,ρ )cr
> ρ 00 ρ 00
+ q A(A(AX(W + W + ηW Σ) + B 1 ) D w,ρ,η (V + V + ηΣV ) (D v,ρ,η − D v,ρ ))cr
(104)
We write
Z = A(AX(W + W ρ + ηW 00 Σ) + B 1 ) D w,ρ
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Z + Z 0 = A(AX(W + W ρ + ηW 00 Σ) + B 1 ) D w,ρ,η
Since for all n ∈ [N ], kη(AAXW 00 Σ)n k∞ ≤ ηkAk∞ τw,∞ , we have

kZ 0n k∞ ≤ ηkAk∞ τw,∞

0 ηkAk∞ τw,∞
Prρ [Zn,i 6= 0] ≤ Õ( ), i ∈ [m1 ]
W σw
Then we have
Pr[kZ 0n k0 ≥ 2] ≤ Op (η 2 )
Then we only need to consider the case kZ 0n k0 = 1. Let Zn,n 0
i
6= 0. Then the first term in (104), q > A(Z 0 (V + V ρ +
00
ηΣV ) D v,ρ )cr should be dealt with separately.
The term q > A(Z 0 ηΣV 00 D v,ρ )cr ) contributes to Op (η 3 ) to the whole term.
Then we have
kq > A(Z 0 η(V + V ρ ) D v,ρ )cr k ≤ Õ(ηkkq > Ak1 kkAk∞ τw,∞ )
We also have that
ηkAk∞ τw,∞ ηkAk∞ τw,∞
Õ(( m1 )N ) ≤ Õ( m1 ) ≤ 1
σw σw
2
τw,∞
Therefore, the contribution to the first term is Õ(η 2 kq > Ak1 kAk2∞ σw m1 ) + Op (η 3 ).
Denote
δ =A(AX(W + W ρ + ηW 00 Σ) + B 1 ) D w,ρ,η (V + V ρ + ηΣV 00 )
ρ (105)
− A(AX(W + W ) + B 1 ) D w,ρ (V + V ρ )
δ ∈ Rm2 has the following terms:
1. Z 0 (V + V ρ + ηΣV 00 ). We have its n-th row norm bounded by Op (η).
−1
2. ZηΣV 00 . We have its n-th row infinity norm bounded by Õ(kAk∞ ητv,∞ m1 2 ).
3. A(AXηW 00 Σ D w,ρ )(V + V ρ ), of which the n-th row infinity is bounded by Õ(kAk∞ ητw,∞ ).
4. A(AXη 2 W 00 Σ D w,ρ,η ΣV 00 ). Bounded by Op (η 2 ).
Therefore,
− 12
kδ n k∞ ≤ Õ(kAk∞ η(τv,∞ m1 + τw,∞ )) + Op (η 2 )
2
(τw,∞ 2
+τv,∞ m−1
1 )
Similarly, we can derive that the contribution to the second term is Õ(η 2 kq > Ak1 kAk2∞ σv m2 ) + Op (η 3 ).
Lemma B.12. Let F ∗A = (f1∗ , · · · , fK

). Perturbation matrices W 0 , V 0 satisfy

kW 0 k2,4 ≤ τw , kV 0 kF ≤ τv

There exists W
c and Vb such that

C0 K m1
kW
c k2,∞ ≤ , kVb k2,∞ ≤
m1 m2

XK
E[ |fr∗ (q, A, X, W
c , Vb ) − g (b,b) (q, A, X, W
r
c , Vb )|] ≤ 
r=1

E[G(b,b) (q, A, X, W
c , Vb )] ≤ OP T + 

Proof:
By Lemma B.10, we have
4 6
kD 0n,w k0 ≤ Õ(τw5 m15 )  Õ(m1 )
3 1 2 1 2
kD 0n,v k0 ≤ Õ(m22 σv (kAk2∞ +kAk2∞ τw m14 )+m2 kAk∞
3
(kAk∞ τv +kAk∞ τw m14 (1+τv )) 3 ) ≤ Õ(m2 kAk2∞ (/C0 )Θ(1) )
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Applying Lemma B.9, we know

q > A((A(AX W
c) (D 0w )Vb ) (D v ))cr
N
X N
X
= q > an an,k ((ak X W
c) D k,w 0 )Vb D n,v cr
n=1 k=1 (106)

3 C0 K m1
≤kq > Ak1 kAk2∞ m1 10
· m2
m1 m2
≤

q > A((A(AX W
c) (D w )Vb ) (D 0v ))cr
N
X N
X
= q > an an,k ((ak X W
c) D k,w )Vb D 0n,v cr
n=1 k=1 (107)

C0 K m1 1 
≤kq > Ak1 kAk2∞ m1 · m2 · kAk2∞ ( )Θ(1)
2

m1 m2 C0
≤
Then, the conclusion can be derived.

B.1.4. O PTIMIZATION
This section states the optimization process and convergence performance of the algorithm. Lemma B.13 shows that during
the optimization, either there exists an updating direction that decreases the objective, or weight decay decreases the
objective. Lemma B.14 provides the convergence result of the algorithm.
Define

L0 (A∗ , A∗ , A∗ , λt , W t , V t )
t
|Ω |
1 X p p
= t EW ρ ,V ρ ,Σ0 [L(λt FA∗ (eg , X; W (0) + W ρ + W t Σ0 , V (0) + V ρ + Σ0 V t ), yi )] + R( λt W t , λt V t )
Ω i=1
(108)
where √ √ √ √
R( λW t , λV t ) = λv k λV t k2F + λw k λW t k22,4
0
Lemma B.13. For every 0 ∈ (0, 1) and  ∈ (0, √ ) and γ ∈ (0, 14 ],
KkAk∞ p1 p22 Cs (Φ,p2 Cs (φ,kAk∞ ))Cs (φ,kAk∞ ) kAk2∞ +1
consider any W t , V t with

L0 (A∗ , A∗ , A∗ , λt , W t , V t ) ∈ [(1 + γ)OP T + Ω(q > A∗ 1kA∗ k4∞ 0 /γ), Õ(1)]

c , Vb with kW
With high probability on random initialization, there exists W c kF ≤ 1, kVb kF ≤ 1 such that for every
1
η ∈ (0, poly(m1 ,m2 ) ],
√ c √
min{EΣ [L0 (A∗ , A∗ , A∗ , λt , W t + η W Σ, V t + ηΣVb )], L0 (A∗ , A∗ , A∗ , (1 − η)λt , W t , V t )}
(109)
≤(1 − ηγ/4)L0 (A∗ , A∗ , A∗ , λt , W t , V t )

Proof:
Recall the pseudo network and the real network for every r ∈ [K] as

gr (q, A∗ , X, W 0 , V 0 ) = q > A∗ (A∗ (A∗ X(W (0) + W ρ + W 0 ) + B 1 ) D w,ρ,t (V (0) + V ρ + V 0 ) D v,ρ,t )cr

fr (q, A∗ , X, W 0 , V 0 ) = q > A∗ (A∗ (A∗ X(W (0) + W ρ + W 0 ) + B 1 ) D w,ρ,W 0 (V (0) + V ρ + V 0 ) D v,ρ,V 0 )cr
where D w,ρ,t and D v,ρ,t are the diagonal matrices at weights W (0) + W ρ + W t and V (0) + V ρ + V t . D w,ρ,W 0 and
D v,ρ,V 0 are the diagonal matrices at weights W (0) + W ρ + W 0 and V (0) + V ρ + V 0 .
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Denote G(q, A∗ , X, W 0 , V 0 ) = (g1 , · · · , gK ), FA∗ (q, X, W 0 , V 0 ) = (f1 , · · · , fK ).


As long as L0 (A∗ , A∗ , A∗ , λt , W t , V t ) ≤ Õ(1), according to C.32 to C.34 in (Allen-Zhu et al., 2019), we have
p
λw k λt Wc k 4 ≤ 0
2,4
p
λv k λt Vb k2F ≤ 0
kW
c kF  1

kVb kF  1
The we need to study an update direction
f = W t + √η W
W cΣ

Ve = V t + ηΣVb
Changes in Regularizer. Note that here W t ∈ Rd×m1 , V t ∈ Rm1 ×m2 , Σ ∈ Rm1 ×m1 . We know that

EΣ [kV t + ηΣVb k2F ] = kV t k2F + ηkVb k2F
√ c 4 X √ c
EΣ [kW t + η W Σk2,4 ] = E[kwt,i + η W Σi k42 ]
i∈[m1 ]

For each term i ∈ [m1 ], we can bound


√ c c Σi k2 + 2√ηwt,i > W
kwt,i + η W Σi k22 = kwt,i k22 + ηkW 2
c Σi

√ c c Σi k4 + 4ηkwt,i > W
kwt,i + η W Σi k42 = kwt,i k42 + η 2 kW 2
c Σi k2 + 2ηkwt,i k2 kW
2
c Σi k2
2
(110)
≤ kwt,i k42 + 6ηkwt,i k22 kW
c Σi k2 + Op (η 2 )
2

Therefore, by Cauchy-Schwarz inequality, we have


√ c 4
EΣ [kW t + η W Σk2,4 ] ≤ kW t k42,4 + 6ηkW t k22,4 kW
c k2 + Op (η 2 )
2,4
√ √ √
Therefore, by λw k λt W t k42,4 ≤ R( λt W t , λt V t ), we have

f , λt Ve )] ≤R( λt W t , λt V t ) + 6η √0 R( λt W t , λt V t ) + η0


p p p p q p p
E[R( λt W
(111)
p p 1 p p
≤R( λt W t , λt V t ) + ηR( λt W t , λt V t ) + 143η0
4
c and Vb satisfy τw,∞ ≤ 1 1
Changes in Objective. Recall that here W 999 and τv,∞ ≤ 999 . By Lemma B.11, we have for
m11000 m22000
every r ∈ [K]

EW ρ ,V ρ [|fr (q, A∗ , X, W + W ρ + W
f Σ, V + V ρ + Σ0 Ve ) − gr (q, A∗ , X, W + W ρ + W
f Σ, V + V ρ + ΣVe )|]
≤ Õ(kq > A∗ k1 kA∗ k2∞ 0 η) + Op (η 1.5 )
(112)
By Lemma B.10, we have

G(q, A∗ , X, W c Σ, ΣVb ) + √ηG0


f , Ve ) = G(q, A∗ , X, W t , V t ) + ηG(b,b) (q, A∗ , X, W
c Σ, ΣVb ) + √ηG0
= FA∗ (q, X, W t , V t ) + ηG(b,b) (q, A∗ , X, W (113)
c , Vb ) + √ηG0
= FA∗ (q, X, W t , V t ) + ηG(b,b) (q, A∗ , X, W

where EΣ [G0 ] = 0 and |G0 | ≤  with high probability. By C.38 in (Allen-Zhu et al., 2019), we have

EW ρ ,V ρ ,Σ [L(λt FA∗ (q, X, W


f , Ve ), y)]
(114)
f , Ve ) + ηF ∗ ∗ (q, A∗ , X, W
≤EW ρ ,V ρ [L(λt FA∗ (q, X, W f , Ve ), y)] + O(kq > A∗ k1 kA∗ k2 0 η) + Op (η 1.5 )
A ∞
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Following C.40 in (Allen-Zhu et al., 2019), we have


∗ ∗
EW ρ ,V ρ [L(λt FA∗ (q, X, W t , V t ) + ηFA ∗ (q, A , X, W t , V t ), y)]

∗ 2
(115)
≤(1 − η)(2L(λt FA∗ (q, X, W t , V t ), y) − L((1 − η)λt FA∗ (q, X, W t , V t ), y)) + ηL(FA ∗ , y) + Op (η )

Putting all of them together. Denote


|Ω|
1 X (0) f Σ0 , V (0) + V ρ + Σ0 Ve ), yi )]
c1 = E ρ ρ 0 [L(λt FA∗ (eg , X, W + Wρ + W (116)
|Ωt | i=1 W ,V ,Σ,Σ
p √
c01 = EΣ [L0 (A∗ , A∗ , A∗ , λt , W
f , +Ve )] = c1 + EΣ [R( λt W
f , λVe )] (117)

|Ω|
1 X
c2 = EW ρ ,V ρ [L((1 − η)λt FA∗ (eg , X, W (0) + W ρ + W t Σ0 , V (0) + V ρ + Σ0 V t ), yi )] (118)
|Ωt | i=1

c02 = L0 (A∗ , A∗ , A∗ , (1 − η)λt , W t , V t ) = c2 + R( (1 − η)λt W t , (1 − η)λt V t )


p p
(119)

|Ω|
1 X
c3 = EW ρ ,V ρ [L(λt FA∗ (eg , X, W (0) + W ρ + W t Σ0 , V (0) + V ρ + Σ0 V t ), yi )] (120)
|Ωt | i=1
p √
c03 = L0 (A∗ , A∗ , A∗ , λt , W t , V t ) = c3 + R( λt W t , λV t ) (121)
Then following from C.38 to C.42 in (Allen-Zhu et al., 2019), we have
ηγ 0
c01 ≤ (1 − η)(2c03 − c02 ) + c + η(OP T + O(kq > A∗ k1 kA∗ k4∞ 0 /γ)) + Op (η 1.5 ), (122)
4 3
which implies
1 ηγ 0 1
min{c01 , c02 } ≤ (1 − η + )c + η OP T + O(kq > A∗ k1 kA∗ k2∞ 0 η/γ) + Op (η 1.5 )
2 8 3 2
As long as c03 ≥ (1 + γ)OP T + Ω(kq > A∗ k1 kA∗ k2∞ 0 /γ) and γ ∈ [0, 1], we have
γ
min{c01 , c02 } ≤ (1 − η )c03
4
Lemma B.14. Note that the three sampled aggregation matrices in a three-layer learner network can be be different. We
denote them as At(1) , At(2) and At(3) . Let W t , V t be the updated weights trained using A∗ and let W 0t , V 0t be the updated
weights trained using At(i) , i ∈ [3]. With probability at least 99/100, the algorithm converges in T Tw = poly(m1 , m2 )
iterations to a point with η ∈ (0, poly(m1 ,m21,kA∗ k∞ ,K) )

L0 (A∗ , A∗ , A∗ , λt , W t , V t ) ≤ (1 + γ)OP T + 0

If
L0 (At(1) , At(2) , At(3) , λt , W 0t , V 0t )
|Ωt |
1 X
= t EW ρ ,V ρ ,Σ [L(λt FA(1) ,A(2) ,A(3) (q, X i , W (0) + W ρ + W 0t Σ0 , V (0) + V ρ + Σ0 V 0t ), yi )] (123)
|Ω | i=1
p p
+ R( λt W 0t , λt V 0t ),
where
FAt(1) ,At(2) ,At(3) (q, X, W , V ) = q > At(3) σ(At(2) σ(At(1) XW + B 1 )V + B 2 )C, (124)
we also have

L0 (At(1) , At(2) , At(3) , λT −1 , W 0T , V 0T ) ≤ L0 (A∗ , A∗ , A∗ , λT , W T , V T ) + λT −1 · O(poly()) ≤ (1 + γ)OP T + 0


(125)
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Proof:
By Lemma B.13, we know that as long as L0 (A∗ , A∗ , A∗ , λt , W t , V t ) ∈ [(1 + γ)OP T + Ω(q > A∗ 1kA∗ k4∞ 0 /γ), Õ(1)],
then there exists kW
c kF ≤ 1, kVb kF ≤ 1 such that either
√ c √
EΣ,Σ0 [L0 (A∗ , A∗ , A∗ , λt , W t Σ0 + η W ΣΣ0 , Σ0 V t + ηΣ0 ΣVb )] ≤ (1 − ηγ/4)L0 (A∗ , A∗ , A∗ , λt , W t , V t ) (126)

or
L0 (A∗ , A∗ , A∗ , (1 − η)λt , W t , V t ) ≤ (1 − ηγ/4)L0 (A∗ , A∗ , A∗ , λt , W t , V t ) (127)
√ c √
Denote W = W (0) + W ρ + W t Σ0 + η W ΣΣ0 , V = V (0) + V ρ + Σ0 V t + ηΣ0 ΣVb . Note that
K
∂L X ∂L ∂fi
= (128)
∂wj i=1
∂fi ∂wj

∂ √ c √
fr (q, A∗ , X, W (0) + W ρ + W t Σ0 + η W ΣΣ0 , V (0) + V ρ + Σ0 V t + ηΣ0 ΣVb )
∂wj
N
X m2
X N
X (129)
>
= q an ci,r 1rn,i +B2(n,i) ≥0 an,k vj,i 1a˜∗ n Xwk +B1(n,k) ≥0 (a˜∗ n X)> ,
n=1 i=1 k=1
∂2F ∂3F
which implies ∂F
are summations of 1, δ, δ 0 functions and their multiplications. It can be found that no
∂wt , ∂w2t , ∂w3t
R∞ R∞
δ(x)δ 0 (x), δ(x) or δ (x) exist in these terms. Therefore, by −∞ δ(t)f (t)dt = f (0) and −∞ δ 0 (t)f (t)dt = −f 0 (0),
2 02

we can obtain that the value of the third-order derivative w.r.t. W ρ of EW ρ ,V ρ ,Σ [L(λt FA∗ (eg , X, W (0) + W ρ +
W t Σ, V (0) + V ρ + ΣV t ), y)] is proportional to poly(kA∗ k∞ , K), some certain value of the probability density
function of W ρ and its derivative, i.e., poly(σw −1
). Similarly, the value of the third-order derivative w.r.t. W ρ of
EW ρ ,V ρ ,Σ [L(λt FA∗ (eg , X, W + W + W t Σ, V (0) + V ρ + ΣV t ), y)] is polynomially depend on σv−1 and kA∗ k∞ .
(0) ρ

By the value selection of σw and σv , we can conclude that L0 is B = poly(m1 , m2 , kA∗ k∞ , K) second-order smooth.
By Fact A.8 in (Allen-Zhu et al., 2019), it satisfies with η ∈ (0, poly(m1 ,m21,kA∗ k∞ ,K) )

1
λmin (∇2 L0 (A∗ , A∗ , A∗ , λt−1 , W t , V t )) < − (130)
(m1 m2 )8
Meanwhile, for t ≥ 1, by the escape saddle point theorem of Lemma A.9 in (Allen-Zhu et al., 2019), we know with probability
at least 1 − p, λmin (∇2 L0 (A∗ , A∗ , A∗ , λt−1 , W t , V t )) > − (m1 1m2 )8 holds. Choosing p = 100T 1
, then this holds for
t = 1, 2, · · · , T with probability at least 0.999.Therefore, for t = 1, 2, · · · , T , the first case cannot happen, i.e., as long as
L0 (A∗ , A∗ , A∗ , λt , W t , V t ) ≥ (1 + γ)OP T + Ω(q > A∗ 1kA∗ k4∞ 0 /γ),

L0 (A∗ , A∗ , A∗ , (1 − η)λt , W t , V t ) ≤ (1 − ηγ/4)L0 (A∗ , A∗ , A∗ , λt , W t , V t ) (131)

On the other hand, for t = 1, 2, · · · , T − 1, as long as L0 ≤ Õ(1), by Lemma A.9 in (Allen-Zhu et al., 2019), we have

L0 (A∗ , A∗ , A∗ , λt , W t+1 , V t+1 ) ≤ L0 (A∗ , A∗ , A∗ , λt , W t , V t ) + (m1 m2 )−1 (132)

By L0 (A∗ , A∗ , A∗ , λ1 , W 0 , V 0 ) ≤ Õ(1) with high probability, we have L0 (A∗ , A∗ , A∗ , λt , W t , , V t ) ≤ Õ(1) with


high probability for t = 1, 2, · · · , T . Therefore, after T = Θ̃(η −1 log log0m ) rounds of weight decay, we have
L0 (A∗ , A∗ , A∗ , λt , W t , V t ) ≤ (1 + γ)OP T + Ω(q > A∗ 1kA∗ k4∞ 0 /γ). Rescale down 0 and we can obtain our fi-
nal result.
Consider L0 (At(1) , At(2) , At(3) , λt , W 0t , V 0t ). Let wi , v i be the output weights updated with all the aggregation matrices
equal to A∗ , and let w0i , v 0i be the output weights updated with our sampling strategy in Section 3.2. We know that
T
X −1 w −1 X
TX N m2
X N
X
kwi − wi 0 k . kη q > an ∗ ci,r 1[a∗ n σ(A∗ XW )v i ≥ 0] an,k ∗ vj,i 1[a∗ k Xwj ≥ 0](a∗ k − ak t(1) )Xk
t=0 l=0 n=1 i=1 k=1
1 1
≤ · poly(m1 , m2 )c kA∗ k∞ · poly() = O()
poly(m1 , m2 ) poly()
(133)
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

T
X −1 w −1 X
TX N m2
X
kv i − v 0i k . kη q > a∗n ci,r 1[a∗ n σ(A∗ XW )v i ≥ 0](a∗ n σ(A∗ XW ) − at(2)
n σ(A
t(1)
XW 0 ))k
t=0 l=0 n=1 i=1
1 1
≤ · poly(m1 , m2 )c kA∗ k∞ · poly() = O()
poly(m1 , m2 ) poly()
(134)
With a slight abuse of notation, for r ∈ [K], we denote

fr (q, At(1) , At(2) , At(3) , X, W 0t , V 0t ) = q > At(3) σ(At(2) σ(At(1) XW + B 1 )V + B 2 )cr (135)

The difference between fr (q, A∗ , X, W t , V t ) and fr (q, At(1) , At(2) , At(3) , X, W 0t , V 0t ) is caused by kA∗ − At(1) k∞ ,
(t) (t) 0 (t) (t) 0
kA∗ − At(2) k∞ , kA∗ − At(3) k∞ , wi − wi and v i − v i . Following the proof in Lemma A.2, we can easily obtain
·poly()
that if |pl − p∗l | ≤ p∗l · O(poly()) and li ≥ |Ni |/(1 + pc∗1LΦ(L,i) ), it can be derived that kA∗ − A(1) k∞ ≤ O(poly()),
l

kA∗ − A(2) k∞ ≤ O(poly()) and kA∗ − A(3) k∞ ≤ O(poly()). Then, by (133) and (134), we have

|q > A∗ σ(A∗ σ(A∗ XW )V )cr − q > A∗ σ(A(2) σ(A(1) XW 0 )V 0 )cr |


N
X m2
X
≤ q > a∗ n ci,r |σ(a∗ n σ(A∗ XW )v i ) − σ(a(2)
n σ(A
(1)
XW 0 )v 0i )|
n=1 i=1
N
X m2
X
≤ q > a∗ n ci,r |a∗ n σ(A∗ XW )v i − a(2)
n σ(A
(1)
XW 0 )v 0i |
n=1 i=1
N
X m2
X N
X m2
X
∗ ∗
≤ > ∗
q a n ci,r |(a ∗
n − a(2)
n )σ(A XW )v i | + > ∗
q a n ci,r |a(2)
n (σ(A XW )v i − σ(A
(1)
XW 0 )v 0i |)
n=1 i=1 n=1 i=1
N
X m2
X

≤ q > a∗ n ci,r |(a∗ n − a(2)
n )σ(A XW )v i |
n=1 i=1
N
X m2
X

+ q > a∗ n ci,r |a(2)
n ((σ(A XW ) − σ(A
(1)
XW 0 ))v i + σ(A(1) XW 0 )(v i − v 0i ))|
n=1 i=1
N
X m2
X N
X m2
X

≤ q > a∗ n ci,r |(a∗ n − a(2)
n )σ(A XW )v i | + q > a∗ n ci,r |a(2)
n σ(A
(1)
XW 0 )(v i − v 0i )|
n=1 i=1 n=1 i=1
N
X m2
X N
X m1
X
+ q > a∗ n ci,r a00n,k vi,l |(a∗ k − a0k )Xwl + a0k X(wl − w0l )|
n=1 i=1 k=1 l=1
≤O(poly()).
(136)
Hence,
∗ ∗ ∗
|e> > (3)
g A σ(A σ(A XW )V )cr − eg A σ(A(2) σ(A(1) XW 0 )V 0 )cr |
∗ ∗ ∗ > ∗
≤|e>
g A σ(A σ(A XW )V )cr − eg A σ(A
(2)
σ(A(1) XW 0 )V 0 )cr | + |e> ∗
g (A − A
(3)
)σ(A(2) σ(A(1) XW )V )cr |
≤O(poly()).
(137)
which implies

L0 (At(1) , At(2) , At(3) , λT −1 , W 0T , V 0T ) ≤ L0 (A∗ , A∗ , A∗ , λT , W T , V T ) + λT −1 · O(poly()) ≤ (1 + γ)OP T + 0


(138)
Proof of Theorem 3.1:
By Lemma B.14, we have that the algorithm converges in T Tw iterations to a point

L0 (At(1) , At(2) , At(3) , λt , W t , V t ) ≤ (1 + γ)OP T + 0


Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

We know w.h.p., among Õ(1/20 ) choices of j,

min{EW ρ ,V ρ ,Σ, z∈Ω L(λT −1 FA∗ (eg , X, W (0) + W ρ,j + W T Σ, V (0) + V ρ,j + ΣV T )} ≤ (1 + γ)OP T + 0 (139)
j

Then we have 1
kW T k2,4 ≤ 04 τw0 (140)

1
kV T kF ≤ 02 τv0 (141)
By Lemma B.9, we know that

fr (eg , A∗ , X i , W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T , B)


 (142)
=fr (eg , A∗ , X i , W (0) + W ρ , V (0) + V ρ , B) + gr(b,b) (eg , A∗ , X i , W T , V T , B) ±
K

Denote r 0 = A∗ σ(A∗ X(W (0) + W ρ ) + B 1 )(V (0) + V ρ ). Then,

kr 0 k ≤ kA∗ k(kA∗ k∞ · Õ(1)) · Õ(1) ≤ kA∗ k∞ (143)

Therefore,
|fr (eg , A∗ , X i , W (0) + W ρ , V (0) + V ρ , B)|

=|e> 0
g A σ(r + B 2 )cr | (144)
∗ ∗
≤Õ(kA k∞ (kA k∞ + 1)c )
We also have
|gr(b,b) (eg , A∗ , X i , W T , V T , B)|
∗ ∗ ∗
≤|e>
g A A (A XW T D (0)
w,x V T ) D (0)
v,x cr |
1√ (145)
≤kA∗ k2∞ τv0 τw0 m14 m2 c
≤C0 kA∗ k2∞
Hence,
fr (eg , A∗ , X i , W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T , B) ≤ Õ(kA∗ k2∞ (c + C0 )) (146)
Combining (135, 137), we can obtain
(1) (2) (3)
fr (eg , At , At , At , X i , W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T , B) ≤ Õ(kA∗ k2∞ (c + C0 )) (147)
(1) (2) (3)
as long as kA∗ − At k∞ ≤ poly(), kA∗ − At k∞ ≤ poly() and kA∗ − At k∞ ≤ poly().
|Ω|
For any given {X i , yi }i=1 , the dependency between yi , yj , where i, j ∈ |Ω| can be considered in two steps. Figure 5(a)
shows ai X is dependent with at most (1 + δ)2 aj X 0 s. This is because each ai X is determined by at most (1 + δ) row
vector x̃0l s, while each x̃l is contained by at most (1 + δ) ap X 0 s. Similarly, yi is determined by at most (1 + δ) ap X 0 s and
by Figure 5(b) we can find yi is dependent with at most (1 + δ)4 yj (including yi ). Since the matrix A∗ shares the same
non-zero entries with A, the output with A∗ indicates the same dependence.
P|Ωt |
Denote ui = 1/|Ωt | i=1 |L(λT −1 FA∗ (eg , X, W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T ), yi ) −
E(eg ,X,y)∈D [L(λT −1 FA∗ (eg , X, W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T ), yi )]. Then, E[ui ] = 0. Since that
L is 1-lipschitz smooth and L(0K , y) ∈ [0, 1], we have

|L(λT −1 FA∗ (eg , X, W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T ), yi ) − L(0K , yi )|


≤k(λT −1 FA∗ (eg , X, W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T ), yi ) − (0K , yi )k (148)

≤Õ( KkA∗ k2∞ (c + C0 ))

Then, √
|ui | ≤ 2 KkA∗ k2∞ (c + C0 )
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Figure 5. (a) Dependency between as X and ap X (b) Dependency between yi and yj

t2
P(|ui | ≥ t) ≤ 1 ≤ exp(1 − ) (149)
4KkA∗ k4∞ (c + C0 )2
∗ 4 2 2
Then, ui is a sub-Gaussian random variable. We have Eesui ≤ ekA k∞ (c +C0 ) s . By Lemma 7 in (Zhang et al., 2020), we
have P|Ω| 4 ∗ 4 2 2
Ees i=1 ui ≤ e(1+δ) KkA k∞ (c +C0 ) |Ω|s
Therefore,
|Ω|
 X 1 
P ui ≥ k ≤ exp(kA∗ k4∞ (c + C0 )2 K(1 + δ)4 |Ω|s2 − |Ω|ks) (150)
i=1
|Ω|
q
k ∗ 4 2 (1+δ)4 log N
for any s > 0. Let s = 2kA∗ k4 (c +C )2 K(1+δ)4 , k = kA k∞ (c + C0 ) K |Ω| , we can obtain
∞ 0

|Ω|
 X 1 
P ui ≥ k ≤ exp(−kA∗ k4∞ (c + C0 )2 K log N ) ≤ N −K (151)
i=1
|Ω|

Therefore, with probability at least 1 − N −K , we have

E(eg ,X,y)∼D [L(λT −1 FA∗ (eg , X, W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T ), yi )]


|Ωt |
1 X (152)
− t L(λT −1 FA∗ (eg , X, W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T ), yi )
|Ω | i=1
≤0
∗ 8 ∗ √ ∗ ∗
as long as |Ω| ≥ Θ̃(−2 4 5 4 6 4
0 kA k∞ (1 + p1 p2 C (φ, kA k∞ )C (Φ, p2 C (φ, kA k∞ ))(kA k∞ + 1) K (1 + δ) log N ), i.e.,

E(eg ,X,y)∼D [L(λT −1 FA∗ (eg , X, W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T ), yi )] ≤ OP T +  (153)

You might also like