Generalization Guarantee of Training Graph Convolutional Networks With Graph Topology Sampling
Generalization Guarantee of Training Graph Convolutional Networks With Graph Topology Sampling
Proceedings of the 39 th International Conference on Machine Our contributions: To the best of our knowledge, this
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy- paper provides the first generalization analysis of training
right 2022 by the author(s). GCNs with graph topology sampling. We focus on semi-
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
supervised node classification problems where, with all node ing error, through Rademacher complexity. Verma & Zhang
features and partial node labels, the objective is to predict (2019); Cong et al. (2021); Zhou & Wang (2021) analyze
unknown node labels. We summarize our contributions the generalization gap of training GCNs using SGD via the
from the following dimensions. notation of algorithmic stability.
First, this paper proposes a training framework that imple- To analyze the training error and generalization performance
ments both stochastic gradient descent (SGD) and graph simultaneously, Du et al. (2019) uses the neural tangent
topology sampling, and the learned GCN model with Recti- kernel (NTK) approach, where the neural network width
fied Linear Unit (ReLU) activation is guaranteed to approach is infinite and the step size is infinitesimal, shows that the
the best generalization performance of a large class of target training error is zero, and characterizes the generalization
functions. Moreover, as the number of labeled nodes and bound. Zhang et al. (2020) proves that gradient descent can
the number of neurons increase, the class of target function learn a model with zero population risk, provided that all
enlarges, indicating improved generalization. data are generated by an unknown target model. The result
in (Zhang et al., 2020) is limited to two-layer GCNs and
Second, this paper explicitly characterizes the impact of
requires a proper initialization in the local convex region of
graph topology sampling on the generalization performance
the optimal solution.
through the proposed effective adjacency matrix A∗ of a
directed graph that models the node correlations. A∗ de- Generalization analyses of feed-forward neural net-
pends on both the given normalized graph adjacency matrix works. The NTK approach was first developed to analyze
in GCNs and the graph sampling strategy. We provide the fully connected neural networks (FCNNs), see, e.g., (Jacot
general insights that (1) if a node is sampled with a low et al., 2018). The works of Zhong et al. (2017); Fu et al.
frequency, its impact on other nodes is reduced in A∗ com- (2020); Li et al. (2022) analyze one-hidden-layer neural net-
pared with A; (2) graph sampling on a highly-unbalanced A, works with Gaussian input data. Daniely (2017) analyzes
where some nodes have a dominating impact in the graph, re- multi-layer FCNNs but focuses on training the last layer
sults in a more balanced A∗ . Moreover, these insights apply only, while the changes in the hidden layers are negligible.
to other graph sampling methods such as FastGCN (Chen Allen-Zhu et al. (2019) provides the optimization and gen-
et al., 2018a). eralization of three-layer FCNNs. Our proof framework is
built upon (Allen-Zhu et al., 2019) but makes two important
We show that learning with topology sampling has the same
technical contributions. First, this paper provides the first
generalization performance as training GCNs using A∗ .
generalization analysis of graph topology sampling in train-
Therefore, a satisfactory generalization can still be achieved
ing GCNs, while Allen-Zhu et al. (2019) considers FCNNs
even when the number of sampled nodes is small, provided
with neither graph topology nor graph sampling. Second,
that the resulting A∗ still characterizes the data correlations
Allen-Zhu et al. (2019) considers i.i.d. training samples,
properly. This is the first theoretical explanation of the
while this paper considers semi-supervised GCNs where the
empirical success of graph topology sampling.
training data are correlated through graph convolution.
Third, this paper shows that the required number of labeled
nodes, referred to as the sample complexity, is a polynomial 1.2. Notations
of kA∗ k∞ and the maximum node degree, where k·k∞ mea-
sures the maximum absolute row sum. Moreover, our sam- Vectors are in bold lowercase, matrices and tensors in are
ple complexity is only logarithmic in the number of neurons bold uppercase. Scalars are in normal fonts. For instance,
m and consistent with the practical over-parameterization of Z is a matrix, and z is a vector. zi denotes the i-th entry of
GCNs, in contrast to the loose bound of poly(m) in (Zhang z, and Zi,j denotes the (i, j)-th entry of Z. [K] (K > 0)
et al., 2020) in the restrictive setting of two-layer (one- denotes the set including integers from 1 to K. I d ∈ Rd×d
hidden-layer) GCNs without graph topology sampling. and ei represent the identity matrix in Rd×d and the i-th
standard basis vector, respectively. We denote the column
`p norm for W ∈ Rd×N (for p ≥ 1) as
1.1. Related Works
X 1
Generalization analyses of GCNs without graph sam- kW k2,p = ( kwi kp2 ) p (1)
pling. Some recent works analyze GCNs trained on the i∈[m]
original graph. Xu et al. (2019); Cong et al. (2021) char-
acterize the expressive power of GCNs. Xu et al. (2021) Hence, kW k2,2 = kW kF is the Frobenius norm of W . We
analyzes the convergence of gradient descent in training lin- use wi (w̃i ) to denote the i-th column (row) vector of W .
ear GCNs. Lv (2021); Liao et al. (2021); Garg et al. (2020); We follow the convention that f (x) = O(g(x)) (or Ω(g(x)),
Oono & Suzuki (2020) characterize the generalization gap, Θ(g(x))) means that f (x) increases at most (or at least, or in
which is the difference between the training error and test- the same, respectively,) order of g(x). With high probability
2
(w.h.p.) means with probability 1 − e−c log (m1 ,m2 ) for a
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
sufficient large constant c where m1 and m2 are the number only update W and V in training, and A represents the
of neurons in the two hidden layers. graph topology. Note that in conventional GCNs such as
(Kipf & Welling, 2017), C is a learnable parameter, and B 1
P∞ φ(z) with
Function complexity. For any smooth function
and B 2 can be zero. Here for the analytical purpose, we
its power series representation as φ(z) = i=0 ci z i , define
consider a slightly different model where C, B 1 and B 2
two useful parameters as follows,
are fixed as randomly selected values.
∞ p
X
∗ i log(1/) ∗ i Consider a loss function L : R1×k × Y → R such that for
C (φ, R) = (C R) + ( √ C R) |ci | (2)
i=0
i every y ∈ Y, the function L(·, y) is nonnegative, convex, 1-
Lipschitz continuous and 1-Lipschitz smooth and L(0, y) ∈
∞
X [0, 1]. This includes both the cross-entropy loss and the
Cs (φ, R) = C ∗ (i + 1)1.75 Ri |ci | (3) `2 -regression loss (for bounded Y). The learning problem
i=0
solves the following empirical risk minimization problem:
where R ≥ 0 and C ∗ is a sufficiently large constant. These
1 X
two quantities are used in the model complexity and sample min LΩ (W , V ) = L(FA (ei , X; W , V ), y i )
complexity, which represent the required number of model W ,V |Ω|
i∈Ω
parameters and training samples to learn φ up to error, (5)
respectively. Many population functions have bounded com- where LΩ is the empirical risk of the labeled nodes in Ω.
plexity. For instance, if φ(z) is exp(z), sin(z), cos(z) or The trained weights are used to estimate the unknown labels
polynomials of z, then C (φ, O(1)) ≤ O(poly(1/)) and on V/Ω. Note that the results in this paper are distribution-
Cs (φ, O(1)) ≤ O(1). free, and no assumption is made on the distributions of x̃n
and yn .
The main notations are summarized in Table 2 in Appendix.
Training with SGD. In practice, (5) is often solved by gra-
dient type of methods, where in iteration t, the currently
2. Training GCNs with Topology Sampling: estimations are updated by subtracting the product of a posi-
Formulation and Main Components tive step size and the gradient of LΩ evaluated at the current
GCN setup. Let G = {V, E} denote an un-directed graph, estimate. To reduce the computational complexity in es-
where V is the set of nodes with size |V| = N and E is timating the gradient, an SGD method is often employed
the set of edges. Let à ∈ {0, 1}N ×N be the adjacency to compute the gradient of the risk of a randomly selected
matrix of G with added self-connections. Let subset of Ω rather than using the whole set Ω.
P D be the
degree matrix with diagonal elements Di,i = j Ãi,j and However, due to the recursive embedding of neighboring
zero entries otherwise. A denotes the normalized adjacency features in GCNs, see the concatenations of A in (4), the
1 1
matrix with A = D − 2 ÃD − 2 . Let X ∈ RN ×d denote the computation and memory cost of computing the gradient
matrix of the features of N nodes, where the n-th row of can be high. Thus, graph topology sampling methods have
X, denoted by x̃n ∈ R1×d , represents the feature of node been proposed to further reduce the computational cost.
n. Assume kx̃n k = 1 for all n without loss of generality.
Graph topology sampling. A node sampling method ran-
yn ∈ Y represents the label of node n, where Y is a set of
domly removes a subset of nodes and the incident edges
all labels. yn depends on not only xn but the neighbors. Let
from G in each iteration independently, and the embedding
Ω ⊂ V denote the set of labeled nodes. Given X and labels
aggregation is based on the reduced graph. Mathematically,
in Ω, the objective of semi-supervised node-classification is
in iteration s, replace A in (4) with1 As = AP s , where P s
to predict the unknown labels in V/Ω.
is a diagonal matrix, and the ith diagonal entry is 0, if node
Learner network We consider the setting of training a i is removed in iteration s. The non-zero diagonal entries
three-layer GCN F : RN × RN ×d → R1×K with of P s are selected differently based on different sampling
methods. Because As is much more sparse than A, the
FA (eg , X; W , V ) = e>
g Aσ(r + B 2 )C and computation and memory cost of embedding neighboring
(4)
r = Aσ(AXW + B 1 )V features is significantly reduced.
This paper will analyze the generalization performance, i.e., address the non-convex interaction of weights W and V in
the prediction accuracy of unknown labels, of our algorithm both algorithmic design and theoretical analyses.
framework that implements both SGD and graph topology
sampling to solve (5). The details of our algorithm are 3.2. Graph Topology Sampling Strategy
discussed in Section 3.2-3.3, and the generalization perfor-
mance is presented in Section 3.4. Here we describe our graph topology sampling strategy
using As , which we randomly generate to replace A in
the sth SGD iteration. Although our method is motivated
3. Main Algorithmic and Theoretical Results for analysis and different from the existing graph sampling
3.1. Informal Key Theoretical Findings strategies, our insights generalize to other sampling methods
like FastGCN (Chen et al., 2018a). The outline of our algo-
We first summarize the main insights of our results before rithmic framework of training GCNs with graph sampling
presenting them formally. is deferred to Section 3.3.
1. A provable generalization guarantee of GCNs be- Suppose the node degrees in G can be divided into L groups
yond two layers and with graph topology sampling. The with L ≥ 1, where the degrees of nodes in group l are in the
learned GCN by our Algorithm 1 can approach the best order of dl , i.e., between cdl and Cdl for some constants
performance of label prediction using a large class of target c ≤ C, and dl is order-wise smaller than dl+1 , i.e., dl =
functions. Moreover, the prediction performance improves o(dl+1 ). Let Nl denote the number of nodes in group l.
when the number of labeled nodes and the number of neu-
rons m1 and m2 increase. This is the first generalization Graph sampling strategy2 .. We consider a group-wise
performance guarantee of training GCNs with graph topol- uniform sampling strategy, where Sl out of Nl nodes are
ogy sampling. sampled uniformly from each group l. For all unsampled
nodes, we set the corresponding diagonal entries of a di-
2. The explicit characterization of the impact of graph agonal matrix P s to be zero. If node i is sampled in this
sampling through the effective adjacency matrix A∗ . iteration and belongs to group l for any i and l, the ith diag-
We show that training with graph sampling returns a model onal entry of P s is set as p∗l Nl /Sl for some non-negative
that has the same label prediction performance as that of constant p∗l . Then As = AP s . Nl /Sl can be viewed as
a model trained by replacing A with A∗ in (4), where A∗ the scaling to compensate for the unsampled nodes in group
depends on both A and the graph sampling strategy. As l. p∗l can be viewed as the scaling to reflect the impact of
long as A∗ can characterize the correlation among nodes sampling on nodes with different importance that will be
properly, the learned GCN maintains a desirable prediction discussed in detail soon.
performance. This explains the empirical success of graph
topology sampling in many datasets. Effective adjacency matrix A∗ by graph sampling. To
analyze the impact of graph topology sampling on the learn-
3. The explicit sample complexity bound on graph prop- ing performance, we define the effective adjacency matrix
erties. We provide explicit bounds on the sample complex- as follows:
ity and the required number of neurons, both of which grow A∗ = AP ∗ (6)
as the node correlation increase. Moreover, the sample
complexity depends on the number of neurons only log- where P ∗ is a diagonal matrix defined as
arithmically, which is consistent with the practical over- P ∗ii = p∗l if node i belongs to degree group l (7)
parameterization. To the best of our knowledge, (Zhang
et al., 2020) is the only existing work that provides a sam- Therefore, compared with A, all the columns with indices
ple complexity bound based on the graph topology, but in corresponding to group l are scaled by a factor of p∗l . We
the non-practical and restrictive setting of two-layer GCNs. will formally analyze the impact of graph topology sampling
Moreover, the sample complexity bound by (Zhang et al., on the generalization performance in Section 3.4, but an
2020) is polynomial in the number of neurons. intuitive understanding is that our graph sampling strategy
4. Tackling the non-convex interaction of weights be- effectively changes the normalized adjacency matrix A in
tween different layers. The convexity plays a critical role the GCN network model (4) to A∗ .
in many exiting analyses of GCNs. For instance, the anal- A∗ can be viewed as an adjacency matrix of a weighted
yses in (Zhang et al., 2020) require a special initialization directed graph G 0 that reflects the node correlations, where
in the local convex region of the global minimum, and the each un-directed edge in G corresponds to two directed
results only apply to two-layer GCNs. The NTK approach edges in G 0 with possibly different weights. A∗ji measures
in (Du et al., 2019) considers the limiting case that the in-
2
teractions across layers are negligible. Here, we directly Here we discuss asymmetric sampling as a general case. The
special case of symmetric sampling is introduced in Section A.1
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
the impact of the feature of node i on the label of node j. Algorithm 1 Training with SGD and graph topology sam-
If p∗l is in the range of (0, 1), the corresponding entries of pling
columns with indices in group l in A∗ are smaller than those 1: Input: Normalized adjacency matrix A, node features
in A. That means the impact of a node in group l on all other X, known node labels in Ω, the step size η, the number
nodes is reduced from those in A. Conversely, if p∗l > 1, of inner iterations Tw , the number of outer iterations T ,
then the impact of nodes in group l in A∗ is enhanced from σw , σv , λw , λv .
that in A. 2: Initialize W (0) , V (0) , B 1 , B 2 , C.
Parameter selection and insights 3: W 0 = 0, V 0 = 0.
4: for t = 0, 1, · · · , T − 1 do
(1) The scaling factor p∗l should satisfy 5: Apply noisy SGD with step size η on the stochastic
c1 objective L̂Ω (λt ; W , V ) in (11) for Tw steps. To
0 ≤ p∗l ≤ , ∀l (8) generate the stochastic objective in each step s, ran-
Lψl
domly sample a batch of labeled nodes Ωs from Ω;
for a positive constant c1 that can be sufficiently large. ψl is generate As using graph sampling; randomly gener-
defined as follows, ate W ρ , V ρ and Σ.
√ Let the starting point be W = W t , V = V t and
dL dl N l suppose it reaches W t+1 and V t+1 .
ψl := PL ∀l ∈ [L] (9)
i=1 di N i
6: λt+1 = λt · (1 − η).
7: end for
Note that (8) is a minor requirement for most graphs. To see 8: Output: p
this, suppose L is a constant, and every Nl is in the order W (out) = λT −1 (W (0) + W ρ + W T Σ)
V (out) = λT −1 (V (0) + V ρ + ΣV T ).
p
of N . Then ψl is less than O(1) for all l. Thus, all constant
values of p∗l̂ satisfy (8) with ψl from (9). A special example
is that p∗l are all equal, i.e., A∗ = c2 A for some constant
c2 . Because one can scale W and V by 1/c2 in (4) without
changing the results, A∗ is equivalent to A in this case.
The upper bound in (9) only becomes active in highly un-
ˆ
balanced graphs
p √ there exists a dominating group l
where
such that dl̂ Nl̂ dl Nl for all other l. Then the up- 4.2. Second, reducing the number of samples in group l
per bound of p∗l̂ is much smaller than those for other p∗l . corresponds to reducing the impact of group l in A∗ . To
Therefore, the columns of A∗ that correspond to group see this, note that decreasing p∗l reduces the right-hand side
ˆl are scaled down more significantly than other columns, of (10).
indicating that the impact of group ˆl is reduced more signif-
icantly than other groups in A∗ . Therefore, the takeaway 3.3. The Algorithmic Framework of Training GCNs
is that graph topology sampling reduces the impact of
dominating nodes more than other nodes, resulting in a Because (5) is non-convex, solving it directly using SGD
more balanced A∗ compared with A. can get stuck at a bad local minimum in theory. The main
idea in the theoretical analysis to address this non-convexity
(2) The number of sampled nodes shall satisfy is to add weight decay and regularization in the objective of
(5) such that with a proper regularization, any second-order
Sl c1 poly() −1 critical point is almost a global minimum.
≥ (1 + ) ∀l ∈ [L] (10)
Nl Lp∗l ψl
Specifically, for initialization, entries of W (0) are i.i.d. from
where is a small positive value. The sampling require- N (0, m11 ), and entries of V (0) are i.i.d. from N (0, m12 ).
ment in (10) has two takeaways. First, the higher-degree B 1 (or B 2 ) is initialized to be an all-one vector multiply-
groups shall be sampled more frequently than lower- ing a row vector with i.i.d. samples from N (0, m11 ) (or
degree groups. To see this, consider a special case that N (0, m12 )). Entries of C are drawn i.i.d. from N (0, 1).
p∗l = 1, and Nl = N/L for all l. Then (10) indicates that
Sl is larger in a group l with a larger dl . This intuition is the In each outer loop t = 0, ..., T − 1, we use noisy SGD3
same as FastGCN (Chen et al., 2018a), which also samples with step size η for Tw iterations to minimize the stochastic
high-degree nodes with a higher probability in many cases. objective function L̂Ω in (11) with some fixed λt−1 , where
Therefore, the insights from our graph sampling method 3
Noisy SGD is vanilla SGD plus Gaussian perturbation. It is a
also apply to other sampling methods such as FastGCN. We common trick in the theoretical analyses of non-convex optimiza-
will show the connection to FastGCN empirically in Section tion (Ge et al., 2015) and is not needed in practice.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
λ0 = 1, and the weight decays with λt+1 = (1 − η)λt . 3.4. Generalization Guarantee
Our formal generalization analysis shows that our learning
L̂Ω (λt ; W , V ) method returns a GCN model that approaches the minimum
√ √
=LΩ ( λt (W (0) + W ρ + W Σ), λt (V (0) + V ρ + ΣV )) prediction error that can be achieved by the best function
√ √ in a large concept class of target functions, which have two
+ λw k λt W k42,4 + λv k λt V k2F
(11) important properties: (1) the prediction error decreases as
L̂Ω (λt ; W , V ) is stochastic because in each inner iteration size of the function class increases; and (2) the concept
s, (1) we randomly sample a subset Ωs of labeled nodes; (2) class uses A∗ in (6) as the adjacency matrix of the graph
we randomly sample As from the graph topology sampling topology. Therefore, the result implies that if A∗ accurately
method in Section 3.2; (3) W ρ and V ρ are small perturba- captures the correlations among node features and labels,
tion matrices with entries i.i.d. drawn from N (0, σw 2
) and the learned GCN model can achieve a small prediction error
2
N (0, σv ), respectively; and (4) Σ ∈ R m1 ×m1
is a random of unknown labels. Moreover, no other functions in a large
diagonal matrix with diagonal entries uniformly drawn from concept class can perform better than the learned GCN
{1, −1}. W ρ and V ρ are standard Gaussian smoothing model. To formalize the results, we first define the target
in the literature of theoretical analyses of non-convex opti- functions as follows.
mization, see, e.g. (Ge et al., 2015), and are not needed in Concept class and target function F ∗ . Consider a concept
practice. Σ is similar to the practical Dropout (Srivastava class consisting of target functions F ∗ : RN × RN ×d →
et al., 2014) technique that randomly masks out neurons and R1×K :
is also introduced for the theoretical analysis only.
> ∗
∗
Φ(r 1 ) r 2 C ∗
FA ∗ (eg , X) = eg A
The last two terms in (11) are additional regularization terms
for some positive λw and λv . As shown in (Allen-Zhu et al., r 1 = A∗ φ1 (A∗ XW ∗1 )V ∗1 (12)
∗ ∗
2019), k · k2,4 is used for the analysis to drive the weights r 2 = A φ2 (A XW ∗2 )V ∗2
to be evenly distributed among neurons. The practical reg-
ularization k · kF has the same effect in empirical results, where φ1 , φ2 , Φ: R → R all infinite-order smooth4 . The
while the theoretical justification is open. parameters W ∗1 , W ∗2 ∈ Rd×p2 , V ∗1 , V ∗2 ∈ Rp2 ×p1 , C ∗ ∈
Rp1 ×k satisfy that every column of W ∗1 , W ∗2 , V ∗1 , V ∗2 is
Algorithm 1 summarizes the algorithm with the parame- unit norm, and the maximum absolute value of C ∗ is at
ter selections in Table 1. Let W out and V out denote the most 1. The effective adjacency matrix A∗ is defined in (6).
returned weights. We use FA∗ (ei , X; W out , V out ) to pre- Define
dict the label of node i. This might sound different from the
conventional practice which uses A in predicting unknown C (φ, R) = max C (φ1 , R), C (φ2 , R) , (13)
labels. However, note that A∗ only differs from A by a
Cs (φ, R) = max Cs (φ1 , R), C2 (φ1 , R) . (14)
column-wise scaling as from (6). Moreover, A∗ can be set
as A in many practical datasets based on our discussion af-
ter (9). Here we use the general form of A∗ for the purpose We focus on target functions where the function complexity
of analysis. C (Φ, R), Cs (Φ, R), C (φ, R), Cs (φ, R), defined in (2)-(3),
(13)-(14), as well as p1 and p2 , are all bounded.
We remark that our framework of algorithm and analysis
can be easily applied to the simplified setup of two-layer (12) is more general than GCNs. If r 2 is a constant matrix,
GCNs. The resulting algorithm is much simplified to a (12) models a GCN, where W ∗1 and V ∗1 are weight matrices
vanilla SGD plus graph topology sampling. All the ad- in the first and second layer, respectively, and φ1 and Φ are
ditional components above are introduced to address the the activation functions in each layer.
non-convex interaction of W and V theoretically and may Modeling the prediction error of unknown labels. We
not be needed for practical implementation. We skip the will show that the learned GCN by our method performs
discussion of two-layer GCNs in this paper. almost the same as the best function in the concept class in
(12) in predicting unknown labels. Because the practical
datasets usually contain noise in features and labels, we em-
Table 1. Parameter choices for Algorithm 1 ploy a probabilistic model to model the data. Note that our
1/2+0.01
λv 20 m2 /m11−0.01 σv 1/m2
4
λw 20 m13−0.002 /C04 σw 1/m11−0.01 When Φ is operated on a matrix r 1 , Φ(r 1 ) means applying
√ √ Φ on each entry of r 1 . In fact, our results still hold for a more
C C (φ,kA∗ k∞ ) kA∗ k2∞ +1 C0 10C p2
general case that a different function Φj is applied to every entry
√
C 00 C (Φ,C 0 ) kA∗ k2∞ +1 C0 Õ(p21 p2 K 2 CC 00 ) of the jth column of r 1 , j ∈ [p2 ]. We keep the simpler model to
have a more compact representation. The similar arguments hold
for φ1 , φ2 .
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
result is distribution-free , and the following distributions where A∗ is the effective adjacency matrix in (12).
are introduced for the presentation of the results.
Theorem 3.1 shows that the required sample complexity
Specifically, let Dx̃n denote the distribution from which is polynomial in kA∗ k and δ, where δ is the maximum
the feature x̃n of node n is drawn. For example, when node degree without self-connections in A. Note that con-
the noise level is low, Dx̃n can be a distribution centered dition (8) implies that kA∗ k∞ is O(1). Then as long as δ is
at the observed feature of node n with a small variance. O(N α ) for some small α in (0, 1), say α = 1/5, then one
Similarly, let Dyn denote the distribution from which the can accurately infer the unknown labels from a small per-
label yn at node n is drawn. Let eg be uniformly selected centage of labeled nodes. Moreover, our sample complexity
from {ei }N N
i=1 ∈ R . Let D denote the concatenation of is sufficient but not necessary. It is possible to achieve a de-
these distributions of a data point sirable generalization performance if the number of labeled
z = (eg , X, y) ∈ RN × RN ×d × Y. (15) nodes is less than the bound in (18).
Graph topology sampling affects the generalization perfor-
Then the given feature matrix X and partial labels in Ω mance through A∗ . From the discussion in Section 3.2,
can be viewed as |Ω| identically distributed but correlated graph sampling reduces the node correlation in A∗ , espe-
samples from D. The correlation results from the fact that cially for dominating nodes. The generalization perfor-
the label of node i depends on not only the feature of node mance does not degrade when OPTA∗ is small, i.e., the
i but also neighboring features. This model of correlated resulting A∗ is sufficient to characterize the node correla-
samples is different from the conventional assumption of tion in a given dataset. That explains the empirical success
i.i.d. samples in supervised learning and makes our analyses of graph sampling in many datasets.
more involved.
Let 4. Numerical Results
∗
OPTA∗ = min E(eg ,X,y)∼D L(FA ∗ (eg , X), y) To unveil how our theoretical results are aligned with GCN’s
W∗1, W2,
∗
V∗ ∗
1, V 2, C
∗ generalization performance in experiments, we will focus
(16) on numerical evaluations on synthetic data where we can
be the smallest population risk achieved by the best target control target functions and compare with A∗ explicitly. We
function (over the choices of W ∗1 , W ∗2 , V ∗1 , V ∗2 , C ∗ ) in the also evaluate both our graph sampling method and FastGCN
∗
concept class FA ∗ in (12). OPTA∗ measures the average (Chen et al., 2018a) to validate that insights for our graph
loss of predicting the unknown labels if the estimates are sampling method also apply to FastGCN.
computed using the best target function in (12). Clearly,
OPTA∗ decreases as the size of the concept increases, i.e., We generate a graph G with N = 2000 nodes. G has two
when p1 and p2 increase. Moreover, if A∗ indeed models degree groups. Group 1 has N1 nodes, and every node
the node correlations accurately, OPTA∗ can be very small, degree approximately equals d1 . Group 2 has N2 nodes,
indicating a desired generalization performance. We next and every node degree approximately equals d2 . The edges
show that the population risk of the learned GCN model by between nodes are randomly selected. A is the normalized
our method can be arbitrarily close to OPTA∗ . adjacency matrix of G.
Theorem 3.1. For every γ ∈ (0, 41 ], every 0 ∈ (0, 100 1
], The node labels are generated by the target function
2
every ∈ (0, (Kp1 p2 Cs (Φ, p2 Cs (φ, O(1)))Cs (φ, O(1))
· kA∗ k2∞ )−1 0 ), as long as y = (sin(ÂXW ∗ ) tanh(ÂXW ∗ ))C ∗ , (20)
4.1. Sample Complexity and Neural Network Width generate three different datasets from (20). We consider both
with respect to kA∗ k∞ our graph sampling method in Section 3.2 and FastGCN
(Chen et al., 2018a).
We fix N1 = 100, N2 = 1900 and vary A by changing
node degrees d1 and d2 . In the graph topology sampling In Figure 3, N1 = 100 and N2 = 1900. d1 = 10 and
method, p∗1 = 0.7 and p∗2 = 0.3. For every fixed A, the d2 = 1. Figure 3(a) shows the testing performance of a
effective adjacency matrix A∗ is computed based on (6) learned GCN by Algorithm 1, where p∗1 = 0.9 and p∗2 = 0.1.
using p∗1 and p∗2 . Synthetic labels are generated based on the method indeed performs the best on Dataset 1 when Â
(20) using A∗ as Â. is generated using p̂1 = 0.9 and p̂2 = 0.1, in which case
A∗ = Â. This verifies our theoretical result that graph
Figure 1 shows the testing error decreases as the number of
sampling affects A∗ in the target functions, i.e., it achieves
labeled nodes |Ω| increases, when the number of neurons
the best performance if A∗ is the same as  in the target
per layer m is fixed as 500. Moreover, as kA∗ k∞ increases,
function.
the required number of labeled nodes increases to achieve
the same level of testing error. This verifies our sample
complexity bound in (18).
Figure 2 shows the testing error decreases as m increases
when |Ω| is fixed as 1500. Moreover, as kA∗ k∞ increases,
a larger m is needed to achieve the same level of testing
error. This verifies our bound on the number of neurons in
(17).
(a) (b)
Chen, J., Ma, T., and Xiao, C. Fastgcn: Fast learning with
graph convolutional networks via importance sampling.
In International Conference on Learning Representations,
2018a.
Chen, T., Sui, Y., Chen, X., Zhang, A., and Wang, Z. A
(a) (b) unified lottery ticket hypothesis for graph neural networks.
In International Conference on Machine Learning, pp.
Figure 4. Generalization performance of learned GCNs on datasets 1695–1706. PMLR, 2021.
generated from different A∗ by (a) our graph sampling strategy
and (b) FastGCN. A is balanced. Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh,
C.-J. Cluster-gcn: An efficient algorithm for training
deep and large graph convolutional networks. In Proceed-
5. Conclusion ings of the 25th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, pp. 257–266,
This paper provides a new theoretical framework for ex- 2019.
plaining the empirical success of graph sampling in training
GCNs. It quantifies the impact of graph sampling explic- Cong, W., Ramezani, M., and Mahdavi, M. On provable
itly through the effective adjacency matrix and provides benefits of depth in training graph convolutional networks.
generalization and sample complexity analyses. One fu- Advances in Neural Information Processing Systems, 34,
ture direction is to develop active graph sampling strategies 2021.
based on the presented insights and analyze its generaliza-
tion performance. Other potential extension includes the Daniely, A. Sgd learns the conjugate kernel class of the
construction of statistical-model-based characterization of network. Advances in Neural Information Processing
A∗ and fitness to real-world data, and the generalization Systems, 30:2422–2430, 2017.
analysis of deep GCNs, graph auto-encoders, and jumping
knowledge networks. Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang,
R., and Xu, K. Graph neural tangent kernel: Fusing
graph neural networks with graph kernels. In Advances in
Acknowledgements Neural Information Processing Systems, pp. 5724–5734,
This work was supported by AFOSR FA9550-20- 2019.
1-0122, ARO W911NF-21-1-0255, NSF 1932196
Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell,
and the Rensselaer-IBM AI Research Collaboration
R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Con-
(http://airc.rpi.edu), part of the IBM AI Horizons Network
volutional networks on graphs for learning molecular fin-
(http://ibm.biz/AIHorizons). We thank Ruisi Jian, Haolin
gerprints. In Advances in neural information processing
Xiong at Rensselaer Polytechnic Institute for the help
systems, pp. 2224–2232, 2015.
in formulating numerical experiments. We thank all
anonymous reviewers for their constructive comments. Fu, H., Chi, Y., and Liang, Y. Guaranteed recovery of one-
hidden-layer neural networks via cross entropy. IEEE
References Transactions on Signal Processing, 68:3225–3235, 2020.
Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and gener- Garg, V., Jegelka, S., and Jaakkola, T. Generalization and
alization in overparameterized neural networks, going representational limits of graph neural networks. In In-
beyond two layers. In Advances in neural information ternational Conference on Machine Learning, pp. 3419–
processing systems, pp. 6158–6169, 2019. 3430. PMLR, 2020.
Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al. Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from sad-
Interaction networks for learning about objects, relations dle points—online stochastic gradient for tensor decom-
and physics. In Advances in neural information process- position. In Conference on learning theory, pp. 797–842.
ing systems, pp. 4502–4510, 2016. PMLR, 2015.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre- Satorras, V. G. and Estrach, J. B. Few-shot learning with
sentation learning on large graphs. In Advances in neural graph neural networks. In International Conference on
information processing systems, pp. 1024–1034, 2017. Learning Representations, 2018.
Hu, H., Gu, J., Zhang, Z., Dai, J., and Wei, Y. Relation Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
networks for object detection. In Proceedings of the IEEE and Salakhutdinov, R. Dropout: a simple way to prevent
conference on computer vision and pattern recognition, neural networks from overfitting. The journal of machine
pp. 3588–3597, 2018. learning research, 15(1):1929–1958, 2014.
Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Van den Berg, R., Kipf, T. N., and Welling, M. Graph
Convergence and generalization in neural networks. In convolutional matrix completion. In KDD, 2018.
Advances in neural information processing systems, pp.
8571–8580, 2018. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio,
P., and Bengio, Y. Graph attention networks. Interna-
Kipf, T. N. and Welling, M. Semi-supervised classification tional Conference on Learning Representations (ICLR),
with graph convolutional networks. In Proc. International 2018.
Conference on Learning (ICLR), 2017.
Verma, S. and Zhang, Z.-L. Stability and generalization
Li, H., Zhang, S., and Wang, M. Learning and generaliza- of graph convolutional neural networks. In Proceedings
tion of one-hidden-layer neural networks, going beyond of the 25th ACM SIGKDD International Conference on
standard gaussian data. In 2022 56th Annual Conference Knowledge Discovery & Data Mining, pp. 1539–1548,
on Information Sciences and Systems (CISS), pp. 37–42. 2019.
IEEE, 2022.
Wang, X., Ye, Y., and Gupta, A. Zero-shot recognition
Li, J., Zhang, T., Tian, H., Jin, S., Fardad, M., and Zafarani,
via semantic embeddings and knowledge graphs. In Pro-
R. Sgcn: A graph sparsifier based on graph convolutional
ceedings of the IEEE conference on computer vision and
networks. In Pacific-Asia Conference on Knowledge Dis-
pattern recognition, pp. 6857–6866, 2018.
covery and Data Mining, pp. 275–287. Springer, 2020.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
Liao, R., Urtasun, R., and Zemel, R. A pac-bayesian ap-
are graph neural networks? International Conference on
proach to generalization bounds for graph neural net-
Learning Representations (ICLR), 2019.
works. In International Conference on Learning Repre-
sentations, 2021. Xu, K., Zhang, M., Jegelka, S., and Kawaguchi, K. Opti-
Lv, S. Generalization bounds for graph convolutional neural mization of graph neural networks: Implicit acceleration
networks via rademacher complexity. arXiv preprint by skip connections and more depth. In International
arXiv:2102.10234, 2021. Conference on Machine Learning. PMLR, 2021.
Oono, K. and Suzuki, T. Optimization and generalization Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,
analysis of transduction through gradient boosting and ap- W. L., and Leskovec, J. Graph convolutional neural net-
plication to multi-scale graph neural networks. Advances works for web-scale recommender systems. In Proceed-
in Neural Information Processing Systems, 33, 2020. ings of the 24th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, pp. 974–983,
Peng, N., Poon, H., Quirk, C., Toutanova, K., and Yih, W.- 2018.
t. Cross-sentence n-ary relation extraction with graph
lstms. Transactions of the Association for Computational Zhang, S., Wang, M., Liu, S., Chen, P.-Y., and Xiong, J.
Linguistics, 5:101–115, 2017. Fast learning of graph neural networks with guaranteed
generalizability: One-hidden-layer case. arXiv preprint
Ramezani, M., Cong, W., Mahdavi, M., Sivasubramaniam, arXiv:2006.14117, 2020.
A., and Kandemir, M. Gcn meets gpu: Decoupling “when
to sample” from “how to sample”. Advances in Neural Zheng, C., Zong, B., Cheng, W., Song, D., Ni, J., Yu, W.,
Information Processing Systems, 33:18482–18492, 2020. Chen, H., and Wang, W. Robust graph representation
learning via neural sparsification. In International Con-
Sanchez-Gonzalez, A., Heess, N., Springenberg, J. T., ference on Machine Learning, pp. 11458–11468. PMLR,
Merel, J., Riedmiller, M., Hadsell, R., and Battaglia, P. 2020.
Graph networks as learnable physics engines for infer-
ence and control. In International Conference on Machine Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon,
Learning, pp. 4470–4479. PMLR, 2018. I. S. Recovery guarantees for one-hidden-layer neural
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
Zou, D., Hu, Z., Wang, Y., Jiang, S., Sun, Y., and Gu,
Q. Layer-dependent importance sampling for training
deep and large graph convolutional networks. Advances
in Neural Information Processing Systems, 32:11249–
11259, 2019.
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
A. Preliminaries
Lemma A.1. kãn Xk ≤ kAk∞ .
Proof:
N
X
kãn Xk = k an,k x̃k k
k=1
N N
X an,k X
=k PN x̃k k · an,k
k=1 k=1 an,k k=1
(21)
N
X an,k
≤ PN kx̃k k · kAk∞
k=1 k=1 an,k
= kAk∞
where the second to last step is by the convexity of k · k.
Lemma A.2. Given a graph G with L(≥ 1) groups of nodes, where the group i with node degree di is denoted as Ni .
Suppose that in iteration t, At (or any of At(1) , At(2) , At(3) in the general setting) is generated from the sampling strategy
in Section 3.2, if the number of sampled nodes satisfies li ≥ |Ni |/(1 + cLp 1 poly()
∗ Ψ ), we have
i i
Proof:
From Section 3.2, we can rewrite that
(
|Nk | ∗
t
ãn = lk pk An,j , if the nodes n, j are connected and j is selected and j ∈ Nk (23)
0, else
(
p∗k An,j , if the nodes n, j are connected and j ∈ Nk
a˜∗ n = (24)
0, else
> > PN
Let A∗ = (a˜∗ 1 , a˜∗ 2 , · · · , a˜∗ n )> . Since that we need that j=1 A∗n,j ≤ O(1), we require
X
p∗i An,j ≤ O(1/L), holds for any i ∈ [L], n ∈ [N ] (25)
j∈Ni
We first roughly compute the ratio of edges that one node is connected to the nodes in another group. For the node with degree
deg(i), it has deg(i) − 1 open edges except the self-connection. Hence, the group with degree deg(j) has (deg(j) − 1)|Nj |
open edges except self-connections in total. Therefore, the ratio of the edges connected to the group j to all groups is
(deg(j) − 1)|Nj | dj |Nj |
PL ≈ PL (26)
l=1 (deg(l) − 1)|Nl | l=1 dl |Nl |
Define r
dn di |Ni |
Ψ(n, i) = · PL (27)
di l=1 dl |Nl |
Then, as long as
X 1 di |Ni |
p∗i An,j ≈ p∗i √ · PL dn . p∗i Ψ(n, i) ≤ O(1/L) (28)
j∈|Ni |
di dn l=1 ld |N l |
i.e., r PL
c1 c1 c1 di dl |Nl |
p∗i ≤ = = l=1
(29)
L · maxn∈[L] {Ψ(n, i)} L · Ψ(L, i) L dL di |Ni |
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
for some constant c1 > 0, we can obtain that kA∗ k∞ ≤ O(1). Since that
X 1 di |Ni | lk X lk
An,j ≈ √ · PL dn ≈ An,j (30)
di dn l=1 dl |Nl |
|Nk | |Nk |
j∈Sk j∈N k
X 1 di |Ni | lk X lk
An,j ≈ √ · PL dn (1 − )≈ An,j (1 − ), (31)
di dn l=1 dl |Nl |
|N k | |N k|
j ∈S
/ k j∈N k
kãtn − a˜∗ n k1
L X L X
X |Nk | X
= An,j p∗k ( − 1) + An,j p∗k
lk
k=1 j∈Sk k=1 j ∈S
/ k
L
X |Nk | lk X lk X
. (p∗k ( − 1) An,j + (1 − )p∗k An,j ) (32)
lk |Nk | |Nk |
k=1 j∈Nk j∈Nk
L
X 1 X
.poly() An,j
LΨ(L, k)
k=1 j∈Nk
:=poly()Γ(A∗ )
c1 poly()
where the first inequality is by (30, 31) and the second inequality holds as long as li ≥ |Ni |/(1 + Lp∗ ). Combining
i Ψ(L,i)
(41), we have
L L
X X X 1 X
p∗i An,j . An,j = Γ(A∗ ) ≤ O(1) (33)
i=1 i=1
LΨ(L, i)
j∈Ni j∈Ni
Lemma A.3. Given a graph G with L(≥ 1) groups of nodes, where the group i with node degree di is denoted as Ni .
Suppose At (or any of At(1) , At(2) , At(3) in the general setting) is generated from the sampling strategy in Section A.1, if
the number of sampled nodes satisfies li ≥ |Ni |/(1 + cLp
2 poly()
∗ Ψ ), then we have
i i
Proof:
i.e.,
r PL
p c2 c1 c2 di l=1 dl |Nl |
p∗i ≤ = = (42)
L · maxn∈[L] {Ψ(n, i)} L · Ψ(L, i) L dL di |Ni |
for some constant c2 > 0, we can obtain that kA∗ k∞ ≤ O(1).
The difference between ãtn and a˜∗ n can then be derived as
kãtn − a˜∗ n k1
s
L X L X
X p |Nk ||Nu | X p
= An,j p∗u p∗k ( − 1) + An,j p∗u p∗k
lk lu
k=1 j∈Sk k=1 j ∈S
/ k
s (43)
L L X
X X p ∗ |Nk ||Nu | lk X p lk
≈ ∗
An,j pu pk ( − 1) + An,j p∗u p∗k (1 − )
lk lu |Nk | |Nk |
k=1 j∈Nk k=1 j∈Nk
.poly()
c2 poly()
as long as li ≥ |Ni |/(1 + √ ∗
)2 .
L pi Ψ(L,i)
B.1. Lemmas
B.1.1. F UNCTION APPROXIMATION
To show that the target function can be learnt by the learner network with the Relu function, a good approach is to firstly
find a function h(·) such that the φ functions in the target function can be approximated by h(·) with an indicator function.
In this section, Lemma B.1 provides the existence of such h(·) function. Lemma B.2 and B.3 are two supporting lemmas to
prove Lemma B.1.
Lemma B.1. For every smooth function φ, every ∈ (0, C(φ,a)1√a2 +1 ), there exists a function h : R2 →
√ √ √
[−C (φ, a) a2 + 1, C (φ, a) a2 + 1] that is also C (φ, a) a2 + 1-Lipschitz continuous on its first coordinate with the
following two (equivalent) properties:
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
Proof:
Firstly, since we can assume w∗ = (1, 0, · · · , 0) without loss of generality by rotating x and w, it can be derived that x, w,
w∗ are equivalent to that they
p are two-dimensional. Therefore, proving Lemma B.1b suffices in showing Lemma B.1a.
Let w0 = (α, β), x = (x1 , t2 − x21 ) where α and β are independent. Following the idea of Lemma 6.3 in (Allen-Zhu et al.,
p ⊥
2019), we use another randomness as an alternative, i.e., we write x⊥ = ( t2 − x21 , −x1 ), w0 = α xt + β xt ∼ N (0, I).
q
x2
Then w0 X = tα. Let α1 = w01 = α xt1 + β 1 − t21 , where α, β ∼ N (0, 1). Hence, α1 ∼ N (0, 1).
We first use Lemma B.2 to fit φ(x1 ). By Taylor expansion, we have
∞
X ∞
X
φ(x1 ) = c0 + ci xi1 + ci xi1
i=1, odd i i=2, even i
∞
(45)
X
= c0 + c0i Eα,β∼N (0,1) [hi (α1 )1[qi (b0 )]1[w0 X + b0 ≥ 0]]
i=1
where hi (·) is the Hermite polynomial defined in Definition A.5 in (Allen-Zhu et al., 2019), and
√ (
0 ci 0 200i2 |ci | t2 + 1 |b0 | ≤ t/(2i), i is odd
ci = 0 , |ci | ≤ 1−i
and qi (b0 ) = (46)
pi (i − 1)!! t 0 < −b0 ≤ t/(2i), i is even
1
q √
2 +1
Let Bi = 100i 2 + 10 log( 1 tt1−i ). Define ĥi (α1 ) = hi (α1 ) · 1[|α1 | ≤ Bi ] + hi (sign(α1 )Bi ) · 1[|α1 | > Bi ] as the
truncated version of the Hermite polynomial. Then we have
∞
X
φ(x1 ) = c0 + R(x1 ) + c0i Eα,β∼N (0,1) [ĥi (α1 )1[qi (b0 )]1[w0 X + b0 ≥ 0]],
i=1
where
∞
X h i
c0i Eα,β∼N (0,1) hi (α1 ) · 1[|α1 | > Bi ] − hi (sign(α1 )Bi · 1[|α| > Bi ]) 1[qi (b0 )]1[w0 X + b0 ≥ 0]
R(x1 ) =
i=1
Define
∞
X
h(α1 , b0 ) = c0 + c0i · ĥi (α1 ) · 1[qi (b0 )]
i=1
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
i=1
t (47)
∞
X p 2
≤ (2 + c20 ) + (i + 1)1.75 · |ci | · ti t2 + 1
i=0
2 2
≤ Cs (φ, t) (t + 1)
Lemma B.2. Denote hi (x) as the degree-i Hermite polynomial as in Definition A.5 in (Allen-Zhu et al., 2019). For every
1−i
integer i ≥ 1, there exists constant p0i with |p0i | ≥ √tt2 +1 (i−1)!!
100i2 such that
1 b0 t
for even i : xi1 = 0 Ew0 ∼N (0,I),b0 ∼N (0,1) [hi (α1 )1[α ≥ − ]1[0 < −b0 ≤ ]] (48)
pi t 2i
1 b0 t
for odd i : xi1 = 0 Ew0 ∼N (0,I),b0 ∼N (0,1) [h(α1 )1[α ≥ − ]1[|b0 | ≤ ]] (49)
pi t 2i
for kxk ≤ t.
Proof:
For even i, by Lemma A.6 in (Allen-Zhu et al., 2019), we have
b0 t t xi
Ew0 ∼N (0,I),b0 ∼N (0,1) [hi (α1 )1[α ≥ − ]1[0 < −b0 ≤ ]] = Eb0 ∼N (0,1) [pi · 1[0 < −b0 ≤ ]] · i1
t 2i 2i t
, where
i−1 i−1−r
exp(−b20 /(2t2 )) X (−1) 2
i/2 − 1
pi = (i − 1)!! √ (−b0 /t)r
2π r!! (r − 1)/2
r=1,r odd
i−1−r
(−1) 2 i/2−1
Define cr = r!! (r−1)/2
. Then sign(cr ) = −sign(cr+2 ). We can derive
cr (−b0 /t)r b0 i + 1 − r 1 1
= ( )2 ≤ ≤
cr−2 (−b0 /t)r−2 t r(r − 1) 4i 4
Therefore,
i−1
X 3
cr (−b0 /t)r ≥ |b0 /t|
4
r=1,r odd
For odd i, similarly by Lemma A.6 in (Allen-Zhu et al., 2019), we can obtain
b0 t t xi
Ew0 ∼N (0,I),b0 ∼N (0,1) [h(α1 )1[α ≥ − ]1[|b0 | ≤ ]] = Eb0 ∼N (0,1) [pi · 1[|b0 | ≤ ]] · i1
t 2i 2i t
, where
i−1 i−1−r
exp(−b20 /(2t2 )) X (−1) 2
i/2 − 1
pi = (i − 1)!! √ (−b0 /t)r
2π r=1,r even
r!! (r − 1)/2
P∞ √
1. i=1 |c0i | · |Ex∼N (0,1) [|hi (x)| · 1[|x| ≥ b]]| ≤
8 t2 + 1
P∞ √
2. i=1 |c0i | · |Ex∼N (0,1) [|hi (b)| · 1[|x| ≥ b]]| ≤
8 t2 + 1
P∞ √
3. i=1 |c0i |Ez∈N (0,1) [|hi (z)|1[|z| ≤ Bi ]] ≤ C (φ, t) t2 + 1
P∞ √
4. i=1 |c0i |Ez∈N (0,1) [| dz
d
hi (z)|1[|z| ≤ Bi ]] ≤ C (φ, t) t2 + 1
Proof:
By the definition of Hermite polynomial in Definition A.5 in (Allen-Zhu et al., 2019), we have
bi/2c
X |x|i−2j i2j
hi (x) ≤
j=1
j!
Lemma B.4 shows the target function can be approximated by the pseudo network with some parameters. Lemma B.5 to B.8
provides how the existence of such a pseudo network is developed step by step.
1 √
Lemma B.4. For every ∈ (0, 2 2
), there exists
Kkqk1 p1 p2 Cs (Φ,p2 Cs (φ,kAk∞ ))Cs (φ,kAk∞ ) kAk∞ +1
√ p
M = poly(C (Φ, p2 C (φ, kAk∞ ) kAk2∞ + 1), 1/)
p
C = C (φ, kAk∞ ) kAk2∞ + 1 (59)
√
C 0 = 10C p2 (60)
p
C 00 = C (Φ, C 0 ) kAk2∞ + 1 (61)
C0 = Õ(p21 p2 K 2 CC 00 ) (62)
c , Vb with m1 , m2 ≥ M ,
such that with high probability, there exists W
√
C0 m1
kW
c k2,∞ ≤ , kVb k2,∞ ≤
m1 m2
such that
K
hX i
E(X,y)∈D |fr∗ (q, A, X) − gr(0) (q, A, X, W
c .Vb )| ≤
r=1
Proof: p
For each φ2,j , we can construct hφ,j : R2 → [−C, C] where C = C (φ, kAk∞ ) kAk2∞ + 1 using Lemma B.1 satisfying
for i ∈ [m1 ]. Consider any arbitrary b ∈ Rm1 with vi ∈ {−1, 1}. Define
1
00
c = (C0 C /C) (vi
2
hφ,j (w∗2,j > wi , B1(i) )ed )i∈[m1 ]
(0) (0)
X
∗
W 2
v2,j (64)
c m 1
j∈[p2 ]
X c∗ K
1 √ X (0)
X
Vb = (C0 C 00 /C)− 2 k
(vh( m2 ∗
v1,j αi,j , B2(i) ) ci,r )i∈[m2 ] (65)
m2 r=1
k∈[p1 ] j∈[p2 ]
Then,
gr(0) (q, A, W
c , Vb , B)
N
X X N
X X
= q > an ci,r 1rn,i +B2(n,i) ≥0 an,j 1aj Xw(0) +B aj X W
c i Vbi,i0
i 1(j,i) ≥0
n=1 i∈[m1 ] i0 ∈[m2 ] j=1
N N
X c∗k X X √ X (0)
X X
= q > an c2i,r 1rn,i +B2(n,i) ≥0 h( m2 ∗
v1,j αi,j , B2(i) ) an,j ∗
v2,l φ2,l (aj Xw∗2,l )
m2 2c n=1 j=1
k∈[p1 ] i∈[m1 ] j∈[p2 ] l∈[p2 ]
N
X X X N
X N
X X (66)
= q > an c∗k Φ( ∗
v1,j am,n φ1,j (am Xw∗1,j )) an,j ∗
v2,l φ2,l (aj Xw∗2,l )
k∈[p1 ] n=1 j∈[p2 ] m=1 j=1 l∈[p2 ]
p
± O(p1 p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1)
N
X X X X
= q > an c∗k Φ(ãn ∗
v1,j φ1,j (AXw∗1,j ))ãn ∗
v2,l φ2,l (AXw∗2,l )
n=1 k∈[p1 ] j∈[p2 ] l∈[p2 ]
p
± O(kqk1 p1 p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1)
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
where the first step comes from definition of g (0) , the second step is derived from (64) and (65) and the second to last step is
by Lemma B.8.
Lemma B.5. For every smooth function φ, every w∗ ∈ Rd with kw∗ k = 1, for every ∈ (0, 1√
),
Cs (φ,kAk∞ ) kAk2∞ +1
(0) (0) (0) (0) (0) (0)
there exists real-valued functions ρ(v 1 , W (0) , B 1(n) ), J(ãn X, v 1 , W (0) , B 1(n) ), R(ãn X, v 1 , W (0) , B 1(n) ) and
φ (ãn X) such that for every X
N
(0) (0) (0) (0) (0) (0)
X
rn,1 (X) = ρ(v 1 , W (0) , B 1(n) ) aj,n φ (aj Xw∗ ) + J(X, v 1 , W (0) , B 1(n) ) + R(X, v 1 , W (0) , B 1(n) )
j=1
p (0) (0)
Moreover, letting C = C (φ, kAk∞ ) kAk2∞ + 1 be the complexity of φ, and if v1,i ∼ N (0, m12 ) and wi,j , B 1(n) ∼
N (0, m11 ) are at random initialization, then we have
(0) (0) (0) (0)
1. for every fixed ãn X, ρ(v 1 , W (0) , B 1(n) ) is independent of J(ãn X, v 1 , W (0) , B 1(n) ).
(0) (0)
2. ρ(v 1 , W (0) , B 1(n) ) ∼ N (0, 100C12 m2 ).
3. |φ (ãn Xw∗i ) − φ(ãn Xw∗i )| ≤
4. with high probability, |R(X, v 1 , W (0) , B 1(n) )| ≤ Õ( √kAk , B 1(n) )| ≤ Õ( kAk∞√
(0) (0) ∞ (0) (0) (0) (1+kAk∞ )
m1 m2 ), |J(X, v 1 , W m2 ) and
(0) (0)
E[J(X, v 1 , W (0) , B 1(n) )] = 0.
With high probability, we also have
(0) τ
ρ̃(v1 ) ∼ N (0, )
C 2 m2
1
W2 (ρ|W (0) ,B (0) , ρ̃) ≤ Õ( √ )
1(n) C m2
Proof:
By Lemma B.1, we have
(0) (0)
p
Since α = i∈S ui [e> ]i and |ui [e>
P
dW dW ]i | ≤ Õ(1/ m1 |S|), by (67) and the Wasserstein distance bound of central
limit theorem we know there exists g ∼ N (0, m11 ) such that
1
W2 (α|W (0) ,B (0) , g) ≤ Õ( √ )
1(n) τ m1
Then,
N m1
(0) (0) (0)
X X
rn,1 (X) = aj,n vi,1 σ(aj Xw1 + B1(n,i) )
j=1 i=1
N N
(0) (0) (0) (0) (0) (0)
X X X X
= aj,n vi,1 σ(aj Xw1 + B1(n,i) ) + aj,n vi,1 σ(aj Xw1 + B1(n,i) ) (68)
j=1 i∈S
/ j=1 i∈S
N
(0) (0) (0)
X X
= J1 + aj,n vi,1 σ(aj Xw1 + B1(n,i) )
j=1 i∈S
rn,1 (X) − J1
N N
X X (0) (0) (0) si X X (0) (0) (0) (0)
= aj,n vi,1 1[aj Xw1 + B1(n,i) ] p α + aj,n vi,1 1[aj Xw1 + B1(n,i) ](aj Xβ i + B1(n,i) ) (69)
j=1 i∈S
2 |S| j=1 i∈S
=P1 + P2
Here, we know that since
(0) (0) (0) (0) (0) (0)
E[vi,1 σ(aj Xw1 + B1(n,i) )] = E[vi,1 ] · E[σ(aj Xw1 + B1(n,i) )] = 0 (70)
Hence,
N X (0) (0) (0)
X
E[J1 ] = E[ aj,n vi,1 σ(aj Xw1 + B1(n,i) )] = 0 (71)
j=1 i∈S
/
√ N
C m2 X C
|√ P1 − aj,n φ (aj Xw∗ )| ≤ Õ(kAk∞ √ )
τ m1 j=1
τ m1
Define √
(0) (0) τ m1 τ
ρ(v 1 , W (0) , B 1(n) ) = √ α ∼ N (0, 2 )
C m2 C m2
Then,
N
(0) (0) (0) (0)
X
P1 = ρ(v 1 , W (0) , B 1(n) ) · aj,n φ (aj Xw∗ ) + R1 + R2 (X, v 1 , W (0) , B 1(n) )
j=1
Therefore,
1
W2 (ρ|W (0) ,B (0) , ρ̃) ≤ Õ( √ )
1(n) C m2
Meanwhile,
(0) si (0) (0) 1
aj Xwi = α p aj Xed + aj Xβ i + B1(n,i) = aj Xβ i + B1(n,i) ± Õ( p )
|S| |S|m1
we have
N
(0) (0) (0)
X X
P2 = aj,n vi,1 1[aj Xβ i + b1(n,j) ](aj Xβ i + b1(n,i) ) + R3 = J2 + R3
j=1 i∈S
E[J2 ] = 0 (73)
1 1
αi,j ∼ N (0, ), βi (X) ∼ N (0, ),
m2 m2
satisfying
2
N
X X p23
W2 (rn,i (X), αi,j am,n φ1,j, (am Xw∗1,i ) + Ci βi (X)) ≤ Õ( 1√ )
j∈[p2 ] m=1 m1 6
m2
Proof:
Define p2 S many chunks of the first layer with each chunk corresponding to a set Sj,l , where |Sj,l | = m1 /(p2 S) for j ∈ [p2 ]
and l ∈ [S], such that
m1 m1 m1
Sj,l = {(j − 1) + (l − 1) S + k|k ∈ [ } ⊂ [m1 ]
p2 p2 p2 S]
By Lemma B.5, we have
N
(0) (0)
X X
rn,i (X) = ρ(v i [j, l], W (0) [j, l], B 1(n) [j, l]) am,n φ (am Xw∗1,j )
j∈[p2 ],l∈[S] m=1
(74)
(0) (0) (0) (0)
X
+ Jj (X, v i [j, l], W (0) [j, l], B 1(n) [j, l]) + Rj (X, v i [j, l], W (0) [j, l], B 1(n) [j, l]),
j∈[p2 ],l∈[S]
(0) (0)
where ρ(v i [j, l], W (0) [j, l], B 1(n) [j, l]) ∼ N (0, 100C 21m2 p2 S ). Then ρj = 1 0
P
l∈[S] ρj,l ∼ N (0, C 02 m2 ) for C =
√
10C p2 . Define
(0) (0)
X
JjS (X) = Jj (X, v i [j, l], W (0) [j, l], B 1(n) [j, l])
l∈[S]
(0) (0)
X
RjS (X) = Rj (X, v i [j, l], W (0) [j, l], B 1(n) [j, l])
l∈[S]
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
kAk∞ (1 + kAk∞ )
W2 (JjS (X), βj (X)) ≤ √
m2 pS
N √
X X Sp2 p2 kAk∞ (1 + kAk∞ )
W2 (rn,i (X), ρj am,n φ1,j, (am Xw∗1,j ) 0
+ β (X)) ≤ Õ( √ + )
m=1
m1 m2 m2 S
j∈[p2 ]
We know there exists a positive constant Ci such that β 0 /Ci ∼ N (0, m12 ). Let αi,j = C 0 ρj , βi0 = β 0 /Ci . Notice that
P 2 (0) (0) (0)
E l∈[S],i∈[p2 ] [Jj (X, v i [j, l], W [j, l], b1 [j, l])] = Õ(kAk2∞ (1 + kAk∞ )2 /m2 ). Hence, we have
Ci ≤ Õ(kAk∞ (1 + kAk∞ )
1
Let S = (m1 /p2 ) 3 , we can obtain
2
N
X X p23
W2 (rn,i (X), αi,j am,n φ1,j, (am Xw∗1,i ) + Ci βi (X)) ≤ Õ( 1√ )
j∈[p2 ] m=1 m1 6
m2
p
Lemma B.7. There exists function h : R2 → [−C 00 , C 00 ] for C 00 = C (Φ, C 0 ) kAk2∞ + 1 such that
√ X
∗ (0)
X
∗
E[1rn,i (X)+b(0) ≥0 h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
2(n,i)
j∈[p2 ] j∈[p2 ]
X N
X X p
∗ ∗
=Φ( v1,j am,n φ1,j ) v2,j φ2,j (ãn Xw∗2,j )) ± Õ(p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1)
j∈[p2 ] m=1 j∈[p2 ]
(75)
Proof: PN PN
Choose w = (αi,1 , · · · , αi,p2 , βi ), x = ( m=1 am,n φ1,1, , · · · , m=1 am,n φ1,p2 , , Ci ) and w∗ = (v1,1
∗ ∗
, · · · , v1,p2
, 0).
2 2 00 00 00 0
p
Then, kxk ≤ O(kAk∞ + kAk∞ ). By Lemma B.1, there exists h : R → [−C , C ] for C = Cs (Φ, C ) kAk∞ + 1 2
such that
√ (0)
X
E[1wX+b(0) ≥0 h( m2 w> w∗ , b2(n,i) )( ∗
v2,j φ2,j (ãn Xw∗2,j ))]
2(n,i)
j∈[p2 ]
√ (0)
X
=Eαi ,βi [1P PN ∗ 0 (0) h( m2 w> w∗ , b2(n,i) )( ∗
v2,j φ2,j (ãn Xw∗2,j ))]
] αi,j m=1 am,n φ1,j, (ãn Xw 1,i )+Ci βi +b2(n,i) ≥0
j∈[p2
j∈[p2 ]
(76)
X N
X X
=Φ(C 0 ∗
v1,j am,n φ1,j, ) ∗
v2,j φ2,j (ãn Xw∗2,j )) ± C 000
j∈[p2 ] m=1 j∈[p2 ]
where X p
C 000 = sup | ∗
v2,j φ2,j (ãn Xw∗2,j )| ≤ p2 Cs (φ, kAk∞ ) kAk2∞ + 1
j∈[p2 ]
2 2 2
N
X X 2p23 2p23 √ 2p23
Pr αi,j am,n φ1,j, (ãn Xw∗1,i ) + Ci βi0 ≤ Õ( 1 √ ) ≤ Õ( 1 √ ) · m2 = Õ( 1 ), (78)
j∈[p2 ] m=1 m16 m2 m16 m2 m16
2/3 1/6
which implies with probability at least 1 − 2p2 /m1 , (77) holds. Therefore,
√ X
∗ (0)
X
∗
E[1rn,i (X)+b(0) ≥0 h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
2(n,i)
j∈[p2 ] j∈[p2 ]
√ X
∗ (0)
X
∗
=E[1P PN ∗ 0 (0) h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
j∈[p2 ] αi,j m=1 am,n φ1,j, (ãn Xw 1,i )+Ci βi +b2(n,i) ≥0
j∈[p2 ] j∈[p2 ]
± E[1rn,i (X)+b(0) ≥0
6= 1P PN ∗ 0 (0) ]O(C 000 C 00 )
2(n,i) j∈[p2 ] αi,j m=1 am,n φ1,j, (ãn Xw 1,i )+Ci βi +b2(n,i) ≥0
√ X
∗ (0)
X
∗
=E[1P αi,j
PN
am,n φ1,j, (ãn Xw∗ 0 (0)
≥0
h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
j∈[p2 ] m=1 1,i )+Ci βi +b2(n,i)
j∈[p2 ] j∈[p2 ]
2/3
2p2
± Õ( 1/6 C 000 C 00 )
m1
X XN X p
∗ ∗
=Φ( v1,j am,n φ1,j ) v2,j φ2,j (x)) ± Õ(p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1 · ),
j∈[p2 ] m=1 j∈[p2 ]
(79)
where the first step is by Lemma B.6, the second step is by (77) and (78) and the last step comes from (76) and m1 ≥ M .
Lemma B.8.
m2 2
1 X ci,l √ X
∗ (0)
X
∗
E[ 2
1rn,i (X)+b(0) ≥0 h( m2 v1,j αi,j , b2(n,i) )( v2,j φ2,j (ãn Xw∗2,j ))]
m2 i=1 c 2(n,i)
j∈[p2 ] j∈[p2 ]
N
X
∗
X X
∗
(80)
=Φ( v1,j am,n am Xδφ1,j ) v2,j φ2,j (ãn Xw∗2,j ))
j∈[p2 ] m=1 j∈[p2 ]
p
± Õ(p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ ) kAk2∞ + 1 · )
Proof:
(0) (0)
Recall ρ̃(v 1 ) ∼ N (0, C 2τm2 ). Define ρ̃j,l = ρ̃(v 1 [j, l]). Therefore,
1
W2 (ρj,l |W (0) ,B (0) , ρ̃j,l ) ≤ Õ( √ ) (81)
1(n) C 0 m2 S
1
W2 (ρj |W (0) ,B (0) , ρ̃j ) ≤ Õ( √ ) (82)
1(n) C0 m2
where ρ̃j = l∈[S] ρj,l . We then define α̃i,j = C 0 ρ̃j
P
kãn k2∞
r̃n,i ∼ N (0, E[kuk]2 )
m2
Then we have p
kAk∞ kAk2∞ + 1
W2 (rn,i (X), r̃n,i (X)) ≤ Õ( √ ) (83)
m2
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
X N
X X p
∗ ∗
=Φ( v1,j am,n φ1,j ) v2,j φ2,j (ãn Xw∗2,j )) ± Õ(p22 Cs (Φ, p2 Cs (φ, kAk∞ ))Cs (φ, kAk∞ )( kAk2∞ + 1 · )
j∈[p2 ] m=1 j∈[p2 ]
(84)
B.1.3. C OUPLING
This section illustrates the coupling between the real and pseudo networks. We first define diagonal matrices D n,w ,
D n,w + D 00n,w , D n,w + D 0n,w for node n as the sign of Relu’s in the first layer at weights W (0) , W (0) + W ρ and
W (0) +W ρ +W 0 , respectively. We also define diagonal matrices D n,v , D n,v +D 00n,v , D n,v +D 0n,v for node n as the sign
of Relu’s in the second layer at weights {W (0) , V (0) }, {W (0) +W ρ , V (0) +V ρ } and {W (0) +W ρ +W 0 , V (0) +V ρ +V 0 },
respectively. For every l ∈ [K], we then introduce the pseudo network and its semi-bias, bias-free version as
(b)
gl (q, A, X, W , V ) = q > A((A(AXW + B 1 ) (D w + D 0w )V ) (D v + D 0v ))cl (86)
(b,b)
gl (q, A, X, W , V ) = q > A((A(AXW ) (D w + D 0w )V ) (D v + D 0v ))cl (87)
Lemma B.9 gives the final result of coupling with added Drop-out noise. Lemma B.10 states the sparse sign change in Relu
and the function value changes of pseudo network by some update. To be more specific, Lemma B.11 shows that the sign
pattern can be viewed as fixed for the smoothed objective when a small update is introduced to the current weights. Lemma
B.12 proves the bias-free pseudo network can also approximate the target function.
Lemma B.9. Let FA = (f1 , f2 , · · · , fK ). With high probability, we have for any kW 0 k2,4 ≤ τw , kV 0 kF ≤ τv , such that
0
where we use D (0) (0)
w,x and D v,x to denote the sign matrices at random initialization W
(0)
, V (0) and we let D (0)
w,x + D w,x ,
0 0 0
D (0)
v,x + D v,x be the sign matrices at W + W Σ, V + ΣV .
Proof:
(0) (0) (0) (0) (0) (0)
Since ãn Xwi + B1(n,i) = ãn X̃ w̃i where w̃i = (wi , B1(n,i) ) ∈ Rd+1 and X̃ = (X, 1) ∈ RN ×(d+1) , we can
ignore the bias term for simplicity. Define
Z 1 = A(AXW 0 Σ) D (0)
w,x
−1
Therefore, we have kZ n ΣV 0 k2 ≤ Õ(kAk∞ m1 2 τv ).
Let s be the total number of sign changes in the first layer caused by adding W 0 . Note that the total number of coordinated
3
(0)
i such that |ãn Xwi | ≤ s00 = 2τ1w is at most s00 m12 with high probability. Since kW 0 k2,4 ≤ τw , we must have
s4
3 3 4 6
00 τw
s ≤ Õ(s m ) = Õ(
2
1 m1 ). Therefore, kZ 2,n k0 ≤ s = Õ(τw5 m15 ). Then,
2
s4
Then we have 6 3
kZ 2,n ΣV 0 k2 ≤ Õ(τv τw5 m110 kAk∞ )
With high probability, we have
N m2
X X √
q > an 0
ci,l (σ(rn,i + rn,i 0
) − σ(rn,i )) ≤ Õ(kqk m2 )krn,i k
n=1 i=1
where D 00v,x is the diagonal sign change matrix from ZV (0) to (Z + Z 1 + Z 2 )V (0) + Z 1 ΣV 0 . The difference includes
three terms. 1
−1
kZ 1,n V (0) k∞ ≤ Õ(kAk∞ m14 τw m2 2 ) (92)
− 12 √ 9 8
−1
kZ 2,n V (0) k∞ ≤ Õ(kZ 2,n km2 s) ≤ Õ(kAk∞ m110 τw5 m2 2 ) (93)
1
kZ 1,n ΣV 0 k ≤ Õ(τv kAk∞ τw m14 ) (94)
where (92) is by Fact C.9 in (Allen-Zhu et al., 2019). Then we have
3 1
− 12 9 8
−1 1 4 4 4 1
|A1 − A2 | ≤ kq > Ak1 · Õ(m22 kAk2∞ (m14 τw m2 + m110 τw5 m2 2 )2 + m22 τv3 kAk∞
3
τw3 m13 )
1
|q > A(Z 1 V (0) D (0) >
v,x )cl | ≤ Õ(kq Ak1 kAk∞ τw m1 )
4
Therefore, we have
8 9 1 1
|A2 − A3 | ≤ Õ(kq > Ak1 kAk∞ τw5 m110 + kqk1 kAk∞ τw m14 + kqk1 τv kAk∞ τw m14 )
Finally, we have
fl (q, A, X, W (0) + W 0 Σ, V (0) + ΣV 0 )
0 0
=q > A(ZV (0) D (0) >
v,x )cl + q A(A(AXW D (0)
w,x V ) D (0)
v,x )cl
√ (95)
m2 9 16 √ 8 9
± Õ(τv √ + m15 τw5 m2 + τw5 m110 ) · kqk1 kAk∞
m1
1 1 1 τw 1
Lemma B.10. Suppose τv ∈ (0, 1], τw ∈ [ 3 , 1 ], σw ∈ [ 3 , 1 ], σv ∈ (0, 1 )]. The perturbation matrices satisfies
m12 m12 m12 m14 m22
kW 0 k2,4 ≤ τw , kV 0 kF ≤ τv , kW 00 k2,4 ≤ τw , kV 00 kF ≤ τv and random diagonal matrix Σ has each diagonal entry i.i.d.
drawn from {±1}. Then with high probability, we have
(1) Sparse sign change
4 6
kD 0n,w k0 ≤ Õ(τw5 m15 )
3 1 2 1 2
kD 0n,v k0 ≤ Õ(m22 σv (kAk∞ + kAk∞ τw m14 ) + m2 kAk∞
3
(kAk∞ τv + kAk∞ τw m14 (1 + τv )) 3 )
(2) Cross term vanish
for every r ∈ [K], where EΣ [gr0 ] = 0 and |gr0 | ≤ ηkq > Ak1 kAk2∞ τv .
Proof:
(0) (0) (0) (0) (0) (0)
(1) We first consider the sign changes by W ρ . Since ãn Xwi +B1(n,i) = ãn X̃ w̃i where w̃i = (wi , B1(n,i) ) ∈ Rd+1
and X̃ = (X, 1) ∈ RN ×(d+1) , we can ignore the bias term for simplicity. We have
(0)
Pr[|ãn X̃ w̃i | ≤ |ãn X̃ w̃ρi |] = Pr[|z| ≤ 1]
Z 1
1
= √ 2+ 1 dz
−1 π(σ w m 1 z σw
√
m1 )
2 1
Z (σw m1 ) 2
1 (97)
= dt
π(t2 + 1)
1
2 m )2
−(σw 1
2 √
= arctan σw m1
π
√
≤ Õ(σw m1 )
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
Then, we have
3
kD 00n,w k0 ≤ Õ(σw m12 )
(0) 3 3
kãn X̃ W̃ D 00n,w k2 ≤ Õ(kãn X̃kσw2 m14 )
We then consider the sign changes by W 0 . Let s = kD 0n,w − D 00n,w k0 be the total number of sign changes in the first layer
(0) ρ
caused by adding W 0 . Note that the total number of coordinated i such that |ãn X̃(W̃ + W̃ )i | ≤ s00 = 2τw
1 is at most
s4
3
s00 m1 with high probability. Since kW 0 k2,4 ≤ τw , we must have
2
3 τw 3
s ≤ Õ(s00 m 2 ) = Õ( 1 m12 )
s 4
4 6
kD 0n,w − D 00n,w k0 = s ≤ Õ(τw5 m15 )
(0) ρ 1 6 3
kãn X̃(W̃ + W̃ )(D 0n,w − D 00n,w )k2 ≤ Õ(s 4 τw ) ≤ Õ(τw5 m110 )
To sum up, we have
3 4 6 4 6
kD 0n,w k0 ≤ Õ(σw m12 + τw5 m15 ) ≤ Õ(τw5 m15 )
(0) (0) ρ (0)
Denote z n,0 = ãn X̃ W̃ D n,w and z n,2 = ãn X̃(W̃ + W̃ + W 0 )(D n,w + D 0n,w ) − ãn X̃ W̃ D n,w . With high
probability, we know
ρ
kz n,2 k ≤kãn X̃W 0 k + kãn X̃ W̃ k
1 1
≤Õ(m14 τw kAk∞ + kAk∞ σw m12 ) (98)
1
≤Õ(kAk∞ τw m1 ) 4
kãn (Z 0 + Z 2 )V 0 + ãn Z 2 V (0) k2 ≤ Õ(kAk∞ ((kz 1,0 k + kz 1,2 k)τv + kz 1,2 k))
Combining kz 1,0 k ≤ Õ(kAk∞ ), by Claim C.8 in (Allen-Zhu et al., 2019) we have
3 1 2 1 2
kD 0n,v k0 ≤ Õ(m22 σv (kAk2∞ + kAk2∞ τw m14 ) + m2 kAk∞
3
(kAk∞ τv + kAk∞ τw m14 (1 + τv )) 3 )
where the last two terms are the error terms. We know that
v v
u m1 u m1
(0) > (0)
uX (0)
uX 1
kW k ≤ max ka W k ≤ max t (a> wi )2 ≤ max t ( √ )2 = 1
kak=1 kak=1
i=1
kak=1
i=1
m1
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
Therefore,
Proof:
0
Pρ,η − Pρ,η = q > A(A(AX(W + W ρ + ηW 00 Σ) + B 1 ) (D w,ρ,η − D w,ρ )(V + V ρ + ηΣV 00 ) D v,ρ )cr
> ρ 00 ρ 00
+ q A(A(AX(W + W + ηW Σ) + B 1 ) D w,ρ,η (V + V + ηΣV ) (D v,ρ,η − D v,ρ ))cr
(104)
We write
Z = A(AX(W + W ρ + ηW 00 Σ) + B 1 ) D w,ρ
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
Z + Z 0 = A(AX(W + W ρ + ηW 00 Σ) + B 1 ) D w,ρ,η
Since for all n ∈ [N ], kη(AAXW 00 Σ)n k∞ ≤ ηkAk∞ τw,∞ , we have
kZ 0n k∞ ≤ ηkAk∞ τw,∞
0 ηkAk∞ τw,∞
Prρ [Zn,i 6= 0] ≤ Õ( ), i ∈ [m1 ]
W σw
Then we have
Pr[kZ 0n k0 ≥ 2] ≤ Op (η 2 )
Then we only need to consider the case kZ 0n k0 = 1. Let Zn,n 0
i
6= 0. Then the first term in (104), q > A(Z 0 (V + V ρ +
00
ηΣV ) D v,ρ )cr should be dealt with separately.
The term q > A(Z 0 ηΣV 00 D v,ρ )cr ) contributes to Op (η 3 ) to the whole term.
Then we have
kq > A(Z 0 η(V + V ρ ) D v,ρ )cr k ≤ Õ(ηkkq > Ak1 kkAk∞ τw,∞ )
We also have that
ηkAk∞ τw,∞ ηkAk∞ τw,∞
Õ(( m1 )N ) ≤ Õ( m1 ) ≤ 1
σw σw
2
τw,∞
Therefore, the contribution to the first term is Õ(η 2 kq > Ak1 kAk2∞ σw m1 ) + Op (η 3 ).
Denote
δ =A(AX(W + W ρ + ηW 00 Σ) + B 1 ) D w,ρ,η (V + V ρ + ηΣV 00 )
ρ (105)
− A(AX(W + W ) + B 1 ) D w,ρ (V + V ρ )
δ ∈ Rm2 has the following terms:
1. Z 0 (V + V ρ + ηΣV 00 ). We have its n-th row norm bounded by Op (η).
−1
2. ZηΣV 00 . We have its n-th row infinity norm bounded by Õ(kAk∞ ητv,∞ m1 2 ).
3. A(AXηW 00 Σ D w,ρ )(V + V ρ ), of which the n-th row infinity is bounded by Õ(kAk∞ ητw,∞ ).
4. A(AXη 2 W 00 Σ D w,ρ,η ΣV 00 ). Bounded by Op (η 2 ).
Therefore,
− 12
kδ n k∞ ≤ Õ(kAk∞ η(τv,∞ m1 + τw,∞ )) + Op (η 2 )
2
(τw,∞ 2
+τv,∞ m−1
1 )
Similarly, we can derive that the contribution to the second term is Õ(η 2 kq > Ak1 kAk2∞ σv m2 ) + Op (η 3 ).
Lemma B.12. Let F ∗A = (f1∗ , · · · , fK
∗
). Perturbation matrices W 0 , V 0 satisfy
kW 0 k2,4 ≤ τw , kV 0 kF ≤ τv
There exists W
c and Vb such that
√
C0 K m1
kW
c k2,∞ ≤ , kVb k2,∞ ≤
m1 m2
XK
E[ |fr∗ (q, A, X, W
c , Vb ) − g (b,b) (q, A, X, W
r
c , Vb )|] ≤
r=1
E[G(b,b) (q, A, X, W
c , Vb )] ≤ OP T +
Proof:
By Lemma B.10, we have
4 6
kD 0n,w k0 ≤ Õ(τw5 m15 ) Õ(m1 )
3 1 2 1 2
kD 0n,v k0 ≤ Õ(m22 σv (kAk2∞ +kAk2∞ τw m14 )+m2 kAk∞
3
(kAk∞ τv +kAk∞ τw m14 (1+τv )) 3 ) ≤ Õ(m2 kAk2∞ (/C0 )Θ(1) )
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
q > A((A(AX W
c) (D 0w )Vb ) (D v ))cr
N
X N
X
= q > an an,k ((ak X W
c) D k,w 0 )Vb D n,v cr
n=1 k=1 (106)
√
3 C0 K m1
≤kq > Ak1 kAk2∞ m1 10
· m2
m1 m2
≤
q > A((A(AX W
c) (D w )Vb ) (D 0v ))cr
N
X N
X
= q > an an,k ((ak X W
c) D k,w )Vb D 0n,v cr
n=1 k=1 (107)
√
C0 K m1 1
≤kq > Ak1 kAk2∞ m1 · m2 · kAk2∞ ( )Θ(1)
2
m1 m2 C0
≤
Then, the conclusion can be derived.
B.1.4. O PTIMIZATION
This section states the optimization process and convergence performance of the algorithm. Lemma B.13 shows that during
the optimization, either there exists an updating direction that decreases the objective, or weight decay decreases the
objective. Lemma B.14 provides the convergence result of the algorithm.
Define
L0 (A∗ , A∗ , A∗ , λt , W t , V t )
t
|Ω |
1 X p p
= t EW ρ ,V ρ ,Σ0 [L(λt FA∗ (eg , X; W (0) + W ρ + W t Σ0 , V (0) + V ρ + Σ0 V t ), yi )] + R( λt W t , λt V t )
Ω i=1
(108)
where √ √ √ √
R( λW t , λV t ) = λv k λV t k2F + λw k λW t k22,4
0
Lemma B.13. For every 0 ∈ (0, 1) and ∈ (0, √ ) and γ ∈ (0, 14 ],
KkAk∞ p1 p22 Cs (Φ,p2 Cs (φ,kAk∞ ))Cs (φ,kAk∞ ) kAk2∞ +1
consider any W t , V t with
c , Vb with kW
With high probability on random initialization, there exists W c kF ≤ 1, kVb kF ≤ 1 such that for every
1
η ∈ (0, poly(m1 ,m2 ) ],
√ c √
min{EΣ [L0 (A∗ , A∗ , A∗ , λt , W t + η W Σ, V t + ηΣVb )], L0 (A∗ , A∗ , A∗ , (1 − η)λt , W t , V t )}
(109)
≤(1 − ηγ/4)L0 (A∗ , A∗ , A∗ , λt , W t , V t )
Proof:
Recall the pseudo network and the real network for every r ∈ [K] as
gr (q, A∗ , X, W 0 , V 0 ) = q > A∗ (A∗ (A∗ X(W (0) + W ρ + W 0 ) + B 1 ) D w,ρ,t (V (0) + V ρ + V 0 ) D v,ρ,t )cr
fr (q, A∗ , X, W 0 , V 0 ) = q > A∗ (A∗ (A∗ X(W (0) + W ρ + W 0 ) + B 1 ) D w,ρ,W 0 (V (0) + V ρ + V 0 ) D v,ρ,V 0 )cr
where D w,ρ,t and D v,ρ,t are the diagonal matrices at weights W (0) + W ρ + W t and V (0) + V ρ + V t . D w,ρ,W 0 and
D v,ρ,V 0 are the diagonal matrices at weights W (0) + W ρ + W 0 and V (0) + V ρ + V 0 .
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
kVb kF 1
The we need to study an update direction
f = W t + √η W
W cΣ
√
Ve = V t + ηΣVb
Changes in Regularizer. Note that here W t ∈ Rd×m1 , V t ∈ Rm1 ×m2 , Σ ∈ Rm1 ×m1 . We know that
√
EΣ [kV t + ηΣVb k2F ] = kV t k2F + ηkVb k2F
√ c 4 X √ c
EΣ [kW t + η W Σk2,4 ] = E[kwt,i + η W Σi k42 ]
i∈[m1 ]
√ c c Σi k4 + 4ηkwt,i > W
kwt,i + η W Σi k42 = kwt,i k42 + η 2 kW 2
c Σi k2 + 2ηkwt,i k2 kW
2
c Σi k2
2
(110)
≤ kwt,i k42 + 6ηkwt,i k22 kW
c Σi k2 + Op (η 2 )
2
EW ρ ,V ρ [|fr (q, A∗ , X, W + W ρ + W
f Σ, V + V ρ + Σ0 Ve ) − gr (q, A∗ , X, W + W ρ + W
f Σ, V + V ρ + ΣVe )|]
≤ Õ(kq > A∗ k1 kA∗ k2∞ 0 η) + Op (η 1.5 )
(112)
By Lemma B.10, we have
where EΣ [G0 ] = 0 and |G0 | ≤ with high probability. By C.38 in (Allen-Zhu et al., 2019), we have
∗ 2
(115)
≤(1 − η)(2L(λt FA∗ (q, X, W t , V t ), y) − L((1 − η)λt FA∗ (q, X, W t , V t ), y)) + ηL(FA ∗ , y) + Op (η )
|Ω|
1 X
c2 = EW ρ ,V ρ [L((1 − η)λt FA∗ (eg , X, W (0) + W ρ + W t Σ0 , V (0) + V ρ + Σ0 V t ), yi )] (118)
|Ωt | i=1
|Ω|
1 X
c3 = EW ρ ,V ρ [L(λt FA∗ (eg , X, W (0) + W ρ + W t Σ0 , V (0) + V ρ + Σ0 V t ), yi )] (120)
|Ωt | i=1
p √
c03 = L0 (A∗ , A∗ , A∗ , λt , W t , V t ) = c3 + R( λt W t , λV t ) (121)
Then following from C.38 to C.42 in (Allen-Zhu et al., 2019), we have
ηγ 0
c01 ≤ (1 − η)(2c03 − c02 ) + c + η(OP T + O(kq > A∗ k1 kA∗ k4∞ 0 /γ)) + Op (η 1.5 ), (122)
4 3
which implies
1 ηγ 0 1
min{c01 , c02 } ≤ (1 − η + )c + η OP T + O(kq > A∗ k1 kA∗ k2∞ 0 η/γ) + Op (η 1.5 )
2 8 3 2
As long as c03 ≥ (1 + γ)OP T + Ω(kq > A∗ k1 kA∗ k2∞ 0 /γ) and γ ∈ [0, 1], we have
γ
min{c01 , c02 } ≤ (1 − η )c03
4
Lemma B.14. Note that the three sampled aggregation matrices in a three-layer learner network can be be different. We
denote them as At(1) , At(2) and At(3) . Let W t , V t be the updated weights trained using A∗ and let W 0t , V 0t be the updated
weights trained using At(i) , i ∈ [3]. With probability at least 99/100, the algorithm converges in T Tw = poly(m1 , m2 )
iterations to a point with η ∈ (0, poly(m1 ,m21,kA∗ k∞ ,K) )
L0 (A∗ , A∗ , A∗ , λt , W t , V t ) ≤ (1 + γ)OP T + 0
If
L0 (At(1) , At(2) , At(3) , λt , W 0t , V 0t )
|Ωt |
1 X
= t EW ρ ,V ρ ,Σ [L(λt FA(1) ,A(2) ,A(3) (q, X i , W (0) + W ρ + W 0t Σ0 , V (0) + V ρ + Σ0 V 0t ), yi )] (123)
|Ω | i=1
p p
+ R( λt W 0t , λt V 0t ),
where
FAt(1) ,At(2) ,At(3) (q, X, W , V ) = q > At(3) σ(At(2) σ(At(1) XW + B 1 )V + B 2 )C, (124)
we also have
Proof:
By Lemma B.13, we know that as long as L0 (A∗ , A∗ , A∗ , λt , W t , V t ) ∈ [(1 + γ)OP T + Ω(q > A∗ 1kA∗ k4∞ 0 /γ), Õ(1)],
then there exists kW
c kF ≤ 1, kVb kF ≤ 1 such that either
√ c √
EΣ,Σ0 [L0 (A∗ , A∗ , A∗ , λt , W t Σ0 + η W ΣΣ0 , Σ0 V t + ηΣ0 ΣVb )] ≤ (1 − ηγ/4)L0 (A∗ , A∗ , A∗ , λt , W t , V t ) (126)
or
L0 (A∗ , A∗ , A∗ , (1 − η)λt , W t , V t ) ≤ (1 − ηγ/4)L0 (A∗ , A∗ , A∗ , λt , W t , V t ) (127)
√ c √
Denote W = W (0) + W ρ + W t Σ0 + η W ΣΣ0 , V = V (0) + V ρ + Σ0 V t + ηΣ0 ΣVb . Note that
K
∂L X ∂L ∂fi
= (128)
∂wj i=1
∂fi ∂wj
∂ √ c √
fr (q, A∗ , X, W (0) + W ρ + W t Σ0 + η W ΣΣ0 , V (0) + V ρ + Σ0 V t + ηΣ0 ΣVb )
∂wj
N
X m2
X N
X (129)
>
= q an ci,r 1rn,i +B2(n,i) ≥0 an,k vj,i 1a˜∗ n Xwk +B1(n,k) ≥0 (a˜∗ n X)> ,
n=1 i=1 k=1
∂2F ∂3F
which implies ∂F
are summations of 1, δ, δ 0 functions and their multiplications. It can be found that no
∂wt , ∂w2t , ∂w3t
R∞ R∞
δ(x)δ 0 (x), δ(x) or δ (x) exist in these terms. Therefore, by −∞ δ(t)f (t)dt = f (0) and −∞ δ 0 (t)f (t)dt = −f 0 (0),
2 02
we can obtain that the value of the third-order derivative w.r.t. W ρ of EW ρ ,V ρ ,Σ [L(λt FA∗ (eg , X, W (0) + W ρ +
W t Σ, V (0) + V ρ + ΣV t ), y)] is proportional to poly(kA∗ k∞ , K), some certain value of the probability density
function of W ρ and its derivative, i.e., poly(σw −1
). Similarly, the value of the third-order derivative w.r.t. W ρ of
EW ρ ,V ρ ,Σ [L(λt FA∗ (eg , X, W + W + W t Σ, V (0) + V ρ + ΣV t ), y)] is polynomially depend on σv−1 and kA∗ k∞ .
(0) ρ
By the value selection of σw and σv , we can conclude that L0 is B = poly(m1 , m2 , kA∗ k∞ , K) second-order smooth.
By Fact A.8 in (Allen-Zhu et al., 2019), it satisfies with η ∈ (0, poly(m1 ,m21,kA∗ k∞ ,K) )
1
λmin (∇2 L0 (A∗ , A∗ , A∗ , λt−1 , W t , V t )) < − (130)
(m1 m2 )8
Meanwhile, for t ≥ 1, by the escape saddle point theorem of Lemma A.9 in (Allen-Zhu et al., 2019), we know with probability
at least 1 − p, λmin (∇2 L0 (A∗ , A∗ , A∗ , λt−1 , W t , V t )) > − (m1 1m2 )8 holds. Choosing p = 100T 1
, then this holds for
t = 1, 2, · · · , T with probability at least 0.999.Therefore, for t = 1, 2, · · · , T , the first case cannot happen, i.e., as long as
L0 (A∗ , A∗ , A∗ , λt , W t , V t ) ≥ (1 + γ)OP T + Ω(q > A∗ 1kA∗ k4∞ 0 /γ),
On the other hand, for t = 1, 2, · · · , T − 1, as long as L0 ≤ Õ(1), by Lemma A.9 in (Allen-Zhu et al., 2019), we have
T
X −1 w −1 X
TX N m2
X
kv i − v 0i k . kη q > a∗n ci,r 1[a∗ n σ(A∗ XW )v i ≥ 0](a∗ n σ(A∗ XW ) − at(2)
n σ(A
t(1)
XW 0 ))k
t=0 l=0 n=1 i=1
1 1
≤ · poly(m1 , m2 )c kA∗ k∞ · poly() = O()
poly(m1 , m2 ) poly()
(134)
With a slight abuse of notation, for r ∈ [K], we denote
fr (q, At(1) , At(2) , At(3) , X, W 0t , V 0t ) = q > At(3) σ(At(2) σ(At(1) XW + B 1 )V + B 2 )cr (135)
The difference between fr (q, A∗ , X, W t , V t ) and fr (q, At(1) , At(2) , At(3) , X, W 0t , V 0t ) is caused by kA∗ − At(1) k∞ ,
(t) (t) 0 (t) (t) 0
kA∗ − At(2) k∞ , kA∗ − At(3) k∞ , wi − wi and v i − v i . Following the proof in Lemma A.2, we can easily obtain
·poly()
that if |pl − p∗l | ≤ p∗l · O(poly()) and li ≥ |Ni |/(1 + pc∗1LΦ(L,i) ), it can be derived that kA∗ − A(1) k∞ ≤ O(poly()),
l
kA∗ − A(2) k∞ ≤ O(poly()) and kA∗ − A(3) k∞ ≤ O(poly()). Then, by (133) and (134), we have
min{EW ρ ,V ρ ,Σ, z∈Ω L(λT −1 FA∗ (eg , X, W (0) + W ρ,j + W T Σ, V (0) + V ρ,j + ΣV T )} ≤ (1 + γ)OP T + 0 (139)
j
Then we have 1
kW T k2,4 ≤ 04 τw0 (140)
1
kV T kF ≤ 02 τv0 (141)
By Lemma B.9, we know that
Therefore,
|fr (eg , A∗ , X i , W (0) + W ρ , V (0) + V ρ , B)|
∗
=|e> 0
g A σ(r + B 2 )cr | (144)
∗ ∗
≤Õ(kA k∞ (kA k∞ + 1)c )
We also have
|gr(b,b) (eg , A∗ , X i , W T , V T , B)|
∗ ∗ ∗
≤|e>
g A A (A XW T D (0)
w,x V T ) D (0)
v,x cr |
1√ (145)
≤kA∗ k2∞ τv0 τw0 m14 m2 c
≤C0 kA∗ k2∞
Hence,
fr (eg , A∗ , X i , W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T , B) ≤ Õ(kA∗ k2∞ (c + C0 )) (146)
Combining (135, 137), we can obtain
(1) (2) (3)
fr (eg , At , At , At , X i , W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T , B) ≤ Õ(kA∗ k2∞ (c + C0 )) (147)
(1) (2) (3)
as long as kA∗ − At k∞ ≤ poly(), kA∗ − At k∞ ≤ poly() and kA∗ − At k∞ ≤ poly().
|Ω|
For any given {X i , yi }i=1 , the dependency between yi , yj , where i, j ∈ |Ω| can be considered in two steps. Figure 5(a)
shows ai X is dependent with at most (1 + δ)2 aj X 0 s. This is because each ai X is determined by at most (1 + δ) row
vector x̃0l s, while each x̃l is contained by at most (1 + δ) ap X 0 s. Similarly, yi is determined by at most (1 + δ) ap X 0 s and
by Figure 5(b) we can find yi is dependent with at most (1 + δ)4 yj (including yi ). Since the matrix A∗ shares the same
non-zero entries with A, the output with A∗ indicates the same dependence.
P|Ωt |
Denote ui = 1/|Ωt | i=1 |L(λT −1 FA∗ (eg , X, W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T ), yi ) −
E(eg ,X,y)∈D [L(λT −1 FA∗ (eg , X, W (0) + W ρ + W T Σ, V (0) + V ρ + ΣV T ), yi )]. Then, E[ui ] = 0. Since that
L is 1-lipschitz smooth and L(0K , y) ∈ [0, 1], we have
Then, √
|ui | ≤ 2 KkA∗ k2∞ (c + C0 )
Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling
t2
P(|ui | ≥ t) ≤ 1 ≤ exp(1 − ) (149)
4KkA∗ k4∞ (c + C0 )2
∗ 4 2 2
Then, ui is a sub-Gaussian random variable. We have Eesui ≤ ekA k∞ (c +C0 ) s . By Lemma 7 in (Zhang et al., 2020), we
have P|Ω| 4 ∗ 4 2 2
Ees i=1 ui ≤ e(1+δ) KkA k∞ (c +C0 ) |Ω|s
Therefore,
|Ω|
X 1
P ui ≥ k ≤ exp(kA∗ k4∞ (c + C0 )2 K(1 + δ)4 |Ω|s2 − |Ω|ks) (150)
i=1
|Ω|
q
k ∗ 4 2 (1+δ)4 log N
for any s > 0. Let s = 2kA∗ k4 (c +C )2 K(1+δ)4 , k = kA k∞ (c + C0 ) K |Ω| , we can obtain
∞ 0
|Ω|
X 1
P ui ≥ k ≤ exp(−kA∗ k4∞ (c + C0 )2 K log N ) ≤ N −K (151)
i=1
|Ω|