0% found this document useful (0 votes)
10 views10 pages

(AAA2022) Regularizing Graph Neural Networks Via Consistency-Diversity Graph Augmentations

This paper addresses the limitations of graph neural networks (GNNs) in semi-supervised learning, specifically their underutilization of unlabeled data and susceptibility to overfitting. The authors propose two metrics, Consistency and Diversity, to evaluate graph augmentations and introduce a new method called NASA that enhances GNN performance by balancing these metrics through neighbor-based augmentations. Extensive experiments demonstrate the effectiveness of NASA in improving both accuracy and generalization across multiple datasets.

Uploaded by

librahu123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

(AAA2022) Regularizing Graph Neural Networks Via Consistency-Diversity Graph Augmentations

This paper addresses the limitations of graph neural networks (GNNs) in semi-supervised learning, specifically their underutilization of unlabeled data and susceptibility to overfitting. The authors propose two metrics, Consistency and Diversity, to evaluate graph augmentations and introduce a new method called NASA that enhances GNN performance by balancing these metrics through neighbor-based augmentations. Extensive experiments demonstrate the effectiveness of NASA in improving both accuracy and generalization across multiple datasets.

Uploaded by

librahu123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Regularizing Graph Neural Networks via Consistency-Diversity

Graph Augmentations
Deyu Bo1 * , BinBin Hu2 , Xiao Wang1 , Zhiqiang Zhang2 , Chuan Shi1 † , Jun Zhou2
1
Beijing University of Posts and Telecommunications
2
Ant Financial Services Group, Hangzhou, China
{bodeyu, xiaowang, shichuan}@bupt.edu.cn, {bin.hbb,lingyao.zzq,jun.zhoujun}@antfin.com

Abstract 0.80

0.79
Despite the remarkable performance of graph neural networks

Consistency
(GNNs) in semi-supervised learning, it is criticized for not 0.78
Raw data
making full use of unlabeled data and suffering from over- 0.77
dropout
dropnode
fitting. Recently, graph data augmentation, used to improve dropedge
LP
both accuracy and generalization of GNNs, has received con- 0.76 NASA
1.0 1.5 2.0 2.5 3.0
siderable attentions. However, one fundamental question is Diversity
how to evaluate the quality of graph augmentations in princi-
ple? In this paper, we propose two metrics, Consistency and (a) Dilemma of augmentations (b) A toy semi-supervised dataset
Diversity, from the aspects of augmentation correctness and
generalization. Moreover, we discover that existing augmen- Figure 1: (a) Consistency and diversity of different augmenta-
tations fall into a dilemma between these two metrics. Can tions on Cora dataset. Dotted line represents the consistency
we find a graph augmentation satisfying both consistency and of the raw data. Circles with different colors indicate differ-
diversity? A well-informed answer can help us understand ent graph augmentations. Black and red crosses show the
the mechanism behind graph augmentation and improve the
consistency and diversity of immediate neighbors and our
performance of GNNs. To tackle this challenge, we analyze
two representative semi-supervised learning algorithms: label proposed augmentation. (b) Toy example of data augmen-
propagation (LP) and consistency regularization (CR). We tations in SSL. Blue and red circles are labeled data, gray
find that LP utilizes the prior knowledge of graphs to improve circles are unlabeled data and rectangles are augmentations.
consistency and CR adopts variable augmentations to pro-
mote diversity. Based on this discovery, we treat neighbors
as augmentations to capture the prior knowledge embodying 2017). Recently, graph data augmentation is used to improve
homophily assumption, which promises a high consistency
both accuracy and generalization of GNNs (Rong et al. 2020;
of augmentations. To further promote diversity, we randomly
replace the immediate neighbors of each node with its remote Verma et al. 2021; Feng et al. 2019, 2020; Wang et al. 2020).
neighbors. After that, a neighbor-constrained regularization is Although there are some augmentation strategies on
proposed to enforce the predictions of the augmented neigh- graphs, such as DropEdge (Rong et al. 2020) and DropNode
bors to be consistent with each other. Extensive experiments on (Feng et al. 2020), it is still unknown which augmentation is
five real-world graphs validate the superiority of our method better for GNNs. Generally, an easy augmentation contributes
in improving the accuracy and generalization of GNNs. less to the generalization of model and a hard augmentation
may bring additional noise (Yin et al. 2019). Therefore, a
1 Introduction natural question is how to evaluate the quality of graph aug-
mentations in principle? To this goal, as the first contribution
Graph neural networks (GNNs), as a typical graph-based of this paper, we propose two metrics of graph augmentation
semi-supervised learning (SSL) method, has achieved state- in SSL: Consistency and Diversity. Consistency indicates
of-the-art performance (Kipf and Welling 2017; Velickovic whether the augmented data belong to the same class with
et al. 2018). Despite its success, GNNs has been criticized for the raw data and diversity reveals how different the distribu-
not making full use of unlabeled data (Wang et al. 2020; Feng tion captured by augmented data is from raw data. Detailed
et al. 2020), which is an essential requirement of SSL (Yang descriptions can be found in Sec. 2. If the augmentation and
et al. 2021). Previous methods tend to use pseudo labels to original data are in different classes, it will hurt the accuracy
overcome this limitation (Sun, Lin, and Zhu 2020; Li, Han, of the model. While if the augmentation is similar to original
and Wu 2018), but suffer from poor calibration (Guo et al. data, it may contribute less to the generalization of the model.
* Work done during Deyu’s internship at Ant Group.
Therefore, a good augmentation should not only ensure the

Corresponding Author. correctness but also provide sufficient generalization.
Copyright © 2022, Association for the Advancement of Artificial Based on the two evaluations, we test three commonly
Intelligence (www.aaai.org). All rights reserved. used graph augmentations, i.e., Dropout (Srivastava et al.
2014), DropEdge and DropNode, with different dropping contributes little information because it is close to raw data
rates. The results are shown in Fig. 1(a), where there is a (high consistency, low diversity); B is different from raw data,
dilemma between consistency and diversity: an augmentation but it locates in a wrong class, which brings additional noise
with high consistency may have less diversity and vice versa. (low consistency, high diversity); C benefits the classification
Since the dilemma of existing graph augmentations is identi- a lot because it not only has correct labels but also brings ad-
fied, a natural question is can we find a graph augmentation ditional generalization (high consistency, high diversity). The
satisfying both consistency and diversity at the same time? aforementioned discussion shows that a good augmentation
This is not a trivial task because we need to quantitatively should generalize to the distribution beyond training data.
define consistency and diversity for graph data and make a Therefore, only using labeled data cannot comprehensively
delicate balance between them. evaluate the quality of augmentations. To better measure the
To solve the dilemma, we need to know the factors that correctness and generalization of the augmentations, we need
affect consistency and diversity. We analyze two representa- to introduce additional data, e.g. validation set, for evaluation.
tive SSL methods, label propagation (LP) and consistency The main idea is as follows:
regularization (CR), and find that LP uses neighbors as aug- We first train two models Fθ , Feθ : Rd → RC , through the
mentations, which naturally captures the prior knowledge of training data Dtrain and its augmentations D e train , respec-
graphs and improves consistency. While CR employs variable tively, where d is the dimension of input features, C is the
augmentations to promote diversity. Based on this discovery, number of classes and θ denotes the parameters. After that,
in this paper, we propose NASA, short for Neighbors Are we use the two models to predict on the validation set Dval . If
Special Augmentations, to augment and regularize GNNs. the augmentations have better correctness and generalization,
NASA consists of two parts: augmentation and regularization. the model Feθ should have higher accuracy on validation set
In the augmentation, we treat neighbors as special augmenta- and establish a more different decision boundary from Fθ .
tions and propose to disturb nodes by replacing their immedi- This leads to the metrics of consistency and diversity:
ate neighbors with remote neighbors. Generally, neighbors
can capture the prior knowledge of graphs, i.e., homophily Metric of Consistency. We use the accuracy of augmented
assumption, and replacing neighbors can improve the vari- model on validation set to represent the level of consistency:
ability, so we can preserve high consistency and diversity
simultaneously. In the regularization, we propose a neighbor- C = Acc(Feθ (Dval ), Yval ), (1)
constrained regularization, which enforces the predictions of
where Yval denotes the labels of validation data. A lower
neighbors to be consistent with each other, so that a large
value of C means that the augmentations are inconsistent
number of unlabeled nodes can be used in training. Moreover,
with the raw data, which may hurt the accuracy of the model.
we show that the proposed regularization can be used as a
However, a higher value of C does not mean that the quality of
supplement of the traditional graph regularization.
augmentation is necessarily good, because it may contribute
The contribution of this paper is summarized as follows:
less to the generalization of the model, which leads to the
• We propose consistency and diversity to evaluate the qual- metric of diversity.
ity of existing graph augmentations, and find that they
cannot satisfy the two metrics at the same time. To the Metric of Diversity. We use the difference between the
best of our knowledge, this is the first exploration of met- predictions of the original model Fθ and augmented model
rics of graph augmentations. Feθ to represent the level of diversity:
• We propose NASA, which generates graph augmentations
D = ||Feθ (Dval ) − Fθ (Dval )||2F , (2)
with high consistency and diversity through replacing im-
mediate neighbors with remote neighbors, and constrains where || · ||F is the Frobenius norm. A lower value of D indi-
the predictions of augmented neighbors to be consistent. cates that the augmentations have a similar distribution with
• We validate the effectiveness of NASA by comparing with original data, which cannot benefit the generalization of mod-
state-of-the-art methods on five real-world datasets. We els (Yin et al. 2019). But a higher value of D cannot ensure
also conduct a generalization test to verify the superiority the correctness of augmentations. Therefore, the combination
of NASA on improving the generalization of GNNs. of the two metrics is necessary for the evaluation.
Note that the metrics of consistency and diversity are not
2 Evaluation of Augmentation limited to graph data. Instead, they can be used to evaluate
the quality of data augmentations in other semi-supervised
In this section, we will introduce the detailed description of
field, such as computer vision (Berthelot et al. 2019; Xie et al.
the two metrics, i.e., Consistency and Diversity. Before that,
2020). In the next section, we will introduce our method and
we first explain the motivation for designing the two metrics.
explain how these two metrics guide the model design.
Let’s take the “two moons” data as an example (Verma
et al. 2019), as shown in Fig. 1(b), where the blue and red
circles are labeled data, and gray circles are unlabeled data. 3 Methodology
We can see that the number of labeled data is relatively small Let G = (V, E) denote a graph, where V is the set of nodes
and cannot reflect the distribution of the entire data. In this with |V | = N and E is the set of edges. Each graph G
situation, we consider three types of augmentations, i.e., A, has an adjacency matrix A ∈ {0, 1}N ×N , where Aij = 1
B, C. It is obvious that although A lies in the correct class, it means there is an edge between vi and vj , otherwise 0.
0.85 0.75
X ∈ RN ×d are the node features and H ∈ RN ×C are the 0.80 Raw
Raw
1-hop
0.70
node presentations learned by GNNs. Generally, most ex- 0.75
1-hop
2-hop 0.65 2-hop

Consistency

Consistency
isting GNN can be summarized as a message passing ar- 0.70
0.60
chitecture (Gilmer et al. 2017), which can be formulated as 0.65
0.55
H = Trans(Agg{A, X; Φ}; Θ). Agg means aggregating 0.60 3-hop 3-hop
0.50
information from neighbors in the graphs and Trans is to 0.55

transform the aggregated information into new node represen- 0.50 0.45
0 2 4 6 8 10 0 2 4 6 8
Diversity Diversity
tations. The parameters Φ, Θ are used for aggregation and
transformation, respectively. In a graph augmentation, the (a) Neighbors on Cora (b) Neighbors on Citeseer
perturbation may occur in both node features and structures.
Therefore, the augmented node representations can be calcu- Figure 2: Empirical study of different neighbors’ consistency
lated as H
e = Trans(Agg{A, e X;e Φ}; Θ), where A e and Xe and diversity. “Raw” represents the original training nodes,
are the augmented features and structures, respectively. and “k-hop” indicates the neighbors that are k hops away
from the training nodes, where k ∈ {1, 2, 3}.
3.1 Connection Between Consistency
Regularization and Label Propagation
A basic requirement of SSL is to make good use of the unla- dropedge will drop different edges in each epoch, which im-
beled data (van Engelen and Hoos 2020; Chong et al. 2020). proves the generalization of GNNs implicitly.
Here we review two representative SSL algorithms and dis- The aforementioned discussion reveals that a good aug-
cuss how they use augmentations to assist unlabeled nodes. mentation should not only utilize of the prior knowledge of
Label propagation is a traditional graph-based semi- data (for consistency), but also provide variable augmenta-
supervised algorithm, which propagates labels to unlabeled tions (for diversity). This motivates the design of our model.
nodes along graph topology (Zhou et al. 2003). The objective
function can be defined as: 3.2 Our Proposed Model: NASA
X XX We introduce the details of our proposed model, which con-
LLP = ||hi − yi ||22 + α ||hi − hj ||22 , (3)
sists of two components: augmentation and regularization.
i∈VL i∈V j∈Ni
In the augmentation, we propose to use remote neighbors
where hi is the i-th row of H, VL represents the labeled to replace immediate neighbors to promote diversity. In the
nodes, α is a hyper-parameter, yi is a one-hot vector denoted regularization, we propose two techniques to constrain the
as the label of vi and Ni denotes the neighbors of vi . The first predictions of augmentations.
term is a classification loss, here we take the mean square
Augmentation on Neighbors. Inspired by the design of
loss as example. The second term is a graph Laplacian regu-
LP, we aim to use neighbors as augmentations to improve the
larization, which enforces the representations of neighbors to
consistency. However, this way lacks variability and may be
be consistent. Note that the closed-form solution of Eq. 3 is
affected by the noise. Therefore, an effective augmentation
H = (I + αL)−1 Y, where L is the Laplacian matrix of A.
strategy is to change the neighbors during training.
Consistency regularization is an emerging semi-supervised
To determine which neighbors we should use as substi-
model, which enforces model to have similar predictions
tutes, we make an empirical study to identify their quality.
between raw data and random augmentations, so that the
Specifically, we divide the neighbors into different groups
model will be robust to the small data perturbations (Xie et al.
according to their distances to the training nodes. We then
2020). The objective function can be formulated as:
calculate the consistency and diversity through Eq. 1 and
K 2, where we use graph convolutional networks (GCNs) as
e (k) ||2 , (4)
X XX
LCR = ||hi − yi ||22 + α ||hi − h i 2 the test model Fθ , the training nodes are Dtrain and their
i∈VL i∈V k=1 neighbors are D e train . The results are shown in Fig. 2. It can
e (k) be seen that the farther the neighbors are from training data,
where K is the number of random augmentations and h i the lower the consistency and the higher the diversity. In par-
is the representation of k-th augmentation. The first term of
ticular, comparing the 2-hop neighbors with 3-hop neighbors,
CR is the same as LP and the second term is a regularization,
we can find that the consistency of 2-hop neighbors decreases
which uses the prediction of vi as a pseudo label to supervise
slightly, but 3-hop neighbors hurt the consistency heavily and
the output of its augmentations.
do not add much diversity.
Remark 1. (Two perspectives of LP and CR) Comparing Based on the results, we propose Neighbor Replace (NR)
Eq. 3 and Eq. 4, we can find that the difference between LP to randomly replace the 1-hop neighbors by the 2-hop neigh-
and CR is the regularization. From the perspective of LP, bors. Specifically, for node vi , we use a Bernoulli distri-
using neighbors as augmentations explicitly utilizes the prior bution to sample its neighbors randomly, i.e., ∀vj ∈ Ni ,
knowledge of graphs, i.e., the homophily assumption. There- j ∼ Bern(p). For each sampled neighbor vj with j = 1,
fore, the consistency of neighbors is higher than random we drop the edges between vj and vi , and randomly choose
augmentations. From the perspective of CR, the features and a neighbor of vj as the new neighbor of vi , i.e., Ninew =
structures of neighbors hj are fixed during training, while {vk ∼ Nj , j = 1}. For the neighbors with j = 0, we do not
random augmentations h e (k) will change dynamically, e.g. change them and denote them as Niold = {vj ∈ Ni , j = 0}.
i
Therefore, the augmented neighbors of vi is defined as Table 1: Statistics of datasets.
Nei = N new ∪ N old . The benefits of NR are two-fold: first,
i i
the exchange between 1-hop neighbors and 2-hop neighbors Dataset # Nodes # Edges # Features # Classes
perturbs graph structures, but does not seriously hurt the cor-
rectness. Second, the supervision signals can be propagated Cora 2,708 5,429 1,433 7
to more unlabeled nodes so that the generalization can be Citeseer 3,327 4,732 3,703 6
promoted. Pubmed 19,717 44,338 500 3
Computer 13,381 245,778 767 10
Although graph structures contain the consistency informa-
Photo 7,487 119,043 745 8
tion, the inter-edges (Zhao et al. 2021) and NR augmentations
may introduce some noise. Here we propose two techniques,
i.e., neighbor-constrained regularization and dynamic train-
ing, to prevent pseudo labels from being heavily disturbed. where α is a hyper-parameter for balancing.
Finally, we give a further explanation to show why this reg-
Neighbor-constrained Regularization. After perturbing ularization is called ”neighbor-constrained”. In addition, we
the neighbors of each node, we feed the augmented graph analyze its connection to the traditional graph regularization
topology A e and original node features X into an ar- (Belkin and Niyogi 2003). We can rewrite Eq. 7 as:
bitrary GNNs to learn the node representations: H e = 1 XX 
LCR = p ei − p
e i log p e i log h
ej , (9)
Trans(Agg{A, X; Φ}; Θ). For the labeled nodes, a cross-
e N
i∈V j∈N
entropy loss is used to supervise the predictions of GNNs:
ei

where the first term can be removed because of the gradient


1 X
LCE = − yi log h
ei. (5) truncation. Therefore, if we ignore the sharpening trick, the
NL second term can be rewritten as:
i∈VL
1 X e
Note that here we use labels to supervise the augmented LCR = − 2 hp log h
eq , (10)
N
representations he i because we find that this approach can p,q∈N
ei
reduce the risk of over-fitting. For the unlabeled nodes, we which can be seen as the cross-entropy loss between the aug-
design a novel neighbor-constrained regularization to enforce mented neighbors. Eq. 10 requires the predictions of neigh-
the predictions of neighbors to be consistent with each other. bors to be consistent with each other. That is why we call this
Specifically, we first fuse the predictions of neighbors as the regularization “neighbor-constrained”.
ei = e1
P
pseudo label of the center node: y |Ni | ei hj . The
j∈N
e
Connection with manifold learning. Similart to Eq. 3 and
average of neighbors’ predictions is similar to the voting Eq. 4, the objective function of NASA can be rewritten as:
results, which can effectively prevent the pseudo labels from X XX X
being affected by the noisy neighbors. L= e i − yi ||2 + α
||h ||h
ej − e j ||2 . (11)
h
2 2
Before using the averaged pseudo labels to supervise the i∈VL i∈V j∈Ni j
prediction of neighbors, we utilize the sharpening trick to The second term of Eq. 11 is similar to the local linear em-
enforce the classifier output a low-entropy prediction: bedding (LLE) (Roweis and Saul 2000) algorithm, which
1
, C−1
X 1 uses the weighted sum of neighbors to reconstruct the target
p
e ij = y T
eij y T
eic , (6) nodes. In this way, the manifold of high-dimensional data
c=0 can be preserved in the low dimensional space.
where T ∈ (0, 1] is a scaling factor, controlling the sharpness Dynamic Training. During training, we perform NR on
of the prediction, i is the index of nodes, j and c indicate the each node in each epoch, that is to say, the augmented graph
specific dimensions of the representation (0 < j < C − 1). topology A e is different in each epoch. We call this dynamic
Then we use the sharpened pseudo labels to supervise the training, otherwise static training. The dynamic training of
predictions of augmented neighbors: NASA makes the model more robust. On the one hand, in
each epoch, different neighbors are used for training, which
1 XX  
LCR = KL p e i ||h
ej , (7) makes the model to be invariant to the change of neighbors.
N On the other hand, there may exist some neighbors that do not
i∈V j∈N
ei
belong to the same class. Using dynamic training can prevent
where KL is the Kullback-Leibler divergence (Joyce 2011), the model from over-fitting the unsatisfactory augmentations.
measuring the distance between two distributions. Besides, Ablation studies can be found in Sec. 4.3.
we will not use the gradient of the pseudo label pe i to update
parameters Φ and Θ, as suggested by (Miyato et al. 2019). Complexity. The time complexity consists of two parts:
Through this regularization, unlabeled nodes can be used in one is the complexity of GNNs. Here we take GCNs
training to prevent the model from over-fitting. The final loss (Kipf and Welling 2017) as an example, whose complex-
function is the combination of classification and neighbor- ity is O(L|E|d2 ) and L is the number of layers. Another
constrained regularization: is the complexity of the regularization, whose complex-
ity is O(|E|d). Therefore, the overall complexity of is
L = LCE + αLCR , (8) O(|E|(Ld2 + d)), which is linear to the number of edges.
Table 2: Node classification results under different label split (%). A higher value indicates a better performance. Bold for the
best. (-) means the standard deviation is too large to have a stable result.

Standard Split Less Label Split Random Split


Cora Citeseer Pubmed Cora Citeseer Pubmed Computer Photo
LP 70.4±0.0 50.6±0.0 71.8±0.0 64.9±3.3 41.8±4.2 71.4±3.8 79.8±3.4 79.0±4.8
GLP 80.3±0.2 71.7±0.6 78.8±0.4 70.1±2.8 60.7±5.5 73.2±4.0 81.9±1.1 89.6±0.7
GCN-LPA 82.8±0.1 72.3±0.2 78.6±0.2 68.8±3.3 53.2±4.7 71.5±3.6 80.4±2.4 89.4±1.5
PTA 83.0±0.5 71.6±0.4 80.1±0.1 67.7±2.8 58.5±4.9 71.5±3.2 82.3±0.9 90.7±2.1
GCN 81.5±0.3 70.3±0.9 79.0±0.2 70.1±2.7 58.4±5.2 71.8±4.4 82.3±1.5 90.4±0.7
GAT 83.0±0.7 72.5±0.7 79.0±0.3 71.4±3.7 62.2±6.5 72.5±4.0 - -
MixHop 81.9±0.4 71.4±0.8 80.8±0.6 67.9±3.0 59.0±5.5 71.3±3.1 - -
GMNN 83.7±0.3 72.9±0.5 80.3±0.4 71.4±2.1 60.5±3.2 72.8±3.1 82.7±1.3 91.0±2.9
APPNP 83.8±0.3 71.6±0.5 79.7±0.3 69.9±2.1 59.3±2.8 71.4±3.5 82.1±1.9 90.6±2.0
GAUG 83.6±0.5 73.3±1.1 80.2±0.3 72.5±2.8 62.2±5.8 73.2±2.7 - -
DropEdge 82.8±0.9 72.3±1.3 79.6±0.8 71.4±3.0 62.0±6.6 72.2±3.9 81.5±1.4 89.4±1.7
GraphVAT 82.9±0.5 73.8±0.9 79.5±0.3 70.6±4.2 61.2±5.1 73.4±3.3 82.3±3.1 90.5±2.6
GraphMix 83.9±0.6 74.7±0.6 81.0±0.5 72.3±6.1 61.0±4.5 74.6±3.2 84.2±2.5 91.3±1.9
GRAND 84.5±0.3 74.2±0.3 80.0±4.3 73.4±2.4 62.6±4.2 74.0±2.7 84.8±1.5 91.7±2.2
NodeAug 84.3±0.5 74.9±0.5 81.5±0.5 74.2±3.2 62.4±4.1 74.4±3.5 84.5±2.2 92.3±2.5
NASA 85.1±0.3 75.5±0.4 80.2±0.3 75.2±4.0 63.4±4.8 74.0±2.3 85.5±3.3 92.7±2.9

4 Experiments • GNNs-based methods: GCN (Kipf and Welling 2017),


GAT (Velickovic et al. 2018), MixHop (Abu-El-Haija
4.1 Experimental Setup et al. 2019), GMNN (Qu, Bengio, and Tang 2019) and
We test the performance of different methods in the semi- APPNP (Klicpera, Bojchevski, and Günnemann 2019).
supervised node classification task. Specifically, we use five • Regularization-based methods: GAUG(Zhao et al. 2021),
different datasets — three citation datasets, e.g., Cora, Cite- DropEdge (Rong et al. 2020), GraphVAT (Feng et al.
seer and Pubmed from (Kipf and Welling 2017) and two 2019), GraphMix (Verma et al. 2021), GRAND (Feng
co-purchase datasets, e.g., Amazon Computers and Ama- et al. 2020) and NodeAug (Wang et al. 2020).
zon Photo from (Shchur et al. 2018). The statistics of these
datasets are shown in Table 1. Besides, we consider three Implementation. The hyper-parameters are set as follows:
different data splits to evaluate these methods more compre- learning rate=0.01, weight decay=1e-3, hidden unit=32 and
hensively. The first is the standard split of citation networks, Adam optimizer (Kingma and Ba 2015) for all methods.
provided by (Kipf and Welling 2017), which is widely used For the benchmarks, if the original papers provide the
in the node classification task (Velickovic et al. 2018). In hyper-parameters, we set them as the authors suggested.
the standard split, each class has 20 labeled nodes, and 500 For NASA, dropout rate is searched in {0.1, ..., 0.9}, tem-
nodes for validation, 1000 nodes for testing. The second is a perature of sharpening is searched in {0.1, ..., 1.0} and
less label split of citation networks, where each class has 5 α = {0.1, ..., 1.0} for all datasets. We run NASA for 1000
labeled nodes, and the set of validation and testing nodes is epochs and select the model with the lowest validation loss
same to the standard split. The less label split poses a greater for test. For the less label split and random split, we make 5
challenge to the model’s generalization. The third split is random splits with seed {0, 1, 2, 3, 4}, and for each method,
the random split of co-purchase datasets, where 20 nodes we run 10 times and report the mean accuracy and standard
per class are randomly sampled for training, 30 nodes for deviation. Note that for fair comparison, we use the standard
validation and others for testing, as suggested by (Shchur two-layer GCNs as the backbone for the regularization-based
et al. 2018). All the data splits are widely used in previous methods and NASA, because we want to ensure that the im-
works (Feng et al. 2020; Wang et al. 2020). provement comes from the regularization term itself instead
of the advanced GNNs.
Benchmarks. We choose three kinds of methods as
benchmarks: LP-based methods, GNNs-based methods and 4.2 Performance on Node Classification
regularization-based methods. A detailed description and dis-
cussion of these methods can be found in Sec. 5 The performance of different methods are summarized in
Table 2. From top to bottom, we show the results of the three
• LP-based methods: Original LP (Zhou et al. 2003), GLP types of baselines, from which we can draw the following
(Li et al. 2019), GCN-LPA (Wang and Leskovec 2020) conclusions: First, the accuracy of LP-based methods is usu-
and PTA (Dong et al. 2021). ally lower than the other two types of methods, indicating
2.00 2.00 2.00
2.0 GCN_train 2.0 LP_train 2.0 GRAND_train GCN_train LP_train GRAND_train
GCN_valid LP_valid GRAND_valid 1.75 1.75 1.75
GCN_valid LP_valid GRAND_valid
NASA_train NASA_train NASA_train 1.50 NASA_train 1.50 NASA_train 1.50 NASA_train
1.5 NASA_valid 1.5 NASA_valid 1.5 NASA_valid 1.25 NASA_valid 1.25 NASA_valid 1.25 NASA_valid
Loss

Loss

Loss

Loss

Loss

Loss
1.0 1.00 1.00 1.00
1.0 1.0
0.75 0.75 0.75
0.5 0.5 0.5 0.50 0.50 0.50
0.25 0.25 0.25
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Epoch Epoch Epoch Epoch Epoch Epoch

(a) NASA vs. GCN (b) NASA vs. LP (c) NASA vs. GRAND (d) NASA vs. GCN (e) NASA vs. LP (f) NASA vs. GRAND

Figure 3: Curves of training and validation loss on Cora (a-c) and Citeseer (d-f). A smaller gap between the training and validation
loss indicates a better generalization.

Table 3: Ablation study on augmentation (%) influence the performance of NASA. First, we can find that
without augmentation, the result is more stable but the accu-
NASA Cora Citeseer racy drops, which indicates that augmentations can help to
improve the performance of model. Besides, the augmenta-
w/o augmentation 84.5±0.1 75.0±0.1 tions on graph structures, i.e., NR and dropedge, are more
useful than the augmentations on node features, i.e., dropn-
w/ NR 85.1±0.3 75.5±0.4 ode and dropout. This phenomenon is also observed by (You,
w/ dropedge 84.7±0.5 75.1±0.2 Ying, and Leskovec 2020). Therefore, the future work of
graph augmentations can pay more attentions on perturbing
w/ dropnode 84.6±0.3 74.9±0.3
the topology of the graphs.
w/ dropout 84.5±0.2 74.7±0.2 In Table 4, we list the results of different variants of the
regularization term in NASA. The first two rows reveal the
Table 4: Ablation study on regularization (%) advantage of dynamic training in regularization. We can find
that the accuracy of static training is lower than dynamic
NASA Cora Citeseer training, and the standard deviation is much higher, espe-
cially in Citeseer. This shows that static training is easily af-
w/ dynamic training 85.1±0.3 75.5±0.4 fected by the extreme augmentations, while dynamic training
w/ static training 84.7±0.9 70.7±12.6 is more stable. The middle two rows validate the effective-
w/o augmentation 84.5±0.1 75.0±0.1 ness of augmentation and neighbor. Without any of them,
w/o neighbor 83.4±0.4 73.1±0.7 the performance of the NASA will decrease, which reflects
the observations in Fig. 1(a). The last row demonstrates the
w/o sharpening 83.7±0.5 72.7±0.5 usefulness of sharpening.

4.4 Generalization Analysis


that only using the dependency of labels cannot achieve satis- We design this experiment to validate the superiority of
factory results. Besides, the performance of regularization- NASA on improving the generalization of GNNs. Specif-
based methods are significantly higher than the GNNs-based ically, we use the generalization gap (GP) to measure the
methods, which shows the effectiveness of regularization generalization of different models. GP is a commonly used
term. Specially, NASA relatively improves the performance metric of model generalization (Jiang et al. 2019), which is
of GCNs by 4.4%, 7.4% and 1.5% on standard Cora, Citeseer defined as the difference between the training loss and vali-
and Pubmed, respectively. As for the less label split, NASA dation loss. Note that a smaller value of GP indicates a better
makes more improvements, i.e., 7.3%, 8.6% and 3.1%, which generalization. In the experiment, we first jointly optimize
proves the superiority of our proposed regularization in uti- the classification and regularization loss in the training pro-
lizing large amounts of unlabeled data. In the random split, cess. While in inference, the regularization term is removed,
NASA also achieves state-of-the-art performance. Finally, we and the training and validation loss is calculated by the back-
notice that the performance of NASA on Pubmed is weaker bone GNNs only. In this situation, augmentations can only
than GraphMix and NodeAug. We guess this is because in affect the models in the training stage, which requires the
Pubmed, the neighbors do not contribute much to classifica- regularization term to make full use of the unlabeled data.
tion. From Fig. 3, we can find that the gap of CR-based methods,
i.e., NASA and GRAND, is always smaller than GCN and LP,
4.3 Ablation Study indicating that CR has an advantage on improving the gen-
In order to prove the effectiveness of different components in eralization of GNNs. Besides, compared with GRAND, the
NASA, we conduct two ablation study on two datasets: Cora gap of NASA shrinks 12.5% and 25% on Citeseer and Cora,
and Citeseer. Specifically, we validate the effectiveness of respectively. This observation shows that the regularization
the graph augmentation strategy and regularization term of of NASA is more effective than the state-of-the-art regular-
NASA, respectively. The results are shown in Table 3 and 4. ization method on GNNs. It is worth noting that the shrinking
In Table 3, we test how different augmentation strategies of NASA’s gap benefits from the decrease of validation loss
5 Related Work
Label Propagation. LP (Zhou et al. 2003) is a simple yes
effective algorithm in graph-based SSL, which propagates
labels to the unlabeled nodes along network structures. The
major shortcoming of LP is that it cannot utilize node fea-
tures, so its performance heavily depends on the network
structures and initialization. Some methods are proposed to
(a) Overall visualization (b) Neighbor deal with this problem. Generalized Label Propagation (GLP)
(Li et al. 2019) generalizes LP by extending the graph filter of
LP to node features. GCN-LPA (Wang and Leskovec 2020)
combines GNNs with LP, where the objective function of LP
is used to learn the weights of edges for graph convolution.
Besides, (Dong et al. 2021) proves that the decoupled GCNs,
e.g., APPNP(Klicpera, Bojchevski, and Günnemann 2019),
is equal to a two-step label propagation.
Graph Neural Networks. GNNs makes a breakthrough in
(c) DropNode (d) Neighbor Replace the field of semi-supervised node classification. Currently,
GNNs can be divided into two categories: spectral methods
Figure 4: (a) Visualization of the node representations in and spatial methods. Spectral methods aim to utilize the the-
Cora. Colors denote different classes. We zoom in the red ory of graph signal processing to design graph filters, such
class to show the augmentations of (b) Neighbors for LP, (c) as GCN (Kipf and Welling 2017) and GraphHeat (Xu et al.
DropNode for CR and (d) Neighbor Replace for NASA. 2019). Spatial methods focus on designing the message pass-
ing of GNNs. For example, GAT (Velickovic et al. 2018)
uses attention mechanism to learn the importance of neigh-
bors and MixHop (Abu-El-Haija et al. 2019) concatenates
rather than the increase of training loss, which proves that the representations of neighbors with different orders. How-
NASA can make good use of the unlabeled data. Finally, we ever, none of them explicitly utilize the unlabeled nodes for
find an interesting phenomenon that the loss curve of NASA training, which are easily to over-fit the scarce training data.
will increase in the begin of training. We think this is because
the model tends to optimize the regularization term at first. Regularization on GNNs. The use of CR in SSL is first
adopted in the field of computer vision (Berthelot et al. 2019;
Sohn et al. 2020; Xie et al. 2020) and then draws attentions
4.5 Case Visualization in graph data. CR provides an explicit way to use unlabeled
data, which significantly improve the generalization of mod-
In Fig. 1(a), we introduce the dilemma between the consis- els. Data augmentation is an important component of CR.
tency and diversity of graph augmentations. Here, we give a In order to apply CR to GNNs, a lot of graph augmenta-
closer visualization of different augmentations. We consider tions are proposed. For example, GRAND (Feng et al. 2020)
three graph augmentation strategies: immediate neighbors, proposes DropNode, GraphVAT (Feng et al. 2019) designs
DropNode and NR, which are corresponding to LP, CR and graph virtual adversarial training, GAUG (Zhao et al. 2021)
NASA, respectively. For DropNode, the drop probability is proposes a learnable augmentation strategy and GraphMix
set to 0.5, as suggested by (Feng et al. 2020). We take one (Verma et al. 2021) uses linear interpolation. They prefer to
node in the training set as an example and zoom in its repre- perform random perturbations on either graph structures or
sentation and augmentations together. The visualizations are node features or both. Different from them, we tend to use
shown in Fig. 4(a). the prior knowledge to augment graphs, thus guaranteeing
the consistency of the augmentations.
In Fig. 4(b), we can find that, except for one neighbor,
the others (black circles) are close to the original node (red
circle), which indicates that the consistency of neighbors is 6 Conclusions
good, but the diversity is poor. In Fig. 4(c), the augmentations In this paper, we study how to use graph augmentation to
are far from the original node and some of them are out of regularize GNNs and improve its performance and general-
the cluster. This shows that although DropNode can provide ization ability. We find that existing graph augmentations fall
a better diversity, the consistency of it cannot be guaranteed. into a dilemma between consistency and diversity. To solve
Fig. 4(d) shows the augmentations of NR. We can see that this problem, we propose a new regularization, NASA, to
the augmentations are in the different locations of the clus- utilize the augmented neighbors with high consistency and
ter, which exhibits a better consistency and diversity than diversity to regularize GNNs. Experimental results validate
LP and CR. The reasons why NR performs well is that it the superiority of NASA on improving the performance and
uses the neighbors within two-hops as augmentations, which generalization of GNNs. An important future work is to pre-
have more diversity than the immediate neighbors and better vent NASA from being affected by the noisy neighbors and
consistency than random augmentations. generalize the method to heterophilic graphs.
7 Acknowledgments Miyato, T.; Maeda, S.; Koyama, M.; and Ishii, S. 2019. Vir-
This work is supported in part by the National Natural tual Adversarial Training: A Regularization Method for Su-
Science Foundation of China (No. U20B2045, 61772082, pervised and Semi-Supervised Learning. IEEE Trans. Pattern
61702296, 62002029, 62172052), the Fundamental Research Anal. Mach. Intell., 41(8): 1979–1993.
Funds for the Central Universities 2021RC28, and BUPT Qu, M.; Bengio, Y.; and Tang, J. 2019. GMNN: Graph
Excellent Ph.D. Students Foundation (No. CX2020115). Markov Neural Networks. In ICML.
Rong, Y.; Huang, W.; Xu, T.; and Huang, J. 2020. DropE-
References dge: Towards Deep Graph Convolutional Networks on Node
Abu-El-Haija, S.; Perozzi, B.; Kapoor, A.; Alipourfard, N.; Classification. In ICLR.
Lerman, K.; Harutyunyan, H.; Steeg, G. V.; and Galstyan, A. Roweis, S. T.; and Saul, L. K. 2000. Nonlinear dimensionality
2019. MixHop: Higher-Order Graph Convolutional Architec- reduction by locally linear embedding. science, 290(5500):
tures via Sparsified Neighborhood Mixing. In ICML. 2323–2326.
Belkin, M.; and Niyogi, P. 2003. Laplacian Eigenmaps for Shchur, O.; Mumme, M.; Bojchevski, A.; and Günnemann, S.
Dimensionality Reduction and Data Representation. Neural 2018. Pitfalls of Graph Neural Network Evaluation. CoRR,
Comput., 15(6): 1373–1396. abs/1811.05868.
Berthelot, D.; Carlini, N.; Goodfellow, I. J.; Papernot, N.; Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.;
Oliver, A.; and Raffel, C. 2019. MixMatch: A Holistic Ap- Raffel, C.; Cubuk, E. D.; Kurakin, A.; and Li, C. 2020. Fix-
proach to Semi-Supervised Learning. In NeurIPS. Match: Simplifying Semi-Supervised Learning with Consis-
tency and Confidence. In NeurIPS.
Chong, Y.; Ding, Y.; Yan, Q.; and Pan, S. 2020. Graph-based
semi-supervised learning: A review. Neurocomputing, 408: Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.;
216–230. and Salakhutdinov, R. 2014. Dropout: a simple way to pre-
vent neural networks from overfitting. J. Mach. Learn. Res.,
Dong, H.; Chen, J.; Feng, F.; He, X.; Bi, S.; Ding, Z.; and 15(1): 1929–1958.
Cui, P. 2021. On the Equivalence of Decoupled Graph Con-
volution Network and Label Propagation. In WWW. Sun, K.; Lin, Z.; and Zhu, Z. 2020. Multi-Stage Self-
Supervised Learning for Graph Convolutional Networks on
Feng, F.; He, X.; Tang, J.; and Chua, T.-S. 2019. Graph Graphs with Few Labeled Nodes. In AAAI.
Adversarial Training: Dynamically Regularizing Based on
van Engelen, J. E.; and Hoos, H. H. 2020. A survey on
Graph Structure. IEEE Trans. Knowl. Data Eng.
semi-supervised learning. Mach. Learn., 109(2): 373–440.
Feng, W.; Zhang, J.; Dong, Y.; Han, Y.; Luan, H.; Xu, Q.;
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò,
Yang, Q.; Kharlamov, E.; and Tang, J. 2020. Graph Random
P.; and Bengio, Y. 2018. Graph Attention Networks. In ICLR.
Neural Networks for Semi-Supervised Learning on Graphs.
In NeurIPS. Verma, V.; Lamb, A.; Kannala, J.; Bengio, Y.; and Lopez-
Paz, D. 2019. Interpolation Consistency Training for Semi-
Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and supervised Learning. In IJCAI.
Dahl, G. E. 2017. Neural Message Passing for Quantum
Chemistry. In ICML. Verma, V.; Qu, M.; Lamb, A.; Bengio, Y.; Kannala, J.; and
Tang, J. 2021. GraphMix: Regularized Training of Graph
Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On Neural Networks for Semi-Supervised Learning. AAAI.
Calibration of Modern Neural Networks. In ICML.
Wang, H.; and Leskovec, J. 2020. Unifying Graph Convo-
Jiang, Y.; Krishnan, D.; Mobahi, H.; and Bengio, S. 2019. lutional Neural Networks and Label Propagation. CoRR,
Predicting the Generalization Gap in Deep Networks with abs/2002.06755.
Margin Distributions. In ICLR.
Wang, Y.; Wang, W.; Liang, Y.; Cai, Y.; Liu, J.; and Hooi, B.
Joyce, J. M. 2011. Kullback-Leibler Divergence. In Interna- 2020. NodeAug: Semi-Supervised Node Classification with
tional Encyclopedia of Statistical Science, 720–722. Springer. Data Augmentation. In KDD.
Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochas- Xie, Q.; Dai, Z.; Hovy, E. H.; Luong, T.; and Le, Q. 2020.
tic Optimization. In ICLR. Unsupervised Data Augmentation for Consistency Training.
Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classifi- In NeurIPS.
cation with Graph Convolutional Networks. In ICLR. Xu, B.; Shen, H.; Cao, Q.; Cen, K.; and Cheng, X. 2019.
Klicpera, J.; Bojchevski, A.; and Günnemann, S. 2019. Pre- Graph Convolutional Networks using Heat Kernel for Semi-
dict then Propagate: Graph Neural Networks meet Personal- supervised Learning. In IJCAI.
ized PageRank. In ICLR. Yang, X.; Song, Z.; King, I.; and Xu, Z. 2021. A Survey on
Li, Q.; Han, Z.; and Wu, X. 2018. Deeper Insights Into Graph Deep Semi-supervised Learning. CoRR, abs/2103.00550.
Convolutional Networks for Semi-Supervised Learning. In Yin, D.; Lopes, R. G.; Shlens, J.; Cubuk, E. D.; and Gilmer,
AAAI. J. 2019. A Fourier Perspective on Model Robustness in
Li, Q.; Wu, X.; Liu, H.; Zhang, X.; and Guan, Z. 2019. Label Computer Vision. In NeurIPS, 13255–13265.
Efficient Semi-Supervised Learning via Graph Filtering. In You, J.; Ying, Z.; and Leskovec, J. 2020. Design Space for
CVPR. Graph Neural Networks. In NeurIPS.
Zhao, T.; Liu, Y.; Neves, L.; Woodford, O. J.; Jiang, M.;
and Shah, N. 2021. Data Augmentation for Graph Neural
Networks. In AAAI.
Zhou, D.; Bousquet, O.; Lal, T. N.; Weston, J.; and Schölkopf,
B. 2003. Learning with Local and Global Consistency. In
NeurIPS.
A Experimental Investigate GraphMix (without License): https://github.com/
vikasverma1077/GraphMix
0.80
0.705 0.80
0.79 0.700
0.79
Consistency

Consistency
0.695

Consistency
0.78 0.690
Raw data Raw data 0.78 Raw data
dropout 0.685 dropout
0.77 dropout
dropnode 0.680 dropnode dropnode
dropedge dropedge 0.77 dropedge
LP 0.675 LP LP
0.76 NASA NASA NASA
0.76
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 3.5
Diversity Diversity Diversity

(a) Cora (b) Citeseer (c) Pubmed

B Detailed Information of Datasets and


Environment
The environment where the code runs is shown as follows:
• Operating system: Linux 4.9.151-015.x86 64.
• CPU information: Intel(R) Xeon(R) CPU E5-2682 v4
@2.50GHz.
• GPU information: NVIDIA® Tesla™ M40 GPU Comput-
ing Accelerator - 12G.

C Detailed Parameters of NASA


Table 1: Hyper-parameters of NASA.

Split Dataset Dropout Balance (α) Scaling (T )


Cora 0.7 1.0 0.5
Standard Split Citeseer 0.1 1.0 0.5
Pubmed 0.5 0.5 0.2
Cora 0.8 1.0 0.7
Less Label Split Citeseer 0.8 1.0 1.0
Pubmed 0.5 0.5 0.5
Computer 0.3 0.7 0.5
Random Split Photo 0.5 1.0 0.3
t

D Source Code of Benchmarks


We make sure that the code and data we use are public and
do not contain any information about the authors of this
paper. The acquisition of code and data complies with the
provider’s license and all of them do not contain any offensive
content. The address of benchmarks’ data and code are listed
as follows:
Cora, Citeseer, Pubmed, Amazon-Computer & Amazon-
Photo (Apache-2.0 License): https://docs.dgl.ai/en/latest/api/
python/dgl.data.html#node-prediction-datasets
LP & GLP (MIT License): https://github.com/liqimai/
Efficient-SSL
GCN-LPA (MIT License): https://github.com/hwwang55/
GCN-LPA
PTA (MIT License): https://github.com/DongHande/PT
propagation then training
GCN, GAT, MixHop, GMNN, APPNP, GRAND & DropE-
dge (Apache-2.0 License): https://github.com/dmlc/dgl/tree/
master/examples/pytorch
GraphVAT (without License): https://github.com/fulifeng/
GraphAT

You might also like