(AAA2022) Regularizing Graph Neural Networks Via Consistency-Diversity Graph Augmentations
(AAA2022) Regularizing Graph Neural Networks Via Consistency-Diversity Graph Augmentations
Graph Augmentations
Deyu Bo1 * , BinBin Hu2 , Xiao Wang1 , Zhiqiang Zhang2 , Chuan Shi1 † , Jun Zhou2
1
Beijing University of Posts and Telecommunications
2
Ant Financial Services Group, Hangzhou, China
{bodeyu, xiaowang, shichuan}@bupt.edu.cn, {bin.hbb,lingyao.zzq,jun.zhoujun}@antfin.com
Abstract 0.80
0.79
Despite the remarkable performance of graph neural networks
Consistency
(GNNs) in semi-supervised learning, it is criticized for not 0.78
Raw data
making full use of unlabeled data and suffering from over- 0.77
dropout
dropnode
fitting. Recently, graph data augmentation, used to improve dropedge
LP
both accuracy and generalization of GNNs, has received con- 0.76 NASA
1.0 1.5 2.0 2.5 3.0
siderable attentions. However, one fundamental question is Diversity
how to evaluate the quality of graph augmentations in princi-
ple? In this paper, we propose two metrics, Consistency and (a) Dilemma of augmentations (b) A toy semi-supervised dataset
Diversity, from the aspects of augmentation correctness and
generalization. Moreover, we discover that existing augmen- Figure 1: (a) Consistency and diversity of different augmenta-
tations fall into a dilemma between these two metrics. Can tions on Cora dataset. Dotted line represents the consistency
we find a graph augmentation satisfying both consistency and of the raw data. Circles with different colors indicate differ-
diversity? A well-informed answer can help us understand ent graph augmentations. Black and red crosses show the
the mechanism behind graph augmentation and improve the
consistency and diversity of immediate neighbors and our
performance of GNNs. To tackle this challenge, we analyze
two representative semi-supervised learning algorithms: label proposed augmentation. (b) Toy example of data augmen-
propagation (LP) and consistency regularization (CR). We tations in SSL. Blue and red circles are labeled data, gray
find that LP utilizes the prior knowledge of graphs to improve circles are unlabeled data and rectangles are augmentations.
consistency and CR adopts variable augmentations to pro-
mote diversity. Based on this discovery, we treat neighbors
as augmentations to capture the prior knowledge embodying 2017). Recently, graph data augmentation is used to improve
homophily assumption, which promises a high consistency
both accuracy and generalization of GNNs (Rong et al. 2020;
of augmentations. To further promote diversity, we randomly
replace the immediate neighbors of each node with its remote Verma et al. 2021; Feng et al. 2019, 2020; Wang et al. 2020).
neighbors. After that, a neighbor-constrained regularization is Although there are some augmentation strategies on
proposed to enforce the predictions of the augmented neigh- graphs, such as DropEdge (Rong et al. 2020) and DropNode
bors to be consistent with each other. Extensive experiments on (Feng et al. 2020), it is still unknown which augmentation is
five real-world graphs validate the superiority of our method better for GNNs. Generally, an easy augmentation contributes
in improving the accuracy and generalization of GNNs. less to the generalization of model and a hard augmentation
may bring additional noise (Yin et al. 2019). Therefore, a
1 Introduction natural question is how to evaluate the quality of graph aug-
mentations in principle? To this goal, as the first contribution
Graph neural networks (GNNs), as a typical graph-based of this paper, we propose two metrics of graph augmentation
semi-supervised learning (SSL) method, has achieved state- in SSL: Consistency and Diversity. Consistency indicates
of-the-art performance (Kipf and Welling 2017; Velickovic whether the augmented data belong to the same class with
et al. 2018). Despite its success, GNNs has been criticized for the raw data and diversity reveals how different the distribu-
not making full use of unlabeled data (Wang et al. 2020; Feng tion captured by augmented data is from raw data. Detailed
et al. 2020), which is an essential requirement of SSL (Yang descriptions can be found in Sec. 2. If the augmentation and
et al. 2021). Previous methods tend to use pseudo labels to original data are in different classes, it will hurt the accuracy
overcome this limitation (Sun, Lin, and Zhu 2020; Li, Han, of the model. While if the augmentation is similar to original
and Wu 2018), but suffer from poor calibration (Guo et al. data, it may contribute less to the generalization of the model.
* Work done during Deyu’s internship at Ant Group.
Therefore, a good augmentation should not only ensure the
†
Corresponding Author. correctness but also provide sufficient generalization.
Copyright © 2022, Association for the Advancement of Artificial Based on the two evaluations, we test three commonly
Intelligence (www.aaai.org). All rights reserved. used graph augmentations, i.e., Dropout (Srivastava et al.
2014), DropEdge and DropNode, with different dropping contributes little information because it is close to raw data
rates. The results are shown in Fig. 1(a), where there is a (high consistency, low diversity); B is different from raw data,
dilemma between consistency and diversity: an augmentation but it locates in a wrong class, which brings additional noise
with high consistency may have less diversity and vice versa. (low consistency, high diversity); C benefits the classification
Since the dilemma of existing graph augmentations is identi- a lot because it not only has correct labels but also brings ad-
fied, a natural question is can we find a graph augmentation ditional generalization (high consistency, high diversity). The
satisfying both consistency and diversity at the same time? aforementioned discussion shows that a good augmentation
This is not a trivial task because we need to quantitatively should generalize to the distribution beyond training data.
define consistency and diversity for graph data and make a Therefore, only using labeled data cannot comprehensively
delicate balance between them. evaluate the quality of augmentations. To better measure the
To solve the dilemma, we need to know the factors that correctness and generalization of the augmentations, we need
affect consistency and diversity. We analyze two representa- to introduce additional data, e.g. validation set, for evaluation.
tive SSL methods, label propagation (LP) and consistency The main idea is as follows:
regularization (CR), and find that LP uses neighbors as aug- We first train two models Fθ , Feθ : Rd → RC , through the
mentations, which naturally captures the prior knowledge of training data Dtrain and its augmentations D e train , respec-
graphs and improves consistency. While CR employs variable tively, where d is the dimension of input features, C is the
augmentations to promote diversity. Based on this discovery, number of classes and θ denotes the parameters. After that,
in this paper, we propose NASA, short for Neighbors Are we use the two models to predict on the validation set Dval . If
Special Augmentations, to augment and regularize GNNs. the augmentations have better correctness and generalization,
NASA consists of two parts: augmentation and regularization. the model Feθ should have higher accuracy on validation set
In the augmentation, we treat neighbors as special augmenta- and establish a more different decision boundary from Fθ .
tions and propose to disturb nodes by replacing their immedi- This leads to the metrics of consistency and diversity:
ate neighbors with remote neighbors. Generally, neighbors
can capture the prior knowledge of graphs, i.e., homophily Metric of Consistency. We use the accuracy of augmented
assumption, and replacing neighbors can improve the vari- model on validation set to represent the level of consistency:
ability, so we can preserve high consistency and diversity
simultaneously. In the regularization, we propose a neighbor- C = Acc(Feθ (Dval ), Yval ), (1)
constrained regularization, which enforces the predictions of
where Yval denotes the labels of validation data. A lower
neighbors to be consistent with each other, so that a large
value of C means that the augmentations are inconsistent
number of unlabeled nodes can be used in training. Moreover,
with the raw data, which may hurt the accuracy of the model.
we show that the proposed regularization can be used as a
However, a higher value of C does not mean that the quality of
supplement of the traditional graph regularization.
augmentation is necessarily good, because it may contribute
The contribution of this paper is summarized as follows:
less to the generalization of the model, which leads to the
• We propose consistency and diversity to evaluate the qual- metric of diversity.
ity of existing graph augmentations, and find that they
cannot satisfy the two metrics at the same time. To the Metric of Diversity. We use the difference between the
best of our knowledge, this is the first exploration of met- predictions of the original model Fθ and augmented model
rics of graph augmentations. Feθ to represent the level of diversity:
• We propose NASA, which generates graph augmentations
D = ||Feθ (Dval ) − Fθ (Dval )||2F , (2)
with high consistency and diversity through replacing im-
mediate neighbors with remote neighbors, and constrains where || · ||F is the Frobenius norm. A lower value of D indi-
the predictions of augmented neighbors to be consistent. cates that the augmentations have a similar distribution with
• We validate the effectiveness of NASA by comparing with original data, which cannot benefit the generalization of mod-
state-of-the-art methods on five real-world datasets. We els (Yin et al. 2019). But a higher value of D cannot ensure
also conduct a generalization test to verify the superiority the correctness of augmentations. Therefore, the combination
of NASA on improving the generalization of GNNs. of the two metrics is necessary for the evaluation.
Note that the metrics of consistency and diversity are not
2 Evaluation of Augmentation limited to graph data. Instead, they can be used to evaluate
the quality of data augmentations in other semi-supervised
In this section, we will introduce the detailed description of
field, such as computer vision (Berthelot et al. 2019; Xie et al.
the two metrics, i.e., Consistency and Diversity. Before that,
2020). In the next section, we will introduce our method and
we first explain the motivation for designing the two metrics.
explain how these two metrics guide the model design.
Let’s take the “two moons” data as an example (Verma
et al. 2019), as shown in Fig. 1(b), where the blue and red
circles are labeled data, and gray circles are unlabeled data. 3 Methodology
We can see that the number of labeled data is relatively small Let G = (V, E) denote a graph, where V is the set of nodes
and cannot reflect the distribution of the entire data. In this with |V | = N and E is the set of edges. Each graph G
situation, we consider three types of augmentations, i.e., A, has an adjacency matrix A ∈ {0, 1}N ×N , where Aij = 1
B, C. It is obvious that although A lies in the correct class, it means there is an edge between vi and vj , otherwise 0.
0.85 0.75
X ∈ RN ×d are the node features and H ∈ RN ×C are the 0.80 Raw
Raw
1-hop
0.70
node presentations learned by GNNs. Generally, most ex- 0.75
1-hop
2-hop 0.65 2-hop
Consistency
Consistency
isting GNN can be summarized as a message passing ar- 0.70
0.60
chitecture (Gilmer et al. 2017), which can be formulated as 0.65
0.55
H = Trans(Agg{A, X; Φ}; Θ). Agg means aggregating 0.60 3-hop 3-hop
0.50
information from neighbors in the graphs and Trans is to 0.55
transform the aggregated information into new node represen- 0.50 0.45
0 2 4 6 8 10 0 2 4 6 8
Diversity Diversity
tations. The parameters Φ, Θ are used for aggregation and
transformation, respectively. In a graph augmentation, the (a) Neighbors on Cora (b) Neighbors on Citeseer
perturbation may occur in both node features and structures.
Therefore, the augmented node representations can be calcu- Figure 2: Empirical study of different neighbors’ consistency
lated as H
e = Trans(Agg{A, e X;e Φ}; Θ), where A e and Xe and diversity. “Raw” represents the original training nodes,
are the augmented features and structures, respectively. and “k-hop” indicates the neighbors that are k hops away
from the training nodes, where k ∈ {1, 2, 3}.
3.1 Connection Between Consistency
Regularization and Label Propagation
A basic requirement of SSL is to make good use of the unla- dropedge will drop different edges in each epoch, which im-
beled data (van Engelen and Hoos 2020; Chong et al. 2020). proves the generalization of GNNs implicitly.
Here we review two representative SSL algorithms and dis- The aforementioned discussion reveals that a good aug-
cuss how they use augmentations to assist unlabeled nodes. mentation should not only utilize of the prior knowledge of
Label propagation is a traditional graph-based semi- data (for consistency), but also provide variable augmenta-
supervised algorithm, which propagates labels to unlabeled tions (for diversity). This motivates the design of our model.
nodes along graph topology (Zhou et al. 2003). The objective
function can be defined as: 3.2 Our Proposed Model: NASA
X XX We introduce the details of our proposed model, which con-
LLP = ||hi − yi ||22 + α ||hi − hj ||22 , (3)
sists of two components: augmentation and regularization.
i∈VL i∈V j∈Ni
In the augmentation, we propose to use remote neighbors
where hi is the i-th row of H, VL represents the labeled to replace immediate neighbors to promote diversity. In the
nodes, α is a hyper-parameter, yi is a one-hot vector denoted regularization, we propose two techniques to constrain the
as the label of vi and Ni denotes the neighbors of vi . The first predictions of augmentations.
term is a classification loss, here we take the mean square
Augmentation on Neighbors. Inspired by the design of
loss as example. The second term is a graph Laplacian regu-
LP, we aim to use neighbors as augmentations to improve the
larization, which enforces the representations of neighbors to
consistency. However, this way lacks variability and may be
be consistent. Note that the closed-form solution of Eq. 3 is
affected by the noise. Therefore, an effective augmentation
H = (I + αL)−1 Y, where L is the Laplacian matrix of A.
strategy is to change the neighbors during training.
Consistency regularization is an emerging semi-supervised
To determine which neighbors we should use as substi-
model, which enforces model to have similar predictions
tutes, we make an empirical study to identify their quality.
between raw data and random augmentations, so that the
Specifically, we divide the neighbors into different groups
model will be robust to the small data perturbations (Xie et al.
according to their distances to the training nodes. We then
2020). The objective function can be formulated as:
calculate the consistency and diversity through Eq. 1 and
K 2, where we use graph convolutional networks (GCNs) as
e (k) ||2 , (4)
X XX
LCR = ||hi − yi ||22 + α ||hi − h i 2 the test model Fθ , the training nodes are Dtrain and their
i∈VL i∈V k=1 neighbors are D e train . The results are shown in Fig. 2. It can
e (k) be seen that the farther the neighbors are from training data,
where K is the number of random augmentations and h i the lower the consistency and the higher the diversity. In par-
is the representation of k-th augmentation. The first term of
ticular, comparing the 2-hop neighbors with 3-hop neighbors,
CR is the same as LP and the second term is a regularization,
we can find that the consistency of 2-hop neighbors decreases
which uses the prediction of vi as a pseudo label to supervise
slightly, but 3-hop neighbors hurt the consistency heavily and
the output of its augmentations.
do not add much diversity.
Remark 1. (Two perspectives of LP and CR) Comparing Based on the results, we propose Neighbor Replace (NR)
Eq. 3 and Eq. 4, we can find that the difference between LP to randomly replace the 1-hop neighbors by the 2-hop neigh-
and CR is the regularization. From the perspective of LP, bors. Specifically, for node vi , we use a Bernoulli distri-
using neighbors as augmentations explicitly utilizes the prior bution to sample its neighbors randomly, i.e., ∀vj ∈ Ni ,
knowledge of graphs, i.e., the homophily assumption. There- j ∼ Bern(p). For each sampled neighbor vj with j = 1,
fore, the consistency of neighbors is higher than random we drop the edges between vj and vi , and randomly choose
augmentations. From the perspective of CR, the features and a neighbor of vj as the new neighbor of vi , i.e., Ninew =
structures of neighbors hj are fixed during training, while {vk ∼ Nj , j = 1}. For the neighbors with j = 0, we do not
random augmentations h e (k) will change dynamically, e.g. change them and denote them as Niold = {vj ∈ Ni , j = 0}.
i
Therefore, the augmented neighbors of vi is defined as Table 1: Statistics of datasets.
Nei = N new ∪ N old . The benefits of NR are two-fold: first,
i i
the exchange between 1-hop neighbors and 2-hop neighbors Dataset # Nodes # Edges # Features # Classes
perturbs graph structures, but does not seriously hurt the cor-
rectness. Second, the supervision signals can be propagated Cora 2,708 5,429 1,433 7
to more unlabeled nodes so that the generalization can be Citeseer 3,327 4,732 3,703 6
promoted. Pubmed 19,717 44,338 500 3
Computer 13,381 245,778 767 10
Although graph structures contain the consistency informa-
Photo 7,487 119,043 745 8
tion, the inter-edges (Zhao et al. 2021) and NR augmentations
may introduce some noise. Here we propose two techniques,
i.e., neighbor-constrained regularization and dynamic train-
ing, to prevent pseudo labels from being heavily disturbed. where α is a hyper-parameter for balancing.
Finally, we give a further explanation to show why this reg-
Neighbor-constrained Regularization. After perturbing ularization is called ”neighbor-constrained”. In addition, we
the neighbors of each node, we feed the augmented graph analyze its connection to the traditional graph regularization
topology A e and original node features X into an ar- (Belkin and Niyogi 2003). We can rewrite Eq. 7 as:
bitrary GNNs to learn the node representations: H e = 1 XX
LCR = p ei − p
e i log p e i log h
ej , (9)
Trans(Agg{A, X; Φ}; Θ). For the labeled nodes, a cross-
e N
i∈V j∈N
entropy loss is used to supervise the predictions of GNNs:
ei
Loss
Loss
Loss
Loss
Loss
1.0 1.00 1.00 1.00
1.0 1.0
0.75 0.75 0.75
0.5 0.5 0.5 0.50 0.50 0.50
0.25 0.25 0.25
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Epoch Epoch Epoch Epoch Epoch Epoch
(a) NASA vs. GCN (b) NASA vs. LP (c) NASA vs. GRAND (d) NASA vs. GCN (e) NASA vs. LP (f) NASA vs. GRAND
Figure 3: Curves of training and validation loss on Cora (a-c) and Citeseer (d-f). A smaller gap between the training and validation
loss indicates a better generalization.
Table 3: Ablation study on augmentation (%) influence the performance of NASA. First, we can find that
without augmentation, the result is more stable but the accu-
NASA Cora Citeseer racy drops, which indicates that augmentations can help to
improve the performance of model. Besides, the augmenta-
w/o augmentation 84.5±0.1 75.0±0.1 tions on graph structures, i.e., NR and dropedge, are more
useful than the augmentations on node features, i.e., dropn-
w/ NR 85.1±0.3 75.5±0.4 ode and dropout. This phenomenon is also observed by (You,
w/ dropedge 84.7±0.5 75.1±0.2 Ying, and Leskovec 2020). Therefore, the future work of
graph augmentations can pay more attentions on perturbing
w/ dropnode 84.6±0.3 74.9±0.3
the topology of the graphs.
w/ dropout 84.5±0.2 74.7±0.2 In Table 4, we list the results of different variants of the
regularization term in NASA. The first two rows reveal the
Table 4: Ablation study on regularization (%) advantage of dynamic training in regularization. We can find
that the accuracy of static training is lower than dynamic
NASA Cora Citeseer training, and the standard deviation is much higher, espe-
cially in Citeseer. This shows that static training is easily af-
w/ dynamic training 85.1±0.3 75.5±0.4 fected by the extreme augmentations, while dynamic training
w/ static training 84.7±0.9 70.7±12.6 is more stable. The middle two rows validate the effective-
w/o augmentation 84.5±0.1 75.0±0.1 ness of augmentation and neighbor. Without any of them,
w/o neighbor 83.4±0.4 73.1±0.7 the performance of the NASA will decrease, which reflects
the observations in Fig. 1(a). The last row demonstrates the
w/o sharpening 83.7±0.5 72.7±0.5 usefulness of sharpening.
Consistency
0.695
Consistency
0.78 0.690
Raw data Raw data 0.78 Raw data
dropout 0.685 dropout
0.77 dropout
dropnode 0.680 dropnode dropnode
dropedge dropedge 0.77 dropedge
LP 0.675 LP LP
0.76 NASA NASA NASA
0.76
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 3.5
Diversity Diversity Diversity