Graph Neural Networks: Self-Supervised Learning
Graph Neural Networks: Self-Supervised Learning
ABSTRACT existing GNNs [19, 58, 65, 66] are mostly established in a semi-
Although deep learning has achieved state-of-the-art performance supervised or supervised manner, which still requires high cost
across numerous domains, these models generally require large label annotation. Additionally, these GNN models may not take full
annotated datasets to reach their full potential and avoid overfitting. advantage of the abundant information in unlabeled data, such as
However, obtaining such datasets can have high associated costs the graph topology and node attributes. Hence, SSL can be naturally
or even be impossible to procure. Self-supervised learning (SSL) harnessed for GNNs to gain additional supervision and thoroughly
seeks to create and utilize specific pretext tasks on unlabeled data exploit the information in the unlabeled data.
to aid in alleviating this fundamental limitation of deep learning Compared with grid-based data such as images or text [73],
models. Although initially applied in the image and text domains, graph-structured data is far more complex due to its highly irregu-
recent interest has been in leveraging SSL in the graph domain to lar topology, involved intrinsic interactions and abundant domain-
improve the performance of graph neural networks (GNNs). For specific semantics [63]. Different from images and text where the en-
node-level tasks, GNNs can inherently incorporate unlabeled node tire structure represents a single entity or expresses a single seman-
data through the neighborhood aggregation unlike in the image or tic meaning, each node in the graph is an individual instance with
text domains; but they can still benefit by applying novel pretext its own features and positioned in its own local context. Further-
tasks to encode richer information and numerous such methods more, these individual instances are inherently related with each
have recently been developed. For GNNs solving graph-level tasks, other, which forms diverse local structures that encode even more
applying SSL methods is more aligned with other traditional do- complex information to be discovered and analyzed. While such
mains, but still presents unique challenges and has been the focus of complexity engenders tremendous challenges in analyzing graph-
a few works. In this chapter, we summarize recent developments in structured data, the substantial and diverse information contained
applying SSL to GNNs categorizing them via the different training in the node features, node labels, local/global graph structures, and
strategies and types of data used to construct their pretext tasks, their interactions and combinations provide golden opportunities
and finally discuss open challenges for future directions. to design self-supervised pretext tasks.
Embracing all the challenges and opportunities to study self-
supervised learning in GNNs, the works [22, 24, 28, 70] have been
1 INTRODUCTION the first research that systematically design and compare differ-
ent self-supervised pretext tasks in GNNs. For example, the works
Recent years have witnessed the great success of applying deep
[24, 70] design pretext tasks to encode the topological properties
learning in numerous fields. However, the superior performance
of a node such as centrality, clustering coefficient, and its graph
of deep learning heavily depends on the quality of the supervi-
partitioning assignment, or to encode the attributes of a node such
sion provided by the labeled data and collecting a large amount of
as individual features and clustering assignments in embeddings
high-quality labeled data tends to be time-intensive and resource-
output by GNNs. The work [28] designs pretext tasks to align the
expensive [22, 79]. Therefore, to alleviate the demand for massive la-
pairwise feature similarity or the topological distance between two
beled data and provide sufficient supervision, self-supervised learn-
nodes in the graph with the closeness of two nodes in the embed-
ing (SSL) has been introduced. Specifically, SSL designs domain-
ding space. Apart from the supervision information employed in
specific pretext tasks that leverage extra supervision from unlabeled
creating pretext tasks, designing effective training strategies and se-
data to train deep learning models and learn better representa-
lecting reasonable loss functions are another crucial components in
tions for downstream tasks. In computer vision, various pretext
incorporating SSL into GNNs. Two frequently used training strate-
tasks have been studied, e.g., predicting relative locations of image
gies that equip GNNs with SSL are 1) pre-training GNNs through
patches [43] and identifying augmented images generated from im-
completing pretext task(s) and then fine-tuning the GNNs on down-
age processing techniques such as cropping, rotating and resizing
stream task(s), and 2) jointly training GNNs on both pretext and
[53]. In natural language processing, self-supervised learning has
downstream tasks [28, 70]. There are also few works [5, 55] ap-
also been heavily utilized, e.g., predicting the masked word in BERT
plying the idea of self-training in incorporating SSL into GNNs.
[9].
In addition, loss functions are selected to be tailored for purposes
Simultaneously, graph representation learning has emerged as
of specific pretext tasks, which includes classification-based tasks
a powerful strategy for analyzing graph-structured data over the
(cross-entropy loss), regression-based tasks (mean squared error
past few years [18]. As the generalization of deep learning to the
loss) and contrastive-based tasks (contrastive loss).
graph domain, Graph Neural Networks (GNNs) has become one
In view of the substantial progress made in the field of graph
promising paradigm due to their efficiency and strong performance
neural networks and the significant potential of self-supervised
in real-world applications [67, 78]. However, the vanilla GNN model
(i.e., Graph Convolutional Network [33]) and even more advanced
1
Yu Wang, Wei Jin, and Tyler Derr
learning, this chapter aims to present a systematic and compre- place nine shuffled patches back to the original locations [43]. More
hensive review on applying self-supervised learning into graph pretext tasks such as colorization, autoencoder, and contrastive
neural networks. The rest of the chapter is organized as follows. predictive coding have also been introduced and effectively utilized
Section 2 first introduces self-supervised learning and pretext tasks, [45, 60, 71].
and then summarizes frequently used self-supervised methods from While computer vision has achieved amazing progress on self-
the image and text domains. In Section 3, we introduce the training supervised learning in recent years, self-supervised learning has
strategies that are used to incorporate SSL into GNNs and catego- been heavily utilized in natural language processing (NLP) research
rize the pretext tasks that have been developed for GNNs. Section 4 for quite a while. Word2vec [40] is the first work that popularized
and 5 present detailed summaries of numerous representative SSL the SSL ideas in the NLP field. Center word prediction and neigh-
methods that have been developed for node-level and graph-level bor word prediction are two pretext tasks in Word2vec where the
pretext tasks. Thereafter, in Section 6 we discuss representative model is given a small chunk of the text and asked to predict the
SSL methods that are developed using both node-level and graph- center word in that text or vice versa. BERT [9] is another famous
level supervision, which we refer to as node-graph-level pretext pre-trained model in NLP where two pretest tasks are to recover
tasks. Section 7 collects and reinforces the major results and the randomly masked words in a text or to classify whether two sen-
insightful discoveries in prior sections. Concluding remarks and tences can come one after another or not. Similar works have also
future forecasts on the development of SSL in GNNs are provided been introduced, such as having the pretext task classify whether a
in Section 8. pair of sentences are in the correct order [34], or a pretext task that
first randomly shuffles the ordering of sentences and then seeks to
recover the original ordering [36].
2 SELF-SUPERVISED LEARNING Compared with the difficulty of data acquisition encountered
Supervised learning is the machine learning task of training a model in image and text domains, machine learning in the graph domain
that maps an input to an output based on the ground-truth input- faces even more challenges in acquiring high-quality labeled data.
output pairs provided by a labeled dataset. Good performance of For example, for molecular graphs it can be extremely expensive
supervised learning requires a decent amount of labeled data (es- to perform the necessary laboratory experiments to label some
pecially when using deep learning models), which are expensive molecules [52], and in a social network obtaining ground-truth
to manually collect. Conversely, self-supervised learning generates labels for individual users may require large-scale surveys or be un-
supervisory signals from unlabeled data and then trains the model able to be released due to privacy agreements/concerns [3]. There-
based on the generated supervisory signals. The task used for train- fore, the success achieved by applying SSL in CV and NLP naturally
ing the model based on the generative signal is referred to as the leads the question as to whether SSL can be effectively applied in
pretext task. In comparison, the task whose ultimate performance the graph domain. Given that graph neural network is among the
we care about the most and expect our model to solve is referred most powerful paradigms for graph representation learning, in fol-
to as the downstream task. To guarantee the performance benefits lowing sections we will mainly focus on introducing self-supervised
from self-supervised learning, pretext tasks should be carefully learning within the framework of graph neural networks and high-
designed such that completing them encourages the model to have lighting/summarizing these recent advancements.
the similar or complementary understanding as completing down-
stream tasks. Self-supervised learning initially originated to solve 3 APPLYING SSL TO GNNS: CATEGORIZING
tasks in image and text domains. The following part focuses on TRAINING STRATEGIES, LOSS FUNCTIONS
introducing self-supervised learning in these two fields with the
specific emphasis on different pretext tasks.
AND PRETEXT TASKS
In computer vision (CV), many ideas have been proposed for When seeking to apply self-supervised learning to GNNs, the major
self-supervised representation learning on image data. A common decisions to be made are how to construct the pretext tasks, which
example is that we expect that small distortion on an image does not includes what information to leverage from the unlabeled data,
affect its original semantic meaning or geometric forms. The idea to what loss function to use, and what training strategy to use for
create surrogate training datasets with unlabeled image patches by effectively improving the GNN’s performance. Hence, in this section
first sampling patches from different images at varying positions we will first mathematically formalize the graph neural network
and then distorting patches by applying a variety of random trans- with self-supervised learning and then discuss each of the above.
formations are proposed in [12]. The pretext task is to discriminate More specifically, we will introduce three training strategies, three
between patches distorted from the same image or from different loss functions that are frequently employed in the current literature,
images. Rotation of an entire image is another effective and inex- and categorize current state-of-the-art pretext tasks for GNNs based
pensive way to modify an input image without changing semantic on the type of information they leverage for constructing the pretext
content [16]. Each input image is first rotated by a multiple of 90 task.
degrees at random. The model is then trained to predict which Given an undirected attributed graph 𝒢 = {𝒱, ℰ, X}, where
rotation has been applied. However, instead of performing pretext 𝒱 = {𝑣 1, ..., 𝑣 | 𝒱 | } represents the vertex set with |𝒱 | vertices, ℰ
tasks on an entire image, the local patches could also be extracted represents the edge set and 𝑒𝑖 𝑗 = (𝑣𝑖 , 𝑣 𝑗 ) is an edge between
to construct the pretext tasks. Examples of methods using this tech- node 𝑣𝑖 and 𝑣 𝑗 , X ∈ R | 𝒱 |×𝑑 represents the feature matrix and
nique include predicting the relative position between two random x𝑖 = X[𝑖, :] ⊤ ∈ R𝑑 is the 𝑑-dimensional feature vector of the node
patches from one image [10] and designing a jigsaw puzzle game to 𝑣𝑖 . A ∈ R | 𝒱 |×| 𝒱 | is the adjacency matrix where A𝑖 𝑗 = 1 if 𝑒𝑖 𝑗 ∈ ℰ
2
Graph Neural Networks: Self-supervised Learning
and A𝑖 𝑗 = 0 if 𝑒𝑖 𝑗 ∉ ℰ. We denote any GNN-based feature ex- model parameters 𝜃, 𝜃 ssl , and 𝜃 sup , where 𝜃 ssl denotes the parame-
′
tractor as 𝑓𝜃 : R | 𝒱 |×𝑑 × R | 𝒱 |× | 𝒱 | → R | 𝒱 |×𝑑 parametrized by 𝜃 , ters of the adaptation layer for the pretext task. Inspired by relevant
which takes any node feature matrix X and the graph adjacency discussions [24, 28, 55, 69, 70], we summarize three possible training
matrix A and outputs the 𝑑 ′ -dimensional representation for each strategies that are popular in the literature to train GNNs in the self-
′
node ZGNN = 𝑓𝜃 (X, A) ∈ R | 𝒱 |×𝑑 , which is further fed into any supervised setting as self-training, pre-training with fine-tuning,
′ ′
permutation invariant function READOUT : R | 𝒱 |×𝑑 → R𝑑 to ob- and joint training.
′
tain the graph embeddings zGNN,𝒢 = READOUT(𝑓𝜃 (X, A)) ∈ R𝑑 . 3.1.1 Self-training. Self-training is a strategy that leverages the
More specifically, we note that here 𝜃 represents the parameters supervision information in the training process generated by the
encoded in the corresponding network architectures of the GNN model itself [37, 51]. A typical self-training pipeline begins with first
[19, 58, 65, 66]. Considering the transductive semi-supervised tasks training the model over the labeled data, then generating pseudo
where we are provided with the labeled node set 𝒱𝑙 ⊂ 𝒱, the la- labels to unlabeled samples that have highly confident predictions,
beled graph 𝒢, the associated node label matrix Ysup ∈ R | 𝒱𝑙 |×𝑙 , and including them into the labeled data in the next round of train-
and the graph label ysup,𝒢 ∈ R𝑙 with label dimension 𝑙, we aim ing. In this way, the pretext task is the same as the downstream
to classify nodes and graphs. The node and graph representations task by utilizing the pseudo labels for some of the originally un-
output by GNNs are firstly processed by the extra adaptation layer labeled data. A detailed overview is presented in Fig. 1 where the
ℎ𝜃 sup parametrized by the supervised adaptation parameter 𝜃 sup to prediction results are re-utilized to augment the training data in
obtain the predicted 𝑙-dimensional node label Zsup ∈ R | 𝒱 |×𝑙 and the next iteration as done in [55].
graph label zsup,𝒢 ∈ R𝑙 by Eq. (1)-(2). Then the model parameters
𝜃 in GNN-based extractor 𝑓𝜃 and the parameters 𝜃 sup in adaptation
layer ℎ𝜃 sup are learned by optimizing the supervised loss calculated
between the output/predicted label and the true label for labeled
nodes and the labeled graph, which can be formulated as:
Zsup = ℎ𝜃 sup (𝑓𝜃 (X, A)) (1)
integrated with semi-supervised learning. where 𝜃 ssl denotes the parameters of the adaptation layer ℎ𝜃 ssl
The model capability of extracting features for completing pre- for the pretext tasks, ℓssl is the self-supervised loss function for
text and downstream tasks is improved through optimizing the each example, and ℒssl is the total loss function of completing the
3
Yu Wang, Wei Jin, and Tyler Derr
Figure 2: An overview of GNNs with SSL using pre-training Figure 3: An overview of GNNs with SSL using joint training.
and fine-tuning.
self-supervised task. In node pretext tasks, zssl,𝑖 = Zssl [𝑖, :] ⊤ and 3.2 Loss Functions
yssl,𝑖 = Yssl [𝑖, :] ⊤ , which are the self-supervised predicted and true A loss function is used to evaluate the performance of how well the
label(s) for the node 𝑣𝑖 , respectively. In graph pretext tasks, zssl,𝒢 algorithm models the data. Generally in GNNs with self-supervised
and yssl,𝒢 are the self-supervised predicted and true label(s) for the learning, the loss function for the pretext task has three forms,
graph 𝒢, respectively. Then, in the fine-tuning process, the feature which are classification loss, regression loss and contrastive learn-
extractor 𝑓𝜃 is trained by completing downstream tasks in Eq. (1)- ing loss. Note that the loss functions we discuss here are only for
(3) with the pre-trained 𝜃 ∗ as the initialization. Note that to utilize the pretext tasks rather than downstream tasks.
the pre-trained node/graph representations the fine-tuning process
3.2.1 Classification and Regression Loss. In completing classification-
can also be replaced by training a linear classifier (e.g., Logistic
based pretext tasks such as node clustering where node embeddings
Regression [49, 57, 69, 75]).
are expected to encode the assignment information of the clusters,
3.1.3 Joint Training. Another natural idea to harness self-supervised the objective for the pretext is to minimize the following loss func-
learning for graph neural networks is to combine losses of com- tion:
pleting pretext task(s) and downstream task(s) and jointly train the 1 Õ 𝐿
ℓCE (zssl,𝑖 , yssl,𝑖 ) = − | 𝒱1 | 1 (yssl,𝑖 𝑗 = 1) log( z̃ssl,𝑖 𝑗 )
Í Í
model. The overview of the joint training is shown in Fig. 3.
|𝒱 | 𝑣𝑖 ∈𝒱 𝑗 =1
𝑣𝑖 ∈𝒱
The joint training consists of two components: feature extrac-
| {z }
tion by a GNN and adaption processes for both the pretext tasks
ℒssl = Node pretext tasks ,
and downstream tasks. In the feature extraction process, a GNN 𝐿
1 (yssl,𝒢 𝑗 = 1) log( z̃ssl,𝒢 𝑗 )
Í
ℓCE (zssl,𝒢 , yssl,𝒢 ) = −
takes the graph adjacency matrix A and the feature matrix X as
| {z } 𝑗 =1
input and outputs the node embeddings ZGNN and/or graph embed-
Graph pretext tasks
dings zGNN,𝒢 . In the adaptation procedure, the extracted node and (11)
graph embeddings are further transformed to complete pretext and
downstream tasks via ℎ𝜃 ssl and ℎ𝜃 sup , respectively. We then jointly where ℓCE indicates the cross entropy function, zssl,𝑖 and zssl,𝒢 rep-
optimize the pretext and downstream task losses as: resents the predicted label distribution of node 𝑣𝑖 and graph 𝒢
for the pretext task, and their corresponding class probability dis-
Zsup = ℎ𝜃 sup (𝑓𝜃 (X, A)), Zssl = ℎ𝜃 ssl (𝑓𝜃 (X, A)), (7) tribution z̃ssl,𝑖 and z̃ssl,𝒢 are calculated by softmax normalization,
respectively. For example, z̃ssl,𝑖 𝑗 is the probability of node 𝑣𝑖 be-
longing to class 𝑗. Since every node 𝑣𝑖 has its own pseudo label (i.e.,
zsup,𝒢 = ℎ𝜃 sup (READOUT(𝑓𝜃 (X, A))), (8) yssl,𝑖 ) in completing pretext tasks, we can consider all the nodes 𝒱
zssl,𝒢 = ℎ𝜃 ssl (READOUT(𝑓𝜃 (X, A))), (9) in the graph compared to only the labeled set of nodes 𝒱𝑙 as before
in downstream tasks.
1 Õ In completing regression-based pretext tasks, such as feature
arg min (𝛼 1 ℓsup (zsup,𝑖 , ysup,𝑖 )
𝜃,𝜃 sup ,𝜃 ssl |𝒱 | completion, the mean squared error loss is typically used as the
𝑣𝑖 ∈𝒱
+ 𝛼 2 ℓssl (zssl,𝑖 , yssl,𝑖 )) loss function:
∗ ∗ ∗
|
{z }
𝜃 , 𝜃 sup, 𝜃 ssl = , 1 Õ
Node pretext tasks
ℓMSE (zssl,𝑖 , yssl,𝑖 ) = | 𝒱1 | 𝑣𝑖 ∈𝒱 ||zssl,𝑖 − yssl,𝑖 || 2
Í
|𝒱 |
arg min 𝛼 1 ℓsup (zsup,𝒢 , ysup,𝒢 ) + 𝛼 2 ℓssl (zssl,𝒢 , yssl,𝒢 )
𝜃,𝜃 sup ,𝜃 ssl
𝑣𝑖 ∈𝒱
|
{z }
|
{z }
ℒssl = ,
Graph pretext tasks Node pretext tasks
ℓMSE (zssl,𝒢 , yssl,𝒢 ) = ||zssl,𝒢 − yssl,𝒢 || 2
(10)
| {z }
where 𝛼 1, 𝛼 2 ∈ R > 0 are the weights for combining the supervised
Graph pretext tasks
loss ℓsup and the self-supervised loss ℓssl . (12)
4
Graph Neural Networks: Self-supervised Learning
centrality score 𝑠 estimates its rank score by S𝑣 = 𝐷𝑠𝑟𝑎𝑛𝑘 (zGNN,𝑣 ). is to choose a fix-tune boundary in the middle layer of GNNs. The
The probability of estimated rank order is defined by the sigmoid GNN blocks below this boundary are fixed, while the ones above
𝑠 = exp(S𝑢 −S𝑣 ) . Then predicting the relative order be-
function R̃𝑢,𝑣 the boundary are fine-tuned. For downstream tasks that are closely
1+exp(𝑆𝑢 −𝑆 𝑣 )
tween pairs of nodes could be formalized as a binary classification related to the pre-trained tasks, a higher boundary is used.
problem with the loss: Another important node-level structure property is the partition
Õ Õ each node belongs after performing a graph partitioning method.
𝑠 𝑠 𝑠 𝑠
ℒssl = − (R𝑢,𝑣 log R̃𝑢,𝑣 + (1 − R𝑢,𝑣 ) log(1 − R̃𝑢,𝑣 )). (16) In [70], the pretext task is to train the GNNs to encode the node
𝑠 𝑢,𝑣 ∈𝒱 partition information. Graph partitioning is to partition the nodes of
a graph into different groups such that the number of edges between
Different from peer works, [24] does not consider any node fea-
each group is minimized. Given the node set 𝒱, the edge set ℰ, and
ture but instead extract the node features directly from the graph
a preset number of partitions 𝑝 ∈ [1, |𝒱 |], a graph partitioning
topology, which includes: (1) degree that defines the local impor-
algorithm (e.g., [30] as used in [70]) will output a set of nodes
tance of a node; (2) core-number that defines the connectivity of
{𝒱par1 , ..., 𝒱par𝑝 |𝒱par𝑖 ⊂ 𝒱, 𝑖 = 1, ..., 𝑝}. Then the classification loss
the subgraph around a node; (3) collective influence that defines
the neighborhood importance of a node; and (4) local clustering is set exactly the same as:
coefficient, which defines the connectivity of 1-hop neighborhood 1 Õ
ℒssl = − ℓCE (zssl,𝑖 , yssl,𝑖 ) (17)
of a node. Then, the four features (after min-max normalization) |𝒱 |
𝑣𝑖 ∈𝒱
are concatenated with a nonlinear transformation and fed into the
GNN where [24] uses the pretext tasks: centrality ranking, cluster- 1 Additional summary details and the corresponding code links for these methods
ing recovery and edge prediction. Another innovative idea in [24] can be found at https://github.com/NDS-VU/GNN-SSL-Chapter.
6
Graph Neural Networks: Self-supervised Learning
where zssl,𝑖 denotes the embedding of node 𝑣𝑖 and assuming that where 𝑓𝑤 is a function mapping the difference between two node
the partitioning label is a one-hot encoding yssl,𝑖 ∈ R𝑝 with 𝑘-th embeddings from GNNs to a scalar representing the similarity
entry as 1 and others as 0 if 𝑣𝑖 ∈ 𝒱par𝑘 , 𝑖 = 1, ..., |𝒱 |, ∃𝑘 ∈ [1, 𝑝]. between them.
appearing in influential nodes are seen as important and so are data. However, directly checking in this naïve way is very time
masked with lower probability. consuming.
The observation made in [4] that nodes with further topological
distance to the labeled nodes are more likely to be misclassified 5 GRAPH-LEVEL SSL PRETEXT TASKS
indicates the uneven distribution of the ability of GNNs to embed After having just presented the node-level SSL pretext tasks, in
node features in the whole graph. However, existing graph con- this section we focus on the graph-level SSL pretext tasks where
trastive learning methods ignore this uneven distribution, which we desire the node embeddings coming from the GNNs to encode
motivates Chen et al. [4] to propose the distance-wise graph con- information of graph-level properties.
trastive learning (DwGCL) method that can adaptively augment
the graph topology, sample the positive and negative pairs, and 5.1 Structure-based Pretext Tasks
maximize the mutual information. The topology information gain As the counterpart of the nodes in the graph, the edges encode
(TIG) is calculated based on Group PageRank and node features abundant information of the graph, which can also be leveraged as
to describe the task information effectiveness that the node ob- an extra supervision to design pretext tasks. The pretext task in [74]
tains from labeled nodes along the graph topology. By ranking is to recover the graph topology, i.e., predict edges, after randomly
the performance of GNNs on nodes according to their TIG val- removing edges in the graph. After node embeddings zGNN,𝑖 is
ues with/without contrastive learning, it is found that contrastive obtained for each node 𝑣𝑖 , the probability of the edge between
learning mainly improves the performance on nodes that are topo- any pair of nodes 𝑣𝑖 , 𝑣 𝑗 is calculated by their feature similarity as
logically far away from the labeled nodes. Based on the above follows:
finding, Chen et al. [4] proposes to: 1) perturb the graph topology A𝑖′𝑗 = sigmoid(zGNN,𝑖 (zGNN,𝑗 ) ⊤ ), (26)
by augmenting nodes according to their TIG value; 2) sampling
and the weighted cross-entropy loss is used during training, which
the positive and negative pairs considering local/global topology
is defined as:
distance and node embedding distance; and 3) assigning different Õ
weights to nodes in the self-supervised loss based on their TIG ℒssl = − 𝑊 (A𝑖 𝑗 log A𝑖′𝑗 ) + (1 − A𝑖 𝑗 ) log(1 − A𝑖′𝑗 ), (27)
rankings. Results demonstrate the performance improvement of 𝑣𝑖 ,𝑣 𝑗 ∈𝒱
this distance-wise graph contrastive learning over the typical con-
where 𝑊 is the weight hyperparameter used for balancing two
trastive learning approach.
classes; which are node pairs having an edge and node pairs without
Another special supervision information to exploit is the predic-
an edge between them.
tion results of the model itself. Sun et al. [55] leverages the multi-
As it is known that unclean graph structure usually impedes the
stage training framework to utilize the information of the pseudo
applicability of GNNs [8, 26]. A method that trains the GNNs by
labels generated by predictions in the next rounds of training. The
downstream supervised tasks based on the cleaned graph structure
multi-stage training algorithm repeatedly adds the most confident
reconstructed from completing a self-supervised pretext task is
predictions of each class to the label set and re-utilizes these pseudo
introduced in [13]. The self-supervised pretext task aims to train a
labeled data to train the GNNs. Furthermore, a self-checking mech-
separate GNN to denoise the corrupted node feature X̂ generated
anism based on DeepCluster [2] is proposed to guarantee the pre-
by either randomly zeroing some dimensions of the original node
cision of labeled data. Assuming that the cluster assignment for
feature X when having binary features or by adding independent
node 𝑣𝑖 is c𝑖 ∈ {0, 1}𝑝 (here the number of clusters is assumed to
Gaussian noise when X is continuous. Two methods are used to
equal to the number of predefined classes 𝑝 in the downstream
generate the initial graph adjacency matrix Ã. The first method Full
classification task) and the centroid matrix 𝐶 ∈ R𝑑 ×𝑝 represents
′
The parameters in FP and MLP-kNN used for generating the initial as in Eq. (14). Afterwards, spectral clustering [61] is performed on
adjacency matrix à is optimized by: U to generate different groups, within which 𝑛 𝒢 connected compo-
1 Õ nents that have more than three nodes are collected as the sampled
ℒssl = ℓMSE (x𝑖 , ẑ𝑖 ), (29) subgraphs from the graph 𝒢 and their embeddings are calculated
|𝒱m |
𝑣𝑖 ∈𝒱m by applying READOUT function. For each subgraph, its cosine sim-
where ẑ𝑖 = 𝑍ˆ [𝑖, :] ⊤ is the noisy embedding vector of the node ilarity to each of the 𝑚 motifs is calculated to obtain a similarity
𝑣𝑖 obtained by the separate GNN-based encoder. The optimized metric S ∈ R𝑚×𝑛𝒢 . To produce semantic-meaningful subgraphs
parameters in FP and MLP-kNN leads to the generation of more that are close to motifs, the top 10% most similar subgraphs to each
cleaned graph adjacency matrix, which in turn results in the better motif are selected based on the similarity metric S and are collected
performance in the downstream tasks. into a set 𝒢 top . The affinity values in U between pairs of nodes in
In addition to the graph edges and the adjacency matrix, topolog- each of these subgraphs are increased by optimizing the loss:
ical distance between nodes is another important global structure |𝒢
Õ|
top
1 Õ
property in graph. The pretext task in Peng et al. [49] is to recover ℒ1 = − U[ 𝑗, 𝑘]. (32)
the topological distance between nodes. More specifically, they |𝒢 top | 𝑖=1 (𝑣 ,𝑣 ) ∈𝒢 top
𝑗 𝑘 𝑖
leverage the shortest path length between nodes denoted as 𝑝𝑖 𝑗
between nodes 𝑣𝑖 and 𝑣 𝑗 , but this could be replaced with any other The optimization of the above loss forces nodes in motif-like sub-
distance measure. Then, they define the set 𝒞𝑖𝑘 as all the nodes graphs to be more likely to be grouped together in spectral clus-
having the shortest path distance of length 𝑘 from node 𝑣𝑖 . More tering, which leads to more subgraph samples aligned with the
formally, this is defined as: motifs. Next, the embedding table of motifs is optimized based
on the sampled subgraphs. The assignment matrix Q ∈ R𝑚×𝑛𝒢
𝒞𝑖 = 𝒞𝑖1 ∪ 𝒞𝑖2 ∪ · · · ∪ 𝒞𝑖𝛿𝑖 , 𝒞𝑖𝑘 = {𝑣 𝑗 |𝑑𝑖 𝑗 = 𝑘 }, 𝑘 = 1, 2, · · · , 𝛿𝑖 , (30) is found by maximizing similarities between embeddings and its
assigned motif:
where 𝛿𝑖 is the upper bound of the hop count from other nodes to
𝑣𝑖 , 𝑑𝑖 𝑗 is the length of the path 𝑝𝑖 𝑗 , and 𝒞𝑖 is the union of all the 1Õ
max 𝑇𝑟 (QT S) − Q[𝑖, 𝑗] log Q[𝑖, 𝑗], (33)
𝑘-hop shortest path neighbor sets 𝐶𝑖𝑘 . Based on these sets, one-hot Q 𝜆 𝑖,𝑗
encodings d𝑖 𝑗 ∈ R𝛿𝑖 are created for pairs of nodes 𝑣𝑖 , 𝑣 𝑗 , where
where the second term controlled by hyperparameter 𝜆 is to avoid
𝑣 𝑗 ∈ 𝒞𝑖 , according to their distance 𝑑𝑖 𝑗 . Then, the GNN model is
all representations collapsing into a single cluster center. After the
guided to extract node embeddings that encode node topological
cluster assignment matrix Q is obtained, the GNN-based encoder
distance as follows:
Õ Õ and the motif embedding table are trained, which is equivalent
ℒssl = ℓCE (𝑓𝑤 (|zGNN,𝑖 − zGNN,𝑗 |), d𝑖 𝑗 ), (31) to a supervised 𝑚-class classification problem with labels Q and
𝑣𝑖 ∈𝒱 𝑣 𝑗 ∈𝒞𝑖 the prediction distribution eS obtained by applying a column-wise
softmax normalization with temperature 𝜏:
where 𝑓𝑤 is a function mapping the difference between two node
embeddings to the probabilities of pairs of nodes belonging to the 𝑛𝒢
1 Õ
corresponding category of the topological distance. Since the num- ℒ2 = − ℓCE (q𝑖 , s̃𝑖 ), (34)
𝑛 𝒢 𝑖=1
ber of the categories depends on the upper bound of the hop count
(topological distance) but precisely determining this upper bound where q𝑖 = Q[:, 𝑖] and s̃𝑖 = e S[:, 𝑖] denote the assignment distri-
is time-consuming for a big graph, it is assumed that the number of bution and predicted distribution for the subgraph 𝑖, respectively.
hops (distance) is under control based on small-world phenomenon Optimizing Eq. (34) jointly enhances the ability of GNN encoder
[42] and is further divided into several major categories that clearly to extract subgraphs that are similar to motifs and improves the
discriminates the dissimilarity and partly tolerates the similarity. embeddings of motifs. The last step is to train the GNN-based
Experiments demonstrate that dividing the topological distance encoder by a classification task where subgraphs are reassigned
into four categories: 𝒞𝑖1, 𝒞𝑖2, 𝒞𝑖3, 𝒞𝑖𝑘 (𝑘 ≥ 4) achieves the best perfor- back to their corresponding graphs. Note that the subgraphs are
mance (i.e., 𝛿𝑖 =4). Another problem is that the number of nodes generated by the Motif-guided extractor, which are more likely
that are close to the focal node 𝑣𝑖 is much less than the nodes that to capture higher-level semantic information compared with ran-
are further away (i.e., the magnitude of 𝒞𝑖𝛿𝑖 will be significantly domly sampled subgraphs. The whole framework is trained jointly
larger than other sets). To circumvent this imbalance problem, node by weighted combining ℒ1, ℒ2 and the contrastive loss.
pairs are sampled with adaptive ratio. Aside from the network motifs, other subgraph structures can be
Network motifs are recurrent and statistically significant sub- leveraged to provide extra supervision in designing pretext tasks.
graphs of a larger graph and Zhang et al [72] designs a pretext task In [50], an 𝑟 -ego network for a certain vertex is defined as the
to train a GNN encoder that can automatically extract graph motifs. subgraph induced by nodes that have shortest path with length
The learned motifs are further leveraged to generate informative shorter than 𝑟 . Then a random walk with restart is initiated at
subgraphs used in graph-subgraph contrastive learning. Firstly, a ego vertex 𝑣𝑖 and the subgraph induced by nodes that are visited
GNN-based encoder 𝑓𝜃 and a 𝑚-slot embedding table {m1, ..., m𝑚 } during the random walk starting at 𝑣𝑖 are used as the augmented
denoting 𝑚 cluster centers of 𝑚 motifs are initialized. Then, a node version of the 𝑟 -ego network. First, two augmented 𝑟 -ego networks
affinity matrix U ∈ R | 𝒱 |× | 𝒱 | is calculated by softmax normalization centered around vertex 𝑣𝑖 are obtained by performing the random
on the embedding similarity 𝒟(zGNN,𝑖 , zGNN,𝑗 ) between nodes 𝑖, 𝑗 walk twice (i.e., 𝒢𝑖 and 𝒢𝑖+ ), which are defined as a positive pair
9
Yu Wang, Wei Jin, and Tyler Derr
since they come from the same 𝑟 -ego network. In comparison, transformation information between two different graph topolo-
a negative pair corresponds to two subgraphs augmented from gies in the representations of nodes obtained by GNNs. First, they
different 𝑟 -ego networks (e.g., one coming from 𝑣𝑖 and another transform the original graph adjacency matrix A into  by ran-
coming from 𝑣 𝑗 resulting in random walk induced subgraphs 𝒢𝑖 domly adding or removing edges from the original edge set. Then,
and 𝒢 𝑗 , respectively). Based on the above defined positive and by feeding the original and transformed graph topology and the
negative subgraph pairs, a contrastive loss is set up to optimize the node feature matrix into any GNN-based encoder, the feature rep-
GNNs as follows: resentation ZGNN, ẐGNN before and after topology transformation
′
1 Õ are calculated and their difference ΔZ ∈ R𝑁 ×𝐹 is defined as:
ℒssl = + ℓNT-Xent (Z1ssl, Z2ssl, 𝒫 − ), (35)
|𝒫 | + +
( 𝒢𝑖 ,𝒢𝑖 ) ∈𝒫 ΔZ = ẐGNN − ZGNN = [ΔzGNN,1, ..., ΔzGNN,𝑁 ] ⊤ (40)
where Z1ssl, Z2ssl
denotes the GNN-based graph embeddings and = [ẑGNN,1 − zGNN,1, ..., ẑGNN,𝑁 − zGNN,𝑁 ] ⊤ . (41)
specifically here the two different views are the same Z1ssl = Z2ssl . Next they predict the topology transformation between node 𝑣𝑖 and
𝒫 + contains positive pairs of subgraphs (𝒢𝑖 , 𝒢𝑖+ ) sampled by ran- 𝑣 𝑗 through the node-wise feature difference ΔZ by constructing the
dom walk starting at the same ego vertex 𝑣𝑖 in the same graph edge representation as:
while 𝑃 − = ( 𝒢𝑖 ,𝒢 + ) ∈𝒫 + 𝒫𝒢− represents all sets of negative sam-
Ð
𝑖 𝑖 exp(−(Δz𝑖 − Δz 𝑗 ) ⊙ (Δz𝑖 − Δz 𝑗 ))
ples. Specifically 𝒫𝒢− represents subgraphs sampled by random e𝑖 𝑗 = , (42)
𝑖
walk starting at either different ego vertex from 𝑣𝑖 in 𝒢 or directly || exp(−(Δz𝑖 − Δz 𝑗 ) ⊙ (Δz𝑖 − Δz 𝑗 ))||
sampled by random walk in different graphs from 𝒢. where ⊙ denotes the Hardamard product. This edge representation
Although Graph Attention Network (GAT) [58] achieves per- e𝑖 𝑗 is then fed into an MLP for the prediction of the topological
formance improvements over the original GCN [33], there is little transformation, which includes four classes: edge addition, edge
understanding of what graph attention learns. To this end, Dongk- deletion, keeping disconnection and keeping connection between
wan et al. [32] proposes a specific pretext task to leverage the edge each pair of nodes. Thus, the GNN-based encoder is trained by:
information to supervise what graph attention learns:
1 Õ
1 ℒssl = ℓCE (MLP(e𝑖 𝑗 ), t𝑖 𝑗 ) (43)
1 ( 𝑗, 𝑖) ∈ ℰ · log 𝜒𝑖 𝑗 |𝒱 | 2 𝑣 ,𝑣 ∈𝒱
Õ
ℒssl = (36)
|ℰ ∪ ℰ − | 𝑖 𝑗
( 𝑗,𝑖) ∈ℰ ∪ℰ −
where we denote the topological transformation category between
+ 1 ( 𝑗, 𝑖) ∈ ℰ − log(1 − 𝜒𝑖 𝑗 ), (37)
nodes 𝑣𝑖 and 𝑣 𝑗 as one-hot encoding t𝑖 𝑗 ∈ R4 .
where ℰ is the set of edges, ℰ − is the sampled set of node pairs
without edges, and 𝜒𝑖 𝑗 is the edge probability between node 𝑖, 𝑗 5.2 Feature-based Pretext Tasks
calculated from their embeddings. Based on two primary edge Typically, graphs does not come with any feature information and
attentions, the GAT attention (shortly as GO) [58] and the dot- here the graph-level features refer to the graph embeddings ob-
product attention (shortly as DP) [38], two advanced attention tained after applying a pooling layer on all node embeddings from
mechansims, SuperGATSD (Scaled Dot-product, shortly as SD) and GNNs.
SuperGATMX (Mixed GO and DP, shortly as MX) are proposed: GraphCL [69] designs the pretext task to first augment graphs
√ by four different augmentations including node dropping, edge
𝑒𝑖 𝑗,SD = 𝑒𝑖 𝑗,DP / 𝐹, 𝜒𝑖 𝑗,SD = 𝜎 (𝑒𝑖 𝑗,SD ), (38)
perturbation, attribute masking and subgraph extraction and then
𝑒𝑖 𝑗,MX = 𝑒𝑖 𝑗,GO · 𝜎 (𝑒𝑖 𝑗,DP ), 𝜒𝑖 𝑗,MX = 𝜎 (𝑒𝑖 𝑗,DP ), (39) maximize the mutual information of the graph embeddings be-
tween different augmented views generated from the same original
where 𝜎 denotes the sigmoid function taking the edge weight 𝑒𝑖 𝑗 and
graph while also minimizing the mutual information of the graph
calculating the edge probability 𝜒𝑖 𝑗 . SuperGATSD divides the dot-
embeddings between different augmented views generated from
product of edge 𝑒𝑖 𝑗,DP by a square root of dimension as Transformer
different graphs. The graph embeddings Zssl are obtained through
[56] to prevent some large values from dominating the entire atten-
any permutational-invariant READOUT function on node embed-
tion after softmax. SuperGATMX multiplies GO and DP attention
dings followed by applying an adaptation layer. Then the mutual
with sigmoid, which is motivated by the gating mechanism of Gated
information is maximized by optimizing the following NT-Xent
Recurrent Units (GRUs) [7]. Since DP attention with the sigmoid
contrastive loss:
denotes the edge probability, multiplying 𝜎 (𝑒𝑖 𝑗,DP ) in calculating
1 Õ
𝑒𝑖 𝑗,MX can softly drop neighbors that are not likely linked while im- ℒssl = + ℓNT-Xent (Z1ssl, Z2ssl, 𝒫 − ), (44)
plicitly assigning importance to the remaining nodes. 𝑒𝑖 𝑗,DP, 𝑒𝑖 𝑗,GO |𝒫 | +
( 𝒢𝑖 ,𝒢 𝑗 ) ∈𝒫
are the weight of edge (𝑖, 𝑗) used to calculate the GO and DP at-
tention. Results disclose several insightful discovers including the where Z1ssl, Z2ssl represent graph embeddings under two different
GO attention learns label-agreement better than DP, whereas DP views. The view could be the original view without any augmen-
predicts edge presence better than GO, and the performance of the tation or the one generated from applying four different augmen-
attention mechanism is not fixed but depends on homophily and tations. 𝒫 + contains positive pairs of graphs (𝒢𝑖 , 𝒢 𝑗 ) augmented
from the same original graph while 𝒫 − = ( 𝒢𝑖 ,𝒢 𝑗 ) ∈𝒫 + 𝒫𝒢− repre-
Ð
average degree of the specific graph. 𝑖
The topological information can also be generated manually sents all sets of negative samples. Specifically 𝒫𝒢− contains graphs
𝑖
for designing pretext tasks. Gao et al. [15] proposes to encode the augmented from the graph different from 𝒢𝑖 . Numerical results
10
Graph Neural Networks: Self-supervised Learning
as:
|Γ𝒱𝑙 (𝑣𝑖 , 𝑐)| + |Γ𝒱𝑢 (𝑣𝑖 , 𝑐)|
y𝑖𝑐 = , 𝑐 = 1, ..., 𝑙, (47)
|Γ𝒱𝑙 (𝑣𝑖 )| + |Γ𝒱𝑢 (𝑣𝑖 )|
where 𝒱𝑢 and 𝒱𝑙 denote the unlabeled and labeled node set, Γ𝒱𝑢 (𝑣𝑖 )
denotes the unlabeled nodes that are adjacency to node 𝑣𝑖 , Γ𝒱𝑢 (𝑣𝑖 , 𝑐)
denotes the unlabeled nodes that have been assigned class 𝑐 and
are adjacency to node 𝑣𝑖 , 𝒩𝒱𝑙 (𝑣𝑖 ) denotes the labeled nodes that
are adjacency to node 𝑣𝑖 , Γ𝒱𝑙 (𝑣𝑖 , 𝑐) denotes the labeled nodes that
are adjacency to node 𝑣𝑖 and of class 𝑐. To generate labels for the
unlabeled nodes so as to calculate the context vector y𝑖 for each
node 𝑣𝑖 , label propagation (LP) [77] or the iterative classification
algorithm (ICA) [41] is used to construct pseudo labels for unlabeled
Figure 6: An example of a context and r-neighborhood
nodes in 𝒱𝑢 . Then the pretext task is approached by optimizing the
graph.
following loss function:
1 Õ
ℒssl = ℓCE (zssl,𝑖 , y𝑖 ), (48)
demonstrate that the augmentation of edge perturbations bene- |𝒱 |
𝑣𝑖 ∈𝒱
fits social networks but hurts biochemical molecules. Applying
attribute masking achieves better performance in denser graphs. The main issue of the above pretext task is the error caused by
Node dropping and subgraph extraction are generally beneficial generating labels from LP or ICA. The paper [28] further proposed
across all datasets. two methods to improve the above pretext task. The first method
is to replace the procedure of assigning labels of unlabeled nodes
5.3 Hybrid Pretext Tasks based on only one method such as LP or ICA with assigning la-
bels by ensembling results from multiple different methods. Their
One way to use the information of the training nodes in designing
second method treats the initial labeling from LP or ICA as noisy
pretext tasks is developed in [22] where the context concept is raised.
labels, and then leverages an iterative approach [20] to improve the
The goal of this work is to pre-train a GNN so that it maps nodes
context vectors, which leads to significant improvements based on
appearing in similar graph structure contexts to nearby embeddings.
this correction phase.
For every node 𝑣𝑖 , the 𝑟 -hop neighborhood of 𝑣𝑖 contains all nodes
One previous pretext task is to recover the topological distance
and edges that are at most 𝑟 -hops away from 𝑣𝑖 in the graph. The
between nodes. However, calculating the distance of the shortest
context graph of 𝑣𝑖 is a subgraph between 𝑟 1 -hops and 𝑟 2 -hops
path for all pairs of nodes even after the sampling is time-consuming.
away from node 𝑣𝑖 . It is required that 𝑟 1 < 𝑟 so that some nodes are
Therefore, Jin et al. [28] replaces the pairwise distance between
shared between the neighborhood and the context graph, which is
nodes with the distance between nodes and their corresponding
referred to as context anchor nodes. Examples of neighborhood and
clusters. For each cluster, a fixed set of anchor/center nodes is
context graphs are shown in Fig. 6. Two GNN encoders are set up:
established. For each node, its distance to this set of anchor nodes is
the main GNN encoder is to get the node embedding z𝑟GNN,𝑖 based
calculated. The pretext task is to extract node features that encode
on their 𝑟 -hop neighborhood node features and the context GNN
the information of this node2cluster distance. Suppose 𝑘 clusters are
is to get the node embeddings of every other node in the context
obtained by applying the METIS graph partitioning algorithm [31]
anchor node set, which are then averaged to get the node context
and the node with the highest degree is assumed to be the center
embedding c𝑖 . Then [22] used negative sampling to jointly learn
of the corresponding cluster, then each node 𝑣𝑖 will have a cluster
the main GNN and the context GNN. In the optimization process,
distance vector d𝑖 ∈ R𝑘 and the distance-to-cluster pretext task is
positive samples refer to the situation when the center node of the
completed by optimizing:
context and the neighborhood graphs is the same while the negative
samples refer to the situation when the center nodes of the context 1 Õ
ℒssl = ℓMSE (zssl,𝑖 , d𝑖 ), (49)
and the neighborhood graphs are different. The learning objective |𝒱 |
𝑣𝑖 ∈𝒱
is a binary classification of whether a particular neighborhood
Aside from the graph topology and the node features, the distri-
and a particular context graph have the same center node and the
bution of the training nodes and their training labels are another
negative likelihood loss is used as follows:
valuable source of information for designing pretext tasks. One of
1 Õ
the pretext tasks in [28] is to require the node embeddings output
ℒssl = −( (y𝑖 log(𝜎 ((z𝑟GNN,𝑖 ) ⊤ c 𝑗 )) (45)
|𝒦| by GNNs to encode the information of the topological distance from
(𝑣𝑖 ,𝑣 𝑗 ) ∈𝒦
any node to the training nodes. Assuming that the total number
+ (1 − y𝑖 ) log(1 − 𝜎 ((z𝑟GNN,𝑖 ) ⊤ c 𝑗 )))) (46)
of classes is 𝑝 and for class 𝑐 ∈ {1, ..., 𝑝} and the node 𝑣𝑖 ∈ 𝒱, the
where 𝑦𝑖 = 1 for the positive sample where 𝑖 = 𝑗 while 𝑦𝑖 = 0 for average, minimum and maximum shortest path length from 𝑣𝑖 to
the negative sample where 𝑖 ≠ 𝑗, with 𝒦 denoting the set of positive all labeled nodes in class 𝑐 is calculated and denoted as d𝑖 ∈ R3𝑝 ,
and negative pairs, and 𝜎 is the sigmoid function computing the then the objective is to optimize the same regression loss as defined
probability. in Eq. (49)
Similar idea to employ the context concept in completing pretext The generating process of networks encodes abundant infor-
tasks is also proposed in [28]. Specifically, the context here is defined mation for designing pretext tasks. Hu et al. [23] proposes the
11
Yu Wang, Wei Jin, and Tyler Derr
GPT-GNN framework for generative pre-training of GNNs. This representations. The local patch representations are further fed into
framework performs attribute and edge generation to enable the an injective readout function to get the global graph representa-
pre-trained model to capture the inherent dependency between tions zGNN,𝒢 = READOUT(ZGNN ). Then the mutual information
node attributes and graph structure. Assuming that the likelihood between ZGNN and zGNN,𝒢 is maximized by minimizing the follow-
over this graph by this GNN model is 𝑝 (𝒢; 𝜃 ) which represents how ing loss function:
the nodes in 𝒢 are attributed and connected, GPT-GNN aims to
𝒫 | +
pre-train the GNN model by maximizing the graph likelihood, i.e., 1 |Õ
𝜃 ∗ = max𝜃 𝑝 (𝒢; 𝜃 ). Given a permutated order, the log likelihood is ℒssl = + E [log 𝜎 (z⊤
GNN,𝑖 WzGNN,𝒢 )] (53)
|𝒫 | + |𝒫 | 𝑖=1 (X,A)
−
factorized autoregressively - generating one node per iteration as:
𝒫− |
|Õ
|𝒱 |
Õ + E ( X̂,Â) [log(1 − 𝜎 ( z̃⊤
GNN,𝑖 WzGNN, 𝒢 ))] ,
log 𝑝𝜃 (X, ℰ) = log 𝑝𝜃 (x𝑖 , ℰ𝑖 |X<𝑖 , ℰ<𝑖 ) (50)
𝑗=1
𝑖=1
For all nodes that are generated before the node 𝑖, their attributes where |𝒫 + | and |𝒫 − | are the number of the positive and negative
X<𝑖 , and the edges between these nodes ℰ<𝑖 are used to generate a pairs, 𝜎 stands for any nonlinear activation function and PReLU
new node 𝑣𝑖 , including both its attribute x𝑖 and its connections is used in [57], z⊤
GNN,𝑖 WzGNN,𝒢 calculates the weighted similarity
with existing nodes ℰ𝑖 . Instead of directly assuming that x𝑖 , ℰ𝑖 between the patch representation centered at node 𝑣𝑖 and the graph
are independent, they devise a dependency-aware factorization representation. A linear classifier is followed up to classify nodes
mechanism to maintain the dependency between node attributes after the above contrastive pretext task.
and edge existence. The generation process can be decomposed Similar to Velickovic et al. [57] where the mutual information
into two coupled parts: (1) generating node attributes given the between the patch representations and the graph representations
observed edges, and (2) generating the remaining edges given the is maximized, Hassani et al. [21] proposed another framework of
observed edges and the generated node attributes. For computing contrasting the node representations of one view and the graph rep-
the loss of attribute generation, the generated node feature matrix resentations of another view. The first view is the original graph and
X is corrupted by masking some dimensions to obtain the corrupted the second view is generated by a graph diffusion matrix. The heat
version X̂Attr and further fed together with the generated edges into and personalized PageRank (PPR) diffusion matrix are considered,
GNNs to get the embeddings ẐAttr GNN . Then, the decoder Dec
Attr (·)
which are:
Attr
is specified, which takes ẐGNN as input and outputs the predicted
Sheat = exp(𝑡AD−1 − 𝑡), (54)
attributes DecAttr ( ẐAttr
GNN ). The attribute generation loss is:
1 Õ
ℒAttr
ssl = |𝒱 | ℓMSE (DecAttr (ẑAttr
GNN,𝑖 ), x𝑖 ), (51) SPPR = 𝛼 (I𝑛 − (1 − 𝛽)D−1/2 AD−1/2 ) −1, (55)
𝑣𝑖 ∈𝒱
where ẑAttr Attr ⊤ where 𝛽 denotes teleport probability, 𝑡 is the diffusion time, and
GNN,𝑖 = ẐGNN [𝑖, :] denotes the decoded embedding of D is the diagonal degree matrix. After D is obtained, two different
node 𝑣𝑖 . For computing the loss of edge reconstruction, the original
GNN encoders followed by a shared projection head are applied
generated node feature matrix X is directly fed together with the
Edge on nodes in the original graph adjacency matrix and the gener-
generated edges into GNNs to get the embeddings ZGNN . Then the ated diffusion matrix to get two different node embeddings Z1GNN
contrastive NT-Xent loss is calculated:
and Z2GNN . Two different graph embeddings z1GNN,𝒢 and z2GNN,𝒢
Edge 1 Õ Edge Edge
ℒssl = + ℓNT-Xent (ZGNN, ZGNN, 𝒫 − ), (52) are further obtained by applying a graph pooling function to the
|𝒫 | +
(𝑣𝑖 ,𝑣 𝑗 ) ∈𝒫 node representations (before the projection head) and followed by
another shared projection head. The mutual information between
where 𝒫+ contains positive pairs of connected nodes (𝑣𝑖 , 𝑣 𝑗 ) while
nodes and graphs in different views is maximized through:
𝒫 − = (𝑣𝑖 ,𝑣 𝑗 ) ∈𝒫 + 𝒫𝑣−𝑖 represents all sets of negative samples and
Ð
𝑃 𝑣−𝑖 contains all nodes that are not directly linked with node 𝑣𝑖 . Note 1 Õ
Edge ℒssl = − (MI(z1GNN,𝑖 , z2GNN,𝒢 ) + MI(z2GNN,𝑖 , z1GNN,𝒢 )),
here two views are set equal, i.e., Z1 = Z2 = ZGNN . |𝒱 |
𝑣𝑖 ∈𝑉
(56)
6 NODE-GRAPH-LEVEL SSL PRETEXT TASKS where the MI represents the mutual information estimator and
All the above pretext tasks are designed based on either the node four estimators are explored, which are noise-contrastive estimator,
or the graph level supervision. However, there is another final line Jensen-Shannon estimator, normalized temperature-scaled cross-
of research combining these two sources of supervision to design entropy, and Donsker-Varadhan representation of the KL-divergence.
pretext tasks, which we summarize in this section. Note that the mutual information in Eq. (56) is averaged over all
Velickovic et al. [57] proposed to maximize the mutual informa- graphs in the original work [21]. Additionally, their results demon-
tion between representations of high-level graphs and low-level strate that Jensen-Shannon estimator achieves better results across
patches. In each iteration, a negative sample X̂, Â is generated by all graph classification tasks, whereas in the node classification
corrupting the graph through shuffling node features and removing task, noise contrastive estimation achieves better results. They also
edges. Then a GNN-based encoder is applied to extract node repre- discover that increasing the number of views does not increase the
sentations ZGNN and ẐGNN , which are also named as the local patch performance on downstream tasks.
12
Graph Neural Networks: Self-supervised Learning
PMLR, 4116–4126. [46] Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang.
[22] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, 2018. Adversarially Regularized Graph Autoencoder for Graph Embedding. In
and Jure Leskovec. 2020. Strategies for Pre-training Graph Neural Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial
International Conference on Learning Representations. Intelligence, IJCAI. 2609–2615.
[23] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020. [47] Liam Paninski. 2003. Estimation of entropy and mutual information. Neural
GPT-GNN: Generative Pre-Training of Graph Neural Networks. In KDD ’20: The computation 15, 6 (2003), 1191–1253.
26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, [48] Jiwoong Park, Minsik Lee, Hyung Jin Chang, Kyuewang Lee, and Jin Young
1857–1867. Choi. 2019. Symmetric graph convolutional autoencoder for unsupervised graph
[24] Ziniu Hu, Changjun Fan, Ting Chen, Kai-Wei Chang, and Yizhou Sun. 2019. Pre- representation learning. In Proceedings of the IEEE/CVF International Conference
training graph neural networks for generic structural feature extraction. arXiv on Computer Vision. 6519–6528.
preprint arXiv:1905.13728 (2019). [49] Zhen Peng, Yixiang Dong, Minnan Luo, Xiao-Ming Wu, and Qinghua Zheng.
[25] Dasol Hwang, Jinyoung Park, Sunyoung Kwon, KyungMin Kim, Jung-Woo Ha, 2020. Self-supervised graph representation learning via global context prediction.
and Hyunwoo J Kim. 2020. Self-supervised Auxiliary Learning with Meta-paths arXiv preprint arXiv:2003.01604 (2020).
for Heterogeneous Graphs. In Advances in Neural Information Processing Systems, [50] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding,
Vol. 33. 10294–10305. Kuansan Wang, and Jie Tang. 2020. Gcc: Graph contrastive coding for graph
[26] Soobeom Jang, Seong-Eun Moon, and Jong-Seok Lee. 2019. Brain signal clas- neural network pre-training. In Proceedings of the 26th ACM SIGKDD International
sification via learning connectivity structure. arXiv preprint arXiv:1905.11678 Conference on Knowledge Discovery & Data Mining. 1150–1160.
(2019). [51] Ellen Riloff. 1996. Automatically generating extraction patterns from untagged
[27] Yizhu Jiao, Yun Xiong, Jiawei Zhang, Yao Zhang, Tianqi Zhang, and Yangyong text. In Proceedings of the National Conference on Artificial Intelligence. 1044–1049.
Zhu. 2020. Sub-graph Contrast for Scalable Self-Supervised Graph Represen- [52] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang,
tation Learning. In 20th IEEE International Conference on Data Mining, ICDM and Junzhou Huang. 2020. Self-Supervised Graph Transformer on Large-Scale
2020, Sorrento, Italy, November 17-20, 2020, Claudia Plant, Haixun Wang, Alfredo Molecular Data. Advances in Neural Information Processing Systems 33 (2020).
Cuzzocrea, Carlo Zaniolo, and Xindong Wu (Eds.). IEEE, 222–231. [53] Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data
[28] Wei Jin, Tyler Derr, Haochen Liu, Yiqi Wang, Suhang Wang, Zitao Liu, and Jiliang augmentation for deep learning. Journal of Big Data 6, 1 (2019), 1–48.
Tang. 2020. Self-supervised learning on graphs: Deep insights and new direction. [54] Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. 2020. InfoGraph:
arXiv preprint arXiv:2006.10141 (2020). Unsupervised and Semi-supervised Graph-Level Representation Learning via
[29] Wei Jin, Tyler Derr, Yiqi Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2021. Mutual Information Maximization. In 8th International Conference on Learning
Node Similarity Preserving Graph Convolutional Networks (WSDM ’21). ACM, Representations, ICLR.
148–156. [55] Ke Sun, Zhouchen Lin, and Zhanxing Zhu. 2020. Multi-stage self-supervised
[30] George Karypis and Vipin Kumar. 1995. Multilevel graph partitioning schemes. learning for graph convolutional networks on graphs with few labeled nodes. In
In ICPP (3). 113–122. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5892–5899.
[31] George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme [56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
(1998), 359–392. you Need. In Advances in Neural Information Processing Systems 30: Annual Con-
[32] Dongkwan Kim and Alice Oh. 2021. How to Find Your Friendly Neighborhood: ference on Neural Information Processing Systems. 5998–6008.
Graph Attention Design with Self-Supervision. In International Conference on [57] Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio,
Learning Representations. and R Devon Hjelm. 2019. Deep Graph Infomax.. In International Conference on
[33] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Learning Representations (Poster).
Graph Convolutional Networks. In 5th International Conference on Learning [58] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Representations, ICLR. Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Con-
[34] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush ference on Learning Representations.
Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised [59] Nicolas Vercheval, Hendrik De Bie, and Aleksandra Pizurica. 2020. Variational
Learning of Language Representations. In International Conference on Learning Auto-Encoders Without Graph Coarsening For Fine Mesh Learning. In IEEE
Representations. International Conference on Image Processing, ICIP. IEEE, 2681–2685.
[35] Phuc H Le-Khac, Graham Healy, and Alan F Smeaton. 2020. Contrastive repre- [60] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
sentation learning: A framework and review. IEEE Access (2020). 2008. Extracting and composing robust features with denoising autoencoders. In
[36] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Proceedings of the 25th international conference on Machine learning. 1096–1103.
Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: [61] Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and
Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, computing 17, 4 (2007), 395–416.
Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the [62] Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. 2017.
Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Mgae: Marginalized graph autoencoder for graph clustering. In Proceedings of the
Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association 2017 ACM on Conference on Information and Knowledge Management. 889–898.
for Computational Linguistics, 7871–7880. [63] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and
[37] Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE
convolutional networks for semi-supervised learning. In Proceedings of the AAAI transactions on neural networks and learning systems (2020).
Conference on Artificial Intelligence, Vol. 32. [64] Yaochen Xie, Zhao Xu, Zhengyang Wang, and Shuiwang Ji. 2021. Self-
[38] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Ap- Supervised Learning of Graph Neural Networks: A Unified Review. arXiv preprint
proaches to Attention-based Neural Machine Translation. In Proceedings of the arXiv:2102.10757 (2021).
2015 Conference on Empirical Methods in Natural Language Processing. ACL, 1412– [65] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful
1421. are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
[39] Franco Manessi and Alessandro Rozza. 2020. Graph-Based Neural Network [66] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi
Models with Multiple Self-Supervised Auxiliary Tasks. arXiv preprint arXiv: Kawarabayashi, and Stefanie Jegelka. 2018. Representation learning on graphs
2011.07267 (2020). with jumping knowledge networks. In International Conference on Machine Learn-
[40] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Esti- ing. PMLR, 5453–5462.
mation of Word Representations in Vector Space. In 1st International Conference [67] Jiaxuan You, Jonathan Gomes-Selman, Rex Ying, and Jure Leskovec. 2021. Identity-
on Learning Representations, ICLR 2013, Workshop Track Proceedings. aware Graph Neural Networks. arXiv preprint arXiv:2101.10320 (2021).
[41] Jennifer Neville and David Jensen. 2000. Iterative classification in relational data. [68] Jiaxuan You, Zhitao Ying, and Jure Leskovec. 2020. Design Space for Graph
In Proc. AAAI-2000 workshop on learning statistical models from relational data. Neural Networks. In Advances in Neural Information Processing Systems, Vol. 33.
13–20. 17009–17021.
[42] Mark Newman. 2018. Networks. Oxford university press. [69] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and
[43] Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual repre- Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In Advances
sentations by solving jigsaw puzzles. In European conference on computer vision. in Neural Information Processing Systems, Vol. 33. 5812–5823.
Springer, 69–84. [70] Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen. 2020. When does
[44] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-GAN: Train- self-supervision help graph convolutional networks?. In International Conference
ing Generative Neural Samplers using Variational Divergence Minimization. In on Machine Learning. PMLR, 10871–10880.
Advances in Neural Information Processing Systems, Vol. 29. [71] Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization.
[45] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning In European conference on computer vision. Springer, 649–666.
with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
14
Graph Neural Networks: Self-supervised Learning
[72] Shichang Zhang, Ziniu Hu, Arjun Subramonian, and Yizhou Sun. 2020. Motif-
Driven Contrastive Learning of Graph Representations. arXiv preprint
arXiv:2012.12533 (2020).
[73] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2020. Deep learning on graphs: A
survey. IEEE Transactions on Knowledge and Data Engineering (2020).
[74] Qikui Zhu, Bo Du, and Pingkun Yan. 2020. Self-supervised training of graph
convolutional networks. arXiv preprint arXiv:2006.02380 (2020).
[75] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020.
Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131
(2020).
[76] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021.
Graph Contrastive Learning with Adaptive Augmentation. In Proceedings of The
Web Conference 2021 (WWW ’21). ACM, 12 pages.
[77] Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from labeled and unlabeled
data with label propagation. (2002).
[78] Marinka Zitnik and Jure Leskovec. 2017. Predicting multicellular function through
multi-layer tissue networks. Bioinformatics 33, 14 (2017), i190–i198.
[79] Marinka Zitnik, Jure Leskovec, et al. 2018. Prioritizing network communities.
Nature communications 9, 1 (2018), 1–9.
15