0% found this document useful (0 votes)

52 views15 pages

Graph Neural Networks: Self-Supervised Learning

This document summarizes recent developments in applying self-supervised learning to graph neural networks. It categorizes self-supervised learning methods for GNNs based on the training strategies and types of data used to construct pretext tasks. Several early works on designing pretext tasks for GNNs are discussed, focusing on encoding node properties and attributes. Common training strategies like pre-training and joint training are mentioned. Loss functions used in self-supervised GNN training are also briefly covered.

Uploaded by

IndritNallbani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views15 pages

Graph Neural Networks: Self-Supervised Learning

Uploaded by

IndritNallbani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Graph Neural Networks: Self-supervised Learning

Yu Wang Wei Jin Tyler Derr

Vanderbilt University Michigan State University Vanderbilt University
[email protected] [email protected] [email protected]

ABSTRACT existing GNNs [19, 58, 65, 66] are mostly established in a semi-
Although deep learning has achieved state-of-the-art performance supervised or supervised manner, which still requires high cost
across numerous domains, these models generally require large label annotation. Additionally, these GNN models may not take full
annotated datasets to reach their full potential and avoid overfitting. advantage of the abundant information in unlabeled data, such as
However, obtaining such datasets can have high associated costs the graph topology and node attributes. Hence, SSL can be naturally
or even be impossible to procure. Self-supervised learning (SSL) harnessed for GNNs to gain additional supervision and thoroughly
seeks to create and utilize specific pretext tasks on unlabeled data exploit the information in the unlabeled data.
to aid in alleviating this fundamental limitation of deep learning Compared with grid-based data such as images or text [73],
models. Although initially applied in the image and text domains, graph-structured data is far more complex due to its highly irregu-
recent interest has been in leveraging SSL in the graph domain to lar topology, involved intrinsic interactions and abundant domain-
improve the performance of graph neural networks (GNNs). For specific semantics [63]. Different from images and text where the en-
node-level tasks, GNNs can inherently incorporate unlabeled node tire structure represents a single entity or expresses a single seman-
data through the neighborhood aggregation unlike in the image or tic meaning, each node in the graph is an individual instance with
text domains; but they can still benefit by applying novel pretext its own features and positioned in its own local context. Further-
tasks to encode richer information and numerous such methods more, these individual instances are inherently related with each
have recently been developed. For GNNs solving graph-level tasks, other, which forms diverse local structures that encode even more
applying SSL methods is more aligned with other traditional do- complex information to be discovered and analyzed. While such
mains, but still presents unique challenges and has been the focus of complexity engenders tremendous challenges in analyzing graph-
a few works. In this chapter, we summarize recent developments in structured data, the substantial and diverse information contained
applying SSL to GNNs categorizing them via the different training in the node features, node labels, local/global graph structures, and
strategies and types of data used to construct their pretext tasks, their interactions and combinations provide golden opportunities
and finally discuss open challenges for future directions. to design self-supervised pretext tasks.
Embracing all the challenges and opportunities to study self-
supervised learning in GNNs, the works [22, 24, 28, 70] have been
1 INTRODUCTION the first research that systematically design and compare differ-
ent self-supervised pretext tasks in GNNs. For example, the works
Recent years have witnessed the great success of applying deep
[24, 70] design pretext tasks to encode the topological properties
learning in numerous fields. However, the superior performance
of a node such as centrality, clustering coefficient, and its graph
of deep learning heavily depends on the quality of the supervi-
partitioning assignment, or to encode the attributes of a node such
sion provided by the labeled data and collecting a large amount of
as individual features and clustering assignments in embeddings
high-quality labeled data tends to be time-intensive and resource-
output by GNNs. The work [28] designs pretext tasks to align the
expensive [22, 79]. Therefore, to alleviate the demand for massive la-
pairwise feature similarity or the topological distance between two
beled data and provide sufficient supervision, self-supervised learn-
nodes in the graph with the closeness of two nodes in the embed-
ing (SSL) has been introduced. Specifically, SSL designs domain-
ding space. Apart from the supervision information employed in
specific pretext tasks that leverage extra supervision from unlabeled
creating pretext tasks, designing effective training strategies and se-
data to train deep learning models and learn better representa-
lecting reasonable loss functions are another crucial components in
tions for downstream tasks. In computer vision, various pretext
incorporating SSL into GNNs. Two frequently used training strate-
tasks have been studied, e.g., predicting relative locations of image
gies that equip GNNs with SSL are 1) pre-training GNNs through
patches [43] and identifying augmented images generated from im-
completing pretext task(s) and then fine-tuning the GNNs on down-
age processing techniques such as cropping, rotating and resizing
stream task(s), and 2) jointly training GNNs on both pretext and
[53]. In natural language processing, self-supervised learning has
downstream tasks [28, 70]. There are also few works [5, 55] ap-
also been heavily utilized, e.g., predicting the masked word in BERT
plying the idea of self-training in incorporating SSL into GNNs.
[9].
In addition, loss functions are selected to be tailored for purposes
Simultaneously, graph representation learning has emerged as
of specific pretext tasks, which includes classification-based tasks
a powerful strategy for analyzing graph-structured data over the
(cross-entropy loss), regression-based tasks (mean squared error
past few years [18]. As the generalization of deep learning to the
loss) and contrastive-based tasks (contrastive loss).
graph domain, Graph Neural Networks (GNNs) has become one
In view of the substantial progress made in the field of graph
promising paradigm due to their efficiency and strong performance
neural networks and the significant potential of self-supervised
in real-world applications [67, 78]. However, the vanilla GNN model
(i.e., Graph Convolutional Network [33]) and even more advanced
1
Yu Wang, Wei Jin, and Tyler Derr

learning, this chapter aims to present a systematic and compre- place nine shuffled patches back to the original locations [43]. More
hensive review on applying self-supervised learning into graph pretext tasks such as colorization, autoencoder, and contrastive
neural networks. The rest of the chapter is organized as follows. predictive coding have also been introduced and effectively utilized
Section 2 first introduces self-supervised learning and pretext tasks, [45, 60, 71].
and then summarizes frequently used self-supervised methods from While computer vision has achieved amazing progress on self-
the image and text domains. In Section 3, we introduce the training supervised learning in recent years, self-supervised learning has
strategies that are used to incorporate SSL into GNNs and catego- been heavily utilized in natural language processing (NLP) research
rize the pretext tasks that have been developed for GNNs. Section 4 for quite a while. Word2vec [40] is the first work that popularized
and 5 present detailed summaries of numerous representative SSL the SSL ideas in the NLP field. Center word prediction and neigh-
methods that have been developed for node-level and graph-level bor word prediction are two pretext tasks in Word2vec where the
pretext tasks. Thereafter, in Section 6 we discuss representative model is given a small chunk of the text and asked to predict the
SSL methods that are developed using both node-level and graph- center word in that text or vice versa. BERT [9] is another famous
level supervision, which we refer to as node-graph-level pretext pre-trained model in NLP where two pretest tasks are to recover
tasks. Section 7 collects and reinforces the major results and the randomly masked words in a text or to classify whether two sen-
insightful discoveries in prior sections. Concluding remarks and tences can come one after another or not. Similar works have also
future forecasts on the development of SSL in GNNs are provided been introduced, such as having the pretext task classify whether a
in Section 8. pair of sentences are in the correct order [34], or a pretext task that
first randomly shuffles the ordering of sentences and then seeks to
recover the original ordering [36].
2 SELF-SUPERVISED LEARNING Compared with the difficulty of data acquisition encountered
Supervised learning is the machine learning task of training a model in image and text domains, machine learning in the graph domain
that maps an input to an output based on the ground-truth input- faces even more challenges in acquiring high-quality labeled data.
output pairs provided by a labeled dataset. Good performance of For example, for molecular graphs it can be extremely expensive
supervised learning requires a decent amount of labeled data (es- to perform the necessary laboratory experiments to label some
pecially when using deep learning models), which are expensive molecules [52], and in a social network obtaining ground-truth
to manually collect. Conversely, self-supervised learning generates labels for individual users may require large-scale surveys or be un-
supervisory signals from unlabeled data and then trains the model able to be released due to privacy agreements/concerns [3]. There-
based on the generated supervisory signals. The task used for train- fore, the success achieved by applying SSL in CV and NLP naturally
ing the model based on the generative signal is referred to as the leads the question as to whether SSL can be effectively applied in
pretext task. In comparison, the task whose ultimate performance the graph domain. Given that graph neural network is among the
we care about the most and expect our model to solve is referred most powerful paradigms for graph representation learning, in fol-
to as the downstream task. To guarantee the performance benefits lowing sections we will mainly focus on introducing self-supervised
from self-supervised learning, pretext tasks should be carefully learning within the framework of graph neural networks and high-
designed such that completing them encourages the model to have lighting/summarizing these recent advancements.
the similar or complementary understanding as completing down-
stream tasks. Self-supervised learning initially originated to solve 3 APPLYING SSL TO GNNS: CATEGORIZING
tasks in image and text domains. The following part focuses on TRAINING STRATEGIES, LOSS FUNCTIONS
introducing self-supervised learning in these two fields with the
specific emphasis on different pretext tasks.
AND PRETEXT TASKS
In computer vision (CV), many ideas have been proposed for When seeking to apply self-supervised learning to GNNs, the major
self-supervised representation learning on image data. A common decisions to be made are how to construct the pretext tasks, which
example is that we expect that small distortion on an image does not includes what information to leverage from the unlabeled data,
affect its original semantic meaning or geometric forms. The idea to what loss function to use, and what training strategy to use for
create surrogate training datasets with unlabeled image patches by effectively improving the GNN’s performance. Hence, in this section
first sampling patches from different images at varying positions we will first mathematically formalize the graph neural network
and then distorting patches by applying a variety of random trans- with self-supervised learning and then discuss each of the above.
formations are proposed in [12]. The pretext task is to discriminate More specifically, we will introduce three training strategies, three
between patches distorted from the same image or from different loss functions that are frequently employed in the current literature,
images. Rotation of an entire image is another effective and inex- and categorize current state-of-the-art pretext tasks for GNNs based
pensive way to modify an input image without changing semantic on the type of information they leverage for constructing the pretext
content [16]. Each input image is first rotated by a multiple of 90 task.
degrees at random. The model is then trained to predict which Given an undirected attributed graph 𝒢 = {𝒱, ℰ, X}, where
rotation has been applied. However, instead of performing pretext 𝒱 = {𝑣 1, ..., 𝑣 | 𝒱 | } represents the vertex set with |𝒱 | vertices, ℰ
tasks on an entire image, the local patches could also be extracted represents the edge set and 𝑒𝑖 𝑗 = (𝑣𝑖 , 𝑣 𝑗 ) is an edge between
to construct the pretext tasks. Examples of methods using this tech- node 𝑣𝑖 and 𝑣 𝑗 , X ∈ R | 𝒱 |×𝑑 represents the feature matrix and
nique include predicting the relative position between two random x𝑖 = X[𝑖, :] ⊤ ∈ R𝑑 is the 𝑑-dimensional feature vector of the node
patches from one image [10] and designing a jigsaw puzzle game to 𝑣𝑖 . A ∈ R | 𝒱 |×| 𝒱 | is the adjacency matrix where A𝑖 𝑗 = 1 if 𝑒𝑖 𝑗 ∈ ℰ
2
Graph Neural Networks: Self-supervised Learning

and A𝑖 𝑗 = 0 if 𝑒𝑖 𝑗 ∉ ℰ. We denote any GNN-based feature ex- model parameters 𝜃, 𝜃 ssl , and 𝜃 sup , where 𝜃 ssl denotes the parame-
′
tractor as 𝑓𝜃 : R | 𝒱 |×𝑑 × R | 𝒱 |× | 𝒱 | → R | 𝒱 |×𝑑 parametrized by 𝜃 , ters of the adaptation layer for the pretext task. Inspired by relevant
which takes any node feature matrix X and the graph adjacency discussions [24, 28, 55, 69, 70], we summarize three possible training
matrix A and outputs the 𝑑 ′ -dimensional representation for each strategies that are popular in the literature to train GNNs in the self-
′
node ZGNN = 𝑓𝜃 (X, A) ∈ R | 𝒱 |×𝑑 , which is further fed into any supervised setting as self-training, pre-training with fine-tuning,
′ ′
permutation invariant function READOUT : R | 𝒱 |×𝑑 → R𝑑 to ob- and joint training.
′
tain the graph embeddings zGNN,𝒢 = READOUT(𝑓𝜃 (X, A)) ∈ R𝑑 . 3.1.1 Self-training. Self-training is a strategy that leverages the
More specifically, we note that here 𝜃 represents the parameters supervision information in the training process generated by the
encoded in the corresponding network architectures of the GNN model itself [37, 51]. A typical self-training pipeline begins with first
[19, 58, 65, 66]. Considering the transductive semi-supervised tasks training the model over the labeled data, then generating pseudo
where we are provided with the labeled node set 𝒱𝑙 ⊂ 𝒱, the la- labels to unlabeled samples that have highly confident predictions,
beled graph 𝒢, the associated node label matrix Ysup ∈ R | 𝒱𝑙 |×𝑙 , and including them into the labeled data in the next round of train-
and the graph label ysup,𝒢 ∈ R𝑙 with label dimension 𝑙, we aim ing. In this way, the pretext task is the same as the downstream
to classify nodes and graphs. The node and graph representations task by utilizing the pseudo labels for some of the originally un-
output by GNNs are firstly processed by the extra adaptation layer labeled data. A detailed overview is presented in Fig. 1 where the
ℎ𝜃 sup parametrized by the supervised adaptation parameter 𝜃 sup to prediction results are re-utilized to augment the training data in
obtain the predicted 𝑙-dimensional node label Zsup ∈ R | 𝒱 |×𝑙 and the next iteration as done in [55].
graph label zsup,𝒢 ∈ R𝑙 by Eq. (1)-(2). Then the model parameters
𝜃 in GNN-based extractor 𝑓𝜃 and the parameters 𝜃 sup in adaptation
layer ℎ𝜃 sup are learned by optimizing the supervised loss calculated
between the output/predicted label and the true label for labeled
nodes and the labeled graph, which can be formulated as:
Zsup = ℎ𝜃 sup (𝑓𝜃 (X, A)) (1)

zsup,𝒢 = ℎ𝜃 sup (READOUT(𝑓𝜃 (X, A))) (2)

Figure 1: An overview of GNNs with SSL using self-training.
𝜃 ∗, 𝜃 sup
∗
= arg min ℒsup (𝜃, 𝜃 sup )
𝜃,𝜃 sup
3.1.2 Pre-training and Fine-tuning. A common strategy to utilize
 1 Õ

 arg min ℓsup (zsup,𝑖 , ysup,𝑖 ) features learned from completing pretext tasks includes applying
𝜃,𝜃 sup |𝒱𝑙 |




 𝑣𝑖 ∈𝒱𝑙 the optimized parameters from self-supervision as initialization





| {z } for fine-tuning in downstream tasks. This strategy consists of two
= Node supervised task , (3) stages: pre-training on the self-supervised pretext tasks and fine-


 arg min ℓsup (zsup,𝒢 , ysup,𝒢 ) tuning on the downstream tasks. The overview of this two-stage

 𝜃,𝜃 sup



 | {z } optimization strategy is given in Fig. 2.


 Graph supervised task The whole model consists of one shared GNN-based feature
extractor and two adaptation modules, one for the pretext task and
where ℒsup is the total supervised loss function and ℓsup is the one for the downstream task. In the pre-training process, the model
supervised loss function for each example, ysup,𝑖 = Ysup [𝑖, :] ⊤ is trained with the self-supervised pretext task(s) as:
indicates the true label for node 𝑣𝑖 in node supervised task and
ysup,𝒢 indicates the true label for graph 𝒢 in graph supervised task. Zssl = ℎ𝜃 ssl (𝑓𝜃 (X, A)), (4)
Their corresponding predicted label distributions are denoted as
zsup,𝑖 = Zsup [𝑖, :] ⊤ and zsup,𝒢 . 𝜃, 𝜃 sup are parameters to be opti- zssl,𝒢 = ℎ𝜃 ssl (READOUT(𝑓𝜃 (X, A))), (5)
mized for any GNN model and the extra adaptation layer for the
supervised downstream task, respectively. Note that for ease of 𝜃 ∗, 𝜃 ∗ ssl = argmin ℒssl (𝜃, 𝜃 ssl )
𝜃,𝜃 ssl
notation, we assume the above graph supervised task is operated
1 Õ
only on one graph but the above framework can be easily adapted 
 arg min ℓssl (zssl,𝑖 , yssl,𝑖 )
|𝒱 |

 𝜃,𝜃 ssl
to supervised tasks on multiple graphs. 


 𝑣𝑖 ∈𝒱

 |

 {z }
3.1 Training Strategies = Node pretext tasks , (6)

 arg min ℓssl (zssl,𝒢 , yssl,𝒢 )
In this chapter, we view SSL as the process of designing a specific



 𝜃,𝜃 ssl
pretext task and learning the model on the pretext task. In this

 | {z }


sense, SSL can either be used as unsupervised pre-training or be  Graph pretext tasks

integrated with semi-supervised learning. where 𝜃 ssl denotes the parameters of the adaptation layer ℎ𝜃 ssl
The model capability of extracting features for completing pre- for the pretext tasks, ℓssl is the self-supervised loss function for
text and downstream tasks is improved through optimizing the each example, and ℒssl is the total loss function of completing the
3
Yu Wang, Wei Jin, and Tyler Derr

Figure 2: An overview of GNNs with SSL using pre-training Figure 3: An overview of GNNs with SSL using joint training.
and fine-tuning.

self-supervised task. In node pretext tasks, zssl,𝑖 = Zssl [𝑖, :] ⊤ and 3.2 Loss Functions
yssl,𝑖 = Yssl [𝑖, :] ⊤ , which are the self-supervised predicted and true A loss function is used to evaluate the performance of how well the
label(s) for the node 𝑣𝑖 , respectively. In graph pretext tasks, zssl,𝒢 algorithm models the data. Generally in GNNs with self-supervised
and yssl,𝒢 are the self-supervised predicted and true label(s) for the learning, the loss function for the pretext task has three forms,
graph 𝒢, respectively. Then, in the fine-tuning process, the feature which are classification loss, regression loss and contrastive learn-
extractor 𝑓𝜃 is trained by completing downstream tasks in Eq. (1)- ing loss. Note that the loss functions we discuss here are only for
(3) with the pre-trained 𝜃 ∗ as the initialization. Note that to utilize the pretext tasks rather than downstream tasks.
the pre-trained node/graph representations the fine-tuning process
3.2.1 Classification and Regression Loss. In completing classification-
can also be replaced by training a linear classifier (e.g., Logistic
based pretext tasks such as node clustering where node embeddings
Regression [49, 57, 69, 75]).
are expected to encode the assignment information of the clusters,
3.1.3 Joint Training. Another natural idea to harness self-supervised the objective for the pretext is to minimize the following loss func-
learning for graph neural networks is to combine losses of com- tion:
pleting pretext task(s) and downstream task(s) and jointly train the 1 Õ 𝐿
ℓCE (zssl,𝑖 , yssl,𝑖 ) = − | 𝒱1 | 1 (yssl,𝑖 𝑗 = 1) log( z̃ssl,𝑖 𝑗 )
 Í Í
model. The overview of the joint training is shown in Fig. 3.



 |𝒱 | 𝑣𝑖 ∈𝒱 𝑗 =1
𝑣𝑖 ∈𝒱
The joint training consists of two components: feature extrac-




 | {z }
tion by a GNN and adaption processes for both the pretext tasks

ℒssl = Node pretext tasks ,
and downstream tasks. In the feature extraction process, a GNN 𝐿
1 (yssl,𝒢 𝑗 = 1) log( z̃ssl,𝒢 𝑗 )

 Í

 ℓCE (zssl,𝒢 , yssl,𝒢 ) = −
takes the graph adjacency matrix A and the feature matrix X as 

 | {z } 𝑗 =1

input and outputs the node embeddings ZGNN and/or graph embed- 
 Graph pretext tasks
dings zGNN,𝒢 . In the adaptation procedure, the extracted node and (11)
graph embeddings are further transformed to complete pretext and
downstream tasks via ℎ𝜃 ssl and ℎ𝜃 sup , respectively. We then jointly where ℓCE indicates the cross entropy function, zssl,𝑖 and zssl,𝒢 rep-
optimize the pretext and downstream task losses as: resents the predicted label distribution of node 𝑣𝑖 and graph 𝒢
for the pretext task, and their corresponding class probability dis-
Zsup = ℎ𝜃 sup (𝑓𝜃 (X, A)), Zssl = ℎ𝜃 ssl (𝑓𝜃 (X, A)), (7) tribution z̃ssl,𝑖 and z̃ssl,𝒢 are calculated by softmax normalization,
respectively. For example, z̃ssl,𝑖 𝑗 is the probability of node 𝑣𝑖 be-
longing to class 𝑗. Since every node 𝑣𝑖 has its own pseudo label (i.e.,
zsup,𝒢 = ℎ𝜃 sup (READOUT(𝑓𝜃 (X, A))), (8) yssl,𝑖 ) in completing pretext tasks, we can consider all the nodes 𝒱
zssl,𝒢 = ℎ𝜃 ssl (READOUT(𝑓𝜃 (X, A))), (9) in the graph compared to only the labeled set of nodes 𝒱𝑙 as before
in downstream tasks.
1 Õ In completing regression-based pretext tasks, such as feature

 arg min (𝛼 1 ℓsup (zsup,𝑖 , ysup,𝑖 )

 𝜃,𝜃 sup ,𝜃 ssl |𝒱 | completion, the mean squared error loss is typically used as the

 𝑣𝑖 ∈𝒱



 + 𝛼 2 ℓssl (zssl,𝑖 , yssl,𝑖 )) loss function:
∗ ∗ ∗
 |
 {z }
𝜃 , 𝜃 sup, 𝜃 ssl = , 1 Õ
Node pretext tasks
ℓMSE (zssl,𝑖 , yssl,𝑖 ) = | 𝒱1 | 𝑣𝑖 ∈𝒱 ||zssl,𝑖 − yssl,𝑖 || 2

 Í

 |𝒱 |


 arg min 𝛼 1 ℓsup (zsup,𝒢 , ysup,𝒢 ) + 𝛼 2 ℓssl (zssl,𝒢 , yssl,𝒢 ) 


 𝜃,𝜃 sup ,𝜃 ssl 

 𝑣𝑖 ∈𝒱

 
 |
 {z }
 |
 {z }
ℒssl = ,
 Graph pretext tasks Node pretext tasks
 ℓMSE (zssl,𝒢 , yssl,𝒢 ) = ||zssl,𝒢 − yssl,𝒢 || 2

(10) 


 | {z }


where 𝛼 1, 𝛼 2 ∈ R > 0 are the weights for combining the supervised 
 Graph pretext tasks
loss ℓsup and the self-supervised loss ℓssl . (12)
4
Graph Neural Networks: Self-supervised Learning

negative samples. Especially 𝒫𝑖− contains all negative samples of

the sample 𝑖. Note that we can contrast both node representations,
graph representations and node-graph representations under dif-
ferent views. Therefore, z1ssl is not limited to the node embeddings,
but could refer to the embeddings of both node and graph under
the first graph view 𝒢 1 . Thus, 𝑖, 𝑗, 𝑘 could refer to both node and
graph samples.

3.3 Pretext Tasks

Pretext tasks are constructed by leveraging different types of super-
vision information coming from different components of graphs.
Figure 4: An overview of GNNs with SSL using contrastive Based on the components that generate the supervision information,
learning. pretext tasks that are prevalent in the literature are categorized into
node-level, graph-level and node-graph level. In completing node-
where the objective is minimizing the distance from our learned level and graph-level pretext tasks, three types of information can
embedding to yssl,𝑖 which represents any ground-truth value of be leveraged: graph structure, node features, or hybrid, where the
node 𝑣𝑖 , such as the original attribute in the feature completion or latter combines the information from node features, graph structure,
other values of node 𝑣𝑖 . and even information from the known training labels (as presented
3.2.2 Contrastive Learning Loss. Inspired by the significant progress in [28]). We summarize the categorization of pretext tasks as a tree
achieved by employing the contrastive learning in natural language where each leaf node represents a specific type of pretext tasks in
processing and computer vision [35], recent studies [21, 57, 69, 75, Fig. 5 while also including the corresponding references. In next
76] propose similar contrastive frameworks to enable SSL in GNNs. three sections, we give detailed explanations about each of these
The general goal of contrastive learning in GNNs is to train GNN- pretext tasks and summarize the majority of existing methods.
based encoders such that the agreement of representations between
similar graph instances (e.g., multiple views generated from the 4 NODE-LEVEL SSL PRETEXT TASKS
same instance) is maximized while the agreement between dissimi- For node-level pretext tasks, methods have been developed to use
lar graph instances (e.g., multiple views generated from different easily-accessible data to generate pseudo labels for each node or
instances) is minimized. Such maximization and minimization of relationships for each pair of nodes. In this way, the GNNs are
agreements between different views of instances is typically formal- then trained to be predictive of the pseudo labels or to keep the
ized as maximizing the mutual information ℐ (Z1ssl, Z2ssl ) between equivalence between the node embeddings and the original node
representations Z1ssl and Z2ssl under two different views as: relationships.

max ℐ (Z1ssl, Z2ssl ), (13) 4.1 Structure-based Pretext Tasks

𝜃,𝜃 ssl
Different nodes have different structure properties in graph topol-
where Z1ssl, Z2ssl correspond to representations output from any ogy, which can be measured by the node degree, centrality, node
GNN-based encoder followed by an adaptation layer ℎ𝜃 ssl under partition, etc. Thus, for structure-based pretext tasks at the node-
two different graph views 𝒢 1, 𝒢 2 . level, we expect to align node embeddings extracted from the GNNs
In order to computationally estimate and maximize the mu- with their structure properties, in an attempt to ensure this infor-
tual information that is originally intractable to be exactly com- mation is preserved while GNNs learn the node embeddings.
puted in most cases [1, 14, 47, 64], multiple estimators to evaluate Since degree is the most fundamental topological property, Jin
the lower bounds to the mutual information are derived, includ- et al. [28] designs the pretext task to recover the node degree from
ing normalized temperature-scaled cross-entropy (NT-Xent) [6], the node embeddings as follows:
Donsker-Varadhan representation of the KL-divergence [11], noise-
contrastive estimation (InfoNCE) [17], Jensen-Shannon estimator 1 Õ
ℒssl = ℓMSE (zssl,𝑖 , 𝑑𝑖 ) (15)
[44]. For simplicity, here we only present one frequently used mu- |𝒱 |
𝑣𝑖 ∈𝒱
tual information estimator NT-Xent, which is formalized as:
1 where 𝑑𝑖 represents the degree of node 𝑖 and zssl,𝑖 = Zssl [𝑖, :] ⊤
Õ
ℒssl = + ℓNT-Xent (Z1ssl, Z2ssl, 𝒫 − )
|𝒫 | + denotes the self-supervised GNN embeddings of node 𝑖. It should
(𝑖,𝑗) ∈𝒫
be noted that this pretext task can be generalized to harness any
1 Õ exp(𝒟(z1ssl,𝑖 , z2ssl,𝑗 )) structure property in the node level.
=− log (14)
|𝒫 + | exp(𝒟(z1ssl,𝑖 , z2ssl,𝑘 )) Node centrality measures the importance of nodes based on their
Í
(𝑖,𝑗) ∈𝒫 +
𝑘 ∈ { 𝑗∪𝒫𝑖− } structure roles in the whole graph [42]. Hu et al. [24] designs a pre-
sim(z1 ,z2 )
text task to have GNNs estimate the rank scores of node centrality.
where 𝒟(z1ssl,𝑖 , z2ssl,𝑗 )) = ssl,𝑖 ssl,𝑗
𝜏 is a learnable discriminator The specific centrality measures considered are eigencentrality, be-
parametrized with the similarity function (i.e., cosine similarity) tweenness, closeness, and subgraph centrality. For a node pair (𝑢, 𝑣)
𝑠 = 1 (𝑠 > 𝑠 ) where
and the temperature factor 𝜏, 𝒫 + represents the set of all pairs of and a centrality score 𝑠, with relative order R𝑢,𝑣 𝑢 𝑣
positive samples while 𝒫 − = (𝑖,𝑗) ∈𝒫 + 𝒫𝑖− represents all sets of
Ð 𝑠 = 1 if 𝑠 > 𝑠 and R 𝑟𝑎𝑛𝑘 for
R𝑢,𝑣 𝑢 𝑣 𝑢,𝑣 = 0 if 𝑠𝑢 ≤ 𝑠 𝑣 , a decoder 𝐷𝑠
5
Yu Wang, Wei Jin, and Tyler Derr

GNN SSL Pretext Tasks

Node-level Graph-level Node-graph-level

Pretext Tasks Pretext Tasks Pretext Tasks

Structure-based Structure-based Patch-graph contrastive

learning [57]
Connection recovery [13, 25,
Degree recovery [28]
46, 59, 74? ]
Subgraph-graph contrastive
Centrality ranking [24] Topological distance recovery learning [21, 54]
[28, 49]
Partition recovery [70] Patch-subgraph contrastive
Motif contrastive learning learning [27]
[72]
Feature-based
𝑟 -ego subgraph contrastive
Feature completion [22, 28, learning [50]
39, 62, 70]
Attention-based topology
Embedding completion [39, recovery [32]
48, 62]
Topological transformation
Clustering recovery [28, 70] recovery [15]

Pairwise similarity recovery Feature-based

[28, 29]
Graph contrastive learning
Hybrid [22, 69]

Node contrastive learning Hybrid

[4, 75, 76]

Prediction recovery [5, 55] Context recovery [22, 28]

Topological distance to clus-

ter recovery [28]
Graph generation recovery
[23]

Figure 5: A categorization of SSL pretext tasks used in GNNs.1

centrality score 𝑠 estimates its rank score by S𝑣 = 𝐷𝑠𝑟𝑎𝑛𝑘 (zGNN,𝑣 ). is to choose a fix-tune boundary in the middle layer of GNNs. The
The probability of estimated rank order is defined by the sigmoid GNN blocks below this boundary are fixed, while the ones above
𝑠 = exp(S𝑢 −S𝑣 ) . Then predicting the relative order be-
function R̃𝑢,𝑣 the boundary are fine-tuned. For downstream tasks that are closely
1+exp(𝑆𝑢 −𝑆 𝑣 )
tween pairs of nodes could be formalized as a binary classification related to the pre-trained tasks, a higher boundary is used.
problem with the loss: Another important node-level structure property is the partition
Õ Õ each node belongs after performing a graph partitioning method.
𝑠 𝑠 𝑠 𝑠
ℒssl = − (R𝑢,𝑣 log R̃𝑢,𝑣 + (1 − R𝑢,𝑣 ) log(1 − R̃𝑢,𝑣 )). (16) In [70], the pretext task is to train the GNNs to encode the node
𝑠 𝑢,𝑣 ∈𝒱 partition information. Graph partitioning is to partition the nodes of
a graph into different groups such that the number of edges between
Different from peer works, [24] does not consider any node fea-
each group is minimized. Given the node set 𝒱, the edge set ℰ, and
ture but instead extract the node features directly from the graph
a preset number of partitions 𝑝 ∈ [1, |𝒱 |], a graph partitioning
topology, which includes: (1) degree that defines the local impor-
algorithm (e.g., [30] as used in [70]) will output a set of nodes
tance of a node; (2) core-number that defines the connectivity of
{𝒱par1 , ..., 𝒱par𝑝 |𝒱par𝑖 ⊂ 𝒱, 𝑖 = 1, ..., 𝑝}. Then the classification loss
the subgraph around a node; (3) collective influence that defines
the neighborhood importance of a node; and (4) local clustering is set exactly the same as:
coefficient, which defines the connectivity of 1-hop neighborhood 1 Õ
ℒssl = − ℓCE (zssl,𝑖 , yssl,𝑖 ) (17)
of a node. Then, the four features (after min-max normalization) |𝒱 |
𝑣𝑖 ∈𝒱
are concatenated with a nonlinear transformation and fed into the
GNN where [24] uses the pretext tasks: centrality ranking, cluster- 1 Additional summary details and the corresponding code links for these methods
ing recovery and edge prediction. Another innovative idea in [24] can be found at https://github.com/NDS-VU/GNN-SSL-Chapter.
6
Graph Neural Networks: Self-supervised Learning

where zssl,𝑖 denotes the embedding of node 𝑣𝑖 and assuming that where 𝑓𝑤 is a function mapping the difference between two node
the partitioning label is a one-hot encoding yssl,𝑖 ∈ R𝑝 with 𝑘-th embeddings from GNNs to a scalar representing the similarity
entry as 1 and others as 0 if 𝑣𝑖 ∈ 𝒱par𝑘 , 𝑖 = 1, ..., |𝒱 |, ∃𝑘 ∈ [1, 𝑝]. between them.

4.2 Feature-based Pretext Tasks 4.3 Hybrid Pretext Tasks

Node features are another important information that can be lever- Instead of employing only the topology or only the feature infor-
aged to provide extra supervision. Since the state-of-the-art GNNs mation as the extra supervision, some pretext tasks combine them
suffer from over-smoothing [5], the original feature information is together as a hybrid supervision, or even utilize information from
partially lost after fed into the GNNs. In order to reduce the informa- the known training labels.
tion loss in node embeddings, the pretext task in [22, 28, 39, 62, 70] A contrastive framework for unsupervised graph representation
is to first mask node features and let the GNN predict those features. learning, GRACE, where two correlated graph views are generated
More specifically, they randomly mask input node features by re- by randomly performing corruption on attributes (masking node
placing them with special mask indicators and then apply GNNs to features) and topology (removing or adding graph edges) is pro-
obtain the corresponding node embeddings. Finally a linear model is posed in [75]. Then the GNNs are trained using a contrastive loss
applied on top of embeddings to predict the corresponding masked to maximize the agreement between node embeddings in these
node features. Assuming the set of nodes that are masked is 𝒱m , two views. In each iteration two graph views 𝒢 1 = {𝐴1, 𝑋 1 } and
then the self-supervised regression loss to reconstruct these masked 𝒢 2 = {𝐴2, 𝑋 2 } are generated randomly according to the possible
features is: augmentation functions from an input graph 𝒢 = {𝐴, 𝑋 }.
1 Õ
ℒssl = ℓMSE (zssl,𝑖 , x𝑖 ) (18) The objective is to maximize the similarity of the same nodes
|𝒱m | in different views of the graph while minimizing the similarity of
𝑣𝑖 ∈𝒱m
To handle the high sparsity of the node features, it is beneficial different nodes in the same or different views of the graph. Thus,
to first perform feature dimensionality reduction on 𝑋 (such as if we denote the node embeddings in the two views as Z1GNN =
principle component analysis (PCA) used in [28]). Additionally, 𝑓𝜃 (X1, A1 ), Z2GNN = 𝑓𝜃 (X2, A2 ), then the contrastive NT-Xent loss
instead of reconstructing node features, node embeddings could is:
also be reconstructed from their corrupted version, such as in [39]. 1 Õ
ℒssl = + ℓNT-Xent (Z1GNN, 𝑍 GNN
2
, 𝒫 − ), (22)
Contrary to the graph partitioning where nodes are grouped by |𝒫 | 1 2 +
(𝑣𝑖 ,𝑣𝑖 ) ∈𝒫
the graph topology, in graph clustering the clusters of nodes are
discovered based on their features [70]. In this way the pretext task where 𝒫 + includes positive pairs of (𝑣𝑖1, 𝑣𝑖2 ) where 𝑣𝑖1, 𝑣𝑖2 correspond
can be designed to recover the node clustering assignment. Given to the same node in different views, while 𝒫 − = (𝑣 1 ,𝑣 2 ) ∈𝒫 + 𝒫 −1
Ð
𝑣𝑖
the node set 𝒱, the feature matrix 𝑋 , and a preset number of clusters 𝑖 𝑖
represents all sets of negative samples with 𝒫 −1 containing nodes
𝑝 ∈ [1, |𝒱 |] (or without if the clustering algorithm automatically 𝑣𝑖
learns the number of clusters) as input, the clustering algorithm different from 𝑣𝑖 in the same view (intra-view negative pairs) or
will output a set of node clusters {𝒱clu1 , . . . , 𝒱clu𝑝 |𝒱clu𝑖 ⊂ 𝒱, 𝑖 = the other view (inter-view negative pairs).
1, ..., 𝑝} and assuming for node 𝑣𝑖 , the partitioning label is a one- More specifically, in the above, the two graph corruptions are
hot encoding yssl,𝑖 ∈ R𝑝 with 𝑘-th entry as 1 and others as 0 if removing edges and masking node features. In removing edges, a
𝑣𝑖 ∈ 𝒱clu𝑘 , 𝑖 = 1, ..., |𝒱 |, ∃𝑘 ∈ [1, 𝑝]. Then the loss is the same as random masking matrix M ∈ {0, 1} | 𝒱 |×| 𝒱 | is randomly sampled
Eq. (17). whose entry is drawn from a Bernoulli distribution M𝑖 𝑗 ∼ ℬ(1−𝑝𝑟 )
Instead of focusing on individual nodes, pretext tasks have also if A𝑖 𝑗 = 1 for the original graph. 𝑝𝑟 is the probability of each edge
been developed based on the relationship between pairs of nodes being removed. The resulting matrix can be computed as A ′ = A⊙M
′
[28, 29]. The basic idea is to retain the node pairwise feature simi- creating the adjacency matrix of graph view 𝒢 from 𝒢.
larity in the node embeddings from GNNs. Suppose 𝒯𝑠 , 𝒯𝑑 denote In masking node features, a random vector m ∈ {0, 1}𝑑 is utilized,
the sets of node pairs having the highest and the lowest similarity: where each dimension of m is independently drawn from a Bernoulli
distribution with probability 1 − 𝑝𝑚 and 𝑑 is the dimension of the
𝒯𝑠 = {(𝑣𝑖 , 𝑣 𝑗 )| sim(x𝑖 , x 𝑗 ) in top-B of
node features X. Then, the generated node features X ′ for graph
𝐵 ′
{sim(x𝑖 , x𝑏 )}𝑏=1 \sim(x𝑖 , x𝑖 ), ∀𝑣𝑖 ∈ 𝒱 }, (19) view 𝒢 from 𝒢 is computed by:

𝒯𝑑 = {(𝑣𝑖 , 𝑣 𝑗 )| sim(x𝑖 , x 𝑗 ) in bottom-B of X ′ = [x1 ⊙ m; x2 ⊙ m; · · · ; x | 𝒱 | ⊙ m], (23)

𝐵
{sim(x𝑖 , x𝑏 )}𝑏=1 \sim(x𝑖 , x𝑖 ), ∀𝑣𝑖 ∈ 𝒱 }, (20) where [; ] is the concatenation operator. Moreover, a modified ver-
sion of the GRACE is proposed in [76] where the whole contrastive
where sim(x𝑖 , x 𝑗 ) measures the cosine similarity of features be- procedure is the same as GRACE except that the graph augmenta-
tween two nodes 𝑣𝑖 , 𝑣 𝑗 and 𝐵 is the number of top/bottom pairs tion is adaptively performed based on the importance of nodes and
selected for each node. Then the pretext task is to optimize the edges. Specifically, the probability of removing an edge between
following regression loss: nodes 𝑣𝑖 , 𝑣 𝑗 should reflect the importance of the edge (𝑣𝑖 , 𝑣 𝑗 ) such
1 Õ that the augmentation function is more likely to corrupt unimpor-
ℒssl = ℓMSE 𝑓𝑤 ( |zGNN,𝑖 −zGNN,𝑗 |), sim(x𝑖 , x 𝑗 ) ,
|𝒯𝑠 ∪ 𝒯𝑑 | tant edges while keeping important connective structures intact
(𝑣𝑖 ,𝑣 𝑗 ) ∈𝒯𝑠 ∪𝒯𝑑
(21) in augmented views. Similarly the feature dimensions frequently
7
Yu Wang, Wei Jin, and Tyler Derr

appearing in influential nodes are seen as important and so are data. However, directly checking in this naïve way is very time
masked with lower probability. consuming.
The observation made in [4] that nodes with further topological
distance to the labeled nodes are more likely to be misclassified 5 GRAPH-LEVEL SSL PRETEXT TASKS
indicates the uneven distribution of the ability of GNNs to embed After having just presented the node-level SSL pretext tasks, in
node features in the whole graph. However, existing graph con- this section we focus on the graph-level SSL pretext tasks where
trastive learning methods ignore this uneven distribution, which we desire the node embeddings coming from the GNNs to encode
motivates Chen et al. [4] to propose the distance-wise graph con- information of graph-level properties.
trastive learning (DwGCL) method that can adaptively augment
the graph topology, sample the positive and negative pairs, and 5.1 Structure-based Pretext Tasks
maximize the mutual information. The topology information gain As the counterpart of the nodes in the graph, the edges encode
(TIG) is calculated based on Group PageRank and node features abundant information of the graph, which can also be leveraged as
to describe the task information effectiveness that the node ob- an extra supervision to design pretext tasks. The pretext task in [74]
tains from labeled nodes along the graph topology. By ranking is to recover the graph topology, i.e., predict edges, after randomly
the performance of GNNs on nodes according to their TIG val- removing edges in the graph. After node embeddings zGNN,𝑖 is
ues with/without contrastive learning, it is found that contrastive obtained for each node 𝑣𝑖 , the probability of the edge between
learning mainly improves the performance on nodes that are topo- any pair of nodes 𝑣𝑖 , 𝑣 𝑗 is calculated by their feature similarity as
logically far away from the labeled nodes. Based on the above follows:
finding, Chen et al. [4] proposes to: 1) perturb the graph topology A𝑖′𝑗 = sigmoid(zGNN,𝑖 (zGNN,𝑗 ) ⊤ ), (26)
by augmenting nodes according to their TIG value; 2) sampling
and the weighted cross-entropy loss is used during training, which
the positive and negative pairs considering local/global topology
is defined as:
distance and node embedding distance; and 3) assigning different Õ
weights to nodes in the self-supervised loss based on their TIG ℒssl = − 𝑊 (A𝑖 𝑗 log A𝑖′𝑗 ) + (1 − A𝑖 𝑗 ) log(1 − A𝑖′𝑗 ), (27)
rankings. Results demonstrate the performance improvement of 𝑣𝑖 ,𝑣 𝑗 ∈𝒱
this distance-wise graph contrastive learning over the typical con-
where 𝑊 is the weight hyperparameter used for balancing two
trastive learning approach.
classes; which are node pairs having an edge and node pairs without
Another special supervision information to exploit is the predic-
an edge between them.
tion results of the model itself. Sun et al. [55] leverages the multi-
As it is known that unclean graph structure usually impedes the
stage training framework to utilize the information of the pseudo
applicability of GNNs [8, 26]. A method that trains the GNNs by
labels generated by predictions in the next rounds of training. The
downstream supervised tasks based on the cleaned graph structure
multi-stage training algorithm repeatedly adds the most confident
reconstructed from completing a self-supervised pretext task is
predictions of each class to the label set and re-utilizes these pseudo
introduced in [13]. The self-supervised pretext task aims to train a
labeled data to train the GNNs. Furthermore, a self-checking mech-
separate GNN to denoise the corrupted node feature X̂ generated
anism based on DeepCluster [2] is proposed to guarantee the pre-
by either randomly zeroing some dimensions of the original node
cision of labeled data. Assuming that the cluster assignment for
feature X when having binary features or by adding independent
node 𝑣𝑖 is c𝑖 ∈ {0, 1}𝑝 (here the number of clusters is assumed to
Gaussian noise when X is continuous. Two methods are used to
equal to the number of predefined classes 𝑝 in the downstream
generate the initial graph adjacency matrix Ã. The first method Full
classification task) and the centroid matrix 𝐶 ∈ R𝑑 ×𝑝 represents
′

Parametrization (FP) treats every entry in Ã as a parameter and

the feature of each cluster, then we obtain the cluster assignment
directly optimizes its |𝒱 | 2 parameters by denoising the corrupted
c𝑖 for each node 𝑣𝑖 by optimizing:
feature X̂. The second method MLP-kNN considers a mapping
1 Õ function kNN(MLP(X)), where a multilayer perceptron (i.e., MLP(·))
min min ||zGNN,𝑖 − 𝐶c𝑖 || 22, 𝑠.𝑡 . c𝑖T 1𝑝 = 1. (24) updates the original node features and kNN(·) produces a sparse
𝐶 𝒱 c ∈ {0,1}𝑝
𝑣 ∈𝒱 𝑖
𝑖
matrix by selecting top-k similar nodes to each node and adds edges
After applying DeepCluster to group nodes into multiple clusters, between them. Then, the generated initial adjacency matrix Ã is
an aligning mechanism is used to assign nodes in each cluster to normalized and symmetrized into a new adjacency matrix A as
their corresponding class defined by downstream tasks. For each follows:
1 𝑃˜ ( Ã) + 𝑃˜ ( Ã)
⊤
cluster 𝑘 ∈ [1, 𝑝] in unlabeled data, the computation of aligning 1
A = D− 2 D− 2 , (28)
mechanism is: 2
𝑐 𝑘 = arg min ||𝜅𝑘 − 𝜇𝑚 || 2, (25) where 𝑃˜ is a function with a non-negative range to ensure the posi-
𝑚 tivity of every entry in A. In MLP-kNN method, 𝑃˜ is the element-
where 𝜇𝑚 denotes the centroid of class 𝑚 in labeled data, 𝜅𝑘 denotes wise ReLU function. However, the ReLU function could result in the
the centroid of cluster 𝑘 in unlabeled data and 𝑐 𝑘 represents the gradient flow problem in the FP method, thus the element-wise ELU
aligned class that has the closest distance to the centroid 𝜅𝑘 of the function followed by an addition of 1 to avoid the problem of gradi-
cluster 𝑘 among all centroids of classes in the original labeled data. ent flow is used instead. Next, a separate GNN-based encoder takes
Note that the self-checking can be directly performed by comparing noisy node features X̂ and the new normalized adjacency matrix
the distance of each unlabeled node to centroids of classes in labeled A as input and output the updated node features Ẑ = GNN( X̂, A).
8
Graph Neural Networks: Self-supervised Learning

The parameters in FP and MLP-kNN used for generating the initial as in Eq. (14). Afterwards, spectral clustering [61] is performed on
adjacency matrix Ã is optimized by: U to generate different groups, within which 𝑛 𝒢 connected compo-
1 Õ nents that have more than three nodes are collected as the sampled
ℒssl = ℓMSE (x𝑖 , ẑ𝑖 ), (29) subgraphs from the graph 𝒢 and their embeddings are calculated
|𝒱m |
𝑣𝑖 ∈𝒱m by applying READOUT function. For each subgraph, its cosine sim-
where ẑ𝑖 = 𝑍ˆ [𝑖, :] ⊤ is the noisy embedding vector of the node ilarity to each of the 𝑚 motifs is calculated to obtain a similarity
𝑣𝑖 obtained by the separate GNN-based encoder. The optimized metric S ∈ R𝑚×𝑛𝒢 . To produce semantic-meaningful subgraphs
parameters in FP and MLP-kNN leads to the generation of more that are close to motifs, the top 10% most similar subgraphs to each
cleaned graph adjacency matrix, which in turn results in the better motif are selected based on the similarity metric S and are collected
performance in the downstream tasks. into a set 𝒢 top . The affinity values in U between pairs of nodes in
In addition to the graph edges and the adjacency matrix, topolog- each of these subgraphs are increased by optimizing the loss:
ical distance between nodes is another important global structure |𝒢
Õ|
top
1 Õ
property in graph. The pretext task in Peng et al. [49] is to recover ℒ1 = − U[ 𝑗, 𝑘]. (32)
the topological distance between nodes. More specifically, they |𝒢 top | 𝑖=1 (𝑣 ,𝑣 ) ∈𝒢 top
𝑗 𝑘 𝑖
leverage the shortest path length between nodes denoted as 𝑝𝑖 𝑗
between nodes 𝑣𝑖 and 𝑣 𝑗 , but this could be replaced with any other The optimization of the above loss forces nodes in motif-like sub-
distance measure. Then, they define the set 𝒞𝑖𝑘 as all the nodes graphs to be more likely to be grouped together in spectral clus-
having the shortest path distance of length 𝑘 from node 𝑣𝑖 . More tering, which leads to more subgraph samples aligned with the
formally, this is defined as: motifs. Next, the embedding table of motifs is optimized based
on the sampled subgraphs. The assignment matrix Q ∈ R𝑚×𝑛𝒢
𝒞𝑖 = 𝒞𝑖1 ∪ 𝒞𝑖2 ∪ · · · ∪ 𝒞𝑖𝛿𝑖 , 𝒞𝑖𝑘 = {𝑣 𝑗 |𝑑𝑖 𝑗 = 𝑘 }, 𝑘 = 1, 2, · · · , 𝛿𝑖 , (30) is found by maximizing similarities between embeddings and its
assigned motif:
where 𝛿𝑖 is the upper bound of the hop count from other nodes to
𝑣𝑖 , 𝑑𝑖 𝑗 is the length of the path 𝑝𝑖 𝑗 , and 𝒞𝑖 is the union of all the 1Õ
max 𝑇𝑟 (QT S) − Q[𝑖, 𝑗] log Q[𝑖, 𝑗], (33)
𝑘-hop shortest path neighbor sets 𝐶𝑖𝑘 . Based on these sets, one-hot Q 𝜆 𝑖,𝑗
encodings d𝑖 𝑗 ∈ R𝛿𝑖 are created for pairs of nodes 𝑣𝑖 , 𝑣 𝑗 , where
where the second term controlled by hyperparameter 𝜆 is to avoid
𝑣 𝑗 ∈ 𝒞𝑖 , according to their distance 𝑑𝑖 𝑗 . Then, the GNN model is
all representations collapsing into a single cluster center. After the
guided to extract node embeddings that encode node topological
cluster assignment matrix Q is obtained, the GNN-based encoder
distance as follows:
Õ Õ and the motif embedding table are trained, which is equivalent
ℒssl = ℓCE (𝑓𝑤 (|zGNN,𝑖 − zGNN,𝑗 |), d𝑖 𝑗 ), (31) to a supervised 𝑚-class classification problem with labels Q and
𝑣𝑖 ∈𝒱 𝑣 𝑗 ∈𝒞𝑖 the prediction distribution eS obtained by applying a column-wise
softmax normalization with temperature 𝜏:
where 𝑓𝑤 is a function mapping the difference between two node
embeddings to the probabilities of pairs of nodes belonging to the 𝑛𝒢
1 Õ
corresponding category of the topological distance. Since the num- ℒ2 = − ℓCE (q𝑖 , s̃𝑖 ), (34)
𝑛 𝒢 𝑖=1
ber of the categories depends on the upper bound of the hop count
(topological distance) but precisely determining this upper bound where q𝑖 = Q[:, 𝑖] and s̃𝑖 = e S[:, 𝑖] denote the assignment distri-
is time-consuming for a big graph, it is assumed that the number of bution and predicted distribution for the subgraph 𝑖, respectively.
hops (distance) is under control based on small-world phenomenon Optimizing Eq. (34) jointly enhances the ability of GNN encoder
[42] and is further divided into several major categories that clearly to extract subgraphs that are similar to motifs and improves the
discriminates the dissimilarity and partly tolerates the similarity. embeddings of motifs. The last step is to train the GNN-based
Experiments demonstrate that dividing the topological distance encoder by a classification task where subgraphs are reassigned
into four categories: 𝒞𝑖1, 𝒞𝑖2, 𝒞𝑖3, 𝒞𝑖𝑘 (𝑘 ≥ 4) achieves the best perfor- back to their corresponding graphs. Note that the subgraphs are
mance (i.e., 𝛿𝑖 =4). Another problem is that the number of nodes generated by the Motif-guided extractor, which are more likely
that are close to the focal node 𝑣𝑖 is much less than the nodes that to capture higher-level semantic information compared with ran-
are further away (i.e., the magnitude of 𝒞𝑖𝛿𝑖 will be significantly domly sampled subgraphs. The whole framework is trained jointly
larger than other sets). To circumvent this imbalance problem, node by weighted combining ℒ1, ℒ2 and the contrastive loss.
pairs are sampled with adaptive ratio. Aside from the network motifs, other subgraph structures can be
Network motifs are recurrent and statistically significant sub- leveraged to provide extra supervision in designing pretext tasks.
graphs of a larger graph and Zhang et al [72] designs a pretext task In [50], an 𝑟 -ego network for a certain vertex is defined as the
to train a GNN encoder that can automatically extract graph motifs. subgraph induced by nodes that have shortest path with length
The learned motifs are further leveraged to generate informative shorter than 𝑟 . Then a random walk with restart is initiated at
subgraphs used in graph-subgraph contrastive learning. Firstly, a ego vertex 𝑣𝑖 and the subgraph induced by nodes that are visited
GNN-based encoder 𝑓𝜃 and a 𝑚-slot embedding table {m1, ..., m𝑚 } during the random walk starting at 𝑣𝑖 are used as the augmented
denoting 𝑚 cluster centers of 𝑚 motifs are initialized. Then, a node version of the 𝑟 -ego network. First, two augmented 𝑟 -ego networks
affinity matrix U ∈ R | 𝒱 |× | 𝒱 | is calculated by softmax normalization centered around vertex 𝑣𝑖 are obtained by performing the random
on the embedding similarity 𝒟(zGNN,𝑖 , zGNN,𝑗 ) between nodes 𝑖, 𝑗 walk twice (i.e., 𝒢𝑖 and 𝒢𝑖+ ), which are defined as a positive pair
9
Yu Wang, Wei Jin, and Tyler Derr

since they come from the same 𝑟 -ego network. In comparison, transformation information between two different graph topolo-
a negative pair corresponds to two subgraphs augmented from gies in the representations of nodes obtained by GNNs. First, they
different 𝑟 -ego networks (e.g., one coming from 𝑣𝑖 and another transform the original graph adjacency matrix A into Â by ran-
coming from 𝑣 𝑗 resulting in random walk induced subgraphs 𝒢𝑖 domly adding or removing edges from the original edge set. Then,
and 𝒢 𝑗 , respectively). Based on the above defined positive and by feeding the original and transformed graph topology and the
negative subgraph pairs, a contrastive loss is set up to optimize the node feature matrix into any GNN-based encoder, the feature rep-
GNNs as follows: resentation ZGNN, ẐGNN before and after topology transformation
′
1 Õ are calculated and their difference ΔZ ∈ R𝑁 ×𝐹 is defined as:
ℒssl = + ℓNT-Xent (Z1ssl, Z2ssl, 𝒫 − ), (35)
|𝒫 | + +
( 𝒢𝑖 ,𝒢𝑖 ) ∈𝒫 ΔZ = ẐGNN − ZGNN = [ΔzGNN,1, ..., ΔzGNN,𝑁 ] ⊤ (40)
where Z1ssl, Z2ssl
denotes the GNN-based graph embeddings and = [ẑGNN,1 − zGNN,1, ..., ẑGNN,𝑁 − zGNN,𝑁 ] ⊤ . (41)
specifically here the two different views are the same Z1ssl = Z2ssl . Next they predict the topology transformation between node 𝑣𝑖 and
𝒫 + contains positive pairs of subgraphs (𝒢𝑖 , 𝒢𝑖+ ) sampled by ran- 𝑣 𝑗 through the node-wise feature difference ΔZ by constructing the
dom walk starting at the same ego vertex 𝑣𝑖 in the same graph edge representation as:
while 𝑃 − = ( 𝒢𝑖 ,𝒢 + ) ∈𝒫 + 𝒫𝒢− represents all sets of negative sam-
Ð
𝑖 𝑖 exp(−(Δz𝑖 − Δz 𝑗 ) ⊙ (Δz𝑖 − Δz 𝑗 ))
ples. Specifically 𝒫𝒢− represents subgraphs sampled by random e𝑖 𝑗 = , (42)
𝑖
walk starting at either different ego vertex from 𝑣𝑖 in 𝒢 or directly || exp(−(Δz𝑖 − Δz 𝑗 ) ⊙ (Δz𝑖 − Δz 𝑗 ))||
sampled by random walk in different graphs from 𝒢. where ⊙ denotes the Hardamard product. This edge representation
Although Graph Attention Network (GAT) [58] achieves per- e𝑖 𝑗 is then fed into an MLP for the prediction of the topological
formance improvements over the original GCN [33], there is little transformation, which includes four classes: edge addition, edge
understanding of what graph attention learns. To this end, Dongk- deletion, keeping disconnection and keeping connection between
wan et al. [32] proposes a specific pretext task to leverage the edge each pair of nodes. Thus, the GNN-based encoder is trained by:
information to supervise what graph attention learns:
1 Õ
1 ℒssl = ℓCE (MLP(e𝑖 𝑗 ), t𝑖 𝑗 ) (43)
1 ( 𝑗, 𝑖) ∈ ℰ · log 𝜒𝑖 𝑗 |𝒱 | 2 𝑣 ,𝑣 ∈𝒱
Õ
ℒssl = (36)
|ℰ ∪ ℰ − | 𝑖 𝑗
( 𝑗,𝑖) ∈ℰ ∪ℰ −
where we denote the topological transformation category between
+ 1 ( 𝑗, 𝑖) ∈ ℰ − log(1 − 𝜒𝑖 𝑗 ), (37)

nodes 𝑣𝑖 and 𝑣 𝑗 as one-hot encoding t𝑖 𝑗 ∈ R4 .
where ℰ is the set of edges, ℰ − is the sampled set of node pairs
without edges, and 𝜒𝑖 𝑗 is the edge probability between node 𝑖, 𝑗 5.2 Feature-based Pretext Tasks
calculated from their embeddings. Based on two primary edge Typically, graphs does not come with any feature information and
attentions, the GAT attention (shortly as GO) [58] and the dot- here the graph-level features refer to the graph embeddings ob-
product attention (shortly as DP) [38], two advanced attention tained after applying a pooling layer on all node embeddings from
mechansims, SuperGATSD (Scaled Dot-product, shortly as SD) and GNNs.
SuperGATMX (Mixed GO and DP, shortly as MX) are proposed: GraphCL [69] designs the pretext task to first augment graphs
√ by four different augmentations including node dropping, edge
𝑒𝑖 𝑗,SD = 𝑒𝑖 𝑗,DP / 𝐹, 𝜒𝑖 𝑗,SD = 𝜎 (𝑒𝑖 𝑗,SD ), (38)
perturbation, attribute masking and subgraph extraction and then
𝑒𝑖 𝑗,MX = 𝑒𝑖 𝑗,GO · 𝜎 (𝑒𝑖 𝑗,DP ), 𝜒𝑖 𝑗,MX = 𝜎 (𝑒𝑖 𝑗,DP ), (39) maximize the mutual information of the graph embeddings be-
tween different augmented views generated from the same original
where 𝜎 denotes the sigmoid function taking the edge weight 𝑒𝑖 𝑗 and
graph while also minimizing the mutual information of the graph
calculating the edge probability 𝜒𝑖 𝑗 . SuperGATSD divides the dot-
embeddings between different augmented views generated from
product of edge 𝑒𝑖 𝑗,DP by a square root of dimension as Transformer
different graphs. The graph embeddings Zssl are obtained through
[56] to prevent some large values from dominating the entire atten-
any permutational-invariant READOUT function on node embed-
tion after softmax. SuperGATMX multiplies GO and DP attention
dings followed by applying an adaptation layer. Then the mutual
with sigmoid, which is motivated by the gating mechanism of Gated
information is maximized by optimizing the following NT-Xent
Recurrent Units (GRUs) [7]. Since DP attention with the sigmoid
contrastive loss:
denotes the edge probability, multiplying 𝜎 (𝑒𝑖 𝑗,DP ) in calculating
1 Õ
𝑒𝑖 𝑗,MX can softly drop neighbors that are not likely linked while im- ℒssl = + ℓNT-Xent (Z1ssl, Z2ssl, 𝒫 − ), (44)
plicitly assigning importance to the remaining nodes. 𝑒𝑖 𝑗,DP, 𝑒𝑖 𝑗,GO |𝒫 | +
( 𝒢𝑖 ,𝒢 𝑗 ) ∈𝒫
are the weight of edge (𝑖, 𝑗) used to calculate the GO and DP at-
tention. Results disclose several insightful discovers including the where Z1ssl, Z2ssl represent graph embeddings under two different
GO attention learns label-agreement better than DP, whereas DP views. The view could be the original view without any augmen-
predicts edge presence better than GO, and the performance of the tation or the one generated from applying four different augmen-
attention mechanism is not fixed but depends on homophily and tations. 𝒫 + contains positive pairs of graphs (𝒢𝑖 , 𝒢 𝑗 ) augmented
from the same original graph while 𝒫 − = ( 𝒢𝑖 ,𝒢 𝑗 ) ∈𝒫 + 𝒫𝒢− repre-
Ð
average degree of the specific graph. 𝑖
The topological information can also be generated manually sents all sets of negative samples. Specifically 𝒫𝒢− contains graphs
𝑖
for designing pretext tasks. Gao et al. [15] proposes to encode the augmented from the graph different from 𝒢𝑖 . Numerical results
10
Graph Neural Networks: Self-supervised Learning

as:
|Γ𝒱𝑙 (𝑣𝑖 , 𝑐)| + |Γ𝒱𝑢 (𝑣𝑖 , 𝑐)|
y𝑖𝑐 = , 𝑐 = 1, ..., 𝑙, (47)
|Γ𝒱𝑙 (𝑣𝑖 )| + |Γ𝒱𝑢 (𝑣𝑖 )|
where 𝒱𝑢 and 𝒱𝑙 denote the unlabeled and labeled node set, Γ𝒱𝑢 (𝑣𝑖 )
denotes the unlabeled nodes that are adjacency to node 𝑣𝑖 , Γ𝒱𝑢 (𝑣𝑖 , 𝑐)
denotes the unlabeled nodes that have been assigned class 𝑐 and
are adjacency to node 𝑣𝑖 , 𝒩𝒱𝑙 (𝑣𝑖 ) denotes the labeled nodes that
are adjacency to node 𝑣𝑖 , Γ𝒱𝑙 (𝑣𝑖 , 𝑐) denotes the labeled nodes that
are adjacency to node 𝑣𝑖 and of class 𝑐. To generate labels for the
unlabeled nodes so as to calculate the context vector y𝑖 for each
node 𝑣𝑖 , label propagation (LP) [77] or the iterative classification
algorithm (ICA) [41] is used to construct pseudo labels for unlabeled
Figure 6: An example of a context and r-neighborhood
nodes in 𝒱𝑢 . Then the pretext task is approached by optimizing the
graph.
following loss function:
1 Õ
ℒssl = ℓCE (zssl,𝑖 , y𝑖 ), (48)
demonstrate that the augmentation of edge perturbations bene- |𝒱 |
𝑣𝑖 ∈𝒱
fits social networks but hurts biochemical molecules. Applying
attribute masking achieves better performance in denser graphs. The main issue of the above pretext task is the error caused by
Node dropping and subgraph extraction are generally beneficial generating labels from LP or ICA. The paper [28] further proposed
across all datasets. two methods to improve the above pretext task. The first method
is to replace the procedure of assigning labels of unlabeled nodes
5.3 Hybrid Pretext Tasks based on only one method such as LP or ICA with assigning la-
bels by ensembling results from multiple different methods. Their
One way to use the information of the training nodes in designing
second method treats the initial labeling from LP or ICA as noisy
pretext tasks is developed in [22] where the context concept is raised.
labels, and then leverages an iterative approach [20] to improve the
The goal of this work is to pre-train a GNN so that it maps nodes
context vectors, which leads to significant improvements based on
appearing in similar graph structure contexts to nearby embeddings.
this correction phase.
For every node 𝑣𝑖 , the 𝑟 -hop neighborhood of 𝑣𝑖 contains all nodes
One previous pretext task is to recover the topological distance
and edges that are at most 𝑟 -hops away from 𝑣𝑖 in the graph. The
between nodes. However, calculating the distance of the shortest
context graph of 𝑣𝑖 is a subgraph between 𝑟 1 -hops and 𝑟 2 -hops
path for all pairs of nodes even after the sampling is time-consuming.
away from node 𝑣𝑖 . It is required that 𝑟 1 < 𝑟 so that some nodes are
Therefore, Jin et al. [28] replaces the pairwise distance between
shared between the neighborhood and the context graph, which is
nodes with the distance between nodes and their corresponding
referred to as context anchor nodes. Examples of neighborhood and
clusters. For each cluster, a fixed set of anchor/center nodes is
context graphs are shown in Fig. 6. Two GNN encoders are set up:
established. For each node, its distance to this set of anchor nodes is
the main GNN encoder is to get the node embedding z𝑟GNN,𝑖 based
calculated. The pretext task is to extract node features that encode
on their 𝑟 -hop neighborhood node features and the context GNN
the information of this node2cluster distance. Suppose 𝑘 clusters are
is to get the node embeddings of every other node in the context
obtained by applying the METIS graph partitioning algorithm [31]
anchor node set, which are then averaged to get the node context
and the node with the highest degree is assumed to be the center
embedding c𝑖 . Then [22] used negative sampling to jointly learn
of the corresponding cluster, then each node 𝑣𝑖 will have a cluster
the main GNN and the context GNN. In the optimization process,
distance vector d𝑖 ∈ R𝑘 and the distance-to-cluster pretext task is
positive samples refer to the situation when the center node of the
completed by optimizing:
context and the neighborhood graphs is the same while the negative
samples refer to the situation when the center nodes of the context 1 Õ
ℒssl = ℓMSE (zssl,𝑖 , d𝑖 ), (49)
and the neighborhood graphs are different. The learning objective |𝒱 |
𝑣𝑖 ∈𝒱
is a binary classification of whether a particular neighborhood
Aside from the graph topology and the node features, the distri-
and a particular context graph have the same center node and the
bution of the training nodes and their training labels are another
negative likelihood loss is used as follows:
valuable source of information for designing pretext tasks. One of
1 Õ
the pretext tasks in [28] is to require the node embeddings output
ℒssl = −( (y𝑖 log(𝜎 ((z𝑟GNN,𝑖 ) ⊤ c 𝑗 )) (45)
|𝒦| by GNNs to encode the information of the topological distance from
(𝑣𝑖 ,𝑣 𝑗 ) ∈𝒦
any node to the training nodes. Assuming that the total number
+ (1 − y𝑖 ) log(1 − 𝜎 ((z𝑟GNN,𝑖 ) ⊤ c 𝑗 )))) (46)
of classes is 𝑝 and for class 𝑐 ∈ {1, ..., 𝑝} and the node 𝑣𝑖 ∈ 𝒱, the
where 𝑦𝑖 = 1 for the positive sample where 𝑖 = 𝑗 while 𝑦𝑖 = 0 for average, minimum and maximum shortest path length from 𝑣𝑖 to
the negative sample where 𝑖 ≠ 𝑗, with 𝒦 denoting the set of positive all labeled nodes in class 𝑐 is calculated and denoted as d𝑖 ∈ R3𝑝 ,
and negative pairs, and 𝜎 is the sigmoid function computing the then the objective is to optimize the same regression loss as defined
probability. in Eq. (49)
Similar idea to employ the context concept in completing pretext The generating process of networks encodes abundant infor-
tasks is also proposed in [28]. Specifically, the context here is defined mation for designing pretext tasks. Hu et al. [23] proposes the
11
Yu Wang, Wei Jin, and Tyler Derr

GPT-GNN framework for generative pre-training of GNNs. This representations. The local patch representations are further fed into
framework performs attribute and edge generation to enable the an injective readout function to get the global graph representa-
pre-trained model to capture the inherent dependency between tions zGNN,𝒢 = READOUT(ZGNN ). Then the mutual information
node attributes and graph structure. Assuming that the likelihood between ZGNN and zGNN,𝒢 is maximized by minimizing the follow-
over this graph by this GNN model is 𝑝 (𝒢; 𝜃 ) which represents how ing loss function:
the nodes in 𝒢 are attributed and connected, GPT-GNN aims to
𝒫 | +
pre-train the GNN model by maximizing the graph likelihood, i.e., 1 |Õ
𝜃 ∗ = max𝜃 𝑝 (𝒢; 𝜃 ). Given a permutated order, the log likelihood is ℒssl = + E [log 𝜎 (z⊤
GNN,𝑖 WzGNN,𝒢 )] (53)
|𝒫 | + |𝒫 | 𝑖=1 (X,A)
−
factorized autoregressively - generating one node per iteration as:
𝒫− |
|Õ
|𝒱 |

Õ + E ( X̂,Â) [log(1 − 𝜎 ( z̃⊤
GNN,𝑖 WzGNN, 𝒢 ))] ,
log 𝑝𝜃 (X, ℰ) = log 𝑝𝜃 (x𝑖 , ℰ𝑖 |X<𝑖 , ℰ<𝑖 ) (50)
𝑗=1
𝑖=1
For all nodes that are generated before the node 𝑖, their attributes where |𝒫 + | and |𝒫 − | are the number of the positive and negative
X<𝑖 , and the edges between these nodes ℰ<𝑖 are used to generate a pairs, 𝜎 stands for any nonlinear activation function and PReLU
new node 𝑣𝑖 , including both its attribute x𝑖 and its connections is used in [57], z⊤
GNN,𝑖 WzGNN,𝒢 calculates the weighted similarity
with existing nodes ℰ𝑖 . Instead of directly assuming that x𝑖 , ℰ𝑖 between the patch representation centered at node 𝑣𝑖 and the graph
are independent, they devise a dependency-aware factorization representation. A linear classifier is followed up to classify nodes
mechanism to maintain the dependency between node attributes after the above contrastive pretext task.
and edge existence. The generation process can be decomposed Similar to Velickovic et al. [57] where the mutual information
into two coupled parts: (1) generating node attributes given the between the patch representations and the graph representations
observed edges, and (2) generating the remaining edges given the is maximized, Hassani et al. [21] proposed another framework of
observed edges and the generated node attributes. For computing contrasting the node representations of one view and the graph rep-
the loss of attribute generation, the generated node feature matrix resentations of another view. The first view is the original graph and
X is corrupted by masking some dimensions to obtain the corrupted the second view is generated by a graph diffusion matrix. The heat
version X̂Attr and further fed together with the generated edges into and personalized PageRank (PPR) diffusion matrix are considered,
GNNs to get the embeddings ẐAttr GNN . Then, the decoder Dec
Attr (·)
which are:
Attr
is specified, which takes ẐGNN as input and outputs the predicted
Sheat = exp(𝑡AD−1 − 𝑡), (54)
attributes DecAttr ( ẐAttr
GNN ). The attribute generation loss is:
1 Õ
ℒAttr
ssl = |𝒱 | ℓMSE (DecAttr (ẑAttr
GNN,𝑖 ), x𝑖 ), (51) SPPR = 𝛼 (I𝑛 − (1 − 𝛽)D−1/2 AD−1/2 ) −1, (55)
𝑣𝑖 ∈𝒱

where ẑAttr Attr ⊤ where 𝛽 denotes teleport probability, 𝑡 is the diffusion time, and
GNN,𝑖 = ẐGNN [𝑖, :] denotes the decoded embedding of D is the diagonal degree matrix. After D is obtained, two different
node 𝑣𝑖 . For computing the loss of edge reconstruction, the original
GNN encoders followed by a shared projection head are applied
generated node feature matrix X is directly fed together with the
Edge on nodes in the original graph adjacency matrix and the gener-
generated edges into GNNs to get the embeddings ZGNN . Then the ated diffusion matrix to get two different node embeddings Z1GNN
contrastive NT-Xent loss is calculated:
and Z2GNN . Two different graph embeddings z1GNN,𝒢 and z2GNN,𝒢
Edge 1 Õ Edge Edge
ℒssl = + ℓNT-Xent (ZGNN, ZGNN, 𝒫 − ), (52) are further obtained by applying a graph pooling function to the
|𝒫 | +
(𝑣𝑖 ,𝑣 𝑗 ) ∈𝒫 node representations (before the projection head) and followed by
another shared projection head. The mutual information between
where 𝒫+ contains positive pairs of connected nodes (𝑣𝑖 , 𝑣 𝑗 ) while
nodes and graphs in different views is maximized through:
𝒫 − = (𝑣𝑖 ,𝑣 𝑗 ) ∈𝒫 + 𝒫𝑣−𝑖 represents all sets of negative samples and
Ð
𝑃 𝑣−𝑖 contains all nodes that are not directly linked with node 𝑣𝑖 . Note 1 Õ
Edge ℒssl = − (MI(z1GNN,𝑖 , z2GNN,𝒢 ) + MI(z2GNN,𝑖 , z1GNN,𝒢 )),
here two views are set equal, i.e., Z1 = Z2 = ZGNN . |𝒱 |
𝑣𝑖 ∈𝑉
(56)
6 NODE-GRAPH-LEVEL SSL PRETEXT TASKS where the MI represents the mutual information estimator and
All the above pretext tasks are designed based on either the node four estimators are explored, which are noise-contrastive estimator,
or the graph level supervision. However, there is another final line Jensen-Shannon estimator, normalized temperature-scaled cross-
of research combining these two sources of supervision to design entropy, and Donsker-Varadhan representation of the KL-divergence.
pretext tasks, which we summarize in this section. Note that the mutual information in Eq. (56) is averaged over all
Velickovic et al. [57] proposed to maximize the mutual informa- graphs in the original work [21]. Additionally, their results demon-
tion between representations of high-level graphs and low-level strate that Jensen-Shannon estimator achieves better results across
patches. In each iteration, a negative sample X̂, Â is generated by all graph classification tasks, whereas in the node classification
corrupting the graph through shuffling node features and removing task, noise contrastive estimation achieves better results. They also
edges. Then a GNN-based encoder is applied to extract node repre- discover that increasing the number of views does not increase the
sentations ZGNN and ẐGNN , which are also named as the local patch performance on downstream tasks.
12
Graph Neural Networks: Self-supervised Learning

7 DISCUSSION different downstream tasks, GNN architectures and datasets. We

Existing methods employing self-supervision to graph neural net- hope that this chapter can shed some light on the main ideas of
works achieve performance improvements and numerous insightful applying self-supervised learning to graph neural networks and
results are also discovered in the meantime. While most of the self- related applications in order to encourage progress in the field.
supervised pretext tasks are helpful for the downstream tasks, there
are still a fair proportion of pretext tasks that bring weak or even REFERENCES
fail to boost the performance [15, 28, 39, 70]. This is either because [1] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua
these pretext tasks are highly unrelated to the primary task, i.e., the Bengio, Aaron Courville, and Devon Hjelm. 2018. Mutual information neural
encoded features useful for pretext tasks are useless or even harmful estimation. In International Conference on Machine Learning. PMLR, 531–540.
[2] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018.
[39] for downstream tasks or because the information learned from Deep clustering for unsupervised learning of visual features. In Proceedings of
completing pretext tasks can already be learned from completing the European Conference on Computer Vision (ECCV). 132–149.
[3] Bo Chen, Jing Zhang, Xiaokang Zhang, Xiaobin Tang, Lingfan Cai, Hong Chen,
downstream tasks by GNNs [28]. Besides, the strength of the perfor- Cuiping Li, Peng Zhang, and Jie Tang. 2020. COAD: Contrastive Pre-training
mance improvement depends on the specific GNN architecture used with Adversarial Fine-tuning for Zero-shot Expert Linking. arXiv preprint
for completing pretext and downstream tasks. The improvements arXiv:2012.11336 (2020).
[4] Deli Chen, Yanyai Lin, Lei Li, Xuancheng Ren Li, Jie Zhou, Xu Sun, et al. 2020.
are more significant for basic GNNs such as GCN, GAT, and GIN Distance-wise Graph Contrastive Learning. arXiv preprint arXiv:2012.07437
while less for more advanced GNNs such as GMNN [70]. Further- (2020).
more, one pretext task is not universally the best across multiple [5] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020. Measuring
and relieving the over-smoothing problem for graph neural networks from the
datasets [15, 39]. Therefore, whether a self-supervised pretext task topological view. In Proceedings of the AAAI Conference on Artificial Intelligence,
helps GNNs in the standard target performance is determined by Vol. 34. 3438–3445.
[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A
first whether the dataset allows the GNNs to extract extra feature simple framework for contrastive learning of visual representations. In Interna-
information through completing pretext tasks, and second whether tional conference on machine learning. PMLR, 1597–1607.
the extra self-supervised information complement, contradict to or [7] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase
has already been covered by information extracted from existing Representations using RNN Encoder–Decoder for Statistical Machine Translation.
architecture [70]. Numerous works focus on applying contrastive In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
learning as a form of self-supervised learning [4, 21, 57, 69, 76]. Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar,
1724–1734.
Generally they find that while composing different augmentations [8] Luca Cosmo, Anees Kazi, Seyed-Ahmad Ahmadi, Nassir Navab, and Michael
benefits the performance [69], increasing the number of views gen- Bronstein. 2020. Latent Patient Network Learning for Automatic Diagnosis.
arXiv preprint arXiv:2003.13620 (2020).
erated from the same graph augmentation technique to more than [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
two cause no further improvement [21], which is different from vi- Pre-training of Deep Bidirectional Transformers for Language Understanding. In
sual representation learning. Moreover, the beneficial combinations Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, NAACL-HLT. ACL,
of augmentations are data-specific because of the highly heteroge- 4171–4186.
neous nature of the graph-structured data and harder contrastive [10] Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual rep-
tasks are more helpful than overly simple ones [69]. Therefore, resentation learning by context prediction. In Proceedings of the IEEE international
conference on computer vision. 1422–1430.
designing viable pretext tasks requires domain specific knowledge [11] M. D. Donsker and S. R.S. Varadhan. 1976. Asymptotic evaluation of certain
and should be targeted towards specific types of networks, GNN Markov process expectations for large time—III. Communications on Pure and
Applied Mathematics 29, 4 (July 1976), 389–461.
architectures and downstream tasks. [12] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, and Thomas
Brox. 2014. Discriminative Unsupervised Feature Learning with Convolutional
Neural Networks. In Advances in Neural Information Processing Systems. 766–774.
8 CONCLUSION [13] Bahare Fatemi, Layla El Asri, and Seyed Mehran Kazemi. 2021. SLAPS: Self-
Supervision Improves Structure Learning for Graph Neural Networks. arXiv
In this chapter, we provided a systemic, categorical and comprehen- preprint arXiv:2102.05034 (2021).
sive overview on the recent works leveraging self-supervised learn- [14] Marylou Gabrié, Andre Manoel, Clément Luneau, Jean Barbier, Nicolas Macris,
ing in graph neural networks. Despite recent successes achieved by Florent Krzakala, and Lenka Zdeborová. 2019. Entropy and mutual information
in models of deep neural networks. Journal of Statistical Mechanics: Theory and
applying self-supervised learning in the text and image domains, Experiment 2019, 12 (2019), 124014.
self-supervised learning applied to the graph domain, especially for [15] Xiang Gao, Wei Hu, and Guo-Jun Qi. 2021. Unsupervised Learning of Topology
Transformation Equivariant Representations.
graph neural networks, is still in its emerging stage. Several promis- [16] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised Repre-
ing directions could be pursued to further advance this field. First, sentation Learning by Predicting Image Rotations. In International Conference on
although a large surge of research focuses on designing effective Learning Representations.
[17] Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation:
pretext tasks boosting the performance of graph neural networks, A new estimation principle for unnormalized statistical models. In Proceedings
few works focus on visualizing, interpreting and explaining the of the Thirteenth International Conference on Artificial Intelligence and Statistics.
underlying reason causing such beneficial performance improve- JMLR Workshop and Conference Proceedings, 297–304.
[18] William L Hamilton. 2020. Graph representation learning. Synthesis Lectures on
ments. Deeply understanding the intrinsic mechanism as to why Artifical Intelligence and Machine Learning 14, 3 (2020), 1–159.
and how SSL helps GNNs could help us design more powerful pre- [19] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation
Learning on Large Graphs. In Proceedings of the 31st International Conference on
text tasks. Second, similar to the work defining the architectural Neural Information Processing Systems (NIPS’17). 1025–1035.
design space for GNNs to quickly query the best GNN design for a [20] Jiangfan Han, Ping Luo, and Xiaogang Wang. 2019. Deep self-learning from
novel task on a novel dataset [68], we should collect and classify noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer
Vision. 5138–5147.
various pretext tasks and create a design space for SSL in GNNs. [21] Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view rep-
This allows for transferring the best designs of pretext tasks across resentation learning on graphs. In International Conference on Machine Learning.
13
Yu Wang, Wei Jin, and Tyler Derr

PMLR, 4116–4126. [46] Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang.
[22] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, 2018. Adversarially Regularized Graph Autoencoder for Graph Embedding. In
and Jure Leskovec. 2020. Strategies for Pre-training Graph Neural Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial
International Conference on Learning Representations. Intelligence, IJCAI. 2609–2615.
[23] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020. [47] Liam Paninski. 2003. Estimation of entropy and mutual information. Neural
GPT-GNN: Generative Pre-Training of Graph Neural Networks. In KDD ’20: The computation 15, 6 (2003), 1191–1253.
26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, [48] Jiwoong Park, Minsik Lee, Hyung Jin Chang, Kyuewang Lee, and Jin Young
1857–1867. Choi. 2019. Symmetric graph convolutional autoencoder for unsupervised graph
[24] Ziniu Hu, Changjun Fan, Ting Chen, Kai-Wei Chang, and Yizhou Sun. 2019. Pre- representation learning. In Proceedings of the IEEE/CVF International Conference
training graph neural networks for generic structural feature extraction. arXiv on Computer Vision. 6519–6528.
preprint arXiv:1905.13728 (2019). [49] Zhen Peng, Yixiang Dong, Minnan Luo, Xiao-Ming Wu, and Qinghua Zheng.
[25] Dasol Hwang, Jinyoung Park, Sunyoung Kwon, KyungMin Kim, Jung-Woo Ha, 2020. Self-supervised graph representation learning via global context prediction.
and Hyunwoo J Kim. 2020. Self-supervised Auxiliary Learning with Meta-paths arXiv preprint arXiv:2003.01604 (2020).
for Heterogeneous Graphs. In Advances in Neural Information Processing Systems, [50] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding,
Vol. 33. 10294–10305. Kuansan Wang, and Jie Tang. 2020. Gcc: Graph contrastive coding for graph
[26] Soobeom Jang, Seong-Eun Moon, and Jong-Seok Lee. 2019. Brain signal clas- neural network pre-training. In Proceedings of the 26th ACM SIGKDD International
sification via learning connectivity structure. arXiv preprint arXiv:1905.11678 Conference on Knowledge Discovery & Data Mining. 1150–1160.
(2019). [51] Ellen Riloff. 1996. Automatically generating extraction patterns from untagged
[27] Yizhu Jiao, Yun Xiong, Jiawei Zhang, Yao Zhang, Tianqi Zhang, and Yangyong text. In Proceedings of the National Conference on Artificial Intelligence. 1044–1049.
Zhu. 2020. Sub-graph Contrast for Scalable Self-Supervised Graph Represen- [52] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang,
tation Learning. In 20th IEEE International Conference on Data Mining, ICDM and Junzhou Huang. 2020. Self-Supervised Graph Transformer on Large-Scale
2020, Sorrento, Italy, November 17-20, 2020, Claudia Plant, Haixun Wang, Alfredo Molecular Data. Advances in Neural Information Processing Systems 33 (2020).
Cuzzocrea, Carlo Zaniolo, and Xindong Wu (Eds.). IEEE, 222–231. [53] Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data
[28] Wei Jin, Tyler Derr, Haochen Liu, Yiqi Wang, Suhang Wang, Zitao Liu, and Jiliang augmentation for deep learning. Journal of Big Data 6, 1 (2019), 1–48.
Tang. 2020. Self-supervised learning on graphs: Deep insights and new direction. [54] Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. 2020. InfoGraph:
arXiv preprint arXiv:2006.10141 (2020). Unsupervised and Semi-supervised Graph-Level Representation Learning via
[29] Wei Jin, Tyler Derr, Yiqi Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2021. Mutual Information Maximization. In 8th International Conference on Learning
Node Similarity Preserving Graph Convolutional Networks (WSDM ’21). ACM, Representations, ICLR.
148–156. [55] Ke Sun, Zhouchen Lin, and Zhanxing Zhu. 2020. Multi-stage self-supervised
[30] George Karypis and Vipin Kumar. 1995. Multilevel graph partitioning schemes. learning for graph convolutional networks on graphs with few labeled nodes. In
In ICPP (3). 113–122. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5892–5899.
[31] George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme [56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
(1998), 359–392. you Need. In Advances in Neural Information Processing Systems 30: Annual Con-
[32] Dongkwan Kim and Alice Oh. 2021. How to Find Your Friendly Neighborhood: ference on Neural Information Processing Systems. 5998–6008.
Graph Attention Design with Self-Supervision. In International Conference on [57] Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio,
Learning Representations. and R Devon Hjelm. 2019. Deep Graph Infomax.. In International Conference on
[33] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Learning Representations (Poster).
Graph Convolutional Networks. In 5th International Conference on Learning [58] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Representations, ICLR. Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Con-
[34] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush ference on Learning Representations.
Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised [59] Nicolas Vercheval, Hendrik De Bie, and Aleksandra Pizurica. 2020. Variational
Learning of Language Representations. In International Conference on Learning Auto-Encoders Without Graph Coarsening For Fine Mesh Learning. In IEEE
Representations. International Conference on Image Processing, ICIP. IEEE, 2681–2685.
[35] Phuc H Le-Khac, Graham Healy, and Alan F Smeaton. 2020. Contrastive repre- [60] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
sentation learning: A framework and review. IEEE Access (2020). 2008. Extracting and composing robust features with denoising autoencoders. In
[36] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Proceedings of the 25th international conference on Machine learning. 1096–1103.
Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: [61] Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and
Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, computing 17, 4 (2007), 395–416.
Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the [62] Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. 2017.
Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Mgae: Marginalized graph autoencoder for graph clustering. In Proceedings of the
Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association 2017 ACM on Conference on Information and Knowledge Management. 889–898.
for Computational Linguistics, 7871–7880. [63] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and
[37] Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE
convolutional networks for semi-supervised learning. In Proceedings of the AAAI transactions on neural networks and learning systems (2020).
Conference on Artificial Intelligence, Vol. 32. [64] Yaochen Xie, Zhao Xu, Zhengyang Wang, and Shuiwang Ji. 2021. Self-
[38] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Ap- Supervised Learning of Graph Neural Networks: A Unified Review. arXiv preprint
proaches to Attention-based Neural Machine Translation. In Proceedings of the arXiv:2102.10757 (2021).
2015 Conference on Empirical Methods in Natural Language Processing. ACL, 1412– [65] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful
1421. are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
[39] Franco Manessi and Alessandro Rozza. 2020. Graph-Based Neural Network [66] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi
Models with Multiple Self-Supervised Auxiliary Tasks. arXiv preprint arXiv: Kawarabayashi, and Stefanie Jegelka. 2018. Representation learning on graphs
2011.07267 (2020). with jumping knowledge networks. In International Conference on Machine Learn-
[40] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Esti- ing. PMLR, 5453–5462.
mation of Word Representations in Vector Space. In 1st International Conference [67] Jiaxuan You, Jonathan Gomes-Selman, Rex Ying, and Jure Leskovec. 2021. Identity-
on Learning Representations, ICLR 2013, Workshop Track Proceedings. aware Graph Neural Networks. arXiv preprint arXiv:2101.10320 (2021).
[41] Jennifer Neville and David Jensen. 2000. Iterative classification in relational data. [68] Jiaxuan You, Zhitao Ying, and Jure Leskovec. 2020. Design Space for Graph
In Proc. AAAI-2000 workshop on learning statistical models from relational data. Neural Networks. In Advances in Neural Information Processing Systems, Vol. 33.
13–20. 17009–17021.
[42] Mark Newman. 2018. Networks. Oxford university press. [69] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and
[43] Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual repre- Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In Advances
sentations by solving jigsaw puzzles. In European conference on computer vision. in Neural Information Processing Systems, Vol. 33. 5812–5823.
Springer, 69–84. [70] Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen. 2020. When does
[44] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-GAN: Train- self-supervision help graph convolutional networks?. In International Conference
ing Generative Neural Samplers using Variational Divergence Minimization. In on Machine Learning. PMLR, 10871–10880.
Advances in Neural Information Processing Systems, Vol. 29. [71] Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization.
[45] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning In European conference on computer vision. Springer, 649–666.
with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
14
Graph Neural Networks: Self-supervised Learning

[72] Shichang Zhang, Ziniu Hu, Arjun Subramonian, and Yizhou Sun. 2020. Motif-
Driven Contrastive Learning of Graph Representations. arXiv preprint
arXiv:2012.12533 (2020).
[73] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2020. Deep learning on graphs: A
survey. IEEE Transactions on Knowledge and Data Engineering (2020).
[74] Qikui Zhu, Bo Du, and Pingkun Yan. 2020. Self-supervised training of graph
convolutional networks. arXiv preprint arXiv:2006.02380 (2020).
[75] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020.
Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131
(2020).
[76] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021.
Graph Contrastive Learning with Adaptive Augmentation. In Proceedings of The
Web Conference 2021 (WWW ’21). ACM, 12 pages.
[77] Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from labeled and unlabeled
data with label propagation. (2002).
[78] Marinka Zitnik and Jure Leskovec. 2017. Predicting multicellular function through
multi-layer tissue networks. Bioinformatics 33, 14 (2017), i190–i198.
[79] Marinka Zitnik, Jure Leskovec, et al. 2018. Prioritizing network communities.
Nature communications 9, 1 (2018), 1–9.

A Practical Tutorial On Graph Neural Net
No ratings yet
A Practical Tutorial On Graph Neural Net
38 pages
On The Bottleneck of Graph Neural Networks
No ratings yet
On The Bottleneck of Graph Neural Networks
16 pages
Graph Neural Networks For Natural Language Processing: A Survey
No ratings yet
Graph Neural Networks For Natural Language Processing: A Survey
127 pages
The Koch Snowflake
75% (8)
The Koch Snowflake
16 pages
A Cookbook of Self-Supervised Learning PDF
No ratings yet
A Cookbook of Self-Supervised Learning PDF
70 pages
A Survey of Self Superviaed Learning
No ratings yet
A Survey of Self Superviaed Learning
26 pages
57 Exploring The Potential of Lar
No ratings yet
57 Exploring The Potential of Lar
31 pages
A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends
No ratings yet
A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends
20 pages
Boosting Graph Contrastive Learning Via Graph Contrastive Saliency
No ratings yet
Boosting Graph Contrastive Learning Via Graph Contrastive Saliency
17 pages
A Dual-Channel Semi-Supervised Learning Framework On Graphs Via Knowledge Transfer and Meta-Learning
No ratings yet
A Dual-Channel Semi-Supervised Learning Framework On Graphs Via Knowledge Transfer and Meta-Learning
26 pages
2106 - GraphiT Encoding Graph Structure in Transformers
No ratings yet
2106 - GraphiT Encoding Graph Structure in Transformers
20 pages
Self-Supervised Representation Learning - Introduction, Advances and Challenges
No ratings yet
Self-Supervised Representation Learning - Introduction, Advances and Challenges
19 pages
AlonAndYahav 2021 On The Bottleneck of Graph Neu
No ratings yet
AlonAndYahav 2021 On The Bottleneck of Graph Neu
16 pages
Graph-Based Semi-Supervised Learning - A Comprehensive Review
No ratings yet
Graph-Based Semi-Supervised Learning - A Comprehensive Review
24 pages
LLM As GNN Graph Vocabul
No ratings yet
LLM As GNN Graph Vocabul
25 pages
GNN Edge Maskingpdf
No ratings yet
GNN Edge Maskingpdf
21 pages
【TPAMI 综述】时间序列分析的自我监督学习
No ratings yet
【TPAMI 综述】时间序列分析的自我监督学习
20 pages
Self-Supervised Learning For Time Series Analysis Taxonomy Progress and Prospects
No ratings yet
Self-Supervised Learning For Time Series Analysis Taxonomy Progress and Prospects
20 pages
Towards Democratizing Joint-Embedding Self-Supervised Learning
No ratings yet
Towards Democratizing Joint-Embedding Self-Supervised Learning
11 pages
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
No ratings yet
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
12 pages
A Cookbook of Self-Supervised Learning
No ratings yet
A Cookbook of Self-Supervised Learning
71 pages
Graph Contrastive Learning With Augmentations
No ratings yet
Graph Contrastive Learning With Augmentations
12 pages
Automated Unsupervised Graph Representation Learning
No ratings yet
Automated Unsupervised Graph Representation Learning
14 pages
Semi-Supervised Learning With Self-Supervised Networks
No ratings yet
Semi-Supervised Learning With Self-Supervised Networks
10 pages
Scheibenreif Self-Supervised Vision Transformers For Land-Cover Segmentation and Classification CVPRW 2022 Paper
No ratings yet
Scheibenreif Self-Supervised Vision Transformers For Land-Cover Segmentation and Classification CVPRW 2022 Paper
10 pages
23 - AAAI - Substructure Aware Graph Neural Networks
No ratings yet
23 - AAAI - Substructure Aware Graph Neural Networks
9 pages
A Combination of Hidden Markov Model and Fuzzy Model For Stock Market Forecasting PDF
No ratings yet
A Combination of Hidden Markov Model and Fuzzy Model For Stock Market Forecasting PDF
8 pages
ArXiv-2024-MingZhang-0-Towards Graph Contrastive Learning A Survey and Beyond
No ratings yet
ArXiv-2024-MingZhang-0-Towards Graph Contrastive Learning A Survey and Beyond
35 pages
SR GNN
No ratings yet
SR GNN
16 pages
Paper1 With LLM
No ratings yet
Paper1 With LLM
6 pages
Recurrent Neural Network For Text Classification With Multi-Task Learning
No ratings yet
Recurrent Neural Network For Text Classification With Multi-Task Learning
7 pages
Self Supervised Learning: A Succinct Review: Veenu Rani Syed Tufael Nabi Munish Kumar Ajay Mittal Krishan Kumar
No ratings yet
Self Supervised Learning: A Succinct Review: Veenu Rani Syed Tufael Nabi Munish Kumar Ajay Mittal Krishan Kumar
15 pages
1 s2.0 S095070512400741X Main
No ratings yet
1 s2.0 S095070512400741X Main
11 pages
2020 - Supervised Community Detection With Line Graph Neural Networks - Chen Et Al
No ratings yet
2020 - Supervised Community Detection With Line Graph Neural Networks - Chen Et Al
24 pages
Entropy 24 00551 v2
No ratings yet
Entropy 24 00551 v2
22 pages
NeurIPS 2020 Graph Random Neural Networks For Semi Supervised Learning On Graphs Paper
No ratings yet
NeurIPS 2020 Graph Random Neural Networks For Semi Supervised Learning On Graphs Paper
12 pages
Identity-Aware Graph Neural Networks: Jiaxuan You, Jonathan Gomes-Selman, Rex Ying, Jure Leskovec
No ratings yet
Identity-Aware Graph Neural Networks: Jiaxuan You, Jonathan Gomes-Selman, Rex Ying, Jure Leskovec
10 pages
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
No ratings yet
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
107 pages
GNNs
No ratings yet
GNNs
28 pages
Neural Networks For Text Classification
No ratings yet
Neural Networks For Text Classification
12 pages
Graph GPT
No ratings yet
Graph GPT
10 pages
GMPT Cikm2021 Final
No ratings yet
GMPT Cikm2021 Final
10 pages
Original GNN
No ratings yet
Original GNN
22 pages
Bertgcn: Transductive Text Classification by Combining GCN and Bert
No ratings yet
Bertgcn: Transductive Text Classification by Combining GCN and Bert
7 pages
A Comparison Between Recursive Neural Networks and Graph Neural Networks
No ratings yet
A Comparison Between Recursive Neural Networks and Graph Neural Networks
8 pages
AI V1 V2 V3 Fall 2020 - 21 Assg 02
No ratings yet
AI V1 V2 V3 Fall 2020 - 21 Assg 02
3 pages
Pgpool II Tutorial
No ratings yet
Pgpool II Tutorial
6 pages
Self-Supervised Learning: Pretext Tasks
No ratings yet
Self-Supervised Learning: Pretext Tasks
3 pages
Prior Analytics: Syllogism
No ratings yet
Prior Analytics: Syllogism
9 pages
1201 Graph Transformer
No ratings yet
1201 Graph Transformer
14 pages
Seminar Presentation
No ratings yet
Seminar Presentation
19 pages
GNN Foundations Frontiers and Applications Chapter3
No ratings yet
GNN Foundations Frontiers and Applications Chapter3
11 pages
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
No ratings yet
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
12 pages
Gnns
No ratings yet
Gnns
75 pages
Exam Version
No ratings yet
Exam Version
413 pages
PMSM
No ratings yet
PMSM
8 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
22 pages
CAAI Trans On Intel Tech - 2024 - Sharma - Image and Video Analysis Using Graph Neural Network For Internet of Medical
No ratings yet
CAAI Trans On Intel Tech - 2024 - Sharma - Image and Video Analysis Using Graph Neural Network For Internet of Medical
15 pages
Graph Neural Networks (GNNS)
No ratings yet
Graph Neural Networks (GNNS)
22 pages
Verilog HDL Basics
No ratings yet
Verilog HDL Basics
73 pages
02 Eisenman Cardboard Architecture
No ratings yet
02 Eisenman Cardboard Architecture
12 pages
The Problem of Punctuation in Modern English
No ratings yet
The Problem of Punctuation in Modern English
18 pages
OM Chapter3 PM - Students
No ratings yet
OM Chapter3 PM - Students
37 pages
Cad Unit-3 PDF
No ratings yet
Cad Unit-3 PDF
18 pages
2024 - Introduction To Graph Neural Networks A Starting
No ratings yet
2024 - Introduction To Graph Neural Networks A Starting
49 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
28 pages
ZXF01U03
No ratings yet
ZXF01U03
4 pages
Mahamaya Technical University,: Noida
No ratings yet
Mahamaya Technical University,: Noida
47 pages
Angle Section: Design Capacities
No ratings yet
Angle Section: Design Capacities
6 pages
Paper A Method For Fuel Efficiency Classification of Agricultural Tractors
No ratings yet
Paper A Method For Fuel Efficiency Classification of Agricultural Tractors
11 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
20 pages
Content Augmented Graph Neural Networks
No ratings yet
Content Augmented Graph Neural Networks
15 pages
Leakage Current Mitigation in Photovoltaic String Inverter Using Predictive Control With Fixed Average Switching Frequency
No ratings yet
Leakage Current Mitigation in Photovoltaic String Inverter Using Predictive Control With Fixed Average Switching Frequency
11 pages
Probability Questions
No ratings yet
Probability Questions
2 pages
Penggunaan Balanced Scorecard Dalam: Strategic Management Jamu Puspo
No ratings yet
Penggunaan Balanced Scorecard Dalam: Strategic Management Jamu Puspo
15 pages
Errata
No ratings yet
Errata
2 pages
BBA Full Syllybus-DBI COLLEGE
No ratings yet
BBA Full Syllybus-DBI COLLEGE
40 pages
Polynomials 03
No ratings yet
Polynomials 03
1 page
A Gentle Introduction To Graph Neural Networks
No ratings yet
A Gentle Introduction To Graph Neural Networks
9 pages
Computer Fundamentals and Programming Using Dev C++
No ratings yet
Computer Fundamentals and Programming Using Dev C++
16 pages
Self-Supervised Learning: Generative or Contrastive
No ratings yet
Self-Supervised Learning: Generative or Contrastive
20 pages
Masked Autoencoders Are Scalable Vision Learners
No ratings yet
Masked Autoencoders Are Scalable Vision Learners
14 pages
Representation Learning of Histopathology Images Using Graph Neural Networks
No ratings yet
Representation Learning of Histopathology Images Using Graph Neural Networks
8 pages
Federated Graph Learning
No ratings yet
Federated Graph Learning
5 pages
Success Mantra - Class - 10 All Subject
No ratings yet
Success Mantra - Class - 10 All Subject
126 pages
PHY 111b
No ratings yet
PHY 111b
8 pages
A Comprehensive Review On Power System Risk-Based Transient Stability
No ratings yet
A Comprehensive Review On Power System Risk-Based Transient Stability
6 pages
Revision Test
No ratings yet
Revision Test
6 pages
Kat PDF
No ratings yet
Kat PDF
2 pages
Reverse Engineering Self-Supervised Learning
No ratings yet
Reverse Engineering Self-Supervised Learning
21 pages
7 Market Segmentation 3 Data Analysis
No ratings yet
7 Market Segmentation 3 Data Analysis
33 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
28 pages
SAT5
No ratings yet
SAT5
17 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet

Graph Neural Networks: Self-Supervised Learning

Uploaded by

Graph Neural Networks: Self-Supervised Learning

Uploaded by

Graph Neural Networks: Self-supervised Learning

Yu Wang Wei Jin Tyler Derr

zsup,𝒢 = ℎ𝜃 sup (READOUT(𝑓𝜃 (X, A))) (2)

negative samples. Especially 𝒫𝑖− contains all negative samples of

3.3 Pretext Tasks

max ℐ (Z1ssl, Z2ssl ), (13) 4.1 Structure-based Pretext Tasks

GNN SSL Pretext Tasks

Node-level Graph-level Node-graph-level

Structure-based Structure-based Patch-graph contrastive

Pairwise similarity recovery Feature-based

Node contrastive learning Hybrid

Prediction recovery [5, 55] Context recovery [22, 28]

Topological distance to clus-

Figure 5: A categorization of SSL pretext tasks used in GNNs.1

4.2 Feature-based Pretext Tasks 4.3 Hybrid Pretext Tasks

𝒯𝑑 = {(𝑣𝑖 , 𝑣 𝑗 )| sim(x𝑖 , x 𝑗 ) in bottom-B of X ′ = [x1 ⊙ m; x2 ⊙ m; · · · ; x | 𝒱 | ⊙ m], (23)

Parametrization (FP) treats every entry in Ã as a parameter and

7 DISCUSSION different downstream tasks, GNN architectures and datasets. We

You might also like