0% found this document useful (0 votes)
80 views20 pages

Self-Supervised Learning Generative or Contrastive

This document provides a survey of self-supervised learning methods. It discusses three main categories of self-supervised learning based on their objectives: generative methods, contrastive methods, and generative-contrastive (adversarial) methods. The document reviews empirical self-supervised learning models in computer vision, natural language processing, and graph learning. It also discusses related theoretical analysis on why self-supervised learning works and outlines open problems and future directions.

Uploaded by

Nick Nikzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views20 pages

Self-Supervised Learning Generative or Contrastive

This document provides a survey of self-supervised learning methods. It discusses three main categories of self-supervised learning based on their objectives: generative methods, contrastive methods, and generative-contrastive (adversarial) methods. The document reviews empirical self-supervised learning models in computer vision, natural language processing, and graph learning. It also discusses related theoretical analysis on why self-supervised learning works and outlines open problems and future directions.

Uploaded by

Nick Nikzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO.

1, JANUARY 2023 857

Self-Supervised Learning:
Generative or Contrastive
Xiao Liu , Fanjin Zhang , Zhenyu Hou, Li Mian, Zhaoyu Wang,
Jing Zhang , and Jie Tang , Fellow, IEEE

Abstract—Deep supervised learning has achieved great success in the last decade. However, its defects of heavy dependence on
manual labels and vulnerability to attacks have driven people to find other paradigms. As an alternative, self-supervised learning (SSL)
attracts many researchers for its soaring performance on representation learning in the last several years. Self-supervised
representation learning leverages input data itself as supervision and benefits almost all types of downstream tasks. In this survey, we
take a look into new self-supervised learning methods for representation in computer vision, natural language processing, and graph
learning. We comprehensively review the existing empirical methods and summarize them into three main categories according to their
objectives: generative, contrastive, and generative-contrastive (adversarial). We further collect related theoretical analysis on self-
supervised learning to provide deeper thoughts on why self-supervised learning works. Finally, we briefly discuss open problems and
future directions for self-supervised learning. An outline slide for the survey is provided1 .

Index Terms—Self-supervised learning, generative model, contrastive learning, deep learning

1 INTRODUCTION with fewer labels, fewer samples, and fewer trials. As a


promising alternative, self-supervised learning has drawn
EEP neural networks [71] have shown outstanding per-
D formance on various machine learning tasks, especially
on supervised learning in computer vision (image classifica-
massive attention for its data efficiency and generalization
ability, and many state-of-the-art models have been follow-
ing this paradigm. This survey will take a comprehensive
tion [26], [48], [54], semantic segmentation [39], [76]), natu-
look at the recent developing self-supervised learning
ral language processing (pre-trained language models [27],
models and discuss their theoretical soundness, including
[68], [75], [138], sentiment analysis [74], question answer-
frameworks such as Pre-trained Language Models (PTM),
ing [5], [29], [100], [139] etc.) and graph learning (node clas-
Generative Adversarial Networks (GAN), autoencoders
sification [53], [64], [95], [126], graph classification [7], [110],
and their extensions, Deep Infomax, and Contrastive Cod-
[146] etc.). Generally, the supervised learning is trained on a
ing. An outline slide is also provided.1
specific task with a large labeled dataset, which is randomly
The term “self-supervised learning” is first introduced in
divided for training, validation and test.
robotics, where training data is automatically labeled by
However, supervised learning is meeting its bottleneck.
leveraging the relations between different input sensor sig-
It relies heavily on expensive manual labeling and suffers
nals. Afterwards, machine learning community further
from generalization error, spurious correlations, and adver-
develops the idea. In the invited speech on AAAI 2020, The
sarial attacks. We expect the neural networks to learn more
Turing award winner Yann LeCun described self-super-
vised learning as ”the machine predicts any parts of its
input for any observed part”.2 Combining self-supervised
 Xiao Liu, Fanjin Zhang, and Zhenyu Hou are with the Department of
Computer Science and Technology, Tsinghua University, Beijing 100190, learning’s traditional definition and LeCun’s definition, we
China. E-mail: {liuxiao17, zfj17, hzy17}@mails.tsinghua.edu.cn. can further summarize its features as:
 Li Mian is with the Beijing Institute of Technonlogy, Beijing 100811,
China. E-mail: [email protected].  Obtain “labels” from the data itself by using a “semi-
 Zhaoyu Wang is with Anhui University, Anhui 230093, China. automatic” process.
E-mail: [email protected].
 Jing Zhang is with the Renmin University of China, Beijing 100872,  Predict part of the data from other parts.
China. E-mail: [email protected]. Specifically, the “other part” could be incomplete, trans-
 Jie Tang is with the Department of Computer Science and Technology, formed, distorted, or corrupted (i.e., data augmentation
Tsinghua University, Beijing 100190, China, and also with the Tsinghua
technique). In other words, the machine learns to ’recover’
National Laboratory for Information Science and Technology (TNList),
Beijing 100084, China. E-mail: [email protected]. whole, or parts of, or merely some features of its original
Manuscript received 16 June 2020; revised 17 March 2021; accepted 3 June 2021. input.
Date of publication 22 June 2021; date of current version 7 December 2022. People are often confused by the concepts of unsuper-
The work was supported in part by the National Key R&D Program of China vised learning and self-supervised learning. Self-supervised
under Grant 2018YFB1402600, in part by NSFC for Distinguished Young
Scholar under Grant 61825602, and in part by NSFC under Grant 61836013.
(Corresponding author: Jie Tang.)
Recommended for acceptance by L. Chen. 1. Slides at https://www.aminer.cn/pub/5ee8986f91e011e66831c59b/
Digital Object Identifier no. 10.1109/TKDE.2021.3090866 2. https://aaai.org/Conferences/AAAI-20/invited-speakers/

1041-4347 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
858 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

Fig. 1. An illustration to distinguish the supervised, unsupervised and Fig. 2. Number of publications and citations on self-supervised learning
self-supervised learning framework. In self-supervised learning, the during 2012-2020, from Microsoft Academic [108], [144]. Self-super-
“related information” could be another modality, parts of inputs, or vised learning is drawing tremendous attention in recent years.
another form of the inputs. Repainted from [25].
based methods, deep learning usually follows the so-called
learning can be viewed as a branch of unsupervised learn- “end-to-end” fashion (raw-data in, prediction out). It makes
ing since there is no manual label involved. However, nar- very few prior assumptions, which leads to over-fitting and
rowly speaking, unsupervised learning concentrates on biases in scenarios with little supervised data. Literature
detecting specific data patterns, such as clustering, commu- has shown that simple multi-layer perceptrons have a very
nity discovery, or anomaly detection, while self-supervised poor generalization ability (always assume a linear relation-
learning aims at recovering, which is still in the paradigm ship for out-of-distribution (OOD) samples) [135], which
of supervised settings. Fig. 1 provides a vivid explanation results in over-confident (and wrong) predictions.
of the differences between them. To conquer the fundamental OOD and generalization
There exist several comprehensive reviews related to problem, while numerous works focus on designing new
Pre-trained Language Models [96], Generative Adversarial architectures for neural networks, another simple yet effec-
Networks [130], autoencoders, and contrastive learning for tive solution is to enlarge the training dataset to make as
visual representation [57]. However, none of them concen- many samples “in-distribution”. However, the fact is,
trates on the inspiring idea of self-supervised learning itself. despite massive available unlabeled web data in this big
In this work, we collect studies from natural language proc- data era, high-quality data with human labeling could be
essing, computer vision, and graph learning in recent years costly. For example, Scale.ai,3 a data labeling company,
to present an up-to-date and comprehensive retrospective charges $6.4 per image for the image segmentation labeling.
on the frontier of self-supervised learning. To sum up, our An image segmentation dataset containing 10k+ high-qual-
contributions are: ity samples could cost up to a million-dollar.
The most crucial point for self-supervised learning’s suc-
We provide a detailed and up-to-date review of self-
cess is that it figures out a way to leverage the tremendous
supervised learning for representation. We introduce
amounts of unlabeled data that becomes available in the big
the background knowledge, models with variants,
data era. It a time for deep learning algorithms to get rid of
and important frameworks. One can easily grasp the
human supervision and turn back to data’s self-supervision.
frontier ideas of self-supervised learning.
The intuition of self-supervised learning is to leverage the
 We categorize self-supervised learning models into
data’s inherent co-occurrence relationships as the self-
generative, contrastive, and generative-contrastive
supervision, which could be versatile. For example, in the
(adversarial), with particular genres inner each one.
incomplete sentence “I like ____ apples”, a well-trained lan-
We demonstrate the pros and cons of each category.
guage model would predict “eating” for the blank (i.e., the
 We identify several open problems in this field, ana-
famous Cloze Test [117]) because it frequently co-occurs
lyze the limitations and boundaries, and discuss the
with the context in the corpora. We can summarize the
future direction for self-supervised representation
mainstream self-supervision into three general categories
learning.
(see Fig. 3) and detailed subsidiaries:
We organize the survey as follows. In Section 2, we intro-
duce the motivation of self-supervised learning. We also  Generative: train an encoder to encode input x into
present our categorization of self-supervised learning and a an explicit vector z and a decoder to reconstruct x
conceptual comparison between them. From Sections 3, 4, from z (e.g., the cloze test, graph generation)
and 5, we will introduce the empirical self-supervised learn-  Contrastive: train an encoder to encode input x into an
ing methods utilizing generative, contrastive and generative- explicit vector z to measure similarity (e.g., mutual
contrastive objectives. Finally, in Sections 6 and 7, we discuss information maximization, instance discrimination)
the open problems, future directions and our conclusions.  Generative-Contrastive (Adversarial): train an
encoder-decoder to generate fake samples and a dis-
criminator to distinguish them from real samples
2 MOTIVATION OF SELF-SUPERVISED LEARNING (e.g., GAN)
It is universally acknowledged that deep learning algo-
rithms are data-hungry. Compared to traditional feature- 3. https://scale.com/pricing

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: SELF-SUPERVISED LEARNING: GENERATIVE OR CONTRASTIVE 859

Fig. 3. Categorization of Self-supervised learning (SSL): Generative,


Contrastive and Generative-Contrastive (Adversarial). Fig. 4. Conceptual comparison between Generative, Contrastive, and
Generative-Contrastive methods.

Their main difference lies in model architectures and can be factorized as a product of conditionals
objectives. A detailed conceptual comparison is shown in
Fig. 4. Their architectures can be unified into two general X
T

components: the generator and the discriminator, and the max pu ðxÞ ¼ log pu ðxt jx1:t1 Þ; (1)
u
t¼1
generator can be further decomposed into an encoder and a
decoder. Different things are: where the probability of each variable is dependent on the
1) For latent distribution z: in generative and contras- previous variables.
tive methods, z is explicit and is often leveraged by In NLP, the objective of auto-regressive language model-
downstream tasks; while in GAN, z is implicitly ing is usually maximizing the likelihood under the forward
modeled. autoregressive factorization [138]. GPT [98] and GPT-2 [99]
2) For discriminator: the generative method does not use Transformer decoder architecture [125] for language
have a discriminator while GAN and contrastive model. Different from GPT, GPT-2 removes the fine-tuning
have. Contrastive discriminator has comparatively processes of different tasks. To learn unified representations
that generalize across different tasks, GPT-2 models
fewer parameters (e.g., a multi-layer perceptron with
pðoutputjinput; taskÞ, which means given different tasks, the
2-3 layers) than GAN (e.g., a standard ResNet [48]).
3) For objectives: the generative methods use a recon- same inputs can have different outputs.
struction loss, the contrastive ones use a contrastive The auto-regressive models have also been employed in
similarity metric (e.g., InfoNCE), and the generative- computer vision, such as PixelRNN [124] and Pix-
contrastive ones leverage distributional divergence elCNN [122]. The general idea is to use auto-regressive
as the loss (e.g., JS-divergence, Wasserstein Distance). methods to model images pixel by pixel. For example, the
A properly designed training objective related to down- lower (right) pixels are generated by conditioning on the
stream tasks could turn our randomly initialized models upper (left) pixels. The pixel distributions of PixelRNN and
into excellent pre-trained feature extractors. For example, PixelCNN are modeled by RNN and CNN, respectively.
contrastive learning is found to be useful for almost all For 2D images, auto-regressive models can only factorize
probabilities according to specific directions (such as right
visual classification tasks. This is probably because the con-
and down). Therefore, masked filters are employed in CNN
trastive object is modeling the class-invariance between dif-
ferent image instances. The contrastive loss makes images architecture. Furthermore, two convolutional networks are
containing the same object class more similar. It makes combined to remove the blind spot in images. Based on Pix-
those containing different classes less similar, essentially elCNN, WaveNet [121] – a generative model for raw audio
accords with the downstream image classification, object was proposed. To deal with long-range temporal dependen-
detection, and other classification-based tasks. The art of cies, the authors develop dilated causal convolutions to
self-supervised learning primarily lies in defining proper improve the receptive field. Moreover, Gated Residual
objectives for unlabeled data. blocks and skip connections are employed to empower bet-
ter expressivity.
The auto-regressive models can also be applied to graph
3 GENERATIVE SELF-SUPERVISED LEARNING domain problems, such as graph generation. You et al. [141]
propose GraphRNN to generate realistic graphs with deep
This section will introduce important self-supervised learn-
ing methods based on generative models, including auto- auto-regressive models. They decompose the graph genera-
regressive (AR) models, flow-based models, auto-encoding tion process into a sequence generation of nodes and edges
(AE) models, and hybrid generative models. conditioned on the graph generated so far. The objective of
GraphRNN is defined as the likelihood of the observed
graph generation sequences. GraphRNN can be viewed as a
3.1 Auto-Regressive (AR) Model hierarchical model, where a graph-level RNN maintains the
Auto-regressive (AR) models can be viewed as “Bayes net state of the graph and generates new nodes, while an edge-
structure” (directed graph model). The joint distribution level RNN generates new edges based on the current graph

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
860 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

TABLE 1
An Overview of Recent Self-Supervised Representation Learning

Model FOS TypeGenerator Self-supervision Pretext Task Hard Hard NS strategy


NS PS
GPT/GPT-2 [98], [99] NLP G AR Following words Next word prediction - - -
PixelCNN [122], [124] CV G AR Following pixels Next pixel prediction - - -
NICE [30] CV G Flow Whole image Image reconstruction - - -
RealNVP [31] CV G based - - -
Glow [62] CV G - - -
word2vec [79], [80] NLP G AE Context words CBOW & SkipGram   End-to-end
FastText [10] NLP G AE CBOW   End-to-end
DeepWalk-based [43], Graph G AE Graph edges Link prediction   End-to-end
[92], [115]
VGAE [65] Graph G AE   End-to-end
BERT [27] NLP G AE Masked words Sentence Masked language model, Next - - -
topic senetence prediction
SpanBERT [58] NLP G AE Masked words Masked language model - - -
ALBERT [68] NLP G AE Masked words Sentence Masked language model, Sentence - - -
order order prediction
ERNIE [114], [149] NLP G AE Masked words Sentence Masked language model, Next - - -
topic senetence prediction
GPT-GNN [52] Graph G AE Attribute & Edge Masked graph generation - - -
VQ-VAE 2 [101] CV G AE Whole image Image reconstruction - - -
XLNet [138] NLP G AE+AR Masked words Permutation language model - - -
GraphAF [107] Graph G Flow+AR Attribute & Edge Sequential graph generation - - -
RelativePosition [32] CV C - Spatial relations (Context- Relative postion prediction - - -
CDJP [61] CV C - Instance) Jigsaw + Inpainting + Colorization   End-to-end
PIRL [81] CV C - Jigsaw  ✓ Memory bank
RotNet [38] CV C - Rotation Prediction - - -
Deep InfoMax [49] CV C - Belonging (Context- MI Maximization   End-to-end
AMDIM [6] CV C - Instance)  ✓ End-to-end
CPC [88] CV C -   End-to-end
InfoWord [66] NLP C -   End-to-end
DGI [127] Graph C - ✓  End-to-end
InfoGraph [110] Graph C -   End-to-end
(batch-wise)
CMC-Graph [46] Graph C -  ✓ End-to-end
S2 GRL [90] Graph C -   End-to-end
Pre-trained GNN [51] Graph C - Belonging Node attributes MI maximization, Masked attribute   End-to-end
prediction
DeepCluster [13] CV C - Similarity (Instance- Cluster discrimination - - -
Local Aggregation [152] CV C - Instance) - - -
ClusterFit [136] CV C - - - -
SwAV [14] CV C - - ✓ End-to-end
SEER [41] CV C - - ✓ End-to-end
M3S [113] Graph C - - - -
InstDisc [132] CV C - Identity (Instance- Instance discrimination   Memory bank
CMC [118] CV C - Instance)  ✓ End-to-end
MoCo [47] CV C -   Momentum
MoCo v2 [18] CV C -  ✓ Momentum
SimCLR [15] CV C -  ✓ End-to-end
InfoMin [119] CV C -  ✓ End-to-end
BYOL [42] CV C - no NS ✓ End-to-end
ReLIC [82] CV C -  ✓ End-to-end
SimSiam [19] CV C - no NS ✓ End-to-end
SimCLR v2 (semi) [16] CV C -  ✓ End-to-end
GCC [94] Graph C -  ✓ Momentum
GraphCL [142] Graph C -  ✓ End-to-end
GAN [40] CV G-C AE Whole image Image reconstruction - - -
Adversarial AE [77] CV G-C AE - - -
BiGAN/ALI [33], [36] CV G-C AE - - -
BigBiGAN [34] CV G-C AE - - -
Colorization [69] CV G-C AE Image color Colorization - - -
Inpainting [89] CV G-C AE Parts of images Inpainting - - -
Super-resolution [72] CV G-C AE Details of images Super-resolution - - -
ELECTRA [21] NLP G-C AE Masked words Replaced token detection ✓  End-to-end

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: SELF-SUPERVISED LEARNING: GENERATIVE OR CONTRASTIVE 861

TABLE 1
(Continued )

Model FOS TypeGenerator Self-supervision Pretext Task Hard Hard NS strategy


NS PS
WKLM [134] NLP G-C AE Masked entities Replaced entity detection ✓  End-to-end
ANE [23] Graph G-C AE Graph edges Link prediction - - -
GraphGAN [128] Graph G-C AE - - -
GraphSGAN [28] Graph G-C AE Graph nodes Node classification - - -

For acronyms used, “FOS” refers to fields of study; “NS” refers to negative samples; “PS” refers to positive samples; “MI” refers to mutual information. For
alphabets in “Type”: G Generative; C Contrastive; G-C Generative-Contrastive (Adversarial). For symbols in “Hard NS” and “Hard PS”, “-” means not applica-
ble, “” means not adopted, “✓”’ means adopted; “no NS” particularly means not using negative samples in instance-instance contrast.

state. After that, MRNN [93] and GCPN [140] are proposed 3.3 Auto-Encoding (AE) Model
as auto-regressive approaches. MRNN and GCPN both use The auto-encoding model’s goal is to reconstruct (part of)
a reinforcement learning framework to generate molecule inputs from (corrupted) inputs. Due to its flexibility, the AE
graphs through optimizing domain-specific rewards. How- model is probably the most popular generative model with
ever, MRNN mainly uses RNN-based networks for state many variants.
representations, but GCPN employs GCN-based encoder
networks. 3.3.1 Basic AE Model
The advantage of auto-regressive models is that they can
Autoencoder (AE) was first introduced in [8] for pre-train-
model the context dependency well. However, one short-
ing artificial neural networks. Before autoencoder,
coming of the AR model is that the token at each position
Restricted Boltzmann Machine (RBM) [109] can also be
can only access its context from one direction.
viewed as a special “autoencoder”. RBM is an undirected
graphical model, and it only contains two layers: the visible
3.2 Flow-Based Model
layer and the hidden layer. The objective of RBM is to mini-
The goal of flow-based models is to estimate complex high- mize the difference between the marginal distribution of
dimensional densities pðxÞ from data. Intuitively, directly models and data distributions. In contrast, an autoencoder
formalizing the densities is difficult. To obtain a compli- can be regarded as a directed graphical model, and it can be
cated densities, we hope to generate it “step by step” by trained more efficiently. Autoencoder is typically for
stacking a series of transforming functions that describing dimensionality reduction. Generally, the autoencoder is a
different data characteristics respectively. Generally, flow- feed-forward neural network trained to produce its input at
based models first define a latent variable z which follows a the output layer. The AE is comprised of an encoder network
known distribution pZ ðzÞ. Then define z ¼ fu ðxÞ, where fu is 0
h ¼ fenc ðxÞ and a decoder network x ¼ fdec ðhÞ. The objective
an invertible and differentiable function. The goal is to learn 0
of AE is to make x and x as similar as possible (such as
the transformation between x and z so that the density of x through mean-square error). It can be proved that the linear
can be depicted. According to the integral rule, pu ðxÞdx ¼ autoencoder corresponds to the PCA method. Sometimes
pðzÞdz. Therefore, the densities of x and z satisfies the number of hidden units is greater than the number of
  input units, and some interesting structures can be discov-
 @fu ðxÞ 

pu ðxÞ ¼ pðfu ðxÞÞ ; (2) ered by imposing sparsity constraints on the hidden
@x  units [85].
and the objective is to maximize the likelihood
3.3.2 Context Prediction Model (CPM)
X X
max log pu ðxðiÞ Þ ¼ max log pZ ðfu ðxðiÞ ÞÞ The idea of the Context Prediction Model (CPM) is to pre-
u u
i i dict contextual information based on inputs.
 
 @fu ðiÞ  In NLP, when it comes to self-supervised learning on

þ log  ðx Þ: (3)
@x word embedding, CBOW and Skip-Gram [80] are pioneer-
ing works. CBOW aims to predict the input tokens based on
The advantage of flow-based models is that the mapping context tokens. In contrast, Skip-Gram aims to predict con-
between x and z is invertible. However, it also requires that text tokens based on input tokens. Usually, negative sam-
x and z must have the same dimension. fu needs to be care- pling is employed to ensure computational efficiency and
fully designed since it should be invertible and the Jacobian scalability. Following CBOW architecture, FastText [10] is
determinant in Eq. (2) should also be calculated easily. proposed by utilizing subword information.
NICE [30] and RealNVP [31] design affine coupling layer to Inspired by the progress of word embedding models in
parameterize fu . The core idea is to split x into two blocks NLP, many network embedding models are proposed
ðx1 ; x2 Þ and apply a transformation from ðx1 ; x2 Þ to ðz1 ; z2 Þ based on a similar context prediction objective. Deep-
in an auto-regressive manner, that is z1 ¼ x1 and z2 ¼ walk [92] samples truncated random walks to learn latent
x2 þ mðx1 Þ. More recently, Glow [62] was proposed and it node embedding based on the Skip-Gram model. It treats
introduces invertible 1  1 convolutions and simplifies random walks as the equivalent of sentences. However,
RealNVP. another network embedding approach LINE [115] aims to

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
862 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

generate neighbors rather than nodes on a path based on log-likelihood of data is maximized during training.
current nodes
X log pðxÞ  DKL ðqðzjxÞjjpðzÞÞ þ EqðzjxÞ ½log pðxjzÞ; (5)
O¼ wij log pðvj jvi Þ; (4)
ði;jÞ2E
where pðxÞ is evidence probability, pðzÞ is prior and pðxjzÞ is
likelihood probability. The right-hand side of the above
where E denotes edge set, v denotes the node, wij represents
equation is called ELBO. From the auto-encoding perspec-
the weight of edge ðvi ; vj Þ. LINE also uses negative sampling
tive, the first term of ELBO is a regularizer forcing the poste-
to sample multiple negative edges to approximate the
rior to approximate the prior. The second term is the
objective.
likelihood of reconstructing the original input data based
on latent variables.
3.3.3 Denoising AE Model Variational Autoencoders (VAE) [63] is one important
The intuition of denoising autoencoder models is that repre- example where variational inference is utilized. VAE assumes
sentation should be robust to the introduction of noise. The the prior pðzÞ and the approximate posterior qðzjxÞ both fol-
masked language model (MLM), one of the most successful low Gaussian distributions. Specifically, let pðzÞ  N ð0; 1Þ.
architectures in natural language processing, can be Furthermore, reparameterization trick is utilized for modeling
regarded as a denoising AE model. To model text sequence, approximate posterior qðzjxÞ. Assume z  N ðm; s 2 Þ, z ¼
the masked language model (MLM) randomly masks some m þ s where   N ð0; 1Þ. Both m and s are parameterized by
of the tokens from the input and then predicts them based neural networks. Based on calculated latent variable z,
on their context information, which is similar to the Cloze decoder network is utilized to reconstruct the input data.
task [117]. BERT [27] is the most representative work in this Recently, a novel and powerful variational AE model
field. Specifically, in BERT, a unique token [MASK] is intro- called VQ-VAE [123] was proposed. VQ-VAE aims to learn
duced in the training process to mask some tokens. How- discrete latent variables motivated by the fact that many
ever, one shortcoming of this method is that there are no modalities are inherently discrete, such as language, speech,
input [MASK] tokens for down-stream tasks. To mitigate and images. VQ-VAE relies on vector quantization (VQ) to
this, the authors do not always replace the predicted tokens learn the posterior distribution of discrete latent variables.
with [MASK] in training. Instead, they replace them with The discrete latent variables are calculated by the nearest
original words or random words with a small probability. neighbor lookup using a shared, learnable embedding table.
Following BERT, many extensions of MLM emerge. In training, the gradients are approximated through
SpanBERT [58] chooses to mask continuous random spans straight-through estimator [9] as
rather than random tokens adopted by BERT. Moreover, it
trains the span boundary representations to predict the Lðx; DðeÞÞ ¼ kx  DðeÞk22 þ ksg½EðxÞ  ek22 þ bksg½e  EðxÞk22 ;
masked spans, inspired by ideas in coreference resolution.
(6)
ERNIE (Baidu) [114] masks entities or phrases to learn
entity-level and phrase-level knowledge, which obtains where e refers to the codebook, the operator sg refers to a
good results in Chinese natural language processing tasks. stop-gradient operation that blocks gradients from flowing
ERNIE (Tsinghua) [149] further integrates knowledge (enti- into its argument, and b is a hyperparameter which controls
ties and relations) in knowledge graphs into language the reluctance to change the code corresponding to the
models. encoder output.
Compared with the AR model, in denoising AE for lan- More recently, researchers propose VQ-VAE-2 [101],
guage modeling, the predicted tokens have access to contex- which can generate versatile high-fidelity images that rival
tual information from both sides. However, the fact that BigGAN [11] on ImageNet [26], the state-of-the-art GAN
MLM assumes the predicted tokens are independent if the model. First, the authors enlarge the scale and enhance the
unmasked tokens are given (which does not hold in reality) autoregressive priors by a powerful PixelCNN [122] prior.
has long been considered as its inherent drawback. Additionally, they adopt a multi-scale hierarchical organi-
In graph learning, Hu et al. [52] proposes GPT-GNN, a zation of VQ-VAE, which enables learning local information
generative pre-training method for graph neural networks. and global information of images separately. Nowadays,
It also leverages the graph masking techniques and then VAE and its variants have been widely used in the com-
asks the graph neural network to generate masked edges puter vision area, such as image representation learning,
and attributes. GPT-GNN’s wide range of experiments on image generation, video generation.
OAG [108], [116], [144], the largest public, academic graph Variational auto-encoding models have also been
with 100 million nodes and 2 billion edges, shows impres- employed in node representation learning on graphs. For
sive improvements on various graph learning tasks. example, Variational graph auto-encoder (VGAE) [65] uses
the same variational inference technique as VAE with graph
3.3.4 Variational AE Model convolutional networks (GCN) [64] as the encoder. Due to
The variational auto-encoding model assumes that data are the uniqueness of graph-structured data, the objective of
generated from underlying latent (unobserved) representa- VGAE is to reconstruct the adjacency matrix of the graph by
tion. The posterior distribution over a set of unobserved measuring node proximity. Zhu et al. [150] propose DVNE, a
variables Z ¼ fz1 ; z2 ; . . . ; zn g given some data X is approxi- deep variational network embedding model in Wasserstein
mated by a variational distribution qðzjxÞ  pðzjxÞ. In varia- space. It learns Gaussian node embedding to model the
tional inference, the evidence lower bound (ELBO) on the uncertainty of nodes. 2-Wasserstein distance is used to

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: SELF-SUPERVISED LEARNING: GENERATIVE OR CONTRASTIVE 863

measure the similarity between the distributions for its effec- is utilized to convert discrete data (including node types and
tiveness in preserving network transitivity. vGraph [111] can edge types) into continuous data.
perform node representation learning and community detec-
tion collaboratively through a generative variational infer- 3.5 Pros and Cons
ence framework. It assumes that each node can be generated A reason for the generative self-supervised learning’s suc-
from a mixture of communities, and each community is cess in self-supervised learning is its ability to recover the
defined as a multinomial distribution over nodes. original data distribution without assumptions for down-
stream tasks, which enables generative models’ wide appli-
3.4 Hybrid Generative Models cations in both classification and generation. Notably, all
3.4.1 Combining AR and AE Model the existing generation tasks (including text, image, and
audio) rely heavily on generative self-supervised learning.
Some researchers propose to combine the advantages of
Nevertheless, two shortcomings restrict its performance.
both AR and AE. MADE [78] makes a simple modification
First, despite its central status in generation tasks, gener-
to autoencoder. It masks the autoencoder’s parameters to
ative self-supervised learning is recently found far less com-
respect auto-regressive constraints. Specifically, for the orig-
petitive than contrastive self-supervised learning in some
inal autoencoder, neurons between two adjacent layers are
classification scenarios because contrastive learning’s goal
fully-connected through MLPs. However, in MADE, some
naturally conforms the classification objective. Works
connections between adjacent layers are masked to ensure
including MoCo [47], SimCLR [15], BYOL [42] and
that each input dimension is reconstructed solely from its
SwAV [14] have presented overwhelming performances on
dimensions. MADE can be easily parallelized on condi-
various CV benchmarks. Nevertheless, in the NLP domain,
tional computations, and it can get direct and cheap esti-
researchers still depend on generative language models to
mates of high-dimensional joint probabilities by combining
conduct text classification.
AE and AR models.
Second, the point-wise nature of the generative objective
In NLP, Permutation Language Model (PLM) [138] is a
has some inherent defects. This objective is usually P formu-
representative model that combines the advantage of auto-
lated as a maximum likelihood function LMLE ¼  x log
regressive model and auto-encoding model. XLNet [138],
pðxjcÞ where x is all the samples we hope to model, and c is a
which introduces PLM, is a generalized auto-regressive pre-
conditional constraint such as context information. Consider-
training method. XLNet enables learning bidirectional con-
ing its form, MLE has two fatal problems:
texts by maximizing the expected likelihood over all
permutations of the factorization order. To formalize the 1) Sensitive and Conservative Distribution. When pðxjcÞ !
idea, let ZT denotes the set of all possible permutations of 0, LMLE becomes super large, making generative
the length-T index sequence ½1; 2; . . . ; T , the objective of model extremely sensitive to rare samples. It directly
PLM can be expressed as follows: leads to a conservative distribution, which has a low
" # performance.
XT
max EzZT log pu ðxzt jxz < t Þ : (7) 2) Low-level Abstraction Objective. In MLE, the represen-
u tation distribution is modeled at x’s level (i.e., point-
t¼1
wise level), such as pixels in images, words in texts,
Actually, for each text sequence, different factorization and nodes in graphs. However, most of the classifi-
orders are sampled. Therefore, each token can see its contex- cation tasks target at high-level abstraction, such as
tual information from both sides. Based on the permuted object detection, long paragraph understanding, and
order, XLNet also conducts reparameterization with posi- molecule classification.
tions to let the model know which position is needed to pre- and as an opposite approach, generative-contrastive self-
dict. Then a special two-stream self-attention is introduced supervised learning abandons the point-wise objective. It
for target-aware prediction. turns to distributional matching objectives that are more
Furthermore, different from BERT, inspired by the latest robust and better handle the high-level abstraction chal-
advancements in the AR model, XLNet integrates the seg- lenge in the data manifold.
ment recurrence mechanism and relative encoding scheme
of Transformer-XL [24] into pre-training, which can model 4 CONTRASTIVE SELF-SUPERVISED LEARNING
long-range dependency better than Transformer [125]. From a statistical perspective, machine learning models are
categorized into generative and discriminative models.
3.4.2 Combining AE and Flow-Based Models Given the joint distribution P ðX; Y Þ of the input X and tar-
In the graph domain, GraphAF [107] is a flow-based auto- get Y , the generative model calculates the pðXjY ¼ yÞ while
regressive model for molecule graph generation. It can gen- the discriminative model tries to model the P ðY jX ¼ xÞ.
erate molecules in an iterative process and also calculate the Because most of the representation learning tasks hope to
exact likelihood in parallel. GraphAF formalizes molecule model relationships between x, for a long time, people
generation as a sequential decision process. It incorporates believe that the generative model is the only choice for
detailed domain knowledge into the reward design, such as representation learning.
valency check. Inspired by the recent progress of flow-based However, recent breakthroughs in contrastive learning,
models, it defines an invertible transformation from a base such as Deep InfoMax, MoCo and SimCLR, shed light on
distribution (e.g., multivariate Gaussian) to a molecular the potential of discriminative models for representation.
graph structure. Additionally, Dequantization technique [50] Contrastive learning aims at ”learn to compare” through a

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
864 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

Fig. 5. Architecture of VQ-VAE [123]. Compared to VAE, the orginal hid-


den distribution is replaced with a quantized vector dictionary. In addition,
the prior distribution is replaced with a pre-trained PixelCNN that models
the hierarchical features of images. Taken from [123].

Noise Contrastive Estimation (NCE) [44] objective format-


ted as
" T þ
#
efðxÞ fðx Þ Fig. 7. Self-supervised representation learning performance on Image-
L ¼ Ex;xþ ;x logð ; (8)
efðxÞ fðxþ Þ þ efðxÞ fðx Þ
T T Net top-1 accuracy in March, 2021, under linear classification protocol.
The self-supervised learning’s ability on feature extraction is rapidly
approaching the supervised method (ResNet50). Except for BigBiGAN,
where xþ is similar to x, x is dissimilar to x and f is an
all the models above are contrastive self-supervised learning methods.
encoder (representation function). The similarity measure
and encoder may vary from task to task, but the framework
(such as understanding what an elephant looks like
remains the same. With more dissimilar pairs involved, we
is critical for predicting relative position between its
have the InfoNCE [88] formulated as
" # head and tail).
þ
T
efðxÞ fðx Þ  MI focuses on learning the direct belonging relation-
L ¼ Ex;xþ ;xk logð P : (9) ships between local parts and global context. The rel-
efðxÞ fðxþ Þ þ K
T fðxÞT fðxk Þ
k¼1 e
ative positions between local parts are ignored.
Here we divide recent contrastive learning frameworks
into 2 types: context-instance contrast and instance-instance 4.1.1 Predict Relative Position
contrast. Both of them achieve amazing performance in
downstream tasks, especially on classification problems Many data contain rich spatial or sequential relations
under the linear protocol. between parts of it. For example, in image data such as
Fig. 8, the elephant’s head is on the right of its tail. In text
data, a sentence like ”Nice to meet you.” would probably be
4.1 Context-Instance Contrast
ahead of ”Nice to meet you, too.”. Various models regard
The context-instance contrast, or so-called global-local con-
recognizing relative positions between parts of it as the pre-
trast, focuses on modeling the belonging relationship
text task [57]. It could be to predict relative positions of two
between the local feature of a sample and its global context
patches from a sample [32], or to recover positions of shuf-
representation. When we learn the representation for a local
fled segments of an image (solve jigsaw) [61], [86], [131], or
feature, we hope it is associative to the representation of the
to infer the rotation angle’s degree of an image [38]. PRP
global content, such as stripes to tigers, sentences to its para-
may also serve as tools to create hard positive samples. For
graph, and nodes to their neighborhoods.
instance, the jigsaw technique is applied in PIRL [81] to aug-
There are two main types of Context-Instance Contrast:
ment the positive sample, but PIRL does not regard solving
Predict Relative Position (PRP) and Maximize Mutual Infor-
jigsaw and recovering spatial relation as its objective.
mation (MI). The differences between them are:
In the pre-trained language model, similar ideas such as
 PRP focuses on learning relative positions between Next Sentence Prediction (NSP) are also adopted. NSP loss
local components. The global context serves as an is initially introduced by BERT [27], where for a sentence,
implicit requirement for predicting these relations the model is asked to distinguish the following and a ran-
domly sampled one. However, some later work empirically
proves that NSP helps little, even harm the performance. So
in RoBERTa [75], the NSP loss is removed.

Fig. 6. Illustration for permutation language modeling [138] objective for


predicting x3 given the same input sequence x but with different factori- Fig. 8. Three typical methods for spatial relation contrast: predict relative
zation orders. Adapted from [138]. position [32], rotation [38] and solve jigsaw [61], [81], [86], [131].

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: SELF-SUPERVISED LEARNING: GENERATIVE OR CONTRASTIVE 865

To replace NSP, ALBERT [68] proposes Sentence Order In language pre-training, InfoWord [66] proposes to max-
Prediction (SOP) task. That is because, in NSP, the negative imize the mutual information between a global representa-
next sentence is sampled from other passages that may tion of a sentence and n-grams in it. The context is induced
have different topics from the current one, turning the NSP from the sentence with selected n-grams being masked, and
into a far easier topic model problem. In SOP, two sentences the negative contexts are randomly picked out from the
that exchange their position are regarded as a negative sam- corpus.
ple, making the model concentrate on the semantic mean- In graph learning, Deep Graph InfoMax (DGI) [127]
ing’s coherence. regards a node’s representation as the local feature and the
average of randomly samples 2-hop neighbors as the con-
4.1.2 Maximize Mutual Information text. However, it is hard to generate negative contexts on a
This kind of method derives from mutual information (MI) single graph. To solve this problem, DGI proposes to corrupt
– a fundamental concept in statistics. Mutual information the original context by keeping the sub-graph structure and
targets modeling the association between two variables, permuting the node features. DGI is followed by many
and our objective is to maximize it. Generally, this kind of works, such as InfoGraph [110], which targets learning
models optimize graph-level representation rather than node level, maximiz-
ing the mutual information between graph-level representa-
max Iðg1 ðx1 Þ; g2 ðx2 ÞÞ; (10) tion and substructures at different levels. As what CMC has
g1 2G1 ;g2 2G1
done to improve Deep InfoMax, in [46] authors propose a
where gi is the representation encoder, Gi is a class of contrastive multi-view representation learning method for
encoders with some constraints, and Ið ; Þ is a sample-based the graph. They also discover that graph diffusion is the
estimator for the accurate mutual information. In applica- most effective way to yield augmented positive sample
tions, MI is notorious for its complex computation. A com- pairs in graph learning.
mon practice is to alternatively maximize I’s lower bound As an attempt to unify graph pre-training, in [51], the
with an NCE objective. authors systematically analysis the pre-training strategies
Deep InfoMax [49] is the first one to explicitly model for graph neural networks from two dimensions: attribute/
mutual information through a contrastive learning task, structural and node-level/graph-level. For structural pre-
which maximize the MI between a local patch and its global diction at node-level, they propose Context Prediction to
context. For real practices, take image classification as an maximize the MI between the k-hop neighborhood’s repre-
example, we can encode a cat image x into fðxÞ 2 RMMd , sentations and its context graph. For attributes in the chemi-
and take out a local feature vector v 2 Rd . To conduct con- cal domain, they propose Attribute Mask to predict a
trast between instance and context, we need two other node’s attribute according to its neighborhood, which is a
things: generative objective similar to token masks in BERT.
S2 GRL [91] further separates nodes in the context graph
 a summary function g:RMMd ! Rd to generate the into k-hop context subgraphs and maximizes their MI with
context vector s ¼ gðfðxÞÞ 2 Rd target node, respectively. However, a fundamental problem
 another cat image x and its context vector s ¼ of graph pre-training is about learning inductive biases
gðfðx ÞÞ. across graphs, and existing graph pre-training work is only
and the contrastive objective is then formulated as applicable for a specific domain.
" T
!#
ev s
L ¼ Ev;x log vT s : (11)
e þ evT s 4.2 Instance-Instance Contrast
Though MI-based contrastive learning achieves great suc-
Deep InfoMax provides us with a new paradigm and cess, some recent studies [15], [18], [47], [120] cast doubt on
boosts the development of self-supervised learning. The the actual improvement brought by MI.
first influential follower is Contrastive Predictive Coding The [120] provides empirical evidence that the success of
(CPC) [88] for speech recognition. CPC maximizes the asso- the models mentioned above is only loosely connected to
ciation between a segment of audio and its context audio. MI by showing that an upper bound MI estimator leads to
To improve data efficiency, it takes several negative context ill-conditioned and lower performance representations.
vectors at the same time. Later on, CPC has also been Instead, more should be attributed to encoder architecture
applied in image classification. and a negative sampling strategy related to metric learning.
AMDIM [6] enhances the positive association between a A significant focus in metric learning is to perform hard
local feature and its context. It randomly samples two differ- positive sampling while increasing the negative sampling
ent views of an image (truncated, discolored, and so forth) efficiency. They probably play a more critical role in MI-
to generate the local feature vector and context vector, based models’ success.
respectively. CMC [118] extends it into several different As an alternative, instance-instance contrastive learning
views for one image and samples another irrelevant image discards MI and directly studies the relationships between
as the negative. However, CMC is fundamentally different different samples’ instance-level local representations as
from Deep InfoMax and AMDIM because it proposes to what metric learning does. Instance-level representation,
measure the instance-instance similarity rather than con- rather than context-level, is more crucial for a wide range of
text-instance similarity. We will discuss it in the following classification tasks. For example, in an image classified as
subsection. “dog”, while there must be dog instances, some other

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
866 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

irrelevant context objects such as grass might appear. But stages: one for pre-training and the other for evaluation.
what matters for the image classification is the dog rather ClusterFit [136] introduces a cluster prediction fine-tuning
than the context. Another example would be sentence emo- stage similar to DeepCluster between the above two stages,
tional classification, which primarily relies on few but which improves the representation’s performance on down-
important keywords. stream classification evaluation.
In the early stage of instance-instance contrastive Despite the previous success of cluster discrimination-
learning’s development, researchers borrow ideas from based contrastive learning, the two-stage training paradigm
semi-supervised learning to produce pseudo labels via clus- is time-consuming and poor performing compared to
ter-based discrimination and achieve rather good perfor- later instance discrimination-based methods, including
mance on representations. More recently, CMC [118], CMC [118], MoCo [47] and SimCLR [15]. These instance dis-
MoCo [47], SimCLR [15], and BYOL [42] further support the crimination-based methods have got rid of the slow cluster-
above conclusion by outperforming the context-instance ing stage and introduced efficient data augmentation (i.e.,
contrastive methods and achieve a competitive result to multi-view) strategies to boost the performance. In light of
supervised methods under the linear classification protocol. these problems, authors in SwAV [14] bring online cluster-
We will start with cluster-based discrimination proposed ing ideas and multi-view data augmentation strategies into
earlier and then turn to instance-based discrimination. the cluster discrimination approach. SwAV proposes a
swapped prediction contrastive objectives to deal with
multi-view augmentation. The intuition is that, given some
4.2.1 Cluster Discrimination (clustered) prototypes, different views of the same images
Instance-instance contrast is first studied in clustering-based should be assigned into the same prototypes. SwAV names
methods [13], [73], [87], [137], especially the DeepClus- this “assignment” as “codes”. To accelerate code comput-
ter [13] which first achieves competitive performance to the ing, the authors of SwAV design an online computing strat-
supervised model AlexNet [67]. egy. SwAV outperforms instance discrimination-based
Image classification asks the model to categorize images methods when model size is small and is more computa-
correctly, and the representation of images in the same cate- tionally efficient. Based on SwAV, a 1.3-billion-parameter
gory should be similar. Therefore, the motivation is to pull SEER [41] is trained on 1 billion web images collected from
similar images near in the embedding space. In supervised Instagram.
learning, this pulling-near process is accomplished via label In graph learning, M3S [113] adopts a similar idea to
supervision; in self-supervised learning, however, we do perform DeepCluster-style self-supervised pre-training for
not have such labels. To solve the label problem, Deep Clus- better semi-supervised prediction. Given little labeled
ter [13] proposes to leverage clustering to yield pseudo data and many unlabeled data, for every stage, M3S first
labels and asks a discriminator to predict images’ labels. pre-train itself to produce pseudo labels on unlabeled data
The training could be formulated in two steps. In the first as DeepCluster does and then compares these pseudo
step, DeepCluster uses K-means to cluster encoded repre- labels with those predicted by the model being supervised
sentation and produces pseudo labels for each sample. trained on labeled data. Only top-k confident labels are
Then in the second step, the discriminator predicts whether added into a labeled set for the next stage of semi-super-
two samples are from the same cluster and back-propagates vised training. In [143], this idea is further developed into
to the encoder. These two steps are performed iteratively. three pre-training tasks: topology partitioning (similar to
Recently, Local Aggregation (LA) [152] has pushed for- spectral clustering), node feature clustering, and graph
ward the cluster-based method’s boundary. It points out completion.
several drawbacks of DeepCluster and makes the corre-
sponding optimization. First, in DeepCluster, samples are
assigned to mutual-exclusive clusters, but LA identifies 4.2.2 Instance Discrimination
neighbors separately for each example. Second, DeepClus- The prototype of leveraging instance discrimination as a
ter optimizes a cross-entropy discriminative loss, while LA pretext task is InstDisc [132]. Based on InstDisc, CMC [118]
employs an objective function that directly optimizes a local proposes to adopt multiple different views of an image as
soft-clustering metric. These two changes substantially positive samples and take another one as the negative.
boost the performance of LA representation on downstream CMC draws near multiple views of an image in the embed-
tasks. ding space and pulls away from other samples. However, it
A similar work to LA would be VQ-VAE [101], [123] that is somehow constrained by the idea of Deep InfoMax,
we introduce in Section 3. To conquer the traditional defi- which only samples one negative sample for each positive
ciency for VAE to generate high-fidelity images, VQ-VAE one.
proposes quantizing vectors. For the feature matrix encoded In MoCo [47], researchers further develop the idea of
from an image, VQ-VAE substitutes each 1-dimensional leveraging instance discrimination via momentum contrast,
vector in the matrix to the nearest one in an embedding dic- which substantially increases the amount of negative sam-
tionary. This process is somehow the same as what LA is ples. For example, given an input image x, our intuition is
doing. to learn a instinct representation q ¼ fq ðxÞ by a query
Clustering-based discrimination may also help in the encoder fq ð Þ that can distinguish x from any other images.
generalization of other pre-trained models, transferring Therefore, for a set of other images xi , we employ an asyn-
models from pretext objectives to downstream tasks better. chronously updated key encoder fk ð Þ to yield kþ ¼ fk ðxÞ
Traditional representation learning models have only two and ki ¼ fk ðxi Þ, and optimize the following objective

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: SELF-SUPERVISED LEARNING: GENERATIVE OR CONTRASTIVE 867

Fig. 11. Cluster-based instance-instance contrastive emthods: Deep-


Cluster [13] and Local Aggregation [152]. In the embedding space,
DeepCluster uses clustering to yield pseudo labels for discrimination to
draw near similar samples. However, Local Aggregation shows that a
egocentric soft-clustering objective would be more effective.

making the positive pair far too easy to distinguish.


Fig. 9. Two representatives for mutual information’s application in con- PIRL [81] adds jigsaw augmentation as described in Sec-
trastive learning. Deep InfoMax (DIM) [49] first encodes an image into
feature maps, and leverage a read-out function (or so-called summary
tion 4.1.1. PIRL asks the encoder to regard an image and its
function) to produce a summary vector. AMDIM [6] enhances the DIM jigsawed one as similar pairs to produce a pretext-invariant
through randomly choosing another view of the image to produce the representation.
summary vector. In SimCLR [15], the authors further illustrate the impor-
tance of a hard positive sample strategy by introducing data
expðq kþ =tÞ
L ¼ log PK ; (12) augmentation in 10 forms. This data augmentation is similar
i¼0 expðq ki =tÞ to CMC [118], which leverages several different views to
augment the positive pairs. SimCLR follows the end-to-end
where K is the number of negative samples. This formula is training framework instead of momentum contrast from
in the form of InfoNCE. MoCo, and to handle the large-scale negative samples prob-
Besides, MoCo presents two other critical ideas in deal- lem, SimCLR chooses a batch size of N as large as 8196.
ing with negative sampling efficiency. The details are as follows. A minibatch of N samples is
 First, it abandons the traditional end-to-end training augmented to be 2N samples x^j ðj ¼ 1; 2; . . . ; 2NÞ. For a pair
framework. It designs the momentum contrast learn- of a positive sample x^i and x^j (derive from one original sam-
ing with two encoders (query and key), which pre- ple), other 2ðN  1Þ are treated as negative ones. A pairwise
vents the fluctuation of loss convergence in the contrastive loss NT-Xent loss [17] is defined as
beginning period.
 Second, to enlarge negative samples’ capacity, MoCo xi ; x^j Þ=tÞ
expðsimð^
li;j ¼ log P2N ; (13)
employs a queue (with K as large as 65536) to save xi ; x^k Þ=tÞ
k¼1 I½k6¼i expðsimð^
the recently encoded batches as negative samples.
This significantly improves the negative sampling noted that li;j is asymmetrical, and the simð ; Þ function here
efficiency. is a cosine similarity function that can normalize the repre-
There are some other auxiliary techniques to ensure the sentations. The summed up loss is
training convergence, such as batch shuffling to avoid trivial
solutions and temperature hyper-parameter t to adjust the 1 XN
L¼ ½l2i1;2i þ l2i;2i1 : (14)
scale. 2N k¼1
However, MoCo adopts a too simple positive sample
strategy: a pair of positive representations come from the SimCLR also provides some other practical techniques,
same sample without any transformation or augmentation, including a learnable nonlinear transformation between the
representation and the contrastive loss, more training steps,
and deeper neural networks. [18] conducts ablation studies
to show that techniques in SimCLR can also further improve
MoCo’s performance.
More investigation into augmenting positive samples is
made in InfoMin [119]. The authors claim that we should
select those views with less mutual information for better-
augmented views in contrastive learning. In the optimal sit-
uation, the views should only share the label information.
To produce such optimal views, the authors first propose an
Fig. 10. Deep Graph InfoMax [127] uses a readout function to generate unsupervised method to minimize mutual information
summary vector s1 , and puts it into a discriminator with node 1’s embed- between views. However, this may result in a loss of infor-
ding x1 and corrupted embedding x e1 respectively to identify which
embedding is the real embedding. The corruption is to shuffle the posi- mation for predicting labels (such as a pure blank view).
tions of nodes. Therefore, a semi-supervised method is then proposed to

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
868 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

Fig. 15. Illustration of typical ”recovering with partial input” methods: col-
Fig. 12. Conceptual comparison of three contrastive loss mechanisms. orization, inpainting and super-resolution. Given the original input on the
left, models are asked to recover it with different partial inputs given on
Taken from MoCo [47].
the right.

Fig. 16. The architecture of ELECTRA [21]. It follows GAN’s framework


but uses a two-stage training paradigm to avoid using policy gradient.
The MLM is Masked Language Model.

Fig. 13. Ten different views adopted by SIMCLR [15]. The enhancement
of positive samples substantially improves the self-supervised learning
performance. Taken from [15].

Fig. 17. The architecture of GraphSGAN [28], which investigates density


gaps in embedding space for classification problems. Taken from [28].

Fig. 14. The architecture of BYOL [42]. Noted that the online encoder using cross-entropy loss, they follow the regression para-
has an additional layer qu compared to the target one, which gives the digm in which mean square error is used as
representations some flexibility to be improved during the training. Taken
from [42].
0 2
hqu ðzu Þ; z0 i
LBYOL
u , jjqu ðzu Þ  z jj2 ¼ 2  2 : (15)
jjqu ðzu Þjj2 jjz0 jj2
find views sharing only label information. This technique
leads to an improve about 2% over MoCo v2. This not only makes the model better-performed in
A more radical step is made by BYOL [42], which dis- downstream tasks, but also more robust to smaller batch
cards negative sampling in self-supervised learning but size. In MoCo and SimCLR, a drop in batch size results in a
achieves an even better result over InfoMin. For contrastive significant decline in performance. However, in BYOL,
learning methods we mentioned above, they learn represen- although batch size still matters, it is far less critical. The
tations by predicting different views of the same image and ablation study shows that a batch size of 512 only causes a
cast the prediction problem directly in representation space. drop of 0.3% compared to a standard batch size of 4,096,
However, predicting directly in representation space can while SimCLR shows a drop of 1.4%.
lead to collapsed representations because multi-views are In SimSiam [19], researchers further study how necessary
generally too predictive for each other. Without negative is negative sampling, and even batch normalization in con-
samples, it would be too easy for the neural networks to dis- trastive representation learning. They show that the most
tinguish those positive views. critical component in BYOL is the stop gradient operation,
In BYOL, researchers argue that negative samples may which makes the target representation stable. SimSiam is
not be necessary in this process. They show that, if we use a proved to converge faster than MoCo, SimCLR, and BYOL
fixed randomly initialized network (which would not col- with even smaller batch sizes, while the performance only
lapse because it is not trained) to serve as the key encoder, slightly decreases.
the representation produced by query encoder would still Some other works are inspired by theoretical analysis
be improved during training. If then we set the target into the contrastive objective. ReLIC [82] argues that con-
encoder to be the trained query encoder and iterate this pro- trastive pre-training teaches the encoder to causally disen-
cedure, we would progressively achieve better perfor- tangle the invariant content (i.e., main objects) and style
mance. Therefore, BYOL proposes an architecture (Fig. 14) (i.e., environments) in an image. To better enforce this
with an exponential moving average strategy to update the observation in the data augmentation, they propose to add
target encoder just as MoCo does. Additionally, instead of an extra KL-divergence regularizer between prediction

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: SELF-SUPERVISED LEARNING: GENERATIVE OR CONTRASTIVE 869

logits of an image’s different views. The results show that unlabeled images. We then train a larger EfficientNet as a
this can enhance the models’ generalization ability and student model based on labeled and pseudo labeled images.
robustness and improve the performance. We iterate this process by putting back the student as the
In graph learning, Graph Contrastive Coding (GCC) [94] teacher. During the pseudo labels generation, the teacher is
is a pioneer to leverage instance discrimination as the pre- not noised so that the pseudo labels are as accurate as possi-
text task for structural information pre-training. For each ble. However, during the student’s learning, we inject noise
node, we sample two subgraphs independently by random such as dropout, stochastic depth, and data augmentation
walks with restart and use top eigenvectors from their nor- via RandAugment to the student to generalize better than
malized graph Laplacian matrices as nodes’ initial represen- the teacher.
tations. Then we use GNN to encode them and calculate the In light of semi-supervised self-training’s success, it is
InfoNCE loss as what MoCo and SimCLR do, where the natural to rethink its relationship with the self-supervised
node embeddings from the same node (in different sub- methods, especially with the successful contrastive pre-
graphs) are viewed as similar. Results show that GCC learns trained methods. In Section 4.2.1, we have introduced
better transferable structural knowledge than previous M3S [112] that attempts to combine cluster-based contras-
work such as struc2vec [103], GraphWave [35] and tive pre-training and downstream semi-supervised learn-
ProNE [145]. GraphCL [142] studies the data augmentation ing. For computer vision tasks, Zoph et al. [153] study the
strategies in graph learning. They propose four different MoCo pre-training and a self-training method in which a
augmentation methods based on edge perturbation and teacher is first trained on a downstream dataset (e.g.,
node dropping. It further demonstrates that the appropriate COCO) and then yield pseudo labels on unlabeled data
combination of these strategies can yield even better (e.g., ImageNet), and finally a student learns jointly over
performance. real labels on the downstream dataset and pseudo labels on
unlabeled data. They surprisingly find that pre-training’s
performance hurts while self-training still benefits from
4.3 Self-Supervised Contrastive Pre-Training for strong data augmentation. Besides, more labeled data
Semi-Supervised Self-Training diminishes the value of pre-training, while semi-supervised
While contrastive learning-based self-supervised learning self-training always improves. They also discover that the
continues to push the boundaries on various benchmarks, improvements from pre-training and self-training are
labels are still important because there is a gap between orthogonal to each other, i.e., contributing to the perfor-
training objectives of self-supervised learning and super- mance from different perspectives. The model with joint
vised learning. In other words, no matter how self-super- pre-training and self-training is the best.
vised learning models improve, they are still the only Chen et al. [16]’s SimCLR v2 supports the conclusion
powerful feature extractor, and to transfer to the down- mentioned above by showing that with only 10% of the orig-
stream task, we still need labels more or less. As a result, to inal ImageNet labels, the ResNet-50 can surpass the super-
bridge the gap between self-supervised pre-training and vised one with joint pre-training and self-training. They
downstream tasks, semi-supervised learning is what we are propose a 3-step framework:
looking for.
Recall the MoCo [47] that have topped the ImageNet 1) Do self-supervised pre-training as SimCLR v1, with
leader-board. Although it is proved beneficial for many some minor architecture modification and a deeper
other downstream vision tasks, it fails to improve the ResNet.
COCO object detection task. Some following work [84], 2) Fine-tune the last few layers with only 1% or 10% of
[153] investigates this problem and attributes it to the gap original ImageNet labels.
between the instance discrimination and object detection. In 3) Use the fine-tuned network as teacher to yield labels
such a situation, while pure self-supervised pre-training on unlabeled data to train a smaller student ResNet-
fails to help, semi-supervised-based self-training can con- 50.
tribute a lot to it. The success in combining self-supervised contrastive
First, we will clarify the definitions of semi-supervised pre-training and semi-supervised self-training opens up
learning and self-training. Semi-supervised learning is an our eyes for a future data-efficient deep learning paradigm.
approach to machine learning that combines a small More work is expected for investigating their latent
amount of labeled data with many unlabeled data during mechanisms.
training. Various methods derive from several different
assumptions made on the data distribution, with self-train- 4.4 Pros and Cons
ing (or self-labeling) being the oldest. In self-training, a Because contrastive learning has assumed the downstream
model is trained on the small amount of labeled data and applications to be classifications, it only employs the
then yield labels on unlabeled data. Only those data with encoder and discards the decoder in the architecture com-
highly confident labels are combined with original labeled pared to generative models. Therefore, contrastive models
data to train a new model. We iterate this procedure to find are usually light-weighted and perform better in discrimi-
the best model. native downstream applications.
The current state-of-the-art supervised model [133] on Contrastive learning is closely related to metric learning,
ImageNet follows the self-training paradigm, where we first a discipline that has been long studied. However, self-
train an EfficientNet model on labeled ImageNet images supervised contrastive learning is still an emerging field,
and use it as a teacher to generate pseudo labels on 300M and many problems remain to be solved, including:

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
870 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

1) Scale to natural language pre-training. Despite its suc- several subsections, we will discuss its various applications
cess in computer vision, contrastive pre-training on representation learning.
does not present a convincing result in the NLP
benchmarks. Most contrastive learning in NLP now 5.1 Generate With Complete Input
lies in BERT’s supervised fine-tuning, such as This section introduces GAN and its variants for representa-
improving BERT’s sentence-level representa- tion learning, focusing on capturing the sample’s complete
tion [102], information retrieval [59]. Few algorithms information.
have been proposed to apply contrastive learning in The inception of adversarial representation learning
the pre-training stage. As most language under- should be attributed to Generative Adversarial Networks
standing tasks are classifications, a contrastive lan- (GAN) [97], which proposes the adversarial training frame-
guage pre-training approach should be better than work. Follow GAN, many variants [11], [55], [56], [60], [72],
the current generative language models. [89] emerge and reshape people’s understanding of deep
2) Sampling efficiency. Negative sampling is a must for learning’s potential. GAN’s training process could be
most contrastive learning, but this process is often viewed as two players play a game; one generates fake sam-
tricky, biased, and time-consuming. BYOL [42] and ples while another tries to distinguish them from real ones.
SimSiam [19] are the pioneers to get contrastive To formulate this problem, we define G as the generator, D
learning rid of negative samples, but it can be as the discriminator, pdata ðxÞ as the real sample distribution,
improved. It is also not clear enough that what role pz ðzÞ as the learned latent sample distribution, we want to
negative sampling plays in contrastive learning. optimize this min-max game
3) Data augmentation. Researchers have proved that
data augmentation can boost contrastive learning’s min max Expdata ðxÞ ½logDðxÞ þ Ezpz ðzÞ ½logð1  DðGðzÞÞÞ:
G D
performance, but the theory for why and how it (16)
helps is still quite ambiguous. This hinders its
application into other domains, such as NLP and Before VQ-VAE2, GAN maintains dominating perfor-
graph learning, where the data is discrete and mance on image generation tasks over purely generative
abstract. models, such as autoregressive PixelCNN and autoencoder
VAE. It is natural to think about how this framework could
benefit representation learning.
5 GENERATIVE-CONTRASTIVE (ADVERSARIAL)
However, there is a gap between generation and represen-
SELF-SUPERVISED LEARNING tation. Compared to autoencoder’s explicit latent sample dis-
Generative-contrastive representation learning, or in a more tribution pz ðzÞ, GAN’s latent distribution pz ðzÞ is implicitly
familiar name adversarial representation learning, leverage dis- modeled. We need to extract this implicit distribution out.
criminative loss function as the objective. Yann Lecun com- To bridge this gap, AAE [77] first proposes a solution to fol-
ments on adversarial learning as ”the most interesting idea low the autoencoder’s natural idea. The generator in GAN
in the last ten years in machine learning.”. Its application in could be viewed as an implicit autoencoder. We can replace
learning representation is also booming. the generator with an explicit variational autoencoder (VAE)
The idea of adversarial learning derives from generative to extract the representation out. Recall the objective of VAE
learning, where researchers have observed some inherent
shortcomings of point-wise generative reconstruction (See LVAE ¼ EqðzjxÞ ðlogðpðxjzÞÞ þ KLðqðzjxÞkpðzÞÞ: (17)
Section 3.5). As an alternative, adversarial learning learns to
reconstruct the original data distribution rather than the As we mentioned before, compared to l2 loss of autoen-
samples by minimizing the distributional divergence. coder, discriminative loss in GAN better models the high-
In terms of contrastive learning, adversarial methods still level abstraction. To alleviate the problem, AAE substitutes
preserve the generator structure consisting of an encoder the KL divergence function for a discriminative loss
and a decoder. In contrast, the contrastive abandons the
LDisc ¼ CrossEntropyðqðzÞ; pðzÞÞ; (18)
decoder component (as shown in Fig. 4). It is critical
because, on the one hand, the generator endows adversarial that asks the discriminator to distinguish representation
learning with strong expressiveness that is peculiar to gen- from the encoder and a prior distribution.
erative models; on the other hand, it also makes the objec- However, AAE still preserves the reconstruction error,
tive of adversarial methods far more challenging to learn which contradicts GAN’s core idea. Based on AAE,
than that of contrastive methods, leading to unstable con- BiGAN [33] and ALI [36] argue to embrace adversarial
vergence. In the adversarial setting, the decoder’s existence learning without reservation and put forward a new frame-
asks the representation to be ”reconstructive,” in other work. Given an actual sample x
words, it contains all the necessary information for con-
structing the inputs. However, in the contrastive setting, we  Generator G: the generator here virtually acts as the
only need to learn ”distinguishable” information to discrim- decoder, generates fake samples x0 ¼ GðzÞ by z from
inate different samples. a prior latent distribution (e.g., [uniform(-1,1)]d , d
To sum up, the adversarial methods absorb merits from refers to dimension).
both generative and contrastive methods together with  Encoder E: a newly added component, mapping real
some drawbacks. In a situation where we need to fit on an sample x to representation z0 ¼ EðxÞ. This is also
implicit distribution, it is a better choice. In the following exactly what we want to train.

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: SELF-SUPERVISED LEARNING: GENERATIVE OR CONTRASTIVE 871

 Discriminator D: given two inputs [z, GðzÞ] and generator G is a small Masked Language Model (MLM),
[EðxÞ, x], decide which one is from the real sample which replaces masked tokens in a sentence to words. The
distribution. discriminator D is asked to predict which words are
It is easy to see that their training goal is E ¼ G1 . In replaced. Notice that replaced means not the same with origi-
other words, encoder E should learn to ”convert” generator nal unmasked inputs. The training is conducted in two
G. This goal could be rewritten as a l0 loss for autoen- stages:
coder [33], but it is not the same as a traditional autoencoder
because the distribution does not make any assumption 1) Warm-up the generator: train the G with MLM pre-
about the data itself. The distribution is shaped by the dis- text task LMLM ðxx; uG Þ for some steps to warm up the
criminator, which captures the semantic-level difference. parameters.
Based on BiGAN and ALI, later studies [20], [34] discover 2) Trained with the discriminator: D’s parameters is
that GAN with deeper and larger networks and modified initialized with G’s and then trained with the dis-
architectures can produce even better results on down- criminative objective LDisc ðxx; uD Þ (a cross-entropy
stream task. loss). During this period, G’s parameter is frozen.
The final objective could be written as

5.2 Recover With Partial Input X


min LMLM ðxx; uG Þ þ LDisc ðxx; uD Þ: (19)
As we mentioned above, GAN’s architecture is not born for uG ;uD
x 2X
representation learning, and modification is needed to
apply its framework. While BiGAN and ALI choose to Though ELECTRA is structured as GAN, it is not trained
extract the implicit distribution directly, some other meth- in the GAN setting. Compared to image data, which is con-
ods such as colorization [69], [70], [147], [148], inpaint- tinuous, word tokens are discrete, which stops the gradient
ing [55], [89] and super-resolution [72] apply the adversarial backpropagation. A possible substitution is to leverage pol-
learning via in a different way. Instead of asking models to icy gradient, but ELECTRA experiments show that perfor-
reconstruct the whole input, they provide models with par- mance is slightly lower. Theoretically speaking, LDisc ðxx; uD Þ
tial input and ask them to recover the rest parts. This is simi- is actually turning the conventional k-class softmax classifi-
lar to denoising autoencoder (DAE) such as BERT’s family cation into a binary classification. This substantially saves
in natural language processing but conducted in an adver- the computation effort but may somehow harm the repre-
sarial manner. sentation quality due to the early degeneration of embed-
Colorization is first proposed by [147]. The problem can ding space. In summary, ELECTRA is still an inspiring
be described as given one color channel L in an image and pioneer work in leveraging discriminative objectives.
predicting the value of two other channels A, B. The At the same time, WKLM [134] proposes to perform RTD
encoder and decoder networks can be set to any form of at the entity-level. For entities in Wikipedia paragraphs,
convolutional neural network. Interestingly, to avoid the WKLM replaced them with similar entities and trained the
uncertainty brought by traditional generative methods such language model to distinguish them in a similar discrimina-
as VAE, the author transforms the generation task into a tive objective as ELECTRA, performing exceptionally well
classification one. The first figure out the common locating in downstream tasks like question answering. Similar work
area of ðA; BÞ and then split it into 313 categories. The classi- is REALM [45], which conducts higher article-level retrieval
fication is performed through a softmax layer with hyper- augmentation to the language model. However, REALM is
parameter T as an adjustment. Based on [147], a range of not using the discriminative objective.
colorization-based representation methods [69], [70], [148]
are proposed to benefit downstream tasks. 5.4 Graph Learning
Inpainting [55], [89] is more straight forward. We will ask
There are also attempts to utilize adversarial learning ( [23],
the model to predict an arbitrary part of an image given the
[28], [128]). Interestingly, their ideas are quite different from
rest of it. Then a discriminator is employed to distinguish
each other.
the inpainted image from the original one. Super-resolution
The most natural idea is to follow BiGAN [33] and
method SRGAN [72] follows the same idea to recover high-
ALI [36]’s a practice that asks discriminator to distinguish
resolution images from blurred low-resolution ones in the
representation from generated and prior distribution.
adversarial setting.
Adversarial Network Embedding (ANE) [23] designs a gen-
erator G that is updated in two stages: 1) G encodes sam-
5.3 Pre-Trained Language Model pled graph into target embedding and computes traditional
For a long time, the pre-trained language model (PTM) NCE with a context encoder F like Skip-gram, 2) discrimi-
focuses on maximum likelihood estimation based pretext nator D is asked to distinguish embedding from G and a
task because discriminative objectives are thought to be sampled one from a prior distribution. The optimized objec-
helpless due to languages’ vibrant patterns. However, tive is a sum of the above two objectives, and the generator
recently some work shows excellent performance and sheds G could yield better node representation for the classifica-
light on contrastive objectives’ potential in PTM. tion task.
The pioneering work is ELECTRA [21], surpassing BERT GraphGAN [128] considers to model the link prediction
given at the same computation budget. ELECTRA proposes task and follow the original GAN style discriminative objec-
Replaced Token Detection (RTD) and leverages GAN’s tive to distinguish directly at node-level rather than repre-
structure to pre-train a language model. In this setting, the sentation-level. The model first selects nodes from the target

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
872 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

node’s subgraph vc according to embedding encoded by the learning has soon outperformed them with fewer
generator G. Then some neighbor nodes to vc selected from parameters.
the subgraph, together with those selected by G, are put Despite the challenges, however, it is still promising
into a binary classifier D to decide whether they are linked because it overcomes some inherent deficits of the point-
to vc . Because this framework involves a discrete selection wise generative objective. Maybe we still need to wait for a
procedure, while gradient descents could update the dis- better future implementation of this idea.
criminator D, the generator G is updated via policy
gradients.
GraphSGAN [28] applies the adversarial method in semi- 6 DISCUSSIONS AND FUTURE DIRECTIONS
supervised graph learning with the motivation that mar- In this section, we will discuss several open problems
ginal nodes cause most classification errors in the graph. and future directions in self-supervised learning for
Consider samples in the same category; they are usually representation.
clustered in the embedding space. Between clusters, there Theoretical Foundation. Though self-supervised learning
are density gaps where few samples exist. The author pro- has achieved great success, few works investigate the mech-
vides rigorous mathematical proof that we can perform anisms behind it. In this survey, we have listed several
complete classification theoretically if we generate enough recent works on this topic and show that theoretical analysis
fake samples in density gaps. GraphSGAN leverages a gen- is significant to avoid misleading empirical conclusions.
erator G to generate fake nodes in density gaps during the In [4], researchers present a conceptual framework to
training and asks the discriminator D to classify nodes into analyze the contrastive objective’s function in generaliza-
their original categories and a category for those fake ones. tion ability. [120] empirically proves that mutual infor-
In the test period, fake samples are removed, and classifica- mation is only loosely related to the success of several
tion results on original categories could be improved MI-based methods, in which the sampling strategies and
substantially. architecture design may count more. This type of works
is crucial for self-supervised learning to form a solid
foundation, and more work related to theory analysis is
5.5 Domain Adaptation and Multi-Modality urgently needed.
Representation
Transferring to Downstream Tasks. There is an essential
Essentially, the discriminator in adversarial learning gap between pre-training and downstream tasks. Research-
serves to match the discrepancy between latent representa- ers design elaborate pretext tasks to help models learn some
tion distribution and data distribution. This function natu- critical features of the dataset that can transfer to other jobs,
rally relates to domain adaptation and multi-modality but sometimes this may fail to realize. Besides, the process
representation problems, aiming to align different repre- of selecting pretext tasks seems to be too heuristic and tricky
sentation distribution. [1], [2], [37], [105] studies how without patterns to follow.
GAN can help on domain adaptation. [12], [129] leverage A typical example is the selection of pre-training tasks in
adversarial sampling to improve the negative samples’ BERT and ALBERT. BERT uses Next Sentence Prediction
quality. For multi-modality representation, [151]’s image (NSP) to enhance its ability for sentence-level understand-
to image translation, [106]’s text style transfer, [22]’s word ing. However, ALBERT shows that NSP equals a naive topic
to word translation and [104] image to text translation model, which is far too easy for language model pre-train-
show great power of adversarial representation learning. ing and even decreases BERT’s performance.
For the pre-training task selection problem, a probably
5.6 Pros and Cons exciting direction would be to design pre-training tasks for
Generative-contrastive (adversarial) self-supervised learn- a specific downstream task automatically, just as what Neu-
ing is particularly successful in image generation, transfor- ral Architecture Search [154] does for neural network
mation and manipulation, but there are also some architecture.
challenges for its future development: Transferring Across Datasets. This problem is also known
as how to learn inductive biases or inductive learning. Tra-
 Limited applications in NLP and graph. Due to the dis- ditionally, we split a dataset into the training used for learn-
crete nature of languages and graphs, the adversarial ing the model parameters and the testing part for
methods do not perform as well as they do in com- evaluation. An essential prerequisite for this learning para-
puter vision. Furthermore, GAN-based language digm is that data in the real world conform to our dataset’s
generation has been found to be much worse than distribution. Nevertheless, this assumption frequently fails
unidirectional language models such as GPTs. in experiments.
 Easy to collapse. It is also notorious that adversarial Self-supervised representation learning solves part of
models are prone to collapse during the training, this problem, especially in the field of natural language
with numerous techniques developed to stabilize its processing. Vast amounts of corpora used in the language
training, such as spectral normalization [83], W- model pre-training help cover most language patterns and,
GAN [3] and so on. therefore, contribute to the success of PTMs in various lan-
 Not for feature extraction. Although works such as guage tasks. However, this is based on the fact that text in
BiGAN [33] and BigBiGAN [34] have explored some the same language shares the same embedding space. For
ways to leverage GAN’s learned latent representa- other tasks like machine translation and fields like graph
tion and achieve good performance, contrastive learning where embedding spaces are different for different

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: SELF-SUPERVISED LEARNING: GENERATIVE OR CONTRASTIVE 873

datasets, learning the transferable inductive biases effi- [5] A. Asai, K. Hashimoto, H. Hajishirzi, R. Socher, and C. Xiong,
“Learning to retrieve reasoning paths over wikipedia graph for
ciently is still an open problem. question answering,” 2019, arXiv:1911.10470.
Exploring Potential of Sampling Strategies. In [120], the [6] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning repre-
authors attribute one of the reasons for the success of mutual sentations by maximizing mutual information across views,” in
information-based methods to better sampling strategies. Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019, pp. 15509–
15519.
MoCo [47], SimCLR [15], and a series of other contrastive [7] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang,
methods may also support this conclusion. They propose to “SimGNN: A neural network approach to fast graph similarity
leverage super large amounts of negative samples and aug- computation,” in Proc. 12th ACM Int. Conf. Web Search Data Min-
ing, 2019, pp. 384–392.
mented positive samples, whose effects are studied in deep [8] D. H. Ballard, “Modular learning in neural networks,” in Proc.
metric learning. How to further release the power of sam- 6th Nat. Conf. Artif. Intell., 1987, pp. 279–284.
pling is still an unsolved and attractive problem. [9] Y. Bengio, N. Leonard, and A. Courville, “Estimating or propa-
Early Degeneration for Contrastive Learning. Contrastive gating gradients through stochastic neurons for conditional
computation,” 2013, arXiv:1308.3432.
learning methods such as MoCo [47] and SimCLR [15] [10] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching
are rapidly approaching the performance of supervised word vectors with subword information,” Trans. Assoc. Comput.
learning for computer vision. However, their incredible Linguistics, vol. 5, pp. 135–146, 2017.
performances are generally limited to the classification [11] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN train-
ing for high fidelity natural image synthesis,” 2018,
problem. Meanwhile, the generative-contrastive method arXiv:1809.11096.
ELETRA [21] for language model pre-training is also out- [12] L. Cai and W. Y. Wang, “KBGAN: Adversarial learning for
performing other generative methods on several stan- knowledge graph embeddings,” 2017, arXiv:1711.04071.
[13] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clus-
dard NLP benchmarks with fewer model parameters. tering for unsupervised learning of visual features,” in Proc. Eur.
However, some remarks indicate that ELETRA’s perfor- Conf. Comput. Vis., 2018, pp. 132–149.
mance on language generation and neural entity extrac- [14] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Jou-
tion is not up to expectations. lin, “Unsupervised learning of visual features by contrasting
cluster assignments,” 2020, arXiv:2006.09882.
Problems above are probably because the contrastive [15] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
objectives often get trapped into embedding spaces’ early framework for contrastive learning of visual representations,”
degeneration problem, which means that the model over- 2020, arXiv:2002.05709.
[16] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton,
fits to the discriminative pretext task too early, and therefore “Big self-supervised models are strong semi-supervised
lost the ability to generalize. We expect that there would be learners,” 2020, arXiv:2006.10029.
techniques or new paradigms to solve the early degenera- [17] T. Chen, Y. Sun, Y. Shi, and L. Hong, “On sampling strategies
tion problem while preserving contrastive learning’s for neural network-based collaborative filtering,” in Proc. 23rd
ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2017,
advantages. pp. 767–776.
[18] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines
with momentum contrastive learning,” 2020, arXiv:2003.04297.
7 CONCLUSION [19] X. Chen and K. He, “Exploring simple siamese representation
learning,” 2020, arXiv:2011.10566.
This survey comprehensively reviews the existing self- [20] L. Chongxuan, T. Xu, J. Zhu, and B. Zhang, “Triple generative
supervised representation learning approaches in natural adversarial nets,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst.,
language processing (NLP), computer vision (CV), graph 2017, pp. 4088–4098.
[21] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “ELECTRA:
learning, and beyond. Self-supervised learning is the pres- Pre-training text encoders as discriminators rather than gener-
ent and future of deep learning due to its supreme ability to ators,” 2020, arXiv:2003.10555.
utilize Web-scale unlabeled data to train feature extractors [22] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jegou,
and context generators efficiently. Despite the diversity of “Word translation without parallel data,” 2017, arXiv:1710.04087.
[23] Q. Dai, Q. Li, J. Tang, and D. Wang, “Adversarial network
algorithms, we categorize all self-supervised methods into embedding,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018,
three classes: generative, contrastive, and generative con- pp. 2167–2174.
trastive according to their essential training objectives. We [24] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhut-
dinov, “Transformer-XL: Attentive language models beyond a
introduce typical and representative methods in each cate- fixed-length context,” in Proc. 57th Annu. Meeting Assoc. Comput.
gory and sub-categories. Moreover, we discuss the pros and Linguistics, 2019, pp. 2978–2988.
cons of each category and their unique application scenar- [25] V. R. de Sa, “Learning classification with unlabeled data,” in
ios. Finally, fundamental problems and future directions of Proc. 6th Int. Conf. Neural Inf. Process. Syst., 1994, pp. 112–119.
[26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
self-supervised learning are listed. “ImageNet: A large-scale hierarchical image database,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–
REFERENCES 255.
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
[1] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and training of deep bidirectional transformers for language under-
M. Marchand, “Domain-adversarial neural networks,” 2014, standing,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Lin-
arXiv:1412.4446. guistics, Hum. Lang. Technol., 2019, pp. 4171–4186.
[2] F. Alam, S. Joty, and M. Imran, “Domain adaptation with adver- [28] M. Ding, J. Tang, and J. Zhang, “Semi-supervised learning on
sarial training and graph embeddings,” 2018, arXiv:1805.05151. graphs with generative adversarial nets,” in Proc. 27th ACM Int.
[3] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative Conf. Inf. Knowl. Manage., 2018, pp. 913–922.
adversarial networks,” in Proc. 34th Int. Conf. Mach. Learn., 2017, [29] M. Ding, C. Zhou, Q. Chen, H. Yang, and J. Tang, “Cognitive
pp. 214–223. graph for multi-hop reading comprehension at scale,” 2019,
[4] S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and arXiv:1905.05460.
N. Saunshi, “A theoretical analysis of contrastive unsupervised [30] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear indepen-
representation learning,” 2019, arXiv:1902.09229. dent components estimation,” 2014, arXiv:1410.8516.

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
874 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

[31] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation [59] V. Karpukhin et al., “Dense passage retrieval for open-domain
using real NVP,” 2016, arXiv:1605.08803. question answering,” 2020, arXiv:2004.04906.
[32] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual [60] T. Karras, S. Laine, and T. Aila, “A style-based generator archi-
representation learning by context prediction,” in Proc. IEEE Int. tecture for generative adversarial networks,” in Proc. IEEE Conf.
Conf. Comput. Vis., 2015, pp. 1422–1430. Comput. Vis. Pattern Recognit., 2019, pp. 4401–4410.
[33] J. Donahue, P. Kr€ ahenb€uhl, and T. Darrell, “Adversarial feature [61] D. Kim, D. Cho, D. Yoo, and I. S. Kweon, “Learning image repre-
learning,” 2016, arXiv:1605.09782. sentations by completing damaged jigsaw puzzles,” in Proc.
[34] J. Donahue and K. Simonyan, “Large scale adversarial represen- IEEE Winter Conf. Appl. Comput. Vis., 2018, pp. 793–802.
tation learning,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., [62] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with
2019, pp. 10541–10551. invertible 1x1 convolutions,” in Proc. 32nd Int. Conf. Neural Inf.
[35] C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec, “Learning Process. Syst., 2018, pp. 10215–10224.
structural node embeddings via diffusion wavelets,” in Proc. [63] D. P. Kingma and M. Welling, “Auto-encoding variational
24th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2018, bayes,” 2013, arXiv:1312.6114.
pp. 1320–1329. [64] T. N. Kipf and M. Welling, “Semi-supervised classification with
[36] V. Dumoulin et al., “Adversarially learned inference,” 2016, graph convolutional networks,” 2016, arXiv:1609.02907.
arXiv:1606.00704. [65] T. N. Kipf and M. Welling, “Variational graph auto-encoders,”
[37] Y. Ganin et al., “Domain-adversarial training of neural networks,” 2016, arXiv:1611.07308.
The J. Mach. Learn. Res., vol. 17, no. 1, pp. 2096–2030, 2016. [66] L. Kong, C. D. M. d’Autume, W. Ling, L. Yu, Z. Dai, and D. Yoga-
[38] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised repre- tama, “A mutual information maximization perspective of lan-
sentation learning by predicting image rotations,” 2018, guage representation learning,” 2019, arXiv:1910.08350.
arXiv:1803.07728. [67] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classi-
[39] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature fication with deep convolutional neural networks,” in Proc. 25th
hierarchies for accurate object detection and semantic Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [68] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Sori-
2014, pp. 580–587. cut, “ALBERT: A lite bert for self-supervised learning of lan-
[40] I. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th guage representations,” 2019, arXiv:1909.11942.
Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680. [69] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning repre-
[41] P. Goyal et al., “Self-supervised pretraining of visual features in sentations for automatic colorization,” in Proc. Eur. Conf. Comput.
the wild,” 2021, arXiv:2103.01988. Vis., 2016, pp. 577–593.
[42] J.-B. Grill et al., “Bootstrap your own latent: A new approach to [70] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a
self-supervised learning,” 2020, arXiv:2006.07733. proxy task for visual understanding,” in Proc. IEEE Conf. Comput.
[43] A. Grover and J. Leskovec, “node2vec: Scalable feature learning Vis. Pattern Recognit., 2017, pp. 6874–6883.
for networks,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Dis- [71] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nat., vol.
cov. Data Mining, 2016, pp. 855–864. 521, no. 7553, pp. 436–444, 2015.
[44] M. Gutmann and A. Hyv€arinen, “Noise-contrastive estima- [72] C. Ledig et al., “Photo-realistic single image super-resolution
tion: A new estimation principle for unnormalized statistical using a generative adversarial network,” in Proc. IEEE Conf. Com-
models,” in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010, put. Vis. Pattern Recognit., 2017, pp. 4681–4690.
pp. 297–304. [73] D. Li, W.-C. Hung, J.-B. Huang, S. Wang, N. Ahuja, and M.-H.
[45] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “REALM: Yang, “Unsupervised visual representation learning by graph-
Retrieval-augmented language model pre-training,” 2020, based consistent constraints,” in Proc. Eur. Conf. Comput. Vis.,
arXiv:2002.08909. 2016, pp. 678–694.
[46] K. Hassani and A. H. Khasahmadi, “Contrastive multi-view [74] B. Liu, “Sentiment analysis and opinion mining,” Synth. Lectures
representation learning on graphs,” 2020, arXiv:2006.05582. Hum. Lang. Technol., vol. 5, no. 1, pp. 1–167, 2012.
[47] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum con- [75] Y. Liu et al., “RoBERTa: A robustly optimized bert pretraining
trast for unsupervised visual representation learning,” 2019, approach,” 2019, arXiv:1907.11692.
arXiv:1911.05722. [76] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for works for semantic segmentation,” in Proc. IEEE Conf. Comput.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec- Vis. Pattern Recognit., 2015, pp. 3431–3440.
ognit., 2016, pp. 770–778. [77] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey,
[49] R. D. Hjelm et al., “Learning deep representations by mutual infor- “Adversarial autoencoders,” 2015, arXiv:1511.05644.
mation estimation and maximization,” 2018, arXiv:1808.06670. [78] M. Mathieu, “Masked autoencoder for distribution estimation,”
[50] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel, “Flow++: 2015.
Improving flow-based generative models with variational [79] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, “Efficient esti-
dequantization and architecture design,” in Proc. 36th Int. Conf. mation of word representations in vector space,” CoRR, 2013.
Mach. Learn., 2019, pp. 2722–2730. [80] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
[51] W. Hu et al., “Strategies for pre-training graph neural networks,” “Distributed representations of words and phrases and their
in Proc. Int. Conf. Learn. Representations, 2019. compositionality,” in Proc. 26th Int. Conf. Neural Inf. Process. Syst.,
[52] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun, “GPT-GNN: 2013, pp. 3111–3119.
Generative pre-training of graph neural networks,” 2020, [81] I. Misra and L. van der Maaten, “Self-supervised learning of pre-
arXiv:2006.15437. text-invariant representations,” 2019, arXiv:1912.01991.
[53] Z. Hu, Y. Dong, K. Wang, and Y. Sun, “Heterogeneous graph [82] J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, and C. Blun-
transformer,” 2020, arXiv:2003.01332. dell, “Representation learning via invariant causal mechanisms,”
[54] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected 2020, arXiv:2010.07922.
convolutional networks,” Proc. IEEE Conf. Comput. Vis. Pattern [83] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral
Recognit., 2017, pp. 2261–2269. normalization for generative adversarial networks,” 2018,
[55] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally arXiv:1802.05957.
consistent image completion,” ACM Trans. Graph., vol. 36, no. 4, [84] A. Newell and J. Deng, “How useful is self-supervised pretrain-
pp. 1–14, 2017. ing for visual tasks?” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
[56] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image ognit., 2020, pp. 7345–7354.
translation with conditional adversarial networks,” in Proc. IEEE [85] A. Ng et al., “Sparse autoencoder,” CS294A Lecture Notes, vol. 72,
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1125–1134. no. 2011, pp. 1–19, 2011.
[57] L. Jing and Y. Tian, “Self-supervised visual feature learning with [86] M. Noroozi and P. Favaro, “Unsupervised learning of visual rep-
deep neural networks: A survey,” 2019, arXiv:1902.06162. resentations by solving jigsaw puzzles,” in Proc. Eur. Conf. Com-
[58] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. put. Vis., 2016, pp. 69–84.
Levy, “SpanBERT: Improving pre-training by representing and [87] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting
predicting spans,” Trans. Assoc. Comput. Linguistics, vol. 8, self-supervised learning via knowledge transfer,” in Proc. IEEE
pp. 64–77, 2020. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9359–9367.

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: SELF-SUPERVISED LEARNING: GENERATIVE OR CONTRASTIVE 875

[88] A. V. D. Oord, Y. Li, and O. Vinyals, “Representation learning [114] Y. Sun et al., “ERNIE: Enhanced representation through knowl-
with contrastive predictive coding,” 2018, arXiv:1807.03748. edge integration,” 2019, arXiv:1904.09223.
[89] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, [115] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “LINE:
“Context encoders: Feature learning by inpainting,” in Proc. IEEE Large-scale information network embedding,” in Proc. 24th Int.
Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2536–2544. Conf. World Wide Web, 2015, pp. 1067–1077.
[90] Z. Peng, Y. Dong, M. Luo, X. M. Wu, and Q. Zheng, “Self-super- [116] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “ArnetMiner:
vised graph representation learning via global context pre- Extraction and mining of academic social networks,” in Proc.
diction,” 2020, arXiv:2003.01604. 14th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2008,
[91] Z. Peng, Y. Dong, M. Luo, X.-M. Wu, and Q. Zheng, “Self-super- pp. 990–998.
vised graph representation learning via global context pre- [117] W. L. Taylor, ““Cloze procedure”: A new tool for measuring
diction,” 2020, arXiv:2003.01604. readability,” Journalism Quart., vol. 30, no. 4, pp. 415–433, 1953.
[92] B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: Online learn- [118] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview
ing of social representations,” in Proc. 20th ACM SIGKDD Int. coding,” 2019, arXiv:1906.05849.
Conf. Knowl. Discov. Data Mining, 2014, pp. 701–710. [119] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola,
[93] M. Popova, M. Shvets, J. Oliva, and O. Isayev, “MolecularRNN: “What makes for good views for contrastive learning,” 2020,
Generating realistic molecular graphs with optimized proper- arXiv:2005.10243.
ties,” 2019, arXiv:1905.13372. [120] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and
[94] J. Qiu et al., “GCC: Graph contrastive coding for graph neural M. Lucic, “On mutual information maximization for representa-
network pre-training,” 2020, arXiv:2006.09963. tion learning,” 2019, arXiv:1907.13625.
[95] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang, “DeepInf: [121] A. van den Oord et al., “WaveNet: A generative model for raw
Social influence prediction with deep learning,” in Proc. 24th audio,” in Proc. 9th ISCA Speech Synth. Workshop, 2016, pp. 125–
ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2018, pp. 125.
2110–2119. [122] A. Van den Oord et al., “Conditional image generation with pix-
[96] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained elcnn decoders,” in Proc. 30th Int. Conf. Neural Inf. Process. Syst.,
models for natural language processing: A survey,” 2020, 2016, pp. 4790–4798.
arXiv:2003.08271. [123] A. van den Oord et al., “Neural discrete representation learning,”
[97] A. Radford, L. Metz, and S. Chintala, “Unsupervised representa- in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6306–
tion learning with deep convolutional generative adversarial 6315.
networks,” 2015, arXiv:1511.06434. [124] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel
[98] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, recurrent neural networks,” in Proc. 33rd Int. Conf. Mach. Learn.,
“Improving language understanding by generative pre-training. 2016, pp. 1747–1756.
[99] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutsk- [125] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int.
ever, “Language models are unsupervised multitask learners,” Conf. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
OpenAI Blog, vol. 1, no. 8, 2019, Art. no. 9. [126] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and
[100] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: Y. Bengio, “Graph attention networks,” 2017, arXiv:1710.10903.
100,000+ questions for machine comprehension of text,” 2016, [127] P. Velickovic, W. Fedus, W. L. Hamilton, P. Li o, Y. Bengio, and R.
arXiv:1606.05250. D. Hjelm, “Deep graph infomax,” 2018, arXiv:1809.10341.
[101] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse [128] H. Wang et al., “GraphGAN: Graph representation learning with
high-fidelity images with VQ-VAE-2,” in Proc. 33rd Int. Conf. generative adversarial nets,” in Proc. 32nd AAAI Conf. Artif.
Neural Inf. Process. Syst., 2019, pp. 14837–14847. Intell., 2018, pp. 2508–2515.
[102] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embed- [129] P. Wang, S. Li, and R. Pan, “Incorporating GAN for negative
dings using siamese BERT-networks,” 2019, arXiv:1908.10084. sampling in knowledge representation learning,” in Proc. 32nd
[103] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo, “struc2vec: AAAI Conf. Artif. Intell., 2018, pp. 2005–2012.
Learning node representations from structural identity,” in Proc. [130] Z. Wang, Q. She, and T. E. Ward, “Generative adversarial net-
23rd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2017, works: A survey and taxonomy,” 2019, arXiv:1906.01529.
pp. 385–394. [131] C. Wei et al., “Iterative reorganization with weak spatial con-
[104] N. Sarafianos, X. Xu, and I. A. Kakadiaris, “Adversarial represen- straints: Solving arbitrary jigsaw puzzles for unsupervised repre-
tation learning for text-to-image matching,” in Proc. IEEE Int. sentation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern
Conf. Comput. Vis., 2019, pp. 5814–5824. Recognit., 2019, pp. 1910–1919.
[105] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Adversarial represen- [132] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature
tation learning for domain adaptation,” Stat, vol. 1050, 2017, learning via non-parametric instance discrimination,” in Proc.
Art. no. 5. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3733–3742.
[106] T. Shen, T. Lei, R. Barzilay, and T. Jaakkola, “Style transfer from [133] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with
non-parallel text by cross-alignment,” in Proc. 31st Int. Conf. Neu- noisy student improves ImageNet classification,” in Proc. IEEE
ral Inf. Process. Syst., 2017, pp. 6830–6841. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10687–10698.
[107] C. Shi, M. Xu, Z. Zhu, W. Zhang, M. Zhang, and J. Tang, [134] W. Xiong, J. Du, W. Y. Wang, and V. Stoyanov, “Pretrained ency-
“GraphAF: A flow-based autoregressive model for molecular clopedia: Weakly supervised knowledge-pretrained language
graph generation,” 2020, arXiv:2001.09382. model,” 2019, arXiv:1912.09637.
[108] A. Sinha et al., “An overview of microsoft academic service [135] K. Xu, J. Li, M. Zhang, S. S. Du, K.-I. Kawarabayashi, and S.
(MAS) and applications,” in Proc. 24th Int. Conf. World Wide Web, Jegelka, “How neural networks extrapolate: From feedforward
2015, pp. 243–246. to graph neural networks,” 2020, arXiv:2009.11848.
[109] P. Smolensky, “Information processing in dynamical systems: [136] X. Yan, I. Misra, A. Gupta, D. Ghadiyaram, and D. Mahajan,
Foundations of harmony theory,” Dept. Comput. Sci., Colorado “ClusterFit: Improving generalization of visual representations,”
Univ. Boulder, Boulder, CO, 1986. 2019, arXiv:1912.03330.
[110] F.-Y. Sun, J. Hoffmann, and J. Tang, “InfoGraph: Unsupervised [137] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of
and semi-supervised graph-level representation learning via deep representations and image clusters,” in Proc. IEEE Conf.
mutual information maximization,” 2019, arXiv:1908.01000. Comput. Vis. Pattern Recognit., 2016, pp. 5147–5156.
[111] F.-Y. Sun, M. Qu, J. Hoffmann, C.-W. Huang, and J. Tang, [138] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
“vGraph: A generative model for joint community detection and Q. V. Le, “XLNet: Generalized autoregressive pretraining for lan-
node representation learning,” in Proc. 33rd Int. Conf. Neural Inf. guage understanding,” in Proc. 33rd Int. Conf. Neural Inf. Process.
Process. Syst., 2019, pp. 512–522. Syst., 2019, pp. 5754–5764.
[112] K. Sun, Z. Lin, and Z. Zhu, “Multi-stage self-supervised learning [139] Z. Yang et al., “HotpotQA: A dataset for diverse, explainable
for graph convolutional networks on graphs with few labeled multi-hop question answering,” 2018, arXiv:1809.09600.
nodes,” in Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, pp. 5892– [140] J. You, B. Liu, Z. Ying, V. Pande, and J. Leskovec, “Graph convo-
5899. lutional policy network for goal-directed molecular graph gener-
[113] K. Sun, Z. Zhu, and Z. Lin, “Multi-stage self-supervised learning ation,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2018,
for graph convolutional networks,” 2019, arXiv:1902.11038. pp. 6410–6421.

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.
876 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 1, JANUARY 2023

[141] J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec, Zhenyu Hou is currently working toward the
“GraphRNN: Generating realistic graphs with deep auto-regres- undergraduate degree with the Department of
sive models,” in Proc. 35th Int. Conf. Mach. Learn., 2018, pp. 5708– Computer Science and Technology, Tsinghua
5717. University, China. His main research interests
[142] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph con- include graph representation learning and
trastive learning with augmentations,” 2020, arXiv:2010.13902. reasoning.
[143] Y. You, T. Chen, Z. Wang, and Y. Shen, “When does self-supervi-
sion help graph convolutional networks?” 2020, arXiv:2006.09136.
[144] F. Zhang et al., “OAG: Toward linking large-scale heterogeneous
entity graphs,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl. Dis-
cov. Data Mining, 2019, pp. 2585–2595.
[145] J. Zhang, Y. Dong, Y. Wang, J. Tang, and M. Ding, “ProNE: Fast Li Mian received the bachelor’s degree from the
and scalable network representation learning,” in Proc. 28th Int. Department of Computer Science, Beijing Insti-
Joint Conf. Artif. Intell., 2019, pp. 4278–4284. tute of Technology, China, in 2020. She is now
[146] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end admitted into a graduate program in Georgia
deep learning architecture for graph classification,” in Proc. 32nd Institute of Technology, Atlanta, Georgia. Her
AAAI Conf. Artif. Intell., 2018, pp. 4438–4445. research interests focus on data mining, natural
[147] R. Zhang, P. Isola, and A. A. Efros, “Colorful image color- language processing and machine learning.
ization,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 649–666.
[148] R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders:
Unsupervised learning by cross-channel prediction,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1058–1067.
[149] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ERNIE:
Enhanced language representation with informative entities,” Zhaoyu Wang is currently working toward the
graduate degree with the Department of Com-
2019, arXiv:1905.07129.
puter Science and Technology, Anhui University,
[150] D. Zhu, P. Cui, D. Wang, and W. Zhu, “Deep variational network
embedding in wasserstein space,” in Proc. 24th ACM SIGKDD China. His research interests include data min-
Int. Conf. Knowl. Discov. Data Mining, 2018, pp. 2827–2836. ing, natural language processing and their appli-
[151] J.-Y. Zhu et al., “Toward multimodal image-to-image trans- cations in recommender systems.
lation,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017,
pp. 465–476.
[152] C. Zhuang, A. L. Zhai, and D. Yamins, “Local aggregation for
unsupervised learning of visual embeddings,” in Proc. IEEE Int.
Conf. Comput. Vis., 2019, pp. 6002–6012. Jing Zhang received the master’s and PhD
[153] B. Zoph et al., “Rethinking pre-training and self-training,” 2020, degrees from the Department of Computer Sci-
arXiv:2006.06882. ence and Technology, Tsinghua University, China.
[154] B. Zoph and Q. V. Le, “Neural architecture search with reinforce- She is an assistant professor in Information
ment learning,” 2016, arXiv:1611.01578. School, Renmin University of China, China. Her
research interests include social network mining
and deep learning.
Xiao Liu is currently working toward the senior
undergraduate degree with the Department of
Computer Science and Technology, Tsinghua
University, China. His main research interests
include data mining, machine learning and knowl-
edge graph. He has published a paper on KDD. Jie Tang (Fellow, IEEE) received the PhD degree
from Tsinghua University, China. He is full profes-
sor with the Department of Computer Science
and Technology, Tsinghua University, China. His
main research interests include data mining,
social network, and machine learning. He has
published more than 200 research papers in top
Fanjin Zhang received the bachelor’s degree international journals and conferences.
from the Department of Computer Science and
Technology, Nanjing Unviersity, China. She is cur-
rently working toward the PhD degree with the
Department of Computer Science and Technol-
ogy, Tsinghua University, China. Her research " For more information on this or any other computing topic,
interests include data mining and social network.
please visit our Digital Library at www.computer.org/csdl.

Authorized licensed use limited to: Griffith University. Downloaded on January 18,2023 at 07:08:32 UTC from IEEE Xplore. Restrictions apply.

You might also like