A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.
8, AUGUST 2015 1
A Survey on Self-supervised Learning:

Algorithms, Applications, and Future Trends
Jie Gui, Senior Member, IEEE, Tuo Chen, Jing Zhang, Senior Member, IEEE, Qiong Cao,
Zhenan Sun, Senior Member, IEEE, Hao Luo, Dacheng Tao, Fellow, IEEE
Abstract—Deep supervised learning algorithms typically require a large volume of labeled data to achieve satisfactory performance.
However, the process of collecting and labeling such data can be expensive and time-consuming. Self-supervised learning (SSL),
a subset of unsupervised learning, aims to learn discriminative features from unlabeled data without relying on human-annotated
labels. SSL has garnered significant attention recently, leading to the development of numerous related algorithms. However, there is a
dearth of comprehensive studies that elucidate the connections and evolution of different SSL variants. This paper presents a review of
arXiv:2301.05712v3 [cs.LG] 17 Sep 2023
diverse SSL methods, encompassing algorithmic aspects, application domains, three key trends, and open research questions. Firstly,
we provide a detailed introduction to the motivations behind most SSL algorithms and compare their commonalities and differences.
Secondly, we explore representative applications of SSL in domains such as image processing, computer vision, and natural language
processing. Lastly, we discuss the three primary trends observed in SSL research and highlight the open questions that remain. A
curated collection of valuable resources can be accessed at https://github.com/guijiejie/SSL.
Index Terms—Self-supervised learning, Contrastive learning, Generative model, Representation learning, Transfer learning
1 I NTRODUCTION
D EEP supervised learning algorithms have demon-

strated impressive performance in various domains,
including computer vision (CV) and natural language pro-
of labeled examples is frequently costly, arduous, or time-
consuming due to the requirement of skilled human annota-
tors with sufficient domain expertise [12], [13]. To illustrate,
cessing (NLP). To address this, models pre-trained on large- consider the analysis of web user profiles, where a substan-
scale datasets like ImageNet [1] are commonly employed tial amount of data can be readily collected. However, the
as a starting point and subsequently fine-tuned for specific labeling of non-profitable or profitable users necessitates
downstream tasks (Table 1). This practice is motivated by thorough scrutiny, judgment, and sometimes even time-
two primary reasons. Firstly, the parameters acquired from intensive tracing tasks performed by experienced human
large-scale datasets offer a favorable initialization, enabling assessors, resulting in significant expenses. Another instance
faster convergence of models trained on other tasks [2]. pertains to the medical field, where unlabeled examples
Secondly, a network trained on a large-scale dataset has can be easily obtained through routine medical exami-
already learned discriminative features, which can be easily nations. Nevertheless, assigning diagnoses individually to
transferred to downstream tasks and mitigate the overfitting such a large number of cases places a substantial burden
issue arising from limited training data in such tasks [3], [4]. on medical experts. For example, in the case of breast
Unfortunately, numerous real-world data mining and cancer diagnosis, radiologists must label each focus in a
machine learning applications face a common challenge vast collection of easily attainable, high-resolution mammo-
where an abundance of unlabeled training instances coexists grams. This process often proves to be highly inefficient and
with a limited number of labeled ones. The acquisition time-consuming. Additionally, supervised learning methods
are susceptible to spurious correlations and generalization
errors, and vulnerable to adversarial attacks.
• J. Gui is with the School of Cyber Science and Engineering, Southeast
University and with Purple Mountain Laboratories, Nanjing 210000,
To address the aforementioned limitations of super-
China (e-mail: [email protected]). vised learning, various machine learning paradigms have
been introduced, including active learning, semi-supervised
• T. Chen is with the School of Cyber Science and Engineering, Southeast learning, and self-supervised learning (SSL). This paper
University (e-mail: [email protected]).
specifically emphasizes SSL. SSL algorithms aim to learn
• J. Zhang and D. Tao are with with the School of Computer Science in discriminative features from vast quantities of unlabeled in-
the University of Sydney, Australia. E-mail: [email protected], stances without relying on human annotations. The overall
[email protected].
framework of SSL is depicted in Fig. 1. In the self-supervised
• Q. Cao is with JD Explore Academy (e-mail: [email protected]). pre-training phase, a pre-defined pretext task is formulated
for the deep learning algorithm to solve. Pseudo-labels
• Z. Sun is with the Center for Research on Intelligent Perception and for the pretext task are automatically generated based on
Computing, Chinese Academy of Sciences, Beijing 100190, China (e-mail:
[email protected]). specific attributes of the input data. Once the self-supervised
pre-training process is completed, the acquired model can
• H. Luo is with Alibaba Group, Hangzhou 310052, China (e-mail: haolu- be transferred to downstream tasks.
[email protected]).
One notable advantage of SSL algorithms is their abil-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
Pre-training Data Pre-training Tasks Downstream Tasks
detection / segmentation /
image categorization [5]
Supervised extensive labeled data pose estimation / depth estimation, etc
video action categorization [6] action recognition / object tracking, etc
detection / segmentation /
Image: rotation [7], jigsaw [8], etc
pose estimation / depth estimation, etc
SSL extensive unlabeled data Video: the order of frames [9], playing direction [10], etc action recognition / object tracking, etc
question answering / textual entailment recognition /
NLP: masked language modeling [11]
natural language inference, etc.
TABLE 1: Comparison between supervised and self-supervised pre-training and fine-tuning.
Unlabeled Labeled day or more than two papers per hour (Fig. 2). To assist
Data Data
researchers in navigating this vast number of SSL papers
and to consolidate the latest research findings, we aim to
provide a timely and comprehensive survey on this subject.
Differences from previous work: Previous works have
provided reviews on SSL that cater to specific applications
Initialization Transfer such as recommender systems [26], graphs [27], [28], se-
quential transfer learning [29], videos [30], adversarial pre-
training of self-supervised deep networks [31], and visual
SSL Downstream
feature learning [32]. Besides, Liu et al. [4] primarily focused
Tasks Tasks on papers published before 2020, lacking the latest advance-
ments. Jaiswal et al. [33] centered their survey on contrastive
Fig. 1: The overall framework of SSL. learning (CL). Notably, recent breakthroughs in SSL research
within the CV domain are of significant importance. Thus,
this review predominantly encompasses recent SSL research
_ X 104
35 derived from the CV community, particularly those influ-
ential and classic findings. The primary objectives of this
3
review are to elucidate the concept of SSL, its categories
2.5
and subcategories, its differentiation and relationship with
other machine learning paradigms, as well as its theoretical
2 foundations. We present an extensive and up-to-date review
of the frontiers of visual SSL, dividing it into four key areas:
1.5 context-based, CL, generative, and contrastive generative
algorithms, aiming to outline prominent research trends for
1
scholars.
0.5
2 A LGORITHMS
This section begins by providing an introduction to SSL, fol-
lowed by an explanation of the pretext tasks associated with
Fig. 2: Google Scholar search results for “self-supervised SSL and their integration with other learning paradigms.
learning”. The vertical and horizontal axes denote the num-
ber of SSL publications and the year, respectively. 2.1 What is SSL?
The introduction of SSL is attributed to [34] (Fig. 3), who
employed this architecture to learn in natural environments
ity to leverage extensive unlabeled data since the gener- featuring diverse modalities. For instance, the sight of a cow
ation of pseudo-labels does not necessitate human anno- and the sound of its characteristic “mooing” are frequently
tations. By utilizing these pseudo-labels during training, observed together. Therefore, although the cow image may
self-supervised algorithms have demonstrated promising not warrant a cow label, it is frequently associated with a
outcomes, resulting in a reduced performance disparity “moo” instance. The crux lies in processing the cow image
compared to supervised algorithms in downstream tasks. to derive a self-supervised label for the network, enabling it
Asano et al. [14] demonstrated that SSL can produce gen- to process the corresponding “moo” sound, and vice versa.
eralizable features that exhibit robust generalization even Subsequently, the machine learning community has ad-
when applied to a single image. vanced the concept of SSL, which falls within the realm
The advancement of SSL [3], [4], [15]–[24] has exhib- of unsupervised learning. SSL involves generating output
ited rapid progress, capturing significant attention within labels “intrinsically” from input data examples by revealing
the research community (Fig. 2), and is recognized as a the relationships between data components or various views
crucial element for achieving human-level intelligence [25]. of the data. These output labels are derived directly from the
Google Scholar reports a substantial volume of SSL-related data examples. According to this definition, an autoencoder
publications, with approximately 18,900 papers published (AE) can be perceived as a type of SSL algorithms, where
in 2021 alone. This accounts for an average of 52 papers per the output labels correspond to the data itself. AEs have
supervised unsupervised self-supervised

θ=?
derives label from
“COW”
co-ocurring input
Rotation Jigsaw Colorization
Fig. 4: Illustration of three common context-based methods:

rotation, jigsaw, and colorization.
moo
2.2.1 Context-based methods
Context-based methods Fig. rely
4: Illustration of three common
on the inherent contextual re-
Fig. 3: The differences among supervised learning, unsuper- lationships among the context-based methods: encompassing
provided examples,
vised learning, and SSL. The image is reproduced from [34]. aspects such as spatialrotation, jigsaw,and
structures and the
colorization.
preservation of
both local and global consistency. We illustrate the concept
of context-based pretext tasks using rotation as a simple
gained extensive usage across multiple domains, including
example [36]. Subsequently, we progressively introduce ad-
dimensionality reduction and anomaly detection.
ditional tasks (Fig. 4).
In the keynote talk at ICLR 2020, Yann LeCun elucidated Rotation: The approach of utilizing rotation involves
the concept of SSL as an analogous process to completing training deep neural networks (DNNs) to learn image rep-
missing information (reconstruction). He presented multiple resentations by recognizing the geometric transformations
variations as follows: 1) Predict any part of the input from applied to the original image. In their work, Gidaris et al.
any other part; 2) Predict the future from the past; 3) Predict [7] generated three rotated images (90o , 180o , and 270o
the invisible from the visible; and 4) Predict any occluded, rotations) for each original image (”0o rotation”). These
masked, or corrupted part from all available parts. In sum- images were classified into four classes corresponding to
mary, a portion of the input is unknown in SSL, and the the rotation angles (0o , 90o , 180o , and 270o ), serving as the
objective is to predict that particular segment. output labels derived from the images themselves. Specif-
Jing et al. [32] expanded the definition of SSL to encom- ically, a set of K = 4 discrete geometric transformations
pass methods that operate without human-annotated labels. K
G = g (·|y)y=1 was employed, where g (·|y) represents the
Consequently, any approach devoid of such labels can be
operator that applies a geometric transformation labeled
categorized under SSL, effectively equating SSL with unsu-
as y to the image X , resulting in the transformed image
pervised learning. This categorization includes generative
X y = g (X|y).
adversarial networks (GANs) [35], thereby positioning them
To predict rotation, Gidaris et al. employed a deep con-
within the realm of SSL.
volutional neural network (CNN) denoted as F (·), which
Pretext tasks, also referred to as surrogate or proxy tackled a four-class categorization task. The CNN F (·) takes
tasks, are a fundamental concept in the field of SSL. The ∗
an input image X y (where y ∗ is unknown to F (·)) and
term “pretext” denotes that the task being solved is not generates a probability distribution over potential geometric
the primary objective but serves as a means to generate a transformations, expressed as
robust pre-trained model. Prominent examples of pretext
∗ n ∗ oK
tasks include rotation prediction and instance discrimina- F X y |θ = F y X y |θ . (1)
tion, among others. Each pretext task necessitates the use of y=1
distinct loss functions to achieve its intended goal. Given the ∗
significance of pretext tasks in SSL, we proceed to introduce Here, F y X y |θ represents the predicted probability for
them in further detail. the geometric transformation labeled as y , while θ denotes
the learnable parameters of F (·).
In order to accurately classify the K = 4 classes of
2.2 Pretext tasks natural images, a proficient CNN should possess the capa-
This section provides a comprehensive overview of the bility to do so. Therefore, when provided with a set of N
pretext tasks employed in SSL. A prevalent approach in training instances D = {Xi }N i=1 , the self-supervised training
SSL involves devising pretext tasks for networks to solve, objective of F (·) can be formulated as
where the networks are trained by optimizing the objective 1 X
N
functions associated with these tasks. Pretext tasks typi- min loss(Xi , θ). (2)
cally exhibit two key characteristics. Firstly, deep learning
θ N i=1
methods are employed to learn features that facilitate the Here, the loss function loss(·) is defined as
resolution of pretext tasks. Secondly, supervised signals
K
are derived from the data itself, a process known as self- 1 X
supervision. Commonly employed techniques encompass loss(Xi , θ) = − log(F y (g (Xi |y) |θ)). (3)
K y=1
four categories of pretext tasks: context-based methods,
CL, generative algorithms, and contrastive generative meth- In [37], the relative rotation angle was confined to the
ods. In our paper, generative algorithms primarily refer to interval of [−30o , 30o ]. These rotations were discretized into
masked image modeling (MIM) methods. bins of 3o each, leading to a total of 20 classes (or bins).
Colorization: The concept of colorization was initially learning, thus rendering SSL highly pertinent for large-
introduced in [38], and subsequent studies [39]–[41] demon- scale applications. Early CL approaches were built upon
strated its effectiveness as a pretext task for SSL. Color pre- the concept of utilizing negative examples. However, as
diction offers the advantageous feature of requiring freely CL has progressed, a range of methods have emerged that
available training data. In this context, a model can utilize eliminate the need for negative examples. These methods
the L channel of any color image as input and utilize the embrace distinct ideas such as self-distillation and feature
corresponding ab color channels in the CIE Lab color space decorrelation, yet all adhere to the principle of maintaining
as self-supervised signals. The objective is to predict the positive example consistency. The following section outlines
ab color channels Y ∈ RH×W ×2 given an input lightness the various CL methods currently available (Fig. 5).
channel X ∈ RH×W ×1 , where H and W represent the 2.2.2.1 Negative example-based CL: Negative
height and width of the image, respectively. In this context, examples-based CL adheres to a pretext task known as
Y and Ŷ denote the ground truth and predicted values, instance discrimination, which involves generating distinct
respectively. A commonly employed objective function aims views of an instance. In negative examples-based CL,
to minimize the Frobenius norm between Y and Ŷ , as views originating from the same instance are treated as
expressed by positive examples for an anchor sample, while views
from different instances serve as negative examples. The
2
L = Ŷ − Y . (4) underlying principle is to promote proximity between
F positive examples and maximize the separation between
Besides, [38] utilized the multinomial cross-entropy loss negative examples within the latent space. The definition
instead of (4) to enhance robustness. Upon completing the of positive and negative examples varies depending on
training process, it becomes possible to predict the ab color factors such as the modality being considered and specific
channels for any grayscale image. Consequently, the L chan- requirements, including spatial and temporal consistency
nel and the ab color channels can be concatenated to restore in video understanding or the co-occurrence of modalities
the original grayscale image to a colorful representation. in multi-modal learning scenarios. In the context of
Jigsaw: The jigsaw approach leverages jigsaw puzzles conventional 2D image CL, image augmentation techniques
as surrogate tasks, operating under the assumption that a are utilized to generate diverse views from a single image.
model accomplishes these tasks by comprehending the con- MoCo: He et al. [50] framed CL as a dictionary look-
textual information embedded within the examples. Specif- up task. In this framework, a query q and a set of encoded
ically, images are fragmented into discrete patches, and examples {k0 , k1 , k2 , · · ·} serve as the keys in a dictionary.
their positions are randomly rearranged, with the objective Assuming a single key, denoted as k+ in the dictionary,
of reconstructing the original order. In [42], the impact of matches the query q , a contrastive loss [55] function is
scaling two self-supervised methods, namely jigsaw [8], [43] employed. The value of this function is low when q is similar
and colorization, was investigated along three dimensions: to its positive key k+ and dissimilar to all other negative
data size, model capacity, and problem complexity. The keys. In the MoCo v1 [50] framework, the InfoNCE loss
results indicated that transfer performance exhibits a log- function [56], a form of contrastive loss, is utilized, i.e.,
linear growth pattern in relation to data size. Furthermore, exp(q · k+ /τ )
representation quality was found to improve with higher- Lq = − log PK , (5)
i=0 exp(q · ki /τ )
capacity models and increased problem complexity.
Others: The pretext task employed in [44], [45] involved where τ represents the temperature hyper-parameter. The
a conditional motion propagation problem. To enforce a summation is computed over one positive example and
specific constraint on the feature representation process, K negative examples. InfoNCE is derived from noise con-
Noroozi et al. [46] introduced an additional requirement trastive estimation (NCE) [57], which is characterized by the
where the sum of feature representations of all image following objective:
patches should approximate the feature representation of exp(q · k+ /τ )
the entire image. While many pretext tasks yield represen- Lq = − log , (6)
exp(q · k+ /τ )+ exp(q · k− /τ )
tations that exhibit covariance with image transformations,
[47] argued for the importance of semantic representations where q exhibits similarity to the positive example k+ and
being invariant to such transformations. In response, they dissimilarity to the negative example k− .
proposed a pretext-invariant representation learning ap- MoCo v2 [51] builds upon the foundation established
proach that enables the learning of invariant representations by MoCo v1 [50] and SimCLR v1 [53], incorporating an
through pretext tasks. multilayer perceptron (MLP) projection head and more data
augmentations.
SimCLR: SimCLR v1 [53] employs a mini-batch sam-
2.2.2 Contrastive Learning pling strategy with N instances, wherein a contrastive pre-
Numerous SSL methods based on CL have emerged, build- diction task is formulated on pairs of augmented instances
ing upon the foundation of simple instance discrimination from the mini-batch, generating a total of 2N instances.
tasks [48], [49]. Notable examples include MoCo v1 [50], Notably, SimCLR v1 does not explicitly select negative
MoCo v2 [51], MoCo v3 [52], SimCLR v1 [53] and SimCLR instances. Instead, for a given positive pair, the remaining
v2 [54]. Pioneering algorithms, such as MoCo, have signif- 2(N − 1) augmented instances in the mini-batch are treated
icantly enhanced the performance of self-supervised pre- as negatives. Let sim(u, v) = uT v (∥u∥ ∥v∥) represent the
training, reaching a level comparable to that of supervised cosine similarity between two instances u and v . The loss
similarity & similarity &

similarity
dissimilarity decorrelation
encoder ..................... encoder encoder ..................... encoder encoder ..................... encoder
image image image

negative samples self-distillation feature decorrelation
Fig. 5: Illustration of different CL methods: CL based on negative examples (left), CL based on self-distillation (middle),
Fig. 5: Illustration of different CL
and CL based on feature decorrelation (right).
methods: CL based on
negative examples (left), CL based on
self-distillation (middle),
function of SimCLR v1 for a positive instance pair (i,
and CL based on j) is tasks such as object detection and semantic segmentation.
feature
defined as decorrelation (right). Consequently, the design of data augmentation methods
exp(sim(zi , zj )/τ ) tailored to specific downstream tasks has emerged as a
li,j = − log P2N , (7) significant area of exploration.
k=1 1[k̸=i] exp(sim(zi , zk )/τ )
Given the observed benefits of strong data augmenta-
where 1[k̸=i] ∈ {0, 1} is an indicator function equal to 1 if tion in enhancing CL performance [53], there has been a
k ̸= i, and τ denotes the temperature hyper-parameter. The growing interest in leveraging more robust augmentation
overall loss is computed across all positive pairs, including techniques. However, it is worth noting that solely relying
both (i, j) and (j, i), within the mini-batch. on strong data augmentation can actually lead to a decline
In MoCo, the features generated by the momentum in performance [62]. The distortions introduced by strong
encoder are stored in a feature queue as negative examples. data augmentation can alter the image structure, resulting
These negative examples do not undergo gradient updates in a distribution that differs from that of weakly augmented
during backpropagation. Conversely, SimCLR utilizes neg- images. This discrepancy poses optimization challenges. To
ative examples from the current mini-batch, and all of them address the overfitting issue arising from strong data aug-
are subjected to gradient updates during backpropagation. mentation, [64] proposes an alternative approach. Instead
Both MoCo and SimCLR rely on data augmentation tech- of employing a one-hot distribution, they suggest using
niques, including cropping, resizing, and color distortion. the distribution generated by weak data augmentation as
Notably, SimCLR made a significant contribution by high- a mimic. This mitigates the negative impact of strong data
lighting the crucial role of robust data augmentation in CL, augmentation by aligning the distribution of augmented
a finding subsequently confirmed by MoCo v2. Additional examples with that of weakly augmented examples.
augmentation methods have also been explored [58]. For 2.2.2.2 Self-distillation-based CL: Bootstrap Your
instance, in [59], foreground saliency levels were estimated Own Latent (BYOL) [65] is a prominent self-distillation
in images, and augmentations were created by selectively algorithm designed specifically for self-supervised image
copying and pasting image foregrounds onto diverse back- representation learning, eliminating the need for negative
grounds, such as grayscale images with random grayscale pairs. This approach employs two identical DNNs, known
levels, texture images, and ImageNet images. Furthermore, as Siamese networks, with the same architecture but differ-
views can be derived from various sources, including dif- ent weights. One serves as the online network, while the
ferent modalities such as photos and sounds [60], as well as other is the target network. Similar to MoCo [50], BYOL
coherence among different image channels [61]. enhances the target network through a gradual averaging
Minimizing the contrastive loss is known to effec- of the online network. Siamese networks have emerged
tively maximize a lower bound of the mutual information as prevalent architectures in contemporary self-supervised
I(x1 ; x2 ) between the variables x1 and x2 [56]. Building visual representation learning models, including SimCLR,
upon this understanding, [62] proposes principles for de- BYOL, and SwAV [66]. These models aim to maximize
signing diverse views based on information theory. These the similarity between two augmented versions of a single
principles suggest that the views should aim to maximize image while incorporating specific conditions to mitigate
I(v1 ; y) and I(v2 ; y) (v1 , v2 , and y denoting the first view, the risk of collapsing solutions.
the second view, and the label, respectively), representing Simple Siamese (SimSiam) networks, introduced by [67],
the amount of information contained about the task label, offers a straightforward approach to learning effective rep-
while simultaneously minimizing I(v1 ; v2 ), indicating the resentations in SSL without the need for negative example
shared information between inputs encompassing both task- pairs, large batches, or momentum encoders. Given a data
relevant and irrelevant details. Consequently, the optimal point x and two randomly augmented views x1 and x2 ,
data augmentation method is contingent on the specific an encoder f and an MLP prediction head h process these
downstream task. In the context of dense prediction tasks, views. The resulting outputs are denoted as p1 = h (f (x1 ))
[63] introduces a novel approach for generating different and z2 = f (x2 ). The objective of [67] is to minimize their
views. This study reveals that commonly employed data negative cosine similarity:
augmentation methods, as utilized in SimCLR, are more p1 z2
suitable for classification tasks rather than dense prediction D (p1 , z2 ) = − . (8)
∥p1 ∥2 ∥z2 ∥2
grad
grad similarity & grad
similarity identical networks along the batch dimension:
dissimilarity
P A B
predictor b zb,i zb,j
Cij = r 2 rP 2 . (12)
share moving
A B
P
b zb,i b zb,j
weights average
momentum
encoder ..................... encoder encoder encoder
In the above equation, b represents the batch example index

image image and i, j denotes the vector dimension indices of the network
SimCLR BYOL outputs. The matrix C has a square shape and its size is
grad grad
equal to the dimensionality of the network output.
similarity similarity
Variance-Invariance-Covariance Regularization:
Sinkhorn-
Knopp
predictor Variance-invariance-covariance regularization (VICReg)
[70] has been proposed for SSL, similar to Barlow twins
encoder encoder encoder encoder
[69]. While Barlow twins focus on a cross-correlation
matrix, VICReg considers variance, invariance, and
covariance simultaneously. Let d, n, and zjA represent the
image image dimensionality of the vectors in Z A , the batch size, and the
SwAV SimSiam vector consisting of the values at dimension j among all
examples in Z A , respectively. The variance regularization
Fig. 6: Comparison among different Siamese architectures. term v in VICReg is defined as a hinge loss function applied
The image is reproduced from [67]. to the standard deviation of the embeddings along the
batch dimension:
d
1X
Here, ∥∥2 represents the l2 -norm. Similar to [65], a symmet- v ZA = max(0, γ − S zjA , ε ). (13)
ric loss [67] is defined as d j=1
1 Here, S represents the regularized standard deviation, de-

L= (D (p1 , z2 ) + D (p2 , z1 )) . (9) fined as
2 q
This loss is defined based on the example x, and the overall S(y, ε) = Var(y) + ε. (14)
loss is the average of all examples. Notably, [67] employs
The constant γ determines the standard deviation and is
a stop-gradient (stopgrad) operation by modifying Eq. (8)
set to 1 in the experiments, while ε is a small scalar used to
as D (p1 , stopgrad (z2 )). This implies that z2 is treated as a
prevent numerical instabilities. This criterion encourages the
constant. Similarly, Eq. (9) is revised as
variance within the current batch to be equal to or greater
1 than γ for every dimension, thereby preventing collapse
L= (D (p1 , stopgrad (z2 )) + D (p2 , stopgrad (z1 ))) . (10) scenarios where all data are mapped to the same vector.
2
The invariance criterion s in VICReg, which captures
Figure 6 illustrates the distinctions among SimCLR, the similarity between Z A and Z B , is defined as the mean-
BYOL, SwAV, and SimSiam. The categorization of BYOL and squared Euclidean distance between each pair of data with-
SimSiam as CL methods is a subject of debate due to their out any normalization:
exclusion of negative examples. However, to be consistent n
1X A 2
with [68], this paper considers BYOL and SimSiam to belong s Z A, Z B = z − zbB . (15)
to CL methods. n b=1 b 2
2.2.2.3 Feature decorrelation-based CL: The objec-

In addition, the covariance criterion c(Z) in VICReg is
tive of feature decorrelation is to learn features that are
defined as
decorrelated.
Barlow twins: Barlow twins [69] introduced a novel
1X 2
c (Z) = [C(Z)]i,j , (16)
loss function that encourages the similarity of embedding d i̸=j
vectors from distorted versions of an example while min-
imizing redundancy between their components. Similar to where C(Z) represents the covariance matrix of Z . The
other SSL methods such as MoCo [50] and SimCLR [53], overall loss of VICReg is a weighted sum of the variance,
Barlow twins generate two views for each image in a batch invariance, and covariance:
X sampled from a dataset, resulting in batches of embed- l Z A , Z B = s Z A , Z B+ α v Z A + v Z B

dings Z A and Z B . The objective function of Barlow twins is (17)
+β C Z A + C Z B ,
defined as
where α and β are two hyper-parameters.
2
X XX
2 2.2.2.4 Analysis of CL: Despite the impressive re-
LBT = (1 − Cii ) + λ Cij . (11)
i i j̸=i sults achieved by contrastive SSL, the underlying mecha-
nisms responsible for its success remain somewhat obscure
Here, λ is a hyper-parameter, and C represents the cross- and not fully understood. Several studies have delved into
correlation matrix computed between the outputs of two this area [71]–[82]. Theoretical investigations by [72], [77],
[80] have provided support for the value of feature represen- Newell et al. [83] conducted a comprehensive investi-
tations generated through CL. Their findings demonstrate gation into the potential effects of pre-training on model
that these representations offer significant utility for down- performance. Their study explored three key hypotheses as
stream tasks. follows. Firstly, whether pre-training consistently leads to
Connection to principal component analysis: Tian [82] performance improvements. Secondly, whether pre-training
demonstrated that CL with loss functions like InfoNCE can achieves higher accuracy when faced with limited labeled
be formulated as a max-min problem. The max function data, but eventually levels off at a performance comparable
aims to maximize the contrast between feature represen- to the baseline when sufficient labeled data is available.
tations, while the min function assigns weights to pairs Thirdly, whether pre-training converges to baseline perfor-
of examples with similar representations. In the context of mance before reaching its plateau in accuracy. To address
deep linear networks, Tian showed that the max function these hypotheses, the authors conducted experiments on the
in representation learning is equivalent to principal com- synthetic COCO dataset with rendering, allowing for the
ponent analysis (PCA), and most local minima correspond availability of a large number of labels. The results revealed
to global minima, thus recovering optimal PCA solutions. that self-supervised pre-training adheres to the assumption
Experimental results revealed that this formulation, when outlined in the third hypothesis. This suggests that SSL
extended to include new contrastive losses beyond In- does not surpass supervised learning in terms of learning
foNCE, achieves comparable or even superior performance capability, but does perform effectively when dealing with
on datasets like STL-10 and CIFAR10. Furthermore, Tian ex- limited labeled data.
tended his theoretical analysis to 2-layer rectified linear unit 2.2.2.5 Others: Besides the aforementioned works,
(ReLU) networks, emphasizing the substantial differences several other approaches have employed CL. Among them,
between linear and nonlinear scenarios and highlighting [52], [84] investigated the utilization of vision transform-
the essential role of data augmentation during the training ers (ViTs) as the backbone for contrastive SSL, employ-
process. It is noteworthy that PCA aims to maximize the ing multi-crop and cross-entropy losses [84]. Notably, [84]
inter-example distances within a low-dimensional subspace, discovered that the resultant features exhibited exceptional
making it a specific instance of instance discrimination. performance as K -nearest neighbors (K -NN) classifiers and
Connection to spectral clustering: Chen et al. [79] es- effectively encoded explicit information regarding the se-
tablished a connection between CL and spectral clustering, mantic segmentation of images. These desirable properties
showing that the representations obtained from CL corre- have also motivated specific downstream tasks [85]. In a
spond to embeddings of a positive pair graph in spectral different study, [86] adopted patches extracted from the
clustering. Specifically, the authors introduced a population same image as a positive pair, while patches from different
augmentation graph, where nodes represent augmented images served as negative pairs. A mixing operation is fur-
data from the population distribution, and the presence of ther explored in RegionCL [87] to diversify the contrastive
an edge between nodes is determined by whether they orig- pairs. Yang et al. [88] integrated CL and MIM in the context
inate from the same original example. Their key assump- of text recognition, utilizing a weighted objective function.
tion is that different classes exhibit only a limited number Numerous CL-based methods are available in the literature
of connections, resulting in a sparser partition for such a [89]–[109]. It should be noted that CL is not restricted solely
graph. Empirical evidence has confirmed this characteristic, to SSL, as it can also be used in supervised learning [110].
illustrating the data continuity within the same class [81].
Specifically, spectral decomposition is employed on the 2.2.3 Generative algorithms
adjacency matrix to construct a matrix, where each row
denotes the representation of an example. Through a linear For the category of generative algorithms, this study pri-
transformation, they demonstrated that the corresponding marily focuses on MIM methods. MIM methods [111] (Fig.
feature extractor could be retrieved by minimizing an un- 7)—namely, bidirectional encoder representation from im-
conventional contrastive loss given as follows: age transformers (BEiT) [112], masked AE (MAE) [68], con-
text AE (CAE) [113], and a simple framework for MIM
L(f ) =h−2 · Ex,x+ f (x)i⊤ f (x+ )

(SimMIM) [114]—have gained significant popularity and
2 (18)
+Ex,x′ f (x)⊤ f (x′ ) pose a considerable challenge to the prevailing dominance

.
of CL. MIM leverages co-occurrence relationships among
It is worth noting that in cases where the dimensionality image patches as supervision signals.
of the representation surpasses the maximum count of dis- MIM represents a variant of the denoising AE (DAE)
joint subgraphs, the utilization of learned representations in [16], emphasizing the importance of a robust representation
linear classification is guaranteed to yield minimal error. that remains resilient to input noise. Notably, the bidirec-
Connection to supervised learning: Recent research has tional encoder representations from Transformers (BERT)
highlighted the remarkable efficacy of self-supervised pre- [11] have emerged as a renowned variant of the DAE and
training using CL for downstream tasks involving classifi- achieved remarkable success in NLP. Researchers aspire to
cation. However, its effectiveness may vary when applied extend this success to CV by employing BERT-like pre-
to other task domains. Thus, there is a compelling need training strategies. However, it is crucial to acknowledge
to investigate the potential of contrastive pre-training in that BERT’s success in NLP can be attributed not only to
augmenting supervised learning, particularly in terms of its large-scale self-supervised pre-training but also to its
surpassing the accuracy achieved through traditional super- scalable network architecture. A notable distinction between
vised learning. the NLP and CV communities is their use of different
joint where E denotes the encoder, D denotes the decoder, T1

recover
embedding represents the transformation applied to the input before it
decoder is fed into the network, and T2 represents the transformation
used to derive the target label. It is noteworthy that this
representation is provided for the sake of clarity and ease of
understanding rather than serving as a strict definition.
encoder ..................... encoder encoder The primary distinction between BEiT and MAE lies in
their choice of T . While BEiT employs the token output from
view 1 view 2 masked view
the pre-trained tokenizer as its target, MAE directly uses
the original pixels as its target. BEiT adopts a two-stage
Fig. 7: The broad differences between CL and MIM. Note
approach, initially training a tokenizer to convert images
that the actual differences between their pipelines are not
into visual tokens, followed by BERT-style training. On the
limited to what is shown.
other hand, MAE is a one-stage end-to-end approach, incor-
Fig. 7: The broad differences between
porating a decoder to decode the encoder-derived represen-
CL and MIM. Note tation into the original pixels. The two representative MIM
that the actual
primary models, with transformers differences
being prevalent between
in NLP approaches BEiT and MAE, showcase different architectural
and CNNs being widely adopted their pipelines
in CV. are not designs, with subsequent MIM methods often following one
The landscape changed significantly
limited to whatwith the introduc- of these techniques. A central challenge in MIM lies in the
is shown
tion of the original ViT [5], which marked a pivotal mo- selection of the target representation T2 , which leads to the
ment. Alexey Dosovitskiy et al. conducted pioneering re- categorization of MIM methods, as presented in Table 2.
search on applying MIM to CV, drawing inspiration from Following the introduction of BEiT and MAE, several
BERT’s masked image prediction paradigm. Their smaller variants have been proposed. iBOT [111] is an “online
ViT-B/16 model achieved 79.9% accuracy on ImageNet [1] tokenizer” adaptation of BEiT, aiming to address the limita-
through self-supervised pre-training, an impressive 2% im- tion of dVAE in capturing only low-level semantics within
provement over training from scratch. However, it still fell local details. The CAE introduces an alignment constraint
short of the accuracy attained by supervised pre-training. to encourage masked patch representations (predicted by
Beyond ViTs, a separate early investigation adopted context a “latent contextual regressor”) to lie in the encoded rep-
encoders [115], employing a concept akin to MAE, i.e., image resentation space. This decoupling of the representation
inpainting. learning task and pretext task enhances the model’s capacity
Despite their structural alignment, MAE did not find for representation learning. Furthermore, MAE has been
significant application in vision research until the emer- extended to other modalities beyond images [124]–[126].
gence of BEiT. Prior to BEiT, there was another endeavor MIM has demonstrated significant potential in pre-
known as image generative pre-training (iGPT) [116], but it training vision transformers [127]–[130]. Leveraging ver-
received limited attention due to its subpar accuracy and satile ViT backbones, MIM acquires self-supervised visual
computational efficiency. In the footsteps of BERT’s success representations by masking certain patches of the original
in NLP, BEiT introduced a tailored MIM task for visual pre- image and subsequently recovering the masked informa-
training, i.e., a tokenization approach which breaks down tion. However, in prior works, the random masking of
the input image into visual tokens, and then randomly image patches led to an underutilization of valuable seman-
masks a subset of the image patches. Similar to BERT, tic information essential for effective visual representation
both masked and unmasked image tokens are fed into the learning. Furthermore, the considerable size of the backbone
ViT, aiming to recover the masked visual tokens based on resulted in extensive pre-training time for most previously
the information from unmasked patches. To address the developed methods. To address these issues, Liu et al. [131]
challenge of predicting raw pixels, the authors leveraged introduced an attention-driven masking and throwing strat-
a discrete variational autoencoder (dVAE) [117] to create a egy, effectively tackling both aforementioned challenges.
predefined visual vocabulary. An interesting aspect is that
the downstream task does not entail explicit [mask] labels, 2.2.4 Contrastive Generative Methods
leading researchers to develop diverse algorithms to miti- As stated in [132], contrastive models tend to be data-
gate this issue. For instance, in the original BERT, methods hungry and vulnerable to overfitting issues, whereas gen-
like random words were used to alleviate inconsistencies erative models encounter data-filling challenges and ex-
between upstream and downstream tasks. In CV, both BEiT hibit inferior data scaling capabilities when compared to
and SimMIM adopt paradigms akin to BERT, involving the contrastive models. While contrastive models often fo-
inclusion of special [mask] tokens into the network. cus on global views [84], overlooking internal structures
In contrast to BEiT, MAE does not utilize special mask within images, MIM primarily models local relationships.
tokens and treats the task as a regression problem. It ad- The divergent characteristics and challenges encountered
dresses the challenges of applying the DAE in CV from three in contrastive self-supervised learning and generative self-
key aspects: architecture, information density, and decoder supervised learning have motivated researchers to explore
design. MAE’s simplicity and effectiveness have established the combination of these two kinds of approaches.
it as a crucial baseline within the MIM domain. To elaborate further, let us compare the challenges faced
Here, we define by contrastive self-supervised methods and generative self-
supervised methods. Generative self-supervised methods
MIM := L (D (E (T1 (I))) , T2 (I)) , (19) are characterized as data-filling approaches [133]. For a
Low-Level Targets High-Level Targets Self-Distillation Contrastive / Multi-modal Teacher
Algorithm ViT [5] MAE [68] SimMIM [114] Maskfeat [118] BEiT [112] CAE [113] PeCo [119] data2vec [120] SdAE [121] MimCo [122] BEiT v2 [123]
Target Raw Pixel HOG VQ-VAE VQ-GAN self MoCo v3 CLIP
TABLE 2: Classifications of MIM methods based on the reconstruction target. The second and third rows denote MIM
methods and reconstructing targets, respectively.
SSL
Context Masked Image

Contrastive
Based Modeling
learning
Negative Contrastive/Multi-
Clustering Self-Distillation Feature Decorrelation Low-level targets High-level targets Self-Distillation
Samples modal Teacher
Barlow ViT MAE BEiT PeCo data2vec SdAE MaskFeat BEiT v2

Rotation Jigsaw Colorization MoCos SimCLR SwAV BYOL DINO VICReg
Twins
Fig. 8: Several representative pretext tasks of SSL.
model of a certain size, when the dataset reaches a certain 2.2.5 Summary
magnitude, further scaling of the data does not lead to As described above, numerous pretext tasks for SSL have
significant performance gains in generative self-supervised been devised, with several significant milestone variants
methods. In contrast, recent studies have revealed the po- depicted in Fig. 8. Several other pretext tasks are avail-
tential of data scaling to enhance the performance of CL able [138], [139], encompassing diverse approaches such
[134]. As data increases, CL shows substantial performance as relative patch location [140], noise prediction [141], fea-
improvements, demonstrating remarkable generalization ture clustering [142]–[144], cross-channel prediction [145],
without additional fine-tuning on downstream tasks. How- and combining different cues [146]. Kolesnikov et al. [147]
ever, the scenario differs in low-data regimes. Contrastive conducted a comprehensive investigation of previously
models may find shortcuts with trivial representations that proposed SSL pretext tasks, yielding significant insights.
overfit the limited data [50], thus leading to inconsistent Besides, Krähenbühl et al. [148] proposed an alternative
improvements in generalization performance for down- approach to pretext tasks and demonstrated the ease of
stream tasks using pre-trained models with contrastive self- obtaining data from video games.
supervised methods [132]. On the other hand, generative It has been observed that context-based approaches ex-
methods are more adept at handling low-data scenarios and hibit limited applicability due to their inferior performance.
can even achieve notable performance improvements when In the realm of visual SSL, two dominant types of algorithms
data is extremely scarce, such as with only 10 images [135]. are CL and MIM. While visual CL may encounter overfit-
ting issues, CL algorithms that incorporate multi-modality,
Several endeavors have sought to integrate both types exemplified by CLIP [2], have gained popularity.
of algorithms [132], [136]. In [136], GANs are employed
for online data augmentation in CL. The study devises a 2.3 Combinations with other learning paradigms
contrastive module that learns view-invariant features for
generation and introduces a view-invariant loss function to It is essential to acknowledge that the advancements in
facilitate learning between original and generated views. On SSL did not occur in isolation; instead, they have been
the other hand, [111] draws inspiration from both BEiT and the result of continuous development over time. In this
DINO [84]. It modifies the tokenizer of BEiT to an online dissection, we provide a comprehensive list of relevant learning
tilled teacher while integrating cross-view distillation from paradigms that, when combined with SSL, contribute to a
the DINO framework. As a result, iBOT [111] significantly clearer understanding of their collective impact.
enhances linear probing accuracy compared to the MIM 2.3.1 GANs
method.
GANs represent classical unsupervised learning methods
Despite attempts to combine both types of approaches, and were among the most successful approaches in this do-
naive combinations may not always yield performance main before the surge of SSL techniques. The integration of
gains and can even perform worse than the generative GANs with SSL offers various avenues, with self-supervised
model baseline, thereby exacerbating the issue of repre- GANs (SS-GAN) serving as one such example. The GANs’
sentation over-fitting [132]. The performance degradation objective function [35], [149] is given as
could be attributed to the disparate properties of CL and min max V (G, D) = Ex∼pdata (x) [log D (x)]
generative methods. For instance, CL methods typically G D (20)
+Ez∼pz (z) [log (1 − D (G (z)))] .
exhibit longer attention distances, whereas generative meth-
ods tend to favor local attention [137]. In light of this The SS-GAN [150] is defined by combining the objective
challenge, RECON [132] emerges as a solution by training functions of GANs with the concept of rotation [7]:
generative modeling to guide CL, thereby leveraging the LG = −V (G, D) − αEx∼pG Er∼R [log QD (R = r|xr )],
benefits of both paradigms. (21)
LD = V (G, D) − βEx∼pdata Er∼R [log QD (R = r|xr )],
where V (G, D) represents the objective function of GANs as 2.3.4 Multi-view/multi-modal(ality) learning
given in Eq. (20), and r ∼ R refers to a rotation selected from Observation plays a vital role in infants’ acquisition of
a set of possible rotations, similar to the concept presented knowledge about the world. Notably, they can grasp the
in [7]. Here, xr denotes an image x rotated by r degrees, concept of apples through observational and comparative
and Q (R|xr ) corresponds to the discriminator’s predictive processes, which distinguishes their learning approach from
distribution over the angles of rotation for a given example traditional supervised algorithms that rely on extensive
x. Notably, rotation [7] serves as a classical SSL method. The labeled apple data. This phenomenon was demonstrated by
SS-GAN incorporates rotation invariance into the GANs’ Orhan et al. [22], who gathered perceptual data from infants
generation process by integrating the rotation prediction and employed an SSL algorithm to model how infants learn
task during training. the concept of “apple”. Moreover, infants’ learning about the
world extends to multi-view and multi-modal(ality) learn-
2.3.2 Semi-supervised learning ing [2], encompassing various sensory inputs such as video
SSL and semi-supervised learning are contrasting and audio. Hence, SSL and multi-view/multi-modal(ality)
paradigms that can be effectively combined. One notable learning converge naturally in infants’ learning mechanisms
example of this combination is self-supervised semi- as they explore and comprehend the workings of the world.
supervised learning (S4 L) [151]. In S4 L, the objective 2.3.4.1 Multiview CL: The objective function in
function is given by standard multiview CL, as proposed by Tian et al. [62], is
given by
min Ll (Dl , θ) + wLu (Du , θ) , (22)
θ LN CE = E [Lq ] , (25)
where Dl represents the labeled training dataset, Du is the where Lq corresponds to Eq. (5). Additionally, it holds
unlabeled training dataset, Ll denotes the classification loss that LN CE + IN CE (v1 , v2 ) = log(K), with v1 and v2
computed on all labeled examples, Lu stands for the self- representing two views of the data point x. Tian et al.
supervised loss (e.g., rotation task in Eq. (3)) utilizing both [62] conducted a study to identify effective views for CL
Dl and Du , w is a free parameter used for balancing the and introduced both unsupervised view learning and semi-
contributions of Ll and Lu , and θ represents the parameters supervised view learning. To split an image X over its
of the learning model. channels, the operation is represented as {X1 , X2:3 }. Let
Incorporating SSL as an auxiliary task is a well- X̂ denote g(X), i.e., X̂ = g(X), where g represents a
established approach in semi-supervised learning. Another flow-based model. For both unsupervised view learning
classical method to leverage SSL within this context involves and semi-supervised view learning, adversarial training
implementing SSL on unlabeled data, followed by fine- was employed. Two encoders, f1 and f2 , were trained to
tuning the resultant model on labeled data, as demonstrated maximize IN CE (X̂1 , X̂2:3 ) as stated in Eq. (25), while g was
in the SimCLR framework. trained to minimize IN CE (X̂1 , X̂2:3 ). Formally, the objective
To demonstrate the robustness of self-supervision function for unsupervised view learning can be expressed as
against adversarial perturbations, Hendrycks et al. [152]
proposed an overall loss function as a linear combination min max INf1CE
,f2
(g(X)1 , g(X)2:3 ). (26)
g f1 ,f2
of supervised and self-supervised losses:
In the context of semi-supervised view learning, when
L(x, y, θ) = LCE (y, p (y|P GD(x)) , θ) several labeled examples are available, the objective func-
(23)
+λLSS (P GD(x), θ) , tion is formulated as
where x represents an example, y is the ground-truth label, min max INf1CE
,f2
(g(X)1 , g(X)2:3 )
g,c1 ,c2 f1 ,f2 (27)
θ denotes the model parameters, LCE refers to the cross- +Lce (c1 (g(X)1 ) , y) + Lce (c2 (g(X)2:3 ) , y) ,
entropy loss, and P GD stands for projected gradient de-
scent. The first and second terms in (23) correspond to the where y represents the labels, c1 and c2 are classifiers, and
supervised learning loss and the SSL loss, respectively. Lce denotes the cross-entropy loss. Further relevant works
can be found in [61], [62], [153]. Table 3 summarizes different
2.3.3 Multi-instance learning (MIL) SSL losses.
2.3.4.2 Images and text: In the study conducted by
Miech et al. [13] introduced an extension of the InfoNCE Gomez et al. [154], the authors employed a topic mod-
loss (5) for MIL and termed it MIL-NCE: eling framework to project the text of an article into the
 P f (x)T g(y)
 topic probability space. This semantic-level representation
n
e
X  (x,y)∈Pi  was then utilized as the self-supervised signal for train-
max log  T T
, (24)ing CNN models on images. On a similar note, CLIP
ef (x) g(y) + ef (x′ ) g(y′ ) 
P P
f,g 
i=1 [2] leverages a CL-style pre-training task to predict the
(x,y)∈Pi (x′ ,y ′ )∈Ni
correspondence between captions and images. Benefiting
where x and y represent a video clip and a narration, from the CL paradigm, CLIP is capable of training models
respectively. The functions f and g generate embeddings from scratch on an extensive dataset comprising 400 million
of x and y , respectively. For a specific example indexed by image-text pairs collected from the internet. Consequently,
i, Pi denotes the set of positive video/narration pairs, while CLIP’s advancements have significantly propelled multi-
Ni corresponds to the set of negative video/narration pairs. modal learning to the forefront of research attention.
TABLE 3: Different losses of SSL.
Category Method Loss Equation

K
1
log(F y (g (Xi |y) |θ))
P
Context-Based Rotation [7] loss(Xi , θ) = − K (3)
y=1
exp(q·k+ /τ )
MoCo v1 [50] Lq = − log P K exp(q·k /τ ) (5)
Pretext i=0 i
exp(sim(zi ,zj )/τ )
SimCLR v1 [53] li,j = − log P2N (7)
CL k=1 1[k̸=i] exp(sim(zi ,zk )/τ )
1
SimSiam [67] L= (D (p1 , stopgrad (z2 )) + D (p2 , stopgrad (z1 )))
2
(10)
(1 − Cii )2 + λ
P PP 2
Barlow twins [69] LBT = Cij (11)
i i j̸=i
A B A B A + v ZB

l Z ,Z =s Z , Z + α v Z
VICReg [70] (17)
+β C Z A + C Z B ,
LG = −V (G, D) − αEx∼pG Er∼R [log QD (R = r|xr )],
SS-GAN [150] (21)
LD = V (G, D) − βEx∼pdata Er∼R [log QD (R = r|xr )]
Combinations
S4 L [151] min Ll (Dl , θ) + wLu (Du , θ) (22)
with θ
Other SSL improving robustness [152] L(x, y, θ) = LCE (y, p (y|P GD(x)) , θ) + λLSS (P GD(x), θ) (23)
Learning unsupervised view learning [62] min max IN f1 ,f2 (g(X) , g(X) (26)
g f1 ,f2 CE 1 2:3 )
Paradigms
min max I f1 ,f2 (g(X)1 , g(X)2:3 )
semi-supervised view learning [62] g,c1 ,c2 f1 ,f2 N CE (27)

+Lce c1 g(X)1 , y + Lce c2 g(X)2:3 , y
2.3.4.3 Point clouds and other modalities: Several heavily depends on training data that may not accurately
SSL methods have been proposed for joint learning of 3D represent the new test distribution, leading to bias. On the
point cloud features and 2D image features by leverag- other hand, training a new model from scratch for each
ing cross-modality and cross-view correspondences through test input, ignoring all training data, is undesirable. This
triplet and cross-entropy losses [155]. Additionally, there are approach results in an unbiased representation for each test
efforts to jointly learn view-invariant and mode-invariant input but exhibits high variance due to its singularity.
characteristics from diverse modalities, such as images,
point clouds, and meshes, using heterogeneous networks 2.3.6 Summary
for 3D data [156]. SSL has also been employed for point The evolution of SSL is characterized by its dynamic and
cloud datasets, with approaches including CL and cluster- interconnected nature. Analyzing the amalgamation of var-
ing based on graph CNNs [157]. Furthermore, AEs have ious methods allows for a clearer grasp of SSL’s devel-
been used for point clouds in works like [125], [126], [158], opmental trajectory. An exemplar of this success is evi-
[159], while capsule networks have been applied to point dent in CLIP, which effectively combines CL with multi-
cloud data in [160]. modal learning, leading to remarkable achievements. SSL
has been extensively integrated with various machine learn-
2.3.5 Test time training ing tasks, showcasing its versatility and potential. It has
Sun et al. [161] introduced “test time training (TTT) with been combined with clustering [66], semi-supervised learn-
self-supervision” to enhance the performance of predictive ing [151], multi-task learning [163]–[166], transfer learning
models when the training and test data come from distinct [167]–[169], graph NNs [153], [170]–[175], reinforcement
distributions. TTT converts an individual unlabeled test learning [176]–[178], few-shot learning [179], [180], neural
example into an SSL problem, enabling model parameter architecture search [181], robust learning [152], [182]–[184],
updates before making predictions. Recently, Gandelsman and meta-learning [185], [186]. This diverse integration un-
et al. [162] combined TTT with MAE for improved perfor- derscores the widespread applicability and impact of SSL in
mance. They argued that by treating TTT as a one-sample the machine learning domain.
learning problem, optimizing a model for each test input
could be addressed using the MAE as 3 A PPLICATIONS
n SSL initially emerged in the context of vowel class recogni-
1 X
h0 = arg min lm (h ◦ f0 (xi ) , yi ) , (28) tion [187], and subsequently, it was extended to encompass
h n i=1 object extraction tasks [188]. SSL has found widespread ap-
fx , gx = arg min ls (g ◦ f (mask(x)), x). (29) plications in diverse domains, including CV, NLP, medical
f,g image analysis, and remote sensing (RS).
Here, f and g refer to the encoder and decoder of MAE, and
h denotes the main task head, respectively. 3.1 CV
In contrast to the classic paradigm, during training, the Sharma et al. [189] introduced a fully convolutional volu-
main task head utilizes features acquired from the MAE metric AE for unsupervised deep embeddings learning of
encoder rather than the original examples. Consequently, object shapes. In addition, SSL has been extensively applied
a singular example suffices for training f during prediction. to various aspects of image processing and CV: image
Moreover, this paper offers an intuitive rationale for the inpainting [115], human parsing [190], [191], scene deocclu-
efficacy of TTT. Specifically, TTT achieves an improved bias- sion [192], semantic image segmentation [193], [194], monoc-
variance tradeoff under distribution shifts. A static model ular vision [195], person reidentification (re-ID) [196]–[198],
visual odometry [199], scene flow estimation [200], knowl- 3.1.1.2 Motions of objects in videos: Diba et al.
edge distillation [201], optical flow prediction [202], vision- [231] focused on SSL of motions in videos by employing
language navigation (VLN) [203], physiological signal es- dynamic motion filters to enhance motion representations,
timation [204], [205], image denoising [206], [207], object particularly for improving human action recognition. The
detection [208]–[210], super-resolution [211], [212], voxel concept of SSL with videos (CoCLR) [232] bears similarities
prediction from 2D images [213], ego-motion [214], [215], to SimCLR [53].
and mask prediction [216]. These applications highlight the 3.1.1.3 Multi-modal(ality) data in videos: The au-
broad impact and relevance of SSL in the realm of image ditory and visual components in a video are intrinsically
processing and CV. interconnected. Leveraging this correlation, Korbar et al.
[233] employed a self-supervised temporal synchronization
3.1.1 SSL models for videos approach to learn comprehensive and effective models for
SSL has garnered widespread usage across various appli- both video and audio analysis. Similarly, other methodolo-
cations, including video representation learning [217]–[219] gies [60], [234] are also founded on joint video and audio
and video retrieval [220]. Wang et al. [221] employed a vast modalities while certain studies [235]–[237] incorporated
collection of unlabeled web videos to learn visual repre- both video and text modalities. Moreover, Alayrac et al.
sentations. The central concept revolves around utilizing [238] explored a tri-modal approach involving vision, audio,
visual tracking as a self-supervised signal. Consequently, and language in videos. On a different note, Sermanet et al.
two patches connected by a track are expected to possess [239] proposed a self-supervised technique for learning rep-
similar visual representations, as they likely correspond to resentations and robotic behaviors from unlabeled videos
the same object or belong to the same object part. Srivastava captured from various viewpoints.
et al. [222] proposed a composite self-supervised model by 3.1.1.4 Spatial-temporal coherence of objects in
integrating two distinct models: a long short-term memory videos: Wang et al. [240] introduced a self-supervised al-
(LSTM) AE and an LSTM-based future prediction model. gorithm for learning visual correspondence in unlabeled
This composite model served the dual purpose of input videos by utilizing cycle consistency in time as a self-
reconstruction and future prediction. supervised signal. Extensions of this work have been ex-
3.1.1.1 Temporal information in videos: Various plored by Li et al. [241] and Jabri et al. [242]. Lai et al. [243]
forms of temporal information in videos can be employed, presented a memory-augmented self-supervised method
encompassing frame order, video playback direction, video that enables generalizable and accurate pixel-level tracking.
playback speed, and future prediction information [223], Zhang et al. [244] employed spatial-temporal consistency
[224]. 1) The order of the frames. Several studies have of depth maps to mitigate forgetting during the learning
explored the significance of frame order in videos. Misra et process. Zhao et al. [245] proposed a novel self-supervised
al. [9] introduced a method for learning visual representa- algorithm named the “video cloze procedure (VCP),” which
tions from raw spatiotemporal signals and determining the facilitates learning rich spatial-temporal representations for
correct temporal sequence of frames extracted from videos. videos.
Fernando et al. [225] proposed a novel self-supervised CNN
pre-training approach called “odd-one-out learning,” where 3.1.2 Universal sequential SSL models for image process-
the objective is to identify the unrelated or odd element ing and CV
within a set of related elements. This odd element corre- Contrastive predictive coding (CPC) [56] operates on the
sponds to a video subsequence with an incorrect temporal fundamental concept of acquiring informative representa-
frame order, while the related elements maintain the cor- tions through latent space predictions of future data using
rect temporal order. Lee et al. [226] employed temporally robust autoregressive models. While initially applied to
shuffled frames, presented in a non-chronological order, as sequential data like speech and text, CPC has also found
inputs to train a CNN for predicting the correct order of applicability to images [246].
the shuffled sequences, effectively using temporal coher- Drawing inspiration from the accomplishments of GPT
ence as a self-supervised signal. Building upon this work, [247], [248] in NLP, iGPT [116] investigates whether similar
Xu et al. [227] utilized temporally shuffled clips as inputs models can effectively learn representations for images.
instead of individual frames, training 3D CNNs to sort iGPT explores two training objectives, namely autoregres-
these shuffled clips. 2) Video playback direction. Temporal sive prediction and a denoising objective, thereby sharing
direction analysis in videos, as studied by Wei et al. [10], similarities with BERT [11]. In high-resolution scenarios,
involves discerning the arrow of time to determine if a this approach [116] competes favorably with other self-
video sequence progresses in the forward or backward supervised methods on ImageNet [1]. Similar to iGPT, ViT
direction. 3) Video playback speed. Video playback speed [5] also adopts a transformer architecture for vision tasks. By
has been a subject of investigation in several studies. Benaim applying a pure transformer to sequences of image patches,
et al. [228] focused on predicting the speeds of moving ViT has demonstrated outstanding performance in image
objects in videos, determining whether they moved faster classification tasks. The transformer architecture has been
or slower than the normal speed. Yao et al. [229] leveraged further extended to various vision-related applications, as
playback rates and their corresponding video content as evidenced by [52], [68], [84], [112], [249], [250].
self-supervision signals for video representation learning.
Additionally, Wang et al. [230] addressed the challenge of 3.2 NLP
self-supervised video representation learning through the In the realm of NLP, pioneering works for performing
lens of video pace prediction. SSL on word embeddings include the continuous bag-of-
words model and the continuous skip-gram model [251]. It is important to note that due to the aim of comparing
SSL methods, notably BERT [11] and GPT, have found a wide array of algorithms, the experimental setups are not
widespread application in NLP [252]–[256]. Moreover, SSL strictly standardized. Nevertheless, we make efforts to align
has been employed for other sequential data, including crucial hyper-parameters, while certain parameters such
sound data [257]. as the number of training epochs may not be completely
aligned. The experimental results are uniformly obtained
using the default backbone specified in the original papers,
3.3 Other fields
such as ResNet-50 or ViT-B. In instances where certain
Within the medical field [258], the availability of labeled experimental results lack corresponding ResNet-50 or ViT-B
data is typically limited, while a vast amount of unlabeled implementations, we provide results based on other back-
data exists. This natural scenario makes SSL a compelling bones, suitably marked with subscripts.
approach, which has been effectively employed for vari- Setup. The pre-training process utilizes ImageNet-1k [1]
ous tasks like medical image segmentation [259] and 3D as the primary dataset. Subsequently, following a standard
medical image analysis [260]. Recently, SSL has also found procedure [50], [53] outlined in Table 4, a comparative anal-
applications in the remote sensing domain, benefiting from ysis of these methods is conducted through linear classifica-
the abundance of large-scale unlabeled data that remains tion of frozen features. This entails training a linear classifier,
largely unexplored. For example, SeCo [261] leverages sea- which consists of a fully connected layer followed by the
sonal changes in RS images to construct positive pairs and softmax function, using features obtained from the pre-
perform CL. On the other hand, RVSA [262] introduces a trained model. “Fine-tuning” denotes fine-tuning the entire
novel rotated varied-size window attention mechanism that model. The reported results indicate the top-1 classification
advances the plain vision transformer to serve as a funda- accuracy obtained on the ImageNet validation set.
mental model for various remote sensing tasks. Notably, it We also present our findings for the object detection and
is pre-trained using the generative SSL method MAE [68] on semantic segmentation tasks on widely recognized datasets,
the large-scale MillionAID dataset. including PASCAL VOC [264], COCO [265], and ADE20k
[266], [267]. The evaluation of object detection results on the
PASCAL VOC dataset employs the default mean average
4 P ERFORMANCE COMPARISON precision (mAP), specifically AP50 . By default, the object
Once a pre-trained model is obtained through SSL, the as- detection task on PASCAL VOC employs VOC2007 for train-
sessment of its performance becomes necessary. The conven- ing. However, certain methods employ the combined 07+12
tional approach involves gauging the achieved performance dataset instead of VOC2007 for training, and the results are
on downstream tasks to ascertain the quality of the extracted annotated with a superscript “e”. As for the object detection
features. However, this evaluation metric does not provide and instance segmentation tasks on COCO, we adopt the
insights into what the network has specifically learned bounding-box AP (APbb ) and mask AP (APmk ) metrics, in
during self-supervised pre-training. To delve into the inter- accordance with [50].
pretability of self-supervised features, alternative evaluation
metrics, such as network dissection [263], can be employed.
4.2 Summary
Recently, a plethora of MIM methods have emerged, show-
casing distinct focuses compared to previous approaches. In Firstly, the linear probe accuracy of the self-supervised
this section, we aim to present a clear demonstration of the algorithm based on CL consistently surpasses that of the
performance exhibited by various methods. We summarize other algorithms. This superiority can be attributed to the
the classification and transfer learning efficacy of typical algorithm’s ability to generate well-structured latent spaces,
SSL methods on well-established datasets. It is important to wherein distinct categories are effectively separated, and
note that SSL techniques can theoretically be applied to data similar categories are appropriately clustered.
with diverse modalities. However, for the sake of simplicity, Secondly, it is observed that pre-trained models using
we narrow our focus to SSL in the image domain. Within MIM can be fine-tuned to achieve superior performance in
this domain, we compare the achieved performance across most cases. Conversely, pre-trained models based on CL
several downstream tasks, primarily encompassing image lack this property. One primary reason for this discrepancy
classification, object detection, and semantic segmentation. lies in the increased susceptibility of CL-based models to
overfitting [64], [269], [270]. This observation also extends
to the fine-tuning of pre-trained models for downstream
4.1 Comprehensive comparison tasks. MIM-based approaches consistently exhibit substan-
In this section, we present the results obtained from diverse tial performance enhancements in downstream tasks, while
algorithms tested on respective datasets as summarized in CL-based methods offer comparatively limited assistance.
Table 4. The experimental results are drawn either directly Thirdly, CL-based methods tend to employ resource-
from the original papers or from other sources with anno- intensive techniques like momentum encoders, memory
tations. When results are sourced from original papers, no queues, and multi-crop, significantly increasing the de-
specific indication is provided; however, for results from al- mands on computing, storage, and communication re-
ternative works, the data source is indicated. In cases where sources. In contrast, MIM-based methods have a more effi-
a method replicated from another work achieves superior cient resource utilization, possibly attributed to the absence
accuracy compared to the original paper, we consistently of example interactions. This advantageous property allows
report the results with higher accuracy. MIM-based algorithms to easily scale up models and data,
Methods Linear Probe Fine-Tuning VOC det VOC seg COCO det COCO seg ADE20K seg DB
Random: 17.1A [8] - 60.2eR [67] 19.8A [8] 36.7R [50] 33.7R [50] - -
R50 Sup 76.5 [66] 76.5 [66] 81.3e [67] 74.4 [65] 40.6 [50] 36.8 [50] - -
ViT-B Sup 82.3 [68] 82.3 [68] - - 47.9 [68] 42.9 [68] 47.4 [68] -
Context-Based:
Jigsaw [8] 45.7R [66] 54.7 61.4R [42] 37.6 - - - 256
Colorization [38] 39.6R [66] 40.7 [7] 46.9 35.6 - - - -
Rotation [7] 38.7 50.0 54.4 39.1 - - - 128
CL Based on Negative Examples:
Examplar [138] 31.5 [48] - - - - - - -
Instdisc [48] 54.0 - 65.4 - - - - 256
MoCo v1 [50] 60.6 - 74.9 - 40.8 36.9 - 256
SimCLR [53] 73.9V [52] - 81.8e [67] - 37.9 [67] 33.3 [67] - 4096
MoCo v2 [51] 72.2 [67] - 82.5e - 39.8 [70] 36.1 [70] - 256
MoCo v3 [52] 76.7 83.2 - - 47.9 [68] 42.7 [68] 47.3 [68] 4096
CL Based on Clustering:
SwAV [66] 75.3 - 82.6e [70] - 41.6 37.8 [70] - 4096
CL Based on Self-distillation:
BYOL [65] 74.3 - 81.4e [67] 76.3 40.4 [70] 37.0 [70] - 4096
SimSiam [67] 71.3 - 82.4e [67] - 39.2 34.4 - 512
DINO [84] 78.2 83.6 [111] - - 46.8 [113] 41.5 [113] 44.1 [112] 1024
CL Based on Feature Decorrelation:
Barlow Twins [69] 73.2 - 82.6e [70] - 39.2 34.3 - 2048
VICReg [70] 73.2 - 82.4e - 39.4 36.4 - 2048
Masked Image Modeling (ViT-B by default):
Context Encoder [115] 21.0A [7] - 44.5A [7] 30.0A - - - -
BEiT v1 [112] 56.7 [123] 83.4 [111] - - 49.8 [68] 44.4 [68] 47.1 [68] 2000
MAE [68] 67.8 83.6 - - 50.3 44.9 48.1 4096
SimMIM [114] 56.7 83.8 - - 52.3Swin−B [268] - 52.8Swin−B [268] 2048
PeCo [119] - 84.5 - - 43.9 39.8 46.7 2048
iBOT [111] 79.5 84.0 - - 51.2 44.2 50.0 1024
MimCo [122] - 83.9 - - 44.9 40.7 48.91 2048
CAE [113] 70.4 83.9 - - 50 44 50.2 2048
data2vec [120] - 84.2 - - - - - 2048
SdAE [121] 64.9 84.1 - - 48.9 43.0 48.6 768
BEiT v2 [123] 80.1 85.5 - - - - 53.1 2048
TABLE 4: Experimental results of the tested algorithms for linear classification and transfer learning tasks. DB denotes the
default batch size. The symbol “-” indicates the absence or unavailability of the data point in the respective paper. The
subscripts A, R, and V represent AlexNet, ResNet-50, and ViT-B, respectively. The superscript “e” indicates the utilization
of extra data, specifically VOC2012.
efficiently leveraging modern GPUs for high parallel com- comparable or even superior performance when compared
puting. to traditional CL-based methods. This also urgently calls for
a theoretical explanation.
5 C ONCLUSIONS , F UTURE T RENDS , AND O PEN Secondly, a crucial question arises concerning the au-
Q UESTIONS tomatic design of an optimal pretext task to enhance the
performance of a fixed downstream task. Various methods
In summary, this comprehensive review offers essential in-
have been proposed to address this challenge, including
sights into contemporary SSL research, providing newcom-
the pixel-to-propagation consistency method [63] and dense
ers with an overall picture of the field. The paper presents a
contrastive learning [272]. However, this problem remains
thorough survey of SSL from three main perspectives: algo-
insufficiently resolved, and further theoretical investigations
rithms, applications, and future trends. We focus on main-
are warranted in this direction.
stream visual SSL algorithms, classifying them into four
major types: context-based methods, generative methods, Thirdly, there is a pressing need for a unified SSL
contrastive methods, and contrastive generative methods. paradigm that encompasses multiple modalities. MIM has
Furthermore, we investigate the correlation between SSL demonstrated remarkable progress in vision tasks, akin to
and other learning paradigms while comparing early SSL the success of masked language model in NLP, suggesting
algorithms with current mainstream ones. Lastly, we will the possibility of unifying learning paradigms. Additionally,
delve into future trends and open problems as outlined the ViT architecture bridges the gap between visual and
below. verbal modalities, enabling the construction of a unified
Main trends: Firstly, while practical developments in transformer model for both CV and NLP tasks. Recent en-
SSL have progressed significantly, its theoretical analysis deavors [120], [273] have sought to unify SSL models, yield-
lags behind. For instance, investigations into why BYOL and ing impressive results in downstream tasks and showing
SimSiam [67] do not collapse [271] have been conducted, but broad applicability. Nevertheless, NLP has advanced further
the fundamental reason remains elusive. Further theoretical in leveraging SSL models, prompting the CV community
explorations are necessary to unravel this mystery and po- to draw inspiration from NLP approaches to effectively
tentially uncover more effective solutions. Moreover, recent harness the potential of pre-trained models.
research has shown that MIM-based methods can attain Open problems: Firstly, can SSL harness the advantages
of almost limitless data? Considering the abundance of [10] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning
unlabeled data, can SSL consistently benefit from additional and using the arrow of time,” in IEEE Conf. Comput. Vis. Pattern
Recognit., pp. 8052–8060, 2018.
unlabeled data, and can the theoretical inflection point be [11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
determined? training of deep bidirectional transformers for language under-
Secondly, it is pertinent to explore the interconnection standing,” arXiv preprint arXiv:1810.04805, 2018.
between SSL and multi-modality learning, as both method- [12] X. Zeng, Y. Pan, M. Wang, J. Zhang, and Y. Liu, “Realistic face
reenactment via self-supervised disentangling of identity and
ologies share resemblances with the cognitive processes pose,” in AAAI Conf.Artif. Intell., pp. 12154–12163, 2020.
observed in infants. Consequently, a critical inquiry arises: [13] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zis-
how can these two approaches be synergistically integrated serman, “End-to-end learning of visual representations from un-
to forge a robust and comprehensive learning model? curated instructional videos,” in IEEE Conf. Comput. Vis. Pattern
Recognit., pp. 9879–9889, 2020.
Thirdly, determining the most optimal or recommended [14] Y. M. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis
SSL algorithm poses a challenge as there is no universally of self-supervision, or what we can learn from a single image,”
applicable solution. The ideal selection of an algorithm in Int. Conf. Learn. Represent., 2020.
should align with the specific problem structure, yet prac- [15] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimension-
ality of data with neural networks,” Science, vol. 313, no. 5786,
tical situations often complicate this process. Consequently, pp. 504–507, 2006.
the development of a checklist to aid users in identifying the [16] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Ex-
most suitable method under particular circumstances war- tracting and composing robust features with denoising autoen-
rants investigation and should be pursued as a promising coders,” in Int. Conf. Mach. Learn., pp. 1096–1103, 2008.
[17] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning
avenue for future research. to grasp from 50k tries and 700 robot hours,” in IEEE Int. Conf.
Fourthly, the assumption that unlabeled data invariably Robot. Autom., pp. 3406–3413, 2016.
leads to improved outcomes warrants scrutiny. Our hy- [18] Y. Li, M. Paluri, J. M. Rehg, and P. Dollár, “Unsupervised learning
of edges,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1619–
pothesis challenges this notion, especially concerning semi-
1627, 2016.
supervised learning methods, as the no free lunch theorem [19] D. Li, W.-C. Hung, J.-B. Huang, S. Wang, N. Ahuja, and M.-H.
comes into play. Performance degradation can arise when Yang, “Unsupervised visual representation learning by graph-
model assumptions fail to align effectively with the un- based consistent constraints,” in Eur. Conf. Comput. Vis., pp. 678–
694, 2016.
derlying problem structure. For instance, if a model as-
[20] H. Lee, S. J. Hwang, and J. Shin, “Rethinking data aug-
sumes a substantial separation between decision boundaries mentation: Self-supervision and self-distillation,” arXiv preprint
and regions of high data density, it may perform poorly arXiv:1910.05872, 2019.
when faced with data originating from heavily overlapping [21] B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, and
Q. Le, “Rethinking pre-training and self-training,” in Neural Inf.
Cauchy distributions, as the decision boundary would tra- Process. Syst., pp. 1–13, 2020.
verse through dense areas. However, preemptively identi- [22] A. E. Orhan, V. V. Gupta, and B. M. Lake, “Self-supervised
fying such mismatches remains intricate and an unresolved learning through the eyes of a child,” in Neural Inf. Process. Syst.,
matter. Consequently, this topic merits further research to pp. 9960–9971, 2020.
shed light on the matter. [23] J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, and C. Blundell,
“Representation learning via invariant causal mechanisms,” in
Int. Conf. Learn. Represent., pp. 1–19, 2021.
[24] T. Hua, W. Wang, Z. Xue, S. Ren, Y. Wang, and H. Zhao, “On
R EFERENCES feature decorrelation in self-supervised learning,” in IEEE Int.
Conf. Comput. Vis., pp. 9598–9608, 2021.
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, [25] VentureBeat, “Yann LeCun, Yoshua Bengio: Self-
“Imagenet: A large-scale hierarchical image database,” in IEEE supervised learning is key to human-level intelligence.”
Conf. Comput. Vis. Pattern Recognit., pp. 248–255, 2009. https://cacm.acm.org/news/244720-yann-lecun-yoshua-
[2] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- bengio-self-supervised-learning-is-key-to-human-level-
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning intelligence/fulltext.
transferable visual models from natural language supervision,” [26] J. Yu, H. Yin, X. Xia, T. Chen, J. Li, and Z. Huang, “Self-supervised
in Int. Conf. Mach. Learn., pp. 8748–8763, 2021. learning for recommender systems: A survey,” arXiv preprint
[3] L. Ericsson, H. Gouk, and T. M. Hospedales, “How well do self- arXiv:2203.15876, 2022.
supervised models transfer?,” in IEEE Conf. Comput. Vis. Pattern [27] Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, and P. Yu, “Graph
Recognit., pp. 5414–5423, 2021. self-supervised learning: A survey,” IEEE T. Knowl. Data Eng.,
[4] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, 2022.
“Self-supervised learning: Generative or contrastive,” IEEE T. [28] L. Wu, H. Lin, C. Tan, Z. Gao, and S. Z. Li, “Self-supervised
Knowl. Data Eng., 2022. learning on graphs: Contrastive, generative, or predictive,” IEEE
[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Knowl. Data Eng., 2022.
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, [29] H. H. Mao, “A survey on self-supervised pre-training for se-
et al., “An image is worth 16x16 words: Transformers for image quential transfer learning in neural networks,” arXiv preprint
recognition at scale,” in Int. Conf. Learn. Represent., 2021. arXiv:2007.00800, 2020.
[6] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, [30] M. C. Schiappa, Y. S. Rawat, and M. Shah, “Self-supervised
“Learning spatiotemporal features with 3d convolutional net- learning for videos: A survey,” arXiv preprint arXiv:2207.00419,
works,” in IEEE Int. Conf. Comput. Vis., pp. 4489–4497, 2015. 2022.
[7] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised represen- [31] G.-J. Qi and M. Shah, “Adversarial pretraining of self-supervised
tation learning by predicting image rotations,” in Int. Conf. Learn. deep networks: Past, present and future,” arXiv preprint
Represent., pp. 1–14, 2018. arXiv:2210.13463, 2022.
[8] M. Noroozi and P. Favaro, “Unsupervised learning of visual [32] L. Jing and Y. Tian, “Self-supervised visual feature learning with
representations by solving jigsaw puzzles,” in Eur. Conf. Comput. deep neural networks: A survey,” IEEE Trans. Pattern Anal. Mach.
Vis., pp. 69–84, 2016. Intell., vol. 43, no. 11, pp. 4037–4058, 2021.
[9] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: un- [33] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon,
supervised learning using temporal order verification,” in Eur. “A survey on contrastive self-supervised learning,” Technologies,
Conf. Comput. Vis., pp. 527–544, 2016. vol. 9, no. 1, pp. 1–22, 2020.
[34] V. R. de Sa, “Learning classification with unlabeled data,” in [61] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview cod-
Neural Inf. Process. Syst., pp. 112–119, 1994. ing,” in Eur. Conf. Comput. Vis., pp. 776–794, 2020.
[35] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A review on genera- [62] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola,
tive adversarial networks: Algorithms, theory, and applications,” “What makes for good views for contrastive learning,” in Neural
IEEE T. Knowl. Data Eng., 2022. Inf. Process. Syst., pp. 1–13, 2020.
[36] T. Nathan Mundhenk, D. Ho, and B. Y. Chen, “Improvements [63] Z. Xie, Y. Lin, Z. Zhang, Y. Cao, S. Lin, and H. Hu, “Propagate
to context based self-supervised learning,” in IEEE Conf. Comput. yourself: Exploring pixel-level consistency for unsupervised vi-
Vis. Pattern Recognit., pp. 9339–9348, 2018. sual representation learning,” in IEEE Conf. Comput. Vis. Pattern
[37] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by mov- Recognit., pp. 16684–16693, 2021.
ing,” in IEEE Int. Conf. Comput. Vis., pp. 37–45, 2015. [64] X. Wang and G.-J. Qi, “Contrastive learning with stronger aug-
[38] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” mentations,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–12,
in Eur. Conf. Comput. Vis., pp. 649–666, 2016. 2022.
[39] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning repre- [65] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond,
sentations for automatic colorization,” in Eur. Conf. Comput. Vis., E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar,
pp. 577–593, 2016. et al., “Bootstrap your own latent: A new approach to self-
[40] R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. supervised learning,” in Neural Inf. Process. Syst., pp. 1–14, 2020.
Efros, “Real-time user-guided image colorization with learned [66] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and
deep priors,” arXiv preprint arXiv:1705.02999, 2017. A. Joulin, “Unsupervised learning of visual features by contrast-
[41] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a ing cluster assignments,” in Neural Inf. Process. Syst., 2020.
proxy task for visual understanding,” in IEEE Conf. Comput. Vis. [67] X. Chen and K. He, “Exploring simple siamese representation
Pattern Recognit., pp. 6874–6883, 2017. learning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 15750–
[42] P. Goyal, D. Mahajan, A. Gupta, and I. Misra, “Scaling and 15758, 2021.
benchmarking self-supervised visual representation learning,” in [68] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked
IEEE Int. Conf. Comput. Vis., pp. 6391–6400, 2019. autoencoders are scalable vision learners,” in IEEE Conf. Comput.
[43] U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised Vis. Pattern Recognit., pp. 16000–16009, 2022.
learning of spatiotemporal context for video action recognition,” [69] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins:
in Proc. Winter Conf. Appl. Comput. Vis., pp. 179–189, 2019. Self-supervised learning via redundancy reduction,” in Int. Conf.
[44] X. Zhan, X. Pan, Z. Liu, D. Lin, and C. C. Loy, “Self-supervised Mach. Learn., 2021.
learning via conditional motion propagation,” in IEEE Conf. [70] A. Bardes, J. Ponce, and Y. LeCun, “Vicreg: Variance-invariance-
Comput. Vis. Pattern Recognit., pp. 1881–1889, 2019. covariance regularization for self-supervised learning,” in Int.
[45] K. Wang, L. Lin, C. Jiang, C. Qian, and P. Wei, “3d human Conf. Learn. Represent., pp. 1–12, 2022.
pose machines with self-supervised learning,” IEEE Trans. Pattern [71] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lu-
Anal. Mach. Intell., vol. 42, no. 5, pp. 1069–1082, 2019. cic, “On mutual information maximization for representation
[46] M. Noroozi, H. Pirsiavash, and P. Favaro, “Representation learn- learning,” in Int. Conf. Learn. Represent., pp. 1–12, 2020.
ing by learning to count,” in IEEE Int. Conf. Comput. Vis., [72] N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khan-
pp. 5898–5906, 2017. deparkar, “A theoretical analysis of contrastive unsupervised
[47] I. Misra and L. v. d. Maaten, “Self-supervised learning of pretext- representation learning,” in Int. Conf. Mach. Learn., pp. 5628–5637,
invariant representations,” in IEEE Conf. Comput. Vis. Pattern 2019.
Recognit., pp. 6707–6717, 2020. [73] Y. Yang and Z. Xu, “Rethinking the value of labels for improving
[48] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature class-imbalanced learning,” in Neural Inf. Process. Syst., 2020.
learning via non-parametric instance discrimination,” in IEEE [74] Y.-H. H. Tsai, Y. Wu, R. Salakhutdinov, and L.-P. Morency,
Conf. Comput. Vis. Pattern Recognit., pp. 3733–3742, 2018. “Self-supervised learning from a multi-view perspective,” arXiv
[49] N. Zhao, Z. Wu, R. W. Lau, and S. Lin, “What makes instance preprint arXiv:2006.05576, 2020.
discrimination good for transfer learning?,” in Int. Conf. Learn. [75] T. Wang and P. Isola, “Understanding contrastive representation
Represent., pp. 1–11, 2021. learning through alignment and uniformity on the hypersphere,”
[50] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum con- in Int. Conf. Mach. Learn., pp. 9929–9939, 2020.
trast for unsupervised visual representation learning,” in IEEE [76] C.-Y. Chuang, J. Robinson, L. Yen-Chen, A. Torralba, and
Conf. Comput. Vis. Pattern Recognit., pp. 9729–9738, 2020. S. Jegelka, “Debiased contrastive learning,” in Int. Conf. Learn.
[51] X. Chen, H. Fan, R. Girshick, and K. He, “Improved base- Represent., 2020.
lines with momentum contrastive learning,” arXiv preprint [77] J. D. Lee, Q. Lei, N. Saunshi, and J. Zhuo, “Predicting what you
arXiv:2003.04297, 2020. already know helps: Provable self-supervised learning,” arXiv
[52] X. Chen, S. Xie, and K. He, “An empirical study of training self- preprint arXiv:2008.01064, 2020.
supervised visual transformers,” in IEEE Int. Conf. Comput. Vis., [78] S. Chen, G. Niu, C. Gong, J. Li, J. Yang, and M. Sugiyama,
pp. 9640–9649, 2021. “Large-margin contrastive learning with distance polarization
[53] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple regularizer,” in Int. Conf. Mach. Learn., pp. 1673–1683, 2021.
framework for contrastive learning of visual representations,” in [79] J. Z. HaoChen, C. Wei, A. Gaidon, and T. Ma, “Provable guaran-
Int. Conf. Mach. Learn., pp. 1597–1607, 2020. tees for self-supervised deep learning with spectral contrastive
[54] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, loss,” in Neural Inf. Process. Syst., pp. 5000–5011, 2021.
“Big self-supervised models are strong semi-supervised learn- [80] C. Tosh, A. Krishnamurthy, and D. Hsu, “Contrastive learning,
ers,” in Neural Inf. Process. Syst., pp. 1–13, 2020. multi-view redundancy, and linear models,” in Algorithmic Learn-
[55] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction ing Theory, pp. 1179–1206, 2021.
by learning an invariant mapping,” in IEEE Conf. Comput. Vis. [81] C. Wei, K. Shen, Y. Chen, and T. Ma, “Theoretical analysis of
Pattern Recognit., pp. 1735–1742, 2006. self-training with deep networks on unlabeled data,” in Int. Conf.
[56] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with Learn. Represent., pp. 1–15, 2021.
contrastive predictive coding,” arXiv preprint arXiv:1807.03748, [82] Y. Tian, “Deep contrastive learning is provably (almost) principal
2019. component analysis,” arXiv preprint arXiv:2201.12680, 2022.
[57] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: [83] A. Newell and J. Deng, “How useful is self-supervised pretrain-
A new estimation principle for unnormalized statistical models,” ing for visual tasks?,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
in Int. Conf. Artif. Intell. Statist., pp. 297–304, 2010. pp. 7345–7354, 2020.
[58] M. Zheng, S. You, F. Wang, C. Qian, C. Zhang, X. Wang, and [84] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski,
C. Xu, “Ressl: Relational self-supervised learning with weak and A. Joulin, “Emerging properties in self-supervised vision
augmentation,” arXiv preprint arXiv:2107.09282, 2021. transformers,” in IEEE Int. Conf. Comput. Vis., pp. 9650–9660,
[59] N. Zhao, Z. Wu, R. W. Lau, and S. Lin, “Distilling localization 2021.
for self-supervised representation learning,” in AAAI Conf.Artif. [85] Y. Wang, X. Shen, S. X. Hu, Y. Yuan, J. L. Crowley, and D. Vaufrey-
Intell., pp. 10990–10998, 2021. daz, “Self-supervised transformers for unsupervised object dis-
[60] R. Arandjelovic and A. Zisserman, “Objects that sound,” in Eur. covery using normalized cut,” in IEEE Conf. Comput. Vis. Pattern
Conf. Comput. Vis., pp. 435–451, 2018. Recognit., pp. 14543–14553, 2022.
[86] E. Hoffer, I. Hubara, and N. Ailon, “Deep unsupervised learn- [110] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola,
ing through spatial contrasting,” arXiv preprint arXiv:1610.00243, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive
2016. learning,” in Neural Inf. Process. Syst., pp. 18661–18673, 2020.
[87] Y. Xu, Q. Zhang, J. Zhang, and D. Tao, “Regioncl: exploring con- [111] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong,
trastive region pairs for self-supervised representation learning,” “ibot: Image bert pre-training with online tokenizer,” in Int. Conf.
in European Conference on Computer Vision, pp. 477–494, Springer, Learn. Represent., pp. 1–12, 2022.
2022. [112] H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of
[88] M. Yang, M. Liao, P. Lu, J. Wang, S. Zhu, H. Luo, Q. Tian, image transformers,” in Int. Conf. Learn. Represent., pp. 1–13, 2022.
and X. Bai, “Reading and writing: Discriminative and generative [113] X. Chen, M. Ding, X. Wang, Y. Xin, S. Mo, Y. Wang, S. Han, P. Luo,
modeling for self-supervised text recognition,” arXiv preprint G. Zeng, and J. Wang, “Context autoencoder for self-supervised
arXiv:2207.00193, 2022. representation learning,” arXiv preprint arXiv:2202.03026, 2022.
[89] H. Wu, Y. Qu, S. Lin, J. Zhou, R. Qiao, Z. Zhang, Y. Xie, and L. Ma, [114] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu,
“Contrastive learning for compact single image dehazing,” in “Simmim: A simple framework for masked image modeling,” in
IEEE Conf. Comput. Vis. Pattern Recognit., pp. 10551–10560, 2021. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 9653–9663, 2022.
[90] R. Zhu, B. Zhao, J. Liu, Z. Sun, and C. W. Chen, “Improving [115] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
contrastive learning by visualizing feature transformation,” in “Context encoders: Feature learning by inpainting,” in IEEE Conf.
IEEE Int. Conf. Comput. Vis., pp. 10306–10315, 2021. Comput. Vis. Pattern Recognit., pp. 2536–2544, 2016.
[91] M. Yang, Y. Li, Z. Huang, Z. Liu, P. Hu, and X. Peng, “Par- [116] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, P. Dhariwal,
tially view-aligned representation learning with noise-robust D. Luan, and I. Sutskever, “Generative pretraining from pixels,”
contrastive loss,” in IEEE Conf. Comput. Vis. Pattern Recognit., in Int. Conf. Mach. Learn., pp. 1691–1703, 2020.
pp. 1134–1143, 2021. [117] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford,
[92] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,”
and A. Overwijk, “Approximate nearest neighbor negative con- in Int. Conf. Mach. Learn., pp. 8821–8831, 2021.
trastive learning for dense text retrieval,” in Int. Conf. Learn. [118] C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichten-
Represent., pp. 1–12, 2021. hofer, “Masked feature prediction for self-supervised visual pre-
[93] J. Li, P. Zhou, C. Xiong, and S. C. Hoi, “Prototypical contrastive training,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 14668–
learning of unsupervised representations,” in Int. Conf. Learn. 14678, 2022.
Represent., pp. 1–12, 2021. [119] X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen,
[94] K. Kotar, G. Ilharco, L. Schmidt, K. Ehsani, and R. Mottaghi, F. Wen, and N. Yu, “Peco: Perceptual codebook for bert pre-
“Contrasting contrastive self-supervised representation learning training of vision transformers,” arXiv preprint arXiv:2111.12710,
pipelines,” in IEEE Int. Conf. Comput. Vis., pp. 9949–9959, 2021. 2021.
[95] S. Liu, H. Fan, S. Qian, Y. Chen, W. Ding, and Z. Wang, “Hit: [120] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli,
Hierarchical transformer with momentum contrast for video-text “Data2vec: A general framework for self-supervised learning in
retrieval,” in IEEE Int. Conf. Comput. Vis., pp. 11915–11925, 2021. speech, vision and language,” arXiv preprint arXiv:2202.03555,
2022.
[96] A. Islam, C.-F. Chen, R. Panda, L. Karlinsky, R. Radke, and
R. Feris, “A broad study on the transferability of visual repre- [121] Y. Chen, Y. Liu, D. Jiang, X. Zhang, W. Dai, H. Xiong, and Q. Tian,
sentations with contrastive learning,” in IEEE Int. Conf. Comput. “Sdae: Self-distillated masked autoencoder,” in Eur. Conf. Comput.
Vis., pp. 8845–8855, 2021. Vis., pp. 108–124, 2022.
[122] Q. Zhou, C. Yu, H. Luo, Z. Wang, and H. Li, “Mimco: Masked
[97] J. Li, C. Xiong, and S. C. Hoi, “Learning from noisy data with
image modeling pre-training with contrastive teacher,” in ACM
robust representation learning,” in IEEE Int. Conf. Comput. Vis.,
Int. Conf. Multimedia, pp. 4487–4495, 2022.
pp. 9485–9494, 2021.
[123] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked
[98] H. Cha, J. Lee, and J. Shin, “Co2l: Contrastive continual learning,”
image modeling with vector-quantized visual tokenizers,” arXiv
in IEEE Int. Conf. Comput. Vis., pp. 9516–9525, 2021.
preprint arXiv:2208.06366, 2022.
[99] O. J. Hénaff, S. Koppula, J.-B. Alayrac, A. v. d. Oord, O. Vinyals,
[124] C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders
and J. Carreira, “Efficient visual pretraining with contrastive
as spatiotemporal learners,” arXiv preprint arXiv:2205.09113, 2022.
detection,” in IEEE Int. Conf. Comput. Vis., pp. 10086–10096, 2021.
[125] Y. Liang, S. Zhao, B. Yu, J. Zhang, and F. He, “Meshmae: Masked
[100] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman, autoencoders for 3d mesh data analysis,” in Eur. Conf. Comput.
“With a little help from my friends: Nearest-neighbor contrastive Vis., pp. 37–54, 2022.
learning of visual representations,” in IEEE Int. Conf. Comput. Vis.,
[126] Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked
pp. 9588–9597, 2021.
autoencoders for point cloud self-supervised learning,” in Eur.
[101] J. Cui, Z. Zhong, S. Liu, B. Yu, and J. Jia, “Parametric contrastive Conf. Comput. Vis., pp. 604–621, 2022.
learning,” in IEEE Int. Conf. Comput. Vis., pp. 715–724, 2021. [127] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao,
[102] A. Shah, S. Sra, R. Chellappa, and A. Cherian, “Max-margin Z. Zhang, L. Dong, et al., “Swin transformer v2: Scaling up capac-
contrastive learning,” in AAAI Conf.Artif. Intell., 2022. ity and resolution,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
[103] L. Jing, P. Vincent, Y. LeCun, and Y. Tian, “Understanding di- pp. 12009–12019, 2022.
mensional collapse in contrastive self-supervised learning,” in [128] Q. Zhang, Y. Xu, J. Zhang, and D. Tao, “Vitaev2: Vision trans-
Int. Conf. Learn. Represent., pp. 1–11, 2022. former advanced by exploring inductive bias for image recogni-
[104] J. Zhang, X. Xu, F. Shen, Y. Yao, J. Shao, and X. Zhu, “Video tion and beyond,” Int. J. Comput. Vis., pp. 1–22, 2023.
representation learning with graph contrastive augmentation,” [129] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision
in ACM Int. Conf. Multimedia, pp. 3043–3051, 2021. transformer backbones for object detection,” in Eur. Conf. Comput.
[105] S. Lal, M. Prabhudesai, I. Mediratta, A. W. Harley, and K. Fragki- Vis., pp. 280–296, 2022.
adaki, “Coconets: Continuous contrastive 3d scene representa- [130] Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “Vitpose: Simple vision
tions,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2021. transformer baselines for human pose estimation,” in Neural Inf.
[106] Q. Hu, X. Wang, W. Hu, and G.-J. Qi, “Adco: Adversarial contrast Process. Syst., pp. 38571–38584, 2022.
for efficient learning of unsupervised representations from self- [131] Z. Liu, J. Gui, and H. Luo, “Good helper is around you: Attention-
trained negative adversaries,” in IEEE Conf. Comput. Vis. Pattern driven masked image modeling,” in AAAI Conf.Artif. Intell.,
Recognit., 2021. pp. 1799–1807, 2023.
[107] Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and [132] Z. Qi, R. Dong, G. Fan, Z. Ge, X. Zhang, K. Ma, and
D. Larlus, “Hard negative mixing for contrastive learning,” in L. Yi, “Contrast with reconstruct: Contrastive 3d representa-
Neural Inf. Process. Syst., pp. 1–12, 2020. tion learning guided by generative pretraining,” arXiv preprint
[108] G. Bukchin, E. Schwartz, K. Saenko, O. Shahar, R. Feris, R. Giryes, arXiv:2302.02318, 2023.
and L. Karlinsky, “Fine-grained angular contrastive learning with [133] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, Y. Wei, Q. Dai, and H. Hu, “On
coarse labels,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2021. data scaling in masked image modeling,” in IEEE Conf. Comput.
[109] S. Purushwalkam and A. Gupta, “Demystifying contrastive self- Vis. Pattern Recognit., pp. 10365–10374, 2023.
supervised learning: Invariances, augmentations and dataset bi- [134] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec,
ases,” in Neural Inf. Process. Syst., pp. 1–12, 2020. V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al.,
“Dinov2: Learning robust visual features without supervision,” [159] M. Gadelha, R. Wang, and S. Maji, “Multiresolution tree networks
arXiv preprint arXiv:2304.07193, 2023. for 3d point cloud processing,” in Eur. Conf. Comput. Vis., pp. 103–
[135] X. Kong and X. Zhang, “Understanding masked image modeling 118, 2018.
via learning occlusion invariant feature,” in IEEE Conf. Comput. [160] Y. Zhao, T. Birdal, H. Deng, and F. Tombari, “3d point capsule
Vis. Pattern Recognit., pp. 6241–6251, 2023. networks,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1009–
[136] H. Chen, Y. Wang, B. Lagadec, A. Dantcheva, and F. Bremond, 1018, 2019.
“Joint generative and contrastive learning for unsupervised per- [161] Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt,
son re-identification,” in IEEE Conf. Comput. Vis. Pattern Recognit., “Test-time training with self-supervision for generalization under
pp. 2004–2013, 2021. distribution shifts,” in Int. Conf. Mach. Learn., 2020.
[137] Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y. Cao, “Revealing [162] Y. Gandelsman, Y. Sun, X. Chen, and A. A. Efros, “Test-time train-
the dark secrets of masked image modeling,” in IEEE Conf. ing with masked autoencoders,” arXiv preprint arXiv:2209.07522,
Comput. Vis. Pattern Recognit., pp. 14475–14485, 2023. 2022.
[138] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, [163] J. J. Sun, A. Kennedy, E. Zhan, D. J. Anderson, Y. Yue, and
“Discriminative unsupervised feature learning with convolu- P. Perona, “Task programming: Learning data efficient behavior
tional neural networks,” in Neural Inf. Process. Syst., pp. 766–774, representations,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
2014. pp. 2876–2885, 2021.
[139] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, [164] Z. Ren and Y. Jae Lee, “Cross-domain self-supervised multi-task
and T. Brox, “Discriminative unsupervised feature learning with feature learning using synthetic imagery,” in IEEE Conf. Comput.
exemplar convolutional neural networks,” IEEE Trans. Pattern Vis. Pattern Recognit., pp. 762–771, 2018.
Anal. Mach. Intell., vol. 38, no. 9, pp. 1734–1747, 2015. [165] K. Hassani and M. Haley, “Unsupervised multi-task feature
[140] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual learning on point clouds,” in IEEE Int. Conf. Comput. Vis.,
representation learning by context prediction,” in IEEE Int. Conf. pp. 8160–8171, 2019.
Comput. Vis., pp. 1422–1430, 2015. [166] A. Piergiovanni, A. Angelova, and M. S. Ryoo, “Evolving losses
[141] P. Bojanowski and A. Joulin, “Unsupervised learning by predict- for unsupervised video representation learning,” in IEEE Conf.
ing noise,” in Int. Conf. Mach. Learn., 2017. Comput. Vis. Pattern Recognit., pp. 133–142, 2020.
[142] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embed- [167] K. Saito, D. Kim, S. Sclaroff, and K. Saenko, “Universal domain
ding for clustering analysis,” in Int. Conf. Mach. Learn., pp. 478– adaptation through self supervision,” in Neural Inf. Process. Syst.,
487, 2016. pp. 1–11, 2020.
[143] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of [168] Y. Sun, E. Tzeng, T. Darrell, and A. A. Efros, “Unsupervised
deep representations and image clusters,” in IEEE Conf. Comput. domain adaptation through self-supervision,” arXiv preprint
Vis. Pattern Recognit., pp. 5147–5156, 2016. arXiv:1909.11825, 2019.
[144] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clus- [169] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting
tering for unsupervised learning of visual features,” in Eur. Conf. self-supervised learning via knowledge transfer,” in IEEE Conf.
Comput. Vis., pp. 132–149, 2018. Comput. Vis. Pattern Recognit., pp. 9359–9367, 2018.
[145] R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders: [170] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and
Unsupervised learning by cross-channel prediction,” in IEEE J. Leskovec, “Strategies for pre-training graph neural networks,”
Conf. Comput. Vis. Pattern Recognit., pp. 1058–1067, 2017. arXiv preprint arXiv:1905.12265, 2019.
[146] X. Wang, K. He, and A. Gupta, “Transitive invariance for self- [171] Y. You, T. Chen, Z. Wang, and Y. Shen, “When does self-
supervised visual representation learning,” in IEEE Int. Conf. supervision help graph convolutional networks?,” arXiv preprint
Comput. Vis., pp. 1329–1338, 2017. arXiv:2006.09136, 2020.
[147] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised [172] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang,
visual representation learning,” in IEEE Conf. Comput. Vis. Pattern and J. Tang, “Gcc: Graph contrastive coding for graph neural
Recognit., pp. 1920–1929, 2019. network pre-training,” in ACM SIGKDD International Conference
[148] P. Krähenbühl, “Free supervision from video games,” in IEEE on Knowledge Discovery and Data Mining, pp. 1150–1160, 2020.
Conf. Comput. Vis. Pattern Recognit., pp. 2955–2964, 2018. [173] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun, “Gpt-
[149] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- gnn: Generative pre-training of graph neural networks,” in ACM
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- SIGKDD International Conference on Knowledge Discovery and Data
sarial nets,” in Neural Inf. Process. Syst., pp. 2672–2680, 2014. Mining, pp. 1857–1867, 2020.
[150] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby, “Self- [174] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang,
supervised gans via auxiliary rotation loss,” in IEEE Conf. Com- “Self-supervised graph transformer on large-scale molecular
put. Vis. Pattern Recognit., pp. 12154–12163, 2019. data,” in Neural Inf. Process. Syst., 2020.
[151] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, “S4l: Self- [175] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, “Deep graph con-
supervised semi-supervised learning,” in IEEE Int. Conf. Comput. trastive representation learning,” arXiv preprint arXiv:2006.04131,
Vis., pp. 1476–1485, 2019. 2020.
[152] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song, “Using [176] U. Buchler, B. Brattoli, and B. Ommer, “Improving spatiotempo-
self-supervised learning can improve model robustness and un- ral self-supervision by deep reinforcement learning,” in Eur. Conf.
certainty,” in Neural Inf. Process. Syst., pp. 15663–15674, 2019. Comput. Vis., pp. 770–786, 2018.
[153] K. Hassani and A. H. Khasahmadi, “Contrastive multi-view [177] D. Guo, B. A. Pires, B. Piot, J.-b. Grill, F. Altché, R. Munos, and
representation learning on graphs,” in Int. Conf. Mach. Learn., M. G. Azar, “Bootstrap latent-predictive representations for mul-
2020. titask reinforcement learning,” arXiv preprint arXiv:2004.14646,
[154] L. Gomez, Y. Patel, M. Rusiñol, D. Karatzas, and C. Jawahar, 2020.
“Self-supervised learning of visual features through embedding [178] N. Hansen, Y. Sun, P. Abbeel, A. A. Efros, L. Pinto, and X. Wang,
images into text topic spaces,” in IEEE Conf. Comput. Vis. Pattern “Self-supervised policy adaptation during deployment,” arXiv
Recognit., pp. 4230–4239, 2017. preprint arXiv:2007.04309, 2020.
[155] L. Jing, Y. Chen, L. Zhang, M. He, and Y. Tian, “Self-supervised [179] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord,
feature learning by cross-modality and cross-view correspon- “Boosting few-shot visual learning with self-supervision,” in
dences,” arXiv preprint arXiv:2004.05749, 2020. IEEE Int. Conf. Comput. Vis., pp. 8059–8068, 2019.
[156] L. Jing, Y. Chen, L. Zhang, M. He, and Y. Tian, “Self-supervised [180] J.-C. Su, S. Maji, and B. Hariharan, “Boosting supervision
modal and view invariant feature learning,” arXiv preprint with self-supervision for few-shot learning,” arXiv preprint
arXiv:2005.14169, 2020. arXiv:1906.07079, 2019.
[157] L. Zhang and Z. Zhu, “Unsupervised feature learning for point [181] C. Li, T. Tang, G. Wang, J. Peng, B. Wang, X. Liang, and X. Chang,
cloud understanding by contrasting and clustering using graph “Bossnas: Exploring hybrid cnn-transformers with block-wisely
convolutional neural networks,” in International Conference on 3D self-supervised neural architecture search,” in IEEE Int. Conf.
Vision, pp. 395–404, 2019. Comput. Vis., 2021.
[158] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud [182] L. Fan, S. Liu, P.-Y. Chen, G. Zhang, and C. Gan, “When does
auto-encoder via deep grid deformation,” in IEEE Conf. Comput. contrastive learning preserve adversarial robustness from pre-
Vis. Pattern Recognit., pp. 206–215, 2018. training to finetuning?,” in Neural Inf. Process. Syst., 2021.
[183] M. Kim, J. Tack, and S. J. Hwang, “Adversarial self-supervised [207] T. Huang, S. Li, X. Jia, H. Lu, and J. Liu, “Neighbor2neighbor:
contrastive learning,” in Neural Inf. Process. Syst., pp. 1–12, 2020. Self-supervised denoising from single noisy images,” in IEEE
[184] T. Chen, S. Liu, S. Chang, Y. Cheng, L. Amini, and Z. Wang, Conf. Comput. Vis. Pattern Recognit., 2021.
“Adversarial robustness: From self-supervised pre-training to [208] C. Yang, Z. Wu, B. Zhou, and S. Lin, “Instance localization for
fine-tuning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 699– self-supervised detection pretraining,” in IEEE Conf. Comput. Vis.
708, 2020. Pattern Recognit., pp. 3987–3996, 2021.
[185] Y. Lin, X. Guo, and Y. Lu, “Self-supervised video representation [209] I. Croitoru, S.-V. Bogolin, and M. Leordeanu, “Unsupervised
learning with meta-contrastive network,” in IEEE Int. Conf. Com- learning from video to detect foreground objects in single im-
put. Vis., pp. 8239–8249, 2021. ages,” in IEEE Int. Conf. Comput. Vis., pp. 4335–4343, 2017.
[186] Y. An, H. Xue, X. Zhao, and L. Zhang, “Conditional self- [210] E. Xie, J. Ding, W. Wang, X. Zhan, H. Xu, Z. Li, and P. Luo,
supervised learning for few-shot classification,” in Int. Joint Conf. “Detco: Unsupervised contrastive learning for object detection,”
Artif. Intell., pp. 2140–2146, 2021. arXiv preprint arXiv:2102.04803, 2021.
[187] S. Pal, A. Datta, and D. D. Majumder, “Computer recognition [211] G. Wu, J. Jiang, X. Liu, and J. Ma, “A practical contrastive
of vowel sounds using a self-supervised learning algorithm,” learning framework for single image super-resolution,” arXiv
Journal of the Anatomical Society of India, pp. 117–123, 1978. preprint arXiv:2111.13924, 2021.
[188] A. Ghosh, N. R. Pal, and S. K. Pal, “Self-organization for object [212] S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin, “Pulse:
extraction using a multilayer neural network and fuzziness mear- Self-supervised photo upsampling via latent space exploration of
sures,” IEEE Transactions on Fuzzy Systems, pp. 54–68, 1993. generative models,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
[189] A. Sharma, O. Grau, and M. Fritz, “Vconv-dae: Deep volumetric pp. 2437–2445, 2020.
shape learning without object labels,” in Eur. Conf. Comput. Vis., [213] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta, “Learning
pp. 236–250, 2016. a predictable and generative vector representation for objects,” in
[190] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin, “Look into Eur. Conf. Comput. Vis., pp. 484–499, 2016.
person: Self-supervised structure-sensitive learning and a new [214] D. Jayaraman and K. Grauman, “Learning image representations
benchmark for human parsing,” in IEEE Conf. Comput. Vis. Pat- tied to ego-motion,” in IEEE Int. Conf. Comput. Vis., pp. 1413–
tern Recognit., pp. 932–940, 2017. 1421, 2015.
[191] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into person: Joint [215] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth,
body parsing & pose estimation network and a new benchmark,” optical flow and camera pose,” in IEEE Conf. Comput. Vis. Pattern
IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 4, pp. 871–885, Recognit., pp. 1983–1992, 2018.
2018.
[216] Y. Zhao, G. Wang, C. Luo, W. Zeng, and Z.-J. Zha, “Self-
[192] X. Zhan, X. Pan, B. Dai, Z. Liu, D. Lin, and C. C. Loy, “Self- supervised visual representations learning by contrastive mask
supervised scene de-occlusion,” in IEEE Conf. Comput. Vis. Pattern prediction,” in IEEE Int. Conf. Comput. Vis., 2021.
Recognit., pp. 3784–3792, 2020.
[217] L. Huang, Y. Liu, B. Wang, P. Pan, Y. Xu, and R. Jin, “Self-
[193] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan,
supervised video representation learning by context and mo-
“Learning features by watching objects move,” in IEEE Conf.
tion decoupling,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
Comput. Vis. Pattern Recognit., pp. 2701–2710, 2017.
pp. 13886–13895, 2021.
[194] Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Self-supervised
[218] K. Hu, J. Shao, Y. Liu, B. Raj, M. Savvides, and Z. Shen, “Contrast
equivariant attention mechanism for weakly supervised seman-
and order representations for video self-supervised learning,” in
tic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
IEEE Int. Conf. Comput. Vis., pp. 7939–7949, 2021.
pp. 12275–12284, 2020.
[195] Z. Chen, X. Ye, L. Du, W. Yang, L. Huang, X. Tan, Z. Shi, [219] M. Tschannen, J. Djolonga, M. Ritter, A. Mahendran, N. Houlsby,
F. Shen, and E. Ding, “Aggnet for self-supervised monocular S. Gelly, and M. Lucic, “Self-supervised learning of video-
depth estimation: Go an aggressive step furthe,” in ACM Int. induced visual invariances,” in IEEE Conf. Comput. Vis. Pattern
Conf. Multimedia, pp. 1526–1534, 2021. Recognit., pp. 13806–13815, 2020.
[196] H. Chen, B. Lagadec, and F. Bremond, “Ice: Inter-instance con- [220] X. He, Y. Pan, M. Tang, Y. Lv, and Y. Peng, “Learn from unlabeled
trastive encoding for unsupervised person re-identification,” in videos for near-duplicate video retrieval,” in International Confer-
IEEE Int. Conf. Comput. Vis., pp. 14960–14969, 2021. ence on Research on Development in Information Retrieval, pp. 1–10,
[197] T. Isobe, D. Li, L. Tian, W. Chen, Y. Shan, and S. Wang, “Towards 2022.
discriminative representation learning for unsupervised person [221] X. Wang and A. Gupta, “Unsupervised learning of visual repre-
re-identification,” in IEEE Int. Conf. Comput. Vis., pp. 8526–8536, sentations using videos,” in IEEE Int. Conf. Comput. Vis., pp. 2794–
2021. 2802, 2015.
[198] Z. Wang, J. Zhang, L. Zheng, Y. Liu, Y. Sun, Y. Li, and [222] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsuper-
S. Wang, “Cycas: Self-supervised cycle association for learning vised learning of video representations using lstms,” in Int. Conf.
re-identifiable descriptions,” in Eur. Conf. Comput. Vis., 2020. Mach. Learn., pp. 843–852, 2015.
[199] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha, “Self- [223] T. Han, W. Xie, and A. Zisserman, “Video representation learning
supervised deep visual odometry with online adaptation,” in by dense predictive coding,” in ICCV Workshops, 2019.
IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6339–6348, 2020. [224] T. Han, W. Xie, and A. Zisserman, “Memory-augmented dense
[200] W. Wu, Z. Y. Wang, Z. Li, W. Liu, and L. Fuxin, “Pointpwc-net: predictive coding for video representation learning,” in Eur. Conf.
Cost volume on point clouds for (self-) supervised scene flow Comput. Vis., 2020.
estimation,” in Eur. Conf. Comput. Vis., 2020. [225] B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised
[201] G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets video representation learning with odd-one-out networks,” in
self-supervision,” arXiv preprint arXiv:2006.07114, 2020. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3636–3645, 2017.
[202] J. Walker, A. Gupta, and M. Hebert, “Dense optical flow predic- [226] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised
tion from a static image,” in IEEE Int. Conf. Comput. Vis., pp. 2443– representation learning by sorting sequences,” in IEEE Int. Conf.
2451, 2015. Comput. Vis., pp. 667–676, 2017.
[203] F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language nav- [227] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang, “Self-
igation with self-supervised auxiliary reasoning tasks,” in IEEE supervised spatiotemporal learning via video clip order predic-
Conf. Comput. Vis. Pattern Recognit., pp. 10012–10022, 2020. tion,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 10334–
[204] X. Niu, S. Shan, H. Han, and X. Chen, “Rhythmnet: End-to-end 10343, 2019.
heart rate estimation from face via spatial-temporal representa- [228] S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman,
tion,” IEEE Trans. Image Process., vol. 29, pp. 2409–2423, 2020. M. Rubinstein, M. Irani, and T. Dekel, “Speednet: Learning the
[205] X. Niu, Z. Yu, H. Han, X. Li, S. Shan, and G. Zhao, “Video-based speediness in videos,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
remote physiological measurement via cross-verified feature dis- pp. 9922–9931, 2020.
entangling,” in Eur. Conf. Comput. Vis., 2020. [229] Y. Yao, C. Liu, D. Luo, Y. Zhou, and Q. Ye, “Video playback rate
[206] Y. Xie, Z. Wang, and S. Ji, “Noise2same: Optimizing a self- perception for self-supervised spatio-temporal representation
supervised bound for image denoising,” in Neural Inf. Process. learning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6548–
Syst., 2020. 6557, 2020.
[230] J. Wang, J. Jiao, and Y.-H. Liu, “Self-supervised video represen- [254] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-
tation learning by pace prediction,” in Eur. Conf. Comput. Vis., training text encoders as discriminators rather than generators,”
2020. in Int. Conf. Learn. Represent., 2020.
[231] A. Diba, V. Sharma, L. V. Gool, and R. Stiefelhagen, “Dynamonet: [255] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained
Dynamic action and motion network,” in IEEE Int. Conf. Comput. models for natural language processing: A survey,” arXiv preprint
Vis., pp. 6192–6201, 2019. arXiv:2003.08271, 2020.
[232] T. Han, W. Xie, and A. Zisserman, “Self-supervised co-training [256] H. Wang, X. Wang, W. Xiong, M. Yu, X. Guo, S. Chang, and
for video representation learning,” in Neural Inf. Process. Syst., W. Y. Wang, “Self-supervised learning for contextualized ex-
pp. 1–12, 2020. tractive summarization,” in Annual Meeting of the Association for
[233] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of Computational Linguistics, pp. 2221–2227, 2019.
audio and video models from self-supervised synchronization,” [257] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning
in Neural Inf. Process. Syst., pp. 7763–7774, 2018. sound representations from unlabeled video,” in Neural Inf. Pro-
[234] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in cess. Syst., pp. 892–900, 2016.
IEEE Int. Conf. Comput. Vis., pp. 609–617, 2017. [258] H.-Y. Zhou, C. Lu, S. Yang, X. Han, and Y. Yu, “Preservational
[235] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, learning improves self-supervised medical image models by re-
“Videobert: A joint model for video and language representation constructing diverse contexts,” in IEEE Int. Conf. Comput. Vis.,
learning,” in IEEE Int. Conf. Comput. Vis., pp. 7464–7473, 2019. pp. 3499–3509, 2021.
[236] A. Nagrani, C. Sun, D. Ross, R. Sukthankar, C. Schmid, and [259] K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu, “Contrastive
A. Zisserman, “Speech2action: Cross-modal supervision for ac- learning of global and local features for medical image segmenta-
tion recognition,” in IEEE Conf. Comput. Vis. Pattern Recognit., tion with limited annotations,” in Neural Inf. Process. Syst., 2020.
pp. 10317–10326, 2020. [260] J. Zhu, Y. Li, Y. Hu, K. Ma, S. K. Zhou, and Y. Zheng, “Rubik’s
[237] J. C. Stroud, D. A. Ross, C. Sun, J. Deng, R. Sukthankar, and cube+: A self-supervised feature learning framework for 3d med-
C. Schmid, “Learning video representations from textual web ical image analysis,” Medical Image Analysis, p. 101746, 2020.
supervision,” arXiv preprint arXiv:2007.14937, 2020. [261] O. Manas, A. Lacoste, X. Giró-i Nieto, D. Vazquez, and P. Ro-
[238] J.-B. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Rama- driguez, “Seasonal contrast: Unsupervised pre-training from un-
puram, J. De Fauw, L. Smaira, S. Dieleman, and A. Zisserman, curated remote sensing data,” in IEEE Int. Conf. Comput. Vis.,
“Self-supervised multimodal versatile networks,” arXiv preprint pp. 9414–9423, 2021.
arXiv:2006.16228, 2020. [262] D. Wang, Q. Zhang, Y. Xu, J. Zhang, B. Du, D. Tao, and L. Zhang,
[239] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and “Advancing plain vision transformer toward remote sensing
S. Levine, “Time-contrastive networks: Self-supervised learning foundation model,” IEEE Trans. Geoscience and Remote Sensing,
from video,” in IEEE Int. Conf. Robot. Autom., pp. 1134–1141, 2018. vol. 61, pp. 1–15, 2022.
[240] X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence [263] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network
from the cycle-consistency of time,” in IEEE Conf. Comput. Vis. dissection: Quantifying interpretability of deep visual representa-
Pattern Recognit., pp. 2566–2576, 2019. tions,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6541–6549,
2017.
[241] X. Li, S. Liu, S. De Mello, X. Wang, J. Kautz, and M.-H.
[264] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
Yang, “Joint-task self-supervised learning for temporal corre-
A. Zisserman, “The pascal visual object classes (voc) challenge,”
spondence,” in Neural Inf. Process. Syst., pp. 318–328, 2019.
Int. J. Comput. Vis., vol. 88, pp. 303–338, 2010.
[242] A. Jabri, A. Owens, and A. A. Efros, “Space-time correspondence
[265] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
as a contrastive random walk,” in Neural Inf. Process. Syst.,
P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft
pp. 19545–19560, 2020.
coco: Common objects in context,” 2015.
[243] Z. Lai, E. Lu, and W. Xie, “Mast: A memory-augmented self-
[266] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
supervised tracker,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
“Scene parsing through ade20k dataset,” in IEEE Conf. Comput.
pp. 6479–6488, 2020.
Vis. Pattern Recognit., 2017.
[244] Z. Zhang, S. Lathuiliere, E. Ricci, N. Sebe, Y. Yan, and J. Yang, [267] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and
“Online depth learning against forgetting in monocular videos,” A. Torralba, “Semantic understanding of scenes through the
in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4494–4503, 2020. ade20k dataset,” Int. J. Comput. Vis., vol. 127, no. 3, pp. 302–321,
[245] D. Luo, C. Liu, Y. Zhou, D. Yang, C. Ma, Q. Ye, and W. Wang, 2019.
“Video cloze procedure for self-supervised spatio-temporal [268] J. Liu, X. Huang, Y. Liu, and H. Li, “Mixmim: Mixed and masked
learning,” in AAAI Conf.Artif. Intell., pp. 11701–11708, 2020. image modeling for efficient visual representation learning,”
[246] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, arXiv preprint arXiv:2205.13137, 2022.
S. Eslami, and A. v. d. Oord, “Data-efficient image recognition [269] J. Robinson, L. Sun, K. Yu, K. Batmanghelich, S. Jegelka, and
with contrastive predictive coding,” in Int. Conf. Mach. Learn., S. Sra, “Can contrastive learning avoid shortcut solutions?,” in
2020. Neural Inf. Process. Syst., pp. 4974–4986, 2021.
[247] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Im- [270] Y. Wei, H. Hu, Z. Xie, Z. Zhang, Y. Cao, J. Bao, D. Chen,
proving language understanding by generative pre-training,” and B. Guo, “Contrastive learning rivals masked image mod-
2018. eling in fine-tuning via feature distillation,” arXiv preprint
[248] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, arXiv:2205.14141, 2022.
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, [271] Y. Tian, X. Chen, and S. Ganguli, “Understanding self-supervised
et al., “Language models are few-shot learners,” arXiv preprint learning dynamics without contrastive pairs,” in Int. Conf. Mach.
arXiv:2005.14165, 2020. Learn., pp. 10268–10278, 2021.
[249] C. Li, J. Yang, P. Zhang, M. Gao, B. Xiao, X. Dai, L. Yuan, [272] X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense con-
and J. Gao, “Efficient self-supervised vision transformers for trastive learning for self-supervised visual pre-training,” in IEEE
representation learning,” arXiv preprint arXiv:2106.09785, 2021. Conf. Comput. Vis. Pattern Recognit., pp. 3024–3033, 2021.
[250] Z. Li, Z. Chen, F. Yang, W. Li, Y. Zhu, C. Zhao, R. Deng, L. Wu, [273] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal,
R. Zhao, M. Tang, and J. Wang, “Mst: Masked self-supervised O. K. Mohammed, S. Singhal, S. Som, et al., “Image as a foreign
transformer for visual representation,” in Neural Inf. Process. Syst., language: Beit pretraining for all vision and vision-language
pp. 1–12, 2021. tasks,” arXiv preprint arXiv:2208.10442, 2022.
[251] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their
compositionality,” in Neural Inf. Process. Syst., pp. 3111–3119,
2013.
[252] L. Kong, C. d. M. d’Autume, W. Ling, L. Yu, Z. Dai, and D. Yo-
gatama, “A mutual information maximization perspective of lan-
guage representation learning,” arXiv preprint arXiv:1910.08350,
2019.
[253] J. Wu, X. Wang, and W. Y. Wang, “Self-supervised dialogue
learning,” arXiv preprint arXiv:1907.00448, 2019.

A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends

Uploaded by

Copyright:

Available Formats

A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends

Uploaded by

Copyright:

Available Formats

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

A Survey on Self-supervised Learning:

D EEP supervised learning algorithms have demon-

TABLE 1: Comparison between supervised and self-supervised pre-training and fine-tuning.

supervised unsupervised self-supervised

Rotation Jigsaw Colorization

Fig. 4: Illustration of three common context-based methods:

similarity & similarity &

encoder ..................... encoder encoder ..................... encoder encoder ..................... encoder

image image image

In the above equation, b represents the batch example index

1 Here, S represents the regularized standard deviation, de-

2.2.2.3 Feature decorrelation-based CL: The objec-

joint where E denotes the encoder, D denotes the decoder, T1

Context Masked Image

Barlow ViT MAE BEiT PeCo data2vec SdAE MaskFeat BEiT v2

Fig. 8: Several representative pretext tasks of SSL.

TABLE 3: Different losses of SSL.

Category Method Loss Equation

You might also like