A Survey On Semi-, Self - and Unsupervised Learning For Image Classification
A Survey On Semi-, Self - and Unsupervised Learning For Image Classification
Abstract
1
ized datasets like medical imaging [10]. This might survey consists of deep learning researchers or inter-
be a practical workaround for some applications but ested people with comparable preliminary knowledge
the fundamental issue remains: Unlike humans, su- who want to keep track of recent developments in the
pervised learning needs enormous amounts of labeled field of semi-, self- and unsupervised learning.
data.
For a given problem we often have access to a 1.1. Related Work
large dataset of unlabeled data. How this unsuper- In this subsection, we give a quick overview of pre-
vised data could be used for neural networks has vious works and reference topics we will not address
been of research interest for many years [11]. Xie further to maintain the focus of this survey.
et al. were among the first in 2016 to investigate The research of semi- and unsupervised techniques
unsupervised deep learning image clustering strate- in computer vision has a long history. A variety of
gies to leverage this data [12]. Since then, the us- research, surveys, and books has been published on
age of unlabeled data has been researched in numer- this topic [17, 18, 19, 20, 21]. Unsupervised cluster
ous ways and has created research fields like unsu- algorithms were researched before the breakthrough of
pervised, semi-supervised, self-supervised, weakly- deep learning and are still widely used [22]. There are
supervised, or metric learning [13]. Generally speak- already extensive surveys that describe unsupervised
ing, unsupervised learning uses no labeled data, semi- and semi-supervised strategies without deep learning
supervised learning uses unlabeled and labeled while [18, 23]. We will focus only on techniques including
self-supervised learning generates labeled data on its deep neural networks.
own. Other research directions are even more different Many newer surveys focus only on self-, semi- or
because weakly-supervised learning uses only partial unsupervised learning [24, 19, 20]. Min et al. wrote
information about the label and metric learning aims at an overview of unsupervised deep learning strategies
learning a good distance metric. The idea that unifies [24]. They presented the beginning in this field of re-
these approaches is that using unlabeled data is bene- search from a network architecture perspective. The
ficial during the training process (see Figure 1 for an authors looked at a broad range of architectures. We
illustration). It either makes the training with fewer la- focus on only one architecture which Min et al. refer
bels more robust or in some rare cases even surpasses to as ”Clustering deep neural network (CDNN)-based
the supervised cases [14]. deep clustering” [24]. Even though the work was pub-
Due to this benefit, many researchers and compa- lished in 2018, it already misses the recent and major
nies work in the field of semi-, self-, and unsupervised developments in deep learning of the last years. We
learning. The main goal is to close the gap between look at these more recent developments and show the
semi-supervised and supervised learning or even sur- connections to other research fields that Min et al. did
pass these results. Considering presented methods like not include.
[15, 16] we believe that research is at the breaking Van Engelen and Hoos give a broad overview
point of achieving this goal. Hence, there is a lot of of general and recent semi-supervised methods [20].
research ongoing in this field. This survey provides an They cover some recent developments but deep learn-
overview to keep track of the major and recent devel- ing strategies such as [25, 26, 27, 14, 28] are not cov-
opments in semi-, self-, and unsupervised learning. ered. Furthermore, the authors do not explicitly com-
Most investigated research topics share a variety of pare the presented methods based on their structure or
common ideas while differing in goal, application con- performance.
texts, and implementation details. This survey gives an Jing and Tian concentrated their survey on recent
overview of this wide range of research topics. The fo- developments in self-supervised learning [19]. Like
cus of this survey is on describing the similarities and us, the authors provide a performance comparison and
differences between the methods. a taxonomy. Their taxonomy distinguishes between
Whereas we look at a broad range of learning strate- different kinds of pretext tasks. We look at pretext
gies, we compare these methods only based on the im- tasks as one common idea and compare the methods
age classification task. The addressed audience of this based on these underlying ideas. Jing and Tian look at
2
Figure 2: Overview of the structure of this survey – The learning strategies unsupervised, semi-supervised and
supervised are commonly used in the literature. Because semi-supervised learning is incorporating many meth-
ods we defined training strategies which subdivides semi-supervised learning. For details about the training and
learning strategies (including self-supervised learning) see subsection 2.1. Each method belongs to one training
strategy and uses several common ideas. A common idea can be a concept such as a pretext task or a loss such
as cross-entropy. The definition of methods and common ideas is given in section 2. Details about the common
ideas are defined in subsection 2.2. All methods in this survey are shortly described and categorized in section 3.
The methods are compared with each other based on this information concerning their used common ideas and
their performance in subsection 4.3. The results of the comparisons and three resulting trends are discussed in
subsection 4.4.
different tasks apart from classification but do not in- tions in semi-supervised learning [31]. We keep our
clude semi- and unsupervised methods without a pre- survey limited to general image classification tasks and
text task. focus on their practical application.
Qi and Luo are one of the few who look at self-, In this survey, we will focus on deep learning ap-
semi- and unsupervised learning in one survey [29]. proaches for image classification. We will investigate
However, they look at the different learning strategies the different learning strategies with a spotlight on loss
separately and give comparisons only inside the re- functions. We concentrate on recent methods because
spective learning strategy. We show that bridging these older one are already adequately addressed in previ-
gaps leads to new insights, improved performance, and ous literature [17, 18, 19, 20, 21]. Keeping the above-
future research approaches. mentioned limitations in mind, the topic of self-, semi-
Some surveys focus not on the general overviews , and unsupervised learning still includes a broad range
about semi-, self-, and unsupervised learning but spe- of research fields. We have to exclude some related
cial details. In their survey, Cheplygina et al. present topics from this survey to keep the focus of this work
a variety of methods in the context of medical im- for example because other research have a different
age analysis [30]. They include deep learning and aim or are evaluated on different datasets. Therefore,
older machine learning approaches but look at differ- topics like metric learning [13] and meta learning such
ent strategies from a medical perspective. Mey and as [32] will be excluded. More specific networks like
Loog focused on the underlying theoretical assump- general adversarial networks [33] and graph networks
3
such as [34] will be excluded. Also, other applica- methods. We roughly sort the methods based on their
tions like pose estimation [35] and segmentation [36] training strategy but compare them in detail based on
or other image sources like videos or sketches [37] are the used common ideas. See subsection 2.2 for further
excluded. Topics like few-shot or zero-shot learning information about common ideas.
methods such as [38] are excluded in this survey. How- In the rest of this chapter, we will use a shared def-
ever, we will see in subsection 4.4 that topics like few- inition for the following variables. For an arbitrary set
shot learning and semi-supervised can learn from each of images X we define Xl and Xu with X = Xl ∪X ˙ u
other in the future like in [39]. as the labeled and unlabeled images, respectively. For
an image x ∈ Xl the corresponding label is defined as
1.2. Outline
zx ∈ Z. An image x ∈ Xu has no label otherwise it
The rest of the paper is structured in the follow- would belong to Xl . For the distinction between Xu
ing way. We define and explain the terms which are and Xl , only the usage of the label information during
used in this survey such as method, training strategy training is important. For example, an image x ∈ X
and common idea in section 2. A visual representa- might have a label that can be used during evaluation
tion of the terms and their dependencies can be seen but as long as the label is not used during training
before the analysis part in Figure 2. All methods are we define x ∈ Xu . The learning strategy LSX for
presented with a short description, their training strat- a dataset X is either unsupervised (X = Xu ), super-
egy and common idea in section 3. In section 4, we vised (X = Xl ) or semi-supervised (Xu ∩ Xl 6= ∅).
compare the methods based their used ideas and their During different phases of the training, different im-
performance across four common image classification age datasets X1 , X2 , . . . Xn with n ∈ N could be used.
datasets. This section also includes a description of Two consecutive datasets Xi and Xi+1 with i ≤ n and
the datasets and evaluation metrics. Finally, we dis- i ∈ N are different as long as different images (Xi 6=
cuss the results of the comparisons in subsection 4.4 Xi+1 ) or different labels (XLi 6= XLi+1 ) are used. The
and identify three trends and research opportunities. learning strategy LSi up to the dataset Xi during the
In Figure 2, a complete overview of the structure of training is calculated based on Xu = ∪ij=1 Xuj and
this survey can be seen. Xl = ∪ij=1 Xlj . Consecutive phases of the training are
grouped into stages. The stage changes during consec-
utive datasets Xi and Xi+1 iff the learning strategy is
2. Underlying Concepts different (LSXi 6= LSXi+1 ) and the overall learning
Throughout this survey, we use the terms train- strategy changes (LSi 6= LSi+1 ). Due to this defi-
ing strategy, common idea, and method in a spe- nition, only two stages can occur during training and
cific meaning. The training strategy is the general the seven possible combinations are visualized in Fig-
type/approach for using the unsupervised data dur- ure 4. For more details see subsection 2.1. Let C be
ing training. The training strategies are similar to the number of classes for the labels Z. For a given neu-
the terms semi-supervised, self-supervised, or unsu- ral network f and input x ∈ X the output of the neural
pervised learning but provide a definition for corner network is f (x). For the below-defined formulations,
cases that the other terms do not. We will explain the f is an arbitrary network with arbitrary weights and
differences and similarities in detail in subsection 2.1. parameters.
The papers we discuss in detail in this survey propose
2.1. Training strategies
different elements like an algorithm, a general idea,
or an extension of previous work. To be consistent in Terms like semi-supervised, self-supervised, and
this survey, we call the main algorithm, idea, or exten- unsupervised learning are often used in literature but
sion in each paper a method. All methods are briefly have overlapping definitions for certain methods. We
described in section 3. A method follows a training will summarize the general understanding and defini-
strategy and is based on several common ideas. We tion of these terms and highlight borderline cases that
use the term common idea, or in short idea, for con- are difficult to classify. Due to these borderline cases,
cepts and approaches that are shared between different we will define a new taxonomy based on the stages
4
(a) Supervised (b) One-Stage-Semi-Supervised (c) One-Stage-Unsupervised (d) Multi-Stage-Semi-Supervised
Figure 3: Illustrations of supervised learning (a) and the three presented reduced training strategies (b-d) - The red
and dark blue circles represent labeled data points of different classes. The light grey circles represent unlabeled
data points. The black lines define the underlying decision boundaries between the classes. The striped circles
represent data points that do not use the label information in the first stage and can access this information in a
second stage. For more details on stages and the different learning strategies see subsection 2.1.
during training for a precise distinction of the methods. beled data from the beginning in comparison to rep-
In subsection 4.3, we will see that this taxonomy leads resentation learning methods like [25, 43, 40, 44, 42]
to a clear clustering of the methods regarding the com- which use them in different stages of their training.
mon ideas which further justifies this taxonomy. A vi- Some methods combine ideas from self-supervised
sual comparison between the learning-strategies semi- learning, semi-supervised learning and unsupervised
supervised and unsupervised learning and the training learning [15, 27] and are even more difficult to clas-
strategies can be found in Figure 4. sify.
Unsupervised learning describes the training with- From the above explanation, we see that most meth-
out any labels. However, the goal can be a clustering ods are either unsupervised or semi-supervised in the
(e.g. [14, 27]) or good representation (e.g. [25, 40]) of context of image classification. The usage of labeled
the data. Some methods combine several unsupervised and unlabeled data in semi-supervised methods varies
steps to achieve firstly a good representation and then and a clear distinction in the common taxonomy is not
a clustering (e.g. [41]). In most cases, this unsuper- obvious. Nevertheless, we need to structure the meth-
vised training is achieved by generating its own labels, ods in some way to keep an overview, allow compar-
and therefore the methods are called self-supervised. isons and acknowledge the difference of research foci.
A counterexample for an unsupervised method with- We decided against providing a fine-grained taxonomy
out self-supervision would be k-means [22]. Often, as in previous literature [29] because we believe fu-
self-supervision is achieved on a pretext task on the ture research will come up with new combinations that
same or a different dataset and then the pretrained net- were not thought of before. We separate the methods
work is fine-tuned on a downstream task [19]. Many only based on a rough distinction when the labeled or
methods that follow this paradigm say their method is unlabeled data is used during the training. For detailed
a form of representation learning [25, 42, 43, 44, 40]. comparisons, we distinct the methods based on their
In this survey, we focus on image classification, and common ideas that are defined above and described in
therefore most self-supervised or representation learn- detail in subsection 2.2. We call all semi-, self-, and
ing methods need to fine-tune on labeled data. The unsupervised (learning) strategies together reduced su-
combination of pretraining and fine-tuning can nei- pervised (learning) strategies.
ther be called unsupervised nor self-supervised as ex- We defined stages above (see section 2) as the dif-
ternal labeled information are used. Semi-supervised ferent phases/time intervals during training when the
learning describes methods that use labeled and un- different learning strategies supervised (X = Xl ),
labeled data. However, semi-supervised methods like unsupervised (X = Xu ) or semi-supervised (Xu ∩
[26, 45, 46, 16, 47, 48, 49] use the labeled and unla- Xl 6= ∅) are used. For example, a method that
5
uses a self-supervised pretraining on Xu and then
fine-tunes on the same images with labels has two
stages. A method that uses different algorithms,
losses, or datasets during the training but only uses
unsupervised data Xu has one stage (e.g. [41]).
A method which uses Xu and Xl during the com-
plete training has one stage (e.g. [26]). Based
on the definition of stages during training, we clas-
sify reduced supervised methods into the training Figure 4: Illustration of the different training strategies
strategies: One-Stage-Semi-Supervised, One-Stage- – Each row stands for a different combination of data
Unsupervised, and Multi-Stage-Semi-Supervised. An usage during the first and second stage (defined in sec-
overview of the stage combinations and the corre- tion 2). The first column states the common learning
sponding training strategy is given in Figure 4. As we strategy name in the literature for this usage whereas
concentrate on reduced supervised learning in this sur- the last column states the training strategy name used
vey, we will not discuss any methods which are com- in this survey. The second column represents the used
pletely supervised. data overall. The third and fourth column represent the
Due to the above definition of stages a fifth combi- used data in stage one or two. The blue and grey (half-
nation of data usage between the stages exists. This ) circles represent the usage of the labeled data Xl and
combination would use only labeled data in the first the unlabeled data Xu respectively in each stage or
stage and unlabeled data in the second stage. In the overall. A minus means that no further stage is used.
rest of the survey, we will exclude this training strat- The dashed half circle in the last row represents that
egy for the following reasons. The case that a stage this dashed part of the data can be used.
of complete supervision is followed by a stage of par-
tial or no supervision is an unusual training strategy.
Due to this unusual usage, we only know of weight ini- 2.1.2 One-Stage-Semi-Supervised Training
tialization followed by other reduced supervised train- All methods which follow the one-stage-semi-
ing steps where this combination could occur. We see supervised training strategy are trained in one stage
the initialization of a network with pretrained weights with the usage of Xl , Xu , and Z. The main differ-
from a supervised training on a different dataset (e.g. ence to all supervised learning strategies is the usage
Imagenet [1]) as an architectural decision. It is not part of the additional unlabeled data Xu . A common way
of the reduced supervised training process because it is to integrate the unlabeled data is to add one or more
used mainly as a more sophisticated weight initializa- unsupervised losses to the supervised loss.
tion. If we exclude weight initialization for this reason,
we know of no method which belongs to this stage.
2.1.3 One-Stage-Unsupervised Training
In the following paragraphs, we will describe all
other training strategies in detail and they are illus- All methods which follow the one-stage-unsupervised
trated in Figure 3. training strategy are trained in one stage with the
usage of only the unlabeled samples Xu . There-
fore, many authors in this training strategy call their
2.1.1 Supervised Learning method unsupervised. A variety of loss functions ex-
ist for unsupervised learning [50, 14, 12]. In most
Supervised learning is the most common strategy in cases, the problem is rephrased in such a way that
image classification with deep neural networks. These all inputs for the loss can be generated, e.g. re-
methods only use labeled data Xl and its correspond- construction loss in autoencoders [12]. Due to this
ing labels Z. The goal is to minimize a loss function self-supervision, some call also these methods self-
between the output of the network f (x) and the ex- supervised. We want to point out one major differ-
pected label zx ∈ Z for all x ∈ Xl . ence to many self-supervised methods following the
6
multi-stage-semi-supervised training strategy below. Loss Functions
One-Stage-Unsupervised methods give image classi-
Cross-entropy (CE)
fications without any further usage of labeled data.
A common loss function for image classification is
cross-entropy [51]. It is commonly used to measure
2.1.4 Multi-Stage-Semi-Supervised Training the difference between f (x) and the corresponding la-
bel zx for a given x ∈ Xl . The loss is defined in Equa-
All methods which follow the multi-stage-semi-
tion 1 and the goal is to minimize the difference.
supervised training strategy are trained in two stages
with the usage of Xu in the first stage and Xl and
maybe Xu in the second stage. Many methods that C
X
are called self-supervised by their authors fall into this CE(zx , f (x)) = − P (c|zx )log(P (c|f (x)))
strategy. Commonly a pretext task is used to learn c=1
representations on unlabeled data Xu . In the second C
X
stage, these representations are fine-tuned to image =− P (c|zx )log(P (c|zx ))
classification on Xl . An important difference to a one- c=1 (1)
C
stage method is that these methods return useable clas- X P (c|f (x))
sifications only after an additional training stage. − P (c|zx )log( )
P (c|zx )
c=1
7
(a) VAT (b) Mixup (c) Overclustering (d) Pseudo-Label
Figure 5: Illustration of four selected common ideas – (a) The blue and red circles represent two different classes.
The line is the decision boundary between these classes. The spheres around the circles define the area of possible
transformations. The arrows represent the adversarial change vector r which pushes the decision boundary away
from any data point. (b) The images of a cat and a dog are combined with a parametrized blending. The labels
are also combined with the same parameterization. The shown images are taken from the dataset STL-10 [52]
(c) Each circle represents a data point and the coloring of the circle the ground-truth label. In this example, the
images in the middle have fuzzy ground-truth labels. Classification can only draw one arbitrary decision boundary
(dashed line) in the datapoints whereas overclustering can create multiple subregions. This method could also be
applied to outliers rather than fuzzy labels. (d) This loop represents one version of Pseudo-Labeling. A neural
network predicts an output distribution. This distribution is cast into a hard Pseudo-Label which is then used for
further training the neural network.
[25, 54, 55, 56, 57]. Examples of contrastive loss func- InfoNCE is a lower bound for the mutual information
tions are NT-Xent [25] and InfoNCE [55] and both are between the views [55]. More details and different
based on Cross-Entropy. The loss NT-Xent is com- bounds for other losses can be found in [59]. How-
puted across all positive pairs (xi , xj ) in a fixed subset ever, Tschannen et al. show evidence that these lower
of X with N elements e.g. a batch during training. bounds might not be the main reason for the successes
The definition of the loss for a positive pair is given in of these methods [60]. Due to this fact, we count losses
Equation 2. The similarity sim between the outputs is like InfoNCE as a mixture of the common ideas con-
measured with a normalized dot product, τ is a tem- trastive loss and mutual information.
perature parameter and the batch consists of N image
pairs. Entropy Minimization (EM)
exp(sim(f (xi ), f (xj ))/τ ) Grandvalet and Bengio noticed that the distributions
lxi ,xj = −log P2N of predictions in semi-supervised learning tend to be
k=1 1k6=i exp(sim(f (xi ), f (xk ))/τ ) distributed over many or all classes instead of being
(2)
sharp for one or few classes [61]. They proposed
Chen and Li generalize the loss NT-Xent into a to sharpen the output predictions or in other words
broader family of loss functions with an alignment to force the network to make more confident predic-
and a distribution part [58]. The alignment part en- tions by minimizing entropy [61]. They minimized
courages representations of positive pairs to be similar the entropy H(P (·|f (x))) for a probability distribu-
whereas the distribution part ”encourages representa- tion (P (·|f (x)) based on a certain neural output f (x)
tions to match a prior distribution” [58]. The loss In- and an image x ∈ X. This minimization leads to
foNCE is motivated like other contrastive losses by sharper / more confident predictions. If this loss is
maximizing the agreement / mutual information be- used as the only loss the network/predictions would
tween different views. Van der Oord et al. showed that degenerate to a trivial minimization.
8
Kullback-Leibler divergence (KL) not known. For example, we could use the outputs of
a neural network f (x), f (y) for two augmented views
The Kullback-Leiber divergence is also commonly
x, y of the same image as the distributions P, Q. In
used in image classification since it can be interpreted
general, the distributions could be dependent as x, y
as a part of cross-entropy. In general, KL measures the
could be identical or very similar and the distributions
difference between two given distributions [62] and
could be independent if x, y they are crops of distinct
is therefore often used to define an auxiliary loss be-
classes e.g. the background sky and the foreground
tween the output f (x) for an image x ∈ X and a given
object. Therefore, the mutual information needs to be
secondary discrete probability distribution Q over the
approximated. The used approximation varies depend-
classes C. The definition is given in Equation 3. The
ing on the method and the definition of the distribu-
second distribution could be another network output
tions P, Q. For further theoretical insights and several
distribution, a prior known distribution, or a ground-
approximations see [59, 64].
truth distribution depending on the goal of the mini-
We show the definition of the mutual information
mization.
between two network outputs f (x), f (y) for images
x, y ∈ X as an example in Equation 5. This equa-
C tion also shows an alternative representation of mutual
X P (c|f (x))
KL(Q || P (·|f (x)) = − Q(c)log( ) information: the separation in entropy H(P (·|f (x)))
Q(c)
c=1 and conditional entropy H(P (·|f (x)) | P (·|f (y))). Ji
(3) et al. argue that this representation illustrates the ben-
efits of using MI over CE in unsupervised cases [14].
Mean Squared Error (MSE) A degeneration is avoided because MI balances the ef-
fects of maximizing the entropy with a uniform distri-
MSE measures the Euclidean distance between two
bution for P (·|f (x)) and minimizing the conditional
vectors e.g. two neural network outputs f (x), f (y)
entropy by equalizing P (·|f (x)) and P (·|f (y)). Both
for the images x, y ∈ X. In contrast to the loss CE
cases lead to a degeneration of the neural network on
or KL, MSE is not a probability measure and there-
their own.
fore the vectors can be in an arbitrary Euclidean fea-
ture space (see Equation 4). The minimization of the I(P (·|f (x), P (·|f (y))
MSE will pull the two vectors or as in the example the = KL(P (·|f (x), f (y)) || P (·|f (x) ∗ P (·|f (y))))
network outputs together. Similar to the minimization
C
of entropy, this would lead to a degeneration of the net- X
= P (c, c0 |f (x), f (y))
work if this loss is used as the only loss on the network
c=1,c0 =1
outputs.
P (c, c0 |f (x), f (y))
log( )
P (c|f (x) ∗ P (c0 |f (y)))
M SE(f (x), f (y)) = ||f (x) − f (y)||22 (4) = H(P (·|f (x)) + H(P (·|f (x)) | P (·|f (y)))
(5)
Mutual Information (MI)
MI is defined for two probability distributions P, Q Virtual Adversarial Training (VAT)
as the Kullback Leiber (KL) divergence between the
joint distribution and the marginal distributions [63]. VAT [65] tries to make predictions invariant to small
In many reduced supervised methods, the goal is to transformations by minimizing the distance between
maximize the mutual information between the distri- an image and a transformed version of the image.
butions. These distributions could be based on the in- Miyato et al. showed how a transformation can be
put, the output, or an intermediate step of a neural net- chosen and approximated in an adversarial way. This
work. In most cases, the conditional distribution be- adversarial transformation maximizes the distance be-
tween P and Q and therefore the joint distribution is tween an image and a transformed version of it over all
9
(a) Main image (b) Different image (c) Jigsaw (d) Jigsaw++
Figure 6: Illustrations of 8 selected pretext tasks – (a) Example image for the pretext task (b) Negative/different
example image in the dataset or batch (c) The Jigsaw pretext task consists of solving a simple Jigsaw puzzle
generated from the main image. (d) Jigsaw++ augments the Jigsaw puzzle by adding in parts of a different image.
(e) In the exemplar pretext task, the distributions of a weakly augmented image (upper right corner) and several
strongly augmented images should be aligned. (f) An image is rotated around a fixed set of rotations e.g. 0, 90,
180, and 270 degrees. The network should predict the rotation which has been applied. (g) A central patch and an
adjacent patch from the same image are given. The task is to predict one of the 8 possible relative positions of the
second patch to the first one. In the example, the correct answer is upper center. (h) The network receives a list of
pairs and should predict the positive pairs. In this example, a positive pair consists of augmented views from the
same image. Some illustrations are inspired by [44, 42, 40].
10
Overclustering (OC) 3. Methods
Normally, if we have k classes in the supervised case This section shorty summarizes all methods in the
we also use k clusters in the unsupervised case. Re- survey in roughly chronological order and separated
search showed that it can be beneficial to use more by their training strategy. Each summary states the
clusters than actual classes k exist [67, 14, 27]. We call used common ideas, explains their usage, and high-
this idea overclustering. Overclustering can be ben- lights special cases. The abbreviations for the common
eficial in semi-supervised or unsupervised cases due ideas are defined in subsection 2.2. We include a large
to the effect that neural networks can decide ’on their number of recent methods but we do not claim this list
own’ how to split the data. This separation can be help- to be complete.
ful in noisy/fuzzy data or with intermediate classes that
3.1. One-Stage-Semi-Supervised
were sorted into adjacent classes randomly [27]. An il-
lustration of this idea is presented in Figure 5c Pseudo-Labels
11
(a) π-model (b) Temporal Ensembling (c) Mean Teacher (d) UDA
Figure 7: Illustration of four selected one-stage-semi-supervised methods – The used method is given below each
image. The input including label information is given in the blue box on the left side. On the right side, an
illustration of the method is provided. In general, the process is organized from top to bottom. At first, the input
images are preprocessed by none or two different random transformations t. Special augmentation techniques like
Autoaugment [69] are represented by a red box. The following neural network uses these preprocessed images
(x, y) as input. The calculation of the loss (dotted line) is different for each method but shares common parts.
All methods use the cross-entropy (CE) between label and predicted distribution P (·|f (x)) on labeled examples.
Details about the methods can be found in the corresponding entry in section 3 whereas abbreviations for common
methods are defined in subsection 2.2. EMA stands for the exponential moving average.
[48]. They develop their approach based on the π- Interpolation Consistency Training (ICT)
model and Temporal Ensembling [49]. Therefore, they
also use MSE as a consistency loss between two pre-
dictions but create these predictions differently. They ICT [70] uses linear interpolations of unlabeled data
argue that Temporal Ensembling incorporates new in- points to regularize the consistency between images.
formation too slowly into predictions. The reason for Verma et al. use a combination of the supervised loss
this is that the exponential moving average (EMA) is CE and the unsupervised loss MSE. The unsupervised
only updated once per epoch. Therefore, they pro- loss is measured between the prediction of the inter-
pose to use a teacher based on the average weights of polation of two images and the interpolation of their
a student in each update step. Tarvainen & Valpola Pseudo-Labels. The interpolation is generated with the
show for their model that the KL-divergence is an infe- mixup [66] algorithm from two unlabeled data points.
rior consistency loss than MSE. An illustration of this For these unlabeled data points, the Pseudo-Labels are
method is given in Figure 7. Common ideas: CE, MSE predicted by a Mean Teacher [48] network. Common
ideas: CE, MSE, MU, PL
VAT [65] is not just the name for a common idea but it In contrast to other semi-supervised methods, Athi-
is also a one-stage-semi-supervised method. Miyato et waratkun et al. do not change the loss but the opti-
al. used a combination of VAT on unlabeled data and mization algorithm [71]. They analyzed the learning
CE on labeled data [65]. They showed that the adver- process based on ideas and concepts of SWA [72], π-
sarial transformation leads to a lower error on image model [49] and Mean Teacher [48]. Athiwaratkun et
classification than random transformations. Further- al. show that averaging and cycling learning rates are
more, they showed that adding EntMin [61] to the loss beneficial in semi-supervised learning by stabilizing
increased accuracy even more. Common ideas: CE, the training. They call their improved version of SWA
(EM), VAT fast-SWA due to faster convergence and lower perfor-
12
mance variance [71]. The architecture and loss is ei- AutoAugment [69] in combination with Cutout [75].
ther copied from π-model [49] or Mean Teacher [48]. AutoAugment uses reinforcement learning to create
Common ideas: CE, MSE useful augmentations automatically. Cutout is an aug-
mentation scheme where randomly selected regions of
MixMatch the image are masked out. Xie et al. show that this
combined augmentation method achieves higher per-
MixMatch [46] uses a combination of a supervised and formance in comparison to previous methods on their
an unsupervised loss. Berthelot et al. use CE as the su- own like Cutout, Cropping, or Flipping. In addition to
pervised loss and MSE between predictions and gener- the different augmentation, they propose to use a vari-
ated Pseudo-Labels as their unsupervised loss. These ety of other regularization methods. They proposed
Pseudo-Labels are created from previous predictions Training Signal Annealing which restricts the influ-
of augmented images. They propose a novel sharp- ence of labeled examples during the training process
ing method over multiple predictions to improve the to prevent overfitting. They use EntMin [61] and a
quality of the Pseudo-Labels. This sharpening also kind of Pseudo-Labeling [47]. We use the term kind
enforces implicitly a minimization of the entropy on of Pseudo-Labeling because they do not use the pre-
the unlabeled data. Furthermore, they extend the al- dictions as labels but they use them to filter unsuper-
gorithm mixup [66] to semi-supervised learning by in- vised data for outliers. An illustration of this method is
corporating the generated labels. Common ideas: CE, given in Figure 7. Common ideas: CE, EM, KL, (PL)
(EM), MSE, MU, PL
Self-paced Multi-view Co-training (SpamCo)
Ensemple AutoEndocing Transformation (EnAET)
Ma et al. propose a general framework for co-training
EnAET [73] combines the self-supervised pretext task across multiple views [76]. In the context of image
AutoEncoding Transformations [74] with MixMatch classification, different neural networks can be used
[46]. Wang et al. apply spatial transformations, such as as different views. The main idea of the co-training
translations and rotations, and non-spatial transforma- between different views is similar to using Pseudo-
tions, such as color distortions, on input images in the Labels. The main differences in SpamCo are that the
pretext task. The transformations are then estimated Pseudo-Labels are not used for all samples and they
with the original and augmented image given. This is influence each other across views. Each unlabeled im-
a difference to other pretext tasks where the estima- age has a weight value for each view. Based on an age
tion is often based on the augmented image only [40]. parameter, more unlabeled images are considered in
The loss is used together with the loss of MixMatch each iteration. At first only confident Pseudo-Labels
and is extended with the Kullback Leiber divergence are used and over time also less confident ones are al-
between the predictions of the original and the aug- lowed. The proposed hard or soft co-regularizers also
mented image. Common ideas: CE, (EM), KL, MSE, influence the weighting of the unlabeled images. The
MU, PL, PT regularizers encourage to select unlabeled images for
training across views. Without this regularization the
Unsupervised Data Augmentation (UDA) training would degenerate to an independent training
of the different views/models. CE is used as loss on
Xie et al. present with UDA a semi-supervised learn- the labels and Pseudo-Labels with additional L2 regu-
ing algorithm that concentrates on the usage of state- larization. Ma et al. show further applications includ-
of-the-art augmentation [16]. They use a supervised ing text classification and object detection. Common
and an unsupervised loss. The supervised loss is CE ideas: CE, CE*, MSE, PL
whereas the unsupervised loss is the Kullback Leiber
divergence between output predictions. These output
ReMixMatch
predictions are based on an image and an augmented
version of this image. For image classification, they ReMixMatch [45] is an extension of MixMatch with
propose to use the augmentation scheme generated by distribution alignment and augmentation anchoring.
13
(a) MixMatch (b) ReMixMatch (c) FixMatch (d) FOC
Figure 8: Illustration of four selected methods – The used method is given below each image. The input including
label information is given in the blue box on the left side. On the right side, an illustration of the method is pro-
vided. For FOC the second stage is represented. In general, the process is organized from top to bottom. At first,
the input images are preprocessed by none or two different random transformations t. Special augmentation tech-
niques like CTAugment [45] are represented by a red box. The following neural network uses these preprocessed
images (e.g. x, y) as input. The calculation of the loss (dotted line) is different for each method but shares com-
mon parts. All methods use the cross-entropy (CE) between label and predicted distribution P (·|f (x)) on labeled
examples. Details about the methods can be found in the corresponding entry in section 3 whereas abbreviations
for common methods are defined in subsection 2.2.
Berthelot et al. motivate the distribution alignment in the unlabeled data, one weakly- and one strongly-
with an analysis of mutual information. They use en- augmented version is created. The Pseudo-Label of
tropy minimization via ”sharpening” but they do not the weakly-augmented version is used if a confidence
use any prediction equalization like in mutual informa- threshold is surpassed by the network. If a Pseudo-
tion. They argue that an equal distribution is also not Label is calculated the network output of the strongly-
desirable since the distribution of the unlabeled data augmented version is compared with this hard label
could be skewed. Therefore, they align the predictions via cross-entropy which implicitly encourages low-
of the unlabeled data with a marginal class distribu- entropy predictions on the unlabeled data [26]. Sohn
tion over the seen examples. Berthelot et al. exchange et al. do not use ideas like Mixup, VAT, or distribution
the augmentation scheme of MixMatch with augmen- alignment but they state that they can be used and pro-
tation anchoring. Instead of averaging the prediction vide ablations for some of these extensions. Common
over different slight augmentations of an image they ideas: CE, CE*, (EM), PL
only use stronger augmentations as regularization. All
augmented predictions of an image are encouraged to 3.2. Multi-Stage-Semi-Supervised
result in the same distribution with CE instead of MSE.
Furthermore, a self-supervised loss based on the rota- Exemplar
tion pretext task [40] was added. Common ideas: CE,
CE* (EM), (MI), MU, PL, PT Dosovitskiy et al. proposed a self-supervised pretext
task with additional fine-tuning [68]. They randomly
FixMatch sample patches from different images and augment
these patches heavily. Augmentations can be for ex-
FixMatch [26] is building on the ideas of ReMixMatch ample rotations, translations, color changes, or con-
but is dropping several ideas to make the framework trast adjustments. The classification task is to map all
more simple while achieving a better performance. augmented versions of a patch to the correct original
FixMatch is using the cross-entropy loss on the su- patch using cross-entropy loss. Common ideas: CE,
pervised and the unsupervised data. For each image CE*, PT
14
(a) AMDIM (b) CPC (c) DeepCluster (d) IIC
Figure 9: Illustration of four selected multi-stage-semi-supervised methods – The used method is given below
each image. The input is given in the red box on the left side. On the right side, an illustration of the method
is provided. The fine-tuning part is excluded and only the first stage/pretext task is represented. In general, the
process is organized from top to bottom. At first, the input images are either preprocessed by one or two random
transformations t or are split up. The following neural network uses these preprocessed images (x, y) as input. The
calculation of the loss (dotted line) is different for each method. AMDIM and CPC use internal elements of the
network to calculate the loss. DeepCluster and IIC use the predicted output distributions (P (·|f (x)), P (·|f (y)))
to calculate a loss. Details about the methods can be found in the corresponding entry in section 3 whereas
abbreviations for common methods are defined in subsection 2.2.
15
numbers of rotations but four rotations score the best Augmented Multiscale Deep InfoMax (AMDIM)
result. For image classification, they fine-tune on la-
AMDIM [78] maximizes the MI between inputs and
beled data. Common ideas: CE, CE*, PT
outputs of a network. It is an extension of the method
DIM [77]. DIM usually maximizes MI between lo-
Contrastive Predictive Coding (CPC) cal regions of an image and a representation of the im-
age. AMDIM extends the idea of DIM in several ways.
CPC [55, 56] is a self-supervised method that predicts Firstly, the authors sample the local regions and repre-
representations of local image regions based on previ- sentations from different augmentations of the same
ous image regions. The authors determine the quality source image. Secondly, they maximize MI between
of these predictions with a contrastive loss which iden- multiple scales of the local region and the represen-
tifies the correct prediction out of randomly sampled tation. They use a more powerful encoder and define
negative ones. They call their loss InfoNCE which is mixture-based representations to achieve higher accu-
cross-entropy for the prediction of positive examples racies. Bachman et al. fine-tune the representations on
[55]. Van den Oord et al. showed that minimizing In- labeled data to measure their quality. An illustration of
foNCE maximizes the lower bound for MI between the this method is given in Figure 9. Common ideas: CE,
previous image regions and the predicted image region MI, PT
[55]. An illustration of this method is given in Fig-
ure 9. The representations of the pretext task are then
Deep Metric Transfer (DMT)
fine-tuned. Common ideas: CE, (CE*), CL, (MI), PT
DMT [79] learns a metric as a pretext task and then
propagates labels onto unlabeled data with this met-
Constrastive Multiview Coding (CMC) ric. Liu et al. use self-supervised image colorization
[80] or unsupervised instance discrimination [81] to
CMC [54] generalizes CPC [55] to an arbitrary collec-
calculate a metric. In the second stage, they propa-
tion of views. Tian et al. try to learn an embedding that
gate labels to unlabeled data with spectral clustering
is different for contrastive samples and equal for sim-
and then fine-tune the network with the new Pseudo-
ilar images. Like Oord et al. they train their network
Labels. Additionally, they show that their approach
by identifying the correct prediction out of multiple
is complementary to previous methods. If they use
negative ones [55]. However, Tian et al. take different
the most confident Pseudo-Labels for methods such as
views of the same image such as color channels, depth,
Mean Teacher [48] or VAT [65], they can improve the
and segmentation as similar images. For common im-
accuracy with very few labels by about 30%. Common
age classification datasets like STL-10, they use patch-
ideas: CE, CE*, PL, PT
based similarity. After this pretext task, the represen-
tations are fine-tuned to the desired dataset. Common
ideas: CE, (CE*), CL, (MI), PT Invariant Information Clustering (IIC)
IIC [14] maximizes the MI between augmented views
Deep InfoMax (DIM) of an image. The idea is that images should belong
to the same class regardless of the augmentation. The
DIM [77] maximizes the MI between local input re- augmentation has to be a transformation to which the
gions and output representations. Hjelm et al. show neural network should be invariant. The authors do
that maximizing over local input regions rather than not maximize directly over the output distributions but
the complete image is beneficial for image classifi- over the class distribution which is approximated for
cation. Also, they use a discriminator to match the every batch. Ji et al. use auxiliary overclustering on a
output representations to a given prior distribution. In different output head to increase their performance in
the end, they fine-tune the network with an additional the unsupervised case. This idea allows the network to
small fully-connected neural network. Common ideas: learn subclasses and handle noisy data. Ji et al. use So-
CE, MI, PT bel filtered images as input instead of the original RGB
16
images. Additionally, they show how to extend IIC to Fuzzy Overclustering (FOC)
image segmentation. Up to this point, the method is
Fuzzy Overclustering [27] is an extension of IIC [14].
completely unsupervised. To be comparable to other
FOC focuses on using overclustering to subdivide
semi-supervised methods they fine-tune their models
fuzzy labels in real-world datasets. Therefore, it uni-
on a subset of available labels. An illustration of this
fies the used data and losses proposed by IIC between
method is given in Figure 9. The first unsupervised
the different stages and extends it with new ideas such
stage can be seen as a self-supervised pretext task. In
as the novel loss Inverse Cross-Entropy (CE−1 ). This
contrast to other pretext tasks, this task already pre-
loss is inspired by Cross-Entropy but can be used on
dicts representations which can be seen as classifica-
the overclustering results of the network where no
tions. Common ideas: CE, MI, OC, PT
ground truth labels are known. FOC is not achiev-
ing state-of-the-art results on a common image clas-
Self-Supervised Semi-Supervised Learning (S4 L) sification dataset. However, on a real-world plankton
dataset with fuzzy labels, it surpasses FixMatch and
S4 L [15] is, as the name suggests, a combination of
shows that 5-10% more consistent predictions can be
self-supervised and semi-supervised methods. Zhai
achieved. Like IIC, FOC can be viewed as a multi-
et al. split the loss into a supervised and an unsuper-
stage-semi-supervised and an one-stage-unsupervised
vised part. The supervised loss is CE whereas the un-
method. In general, FOC is trained in one unsuper-
supervised loss is based on the self-supervised tech-
vised and one semi-supervised stage and can be seen
niques using rotation and exemplar prediction [40, 68].
as a multi-stage-semi-supervised method. Like IIC,
The authors show that their method performs better
it produces classifications already in the unsupervised
than other self-supervised and semi-supervised tech-
stage and can therefore also be seen as an one-stage-
niques [68, 40, 65, 61, 47]. In their Mix Of All
unsupervised method. Common ideas: CE, (CE*) MI,
Models (MOAM) they combine self-supervised rota-
OC, PT
tion prediction, VAT, entropy minimization, Pseudo-
Labels, and fine-tuning into a single model with mul-
tiple training steps. Since we discuss the results of Momentum Contrast (MoCo)
their MOAM we identify S4 L as a multi-stage-semi- He et al. propose to use a momentum encoder for
supervised method. Common ideas: CE, CE*, EM, contrastive learning [82]. In other methods [25, 57,
PL, PT, VAT 55, 56], the negative examples for the contrastive loss
are sampled from the same mini-batch as the positive
Simple Framework for Contrastive Learning of Vi- pair. A large batch size is needed to ensure a great
sual Representation (SimCLR) variety of negative examples. He et al. sample their
negative examples from a queue encoded by another
SimCLR [25] maximizes the agreement between two network whose weights are updated with an exponen-
different augmentations of the same image. The tial moving average of the main network. They solve
method is similiar to CPC [55] and IIC [14]. In com- the pretext task proposed by [81] with negative exam-
parison to CPC Chen et al. do not use the different ples samples from their queue and fine-tune in a sec-
inner representations. Contrary to IIC they use nor- ond stage on labeled data. Chen et al. provide further
malized temperature-scaled cross-entropy (NT-Xent) ablations and baseline for the MoCo Framework e.g.
as their loss. Based on the cosine similarity of the pre- by using a MLP head for fine-tuning [83]. Common
dictions, NT-Xent measures whether positive pairs are ideas: CE, CL, PT
similar and negative pairs are dissimilar. Augmented
versions of the same image are treated as positive pairs
Bootstrap you own latent (BYOL)
and pairs with any other image as negative pair. The
system is trained with large batch sizes of up to 8192 Grill et al. use an online and a target network. In the
instead of a memory bank to create enough negative proposed pretext task, the online network predicts the
examples. Common ideas: CE, (CE*), CL, PT image representation of the target network for an im-
17
(a) SimCLR (b) SimCLRv2 (c) MoCo (d) BYOL
Figure 10: Illustration of four selected multi-stage-semi-supervised methods – The used method is given below
each image. The input is given in the red (not using labels) or blue (using labels) box on the left side. On the right
side, an illustration of the method is provided. The fine-tuning part is excluded and only the first stage/pretext
task is represented. For SimCLRv2 the second stage or distillation step is illustrated. In general, the process is
organized from top to bottom. At first, the input images are either preprocessed by one or two random transforma-
tions t or are split up. The following neural network uses these preprocessed images (x, y) as input. Details about
the methods can be found in the corresponding entry in section 3 whereas abbreviations for common methods are
defined in subsection 2.2. EMA stands for the exponential moving average.
age [28]. The difference between the predictions is The second step is fine-tuning this large network with
measured with MSE. Normally, this approach would a small amount of labeled data. The third step is
lead to a degeneration of the network as a constant pre- self-training or distillation. The large pretrained net-
diction over all images would also achieve the goal. work is used to predict Pseudo-Labels on the com-
In contrastive learning, this degeneration is avoided plete (unlabeled) data. These (soft) Pseudo-Labels
by selecting a positive pair of examples from multiple are then used to train a smaller neural network with
negative ones [25, 57, 55, 56, 82, 83]. By using a slow- CE. The distillation step could be also performed on
moving average of the weights between the online and the same network as in the pretext task. Chen et al.
target network, Grill et al. show empirically that the show that even this self-distillation leads to perfor-
degeneration to a constant prediction can be avoided. mance improvements[57]. Common ideas: CE, (CE*),
This approach has the positive effect that BYOL per- CL, PL, PT
formance is depending less on hyperparameters like
augmentation and batch size [28]. In a follow-up work, 3.3. One-Stage-Unsupervised
Richemond et al. show that BYOL even works when Deep Adaptive Image Clustering (DAC)
no batch normalization which might have introduced
kind of a contrastive learning effect in the batches is DAC [50] reformulates unsupervised clustering as a
used [84]. Common ideas: MSE, PT pairwise classification. Similar to the idea of Pseudo-
Labels Chang et al. predict clusters and use these to
Simple Framework for Contrastive Learning of Vi- retrain the network. The twist is that they calculate the
sual Representation (SimCLRv2) cosine distance between all cluster predictions. This
distance is used to determine whether the input images
Chen et al. extend the framework SimCLR by using are similar or dissimilar with a given certainty. The
larger and deeper networks and by incorporating the network is then trained with binary CE on these certain
memory mechanism from MoCo [57]. Moreover, they similar and dissimilar input images. One can interpret
propose to use this framework in three steps. The these similarities and dissimilarities as Pseudo-Labels
first is training a contrastive learning pretext task with for the similarity classification task. During the train-
a deep neural network and the SimCLRv2 method. ing process, they lower the needed certainty to include
18
more images. As input Chang et al. use a combination For each sample, the k nearest neighbors are selected
of RGB and extracted HOG features. Common ideas: in the gained feature space. The novel semantic clus-
PL tering loss encourages these samples to be in the same
cluster. Gansbeke et al. noticed that the wrong near-
Information Maximizing Self-Augmented Training est neighbors have a lower confidence and propose to
(IMSAT) create Pseudo-Labels on only confident examples for
further fine-tuning. They also show that Overcluster-
IMSAT [85] maximizes MI between the input and out- ing can be successfully used if the number of clusters
put of the model. As a consistency regularization Hu is not known before. Common ideas: OC, PL, PT
et al. use CE between an image prediction and an aug-
mented image prediction. They show that the best aug- 4. Analysis
mentation of the prediction can be calculated with VAT
In this chapter, we will analyze which common
[65]. The maximization of MI directly on the image
ideas are shared or differ between methods. We will
input leads to a problem. For datasets like CIFAR-10,
compare the performance of all methods with each
CIFAR-100 [86] and STL-10 [52] the color informa-
other on common deep learning datasets.
tion is too dominant in comparison to the actual con-
tent or shape. As a workaround, Hu et al. use the fea- 4.1. Datasets
tures generated by a pretrained CNN on ImageNet [1]
as input. Common ideas: MI, VAT In this survey, we compare the presented methods
on a variety of datasets. We selected four datasets that
were used in multiple papers to allow a fair compari-
Invariant Information Clustering (IIC)
son. An overview of example images is given in Fig-
IIC [14] is described above as a multi-stage-semi- ure 11.
supervised method. In comparison to other presented
methods, IIC creates usable classifications without CIFAR-10 and CIFAR-100
fine-tuning the model on labeled data. The reason
for this is that the pretext task is constructed in such are large datasets of tiny color images with size 32x32
a way that label predictions can be extracted directly [86]. Both datasets contain 60,000 images belong-
from the model. This leads to the conclusion that IIC ing to 10 or 100 classes respectively. The 100 classes
can also be interpreted as an unsupervised learning in CIFAR-100 can be combined into 20 superclasses.
method. Common ideas: MI, OC Both sets provide 50,000 training examples and 10,000
validation examples (image + label). The presented re-
sults are only trained with 4,000 labels for CIFAR-10
Fuzzy Overclustering (FOC)
and 10,000 labels for CIFAR-100 to represent a semi-
FOC [27] is described avbove as a multi-stage-semi- supervised case. If a method uses all labels this is
supervised method. Like IIC, FOC can also be seen marked independently.
as an one-stage-unsupervised method because the first
stage yields cluster predictions. Common ideas: MI, STL-10
OC
is dataset designed for unsupervised and semi-
supervised learning [52]. The dataset is inspired by
Semantic Clustering by Adopting Nearest Neigh-
CIFAR-10 [86] but provides fewer labels. It only con-
bors (SCAN)
sists of 5,000 training labels and 8,000 validation la-
Gansbeke et al. calculate clustering assignments build- bels. However, 100,000 unlabeled example images are
ing on self-supervised pretext task by mining the near- also provided. These unlabeled examples belong to
est neighbors and using self-labeling. They propose to the training classes and some different classes. The
use SimCLR [25] as a pretext task but show that other images are 96x96 color images and were acquired in
pretext tasks [81, 40] could also be used for this step. combination with their labels from ImageNet [1].
19
(a) CIFAR-10 (b) STL-10 (c) ILSVRC-2012
Figure 11: Examples of four random cats in the different datasets to illustrate the difference in quality
20
Table 1: Overview of the methods and their used common ideas — On the left-hand side, the reviewed methods
from section 3 are sorted by the training strategy. The top row lists the common ideas. Details about the ideas
and their abbreviations are given in subsection 2.2. The last column and some rows sum up the usage of ideas per
method or training strategy. Legend: (X) The idea is only used indirectly. The individual explanations are given
in section 3.
Overall
CE CE* EM CL KL MSE MU MI OC PT PL VAT
Sum
One-Stage-Semi-Supervised
Pseudo-Labels [47] X X X 3
π model [49] X X 2
Temporal Ensembling [49] X X 2
Mean Teacher [48] X X 2
VAT [65] X X 2
VAT + EntMin [65] X X X 3
ICT [70] X X X X 4
fast-SWA [71] X X 2
MixMatch [46] X (X) X X X 5
EnAET [73] X (X) X X X AET X 7
UDA [16] X X X (X) 4
SPamCO [76] X X X X 4
ReMixMatch [45] X X (X) X (X) Rotation X 7
FixMatch [26] X X (X) X 4
Sum 14 4 6 0 2 8 4 1 0 2 8 2 47
Multi-Stage-Semi-Supervised
Exemplar [68] X X Augmentation 3
Context [42] X X Context 3
Jigsaw [43] X X Jigsaw 3
DeepCluster [67] X X X Clustering X 5
Rotation [40] X X Rotation 3
CPC [55, 56] X (X) X (X) CL 5
CMC [54] X (X) X (X) CL 5
DIM [77] X X MI 3
AMDIM [78] X X MI 3
DMT [79] X X X Metric X 5
IIC [14] X X X MI 4
S4 L [15] X X X Rotation X X 6
SimCLR [25] X (X) CL 3
MoCo [82] X X Metric 3
BYOL [28] X X Bootstrap 3
FOC [27] X (X) X X MI 5
SimCLRv2 [57] X (X) X CL X 5
Sum 17 11 1 5 0 1 0 6 3 17 4 1 66
One-Stage-Unsupervised
DAC [50] X 1
IMSAT [85] X X 2
IIC [14] X X MI 3
FOC [27] X X MI 3
SCAN [41] X CL X 3
Sum 0 0 0 0 0 0 0 3 3 3 2 1 12
Overall Sum 31 54 7 5 2 9 4 10 6 22 14 4 125
21
Table 2: Overview of the reported accuracies — The first column states the used method. For the supervised
baseline, we used the best-reported results which were considered as baselines in the referenced papers. The
original paper is given in brackets after the score. The architecture is given in the second column. The last four
columns report the Top-1 accuracy score in % for the respective dataset (See subsection 4.2 for further details). If
the results are not reported in the original paper, the reference is given after the result. A blank entry represents
the fact that no result was reported. Be aware that different architectures and frameworks are used which might
impact the results. Please see subsection 4.3 for a detailed explanation. Legend: † 100% of the labels are used
instead of the default value defined in subsection 4.1. ‡ Multilayer perceptron is used for fine-tuning instead
of one fully connected layer. Remarks on special architectures and evaluations: 1 Architecture includes Shake-
Shake regularization. 2 Network uses wider hidden layers. 3 Method uses ten random classes out of the default
1000 classes. 4 Network only predicts 20 superclasses instead of the default 100 classes. 5 Inputs are pretrained
ImageNet features. 6 Method uses different copies of the network for each input. 7 The network uses selective
kernels [87].
23
ments. Since IMSAT uses pretrained ImageNet features, a su-
For the CIFAR-10 dataset, almost all multi- or one- perset of STL-10, the results are not directly compara-
stage-semi-supervised methods reach about or over ble.
90% accuracy. The best methods MixMatch and Fix-
4.4. Discussion
Match reach an accuracy of more than 95% and are
roughly three percent worse than the fully supervised In this subsection, we discuss the presented results
baseline. For the CIFAR-100 dataset, fewer results are of the previous subsection. We divide our discussion
reported. FixMatch is with about 77% on this dataset into three major trends that we identified. All these
the best method in comparison to the fully supervised trends lead to possible future research opportunities.
baseline of about 80%. Newer methods also provide
results for 1000 or even 250 labels instead of 4000 la- 1. Trend: Real World Applications?
bels. Especially EnAET, ReMixMatch, and FixMatch
Previous methods were not scaleable to real-world
stick out since they achieve only 1-2% worse results
images and applications and used workarounds e.g.
with 250 labels instead of with 4000 labels.
extracted features [85] to process real-world images.
For the STL-10 dataset, most methods report a bet-
Many methods can report a result of over 90% on
ter result than the supervised baseline. These results
CIFAR-10, a simple low-resolution dataset. Only five
are possible due to the unlabeled part of the dataset.
methods can achieve a Top-5 accuracy of over 90%
The unlabeled data can only be utilized by semi-, self-
on ILSVRC-2012, a high-resolution dataset. We con-
, or unsupervised methods. EnAET achieves the best
clude that most methods are not scalable to high-
results with more than 95%. FixMatch reports an ac-
resolution and complex image classification problems.
curacy of nearly 95% with only 1000 labels. This is
However, the best-reported methods like FixMatch
more than most methods achieve with 5000 labels.
and SimCLRv2 seem to have surpassed the point of
The ILSVRC-2012 dataset is the most difficult only scientific usage and could be applied to real-
dataset based on the reported Top-1 accuracies. Most world classification tasks.
methods only achieve a Top-1 accuracy which is This conclusion applies to real-world image clas-
roughly 20% worse than the reported supervised base- sification tasks with balanced and clearly separated
line with around 86%. Only the methods SimCLR, classes. This conclusion also implicates which real-
BYOL, and SimCLRv2 achieve an accuracy that is world issues need to be solved in future research.
less than 10% worse than the baseline. SimCLRv2 Class imbalance [93, 94] or noisy labels [95, 27]
achieves the best accuracy with a Top-1 accuracy of are not treated by the presented methods. Datasets
80.9% and a Top-5 accuracy of around 96%. For fewer with also few unlabeled data points are not consid-
labels also SimCLR, BYOL and SimCLRv2 achieve ered. We see that good performance on well-structured
the best results. datasets does not always transfer completely to real-
The unsupervised methods are separated from the world datasets [27]. We assume that these issues arise
supervised baseline by a clear margin of up to 10%. due to assumptions that do not hold on real-world
SCAN achieves the best results in comparison to the datasets like a clear distinction between datapoints
other methods as it builds on the strong pretext task of [27] and non-robust hyperparameters like augmenta-
SimCLR. This also illustrates the reason for including tions and batch size [28]. Future research has to ad-
the unsupervised method in a comparison with semi- dress these issues so that reduced supervised learning
supervised methods. Unsupervised methods do not methods can be applied to any real-world datasets.
use labeled examples and therefore are expected to be
worse. However, the data show that the gap of 10%
2. Trend: How much supervision is needed?
is not large and that unsupervised methods can benefit
from ideas of self-supervised learning. Some paper re- We see that the gap between reduced supervised and
port results for even fewer labels as shown in Table 3 supervised methods is shrinking. For CIFAR-10,
which closes the gap to unsupervised learning further. CIFAR-100 and ILSVRC-2012 we have a gap of less
IMSAT reports an accuracy of about 94% on STL-10. than 5% left between total supervised and reduced
24
Table 3: Overview of the reported accuracies with fewer labels - The first column states the used method. The last
seven columns report the Top-1 accuracy score in % for the respective dataset and amount of labels. The number
is either given as an absolute number or in percent. A blank entry represents the fact that no result was reported.
supervised learning. For STL-10 the reduced super- prove the supervised training [96, 97, 98]. These large
vised methods even surpass the total supervised case amounts of data can only be collected without any or
by about 20% due to the additional set of unlabeled weak labels as the collection process has to be auto-
data. We conclude that reduced supervised learning mated. It will be interesting to investigate if the dis-
reaches comparable results while using only roughly cussed methods in this survey can also scale to such
10% of the labels. datasets while using only few labels per class.
In general, we considered a reduction from 100% to We conclude that on datasets with few and a fixed
10% of all labels. However, we see that methods like number of classes semi-supervised methods will be
FixMatch and SimCLRv2 achieve comparable results more important than unsupervised methods. However,
with even fewer labels such as the usage of 1% of all if we have a lot of classes or new classes should be de-
labels. For ILSVRC-2012 this is equivalent to about tected like in few- or zero-shot learning [99, 100, 38,
13 images per class. FixMatch even achieves a median 94] unsupervised methods will still have a lower label-
accuracy of around 65% for one label per class for the ing cost and be of high importance. This means future
CIFAR-10 dataset[26]. research has to investigate how the semi-supervised
The trend that results improve overtime is expected. ideas can be transferred to unsupervised methods as
But the results indicate that we are near the point in [14, 41] and to settings with many, an unknown or
where semi-supervised learning needs very few to al- rising amount of classes like in [39, 96].
most no labels per class (e.g. 10 labels for CIFAR10).
In practice, the labeling cost for unsupervised and 3. Trend: Combination of common ideas
semi-supervised will almost be the same for common
classification datasets. Unsupervised methods would In the comparison, we identified that few common
need to bridge the performance gap on these classifi- ideas are shared by one-stage-semi-supervised and
cation datasets to be useful anymore. It is questionable multi-stage-semi-supervised methods.
if an unsupervised method can achieve this because it We believe there is only a little overlap between
would need to guess what a human wants to have clas- these methods due to the different aims of the respec-
sified even when competing features are available. tive authors. Many multi-stage-semi-supervised pa-
We already see that on datasets like ImageNet ad- pers focus on creating good representations. They
ditional data such as JFT-300M is used to further im- fine-tune their results only to be comparable. One-
25
stage-semi-supervised papers aim for the best accu- The performance gap between supervised and semi-
racy scores with as few labels as possible. or self-supervised methods is closing and the num-
If we look at methods like SimCLRv2, EnAET, ber of labels to get comparable results to fully super-
ReMixMatch, or S4 L we see that it can be beneficial to vised learning is decreasing. In the future, the unsuper-
combine different ideas and mindsets. These methods vised methods will have almost no labeling cost benefit
used a broad range of ideas and also ideas uncommon in comparison to the semi-supervised methods due to
for their respective training strategy. S 4 L calls their these developments. We conclude that in combination
combined approach even ”Mix of all models” [15] and with the fact that semi-supervised methods have the
SimCLRv2 states that ”Self-Supervised Methods are benefit of using labels as guidance unsupervised meth-
Strong Semi-Supervised Learners” [57]. ods will lose importance. However, for a large num-
We assume that this combination is one reason for ber of classes or an increasing number of classes the
their superior performance. This assumption is sup- ideas of unsupervised are still of high importance and
ported by the included comparisons in the original pa- ideas from semi-supervised and self-supervised learn-
pers. For example, S4 L showed the impact of each ing need to be transferred to this setting.
method separately as well as the combination of all We concluded that one-stage-semi-supervised and
[15]. multi-stage-semi-supervised training mainly use a dif-
Methods like Fixmatch illustrate that it does not ferent set of common ideas. Both strategies use a com-
need a lot of common ideas to achieve state-of-the-art bination of different ideas but there are few overlaps in
performance but rather that the selection of the correct these techniques. We identified the trend that a combi-
ideas and combining them in a meaningful is impor- nation of different techniques is beneficial to the over-
tant. We identified that some common ideas are not all performance. In combination with the small over-
often combined and that the combination of a broad lap between the ideas, we identified possible future re-
range and unusual ideas can be beneficial. We believe search opportunities.
that the combination of the different common idea is a
promising future research field because many reason- References
able combinations are yet not explored.
[1] Alex Krizhevsky, Ilya Sutskever, and Geof-
5. Conclusion frey E. Hinton. Imagenet classification with
deep convolutional neural networks. In Ad-
In this paper, we provided an overview of semi-, vances in neural information processing sys-
self-, and unsupervised methods. We analyzed their tems, volume 60, pages 1097–1105. Associa-
difference, similarities, and combinations based on 34 tion for Computing Machinery, 2012. 1, 6, 19,
different methods. This analysis led to the identifica- 20, 23
tion of several trends and possible research fields.
We based our analysis on the definition of the dif- [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren,
ferent training strategies and common ideas in these and Jian Sun. Deep Residual Learning for Im-
strategies. We showed how the methods work in gen- age Recognition. IEEE Conference on Com-
eral, which ideas they use and provide a simple classi- puter Vision and Pattern Recognition (CVPR),
fication. Despite the difficult comparison of the meth- pages 770–778, 2015. 1, 22
ods’ performances due to different architectures and
implementations, we identified three major trends. [3] J Brünger, S Dippel, R Koch, and C Veit. ‘Tail-
Results of over 90% Top-5 accuracy on ILSVRC- ception’: using neural networks for assessing
2012 with only 10% of the labels indicate that semi- tail lesions on pictures of pig carcasses. Ani-
supervised methods could be applied to real-world mal, 13(5):1030–1036, 2019. 1
problems. However, issues like class imbalance and
noisy or fuzzy labels are not considered. More robust [4] Joseph Redmon and Ali Farhadi. YOLOv3:
methods need to be researched before semi-supervised An Incremental Improvement. arXiv preprint
learning can be applied to real-world issues. arXiv:1804.02767, 2018. 1
26
[5] Sascha Clausen, Claudius Zelenka, Tobias [13] Mahmut Kaya and Hasan Sakir Bilge. Deep
Schwede, and Reinhard Koch. Parcel Track- Metric Learning : A Survey. Symmetry,
ing by Detection in Large Camera Networks. 11(9):1066, 2019. 2, 3
In Thomas Brox, Andrés Bruhn, and Mario
Fritz, editors, GCPR 2018:Pattern Recognition, [14] Andrea Vedaldi, Xu Ji, João F. Henriques, and
pages 89–104. Springer International Publish- Andrea Vedaldi. Invariant Information Clus-
ing, 2019. 1 tering for Unsupervised Image Classification
and Segmentation. Proceedings of the IEEE
[6] Jonathan Long, Evan Shelhamer, and Trevor International Conference on Computer Vision,
Darrell. Fully convolutional networks for se- (Iic):9865–9874, 2019. 2, 5, 6, 9, 11, 16, 17,
mantic segmentation. In Proceedings of the 19, 21, 22, 25
IEEE conference on computer vision and pat-
tern recognition, pages 3431–3440, 2015. 1 [15] Xiaohua Zhai, Avital Oliver, Alexander
Kolesnikov, and Lucas Beyer. S4L: Self-
[7] Kaiming He, Georgia Gkioxari, Piotr Dollár, Supervised Semi-Supervised Learning. In
Ross Girshick, Mask R-cnn, Piotr Doll, and Proceedings of the IEEE international confer-
Ross Girshick. Mask r-cnn. In Proceedings of ence on computer vision, pages 1476–1485,
the IEEE international conference on computer 2019. 2, 5, 10, 17, 21, 22, 26
vision, pages 2961–2969, 2017. 1
[16] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-
[8] Christopher M Bishop. Pattern recognition and Thang Luong, and Quoc V Le. Unsupervised
machine learning. Springer, 2006. 1 Data Augmentation for Consistency Training.
Advances in Neural Information Processing
[9] Dhruv Mahajan, Ross Girshick, Vignesh Ra-
Systems 33 pre-proceedings (NeurIPS 2020),
manathan, Kaiming He, Manohar Paluri, Yix-
2020. 2, 5, 13, 21, 22, 25
uan Li, Ashwin Bharambe, and Laurens van der
Maaten. Exploring the Limits of Weakly Super- [17] Olivier Chapelle, Bernhard Scholkopf, Alexan-
vised Pretraining. Proceedings of the European der Zien, Bernhard Schölkopf, and Alexander
Conference on Computer Vision (ECCV), pages Zien. Semi-supervised learning. IEEE Trans-
181–196, 2018. 1 actions on Neural Networks, 20(3):542, 2006.
2, 3
[10] Lars Schmarje, Claudius Zelenka, Ulf Geisen,
Claus-C. Glüer, and Reinhard Koch. 2D and 3D [18] Rui Xu and Donald C Wunsch. Survey of clus-
Segmentation of uncertain local collagen fiber tering algorithms. IEEE Transactions on Neural
orientations in SHG microscopy. DAGM Ger- Networks, 16:645–678, 2005. 2, 3
man Conference of Pattern Regocnition, 11824
LNCS:374–386, 2019. 2 [19] Longlong Jing and Yingli Tian. Self-supervised
Visual Feature Learning with Deep Neural Net-
[11] Geoffrey E Hinton, Terrence Joseph Sejnowski, works: A Survey. IEEE Transactions on Pattern
and Others. Unsupervised learning: founda- Analysis and Machine Intelligence, 2019. 2, 3,
tions of neural computation. MIT press, 1999. 5
2
[20] Jesper E. van Engelen and Holger H. Hoos. A
[12] Junyuan Xie, Ross B Girshick, Ali Farhadi, survey on semi-supervised learning. Machine
A L I Cs, and Washington Edu. Unsuper- Learning, 109(2):373–440, 2019. 2, 3
vised deep embedding for clustering analysis.
In 33rd International Conference on Machine [21] Gianluigi Ciocca, Claudio Cusano, Simone
Learning, volume 1, pages 740–749. Interna- Santini, and Raimondo Schettini. On the use
tional Machine Learning Society (IMLS), 2016. of supervised features for unsupervised image
2, 6 categorization: An evaluation. Computer Vision
27
and Image Understanding, 122:155–171, 2014. Kavukcuoglu, Rémi Munos, and Michal Valko.
2, 3 Bootstrap your own latent: A new approach
to self-supervised Learning. Advances in
[22] James MacQueen and Others. Some methods Neural Information Processing Systems 33
for classification and analysis of multivariate pre-proceedings (NeurIPS 2020), 2020. 2, 18,
observations. In Proceedings of the fifth Berke- 21, 22, 24, 25
ley symposium on mathematical statistics and
probability, volume 1, pages 281–297. Oak- [29] Guo-Jun Qi and Jiebo Luo. Small Data Chal-
land, CA, USA, 1967. 2, 5 lenges in Big Data Era: A Survey of Recent
Progress on Unsupervised and Semi-Supervised
[23] Xiaojin Zhu. Semi-Supervised Learning Lit- Methods. arXiv preprint arXiv:1903.11260,
erature Survey. Comput Sci, University of 2019. 3, 5
Wisconsin-Madison, 2, 2008. 2
[30] Veronika Cheplygina, Marleen de Bruijne,
[24] Erxue Min, Xifeng Guo, Qiang Liu, Gen Zhang, Josien P W Pluim, Marleen De Bruijne, Josien
Jianjing Cui, and Jun Long. A survey of cluster- P W Pluim, Marleen de Bruijne, and Josien P W
ing with deep learning: From the perspective of Pluim. Not-so-supervised: A survey of semi-
network architecture. IEEE Access, 6:39501– supervised, multi-instance, and transfer learn-
39514, 2018. 2 ing in medical image analysis. Medical Image
[25] Ting Chen, Simon Kornblith, Mohammad Analysis, 54:280–296, 2019. 3
Norouzi, and Geoffrey Hinton. A simple frame- [31] Alexander Mey and Marco Loog. Improv-
work for contrastive learning of visual represen- ability Through Semi-Supervised Learning: A
tations. International conference on machine Survey of Theoretical Results. arXiv preprint
learning, (PMLR):1597–1607, 2020. 2, 5, 7, arXiv:1908.09574, pages 1–28, 2019. 3
8, 11, 17, 18, 19, 21, 22, 23, 25
[32] Chelsea Finn, Pieter Abbeel, and Sergey
[26] Kihyuk Sohn, David Berthelot, Chun-Liang Levine. Model-Agnostic Meta-Learning for
Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Fast Adaptation of Deep Networks. 34th In-
Cubuk, Alex Kurakin, Han Zhang, and ternational Conference on Machine Learning,
Colin Raffel. FixMatch: Simplifying Semi- ICML 2017, 3:1856–1868, mar 2017. 3
Supervised Learning with Consistency and
Confidence. Advances in Neural Informa- [33] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi
tion Processing Systems 33 pre-proceedings Mirza, Bing Xu, David Warde-Farley, Sherjil
(NeurIPS 2020), 2020. 2, 5, 6, 10, 14, 21, 22, Ozair, Aaron Courville, and Yoshua Bengio.
25 Generative Adversarial Networks. Proceed-
ings of the International Conference on Neural
[27] Lars Schmarje, Johannes Brünger, Monty San- Information Processing Systems, pages 2672–
tarossa, Simon-Martin Schröder, Rainer Kiko, 2680, jun 2014. 3
and Reinhard Koch. Beyond Cats and Dogs:
Semi-supervised Classification of fuzzy la- [34] Lu Liu, Tianyi Zhou, Guodong Long, Jing
bels with overclustering. arXiv preprint Jiang, Lina Yao, and Chengqi Zhang. Proto-
arXiv:2012.01768, 2020. 2, 5, 11, 17, 19, 21, type Propagation Networks (PPN) for Weakly-
22, 23, 24 supervised Few-shot Learning on Category
Graph. IJCAI International Joint Conference on
[28] Jean-Bastien Grill, Florian Strub, Flo- Artificial Intelligence, 2019-Augus:3015–3022,
rent Altché, Corentin Tallec, Pierre H. may 2019. 4
Richemond, Elena Buchatskaya, Carl Doersch,
Bernardo Avila Pires, Zhaohan Daniel Guo, [35] Norimichi Ukita and Yusuke Uematsu. Semi-
Mohammad Gheshlaghi Azar, Bilal Piot, Koray and weakly-supervised human pose estimation.
28
Computer Vision and Image Understanding, [43] Mehdi Noroozi and Paolo Favaro. Unsuper-
170:67–78, 2018. 4 vised Learning of Visual Representations by
Solving Jigsaw Puzzles. European Conference
[36] Dwarikanath Mahapatra. Combining multiple on Computer Vision, pages 69–84, 2016. 5, 11,
expert annotations using semi-supervised learn- 15, 21, 22
ing and graph cuts for medical image segmenta-
tion. Computer Vision and Image Understand- [44] Mehdi Noroozi, Ananth Vinjimoor, Paolo
ing, 151:114–123, oct 2016. 4 Favaro, and Hamed Pirsiavash. Boosting self-
supervised learning via knowledge transfer. In
[37] Peng Xu, Zeyu Song, Qiyue Yin, Yi-Zhe Song, Conference on Computer Vision and Pattern
and Liang Wang. Deep Self-Supervised Repre- Recognition, pages 9359–9367, 2018. 5, 10, 11,
sentation Learning for Free-Hand Sketch. IEEE 15
Transactions on Circuits and Systems for Video
Technology, 2020. 4 [45] David Berthelot, Nicholas Carlini, Ekin D.
Cubuk, Alex Kurakin, Kihyuk Sohn, Han
[38] Lu Liu, Tianyi Zhou, Guodong Long, Jing Zhang, and Colin Raffel. ReMixMatch: Semi-
Jiang, Xuanyi Dong, and Chengqi Zhang. Iso- Supervised Learning with Distribution Align-
metric Propagation Network for Generalized ment and Augmentation Anchoring. Inter-
Zero-shot Learning. International Conference national Conference on Learning Representa-
on Learning Representations, feb 2021. 4, 25 tions, 2020. 5, 10, 11, 13, 14, 21, 22, 25
[39] Zhongjie Yu, Lin Chen, Zhongwei Cheng, and [46] David Berthelot, Nicholas Carlini, Ian Good-
Jiebo Luo. Transmatch: A transfer-learning fellow, Nicolas Papernot, Avital Oliver, and
scheme for semi-supervised few-shot learning. Colin A Raffel. Mixmatch: A holistic approach
In Proceedings of the IEEE/CVF Conference to semi-supervised learning. In Advances in
on Computer Vision and Pattern Recognition, Neural Information Processing Systems, pages
pages 12856–12864, 2020. 4, 25 5050–5060, 2019. 5, 10, 11, 13, 21, 22, 25
29
International Conference on Computer Vision [59] Ben Poole, Sherjil Ozair, Aaron van den Oord,
(ICCV), pages 5880–5888, 2017. 6, 18, 21, 22 Alexander A. Alemi, and George Tucker. On
Variational Bounds of Mutual Information. In-
[51] Ian Goodfellow, Yoshua Bengio, and Aaron
ternational Conference on Machine Learning,
Courville. Deep Learning. MIT Press, 2016.
2019. 8, 9
7
[52] Adam Coates, Andrew Ng, and Honglak Lee. [60] Michael Tschannen, Josip Djolonga, Paul K.
An analysis of single-layer networks in unsu- Rubenstein, Sylvain Gelly, and Mario Lucic.
pervised feature learning. In Proceedings of On Mutual Information Maximization for Rep-
the fourteenth international conference on arti- resentation Learning. International Conference
ficial intelligence and statistics, pages 215–223, on Learning Representations, 2020. 8
2011. 8, 19
[61] Yves Grandvalet and Yoshua Bengio. Semi-
[53] R Hadsell, S Chopra, and Y LeCun. Dimension- supervised learning by entropy minimization.
ality Reduction by Learning an Invariant Map- In Advances in neural information processing
ping. In 2006 IEEE Computer Society Confer- systems, pages 529–536, 2005. 8, 12, 13, 17
ence on Computer Vision and Pattern Recogni-
tion (CVPR’06), volume 2, pages 1735–1742, [62] S Kullback and R A Leibler. On Information
2006. 7 and Sufficiency. Ann. Math. Statist., 22(1):79–
86, 1951. 9
[54] Yonglong Tian, Dilip Krishnan, and Phillip
Isola. Contrastive Multiview Coding. European [63] Thomas M Cover and Joy A Thomas. Elements
conference on computer vision, 2019. 8, 16, 21, of information theory. John Wiley & Sons,
22 1991. 9
[55] Aaron Van Den Oord, Yazhe Li, and Oriol [64] Mohamed Ishmael Belghazi, Aristide Baratin,
Vinyals. Representation Learning with Con- Sai Rajeswar, Sherjil Ozair, Yoshua Bengio,
trastive Predictive Coding. arXiv preprint Aaron Courville, and R. Devon Hjelm. Mu-
arXiv:1807.03748., 2018. 8, 11, 16, 17, 18, 21, tual Information Neural Estimation. In Interna-
23 tional Conference on Machine Learning, pages
[56] Olivier J Hénaff, Aravind Srinivas, Jef- 531–540, 2018. 9
frey De Fauw, Ali Razavi, Carl Doersch,
[65] Semi-supervised Learning, Takeru Miyato,
S. M. Ali Eslami, and Aaron van den Oord.
Shin-ichi Maeda, Masanori Koyama, Shin Ishii,
Data-Efficient Image Recognition with Con-
and Masanori Koyama. Virtual adversarial
trastive Predictive Coding. Proceedings of
training: a regularization method for supervised
the37thInternational Conference on Machine-
and semi-supervised learning. IEEE transac-
Learning, PMLR:4182–4192, 2020. 8, 16, 17,
tions on pattern analysis and machine intelli-
18, 21, 22
gence, pages 1–16, 2018. 9, 10, 12, 16, 17, 19,
[57] Ting Chen, Simon Kornblith, Kevin Swer- 21, 22
sky, Mohammad Norouzi, and Geoffrey Hinton.
Big Self-Supervised Models are Strong Semi- [66] Hongyi Zhang, Moustapha Cisse, Yann N
Supervised Learners. Advances in Neural Infor- Dauphin, and David Lopez-Paz. mixup: Be-
mation Processing Systems 33 pre-proceedings yond empirical risk minimization. In Inter-
(NeurIPS 2020), 2020. 8, 17, 18, 21, 22, 25, 26 national Conference on Learning Representa-
tions, 2018. 10, 12, 13
[58] Ting Chen and Lala Li. Intriguing Proper-
ties of Contrastive Losses. arXiv preprint [67] Mathilde Caron, Piotr Bojanowski, Armand
arXiv:2011.02803, 2020. 8 Joulin, and Matthijs Douze. Deep Clustering for
30
Unsupervised Learning of Visual Features. Pro- [74] Liheng Zhang, Guo Jun Qi, Liqiang Wang, and
ceedings of the European Conference on Com- Jiebo Luo. AET vs. AED: Unsupervised rep-
puter Vision (ECCV), pages 132–149, 2018. 11, resentation learning by auto-encoding transfor-
15, 21, 22 mations rather than data. In Proceedings of the
IEEE Computer Society Conference on Com-
[68] Alexey Dosovitskiy, Philipp Fischer, Jost To-
puter Vision and Pattern Recognition, volume
bias Springenberg, Martin Riedmiller, and
2019-June, pages 2542–2550. IEEE, 2019. 13
Thomas Brox. Discriminative Unsupervised
Feature Learning with Exemplar Convolutional [75] Terrance Devries and Graham W Taylor. Im-
Neural Networks. IEEE Transactions on proved Regularization of Convolutional Neu-
Pattern Analysis and Machine Intelligence, ral Networks with Cutout. arXiv preprint
38(9):1734–1747, 2015. 11, 14, 17, 21, 22 arXiv:1708.04552, 2017. 13
[69] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, [76] Fan Ma, Deyu Meng, Xuanyi Dong, and
Vijay Vasudevan, and Quoc V. Le. Autoaug- Yi Yang. Self-paced Multi-view Co-training.
ment: Learning augmentation strategies from Journal of Machine Learning Research,
data. In Proceedings of the IEEE conference on 21(57):1–38, 2020. 13, 21, 22
computer vision and pattern recognition, num-
ber Section 3, pages 113–123, 2019. 12, 13 [77] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-
Marchildon, Karan Grewal, Phil Bachman,
[70] Vikas Verma, Alex Lamb, Juho Kannala, Adam Trischler, and Yoshua Bengio. Learn-
Yoshua Bengio, David Lopez-Paz, Kenji ing deep representations by mutual information
Kawaguchi, Alex Lamb, Juho Kannala, Yoshua estimation and maximization. In International
Bengio, and David Lopez-Paz. Interpola- Conference on Learning Representations, pages
tion Consistency Training for Semi-Supervised 1–24, 2019. 16, 21, 22
Learning. Proceedings of the Twenty-Eighth In-
ternational Joint Conference on Artificial Intel- [78] Philip Bachman, R Devon Hjelm, and William
ligence, 2019. 12, 21, 22, 25 Buchwalter. Learning Representations by Max-
imizing Mutual Information Across Views. In
[71] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Advances in Neural Information Processing
Andrew Gordon Wilson, Dmitrii Podoprikhin, Systems, pages 15509–15519, 2019. 16, 21, 22
Timur Garipov, Dmitry Vetrov, and An-
drew Gordon Wilson. There Are Many Con- [79] Bin Liu, Zhirong Wu, Han Hu, and Stephen
sistent Explanations of Unlabeled Data: Why Lin. Deep Metric Transfer for Label Prop-
You Should Average. In International Confer- agation with Limited Annotated Data. 2019
ence on Learning Representations, pages 1–22, IEEE/CVF International Conference on Com-
2019. 12, 13, 21, 22 puter Vision Workshop (ICCVW), 2019. 16, 21,
22, 25
[72] Pavel Izmailov, Dmitrii Podoprikhin, Timur
Garipov, Dmitry Vetrov, and Andrew Gordon [80] Richard Zhang, Phillip Isola, and Alexei A.
Wilson. Averaging Weights Leads to Wider Op- Efros. Colorful Image Colorization. European
tima and Better Generalization. In Conference conference on computer vision, pages 649–666,
on Uncertainty in Artificial Intelligence, 2018. 2016. 16
12
[81] Zhirong Wu, Yuanjun Xiong, Stella Yu, and
[73] Xiao Wang, Daisuke Kihara, Jiebo Luo, Dahua Lin. Unsupervised Feature Learning
and Guo-Jun Qi. EnAET: Self-Trained via Non-Parametric Instance-level Discrimina-
Ensemble AutoEncoding Transformations for tion. 2018 IEEE/CVF Conference on Computer
Semi-Supervised Learning. arXiv preprint Vision and Pattern Recognition, pages 3733–
arXiv:1911.09265, 2019. 13, 21, 22, 25 3742, may 2018. 16, 17, 19
31
[82] Kaiming He, Haoqi Fan, Yuxin Wu, Saining [90] Sergey Zagoruyko and Nikos Komodakis. Wide
Xie, and Ross Girshick. Momentum Contrast Residual Networks. In Procedings of the
for Unsupervised Visual Representation Learn- British Machine Vision Conference 2016, pages
ing. Proceedings of the IEEE/CVF Conference 87.1–87.12. British Machine Vision Associa-
on Computer Vision and Pattern Recognition, tion, 2016. 23
pages 9729–9738, 2020. 17, 18, 21, 22
[91] Xavier Gastaldi. Shake-Shake regularization.
[83] Xinlei Chen, Haoqi Fan, Ross Girshick, and arXiv preprint arXiv:1705.07485, 2017. 23
Kaiming He. Improved Baselines with Mo-
[92] Avital Oliver, Augustus Odena, Colin A Raf-
mentum Contrastive Learning. arXiv preprint
fel, Ekin Dogus Cubuk, and Ian J Goodfellow.
arXiv:2003.04297, 2020. 17, 18, 22
Realistic evaluation of deep semi-supervised
learning algorithms. In Advances in Neu-
[84] Pierre H. Richemond, Jean-Bastien Grill, Flo-
ral Information Processing Systems, number
rent Altché, Corentin Tallec, Florian Strub, An-
NeurIPS, pages 3235–3246, 2018. 23
drew Brock, Samuel Smith, Soham De, Razvan
Pascanu, Bilal Piot, and Michal Valko. BYOL [93] Tsung-Yi Lin, Priya Goyal, Ross Girshick,
works even without batch statistics. arXiv Kaiming He, and Piotr Dollár. Focal loss for
preprint arXiv:2010.10241, 2020. 18 dense object detection. In Proceedings of the
IEEE international conference on computer vi-
[85] Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi sion, pages 2980–2988, 2017. 24
Matsumoto, and Masashi Sugiyama. Learning
Discrete Representations via Information Max- [94] Simon-Martin Schröder, Rainer Kiko, and
imizing Self-Augmented Training. Proceedings Reinhard Koch. MorphoCluster: Efficient An-
of the 34th International Conference on Ma- notation of Plankton images by Clustering. Sen-
chine Learning-Volume 70, pages 1558–1567, sors, 20, 2020. 24, 25
2017. 19, 21, 22, 24
[95] Qing Li, Xiaojiang Peng, Liangliang Cao, Wen-
[86] Alex Krizhevsky, Geoffrey Hinton, and Others. bin Du, Hao Xing, Yu Qiao, and Qiang Peng.
Learning multiple layers of features from tiny Product image recognition with guidance learn-
images. Technical report, Citeseer, 2009. 19 ing and noisy supervision. Computer Vision and
Image Understanding, 196:102963, 2020. 24
[87] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian
[96] Hieu Pham, Zihang Dai, Qizhe Xie, Minh-
Yang. Selective Kernel Networks. Proceedings
Thang Luong, and Quoc V. Le. Meta Pseudo
of the IEEE/CVF Conference on Computer Vi-
Labels. 2020. 25
sion and Pattern Recognition, pages 510–519,
2019. 22 [97] Alexander Kolesnikov, Lucas Beyer, Xiaohua
Zhai, Joan Puigcerver, Jessica Yung, Sylvain
[88] Hugo Touvron, Andrea Vedaldi, Matthijs Gelly, and Neil Houlsby. Big Transfer (BiT):
Douze, and Hervé Jégou. Fixing the train-test General Visual Representation Learning. In
resolution discrepancy: FixEfficientNet. arXiv Lecture Notes in Computer Science, pages 491–
preprint arXiv:2003.08237, 2020. 22 507. 2020. 25
[89] Alexander Kolesnikov, Xiaohua Zhai, and Lu- [98] Qizhe Xie, Minh-Thang Luong, Eduard Hovy,
cas Beyer. Revisiting Self-Supervised Visual and Quoc V. Le. Self-Training With Noisy
Representation Learning. Proceedings of the Student Improves ImageNet Classification. In
IEEE conference on Computer Vision and Pat- IEEE/CVF Conference on Computer Vision and
tern Recognition, pages 1920–1929, 2019. 22, Pattern Recognition (CVPR), pages 10684–
23 10695. IEEE, jun 2020. 25
32
[99] Yaqing Wang, Quanming Yao, James T Kwok,
and Lionel M Ni. Generalizing from a few ex-
amples: A survey on few-shot learning. ACM
Computing Surveys (CSUR), 53(3):1–34, 2020.
25
33