0% found this document useful (0 votes)
116 views33 pages

A Survey On Semi-, Self - and Unsupervised Learning For Image Classification

This survey provides an overview of image classification methods that require fewer labeled examples than traditional supervised deep learning approaches. It analyzes 34 recent semi-supervised, self-supervised, and unsupervised methods and identifies three major trends: 1) methods can scale to real-world applications but do not address issues like class imbalance; 2) the amount of supervision needed is decreasing; 3) methods share common ideas but could benefit from combining ideas across different approaches. The survey aims to help researchers keep track of developments in using unlabeled data to improve image classification.

Uploaded by

Li Jian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views33 pages

A Survey On Semi-, Self - and Unsupervised Learning For Image Classification

This survey provides an overview of image classification methods that require fewer labeled examples than traditional supervised deep learning approaches. It analyzes 34 recent semi-supervised, self-supervised, and unsupervised methods and identifies three major trends: 1) methods can scale to real-world applications but do not address issues like class imbalance; 2) the amount of supervision needed is decreasing; 3) methods share common ideas but could benefit from combining ideas across different approaches. The survey aims to help researchers keep track of developments in using unlabeled data to improve image classification.

Uploaded by

Li Jian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

A Survey on Semi-, Self- and Unsupervised Learning in Image Classification

Lars Schmarje*, Monty Santarossa, Simon-Martin Schröder, Reinhard Koch


Multimedia Information Processing Group, Kiel University, Germany
{las,msa,sms,rk}@informatik.uni-kiel.de
arXiv:2002.08721v5 [cs.CV] 25 May 2021

Abstract

While deep learning strategies achieve outstanding


results in computer vision tasks, one issue remains:
The current strategies rely heavily on a huge amount
of labeled data. In many real-world problems, it is not
feasible to create such an amount of labeled training
data. Therefore, it is common to incorporate unlabeled
data into the training process to reach equal results
with fewer labels. Due to a lot of concurrent research,
it is difficult to keep track of recent developments. In
this survey, we provide an overview of often used ideas
and methods in image classification with fewer labels.
We compare 34 methods in detail based on their per-
formance and their commonly used ideas rather than Figure 1: This image illustrates and simplifies the ben-
a fine-grained taxonomy. In our analysis, we identify efit of using unlabeled data during deep learning train-
three major trends that lead to future research oppor- ing. The red and dark blue circles represent labeled
tunities. 1. State-of-the-art methods are scaleable to data points of different classes. The light grey cir-
real-world applications in theory but issues like class cles represent unlabeled data points. If we have only
imbalance, robustness, or fuzzy labels are not consid- a small number of labeled samples available we can
ered. 2. The degree of supervision which is needed only make assumptions (dotted line) over the underly-
to achieve comparable results to the usage of all la- ing true distribution (solid line). This true distribution
bels is decreasing and therefore methods need to be can only be determined if we also consider the unla-
extended to settings with a variable number of classes. beled data points and clarify the decision boundary.
3. All methods share some common ideas but we iden-
tify clusters of methods that do not share many ideas.
We show that combining ideas from different clusters The quality of a deep neural network is strongly
can lead to better performance. influenced by the number of labeled/supervised im-
ages [8]. ImageNet [1] is a huge labeled dataset with
1. Introduction over one million images which allows the training of
networks with impressive performance. Recent re-
Deep learning strategies achieve outstanding suc-
search shows that even larger datasets than ImageNet
cesses in computer vision tasks. They reach the best
can improve these results [9]. However, in many real-
performance in a diverse range of tasks such as im-
world applications it is not possible to create labeled
age classification [1, 2, 3], object detection [4, 5] or
datasets with millions of images. A common strategy
semantic segmentation [6, 7].
for dealing with this problem is transfer learning. This
* Corresponding author strategy improves results even on small and special-

1
ized datasets like medical imaging [10]. This might survey consists of deep learning researchers or inter-
be a practical workaround for some applications but ested people with comparable preliminary knowledge
the fundamental issue remains: Unlike humans, su- who want to keep track of recent developments in the
pervised learning needs enormous amounts of labeled field of semi-, self- and unsupervised learning.
data.
For a given problem we often have access to a 1.1. Related Work
large dataset of unlabeled data. How this unsuper- In this subsection, we give a quick overview of pre-
vised data could be used for neural networks has vious works and reference topics we will not address
been of research interest for many years [11]. Xie further to maintain the focus of this survey.
et al. were among the first in 2016 to investigate The research of semi- and unsupervised techniques
unsupervised deep learning image clustering strate- in computer vision has a long history. A variety of
gies to leverage this data [12]. Since then, the us- research, surveys, and books has been published on
age of unlabeled data has been researched in numer- this topic [17, 18, 19, 20, 21]. Unsupervised cluster
ous ways and has created research fields like unsu- algorithms were researched before the breakthrough of
pervised, semi-supervised, self-supervised, weakly- deep learning and are still widely used [22]. There are
supervised, or metric learning [13]. Generally speak- already extensive surveys that describe unsupervised
ing, unsupervised learning uses no labeled data, semi- and semi-supervised strategies without deep learning
supervised learning uses unlabeled and labeled while [18, 23]. We will focus only on techniques including
self-supervised learning generates labeled data on its deep neural networks.
own. Other research directions are even more different Many newer surveys focus only on self-, semi- or
because weakly-supervised learning uses only partial unsupervised learning [24, 19, 20]. Min et al. wrote
information about the label and metric learning aims at an overview of unsupervised deep learning strategies
learning a good distance metric. The idea that unifies [24]. They presented the beginning in this field of re-
these approaches is that using unlabeled data is bene- search from a network architecture perspective. The
ficial during the training process (see Figure 1 for an authors looked at a broad range of architectures. We
illustration). It either makes the training with fewer la- focus on only one architecture which Min et al. refer
bels more robust or in some rare cases even surpasses to as ”Clustering deep neural network (CDNN)-based
the supervised cases [14]. deep clustering” [24]. Even though the work was pub-
Due to this benefit, many researchers and compa- lished in 2018, it already misses the recent and major
nies work in the field of semi-, self-, and unsupervised developments in deep learning of the last years. We
learning. The main goal is to close the gap between look at these more recent developments and show the
semi-supervised and supervised learning or even sur- connections to other research fields that Min et al. did
pass these results. Considering presented methods like not include.
[15, 16] we believe that research is at the breaking Van Engelen and Hoos give a broad overview
point of achieving this goal. Hence, there is a lot of of general and recent semi-supervised methods [20].
research ongoing in this field. This survey provides an They cover some recent developments but deep learn-
overview to keep track of the major and recent devel- ing strategies such as [25, 26, 27, 14, 28] are not cov-
opments in semi-, self-, and unsupervised learning. ered. Furthermore, the authors do not explicitly com-
Most investigated research topics share a variety of pare the presented methods based on their structure or
common ideas while differing in goal, application con- performance.
texts, and implementation details. This survey gives an Jing and Tian concentrated their survey on recent
overview of this wide range of research topics. The fo- developments in self-supervised learning [19]. Like
cus of this survey is on describing the similarities and us, the authors provide a performance comparison and
differences between the methods. a taxonomy. Their taxonomy distinguishes between
Whereas we look at a broad range of learning strate- different kinds of pretext tasks. We look at pretext
gies, we compare these methods only based on the im- tasks as one common idea and compare the methods
age classification task. The addressed audience of this based on these underlying ideas. Jing and Tian look at

2
Figure 2: Overview of the structure of this survey – The learning strategies unsupervised, semi-supervised and
supervised are commonly used in the literature. Because semi-supervised learning is incorporating many meth-
ods we defined training strategies which subdivides semi-supervised learning. For details about the training and
learning strategies (including self-supervised learning) see subsection 2.1. Each method belongs to one training
strategy and uses several common ideas. A common idea can be a concept such as a pretext task or a loss such
as cross-entropy. The definition of methods and common ideas is given in section 2. Details about the common
ideas are defined in subsection 2.2. All methods in this survey are shortly described and categorized in section 3.
The methods are compared with each other based on this information concerning their used common ideas and
their performance in subsection 4.3. The results of the comparisons and three resulting trends are discussed in
subsection 4.4.

different tasks apart from classification but do not in- tions in semi-supervised learning [31]. We keep our
clude semi- and unsupervised methods without a pre- survey limited to general image classification tasks and
text task. focus on their practical application.
Qi and Luo are one of the few who look at self-, In this survey, we will focus on deep learning ap-
semi- and unsupervised learning in one survey [29]. proaches for image classification. We will investigate
However, they look at the different learning strategies the different learning strategies with a spotlight on loss
separately and give comparisons only inside the re- functions. We concentrate on recent methods because
spective learning strategy. We show that bridging these older one are already adequately addressed in previ-
gaps leads to new insights, improved performance, and ous literature [17, 18, 19, 20, 21]. Keeping the above-
future research approaches. mentioned limitations in mind, the topic of self-, semi-
Some surveys focus not on the general overviews , and unsupervised learning still includes a broad range
about semi-, self-, and unsupervised learning but spe- of research fields. We have to exclude some related
cial details. In their survey, Cheplygina et al. present topics from this survey to keep the focus of this work
a variety of methods in the context of medical im- for example because other research have a different
age analysis [30]. They include deep learning and aim or are evaluated on different datasets. Therefore,
older machine learning approaches but look at differ- topics like metric learning [13] and meta learning such
ent strategies from a medical perspective. Mey and as [32] will be excluded. More specific networks like
Loog focused on the underlying theoretical assump- general adversarial networks [33] and graph networks

3
such as [34] will be excluded. Also, other applica- methods. We roughly sort the methods based on their
tions like pose estimation [35] and segmentation [36] training strategy but compare them in detail based on
or other image sources like videos or sketches [37] are the used common ideas. See subsection 2.2 for further
excluded. Topics like few-shot or zero-shot learning information about common ideas.
methods such as [38] are excluded in this survey. How- In the rest of this chapter, we will use a shared def-
ever, we will see in subsection 4.4 that topics like few- inition for the following variables. For an arbitrary set
shot learning and semi-supervised can learn from each of images X we define Xl and Xu with X = Xl ∪X ˙ u
other in the future like in [39]. as the labeled and unlabeled images, respectively. For
an image x ∈ Xl the corresponding label is defined as
1.2. Outline
zx ∈ Z. An image x ∈ Xu has no label otherwise it
The rest of the paper is structured in the follow- would belong to Xl . For the distinction between Xu
ing way. We define and explain the terms which are and Xl , only the usage of the label information during
used in this survey such as method, training strategy training is important. For example, an image x ∈ X
and common idea in section 2. A visual representa- might have a label that can be used during evaluation
tion of the terms and their dependencies can be seen but as long as the label is not used during training
before the analysis part in Figure 2. All methods are we define x ∈ Xu . The learning strategy LSX for
presented with a short description, their training strat- a dataset X is either unsupervised (X = Xu ), super-
egy and common idea in section 3. In section 4, we vised (X = Xl ) or semi-supervised (Xu ∩ Xl 6= ∅).
compare the methods based their used ideas and their During different phases of the training, different im-
performance across four common image classification age datasets X1 , X2 , . . . Xn with n ∈ N could be used.
datasets. This section also includes a description of Two consecutive datasets Xi and Xi+1 with i ≤ n and
the datasets and evaluation metrics. Finally, we dis- i ∈ N are different as long as different images (Xi 6=
cuss the results of the comparisons in subsection 4.4 Xi+1 ) or different labels (XLi 6= XLi+1 ) are used. The
and identify three trends and research opportunities. learning strategy LSi up to the dataset Xi during the
In Figure 2, a complete overview of the structure of training is calculated based on Xu = ∪ij=1 Xuj and
this survey can be seen. Xl = ∪ij=1 Xlj . Consecutive phases of the training are
grouped into stages. The stage changes during consec-
utive datasets Xi and Xi+1 iff the learning strategy is
2. Underlying Concepts different (LSXi 6= LSXi+1 ) and the overall learning
Throughout this survey, we use the terms train- strategy changes (LSi 6= LSi+1 ). Due to this defi-
ing strategy, common idea, and method in a spe- nition, only two stages can occur during training and
cific meaning. The training strategy is the general the seven possible combinations are visualized in Fig-
type/approach for using the unsupervised data dur- ure 4. For more details see subsection 2.1. Let C be
ing training. The training strategies are similar to the number of classes for the labels Z. For a given neu-
the terms semi-supervised, self-supervised, or unsu- ral network f and input x ∈ X the output of the neural
pervised learning but provide a definition for corner network is f (x). For the below-defined formulations,
cases that the other terms do not. We will explain the f is an arbitrary network with arbitrary weights and
differences and similarities in detail in subsection 2.1. parameters.
The papers we discuss in detail in this survey propose
2.1. Training strategies
different elements like an algorithm, a general idea,
or an extension of previous work. To be consistent in Terms like semi-supervised, self-supervised, and
this survey, we call the main algorithm, idea, or exten- unsupervised learning are often used in literature but
sion in each paper a method. All methods are briefly have overlapping definitions for certain methods. We
described in section 3. A method follows a training will summarize the general understanding and defini-
strategy and is based on several common ideas. We tion of these terms and highlight borderline cases that
use the term common idea, or in short idea, for con- are difficult to classify. Due to these borderline cases,
cepts and approaches that are shared between different we will define a new taxonomy based on the stages

4
(a) Supervised (b) One-Stage-Semi-Supervised (c) One-Stage-Unsupervised (d) Multi-Stage-Semi-Supervised

Figure 3: Illustrations of supervised learning (a) and the three presented reduced training strategies (b-d) - The red
and dark blue circles represent labeled data points of different classes. The light grey circles represent unlabeled
data points. The black lines define the underlying decision boundaries between the classes. The striped circles
represent data points that do not use the label information in the first stage and can access this information in a
second stage. For more details on stages and the different learning strategies see subsection 2.1.

during training for a precise distinction of the methods. beled data from the beginning in comparison to rep-
In subsection 4.3, we will see that this taxonomy leads resentation learning methods like [25, 43, 40, 44, 42]
to a clear clustering of the methods regarding the com- which use them in different stages of their training.
mon ideas which further justifies this taxonomy. A vi- Some methods combine ideas from self-supervised
sual comparison between the learning-strategies semi- learning, semi-supervised learning and unsupervised
supervised and unsupervised learning and the training learning [15, 27] and are even more difficult to clas-
strategies can be found in Figure 4. sify.
Unsupervised learning describes the training with- From the above explanation, we see that most meth-
out any labels. However, the goal can be a clustering ods are either unsupervised or semi-supervised in the
(e.g. [14, 27]) or good representation (e.g. [25, 40]) of context of image classification. The usage of labeled
the data. Some methods combine several unsupervised and unlabeled data in semi-supervised methods varies
steps to achieve firstly a good representation and then and a clear distinction in the common taxonomy is not
a clustering (e.g. [41]). In most cases, this unsuper- obvious. Nevertheless, we need to structure the meth-
vised training is achieved by generating its own labels, ods in some way to keep an overview, allow compar-
and therefore the methods are called self-supervised. isons and acknowledge the difference of research foci.
A counterexample for an unsupervised method with- We decided against providing a fine-grained taxonomy
out self-supervision would be k-means [22]. Often, as in previous literature [29] because we believe fu-
self-supervision is achieved on a pretext task on the ture research will come up with new combinations that
same or a different dataset and then the pretrained net- were not thought of before. We separate the methods
work is fine-tuned on a downstream task [19]. Many only based on a rough distinction when the labeled or
methods that follow this paradigm say their method is unlabeled data is used during the training. For detailed
a form of representation learning [25, 42, 43, 44, 40]. comparisons, we distinct the methods based on their
In this survey, we focus on image classification, and common ideas that are defined above and described in
therefore most self-supervised or representation learn- detail in subsection 2.2. We call all semi-, self-, and
ing methods need to fine-tune on labeled data. The unsupervised (learning) strategies together reduced su-
combination of pretraining and fine-tuning can nei- pervised (learning) strategies.
ther be called unsupervised nor self-supervised as ex- We defined stages above (see section 2) as the dif-
ternal labeled information are used. Semi-supervised ferent phases/time intervals during training when the
learning describes methods that use labeled and un- different learning strategies supervised (X = Xl ),
labeled data. However, semi-supervised methods like unsupervised (X = Xu ) or semi-supervised (Xu ∩
[26, 45, 46, 16, 47, 48, 49] use the labeled and unla- Xl 6= ∅) are used. For example, a method that

5
uses a self-supervised pretraining on Xu and then
fine-tunes on the same images with labels has two
stages. A method that uses different algorithms,
losses, or datasets during the training but only uses
unsupervised data Xu has one stage (e.g. [41]).
A method which uses Xu and Xl during the com-
plete training has one stage (e.g. [26]). Based
on the definition of stages during training, we clas-
sify reduced supervised methods into the training Figure 4: Illustration of the different training strategies
strategies: One-Stage-Semi-Supervised, One-Stage- – Each row stands for a different combination of data
Unsupervised, and Multi-Stage-Semi-Supervised. An usage during the first and second stage (defined in sec-
overview of the stage combinations and the corre- tion 2). The first column states the common learning
sponding training strategy is given in Figure 4. As we strategy name in the literature for this usage whereas
concentrate on reduced supervised learning in this sur- the last column states the training strategy name used
vey, we will not discuss any methods which are com- in this survey. The second column represents the used
pletely supervised. data overall. The third and fourth column represent the
Due to the above definition of stages a fifth combi- used data in stage one or two. The blue and grey (half-
nation of data usage between the stages exists. This ) circles represent the usage of the labeled data Xl and
combination would use only labeled data in the first the unlabeled data Xu respectively in each stage or
stage and unlabeled data in the second stage. In the overall. A minus means that no further stage is used.
rest of the survey, we will exclude this training strat- The dashed half circle in the last row represents that
egy for the following reasons. The case that a stage this dashed part of the data can be used.
of complete supervision is followed by a stage of par-
tial or no supervision is an unusual training strategy.
Due to this unusual usage, we only know of weight ini- 2.1.2 One-Stage-Semi-Supervised Training
tialization followed by other reduced supervised train- All methods which follow the one-stage-semi-
ing steps where this combination could occur. We see supervised training strategy are trained in one stage
the initialization of a network with pretrained weights with the usage of Xl , Xu , and Z. The main differ-
from a supervised training on a different dataset (e.g. ence to all supervised learning strategies is the usage
Imagenet [1]) as an architectural decision. It is not part of the additional unlabeled data Xu . A common way
of the reduced supervised training process because it is to integrate the unlabeled data is to add one or more
used mainly as a more sophisticated weight initializa- unsupervised losses to the supervised loss.
tion. If we exclude weight initialization for this reason,
we know of no method which belongs to this stage.
2.1.3 One-Stage-Unsupervised Training
In the following paragraphs, we will describe all
other training strategies in detail and they are illus- All methods which follow the one-stage-unsupervised
trated in Figure 3. training strategy are trained in one stage with the
usage of only the unlabeled samples Xu . There-
fore, many authors in this training strategy call their
2.1.1 Supervised Learning method unsupervised. A variety of loss functions ex-
ist for unsupervised learning [50, 14, 12]. In most
Supervised learning is the most common strategy in cases, the problem is rephrased in such a way that
image classification with deep neural networks. These all inputs for the loss can be generated, e.g. re-
methods only use labeled data Xl and its correspond- construction loss in autoencoders [12]. Due to this
ing labels Z. The goal is to minimize a loss function self-supervision, some call also these methods self-
between the output of the network f (x) and the ex- supervised. We want to point out one major differ-
pected label zx ∈ Z for all x ∈ Xl . ence to many self-supervised methods following the

6
multi-stage-semi-supervised training strategy below. Loss Functions
One-Stage-Unsupervised methods give image classi-
Cross-entropy (CE)
fications without any further usage of labeled data.
A common loss function for image classification is
cross-entropy [51]. It is commonly used to measure
2.1.4 Multi-Stage-Semi-Supervised Training the difference between f (x) and the corresponding la-
bel zx for a given x ∈ Xl . The loss is defined in Equa-
All methods which follow the multi-stage-semi-
tion 1 and the goal is to minimize the difference.
supervised training strategy are trained in two stages
with the usage of Xu in the first stage and Xl and
maybe Xu in the second stage. Many methods that C
X
are called self-supervised by their authors fall into this CE(zx , f (x)) = − P (c|zx )log(P (c|f (x)))
strategy. Commonly a pretext task is used to learn c=1
representations on unlabeled data Xu . In the second C
X
stage, these representations are fine-tuned to image =− P (c|zx )log(P (c|zx ))
classification on Xl . An important difference to a one- c=1 (1)
C
stage method is that these methods return useable clas- X P (c|f (x))
sifications only after an additional training stage. − P (c|zx )log( )
P (c|zx )
c=1

2.2. Common ideas = H(P (·|zx ))


+ KL(P (·|zx ) || P (·|f (x))
Different common ideas are used to train models in
semi-, self-, and unsupervised learning. In this sec- P is a probability distribution over all classes and is
tion, we present a selection of these ideas that are used approximated with the (softmax-)output of the neural
across multiple methods in the literature. network f (x) or the given label zx . H is the entropy
It is important to notice that our usage of common of a probability distribution and KL is the Kullback-
ideas is fuzzy and incomplete by definition. A com- Leibler divergence. It is important to note that cross-
mon idea should not be an identical implementation entropy is the sum of entropy over zx and a Kullback-
or approximation but the underlying motivation. This Leibler divergence between f (x) and zx . In general,
fuzziness is needed for two reasons. Firstly, a compar- the entropy H(P (·|zx )) is zero due to the one-hot en-
ison would not be possible due to so many small dif- coded label zx .
ferences in the exact implementations. Secondly, they The loss function CE could also be used with a
allow us to abstract some core elements of a method different probability distribution than P based on the
and therefore similarities can be detected. Also, not ground-truth label. These distributions could be for
all details, concepts, and motivations are captured by example be based on Pseudo-Labels or other targets in
common ideas. We will limit ourselves to the com- a self-supervised pretext task. We abbreviate the used
mon ideas described below since we believe they are common idea with CE* if not the ground-truth labels
enough to characterize all recent methods. At the same are used to highlight this specialty.
time, we know that these ideas need to be extended in
the future as new common ideas will arise, old ones
Contrastive Loss (CL)
will disappear, and focus will shift to other ideas. In
contrast to detailed taxonomies, these new ideas can A contrastive loss tries to distinguish positive and neg-
easily be integrated as new tags. ative pairs. The positive pair could be different views
We sorted the ideas in alphabetical order and dis- of the same image and the negative pairs could be all
tinguish loss functions and general concepts. Since other pairwise combinations in a batch [25]. Hadsell et
ideas might reference each other, you may have to al. proposed to learn representations based on contrast-
jump to the corresponding entry if you would like to ing [53]. In recent years, the idea has been extended by
know more. self-supervised visual representation learning methods

7
(a) VAT (b) Mixup (c) Overclustering (d) Pseudo-Label

Figure 5: Illustration of four selected common ideas – (a) The blue and red circles represent two different classes.
The line is the decision boundary between these classes. The  spheres around the circles define the area of possible
transformations. The arrows represent the adversarial change vector r which pushes the decision boundary away
from any data point. (b) The images of a cat and a dog are combined with a parametrized blending. The labels
are also combined with the same parameterization. The shown images are taken from the dataset STL-10 [52]
(c) Each circle represents a data point and the coloring of the circle the ground-truth label. In this example, the
images in the middle have fuzzy ground-truth labels. Classification can only draw one arbitrary decision boundary
(dashed line) in the datapoints whereas overclustering can create multiple subregions. This method could also be
applied to outliers rather than fuzzy labels. (d) This loop represents one version of Pseudo-Labeling. A neural
network predicts an output distribution. This distribution is cast into a hard Pseudo-Label which is then used for
further training the neural network.

[25, 54, 55, 56, 57]. Examples of contrastive loss func- InfoNCE is a lower bound for the mutual information
tions are NT-Xent [25] and InfoNCE [55] and both are between the views [55]. More details and different
based on Cross-Entropy. The loss NT-Xent is com- bounds for other losses can be found in [59]. How-
puted across all positive pairs (xi , xj ) in a fixed subset ever, Tschannen et al. show evidence that these lower
of X with N elements e.g. a batch during training. bounds might not be the main reason for the successes
The definition of the loss for a positive pair is given in of these methods [60]. Due to this fact, we count losses
Equation 2. The similarity sim between the outputs is like InfoNCE as a mixture of the common ideas con-
measured with a normalized dot product, τ is a tem- trastive loss and mutual information.
perature parameter and the batch consists of N image
pairs. Entropy Minimization (EM)

exp(sim(f (xi ), f (xj ))/τ ) Grandvalet and Bengio noticed that the distributions
lxi ,xj = −log P2N of predictions in semi-supervised learning tend to be
k=1 1k6=i exp(sim(f (xi ), f (xk ))/τ ) distributed over many or all classes instead of being
(2)
sharp for one or few classes [61]. They proposed
Chen and Li generalize the loss NT-Xent into a to sharpen the output predictions or in other words
broader family of loss functions with an alignment to force the network to make more confident predic-
and a distribution part [58]. The alignment part en- tions by minimizing entropy [61]. They minimized
courages representations of positive pairs to be similar the entropy H(P (·|f (x))) for a probability distribu-
whereas the distribution part ”encourages representa- tion (P (·|f (x)) based on a certain neural output f (x)
tions to match a prior distribution” [58]. The loss In- and an image x ∈ X. This minimization leads to
foNCE is motivated like other contrastive losses by sharper / more confident predictions. If this loss is
maximizing the agreement / mutual information be- used as the only loss the network/predictions would
tween different views. Van der Oord et al. showed that degenerate to a trivial minimization.

8
Kullback-Leibler divergence (KL) not known. For example, we could use the outputs of
a neural network f (x), f (y) for two augmented views
The Kullback-Leiber divergence is also commonly
x, y of the same image as the distributions P, Q. In
used in image classification since it can be interpreted
general, the distributions could be dependent as x, y
as a part of cross-entropy. In general, KL measures the
could be identical or very similar and the distributions
difference between two given distributions [62] and
could be independent if x, y they are crops of distinct
is therefore often used to define an auxiliary loss be-
classes e.g. the background sky and the foreground
tween the output f (x) for an image x ∈ X and a given
object. Therefore, the mutual information needs to be
secondary discrete probability distribution Q over the
approximated. The used approximation varies depend-
classes C. The definition is given in Equation 3. The
ing on the method and the definition of the distribu-
second distribution could be another network output
tions P, Q. For further theoretical insights and several
distribution, a prior known distribution, or a ground-
approximations see [59, 64].
truth distribution depending on the goal of the mini-
We show the definition of the mutual information
mization.
between two network outputs f (x), f (y) for images
x, y ∈ X as an example in Equation 5. This equa-
C tion also shows an alternative representation of mutual
X P (c|f (x))
KL(Q || P (·|f (x)) = − Q(c)log( ) information: the separation in entropy H(P (·|f (x)))
Q(c)
c=1 and conditional entropy H(P (·|f (x)) | P (·|f (y))). Ji
(3) et al. argue that this representation illustrates the ben-
efits of using MI over CE in unsupervised cases [14].
Mean Squared Error (MSE) A degeneration is avoided because MI balances the ef-
fects of maximizing the entropy with a uniform distri-
MSE measures the Euclidean distance between two
bution for P (·|f (x)) and minimizing the conditional
vectors e.g. two neural network outputs f (x), f (y)
entropy by equalizing P (·|f (x)) and P (·|f (y)). Both
for the images x, y ∈ X. In contrast to the loss CE
cases lead to a degeneration of the neural network on
or KL, MSE is not a probability measure and there-
their own.
fore the vectors can be in an arbitrary Euclidean fea-
ture space (see Equation 4). The minimization of the I(P (·|f (x), P (·|f (y))
MSE will pull the two vectors or as in the example the = KL(P (·|f (x), f (y)) || P (·|f (x) ∗ P (·|f (y))))
network outputs together. Similar to the minimization
C
of entropy, this would lead to a degeneration of the net- X
= P (c, c0 |f (x), f (y))
work if this loss is used as the only loss on the network
c=1,c0 =1
outputs.
P (c, c0 |f (x), f (y))
log( )
P (c|f (x) ∗ P (c0 |f (y)))
M SE(f (x), f (y)) = ||f (x) − f (y)||22 (4) = H(P (·|f (x)) + H(P (·|f (x)) | P (·|f (y)))
(5)
Mutual Information (MI)
MI is defined for two probability distributions P, Q Virtual Adversarial Training (VAT)
as the Kullback Leiber (KL) divergence between the
joint distribution and the marginal distributions [63]. VAT [65] tries to make predictions invariant to small
In many reduced supervised methods, the goal is to transformations by minimizing the distance between
maximize the mutual information between the distri- an image and a transformed version of the image.
butions. These distributions could be based on the in- Miyato et al. showed how a transformation can be
put, the output, or an intermediate step of a neural net- chosen and approximated in an adversarial way. This
work. In most cases, the conditional distribution be- adversarial transformation maximizes the distance be-
tween P and Q and therefore the joint distribution is tween an image and a transformed version of it over all

9
(a) Main image (b) Different image (c) Jigsaw (d) Jigsaw++

(e) Exemplar (f) Rotation (g) Context (h) Contrastive Learning

Figure 6: Illustrations of 8 selected pretext tasks – (a) Example image for the pretext task (b) Negative/different
example image in the dataset or batch (c) The Jigsaw pretext task consists of solving a simple Jigsaw puzzle
generated from the main image. (d) Jigsaw++ augments the Jigsaw puzzle by adding in parts of a different image.
(e) In the exemplar pretext task, the distributions of a weakly augmented image (upper right corner) and several
strongly augmented images should be aligned. (f) An image is rotated around a fixed set of rotations e.g. 0, 90,
180, and 270 degrees. The network should predict the rotation which has been applied. (g) A central patch and an
adjacent patch from the same image are given. The task is to predict one of the 8 possible relative positions of the
second patch to the first one. In the example, the correct answer is upper center. (h) The network receives a list of
pairs and should predict the positive pairs. In this example, a positive pair consists of augmented views from the
same image. Some illustrations are inspired by [44, 42, 40].

possible transformations. The loss is defined in Equa- Concepts


tion 6 with an image x ∈ X and the output of a given
neural network f (x). Mixup (MU)
V AT (f (x)) = D(P (·|f (x), P (·|f (x + radv ))
Mixup creates convex combinations of images by
radv = argmax D(P (·|f (x), P (·|f (x + r))
r;||r||≤ blending them into each other. An illustration of the
(6) concept is given in Figure 5b. The prediction of the
convex combination of the corresponding labels turned
P is the probability distribution over the outputs of the out to be beneficial because the network needs to cre-
neural network and D is a non-negative function that ate consistent predictions for intermediate interpola-
measures the distance. As illustrated in Figure 5a r is tions of the image. This approach has been beneficial
a vector and  the maximum length of this vector. Two for supervised learning in general [66] and is there-
examples of used distance measures are cross-entropy fore also used in several semi-supervised learning al-
[65] and Kullback-Leiber divergence [15]. gorithms [46, 45, 26].

10
Overclustering (OC) 3. Methods

Normally, if we have k classes in the supervised case This section shorty summarizes all methods in the
we also use k clusters in the unsupervised case. Re- survey in roughly chronological order and separated
search showed that it can be beneficial to use more by their training strategy. Each summary states the
clusters than actual classes k exist [67, 14, 27]. We call used common ideas, explains their usage, and high-
this idea overclustering. Overclustering can be ben- lights special cases. The abbreviations for the common
eficial in semi-supervised or unsupervised cases due ideas are defined in subsection 2.2. We include a large
to the effect that neural networks can decide ’on their number of recent methods but we do not claim this list
own’ how to split the data. This separation can be help- to be complete.
ful in noisy/fuzzy data or with intermediate classes that
3.1. One-Stage-Semi-Supervised
were sorted into adjacent classes randomly [27]. An il-
lustration of this idea is presented in Figure 5c Pseudo-Labels

Pseudo-Labels [47] describes a common idea in deep


learning and a learning method on its own. For the
Pretext Task (PT)
description of the common idea see above in subsec-
A pretext task is a broad-ranged description of self- tion 2.2. In contrast to many other semi-supervised
supervised training a neural network on a different task methods, Pseudo-Labels does not use a combination
than the target task. This task can be for example pre- of an unsupervised and a supervised loss. The Pseudo-
dicting the rotation of an image [40], solving a jigsaw Labels approach uses the predictions of a neural net-
puzzle [43], using a contrastive loss [25, 55] or max- work as labels for unknown data as described in the
imizing mutual information [14, 27]. An overview of common idea. Therefore, the labeled and unlabeled
most pretext task in this survey is given in Figure 6 and data are used in parallel to minimize the CE loss. Com-
a complete overview is given in Table 1. In most cases mon ideas: CE, CE*, PL
the self-supervised, pretext task is used to learn rep-
resentations which can then be fine-tuned for image π-model and Temporal Ensembling
classification [25, 55, 68, 42, 43, 44, 40]. In a semi-
supervised context, some methods use this pretext task Laine & Aila present two similar learning methods
to define an additional loss during training [45]. with the names π-model and Temporal Ensembling
[49]. Both methods use a combination of the super-
vised CE loss and the unsupervised consistency loss
Pseudo-Labels (PL) MSE. The first input for the consistency loss in both
cases is the output of their network from a randomly
A simple approach for estimating labels of unknown augmented input image. The second input is differ-
data is using Pseudo-Labels [47]. Lee proposed to ent for each method. In the π-model an augmentation
classify unseen data with a neural network and use of the same image is used. In Temporal Ensembling
the predictions as labels. This process is illustrated an exponential moving average of previous predictions
in Figure 5d. What sounds at first like a self-fulfilling is evaluated. Laine & Aila show that Temporal En-
assumption works reasonably well in real-world im- sembling is up to two times faster and more stable in
age classification tasks. It is important to notice that comparison to the π-model [49]. Illustrations of these
the network needs additional information to prevent methods are given in Figure 7. Common ideas: CE,
total random predictions. This additional information MSE
could be some known labels or a weight initialization
of other supervised data or unsupervised on a pretext Mean Teacher
task. Several modern methods are based on the same
core idea of creating labels by predicting them on their With Mean Teacher Tarvainen & Valpola present a
own [48, 46]. student-teacher-approach for semi-supervised learning

11
(a) π-model (b) Temporal Ensembling (c) Mean Teacher (d) UDA

Figure 7: Illustration of four selected one-stage-semi-supervised methods – The used method is given below each
image. The input including label information is given in the blue box on the left side. On the right side, an
illustration of the method is provided. In general, the process is organized from top to bottom. At first, the input
images are preprocessed by none or two different random transformations t. Special augmentation techniques like
Autoaugment [69] are represented by a red box. The following neural network uses these preprocessed images
(x, y) as input. The calculation of the loss (dotted line) is different for each method but shares common parts.
All methods use the cross-entropy (CE) between label and predicted distribution P (·|f (x)) on labeled examples.
Details about the methods can be found in the corresponding entry in section 3 whereas abbreviations for common
methods are defined in subsection 2.2. EMA stands for the exponential moving average.

[48]. They develop their approach based on the π- Interpolation Consistency Training (ICT)
model and Temporal Ensembling [49]. Therefore, they
also use MSE as a consistency loss between two pre-
dictions but create these predictions differently. They ICT [70] uses linear interpolations of unlabeled data
argue that Temporal Ensembling incorporates new in- points to regularize the consistency between images.
formation too slowly into predictions. The reason for Verma et al. use a combination of the supervised loss
this is that the exponential moving average (EMA) is CE and the unsupervised loss MSE. The unsupervised
only updated once per epoch. Therefore, they pro- loss is measured between the prediction of the inter-
pose to use a teacher based on the average weights of polation of two images and the interpolation of their
a student in each update step. Tarvainen & Valpola Pseudo-Labels. The interpolation is generated with the
show for their model that the KL-divergence is an infe- mixup [66] algorithm from two unlabeled data points.
rior consistency loss than MSE. An illustration of this For these unlabeled data points, the Pseudo-Labels are
method is given in Figure 7. Common ideas: CE, MSE predicted by a Mean Teacher [48] network. Common
ideas: CE, MSE, MU, PL

Virtual Adversarial Training (VAT) Fast-Stochastic Weight Averaging (fast-SWA)

VAT [65] is not just the name for a common idea but it In contrast to other semi-supervised methods, Athi-
is also a one-stage-semi-supervised method. Miyato et waratkun et al. do not change the loss but the opti-
al. used a combination of VAT on unlabeled data and mization algorithm [71]. They analyzed the learning
CE on labeled data [65]. They showed that the adver- process based on ideas and concepts of SWA [72], π-
sarial transformation leads to a lower error on image model [49] and Mean Teacher [48]. Athiwaratkun et
classification than random transformations. Further- al. show that averaging and cycling learning rates are
more, they showed that adding EntMin [61] to the loss beneficial in semi-supervised learning by stabilizing
increased accuracy even more. Common ideas: CE, the training. They call their improved version of SWA
(EM), VAT fast-SWA due to faster convergence and lower perfor-

12
mance variance [71]. The architecture and loss is ei- AutoAugment [69] in combination with Cutout [75].
ther copied from π-model [49] or Mean Teacher [48]. AutoAugment uses reinforcement learning to create
Common ideas: CE, MSE useful augmentations automatically. Cutout is an aug-
mentation scheme where randomly selected regions of
MixMatch the image are masked out. Xie et al. show that this
combined augmentation method achieves higher per-
MixMatch [46] uses a combination of a supervised and formance in comparison to previous methods on their
an unsupervised loss. Berthelot et al. use CE as the su- own like Cutout, Cropping, or Flipping. In addition to
pervised loss and MSE between predictions and gener- the different augmentation, they propose to use a vari-
ated Pseudo-Labels as their unsupervised loss. These ety of other regularization methods. They proposed
Pseudo-Labels are created from previous predictions Training Signal Annealing which restricts the influ-
of augmented images. They propose a novel sharp- ence of labeled examples during the training process
ing method over multiple predictions to improve the to prevent overfitting. They use EntMin [61] and a
quality of the Pseudo-Labels. This sharpening also kind of Pseudo-Labeling [47]. We use the term kind
enforces implicitly a minimization of the entropy on of Pseudo-Labeling because they do not use the pre-
the unlabeled data. Furthermore, they extend the al- dictions as labels but they use them to filter unsuper-
gorithm mixup [66] to semi-supervised learning by in- vised data for outliers. An illustration of this method is
corporating the generated labels. Common ideas: CE, given in Figure 7. Common ideas: CE, EM, KL, (PL)
(EM), MSE, MU, PL
Self-paced Multi-view Co-training (SpamCo)
Ensemple AutoEndocing Transformation (EnAET)
Ma et al. propose a general framework for co-training
EnAET [73] combines the self-supervised pretext task across multiple views [76]. In the context of image
AutoEncoding Transformations [74] with MixMatch classification, different neural networks can be used
[46]. Wang et al. apply spatial transformations, such as as different views. The main idea of the co-training
translations and rotations, and non-spatial transforma- between different views is similar to using Pseudo-
tions, such as color distortions, on input images in the Labels. The main differences in SpamCo are that the
pretext task. The transformations are then estimated Pseudo-Labels are not used for all samples and they
with the original and augmented image given. This is influence each other across views. Each unlabeled im-
a difference to other pretext tasks where the estima- age has a weight value for each view. Based on an age
tion is often based on the augmented image only [40]. parameter, more unlabeled images are considered in
The loss is used together with the loss of MixMatch each iteration. At first only confident Pseudo-Labels
and is extended with the Kullback Leiber divergence are used and over time also less confident ones are al-
between the predictions of the original and the aug- lowed. The proposed hard or soft co-regularizers also
mented image. Common ideas: CE, (EM), KL, MSE, influence the weighting of the unlabeled images. The
MU, PL, PT regularizers encourage to select unlabeled images for
training across views. Without this regularization the
Unsupervised Data Augmentation (UDA) training would degenerate to an independent training
of the different views/models. CE is used as loss on
Xie et al. present with UDA a semi-supervised learn- the labels and Pseudo-Labels with additional L2 regu-
ing algorithm that concentrates on the usage of state- larization. Ma et al. show further applications includ-
of-the-art augmentation [16]. They use a supervised ing text classification and object detection. Common
and an unsupervised loss. The supervised loss is CE ideas: CE, CE*, MSE, PL
whereas the unsupervised loss is the Kullback Leiber
divergence between output predictions. These output
ReMixMatch
predictions are based on an image and an augmented
version of this image. For image classification, they ReMixMatch [45] is an extension of MixMatch with
propose to use the augmentation scheme generated by distribution alignment and augmentation anchoring.

13
(a) MixMatch (b) ReMixMatch (c) FixMatch (d) FOC

Figure 8: Illustration of four selected methods – The used method is given below each image. The input including
label information is given in the blue box on the left side. On the right side, an illustration of the method is pro-
vided. For FOC the second stage is represented. In general, the process is organized from top to bottom. At first,
the input images are preprocessed by none or two different random transformations t. Special augmentation tech-
niques like CTAugment [45] are represented by a red box. The following neural network uses these preprocessed
images (e.g. x, y) as input. The calculation of the loss (dotted line) is different for each method but shares com-
mon parts. All methods use the cross-entropy (CE) between label and predicted distribution P (·|f (x)) on labeled
examples. Details about the methods can be found in the corresponding entry in section 3 whereas abbreviations
for common methods are defined in subsection 2.2.

Berthelot et al. motivate the distribution alignment in the unlabeled data, one weakly- and one strongly-
with an analysis of mutual information. They use en- augmented version is created. The Pseudo-Label of
tropy minimization via ”sharpening” but they do not the weakly-augmented version is used if a confidence
use any prediction equalization like in mutual informa- threshold is surpassed by the network. If a Pseudo-
tion. They argue that an equal distribution is also not Label is calculated the network output of the strongly-
desirable since the distribution of the unlabeled data augmented version is compared with this hard label
could be skewed. Therefore, they align the predictions via cross-entropy which implicitly encourages low-
of the unlabeled data with a marginal class distribu- entropy predictions on the unlabeled data [26]. Sohn
tion over the seen examples. Berthelot et al. exchange et al. do not use ideas like Mixup, VAT, or distribution
the augmentation scheme of MixMatch with augmen- alignment but they state that they can be used and pro-
tation anchoring. Instead of averaging the prediction vide ablations for some of these extensions. Common
over different slight augmentations of an image they ideas: CE, CE*, (EM), PL
only use stronger augmentations as regularization. All
augmented predictions of an image are encouraged to 3.2. Multi-Stage-Semi-Supervised
result in the same distribution with CE instead of MSE.
Furthermore, a self-supervised loss based on the rota- Exemplar
tion pretext task [40] was added. Common ideas: CE,
CE* (EM), (MI), MU, PL, PT Dosovitskiy et al. proposed a self-supervised pretext
task with additional fine-tuning [68]. They randomly
FixMatch sample patches from different images and augment
these patches heavily. Augmentations can be for ex-
FixMatch [26] is building on the ideas of ReMixMatch ample rotations, translations, color changes, or con-
but is dropping several ideas to make the framework trast adjustments. The classification task is to map all
more simple while achieving a better performance. augmented versions of a patch to the correct original
FixMatch is using the cross-entropy loss on the su- patch using cross-entropy loss. Common ideas: CE,
pervised and the unsupervised data. For each image CE*, PT

14
(a) AMDIM (b) CPC (c) DeepCluster (d) IIC

Figure 9: Illustration of four selected multi-stage-semi-supervised methods – The used method is given below
each image. The input is given in the red box on the left side. On the right side, an illustration of the method
is provided. The fine-tuning part is excluded and only the first stage/pretext task is represented. In general, the
process is organized from top to bottom. At first, the input images are either preprocessed by one or two random
transformations t or are split up. The following neural network uses these preprocessed images (x, y) as input. The
calculation of the loss (dotted line) is different for each method. AMDIM and CPC use internal elements of the
network to calculate the loss. DeepCluster and IIC use the predicted output distributions (P (·|f (x)), P (·|f (y)))
to calculate a loss. Details about the methods can be found in the corresponding entry in section 3 whereas
abbreviations for common methods are defined in subsection 2.2.

Context for image classification tasks. Noroozi et al. extended


the Jigsaw task by adding image parts of a different
Doersch et al. propose to use context prediction as image [44]. They call the extension Jigsaw++. Ex-
a pretext task for visual representation learning [42]. amples for a Jigsaw or Jigsaw++ puzzle are given in
A central patch and an adjacent patch from an image Figure 6. Common ideas: CE, CE*, PT
are used as input. The task is to predict one of the 8
possible relative positions of the second patch to the
first one using cross-entropy loss. An illustration of DeepCluster
the pretext task is given in Figure 6. Doersch et al. DeepCluster [67] is a self-supervised method that gen-
argue that this task becomes easier if you recognize erates labels by k-means clustering. Caron et al. it-
the content of these patches. The authors fine-tune erate between clustering of predicted labels to gener-
their representations for other tasks and show their su- ate Pseudo-Labels and training with cross-entropy on
periority in comparison to the random initialization. these labels. They show that it is beneficial to use over-
Aside from fine-tuning, Doersch et al. show how their clustering in the pretext task. After the pretext task,
method could be used for Visual Data Mining. Com- they fine-tune the network on all labels. An illustration
mon ideas: CE, CE*, PT of this method is given in Figure 9. Common ideas:
CE, OC, PL, PT
Jigsaw
Rotation
Noroozi and Favaro propose to solve Jigsaw puzzles
as a pretext task [43]. The idea is that a network has to Gidaris et al. use a pretext task based on image rota-
understand the concept of a presented object to solve tion prediction [40]. They propose to randomly rotate
the puzzle using the classification loss cross-entropy. the input image by 0, 90, 180, or 270 degrees and let
They prevent simple solutions that only look at edges the network predict the chosen rotation degree. They
or corners by including small random margins between train the network with cross-entropy on this classifica-
the puzzle patches. They fine-tune on supervised data tion task. In their work, they also evaluate different

15
numbers of rotations but four rotations score the best Augmented Multiscale Deep InfoMax (AMDIM)
result. For image classification, they fine-tune on la-
AMDIM [78] maximizes the MI between inputs and
beled data. Common ideas: CE, CE*, PT
outputs of a network. It is an extension of the method
DIM [77]. DIM usually maximizes MI between lo-
Contrastive Predictive Coding (CPC) cal regions of an image and a representation of the im-
age. AMDIM extends the idea of DIM in several ways.
CPC [55, 56] is a self-supervised method that predicts Firstly, the authors sample the local regions and repre-
representations of local image regions based on previ- sentations from different augmentations of the same
ous image regions. The authors determine the quality source image. Secondly, they maximize MI between
of these predictions with a contrastive loss which iden- multiple scales of the local region and the represen-
tifies the correct prediction out of randomly sampled tation. They use a more powerful encoder and define
negative ones. They call their loss InfoNCE which is mixture-based representations to achieve higher accu-
cross-entropy for the prediction of positive examples racies. Bachman et al. fine-tune the representations on
[55]. Van den Oord et al. showed that minimizing In- labeled data to measure their quality. An illustration of
foNCE maximizes the lower bound for MI between the this method is given in Figure 9. Common ideas: CE,
previous image regions and the predicted image region MI, PT
[55]. An illustration of this method is given in Fig-
ure 9. The representations of the pretext task are then
Deep Metric Transfer (DMT)
fine-tuned. Common ideas: CE, (CE*), CL, (MI), PT
DMT [79] learns a metric as a pretext task and then
propagates labels onto unlabeled data with this met-
Constrastive Multiview Coding (CMC) ric. Liu et al. use self-supervised image colorization
[80] or unsupervised instance discrimination [81] to
CMC [54] generalizes CPC [55] to an arbitrary collec-
calculate a metric. In the second stage, they propa-
tion of views. Tian et al. try to learn an embedding that
gate labels to unlabeled data with spectral clustering
is different for contrastive samples and equal for sim-
and then fine-tune the network with the new Pseudo-
ilar images. Like Oord et al. they train their network
Labels. Additionally, they show that their approach
by identifying the correct prediction out of multiple
is complementary to previous methods. If they use
negative ones [55]. However, Tian et al. take different
the most confident Pseudo-Labels for methods such as
views of the same image such as color channels, depth,
Mean Teacher [48] or VAT [65], they can improve the
and segmentation as similar images. For common im-
accuracy with very few labels by about 30%. Common
age classification datasets like STL-10, they use patch-
ideas: CE, CE*, PL, PT
based similarity. After this pretext task, the represen-
tations are fine-tuned to the desired dataset. Common
ideas: CE, (CE*), CL, (MI), PT Invariant Information Clustering (IIC)
IIC [14] maximizes the MI between augmented views
Deep InfoMax (DIM) of an image. The idea is that images should belong
to the same class regardless of the augmentation. The
DIM [77] maximizes the MI between local input re- augmentation has to be a transformation to which the
gions and output representations. Hjelm et al. show neural network should be invariant. The authors do
that maximizing over local input regions rather than not maximize directly over the output distributions but
the complete image is beneficial for image classifi- over the class distribution which is approximated for
cation. Also, they use a discriminator to match the every batch. Ji et al. use auxiliary overclustering on a
output representations to a given prior distribution. In different output head to increase their performance in
the end, they fine-tune the network with an additional the unsupervised case. This idea allows the network to
small fully-connected neural network. Common ideas: learn subclasses and handle noisy data. Ji et al. use So-
CE, MI, PT bel filtered images as input instead of the original RGB

16
images. Additionally, they show how to extend IIC to Fuzzy Overclustering (FOC)
image segmentation. Up to this point, the method is
Fuzzy Overclustering [27] is an extension of IIC [14].
completely unsupervised. To be comparable to other
FOC focuses on using overclustering to subdivide
semi-supervised methods they fine-tune their models
fuzzy labels in real-world datasets. Therefore, it uni-
on a subset of available labels. An illustration of this
fies the used data and losses proposed by IIC between
method is given in Figure 9. The first unsupervised
the different stages and extends it with new ideas such
stage can be seen as a self-supervised pretext task. In
as the novel loss Inverse Cross-Entropy (CE−1 ). This
contrast to other pretext tasks, this task already pre-
loss is inspired by Cross-Entropy but can be used on
dicts representations which can be seen as classifica-
the overclustering results of the network where no
tions. Common ideas: CE, MI, OC, PT
ground truth labels are known. FOC is not achiev-
ing state-of-the-art results on a common image clas-
Self-Supervised Semi-Supervised Learning (S4 L) sification dataset. However, on a real-world plankton
dataset with fuzzy labels, it surpasses FixMatch and
S4 L [15] is, as the name suggests, a combination of
shows that 5-10% more consistent predictions can be
self-supervised and semi-supervised methods. Zhai
achieved. Like IIC, FOC can be viewed as a multi-
et al. split the loss into a supervised and an unsuper-
stage-semi-supervised and an one-stage-unsupervised
vised part. The supervised loss is CE whereas the un-
method. In general, FOC is trained in one unsuper-
supervised loss is based on the self-supervised tech-
vised and one semi-supervised stage and can be seen
niques using rotation and exemplar prediction [40, 68].
as a multi-stage-semi-supervised method. Like IIC,
The authors show that their method performs better
it produces classifications already in the unsupervised
than other self-supervised and semi-supervised tech-
stage and can therefore also be seen as an one-stage-
niques [68, 40, 65, 61, 47]. In their Mix Of All
unsupervised method. Common ideas: CE, (CE*) MI,
Models (MOAM) they combine self-supervised rota-
OC, PT
tion prediction, VAT, entropy minimization, Pseudo-
Labels, and fine-tuning into a single model with mul-
tiple training steps. Since we discuss the results of Momentum Contrast (MoCo)
their MOAM we identify S4 L as a multi-stage-semi- He et al. propose to use a momentum encoder for
supervised method. Common ideas: CE, CE*, EM, contrastive learning [82]. In other methods [25, 57,
PL, PT, VAT 55, 56], the negative examples for the contrastive loss
are sampled from the same mini-batch as the positive
Simple Framework for Contrastive Learning of Vi- pair. A large batch size is needed to ensure a great
sual Representation (SimCLR) variety of negative examples. He et al. sample their
negative examples from a queue encoded by another
SimCLR [25] maximizes the agreement between two network whose weights are updated with an exponen-
different augmentations of the same image. The tial moving average of the main network. They solve
method is similiar to CPC [55] and IIC [14]. In com- the pretext task proposed by [81] with negative exam-
parison to CPC Chen et al. do not use the different ples samples from their queue and fine-tune in a sec-
inner representations. Contrary to IIC they use nor- ond stage on labeled data. Chen et al. provide further
malized temperature-scaled cross-entropy (NT-Xent) ablations and baseline for the MoCo Framework e.g.
as their loss. Based on the cosine similarity of the pre- by using a MLP head for fine-tuning [83]. Common
dictions, NT-Xent measures whether positive pairs are ideas: CE, CL, PT
similar and negative pairs are dissimilar. Augmented
versions of the same image are treated as positive pairs
Bootstrap you own latent (BYOL)
and pairs with any other image as negative pair. The
system is trained with large batch sizes of up to 8192 Grill et al. use an online and a target network. In the
instead of a memory bank to create enough negative proposed pretext task, the online network predicts the
examples. Common ideas: CE, (CE*), CL, PT image representation of the target network for an im-

17
(a) SimCLR (b) SimCLRv2 (c) MoCo (d) BYOL

Figure 10: Illustration of four selected multi-stage-semi-supervised methods – The used method is given below
each image. The input is given in the red (not using labels) or blue (using labels) box on the left side. On the right
side, an illustration of the method is provided. The fine-tuning part is excluded and only the first stage/pretext
task is represented. For SimCLRv2 the second stage or distillation step is illustrated. In general, the process is
organized from top to bottom. At first, the input images are either preprocessed by one or two random transforma-
tions t or are split up. The following neural network uses these preprocessed images (x, y) as input. Details about
the methods can be found in the corresponding entry in section 3 whereas abbreviations for common methods are
defined in subsection 2.2. EMA stands for the exponential moving average.

age [28]. The difference between the predictions is The second step is fine-tuning this large network with
measured with MSE. Normally, this approach would a small amount of labeled data. The third step is
lead to a degeneration of the network as a constant pre- self-training or distillation. The large pretrained net-
diction over all images would also achieve the goal. work is used to predict Pseudo-Labels on the com-
In contrastive learning, this degeneration is avoided plete (unlabeled) data. These (soft) Pseudo-Labels
by selecting a positive pair of examples from multiple are then used to train a smaller neural network with
negative ones [25, 57, 55, 56, 82, 83]. By using a slow- CE. The distillation step could be also performed on
moving average of the weights between the online and the same network as in the pretext task. Chen et al.
target network, Grill et al. show empirically that the show that even this self-distillation leads to perfor-
degeneration to a constant prediction can be avoided. mance improvements[57]. Common ideas: CE, (CE*),
This approach has the positive effect that BYOL per- CL, PL, PT
formance is depending less on hyperparameters like
augmentation and batch size [28]. In a follow-up work, 3.3. One-Stage-Unsupervised
Richemond et al. show that BYOL even works when Deep Adaptive Image Clustering (DAC)
no batch normalization which might have introduced
kind of a contrastive learning effect in the batches is DAC [50] reformulates unsupervised clustering as a
used [84]. Common ideas: MSE, PT pairwise classification. Similar to the idea of Pseudo-
Labels Chang et al. predict clusters and use these to
Simple Framework for Contrastive Learning of Vi- retrain the network. The twist is that they calculate the
sual Representation (SimCLRv2) cosine distance between all cluster predictions. This
distance is used to determine whether the input images
Chen et al. extend the framework SimCLR by using are similar or dissimilar with a given certainty. The
larger and deeper networks and by incorporating the network is then trained with binary CE on these certain
memory mechanism from MoCo [57]. Moreover, they similar and dissimilar input images. One can interpret
propose to use this framework in three steps. The these similarities and dissimilarities as Pseudo-Labels
first is training a contrastive learning pretext task with for the similarity classification task. During the train-
a deep neural network and the SimCLRv2 method. ing process, they lower the needed certainty to include

18
more images. As input Chang et al. use a combination For each sample, the k nearest neighbors are selected
of RGB and extracted HOG features. Common ideas: in the gained feature space. The novel semantic clus-
PL tering loss encourages these samples to be in the same
cluster. Gansbeke et al. noticed that the wrong near-
Information Maximizing Self-Augmented Training est neighbors have a lower confidence and propose to
(IMSAT) create Pseudo-Labels on only confident examples for
further fine-tuning. They also show that Overcluster-
IMSAT [85] maximizes MI between the input and out- ing can be successfully used if the number of clusters
put of the model. As a consistency regularization Hu is not known before. Common ideas: OC, PL, PT
et al. use CE between an image prediction and an aug-
mented image prediction. They show that the best aug- 4. Analysis
mentation of the prediction can be calculated with VAT
In this chapter, we will analyze which common
[65]. The maximization of MI directly on the image
ideas are shared or differ between methods. We will
input leads to a problem. For datasets like CIFAR-10,
compare the performance of all methods with each
CIFAR-100 [86] and STL-10 [52] the color informa-
other on common deep learning datasets.
tion is too dominant in comparison to the actual con-
tent or shape. As a workaround, Hu et al. use the fea- 4.1. Datasets
tures generated by a pretrained CNN on ImageNet [1]
as input. Common ideas: MI, VAT In this survey, we compare the presented methods
on a variety of datasets. We selected four datasets that
were used in multiple papers to allow a fair compari-
Invariant Information Clustering (IIC)
son. An overview of example images is given in Fig-
IIC [14] is described above as a multi-stage-semi- ure 11.
supervised method. In comparison to other presented
methods, IIC creates usable classifications without CIFAR-10 and CIFAR-100
fine-tuning the model on labeled data. The reason
for this is that the pretext task is constructed in such are large datasets of tiny color images with size 32x32
a way that label predictions can be extracted directly [86]. Both datasets contain 60,000 images belong-
from the model. This leads to the conclusion that IIC ing to 10 or 100 classes respectively. The 100 classes
can also be interpreted as an unsupervised learning in CIFAR-100 can be combined into 20 superclasses.
method. Common ideas: MI, OC Both sets provide 50,000 training examples and 10,000
validation examples (image + label). The presented re-
sults are only trained with 4,000 labels for CIFAR-10
Fuzzy Overclustering (FOC)
and 10,000 labels for CIFAR-100 to represent a semi-
FOC [27] is described avbove as a multi-stage-semi- supervised case. If a method uses all labels this is
supervised method. Like IIC, FOC can also be seen marked independently.
as an one-stage-unsupervised method because the first
stage yields cluster predictions. Common ideas: MI, STL-10
OC
is dataset designed for unsupervised and semi-
supervised learning [52]. The dataset is inspired by
Semantic Clustering by Adopting Nearest Neigh-
CIFAR-10 [86] but provides fewer labels. It only con-
bors (SCAN)
sists of 5,000 training labels and 8,000 validation la-
Gansbeke et al. calculate clustering assignments build- bels. However, 100,000 unlabeled example images are
ing on self-supervised pretext task by mining the near- also provided. These unlabeled examples belong to
est neighbors and using self-labeling. They propose to the training classes and some different classes. The
use SimCLR [25] as a pretext task but show that other images are 96x96 color images and were acquired in
pretext tasks [81, 40] could also be used for this step. combination with their labels from ImageNet [1].

19
(a) CIFAR-10 (b) STL-10 (c) ILSVRC-2012

Figure 11: Examples of four random cats in the different datasets to illustrate the difference in quality

ILSVRC-2012 whereas the cluster accuracy is defined in Equation 8.


PN
i=1 1zxi =argmax1≤j≤C f (xi )j
is a subset of ImageNet [1]. The training set consists ACC(x1 , . . . , xN ) =
N
of 1.2 million images whereas the validation and the (7)
test set include 150,000 images. These images belong
to 1000 object categories. Due to this large number PN
of categories, it is common to report Top-5 and Top- i=1 1zxi =σ(argmax1≤j≤C f (xi )j )
ACC(x1 , . . . , xN ) = max
1 accuracy. Top-1 accuracy is the classical accuracy σ N
(8)
where one prediction is compared to one ground-truth
label. Top-5 accuracy checks if a ground truth label 4.3. Comparison of methods
is in a set of at most five predictions. For further de-
In this subsection, we will compare the methods
tails on accuracy see subsection 4.2. The presented
concerning their used common ideas and performance.
results are only trained with 10% of labels to represent
We will summarize the presented results and discuss
a semi-supervised case. If a method uses all labels this
the underlying trends in the next subsection.
is marked independently.

Comparison concerning used common ideas


4.2. Evaluation metrics In Table 1 we present all methods and their used com-
mon ideas. Following our definition of common ideas
We compare the performance of all methods based in subsection 2.2, we evaluate only ideas that were
on their classification score. This score is defined dif- used frequently in different papers. Special details
ferently for unsupervised and all other settings. We such as the different optimizer for fast-SWA or the
follow standard protocol and use the classification ac- used approximation for MI are excluded. Please see
curacy in most cases. For unsupervised learning, we section 3 for further details.
use cluster accuracy because we need to handle the One might expect that common ideas are used
missing labels during the training. We need to find equally between methods and training strategies. We
the best one-to-one permutations (σ) from the network rather see a tendency that common ideas differ be-
cluster predictions to the ground-truth classes. For N tween training strategies. We will step through all
images x1 , . . . , xN ∈ Xl with labels zxi and predic- common ideas based on the significance of differen-
tions f (xi ) ∈ RC the accuracy is defined in Equation 7 tiating the training strategies.

20
Table 1: Overview of the methods and their used common ideas — On the left-hand side, the reviewed methods
from section 3 are sorted by the training strategy. The top row lists the common ideas. Details about the ideas
and their abbreviations are given in subsection 2.2. The last column and some rows sum up the usage of ideas per
method or training strategy. Legend: (X) The idea is only used indirectly. The individual explanations are given
in section 3.

Overall
CE CE* EM CL KL MSE MU MI OC PT PL VAT
Sum
One-Stage-Semi-Supervised
Pseudo-Labels [47] X X X 3
π model [49] X X 2
Temporal Ensembling [49] X X 2
Mean Teacher [48] X X 2
VAT [65] X X 2
VAT + EntMin [65] X X X 3
ICT [70] X X X X 4
fast-SWA [71] X X 2
MixMatch [46] X (X) X X X 5
EnAET [73] X (X) X X X AET X 7
UDA [16] X X X (X) 4
SPamCO [76] X X X X 4
ReMixMatch [45] X X (X) X (X) Rotation X 7
FixMatch [26] X X (X) X 4
Sum 14 4 6 0 2 8 4 1 0 2 8 2 47
Multi-Stage-Semi-Supervised
Exemplar [68] X X Augmentation 3
Context [42] X X Context 3
Jigsaw [43] X X Jigsaw 3
DeepCluster [67] X X X Clustering X 5
Rotation [40] X X Rotation 3
CPC [55, 56] X (X) X (X) CL 5
CMC [54] X (X) X (X) CL 5
DIM [77] X X MI 3
AMDIM [78] X X MI 3
DMT [79] X X X Metric X 5
IIC [14] X X X MI 4
S4 L [15] X X X Rotation X X 6
SimCLR [25] X (X) CL 3
MoCo [82] X X Metric 3
BYOL [28] X X Bootstrap 3
FOC [27] X (X) X X MI 5
SimCLRv2 [57] X (X) X CL X 5
Sum 17 11 1 5 0 1 0 6 3 17 4 1 66
One-Stage-Unsupervised
DAC [50] X 1
IMSAT [85] X X 2
IIC [14] X X MI 3
FOC [27] X X MI 3
SCAN [41] X CL X 3
Sum 0 0 0 0 0 0 0 3 3 3 2 1 12
Overall Sum 31 54 7 5 2 9 4 10 6 22 14 4 125

21
Table 2: Overview of the reported accuracies — The first column states the used method. For the supervised
baseline, we used the best-reported results which were considered as baselines in the referenced papers. The
original paper is given in brackets after the score. The architecture is given in the second column. The last four
columns report the Top-1 accuracy score in % for the respective dataset (See subsection 4.2 for further details). If
the results are not reported in the original paper, the reference is given after the result. A blank entry represents
the fact that no result was reported. Be aware that different architectures and frameworks are used which might
impact the results. Please see subsection 4.3 for a detailed explanation. Legend: † 100% of the labels are used
instead of the default value defined in subsection 4.1. ‡ Multilayer perceptron is used for fine-tuning instead
of one fully connected layer. Remarks on special architectures and evaluations: 1 Architecture includes Shake-
Shake regularization. 2 Network uses wider hidden layers. 3 Method uses ten random classes out of the default
1000 classes. 4 Network only predicts 20 superclasses instead of the default 100 classes. 5 Inputs are pretrained
ImageNet features. 6 Method uses different copies of the network for each input. 7 The network uses selective
kernels [87].

Architecture Publication CIFAR-10 CIFAR-100 STL-10 ILSVRC-2012 ILSVRC-2012 (Top-5)


Supervised (100% labels) Best reported - 98.01[73] 79.82[78] 68.7 [77] 85.7 [88] 97.6 [88]
One-Stage-Semi-Supervised
Pseudo-Label [47] ResNet50v2 [2] 2013 82.41 [15]
π model [49] CONV-13 2017 87.64
Temporal Ensembling [49] CONV-13 2017 87.84
Mean Teacher [48] CONV-13 2017 87.69
Mean Teacher [48] Wide ResNet-28 2017 89.64 90.9[57]
VAT [65] CONV-13 2018 88.64
VAT [65] ResNet50v2 2018 82.78 [15]
VAT + EntMin [65] CONV-13 2018 89.45
VAT + EntMin[65] ResNet50v2 2018 86.41 [15] 83.3 [15]
ICT [70] Wide ResNet-28 2019 92.34
ICT [70] CONV-13 2019 92.71
fast-SWA [71] CONV-13 2019 90.95 66.38
fast-SWA [71] ResNet-261 2019 93.72
MixMatch [46] Wide ResNet-28 2019 95.05 74.12 94.41
EnAET [73] Wide ResNet-28 2019 94.65 73.07 95.48
UDA [16] Wide ResNet-28 2019 94.7 68.66 88.52
SPamCo [76] Wide ResNet-28 2020 92.95
ReMixMatch [45] Wide ResNet-28 2020 94.86 76.97[26]
FixMatch [26] Wide ResNet-28 2020 95.74 77.40
FixMatch [26] ResNet-50 2020 71.46 89.13
Multi-Stage-Semi-Supervised
Exemplar [68] ResNet50 2015 46.0† [89] 81.01 [15]
Context [42] ResNet50 2015 51.4† [89]
Jigsaw [43] AlexNet 2016 44.6† [89]
DeepCluster [67] AlexNet 2018 73.4 [14] 41†
Rotation [40] AlexNet 2018 55.4† [89]
Rotation [40] ResNet50v2 2018 78.53 [15]
CPC [56] ResNet-170 2020 77.45† [77] 77.81† [77] 61.0 84.88
CMC [54] AlexNet 2019 86.88‡
CMC [54] ResNet-506 2019 70.6 89.7?
DIM [77] AlexNet 2019 72.57‡
DIM [77] GAN Discriminator 2019 75.21†‡ 49.74†‡
AMDIM [78] ResNet18 2019 91.3† / 93.6†‡ 70.2† / 73.8†‡ 93.6 / 93.8‡ 60.2† / 60.9†‡
DMT [79] Wide ResNet-28 2019 88.70
IIC [14] ResNet34 2019 85.76 [27] / 88.8‡
S4 L [15] ResNet50v22 2019 73.21 91.23?
SimCLR [25] ResNet50v22 2020 74.4 [57] / 76.5† 92.6 / 93.2†
MOCO [82] ResNet502 2020 68.6
MOCO [82] ResNet50 2020 60.6† / 71.1†‡ [83]
BYOL [28] ResNet2002 2020 77.7 93.7
FOC [27] ResNet34 2020 86.49
SimCLRv2 [57] ResNet-1522,7 2020 80.9‡ 95.5‡
One-Stage-Unsupervised
DAC [50] All-ConvNet 2017 52.18 23.75 46.99 52.723
IMSAT [85] Autoencoder5 2017 45.6 27.5 94.1
IIC [14] ResNet34 2019 61.7 25.74 59.6
FOC [27] ResNet34 2020 60.45
SCAN [41] ResNet18 2020 88.3 50.74 80.9
22
A major separation between the training strategies Comparison concerning performance
can be based on CE and pretext tasks. All one-
stage-semi-supervised methods use a cross-entropy We compare the performance of the different meth-
loss during training whereas only two use additional ods based on their respective reported results or cross-
losses based on pretext tasks. All multi-stage-semi- references in other papers. For better comparability,
supervised methods use a pretext task and use CE for we would have liked to recreate every method in a uni-
fine-tuning. All one-stage-semi-supervised methods fied setup but this was not feasible. Whereas using
use no CE and often use a pretext task. Due to our reported values might be the only possible approach, it
definition of the training strategies this grouping is ex- leads to drawbacks in the analysis.
pected. Kolesnikov et al. showed that changes in the archi-
tecture can lead to significant performance boost or
However, further clusters of the common ideas drops [89]. They state that ’neither [...] the ranking of
are visible. We notice that some common ideas architectures [is] consistent across different methods,
are (almost) solely used by one of the two semi- nor is the ranking of methods consistent across archi-
supervised strategies. These common ideas are EM, tectures’ [89]. Most methods try to achieve compa-
KL, MSE, and MU for one-stage-semi-supervised rability with previous ones by a similar setup but over
methods and CL, MI, and OC for multi-stage-semi- time small differences still aggregate and lead to a vari-
supervised methods. We hypothesize that this shared ety of used architectures. Some methods use only early
and different usage of ideas exists due to the different convolutional networks such as AlexNet [1] but oth-
usage of unlabeled data. For example, one-stage-semi- ers use more modern architectures like Wide ResNet-
supervised methods use the unlabeled and labeled data Architecture [90] or Shake-Shake-Regularization [91].
in the same stage and therefore might need to regular- Oliver et al. proposed guidelines to ensure more
ize the training with MSE. comparable evaluations in semi-supervised learning
[92]. They showed that not following these guide-
If we compare multi-stage-semi-supervised and lines may lead to changes in the performance [92].
one-stage-unsupervised training we notice that MI, Whereas some methods try to follow these guidelines,
OC, and PT are often used in both. All three of them we cannot guarantee that all methods do so. This im-
are not often used with one-stage-semi-supervised pacts comparability further. Considering the above-
training as stated above. We hypothesize that this simi- mentioned limitations, we do not focus on small dif-
larity arises because most multi-stage-semi-supervised ferences but look for general trends and specialties in-
methods have an unsupervised stage followed by a su- stead.
pervised stage. For the method IIC the authors even Table 2 shows the collected results for all presented
proposed to fine-tune the unsupervised method to sur- methods. We also provide results for the respective
pass purely supervised results. CE*, PL, and VAT are supervised baselines reported by the authors. To keep
used in several different methods. Due to their simple fair comparability we did not add state-of-the-art base-
and complementary idea, they can be used in a variety lines with more complex architectures. Table 3 shows
of different methods. UDA for example uses PL to fil- the results for even fewer labels as normally defined in
ter the unlabeled data for useful images. CE* seems subsection 4.1.
to be more often used by multi-stage-semi-supervised In general, the used architectures become more
methods. The parentheses in Table 1 indicate that complex and the accuracies rise over time. This behav-
they often also motivate another idea like CE−1 [27] ior is expected as new results are often improvements
or the CL loss [25, 55]. All in all, we see that the of earlier works. The changes in architecture may have
defined training strategies share common ideas inside led to these improvements. However, many papers
each strategy and differ in the usage of ideas between include ablation studies and comparisons to only su-
them. We conclude that the definition of the training pervised methods to show the impact of their method.
strategies is not only logical but is also supported by We believe that a combination of more modern archi-
their usage of common ideas. tecture and more advanced methods lead to improve-

23
ments. Since IMSAT uses pretrained ImageNet features, a su-
For the CIFAR-10 dataset, almost all multi- or one- perset of STL-10, the results are not directly compara-
stage-semi-supervised methods reach about or over ble.
90% accuracy. The best methods MixMatch and Fix-
4.4. Discussion
Match reach an accuracy of more than 95% and are
roughly three percent worse than the fully supervised In this subsection, we discuss the presented results
baseline. For the CIFAR-100 dataset, fewer results are of the previous subsection. We divide our discussion
reported. FixMatch is with about 77% on this dataset into three major trends that we identified. All these
the best method in comparison to the fully supervised trends lead to possible future research opportunities.
baseline of about 80%. Newer methods also provide
results for 1000 or even 250 labels instead of 4000 la- 1. Trend: Real World Applications?
bels. Especially EnAET, ReMixMatch, and FixMatch
Previous methods were not scaleable to real-world
stick out since they achieve only 1-2% worse results
images and applications and used workarounds e.g.
with 250 labels instead of with 4000 labels.
extracted features [85] to process real-world images.
For the STL-10 dataset, most methods report a bet-
Many methods can report a result of over 90% on
ter result than the supervised baseline. These results
CIFAR-10, a simple low-resolution dataset. Only five
are possible due to the unlabeled part of the dataset.
methods can achieve a Top-5 accuracy of over 90%
The unlabeled data can only be utilized by semi-, self-
on ILSVRC-2012, a high-resolution dataset. We con-
, or unsupervised methods. EnAET achieves the best
clude that most methods are not scalable to high-
results with more than 95%. FixMatch reports an ac-
resolution and complex image classification problems.
curacy of nearly 95% with only 1000 labels. This is
However, the best-reported methods like FixMatch
more than most methods achieve with 5000 labels.
and SimCLRv2 seem to have surpassed the point of
The ILSVRC-2012 dataset is the most difficult only scientific usage and could be applied to real-
dataset based on the reported Top-1 accuracies. Most world classification tasks.
methods only achieve a Top-1 accuracy which is This conclusion applies to real-world image clas-
roughly 20% worse than the reported supervised base- sification tasks with balanced and clearly separated
line with around 86%. Only the methods SimCLR, classes. This conclusion also implicates which real-
BYOL, and SimCLRv2 achieve an accuracy that is world issues need to be solved in future research.
less than 10% worse than the baseline. SimCLRv2 Class imbalance [93, 94] or noisy labels [95, 27]
achieves the best accuracy with a Top-1 accuracy of are not treated by the presented methods. Datasets
80.9% and a Top-5 accuracy of around 96%. For fewer with also few unlabeled data points are not consid-
labels also SimCLR, BYOL and SimCLRv2 achieve ered. We see that good performance on well-structured
the best results. datasets does not always transfer completely to real-
The unsupervised methods are separated from the world datasets [27]. We assume that these issues arise
supervised baseline by a clear margin of up to 10%. due to assumptions that do not hold on real-world
SCAN achieves the best results in comparison to the datasets like a clear distinction between datapoints
other methods as it builds on the strong pretext task of [27] and non-robust hyperparameters like augmenta-
SimCLR. This also illustrates the reason for including tions and batch size [28]. Future research has to ad-
the unsupervised method in a comparison with semi- dress these issues so that reduced supervised learning
supervised methods. Unsupervised methods do not methods can be applied to any real-world datasets.
use labeled examples and therefore are expected to be
worse. However, the data show that the gap of 10%
2. Trend: How much supervision is needed?
is not large and that unsupervised methods can benefit
from ideas of self-supervised learning. Some paper re- We see that the gap between reduced supervised and
port results for even fewer labels as shown in Table 3 supervised methods is shrinking. For CIFAR-10,
which closes the gap to unsupervised learning further. CIFAR-100 and ILSVRC-2012 we have a gap of less
IMSAT reports an accuracy of about 94% on STL-10. than 5% left between total supervised and reduced

24
Table 3: Overview of the reported accuracies with fewer labels - The first column states the used method. The last
seven columns report the Top-1 accuracy score in % for the respective dataset and amount of labels. The number
is either given as an absolute number or in percent. A blank entry represents the fact that no result was reported.

CIFAR-10 STL-10 ILSVRC-2012 ILSVRC-2012 (Top-5)


4000 1000 250 5000 1000 10% 1% 10% 1%
One-Stage-Semi-Supervised
Mean Teacher [48] 89.64 82.68 52.68
ICT [70] 92.71 84.52 61.4 [46]
MixMatch [46] 93.76 92.25 88.92 94.41 89.82
EnAET [73] 94.65 93.05 92.4 95.48 91.96
UDA [16] 95.12[26] 91.18[26] 92.34[26] 68.66 88.52
ReMixMatch [45] 94.86 94.27 93.73 93.82
FixMatch [26] 95.74 94.93 94.83 71.46 89.13
Multi-Stage-Semi-Supervised
DMT [79] 88.70 80.3 58.6
SimCLR [25] 74.4[57] 63.0[57] 92.6 85.8
BYOL [28] 77.7 71.2 93.7 89.5
SimCLRv2 [57] 80.9 76.6 95.5 93.4

supervised learning. For STL-10 the reduced super- prove the supervised training [96, 97, 98]. These large
vised methods even surpass the total supervised case amounts of data can only be collected without any or
by about 20% due to the additional set of unlabeled weak labels as the collection process has to be auto-
data. We conclude that reduced supervised learning mated. It will be interesting to investigate if the dis-
reaches comparable results while using only roughly cussed methods in this survey can also scale to such
10% of the labels. datasets while using only few labels per class.
In general, we considered a reduction from 100% to We conclude that on datasets with few and a fixed
10% of all labels. However, we see that methods like number of classes semi-supervised methods will be
FixMatch and SimCLRv2 achieve comparable results more important than unsupervised methods. However,
with even fewer labels such as the usage of 1% of all if we have a lot of classes or new classes should be de-
labels. For ILSVRC-2012 this is equivalent to about tected like in few- or zero-shot learning [99, 100, 38,
13 images per class. FixMatch even achieves a median 94] unsupervised methods will still have a lower label-
accuracy of around 65% for one label per class for the ing cost and be of high importance. This means future
CIFAR-10 dataset[26]. research has to investigate how the semi-supervised
The trend that results improve overtime is expected. ideas can be transferred to unsupervised methods as
But the results indicate that we are near the point in [14, 41] and to settings with many, an unknown or
where semi-supervised learning needs very few to al- rising amount of classes like in [39, 96].
most no labels per class (e.g. 10 labels for CIFAR10).
In practice, the labeling cost for unsupervised and 3. Trend: Combination of common ideas
semi-supervised will almost be the same for common
classification datasets. Unsupervised methods would In the comparison, we identified that few common
need to bridge the performance gap on these classifi- ideas are shared by one-stage-semi-supervised and
cation datasets to be useful anymore. It is questionable multi-stage-semi-supervised methods.
if an unsupervised method can achieve this because it We believe there is only a little overlap between
would need to guess what a human wants to have clas- these methods due to the different aims of the respec-
sified even when competing features are available. tive authors. Many multi-stage-semi-supervised pa-
We already see that on datasets like ImageNet ad- pers focus on creating good representations. They
ditional data such as JFT-300M is used to further im- fine-tune their results only to be comparable. One-

25
stage-semi-supervised papers aim for the best accu- The performance gap between supervised and semi-
racy scores with as few labels as possible. or self-supervised methods is closing and the num-
If we look at methods like SimCLRv2, EnAET, ber of labels to get comparable results to fully super-
ReMixMatch, or S4 L we see that it can be beneficial to vised learning is decreasing. In the future, the unsuper-
combine different ideas and mindsets. These methods vised methods will have almost no labeling cost benefit
used a broad range of ideas and also ideas uncommon in comparison to the semi-supervised methods due to
for their respective training strategy. S 4 L calls their these developments. We conclude that in combination
combined approach even ”Mix of all models” [15] and with the fact that semi-supervised methods have the
SimCLRv2 states that ”Self-Supervised Methods are benefit of using labels as guidance unsupervised meth-
Strong Semi-Supervised Learners” [57]. ods will lose importance. However, for a large num-
We assume that this combination is one reason for ber of classes or an increasing number of classes the
their superior performance. This assumption is sup- ideas of unsupervised are still of high importance and
ported by the included comparisons in the original pa- ideas from semi-supervised and self-supervised learn-
pers. For example, S4 L showed the impact of each ing need to be transferred to this setting.
method separately as well as the combination of all We concluded that one-stage-semi-supervised and
[15]. multi-stage-semi-supervised training mainly use a dif-
Methods like Fixmatch illustrate that it does not ferent set of common ideas. Both strategies use a com-
need a lot of common ideas to achieve state-of-the-art bination of different ideas but there are few overlaps in
performance but rather that the selection of the correct these techniques. We identified the trend that a combi-
ideas and combining them in a meaningful is impor- nation of different techniques is beneficial to the over-
tant. We identified that some common ideas are not all performance. In combination with the small over-
often combined and that the combination of a broad lap between the ideas, we identified possible future re-
range and unusual ideas can be beneficial. We believe search opportunities.
that the combination of the different common idea is a
promising future research field because many reason- References
able combinations are yet not explored.
[1] Alex Krizhevsky, Ilya Sutskever, and Geof-
5. Conclusion frey E. Hinton. Imagenet classification with
deep convolutional neural networks. In Ad-
In this paper, we provided an overview of semi-, vances in neural information processing sys-
self-, and unsupervised methods. We analyzed their tems, volume 60, pages 1097–1105. Associa-
difference, similarities, and combinations based on 34 tion for Computing Machinery, 2012. 1, 6, 19,
different methods. This analysis led to the identifica- 20, 23
tion of several trends and possible research fields.
We based our analysis on the definition of the dif- [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren,
ferent training strategies and common ideas in these and Jian Sun. Deep Residual Learning for Im-
strategies. We showed how the methods work in gen- age Recognition. IEEE Conference on Com-
eral, which ideas they use and provide a simple classi- puter Vision and Pattern Recognition (CVPR),
fication. Despite the difficult comparison of the meth- pages 770–778, 2015. 1, 22
ods’ performances due to different architectures and
implementations, we identified three major trends. [3] J Brünger, S Dippel, R Koch, and C Veit. ‘Tail-
Results of over 90% Top-5 accuracy on ILSVRC- ception’: using neural networks for assessing
2012 with only 10% of the labels indicate that semi- tail lesions on pictures of pig carcasses. Ani-
supervised methods could be applied to real-world mal, 13(5):1030–1036, 2019. 1
problems. However, issues like class imbalance and
noisy or fuzzy labels are not considered. More robust [4] Joseph Redmon and Ali Farhadi. YOLOv3:
methods need to be researched before semi-supervised An Incremental Improvement. arXiv preprint
learning can be applied to real-world issues. arXiv:1804.02767, 2018. 1

26
[5] Sascha Clausen, Claudius Zelenka, Tobias [13] Mahmut Kaya and Hasan Sakir Bilge. Deep
Schwede, and Reinhard Koch. Parcel Track- Metric Learning : A Survey. Symmetry,
ing by Detection in Large Camera Networks. 11(9):1066, 2019. 2, 3
In Thomas Brox, Andrés Bruhn, and Mario
Fritz, editors, GCPR 2018:Pattern Recognition, [14] Andrea Vedaldi, Xu Ji, João F. Henriques, and
pages 89–104. Springer International Publish- Andrea Vedaldi. Invariant Information Clus-
ing, 2019. 1 tering for Unsupervised Image Classification
and Segmentation. Proceedings of the IEEE
[6] Jonathan Long, Evan Shelhamer, and Trevor International Conference on Computer Vision,
Darrell. Fully convolutional networks for se- (Iic):9865–9874, 2019. 2, 5, 6, 9, 11, 16, 17,
mantic segmentation. In Proceedings of the 19, 21, 22, 25
IEEE conference on computer vision and pat-
tern recognition, pages 3431–3440, 2015. 1 [15] Xiaohua Zhai, Avital Oliver, Alexander
Kolesnikov, and Lucas Beyer. S4L: Self-
[7] Kaiming He, Georgia Gkioxari, Piotr Dollár, Supervised Semi-Supervised Learning. In
Ross Girshick, Mask R-cnn, Piotr Doll, and Proceedings of the IEEE international confer-
Ross Girshick. Mask r-cnn. In Proceedings of ence on computer vision, pages 1476–1485,
the IEEE international conference on computer 2019. 2, 5, 10, 17, 21, 22, 26
vision, pages 2961–2969, 2017. 1
[16] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-
[8] Christopher M Bishop. Pattern recognition and Thang Luong, and Quoc V Le. Unsupervised
machine learning. Springer, 2006. 1 Data Augmentation for Consistency Training.
Advances in Neural Information Processing
[9] Dhruv Mahajan, Ross Girshick, Vignesh Ra-
Systems 33 pre-proceedings (NeurIPS 2020),
manathan, Kaiming He, Manohar Paluri, Yix-
2020. 2, 5, 13, 21, 22, 25
uan Li, Ashwin Bharambe, and Laurens van der
Maaten. Exploring the Limits of Weakly Super- [17] Olivier Chapelle, Bernhard Scholkopf, Alexan-
vised Pretraining. Proceedings of the European der Zien, Bernhard Schölkopf, and Alexander
Conference on Computer Vision (ECCV), pages Zien. Semi-supervised learning. IEEE Trans-
181–196, 2018. 1 actions on Neural Networks, 20(3):542, 2006.
2, 3
[10] Lars Schmarje, Claudius Zelenka, Ulf Geisen,
Claus-C. Glüer, and Reinhard Koch. 2D and 3D [18] Rui Xu and Donald C Wunsch. Survey of clus-
Segmentation of uncertain local collagen fiber tering algorithms. IEEE Transactions on Neural
orientations in SHG microscopy. DAGM Ger- Networks, 16:645–678, 2005. 2, 3
man Conference of Pattern Regocnition, 11824
LNCS:374–386, 2019. 2 [19] Longlong Jing and Yingli Tian. Self-supervised
Visual Feature Learning with Deep Neural Net-
[11] Geoffrey E Hinton, Terrence Joseph Sejnowski, works: A Survey. IEEE Transactions on Pattern
and Others. Unsupervised learning: founda- Analysis and Machine Intelligence, 2019. 2, 3,
tions of neural computation. MIT press, 1999. 5
2
[20] Jesper E. van Engelen and Holger H. Hoos. A
[12] Junyuan Xie, Ross B Girshick, Ali Farhadi, survey on semi-supervised learning. Machine
A L I Cs, and Washington Edu. Unsuper- Learning, 109(2):373–440, 2019. 2, 3
vised deep embedding for clustering analysis.
In 33rd International Conference on Machine [21] Gianluigi Ciocca, Claudio Cusano, Simone
Learning, volume 1, pages 740–749. Interna- Santini, and Raimondo Schettini. On the use
tional Machine Learning Society (IMLS), 2016. of supervised features for unsupervised image
2, 6 categorization: An evaluation. Computer Vision

27
and Image Understanding, 122:155–171, 2014. Kavukcuoglu, Rémi Munos, and Michal Valko.
2, 3 Bootstrap your own latent: A new approach
to self-supervised Learning. Advances in
[22] James MacQueen and Others. Some methods Neural Information Processing Systems 33
for classification and analysis of multivariate pre-proceedings (NeurIPS 2020), 2020. 2, 18,
observations. In Proceedings of the fifth Berke- 21, 22, 24, 25
ley symposium on mathematical statistics and
probability, volume 1, pages 281–297. Oak- [29] Guo-Jun Qi and Jiebo Luo. Small Data Chal-
land, CA, USA, 1967. 2, 5 lenges in Big Data Era: A Survey of Recent
Progress on Unsupervised and Semi-Supervised
[23] Xiaojin Zhu. Semi-Supervised Learning Lit- Methods. arXiv preprint arXiv:1903.11260,
erature Survey. Comput Sci, University of 2019. 3, 5
Wisconsin-Madison, 2, 2008. 2
[30] Veronika Cheplygina, Marleen de Bruijne,
[24] Erxue Min, Xifeng Guo, Qiang Liu, Gen Zhang, Josien P W Pluim, Marleen De Bruijne, Josien
Jianjing Cui, and Jun Long. A survey of cluster- P W Pluim, Marleen de Bruijne, and Josien P W
ing with deep learning: From the perspective of Pluim. Not-so-supervised: A survey of semi-
network architecture. IEEE Access, 6:39501– supervised, multi-instance, and transfer learn-
39514, 2018. 2 ing in medical image analysis. Medical Image
[25] Ting Chen, Simon Kornblith, Mohammad Analysis, 54:280–296, 2019. 3
Norouzi, and Geoffrey Hinton. A simple frame- [31] Alexander Mey and Marco Loog. Improv-
work for contrastive learning of visual represen- ability Through Semi-Supervised Learning: A
tations. International conference on machine Survey of Theoretical Results. arXiv preprint
learning, (PMLR):1597–1607, 2020. 2, 5, 7, arXiv:1908.09574, pages 1–28, 2019. 3
8, 11, 17, 18, 19, 21, 22, 23, 25
[32] Chelsea Finn, Pieter Abbeel, and Sergey
[26] Kihyuk Sohn, David Berthelot, Chun-Liang Levine. Model-Agnostic Meta-Learning for
Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Fast Adaptation of Deep Networks. 34th In-
Cubuk, Alex Kurakin, Han Zhang, and ternational Conference on Machine Learning,
Colin Raffel. FixMatch: Simplifying Semi- ICML 2017, 3:1856–1868, mar 2017. 3
Supervised Learning with Consistency and
Confidence. Advances in Neural Informa- [33] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi
tion Processing Systems 33 pre-proceedings Mirza, Bing Xu, David Warde-Farley, Sherjil
(NeurIPS 2020), 2020. 2, 5, 6, 10, 14, 21, 22, Ozair, Aaron Courville, and Yoshua Bengio.
25 Generative Adversarial Networks. Proceed-
ings of the International Conference on Neural
[27] Lars Schmarje, Johannes Brünger, Monty San- Information Processing Systems, pages 2672–
tarossa, Simon-Martin Schröder, Rainer Kiko, 2680, jun 2014. 3
and Reinhard Koch. Beyond Cats and Dogs:
Semi-supervised Classification of fuzzy la- [34] Lu Liu, Tianyi Zhou, Guodong Long, Jing
bels with overclustering. arXiv preprint Jiang, Lina Yao, and Chengqi Zhang. Proto-
arXiv:2012.01768, 2020. 2, 5, 11, 17, 19, 21, type Propagation Networks (PPN) for Weakly-
22, 23, 24 supervised Few-shot Learning on Category
Graph. IJCAI International Joint Conference on
[28] Jean-Bastien Grill, Florian Strub, Flo- Artificial Intelligence, 2019-Augus:3015–3022,
rent Altché, Corentin Tallec, Pierre H. may 2019. 4
Richemond, Elena Buchatskaya, Carl Doersch,
Bernardo Avila Pires, Zhaohan Daniel Guo, [35] Norimichi Ukita and Yusuke Uematsu. Semi-
Mohammad Gheshlaghi Azar, Bilal Piot, Koray and weakly-supervised human pose estimation.

28
Computer Vision and Image Understanding, [43] Mehdi Noroozi and Paolo Favaro. Unsuper-
170:67–78, 2018. 4 vised Learning of Visual Representations by
Solving Jigsaw Puzzles. European Conference
[36] Dwarikanath Mahapatra. Combining multiple on Computer Vision, pages 69–84, 2016. 5, 11,
expert annotations using semi-supervised learn- 15, 21, 22
ing and graph cuts for medical image segmenta-
tion. Computer Vision and Image Understand- [44] Mehdi Noroozi, Ananth Vinjimoor, Paolo
ing, 151:114–123, oct 2016. 4 Favaro, and Hamed Pirsiavash. Boosting self-
supervised learning via knowledge transfer. In
[37] Peng Xu, Zeyu Song, Qiyue Yin, Yi-Zhe Song, Conference on Computer Vision and Pattern
and Liang Wang. Deep Self-Supervised Repre- Recognition, pages 9359–9367, 2018. 5, 10, 11,
sentation Learning for Free-Hand Sketch. IEEE 15
Transactions on Circuits and Systems for Video
Technology, 2020. 4 [45] David Berthelot, Nicholas Carlini, Ekin D.
Cubuk, Alex Kurakin, Kihyuk Sohn, Han
[38] Lu Liu, Tianyi Zhou, Guodong Long, Jing Zhang, and Colin Raffel. ReMixMatch: Semi-
Jiang, Xuanyi Dong, and Chengqi Zhang. Iso- Supervised Learning with Distribution Align-
metric Propagation Network for Generalized ment and Augmentation Anchoring. Inter-
Zero-shot Learning. International Conference national Conference on Learning Representa-
on Learning Representations, feb 2021. 4, 25 tions, 2020. 5, 10, 11, 13, 14, 21, 22, 25

[39] Zhongjie Yu, Lin Chen, Zhongwei Cheng, and [46] David Berthelot, Nicholas Carlini, Ian Good-
Jiebo Luo. Transmatch: A transfer-learning fellow, Nicolas Papernot, Avital Oliver, and
scheme for semi-supervised few-shot learning. Colin A Raffel. Mixmatch: A holistic approach
In Proceedings of the IEEE/CVF Conference to semi-supervised learning. In Advances in
on Computer Vision and Pattern Recognition, Neural Information Processing Systems, pages
pages 12856–12864, 2020. 4, 25 5050–5060, 2019. 5, 10, 11, 13, 21, 22, 25

[47] Dong-Hyun Lee. Pseudo-label: The simple and


[40] Spyros Gidaris, Praveer Singh, and Nikos Ko-
efficient semi-supervised learning method for
modakis. Unsupervised Representation Learn-
deep neural networks. In Workshop on chal-
ing by Predicting Image Rotations. In Inter-
lenges in representation learning, ICML, vol-
national Conference on Learning Representa-
ume 3, page 2, 2013. 5, 11, 13, 17, 21, 22
tions, number 2016, pages 1–16, 2018. 5, 10,
11, 13, 14, 15, 17, 19, 21, 22 [48] Antti Tarvainen and Harri Valpola. Mean teach-
ers are better role models: Weight-averaged
[41] Wouter Van Gansbeke, Simon Vandenhende, consistency targets improve semi-supervised
Stamatios Georgoulis, Marc Proesmans, and deep learning results. In International Confer-
Luc Van Gool. SCAN: Learning to Classify Im- ence on Learning Representations, 2017. 5, 11,
ages without Labels. In Proceedings of the Eu- 12, 13, 16, 21, 22, 25
ropean Conference on Computer Vision, pages
268–285, 2020. 5, 6, 21, 22, 25 [49] Samuli Laine and Timo Aila. Temporal ensem-
bling for semi-supervised learning. In Inter-
[42] Carl Doersch, Abhinav Gupta, and Alexei A national Conference on Learning Representa-
Efros. Unsupervised Visual Representation tions, 2017. 5, 11, 12, 13, 21, 22
Learning by Context Prediction. In IEEE
International Conference on Computer Vision [50] Jianlong Chang, Lingfeng Wang, Gaofeng
(ICCV), pages 1422–1430. IEEE, 2015. 5, 10, Meng, Shiming Xiang, and Chunhong Pan.
11, 15, 21, 22 Deep Adaptive Image Clustering. 2017 IEEE

29
International Conference on Computer Vision [59] Ben Poole, Sherjil Ozair, Aaron van den Oord,
(ICCV), pages 5880–5888, 2017. 6, 18, 21, 22 Alexander A. Alemi, and George Tucker. On
Variational Bounds of Mutual Information. In-
[51] Ian Goodfellow, Yoshua Bengio, and Aaron
ternational Conference on Machine Learning,
Courville. Deep Learning. MIT Press, 2016.
2019. 8, 9
7
[52] Adam Coates, Andrew Ng, and Honglak Lee. [60] Michael Tschannen, Josip Djolonga, Paul K.
An analysis of single-layer networks in unsu- Rubenstein, Sylvain Gelly, and Mario Lucic.
pervised feature learning. In Proceedings of On Mutual Information Maximization for Rep-
the fourteenth international conference on arti- resentation Learning. International Conference
ficial intelligence and statistics, pages 215–223, on Learning Representations, 2020. 8
2011. 8, 19
[61] Yves Grandvalet and Yoshua Bengio. Semi-
[53] R Hadsell, S Chopra, and Y LeCun. Dimension- supervised learning by entropy minimization.
ality Reduction by Learning an Invariant Map- In Advances in neural information processing
ping. In 2006 IEEE Computer Society Confer- systems, pages 529–536, 2005. 8, 12, 13, 17
ence on Computer Vision and Pattern Recogni-
tion (CVPR’06), volume 2, pages 1735–1742, [62] S Kullback and R A Leibler. On Information
2006. 7 and Sufficiency. Ann. Math. Statist., 22(1):79–
86, 1951. 9
[54] Yonglong Tian, Dilip Krishnan, and Phillip
Isola. Contrastive Multiview Coding. European [63] Thomas M Cover and Joy A Thomas. Elements
conference on computer vision, 2019. 8, 16, 21, of information theory. John Wiley & Sons,
22 1991. 9
[55] Aaron Van Den Oord, Yazhe Li, and Oriol [64] Mohamed Ishmael Belghazi, Aristide Baratin,
Vinyals. Representation Learning with Con- Sai Rajeswar, Sherjil Ozair, Yoshua Bengio,
trastive Predictive Coding. arXiv preprint Aaron Courville, and R. Devon Hjelm. Mu-
arXiv:1807.03748., 2018. 8, 11, 16, 17, 18, 21, tual Information Neural Estimation. In Interna-
23 tional Conference on Machine Learning, pages
[56] Olivier J Hénaff, Aravind Srinivas, Jef- 531–540, 2018. 9
frey De Fauw, Ali Razavi, Carl Doersch,
[65] Semi-supervised Learning, Takeru Miyato,
S. M. Ali Eslami, and Aaron van den Oord.
Shin-ichi Maeda, Masanori Koyama, Shin Ishii,
Data-Efficient Image Recognition with Con-
and Masanori Koyama. Virtual adversarial
trastive Predictive Coding. Proceedings of
training: a regularization method for supervised
the37thInternational Conference on Machine-
and semi-supervised learning. IEEE transac-
Learning, PMLR:4182–4192, 2020. 8, 16, 17,
tions on pattern analysis and machine intelli-
18, 21, 22
gence, pages 1–16, 2018. 9, 10, 12, 16, 17, 19,
[57] Ting Chen, Simon Kornblith, Kevin Swer- 21, 22
sky, Mohammad Norouzi, and Geoffrey Hinton.
Big Self-Supervised Models are Strong Semi- [66] Hongyi Zhang, Moustapha Cisse, Yann N
Supervised Learners. Advances in Neural Infor- Dauphin, and David Lopez-Paz. mixup: Be-
mation Processing Systems 33 pre-proceedings yond empirical risk minimization. In Inter-
(NeurIPS 2020), 2020. 8, 17, 18, 21, 22, 25, 26 national Conference on Learning Representa-
tions, 2018. 10, 12, 13
[58] Ting Chen and Lala Li. Intriguing Proper-
ties of Contrastive Losses. arXiv preprint [67] Mathilde Caron, Piotr Bojanowski, Armand
arXiv:2011.02803, 2020. 8 Joulin, and Matthijs Douze. Deep Clustering for

30
Unsupervised Learning of Visual Features. Pro- [74] Liheng Zhang, Guo Jun Qi, Liqiang Wang, and
ceedings of the European Conference on Com- Jiebo Luo. AET vs. AED: Unsupervised rep-
puter Vision (ECCV), pages 132–149, 2018. 11, resentation learning by auto-encoding transfor-
15, 21, 22 mations rather than data. In Proceedings of the
IEEE Computer Society Conference on Com-
[68] Alexey Dosovitskiy, Philipp Fischer, Jost To-
puter Vision and Pattern Recognition, volume
bias Springenberg, Martin Riedmiller, and
2019-June, pages 2542–2550. IEEE, 2019. 13
Thomas Brox. Discriminative Unsupervised
Feature Learning with Exemplar Convolutional [75] Terrance Devries and Graham W Taylor. Im-
Neural Networks. IEEE Transactions on proved Regularization of Convolutional Neu-
Pattern Analysis and Machine Intelligence, ral Networks with Cutout. arXiv preprint
38(9):1734–1747, 2015. 11, 14, 17, 21, 22 arXiv:1708.04552, 2017. 13
[69] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, [76] Fan Ma, Deyu Meng, Xuanyi Dong, and
Vijay Vasudevan, and Quoc V. Le. Autoaug- Yi Yang. Self-paced Multi-view Co-training.
ment: Learning augmentation strategies from Journal of Machine Learning Research,
data. In Proceedings of the IEEE conference on 21(57):1–38, 2020. 13, 21, 22
computer vision and pattern recognition, num-
ber Section 3, pages 113–123, 2019. 12, 13 [77] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-
Marchildon, Karan Grewal, Phil Bachman,
[70] Vikas Verma, Alex Lamb, Juho Kannala, Adam Trischler, and Yoshua Bengio. Learn-
Yoshua Bengio, David Lopez-Paz, Kenji ing deep representations by mutual information
Kawaguchi, Alex Lamb, Juho Kannala, Yoshua estimation and maximization. In International
Bengio, and David Lopez-Paz. Interpola- Conference on Learning Representations, pages
tion Consistency Training for Semi-Supervised 1–24, 2019. 16, 21, 22
Learning. Proceedings of the Twenty-Eighth In-
ternational Joint Conference on Artificial Intel- [78] Philip Bachman, R Devon Hjelm, and William
ligence, 2019. 12, 21, 22, 25 Buchwalter. Learning Representations by Max-
imizing Mutual Information Across Views. In
[71] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Advances in Neural Information Processing
Andrew Gordon Wilson, Dmitrii Podoprikhin, Systems, pages 15509–15519, 2019. 16, 21, 22
Timur Garipov, Dmitry Vetrov, and An-
drew Gordon Wilson. There Are Many Con- [79] Bin Liu, Zhirong Wu, Han Hu, and Stephen
sistent Explanations of Unlabeled Data: Why Lin. Deep Metric Transfer for Label Prop-
You Should Average. In International Confer- agation with Limited Annotated Data. 2019
ence on Learning Representations, pages 1–22, IEEE/CVF International Conference on Com-
2019. 12, 13, 21, 22 puter Vision Workshop (ICCVW), 2019. 16, 21,
22, 25
[72] Pavel Izmailov, Dmitrii Podoprikhin, Timur
Garipov, Dmitry Vetrov, and Andrew Gordon [80] Richard Zhang, Phillip Isola, and Alexei A.
Wilson. Averaging Weights Leads to Wider Op- Efros. Colorful Image Colorization. European
tima and Better Generalization. In Conference conference on computer vision, pages 649–666,
on Uncertainty in Artificial Intelligence, 2018. 2016. 16
12
[81] Zhirong Wu, Yuanjun Xiong, Stella Yu, and
[73] Xiao Wang, Daisuke Kihara, Jiebo Luo, Dahua Lin. Unsupervised Feature Learning
and Guo-Jun Qi. EnAET: Self-Trained via Non-Parametric Instance-level Discrimina-
Ensemble AutoEncoding Transformations for tion. 2018 IEEE/CVF Conference on Computer
Semi-Supervised Learning. arXiv preprint Vision and Pattern Recognition, pages 3733–
arXiv:1911.09265, 2019. 13, 21, 22, 25 3742, may 2018. 16, 17, 19

31
[82] Kaiming He, Haoqi Fan, Yuxin Wu, Saining [90] Sergey Zagoruyko and Nikos Komodakis. Wide
Xie, and Ross Girshick. Momentum Contrast Residual Networks. In Procedings of the
for Unsupervised Visual Representation Learn- British Machine Vision Conference 2016, pages
ing. Proceedings of the IEEE/CVF Conference 87.1–87.12. British Machine Vision Associa-
on Computer Vision and Pattern Recognition, tion, 2016. 23
pages 9729–9738, 2020. 17, 18, 21, 22
[91] Xavier Gastaldi. Shake-Shake regularization.
[83] Xinlei Chen, Haoqi Fan, Ross Girshick, and arXiv preprint arXiv:1705.07485, 2017. 23
Kaiming He. Improved Baselines with Mo-
[92] Avital Oliver, Augustus Odena, Colin A Raf-
mentum Contrastive Learning. arXiv preprint
fel, Ekin Dogus Cubuk, and Ian J Goodfellow.
arXiv:2003.04297, 2020. 17, 18, 22
Realistic evaluation of deep semi-supervised
learning algorithms. In Advances in Neu-
[84] Pierre H. Richemond, Jean-Bastien Grill, Flo-
ral Information Processing Systems, number
rent Altché, Corentin Tallec, Florian Strub, An-
NeurIPS, pages 3235–3246, 2018. 23
drew Brock, Samuel Smith, Soham De, Razvan
Pascanu, Bilal Piot, and Michal Valko. BYOL [93] Tsung-Yi Lin, Priya Goyal, Ross Girshick,
works even without batch statistics. arXiv Kaiming He, and Piotr Dollár. Focal loss for
preprint arXiv:2010.10241, 2020. 18 dense object detection. In Proceedings of the
IEEE international conference on computer vi-
[85] Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi sion, pages 2980–2988, 2017. 24
Matsumoto, and Masashi Sugiyama. Learning
Discrete Representations via Information Max- [94] Simon-Martin Schröder, Rainer Kiko, and
imizing Self-Augmented Training. Proceedings Reinhard Koch. MorphoCluster: Efficient An-
of the 34th International Conference on Ma- notation of Plankton images by Clustering. Sen-
chine Learning-Volume 70, pages 1558–1567, sors, 20, 2020. 24, 25
2017. 19, 21, 22, 24
[95] Qing Li, Xiaojiang Peng, Liangliang Cao, Wen-
[86] Alex Krizhevsky, Geoffrey Hinton, and Others. bin Du, Hao Xing, Yu Qiao, and Qiang Peng.
Learning multiple layers of features from tiny Product image recognition with guidance learn-
images. Technical report, Citeseer, 2009. 19 ing and noisy supervision. Computer Vision and
Image Understanding, 196:102963, 2020. 24
[87] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian
[96] Hieu Pham, Zihang Dai, Qizhe Xie, Minh-
Yang. Selective Kernel Networks. Proceedings
Thang Luong, and Quoc V. Le. Meta Pseudo
of the IEEE/CVF Conference on Computer Vi-
Labels. 2020. 25
sion and Pattern Recognition, pages 510–519,
2019. 22 [97] Alexander Kolesnikov, Lucas Beyer, Xiaohua
Zhai, Joan Puigcerver, Jessica Yung, Sylvain
[88] Hugo Touvron, Andrea Vedaldi, Matthijs Gelly, and Neil Houlsby. Big Transfer (BiT):
Douze, and Hervé Jégou. Fixing the train-test General Visual Representation Learning. In
resolution discrepancy: FixEfficientNet. arXiv Lecture Notes in Computer Science, pages 491–
preprint arXiv:2003.08237, 2020. 22 507. 2020. 25

[89] Alexander Kolesnikov, Xiaohua Zhai, and Lu- [98] Qizhe Xie, Minh-Thang Luong, Eduard Hovy,
cas Beyer. Revisiting Self-Supervised Visual and Quoc V. Le. Self-Training With Noisy
Representation Learning. Proceedings of the Student Improves ImageNet Classification. In
IEEE conference on Computer Vision and Pat- IEEE/CVF Conference on Computer Vision and
tern Recognition, pages 1920–1929, 2019. 22, Pattern Recognition (CVPR), pages 10684–
23 10695. IEEE, jun 2020. 25

32
[99] Yaqing Wang, Quanming Yao, James T Kwok,
and Lionel M Ni. Generalizing from a few ex-
amples: A survey on few-shot learning. ACM
Computing Surveys (CSUR), 53(3):1–34, 2020.
25

[100] Wei Wang, Vincent W Zheng, Han Yu, and


Chunyan Miao. A survey of zero-shot learn-
ing: Settings, methods, and applications. ACM
Transactions on Intelligent Systems and Tech-
nology (TIST), 10(2):1–37, 2019. 25

33

You might also like