Label Propagation For Deep Semi-Supervised Learning
Label Propagation For Deep Semi-Supervised Learning
Abstract
arXiv:1904.04717v1 [cs.CV] 9 Apr 2019
1
data with unsupervised objectives on all data, where the Unsupervised loss in deep SSL. Assuming that every train-
latter act as regularization [41, 38]. Or, an existing clas- ing image, labeled or not, belongs to a single category, a
sifier can be used to assign pseudo-labels [24, 35], which natural requirement on the classifier is to make a confident
is another form of algorithmic supervision. Using a pow- prediction on the training set. This idea was formalized by
erful classifier trained on carefully annotated data can pro- Sajjadi et al. [35], where the regularizer is designed to min-
vide high-quality pseudo-labels, opening the door to learn- imize the entropy of the network output. Such a loss term is
ing from real unlabeled, large scale data. In such omni- easily combined with other terms. A similar combination is
supervised learning [31], the fully supervised performance performed for denoising auto-encoders that are applied on
on the labeled part is actually the lower bound. This only all images in an unsupervised manner [32].
refreshes the interest in inductive semi-supervised methods. A direction attracting a lot of attention is that of consis-
In this paper, we use efficient transductive label propa- tency loss, where two related cases, e.g. coming from two
gation [43] to infer pseudo-labels for unlabeled data, which similar images, or made by two networks with related pa-
are used to train the classifier. Label propagation is a graph- rameters, are encouraged to have similar network outputs.
based method, and in this work the graph is constructed ex- Sajjadi et al. [34] is the first, to our knowledge, to use a
ploiting the embeddings obtained by the classification net- consistency loss between the outputs of a network on ran-
work itself. Thus, the proposed method alternates between dom perturbations of the same image. Laine and Aila [23]
two steps. First, the network is trained from labeled and rather apply consistency between the output of the current
pseudo-labeled data. The second step uses the embeddings network and the temporal average of outputs during train-
of the network trained in the previous step to construct a ing. The state-of-the-art mean teacher (MT) method [38]
nearest neighbor graph. Label propagation is then used to replaces output averaging by averaging of network param-
infer pseudo-labels for unlabeled images, as well as a cer- eters. Consistency loss is commonly measured by squared
tainty score per image and per class. Training is performed Euclidean distance. The Jensen-Shannon divergence is used
on all data, using certainty-based weights. instead by Qiao et al. [29], while complementarity of the
We experimentally show on standard datasets that the two networks is enforced via adversarial examples. A simi-
proposed method outperforms other semi-supervised ap- lar idea is proposed by Miyato et al. [26].
proaches. The less labeled data is available, the more pro- Pseudo-labeling in deep SSL. Lee [24] uses the current
nounced the advantage of the proposed approach is. network to infer pseudo-labels of unlabeled examples, by
choosing the most confident class. These pseudo-labels are
2. Related work treated like human-provided labels in the cross entropy loss.
The literature is rich in the problem of semi-supervised Its impact is similar to that of entropy minimization [35]; in
learning (SSL). The reader is advised to see [3] for an ex- both cases the network is forced to have more confident pre-
tensive overview. The same holds for SSL in image classi- dictions. The same principle is adopted by Shi et al. [36],
fication [10, 16, 4, 37]. In this section, we mostly restrict where the authors further add contrastive loss to the con-
the discussion to approaches that use deep learning for SSL sistency loss. Our method is different from all such prior
and perform the training on a large image collection with work in that pseudo-labels are inferred by label propagation
mini-batch optimization. rather than network predictions.
Prior work on semi-supervised deep learning for image Label propagation has been extensively used in a transduc-
classification is divided into two main categories. The first tive setup (see chapter 11 [3]). Recently, Douze et al. [7]
consists of methods, e.g. [15, 23, 34, 38], that add an un- perform label propagation on a large image dataset with
supervised loss term (often called a regularizer) into the CNN descriptors for few shot learning. Unseen images are
loss function. This term is applied to either all images or classified via online label propagation, which requires stor-
only the unlabeled ones. Methods in the second category, ing the entire dataset, while the network is trained in ad-
e.g. [24, 36], assign pseudo-labels to the unlabeled exam- vance and descriptors are fixed. Our work is different in
ples. The pseudo-labeled data are then used in training with that we perform label propagation on the training set off-
a supervised loss, such as cross entropy. Both categories use line while training the network, such that inference is pos-
a standard loss term that is trained with supervision from sible without accessing the original training set. Learning
labeled images. A thorough evaluation of SSL deep image by association [17] can been seen as two steps of propaga-
classification can be found in Miyato et al. [27]. tion on a constrained bi-partite graph between labeled and
Our contribution belongs to the second category, and unlabeled examples. Graph transduction game (GTG) [9],
is conceptually and implementation-wise orthogonal to the a form of label propagation, has been used for pseudo-
first. It is therefore straightforward to combine the proposed labels [8] as in our work, but in this case the network is
method with any method from the first category. We do pre-trained, the graph remains fixed and there is no weight-
combine it with [38] as shown in Section 5. ing mechanism. We compare to this approach in Section 5.
3. Preliminaries where again `s is any supervised loss function like cross-
entropy. An example is the approach proposed by Lee [24],
In this section we formulate the semi-supervised learn- who first train network fθ with (2) and then assign pseudo-
ing problem and then we discuss the classifier, different loss labels according to (1) for i ∈ U .
functions that are commonly used in prior work, and finally
a transductive learning approach that our method is based Unsupervised loss is another common alternative where
on. In our experiments we use a convolutional neural net- the loss function applies to both labeled and unlabeled ex-
work (CNN) to perform image classification, but this for- amples and encourages consistency under different trans-
mulation applies to any network architecture in any domain. formations of the data or the network. The so-called consis-
tency loss [36, 38, 36] is defined as
Problem formulation. We assume a collection of n ex-
amples X := (x1 , . . . , xl , xl+1 , . . . , xn ) with xi ∈ X . n
X
The first l examples xi for i ∈ L := {1, . . . , l}, denoted Lu (X; θ) := `u (fθ (xi ), fθ̃ (x̃i )), (4)
by XL , are labeled according to YL := (y1 , . . . , yl ) with i=1
yi ∈ C, where C := {1, . . . , c} is a discrete label set where x̃i refers to a different transformation of example xi .
for c classes. The remaining u := n − l examples xi for Note that according to the standard practice of data augmen-
i ∈ U := {l + 1, . . . , n}, denoted by XU , are unlabeled. tation, every forward pass of xi during training is performed
The goal in SSL is to use all examples X and labels YL under some random transformation. Parameter set θ̃ is ei-
to train a classifier that maps previously unseen samples to ther equal to θ or any other transformation of it, such as a
class labels. moving average over the sequence of network updates [38].
Classifier. The network takes an input example from X and A simple choice of `u is the squared Euclidean distance, i.e.
produces a vector of class confidence scores. We denote it `u (s, s̃) := ||s − s̃)||2 for s, s̃ ∈ Rc , forcing the two outputs
by fθ : X → Rc , where θ are the network parameters. It to be as close as possible.
is conceptually divided in two parts. The first is a feature Transductive learning solves a more specific problem. In-
extraction network φθ : X → Rd mapping the input to a stead of training a generic classifier able to classify new,
feature vector, or descriptor. We denote the descriptor of yet unseen, examples, the goal is to use X and YL to in-
the i-th example by vi := φθ (xi ). The second typically fer labels for examples in XU . In this work, we adopt the
consists of a fully connected (FC) layer applied on top of graph-based approach of Zhou et al. [43] for transductive
φθ and followed by softmax, producing a vector of confi- learning by diffusion1 .
dence scores. Function fθ is the mapping from input space
Diffusion for transductive learning [43]. Let V =
directly to confidence scores. The output of the network for
(v1 , . . . , vl , vl+1 , . . . , vn ) be the descriptor set, where vi
the i-th example is fθ (xi ) and the prediction is the one of
corresponds to xi as defined earlier. A symmetric adjacency
maximum confidence score
matrix W ∈ Rn×n with zero diagonal is constructed, whose
ŷi := arg max fθ (xi )j , (1) elements wij are non-negative pairwise similarities between
j
vi and vj . Its symmetrically normalized counterpart is
where subscript j denotes the j-th dimension of the vector. given by W = D−1/2 W D−1/2 , where D := diag(W 1n )
Supervised loss. In supervised learning, the network is is the degree matrix and 1n is the all-ones n-vector. A n × c
trained by minimizing a supervised loss term of the form label matrix Y is defined with elements
l 1, if i ∈ L ∧ yi = j
X Yij := (5)
Ls (XL , YL ; θ) := `s (fθ (xi ), yi ) , (2) 0, otherwise.
i=1
That is, the rows of Y corresponding to labeled examples
which applies only to labeled examples in XL . Such term are one-hot encoded labels and the rest are zero. Diffusion
is part of the total loss when training a network in a semi- amounts to computing the n × c matrix
supervised setup [36, 38, 29]. A standard choice for the
loss function `s in classification is cross-entropy, given by Z := (I − αW)−1 Y, (6)
`s (s, y) := − log sy for s ∈ Rc and y ∈ C. where α ∈ [0, 1) is a parameter. Finally, the class prediction
Pseudo-labeling is the process of assigning a pseudo-label for an unlabeled example xi is
ŷi to each example xi for i ∈ U . Denoting by ŶU :=
(ŷl+1 , . . . , ŷn ) the collection of pseudo-labels for XU , the ŷi := arg max zij , (7)
j
following additional pseudo-label loss term applies
where zij is the (i, j) element of matrix Z.
n
X
1 We first present the original approach and discuss our design choices
Lp (XU , ŶU ; θ) := `s (fθ (xi ), ŷi ) , (3)
i=l+1 in the following section.
It is interesting to observe that matrix Z as defined by (6) which applies because matrix (I −αW) is positive-definite.
is the minimizer of the following quadratic cost function This solution is known to be faster than the iterative solution
2 of Zhou et al. [43], and has been used in semi-supervised
n learning [44], interactive image segmentation [14], image
αX
zi zj
2
J(Z) := wij
√ − p
+(1−α) kY − ZkF ,
2 i,j=1
dii retrieval [20] and semantic image segmentation [2]. Finally,
djj
we infer the pseudo-labels ŶU = (ŷl+1 , . . . , ŷn ), where ŷi
(8) is given by (7).
where zi is the i-th row of matrix Z, dii is the i-th diago-
nal diagonal element of D and k·kF is the Frobenius norm. Pseudo-label certainty and class balancing. Inferring
The first term encourages smoothness such that nearby ex- pseudo-labels from matrix Z by hard assignment has two
amples get the same predictions, while the second attempts undesired effects: first, we define pseudo-labels on all un-
to maintain predictions for the labeled examples [43]. labeled examples while clearly we do not have the same
certainty for each example. Second, pseudo-labels may not
be balanced over classes, which will impede learning.
4. Method
To deal with the former issue we associate with each
In the following, we begin by providing an overview of pseudo-label a weight reflecting the certainty of the predic-
our approach. We then develop the main elements of our so- tion. We use entropy, as a measure of uncertainty, to assign
lution, put everything together in a concrete algorithm, and weight ωi to example xi , defined by
discuss how our approach is complementary to approaches
using unsupervised loss for SSL [38, 36, 36]. Finally, we H(ẑi )
ωi := 1 − , (11)
discuss the relation to prior work that encourages smooth- log(c)
ness in deep networks.
where Ẑ isP the row-wise normalized counterpart of Z, i.e.
Overview. We introduce a new iterative process for semi- ẑij = zij / k zik , and function H : Rc → R is the entropy
supervised learning that can be summarized as follows. function. Weight ωi is normalized in [0, 1] because log(c)
First, we construct a nearest neighbor graph and perform is the maximum possible entropy in Rc .
label propagation by transductive learning on the training
To deal with the latter issue of class imbalance, we assign
set. Then, we estimate of a weight reflecting the uncertainty
weight ζj to class j that is inversely proportional to class
of label propagation for each unlabeled example. Finally,
population, defined as ζj := (|Lj | + |Uj |)−1 , where Lj
we inject the obtained labels into the network training pro-
(resp. Uj ) are the examples labeled (resp. pseudo-labeled)
cess. These ideas are developed below, while a graphical
as class j.
overview of the proposed approach is shown in Figure 2.
Given the above definitions of per-example and per-class
Nearest neighbor graph. Given a network with pa- weights, we associate the following weighted loss to the la-
rameters θ, we construct the descriptor set V = beled and pseudo-labeled examples
(v1 , . . . , vl , vl+1 , . . . , vn ), where vi := φθ (xi ). A sparse
affinity matrix A ∈ Rn×n with elements l
X
Lw (X, YL , ŶU ; θ) := ζyi `s (fθ (xi ), yi )
(
[vi> vj ]γ+ , if i 6= j ∧ vi ∈ NNk (vj ) i=1
aij := (9) n
0, otherwise
X
+ ωi ζŷi `s (fθ (xi ), ŷi ) , (12)
i=l+1
is constructed, where NNk denotes the set of k nearest
neighbors in X, and γ is a parameter following recent work which is the sum of weighted versions of Ls (2) and Lp (3).
on manifold-based search [20]. Note that constructing the In contrast to (3), pseudo-labels originate in diffusion rather
affinity matrix of the nearest neighbor graph is efficient even than network predictions.
for large n [20], while constructing the full affinity matrix A toy example showing the result of label propagation
as in Zhou et al. is not tractable. Then, let W := A + A> , and the estimated weights is shown in Figure 3.
which is indeed a symmetric nonnegative adjacency matrix Iterative training. Given the above definitions of nearest
with zero diagonal. neighbor graph definition, label propagation, example/class
Label propagation. Estimating matrix Z by (6) is imprac- weighting and pseudo-label loss, we plug those components
tical for large n because the inverse matrix (I − αW)−1 is into an iterative learning process. We begin by randomly
not sparse. We rather use the the conjugate gradient (CG) initializing the network parameters θ and we train the net-
method to solve linear system work for T epochs in a fully supervised manner on the l
labeled examples XL using the supervised loss term (2).
(I − αW)Z = Y, (10) The trained network then provides the starting point for the
Network fθ
FC + softmax
Feature extractor φθ
Phase 1: Train for 1 epoch with
Train for T epochs with Lw (X, YL , ŶU ; θ)
Ls (XL , YL ; θ) (all examples)
(labeled examples only) Use φθ
So
La lve
be (1
l pr 0)
op
aga
tio
n
following iterative process. First, we extract descriptors V stance (4), applied to both labeled and unlabeled examples.
on the entire training set X and compute nearest neighbors Combination of the two comes in a straightforward way by
to construct the adjacency matrix W . Second, we perform adding term (4) to the total loss optimized in lines 4 and 16
label propagation by solving linear system (10) and assign of Algorithm 1. This is exactly the way we combine the
pseudo-labels to unlabeled examples XU by (7). Finally, proposed approach with the state-of-the-art Mean-Teacher
we train the network for one epoch on the entire training approach [38] in our experiments.
set X using the weighted loss Lw (12). We repeat this it- √
Discussion. In an inductive framework, if zi / dii is re-
erative process for T 0 epochs. The above is summarized in placed by the network output fθ (xi ) in the smoothness
Algorithm 1. term of (8), then this becomes an unsupervised loss term,
Procedure O PTIMIZE() refers to the mini-batch opti- e.g. like (4), only now it encourages consistency between
mization of the corresponding loss term for one epoch, i.e. nearby example predictions. And indeed such solution is
all examples are fed to the network once. More details about adopted e.g. by Weston et al. [41]. This is not very effi-
batch construction are given in the implementation details. cient because the adjacency matrix is typically sparse with
Combination with other approaches. Our contribution non-zero-elements only on nearest neighbors, and then the
falls in the case of pseudo-label loss in the form of (3). It is gradient of the smoothness term will propagate from each
orthogonal to approaches that use unsupervised loss, for in- example to its neighbors only at each iteration.
Algorithm 1 Label propagation for deep SSL Teacher [38] when available (1k, 2k and 4k labels). The se-
1: procedure LPDSSL(Training examples X, labels YL ) lection process is repeated 10 times, resulting in 10 different
2: θ ← initialize randomly dataset splits for SSL on CIFAR 10. We follow the common
3: for epoch ∈ [1, . . . , T ] do
4: θ ← O PTIMIZE(Ls (XL , YL ; θ)) . mini-batch optimization
practice which is to use each of them and report mean error
5: end for and standard deviation.
6: for epoch ∈ [1, . . . , T 0 ] do CIFAR-100. Similarly to CIFAR-10, CIFAR-100 has 50k
7: for i ∈ {1, . . . , n} do vi ← φθ (xi ) . extract descriptors
8: for (i, j) ∈ {1, . . . , n}2 do aij ← affinity values (9) training and 10k test images of resolution 32 × 32, com-
9: W ← A + A> . symmetric affinity ing from 100 classes. We follow a protocol equivalent to
10: W ← D−1/2 W D−1/2 . symmetrically normalized affinity the one of CIFAR-10. We evaluate with 40 and 100 labeled
11: Z ← solve (10) with CG P . diffusion images per class, corresponding to 4k and 10k labeled im-
12: for (i, j) ∈ U × C do ẑij ← zij / k zik . normalize Z
13: for i ∈ U do ŷi ← arg maxj ẑij . pseudo-label ages in total. There are 3 such dataset splits, mean error and
14: for i ∈ U do ωi ← certainty of ŷi (11) . pseudo-label weight standard deviation are reported.
15: for j ∈ C do ζj ← (|Lj | + |Uj |)−1 . class weight/balancing
Mini-ImageNet. We introduce an SSL evaluation setup for
16: θ ← O PTIMIZE(Lw (X, YL , ŶU ; θ)) . mini-batch optimization
17: end for
Mini-ImageNet [39] which is a subset of the well-known
18: end procedure ImageNet [6] dataset and has been previously used for few-
shot learning [11]. We use the train/test splits created in the
work of Ravi and Larochelle [33]. It consists of 100 classes
Our main idea therefore is that instead of just encour- with 600 images per class, of resolution 84 × 84. We ran-
aging nearby examples to get the same predictions, we en- domly assign 500 images from each class to the training set,
courage all examples to get predictions same as the ones and 100 images to the test set. The result is a train and test
we would get by transductive learning according to the set of 50k and 10k images, respectively. We create three
quadratic cost (8) and its solution Z (6). Computing Z is dataset splits for the case of 40 and 100 labeled images per
efficient because it is performed outside our main optimiza- class that correspond to 4k and 10k labeled images in to-
tion process, i.e. it does not need iterating on mini-batches tal. Mean error and standard deviation over the three dataset
of data and backpropagating through the network. Then, splits are reported.
given Z, the main optimization process drives all examples
5.2. Training
directly to that solution, as if they were all labeled.
We list the reproduced baselines, and provide training
5. Experiments details per algorithm and dataset.
Implementation. We build our implementation on top
We present the datasets used in our experiments and the of the publicly available Pytorch code for the Mean Teacher
SSL setup that is followed. Then, we discuss the training (MT) approach [38]2 . The fully supervised baseline and MT
details of our method and the methods reproduced for fair are reproduced identically as the original implementation.
comparison. Finally, we perform experiments to show the In all our experiments SGD optimization is used.
impact of different components involved in the proposed Networks. Experiments on CIFAR-10 and CIFAR-100
method and to compare with the state of the art. All er- are performed with the “13-layer” network that is used
ror rates reported are produced by our own implementation in prior work [23, 38], while on Mini-ImageNet, Resnet-
unless otherwise stated. 18 [18] is engaged. Both networks consist of a feature ex-
5.1. Datasets tractor φθ followed by an FC layer and softmax. We add an
`2 -normalization layer right after φθ (before the FC layer)
We use three image classification datasets, namely providing unit-norm descriptors for the graph construction.
CIFAR-10 [22], CIFAR-100 [22] and Mini-ImageNet [39]. The same choice is also adopted in the fully supervised
Each dataset is used in an SSL setup where part of the train- baseline. One exception is all variants of MT as we ob-
ing images are labeled and the rest are unlabeled. We evalu- served that the `2 -normalization layer slightly harms per-
ate the performance on an independent test set. Unless oth- formance. We normalize images to have channel-wise zero
erwise specified, error rate is reported in our experiments. mean and unit variance over the entire training set. Unlike
CIFAR-10. The training set consists of 50k images com- prior work [38], we do not normalize the input images with
ing from 10 classes, while the test set consists of 10k im- ZCA, nor add Gaussian noise to the input layer, which result
ages from the same 10 classes. All images have resolution in worse performance according to our experiments.
32 × 32. Evaluation is performed with 50, 100, 200, and Hyper-parameters and training choices are adapted
400 labeled images per classes, corresponding to l = 500, from the MT method and implementation. These are fixed
1k, 2k, and 4k label images in total. We use the same 2 https://github.com/CuriousAI/mean-teacher/tree/
random selection of labeled images that is used in Mean master/pytorch
Pseudo-labeling CIFAR-10 ωi ζj
36.53 ± 1.42
3 36.17 ± 1.98
Diffusion (7) auto(0.82) auto(0.82) auto(0.82) auto(0.82) auto(0.81) ship(0.81)
3 33.32 ± 1.53
3 3 32.40 ± 1.80
GTG [8] 3 3 35.20 ± 2.23
Network (1) 3 3 35.17 ± 2.46
Table 1. Impact of weights ωi , class weights ζj , and pseudo- ship(0.81) frog(0.80) auto(0.80) auto(0.80) frog(0.80) frog(0.80)
labeling by diffusion prediction (7) or network prediction (1). Er- Figure 7. Examples of incorrectly pseudo-labeled images with
ror rate is reported on CIFAR-10 with 500 labels. highest ωi in CIFAR-10. Predicted class and ωi are shown below
each image.
Prediction accuracy
0.7
0.65 size is 100 for CIFAR-10 and 128 for CIFAR-100 and Mini-
0.6 ImageNet. All other learning parameters remain unchanged
0.55 Diffusion (7) from MT implementation.
0.5
Network (1) The fully supervised approach corresponds to training
0 50 100 150 with (2) and labeled images only. MT uses the additional
Epochs dual output trick with coefficient 0.01. Both these ap-
proaches are reproduced.
Figure 4. Accuracy of predicted pseudo-labels according to
Our approach is performed with mini-batch size B =
ground-truth on CIFAR-10 with 500 labeled images. Diffusion
predictions (7) are compared against network predictions (1).
BU + BL , where BL images are labeled and BU images
are originally unlabeled. We set BL = 50 for CIFAR-10
Number of images
40 Ours
MT [38]
30 MT + ours a variant of our approach where the pseudo-labels are not
20 provided by diffusion but derived from the network with
10
(1) or from GTG propagation [8] instead. Training is per-
500 1k 2k 4k formed with (12), as with our method. This is in the spirit
Number of labeled images of pseudo-labeling in prior work [36, 24].
Figure 6. Error rate versus number of labeled images on CIFAR- 5.3. Ablation Study
10 using different methods.
We investigate the impact of different components of our
method. First, we study the effectiveness of weights intro-
for all approaches (re)produced by this work. The training duced in the loss function (12). Table 1 shows the classifi-
is performed for 180 epochs in total. Initial learning rate l0 cation performance on CIFAR-10 test set, when using only
is decayed with cosine annealing [25] so that it would have 500 labeled examples for training and the rest of the training
reached zero after 210 epochs, while l0 = 0.05 on CIFAR- set is considered unlabeled. Different weighting schemes
10, and l0 = 0.2 on CIFAR-100 and Mini-ImageNet. Ran- are evaluated by setting all ωi to one, all ζi to one, or both
dom data augmentation is performed by 4×4 random trans- to one. It is shown that both weights have positive contribu-
lations [38] followed by horizontal flip in CIFAR-10 and tions. We also show the benefit of predicting with diffusion
CIFAR-100. On Mini-ImageNet, each image is randomly over predicting by the trained network or GTG propagation.
rotated by 10 degrees before random horizontal flip. Batch Pseudo-labeling by the network predictions uses examples
Dataset CIFAR-10
Nb. labeled images 500 1000 2000 4000
Fully supervised 49.08 ± 0.83 40.03 ± 1.11 29.58 ± 0.93 21.63 ± 0.38
†
TDCNN [36] - 32.67 ± 1.93 22.99 ± 0.79 16.17 ± 0.37
Network prediction (1) + weights 35.17 ± 2.46 23.79 ± 1.31 16.64 ± 0.48 13.21 ± 0.61
Ours: Diffusion prediction (7) + weights 32.40 ± 1.80 22.02 ± 0.88 15.66 ± 0.35 12.69 ± 0.29
VAT [26]† - - - 11.36
Π model [23]† - - - 12.36 ± 0.31
Temporal Ensemble [23]† - - - 12.16 ± 0.24
MT [38]† - 27.36 ± 1.30 15.73 ± 0.31 12.31 ± 0.28
MT [38] 27.45 ± 2.64 19.04 ± 0.51 14.35 ± 0.31 11.41 ± 0.25
MT + Ours 24.02 ± 2.44 16.93 ± 0.70 13.22 ± 0.29 10.61 ± 0.28
Table 2. Comparison with the state of the art on CIFAR-10. Error rate is reported. “13-layer” network is used. The top part of the table
corresponds to training with pseudo-labels, while the bottom part of the table includes methods that are complementary to ours, as shown
by the combination of our method with MT. † denotes scores reported in prior work.
that the network can already classify, while diffusion allows ber of labels is reduced. The results on CIFAR-10 show
for accurate predictions beyond those examples. In Fig- that our approach is complementary to unsupervised loss,
ure 4, we report the progress of the pseudo-label accuracy such as the one used by MT. This combination achieves
on unlabeled images XU throughout the training. Diffusion the best performance on this dataset. The same holds for
predictions are consistently better than network predictions. CIFAR-100 and Mini-ImageNet for 10k available labels.
Figure 5 demonstrates how ωi accurately estimates the Our method also achieves a lower error rate than tempo-
certainty of the prediction. From the plots we observe that ral ensemble (38.65 ± 0.51) and Π-model (39.19 ± 0.36) on
predictions become more accurate as the training evolves, CIFAR-100 [23] with 10k labels. On Mini-ImageNet with
while at the beginning most examples are misclassified. 4k available labels, the best performance is achieved when
The proposed weighting mechanism is robust to incorrect using our method without combining with Mean Teacher.
pseudo-labels and prevents model collapse. Figure 7 shows
some of the incorrectly pseudo-labeled images with high
certainty ωi . Most of the incorrect labels come from trucks
6. Conclusions
labeled as automobiles or birds labeled as frogs.
5.4. Comparison with the state-of-the-art Most recent approaches for deep SSL rely on training
with unsupervised loss on both labeled and unlabeled im-
We present a comparison with state-of-the-art on all 3 ages. We have proposed an approach that relies on graph-
datasets in Tables 2 and 3. The comparison includes perfor- based label propagation to infer pseudo-labels for the un-
mance reported in prior work and our reproduced results. labeled images. An additional training set is formed with
In the case of the work by Shi et al. [36], we only compare these pseudo-labels, which are shown to be more valuable
with their TDCNN variant which refers to pseudo-labeling than the pseudo-labels inferred by the network itself. Our
for network training. The other loss terms in their work are method is in principle complementary to unsupervised loss
complementary to ours, similarly to MT. We additionally terms, which is experimentally shown in this work.
compare with our implementation of pseudo-labeling with
network predictions combined with the proposed weights. Acknowledgments This work is supported by the GAČR
The proposed approach performs the best out of the grant 19-23165S and the OP VVV funded project
pseudo-label based approaches on CIFAR-10. Results in CZ.02.1.01/0.0/0.0/16 019/0000765 “Research Center for
Figure 6 show that our benefit is larger when the num- Informatics”.
References [21] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-
scale similarity search with gpus. arXiv preprint
[1] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and arXiv:1702.08734, 2017. 7
Matthijs Douze. Deep clustering for unsupervised learning
[22] Alex Krizhevsky and Geoffrey Hinton. Learning multiple
of visual features. ECCV, 2018. 1
layers of features from tiny images. Technical report, Uni-
[2] Siddhartha Chandra and Iasonas Kokkinos. Fast, exact and versity of Toronto, 2009. 6
multi-scale inference for semantic image segmentation with
[23] Samuli Laine and Timo Aila. Temporal ensembling for semi-
deep Gaussian CRFs. In ECCV, 2016. 4
supervised learning. In ICLR, 2017. 2, 6, 8
[3] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien.
[24] Dong-Hyun Lee. Pseudo-label: The simple and efficient
Semi-Supervised Learning. MIT Press, 2006. 2
semi-supervised learning method for deep neural networks.
[4] Dengxin Dai and Luc Van Gool. Ensemble projection for
In ICMLW, 2013. 2, 3, 7
semi-supervised image classification. In ICCV, 2013. 2
[25] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient
[5] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu-
descent with warm restarts. ICLR, 2017. 7
pervised visual representation learning by context prediction.
In ICCV, 2015. 1 [26] Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori
Koyama. Virtual adversarial training: a regularization
[6] Wei Dong, Richard Socher, Li Li-Jia, Kai Li, and Li Fei-
method for supervised and semi-supervised learning. IEEE
Fei. Imagenet: A large-scale hierarchical image database. In
Trans. PAMI, 2018. 2, 8
CVPR, June 2009. 6
[27] Avital Oliver, Augustus Odena, Colin Raffel, Ekin D Cubuk,
[7] Matthijs Douze, Arthur Szlam, Bharath Hariharan, and
and Ian J Goodfellow. Realistic evaluation of deep semi-
Hervé Jégou. Low-shot learning with large-scale diffusion.
supervised learning algorithms. In ICLRW, 2018. 2
In CVPR, 2018. 2
[8] Ismail Elezi, Alessandro Torcinovich, Sebastiano Vascon, [28] Deepak Pathak, Ross B Girshick, Piotr Dollár, Trevor Dar-
and Marcello Pelillo. Transductive label augmentation rell, and Bharath Hariharan. Learning features by watching
for improved deep network learning. arXiv preprint objects move. In CVPR, 2017. 1
arXiv:1805.10546, 2018. 2, 7 [29] Siyuan Qiao, Wei Shen, Zhishuai Zhang, Bo Wang, and Alan
[9] Aykut Erdem and Marcello Pelillo. Graph transduction as a Yuille. Deep co-training for semi-supervised image recogni-
noncooperative game. Neural Computation, 24, 2012. 2 tion. In ECCV, 2018. 2, 3
[10] Rob Fergus, Yair Weiss, and Antonio Torralba. Semi- [30] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Fine-
supervised learning in gigantic image collections. In NIPS, tuning CNN image retrieval with no human annotation. IEEE
2009. 2 Trans. PAMI, 2018. 1
[11] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot [31] Ilija Radosavovic, Piotr Dollar, Ross Girshick, Georgia
visual learning without forgetting. In CVPR, 2018. 6 Gkioxari, and Kaiming He. Data distillation: Towards omni-
[12] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- supervised learning. In CVPR, 2018. 2
supervised representation learning by predicting image rota- [32] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri
tions. In ICLR, 2018. 1 Valpola, and Tapani Raiko. Semi-supervised learning with
[13] Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Lar- ladder networks. In NIPS, 2015. 2
lus. End-to-end learning of deep visual representations for [33] Sachin Ravi and Hugo Larochelle. Optimization as a model
image retrieval. IJCV, 124(2), 2017. 1 for few-shot learning. In ICLR, 2016. 6
[14] Leo Grady. Random walks for image segmentation. IEEE [34] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen.
Trans. PAMI, 28(11):1768–1783, 2006. 4 Mutual exclusivity loss for semi-supervised deep learning.
[15] Yves Grandvalet and Yoshua Bengio. Semi-supervised In ICIP, 2016. 2
learning by entropy minimization. In NIPS, 2005. 2 [35] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen.
[16] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. Regularization with stochastic transformations and perturba-
Multimodal semi-supervised learning for image classifica- tions for deep semi-supervised learning. In NIPS, 2016. 2
tion. In CVPR, 2010. 2 [36] Weiwei Shi, Yihong Gong, Chris Ding, Zhiheng Ma, Xiaoyu
[17] Philip Haeusser, Alexander Mordvintsev, and Daniel Cre- Tao, and Nanning Zheng. Transductive semi-supervised
mers. Learning by association – a versatile semi-supervised deep learning using min-max features. In ECCV, 2018. 2, 3,
training method for neural networks. In CVPR, 2017. 2 4, 7, 8
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [37] Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta.
Deep residual learning for image recognition. In CVPR, Constrained semi-supervised learning using attributes and
2016. 6 comparative attributes. In ECCV, 2012. 2
[19] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej [38] Antti Tarvainen and Harri Valpola. Mean teachers are better
Chum. Mining on manifolds: Metric learning without labels. role models: Weight-averaged consistency targets improve
In CVPR, 2018. 1 semi-supervised deep learning results. In NIPS, 2017. 2, 3,
[20] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Teddy Furon, 4, 5, 6, 7, 8
and Ondrej Chum. Efficient diffusion on region manifolds: [39] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wier-
Recovering small objects with compact cnn representations. stra, et al. Matching networks for one shot learning. In NIPS,
In CVPR, 2017. 4, 7 2016. 6
[40] Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transi-
tive invariance for selfsupervised visual representation learn-
ing. In ICCV, 2017. 1
[41] Jason Weston, Frédéric Ratle, and Ronan Collobert. Deep
learning via semi-supervised embedding. In ICML, 2008. 2,
5
[42] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Un-
supervised feature learning via non-parametric instance-level
discrimination. CVPR, 2018. 1
[43] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Ja-
son Weston, and Bernhard Schölkopf. Learning with local
and global consistency. In NIPS, 2003. 1, 2, 3, 4
[44] Xiaojin Zhu, John Lafferty, and Ronald Rosenfeld. Semi-
Supervised Learning with Graphs. PhD thesis, Carnegie
Mellon University, Language Technologies Institute, School
of Computer Science Pittsburgh, PA, 2005. 4
[45] Xiaojin Zhu, John D Lafferty, and Zoubin Ghahramani.
Semi-supervised learning: From Gaussian fields to Gaussian
processes. Technical report, 2003. 1