0% found this document useful (0 votes)
33 views7 pages

Tri-Net For Semi-Supervised Deep Learning

This document summarizes a research paper on tri-net, a deep neural network approach for semi-supervised learning that uses unlabeled data to improve performance when only limited labeled data is available. The tri-net model learns three initial modules and has them label unlabeled data for each other. It considers techniques for model initialization, diversity augmentation, and pseudo-label editing. Experiments on benchmark datasets show tri-net achieves state-of-the-art results, including an error rate of 8.30% on CIFAR-10 using only 4,000 labeled examples.

Uploaded by

busuulwaerick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views7 pages

Tri-Net For Semi-Supervised Deep Learning

This document summarizes a research paper on tri-net, a deep neural network approach for semi-supervised learning that uses unlabeled data to improve performance when only limited labeled data is available. The tri-net model learns three initial modules and has them label unlabeled data for each other. It considers techniques for model initialization, diversity augmentation, and pseudo-label editing. Experiments on benchmark datasets show tri-net achieves state-of-the-art results, including an error rate of 8.30% on CIFAR-10 using only 4,000 labeled examples.

Uploaded by

busuulwaerick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Tri-net for Semi-Supervised Deep Learning

Dong-Dong Chen, Wei Wang, Wei Gao, Zhi-Hua Zhou


National Key Laboratory for Novel Software Technology
Nanjing University, Nanjing 210023, China
{chendd, wangw, gaow, zhouzh}@lamda.nju.edu.cn

Abstract has been combined with deep model for the tasks which have
two views [Cheng et al., 2016; Ardehaly and Culotta, 2017].
Deep neural networks have witnessed great suc- Nevertheless, in real applications, we always confront the
cesses in various real applications, but it requires task with one-view data, and tri-training can be utilized no
a large number of labeled data for training. In matter whether there are one or more views.
this paper, we propose tri-net, a deep neural net- In this paper, we propose tri-net which combines tri-
work which is able to use massive unlabeled data training with deep model. We first learn three initial mod-
to help learning with limited labeled data. We con- ules, and each module is then used to predict a pool of un-
sider model initialization, diversity augmentation labeled data, where two modules label some unlabeled in-
and pseudo-label editing simultaneously. In our stances for another module. Later, three modules are refined
work, we utilize output smearing to initialize mod- by using the newly labeled examples. We consider three key
ules, use fine-tuning on labeled data to augment di- techniques in tri-net, i.e., model initialization, diversity aug-
versity and eliminate unstable pseudo-labels to al- mentation and pseudo-label editing, which can be summa-
leviate the influence of suspicious pseudo-labeled rized as follows: we use output smearing [Breiman, 2000] to
data. Experiments show that our method achieves help generate diverse and accurate initial modules; we fine-
the best performance in comparison with state-of- tune the modules in some specific rounds on labeled data to
the-art semi-supervised deep learning methods. In augment the diversity among them; we propose a data editing
particular, it achieves 8.30% error rate on CIFAR- method named DES based on the intuition that stable pseudo-
10 by using only 4000 labeled examples. labels are more reliable. Experiments are conducted on three
benchmark datasets, i.e., MNIST, SVHN and CIFAR-10, and
1 Introduction the results demonstrate that our tri-net has good performance
Deep neural networks (DNNs) have become a hot wave on all datasets. In particular, it achieves 8.45% error rate
during the past few years, and great successes have been on CIFAR-10 by using only 4,000 labeled examples. With
achieved in various real applications, such as image classifi- more sophisticated initialization methods, tri-net can get even
cation [Krizhevsky et al., 2012], object detection [Girshick better performance. For example, when we use the semi-
et al., 2014], scene labeling [Shelhamer et al., 2017], etc. supervised deep learning method Π model [Laine and Aila,
DNNs always learn a large number of parameters requiring 2016] to initialize our tri-net, we can achieve 8.30% error rate
a large amount of labeled data to alleviate overfitting. It is on CIFAR-10 by using only 4,000 labeled examples.
well-known that collecting tremendous high-quality labeled The rest of this paper is organized as follows: we introduce
data is expensive, yet we could easily collect abundant unla- related work in Section 2 and present our tri-net in Section 3.
beled data in many real applications. Hence, it is desirable Experimental results are given in Section 4. Finally, we make
to use unlabeled data to improve the performance of DNNs a conclusion in Section 5.
when training with limited labeled data.
A natural idea is to combine semi-supervised learn- 2 Related Work
ing [Chapelle et al., 2006; Zhu, 2007; Zhou and Li, 2010] Many methods have been proposed to tackle semi-supervised
with deep learning. The disagreement-based learning [Zhou learning, we only introduce the most related ones. For more
and Li, 2010] plays an important role in semi-supervised information of semi-supervised learning, see [Chapelle et al.,
learning, in which co-training [Blum and Mitchell, 1998] and 2006; Zhu, 2007; Zhou and Li, 2010].
tri-training [Zhou and Li, 2005b] are two representatives. The Disagreement-based semi-supervised learning started from
basic idea of disagreement-based semi-supervised learning is the seminal paper of Blum and Mitchell [1998] on co-
to train multiple learners for the task and exploit the disagree- training. Co-training first learns two classifiers from two
ments during the learning process. The disagreement in co- views and then lets them label unlabeled data for each other to
training is based on different views, while tri-training uses improve performance. However, in most real applications the
bootstrap sampling to get diverse training sets. Co-training data sets have only one view rather than two. Some methods

2014
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
Labeled by
𝑀2 &𝑀3 Pseudo-Labeled
Data of 𝑀1
𝑀1 : Module 1 Labeled by
𝑀1 &𝑀3
Pseudo-Labeled
Labeled Unlabeled
Labeled by Data of 𝑀2
Data Training Data 𝑀2 : Module 2 Data
𝑀1 &𝑀2
Pseudo-Labeled
𝑀𝑆 : Shared Module Data of 𝑀3
𝑀3 : Module 3

Figure 1: Training process of tri-net.

employed different learning algorithms or different parameter et al., 2017] and VAdD [Park et al., 2018] introduced ad-
configurations to learn two different classifiers [Goldman and versarial training [Goodfellow et al., 2014] into these meth-
Zhou, 2000; Zhou and Li, 2005a]. Although these methods ods while temporal ensembling [Laine and Aila, 2016] and
do not rely on the existence of two views, they require spe- mean teacher [Tarvainen and Valpola, 2017] introduced en-
cial learning algorithms to construct classifiers. Zhou and Li semble learning [Zhou, 2012] into them. Compared with
[2005b] proposed tri-training, which utilizes bootstrap sam- these state-of-the-art methods, our method can achieve bet-
pling to get three different training sets and generates three ter performance.
classifiers from these three training sets respectively. Tri-
training requires neither the existence of multiple views nor 3 Our Approach
special learning algorithms, thus it can be applied to more
3.1 Overview
real applications. For these algorithms, there have been some
theoretical studies to explain why unlabeled data can im- In semi-supervised learning, we have a small labeled data set
prove the learning performance [Blum and Mitchell, 1998; L = {(xl , yl )|l = 1, 2, . . . , L} with L labeled examples and
Balcan et al., 2004; Wang and Zhou, 2010; Balcan and Blum, a large-scale unlabeled data set U = {(xu )|u = 1, 2, . . . , U }
2010]. with U unlabeled instances. Suppose the data have C classes
and yl = (yl1 , yl2 , . . . , ylC ), where ylc = 1 if the example be-
With the fast development of deep learning, disagreement- longs to the c-th class otherwise ylc = 0, for c = 1, 2, . . . , C.
based semi-supervised learning has been combined with deep Our goal is to learn a model from the training set L ∪ U to
model for some applications. Cheng et al. [2016] developed a classify unseen instances. In this paper, we propose tri-net by
semi-supervised multimodal deep learning framework based combining tri-training with deep neural network. Our tri-net
on co-training to deal with the RGB-D object-recognition has three phases which are described as follows.
task. They utilized each view (i.e., RGB and depth) to learn Initialization. The first step in tri-net is to generate three
a DNN and the two DNNs labeled unlabeled data to augment accurate and diverse modules. Instead of training three net-
the training set. Ardehaly and Culotta [2017] combined co- works separately, tri-net is one DNN which is composed of a
training with deep model to address the demographic classifi- shared module MS and three different modules M1 , M2 and
cation task. They generated two DNNs from two views (i.e., M3 . Here, M1 , M2 and M3 classify the shared features gen-
image and text) respectively and let them provide pseudo- erated by MS . This network structure is inspired by Saito et
labels for each other. Nevertheless, many tasks have only one al. [2017] and is efficient for implementation. In order to get
view in real applications. It is more desirable to develop the three accurate and diverse modules, we use output smearing
disagreement-based deep models for one-view data. (Section 3.2) to generate three different labeled data sets, i.e.,
There are many other methods in semi-supervised deep L1os , L2os and L3os . We train MS , M1 , M2 and M3 simulta-
learning. Some of them were based on generative mod- neously on the three data sets. Specifically, MS and Mv are
els. These methods paid efforts to learn the input distribu- trained on Lvos (v = 1, 2, 3).
tion p(x). Variational auto-encoder (VAE) combined varia- Training. In the training process, some unlabeled data will
tional methods with DNNs to help estimate p(x) [Kingma be labeled and added into the labeled training sets. In order
et al., 2014; Maaløe et al., 2016] while generative adver- not to change the distribution of labeled training sets, we as-
sarial networks (GANs) aimed to leverage a generator to sume that the unlabeled data are selected from a pool of U .
detect the low-density boundaries [Salimans et al., 2016; We use N to denote the size of the pool. This strategy is
Dai et al., 2017]. In contrast to the generative nature, our widely used in semi-supervised learning [Blum and Mitchell,
tri-net is a discriminative model and does not need to esti- 1998; Zhou and Li, 2005a; Saito et al., 2017]. With three
mate p(x). Some combined graph-based methods with deep modules, if two modules agree on the prediction of the unla-
neural networks [Weston et al., 2012; Luo et al., 2017]. They beled instance from the pool and the prediction is confident
enforced smoothness of the predictions with respect to the and stable, the two modules will teach the third module on
graph structure while we do not need to construct the graph. this instance. The instance with the pseudo-label predicted
Some were perturbation-based discriminative methods. They by the two modules is added into the training sets of the third
utilized local variations of the input to regularize the output module. Then the third module is refined with the augmented
to be smooth [Bachman et al., 2014; Rasmus et al., 2015; training set. Here, confident prediction means that the av-
Laine and Aila, 2016; Sajjadi et al., 2016]. VAT [Miyato erage maximum posterior probability of the two modules is

2015
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Algorithm 1 Tri-net 3.2 Output Smearing


Input: Output smearing was proposed by Breiman [2000]. It con-
Labeled set L and unlabeled set U structs diverse training sets by injecting random noise into
Labeling: the methods of labeling when the predictions of two clas- true labels and generates modules from the diverse training
sifiers are confident and agree with each other
sets respectively. Injecting noise into true labels can also reg-
DES: the methods of pseudo-label editing
σ0 : the initial threshold parameter for filtrating the unconfident ularize the modules by smoothing the labels [Szegedy et al.,
pseudo-labels 2016]. We apply this technique to initialize our modules M1 ,
σos : the value to decrease σ if output smearing is used in this learn- M2 and M3 . For an example {xl , yl } (l = 1, 2, . . . , L), where
ing round yl = (yl1 , yl2 , . . . , ylC ), ylc = 1 if the example belongs to
Output: the c-th class otherwise ylc = 0. In output smearing, we add
Tri-net: the model composed of MS , M1 , M2 and M3 noise into every component of yl .
1: Initialization:
2: Generate {L1os , L2os , L3os } by using output smearing on L ŷlc = ylc + ReLU(zlc × std) (2)
3: Train MS , M1 , M2 , M3 with mini-batch from training set L1os , where zlc is sampled independently from the standard normal
L2os , L3os
4: f lagos = 1; σ = σ0
distribution, std is the standard deviation, ReLU is a function
5: Training: 
a, a > 0 ,
6: for t = 1 → T do ReLU(a) = (3)
7: Nt = min(1000 × 2t , U ) 0, a ≤ 0 .
8: if Nt = U then
9: if mod(t, 4) = 0 then
Here, we use ReLU function to ensure ŷlc non-negative and
10: Train MS , M1 , M2 , M3 with mini-batch from training normalize ŷlc according to Eq. 4.
set L1os , L2os , L3os C
11: f lagos = 1 ; σ = σ − 0.05
X
ŷl = (ŷl1 , ŷl2 , . . . , ŷlC )/ ŷlc . (4)
12: continue
c=1
13: if f lagos = 1 then
14: f lagos = 0 ; σt = σ − σos With output smearing, we construct three diverse training sets
15: else L1os , L2os and L3os from the initial labeled data set L, where
16: σt = σ Lvos = {(xl , ŷlv )|1 ≤ l ≤ L} (v = 1, 2, 3) is constructed
17: for v = 1 → 3 do by output smearing and ŷlv is calculated according to Eq. 4.
18: PLv ← ∅
19: PLv ← Labeling(MS , Mj , Mh , U , Nt , σt )(j, h 6= v)
Then we initialize tri-net with L1os , L2os and L3os by minimiz-
20: PLv ← DES(MS , PLv , Mj , Mh ) ing Loss shown in Eq. 5.
21: L̂v ← L ∪ PLv L
1 Xn
22: if v = 1 then Ly M1 MS (xl ) , ŷl1 + Ly M2 MS (xl ) , ŷl2
   
Loss =
23: Train MS , Mv with mini-batch from training set L̂v L
l=1
24: else  o
25: Train Mv with mini-batch from training set L̂v + Ly M3 MS (xl ) , ŷl3 (5)
26: return MS , M1 , M2 and M3
Here, Ly denotes the standard softmax cross-entropy loss
function, MS denotes the shared module, M1 ,M  2 and M3
larger than the threshold σ. Stable prediction means that the denote the three modules in tri-net, Mv MS (xl ) denotes the
pseudo-label should not change much when the modules pre- output of Mv on xl where Mv classifies the features gener-
dict the instance repeatedly and the details will be presented ated by MS on xl (v = 1, 2, 3).
in Section 3.4. Three modules will be more and more similar
since they augment the training sets of one another [Wang and 3.3 Diversity Augmentation
Zhou, 2017]. To tackle this problem, we fine-tune the mod- Diversity among three modules in tri-net plays an important
ules on labeled data to augment the diversity among them in role in the training process. When three modules label unla-
some specific rounds. The whole training process is shown in beled data to augment the training sets of one another, they
Algorithm 1. become more and more similar. In order to maintain the
Inference. Given an unseen instance x, we use the average diversity, we fine-tune three modules M1 , M2 and M3 on
of the posterior probability of the three modules as the poste- the diverse training sets L1os , L2os and L3os in some specific
rior probability of our method. The unseen instance x is clas- rounds. In the experiments, the fine-tuning is executed every
sified with maximum posterior probability shown in Eq.  1, 3 rounds, which will be described in Section 4.
where MS denotes the shared module and Mv MS (x) de-
notes the label predicted by Mv (v = 1, 2, 3) on x. 3.4 Pseudo-Label Editing
n   The pseudo-labels of the newly labeled examples may be in-
y = arg max p M1 MS (x) = c|x + correct, and these incorrect pseudo-labels will degenerate the
c∈{1,2,...,C} performance. Data editing which can deal with the suspi-
   o cious pseudo-labels is important and there have been some
p M2 MS (x) = c|x + p M3 MS (x) = c|x (1)
data-editing methods in semi-supervised learning [Zhang and

2016
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Conv block Residual block


Conv Conv Conv Conv Conv Conv
axaxh axaxh axaxh axaxh axaxh axaxh
Pad n Pad n Pad n Pad n Pad n Pad n

𝑀1
Conv block Max- Conv Conv Conv
pooling Average-
5x5x256 5x5x512 1x1x512 1x1x512 FC
2x2 pooling
Pad 2 Pad 1 Pad 0 Pad 0
2x2 stride
𝑀𝑆 𝑀2
Max- Max-
Conv block Conv block Conv Conv Conv
pooling pooling Average-
3x3x128 3x3x256 3x3x512 1x1x512 1x1x512 FC
2x2 2x2 pooling
Pad 1 Pad 1 Pad 0 Pad 0 Pad 0
2x2 stride 2x2 stride
𝑀3
Residual Max- Residual Max-
Conv Conv
block pooling block pooling Average-
1x1x512 1x1x512 FC
3x3x256 2x2 3x3x512 2x2 pooling
Pad 0 Pad 0
Pad 1 2x2 stride Pad 1 2x2 stride

Figure 2: The architecture of tri-net. It is composed of a shared module MS and three different modules M1 , M2 and M3 .

Zhou, 2011]. However, these existing methods are usually ing process, we respectively fine-tune three modules M1 , M2
based on graph and are difficult to be used in DNNs due to the and M3 on the diverse training sets L1os , L2os and L3os ev-
high dimension. Here, we propose a new data-editing method ery 3 rounds after N = U to maintain the diversity (line 10,
for DNNs with dropout [Srivastava et al., 2014]. Generally, Algorithm 1). Since L1os , L2os and L3os are injected into ran-
dropout works in two modes: at training mode, the connec- dom noise, the confidence threshold σ is decreased by σos
tions of the network are different in every forward pass; at test (line 14, Algorithm 1). We set σ0 = 0.999 and σos = 0.01
mode, the connections are fixed. This means that the predic- in MNIST; σ0 = 0.95 and σos = 0.25 in SVHN and CIFAR-
tion for dropout working in training mode may change. For 10. We use dropout (p = 0.5) after each max-pooling layer,
each (xi , y i ), y i is the pseudo-label predicted by the mod- use Leaky-ReLU (α = 0.1) as activate function except the
ules working in test mode. We use dropout working in train FC layer, and use soft-max for FC layer. We also use Batch-
mode to measure the stability of the pseudo-labeled data, i.e., Normalization [Ioffe and Szegedy, 2015] for all layers ex-
we use the modules to predict the label of xi for K times in cept the FC layer. We use SGD with a mini-batch size of 16.
training mode and record the frequency k that the prediction The learning rate starts from 0.1 in initialization (from 0.02
is different from y i . If k > K3 , we regard the pseudo-label y i in training) and is divided by 10 when the error plateaus. In
of xi as an unstable pseudo-label. For these unstable pseudo- initialization, three modules M1 , M2 and M3 are trained for
labels, we will eliminate them. We set K = 9 in all experi- up to 300 epochs in SVHN and CIFAR-10 (100 in MNIST).
ments. In training, three modules M1 , M2 and M3 are trained for up
to 90 epochs in SVHN and CIFAR-10 (60 in MNIST). We
4 Experiments set std = 0.05 in SVHN and CIFAR-10 (0.001 in MNSIT).
We use a weight decay of 0.0001 and a momentum of 0.9.
4.1 Setup Following the setting in Laine and Aila [2016], we use ZCA,
Datasets. We run experiments on three widely used bench- random crop and horizon flipping for CIFAR-10, zero-mean
mark datasets, i.e., MNIST, SVHN, and CIFAR-10. We ran- normalization and random crop for SVHN.
domly sample 100, 1,000, and 4,000 labeled examples from
MNIST, SVHN and CIFAR-10 as the initial labeled data set 4.2 Results
L respectively and use the standard data split for testing as We compare our tri-net with state-of-the-art methods shown
that in previous work. in Table 1. Recently, Abbasnejad et al. [2017] exploited a pre-
Network Architectures. The network architecture of tri- trained model in their infinite Variational Autoencoder (infi-
net for CIFAR-10 is shown in Figure 2, which is derived nite VAE) method, however, the state-of-the-art methods did
from the popular architecture [Laine and Aila, 2016] used in not use the pre-trained model. To make a fair comparison,
semi-supervised deep learning. In order to get more diver- we do not exploit the pre-trained model as that in state-of-
sity among three modules, we use different convolution ker- the-art methods. The results in Table 1 indicate that tri-net
nel sizes, different network structures (with/without residual has good performance. It achieves the error rate of 0.53% on
block) and different depths for M1 , M2 and M3 . The net- MNIST with 100 labeled examples and 8.45% error rate on
work architectures for MNIST and SVHN are similar to that CIFAR-10 with 4000 labeled examples, which are much bet-
in Figure 2 but in a smaller size. ter than state-of-the-art methods. Since tri-net exploits three
Parameters. In order to prevent the network from over- modules while the state-of-the-art methods exploit one or two
fitting, we gradually increase the pool size N = 1000 × 2t modules, the time cost of tri-net is more than that of these
up to the size of unlabeled data U [Saito et al., 2017], where methods.
t denotes the learning round. The maximal learning round There is an initialization in tri-net, with more sophisticated
T is set to be 30 in all experiments. We gradually decrease initialization methods, tri-net could have better performance.
the confidence threshold σ after N = U to make more unla- Π model [Laine and Aila, 2016] is a rising semi-supervised
beled data to be labeled (line 11, Algorithm 1). In the train- deep learning method. It evaluates each input twice based on

2017
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Methods MNIST (L = 100) SVHN (L = 1000) CIFAR-10 (L = 4000)


Ladder network [Rasmus et al., 2015] 0.89 ± 0.50 - 20.40 ± 0.47*
GoodSemiBadGan [Dai et al., 2017] 0.795 ± 0.098 4.25 ± 0.03* 14.41 ± 0.03*
Π model [Laine and Aila, 2016] - 4.82 ± 0.17 12.36 ± 0.31
Temporal ensembling [Laine and Aila, 2016] - 4.42 ± 0.16 12.16 ± 0.24
Mean teacher [Tarvainen and Valpola, 2017] - 3.95 ± 0.19 12.31 ± 0.28
VAT + EntMin [Miyato et al., 2017] - 3.86 10.55
Π + SNTG [Luo et al., 2017] 0.66 ± 0.07 3.82 ± 0.25 11.00 ± 0.13
VAdD(KL)+VAT [Park et al., 2018] - 3.55 ± 0.05 9.22 ± 0.10
Tri-net 0.53 ± 0.10 3.71 ± 0.14 8.45 ± 0.22
Tri-net + Π model 0.52 ± 0.05 3.45 ± 0.10 8.30 ± 0.15

Table 1: Error rates (%) of methods on MNIST, SVHN and CIFAR-10. * indicates that the method does not use data augmentation.

datasets MNIST SVHN CIFAR-10


index err agr err agr err agr
without output smearing 8.55 ± 0.00 85.69± 0.50 12.47 ± 0.12 82.56 ± 0.88 16.51 ± 0.09 81.47 ± 0.40
with output smearing 7.85 ± 0.48 86.52 ± 0.55 12.20 ± 0.21 81.25 ± 0.22 15.42 ± 0.17 79.98 ± 0.89

Table 2: Results of tri-net with/without output smearing. err means the error rate of ensemble of three modules M1 , M2 and M3 . arg means
the ratio of the agreed data by modules M1 , M2 and M3 .

the neural network and calculates the loss between the two used to alleviate the influence of suspicious pseudo-labels.
predictions to regularize the neural network. We also use Π To show that whether these techniques are helpful to tri-
model to initialize three modules M1 , M2 and M3 in tri-net net, we run experiments with/without them in tri-net, and the
and call it tri-net + Π model. The results are also shown in results are shown in Figure 4. Figure 4 indicates that when all
Table 1. From Table 1, we can find that tri-net + Π model per- three techniques are used, tri-net has the best performance.
forms better than tri-net and achieves the error rate of 3.45% It implies that these techniques are very necessary for tri-net
on SVHN with 1000 labeled examples. and each of them makes a contribution to the good perfor-
Tri-net is a semi-supervised learning method by using un- mance of tri-net.
labeled data to improve learning performance. It has been re- Different network structures are used to get three diverse
ported that semi-supervised learning with the exploitation of modules M1 , M2 and M3 , we conduct the experiments with
unlabeled data might deteriorate learning performance [Bal- the same network structure for three modules M1 , M2 and
can and Blum, 2010; Chapelle et al., 2006]. Now, we demon- M3 as a comparison. The results shown in Table 3 indicate
strate whether the performance of tri-net will be deteriorated that different structures bring better performance. The pa-
by keeping on using unlabeled data. As tri-net labels more rameter σos controls the confidence threshold when output
and more unlabeled data, we depict the error rates of three smearing is used in the training process. We conduct the ex-
modules M1 , M2 , M3 and tri-net in every learning round in periments with different σos ∈ [0.01, 0.25], and the results
Figure 3, which shows that except very few learning rounds, shown in Table 4 indicate that tri-net is not very sensitive to
the performance is not deteriorated by keeping on using un- the parameter σos .
labeled data.
5 Conclusion
4.3 Further Discussion
In this paper, we propose tri-net for semi-supervised deep
In order to generate three accurate and diverse modules M1 ,
learning, in which we generate three modules to exploit unla-
M2 and M3 , we introduce output smearing in initialization.
beled data by considering model initialization, diversity aug-
We record the error rates of ensemble of three modules M1 ,
mentation and pseudo-label editing simultaneously. Exper-
M2 , M3 and their agreement in the initialization with/without
iments on several benchmarks demonstrate that our method
output smearing. The results are shown in Table 2. Table 2
is superior to state-of-the-art semi-supervised deep learning
indicates that on all three datasets the error rates of ensem-
ble of M1 , M2 and M3 in initialization with output smearing
are lower than that without output smearing. Three modules datasets MNIST SVHN CIFAR-10
M1 , M2 and M3 generated with output smearing also have
large diversity (low agreement means large diversity). As tri- with the same structure 0.60 3.95 9.05
net goes on, M1 , M2 and M3 become similar, and then fine- with different structures 0.53 3.71 8.45
tuning is introduced to augment the diversity among them.
Some pseudo-labels may be incorrect, pseudo-label editing is Table 3: Error rates (%) of tri-net with the same/different structures.

2018
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

10 15 20
𝑀1 𝑀1 𝑀1
8 𝑀2 𝑀2 𝑀2
12 17
𝑀3 𝑀3 𝑀3
6
Error rate (%)

Error rate (%)

Error rate (%)


Tri-net 9
Tri-net 14
Tri-net
4

6 11
2

0 3 8
1 6 11 16 21 26 31 1 6 11 16 21 26 31 1 6 11 16 21 26 31
Round Round Round
(a) MNIST (b) SVHN (c) CIFAR-10

Figure 3: Error rates of tri-net and its three modules M1 , M2 and M3 .

1 4.5 10

4.25
0.86 9.55
4.05 9.5
0.8 4 3.92 9.15

Error rate (%)


Error rate (%)
Error rate (%)

0.72 9.05
3.71 9
0.64
mean std
0.6 Tri-net 0.53 0.1 0.53
3.5
w/o os0.53 0.64 0.07 0.64 8.5 8.45
w/o re 0.86 0.46 0.86
w/o des 0.72 0.15 0.72

0.4 3 8
Tri-net w/o os w/o ft w/o DES Tri-net w/o os w/o ft w/o DES Tri-net w/o os w/o ft w/o DES
(a) MNIST (b) SVHN (c) CIFAR-10

Figure 4: Error rates of tri-net with/without three techniques. Specifically, “w/o os” means tri-net without output smearing, “w/o ft” means
tri-net without fine-tuning, and “w/o DES” means tri-net without pseudo-label editing.

σos 0.01 0.05 0.1 0.25 toencoder for semi-supervised learning. In CVPR, pages
781–790, 2017.
MNIST 0.53 0.55 0.58 0.60
SVHN 4.23 4.09 3.81 3.71 [Ardehaly and Culotta, 2017] Ehsan Mohammady Ardehaly
CIFAR-10 9.38 9.10 8.65 8.45 and Aron Culotta. Co-training for demographic classifica-
tion using deep learning from label proportions. In ICDM
Table 4: Error rates (%) of tri-net with different σos . Workshop, pages 1017–1024, 2017.
[Bachman et al., 2014] Philip Bachman, Ouais Alsharif, and
methods. In particular, it can achieve the error rate of 8.30% Doina Precup. Learning with pseudo-ensembles. In NIPS,
on CIFAR-10 by using only 4000 labeled examples. Extend- pages 3365–3373, 2014.
ing tri-net with more modules could exploit the power of en- [Balcan and Blum, 2010] Maria-Florina Balcan and Avrim
semble in labeling the unlabeled data confidently. In this situ- Blum. A discriminative model for semi-supervised learn-
ation, one important issue is to maintain the diversity among ing. Journal of the ACM, 57(3):19:1–19:46, 2010.
these modules, which will be an interesting research direction
in semi-supervised deep learning. [Balcan et al., 2004] Maria-Florina Balcan, Avrim Blum,
and Ke Yang. Co-training and expansion: Towards bridg-
ing theory and practice. In NIPS, pages 89–96, 2004.
Acknowledgments
[Blum and Mitchell, 1998] Avrim Blum and Tom Mitchell.
This work was supported by the NSFC (61751306, Combining labeled and unlabeled data with co-training. In
61673202, 61503179), the Jiangsu Science Foundation COLT, pages 92–100, 1998.
(BK20150586) and the Fundamental Research Funds for the
[Breiman, 2000] Leo Breiman. Randomizing outputs to in-
Central Universities.
crease prediction accuracy. Machine Learning, 40(3):229–
242, 2000.
References [Chapelle et al., 2006] Olivier Chapelle, Bernhard
[Abbasnejad et al., 2017] Ehsan Abbasnejad, Anthony R. Schölkopf, and Alexander Zien. Semi-supervised
Dick, and Anton van den Hengel. Infinite variational au- learning. MIT Press, 2006.

2019
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

[Cheng et al., 2016] Yanhua Cheng, Xin Zhao, Rui Cai, Zhi- [Sajjadi et al., 2016] Mehdi Sajjadi, Mehran Javanmardi,
wei Li, Kaiqi Huang, and Yong Rui. Semi-supervised mul- and Tolga Tasdizen. Regularization with stochastic trans-
timodal deep learning for RGB-D object recognition. In formations and perturbations for deep semi-supervised
IJCAI, pages 3345–3351, 2016. learning. In NIPS, pages 1163–1171, 2016.
[Dai et al., 2017] Zihang Dai, Zhilin Yang, Fan Yang, [Salimans et al., 2016] Tim Salimans, Ian J. Goodfellow,
William W. Cohen, and Ruslan Salakhutdinov. Good semi- Wojciech Zaremba, Vicki Cheung, Alec Radford, and
supervised learning that requires a bad GAN. In NIPS, Xi Chen. Improved techniques for training GANs. In
pages 6513–6523, 2017. NIPS, pages 2226–2234, 2016.
[Girshick et al., 2014] Ross B. Girshick, Jeff Donahue, [Shelhamer et al., 2017] Evan Shelhamer, Jonathan Long,
Trevor Darrell, and Jitendra Malik. Rich feature hierar- and Trevor Darrell. Fully convolutional networks for se-
chies for accurate object detection and semantic segmen- mantic segmentation. IEEE Transactions on Pattern Anal-
tation. In CVPR, pages 580–587, 2014. ysis and Machine Intelligence, 39(4):640–651, 2017.
[Goldman and Zhou, 2000] Sally A. Goldman and Yan [Srivastava et al., 2014] Nitish Srivastava, Geoffrey E. Hin-
Zhou. Enhancing supervised learning with unlabeled data. ton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
In ICML, pages 327–334, 2000. Salakhutdinov. Dropout: A simple way to prevent neural
[Goodfellow et al., 2014] Ian J. Goodfellow, Jonathon networks from overfitting. Journal of Machine Learning
Research, 15(1):1929–1958, 2014.
Shlens, and Christian Szegedy. Explaining and harnessing
adversarial examples. CoRR, abs/1412.6572, 2014. [Szegedy et al., 2016] Christian Szegedy, Vincent Van-
[Ioffe and Szegedy, 2015] Sergey Ioffe and Christian houcke, Sergey Ioffe, Jonathon Shlens, and Zbigniew
Wojna. Rethinking the inception architecture for computer
Szegedy. Batch normalization: Accelerating deep net-
vision. In CVPR, pages 2818–2826, 2016.
work training by reducing internal covariate shift. In
ICML, pages 448–456, 2015. [Tarvainen and Valpola, 2017] Antti Tarvainen and Harri
Valpola. Mean teachers are better role models: Weight-
[Kingma et al., 2014] Diederik P. Kingma, Shakir Mo-
averaged consistency targets improve semi-supervised
hamed, Danilo Jimenez Rezende, and Max Welling. Semi- deep learning results. In NIPS, pages 1195–1204, 2017.
supervised learning with deep generative models. In NIPS,
pages 3581–3589, 2014. [Wang and Zhou, 2010] Wei Wang and Zhi-Hua Zhou. A
new analysis of co-training. In ICML, pages 1135–1142,
[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,
2010.
and Geoffrey E. Hinton. Imagenet classification with deep
convolutional neural networks. In NIPS, pages 1106– [Wang and Zhou, 2017] Wei Wang and Zhi-Hua Zhou. The-
1114, 2012. oretical foundation of co-training and disagreement-based
algorithms. CoRR, abs/1708.04403, 2017.
[Laine and Aila, 2016] Samuli Laine and Timo Aila. Tem-
poral ensembling for semi-supervised learning. CoRR, [Weston et al., 2012] Jason Weston, Frédéric Ratle, Hossein
abs/1610.02242, 2016. Mobahi, and Ronan Collobert. Deep learning via semi-
supervised embedding. In Neural Networks: Tricks of the
[Luo et al., 2017] Yucen Luo, Jun Zhu, Mengxi Li, Yong Trade - Second Edition, pages 639–655. 2012.
Ren, and Bo Zhang. Smooth neighbors on teacher graphs
for semi-supervised learning. CoRR, abs/1711.00258, [Zhang and Zhou, 2011] Min-Ling Zhang and Zhi-Hua
2017. Zhou. CoTrade: Confident co-training with data editing.
IEEE Transactions on Systems, Man, and Cybernetics,
[Maaløe et al., 2016] Lars Maaløe, Casper Kaae Sønderby, Part B: Cybernetics, 41(6):1612–1626, 2011.
Søren Kaae Sønderby, and Ole Winther. Auxiliary deep
generative models. In ICML, pages 1445–1453, 2016. [Zhou and Li, 2005a] Zhi-Hua Zhou and Ming Li. Semi-
supervised regression with co-training. In IJCAI, pages
[Miyato et al., 2017] Takeru Miyato, Shin-ichi Maeda, 908–916, 2005.
Masanori Koyama, and Shin Ishii. Virtual adversarial
[Zhou and Li, 2005b] Zhi-Hua Zhou and Ming Li. Tri-
training: a regularization method for supervised and
semi-supervised learning. CoRR, abs/1704.03976, 2017. training: Exploiting unlabeled data using three classifiers.
IEEE Transactions on Knowledge and Data Engineering,
[Park et al., 2018] Sungrae Park, Jun-Keon Park, Su-Jin 17(11):1529–1541, 2005.
Shin, and Il-Chul Moon. Adversarial dropout for super-
[Zhou and Li, 2010] Zhi-Hua Zhou and Ming Li. Semi-
vised and semi-supervised learning. In AAAI, 2018.
supervised learning by disagreement. Knowledge and In-
[Rasmus et al., 2015] Antti Rasmus, Mathias Berglund, formation Systems, 24(3):415–439, 2010.
Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi- [Zhou, 2012] Zhi-Hua Zhou. Ensemble Methods: Founda-
supervised learning with ladder networks. In NIPS, pages
tions and Algorithms. Chapman & Hall/CRC, 2012.
3546–3554, 2015.
[Zhu, 2007] Xiaojin Zhu. Semi-supervised learning lit-
[Saito et al., 2017] Kuniaki Saito, Yoshitaka Ushiku, and
erature survey. Technical Report 1530, University of
Tatsuya Harada. Asymmetric tri-training for unsupervised
Wisconsin-Madison, 2007.
domain adaptation. In ICML, pages 2988–2997, 2017.

2020

You might also like