0% found this document useful (0 votes)
22 views11 pages

NIPS 2017 Decoupling When To Update From How To Update Paper

Uploaded by

icushj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

NIPS 2017 Decoupling When To Update From How To Update Paper

Uploaded by

icushj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Decoupling “when to update” from “how to update”

Eran Malach Shai Shalev-Shwartz


School of Computer Science School of Computer Science
The Hebrew University, Israel The Hebrew University, Israel
[email protected] [email protected]

Abstract
Deep learning requires data. A useful approach to obtain data is to be creative and
mine data from various sources, that were created for different purposes. Unfortu-
nately, this approach often leads to noisy labels. In this paper, we propose a meta
algorithm for tackling the noisy labels problem. The key idea is to decouple “when
to update” from “how to update”. We demonstrate the effectiveness of our algo-
rithm by mining data for gender classification by combining the Labeled Faces
in the Wild (LFW) face recognition dataset with a textual genderizing service,
which leads to a noisy dataset. While our approach is very simple to implement,
it leads to state-of-the-art results. We analyze some convergence properties of the
proposed algorithm.

1 Introduction
In recent years, deep learning achieves state-of-the-art results in various different tasks, however,
neural networks are mostly trained using supervised learning, where a massive amount of labeled
data is required. While collecting unlabeled data is relatively easy given the amount of data available
on the web, providing accurate labeling is usually an expensive task. In order to overcome this
problem, data science becomes an art of extracting labels out of thin air. Some popular approaches
to labeling are crowdsourcing, where the labeling is not done by experts, and mining available
meta-data, such as text that is linked to an image in a webpage. Unfortunately, this gives rise to a
problem of abundant noisy labels - labels may often be corrupted [19], which might deteriorate the
performance of neural-networks [12].
Let us start with an intuitive explanation as to why noisy labels are problematic. Common neural
network optimization algorithms start with a random guess of what the classifier should be, and
then iteratively update the classifier based on stochastically sampled examples from a given dataset,
optimizing a given loss function such as the hinge loss or the logistic loss. In this process, wrong
predictions lead to an update of the classifier that would hopefully result in better classification
performance. While at the beginning of the training process the predictions are likely to be wrong,
as the classifier improves it will fail on less and less examples, thus making fewer and fewer updates.
On the other hand, in the presence of noisy labels, as the classifier improves the effect of the noise
increases - the classifier may give correct predictions, but will still have to update due to wrong
labeling. Thus, in an advanced stage of the training process the majority of the updates may actually
be due to wrongly labeled examples, and therefore will not allow the classifier to further improve.
To tackle this problem, we propose to decouple the decision of “when to update” from the decision
of “how to update”. As mentioned before, in the presence of noisy labels, if we update only when
the classifier’s prediction differs from the available label, then at the end of the optimization process,
these few updates will probably be mainly due to noisy labels. We would therefore like a different
update criterion, that would let us decide whether it is worthy to update the classifier based on a
given example. We would like to preserve the behavior of performing many updates at the begin-
ning of the training process but only a few updates when we approach convergence. To do so, we

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
suggest to train two predictors, and perform update steps only in case of disagreement between them.
This way, when the predictors get better, the “area” of their disagreement gets smaller, and updates
are performed only on examples that lie in the disagreement area, therefore preserving the desired
behavior of the standard optimization process. On the other hand, since we do not perform an up-
date based on disagreement with the label (which may be due to a problem in the label rather than
a problem in the predictor), this method keeps the effective amount of noisy labels seen throughout
the training process at a constant rate.
The idea of deciding “when to update” based on a disagreement between classifiers is closely related
to approaches for active learning and selective sampling - a setup in which the learner does not have
unlimited access to labeled examples, but rather has to query for each instance’s label, provided at
a given cost (see for example [34]). Specifically, the well known query-by-committee algorithm
maintains a version space of hypotheses and at each iteration, decides whether to query the label of
a given instance by sampling two hypotheses uniformly at random from the version space [35, 14].
Naturally, maintaining the version space of deep networks seems to be intractable. Our algorithm
maintains only two deep networks. The difference between them stems from the random initializa-
tion. Therefore, unlike the original query-by-committee algorithm, that samples from the version
space at every iteration, we sample from the original hypotheses class only once (at the initializa-
tion), and from there on, we update these two hypotheses using the backpropagation rule, when
they disagree on the label. To the best of our knowledge, this algorithm was not proposed/analyzed
previously, not in the active learning literature and especially not as a method for dealing with noisy
labels.
To show that this method indeed improves the robustness of deep learning to noisy labels, we con-
duct an experiment that aims to study a real-world scenario of acquiring noisy labels for a given
dataset. We consider the task of gender classification based on images. We did not have a dedicated
dataset for this task. Instead, we relied on the Labeled Faces in the Wild (LFW) dataset, which con-
tains images of different people along with their names, but with no information about their gender.
To find the gender for each image, we use an online service to match a gender to a given name (as is
suggested by [25]), a method which is naturally prone to noisy labels (due to unisex names). Apply-
ing our algorithm to an existing neural network architecture reduces the effect of the noisy labels,
achieving better results than similar available approaches, when tested on a clean subset of the data.
We also performed a controlled experiment, in which the base algorithm is the perceptron, and show
that using our approach leads to a noise resilient algorithm, which can handle an extremely high
label noise rates of up to 40%. The controlled experiments are detailed in Appendix B.
In order to provide theoretical guarantees for our meta algorithm, we need to tackle two questions:
1. does this algorithm converge? and if so, how quickly? and 2. does it converge to an optimum? We
give a positive answer to the first question, when the base algorithm is the perceptron and the noise
is label flip with a constant probability. Specifically, we prove that the expected number of iterations
required by the resulting algorithm equals (up to a constant factor) to that of the perceptron in the
noise-free setting. As for the second question, clearly, the convergence depends on the initialization
of the two predictors. For example, if we initialize the two predictors to be the same predictor, the
algorithm will not perform any updates. Furthermore, we derive lower bounds on the quality of the
solution even if we initialize the two predictors at random. In particular, we show that for some
distributions, the algorithm’s error will be bounded away from zero, even in the case of linearly
separable data. This raises the question of whether a better initialization procedure may be helpful.
Indeed, we show that for the same distribution mentioned above, even if we add random label noise,
if we initialize the predictors by performing few vanilla perceptron iterations, then the algorithm
performs much better. Despite this worst case pessimism, we show that empirically, when working
with natural data, the algorithm converges to a good solution. We leave a formal investigation of
distribution dependent upper bounds to future work.

2 Related Work

The effects of noisy labels was vastly studied in many different learning algorithms (see for example
the survey in [13]), and various solutions to this problem have been proposed, some of them with
theoretically provable bounds, including methods like statistical queries, boosting, bagging and more
[21, 26, 7, 8, 29, 31, 23, 27, 3]. Our focus in this paper is on the problem of noisy labels in the context
of deep learning. Recently, there have been several works aiming at improving the resilience of deep

2
learning to noisy labels. To the best of our knowledge, there are four main approaches. The first
changes the loss function. The second adds a layer that tries to mimic the noise behavior. The third
groups examples into buckets. The fourth tries to clean the data as a preprocessing step. Beyond
these approaches, there are methods that assume a small clean data set and another large, noisy, or
even unlabeled, data set [30, 6, 38, 1]. We now list some specific algorithms from these families.
[33] proposed to change the cross entropy loss function by adding a regularization term that takes
into account the current prediction of the network. This method is inspired by a technique called
minimum entropy regularization, detailed in [17, 16]. It was also found to be effective by [12],
which suggested a further improvement of this method by effectively increasing the weight of the
regularization term during the training procedure.
[28] suggested to use a probabilistic model that models the conditional probability of seeing a wrong
label, where the correct label is a latent variable of the model. While [28] assume that the probability
of label-flips between classes is known in advance, a follow-up work by [36] extends this method
to a case were these probabilities are unknown. An improved method, that takes into account the
fact that some instances might be more likely to have a wrong label, has been proposed recently
in [15]. In particular, they add another softmax layer to the network, that can use the output of
the last hidden layer of the network in order to predict the probability of the label being flipped.
Unfortunately, their method involves optimizing the biases of the additional softmax layer by first
training it on a simpler setup (without using the last hidden layer), which implies two-phase training
that further complicates the optimization process. It is worth noting that there are some other works
that suggest methods that are very similar to [36, 15], with a slightly different objective or training
method [5, 20], or otherwise suggest a complicated process which involves estimation of the class-
dependent noise probabilities [32]. Another method from the same family is the one described in
[37], who suggests to differentiate between “confusing” noise, where some features of the example
make it hard to label, or otherwise a completely random label noise, where the mislabeling has no
clear reason.
[39] suggested to train the network to predict labels on a randomly selected group of images from
the same class, instead of classifying each image individually. In their method, a group of images
is fed as an input to the network, which merges their inner representation in a deeper level of the
network, along with an attention model added to each image, and producing a single prediction.
Therefore, noisy labels may appear in groups with correctly labeled examples, thus diminishing
their impact. The final setup is rather complicated, involving many hyper-parameters, rather than
providing a simple plug-and-play solution to make an existing architecture robust to noisy labels.
From the family of preprocessing methods, we mention [4, 10], that try to eliminate instances that are
suspected to be mislabeled. Our method shares a similar motivation of disregarding contaminated
instances, but without the cost of complicating the training process by a preprocessing phase.
In our experiment we test the performance of our method against methods that are as simple as
training a vanilla version of neural network. In particular, from the family of modified loss function
we chose the two variants of the regularized cross entropy loss suggested by [33] (soft and hard
bootsrapping). From the family of adding a layer that models the noise, we chose to compare to one
of the models suggested in [15] (which is very similar to the model proposed by [36]), because this
model does not require any assumptions or complication of the training process. We find that our
method outperformed all of these competing methods, while being extremely simple to implement.
Finally, as mentioned before, our “when to update” rule is closely related to approaches for active
learning and selective sampling, and in particular to the query-by-committee algorithm. In [14] a
thorough analysis is provided for various base algorithms implementing the query-by-committee
update rule, and particularly they analyze the perceptron base algorithm under some strong distribu-
tional assumptions. In other works, an ensemble of neural networks is trained in an active learning
setup to improve the generalization of neural networks [11, 2, 22]. Our method could be seen as
a simplified member of ensemble methods. As mentioned before, our motivation is very different
than the active learning scenario, since our main goal is dealing with noisy labels, rather than trying
to reduce the number of label queries. To the best of our knowledge, the algorithm we propose was
not used or analyzed in the past for the purpose of dealing with noisy labels in deep learning.

3
3 METHOD
As mentioned before, to tackle the problem of noisy labels, we suggest to change the update rule
commonly used in deep learning optimization algorithms in order to decouple the decision of “when
to update” from “how to update”. In our approach, the decision of “when to update” does not depend
on the label. Instead, it depends on a disagreement between two different networks. This method
could be generally thought of as a meta-algorithm that uses two base classifiers, performing updates
according to a base learning algorithm, but only on examples for which there is a disagreement
between the two classifiers.
To put this formally, let X be an instance space and Y be the label space, and assume we sample
examples from a distribution D̃ over X ⇥ Y, with possibly noisy labels. We wish to train a classifier
h, coming from a hypothesis class H. We rely on an update rule, U , that updates h based on its
current value as well as a mini-batch of b examples. The meta algorithm receives as input a pair of
two classifiers, h1 , h2 2 H, the update rule, U , and a mini batch size, b. A pseudo-code is given in
Algorithm 1.
Note that we do not specify how to initialize the two base classifiers, h1 , h2 . When using deep
learning as the base algorithm, the easiest approach is maybe to perform a random initialization.
Another approach is to first train the two classifiers while following the regular “when to update”
rule (which is based on the label y), possibly training each classifier on a different subset of the data,
and switching to the suggested update rule only in an advanced stage of the training process. We
later show that the second approach is preferable.
At the end of the optimization process, we can simply return one of the trained classifiers. If a
small accurately labeled test data is available, we can choose to return the classifier with the better
accuracy on the clean test data.

Algorithm 1 Update by Disagreement


input:
an update rule U
batch size b
two initial predictors h1 , h2 2 H
for t = 1, 2, . . . , N do
draw mini-batch (x1 , y1 ), . . . , (xb , yb ) ⇠ D̃b
let S = {(xi , yi ) : h1 (xi ) 6= h2 (xi )}
h1 U (h1 , S)
h2 U (h2 , S)
end for

4 Theoretical analysis
Since a convergence analysis for deep learning is beyond our reach even in the noise-free setting,
we focus on analyzing properties of our algorithm for linearly separable data, which is corrupted by
random label noise, and while using the perceptron as a base algorithm.
Let X = {x 2 Rd : kxk  1}, Y = {±1}, and let D be a probability distribution over X ⇥ Y,
such that there exists w⇤ for which D({(x, y) : yhw⇤ , xi < 1}) = 0. The distribution we observe,
denoted D̃, is a noisy version of D. Specifically, to sample (x, ỹ) ⇠ D̃ one should sample (x, y) ⇠ D
and output (x, y) with probability 1 µ and (x, y) with probability µ. Here, µ is in [0, 1/2).
Finally, let H be the class of linear classifiers, namely, H = {x 7! sign(hw, xi) : w 2 Rd }. We
use the perceptron’s update rule with mini-batch size of 1. That is, given the classifier wt 2 Rd , the
update on example (xt , yt ) 2 X ⇥ Y is: wt+1 = U (wt , (xt , yt )) := wt + yt xt .
As mentioned in the introduction, to provide a full theoretical analysis of this algorithm, we need to
account for two questions:
1. does this algorithm converge? and if so, how quickly?
2. does it converge to an optimum?

4
Theorem 1 below provides a positive answer for the first question. It shows that the number of
updates of our algorithm is only larger by a constant factor (that depends on the initial vectors and
the amount of noise) relatively to the bound for the vanilla perceptron in the noise-less case.
Theorem 1 Suppose that the “Update by Disagreement” algorithm is run on a sequence of random
(1) (2) (i)
N examples from D̃, and with initial vectors w0 , w0 . Denote K = maxi kw0 k. Let T be the
number of updates performed by the “Update by Disagreement” algorithm.
Then, E[T ]  3(1(4 K+1)
2µ)2 kw k where the expectation is w.r.t. the randomness of sampling from D̃.
⇤ 2

Proof It will be more convenient to rewrite the algorithm as follows. We perform N iterations,
(i) (i)
where at iteration t we receive (xt , ỹt ), and update wt+1 = wt + ⌧t ỹt xt , where
(
(1) (2)
1 if sign(hwt , xt i) 6= sign(hwt , xt i)
⌧t =
0 otherwise
Observe that we can write ỹt = ✓t yt , where (xt , yt ) ⇠ D, and ✓t is a random variables with
P[✓t = 1] = 1 µ and P[✓t = 1] = µ. We Palso use the notation vt = yt hw , xt i and ṽt = ✓t vt .

Our goal is to upper bound T̄ := E[T ] = E[ t ⌧t ].


We start with showing that " #
N
X
E ⌧t ṽt (1 2µ)T (1)
t=1
Indeed, since ✓t is independent of ⌧t and vt , we get that:
E[⌧t ṽt ] = E[⌧t ✓t vt ] = E[✓t ] · E[⌧t vt ] = (1 2µ) E[⌧t vt ] (1 2µ) E[⌧t ]
where in the last inequality we used the fact that vt 1 with probability 1 and ⌧t is non-negative.
Summing over t we obtain that Equation 1 holds.
Next, we show that for i 2 {1, 2},
N
X
(i) (i) (2) (1)
kwt k2  kw0 k2 + ⌧t (2kw0 w0 k + 1) (2)
t=1
(1) (2) (1) (2) (1) (2)
Indeed, since the update of wt+1 and wt+1 is identical, we have that kwt+1 wt+1 k = kw0 w0 k
(1) (2)
for every t. Now, whenever ⌧t = 1 we have that either yt hwt 1 , xt i  0 or yt hwt 1 , xt i  0.
(1)
Assume w.l.o.g. that yt hwt 1 , xt i  0. Then,
(1) (1) (1) 2 (1) (1) 2
kwt k2 = kwt 1 + yt xt k2 = kwt 1k + 2yt hwt 1 , xt i + kxt k2  kwt 1k +1
Second,
(2) (2) (2) 2 (2)
kwt k2 = kwt 2
1 + yt xt k = kwt 1 k + 2yt hwt 1 , xt i + kxt k
2

(2) (2) (1)


 kwt 1 k2 + 2yt hwt 1 wt 1 , xt i + kxt k2
(2) (2) (1) (2) (2) (1)
 kwt 1 k2 + 2 kwt 1 wt 1 k + 1 = kwt 1 k2 + 2 kw0 w0 k + 1
(i) (i) 2 (2) (1)
Therefore, the above two equations imply 8i 2 {1, 2}, kwt k2  kwt 1k + 2 kw0 w0 k + 1.
Summing over t we obtain that Equation 2 holds.
Equipped with Equation 1 and Equation 2 we are ready to prove the theorem.
(i) (2) (1)
Denote K = maxi kw0 k and note that kw0 w0 k  2K. We prove the theorem by providing
(i)
upper and lower bounds on E[hwt , w⇤ i]. Combining the update rule with Equation 1 we get:
"N #
(i) (i)
X (i)
⇤ ⇤
E[hwt , w i] = hw0 , w i + E ⌧t ṽt hw0 , w⇤ i + (1 2µ)T̄ K kw⇤ k + (1 2µ)T̄
t=1

To construct an upper bound, first note that Equation 2 implies that


(i) (i) (2) (1)
E[kwt k2 ]  kw0 k2 + (2kw0 w0 k + 1)T̄  K 2 + (4 K + 1) T̄

5
Using the above and Jensen’s inequality, we get that
q q
(i) (i) (i)
E[hwt , w⇤ i]  E[kwt k kw⇤ k]  kw⇤ k E[kwt k2 ]  kw⇤ k K 2 + (4 K + 1)T̄
Comparing the upper and lower bounds, we obtain that
q
K kw⇤ k + (1 2µ)T̄  kw⇤ k K 2 + (4 K + 1)T̄
p p p
Using a + b  a + b, the above implies that
p p
(1 2µ)T̄ kw⇤ k (4 K + 1) T̄ 2 K kw⇤ k  0
p p
Denote ↵ = kw⇤ k (4 K + 1), then the above also implies that (1 2µ)T̄ ↵ T̄ ↵  0.
Denote = ↵/(1 2µ), using standard algebraic manipulations, the above implies that
2 1.5 2
T̄  + + 3 ,
where we used the fact that kw k must be at least 1 for the separability assumption to hold, hence

1. This concludes our proof.

The above theorem tells us that our algorithm converges quickly. We next address the second ques-
tion, regarding the quality of the point to which the algorithm converges. As mentioned in the
(1) (2)
introduction, the convergence must depend on the initial predictors. Indeed, if w0 = w0 , then
(1)
the algorithm will not make any updates. The next question is what happens if we initialize w0
(2)
and w0 at random. The lemma below shows that this does not suffice to ensure convergence to the
optimum, even if the data is linearly separable without noise. The proof for this lemma is given in
Appendix A.

Lemma 1 Fix some 2 (0, 1) and let d be an integer greater than 40 log(1/ ). There exists a
distribution over Rd ⇥ {±1}, which is separable by a weight vector w⇤ for which kw⇤ k2 = d, such
that running the “Update by Disagreement” algorithm, with the perceptron as the underlying update
(1) (2)
rule, and with every coordinate of w0 , w0 initialized according to any symmetric distribution over
R, will yield a solution whose error is at least 1/8, with probability of at least 1 .

Trying to circumvent the lower bound given in the above lemma, one may wonder what would
(1) (2)
happen if we will initialize w0 , w0 differently. Intuitively, maybe noisy labels are not such a big
(1) (2)
problem at the beginning of the learning process. Therefore, we can initialize w0 , w0 by running
the vanilla perceptron for several iterations, and only then switch to our algorithm. Trivially, for
the distribution we constructed in the proof of Lemma 1, this approach will work just because in
(1) (2)
the noise-free setting, both w0 and w0 will converge to vectors that give the same predictions
as w . But, what would happen in the noisy setting, when we flip the label of every example with

probability of µ? The lemma below shows that the error of the resulting solution is likely to be order
of µ3 . Here again, the proof is given in Appendix A.

Lemma 2 Consider a vector w⇤ 2 {±1}d and the distribution D̃ over Rd ⇥ {±1} such that to
sample a pair (x, ỹ) we first choose x uniformly at random from {e1 , . . . , ed }, set y = hw⇤ , ei i, and
(1) (2)
set ỹ = y with probability 1 µ and ỹ = y with probability µ. Let w0 , w0 be the result of
running the vanilla perceptron algorithm on random examples from D̃ for any number of iterations.
Suppose that we run the “Update by Disagreement” algorithm for an additional arbitrary number
of iterations. Then, the error of the solution is likely to be ⌦(µ3 ).

To summarize, we see that without making additional assumptions on the data distribution, it is
impossible to prove convergence of our algorithm to a good solution. In the next section we show
that for natural data distributions, our algorithm converges to a very good solution.

5 EXPERIMENTS
We now demonstrate the merit of our suggested meta-algorithm using empirical evaluation. Our
main experiment is using our algorithm with deep networks in a real-world scenario of noisy labels.

6
In particular, we use a hypothesis class of deep networks and a Stochastic Gradient Descent with mo-
mentum as the basis update rule. The task is classifying face images according to gender. As training
data, we use the Labeled Faces in the Wild (LFW) dataset for which we had a labeling of the name
of the face, but we did not have gender labeling. To construct gender labels, we used an external
service that provides gender labels based on names. This process resulted in noisy labels. We show
that our method leads to state-of-the-art results on this task, compared to competing noise robustness
methods. We also performed controlled experiments to demonstrate our algorithm’s performance on
linear classification with varying levels of noise. These results are detailed in Appendix B.

5.1 Deep Learning

We have applied our algorithm with a Stochastic Gradient Descent (SGD) with momentum as the
base update rule on the task of labeling images of faces according to gender. The images were taken
from the Labeled Faces in the Wild (LFW) benchmark [18]. This benchmark consists of 13,233
images of 5,749 different people collected from the web, labeled with the name of the person in the
picture. Since the gender of each subject is not provided, we follow the method of [25] and use a
service that determines a person’s gender by their name (if it is recognized), along with a confidence
level. This method gives rise to “natural” noisy labels due to “unisex” names, and therefore allows
us to experiment with a real-world setup of dataset with noisy labels.

Name Kim Morgan Joan Leslie


Confidence 88% 64% 82% 88%

Correct

Mislabeled
Figure 1: Images from the dataset tagged as female

We have constructed train and test sets as follows. We first took all the individuals on which the
gender service gave 100% confidence. We divided this set at random into three subsets of equal
size, denoted N1 , N2 , N3 . We denote by N4 the individuals on which the confidence level is in
[90%, 100%), and by N5 the individuals on which the confidence level is in [0%, 90%). Needless to
say that all the sets N1 , . . . , N5 have zero intersection with each other.
We repeated each experiment three times, where in every time we used a different Ni as the test set,
for i 2 {1, 2, 3}. Suppose N1 is the test set, then for the training set we used two configurations:

1. A dataset consisting of all the images that belong to names in N2 , N3 , N4 , N5 , where un-
recognized names were labeled as male (since the majority of subjects in LFW are males).
2. A dataset consisting of all the images that belong to names in N2 , N3 , N4 .

We use a network architecture suggested by [24], using an available tensorflow implementation1 .


It should be noted that we did not change any parameters of the network architecture or the opti-
mization process, and use the default parameters in the implementation. Since the amount of male
and female subjects in the dataset is not balanced, we use an objective of maximizing the balanced
accuracy [9] - the average accuracy obtained on either class.
Training is done for 30,000 iterations on 128 examples mini-batch. In order to make the networks
disagreement meaningful, we initialize the two networks by training both of them normally (up-
dating on all the examples) until iteration #5000, where we switch to training with the “Update by
Disagreement” rule. Due to the fact that we are not updating on all examples, we decrease the weight
of batches with less than 10% of the original examples in the original batch to stabilize gradients. 2 .
1
https://github.com/dpressel/rude-carnie.
2
Code is available online on https://github.com/emalach/UpdateByDisagreement.

7
We inspect the balanced accuracy on our test data during the training process, comparing our method
to a vanilla neural network training, as well as to soft and hard bootstrapping described in [33] and
to the s-model described in [15], all of which are using the same network architecture. We use the
initialization parameters for [33, 15] that were suggested in the original papers. We show that while
in other methods, the accuracy effectively decreases during the training process due to overfitting the
noisy labels, in our method this effect is less substantial, allowing the network to keep improving.
We study two different scenarios, one in which a small clean test data is available for model selection,
and therefore we can choose the iteration with best test accuracy, and a more realistic scenario where
there is no clean test data at hand. For the first scenario, we observe the balanced accuracy of the best
available iteration. For the second scenario, we observe the balanced accuracy of the last iteration.
As can be seen in Figure 2 and the supplementary results listed in Table 1 in Appendix B, our
method outperforms the other methods in both situations. This is true for both datasets, although, as
expected, the improvement in performance is less substantial on the cleaner dataset.
The second best algorithm is the s-model described in [15]. Since our method can be applied to
any base algorithm, we also applied our method on top of the s-model. This yields even better
performance, especially when the data is less noisy, where we obtain a significant improvement.

Dataset #1 - more noise Dataset #2 - less noise

Figure 2: Balanced accuracy of all methods on clean test data, trained on the two different datasets.

6 Discussion
We have described an extremely simple approach for supervised learning in the presence of noisy
labels. The basic idea is to decouple the “when to update” rule from the “how to update” rule. We
achieve this by maintaining two predictors, and update based on their disagreement. We have shown
that this simple approach leads to state-of-the-art results.
Our theoretical analysis shows that the approach leads to fast convergence rate when the underlying
update rule is the perceptron. We have also shown that proving that the method converges to an op-
timal solution must rely on distributional assumptions. There are several immediate open questions
that we leave to future work. First, suggesting distributional assumptions that are likely to hold in
practice and proving that the algorithm converges to an optimal solution under these assumptions.
Second, extending the convergence proof beyond linear predictors. While obtaining absolute con-
vergence guarantees seems beyond reach at the moment, coming up with oracle based convergence
guarantees may be feasible.

Acknowledgements: This research is supported by the European Research Council (TheoryDL


project).

8
References
[1] Rie Kubota Ando and Tong Zhang. Two-view feature generation model for semi-supervised
learning. In Proceedings of the 24th international conference on Machine learning, pages
25–32. ACM, 2007.

[2] Les E Atlas, David A Cohn, Richard E Ladner, Mohamed A El-Sharkawi, Robert J Marks,
ME Aggoune, and DC Park. Training connectionist networks with queries and selective sam-
pling. In NIPS, pages 566–573, 1989.

[3] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for
efficiently learning linear separators with noise. In Proceedings of the 46th Annual ACM Sym-
posium on Theory of Computing, pages 449–458. ACM, 2014.

[4] Ricardo Barandela and Eduardo Gasca. Decontamination of training samples for supervised
pattern recognition methods. In Joint IAPR International Workshops on Statistical Techniques
in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages
621–630. Springer, 2000.

[5] Alan Joseph Bekker and Jacob Goldberger. Training deep neural-networks based on unreli-
able labels. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International
Conference on, pages 2682–2686. IEEE, 2016.

[6] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In
Proceedings of the eleventh annual conference on Computational learning theory, pages 92–
100. ACM, 1998.

[7] Jakramate Bootkrajang and Ata Kabán. Label-noise robust logistic regression and its appli-
cations. In Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, pages 143–158. Springer, 2012.

[8] Jakramate Bootkrajang and Ata Kabán. Boosting in the presence of label noise. arXiv preprint
arXiv:1309.6818, 2013.

[9] Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M Buhmann.
The balanced accuracy and its posterior distribution. In Pattern recognition (ICPR), 2010 20th
international conference on, pages 3121–3124. IEEE, 2010.

[10] Carla E. Brodley and Mark A. Friedl. Identifying mislabeled training data. Journal of Artificial
Intelligence Research, 11:131–167, 1999.

[11] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning.
Machine learning, 15(2):201–221, 1994.

[12] David Flatow and Daniel Penner. On the robustness of convnets to training on noisy
labels. http://cs231n.stanford.edu/reports/flatow_penner_report.
pdf, 2017.

[13] Benoı̂t Frénay and Michel Verleysen. Classification in the presence of label noise: a survey.
IEEE transactions on neural networks and learning systems, 25(5):845–869, 2014.

[14] Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the
query by committee algorithm. Machine learning, 28(2-3):133–168, 1997.

[15] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural networks using a noise adap-
tation layer. Under review for ICLR, 2017.

[16] Yves Grandvalet and Yoshua Bengio. Entropy regularization. Semi-supervised learning, pages
151–168, 2006.

[17] Yves Grandvalet, Yoshua Bengio, et al. Semi-supervised learning by entropy minimization. In
NIPS, volume 17, pages 529–536, 2004.

9
[18] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the
wild: A database for studying face recognition in unconstrained environments. Technical
report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.

[19] Panagiotis G Ipeirotis, Foster Provost, and Jing Wang. Quality management on amazon me-
chanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages
64–67. ACM, 2010.

[20] Pravin Kakar and Alex Yong-Sang Chia. Probabilistic learning from mislabelled data for mul-
timedia content recognition. In Multimedia and Expo (ICME), 2015 IEEE International Con-
ference on, pages 1–6. IEEE, 2015.

[21] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM
(JACM), 45(6):983–1006, 1998.

[22] Anders Krogh, Jesper Vedelsby, et al. Neural network ensembles, cross validation, and active
learning. Advances in neural information processing systems, 7:231–238, 1995.

[23] Jan Larsen, L Nonboe, Mads Hintz-Madsen, and Lars Kai Hansen. Design of robust neural
network classifiers. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the
1998 IEEE International Conference on, volume 2, pages 1205–1208. IEEE, 1998.

[24] Gil Levi and Tal Hassner. Age and gender classification using convolutional neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work-
shops, pages 34–42, 2015.

[25] Philip Masek and Magnus Thulin. Evaluation of face recognition apis and libraries. Master’s
thesis, University of Gothenburg, 2015.

[26] Ross A McDonald, David J Hand, and Idris A Eckley. An empirical comparison of three
boosting algorithms on real data sets with artificial class noise. In International Workshop on
Multiple Classifier Systems, pages 35–44. Springer, 2003.

[27] Aditya Krishna Menon, Brendan van Rooyen, and Nagarajan Natarajan. Learning from binary
labels with instance-dependent corruption. arXiv preprint arXiv:1605.00751, 2016.

[28] Volodymyr Mnih and Geoffrey E Hinton. Learning to label aerial images from noisy data.
In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages
567–574, 2012.

[29] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning
with noisy labels. In Advances in neural information processing systems, pages 1196–1204,
2013.

[30] Kamal Nigam and Rayid Ghani. Analyzing the effectiveness and applicability of co-training. In
Proceedings of the ninth international conference on Information and knowledge management,
pages 86–93. ACM, 2000.

[31] Giorgio Patrini, Frank Nielsen, Richard Nock, and Marcello Carioni. Loss factorization,
weakly supervised learning and label noise robustness. arXiv preprint arXiv:1602.02450,
2016.

[32] Giorgio Patrini, Alessandro Rozza, Aditya Menon, Richard Nock, and Lizhen Qu. Mak-
ing neural networks robust to label noise: a loss correction approach. arXiv preprint
arXiv:1609.03683, 2016.

[33] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and An-
drew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv
preprint arXiv:1412.6596, 2014.

[34] Burr Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-
66):11, 2010.

10
[35] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Proceed-
ings of the fifth annual workshop on Computational learning theory, pages 287–294. ACM,
1992.
[36] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Train-
ing convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
[37] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive
noisy labeled data for image classification. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 2691–2699, 2015.
[38] Xiaojin Zhu. Semi-supervised learning literature survey. Computer Sciences TR 1530, 2005.
[39] Bohan Zhuang, Lingqiao Liu, Yao Li, Chunhua Shen, and Ian Reid. Attend in groups:
a weakly-supervised deep learning framework for learning from web data. arXiv preprint
arXiv:1611.09960, 2016.

11

You might also like