Learning From Noisy Labels With Deep Neural Networks Survey
Learning From Noisy Labels With Deep Neural Networks Survey
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8136 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023
Fig. 2. Categorization of recent deep learning methods for overcoming noisy labels.
A. Related Surveys
Frénay and Verleysen [13] discussed the potential negative
consequence of learning from noisy labels and provided a
comprehensive survey on noise-robust classification meth-
ods, focusing on conventional supervised approaches, such
as naïve Bayes and support vector machines. Furthermore,
their survey included the definitions and sources of label
noise and the taxonomy of label noise. Zhang et al. [28]
discussed another aspect of label noise in crowdsourced data
annotated by nonexperts and provided a thorough review
of expectation–maximization (EM) algorithms that were pro-
posed to improve the quality of crowdsourced labels. Mean-
while, Nigam et al. [29] provided a brief introduction to deep
learning algorithms that were proposed to manage noisy labels; II. P RELIMINARIES
however, the scope of these algorithms was limited to only In this section, the problem statement for supervised learn-
two categories, i.e., the loss function and sample selection ing with noisy labels is provided along with the taxonomy
in Fig. 2. Recently, Han et al. [30] summarized the essential of label noise. Managing noisy labels is a long-standing
components of robust learning with noisy labels, but their issue; therefore, we review the basic conventional approaches
categorization is totally different from ours in philosophy; and theoretical foundations underlying robust deep learning.
we mainly focus on systematic methodological difference, Table I summarizes the notation frequently used in this study.
whereas they rather focused on more general views, such
as input data, objective functions, and optimization policies.
A. Supervised Learning With Noisy Labels
Furthermore, this survey is the first to present a comprehen-
sive methodological comparison of existing robust training Classification is a representative supervised learning task for
approaches. learning a function that maps an input feature to a label [38].
In this article, we consider a c-class classification problem
B. Survey Scope using a DNN with a softmax output layer. Let X ⊂ Rd be the
Robust training with DNNs becomes critical to guarantee feature space and Y = {0, 1}c be the ground-truth label space
the reliability of machine learning algorithms. In addition in a one-hot manner. In a typical classification problem, we are
to label noise, two types of flawed training data have been provided with a training dataset D = {(x i , yi )}i=1 N
obtained
actively studied by different communities [31], [32]. Adver- from an unknown joint distribution PD over X ×Y, where each
sarial learning is designed for small, worst case perturba- (x i , yi ) is independent and identically distributed. The goal of
tions of the inputs, so-called adversarial examples, which are the task is to learn the mapping function f (·; ) : X → [0, 1]c
maliciously constructed to deceive an already trained model of the DNN parameterized by such that the parameter
into making errors [33]–[36]. Meanwhile, data imputation minimizes the empirical risk RD ( f )
primarily deals with missing inputs in training data, where 1
RD ( f ) = ED f (x; ), y = f (x; ), y
missing values are estimated from the observed ones [32], [37]. |D|
(x,y)∈D
Adversarial learning and data imputation are closely related to
(1)
robust learning, but handling feature noise is beyond the scope
of this survey—i.e., learning from noisy labels. where is a certain loss function.
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8137
As data labels are corrupted in various real-world scenarios, C. Nondeep Learning Approaches
we aim to train the DNN from noisy labels. Specifically,
we are provided with a noisy training dataset D̃ = {(x i , ỹi )}i=1
N For decades, numerous methods have been proposed to
obtained from a noisy joint distribution PD̃ over X × Ỹ, manage noisy labels using conventional machine learning
where ỹ is a noisy label which may not be true. Hence, techniques. These methods can be categorized into four
following the standard training procedure, a mini-batch Bt = groups [13], [29], [41], as follows.
{(x i , ỹi )}bi=1 comprising b examples is obtained randomly from 1) Data Cleaning: Training data are cleaned by exclud-
the noisy training dataset D̃ at time t. Subsequently, the DNN ing examples whose labels are likely to be corrupted.
parameter t at time t is updated along the descent direction Bagging and boosting are used to filter out false-labeled
of the empirical risk on mini-batch Bt examples to remove examples with higher weights
⎛ ⎞ because false-labeled examples tend to exhibit much
1 higher weights than true-labeled examples [42], [43].
t+1 = t − η∇ ⎝ f (x; t ), ỹ ⎠ (2) In addition, various methods, such as k-nearest neighbor,
|Bt | (x, ỹ)∈B
t outlier detection, and anomaly detection, have been
where η is a learning rate specified. widely exploited to exclude false-labeled examples from
Here, the risk minimization process is no longer noise- noisy training data [44]–[46]. Nevertheless, this family
tolerant because of the loss computed by the noisy labels. of methods suffers from an over-cleaning issue that
DNNs can easily memorize corrupted labels and correspond- overly removes even the true-labeled examples.
ingly degenerate their generalizations on unseen data [13], 2) Surrogate Loss: Motivated by the noise-tolerance of the
[28], [29]. Hence, mitigating the adverse effects of noisy labels 0–1 loss function [39], many researchers have attempted
is essential to enable noise-tolerant training for deep learning. to resolve its inherent limitations, such as computational
hardness and nonconvexity that render gradient methods
unusable. Hence, several convex surrogate loss func-
B. Taxonomy of Label Noise tions, which approximate the 0–1 loss function, have
This section presents the types of label noise that have been been proposed to train a specified classifier under the
adopted to design robust training algorithms. Even if data binary classification setting [47]–[51]. However, these
labels are corrupted from ground-truth labels without any prior loss functions cannot support the multiclass classifica-
assumption, in essence, the corruption probability is affected tion task.
by the dependency between data features or class labels. 3) Probabilistic Method: Under the assumption that the
A detailed analysis of the taxonomy of label noise was pro- distribution of features is helpful in solving the problem
vided by Frénay and Verleysen [13]. Most existing algorithms of learning from noisy labels [52], the confidence of
dealt with instance-independent noise, but instance-dependent each label is estimated by clustering and then used
noise has not yet been extensively investigated owing to its for a weighted training scheme [53]. This confidence
complex modeling. is also used to convert hard labels into soft labels to
1) Instance-Independent Label Noise: A typical approach reflect the uncertainty of labels [54]. In addition to
for modeling label noise assumes that the corruption process is these clustering approaches, several Bayesian methods
conditionally independent of data features when the true label have been proposed for graphical models such that they
is given [22], [39]. That is, the true label is corrupted by a noise can benefit from using any type of prior information
transition matrix T ∈ [0, 1]c×c , where Ti j := p( ỹ = j |y = i ) in the learning process [55]. However, this family of
is the probability of the true label i being flipped into a methods may exacerbate the overfitting issue owing to
corrupted label j . In this approach, the noise is called a the increased number of model parameters.
symmetric (or uniform) noise with a noise rate τ ∈ [0, 1] if 4) Model-Based Method: As conventional models, such as
∀i= j Ti j = 1 − τ ∧ ∀i= j Ti j = (τ/(c − 1)), where a true label is the SVM and decision tree, are not robust to noisy
flipped into other labels with equal probability. In contrast to labels, significant effort has been expended to improve
symmetric noise, the noise is called an asymmetric (or label- their robustness. To develop a robust SVM model, mis-
dependent) noise if ∀i= j Ti j = 1 − τ ∧ ∃i= j,i=k, j =k Ti j > Tik , classified examples during learning are penalized in the
where a true label is more likely to be mislabeled into a objective [56], [57]. In addition, several decision tree
particular label. For example, a “dog” is more likely to be models are extended using new split criteria to solve
confused with a “cat” than with a “fish.” In a stricter case the overfitting issue when the training data are not fully
when ∀i= j Ti j = 1 − τ ∧ ∃i= j Ti j = τ , the noise is called a pair reliable [58], [59]. However, it is infeasible to apply their
noise, where a true label is flipped into only a certain label. design principles to deep learning.
2) Instance-Dependent Label Noise: For more realistic Meanwhile, deep learning is more susceptible to label noises
noise modeling, the corruption probability is assumed to be than traditional machine learning owing to its high expressive
dependent on both the data features and class labels [16], [40]. power, as proven by many researchers [21], [60], [61]. There
Accordingly, the corruption probability is defined as ρi j (x) = has been significant effort to understand why noisy labels
p( ỹ = j |y = i, x). Unlike the aforementioned noises, the data negatively affect the performance of DNNs [22], [61]–[63].
feature of an example x also affects the chance of x being This theoretical understanding has led to the algorith-
mislabeled. mic design which achieves higher robustness than nondeep
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8138 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023
Fig. 3. High-level research overview of robust deep learning for noisy labels. The research directions that are actively contributed by the machine learning
community are categorized into five groups in blue italic.
learning methods. A detailed analysis of theoretical under- 3) Robust Loss Function (Section III-C): Improving the
standing for robust deep learning was provided by loss function.
Han et al. [30]. 4) Loss Adjustment (Section III-D): Adjusting the loss
value according to the confidence of a given loss
D. Regression With Noisy Labels (or label) by loss correction, loss reweighting, or label
In addition to classification, regression is another main topic refurbishment.
of supervised machine learning, which aims to model the rela- 5) Sample Selection (Section III-E): Identifying true-
tionship between a number of features and a continuous target labeled examples from noisy training data via multinet-
variable. Unlike the classification task with a discrete label work or multiround learning.
space, the regression task considers the continuous variable as Overall, we categorize all recent deep learning methods
its target label [64], and thus, it learns the mapping function into five groups corresponding to popular research directions,
f ( · ; ) : X → Y, where Y ∈ R is a continuous label as shown in Fig. 3. In Section III-D, meta-learning is also
space. Given the input feature x and its ground-truth label y, discussed because it finds the optimal hyperparameters for
two types of label noise are considered in the regression task. loss reweighting. In Section III-E, we discuss the recent
An additive noise [65] is formulated by ỹ := y + , where efforts for combining sample selection with other orthogonal
is drawn from a random distribution independent of the directions or semisupervised learning toward the state-of-the-
input feature; an instance-dependent noise [66] is formulated art performance.
by ỹ := ρ(x) where ρ : X → Y is a noise function dependent Fig. 2 shows the categorization of robust training methods
on the input feature. using these five groups.
Although regression predicts continuous values, regression
and classification share the same concept of learning the A. Robust Architecture
mapping function from the input feature x to the output In numerous studies, architectural changes have been made
label y. Thus, many robust approaches for classification are to model the noise transition matrix of a noisy dataset [16],
easily extended to the regression problem with simple modifi- [75]–[82]. These changes include adding a noise adaptation
cation [67]. Thus, in this survey, we focus on the classification layer at the top of the softmax layer and designing a new ded-
setting for which the most robust methods are defined. icated architecture. The resulting architectures yield improved
generalization through the modification of the DNN output
III. D EEP L EARNING A PPROACHES
based on the estimated label transition probability.
According to our comprehensive survey, the robustness of 1) Noise Adaptation Layer: From the view of training data,
deep learning can be enhanced in numerous approaches [16], the noise process is modeled by discovering the underlying
[25], [68]–[74]. Fig. 3 shows an overview of recent research label transition pattern (i.e., the noise transition matrix T).
directions conducted by the machine learning community. Given an example x, the noisy class posterior probability for
All of them (i.e., Sections III-A–III-E) focused on making an example x is expressed by
a supervised learning process more robust to label noise.
1) Robust Architecture (Section III-A): Adding a noise
c
c
p( ỹ = j |x) = p( ỹ = j, y = i |x) = Ti j p(y = i |x)
adaptation layer at the top of an underlying DNN to
i=1 i=1
learn label transition process or developing a dedicated where Ti j = p( ỹ = j |y = i, x). (3)
architecture to reliably support more diverse types of
label noise. In light of this, the noise adaptation layer is intended
2) Robust Regularization (Section III-B): Enforcing a DNN to mimic the label transition behavior in learning a DNN.
to overfitting less to false-labeled examples explicitly or Let p(y|x; ) be the output of the base DNN with a soft-
implicitly. max output layer. Then, following (3), the probability of an
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8139
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8140 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023
In contrast, pretraining [88] empirically proves that fine-tuning be the learned classifier with the modified loss for the
on a pretrained model provides a significant improvement noisy data, where R̂ ,D̃ ( f ) = ED̃ [( f (x; ), ỹ)]. If is
in robustness compared with models trained from scratch; L-Lipschitz and classification calibrated [50], with probability
the universal representations of pretraining prevent the model at least 1 − δ, there exists a nondecreasing function ζ with
parameters from being updated in the wrong direction by noisy ζ (0) = 0 [39] such that
labels. PHuber [89] proposes a composite loss-based gradient Approximation and Estimation Errors
clipping, which is a variation of standard gradient clipping for
∗
label noise robustness. Robust early learning [90] classifies RD ( f )−R ≤ ζ min f ∈F R,D ( f ) − min f R,D ( f )
ˆ
critical parameters and noncritical parameters for fitting clean
and noise labels, respectively. Then, it penalizes only the + 4L p RC(F ) + 2 log(1/δ) 2|D| . (7)
noncritical ones with a different update rule. ODLN [91]
leverages open-set auxiliary data and prevents the overfitting L p is the Lipschitz constant of and RC is the Rademacher
of noisy labels by assigning random labels to the open-set complexity of the hypothesis class F . Then, by the universal
examples, which are uniformly sampled from the label set. approximation theorem [96], the Bayes optimal classifier f ∗
Remark: The explicit regularization often introduces sen- is guaranteed to be in the hypothesis class F with DNNs.
sitive model-dependent hyperparameters or requires deeper Based on this theoretical foundation, researchers have
architectures to compensate for the reduced capacity, yet it attempted to design robust loss functions such that they
can lead to significant performance gain if they are optimally achieve a small risk for unseen clean data even when noisy
tuned. labels exist in the training data [68], [97]–[101].
2) Implicit Regularization: The regularization can also be Technical Detail: Initially, Manwani and Sastry [48] the-
an implicit form that gives the effect of stochasticity, e.g., data oretically proved a sufficient condition for the loss func-
augmentation and mini-batch stochastic gradient descent. tion such that risk minimization with that function becomes
Technical Detail: Adversarial training [92] enhances the noise-tolerant for binary classification. Subsequently, the suffi-
noise tolerance by encouraging the DNN to correctly clas- cient condition was extended for multiclass classification using
sify both original inputs and hostilely perturbed ones. Label deep learning [68]. Specifically, a loss function is defined to
smoothing [93], [94] estimates the marginalized effect of be noise-tolerant for a c-class classification under symmetric
label noise during training, thereby reducing overfitting by noise if the function satisfies the noise rate τ < ((c − 1)/c)
preventing the DNN from assigning a full probability to noisy and
training examples. Instead of the one-hot label, the noisy label c
is mixed with a uniform mixture over all possible labels f (x; ), y = j = C ∀x ∈ X ∀ f (8)
j =1
ȳ = ȳ(1), ȳ(2), . . . , ȳ(c)
where C is a constant. This condition guarantees that the
where ȳ(i ) = (1 − α) · [ ỹ = i ] + α/c and α ∈ [0, 1]. (5) classifier trained on noisy data has the same misclassification
Here, [·] is the Iverson bracket and α is the smoothing probability as that trained on noise-free data under the speci-
degree. In contrast, mixup [95] regularizes the DNN to favor fied assumption. An extension for multilabel classification was
simple linear behaviors in between training examples. First, the provided by Kumar et al. [102]. Moreover, if RD ( f ∗ ) = 0,
mini-batch is constructed using virtual training examples, each then the function is also noise-tolerant under an asymmetric
of which is formed by the linear interpolation of two noisy noise, where f ∗ is a global risk minimizer of RD .
training examples (x i , ỹi ) and (x j , ỹ j ) obtained at random For the classification task, the categorical cross entropy
from noisy training data D̃ (CCE) loss is the most widely used loss function owing
to its fast convergence and high generalization capability.
x mix = λx i + (1 − λ)x j and ymix = λ ỹi + (1 − λ) ỹ j (6) However, in the presence of noisy labels, the robust MAE [68]
where λ ∈ [0, 1] is the balance parameter between two showed that the mean absolute error (MAE) loss achieves
examples. Thus, mixup extends the training distribution by better generalization than the CCE loss because only the
updating the DNN for the constructed mini-batch. MAE loss satisfies the aforementioned condition. A limita-
Remark: The implicit regularization improves the gener- tion of the MAE loss is that its generalization performance
alization capability of the DNN without reducing the rep- degrades significantly when complicated data are involved.
resentational capacity. It also does not introduce sensitive Hence, the generalized cross entropy (GCE) [97] was proposed
model-dependent hyperparameters because it is applied to the to achieve the advantages of both MAE and CCE losses;
training data. However, the extended feature or label space the GCE loss is a more general class of noise-robust loss
slows down the convergence of training. that encompasses both of them. Amid et al. [103] extended
the GCE loss by introducing two temperatures based on the
Tsallis divergence. Bitempered loss [104] introduces a proper
C. Robust Loss Function unbiased generalization of the CE loss based on the Bregman
It was proven that a learned DNN with a suitably modified divergence. In addition, inspired by the symmetricity of the
loss function for noisy data D̃ can approach the Bayes Kullback–Leibler divergence, the symmetric cross entropy
optimal classifier f ∗ , which achieves the optimal Bayes risk (SCE) [98] was proposed by combining a noise tolerance term,
R∗ = RD ( f ∗ ) for clean data D. Let fˆ = argmin f ∈F R̂ ,D̃ ( f ) namely reverse cross entropy loss, with the standard CCE loss.
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8141
Meanwhile, the curriculum loss (CL) [99] is a surrogate Conversely, forward correction [62] uses a linear com-
loss of the 0–1 loss function; it provides a tight upper bound bination of a DNN’s softmax outputs before applying the
and can easily be extended to multiclass classification. The loss function. Hence, the forward correction is performed
active passive loss (APL) [105] is a combination of two types by multiplying the estimated transition probability with the
of robust loss functions, an active loss that maximizes the softmax outputs during the forward propagation step
probability of belonging to the given class and a passive loss
that minimizes the probability of belonging to other classes. f (x; ), ỹ = p̂( ỹ|1), . . . , p̂( ỹ|c) f (x; ) , ỹ
Remark: The robustness of these methods is theoretically
= T̂ f (x; ) , ỹ . (10)
supported well. However, they perform well only in simple
cases, when learning is easy or the number of classes is
small [106]. Moreover, the modification of the loss function Furthermore, gold loss correction [107] assumes the avail-
increases the training time for convergence [97]. ability of clean validation data or anchor points for loss
correction. Thus, a more accurate transition matrix is obtained
D. Loss Adjustment by using them as additional information, which further
Loss adjustment is effective for reducing the negative impact improves the robustness of the loss correction. Recently,
of noisy labels by adjusting the loss of all training examples T-revision [113] provides a solution that can infer the tran-
before updating the DNN [19], [62], [69], [107]–[111]. The sition matrix without anchor points, and dual T [114] fac-
methods associated with it can be categorized into three torizes the matrix into the product of two easy-to-estimate
groups depending on their adjustment philosophy: 1) loss matrices to avoid directly estimating the noisy class pos-
correction that estimates the noise transition matrix to correct terior. Beyond the instance-independent noise assumption,
the forward or backward loss; 2) loss reweighting that imposes Zhang and Sugiyama [115] introduced the instance-confidence
different importance to each example for a weighted training embedding to model instance-dependent noise in estimating
scheme; 3) label refurbishment that adjusts the loss using the the transition matrix. On the other hand, Yang et al. [116]
refurbished label obtained from a convex combination of noisy proposed to use the Bayes optimal transition matrix estimated
and predicted labels; and 4) meta learning that automatically from the distilled examples for the instance-dependent noise
infers the optimal rule for loss adjustment. Unlike the robust transition matrix.
loss function newly designed for robustness, this family of Remark: The robustness of these approaches is highly
methods aims to make the traditional optimization process dependent on how precisely the transition matrix is estimated.
robust to label noise. Hence, in the middle of training, the To acquire such a transition matrix, they require prior knowl-
update rule is adjusted such that the negative impact of label edge in general, such as anchor points or clean validation data.
noise is minimized. 2) Loss Reweighting: Inspired by the concept of importance
In general, loss adjustment allows for a full exploration reweighting [117], loss reweighting aims to assign smaller
of the training data by adjusting the loss of every example. weights to the examples with false labels and greater weights
However, the error incurred by false correction is accumulated, to those with true labels. Accordingly, the reweighted loss on
especially when the number of classes or the number of the mini-batch Bt is used to update the DNN
mislabeled examples is large [112]. ⎛ Reweighted Loss
⎞
1) Loss Correction: Similar to the noise adaptation layer ⎟
⎜ 1
presented in Section III-A, this approach modifies the loss t+1 = t − η∇ ⎝ w(x, ỹ) f (x; t ), ỹ ⎠
|Bt | (x, ỹ)∈B
of each example by multiplying the estimated label transition t
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8142 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023
variation of appropriate weighting schemes that rely on the and noise type-agnostic rules for better practical use. It is
noise type and training data. similar to loss reweighting and label refurbishment, but the
3) Label Refurbishment: Refurbishing a noisy label ỹ effec- adjustment is automated in a meta-learning manner.
tively prevents overfitting to false labels. Let ŷ be the current Technical Detail: For the loss reweighting in (11), the
prediction of the DNN f (x; ). Therefore, the refurbished goal is to learn the weight function w(x, ỹ). Specifically,
label y refurb can be obtained by a convex combination of the L2LWS [126] and CWS [127] are unified neural architectures
noisy label ỹ and the DNN prediction ŷ composed of a target DNN and a meta-DNN. The meta-DNN
is trained on a small clean validation dataset; it then provides
y refurb = α ỹ + (1 − α) ŷ (12)
guidance to evaluate the weight score for the target DNN.
where α ∈ [0, 1] is the label confidence of ỹ. To mitigate the Here, part of the two DNNs is shared and jointly trained
damage of incorrect labeling, this approach backpropagates the to benefit from each other. Automatic reweighting [106] is
loss for the refurbished label instead of the noisy one, thereby a meta-learning algorithm that learns the weights of training
yielding substantial robustness to noisy labels. examples based on their gradient directions. It includes a small
Technical Detail: Bootstrapping [69] is the first method clean validation dataset into the training dataset and reweights
that proposes the concept of label refurbishment to update the the backward loss of the mini-batch examples such that the
target label of training examples. It develops a more coherent updated gradient minimizes the loss of this validation dataset.
network that improves its ability to evaluate the consistency Meta-weight-net [124] parameterizes the weighting function as
of noisy labels, with the label confidence α obtained via a multilayer perceptron network with only one hidden layer.
cross validation. Dynamic bootstrapping [110] dynamically A meta-objective is defined to update its parameters such that
adjusts the confidence α of individual training examples. The they minimize the empirical risk of a small clean dataset.
confidence α is obtained by fitting a two-component and At each iteration, the parameter of the target network is
1-D beta mixture model to the loss distribution of all training guided by the weight function updated via the meta-objective.
examples. Self-adaptive training [119] applies the exponential Likewise, data coefficients (i.e., exemplar weights and true
moving average to alleviate the instability issue of using labels) [128] are estimated by meta-optimization with a small
instantaneous prediction of the current DNN clean set, which is only 0.2% of the entire training set, while
refurbishing the examples probably mislabeled.
refurb
yt+1 = αytrefurb + (1 − α) ŷ, where y0refurb = ỹ. (13)
For the label refurbishment in (12), knowledge distilla-
D2L [111] trains a DNN using a dimensionality-driven learn- tion [129] adopts the technique of transferring knowledge from
ing strategy to avoid overfitting to false labels. A simple one expert model to a target model. The prediction from the
measure called local intrinsic dimensionality [120] is adopted expert DNN trained on small clean validation data is used
to evaluate the confidence α in considering that the overfitting instead of the prediction ŷ from the target DNN. MLC [130]
is exacerbated by dimensional expansion. Hence, refurbished updates the target model with corrected labels provided by a
labels are generated to prevent the dimensionality of the meta-model trained on clean validation data. The two models
representation subspace from expanding at a later stage of are trained concurrently via a bilevel optimization.
training. Recently, SELFIE [19] introduces a novel concept Remark: By learning the update rule via meta-learning, the
of refurbishable examples that can be corrected with high trained network easily adapts to various types of data and label
precision. The key idea is to consider the example with noise. Nevertheless, unbiased clean validation data is essential
consistent label predictions as refurbishable because such to minimize the auxiliary objective, although it may not be
consistent predictions correspond to its true label with a available in real-world data.
high probability owing to the learner’s perceptual consistency.
Accordingly, the labels of only refurbishable examples are
E. Sample Selection
corrected to minimize the number of falsely corrected cases.
Similarly, AdaCorr [121] selectively refurbishes the label of To avoid any false corrections, many recent studies [19],
noisy examples, but a theoretical error bound is provided. [70], [99], [112], [131]–[137] have adopted sample selection
Alternatively, SEAL [122] averages the softmax output of a that involves selecting true-labeled examples from a noisy
DNN on each example over the whole training process, then training dataset. In this case, the update equation in (2) is
retrains the DNN using the averaged soft labels. modified to render a DNN more robust for noisy labels. Let
Remark: Differently from loss correction and reweighting, Ct ⊆ Bt be the identified clean examples at time t. Then, the
all the noisy labels are explicitly replaced with other expected DNN is updated only for the selected clean examples Ct
clean labels (or their combination). If there are not many con- ⎛ ⎞
fusing classes in data, these methods work well by refurbishing 1
t+1 = t − η∇ ⎝ f (x; t ), ỹ ⎠ (14)
the noisy labels with high precision. In the opposite case, the |Ct |
(x, ỹ)∈Ct
DNN could overfit wrongly refurbished labels.
4) Meta Learning: In recent years, meta-learning becomes where the rest mini-batch examples, which are likely to be
an important topic in the machine learning community and false-labeled, are excluded to pursue robust learning.
is applied to improve noise robustness [123]–[125]. The key The memorization nature of DNNs has been explored theo-
concept is learning to learn that performs learning at a level retically and empirically to identify clean examples from noisy
higher than conventional learning, thus achieving data-agnostic training data [138]–[140]. Specifically, assuming clusterable
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8143
data where the clusters are located on the unit Euclidean ball,
Li et al. [61] proved the distance from the initial weight W0
to the weight Wt after t iterations
√
Wt − W0 F K + (K 2 0 /C2 )t (15)
where · F is the Frobenius norm, K is the number of clusters,
and C is the set of cluster centers reaching all input examples
within their 0 neighborhood. Equation (15) demonstrates that Fig. 5. Loss distribution of training examples at the training accuracy of
the weights of DNNs start to stray far from the initial weights 50% on noisy CIFAR-100. (This figure is adapted from Song et al. [141].)
(a) Symmetric noise 40%. (b) Asymmetric noise 40%.
when overfitting to corrupted labels, while they are still in the
vicinity of the initial weights at an early stage of training [30],
[61]. In the empirical studies [21], [141], the memorization examples largely overlap, as in the asymmetric noise in
effect is also observed since DNNs tend to first learn simple Fig. 5(b).
and generalized patterns and then gradually overfit to all noisy 2) Multiround Learning: Without maintaining additional
patterns. As such, favoring small-loss training examples as the DNNs, multiround learning iteratively refines the selected set
clean ones are commonly employed to design robust training of clean examples by repeating the training round. Thus,
methods [112], [131], [134], [135], [142]. the selected set keeps improving as the number of rounds
Learning with sample selection is well motivated and works increases.
well in general, but this approach suffers from accumulated Technical Detail: ITLM [134] iteratively minimizes the
error caused by incorrect selection, especially when there trimmed loss by alternating between selecting true-labeled
are many ambiguous classes in training data. Hence, recent examples at the current moment and retraining the DNN using
approaches often leverage multiple DNNs to cooperate with them. At each training round, only a fraction of small-loss
one another [112] or run multiple training rounds [133]. examples obtained in the current round is used to retrain the
Moreover, to benefit from even false-labeled examples, loss DNN in the next round. INCV [135] randomly divides noisy
correction or semisupervised learning have been recently com- training data and then employs cross validation to classify
bined with the sample selection strategy [19], [142]. true-labeled examples while removing large-loss examples at
1) Multinetwork Learning: Collaborative learning and each training round. Here, co-teaching is adopted to train the
co-training are widely used for multinetwork training. Conse- DNN on the identified examples in the final round of training.
quently, the sample selection process is guided by the mentor Similarly, O2U-Net [144] repeats the whole training process
network in the case of collaborative learning or the peer with the cyclical learning rate until enough loss statistics
network in the case of co-training. of every example are gathered. Next, the DNN is retrained
Technical Detail: Initially, decouple [70] proposes the from scratch only for the clean data where false-labeled
decoupling of when to update from how to update. Hence, examples have been detected and removed based on
two DNNs are maintained simultaneously and updated only statistics.
the examples selected based on a disagreement between the A number of variations have been proposed to achieve
two DNNs. Next, due to the memorization effect of DNNs, high performance using iterative refinement only in a single-
many researchers have adopted another selection criterion, training round. Beyond the small-loss trick, iterative detec-
called a small-loss trick, which treats a certain number of tion [133] detects false-labeled examples by employing the
small-loss training examples as true-labeled examples; many local outlier factor algorithm [145]. With a Siamese network,
true-labeled examples tend to exhibit smaller losses than false- it gradually pulls away false-labeled examples from true-
labeled examples, as shown in Fig. 5(a). In MentorNet [131], labeled samples in the deep feature space. MORPH [137]
a pretrained mentor network guides the training of a student introduces the concept of memorized examples which is used
network in a collaborative learning manner. Based on the to iteratively expand an initial safe set into a maximal safe
small-loss trick, the mentor network provides the student set via self-transitional learning. TopoFilter [146] utilizes the
network with examples whose labels are likely to be correct. spatial topological pattern of learned representations to detect
Co-teaching [112] and Co-teaching+ [132] also maintain two true-labeled examples, not relying on the prediction of the
DNNs, but each DNN selects a certain number of small-loss noisy classifier. NGC [147] iteratively constructs the near-
examples and feeds them to its peer DNN for further training. est neighbor graph using latent representations and performs
Co-teaching+ further employs the disagreement strategy of geometry-based sample selection by aggregating information
decouple compared with co-teaching. In contrast, JoCoR [143] from neighborhoods. Soft pesudolabels are assigned to the
reduces the diversity of two networks via co-regularization, examples not selected.
making predictions of the two networks closer. Remark: The selected clean set keeps expanded and purified
Remark: The co-training methods help reduce the confirma- with iterative refinement, mainly through multiround learning.
tion bias [112], which is a hazard of favoring the examples As a side effect, the computational cost for training increases
selected at the beginning of training, while the increase in the linearly for the number of training rounds.
number of learnable parameters makes their learning pipeline 3) Hybrid Approach: An inherent limitation of sample
inefficient. In addition, the small-loss trick does not work well selection is to discard all the unselected training exam-
when the loss distribution of true-labeled and false-labeled ples, thus resulting in a partial exploration of training data.
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8144 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023
TABLE II
C OMPARISON OF P ROPOSED ROBUST D EEP L EARNING M ETHODS W ITH R ESPECT TO THE F OLLOWING S IX P ROPERTIES : P1—F LEXIBILITY,
P2—N O P RETRAINING , P3—F ULL E XPLORATION , P4—N O S UPERVISION , P5—H EAVY N OISE , AND P6—C OMPLEX N OISE
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8145
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8146 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023
TABLE III
C OMPARISON OF ROBUST D EEP L EARNING C ATEGORIES FOR OVERCOMING N OISY L ABELS
3) (P3) Full Exploration: Excluding unreliable examples The remaining properties (i.e., P2–P4) are only assigned “”
from the update is an effective method for robust deep or “✕.” Regarding the implementation, we assign “N/A” if a
learning; however, it eliminates hard but useful training publicly available source code is not available.
examples as well. “Full exploration” ensures that the No existing method supports all the properties. Each method
proposed methods can use all training examples without achieves noise robustness by supporting a different combina-
severe overfitting to false-labeled examples by adjusting tion of the properties. The supported properties are similar
their training losses or applying semisupervised learning. among the methods of the same (sub)category because those
4) (P4) No Supervision: Learning with supervision, such methods share the same methodological philosophy; however,
as a clean validation set or a known noise rate, is often they differ significantly depending on the (sub)category. There-
impractical because they are difficult to obtain. Hence, fore, we investigate the properties generally supported in each
such supervision had better be avoided to increase practi- (sub)category and summarize them in Table III. Here, the
cality in real-world scenarios. “No supervision” ensures property of a (sub)category is marked as the majority of the
that the proposed methods can be trained without any belonging methods. If no clear trend is observed among those
supervision. methods, then the property is marked “.”
5) (P5) Heavy Noise: In real-world noisy data, the noise
rate can vary from light to heavy. Hence, learning V. N OISE R ATE E STIMATION
methods should achieve consistent noise robustness with
The estimation of a noise rate is an imperative part of
respect to the noise rate. “Heavy noise” ensures that the
utilizing robust methods for better practical use, especially
proposed methods can combat even the heavy noise.
with the approaches belonging to the loss adjustment and
6) (P6) Complex Noise: The type of label noise signif-
sample selection. The estimated noise rate is widely used to
icantly affects the performance of a learning method.
reweight examples for a robust classifier [97], [114], [117] or
To manage real-world noisy data, diverse types of label
to determine how many examples should be selected as clean
noise should be considered when designing a robust
ones [19], [112], [135]. However, detailed analysis has yet to
training method. “Complex noise” ensures that the pro-
be performed properly, though many robust approaches highly
posed method can combat even the complex label noise.
rely on the accuracy of noise rate estimation. The noise rate
Table II shows a comparison of all robust deep learning can be estimated by exploiting the inferred noise transition
methods, which are grouped according to the most appropriate matrix [113], [114], [151], the Gaussian mixture model [110],
category. In the first row, the aforementioned six properties [137], [152], or the cross validation [19], [135].
are labeled as P1–P6, and the availability of open-source
implementation is added in the last column. For each property,
we assign “” if it is completely supported, “✕” if it is not A. Noise Transition Matrix
supported, and “” if it is supported but not completely. More The noise transition matrix has been used to build a statisti-
specifically, “” is assigned to P1 if the method can be flexible cally consistent robust classifier because it represents the class
but requires additional effort, to P5 if the method can combat posterior probabilities for noisy and clean data, as in (3). The
only moderate label noise, and to P6 if the method does not first method to estimate the noise rate is exploiting this noise
make a strict assumption about the noise type but without transition matrix, which can be inferred or trained accurately
explicitly modeling instance-dependent noise. Thus, for P6, by using perfectly clean examples, i.e., anchor points [117],
the method marked with “✕” only deals with the instance- [153]; an example x with its label i is defined as an anchor
independent noise, while the method marked with “” point if p(y = i |x) = 1 and p(y = k|x) = 0 for k = i .
deals with both instance-independent and -dependent noises. Thus, let Ai be the set of anchor points with label i , then the
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8147
ever, Pleiss et al. [152] recently pointed out that the training 54 http://www.image-net.org
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8148 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023
TABLE IV
S UMMARY OF P UBLICLY AVAILABLE D ATASETS U SED FOR S TUDYING L ABEL N OISE
tiny-ImageNet,55 image database organized according to the ŷi be the predicted label of the i th example in T . Subsequently,
WordNet hierarchy and its small subset [1], [158]. Because the test accuracy is formalized by
the labels in these datasets are almost all true-labeled, their |{(x i , yi ) ∈ T : ŷi = yi }|
labels in the training data should be artificially corrupted for Test Accuracy = . (21)
|T |
the evaluation of synthetic noises, namely symmetric noise and
asymmetric noise. If the test data are not available, validation accuracy can
be used by replacing T in (21) with validation data V =
2) Real-World Noisy Datasets: Unlike the clean datasets, |V|
real-world noisy datasets inherently contain many mislabeled {(x i , yi )}i=1 as an alternative
examples annotated by nonexperts. According to the litera- |{(x i , yi ) ∈ V : ŷi = yi }|
Validation Accuracy = . (22)
ture [16]–[19], six real-world noisy datasets are widely used: |V|
ANIMAL-10N,56 real-world noisy data of human-labeled Furthermore, if the specified method belongs to the “sample
online images for 10 confusing animals [19]; CIFAR-10N57 selection” category, label precision and label recall [112],
and CIFAR-100N,57 variations of CIFAR-10 and CIFAR-100 [135] can be used as the metrics
with human-annotated real-world noisy labels collected from
|{(x i , ỹi ) ∈ St : ỹi = yi }|
Amazon’s Mechanical Turk [159]. They provide human labels Label Precision =
with different noise rates, as shown in Table IV; Food-101N,58 |St |
real-world noisy data of crawled food images annotated by |{(x i , ỹi ) ∈ St : ỹi = yi }|
Label Recall = (23)
their search keywords in the Food-101 taxonomy [18], [160]; |{(x i , ỹi ) ∈ Bt : ỹi = yi }|
Clothing1M,59 real-world noisy data of large-scale crawled where St is the set of selected clean examples in a mini-
clothing images from several online shopping websites [16]; batch Bt . The two metrics are performance indicators for
WebVision,60 real-world noisy data of large-scale web images the examples selected from the mini-batch as true-labeled
crawled from Flickr and Google Images search [17]. To sup- ones [112].
port sophisticated evaluation, most real-world noisy datasets Meanwhile, if the specified method belongs to the “label
contain their own clean validation set and provide the esti- refurbishment” category, correction error [19] can be used as
mated noise rate of their training set. an indicator of how many examples are incorrectly refurbished
x i ∈ R : argmax y refurb = yi
i
Correction Error =
B. Evaluation Metrics |R|
A typical metric to assess the robustness of a particular (24)
method is the prediction accuracy for unbiased and clean where R is the set of examples whose labels are refurbished
examples that are not used in training. The prediction accuracy by (12) and yirefurb is the refurbished label of the i th examples
degrades significantly if the DNN overfits to false-labeled in R.
examples [22]. Hence, test accuracy has generally been
|T |
adopted for evaluation [13]. For a test set T = {(x i , yi )}i=1 , let VII. F UTURE R ESEARCH D IRECTIONS
With recent efforts in the machine learning community, the
55 https://www.kaggle.com/c/tiny-imagenet
56 https://dm.kaist.ac.kr/datasets/animal-10n
robustness of DNNs becomes evolved in several directions.
57 http://noisylabels.com/ Thus, the existing approaches covered in our survey face a
58
https://kuanghuei.github.io/Food-101N variety of future challenges. This section provides a discus-
59 https://www.floydhub.com/lukasmyth/datasets/clothing1m sion for future research that can facilitate and envision the
60 https://data.vision.ee.ethz.ch/cvl/webvision/download.html
development of deep learning in the label noise area.
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8149
A. Instance-Dependent Label Noise recent study [165] that discusses the evaluation of multilabel
Existing theoretical and empirical studies for robust loss classifiers trained with noisy labels.
function and loss correction are largely built upon the C. Class Imbalance Data With Label Noise
instance-independent noise assumption that the label noise
The class imbalance in training data is commonly observed,
is independent of input features [76], [77], [113], [114].
where a few classes account for most of the data. Especially
However, this assumption may not be a good approximation
when working with large data in many real-world applications,
of the real-world label noise. In particular, Chen et al. [122]
this problem becomes more severe and is often associated with
conducted a theoretical hypothesis testing61 using a popular
the problem of noisy labels [166]. Nevertheless, to ease the
real-world dataset, Clothing1M, and proved that its label noise
label noise problem, it is commonly assumed that training
is statistically different from the instance-independent noise.
examples are equally distributed over all class labels in the
This testing confirms that the label noise should depend on
training data. This assumption is quite strong when collecting
the instance.
large-scale data, and thus, we need to consider a more realistic
Conversely, most methods for the other direction (especially,
scenario in which the two problems coexist.
sample selection) work well even under the instance-dependent
Most of the existing robust methods may not work well with
label noise in general since they do not rely on the assumption.
the class imbalance, especially when they rely on the learning
Nevertheless, Song et al. [141] pointed out that their perfor-
dynamics of DNNs, e.g., the small-loss trick or memorization
mance could considerably worsen in the instance-dependent
effect. Under the existence of the class imbalance, the training
(or real-world) noise compared with symmetric noise due to
model converges to major classes faster than minor classes
the confusion between true-labeled and false-labeled exam-
such that most examples in the major class exhibit small
ples. The loss distribution of true-labeled examples heavily
losses (i.e., early memorization). That is, there is a risk of
overlaps that of false-labeled samples in the asymmetric noise,
discarding most examples in the minor class. Furthermore,
which is similar to the real-world noise, in Fig. 5(b). Thus,
in terms of example importance, high-loss examples are com-
identifying clean examples becomes more challenging when
monly favored for the class imbalance problem [124], while
dealing with the instance-dependent label noise.
small-loss examples are favored for the label noise problem.
Beyond the instance-independent label noise, there have
This conceptual contradiction hinders the applicability of the
been a few recent studies for the instance-dependent label
existing methods that neglect the class imbalance. Therefore,
noise. Mostly, they only focus on a binary classification
these two problems should be considered simultaneously to
task [66], [161] or a restricted small-scale machine learning
deal with more general situations.
model, such as logistic regression [63]. Therefore, learning
with the instance-dependent label noise is an important topic D. Robust and Fair Training
that deserves more research attention. Machine learning classifiers can perpetuate and amplify the
existing systemic injustices in society [167]. Hence, fairness is
B. Multilabel Data With Label Noise becoming another important topic. Traditionally, robust train-
ing and fair training have been studied by separate communi-
Most of the existing methods are applicable only for a ties; robust training with noisy labels has mostly focused on
single-label multiclass classification problem, where each data combating label noise without regarding data bias [13], [30],
example is assumed to have only one true label. How- whereas fair training has focused on dealing with data bias,
ever, in the case of multilabel learning, each data example not necessarily noise [167], [168]. However, noisy labels
can be associated with a set of multiple true class labels. and data bias, in fact, coexist in real-world data. Satisfying
In music categorization, each music can belong to multiple both robustness and fairness is more realistic but challenging
categories [162]. In semantic scene classification, each scene because the bias in data is pertinent to label noise.
may belong to multiple scene classes [163]. Thus, contrary to In general, many fairness criteria are group-based, where
the single-label setup, the multilabel classifier aims to predict a a target metric is equalized or enforced over subpopulations
set of target objects simultaneously. In this setup, a multilabel in the data, also known as protected groups, such as race
dataset of millions of examples is reported to contain over or gender [167]. Accordingly, the goal of fair training is
26.6% false-positive labels and a significant number of omitted building a model that satisfies such fairness criteria for the
labels [164]. true protected groups. However, if the noisy protection group
Even worse, the difference in occurrence between classes is involved, such fairness criteria cannot be directly applied.
makes this problem more challenging; some minor class labels Recently, mostly after 2020, a few pioneering studies have
occur less in training data than other major class labels. emerged to consider both robustness and fairness objectives
Considering such aspects that can arise in multilabel classi- at the same time under the binary classification setting [169],
fication, the simple extension of existing methods may not [170]. Therefore, more research attention is needed for the
learn the proper correlations among multiple labels. Therefore, convergence of robust training and fair training.
learning from noisy labels with multilabel data is another
important topic for future research. We refer the readers to a E. Connection With Input Perturbation
61 InClothing1M, the result showed that the instance-independent noise hap- There has been a lot of research on the robustness of
pens with probability lower than 10−21250 , which is statistically impossible. deep learning under input perturbation, mainly in the field
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8150 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023
of adversarial training where the input feature is maliciously supported properties varied depending on the category to
perturbed to distort the output of the DNN [34], [36]. Although which each method belonged. Several experimental guidelines
learning with noisy labels and learning with noisy inputs were also discussed, including noise rate estimation, publicly
have been regarded as separate research fields, their goals available datasets, and evaluation metrics. Finally, we provided
are similar in that they learn noise-robust representations from insights and directions for future research in this domain.
noisy data. Based on this common point of view, a few recent
studies have investigated the interaction of adversarial training
R EFERENCES
with noisy labels [171]–[173].
Interestingly, it was turned out that adversarial training [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
makes DNNs robust to label noise [171]. Based on this finding, with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Process. Syst. (NIPS), 2012, pp. 1097–1105.
Damodaran et al. [172] proposed a new regularization term, [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
called Wasserstein adversarial regularization, to address the look once: Unified, real-time object detection,” in Proc. CVPR, 2016,
problem of learning with noisy labels. Zhu et al. [173] pro- pp. 779–788.
[3] W. Zhang, T. Du, and J. Wang, “Deep learning over multi-field
posed to use the number of projected gradient descent steps as categorical data,” in Proc. ECIR, 2016, pp. 45–57.
a new criterion for sample selection such that clean examples [4] L. Pang, Y. Lan, J. Guo, J. Xu, J. Xu, and X. Cheng, “DeepRank:
are filtered out from noisy data. These approaches are regarded A new deep architecture for relevance ranking in information retrieval,”
in Proc. CIKM, 2017, pp. 257–266.
as a new perspective on label noise compared with traditional [5] K. D. Onal et al., “Neural information retrieval: At the end of the early
work. Therefore, understanding the connection between input years,” Inf. Retr. J., vol. 21, nos. 2–3, pp. 111–182, 2018.
perturbation and label noise could be another future topic for [6] J. Howard and S. Ruder, “Universal language model fine-tuning for
text classification,” in Proc. ACL, 2018, pp. 328–339.
better representation learning toward robustness. [7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of deep bidirectional transformers for language understanding,”
F. Efficient Learning Pipeline in Proc. ACL, 2019, pp. 4171–4186.
[8] A. Severyn and A. Moschitti, “Twitter sentiment analysis with deep
The efficiency of the learning pipeline is another important convolutional neural networks,” in Proc. ACL, 2015, pp. 959–962.
aspect to design deep learning approaches. However, for robust [9] G. Paolacci, J. Chandler, and P. G. Ipeirotis, “Running experiments
on Amazon mechanical Turk,” Judgment Decis. Making, vol. 5, no. 5,
deep learning, most studies have neglected the efficiency of pp. 411–419, 2010.
the algorithm because their main goal is to improve the [10] V. Cothey, “Web-crawling reliability,” J. Amer. Soc. Inf. Sci. Technol.,
robustness to label noise. For example, maintaining multiple vol. 55, no. 14, pp. 1228–1238, 2004.
[11] W. Mason and S. Suri, “Conducting behavioral research on Amazon’s
DNNs or training a DNN in multiple rounds is frequently mechanical Turk,” Behav. Res. Methods, vol. 44, no. 1, pp. 1–23, 2012.
used, but these approaches significantly degrade the efficiency [12] C. Scott, G. Blanchard, and G. Handy, “Classification with asymmetric
of the learning pipeline. On the other hand, the need for more label noise: Consistency and maximal denoising,” in Proc. COLT, 2013,
pp. 489–511.
efficient algorithms is increasing owing to the rapid increase [13] B. Frenay and M. Verleysen, “Classification in the presence of label
in the amount of available data [174]. noise: A survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5,
According to our literature survey, most work did not even pp. 845–869, May 2014.
[14] R. V. Lloyd et al., “Observer variation in the diagnosis of follicular
report the efficiency (or time complexity) of their approaches. variant of papillary thyroid carcinoma,” The Amer. J. Surgical Pathol.,
However, it is evident that saving the training time is help- vol. 28, no. 10, pp. 1336–1340, 2004.
ful under the restricted budget for computation. Therefore, [15] H. Xiao, H. Xiao, and C. Eckert, “Adversarial label flips attack on
support vector machines,” in Proc. ECAI, 2012, pp. 870–875.
enhancing the efficiency will significantly increase the usabil- [16] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from
ity of robust deep learning in the big data era. massive noisy labeled data for image classification,” in Proc. CVPR,
2015, pp. 2691–2699.
VIII. C ONCLUSION [17] W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool, “Web vision
database: Visual learning and understanding from web data,” 2017,
DNNs easily overfit false labels owing to their high capacity arXiv:1708.02862.
in totally memorizing all noisy training samples. This overfit- [18] K.-H. Lee, X. He, L. Zhang, and L. Yang, “CleanNet: Transfer learning
for scalable image classifier training with label noise,” in Proc. CVPR,
ting issue still remains even with various conventional regular- 2018, pp. 5447–5456.
ization techniques, such as dropout and batch normalization, [19] H. Song, M. Kim, and J.-G. Lee, “SELFIE: Refurbishing unclean sam-
thereby significantly decreasing their generalization perfor- ples for robust deep learning,” in Proc. ICML, 2019, pp. 5907–5915.
[20] J. Krause et al., “The unreasonable effectiveness of noisy data for fine-
mance. Even worse, in real-world applications, the difficulty in grained recognition,” in Proc. ECCV, 2016, pp. 301–320.
labeling renders the overfitting issue more severe. Therefore, [21] D. Arpit et al., “A closer look at memorization in deep networks,” in
Proc. ICML, 2017, pp. 233–242.
learning from noisy labels has recently become one of the [22] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand-
most active research topics. ing deep learning requires rethinking generalization,” in Proc. ICLR,
In this survey, we presented a comprehensive understanding 2017.
[23] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmen-
of modern deep learning methods to address the negative tation for deep learning,” J. Big Data, vol. 6, p. 60, Dec. 2019.
consequences of learning from noisy labels. All the methods [24] A. Krogh and J. A. Hertz, “A simple weight decay can improve
were grouped into five categories according to their underlying generalization,” in Proc. NeurIPS, 1992, pp. 950–957.
[25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
strategies and described along with their methodological weak- R. Salakhutdinov, “DropOut: A simple way to prevent neural networks
nesses. Furthermore, a systematic comparison was conducted from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
using six popular properties used for evaluation in the recent 2014.
[26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
literature. According to the comparison results, there is no network training by reducing internal covariate shift,” in Proc. 32nd
ideal method that supports all the required properties; the Int. Conf. Mach. Learn., 2015, pp. 448–456.
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8151
[27] X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative [57] B. Biggio, B. Nelson, and P. Laskov, “Support vector machines under
study,” Artif. Intell. Rev., vol. 22, no. 3, pp. 177–210, Nov. 2004. adversarial label noise,” in Proc. ACML, 2011, pp. 97–112.
[28] J. Zhang, X. Wu, and V. S. Sheng, “Learning from crowdsourced [58] C. J. Mantas and J. Abellán, “Credal-C4.5: Decision tree based on
labeled data: A survey,” Artif. Intell. Rev., vol. 46, no. 4, pp. 543–576, imprecise probabilities to classify noisy data,” Expert Syst. Appl.,
Dec. 2016. vol. 41, no. 10, pp. 4625–4637, Aug. 2014.
[29] N. Nigam, T. Dutta, and H. P. Gupta, “Impact of noisy labels in learning [59] A. Ghosh, N. Manwani, and P. Sastry, “On the robustness of decision
techniques: A survey,” in Proc. ICDIS, 2020, pp. 403–411. tree learning under label noise,” in Proc. PAKDD, 2017, pp. 685–697.
[30] B. Han et al., “A survey of label-noise representation learning: Past, [60] S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez-Granda, “Early-
present and future,” 2020, arXiv:2011.04406. learning regularization prevents memorization of noisy labels,” in Proc.
[31] N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning NeurIPS, vol. 33, 2020, pp. 20331–20342.
in computer vision: A survey,” IEEE Access, vol. 6, pp. 14410–14430, [61] M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent with early
2018. stopping is provably robust to label noise for overparameterized neural
[32] J. Yoon, J. Jordon, and M. Schaar, “GAIN: Missing data impu- networks,” in Proc. AISTATS, 2020, pp. 4313–4324.
tation using generative adversarial nets,” in Proc. ICML, 2018, [62] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making
pp. 5689–5698. deep neural networks robust to label noise: A loss correction approach,”
[33] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard, “Robustness of in Proc. CVPR, 2017, pp. 1944–1952.
classifiers: From adversarial to random noise,” in Proc. NeurIPS, 2016, [63] J. Cheng, T. Liu, K. Ramamohanarao, and D. Tao, “Learning with
pp. 1632–1640. bounded instance and label-dependent label noise,” in Proc. ICML,
[34] E. Dohmatob, “Generalized no free lunch theorem for adversarial 2020, pp. 1789–1799.
robustness,” in Proc. ICML, 2019, pp. 1646–1654.
[64] B. Garg and N. Manwani, “Robust deep ordinal regression under label
[35] J. Gilmer, N. Ford, N. Carlini, and E. Cubuk, “Adversarial examples noise,” in Proc. ACML, 2020, pp. 782–796.
are a natural consequence of test error in noise,” in Proc. ICML, 2019,
[65] W. Hu, Z. Li, and D. Yu, “Simple and effective regularization methods
pp. 2280–2289.
for training on noisily labeled data with generalization guarantee,” in
[36] S. Mahloujifar, D. I. Diochnos, and M. Mahmoody, “The curse of
Proc. ICLR, 2020, pp. 1–18.
concentration in robust learning: Evasion and poisoning attacks from
concentration of measure,” in Proc. AAAI, 2019, vol. 33, no. 1, [66] A. K. Menon, B. Van Rooyen, and N. Natarajan, “Learning from
pp. 4536–4543. binary labels with instance-dependent noise,” Mach. Learn., vol. 107,
[37] D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, nos. 8–10, pp. 1561–1595, 2018.
pp. 581–592, 1976. [67] L. Torgo and J. Gama, “Regression using classification algorithms,”
[38] C. M. Bishop, Pattern Recognition Machine Learning. Springer, 2006. Intell. Data Anal., vol. 1, no. 4, pp. 275–292, Oct. 1997.
[39] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning [68] A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label
with noisy labels,” in Proc. NeurIPS, 2013, pp. 1196–1204. noise for deep neural networks,” in Proc. AAAI, 2017, pp. 1919–1925.
[40] J. Goldberger and E. Ben-Reuven, “Training deep neural-networks [69] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and
using a noise adaptation layer,” in Proc. ICLR, 2017, pp. 1–9. A. Rabinovich, “Training deep neural networks on noisy labels with
[41] P. Sastry and N. Manwani, “Robust learning of classifiers in the bootstrapping,” in Proc. ICLR, 2015, pp. 1–11.
presence of label noise,” in Proc. Pattern Recognit. Big Data, 2017, [70] E. Malach and S. Shalev-Shwartz, “Decoupling ‘when to update’ from
pp. 167–197. ‘how to update,”’ in Proc. NeurIPS, 2017, pp. 960–970.
[42] V. Wheway, “Using boosting to detect noisy data,” in Proc. PRICAI, [71] L. P. Garcia, A. C. de Carvalho, and A. C. Lorena, “Noise detection
2000, pp. 123–130. in the meta-learning level,” Neurocomputing, vol. 176, pp. 14–25,
[43] B. Sluban, D. Gamberger, and N. Lavrač, “Ensemble-based noise Dec. 2016.
detection: Noise ranking and visual performance evaluation,” Data [72] Y. Yan, Z. Xu, I. W. Tsang, G. Long, and Y. Yang, “Robust semi-
Mining Knowl. Discovery, vol. 28, no. 2, pp. 265–303, 2012. supervised learning through label aggregation,” in Proc. AAAI, 2016,
[44] S. J. Delany, N. Segata, and B. Mac Namee, “Profiling instances in pp. 2244–2250.
noise reduction,” Knowl.-Based Syst., vol. 31, pp. 28–40, Jul. 2012. [73] H. Harutyunyan, K. Reing, G. Ver Steeg, and A. Galstyan, “Improving
[45] D. Gamberger, N. Lavrac, and S. Dzeroski, “Noise detection and generalization by controlling label-noise information in neural network
elimination in data preprocessing: Experiments in medical domains,” weights,” in Proc. ICML, 2020, pp. 4071–4081.
Appl. Artif. Intell., vol. 14, no. 2, pp. 205–223, Nov. 2000. [74] P. Chen, G. Chen, J. Ye, J. Zhao, and P.-A. Heng, “Noise against
[46] J. Thongkam, G. Xu, Y. Zhang, and F. Huang, “Support vector machine noise: Stochastic label noise helps combat inherent label noise,” in
for outlier detection in breast cancer survivability prediction,” in Proc. Proc. ICLR, 2021, pp. 1–20.
APWeb, 2008, pp. 99–109. [75] X. Chen and A. Gupta, “Webly supervised learning of convolutional
[47] V. Mnih and G. E. Hinton, “Learning to label aerial images from noisy networks,” in Proc. ICCV, 2015, pp. 1431–1439.
data,” in Proc. ICML, 2012, pp. 567–574. [76] A. J. Bekker and J. Goldberger, “Training deep neural-networks based
[48] N. Manwani and P. Sastry, “Noise tolerance under risk minimization,” on unreliable labels,” in Proc. ICASSP, 2016, pp. 2682–2686.
IEEE Trans. Cybern., vol. 43, no. 3, pp. 1146–1151, Jun. 2013. [77] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus,
[49] A. Ghosh, N. Manwani, and P. S. Sastry, “Making risk minimiza- “Training convolutional networks with noisy labels,” in Proc. ICLRW,
tion tolerant to label noise,” Neurocomputing, vol. 160, pp. 93–107, 2015, pp. 1–11.
Jul. 2015. [78] I. Jindal, M. Nokleby, and X. Chen, “Learning deep networks from
[50] B. Van Rooyen, A. Menon, and R. C. Williamson, “Learning with noisy labels with dropout regularization,” in Proc. ICDM, 2016,
symmetric label noise: The importance of being unhinged,” in Proc. pp. 967–972.
NeurIPS, 2015, pp. 10–18. [79] J. Goldberger and E. Ben-Reuven, “Training deep neural-networks
[51] G. Patrini, F. Nielsen, R. Nock, and M. Carioni, “Loss factorization, using a noise adaptation layer,” in Proc. ICLR, 2017, pp. 1–9.
weakly supervised learning and label noise robustness,” in Proc. ICML,
[80] B. Han et al., “Masking: A new perspective of noisy supervision,” in
2016, pp. 708–717.
Proc. NeurIPS, 2018, pp. 5836–5846.
[52] R. Xu and D. C. Wunsch, “Survey of clustering algorithms,” IEEE
Trans. Neural Netw., vol. 16, no. 3, pp. 645–678, Jun. 2005. [81] J. Yao et al., “Deep learning from noisy image labels with qual-
ity embedding,” IEEE Trans. Image Process., vol. 28, no. 4,
[53] U. Rebbapragada and C. E. Brodley, “Class noise mitigation through
pp. 1909–1922, Dec. 2018.
instance weighting,” in Proc. ECML, 2007, pp. 708–715.
[54] T. Liu, K. Wang, B. Chang, and Z. Sui, “A soft-label method for noise- [82] L. Cheng et al., “Weakly supervised learning with side information for
tolerant distantly supervised relation extraction,” in Proc. EMNLP, noisy labeled images,” in Proc. ECCV, 2020, pp. 306–321.
2017, pp. 1790–1795. [83] X. Xia et al., “Extended T: Learning with mixed closed-set and open-
[55] F. O. Kaster, B. H. Menze, M.-A. Weber, and F. A. Hamprecht, set noisy labels,” 2020, arXiv:2012.00932.
“Comparative validation of graphical models for learning tumor seg- [84] I. Goodfellow et al., “Generative adversarial nets,” in Proc. NeurIPS,
mentations from noisy manual annotations,” in Proc. MICCAI, 2010, 2014, pp. 2672–2680.
pp. 74–85. [85] K. Lee, S. Yun, K. Lee, H. Lee, B. Li, and J. Shin, “Robust inference
[56] A. Ganapathiraju and J. Picone, “Support vector machines for auto- via generative classifiers for handling noisy labels,” in Proc. ICML,
matic data cleanup,” in Proc. ICSLP, 2000, pp. 210–213. 2019, pp. 3763–3772.
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8152 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023
[86] R. Tanno, A. Saeedi, S. Sankaranarayanan, D. C. Alexander, and [116] S. Yang et al., “Estimating instance-dependent label-noise transition
N. Silberman, “Learning from noisy labels by regularized estimation matrix using DNNs,” 2021, arXiv:2105.13001.
of annotator confusion,” in Proc. CVPR, 2019, pp. 11244–11253. [117] T. Liu and D. Tao, “Classification with noisy labels by importance
[87] S. Jenni and P. Favaro, “Deep bilevel learning,” in Proc. ECCV, 2018, reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 3,
pp. 618–633. pp. 447–461, Mar. 2016.
[88] D. Hendrycks, K. Lee, and M. Mazeika, “Using pre-training can [118] H. Zhang, X. Xing, and L. Liu, “DualGraph: A graph-based method
improve model robustness and uncertainty,” in Proc. ICML, 2019, for reasoning about label noise,” in Proc. CVPR, 2021, pp. 9654–9663.
pp. 2712–2721. [119] L. Huang, C. Zhang, and H. Zhang, “Self-adaptive training:
[89] A. K. Menon, A. S. Rawat, S. J. Reddi, and S. Kumar, “Can gradient Beyond empirical risk minimization,” in Proc. NeurIPS, 2020,
clipping mitigate label noise?” in Proc. ICLR, 2020, pp. 1–26. pp. 19365–19376.
[90] X. Xia et al., “Robust early-learning: Hindering the memorization of [120] M. E. Houle, “Local intrinsic dimensionality I: An extreme-value-
noisy labels,” in Proc. ICLR, 2021, pp. 1–15. theoretic foundation for similarity applications,” in Proc. SISAP, 2017,
[91] H. Wei, L. Tao, R. Xie, and B. An, “Open-set label noise can improve pp. 64–79.
robustness against inherent label noise,” in Proc. NeurIPS, 2021, [121] S. Zheng, P. Wu, A. Goswami, M. Goswami, D. Metaxas, and C. Chen,
pp. 1–15. “Error-bounded correction of noisy labels,” in Proc. ICML, 2020,
[92] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing pp. 11447–11457.
adversarial examples,” in Proc. ICLR, 2014, pp. 1–11. [122] P. Chen, J. Ye, G. Chen, J. Zhao, and P.-A. Heng, “Beyond
[93] S. Zhang, Y. Hou, B. Wang, and D. Song, “Regularizing neural class-conditional assumption: A primary attempt to combat instance-
networks via retaining confident connections,” Entropy, vol. 19, no. 7, dependent label noise,” in Proc. AAAI, 2021, pp. 1–10.
p. 313, Jun. 2017. [123] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
[94] M. Lukasik, S. Bhojanapalli, A. Menon, and S. Kumar, “Does label for fast adaptation of deep networks,” in Proc. ICML, 2017,
smoothing mitigate label noise?” in Proc. ICLR, 2020, pp. 6448–6458. pp. 1126–1135.
[95] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: [124] J. Shu et al., “Meta-weight-net: Learning an explicit mapping for
Beyond empirical risk minimization,” in Proc. ICLR, 2018, pp. 1–13. sample weighting,” in Proc. NeurIPS, 2019, pp. 1917–1928.
[96] B. C. Csáji et al., “Approximation with artificial neural networks,” Fac. [125] Z. Wang, G. Hu, and Q. Hu, “Training noise-robust deep neural
Sci., Etvs Lornd Univ., Hung., vol. 24, no. 48, p. 7, 2001. networks via meta-learning,” in Proc. CVPR, 2020, pp. 4524–4533.
[97] Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training [126] M. Dehghani, A. Severyn, S. Rothe, and J. Kamps, “Learning to learn
deep neural networks with noisy labels,” in Proc. Adv. Neural Inf. from weak supervision by full supervision,” in Proc. NeurIPSW, 2017,
Process. Syst., 2018, pp. 8778–8788. pp. 1–8.
[98] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey, “Symmetric [127] M. Dehghani, A. Severyn, S. Rothe, and J. Kamps, “Avoiding your
cross entropy for robust learning with noisy labels,” in Proc. ICCV, teacher’s mistakes: Training neural networks with controlled weak
2019, pp. 322–330. supervision,” 2017, arXiv:1711.00313.
[99] Y. Lyu and I. W. Tsang, “Curriculum loss: Robust learning and [128] Z. Zhang, H. Zhang, S. O. Arik, H. Lee, and T. Pfister, “Distilling
generalization against label corruption,” in Proc. ICLR, 2020, pp. 1–22. effective supervision from severe label noise,” in Proc. CVPR, 2020,
[100] L. Feng, S. Shu, Z. Lin, F. Lv, L. Li, and B. An, “Can cross entropy pp. 9294–9303.
loss be robust to label noise,” in Proc. IJCAI, 2020, pp. 2206–2212. [129] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li, “Learning from
[101] Y. Liu and H. Guo, “Peer loss functions: Learning from noisy labels noisy labels with distillation,” in Proc. ICCV, 2017, pp. 1910–1918.
without knowing noise rates,” in Proc. ICML, 2020, pp. 6226–6236. [130] G. Zheng, A. H. Awadallah, and S. Dumais, “Meta label correction for
[102] H. Kumar, N. Manwani, and P. Sastry, “Robust learning of multi- noisy label learning,” in Proc. AAAI, 2021, pp. 1–9.
label classifiers under label noise,” in Proc. CODS-COMAD, 2020, [131] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “MentorNet:
pp. 90–97. Learning data-driven curriculum for very deep neural networks on
[103] E. Amid, M. K. Warmuth, and S. Srinivasan, “Two-temperature logistic corrupted labels,” in Proc. ICML, 2018, pp. 2304–2313.
regression based on the tsallis divergence,” in Proc. AISTATS, 2019, [132] X. Yu, B. Han, J. Yao, G. Niu, I. W. Tsang, and M. Sugiyama, “How
pp. 2388–2396. does disagreement help generalization against label corruption?” in
[104] E. Amid, M. K. Warmuth, R. Anil, and T. Koren, “Robust bi-tempered Proc. ICML, 2019, pp. 7164–7173.
logistic loss based on Bregman divergences,” in Proc. NeurIPS, 2019, [133] Y. Wang et al., “Iterative learning with open-set noisy labels,” in Proc.
pp. 14987–14996. CVPR, 2018, pp. 8688–8696.
[105] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, and J. Bailey, [134] Y. Shen and S. Sanghavi, “Learning with bad training data via iterative
“Normalized loss functions for deep learning with noisy labels,” in trimmed loss minimization,” in Proc. ICML, 2019, pp. 5739–5748.
Proc. ICML, 2020, pp. 6543–6553. [135] P. Chen, B. Liao, G. Chen, and S. Zhang, “Understanding and utilizing
[106] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to deep neural networks trained with noisy labels,” in Proc. ICML, 2019,
reweight examples for robust deep learning,” in Proc. ICML, 2018, pp. 1062–1070.
pp. 4334–4343. [136] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen,
[107] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel, “Using trusted L. Beggel, and T. Brox, “SELF: Learning to filter noisy labels with
data to train deep networks on labels corrupted by severe noise,” in self-ensembling,” in Proc. ICLR, 2020, pp. 1–15.
Proc. NeurIPS, 2018, pp. 10456–10465. [137] H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, “Robust learning
[108] R. Wang, T. Liu, and D. Tao, “Multiclass learning with partially by self-transition for handling noisy labels,” in Proc. KDD, 2021,
corrupted labels,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 1490–1500.
pp. 2568–2580, Jun. 2018. [138] D. Krueger et al., “Deep nets don’t learn via memorization,” in Proc.
[109] H.-S. Chang, E. Learned-Miller, and A. McCallum, “Active bias: ICLRW, 2017, pp. 1–4.
Training more accurate neural networks by emphasizing high variance [139] C. Zhang, S. Bengio, M. Hardt, M. C. Mozer, and Y. Singer, “Identity
samples,” in Proc. NeurIPS, 2017, pp. 1002–1012. Crisis: Memorization and generalization under extreme overparameter-
[110] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, ization,” in Proc. ICLR, 2020, pp. 1–39.
“Unsupervised label noise modeling and loss correction,” in Proc. [140] Q. Yao, H. Yang, B. Han, G. Niu, and J. T.-Y. Kwok, “Searching to
ICML, 2019, pp. 312–321. exploit memorization effect in learning with noisy labels,” in Proc.
[111] X. Ma et al., “Dimensionality-driven learning with noisy labels,” in ICML, 2020, pp. 10789–10798.
Proc. ICML, 2018, pp. 3355–3364. [141] H. Song, M. Kim, D. Park, and J.-G. Lee, “How does early stopping
[112] B. Han et al., “Co-teaching: Robust training of deep neural networks help generalization against label noise?” 2019, arXiv:1911.08059.
with extremely noisy labels,” in Proc. Adv. Neural Inf. Process. Syst., [142] J. Li, R. Socher, and S. C. Hoi, “DivideMix: Learning with noisy labels
2018, pp. 8527–8537. as semi-supervised learning,” in Proc. ICLR, 2020, pp. 1–14.
[113] X. Xia et al., “Are anchor points really indispensable in label-noise [143] H. Wei, L. Feng, X. Chen, and B. An, “Combating noisy labels by
learning?” in Proc. NeurIPS, 2019, pp. 1–12. agreement: A joint training method with co-regularization,” in Proc.
[114] Y. Yao et al., “Dual T: Reducing estimation error for transition matrix CVPR, 2020, pp. 13726–13735.
in label-noise learning,” in Proc. NeurIPS, 2020, pp. 7260–7271. [144] J. Huang, L. Qu, R. Jia, and B. Zhao, “O2U-Net: A simple noisy
[115] Y. Zhang and M. Sugiyama, “Approximating instance-dependent noise label detection approach for deep neural networks,” in Proc. ICCV,
via instance-confidence embedding,” 2021, arXiv:2103.13569. Oct. 2019, pp. 3326–3334.
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8153
[145] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: [172] B. B. Damodaran, K. Fatras, S. Lobry, R. Flamary, D. Tuia, and
Identifying density-based local outliers,” in Proc. Int. Conf. Manage. N. Courty, “Wasserstein adversarial regularization (WAR) on label
Data (SIGMOD), vol. 29, 2000, pp. 93–104. noise,” in Proc. ICLR, 2020, pp. 1–15.
[146] P. Wu, S. Zheng, M. Goswami, D. Metaxas, and C. Chen, “A topo- [173] J. Zhu et al., “Understanding the interaction of adversarial training with
logical filter for learning with label noise,” in Proc. NeurIPS, 2020, noisy labels,” 2021, arXiv:2102.03482.
pp. 21382–21393. [174] G. Nguyen et al., “Machine learning and deep learning frame-
[147] Z.-F. Wu, T. Wei, J. Jiang, C. Mao, M. Tang, and Y.-F. Li, “NGC: works and libraries for large-scale data mining: A survey,”
A unified framework for learning with open-world noisy data,” in Proc. Artif. Intell. Rev., vol. 52, no. 1, pp. 77–124, 2019.
ICCV, 2021, pp. 62–71.
[148] A. Tarvainen and H. Valpola, “Mean teachers are better role models:
Weight-averaged consistency targets improve semi-supervised deep
learning results,” in Proc. Adv. Neural Inf. Process. Syst., 2017, Hwanjun Song received the Ph.D. degree from the
pp. 1195–1204. Graduate School of Knowledge Service Engineering,
[149] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and Korea Advanced Institute of Science and Technol-
C. A. Raffel, “MixMatch: A holistic approach to semi-supervised ogy, Daejeon, South Korea, in 2021.
learning,” in Proc. NeurIPS, 2019, pp. 5050–5060. He was a Research Intern with Google Research,
[150] T. Zhou, S. Wang, and J. Bilmes, “Robust curriculum learning: From Seoul, South Korea, in 2020. He is currently a
clean label detection to noisy label self-correction,” in Proc. ICLR, Research Scientist with the NAVER AI Labora-
2021, pp. 1–18. tory, Seongnam, South Korea. His research inter-
[151] X. Li, T. Liu, B. Han, G. Niu, and M. Sugiyama, “Provably end-to- est includes designing advanced approaches to
end label-noise learning without anchor points,” in Proc. ICML, 2021, handle large-scale and noisy data, including two
pp. 6403–6413. main real-world challenges for the practical use of
[152] G. Pleiss, T. Zhang, E. R. Elenberg, and K. Q. Weinberger. (2020). AI approaches.
Detecting Noisy Training Data With Loss Curves. [Online]. Available:
https://openreview.net/forum?id=HyenUkrtDB
[153] C. Scott, “A rate of convergence for mixture proportion estimation, with Minseok Kim (Student Member, IEEE) received the
application to learning from noisy labels,” in Proc. AISTATS, 2015, master’s degree from the Korea Advanced Institute
pp. 838–846. of Science and Technology, Daejeon, South Korea,
[154] Y. LeCun, C. Cortes, and C. J. Burges. (1998). The MNIST Database in 2018, where he is currently pursuing the Ph.D.
of Handwritten Digits, 1998. [Online]. Available: http://yann.lecun. degree under the supervision of Prof. Jae-Gil Lee
com/exdb/mnist with the Graduate School of Knowledge Service
[155] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel Engineering.
image dataset for benchmarking machine learning algorithms,” 2017, His current research interests include robustness
arXiv:1708.07747. and uncertainty in machine learning, data augmen-
[156] A. Krizhevsky, V. Nair, and G. Hinton. (2014). CIFAR-10 and tation, and personalized recommendation systems.
CIFAR-100 Datasets. [Online]. Available: https://www.cs.toronto.
edu/~kriz/cifar.html
[157] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
“Reading digits in natural images with unsupervised feature learning,” Dongmin Park received the master’s degree from
in Proc. NeurIPSW, 2011, pp. 1–9. the Korea Advanced Institute of Science and Tech-
[158] A. Karpathy et al., “Cs231n convolutional neural networks for visual nology, Daejeon, South Korea, in 2020, where he
recognition,” Neural Netw., vol. 1, p. 1, Dec. 2016. is currently pursuing the Ph.D. degree under the
[159] J. Wei, Z. Zhu, H. Cheng, T. Liu, G. Niu, and Y. Liu, “Learning with supervision of Prof. Jae-Gil Lee with the Graduate
noisy labels revisited: A study using real-world human annotations,” School of Knowledge Service Engineering.
2011, arXiv:2110.12088. His current research interests include robust deep
[160] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining learning and representation learning in graph and
discriminative components with random forests,” in Proc. ECCV, 2014, time–series data.
pp. 446–461.
[161] J. Bootkrajang and J. Chaijaruwanich, “Towards instance-dependent
label noise-tolerant classification: A probabilistic approach,” Pattern
Anal. Appl., vol. 23, no. 1, pp. 95–111, 2020. Yooju Shin received the master’s degree from the
[162] G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” Korea Advanced Institute of Science and Technol-
Int. J. Data Warehousing Mining, vol. 3, no. 3, pp. 1–13, 2007. ogy, Daejeon, South Korea, in 2019, where he is
[163] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label currently pursuing the Ph.D. degree under the super-
scene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771, vision of Prof. Jae-Gil Lee with the Graduate School
Sep. 2004. of Knowledge Service Engineering.
[164] I. Krasin et al., “OpenImages: A public dataset for large-scale multi- His current research interests include out-of-
label and multi-class image classification,” Dataset, vol. 2, no. 3, p. 18, distribution learning and label efficient learning in
2017. time series.
[165] W. Zhao and C. Gomes, “Evaluating multi-label classifiers with noisy
labels,” 2021, arXiv:2102.08427.
[166] J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with
class imbalance,” J. Big Data, vol. 6, no. 1, pp. 1–54, Dec. 2019.
Jae-Gil Lee (Member, IEEE) received the Ph.D.
[167] M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in super- degree in computer science from the Korea
vised learning,” in Proc. NeurIPS, 2016, pp. 1–9. Advanced Institute of Science and Technology,
[168] H. Jiang and O. Nachum, “Identifying and correcting label bias in Daejeon, South Korea, in 2005.
machine learning,” in Proc. AISTATS, 2020, pp. 702–712. He was a Post-Doctoral Researcher with the
[169] S. Wang, W. Guo, H. Narasimhan, A. Cotter, M. Gupta, and IBM Research—Almaden Laboratory, San Jose, CA,
M. I. Jordan, “Robust optimization for fairness with noisy protected USA, and a Post-Doctoral Research Associate with
groups,” in Proc. NeurIPS, 2020, pp. 5190–5203. the University of Illinois at Urbana–Champaign,
[170] J. Wang, Y. Liu, and C. Levy, “Fair classification with group-dependent Champaign, IL, USA. He is currently an Associate
label noise,” in Proc. FAccT, 2021, pp. 526–536. Professor with KAIST, where he is also the Leader
[171] J. Uesato, J.-B. Alayrac, P.-S. Huang, R. Stanforth, A. Fawzi, and of the Data Mining Laboratory. His research interests
P. Kohli, “Are labels required for improving adversarial robustness?” include mobility and stream data mining, deep learning-based big data
in Proc. NeurIPS, 2019, pp. 12192–12202. analysis, and distributed deep learning.
Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.