0% found this document useful (0 votes)

51 views19 pages

Learning From Noisy Labels With Deep Neural Networks Survey

This survey discusses the challenges and methods for training deep neural networks (DNNs) with noisy labels, which can significantly impair generalization performance. It reviews 62 state-of-the-art robust training methods categorized into five groups, along with a systematic comparison of their properties. The document also highlights the importance of noise rate estimation and outlines future research directions in the field of robust deep learning.

Uploaded by

wangyunbo09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views19 pages

Learning From Noisy Labels With Deep Neural Networks Survey

Uploaded by

wangyunbo09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

11, NOVEMBER 2023 8135

Learning From Noisy Labels With Deep

Neural Networks: A Survey
Hwanjun Song , Minseok Kim, Student Member, IEEE, Dongmin Park ,
Yooju Shin, and Jae-Gil Lee , Member, IEEE

Abstract— Deep learning has achieved remarkable success in

numerous domains with help from large amounts of big data.
However, the quality of data labels is a concern because of
the lack of high-quality labels in many real-world scenarios.
As noisy labels severely degrade the generalization performance
of deep neural networks, learning from noisy labels (robust
training) is becoming an important task in modern deep learning
applications. In this survey, we first describe the problem of
learning with label noise from a supervised learning perspective.
Next, we provide a comprehensive review of 62 state-of-the-art
robust training methods, all of which are categorized into five Fig. 1. Convergence curves of training and test accuracy when training
groups according to their methodological difference, followed by WideResNet-16-8 using a standard training method on the CIFAR-100 dataset
a systematic comparison of six properties used to evaluate their with the symmetric noise of 40%: “Noisy w/o. Reg.” and “Noisy w. Reg.” are
superiority. Subsequently, we perform an in-depth analysis of the models trained on noisy data without and with regularization, respectively,
noise rate estimation and summarize the typically used evaluation and “Clean w. Reg.” is the model trained on clean data with regularization.
methodology, including public noisy datasets and evaluation
metrics. Finally, we present several promising research directions
that can serve as a guideline for future studies. by a label-flipping attack [15]. Such unreliable labels are called
noisy labels because they may be corrupted from ground-truth
Index Terms— Classification, deep learning, label noise, noisy
label, robust deep learning, robust optimization, survey. labels. The ratio of corrupted labels in real-world datasets is
reported to range from 8.0% to 38.5% [16]–[19].
I. I NTRODUCTION In the presence of noisy labels, training DNNs is known
to be susceptible to noisy labels because of the significant
W ITH the recent emergence of large-scale datasets, deep
neural networks (DNNs) have exhibited impressive
performance in numerous machine learning tasks, such as
number of model parameters that render DNNs overfit to
even corrupted labels with the capability of learning any
computer vision [1], [2], information retrieval [3]–[5], and complex function [20], [21]. Zhang et al. [22] demonstrated
language processing [6]–[8]. Their success is dependent on that DNNs can easily fit an entire training dataset with
the availability of massive but carefully labeled data, which any ratio of corrupted labels, which eventually resulted in
are expensive and time-consuming to obtain. Some nonexpert poor generalizability on a test dataset. Unfortunately, popular
sources, such as Amazon’s Mechanical Turk and the surround- regularization techniques, such as data augmentation [23],
ing text of collected data, have been widely used to mitigate weight decay [24], dropout [25], and batch normalization [26]
the high labeling cost; however, the use of these sources have been applied extensively, but they do not completely
often results in unreliable labels [9]–[12]. In addition, data overcome the overfitting issue by themselves. As shown in
labels can be extremely complex even for experienced domain Fig. 1, the gap in test accuracy between models trained on
experts [13], [14]; they can also be adversarially manipulated clean and noisy data remains significant even though all of
the aforementioned regularization techniques are activated.
Manuscript received 15 July 2020; revised 6 April 2021 and 10 November In addition, the accuracy drop with label noise is considered
2021; accepted 15 February 2022. Date of publication 7 March 2022; date to be more harmful than with other noises, such as input
of current version 30 October 2023. This work was supported by Institute
of Information & Communications Technology Planning & Evaluation (IITP) noise [27]. Hence, achieving a good generalization capability
grant funded by the Korea government (MSIT) (No. 2020-0-00862, DB4DL: in the presence of noisy labels is a key challenge.
High-Usability and Performance In-Memory Distributed DBMS for Deep Several studies have been conducted to investigate super-
Learning). (Corresponding author: Jae-Gil Lee.)
Hwanjun Song is with the NAVER AI Laboratory, Seongnam 13561, South vised learning under noisy labels. Beyond conventional
Korea (e-mail: [email protected]). machine learning techniques [13], [28], deep learning tech-
Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee are with niques have recently gained significant attention in the
the Graduate School of Knowledge Service Engineering, Korea Advanced
Institute of Science and Technology, Daejeon 34141, South Korea (e-mail: machine learning community. In this survey, we present the
[email protected]; [email protected]; [email protected]; advances in recent deep learning techniques for overcoming
[email protected]). noisy labels. We surveyed 174 recent studies by recursively
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2022.3152527. tracking relevant bibliographies in papers published at premier
Digital Object Identifier 10.1109/TNNLS.2022.3152527 research conferences, such as CVPR, ICCV, NeurIPS, ICML,
2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8136 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023

Fig. 2. Categorization of recent deep learning methods for overcoming noisy labels.

and ICLR. Although we attempted to comprehensively include TABLE I

all recent studies at the time of submission, some of them S UMMARY OF THE N OTATION
may not be included because of the quadratic increase in deep
learning papers. The studies included were grouped into five
categories, as shown in Fig. 2 (see Section III for details).

A. Related Surveys
Frénay and Verleysen [13] discussed the potential negative
consequence of learning from noisy labels and provided a
comprehensive survey on noise-robust classification meth-
ods, focusing on conventional supervised approaches, such
as naïve Bayes and support vector machines. Furthermore,
their survey included the definitions and sources of label
noise and the taxonomy of label noise. Zhang et al. [28]
discussed another aspect of label noise in crowdsourced data
annotated by nonexperts and provided a thorough review
of expectation–maximization (EM) algorithms that were pro-
posed to improve the quality of crowdsourced labels. Mean-
while, Nigam et al. [29] provided a brief introduction to deep
learning algorithms that were proposed to manage noisy labels; II. P RELIMINARIES
however, the scope of these algorithms was limited to only In this section, the problem statement for supervised learn-
two categories, i.e., the loss function and sample selection ing with noisy labels is provided along with the taxonomy
in Fig. 2. Recently, Han et al. [30] summarized the essential of label noise. Managing noisy labels is a long-standing
components of robust learning with noisy labels, but their issue; therefore, we review the basic conventional approaches
categorization is totally different from ours in philosophy; and theoretical foundations underlying robust deep learning.
we mainly focus on systematic methodological difference, Table I summarizes the notation frequently used in this study.
whereas they rather focused on more general views, such
as input data, objective functions, and optimization policies.
A. Supervised Learning With Noisy Labels
Furthermore, this survey is the first to present a comprehen-
sive methodological comparison of existing robust training Classification is a representative supervised learning task for
approaches. learning a function that maps an input feature to a label [38].
In this article, we consider a c-class classification problem
B. Survey Scope using a DNN with a softmax output layer. Let X ⊂ Rd be the
Robust training with DNNs becomes critical to guarantee feature space and Y = {0, 1}c be the ground-truth label space
the reliability of machine learning algorithms. In addition in a one-hot manner. In a typical classification problem, we are
to label noise, two types of flawed training data have been provided with a training dataset D = {(x i , yi )}i=1 N
obtained
actively studied by different communities [31], [32]. Adver- from an unknown joint distribution PD over X ×Y, where each
sarial learning is designed for small, worst case perturba- (x i , yi ) is independent and identically distributed. The goal of
tions of the inputs, so-called adversarial examples, which are the task is to learn the mapping function f (·; ) : X → [0, 1]c
maliciously constructed to deceive an already trained model of the DNN parameterized by such that the parameter
into making errors [33]–[36]. Meanwhile, data imputation minimizes the empirical risk RD ( f )
primarily deals with missing inputs in training data, where 1
RD ( f ) = ED f (x; ), y = f (x; ), y
missing values are estimated from the observed ones [32], [37]. |D|
(x,y)∈D
Adversarial learning and data imputation are closely related to
(1)
robust learning, but handling feature noise is beyond the scope
of this survey—i.e., learning from noisy labels. where is a certain loss function.

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
SONG et al.: LEARNING FROM NOISY LABELS WITH DNNs: SURVEY 8137

As data labels are corrupted in various real-world scenarios, C. Nondeep Learning Approaches
we aim to train the DNN from noisy labels. Specifically,
we are provided with a noisy training dataset D̃ = {(x i , ỹi )}i=1
N For decades, numerous methods have been proposed to
obtained from a noisy joint distribution PD̃ over X × Ỹ, manage noisy labels using conventional machine learning
where ỹ is a noisy label which may not be true. Hence, techniques. These methods can be categorized into four
following the standard training procedure, a mini-batch Bt = groups [13], [29], [41], as follows.
{(x i , ỹi )}bi=1 comprising b examples is obtained randomly from 1) Data Cleaning: Training data are cleaned by exclud-
the noisy training dataset D̃ at time t. Subsequently, the DNN ing examples whose labels are likely to be corrupted.
parameter t at time t is updated along the descent direction Bagging and boosting are used to filter out false-labeled
of the empirical risk on mini-batch Bt examples to remove examples with higher weights
⎛ ⎞ because false-labeled examples tend to exhibit much
1 higher weights than true-labeled examples [42], [43].
t+1 = t − η∇ ⎝ f (x; t ), ỹ ⎠ (2) In addition, various methods, such as k-nearest neighbor,
|Bt | (x, ỹ)∈B
t outlier detection, and anomaly detection, have been
where η is a learning rate specified. widely exploited to exclude false-labeled examples from
Here, the risk minimization process is no longer noise- noisy training data [44]–[46]. Nevertheless, this family
tolerant because of the loss computed by the noisy labels. of methods suffers from an over-cleaning issue that
DNNs can easily memorize corrupted labels and correspond- overly removes even the true-labeled examples.
ingly degenerate their generalizations on unseen data [13], 2) Surrogate Loss: Motivated by the noise-tolerance of the
[28], [29]. Hence, mitigating the adverse effects of noisy labels 0–1 loss function [39], many researchers have attempted
is essential to enable noise-tolerant training for deep learning. to resolve its inherent limitations, such as computational
hardness and nonconvexity that render gradient methods
unusable. Hence, several convex surrogate loss func-
B. Taxonomy of Label Noise tions, which approximate the 0–1 loss function, have
This section presents the types of label noise that have been been proposed to train a specified classifier under the
adopted to design robust training algorithms. Even if data binary classification setting [47]–[51]. However, these
labels are corrupted from ground-truth labels without any prior loss functions cannot support the multiclass classifica-
assumption, in essence, the corruption probability is affected tion task.
by the dependency between data features or class labels. 3) Probabilistic Method: Under the assumption that the
A detailed analysis of the taxonomy of label noise was pro- distribution of features is helpful in solving the problem
vided by Frénay and Verleysen [13]. Most existing algorithms of learning from noisy labels [52], the confidence of
dealt with instance-independent noise, but instance-dependent each label is estimated by clustering and then used
noise has not yet been extensively investigated owing to its for a weighted training scheme [53]. This confidence
complex modeling. is also used to convert hard labels into soft labels to
1) Instance-Independent Label Noise: A typical approach reflect the uncertainty of labels [54]. In addition to
for modeling label noise assumes that the corruption process is these clustering approaches, several Bayesian methods
conditionally independent of data features when the true label have been proposed for graphical models such that they
is given [22], [39]. That is, the true label is corrupted by a noise can benefit from using any type of prior information
transition matrix T ∈ [0, 1]c×c , where Ti j := p( ỹ = j |y = i ) in the learning process [55]. However, this family of
is the probability of the true label i being flipped into a methods may exacerbate the overfitting issue owing to
corrupted label j . In this approach, the noise is called a the increased number of model parameters.
symmetric (or uniform) noise with a noise rate τ ∈ [0, 1] if 4) Model-Based Method: As conventional models, such as
∀i= j Ti j = 1 − τ ∧ ∀i= j Ti j = (τ/(c − 1)), where a true label is the SVM and decision tree, are not robust to noisy
flipped into other labels with equal probability. In contrast to labels, significant effort has been expended to improve
symmetric noise, the noise is called an asymmetric (or label- their robustness. To develop a robust SVM model, mis-
dependent) noise if ∀i= j Ti j = 1 − τ ∧ ∃i= j,i=k, j =k Ti j > Tik , classified examples during learning are penalized in the
where a true label is more likely to be mislabeled into a objective [56], [57]. In addition, several decision tree
particular label. For example, a “dog” is more likely to be models are extended using new split criteria to solve
confused with a “cat” than with a “fish.” In a stricter case the overfitting issue when the training data are not fully
when ∀i= j Ti j = 1 − τ ∧ ∃i= j Ti j = τ , the noise is called a pair reliable [58], [59]. However, it is infeasible to apply their
noise, where a true label is flipped into only a certain label. design principles to deep learning.
2) Instance-Dependent Label Noise: For more realistic Meanwhile, deep learning is more susceptible to label noises
noise modeling, the corruption probability is assumed to be than traditional machine learning owing to its high expressive
dependent on both the data features and class labels [16], [40]. power, as proven by many researchers [21], [60], [61]. There
Accordingly, the corruption probability is defined as ρi j (x) = has been significant effort to understand why noisy labels
p( ỹ = j |y = i, x). Unlike the aforementioned noises, the data negatively affect the performance of DNNs [22], [61]–[63].
feature of an example x also affects the chance of x being This theoretical understanding has led to the algorith-
mislabeled. mic design which achieves higher robustness than nondeep

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8138 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023

Fig. 3. High-level research overview of robust deep learning for noisy labels. The research directions that are actively contributed by the machine learning
community are categorized into five groups in blue italic.

learning methods. A detailed analysis of theoretical under- 3) Robust Loss Function (Section III-C): Improving the
standing for robust deep learning was provided by loss function.
Han et al. [30]. 4) Loss Adjustment (Section III-D): Adjusting the loss
value according to the confidence of a given loss
D. Regression With Noisy Labels (or label) by loss correction, loss reweighting, or label
In addition to classification, regression is another main topic refurbishment.
of supervised machine learning, which aims to model the rela- 5) Sample Selection (Section III-E): Identifying true-
tionship between a number of features and a continuous target labeled examples from noisy training data via multinet-
variable. Unlike the classification task with a discrete label work or multiround learning.
space, the regression task considers the continuous variable as Overall, we categorize all recent deep learning methods
its target label [64], and thus, it learns the mapping function into five groups corresponding to popular research directions,
f ( · ; ) : X → Y, where Y ∈ R is a continuous label as shown in Fig. 3. In Section III-D, meta-learning is also
space. Given the input feature x and its ground-truth label y, discussed because it finds the optimal hyperparameters for
two types of label noise are considered in the regression task. loss reweighting. In Section III-E, we discuss the recent
An additive noise [65] is formulated by ỹ := y + , where efforts for combining sample selection with other orthogonal
is drawn from a random distribution independent of the directions or semisupervised learning toward the state-of-the-
input feature; an instance-dependent noise [66] is formulated art performance.
by ỹ := ρ(x) where ρ : X → Y is a noise function dependent Fig. 2 shows the categorization of robust training methods
on the input feature. using these five groups.
Although regression predicts continuous values, regression
and classification share the same concept of learning the A. Robust Architecture
mapping function from the input feature x to the output In numerous studies, architectural changes have been made
label y. Thus, many robust approaches for classification are to model the noise transition matrix of a noisy dataset [16],
easily extended to the regression problem with simple modifi- [75]–[82]. These changes include adding a noise adaptation
cation [67]. Thus, in this survey, we focus on the classification layer at the top of the softmax layer and designing a new ded-
setting for which the most robust methods are defined. icated architecture. The resulting architectures yield improved
generalization through the modification of the DNN output
III. D EEP L EARNING A PPROACHES
based on the estimated label transition probability.
According to our comprehensive survey, the robustness of 1) Noise Adaptation Layer: From the view of training data,
deep learning can be enhanced in numerous approaches [16], the noise process is modeled by discovering the underlying
[25], [68]–[74]. Fig. 3 shows an overview of recent research label transition pattern (i.e., the noise transition matrix T).
directions conducted by the machine learning community. Given an example x, the noisy class posterior probability for
All of them (i.e., Sections III-A–III-E) focused on making an example x is expressed by
a supervised learning process more robust to label noise.
1) Robust Architecture (Section III-A): Adding a noise
c
c
p( ỹ = j |x) = p( ỹ = j, y = i |x) = Ti j p(y = i |x)
adaptation layer at the top of an underlying DNN to
i=1 i=1
learn label transition process or developing a dedicated where Ti j = p( ỹ = j |y = i, x). (3)
architecture to reliably support more diverse types of
label noise. In light of this, the noise adaptation layer is intended
2) Robust Regularization (Section III-B): Enforcing a DNN to mimic the label transition behavior in learning a DNN.
to overfitting less to false-labeled examples explicitly or Let p(y|x; ) be the output of the base DNN with a soft-
implicitly. max output layer. Then, following (3), the probability of an

the reliability of estimating the label transition probability to

handle more complex and realistic label noise.
Technical Detail: Probabilistic noise modeling [16] man-
ages two independent networks, each of which is specialized to
predict the noise type and label transition probability. Because
an EM-based approach with random initialization is impracti-
cal for training the entire network, both networks are trained
with massive noisy labeled data after the pretraining step with
a small amount of clean data. Meanwhile, masking [80] is
Fig. 4. Noise modeling process using the noise adaptation layer. a human-assisted approach to convey the human cognition
of invalid label transitions. Considering that noisy labels are
mainly from the interaction between humans and tasks, the
example x being predicted as its noisy label ỹ is parameterized invalid transition investigated by humans was leveraged to
by constrain the noise modeling process. Owing to the difficulty
in specifying the explicit constraint, a variant of generative

c adversarial networks (GANs) [84] was employed in this study.
p( ỹ = j |x; , W) = p( ỹ = j, y = i |x; , W) Recently, the contrastive-additive noise network [81] was
i=1 proposed to adjust incorrectly estimated label transition prob-
c
abilities by introducing a new concept of quality embedding,
= p( ỹ = j |y = i ; W) p(y = i |x; ) .
which models the trustworthiness of noisy labels. RoG [85]
i=1 Noise Adaptation Layer Base Model builds a simple yet robust generative classifier on top of any
(4) discriminative DNN pretrained on noisy data.
Remark: Compared with the noise adaptation layer, this
Here, the noisy label ỹ is assumed to be conditionally inde- family of methods significantly improves the robustness to
pendent of the input x in general. Accordingly, as shown in more diverse types of label noise, but it cannot be easily
Fig. 4, the noisy adaptation layer is added at the top of the extended to other architectures in general.
base DNN to model the noise transition matrix parameterized
by W. This layer should be removed when test data are to be B. Robust Regularization
predicted. Regularization methods have been widely studied to
Technical Detail: Webly learning [75] first trains the base improve the generalizability of a learned model in the machine
DNN only for easy examples retrieved by search engines; learning community [23]–[26]. By avoiding overfitting in
subsequently, the confusion matrix for all training examples model training, the robustness to label noise improves with
is used as the initial weight W of the noise adaptation layer. widely-used regularization techniques, such as data augmenta-
It fine-tunes the entire model in an end-to-end manner for tion [23], weight decay [24], dropout [25], and batch normal-
hard training examples. In contrast, the noise model [77] ization [26]. These canonical regularization methods operate
initializes W to an identity matrix and adds a regularizer to well on moderately noisy data, but they alone do not suffi-
force W to diffuse during DNN training. The dropout noise ciently improve the test accuracy; poor generalization could be
model [25] applies dropout regularization to the adaptation obtained when the noise is heavy [86]. Thus, more advanced
layer, whose output is normalized by the softmax function regularization techniques have been recently proposed, which
to implicitly diffuse W. The s-model [79] is similar to further improved robustness to label noise when used along
the dropout noise model but dropout is not applied. The with the canonical methods. The main advantage of this family
c-model [79] is an extension of the s-model that models is its flexibility in collaborating with other directions because
the instance-dependent noise, which is more realistic than the it only requires simple modifications.
symmetric and asymmetric noises. Meanwhile, NLNN [76] 1) Explicit Regularization: The regularization can be an
adopts the EM algorithm to iterate the E-step to estimate explicit form that modifies the expected training loss, e.g.,
the noise transition matrix and the M-step to backpropagate weight decay and dropout.
the DNN. Technical Detail: Bilevel learning [87] uses a clean val-
Remark: A common drawback of this family is their idation dataset to regularize the overfitting of a model by
inability to identify false-labeled examples, treating all the introducing a bilevel optimization approach, which differs
examples equally. Thus, the estimation error for the transition from the conventional one in that its regularization constraint
matrix is generally large when only noisy training data is is also an optimization problem. Overfitting is controlled by
used or when the noise rate is high [83]. Meanwhile, for the adjusting the weights on each mini-batch and selecting their
EM-based method, becoming stuck in local optima is values such that they minimize the error on the validation
inevitable, and high computational costs are incurred [79]. dataset. Meanwhile, annotator confusion [86] assumes the
2) Dedicated Architecture: Beyond the label-dependent existence of multiple annotators and introduces a regularized
label noise, several studies have been conducted to support EM-based approach to model the label transition probability;
more complex noise, leading to the design of dedicated archi- its regularizer enables the estimated transition probability
tectures [16], [80], [81]. They typically aimed at increasing to converge to the true confusion matrix of the annotators.

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8140 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023

In contrast, pretraining [88] empirically proves that fine-tuning be the learned classifier with the modified loss for the
on a pretrained model provides a significant improvement noisy data, where R̂ ,D̃ ( f ) = ED̃ [( f (x; ), ỹ)]. If is
in robustness compared with models trained from scratch; L-Lipschitz and classification calibrated [50], with probability
the universal representations of pretraining prevent the model at least 1 − δ, there exists a nondecreasing function ζ with
parameters from being updated in the wrong direction by noisy ζ (0) = 0 [39] such that
labels. PHuber [89] proposes a composite loss-based gradient Approximation and Estimation Errors
clipping, which is a variation of standard gradient clipping for
∗
label noise robustness. Robust early learning [90] classifies RD ( f )−R ≤ ζ min f ∈F R,D ( f ) − min f R,D ( f )
ˆ
critical parameters and noncritical parameters for fitting clean
and noise labels, respectively. Then, it penalizes only the + 4L p RC(F ) + 2 log(1/δ) 2|D| . (7)
noncritical ones with a different update rule. ODLN [91]
leverages open-set auxiliary data and prevents the overfitting L p is the Lipschitz constant of and RC is the Rademacher
of noisy labels by assigning random labels to the open-set complexity of the hypothesis class F . Then, by the universal
examples, which are uniformly sampled from the label set. approximation theorem [96], the Bayes optimal classifier f ∗
Remark: The explicit regularization often introduces sen- is guaranteed to be in the hypothesis class F with DNNs.
sitive model-dependent hyperparameters or requires deeper Based on this theoretical foundation, researchers have
architectures to compensate for the reduced capacity, yet it attempted to design robust loss functions such that they
can lead to significant performance gain if they are optimally achieve a small risk for unseen clean data even when noisy
tuned. labels exist in the training data [68], [97]–[101].
2) Implicit Regularization: The regularization can also be Technical Detail: Initially, Manwani and Sastry [48] the-
an implicit form that gives the effect of stochasticity, e.g., data oretically proved a sufficient condition for the loss func-
augmentation and mini-batch stochastic gradient descent. tion such that risk minimization with that function becomes
Technical Detail: Adversarial training [92] enhances the noise-tolerant for binary classification. Subsequently, the suffi-
noise tolerance by encouraging the DNN to correctly clas- cient condition was extended for multiclass classification using
sify both original inputs and hostilely perturbed ones. Label deep learning [68]. Specifically, a loss function is defined to
smoothing [93], [94] estimates the marginalized effect of be noise-tolerant for a c-class classification under symmetric
label noise during training, thereby reducing overfitting by noise if the function satisfies the noise rate τ < ((c − 1)/c)
preventing the DNN from assigning a full probability to noisy and
training examples. Instead of the one-hot label, the noisy label c

is mixed with a uniform mixture over all possible labels f (x; ), y = j = C ∀x ∈ X ∀ f (8)
j =1
ȳ = ȳ(1), ȳ(2), . . . , ȳ(c)
where C is a constant. This condition guarantees that the
where ȳ(i ) = (1 − α) · [ ỹ = i ] + α/c and α ∈ [0, 1]. (5) classifier trained on noisy data has the same misclassification
Here, [·] is the Iverson bracket and α is the smoothing probability as that trained on noise-free data under the speci-
degree. In contrast, mixup [95] regularizes the DNN to favor fied assumption. An extension for multilabel classification was
simple linear behaviors in between training examples. First, the provided by Kumar et al. [102]. Moreover, if RD ( f ∗ ) = 0,
mini-batch is constructed using virtual training examples, each then the function is also noise-tolerant under an asymmetric
of which is formed by the linear interpolation of two noisy noise, where f ∗ is a global risk minimizer of RD .
training examples (x i , ỹi ) and (x j , ỹ j ) obtained at random For the classification task, the categorical cross entropy
from noisy training data D̃ (CCE) loss is the most widely used loss function owing
to its fast convergence and high generalization capability.
x mix = λx i + (1 − λ)x j and ymix = λ ỹi + (1 − λ) ỹ j (6) However, in the presence of noisy labels, the robust MAE [68]
where λ ∈ [0, 1] is the balance parameter between two showed that the mean absolute error (MAE) loss achieves
examples. Thus, mixup extends the training distribution by better generalization than the CCE loss because only the
updating the DNN for the constructed mini-batch. MAE loss satisfies the aforementioned condition. A limita-
Remark: The implicit regularization improves the gener- tion of the MAE loss is that its generalization performance
alization capability of the DNN without reducing the rep- degrades significantly when complicated data are involved.
resentational capacity. It also does not introduce sensitive Hence, the generalized cross entropy (GCE) [97] was proposed
model-dependent hyperparameters because it is applied to the to achieve the advantages of both MAE and CCE losses;
training data. However, the extended feature or label space the GCE loss is a more general class of noise-robust loss
slows down the convergence of training. that encompasses both of them. Amid et al. [103] extended
the GCE loss by introducing two temperatures based on the
Tsallis divergence. Bitempered loss [104] introduces a proper
C. Robust Loss Function unbiased generalization of the CE loss based on the Bregman
It was proven that a learned DNN with a suitably modified divergence. In addition, inspired by the symmetricity of the
loss function for noisy data D̃ can approach the Bayes Kullback–Leibler divergence, the symmetric cross entropy
optimal classifier f ∗ , which achieves the optimal Bayes risk (SCE) [98] was proposed by combining a noise tolerance term,
R∗ = RD ( f ∗ ) for clean data D. Let fˆ = argmin f ∈F R̂ ,D̃ ( f ) namely reverse cross entropy loss, with the standard CCE loss.

Meanwhile, the curriculum loss (CL) [99] is a surrogate Conversely, forward correction [62] uses a linear com-
loss of the 0–1 loss function; it provides a tight upper bound bination of a DNN’s softmax outputs before applying the
and can easily be extended to multiclass classification. The loss function. Hence, the forward correction is performed
active passive loss (APL) [105] is a combination of two types by multiplying the estimated transition probability with the
of robust loss functions, an active loss that maximizes the softmax outputs during the forward propagation step
probability of belonging to the given class and a passive loss
that minimizes the probability of belonging to other classes. f (x; ), ỹ = p̂( ỹ|1), . . . , p̂( ỹ|c) f (x; ) , ỹ
Remark: The robustness of these methods is theoretically
= T̂ f (x; ) , ỹ . (10)
supported well. However, they perform well only in simple
cases, when learning is easy or the number of classes is
small [106]. Moreover, the modification of the loss function Furthermore, gold loss correction [107] assumes the avail-
increases the training time for convergence [97]. ability of clean validation data or anchor points for loss
correction. Thus, a more accurate transition matrix is obtained
D. Loss Adjustment by using them as additional information, which further
Loss adjustment is effective for reducing the negative impact improves the robustness of the loss correction. Recently,
of noisy labels by adjusting the loss of all training examples T-revision [113] provides a solution that can infer the tran-
before updating the DNN [19], [62], [69], [107]–[111]. The sition matrix without anchor points, and dual T [114] fac-
methods associated with it can be categorized into three torizes the matrix into the product of two easy-to-estimate
groups depending on their adjustment philosophy: 1) loss matrices to avoid directly estimating the noisy class pos-
correction that estimates the noise transition matrix to correct terior. Beyond the instance-independent noise assumption,
the forward or backward loss; 2) loss reweighting that imposes Zhang and Sugiyama [115] introduced the instance-confidence
different importance to each example for a weighted training embedding to model instance-dependent noise in estimating
scheme; 3) label refurbishment that adjusts the loss using the the transition matrix. On the other hand, Yang et al. [116]
refurbished label obtained from a convex combination of noisy proposed to use the Bayes optimal transition matrix estimated
and predicted labels; and 4) meta learning that automatically from the distilled examples for the instance-dependent noise
infers the optimal rule for loss adjustment. Unlike the robust transition matrix.
loss function newly designed for robustness, this family of Remark: The robustness of these approaches is highly
methods aims to make the traditional optimization process dependent on how precisely the transition matrix is estimated.
robust to label noise. Hence, in the middle of training, the To acquire such a transition matrix, they require prior knowl-
update rule is adjusted such that the negative impact of label edge in general, such as anchor points or clean validation data.
noise is minimized. 2) Loss Reweighting: Inspired by the concept of importance
In general, loss adjustment allows for a full exploration reweighting [117], loss reweighting aims to assign smaller
of the training data by adjusting the loss of every example. weights to the examples with false labels and greater weights
However, the error incurred by false correction is accumulated, to those with true labels. Accordingly, the reweighted loss on
especially when the number of classes or the number of the mini-batch Bt is used to update the DNN
mislabeled examples is large [112]. ⎛ Reweighted Loss
⎞
1) Loss Correction: Similar to the noise adaptation layer ⎟
⎜ 1
presented in Section III-A, this approach modifies the loss t+1 = t − η∇ ⎝ w(x, ỹ) f (x; t ), ỹ ⎠
|Bt | (x, ỹ)∈B
of each example by multiplying the estimated label transition t

probability by the output of a specified DNN. The main (11)

difference is that the learning of the transition probability is
decoupled from that of the model. where w(x, ỹ) is the weight of an example x with its noisy
Technical Detail: Backward correction [62] initially approx- label ỹ. Hence, the examples with smaller weights do not
imates the noise transition matrix using the softmax output significantly affect the DNN learning.
of the DNN trained without loss correction. Subsequently, Technical Detail: In importance reweighting [108], the ratio
it retrains the DNN while correcting the original loss based on of two joint data distributions w(x, ỹ) = PD (x, ỹ)/PD̃ (x, ỹ)
the estimated matrix. The corrected loss of an example (x, ỹ) determines the contribution of the loss of each noisy example.
is computed by a linear combination of its loss values for An approximate solution to estimate the ratio was developed
observable labels, whose coefficient is the inverse transition because the two distributions are difficult to determine from
matrix T−1 to the observable label y ∈ {1, . . . , c}, given noisy data. Meanwhile, active bias [109] emphasizes uncertain
its target label ỹ. Therefore, the backward correction is examples with inconsistent label predictions by assigning
performed by multiplying the inverse transition matrix to the their prediction variances as the weights for training. Dual-
prediction for all the observable labels Graph [118] employs graph neural networks and reweights the
examples according to the structural relations among labels,
−1
f (x; ), ỹ = T̂ f (x; ), 1 , . . . , f (x; ), c eliminating the abnormal noise examples.
Remark: These approaches need to manually prespecify the
(9)
weighting function and additional hyperparameters, which is
where T̂ is the estimated noise transition matrix. fairly hard to be applied in practice due to the significant

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8142 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023

variation of appropriate weighting schemes that rely on the and noise type-agnostic rules for better practical use. It is
noise type and training data. similar to loss reweighting and label refurbishment, but the
3) Label Refurbishment: Refurbishing a noisy label ỹ effec- adjustment is automated in a meta-learning manner.
tively prevents overfitting to false labels. Let ŷ be the current Technical Detail: For the loss reweighting in (11), the
prediction of the DNN f (x; ). Therefore, the refurbished goal is to learn the weight function w(x, ỹ). Specifically,
label y refurb can be obtained by a convex combination of the L2LWS [126] and CWS [127] are unified neural architectures
noisy label ỹ and the DNN prediction ŷ composed of a target DNN and a meta-DNN. The meta-DNN
is trained on a small clean validation dataset; it then provides
y refurb = α ỹ + (1 − α) ŷ (12)
guidance to evaluate the weight score for the target DNN.
where α ∈ [0, 1] is the label confidence of ỹ. To mitigate the Here, part of the two DNNs is shared and jointly trained
damage of incorrect labeling, this approach backpropagates the to benefit from each other. Automatic reweighting [106] is
loss for the refurbished label instead of the noisy one, thereby a meta-learning algorithm that learns the weights of training
yielding substantial robustness to noisy labels. examples based on their gradient directions. It includes a small
Technical Detail: Bootstrapping [69] is the first method clean validation dataset into the training dataset and reweights
that proposes the concept of label refurbishment to update the the backward loss of the mini-batch examples such that the
target label of training examples. It develops a more coherent updated gradient minimizes the loss of this validation dataset.
network that improves its ability to evaluate the consistency Meta-weight-net [124] parameterizes the weighting function as
of noisy labels, with the label confidence α obtained via a multilayer perceptron network with only one hidden layer.
cross validation. Dynamic bootstrapping [110] dynamically A meta-objective is defined to update its parameters such that
adjusts the confidence α of individual training examples. The they minimize the empirical risk of a small clean dataset.
confidence α is obtained by fitting a two-component and At each iteration, the parameter of the target network is
1-D beta mixture model to the loss distribution of all training guided by the weight function updated via the meta-objective.
examples. Self-adaptive training [119] applies the exponential Likewise, data coefficients (i.e., exemplar weights and true
moving average to alleviate the instability issue of using labels) [128] are estimated by meta-optimization with a small
instantaneous prediction of the current DNN clean set, which is only 0.2% of the entire training set, while
refurbishing the examples probably mislabeled.
refurb
yt+1 = αytrefurb + (1 − α) ŷ, where y0refurb = ỹ. (13)
For the label refurbishment in (12), knowledge distilla-
D2L [111] trains a DNN using a dimensionality-driven learn- tion [129] adopts the technique of transferring knowledge from
ing strategy to avoid overfitting to false labels. A simple one expert model to a target model. The prediction from the
measure called local intrinsic dimensionality [120] is adopted expert DNN trained on small clean validation data is used
to evaluate the confidence α in considering that the overfitting instead of the prediction ŷ from the target DNN. MLC [130]
is exacerbated by dimensional expansion. Hence, refurbished updates the target model with corrected labels provided by a
labels are generated to prevent the dimensionality of the meta-model trained on clean validation data. The two models
representation subspace from expanding at a later stage of are trained concurrently via a bilevel optimization.
training. Recently, SELFIE [19] introduces a novel concept Remark: By learning the update rule via meta-learning, the
of refurbishable examples that can be corrected with high trained network easily adapts to various types of data and label
precision. The key idea is to consider the example with noise. Nevertheless, unbiased clean validation data is essential
consistent label predictions as refurbishable because such to minimize the auxiliary objective, although it may not be
consistent predictions correspond to its true label with a available in real-world data.
high probability owing to the learner’s perceptual consistency.
Accordingly, the labels of only refurbishable examples are
E. Sample Selection
corrected to minimize the number of falsely corrected cases.
Similarly, AdaCorr [121] selectively refurbishes the label of To avoid any false corrections, many recent studies [19],
noisy examples, but a theoretical error bound is provided. [70], [99], [112], [131]–[137] have adopted sample selection
Alternatively, SEAL [122] averages the softmax output of a that involves selecting true-labeled examples from a noisy
DNN on each example over the whole training process, then training dataset. In this case, the update equation in (2) is
retrains the DNN using the averaged soft labels. modified to render a DNN more robust for noisy labels. Let
Remark: Differently from loss correction and reweighting, Ct ⊆ Bt be the identified clean examples at time t. Then, the
all the noisy labels are explicitly replaced with other expected DNN is updated only for the selected clean examples Ct
clean labels (or their combination). If there are not many con- ⎛ ⎞
fusing classes in data, these methods work well by refurbishing 1
t+1 = t − η∇ ⎝ f (x; t ), ỹ ⎠ (14)
the noisy labels with high precision. In the opposite case, the |Ct |
(x, ỹ)∈Ct
DNN could overfit wrongly refurbished labels.
4) Meta Learning: In recent years, meta-learning becomes where the rest mini-batch examples, which are likely to be
an important topic in the machine learning community and false-labeled, are excluded to pursue robust learning.
is applied to improve noise robustness [123]–[125]. The key The memorization nature of DNNs has been explored theo-
concept is learning to learn that performs learning at a level retically and empirically to identify clean examples from noisy
higher than conventional learning, thus achieving data-agnostic training data [138]–[140]. Specifically, assuming clusterable

data where the clusters are located on the unit Euclidean ball,
Li et al. [61] proved the distance from the initial weight W0
to the weight Wt after t iterations
√
Wt − W0 F K + (K 2 0 /C2 )t (15)
where · F is the Frobenius norm, K is the number of clusters,
and C is the set of cluster centers reaching all input examples
within their 0 neighborhood. Equation (15) demonstrates that Fig. 5. Loss distribution of training examples at the training accuracy of
the weights of DNNs start to stray far from the initial weights 50% on noisy CIFAR-100. (This figure is adapted from Song et al. [141].)
(a) Symmetric noise 40%. (b) Asymmetric noise 40%.
when overfitting to corrupted labels, while they are still in the
vicinity of the initial weights at an early stage of training [30],
[61]. In the empirical studies [21], [141], the memorization examples largely overlap, as in the asymmetric noise in
effect is also observed since DNNs tend to first learn simple Fig. 5(b).
and generalized patterns and then gradually overfit to all noisy 2) Multiround Learning: Without maintaining additional
patterns. As such, favoring small-loss training examples as the DNNs, multiround learning iteratively refines the selected set
clean ones are commonly employed to design robust training of clean examples by repeating the training round. Thus,
methods [112], [131], [134], [135], [142]. the selected set keeps improving as the number of rounds
Learning with sample selection is well motivated and works increases.
well in general, but this approach suffers from accumulated Technical Detail: ITLM [134] iteratively minimizes the
error caused by incorrect selection, especially when there trimmed loss by alternating between selecting true-labeled
are many ambiguous classes in training data. Hence, recent examples at the current moment and retraining the DNN using
approaches often leverage multiple DNNs to cooperate with them. At each training round, only a fraction of small-loss
one another [112] or run multiple training rounds [133]. examples obtained in the current round is used to retrain the
Moreover, to benefit from even false-labeled examples, loss DNN in the next round. INCV [135] randomly divides noisy
correction or semisupervised learning have been recently com- training data and then employs cross validation to classify
bined with the sample selection strategy [19], [142]. true-labeled examples while removing large-loss examples at
1) Multinetwork Learning: Collaborative learning and each training round. Here, co-teaching is adopted to train the
co-training are widely used for multinetwork training. Conse- DNN on the identified examples in the final round of training.
quently, the sample selection process is guided by the mentor Similarly, O2U-Net [144] repeats the whole training process
network in the case of collaborative learning or the peer with the cyclical learning rate until enough loss statistics
network in the case of co-training. of every example are gathered. Next, the DNN is retrained
Technical Detail: Initially, decouple [70] proposes the from scratch only for the clean data where false-labeled
decoupling of when to update from how to update. Hence, examples have been detected and removed based on
two DNNs are maintained simultaneously and updated only statistics.
the examples selected based on a disagreement between the A number of variations have been proposed to achieve
two DNNs. Next, due to the memorization effect of DNNs, high performance using iterative refinement only in a single-
many researchers have adopted another selection criterion, training round. Beyond the small-loss trick, iterative detec-
called a small-loss trick, which treats a certain number of tion [133] detects false-labeled examples by employing the
small-loss training examples as true-labeled examples; many local outlier factor algorithm [145]. With a Siamese network,
true-labeled examples tend to exhibit smaller losses than false- it gradually pulls away false-labeled examples from true-
labeled examples, as shown in Fig. 5(a). In MentorNet [131], labeled samples in the deep feature space. MORPH [137]
a pretrained mentor network guides the training of a student introduces the concept of memorized examples which is used
network in a collaborative learning manner. Based on the to iteratively expand an initial safe set into a maximal safe
small-loss trick, the mentor network provides the student set via self-transitional learning. TopoFilter [146] utilizes the
network with examples whose labels are likely to be correct. spatial topological pattern of learned representations to detect
Co-teaching [112] and Co-teaching+ [132] also maintain two true-labeled examples, not relying on the prediction of the
DNNs, but each DNN selects a certain number of small-loss noisy classifier. NGC [147] iteratively constructs the near-
examples and feeds them to its peer DNN for further training. est neighbor graph using latent representations and performs
Co-teaching+ further employs the disagreement strategy of geometry-based sample selection by aggregating information
decouple compared with co-teaching. In contrast, JoCoR [143] from neighborhoods. Soft pesudolabels are assigned to the
reduces the diversity of two networks via co-regularization, examples not selected.
making predictions of the two networks closer. Remark: The selected clean set keeps expanded and purified
Remark: The co-training methods help reduce the confirma- with iterative refinement, mainly through multiround learning.
tion bias [112], which is a hazard of favoring the examples As a side effect, the computational cost for training increases
selected at the beginning of training, while the increase in the linearly for the number of training rounds.
number of learnable parameters makes their learning pipeline 3) Hybrid Approach: An inherent limitation of sample
inefficient. In addition, the small-loss trick does not work well selection is to discard all the unselected training exam-
when the loss distribution of true-labeled and false-labeled ples, thus resulting in a partial exploration of training data.

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8144 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023

TABLE II
C OMPARISON OF P ROPOSED ROBUST D EEP L EARNING M ETHODS W ITH R ESPECT TO THE F OLLOWING S IX P ROPERTIES : P1—F LEXIBILITY,
P2—N O P RETRAINING , P3—F ULL E XPLORATION , P4—N O S UPERVISION , P5—H EAVY N OISE , AND P6—C OMPLEX N OISE

Meanwhile, SELFIE [19] is a hybrid approach of sam-

ple selection and loss correction. The loss of refurbishable
examples is corrected (i.e., loss correction) and then used
together with that of small-loss examples (i.e., sample selec-
tion). Consequently, more training examples are considered for
updating the DNN. The CL [99] is combined with the robust
loss function approach and used to extract the true-labeled
examples from noisy data.
Remark: Noise robustness is significantly improved by
Fig. 6. Procedures for semisupervised learning under label noise. (a) Noisy
data. (b) Transformed data. (c) SSL.
combining with other techniques. However, the hyperpara-
meters introduced by these techniques render a DNN more
To exploit all the noisy examples, researchers have attempted susceptible to changes in data and noise types, and an increase
to combine sample selection with other orthogonal ideas. in computational cost is inevitable.
Technical Detail: The most prominent method in this
direction is combining a specific sample selection strategy with IV. M ETHODOLOGICAL C OMPARISON
a specific semisupervised learning model. As shown in Fig. 6, In this section, we compare the 62 deep learning meth-
selected examples are treated as labeled clean data, whereas ods for overcoming noisy labels introduced in Section III
the remaining examples are treated as unlabeled. Subsequently, with respect to the following six properties. When selecting
semisupervised learning is performed using the transformed the properties, we refer to the properties that are typically
data. SELF [136] is combined with a semisupervised learning used to compare the performance of robust deep learn-
approach to progressively filter out false-labeled examples ing methods [19], [112]. To the best of our knowledge,
from noisy data. By maintaining the running average model this survey is the first to provide a systematic comparison
called the mean-teacher [148] as the backbone, it obtains the of robust training methods. This comprehensive comparison
self-ensemble predictions of all training examples and then will provide useful insights that can enlighten new future
progressively removes examples whose ensemble predictions directions.
do not agree with their annotated labels. This method further 1) (P1) Flexibility: With the rapid evolution of deep learn-
leverages unsupervised loss from the examples not included in ing research, a number of new network architectures
the selected clean set. DivideMix [142] uses two-component are constantly emerging and becoming available. Hence,
and 1-D Gaussian mixture models to transform noisy the ability to support any type of architecture is impor-
data into labeled (clean) and unlabeled (noisy) sets. Then, tant. “Flexibility” ensures that the proposed method can
it applies a semisupervised technique MixMatch [149]. quickly adapt to the state-of-the-art architecture.
Recently, RoCL [150] employs two-phase learning strategies: 2) (P2) No Pretraining: A typical approach to improve
supervised training on selected clean examples and then noise robustness is to use a pretrained network; how-
semisupervised learning on relabeled noisy examples with ever, this incurs an additional computational cost to
self-supervision. For selection and relabeling, it computes the the learning process. “No Pretraining” ensures that the
exponential moving average of the loss over training iterations. proposed method can be trained from scratch without
any pretraining.
1 https://github.com/endernewton/webly-supervised
2 https://github.com/delchiaro/training-cnn-noisy-labels-keras 26 https://github.com/songhwanjun/ActiveBias
3 https://github.com/ijindal/Noisy_Dropout_regularization 27 https://github.com/dr-darryl-wright/Noisy-Labels-with-Bootstrapping
4 https://github.com/udibr/noisy_labels 28 https://github.com/PaulAlbert31/LabelNoiseCorrection
5 https://github.com/Ryo-Ito/Noisy-Labels-Neural-Network 29 https://github.com/LayneH/self-adaptive-training
6 https://github.com/Cysu/noisy_label 30 https://github.com/xingjunm/dimensionality-driven-learning
7 https://github.com/bhanML/Masking 31 https://github.com/pingqingsheng/LRT
8 https://github.com/pokaxpoka/RoGNoisyLabel 32 https://github.com/chenpf1025/IDN
9 https://github.com/sjenni/DeepBilevel 33 https://github.com/krayush07/learn-by-weak-supervision
10 https://rt416.github.io/pdf/trace_codes.pdf 34 https://github.com/uber-research/learning-to-reweight-examples
11 github.com/hendrycks/pre-training 35 https://github.com/xjtushujun/meta-weight-net
12 https://github.com/dmizr/phuber 36 https://github.com/google-research/google-research/tree/master/ieg
13 37
https://github.com/xiaoboxia/CDR https://aka.ms/MLC
14 https://github.com/hongxin001/ODNL?ref=pythonrepo.com 38 https://github.com/emalach/UpdateByDisagreement
15 https://https://github.com/sarathknv/adversarial-examples-pytorch 39 https://github.com/google/mentornet
16 https://github.com/CoinCheung/pytorch-loss 40 https://github.com/bhanML/Co-teaching
17 https://github.com/facebookresearch/mixup-cifar10 41 https://github.com/bhanML/coteaching_plus
18 42
https://github.com/AlanChou/Truncated-Loss https://github.com/hongxin001/JoCoR
19 https://github.com/YisenWang/symmetric_cross_entropy 43 https://github.com/yanyao-shen/ITLM-simplecode
20 https://github.com/google/bi-tempered-loss 44 https://github.com/chenpf1025/noisy_label_understanding_utilizing
21 https://github.com/HanxunH/Active-Passive-Losses 45 https://github.com/hjimce/O2U-Net
22 https://github.com/giorgiop/loss-correction 46 https://github.com/YisenWang/Iterative_learning
23 47
https://github.com/mmazeika/glc https://github.com/pxiangwu/TopoFilter
24 https://github.com/xiaoboxia/T-Revision 48 https://github.com/kaist-dmlab/SELFIE
25 https://github.com/xiaoboxia/Classification-with-noisy-labels 49 https://github.com/LiJunnan1992/DivideMix

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8146 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023

TABLE III
C OMPARISON OF ROBUST D EEP L EARNING C ATEGORIES FOR OVERCOMING N OISY L ABELS

3) (P3) Full Exploration: Excluding unreliable examples The remaining properties (i.e., P2–P4) are only assigned “”
from the update is an effective method for robust deep or “✕.” Regarding the implementation, we assign “N/A” if a
learning; however, it eliminates hard but useful training publicly available source code is not available.
examples as well. “Full exploration” ensures that the No existing method supports all the properties. Each method
proposed methods can use all training examples without achieves noise robustness by supporting a different combina-
severe overfitting to false-labeled examples by adjusting tion of the properties. The supported properties are similar
their training losses or applying semisupervised learning. among the methods of the same (sub)category because those
4) (P4) No Supervision: Learning with supervision, such methods share the same methodological philosophy; however,
as a clean validation set or a known noise rate, is often they differ significantly depending on the (sub)category. There-
impractical because they are difficult to obtain. Hence, fore, we investigate the properties generally supported in each
such supervision had better be avoided to increase practi- (sub)category and summarize them in Table III. Here, the
cality in real-world scenarios. “No supervision” ensures property of a (sub)category is marked as the majority of the
that the proposed methods can be trained without any belonging methods. If no clear trend is observed among those
supervision. methods, then the property is marked “.”
5) (P5) Heavy Noise: In real-world noisy data, the noise
rate can vary from light to heavy. Hence, learning V. N OISE R ATE E STIMATION
methods should achieve consistent noise robustness with
The estimation of a noise rate is an imperative part of
respect to the noise rate. “Heavy noise” ensures that the
utilizing robust methods for better practical use, especially
proposed methods can combat even the heavy noise.
with the approaches belonging to the loss adjustment and
6) (P6) Complex Noise: The type of label noise signif-
sample selection. The estimated noise rate is widely used to
icantly affects the performance of a learning method.
reweight examples for a robust classifier [97], [114], [117] or
To manage real-world noisy data, diverse types of label
to determine how many examples should be selected as clean
noise should be considered when designing a robust
ones [19], [112], [135]. However, detailed analysis has yet to
training method. “Complex noise” ensures that the pro-
be performed properly, though many robust approaches highly
posed method can combat even the complex label noise.
rely on the accuracy of noise rate estimation. The noise rate
Table II shows a comparison of all robust deep learning can be estimated by exploiting the inferred noise transition
methods, which are grouped according to the most appropriate matrix [113], [114], [151], the Gaussian mixture model [110],
category. In the first row, the aforementioned six properties [137], [152], or the cross validation [19], [135].
are labeled as P1–P6, and the availability of open-source
implementation is added in the last column. For each property,
we assign “” if it is completely supported, “✕” if it is not A. Noise Transition Matrix
supported, and “” if it is supported but not completely. More The noise transition matrix has been used to build a statisti-
specifically, “” is assigned to P1 if the method can be flexible cally consistent robust classifier because it represents the class
but requires additional effort, to P5 if the method can combat posterior probabilities for noisy and clean data, as in (3). The
only moderate label noise, and to P6 if the method does not first method to estimate the noise rate is exploiting this noise
make a strict assumption about the noise type but without transition matrix, which can be inferred or trained accurately
explicitly modeling instance-dependent noise. Thus, for P6, by using perfectly clean examples, i.e., anchor points [117],
the method marked with “✕” only deals with the instance- [153]; an example x with its label i is defined as an anchor
independent noise, while the method marked with “” point if p(y = i |x) = 1 and p(y = k|x) = 0 for k = i .
deals with both instance-independent and -dependent noises. Thus, let Ai be the set of anchor points with label i , then the

loss becomes less separable by the GMM as the training

progresses, and thus, proposed the area under the loss (AUL)
curve, which is the sum of the example’s training losses
obtained from all previous training epochs. Even after the
loss signal decays in later epochs, the distributions remain
separable. Therefore, the noise rate is finally estimated by

τ̂ = E(x, ỹ)∈D̃ p g|AULt (x, ỹ)

t

where AULt (x, ỹ) = f (x; t ), ỹ . (19)
i=1
Fig. 7. Training loss distributions of true-labeled and false-labeled examples
using the ground-truth label and the GMM on CIFAR-100 data with two
synthetic noises of 40%. (a) Symmetric noise. (b) Asymmetric noise.
C. Cross Validation
The third method is estimating the noise rate by apply-
element of the noise transition matrix Ti j is estimated by ing cross validation, which typically requires clean valida-
1 tion data [19], [112], [132]. However, such clean validation
c
T̂i j = p( ỹ = j |y = k) p(y = k|x) data are hard to acquire in real-world applications. Thus,
|Ai | x∈A k=1
i Chen et al. [135] leveraged two randomly divided noisy train-
1 ing datasets for cross validation. Under the assumption that the
= p( ỹ = j |x; ) (16)
|Ai | two datasets share exactly the same noise transition matrix,
x∈Ai
the noise rate quantifies the test accuracy of DNNs that are,
where p( ỹ = j |x; ) is the noisy class posterior probability respectively, trained and tested on the two divided sets
of the classifier trained on noisy training data for the anchor
point x (see the detailed proof in [107], [113], and [114]). (1 − τ̂ )2 + τ̂ 2 /(c − 1), if symmetric
TestAccuracy =
Next, based on the inferred noise transition matrix, the noise (1 − τ̂ )2 + τ̂ 2 , if asymmetric.
rate of a balanced training dataset is obtained by averaging (20)
the label transition probabilities between classes
Therefore, the noise rate is estimated from the test accuracy
1 1
c c c c
τ̂ = p( ỹ = j |y = i ) = T̂i j . (17) obtained by cross validation.
c i=1 j =i c i=1 j =i
However, since the anchor points are typically unknown in VI. E XPERIMENTAL D ESIGN
real-world data, they are identified from noisy training data by This section describes the typically used experimental
either theoretical derivations [117] or heuristics [62]. In addi- design for comparing robust training methods in the presence
tion, there have been recent efforts to learn the noise transition of label noise. We introduce publicly available image datasets
matrix without anchor points. T-revision [113] initializes a and then describe widely used evaluation metrics.
transition matrix by exploiting the examples with high noisy
class posterior probabilities and then refines the matrix by A. Publicly Available Datasets
adding a slack variable. Dual-T [114] introduces an inter-
To validate the robustness of the proposed algorithms,
mediate class that factorizes the transition matrix into two
easy-to-estimate matrices for better accuracy. VolMinNet [151] an image classification task was widely conducted on numer-
ous image benchmark datasets. Table IV summarizes pop-
realizes an end-to-end framework and relaxes the need for
ularly used public benchmark datasets, which are classified
anchor points under the sufficiently scattered assumption.
into two categories: 1) a “clean dataset” that consists of
mostly true-labeled examples annotated by human experts and
B. Gaussian Mixture Model
2) a “real-world noisy dataset” that comprises real-world noisy
The second method is exploiting a 1-D and two-component examples with varying numbers of false labels.
GMM to model the loss distribution of true-labeled and false- 1) Clean Datasets: According to the literature [19], [133],
labeled examples [110], [152]. As shown in Fig. 7, since the [142], seven clean datasets are widely used: MNIST50 ,
loss distribution tends to be bimodal, the two Gaussian com- classification of handwritten digits [154]; fashion-MNIST51,
ponents are fit to the training loss by using the EM algorithm; classification of various clothing [155]; CIFAR-1052 , and
the probability of an example being a false-labeled one is CIFAR-100,52 classification of a subset of 80 million cate-
obtained through its posterior probability. Hence, the noise rate gorical images [156]; SVHN,53 classification of house num-
is estimated at each epoch t by computing the expectation of bers in Google Street view images [157]; ImageNet54 , and
the posterior probability for all training examples
50 http://yann.lecun.com/exdb/mnist
τ̂ = E(x, ỹ)∈D̃ p g| f (x; t ), ỹ (18) 51 https://github.com/zalandoresearch/fashion-mnist
52
https://www.cs.toronto.edu/~kriz/cifar.html
where g is the Gaussian component with a larger loss. How- 53 http://ufldl.stanford.edu/housenumbers

ever, Pleiss et al. [152] recently pointed out that the training 54 http://www.image-net.org

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8148 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023

TABLE IV
S UMMARY OF P UBLICLY AVAILABLE D ATASETS U SED FOR S TUDYING L ABEL N OISE

tiny-ImageNet,55 image database organized according to the ŷi be the predicted label of the i th example in T . Subsequently,
WordNet hierarchy and its small subset [1], [158]. Because the test accuracy is formalized by
the labels in these datasets are almost all true-labeled, their |{(x i , yi ) ∈ T : ŷi = yi }|
labels in the training data should be artificially corrupted for Test Accuracy = . (21)
|T |
the evaluation of synthetic noises, namely symmetric noise and
asymmetric noise. If the test data are not available, validation accuracy can
be used by replacing T in (21) with validation data V =
2) Real-World Noisy Datasets: Unlike the clean datasets, |V|
real-world noisy datasets inherently contain many mislabeled {(x i , yi )}i=1 as an alternative
examples annotated by nonexperts. According to the litera- |{(x i , yi ) ∈ V : ŷi = yi }|
Validation Accuracy = . (22)
ture [16]–[19], six real-world noisy datasets are widely used: |V|
ANIMAL-10N,56 real-world noisy data of human-labeled Furthermore, if the specified method belongs to the “sample
online images for 10 confusing animals [19]; CIFAR-10N57 selection” category, label precision and label recall [112],
and CIFAR-100N,57 variations of CIFAR-10 and CIFAR-100 [135] can be used as the metrics
with human-annotated real-world noisy labels collected from
|{(x i , ỹi ) ∈ St : ỹi = yi }|
Amazon’s Mechanical Turk [159]. They provide human labels Label Precision =
with different noise rates, as shown in Table IV; Food-101N,58 |St |
real-world noisy data of crawled food images annotated by |{(x i , ỹi ) ∈ St : ỹi = yi }|
Label Recall = (23)
their search keywords in the Food-101 taxonomy [18], [160]; |{(x i , ỹi ) ∈ Bt : ỹi = yi }|
Clothing1M,59 real-world noisy data of large-scale crawled where St is the set of selected clean examples in a mini-
clothing images from several online shopping websites [16]; batch Bt . The two metrics are performance indicators for
WebVision,60 real-world noisy data of large-scale web images the examples selected from the mini-batch as true-labeled
crawled from Flickr and Google Images search [17]. To sup- ones [112].
port sophisticated evaluation, most real-world noisy datasets Meanwhile, if the specified method belongs to the “label
contain their own clean validation set and provide the esti- refurbishment” category, correction error [19] can be used as
mated noise rate of their training set. an indicator of how many examples are incorrectly refurbished

x i ∈ R : argmax y refurb = yi
i
Correction Error =
B. Evaluation Metrics |R|
A typical metric to assess the robustness of a particular (24)
method is the prediction accuracy for unbiased and clean where R is the set of examples whose labels are refurbished
examples that are not used in training. The prediction accuracy by (12) and yirefurb is the refurbished label of the i th examples
degrades significantly if the DNN overfits to false-labeled in R.
examples [22]. Hence, test accuracy has generally been
|T |
adopted for evaluation [13]. For a test set T = {(x i , yi )}i=1 , let VII. F UTURE R ESEARCH D IRECTIONS
With recent efforts in the machine learning community, the
55 https://www.kaggle.com/c/tiny-imagenet
56 https://dm.kaist.ac.kr/datasets/animal-10n
robustness of DNNs becomes evolved in several directions.
57 http://noisylabels.com/ Thus, the existing approaches covered in our survey face a
58
https://kuanghuei.github.io/Food-101N variety of future challenges. This section provides a discus-
59 https://www.floydhub.com/lukasmyth/datasets/clothing1m sion for future research that can facilitate and envision the
60 https://data.vision.ee.ethz.ch/cvl/webvision/download.html
development of deep learning in the label noise area.

A. Instance-Dependent Label Noise recent study [165] that discusses the evaluation of multilabel
Existing theoretical and empirical studies for robust loss classifiers trained with noisy labels.
function and loss correction are largely built upon the C. Class Imbalance Data With Label Noise
instance-independent noise assumption that the label noise
The class imbalance in training data is commonly observed,
is independent of input features [76], [77], [113], [114].
where a few classes account for most of the data. Especially
However, this assumption may not be a good approximation
when working with large data in many real-world applications,
of the real-world label noise. In particular, Chen et al. [122]
this problem becomes more severe and is often associated with
conducted a theoretical hypothesis testing61 using a popular
the problem of noisy labels [166]. Nevertheless, to ease the
real-world dataset, Clothing1M, and proved that its label noise
label noise problem, it is commonly assumed that training
is statistically different from the instance-independent noise.
examples are equally distributed over all class labels in the
This testing confirms that the label noise should depend on
training data. This assumption is quite strong when collecting
the instance.
large-scale data, and thus, we need to consider a more realistic
Conversely, most methods for the other direction (especially,
scenario in which the two problems coexist.
sample selection) work well even under the instance-dependent
Most of the existing robust methods may not work well with
label noise in general since they do not rely on the assumption.
the class imbalance, especially when they rely on the learning
Nevertheless, Song et al. [141] pointed out that their perfor-
dynamics of DNNs, e.g., the small-loss trick or memorization
mance could considerably worsen in the instance-dependent
effect. Under the existence of the class imbalance, the training
(or real-world) noise compared with symmetric noise due to
model converges to major classes faster than minor classes
the confusion between true-labeled and false-labeled exam-
such that most examples in the major class exhibit small
ples. The loss distribution of true-labeled examples heavily
losses (i.e., early memorization). That is, there is a risk of
overlaps that of false-labeled samples in the asymmetric noise,
discarding most examples in the minor class. Furthermore,
which is similar to the real-world noise, in Fig. 5(b). Thus,
in terms of example importance, high-loss examples are com-
identifying clean examples becomes more challenging when
monly favored for the class imbalance problem [124], while
dealing with the instance-dependent label noise.
small-loss examples are favored for the label noise problem.
Beyond the instance-independent label noise, there have
This conceptual contradiction hinders the applicability of the
been a few recent studies for the instance-dependent label
existing methods that neglect the class imbalance. Therefore,
noise. Mostly, they only focus on a binary classification
these two problems should be considered simultaneously to
task [66], [161] or a restricted small-scale machine learning
deal with more general situations.
model, such as logistic regression [63]. Therefore, learning
with the instance-dependent label noise is an important topic D. Robust and Fair Training
that deserves more research attention. Machine learning classifiers can perpetuate and amplify the
existing systemic injustices in society [167]. Hence, fairness is
B. Multilabel Data With Label Noise becoming another important topic. Traditionally, robust train-
ing and fair training have been studied by separate communi-
Most of the existing methods are applicable only for a ties; robust training with noisy labels has mostly focused on
single-label multiclass classification problem, where each data combating label noise without regarding data bias [13], [30],
example is assumed to have only one true label. How- whereas fair training has focused on dealing with data bias,
ever, in the case of multilabel learning, each data example not necessarily noise [167], [168]. However, noisy labels
can be associated with a set of multiple true class labels. and data bias, in fact, coexist in real-world data. Satisfying
In music categorization, each music can belong to multiple both robustness and fairness is more realistic but challenging
categories [162]. In semantic scene classification, each scene because the bias in data is pertinent to label noise.
may belong to multiple scene classes [163]. Thus, contrary to In general, many fairness criteria are group-based, where
the single-label setup, the multilabel classifier aims to predict a a target metric is equalized or enforced over subpopulations
set of target objects simultaneously. In this setup, a multilabel in the data, also known as protected groups, such as race
dataset of millions of examples is reported to contain over or gender [167]. Accordingly, the goal of fair training is
26.6% false-positive labels and a significant number of omitted building a model that satisfies such fairness criteria for the
labels [164]. true protected groups. However, if the noisy protection group
Even worse, the difference in occurrence between classes is involved, such fairness criteria cannot be directly applied.
makes this problem more challenging; some minor class labels Recently, mostly after 2020, a few pioneering studies have
occur less in training data than other major class labels. emerged to consider both robustness and fairness objectives
Considering such aspects that can arise in multilabel classi- at the same time under the binary classification setting [169],
fication, the simple extension of existing methods may not [170]. Therefore, more research attention is needed for the
learn the proper correlations among multiple labels. Therefore, convergence of robust training and fair training.
learning from noisy labels with multilabel data is another
important topic for future research. We refer the readers to a E. Connection With Input Perturbation
61 InClothing1M, the result showed that the instance-independent noise hap- There has been a lot of research on the robustness of
pens with probability lower than 10−21250 , which is statistically impossible. deep learning under input perturbation, mainly in the field

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8150 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023

of adversarial training where the input feature is maliciously supported properties varied depending on the category to
perturbed to distort the output of the DNN [34], [36]. Although which each method belonged. Several experimental guidelines
learning with noisy labels and learning with noisy inputs were also discussed, including noise rate estimation, publicly
have been regarded as separate research fields, their goals available datasets, and evaluation metrics. Finally, we provided
are similar in that they learn noise-robust representations from insights and directions for future research in this domain.
noisy data. Based on this common point of view, a few recent
studies have investigated the interaction of adversarial training
R EFERENCES
with noisy labels [171]–[173].
Interestingly, it was turned out that adversarial training [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
makes DNNs robust to label noise [171]. Based on this finding, with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Process. Syst. (NIPS), 2012, pp. 1097–1105.
Damodaran et al. [172] proposed a new regularization term, [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
called Wasserstein adversarial regularization, to address the look once: Unified, real-time object detection,” in Proc. CVPR, 2016,
problem of learning with noisy labels. Zhu et al. [173] pro- pp. 779–788.
[3] W. Zhang, T. Du, and J. Wang, “Deep learning over multi-field
posed to use the number of projected gradient descent steps as categorical data,” in Proc. ECIR, 2016, pp. 45–57.
a new criterion for sample selection such that clean examples [4] L. Pang, Y. Lan, J. Guo, J. Xu, J. Xu, and X. Cheng, “DeepRank:
are filtered out from noisy data. These approaches are regarded A new deep architecture for relevance ranking in information retrieval,”
in Proc. CIKM, 2017, pp. 257–266.
as a new perspective on label noise compared with traditional [5] K. D. Onal et al., “Neural information retrieval: At the end of the early
work. Therefore, understanding the connection between input years,” Inf. Retr. J., vol. 21, nos. 2–3, pp. 111–182, 2018.
perturbation and label noise could be another future topic for [6] J. Howard and S. Ruder, “Universal language model fine-tuning for
text classification,” in Proc. ACL, 2018, pp. 328–339.
better representation learning toward robustness. [7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of deep bidirectional transformers for language understanding,”
F. Efficient Learning Pipeline in Proc. ACL, 2019, pp. 4171–4186.
[8] A. Severyn and A. Moschitti, “Twitter sentiment analysis with deep
The efficiency of the learning pipeline is another important convolutional neural networks,” in Proc. ACL, 2015, pp. 959–962.
aspect to design deep learning approaches. However, for robust [9] G. Paolacci, J. Chandler, and P. G. Ipeirotis, “Running experiments
on Amazon mechanical Turk,” Judgment Decis. Making, vol. 5, no. 5,
deep learning, most studies have neglected the efficiency of pp. 411–419, 2010.
the algorithm because their main goal is to improve the [10] V. Cothey, “Web-crawling reliability,” J. Amer. Soc. Inf. Sci. Technol.,
robustness to label noise. For example, maintaining multiple vol. 55, no. 14, pp. 1228–1238, 2004.
[11] W. Mason and S. Suri, “Conducting behavioral research on Amazon’s
DNNs or training a DNN in multiple rounds is frequently mechanical Turk,” Behav. Res. Methods, vol. 44, no. 1, pp. 1–23, 2012.
used, but these approaches significantly degrade the efficiency [12] C. Scott, G. Blanchard, and G. Handy, “Classification with asymmetric
of the learning pipeline. On the other hand, the need for more label noise: Consistency and maximal denoising,” in Proc. COLT, 2013,
pp. 489–511.
efficient algorithms is increasing owing to the rapid increase [13] B. Frenay and M. Verleysen, “Classification in the presence of label
in the amount of available data [174]. noise: A survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5,
According to our literature survey, most work did not even pp. 845–869, May 2014.
[14] R. V. Lloyd et al., “Observer variation in the diagnosis of follicular
report the efficiency (or time complexity) of their approaches. variant of papillary thyroid carcinoma,” The Amer. J. Surgical Pathol.,
However, it is evident that saving the training time is help- vol. 28, no. 10, pp. 1336–1340, 2004.
ful under the restricted budget for computation. Therefore, [15] H. Xiao, H. Xiao, and C. Eckert, “Adversarial label flips attack on
support vector machines,” in Proc. ECAI, 2012, pp. 870–875.
enhancing the efficiency will significantly increase the usabil- [16] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from
ity of robust deep learning in the big data era. massive noisy labeled data for image classification,” in Proc. CVPR,
2015, pp. 2691–2699.
VIII. C ONCLUSION [17] W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool, “Web vision
database: Visual learning and understanding from web data,” 2017,
DNNs easily overfit false labels owing to their high capacity arXiv:1708.02862.
in totally memorizing all noisy training samples. This overfit- [18] K.-H. Lee, X. He, L. Zhang, and L. Yang, “CleanNet: Transfer learning
for scalable image classifier training with label noise,” in Proc. CVPR,
ting issue still remains even with various conventional regular- 2018, pp. 5447–5456.
ization techniques, such as dropout and batch normalization, [19] H. Song, M. Kim, and J.-G. Lee, “SELFIE: Refurbishing unclean sam-
thereby significantly decreasing their generalization perfor- ples for robust deep learning,” in Proc. ICML, 2019, pp. 5907–5915.
[20] J. Krause et al., “The unreasonable effectiveness of noisy data for fine-
mance. Even worse, in real-world applications, the difficulty in grained recognition,” in Proc. ECCV, 2016, pp. 301–320.
labeling renders the overfitting issue more severe. Therefore, [21] D. Arpit et al., “A closer look at memorization in deep networks,” in
Proc. ICML, 2017, pp. 233–242.
learning from noisy labels has recently become one of the [22] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand-
most active research topics. ing deep learning requires rethinking generalization,” in Proc. ICLR,
In this survey, we presented a comprehensive understanding 2017.
[23] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmen-
of modern deep learning methods to address the negative tation for deep learning,” J. Big Data, vol. 6, p. 60, Dec. 2019.
consequences of learning from noisy labels. All the methods [24] A. Krogh and J. A. Hertz, “A simple weight decay can improve
were grouped into five categories according to their underlying generalization,” in Proc. NeurIPS, 1992, pp. 950–957.
[25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
strategies and described along with their methodological weak- R. Salakhutdinov, “DropOut: A simple way to prevent neural networks
nesses. Furthermore, a systematic comparison was conducted from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
using six popular properties used for evaluation in the recent 2014.
[26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
literature. According to the comparison results, there is no network training by reducing internal covariate shift,” in Proc. 32nd
ideal method that supports all the required properties; the Int. Conf. Mach. Learn., 2015, pp. 448–456.

[27] X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative [57] B. Biggio, B. Nelson, and P. Laskov, “Support vector machines under
study,” Artif. Intell. Rev., vol. 22, no. 3, pp. 177–210, Nov. 2004. adversarial label noise,” in Proc. ACML, 2011, pp. 97–112.
[28] J. Zhang, X. Wu, and V. S. Sheng, “Learning from crowdsourced [58] C. J. Mantas and J. Abellán, “Credal-C4.5: Decision tree based on
labeled data: A survey,” Artif. Intell. Rev., vol. 46, no. 4, pp. 543–576, imprecise probabilities to classify noisy data,” Expert Syst. Appl.,
Dec. 2016. vol. 41, no. 10, pp. 4625–4637, Aug. 2014.
[29] N. Nigam, T. Dutta, and H. P. Gupta, “Impact of noisy labels in learning [59] A. Ghosh, N. Manwani, and P. Sastry, “On the robustness of decision
techniques: A survey,” in Proc. ICDIS, 2020, pp. 403–411. tree learning under label noise,” in Proc. PAKDD, 2017, pp. 685–697.
[30] B. Han et al., “A survey of label-noise representation learning: Past, [60] S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez-Granda, “Early-
present and future,” 2020, arXiv:2011.04406. learning regularization prevents memorization of noisy labels,” in Proc.
[31] N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning NeurIPS, vol. 33, 2020, pp. 20331–20342.
in computer vision: A survey,” IEEE Access, vol. 6, pp. 14410–14430, [61] M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent with early
2018. stopping is provably robust to label noise for overparameterized neural
[32] J. Yoon, J. Jordon, and M. Schaar, “GAIN: Missing data impu- networks,” in Proc. AISTATS, 2020, pp. 4313–4324.
tation using generative adversarial nets,” in Proc. ICML, 2018, [62] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making
pp. 5689–5698. deep neural networks robust to label noise: A loss correction approach,”
[33] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard, “Robustness of in Proc. CVPR, 2017, pp. 1944–1952.
classifiers: From adversarial to random noise,” in Proc. NeurIPS, 2016, [63] J. Cheng, T. Liu, K. Ramamohanarao, and D. Tao, “Learning with
pp. 1632–1640. bounded instance and label-dependent label noise,” in Proc. ICML,
[34] E. Dohmatob, “Generalized no free lunch theorem for adversarial 2020, pp. 1789–1799.
robustness,” in Proc. ICML, 2019, pp. 1646–1654.
[64] B. Garg and N. Manwani, “Robust deep ordinal regression under label
[35] J. Gilmer, N. Ford, N. Carlini, and E. Cubuk, “Adversarial examples noise,” in Proc. ACML, 2020, pp. 782–796.
are a natural consequence of test error in noise,” in Proc. ICML, 2019,
[65] W. Hu, Z. Li, and D. Yu, “Simple and effective regularization methods
pp. 2280–2289.
for training on noisily labeled data with generalization guarantee,” in
[36] S. Mahloujifar, D. I. Diochnos, and M. Mahmoody, “The curse of
Proc. ICLR, 2020, pp. 1–18.
concentration in robust learning: Evasion and poisoning attacks from
concentration of measure,” in Proc. AAAI, 2019, vol. 33, no. 1, [66] A. K. Menon, B. Van Rooyen, and N. Natarajan, “Learning from
pp. 4536–4543. binary labels with instance-dependent noise,” Mach. Learn., vol. 107,
[37] D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, nos. 8–10, pp. 1561–1595, 2018.
pp. 581–592, 1976. [67] L. Torgo and J. Gama, “Regression using classification algorithms,”
[38] C. M. Bishop, Pattern Recognition Machine Learning. Springer, 2006. Intell. Data Anal., vol. 1, no. 4, pp. 275–292, Oct. 1997.
[39] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning [68] A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label
with noisy labels,” in Proc. NeurIPS, 2013, pp. 1196–1204. noise for deep neural networks,” in Proc. AAAI, 2017, pp. 1919–1925.
[40] J. Goldberger and E. Ben-Reuven, “Training deep neural-networks [69] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and
using a noise adaptation layer,” in Proc. ICLR, 2017, pp. 1–9. A. Rabinovich, “Training deep neural networks on noisy labels with
[41] P. Sastry and N. Manwani, “Robust learning of classifiers in the bootstrapping,” in Proc. ICLR, 2015, pp. 1–11.
presence of label noise,” in Proc. Pattern Recognit. Big Data, 2017, [70] E. Malach and S. Shalev-Shwartz, “Decoupling ‘when to update’ from
pp. 167–197. ‘how to update,”’ in Proc. NeurIPS, 2017, pp. 960–970.
[42] V. Wheway, “Using boosting to detect noisy data,” in Proc. PRICAI, [71] L. P. Garcia, A. C. de Carvalho, and A. C. Lorena, “Noise detection
2000, pp. 123–130. in the meta-learning level,” Neurocomputing, vol. 176, pp. 14–25,
[43] B. Sluban, D. Gamberger, and N. Lavrač, “Ensemble-based noise Dec. 2016.
detection: Noise ranking and visual performance evaluation,” Data [72] Y. Yan, Z. Xu, I. W. Tsang, G. Long, and Y. Yang, “Robust semi-
Mining Knowl. Discovery, vol. 28, no. 2, pp. 265–303, 2012. supervised learning through label aggregation,” in Proc. AAAI, 2016,
[44] S. J. Delany, N. Segata, and B. Mac Namee, “Profiling instances in pp. 2244–2250.
noise reduction,” Knowl.-Based Syst., vol. 31, pp. 28–40, Jul. 2012. [73] H. Harutyunyan, K. Reing, G. Ver Steeg, and A. Galstyan, “Improving
[45] D. Gamberger, N. Lavrac, and S. Dzeroski, “Noise detection and generalization by controlling label-noise information in neural network
elimination in data preprocessing: Experiments in medical domains,” weights,” in Proc. ICML, 2020, pp. 4071–4081.
Appl. Artif. Intell., vol. 14, no. 2, pp. 205–223, Nov. 2000. [74] P. Chen, G. Chen, J. Ye, J. Zhao, and P.-A. Heng, “Noise against
[46] J. Thongkam, G. Xu, Y. Zhang, and F. Huang, “Support vector machine noise: Stochastic label noise helps combat inherent label noise,” in
for outlier detection in breast cancer survivability prediction,” in Proc. Proc. ICLR, 2021, pp. 1–20.
APWeb, 2008, pp. 99–109. [75] X. Chen and A. Gupta, “Webly supervised learning of convolutional
[47] V. Mnih and G. E. Hinton, “Learning to label aerial images from noisy networks,” in Proc. ICCV, 2015, pp. 1431–1439.
data,” in Proc. ICML, 2012, pp. 567–574. [76] A. J. Bekker and J. Goldberger, “Training deep neural-networks based
[48] N. Manwani and P. Sastry, “Noise tolerance under risk minimization,” on unreliable labels,” in Proc. ICASSP, 2016, pp. 2682–2686.
IEEE Trans. Cybern., vol. 43, no. 3, pp. 1146–1151, Jun. 2013. [77] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus,
[49] A. Ghosh, N. Manwani, and P. S. Sastry, “Making risk minimiza- “Training convolutional networks with noisy labels,” in Proc. ICLRW,
tion tolerant to label noise,” Neurocomputing, vol. 160, pp. 93–107, 2015, pp. 1–11.
Jul. 2015. [78] I. Jindal, M. Nokleby, and X. Chen, “Learning deep networks from
[50] B. Van Rooyen, A. Menon, and R. C. Williamson, “Learning with noisy labels with dropout regularization,” in Proc. ICDM, 2016,
symmetric label noise: The importance of being unhinged,” in Proc. pp. 967–972.
NeurIPS, 2015, pp. 10–18. [79] J. Goldberger and E. Ben-Reuven, “Training deep neural-networks
[51] G. Patrini, F. Nielsen, R. Nock, and M. Carioni, “Loss factorization, using a noise adaptation layer,” in Proc. ICLR, 2017, pp. 1–9.
weakly supervised learning and label noise robustness,” in Proc. ICML,
[80] B. Han et al., “Masking: A new perspective of noisy supervision,” in
2016, pp. 708–717.
Proc. NeurIPS, 2018, pp. 5836–5846.
[52] R. Xu and D. C. Wunsch, “Survey of clustering algorithms,” IEEE
Trans. Neural Netw., vol. 16, no. 3, pp. 645–678, Jun. 2005. [81] J. Yao et al., “Deep learning from noisy image labels with qual-
ity embedding,” IEEE Trans. Image Process., vol. 28, no. 4,
[53] U. Rebbapragada and C. E. Brodley, “Class noise mitigation through
pp. 1909–1922, Dec. 2018.
instance weighting,” in Proc. ECML, 2007, pp. 708–715.
[54] T. Liu, K. Wang, B. Chang, and Z. Sui, “A soft-label method for noise- [82] L. Cheng et al., “Weakly supervised learning with side information for
tolerant distantly supervised relation extraction,” in Proc. EMNLP, noisy labeled images,” in Proc. ECCV, 2020, pp. 306–321.
2017, pp. 1790–1795. [83] X. Xia et al., “Extended T: Learning with mixed closed-set and open-
[55] F. O. Kaster, B. H. Menze, M.-A. Weber, and F. A. Hamprecht, set noisy labels,” 2020, arXiv:2012.00932.
“Comparative validation of graphical models for learning tumor seg- [84] I. Goodfellow et al., “Generative adversarial nets,” in Proc. NeurIPS,
mentations from noisy manual annotations,” in Proc. MICCAI, 2010, 2014, pp. 2672–2680.
pp. 74–85. [85] K. Lee, S. Yun, K. Lee, H. Lee, B. Li, and J. Shin, “Robust inference
[56] A. Ganapathiraju and J. Picone, “Support vector machines for auto- via generative classifiers for handling noisy labels,” in Proc. ICML,
matic data cleanup,” in Proc. ICSLP, 2000, pp. 210–213. 2019, pp. 3763–3772.

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.
8152 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 11, NOVEMBER 2023

[86] R. Tanno, A. Saeedi, S. Sankaranarayanan, D. C. Alexander, and [116] S. Yang et al., “Estimating instance-dependent label-noise transition
N. Silberman, “Learning from noisy labels by regularized estimation matrix using DNNs,” 2021, arXiv:2105.13001.
of annotator confusion,” in Proc. CVPR, 2019, pp. 11244–11253. [117] T. Liu and D. Tao, “Classification with noisy labels by importance
[87] S. Jenni and P. Favaro, “Deep bilevel learning,” in Proc. ECCV, 2018, reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 3,
pp. 618–633. pp. 447–461, Mar. 2016.
[88] D. Hendrycks, K. Lee, and M. Mazeika, “Using pre-training can [118] H. Zhang, X. Xing, and L. Liu, “DualGraph: A graph-based method
improve model robustness and uncertainty,” in Proc. ICML, 2019, for reasoning about label noise,” in Proc. CVPR, 2021, pp. 9654–9663.
pp. 2712–2721. [119] L. Huang, C. Zhang, and H. Zhang, “Self-adaptive training:
[89] A. K. Menon, A. S. Rawat, S. J. Reddi, and S. Kumar, “Can gradient Beyond empirical risk minimization,” in Proc. NeurIPS, 2020,
clipping mitigate label noise?” in Proc. ICLR, 2020, pp. 1–26. pp. 19365–19376.
[90] X. Xia et al., “Robust early-learning: Hindering the memorization of [120] M. E. Houle, “Local intrinsic dimensionality I: An extreme-value-
noisy labels,” in Proc. ICLR, 2021, pp. 1–15. theoretic foundation for similarity applications,” in Proc. SISAP, 2017,
[91] H. Wei, L. Tao, R. Xie, and B. An, “Open-set label noise can improve pp. 64–79.
robustness against inherent label noise,” in Proc. NeurIPS, 2021, [121] S. Zheng, P. Wu, A. Goswami, M. Goswami, D. Metaxas, and C. Chen,
pp. 1–15. “Error-bounded correction of noisy labels,” in Proc. ICML, 2020,
[92] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing pp. 11447–11457.
adversarial examples,” in Proc. ICLR, 2014, pp. 1–11. [122] P. Chen, J. Ye, G. Chen, J. Zhao, and P.-A. Heng, “Beyond
[93] S. Zhang, Y. Hou, B. Wang, and D. Song, “Regularizing neural class-conditional assumption: A primary attempt to combat instance-
networks via retaining confident connections,” Entropy, vol. 19, no. 7, dependent label noise,” in Proc. AAAI, 2021, pp. 1–10.
p. 313, Jun. 2017. [123] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
[94] M. Lukasik, S. Bhojanapalli, A. Menon, and S. Kumar, “Does label for fast adaptation of deep networks,” in Proc. ICML, 2017,
smoothing mitigate label noise?” in Proc. ICLR, 2020, pp. 6448–6458. pp. 1126–1135.
[95] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: [124] J. Shu et al., “Meta-weight-net: Learning an explicit mapping for
Beyond empirical risk minimization,” in Proc. ICLR, 2018, pp. 1–13. sample weighting,” in Proc. NeurIPS, 2019, pp. 1917–1928.
[96] B. C. Csáji et al., “Approximation with artificial neural networks,” Fac. [125] Z. Wang, G. Hu, and Q. Hu, “Training noise-robust deep neural
Sci., Etvs Lornd Univ., Hung., vol. 24, no. 48, p. 7, 2001. networks via meta-learning,” in Proc. CVPR, 2020, pp. 4524–4533.
[97] Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training [126] M. Dehghani, A. Severyn, S. Rothe, and J. Kamps, “Learning to learn
deep neural networks with noisy labels,” in Proc. Adv. Neural Inf. from weak supervision by full supervision,” in Proc. NeurIPSW, 2017,
Process. Syst., 2018, pp. 8778–8788. pp. 1–8.
[98] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey, “Symmetric [127] M. Dehghani, A. Severyn, S. Rothe, and J. Kamps, “Avoiding your
cross entropy for robust learning with noisy labels,” in Proc. ICCV, teacher’s mistakes: Training neural networks with controlled weak
2019, pp. 322–330. supervision,” 2017, arXiv:1711.00313.
[99] Y. Lyu and I. W. Tsang, “Curriculum loss: Robust learning and [128] Z. Zhang, H. Zhang, S. O. Arik, H. Lee, and T. Pfister, “Distilling
generalization against label corruption,” in Proc. ICLR, 2020, pp. 1–22. effective supervision from severe label noise,” in Proc. CVPR, 2020,
[100] L. Feng, S. Shu, Z. Lin, F. Lv, L. Li, and B. An, “Can cross entropy pp. 9294–9303.
loss be robust to label noise,” in Proc. IJCAI, 2020, pp. 2206–2212. [129] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li, “Learning from
[101] Y. Liu and H. Guo, “Peer loss functions: Learning from noisy labels noisy labels with distillation,” in Proc. ICCV, 2017, pp. 1910–1918.
without knowing noise rates,” in Proc. ICML, 2020, pp. 6226–6236. [130] G. Zheng, A. H. Awadallah, and S. Dumais, “Meta label correction for
[102] H. Kumar, N. Manwani, and P. Sastry, “Robust learning of multi- noisy label learning,” in Proc. AAAI, 2021, pp. 1–9.
label classifiers under label noise,” in Proc. CODS-COMAD, 2020, [131] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “MentorNet:
pp. 90–97. Learning data-driven curriculum for very deep neural networks on
[103] E. Amid, M. K. Warmuth, and S. Srinivasan, “Two-temperature logistic corrupted labels,” in Proc. ICML, 2018, pp. 2304–2313.
regression based on the tsallis divergence,” in Proc. AISTATS, 2019, [132] X. Yu, B. Han, J. Yao, G. Niu, I. W. Tsang, and M. Sugiyama, “How
pp. 2388–2396. does disagreement help generalization against label corruption?” in
[104] E. Amid, M. K. Warmuth, R. Anil, and T. Koren, “Robust bi-tempered Proc. ICML, 2019, pp. 7164–7173.
logistic loss based on Bregman divergences,” in Proc. NeurIPS, 2019, [133] Y. Wang et al., “Iterative learning with open-set noisy labels,” in Proc.
pp. 14987–14996. CVPR, 2018, pp. 8688–8696.
[105] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, and J. Bailey, [134] Y. Shen and S. Sanghavi, “Learning with bad training data via iterative
“Normalized loss functions for deep learning with noisy labels,” in trimmed loss minimization,” in Proc. ICML, 2019, pp. 5739–5748.
Proc. ICML, 2020, pp. 6543–6553. [135] P. Chen, B. Liao, G. Chen, and S. Zhang, “Understanding and utilizing
[106] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to deep neural networks trained with noisy labels,” in Proc. ICML, 2019,
reweight examples for robust deep learning,” in Proc. ICML, 2018, pp. 1062–1070.
pp. 4334–4343. [136] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen,
[107] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel, “Using trusted L. Beggel, and T. Brox, “SELF: Learning to filter noisy labels with
data to train deep networks on labels corrupted by severe noise,” in self-ensembling,” in Proc. ICLR, 2020, pp. 1–15.
Proc. NeurIPS, 2018, pp. 10456–10465. [137] H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, “Robust learning
[108] R. Wang, T. Liu, and D. Tao, “Multiclass learning with partially by self-transition for handling noisy labels,” in Proc. KDD, 2021,
corrupted labels,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 1490–1500.
pp. 2568–2580, Jun. 2018. [138] D. Krueger et al., “Deep nets don’t learn via memorization,” in Proc.
[109] H.-S. Chang, E. Learned-Miller, and A. McCallum, “Active bias: ICLRW, 2017, pp. 1–4.
Training more accurate neural networks by emphasizing high variance [139] C. Zhang, S. Bengio, M. Hardt, M. C. Mozer, and Y. Singer, “Identity
samples,” in Proc. NeurIPS, 2017, pp. 1002–1012. Crisis: Memorization and generalization under extreme overparameter-
[110] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, ization,” in Proc. ICLR, 2020, pp. 1–39.
“Unsupervised label noise modeling and loss correction,” in Proc. [140] Q. Yao, H. Yang, B. Han, G. Niu, and J. T.-Y. Kwok, “Searching to
ICML, 2019, pp. 312–321. exploit memorization effect in learning with noisy labels,” in Proc.
[111] X. Ma et al., “Dimensionality-driven learning with noisy labels,” in ICML, 2020, pp. 10789–10798.
Proc. ICML, 2018, pp. 3355–3364. [141] H. Song, M. Kim, D. Park, and J.-G. Lee, “How does early stopping
[112] B. Han et al., “Co-teaching: Robust training of deep neural networks help generalization against label noise?” 2019, arXiv:1911.08059.
with extremely noisy labels,” in Proc. Adv. Neural Inf. Process. Syst., [142] J. Li, R. Socher, and S. C. Hoi, “DivideMix: Learning with noisy labels
2018, pp. 8527–8537. as semi-supervised learning,” in Proc. ICLR, 2020, pp. 1–14.
[113] X. Xia et al., “Are anchor points really indispensable in label-noise [143] H. Wei, L. Feng, X. Chen, and B. An, “Combating noisy labels by
learning?” in Proc. NeurIPS, 2019, pp. 1–12. agreement: A joint training method with co-regularization,” in Proc.
[114] Y. Yao et al., “Dual T: Reducing estimation error for transition matrix CVPR, 2020, pp. 13726–13735.
in label-noise learning,” in Proc. NeurIPS, 2020, pp. 7260–7271. [144] J. Huang, L. Qu, R. Jia, and B. Zhao, “O2U-Net: A simple noisy
[115] Y. Zhang and M. Sugiyama, “Approximating instance-dependent noise label detection approach for deep neural networks,” in Proc. ICCV,
via instance-confidence embedding,” 2021, arXiv:2103.13569. Oct. 2019, pp. 3326–3334.

[145] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: [172] B. B. Damodaran, K. Fatras, S. Lobry, R. Flamary, D. Tuia, and
Identifying density-based local outliers,” in Proc. Int. Conf. Manage. N. Courty, “Wasserstein adversarial regularization (WAR) on label
Data (SIGMOD), vol. 29, 2000, pp. 93–104. noise,” in Proc. ICLR, 2020, pp. 1–15.
[146] P. Wu, S. Zheng, M. Goswami, D. Metaxas, and C. Chen, “A topo- [173] J. Zhu et al., “Understanding the interaction of adversarial training with
logical filter for learning with label noise,” in Proc. NeurIPS, 2020, noisy labels,” 2021, arXiv:2102.03482.
pp. 21382–21393. [174] G. Nguyen et al., “Machine learning and deep learning frame-
[147] Z.-F. Wu, T. Wei, J. Jiang, C. Mao, M. Tang, and Y.-F. Li, “NGC: works and libraries for large-scale data mining: A survey,”
A unified framework for learning with open-world noisy data,” in Proc. Artif. Intell. Rev., vol. 52, no. 1, pp. 77–124, 2019.
ICCV, 2021, pp. 62–71.
[148] A. Tarvainen and H. Valpola, “Mean teachers are better role models:
Weight-averaged consistency targets improve semi-supervised deep
learning results,” in Proc. Adv. Neural Inf. Process. Syst., 2017, Hwanjun Song received the Ph.D. degree from the
pp. 1195–1204. Graduate School of Knowledge Service Engineering,
[149] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and Korea Advanced Institute of Science and Technol-
C. A. Raffel, “MixMatch: A holistic approach to semi-supervised ogy, Daejeon, South Korea, in 2021.
learning,” in Proc. NeurIPS, 2019, pp. 5050–5060. He was a Research Intern with Google Research,
[150] T. Zhou, S. Wang, and J. Bilmes, “Robust curriculum learning: From Seoul, South Korea, in 2020. He is currently a
clean label detection to noisy label self-correction,” in Proc. ICLR, Research Scientist with the NAVER AI Labora-
2021, pp. 1–18. tory, Seongnam, South Korea. His research inter-
[151] X. Li, T. Liu, B. Han, G. Niu, and M. Sugiyama, “Provably end-to- est includes designing advanced approaches to
end label-noise learning without anchor points,” in Proc. ICML, 2021, handle large-scale and noisy data, including two
pp. 6403–6413. main real-world challenges for the practical use of
[152] G. Pleiss, T. Zhang, E. R. Elenberg, and K. Q. Weinberger. (2020). AI approaches.
Detecting Noisy Training Data With Loss Curves. [Online]. Available:
https://openreview.net/forum?id=HyenUkrtDB
[153] C. Scott, “A rate of convergence for mixture proportion estimation, with Minseok Kim (Student Member, IEEE) received the
application to learning from noisy labels,” in Proc. AISTATS, 2015, master’s degree from the Korea Advanced Institute
pp. 838–846. of Science and Technology, Daejeon, South Korea,
[154] Y. LeCun, C. Cortes, and C. J. Burges. (1998). The MNIST Database in 2018, where he is currently pursuing the Ph.D.
of Handwritten Digits, 1998. [Online]. Available: http://yann.lecun. degree under the supervision of Prof. Jae-Gil Lee
com/exdb/mnist with the Graduate School of Knowledge Service
[155] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel Engineering.
image dataset for benchmarking machine learning algorithms,” 2017, His current research interests include robustness
arXiv:1708.07747. and uncertainty in machine learning, data augmen-
[156] A. Krizhevsky, V. Nair, and G. Hinton. (2014). CIFAR-10 and tation, and personalized recommendation systems.
CIFAR-100 Datasets. [Online]. Available: https://www.cs.toronto.
edu/~kriz/cifar.html
[157] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
“Reading digits in natural images with unsupervised feature learning,” Dongmin Park received the master’s degree from
in Proc. NeurIPSW, 2011, pp. 1–9. the Korea Advanced Institute of Science and Tech-
[158] A. Karpathy et al., “Cs231n convolutional neural networks for visual nology, Daejeon, South Korea, in 2020, where he
recognition,” Neural Netw., vol. 1, p. 1, Dec. 2016. is currently pursuing the Ph.D. degree under the
[159] J. Wei, Z. Zhu, H. Cheng, T. Liu, G. Niu, and Y. Liu, “Learning with supervision of Prof. Jae-Gil Lee with the Graduate
noisy labels revisited: A study using real-world human annotations,” School of Knowledge Service Engineering.
2011, arXiv:2110.12088. His current research interests include robust deep
[160] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining learning and representation learning in graph and
discriminative components with random forests,” in Proc. ECCV, 2014, time–series data.
pp. 446–461.
[161] J. Bootkrajang and J. Chaijaruwanich, “Towards instance-dependent
label noise-tolerant classification: A probabilistic approach,” Pattern
Anal. Appl., vol. 23, no. 1, pp. 95–111, 2020. Yooju Shin received the master’s degree from the
[162] G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” Korea Advanced Institute of Science and Technol-
Int. J. Data Warehousing Mining, vol. 3, no. 3, pp. 1–13, 2007. ogy, Daejeon, South Korea, in 2019, where he is
[163] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label currently pursuing the Ph.D. degree under the super-
scene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771, vision of Prof. Jae-Gil Lee with the Graduate School
Sep. 2004. of Knowledge Service Engineering.
[164] I. Krasin et al., “OpenImages: A public dataset for large-scale multi- His current research interests include out-of-
label and multi-class image classification,” Dataset, vol. 2, no. 3, p. 18, distribution learning and label efficient learning in
2017. time series.
[165] W. Zhao and C. Gomes, “Evaluating multi-label classifiers with noisy
labels,” 2021, arXiv:2102.08427.
[166] J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with
class imbalance,” J. Big Data, vol. 6, no. 1, pp. 1–54, Dec. 2019.
Jae-Gil Lee (Member, IEEE) received the Ph.D.
[167] M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in super- degree in computer science from the Korea
vised learning,” in Proc. NeurIPS, 2016, pp. 1–9. Advanced Institute of Science and Technology,
[168] H. Jiang and O. Nachum, “Identifying and correcting label bias in Daejeon, South Korea, in 2005.
machine learning,” in Proc. AISTATS, 2020, pp. 702–712. He was a Post-Doctoral Researcher with the
[169] S. Wang, W. Guo, H. Narasimhan, A. Cotter, M. Gupta, and IBM Research—Almaden Laboratory, San Jose, CA,
M. I. Jordan, “Robust optimization for fairness with noisy protected USA, and a Post-Doctoral Research Associate with
groups,” in Proc. NeurIPS, 2020, pp. 5190–5203. the University of Illinois at Urbana–Champaign,
[170] J. Wang, Y. Liu, and C. Levy, “Fair classification with group-dependent Champaign, IL, USA. He is currently an Associate
label noise,” in Proc. FAccT, 2021, pp. 526–536. Professor with KAIST, where he is also the Leader
[171] J. Uesato, J.-B. Alayrac, P.-S. Huang, R. Stanforth, A. Fawzi, and of the Data Mining Laboratory. His research interests
P. Kohli, “Are labels required for improving adversarial robustness?” include mobility and stream data mining, deep learning-based big data
in Proc. NeurIPS, 2019, pp. 12192–12202. analysis, and distributed deep learning.

Authorized licensed use limited to: Central South University. Downloaded on February 13,2025 at 05:10:10 UTC from IEEE Xplore. Restrictions apply.

CUP IB Bus Man Teacher Resource Starter Pack
100% (5)
CUP IB Bus Man Teacher Resource Starter Pack
110 pages
Motivation Letter
No ratings yet
Motivation Letter
1 page
Understanding and Utilizing Deep Neural Networks Trained With Noisy Labels
No ratings yet
Understanding and Utilizing Deep Neural Networks Trained With Noisy Labels
13 pages
Learning Discriminative Dynamics With Label Corruption For Noisy Label Detection
No ratings yet
Learning Discriminative Dynamics With Label Corruption For Noisy Label Detection
14 pages
Kim Learning Discriminative Dynamics With Label Corruption For Noisy Label Detection CVPR 2024 Paper
No ratings yet
Kim Learning Discriminative Dynamics With Label Corruption For Noisy Label Detection CVPR 2024 Paper
11 pages
Wang Symmetric Cross Entropy For Robust Learning With Noisy Labels ICCV 2019 Paper
No ratings yet
Wang Symmetric Cross Entropy For Robust Learning With Noisy Labels ICCV 2019 Paper
9 pages
Label Noise Types and Their Effects On Deep Learning
No ratings yet
Label Noise Types and Their Effects On Deep Learning
6 pages
Development of A Mathematical Model To A
No ratings yet
Development of A Mathematical Model To A
13 pages
Learning With Imbalanced Noisy
No ratings yet
Learning With Imbalanced Noisy
12 pages
NeurIPS 2018 Generalized Cross Entropy Loss For Training Deep Neural Networks With Noisy Labels Paper
No ratings yet
NeurIPS 2018 Generalized Cross Entropy Loss For Training Deep Neural Networks With Noisy Labels Paper
11 pages
Noisy-Label Learning With Sample Selection Based o
No ratings yet
Noisy-Label Learning With Sample Selection Based o
15 pages
A Comprehensive Introduction To Label Noise: Completely at Random (NCAR) at Random (NAR) Not at Random (NNAR)
No ratings yet
A Comprehensive Introduction To Label Noise: Completely at Random (NCAR) at Random (NAR) Not at Random (NNAR)
10 pages
Wu 等 - 2024 - A Time-Consistency Curriculum for Learning From Instance-Dependent Noisy Labels
No ratings yet
Wu 等 - 2024 - A Time-Consistency Curriculum for Learning From Instance-Dependent Noisy Labels
13 pages
Thulasidasan Washington 0250E 21330
No ratings yet
Thulasidasan Washington 0250E 21330
186 pages
NIPS 2017 Decoupling When To Update From How To Update Paper
No ratings yet
NIPS 2017 Decoupling When To Update From How To Update Paper
11 pages
LEARNING TO PURIFY NOISY LABELS VIA META SOFT LABEL CORRECTOR (Yichen Wu) - Arxiv 2020-08
No ratings yet
LEARNING TO PURIFY NOISY LABELS VIA META SOFT LABEL CORRECTOR (Yichen Wu) - Arxiv 2020-08
12 pages
Label Noise Detection in IoT Security Based On Decision Tree and Active Learning
No ratings yet
Label Noise Detection in IoT Security Based On Decision Tree and Active Learning
8 pages
Divide Mix
No ratings yet
Divide Mix
14 pages
Learning To Reweight
No ratings yet
Learning To Reweight
13 pages
Bootstrap PDF
No ratings yet
Bootstrap PDF
11 pages
10 1016@j JCP 2019 05 027
No ratings yet
10 1016@j JCP 2019 05 027
17 pages
Na 等 - 2024 - Label-noise robust diffusion models
No ratings yet
Na 等 - 2024 - Label-noise robust diffusion models
44 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Classification in The Presence of Label Noise A Survey
No ratings yet
Classification in The Presence of Label Noise A Survey
27 pages
Yu Et Al. - 2022 - Progressive Ensemble Kernel-Based Broad Learning System For Noisy Data Classification
No ratings yet
Yu Et Al. - 2022 - Progressive Ensemble Kernel-Based Broad Learning System For Noisy Data Classification
14 pages
An Introductory Note On Machine Learning. A V Narasimhadhan
No ratings yet
An Introductory Note On Machine Learning. A V Narasimhadhan
2 pages
2020 BO and Adversarial Attacks
No ratings yet
2020 BO and Adversarial Attacks
26 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
2104.01493v1
No ratings yet
2104.01493v1
19 pages
Statistics Mechanic of Deep Learning
No ratings yet
Statistics Mechanic of Deep Learning
28 pages
Conmatphys 031119 050745
No ratings yet
Conmatphys 031119 050745
28 pages
9036 Provably Noise Resilient
No ratings yet
9036 Provably Noise Resilient
17 pages
JB 14
No ratings yet
JB 14
6 pages
Machine Learning: A Review of Classification and Combining Techniques
No ratings yet
Machine Learning: A Review of Classification and Combining Techniques
32 pages
LECTURE#9 EE258 F22 Part2 Draft v1
No ratings yet
LECTURE#9 EE258 F22 Part2 Draft v1
14 pages
407 A Decade S Battle On Datas
No ratings yet
407 A Decade S Battle On Datas
17 pages
On Truthing Issues in Supervised Classification: Jonathan K. Su
No ratings yet
On Truthing Issues in Supervised Classification: Jonathan K. Su
91 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Statistical Mechanics of Deep Learning
No ratings yet
Statistical Mechanics of Deep Learning
30 pages
1710 11573 PDF
No ratings yet
1710 11573 PDF
14 pages
GHJGKJ
No ratings yet
GHJGKJ
16 pages
Snorkel: Rapid Training Data Creation With Weak Supervision
No ratings yet
Snorkel: Rapid Training Data Creation With Weak Supervision
17 pages
Usc 08
No ratings yet
Usc 08
46 pages
Data Programming: Creating Large Training Sets, Quickly
No ratings yet
Data Programming: Creating Large Training Sets, Quickly
27 pages
Fu Relaxing From Vocabulary ICCV 2015 Paper
No ratings yet
Fu Relaxing From Vocabulary ICCV 2015 Paper
9 pages
Noise Robust Classification Based On Spread Spectrum - ICDM - 08
No ratings yet
Noise Robust Classification Based On Spread Spectrum - ICDM - 08
10 pages
CVIS Label Errors Camera Ready
No ratings yet
CVIS Label Errors Camera Ready
4 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Deep Neural Networks Are Lazy: On The Inductive Bias of Deep Learning
No ratings yet
Deep Neural Networks Are Lazy: On The Inductive Bias of Deep Learning
78 pages
20 StatMechDeep
No ratings yet
20 StatMechDeep
30 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Pleiss 等 - Identifying Mislabeled Data using the Area Under t
No ratings yet
Pleiss 等 - Identifying Mislabeled Data using the Area Under t
13 pages
Exploratory Machine Learning With Unknown Unknowns
No ratings yet
Exploratory Machine Learning With Unknown Unknowns
41 pages
CMW Net
No ratings yet
CMW Net
41 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
InTech-Types of Machine Learning Algorithms
No ratings yet
InTech-Types of Machine Learning Algorithms
30 pages
Lukasik 20 A
No ratings yet
Lukasik 20 A
11 pages
Unit-2 L2
No ratings yet
Unit-2 L2
22 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Deep Reinforcement Learning: An Essential Guide
From Everand
Deep Reinforcement Learning: An Essential Guide
Robert Johnson
No ratings yet
Mixtures Introduction
No ratings yet
Mixtures Introduction
11 pages
Comparison of Topic Modelling Approaches in The Banking Context
No ratings yet
Comparison of Topic Modelling Approaches in The Banking Context
14 pages
An Introduction To Syllabus Design and Evaluation
100% (1)
An Introduction To Syllabus Design and Evaluation
7 pages
Department of Education: Seidel, Reynaldo, Jr. C
No ratings yet
Department of Education: Seidel, Reynaldo, Jr. C
3 pages
7es Lesson Plan
No ratings yet
7es Lesson Plan
2 pages
Communicative Approach
No ratings yet
Communicative Approach
7 pages
Sow English Year 3 SJK 2025 2026
No ratings yet
Sow English Year 3 SJK 2025 2026
12 pages
Department of Education Region IX, Zamboanga Peninsula Division of Zamboanga City
No ratings yet
Department of Education Region IX, Zamboanga Peninsula Division of Zamboanga City
13 pages
The Five Step Approach Goals
No ratings yet
The Five Step Approach Goals
5 pages
1a Hello
No ratings yet
1a Hello
6 pages
Teaching of English (EDU516) Table of Contents
No ratings yet
Teaching of English (EDU516) Table of Contents
126 pages
English Handbook For Teacher
No ratings yet
English Handbook For Teacher
35 pages
GRADE 8 CRE CURRICULUM DESIGN - Compressed
No ratings yet
GRADE 8 CRE CURRICULUM DESIGN - Compressed
72 pages
Artificial Neural Networks - Short Answers
No ratings yet
Artificial Neural Networks - Short Answers
5 pages
Creative Approaches To Writing Materials
100% (3)
Creative Approaches To Writing Materials
2 pages
Classroom Program
No ratings yet
Classroom Program
4 pages
NRP ON CATCH UP FRIDAYS For DepEd
100% (4)
NRP ON CATCH UP FRIDAYS For DepEd
20 pages
Wentworth Mission
No ratings yet
Wentworth Mission
12 pages
Caitlin Sherlock Resume
No ratings yet
Caitlin Sherlock Resume
3 pages
Edtpa Lesson Plan 2 1
No ratings yet
Edtpa Lesson Plan 2 1
4 pages
Tests For Teachers, Specialist Teachers and Specialists
100% (1)
Tests For Teachers, Specialist Teachers and Specialists
6 pages
Honeycomb Paths: Difficulty Level
No ratings yet
Honeycomb Paths: Difficulty Level
3 pages
Child Assent and Parent's Consent
No ratings yet
Child Assent and Parent's Consent
4 pages
Studenttranscript
No ratings yet
Studenttranscript
2 pages
2017 Harvard DCE Branding MGMT E 6100 Syllabus
No ratings yet
2017 Harvard DCE Branding MGMT E 6100 Syllabus
5 pages
Developing Cross Cultural Competence For Leaders A Guide 1st Edition PDF
100% (14)
Developing Cross Cultural Competence For Leaders A Guide 1st Edition PDF
15 pages
Life Changing Impact - Arthur Dhanaraj
No ratings yet
Life Changing Impact - Arthur Dhanaraj
252 pages
III Q3 EXAM Answer Key
100% (1)
III Q3 EXAM Answer Key
2 pages

Learning From Noisy Labels With Deep Neural Networks Survey

Uploaded by

Learning From Noisy Labels With Deep Neural Networks Survey

Uploaded by

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

11, NOVEMBER 2023 8135

Learning From Noisy Labels With Deep

Abstract— Deep learning has achieved remarkable success in

and ICLR. Although we attempted to comprehensively include TABLE I

the reliability of estimating the label transition probability to

probability by the output of a specified DNN. The main (11)

Meanwhile, SELFIE [19] is a hybrid approach of sam-

loss becomes less separable by the GMM as the training

You might also like