11institutetext: {}^{\textrm{{\char 0\relax}}}start_FLOATSUPERSCRIPT ✉ end_FLOATSUPERSCRIPT Linchao Zhu1
11email: [email protected]

Chao Liang1
11email: [email protected]

Humphrey Shi2,3
11email: [email protected]

Yi Yang1
11email: [email protected]
22institutetext: 1 ReLER Lab, CCAI, Zhejiang University
2 SHI Labs @ UIUC & Oregon
3 Picsart AI Research (PAIR)

Combating Label Noise With A General Surrogate Model For Sample Selection

Chao Liang Linchao Zhu Humphrey Shi Yi Yang
(Received: date / Accepted: date)
Abstract

Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way to deal with label noise. The key is to separate clean samples based on some criterion. Previous methods pay more attention to the small loss criterion where small-loss samples are regarded as clean ones. Nevertheless, such a strategy relies on the learning dynamics of each data instance. Some noisy samples are still memorized due to frequently occurring corrupted learning patterns. To tackle this problem, a training-free surrogate model is preferred, freeing from the effect of memorization. In this work, we propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically. CLIP brings external knowledge to facilitate the selection of clean samples with its ability of text-image alignment. Furthermore, a margin adaptive loss is designed to regularize the selection bias introduced by CLIP, providing robustness to label noise. We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. Our method achieves significant improvement without CLIP involved during the inference stage.

journal: Preprint

1 Introduction

With the emergence of deep neural networks (DNNs) and the boost of computation capability, current visual intelligence systems can excel in several tasks, e.g., image classification (Russakovsky et al., 2015; He et al., 2016; Dosovitskiy et al., 2021; Liu et al., 2021), object detection (Carion et al., 2020; Liu et al., 2020a), video understanding (Zhao et al., 2022; Ma et al., 2022), even surpassing human-level performance. These remarkable breakthroughs are closely related to the collection of high-quality annotated data. However, the labeling process is labor-intensive and expensive. For some specific domains like insect classification, it is much more difficult to annotate the data without expert knowledge.

Some researchers resort to a compromising scheme and make use of large-scale cheap webly annotated data (Xiao et al., 2015; Kolesnikov et al., 2020). This can inevitably introduce label noise. Supervised learning often assumes that the training and test data are sampled from the independent identical distribution. The existence of noisy labels results in a discrepancy between the training and test distribution. As a consequence, learning with noisy labels leads to poor generalization on clean unseen test data.

Refer to caption
Figure 1: Some small-loss noisy samples that share similar visual patterns are memorized by DNNs. They are misidentified as clean samples by the small loss criterion. With the help of the powerful open-vocabulary vision-language model CLIP, these samples can be further filtered out potentially. With cleaner training samples, the classification performance is further boosted.

Massively sophisticated algorithms (Han et al., 2018a; Li et al., 2020; Wei et al., 2020; Liu et al., 2020b; Ortego et al., 2021; Xia et al., 2021; Yao et al., 2021b; Zhu et al., 2022; Li et al., 2022a) have been proposed to alleviate the negative effect of noisy labels. One of the promising lines of works is sample selection (Han et al., 2018a; Wei et al., 2020; Li et al., 2020; Yao et al., 2021b; Zhu et al., 2022). The main idea is to separate clean samples from all the training samples based on some rules or criteria. The cleaner samples can enhance the learning of a more unbiased classifier. Previous methods (Han et al., 2018a; Wei et al., 2020; Li et al., 2020, 2022a) mostly consider the small loss strategy. Clean and simple samples are assumed to be first fitted by DNNs and noisy samples are later memorized during the training process (Arpit et al., 2017). This strategy heavily relies on the learning dynamics of each data instance and suffers from the undesirable learning bias in the training dataset. As shown in Figure 1, a proportion of noisy samples are identified as clean samples with small losses because they share similar corrupted visual patterns that occur frequently in the learning process. After sample selection, these out-of-distribution noisy samples are still mixed within the pick-out in-distribution clean samples. Learning with these noisy samples can mislead the classifier and have a negative impact on the decision boundary. In order to get rid of the memorization effect, a training-free surrogate model is a good choice for detecting noisy samples.

Recently, vision-language models pretrained on text-image pairs show promising zero-shot performance on downstream tasks, especially CLIP (Radford et al., 2021). Born with the powerful zero-shot capability, CLIP can be easily adapted to score for unseen objects without extra training. With the text query, it is flexible to take advantage of CLIP to infer the data instance with the correct label or not. Benefiting from pretrained on large-scale text-image web data, CLIP shows great robustness to distribution shift and domain generalization. CLIP can bring external knowledge to facilitate the selection of clean samples, which could potentially filter out those noisy samples that have been memorized by DNNs.

In this paper, we leverage the off-the-shelf vision-language surrogate model CLIP (Radford et al., 2021) to detect noisy samples automatically, which has not been explored yet. First, in contrast to the learning-centric small loss criterion, our CLIP-based selection strategy is a training-free method. Such property avoids the learning bias brought by the noisy supervision. CLIP scores each data instance with its ability of text-image alignment. Combined with the prompt technique, each training sample can be evaluated by CLIP and assigned a surrogate confidence corresponding to its noisy label. Naturally, we regard those samples with high confidences as clean ones. Those noisy samples with corrupted visual patterns can be filtered out with the help of external knowledge, which can further improve the learning of the classifier. Second, we propose a robust noise-aware balanced margin adaptive loss to regularize the selection bias brought by CLIP. On the one hand, CLIP remains biased towards certain classes so it can be overconfident in some classes. Also, the existing methods often neglect the side effect of sample selection, that is, the class imbalance issue might occur. Our noise-aware balanced margin adaptive loss modifies the logits directly, encouraging a relatively large margin for overconfident and dominant classes. This unified margin mechanism can mitigate the effect of noisy labels and imbalanced distribution for robust training.

We evaluate our method on both real-world and synthetic noisy datasets without CLIP involved during the inference stage. The significant improvement on several noisy benchmarks confirms the effectiveness of our proposed method.

Overall, our contributions can be summarized as follows:

  • We are the first to leverage the off-the-shelf vision-language surrogate model CLIP to help select clean samples automatically, which has not been explored before. This training-free method prevents the learning bias brought by the small-loss strategy, which can improve the learning of a more robust classifier and alleviate the memorization effect.

  • We propose a noise-aware balanced margin adaptive loss to reduce the selection bias introduced by CLIP, providing much more robustness to label noise.

  • We demonstrate that our proposed method can achieve significant improvement on both real-world and synthetic noisy datasets without CLIP involved during the inference stage.

2 Related Works

Numerous approaches have been proposed to combat label noise in recent works (Li et al., 2020; Bai and Liu, 2021; Zhang and Pfister, 2021; Zhu et al., 2022; Huang and Chong, 2023). The common solutions can be typically categorized into three types: sample selection, sample reweighting, and label correction.

Sample selection focuses on identifying the clean samples from all the noisy training samples. The clean samples are then used to train the deep neural network. The key problem is to design a good criterion. There are several strategies (Han et al., 2018a; Arazo et al., 2019; Wei et al., 2020; Yao et al., 2021b; Ortego et al., 2021; Zhu et al., 2022) to detect noisy labels. Among them, the small-loss trick (Han et al., 2018a; Arazo et al., 2019; Wei et al., 2020; Yao et al., 2021b) plays an important role. Deep neural networks tend to learn clean and simple patterns faster (Arpit et al., 2017). Co-teaching (Han et al., 2018a) selects a pre-defined proportion of samples with small cross-entropy losses and discards the remaining. Instead, JoCoR (Wei et al., 2020) selects samples with small joint losses composed of cross-entropy losses and co-regularization losses. However, JoSRC (Yao et al., 2021b) argues that prior methods neglect different noise ratios in different mini-batches. It exploits the Jensen-Shannon (JS) divergence which serves as the sample cleanness, to separate clean samples in a global manner. Recently, several works (Ortego et al., 2021; Zhu et al., 2022) try to filter noisy samples out by leveraging neighborhood information, especially via K-Nearest-Neighbors (KNN) algorithm. MOIT (Ortego et al., 2021) selects the confident examples based on the representation similarity between the neighbors. Zhu et al. (Zhu et al., 2022) employ KNN to re-label each sample and detect noisy labels by two simple criteria: local majority voting and global score-based ranking.

Sample reweighting is a traditional and effective method to resist the memorization effect of noisy labels, which encourages larger weights for clean samples and smaller weights for noisy ones (Shu et al., 2019; Zhang and Pfister, 2021; Xu et al., 2021). Meta-Weight-Net (Shu et al., 2019) learns to reweight each sample following the meta-learning paradigm. However, this method requires a small unbiased, and clean validation set, which might be difficult or expensive to collect in practice. To overcome this limitation, Zhang et al. (Zhang and Pfister, 2021) propose to build the proxy clean data from the training history. They maintain the memory to store the past losses and use the changes between the model and meta-model at different steps as the selection criterion.

Refer to caption

NABMsubscriptNABM\mathcal{L}_{\text{NABM}}caligraphic_L start_POSTSUBSCRIPT NABM end_POSTSUBSCRIPT(Eq. 7)

Figure 2: The overall framework is presented. We leverage the open-vocabulary vision-language surrogate model CLIP to select clean samples. The annotated confidence is predicted by CLIP corresponding to its noisy label. Here, red denotes noisy samples treated by CLIP and green denotes clean ones. Then, combined with the transition matrix and the class frequency prior, we propose a noise-aware balanced margin adaptive loss to mitigate the overconfidence effect and the class imbalanced issue.

Label correction aims to assign correct pseudo labels to those samples with wrong labels. The most popular way is to use the prediction from the model (Arazo et al., 2019; Liang et al., 2023). In general, the generated pseudo label is the convex combination between the original noisy label and the current prediction of the model (Arazo et al., 2019). Some works utilize the prediction from class prototypes (Han et al., 2019) or get hard labels based on threshold (Ortego et al., 2021).

Others combine several techniques to prevent overfitting to noisy labels, e.g., mix-up (Zhang et al., 2018), label smoothing (Ortego et al., 2021), consistency regularization (Iscen et al., 2022; Cheng et al., 2022b), semi-supervised framework (Li et al., 2020), contrastive learning (Ortego et al., 2021; Wu et al., 2021). DivideMix (Li et al., 2020) first divides the training samples into the labeled and unlabeled set by fitting the Gaussian Mixture Model (GMM) on the loss distribution, and then performs the semi-supervised learning. NCR (Iscen et al., 2022) proposes a consistency regularization term to enforce the output logit of one sample similar to its neighbors based on the structure of the feature space.

LAION (Schuhmann et al., 2021) and DataComp (Gadre et al., 2024) employ CLIP to filter image-text pairs during the training of vision-language models, with the primary objective of enhancing data quality by eliminating noisy or irrelevant pairs. These approaches leverage CLIP’s multi-modal understanding to ensure that only the most semantically aligned image-text pairs contribute to the training process. While our proposed method shares the fundamental principle of data refinement, it introduces a novel application of CLIP in a different context: filtering noisy labels in classification tasks.

Unlike the prior works, which focus on the relational alignment between image and text pairs in multi-modal datasets, our method specifically addresses the issue of mislabeled data within a purely visual classification setting. Here, the noise is label-centric, where an image is incorrectly labeled, leading to inaccuracies in the training dataset. By applying CLIP to evaluate the consistency between images and their associated labels, we can effectively identify and remove incorrect labels, thereby improving the accuracy and reliability of the labeled dataset. This novel use of CLIP for label noise filtering in classification represents a significant departure from its traditional role in vision-language model training, marking a new frontier in its application for enhancing dataset quality.

3 Method

3.1 Preliminary

Problem formulation. In the image classification problem, we are given a training dataset 𝒟={(xi,yi)|i=1,2,3,,N}𝒟conditional-setsubscript𝑥𝑖subscript𝑦𝑖𝑖123𝑁\mathcal{D}=\{(x_{i},y_{i})|i=1,2,3,...,N\}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 , 2 , 3 , … , italic_N } consisted of N𝑁Nitalic_N sample pairs, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an image and yi{1,2,3,,C}subscript𝑦𝑖123𝐶y_{i}\in\{1,2,3,...,C\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , 2 , 3 , … , italic_C } is the associated label for each sample pair. C𝐶Citalic_C is the number of classes. In our task, some unknown number of labels are noisy, i.e., yiy^isubscript𝑦𝑖subscript^𝑦𝑖y_{i}\neq\hat{y}_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the noisy label and y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is its true class label. Note that yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the correct label if and only if yi=y^isubscript𝑦𝑖subscript^𝑦𝑖y_{i}=\hat{y}_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Our goal is to train a deep neural network θsubscript𝜃\mathcal{F}_{\theta}caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with such a noisily labeled training dataset, which can generalize well on the clean unseen test data. The network θsubscript𝜃\mathcal{F}_{\theta}caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is composed of three components: (1) a feature encoder f𝑓fitalic_f that maps an image xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into a high-dimensional representation vi=f(xi)subscript𝑣𝑖𝑓subscript𝑥𝑖v_{i}=f(x_{i})italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ); (2) a classifier hhitalic_h. It takes visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as an input and outputs the logit zi=h(vi)subscript𝑧𝑖subscript𝑣𝑖z_{i}=h(v_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_h ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ); (3) a softmax layer σ𝜎\sigmaitalic_σ transforms the logit zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into the probability pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Vision-language surrogate model. Recently, models pretrained on large-scale text-image supervision have been popular, e.g., CLIP (Radford et al., 2021). In the training stage, CLIP pretrains the image encoder and text encoder with the contrastive loss. It pulls the image feature embedding and the paired text feature embedding closer in the shared embedding space by maximizing the cosine similarity. During the inference, CLIP can predict the most possible pair given an image and a set of prompt-based texts like “a photo of a {CLASS}", where {CLASS} is replaced by the class name. This framework endows CLIP with the capability of open-vocabulary zero-shot classification naturally, and it can be adapted to several downstream tasks. By the power of text-image alignment, we leverage the CLIP-like open-vocabulary vision-language surrogate model to select clean samples based on the prediction confidences. Note that the frozen CLIP is only used as a scorer in the training stage.

Overview. First, we pretrain the feature encoder f𝑓fitalic_f to learn the robust representation with noisy labels. Then, we only keep the backbone f𝑓fitalic_f and re-train the classifier hhitalic_h. We apply the vision-language pretrained model CLIP (Radford et al., 2021) to help select clean samples automatically. In order to mitigate the selection bias introduced by CLIP, we design a robust noise-aware balanced margin adaptive loss to regularize the effect of overconfidence and class imbalance. The overall framework is presented in Figure 2. We present the overall training algorithm in Algorithm 1.

3.2 Selecting clean samples with CLIP

Learning with noisy labels suffers from the adverse effect that deep neural networks can easily memorize noisy samples (Zhang et al., 2021). One of the effective solutions to this problem is sample selection. Most of the prior research (Han et al., 2018a; Wei et al., 2020; Li et al., 2020, 2022a) relies on the small loss criterion, based on the observation that deep networks fit clean samples first, and then gradually noisy ones (Arpit et al., 2017). This strategy is a learning-centric selection metric by fitting the data distribution. It can be affected by the learning bias in the training dataset where those noisy samples with repetitive corrupted visual patterns are identified as clean samples. As a result, the deep network accumulates the prediction errors. To avoid this confirmation bias, we resort to a training-free surrogate model. We leverage the off-the-shelf pretrained surrogate model CLIP (Radford et al., 2021) to help detect clean samples automatically. CLIP shows several advantages in learning with noisy labels: (1) a training-free selection strategy devoid of reliance on memorization effect; (2) flexible to transfer to downstream tasks with the powerful capability of text-image alignment without extra training; (3) customized prompt engineering that might help filter out some noisy labels based on our prior knowledge potentially.

We propose to select clean samples based on the predictions from CLIP. Given an images x𝑥xitalic_x, the image feature V𝑉Vitalic_V is extracted by the image encoder and the text features {T1,,TC}subscript𝑇1subscript𝑇𝐶\{T_{1},...,T_{C}\}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT } are generated by the text encoder from the prompt template 𝒯𝒯\mathcal{T}caligraphic_T, e.g., “a photo of a {CLASS}". Then, the CLIP prediction for label y=i𝑦𝑖y=iitalic_y = italic_i is computed as follows:

q(y=i|x)=exp(cos(V,Ti)/τ)j=1Cexp(cos(V,Tj)/τ),𝑞𝑦conditional𝑖𝑥𝑉subscript𝑇𝑖𝜏superscriptsubscript𝑗1𝐶𝑉subscript𝑇𝑗𝜏q(y=i|x)=\frac{\exp(\cos(V,T_{i})/\tau)}{\sum_{j=1}^{C}\exp(\cos(V,T_{j})/\tau% )},italic_q ( italic_y = italic_i | italic_x ) = divide start_ARG roman_exp ( roman_cos ( italic_V , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_exp ( roman_cos ( italic_V , italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG , (1)

where cos(,)\cos(\cdot,\cdot)roman_cos ( ⋅ , ⋅ ) denotes the cosine similarity and τ𝜏\tauitalic_τ is the temperature factor. We use τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01 in the experiments. Then, we consider two types of selection criteria.

Prediction Confidence: Naturally, we regard the prediction corresponding to the noisy label from the CLIP as the confidence of the sample and select those with high confidences. Specifically, given a sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with a label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it is judged as a clean sample if qi(y=yi|xi)>ρsubscript𝑞𝑖𝑦conditionalsubscript𝑦𝑖subscript𝑥𝑖𝜌q_{i}(y=y_{i}|x_{i})>\rhoitalic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_ρ, where ρ𝜌\rhoitalic_ρ is a pre-defined threshold. This criterion is simple and effective.

Prompt Consistency: Domain-specific knowledge can be injected into the prompt, which helps detect out-of-domain noisy samples. Noisy web images are collected by keyword searching. However, class names can be ambiguous. For example, “stingray” can represent a type of a car or an animal. If we target classifying the animals, these car images are treated as out-of-domain data. It is difficult for the small loss criterion to distinguish these noisy samples because these images share repetitive visual patterns. Models can easily memorize these samples. Prompts help specify the content of the images. The prediction for a clean sample should be consistent between two prompts where the only difference is domain-specific context. For instance, we apply two prompt templates 𝒯1subscript𝒯1\mathcal{T}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: “a photo of a {CLASS}” and 𝒯2subscript𝒯2\mathcal{T}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: “a photo an animal {CLASS}” to get two predictions qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and q~isubscript~𝑞𝑖\tilde{q}_{i}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for a given sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We utilize the Jensen-Shannon divergence to quantify the distance disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT between the above two probability predictions:

disubscript𝑑𝑖\displaystyle d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =DJS(qi||q~i)\displaystyle=D_{JS}(q_{i}||\tilde{q}_{i})= italic_D start_POSTSUBSCRIPT italic_J italic_S end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=12DKL(qi||qi+q~i2)+12DKL(q~i||qi+q~i2),\displaystyle=\frac{1}{2}D_{KL}(q_{i}||\frac{q_{i}+\tilde{q}_{i}}{2})+\frac{1}% {2}D_{KL}(\tilde{q}_{i}||\frac{q_{i}+\tilde{q}_{i}}{2}),= divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) , (2)

where DKLsubscript𝐷𝐾𝐿D_{KL}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT is the Kullback-Leibler (KL) divergence. Intuitively, we treat samples with small JS divergence as clean samples, i.e., di<μsubscript𝑑𝑖𝜇d_{i}<\muitalic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_μ, where μ𝜇\muitalic_μ is a pre-defined threshold. This criterion allows us to make use of human knowledge to help detect noisy samples but it may need sophisticated design.

By introducing external knowledge from CLIP (Radford et al., 2021; Yang et al., 2021), the noisy samples that have been memorized by DNNs can be further identified potentially. The selected cleaner samples can facilitate the learning of a more robust classifier and therefore improve the classification performance.

3.3 Noise-Aware Balanced Margin Adaptive Loss

CLIP (Radford et al., 2021) helps select clean samples, nevertheless, it can also bring the selection bias. On the one hand, CLIP is often biased towards some classes (Wang et al., 2022). It can provide overconfident scores for some classes. On the other hand, the class imbalance issue occurs after sample selection, which is often neglected by the existing methods. In order to regularize the selection bias, we take advantage of margin adaptive mechanism with two priors, which encourages the overconfident and dominant classes to have relatively large margins.

Transition matrix. The transition matrix can be used to reflect the class-level confidence of the model. Each element Mijsubscript𝑀𝑖𝑗M_{ij}italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in the transition matrix MC×C𝑀superscript𝐶𝐶M\in\mathbb{R}^{C\times C}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT represents the probability of being flipped to a label j𝑗jitalic_j when given an instance with a label i𝑖iitalic_i. Following GLC (Hendrycks et al., 2018), we estimate the class-dependent transition matrix by the average of the prediction q(y=i|x)𝑞𝑦conditional𝑖𝑥q(y=i|x)italic_q ( italic_y = italic_i | italic_x ) (Eq. 1) from the vision-language surrogate model CLIP:

Mij=1Niq(y=j|x,y=i),subscript𝑀𝑖𝑗1subscript𝑁𝑖𝑞𝑦conditional𝑗𝑥𝑦𝑖M_{ij}=\frac{1}{N_{i}}\sum q(y=j|x,y=i),italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ italic_q ( italic_y = italic_j | italic_x , italic_y = italic_i ) , (3)

where Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the number of instances in class i𝑖iitalic_i. Note that we estimate the transition matrix by using all the training samples. Addressing noisy labels with the transition matrix has been extensively studied in the literature (Hendrycks et al., 2018; Li et al., 2022b; Cheng et al., 2022a). They mostly use the transition matrix to refine the output probability directly. By contrast, we regard it as a margin penalty to prevent the overconfidence effect.

Class frequency prior. The class frequency prior measures the distribution of the training data, which is a common statistic used to address the long-tail problem (Menon et al., 2021). It is defined as πj=Nj/Nsubscript𝜋𝑗superscriptsubscript𝑁𝑗superscript𝑁\pi_{j}=N_{j}^{\prime}/N^{\prime}italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where N,Njsuperscript𝑁superscriptsubscript𝑁𝑗N^{\prime},N_{j}^{\prime}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the number of the training samples and the number of instances in class j𝑗jitalic_j.

With the transition matrix and class frequency prior, we propose a noise-aware margin adaptive loss to address the above mentioned problems in a unified framework. After the selection of clean samples, we get the clean subset 𝒟cleansubscript𝒟𝑐𝑙𝑒𝑎𝑛\mathcal{D}_{clean}caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT consisting of Nsuperscript𝑁N^{\prime}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT training samples. For (xi,yi)𝒟cleansubscript𝑥𝑖subscript𝑦𝑖subscript𝒟𝑐𝑙𝑒𝑎𝑛(x_{i},y_{i})\in\mathcal{D}_{clean}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT, we obtain the model’s output of the softmax probability p^isubscript^𝑝𝑖\hat{p}_{i}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as:

p^i=exp((ziyi+δMyiyi+tlogπyi)/s)j=1Cexp((zij+δMyij+tlogπj)/s),subscript^𝑝𝑖superscriptsubscript𝑧𝑖subscript𝑦𝑖𝛿subscript𝑀subscript𝑦𝑖subscript𝑦𝑖𝑡subscript𝜋subscript𝑦𝑖𝑠superscriptsubscript𝑗1𝐶superscriptsubscript𝑧𝑖𝑗𝛿subscript𝑀subscript𝑦𝑖𝑗𝑡subscript𝜋𝑗𝑠\hat{p}_{i}=\frac{\exp((z_{i}^{y_{i}}+\delta M_{y_{i}y_{i}}+t\log\pi_{y_{i}})/% s)}{\sum_{j=1}^{C}\exp((z_{i}^{j}+\delta M_{y_{i}j}+t\log\pi_{j})/s)},over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_exp ( ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_δ italic_M start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_t roman_log italic_π start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) / italic_s ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_exp ( ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT + italic_δ italic_M start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_t roman_log italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_s ) end_ARG , (4)

where δ,t𝛿𝑡\delta,titalic_δ , italic_t control the noise-aware margin and balanced margin, respectively. Here, s𝑠sitalic_s is the temperature factor.

Conventionally, the deep neural network is optimized by empirical risk minimization of the vanilla cross-entropy loss:

ERM=𝔼𝒟clean[CE(x,y)]=1Ni=1NCE(xi,yi),subscriptERMsubscript𝔼subscript𝒟𝑐𝑙𝑒𝑎𝑛delimited-[]subscriptCE𝑥𝑦1superscript𝑁superscriptsubscript𝑖1superscript𝑁subscriptCEsubscript𝑥𝑖subscript𝑦𝑖\mathcal{L}_{\text{ERM}}=\mathbb{E}_{\mathcal{D}_{clean}}[\ell_{\text{CE}}(x,y% )]=\frac{1}{N^{\prime}}\sum_{i=1}^{N^{\prime}}\ell_{\text{CE}}(x_{i},y_{i}),caligraphic_L start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_ℓ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_x , italic_y ) ] = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (5)
CE(xi,yi)=logexp(ziyi)j=1Cexp(zij).subscriptCEsubscript𝑥𝑖subscript𝑦𝑖superscriptsubscript𝑧𝑖subscript𝑦𝑖superscriptsubscript𝑗1𝐶superscriptsubscript𝑧𝑖𝑗\ell_{\text{CE}}(x_{i},y_{i})=-\log\frac{\exp({z_{i}^{y_{i}}})}{\sum_{j=1}^{C}% \exp(z_{i}^{j})}.roman_ℓ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - roman_log divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_ARG . (6)

However, we find this loss does not perform well in the experiments. We hypothesize that there are two groups of data for each class after sample selection: one is many easy samples distributed at the center of the class and the other is few hard samples distributed near the class boundary. Cross-entropy loss assigns the same weight to each sample. The imbalance between many easy samples and few hard samples makes it difficult for classifier optimization. In order to tackle this issue, we employ the focal loss (Lin et al., 2017).

Finally, combined with the focal loss FLsubscriptFL\ell_{\text{FL}}roman_ℓ start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT (γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0), our noise-aware balanced margin adaptive loss is defined as:

NABM=𝔼𝒟clean[FL(p^)]=1Ni=1NFL(p^i).subscriptNABMsubscript𝔼subscript𝒟𝑐𝑙𝑒𝑎𝑛delimited-[]subscriptFL^𝑝1superscript𝑁superscriptsubscript𝑖1superscript𝑁subscriptFLsubscript^𝑝𝑖\mathcal{L}_{\text{NABM}}=\mathbb{E}_{\mathcal{D}_{clean}}[\ell_{\text{FL}}(% \hat{p})]=\frac{1}{N^{\prime}}\sum_{i=1}^{N^{\prime}}\ell_{\text{FL}}(\hat{p}_% {i}).caligraphic_L start_POSTSUBSCRIPT NABM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_ℓ start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG ) ] = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (7)

By modifying the logit with the transition matrix and class frequency prior, this margin adaptive mechanism can suppress overconfidence on biased classes and mitigate the negative effect of imbalanced distribution brought by sample selection from CLIP, which can encourage the model to resist label noise better.

Input : training dataset with noisy labels 𝒟={(xi,yi)}i=1N𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT; pretrained feature encoder f𝑓fitalic_f; classifier hhitalic_h; open-vocabulary vision-language model CLIP; texts T={Tj}j=1C𝑇superscriptsubscriptsubscript𝑇𝑗𝑗1𝐶T=\{T_{j}\}_{j=1}^{C}italic_T = { italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT generated by the prompt template; threshold ρ𝜌\rhoitalic_ρ; maximum epoch E𝐸Eitalic_E
Output : deep neural network θ={f,h}subscript𝜃𝑓\mathcal{F}_{\theta}=\{f,h\}caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = { italic_f , italic_h }
1
// Selecting clean samples with CLIP
2
3𝒟cleansubscript𝒟𝑐𝑙𝑒𝑎𝑛\mathcal{D}_{clean}\leftarrow\varnothingcaligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT ← ∅;
4
5foreach  (xi,yi)𝒟subscript𝑥𝑖subscript𝑦𝑖𝒟(x_{i},y_{i})\in\mathcal{D}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D do
6       Compute q(y=yi|xi)𝑞𝑦conditionalsubscript𝑦𝑖subscript𝑥𝑖q(y=y_{i}|x_{i})italic_q ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) by Eq. 1 with CLIP and T𝑇Titalic_T ;
7       if q(y=yi|xi)>ρ𝑞𝑦conditionalsubscript𝑦𝑖subscript𝑥𝑖𝜌q(y=y_{i}|x_{i})>\rhoitalic_q ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_ρ then
8            𝒟clean𝒟clean{(xi,yi)}subscript𝒟𝑐𝑙𝑒𝑎𝑛subscript𝒟𝑐𝑙𝑒𝑎𝑛subscript𝑥𝑖subscript𝑦𝑖\mathcal{D}_{clean}\leftarrow\mathcal{D}_{clean}\cup\{(x_{i},y_{i})\}caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT ∪ { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) };
9       end if
10      
11 end foreach
12
// Calculating the transition matrix
13 Compute the transition matrix M𝑀Mitalic_M by Eq. 3 on 𝒟𝒟\mathcal{D}caligraphic_D;
14
// Calculating the class frequency prior
15 Compute the class frequency prior π𝜋\piitalic_π on 𝒟cleansubscript𝒟𝑐𝑙𝑒𝑎𝑛\mathcal{D}_{clean}caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT;
16
// Fine-tune the network
17 Re-initial the classifier hhitalic_h;
18 for e=1,,E𝑒1𝐸e=1,...,Eitalic_e = 1 , … , italic_E do
19      while k<MaxIter𝑘MaxIterk<\text{MaxIter}italic_k < MaxIter do
20             Draw a mini-batch 𝒳ek={(xb,yb)}b=1Bsuperscriptsubscript𝒳𝑒𝑘superscriptsubscriptsubscript𝑥𝑏subscript𝑦𝑏𝑏1𝐵\mathcal{X}_{e}^{k}=\{(x_{b},y_{b})\}_{b=1}^{B}caligraphic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT from 𝒟cleansubscript𝒟𝑐𝑙𝑒𝑎𝑛\mathcal{D}_{clean}caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT ;
21             Compute the loss NABMsubscriptNABM\mathcal{L}_{\text{NABM}}caligraphic_L start_POSTSUBSCRIPT NABM end_POSTSUBSCRIPT by Eq. 7 with M𝑀Mitalic_M and π𝜋\piitalic_π on 𝒳eksuperscriptsubscript𝒳𝑒𝑘\mathcal{X}_{e}^{k}caligraphic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ;
22             Calculate the gradients by the loss NABMsubscriptNABM\mathcal{L}_{\text{NABM}}caligraphic_L start_POSTSUBSCRIPT NABM end_POSTSUBSCRIPT backpropagation ;
23             Optimized by SGD;
24            
25       end while
26      
27 end for
28
return θ={f,h}subscript𝜃𝑓\mathcal{F}_{\theta}=\{f,h\}caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = { italic_f , italic_h }
Algorithm 1 Pseudo-code for our method.

4 Experiments

4.1 Experiment setup

Datasets. We evaluate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. For real-world datasets, we conduct experiments on five benchmarks with noisy labels: Clothing1M, WebVision, Red Mini-ImageNet, CIFAR-10N, and CIFAR-100N. Clothing1M (Xiao et al., 2015) consists of 1 million training images collected from some online shopping websites where labels are produced by the surrounding texts. The test set contains 10,526 images of 14 classes. Webvision (Li et al., 2017) is crawled from the web using 1,000 concepts from ImageNet ILSVRC12 (Deng et al., 2009). Following (Li et al., 2020), we experiment with the first 50 classes of the Google image subset on WebVision 1.0. The training and validation set contains 65,944 and 2,500 images, respectively. Red Mini-ImageNet (Jiang et al., 2020) is a benchmark of controlled real-world label noise from the web. The dataset contains 100 classes. We experiment with the noise rate of 20%, 40%, 60%, and 80%. The image size is resized to 32×\times×32 for a fair comparison (Garg et al., 2023; Kim et al., 2024). CIFAR-10N and CIFAR-100N (Wei et al., 2022) are two recently proposed benchmarks with real-world human-annotated noisy labels. The noisy labels are collected from Amazon Mechanical Turk. For CIFAR-10N, each image is annotated with 3 human-annotated labels. We study three types of noisy label sets: (1) Aggregate: the noisy label is aggregated by majority voting; (2) Random i𝑖iitalic_i (i=1,2,3𝑖123i=1,2,3italic_i = 1 , 2 , 3): use the i𝑖iitalic_i-th annotated label as the noisy label; (3) Worst: use any wrongly annotated label if it exists. For CIFAR-100N, each image is annotated with one noisy fine label and a coarse label. Please refer to (Wei et al., 2022) and the website111http://noisylabels.com/ for more details. For synthetic datasets, we manually make label corruption on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). Both CIFAR-10 and CIFAR-100 contain 50,000 training images and 10,000 test images, with 10 and 100 classes, respectively. The size of images is 32×\times×32. We investigate three types of label noise: symmetric, asymmetric, and instance-dependent. Symmetric noise is generated by randomly replacing clean labels with other possible labels. In this case, some clean labels can be maintained. We constrain the label flipping to be closed under the given label set. Asymmetric noise is injected by replacing labels only in similar classes, e.g., deer\rightarrowhorse, dog\leftrightarrowcat, which is more common in practice. Instance-dependent noise depends on image information. We simulate the experiment environment following (Xia et al., 2020).

Evaluation metrics. Top-1 test accuracy is reported in the experiments. For Clothing1M dataset, we select the model that performs best on the validation set. For WebVision, we also report top-5 accuracy. After training on WebVision, we evaluate performance on ImageNet without any finetuning.

Implementation details. Following (Li et al., 2020, 2022a), we perform the same experiment protocol to pretrain the backbone. For WebVision, we use Inception-ResNet V2 (Szegedy et al., 2017) for DivideMix (Li et al., 2020) and ResNet-18 for Sel-CL (Li et al., 2022a). We reinitialize the classifier and train for 10 epochs with a learning rate of 0.01 and 0.001, respectively. The optimizer is SGD with a weight decay of 0.0001 and the batch size is 64. For Clothing1M, we use ResNet-50 pretrained on ImageNet. Following the previous works (Li et al., 2020), we sample 1000 mini-batches in each epoch. The training epoch is 80 and the initial learning rate is 0.02. We set ρ=0.6𝜌0.6\rho=0.6italic_ρ = 0.6. For Red Mini-ImageNet, CIFAR-10N, CIFAR-100N, CIFAR-10, and CIFAR-100 datasets, we train Pre-Act ResNet-18 for 10 epochs. The weight decay is set as 0.0005 and the batch size is 128. We set ρ=0.5𝜌0.5\rho=0.5italic_ρ = 0.5 for Red Mini-ImageNet, CIFAR-10N, and CIFAR-10, ρ=0.1𝜌0.1\rho=0.1italic_ρ = 0.1 for CIFAR-100 and CIFAR-100N. We use s=1.0,δ=0.5,t=1.0formulae-sequence𝑠1.0formulae-sequence𝛿0.5𝑡1.0s=1.0,\delta=0.5,t=1.0italic_s = 1.0 , italic_δ = 0.5 , italic_t = 1.0, ViT-B/16 as the backbone of CLIP for filtering when training ResNet-18, and s=0.1,δ=0.1,t=0.01formulae-sequence𝑠0.1formulae-sequence𝛿0.1𝑡0.01s=0.1,\delta=0.1,t=0.01italic_s = 0.1 , italic_δ = 0.1 , italic_t = 0.01, ResNet-50 for others. Empirically, we find smaller s𝑠sitalic_s is suitable for deeper networks.

4.2 Results

Method WebVision ILSVRC12
top-1 top-5 top-1 top-5
F-correction (Patrini et al., 2017) 61.12 82.68 57.36 82.36
Decoupling (Malach and Shalev-Shwartz, 2017) 62.54 84.74 58.26 82.26
MentorNet (Jiang et al., 2018) 63.00 81.40 57.80 79.92
Co-teaching (Han et al., 2018a) 63.58 85.20 61.48 84.70
Iterative-CV (Chen et al., 2019) 65.24 85.34 61.60 84.98
ELR (Liu et al., 2020b) 76.26 91.26 68.71 87.84
ELR+ (Liu et al., 2020b) 77.78 91.68 70.29 89.76
NGC (Wu et al., 2021) 79.16 91.84 74.44 91.04
TCL (Huang et al., 2023) 79.10 92.30 75.40 92.40
LSL (Kim et al., 2024) 81.40 93.00 77.00 91.84
CLIP zero-shot (RN50) (Radford et al., 2021) 71.20 95.04 74.04 96.80
CLIP zero-shot (ViT-B/16) (Radford et al., 2021) 75.44 96.92 78.36 97.92
DivideMix (Li et al., 2020) 77.32 91.64 75.20 90.84
Ours (DivideMix init.) 79.08 91.96 76.04 93.12
Sel-CL (Li et al., 2022a) 77.88 91.60 74.28 90.96
Ours (Sel-CL init.) 81.00 93.84 76.28 94.92
Table 1: We report top-1 and top-5 test accuracy (%) on WebVision 1.0 and ImageNet ILSVRC12. Our method achieves significant improvement over baseline methods. We show the best and the second best results in LNL methods.
Method Noise rate
20% 40% 60% 80%
CE (Yao et al., 2021a) 47.36 42.70 37.30 29.76
Mixup (Zhang et al., 2018) 49.10 46.40 40.58 33.58
MentorMix (Jiang et al., 2020) 51.02 47.14 43.80 33.46
DivideMix (Li et al., 2020) 50.96 46.72 43.14 34.50
SSR (Feng et al., 2021) 52.18 48.96 42.42 33.20
FaMUS (Xu et al., 2021) 51.42 48.06 45.10 35.50
LSL (Kim et al., 2024) 54.68 49.80 45.46 36.78
InstanceGM (Garg et al., 2023) 58.38 52.24 47.96 39.62
Ours 61.26 57.09 53.25 45.65
Table 2: Test accuracy (%) on Red Mini-ImageNet (CNWL). Based on InstanceGM, ours achieves significant improvement.
CE Co-teaching (Han et al., 2018a) ELR (Liu et al., 2020b) NCR (Iscen et al., 2022) DivideMix (Li et al., 2020) Ours
68.94 69.21 72.87 74.6 74.76 74.84±plus-or-minus\pm±0.03
Table 3: Comparison between our method and the existing baseline methods on the Clothing-1M dataset. Test accuracy (%) are reported.
Dataset CIFAR-10N CIFAR-100N
Types Aggregate Random1 Random2 Random3 Worst Noisy
CE (Standard) 87.22 81.59 82.22 82.06 67.45 47.54
DivideMix (Li et al., 2020) 95.33 95.35 95.01 95.18 92.47 69.84
Ours 95.95 ±plus-or-minus\pm± 0.05 96.17 ±plus-or-minus\pm± 0.10 95.58 ±plus-or-minus\pm± 0.10 95.95 ±plus-or-minus\pm± 0.05 93.67 ±plus-or-minus\pm± 0.10 72.46 ±plus-or-minus\pm± 0.41
Table 4: Test accuracy(%) on CIFAR-10N and CIFAR-100N. All results use the PreAct ResNet-18 architecture. We reproduce all the baselines. Ours achieves significant improvement against DivideMix.
Dataset CIFAR-10 CIFAR-100
Noise type Sym. Asym. Sym.
Noise rate 20% 50% 80% 90% 40% 20% 50% 80% 90%
Standard 83.9 58.5 25.9 17.3 77.3 61.5 37.4 10.4 4.1
F-correction (Patrini et al., 2017) 83.1 59.4 26.2 18.8 83.1 61.4 37.3 9.0 3.4
Co-teaching+ (Yu et al., 2019) 88.2 84.1 45.5 30.1 - 64.1 45.3 15.5 8.8
Mixup (Zhang et al., 2018) 92.3 77.6 46.7 43.9 - 66.0 46.6 17.6 8.1
P-correction (Yi and Wu, 2019) 92.0 88.7 76.5 58.2 88.1 68.1 56.4 20.7 8.8
M-correction (Arazo et al., 2019) 93.8 91.9 86.6 68.7 86.3 73.4 65.4 47.6 20.5
MOIT+ (Ortego et al., 2021) 94.1 - 75.8 - 93.2 75.9 - 51.4 -
ELR+ (Liu et al., 2020b) 94.9 93.9 90.9 74.5 88.9 76.3 72.0 57.2 30.9
NCR (Iscen et al., 2022) 95.2 94.3 91.6 75.1 90.7 76.6 72.5 58.0 30.8
NGC (Wu et al., 2021) 95.9 94.5 91.6 80.5 90.6 79.3 75.9 62.7 29.8
DivideMix (Li et al., 2020) 95.7 94.4 92.9 75.4 92.1 76.9 74.2 59.6 31.0
Ours 96.6 95.6 94.1 89.2 95.1 80.3 76.6 63.4 45.7
Table 5: Test accuracy(%) on CIFAR-10 and CIFAR-100 under different noise rates. Sym. is the symmetric noise and Asym. denotes the asymmetric noise. Based on DivideMix (Li et al., 2020), ours achieves significant improvement against DivideMix.

Real-world datasets. Table 1 shows the results on WebVision (Li et al., 2017). Our method achieves significant improvement against DivideMix baseline (Li et al., 2020). The top-1 and top-5 accuracy is 79.08% and 91.96% on WebVision validation set, respectively. The performance gains are 1.76% and 0.32% compared with DivideMix. When transferred to ImageNet (Deng et al., 2009), the performance on top-5 accuracy is substantially boosted. Ours achieves 93.12% accuracy, surpassing the DivideMix method by 2.28%. With much more robust learned representation pretrained with contrastive learning (Li et al., 2022a), our method achieves the second best top-1 accuracy 81.00%. We find our method shows better top-5 test accuracy compared to the state-of-the-art method LSL (Kim et al., 2024). These results verify the effectiveness of our proposed sampling strategy and margin mechanism. Compared with other approaches, our method still shows competitive performance, which is effective to resist label noise. In Table 2, we show the test accuracy on Red Mini-ImageNet (Jiang et al., 2020) with controllable realistic label noise. Our method outperforms the contemporary methods, which surpasses the second best by 2.9%, 4.8%, 5.3%, and 6.0% under the noise rate of 20%, 40%, 60%, and 80% respectively. In Table 3, we present the comparison between previous methods on Clothing-1M (Xiao et al., 2015) with realistic noisy labels. Benefiting from training with the selected cleaner samples, our proposed approach achieves consistent improvement over DivideMix (Li et al., 2020), showing competitive performance. Compared to other methods like NCR (Iscen et al., 2022), ELR (Liu et al., 2020b), our method achieves better performance as well. The evaluation results on CIFAR-10N and CIFAR-100N are shown in Table 4. We can observe that DivideMix (Li et al., 2020) outperforms the basic CE baseline by a large margin. This confirms the robustness of DivideMix. In addition, our proposed method can improve the DivideMix further. These results indicate our method is effective to mitigate the negative effect of real-world noise.

Method Noise rate
20% 30% 40% 45% 50%
CE (Yao et al., 2021a) 30.42 24.15 21.45 15.23 14.42
Mixup (Zhang et al., 2018) 32.92 29.76 25.92 23.13 21.31
F-correction (Patrini et al., 2017) 36.38 33.17 26.75 21.93 19.27
Reweight (Liu and Tao, 2015) 36.73 31.91 28.39 24.12 20.23
Decoupling (Malach and Shalev-Shwartz, 2017) 36.53 30.93 27.85 23.81 19.59
Co-teaching (Han et al., 2018b) 37.96 33.43 28.04 25.60 23.97
MentorNet (Jiang et al., 2018) 38.91 34.23 31.89 27.53 24.15
DivideMix (Li et al., 2020) 77.07 76.33 70.80 57.78 58.61
SSR (Feng et al., 2021) 78.84 78.60 76.95 74.98 72.83
LSL (Kim et al., 2024) 80.94 79.90 78.60 78.08 77.95
InstanceGM (Garg et al., 2023) 79.69 79.21 78.47 77.49 77.19
Ours 80.97 80.42 79.68 79.39 78.74
Table 6: Test accuracy (%) on CIFAR-100 with instance-dependent noise. Based on InstanceGM, ours achieves significant improvement.

Synthetic datasets. We show the comparison between our method and the previous methods on the manually corrupted dataset CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) under different noise rates and various noise conditions in Table 5 and Table 6. Our method shows better evaluation performance against DivideMix across all the noise rates, achieving state-of-the-art results in almost all the settings. Especially under a high noise rate of  90%, our method considerably outperforms DivideMix (Li et al., 2020) by a large margin. The test accuracy is 89.2% and 45.7% on CIFAR-10 and CIFAR-100, respectively. The gap is 13.8% and 14.7% against DivideMix, indicating that our method can identify cleaner samples even under the extreme noise rate. When it comes to asymmetric noise, our method achieves 3.0% improvement versus DivideMix and even surpasses MOIT+ (Ortego et al., 2021) by 1.9%. Our method is robust to instance-dependent noise as well. Based on InstanceGM (Garg et al., 2023), ours achieves better test accuracy than the state-of-the-art method LSL (Kim et al., 2024). The gap is 1.0% under the noise rate of 40%. The promising results demonstrate CLIP can be easily adapted to help select clean samples. Equipped with our noise-aware balanced margin adaptive loss, the classifier learning is more robust and unbiased.

4.3 Ablation Study

Sampling strategy. We explore the effectiveness of our sampling strategy in this ablation analysis. Table 7 shows the accuracy performance comparison between different sampling strategies. We keep the same experimental setup. The classifier is trained with the focal loss. First, we compare with GMM which is the commonly used method to select clean samples (Arazo et al., 2019; Li et al., 2020). When the classifier is re-trained with the clean samples that are divided by GMM, the top-1 accuracy on WebVision (Li et al., 2017) increases a little compared to the pretrained DivideMix (Li et al., 2020) model. It indicates that the deep neural networks have been memorizing some small-loss noisy samples. These noisy samples cannot be identified by GMM, which can still hinder the learning of a robust classifier. Hence, the performance is hard to further boost. Second, we observe that our prediction confidence strategy significantly improves the classification accuracy, exceeding the baseline by 1% top-1 accuracy on WebVision.

Sampling Strategy WebVision ILSVRC12
top-1 top-5 top-1 top-5
DivideMix (pretrained) (Li et al., 2020) 77.32 91.64 75.20 90.84
Small Loss (GMM) 77.36 91.08 74.32 92.08
Prediction Confidence 78.32 91.48 75.28 92.92
Prompt Consistency 77.36 91.24 74.16 92.04
Table 7: Comparison between different sampling strategies. We report top-1 (top-5) accuracy (%) on WebVision and ImageNet.

The results suggest that CLIP (Radford et al., 2021) can help detect the memorized noisy samples and select cleaner training samples, with its powerful zero-shot prediction capability and external knowledge from large-scale pretrained data. Learning with cleaner samples contributes to establishing a better decision boundary. Third, we try to inject prior knowledge into the prompt with the expectation that this strategy can identify some out-of-domain data. We find our prompt consistency strategy can help detect noisy samples and select clean ones as shown in Figure 3. But the overall performance is almost comparable against GMM. As discussed in Section 3.2, we suspect it might require more sophisticated designed prompt templates. We leave it for future work.

Cross-entropy loss v.s. Focal loss. The straightforward way to address the image classification is to train a classifier with the vanilla cross-entropy loss. However, the performance is unsatisfactory in the experiments. We conduct the experiments on the clean samples selected by CLIP (Radford et al., 2021). The results are reported in Table 8. There are some findings: (1) After the selection of clean samples, training with the cross-entropy loss can further boost the top-1 accuracy (+0.44%) on WebVision validation set compared to the pretrained model with DivideMix (Li et al., 2020). This confirms the effectiveness of our sampling strategy. (2) In contrast, the top-1 accuracy on ImageNet drops 0.6%. As we discussed in Section 3.3, we hypothesize this is because there are many clean samples with easy patterns in the selected samples. Cross-entropy loss treats each sample equally. DNNs focus more on the simple samples and fit the training data well, which can result in poor performance when transferred to other datasets. (3) Focal loss (Lin et al., 2017) shows significant improvement against cross-entropy loss on both WebVision and ImageNet, by 0.56%(1.04%) and 0.48%(1.28%) increase of top-1(top-5) accuracy, respectively. Hard samples play an important role in determining an accurate decision boundary. By balancing the contributions between easy samples and hard samples, focal loss assigns larger weights to hard samples, which prevents the overwhelm of easy samples. Ultimately, it can facilitate the optimization of the classifier and promote the prediction performance.

Loss function WebVision ILSVRC12
top1 top5 top1 top5
DivideMix (pretrained) (Li et al., 2020) 77.32 91.64 75.20 90.84
Cross-Entropy 77.76 90.44 74.80 91.64
Focal Loss 78.32 91.48 75.28 92.92
w/o Focal Loss 78.20 91.04 74.92 92.00
w/o balanced margin (t=0𝑡0t=0italic_t = 0) 78.68 91.72 75.68 93.00
w/o noise-aware margin (δ=0𝛿0\delta=0italic_δ = 0) 78.76 91.92 75.76 92.92
Ours 79.08 91.96 76.04 93.12
Table 8: Ablation analysis on the loss function and the contribution of each component. We report top-1 and top-5 test accuracy (%).
Refer to caption
Figure 3: Examples of the selected or filtered images by the prompt consistency strategy. red denotes the annotated noisy label is wrong and green represents the annotated noisy label is consistent with its true label.
Refer to caption
(a)
Refer to caption
(b)
Figure 4: LABEL:fig:ab_threshold: Ablation study on the effect of selection threshold. We plot the number of training samples and top-1 accuracy (%) on WebVision and ImageNet with different thresholds. LABEL:fig:ab_margin: Ablation study on the effect of the noise-aware margin. We report top-1 and top-5 test accuracy (%) on both WebVision and ImageNet.

Ablation study on noise-aware balanced margin adaptive loss. We investigate the contribution of each component in our proposed loss function. For a fair comparison, we conduct the experiments by removing each component and examining the effect. All the other configurations are the same. We present the evaluation results in Table 8. Removing the focal loss hurts the performance. Both noise-aware margin and balanced margin can bring the improvement on the performance. The former outperforms the focal loss baseline by 0.36% on WebVision (Li et al., 2017) and 0.4% on ImageNet (Deng et al., 2009) measured by top-1 accuracy while the latter obtains the performance gains over 0.44% and 0.48%. The observations validate that CLIP (Radford et al., 2021) model might introduce the selection bias. On the one hand, CLIP can be overconfident in certain classes so that some noisy samples are still mixed even after sample selection. On the other hand, the selected samples often exhibit a class-imbalanced distribution, especially under the situation where asymmetric noise exists. Although sample selection brings cleaner data, the impact of data imbalance will be amplified. Our margin adaptive loss solves these issues in a unified framework, which mitigates the overconfidence effect by the noise-aware margin and relieves the influence of the long-tailed distribution by the balanced margin. The joint effect of these two margins fosters the learning of the classifier and further improves the performance.

Refer to caption
Figure 5: Confidence distribution comparison between GMM and CLIP on WebVision. We divide confidence into 10 intervals, each with a range of 0.1, and count the image amount for each interval.
Refer to caption
Figure 6: Confidence distribution comparison between GMM and CLIP on Clothing1M. We divide confidence into 10 intervals, each with a range of 0.1, and count the image amount for each interval.

Analysis of transition matrix. We calculate the error between our estimated and the groundtruth transition matrix. Under the symmetric noise of 0.5, the absolute mean error is 0.02 on CIFAR-10 and 0.006 on CIFAR-100, respectively.

Effect of selection threshold ρ𝜌\rhoitalic_ρ in sample selection. In Figure LABEL:fig:ab_threshold, we show the number of the selected training samples and top-1 accuracy on both WebVision (Li et al., 2017) and ImageNet (Deng et al., 2009) with the varied selection thresholds on confidence. As we can see, with the threshold getting larger, the number of training samples decreases rapidly. The top-1 accuracy performance on WebVision also drops, especially when we set the threshold to 0.9. This is because the training samples are insufficient to learn a good decision boundary after the selection of the clean samples, especially for some classes with few data points. Even though sample selection can bring cleaner data, we cannot ignore the risk of the reduced amount of training data. Therefore, we choose a proper threshold to ensure enough instances for training the network. We notice that the performance on ImageNet is less affected. We guess it might be attributed to the robust learned representation by training with diverse web data.

Refer to caption
Figure 7: Examples of the selected or filtered images by CLIP on WebVision. The first two rows show the selected clean samples. The last row shows those filtered noisy samples. The confidences from GMM and CLIP are presented at the bottom of the image. Here, red denotes the annotated noisy label is wrong and green represents the annotated noisy label is consistent with its true label.
Refer to caption
Figure 8: More examples of the selected or filtered images by CLIP. The first two rows are from WebVision and the last row is from Clothing1M. The confidence from GMM and CLIP is presented at the bottom of the image, respectively. Here, red denotes the annotated noisy label is wrong and green represents the annotated noisy label is consistent with its true label.

Effect of δ𝛿\deltaitalic_δ in noise-aware margin regularization. To understand the influence of the noise-aware margin, we vary δ𝛿\deltaitalic_δ from 0.0 to 1.0 and keep other hyperparameters fixed. As shown in Figure LABEL:fig:ab_margin, small δ𝛿\deltaitalic_δ (e.g., δ=0.1𝛿0.1\delta=0.1italic_δ = 0.1) leads to higher top-1 and top-5 accuracy both on WebVision (Li et al., 2017) and ImageNet (Deng et al., 2009). The performance remains relatively stable when δ0.5𝛿0.5\delta\leq 0.5italic_δ ≤ 0.5. However, when δ𝛿\deltaitalic_δ is pretty large (e.g., δ=1.0𝛿1.0\delta=1.0italic_δ = 1.0), the accuracy drops considerably. The top-1 accuracy on WebVision is only 31.0%. These results support our motivation that the noise-aware margin plays a role in preventing overconfidence effect. This regularization can provide robustness to label noise. Nevertheless, note that too large δ𝛿\deltaitalic_δ will hinder the optimization of the classifier, which results in poor performance. Empirically, small δ𝛿\deltaitalic_δ is recommended.

Training time analysis. We analyze the training time of our proposed method on WebVision to understand its efficiency. The experiment is conducted on a single NVIDIA 3090 GPU. For CLIP-based sample selection, it takes around 73.9 seconds with a batch size of 1000. After the pretraining stage, our model is trained for just under 40 minutes.

4.4 Visualization

Figure 5 and 6 compares the confidence distribution between GMM (Li et al., 2020) and CLIP (Radford et al., 2021) on WebVision and Clothing1M. It can be seen that the confidence from GMM is near 0 or 1 after training while the distribution of CLIP is much more scattered. It indicates that some noisy samples have been memorized by DNNs and the confidence is less discriminative for distinguishing the clean and noisy samples. In Figure 7, we show several selected or filtered training images from the WebVision dataset. We compare the selection between CLIP and GMM. As we can see, CLIP can identify some clean samples which are regarded as noisy samples by GMM. Meanwhile, CLIP can also filter out some noisy samples that GMM fails to find, based on the predicted confidences on noisy labels. These results support our assumption that some noisy samples are memorized by DNNs and cannot be filtered based on the small loss criterion. These samples often share similar visual patterns. For instance, the pillow image (Row 3, Column 2 in Figure 7) with a repeated bird pattern is mistakenly identified as an actual bird by the GMM. In contrast, CLIP’s prior knowledge helps mitigate this issue. Benefiting from pretrained on large-scale web data, the CLIP model can help detect more noisy samples with its powerful zero-shot capability. More examples are presented in Figure 8.

5 Conclusion

In this paper, we propose to leverage the vision-language pretrained surrogate model CLIP to help select clean samples when dealing with noisy labels. Born with the powerful capability of zero-shot inference, CLIP can identify some noisy samples memorized by deep neural networks, based on the predicted confidences on noisy labels. Furthermore, our noise-aware balanced margin adaptive loss facilitates the learning of the classifier, which can mitigate the introduced selection bias from CLIP. The significant improvement on multiple noisy datasets verifies the effectiveness of our method without CLIP involved at the inference stage.

Conflict of interest

The authors declare that they have no conflict of interest.

Data availability

References

  • Arazo et al. (2019) Arazo E, Ortego D, Albert P, O’Connor N, McGuinness K (2019) Unsupervised label noise modeling and loss correction. In: International conference on machine learning, PMLR, pp 312–321
  • Arpit et al. (2017) Arpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS, Maharaj T, Fischer A, Courville A, Bengio Y, et al. (2017) A closer look at memorization in deep networks. In: International conference on machine learning, PMLR, pp 233–242
  • Bai and Liu (2021) Bai Y, Liu T (2021) Me-momentum: Extracting hard confident examples from noisily labeled data. In: ICCV
  • Carion et al. (2020) Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
  • Chen et al. (2019) Chen P, Liao BB, Chen G, Zhang S (2019) Understanding and utilizing deep neural networks trained with noisy labels. In: International Conference on Machine Learning, PMLR, pp 1062–1070
  • Cheng et al. (2022a) Cheng D, Liu T, Ning Y, Wang N, Han B, Niu G, Gao X, Sugiyama M (2022a) Instance-dependent label-noise learning with manifold-regularized transition matrix estimation. In: CVPR
  • Cheng et al. (2022b) Cheng D, Ning Y, Wang N, Gao X, Yang H, Du Y, Han B, Liu T (2022b) Class-dependent label-noise learning with cycle-consistency regularization. In: Oh AH, Agarwal A, Belgrave D, Cho K (eds) Advances in Neural Information Processing Systems, URL https://openreview.net/forum?id=IvnoGKQuXi
  • Deng et al. (2009) Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp 248–255
  • Dosovitskiy et al. (2021) Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=YicbFdNTTy
  • Feng et al. (2021) Feng C, Tzimiropoulos G, Patras I (2021) Ssr: An efficient and robust framework for learning with unknown label noise. arXiv preprint arXiv:211111288
  • Gadre et al. (2024) Gadre SY, Ilharco G, Fang A, Hayase J, Smyrnis G, Nguyen T, Marten R, Wortsman M, Ghosh D, Zhang J, et al. (2024) Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems 36
  • Garg et al. (2023) Garg A, Nguyen C, Felix R, Do TT, Carneiro G (2023) Instance-dependent noisy label learning via graphical modelling. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2288–2298
  • Han et al. (2018a) Han B, Yao Q, Yu X, Niu G, Xu M, Hu W, Tsang I, Sugiyama M (2018a) Co-teaching: Robust training of deep neural networks with extremely noisy labels. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 31, URL https://proceedings.neurips.cc/paper/2018/file/a19744e268754fb0148b017647355b7b-Paper.pdf
  • Han et al. (2018b) Han B, Yao Q, Yu X, Niu G, Xu M, Hu W, Tsang I, Sugiyama M (2018b) Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31
  • Han et al. (2019) Han J, Luo P, Wang X (2019) Deep self-learning from noisy labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5138–5147
  • He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
  • Hendrycks et al. (2018) Hendrycks D, Mazeika M, Wilson D, Gimpel K (2018) Using trusted data to train deep networks on labels corrupted by severe noise. Advances in neural information processing systems 31
  • Huang and Chong (2023) Huang X, Chong KFE (2023) Genkl: An iterative framework for resolving label ambiguity and label non-conformity in web images via a new generalized kl divergence. International Journal of Computer Vision pp 1–25
  • Huang et al. (2023) Huang Z, Zhang J, Shan H (2023) Twin contrastive learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11661–11670
  • Iscen et al. (2022) Iscen A, Valmadre J, Arnab A, Schmid C (2022) Learning with neighbor consistency for noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4672–4681
  • Jiang et al. (2018) Jiang L, Zhou Z, Leung T, Li LJ, Fei-Fei L (2018) Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: International Conference on Machine Learning, PMLR, pp 2304–2313
  • Jiang et al. (2020) Jiang L, Huang D, Liu M, Yang W (2020) Beyond synthetic noise: Deep learning on controlled noisy labels. In: International conference on machine learning, PMLR, pp 4804–4815
  • Kim et al. (2024) Kim Nr, Lee JS, Lee JH (2024) Learning with structural labels for learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 27610–27620
  • Kolesnikov et al. (2020) Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2020) Big transfer (bit): General visual representation learning. In: European conference on computer vision, Springer, pp 491–507
  • Krizhevsky et al. (2009) Krizhevsky A, Hinton G, et al. (2009) Learning multiple layers of features from tiny images
  • Li et al. (2020) Li J, Socher R, Hoi SC (2020) Dividemix: Learning with noisy labels as semi-supervised learning. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=HJgExaVtwr
  • Li et al. (2022a) Li S, Xia X, Ge S, Liu T (2022a) Selective-supervised contrastive learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 316–325
  • Li et al. (2022b) Li S, Xia X, Zhang H, Zhan Y, Ge S, Liu T (2022b) Estimating noise transition matrix with label correlations for noisy multi-label learning. In: Oh AH, Agarwal A, Belgrave D, Cho K (eds) Advances in Neural Information Processing Systems, URL https://openreview.net/forum?id=GwXrGy_vc8m
  • Li et al. (2017) Li W, Wang L, Li W, Agustsson E, Van Gool L (2017) Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:170802862
  • Liang et al. (2023) Liang C, Yang Z, Zhu L, Yang Y (2023) Co-learning meets stitch-up for noisy multi-label visual recognition. IEEE Transactions on Image Processing 32:2508–2519, DOI 10.1109/TIP.2023.3270103
  • Lin et al. (2017) Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
  • Liu et al. (2020a) Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M (2020a) Deep learning for generic object detection: A survey. International journal of computer vision 128:261–318
  • Liu et al. (2020b) Liu S, Niles-Weed J, Razavian N, Fernandez-Granda C (2020b) Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems 33:20331–20342
  • Liu and Tao (2015) Liu T, Tao D (2015) Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence 38(3):447–461
  • Liu et al. (2021) Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
  • Ma et al. (2022) Ma F, Zhu L, Yang Y (2022) Weakly supervised moment localization with decoupled consistent concept prediction. International Journal of Computer Vision 130(5):1244–1258
  • Malach and Shalev-Shwartz (2017) Malach E, Shalev-Shwartz S (2017) Decoupling" when to update" from" how to update". Advances in neural information processing systems 30
  • Menon et al. (2021) Menon AK, Jayasumana S, Rawat AS, Jain H, Veit A, Kumar S (2021) Long-tail learning via logit adjustment. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=37nvvqkCo5
  • Ortego et al. (2021) Ortego D, Arazo E, Albert P, O’Connor NE, McGuinness K (2021) Multi-objective interpolation training for robustness to label noise. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6606–6615
  • Patrini et al. (2017) Patrini G, Rozza A, Krishna Menon A, Nock R, Qu L (2017) Making deep neural networks robust to label noise: A loss correction approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1944–1952
  • Radford et al. (2021) Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp 8748–8763
  • Russakovsky et al. (2015) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115:211–252
  • Schuhmann et al. (2021) Schuhmann C, Vencu R, Beaumont R, Kaczmarczyk R, Mullis C, Katta A, Coombes T, Jitsev J, Komatsuzaki A (2021) Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:211102114
  • Shu et al. (2019) Shu J, Xie Q, Yi L, Zhao Q, Zhou S, Xu Z, Meng D (2019) Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems 32
  • Szegedy et al. (2017) Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
  • Wang et al. (2022) Wang X, Wu Z, Lian L, Yu SX (2022) Debiased learning from naturally imbalanced pseudo-labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14647–14657
  • Wei et al. (2020) Wei H, Feng L, Chen X, An B (2020) Combating noisy labels by agreement: A joint training method with co-regularization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13726–13735
  • Wei et al. (2022) Wei J, Zhu Z, Cheng H, Liu T, Niu G, Liu Y (2022) Learning with noisy labels revisited: A study using real-world human annotations. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=TBWA6PLJZQm
  • Wu et al. (2021) Wu ZF, Wei T, Jiang J, Mao C, Tang M, Li YF (2021) Ngc: a unified framework for learning with open-world noisy data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 62–71
  • Xia et al. (2020) Xia X, Liu T, Han B, Wang N, Gong M, Liu H, Niu G, Tao D, Sugiyama M (2020) Part-dependent label noise: Towards instance-dependent label noise. Advances in Neural Information Processing Systems 33:7597–7610
  • Xia et al. (2021) Xia X, Liu T, Han B, Gong M, Yu J, Niu G, Sugiyama M (2021) Sample selection with uncertainty of losses for learning with noisy labels. URL https://openreview.net/forum?id=zGsRcuoR5-0
  • Xiao et al. (2015) Xiao T, Xia T, Yang Y, Huang C, Wang X (2015) Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2691–2699
  • Xu et al. (2021) Xu Y, Zhu L, Jiang L, Yang Y (2021) Faster meta update strategy for noise-robust deep learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 144–153
  • Yang et al. (2021) Yang Y, Zhuang Y, Pan Y (2021) Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering 22(12):1551–1558
  • Yao et al. (2021a) Yao Y, Liu T, Gong M, Han B, Niu G, Zhang K (2021a) Instance-dependent label-noise learning under a structural causal model. Advances in Neural Information Processing Systems 34:4409–4420
  • Yao et al. (2021b) Yao Y, Sun Z, Zhang C, Shen F, Wu Q, Zhang J, Tang Z (2021b) Jo-src: A contrastive approach for combating noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5192–5201
  • Yi and Wu (2019) Yi K, Wu J (2019) Probabilistic end-to-end noise correction for learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7017–7025
  • Yu et al. (2019) Yu X, Han B, Yao J, Niu G, Tsang I, Sugiyama M (2019) How does disagreement help generalization against label corruption? In: International Conference on Machine Learning, PMLR, pp 7164–7173
  • Zhang et al. (2021) Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2021) Understanding deep learning (still) requires rethinking generalization. Communications of the ACM 64(3):107–115
  • Zhang et al. (2018) Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018) mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=r1Ddp1-Rb
  • Zhang and Pfister (2021) Zhang Z, Pfister T (2021) Learning fast sample re-weighting without reward data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 725–734
  • Zhao et al. (2022) Zhao S, Zhu L, Wang X, Yang Y (2022) Centerclip: Token clustering for efficient text-video retrieval. In: SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 11–15, 2022, Madrid, Spain
  • Zhu et al. (2022) Zhu Z, Dong Z, Liu Y (2022) Detecting corrupted labels without training a model to predict. In: International Conference on Machine Learning, PMLR, pp 27412–27427