∎

¹¹institutetext:

{}^{\textrm{{\char 0\relax}}}

Linchao Zhu¹
¹¹email: [email protected]

Chao Liang¹
¹¹email: [email protected]

Humphrey Shi^2,3
¹¹email: [email protected]

Yi Yang¹
¹¹email: [email protected]
²²institutetext: 1 ReLER Lab, CCAI, Zhejiang University
2 SHI Labs @ UIUC & Oregon
3 Picsart AI Research (PAIR)

Combating Label Noise With A General Surrogate Model For Sample Selection

Chao Liang Linchao Zhu Humphrey Shi Yi Yang

(Received: date / Accepted: date)

Abstract

Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way to deal with label noise. The key is to separate clean samples based on some criterion. Previous methods pay more attention to the small loss criterion where small-loss samples are regarded as clean ones. Nevertheless, such a strategy relies on the learning dynamics of each data instance. Some noisy samples are still memorized due to frequently occurring corrupted learning patterns. To tackle this problem, a training-free surrogate model is preferred, freeing from the effect of memorization. In this work, we propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically. CLIP brings external knowledge to facilitate the selection of clean samples with its ability of text-image alignment. Furthermore, a margin adaptive loss is designed to regularize the selection bias introduced by CLIP, providing robustness to label noise. We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. Our method achieves significant improvement without CLIP involved during the inference stage.

^†^†journal: Preprint

1 Introduction

With the emergence of deep neural networks (DNNs) and the boost of computation capability, current visual intelligence systems can excel in several tasks, e.g., image classification (Russakovsky et al., 2015; He et al., 2016; Dosovitskiy et al., 2021; Liu et al., 2021), object detection (Carion et al., 2020; Liu et al., 2020a), video understanding (Zhao et al., 2022; Ma et al., 2022), even surpassing human-level performance. These remarkable breakthroughs are closely related to the collection of high-quality annotated data. However, the labeling process is labor-intensive and expensive. For some specific domains like insect classification, it is much more difficult to annotate the data without expert knowledge.

Some researchers resort to a compromising scheme and make use of large-scale cheap webly annotated data (Xiao et al., 2015; Kolesnikov et al., 2020). This can inevitably introduce label noise. Supervised learning often assumes that the training and test data are sampled from the independent identical distribution. The existence of noisy labels results in a discrepancy between the training and test distribution. As a consequence, learning with noisy labels leads to poor generalization on clean unseen test data.

Refer to caption — Figure 1: Some small-loss noisy samples that share similar visual patterns are memorized by DNNs. They are misidentified as clean samples by the small loss criterion. With the help of the powerful open-vocabulary vision-language model CLIP, these samples can be further filtered out potentially. With cleaner training samples, the classification performance is further boosted.

Massively sophisticated algorithms (Han et al., 2018a; Li et al., 2020; Wei et al., 2020; Liu et al., 2020b; Ortego et al., 2021; Xia et al., 2021; Yao et al., 2021b; Zhu et al., 2022; Li et al., 2022a) have been proposed to alleviate the negative effect of noisy labels. One of the promising lines of works is sample selection (Han et al., 2018a; Wei et al., 2020; Li et al., 2020; Yao et al., 2021b; Zhu et al., 2022). The main idea is to separate clean samples from all the training samples based on some rules or criteria. The cleaner samples can enhance the learning of a more unbiased classifier. Previous methods (Han et al., 2018a; Wei et al., 2020; Li et al., 2020, 2022a) mostly consider the small loss strategy. Clean and simple samples are assumed to be first fitted by DNNs and noisy samples are later memorized during the training process (Arpit et al., 2017). This strategy heavily relies on the learning dynamics of each data instance and suffers from the undesirable learning bias in the training dataset. As shown in Figure 1, a proportion of noisy samples are identified as clean samples with small losses because they share similar corrupted visual patterns that occur frequently in the learning process. After sample selection, these out-of-distribution noisy samples are still mixed within the pick-out in-distribution clean samples. Learning with these noisy samples can mislead the classifier and have a negative impact on the decision boundary. In order to get rid of the memorization effect, a training-free surrogate model is a good choice for detecting noisy samples.

Recently, vision-language models pretrained on text-image pairs show promising zero-shot performance on downstream tasks, especially CLIP (Radford et al., 2021). Born with the powerful zero-shot capability, CLIP can be easily adapted to score for unseen objects without extra training. With the text query, it is flexible to take advantage of CLIP to infer the data instance with the correct label or not. Benefiting from pretrained on large-scale text-image web data, CLIP shows great robustness to distribution shift and domain generalization. CLIP can bring external knowledge to facilitate the selection of clean samples, which could potentially filter out those noisy samples that have been memorized by DNNs.

In this paper, we leverage the off-the-shelf vision-language surrogate model CLIP (Radford et al., 2021) to detect noisy samples automatically, which has not been explored yet. First, in contrast to the learning-centric small loss criterion, our CLIP-based selection strategy is a training-free method. Such property avoids the learning bias brought by the noisy supervision. CLIP scores each data instance with its ability of text-image alignment. Combined with the prompt technique, each training sample can be evaluated by CLIP and assigned a surrogate confidence corresponding to its noisy label. Naturally, we regard those samples with high confidences as clean ones. Those noisy samples with corrupted visual patterns can be filtered out with the help of external knowledge, which can further improve the learning of the classifier. Second, we propose a robust noise-aware balanced margin adaptive loss to regularize the selection bias brought by CLIP. On the one hand, CLIP remains biased towards certain classes so it can be overconfident in some classes. Also, the existing methods often neglect the side effect of sample selection, that is, the class imbalance issue might occur. Our noise-aware balanced margin adaptive loss modifies the logits directly, encouraging a relatively large margin for overconfident and dominant classes. This unified margin mechanism can mitigate the effect of noisy labels and imbalanced distribution for robust training.

We evaluate our method on both real-world and synthetic noisy datasets without CLIP involved during the inference stage. The significant improvement on several noisy benchmarks confirms the effectiveness of our proposed method.

Overall, our contributions can be summarized as follows:

•

We are the first to leverage the off-the-shelf vision-language surrogate model CLIP to help select clean samples automatically, which has not been explored before. This training-free method prevents the learning bias brought by the small-loss strategy, which can improve the learning of a more robust classifier and alleviate the memorization effect.
•

We propose a noise-aware balanced margin adaptive loss to reduce the selection bias introduced by CLIP, providing much more robustness to label noise.
•

We demonstrate that our proposed method can achieve significant improvement on both real-world and synthetic noisy datasets without CLIP involved during the inference stage.

2 Related Works

Numerous approaches have been proposed to combat label noise in recent works (Li et al., 2020; Bai and Liu, 2021; Zhang and Pfister, 2021; Zhu et al., 2022; Huang and Chong, 2023). The common solutions can be typically categorized into three types: sample selection, sample reweighting, and label correction.

Sample selection focuses on identifying the clean samples from all the noisy training samples. The clean samples are then used to train the deep neural network. The key problem is to design a good criterion. There are several strategies (Han et al., 2018a; Arazo et al., 2019; Wei et al., 2020; Yao et al., 2021b; Ortego et al., 2021; Zhu et al., 2022) to detect noisy labels. Among them, the small-loss trick (Han et al., 2018a; Arazo et al., 2019; Wei et al., 2020; Yao et al., 2021b) plays an important role. Deep neural networks tend to learn clean and simple patterns faster (Arpit et al., 2017). Co-teaching (Han et al., 2018a) selects a pre-defined proportion of samples with small cross-entropy losses and discards the remaining. Instead, JoCoR (Wei et al., 2020) selects samples with small joint losses composed of cross-entropy losses and co-regularization losses. However, JoSRC (Yao et al., 2021b) argues that prior methods neglect different noise ratios in different mini-batches. It exploits the Jensen-Shannon (JS) divergence which serves as the sample cleanness, to separate clean samples in a global manner. Recently, several works (Ortego et al., 2021; Zhu et al., 2022) try to filter noisy samples out by leveraging neighborhood information, especially via K-Nearest-Neighbors (KNN) algorithm. MOIT (Ortego et al., 2021) selects the confident examples based on the representation similarity between the neighbors. Zhu et al. (Zhu et al., 2022) employ KNN to re-label each sample and detect noisy labels by two simple criteria: local majority voting and global score-based ranking.

Sample reweighting is a traditional and effective method to resist the memorization effect of noisy labels, which encourages larger weights for clean samples and smaller weights for noisy ones (Shu et al., 2019; Zhang and Pfister, 2021; Xu et al., 2021). Meta-Weight-Net (Shu et al., 2019) learns to reweight each sample following the meta-learning paradigm. However, this method requires a small unbiased, and clean validation set, which might be difficult or expensive to collect in practice. To overcome this limitation, Zhang et al. (Zhang and Pfister, 2021) propose to build the proxy clean data from the training history. They maintain the memory to store the past losses and use the changes between the model and meta-model at different steps as the selection criterion.

Label correction aims to assign correct pseudo labels to those samples with wrong labels. The most popular way is to use the prediction from the model (Arazo et al., 2019; Liang et al., 2023). In general, the generated pseudo label is the convex combination between the original noisy label and the current prediction of the model (Arazo et al., 2019). Some works utilize the prediction from class prototypes (Han et al., 2019) or get hard labels based on threshold (Ortego et al., 2021).

Others combine several techniques to prevent overfitting to noisy labels, e.g., mix-up (Zhang et al., 2018), label smoothing (Ortego et al., 2021), consistency regularization (Iscen et al., 2022; Cheng et al., 2022b), semi-supervised framework (Li et al., 2020), contrastive learning (Ortego et al., 2021; Wu et al., 2021). DivideMix (Li et al., 2020) first divides the training samples into the labeled and unlabeled set by fitting the Gaussian Mixture Model (GMM) on the loss distribution, and then performs the semi-supervised learning. NCR (Iscen et al., 2022) proposes a consistency regularization term to enforce the output logit of one sample similar to its neighbors based on the structure of the feature space.

LAION (Schuhmann et al., 2021) and DataComp (Gadre et al., 2024) employ CLIP to filter image-text pairs during the training of vision-language models, with the primary objective of enhancing data quality by eliminating noisy or irrelevant pairs. These approaches leverage CLIP’s multi-modal understanding to ensure that only the most semantically aligned image-text pairs contribute to the training process. While our proposed method shares the fundamental principle of data refinement, it introduces a novel application of CLIP in a different context: filtering noisy labels in classification tasks.

Unlike the prior works, which focus on the relational alignment between image and text pairs in multi-modal datasets, our method specifically addresses the issue of mislabeled data within a purely visual classification setting. Here, the noise is label-centric, where an image is incorrectly labeled, leading to inaccuracies in the training dataset. By applying CLIP to evaluate the consistency between images and their associated labels, we can effectively identify and remove incorrect labels, thereby improving the accuracy and reliability of the labeled dataset. This novel use of CLIP for label noise filtering in classification represents a significant departure from its traditional role in vision-language model training, marking a new frontier in its application for enhancing dataset quality.

3 Method

3.1 Preliminary

Problem formulation. In the image classification problem, we are given a training dataset $\mathcal{D}=\{(x_{i},y_{i})|i=1,2,3,...,N\}$ consisted of $N$ sample pairs, where $x_{i}$ is an image and $y_{i}\in\{1,2,3,...,C\}$ is the associated label for each sample pair. $C$ is the number of classes. In our task, some unknown number of labels are noisy, i.e., $y_{i}\neq\hat{y}_{i}$ , where $y_{i}$ is the noisy label and $\hat{y}_{i}$ is its true class label. Note that $y_{i}$ is the correct label if and only if $y_{i}=\hat{y}_{i}$ . Our goal is to train a deep neural network $\mathcal{F}_{\theta}$ with such a noisily labeled training dataset, which can generalize well on the clean unseen test data. The network $\mathcal{F}_{\theta}$ is composed of three components: (1) a feature encoder $f$ that maps an image $x_{i}$ into a high-dimensional representation $v_{i}=f(x_{i})$ ; (2) a classifier $h$ . It takes $v_{i}$ as an input and outputs the logit $z_{i}=h(v_{i})$ ; (3) a softmax layer $\sigma$ transforms the logit $z_{i}$ into the probability $p_{i}$ .

Vision-language surrogate model. Recently, models pretrained on large-scale text-image supervision have been popular, e.g., CLIP (Radford et al., 2021). In the training stage, CLIP pretrains the image encoder and text encoder with the contrastive loss. It pulls the image feature embedding and the paired text feature embedding closer in the shared embedding space by maximizing the cosine similarity. During the inference, CLIP can predict the most possible pair given an image and a set of prompt-based texts like “a photo of a {CLASS}", where {CLASS} is replaced by the class name. This framework endows CLIP with the capability of open-vocabulary zero-shot classification naturally, and it can be adapted to several downstream tasks. By the power of text-image alignment, we leverage the CLIP-like open-vocabulary vision-language surrogate model to select clean samples based on the prediction confidences. Note that the frozen CLIP is only used as a scorer in the training stage.

Overview. First, we pretrain the feature encoder $f$ to learn the robust representation with noisy labels. Then, we only keep the backbone $f$ and re-train the classifier $h$ . We apply the vision-language pretrained model CLIP (Radford et al., 2021) to help select clean samples automatically. In order to mitigate the selection bias introduced by CLIP, we design a robust noise-aware balanced margin adaptive loss to regularize the effect of overconfidence and class imbalance. The overall framework is presented in Figure 2. We present the overall training algorithm in Algorithm 1.

3.2 Selecting clean samples with CLIP

Learning with noisy labels suffers from the adverse effect that deep neural networks can easily memorize noisy samples (Zhang et al., 2021). One of the effective solutions to this problem is sample selection. Most of the prior research (Han et al., 2018a; Wei et al., 2020; Li et al., 2020, 2022a) relies on the small loss criterion, based on the observation that deep networks fit clean samples first, and then gradually noisy ones (Arpit et al., 2017). This strategy is a learning-centric selection metric by fitting the data distribution. It can be affected by the learning bias in the training dataset where those noisy samples with repetitive corrupted visual patterns are identified as clean samples. As a result, the deep network accumulates the prediction errors. To avoid this confirmation bias, we resort to a training-free surrogate model. We leverage the off-the-shelf pretrained surrogate model CLIP (Radford et al., 2021) to help detect clean samples automatically. CLIP shows several advantages in learning with noisy labels: (1) a training-free selection strategy devoid of reliance on memorization effect; (2) flexible to transfer to downstream tasks with the powerful capability of text-image alignment without extra training; (3) customized prompt engineering that might help filter out some noisy labels based on our prior knowledge potentially.

We propose to select clean samples based on the predictions from CLIP. Given an images $x$ , the image feature $V$ is extracted by the image encoder and the text features $\{T_{1},...,T_{C}\}$ are generated by the text encoder from the prompt template $\mathcal{T}$ , e.g., “a photo of a {CLASS}". Then, the CLIP prediction for label $y=i$ is computed as follows:

q(y=i|x)=\frac{\exp(\cos(V,T_{i})/\tau)}{\sum_{j=1}^{C}\exp(\cos(V,T_{j})/\tau% )},

(1)

where $\cos(\cdot,\cdot)$ denotes the cosine similarity and $\tau$ is the temperature factor. We use $\tau=0.01$ in the experiments. Then, we consider two types of selection criteria.

Prediction Confidence: Naturally, we regard the prediction corresponding to the noisy label from the CLIP as the confidence of the sample and select those with high confidences. Specifically, given a sample $x_{i}$ with a label $y_{i}$ , it is judged as a clean sample if $q_{i}(y=y_{i}|x_{i})>\rho$ , where $\rho$ is a pre-defined threshold. This criterion is simple and effective.

Prompt Consistency: Domain-specific knowledge can be injected into the prompt, which helps detect out-of-domain noisy samples. Noisy web images are collected by keyword searching. However, class names can be ambiguous. For example, “stingray” can represent a type of a car or an animal. If we target classifying the animals, these car images are treated as out-of-domain data. It is difficult for the small loss criterion to distinguish these noisy samples because these images share repetitive visual patterns. Models can easily memorize these samples. Prompts help specify the content of the images. The prediction for a clean sample should be consistent between two prompts where the only difference is domain-specific context. For instance, we apply two prompt templates $\mathcal{T}_{1}$ : “a photo of a {CLASS}” and $\mathcal{T}_{2}$ : “a photo an animal {CLASS}” to get two predictions $q_{i}$ and $\tilde{q}_{i}$ for a given sample $x_{i}$ . We utilize the Jensen-Shannon divergence to quantify the distance $d_{i}$ between the above two probability predictions:

	$\displaystyle d_{i}$	$\displaystyle=D_{JS}(q_{i}\|\|\tilde{q}_{i})$
		$\displaystyle=\frac{1}{2}D_{KL}(q_{i}\|\|\frac{q_{i}+\tilde{q}_{i}}{2})+\frac{1}% {2}D_{KL}(\tilde{q}_{i}\|\|\frac{q_{i}+\tilde{q}_{i}}{2}),$		(2)

where $D_{KL}$ is the Kullback-Leibler (KL) divergence. Intuitively, we treat samples with small JS divergence as clean samples, i.e., $d_{i}<\mu$ , where $\mu$ is a pre-defined threshold. This criterion allows us to make use of human knowledge to help detect noisy samples but it may need sophisticated design.

By introducing external knowledge from CLIP (Radford et al., 2021; Yang et al., 2021), the noisy samples that have been memorized by DNNs can be further identified potentially. The selected cleaner samples can facilitate the learning of a more robust classifier and therefore improve the classification performance.

3.3 Noise-Aware Balanced Margin Adaptive Loss

CLIP (Radford et al., 2021) helps select clean samples, nevertheless, it can also bring the selection bias. On the one hand, CLIP is often biased towards some classes (Wang et al., 2022). It can provide overconfident scores for some classes. On the other hand, the class imbalance issue occurs after sample selection, which is often neglected by the existing methods. In order to regularize the selection bias, we take advantage of margin adaptive mechanism with two priors, which encourages the overconfident and dominant classes to have relatively large margins.

Transition matrix. The transition matrix can be used to reflect the class-level confidence of the model. Each element $M_{ij}$ in the transition matrix $M\in\mathbb{R}^{C\times C}$ represents the probability of being flipped to a label $j$ when given an instance with a label $i$ . Following GLC (Hendrycks et al., 2018), we estimate the class-dependent transition matrix by the average of the prediction $q(y=i|x)$ (Eq. 1) from the vision-language surrogate model CLIP:

M_{ij}=\frac{1}{N_{i}}\sum q(y=j|x,y=i),

(3)

where $N_{i}$ denotes the number of instances in class $i$ . Note that we estimate the transition matrix by using all the training samples. Addressing noisy labels with the transition matrix has been extensively studied in the literature (Hendrycks et al., 2018; Li et al., 2022b; Cheng et al., 2022a). They mostly use the transition matrix to refine the output probability directly. By contrast, we regard it as a margin penalty to prevent the overconfidence effect.

Class frequency prior. The class frequency prior measures the distribution of the training data, which is a common statistic used to address the long-tail problem (Menon et al., 2021). It is defined as $\pi_{j}=N_{j}^{\prime}/N^{\prime}$ where $N^{\prime},N_{j}^{\prime}$ are the number of the training samples and the number of instances in class $j$ .

With the transition matrix and class frequency prior, we propose a noise-aware margin adaptive loss to address the above mentioned problems in a unified framework. After the selection of clean samples, we get the clean subset $\mathcal{D}_{clean}$ consisting of $N^{\prime}$ training samples. For $(x_{i},y_{i})\in\mathcal{D}_{clean}$ , we obtain the model’s output of the softmax probability $\hat{p}_{i}$ as:

\hat{p}_{i}=\frac{\exp((z_{i}^{y_{i}}+\delta M_{y_{i}y_{i}}+t\log\pi_{y_{i}})/% s)}{\sum_{j=1}^{C}\exp((z_{i}^{j}+\delta M_{y_{i}j}+t\log\pi_{j})/s)},

(4)

where $\delta,t$ control the noise-aware margin and balanced margin, respectively. Here, $s$ is the temperature factor.

Conventionally, the deep neural network is optimized by empirical risk minimization of the vanilla cross-entropy loss:

\mathcal{L}_{\text{ERM}}=\mathbb{E}_{\mathcal{D}_{clean}}[\ell_{\text{CE}}(x,y% )]=\frac{1}{N^{\prime}}\sum_{i=1}^{N^{\prime}}\ell_{\text{CE}}(x_{i},y_{i}),

(5)

\ell_{\text{CE}}(x_{i},y_{i})=-\log\frac{\exp({z_{i}^{y_{i}}})}{\sum_{j=1}^{C}% \exp(z_{i}^{j})}.

(6)

However, we find this loss does not perform well in the experiments. We hypothesize that there are two groups of data for each class after sample selection: one is many easy samples distributed at the center of the class and the other is few hard samples distributed near the class boundary. Cross-entropy loss assigns the same weight to each sample. The imbalance between many easy samples and few hard samples makes it difficult for classifier optimization. In order to tackle this issue, we employ the focal loss (Lin et al., 2017).

Finally, combined with the focal loss $\ell_{\text{FL}}$ ( $\gamma=1.0$ ), our noise-aware balanced margin adaptive loss is defined as:

\mathcal{L}_{\text{NABM}}=\mathbb{E}_{\mathcal{D}_{clean}}[\ell_{\text{FL}}(% \hat{p})]=\frac{1}{N^{\prime}}\sum_{i=1}^{N^{\prime}}\ell_{\text{FL}}(\hat{p}_% {i}).

(7)

By modifying the logit with the transition matrix and class frequency prior, this margin adaptive mechanism can suppress overconfidence on biased classes and mitigate the negative effect of imbalanced distribution brought by sample selection from CLIP, which can encourage the model to resist label noise better.

Input : training dataset with noisy labels

\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N}

; pretrained feature encoder

f

; classifier

h

; open-vocabulary vision-language model CLIP; texts

T=\{T_{j}\}_{j=1}^{C}

generated by the prompt template; threshold

\rho

; maximum epoch

E

Output : deep neural network

\mathcal{F}_{\theta}=\{f,h\}

// Selecting clean samples with CLIP

\mathcal{D}_{clean}\leftarrow\varnothing

;

5foreach $(x_{i},y_{i})\in\mathcal{D}$ do

6 Compute

q(y=y_{i}|x_{i})

by Eq. 1 with CLIP and

T

;

7 if $q(y=y_{i}|x_{i})>\rho$ then

\mathcal{D}_{clean}\leftarrow\mathcal{D}_{clean}\cup\{(x_{i},y_{i})\}

;

9 end if

11 end foreach

// Calculating the transition matrix

13 Compute the transition matrix

M

by Eq. 3 on

\mathcal{D}

;

// Calculating the class frequency prior

15 Compute the class frequency prior

\pi

\mathcal{D}_{clean}

;

// Fine-tune the network

17 Re-initial the classifier

h

;

18 for $e=1,...,E$ do

19 while $k<\text{MaxIter}$ do

20 Draw a mini-batch

\mathcal{X}_{e}^{k}=\{(x_{b},y_{b})\}_{b=1}^{B}

from

\mathcal{D}_{clean}

;

21 Compute the loss

\mathcal{L}_{\text{NABM}}

by Eq. 7 with

M

and

\pi

\mathcal{X}_{e}^{k}

;

22 Calculate the gradients by the loss

\mathcal{L}_{\text{NABM}}

backpropagation ;

23 Optimized by SGD;

25 end while

27 end for

return

\mathcal{F}_{\theta}=\{f,h\}

Algorithm 1 Pseudo-code for our method.

4 Experiments

4.1 Experiment setup

Datasets. We evaluate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. For real-world datasets, we conduct experiments on five benchmarks with noisy labels: Clothing1M, WebVision, Red Mini-ImageNet, CIFAR-10N, and CIFAR-100N. Clothing1M (Xiao et al., 2015) consists of 1 million training images collected from some online shopping websites where labels are produced by the surrounding texts. The test set contains 10,526 images of 14 classes. Webvision (Li et al., 2017) is crawled from the web using 1,000 concepts from ImageNet ILSVRC12 (Deng et al., 2009). Following (Li et al., 2020), we experiment with the first 50 classes of the Google image subset on WebVision 1.0. The training and validation set contains 65,944 and 2,500 images, respectively. Red Mini-ImageNet (Jiang et al., 2020) is a benchmark of controlled real-world label noise from the web. The dataset contains 100 classes. We experiment with the noise rate of 20%, 40%, 60%, and 80%. The image size is resized to 32 $\times$ 32 for a fair comparison (Garg et al., 2023; Kim et al., 2024). CIFAR-10N and CIFAR-100N (Wei et al., 2022) are two recently proposed benchmarks with real-world human-annotated noisy labels. The noisy labels are collected from Amazon Mechanical Turk. For CIFAR-10N, each image is annotated with 3 human-annotated labels. We study three types of noisy label sets: (1) Aggregate: the noisy label is aggregated by majority voting; (2) Random $i$ ( $i=1,2,3$ ): use the $i$ -th annotated label as the noisy label; (3) Worst: use any wrongly annotated label if it exists. For CIFAR-100N, each image is annotated with one noisy fine label and a coarse label. Please refer to (Wei et al., 2022) and the website¹¹1http://noisylabels.com/ for more details. For synthetic datasets, we manually make label corruption on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). Both CIFAR-10 and CIFAR-100 contain 50,000 training images and 10,000 test images, with 10 and 100 classes, respectively. The size of images is 32 $\times$ 32. We investigate three types of label noise: symmetric, asymmetric, and instance-dependent. Symmetric noise is generated by randomly replacing clean labels with other possible labels. In this case, some clean labels can be maintained. We constrain the label flipping to be closed under the given label set. Asymmetric noise is injected by replacing labels only in similar classes, e.g., deer $\rightarrow$ horse, dog $\leftrightarrow$ cat, which is more common in practice. Instance-dependent noise depends on image information. We simulate the experiment environment following (Xia et al., 2020).

Evaluation metrics. Top-1 test accuracy is reported in the experiments. For Clothing1M dataset, we select the model that performs best on the validation set. For WebVision, we also report top-5 accuracy. After training on WebVision, we evaluate performance on ImageNet without any finetuning.

Implementation details. Following (Li et al., 2020, 2022a), we perform the same experiment protocol to pretrain the backbone. For WebVision, we use Inception-ResNet V2 (Szegedy et al., 2017) for DivideMix (Li et al., 2020) and ResNet-18 for Sel-CL (Li et al., 2022a). We reinitialize the classifier and train for 10 epochs with a learning rate of 0.01 and 0.001, respectively. The optimizer is SGD with a weight decay of 0.0001 and the batch size is 64. For Clothing1M, we use ResNet-50 pretrained on ImageNet. Following the previous works (Li et al., 2020), we sample 1000 mini-batches in each epoch. The training epoch is 80 and the initial learning rate is 0.02. We set $\rho=0.6$ . For Red Mini-ImageNet, CIFAR-10N, CIFAR-100N, CIFAR-10, and CIFAR-100 datasets, we train Pre-Act ResNet-18 for 10 epochs. The weight decay is set as 0.0005 and the batch size is 128. We set $\rho=0.5$ for Red Mini-ImageNet, CIFAR-10N, and CIFAR-10, $\rho=0.1$ for CIFAR-100 and CIFAR-100N. We use $s=1.0,\delta=0.5,t=1.0$ , ViT-B/16 as the backbone of CLIP for filtering when training ResNet-18, and $s=0.1,\delta=0.1,t=0.01$ , ResNet-50 for others. Empirically, we find smaller $s$ is suitable for deeper networks.

4.2 Results

Method	WebVision		ILSVRC12
Method	top-1	top-5	top-1	top-5
F-correction (Patrini et al., 2017)	61.12	82.68	57.36	82.36
Decoupling (Malach and Shalev-Shwartz, 2017)	62.54	84.74	58.26	82.26
MentorNet (Jiang et al., 2018)	63.00	81.40	57.80	79.92
Co-teaching (Han et al., 2018a)	63.58	85.20	61.48	84.70
Iterative-CV (Chen et al., 2019)	65.24	85.34	61.60	84.98
ELR (Liu et al., 2020b)	76.26	91.26	68.71	87.84
ELR+ (Liu et al., 2020b)	77.78	91.68	70.29	89.76
NGC (Wu et al., 2021)	79.16	91.84	74.44	91.04
TCL (Huang et al., 2023)	79.10	92.30	75.40	92.40
LSL (Kim et al., 2024)	81.40	93.00	77.00	91.84
CLIP zero-shot (RN50) (Radford et al., 2021)	71.20	95.04	74.04	96.80
CLIP zero-shot (ViT-B/16) (Radford et al., 2021)	75.44	96.92	78.36	97.92
DivideMix (Li et al., 2020)	77.32	91.64	75.20	90.84
Ours (DivideMix init.)	79.08	91.96	76.04	93.12
Sel-CL (Li et al., 2022a)	77.88	91.60	74.28	90.96
Ours (Sel-CL init.)	81.00	93.84	76.28	94.92

Table 1: We report top-1 and top-5 test accuracy (%) on WebVision 1.0 and ImageNet ILSVRC12. Our method achieves significant improvement over baseline methods. We show the best and the second best results in LNL methods.

Method	Noise rate
Method	20%	40%	60%	80%
CE (Yao et al., 2021a)	47.36	42.70	37.30	29.76
Mixup (Zhang et al., 2018)	49.10	46.40	40.58	33.58
MentorMix (Jiang et al., 2020)	51.02	47.14	43.80	33.46
DivideMix (Li et al., 2020)	50.96	46.72	43.14	34.50
SSR (Feng et al., 2021)	52.18	48.96	42.42	33.20
FaMUS (Xu et al., 2021)	51.42	48.06	45.10	35.50
LSL (Kim et al., 2024)	54.68	49.80	45.46	36.78
InstanceGM (Garg et al., 2023)	58.38	52.24	47.96	39.62
Ours^†	61.26	57.09	53.25	45.65

Table 2: Test accuracy (%) on Red Mini-ImageNet (CNWL). Based on InstanceGM, ours^† achieves significant improvement.

CE	Co-teaching (Han et al., 2018a)	ELR (Liu et al., 2020b)	NCR (Iscen et al., 2022)	DivideMix (Li et al., 2020)	Ours
68.94	69.21	72.87	74.6	74.76	74.84 $\pm$ 0.03

Table 3: Comparison between our method and the existing baseline methods on the Clothing-1M dataset. Test accuracy (%) are reported.

Dataset	CIFAR-10N					CIFAR-100N
Types	Aggregate	Random1	Random2	Random3	Worst	Noisy
CE^∗ (Standard)	87.22	81.59	82.22	82.06	67.45	47.54
DivideMix^∗ (Li et al., 2020)	95.33	95.35	95.01	95.18	92.47	69.84
Ours^†	95.95 $\pm$ 0.05	96.17 $\pm$ 0.10	95.58 $\pm$ 0.10	95.95 $\pm$ 0.05	93.67 $\pm$ 0.10	72.46 $\pm$ 0.41

Table 4: Test accuracy(%) on CIFAR-10N and CIFAR-100N. All results use the PreAct ResNet-18 architecture. ^∗We reproduce all the baselines. Ours^† achieves significant improvement against DivideMix.

Dataset	CIFAR-10					CIFAR-100
Noise type	Sym.				Asym.	Sym.
Noise rate	20%	50%	80%	90%	40%	20%	50%	80%	90%
Standard	83.9	58.5	25.9	17.3	77.3	61.5	37.4	10.4	4.1
F-correction (Patrini et al., 2017)	83.1	59.4	26.2	18.8	83.1	61.4	37.3	9.0	3.4
Co-teaching+ (Yu et al., 2019)	88.2	84.1	45.5	30.1	-	64.1	45.3	15.5	8.8
Mixup (Zhang et al., 2018)	92.3	77.6	46.7	43.9	-	66.0	46.6	17.6	8.1
P-correction (Yi and Wu, 2019)	92.0	88.7	76.5	58.2	88.1	68.1	56.4	20.7	8.8
M-correction (Arazo et al., 2019)	93.8	91.9	86.6	68.7	86.3	73.4	65.4	47.6	20.5
MOIT+ (Ortego et al., 2021)	94.1	-	75.8	-	93.2	75.9	-	51.4	-
ELR+ (Liu et al., 2020b)	94.9	93.9	90.9	74.5	88.9	76.3	72.0	57.2	30.9
NCR (Iscen et al., 2022)	95.2	94.3	91.6	75.1	90.7	76.6	72.5	58.0	30.8
NGC (Wu et al., 2021)	95.9	94.5	91.6	80.5	90.6	79.3	75.9	62.7	29.8
DivideMix (Li et al., 2020)	95.7	94.4	92.9	75.4	92.1	76.9	74.2	59.6	31.0
Ours^†	96.6	95.6	94.1	89.2	95.1	80.3	76.6	63.4	45.7

Table 5: Test accuracy(%) on CIFAR-10 and CIFAR-100 under different noise rates. Sym. is the symmetric noise and Asym. denotes the asymmetric noise. Based on DivideMix (Li et al., 2020), ours^† achieves significant improvement against DivideMix.

Real-world datasets. Table 1 shows the results on WebVision (Li et al., 2017). Our method achieves significant improvement against DivideMix baseline (Li et al., 2020). The top-1 and top-5 accuracy is 79.08% and 91.96% on WebVision validation set, respectively. The performance gains are 1.76% and 0.32% compared with DivideMix. When transferred to ImageNet (Deng et al., 2009), the performance on top-5 accuracy is substantially boosted. Ours achieves 93.12% accuracy, surpassing the DivideMix method by 2.28%. With much more robust learned representation pretrained with contrastive learning (Li et al., 2022a), our method achieves the second best top-1 accuracy 81.00%. We find our method shows better top-5 test accuracy compared to the state-of-the-art method LSL (Kim et al., 2024). These results verify the effectiveness of our proposed sampling strategy and margin mechanism. Compared with other approaches, our method still shows competitive performance, which is effective to resist label noise. In Table 2, we show the test accuracy on Red Mini-ImageNet (Jiang et al., 2020) with controllable realistic label noise. Our method outperforms the contemporary methods, which surpasses the second best by 2.9%, 4.8%, 5.3%, and 6.0% under the noise rate of 20%, 40%, 60%, and 80% respectively. In Table 3, we present the comparison between previous methods on Clothing-1M (Xiao et al., 2015) with realistic noisy labels. Benefiting from training with the selected cleaner samples, our proposed approach achieves consistent improvement over DivideMix (Li et al., 2020), showing competitive performance. Compared to other methods like NCR (Iscen et al., 2022), ELR (Liu et al., 2020b), our method achieves better performance as well. The evaluation results on CIFAR-10N and CIFAR-100N are shown in Table 4. We can observe that DivideMix (Li et al., 2020) outperforms the basic CE baseline by a large margin. This confirms the robustness of DivideMix. In addition, our proposed method can improve the DivideMix further. These results indicate our method is effective to mitigate the negative effect of real-world noise.

Method	Noise rate
Method	20%	30%	40%	45%	50%
CE (Yao et al., 2021a)	30.42	24.15	21.45	15.23	14.42
Mixup (Zhang et al., 2018)	32.92	29.76	25.92	23.13	21.31
F-correction (Patrini et al., 2017)	36.38	33.17	26.75	21.93	19.27
Reweight (Liu and Tao, 2015)	36.73	31.91	28.39	24.12	20.23
Decoupling (Malach and Shalev-Shwartz, 2017)	36.53	30.93	27.85	23.81	19.59
Co-teaching (Han et al., 2018b)	37.96	33.43	28.04	25.60	23.97
MentorNet (Jiang et al., 2018)	38.91	34.23	31.89	27.53	24.15
DivideMix (Li et al., 2020)	77.07	76.33	70.80	57.78	58.61
SSR (Feng et al., 2021)	78.84	78.60	76.95	74.98	72.83
LSL (Kim et al., 2024)	80.94	79.90	78.60	78.08	77.95
InstanceGM (Garg et al., 2023)	79.69	79.21	78.47	77.49	77.19
Ours^†	80.97	80.42	79.68	79.39	78.74

Table 6: Test accuracy (%) on CIFAR-100 with instance-dependent noise. Based on InstanceGM, ours^† achieves significant improvement.

Synthetic datasets. We show the comparison between our method and the previous methods on the manually corrupted dataset CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) under different noise rates and various noise conditions in Table 5 and Table 6. Our method shows better evaluation performance against DivideMix across all the noise rates, achieving state-of-the-art results in almost all the settings. Especially under a high noise rate of 90%, our method considerably outperforms DivideMix (Li et al., 2020) by a large margin. The test accuracy is 89.2% and 45.7% on CIFAR-10 and CIFAR-100, respectively. The gap is 13.8% and 14.7% against DivideMix, indicating that our method can identify cleaner samples even under the extreme noise rate. When it comes to asymmetric noise, our method achieves 3.0% improvement versus DivideMix and even surpasses MOIT+ (Ortego et al., 2021) by 1.9%. Our method is robust to instance-dependent noise as well. Based on InstanceGM (Garg et al., 2023), ours achieves better test accuracy than the state-of-the-art method LSL (Kim et al., 2024). The gap is 1.0% under the noise rate of 40%. The promising results demonstrate CLIP can be easily adapted to help select clean samples. Equipped with our noise-aware balanced margin adaptive loss, the classifier learning is more robust and unbiased.

4.3 Ablation Study

Sampling strategy. We explore the effectiveness of our sampling strategy in this ablation analysis. Table 7 shows the accuracy performance comparison between different sampling strategies. We keep the same experimental setup. The classifier is trained with the focal loss. First, we compare with GMM which is the commonly used method to select clean samples (Arazo et al., 2019; Li et al., 2020). When the classifier is re-trained with the clean samples that are divided by GMM, the top-1 accuracy on WebVision (Li et al., 2017) increases a little compared to the pretrained DivideMix (Li et al., 2020) model. It indicates that the deep neural networks have been memorizing some small-loss noisy samples. These noisy samples cannot be identified by GMM, which can still hinder the learning of a robust classifier. Hence, the performance is hard to further boost. Second, we observe that our prediction confidence strategy significantly improves the classification accuracy, exceeding the baseline by 1% top-1 accuracy on WebVision.

Sampling Strategy	WebVision		ILSVRC12
Sampling Strategy	top-1	top-5	top-1	top-5
DivideMix (pretrained) (Li et al., 2020)	77.32	91.64	75.20	90.84
Small Loss (GMM)	77.36	91.08	74.32	92.08
Prediction Confidence	78.32	91.48	75.28	92.92
Prompt Consistency	77.36	91.24	74.16	92.04

Table 7: Comparison between different sampling strategies. We report top-1 (top-5) accuracy (%) on WebVision and ImageNet.

The results suggest that CLIP (Radford et al., 2021) can help detect the memorized noisy samples and select cleaner training samples, with its powerful zero-shot prediction capability and external knowledge from large-scale pretrained data. Learning with cleaner samples contributes to establishing a better decision boundary. Third, we try to inject prior knowledge into the prompt with the expectation that this strategy can identify some out-of-domain data. We find our prompt consistency strategy can help detect noisy samples and select clean ones as shown in Figure 3. But the overall performance is almost comparable against GMM. As discussed in Section 3.2, we suspect it might require more sophisticated designed prompt templates. We leave it for future work.

Cross-entropy loss v.s. Focal loss. The straightforward way to address the image classification is to train a classifier with the vanilla cross-entropy loss. However, the performance is unsatisfactory in the experiments. We conduct the experiments on the clean samples selected by CLIP (Radford et al., 2021). The results are reported in Table 8. There are some findings: (1) After the selection of clean samples, training with the cross-entropy loss can further boost the top-1 accuracy (+0.44%) on WebVision validation set compared to the pretrained model with DivideMix (Li et al., 2020). This confirms the effectiveness of our sampling strategy. (2) In contrast, the top-1 accuracy on ImageNet drops 0.6%. As we discussed in Section 3.3, we hypothesize this is because there are many clean samples with easy patterns in the selected samples. Cross-entropy loss treats each sample equally. DNNs focus more on the simple samples and fit the training data well, which can result in poor performance when transferred to other datasets. (3) Focal loss (Lin et al., 2017) shows significant improvement against cross-entropy loss on both WebVision and ImageNet, by 0.56%(1.04%) and 0.48%(1.28%) increase of top-1(top-5) accuracy, respectively. Hard samples play an important role in determining an accurate decision boundary. By balancing the contributions between easy samples and hard samples, focal loss assigns larger weights to hard samples, which prevents the overwhelm of easy samples. Ultimately, it can facilitate the optimization of the classifier and promote the prediction performance.

Loss function	WebVision		ILSVRC12
Loss function	top1	top5	top1	top5
DivideMix (pretrained) (Li et al., 2020)	77.32	91.64	75.20	90.84
Cross-Entropy	77.76	90.44	74.80	91.64
Focal Loss	78.32	91.48	75.28	92.92
w/o Focal Loss	78.20	91.04	74.92	92.00
w/o balanced margin ( $t=0$ )	78.68	91.72	75.68	93.00
w/o noise-aware margin ( $\delta=0$ )	78.76	91.92	75.76	92.92
Ours	79.08	91.96	76.04	93.12

Table 8: Ablation analysis on the loss function and the contribution of each component. We report top-1 and top-5 test accuracy (%).

Ablation study on noise-aware balanced margin adaptive loss. We investigate the contribution of each component in our proposed loss function. For a fair comparison, we conduct the experiments by removing each component and examining the effect. All the other configurations are the same. We present the evaluation results in Table 8. Removing the focal loss hurts the performance. Both noise-aware margin and balanced margin can bring the improvement on the performance. The former outperforms the focal loss baseline by 0.36% on WebVision (Li et al., 2017) and 0.4% on ImageNet (Deng et al., 2009) measured by top-1 accuracy while the latter obtains the performance gains over 0.44% and 0.48%. The observations validate that CLIP (Radford et al., 2021) model might introduce the selection bias. On the one hand, CLIP can be overconfident in certain classes so that some noisy samples are still mixed even after sample selection. On the other hand, the selected samples often exhibit a class-imbalanced distribution, especially under the situation where asymmetric noise exists. Although sample selection brings cleaner data, the impact of data imbalance will be amplified. Our margin adaptive loss solves these issues in a unified framework, which mitigates the overconfidence effect by the noise-aware margin and relieves the influence of the long-tailed distribution by the balanced margin. The joint effect of these two margins fosters the learning of the classifier and further improves the performance.

Analysis of transition matrix. We calculate the error between our estimated and the groundtruth transition matrix. Under the symmetric noise of 0.5, the absolute mean error is 0.02 on CIFAR-10 and 0.006 on CIFAR-100, respectively.

Effect of selection threshold $\rho$ in sample selection. In Figure LABEL:fig:ab_threshold, we show the number of the selected training samples and top-1 accuracy on both WebVision (Li et al., 2017) and ImageNet (Deng et al., 2009) with the varied selection thresholds on confidence. As we can see, with the threshold getting larger, the number of training samples decreases rapidly. The top-1 accuracy performance on WebVision also drops, especially when we set the threshold to 0.9. This is because the training samples are insufficient to learn a good decision boundary after the selection of the clean samples, especially for some classes with few data points. Even though sample selection can bring cleaner data, we cannot ignore the risk of the reduced amount of training data. Therefore, we choose a proper threshold to ensure enough instances for training the network. We notice that the performance on ImageNet is less affected. We guess it might be attributed to the robust learned representation by training with diverse web data.

Effect of $\delta$ in noise-aware margin regularization. To understand the influence of the noise-aware margin, we vary $\delta$ from 0.0 to 1.0 and keep other hyperparameters fixed. As shown in Figure LABEL:fig:ab_margin, small $\delta$ (e.g., $\delta=0.1$ ) leads to higher top-1 and top-5 accuracy both on WebVision (Li et al., 2017) and ImageNet (Deng et al., 2009). The performance remains relatively stable when $\delta\leq 0.5$ . However, when $\delta$ is pretty large (e.g., $\delta=1.0$ ), the accuracy drops considerably. The top-1 accuracy on WebVision is only 31.0%. These results support our motivation that the noise-aware margin plays a role in preventing overconfidence effect. This regularization can provide robustness to label noise. Nevertheless, note that too large $\delta$ will hinder the optimization of the classifier, which results in poor performance. Empirically, small $\delta$ is recommended.

Training time analysis. We analyze the training time of our proposed method on WebVision to understand its efficiency. The experiment is conducted on a single NVIDIA 3090 GPU. For CLIP-based sample selection, it takes around 73.9 seconds with a batch size of 1000. After the pretraining stage, our model is trained for just under 40 minutes.

4.4 Visualization

Figure 5 and 6 compares the confidence distribution between GMM (Li et al., 2020) and CLIP (Radford et al., 2021) on WebVision and Clothing1M. It can be seen that the confidence from GMM is near 0 or 1 after training while the distribution of CLIP is much more scattered. It indicates that some noisy samples have been memorized by DNNs and the confidence is less discriminative for distinguishing the clean and noisy samples. In Figure 7, we show several selected or filtered training images from the WebVision dataset. We compare the selection between CLIP and GMM. As we can see, CLIP can identify some clean samples which are regarded as noisy samples by GMM. Meanwhile, CLIP can also filter out some noisy samples that GMM fails to find, based on the predicted confidences on noisy labels. These results support our assumption that some noisy samples are memorized by DNNs and cannot be filtered based on the small loss criterion. These samples often share similar visual patterns. For instance, the pillow image (Row 3, Column 2 in Figure 7) with a repeated bird pattern is mistakenly identified as an actual bird by the GMM. In contrast, CLIP’s prior knowledge helps mitigate this issue. Benefiting from pretrained on large-scale web data, the CLIP model can help detect more noisy samples with its powerful zero-shot capability. More examples are presented in Figure 8.

5 Conclusion

In this paper, we propose to leverage the vision-language pretrained surrogate model CLIP to help select clean samples when dealing with noisy labels. Born with the powerful capability of zero-shot inference, CLIP can identify some noisy samples memorized by deep neural networks, based on the predicted confidences on noisy labels. Furthermore, our noise-aware balanced margin adaptive loss facilitates the learning of the classifier, which can mitigate the introduced selection bias from CLIP. The significant improvement on multiple noisy datasets verifies the effectiveness of our method without CLIP involved at the inference stage.

Conflict of interest

The authors declare that they have no conflict of interest.

Data availability

The datasets analyzed during the current study are available in https://www.image-net.org/, https://www.cs.toronto.edu/~kriz/cifar.html, http://noisylabels.com/, https://google.github.io/controlled-noisy-web-labels/ and https://data.vision.ee.ethz.ch/cvl/webvision/dataset2017.html. No new datasets were generated.

References

Arazo et al. (2019) Arazo E, Ortego D, Albert P, O’Connor N, McGuinness K (2019) Unsupervised label noise modeling and loss correction. In: International conference on machine learning, PMLR, pp 312–321
Arpit et al. (2017) Arpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS, Maharaj T, Fischer A, Courville A, Bengio Y, et al. (2017) A closer look at memorization in deep networks. In: International conference on machine learning, PMLR, pp 233–242
Bai and Liu (2021) Bai Y, Liu T (2021) Me-momentum: Extracting hard confident examples from noisily labeled data. In: ICCV
Carion et al. (2020) Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
Chen et al. (2019) Chen P, Liao BB, Chen G, Zhang S (2019) Understanding and utilizing deep neural networks trained with noisy labels. In: International Conference on Machine Learning, PMLR, pp 1062–1070
Cheng et al. (2022a) Cheng D, Liu T, Ning Y, Wang N, Han B, Niu G, Gao X, Sugiyama M (2022a) Instance-dependent label-noise learning with manifold-regularized transition matrix estimation. In: CVPR
Cheng et al. (2022b) Cheng D, Ning Y, Wang N, Gao X, Yang H, Du Y, Han B, Liu T (2022b) Class-dependent label-noise learning with cycle-consistency regularization. In: Oh AH, Agarwal A, Belgrave D, Cho K (eds) Advances in Neural Information Processing Systems, URL https://openreview.net/forum?id=IvnoGKQuXi
Deng et al. (2009) Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp 248–255
Dosovitskiy et al. (2021) Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=YicbFdNTTy
Feng et al. (2021) Feng C, Tzimiropoulos G, Patras I (2021) Ssr: An efficient and robust framework for learning with unknown label noise. arXiv preprint arXiv:211111288
Gadre et al. (2024) Gadre SY, Ilharco G, Fang A, Hayase J, Smyrnis G, Nguyen T, Marten R, Wortsman M, Ghosh D, Zhang J, et al. (2024) Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems 36
Garg et al. (2023) Garg A, Nguyen C, Felix R, Do TT, Carneiro G (2023) Instance-dependent noisy label learning via graphical modelling. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2288–2298
Han et al. (2018a) Han B, Yao Q, Yu X, Niu G, Xu M, Hu W, Tsang I, Sugiyama M (2018a) Co-teaching: Robust training of deep neural networks with extremely noisy labels. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 31, URL https://proceedings.neurips.cc/paper/2018/file/a19744e268754fb0148b017647355b7b-Paper.pdf
Han et al. (2018b) Han B, Yao Q, Yu X, Niu G, Xu M, Hu W, Tsang I, Sugiyama M (2018b) Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31
Han et al. (2019) Han J, Luo P, Wang X (2019) Deep self-learning from noisy labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5138–5147
He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hendrycks et al. (2018) Hendrycks D, Mazeika M, Wilson D, Gimpel K (2018) Using trusted data to train deep networks on labels corrupted by severe noise. Advances in neural information processing systems 31
Huang and Chong (2023) Huang X, Chong KFE (2023) Genkl: An iterative framework for resolving label ambiguity and label non-conformity in web images via a new generalized kl divergence. International Journal of Computer Vision pp 1–25
Huang et al. (2023) Huang Z, Zhang J, Shan H (2023) Twin contrastive learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11661–11670
Iscen et al. (2022) Iscen A, Valmadre J, Arnab A, Schmid C (2022) Learning with neighbor consistency for noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4672–4681
Jiang et al. (2018) Jiang L, Zhou Z, Leung T, Li LJ, Fei-Fei L (2018) Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: International Conference on Machine Learning, PMLR, pp 2304–2313
Jiang et al. (2020) Jiang L, Huang D, Liu M, Yang W (2020) Beyond synthetic noise: Deep learning on controlled noisy labels. In: International conference on machine learning, PMLR, pp 4804–4815
Kim et al. (2024) Kim Nr, Lee JS, Lee JH (2024) Learning with structural labels for learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 27610–27620
Kolesnikov et al. (2020) Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2020) Big transfer (bit): General visual representation learning. In: European conference on computer vision, Springer, pp 491–507
Krizhevsky et al. (2009) Krizhevsky A, Hinton G, et al. (2009) Learning multiple layers of features from tiny images
Li et al. (2020) Li J, Socher R, Hoi SC (2020) Dividemix: Learning with noisy labels as semi-supervised learning. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=HJgExaVtwr
Li et al. (2022a) Li S, Xia X, Ge S, Liu T (2022a) Selective-supervised contrastive learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 316–325
Li et al. (2022b) Li S, Xia X, Zhang H, Zhan Y, Ge S, Liu T (2022b) Estimating noise transition matrix with label correlations for noisy multi-label learning. In: Oh AH, Agarwal A, Belgrave D, Cho K (eds) Advances in Neural Information Processing Systems, URL https://openreview.net/forum?id=GwXrGy_vc8m
Li et al. (2017) Li W, Wang L, Li W, Agustsson E, Van Gool L (2017) Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:170802862
Liang et al. (2023) Liang C, Yang Z, Zhu L, Yang Y (2023) Co-learning meets stitch-up for noisy multi-label visual recognition. IEEE Transactions on Image Processing 32:2508–2519, DOI 10.1109/TIP.2023.3270103
Lin et al. (2017) Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu et al. (2020a) Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M (2020a) Deep learning for generic object detection: A survey. International journal of computer vision 128:261–318
Liu et al. (2020b) Liu S, Niles-Weed J, Razavian N, Fernandez-Granda C (2020b) Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems 33:20331–20342
Liu and Tao (2015) Liu T, Tao D (2015) Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence 38(3):447–461
Liu et al. (2021) Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Ma et al. (2022) Ma F, Zhu L, Yang Y (2022) Weakly supervised moment localization with decoupled consistent concept prediction. International Journal of Computer Vision 130(5):1244–1258
Malach and Shalev-Shwartz (2017) Malach E, Shalev-Shwartz S (2017) Decoupling" when to update" from" how to update". Advances in neural information processing systems 30
Menon et al. (2021) Menon AK, Jayasumana S, Rawat AS, Jain H, Veit A, Kumar S (2021) Long-tail learning via logit adjustment. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=37nvvqkCo5
Ortego et al. (2021) Ortego D, Arazo E, Albert P, O’Connor NE, McGuinness K (2021) Multi-objective interpolation training for robustness to label noise. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6606–6615
Patrini et al. (2017) Patrini G, Rozza A, Krishna Menon A, Nock R, Qu L (2017) Making deep neural networks robust to label noise: A loss correction approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1944–1952
Radford et al. (2021) Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp 8748–8763
Russakovsky et al. (2015) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115:211–252
Schuhmann et al. (2021) Schuhmann C, Vencu R, Beaumont R, Kaczmarczyk R, Mullis C, Katta A, Coombes T, Jitsev J, Komatsuzaki A (2021) Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:211102114
Shu et al. (2019) Shu J, Xie Q, Yi L, Zhao Q, Zhou S, Xu Z, Meng D (2019) Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems 32
Szegedy et al. (2017) Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
Wang et al. (2022) Wang X, Wu Z, Lian L, Yu SX (2022) Debiased learning from naturally imbalanced pseudo-labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14647–14657
Wei et al. (2020) Wei H, Feng L, Chen X, An B (2020) Combating noisy labels by agreement: A joint training method with co-regularization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13726–13735
Wei et al. (2022) Wei J, Zhu Z, Cheng H, Liu T, Niu G, Liu Y (2022) Learning with noisy labels revisited: A study using real-world human annotations. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=TBWA6PLJZQm
Wu et al. (2021) Wu ZF, Wei T, Jiang J, Mao C, Tang M, Li YF (2021) Ngc: a unified framework for learning with open-world noisy data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 62–71
Xia et al. (2020) Xia X, Liu T, Han B, Wang N, Gong M, Liu H, Niu G, Tao D, Sugiyama M (2020) Part-dependent label noise: Towards instance-dependent label noise. Advances in Neural Information Processing Systems 33:7597–7610
Xia et al. (2021) Xia X, Liu T, Han B, Gong M, Yu J, Niu G, Sugiyama M (2021) Sample selection with uncertainty of losses for learning with noisy labels. URL https://openreview.net/forum?id=zGsRcuoR5-0
Xiao et al. (2015) Xiao T, Xia T, Yang Y, Huang C, Wang X (2015) Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2691–2699
Xu et al. (2021) Xu Y, Zhu L, Jiang L, Yang Y (2021) Faster meta update strategy for noise-robust deep learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 144–153
Yang et al. (2021) Yang Y, Zhuang Y, Pan Y (2021) Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering 22(12):1551–1558
Yao et al. (2021a) Yao Y, Liu T, Gong M, Han B, Niu G, Zhang K (2021a) Instance-dependent label-noise learning under a structural causal model. Advances in Neural Information Processing Systems 34:4409–4420
Yao et al. (2021b) Yao Y, Sun Z, Zhang C, Shen F, Wu Q, Zhang J, Tang Z (2021b) Jo-src: A contrastive approach for combating noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5192–5201
Yi and Wu (2019) Yi K, Wu J (2019) Probabilistic end-to-end noise correction for learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7017–7025
Yu et al. (2019) Yu X, Han B, Yao J, Niu G, Tsang I, Sugiyama M (2019) How does disagreement help generalization against label corruption? In: International Conference on Machine Learning, PMLR, pp 7164–7173
Zhang et al. (2021) Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2021) Understanding deep learning (still) requires rethinking generalization. Communications of the ACM 64(3):107–115
Zhang et al. (2018) Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018) mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=r1Ddp1-Rb
Zhang and Pfister (2021) Zhang Z, Pfister T (2021) Learning fast sample re-weighting without reward data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 725–734
Zhao et al. (2022) Zhao S, Zhu L, Wang X, Yang Y (2022) Centerclip: Token clustering for efficient text-video retrieval. In: SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 11–15, 2022, Madrid, Spain
Zhu et al. (2022) Zhu Z, Dong Z, Liu Y (2022) Detecting corrupted labels without training a model to predict. In: International Conference on Machine Learning, PMLR, pp 27412–27427