Pro Co
Pro Co
Abstract—Long-tailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited
number of samples. Such imbalance issue considerably impairs the performance of standard supervised learning algorithms, which
are mainly designed for balanced training sets. Recent investigations have revealed that supervised contrastive learning exhibits
promising potential in alleviating the data imbalance. However, the performance of supervised contrastive learning is plagued by an
inherent challenge: it necessitates sufficiently large batches of training data to construct contrastive pairs that cover all categories, yet
this requirement is difficult to meet in the context of class-imbalanced data. To overcome this obstacle, we propose a novel probabilistic
arXiv:2403.06726v2 [cs.LG] 14 Mar 2024
contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space, and
samples contrastive pairs accordingly. In fact, estimating the distributions of all classes using features in a small batch, particularly for
imbalanced data, is not feasible. Our key idea is to introduce a reasonable and simple assumption that the normalized features in
contrastive learning follow a mixture of von Mises-Fisher (vMF) distributions on unit space, which brings two-fold benefits. First, the
distribution parameters can be estimated using only the first sample moment, which can be efficiently computed in an online manner
across different batches. Second, based on the estimated distribution, the vMF distribution allows us to sample an infinite number of
contrastive pairs and derive a closed form of the expected contrastive loss for efficient optimization. Other than long-tailed problems,
ProCo can be directly applied to semi-supervised learning by generating pseudo-labels for unlabeled data, which can subsequently be
utilized to estimate the distribution of the samples inversely. Theoretically, we analyze the error bound of ProCo. Empirically, extensive
experimental results on supervised/semi-supervised visual recognition and object detection tasks demonstrate that ProCo consistently
outperforms existing methods across various datasets. Our code is available at https://github.com/LeapLabTHU/ProCo.
Index Terms—Long-Tailed Visual Recognition, Contrastive Learning, Representation Learning, Semi-Supervised Learning.
1 I NTRODUCTION
2 R ELATED W ORK [53], [54]. Khosla et al. [14] extended the contrastive learning
to the supervised contrastive learning (SCL) paradigm by
Long-tailed Recognition. To address the long-tailed recog- incorporating label information. However, due to the imbal-
nition problem, early rebalancing methods can be classi- ance of positive and negative samples, contrastive learning
fied into two categories: re-sampling and re-weighting. Re- also faces the problem that the model over-focusing on head
sampling techniques aid in the acquisition of knowledge categories in long-tailed recognition [9], [17], [19]. To balance
pertaining to tail classes by adjusting the imbalanced distri- the feature space, KCL [19] uses the same number positive
bution of training data through either undersampling [30], pairs for all the classes. Recent studies [15], [16], [17], [18],
[31] or oversampling [32]. Re-weighting methods adapt the [55], [56] have proposed to introduce class complement for
loss function to promote greater gradient contribution for constructing positive and negative pairs. These approaches
tail classes [33], [34] and even samples [10]. Nevertheless, ensure that all classes appear in every training iteration to
Kang et al. [11] demonstrated that strong long-tailed recog- re-balance the distribution of contrast samples. A compre-
nition performance can be attained by merely modifying hensive comparison is provided in Sec. 3.5.
the classifier, without rebalancing techniques. Furthermore, Furthermore, recent advancements in multi-modal foun-
post-hoc normalisation of the classifier weights [11], [35], dation models based on contrastive learning, such as
[36] and loss margin modification [12], [13], [37], [38] have CLIP [57], have demonstrated remarkable generalization
been two effective and prevalent methods. Post-hoc normal- capabilities across various downstream tasks. Inspired by
isation is motivated by the observation that the classifier this, researchers have begun to incorporate multi-modal
weight norm tends to correlate with the class distribution, foundational models into long-tail recognition tasks. VL-
which can be corrected by normalising the weights. Loss LTR [58] develops a class-level visual-linguistic pre-training
margin modification methods incorporate prior information approach to associate images and textual descriptions at the
of class distribution into the loss function by adjusting class level and introduces a language-guided recognition
the classifier’s margin. Logit Adjustment [13] and Balanced head, effectively leveraging visual-linguistic representations
Softmax [37] deduce that the classifier’s decision boundary for enhanced visual recognition.
for each class corresponds to the log of the prior proba- Knowledge Distillation for Long-tailed Recognition.
bility in the training data from the probabilistic perspec- Knowledge distillation involves training a student model
tive, which is demonstrated to be a straightforward and using the outputs of a well-trained teacher model [59].
effective technique. Moreover, another common technique This approach has been increasingly applied to long-tailed
involves augmenting the minority classes by data augmen- learning. For instance, LFME [60] trains multiple experts
tation techniques [2], [3], [23], [24], [39], [40], [41], [42], on various, less imbalanced sample subsets (e.g., head,
[43]. MetaSAug [24] employs meta-learning to estimate the middle, and tail sets), subsequently distilling these ex-
variance of the feature distribution for each class and utilizes perts into a unified student model. In a similar vein,
it as a semantic direction for augmenting single sample, RIDE [61] introduces a knowledge distillation method to
which is inspired by implicit semantic data augmentation streamline the multi-expert model by developing a student
(ISDA) [23]. The ISDA employs a normal distribution to network with fewer experts. Differing from the multi-expert
model unconstrained features for data augmentation and paradigm, DiVE [62] demonstrates the efficacy of using a
obtains an upper bound of the expected cross-entropy class-balanced model as the teacher for enhancing long-
loss, which is related to our method. Nevertheless, fea- tailed learning. NCL [63] incorporates two main compo-
tures’ normalization makes direct modeling infeasible in nents: Nested Individual Learning and Nested Balanced On-
contrastive learning with a normal distribution. Hence, we line Distillation. NIL focuses on the individual supervised
adopt a mixture of von Mises-Fisher distributions on the learning for each expert, while NBOD facilitates knowledge
unit sphere, allowing us to derive a closed-form of the transfer among multiple experts. Lastly, xERM [64] aims
expected contrastive loss rather than an upper bound. The to develop an unbiased, test-agnostic model for long-tailed
vMF distribution [44] is a fundamental probability distribu- classification. Grounded in causal theory, xERM seeks to
tion on the unit hyper-sphere S p−1 in Rp , which has been mitigate bias by minimizing cross-domain empirical risk.
successfully used in deep metric learning [45], [46], super- Imbalanced Semi-supervised Learning (SSL). Semi-
vised learning [47], [48], and unsupervised learning [49]. A supervised learning is a subfield of machine learning that
recent study [50] introduces a classifier that utilizes the von addresses scenarios where labeled training samples are
Mises-Fisher distribution to address long-tailed recognition limited, but an extensive amount of unlabeled data is
problems. Although this approach exhibits similarities to available [26], [27], [28], [29], [65]. This scenario is directly
our method in terms of employing the vMF distribution, relevant to a multitude of practical problems where it is
it specifically emphasizes the quality of representation for relatively expensive to produce labeled data. The main ap-
classifiers and features, considering the distribution overlap proach in SSL is leveraging labeled data to generate pseudo-
coefficient. labels for unlabeled data, and then train the model with
Contrastive Learning for Long-tailed Recognition. Re- both pseudo-labeled and labeled data [66]. In addition,
cently, researchers have employed the contrastive learning consistency regularization or cluster assumption can be
to tackle the challenge of long-tailed recognition. Con- combined to further constrain the distribution of unlabeled
trastive learning is a self-supervised learning approach that data. For long-tailed datasets, due to class imbalance, SSL
leverages contrastive loss function to learn a more dis- methods will be biased towards head classes when generat-
criminative representation of the data by maximizing the ing pseudo-labels for unlabeled data. Recently, researchers
similarity between positive and negative samples [51], [52], have proposed some methods to address the problem of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
pseudo-label generation in imbalanced SSL. DARP [67] is SCL adopts the latter as the loss function, since it is an upper
proposed to softly refine the pseudo-labels generated from bound of the former.
a biased model by formulating a convex optimization prob-
lem. CReST [68] adopts an iterative approach to retrain the 3.2 Probabilistic Contrastive Learning
model by continually incorporating pseudo-labeled sam-
As aforementioned, for any example in a batch, SCL con-
ples. DASO [69] focuses on the unknown distribution of the
siders other examples with the same labels as positive
unlabeled data and blends the linear and semantic pseudo-
samples, while the rest are viewed as negative samples.
labels in different proportions for each class to reduce the
Consequently, it is essential for the batch to contain an
overall bias.
adequate amount of data to ensure each example receives
appropriate supervision signals. Nevertheless, this require-
3 M ETHOD ment is inefficient as a larger batch size often leads to sig-
3.1 Preliminaries nificant computational and memory burdens. Furthermore,
in practical machine learning scenarios, the data distribu-
In this subsection, we start by presenting the preliminaries,
tion typically exhibits a long-tail pattern, with infrequent
laying the basis for introducing our method. Consider a
sampling of the tail classes within the mini-batches. This
standard image recognition problem. Given the training
N particular characteristic necessitates further enlargement of
set D = {(xi , yi )i=1 }, the model is trained to map the the batches to effectively supervise the tail classes.
images from the space X into the classes from the space To address this issue, we propose a novel probabilistic
Y = {1, 2, . . . , K}. Typically, the mapping function φ is contrastive (ProCo) learning algorithm that estimates the
modeled as a neural network, which consists of a back- feature distribution and samples from it to construct con-
bone feature extractor F : X → Z and a linear classifier trastive pairs. Our method is inspired by [23], [24], which
G : Z → Y. employ normal distribution to model unconstrained fea-
Logit Adjustment [13] is a loss margin modification tures from the perspective of data augmentation and obtain
method. It adopts the prior probability of each class as the an upper bound of the expected loss for optimization. How-
margin during the training and inference process. The logit ever, the features in contrastive learning are constrained to
adjustment method is defined as: the unit hypersphere, which is not suitable for directly mod-
πy eφyi (xi ) eling them with a normal distribution. Moreover, due to the
LLA (xi , yi ) = − log P i , (1) imbalanced distribution of training data, it is infeasible to
πy′ eφy′ (xi ) estimate the distribution parameters of all classes in a small
y ′ ∈Y
batch. Therefore, we introduce a simple and reasonable
where πy is the class frequency in the training or test set, von Mises-Fisher distribution defined on the hypersphere,
and φy is the logits of the class y . whose parameters can be efficiently estimated by maximum
Supervised Contrastive Learning (SCL) [14] is a gener- likelihood estimation across different batches. Furthermore,
alization of the unsupervised contrastive learning method. we rigorously derive a closed form of expected SupCon loss
SCL is designed to learn a feature extractor F that can rather than an upper bound for efficient optimization and
distinguish between positive pairs (xi , xj ) with the same apply it to semi-supervised learning.
label yi = yj and negative pairs (xi , xj ) with different Distribution Assumption. As previously mentioned, the
labels yi ̸= yj . Given any batch of sample-label pairs features in contrastive learning are constrained to be on
B
B = {(xi , yi )Ni=1 } and a temperature parameter τ , two the unit hypersphere. Therefore, we assume that the fea-
typical ways to define the SCL loss are [14]: tures follow a mixture of von Mises–Fisher (vMF) distribu-
tions [44], which is often regarded as a generalization of
sup −1 X ezi ·zp /τ
Lout (xi , yi ) = log , (2) the normal distribution to the hypersphere. The probability
NyBi K
P P z ·z /τ
density function of the vMF distribution for the random p-
p∈A(yi ) e i a
j=1 a∈A(j) dimensional unit vector z is given by:
( )
sup 1 ezi ·zp /τ 1 ⊤
eκµ z ,
X
Lin (xi , yi ) = − log , fp (z; µ, κ) = (4)
NyBi K
P P z ·z /τ Cp (κ)
p∈A(yi ) e i a
j=1 a∈A(j) (2π)p/2 I(p/2−1) (κ)
Cp (κ) = , (5)
(3) κp/2−1
where A(j) is the set of indices of the instances in the where z is a p-dimensional unit vector, κ ≥ 0, ∥µ∥2 = 1 and
batch B \ {(xi , yi )} with the same label j , NyBi = |A(yi )| I(p/2−1) denotes the modified Bessel function of the first
is its cardinality, and z denotes the normalized features of x kind at order p/2 − 1, which is defined as:
extracted by F : ∞
X 1 z
I(p/2−1) (z) = ( )2k+p/2−1 . (6)
F (xi ) F (xp ) F (xa ) k!Γ(p/2 − 1 + k + 1) 2
zi = , zp = , za = . k=0
∥F (xi )∥ ∥F (xp )∥ ∥F (xa )∥
The parameters µ and κ are referred to as the mean direction
sup sup
In addition, Lout and Lin denote the sum over the positive and concentration parameter, respectively. A higher concen-
pairs relative to the location of the log. As demonstrated tration around the mean direction µ is observed with greater
in [14], the two loss formulations are not equivalent and κ, and the distribution becomes uniform on the sphere when
sup sup
Jensen’s inequality [70] implies that Lin ≤ Lout . Therefore, κ = 0.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
of the SCL, i.e., relying on large batch sizes (see Fig. 2). Algorithm 1 The ProCo Algorithm.
Furthermore, the assumption of feature distribution and 1: Input: Training set D , loss weight α
the estimation of parameters can effectively capture the 2: Randomly initialize the parameters Θ of backbone F , pro-
diversity of features among different classes, which enables jection head P and classifier G
3: for t = 0 to T do
our method to achieve stronger performance even without
4: Sample a mini-batch {xi , yi }Bi=1 from D
the sample-wise contrast as SCL (see Tab. 3). P (F (xi ))
5: Compute zi = ∥P (F (xi ))∥
and G(F (xi ))
Numerical Computation. Due to PyTorch only pro- 6: Estimate µ and κ according to Eq. (8) and Eq. (10)
viding the GPU implementation of the zeroth and first- 7: Compute L according to Eq. (24)
order modified Bessel functions, one approach for efficiently 8: Update Θ with SGD
computing the high-order function in ProCo is using the 9: end for
following recurrence relation: 10: Output: Θ
2ν
Iν+1 (κ) = Iν (κ) − Iν−1 (κ). (21)
κ 3.3 Theoretical Error Analysis
However, this method exhibits numerical instability when To further explore the theoretical foundations of our ap-
the value of κ is not sufficiently large. proach, we establish an upper bound on the generaliza-
Hence, we employ the Miller recurrence algorithm [72]. tion error and excess risk for the ProCo loss, as defined
To compute Ip/2−1 (κ) in ProCo, we follow these steps: First, in Eq. (20). For simplicity, our analysis focuses on the binary
we assign the trial values 1 and 0 to IM (κ) and IM +1 (κ), classification scenario, where the labels y belong to the set
respectively. Here, M is a chosen large positive integer, and {−1, +1}.
in our experiments, we set M = p. Then, using the inverse
Assumption 1. pτ ≫ 1, with τ representing the temperature
recurrence relation:
parameter and p the dimensionality of the feature space.
2ν
Iν−1 (κ) = Iν (κ) + Iν+1 (κ), (22) Proposition 2 (Generalization Error Bound). Under the As-
κ
sumption 1, the following generalization bound is applicable with
we can compute Iν (κ) for ν = M − 1, M − 2, · · · , 0. The a probability of at least 1 − δ/2. For every class y ∈ {−1, 1} and
value of Ip/2−1 (κ) obtained from this process is denoted as for estimated parameters µ̂ and κ̂, the bound is expressed as:
I˜p/2−1 (κ), and I0 (κ) is denoted as I˜0 (κ). Finally, we can
1 X
then compute Ip/2−1 (κ) as follows: Ez|y LProCo (y, z; µ̂, κ̂) − LProCo (y, zi ; µ̂, κ̂)
Ny i
I0 (κ) ˜ s
Ip/2−1 (κ) = Ip/2−1 (κ). (23) 2 ⊤ 2 ln(2/δ)
I˜0 (κ) ≤ w Σy w ln + log(1 + e||w||2 −by ). (25)
Ny δ 3Ny
Overall Objective. Following the common practice in
long-tailed recognition [16], [17], [18], we adopt a two- The generalization bound across all classes, with a probability of
branch design. The model consists of a classification branch at least 1 − δ , is thus:
based on a linear classifier G(·) and a representation branch X P (y) X
based on a projection head P (·), which is an MLP that maps E(z,y) LProCo (y, z; µ̂, κ̂) ≤ LProCo (y, zi ; µ̂, κ̂)
Ny i
y∈{−1,1}
the representation to another feature space for decoupling
with the classifier. Besides, a backbone network F (·) is
X P (y) ln(2/δ)
+ log(1 + e||w||2 −by )
shared by the two branches. For the classification branch 3Ny
y∈{−1,1}
and the representation branch, we adopt the simple and s
effective logit adjustment loss LLA and our proposed loss
X 2 ⊤ 2
+ P (y) w (Σy )w ln , (26)
LProCo respectively. Finally, the loss functions of the two Ny δ
y∈{−1,1}
branches are weighted and summed up as the overall loss
function: where Ny denotes the number of samples in class y , w =
L = LLA + αLProCo , (24) (µ̂+1 − µ̂−1 )/τ , b = 2τ12 ( κ+1
1 1
− κ−1 ) + log ππ−1
+1
, and Σy is
the covariance matrix of z conditioned on y .
where α is the weight of the representation branch.
In our experimental setting, τ ≈ 0.1 and p > 128,
In general, by introducing an additional feature branch
thus Assumption 1 is reasonable in practice. Proposition 2
during training, our method can be efficiently optimized
indicates that the generalization error gap is primarily con-
with stochastic gradient descent (SGD) algorithm along with
trolled by the sample size and the data distribution variance.
the classification branch and does not introduce any extra
This finding corresponds to the insights from [13], [73],
overhead during inference.
affirming that our method does not introduce extra factors
Compatibility with Existing Methods. In particular, our
in the error bound, nor does it expand the error bound.
approach is appealing in that it is a general and flexible
This theoretically assures the robust generalizability of our
framework. It can be easily combined with existing works
approach.
applied to the classification branch, such as different loss
Furthermore, our approach relies on certain assumptions
functions, multi-expert framework, etc (see Tab. 9).
regarding feature distribution and parameter estimation.
The pseudo-code of our algorithm is shown in Algo-
To assess the influence of these parameters on model per-
rithm 1.
formance, we derive an excess risk bound. This bound
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
Proposition 3 (Excess Risk Bound). Given Assumptions 1 where wj and zi are normalized to unit length, and τ
and 2, the following excess risk bound holds: is the temperature parameter. This is equivalent to cosine
E(z,y) LProCo (y, z; µ̂, κ̂) − E(z,y) LProCo (y, z; µ⋆ , κ⋆ ) classifier (normalized linear activation) [75]—a variant of
1 cross-entropy loss, which has been applied to long-tailed
=O(∆µ + ∆ ), (27) recognition algorithms [16], [76], [77]. Therefore, the sole
κ
introduction of the learnable parameter w is analogous to
where ∆µ = µ̂ − µ⋆ , ∆ κ1 = 1
κ̂ − 1
κ⋆ . the role played by the weight in the classification branch,
Assumption 2 is the core assumption of our method. which is further validated by the empirical results in Tab. 3.
Building upon this, Proposition 3 demonstrates that the The related works are summarized in Tab. 1. Batch-
excess risk associated with our method is primarily gov- Former (BF) [74] is proposed to equip deep neural networks
erned by the first-order term of the estimation error in the with the capability to investigate the sample relationships
parameters. within each mini-batch. This is achieved by constructing a
Transformer Encoder Network among the images, thereby
3.4 ProCo for Semi-supervised Learning uncovering the interrelationships among the images in the
mini-batch. BatchFormer facilitates the propagation of gra-
In order to further validate the effectiveness of our method, dients from each label to all images within the mini-batch,
we also apply ProCo to semi-supervised learning. ProCo can a concept that mirrors the approach used in contrastive
be directly employed by generating pseudo-labels for un- learning.
labeled data, which can subsequently be utilized to es-
Embedding and Logit Margins (ELM) [73] proposes to
timate the distribution inversely. In our implementation,
enforce both embedding and logit margins, analyzing in
we demonstrate that simply adopting a straightforward
detail the benefit of introducing margin modification in con-
approach like FixMatch [66] to generate pseudo-labels will
trastive learning. TSC [15] introduces class representation by
result in superior performance. FixMatch’s main concept
pre-generating and online-matching uniformly distributed
lies in augmenting unlabeled data to produce two views
targets to each class during the training process. However,
and using the model’s prediction on the weakly augmented
the targets do not have class semantics. Hybrid-PSC [18],
view to generate a pseudo-label for the strongly augmented
PaCo [17], GPaCo [56] and BCL [16] all introduce learnable
sample. Specifically, owing to the introduction of feature
parameters as class representation. PaCo also enforces mar-
distribution in our method, we can compute the ProCo loss
gin modification when constructing contrastive samples.
of weakly augmented view for each class to represent the
DRO-LT [55] computes the centroid of each class and utilizes
posterior probability P (y|z), thus enabling the generation
it as the class representation, which is the most relevant
of pseudo-labels.
work to ours. The loss function and uncertainty radius in
DRO-LT are devised by heuristic design from the metric
3.5 Connection with Related Work learning and robust perspective. Moreover, DRO-LT consid-
In the following, we discuss the connections between our ers a sample and corresponding centroid as a positive pair.
method and related works on contrastive learning for long- But in contrast to SCL, the other samples in the batch and
tailed recognition. Recent studies proposed to incorporate the centroid are treated as negative pairs, disregarding the
class complement in the construction of positive and neg- label information of the other samples, which is somewhat
ative pairs. These methods ensure that all classes appear pessimistic.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
30
4 E XPERIMENT ProCo
In this section, we validate the effectiveness of our method 25 SupCon
on supervised/semi-supervised learning. First, we conduct BCL
a range of analytical experiments to confirm our hypothesis
8 16 32 64 128 256
and analyze each component of the method, including 1) Batch Size
performance of the representation branch, 2) comparison
of more settings, 3) comparison between two formulations Fig. 2. Performance of the representation branch. We train the model for
of loss, 4) sensitivity analysis of hyper-parameters, 5) data 200 epochs.
augmentation strategies. Subsequently, we compare our
method with existing supervised learning methods on long- TABLE 2
Comparison of different class complements. EMA denotes exponential
tailed datasets such as CIFAR-10/100-LT, ImageNet-LT, and moving average.
iNaturalist 2018. Finally, experiments on balanced datasets,
semi-supervised learning, and long-tailed object detection
tasks are conducted to confirm the broad applicability of Class Complement Top-1 Acc.
our method. EMA Prototype 51.6
Centroid Prototype 52.0
Normal Distribution 52.1
4.1 Dataset and Evaluation Protocol
ProCo 52.8
We perform long-tailed image classification experiments
on four prevalent long-tailed image classification datasets:
CIFAR-10/100-LT, ImageNet-LT, and iNaturalist. Follow- iNaturalist 2018. iNaturalist 2018 [7] is a severely im-
ing [11], [76], we partition all categories into three subsets balanced large-scale dataset. It contains 8142 classes of
based on the number of training samples: Many-shot (> 100 437.5k images, with an imbalance factor γ = 500 with
images), Medium-shot (20 − 100 images), and Few-shot cardinality ranging from 2 to 1000. In addition to long-tailed
(< 20 images). The top-1 accuracy is reported on the re- image classification, iNaturalist 2018 is also utilized in fine-
spective balanced validation sets. In addition, we conducted grained image classification.
experiments on balanced image classification datasets and CUB-200-2011. The Caltech-UCSD Birds-200-2011 [79]
long-tailed object detection datasets to verify the broad is a prominent resource for fine-grained visual categoriza-
applicability of our method. The effectiveness of instance tion tasks. Comprising 11,788 images across 200 bird sub-
segmentation was assessed using the mean Average Pre- categories, it is split into two sets: 5,994 images for training
cision (APm ) for mask predictions, calculated at varying and 5,794 for testing.
Intersection over Union (IoU) thresholds ranging from 0.5 LVIS v1. The Large Vocabulary Instance Segmentation
to 0.95, and aggregated across different categories. The (LVIS) dataset [80] is notable for its extensive categorization,
AP values for rare, common, and frequent categories are encompassing 1,203 categories with high-quality instance
represented as APr , APc , and APf , respectively, while the mask annotations. LVIS v1 is divided into three splits: a
AP for detection boxes is denoted as APb . training set with 100,000 images, a validation (val) set with
CIFAR-10/100-LT. CIFAR-10-LT and CIFAR-100-LT are 19,800 images, and a test-dev set, also with 19,800 images.
the long-tailed variants of the original CIFAR-10 and CIFAR- Categories within the training set are classified based on
100 [78] datasets, which are derived by sampling the original their prevalence as rare (1-10 images), common (11-100
training set. Following [10], [12], we sample the training set images), or frequent (over 100 images).
of CIFAR-10 and CIFAR-100 with an exponential function
Nj = N × λj , where λ ∈ (0, 1), N is the size of the original
training set, and Nj is the sampling quantity for the j -th 4.2 Implementation Details
class. The original balanced validation sets of CIFAR-10 and For a fair comparison of long-tailed supervised image clas-
CIFAR-100 are used for testing. The imbalance factor γ = sification, we strictly follow the training setting of [16].
max(Nj )/min(Nj ) is defined as the ratio of the number of All models are trained using an SGD optimizer with a
samples in the most and the least frequent class. We set γ at momentum set to 0.9.
typical values 10, 50, 100 in our experiments. CIFAR-10/100-LT. We adopt ResNet-32 [2] as the back-
ImageNet-LT. ImageNet-LT is proposed in [76], which bone network. The representation branch has a projection
is constructed by sampling a subset of ImageNet following head with an output dimension of 128 and a hidden layer
the Pareto distribution with power value αp = 6. It consists dimension of 512. We set the temperature parameter τ for
of 115.8k images from 1000 categories with cardinality contrastive learning to 0.1. Following [16], [17], we apply
ranging from 1280 to 5. AutoAug [85] and Cutout [86] as the data augmentation
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
TABLE 3
Comparison of more settings. SC denotes supervised contrastive loss,
CC denotes class complement, CR denotes calculable representation,
53.0
and MM denotes margin modification. ProCo
52.5 LA
52.0
TABLE 6
Top-1 accuracy of ResNet-32 on CIFAR-100-LT and CIFAR-10-LT. ∗ denotes results borrowed from [81]. † denotes models trained in the same
setting. We report the results of 200 epochs.
TABLE 8 TABLE 9
Comparisons on ImageNet-LT and iNaturalist 2018 with different Top-1 accuracy of ResNet-50 on ImageNet-LT and iNaturalist 2018.
backbone networks. † denotes models trained in the same setting. † and ‡ denote models trained in the same settings.
TABLE 10
Top-1 accuracy for ResNet-32 and ResNet-50 on balanced datasets.
the effectiveness of our distribution-based class represen-
The ResNet-32 model is trained on CIFAR-100/10 for 200 epochs from
tation on datasets of large scale. Furthermore, Tab. 9 lists scratch. For CUB-200-2011, the pre-trained ResNet-50 model is
detailed results on more training settings for ImageNet-LT fine-tuned for 30 epochs.
dataset. ProCo has the most significant performance im-
provement on tail categories. In addition to combining with CIFAR-100 CIFAR-10 CUB-200-2011
typical classification branches, ProCo can also be combined CrossEntropy 71.5 93.4 81.6
with other methods to further improve model performance,
ProCo 73.0 94.6 82.9
such as different loss functions and model ensembling meth-
ods. We also report the results of combining with NCL [63].
NCL is a multi-expert method that utilizes distillation and
derlines the strength of our method in addressing not only
hard category mining. ProCo demonstrates performance
imbalances in P (y) but also intra-class distribution vari-
improvements for NCL.
ances in P (z|y). These aspects correspond to the factors Ny
iNaturalist 2018. Tab. 8 presents the experimental com-
and Σy in Proposition 2. Overall, the results imply the broad
parison of ProCo with existing methods on iNaturalist 2018
utility of our approach across diverse datasets.
over 90 epochs. iNaturalist 2018 is a highly imbalanced
large-scale dataset, thus making it ideal for studying the im-
pact of imbalanced datasets on model performance. Under 4.6 Long-Tailed Semi-Supervised Image Classification
the same training setting, ProCo outperforms BCL by 1.7%. We present experimental results of semi-supervised learning
Furthermore, to facilitate a comparison with state-of-the-art in Tab. 11. Fixmatch [66] is employed as the foundational
methodologies, an extended training schedule of 400 epochs framework to generate pseudo labels, and it is assessed
is conducted. The results in Tab. 9 indicate that ProCo is effectiveness in comparison to other methods in long-tailed
capable of effectively scaling to larger datasets and longer semi-supervised learning. We mainly follow the setting of
training schedules. DASO [69] except for substituting the semantic classifier
based on the centroid prototype of each class with our rep-
4.5 Balanced Supervised Image Classification resentation branch for training. ProCo outperforms DASO
across various levels of data imbalance and dataset sizes
The foundational theory of our model is robust against
while maintaining the same training conditions. Specifically,
data imbalance, meaning that the derivation of ProCo is
in cases of higher data imbalance (γ = 20) and the ratio of
unaffected by long-tailed distributions. In support of this,
unlabeled data (N1 = 50), our proposed method exhibits
we also perform experiments on balanced datasets, as il-
a significant performance enhancement (with LA) of up to
lustrated in Tab. 10. For the CIFAR-100/10 dataset, aug-
2.8% when compared to DASO.
mentation and training parameters identical to those used
for CIFAR-100/10-LT are employed. Additionally, we ex-
pand our experimentation to the fine-grained classification 4.7 Long-Tailed Object Detection
dataset CUB-200-2011. These results demonstrate that while In addition to image classification, we extend ProCo to
our method, primarily designed for imbalanced datasets, object detection tasks. Specifically, we utilize Faster R-
mitigates the inherent limitations of contrastive learning CNN [4] and Mask R-CNN [93] as foundational frame-
in such contexts, additional experiments also highlight its works, integrating our proposed ProCo loss into the box
effectiveness on balanced training sets. This versatility un- classification branch. This method was implemented using
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
TABLE 11
Comparison of accuracy (%) on CIFAR100-LT under γl = γu setup. γl and γu are the imbalance factors for labeled and unlabeled data,
respectively. N1 and M1 are the size of the most frequent class in the labeled data and unlabeled data, respectively. LA denotes the Logit
Adjustment method [13]. † denotes models trained in the same setting.
γ = γl = γu = 10 γ = γl = γu = 20
Method
N1 = 50 N1 = 150 N1 = 50 N1 = 150
M1 = 400 M1 = 300 M1 = 400 M1 = 300
Supervised 29.6 46.9 25.1 41.2
w/ LA [13] 30.2 48.7 26.5 44.1
FixMatch [66] 45.2 56.5 40.0 50.7
w/ DARP [67] 49.4 58.1 43.4 52.2
w/ CReST+ [68] 44.5 57.4 40.1 52.1
w/ DASO† [69] 49.8 59.2 43.6 52.9
w/ ProCo† 50.9 60.2 44.8 54.8
FixMatch [66] + LA [13] 47.3 58.6 41.4 53.4
w/ DARP [67] 50.5 59.9 44.4 53.8
w/ CReST+ [68] 44.0 57.1 40.6 52.3
w/ DASO† [69] 50.7 60.6 44.1 55.1
w/ ProCo† 52.1 61.3 46.9 55.9
TABLE 12 A PPENDIX A
Results on different frameworks with ResNet-50 backbone on LVIS v1.
We conduct experiments with 1x schedule.
P ROOF OF P ROPOSITION 2
Before presenting the proof of Proposition 2, we intro-
ProCo APb APr APc APf APm duce several lemmas essential for the subsequent argument.
✗ 22.1 9.0 21.0 29.2 – We begin by proving the asymptotic expansion of the
Faster R-CNN [4]
✓ 24.7 15.5 24.2 29.3 – ProCo loss.
✗ 22.5 9.1 21.1 30.1 21.7
Mask R-CNN [93] Lemma 1 (Asymptotic expansion). The ProCo loss satisfies the
✓ 25.2 16.1 24.5 30.0 24.7
following asymptotic expansion under the Assumption 1:
⊤ 2
πy eµy z/τ +1/(2τ κy )
mmdetection [94], adhering to the training settings of the LProCo (y, z) ∼ − log PK µ⊤ 2
.
j z/τ +1/(2τ κj )
original baselines. As depicted in Tab. 12, our approach j=1 πj e
yields noticeable enhancements on the LVIS v1 dataset,
Proof. Recall that the ProCo loss is defined as:
with both Faster R-CNN and Mask R-CNN demonstrating
improved performance across various categories.
K
Cp (κ̃yi ) X C p (κ̃j )
LProCo = − log πyi + log πj ,
Cp (κyi ) j=1
Cp (κj )
5 C ONCLUSION
where Cp (κ) is the normalizing constant of the von Mises-
In this paper, we proposed a novel probabilistic con-
Fisher distribution, which is given by
trastive (ProCo) learning algorithm for long-tailed distribu-
tion. Specifically, we employed a reasonable and straight- (2π)p/2 I(p/2−1) (κ)
forward von Mises-Fisher distribution to model the nor- Cp (κ) =
κp/2−1
malized feature space of samples in the context of con- κ̃j = ||κj µj + zi /τ ||2 .
trastive learning. This choice offers two key advantages.
First, it is efficient to estimate the distribution parameters Therefore, we aim to demonstrate that:
across different batches by maximum likelihood estimation.
(p−1)/2
Second, we derived a closed form of expected supervised eκ̃j κj ⊤ 2
inherent limitation of supervised contrastive learning that According to the calculation formula, the parameter κ is
requires a large number of samples to achieve satisfactory R̄(p−R̄)
computed by 1−R̄2 , During the training process, it is
performance. Furthermore, ProCo can be directly applied
to semi-supervised learning by generating pseudo-labels observed that the value of 1−R̄R̄2 ≫ 1. Consequently, this
for unlabeled data, which can subsequently be utilized to implies that κ ≫ p. Referring to the asymptotic expansion
estimate the distribution inversely. We have proven the of the modified Bessel function of the first kind for large κ
error bound of ProCo theoretically. Extensive experimental relative to p [72], we have
results on various classification and object detection datasets eκ
demonstrate the effectiveness of the proposed algorithm. I(p/2−1) (κ) ∼ √ .
2πκ
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
and
! We are now ready to demonstrate the validity of Propo-
(p−1)/2 (p−1)/2 (p − 1)µ⊤ j zi p−1 sition 2.
κ̃j ∼ κj 1+ + 2 2 .
2κj τ 4κj τ
Proof. First, we examine the class-conditional ProCo loss,
Given κ ≫ p, we have: denoted as Ez|y LProCo (y, z). For a class label y ∈ {−1, 1},
according to Lemma 2, we establish that with a probability
(p−1)/2
κj of at least 1 − 2δ , the following inequality holds:
(p−1)/2
∼ 1.
κ̃j 1 X
Ez|y LProCo (y, z) − LProCo (y, z)
Consequently, we establish that: Ny i
s
(p−1)/2 2Vz|y [LProCo (y, z)] ln 2/δ B ln(2/δ)
eκ̃j κj µ⊤ 2 ≤ + .
∼e j zi /τ +1/(2τ κj ) Ny 3Ny
eκj κ̃(p−1)/2
j
Incorporating Lemma 3 and Lemma 1, we obtain:
1 X
Ez|y LProCo (y, z) − LProCo (y, z)
Lemma 2 (Bennett’s inequality [95]). Let Z1 , . . . , Zn be i.i.d. Ny i
random variables with values in [0, B] and let δ > 0. Then, with s
2Vz|y [Llin (y, z)] ln 2/δ ln(2/δ)
probability at least 1 − δ in (Z1 , . . . , Zn ), ≤ + log(1 + e||w||2 −by )
s Ny 3Ny
n
1X 2VZ ln 1/δ B ln(1/δ)
EZ − Zi ≤ + ,
n i=1 n 3n
where Llin (y, z) is defined as
where VZ is the variance of Z . !
(µ+1 − µ−1 )⊤ z κ−1 − κ+1 π+1
Lemma 3 (Variance inequality). Let Llog (y, z) :=
−y + 2 + log .
τ 2τ κ−1 κ+1 π−1
−y(w⊤ z+b)
log 1 + e and Llin (y, z) := −y(w⊤ z + b). Then,
for any y ∈ {−1, 1}, Moreover, the variance Vz|y [Llin (y, z)] is computed as:
Vz|y [Llog (y, z)] ≤ Vz|y [Llin (y, z)]. Vz|y [Llin (y, z)] = Vz|y [(µ+1 − µ−1 )⊤ z/τ )]
= (µ+1 − µ−1 )⊤ Σy (µ+1 − µ−1 )/τ 2 ,
Proof. Consider the function fy (z) := log(1 + e−yz ),
where y ∈ {−1, 1}. The derivative fy′ (z) is given by where Σy represents the covariance matrix of z conditioned
−yz
ye
fy′ (z) = − 1+e ′
−yz , which implies that supz |fy (z)| ≤ 1.
on y . Consequently, We have thus completed the proof for
Consequently, fy is a 1-Lipschitz function which satisfies the conditional distribution’s error bound as follows:
the following inequality: 1 X
Ez|y LProCo (y, z) − LProCo (y, z)
Ny i
|fy (z) − fy (z ′ )| ≤ |z − z ′ |, ∀z, z ′ ∈ R. s
2 ⊤ 2 ln(2/δ)
Regarding the variance of any real-valued function h, it is ≤ w (Σy )w ln + log(1 + e||w||2 −by ),
Ny δ 3Ny
defined as follows:
where w = (µ+1 − µ−1 )/τ .
V[h(z)] = E[h2 (z)] − (E[h(z)])2
To extend this to the generalization bound across all
≤ E[h2 (z)]. classes, we apply the union bound. Consequently, with a
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
[27] T. Miyato, S.-I. Maeda, M. Koyama, and S. Ishii, “Virtual adversar- [54] H. Kaiming, F. Haoqi, W. Yuxin, X. Saining, and G. Ross, “Momen-
ial training: A regularization method for supervised and semi- tum contrast for unsupervised visual representation learning,”
supervised learning,” IEEE Transactions on Pattern Analysis and CVPR, 2019. 3
Machine Intelligence, 2019. 2, 3 [55] D. Samuel and G. Chechik, “Distributional robustness loss for
[28] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling, long-tail learning,” in ICCV, 2021. 3, 7, 10
“Semi-supervised learning with deep generative models,” in [56] J. Cui, Z. Zhong, Z. Tian, S. Liu, B. Yu, and J. Jia, “General-
NeurIPS, 2014. 2, 3 ized parametric contrastive learning,” IEEE Transactions on Pattern
[29] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Analysis and Machine Intelligence, 2023. 3, 7, 10, 11
“Semi-supervised learning with ladder networks,” in NeurIPS, [57] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
2015. 2, 3 G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transfer-
[30] M. Kubat, S. Matwin et al., “Addressing the curse of imbalanced able visual models from natural language supervision,” in ICML,
training sets: one-sided selection,” in ICML, 1997. 3 2021. 3
[31] B. C. Wallace, K. Small, C. E. Brodley, and T. A. Trikalinos, “Class [58] C. Tian, W. Wang, X. Zhu, J. Dai, and Y. Qiao, “Vl-ltr: Learning
imbalance, redux,” in International Conference on Data Mining, 2011. class-wise visual-linguistic representation for long-tailed visual
3 recognition,” in ECCV, 2022. 3
[32] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, [59] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
“SMOTE: synthetic minority over-sampling technique,” Journal of neural network,” arXiv preprint, 2015. 3
Artificial Intelligence Research, 2002. 3 [60] L. Xiang, G. Ding, and J. Han, “Learning from multiple experts:
[33] C. Huang, Y. Li, C. C. Loy, and X. Tang, “Learning deep represen- Self-paced knowledge distillation for long-tailed classification,” in
tation for imbalanced classification,” in CVPR, 2016. 3 ECCV, 2020. 3
[34] A. Menon, H. Narasimhan, S. Agarwal, and S. Chawla, “On the [61] X. Wang, L. Lian, Z. Miao, Z. Liu, and S. X. Yu, “Long-tailed
statistical consistency of algorithms for binary classification under recognition by routing diverse distribution-aware experts,” in
class imbalance,” in ICML, 2013. 3 ICLR, 2020. 3, 10, 11
[35] B. Kim and J. Kim, “Adjusting decision boundary for class imbal- [62] Y.-Y. He, J. Wu, and X.-S. Wei, “Distilling virtual examples for
anced learning,” IEEE Access, 2020. 3 long-tailed recognition,” in ICCV, 2021, pp. 235–244. 3
[36] J. Zhang, L. Liu, P. Wang, and C. Shen, “To balance or not to [63] J. Li, Z. Tan, J. Wan, Z. Lei, and G. Guo, “Nested collaborative
balance: A simple-yet-effective approach for learning with long- learning for long-tailed visual recognition,” in CVPR, 2022. 3, 10,
tailed distributions,” arXiv preprint, 2019. 3 11
[37] J. Ren, C. Yu, X. Ma, H. Zhao, and S. Yi, “Balanced meta-softmax [64] B. Zhu, Y. Niu, X.-S. Hua, and H. Zhang, “Cross-domain empir-
for long-tailed visual recognition,” in NeurIPS, 2020. 3 ical risk minimization for unbiased long-tailed classification,” in
[38] J. Tan, C. Wang, B. Li, Q. Li, W. Ouyang, C. Yin, and J. Yan, AAAI, 2022. 3
“Equalization loss for long-tailed object recognition,” in CVPR, [65] G. Huang and C. Du, “The High Separation Probability Assump-
2020. 3 tion for Semi-Supervised Learning,” IEEE Transactions on Systems,
[39] P. Chu, X. Bian, S. Liu, and H. Ling, “Feature space augmentation Man, and Cybernetics: Systems, 2022. 3
for long-tailed data,” in ECCV, 2020. 3 [66] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel,
[40] Y. Zang, C. Huang, and C. C. Loy, “FASA: Feature augmentation E. D. Cubuk, A. Kurakin, and C.-L. Li, “FixMatch: Simplifying
and sampling adaptation for long-tailed instance segmentation,” semi-supervised learning with consistency and confidence,” in
in ICCV, 2021. 3 NeurIPS, 2020. 3, 7, 11, 12
[41] Y. Wang, Z. Ni, S. Song, L. Yang, and G. Huang, “Revisiting [67] J. Kim, Y. Hur, S. Park, E. Yang, S. J. Hwang, and J. Shin, “Dis-
locally supervised learning: an alternative to end-to-end training,” tribution aligning refinery of pseudo-label for imbalanced semi-
in ICLR, 2021. 3 supervised learning,” in NeurIPS, 2020. 4, 12
[42] Y. Wang, Y. Yue, R. Lu, T. Liu, Z. Zhong, S. Song, and G. Huang, [68] C. Wei, K. Sohn, C. Mellina, A. Yuille, and F. Yang, “CReST: A
“Efficienttrain: Exploring generalized curriculum learning for class-rebalancing self-training framework for imbalanced semi-
training visual backbones,” in ICCV, 2023. 3 supervised learning,” in CVPR, 2021. 4, 12
[43] G. Huang, Y. Wang, K. Lv, H. Jiang, W. Huang, P. Qi, and S. Song, [69] Y. Oh, D.-J. Kim, and I. S. Kweon, “DASO: Distribution-aware
“Glance and focus networks for dynamic visual recognition,” IEEE semantics-oriented pseudo-label for imbalanced semi-supervised
Transactions on Pattern Analysis and Machine Intelligence, vol. 45, learning,” in CVPR, 2022. 4, 9, 11, 12
no. 4, pp. 4605–4621, 2022. 3 [70] J. L. W. V. Jensen, “Sur les fonctions convexes et les inégalités entre
[44] K. V. Mardia, P. E. Jupp, and K. Mardia, Directional statistics. Wiley les valeurs moyennes,” Acta mathematica, 1906. 4
Online Library, 2000. 3, 4 [71] S. Sra, “A short note on parameter approximation for von mises-
[45] X. Zhe, S. Chen, and H. Yan, “Directional statistics-based deep fisher distributions: and a fast implementation of is (x),” Computa-
metric learning for image classification and retrieval,” Pattern tional Statistics, 2012. 5
Recognition, 2019. 3 [72] M. Abramowitz, I. A. Stegun et al., Handbook of mathematical
[46] K. Roth, O. Vinyals, and Z. Akata, “Non-isotropy regularization functions, 1964. 6, 12
for proxy-based deep metric learning,” in CVPR, 2022. 3 [73] W. Jitkrittum, A. K. Menon, A. S. Rawat, and S. Kumar, “ELM: Em-
[47] T. R. Scott, A. C. Gallagher, and M. C. Mozer, “von Mises-Fisher bedding and logit margins for long-tail learning,” arXiv preprint,
loss: An exploration of embedding geometries for supervised 2022. 6, 7
learning,” in ICCV, 2021. 3 [74] Z. Hou, B. Yu, and D. Tao, “Batchformer: Learning to explore
[48] S. Li, J. Xu, X. Xu, P. Shen, S. Li, and B. Hooi, “Spherical confidence sample relationships for robust representation learning,” in CVPR,
learning for face recognition,” in CVPR, 2021. 3 2022. 7, 11
[49] A. Banerjee, I. S. Dhillon, J. Ghosh, S. Sra, and G. Ridgeway, [75] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li,
“Clustering on the Unit Hypersphere using von Mises-Fisher and W. Liu, “CosFace: Large margin cosine loss for deep face
Distributions.” Journal of Machine Learning Research, 2005. 3 recognition,” in CVPR, 2018. 7
[50] H. Wang, S. Fu, X. He, H. Fang, Z. Liu, and H. Hu, “Towards [76] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu, “Open
calibrated hyper-sphere representation via distribution overlap long-tailed recognition in a dynamic world,” IEEE Transactions on
coefficient for long-tailed learning,” in ECCV, 2022. 3, 11 Pattern Analysis and Machine Intelligence, 2022. 7, 8
[51] C. Ting, K. Simon, N. Mohammad, and H. Geoffrey, “A simple [77] J. Wang, W. Zhang, Y. Zang, Y. Cao, J. Pang, T. Gong, K. Chen,
framework for contrastive learning of visual representations,” in Z. Liu, C. C. Loy, and D. Lin, “Seesaw loss for long-tailed instance
ICML, 2020. 3, 9 segmentation,” in CVPR, 2021. 7
[52] G. Jean-Bastien, S. Florian, A. Florent, T. Corentin, P. R. H., [78] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of
B. Elena, D. Carl, B. P. Avila, Z. G. Daniel, M. A. Gheshlaghi, features from tiny images,” 2009. 8
P. Bilal, K. Koray, M. Rémi, and V. Michal, “Bootstrap Your Own [79] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The
Latent- a new approach to self-supervised learning,” in NeurIPS, caltech-ucsd birds-200-2011 dataset,” 2011. 8
2020. 3 [80] A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large
[53] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” vocabulary instance segmentation,” in CVPR, 2019, pp. 5356–5364.
in ECCV, 2020. 3 8
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16
[81] B. Zhou, Q. Cui, X.-S. Wei, and Z.-M. Chen, “BBN: Bilateral-
branch network with cumulative learning for long-tailed visual
recognition,” in CVPR, 2020. 10
[82] Y. Yang and Z. Xu, “Rethinking the value of labels for improving
class-imbalanced learning,” in NeurIPS, 2020. 10, 11
[83] K. Tang, J. Huang, and H. Zhang, “Long-tailed classification by
keeping the good and removing the bad momentum causal effect,”
in NeurIPS, 2020. 10
[84] J. Cui, S. Liu, Z. Tian, Z. Zhong, and J. Jia, “ResLT: Residual
learning for long-tailed recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2022. 10, 11
[85] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le,
“AutoAugment: Learning augmentation strategies from data,” in
CVPR, 2019. 8
[86] T. DeVries and G. W. Taylor, “Improved regularization of convo-
lutional neural networks with cutout,” arXiv preprint, 2017. 8
[87] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “RandAugment:
Practical automated data augmentation with a reduced search
space,” in CVPR Workshops, 2020. 9
[88] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated
residual transformations for deep neural networks,” in CVPR,
2017. 9
[89] S. Zhang, Z. Li, S. Yan, X. He, and J. Sun, “Distribution alignment:
A unified framework for long-tail visual recognition,” in CVPR,
2021. 11
[90] T. Li, L. Wang, and G. Wu, “Self supervision to distillation for
long-tailed visual recognition,” in ICCV, 2021. 11
[91] A. M. H. Tiong, J. Li, G. Lin, B. Li, C. Xiong, and S. C. Hoi, “Im-
proving tail-class representation with centroid contrastive learn-
ing,” Pattern Recognition Letters, 2023. 11
[92] Y. L. Mengke Li, Yiu-ming Cheung, “Long-tailed visual recogni-
tion via gaussian clouded logit adjustment,” in CVPR, 2022. 11
[93] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
ICCV, 2017, pp. 2961–2969. 11, 12
[94] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng,
Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li,
X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy,
and D. Lin, “MMDetection: Open mmlab detection toolbox and
benchmark,” arXiv preprint, 2019. 12
[95] A. Maurer and M. Pontil, “Empirical bernstein bounds and
sample-variance penalization,” in Annual Conference Computational
Learning Theory, 2009. 13