0% found this document useful (0 votes)
22 views16 pages

Pro Co

Uploaded by

wuhtcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views16 pages

Pro Co

Uploaded by

wuhtcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Probabilistic Contrastive Learning for


Long-Tailed Visual Recognition
Chaoqun Du, Yulin Wang, Shiji Song, and Gao Huang

Abstract—Long-tailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited
number of samples. Such imbalance issue considerably impairs the performance of standard supervised learning algorithms, which
are mainly designed for balanced training sets. Recent investigations have revealed that supervised contrastive learning exhibits
promising potential in alleviating the data imbalance. However, the performance of supervised contrastive learning is plagued by an
inherent challenge: it necessitates sufficiently large batches of training data to construct contrastive pairs that cover all categories, yet
this requirement is difficult to meet in the context of class-imbalanced data. To overcome this obstacle, we propose a novel probabilistic
arXiv:2403.06726v2 [cs.LG] 14 Mar 2024

contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space, and
samples contrastive pairs accordingly. In fact, estimating the distributions of all classes using features in a small batch, particularly for
imbalanced data, is not feasible. Our key idea is to introduce a reasonable and simple assumption that the normalized features in
contrastive learning follow a mixture of von Mises-Fisher (vMF) distributions on unit space, which brings two-fold benefits. First, the
distribution parameters can be estimated using only the first sample moment, which can be efficiently computed in an online manner
across different batches. Second, based on the estimated distribution, the vMF distribution allows us to sample an infinite number of
contrastive pairs and derive a closed form of the expected contrastive loss for efficient optimization. Other than long-tailed problems,
ProCo can be directly applied to semi-supervised learning by generating pseudo-labels for unlabeled data, which can subsequently be
utilized to estimate the distribution of the samples inversely. Theoretically, we analyze the error bound of ProCo. Empirically, extensive
experimental results on supervised/semi-supervised visual recognition and object detection tasks demonstrate that ProCo consistently
outperforms existing methods across various datasets. Our code is available at https://github.com/LeapLabTHU/ProCo.

Index Terms—Long-Tailed Visual Recognition, Contrastive Learning, Representation Learning, Semi-Supervised Learning.

1 I NTRODUCTION

W ITH the accelerated progress in deep learning and the


emergence of large, well-organized datasets, signif-
icant advancements have been made in computer vision
contrastive learning (SCL) [14] may serve as a more suitable
optimization target in terms of resilience against long-tail
distribution [15], [16]. More precisely, SCL deliberately in-
tasks, including image classification [1], [2], [3], object de- tegrates label information into the formulation of positive
tection [4], and semantic segmentation [5]. The meticulous and negative pairs for the contrastive loss function. Unlike
annotation of these datasets ensures a balance among vari- self-supervised learning, which generates positive samples
ous categories during their development [6]. Nevertheless, through data augmentation of the anchor, SCL constructs
in practical applications, acquiring comprehensive datasets positive samples from the same class as the anchor. Notably,
that are both balanced and encompass all conceivable sce- initial exploration of this approach has already yielded per-
narios remains a challenge. Data distribution in real-world formance surpassing most competitive algorithms designed
contexts often adheres to a long-tail pattern, characterized for long-tail distribution [17], [18], [19].
by an exponential decline in the number of samples per class Despite its merits, SCL still suffers from an inherent
from head to tail [7]. This data imbalance presents a consid- limitation. To guarantee performance, SCL necessitates a
erable challenge for training deep models, as their ability considerable batch size for generating sufficient contrastive
to generalize to infrequent categories may be hindered pairs [14], resulting in a substantial computational and
due to the limited training data available for these classes. memory overhead. Notably, this issue becomes more pro-
Furthermore, class imbalance can induce a bias towards nounced with long-tailed data in real-world settings, where
dominant classes, leading to suboptimal performance on tail classes are infrequently sampled within a mini-batch or
minority classes [8], [9]. Consequently, addressing the long- memory bank. Consequently, the loss function’s gradient is
tail distribution issue is crucial for the successful application predominantly influenced by head classes, leading to a lack
of computer vision tasks in real-world scenarios. of information from tail classes and an inherent tendency for
In tackling the long-tailed data conundrum, a multi- the model to concentrate on head classes while disregarding
tude of algorithms have been developed by adapting the tail classes [16], [17]. As an example, in the Imagenet-LT
traditional cross-entropy learning objective [10], [11], [12], dataset, a typical batch size of 4096 and memory size of
[13]. Recent studies, however, have unveiled that supervised 8192 yield an average of fewer than one sample per mini-
batch or memory bank for 212 and 89 classes, respectively.
• C. Du, Y. Wang, S. Song, G. Huang are with the Department of Automa- In this study, we address the aforementioned issue with
tion, BNRist, Tsinghua University, Beijing 100084, China. Email: {dcq20, a simple but effective solution. Our primary insight involves
wang-yl19}@mails.tsinghua.edu.cn, [email protected], gao- considering the sampling of an infinite number of con-
[email protected]. Corresponding author: Gao Huang.
trastive pairs from the actual data distribution and solving
©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of
any copyrighted component of this work in other works.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

the expected loss to determine the optimization objective.


By directly estimating and minimizing the expectation, the
need for maintaining a large batch size is obviated. More- features from different batches
over, since all classes are theoretically equivalent in the
expectation, the long-tail distribution problem in real-world
data is naturally alleviated.
However, implementing our idea is not straightforward
due to two obstacles: 1) the methodology for modeling the att
rac
actual data distribution is typically complex, e.g., training t repulse
deep generative models [20], [21], [22]; and 2) calculating the
expected training loss in a closed form is difficult. In this pa-
repuls
per, we simultaneously tackle both challenges by proposing e
a novel probabilistic contrastive learning algorithm as illus-
trated in Fig. 1. Our method is inspired by the intriguing ob-
servation that deep features generally contain rich semantic
information, enabling their statistics to represent the intra-
class and inter-class variations of the data [23], [24], [25].
These methods model unconstrained features with a normal
distribution from the perspective of data augmentation and
obtain an upper bound of the expected cross-entropy loss
for optimization. However, due to the normalization of fea- Fig. 1. Illustration of Probabilistic Contrastive Learning. ProCo estimates
tures in contrastive learning, direct modeling with a normal the distribution of samples based on the features from different batches
distribution is not feasible. In addition, it is not possible and samples contrastive pairs from it. Moreover, a closed form of ex-
to estimate the distributions of all classes in a small batch pected contrastive loss is derived by sampling an infinite number of
contrastive pairs, which eliminates the inherent limitation of SCL on large
for long-tailed data. Therefore, we adopt a reasonable and batch sizes.
simple von Mises-Fisher distribution on the unit sphere in
Rn to model the feature distribution, which is commonly
considered as an extension of the normal distribution to independent of imbalanced class distribution theoretically,
the hypersphere. It brings two advantages: 1) the distribu- experiments are also conducted on balanced datasets. The
tion parameters can be estimated by maximum likelihood results indicate that ProCo achieves enhanced performance
estimation using only the first sample moment, which can on balanced datasets as well.
be efficiently computed across different batches during the The primary contributions of this study are outlined as
training process; and 2) building upon this formulation, we follows:
theoretically demonstrate that a closed form of the expected
loss can be rigorously derived as the sampling number • We propose a novel probabilistic contrastive learn-
approaches infinity rather than an upper bound, which we ing (ProCo) algorithm for long-tailed recognition
designate as the ProCo loss. This enables us to circumvent problems. By adopting a reasonable and simple von
the necessity of explicitly sampling numerous contrastive Mises-Fisher (vMF) distribution to model the feature
pairs and instead minimize a surrogate loss function, which distribution, we can estimate the parameters across
can be efficiently optimized and does not introduce any different batches efficiently. ProCo eliminates the
extra overhead during inference. inherent limitation of SCL on large batch size by
Furthermore, we extend the application of the pro- sampling contrastive pairs from the estimated distri-
posed ProCo algorithm to more realistic imbalanced semi- bution, particularly when dealing with imbalanced
supervised learning scenarios, where only a small portion data (see Fig. 2).
of the entire training data set possesses labels [26], [27], [28], • We derive a closed form of expected supervised
[29]. Semi-supervised algorithms typically employ a strat- contrastive loss based on the estimated vMF distri-
egy that generates pseudo-labels for unlabeled data based bution when the sampling number tends to infinity
on the model’s predictions, using these labels to regularize and theoretically analyze the error bound of the
the model’s training process. Consequently, the ProCo algo- ProCo loss (see Sec. 3.3). This approach eliminates
rithm can be directly applied to unlabeled samples by gener- the requirement of explicitly sampling numerous
ating pseudo-labels grounded in the ProCo loss, which can contrastive pairs and, instead, focuses on minimizing
subsequently be used to estimate the feature distribution a surrogate loss function. The surrogate loss function
inversely. can be efficiently optimized without introducing any
Despite its simplicity, the proposed ProCo algorithm additional overhead during inference.
demonstrates consistent effectiveness. We perform exten- • We employ the proposed ProCo algorithm to address
sive empirical evaluations on supervised/semi-supervised imbalanced semi-supervised learning scenarios in
image classification and object detection tasks with CIFAR- a more realistic manner. This involves generating
10/100-LT, ImageNet-LT, iNaturalist 2018, and LVIS v1. The pseudo-labels based on the ProCo loss. Subsequently,
results demonstrate that ProCo consistently improves the we conduct comprehensive experiments on various
generalization performance of existing competitive long- classification datasets to showcase the efficacy of the
tailed recognition methods. Furthermore, since ProCo is ProCo algorithm.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

2 R ELATED W ORK [53], [54]. Khosla et al. [14] extended the contrastive learning
to the supervised contrastive learning (SCL) paradigm by
Long-tailed Recognition. To address the long-tailed recog- incorporating label information. However, due to the imbal-
nition problem, early rebalancing methods can be classi- ance of positive and negative samples, contrastive learning
fied into two categories: re-sampling and re-weighting. Re- also faces the problem that the model over-focusing on head
sampling techniques aid in the acquisition of knowledge categories in long-tailed recognition [9], [17], [19]. To balance
pertaining to tail classes by adjusting the imbalanced distri- the feature space, KCL [19] uses the same number positive
bution of training data through either undersampling [30], pairs for all the classes. Recent studies [15], [16], [17], [18],
[31] or oversampling [32]. Re-weighting methods adapt the [55], [56] have proposed to introduce class complement for
loss function to promote greater gradient contribution for constructing positive and negative pairs. These approaches
tail classes [33], [34] and even samples [10]. Nevertheless, ensure that all classes appear in every training iteration to
Kang et al. [11] demonstrated that strong long-tailed recog- re-balance the distribution of contrast samples. A compre-
nition performance can be attained by merely modifying hensive comparison is provided in Sec. 3.5.
the classifier, without rebalancing techniques. Furthermore, Furthermore, recent advancements in multi-modal foun-
post-hoc normalisation of the classifier weights [11], [35], dation models based on contrastive learning, such as
[36] and loss margin modification [12], [13], [37], [38] have CLIP [57], have demonstrated remarkable generalization
been two effective and prevalent methods. Post-hoc normal- capabilities across various downstream tasks. Inspired by
isation is motivated by the observation that the classifier this, researchers have begun to incorporate multi-modal
weight norm tends to correlate with the class distribution, foundational models into long-tail recognition tasks. VL-
which can be corrected by normalising the weights. Loss LTR [58] develops a class-level visual-linguistic pre-training
margin modification methods incorporate prior information approach to associate images and textual descriptions at the
of class distribution into the loss function by adjusting class level and introduces a language-guided recognition
the classifier’s margin. Logit Adjustment [13] and Balanced head, effectively leveraging visual-linguistic representations
Softmax [37] deduce that the classifier’s decision boundary for enhanced visual recognition.
for each class corresponds to the log of the prior proba- Knowledge Distillation for Long-tailed Recognition.
bility in the training data from the probabilistic perspec- Knowledge distillation involves training a student model
tive, which is demonstrated to be a straightforward and using the outputs of a well-trained teacher model [59].
effective technique. Moreover, another common technique This approach has been increasingly applied to long-tailed
involves augmenting the minority classes by data augmen- learning. For instance, LFME [60] trains multiple experts
tation techniques [2], [3], [23], [24], [39], [40], [41], [42], on various, less imbalanced sample subsets (e.g., head,
[43]. MetaSAug [24] employs meta-learning to estimate the middle, and tail sets), subsequently distilling these ex-
variance of the feature distribution for each class and utilizes perts into a unified student model. In a similar vein,
it as a semantic direction for augmenting single sample, RIDE [61] introduces a knowledge distillation method to
which is inspired by implicit semantic data augmentation streamline the multi-expert model by developing a student
(ISDA) [23]. The ISDA employs a normal distribution to network with fewer experts. Differing from the multi-expert
model unconstrained features for data augmentation and paradigm, DiVE [62] demonstrates the efficacy of using a
obtains an upper bound of the expected cross-entropy class-balanced model as the teacher for enhancing long-
loss, which is related to our method. Nevertheless, fea- tailed learning. NCL [63] incorporates two main compo-
tures’ normalization makes direct modeling infeasible in nents: Nested Individual Learning and Nested Balanced On-
contrastive learning with a normal distribution. Hence, we line Distillation. NIL focuses on the individual supervised
adopt a mixture of von Mises-Fisher distributions on the learning for each expert, while NBOD facilitates knowledge
unit sphere, allowing us to derive a closed-form of the transfer among multiple experts. Lastly, xERM [64] aims
expected contrastive loss rather than an upper bound. The to develop an unbiased, test-agnostic model for long-tailed
vMF distribution [44] is a fundamental probability distribu- classification. Grounded in causal theory, xERM seeks to
tion on the unit hyper-sphere S p−1 in Rp , which has been mitigate bias by minimizing cross-domain empirical risk.
successfully used in deep metric learning [45], [46], super- Imbalanced Semi-supervised Learning (SSL). Semi-
vised learning [47], [48], and unsupervised learning [49]. A supervised learning is a subfield of machine learning that
recent study [50] introduces a classifier that utilizes the von addresses scenarios where labeled training samples are
Mises-Fisher distribution to address long-tailed recognition limited, but an extensive amount of unlabeled data is
problems. Although this approach exhibits similarities to available [26], [27], [28], [29], [65]. This scenario is directly
our method in terms of employing the vMF distribution, relevant to a multitude of practical problems where it is
it specifically emphasizes the quality of representation for relatively expensive to produce labeled data. The main ap-
classifiers and features, considering the distribution overlap proach in SSL is leveraging labeled data to generate pseudo-
coefficient. labels for unlabeled data, and then train the model with
Contrastive Learning for Long-tailed Recognition. Re- both pseudo-labeled and labeled data [66]. In addition,
cently, researchers have employed the contrastive learning consistency regularization or cluster assumption can be
to tackle the challenge of long-tailed recognition. Con- combined to further constrain the distribution of unlabeled
trastive learning is a self-supervised learning approach that data. For long-tailed datasets, due to class imbalance, SSL
leverages contrastive loss function to learn a more dis- methods will be biased towards head classes when generat-
criminative representation of the data by maximizing the ing pseudo-labels for unlabeled data. Recently, researchers
similarity between positive and negative samples [51], [52], have proposed some methods to address the problem of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

pseudo-label generation in imbalanced SSL. DARP [67] is SCL adopts the latter as the loss function, since it is an upper
proposed to softly refine the pseudo-labels generated from bound of the former.
a biased model by formulating a convex optimization prob-
lem. CReST [68] adopts an iterative approach to retrain the 3.2 Probabilistic Contrastive Learning
model by continually incorporating pseudo-labeled sam-
As aforementioned, for any example in a batch, SCL con-
ples. DASO [69] focuses on the unknown distribution of the
siders other examples with the same labels as positive
unlabeled data and blends the linear and semantic pseudo-
samples, while the rest are viewed as negative samples.
labels in different proportions for each class to reduce the
Consequently, it is essential for the batch to contain an
overall bias.
adequate amount of data to ensure each example receives
appropriate supervision signals. Nevertheless, this require-
3 M ETHOD ment is inefficient as a larger batch size often leads to sig-
3.1 Preliminaries nificant computational and memory burdens. Furthermore,
in practical machine learning scenarios, the data distribu-
In this subsection, we start by presenting the preliminaries,
tion typically exhibits a long-tail pattern, with infrequent
laying the basis for introducing our method. Consider a
sampling of the tail classes within the mini-batches. This
standard image recognition problem. Given the training
N particular characteristic necessitates further enlargement of
set D = {(xi , yi )i=1 }, the model is trained to map the the batches to effectively supervise the tail classes.
images from the space X into the classes from the space To address this issue, we propose a novel probabilistic
Y = {1, 2, . . . , K}. Typically, the mapping function φ is contrastive (ProCo) learning algorithm that estimates the
modeled as a neural network, which consists of a back- feature distribution and samples from it to construct con-
bone feature extractor F : X → Z and a linear classifier trastive pairs. Our method is inspired by [23], [24], which
G : Z → Y. employ normal distribution to model unconstrained fea-
Logit Adjustment [13] is a loss margin modification tures from the perspective of data augmentation and obtain
method. It adopts the prior probability of each class as the an upper bound of the expected loss for optimization. How-
margin during the training and inference process. The logit ever, the features in contrastive learning are constrained to
adjustment method is defined as: the unit hypersphere, which is not suitable for directly mod-
πy eφyi (xi ) eling them with a normal distribution. Moreover, due to the
LLA (xi , yi ) = − log P i , (1) imbalanced distribution of training data, it is infeasible to
πy′ eφy′ (xi ) estimate the distribution parameters of all classes in a small
y ′ ∈Y
batch. Therefore, we introduce a simple and reasonable
where πy is the class frequency in the training or test set, von Mises-Fisher distribution defined on the hypersphere,
and φy is the logits of the class y . whose parameters can be efficiently estimated by maximum
Supervised Contrastive Learning (SCL) [14] is a gener- likelihood estimation across different batches. Furthermore,
alization of the unsupervised contrastive learning method. we rigorously derive a closed form of expected SupCon loss
SCL is designed to learn a feature extractor F that can rather than an upper bound for efficient optimization and
distinguish between positive pairs (xi , xj ) with the same apply it to semi-supervised learning.
label yi = yj and negative pairs (xi , xj ) with different Distribution Assumption. As previously mentioned, the
labels yi ̸= yj . Given any batch of sample-label pairs features in contrastive learning are constrained to be on
B
B = {(xi , yi )Ni=1 } and a temperature parameter τ , two the unit hypersphere. Therefore, we assume that the fea-
typical ways to define the SCL loss are [14]: tures follow a mixture of von Mises–Fisher (vMF) distribu-
tions [44], which is often regarded as a generalization of
sup −1 X ezi ·zp /τ
Lout (xi , yi ) = log , (2) the normal distribution to the hypersphere. The probability
NyBi K
P P z ·z /τ
density function of the vMF distribution for the random p-
p∈A(yi ) e i a
j=1 a∈A(j) dimensional unit vector z is given by:
( )
sup 1 ezi ·zp /τ 1 ⊤
eκµ z ,
X
Lin (xi , yi ) = − log , fp (z; µ, κ) = (4)
NyBi K
P P z ·z /τ Cp (κ)
p∈A(yi ) e i a
j=1 a∈A(j) (2π)p/2 I(p/2−1) (κ)
Cp (κ) = , (5)
(3) κp/2−1
where A(j) is the set of indices of the instances in the where z is a p-dimensional unit vector, κ ≥ 0, ∥µ∥2 = 1 and
batch B \ {(xi , yi )} with the same label j , NyBi = |A(yi )| I(p/2−1) denotes the modified Bessel function of the first
is its cardinality, and z denotes the normalized features of x kind at order p/2 − 1, which is defined as:
extracted by F : ∞
X 1 z
I(p/2−1) (z) = ( )2k+p/2−1 . (6)
F (xi ) F (xp ) F (xa ) k!Γ(p/2 − 1 + k + 1) 2
zi = , zp = , za = . k=0
∥F (xi )∥ ∥F (xp )∥ ∥F (xa )∥
The parameters µ and κ are referred to as the mean direction
sup sup
In addition, Lout and Lin denote the sum over the positive and concentration parameter, respectively. A higher concen-
pairs relative to the location of the log. As demonstrated tration around the mean direction µ is observed with greater
in [14], the two loss formulations are not equivalent and κ, and the distribution becomes uniform on the sphere when
sup sup
Jensen’s inequality [70] implies that Lin ≤ Lout . Therefore, κ = 0.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

Parameter Estimation. Under the above assumption, we


employ a mixture of vMF distributions to model the feature K
−zi · Ap (κyi )µyi
X 
Cp (κ̃j )
distribution: Lout (zi , yi ) = + log πj ,
τ j=1
Cp (κj )
K K p/2−1 ⊤
X X κy eκy µy z (12)
P (z) = P (y)P (z|y) = πy , (7) K
y=1 y=1
(2π)p/2 Ip/2−1 (κy ) 
Cp (κ̃yi )
 X
Cp (κ̃j )

Lin (zi , yi ) = − log πyi + log πj ,
Cp (κyi ) j=1
Cp (κj )
where the probability of a class y is estimated as πy , which (13)
corresponds to the frequency of class y in the training set.
The mean vector µy and concentration parameter κy of the where z̃j ∼ vMF(µj , κj ), κ̃j = ||κj µj + zi /τ ||2 , τ is the
feature distribution are estimated by maximum likelihood temperature parameter.
estimation. Proof. According to the definition of supervised contrastive
Suppose that a series of N independent unit vectors loss in Eq. (2), we have
{(zi )N
i } on the unit hyper-sphere S
p−1
are drawn from a
vMF distribution of class y . The maximum likelihood esti- −1 X
sup
mates of the mean direction µy and concentration parameter Lout = zi · zp /τ
Nyi
κy satisfy the following equations: p∈A(yi )
XK 
Nj 1 X zi ·za /τ
µy = z̄/R̄, (8) + log N e , (14)
j=1
N Nj
a∈A(j)
Ip/2 (κy )
Ap (κy ) = = R̄, (9) where Nj is the sampling number of class j and satisfies
Ip/2−1 (κy )
limN →∞ Nj /N = πj .
1 PN Let N → ∞ and omit the constant term log N , we have
where z̄ = N i zi is the sample mean and R̄ = ∥z̄∥2 is the following loss function:
the length of sample mean. A simple approximation [71] to
κy is:
K
−zi · E[z̃yi ]
X 
R̄(p − R̄2 ) Lout = + log πj E[ezi ·z̃j /τ ] (15)
κ̂y = . (10) τ
1 − R̄2 j=1
K
−zi · Ap (κyi )µyi
X 
Cp (κ̃j )
Furthermore, the sample mean of each class is estimated = + log πj . (16)
τ j=1
Cp (κj )
in an online manner by aggregating statistics from the previ-
ous mini-batch and the current mini-batch. Specifically, we Eq. (16) is obtained by leveraging the expectation and
adopt the estimated sample mean of the previous epoch for moment-generating function of vMF distribution:
maximum likelihood estimation, while maintaining a new
sample mean from zero initialization in the current epoch Ip/2 (κ)
E (z) = Ap (κ)µ, Ap (κ) = , (17)
through the following online estimation algorithm: Ip/2−1 (κ)
 T  Cp (κ̃)
(t−1) (t−1) (t) ′(t) E et z = , κ̃ = ||κµ + t||2 . (18)
(t) nj z̄ j + mj z̄j Cp (κ)
z̄ j = (t−1) (t)
, (11)
nj + mj Similar to Eq. (12), we can obtain the other loss function
from Eq. (3) as follows:
(t)
where z̄j is the estimated sample mean of class j at step   XK 
′(t)
t and z̄j is the sample mean of class j in current mini- Cp (κ̃yi ) Cp (κ̃j )
Lin = − log πyi + log πj . (19)
batch. nj
(t−1) (t)
and mj represent the numbers of samples
Cp (κyi ) j=1
Cp (κj )
in the previous mini-batches and the current mini-batch,
respectively.
Loss Derivation. Built upon the estimated parameters, Based on the above derivation, we obtain the expected
a straightforward approach may be sampling contrastive formulations for the two SupCon loss functions. Since Lin
pairs from the mixture of vMF distributions. However, we enforces the margin modification as shown in Eq. (1), we
note that sampling sufficient data from the vMF distribu- adopt it as the surrogate loss:
tions at each training iteration is inefficient. To this end,
 
  K
we leverage mathematical analysis to extend the number Cp (κ̃yi ) X C (κ̃
p j  )
LProCo = − log πyi +log  πj . (20)
of samples to infinity and rigorously derive a closed-form Cp (κyi ) j=1
Cp (κj )
expression for the expected contrastive loss function.
The empirical comparison of Lin and Lout is shown in Tab. 4.
Proposition 1. Suppose that the parameters of the mixture of Instead of costly sampling operations, we implicitly
vMF distributions are πy , µy , and κy , y = 1, · · · , K and let the achieve infinite contrastive samples through the surrogate
sampling number N → ∞. Then we have the expected contrastive loss and can optimize it in a much more efficient manner.
loss function, which is given by: This design elegantly addresses the inherent limitations
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

of the SCL, i.e., relying on large batch sizes (see Fig. 2). Algorithm 1 The ProCo Algorithm.
Furthermore, the assumption of feature distribution and 1: Input: Training set D , loss weight α
the estimation of parameters can effectively capture the 2: Randomly initialize the parameters Θ of backbone F , pro-
diversity of features among different classes, which enables jection head P and classifier G
3: for t = 0 to T do
our method to achieve stronger performance even without
4: Sample a mini-batch {xi , yi }Bi=1 from D
the sample-wise contrast as SCL (see Tab. 3). P (F (xi ))
5: Compute zi = ∥P (F (xi ))∥
and G(F (xi ))
Numerical Computation. Due to PyTorch only pro- 6: Estimate µ and κ according to Eq. (8) and Eq. (10)
viding the GPU implementation of the zeroth and first- 7: Compute L according to Eq. (24)
order modified Bessel functions, one approach for efficiently 8: Update Θ with SGD
computing the high-order function in ProCo is using the 9: end for
following recurrence relation: 10: Output: Θ


Iν+1 (κ) = Iν (κ) − Iν−1 (κ). (21)
κ 3.3 Theoretical Error Analysis
However, this method exhibits numerical instability when To further explore the theoretical foundations of our ap-
the value of κ is not sufficiently large. proach, we establish an upper bound on the generaliza-
Hence, we employ the Miller recurrence algorithm [72]. tion error and excess risk for the ProCo loss, as defined
To compute Ip/2−1 (κ) in ProCo, we follow these steps: First, in Eq. (20). For simplicity, our analysis focuses on the binary
we assign the trial values 1 and 0 to IM (κ) and IM +1 (κ), classification scenario, where the labels y belong to the set
respectively. Here, M is a chosen large positive integer, and {−1, +1}.
in our experiments, we set M = p. Then, using the inverse
Assumption 1. pτ ≫ 1, with τ representing the temperature
recurrence relation:
parameter and p the dimensionality of the feature space.

Iν−1 (κ) = Iν (κ) + Iν+1 (κ), (22) Proposition 2 (Generalization Error Bound). Under the As-
κ
sumption 1, the following generalization bound is applicable with
we can compute Iν (κ) for ν = M − 1, M − 2, · · · , 0. The a probability of at least 1 − δ/2. For every class y ∈ {−1, 1} and
value of Ip/2−1 (κ) obtained from this process is denoted as for estimated parameters µ̂ and κ̂, the bound is expressed as:
I˜p/2−1 (κ), and I0 (κ) is denoted as I˜0 (κ). Finally, we can
1 X
then compute Ip/2−1 (κ) as follows: Ez|y LProCo (y, z; µ̂, κ̂) − LProCo (y, zi ; µ̂, κ̂)
Ny i
I0 (κ) ˜ s
Ip/2−1 (κ) = Ip/2−1 (κ). (23) 2 ⊤ 2 ln(2/δ)
I˜0 (κ) ≤ w Σy w ln + log(1 + e||w||2 −by ). (25)
Ny δ 3Ny
Overall Objective. Following the common practice in
long-tailed recognition [16], [17], [18], we adopt a two- The generalization bound across all classes, with a probability of
branch design. The model consists of a classification branch at least 1 − δ , is thus:
based on a linear classifier G(·) and a representation branch X P (y) X
based on a projection head P (·), which is an MLP that maps E(z,y) LProCo (y, z; µ̂, κ̂) ≤ LProCo (y, zi ; µ̂, κ̂)
Ny i
y∈{−1,1}
the representation to another feature space for decoupling
with the classifier. Besides, a backbone network F (·) is
X P (y) ln(2/δ)
+ log(1 + e||w||2 −by )
shared by the two branches. For the classification branch 3Ny
y∈{−1,1}
and the representation branch, we adopt the simple and s
effective logit adjustment loss LLA and our proposed loss
X 2 ⊤ 2
+ P (y) w (Σy )w ln , (26)
LProCo respectively. Finally, the loss functions of the two Ny δ
y∈{−1,1}
branches are weighted and summed up as the overall loss
function: where Ny denotes the number of samples in class y , w =
L = LLA + αLProCo , (24) (µ̂+1 − µ̂−1 )/τ , b = 2τ12 ( κ+1
1 1
− κ−1 ) + log ππ−1
+1
, and Σy is
the covariance matrix of z conditioned on y .
where α is the weight of the representation branch.
In our experimental setting, τ ≈ 0.1 and p > 128,
In general, by introducing an additional feature branch
thus Assumption 1 is reasonable in practice. Proposition 2
during training, our method can be efficiently optimized
indicates that the generalization error gap is primarily con-
with stochastic gradient descent (SGD) algorithm along with
trolled by the sample size and the data distribution variance.
the classification branch and does not introduce any extra
This finding corresponds to the insights from [13], [73],
overhead during inference.
affirming that our method does not introduce extra factors
Compatibility with Existing Methods. In particular, our
in the error bound, nor does it expand the error bound.
approach is appealing in that it is a general and flexible
This theoretically assures the robust generalizability of our
framework. It can be easily combined with existing works
approach.
applied to the classification branch, such as different loss
Furthermore, our approach relies on certain assumptions
functions, multi-expert framework, etc (see Tab. 9).
regarding feature distribution and parameter estimation.
The pseudo-code of our algorithm is shown in Algo-
To assess the influence of these parameters on model per-
rithm 1.
formance, we derive an excess risk bound. This bound
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

TABLE 1 in every iteration of training to rebalance the distribution


Comparison with contrastive learning methods for long-tailed of contrast samples. Moreover, researchers also introduced
recognition. CC denotes class complement, CR denotes calculable
representation, and MM denotes margin modification. margin modification in contrastive learning due to its ef-
fectiveness. Therefore, we mainly discuss three aspects,
namely class complement, calculable representation, and
Method CC CR MM
margin modification. Class complement represents intro-
SCL [14] & BF [74] & KCL [19] ✗ ✗ ✗ ducing global representations of each class in contrastive
ELM [73] ✗ ✗ ✓ learning for constructing positive and negative samples.
TSC [15] ✓ ✓ ✗ Calculable representation corresponds to the class comple-
BCL [16] & Hybrid-PSC [18] ✓ ✗ ✗
PaCo [17] & GPaCo [56] ✓ ✗ ✓ ments are computed from the features rather than learnable
DRO-LT [55] ✓ ✓ ✗ parameters. Margin modification represents adjusting the
contrastive loss according to the prior frequency of different
ProCo ✓ ✓ ✓
classes in the training set as shown in Eq. (1).
In our method, we introduce the class complement based
measures the deviation between the expected risk using on the feature distribution by estimation from the features,
estimated parameters and the Bayesian optimal risk, which which is not a learnable parameter. If the class representa-
is the expected risk with parameters of the ground-truth tion is a learnable parameter w and we ignore the contrast
underlying distribution. between samples, then we have the following contrastive
loss:
Assumption 2. The feature distribution of each class follows a ⊤
ewyi zi /τ
von Mises-Fisher (vMF) distribution, characterized by parameters L(zi , yi ) = − log PN w⊤ z /τ , (28)
µ⋆ and κ⋆ . j=1 e
j i

Proposition 3 (Excess Risk Bound). Given Assumptions 1 where wj and zi are normalized to unit length, and τ
and 2, the following excess risk bound holds: is the temperature parameter. This is equivalent to cosine
E(z,y) LProCo (y, z; µ̂, κ̂) − E(z,y) LProCo (y, z; µ⋆ , κ⋆ ) classifier (normalized linear activation) [75]—a variant of
1 cross-entropy loss, which has been applied to long-tailed
=O(∆µ + ∆ ), (27) recognition algorithms [16], [76], [77]. Therefore, the sole
κ
introduction of the learnable parameter w is analogous to
where ∆µ = µ̂ − µ⋆ , ∆ κ1 = 1
κ̂ − 1
κ⋆ . the role played by the weight in the classification branch,
Assumption 2 is the core assumption of our method. which is further validated by the empirical results in Tab. 3.
Building upon this, Proposition 3 demonstrates that the The related works are summarized in Tab. 1. Batch-
excess risk associated with our method is primarily gov- Former (BF) [74] is proposed to equip deep neural networks
erned by the first-order term of the estimation error in the with the capability to investigate the sample relationships
parameters. within each mini-batch. This is achieved by constructing a
Transformer Encoder Network among the images, thereby
3.4 ProCo for Semi-supervised Learning uncovering the interrelationships among the images in the
mini-batch. BatchFormer facilitates the propagation of gra-
In order to further validate the effectiveness of our method, dients from each label to all images within the mini-batch,
we also apply ProCo to semi-supervised learning. ProCo can a concept that mirrors the approach used in contrastive
be directly employed by generating pseudo-labels for un- learning.
labeled data, which can subsequently be utilized to es-
Embedding and Logit Margins (ELM) [73] proposes to
timate the distribution inversely. In our implementation,
enforce both embedding and logit margins, analyzing in
we demonstrate that simply adopting a straightforward
detail the benefit of introducing margin modification in con-
approach like FixMatch [66] to generate pseudo-labels will
trastive learning. TSC [15] introduces class representation by
result in superior performance. FixMatch’s main concept
pre-generating and online-matching uniformly distributed
lies in augmenting unlabeled data to produce two views
targets to each class during the training process. However,
and using the model’s prediction on the weakly augmented
the targets do not have class semantics. Hybrid-PSC [18],
view to generate a pseudo-label for the strongly augmented
PaCo [17], GPaCo [56] and BCL [16] all introduce learnable
sample. Specifically, owing to the introduction of feature
parameters as class representation. PaCo also enforces mar-
distribution in our method, we can compute the ProCo loss
gin modification when constructing contrastive samples.
of weakly augmented view for each class to represent the
DRO-LT [55] computes the centroid of each class and utilizes
posterior probability P (y|z), thus enabling the generation
it as the class representation, which is the most relevant
of pseudo-labels.
work to ours. The loss function and uncertainty radius in
DRO-LT are devised by heuristic design from the metric
3.5 Connection with Related Work learning and robust perspective. Moreover, DRO-LT consid-
In the following, we discuss the connections between our ers a sample and corresponding centroid as a positive pair.
method and related works on contrastive learning for long- But in contrast to SCL, the other samples in the batch and
tailed recognition. Recent studies proposed to incorporate the centroid are treated as negative pairs, disregarding the
class complement in the construction of positive and neg- label information of the other samples, which is somewhat
ative pairs. These methods ensure that all classes appear pessimistic.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

Compared with the above methods, ProCo is derived


rigorously from contrastive loss and a simple assumption
on feature distribution. Furthermore, ProCo also enforces 40
margin modification, which is a key component for long-
tailed recognition. Sec. 4.4 demonstrates the superiority of 35

Top-1 Acc (%)


ProCo over the above methods.

30
4 E XPERIMENT ProCo
In this section, we validate the effectiveness of our method 25 SupCon
on supervised/semi-supervised learning. First, we conduct BCL
a range of analytical experiments to confirm our hypothesis
8 16 32 64 128 256
and analyze each component of the method, including 1) Batch Size
performance of the representation branch, 2) comparison
of more settings, 3) comparison between two formulations Fig. 2. Performance of the representation branch. We train the model for
of loss, 4) sensitivity analysis of hyper-parameters, 5) data 200 epochs.
augmentation strategies. Subsequently, we compare our
method with existing supervised learning methods on long- TABLE 2
Comparison of different class complements. EMA denotes exponential
tailed datasets such as CIFAR-10/100-LT, ImageNet-LT, and moving average.
iNaturalist 2018. Finally, experiments on balanced datasets,
semi-supervised learning, and long-tailed object detection
tasks are conducted to confirm the broad applicability of Class Complement Top-1 Acc.
our method. EMA Prototype 51.6
Centroid Prototype 52.0
Normal Distribution 52.1
4.1 Dataset and Evaluation Protocol
ProCo 52.8
We perform long-tailed image classification experiments
on four prevalent long-tailed image classification datasets:
CIFAR-10/100-LT, ImageNet-LT, and iNaturalist. Follow- iNaturalist 2018. iNaturalist 2018 [7] is a severely im-
ing [11], [76], we partition all categories into three subsets balanced large-scale dataset. It contains 8142 classes of
based on the number of training samples: Many-shot (> 100 437.5k images, with an imbalance factor γ = 500 with
images), Medium-shot (20 − 100 images), and Few-shot cardinality ranging from 2 to 1000. In addition to long-tailed
(< 20 images). The top-1 accuracy is reported on the re- image classification, iNaturalist 2018 is also utilized in fine-
spective balanced validation sets. In addition, we conducted grained image classification.
experiments on balanced image classification datasets and CUB-200-2011. The Caltech-UCSD Birds-200-2011 [79]
long-tailed object detection datasets to verify the broad is a prominent resource for fine-grained visual categoriza-
applicability of our method. The effectiveness of instance tion tasks. Comprising 11,788 images across 200 bird sub-
segmentation was assessed using the mean Average Pre- categories, it is split into two sets: 5,994 images for training
cision (APm ) for mask predictions, calculated at varying and 5,794 for testing.
Intersection over Union (IoU) thresholds ranging from 0.5 LVIS v1. The Large Vocabulary Instance Segmentation
to 0.95, and aggregated across different categories. The (LVIS) dataset [80] is notable for its extensive categorization,
AP values for rare, common, and frequent categories are encompassing 1,203 categories with high-quality instance
represented as APr , APc , and APf , respectively, while the mask annotations. LVIS v1 is divided into three splits: a
AP for detection boxes is denoted as APb . training set with 100,000 images, a validation (val) set with
CIFAR-10/100-LT. CIFAR-10-LT and CIFAR-100-LT are 19,800 images, and a test-dev set, also with 19,800 images.
the long-tailed variants of the original CIFAR-10 and CIFAR- Categories within the training set are classified based on
100 [78] datasets, which are derived by sampling the original their prevalence as rare (1-10 images), common (11-100
training set. Following [10], [12], we sample the training set images), or frequent (over 100 images).
of CIFAR-10 and CIFAR-100 with an exponential function
Nj = N × λj , where λ ∈ (0, 1), N is the size of the original
training set, and Nj is the sampling quantity for the j -th 4.2 Implementation Details
class. The original balanced validation sets of CIFAR-10 and For a fair comparison of long-tailed supervised image clas-
CIFAR-100 are used for testing. The imbalance factor γ = sification, we strictly follow the training setting of [16].
max(Nj )/min(Nj ) is defined as the ratio of the number of All models are trained using an SGD optimizer with a
samples in the most and the least frequent class. We set γ at momentum set to 0.9.
typical values 10, 50, 100 in our experiments. CIFAR-10/100-LT. We adopt ResNet-32 [2] as the back-
ImageNet-LT. ImageNet-LT is proposed in [76], which bone network. The representation branch has a projection
is constructed by sampling a subset of ImageNet following head with an output dimension of 128 and a hidden layer
the Pareto distribution with power value αp = 6. It consists dimension of 512. We set the temperature parameter τ for
of 115.8k images from 1000 categories with cardinality contrastive learning to 0.1. Following [16], [17], we apply
ranging from 1280 to 5. AutoAug [85] and Cutout [86] as the data augmentation
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

TABLE 3
Comparison of more settings. SC denotes supervised contrastive loss,
CC denotes class complement, CR denotes calculable representation,
53.0
and MM denotes margin modification. ProCo
52.5 LA
52.0

Top-1 Acc (%)


SC CC CR MM Top-1 Acc.
✗ ✗ ✗ ✗ 50.5 51.5
✗ ✓ ✗ ✗ 50.6
✓ ✗ ✗ ✗ 51.8 51.0
✓ ✓ ✗ ✗ 51.9
✓ ✓ ✓ ✓ 52.6 50.5
✗ ✓ ✓ ✓ 52.8
50.0 0.0 0.5 1.0 1.5 2.0
TABLE 4
Comparison of different loss formulations. Fig. 3. Parameter analysis of the loss weight ratio α. LA denotes the logit
adjustment method.
Imbalance Factor 100 50 10
TABLE 5
Lout 52.0 56.6 65.1 Comparison of training the network with (’w/’) and without (’w/o’)
Lin 52.8 57.1 65.5 employing AutoAugment .

Method w/o AutoAug w/ AutoAug


strategies for the classification branch, and SimAug [51] for Logit Adj. [13] 49.7 50.5
the representation branch. The loss weights are assigned BCL [16] 50.1 51.9
equally (α = 1) to both branches. We train the network
ProCo 50.5 52.8
for 200 epochs with a batch size of 256 and a weight decay
of 4 × 10−4 . We gradually increase the learning rate to 0.3
in the first 5 epochs and reduce it by a factor of 10 at the
50. For iNaturalist 2018, we set the initial learning rate to 0.2
160th and 180th epochs. Unless otherwise specified, our
and the weight decay to 1 × 10−4 .
components analysis follows the above training setting. For
a more comprehensive comparison, we also train the model
for 400 epochs with a similar learning rate schedule, except 4.3 Components Analysis
that we warm up the learning rate in the first 10 epochs and Performance of Representation Branch. Since ProCo is
decrease it at 360th and 380th epochs. based on the representation branch, we first analyze the per-
In order to evaluate the effectiveness of ProCo in semi- formance of a single branch. Fig. 2 presents the performance
supervised learning tasks, we partition the training dataset curves of different methods without a classification branch
into two subsets: a labeled set and an unlabeled set. The as the batch size changes. We follow the two-stage training
division is conducted through a random elimination of the strategy and hyper-parameters of SCL [14]. In the first stage,
labels. We predominantly adhere to the training paradigm the model is trained only through the representation branch.
established by DASO [69]. DASO builds its semantic clas- In the second stage, we train a linear classifier to evaluate
sifier by computing the average feature vectors of each the model’s performance. From the results, we can directly
class for generating pseudo-labels. The semantic classifier observe that the performance of BCL [16] and SupCon [14]
in DASO is substituted with our representation branch for is significantly limited by the batch size. However, ProCo
training. The remaining hyperparameters are kept consis- effectively mitigates SupCon’s limitation on the batch size
tent with those in DASO. by introducing the feature distribution of each class. Fur-
ImageNet-LT & iNaturalist 2018. For both ImageNet- thermore, a comparison of different class complements is
LT and iNaturalist 2018, we employ ResNet-50 [2] as presented in Tab. 2. Here, EMA and centroid prototype
the backbone network following the majority of previous represent that the class representations are estimated by the
works. The representation branch consists of a projection exponential moving average and the centroid of the feature
head with an output dimension of 2048 and a hidden vectors, respectively. Normal distribution represents that
layer dimension of 1024. The classification branch adopts we employ the normal distributions to model the feature
a cosine classifier. We set the temperature parameter τ for distribution of each class, even though they are in the nor-
contrastive learning to 0.07. RandAug [87] is employed as a malized feature space. Compared with merely estimating a
data augmentation strategy for the classification branch, and class prototype, ProCo yields superior results by estimating
SimAug for the representation branch. We also assign equal the distribution of features for each class. Empirical results
loss weight (α = 1) to both branches. The model is trained also confirm the theoretical analysis in Sec. 3.2 that the
for 90 epochs with a batch size of 256 and a cosine learning vMF distribution is better suited for modeling the feature
rate schedule. For ImageNet-LT, we set the initial learning distribution than the normal distribution.
rate to 0.1 and the weight decay to 5 × 10−4 . In addition, the Comparison of More Settings. The results are summa-
model is trained for 90 epochs with ResNeXt-50-32x4d [88] rized in Tab. 3, where we compare ProCo with several other
as the backbone network and for 180 epochs with ResNet- settings to clarify the effectiveness of relevant components.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

TABLE 6
Top-1 accuracy of ResNet-32 on CIFAR-100-LT and CIFAR-10-LT. ∗ denotes results borrowed from [81]. † denotes models trained in the same
setting. We report the results of 200 epochs.

Dataset CIFAR-100-LT CIFAR-10-LT


Imbalance Factor 100 50 10 100 50 10
CB-Focal [10] 39.6 45.2 58.0 74.6 79.3 87.5
LDAM-DRW∗ [12] 42.0 46.6 58.7 77.0 81.0 88.2
BBN [81] 42.6 47.0 59.1 79.8 81.2 88.3
SSP [82] 43.4 47.1 58.9 77.8 82.1 88.5
TSC [63] 43.8 47.4 59.0 79.7 82.9 88.7
Casual Model [83] 44.1 50.3 59.6 80.6 83.6 88.5
Hybrid-SC [18] 46.7 51.9 63.1 81.4 85.4 91.1
MetaSAug-LDAM [24] 48.0 52.3 61.3 80.7 84.3 89.7
ResLT [84] 48.2 52.7 62.0 82.4 85.2 89.7
Logit Adjustment† [13] 50.5 54.9 64.0 84.3 87.1 90.9
BCL† [16] 51.9 56.4 64.6 84.5 87.2 91.1
ProCo† 52.8 57.1 65.5 85.9 88.2 91.9

TABLE 7 exhibits optimal performance when α is set to 1.0. This


Top-1 accuracy of ResNet-32 on CIFAR-100-LT with an imbalance suggests that the optimal performance can be achieved by
factor of 100. † denotes the models trained in the same setting
setting the loss weight of both branches equally. Therefore,
we opt not to conduct an exhaustive hyper-parameter search
Method Many Med Few All for each dataset. Instead, we extrapolate the same loss
200 epochs weight to the other datasets.
DRO-LT [55] 64.7 50.0 23.8 47.3 Data Augmentation Strategies. Data augmentation is
RIDE [61] 68.1 49.2 23.9 48.0 a pivotal factor for enhancing model performance. To in-
Logit Adj.† [13] 67.2 51.9 29.5 50.5 vestigate the impact of data augmentation, we conducted
BCL† [16] 67.2 53.1 32.9 51.9 experiments on the CIFAR100-LT dataset with an imbalance
ProCo† 69.0 52.7 34.1 52.8 factor of 100. Tab. 5 demonstrates the influence of AutoAug
400 epochs strategy. Under strong data augmentation, our method
PaCo† [17] - - - 52.0 achieves greater performance enhancement, suggesting the
GPaCo† [56] - - - 52.3 effectiveness of our approach in leveraging the benefits of
Logit Adj.† [13] 68.1 53.0 32.4 52.1 data augmentation.
BCL† [16] 69.2 53.1 34.4 53.1
ProCo† 70.1 53.4 36.4 54.2
4.4 Long-Tailed Supervised Image Classification
CIFAR-10/100-LT. Tab. 6 presents the results of ProCo and
As demonstrated in the table, the logit adjustment method existing methods on the CIFAR-10/100-LT. ProCo demon-
fails to achieve improvement in performance when com- strates superior performance compared to other methods
bined with a learnable class representation (50.5% vs 50.6%, in handling varying levels of class imbalance, particularly
51.8% vs 51.9%). This empirical evidence suggests a certain in situations with higher imbalance factors. With the same
degree of equivalence between the two approaches. Further- training scheme, we mainly compare with Logit Adj. [13]
more, incorporating ProCo with SupCon loss [14] yields and BCL [16]. Combining the representation branch based
slightly performance degradation (52.6% vs 52.8%). This on contrastive learning with the classification branch can
result emphasizes the effectiveness of the assumption on significantly improve the performance of the model. Our
feature distribution and parameter estimation in capturing method further enhances the performance and effectively
feature diversity. alleviates the problem of data imbalance. Moreover, we
Loss Formulations. We analyze the impact of different report longer and more detailed results on the CIFAR-
loss formulations on the performance of ProCo. These two 100-LT dataset with an imbalance factor of 100 in Tab. 7.
formulations are derived from different SupCon losses [14] With 200 and 400 epochs, especially for tail classes, our
as shown in Proposition 1. Tab. 4 shows the results of differ- method has a 1.2% and 2.0% improvement compared to
ent loss formulations on CIFAR-100-LT. We can observe that BCL, respectively. Meanwhile, our method also maintains
Lin outperforms Lout . an improvement on head categories.
Sensitivity Test. To investigate the effect of loss weight ImageNet-LT. We report the performance of ProCo on
on model performance, we conducted experiments on the ImageNet-LT dataset in Tab. 8 compared with existing
CIFAR-100-LT dataset with an imbalance factor of 100. The methods. ProCo surpasses BCL by 1.3% on ResNet-50 and
results are presented in Fig. 3. It is evident that the model ResNeXt-50 with 90 training epochs, which demonstrates
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

TABLE 8 TABLE 9
Comparisons on ImageNet-LT and iNaturalist 2018 with different Top-1 accuracy of ResNet-50 on ImageNet-LT and iNaturalist 2018.
backbone networks. † denotes models trained in the same setting. † and ‡ denote models trained in the same settings.

ImageNet-LT iNaturalist 2018 Method Many Med Few All


Method
Res50 ResX50 Res50 180 epochs, ImageNet-LT
90 epochs Logit Adj.† [13] 65.8 53.2 34.1 55.4
τ -norm [11] 46.7 49.4 65.6 PaCo† [17] 64.4 55.7 33.7 56.0
MetaSAug [24] 47.4 – 68.8 BCL† [16] 67.6 54.6 36.8 57.2
SSP [82] 51.3 – 68.1 NCL‡ [63] 68.2 53.9 36.3 57.0
KCL [11] 51.5 – 68.6 NCL (ensemble)‡ [63] 69.1 56.4 38.9 59.2
DisAlign [89] 52.9 53.4 69.5
vMF Classifier [50] – 53.7 – ProCo† 68.2 55.1 38.1 57.8
SSD [90] – 53.8 69.3 ProCo+NCL‡ 68.4 54.9 38.6 57.9
ICCL [91] – 54.0 70.5 ProCo+NCL (ensemble)‡ 70.6 57.4 40.8 60.2
ResLT [84] – 56.1 70.2 400 epochs, iNaturalist 2018
GCL [92] 54.9 – 72.0
PaCo† [17] 70.3 73.2 73.6 73.2
RIDE [61] 54.9 56.4 72.2
BatchFormer† [74] 72.8 75.3 75.3 75.1
Logit Adj.† [13] 55.1 56.5 –
GPaCo† [56] 73.0 75.5 75.7 75.4
BCL† [16] 56.0 56.7 71.8
ProCo† 74.0 76.0 76.0 75.8
ProCo† 57.3 58.0 73.5

TABLE 10
Top-1 accuracy for ResNet-32 and ResNet-50 on balanced datasets.
the effectiveness of our distribution-based class represen-
The ResNet-32 model is trained on CIFAR-100/10 for 200 epochs from
tation on datasets of large scale. Furthermore, Tab. 9 lists scratch. For CUB-200-2011, the pre-trained ResNet-50 model is
detailed results on more training settings for ImageNet-LT fine-tuned for 30 epochs.
dataset. ProCo has the most significant performance im-
provement on tail categories. In addition to combining with CIFAR-100 CIFAR-10 CUB-200-2011
typical classification branches, ProCo can also be combined CrossEntropy 71.5 93.4 81.6
with other methods to further improve model performance,
ProCo 73.0 94.6 82.9
such as different loss functions and model ensembling meth-
ods. We also report the results of combining with NCL [63].
NCL is a multi-expert method that utilizes distillation and
derlines the strength of our method in addressing not only
hard category mining. ProCo demonstrates performance
imbalances in P (y) but also intra-class distribution vari-
improvements for NCL.
ances in P (z|y). These aspects correspond to the factors Ny
iNaturalist 2018. Tab. 8 presents the experimental com-
and Σy in Proposition 2. Overall, the results imply the broad
parison of ProCo with existing methods on iNaturalist 2018
utility of our approach across diverse datasets.
over 90 epochs. iNaturalist 2018 is a highly imbalanced
large-scale dataset, thus making it ideal for studying the im-
pact of imbalanced datasets on model performance. Under 4.6 Long-Tailed Semi-Supervised Image Classification
the same training setting, ProCo outperforms BCL by 1.7%. We present experimental results of semi-supervised learning
Furthermore, to facilitate a comparison with state-of-the-art in Tab. 11. Fixmatch [66] is employed as the foundational
methodologies, an extended training schedule of 400 epochs framework to generate pseudo labels, and it is assessed
is conducted. The results in Tab. 9 indicate that ProCo is effectiveness in comparison to other methods in long-tailed
capable of effectively scaling to larger datasets and longer semi-supervised learning. We mainly follow the setting of
training schedules. DASO [69] except for substituting the semantic classifier
based on the centroid prototype of each class with our rep-
4.5 Balanced Supervised Image Classification resentation branch for training. ProCo outperforms DASO
across various levels of data imbalance and dataset sizes
The foundational theory of our model is robust against
while maintaining the same training conditions. Specifically,
data imbalance, meaning that the derivation of ProCo is
in cases of higher data imbalance (γ = 20) and the ratio of
unaffected by long-tailed distributions. In support of this,
unlabeled data (N1 = 50), our proposed method exhibits
we also perform experiments on balanced datasets, as il-
a significant performance enhancement (with LA) of up to
lustrated in Tab. 10. For the CIFAR-100/10 dataset, aug-
2.8% when compared to DASO.
mentation and training parameters identical to those used
for CIFAR-100/10-LT are employed. Additionally, we ex-
pand our experimentation to the fine-grained classification 4.7 Long-Tailed Object Detection
dataset CUB-200-2011. These results demonstrate that while In addition to image classification, we extend ProCo to
our method, primarily designed for imbalanced datasets, object detection tasks. Specifically, we utilize Faster R-
mitigates the inherent limitations of contrastive learning CNN [4] and Mask R-CNN [93] as foundational frame-
in such contexts, additional experiments also highlight its works, integrating our proposed ProCo loss into the box
effectiveness on balanced training sets. This versatility un- classification branch. This method was implemented using
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

TABLE 11
Comparison of accuracy (%) on CIFAR100-LT under γl = γu setup. γl and γu are the imbalance factors for labeled and unlabeled data,
respectively. N1 and M1 are the size of the most frequent class in the labeled data and unlabeled data, respectively. LA denotes the Logit
Adjustment method [13]. † denotes models trained in the same setting.

γ = γl = γu = 10 γ = γl = γu = 20
Method
N1 = 50 N1 = 150 N1 = 50 N1 = 150
M1 = 400 M1 = 300 M1 = 400 M1 = 300
Supervised 29.6 46.9 25.1 41.2
w/ LA [13] 30.2 48.7 26.5 44.1
FixMatch [66] 45.2 56.5 40.0 50.7
w/ DARP [67] 49.4 58.1 43.4 52.2
w/ CReST+ [68] 44.5 57.4 40.1 52.1
w/ DASO† [69] 49.8 59.2 43.6 52.9
w/ ProCo† 50.9 60.2 44.8 54.8
FixMatch [66] + LA [13] 47.3 58.6 41.4 53.4
w/ DARP [67] 50.5 59.9 44.4 53.8
w/ CReST+ [68] 44.0 57.1 40.6 52.3
w/ DASO† [69] 50.7 60.6 44.1 55.1
w/ ProCo† 52.1 61.3 46.9 55.9

TABLE 12 A PPENDIX A
Results on different frameworks with ResNet-50 backbone on LVIS v1.
We conduct experiments with 1x schedule.
P ROOF OF P ROPOSITION 2
Before presenting the proof of Proposition 2, we intro-
ProCo APb APr APc APf APm duce several lemmas essential for the subsequent argument.
✗ 22.1 9.0 21.0 29.2 – We begin by proving the asymptotic expansion of the
Faster R-CNN [4]
✓ 24.7 15.5 24.2 29.3 – ProCo loss.
✗ 22.5 9.1 21.1 30.1 21.7
Mask R-CNN [93] Lemma 1 (Asymptotic expansion). The ProCo loss satisfies the
✓ 25.2 16.1 24.5 30.0 24.7
following asymptotic expansion under the Assumption 1:
⊤ 2
πy eµy z/τ +1/(2τ κy )
mmdetection [94], adhering to the training settings of the LProCo (y, z) ∼ − log PK µ⊤ 2
.
j z/τ +1/(2τ κj )
original baselines. As depicted in Tab. 12, our approach j=1 πj e
yields noticeable enhancements on the LVIS v1 dataset,
Proof. Recall that the ProCo loss is defined as:
with both Faster R-CNN and Mask R-CNN demonstrating
improved performance across various categories.
 
  K
Cp (κ̃yi ) X C p (κ̃j )
LProCo = − log πyi + log  πj ,
Cp (κyi ) j=1
Cp (κj )
5 C ONCLUSION
where Cp (κ) is the normalizing constant of the von Mises-
In this paper, we proposed a novel probabilistic con-
Fisher distribution, which is given by
trastive (ProCo) learning algorithm for long-tailed distribu-
tion. Specifically, we employed a reasonable and straight- (2π)p/2 I(p/2−1) (κ)
forward von Mises-Fisher distribution to model the nor- Cp (κ) =
κp/2−1
malized feature space of samples in the context of con- κ̃j = ||κj µj + zi /τ ||2 .
trastive learning. This choice offers two key advantages.
First, it is efficient to estimate the distribution parameters Therefore, we aim to demonstrate that:
across different batches by maximum likelihood estimation.
(p−1)/2
Second, we derived a closed form of expected supervised eκ̃j κj ⊤ 2

contrastive loss for optimization by sampling infinity sam- κ (p−1)/2


∼ eµj zi /τ +1/(2τ κj ) .
e κ̃
j

ples from the estimated distribution. This eliminates the j

inherent limitation of supervised contrastive learning that According to the calculation formula, the parameter κ is
requires a large number of samples to achieve satisfactory R̄(p−R̄)
computed by 1−R̄2 , During the training process, it is
performance. Furthermore, ProCo can be directly applied
to semi-supervised learning by generating pseudo-labels observed that the value of 1−R̄R̄2 ≫ 1. Consequently, this
for unlabeled data, which can subsequently be utilized to implies that κ ≫ p. Referring to the asymptotic expansion
estimate the distribution inversely. We have proven the of the modified Bessel function of the first kind for large κ
error bound of ProCo theoretically. Extensive experimental relative to p [72], we have
results on various classification and object detection datasets eκ
demonstrate the effectiveness of the proposed algorithm. I(p/2−1) (κ) ∼ √ .
2πκ
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

Therefore, we have The above inequalities lead to the conclusion that


(p−1)/2
Cp (κ̃j ) eκ̃j κj Vz|y [Llog (y, z)]
∼ κj (p−1)/2 .
Cp (κj ) e κ̃
j
=Vz|y [fy (w⊤ z + b)]
=Vz|y [fy (w⊤ z + b) − fy (Ez′ |y [w⊤ z ′ + b])]
 2 
⊤ ⊤ ′
Given Assumption 1 and κ ≫ p, it follows that κτ ≪ 1. By ≤Ez|y fy (w z + b) − fy (Ez |y [w z + b])

employing the approximation (1 + x)α ≈ 1 + αx, valid for  2 


x ≪ 1, we obtain: ≤Ez|y w⊤ z + b − Ez′ |y [w⊤ z ′ + b]
q  2 
κ̃j = κj 1 + 2µ⊤ 2 2
j zi /(κj τ ) + 1/(κj τ ) =Ez|y y(w⊤ z + b) − Ez′ |y [y(w⊤ z ′ + b)]
 
∼ κj 1 + µ⊤ 2 2
j zi /(κj τ ) + 1/(2κj τ ) =Vz|y [Llin (y, z)].

and
! We are now ready to demonstrate the validity of Propo-
(p−1)/2 (p−1)/2 (p − 1)µ⊤ j zi p−1 sition 2.
κ̃j ∼ κj 1+ + 2 2 .
2κj τ 4κj τ
Proof. First, we examine the class-conditional ProCo loss,
Given κ ≫ p, we have: denoted as Ez|y LProCo (y, z). For a class label y ∈ {−1, 1},
according to Lemma 2, we establish that with a probability
(p−1)/2
κj of at least 1 − 2δ , the following inequality holds:
(p−1)/2
∼ 1.
κ̃j 1 X
Ez|y LProCo (y, z) − LProCo (y, z)
Consequently, we establish that: Ny i
s
(p−1)/2 2Vz|y [LProCo (y, z)] ln 2/δ B ln(2/δ)
eκ̃j κj µ⊤ 2 ≤ + .
∼e j zi /τ +1/(2τ κj ) Ny 3Ny
eκj κ̃(p−1)/2
j
Incorporating Lemma 3 and Lemma 1, we obtain:
1 X
Ez|y LProCo (y, z) − LProCo (y, z)
Lemma 2 (Bennett’s inequality [95]). Let Z1 , . . . , Zn be i.i.d. Ny i
random variables with values in [0, B] and let δ > 0. Then, with s
2Vz|y [Llin (y, z)] ln 2/δ ln(2/δ)
probability at least 1 − δ in (Z1 , . . . , Zn ), ≤ + log(1 + e||w||2 −by )
s Ny 3Ny
n
1X 2VZ ln 1/δ B ln(1/δ)
EZ − Zi ≤ + ,
n i=1 n 3n
where Llin (y, z) is defined as
where VZ is the variance of Z . !
(µ+1 − µ−1 )⊤ z κ−1 − κ+1 π+1
Lemma 3 (Variance inequality). Let Llog (y, z) :=
−y + 2 + log .
  τ 2τ κ−1 κ+1 π−1
−y(w⊤ z+b)
log 1 + e and Llin (y, z) := −y(w⊤ z + b). Then,
for any y ∈ {−1, 1}, Moreover, the variance Vz|y [Llin (y, z)] is computed as:

Vz|y [Llog (y, z)] ≤ Vz|y [Llin (y, z)]. Vz|y [Llin (y, z)] = Vz|y [(µ+1 − µ−1 )⊤ z/τ )]
= (µ+1 − µ−1 )⊤ Σy (µ+1 − µ−1 )/τ 2 ,
Proof. Consider the function fy (z) := log(1 + e−yz ),
where y ∈ {−1, 1}. The derivative fy′ (z) is given by where Σy represents the covariance matrix of z conditioned
−yz
ye
fy′ (z) = − 1+e ′
−yz , which implies that supz |fy (z)| ≤ 1.
on y . Consequently, We have thus completed the proof for
Consequently, fy is a 1-Lipschitz function which satisfies the conditional distribution’s error bound as follows:
the following inequality: 1 X
Ez|y LProCo (y, z) − LProCo (y, z)
Ny i
|fy (z) − fy (z ′ )| ≤ |z − z ′ |, ∀z, z ′ ∈ R. s
2 ⊤ 2 ln(2/δ)
Regarding the variance of any real-valued function h, it is ≤ w (Σy )w ln + log(1 + e||w||2 −by ),
Ny δ 3Ny
defined as follows:
where w = (µ+1 − µ−1 )/τ .
V[h(z)] = E[h2 (z)] − (E[h(z)])2
To extend this to the generalization bound across all
≤ E[h2 (z)]. classes, we apply the union bound. Consequently, with a
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

probability of at least 1 − δ , the following inequality is R EFERENCES


satisfied:
X P (y) X [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
E(z,y) LProCo (y, z) ≤ LProCo (y, zi ) tion with deep convolutional neural networks,” in NeurIPS, 2017.
Ny i 1
y∈{−1,1}
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
X P (y) ln(2/δ) image recognition,” in CVPR, 2016. 1, 3, 8, 9
+ log(1 + e||w||2 −by ) [3] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,
3Ny
y∈{−1,1} “Densely connected convolutional networks,” in CVPR, 2017. 1, 3
[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
s
X 2 ⊤ 2 real-time object detection with region proposal networks,” in
+ P (y) w (Σy )w ln ,
Ny δ NeurIPS, 2015. 1, 11, 12
y∈{−1,1}
[5] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
where w = (µ+1 − µ−1 )/τ . network,” in CVPR, 2017. 1
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet
large scale visual recognition challenge,” International Journal of
A PPENDIX B Computer Vision, 2015. 1
P ROOF OF P ROPOSITION 3 [7] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard,
H. Adam, P. Perona, and S. Belongie, “The inaturalist species
The primary approach to proving the excess risk bound in- classification and detection dataset,” in CVPR, 2018. 1, 8
volves utilizing the asymptotic expansion of the ProCo loss, [8] F. Graf, C. Hofer, M. Niethammer, and R. Kwitt, “Dissecting
as detailed in Lemma 1, and its compliance with the 1- supervised constrastive learning,” in ICML, 2021. 1
[9] C. Fang, H. He, Q. Long, and W. J. Su, “Exploring deep neural
Lipschitz property as outlined in Lemma 3. networks via layer-peeled model: Minority collapse in imbalanced
Proof. Given Lemma 1, we have training,” Proceedings of the National Academy of Sciences, 2021. 1, 3
[10] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced
E(z,y) LProCo (y, z; µ̂, κ̂) − E(z,y) LProCo (y, z; µ⋆ , κ⋆ ) loss based on effective number of samples,” in CVPR, 2019. 1, 3,
  ⊤
  ⋆⊤ ⋆
 8, 10
∼E(z,y) log 1 + e−y(ŵ z+b̂) − log 1 + e−y(w z+b ) , [11] B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and
Y. Kalantidis, “Decoupling representation and classifier for long-
µ̂ −µ̂ κ̂ −κ̂ π tailed recognition,” in ICLR, 2020. 1, 3, 8, 11
where ŵ = +1 τ −1 and b̂ = 2τ−1 2 κ̂
+1
−1 κ̂+1
+1
+ log π−1 , analo-
⋆ ⋆ [12] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma, “Learning
gously for w and b . imbalanced datasets with label-distribution-aware margin loss,”
Leveraging the 1-Lipschitz property of fy from Lemma 3, in NeurIPS, 2019. 1, 3, 8, 10
we obtain [13] A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and
  ⊤
  ⋆⊤ ⋆
 S. Kumar, “Long-tail learning via logit adjustment,” in ICLR, 2021.
E(z,y) log 1 + e−y(ŵ z+b̂) − log 1 + e−y(w z+b ) 1, 3, 4, 6, 9, 10, 11, 12
  [14] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola,
=E(z,y) fy (ŵ⊤ z + b̂) − fy (w⋆ ⊤ z + b⋆ ) A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive
learning,” in NeurIPS, 2020. 1, 3, 4, 7, 9, 10
≤E(z,y) ŵ⊤ z + b̂ − w⋆ ⊤ z − b⋆ . [15] T. Li, P. Cao, Y. Yuan, L. Fan, Y. Yang, R. S. Feris, P. Indyk, and
D. Katabi, “Targeted supervised contrastive learning for long-
Considering the convexity of absolute value under linear tailed recognition,” in CVPR, 2022. 1, 3, 7
transformation and the integral inequality, we deduce [16] J. Zhu, Z. Wang, J. Chen, Y.-P. P. Chen, and Y.-G. Jiang, “Balanced
contrastive learning for long-tailed visual recognition,” in CVPR,

E(z,y) ŵ⊤ z + b̂ − w⋆ z − b⋆ 2022. 1, 3, 6, 7, 8, 9, 10, 11
[17] J. Cui, Z. Zhong, S. Liu, B. Yu, and J. Jia, “Parametric contrastive
⊤ ⋆⊤ ⋆ learning,” in ICCV, 2021. 1, 3, 6, 7, 8, 10, 11
≤E(z,y) max ŵ z + b̂ − w z − b [18] P. Wang, K. Han, X.-S. Wei, L. Zhang, and L. Wang, “Contrastive
z
n o learning based hybrid networks for long-tailed image classifica-
=E(z,y) max ∥ŵ − w⋆ ∥2 + |b̂ − b⋆ |, ∥ŵ − w⋆ ∥2 − |b̂ + b⋆ | tion,” in CVPR, 2021. 1, 3, 6, 7, 10
[19] B. Kang, Y. Li, S. Xie, Z. Yuan, and J. Feng, “Exploring balanced
≤E(z,y) ∥ŵ − w⋆ ∥2 + |b̂ − b⋆ | feature spaces for representation learning,” in ICLR, 2020. 1, 3, 7
[20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
=∥ŵ − w⋆ ∥2 + |b̂ − b⋆ | S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial
∥∆µ+1 − ∆µ−1 ∥2 1 1 1 networks,” in NeurIPS, 2014. 2
= + 2 ∆ −∆ [21] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli,
τ 2τ κ+1 κ−1
“Deep unsupervised learning using nonequilibrium thermody-
1 namics,” in ICML, 2015. 2
=O(∆µ + ∆ ),
κ [22] J. Guo, C. Du, J. Wang, H. Huang, P. Wan, and G. Huang,
“Assessing a single image in reference-guided image synthesis,”
where ∆µ = µ̂ − µ⋆ , ∆ κ1 = κ̂1 − κ1⋆ , ∆µ+1 = µ̂+1 − µ⋆+1 , in AAAI, 2022. 2
1 1
∆µ−1 = µ̂−1 − µ⋆−1 , ∆ κ+1 = κ̂+1 − κ⋆1 , and ∆ κ−1
1
= [23] Y. Wang, G. Huang, S. Song, X. Pan, Y. Xia, and C. Wu, “Regu-
+1
1 1 larizing deep networks with semantic data augmentation,” IEEE
κ̂−1 − κ⋆
−1
. By connecting the above inequalities, the proof Transactions on Pattern Analysis and Machine Intelligence, 2021. 2, 3,
is completed. 4
[24] S. Li, K. Gong, C. H. Liu, Y. Wang, F. Qiao, and X. Cheng,
ACKNOWLEDGMENTS “MetaSAug: Meta semantic augmentation for long-tailed visual
recognition,” in CVPR, 2021. 2, 3, 4, 10, 11
This work is supported in part by the National Key R&D [25] Q. Cai, Y. Wang, Y. Pan, T. Yao, and T. Mei, “Joint contrastive
Program of China under Grant 2021ZD0140407, the Na- learning with infinite possibilities,” in NeurIPS, 2020. 2
tional Natural Science Foundation of China under Grants [26] X. Jia, X.-Y. Jing, X. Zhu, S. Chen, B. Du, Z. Cai, Z. He, and
D. Yue, “Semi-supervised multi-view deep discriminant represen-
62276150, 42327901. We also appreciate the generous dona- tation learning,” IEEE Transactions on Pattern Analysis and Machine
tion of computing resources by High-Flyer AI. Intelligence, 2021. 2, 3
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

[27] T. Miyato, S.-I. Maeda, M. Koyama, and S. Ishii, “Virtual adversar- [54] H. Kaiming, F. Haoqi, W. Yuxin, X. Saining, and G. Ross, “Momen-
ial training: A regularization method for supervised and semi- tum contrast for unsupervised visual representation learning,”
supervised learning,” IEEE Transactions on Pattern Analysis and CVPR, 2019. 3
Machine Intelligence, 2019. 2, 3 [55] D. Samuel and G. Chechik, “Distributional robustness loss for
[28] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling, long-tail learning,” in ICCV, 2021. 3, 7, 10
“Semi-supervised learning with deep generative models,” in [56] J. Cui, Z. Zhong, Z. Tian, S. Liu, B. Yu, and J. Jia, “General-
NeurIPS, 2014. 2, 3 ized parametric contrastive learning,” IEEE Transactions on Pattern
[29] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Analysis and Machine Intelligence, 2023. 3, 7, 10, 11
“Semi-supervised learning with ladder networks,” in NeurIPS, [57] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
2015. 2, 3 G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transfer-
[30] M. Kubat, S. Matwin et al., “Addressing the curse of imbalanced able visual models from natural language supervision,” in ICML,
training sets: one-sided selection,” in ICML, 1997. 3 2021. 3
[31] B. C. Wallace, K. Small, C. E. Brodley, and T. A. Trikalinos, “Class [58] C. Tian, W. Wang, X. Zhu, J. Dai, and Y. Qiao, “Vl-ltr: Learning
imbalance, redux,” in International Conference on Data Mining, 2011. class-wise visual-linguistic representation for long-tailed visual
3 recognition,” in ECCV, 2022. 3
[32] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, [59] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
“SMOTE: synthetic minority over-sampling technique,” Journal of neural network,” arXiv preprint, 2015. 3
Artificial Intelligence Research, 2002. 3 [60] L. Xiang, G. Ding, and J. Han, “Learning from multiple experts:
[33] C. Huang, Y. Li, C. C. Loy, and X. Tang, “Learning deep represen- Self-paced knowledge distillation for long-tailed classification,” in
tation for imbalanced classification,” in CVPR, 2016. 3 ECCV, 2020. 3
[34] A. Menon, H. Narasimhan, S. Agarwal, and S. Chawla, “On the [61] X. Wang, L. Lian, Z. Miao, Z. Liu, and S. X. Yu, “Long-tailed
statistical consistency of algorithms for binary classification under recognition by routing diverse distribution-aware experts,” in
class imbalance,” in ICML, 2013. 3 ICLR, 2020. 3, 10, 11
[35] B. Kim and J. Kim, “Adjusting decision boundary for class imbal- [62] Y.-Y. He, J. Wu, and X.-S. Wei, “Distilling virtual examples for
anced learning,” IEEE Access, 2020. 3 long-tailed recognition,” in ICCV, 2021, pp. 235–244. 3
[36] J. Zhang, L. Liu, P. Wang, and C. Shen, “To balance or not to [63] J. Li, Z. Tan, J. Wan, Z. Lei, and G. Guo, “Nested collaborative
balance: A simple-yet-effective approach for learning with long- learning for long-tailed visual recognition,” in CVPR, 2022. 3, 10,
tailed distributions,” arXiv preprint, 2019. 3 11
[37] J. Ren, C. Yu, X. Ma, H. Zhao, and S. Yi, “Balanced meta-softmax [64] B. Zhu, Y. Niu, X.-S. Hua, and H. Zhang, “Cross-domain empir-
for long-tailed visual recognition,” in NeurIPS, 2020. 3 ical risk minimization for unbiased long-tailed classification,” in
[38] J. Tan, C. Wang, B. Li, Q. Li, W. Ouyang, C. Yin, and J. Yan, AAAI, 2022. 3
“Equalization loss for long-tailed object recognition,” in CVPR, [65] G. Huang and C. Du, “The High Separation Probability Assump-
2020. 3 tion for Semi-Supervised Learning,” IEEE Transactions on Systems,
[39] P. Chu, X. Bian, S. Liu, and H. Ling, “Feature space augmentation Man, and Cybernetics: Systems, 2022. 3
for long-tailed data,” in ECCV, 2020. 3 [66] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel,
[40] Y. Zang, C. Huang, and C. C. Loy, “FASA: Feature augmentation E. D. Cubuk, A. Kurakin, and C.-L. Li, “FixMatch: Simplifying
and sampling adaptation for long-tailed instance segmentation,” semi-supervised learning with consistency and confidence,” in
in ICCV, 2021. 3 NeurIPS, 2020. 3, 7, 11, 12
[41] Y. Wang, Z. Ni, S. Song, L. Yang, and G. Huang, “Revisiting [67] J. Kim, Y. Hur, S. Park, E. Yang, S. J. Hwang, and J. Shin, “Dis-
locally supervised learning: an alternative to end-to-end training,” tribution aligning refinery of pseudo-label for imbalanced semi-
in ICLR, 2021. 3 supervised learning,” in NeurIPS, 2020. 4, 12
[42] Y. Wang, Y. Yue, R. Lu, T. Liu, Z. Zhong, S. Song, and G. Huang, [68] C. Wei, K. Sohn, C. Mellina, A. Yuille, and F. Yang, “CReST: A
“Efficienttrain: Exploring generalized curriculum learning for class-rebalancing self-training framework for imbalanced semi-
training visual backbones,” in ICCV, 2023. 3 supervised learning,” in CVPR, 2021. 4, 12
[43] G. Huang, Y. Wang, K. Lv, H. Jiang, W. Huang, P. Qi, and S. Song, [69] Y. Oh, D.-J. Kim, and I. S. Kweon, “DASO: Distribution-aware
“Glance and focus networks for dynamic visual recognition,” IEEE semantics-oriented pseudo-label for imbalanced semi-supervised
Transactions on Pattern Analysis and Machine Intelligence, vol. 45, learning,” in CVPR, 2022. 4, 9, 11, 12
no. 4, pp. 4605–4621, 2022. 3 [70] J. L. W. V. Jensen, “Sur les fonctions convexes et les inégalités entre
[44] K. V. Mardia, P. E. Jupp, and K. Mardia, Directional statistics. Wiley les valeurs moyennes,” Acta mathematica, 1906. 4
Online Library, 2000. 3, 4 [71] S. Sra, “A short note on parameter approximation for von mises-
[45] X. Zhe, S. Chen, and H. Yan, “Directional statistics-based deep fisher distributions: and a fast implementation of is (x),” Computa-
metric learning for image classification and retrieval,” Pattern tional Statistics, 2012. 5
Recognition, 2019. 3 [72] M. Abramowitz, I. A. Stegun et al., Handbook of mathematical
[46] K. Roth, O. Vinyals, and Z. Akata, “Non-isotropy regularization functions, 1964. 6, 12
for proxy-based deep metric learning,” in CVPR, 2022. 3 [73] W. Jitkrittum, A. K. Menon, A. S. Rawat, and S. Kumar, “ELM: Em-
[47] T. R. Scott, A. C. Gallagher, and M. C. Mozer, “von Mises-Fisher bedding and logit margins for long-tail learning,” arXiv preprint,
loss: An exploration of embedding geometries for supervised 2022. 6, 7
learning,” in ICCV, 2021. 3 [74] Z. Hou, B. Yu, and D. Tao, “Batchformer: Learning to explore
[48] S. Li, J. Xu, X. Xu, P. Shen, S. Li, and B. Hooi, “Spherical confidence sample relationships for robust representation learning,” in CVPR,
learning for face recognition,” in CVPR, 2021. 3 2022. 7, 11
[49] A. Banerjee, I. S. Dhillon, J. Ghosh, S. Sra, and G. Ridgeway, [75] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li,
“Clustering on the Unit Hypersphere using von Mises-Fisher and W. Liu, “CosFace: Large margin cosine loss for deep face
Distributions.” Journal of Machine Learning Research, 2005. 3 recognition,” in CVPR, 2018. 7
[50] H. Wang, S. Fu, X. He, H. Fang, Z. Liu, and H. Hu, “Towards [76] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu, “Open
calibrated hyper-sphere representation via distribution overlap long-tailed recognition in a dynamic world,” IEEE Transactions on
coefficient for long-tailed learning,” in ECCV, 2022. 3, 11 Pattern Analysis and Machine Intelligence, 2022. 7, 8
[51] C. Ting, K. Simon, N. Mohammad, and H. Geoffrey, “A simple [77] J. Wang, W. Zhang, Y. Zang, Y. Cao, J. Pang, T. Gong, K. Chen,
framework for contrastive learning of visual representations,” in Z. Liu, C. C. Loy, and D. Lin, “Seesaw loss for long-tailed instance
ICML, 2020. 3, 9 segmentation,” in CVPR, 2021. 7
[52] G. Jean-Bastien, S. Florian, A. Florent, T. Corentin, P. R. H., [78] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of
B. Elena, D. Carl, B. P. Avila, Z. G. Daniel, M. A. Gheshlaghi, features from tiny images,” 2009. 8
P. Bilal, K. Koray, M. Rémi, and V. Michal, “Bootstrap Your Own [79] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The
Latent- a new approach to self-supervised learning,” in NeurIPS, caltech-ucsd birds-200-2011 dataset,” 2011. 8
2020. 3 [80] A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large
[53] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” vocabulary instance segmentation,” in CVPR, 2019, pp. 5356–5364.
in ECCV, 2020. 3 8
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

[81] B. Zhou, Q. Cui, X.-S. Wei, and Z.-M. Chen, “BBN: Bilateral-
branch network with cumulative learning for long-tailed visual
recognition,” in CVPR, 2020. 10
[82] Y. Yang and Z. Xu, “Rethinking the value of labels for improving
class-imbalanced learning,” in NeurIPS, 2020. 10, 11
[83] K. Tang, J. Huang, and H. Zhang, “Long-tailed classification by
keeping the good and removing the bad momentum causal effect,”
in NeurIPS, 2020. 10
[84] J. Cui, S. Liu, Z. Tian, Z. Zhong, and J. Jia, “ResLT: Residual
learning for long-tailed recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2022. 10, 11
[85] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le,
“AutoAugment: Learning augmentation strategies from data,” in
CVPR, 2019. 8
[86] T. DeVries and G. W. Taylor, “Improved regularization of convo-
lutional neural networks with cutout,” arXiv preprint, 2017. 8
[87] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “RandAugment:
Practical automated data augmentation with a reduced search
space,” in CVPR Workshops, 2020. 9
[88] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated
residual transformations for deep neural networks,” in CVPR,
2017. 9
[89] S. Zhang, Z. Li, S. Yan, X. He, and J. Sun, “Distribution alignment:
A unified framework for long-tail visual recognition,” in CVPR,
2021. 11
[90] T. Li, L. Wang, and G. Wu, “Self supervision to distillation for
long-tailed visual recognition,” in ICCV, 2021. 11
[91] A. M. H. Tiong, J. Li, G. Lin, B. Li, C. Xiong, and S. C. Hoi, “Im-
proving tail-class representation with centroid contrastive learn-
ing,” Pattern Recognition Letters, 2023. 11
[92] Y. L. Mengke Li, Yiu-ming Cheung, “Long-tailed visual recogni-
tion via gaussian clouded logit adjustment,” in CVPR, 2022. 11
[93] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
ICCV, 2017, pp. 2961–2969. 11, 12
[94] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng,
Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li,
X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy,
and D. Lin, “MMDetection: Open mmlab detection toolbox and
benchmark,” arXiv preprint, 2019. 12
[95] A. Maurer and M. Pontil, “Empirical bernstein bounds and
sample-variance penalization,” in Annual Conference Computational
Learning Theory, 2009. 13

You might also like