0% found this document useful (0 votes)
57 views20 pages

Bayesian Self-Supervised Contrastive Learning

(1) The paper proposes a new self-supervised contrastive loss called BCL (Bayesian Contrastive Loss) that allows for debiasing of false negatives and mining of hard true negatives in a flexible Bayesian framework. (2) It derives analytical expressions for the class conditional probability density of negative and positive examples, enabling estimation of the posterior probability of an unlabeled sample being a true negative without parametric assumptions. (3) Experiments validate the effectiveness and superiority of the BCL loss in self-supervised contrastive learning compared to existing contrastive loss functions.

Uploaded by

liuyunwu2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views20 pages

Bayesian Self-Supervised Contrastive Learning

(1) The paper proposes a new self-supervised contrastive loss called BCL (Bayesian Contrastive Loss) that allows for debiasing of false negatives and mining of hard true negatives in a flexible Bayesian framework. (2) It derives analytical expressions for the class conditional probability density of negative and positive examples, enabling estimation of the posterior probability of an unlabeled sample being a true negative without parametric assumptions. (3) Experiments validate the effectiveness and superiority of the BCL loss in self-supervised contrastive learning compared to existing contrastive loss functions.

Uploaded by

liuyunwu2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Bayesian Self-Supervised Contrastive Learning

Bin Liu 1 Bang Wang* 1

Abstract son et al., 2021; Chen et al., 2020a; Chuang et al., 2020;
Recent years have witnessed many successful ap- Liu et al., 2021): Consider an unlabeled dataset X and
plications of contrastive learning in diverse do- a class label set C, let h : X → C be the classification
arXiv:2301.11673v3 [cs.LG] 23 May 2023

mains, yet its self-supervised version still remains function assigning a data point x ∈ X with a class label
many exciting challenges. As the negative sam- c ∈ C. Assume that observing a class label ρ(c) = τ +
ples are drawn from unlabeled datasets, a ran- is uniform and τ − = 1 − τ + is the probability of ob-
domly selected sample may be actually a false serving any different class. For a given data point x, let
negative to an anchor, leading to incorrect en- p+ (x+ ) = p(x+ |h(x) = h(x+ )) denote the probability of
coder training. This paper proposes a new self- another point x+ with the same label as x, and in such a
supervised contrastive loss called the B CL loss case, x+ is called a positive sample specific to x. Likewise,
that still uses random samples from the unlabeled let p− (x− ) = p(x− |h(x) ̸= h(x− )) denote the probability
data while correcting the resulting bias with im- of another point x− with a different label to x, and in such
portance weights. The key idea is to design the a case, x− is called a negative sample specific to x. Let
desired sampling distribution for sampling hard f : X → Rd denote the representation learning function
true negative samples under the Bayesian frame- (i.e., encoder) to map a point x to an embedding f (x) on a
work. The prominent advantage lies in that the d-dimensional hypersphere.
desired sampling distribution is a parametric struc- Self-supervised contrastive learning is to contrast similar
ture, with a location parameter for debiasing false pairs (x, x+ ) and dissimilar pairs (x, x− ) to train the en-
negative and concentration parameter for mining coder f (Wang & Isola, 2020; Wang & Liu, 2021; Chuang
hard negative, respectively. Experiments validate et al., 2020); While the objective is to encourage representa-
the effectiveness and superiority of the B CL loss 1 . tions of (x, x+ ) to be closer than that of (x, x− ). In training
the encoder, we randomly draw a point from the underlying
data distribution pd on X , i.e., x ∼ pd , and its positive sam-
1. Introduction ple x+ can be easily obtained from some semantic-invariant
Unsupervised learning has been extensively researched for operation on x like image masking, written as x+ ∼ p+ . In
its advantages of learning representations without human practice, a negative sample x− is drawn from the unlabeled
labelers for manually labeling data. How to learn good dataset and x− ∼ pd . However, the sample x− could be
representation without supervision, however, has been a potentially with the same label as x, i.e., it is a false negative
long-standing problem in machine learning. Recently, con- to x. In such a case, the false negative sample, or called sam-
trastive learning that leverages a contrastive loss (Chopra pling bias (Chuang et al., 2020), would degrade the learned
et al., 2005; Hadsell et al., 2006) to train a representation representations (Wang & Liu, 2021; Chuang et al., 2020).
encoder has been promoted as a promising solution to this In this work, we formalize the two crucial tasks of self-
problem (Oord et al., 2018; Tian et al., 2020; Liu et al., 2021; supervised contrastive learning, i.e., debiasing false nega-
Chen et al., 2020b). Remarkable successes of contrastive tives and mining hard negatives, in a Bayesian setting. These
learning have been observed for many applications in dif- tasks are implicitly performed by jointly utilizing sample
ferent domains (Alec et al., 2018; 2019; Misra & Maaten, information (empirical distribution function) and prior class
2020; He et al., 2020). Nonetheless, its potentials can be information τ , which offers a flexible and principled self-
further released for a better contrastive loss being designed. supervised contrastive learning framework. It leverages the
We study the following self-supervised contrastive learning classical idea of Monte Carlo importance sampling to draw
problem (Oord et al., 2018; Chuang et al., 2020; Robin- random samples from the unlabeled data, while correcting
for resulting biases with importance weights, which avoids
1
[email protected]; [email protected]; School of the high computational costs associated with explicit nega-
Electronic Information and Communications, Huazhong Univer- tive sampling.
sity of Science and Technology (HUST), Wuhan, China.
Corresponding author: Bang Wang
Bayesian Self-Supervised Contrastive Learning

Contribution: (i) We propose the Bayesian self-supervised proven that for {x− N
i ∈ T N }i=1 , optimizing the InfoNCE
Contrastive Learning objective function, a novel extension loss LS UP will result in the learning model estimating and
+
to contrastive loss function that allows for false negative de- optimizing the density ratio pp− (Oord et al., 2018; Poole
biasing task hard true negative mining task in a flexible and T +
et al., 2019). Denote x̂+ = ef (x) f (x ) . The CPC (Oord
principled way. BCL is easy to implement, requiring only a
et al., 2018) proves that minimizing LS UP leads to
small modification of the original contrastive loss function.
(ii) We derive analytical expressions for the class condi- x̂+ ∝ p+ /p− . (3)
tional probability density of negative and positive examples,
enabling us to estimate the posterior probability of an un- As discussed by (Oord et al., 2018), p+ /p− preserves the
labeled sample being a true negative, without making any mutual information (MI) of future information and present
parametric assumptions. This provides a new understanding signals, where MI maximization is a fundamental problem
of positive-unlabeled data and inspires future research. in science and engineering (Poole et al., 2019; Belghazi
et al., 2018) .
2. Contrastive Loss and Analysis Now consider the InfoNCE loss LB IASED , which can be re-
garded as the categorical cross-entropy of classifying one
2.1. Contrastive Loss
positive sample x+ from unlabeled samples. For analysis
In the context of supervised contrastive learning, dissim- purpose, we rewrite x+ as x0 . Given N + 1 unlabeled data
ilar pairs (x, x− ) can be easily constructed by randomly points, the posterior probability of one data point x0 being
drawing a true negative sample x− specific to x, i.e., a positive sample can be derived by
x− ∼ p− . The contrastive predictive coding (CPC) (Oord
et al., 2018) introduces the following InfoNCE loss (Gut- P (x0 ∈ pos|{xi }Ni=0 )
+
QN
mann & Hyvärinen, 2010; 2012): p (x0 ) i=1 pd (xi )
= PN Q
+
j=0 p (xj ) i̸=j pd (xi )
T +
ef (x) f (x )
LS UP = E x∼pd [− log ]
x+ ∼p+ ef (x) f (x ) + N
T + P f (x)T f (x−
i ) p+ (x0 )/pd (x0 )
i=1 e = (4)
x− ∼p− PN
(1) p+ (x0 )/pd (x0 ) + j=1 p+ (xj )/pd (xj )
to learn an encoder f : X → Rd /t that maps a data point
To minimize LB IASED , the optimal value for this poste-
x to the hypersphere Rd of radius 1/t, where t is the tem-
rior probability is 1, which is achieved in the limit of
perature scaling. As in the CPC, we also set t = 1 in our
p+ (x0 )/pd (x0 ) → +∞ or p+ (xj )/pd (xj ) → 0. Mini-
theoretical analysis.
mizing LB IASED leads to
In the context of self-supervised contrastive learning, how-
ever, as samples’ labels are not available, i.e., p− (x′ ) = x̂+ ∝ p+ /pd . (5)
p(x′ |h(x) ̸= h(x′ )) is not accessible, the standard approach
is to draw N samples from the data distribution pd , which Note that this is different from Eq. (3), since x−
i may not be
are supposed to be negative samples to x, to optimize the T N for lack of ground truth label.
following InfoNCE self-supervised contrastive loss: Denote x̂+ = m · p+ /pd , m ≥ 0. We investigate the
T + gap between optimizing x̂+ and the optimization objective
ef (x) f (x ) p+ /p− . Inserting pd = p− τ − + p+ τ + back to Eq. (5), we
LB IASED = E x∼pd [− log ].
f (x)T f (x−
ef (x) f (x ) + N
T +
i )
P
x+ ∼p+
i=1 e obtain
x− ∼pd
(2) p+
Following the DCL (Chuang et al., 2020), it is also called x̂+ = m · . (6)
p− τ − + p+ τ +
as biased contrastive loss since those supposedly negative
samples x− drawn from pd might come from the same class Rearranging the above equation yields
as the data point x with probability τ + .
x̂+ · τ −
p+ /p− = . (7)
2.2. Sampling Bias Analysis m − x̂+ · τ +

Let x− ∈ T N denote x− being a true negative (TN) sample


specific to x. Let x− ∈ F N denote x− being a false negative Fig. 1 illustrates the approximate shape of Eq. (7) as a frac-
(FN) sample specific to x, i.e. x− and x are with the same tional function, which reveals the inconsistency between
ground truth class label. Note that whether x− is a TN or InfoNCE LB IASED loss optimization and MI optimization.
FN is specific to a particular anchor point x, and in what That is, when optimizing InfoNCE loss, the increase of x̂+
follows, we omit the specific to x for brevity. It has been does not lead to the monotonic increase of p+ /p− . Indeed,
Bayesian Self-Supervised Contrastive Learning

3.1. False Negative Debiasing


We first consider the true principle for the design of sam-
pling distribution. We denote the power exponent of simi-
larity between an anchor x and another unlabeled sample
T ′
x′ as x̂ = ef (x) f (x ) . Assume that x̂ is independently and
identically distributed with a probability density function ϕ
R x̂
and cumulative distribution function Φ(x̂) = −∞ ϕ(t)dt.
As x′ can be either a T N sample or a F N sample, so ϕ
contains two populations, denoted as ϕT N and ϕF N . The
problem is reduced to estimating the sum over x̂ ∼ ϕT N , i.e.,
PN f (x)T f (x− )
i , while using samples x̂ ∼ ϕ.
i=1 e

Figure 1. Illustration of LB IASED and mutual information optimiza- Existing approaches for solving above problem is the density
tion by Eq. (7). estimation to fit ϕ (Xia et al., 2022), where ϕ is parameter-
ized as a two-component mixture of ϕT N and ϕF N , such as
the Gaussian Mixture Model (Lindsay, 1995), Beta Mixture
Model (Xia et al., 2022). To make the analysis possible, ϕT N
and ϕF N are postulated to follow a simple density function
with fixed parameters, which is a too strong assumption. In
the existence of jump discontinuity indicates that the opti-
addition, the learning algorithm for estimating ϕ is expen-
mization of LB IASED does not necessarily lead to the tractable
sive, since the mixture coefficients that indicate the probabil-
MI optimization. The reason for the intractable MI optimiza-
ity of x̂ ∈ T N or F N are hidden variables. The parameters of
tion is from the fact that not all {x− N
i }i=1 are T N samples, as
ϕT N and ϕF N can only be obtained through the iterative nu-
they are randomly drawn from the data distribution pd . This
merical computation of the EM algorithm (Dempster et al.,
leads to the inclusion of p+ in the denominator of Eq. (6)
1977) that are sensitive to initial values.
when decomposing the data distribution pd . Fig. 7 in Ap-
pendix provides an intuitive explanation. The four sampled In this paper, we propose an analytic method without explic-
data points actually contain one F N sample. Such a F N sam- itly estimating ϕ, also called the nonparametric method in
ple should be pulled closer to the anchor x. However, as it is the statistical theory. Consider n random variables from ϕ ar-
mistakenly treated as a negative sample, during model train- ranged in the ascending order according to their realizations.
ing it will be pushed further apart from the anchor, which We write them as X(1) ≤ X(2) ≤ · · · ≤ X(n) , and X(k)
breaks the semantic structure of embeddings (Wang & Liu, is called the k-th (k = 1, · · · , n) order statistics (David &
2021). Nagaraja, 2004). The probability density function (PDF) of
X(k) is given by:
3. The Proposed Method n!
ϕ(k) (x̂) = Φk−1 (x̂)ϕ(x̂)[1 − Φ(x̂)]n−k
In this paper, we consider to randomly draw negative sam- (k − 1)!(n − k)!
ples {x− N −
i }i=1 from the unlabeled dataset, i.e., xi ∼ pd . As By conditioning on n = 2 we obtain:

the class label is not accessible, xi could be either a T N
sample or a F N sample. We propose to include and compute ϕ(1) = 2ϕ(x̂)[1 − Φ(x̂)] (8)
an importance weight ωi into the InfoNCE contrastive loss ϕ(2) = 2ϕ(x̂)Φ(x̂) (9)
for correcting the resulting bias of drawing negative samples
from pd . The ideal situation is that we can set ω = 0 to
Next, we investigate the position of positive and negative
each F N sample, so that only the hard true negative samples
samples on the hypersphere, so as to get a deep insight into
contribute to the calculation of contrastive loss, which relies
ϕT N . Consider a (x, x+ , x− ) triple, there exists a closed
on the design of desired sampling distribution.
ball B[f (x), d+ ] = {f (·)|d(f (x), f (·)) ≤ d+ } with center
We consider the following two design principles of the sam- f (x) and radius d+ , where d+ = ∥f (x) − f (x+ )∥ is the
pling distribution for drawing {x− N
i }i=1 . The true princi- distance of anchor embedding f (x) and positive sample
ple (Wang & Liu, 2021; Robinson et al., 2021) states that embedding f (x+ ). Two possible cases arise: f (x− ) ∈
the F N samples should not be pushed apart from the anchor B[f (x), d+ ] or f (x− ) ∈/ B[f (x), d+ ], as illustrated by
x in the embedding space. The hard principle (Yannis et al., Fig. 2. We can describe the two cases with the Euclidean
2020; Robinson et al., 2021; Florian et al., 2015; Hyun et al., distance: Fig 2(a) corresponds to d+ < d− , and Fig 2(b) cor-
2016) states that the hard T N samples should be pushed responds to d− ≤ d+ , where d− =p ∥f (x) − f (x− )∥. Note
±
further apart in the embedding space. that the Euclidean distance d = 2/t2 − 2f (x)T f (x± )
Bayesian Self-Supervised Contrastive Learning

since all embeddings f (·) are on the surface of a hyper-


sphere of radius 1/t, so we have x̂− < x̂+ for case (a), and
x̂+ ≤ x̂− for case (b). Expressed in the notation of order
statistics, x̂− (marked in blue in Fig 2) is a realization of
X(1) for case (a), or x̂− is a realization of X(2) for case (b),
respectively.

Figure 3. ϕT N (x̂) and ϕF N (x̂) with different α settings.

of two cumulative distribution functions Φ(x̂) and ΦU N (x̂)


Figure 2. Two possible cases for the relative positions of anchor, as follows:
positive, and negative triples. p
−b + b2 + 4aΦU N (x̂)
Φ(x̂) = (13)
2a
The generation process of observation x̂ from ϕT N can be
described as follows: Select case (a) with probability α, and where a = (1−2α)(τ − −τ + ) and b = 2[ατ − +(1−α)τ + ].
then generate an observation x̂ from ϕ(1) ; Or select case (b) The details are presented in Appendix B.1. Note that ΦU N (x̂)
with probability 1 − α, and then generate an observation x̂ can be directly computed using N unlabeled data using its
from ϕ(2) . That is, ϕT N is the component of X(1) and X(2) empirical counterpart
with a mixture coefficient α
n
1X
ϕT N (x̂) = αϕ(1) (x̂) + (1 − α)ϕ(2) (x̂) (10) Φ̂U N (x̂) = I|Xi ≤x̂| , (14)
n i=1
Similarly, ϕF N is the component of X(2) and X(1) with Above empirical cumulative distribution function (C.D.F)
mixture coefficient α: Φ̂U N (x̂) uniformly converges to common C.D.F ΦU N (x̂) =
R x̂
−∞ U N
ϕ (t)dt as stated by Glivenko theorem (Glivenko,
ϕF N (x̂) = αϕ(2) (x̂) + (1 − α)ϕ(1) (x̂) (11)
1933).
Note that the way of taking x̂− as a realization of X(1) for There may need a further understanding and clarification of
case (a) omits the situation of x̂− = x̂+ . The probability the mixture coefficient α by reviewing Fig. 2. Intuitively, α
measure of x̂− for such case is 0 as ϕ is continuous density is the probability that f (x− ) falls out of B[f (x), d+ ]. For
function dominated by Lebesgue measure. a worst encoder f that randomly guesses, α = 0.5; While
Proposition 3.1 (Class Conditional Density). If ϕ(x̂) is for a perfect encoder α = 1. Therefor, the reasonable value
continuous density function that satisfy ϕ(x̂) ≥ 0 and of α ∈ [0.5, 1]. In fact, α reflects the encoder’s capability
R +∞ of scoring a positive sample higher than that of a negative
−∞
ϕ(x̂)dx̂ = 1, then ϕT N (x̂) and ϕF N (x̂) are probability
density functions that satisfy ϕT N (x̂) ≥ 0, ϕF N (x̂) ≥ 0, and sample, which admits the empirical macro-AUC metric over
R +∞ R +∞
ϕ (x̂)dx̂ = 1, −∞ ϕF N (x̂)dx̂ = 1. all anchors x in the training data set D:
−∞ T N
Z +∞ Z
Z +∞
Proof. See Appendix C.1. α = I(x̂+ ≥ x̂− )p(x̂+ , x̂− )p(x)dx̂+ dx̂− dx
x∈X 0 0
With the class conditional density of positive samples and 1 1 XX
negative samples, we can obtain the density of unlabeled ≃ I(x̂+ ≥ x̂− )
|D| |D+ ||D− | + −
data ϕU N by applying total probability formula to relate D D
marginal probabilities to conditional probabilities 1
= AU C (15)
|D|
ϕU N = τ − ϕT N + τ + ϕF N (12)
By setting ϕ(x̂) as N (0, 1) in Eq. (8) and Eq. (9), we can
Table 3.1 present an overview of densities. By integrating get a quick snapshot of how α affects ϕT N (x̂) and ϕF N (x̂) in
both sides of the Eq (12) we can establish the transformation Eq. (10) and Eq. (11) as illustrated in Fig 3. The larger value
Bayesian Self-Supervised Contrastive Learning

Table 1. Overview of densities.


Densities Meaning Description Parameteric Assumption
ϕ(x̂) Anchor specific density of scores p(x̂; x, f ). Determined by the anchor x and encoder f . No
ϕ(1) (x̂) Density of order statistics X(1) Determined by ϕ(x̂). No
ϕ(2) (x̂) Density of order statistics X(2) Determined by ϕ(x̂). No
ϕT N (x̂) Class conditional density of true negative p(x̂|x− ∈ T N) Determined by α and order statistics ϕ(1) (x̂), ϕ(2) (x̂) No
ϕF N (x̂) Class conditional density of false negative p(x̂|x− ∈ F N) Determined by α and order statistics ϕ(1) (x̂), ϕ(2) (x̂) No
ϕT HN (x̂) Class conditional density of hard true negatives p(x̂|x− ∈ T N, H ARD = β) A component of ϕT N conditioned on a specific hardness level β. No
ϕU N (x̂) Density of unlabeled data ϕU N = τ − ϕT N + τ + ϕF N Determined by class prior τ and class conditional densities. No

of α results in higher discrimination of ϕT N (x̂) from ϕF N (x̂), for drawing {x− N


i }i=1 conditioning on both true principle
since a better encoder encodes dissimilar data points with and hard principle can be derived as:
different class labels more orthogonal (Chuang et al., 2020).
ϕT HN (x̂; α, β) ≜ p(x̂|TN, H ARD)
3.2. Hard Negative Mining = p(x̂, H ARD = β|TN)/p(H ARD = β|TN)
We next consider the hard principle for the design of sam- (1 − β)αϕ(1) (x̂) + β(1 − α)ϕ(2) (x̂)
= (19)
pling distribution. Note that the the first term αϕ(1) (x̂) of (1 − β)α + β(1 − α)
ϕT N (x̂) in Eq (10) is the easy negative sample component
where p(H ARD = β|TN) is the normalization constant,
that been correctly classified by the classifier (as shown in
and can be exactly calculated
R ∞ by the marginal integration
Fig. 2) , and the second term (1 − α)ϕ(2) (x̂) is the hard
p(H ARD = β|TN) = −∞ p(x̂, H ARD = β|T N)dx̂ =
negative sample component that been incorrectly classified
(1 − β)α + β(1 − α).
by the classifier. We decompose ϕT N (x̂) and select the com-
ponent that corresponds to the hard negative sample as the When β = 0.5, ϕT HN (x̂; α, β) = ϕT N (x̂), and the target
target sampling distribution. To achieve this, we introduce a of the sampling distribution remains the original class-
parameter β ∈ [0.5, 1] to decompose the class conditional conditional density of true negatives. In contrast, when
density of true negatives ϕT N (x̂) as follows: β = 1, ϕT HN (x̂; α, β) = ϕ(2) (x̂), and the target of the sam-
ϕT N (x̂) = (1 − β + β)[αϕ(1) (x̂) + (1 − α)ϕ(2) (x̂)] pling distribution only includes hard negative component
that have been misclassified by the classifier.
= (1 − β)αϕ(1) (x̂) + β(1 − α)ϕ(2) (x̂) (16)
+ βαϕ(1) (x̂) + (1 − β)(1 − α)ϕ(2) (x̂)(17) 3.3. Monte Carlo Importance Sampling
In equation (16), the parameter 1 − β controls the propor- Now we have in batch N i.i.d unlabeled samples {x̂i }N
i=1 ∼
tion of easy negative sample components αϕ(1) (x̂), while ϕU N , and desired target sampling distribution ϕT HN , we
β controls the proportion of hard negative sample compo- can approximate the expectation of scores over hard and
nents (1 − α)ϕ(2) (x̂) that have been incorrectly classified true samples using classic Monte-Carlo importance sam-
by the classifier. Thus, Equation (16) can be interpreted as pling (Hesterberg, 1988; Bengio & Senécal, 2008):
the density of hard true negative samples p(x̂, H ARD|T N)
with a hardness level of β, a larger value of β (approaching
Z +∞
ϕT HN (x̂; α, β)
1) leads to p(x̂, H ARD|T N) contains higher proportion of Ex̂∼ψ x̂ = x̂ ϕU N (x̂)dx̂
+∞ ϕU N (x̂)
hard negative sample component that have been incorrectly
ϕT HN (x̂; α, β)
classified by the classifier. Similarly, Equation (17), the = Ex̂∼ϕUN x̂
mirrored counterpart of Equation (16), represents the den- ϕU N (x̂)
N
sity of easy true negative samples p(x̂, E ASY|T N) with an 1 X
easiness level of β. The above algebraic transformation can ≃ ωi x̂i (20)
N i=1
be viewed as total probability decomposition of the class
conditional distribution ϕT N (x̂): where ωi is the density ratio between desired sampling
ϕT N (x̂) ≜ p(x̂|T N) distribution ϕT HN (x̂; α, β) and unlabeled data distribution
ϕU N (x̂), which can be calculated by:
= p(x̂, H ARD = β|T N) + p(x̂, E ASY = β|T N)
(18) ϕT HN (x̂; α, β)
ωi (x̂i ; α, β) = (21)
ϕU N (x̂)
The distribution p(x̂, H ARD|T N) in Eq (16) is a newly de-
fined distribution measured on a separate sample space con- ωi (x̂i ; α, β) is a function of C.D.F Φ(x̂), while ϕ(x̂) is
sisting of hard negative examples, which is a subset of popu- eliminated due to the fractional form of ω. The importance
lation of true negative examples. So the desired distribution weight ω(x̂, α, β) essentially assigns customized weights
Bayesian Self-Supervised Contrastive Learning

to the N unlabeled samples for correcting the bias due to β ∈ [0.5, 1] corresponds to the proportion of misclassified
sampling from the unlabeled data. hard negative components in the desired sampling distribu-
tion, and controls the hardness level of the distribution for
So far we have finished the estimate for summation over
sampling hard negatives. It can be manually selected. We
hard true negative samples by using samples still randomly
refer to Fig 4 for detailed illustration of how importance
drawn from the unlabeled data. The key lies in the calcula-
weight ω(x̂, α, β) changes with respect to Φ̂U N (x̂) under
tion of importance weight for correcting the bias between
various settings of α and β.
actual sampling distribution and desired sampling distri-
bution. By characterized the encoder using macro-AU C
metric, the calculable empirical c.d.f. was introduced as like- 4. Numerical Experiments
lihood to construct the posterior estimator, which we name
4.1. Experiment Objective
it Bayesian self-supervised Contrastive Learning objective
T +
Recall that the core operation of a self-supervise contrastive
ef (x) f (x ) loss is to estimate the expectation of x̂i (Chuang et al., 2020),
L B CL = E x∼pd [− log PN T f (x− )
] (22) T
+
x ∼p +
e f (x)T f (x+ )
+ i=1 ωi · e f (x) i where xi ∈ T N, x̂i ≜ ef (x) f (xi ) , to approximate the su-

x ∼pd pervised loss LS UP using N randomly selected unlabeled
samples xi . For the supervised loss LS UP , we define the
Complexity: The steps for L B CL computation are presented mean of true negative samples’ observations by
in Algorithm 4. Compared with the InfoNCE loss, the addi- PN
tional computation complexity comes from first calculating i=1 I(xi ) · x̂i
θS UP = P N
, (23)
the empirical distribution function empirical C.D.F Φ̂U N (x̂) i=1 I(xi )
from unlabeled data in the order of O(N ), which can be where I(xi ) is the indicator function, I(xi ) = 1 if xi ∈ T N;
neglected as it is far smaller than encoding a sample. Next, Otherwise, I(xi ) = 0, since the expectation is replaced by
we provide a deep understanding of L B CL . empirical estimates in practice (Oord et al., 2018; Chuang
Lemma 3.2 (Posterior Probability Estimation). Let β = 0.5 et al., 2020; Chen et al., 2020a). For the proposed self-
and assume that the similarities x̂ are i.i.d. distributed. For supervised loss LB CL , we define the BCL estimator by
any observation x̂i , the posterior probability estimation of N
x̂i being true negative is given by p(T N|x̂i ) = ωi τ − . 1 X
θ̂B CL = ωi · x̂i . (24)
N i=1
Proof. See Appendix C.2 The empirical counterpart of a unsupervised loss LB CL
Lemma 3.3 (Asymptotic Unbiased Estimation). Let β = equals to supervised loss LS UP , as if θ̂B CL = θ̂S UP . We com-
0.5 and assume that the similarities x̂ are i.i.d. distributed. pare with the following two estimators: the θ̂B IASED by (Oord
For any encoder f , as N → ∞, we have L B CL → L S UP . et al., 2018) and θD CL by (Chuang et al., 2020).
N
1 X
Proof. See Appendix C.3 θ̂B IASED = x̂i (25)
N i=1
PK
Lemma 3.2 establishes the relationship between the weight 1 X
N
j=1 x̂+
j
ω and the posterior probability, and their calculation only θ̂D CL = ( x̂i − N τ + · ) (26)
N τ − i=1 K
involves (i) Φ̂U N (x̂) ∈ [0, 1], which represents the sample
information from a Bayesian viewpoint and can be directly Eq. (26) can be understood as the summation of T N sam-
calculated from N unlabeled data. It contains a probabilis- ples’ observations divided by the number of T N samples
tic interpretation of the unlabeled data x ∈ F N given its N τ −P
. Specifically, N τ + is the number of F N samples,
K
observation x̂. For larger values of x̂ (with a closer em- and j=1 x̂+ j /K is the mean value of K F N samples. So
bedding distance to the anchor), Φ̂U N (x̂) assigns a higher the second term inside the parenthesis corresponds to the
probability prediction of the unlabeled sample x sharing summation of F N samples’P observations among N samples;
N
the identical latent class label with that of the anchor. In While subtracting it from i=1 x̂i corresponds to the sum-
other words, Φ̂U N (x̂) reflects the classification results of the mation of T N samples among N randomly selected samples.
current model on the sample x, that is, the likelihood. (ii)
class prior probability τ . 4.2. Experiment Design
The parameter α ∈ [0.5, 1] corresponds to the encoder’s We design a stochastic process framework to simulate x̂,
macro-AUC metric and controls the confidence level of which depicts the numerical generative process of observa-
sample information. It can be empirically estimated using a tion x̂. In short, an observation x̂ is realized in the following
validation dataset during the training process. The parameter sample function space.
Bayesian Self-Supervised Contrastive Learning

(a) Fixed α with different β settings.

(b) Fixed β with different α settings.

Figure 4. This figure illustrates how the weight ω changes with respect to Φ̂U N (x̂) under various settings of α and β, where τ + is fixed at
0.1. X-axis is the empirical C.D.F function Φ̂U N of unlabeled samples. A value of Φ̂U N (x̂) = 1 indicates that the instance is the nearest
to the anchor, i.e., the highest scored sample. The x-axis can be understood as the hardness level of the samples, with values closer to
1 indicating higher hardness level. The y-axis represents the importance weight ω calculated based on the BCL computation process.
Since ω is calculated in terms of the empirical C.D.F (relative ranking position) rather than an absolute similarity score, leading it more
robust to different temperature scaling settings and encoders. Different values of α and β provide flexible options for up-weighting or
down-weighting negative samples.
Bayesian Self-Supervised Contrastive Learning

Definition 4.1 (Sample Function Space). Consider a func-


tion of observation X(x, e) of two variables defined on
X × Ω. For an anchor x ∈ X , X(x, e) is a random variable
on a probability space (Ω, F, P ) related to the randomly
selected unlabeled samples xi ; For a fixed e ∈ Ω, X(x, e)
is a sample function related to different anchors. We call
{X(x, e) : x ∈ X , e ∈ Ω} a sample function space.

In the sample function space, an anchor x determines the


parameter of (Ω, F, P ), where P is the anchor-specific pro-
posal distribution ϕ. As ϕ is not required to be identical
for different anchors, it simulates the situation that different
anchors may result in different distributions of observations. Figure 5. Empirical distribution of exp(x̂/t) with different α set-
tings. The density is fitted by using Gaussian kernel.
Note that M anchors correspond to M sequences of random
variables, each sequence with different parameters. For an
anchor x, a brief description of generating an observation x̂
is as follows: 4.3. Experiment Results

(i) Select a class label according to the class probability We evaluate the quality of estimators in terms of mean
τ + , indicating that the observation comes from F N with 1
PM For M anchors, 2 we calculate
squared error (MSE).
probability τ + , or comes from T N with probability τ − . M SE θ̂B CL = M m=1 (θ̂B CL,m − θS UP,m ) . Fig. 6 com-
pares the MSE of different estimators against different pa-
(ii) Generate an observation x̂ from the class conditional rameters. It can be observed that the estimator θ̂B CL is su-
density ϕF N (or ϕT N ), dependent on the anchor-specific ϕ, perior than the other two estimators in terms of lower MSE
and location parameter α. with different parameter settings. We refer to Appendix B.2
(iii) Map x̂ to exp(x̂/t) as the final observation. for detailed settings of our numerical experiments. A partic-
ular notice here is that we set β = 0.5 in the ω computation,
An illustrative example is presented in Fig 8 of Appendix. as β is designed in consideration of the requirements of
Repeat the process to generate N observations for one an- downstream task that pushing T N samples further apart,
chor and repeat it for M anchors, we obtain {exp(x̂mi /t) : other than statistical quality of θ̂B CL .
m = 1, ..., M, i = 1, · · · N } corresponding to an empirical
observation from the sample function space {X(x, e) : x ∈
X , e ∈ Ω}. The complete stochastic process depiction of
5. Real Dataset Experiments
observations is presented in Algorithm 1 of Appendix B.2. We conduct the real dataset experiments with the same
Note that in (ii), event if we set ϕ as a simple distribu- settings as theirs for two vision tasks using the CI-
tion, the corresponding class conditional density ϕF N (or FAR10 (Krizhevsky & Hinton, 2009) and STL10 (Adam
ϕT N ) is no longer a simple distribution, we generate the et al., 2011) datasets (Detailed settings are also provided in
observations from ϕF N (or ϕT N ) using the accept-reject sam- Appendix B.3).
pling (Casella et al., 2004) technique (see Algorithm 2 and
3 of Appendix B.2 for implementation details). • SimCLR (Chen et al., 2020a): It provides the experi-
Note that in (iii), although the corresponding density expres- ment framework for all competitors, which comprises
sion of exp(x̂/t) is extremely complicated, x̂ → exp(x̂/t) of a stochastic data augmentation module, the ResNet-
is a strictly monotonic transformation, making the em- 50 as an encoder, neural network projection head. It
pirical distribution function remains the same: Φn (x̂) = uses the contrastive loss LB IASED .
Φn (exp(x̂/t)).
Fig 5 plots the empirical distribution of exp(x̂/t) according • DCL (Chuang et al., 2020): It mainly considers the
to the above generative process, which corresponds to the false negative debiasing task and uses the estimator
distribution depicted in Fig 3, where we set t = 2, M = 1 Eq. (26) to replace the second term in the denominator
and N = 20000. We note that the empirical distributions of LB IASED .
in Fig 5 exhibits similar structures to those by using real-
world datasets as reported in (Robinson et al., 2021; Xia • HCL (Robinson et al., 2021): Following the DCL de-
et al., 2022), indicating the effectiveness of the proposed biasing framework, it also takes into consideration of
stochastic process depiction framework for simulating x̂. hard negative mining by up-weighting each randomly
Bayesian Self-Supervised Contrastive Learning

(a) Performance of encoder f . (b) Number of negative samples (c) Class probability.

(d) Variation of proposal distribution ϕ. (e) Temperature scaling. (f) Number of anchors.

Figure 6. M SE of different estimators with various parameter settings.

selected sample as follows.


Table 3. Impact of hardness level β.
x̂βi Dataset β = 0.5 β = 0.6 β = 0.7 β = 0.8 β = 0.9 β = 1.0
ωiH CL = 1
P N β
. (27) CIFAR10 91.39 82.41 91.89 92.02 92.58 92.12
N j=1 x̂j STL10 80.32 81.79 83.58 83.83 84.85 86.51

DCL is a particular case of HCL with β = 0.

we estimate alpha empirically using a randomly sampled


Table 2. Classification accuracy on CIFAR10 and STL10. batch of data and examine the impact of different β values.
Negative Sample Size N Table 3 presents the impact of hardness level β. On the
Dataset Methods STL10 dataset, the top-1 classification accuracy continues
N=30 N=62 N=126 N=254 N=510
SimCLR 80.21 84.82 87.58 89.87 91.12 to increase with increasing β. However on the CIFAR10
DCL 82.41 87.60 90.38 91.36 92.06 dataset, the top-1 classification accuracy gradually increases
CIFAR10
HCL 83.42 88.45 90.53 91.57 91.92
as beta increases, reaching its optimal value at 0.9 before
BCL 83.61 88.56 90.83 92.07 92.58
SimCLR 61.20 71.69 74.36 77.33 80.20 decreasing. This result indicates that a higher hardness level
DCL 63.91 71.48 76.69 81.48 84.26 may not always lead to better performance. This is further
STL10
HCL 67.24 73.38 79.44 83.82 86.38 supported by the fact that HCL, which includes a hard neg-
BCL 67.45 73.36 80.23 84.68 86.51 ative mining mechanism, performs worse than DCL (DCL
is a particular case of HCL without hard negative mining
For N = 510 (batch size=256) negative examples per data mechanism) on the CIFAR10 dataset.
point, we observed absolute improvements of 1.4% and
Recent studies on the memorization effects of deep neural
6.3% over SimCLR on the CIFAR10 and STL10 datasets,
networks (Arpit et al., 2017) have provided insight into our
respectively. On the CIFAR10 dataset, BCL achieved a top-1
understanding this result. Specifically, deep models have
accuracy of 92.58%, representing an absolute improvement
been shown to prioritize memorizing training data with clean
of 0.5% over the best baseline (DCL). On the STL10 dataset,
labels, and gradually adapt to those with noisy labels as the
BCL achieved a top-1 accuracy of 86.51%, which is compa-
number of training epochs increases (Zhang et al., 2021; Han
rable to the performance of the best baseline (HCL).
et al., 2018), regardless of the network architectures used.
Recall that α corresponds to the macro AUC of the en- Self-supervised contrastive learning can be regarded as a
coder, while β controls the hardness level of the desired noisy label learning problem with negative labels containing
sampling distribution and is manually chosen. Therefore, noise, where false negatives are included. For datasets that
Bayesian Self-Supervised Contrastive Learning

are difficult to fit, the hard negative principle dominates, Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A
encouraging the model to fit clean samples. However, for simple framework for contrastive learning of visual rep-
datasets that are relatively easy to fit, excessive hardness resentations. In ICML, pp. 1597–1607, 2020a.
levels can lead to the model memorizing wrongly given
labels, so the true negative principle should be balanced Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and
with the hard negative principle. Hinton, G. E. Big self-supervised models are strong semi-
supervised learners. NeurIPS, 33:22243–22255, 2020b.
6. Conclusion Chen, X., Fan, H., Girshick, R., and He, K. Improved
This paper proposes the B CL loss for self-supervised con- baselines with momentum contrastive learning. arXiv
trastive learning, which corrects the bias introduced by us- preprint arXiv:2003.04297, 2020c.
ing random negative samples drawn from unlabeled data Chopra, S., Hadsell, R., and LeCun, Y. Learning a sim-
through importance sampling. The key idea is to design a ilarity metric discriminatively, with application to face
desired sampling distribution under a Bayesian framework, verification. In CVPR, pp. 539–546, 2005.
which has a parametric structure enabling principled debias-
ing of false negatives and mining of hard negatives. In future Chuang, C.-Y., Robinson, J., Lin, Y.-C., Torralba, A., and
work, we plan to adapt B CL to noisy label learning or other Jegelka, S. Debiased contrastive learning. In NeurIPS,
weak supervisions and explore the optimal tradeoff between pp. 8765–8775, 2020.
debiasing false negatives and mining hard negatives.
David, H. A. and Nagaraja, H. N. Order statistics. 2004.
References ISBN 9780471654018.

Adam, C., Andrew, N., and Honglak, L. An analysis of Dempster, A. P., Laird, N. M., and Rubin, D. B. Maxi-
single-layer networks in unsupervised feature learning. In mum likelihood from incomplete data via the em algo-
Proceedings of the fourteenth international conference on rithm. Journal of the Royal Statistical Society: Series B
artificial intelligence and statistics, pp. 215–223, 2011. (Methodological), 39(1):1–22, 1977.

Alec, R., Karthik, N., Tim, S., and Ilya, S. Improving Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
language understanding by generative pre-training. 2018. Pre-training of deep bidirectional transformers for lan-
guage understanding. arXiv preprint arXiv:1810.04805,
Alec, R., Jeffrey, W., Rewon, C., David, L., Dario, A., and 2018.
Ilya, S. Language models are unsupervised multitask
learners. 2019. DeVries, T. and Taylor, G. W. Improved regularization of
convolutional neural networks with cutout. arXiv preprint
Arpit, D., Jastrzkebski, S., Ballas, N., Krueger, D., Bengio, arXiv:1708.04552, 2017.
E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville,
A., Bengio, Y., et al. A closer look at memorization in Diederik, P. and Jimmy, B. Adam: A method for stochastic
deep networks. In International conference on machine optimization. In ICLR, 2015.
learning, pp. 233–242, 2017.
Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and
Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning Brox, T. Discriminative unsupervised feature learning
representations by maximizing mutual information across with convolutional neural networks. In NeurIPS, pp.
views. 2019. 766–774, 2014.

Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Ben- Du Plessis, M., Niu, G., and Sugiyama, M. Convex formu-
gio, Y., Courville, A., and Hjelm, R. D. Mine:mutual lation for learning from positive and unlabeled data. In
information neural estimation. In ICML, 2018. International conference on machine learning, pp. 1386–
1394. PMLR, 2015.
Bengio, Y. and Senécal, J.-S. Adaptive importance sampling
to accelerate training of a neural probabilistic language Du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of
model. IEEE Transactions on Neural Networks, 19(4): learning from positive and unlabeled data. In NeurIPS,
713–722, 2008. 2014.

Casella, G., Robert, C. P., and Wells, M. T. General- Florian, S., Dmitry, K., and James, P. Facenet: A unified
ized accept-reject sampling schemes. Lecture Notes- embedding for face recognition and clustering. In CVPR,
Monograph Series, pp. 342–347, 2004. pp. 815–823, 2015.
Bayesian Self-Supervised Contrastive Learning

Glivenko, V. Sulla determinazione empirica delle leggi di Kiryo, R., Niu, G., Du Plessis, M. C., and Sugiyama, M.
probabilita. Gion. Ist. Ital. Attauri., 4:92–99, 1933. Positive-unlabeled learning with non-negative risk esti-
mator. In NeurIPS, 2017.
Gutmann, M. and Hyvärinen, A. Noise-contrastive estima-
tion: A new estimation principle for unnormalized statisti- Komodakis, N. and Gidaris, S. Unsupervised representation
cal models. In Proceedings of the thirteenth international learning by predicting image rotations. In ICLR, 2018.
conference on artificial intelligence and statistics, pp.
Krizhevsky, A. and Hinton, G. Learning multiple layers of
297–304, 2010.
features from tiny images. 2009.
Gutmann, M. U. and Hyvärinen, A. Noise-contrastive es- Lajanugen, L. and Honglak, L. An efficient framework for
timation of unnormalized statistical models, with appli- learning sentence representations. In ICLR, 2018.
cations to natural image statistics. Journal of machine
learning research, 13(2), 2012. Lindsay, B. Mixture Models: Theory, Geometry and Appli-
cations. 1995. ISBN 0-94-0600-32-3.
Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality
reduction by learning an invariant mapping. In CVPR, pp. Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang,
1735–1742, 2006. J., and Tang, J. Self-supervised learning: Generative or
contrastive. IEEE Transactions on Knowledge and Data
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., Engineering, 2021.
and Sugiyama, M. Co-teaching: Robust training of deep
Misra, I. and Maaten, L. v. d. Self-supervised learning of
neural networks with extremely noisy labels. In Advances
pretext-invariant representations. In CVPR, pp. 6707–
in neural information processing systems, pp. 31, 2018.
6717, 2020.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual Oord, A. v. d., Li, Y., and Vinyals, O. Representation learn-
learning for image recognition. In CVPR, pp. 770–778, ing with contrastive predictive coding. arXiv preprint
2016. arXiv:1807.03748, 2018.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo- Poole, B., Ozair, S., van den Oord, A., Alemi, A. A., and
mentum contrast for unsupervised visual representation Tucker, G. On variational bounds of mutual information.
learning. In CVPR, pp. 9729–9738, 2020. In ICML, pp. 5171–5180, 2019.
Henaff, O. Data-efficient image recognition with contrastive Robinson, J., Ching-Yao, C., Sra, S., and Jegelka, S. Con-
predictive coding. In ICML, pp. 4182–4192, 2020. trastive learning with hard negative samples. In ICLR,
2021.
Hesterberg, T. C. Advances in importance sampling. Stan-
ford University, 1988. Saunshi, N., Plevrakis, O., Arora, S., Khodak, M., and
Khandeparkar, H. A theoretical analysis of contrastive
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, unsupervised representation learning. In ICML, pp. 5628–
K., Bachman, P., Trischler, A., and Bengio, Y. Learning 5637, 2019.
deep representations by mutual information estimation
and maximization. arXiv preprint arXiv:1808.06670, Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
2018. Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
A. Going deeper with convolutions. In CVPR, pp. 1–9,
Huang, J., Dong, Q., Gong, S., and Zhu, X. Unsupervised 2015.
deep learning by neighbourhood discovery. In ICML, pp.
Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview
2849–2858, 2019.
coding. In ECCV, pp. 776–794, 2020.
Hyun, Oh, S., Yu, X., Stefanie, J., and Silvio, S. Deep Wang, F. and Liu, H. Understanding the behaviour of con-
metric learning via lifted structured feature embedding. trastive loss. In CVPR, pp. 2495–2504, 2021.
In CVPR, pp. 4004–4012, 2016.
Wang, T. and Isola, P. Understanding contrastive represen-
Jessa, B. and Jesse, D. Learning from positive and unlabeled tation learning through alignment and uniformity on the
data: a survey. Machine Learning, 109:719–760, 2020. hypersphere. In ICML, pp. 9929–9939, 2020.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised fea-
P., Maschinot, A., Liu, C., and Krishnan, D. Supervised ture learning via non-parametric instance discrimination.
contrastive learning. In NeurIPS, pp. 18661–18673, 2020. In CVPR, pp. 3733–3742, 2018.
Bayesian Self-Supervised Contrastive Learning

Xia, J., Wu, L., Wang, G., Chen, J., and Li, S. Z. Progcl:
Rethinking hard negative mining in graph contrastive
learning. In ICML, pp. 24332–24346, 2022.
Xu, L., Lian, J., Zhao, W. X., Gong, M., Shou, L., Jiang,
D., Xie, X., and Wen, J.-R. Negative sampling for con-
trastive representation learning: A review. arXiv preprint
arXiv:2206.00212, 2022.

Yannis, K., Mert, Bulent, S., Noe, P., Philippe, W., and
Diane, L. Hard negative mixing for contrastive learning.
In NeurIPS, pp. 21798–21809, 2020.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.
Understanding deep learning (still) requires rethinking
generalization. Communications of the ACM, 64(3):107–
115, 2021.
Zhuang, C., Zhai, A. L., and Yamins, D. Local aggregation
for unsupervised learning of visual embeddings. In CVPR,
pp. 6002–6012, 2019.
Bayesian Self-Supervised Contrastive Learning

Figure 7. Illustrative example of debiased contrastive objective inspired by (Chuang et al., 2020).

Figure 8. Generation process of observation exp(x̂/t) for one anchor.

A. Related Work
Contrastive learning adopts the learn-to-compare paradigm (Gutmann & Hyvärinen, 2010) that discriminates the observed
data from noise data to relieve the model from reconstructing pixel-level information of data (Oord et al., 2018). Although
the representation encoder f and similarity measure vary from task to task (Devlin et al., 2018; He et al., 2020; Dosovitskiy
et al., 2014), they share an identical basic idea of contrasting similar pairs (x, x+ ) and dissimilar pairs (x, x− ) to train f
by optimizing a contrastive loss (Wang & Isola, 2020), such as the NCE loss (Gutmann & Hyvärinen, 2010), InfoNCE
loss (Oord et al., 2018), Infomax loss (Hjelm et al., 2018), asymptotic contrastive loss (Wang & Isola, 2020) and etc.
These loss functions implicitly (Oord et al., 2018) or explicitly (Hjelm et al., 2018) lower bounds the mutual information.
Arora et al. (Saunshi et al., 2019) provide theoretical analysis on the generalization bound of contrastive learning for
classification tasks. Remarkable successes of supervised contrastive learning have been observed for many applications in
different domains (Henaff, 2020; Khosla et al., 2020), but the great limitations are from their dependent on manually labeled
datasets (Liu et al., 2021).
Self-supervised contrastive learning (Chen et al., 2020a;b; He et al., 2020; Henaff, 2020; Xu et al., 2022) has been extensively
researched for its advantages of learning representations without human labelers for manually labeling data, and benefits
almost all types of downstream tasks (Liu et al., 2021; Bachman et al., 2019; Chen et al., 2020c; Huang et al., 2019; Wu
et al., 2018; Zhuang et al., 2019). The common practice is to obtain positive sample x+ from some semantic-invariant
operation on an anchor x with heavy data augmentation (Chen et al., 2020a), like random cropping and flipping (Oord et al.,
2018), image rotation (Komodakis & Gidaris, 2018), cutout (DeVries & Taylor, 2017) and color distortion (Szegedy et al.,
2015); While drawing negative samples x− s simply from un-labeled data, which introduces the false negative problem,
leading to incorrect encoder training. This problem is related to the classic positive-unlabeled (PU) learning (Jessa & Jesse,
2020), where the unlabeled data used as negative samples would be down-weighted appropriately (Du Plessis et al., 2015;
2014; Kiryo et al., 2017). However, existing PU estimators are not directly applicable to contrastive loss. In particular, two
conflicting tasks are faced in self-supervised contrastive learning (Liu et al., 2021; Robinson et al., 2021): false negative
debiasing and hard negative mining. How to estimate the probability of an unlabeled sample being a true negative sample,
Bayesian Self-Supervised Contrastive Learning

thus how to optimally trade-off the two tasks still remains unsolved.

B. Implementation details of experiments


B.1. Transformation of Φ(x̂) and ΦU N (x̂)
We applying total probability formula to relate marginal probabilities to conditional probabilities

ϕU N = τ − ϕT N + τ + ϕF N
= 2ϕ(x̂)[ατ − + (1 − α)τ + + (1 − 2α)Φ(x̂)(τ − − τ + )] (28)

By integrating both sides of the Eq (28) we can establish a correlation between two distribution functions:
Z x̂
ΦU N (x̂) = ϕU N (t)dt (29)
−∞
Z x̂
= τ − ϕT N (t) + τ + ϕF N (t)dt
−∞
Z x̂
= 2ϕ(t)[ατ − + (1 − α)τ + + (1 − 2α)Φ(t)(τ − − τ + )]dt
−∞
Z x̂ Z x̂
= 2[ατ − + (1 − α)τ + ] ϕ(t)dt + (1 − 2α)(τ − − τ + ) 2ϕ(t)Φ(t)dt
−∞ −∞
= 2[ατ − + (1 − α)τ + ]Φ(x̂) + (1 − 2α)(τ − − τ + )Φ2 (x̂) (30)

Let

a = (1 − 2α)(τ − − τ + )
b = 2[ατ − + (1 − α)τ + ]

So we have ΦU N (x̂) = aΦ2 (x̂) + bΦ(x̂), and


p
−b + b2 + 4aΦU N (x̂)
Φ(x̂) = (31)
2a

−b− b2 +4aΦU N (x̂)
The other solution Φ(x̂) = 2a is discarded because it is less than zero, which is not within the range of the
cumulative distribution function. Fig 9 illustrates the transformation of two cumulative distribution functions Φ(x̂) and
ΦU N (x̂), where we fixed τ + = 0.1.

B.2. Numerical experiments


Chosen of proposal distribution ϕ. For the convenience of controlling the observation within its theoretical minimum
and maximum interval [−1/t2 , 1/t2 ], as well as the previous research show that contrastive loss optimizes the negative
samples for uniformity asymptotically (Wang & Isola, 2020), we set ϕ as U (a, b) to perform the numerical experiment,
where −1/t2 ≤ a ≤ b ≤ 1/t2 is random selected for each anchor. Specifically, the original interval was set as a = −0.5,
b = 0.5, and use γ ∈ [0, 1] to control the maximum sliding amplitude from the original interval.
Implementation of accept-reject sampling. The objective is to generate sample x̂ from class conditional density ϕT N (x̂),
i.e., x̂ ∼ ϕT N (x̂). The basic idea is to generate a sample x̂ from proposal distribution ϕ, and accept it with acceptance
probability pT N . To calculate the acceptance probability pT N , we first write ϕT N (x̂) as a function of proposal distribution
ϕ(x̂)

ϕT N (x̂) = αϕ(1) (x̂) + (1 − α)ϕ(2) (x̂)


= [2α + (2 − 4α)Φ(x̂)]ϕ(x̂),

where Φ(x̂) is the c.d.f. of proposal distribution ϕ(x̂). Next we find a minimum c that satisfies the following inequality:

c · ϕ(x̂) ≥ ϕT N (x̂),
Bayesian Self-Supervised Contrastive Learning

Figure 9. The transformation of two cumulative distribution functions Φ(x̂) and ΦU N (x̂), where x-axis is the C.D.F of unlabeled samples
Φ̂U N , and y-axis is the anchor specific C.D.F Φ(x̂). From Eq (28), it can be seen that when α = 0.5, which means the encoder
makes random guesses, or when τ + = 0.5, which means the prior class probabilities of positive and negative examples are equal,
Φ(x̂) = ΦU N (x̂).

that is
c · ϕ(x̂) ≥ [2α + (2 − 4α)Φ(x̂)]ϕ(x̂).
So the minimal of c is attained

c = min[2α + (2 − 4α)Φ(x̂)], Φ(x̂) ∈ [0, 1]


= 2α,

when Φ(x̂) = 0 since 2 − 4α ≤ 0. So the acceptance probability is

ϕT N (x̂)
pT N =
c · ϕ(x̂)
= [α + (1 − 2α) · Φ(x̂)]/α

An observation x̂ generated from proposal distribution ϕ(x̂) is accepted with probability pT N , formulates the empirical
observations x̂ ∼ ϕT N (x̂) as described in Algorithm 2.
Likewise, the acceptance probability for ϕF N

pF N = [1 − α + (2α − 1) · Φ(x̂)]/α

can be calculated in the similar way. An observation x̂ from proposal distribution ϕ(x̂) is accepted with probability pF N ,
formulates the empirical observations x̂ ∼ ϕF N (x̂) as described in Algorithm 3.
Parameter Settings. For baseline estimator θ̂D CL in Eq (26), the number of positive samples K is fixed as 10. We start from
fixing the parameters as: α = 0.9, β = 0.9, γ = 0.1 τ + = 0.1, temperature scaling t = 0.5, number of anchors M = 1e3,
and number of negative samples N = 64 to investigate the performance of different estimators under different parameters
settings.
Mean values. Aside from M SE, we report the mean values of the estimators θ̂. Our primary concern is whether the chosen
of proposal distribution ϕ or encoder f affects the consistency of estimators. Fig 10(a) shows the influence of different
variation of anchor-specific proposal distribution ϕ on mean values of estimators. Fig 10(b) shows the influence of encoder’s
performance on mean values of estimators. It can be seen that the mean value of PM θ̂ B CL , θ̂ D CL and
PM θ sequences are very
1 1
close, which guaranteed by Chebyshev law of large numbers, namely lim P{| M m=1 θ̂ m − M m=1 θm | < ϵ} = 1 as
M →∞
E θ̂m = θm . This conclusion is un-affected by the chosen of proposal distribution ϕ or encoder f . Aside from consistency
of estimators, Fig 10(b) can be seen as a simulation of training process: a better trained encoder with higher macro-AUC
Bayesian Self-Supervised Contrastive Learning

Algorithm 1 Numerical experiments.


Input: location parameter α(mixtrue coffecient), temperature scaling t
Output: Observations and labels.
for anchor m = 1, 2, ..., M do
\\ choose an anchor-specific distribution ϕ.
Random select [a, b] ⊆ [−1/t2 , 1/t2 ]
Set ϕ ∼ U (a, b)
for negative j = 1, 2, ..., N do
\\ step1:selecting one population
p = random.uniform(0,1)
if p ≤ τ + then
\\ step2:generating observation from ϕF N
x̂j = AccRejetSamplingFN(ϕF N )
label = False
else
\\ step2:generating observation from ϕT N
x̂j = AccRejetSamplingTN(ϕT N )
label = True
\\ step3:mapping
x̂j = exp(x̂j /t)
collecting observation x̂ and label
Result: Observations and labels.

Algorithm 2 AccRejetSamplingTN(ϕT N ).
Input: location parameter α, proposal distrition ϕ
Output: samples x̂ ∼ ϕT N .
x̂j = generate observation x̂j from ϕ
R x̂j
cdf = −∞ ϕ(t)dt
u = random.uniform(0, 1)
while u > [α + (1 − 2α) · cdf ]/α do
x̂j = generate observation x̂j from ϕ
R x̂j
cdf = −∞ ϕ(t)dt
u = random.uniform(0, 1)
Result: x̂j

Algorithm 3 AccRejetSamplingFN(ϕF N ).
Input: location parameter α, proposal distrition ϕ
Output: samples x̂ ∼ ϕF N .
x̂j = generate observation x̂j from ϕ
R x̂j
cdf = −∞ ϕ(t)dt
u = random.uniform(0, 1)
while u > [1 − α + (2α − 1) · cdf ]/α do
x̂j = generate observation x̂j from ϕ
R x̂j
cdf = −∞ ϕ(t)dt
u = random.uniform(0, 1)
Result: x̂j

embeds true negative samples dissimilar with anchor, while more similar to anchor for false negative samples. So we observe
the decrease of θ̄, which corresponds to the decrease of loss values in the training process, and increase of bias |θ̄B IASED − θ̄|
since the contained false negative samples in θ̄B IASED are scored higher as α increases.
Bayesian Self-Supervised Contrastive Learning

(a) (b)

Figure 10. Mean values of estimators θ̂ over all anchors.

B.3. Real data experiment


Experimental setup: We perform the experiments on vision tasks using using CIFAR10 (Krizhevsky & Hinton, 2009)
and STL10 (Adam et al., 2011) datasets. All the experimental settings are identically as DCL (Chuang et al., 2020) and
HCL (Robinson et al., 2021). Specifically, we implement SimCLR (Chen et al., 2020a) with ResNet-50 (He et al., 2016) as
the encoder architecture and use the Adam optimizer (Diederik & Jimmy, 2015) with learning rate 0.001. The temperature
scaling and the dimension of the latent vector were set as t = 0.5 and d=128. All the models are trained for 400 epochs, and
evaluated by training a linear classifier after fixing the learned embedding (Lajanugen & Honglak, 2018; Robinson et al.,
2021). The source codes are available at https://github.com/liubin06/BCL

Algorithm 4 The Bayesian self-supervised Contrastive Learning objective L B CL .


Input: location parameter α, hardness level β, negative scores, positive scores
Output: L B CL .
# Pseudocode of BCL in a PyTorch-like style.
# pos: exponentials for positive examples [BS , ]
# neg: exponentials for negative examples [BS * N]
# norm_const: normarlization constant (alpha * (1 - beta) + (1 - alpha)* beta)
#Step-1: Calculate empirical C.D.F of unlabeled samples
cdf_u = neg.unsqueeze(-1) <= neg.unsqueeze(1).sum(dim=-1) / N #Eq(13)
#Step-2: Calculate C.D.F Phi(x)
cdf = cdf_trans(tau_plus, alpha, cdf_u) #Eq(12)
ccdf = 1 - cdf
phi1 = 2 * ccdf
phi2 = 2 * cdf
x_tn = alpha * phi1 + (1-alpha)* phi2
x_fn = (1-alpha) * phi1 + alpha* phi2
x_thn =(alpha*(1-beta)*phi1+(1-alpha)*beta*phi2) / norm_const #Eq(17)
#Step-3: Calculate omega
omega = x_thn / (x_tn * (1-tau_plus) + x_fn * tau_plus) #Eq(19)
Ng = (omega * neg).sum(dim=-1)
Loss = - log (pos / (pos + Neg ))
Result: L B CL
Bayesian Self-Supervised Contrastive Learning

C. Proofs
Proposition C.1 (Class Conditional Density). If ϕ(x̂) is continuous density function that satisfy ϕ(x̂) ≥ 0 and
R +∞
−∞
ϕ(x̂)dx̂ = 1, then ϕT N (x̂) and ϕF N (x̂) are probability density functions that satisfy ϕT N (x̂) ≥ 0, ϕF N (x̂) ≥ 0,
R +∞ R +∞
and −∞ ϕT N (x̂)dx̂ = 1, −∞ ϕF N (x̂)dx̂ = 1.

R +∞
Proof. Since ϕ(x̂) ≥ 0 and −∞
ϕ(x̂)dx̂ = 1, so

ϕT N (x̂) = αϕ(1) (x̂) + (1 − α)ϕ(2) (x̂)


= 2αϕ(x̂)[1 − Φ(x̂)] + 2(1 − α)ϕ(x̂)Φ(x̂)
≥ 0, (32)

where α ∈ [0.5, 1] and Φ(x̂) ∈ [0, 1].


Z +∞ Z +∞
ϕT N (x̂)dx̂ = 2αϕ(x̂)[1 − Φ(x̂)] + 2(1 − α)ϕ(x̂)Φ(x̂)dx̂
−∞ −∞
Z +∞ Z +∞
= 2α ϕ(x̂)[1 − Φ(x̂)]dx̂ + 2(1 − α) ϕ(x̂)Φ(x̂)dx̂
−∞ −∞
Z +∞ Z +∞
= 2α [1 − Φ(x̂)]dΦ(x̂) + 2(1 − α) Φ(x̂)dΦ(x̂)
−∞ −∞
Z 1 Z 1
= 2α (1 − µ)dµ + 2(1 − α) µdµ (33)
0 0
1
= [α(2µ − µ2 ) + (1 − α)µ2 ] 0
= 1, (34)

where Eq 33 is integration by substitution of Φ(x̂) and µ. Likewise,

ϕF N (x̂) = αϕ(2) (x̂) + (1 − α)ϕ(1) (x̂)


= 2αϕ(x̂)Φ(x̂) + 2(1 − α)ϕ(x̂)[1 − Φ(x̂)]
≥ 0, (35)

and
Z +∞ Z +∞ Z +∞
ϕF N (x̂)dx̂ = 2α ϕ(x̂)Φ(x̂)dx̂ + 2(1 − α) ϕ(x̂)[1 − Φ(x̂)]dx̂
−∞ −∞ −∞
Z +∞ Z +∞
= 2α Φ(x̂)dΦ(x̂) + 2(1 − α) [1 − Φ(x̂)]dΦ(x̂)
−∞ −∞
Z 1 Z 1
= 2α µdµ + 2(1 − α) (1 − µ)dµ
0 0
1
= [αµ + (1 − α)(2µ − µ2 )]
2
0
= 1. (36)

Lemma C.2 (Posterior Probability Estimation). Let β = 0.5 and assume that the similarities x̂ are i.i.d. distributed. For
any observation x̂i , the posterior probability estimation of x̂i being true negative is given by

p(T N|x̂i ) = ωi τ −
Bayesian Self-Supervised Contrastive Learning

Proof. We apply Bayes’ formula to complete the proof. Let β = 0.5,


(1 − 0.5)αϕ(1) (x̂) + 0.5(1 − α)ϕ(2) (x̂)
ϕT HN (x̂; α, β = 0.5) =
(1 − 0.5)α + 0.5(1 − α)
= αϕ(1) (x̂) + (1 − α)ϕ(2) (x̂)
= ϕT N (x̂). (37)
Inserting Eq (37) back to Eq (21), we obtain
ϕT N (x̂i )
ω(x̂i ) =
ϕU N (x̂i )
ϕT N (x̂i )τ − 1
=
ϕT N (x̂i )τ − + ϕF N (x̂i )τ + τ −
1
= p(T N|x̂i ) − (38)
τ
Rearranging above equation yields p(T N|x̂i ) = ωi τ − , which completes the proof. More specifically, the posterior probability
estimation can be expressed as a function of the cumulative distribution function (C.D.F) Φ(x̂i ) and the prior probability τ
as follows:
p(T N|x̂i )
ϕT N (x̂i )τ −
=
ϕT N (x̂i )τ − + ϕF N (x̂i )τ +
ατ − + (1 − 2α)Φ(x̂i )τ −
= (39)
ατ + (1 − α)τ + + (1 − 2α)Φ(x̂i )(τ − − τ + )

It is worth noting that the term ϕ(x̂i ) has been eliminated. Although the explicit expression of Φ(x̂i ) = −∞x̂i ϕ(t)dt is
R

unknown, we can obtain its empirical estimate through Eq (14) and Eq (13). Therefore, the posterior probability p(T N|x̂i )
can be calculated without making any parametric assumptions about ϕ(x̂i ). Similarly, for values of β other than 0.5, the
calculation steps of ω follow the same algorithmic process as depicted in Algorithm 4.
Lemma C.3 (Asymptotic Unbiased Estimation). Let β = 0.5 and assume that the similarities x̂ are i.i.d. distributed. For
any encoder f , as N → ∞, we have
L B CL → L S UP

Proof. We draw the conclusion by the Lebesgue Dominant Convergence Theorem and the properties of important sampling:
T +
ef (x) f (x )
lim L B CL = lim E x∼pd − log[ PN T − ]
N →∞ N →∞ x+ ∼p+ ef (x)T f (x+ ) + i=1 ωi · ef (x) f (xi )
x− ∼pd
T +
ef (x) f (x )
= E x∼pd lim − log[ PN T − ]
x+ ∼p+ N →∞ ef (x)T f (x+ ) + i=1 ωi · ef (x) f (xi )

x ∼pd
T +
ef (x) f (x )
= E x∼pd lim [− log f (x)T f (x+ ) ] (40)
x+ ∼p+ N →∞ e + N · Ex̂∼ϕUN ωx̂
where
ϕT HN (x̂; α, β)
ω(x̂; α, β) =
ϕU N (x̂)
when taking β = 0.5,
(1 − 0.5)αϕ(1) (x̂) + 0.5(1 − α)ϕ(2) (x̂)
ϕT HN (x̂; α, β = 0.5) =
(1 − 0.5)α + 0.5(1 − α)
= αϕ(1) (x̂) + (1 − α)ϕ(2) (x̂)
= ϕT N (x̂). (41)
Bayesian Self-Supervised Contrastive Learning

ϕT N (x̂;α,β)
Therefor, ω = ϕU N (x̂) , so the second term in the denominator of Eq (40)
Z
Ex̂∼ϕUN ωx̂ = ωx̂ϕU N (x̂)dx̂
Z
ϕT N (x̂; α, β)
= x̂ ϕU N (x̂)dx̂
ϕU N (x̂)
Z
= x̂ϕT N (x̂)dx̂
T
f (x−
= Ex− ∼p− ef (x) j ) (42)

Inserting Eq 42 back to Eq 40, we obtain


T
f (x+ )
ef (x)
lim L B CL = E x∼pd lim [− log T f (x− )
]
N →∞ x+ ∼p+ N →∞ ef (x)T f (x+ ) + N Ex− ∼p− ef (x) j

T +
ef (x) f (x )
= E x∼pd lim [− log PN T − ]
x+ ∼p+ N →∞
− −
ef (x)T f (x+ ) + j=1 ef (x) f (xj )
x ∼p
T +
ef (x) f (x )
= lim E x∼pd [− log PN T − ] (43)
N →∞ x+ ∼p+
− −
ef (x)T f (x+ ) + j=1 ef (x) f (xj )
x ∼p
= lim L S UP (44)
N →∞

where Eq 43 is obtained by Lebesgue Dominant Convergence Theorem.

You might also like