Bayesian Self-Supervised Contrastive Learning
Bayesian Self-Supervised Contrastive Learning
Abstract son et al., 2021; Chen et al., 2020a; Chuang et al., 2020;
Recent years have witnessed many successful ap- Liu et al., 2021): Consider an unlabeled dataset X and
plications of contrastive learning in diverse do- a class label set C, let h : X → C be the classification
arXiv:2301.11673v3 [cs.LG] 23 May 2023
mains, yet its self-supervised version still remains function assigning a data point x ∈ X with a class label
many exciting challenges. As the negative sam- c ∈ C. Assume that observing a class label ρ(c) = τ +
ples are drawn from unlabeled datasets, a ran- is uniform and τ − = 1 − τ + is the probability of ob-
domly selected sample may be actually a false serving any different class. For a given data point x, let
negative to an anchor, leading to incorrect en- p+ (x+ ) = p(x+ |h(x) = h(x+ )) denote the probability of
coder training. This paper proposes a new self- another point x+ with the same label as x, and in such a
supervised contrastive loss called the B CL loss case, x+ is called a positive sample specific to x. Likewise,
that still uses random samples from the unlabeled let p− (x− ) = p(x− |h(x) ̸= h(x− )) denote the probability
data while correcting the resulting bias with im- of another point x− with a different label to x, and in such
portance weights. The key idea is to design the a case, x− is called a negative sample specific to x. Let
desired sampling distribution for sampling hard f : X → Rd denote the representation learning function
true negative samples under the Bayesian frame- (i.e., encoder) to map a point x to an embedding f (x) on a
work. The prominent advantage lies in that the d-dimensional hypersphere.
desired sampling distribution is a parametric struc- Self-supervised contrastive learning is to contrast similar
ture, with a location parameter for debiasing false pairs (x, x+ ) and dissimilar pairs (x, x− ) to train the en-
negative and concentration parameter for mining coder f (Wang & Isola, 2020; Wang & Liu, 2021; Chuang
hard negative, respectively. Experiments validate et al., 2020); While the objective is to encourage representa-
the effectiveness and superiority of the B CL loss 1 . tions of (x, x+ ) to be closer than that of (x, x− ). In training
the encoder, we randomly draw a point from the underlying
data distribution pd on X , i.e., x ∼ pd , and its positive sam-
1. Introduction ple x+ can be easily obtained from some semantic-invariant
Unsupervised learning has been extensively researched for operation on x like image masking, written as x+ ∼ p+ . In
its advantages of learning representations without human practice, a negative sample x− is drawn from the unlabeled
labelers for manually labeling data. How to learn good dataset and x− ∼ pd . However, the sample x− could be
representation without supervision, however, has been a potentially with the same label as x, i.e., it is a false negative
long-standing problem in machine learning. Recently, con- to x. In such a case, the false negative sample, or called sam-
trastive learning that leverages a contrastive loss (Chopra pling bias (Chuang et al., 2020), would degrade the learned
et al., 2005; Hadsell et al., 2006) to train a representation representations (Wang & Liu, 2021; Chuang et al., 2020).
encoder has been promoted as a promising solution to this In this work, we formalize the two crucial tasks of self-
problem (Oord et al., 2018; Tian et al., 2020; Liu et al., 2021; supervised contrastive learning, i.e., debiasing false nega-
Chen et al., 2020b). Remarkable successes of contrastive tives and mining hard negatives, in a Bayesian setting. These
learning have been observed for many applications in dif- tasks are implicitly performed by jointly utilizing sample
ferent domains (Alec et al., 2018; 2019; Misra & Maaten, information (empirical distribution function) and prior class
2020; He et al., 2020). Nonetheless, its potentials can be information τ , which offers a flexible and principled self-
further released for a better contrastive loss being designed. supervised contrastive learning framework. It leverages the
We study the following self-supervised contrastive learning classical idea of Monte Carlo importance sampling to draw
problem (Oord et al., 2018; Chuang et al., 2020; Robin- random samples from the unlabeled data, while correcting
for resulting biases with importance weights, which avoids
1
[email protected]; [email protected]; School of the high computational costs associated with explicit nega-
Electronic Information and Communications, Huazhong Univer- tive sampling.
sity of Science and Technology (HUST), Wuhan, China.
Corresponding author: Bang Wang
Bayesian Self-Supervised Contrastive Learning
Contribution: (i) We propose the Bayesian self-supervised proven that for {x− N
i ∈ T N }i=1 , optimizing the InfoNCE
Contrastive Learning objective function, a novel extension loss LS UP will result in the learning model estimating and
+
to contrastive loss function that allows for false negative de- optimizing the density ratio pp− (Oord et al., 2018; Poole
biasing task hard true negative mining task in a flexible and T +
et al., 2019). Denote x̂+ = ef (x) f (x ) . The CPC (Oord
principled way. BCL is easy to implement, requiring only a
et al., 2018) proves that minimizing LS UP leads to
small modification of the original contrastive loss function.
(ii) We derive analytical expressions for the class condi- x̂+ ∝ p+ /p− . (3)
tional probability density of negative and positive examples,
enabling us to estimate the posterior probability of an un- As discussed by (Oord et al., 2018), p+ /p− preserves the
labeled sample being a true negative, without making any mutual information (MI) of future information and present
parametric assumptions. This provides a new understanding signals, where MI maximization is a fundamental problem
of positive-unlabeled data and inspires future research. in science and engineering (Poole et al., 2019; Belghazi
et al., 2018) .
2. Contrastive Loss and Analysis Now consider the InfoNCE loss LB IASED , which can be re-
garded as the categorical cross-entropy of classifying one
2.1. Contrastive Loss
positive sample x+ from unlabeled samples. For analysis
In the context of supervised contrastive learning, dissim- purpose, we rewrite x+ as x0 . Given N + 1 unlabeled data
ilar pairs (x, x− ) can be easily constructed by randomly points, the posterior probability of one data point x0 being
drawing a true negative sample x− specific to x, i.e., a positive sample can be derived by
x− ∼ p− . The contrastive predictive coding (CPC) (Oord
et al., 2018) introduces the following InfoNCE loss (Gut- P (x0 ∈ pos|{xi }Ni=0 )
+
QN
mann & Hyvärinen, 2010; 2012): p (x0 ) i=1 pd (xi )
= PN Q
+
j=0 p (xj ) i̸=j pd (xi )
T +
ef (x) f (x )
LS UP = E x∼pd [− log ]
x+ ∼p+ ef (x) f (x ) + N
T + P f (x)T f (x−
i ) p+ (x0 )/pd (x0 )
i=1 e = (4)
x− ∼p− PN
(1) p+ (x0 )/pd (x0 ) + j=1 p+ (xj )/pd (xj )
to learn an encoder f : X → Rd /t that maps a data point
To minimize LB IASED , the optimal value for this poste-
x to the hypersphere Rd of radius 1/t, where t is the tem-
rior probability is 1, which is achieved in the limit of
perature scaling. As in the CPC, we also set t = 1 in our
p+ (x0 )/pd (x0 ) → +∞ or p+ (xj )/pd (xj ) → 0. Mini-
theoretical analysis.
mizing LB IASED leads to
In the context of self-supervised contrastive learning, how-
ever, as samples’ labels are not available, i.e., p− (x′ ) = x̂+ ∝ p+ /pd . (5)
p(x′ |h(x) ̸= h(x′ )) is not accessible, the standard approach
is to draw N samples from the data distribution pd , which Note that this is different from Eq. (3), since x−
i may not be
are supposed to be negative samples to x, to optimize the T N for lack of ground truth label.
following InfoNCE self-supervised contrastive loss: Denote x̂+ = m · p+ /pd , m ≥ 0. We investigate the
T + gap between optimizing x̂+ and the optimization objective
ef (x) f (x ) p+ /p− . Inserting pd = p− τ − + p+ τ + back to Eq. (5), we
LB IASED = E x∼pd [− log ].
f (x)T f (x−
ef (x) f (x ) + N
T +
i )
P
x+ ∼p+
i=1 e obtain
x− ∼pd
(2) p+
Following the DCL (Chuang et al., 2020), it is also called x̂+ = m · . (6)
p− τ − + p+ τ +
as biased contrastive loss since those supposedly negative
samples x− drawn from pd might come from the same class Rearranging the above equation yields
as the data point x with probability τ + .
x̂+ · τ −
p+ /p− = . (7)
2.2. Sampling Bias Analysis m − x̂+ · τ +
Figure 1. Illustration of LB IASED and mutual information optimiza- Existing approaches for solving above problem is the density
tion by Eq. (7). estimation to fit ϕ (Xia et al., 2022), where ϕ is parameter-
ized as a two-component mixture of ϕT N and ϕF N , such as
the Gaussian Mixture Model (Lindsay, 1995), Beta Mixture
Model (Xia et al., 2022). To make the analysis possible, ϕT N
and ϕF N are postulated to follow a simple density function
with fixed parameters, which is a too strong assumption. In
the existence of jump discontinuity indicates that the opti-
addition, the learning algorithm for estimating ϕ is expen-
mization of LB IASED does not necessarily lead to the tractable
sive, since the mixture coefficients that indicate the probabil-
MI optimization. The reason for the intractable MI optimiza-
ity of x̂ ∈ T N or F N are hidden variables. The parameters of
tion is from the fact that not all {x− N
i }i=1 are T N samples, as
ϕT N and ϕF N can only be obtained through the iterative nu-
they are randomly drawn from the data distribution pd . This
merical computation of the EM algorithm (Dempster et al.,
leads to the inclusion of p+ in the denominator of Eq. (6)
1977) that are sensitive to initial values.
when decomposing the data distribution pd . Fig. 7 in Ap-
pendix provides an intuitive explanation. The four sampled In this paper, we propose an analytic method without explic-
data points actually contain one F N sample. Such a F N sam- itly estimating ϕ, also called the nonparametric method in
ple should be pulled closer to the anchor x. However, as it is the statistical theory. Consider n random variables from ϕ ar-
mistakenly treated as a negative sample, during model train- ranged in the ascending order according to their realizations.
ing it will be pushed further apart from the anchor, which We write them as X(1) ≤ X(2) ≤ · · · ≤ X(n) , and X(k)
breaks the semantic structure of embeddings (Wang & Liu, is called the k-th (k = 1, · · · , n) order statistics (David &
2021). Nagaraja, 2004). The probability density function (PDF) of
X(k) is given by:
3. The Proposed Method n!
ϕ(k) (x̂) = Φk−1 (x̂)ϕ(x̂)[1 − Φ(x̂)]n−k
In this paper, we consider to randomly draw negative sam- (k − 1)!(n − k)!
ples {x− N −
i }i=1 from the unlabeled dataset, i.e., xi ∼ pd . As By conditioning on n = 2 we obtain:
−
the class label is not accessible, xi could be either a T N
sample or a F N sample. We propose to include and compute ϕ(1) = 2ϕ(x̂)[1 − Φ(x̂)] (8)
an importance weight ωi into the InfoNCE contrastive loss ϕ(2) = 2ϕ(x̂)Φ(x̂) (9)
for correcting the resulting bias of drawing negative samples
from pd . The ideal situation is that we can set ω = 0 to
Next, we investigate the position of positive and negative
each F N sample, so that only the hard true negative samples
samples on the hypersphere, so as to get a deep insight into
contribute to the calculation of contrastive loss, which relies
ϕT N . Consider a (x, x+ , x− ) triple, there exists a closed
on the design of desired sampling distribution.
ball B[f (x), d+ ] = {f (·)|d(f (x), f (·)) ≤ d+ } with center
We consider the following two design principles of the sam- f (x) and radius d+ , where d+ = ∥f (x) − f (x+ )∥ is the
pling distribution for drawing {x− N
i }i=1 . The true princi- distance of anchor embedding f (x) and positive sample
ple (Wang & Liu, 2021; Robinson et al., 2021) states that embedding f (x+ ). Two possible cases arise: f (x− ) ∈
the F N samples should not be pushed apart from the anchor B[f (x), d+ ] or f (x− ) ∈/ B[f (x), d+ ], as illustrated by
x in the embedding space. The hard principle (Yannis et al., Fig. 2. We can describe the two cases with the Euclidean
2020; Robinson et al., 2021; Florian et al., 2015; Hyun et al., distance: Fig 2(a) corresponds to d+ < d− , and Fig 2(b) cor-
2016) states that the hard T N samples should be pushed responds to d− ≤ d+ , where d− =p ∥f (x) − f (x− )∥. Note
±
further apart in the embedding space. that the Euclidean distance d = 2/t2 − 2f (x)T f (x± )
Bayesian Self-Supervised Contrastive Learning
to the N unlabeled samples for correcting the bias due to β ∈ [0.5, 1] corresponds to the proportion of misclassified
sampling from the unlabeled data. hard negative components in the desired sampling distribu-
tion, and controls the hardness level of the distribution for
So far we have finished the estimate for summation over
sampling hard negatives. It can be manually selected. We
hard true negative samples by using samples still randomly
refer to Fig 4 for detailed illustration of how importance
drawn from the unlabeled data. The key lies in the calcula-
weight ω(x̂, α, β) changes with respect to Φ̂U N (x̂) under
tion of importance weight for correcting the bias between
various settings of α and β.
actual sampling distribution and desired sampling distri-
bution. By characterized the encoder using macro-AU C
metric, the calculable empirical c.d.f. was introduced as like- 4. Numerical Experiments
lihood to construct the posterior estimator, which we name
4.1. Experiment Objective
it Bayesian self-supervised Contrastive Learning objective
T +
Recall that the core operation of a self-supervise contrastive
ef (x) f (x ) loss is to estimate the expectation of x̂i (Chuang et al., 2020),
L B CL = E x∼pd [− log PN T f (x− )
] (22) T
+
x ∼p +
e f (x)T f (x+ )
+ i=1 ωi · e f (x) i where xi ∈ T N, x̂i ≜ ef (x) f (xi ) , to approximate the su-
−
x ∼pd pervised loss LS UP using N randomly selected unlabeled
samples xi . For the supervised loss LS UP , we define the
Complexity: The steps for L B CL computation are presented mean of true negative samples’ observations by
in Algorithm 4. Compared with the InfoNCE loss, the addi- PN
tional computation complexity comes from first calculating i=1 I(xi ) · x̂i
θS UP = P N
, (23)
the empirical distribution function empirical C.D.F Φ̂U N (x̂) i=1 I(xi )
from unlabeled data in the order of O(N ), which can be where I(xi ) is the indicator function, I(xi ) = 1 if xi ∈ T N;
neglected as it is far smaller than encoding a sample. Next, Otherwise, I(xi ) = 0, since the expectation is replaced by
we provide a deep understanding of L B CL . empirical estimates in practice (Oord et al., 2018; Chuang
Lemma 3.2 (Posterior Probability Estimation). Let β = 0.5 et al., 2020; Chen et al., 2020a). For the proposed self-
and assume that the similarities x̂ are i.i.d. distributed. For supervised loss LB CL , we define the BCL estimator by
any observation x̂i , the posterior probability estimation of N
x̂i being true negative is given by p(T N|x̂i ) = ωi τ − . 1 X
θ̂B CL = ωi · x̂i . (24)
N i=1
Proof. See Appendix C.2 The empirical counterpart of a unsupervised loss LB CL
Lemma 3.3 (Asymptotic Unbiased Estimation). Let β = equals to supervised loss LS UP , as if θ̂B CL = θ̂S UP . We com-
0.5 and assume that the similarities x̂ are i.i.d. distributed. pare with the following two estimators: the θ̂B IASED by (Oord
For any encoder f , as N → ∞, we have L B CL → L S UP . et al., 2018) and θD CL by (Chuang et al., 2020).
N
1 X
Proof. See Appendix C.3 θ̂B IASED = x̂i (25)
N i=1
PK
Lemma 3.2 establishes the relationship between the weight 1 X
N
j=1 x̂+
j
ω and the posterior probability, and their calculation only θ̂D CL = ( x̂i − N τ + · ) (26)
N τ − i=1 K
involves (i) Φ̂U N (x̂) ∈ [0, 1], which represents the sample
information from a Bayesian viewpoint and can be directly Eq. (26) can be understood as the summation of T N sam-
calculated from N unlabeled data. It contains a probabilis- ples’ observations divided by the number of T N samples
tic interpretation of the unlabeled data x ∈ F N given its N τ −P
. Specifically, N τ + is the number of F N samples,
K
observation x̂. For larger values of x̂ (with a closer em- and j=1 x̂+ j /K is the mean value of K F N samples. So
bedding distance to the anchor), Φ̂U N (x̂) assigns a higher the second term inside the parenthesis corresponds to the
probability prediction of the unlabeled sample x sharing summation of F N samples’P observations among N samples;
N
the identical latent class label with that of the anchor. In While subtracting it from i=1 x̂i corresponds to the sum-
other words, Φ̂U N (x̂) reflects the classification results of the mation of T N samples among N randomly selected samples.
current model on the sample x, that is, the likelihood. (ii)
class prior probability τ . 4.2. Experiment Design
The parameter α ∈ [0.5, 1] corresponds to the encoder’s We design a stochastic process framework to simulate x̂,
macro-AUC metric and controls the confidence level of which depicts the numerical generative process of observa-
sample information. It can be empirically estimated using a tion x̂. In short, an observation x̂ is realized in the following
validation dataset during the training process. The parameter sample function space.
Bayesian Self-Supervised Contrastive Learning
Figure 4. This figure illustrates how the weight ω changes with respect to Φ̂U N (x̂) under various settings of α and β, where τ + is fixed at
0.1. X-axis is the empirical C.D.F function Φ̂U N of unlabeled samples. A value of Φ̂U N (x̂) = 1 indicates that the instance is the nearest
to the anchor, i.e., the highest scored sample. The x-axis can be understood as the hardness level of the samples, with values closer to
1 indicating higher hardness level. The y-axis represents the importance weight ω calculated based on the BCL computation process.
Since ω is calculated in terms of the empirical C.D.F (relative ranking position) rather than an absolute similarity score, leading it more
robust to different temperature scaling settings and encoders. Different values of α and β provide flexible options for up-weighting or
down-weighting negative samples.
Bayesian Self-Supervised Contrastive Learning
(i) Select a class label according to the class probability We evaluate the quality of estimators in terms of mean
τ + , indicating that the observation comes from F N with 1
PM For M anchors, 2 we calculate
squared error (MSE).
probability τ + , or comes from T N with probability τ − . M SE θ̂B CL = M m=1 (θ̂B CL,m − θS UP,m ) . Fig. 6 com-
pares the MSE of different estimators against different pa-
(ii) Generate an observation x̂ from the class conditional rameters. It can be observed that the estimator θ̂B CL is su-
density ϕF N (or ϕT N ), dependent on the anchor-specific ϕ, perior than the other two estimators in terms of lower MSE
and location parameter α. with different parameter settings. We refer to Appendix B.2
(iii) Map x̂ to exp(x̂/t) as the final observation. for detailed settings of our numerical experiments. A partic-
ular notice here is that we set β = 0.5 in the ω computation,
An illustrative example is presented in Fig 8 of Appendix. as β is designed in consideration of the requirements of
Repeat the process to generate N observations for one an- downstream task that pushing T N samples further apart,
chor and repeat it for M anchors, we obtain {exp(x̂mi /t) : other than statistical quality of θ̂B CL .
m = 1, ..., M, i = 1, · · · N } corresponding to an empirical
observation from the sample function space {X(x, e) : x ∈
X , e ∈ Ω}. The complete stochastic process depiction of
5. Real Dataset Experiments
observations is presented in Algorithm 1 of Appendix B.2. We conduct the real dataset experiments with the same
Note that in (ii), event if we set ϕ as a simple distribu- settings as theirs for two vision tasks using the CI-
tion, the corresponding class conditional density ϕF N (or FAR10 (Krizhevsky & Hinton, 2009) and STL10 (Adam
ϕT N ) is no longer a simple distribution, we generate the et al., 2011) datasets (Detailed settings are also provided in
observations from ϕF N (or ϕT N ) using the accept-reject sam- Appendix B.3).
pling (Casella et al., 2004) technique (see Algorithm 2 and
3 of Appendix B.2 for implementation details). • SimCLR (Chen et al., 2020a): It provides the experi-
Note that in (iii), although the corresponding density expres- ment framework for all competitors, which comprises
sion of exp(x̂/t) is extremely complicated, x̂ → exp(x̂/t) of a stochastic data augmentation module, the ResNet-
is a strictly monotonic transformation, making the em- 50 as an encoder, neural network projection head. It
pirical distribution function remains the same: Φn (x̂) = uses the contrastive loss LB IASED .
Φn (exp(x̂/t)).
Fig 5 plots the empirical distribution of exp(x̂/t) according • DCL (Chuang et al., 2020): It mainly considers the
to the above generative process, which corresponds to the false negative debiasing task and uses the estimator
distribution depicted in Fig 3, where we set t = 2, M = 1 Eq. (26) to replace the second term in the denominator
and N = 20000. We note that the empirical distributions of LB IASED .
in Fig 5 exhibits similar structures to those by using real-
world datasets as reported in (Robinson et al., 2021; Xia • HCL (Robinson et al., 2021): Following the DCL de-
et al., 2022), indicating the effectiveness of the proposed biasing framework, it also takes into consideration of
stochastic process depiction framework for simulating x̂. hard negative mining by up-weighting each randomly
Bayesian Self-Supervised Contrastive Learning
(a) Performance of encoder f . (b) Number of negative samples (c) Class probability.
(d) Variation of proposal distribution ϕ. (e) Temperature scaling. (f) Number of anchors.
are difficult to fit, the hard negative principle dominates, Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A
encouraging the model to fit clean samples. However, for simple framework for contrastive learning of visual rep-
datasets that are relatively easy to fit, excessive hardness resentations. In ICML, pp. 1597–1607, 2020a.
levels can lead to the model memorizing wrongly given
labels, so the true negative principle should be balanced Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and
with the hard negative principle. Hinton, G. E. Big self-supervised models are strong semi-
supervised learners. NeurIPS, 33:22243–22255, 2020b.
6. Conclusion Chen, X., Fan, H., Girshick, R., and He, K. Improved
This paper proposes the B CL loss for self-supervised con- baselines with momentum contrastive learning. arXiv
trastive learning, which corrects the bias introduced by us- preprint arXiv:2003.04297, 2020c.
ing random negative samples drawn from unlabeled data Chopra, S., Hadsell, R., and LeCun, Y. Learning a sim-
through importance sampling. The key idea is to design a ilarity metric discriminatively, with application to face
desired sampling distribution under a Bayesian framework, verification. In CVPR, pp. 539–546, 2005.
which has a parametric structure enabling principled debias-
ing of false negatives and mining of hard negatives. In future Chuang, C.-Y., Robinson, J., Lin, Y.-C., Torralba, A., and
work, we plan to adapt B CL to noisy label learning or other Jegelka, S. Debiased contrastive learning. In NeurIPS,
weak supervisions and explore the optimal tradeoff between pp. 8765–8775, 2020.
debiasing false negatives and mining hard negatives.
David, H. A. and Nagaraja, H. N. Order statistics. 2004.
References ISBN 9780471654018.
Adam, C., Andrew, N., and Honglak, L. An analysis of Dempster, A. P., Laird, N. M., and Rubin, D. B. Maxi-
single-layer networks in unsupervised feature learning. In mum likelihood from incomplete data via the em algo-
Proceedings of the fourteenth international conference on rithm. Journal of the Royal Statistical Society: Series B
artificial intelligence and statistics, pp. 215–223, 2011. (Methodological), 39(1):1–22, 1977.
Alec, R., Karthik, N., Tim, S., and Ilya, S. Improving Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
language understanding by generative pre-training. 2018. Pre-training of deep bidirectional transformers for lan-
guage understanding. arXiv preprint arXiv:1810.04805,
Alec, R., Jeffrey, W., Rewon, C., David, L., Dario, A., and 2018.
Ilya, S. Language models are unsupervised multitask
learners. 2019. DeVries, T. and Taylor, G. W. Improved regularization of
convolutional neural networks with cutout. arXiv preprint
Arpit, D., Jastrzkebski, S., Ballas, N., Krueger, D., Bengio, arXiv:1708.04552, 2017.
E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville,
A., Bengio, Y., et al. A closer look at memorization in Diederik, P. and Jimmy, B. Adam: A method for stochastic
deep networks. In International conference on machine optimization. In ICLR, 2015.
learning, pp. 233–242, 2017.
Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and
Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning Brox, T. Discriminative unsupervised feature learning
representations by maximizing mutual information across with convolutional neural networks. In NeurIPS, pp.
views. 2019. 766–774, 2014.
Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Ben- Du Plessis, M., Niu, G., and Sugiyama, M. Convex formu-
gio, Y., Courville, A., and Hjelm, R. D. Mine:mutual lation for learning from positive and unlabeled data. In
information neural estimation. In ICML, 2018. International conference on machine learning, pp. 1386–
1394. PMLR, 2015.
Bengio, Y. and Senécal, J.-S. Adaptive importance sampling
to accelerate training of a neural probabilistic language Du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of
model. IEEE Transactions on Neural Networks, 19(4): learning from positive and unlabeled data. In NeurIPS,
713–722, 2008. 2014.
Casella, G., Robert, C. P., and Wells, M. T. General- Florian, S., Dmitry, K., and James, P. Facenet: A unified
ized accept-reject sampling schemes. Lecture Notes- embedding for face recognition and clustering. In CVPR,
Monograph Series, pp. 342–347, 2004. pp. 815–823, 2015.
Bayesian Self-Supervised Contrastive Learning
Glivenko, V. Sulla determinazione empirica delle leggi di Kiryo, R., Niu, G., Du Plessis, M. C., and Sugiyama, M.
probabilita. Gion. Ist. Ital. Attauri., 4:92–99, 1933. Positive-unlabeled learning with non-negative risk esti-
mator. In NeurIPS, 2017.
Gutmann, M. and Hyvärinen, A. Noise-contrastive estima-
tion: A new estimation principle for unnormalized statisti- Komodakis, N. and Gidaris, S. Unsupervised representation
cal models. In Proceedings of the thirteenth international learning by predicting image rotations. In ICLR, 2018.
conference on artificial intelligence and statistics, pp.
Krizhevsky, A. and Hinton, G. Learning multiple layers of
297–304, 2010.
features from tiny images. 2009.
Gutmann, M. U. and Hyvärinen, A. Noise-contrastive es- Lajanugen, L. and Honglak, L. An efficient framework for
timation of unnormalized statistical models, with appli- learning sentence representations. In ICLR, 2018.
cations to natural image statistics. Journal of machine
learning research, 13(2), 2012. Lindsay, B. Mixture Models: Theory, Geometry and Appli-
cations. 1995. ISBN 0-94-0600-32-3.
Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality
reduction by learning an invariant mapping. In CVPR, pp. Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang,
1735–1742, 2006. J., and Tang, J. Self-supervised learning: Generative or
contrastive. IEEE Transactions on Knowledge and Data
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., Engineering, 2021.
and Sugiyama, M. Co-teaching: Robust training of deep
Misra, I. and Maaten, L. v. d. Self-supervised learning of
neural networks with extremely noisy labels. In Advances
pretext-invariant representations. In CVPR, pp. 6707–
in neural information processing systems, pp. 31, 2018.
6717, 2020.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual Oord, A. v. d., Li, Y., and Vinyals, O. Representation learn-
learning for image recognition. In CVPR, pp. 770–778, ing with contrastive predictive coding. arXiv preprint
2016. arXiv:1807.03748, 2018.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo- Poole, B., Ozair, S., van den Oord, A., Alemi, A. A., and
mentum contrast for unsupervised visual representation Tucker, G. On variational bounds of mutual information.
learning. In CVPR, pp. 9729–9738, 2020. In ICML, pp. 5171–5180, 2019.
Henaff, O. Data-efficient image recognition with contrastive Robinson, J., Ching-Yao, C., Sra, S., and Jegelka, S. Con-
predictive coding. In ICML, pp. 4182–4192, 2020. trastive learning with hard negative samples. In ICLR,
2021.
Hesterberg, T. C. Advances in importance sampling. Stan-
ford University, 1988. Saunshi, N., Plevrakis, O., Arora, S., Khodak, M., and
Khandeparkar, H. A theoretical analysis of contrastive
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, unsupervised representation learning. In ICML, pp. 5628–
K., Bachman, P., Trischler, A., and Bengio, Y. Learning 5637, 2019.
deep representations by mutual information estimation
and maximization. arXiv preprint arXiv:1808.06670, Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
2018. Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
A. Going deeper with convolutions. In CVPR, pp. 1–9,
Huang, J., Dong, Q., Gong, S., and Zhu, X. Unsupervised 2015.
deep learning by neighbourhood discovery. In ICML, pp.
Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview
2849–2858, 2019.
coding. In ECCV, pp. 776–794, 2020.
Hyun, Oh, S., Yu, X., Stefanie, J., and Silvio, S. Deep Wang, F. and Liu, H. Understanding the behaviour of con-
metric learning via lifted structured feature embedding. trastive loss. In CVPR, pp. 2495–2504, 2021.
In CVPR, pp. 4004–4012, 2016.
Wang, T. and Isola, P. Understanding contrastive represen-
Jessa, B. and Jesse, D. Learning from positive and unlabeled tation learning through alignment and uniformity on the
data: a survey. Machine Learning, 109:719–760, 2020. hypersphere. In ICML, pp. 9929–9939, 2020.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised fea-
P., Maschinot, A., Liu, C., and Krishnan, D. Supervised ture learning via non-parametric instance discrimination.
contrastive learning. In NeurIPS, pp. 18661–18673, 2020. In CVPR, pp. 3733–3742, 2018.
Bayesian Self-Supervised Contrastive Learning
Xia, J., Wu, L., Wang, G., Chen, J., and Li, S. Z. Progcl:
Rethinking hard negative mining in graph contrastive
learning. In ICML, pp. 24332–24346, 2022.
Xu, L., Lian, J., Zhao, W. X., Gong, M., Shou, L., Jiang,
D., Xie, X., and Wen, J.-R. Negative sampling for con-
trastive representation learning: A review. arXiv preprint
arXiv:2206.00212, 2022.
Yannis, K., Mert, Bulent, S., Noe, P., Philippe, W., and
Diane, L. Hard negative mixing for contrastive learning.
In NeurIPS, pp. 21798–21809, 2020.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.
Understanding deep learning (still) requires rethinking
generalization. Communications of the ACM, 64(3):107–
115, 2021.
Zhuang, C., Zhai, A. L., and Yamins, D. Local aggregation
for unsupervised learning of visual embeddings. In CVPR,
pp. 6002–6012, 2019.
Bayesian Self-Supervised Contrastive Learning
Figure 7. Illustrative example of debiased contrastive objective inspired by (Chuang et al., 2020).
A. Related Work
Contrastive learning adopts the learn-to-compare paradigm (Gutmann & Hyvärinen, 2010) that discriminates the observed
data from noise data to relieve the model from reconstructing pixel-level information of data (Oord et al., 2018). Although
the representation encoder f and similarity measure vary from task to task (Devlin et al., 2018; He et al., 2020; Dosovitskiy
et al., 2014), they share an identical basic idea of contrasting similar pairs (x, x+ ) and dissimilar pairs (x, x− ) to train f
by optimizing a contrastive loss (Wang & Isola, 2020), such as the NCE loss (Gutmann & Hyvärinen, 2010), InfoNCE
loss (Oord et al., 2018), Infomax loss (Hjelm et al., 2018), asymptotic contrastive loss (Wang & Isola, 2020) and etc.
These loss functions implicitly (Oord et al., 2018) or explicitly (Hjelm et al., 2018) lower bounds the mutual information.
Arora et al. (Saunshi et al., 2019) provide theoretical analysis on the generalization bound of contrastive learning for
classification tasks. Remarkable successes of supervised contrastive learning have been observed for many applications in
different domains (Henaff, 2020; Khosla et al., 2020), but the great limitations are from their dependent on manually labeled
datasets (Liu et al., 2021).
Self-supervised contrastive learning (Chen et al., 2020a;b; He et al., 2020; Henaff, 2020; Xu et al., 2022) has been extensively
researched for its advantages of learning representations without human labelers for manually labeling data, and benefits
almost all types of downstream tasks (Liu et al., 2021; Bachman et al., 2019; Chen et al., 2020c; Huang et al., 2019; Wu
et al., 2018; Zhuang et al., 2019). The common practice is to obtain positive sample x+ from some semantic-invariant
operation on an anchor x with heavy data augmentation (Chen et al., 2020a), like random cropping and flipping (Oord et al.,
2018), image rotation (Komodakis & Gidaris, 2018), cutout (DeVries & Taylor, 2017) and color distortion (Szegedy et al.,
2015); While drawing negative samples x− s simply from un-labeled data, which introduces the false negative problem,
leading to incorrect encoder training. This problem is related to the classic positive-unlabeled (PU) learning (Jessa & Jesse,
2020), where the unlabeled data used as negative samples would be down-weighted appropriately (Du Plessis et al., 2015;
2014; Kiryo et al., 2017). However, existing PU estimators are not directly applicable to contrastive loss. In particular, two
conflicting tasks are faced in self-supervised contrastive learning (Liu et al., 2021; Robinson et al., 2021): false negative
debiasing and hard negative mining. How to estimate the probability of an unlabeled sample being a true negative sample,
Bayesian Self-Supervised Contrastive Learning
thus how to optimally trade-off the two tasks still remains unsolved.
ϕU N = τ − ϕT N + τ + ϕF N
= 2ϕ(x̂)[ατ − + (1 − α)τ + + (1 − 2α)Φ(x̂)(τ − − τ + )] (28)
By integrating both sides of the Eq (28) we can establish a correlation between two distribution functions:
Z x̂
ΦU N (x̂) = ϕU N (t)dt (29)
−∞
Z x̂
= τ − ϕT N (t) + τ + ϕF N (t)dt
−∞
Z x̂
= 2ϕ(t)[ατ − + (1 − α)τ + + (1 − 2α)Φ(t)(τ − − τ + )]dt
−∞
Z x̂ Z x̂
= 2[ατ − + (1 − α)τ + ] ϕ(t)dt + (1 − 2α)(τ − − τ + ) 2ϕ(t)Φ(t)dt
−∞ −∞
= 2[ατ − + (1 − α)τ + ]Φ(x̂) + (1 − 2α)(τ − − τ + )Φ2 (x̂) (30)
Let
a = (1 − 2α)(τ − − τ + )
b = 2[ατ − + (1 − α)τ + ]
where Φ(x̂) is the c.d.f. of proposal distribution ϕ(x̂). Next we find a minimum c that satisfies the following inequality:
c · ϕ(x̂) ≥ ϕT N (x̂),
Bayesian Self-Supervised Contrastive Learning
Figure 9. The transformation of two cumulative distribution functions Φ(x̂) and ΦU N (x̂), where x-axis is the C.D.F of unlabeled samples
Φ̂U N , and y-axis is the anchor specific C.D.F Φ(x̂). From Eq (28), it can be seen that when α = 0.5, which means the encoder
makes random guesses, or when τ + = 0.5, which means the prior class probabilities of positive and negative examples are equal,
Φ(x̂) = ΦU N (x̂).
that is
c · ϕ(x̂) ≥ [2α + (2 − 4α)Φ(x̂)]ϕ(x̂).
So the minimal of c is attained
ϕT N (x̂)
pT N =
c · ϕ(x̂)
= [α + (1 − 2α) · Φ(x̂)]/α
An observation x̂ generated from proposal distribution ϕ(x̂) is accepted with probability pT N , formulates the empirical
observations x̂ ∼ ϕT N (x̂) as described in Algorithm 2.
Likewise, the acceptance probability for ϕF N
pF N = [1 − α + (2α − 1) · Φ(x̂)]/α
can be calculated in the similar way. An observation x̂ from proposal distribution ϕ(x̂) is accepted with probability pF N ,
formulates the empirical observations x̂ ∼ ϕF N (x̂) as described in Algorithm 3.
Parameter Settings. For baseline estimator θ̂D CL in Eq (26), the number of positive samples K is fixed as 10. We start from
fixing the parameters as: α = 0.9, β = 0.9, γ = 0.1 τ + = 0.1, temperature scaling t = 0.5, number of anchors M = 1e3,
and number of negative samples N = 64 to investigate the performance of different estimators under different parameters
settings.
Mean values. Aside from M SE, we report the mean values of the estimators θ̂. Our primary concern is whether the chosen
of proposal distribution ϕ or encoder f affects the consistency of estimators. Fig 10(a) shows the influence of different
variation of anchor-specific proposal distribution ϕ on mean values of estimators. Fig 10(b) shows the influence of encoder’s
performance on mean values of estimators. It can be seen that the mean value of PM θ̂ B CL , θ̂ D CL and
PM θ sequences are very
1 1
close, which guaranteed by Chebyshev law of large numbers, namely lim P{| M m=1 θ̂ m − M m=1 θm | < ϵ} = 1 as
M →∞
E θ̂m = θm . This conclusion is un-affected by the chosen of proposal distribution ϕ or encoder f . Aside from consistency
of estimators, Fig 10(b) can be seen as a simulation of training process: a better trained encoder with higher macro-AUC
Bayesian Self-Supervised Contrastive Learning
Algorithm 2 AccRejetSamplingTN(ϕT N ).
Input: location parameter α, proposal distrition ϕ
Output: samples x̂ ∼ ϕT N .
x̂j = generate observation x̂j from ϕ
R x̂j
cdf = −∞ ϕ(t)dt
u = random.uniform(0, 1)
while u > [α + (1 − 2α) · cdf ]/α do
x̂j = generate observation x̂j from ϕ
R x̂j
cdf = −∞ ϕ(t)dt
u = random.uniform(0, 1)
Result: x̂j
Algorithm 3 AccRejetSamplingFN(ϕF N ).
Input: location parameter α, proposal distrition ϕ
Output: samples x̂ ∼ ϕF N .
x̂j = generate observation x̂j from ϕ
R x̂j
cdf = −∞ ϕ(t)dt
u = random.uniform(0, 1)
while u > [1 − α + (2α − 1) · cdf ]/α do
x̂j = generate observation x̂j from ϕ
R x̂j
cdf = −∞ ϕ(t)dt
u = random.uniform(0, 1)
Result: x̂j
embeds true negative samples dissimilar with anchor, while more similar to anchor for false negative samples. So we observe
the decrease of θ̄, which corresponds to the decrease of loss values in the training process, and increase of bias |θ̄B IASED − θ̄|
since the contained false negative samples in θ̄B IASED are scored higher as α increases.
Bayesian Self-Supervised Contrastive Learning
(a) (b)
C. Proofs
Proposition C.1 (Class Conditional Density). If ϕ(x̂) is continuous density function that satisfy ϕ(x̂) ≥ 0 and
R +∞
−∞
ϕ(x̂)dx̂ = 1, then ϕT N (x̂) and ϕF N (x̂) are probability density functions that satisfy ϕT N (x̂) ≥ 0, ϕF N (x̂) ≥ 0,
R +∞ R +∞
and −∞ ϕT N (x̂)dx̂ = 1, −∞ ϕF N (x̂)dx̂ = 1.
R +∞
Proof. Since ϕ(x̂) ≥ 0 and −∞
ϕ(x̂)dx̂ = 1, so
and
Z +∞ Z +∞ Z +∞
ϕF N (x̂)dx̂ = 2α ϕ(x̂)Φ(x̂)dx̂ + 2(1 − α) ϕ(x̂)[1 − Φ(x̂)]dx̂
−∞ −∞ −∞
Z +∞ Z +∞
= 2α Φ(x̂)dΦ(x̂) + 2(1 − α) [1 − Φ(x̂)]dΦ(x̂)
−∞ −∞
Z 1 Z 1
= 2α µdµ + 2(1 − α) (1 − µ)dµ
0 0
1
= [αµ + (1 − α)(2µ − µ2 )]
2
0
= 1. (36)
Lemma C.2 (Posterior Probability Estimation). Let β = 0.5 and assume that the similarities x̂ are i.i.d. distributed. For
any observation x̂i , the posterior probability estimation of x̂i being true negative is given by
p(T N|x̂i ) = ωi τ −
Bayesian Self-Supervised Contrastive Learning
It is worth noting that the term ϕ(x̂i ) has been eliminated. Although the explicit expression of Φ(x̂i ) = −∞x̂i ϕ(t)dt is
R
unknown, we can obtain its empirical estimate through Eq (14) and Eq (13). Therefore, the posterior probability p(T N|x̂i )
can be calculated without making any parametric assumptions about ϕ(x̂i ). Similarly, for values of β other than 0.5, the
calculation steps of ω follow the same algorithmic process as depicted in Algorithm 4.
Lemma C.3 (Asymptotic Unbiased Estimation). Let β = 0.5 and assume that the similarities x̂ are i.i.d. distributed. For
any encoder f , as N → ∞, we have
L B CL → L S UP
Proof. We draw the conclusion by the Lebesgue Dominant Convergence Theorem and the properties of important sampling:
T +
ef (x) f (x )
lim L B CL = lim E x∼pd − log[ PN T − ]
N →∞ N →∞ x+ ∼p+ ef (x)T f (x+ ) + i=1 ωi · ef (x) f (xi )
x− ∼pd
T +
ef (x) f (x )
= E x∼pd lim − log[ PN T − ]
x+ ∼p+ N →∞ ef (x)T f (x+ ) + i=1 ωi · ef (x) f (xi )
−
x ∼pd
T +
ef (x) f (x )
= E x∼pd lim [− log f (x)T f (x+ ) ] (40)
x+ ∼p+ N →∞ e + N · Ex̂∼ϕUN ωx̂
where
ϕT HN (x̂; α, β)
ω(x̂; α, β) =
ϕU N (x̂)
when taking β = 0.5,
(1 − 0.5)αϕ(1) (x̂) + 0.5(1 − α)ϕ(2) (x̂)
ϕT HN (x̂; α, β = 0.5) =
(1 − 0.5)α + 0.5(1 − α)
= αϕ(1) (x̂) + (1 − α)ϕ(2) (x̂)
= ϕT N (x̂). (41)
Bayesian Self-Supervised Contrastive Learning
ϕT N (x̂;α,β)
Therefor, ω = ϕU N (x̂) , so the second term in the denominator of Eq (40)
Z
Ex̂∼ϕUN ωx̂ = ωx̂ϕU N (x̂)dx̂
Z
ϕT N (x̂; α, β)
= x̂ ϕU N (x̂)dx̂
ϕU N (x̂)
Z
= x̂ϕT N (x̂)dx̂
T
f (x−
= Ex− ∼p− ef (x) j ) (42)
T +
ef (x) f (x )
= E x∼pd lim [− log PN T − ]
x+ ∼p+ N →∞
− −
ef (x)T f (x+ ) + j=1 ef (x) f (xj )
x ∼p
T +
ef (x) f (x )
= lim E x∼pd [− log PN T − ] (43)
N →∞ x+ ∼p+
− −
ef (x)T f (x+ ) + j=1 ef (x) f (xj )
x ∼p
= lim L S UP (44)
N →∞