Contrastive Self Supervised Learning With Hard Negative Pair Mining
Contrastive Self Supervised Learning With Hard Negative Pair Mining
Wentao Zhu , Hang Shang , Tingxun Lv , Chao Liao , Sen Yang , Ji Liu
Kuaishou Technology
arXiv:2202.13072v1 [cs.CV] 26 Feb 2022
Uj = T (A (Ij ) ; θT ) , where Ui , Ui0 are calculated from the teacher sub-network and
2 the student sub-network, fk (·, ·) models the mutual informa-
Ui0 (3)
Uj
DisSim(Ui0 , Uj ) = − . tion between the encoded representations in the InfoNCE and
kUi0 k∞ kUj k∞ can use similarity loss as a surrogate loss to approximate the
There exists large numbers of negative pairs of sam- mutual information.
ples. Hard samples have been widely proved to improve We define the similarity loss as the reciprocal of normal-
the performance of a deep learning model [Ren et al., 2015; ized L2 distance of the encoded representations. The In-
Lin and others, 2017]. In the self-supervised learning frame- foNCE loss can then be defined as
work, we define the hard negative pairs to be image pairs of DisSim(Ui , Ui0 )
small dissimilarity according to Equation 3. We try to maxi- LN CE , EIi ∼D [log P 0 ]
Ij ∈D DisSim(Uj , Ui )
mize the normalized L2 distance or dissimilarity of negative
image pairs. The contrastive loss for negative pairs can be Ui Ui0 2
= EIi ∼D [log( − ) ] (9)
derived kUi k∞ kUi0 k∞
X Uj Ui0 2
(DisSim(Ui0 , Uj )) ],
X
L2 = −EIi ∼D [log − EIi ∼D [log ( − ) ].
(4) kUj k∞ kUi0 k∞
Ij ∈B̃i Ij ∈D
The second part of derived loss in equation 9 is the same with Method Top-1 Top-5
our negative pair loss in the equation 4 if we temporarily ne- CPCv2 [Henaff, 2020] 63.8 85.3
glect our hard negative sample pair mining for each batch. CMC [Tian et al., 2019] 66.2 87.0
Minimizing the first part of equation 9 is equivalent with min- SimCLR [Chen et al., 2020a] 69.3 89.0
U0 MoCov2 [Chen and others, 2020] 71.1 N/A
imizing EIi ∼D [( kUUi ki ∞ − kU 0 ki ∞ )2 ], which is the positive pair
i
SimCLRv2 [Chen et al., 2020b] 71.7 N/A
loss in the equation 2. From the above derivation, we con-
InfoMin Aug. [Tian et al., 2020] 73.0 91.1
clude, with the proper relaxation and assumption, minimizing
BYOL [Grill and others, 2020] 74.3 91.6
the our loss is equivalent with minimizing the InfoNCE loss.
Ours 77.1 93.7
Next we demonstrate that the hard negative pair mining
(HNPM) leads to stable training. Without the trade-off factors
α1 and α2 , the loss can be written as Table 1: The accuracy comparison of self-supervised learning (SSL)
approaches with the ResNet-50 encoder based on linear evaluation
Ui Ui0 2 on the ImageNet dataset. The bold face denotes the best accuracy.
L = EIi ∼D [( − ) ]
kUi k∞ kUi0 k∞
X Uj Ui0 2 (10) Method Dep. Wid. Top-1 Top-5
− EIi ∼D [log ( − ) ].
kUj k∞ kUi0 k∞ CMC 50 2× 70.6 89.7
Ij ∈B̃i
SimCLRv2 50 2× 75.6 N/A
Without loss of generality, we remove the normalization con- BYOL 50 2× 77.4 93.6
straint and denote kUUi ki ∞ as Ui . Ours 50 2× 79.4 94.5
SimCLR 50 4× 76.5 93.2
X BYOL 50 4× 78.6 94.2
L = EIi ∼D (Ui − Ui0 )2 − log (Uj − Ui0 )2 .
Ours 50 4× 80.3 95.1
Ij ∈B˜i BYOL 200 2× 79.6 94.8
(11) Ours 200 2× 81.9 96.4
The hard negative pair mining (HNPM) always explores
negative pairs with L2 distance smaller than 1, which guaran-
Table 2: The accuracy (%) comparison of SSL methods with other
tees (Uj − Ui0 )2 is bounded to be smaller than 1. We use M
ResNet encoders based on linear evaluation.
to denote the upper bound of negative pair loss.
|L| ≤ EIi ∼D (Ui − Ui0 )2 + M.
(12)
We use residual networks as the student sub-network
Next we can further prove equation 12 can be optimized S(·, θS ) and teacher sub-network T (·, θT ). The two coeffi-
stably, and the first part of equation 12, i.e., the loss of pos- cients of the loss in equation 7, α1 is set to 0.8 and α2 is
itive pairs, can be decreased consecutively by escaping un- set to 0.1. We employ the gradient clipping strategy in the
desirable equilibria. If the model stacks into an undesirable back-propagation where we set the maximum norm of gradi-
equilibrium solution, the feature representation of teacher ent clipping as 1.0. The Adam optimizer is used to minimize
sub-network can be denoted as E[Ui0 |Ui ] from the update rule the loss in equation 7. The batch size is 160. The learning
in equation 6. The loss of positive pairs LP can be derived as rate is set as 0.1 and we use a cosine annealing schedule for
the learning rate with the maximum number of iterations as
LP = EIi ∼D (Ui − Ui0 )2
100. The smoothing coefficient τ in the update of student
sub-network in equation 6 is set as 0.5.
= EIi ∼D (E[Ui0 |Ui ] − Ui0 )2 = EIi ∼D [V ar(Ui0 |Ui )].
We employ data augmentation for teacher sub-network on-
(13) the-fly during training. We firstly apply color jittering with
Let Z denote an additional variability induced by stochas- brightness of 0.8, contrast of 0.8, saturation of 0.8, and hue
ticities in the training dynamics. We always have a solution of 0.2 to random 80% training images in each batch. Then
leading to a lower loss during the training, which escapes the we convert random 20% images to gray scale, and horizon-
current equilibrium, because tally flip 50% images. After that, we smooth random 10%
V ar(Ui0 |Ui , Z) ≤ V ar(Ui0 |Ui ). (14) images with a random Gaussian kernel of size 3 × 3 and stan-
dard deviation of 1.5 × 1.5. Finally, we crop each image
From the above derivation, the learning is stable with the ben- with random crop size of scale range [0.8, 1.0]. We use the
efit of hard negative pair mining and student sub-network up- mean of [0.485, 0.456, 0.406] and the standard deviation of
dating rule. [0.229, 0.224, 0.225] to normalize RGB channels.
Table 3: The accuracy (%) comparison of SSL methods with the ResNet-50 encoder based on semi-supervised learning on ImageNet dataset.
Method Dep. Wid. SK Para. Top-1 Top-5 Top-1 (10%) Top-5 (10%)
SimCLR [Chen et al., 2020a] 50 2× 7 94M 58.5 83.0 71.7 91.2
BYOL [Grill and others, 2020] 50 2× 7 94M 62.2 84.1 73.5 91.7
Ours 50 2× 7 94M 65.7 86.2 78.6 (5.1↑) 93.2 (1.5↑)
SimCLR [Chen et al., 2020a] 50 4× 7 375M 63.0 85.8 74.4 92.6
BYOL [Grill and others, 2020] 50 4× 7 375M 69.1 87.9 75.7 92.5
Ours 50 4× 7 375M 70.3 89.9 78.9 (3.2↑) 95.5 (2.9↑)
BYOL [Grill and others, 2020] 200 2× 7 250M 71.2 87.9 77.7 92.5
Ours 200 2× 7 250M 76.5 90.3 80.7 (3.0↑) 95.4 (2.9↑)
SimCLRv2 distilled [Chen et al., 2020b] 50 1× 7 N/A 73.9 91.5 77.5 93.4
SimCLRv2 distilled [Chen et al., 2020b] 50 2× 3 N/A 75.9 93.0 80.2 95.0
SimCLRv2 self-distilled [Chen et al., 2020b] 152 3× 3 N/A 76.6 93.4 80.9 95.5
Ours 152 3× 3 N/A 77.6 94.2 81.3 95.7
Table 4: The accuracy (%) comparison of SSL approaches with other ResNet encoders including selective kernel convolution (SK) based on
semi-supervised learning on the ImageNet dataset.
Table 5: The transfer learning accuracy (%) comparison of SSL approaches with ResNet-50 encoder based on linear evaluation on ImageNet.
Table 6: The transfer learning accuracy (%) comparison of SSL approaches with the ResNet-50 encoder based on finetuning on ImageNet.