0% found this document useful (0 votes)
12 views8 pages

Contrastive Self Supervised Learning With Hard Negative Pair Mining

This document presents a novel self-supervised learning framework that incorporates online hard negative pair mining within a student-teacher network architecture. The proposed method enhances the learning of transferable visual features from unlabeled data and demonstrates significant improvements in accuracy on various tasks, including linear evaluation and semi-supervised learning on the ImageNet dataset. Extensive experiments validate the effectiveness of the approach, achieving state-of-the-art results compared to previous methods.

Uploaded by

songmhor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

Contrastive Self Supervised Learning With Hard Negative Pair Mining

This document presents a novel self-supervised learning framework that incorporates online hard negative pair mining within a student-teacher network architecture. The proposed method enhances the learning of transferable visual features from unlabeled data and demonstrates significant improvements in accuracy on various tasks, including linear evaluation and semi-supervised learning on the ImageNet dataset. Extensive experiments validate the effectiveness of the approach, achieving state-of-the-art results compared to previous methods.

Uploaded by

songmhor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Contrastive Self-Supervised Learning with Hard Negative Pair Mining

Wentao Zhu , Hang Shang , Tingxun Lv , Chao Liao , Sen Yang , Ji Liu
Kuaishou Technology
arXiv:2202.13072v1 [cs.CV] 26 Feb 2022

Abstract exhausted multi-view construction to model various varia-


tions [Qian et al., 2020]. Through solving the pretext task
Recently, learning from vast unlabeled data, espe- with specific objective function, the network learns transfer-
cially self-supervised learning, has been emerging able visual features for various downstream tasks.
and attracted widespread attention. Self-supervised
learning followed by the supervised fine-tuning The study of conventional self-supervised learning meth-
on a few labeled examples can significantly im- ods mainly involves data related pretext task design [Zhu et
prove label efficiency and outperform standard su- al., 2021]. Popular pretext tasks include colorizing gray scale
pervised training using fully annotated data [Chen images [Zhang and others, 2016], image inpainting [Pathak
et al., 2020b]. In this work, we present a novel self- and others, 2016], playing image jigsaw puzzle [Noroozi and
supervised deep learning paradigm based on online Favaro, 2016], etc. For the video related self-supervised
hard negative pair mining. Specifically, we design learning approaches, the data related pretext tasks can be se-
a student-teacher network to generate multi-view of quence order verification [Misra and others, 2016], solving
the data for self-supervised learning and integrate sequence sorting [Lee et al., 2017], predicting the odd or un-
hard negative pair mining into the training. Then related element [Fernando and others, 2017], classifying clip
we derive a new triplet-like loss considering both order [Xu and others, 2019], etc.
positive sample pairs and mined hard negative sam- Recent tremendous success of self-supervised learning is
ple pairs. Extensive experiments demonstrate the mainly introduced by advanced learning strategies. The In-
effectiveness of the proposed method and its com- foNCE loss is widely adopted for contrastive learning, which
ponents on ILSVRC-2012. maximizes a lower bound of mutual information based on the
pseudo label in the pretext task [Oord et al., 2018]. SimCLR
employs larger batch sizes, more training steps and composi-
1 Introduction tion of data augmentations, which matches the performance
Learning from a large scale unlabeled dataset has long been a of a fully supervised ResNet-50 simply by adding one addi-
hot topic in the computer vision community, because the large tional linear classifier [Chen et al., 2020a]. [Wu et al., 2018]
amount of high quality labels require laborious and costly an- maintains a large feature memory bank to store training image
notation for each task and there exists huge amount of unla- representation. MoCo builds a large and consistent dictionary
beled data from various data servers and sources. Un/self- through a dynamic queue and a momentum-updated encoder,
supervised learning can effectively learn a task-agnostic rep- which outperforms its supervised pretraining counterpart in
resentation from vast unlabeled data, and downstream tasks, the detection and segmentation [He and others, 2020]. Sim-
such as image classification, can be well performed by fine- Siam employs a stop-gradient operation in the Siamese ar-
tuning on a few task-specific labels. This strategy has be- chitectures to prevent collapsing solutions of self-supervised
come a main-stream pipeline for the transformer based self- learning [Chen and He, 2020]. SimCLRv2 employs big (deep
supervised learning approaches [Vaswani and others, 2017]. and wide) networks during pretraining and fine-tuning, and
Recent advanced self-supervised learning achieves promising it achieves surprising good performance for semi-supervised
results and outperforms conventional fully supervised learn- learning on ImageNet [Chen et al., 2020b]. BYOL trains an
ing method on the image classification [Chen et al., 2020b]. online network to predict a target network representation of
The key effort of general self-supervised learning ap- the same image where the target network is a slow-moving
proaches mainly focuses on pretext task construction [Jing average of the online network [Grill and others, 2020].
and Tian, 2020]. The pretext task can be designed to In this work, we propose a novel self-supervised learn-
be predictive tasks [Mathieu and others, 2016], generative ing paradigm by introducing an effective negative image pair
tasks [Bansal et al., 2018], contrastive tasks [Oord et al., mining in the contrastive learning framework. Specifically,
2018], or a combination of them. The supervision signal we introduce a student-teacher network into the contrastive
for the pretext task, i.e., pseudo label, typically is yielded learning framework to construct multi-view representation of
from a pretext construction process which generally involves data. To effectively learn from unlabeled data in the con-
trastive learning, we further construct the negative image
pairs by hard negative image pair mining. The overall ob-
jective function can be derived as a form of triplet-like loss
facilitated by the collected positive and negative image pairs.
We conduct extensive experiments including linear evalua-
tion, semi-supervised learning, transfer learning, and ablation
study to evaluate our method on the ImageNet dataset [Rus-
sakovsky and others, 2015]. The proposed method achieves
77.1% top-1 accuracy using a ResNet-50 encoder for the
linear evaluation, which outperforms previous state-of-the-
art by 2.8%. For the semi-supervised learning task, our
method with a ResNet-50 encoder obtains the top-1 accuracy
of 73.4%, which outperforms previous best result by 4.6%
using 10% labels. For the transfer learning with linear eval-
uation, our method with a ResNet-50 encoder achieves the
best accuracy on six out of seven widely used transfer learn-
ing datasets, which averagely outperforms previous best re-
sults by 2.5%. More specifically, our major contributions are
summarized as follows. Figure 1: The architecture of contrastive self-supervised learning
with hard negative pair mining.
• First, we build a student-teacher network to construct
multi-view representations in the contrastive learning
framework. The gradient of the student sub-network is view learning trains deep network by maximizing mutual in-
blocked to ease the training difficulty and stabilize the formation between different views of the same scene [Tian et
training of self-supervised learning. al., 2019].
The student-teacher network can be used to generate multi-
• Second, we collect hard negative image pairs on-the-fly view representations of unlabeled data. Temporal ensembling
and add the hard negative image pairs into the training maintains an exponential moving average (EMA) prediction
of contrastive self-supervised learning. as the pseudo label for the self-supervised training [Laine and
• Third, extensive experiments demonstrate the adver- Aila, 2016]. Instead of averaging label predictions, mean-
sarial contrastive self-supervised learning outperforms teacher uses EMA to update model weights [Tarvainen and
previous state-of-the-art self-supervised learning ap- Valpola, 2017]. MoCo further uses a momentum to update the
proaches for linear evaluation, semi-supervised learning encoder for the new keys on-the-fly, and maintains a queue of
and transfer learning on the ImageNet dataset. keys in the contrastive learning framework [He and others,
2020]. BYOL maintains a student-teacher network to yield
2 Related Work multi-view of samples in the training [Grill and others, 2020].
Without negative sample pairs in the training, BYOL achieves
The mainstream unsupervised/self-supervised learning litera- surprisingly good performance. Momentum teacher performs
ture generally involves two aspects: data/feature related pre- two independent momentum updates for teacher’s weight and
text tasks and loss functions [He and others, 2020]. The teacher’s batch normalization statistics to maintain a stable
data/feature related pretext tasks typically can be specially training process [Li and others, 2021].
constructed by the multi-view data/feature generation pro-
cess [Jing and Tian, 2020]. Through solving the pretext task, 3 Method
the deep network of self-supervised learning is expected to
learn a good representation for the down-stream tasks. Loss We employ a student-teacher network to construct two repre-
objective functions can often improve the performance of sentational views of the sample, as illustrated in Fig. 1. At
self-supervised learning significantly. The adversarial con- the top of the student and teacher sub-networks, we construct
trastive learning focuses on the novel loss function based on both the positive sample pairs and negative sample pairs.
advanced student-teacher network design. Next we discuss Specifically, we consider the representations of the same sam-
related study with respect to these aspects. ple from student and teacher sub-networks as the positive
Contrastive loss measures the similarity of image pairs in pair, and we only retain the most similar pair of two differ-
the feature space [Hadsell et al., 2006]. In contrastive learn- ent samples to construct the negative pair, i.e., hard negative
ing framework, the target can be defined and generated on- pair. We block the gradient update of the student sub-network
the-fly during training [Hadsell et al., 2006]. Recent signif- and employ the exponential moving average (EMA) to update
icant success of self-supervised learning has witnessed the its parameters to stabilize the self-supervised training.
widespread adoption of contrastive learning [Henaff, 2020].
[Zhuang et al., 2019] train an embedding function to maxi- 3.1 Student-Teacher Network
mize a metric of local aggregation, causing similar data in- Problem definition: Un/self-supervised learning tries to
stances to move together in the embedding space, while al- learn a good representation from a large scale unlabeled
lowing dissimilar instances to separate. Contrastive multi- dataset D = {I1 , I2 , · · · , IN }, where each I represents an
image. For an image sampled from the dataset Ii ∼ D, where images are randomly sampled from the dataset Ii ∼ D,
we can obtain two representational views of Ii by construct- B̃i is the hard negative sample set of the current batch Bi for
ing one student sub-network S (·; θS ) and one teacher sub- image Ii . The hard negative sample set B̃i can be constructed
network T (·; θT ) [Shin, 2020]. To let the network learn vari-
ous invariant, we employ advanced data augmentation A, in-
cluding color jittering, horizontal flipping, Gaussian blurring B̃i = {Ij |Ij ∈ Bi , Ij 6= Ii , DisSim(Ui0 , Uj ) ≤ 1}. (5)
and random cropping, in the data generation process for the We construct hard negative pairs on-the-fly in training, which
teacher sub-network. We then obtain two views of represen- can be used to efficiently train the self-supervised network.
tations for image Ii as
3.3 Network Update
Ui = T (A (Ii ) ; θT ) , Ui0 = S(Ii ; θS ), (1)
To stabilize the training and avoid a collapsing solution in the
where Ui is the representation from the teacher sub-network self-supervised learning [Chen and He, 2020], we block the
and Ui0 is the representation from the student sub-network. gradient for the student sub-network S(·; θS ). We employ the
The self-supervised learning tries to build pretext tasks exponential moving average (EMA) to update the parameters
from these unlabeled data. The generated representation θS in the student sub-network [Tarvainen and Valpola, 2017]
views Ui and Ui0 from the student and teacher sub-networks
can be considered as a positive pair, which belong to the same θS ← τ θS + (1 − τ ) ∗ θT , (6)
cluster. The adversarial contrastive self-supervised learning where τ is a smoothing coefficient to tune the update strength
tries to yield a compact representation for images of the same of the student sub-network.
cluster by minimizing their normalized L2 distance in the rep- In the back-propagation, we only use the gradient to update
resentational space. The intra-cluster distance can be defined the parameters of the teacher sub-network. The overall loss
" 2 # function can be derived as
Ui Ui0
L1 = EIi ∼D − , (2) L(θT ) = α1 L1 + α2 L2 , (7)
kUi k∞ kUi0 k∞
where 0 < α1 < 1 and 0 < α2 < 1 are the fixed coeffi-
where images are randomly sampled from the dataset Ii ∼ D, cients to tune the trade-off between the intra-cluster loss and
k · k∞ is the infinity norm, i.e., the maximum of the absolute inter-cluster loss. During the back-propagation, we employ
value of elements in the vector. the gradient clipping to stabilize the training.
3.2 Hard Negative Pair Mining (HNPM) 3.4 Connection with InfoNCE and Stability
It is not efficient to train a self-supervised network by solely In our method, we employ hard negative pair mining (HNPM)
using positive pairs of samples. Current self-supervised to add negative image pairs in the training, and use a normal-
learning uses large batch size [Chen et al., 2020a], memory ized L2 distance in the loss function. We will demonstrate
bank [Wu et al., 2018] or large dynamic dictionary [He and that minimizing the loss of our method is equivalent with
others, 2020] to achieve promising results. Adding negative minimizing the InfoNCE loss [Oord et al., 2018]. To sim-
image pairs can significantly improve the training efficiency plify the analysis, we temporarily remove the hard negative
of a self-supervised learning model. pair mining mechanism in our method in the derivation of
We heuristically construct negative pairs in the self- connection with InfoNCE.
supervised learning framework by mining hard negative pairs The InfoNCE loss [Oord et al., 2018] can be written as
of images. For two different image Ii and image Ij , we mea-
sure the dissimilarity of the two images by the normalized L2 fk (Ui , Ui0 )
LN CE = −EIi ∼D [log P 0 ]. (8)
distance in the representation space Ij ∈D fk (Uj , Ui )

Uj = T (A (Ij ) ; θT ) , where Ui , Ui0 are calculated from the teacher sub-network and
2 the student sub-network, fk (·, ·) models the mutual informa-
Ui0 (3)

Uj
DisSim(Ui0 , Uj ) = − . tion between the encoded representations in the InfoNCE and
kUi0 k∞ kUj k∞ can use similarity loss as a surrogate loss to approximate the
There exists large numbers of negative pairs of sam- mutual information.
ples. Hard samples have been widely proved to improve We define the similarity loss as the reciprocal of normal-
the performance of a deep learning model [Ren et al., 2015; ized L2 distance of the encoded representations. The In-
Lin and others, 2017]. In the self-supervised learning frame- foNCE loss can then be defined as
work, we define the hard negative pairs to be image pairs of DisSim(Ui , Ui0 )
small dissimilarity according to Equation 3. We try to maxi- LN CE , EIi ∼D [log P 0 ]
Ij ∈D DisSim(Uj , Ui )
mize the normalized L2 distance or dissimilarity of negative
image pairs. The contrastive loss for negative pairs can be Ui Ui0 2
= EIi ∼D [log( − ) ] (9)
derived kUi k∞ kUi0 k∞
X Uj Ui0 2 
(DisSim(Ui0 , Uj )) ],
 X
L2 = −EIi ∼D [log − EIi ∼D [log ( − ) ].
(4) kUj k∞ kUi0 k∞
Ij ∈B̃i Ij ∈D
The second part of derived loss in equation 9 is the same with Method Top-1 Top-5
our negative pair loss in the equation 4 if we temporarily ne- CPCv2 [Henaff, 2020] 63.8 85.3
glect our hard negative sample pair mining for each batch. CMC [Tian et al., 2019] 66.2 87.0
Minimizing the first part of equation 9 is equivalent with min- SimCLR [Chen et al., 2020a] 69.3 89.0
U0 MoCov2 [Chen and others, 2020] 71.1 N/A
imizing EIi ∼D [( kUUi ki ∞ − kU 0 ki ∞ )2 ], which is the positive pair
i
SimCLRv2 [Chen et al., 2020b] 71.7 N/A
loss in the equation 2. From the above derivation, we con-
InfoMin Aug. [Tian et al., 2020] 73.0 91.1
clude, with the proper relaxation and assumption, minimizing
BYOL [Grill and others, 2020] 74.3 91.6
the our loss is equivalent with minimizing the InfoNCE loss.
Ours 77.1 93.7
Next we demonstrate that the hard negative pair mining
(HNPM) leads to stable training. Without the trade-off factors
α1 and α2 , the loss can be written as Table 1: The accuracy comparison of self-supervised learning (SSL)
approaches with the ResNet-50 encoder based on linear evaluation
Ui Ui0 2 on the ImageNet dataset. The bold face denotes the best accuracy.
L = EIi ∼D [( − ) ]
kUi k∞ kUi0 k∞
X Uj Ui0 2  (10) Method Dep. Wid. Top-1 Top-5
− EIi ∼D [log ( − ) ].
kUj k∞ kUi0 k∞ CMC 50 2× 70.6 89.7
Ij ∈B̃i
SimCLRv2 50 2× 75.6 N/A
Without loss of generality, we remove the normalization con- BYOL 50 2× 77.4 93.6
straint and denote kUUi ki ∞ as Ui . Ours 50 2× 79.4 94.5
SimCLR 50 4× 76.5 93.2
X BYOL 50 4× 78.6 94.2
L = EIi ∼D (Ui − Ui0 )2 − log (Uj − Ui0 )2 .
 
Ours 50 4× 80.3 95.1
Ij ∈B˜i BYOL 200 2× 79.6 94.8
(11) Ours 200 2× 81.9 96.4
The hard negative pair mining (HNPM) always explores
negative pairs with L2 distance smaller than 1, which guaran-
Table 2: The accuracy (%) comparison of SSL methods with other
tees (Uj − Ui0 )2 is bounded to be smaller than 1. We use M
ResNet encoders based on linear evaluation.
to denote the upper bound of negative pair loss.
|L| ≤ EIi ∼D (Ui − Ui0 )2 + M.
 
(12)
We use residual networks as the student sub-network
Next we can further prove equation 12 can be optimized S(·, θS ) and teacher sub-network T (·, θT ). The two coeffi-
stably, and the first part of equation 12, i.e., the loss of pos- cients of the loss in equation 7, α1 is set to 0.8 and α2 is
itive pairs, can be decreased consecutively by escaping un- set to 0.1. We employ the gradient clipping strategy in the
desirable equilibria. If the model stacks into an undesirable back-propagation where we set the maximum norm of gradi-
equilibrium solution, the feature representation of teacher ent clipping as 1.0. The Adam optimizer is used to minimize
sub-network can be denoted as E[Ui0 |Ui ] from the update rule the loss in equation 7. The batch size is 160. The learning
in equation 6. The loss of positive pairs LP can be derived as rate is set as 0.1 and we use a cosine annealing schedule for
the learning rate with the maximum number of iterations as
LP = EIi ∼D (Ui − Ui0 )2
  100. The smoothing coefficient τ in the update of student
sub-network in equation 6 is set as 0.5.
= EIi ∼D (E[Ui0 |Ui ] − Ui0 )2 = EIi ∼D [V ar(Ui0 |Ui )].
 
We employ data augmentation for teacher sub-network on-
(13) the-fly during training. We firstly apply color jittering with
Let Z denote an additional variability induced by stochas- brightness of 0.8, contrast of 0.8, saturation of 0.8, and hue
ticities in the training dynamics. We always have a solution of 0.2 to random 80% training images in each batch. Then
leading to a lower loss during the training, which escapes the we convert random 20% images to gray scale, and horizon-
current equilibrium, because tally flip 50% images. After that, we smooth random 10%
V ar(Ui0 |Ui , Z) ≤ V ar(Ui0 |Ui ). (14) images with a random Gaussian kernel of size 3 × 3 and stan-
dard deviation of 1.5 × 1.5. Finally, we crop each image
From the above derivation, the learning is stable with the ben- with random crop size of scale range [0.8, 1.0]. We use the
efit of hard negative pair mining and student sub-network up- mean of [0.485, 0.456, 0.406] and the standard deviation of
dating rule. [0.229, 0.224, 0.225] to normalize RGB channels.

3.5 Implementation Details


Because of our advanced learning strategy, we do not use 4 Experiments
any pretrained model as the backbone in our implementation.
To generate multi-view representations, we employ data aug- We conduct experiments to validate the performance of the
mentation to model various variations in different views. proposed method on the ILSVRC-2012 dataset.
4.1 Linear Evaluation
The linear evaluation can be used to evaluate the accu-
racy of self-supervised learning (SSL) by freezing the SSL
model and training a separate linear classifier after the SSL
model [Grill and others, 2020; Kornblith et al., 2019; Zhang
and others, 2016]. We compare our method with previous
state-of-the-art approaches with the ResNet-50 encoder and
other ResNet encoders on ImageNet in the Table 1 and Ta-
ble 2, respectively. The top-1 and top-5 accuracy are listed.
With the standard ResNet-50 encoder [He et al., 2016], our
method obtains 77.1% top-1 accuracy and 93.7% top-5 ac- Figure 2: The loss (left) and accuracy (right) comparison w.r.t.
curacy, which outperform previous state-of-the-art top-1 and different epochs for ablation study of hard negative pair mining
top-5 results by 2.8% and 2.1%, respectively. Most sur- (HNPM) and blocking gradient in the student sub-network based on
prisingly, our method achieves 0.6% better accuracy than linear evaluation with the ResNet-200 (2×) encoder on ImageNet.
the accuracy, 76.5%, of the supervised baseline from Sim-
CLR [Chen et al., 2020a].
Table 2 reports the accuracy of self-supervised learning fer learning can be used to evaluate the generalization ability
methods using deeper and wider ResNet encoders based on of the learned SSL model. Practically, both linear evalua-
linear evaluation. Our method with ResNet-200 (2×) obtains tion, i.e., only training the last classification layer, and fine-
81.9% top-1 and 96.4% top-5 accuracy which increase pre- tuning the whole network based on the target dataset can be
vious best top-1 and top-5 accuracy by 2.3% and 1.6%, re- employed for the evaluation of transfer learning. The compar-
spectively. With ResNet-50 (2×) and ResNet-50 (4×) en- ison of transfer learning with linear evaluation and fine-tuning
coders, our method also achieves better accuracy than those are listed in the Table 5 and Table 6.
of CMC [Tian et al., 2019], SimCLRv2 [Chen et al., 2020b] For the linear evaluation of transfer learning task, our
and BYOL [Grill and others, 2020] with the same encoder. method achieves better accuracy than previous state-of-the-
art approaches on six out of seven widely used transfer learn-
4.2 Semi-Supervised Learning ing datasets in Table 5. We provide the accuracy improve-
Semi-supervised learning can also be used to evaluate the ac- ment in Table 5, and our method improves 2.3%, 3.5%, 4.5%,
curacy of self-supervised learning (SSL) by fine-tuning rep- 2.7%, 4.9% and 0.9% on Flood101, SUN397, Cars, Pets,
resentation with a small subset of the training set [Grill and VOC 2007 and Flowers datasets, respectively. On average,
others, 2020]. In this experiment, we use the fixed data splits the transfer learning accuracy of our method is 2.5% higher
of 1% and 10% of the training set in ImageNet, which are than previous best results based on linear evaluation. For the
the same as [Grill and others, 2020]. We also use the top- transfer learning with a fine-tuning task, our method achieves
1 and top-5 accuracy as the evaluation metric for the semi- the best accuracy on four out of seven tasks in Table 6.
supervised learning. The comparison using the ResNet-50
encoder and deeper and wider ResNet encoders are listed 4.4 Ablation Study
in Table 3 and Table 4, respectively. Our method achieves Coefficient τ in the update of student sub-network We
80.2% top-5 accuracy based on a ResNet-50 encoder which investigate the accuracy of our method with linear evalua-
improves previous best result by 1.8% using only 1% train- tion based on the ResNet-50 encoder with respective to the
ing labels in the Table 3. Using 10% training labels, our smoothing coefficient τ of the exponential moving average
method achieves 73.4% and 92.5% for the top-1 and top-5 (EMA) in Table 7. The bigger the τ is, the smaller update
accuracy, which improve previous best top-1 and top-5 accu- the student sub-network performs. When the τ is 0, it means
racy by 4.6% and 3.5%. that we copy the weights of the teacher sub-network to update
The result with ResNet of various depths, widths, selec- the student sub-network completely in each step. When the τ
tive kernel convolution [Li and others, 2019] configurations is 1, it means that the student sub-network is never updated.
are listed in Table 4. Our method achieves the best top-1 We find that at moving average coefficient value of 0.5, we
and top-5 accuracy for all the experimental configurations. obtain the best top-1 accuracy, 77.1%, based on linear eval-
Specifically, based on ResNet-50 (2×) encoder, our method uation. Neither the moving average coefficient τ of 0 nor 1
achieves 65.7% and 78.6% top-1 accuracy using 1% train- generates good performance.
ing labels and 10% training labels, which improves previous Hard negative pair mining (HNPM) We conduct abla-
best top-1 accuracy by 3.5% and 5.1%. Based on ResNet- tion study on hard negative pair mining (HNPM) based on
200 (2×), our method obtains 76.5% and 80.7% top-1 accu- liner evaluation task using the ResNet-200 (2×) encoder on
racy using 1% training labels and 10% training labels, which the ImageNet dataset. Training with all negative pairs, i.e.,
improves the accuracy of BYOL [Grill and others, 2020] by without HNPM, is denoted as “w/o HNPM + block student
5.3% and 3.0%. gradient”, and our method is trained with HNPM, which is
denoted as “w/ HNPM + block student gradient”. The loss
4.3 Transfer Learning and accuracy comparison w.r.t. the training epochs for the
Transfer learning is another widely used task to evaluate the two methods are in Fig. 2. With hard negative pair mining,
accuracy of self-supervised learning (SSL) methods. Trans- the training of our method is much more stable and it achieves
Method Top-1 (1%) Top-5 (1%) Top-1 (10%) Top-5 (10%)
SimCLR [Chen et al., 2020a] 48.3 75.5 65.6 87.8
SimCLRv2 [Chen et al., 2020b] 57.9 N/A 68.4 N/A
BYOL [Grill and others, 2020] 53.2 78.4 68.8 89.0
Ours 56.7 80.2 (1.8↑) 73.4 (4.6↑) 92.5 (3.5↑)

Table 3: The accuracy (%) comparison of SSL methods with the ResNet-50 encoder based on semi-supervised learning on ImageNet dataset.

Method Dep. Wid. SK Para. Top-1 Top-5 Top-1 (10%) Top-5 (10%)
SimCLR [Chen et al., 2020a] 50 2× 7 94M 58.5 83.0 71.7 91.2
BYOL [Grill and others, 2020] 50 2× 7 94M 62.2 84.1 73.5 91.7
Ours 50 2× 7 94M 65.7 86.2 78.6 (5.1↑) 93.2 (1.5↑)
SimCLR [Chen et al., 2020a] 50 4× 7 375M 63.0 85.8 74.4 92.6
BYOL [Grill and others, 2020] 50 4× 7 375M 69.1 87.9 75.7 92.5
Ours 50 4× 7 375M 70.3 89.9 78.9 (3.2↑) 95.5 (2.9↑)
BYOL [Grill and others, 2020] 200 2× 7 250M 71.2 87.9 77.7 92.5
Ours 200 2× 7 250M 76.5 90.3 80.7 (3.0↑) 95.4 (2.9↑)
SimCLRv2 distilled [Chen et al., 2020b] 50 1× 7 N/A 73.9 91.5 77.5 93.4
SimCLRv2 distilled [Chen et al., 2020b] 50 2× 3 N/A 75.9 93.0 80.2 95.0
SimCLRv2 self-distilled [Chen et al., 2020b] 152 3× 3 N/A 76.6 93.4 80.9 95.5
Ours 152 3× 3 N/A 77.6 94.2 81.3 95.7

Table 4: The accuracy (%) comparison of SSL approaches with other ResNet encoders including selective kernel convolution (SK) based on
semi-supervised learning on the ImageNet dataset.

Method Food101 CIFAR-10 SUN397 Cars Pets VOC 2007 Flowers


BYOL [Grill and others, 2020] 75.3 91.3 60.6 67.8 90.4 82.5 96.1
SimCLR [Chen et al., 2020a] 68.4 90.6 58.8 50.3 83.6 80.5 91.2
Supervised-IN [Chen et al., 2020a] 72.3 93.6 61.9 66.7 91.5 82.8 94.7
Ours 77.6 92.4 65.4 72.3 94.2 87.7 97.0

Table 5: The transfer learning accuracy (%) comparison of SSL approaches with ResNet-50 encoder based on linear evaluation on ImageNet.

Method Food101 CIFAR-10 SUN397 Cars Pets VOC 2007 Flowers


BYOL [Grill and others, 2020] 88.5 97.8 63.7 91.6 91.7 85.4 97.0
SimCLR [Chen et al., 2020a] 88.2 97.7 63.5 91.3 89.2 84.1 97.0
Supervised-IN [Chen et al., 2020a] 88.3 97.5 64.3 92.1 92.1 85.0 97.6
Ours 89.1 98.0 64.1 92.1 92.8 85.3 97.5

Table 6: The transfer learning accuracy (%) comparison of SSL approaches with the ResNet-50 encoder based on finetuning on ImageNet.

τ 1.0 0.999 0.5 0.0 5 Conclusion


Top-1 (%) 24 73.4 77.1 49.1
In this work, we introduce a self-supervised learning frame-
Table 7: The effect of the smoothing coefficient τ in the exponential work in a student-teacher network with contrastive loss.
moving average with ResNet-50 encoder based on linear evaluation. To increase the training efficiency, we add the hard nega-
tive image pairs into the contrastive self-supervised learn-
ing paradigm. To stabilize the training and avoid a collaps-
lower loss and higher accuracy than that without hard nega- ing solution, we block the gradient of student sub-network
tive pair mining. and update the parameters of the student sub-network us-
Blocking gradient of student sub-network We also con- ing exponential moving average. We also conduct ablation
duct ablation study on blocking gradient of student sub- study to validate the effectiveness of each component. Exten-
network in Fig. 2. Training without blocking the gradient of sive experiments demonstrate that our method achieves better
student sub-network is denoted as “w/ HNPM + student gra- performance than previous state-of-the-art approaches based
dient”. Our method achieves lower loss and higher accuracy on linear evaluation, semi-supervised learning and transfer
than that with gradient updating of student sub-network. learning on the ImageNet dataset.
References [Lin and others, 2017] Tsung-Yi Lin et al. Focal loss for
[Bansal et al., 2018] Aayush Bansal, Shugao Ma, Deva Ra- dense object detection. In ICCV, 2017.
manan, and Yaser Sheikh. Recycle-gan: Unsupervised [Mathieu and others, 2016] Michael Mathieu et al. Deep
video retargeting. In ECCV, pages 119–135, 2018. multi-scale video prediction beyond mean square error. In
[Chen and He, 2020] Xinlei Chen and Kaiming He. Explor- ICLR, 2016.
ing simple siamese representation learning. arXiv preprint [Misra and others, 2016] Ishan Misra et al. Shuffle and learn:
arXiv:2011.10566, 2020. unsupervised learning using temporal order verification. In
[Chen and others, 2020] Xinlei Chen et al. Improved base- ECCV, 2016.
lines with momentum contrastive learning. arXiv preprint [Noroozi and Favaro, 2016] Mehdi Noroozi and Paolo
arXiv:2003.04297, 2020. Favaro. Unsupervised learning of visual representations
[Chen et al., 2020a] Ting Chen, Simon Kornblith, Moham- by solving jigsaw puzzles. In ECCV. Springer, 2016.
mad Norouzi, and Geoffrey Hinton. A simple framework [Oord et al., 2018] Aaron van den Oord, Yazhe Li, and Oriol
for contrastive learning of visual representations. arXiv Vinyals. Representation learning with contrastive predic-
preprint arXiv:2002.05709, 2020. tive coding. arXiv preprint arXiv:1807.03748, 2018.
[Chen et al., 2020b] Ting Chen, Simon Kornblith, Kevin [Pathak and others, 2016] Deepak Pathak et al. Context en-
Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big coders: Feature learning by inpainting. In CVPR, 2016.
self-supervised models are strong semi-supervised learn-
ers. arXiv preprint arXiv:2006.10029, 2020. [Qian et al., 2020] Rui Qian, Tianjian Meng, Boqing Gong,
Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and
[Fernando and others, 2017] Basura Fernando et al. Self- Yin Cui. Spatiotemporal contrastive video representation
supervised video representation learning with odd-one-out learning. arXiv preprint arXiv:2008.03800, 2020.
networks. In CVPR, 2017.
[Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross B Gir-
[Grill and others, 2020] Jean-Bastien Grill et al. Bootstrap
shick, and Jian Sun. Faster r-cnn: Towards real-time object
your own latent: A new approach to self-supervised learn- detection with region proposal networks. In NIPS, 2015.
ing. arXiv preprint arXiv:2006.07733, 2020.
[Russakovsky and others, 2015] Olga Russakovsky et al.
[Hadsell et al., 2006] Raia Hadsell, Sumit Chopra, and Yann
Imagenet large scale visual recognition challenge. IJCV,
LeCun. Dimensionality reduction by learning an invariant
2015.
mapping. In CVPR, 2006.
[Shin, 2020] Minchul Shin. Semi-supervised learning with
[He and others, 2020] Kaiming He et al. Momentum con-
a teacher-student network for generalized attribute predic-
trast for unsupervised visual representation learning. In
tion. arXiv preprint arXiv:2007.06769, 2020.
CVPR, 2020.
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing [Tarvainen and Valpola, 2017] Antti Tarvainen and Harri
Ren, and Jian Sun. Deep residual learning for image recog- Valpola. Mean teachers are better role models: Weight-
nition. In CVPR, pages 770–778, 2016. averaged consistency targets improve semi-supervised
deep learning results. In NeurIPS, 2017.
[Henaff, 2020] Olivier Henaff. Data-efficient image recog-
nition with contrastive predictive coding. In ICML, pages [Tian et al., 2019] Yonglong Tian, Dilip Krishnan, and
4182–4192. PMLR, 2020. Phillip Isola. Contrastive multiview coding. arXiv preprint
arXiv:1906.05849, 2019.
[Jing and Tian, 2020] Longlong Jing and Yingli Tian. Self-
supervised visual feature learning with deep neural net- [Tian et al., 2020] Yonglong Tian, Chen Sun, Ben Poole,
works: A survey. IEEE TPAMI, 2020. Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What
makes for good views for contrastive learning. arXiv
[Kornblith et al., 2019] Simon Kornblith, Jonathon Shlens,
preprint arXiv:2005.10243, 2020.
and Quoc V Le. Do better imagenet models transfer bet-
ter? In CVPR, pages 2661–2671, 2019. [Vaswani and others, 2017] Ashish Vaswani et al. Attention
is all you need. In NIPS, 2017.
[Laine and Aila, 2016] Samuli Laine and Timo Aila. Tem-
poral ensembling for semi-supervised learning. arXiv [Wu et al., 2018] Zhirong Wu, Yuanjun Xiong, Stella X Yu,
preprint arXiv:1610.02242, 2016. and Dahua Lin. Unsupervised feature learning via non-
[Lee et al., 2017] Hsin-Ying Lee, Jia-Bin Huang, Maneesh parametric instance discrimination. In CVPR, 2018.
Singh, and Ming-Hsuan Yang. Unsupervised representa- [Xu and others, 2019] Dejing Xu et al. Self-supervised spa-
tion learning by sorting sequences. In ICCV, 2017. tiotemporal learning via video clip order prediction. In
[Li and others, 2019] Xiang Li et al. Selective kernel net- CVPR, 2019.
works. In CVPR, 2019. [Zhang and others, 2016] Richard Zhang et al. Colorful im-
[Li and others, 2021] Zeming Li et al. Momentumˆ 2 age colorization. In ECCV, 2016.
teacher: Momentum teacher with momentum statis- [Zhu et al., 2021] Wentao Zhu, Yufang Huang, Daguang Xu,
tics for self-supervised learning. arXiv preprint Zhen Qian, Wei Fan, and Xiaohui Xie. Test-time train-
arXiv:2101.07525, 2021. ing for deformable multi-scale image registration. In 2021
IEEE International Conference on Robotics and Automa-
tion (ICRA), pages 13618–13625. IEEE, 2021.
[Zhuang et al., 2019] Chengxu Zhuang, Alex Lin Zhai, and
Daniel Yamins. Local aggregation for unsupervised learn-
ing of visual embeddings. In ICCV, 2019.

You might also like