Weakly Supervised Contrastive Learning
Weakly Supervised Contrastive Learning
Abstract
Unsupervised visual representation learning has gained
much attention from the computer vision community be-
cause of the recent achievement of contrastive learning.
Most of the existing contrastive learning frameworks adopt
the instance discrimination as the pretext task, which treat-
ing every single instance as a different class. However,
such method will inevitably cause class collision problems,
which hurts the quality of the learned representation. Mo-
tivated by this observation, we introduced a weakly super-
vised contrastive learning framework (WCL) to tackle this
issue. Specifically, our proposed framework is based on two
projection heads, one of which will perform the regular in- Similar Samples
stance discrimination task. The other head will use a graph-
based method to explore similar samples and generate a Figure 1. A example of the class collision problem. A typical in-
weak label, then perform a supervised contrastive learn- stance discrimination method will treats the first column and the
ing task based on the weak label to pull the similar images third column as a negative pair since there are different instance.
However, the semantic information of the first column and the
closer. We further introduced a K-Nearest Neighbor based
third column are very similar, treat them as positive pair should
multi-crop strategy to expand the number of positive sam- be much more reasonable.
ples. Extensive experimental results demonstrate WCL im-
proves the quality of self-supervised representations across datasets [11, 15, 30] and edge devices [45, 36, 44, 35].
different datasets. Notably, we get a new state-of-the-art re- However, most successful methods are trained in the su-
sult for semi-supervised learning. With only 1% and 10% pervised fashion; they usually require a large volume of
labeled examples, WCL achieves 65% and 72% ImageNet labeled data that is very hard to collect. Meanwhile, the
Top-1 Accuracy using ResNet50, which is even higher than quality of data annotations dramatically affects the per-
SimCLRv2 with ResNet101. formance. Recently, self-supervised learning shows its
superiority and achieves promising results for unsuper-
vised and semi-supervised learning in computer vision (e.g.
1. Introduction [6, 7, 19, 8, 9, 5, 18, 50]). These methods can learn general-
purpose visual representations without labels and have a
Modern deep convolutional neural networks demon- good performance on linear classification and transferabil-
strate outstanding performance on various computer vision ity to different tasks or datasets. Notably, a big part of the
* Equal contributions. recent self-supervised representation learning framework is
† Corresponding author. based on the idea of contrastive learning.
10042
A typical contrastive learning based method adopts the can have more positives for each weak label. In this way,
noise contrastive estimation (NCE) [27] to perform the non- similar instances with the same weak label can be pulled
parametric instance discrimination [41] as the pretext task, closer via the supervised contrastive learning [25] task.
which encourages the two augmented views of the same im- Nevertheless, since the mined instance similarities might be
age to be pulled closer on the embedding space but pushes noisy and not completely reliable, in practice, we adopt a
apart all the other images. Most of the recent works mainly two-head framework, one of which handles this weakly su-
improve the performance of contrastive learning from the pervised task while the other is to perform the regular in-
image augmentation for positive samples and the explo- stance discrimination task. Extensive experiments demon-
ration for negative samples. However, instance discrimi- strate the effectiveness of our proposed method across dif-
nation based methods will inevitably induce class collision ferent settings and various datasets.
problem, which means even for very similar instances, they Our contribution can be summarized as follows:
still need to be pushed apart, as shown in Figure 1. This in-
• We proposed a two-head based framework to address
stance similarities thus tend to hurt the representation qual-
the class collision problem, with one head focusing on
ity [1]. In this way, identifying and even leveraging these
the instance discrimination and the other head for at-
similar instances plays a key role in the performance of
tracting similar samples.
learned representations.
• We proposed a simple graph based and parameter-free
Surprisingly, the class collision problem seems to attract
method to find similar samples adaptively.
much lesser attention in contrastive learning. As far as we
know, there has been little effort to identify similar sam- • We introduced a K-Nearest Neighbor based multi-
ples. AdpCLR [49] finds the top-K closest samples on the crops strategy that can provide much more diverse in-
embedding space and treats these samples as their positives. formation than the standard multi-crops strategy.
However, in the early stage of training, the model cannot ef- • The experimental result shows WCL establishes a new
fectively extract the semantic information from the images; state-of-the-art performance for contrastive learning
therefore, this method needs to use SimCLR [6] to pre-train based methods. With only 1% and 10% labeled sam-
for a period of time, and then switch to AdpCLR to get the ples, WCL achieves 65% and 72% Top-1 accuracy on
best performance. FNCancel [23] proposed a similar idea ImageNet using ResNet50. Notably, this result is even
but adopts a very different way to find the top-K similar in- higher than SimCLRv2 with ResNet101.
stances; that is, for each sample, it generates a support set
that contains different augmented views from the same im-
age, then use mean or max aggregation strategy over the co- 2. Related Work
sine similarity score between the augmented views in sup- Self-Supervised Learning. Early work in self-
port set and finally identify the top-K similar samples. Nev- supervised learning mainly focuses on the designing of dif-
ertheless, the optimal support size is 8 in their experiments, ferent pretext tasks. For example, predict a relative offset
requiring 8 additional forwarding passes to generate the em- for a pair of patches [12], solving the jigsaw puzzles [33],
bedding vectors. Obviously, these methods have two short- colorize the gay-scaled images [48], image inpainting [14],
comings. Firstly, they are both time-consuming. In the sec- predicting the rotation angle [16], unsupervised deep clus-
ond place, the result of top-K closest samples might not be tering [4] and image reconstruction [2, 17, 13, 3, 28]. Al-
reciprocal, i.e. xi is the K closest sample of xj , but xj might though these methods have shown their effectiveness, they
not be the K closest sample of xi . In this case, xj will treat lack the generality of the learned representations.
xi as a positive sample, but xi will treat xj as a negative Contrastive Learning. Contrastive learning [27, 21, 41,
sample, which will result in some conflicts. 40] has become one of the most successful methods in the
In this paper, we regard the instance similarities as intrin- field of self-supervised learning. As we mentioned, most
sically weak supervision in representation learning and pro- recent works mainly focus on the augmentation for positive
pose a weakly supervised contrastive learning framework samples and the exploration for negative samples. For ex-
(WCL) to address the class collision issue accordingly. In ample, SimCLR [6] proposed composition of data augmen-
WCL, similar instances are assumed to share the same weak tations e.g. Grayscale, Random Resized Cropping, Color
label comparing to other instances, and instances with the Jittering, and Gaussian Blur to making the model robust
same weak label are expected to be aggregated. To deter- to these transformations. InfoMin [37] further introduced
mine the weak label, we model each batch of instances as an “InfoMin principle” which suggests that a good aug-
a nearest neighbor graph; weak labels are thus determined mentation strategy should reduce the mutual information
and reciprocal for each connected component of the graph. between the positive pairs while keeping the downstream
Besides, we can further expand the graph by a KNN-based task-relevant information intact. To explore the use of neg-
multi-crop strategy to propagate weak labels, such that we ative samples, InstDisc [41] proposed a memory bank store
10043
ɸ(𝗁1)
𝗁1 = ℱ(𝗑 1 )
𝗑1
ɡ(𝗁1) 𝘆1
Shared ℒswap
Encoder
ℒNCE
ɡ(𝗁2) 𝘆2
Instance Discrimination
𝗁2 = ℱ(𝗑 2 )
𝗑2 ɸ(𝗁2)
Figure 2. The overall framework of our proposed method. We adopt a two head based structure (g and ϕ). The first head g will play a
regular instance discrimination task. The second head ϕ will generate a weak label based on the connected component labeling process,
then use the weak label to perform a supervised contrastive learning task. Please see more details in section 3.
the representation of all the images in the dataset. MoCo to obtain two different views of the same instance, which
[19, 8] increasing the number of negatives by using a mo- can be written as {x1 }N 2 N
i=1 = T (x, θ1 ) and {x }i=1 =
mentum contrast mechanism that forces the query encoder T (x, θ2 ) where θ is random seed for T . Then, a convo-
to learn the representation from a slowly progressing key lutional neural network based encoder F(·) will extract the
encoder and maintains a long queue to provide a large num- information from different augmentations, that can be ex-
ber of negative examples. pressed by {h1 } = F({x1 }N 2 2 N
i=1 ) and {h } = F({x }i=1 ).
Contrastive Learning Without Negatives. Unlike the Finally, a non-linear projection head z = g(h) maps rep-
typical contrastive learning framework, BYOL [18] can resentations h to the space where the NCE objective is ap-
learn a high-quality visual representation without the neg- plied. If we denote (zi , zj ) as a positive pair, the NCE ob-
ative samples. Specifically, it trains an online network to jective can be expressed as
predict the target network representation of the same image
under a different augmented view and using an additional \label {equation:nce} \mathcal {L}_{NCE} = -\log \frac {\exp (sim(\mathbf {z}_{i}, \mathbf {z}_{j})/ \tau ) }{\sum _{k=1}^{N} \mathbb {1}_{[k \neq i]} \exp (sim(\mathbf {z}_{i}, \mathbf {z}_{k}) / \tau ) }. (1)
predictor network on top of the online encoder to avoiding
the model collapse. SimSiam [9] extends BYOL to explore
the siamese structure in contrastive learning further. Sur- 3.2. Instance Similarities as Weak Supervision
prisingly, SimSiam prevents the model collapse even with-
out the target network and large batch size; although the lin- The instance discrimination based methods have already
ear evaluation result is lower than BYOL, it performs better shown promising performance for unsupervised pretrain-
in the downstream tasks. ing. However, this line of solution ignores the relationships
between different images because only the augmentations
3. Method from the same image will be regarded as the same class. In-
spired by previous works, we can leverage the embedding
In this section, we will first revisit the preliminary work vectors to explore the relations between different images.
on contrastive learning and address its limitations. Then Specifically, we will generate a weak label based on the
we will introduce our proposed weakly supervised con- embedding vectors and then use it as a supervisory signal
trastive learning framework (WCL), which automatically to attract similar samples in the embedding space. How-
mines similar samples while doing the instance discrimi- ever, direct use of weak supervision will cause two prob-
nation. After that, the algorithm and the implementation lems. First, there is a natural conflict between “instance
details will also be explained. discrimination” and “similar sample attraction” since one
wants to push all the different instances away, and the other
3.1. Revisiting Contrastive Learning
wants to pull similar samples closer. Second, there might
Typical contrastive learning methods adopt the noise be noise in the weak label, especially in the early training
contrastive estimation (NCE) objective for discriminating stages. Simply attracting similar samples based on the weak
different instance in the dataset. Concretely, NCE objective label will slow down the convergence of the model.
encourages different augmentations of the same instance to Two-head framework. To resolve these issues, we pro-
be pulled closer in a latent space yet pushes away different posed an auxiliary projection head ϕ(·). In this case, the
instances’ augmentations. Following the setup of SimCLR primary projection head g(·) will still perform a regular
[6], given a batch of unlabeled samples {x}N i=1 , we ran- instance discrimination task to focus on the instance level
domly apply a composition of augmentation functions T (·) information; the auxiliary projection head consists of the
10044
A A
A B C D E F G H I J
A 0 1 1 0 1 0 0 0 0 0
B B
B 1 0 0 0 0 0 0 0 0 0
E E
C 1 0 0 1 0 0 0 0 0 0
D C C
D 0 0 1 0 0 0 0 0 0 0 D
E 1 0 0 0 0 0 0 0 0 0
I F F
F 0 0 0 0 0 0 1 0 0 0 I
G 0 0 0 0 0 1 0 1 0 0 H H
H 0 0 0 0 0 0 1 0 0 0 G G
I 0 0 0 0 0 0 0 0 0 1 J J
J 0 0 0 0 0 0 0 0 1 0
(A) Adjacency Matrix (B) Nearest Neighbour Graph (C) Weak Label
same structure with g(·) and will explore the similar sam- Then, for each sample vi , we find the closest sample vj by
ples and generate a weak label as the supervisory signal to computing the cosine similarity score. Now, we can define
attract similar samples. With these two heads of distinct an adjacency matrix by:
responsibilities, we can further transform the features ex-
tracted by the encoder F into different embedding spaces to \label {equation:adjacency} A(i, j) = \begin {cases} 1, & \text {if $i = k^{1}_{j}$ or $j = k^{1}_{i}$} \\ 0, & \text {otherwise} \end {cases} (5)
resolve the conflict. Moreover, the primary projection head
will ensure the model’s convergence even when the weak
label has some noise. The information extracted from the Here, we use ki1 to denote the 1-nearest neighbour of vi .
auxiliary projection head can be written as Basically, Eq.(5) will generate a sparse and symmetric 1-
nearest neighbor graph where each vertex is linked with its
\label {equation:graph_head} \mathbf {v}_{i} = \phi ( \mathcal {F} ( T(\mathbf {x}_i, \theta ) ) ) ). (2) closest sample. To find out all similar samples, we can con-
vert this problem into a Connected Components Labeling
Suppose we have obtained a weak label y ∈ RN ×N based
(CCL) process; that is, for each sample, we want to find
on v which denotes whether a pair of samples is similar (i.e.
all the reachable samples based on the 1-nearest neighbor
yij = 1 means xi and xj are similar). Different from Eq.
graph. This is a traditional graph problem that can be eas-
(1) that naturally forms positive pairs through augmenta-
ily solved by the famous Hoshen–Kopelman algorithm [22]
tions, we can then leverage the label yij to indicate whether
(also known as the two-pass algorithm). We define an undi-
xi and xj can produce a positive pair or not. By introducing
rected graph by G = (V, E) where V is the embedding
an indicator 1yij =1 into Eq. (1), we achieve the supervised
from ϕ, and edges E connecting the vertex A(i, j) = 1.
contrastive loss [25]
The algorithm adopts a Disjoint-set data structure that con-
sists of three operations: makeSet, union and find (see the
\label {equation:attraction} &\mathcal {L}_{sup} = \frac {1}{N} \sum _{i=0}^{N} \mathcal {L}_{sup}^i \\ \label {equation:sup} \mathcal {L}_{sup}^{i} = -\sum _{j}^{N} \mathbb {1}_{\mathbf {y}_{ij}=1} & \log \frac {\exp (sim(\mathbf {v}_{i}, \mathbf {v}_{j})/ \tau ) }{\sum _{k=1}^{N} \mathbb {1}_{[k \neq i]} \exp (sim(\mathbf {v}_{i}, \mathbf {v}_{k}) / \tau ) }, definition in Algorithm 1). Basically, it first creates a sin-
gleton set for each v in V , then traverses each edge in E
and merges different sets through the edges; finally, it re-
turns the set for each vertex that belongs to. Back to our
proposed idea, we will treat the samples in the same set as
(4) similar samples. Now, the weak label can be defined as:
which has been shown to be more effective than the tradi-
tional supervised cross-entropy loss. \label {equation:weaklabel} \mathbf {y}_{ij} = \begin {cases} 1, & \text {if find($\mathbf {v}_i$) = find($\mathbf {v}_j$) and $i \neq j$ } \\ 0, & \text {otherwise} \end {cases} (6)
3.3. Weak Label Generation
Such weak label generation method has several advantages.
In this section, we will elaborate how to generate the
• This is a parameter-free process, so we do not need any
weak label for the mini-batch of samples. The overall idea
hyperparameter optimization.
can be summarized into two points: First, for each sample,
the closest sample can be regarded as a similar sample. Sec- • Based on the definition of an undirected graph and con-
ond, if (xi , xj ) and (xj , xk ) are two pairs of similar sam- nected components, the weak label is always recipro-
ples, then we can think that xi and xk are also similar. cal. (i.e. yij = yji )
Suppose we use the auxiliary projection head ϕ to map a • This is a deterministic process; the final result does not
batch of samples to N embeddings V = {v1 , v2 , ..., vN }. depend on any initial state.
10045
Algorithm 1: Connected Components Labeling Algorithm 2: Weakly Supervised Contrastive
Input: An adjacency matrix G = (V, E) Learning (WCL)
Define makeSet(v) : Create a new set with element v Input: {x1 }N 2 N
i=1 and {x }i=1 : a batch of samples
Define union(A, B): Return the set A ∪ B with different augmentations. F: the
Define find(v): Return the set which contains v backbone network. g: the first projection
for v in V do head. ϕ: the auxiliary projection head.
makeSet(v) while network not converge do
end Initialize an empty list L ;
for each (vi , vj ) in E do for i=1 to step do
if find(vi ) ̸= find(vj ) then h1 = F({x1 }N i=1 ) h2 = F({x2 }N i=1 )
union(find(vi ), find(vj )) 1
z = g(h ) 1
z2 = g(h2 )
end v1 = ϕ(h1 ) v2 = ϕ(h2 )
end Calculate contrastive loss LN CE Eq. (1)
for each v in V do Generate weak label y1 , y2 based on v1 , v2
return the set contains v: find(v) Calculate swapped loss Lswap Eq. (7)
end Calculate LcN CE and Lcswap
Output: The corresponding identification of the Optimize the network by Loverall Eq. (8)
connected component for each v. Append h1 to list L ;
end
The weak label will be used as the supervisory signal for Compute the K-NN for each sample based on L.
the auxiliary projection head ϕ. However, if vi and vj are end
in the same set, sim(vi , vj ) is very likely to be a large num- Output: The well trained model F
ber. According to Eq. (4), directly using the weak label will
cause Lsup to be very small, which is not conducive to the liable in the early training; hence, we should use the stan-
model’s optimization. To resolve this issue, we can simply dard multi-crops strategy to warm up the model for a certain
swap the weak label to supervise the same batch of sam- number of epochs and then switch to our K-NN multi-crops
ples with different augmentations. Concretely, we derive to get better performance. (See more details in our exper-
embeddings V 1 and V 2 from two types of augmentations, iments.) If we use LcN CE and Lcswap to denote the con-
based on which we generate the corresponding weak label trastive loss and swapped loss for the multi-crops images,
y1 , y2 . Then y1 will be used as the supervisory signal for then the overall training objective for our weakly supervised
V 2 and vice versa. The swapped version of Eq. (3) can be contrastive learning framework can be expressed as
written as: \begin {split} \label {equation:overall} \mathcal {L}_{overall} = \mathcal {L}_{NCE} + \lambda \mathcal {L}_{cNCE} + \beta \mathcal {L}_{swap} + \gamma \mathcal {L}_{cswap}, \end {split} (8)
\label {equation:swap_ce} \mathcal {L}_{swap} = \mathcal {L}_{sup}(V^1, \mathbf {y}^2) + \mathcal {L}_{sup}(V^2, \mathbf {y}^1). (7) where λ, β and γ are the hyper-parameters. We simply take
λ = 1, β = 0.5 and γ = 0.5 in our implementation. Please
see more details in Algorithm 2.
3.4. Label Propagation with Multi-Crops
4. Experimental Results
Since the comparison between random crops of an image
plays the key role in contrastive learning, there are lots of 4.1. Ablation Studies
previous works [10] pointing out that increasing the number
of crops or views can significantly increase the representa- In this section, we will empirically study our Weak Su-
tion quality. SwAV [5] introduced a multi-crop strategy that pervised Contrastive Learning (WCL) framework under dif-
adds K additional low-resolution crops in each batch. Us- ferent batch sizes, epochs, datasets(CIFAR-10, CIFAR-100,
ing low-resolution images can greatly reduce computational ImageNet100) and show the effectiveness of each compo-
costs. However, the multiple crops of the same image may nent by extensive experiments.
have many overlap areas. In this case, more crops may not CIFAR-10 and CIFAR-100. The CIFAR-10 [26]
provide additional effective information. To address this is- dataset consists of 60000 32x32 colour images in 10 classes,
sue, we proposed a K-Nearest Neighbor based Multi-crops with 6000 images per class. There are 50000 training im-
strategy. Specifically, we will store the feature h1 for every
batch and then use these features to find the K closest sam- ages and 10000 test images. CIFAR-100 is just like the
ples based on the cosine similarity at the end of each epoch. CIFAR-10, except it has 100 classes containing 600 im-
Finally, we will use the low-resolution crops of the K clos- ages each. There are 500 training images and 100 testing
est images in the next epoch. If we apply the Lswap on the images per class. We use the ResNet50 [20] as our back-
K-NN multi-crops, the number of positive samples can be bone network. Because the training images only contain
expended to K times. Note that the K-NN result is unre- 32x32 pixels, we replace the first 7x7 Conv of stride 2 with
10046
Table 1. Experiments on CIFAR-10 and CIFAR-100 with different batch size and training epochs
CIFAR10 CIFAR100
Batch Size Method
100 ep 200 ep 300 ep 400 ep 100 ep 200 ep 300 ep 400 ep
64 SimCLR 77.20 80.64 82.77 84.48 52.35 55.86 58.18 59.96
79.17 83.54 85.68 86.64 53.54 56.57 59.29 60.76
64 WCL (Ours)
(+1.97) (+2.90) (+2.91) (+2.16) (+1.19) (+0.71) (+1.11) (+0.80)
128 SimCLR 79.64 83.57 85.70 86.72 54.72 59.19 60.88 62.20
81.82 85.65 87.81 88.65 55.46 60.30 61.73 63.17
128 WCL (Ours)
(+2.18) (+2.08) (+2.91) (+1.93) (+0.74) (+1.11) (+0.85) (+0.97)
256 SimCLR 81.78 85.34 87.29 88.48 57.16 61.18 63.49 64.20
83.12 87.57 88.85 89.47 57.85 62.98 64.21 64.93
256 WCL (Ours)
(+1.34) (+2.23) (+1.56) (+0.98) (+0.70) (+1.80) (+0.72) (+0.73)
3x3 Conv of stride 1 and also remove the first max pool- Effect of weak supervision. We choose SimCLR as our
ing operation. We use 2-Layer-MLP for the two non-linear baseline, and compare it with our method on BatchSize =
projection heads. For data augmentation, we use the ran- 64, 128, 256 and Epoch = 100, 200, 300, 400. Note, in
dom resized crops (the lower bound of random crop ratio is these experiments; we do not use any multi-crops strategy;
set to 0.2), color distortion (strength=0.5), and leaving out only an additional Lswap is applied on top of the SimCLR.
Gaussian blur. The model is trained using LARS optimizer Table 1 shows the results. Obviously, our proposed method
[46] with a momentum of 0.9 and weight decay of 1e−6 . substantially outperforms the baseline across all settings.
We linear warm up the learning rate for 10 epochs until it For CIFAR-10, we have various improvements from 0.98%
reaches 0.25 × BatchSize/256, then switch to the cosine to 2.91% based on different settings. For CIFAR-100, the
decay scheduler [31]. The temperature parameter τ is al- improvement is from 0.73% to 1.80%.
ways set to 0.1. To perform the Connected Components La-
Table 2. Effectiveness of two-head framework (ImageNet100)
beling process, we simply use the “connected components”
g ϕ LN CE Lswap LcN CE Lcswap Top-1
function from the Scipy Library [39]. We will use the same
✓ ✓ 75.79
training strategy for both CIFAR-10 and CIFAR-100. ✓ ✓ 71.33
ImageNet-100. ImageNet-100 dataset is a randomly ✓ ✓ ✓ 75.26
chosen subset from ILSVRC2010 ImageNet [11]. (We ✓ ✓ ✓ ✓ 77.51
simply take the first 100 class in our experiments.) For ✓ ✓ ✓ ✓ ✓ 79.06
training the ImageNet-100, we strictly follow the training ✓ ✓ ✓ ✓ ✓ 79.08
strategy reported in SimCLR [6]. Specifically, we set the ✓ ✓ ✓ ✓ ✓ ✓ 79.77
BatchSize = √ 2048, and use the LARS optimizer with
lr = 0.075 × BatchSize. Moreover, we found that Effect of two-head framework. We also perform an
the default augmentation that used in SimCLR might be extensive ablation study to examine the effectiveness of our
too strong, which makes the model very hard to converge two head based framework. The experiments are mainly
in the beginning; thus, we adopt the same but a little bit performed on the ImageNet-100 dataset, and the result is
weaker version of the augmentation (the one that used in shown in Table 2. Note, the LcN CE and Lcswap in this
MoCoV2[8]) in the first 10 epochs and then switch it back experiment is based on the standard multi-crops strategy
to the original augmentations after warm-up. The model (without KNN). The first row is the SimCLR baseline. The
will be optimized for 200 epochs, and the rest of the settings second row is the case that only Lswap is applied; the model
(including temperature, weight decay, etc.) are the same as can still learn a meaningful representation but result in a
our CIFAR training. worse accuracy than the baseline. We also try to apply both
Evaluation Protocol. For testing the representation LN CE and Lswap on the same head; from the third row, we
quality, we evaluate our well-trained model on the widely can see there is a 0.53% performance drop. We doubt this
adopted linear evaluation protocol - We will freeze the en- is because of the conflicts between the instance discrimina-
coder parameters and train a linear classifier on top of it by tion and similar sample attraction. The fourth row shows
using the standard SGD optimizer with a momentum of 0.9, our proposed method, which separates the two tasks on dif-
learning rate of 0.1 × BatchSize/256 and cosine decay ferent heads. In this case, we get 1.72% improvements over
scheduler. We don’t use any regularization techniques such the baseline, which verified our hypothesis. The last three
as weight decay and gradient clipping. The model will be rows show the result with the multi-crops strategy, and the
trained for 80 epochs, then evaluated on the testing set. performance can be further improved by 2.26%.
10047
Effect of K-NN Multi-Crops. As we have mentioned, achieved the same result with FNCancel. FNCancel does
the K-NN result is unreliable in the early training, and we not report the standard time usage on the paper, but since it
need to use the standard multi-crops strategy to warm up requires 8 additional forward passes to generate the support
the model for a certain number of epochs. Table 3 shows view embeddings, their actual computational cost will be
the result for a different number of warm up epochs. We much higher than ours. We also compare the result with the
can see clearly that with 50 epochs of warm up, our K-NN multi-crops strategy. In this case, we use 2 160×160 images
multi-crops strategy has 1% improvements over the stan- as our main views and 6 additional 96 × 96 K-NN crops.
dard multi-crops (see the last row in Table 2). Finally, our Look at the last row; our proposed method can achieve 71.0
proposed method achieved 80.78% Top-1 accuracy on lin- top-1 accuracy with only 31% more additional cost than
ear evaluation, which has 5% improvements than the Sim- SimCLR. This is twice faster than FNCancel and has 0.6%
CLR baseline (75.79%). improvements on linear evaluation.
Table 3. Warm up epochs for K-NN Multi-Crops (K=4) Table 5. Top-1 accuracy under the linear evaluation on ImageNet
Epochs 0 25 50 75 100 with the ResNet-50 backbone. The table compares the methods
Accuracy 79.73 80.25 80.78 80.63 80.23 over 200 epochs of pretraining. * denotes multi-crops strategy.
Method Arch Param Epochs Top-1
Visualization. Figure 4 shows the t-SNE visualization Supervised R50 24 - 76.5
[38] of h from a randomly selected 10 classes. Com- InstDisc [41] R50 24 200 58.5
pare to SimCLR; our weakly supervised contrastive learn- LocalAgg [51] R50 24 200 58.8
ing framework can enhance a much better intra-class com- SimCLR [6] R50 24 200 66.8
pactness and inter-class discrepancy. MoCo [19] R50 24 200 60.8
MoCo v2 [8] R50 24 200 67.5
MoCHi [24] R50 24 200 68.0
CPC v2 [27] R50 24 200 63.8
PCL v2 [29] R50 24 200 67.6
SimSiam [9] R50 24 200 70.0
SwAV [5] R50 24 200 69.1
SwAV* [5] R50 24 200 72.7
SimCLR Ours WCL (Ours) R50 24 200 70.3
Figure 4. t-SNE visualization for SimCLR and our method WCL* (Ours) R50 24 200 73.3
4.2. Comparison on ImageNet-1K Dataset Table 6. Top-1 accuracy under the linear evaluation on ImageNet.
The table compares the methods with more epochs of pretraining.
We also performed our algorithm on the large-scale * denotes multi-crops strategy.
ImageNet-1k dataset [11]. The training strategy is the same Method Arch Param Epochs Top-1
as our ImageNet-100 training, except we adopt a larger Supervised R50 24 - 76.5
batch size (4096) and use the 3-Layer-MLP for the two pro- SeLa [43] R50 24 400 61.5
jection heads. For the K-NN Multi-crops, we simply take SimCLR [6] R50 24 800 69.1
the best strategy from Table 3, which means we will use the SimCLR v2 [7] R50 24 800 71.7
standard multi-crops strategy for the first 25% epochs, and MoCo v2 [8] R50 24 800 71.1
then switch to our K-NN version. SimSiam [9] R50 24 800 71.3
SwAV [5] R50 24 800 71.8
Table 4. Compare to FNCancel on ImageNet-1K BYOL [18] R50 24 1000 74.3
Method Epochs GPU(time) Acc FNCancel* [23] R50 24 1000 74.4
SimCLR 100 1.00 66.4 AdpCLR [49] R50 24 1100 72.3
FNCancel 100 - 68.1 WCL (Ours) R50 24 800 72.2
WCL (Ours) 100 1.01 68.1 WCL* (Ours) R50 24 800 74.7
SimCLR 1000 10.00 70.3 Others
FNCancel + multi-crops 100 2.85 70.4 SwAV* [5] R50 24 800 75.3
WCL (Ours) + multi-crops 100 1.31 71.0
Linear Evaluation. For the linear evaluation of
Compare to FNCancel. [23] Table 4 shows the com- ImageNet-1k, we strictly follow the setting in SimCLR
parison between our proposed method with FNCancel and [6]. Table 5 and 6 shows our result for 200 epochs and
SimCLR. Note, for a fair comparison, all models are trained 800 epochs of training. We also report the result with 2
with a 3-Layer-MLP projection head. As we can see, with 224 × 224 and 6 additional 96 × 96 K-NN crops (as in
a negligible additional computational cost (0.01), our pro- SwAV [5]). We can see clearly that when the model is opti-
posed method can surpass the SimCLR baseline 1.7% and mized for 200 epochs, our proposed method achieved state-
10048
Table 7. Low-shot image classification on VOC07
Method Epochs k=1 k=2 k=4 k=8 k=16 k=32 k=64 Full
Random - 8.92 9.33 10.10 10.42 10.82 11.34 11.96 12.42
Supervised 90 54.46 68.15 73.79 79.51 82.26 84.00 85.13 87.27
MoCo v2 [8] 200 46.30 58.40 64.85 72.47 76.14 79.16 81.52 84.60
PCL v2 [29] 200 47.88 59.59 66.21 74.45 78.34 80.72 82.67 85.43
SwAV [5] 200 43.07 55.65 64.82 73.17 78.38 81.86 84.40 87.47
WCL (Ours) 200 48.06 60.12 68.52 76.16 80.24 82.97 85.01 87.75
SwAV [5] 400 42.14 55.34 64.31 73.08 78.47 82.09 84.62 87.78
SwAV [5] 800 42.85 54.90 64.03 72.94 78.65 82.32 84.90 88.13
WCL (Ours) 800 48.25 60.68 68.52 76.48 81.05 83.89 85.88 88.64
of-the-art performance among all the recent self-supervised ically, we resize all images to 256 pixels along the shorter
learning frameworks. When the model is trained for 800 side and taking a 224 × 224 center crop. Then, we train a
epochs, our model can still outperform most recent works linear SVM on top of corresponding global average pooled
but slightly lower than SwAV. final representations. To study the transferability of the
representations in few-shot scenarios, we vary the number
Table 8. ImageNet semi-supervised evaluation.
of labeled examples k and report the mAP. Table 7 shows
1% 10% the comparison between our method with previous works.
Method Top-1 Top-5 Top-1 Top-5 We report the average performance over 5 runs (except for
Supervised 25.4 56.4 48.4 80.4
k=full). The result of our method and SwAV are both based
Semi-supervised
on the multi-crop version. When the model has 200 epochs
S4L [47] - 53.4 - 83.8
of pretraining, our method and SwAV can already outper-
UDA [42] - 68.8 - 88.5
FixMatch [34] - - 71.46 89.1 form the supervised pretraining on the full dataset. Interest-
Self-Supervised ingly, our method is significantly better than all other works,
From AvgPool especially when k is small. When the model has more pre-
InstDisc [41] - 39.2 - 77.4 training epochs, our method can even surpass the supervised
PCL [29] - 75.6 - 86.2 pretraining with k = 64 and consistently has higher perfor-
PIRL [32] 30.7 60.4 57.2 83.8 mance than SwAV across all different k values.
SimCLR v1 [6] 48.3 75.5 65.6 87.8
BYOL [18] 53.2 78.4 68.8 89.0
SwAV [5] 53.9 78.5 70.2 89.9 5. Conclusion
WCL (Ours) 58.3 79.9 71.1 90.3
In this work, we proposed a weakly supervised con-
From Projection Head
SimCLR v2 (R50) [7] 57.9 - 68.4 - trastive learning framework that consist of two projection
SimCLR v2 (R101)[7] 62.1 - 71.4 - heads, one of which focus on the instance discrimination
FNCancel [23] 63.7 85.3 71.1 90.2 task, and the other head adopts the Connected Components
WCL (Ours) 65.0 86.3 72.0 91.2 Labeling process to generate a weak label, then perform
the supervised contrastive learning task by swapping the
Semi-Supervised Learning. Next, we evaluate the per-
weak label to different augmentations. Finally, we intro-
formance obtained when fine-tuning the model representa-
duced a new K-NN based multi-crops strategy which has
tion using a small subset of labeled data. For a fair compari-
much more effective information and expanding the num-
son, we take the same labeled list from SimCLR [6]. Specif-
ber of positive samples to K times. Experiments on CIFAR-
ically, we report our results on two different settings. First,
10, CIFAR-100, ImageNet-100 show the effectiveness of
we follow the strategy in PCL [29], and fine-tuning from the
each component. The results of semi-supervised learning
average pooling layer of the ResNet50 [20] network. In this
and transfer learning demonstrate the state-of-the-art per-
setting, our model outperforms the previous state-of-the-art
formance for unsupervised representation learning.
(SwAV) 4.4% on 1% labels and 0.9% on 10% labels. Then,
we also follow the strategy in SimCLRv2 [7] to fine-tuning
from the first layer of the projection head. In this case, our Acknowledgment
method has 1.3% and 0.9% improvement on 1% and 10%
This work is funded by the National Key Research and
labels over FNCancel. Notably, this result is even higher
Development Program of China (No. 2018AAA0100701)
than the SimCLRv2 with ResNet101 backbone. and the NSFC 61876095. Chang Xu was supported in
Transfer Learning. Finally, We further evaluate the part by the Australian Research Council under Projects
quality of the learned representations by transferring them DE180101438 and DP210101859. Shan You is sup-
to other datasets. Following [29, 5], we perform linear clas- ported by Beijing Postdoctoral Research Foundation.
sification on the PASCAL VOC2007 dataset [15]. Specif-
10049
References [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[1] S. Arora, Hrishikesh Khandeparkar, M. Khodak, Orestis Yoshua Bengio. Generative adversarial nets. In Z. Ghahra-
Plevrakis, and Nikunj Saunshi. A theoretical analysis of mani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Wein-
contrastive unsupervised representation learning. ArXiv, berger, editors, Advances in Neural Information Processing
abs/1902.09229, 2019. 2 Systems, volume 27, pages 2672–2680. Curran Associates,
[2] Pierre Baldi. Autoencoders, unsupervised learning and deep Inc., 2014. 2
architectures. In Proceedings of the 2011 International Con- [18] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
ference on Unsupervised and Transfer Learning Workshop - Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do-
Volume 27, UTLW’11, page 37–50. JMLR.org, 2011. 2 ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-
[3] A. Brock, J. Donahue, and K. Simonyan. Large scale gan mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi
training for high fidelity natural image synthesis. ArXiv, Munos, and Michal Valko. Bootstrap your own latent: A new
abs/1809.11096, 2019. 2 approach to self-supervised learning, 2020. 1, 3, 7, 8
[4] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and [19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Matthijs Douze. Deep clustering for unsupervised learning Girshick. Momentum contrast for unsupervised visual repre-
of visual features. In European Conference on Computer Vi- sentation learning. arXiv preprint arXiv:1911.05722, 2019.
sion, 2018. 2 1, 3, 7
[5] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
otr Bojanowski, and Armand Joulin. Unsupervised learning Deep residual learning for image recognition. arXiv preprint
of visual features by contrasting cluster assignments. 2020. arXiv:1512.03385, 2015. 5, 8
1, 5, 7, 8 [21] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon,
[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua
offrey Hinton. A simple framework for contrastive learning Bengio. Learning deep representations by mutual in-
of visual representations. arXiv preprint arXiv:2002.05709, formation estimation and maximization. arXiv preprint
2020. 1, 2, 3, 6, 7, 8 arXiv:1808.06670, 2018. 2
[7] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad [22] J. Hoshen and R. Kopelman. Percolation and cluster dis-
Norouzi, and Geoffrey Hinton. Big self-supervised mod- tribution. i. cluster multiple labeling technique and critical
els are strong semi-supervised learners. arXiv preprint concentration algorithm. Phys. Rev. B, 14:3438–3445, Oct
arXiv:2006.10029, 2020. 1, 7, 8 1976. 4
[8] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. [23] Tri Huynh, Simon Kornblith, Matthew R. Walter, Michael
Improved baselines with momentum contrastive learning. Maire, and Maryam Khademi. Boosting contrastive
arXiv preprint arXiv:2003.04297, 2020. 1, 3, 6, 7, 8 self-supervised learning with false negative cancellation.
[9] Xinlei Chen and Kaiming He. Exploring simple siamese rep- arXiv:2011.11765, 11 2020. 2, 7, 8
resentation learning, 2020. 1, 3, 7 [24] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion,
[10] Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, An- Philippe Weinzaepfel, and Diane Larlus. Hard negative mix-
tonio Torralba, and Stefanie Jegelka. Debiased contrastive ing for contrastive learning. In Neural Information Process-
learning. In Advances in neural information processing sys- ing Systems (NeurIPS), 2020. 7
tems, 2020. 5 [25] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Yonglong Tian, Phillip Isola, A. Maschinot, Ce Liu, and
ImageNet: A Large-Scale Hierarchical Image Database. In Dilip Krishnan. Supervised contrastive learning. ArXiv,
CVPR09, 2009. 1, 6, 7 abs/2004.11362, 2020. 2, 4
[12] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- [26] A. Krizhevsky. Learning multiple layers of features from
pervised visual representation learning by context prediction. tiny images. 2009. 5
In International Conference on Computer Vision (ICCV), [27] Cheng-I Lai. Contrastive predictive coding based fea-
2015. 2 ture for automatic speaker verification. arXiv preprint
[13] J. Donahue and K. Simonyan. Large scale adversarial repre- arXiv:1904.01575, 2019. 2, 7
sentation learning. In NeurIPS, 2019. 2 [28] C. Ledig, L. Theis, Ferenc Huszár, J. Caballero, Andrew
[14] Omar ElHarrouss, Noor Almaadeed, S. Al-Máadeed, and Y. Aitken, Alykhan Tejani, J. Totz, Zehan Wang, and W. Shi.
Akbari. Image inpainting: A review. Neural Processing Let- Photo-realistic single image super-resolution using a gener-
ters, 51:2007–2028, 2019. 2 ative adversarial network. 2017 IEEE Conference on Com-
[15] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, puter Vision and Pattern Recognition (CVPR), pages 105–
and A. Zisserman. The PASCAL Visual Object Classes 114, 2017. 2
Challenge 2007 (VOC2007) Results. http://www.pascal- [29] Junnan Li, Pan Zhou, Caiming Xiong, and Steven Hoi. Pro-
network.org/challenges/VOC/voc2007/workshop/index.html. totypical contrastive learning of unsupervised representa-
1, 8 tions. In International Conference on Learning Represen-
[16] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- tations, 2021. 7, 8
supervised representation learning by predicting image rota- [30] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir
tions. ArXiv, abs/1803.07728, 2018. 2 Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva
10050
Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Mi- training. In H. Larochelle, M. Ranzato, R. Hadsell, M. F.
crosoft coco: Common objects in context, 2014. cite Balcan, and H. Lin, editors, Advances in Neural Information
arxiv:1405.0312Comment: 1) updated annotation pipeline Processing Systems, volume 33, pages 6256–6268. Curran
description and figures; 2) added new section describing Associates, Inc., 2020. 8
datasets splits; 3) updated author list. 1 [43] Asano YM., Rupprecht C., and Vedaldi A. Self-labelling via
[31] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- simultaneous clustering and representation learning. In In-
tic gradient descent with warm restarts. arXiv preprint ternational Conference on Learning Representations, 2020.
arXiv:1608.03983, 2016. 6 7
[32] Ishan Misra and Laurens van der Maaten. Self-supervised [44] Shan You, Tao Huang, Mingmin Yang, Fei Wang, Chen
learning of pretext-invariant representations. In Proceedings Qian, and Changshui Zhang. Greedynas: Towards fast
of the IEEE/CVF Conference on Computer Vision and Pat- one-shot nas with greedy supernet. In Proceedings of
tern Recognition (CVPR), June 2020. 8 the IEEE/CVF Conference on Computer Vision and Pattern
[33] M. Noroozi and P. Favaro. Unsupervised learning of visual Recognition, pages 1999–2008, 2020. 1
representations by solving jigsaw puzzles. In ECCV, 2016. [45] Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning
2 from multiple teacher networks. In Proceedings of the 23rd
[34] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao ACM SIGKDD International Conference on Knowledge Dis-
Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, covery and Data Mining, pages 1285–1294, 2017. 1
Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi- [46] Yang You, Igor Gitman, and Boris Ginsburg. Large batch
supervised learning with consistency and confidence. arXiv training of convolutional networks, 2017. 6
preprint arXiv:2001.07685, 2020. 8 [47] Xiaohua Zhai, A. Oliver, A. Kolesnikov, and Lucas Beyer.
[35] Xiu Su, Shan You, Fei Wang, Chen Qian, Changshui Zhang, S4l: Self-supervised semi-supervised learning. 2019
and Chang Xu. Bcnet: Searching for network width with bi- IEEE/CVF International Conference on Computer Vision
laterally coupled network. In Proceedings of the IEEE/CVF (ICCV), pages 1476–1485, 2019. 8
Conference on Computer Vision and Pattern Recognition, [48] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
pages 2175–2184, 2021. 1 image colorization. In ECCV, 2016. 2
[36] Xiu Su, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, [49] Shaofeng Zhang, Junchi Yan, and Xiaokang Yang. Self-
Changshui Zhang, and Chang Xu. K-shot NAS: learnable supervised representation learning via adaptive hard-positive
weight-sharing for NAS with k-shot supernets. In ICML, mining, 2021. 2, 7
volume 139 of Proceedings of Machine Learning Research, [50] Mingkai Zheng, Shan You, Fei Wang, Chen Qian, Chang-
pages 9880–9890. PMLR, 2021. 1 shui Zhang, Xiaogang Wang, and Chang Xu. Ressl: Re-
[37] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- lational self-supervised learning with weak augmentation.
trastive multiview coding. arXiv preprint arXiv:1906.05849, arXiv preprint arXiv:2107.09282, 2021. 1
2019. 2 [51] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local
[38] Laurens van der Maaten and Geoffrey Hinton. Visualizing aggregation for unsupervised learning of visual embeddings,
data using t-sne, 2008. 7 2019. 7
[39] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt
Haberland, Tyler Reddy, David Cournapeau, Evgeni
Burovski, Pearu Peterson, Warren Weckesser, Jonathan
Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wil-
son, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J.
Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey,
İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, De-
nis Laxalde, Josef Perktold, Robert Cimrman, Ian Henrik-
sen, E. A. Quintero, Charles R. Harris, Anne M. Archibald,
Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt,
and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algo-
rithms for Scientific Computing in Python. Nature Methods,
17:261–272, 2020. 6
[40] Tongzhou Wang and Phillip Isola. Understanding contrastive
representation learning through alignment and uniformity on
the hypersphere. arXiv preprint arXiv:2005.10242, 2020. 2
[41] Zhirong Wu, Yuanjun Xiong, X Yu Stella, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018. 2, 7, 8
[42] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and
Quoc Le. Unsupervised data augmentation for consistency
10051