Contrastive Visual Clustering For Improving Instance Level - 2024 - Pattern Reco

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Pattern Recognition 154 (2024) 110631

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Contrastive visual clustering for improving instance-level contrastive


learning as a plugin
Yue Liu a,b , Xiangzhen Zan a , Xianbin Li a , Wenbin Liu a , Gang Fang a ,∗
a
Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
b
School of Information Engineering, Jiangxi College of Applied Technology, Ganzhou, China

ARTICLE INFO ABSTRACT

Keywords: Contrastive learning has achieved remarkable success in computer vision, however it is built on instance-
Self-supervised learning level discrimination which leaves the valuable intra-class correlation in dataset unexploited. Current semantic
Contrastive learning clustering methods are proven to be helpful but they would suffer from the error accumulated in the iteration
Deep clustering
process without ground-truth guidance. In an attempt to remedy the clustering error accumulation when
utilizing intra-class correlation for contrastive learning, we propose an online Contrastive Visual Clustering
(CVC) method with two actions: gathering instances with highly similar feature embeddings, and penalizing
instances being clustered with low confidence. CVC can integrate with not only contrastive learning but
also arbitrary self-supervised learning frameworks simply as a plugin. Under various experiment settings,
we show that CVC improves the linear classification performance by a large margin for models pre-trained
with self-supervised representation learning, in both image and video scenarios. The code is available at
https://github.com/yliu1229/CVC.

1. Introduction In a self-supervised manner, one straight-forward intuition to im-


prove the representation space learned with instance-level CL is se-
With Self-Supervised Learning (SSL) making more and further con- mantic clustering, i.e. clustering samples with semantically similar
tributions to the community, contrastive SSL, e.g. Contrastive Learning embedding so that ideally samples of the same class would be clustered
(CL), definitely plays an important role. CL is one of the most frequently together and different clusters would be sparsely spread in the rep-
adopted SSL methods in a wide range of research areas, especially resentation space. Works have been done to explore the effectiveness
in computer vision [1–4]. Several marvelous works, e.g. MoCo [1]
of utilizing clustering in unsupervised learning. DeepCluster [8] and
and SwAV [2], bring state-of-the-art performances which are even
PCL [9] iteratively perform K-means clustering and representation
competitive with supervised learning methods in downstream tasks.
learning with clustering assignment results. However under the un-
Despite all its achievements, it should be emphasized that CL is built on
instance-level discrimination [5], i.e. each sample in dataset is treated supervised condition, the performance of K-means clustering merely
as an unique class whose embedding should be different from the depends on the quality of the feature encoder, errors could be ac-
rest. This means samples would not be able to learn from intra-class cumulated in the iteration process [10,11]. SwAV [2] and CC [10]
positives other than its own variations (e.g., different views, augmen- realize online clustering but do not further utilize cluster assignment
tations), leaving the valuable intra-class correlation in dataset unex- results to improve representation learning. Previous works handle clus-
ploited. However, as pointed out in [6], with positive samples being tering results cautiously because cluster assignments without ground-
constructed by data augmentations, current contrastive self-supervised truth guidance could be very tricky and direct enforcement of cluster
methods are better only at occlusion invariance, compared to super- assignments might backfire.
vised methods. On the other hand, Supervised Contrastive Learning [7] To remedy clustering error accumulation when exploiting the value
proves that contrastive pre-training with ground-truth positives learns of intra-class correlation for CL, this paper proposes an online Con-
representation space with better downstream performance, and the
trastive Visual Clustering (CVC) method that can be adopted in arbi-
more positive samples used the better performance it gets. Thus, how
trary self-supervised learning framework as a plugin. Our idea comes
to utilize intra-class correlation to improve the representation space
from the observations from linear probing pre-trained models released
learned with contrastive learning frameworks is worth investigating.

∗ Corresponding author.
E-mail address: [email protected] (G. Fang).

https://doi.org/10.1016/j.patcog.2024.110631
Received 11 May 2023; Received in revised form 16 May 2024; Accepted 23 May 2024
Available online 25 May 2024
0031-3203/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Y. Liu et al. Pattern Recognition 154 (2024) 110631

Fig. 1. Observations from linear probing of pre-trained models released in [1,12] and illustrations of two actions taken in Contrastive Visual Clustering (CVC). The representation
space is presented as a unit hypersphere, circles on the hypersphere represent different categories and black dots represent correctly classified instances while red dots represent
wrongly classified ones. The closer instances locate to the center of circles, the more confidence of correct classification the linear probing has, e.g. a dot right in the center means
linear probing is 100% sure that this instance is correctly classified. After effective self-supervised pre-training, the representation space presents linear separability as shown on the
left side, instances are relatively sparsely spread by instance discriminative pre-training. CVC performs Gathering and Penalizing in order to further increase linear separability of the
representation space. Wrongly classified red ‘X’ was preliminarily assigned to cluster ‘Stopwatch’ with low confidence, after Penalizing, ‘X’ is re-assigned to its correct cluster ‘Digital
Watch’ and will be pulled to the correct cluster center by updating the Cluster Memory. With CVC, inter-class diversities are increased and intra-class variances are decreased.
(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

in [1,12] as shown on the left side of Fig. 1: after effective self- for models pre-trained with self-supervised representation learn-
supervised pre-training, the representation space (presented as a unit ing. Impressively, with CVC, Mugs [12] reaches 82.4% top-1
hypersphere) presents linear separability, such that instances can be classification accuracy on ImageNet-1K.
semantically clustered. Within each cluster (i.e. circles on the unit
hypersphere), instances closer to the center have higher confidence of
being correctly classified, and indeed vast majority (>90%) of these 2. Related works
instances are correctly classified. On the other hand, there exist in-
stances with low confidence located near to the edge, it is noted that a 2.1. Contrastive learning
large portion (>50%) of them are wrongly classified, and more impor-
tantly these low-confidence wrongly-classified instances have very high
probability of finding their correct label within the top-5 classification CL is one of the most frequently adopted SSL methods in computer
results (please refer to [5] for details). So we take the following two vision [13,14]. The main idea of CL is to pull together representations
actions accordingly in CVC as shown in Fig. 1. Firstly, we gather of positive samples (i.e. samples of the same category) and to push
instances only with highly similar feature embeddings so that the uncer- away representations of negative samples (i.e. samples of different
tain instances will not be included in the semantic clustering process, categories) [13]. Without the knowledge of ground-truth classification,
this largely reduces potential clustering error accumulation. In terms positive samples are normally constructed by various data augmenta-
of linear separability, it is reasonable that we shrink each cluster to tion methods of the same sample [15,16], while the rest samples in the
its key semantic embedding space with the Gathering action. Secondly, same training batch are considered as negatives, therefore making CL
we penalize the model for instances being preliminarily assigned to basically an instance discrimination task [5,17]. One major drawback
clusters with low confidence so that it will have to learn to re-assign of instance-level CL is that it leaves the valuable intra-class correlation
these instances to a cluster with higher confidence. Instances with low in dataset unexploited, to mitigate this issue and to obtain alignment
clustering confidence will not be considered in the gathering process, and uniformity for representation space [18], CL will require a huge
instead the model will try to adjust itself such that these instances can amount of data, a large epoch, and especially a large number of
be re-assigned with high confidence, this effectively remedies existing batch size in the pretext training process [1], which however induces
clustering error. With Gathering and Penalizing in place, the error other issues such as exponentially increasing computation and false
accumulation issue is largely mitigated. negative [19].
To summarize, we make the following three contributions: Wu et al. bring up the concept of memory bank [5] to store rep-
• We propose a contrastive clustering method CVC to remedy clus- resentations computed in previous batches, He et al. improve memory
tering error accumulation when utilizing intra-class correlation bank by updating representations with a momentum encoder [1], both
for improving instance-level contrastive learning by two actions: work aim to enlarge the volume of negative samples. Mugs [12] uses
Gathering instances with highly similar feature embeddings, and memory buffer to store historical averaged token embeddings for top-k
Penalizing instances being clustered with low confidence. neighbor searching. However, none of these works consider representa-
• We show that CVC is a generic method, which can be inte- tions stored in memory bank as potential positive samples, because it is
grated with not only CL but also arbitrary SSL frameworks sim- very tricky to define positives without ground-truth guidance. Our work
ply as a plugin. CVC is also adaptive to both image and video focuses on exploiting the valuable intra-class correlation to mitigate the
representation learning frameworks. issues of instance discrimination of CL, instead of simply expanding the
• We demonstrate that under various experiment settings, CVC number of negative samples, we utilizes the technique of memory bank
improves the linear classification performance by a large margin to store constantly updated positives.

2
Y. Liu et al. Pattern Recognition 154 (2024) 110631

2.2. Deep clustering close to 1.0, which is proven to help align to the effect of linear classi-
fication. This hard assignment is achieved by optimizing a combination
Deep clustering methods iteratively perform clustering and rep- of the following two entropy functions:
resentation learning with deep neural networks, it is proven to be
1 ∑ ∑
𝑁 𝐶
instructive in unsupervised representation learning [8,9,20]. The classic 𝐻(𝑀) = − 𝑀𝑖 log 𝑀𝑖 , 𝐻(𝑃 ) = − 𝑃𝑗 log 𝑃𝑗
DeepCluster [8] utilizes traditional K-means algorithm to guide unsu- 𝑁 𝑖=1 𝑗=1
pervised representation learning. Asano et al. solve the pseudo-label where 𝑀𝑖 stands for the cluster assignment of instance 𝑥𝑖 , 𝑃𝑗 =
assignment problem as an variation of the optimal transport problem 1 ∑𝑁
𝑁 𝑖=1 𝑀𝑖𝑗 and 𝑀𝑖𝑗 is the probability of assigning instance 𝑥𝑖 to cluster
and they enforce equipartition constraint to make data samples being
𝑗.
assigned uniformly [20]. Later works coordinate clustering and rep-
We would want randomly sampled instances 𝑋 not only to be
resentation learning to a new height [21,22], however it is notable
clustered with high confidence, but also to be assigned to each cluster
that most of these works operate in a two-step alternation fashion, in
with equal probability, i.e. enforcing an equipartition constraint [20],
which representation learning and clustering are performed in turns.
thus we optimize the following loss function:
This alternation process not only limits the application in large-scale
datasets, but also may induce error accumulation [10,11]. 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐻(𝑀) − 𝐻(𝑃 ) (2)
Recently, some online clustering methods surmount the two-step
alternation [2,10]. SwAV [2] performs soft cluster assignment in an The enforcement of Eq. (2) guarantees instances are clustered uniformly
end-to-end fashion and uses soft assignments to align the features with high confidence according to their feature embeddings, it also
from different transformations of the same image. CC [10] computes avoids the trivial solution that all instances being assigned to the same
a cluster-level contrastive loss by grafting a cluster branch to the con- cluster [24].
ventional CL framework, interestingly the cluster-level contrastive loss
acts more like a regularization method. SACC [15] exploits strong and 3.2. Cluster memory
weak augmentations for contrastive clustering. These online methods
often trade off clustering performance for the online property, which The objective of Cluster Memory is to store and update cluster
results in elusive cluster assignments so that none of these works further feature representations, which can be interpreted as the representatives
exploit the online cluster assignment result. of clusters. There are two phases in Cluster Memory management:
Of particular interest, authors of RUC [11] and IDCEC [23] notice Initializing phase, where Cluster Memory is created and initialized; and
the error accumulation issue in deep clustering and propose to use re- Updating phase, where cluster feature representations are constantly
liable samples. In our work, we adopt the idea of co-training clustering updated with an update ratio. Details of the two phases are as follows:
and CL from [10,21] to implement an online clustering method, and Initializing. Cluster Memory 𝑀𝑒𝑚 ∈ R𝐶×𝐸 is created as zeros. When
we further take two actions to remedy clustering error accumulation. CL is reaching convergence, instances’ semantic features are well ex-
tracted, we then initialize cluster feature representation 𝑀𝑒𝑚𝑗 simply
3. Contrastive visual clustering as the first instance representation 𝑧𝑖 , whose 𝑀𝑖𝑗 ≥ 𝜃 and 𝜃 is a
threshold close to 1.0. Recall that 𝑀𝑖𝑗 is the probability of assigning
In this section, we first elaborate the details of our proposed method instance 𝑥𝑖 to cluster 𝑗.
Contrastive Visual Clustering (CVC), including the implementation of
the two actions, i.e. gathering instances with highly similar feature em- Updating. After Cluster Memory is initialized, CVC keeps updating
beddings and penalizing instances being clustered with low confidence. Cluster Memory as CL proceeds. The purpose of Updating phase is
Then we introduce how to integrate CVC with self-supervised learning, to retain and emphasize shared semantic feature embedding among
especially CL frameworks. instances assigned to the same cluster while to discard and dilute
As shown in Fig. 2, the proposed CVC consists of two components: a divergent feature embedding. We define the Updating function as:
Cluster Projector, which projects instance feature embeddings to clus-
𝑀𝑒𝑚𝑗 ← (1 − 𝜆)𝑀𝑒𝑚𝑗 + 𝜆𝑧𝑖 , 𝑖 ∈ {1, 2, … , 𝑁 | 𝑀𝑖𝑗 ≥ 𝜃}
ter assignments; a Cluster Memory, which stores and updates cluster
feature representations. where 𝜆 is an update ratio. However, we do not limit the update
strategy here. When CVC updates 𝑀𝑒𝑚𝑗 that is zero, it simply stores
3.1. Cluster projector the value as in Initializing phase.
Cluster feature representations maintained in Cluster Memory will
The objective of Cluster Projector is to project instance feature afterwards be used to facilitate the two actions of CVC, i.e. gathering
embeddings to cluster assignments according to their semantic feature instances with highly similar feature embeddings and penalizing in-
embeddings, which resembles a linear projection function. stances being clustered with low confidence. Intuitively, the number
In contrastive learning, given a batch of data instances 𝑋, we get the of clusters 𝐶 should be equals to the real number of data categories,
feature embedding 𝑌 through an encoder function: 𝑌 = 𝑓 (𝑇 (𝑋)), where however we find 𝐶 set to the batch size performs well. Another point
𝑇 represents a set of data augmentations. Let 𝑌 ∈ R𝑁×𝐷 , where 𝑁 is the to mention is that a Cluster Memory of 1000 clusters with feature
batch size and 𝐷 is the dimension of feature embedding. Embedding dimension of 512 only takes up memory space less than 5MB, which is
𝑌 is further processed by a multilayer perceptron (MLP): 𝑍 = 𝑔(𝑌 ) quite memory-efficient and can be stored in checkpoint without heavy
where 𝑍 ∈ R𝑁×𝐸 and 𝐸 is the dimension of instance representations, burden.
finally 𝑍 is contrasted to compute the contrastive loss 𝐶𝐿 . Meanwhile,
a copy of 𝑌 is sent to Cluster Projector. Cluster Projector consists of a 3.3. Integrate with contrastive learning
linear transformation head ℎ(⋅) and a 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 function, it projects the
instances’ feature embedding to a cluster assignment matrix 𝑀: As shown in Fig. 2, with Cluster Projector and Cluster Memory, CVC
is co-trained along with the conventional CL framework. It requires
𝑀 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(ℎ(𝑌 )) (1)
no changes of the CL framework which computes the contrastive loss
Where 𝑀 ∈ R𝑁×𝐶 and 𝐶 is the number of clusters. Generally, the copy 𝐶𝐿 as it is originally designed. Empirically, it is best practice to
of 𝑌 can be obtained from either branch in the CL framework. integrate CVC at the point when CL is close to convergence, where
Cluster Projector is optimized to produce hard cluster assignments, instances’ semantic features are well extracted. As a bypass network,
i.e. instances are desired to be assigned to clusters with high confidence CVC gets a copy of 𝑌 from Encoder and passes it to Cluster Projector to

3
Y. Liu et al. Pattern Recognition 154 (2024) 110631

Fig. 2. Overview of Contrastive Visual Clustering (CVC) being integrated with contrastive learning framework. CVC consists of two components: Cluster Projector and Cluster
Memory. Cluster Projector receives instance feature embeddings from the Encoder and projects feature embeddings to a cluster assignment matrix which matches each instance to
an entry in Cluster Memory. CVC utilizes three losses: 𝐶𝐿 , 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 and 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 . The contrastive loss 𝐶𝐿 is computed as originally designed, 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 and 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 are computed as
described in Sections 3.1 and 3.3, respectively. The contrastive learning framework is simplified for illustration purpose. ∼ denotes the Softmax operation.

compute the cluster assignment matrix 𝑀 for instances 𝑋 as described 𝑖 ∈ {1, 2, … , 𝑁 | ∀𝑀𝑖𝑗 < 𝜎 𝑎𝑛𝑑 ∃𝑀
̃ 𝑖𝑗 ≥ 𝜃, 𝑗 ∈ [1, 𝐶]}
in Section 3.1. Associated with Cluster Memory, we perform the two ̃ 𝑖 is the cluster assignment vector for instance 𝑥𝑖 through the
where 𝑀
actions as follows:
perturbed encoder 𝑓̃(⋅) as 𝑀 ̃ 𝑖 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(ℎ(𝑓̃(𝑇 (𝑥𝑖 )))), 𝜏𝑝 is the tem-
Gathering. For any instance assigned to a cluster with high confidence, perature parameter and 𝜎 is the threshold of confidence for Penalizing.
it will be pulled to its cluster feature representation and be pushed away Note that the perturbation operation is only intended to induce high
from the rest of the clusters in Cluster Memory using InfoNCE [13]. confidence cluster re-assignment, and dose not interfere the training of
Thus instances assigned to the same cluster will be gathered together, Encoder 𝑓 (⋅).
while different clusters will be pushed away as they are being updated Penalizing is still operated in an end-to-end fashion, and as the pro-
constantly. We formally define the Gathering function as follows: portion of instances with low clustering confidence is relatively small,
the extra cost is low. The Penalizing action is simple but effective, we do
∑ exp(𝑠𝑖𝑚(𝑧𝑖 , 𝑀𝑒𝑚𝑡 )∕𝜏𝑔 )
𝓁𝑔 = − log ∑𝐶 , (3) not imply the target clusters for re-assigning, but let instances choose
𝑖 𝑗=1 exp(𝑠𝑖𝑚(𝑧𝑖 , 𝑀𝑒𝑚𝑗 )∕𝜏𝑔 ) their targets simply through semantic feature matching. It should be
𝑖 ∈ {1, 2, … , 𝑁 | ∃𝑀𝑖𝑗 ≥ 𝜃, 𝑗 ∈ [1, 𝐶]} mentioned that instances are not guaranteed to be re-assigned with
high confidence, these instances will simply be ignored.
where 𝑀𝑒𝑚𝑡 is the feature representation of the target cluster that
Put them together. With Gathering and Penalizing, we optimize CVC
instance 𝑥𝑖 is assigned to, 𝜏𝑔 is the temperature parameter to control
by minimizing the following:
the softness, and 𝑠𝑖𝑚() is the pair-wise similarity function.
Eq. (3) makes sure that instances assigned to the same cluster are 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 = 𝓁𝑔 + 𝓁𝑝 (5)
centered to their cluster feature representation, which consequently
Please be aware that Gathering and Penalizing function are actually
shrinking each cluster to its key semantic feature space.
implemented using one InfoNCE loss, we separate them as 𝓁𝑔 in Eq. (3)
Penalizing. For any instance preliminarily assigned to a cluster with and 𝓁𝑝 in Eq. (4) only for explanation purposes. Together with the
low confidence, the model will be penalized so that it will have to original contrastive loss 𝐶𝐿 , the overall optimizing target is as follows:
learn to re-assign these instances to a cluster with higher confidence.
 = 𝐶𝐿 + 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 + 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 (6)
It is noted that the top-5 classification accuracy is significantly higher
than the top-1 accuracy on ImageNet and the top-5 responding classes It should be stated that CVC can integrate with not only CL, but
are very likely to be visually correlated [5], which implies that low- also arbitrary self-supervised learning frameworks theoretically. It is
confidence wrongly-classified instances have very high probabilities done by simply replacing the contrastive loss 𝐶𝐿 with another self-
of finding their correct classes within the top-5 results. Inspired by supervised objective function in Eq. (6). However, it is noted that the
SimCSE [25], the operation of applying different rate of Dropout and performance of CVC with self-supervised frameworks other than CL is
DropPath has the effect of random feature masking, and there is a good not consistent, we will describe more in Section 4.5.
chance that after random feature masking the predominant semantic Algorithm 1 provides the PyTorch-style pseudocode for computing
feature might emerge while trivial features might fade out. So we first the losses of CVC.
apply random perturbation to Encoder 𝑓 (⋅) by simply using a larger
drop rate for Dropout and DropPath operations, and instances with low 4. Experiments
clustering confidence will then be re-processed by the perturbed feature
Here we report the performance of SSL frameworks integrated with
encoder 𝑓̃(⋅) and the same Cluster Projector to get new cluster assign-
our proposed method CVC. We evaluate the pre-trained representation
ments as in Eq. (1). If a instance is re-assigned with high confidence,
space on benchmark image and video classification tasks. As CVC is
it will then be pulled to its cluster. We formally define the Penalizing
not a stand-alone architecture, we build a simple baseline framework
function as follows:
for self-supervised representation learning of image and video respec-
∑ exp(𝑠𝑖𝑚(𝑧𝑖 , 𝑀𝑒𝑚𝑡 )∕𝜏𝑝 ) tively for performance analysis. We also integrate CVC with several
𝓁𝑝 = − log ∑𝐶 , (4)
𝑖 𝑗=1 exp(𝑠𝑖𝑚(𝑧𝑖 , 𝑀𝑒𝑚𝑗 )∕𝜏𝑝 )
representative SSL frameworks to show the performance improvements.

4
Y. Liu et al. Pattern Recognition 154 (2024) 110631

Algorithm 1: CVC PyTorch-style pseudocode Table 1


Performance of the image representation learning baseline with and without CVC.
# L_cl: the original contrastive loss ImageNet-1K is used for pre-training and evaluation. ‘‘#epoch’’ denotes the number of
# ClusterMem: the cluster memory epochs for contrastive pre-training, and ‘‘+50’’ in ‘‘#epoch’’ column means the model
# C, n: the number of clusters and the batch size is co-trained with CVC for another 50 epochs. Linear probing (‘‘Lin.’’) and k-NN top-1
accuracy (%) on ImageNet-1K are reported.
# 𝜃 , 𝜎 : high and low confidence thresholds
Method Architecture #epoch Lin. k-NN
for x in dataloader:
y = 𝑓 (T (x)) # T is a set of augmentation 70 60.9 55.2
m = cluster_proj(y) # get assignment matrix baseline ViT-S/16
100 62.7 57.8
n-by-C 150 66.3 61.5
L_entropy = entropy_loss(m) 20+50 61.3 55.6
g_idx, g_cluster_idx = index_to_gather(m) vs. baseline +0.4 +0.4
p_idx, p_cluster_idx = index_to_penalize(m) 50+50 65.4 61.1
cluster_indices = stack(g_cluster_idx + baseline + CVC ViT-S/16 vs. baseline +2.7 +3.3
p_cluster_idx) 100+50 69.5 66.2
sim = matmul(y[g_idx+p_idx], ClusterMem.T) vs. baseline +3.2 +4.7
L_cluster = CrossEntropyLoss(sim,
cluster_indices) Table 2
loss = L_cl + L_entropy + L_cluster Comparison of linear probing results of representative self-supervised image representa-
tion learning frameworks with and without CVC. ImageNet-1K is used for pre-training
loss.backward()
and evaluation. Methods marked with † means the results are copied from the original
ClusterMem.update(y, m) # update the cluster paper. We re-evaluate the top-1 linear classification accuracy for the rest methods with
memory or without CVC under the same settings for fair comparison.

def entropy_loss(m): # Compute loss as Equation Method Architecture #epoch Accuracy

(2) SimCLR† ResNet50 200 68.3


SwAV† ResNet50 200 69.1
h_entropy = - m * log(m)
MoCo-v3† ViT-S/16 300 73.2
h_entropy = h_entropy.sum() / n MAE† ViT-B/16 1600 68.0
p = m.sum(dim=0) / n MoCo-v3 150 70.3
p_entropy = - p * log(p) ViT-S/16
+ CVC 100+50 73.1
p_entropy = p_entropy.sum()
return h_entropy - p_entropy Mugs 150 73.1
ViT-S/16
+ CVC 100+50 76.4
def index_to_gather(m): # Get index for gathering
Mugs 300 81.1
idx, cluster_idx = [], [] ViT-L/16
for i in range(n): + CVC 250+50 82.4

if m[i].max() >= 𝜃 : MAE


450 61.5
idx.append(i) 1650 66.7
ViT-S/16
cluster_idx.append(m[i].argmax()) 400+50 60.9
+ CVC
1600+50 68.1
return idx, cluster_idx
def index_to_penalize(m): # Get index for
penalizing
idx, p_idx, p_cluster_idx = [], [], [] to be consistent with other experiment settings. The baseline model is
for i in range(n): pre-trained from scratch on ImageNet-1K [27] without labels, AdamW
if m[i].max() < 𝜎 : optimizer with a weight decay of 1e-5 and a step scheduler is used. We
idx.append(i) set the hyper-parameter 𝜃 to 0.9, 𝜎 to 0.4 and 𝜆 to 0.05 in experiments
𝑚̃ = cluster_proj(𝑓̃(T (x[idx]))) # perturbed unless otherwise stated. We train the baseline for {20, 50, 100} epochs,
encoder 𝑓̃ and continue the training for another 50 epochs with and without CVC
for i in range(len(idx)): respectively.
if 𝑚̃ [i].max() >= 𝜃 : The baseline performance is shown in Table 1. It is obvious that
p_idx.append(idx[i]) CVC improves the baseline intervened at different epochs, and CVC
p_cluster_idx.append(𝑚̃ [i].argmax()) brings significant improvements to k-NN classification accuracy be-
return p_idx, p_cluster_idx cause instances are more tightly gathered through the Gathering op-
eration. CVC is shown to perform better when integrating with better
pre-trained model, it improves accuracy by 3.2% with model pre-
trained by 100 epochs, compared with 2.7% with model pre-trained
4.1. Contrastive image clustering by just 50 epochs. It is also noted that when initializing CVC with an
early interrupted pre-training model, the improvement is marginal as
We first evaluate CVC with self-supervised image representation the model is not yet well trained for encoding semantic features.
learning frameworks under two classification evaluation settings, Plug into representative frameworks. We integrate CVC with three
i.e. linear probing and k-NN. Following Mugs [12], we test different representative SSL frameworks, i.e. MoCo-v3 [1], Mugs [12] and
numbers (10, 20, 50) of nearest neighbors for k-NN evaluation. We
MAE [28], to show the performance improvement. We use pre-trained
construct a baseline model with ‘ViT-S/16’ and utilize variations of the
models released in [1,12] to set up the training, we use an unofficial
‘ViT’ architecture (e.g. ViT-S and ViT-L) when integrating CVC with
pre-trained model of MAE as [28] does not use ViT-S. MoCo-v3 and
representative SSL frameworks according to their released models. For
Mugs are pre-trained for 100 epochs, and another 50 epochs with and
both linear probing and k-NN, we report the top-1 accuracy.
without CVC under the same setting. As argued in [28], Masked Image
Baseline. The baseline framework is built as a simplified version of Modeling (MIM) methods need more training epochs as they only see
SimCLR [14] and we replace the convolutional encoder with ViT [26] a small portion of data per epoch, we use a MAE model pre-trained for

5
Y. Liu et al. Pattern Recognition 154 (2024) 110631

Table 3 Table 4
Performance of the video representation learning baseline with and without CVC. Comparison of linear probing results of representative contrastive video representation
‘‘#epoch’’ denotes the number of epochs for contrastive pre-training, and ‘‘+50’’ in learning frameworks with and without CVC. Kinetics400 is used for pre-training and
‘‘#epoch’’ column means the model is co-trained with CVC for another 50 epochs. evaluation. Method marked with † means the result is copied from the original paper.
Linear probing top-1 accuracy (‘‘Accuracy’’, %) is reported. We re-evaluate the top-1 linear classification accuracy for the rest methods with or
Method Dataset #epoch Accuracy without CVC under the same settings for fair comparison.
Method Architecture #epoch Accuracy
70 41.6
UCF101 VCLR† ResNet50/30 400 64.1
100 52.3
baseline
baseline 100 49.2
Kinetics400 100 49.2 Uniformer-S/8
+ CVC 50+50 54.3
20+50 45.8
vs. baseline +4.2 DPC 200 59.6
3D-ResNet34/16
UCF101
50+50 59.9 + CVC 150+50 65.7
baseline + CVC vs. baseline +7.6
50+50 54.3
Kinetics400 Table 5
vs. baseline +5.1
Ablation on action effectiveness using the baseline in Section 4.1.
Method Action #epoch Accuracy
baseline – 150 66.3
{400, 1600} epochs, and then train it with and without CVC for another 100+50 68.9
Gathering
50 epochs. Note that MoCo-v3 and Mugs are contrastive SSL methods, vs. baseline +2.6
while MAE is an instance of MIM methods, i.e. generative SSL method. 100+50 67.1
baseline + CVC Penalizing
As shown in Table 2, CVC brings comprehensive classification vs. baseline +0.8
performance improvements for all SSL methods: 2.8% for MoCo-v3, All 100+50 69.5
3.3% and 1.3% for Mugs with ViT-S and ViT-L respectively, and 1.4%
for MAE. Impressively, the top-1 accuracy of 82.4% achieved by Mugs
with CVC achieves state-of-the-art performance for self-supervised im- Table 6
Quantitative analysis of instances being affected in Gathering and Penalizing.
age classification on ImageNet-1K. It is also noticeable that with CVC,
Method Action Instance affected (%)
all three SSL methods achieve classification performance on-par with
their independent frameworks of larger architecture and more training Gathering 40 ± 8
baseline + CVC
epochs. When integrated with CL frameworks, CVC provides more Penalizing 4 ± 2
obvious and consistent performance improvements. On the other hand,
MAE suffers a minor performance drop when integrated with CVC at
Table 7
early epoch. We will discuss more about this discrepancy in Section 4.5. Ablation on the number of clusters using the baseline in Section 4.2.
Nevertheless, CVC shows great potential in assisting SSL methods in Method Dataset # C Accuracy
learning a more linearly separable representation space.
101 61.2
UCF101
baseline + CVC 512 59.9
4.2. Contrastive video clustering
400 54.5
Kinetics400
Following the experiments conducted for contrastive image cluster- 512 54.3
ing in Section 4.1, we further perform contrastive video clustering to
exhibit the potential of CVC as a generic contrastive clustering method.
We evaluate CVC with self-supervised video representation learning
frameworks on action classification tasks on UCF101 [29] and Kinet-
ics400 [30]. Unlike most self-supervised video representation learning Plug into contrastive frameworks. We integrate CVC with contrastive
methods that mainly report fine-tuning classification accuracies, we frameworks which use spatial–temporal feature encoders. We use the
focus on the classification performance with frozen encoder as linear pre-trained model released in DPC [32] to set up the training, and
probing reveals the linear separability of pre-trained representation we only report top-1 action classification accuracies by linear probing.
space more directly. We construct baseline models with ‘Uniformer-S/8’ As shown in Table 4, CVC brings remarkable linear probing accu-
for both UCF101 and Kinetics400, and utilize ‘3D-ResNet34/16’ when racy improvements, DPC integrated with CVC outperforms VCLR in
integrating CVC with the representative SSL framework. For all action both accuracy and efficiency. We do not test CVC with some self-
classification evaluations, we report top-1 accuracies. supervised video learning frameworks, e.g., video-version MAE [33]
and VCLR [34], as the feature encoder used in these models is ViT
Baseline. The baseline is constructed as a simple contrastive video
or ResNet which is not different from that used in image represen-
representation learning framework only for performance analysis pur-
tation learning frameworks. Nevertheless, CVC still shows promising
pose: two different sets of 8 consecutive frames (downsampled by 3)
performance in various architectures and various pre-training recipes.
are extracted per video and are processed by the same feature encoder.
The feature embeddings of the two sets are projected by a MLP before
4.3. Ablations
being contrasted to compute the contrastive loss InfoNCE [13]. We
adopt Uniformer-S [31] as the video feature encoder. The baseline
Here We ablate the two proposed actions and hyper-parameters of
model is pre-trained from scratch on UCF101 and Kinetics400 without
CVC using the baselines and report the linear probing accuracy.
labels respectively, AdamW optimizer with a weight decay of 1e-5 and
a step scheduler is used. We train the baseline for {20, 50} epochs The effect of actions. CVC implements two actions, i.e. Gathering
on UCF101, 50 epochs on Kinetics400, and continue the training for and Penalizing, to improve instance-level contrastive visual represen-
another 50 epochs with and without CVC respectively. The result of tation learning. In Table 5, we report the performance improvement
linear probing is shown in Table 3. Impressively, CVC improves the by each action using the image representation learning baseline. In
baseline intervened at different epochs and on different datasets by a general, Gathering contributes more to the performance improvement.
large margin. It is confirmed again here that, CVC performs better when We also conduct a quantitative analysis of instances being affected in
integrating with better pre-trained model. Gathering and Penalizing respectively, results are shown in Table 6.

6
Y. Liu et al. Pattern Recognition 154 (2024) 110631

Fig. 3. t-SNE visualization of the representation space of instances from 10 randomly-selected classes of ImageNet. In subfigure (a)–(d), the model is pre-trained for 150 epochs
with or without CVC as indicated, and in subfigure (e), the model is pre-trained for 300 epochs with CVC.

We found that about 40% of instances are clustered with high con- 4, even with the simple baseline. Please note that unlike video-version
fidence, and in contrast, about 20% of instances are clustered with MAE [33] that uses an image feature encoder ViT, the baseline and
low confidence. Furthermore, in our experiments, 10%–30% of low- DPC [32] use spatial–temporal feature encoders which focus on the
clustering-confidence instances would be re-assigned confidently by action itself rather than frame features that may vary dramatically even
Penalizing. Thus, Gathering would affect around 40% of instances in a between videos of the same action category. Besides, videos are more
batch, while Penalizing would affect 6% instances (20% × 30%) at most. complex than images, resulting in more noise in video feature embed-
Please be noted that instances affected in Penalizing are hard positives dings. So with the Gathering action, CVC successfully filters out noise
which make greater influence in training. This explains the different and keeps video representations tightly close to their action features,
contributions of the two actions. Although Penalizing contributes less, which eventually brings more linear separability to the pre-trained
we argue it is an innovative and effective attempt to remedy wrongly video representation space.
clustered instances, which is rarely addressed in other works. Training efficiency of CVC. It is known that CL methods normally
The number of clusters. The number of clusters, i.e. C, is an important require a large batch size [1,14], however as CVC intervenes and oper-
hyper-parameter in Cluster Projector and Cluster Memory. Intuitively, ates only after CL is reaching convergence, large batch size is optional.
C should be equal to the real number of data categories. As shown in The two actions of CVC, i.e. Gathering and Penalizing, can be regarded
Table 7, not surprisingly, setting C to the number of data categories as a fine-tuning method for the model, no major adjustment will be
gives better performance, but when setting C to the batch size we made. In our experiments, we decrease the learning rate when CVC
still get good results. So as stated in Section 3.2, we set C to the intervenes, and use a batch size of 1024 and 512 for Contrastive Image
batch size (1024 for ImageNet-1K and 512 for video datasets) in our Clustering and Contrastive Video Clustering respectively. Empirically,
experiments. We hypothesize setting C near to or larger than the real CVC takes only 20 epochs to produce demonstrable improvements.
number of categories gives promising performances, as over-cluttering However, CVC still benefits from large batch size as it speeds up the
is better than under-cluttering in terms of the linear separability of the fine-tuning process.
representation space.
5. Conclusion
4.4. Visualization
In this paper, we propose an online Contrastive Visual Clustering
(CVC) method which utilizes the intra-class correlation in dataset while
In Fig. 3, we visualize the representation space of instances from 10
avoiding clustering error accumulation, in order to improve instance-
randomly-selected classes of ImageNet using t-SNE [35] under different level contrastive visual representation learning. CVC implements two
training settings. In Fig. 3(a)–(d), the model is pre-trained for 150 actions to increase the linear separability of the representation space:
epochs with or without CVC as indicated, while in Fig. 3(e), Mugs [12] Gathering instances with highly similar feature embeddings, and Penal-
is pre-trained for 300 epochs with CVC. It is clear that CVC brings more izing instances being clustered with low confidence. CVC can integrate
linear separability to the representation space, and with the two actions with arbitrary self-supervised learning frameworks simply as a plu-
Gathering and Penalizing, Fig. 3(e) shows that instances of the same gin and extensive experiments show that CVC improves the linear
class are tightly clustered while different classes are well separated, classification performance by a large margin for models pre-trained
i.e. intra-class variances are decreased and inter-class diversities are with self-supervised representation learning, in both image and video
increased. scenarios.
The proposed CVC also has a few limitations. CVC only expects to be
4.5. Discussion integrated with well pre-trained models, as when initializing CVC with
early interrupted pre-training models, the improvement is marginal.
Here we discuss observations noted during experiments, and try to Another limitation is that there are multiple hyper-parameters in CVC,
give some insights into the behavior of CVC. e.g. the number of clusters, the low and high confidence thresholds, the
CVC with CL or MIM. As noted in Table 2, CVC performs more drop rate, etc., which need careful tuning. In addition, CVC utilizes a
consistently and impressively with CL frameworks than with MIM multi-objective optimization target, which is not easy to find a good
frameworks. It should be noted that MIM methods struggle in linear solution without non-trivial effort. Although a simple implementation
probing evaluation [28,36], Chen et al. show that MIM methods tend achieves impressive performance improvement, in future works, we
to consider all features in an image while CL methods are better at would like to try alternative implementations, e.g. finding a Pareto
capturing discriminative semantic features [37]. Similarly, we find that optimal solution, to achieve more performance improvements with
CVC receives a considerable portion of instances that are clustered with less training costs. In addition, we would like to explore different
low confidence when integrated with MIM frameworks. The lacking construction methods of Cluster Memory for maintaining cluster fea-
ture representations, such as keeping multiple instance representations
of discriminative semantic features makes MIM methods less coop-
per cluster, and potentially preserving not only the discriminative
erative in linear classification and clustering, thus we encourage the
cluster features but also the diversity of intra-cluster features. We
integration of CVC with CL frameworks.
hope our work will benefit the self-supervised learning community in
CVC for image and video. It is impressive that CVC brings significant terms of the generality of CVC, it is worth trying to integrate CVC
classification accuracy improvements for models pre-trained with con- with any self-supervised learning framework for further performance
trastive video representation learning frameworks as in Tables 3 and improvements.

7
Y. Liu et al. Pattern Recognition 154 (2024) 110631

CRediT authorship contribution statement [14] T. Chen, S. Kornblith, M. Norouzi, G.E. Hinton, A simple framework for
contrastive learning of visual representations, in: Proceedings of the 37th Interna-
tional Conference on Machine Learning, ICML, 13–18 July 2020, in: Proceedings
Yue Liu: Conceptualization, Data curation, Formal analysis, of Machine Learning Research, vol. 119, PMLR, 2020, pp. 1597–1607.
Methodology, Writing – original draft, Validation, Writing – review & [15] X. Deng, D. Huang, D.-H. Chen, C.-D. Wang, J.-H. Lai, Strongly augmented
editing. Xiangzhen Zan: Investigation, Resources, Validation. Xianbin contrastive clustering, Pattern Recognit. 139 (2023) 109470.
Li: Resources, Visualization. Wenbin Liu: Project administration, [16] Y. Tian, D. Krishnan, P. Isola, Contrastive multiview coding, in: 16th European
Conference on Computer Vision, ECCV 2020, Glasgow, UK, August 23-28, 2020,
Resources, Supervision, Writing – review & editing. Gang Fang:
Proceedings, Part XI, in: Lecture Notes in Computer Science, vol. 12356, Springer,
Conceptualization, Formal analysis, Funding acquisition, Investigation, 2020, pp. 776–794.
Supervision, Writing – original draft, Writing – review & editing. [17] A. Dosovitskiy, P. Fischer, J.T. Springenberg, M.A. Riedmiller, T. Brox, Dis-
criminative unsupervised feature learning with exemplar convolutional neural
networks, IEEE Trans. Pattern Anal. Mach. Intell. 38 (9) (2016) 1734–1747.
Declaration of competing interest
[18] T. Wang, P. Isola, Understanding contrastive representation learning through
alignment and uniformity on the hypersphere, in: 37th International Conference
The authors declare that the research was conducted in the absence on Machine Learning, ICML 2020, 13–18 July 2020, in: Proceedings of Machine
of any commercial or financial relationships that could be construed as a Learning Research, vol. 119, PMLR, 2020, pp. 9929–9939.
[19] T. Chen, W. Hung, H. Tseng, S. Chien, M. Yang, Incremental false negative
potential conflict of interest.
detection for contrastive learning, in: The Tenth International Conference on
Learning Representations, Virtual Event, April 25–29, 2022, 2022.
Data availability [20] Y.M. Asano, C. Rupprecht, A. Vedaldi, Self-labelling via simultaneous cluster-
ing and representation learning, in: 8th International Conference on Learning
Representations, Addis Ababa, Ethiopia, April 26–30, 2020, 2020.
I have shared a link to my code. [21] X. Wang, Z. Liu, S.X. Yu, Unsupervised feature learning by cross-level instance-
group discrimination, in: IEEE Conference on Computer Vision and Pattern
Acknowledgment Recognition, CVPR 2021, June 19–25, 2021, Computer Vision Foundation / IEEE,
2021, pp. 12586–12595.
[22] H. Wang, N. Lu, H. Luo, Q. Liu, Self-supervised clustering with assistance from
The study is supported by National Natural Science Foundation of off-the-shelf classifier, Pattern Recognit. 138 (2023) 109350.
China with grant number 61972107. [23] H. Lu, C. Chen, H. Wei, Z. Ma, K. Jiang, Y. Wang, Improved deep convolutional
embedded clustering with re-selectable sample training, Pattern Recognit. 127
(2022) 108611.
References
[24] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, M. Sugiyama, Learning discrete
representations via information maximizing self-augmented training, in: 34th
[1] X. Chen, S. Xie, K. He, An empirical study of training self-supervised vision trans- International Conference on Machine Learning, ICML 2017, Sydney, NSW,
formers, in: IEEE/CVF International Conference on Computer Vision, Montreal, Australia, 6–11 August 2017, in: Proceedings of Machine Learning Research,
QC, Canada, October 10–17, 2021, IEEE, 2021, pp. 9620–9629. vol. 70, PMLR, 2017, pp. 1558–1567.
[2] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, A. Joulin, Unsupervised [25] T. Gao, X. Yao, D. Chen, SimCSE: Simple contrastive learning of sentence embed-
learning of visual features by contrasting cluster assignments, in: Advances dings, in: Proceedings of the 2021 Conference on Empirical Methods in Natural
in Neural Information Processing Systems 33: Annual Conference on Neural Language Processing, EMNLP 2021, Dominican Republic, 7–11 November, 2021,
Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, Association for Computational Linguistics, 2021, pp. 6894–6910.
2020. [26] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
[3] J. Yin, J. Xie, Z. Ma, J. Guo, MPCCL: Multiview predictive coding with M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An
contrastive learning for person re-identification, Pattern Recognit. 129 (2022) image is worth 16x16 words: Transformers for image recognition at scale, in:
108710. 9th International Conference on Learning Representations, Virtual Event, Austria,
[4] K. Yuan, G. Schaefer, Y.-K. Lai, Y. Wang, X. Liu, L. Guan, H. Fang, A May 3–7, 2021, 2021.
multi-strategy contrastive learning framework for weakly supervised semantic [27] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale
segmentation, Pattern Recognit. 137 (2023) 109298. hierarchical image database, in: IEEE Computer Society Conference on Computer
[5] Z. Wu, Y. Xiong, S.X. Yu, D. Lin, Unsupervised feature learning via non- Vision and Pattern Recognition, 20–25 June 2009, Miami, Florida, USA, IEEE
parametric instance discrimination, in: IEEE Conference on Computer Vision and Computer Society, pp. 248–255.
Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, [28] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R.B. Girshick, Masked autoencoders
Computer Vision Foundation / IEEE Computer Society, 2018, pp. 3733–3742. are scalable vision learners, in: IEEE/CVF Conference on Computer Vision and
[6] S. Purushwalkam, A. Gupta, Demystifying contrastive self-supervised learning: Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, IEEE,
Invariances, augmentations and dataset biases, in: Advances in Neural Informa- 2022, pp. 15979–15988.
tion Processing Systems 33: Annual Conference on Neural Information Processing [29] K. Soomro, A.R. Zamir, M. Shah, UCF101: A dataset of 101 human actions classes
Systems 2020, NeurIPS 2020, December 6–12, 2020, 2020. from videos in the wild, 2012, CoRR abs/1212.0402, arXiv:1212.0402.
[7] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, [30] J. Carreira, A. Zisserman, Quo vadis, action recognition? A new model and the
C. Liu, D. Krishnan, Supervised contrastive learning, in: Advances in Neural kinetics dataset, in: 2017 IEEE Conference on Computer Vision and Pattern
Information Processing Systems 33: Annual Conference on Neural Information Recognition, CVPR, Honolulu, HI, USA, July 21–26, 2017, IEEE Computer
Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, 2020. Society, 2017, pp. 4724–4733.
[8] M. Caron, P. Bojanowski, A. Joulin, M. Douze, Deep clustering for unsupervised [31] K. Li, Y. Wang, P. Gao, G. Song, Y. Liu, H. Li, Y. Qiao, UniFormer: Uni-
learning of visual features, in: 15th European Conference on Computer Vision, fied transformer for efficient spatial-temporal representation learning, in: The
Munich, Germany, September 8–14, 2018, Proceedings, Part XIV, in: Lecture Tenth International Conference on Learning Representations, Virtual Event, April
Notes in Computer Science, vol. 11218, Springer, 2018, pp. 139–156. 25–29, 2022, 2022.
[9] J. Li, P. Zhou, C. Xiong, S.C.H. Hoi, Prototypical contrastive learning of [32] T. Han, W. Xie, A. Zisserman, Video representation learning by dense predictive
unsupervised representations, in: 9th International Conference on Learning coding, in: IEEE/CVF International Conference on Computer Vision Workshops,
Representations, Austria, May 3–7, 2021, 2021. ICCV Workshops 2019, Seoul, Korea (South), October 27–28, 2019, IEEE, 2019,
[10] Y. Li, P. Hu, J.Z. Liu, D. Peng, J.T. Zhou, X. Peng, Contrastive clustering, in: pp. 1483–1492.
35th AAAI Conference on Artificial Intelligence, February 2–9, 2021, AAAI Press, [33] Z. Tong, Y. Song, J. Wang, L. Wang, VideoMAE: Masked autoencoders are data-
2021, pp. 8547–8555. efficient learners for self-supervised video pre-training, in: Advances in Neural
[11] S. Park, S. Han, S. Kim, D. Kim, S. Park, S. Hong, M. Cha, Improving Information Processing Systems 35: Annual Conference on Neural Information
unsupervised image clustering with robust learning, in: IEEE Conference on Processing Systems, NeurIPS 2022, New Orleans, LA, USA, November 28 –
Computer Vision and Pattern Recognition, CVPR 2021, Virtual, June 19–25, December 9, 2022, 2022.
2021, Computer Vision Foundation / IEEE, 2021, pp. 12278–12287. [34] H. Kuang, Y. Zhu, Z. Zhang, X. Li, J. Tighe, S. Schwertfeger, C. Stachniss, M.
[12] P. Zhou, Y. Zhou, C. Si, W. Yu, T.K. Ng, S. Yan, Mugs: A multi-granular Li, Video contrastive learning with global context, in: IEEE/CVF International
self-supervised learning framework, 2023, CoRR abs/2203.14415, arXiv:2203. Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada,
14415. October 11–17, 2021, IEEE, 2021, p. 3188.
[13] A. van den Oord, Y. Li, O. Vinyals, Representation learning with contrastive [35] L. van der Maaten, G. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res.
predictive coding, 2018, CoRR abs/1807.03748, arXiv:1807.03748. 9 (86) (2008) 2579–2605.

8
Y. Liu et al. Pattern Recognition 154 (2024) 110631

[36] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, H. Hu, SimMIM: a simple
framework for masked image modeling, in: IEEE/CVF Conference on Computer Xianbin Li is a post-doctoral scholar at Institute of Com-
Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, puting Science and Technology, Guangzhou University. He
2022, IEEE, 2022, pp. 9643–9653. received his Ph.D. from Sun Yat-sen University.
[37] X. Chen, M. Ding, X. Wang, Y. Xin, S. Mo, Y. Wang, S. Han, P. Luo, G. Zeng,
J. Wang, Context autoencoder for self-supervised representation learning, Int. J.
Comput. Vis. 132 (1) (2024) 208–223, http://dx.doi.org/10.1007/S11263-023-
01852-4.

Yue Liu is a Ph.D. candidate under the supervision of


Professor Gang Fang, at Institute of Computing Science and
Technology, Guangzhou University. He received his master Wenbin Liu is a Full Professor at Institute of Comput-
degree from City University of Hongkong. ing Science and Technology, Guangzhou University. He
received his Ph.D. from Huazhong University of Science and
Technology.

Xiangzhen Zan is a Ph.D. candidate under the supervision


of Professor Wenbin Liu, at Institute of Computing Science
and Technology, Guangzhou University. He received his Gang Fang is a Full Professor at Institute of Comput-
master degree from Wenzhou University. ing Science and Technology, Guangzhou University. He
received his Ph.D. from Huazhong University of Science and
Technology.

You might also like