0% found this document useful (0 votes)
4 views14 pages

Cluster GuidedAsymmetricContrastiveLearningforUnsupervisedPersonRe Identification - 2106.07846

The document presents a novel approach called Cluster-guided Asymmetric Contrastive Learning (CACL) for unsupervised person re-identification (Re-ID), which aims to improve feature learning by suppressing the influence of colors in images. CACL leverages clustering information to guide the learning process and employs both instance-level and cluster-level contrastive learning to enhance feature discrimination. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed method in learning robust features for person Re-ID.

Uploaded by

Tapan Mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

Cluster GuidedAsymmetricContrastiveLearningforUnsupervisedPersonRe Identification - 2106.07846

The document presents a novel approach called Cluster-guided Asymmetric Contrastive Learning (CACL) for unsupervised person re-identification (Re-ID), which aims to improve feature learning by suppressing the influence of colors in images. CACL leverages clustering information to guide the learning process and employs both instance-level and cluster-level contrastive learning to enhance feature discrimination. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed method in learning robust features for person Re-ID.

Uploaded by

Tapan Mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/360647017

Cluster-Guided Asymmetric Contrastive Learning for Unsupervised Person Re-


Identification

Article in IEEE Transactions on Image Processing · May 2022


DOI: 10.1109/TIP.2022.3173163

CITATIONS READS
78 39

3 authors:

Mingkun Li Chun-Guang Li
Beijing University of Posts and Telecommunications Beijing University of Posts and Telecommunications
10 PUBLICATIONS 115 CITATIONS 85 PUBLICATIONS 3,348 CITATIONS

SEE PROFILE SEE PROFILE

Jun Guo
Beijing University of Posts and Telecommunications
478 PUBLICATIONS 8,708 CITATIONS

SEE PROFILE

All content following this page was uploaded by Chun-Guang Li on 24 November 2022.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 1

Cluster-guided Asymmetric Contrastive Learning


for Unsupervised Person Re-Identification
Mingkun Li, Chun-Guang Li, Senior Member, IEEE, and Jun Guo

Color Images
Abstract—Unsupervised person re-identification (Re-ID) aims
to match pedestrian images from different camera views in an
unsupervised setting. Existing methods for unsupervised person
Re-ID are usually built upon the pseudo labels from clustering.
arXiv:2106.07846v2 [cs.CV] 9 May 2022

However, the result of clustering depends heavily on the quality


of the learned features, which are overwhelmingly dominated Color Images
Cluster
by colors in images. In this paper, we attempt to suppress the
negative dominating influence of colors to learn more effective
Contrastive
features for unsupervised person Re-ID. Specifically, we propose Learning
a Cluster-guided Asymmetric Contrastive Learning (CACL) ap-
proach for unsupervised person Re-ID, in which clustering result
is leveraged to guide the feature learning in a properly designed Gray-Scale Images

asymmetric contrastive learning framework. In CACL, both


instance-level and cluster-level contrastive learning are employed
to help the siamese network learn discriminant features with
respect to the clustering result within and between different
data augmentation views, respectively. In addition, we also
present a cluster refinement method, and validate that the cluster
refinement step helps CACL significantly. Extensive experiments
conducted on three benchmark datasets demonstrate the superior Fig. 1. Illustration for basic idea of our proposal. We attempt to leverage the
clustering information into contrastive learning to find more effective features
performance of our proposal. by exploring the invariance between color images and gray-scale images.
Index Terms—Unsupervised Person Re-Identification, Asym-
metric Contrastive Learning, Cluster Refinement.
More recently, contrastive learning is applied to perform
feature learning in unsupervised setting, e.g., [6], [7], [8],
I. I NTRODUCTION [9], [10]. The primary idea in these methods is to learn
some invariance in feature representation with self-supervised
U NSUPERVISED person Re-identification (Re-ID) aims
to match pedestrian images from different camera views
in unsupervised setting without demanding massive labelled
mechanism based on data augmentation. In SimCLR [8], each
sample and its multiple augmentations are treated as positive
data, and has attracted increasing attention in computer vision pairs, and the rest of the samples in the same batch are treated
and pattern recognition community in recent years [1]. The as negative pairs and, a contrastive loss is used to distinguish
great challenge we face in unsupervised person Re-ID is to the positive and negative samples to prevent the model from
tackle heavy variations from different viewpoints, varying illu- falling into a trivial solution. We note that SimCLR requires
minations, changing weather conditions, cluttered background to use a large batch size, e.g., 256 ∼ 4096, to contain enough
and etc., without supervision labels. negative samples for effectively training the networks. In
BYOL [10] and SimSiam [9], a predictor layer is used to
Recently, existing methods for unsupervised person Re-ID
prevent feature collapse without using negative samples. In
are usually built on exploiting weak supervision information
InterCLR [6] and SwAV [7], clustering is used to prevent
(e.g., pseudo labels) from clustering. For example, MMT [2]
the feature collapse. In particular, in SwAV [7], a scalable
uses DBSCAN [3] algorithm to generate pseudo labels and
online clustering loss is proposed to train the siamese network
exploit the pseudo labels to train two networks. HCT [4]
with multi-crop data augmentation; whereas in InterCLR [6],
uses a hierarchical clustering algorithm to gradually assign
a MarginNCE loss is proposed to enhance the discriminant
pseudo labels to the training samples during the training stage.
power. While promising performance has been reported on
SSG [5] uses k-means on training samples with multi-views.
ImageNet [11], however, these contrastive learning methods
However, the performance of these methods heavily relies on
are not suitable for unsupervised person Re-ID due to serious
the quality of the pseudo labels, which directly depends on
feature collapse.
the feature representation of the input images.
In this paper, we attempt to leverage cluster information
M. Li, C.-G. Li and J. Guo are with the School of Artificial Intelligence, into contrastive learning to develop an effective framework for
Beijing University of Posts and Telecommunications, Beijing, 100876 P.R. unsupervised person Re-ID. We notice that the performance
China e-mail: {mingkun.li, lichunguang, guojun}@bupt.edu.cn.
Chun-Guang Li is the corresponding author. of person Re-ID depends heavily on the effectiveness of the
Manuscript received xx, 2021; revised xx, xxxx. learned features. However, the learned features are overwhelm-
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 2

ingly dominated by the colors in pedestrian images (such as


the clothing color and background color), especially in the
unsupervised setting. For example, the pedestrian images with
similar color clothes often have smaller distances in feature
space, which may result in mistakes in clustering, and the
mistakes in clustering may further bring wrong guidance to
the pseudo labels for training the network. Although colors
are important feature to match pedestrian images for person
Re-ID, it may also become an obstacle to learn more subtle
and effective texture features that are important fine-level cues
for person Re-ID. Thus it is desirable to learn more robust
and discriminating features that can resist dominant colors for
person Re-ID task.
Unfortunately, it is quite challenging to properly suppress (a) Raw (b) T (c) T 0 (d) G ◦ T 0
the negative impact of colors for learning more effective fine-
grain level features without loss of discriminant information. Fig. 2. Illustration for the raw images and the augmented images. The
first column shows the raw images. The middle two columns show the
For example, directly using random color changing (i.e., color- images generated with transforms T (·) and T 0 (·). The last column shows
jitter [12]) for data augmentation in contrastive training may the corresponding gray-scale images which are generated with both transform
damage the consistency in color distribution, not that helpful to T 0 (·) and color-to-grayscale transform G(·), i.e., G ◦ T 0 (·).
gain generalization ability on unseen samples. To this end, in
this paper, we propose a novel and effective framework for un- proposal. Section IV shows experiments and Section V gives
supervised person Re-ID, termed Cluster-guided Asymmetric the conclusions.
Contrastive Learning (CACL), in which clustering information
is properly incorporated into contrastive learning to learn II. R ELATED W ORK
robust and discriminant features while suppressing dominant
colors, as illustrated in Fig. 1. To be specific, we explore A. Unsupervised Person Re-identification
supervision information from the perspective of suppressing Person Re-ID aims to find specific pedestrians from videos
colors in the framework of cluster-guided contrastive learning, or images according to targets. For the increasing demand
in which the samples in asymmetric views of specifically in real life and avoiding the high consumption of labeling
designed data augmentations (e.g., color images vs. gray- datasets, unsupervised person Re-ID has become popular in
scale images) as shown in Fig. 2—are exploited to provide recent years [1]. The existing unsupervised person Re-ID
strong supervision to impose invariance in feature learning. methods can be divided into two categories: a) unsupervised
By integrating the clustering results into contrastive learning, domain adaptation methods, that need labeled source dataset
the proposed framework is able to avoid feature collapse. and unlabeled target dataset [13], [14], [15], [16]; and b)
By suppressing dominant colors, the proposed framework is pure unsupervised methods, that need with only unlabeled
able to effectively learn robust and discriminating features dataset [17], [18], [19], [20].
other than colors. In addition, we also present a simple but The unsupervised domain adaptation methods train the
effective cluster refinement method to improve the clustering network with the help of labeled datasets, and transfer the
result and thus further enhancing the contrastive learning. We network to unlabeled datasets by reducing the gap between two
conduct extensive experiments on three benchmark datasets, datasets. For example, [21] proposed to align the second-order
and experimental results validate the effectiveness of our statistics of the distributions in the two domains through linear
proposal. transformations to reduce the domain shift; [17] proposed a
Paper Contributions. The contributions of the paper are combined loss function to co-train with samples from the
highlighted as follows. source and target domains and the merging memory bank; [22]
proposed to maximize the inter-domain classification loss and
1) We propose an effective unsupervised framework that minimize the intra-domain classification loss to learn domain
leverages clustering information into contrastive learning robust features. However, unsupervised domain adaptation
while suppressing the dominant colors in images to learn methods are limited by the requirement of the target dataset
fine-grained features. having close distribution to the source dataset.
2) We propose a novel cluster-level loss function to perform Most purely unsupervised person Re-ID methods rely on the
inter-views and intra-view contrastive learning that can pseudo labels to train the network. For example, HCT [4] uses
effectively exploit the cluster-level hidden information hierarchical clustering to generate pseudo labels and train the
from different data augmentation views. convolution neural network for feature learning; [23] assigns
3) We also present a cluster refinement method and verify multiple labels to samples and proposes a new loss function
that the refined clustering information helps the con- for multi-label training. Note that the quality of the pseudo
trastive learning framework significantly. labels relies on the feature representation of the input images.
The remainder of this paper is organized as follows. Sec- However, in the early stage, the feature representation is not
tion II describes the relevant work. Section III presents our good enough to generate high-quality pseudo labels, and thus
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 3

Cluster-Level Contrastive Learning Flow


Instance-Level Contrastive Learning Loss
Back Propagate

Image ResNet Feature embedding Pseudo Labels


Clustering & Cluster Refinement
𝐼(% 𝐹(⋅ |Θ)
𝓍%

𝒯(⋅)
Predictor G( ⋅ |Ψ) 𝓍%
𝑧%
Raw Update
Instance−level Cluster-level Instance Memory Bank
Image Contrastive ℳ
Contrastive
𝐼%
Loss ℒ" Loss ℒ!
𝒢 ∘ 𝒯 & (⋅)

𝓍-%

𝓍-% Cluster-level
Image ResNet Update
Contrastive Instance Memory Bank
𝐼'% 𝐹 & (⋅ |Θ′)
Loss ℒ! ℳ&

Fig. 3. Illustration for our proposed Cluster-guided Asymmetric Contrastive Learning (CACL) framework. After training, we keep only the ResNet F (·|Θ)
in the first branch for inference and use the feature xi for testing.

the low-quality pseudo labels will contaminate the network dependently repel each other, which will undoubtedly ignore
training. Therefore, it is needed to design a cluster refinement the cluster information. In contrast, cluster-level contrastive
method to improve the clustering quality before feeding the learning can effectively mine cluster information, but it relies
pseudo labels to train the network. heavily on the clustering result. Unfortunately, in the early
training stage, the features are not good enough to yield good
B. Contrastive Learning clustering result. Thus, an effective way to train the network by
In recent years, with the development and application of combining both the two lines of contrastive learning methods
the siamese network, contrastive learning began to emerge in is needed.
the field of unsupervised learning. Contrastive learning aims In this paper, we attempt to bridge the two lines of con-
at learning good image representation. It learns invariance in trastive learning methods into a unified framework to form
features by manipulating a set of positive samples and negative effective mutual learning and joint training: a) the instance-
samples with data augmentation. level contrastive learning helps training the network to perform
The existing methods for contrastive learning can be further feature learning—especially in the early training stage; mean-
categorized to: a) instance-level methods [8], [9], [24], [25], while b) the cluster-level contrastive learning helps training
[10] and b) cluster-level methods [7], [6], [26]. Instance-level the network—especially when the quality of the clustering has
methods regard each image as an individual class and consider been improved. In this way, the self-supervision information
two augmented views of the same image as positive pairs and imposed by data augmentation and the weak supervision
treat others in the same batch (or memory bank) as negative information obtained from clustering can be fully exploited
pairs. For example, SimCLR [8] regards samples in the current without the need to use negative samples pairs.
batch as the negative samples; MoCo [27] uses a dictio-
nary to implement contrastive learning, which converts one
III. O UR P ROPOSAL : C LUSTER - GUIDED A SYMMETRIC
branch of the contrastive learning into a momentum encoder;
C ONTRASTIVE L EARNING (CACL)
SimSiam [9] proposed a stop-gradient method that can train
the siamese network without negative samples. Cluster-level This section presents our proposal—Cluster-guided Asym-
methods regard the samples in the same clusters as positive metric Contrastive Learning (CACL) approach for unsuper-
samples and other samples as negative samples. For example, vised person Re-ID.
in [6] InfoNCE loss is combined with MarginNCE loss to For clarity, we show the architecture of our proposed CACL
attract positive samples and repelled negative samples; in [7] in Fig. 3. Overall, our CACL is a siamese network, which
multi-crop data augmentation is used to enhance the robustness consists of two branches of backbone networks F (·|Θ) and
of the network and a scalable online clustering method is F 0 (·|Θ0 ) without sharing parameters, where Θ and Θ0 are the
proposed to explore the inter-invariance of clusters; in [26] parameters in the two networks, respectively, and a predictor
weights-sharing deep neural networks are used to extract layer G(·|Ψ) is added after the first branch, where Ψ denotes
features from sample pairs with different data augmentations, the parameters in the predictor layer. The backbone networks
and contrastive clustering is performed with respect to both F (·|Θ) and F 0 (·|Θ0 ) are implemented1 via ResNet-50 [28] for
the features in the row and column spaces. feature learning.
However, in the unsupervised setting, the instance-level
contrastive learning methods simply make each sample in- 1 It also works if the backbone networks other than ResNet-50 are used.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 4

Given an unlabeled image dataset I = {Ii }N i=1 consisting where ω(Ii ) is to find the cluster index ` for z i , and ũ`
of N samples. For an input image Ii ∈ I, we generate two is the center vector of the `-th cluster in which Ũ :=
samples Iˆi and I˜i via different data augmentation strategies {ũ1 , · · · , ũm0 } and the cluster center ũ` is defined as
as the inputs of the two branches, respectively, in which
1
Iˆi = T (Ii ) and I˜i = G(T 0 (Ii )), where T (·) and T 0 (·) denote
X
ũ` = ṽ i , (3)
two different transforms and G(·) denotes the operation to |C (`) |
Ii ∈C (`)
transform color image into gray-scale image. For simplicity,
we denote the output features of the first network branch and where ṽ i is the instance feature of image I˜i in the instance
the second network branch as xi and x̃i , and denote the output memory bank M̃, C (`) is the `-th cluster. The inter-views
(inter)
of the predictor layer in the first branch as z i , respectively, cluster-level contrastive loss LC defined in Eq. (2)
where xi , x̃i , z i ∈ RD . is used to reduce the discrepancy between the projection
The clustering result of the output features X := output z i of the first network branch and the cluster center
{x1 , · · · , xN } from the first network branch is used to gener- ũ` of the feature output of the second branch with the
ate the pseudo labels Y := {y 1 , · · · , y N }. We exploit the gray-scale view.
pseudo labels to leverage the cluster information into the • Intra-views Cluster-level contrastive loss, denoted as
(intra)
contrastive learning. Specifically, in the training stage, the LC , which is defined as:
two network branches F (·|Θ) and F 0 (·|Θ0 ) are trained with (intra)
the augmented samples without sharing parameters, and the LC = − (1 − qi )2 ln(qi )
(4)
pseudo labels Y are used to guide the training of both network − (1 − q̃i )2 ln(q̃i ),
branches.
In CACL, we use instance memory banks M = {vi }N where qi and q̃i are the softmax of the inner product of the
i=1
and M̃ = {ṽi }N D network outputs and the corresponding instance memory
i=1 where vi , ṽi ∈ R to store the outputs of
two branches, respectively. Both instance memory banks M bank, which are defined as
and M̃ are initialized with X := {x1 , · · · , xN } and X̃ := exp(u>ω(Ii ) xi /τ )
{x̃1 , · · · , x̃N }, which are the outputs of the network branches qi = Pm0 , (5)
>
F (·|Θ) and F 0 (·|Θ0 ) pre-trained on ImageNet, respectively. `=1 exp(u` xi /τ )
exp(ũ>
ω(Ii ) x̃i /τ )
A. Cluster-guided Contrastive Learning q̃i = Pm0 >
, (6)
`=1 exp(ũ` x̃i /τ )
At beginning, we pre-train the two network branches F (·|Θ)
and F 0 (·|Θ0 ) on ImageNet [11], and use the features from where u` and ũ` are the center vectors of the `-th cluster
the first network branch F (·|Θ) to yield m clusters, which for the first branch and the second branch, respectively,
are denoted as C := {C (1) , C (2) , · · · , C (m) }. The clustering in which ũ` is defined in Eq. (3) and u` is defined as
result is used to form pseudo labels to train the cluster-guided 1 X
contrastive learning module. u` = vi , (7)
|C (`) |
To exploit the label invariance between the two augmented Ii ∈C (`)
views and leverage the cluster structure, we employ two types
of contrastive losses: a) instance-level contrastive loss, denoted where v i is the instance feature of image Iˆi in the instance
as LI , and b) cluster-level contrastive loss, denoted as LC . memory bank M. Note that both xi and x̃i share the
Instance-Level Contrastive Loss. To match the feature out- same pseudo labels ω(Ii ) from clustering. The intra-
(intra)
puts z i and x̃i of the two network branches at instance-level, views cluster-level contrastive loss LC in Eq. (4) is
similar to [8], [10], we introduce the negative cosine similarity used to encourage the siamese network to learn features
of the prediction outputs z i in the first branch and the feature with respect to the corresponding cluster center for the
output of the second branch x̃i to define an instance-level two branches, respectively.
contrastive loss LI as follows: Putting the loss functions in Eqs. (2) and (4) together, we
z> x̃i have the cluster-level contrastive loss LC as follows:
i
LI := − , (1)
kz i k2 kx̃i k2 LC := LC
(inter)
+ LC
(intra)
. (8)
where k · k2 is the `2 -norm.
Cluster-Level Contrastive Loss. To leverage the cluster struc- Remark 1. The cluster-level contrastive loss LC in Eq. (8)
ture to further explore the hidden information from different aims to leverage the clustering information to minimize the
views, we propose a cluster-level contrastive loss LC , which difference between the samples of the same cluster from
(inter)
is further divided into inter-views cluster-level contrastive loss different augmentation views via LC , and within the same
(intra)
and intra-views cluster-level contrastive loss. augmentation view via LC . This will help the siamese
• Inter-views Cluster-level contrastive loss, denoted as network to mine the hidden information brought by the basic
(inter)
LC , which is defined as: augmented view in the first branch and the gray-scale aug-
mented view in the second branch to prevent feature collapse
(inter) z>i ũω(Ii )
LC := − , (2) to a trivial solution and impose the supervision information to
kz i k2 kũω(Ii ) k2 learn features other than colors.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 5

B. Clustering and Cluster Refinement where α is set as 0.2 by default (and we will discuss the
Note that the cluster-level contrast loss is greatly affected by influence of α in experiments).
the quality of the clustering result. When the clusters are noisy, In order to save the computation cost2 , we also use a stop-
it will cause negative effects on the training. To improve the gradient operation as mentioned in SimSiam [9]. Note that we
quality of the clustering result, we propose a cluster refinement adopt the stop-gradient operation [9] to the second network
method which removes a proportion of noisy samples in larger branch F 0 (·|Θ0 ) when using the instance level loss LI in
clusters, helping the model to better learn the information at Eq. (1) to perform back propagation. Thus, the parameters
the cluster level. Θ0 in the second network branch are updated only with the
(intra)
For a cluster, we want to keep the samples with higher sim- intra-views cluster-level contrastive loss LC in Eq. (4).
ilarity and remove the samples with lower similarity. Given a Remark 3. For clarity, we summarize the details of the train-
set of raw clusters, denoted as {C (1) , C (2) , · · · , C (m) }, without ing procedure in Algorithm 1. We note that the “asymmetry” in
loss of generality, we pick C (i) to perform cluster refinement. the proposed framework for cluster-guided contrastive learning
At first, we obtain an over-segmentation of C (i) , i.e., C (i) is lies in following three aspects: a) asymmetry in network
(i) (i) (i) structure, i.e., a predictor layer is only added after the first
further divided into {C1 , C2 , · · · , Cni }. Then we perform
cluster refinement according to the following criterion: branch3 ; and b) asymmetry in data augmentation, i.e., the
(i) (i)
augmented samples provided to the second branch are further
if D(Cj |C (i) ) < D(C (i) ), then Cj is kept; (9) transformed into gray-scale; c) asymmetry in pseudo labels
(i) (i) generation, i.e., the output features of the first branch are
otherwise Cj is removed, where D(Cj |C (i) ) is the average
(i)
used to generate pseudo labels which are shared with the
inter-distance from all samples in the sub-cluster Cj to other second branch. Because of the asymmetry in the three aspects
(i) (i)
samples in cluster C , and D(C ) is the average intra- mentioned above, we term the proposed framework as Cluster-
distance among samples in cluster C (i) . guided Asymmetric Contrastive Learning (CACL).
After such a post-processing step, the clusters of larger Remark 4. There have been many unsupervised Re-ID meth-
size are improved and at meantime, more singletons or tiny ods [17], [13], [29], [12] used the contrastive learning to learn
clusters are also produced. We denote the refined clusters discriminant features. Most of them [13], [29], [12] are Gener-
0
as C 0 = {C (1) , C (2) , · · · , C (m ) }, where m0 ≥ m. Compared ative Adversarial Networks (GANs)-based methods and need
to tiny clusters and singletons, the larger clusters are more additional supervised information to assist the training. For
informative to provide pseudo supervision information to example, ATNet [13] trains multiple GANs through utilizing
guide the contrastive learning. illumination and camera information, GCL [12] introduces
Remark 2. In implementation, we use DBSCAN algorithm [3] the pose information in training, and AD-cluster [29] uses
to generate the raw clusters and to generate the over- generating cross-camera samples to assist the training. Un-
segmentation of the clusters. DBSCAN [3] is a density-based like these methods, our proposed CACL uses an asymmetric
clustering algorithm. It regards a data point as density reach- Siamese network to effectively learn fine-grained features by
able if the data point lies within a small distance threshold d to suppressing color with simple data augmentation operations
other samples, where the parameter d is the distance threshold during the training, rather than using an expensive sample
to find neighboring point. Specifically, to generate the raw generation via GANs. Compared to GANs based methods, our
clusters, we employ DBSCAN with a slightly larger distance CACL is simple, efficient and effective.
threshold parameter d (e.g., d = 0.6); whereas to generate the
over-segmentation, we use a slightly smaller distance threshold
parameter d0 , where d0 := d−δ (e.g., δ = 0.02). We will show D. Inference Procedure for CACL
the influence of the parameters δ and d in experiments. After training, we keep only the ResNet F (·|Θ) in the first
branch for inference in testing.
C. Training Procedure for Our CACL Approach To be specific, in the inference procedure, we use the output
In CACL, the two branches in the siamese network are features X of the first branch F (·|Θ) to calculate the similarity
g
implemented with ResNet-50 [28] and they are not sharing between images. Given the query image dataset I g = {Iig }N i=1
q
parameters. We pre-train the two network branches on Ima- and the query image dataset I q = {Iiq }N g
i=1 , where N and
q
geNet at first and use the learned features to initialize the two N are the sizes of the two datasets, respectively. For each
memory banks M and M̃, respectively. image Iig in the query, we compute the distances between the
In training stage, we train both network branches at the query image and the images in the gallery I q via the feature
same time with the total loss:
2 Note that it is not necessary to use the stop-gradient operation in our

L := LI + LC . (10) CACL because the clustering result provides enough guide information under
the asymmetric structure to prevent collapse. Although this is similar to the
We update the two instance memory banks M and M̃, method in SimSiam [9], the purpose is different and it is not necessary to use
in our proposal.
respectively, as follows: 3 It is also feasible to add another predictor layer after the second branch

(t) (t−1) to have a symmetric network structure. Nevertheless, our experimental results
v i ← αv i + (1 − α)xi , (11) show that merely marginal performance improvement can be yielded after
(t) (t−1) adding an extra predictor layer. Thus, we prefer to use the asymmetric network
ṽ i ← αṽ i + (1 − α)x̃i , (12) architecture for the contrastive learning framework.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 6

Algorithm 1 Training Procedure for CACL B. Implementation Details


Input: Given a dataset I = {Ii }N i=1 . Settings for Training. In our CACL approach, we use
Output: ResNet-50 [28] pre-trained on ImageNet [11] for both network
1: Pre-train the two network branches on ImageNet. branches.5 The feature outputs xi ∈ RD and x̃i ∈ RD of the
2: Initialize the two instance memory banks M and M̃ and two networks F (·|Θ) and F (·|Θ0 ) are D-dimensional vectors
set P = Pbest = 0. where D = 2048. We use the features output xi of the first
3: while epoch ≤ total epoch do branch F (·|Θ) to perform clustering, where xi = F (Iˆi |Θ) ∈
4: Generate Iˆi and I˜i via data augmentation T (·) and RD .
G(T 0 (·)); The prediction layer G(·) is a D × D full connection layer.
5: Perform feature extraction to get xi and x̃i ; We initialize the two memory banks with the outputs of the
6: Perform clustering and clustering refinement via Eq. (9) feature from the corresponding network branches F (·|Θ) and
to yield pseudo label Y = {y 1 , · · · , y N }; F 0 (·|Θ0 ), respectively. We optimize the network through Adam
7: Update the two cluster centers U and Ũ via Eq. (7); optimizer [47] with a weight decay of 0.0005 and train the
8: Train siamese network, i.e., updatomg Θ, Ψ and Θ0 via network with 80 epochs in total. The learning rate is initially
the total loss in Eq. (10); set as 0.00035 and decreased to one-tenth per 20 epochs. The
9: Update instance memory bank M and M̃ via Eq. (11) batch size is set to 64. The temperature coefficient τ in Eq. (6)
and Eq. (12); is set to 0.05 and the update factor α in Eqs. (11) and (12) is
10: Evaluate the model performance P with F (·|Θ); set to 0.2.
11: if P > Pbest then Settings for Data Augmentation. In our experiments, we use
12: Output the best model F (·|Θ) and set Pbest ← P ; the same data augmentation operations as other methods [17],
13: end if [2], including random horizontal flip, random erasing and
14: end while random crop, to define data augmentation T (·) and T 0 (·).
Besides, we add a gray-scale transform to the input of the
second branch.
obtained from the output of the first branch. And then, we sort Metrics for Performance Evaluation. In evaluation, we use
the distance in ascending order to find the matched images. the mean average precision (mAP) and cumulative matching
characteristic (CMC) at Rank-1, 5, 10 to evaluate the perfor-
IV. E XPERIMENTS mance.
In this section, we describe the used benchmark datasets
and the detailed parameter settings in experiments at first, C. Comparison to the State-of-the-art Methods
and then provide extensive experiments on these datasets, We compare our proposed CACL to the state-of-the-art
including a set of detailed ablation study and a set of evaluation unsupervised domain adaptation methods and purely unsu-
experiments to show the effect of each component. Finally, we pervised methods for person Re-ID. The purely unsuper-
give a set of data visualization experiments. 4 vised methods for person Re-ID include: CAMEL [40],
PUL [19], SSL [20], LOMO [42], BOW [41], BUC [18],
A. Dataset Description HCT [4], SpCL [17], and CAP [43]. The unsupervised
To evaluate the effectiveness of our proposal, we use domain adaptation methods for person Re-ID include: PT-
the following three benchmark datasets: Market-1501 [41], GAN [30], ADTC [36], HHL [35], SSG [5], MMCL [23],
DukeMTMC-ReID [45] and MSMT17 [46]. AD-Cluster [29], MEB [38], NRMT [39], SPGAN [32], TJ-
Market-1501 has 32,668 photos of 1501 people from six AIDL [16], JVTC [37], PGPPM [34], and MMT [2].
different camera views. The training set contains 12,936 of The comparison results of the state-of-the-art unsupervised
751 identities. The testing set contains 19,732 images of 750 domain adaptation methods and purely unsupervised methods
identities. are shown in Table I. We can find that our proposed CACL
DukeMTMC-ReID consists of images sampling from achieves 80.9/92.7% at mAP/Rank-1 on Market-1501 and
DukeMTMC-ReID video dataset, 120 frames per video, with 69.6/82.6% at mAP/Rank-1 on DukeMTMC-ReID, respec-
a total of 36,411 images of people of 1404 identities. The tively. It can be found that CACL not only performs better
training set contains 16,522 images of 702 identities and the than all pure unsupervised methods but also achieves the best
testing set contains 2228 query images of 702 identities and performance than unsupervised domain adaptation methods.
17,661 gallery images. These images are taken from eight Moreover, we also conduct experiments on a much larger
cameras. dataset MSMT17 and report the experimental results in Table
MSMT17 has a total of 126,441 images under 15 camera II. Again, we can observe that our proposed CACL achieves
views. The training set contains 32,621 images of 1041 identi- a leading performance, i.e., 23.0/48.4% at mAP/Rank-1. It is
ties. The testing set contains 93,820 images of 3060 identities worth to note that our CACL yields superior performance than
are used for testing. MSMT17 is larger than Market-1501 and some UDA methods on this challenging dataset. These results
DukeMTMC-ReID. confirm the effectiveness of our proposal.
4 The code can be downloaded from https://github.com/MingkunLishigure/ 5 In Section IV-C, we also provide the performance evaluation with other
CACL. backbone networks for the two branches.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 7

TABLE I
C OMPARISON TO OTHER STATE - OF - THE - ART METHODS . ’UDA’ IS TO REFER THE UNSUPERVISED DOMAIN ADAPTATION METHODS AND ’US’ IS TO
REFER THE PURELY UNSUPERVISED LEARNING METHODS . ’*’ MEANS THAT THE USED BACKBONE IS PRE - TRAINED ON I MAGE N ET.

Market-1501 DukeMTMC-ReID
Method Type Reference Bakcbone
mAP Rank-1 Rank-5 Rank-10 mAP Rank-1 Rank-5 Rank-10
PTGAN [30] UDA CVPR’18 GoogleNet [31] 15.7 38.6 57.3 - 13.5 27.4 43.6 -
SPGAN [32] UDA CVPR’18 ResNet50* [28] 26.7 58.1 76.0 82.7 26.4 46.9 62.6 68.5
TJ-AIDL [16] UDA CVPR’18 MobileNet* [33] 26.5 58.2 74.8 - 23.0 44.3 59.6 -
PGPPM [34] UDA CVPR’18 ResNet50* [28] 33.9 63.9 81.1 86.4 17.9 36.3 54.0 61.6
HHL [35] UDA ECCV’18 ResNet50* [28] 31.4 62.2 78.0 84.0 27.2 46.9 61.0 66.7
SSG [5] UDA ECCV’19 ResNet50* [28] 58.3 80.0 90.0 92.4 53.4 73.0 80.6 83.2
AD-cluster [29] UDA CVPR’20 ResNet50* [28] 68.3 86.7 94.4 96.5 54.1 72.6 82.5 85.5
ADTC [36] UDA ECCV’20 ResNet50* [28] 59.7 79.3 90.8 94.1 52.5 71.9 84.1 87.5
MMCL [23] UDA CVPR’20 ResNet50* [28] 60.4 84.4 92.8 95.0 51.4 72.4 82.9 85.0
MMT [2] UDA ICLR’20 ResNet50* [28] 73.8 89.5 96.0 97.6 62.3 76.3 87.7 91.2
JVTC [37] UDA ECCV’20 ResNet50* [28] 67.2 86.8 95.2 97.1 66.5 80.4 89.9 93.7
MEB [38] UDA ECCV’20 ResNet50* [28] 76.0 89.9 95.2 96.9 65.3 81.2 90.9 92.2
NRMT [39] UDA ECCV’20 ResNet50* [28] 71.7 87.8 94.6 96.5 62.2 77.8 86.9 89.5
SpCL [17] UDA NIPS’20 ResNet50* [28] 76.7 90.3 96.2 97.7 68.8 82.9 90.1 92.5
CAMEL [40] US ICCV’17 ResNet50* [28] 26.3 54.4 73.1 79.6 19.8 40.2 57.5 64.9
Bow [41] US ICCV’15 - 14.8 35.8 52.4 60.3 8.5 17.1 28.8 34.9
PUL [19] US TOMM’18 ResNet50* [28] 22.8 51.5 70.1 76.8 22.3 41.1 46.6 63.0
LOMO [42] US CVPR’15 - 8.0 27.2 41.6 49.1 4.8 12.3 21.3 26.6
BUC [18] US AAAI’19 ResNet50* [28] 30.6 61.0 71.6 76.4 21.9 40.2 52.7 57.4
HCT [4] US CVPR’20 ResNet50* [28] 56.4 80.0 91.6 95.2 50.1 69.6 83.4 87.4
SSL [20] US CVPR’20 ResNet50* [28] 37.8 71.7 83.8 87.4 28.6 52.5 63.5 68.9
SpCL [17] US NIPS’20 ResNet50* [28] 73.1 88.1 96.3 97.7 65.3 81.2 90.3 92.2
CAP [43] US AAAI’20 ResNet50* [28] 79.2 91.4 96.3 97.7 67.3 81.1 89.3 91.8
CACL US This paper ResNet50* [28] 80.9 92.7 97.4 98.5 69.6 82.6 91.2 93.8
CACL US This paper IBN-ResNet* [44] 83.6 93.3 97.7 98.3 72.5 85.5 92.9 94.9

TABLE II In the baseline method, we train both branches with data


E XPERIMENTAL R ESULTS ON MSMT17. augmentation T 0 (·) and T 0 (·) by using the Non-Parametric
MSMT17 Softmax loss [49], which is defined as
Method Type Reference
mAP Rank-1 Rank-5 Rank-10
exp(u>ω(Ii ) xi /τ )
PTGAN [30] UDA CVPR’18 3.3 11.8 - 27.4 L(xi ) = − ln( Pm0 ), (13)
ECN [48] UDA CVPR’19 10.2 30.2 41.5 46.8 >
SSG [5] UDA ICCV’19 13.3 32.2 - 51.2 `=1 exp(u` xi /τ )
MMCL [23] UDA CVPR’20 16.2 43.6 54.3 58.9 and both the training process and the memory updating strat-
JVTC+ [37] US ECCV’20 17.3 43.1 53.8 59.4
SpCL [17] US NIPS’20 19.1 42.3 55.6 61.2
egy in the baseline method are kept the same as our CACL
MMT [2] UDA ICLR’20 24.0 50.1 63.5 69.3 method.
SpCL [17] UDA NIPS’20 26.8 53.7 79.3 83.1 To comprehensively evaluate the contribution of each com-
CACL US This paper 23.0 48.9 61.2 66.4 ponent, we conduct a set of ablation experiments by test-
CACL w/ IBN-ResNet US This paper 29.9 57.1 68.4 73.1
ing each component in our CACL framework individually,
i.e., cluster refinement, instance-level contrastive loss LI and
cluster-level contrastive loss LC . To further evaluate the sub-
Note that Instance-Batch Normalization (IBN) [44] has been part of the cluster-level contrastive loss, we also conduct
(inter)
used in object recognition and has been proved very effective. experiments to evaluate the influence of using LC or
(intra)
Here, we evaluate our CACL, in which the backbone is im- LC , separately.
plemented with Instance-Batch Normalization ResNet (IBN- In the ablation experiments, to test the model with con-
ResNet). Similar to CACL with ResNet [28], we introduce trastive loss LC or LI , we train both branches with data aug-
an Instance-Batch Normalization (IBN) layer to replace the mentation T 0 (·) and G(T 0 (·)), respectively. To test the model
BN layer and call it an IBN-ResNet. As shown in Table I, performance with the cluster-level contrastive loss LC and the
(intra)
the performance of our CACL can be further improved when sub-part of LC , compared to the baseline method, we
combining with IBN-ResNet. need to replace the Non-Parametric Softmax loss in Eq. (13)
by the loss in Eq. (4) for both branches. The results of the
ablation study are reported in Table III.
D. Ablation Study As can be read in Table III, the performance improves when
each component is used individually. This validates that each
To evaluate the effectiveness of each component: LI , component contributes to the performance improvements. For
(inter) (intra)
LC , LC and clustering with refinement in our CACL the experiments of using both LC and LI , it does not signif-
approach, we conduct a set of ablation experiments on Market- icantly better than just using LI , and in the experiments of
1501 and DukeMTMC-ReID. using LC we observe a slight improvement than the baseline.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 8

TABLE III
A BLATION S TUDY ON M ARKET-1501 AND D UKE MTMC-R E ID.

Market-1501 DukeMTMC-ReID
Components Cluster Refine LI Lintra
C Linter
C mAP Rank-1 Rank-5 Rank-10 mAP Rank-1 Rank-5 Rank-10
Baseline 68.1 85.2 94.0 96.0 62.5 78.5 88.5 90.3
+ LC X X 70.8 87.5 94.4 96 62.5 79.5 88.4 90.8
+ LI X 74.7 88.7 95 96.6 64.2 80.7 89 91.6
+ LI + LC X X X 74.4 89.3 95.9 96.7 63.8 79.2 89.2 91.7
+ Cluster Refine X 73 87.8 95.7 97.2 65.7 81.1 90.6 93.2
+ Cluster Refine + LI X X 78.2 91.2 97 98.1 67.6 81.8 90.2 93
+ Cluster Refine + LI +Linter
C X X X 78.7 91.2 97 97.9 68.5 81.9 91.2 93.8
+ Cluster Refine + LI +Lintra
C X X X 79.2 91.9 96.7 98 68.3 82.1 90.3 93.2
+ Cluster Refine + LC X X X 80.4 92.2 97.1 98.2 68.8 82.2 91.3 93.8
Our CACL X X X X 80.9 92.7 97.4 98.5 69.6 82.6 91.2 93.8

This is because the clustering result is not high quality and


using LC will make the training pay more attention to the
noisy cluster information. Therefore, it might bring misleading
information to the network training. In the experiments of
using both LC and cluster refinement, we observe significant
performance improvement than using the cluster refinement
alone. This also validates that the cluster refinement improves
the clustering result and the refined clustering information can
further enhance the effectiveness of using LC to train the
network.

E. More Evaluation and Analysis Fig. 4. Illustration for the raw images and the augmented images. The 1st
row: “raw images”. The 2nd row: “color-jitter”. Bottom row: “gray-scale”.
Evaluation on Importance of Cluster-Guided. We use an
instance-level contrastive loss in our method to mine the invari-
TABLE V
ance between different augment views based on SimSiam [9]. P ERFORMANCE C OMPARISON ON USING C OLOR DATA AUGMENTATIONS
To verify whether the clustering guidance is vital in the AND G RAY- SCALE T RANSFORM TO THE S ECOND N ETWORK B RANCH .

contrast learning framework, we train our CACL framework


Market-1501
but just using the instance-level contrastive loss in Eq. (1) Components Cluster Refine
mAP Rank-1 Rank-5 Rank-10
without the clustering guidance. The experimental results are T 0 (·) 70.3 87.4 94.6 96.5
shown in Table IV. As can be read from Table IV, surprisingly, J (T 0 (·)) 72.5 87.8 95.3 96.9
the contrastive learning framework without clustering guidance G(T 0 (·)) 74.4 89.3 95.9 96.7
did not work at all. T 0 (·) X 79.0 90.6 96.3 97.1
J (T 0 (·)) X 79.1 90.8 96.7 97.8
TABLE IV G(T 0 (·)) X 80.9 92.7 97.4 98.5
A BLATION S TUDY ON M ARKET-1501.

Market-1501
Components
mAP Rank-1 Rank-5 Rank-10 augmentation methods in Fig. 4. As can be observed, “color-
CACL w/o clustering 0.3 0.5 1.2 2.3 jitter” did change the image, but the color information still
CACL w/o stopGrad 80.2 92.0 97.0 97.6 dominates.
CACL 80.9 92.7 97.4 98.5
Experimental results are provided in Table V. We can
read that using “color-jitter” J (·) yields some performance
Improvements Brought by Suppressing Colors. To suppress improvement, but using “gray-scale” G(·) yield the best
colors influence, CACL uses a gray-scale process G(·) over performance improvement. When combined with the cluster
the data augmentation T 0 (·) for the second network branch. refinement step, we can observe the similar result that: using
To validate the effectiveness of suppressing colors, we conduct “gray-scale” G(·) yields better performance improvement than
a set of experiments under different settings: a) simply using using “color-jitter” J (·). These results validate that suppress-
data augmentation T 0 (·) with raw color; b) using another data ing colors is effective to gain performance improvement.
augmentation approach, named “color-jitter”, which denoted Compared to using “gray-scale”, using “color-jitter” does not
as J (·) to replace G(·), which output is still a color image; c) truly eliminate the influence brought by colors, that is to say,
with gray-scale transform G(·) after T 0 (·). It should be empha- after using color-jitter, the color information still dominates.
sized that in the implementation, the “color-jitter” operation To further reveal the mechanisms why using “gray-scale”
will give random amplitude values to the image changing. works better than using “color-jitter” in the proposed frame-
We display the image samples processed with different data work, we show the statistic histograms of color distributions
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 9

(a) Raw Images (b) Color-Jitter (c) Gray-Scale

Fig. 5. Comparison on distributions in histogram of intensity in RGB channels under different data augmentation operations.

of using raw image, color-jitter, and gray-scale, respectively. TABLE VI


Specifically, we compute the statistical histograms of the P ERFORMANCE C OMPARISON OF DIFFERENT CLUSTER PARAMETER d
( THE MAXIMUM DISTANCE BETWEEN NEIGHBOR POINTS ) ON CACL AND
intensity values in the RGB channels of the raw color images BASELINE METHOD .
and the images after using “color-jitter” and “gray-scale” with
500 images sampled at random in the training data from Market-1501 DukeMTMC-ReID
d Baseline CACL Baseline CACL
Market-1501. The statistical results are shown in Fig. 5. mAP Rank-1 mAP Rank-1 mAP Rank-1 mAP Rank-1
We can observe that: using “gray-scale” yields roughly 0.4 68.6 85.9 75.2 91.4 60.1 77.5 62.0 77.7
consistent distribution in the histogram compared to the raw 0.5 71.2 86.5 81.6 93.0 63.4 80.3 67.5 81.8
0.6 68.1 85.2 80.9 92.7 62.5 78.5 69.6 82.6
images; whereas using the distribution in the histogram of the 0.7 43.8 71.5 75.8 90.1 4.1 10.3 66.7 80.6
images after using “color-jitter” has some notable deviations
from that of the raw images. In the histogram of using
“gray-scale”, the proportion of the pixels at the two extreme TABLE VII
values (i.e., 0 and 255) are significantly reduced; whereas I LLUSTRATION FOR THE MODEL PERFORMANCE WITH DIFFERENT δ ON
M ARKET-1501.
in the histogram of using “color-jitter”, the proportion of
the pixels at the two extreme values, especially at 0, are Market-1501
significantly magnified—this phenomenon might damage the δ d = 0.4 d = 0.5 d = 0.6 d = 0.7
mAP Rank-1 mAP Rank-1 mAP Rank-1 mAP Rank-1
content consistency with the raw image. The difference in the
0.02 75.2 91.4 81.6 93.0 80.9 92.7 75.8 90.1
consistency of the histogram reveals the essential advantage of 0.04 70.8 89.5 80.4 92.6 80.3 92.3 68.7 86.2
using “gray-scale” to suppress the influence of colors, rather 0.06 65.8 87.2 77.7 91.7 79.0 91.4 8.20 20.3
than using “color-jitter”. 0.08 64.3 86.2 76.6 91.2 78.5 91.3 6.10 15.6
Evaluation on Parameters in DBSCAN. We conduct ex-
periments evaluate the parameter d to find the neighbors. In
cluster refinement, we use DBSCAN with a smaller parameter different structures. In Table IV, as can be read that, the
d0 , where d0 := d − δ to find the over-segmentation. We performance of the framework with asymmetric structure
conduct experiments on Market-1501 to evaluate the effects drops slightly (i.e., only 0.7% lower than that of using the
of changing the two parameters. Experiments are recorded stop-gradient operation) when the stop-gradient operation is
in Table VI. we can find that while the change of d will not used. This hints that the framework with asymmetric
affect the baseline performance, our CACL still improves the structure in CACL does not highly depend on the stop-gradient
model performance significantly. Note that even though the operation.
baseline performance will sharply drop when using d = 0.7, Evaluation Performance of Two Branches. To further re-
our method can also achieve a good performance which is also veal the performance of the trained networks, we record the
higher than other unsupervised methods in Table I. performance of using the output features of each branch of
The cluster refinement is an important component in our two networks F (·|Θ) and F 0 (·|Θ0 ), separately, for person Re-
proposed CACL, and δ is an important parameter to find the ID in Table VIII. We can read that using the output features
over-segmentation of the raw clusters. Thus, we further con- of the second branch F 0 (·|Θ0 ) did yield significantly lower
duct experiments to evaluate the performance of using different performance than that of using the output feature of the first
values of δ. Experimental results are shown in Table VII. branch F (·|Θ), and the result of using F 0 (·|Θ0 ) is similar to
(intra)
We can find that the performance is not too sensitive to δ. the result of the experiments without using LC . This is
When using δ = 0.02, the performance achieves the best, i.e., because the second network branch pays attention to learning
80.9/92.7% at mAP/Rank-1 on Market-1501 and 69.6/82.6% features from gray-scale images, lacking of the ability to
at mAP/Rank-1 on DukeMTMC-ReID. capture richer information from color images.
Moreover, we also test the stop-gradient operations under Evaluation on Memory Update Parameter α. We conduct
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 10

TABLE VIII Experimental results are shown in Fig. 6. We can observe


P ERFORMANCE C OMPARISON ON F (·|Θ) AND F 0 (·|Θ0 ). that the contrastive loss LC + LI did help the model dis-
Market-1501
tinguish those similar images while maintaining the cluster
Branch compactness, and also separate the overlapping individual
mAP Rank-1 Rank-5 Rank-10
F (·|Θ) (Color) 80.9 92.7 97.4 98.5 samples from each other. This confirms the effectiveness of
F 0 (·|Θ0 ) (Gray-Scale) 43.8 71.5 83.9 87.1 our proposed approach, and it also shows that our approach
can attenuate the influence of clothing color.
At the same time, we also selected some query samples
TABLE IX with the top-10 best matching images in the gallery set and
P ERFORMANCE C OMPARISON ON DIFFERENT α.
show them in Fig. 7. Compared to the baseline model, our
Market-1501 approach returns more accurate results. We can find that
Branch
mAP Rank-1 Rank-5 Rank-10 most of the wrong samples matched by the baseline model
0.0 75.1 89.8 96.3 97.3 are dressed in the same color with the query sample. These
0.2 80.9 92.7 97.4 98.5 results suggest that our approach can effectively ignore the
0.4 80.8 92.5 97.1 98.2
interference caused by samples with similar colors and thus
0.6 80.2 92.4 97.2 98.3
0.8 77.3 90.9 96.6 98.0 find more accurate matches.
1.0 4.3 10.9 19.9 24.9
V. C ONCLUSION
We have proposed a Cluster-guided Asymmetric Contrastive
experiments to evaluate the effects of the memory update Learning (CACL) approach for unsupervised person Re-ID,
parameter α and show the results in Table IX. We can find in which cluster information is leveraged to guide the feature
that our CACL is not sensitive to the changing of memory learning in a properly designed contrastive learning frame-
update parameter α, except for α = 1. When using α = 1, the work. Specifically, in our proposed CACL, instance-level con-
model performance significantly drops because the memory trastive learning is conducted with respect to the asymmetric
bank has not been updated at this time. When using α = 0.2 data augmentation and cluster-level contrastive learning is
the model achieves the best performance on Market-1501, i.e., conducted with respect to the refined clustering result. By
80.9/92.7% at mAP/Rank-1. leveraging the refined cluster result into contrastive learning,
Evaluation on Performance with Ground-truth Labels. CACL is able to effectively exploit the invariance within and
We compare our CACL to the baseline method with the between different data augmentation views for learning more
ground-truth labels (i.e., in supervised setting). The results effective features beyond the dominating colors. In addition,
are shown in Table X. We can find that CACL could achieve we confirmed that refined clustering result could help our
good performance under unsupervised setting, which is merely CACL approach mine invariant information more effectively
lower 3/1.1% at mAP/Rank-1 than the baseline method, which at the cluster level. We have conducted extensive experiments
is trained with the ground-truth labels on Market-1501. More- on three benchmark datasets and demonstrated the superior
over, if we provide ground-truth labels to train our CACL (i.e., performance of our proposal.
CACL+labels), notable improvements in performance than the As the future work, it is interesting and promising to
supervised baseline method can be observed. incorporate attention mechanism (e.g., [50], [51]), clustering
ensemble and hybrid contrastive learning strategy (e.g., [52])
F. Data Visualization or side information in dataset (e.g., [12]) to further enrich
the representation capacity, improve the stability and enhance
To gain some intuitive understanding of the performance of
the overall performance of the proposed framework. What’s
our proposed CACL, we conduct a set of data visualization
more, in other related fields, such as face recognition or
experiments on Market-1501 to visualize the clustering results
vehicle re identification (e.g., [53], [54]), whether suppresses
of the learned features when different training strategies are
the dominating color can also bring positive influence is a very
used: a) without using the contrastive loss LC + LI ; and b)
interesting and worth exploring direction.
using the contrastive losses LC + LI .
R EFERENCES
TABLE X [1] L. Zheng, Y. Yang, and A. G. Hauptmann, “Person re-identification:
P ERFORMANCE C OMPARISON TO BASELINE M ETHOD IN S UPERVISED Past, present and future,” arXiv preprint arXiv:1610.02984, 2016. 1, 2
S ETTING . “ BASELINE + LABELS ” MEANS THAT WE USE THE [2] Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudo label
GROUND - TRUTH LABELS TO TRAIN THE BASELINE METHOD ; WHEREAS refinery for unsupervised domain adaptation on person re-identification,”
“CACL + LABELS ” MEANS THAT WE USE THE GROUND - TRUTH LABELS in International Conference on Learning Representations, 2020. 1, 6, 7
TO TRAIN OUR CACL. [3] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm
for discovering clusters in large spatial databases with noise,” in Second
Market-1501 DukeMTMC-ReID International Conference on Knowledge Discovery and Data Mining,
Method
mAP Rank-1 mAP Rank-1 1996, p. 226–231. 1, 5
CACL 80.9 92.7 69.6 82.6 [4] K. Zeng, M. Ning, Y. Wang, and Y. Guo, “Hierarchical clustering with
Baseline + labels 83.9 93.6 73.3 86.6 hard-batch triplet loss for person re-identification,” in IEEE Conference
CACL + labels 85.7 94.2 74.9 87.2 on Computer Vision and Pattern Recognition, 2020, pp. 13 657–13 665.
1, 2, 6, 7
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 11

CACL (Without ℒ! and ℒ" ) CACL (Full)

Fig. 6. Data Visualization via t-SNE of the learned feature and clusters under two different training strategies: Training without LC and LI (left) as mentioned
in Table III and our CACL (right). The data points come from the Market-1501 training set (1,000 images of 60 identities). The points with the same color
mean the image of the same identity. To demonstrate the difference between the two distributions in detail, we further zoom in on the circled clusters and
show the corresponding images. The images in the boxes are similar to each other and the corresponding data points are very close to each other or even
overlapping in the feature space if the model is trained without using LC and LI , as shown in the left box; whereas using the contrastive losses LC and LI
will effectively distinguish these data points and maintain the cluster compactness as shown in the right box.

Baseline CACL
Query 1st 10th 1st 10th

Fig. 7. Visualization of the top-10 best matched images. We show the top-10 best matching samples in the gallery set for the query sample with the baseline
method and our proposed CACL. The images with frames in green and in red are the correctly matched images and mismatched images, respectively.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 12

[5] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang, “Self- [26] Y. Li, P. Hu, Z. Liu, D. Peng, J. T. Zhou, and X. Peng, “Contrastive
similarity grouping: A simple unsupervised cross domain adaptation clustering,” in AAAI Conference on Artificial Intelligence, 2021. 3
approach for person re-identification,” in The IEEE International Con- [27] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast
ference on Computer Vision, October 2019, pp. 6112–6121. 1, 6, 7 for unsupervised visual representation learning,” in IEEE Conference on
[6] J. Xie, X. Zhan, Z. Liu, Y. S. Ong, and C. C. Loy, “Delving into inter- Computer Vision and Pattern Recognition, 2020, pp. 9729–9738. 3
image invariance for unsupervised visual representations,” in Conference [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
and Workshop on Neural Information Processing Systems, 2020. 1, 3 image recognition,” in IEEE Conference on Computer Vision and Pattern
[7] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, Recognition, 2016, pp. 770–778. 3, 5, 6, 7
“Unsupervised learning of visual features by contrasting cluster assign- [29] Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian, “Ad-
ments,” Advances in Neural Information Processing Systems, pp. 9912– cluster: Augmented discriminative clustering for domain adaptive person
9924, 2020. 1, 3 re-identification,” in IEEE Conference on Computer Vision and Pattern
[8] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework Recognition, 2020, pp. 9021–9030. 5, 6, 7
for contrastive learning of visual representations,” in International [30] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan to
Conference on Machine Learning, 2020, pp. 1597–1607. 1, 3, 4 bridge domain gap for person re-identification,” in IEEE Conference
[9] X. Chen and K. He, “Exploring simple siamese representation learning,” on Computer Vision and Pattern Recognition, 2018, pp. 79–88. 6, 7
in IEEE Conference on Computer Vision and Pattern Recognition, 2021,
[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
pp. 15 750–15 758. 1, 3, 5, 8
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
[10] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya,
IEEE Conference on Computer Vision and Pattern Recognition, 2015,
C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot,
pp. 1–9. 7
k. kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent
[32] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-
- a new approach to self-supervised learning,” in Advances in Neural
image domain adaptation with preserved self-similarity and domain-
Information Processing Systems, 2020, pp. 21 271–21 284. 1, 3, 4
dissimilarity for person re-identification,” in IEEE Conference on Com-
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
puter Vision and Pattern Recognition, 2018, pp. 994–1003. 6, 7
with deep convolutional neural networks,” in Conference and Workshop
on Neural Information Processing Systems, 2012, pp. 1097–1105. 1, 4, [33] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
6 T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-
[12] H. Chen, Y. Wang, B. Lagadec, A. Dantcheva, and F. Bremond, lutional neural networks for mobile vision applications,” arXiv preprint
“Joint generative and contrastive learning for unsupervised person re- arXiv:1704.04861, 2017. 7
identification,” in IEEE Conference on Computer Vision and Pattern [34] F. Yang, Z. Zhong, Z. Luo, S. Lian, and S. Li, “Leveraging virtual and
Recognition, June 2021, pp. 2004–2013. 2, 5, 10 real person for unsupervised person re-identification,” IEEE Transac-
[13] J. Liu, Z.-J. Zha, D. Chen, R. Hong, and M. Wang, “Adaptive transfer tions on Multimedia, vol. 22, no. 9, pp. 2444–2453, 2019. 6, 7
network for cross-domain person re-identification,” in IEEE Conference [35] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person
on Computer Vision and Pattern Recognition, 2019, pp. 7202–7211. 2, retrieval model hetero-and homogeneously,” in European Conference on
5 Computer Vision, 2018, pp. 172–188. 6, 7
[14] S. Bak, P. Carr, and J.-F. Lalonde, “Domain adaptation through synthesis [36] Z. Ji, X. Zou, X. Lin, X. Liu, T. Huang, and S. Wu, “An attention-driven
for unsupervised person re-identification,” in European Conference on two-stage clustering method for unsupervised person re-identification,”
Computer Vision, 2018, pp. 189–205. 2 in European Conference on Computer Vision, 2020, pp. 20–36. 6, 7
[15] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and [37] J. Li and S. Zhang, “Joint visual and temporal consistency for unsuper-
Y. Tian, “Unsupervised cross-dataset transfer learning for person re- vised domain adaptive person re-identification,” in European Conference
identification,” in IEEE Conference on Computer Vision and Pattern on Computer Vision, 2020. 6, 7
Recognition, 2016, pp. 1306–1315. 2 [38] Y. Zhai, Q. Ye, S. Lu, M. Jia, R. Ji, and Y. Tian, “Multiple expert
[16] J. Wang, X. Zhu, S. Gong, and W. Li, “Transferable joint attribute- brainstorming for domain adaptive person re-identification,” in European
identity deep learning for unsupervised person re-identification,” in IEEE Conference on Computer Vision, 2020, pp. 594–611. 6, 7
Conference on Computer Vision and Pattern Recognition, 2018, pp. [39] F. Zhao, S. Liao, G.-S. Xie, J. Zhao, K. Zhang, and L. Shao, “Un-
2275–2284. 2, 6, 7 supervised domain adaptation with noise resistible mutual-training for
[17] Y. Ge, F. Zhu, D. Chen, R. Zhao, and H. Li, “Self-paced contrastive person re-identification,” in European Conference on Computer Vision.
learning with hybrid memory for domain adaptive object re-id,” in Springer, 2020, pp. 526–544. 6, 7
Advances in Neural Information Processing Systems, 2020, pp. 11 309– [40] H.-X. Yu, A. Wu, and W.-S. Zheng, “Cross-view asymmetric metric
11 321. 2, 5, 6, 7 learning for unsupervised person re-identification,” in IEEE Interna-
[18] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang, “A bottom-up clustering tional Conference on Computer Vision, 2017, pp. 994–1002. 6, 7
approach to unsupervised person re-identification,” in The Association [41] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable
for the Advancement of Artificial Intelligence, vol. 33, 2019, pp. 8738– person re-identification: A benchmark,” in IEEE International Confer-
8745. 2, 6, 7 ence on Computer Vision, 2015, pp. 1116–1124. 6, 7
[19] H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised person re-
[42] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by
identification: Clustering and fine-tuning,” ACM Transactions on Mul-
local maximal occurrence representation and metric learning,” in IEEE
timedia Computing, Communications, and Applications, vol. 14, no. 4,
Conference on Computer Vision and Pattern Recognition, 2015, pp.
p. 83, 2018. 2, 6, 7
2197–2206. 6, 7
[20] Y. Lin, L. Xie, Y. Wu, C. Yan, and Q. Tian, “Unsupervised person re-
[43] M. Wang, B. Lai, J. Huang, X. Gong, and X.-S. Hua, “Camera-aware
identification via softened similarity learning,” in IEEE Conference on
proxies for unsupervised person re-identification,” in AAAI Conference
Computer Vision and Pattern Recognition, 2020, pp. 3390–3399. 2, 6,
on Artificial Intelligence, vol. 2, 2021, p. 4. 6, 7
7
[21] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain [44] X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancing learning
adaptation,” in Association for the Advancement of Artificial Intelligence, and generalization capacities via ibn-net,” in European Conference on
vol. 30, 2016. 2 Computer Vision (ECCV), 2018, pp. 464–479. 7
[22] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by [45] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance
backpropagation,” in International Conference on Machine Learning, measures and a data set for multi-target, multi-camera tracking,” in
2015, pp. 1180–1189. 2 European Conference on Computer Vision. Springer, 2016, pp. 17–
[23] D. Wang and S. Zhang, “Unsupervised person re-identification via 35. 6
multi-label classification,” in IEEE Conference on Computer Vision and [46] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan to
Pattern Recognition, 2020, pp. 10 981–10 990. 2, 6, 7 bridge domain gap for person re-identification,” in IEEE Conference
[24] P. Bojanowski and A. Joulin, “Unsupervised learning by predicting on Computer Vision and Pattern Recognition, 2018, pp. 79–88. 6
noise,” in International Conference on Machine Learning. PMLR, [47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
2017, pp. 517–526. 3 in 3rd International Conference on Learning Representations, 2015. 6
[25] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and [48] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, “Invariance matters:
T. Brox, “Discriminative unsupervised feature learning with exemplar Exemplar memory for domain adaptive person re-identification,” in IEEE
convolutional neural networks,” IEEE Transactions on Pattern Analysis Conference on Computer Vision and Pattern Recognition, 2019, pp. 598–
and Machine Intelligence, vol. 38, no. 9, pp. 1734–1747, 2015. 3 607. 7
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 13

[49] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning


via non-parametric instance discrimination,” in IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 3733–3742. 7
[50] J. Si, H. Zhang, C.-G. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang,
“Dual attention matching network for context-aware feature sequence
based person re-identification,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 5363–5372. 10
[51] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
Transformers for image recognition at scale,” International Conference
on Learning Representations, 2021. 10
[52] H. Sun, M. Li, and C.-G. Li, “Hybrid contrastive learning with clus-
ter ensemble for unsupervised person re-identification,” arXiv preprint
arXiv:2201.11995, 2022. 10
[53] X. Liu, W. Liu, H. Ma, and H. Fu, “Large-scale vehicle re-identification
in urban surveillance videos,” in 2016 IEEE international conference on
multimedia and expo (ICME). IEEE, 2016, pp. 1–6. 10
[54] X. Liu, W. Liu, T. Mei, and H. Ma, “Provid: Progressive and multi-
modal vehicle reidentification for large-scale urban surveillance,” IEEE
Transactions on Multimedia, vol. 20, no. 3, pp. 645–658, 2017. 10

View publication stats

You might also like