Cluster GuidedAsymmetricContrastiveLearningforUnsupervisedPersonRe Identification - 2106.07846
Cluster GuidedAsymmetricContrastiveLearningforUnsupervisedPersonRe Identification - 2106.07846
net/publication/360647017
CITATIONS READS
78 39
3 authors:
Mingkun Li Chun-Guang Li
Beijing University of Posts and Telecommunications Beijing University of Posts and Telecommunications
10 PUBLICATIONS 115 CITATIONS 85 PUBLICATIONS 3,348 CITATIONS
Jun Guo
Beijing University of Posts and Telecommunications
478 PUBLICATIONS 8,708 CITATIONS
SEE PROFILE
All content following this page was uploaded by Chun-Guang Li on 24 November 2022.
Color Images
Abstract—Unsupervised person re-identification (Re-ID) aims
to match pedestrian images from different camera views in an
unsupervised setting. Existing methods for unsupervised person
Re-ID are usually built upon the pseudo labels from clustering.
arXiv:2106.07846v2 [cs.CV] 9 May 2022
𝒯(⋅)
Predictor G( ⋅ |Ψ) 𝓍%
𝑧%
Raw Update
Instance−level Cluster-level Instance Memory Bank
Image Contrastive ℳ
Contrastive
𝐼%
Loss ℒ" Loss ℒ!
𝒢 ∘ 𝒯 & (⋅)
𝓍-%
𝓍-% Cluster-level
Image ResNet Update
Contrastive Instance Memory Bank
𝐼'% 𝐹 & (⋅ |Θ′)
Loss ℒ! ℳ&
Fig. 3. Illustration for our proposed Cluster-guided Asymmetric Contrastive Learning (CACL) framework. After training, we keep only the ResNet F (·|Θ)
in the first branch for inference and use the feature xi for testing.
the low-quality pseudo labels will contaminate the network dependently repel each other, which will undoubtedly ignore
training. Therefore, it is needed to design a cluster refinement the cluster information. In contrast, cluster-level contrastive
method to improve the clustering quality before feeding the learning can effectively mine cluster information, but it relies
pseudo labels to train the network. heavily on the clustering result. Unfortunately, in the early
training stage, the features are not good enough to yield good
B. Contrastive Learning clustering result. Thus, an effective way to train the network by
In recent years, with the development and application of combining both the two lines of contrastive learning methods
the siamese network, contrastive learning began to emerge in is needed.
the field of unsupervised learning. Contrastive learning aims In this paper, we attempt to bridge the two lines of con-
at learning good image representation. It learns invariance in trastive learning methods into a unified framework to form
features by manipulating a set of positive samples and negative effective mutual learning and joint training: a) the instance-
samples with data augmentation. level contrastive learning helps training the network to perform
The existing methods for contrastive learning can be further feature learning—especially in the early training stage; mean-
categorized to: a) instance-level methods [8], [9], [24], [25], while b) the cluster-level contrastive learning helps training
[10] and b) cluster-level methods [7], [6], [26]. Instance-level the network—especially when the quality of the clustering has
methods regard each image as an individual class and consider been improved. In this way, the self-supervision information
two augmented views of the same image as positive pairs and imposed by data augmentation and the weak supervision
treat others in the same batch (or memory bank) as negative information obtained from clustering can be fully exploited
pairs. For example, SimCLR [8] regards samples in the current without the need to use negative samples pairs.
batch as the negative samples; MoCo [27] uses a dictio-
nary to implement contrastive learning, which converts one
III. O UR P ROPOSAL : C LUSTER - GUIDED A SYMMETRIC
branch of the contrastive learning into a momentum encoder;
C ONTRASTIVE L EARNING (CACL)
SimSiam [9] proposed a stop-gradient method that can train
the siamese network without negative samples. Cluster-level This section presents our proposal—Cluster-guided Asym-
methods regard the samples in the same clusters as positive metric Contrastive Learning (CACL) approach for unsuper-
samples and other samples as negative samples. For example, vised person Re-ID.
in [6] InfoNCE loss is combined with MarginNCE loss to For clarity, we show the architecture of our proposed CACL
attract positive samples and repelled negative samples; in [7] in Fig. 3. Overall, our CACL is a siamese network, which
multi-crop data augmentation is used to enhance the robustness consists of two branches of backbone networks F (·|Θ) and
of the network and a scalable online clustering method is F 0 (·|Θ0 ) without sharing parameters, where Θ and Θ0 are the
proposed to explore the inter-invariance of clusters; in [26] parameters in the two networks, respectively, and a predictor
weights-sharing deep neural networks are used to extract layer G(·|Ψ) is added after the first branch, where Ψ denotes
features from sample pairs with different data augmentations, the parameters in the predictor layer. The backbone networks
and contrastive clustering is performed with respect to both F (·|Θ) and F 0 (·|Θ0 ) are implemented1 via ResNet-50 [28] for
the features in the row and column spaces. feature learning.
However, in the unsupervised setting, the instance-level
contrastive learning methods simply make each sample in- 1 It also works if the backbone networks other than ResNet-50 are used.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 4
Given an unlabeled image dataset I = {Ii }N i=1 consisting where ω(Ii ) is to find the cluster index ` for z i , and ũ`
of N samples. For an input image Ii ∈ I, we generate two is the center vector of the `-th cluster in which Ũ :=
samples Iˆi and I˜i via different data augmentation strategies {ũ1 , · · · , ũm0 } and the cluster center ũ` is defined as
as the inputs of the two branches, respectively, in which
1
Iˆi = T (Ii ) and I˜i = G(T 0 (Ii )), where T (·) and T 0 (·) denote
X
ũ` = ṽ i , (3)
two different transforms and G(·) denotes the operation to |C (`) |
Ii ∈C (`)
transform color image into gray-scale image. For simplicity,
we denote the output features of the first network branch and where ṽ i is the instance feature of image I˜i in the instance
the second network branch as xi and x̃i , and denote the output memory bank M̃, C (`) is the `-th cluster. The inter-views
(inter)
of the predictor layer in the first branch as z i , respectively, cluster-level contrastive loss LC defined in Eq. (2)
where xi , x̃i , z i ∈ RD . is used to reduce the discrepancy between the projection
The clustering result of the output features X := output z i of the first network branch and the cluster center
{x1 , · · · , xN } from the first network branch is used to gener- ũ` of the feature output of the second branch with the
ate the pseudo labels Y := {y 1 , · · · , y N }. We exploit the gray-scale view.
pseudo labels to leverage the cluster information into the • Intra-views Cluster-level contrastive loss, denoted as
(intra)
contrastive learning. Specifically, in the training stage, the LC , which is defined as:
two network branches F (·|Θ) and F 0 (·|Θ0 ) are trained with (intra)
the augmented samples without sharing parameters, and the LC = − (1 − qi )2 ln(qi )
(4)
pseudo labels Y are used to guide the training of both network − (1 − q̃i )2 ln(q̃i ),
branches.
In CACL, we use instance memory banks M = {vi }N where qi and q̃i are the softmax of the inner product of the
i=1
and M̃ = {ṽi }N D network outputs and the corresponding instance memory
i=1 where vi , ṽi ∈ R to store the outputs of
two branches, respectively. Both instance memory banks M bank, which are defined as
and M̃ are initialized with X := {x1 , · · · , xN } and X̃ := exp(u>ω(Ii ) xi /τ )
{x̃1 , · · · , x̃N }, which are the outputs of the network branches qi = Pm0 , (5)
>
F (·|Θ) and F 0 (·|Θ0 ) pre-trained on ImageNet, respectively. `=1 exp(u` xi /τ )
exp(ũ>
ω(Ii ) x̃i /τ )
A. Cluster-guided Contrastive Learning q̃i = Pm0 >
, (6)
`=1 exp(ũ` x̃i /τ )
At beginning, we pre-train the two network branches F (·|Θ)
and F 0 (·|Θ0 ) on ImageNet [11], and use the features from where u` and ũ` are the center vectors of the `-th cluster
the first network branch F (·|Θ) to yield m clusters, which for the first branch and the second branch, respectively,
are denoted as C := {C (1) , C (2) , · · · , C (m) }. The clustering in which ũ` is defined in Eq. (3) and u` is defined as
result is used to form pseudo labels to train the cluster-guided 1 X
contrastive learning module. u` = vi , (7)
|C (`) |
To exploit the label invariance between the two augmented Ii ∈C (`)
views and leverage the cluster structure, we employ two types
of contrastive losses: a) instance-level contrastive loss, denoted where v i is the instance feature of image Iˆi in the instance
as LI , and b) cluster-level contrastive loss, denoted as LC . memory bank M. Note that both xi and x̃i share the
Instance-Level Contrastive Loss. To match the feature out- same pseudo labels ω(Ii ) from clustering. The intra-
(intra)
puts z i and x̃i of the two network branches at instance-level, views cluster-level contrastive loss LC in Eq. (4) is
similar to [8], [10], we introduce the negative cosine similarity used to encourage the siamese network to learn features
of the prediction outputs z i in the first branch and the feature with respect to the corresponding cluster center for the
output of the second branch x̃i to define an instance-level two branches, respectively.
contrastive loss LI as follows: Putting the loss functions in Eqs. (2) and (4) together, we
z> x̃i have the cluster-level contrastive loss LC as follows:
i
LI := − , (1)
kz i k2 kx̃i k2 LC := LC
(inter)
+ LC
(intra)
. (8)
where k · k2 is the `2 -norm.
Cluster-Level Contrastive Loss. To leverage the cluster struc- Remark 1. The cluster-level contrastive loss LC in Eq. (8)
ture to further explore the hidden information from different aims to leverage the clustering information to minimize the
views, we propose a cluster-level contrastive loss LC , which difference between the samples of the same cluster from
(inter)
is further divided into inter-views cluster-level contrastive loss different augmentation views via LC , and within the same
(intra)
and intra-views cluster-level contrastive loss. augmentation view via LC . This will help the siamese
• Inter-views Cluster-level contrastive loss, denoted as network to mine the hidden information brought by the basic
(inter)
LC , which is defined as: augmented view in the first branch and the gray-scale aug-
mented view in the second branch to prevent feature collapse
(inter) z>i ũω(Ii )
LC := − , (2) to a trivial solution and impose the supervision information to
kz i k2 kũω(Ii ) k2 learn features other than colors.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 5
B. Clustering and Cluster Refinement where α is set as 0.2 by default (and we will discuss the
Note that the cluster-level contrast loss is greatly affected by influence of α in experiments).
the quality of the clustering result. When the clusters are noisy, In order to save the computation cost2 , we also use a stop-
it will cause negative effects on the training. To improve the gradient operation as mentioned in SimSiam [9]. Note that we
quality of the clustering result, we propose a cluster refinement adopt the stop-gradient operation [9] to the second network
method which removes a proportion of noisy samples in larger branch F 0 (·|Θ0 ) when using the instance level loss LI in
clusters, helping the model to better learn the information at Eq. (1) to perform back propagation. Thus, the parameters
the cluster level. Θ0 in the second network branch are updated only with the
(intra)
For a cluster, we want to keep the samples with higher sim- intra-views cluster-level contrastive loss LC in Eq. (4).
ilarity and remove the samples with lower similarity. Given a Remark 3. For clarity, we summarize the details of the train-
set of raw clusters, denoted as {C (1) , C (2) , · · · , C (m) }, without ing procedure in Algorithm 1. We note that the “asymmetry” in
loss of generality, we pick C (i) to perform cluster refinement. the proposed framework for cluster-guided contrastive learning
At first, we obtain an over-segmentation of C (i) , i.e., C (i) is lies in following three aspects: a) asymmetry in network
(i) (i) (i) structure, i.e., a predictor layer is only added after the first
further divided into {C1 , C2 , · · · , Cni }. Then we perform
cluster refinement according to the following criterion: branch3 ; and b) asymmetry in data augmentation, i.e., the
(i) (i)
augmented samples provided to the second branch are further
if D(Cj |C (i) ) < D(C (i) ), then Cj is kept; (9) transformed into gray-scale; c) asymmetry in pseudo labels
(i) (i) generation, i.e., the output features of the first branch are
otherwise Cj is removed, where D(Cj |C (i) ) is the average
(i)
used to generate pseudo labels which are shared with the
inter-distance from all samples in the sub-cluster Cj to other second branch. Because of the asymmetry in the three aspects
(i) (i)
samples in cluster C , and D(C ) is the average intra- mentioned above, we term the proposed framework as Cluster-
distance among samples in cluster C (i) . guided Asymmetric Contrastive Learning (CACL).
After such a post-processing step, the clusters of larger Remark 4. There have been many unsupervised Re-ID meth-
size are improved and at meantime, more singletons or tiny ods [17], [13], [29], [12] used the contrastive learning to learn
clusters are also produced. We denote the refined clusters discriminant features. Most of them [13], [29], [12] are Gener-
0
as C 0 = {C (1) , C (2) , · · · , C (m ) }, where m0 ≥ m. Compared ative Adversarial Networks (GANs)-based methods and need
to tiny clusters and singletons, the larger clusters are more additional supervised information to assist the training. For
informative to provide pseudo supervision information to example, ATNet [13] trains multiple GANs through utilizing
guide the contrastive learning. illumination and camera information, GCL [12] introduces
Remark 2. In implementation, we use DBSCAN algorithm [3] the pose information in training, and AD-cluster [29] uses
to generate the raw clusters and to generate the over- generating cross-camera samples to assist the training. Un-
segmentation of the clusters. DBSCAN [3] is a density-based like these methods, our proposed CACL uses an asymmetric
clustering algorithm. It regards a data point as density reach- Siamese network to effectively learn fine-grained features by
able if the data point lies within a small distance threshold d to suppressing color with simple data augmentation operations
other samples, where the parameter d is the distance threshold during the training, rather than using an expensive sample
to find neighboring point. Specifically, to generate the raw generation via GANs. Compared to GANs based methods, our
clusters, we employ DBSCAN with a slightly larger distance CACL is simple, efficient and effective.
threshold parameter d (e.g., d = 0.6); whereas to generate the
over-segmentation, we use a slightly smaller distance threshold
parameter d0 , where d0 := d−δ (e.g., δ = 0.02). We will show D. Inference Procedure for CACL
the influence of the parameters δ and d in experiments. After training, we keep only the ResNet F (·|Θ) in the first
branch for inference in testing.
C. Training Procedure for Our CACL Approach To be specific, in the inference procedure, we use the output
In CACL, the two branches in the siamese network are features X of the first branch F (·|Θ) to calculate the similarity
g
implemented with ResNet-50 [28] and they are not sharing between images. Given the query image dataset I g = {Iig }N i=1
q
parameters. We pre-train the two network branches on Ima- and the query image dataset I q = {Iiq }N g
i=1 , where N and
q
geNet at first and use the learned features to initialize the two N are the sizes of the two datasets, respectively. For each
memory banks M and M̃, respectively. image Iig in the query, we compute the distances between the
In training stage, we train both network branches at the query image and the images in the gallery I q via the feature
same time with the total loss:
2 Note that it is not necessary to use the stop-gradient operation in our
L := LI + LC . (10) CACL because the clustering result provides enough guide information under
the asymmetric structure to prevent collapse. Although this is similar to the
We update the two instance memory banks M and M̃, method in SimSiam [9], the purpose is different and it is not necessary to use
in our proposal.
respectively, as follows: 3 It is also feasible to add another predictor layer after the second branch
(t) (t−1) to have a symmetric network structure. Nevertheless, our experimental results
v i ← αv i + (1 − α)xi , (11) show that merely marginal performance improvement can be yielded after
(t) (t−1) adding an extra predictor layer. Thus, we prefer to use the asymmetric network
ṽ i ← αṽ i + (1 − α)x̃i , (12) architecture for the contrastive learning framework.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 6
TABLE I
C OMPARISON TO OTHER STATE - OF - THE - ART METHODS . ’UDA’ IS TO REFER THE UNSUPERVISED DOMAIN ADAPTATION METHODS AND ’US’ IS TO
REFER THE PURELY UNSUPERVISED LEARNING METHODS . ’*’ MEANS THAT THE USED BACKBONE IS PRE - TRAINED ON I MAGE N ET.
Market-1501 DukeMTMC-ReID
Method Type Reference Bakcbone
mAP Rank-1 Rank-5 Rank-10 mAP Rank-1 Rank-5 Rank-10
PTGAN [30] UDA CVPR’18 GoogleNet [31] 15.7 38.6 57.3 - 13.5 27.4 43.6 -
SPGAN [32] UDA CVPR’18 ResNet50* [28] 26.7 58.1 76.0 82.7 26.4 46.9 62.6 68.5
TJ-AIDL [16] UDA CVPR’18 MobileNet* [33] 26.5 58.2 74.8 - 23.0 44.3 59.6 -
PGPPM [34] UDA CVPR’18 ResNet50* [28] 33.9 63.9 81.1 86.4 17.9 36.3 54.0 61.6
HHL [35] UDA ECCV’18 ResNet50* [28] 31.4 62.2 78.0 84.0 27.2 46.9 61.0 66.7
SSG [5] UDA ECCV’19 ResNet50* [28] 58.3 80.0 90.0 92.4 53.4 73.0 80.6 83.2
AD-cluster [29] UDA CVPR’20 ResNet50* [28] 68.3 86.7 94.4 96.5 54.1 72.6 82.5 85.5
ADTC [36] UDA ECCV’20 ResNet50* [28] 59.7 79.3 90.8 94.1 52.5 71.9 84.1 87.5
MMCL [23] UDA CVPR’20 ResNet50* [28] 60.4 84.4 92.8 95.0 51.4 72.4 82.9 85.0
MMT [2] UDA ICLR’20 ResNet50* [28] 73.8 89.5 96.0 97.6 62.3 76.3 87.7 91.2
JVTC [37] UDA ECCV’20 ResNet50* [28] 67.2 86.8 95.2 97.1 66.5 80.4 89.9 93.7
MEB [38] UDA ECCV’20 ResNet50* [28] 76.0 89.9 95.2 96.9 65.3 81.2 90.9 92.2
NRMT [39] UDA ECCV’20 ResNet50* [28] 71.7 87.8 94.6 96.5 62.2 77.8 86.9 89.5
SpCL [17] UDA NIPS’20 ResNet50* [28] 76.7 90.3 96.2 97.7 68.8 82.9 90.1 92.5
CAMEL [40] US ICCV’17 ResNet50* [28] 26.3 54.4 73.1 79.6 19.8 40.2 57.5 64.9
Bow [41] US ICCV’15 - 14.8 35.8 52.4 60.3 8.5 17.1 28.8 34.9
PUL [19] US TOMM’18 ResNet50* [28] 22.8 51.5 70.1 76.8 22.3 41.1 46.6 63.0
LOMO [42] US CVPR’15 - 8.0 27.2 41.6 49.1 4.8 12.3 21.3 26.6
BUC [18] US AAAI’19 ResNet50* [28] 30.6 61.0 71.6 76.4 21.9 40.2 52.7 57.4
HCT [4] US CVPR’20 ResNet50* [28] 56.4 80.0 91.6 95.2 50.1 69.6 83.4 87.4
SSL [20] US CVPR’20 ResNet50* [28] 37.8 71.7 83.8 87.4 28.6 52.5 63.5 68.9
SpCL [17] US NIPS’20 ResNet50* [28] 73.1 88.1 96.3 97.7 65.3 81.2 90.3 92.2
CAP [43] US AAAI’20 ResNet50* [28] 79.2 91.4 96.3 97.7 67.3 81.1 89.3 91.8
CACL US This paper ResNet50* [28] 80.9 92.7 97.4 98.5 69.6 82.6 91.2 93.8
CACL US This paper IBN-ResNet* [44] 83.6 93.3 97.7 98.3 72.5 85.5 92.9 94.9
TABLE III
A BLATION S TUDY ON M ARKET-1501 AND D UKE MTMC-R E ID.
Market-1501 DukeMTMC-ReID
Components Cluster Refine LI Lintra
C Linter
C mAP Rank-1 Rank-5 Rank-10 mAP Rank-1 Rank-5 Rank-10
Baseline 68.1 85.2 94.0 96.0 62.5 78.5 88.5 90.3
+ LC X X 70.8 87.5 94.4 96 62.5 79.5 88.4 90.8
+ LI X 74.7 88.7 95 96.6 64.2 80.7 89 91.6
+ LI + LC X X X 74.4 89.3 95.9 96.7 63.8 79.2 89.2 91.7
+ Cluster Refine X 73 87.8 95.7 97.2 65.7 81.1 90.6 93.2
+ Cluster Refine + LI X X 78.2 91.2 97 98.1 67.6 81.8 90.2 93
+ Cluster Refine + LI +Linter
C X X X 78.7 91.2 97 97.9 68.5 81.9 91.2 93.8
+ Cluster Refine + LI +Lintra
C X X X 79.2 91.9 96.7 98 68.3 82.1 90.3 93.2
+ Cluster Refine + LC X X X 80.4 92.2 97.1 98.2 68.8 82.2 91.3 93.8
Our CACL X X X X 80.9 92.7 97.4 98.5 69.6 82.6 91.2 93.8
E. More Evaluation and Analysis Fig. 4. Illustration for the raw images and the augmented images. The 1st
row: “raw images”. The 2nd row: “color-jitter”. Bottom row: “gray-scale”.
Evaluation on Importance of Cluster-Guided. We use an
instance-level contrastive loss in our method to mine the invari-
TABLE V
ance between different augment views based on SimSiam [9]. P ERFORMANCE C OMPARISON ON USING C OLOR DATA AUGMENTATIONS
To verify whether the clustering guidance is vital in the AND G RAY- SCALE T RANSFORM TO THE S ECOND N ETWORK B RANCH .
Market-1501
Components
mAP Rank-1 Rank-5 Rank-10 augmentation methods in Fig. 4. As can be observed, “color-
CACL w/o clustering 0.3 0.5 1.2 2.3 jitter” did change the image, but the color information still
CACL w/o stopGrad 80.2 92.0 97.0 97.6 dominates.
CACL 80.9 92.7 97.4 98.5
Experimental results are provided in Table V. We can
read that using “color-jitter” J (·) yields some performance
Improvements Brought by Suppressing Colors. To suppress improvement, but using “gray-scale” G(·) yield the best
colors influence, CACL uses a gray-scale process G(·) over performance improvement. When combined with the cluster
the data augmentation T 0 (·) for the second network branch. refinement step, we can observe the similar result that: using
To validate the effectiveness of suppressing colors, we conduct “gray-scale” G(·) yields better performance improvement than
a set of experiments under different settings: a) simply using using “color-jitter” J (·). These results validate that suppress-
data augmentation T 0 (·) with raw color; b) using another data ing colors is effective to gain performance improvement.
augmentation approach, named “color-jitter”, which denoted Compared to using “gray-scale”, using “color-jitter” does not
as J (·) to replace G(·), which output is still a color image; c) truly eliminate the influence brought by colors, that is to say,
with gray-scale transform G(·) after T 0 (·). It should be empha- after using color-jitter, the color information still dominates.
sized that in the implementation, the “color-jitter” operation To further reveal the mechanisms why using “gray-scale”
will give random amplitude values to the image changing. works better than using “color-jitter” in the proposed frame-
We display the image samples processed with different data work, we show the statistic histograms of color distributions
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 9
Fig. 5. Comparison on distributions in histogram of intensity in RGB channels under different data augmentation operations.
Fig. 6. Data Visualization via t-SNE of the learned feature and clusters under two different training strategies: Training without LC and LI (left) as mentioned
in Table III and our CACL (right). The data points come from the Market-1501 training set (1,000 images of 60 identities). The points with the same color
mean the image of the same identity. To demonstrate the difference between the two distributions in detail, we further zoom in on the circled clusters and
show the corresponding images. The images in the boxes are similar to each other and the corresponding data points are very close to each other or even
overlapping in the feature space if the model is trained without using LC and LI , as shown in the left box; whereas using the contrastive losses LC and LI
will effectively distinguish these data points and maintain the cluster compactness as shown in the right box.
Baseline CACL
Query 1st 10th 1st 10th
Fig. 7. Visualization of the top-10 best matched images. We show the top-10 best matching samples in the gallery set for the query sample with the baseline
method and our proposed CACL. The images with frames in green and in red are the correctly matched images and mismatched images, respectively.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 12
[5] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang, “Self- [26] Y. Li, P. Hu, Z. Liu, D. Peng, J. T. Zhou, and X. Peng, “Contrastive
similarity grouping: A simple unsupervised cross domain adaptation clustering,” in AAAI Conference on Artificial Intelligence, 2021. 3
approach for person re-identification,” in The IEEE International Con- [27] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast
ference on Computer Vision, October 2019, pp. 6112–6121. 1, 6, 7 for unsupervised visual representation learning,” in IEEE Conference on
[6] J. Xie, X. Zhan, Z. Liu, Y. S. Ong, and C. C. Loy, “Delving into inter- Computer Vision and Pattern Recognition, 2020, pp. 9729–9738. 3
image invariance for unsupervised visual representations,” in Conference [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
and Workshop on Neural Information Processing Systems, 2020. 1, 3 image recognition,” in IEEE Conference on Computer Vision and Pattern
[7] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, Recognition, 2016, pp. 770–778. 3, 5, 6, 7
“Unsupervised learning of visual features by contrasting cluster assign- [29] Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian, “Ad-
ments,” Advances in Neural Information Processing Systems, pp. 9912– cluster: Augmented discriminative clustering for domain adaptive person
9924, 2020. 1, 3 re-identification,” in IEEE Conference on Computer Vision and Pattern
[8] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework Recognition, 2020, pp. 9021–9030. 5, 6, 7
for contrastive learning of visual representations,” in International [30] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan to
Conference on Machine Learning, 2020, pp. 1597–1607. 1, 3, 4 bridge domain gap for person re-identification,” in IEEE Conference
[9] X. Chen and K. He, “Exploring simple siamese representation learning,” on Computer Vision and Pattern Recognition, 2018, pp. 79–88. 6, 7
in IEEE Conference on Computer Vision and Pattern Recognition, 2021,
[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
pp. 15 750–15 758. 1, 3, 5, 8
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
[10] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya,
IEEE Conference on Computer Vision and Pattern Recognition, 2015,
C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot,
pp. 1–9. 7
k. kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent
[32] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-
- a new approach to self-supervised learning,” in Advances in Neural
image domain adaptation with preserved self-similarity and domain-
Information Processing Systems, 2020, pp. 21 271–21 284. 1, 3, 4
dissimilarity for person re-identification,” in IEEE Conference on Com-
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
puter Vision and Pattern Recognition, 2018, pp. 994–1003. 6, 7
with deep convolutional neural networks,” in Conference and Workshop
on Neural Information Processing Systems, 2012, pp. 1097–1105. 1, 4, [33] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
6 T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-
[12] H. Chen, Y. Wang, B. Lagadec, A. Dantcheva, and F. Bremond, lutional neural networks for mobile vision applications,” arXiv preprint
“Joint generative and contrastive learning for unsupervised person re- arXiv:1704.04861, 2017. 7
identification,” in IEEE Conference on Computer Vision and Pattern [34] F. Yang, Z. Zhong, Z. Luo, S. Lian, and S. Li, “Leveraging virtual and
Recognition, June 2021, pp. 2004–2013. 2, 5, 10 real person for unsupervised person re-identification,” IEEE Transac-
[13] J. Liu, Z.-J. Zha, D. Chen, R. Hong, and M. Wang, “Adaptive transfer tions on Multimedia, vol. 22, no. 9, pp. 2444–2453, 2019. 6, 7
network for cross-domain person re-identification,” in IEEE Conference [35] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person
on Computer Vision and Pattern Recognition, 2019, pp. 7202–7211. 2, retrieval model hetero-and homogeneously,” in European Conference on
5 Computer Vision, 2018, pp. 172–188. 6, 7
[14] S. Bak, P. Carr, and J.-F. Lalonde, “Domain adaptation through synthesis [36] Z. Ji, X. Zou, X. Lin, X. Liu, T. Huang, and S. Wu, “An attention-driven
for unsupervised person re-identification,” in European Conference on two-stage clustering method for unsupervised person re-identification,”
Computer Vision, 2018, pp. 189–205. 2 in European Conference on Computer Vision, 2020, pp. 20–36. 6, 7
[15] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and [37] J. Li and S. Zhang, “Joint visual and temporal consistency for unsuper-
Y. Tian, “Unsupervised cross-dataset transfer learning for person re- vised domain adaptive person re-identification,” in European Conference
identification,” in IEEE Conference on Computer Vision and Pattern on Computer Vision, 2020. 6, 7
Recognition, 2016, pp. 1306–1315. 2 [38] Y. Zhai, Q. Ye, S. Lu, M. Jia, R. Ji, and Y. Tian, “Multiple expert
[16] J. Wang, X. Zhu, S. Gong, and W. Li, “Transferable joint attribute- brainstorming for domain adaptive person re-identification,” in European
identity deep learning for unsupervised person re-identification,” in IEEE Conference on Computer Vision, 2020, pp. 594–611. 6, 7
Conference on Computer Vision and Pattern Recognition, 2018, pp. [39] F. Zhao, S. Liao, G.-S. Xie, J. Zhao, K. Zhang, and L. Shao, “Un-
2275–2284. 2, 6, 7 supervised domain adaptation with noise resistible mutual-training for
[17] Y. Ge, F. Zhu, D. Chen, R. Zhao, and H. Li, “Self-paced contrastive person re-identification,” in European Conference on Computer Vision.
learning with hybrid memory for domain adaptive object re-id,” in Springer, 2020, pp. 526–544. 6, 7
Advances in Neural Information Processing Systems, 2020, pp. 11 309– [40] H.-X. Yu, A. Wu, and W.-S. Zheng, “Cross-view asymmetric metric
11 321. 2, 5, 6, 7 learning for unsupervised person re-identification,” in IEEE Interna-
[18] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang, “A bottom-up clustering tional Conference on Computer Vision, 2017, pp. 994–1002. 6, 7
approach to unsupervised person re-identification,” in The Association [41] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable
for the Advancement of Artificial Intelligence, vol. 33, 2019, pp. 8738– person re-identification: A benchmark,” in IEEE International Confer-
8745. 2, 6, 7 ence on Computer Vision, 2015, pp. 1116–1124. 6, 7
[19] H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised person re-
[42] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by
identification: Clustering and fine-tuning,” ACM Transactions on Mul-
local maximal occurrence representation and metric learning,” in IEEE
timedia Computing, Communications, and Applications, vol. 14, no. 4,
Conference on Computer Vision and Pattern Recognition, 2015, pp.
p. 83, 2018. 2, 6, 7
2197–2206. 6, 7
[20] Y. Lin, L. Xie, Y. Wu, C. Yan, and Q. Tian, “Unsupervised person re-
[43] M. Wang, B. Lai, J. Huang, X. Gong, and X.-S. Hua, “Camera-aware
identification via softened similarity learning,” in IEEE Conference on
proxies for unsupervised person re-identification,” in AAAI Conference
Computer Vision and Pattern Recognition, 2020, pp. 3390–3399. 2, 6,
on Artificial Intelligence, vol. 2, 2021, p. 4. 6, 7
7
[21] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain [44] X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancing learning
adaptation,” in Association for the Advancement of Artificial Intelligence, and generalization capacities via ibn-net,” in European Conference on
vol. 30, 2016. 2 Computer Vision (ECCV), 2018, pp. 464–479. 7
[22] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by [45] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance
backpropagation,” in International Conference on Machine Learning, measures and a data set for multi-target, multi-camera tracking,” in
2015, pp. 1180–1189. 2 European Conference on Computer Vision. Springer, 2016, pp. 17–
[23] D. Wang and S. Zhang, “Unsupervised person re-identification via 35. 6
multi-label classification,” in IEEE Conference on Computer Vision and [46] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan to
Pattern Recognition, 2020, pp. 10 981–10 990. 2, 6, 7 bridge domain gap for person re-identification,” in IEEE Conference
[24] P. Bojanowski and A. Joulin, “Unsupervised learning by predicting on Computer Vision and Pattern Recognition, 2018, pp. 79–88. 6
noise,” in International Conference on Machine Learning. PMLR, [47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
2017, pp. 517–526. 3 in 3rd International Conference on Learning Representations, 2015. 6
[25] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and [48] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, “Invariance matters:
T. Brox, “Discriminative unsupervised feature learning with exemplar Exemplar memory for domain adaptive person re-identification,” in IEEE
convolutional neural networks,” IEEE Transactions on Pattern Analysis Conference on Computer Vision and Pattern Recognition, 2019, pp. 598–
and Machine Intelligence, vol. 38, no. 9, pp. 1734–1747, 2015. 3 607. 7
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 8, APRIL 2022 13