0% found this document useful (0 votes)
119 views11 pages

CLIP Reid

The paper introduces CLIP-ReID, a method that leverages vision-language models like CLIP for image re-identification (ReID) without requiring concrete text labels. It proposes a two-stage training strategy that optimizes learnable text tokens to generate ambiguous descriptions, enhancing the visual representation of images. The method demonstrates state-of-the-art performance on various ReID datasets for both person and vehicle identification tasks.

Uploaded by

qingyuan cai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views11 pages

CLIP Reid

The paper introduces CLIP-ReID, a method that leverages vision-language models like CLIP for image re-identification (ReID) without requiring concrete text labels. It proposes a two-stage training strategy that optimizes learnable text tokens to generate ambiguous descriptions, enhancing the visual representation of images. The method demonstrates state-of-the-art performance on various ReID datasets for both person and vehicle identification tasks.

Uploaded by

qingyuan cai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification

without Concrete Text Labels


Siyuan Li1 , Li Sun1,2 * , Qingli Li1
1
Shanghai Key Laboratory of Multidimensional Information Processing
2
Key Laboratory of Advanced Theory and Application in Statistics and Data Science
East China Normal University, Shanghai, China

Abstract
arXiv:2211.13977v4 [cs.CV] 1 Jan 2023

Pre-trained vision-language models like CLIP have recently


shown superior performances on various downstream tasks,
including image classification and segmentation. However,
in fine-grained image re-identification (ReID), the labels are
indexes, lacking concrete text descriptions. Therefore, it re-
mains to be determined how such models could be applied to
these tasks. This paper first finds out that simply fine-tuning
(a) Stage1 (b) Stage2
the visual model initialized by the image encoder in CLIP, has
Initialized text features Original image features
already obtained competitive performances in various ReID Text features after stage1 Image features after stage2
tasks. Then we propose a two-stage strategy to facilitate a bet-
ter visual representation. The key idea is to fully exploit the Figure 1: t-SNE visualization on image and text features
cross-modal description ability in CLIP through a set of learn-
(Van der Maaten and Hinton 2008). Randomly selected 10
able text tokens for each ID and give them to the text encoder
to form ambiguous descriptions. In the first training stage, persons in the MSMT17 are represented by different colors.
image and text encoders from CLIP keep fixed, and only the The dots and pentagons indicate the image and text features,
text tokens are optimized from scratch by the contrastive loss respectively. (a) and (b) show the data distributions after the
computed within a batch. In the second stage, the ID-specific first and second training stage.
text tokens and their encoder become static, providing con-
straints for fine-tuning the image encoder. With the help of
the designed loss in the downstream task, the image encoder Although CNN-based models for ReID have achieved
is able to represent data as vectors in the feature embedding
accurately. The effectiveness of the proposed strategy is vali-
good performance on some well-known datasets, it is still far
dated on several datasets for the person or vehicle ReID tasks. from being used in a real application. CNN is often blamed
Code is available at https://github.com/Syliz517/CLIP-ReID. for only focusing on a small irrelevant region in the image,
which indicates that its feature is not robust and discrimina-
tive enough. Recently, vision transformers like ViT (Doso-
Introduction vitskiy et al. 2020) have become popular in many tasks, and
they have also shown better performances in ReID. Com-
Image re-identification (ReID) aims to match the same ob-
pared to CNN, transformers can model the long-range de-
ject across different and non-overlapping camera views. Par-
pendency in the whole image. However, due to a large num-
ticularly, it focuses on detecting the same person or vehicle
ber of model parameters, they require a big training set and
in the surveillance camera networks. ReID is a challenging
often perform erratically during optimization. Since ReID
task mainly due to the cluttered background, illumination
datasets are relatively small, the potential of these models is
variations, huge pose changes, or even occlusions. Most re-
not fully exploited yet.
cent ReID models depend on building and training a convo-
lution neural network (CNN) so that each image is mapped Both CNN-based and ViT-based methods heavily rely on
to a feature vector in the embedding space before the classi- pre-training. Almost all ReID methods need an initial model
fier. Images of the same object tend to be close, while differ- trained on ImageNet, which contains images manually given
ent objects become far away in this space. The parameters of one-hot labels from a pre-defined set. Visual contents de-
CNN can be effectively learned under the guidance of cross scribing rich semantics outside the set are completely ig-
entropy loss together with the typical metric learning loss nored. Recently, cross-modal learning like CLIP (Radford
like center or triplet loss (Hermans, Beyer, and Leibe 2017). et al. 2021) connects the visual representation with its cor-
responding high-level language description. They not only
* Corresponding author, email: [email protected]. train on a larger dataset but also change the pre-training
Copyright © 2023, Association for the Advancement of Artificial task, matching visual features to their language descriptions.
Intelligence (www.aaai.org). All rights reserved. Therefore, the image encoder can sense a variety of high-
learnable
Text Text
A photo of a dog. A photo of a [X]1 [X]2 . . . [X]M person.
encoder encoder
�1 �2 … �� �1 �2 … ��

�1 �1
Image �2 Image �2
encoder … encoder …
�� ��
Stage1
ℒ�2� + ℒ�2� ℒ�2� + ℒ�2�
(a) CLIP

learnable
Text Text
[P]1 [P]2 . . . [P]K [class]. A photo of a [X]1 [X]2 . . . [X]M person.
encoder encoder
�1 �2 … … �� �1 �2 … … ��

�1 �1
Image �2 Image �2
encoder … encoder …
�� ��
Stage2
ℒ�� ℒ�� + ℒ��� ℒ�2���
(b) CoOp (c) CLIP-ReID
Image features Text features Similarity scores of matched samples Similarity scores of unmatched samples

Figure 2: Overview of our approach compared to CLIP and CoOp. (a) describes the model of CLIP, using pairs of text and image
to train the image encoder and text encoder. (b) shows the model of CoOp, which fixes the image encoder and text encoder
and fine-tunes text prompt in the downstream dataset. (c) is our proposed CLIP-ReID method, which fixes the text encoder and
image encoder in the first training stage, optimizes a set of learnable text tokens to generate the text features, and then uses the
text features to optimize the image encoder in the second training stage.

level semantics from the text and learns transferable fea- both the image and text encoder are fixed, and only these to-
tures, which can be adapted to many different tasks. E.g., kens are optimized. In the second stage, the description to-
given a particular image classification task, the candidate kens and text encoder keep static, and they together provide
text labels are concrete and can be combined with a prompt, ambiguous descriptions for each ID, which helps to build up
such as “A photo of a”, to form the text descriptions. The the cross-modality image to text cross-entropy loss. Since
classification is then realized by comparing image features CLIP has CNN-based and ViT-based models, the proposed
with text features generated by the text encoder, which takes method is validated on both ResNet-50 and ViT-B/16. The
the text description of categories as input. Note that it is a two types of the model achieve the state-of-the-art on differ-
zero-shot solution without tuning any parameters for down- ent ReID datasets. Moreover, our method can also support
stream tasks but still gives satisfactory results. Based on this, the input of camera ID and overlapped token settings in its
CoOp (Zhou et al. 2021) incorporates a learnable prompt for ViT-based version.
different tasks. The optimized prompt further improves the Fig. 1 simultaneously visualizes image and text features
performance. in 2D coordinates, which could help to understand our train-
CLIP and CoOp need text labels to form text descrip- ing strategy. In the first stage, the text feature of each ID is
tions in downstream tasks. However, in most ReID tasks, adapted to its corresponding image features, making it be-
the labels are indexes, and there are no specific words to come ambiguous descriptions. In the second stage, image
describe the images, so the vision-language model has not features gather around their text descriptions so that image
been widely adopted in ReID. In this paper, we intend to ex- features from different IDs become distant.
ploit CLIP fully. We first fine-tune the image encoder by di- In summary, the contributions of this paper lie in the fol-
rectly using the common losses in ReID, which has already lowing aspects:
obtained high metrics compared to existing works. We use
this model as our baseline and try to improve it by utiliz- • To our knowledge, we are the first to utilize CLIP for
ing the text encoder in CLIP. A two-stage strategy is pro- ReID. We provide competitive baseline models on sev-
posed, which aims to constrain the image encoder by gener- eral ReID datasets, which are the result of fine-tuning the
ating language descriptions from the text encoder. A series visual model initialized by the CLIP image encoder.
of learnable text tokens are incorporated, and they are used • We propose the CLIP-ReID, which fully exploits the
to describe each ID ambiguously. In the first training stage, cross-modal describing ability of CLIP. In our model, the
ID-specific learnable tokens are incorporated to give am- Attention enlarges the receptive field, hence is another
biguous text descriptions, and a two-stage training strat- way to prevent the model from focusing on small areas. In
egy is designed to take full advantage of the text encoder RGA (Zhang et al. 2020), non-local attention is performed
during training. along spatial and channel directions. ABDNet (Chen et al.
• We demonstrate that CLIP-ReID has achieved state-of- 2019) adopts a similar attention module and adds a reg-
the-art performances on many ReID datasets, including ularization term to ensure feature orthogonality. HOReID
both person and vehicle. (Wang et al. 2020) extends the traditional attention into
high-order computation, giving more discriminative fea-
Related Works tures. CAL (Rao et al. 2021) provides an attention scheme
for counterfactual learning, which filters out irrelevant ar-
Image ReID eas and increases prediction accuracy. Recently, due to the
Previous ReID works focus on learning discriminative fea- power of the transformer, it has become popular in ReID.
tures like foreground histograms (Das, Chakraborty, and PAT (Li et al. 2021b) and DRL-Net (Jia et al. 2022) build on
Roy-Chowdhury 2014), local maximal occurrences (Liao ResNet-50, but they utilize a transformer decoder to exploit
et al. 2015), bag-of-visual words (Zheng et al. 2015), or image features from CNN. In the decoder attention block,
hierarchical Gaussian descriptors (Matsukawa et al. 2016). learnable queries first interact with key tokens from the im-
On the other hand, it can also be solved as a metric learn- age and then are updated by weighted image values. They
ing problem, expecting a reasonable distance measurement are expected to reflect local features for ReID. TransReID
for inter- and intra-class samples (Koestinger et al. 2012). (He et al. 2021), AAformer (Zhu et al. 2021) and DCAL
These two aspects are naturally combined by the deep neu- (Zhu et al. 2022) all use encoder attention blocks in ViT,
ral network (Yi et al. 2014), in which the parameters are op- and they obtain better performance, especially on the large
timized under an appropriate loss function with almost no dataset.
intentional interference. Particularly, with the scale develop- This paper implements both CNN and ViT models initial-
ment of CNN on ImageNet, ResNet-50 (He et al. 2016) has ized from CLIP. Benefiting from the two-stage training, both
been regarded as the common model (Luo et al. 2019) for achieve SOTA on different datasets.
most ReID datasets.
Despite the powerful ability of CNN, it is blamed for
its irrelevant highlighted regions, which is probably due to Vision-language learning.
the overfitting of limited training data. OSNet (Zhou et al.
2019) gives a lightweight model to deal with it. Auto-ReID Compared to supervised pre-training on ImageNet, vision-
(Quan et al. 2019) and CDNet (Li, Wu, and Zheng 2021) language pre-training(VLP) has significantly improved the
employ network architecture search for a compact model. performance of many downstream tasks by training to match
OfM (Zhang et al. 2021a) proposes a data selection method image and language. CLIP (Radford et al. 2021) and ALIGN
for learning a sampler to choose generalizable data during (Jia et al. 2021) are good practices, which utilize a pair
training. Although they obtain good results on some small of image and text encoders, and two directional InfoNCE
datasets, performances drop significantly on large ones like losses computed between their outputs for training. Built on
MSMT17. CLIP, several works (Li et al. 2022; Kim, Son, and Kim
Introducing prior knowledge into the network can also al- 2021) have been proposed to incorporate more types of
leviate overfitting. An intuitive idea is to use features from learning tasks like image-to-text matching and mask image/-
different regions for identification. PCB (Sun et al. 2018) text modeling. ALBEF (Li et al. 2021a) aligns the image and
and SAN (Qian et al. 2020) divides the feature into hor- text representation before fusing them through cross-model
izontal stripes to enhance its ability to represent the local attention. SimVLM (Wang et al. 2021) uses a single prefix
region. MGN (Wang et al. 2018) utilizes a multiple granu- language modeling objective for end-to-end training.
larity scheme on feature division to enhance its expressive Inspired by the recent advances in NLP, prompt or
capabilities further, and it has several branches to capture adapter-based tuning becomes prevalent in vision domain
features from different parts. Therefore, model complexity CoOp (Zhou et al. 2021) proposes to fit in a learnable prompt
becomes its major issue. BDB (Dai et al. 2019) has a sim- for image classification. CoCoOp (Zhou et al. 2022) learns a
ple structure with only two branches, one for global features light-weight visual network to give meta tokens for each im-
and the other for local features, which employs a simple age, combined with a set of learnable context vectors. CLIP-
batch feature drop strategy to randomly erase a horizontal Adapter (Gao et al. 2021) adds a light-weight module on top
stripe for all samples within a batch. CBDB-Net (Tan et al. of both image and text encoder.
2021) enhances BDB with more types of feature dropping. In addition, researchers investigate different downstream
Similar multi-branch approaches (Zhang et al. 2021c; Wang tasks to apply CLIP. DenseCLIP (Rao et al. 2022) and
et al. 2022; Zhang et al. 2021b; He et al. 2019; Zhang et al. MaskCLIP (Zhou, Loy, and Dai 2021) apply it for per-pixel
2019; Sun et al. 2020) with the purpose of mining rich fea- prediction in segmentation. ViLD (Gu et al. 2021) adapts im-
tures from different locations are also proposed, and they can age and text encoders in CLIP for object detection. EI-CLIP
be improved if the semantic parsing map participates during (Ma et al. 2022) and CLIP4CirDemo (Baldrati et al. 2022)
training (Jin et al. 2020b; Zhu et al. 2020; Meng et al. 2020; use CLIP to solve retrieval problems. However, as far as we
Chen et al. 2020). know, no works deal with ReID based on CLIP.
Algorithm 1: CLIP-ReID’s training process. where numerators in Eq. (2) and Eq. (3) are the similarities
Input: batch of images xi and their corresponding texts tyi . of two embeddings from matched pair, and the denominators
Parameter: a set of learnable text tokens [X]m (m ∈ 1, ...M) are all similarities with respect to anchor Vi or Ti .
for all IDs existing in training set X , an image encoder I For regular classification tasks, CLIP converts the con-
and a text encoder T , linear layers gV and gT . crete labels of the dataset into text descriptions, then pro-
1: Initialize I, T , gV and gT from the pre-trained CLIP. duces embedding feature Ti , Vi and aligns them. CoOp in-
Initialize [X]m (m ∈ 1, ...M) randomly. corporates a learnable prompt for different tasks while en-
2: while in the 1st stage do tire pre-trained parameters are kept fixed, as depicted in
3: s(Vi , Tyi ) = gV (I(xi )) · gT (T (tyi )) Fig. 2(b). However, it is difficult to exploit CLIP in ReID
4: Optimize [X]m by Eq. (5). tasks where the labels are indexes instead of specific text.
5: end while
6: for yi = 1 to N do
CLIP-ReID
7: textyi = gT (T (tyi )) To deal with the above problem, we propose CLIP-ReID,
8: end for which complements the lacking textual information by pre-
9: while in the 2nd stage do training a set of learnable text tokens. As is shown in
10: s(Vi , Tyi ) = gT (I(xi )) · textyi Fig. 2(c), our scheme is built by pre-trained CLIP with the
11: Optimize I by Eq. (9). two stages of training, and its metrics exceed our baseline.
12: end while The first training stage. We first introduce ID-specific
learnable tokens to learn ambiguous text descriptions, which
are independent for each ID. Specifically, the text de-
Method scriptions fed into T (·) are designed as “A photo of a
Preliminaries: Overview of CLIP [X]1 [X]2 [X]3 ...[X]M person/vehicle”, where each [X]m (m ∈
We first briefly review CLIP. It consists of two encoders, an 1, ...M) is a learnable text token with the same dimension as
image encoder I(·) and a text encoder T (·). The architec- word embedding. M indicates the number of learnable text
ture of I(·) has several alternatives. Basically, a transformer tokens. In this stage, we fix the parameters of I(·) and T (·),
like ViT-B/16 and a CNN like ResNet-50 are two models we and only tokens [X]m are optimized.
work on. Either of them is able to summarize the image into Similar to CLIP, we use Li2t and Lt2i , but replace texti
a feature vector in the cross-modal embedding. with textyi in Eq. (1), since each ID shares the same text
On the other hand, the text encoder T (·) is implemented description. Moreover, for Lt2i , different images in a batch
as a transformer, which is used to generate a representation probably belong to the same person, so Tyi may have more
from a sentence. Specifically, given a description such as than one positive, we change it to:
“A photo of a [class].” where [class] is generally replaced
by concrete text labels. T (·) first converts each word into a −1 X exp(s(Vp , Tyi ))
unique numeric ID by lower-cased byte pair encoding (BPE) Lt2i (yi ) = log PB
|P (yi )| a=1 exp(s(Va , Tyi ))
with 49,152 vocab size (Sennrich, Haddow, and Birch 2015). p∈P (yi )
Then, each ID is mapped to a 512-d word embedding. To (4)
achieve parallel computation, each text sequence has a fixed Here, P (yi ) = {p ∈ 1...B : yp = yi } is the set of indices of
length of 77, including the start [SOS] and end [EOS] to- all positives for Tyi in the batch, and | · | is its cardinality.
kens. After a 12-layer model with 8 attention heads, the By minimizing the loss of Li2t and Lt2i , the gradients
[EOS] token is considered as a feature representation of the are back-propagated through the fixed T (·) to optimize
text, which is layer normalized and then linearly projected [X]1 [X]2 [X]3 ...[X]M , taking full advantage of T (·).
into the cross-modal embedding space.
Lstage1 = Li2t + Lt2i (5)
Specifically, i ∈ {1...B} denotes the index of the images
within a batch. Let imgi be the [CLS] token embedding of To improve the computation efficiency, we obtain all the
image feature, while texti is the corresponding [EOS] to- image features by feeding the whole training set into I(·) at
ken embedding of text feature, then compute the similarity the beginning of the first training stage. For a dataset with
between imgi and texti : N IDs, we save N different Tyi of all IDs at the end of this
stage, preparing for the next stage of training.
s(Vi , Ti ) = Vi · Ti = gV (imgi ) · gT (texti ) (1)
The second training stage. In this stage, only parameters
where gV (·) and gV (·) are linear layers projecting embed- in I(·) are optimized. To boost the final performance, we
ding into a cross-modal embedding space. The image-to-text follow the general strong pipeline of object ReID (Luo et al.
contrastive loss Li2t is calculated as: 2019). We employ the triplet loss Ltri and ID loss Lid with
exp(s(Vi , Ti )) label smoothing for optimization, they are calculated as:
Li2t (i) = − log PB (2)
a=1 exp(s(Vi , Ta )) N
and the text-to-image contrastive loss Lt2i :
X
Lid = −qk log(pk ) (6)
exp(s(Vi , Ti )) k=1
Lt2i (i) = − log PB (3)
a=1 exp(s(Va , Ti ))
Ltri = max(dp − dn + α, 0) (7)
where qk = (1−)δk,y +/N denotes value in the target dis- Dataset Image ID Cam + View
tribution, and pk represents ID prediction logits of class k, MSMT17 126,441 4,101 15
dp and dn are feature distances of positive pair and negative Market-1501 32,668 1,501 6
pair, while α is the margin of Ltri . DukeMTMC-reID 36,411 1,404 8
To fully exploit CLIP, for each image, we can use the text Occluded-Duke 35,489 1,404 8
features obtained in the first training stage to calculate image VeRi-776 49,357 776 28
to text cross-entropy Li2tce as is shown in Eq. (8). Note that VehicleID 221,763 26,267 -
following Lid , we utilize label smoothing on qk in Li2tce .
N
Table 1: Statistics of datasets used in the paper.
X exp(s(Vi , Tyk ))
Li2tce (i) = −qk log PN (8)
k=1 ya =1 exp(s(Vi , Tya ))
Ultimately, the losses used in our second training stage Training details. In the first training stage, we use the
are summarized as follows: Adam optimizer for both the CNN-based and the ViT-based
models, with a learning rate initialized at 3.5 × 10−4 and
Lstage2 = Lid + Ltri + Li2tce (9) decayed by a cosine schedule. At this stage, the batch size is
The whole training process of the proposed CLIP-ReID, set to 64 without using any augmentation methods. Only the
including both the first and second stages, is summarized learnable text tokens [X]1 [X]2 [X]3 ...[X]M are optimizable.
in Algorithm 1. We use the learnable prompts to mine and In the second training stage (same as our baseline), Adam
store the hidden states of the pre-trained image encoder and optimizer is also used to train the image encoder. Each mini-
text encoder, allowing CLIP to retain its own advantages. batch consists of B = P × K images, where P is the number
During the second stage, these prompts can regularize the of randomly selected identities, and K is samples per iden-
image encoder and thus increase its generalization ability. tity. We take P = 16 and K = 4. Each image is augmented
by random horizontal flipping, padding, cropping and eras-
SIE and OLP. To make the model aware of the camera or
ing (Zhong et al. 2020). For the CNN-based model, we
viewpoint, we use Side Information Embeddings (SIE) (He
spend 10 epochs linearly increasing the learning rate from
et al. 2021) to introduce relevant information. Unlike Tran-
3.5 × 10−6 to 3.5 × 10−4 , and then the learning rate is de-
sReID, we only add camera information to the [CLS] to-
cayed by 0.1 at the 40th and 70th epochs. For the ViT-based
ken, rather than all tokens, to avoid disturbing image details.
model, we warm up the model for 10 epochs with a linearly
Overlapping Patches (OLP) can further enhance the model
growing learning rate from 5 × 10−7 to 5 × 10−6 . Then, it
with increased computational resources, which is realized
is decreased by a factor of 0.1 at the 30th and 50th epochs.
simply by changing the stride in the token embedding.
We train the CNN-based model for 120 epochs while the
ViT-based model for 60 epochs. For the CNN-based model,
Experiments we use Ltri and Lid pre and post the global attention pooling
Datasets and Evaluation Protocols layer, and α is set to 0.3. Similarly, we use them pre and post
We evaluate our method on four person re-identification the linear layer after the transformer. Note that we also em-
datasets, including MSMT17 (Wei et al. 2018), Market-1501 ploy Ltri after the 11th transformer layer of ViT-B/16 and
(Zheng et al. 2015), DukeMTMC-reID (Ristani et al. 2016), the 3rd residual layer of ResNet-50.
Occluded-Duke (Miao et al. 2019), and two vehicle ReID
datasets, VeRi-776 (Liu et al. 2016b) and VehicleID (Liu Comparison with State-of-the-Art Methods
et al. 2016a). The details of these datasets are summarized in
Tab. 1. Following common practices, we adapt the cumula- We compare our method with the state-of-the-art methods on
tive matching characteristics (CMC) at Rank-1 (R1) and the three widely used person ReID benchmarks, one occluded
mean average precision (mAP) to evaluate the performance. ReID benchmark in Tab. 2, and two vehicle ReID bench-
marks in Tab. 3. Despite being simple, CLIP-ReID achieves
Implementations a strikingly good result. Note that all data listed here are
Models. We adopt the visual encoder I(·) and the text en- without re-ranking.
coder T (·) from CLIP as the backbone for our image and
text feature extractor. CLIP provides two alternatives I(·), Person ReID. For both CNN-based and ViT-based meth-
namely a transformer and a CNN with a global attention ods, CLIP-ReID outperforms previous methods by a large
pooling layer. For the transformer, we choose the ViT-B/16, margin on the most challenging dataset, MSMT17. Our
which contains 12 transformer layers with the hidden size method achieves 63.0% mAP and 84.4% R1 on the CNN-
of 768 dimensions. To match the output of the T (·), the di- based backbone, and 73.4% mAP and 88.7% R1 (6.0% and
mension of the image feature vector is reduced from 768 to 3.4% higher than Transreid+SIE+OLP) on the ViT-based
512 by a linear layer. For the CNN, we choose ResNet-50, backbone using only the CLIP-ReID method, in further use
where the last stride changes from 2 to 1, resulting in a larger of SIE and OLP we can improve mAP and R1 to 75.8% and
feature map to preserve spatial information. The global at- 89.7%. On other smaller or occluded datasets, such as Mar-
tention pooling layer after ResNet-50 reduces the dimension ket1501, DukeMTMC-reID, and Occluded-Duke, we also
of the embedding vectors from 2048 to 1024, matching the increase the mAP with the ViT-based backbone by 1.0%,
dimensions of the text features converted from 512 to 1024. 0.5% and 1.1%, respectively.
MSMT17 Market-1501 DukeMTMC Occluded-Duke
Backbone Methods References
mAP R1 mAP R1 mAP R1 mAP R1
PCB* ECCV (2018) - - 81.6 93.8 69.2 83.3 - -
MGN* MM (2018) - - 86.9 95.7 78.4 88.7 - -
OSNeT ICCV (2019) 52.9 78.7 84.9 94.8 73.5 88.6 - -
ABD-Net* ICCV (2019) 60.8 82.3 88.3 95.6 78.6 89.0 - -
Auto-ReID* ICCV (2019) 52.5 78.2 85.1 94.5 - - - -
HOReID CVPR (2020) - - 84.9 94.2 75.6 86.9 43.8 55.1
CNN
ISP ECCV (2020) - - 88.6 95.3 80.0 89.6 52.3 62.8
SAN AAAI (2020b) 55.7 79.2 88.0 96.1 75.5 87.9 - -
OfM AAAI (2021a) 54.7 78.4 87.9 94.9 78.6 89.0 - -
CDNet CVPR (2021) 54.7 78.9 86.0 95.1 76.8 88.6 - -
PAT CVPR (2021b) - - 88.0 95.4 78.2 88.8 53.6 64.5
CAL* ICCV (2021) 56.2 79.5 87.0 94.5 76.4 87.2 - -
CBDB-Net* TCSVT (2021) - - 85.0 94.4 74.3 87.7 38.9 50.9
ALDER* TIP (2021b) 59.1 82.5 88.9 95.6 78.9 89.9 - -
LTReID* TMM (2022) 58.6 81.0 89.0 95.9 80.4 90.5 - -
DRL-Net TMM (2022) 55.3 78.4 86.9 94.7 76.6 88.1 50.8 65.0
baseline 60.7 82.1 88.1 94.7 79.3 88.6 47.4 54.2
CLIP-ReID 63.0 84.4 89.8 95.7 80.7 90.0 53.5 61.0
AAformer* arxiv (2021) 63.2 83.6 87.7 95.4 80.0 90.1 58.2 67.0
TransReID+SIE+OLP ICCV (2021) 67.4 85.3 88.9 95.2 82.0 90.7 59.2 66.4
TransReID+SIE+OLP* 69.4 86.2 89.5 95.2 82.6 90.7 - -
ViT DCAL CVPR (2022) 64.0 83.1 87.5 94.7 80.1 89.0 - -
baseline 66.1 84.4 86.4 93.3 80.0 88.8 53.5 60.8
CLIP-ReID 73.4 88.7 89.6 95.5 82.5 90.0 59.5 67.1
CLIP-ReID+SIE+OLP 75.8 89.7 90.5 95.4 83.1 90.8 60.3 67.2

Table 2: Comparison with state-of-the-art CNN- and ViT- based methods on person ReID datasets. DukeMTMC denotes the
DukeMTMC-reID benchmark. The superscript star* means that the input image is resized to a resolution larger than 256x128.

Back VeRi-776 VehicleID Vehicle ReID. Our method achieves competitive perfor-
Methods
-bone mAP R1 R1 R5 mance compared to the prior CNN-based and ViT-based
PRN (2019) 74.3 94.3 78.4 92.3 methods. With the ViT-based backbone, CLIP-ReID reaches
PGAN (2019) 79.3 96.5 77.8 92.1 85.3% mAP and 97.6% R1 on VehicleID, while CLIP-ReID!
SAN (2020) 72.5 93.3 79.7 94.3 reaches 84.5% mAP and 97.3% R1 on VeRi-776.
UMTS (2020a) 75.9 95.8 80.9 -
SPAN (2020) 68.9 94.0 - - Ablation Studies and Analysis
PVEN (2020) 79.5 95.6 84.7 97.0 We conduct comprehensive ablation studies on MSMT17
CNN
SAVER (2020) 79.6 96.4 79.9 95.2 dataset to analyze the influences and sensitivity of some ma-
CFVMNet (2020) 77.1 95.3 81.4 94.1 jor parameters.
CAL (2021) 74.3 95.4 82.5 94.7
EIA-Net (2018) 79.3 95.7 84.1 96.5 Baseline comparison. Many CNN-based works are based
FIDI (2021) 77.6 95.7 78.5 91.9 on the strong baseline proposed by BoT (Luo et al. 2019).
baseline 79.3 95.7 84.4 96.6 For ViT-based methods, TransReID’s baseline is widely
CLIP-ReID 80.3 96.8 85.2 97.1 adopted, while AAformer also proposes a baseline. Al-
TransReID (2021) 80.6 96.9 83.6 97.1 though slightly different, both of them are pre-trained on
TransReID! 82.0 97.1 85.2 97.5 ImageNet, which is different from ours. As shown in Tab. 4,
DCAL (2022) 80.2 96.9 - - due to the effectiveness of CLIP pre-training, our baseline
ViT achieves superior performance compared to other baselines.
baseline 79.3 95.7 84.2 96.6
CLIP-ReID 83.3 97.4 85.3 97.6 Necessity of two-stage training. CLIP aligns embeddings
CLIP-ReID! 84.5 97.3 85.5 97.2 from text and image domains, so it is important to exploit
its text encoder. Since ReID has no specific text that distin-
Table 3: Comparison with state-of-the-art CNN- and ViT- guishes different IDs, we aim to provide this by pre-training
based methods on vehicle ReID datasets. Only the small a set of learnable text tokens. There are two ways to opti-
subset of VehicleID is used in this paper. ! indicates that the mize them. One is one-stage training, in which we train the
method further uses SIE and OLP on VeRi-776 and OLP on image encoder I(·) while using contrastive loss to train the
VehicleID. text tokens at the same time. The other is the two-stage that
Backbones Methods mAP Rank-1 SIE and OLP. In Tab. 7, we evaluate the effectiveness of
BoT 51.3 75.3 SIE and OLP on MSMT17. Using SIE only for [CLS] to-
CNN
CLIP-ReID baseline 60.7 82.1 kens works better than adding it for all global tokens. It gains
AAformer baseline 58.5 79.4 1.1% mAP improvement on MSMT17 when the model uses
ViT TransReID baseline 61.0 81.8 only SIE-cls and 1.2% improvement using only OLP. When
CLIP-ReID baseline 66.1 84.4 applied together, mAP and R1 raise 2.4% and 1.0%, respec-
tively.
Table 4: Comparison of baselines on the MSMT17 dataset.
SIE-all SIE-cls OLP mAP Rank-1
Backbone Methods mAP Rank-1 - - - 73.4 88.7
baseline 60.7 82.1 X - - 74.3 88.6
CNN one stage 61.9 82.8 - X - 74.5 88.8
two stage 63.0 84.4 - - X 74.6 89.5
baseline 66.1 84.4 - X X 75.8 89.7
ViT one stage 68.9 85.9
two stage 73.4 88.7 Table 7: The validations on SIE-cls and OLP in ViT-based
image encoder.
Table 5: Comparison between one- and two-stage training.
Visualization of CLIP-ReID. Finally, we perform visual-
ization experiments using the (Chefer, Gur, and Wolf 2021)
we propose, in which we tune the learnable text tokens in method to show the focused areas of the model. Both Tran-
the first stage and use them to calculate the Li2tce in the sec- sReID’s and our baselines focus on local areas, ignoring
ond stage. To verify which approach is more effective, we other details about the human body, while CLIP-ReID will
perform a comparison on MSMT17. As shown in Tab. 5, the focus on a more comprehensive area.
one-stage training is less effective because, in the early stage
of training, learnable text tokens cannot describe the image
well but affects the optimization of I(·).
Constraint from text encoder in the second stage. There
are P different IDs in a batch, with K images per ID. When
computing Li2tce , if we only consider text embeddings for (a) (b) (c) (d) (a) (b) (c) (d)
the IDs within a batch, like Li2t , the number of participat-
ing IDs is much less than the total number of IDs as in Lid .
We extend it to all IDs in the training set, like Li2tce . From
Tab. 6, we can conclude that comparing with all IDs in the
training set is better than only comparing with the IDs of the
current batch. Another conclusion is that Lt2i is not neces- (a) (b) (c) (d) (a) (b) (c) (d)
sary in the second stage. Finally, we combine the Lid , Ltri ,
Li2tce to form the total loss. For the ViT, the weights of the Figure 3: Visualization. (a) Input images, (b) TransReID
three loss terms are 0.25, 1, and 1, respectively, while they baseline, (c) our baseline (d) CLIP-ReID.
are 1, 1, and 1 for the CNN.

Li2tce Li2t Lt2i mAP Rank-1 Conclusion


- - - 66.1 84.4 This paper investigates the way to apply the vision-language
- X X 71.3 87.5 pre-training model in image ReID. We find that fine-tuning
- X - 71.7 87.6 the visual model initialized by the CLIP image encoder, ei-
X - X 73.2 88.6 ther ResNet-50 or ViT-B/16, gives a good performance com-
X - - 73.4 88.7 pared to other baselines. To fully utilize the cross-modal de-
scription ability in the pre-trained model, we propose CLIP-
Table 6: Loss terms from text encoder in the second stage. ReID with a two-stage training strategy, in which the learn-
able text tokens shared within each ID are incorporated and
augmented to describe different instances. In the first stage,
Number of learnable tokens M. To be consistent with only these tokens get optimized, forming ambiguous text de-
CLIP, we set the text description to “A photo of a scriptions. In the second stage, these tokens and text encoder
[X]1 [X]2 [X]3 ...[X]M person/vehicle.”. We conduct analysis together provide constraints for optimizing the parameters
on the parameter M and find that M = 1 results in not learn- in the image encoder. We validate CLIP-ReID on several
ing sufficient text description, but when M is added to 8, it is datasets of persons and vehicles, and the results demonstrate
redundant and unhelpful. We finally choose M = 4, which the effectiveness of text descriptions and the superiority of
gives the best result among different settings. our model.
Acknowledgments Jin, X.; Lan, C.; Zeng, W.; and Chen, Z. 2020a. Uncertainty-
This work is supported by the Science and Technology aware multi-shot knowledge distillation for image-based object re-
identification. In Proceedings of the AAAI Conference on Artificial
Commission of Shanghai Municipality under Grant No. Intelligence, volume 34, 11165–11172.
22511105800, 19511120800 and 22DZ2229004.
Jin, X.; Lan, C.; Zeng, W.; Wei, G.; and Chen, Z. 2020b.
Semantics-aligned representation learning for person re-
References identification. In Proceedings of the AAAI Conference on
Baldrati, A.; Bertini, M.; Uricchio, T.; and Del Bimbo, A. 2022. Artificial Intelligence, volume 34, 11173–11180.
Effective Conditioned and Composed Image Retrieval Combining Khorramshahi, P.; Peri, N.; Chen, J.-c.; and Chellappa, R. 2020.
CLIP-Based Features. In Proceedings of the IEEE/CVF CVPR, The devil is in the details: Self-supervised attention for vehicle re-
21466–21474. identification. In European Conference on Computer Vision, 369–
Chefer, H.; Gur, S.; and Wolf, L. 2021. Transformer interpretability 386. Springer.
beyond attention visualization. In Proceedings of the IEEE/CVF Kim, W.; Son, B.; and Kim, I. 2021. Vilt: Vision-and-language
CVPR, 782–791. transformer without convolution or region supervision. In Interna-
Chen, T.; Ding, S.; Xie, J.; Yuan, Y.; Chen, W.; Yang, Y.; Ren, tional Conference on Machine Learning, 5583–5594. PMLR.
Z.; and Wang, Z. 2019. Abd-net: Attentive but diverse person re- Koestinger, M.; Hirzer, M.; Wohlhart, P.; Roth, P. M.; and Bischof,
identification. In Proceedings of the IEEE/CVF international con- H. 2012. Large scale metric learning from equivalence constraints.
ference on computer vision, 8351–8361. In 2012 IEEE CVPR, 2288–2295. IEEE.
Chen, T.-S.; Liu, C.-T.; Wu, C.-W.; and Chien, S.-Y. 2020. Li, H.; Wu, G.; and Zheng, W.-S. 2021. Combined depth space
Orientation-aware vehicle re-identification with semantics-guided based architecture search for person re-identification. In Proceed-
part attention network. In European conference on computer vi- ings of the IEEE/CVF CVPR, 6729–6738.
sion, 330–346. Springer. Li, J.; Li, D.; Xiong, C.; and Hoi, S. 2022. Blip: Bootstrap-
Dai, Z.; Chen, M.; Gu, X.; Zhu, S.; and Tan, P. 2019. Batch drop- ping language-image pre-training for unified vision-language un-
block network for person re-identification and beyond. In Proceed- derstanding and generation. arXiv preprint arXiv:2201.12086.
ings of the IEEE/CVF international conference on computer vision, Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; and Hoi, S.
3691–3701. C. H. 2021a. Align before fuse: Vision and language representation
Das, A.; Chakraborty, A.; and Roy-Chowdhury, A. K. 2014. Con- learning with momentum distillation. Advances in neural informa-
sistent re-identification in a camera network. In European confer- tion processing systems, 34: 9694–9705.
ence on computer vision, 330–345. Springer. Li, Y.; He, J.; Zhang, T.; Liu, X.; Zhang, Y.; and Wu, F. 2021b. Di-
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; verse part discovery: Occluded person re-identification with part-
Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, aware transformer. In Proceedings of the IEEE/CVF CVPR, 2898–
G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: 2907.
Transformers for image recognition at scale. arXiv preprint Liang, L.; Lang, C.; Li, Z.; Zhao, J.; Wang, T.; and Feng, S. 2018.
arXiv:2010.11929. Seeing Crucial Parts: Vehicle Model Verification via A Discrimi-
native Representation Model. Journal of the ACM (JACM), 22.
Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.;
and Qiao, Y. 2021. Clip-adapter: Better vision-language models Liao, S.; Hu, Y.; Zhu, X.; and Li, S. Z. 2015. Person re-
with feature adapters. arXiv preprint arXiv:2110.04544. identification by local maximal occurrence representation and met-
ric learning. In Proceedings of the IEEE CVPR, 2197–2206.
Gu, X.; Lin, T.-Y.; Kuo, W.; and Cui, Y. 2021. Open-vocabulary ob-
ject detection via vision and language knowledge distillation. arXiv Liu, H.; Tian, Y.; Yang, Y.; Pang, L.; and Huang, T. 2016a. Deep
preprint arXiv:2104.13921. relative distance learning: Tell the difference between similar vehi-
cles. In Proceedings of the IEEE CVPR, 2167–2175.
He, B.; Li, J.; Zhao, Y.; and Tian, Y. 2019. Part-regularized
Liu, X.; Liu, W.; Ma, H.; and Fu, H. 2016b. Large-scale vehicle
near-duplicate vehicle re-identification. In Proceedings of the
re-identification in urban surveillance videos. In 2016 IEEE inter-
IEEE/CVF CVPR, 3997–4005.
national conference on multimedia and expo (ICME), 1–6. IEEE.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn- Luo, H.; Gu, Y.; Liao, X.; Lai, S.; and Jiang, W. 2019. Bag of
ing for image recognition. In Proceedings of the IEEE Conference tricks and a strong baseline for deep person re-identification. In
on Computer Vision and Pattern Recognition (CVPR), 770–778. Proceedings of the IEEE/CVF CVPR workshops, 0–0.
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; and Jiang, W. 2021. Ma, H.; Zhao, H.; Lin, Z.; Kale, A.; Wang, Z.; Yu, T.; Gu, J.;
Transreid: Transformer-based object re-identification. In Proceed- Choudhary, S.; and Xie, X. 2022. EI-CLIP: Entity-Aware Inter-
ings of the IEEE/CVF international conference on computer vision, ventional Contrastive Learning for E-Commerce Cross-Modal Re-
15013–15022. trieval. In Proceedings of the IEEE/CVF CVPR, 18051–18061.
Hermans, A.; Beyer, L.; and Leibe, B. 2017. In defense Matsukawa, T.; Okabe, T.; Suzuki, E.; and Sato, Y. 2016. Hierar-
of the triplet loss for person re-identification. arXiv preprint chical gaussian descriptor for person re-identification. In Proceed-
arXiv:1703.07737. ings of the IEEE CVPR, 1363–1372.
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Meng, D.; Li, L.; Liu, X.; Li, Y.; Yang, S.; Zha, Z.-J.; Gao, X.;
Q.; Sung, Y.-H.; Li, Z.; and Duerig, T. 2021. Scaling up visual Wang, S.; and Huang, Q. 2020. Parsing-based view-aware embed-
and vision-language representation learning with noisy text super- ding network for vehicle re-identification. In Proceedings of the
vision. In International Conference on Machine Learning, 4904– IEEE/CVF CVPR, 7103–7112.
4916. PMLR. Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; and Yang, Y. 2019. Pose-guided
Jia, M.; Cheng, X.; Lu, S.; and Zhang, J. 2022. Learning disentan- feature alignment for occluded person re-identification. In Pro-
gled representation implicitly via transformer for occluded person ceedings of the IEEE/CVF international conference on computer
re-identification. IEEE Transactions on Multimedia. vision, 542–551.
Qian, J.; Jiang, W.; Luo, H.; and Yu, H. 2020. Stripe-based and Yan, C.; Pang, G.; Bai, X.; Liu, C.; Ning, X.; Gu, L.; and Zhou,
attribute-aware network: A two-branch deep model for vehicle J. 2021. Beyond triplet loss: person re-identification with fine-
re-identification. Measurement Science and Technology, 31(9): grained difference-aware pairwise loss. IEEE Transactions on Mul-
095401. timedia, 24: 1665–1677.
Quan, R.; Dong, X.; Wu, Y.; Zhu, L.; and Yang, Y. 2019. Auto-reid: Yi, D.; Lei, Z.; Liao, S.; and Li, S. Z. 2014. Deep metric learning
Searching for a part-aware convnet for person re-identification. In for person re-identification. In 2014 22nd international conference
Proceedings of the IEEE/CVF International Conference on Com- on pattern recognition, 34–39. IEEE.
puter Vision, 3750–3759. Zhang, E.; Jiang, X.; Cheng, H.; Wu, A.; Yu, F.; Li, K.; Guo, X.;
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agar- Zheng, F.; Zheng, W.; and Sun, X. 2021a. One for More: Selecting
wal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Generalizable Samples for Generalizable ReID Model. In Proceed-
Learning transferable visual models from natural language super- ings of the AAAI Conference on Artificial Intelligence, volume 35,
vision. In International Conference on Machine Learning, 8748– 3324–3332.
8763. PMLR. Zhang, Q.; Lai, J.; Feng, Z.; and Xie, X. 2021b. Seeing like a hu-
man: Asynchronous learning with dynamic progressive refinement
Rao, Y.; Chen, G.; Lu, J.; and Zhou, J. 2021. Counterfactual
for person re-identification. IEEE Transactions on Image Process-
attention learning for fine-grained visual categorization and re-
ing, 31: 352–365.
identification. In Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, 1025–1034. Zhang, X.; Zhang, R.; Cao, J.; Gong, D.; You, M.; and Shen, C.
2019. Part-guided attention learning for vehicle re-identification.
Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, arXiv preprint arXiv:1909.06023, 2(8).
J.; and Lu, J. 2022. Denseclip: Language-guided dense prediction
with context-aware prompting. In Proceedings of the IEEE/CVF Zhang, Y.; He, B.; Sun, L.; and Li, Q. 2021c. Progressive Multi-
CVPR, 18082–18091. Stage Feature Mix for Person Re-Identification. In ICASSP 2021-
2021 IEEE International Conference on Acoustics, Speech and Sig-
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; and Tomasi, C. nal Processing (ICASSP), 2765–2769. IEEE.
2016. Performance measures and a data set for multi-target, multi-
Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; and Chen, Z. 2020. Relation-
camera tracking. In European conference on computer vision, 17–
aware global attention for person re-identification. In Proceedings
35. Springer.
of the ieee/cvf CVPR, 3186–3195.
Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural machine Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; and Tian, Q.
translation of rare words with subword units. arXiv preprint 2015. Scalable person re-identification: A benchmark. In Pro-
arXiv:1508.07909. ceedings of the IEEE international conference on computer vision,
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; and Wang, S. 2018. Be- 1116–1124.
yond part models: Person retrieval with refined part pooling (and Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; and Yang, Y. 2020. Random
a strong convolutional baseline). In Proceedings of the European erasing data augmentation. In Proceedings of the AAAI conference
conference on computer vision (ECCV), 480–496. on artificial intelligence, volume 34, 13001–13008.
Sun, Z.; Nie, X.; Xi, X.; and Yin, Y. 2020. Cfvmnet: A multi- Zhou, C.; Loy, C. C.; and Dai, B. 2021. Denseclip: Extract free
branch network for vehicle re-identification based on common field dense labels from clip. arXiv preprint arXiv:2112.01071.
of view. In Proceedings of the 28th ACM international conference Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2021. Learning to
on multimedia, 3523–3531. Prompt for Vision-Language Models.
Tan, H.; Liu, X.; Bian, Y.; Wang, H.; and Yin, B. 2021. Incomplete Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022. Conditional
descriptor mining with elastic loss for person re-identification. prompt learning for vision-language models. In Proceedings of
IEEE Transactions on Circuits and Systems for Video Technology, the IEEE/CVF CVPR, 16816–16825.
32(1): 160–171. Zhou, K.; Yang, Y.; Cavallaro, A.; and Xiang, T. 2019. Omni-scale
Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using feature learning for person re-identification. In Proceedings of the
t-SNE. Journal of machine learning research, 9(11). IEEE/CVF International Conference on Computer Vision, 3702–
Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; 3712.
Zhou, E.; and Sun, J. 2020. High-order information matters: Learn- Zhu, H.; Ke, W.; Li, D.; Liu, J.; Tian, L.; and Shan, Y. 2022. Dual
ing relation and topology for occluded person re-identification. In Cross-Attention Learning for Fine-Grained Visual Categorization
Proceedings of the IEEE/CVF CVPR, 6449–6458. and Object Re-Identification. In Proceedings of the IEEE/CVF
CVPR, 4692–4702.
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; and Zhou, X. 2018. Learn-
ing discriminative features with multiple granularities for person Zhu, K.; Guo, H.; Liu, Z.; Tang, M.; and Wang, J. 2020. Identity-
re-identification. In Proceedings of the 26th ACM international guided human semantic parsing for person re-identification. In Eu-
conference on Multimedia, 274–282. ropean Conference on Computer Vision, 346–363. Springer.
Zhu, K.; Guo, H.; Zhang, S.; Wang, Y.; Huang, G.; Qiao, H.; Liu, J.;
Wang, P.; Zhao, Z.; Su, F.; and Meng, H. 2022. LTReID: Factoriz-
Wang, J.; and Tang, M. 2021. Aaformer: Auto-aligned transformer
able Feature Generation with Independent Components for Long-
for person re-identification. arXiv preprint arXiv:2104.00921.
Tailed Person Re-Identification. IEEE Transactions on Multimedia.
Wang, Z.; Yu, J.; Yu, A. W.; Dai, Z.; Tsvetkov, Y.; and Cao, Y. 2021.
Simvlm: Simple visual language model pretraining with weak su-
pervision. arXiv preprint arXiv:2108.10904.
Wei, L.; Zhang, S.; Gao, W.; and Tian, Q. 2018. Person transfer gan
to bridge domain gap for person re-identification. In Proceedings
of the IEEE CVPR, 79–88.
Supplementary Material Back- MSMT17
Methods
bone mAP R1
Alternative way for the first training stage.
baseline wo pre Ltri 57.7 79.8
In order to improve the training efficiency in the first stage, baseline 60.7 82.1
we propose another method, which computes the Lt2ice CNN
CLIP-ReID wo pre Ltri 59.5 82.1
based on the average image feature Vyi among all images CLIP-ReID 63.0 84.4
with ID yi . Since the image encoder keeps fixed, Vyi can be baseline wo pre Ltri 65.9 84.2
computed offline, and buffered in the memory. In this way, baseline 66.0 84.4
the first stage is completed in a short time at the expense of ViT
CLIP-ReID wo pre Ltri 73.1 88.6
final performance dropping, shown in Tab. 8. CLIP-ReID 73.4 88.7
exp(s(Vyi , Tyi ))
Lt2ice (yi ) = − log PN (10) Table 9: Comparison of w/wo Ltri of the previous layer.
ya =1 exp(s(Vya , Tyi ))

Number of learnable tokens M.


MSMT17 Market-1501
Methods We conduct analysis on the number of learnable tokens M
mAP R1 T mAP R1 T
average 72.4 88.3 2.0 89.3 95.2 2.0 in Tab. 10. It shows that the performance is not sensitive to
instance 73.4 88.6 37.5 89.6 95.5 15.5 M, and M = 4 gives the best results, as we present in the
manuscript.
Table 8: Comparison between two training methods in the
first stage. T denotes the time of training in minutes. We MSMT17 Market-1501
M
implement ViT-based model on a single NVIDIA GeForce mAP R1 mAP R1
RTX 2080 Ti GPU. 1 72.7 88.4 89.5 94.9
4 73.4 88.7 89.6 95.5
8 73.4 88.6 89.5 95.1
Retrieval result visualization.
We visualize the retrieval results on MSMT17, with the in- Table 10: Performance analysis on parameter M for ViT-
correctly identified samples highlighted in orange. based.

Query 1 2 3 4 5
Dimensions of inference features
As shown in Fig. 5, we have three image features to
use during inference, the results of different combina-
tions in Tab. 11. We concatenate the img feature and
post img feature as the final feature representation.

Comparison with state-of-the-art methods on two


vehicle datasets.
We further intensively evaluate our proposed method on two
vehicle ReID datasets, and the full metrics are shown in
Tab. 12. For the VehicleID dataset, the test sets are avail-
able in small, medium, and large versions, and CLIP-ReID
achieves promising results in all the three settings.

Figure 4: Retrieval result visualization.

The Ltri of the previous layer.


We find that adding Ltri to the previous layer usually im-
proves the model’s ability to discriminate between different
IDs, as shown in Tab. 9. Therefore, we employ Ltri after
the 11th transformer layer of ViT-B/16 and the 3rd residual
layer of ResNet-50. To give a clear illustration of our loss
constraint, we show them in the CNN version of CLIP-ReID
in Fig. 5.
ViT (HW+1)xC’ HWxC’
(HW+1)xC (HW+1)xC
ResNet

...

...
...

...

...
CLS
fc gap 1xC’ self-attntion
CLS gap

CLS
ℒ��� ℒ�� + ℒ��� ℒ�� + ℒ��� + ℒ�2��� ℒ��� ℒ�� + ℒ��� ℒ�� + ℒ��� + ℒ�2���

pre_img_feature img_feature post_img_feature

Figure 5: Details of the CNN-based CLIP-ReID image encoder.

MSMT17 Market-1501
Backbone Inference features Dim
mAP R1 mAP R1
post img feature 1024 61.3 84.0 88.6 95.2
pre img feature 2048 48.3 68.8 83.1 91.9
img feature 2048 57.6 80.5 88.6 95.2
CNN
img feature + post img feature 3072 63.0 84.4 89.8 95.7
img feature + pre img feature 4096 57.0 78.7 88.5 94.9
pre img feature + img feature + post img feature 5120 62.9 83.9 89.9 95.6
post img feature 512 72.3 88.2 89.0 94.9
pre img feature 768 71.7 87.1 88.3 94.7
img feature 768 73.4 88.7 89.6 95.4
ViT
img feature + post img feature 1280 73.4 88.7 89.6 95.5
img feature + pre img feature 1536 73.6 88.6 89.7 95.5
pre img feature + img feature + post img feature 2048 73.6 88.6 89.7 95.5

Table 11: The validations on different inferece features.

VehicleID
VeRi-776
Methods Small Medium Large
mAP R1 R5 R1 R5 mAP R1 R5 mAP R1 R5 mAP
PRN 74.3 94.3 98.7 78.4 92.3 - 75.0 88.3 - 74.2 86.4 -
SAN 72.5 93.3 97.1 79.7 94.3 - 78.4 91.3 75.6 88.3
UMTS 75.9 95.8 - 80.9 - 87.0 78.8 - 84.2 76.1 - 82.8
PVEN 79.5 95.6 98.4 84.7 97.0 - 80.6 94.5 - 77.8 92.0 -
SAVER 79.6 96.4 98.6 79.9 95.2 - 77.6 91.1 - 75.3 88.3 -
CFVMNet 77.1 95.3 98.4 81.4 94.1 - 77.3 90.4 - 74.7 88.7 -
CAL 74.3 95.4 97.9 82.5 94.7 87.8 78.2 91.0 83.8 75.1 88.5 80.9
CLIP-ReID (CNN) 80.3 96.8 98.4 85.2 97.1 90.3 80.7 94.3 86.5 78.7 92.3 84.6
CLIP-ReID (ViT) 83.3 97.4 98.6 85.3 97.6 90.6 81.0 95.0 86.9 78.1 92.7 84.4

Table 12: Comparisons with the state-of-the-art vehicle ReID methods on the VeRi-776 and VehicleID datasets.

You might also like