CLIP Reid
CLIP Reid
Abstract
arXiv:2211.13977v4 [cs.CV] 1 Jan 2023
�1 �1
Image �2 Image �2
encoder … encoder …
�� ��
Stage1
ℒ�2� + ℒ�2� ℒ�2� + ℒ�2�
(a) CLIP
learnable
Text Text
[P]1 [P]2 . . . [P]K [class]. A photo of a [X]1 [X]2 . . . [X]M person.
encoder encoder
�1 �2 … … �� �1 �2 … … ��
�1 �1
Image �2 Image �2
encoder … encoder …
�� ��
Stage2
ℒ�� ℒ�� + ℒ��� ℒ�2���
(b) CoOp (c) CLIP-ReID
Image features Text features Similarity scores of matched samples Similarity scores of unmatched samples
Figure 2: Overview of our approach compared to CLIP and CoOp. (a) describes the model of CLIP, using pairs of text and image
to train the image encoder and text encoder. (b) shows the model of CoOp, which fixes the image encoder and text encoder
and fine-tunes text prompt in the downstream dataset. (c) is our proposed CLIP-ReID method, which fixes the text encoder and
image encoder in the first training stage, optimizes a set of learnable text tokens to generate the text features, and then uses the
text features to optimize the image encoder in the second training stage.
level semantics from the text and learns transferable fea- both the image and text encoder are fixed, and only these to-
tures, which can be adapted to many different tasks. E.g., kens are optimized. In the second stage, the description to-
given a particular image classification task, the candidate kens and text encoder keep static, and they together provide
text labels are concrete and can be combined with a prompt, ambiguous descriptions for each ID, which helps to build up
such as “A photo of a”, to form the text descriptions. The the cross-modality image to text cross-entropy loss. Since
classification is then realized by comparing image features CLIP has CNN-based and ViT-based models, the proposed
with text features generated by the text encoder, which takes method is validated on both ResNet-50 and ViT-B/16. The
the text description of categories as input. Note that it is a two types of the model achieve the state-of-the-art on differ-
zero-shot solution without tuning any parameters for down- ent ReID datasets. Moreover, our method can also support
stream tasks but still gives satisfactory results. Based on this, the input of camera ID and overlapped token settings in its
CoOp (Zhou et al. 2021) incorporates a learnable prompt for ViT-based version.
different tasks. The optimized prompt further improves the Fig. 1 simultaneously visualizes image and text features
performance. in 2D coordinates, which could help to understand our train-
CLIP and CoOp need text labels to form text descrip- ing strategy. In the first stage, the text feature of each ID is
tions in downstream tasks. However, in most ReID tasks, adapted to its corresponding image features, making it be-
the labels are indexes, and there are no specific words to come ambiguous descriptions. In the second stage, image
describe the images, so the vision-language model has not features gather around their text descriptions so that image
been widely adopted in ReID. In this paper, we intend to ex- features from different IDs become distant.
ploit CLIP fully. We first fine-tune the image encoder by di- In summary, the contributions of this paper lie in the fol-
rectly using the common losses in ReID, which has already lowing aspects:
obtained high metrics compared to existing works. We use
this model as our baseline and try to improve it by utiliz- • To our knowledge, we are the first to utilize CLIP for
ing the text encoder in CLIP. A two-stage strategy is pro- ReID. We provide competitive baseline models on sev-
posed, which aims to constrain the image encoder by gener- eral ReID datasets, which are the result of fine-tuning the
ating language descriptions from the text encoder. A series visual model initialized by the CLIP image encoder.
of learnable text tokens are incorporated, and they are used • We propose the CLIP-ReID, which fully exploits the
to describe each ID ambiguously. In the first training stage, cross-modal describing ability of CLIP. In our model, the
ID-specific learnable tokens are incorporated to give am- Attention enlarges the receptive field, hence is another
biguous text descriptions, and a two-stage training strat- way to prevent the model from focusing on small areas. In
egy is designed to take full advantage of the text encoder RGA (Zhang et al. 2020), non-local attention is performed
during training. along spatial and channel directions. ABDNet (Chen et al.
• We demonstrate that CLIP-ReID has achieved state-of- 2019) adopts a similar attention module and adds a reg-
the-art performances on many ReID datasets, including ularization term to ensure feature orthogonality. HOReID
both person and vehicle. (Wang et al. 2020) extends the traditional attention into
high-order computation, giving more discriminative fea-
Related Works tures. CAL (Rao et al. 2021) provides an attention scheme
for counterfactual learning, which filters out irrelevant ar-
Image ReID eas and increases prediction accuracy. Recently, due to the
Previous ReID works focus on learning discriminative fea- power of the transformer, it has become popular in ReID.
tures like foreground histograms (Das, Chakraborty, and PAT (Li et al. 2021b) and DRL-Net (Jia et al. 2022) build on
Roy-Chowdhury 2014), local maximal occurrences (Liao ResNet-50, but they utilize a transformer decoder to exploit
et al. 2015), bag-of-visual words (Zheng et al. 2015), or image features from CNN. In the decoder attention block,
hierarchical Gaussian descriptors (Matsukawa et al. 2016). learnable queries first interact with key tokens from the im-
On the other hand, it can also be solved as a metric learn- age and then are updated by weighted image values. They
ing problem, expecting a reasonable distance measurement are expected to reflect local features for ReID. TransReID
for inter- and intra-class samples (Koestinger et al. 2012). (He et al. 2021), AAformer (Zhu et al. 2021) and DCAL
These two aspects are naturally combined by the deep neu- (Zhu et al. 2022) all use encoder attention blocks in ViT,
ral network (Yi et al. 2014), in which the parameters are op- and they obtain better performance, especially on the large
timized under an appropriate loss function with almost no dataset.
intentional interference. Particularly, with the scale develop- This paper implements both CNN and ViT models initial-
ment of CNN on ImageNet, ResNet-50 (He et al. 2016) has ized from CLIP. Benefiting from the two-stage training, both
been regarded as the common model (Luo et al. 2019) for achieve SOTA on different datasets.
most ReID datasets.
Despite the powerful ability of CNN, it is blamed for
its irrelevant highlighted regions, which is probably due to Vision-language learning.
the overfitting of limited training data. OSNet (Zhou et al.
2019) gives a lightweight model to deal with it. Auto-ReID Compared to supervised pre-training on ImageNet, vision-
(Quan et al. 2019) and CDNet (Li, Wu, and Zheng 2021) language pre-training(VLP) has significantly improved the
employ network architecture search for a compact model. performance of many downstream tasks by training to match
OfM (Zhang et al. 2021a) proposes a data selection method image and language. CLIP (Radford et al. 2021) and ALIGN
for learning a sampler to choose generalizable data during (Jia et al. 2021) are good practices, which utilize a pair
training. Although they obtain good results on some small of image and text encoders, and two directional InfoNCE
datasets, performances drop significantly on large ones like losses computed between their outputs for training. Built on
MSMT17. CLIP, several works (Li et al. 2022; Kim, Son, and Kim
Introducing prior knowledge into the network can also al- 2021) have been proposed to incorporate more types of
leviate overfitting. An intuitive idea is to use features from learning tasks like image-to-text matching and mask image/-
different regions for identification. PCB (Sun et al. 2018) text modeling. ALBEF (Li et al. 2021a) aligns the image and
and SAN (Qian et al. 2020) divides the feature into hor- text representation before fusing them through cross-model
izontal stripes to enhance its ability to represent the local attention. SimVLM (Wang et al. 2021) uses a single prefix
region. MGN (Wang et al. 2018) utilizes a multiple granu- language modeling objective for end-to-end training.
larity scheme on feature division to enhance its expressive Inspired by the recent advances in NLP, prompt or
capabilities further, and it has several branches to capture adapter-based tuning becomes prevalent in vision domain
features from different parts. Therefore, model complexity CoOp (Zhou et al. 2021) proposes to fit in a learnable prompt
becomes its major issue. BDB (Dai et al. 2019) has a sim- for image classification. CoCoOp (Zhou et al. 2022) learns a
ple structure with only two branches, one for global features light-weight visual network to give meta tokens for each im-
and the other for local features, which employs a simple age, combined with a set of learnable context vectors. CLIP-
batch feature drop strategy to randomly erase a horizontal Adapter (Gao et al. 2021) adds a light-weight module on top
stripe for all samples within a batch. CBDB-Net (Tan et al. of both image and text encoder.
2021) enhances BDB with more types of feature dropping. In addition, researchers investigate different downstream
Similar multi-branch approaches (Zhang et al. 2021c; Wang tasks to apply CLIP. DenseCLIP (Rao et al. 2022) and
et al. 2022; Zhang et al. 2021b; He et al. 2019; Zhang et al. MaskCLIP (Zhou, Loy, and Dai 2021) apply it for per-pixel
2019; Sun et al. 2020) with the purpose of mining rich fea- prediction in segmentation. ViLD (Gu et al. 2021) adapts im-
tures from different locations are also proposed, and they can age and text encoders in CLIP for object detection. EI-CLIP
be improved if the semantic parsing map participates during (Ma et al. 2022) and CLIP4CirDemo (Baldrati et al. 2022)
training (Jin et al. 2020b; Zhu et al. 2020; Meng et al. 2020; use CLIP to solve retrieval problems. However, as far as we
Chen et al. 2020). know, no works deal with ReID based on CLIP.
Algorithm 1: CLIP-ReID’s training process. where numerators in Eq. (2) and Eq. (3) are the similarities
Input: batch of images xi and their corresponding texts tyi . of two embeddings from matched pair, and the denominators
Parameter: a set of learnable text tokens [X]m (m ∈ 1, ...M) are all similarities with respect to anchor Vi or Ti .
for all IDs existing in training set X , an image encoder I For regular classification tasks, CLIP converts the con-
and a text encoder T , linear layers gV and gT . crete labels of the dataset into text descriptions, then pro-
1: Initialize I, T , gV and gT from the pre-trained CLIP. duces embedding feature Ti , Vi and aligns them. CoOp in-
Initialize [X]m (m ∈ 1, ...M) randomly. corporates a learnable prompt for different tasks while en-
2: while in the 1st stage do tire pre-trained parameters are kept fixed, as depicted in
3: s(Vi , Tyi ) = gV (I(xi )) · gT (T (tyi )) Fig. 2(b). However, it is difficult to exploit CLIP in ReID
4: Optimize [X]m by Eq. (5). tasks where the labels are indexes instead of specific text.
5: end while
6: for yi = 1 to N do
CLIP-ReID
7: textyi = gT (T (tyi )) To deal with the above problem, we propose CLIP-ReID,
8: end for which complements the lacking textual information by pre-
9: while in the 2nd stage do training a set of learnable text tokens. As is shown in
10: s(Vi , Tyi ) = gT (I(xi )) · textyi Fig. 2(c), our scheme is built by pre-trained CLIP with the
11: Optimize I by Eq. (9). two stages of training, and its metrics exceed our baseline.
12: end while The first training stage. We first introduce ID-specific
learnable tokens to learn ambiguous text descriptions, which
are independent for each ID. Specifically, the text de-
Method scriptions fed into T (·) are designed as “A photo of a
Preliminaries: Overview of CLIP [X]1 [X]2 [X]3 ...[X]M person/vehicle”, where each [X]m (m ∈
We first briefly review CLIP. It consists of two encoders, an 1, ...M) is a learnable text token with the same dimension as
image encoder I(·) and a text encoder T (·). The architec- word embedding. M indicates the number of learnable text
ture of I(·) has several alternatives. Basically, a transformer tokens. In this stage, we fix the parameters of I(·) and T (·),
like ViT-B/16 and a CNN like ResNet-50 are two models we and only tokens [X]m are optimized.
work on. Either of them is able to summarize the image into Similar to CLIP, we use Li2t and Lt2i , but replace texti
a feature vector in the cross-modal embedding. with textyi in Eq. (1), since each ID shares the same text
On the other hand, the text encoder T (·) is implemented description. Moreover, for Lt2i , different images in a batch
as a transformer, which is used to generate a representation probably belong to the same person, so Tyi may have more
from a sentence. Specifically, given a description such as than one positive, we change it to:
“A photo of a [class].” where [class] is generally replaced
by concrete text labels. T (·) first converts each word into a −1 X exp(s(Vp , Tyi ))
unique numeric ID by lower-cased byte pair encoding (BPE) Lt2i (yi ) = log PB
|P (yi )| a=1 exp(s(Va , Tyi ))
with 49,152 vocab size (Sennrich, Haddow, and Birch 2015). p∈P (yi )
Then, each ID is mapped to a 512-d word embedding. To (4)
achieve parallel computation, each text sequence has a fixed Here, P (yi ) = {p ∈ 1...B : yp = yi } is the set of indices of
length of 77, including the start [SOS] and end [EOS] to- all positives for Tyi in the batch, and | · | is its cardinality.
kens. After a 12-layer model with 8 attention heads, the By minimizing the loss of Li2t and Lt2i , the gradients
[EOS] token is considered as a feature representation of the are back-propagated through the fixed T (·) to optimize
text, which is layer normalized and then linearly projected [X]1 [X]2 [X]3 ...[X]M , taking full advantage of T (·).
into the cross-modal embedding space.
Lstage1 = Li2t + Lt2i (5)
Specifically, i ∈ {1...B} denotes the index of the images
within a batch. Let imgi be the [CLS] token embedding of To improve the computation efficiency, we obtain all the
image feature, while texti is the corresponding [EOS] to- image features by feeding the whole training set into I(·) at
ken embedding of text feature, then compute the similarity the beginning of the first training stage. For a dataset with
between imgi and texti : N IDs, we save N different Tyi of all IDs at the end of this
stage, preparing for the next stage of training.
s(Vi , Ti ) = Vi · Ti = gV (imgi ) · gT (texti ) (1)
The second training stage. In this stage, only parameters
where gV (·) and gV (·) are linear layers projecting embed- in I(·) are optimized. To boost the final performance, we
ding into a cross-modal embedding space. The image-to-text follow the general strong pipeline of object ReID (Luo et al.
contrastive loss Li2t is calculated as: 2019). We employ the triplet loss Ltri and ID loss Lid with
exp(s(Vi , Ti )) label smoothing for optimization, they are calculated as:
Li2t (i) = − log PB (2)
a=1 exp(s(Vi , Ta )) N
and the text-to-image contrastive loss Lt2i :
X
Lid = −qk log(pk ) (6)
exp(s(Vi , Ti )) k=1
Lt2i (i) = − log PB (3)
a=1 exp(s(Va , Ti ))
Ltri = max(dp − dn + α, 0) (7)
where qk = (1−)δk,y +/N denotes value in the target dis- Dataset Image ID Cam + View
tribution, and pk represents ID prediction logits of class k, MSMT17 126,441 4,101 15
dp and dn are feature distances of positive pair and negative Market-1501 32,668 1,501 6
pair, while α is the margin of Ltri . DukeMTMC-reID 36,411 1,404 8
To fully exploit CLIP, for each image, we can use the text Occluded-Duke 35,489 1,404 8
features obtained in the first training stage to calculate image VeRi-776 49,357 776 28
to text cross-entropy Li2tce as is shown in Eq. (8). Note that VehicleID 221,763 26,267 -
following Lid , we utilize label smoothing on qk in Li2tce .
N
Table 1: Statistics of datasets used in the paper.
X exp(s(Vi , Tyk ))
Li2tce (i) = −qk log PN (8)
k=1 ya =1 exp(s(Vi , Tya ))
Ultimately, the losses used in our second training stage Training details. In the first training stage, we use the
are summarized as follows: Adam optimizer for both the CNN-based and the ViT-based
models, with a learning rate initialized at 3.5 × 10−4 and
Lstage2 = Lid + Ltri + Li2tce (9) decayed by a cosine schedule. At this stage, the batch size is
The whole training process of the proposed CLIP-ReID, set to 64 without using any augmentation methods. Only the
including both the first and second stages, is summarized learnable text tokens [X]1 [X]2 [X]3 ...[X]M are optimizable.
in Algorithm 1. We use the learnable prompts to mine and In the second training stage (same as our baseline), Adam
store the hidden states of the pre-trained image encoder and optimizer is also used to train the image encoder. Each mini-
text encoder, allowing CLIP to retain its own advantages. batch consists of B = P × K images, where P is the number
During the second stage, these prompts can regularize the of randomly selected identities, and K is samples per iden-
image encoder and thus increase its generalization ability. tity. We take P = 16 and K = 4. Each image is augmented
by random horizontal flipping, padding, cropping and eras-
SIE and OLP. To make the model aware of the camera or
ing (Zhong et al. 2020). For the CNN-based model, we
viewpoint, we use Side Information Embeddings (SIE) (He
spend 10 epochs linearly increasing the learning rate from
et al. 2021) to introduce relevant information. Unlike Tran-
3.5 × 10−6 to 3.5 × 10−4 , and then the learning rate is de-
sReID, we only add camera information to the [CLS] to-
cayed by 0.1 at the 40th and 70th epochs. For the ViT-based
ken, rather than all tokens, to avoid disturbing image details.
model, we warm up the model for 10 epochs with a linearly
Overlapping Patches (OLP) can further enhance the model
growing learning rate from 5 × 10−7 to 5 × 10−6 . Then, it
with increased computational resources, which is realized
is decreased by a factor of 0.1 at the 30th and 50th epochs.
simply by changing the stride in the token embedding.
We train the CNN-based model for 120 epochs while the
ViT-based model for 60 epochs. For the CNN-based model,
Experiments we use Ltri and Lid pre and post the global attention pooling
Datasets and Evaluation Protocols layer, and α is set to 0.3. Similarly, we use them pre and post
We evaluate our method on four person re-identification the linear layer after the transformer. Note that we also em-
datasets, including MSMT17 (Wei et al. 2018), Market-1501 ploy Ltri after the 11th transformer layer of ViT-B/16 and
(Zheng et al. 2015), DukeMTMC-reID (Ristani et al. 2016), the 3rd residual layer of ResNet-50.
Occluded-Duke (Miao et al. 2019), and two vehicle ReID
datasets, VeRi-776 (Liu et al. 2016b) and VehicleID (Liu Comparison with State-of-the-Art Methods
et al. 2016a). The details of these datasets are summarized in
Tab. 1. Following common practices, we adapt the cumula- We compare our method with the state-of-the-art methods on
tive matching characteristics (CMC) at Rank-1 (R1) and the three widely used person ReID benchmarks, one occluded
mean average precision (mAP) to evaluate the performance. ReID benchmark in Tab. 2, and two vehicle ReID bench-
marks in Tab. 3. Despite being simple, CLIP-ReID achieves
Implementations a strikingly good result. Note that all data listed here are
Models. We adopt the visual encoder I(·) and the text en- without re-ranking.
coder T (·) from CLIP as the backbone for our image and
text feature extractor. CLIP provides two alternatives I(·), Person ReID. For both CNN-based and ViT-based meth-
namely a transformer and a CNN with a global attention ods, CLIP-ReID outperforms previous methods by a large
pooling layer. For the transformer, we choose the ViT-B/16, margin on the most challenging dataset, MSMT17. Our
which contains 12 transformer layers with the hidden size method achieves 63.0% mAP and 84.4% R1 on the CNN-
of 768 dimensions. To match the output of the T (·), the di- based backbone, and 73.4% mAP and 88.7% R1 (6.0% and
mension of the image feature vector is reduced from 768 to 3.4% higher than Transreid+SIE+OLP) on the ViT-based
512 by a linear layer. For the CNN, we choose ResNet-50, backbone using only the CLIP-ReID method, in further use
where the last stride changes from 2 to 1, resulting in a larger of SIE and OLP we can improve mAP and R1 to 75.8% and
feature map to preserve spatial information. The global at- 89.7%. On other smaller or occluded datasets, such as Mar-
tention pooling layer after ResNet-50 reduces the dimension ket1501, DukeMTMC-reID, and Occluded-Duke, we also
of the embedding vectors from 2048 to 1024, matching the increase the mAP with the ViT-based backbone by 1.0%,
dimensions of the text features converted from 512 to 1024. 0.5% and 1.1%, respectively.
MSMT17 Market-1501 DukeMTMC Occluded-Duke
Backbone Methods References
mAP R1 mAP R1 mAP R1 mAP R1
PCB* ECCV (2018) - - 81.6 93.8 69.2 83.3 - -
MGN* MM (2018) - - 86.9 95.7 78.4 88.7 - -
OSNeT ICCV (2019) 52.9 78.7 84.9 94.8 73.5 88.6 - -
ABD-Net* ICCV (2019) 60.8 82.3 88.3 95.6 78.6 89.0 - -
Auto-ReID* ICCV (2019) 52.5 78.2 85.1 94.5 - - - -
HOReID CVPR (2020) - - 84.9 94.2 75.6 86.9 43.8 55.1
CNN
ISP ECCV (2020) - - 88.6 95.3 80.0 89.6 52.3 62.8
SAN AAAI (2020b) 55.7 79.2 88.0 96.1 75.5 87.9 - -
OfM AAAI (2021a) 54.7 78.4 87.9 94.9 78.6 89.0 - -
CDNet CVPR (2021) 54.7 78.9 86.0 95.1 76.8 88.6 - -
PAT CVPR (2021b) - - 88.0 95.4 78.2 88.8 53.6 64.5
CAL* ICCV (2021) 56.2 79.5 87.0 94.5 76.4 87.2 - -
CBDB-Net* TCSVT (2021) - - 85.0 94.4 74.3 87.7 38.9 50.9
ALDER* TIP (2021b) 59.1 82.5 88.9 95.6 78.9 89.9 - -
LTReID* TMM (2022) 58.6 81.0 89.0 95.9 80.4 90.5 - -
DRL-Net TMM (2022) 55.3 78.4 86.9 94.7 76.6 88.1 50.8 65.0
baseline 60.7 82.1 88.1 94.7 79.3 88.6 47.4 54.2
CLIP-ReID 63.0 84.4 89.8 95.7 80.7 90.0 53.5 61.0
AAformer* arxiv (2021) 63.2 83.6 87.7 95.4 80.0 90.1 58.2 67.0
TransReID+SIE+OLP ICCV (2021) 67.4 85.3 88.9 95.2 82.0 90.7 59.2 66.4
TransReID+SIE+OLP* 69.4 86.2 89.5 95.2 82.6 90.7 - -
ViT DCAL CVPR (2022) 64.0 83.1 87.5 94.7 80.1 89.0 - -
baseline 66.1 84.4 86.4 93.3 80.0 88.8 53.5 60.8
CLIP-ReID 73.4 88.7 89.6 95.5 82.5 90.0 59.5 67.1
CLIP-ReID+SIE+OLP 75.8 89.7 90.5 95.4 83.1 90.8 60.3 67.2
Table 2: Comparison with state-of-the-art CNN- and ViT- based methods on person ReID datasets. DukeMTMC denotes the
DukeMTMC-reID benchmark. The superscript star* means that the input image is resized to a resolution larger than 256x128.
Back VeRi-776 VehicleID Vehicle ReID. Our method achieves competitive perfor-
Methods
-bone mAP R1 R1 R5 mance compared to the prior CNN-based and ViT-based
PRN (2019) 74.3 94.3 78.4 92.3 methods. With the ViT-based backbone, CLIP-ReID reaches
PGAN (2019) 79.3 96.5 77.8 92.1 85.3% mAP and 97.6% R1 on VehicleID, while CLIP-ReID!
SAN (2020) 72.5 93.3 79.7 94.3 reaches 84.5% mAP and 97.3% R1 on VeRi-776.
UMTS (2020a) 75.9 95.8 80.9 -
SPAN (2020) 68.9 94.0 - - Ablation Studies and Analysis
PVEN (2020) 79.5 95.6 84.7 97.0 We conduct comprehensive ablation studies on MSMT17
CNN
SAVER (2020) 79.6 96.4 79.9 95.2 dataset to analyze the influences and sensitivity of some ma-
CFVMNet (2020) 77.1 95.3 81.4 94.1 jor parameters.
CAL (2021) 74.3 95.4 82.5 94.7
EIA-Net (2018) 79.3 95.7 84.1 96.5 Baseline comparison. Many CNN-based works are based
FIDI (2021) 77.6 95.7 78.5 91.9 on the strong baseline proposed by BoT (Luo et al. 2019).
baseline 79.3 95.7 84.4 96.6 For ViT-based methods, TransReID’s baseline is widely
CLIP-ReID 80.3 96.8 85.2 97.1 adopted, while AAformer also proposes a baseline. Al-
TransReID (2021) 80.6 96.9 83.6 97.1 though slightly different, both of them are pre-trained on
TransReID! 82.0 97.1 85.2 97.5 ImageNet, which is different from ours. As shown in Tab. 4,
DCAL (2022) 80.2 96.9 - - due to the effectiveness of CLIP pre-training, our baseline
ViT achieves superior performance compared to other baselines.
baseline 79.3 95.7 84.2 96.6
CLIP-ReID 83.3 97.4 85.3 97.6 Necessity of two-stage training. CLIP aligns embeddings
CLIP-ReID! 84.5 97.3 85.5 97.2 from text and image domains, so it is important to exploit
its text encoder. Since ReID has no specific text that distin-
Table 3: Comparison with state-of-the-art CNN- and ViT- guishes different IDs, we aim to provide this by pre-training
based methods on vehicle ReID datasets. Only the small a set of learnable text tokens. There are two ways to opti-
subset of VehicleID is used in this paper. ! indicates that the mize them. One is one-stage training, in which we train the
method further uses SIE and OLP on VeRi-776 and OLP on image encoder I(·) while using contrastive loss to train the
VehicleID. text tokens at the same time. The other is the two-stage that
Backbones Methods mAP Rank-1 SIE and OLP. In Tab. 7, we evaluate the effectiveness of
BoT 51.3 75.3 SIE and OLP on MSMT17. Using SIE only for [CLS] to-
CNN
CLIP-ReID baseline 60.7 82.1 kens works better than adding it for all global tokens. It gains
AAformer baseline 58.5 79.4 1.1% mAP improvement on MSMT17 when the model uses
ViT TransReID baseline 61.0 81.8 only SIE-cls and 1.2% improvement using only OLP. When
CLIP-ReID baseline 66.1 84.4 applied together, mAP and R1 raise 2.4% and 1.0%, respec-
tively.
Table 4: Comparison of baselines on the MSMT17 dataset.
SIE-all SIE-cls OLP mAP Rank-1
Backbone Methods mAP Rank-1 - - - 73.4 88.7
baseline 60.7 82.1 X - - 74.3 88.6
CNN one stage 61.9 82.8 - X - 74.5 88.8
two stage 63.0 84.4 - - X 74.6 89.5
baseline 66.1 84.4 - X X 75.8 89.7
ViT one stage 68.9 85.9
two stage 73.4 88.7 Table 7: The validations on SIE-cls and OLP in ViT-based
image encoder.
Table 5: Comparison between one- and two-stage training.
Visualization of CLIP-ReID. Finally, we perform visual-
ization experiments using the (Chefer, Gur, and Wolf 2021)
we propose, in which we tune the learnable text tokens in method to show the focused areas of the model. Both Tran-
the first stage and use them to calculate the Li2tce in the sec- sReID’s and our baselines focus on local areas, ignoring
ond stage. To verify which approach is more effective, we other details about the human body, while CLIP-ReID will
perform a comparison on MSMT17. As shown in Tab. 5, the focus on a more comprehensive area.
one-stage training is less effective because, in the early stage
of training, learnable text tokens cannot describe the image
well but affects the optimization of I(·).
Constraint from text encoder in the second stage. There
are P different IDs in a batch, with K images per ID. When
computing Li2tce , if we only consider text embeddings for (a) (b) (c) (d) (a) (b) (c) (d)
the IDs within a batch, like Li2t , the number of participat-
ing IDs is much less than the total number of IDs as in Lid .
We extend it to all IDs in the training set, like Li2tce . From
Tab. 6, we can conclude that comparing with all IDs in the
training set is better than only comparing with the IDs of the
current batch. Another conclusion is that Lt2i is not neces- (a) (b) (c) (d) (a) (b) (c) (d)
sary in the second stage. Finally, we combine the Lid , Ltri ,
Li2tce to form the total loss. For the ViT, the weights of the Figure 3: Visualization. (a) Input images, (b) TransReID
three loss terms are 0.25, 1, and 1, respectively, while they baseline, (c) our baseline (d) CLIP-ReID.
are 1, 1, and 1 for the CNN.
Query 1 2 3 4 5
Dimensions of inference features
As shown in Fig. 5, we have three image features to
use during inference, the results of different combina-
tions in Tab. 11. We concatenate the img feature and
post img feature as the final feature representation.
...
...
...
...
...
CLS
fc gap 1xC’ self-attntion
CLS gap
CLS
ℒ��� ℒ�� + ℒ��� ℒ�� + ℒ��� + ℒ�2��� ℒ��� ℒ�� + ℒ��� ℒ�� + ℒ��� + ℒ�2���
MSMT17 Market-1501
Backbone Inference features Dim
mAP R1 mAP R1
post img feature 1024 61.3 84.0 88.6 95.2
pre img feature 2048 48.3 68.8 83.1 91.9
img feature 2048 57.6 80.5 88.6 95.2
CNN
img feature + post img feature 3072 63.0 84.4 89.8 95.7
img feature + pre img feature 4096 57.0 78.7 88.5 94.9
pre img feature + img feature + post img feature 5120 62.9 83.9 89.9 95.6
post img feature 512 72.3 88.2 89.0 94.9
pre img feature 768 71.7 87.1 88.3 94.7
img feature 768 73.4 88.7 89.6 95.4
ViT
img feature + post img feature 1280 73.4 88.7 89.6 95.5
img feature + pre img feature 1536 73.6 88.6 89.7 95.5
pre img feature + img feature + post img feature 2048 73.6 88.6 89.7 95.5
VehicleID
VeRi-776
Methods Small Medium Large
mAP R1 R5 R1 R5 mAP R1 R5 mAP R1 R5 mAP
PRN 74.3 94.3 98.7 78.4 92.3 - 75.0 88.3 - 74.2 86.4 -
SAN 72.5 93.3 97.1 79.7 94.3 - 78.4 91.3 75.6 88.3
UMTS 75.9 95.8 - 80.9 - 87.0 78.8 - 84.2 76.1 - 82.8
PVEN 79.5 95.6 98.4 84.7 97.0 - 80.6 94.5 - 77.8 92.0 -
SAVER 79.6 96.4 98.6 79.9 95.2 - 77.6 91.1 - 75.3 88.3 -
CFVMNet 77.1 95.3 98.4 81.4 94.1 - 77.3 90.4 - 74.7 88.7 -
CAL 74.3 95.4 97.9 82.5 94.7 87.8 78.2 91.0 83.8 75.1 88.5 80.9
CLIP-ReID (CNN) 80.3 96.8 98.4 85.2 97.1 90.3 80.7 94.3 86.5 78.7 92.3 84.6
CLIP-ReID (ViT) 83.3 97.4 98.6 85.3 97.6 90.6 81.0 95.0 86.9 78.1 92.7 84.4
Table 12: Comparisons with the state-of-the-art vehicle ReID methods on the VeRi-776 and VehicleID datasets.