(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Northwestern Polytechnical University, Xi’an Shaanxi 710000, China 22institutetext: Xidian University, Xi’an Shaanxi 710000, China 22email: {szzhang,lran,xyh_7491,ynzhang}@nwpu.edu.cn 22email: {luowenlong,yqc123}@mail.nwpu.edu.cn 22email: {dcheng}@xidian.edu.cn

Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

Shizhou Zhang\orcidlink0000-0002-5914-7109 11    Wenlong Luo\orcidlink0009-0000-5439-2749 11    De Cheng\orcidlink0000-0003-1932-4390 Corresponding author, [email protected]    Qingchun Yang\orcidlink0009-0005-2822-2443 11    Lingyan Ran\orcidlink0000-0002-3084-9860 11    Yinghui Xing\orcidlink0000-0001-6021-8261 11    Yanning Zhang\orcidlink0000-0002-2977-8057 11
Abstract

In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset. The code and datasets are available at https://github.com/FHR-L/VSLA-CLIP.

Keywords:
Dataset Ground-to-Aerial Person Re-Identification

1 Introduction

Video-based person Re-Identification (VReID) [14, 29, 30, 2], has been attracting much attention in recent years, as video can provide richer information than single image. Existing research efforts on video-based ReID are mostly based on data from the same platforms, such as ground surveillance cameras. Suppose that a suspect has committed a crime in the city where abundant surveillance cameras have been deployed and escaped into the rural areas where there are no deployed ground surveillance cameras in advance. One feasible solution is sending a moving camera with the help of an airbone UAV platform. Thus, the technical crux has been turned into the cross-platform video-based person ReID.

In this paper, to meet the research need of cross-platform video person ReID, we construct a large-scale benchmark dataset named Ground-to-Aerial Video ReID (G2A-VReID). The G2A-VReID dataset consists of 185,907 images in total, with 5,576 tracklets belonging to 2,788 different person IDs. Each person ID includes two tracklets captured by the UAV and ground surveillance platforms, respectively. There is an average of 33.3 images for each tracklet. The scale of G2A-VReID dataset is larger than most existing video-based person ReID datasets such as MARS [52], iLIDS [41], PRID-2011 [18], etc.

To capture the videos of the same person by both the ground surveillance camera and the UAV-mounted camera, we simulate the ground-to-aerial platform ReID by fixating a ground surveillance camera at a specific location, while flying a DJI consumer UAV nearby. The ground camera is set at about 2.0 meters above the ground, and the flight altitudes of UAVs vary from 20 meters to 60 meters. Additionally, to be more realistic, the flight mode is adjusted randomly among hovering, cruising, and rotating with diverse view angles which greatly enriches the perspectives of the dataset.

Furthermore, the dataset is collected at nine different scenarios, including school campuses, subway station entrances, tourist sites, crossroads, etc. As shown in Fig. 2, the cross-platform video person ReID task is much more challenging than the counterpart in single ground platform, as the tracklets captured in the ground to aerial cross-platform scenarios are featured in drastic variations of view-points, poses, and resolutions. We have evaluated nine existing video-based person ReID algorithms on our newly collected cross-platform dataset. The experimental results showed inferior performances compared with those conventional single-platform datasets. Due to the great challenges of drastic view, pose, and resolution changes, it is not easy to align the visual part features between the cross-platform devices, which is essential in ReID task.

Recently, with the emergence of large-scale pre-trained vision-language models, e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., CLIP [35], a well-aligned visual-semantic space can be obtained through cross-modality contrastive learning of large web visual data along with high-level language descriptions. Although for the ReID task, there is no language descriptions for each person whose identity is just denoted as an index number, a set of learnable description tokens can also be introduced to roughly describe each ID [31]. In this paper, we propose to transform the cross-platform visual alignment problem into visual-semantic alignment with the help of the foundation model CLIP. To be concrete, a two-stage optimization strategy is utilized, which aims to learn description tokens for each ID in the first stage, and fine-tunes the Image Encoder with aligning visual embeddings to semantic features obtained through the learned description token in the second stage. Our experiments demonstrate that fine-tuning the Image Encoder with the constraint of visual-semantic alignment achieves competitive performance.

However, there are two obvious drawbacks in adapting image-based pre-trained foundation models to video ReID tasks by simply fine-tuning. One is the huge training cost with large-scale trainable parameters, and another is that the image encoder lacks the capability of modeling inter-frame information. Many previous works[3, 1, 21] deem video as a stack of frames with temporal structure, and are devoted to modeling temporal features with well-designed modules. But these works ignore the complementarity of frames in a video, which proved to be more effective in ReID task[2]. Moreover, from the aerial perspective, temporal information is limited due to severe self-obstruction. As shown in Tab. 2, temporal models[15, 12, 21] show inferior performance on G2A-VReID. In this paper, we present a new perspective that regards a video clip as a disordered set and propose a parameter-efficient Video Set-Level-Adapter (VSLA) module for foundation modal adaptation. Concretely, VSLA consists of a Cross-Frame Attention Adapter (CFAA) and an Intra-Frame Adapter (IFA). CFAA uses cross-frame attention to allow information exchange between frames, enabling our model to collect complementary features in each video set for powerful video-level representations. IFA transfers the visual ability of image-based foundation model to downstream tasks, providing strong intra-frame appearance representation.

Furthermore, we also propose the Platform Bridging Prompt (PBP) module to solve the visual misalignment problem in cross-platform tasks, where the prompts are adopted to provide explicit instruction to the pre-trained models for generating task-specific results[26, 23, 27, 43]. Specifically, the designed PBP is two sets of platform-specific prompts brought in Image Encoder, which aims to guide the model to focus on learning platform-invariant features, thus bridging the semantic gap of visual features between the ground and aerial platforms.

In summary, the main contributions are as follows:

  • We are the first to collect a large-scale Ground-to-Aerial Video person ReID benchmark dataset for the task of cross-platform video-based person ReID and conducted extensive baseline methods on our dataset.

  • We propose to transform the essential cross-platform visual part alignment problem into visual-semantic alignment with the help of CLIP, and propose PBP to further bridge the semantic gap of visual features between the ground and aerial platforms.

  • We propose the Video Set-Level-Adapter to efficiently adapt pre-trained image-based visual foundation model to the video ReID tasks. Our methods achieves state-of-the-art performances on three widely used video ReID datasets and our cross-platform benchmark dataset.

2 Related Works

In this section, we provide a concise review of two sets of works closely related to our research.

Video ReID Datasets. Existing works on person ReID can be categorized into image-based ReID[50, 8, 31, 49, 7, 16, 9, 55] and video-based ReID[29, 1, 5, 52]. For video-based ReID, the popular datasets include PRID-2011[18], iLIDS[41], MARS[52] and LS-VID[28], etc. PRID-2011 comprises multiple person trajectories captured by two static surveillance cameras, encompassing only 400 sequences involving 200 individuals. In contrast, LS-VID is a large-scale benchmark featuring 14,943 sequences of 3,772 persons, with videos captured at various times throughout the day. Many works have achieved superior performances on these datasets. Specifically, FGReID[46] achieved Rank-1 at 96.1% on PRID-2011, SINet[2] got 92.5% of Rank-1 on iLIDS and DenseIL[17] achieved an mAP of 87.0% on MARS, indicating a saturation trend on these datasets. The existing datasets are all captured with a single platform, i.e. ground surveillance cameras, while we aim to collect a Ground-to-Aerial cross-platform video ReID dataset to support the development of this field.

Video ReID Methods. The object processed in video-based person ReID is a video composed of a sequence of person images. Videos contain richer temporal and spatial information than images. Previous works used 3D CNNs[29, 1, 15], temporal weighting[14, 54, 5, 15, 52], optical flow[13, 32, 10] and many other methods[21, 12, 20, 6] to model the spatiotemporal information of video sequences to alleviate the negative effects of appearance change, occlusion, pose variation, etc. For 3D CNNs, STRF[1] proposed a trainable unit with negligible computational overhead, which is used in conjunction with 3D-CNN to learn discriminate 3D features. For temporal weighting, AP3D[15] assigns attention scores for each spatial region to achieve discriminative parts mining and frame selection. Optical flow refers to the movement of target pixels in an image due to the movement of objects in the image or the movement of the camera in two consecutive frames. STA[14] makes use of color and optical flow information in order to capture appearance and motion information. An essential topic to improve the performance of video-based ReID is the visual part alignment between query and gallery videos. PiT[48] divides each frame into small patches of different granularity in different directions, allowing the model to align two videos with multi-scale local information. It is relatively easy to align the visual part features between the query and gallery videos for these methods by utilizing a simple stripe partition, as the variations of view, pose, and resolution are limited among the single ground cameras.

To solve the severe misalignment of visual features in cross-platform tasks, we resort to visual-semantic alignment of the CLIP model to align the cross-platform person features.

3 Dataset

In this section, we first introduce how we collect and annotate our G2A-VReID dataset in Sec. 3.1 and Sec. 3.2. Then, we make comparisons with other datasets and highlight the key characteristics of G2A-VReID in Sec. 3.3.

Refer to caption
Figure 1: Visualization of proposed G2A-VReID at different heights.
Refer to caption
Figure 2: The distributions of sequence length.

3.1 Dataset Collection

To increase the richness of data and make it closer to the real environment. The videos are captured from 9 different scenarios, including library, crossroads, bus stop, tourist sites, etc. Ground surveillance cameras are used to shoot videos from the ground perspective, and a DJI Mavic UAV is adopted to gather videos from the sky perspective. In detail, the surveillance camera is fixed at a height of about two meters above the ground, and the UAV flies at different heights from 20 to 60 meters. The UAV flies in a mode of hovering, cruising, and rotating, making the captured persons contain richer perspectives. We cropped the captured video at intervals of 0.5 seconds to generate 31,770 frames. As shown in Fig. 2, there are great differences in the viewing perspective and resolution of images taken on different platforms, making it more challenging than existing datasets.

3.2 Annotation

During annotation, all persons appeared in the videos are marked with boundary boxes, and each person is cropped from the scene image according to the box. At the same time, we use mosaic to mask the clear face information for privacy protection. Then, the same people in the UAV and surveillance videos are associated and assigned unique IDs. Next, we combine all the images of a person in one camera into one trajectory. Thus, each person has at least two trajectories, one from the surveillance camera and the other from the UAV. Finally, we annotated 185,907 images of 2,788 identities, corresponding to 5,576 tracklets. Fig. 2 shows the distributions of sequence length.

3.3 Characteristics of Our G2A-VReID

Compared with existing VReID datasets [28, 52, 41] , the characteristics of G2A-VReID are as follows: 1) Drastic view changes. The tracklets in the query and gallery sets are captured from different types of cameras. Consequently, the transitions between the views in the query and gallery tracklets are significantly different. 2) Large number of annotated identities. Our G2A-VReID consists of 2,788 person IDs and 185,907 images, corresponding to 5,576 tracklets. The number of identities is significantly higher than all existing datasets except LS-VID [28], as shown in Tab. 1. 3) Rich outdoor scenarios with large view changes. The G2A-VReID consists of footage from nine diverse scenarios. This diversity enables G2A-VReID to accurately represent realistic environments for person ReID. In contrast, the videos from Mars [52] are captured on a university campus, while iLIDS [41] only contains videos collected from an airport arrival hall. 4) Huge difference in resolution. As depicted in Fig. 2, the height of the UAV-mounted camera varies significantly, spanning from 20 to 60 meters. The width distribution of individuals in images captured by ground cameras primarily ranges from 10 to 70 pixels. Whereas, in UAV-captured images, this range is narrower from 5 to 35 pixels.

Table 1: Comparison of G2A-VReID with other Video-ReID datasets. CWM denotes the camera working mode. AD is the average duration of each video sequence.
Datasets G2A-VReID LS-VID[28] Mars[52] iLIDS[41] PRID-2011[18] 3DPeS[3]
identities 2,788 3,772 1,261 300 200 200
tracklets 5,576 14,943 20,715 600 400 1,000
images 185,907 2,982,685 1,067,516 42,460 40,033 200,000
AD (s) 16.7 6.7 5.6 2.4 3.3 6.7
camera 2 15 6 2 2 8
view ground&\mathbf{~{}\&~{}}&sky ground ground ground ground ground
CWM moving fixed fixed fixed fixed fixed

3.4 Privacy Protection

We try our best to protect the privacy of pedestrians from the following aspects: 1) We mask the faces of all pedestrians using a mosaic to eliminate privacy information, effectively minimizing privacy risks. 2) We use cordons to mark data collection areas and post notifications near the sites during the data capture process. However, we admit the limitation that we can not ensure every pedestrian is informed. 3) The dataset will be licensed for non-profit academic research only. More details about privacy protection (e.g. notification, mosaic, and license) are available at https://github.com/FHR-L/G2A-VReID.

4 Approach

Fig. 3 illustrates the overall architecture of our proposed method. Our approach focuses on cross-platform video person ReID and aims to parameter-efficiently adapt pre-trained image-based visual foundation models to video person ReID tasks. To bridge visual misalignment in cross-platform tasks, we propose to transform the fundamental visual alignment problem into visual-semantic alignment based on CLIP. Specifically, we design a simple baseline method, named FT-CLIP, through fine-tuning the Image Encoder of CLIP. A two-stage training strategy is employed to optimize our approach. ID-specific description tokens are learned from samples originating from various platforms in the first training stage. Then in the second stage, visual features extracted from different platforms are aligned with the semantic features obtained through the learned description tokens. Our work shows that FT-CLIP with the constraint of visual-semantic alignment yields competitive performance, but it is not parameter efficient and ignores inter-frame information. Therefore, we propose the Video Set-Level-Adapter for efficient model tuning, termed as VSLA-CLIP, which outperforms FT-CLIP while utilizing fewer parameters. To further bridge the semantic gap in cross-platform tasks, we propose a prompt-based approach called Platform-Bridge Prompt (PBP).

Refer to caption
Figure 3: Overview of our proposed framework. ID-specific descriptions and shared text prompts are learned in stage one (left). Video Set-Level-Adapter and PBP are introduced and trained in the second stage (right) while freezing other parameters.

4.1 Revisiting CLIP-ReID

CLIP-ReID [31] is the pioneering approach that employs pre-trained vision-language models for image-based ReID. CLIP [35] relies on text labels to generate text descriptions. However, the labels in ReID tasks are indexes rather than specific text, which lacks the ability to depict detailed information about the corresponding persons. To solve this problem, CLIP-ReID uses a series of ID-specific learnable tokens to learn text descriptions and adapts a two-stage optimization strategy.

In the first training stage, only ID-specific tokens are optimized to learn text descriptions for each ID. Text 𝒯𝒟𝒯𝒟\mathcal{TD}caligraphic_T caligraphic_D that feeds into Text-Encoder 𝐄t()subscript𝐄𝑡{\bf E}_{t}(\cdot)bold_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) is “a photo of [𝐗]1[𝐗]Msubscriptdelimited-[]𝐗1subscriptdelimited-[]𝐗M\rm[{\bf X}]_{1}...[{\bf X}]_{M}[ bold_X ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … [ bold_X ] start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT person", where [𝐗]isubscriptdelimited-[]𝐗𝑖[{\bf X}]_{i}[ bold_X ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the learnable tokens. Text embedding 𝐓𝐓{\bf T}bold_T and image embedding 𝐈𝐈{\bf I}bold_I are obtained by:

𝐓=𝐄t(𝒯𝒟),𝐈=𝐄i(),formulae-sequence𝐓subscript𝐄𝑡𝒯𝒟𝐈subscript𝐄𝑖\displaystyle{\bf T}={\bf E}_{t}(\mathcal{TD}),~{}~{}{\bf I}={\bf E}_{i}(% \mathcal{I}),bold_T = bold_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_T caligraphic_D ) , bold_I = bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_I ) , (1)

where 𝐄i()subscript𝐄𝑖{\bf E}_{i}(\cdot)bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) is the Image Encoder. The image-to-text contrastive loss i2tsubscript𝑖2𝑡\mathcal{L}_{i2t}caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT and text-to-image contrastive loss t2isubscript𝑡2𝑖\mathcal{L}_{t2i}caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT are used to optimize [𝐗]1[𝐗]Msubscriptdelimited-[]𝐗1subscriptdelimited-[]𝐗M\rm[{\bf X}]_{1}...[{\bf X}]_{M}[ bold_X ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … [ bold_X ] start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT. Since there are samples with the same ID in a batch, t2isubscript𝑡2𝑖\mathcal{L}_{t2i}caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT in CLIP-ReID is defined as:

t2i(yi)=1|P(yi)|pP(yi)logexp(s(𝐈p,𝐓yi))a=1Bexp(s(𝐈a,𝐓yi)),subscript𝑡2𝑖subscript𝑦𝑖1𝑃subscript𝑦𝑖subscript𝑝𝑃subscript𝑦𝑖𝑠subscript𝐈𝑝subscript𝐓subscript𝑦𝑖superscriptsubscript𝑎1𝐵𝑠subscript𝐈𝑎subscript𝐓subscript𝑦𝑖\mathcal{L}_{t2i}({y_{i}})=\frac{-1}{|P(y_{i})|}\sum_{p\in P(y_{i})}\log\frac{% \exp(s({\bf I}_{p},{\bf T}_{y_{i}}))}{\sum_{a=1}^{B}\exp(s({\bf I}_{a},{\bf T}% _{y_{i}}))},caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG - 1 end_ARG start_ARG | italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( italic_s ( bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_exp ( italic_s ( bold_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG , (2)

where 𝐓yisubscript𝐓subscript𝑦𝑖{\bf T}_{y_{i}}bold_T start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the text embedding of ID-yisubscript𝑦𝑖{y_{i}}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, P(yi)={p{1B},yp=yi}𝑃subscript𝑦𝑖formulae-sequence𝑝1𝐵subscript𝑦𝑝subscript𝑦𝑖P(y_{i})=\{p\in\{1...B\},y_{p}=y_{i}\}italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { italic_p ∈ { 1 … italic_B } , italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is the set of positive samples for 𝐓yisubscript𝐓subscript𝑦𝑖{\bf T}_{y_{i}}bold_T start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and B𝐵Bitalic_B represents the batch size. i2tsubscript𝑖2𝑡\mathcal{L}_{i2t}caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT is similar to t2isubscript𝑡2𝑖\mathcal{L}_{t2i}caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT. The overall loss function of stage one stage1subscript𝑠𝑡𝑎𝑔𝑒1\mathcal{L}_{stage1}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_a italic_g italic_e 1 end_POSTSUBSCRIPT is as follows:

stage1=i2t+t2i.subscript𝑠𝑡𝑎𝑔𝑒1subscript𝑖2𝑡subscript𝑡2𝑖\mathcal{L}_{stage1}=\mathcal{L}_{i2t}+\mathcal{L}_{t2i}.caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_a italic_g italic_e 1 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT . (3)

In the second stage, the ID-specific tokens and Text-Encoder are frozen. Triplet loss trisubscript𝑡𝑟𝑖\mathcal{L}_{tri}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT[37], identity loss idsubscript𝑖𝑑\mathcal{L}_{id}caligraphic_L start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT, and image-to-text cross-entropy loss i2tcesubscript𝑖2𝑡𝑐𝑒\mathcal{L}_{i2tce}caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t italic_c italic_e end_POSTSUBSCRIPT are used to optimize CLIP Image Encoder. The i2tcesubscript𝑖2𝑡𝑐𝑒\mathcal{L}_{i2tce}caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t italic_c italic_e end_POSTSUBSCRIPT is defined as follows:

i2tce(y)=k=1Nqklogexp(s(𝐈y,𝐓yk))ya=1Nexp(s(𝐈y,𝐓ya)),subscript𝑖2𝑡𝑐𝑒𝑦superscriptsubscript𝑘1𝑁subscript𝑞𝑘𝑠subscript𝐈𝑦subscript𝐓subscript𝑦𝑘superscriptsubscriptsubscript𝑦𝑎1𝑁𝑠subscript𝐈𝑦subscript𝐓subscript𝑦𝑎\mathcal{L}_{i2tce}(y)=\sum_{k=1}^{N}-q_{k}\log\frac{\exp(s({\bf I}_{y},{\bf T% }_{y_{k}}))}{\sum_{{y_{a}}=1}^{N}\exp(s({\bf I}_{y},{\bf T}_{y_{a}}))},caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t italic_c italic_e end_POSTSUBSCRIPT ( italic_y ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( italic_s ( bold_I start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s ( bold_I start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG , (4)

where qksubscript𝑞𝑘q_{k}italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes smooth label[38] in the target distribution of the kthsubscript𝑘𝑡k_{th}italic_k start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT ID, s𝑠sitalic_s represents cosine similarity, and N𝑁Nitalic_N is the number of identities.

4.2 Visual-Semantic Alignment

We propose to transform the fundamental challenge of cross-platform visual alignment into visual-semantic alignment, and explore the efficacy of fine-tuning to adapt CLIP to video-based ReID tasks with visual-semantic alignment, named the model FT-CLIP. As shown in Fig. 3 (left), learnable ID-specific description tokens [𝐒]isubscriptdelimited-[]𝐒i\rm[{\bf S}]_{i}[ bold_S ] start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT and shared text prompts [𝐏]isubscriptdelimited-[]𝐏i\rm[{\bf P}]_{i}[ bold_P ] start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT are inserted into the Text-Encoder. All the tokens that feed into the Text-Encoder are concatenated as “[[𝐏]1[𝐏]n/2:[𝐒]1[𝐒]M:[𝐏]n/2+1[𝐏]n]delimited-[]:subscriptdelimited-[]𝐏1subscriptdelimited-[]𝐏n2subscriptdelimited-[]𝐒1subscriptdelimited-[]𝐒M:subscriptdelimited-[]𝐏n21subscriptdelimited-[]𝐏n[\rm[{\bf P}]_{1}...[{\bf P}]_{n/2}:\rm[{\bf S}]_{1}...[{\bf S}]_{M}:\rm[{\bf P% }]_{n/2+1}...[{\bf P}]_{n}][ [ bold_P ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … [ bold_P ] start_POSTSUBSCRIPT roman_n / 2 end_POSTSUBSCRIPT : [ bold_S ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … [ bold_S ] start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT : [ bold_P ] start_POSTSUBSCRIPT roman_n / 2 + 1 end_POSTSUBSCRIPT … [ bold_P ] start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT ]". Semantic features 𝐓𝐓{\bf T}bold_T can be obtained by:

𝐓𝐓\displaystyle{{\bf T}}bold_T =𝐄t([[𝐏]1[𝐏]n/2:[𝐒]1[𝐒]M:[𝐏]n/2+1[𝐏]n]),\displaystyle={\bf E}_{t}([\rm[{\bf P}]_{1}...[{\bf P}]_{n/2}:\rm[{\bf S}]_{1}% ...[{\bf S}]_{M}:\rm[{\bf P}]_{n/2+1}...[{\bf P}]_{n}]),= bold_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( [ [ bold_P ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … [ bold_P ] start_POSTSUBSCRIPT roman_n / 2 end_POSTSUBSCRIPT : [ bold_S ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … [ bold_S ] start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT : [ bold_P ] start_POSTSUBSCRIPT roman_n / 2 + 1 end_POSTSUBSCRIPT … [ bold_P ] start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT ] ) , (5)

where [:][\cdot:\cdot][ ⋅ : ⋅ ] represents the concatenating operation, the dimensions of [𝐏]isubscriptdelimited-[]𝐏i\rm[{\bf P}]_{i}[ bold_P ] start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT and [𝐒]isubscriptdelimited-[]𝐒i\rm[{\bf S}]_{i}[ bold_S ] start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT are the same as that of the word embedding.

Inspired by CLIP-ReID [31], we adopt a two-stage optimization strategy. In the first optimization stage, we freeze both the Image Encoder and Text Encoder, using loss function stage1subscript𝑠𝑡𝑎𝑔𝑒1\mathcal{L}_{stage1}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_a italic_g italic_e 1 end_POSTSUBSCRIPT in Eq. (3) to optimize ID-specific description tokens and the shared text prompts. In the second optimization stage, Image Encoder is trained to align the video embeddings to semantic features. Given a video sample 𝒱iT×H×W×3subscript𝒱𝑖superscript𝑇𝐻𝑊3\mathcal{V}_{i}\in\mathbb{R}^{T\times H\times W\times 3}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × 3 end_POSTSUPERSCRIPT with T𝑇Titalic_T frames, the CLIP image encoder encodes the T𝑇Titalic_T frames independently and mean-pooling is used to fuse the frame embeddings. Visual embeddings 𝐕isubscript𝐕𝑖{\bf V}_{i}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be obtained by:

𝐕i=1TjT𝐄i(𝒱ij),subscript𝐕𝑖1𝑇superscriptsubscript𝑗𝑇subscript𝐄𝑖subscript𝒱𝑖𝑗{\bf V}_{i}=\frac{1}{T}\sum_{j}^{T}{{\bf E}_{i}}(\mathcal{V}_{ij}),bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , (6)

where 𝒱ijsubscript𝒱𝑖𝑗\mathcal{V}_{ij}caligraphic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the jthsubscript𝑗𝑡j_{th}italic_j start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT frame of 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The visual to semantic cross-entropy loss v2scesubscript𝑣2𝑠𝑐𝑒\mathcal{L}_{v2sce}caligraphic_L start_POSTSUBSCRIPT italic_v 2 italic_s italic_c italic_e end_POSTSUBSCRIPT, i2tsubscript𝑖2𝑡\mathcal{L}_{i2t}caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT and t2isubscript𝑡2𝑖\mathcal{L}_{t2i}caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT are adopted to align visual embeddings to semantic features. v2scesubscript𝑣2𝑠𝑐𝑒\mathcal{L}_{v2sce}caligraphic_L start_POSTSUBSCRIPT italic_v 2 italic_s italic_c italic_e end_POSTSUBSCRIPT is similar to i2tcesubscript𝑖2𝑡𝑐𝑒\mathcal{L}_{i2tce}caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t italic_c italic_e end_POSTSUBSCRIPT, defined as:

v2sce(i)=k=1Nqklogexp(s(𝐕i,𝐓yk))yj=1Nexp(s(𝐕i,𝐓yj)),subscript𝑣2𝑠𝑐𝑒𝑖superscriptsubscript𝑘1𝑁subscript𝑞𝑘𝑠subscript𝐕𝑖subscript𝐓subscript𝑦𝑘superscriptsubscriptsubscript𝑦𝑗1𝑁𝑠subscript𝐕𝑖subscript𝐓subscript𝑦𝑗\mathcal{L}_{v2sce}(i)=\sum_{k=1}^{N}-q_{k}\log\frac{\exp(s({\bf V}_{i},{{\bf T% }}_{y_{k}}))}{\sum_{{y_{j}}=1}^{N}\exp(s({\bf V}_{i},{{\bf T}}_{y_{j}}))},caligraphic_L start_POSTSUBSCRIPT italic_v 2 italic_s italic_c italic_e end_POSTSUBSCRIPT ( italic_i ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( italic_s ( bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s ( bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG , (7)

where qksubscript𝑞𝑘q_{k}italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the soft label in the target distribution, and N𝑁Nitalic_N is the number of identities. Meanwhile, triplet loss trisubscript𝑡𝑟𝑖\mathcal{L}_{tri}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT with soft-margin and ID loss idsubscript𝑖𝑑\mathcal{L}_{id}caligraphic_L start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT are also used:

tri=max(dpdn+θ,0),subscript𝑡𝑟𝑖subscript𝑑𝑝subscript𝑑𝑛𝜃0\mathcal{L}_{tri}=\max(d_{p}-d_{n}+\theta,0),caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT = roman_max ( italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_θ , 0 ) , (8)
id=k=1Nqklog(pk),subscript𝑖𝑑superscriptsubscript𝑘1𝑁subscript𝑞𝑘subscript𝑝𝑘\mathcal{L}_{id}=\sum_{k=1}^{N}-q_{k}\log(p_{k}),caligraphic_L start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (9)

where θ𝜃\thetaitalic_θ is the soft-margin of trisubscript𝑡𝑟𝑖\mathcal{L}_{tri}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT, pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents ID prediction logits of class k𝑘kitalic_k, dpsubscript𝑑𝑝d_{p}italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and dnsubscript𝑑𝑛d_{n}italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are feature distances of positive pair and negative pair. The overall loss stage2subscript𝑠𝑡𝑎𝑔𝑒2\mathcal{L}_{stage2}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_a italic_g italic_e 2 end_POSTSUBSCRIPT is defined as follows:

stage2=v2sce+βtri+γid+δi2t+ϵt2i,subscript𝑠𝑡𝑎𝑔𝑒2subscript𝑣2𝑠𝑐𝑒𝛽subscript𝑡𝑟𝑖𝛾subscript𝑖𝑑𝛿subscript𝑖2𝑡italic-ϵsubscript𝑡2𝑖\mathcal{L}_{stage2}=\mathcal{L}_{v2sce}+\beta\mathcal{L}_{tri}+\gamma\mathcal% {L}_{id}+\delta\mathcal{L}_{i2t}+\epsilon\mathcal{L}_{t2i},caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_a italic_g italic_e 2 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_v 2 italic_s italic_c italic_e end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT + italic_γ caligraphic_L start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT + italic_δ caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT + italic_ϵ caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT , (10)

where β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, δ𝛿\deltaitalic_δ and ϵitalic-ϵ\epsilonitalic_ϵ balance the importance of the relative losses.

4.3 Video Set-Level-Adapter for Efficient Model Tuning

Video ReID requires the model to learn appearance representation in both intra-frame and inter-frames. We present a novel perspective, where a video sample is regarded as a frame set 𝒮i={𝒱ij|j=1,2,,n}subscript𝒮𝑖conditional-setsubscript𝒱𝑖𝑗𝑗12𝑛\mathcal{S}_{i}=\{\mathcal{V}_{ij}|j=1,2,...,n\}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { caligraphic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_j = 1 , 2 , … , italic_n } consisting of independent frames, and propose an efficient Video Set-Level-Adapter (VSLA) module. The VSLA consists of two components: an Intra-Frame Adapter (IFA, Fig. 3 (a)) and a Cross-Frame Attention Adapter (CFAA, Fig. 3 (b)). IFA is designed to parameter-efficiently adapt the pre-trained visual foundation model to downstream tasks, it takes raw frames as input and provides image-level appearance representation. CFAA takes a set of frames as input, aggregating the inter-frame complementary information for more powerful video-level representations.

IFA consists of two mapping matrices in a bottleneck structure. It runs in parallel with MLP blocks within each layer of the Image Encoder. As shown in Fig. 3, the Image Encoder in CLIP (ViT-Base-16) consists of alternating layers of Multi-Head Self-Attention (MSA) [39], Multi-Layer Perceptron (MLP) and LayerNorm (LN), which can be formulated as:

𝐱isuperscriptsubscript𝐱𝑖\displaystyle\mathbf{x}_{i}^{{}^{\prime}}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT =MSA(LN(𝐱i1))+𝐱i1,absentMSALNsubscript𝐱𝑖1subscript𝐱𝑖1\displaystyle=\textsc{MSA}(\textsc{LN}(\mathbf{x}_{i-1}))+\mathbf{x}_{i-1},= MSA ( LN ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ) + bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , (11)
𝐱isubscript𝐱𝑖\displaystyle\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =MLP(LN(𝐱i))+𝐱i.absentMLPLNsuperscriptsubscript𝐱𝑖superscriptsubscript𝐱𝑖\displaystyle=\textsc{MLP}(\textsc{LN}(\mathbf{x}_{i}^{{}^{\prime}}))+\mathbf{% x}_{i}^{{}^{\prime}}.= MLP ( LN ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) + bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT . (12)

We denote the input of IFA as 𝐱iT×(N+1)×Dsuperscriptsubscript𝐱𝑖superscript𝑇𝑁1𝐷\mathbf{x}_{i}^{\prime}\in\mathbb{R}^{T\times(N+1)\times D}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × ( italic_N + 1 ) × italic_D end_POSTSUPERSCRIPT, where N=HW/P2𝑁𝐻𝑊superscript𝑃2N=HW/P^{2}italic_N = italic_H italic_W / italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, D𝐷Ditalic_D represents the dimension and T𝑇Titalic_T is the number of frames. The down-projection layer 𝐖downsubscript𝐖𝑑𝑜𝑤𝑛{\bf W}_{down}bold_W start_POSTSUBSCRIPT italic_d italic_o italic_w italic_n end_POSTSUBSCRIPT projects 𝐱isuperscriptsubscript𝐱𝑖\mathbf{x}_{i}^{\prime}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to 𝐱i′′T×(N+1)×αsuperscriptsubscript𝐱𝑖′′superscript𝑇𝑁1𝛼\mathbf{x}_{i}^{\prime\prime}\in\mathbb{R}^{T\times(N+1)\times\alpha}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × ( italic_N + 1 ) × italic_α end_POSTSUPERSCRIPT, where α𝛼\alphaitalic_α is a hyper-parameter. Then 𝐱i′′superscriptsubscript𝐱𝑖′′\mathbf{x}_{i}^{\prime\prime}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT goes through a GELU σ𝜎\sigmaitalic_σ and up-projection layer 𝐖upsubscript𝐖𝑢𝑝{\bf W}_{up}bold_W start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT. The process can be formulated as:

IFA(𝐱i)=σ(𝐱i𝐖down)𝐖up,IFAsuperscriptsubscript𝐱𝑖𝜎superscriptsubscript𝐱𝑖subscript𝐖𝑑𝑜𝑤𝑛subscript𝐖𝑢𝑝\textsc{IFA}(\mathbf{x}_{i}^{{}^{\prime}})=\sigma(\mathbf{x}_{i}^{{}^{\prime}}% {\bf W}_{down}){\bf W}_{up},IFA ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = italic_σ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_d italic_o italic_w italic_n end_POSTSUBSCRIPT ) bold_W start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT , (13)
𝐱i=MLP(LN(𝐱i))+𝐱i+IFA(𝐱i).subscript𝐱𝑖MLPLNsuperscriptsubscript𝐱𝑖superscriptsubscript𝐱𝑖IFAsuperscriptsubscript𝐱𝑖\mathbf{x}_{i}=\textsc{MLP}(\textsc{LN}(\mathbf{x}_{i}^{{}^{\prime}}))+\mathbf% {x}_{i}^{{}^{\prime}}+\textsc{IFA}(\mathbf{x}_{i}^{{}^{\prime}}).bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = MLP ( LN ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) + bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT + IFA ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) . (14)

Unlike LoRA [22], which adds trainable pairs of rank decomposition matrices in parallel to every pre-existing weight matrix, IFA is solely in parallel with MLP. Therefore, adopting IFA results in far fewer parameters, accounting for only 5.5% (α=256𝛼256\alpha=256italic_α = 256) of the whole Image Encoder (ViT-Base-16).

CFAA is also a bottleneck architecture with a cross-frame attention layer in the middle. Our model 𝐌()𝐌{\bf M}(\cdot)bold_M ( ⋅ ) with CFAA is immune to frame ordering[4], which can be formulated as:

𝐌({𝒱ij|j=1,2,,n})=𝐌({𝒱iπ(j)|j=1,2,,n}),𝐌conditional-setsubscript𝒱𝑖𝑗𝑗12𝑛𝐌conditional-setsubscript𝒱𝑖𝜋𝑗𝑗12𝑛\begin{split}{\bf M}(\{\mathcal{V}_{ij}|j=1,2,...,n\})={\bf M}(\{\mathcal{V}_{% i\pi(j)}|j=1,2,...,n\}),\end{split}start_ROW start_CELL bold_M ( { caligraphic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_j = 1 , 2 , … , italic_n } ) = bold_M ( { caligraphic_V start_POSTSUBSCRIPT italic_i italic_π ( italic_j ) end_POSTSUBSCRIPT | italic_j = 1 , 2 , … , italic_n } ) , end_CELL end_ROW (15)

where π𝜋\piitalic_π is any permutation[47]. We denote the input of CFAA as 𝐱i1T×(N+1)×Dsubscript𝐱𝑖1superscript𝑇𝑁1𝐷\mathbf{x}_{i-1}\in\mathbb{R}^{T\times(N+1)\times D}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × ( italic_N + 1 ) × italic_D end_POSTSUPERSCRIPT, the down-projection layer projects 𝐱i1subscript𝐱𝑖1\mathbf{x}_{i-1}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT to 𝐱i1T×(N+1)×αsuperscriptsubscript𝐱𝑖1superscript𝑇𝑁1𝛼\mathbf{x}_{i-1}^{\prime}\in\mathbb{R}^{T\times(N+1)\times\alpha}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × ( italic_N + 1 ) × italic_α end_POSTSUPERSCRIPT. The cross-frame attention layer has the same structure as Multi-Head Self-Attention (MSA)[39]. To aggregate the complementary information among T𝑇Titalic_T frames, we reshape the input of cross-frame attention layer 𝐱i1superscriptsubscript𝐱𝑖1\mathbf{x}_{i-1}^{\prime}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to 𝐱i1𝖳(N+1)×T×αsuperscriptsubscript𝐱𝑖1𝖳superscript𝑁1𝑇𝛼\mathbf{x}_{i-1}^{\prime\mathsf{T}}\in\mathbb{R}^{(N+1)\times T\times\alpha}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ sansserif_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N + 1 ) × italic_T × italic_α end_POSTSUPERSCRIPT, and the attention is done in the second dimension of 𝐱i1𝖳superscriptsubscript𝐱𝑖1𝖳\mathbf{x}_{i-1}^{\prime\mathsf{T}}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ sansserif_T end_POSTSUPERSCRIPT, thus enabling visual information to exchange across frames. Then, we restore the output of cross-frame attention layer from 𝐱i1′′𝖳(N+1)×T×αsuperscriptsubscript𝐱𝑖1′′𝖳superscript𝑁1𝑇𝛼\mathbf{x}_{i-1}^{\prime\prime\mathsf{T}}\in\mathbb{R}^{(N+1)\times T\times\alpha}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ sansserif_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N + 1 ) × italic_T × italic_α end_POSTSUPERSCRIPT to 𝐱i1′′T×(N+1)×αsuperscriptsubscript𝐱𝑖1′′superscript𝑇𝑁1𝛼\mathbf{x}_{i-1}^{\prime\prime}\in\mathbb{R}^{T\times(N+1)\times\alpha}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × ( italic_N + 1 ) × italic_α end_POSTSUPERSCRIPT, with 𝐱i1′′superscriptsubscript𝐱𝑖1′′\mathbf{x}_{i-1}^{\prime\prime}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT passing through up-projection layer. It can be formulated as:

𝐱i=MSA(LN(𝐱i1))+𝐱i1+CFAA(𝐱i1).superscriptsubscript𝐱𝑖MSALNsubscript𝐱𝑖1subscript𝐱𝑖1CFAAsubscript𝐱𝑖1\mathbf{x}_{i}^{{}^{\prime}}=\textsc{MSA}(\textsc{LN}(\mathbf{x}_{i-1}))+% \mathbf{x}_{i-1}+\textsc{CFAA}(\mathbf{x}_{i-1}).bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = MSA ( LN ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ) + bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + CFAA ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) . (16)

4.4 Platform-Bridge Prompt

We additionally introduce Platform-Bridge Prompt (PBP) to bridge platform differences further. PBP is designed to guide model focusing on platform differences. As illustrated in Fig. 3, we add a series of platform-specific learnable prompts in the Image Encoder. Specifically, there are only two sets of prompts, one corresponding to the ground platform and the other to the UAV platform. Applying PBP can be viewed as changing the inputs of each MSA layer in Vision Transformer (ViT [11]). We denote the inputs of the MSA layer as 𝐡(N+1)×D𝐡superscript𝑁1𝐷{\bf h}\in\mathbb{R}^{(N+1)\times D}bold_h ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N + 1 ) × italic_D end_POSTSUPERSCRIPT, where N=HW/P2𝑁𝐻𝑊superscript𝑃2N=HW/P^{2}italic_N = italic_H italic_W / italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and D𝐷Ditalic_D represents the dimension. The MSA layer with PBP can be formulated as follows,

fk(𝐡,𝐩k)={MSAk([𝐡:𝐩kground]) if k<dand𝐡SetgroundMSAk([𝐡:𝐩kuav]) if k<dand𝐡SetuavMSAk(𝐡) if kd,\displaystyle f_{k}({\bf h},{\bf p}_{k})=\begin{cases}MSA_{k}([{\bf h}:{\bf p}% _{k}^{ground}])&\text{ if }k<d~{}\text{and}~{}{\bf h}\in Set^{ground}\\ MSA_{k}([{\bf h}:{\bf p}_{k}^{uav}])&\text{ if }k<d~{}\text{and}~{}{\bf h}\in Set% ^{uav}\\ MSA_{k}({\bf h})&\text{ if }k\geq d,\end{cases}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_h , bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = { start_ROW start_CELL italic_M italic_S italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( [ bold_h : bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_o italic_u italic_n italic_d end_POSTSUPERSCRIPT ] ) end_CELL start_CELL if italic_k < italic_d and bold_h ∈ italic_S italic_e italic_t start_POSTSUPERSCRIPT italic_g italic_r italic_o italic_u italic_n italic_d end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_M italic_S italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( [ bold_h : bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_a italic_v end_POSTSUPERSCRIPT ] ) end_CELL start_CELL if italic_k < italic_d and bold_h ∈ italic_S italic_e italic_t start_POSTSUPERSCRIPT italic_u italic_a italic_v end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_M italic_S italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_h ) end_CELL start_CELL if italic_k ≥ italic_d , end_CELL end_ROW (17)

where 𝐩kgroundl×D,𝐩kuavl×Dformulae-sequencesuperscriptsubscript𝐩𝑘𝑔𝑟𝑜𝑢𝑛𝑑superscript𝑙𝐷superscriptsubscript𝐩𝑘𝑢𝑎𝑣superscript𝑙𝐷{\bf p}_{k}^{ground}\in\mathbb{R}^{l\times D},{\bf p}_{k}^{uav}\in\mathbb{R}^{% l\times D}bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_o italic_u italic_n italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_D end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_a italic_v end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_D end_POSTSUPERSCRIPT, d𝑑ditalic_d and l𝑙litalic_l are the depth and length of PBP, [:]delimited-[]:[:][ : ] denotes the concatenation operation, MSAk𝑀𝑆subscript𝐴𝑘MSA_{k}italic_M italic_S italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the kthsubscript𝑘𝑡k_{th}italic_k start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT MSA layer in Image Encoder, Setuav𝑆𝑒superscript𝑡𝑢𝑎𝑣Set^{uav}italic_S italic_e italic_t start_POSTSUPERSCRIPT italic_u italic_a italic_v end_POSTSUPERSCRIPT and Setground𝑆𝑒superscript𝑡𝑔𝑟𝑜𝑢𝑛𝑑Set^{ground}italic_S italic_e italic_t start_POSTSUPERSCRIPT italic_g italic_r italic_o italic_u italic_n italic_d end_POSTSUPERSCRIPT are two sets containing the samples from the UAV and the samples from the ground platform respectively.

Table 2: Comparison with state-of-the-art methods. †  represents the model initialized by the weight of CLIP[35] released by OpenAI, and ‡  represents the model initialized by weight of ViFi-CLIP[36]. We use bold to indicate the best results of our methods, and underlines to highlight the best results of other methods. On all datasets, our method outperforms the comparisons significantly.
Method MARS LS-VID iLIDS G2A-VReID
mAP rank-1 mAP rank-1 rank-1 mAP rank-1
GLTR[30] 78.5 87.0 44.3 63.1 86 - -
VRSTC[19] 82.3 88.5 - - 83.4 - -
AP3D[15] 85.1 90.1 73.2 84.5 88.7 67.7 57.5
STGCN[45] 83.7 90.0 - - - - -
MGH[44] 85.8 90.0 - - 85.6 76.7 69.9
MG-RAFA[51] 85.9 88.8 - - 88.6 - -
AFA[6] 82.9 90.2 - - 88.5 - -
TCLNet[20] 85.1 89.8 70.3 81.5 86.6 65.4 54.7
STRF[1] 86.1 90.3 - - 89.3 - -
GRL[34] 84.8 91.0 - - 90.4 52.8 41.4
DenseIL[17] 87.0 90.8 - - 92 - -
BiCnet-TKS[21] 86.0 90.2 75.1 84.6 - 63.4 51.7
PSTA[42] 85.8 91.5 - - - 64.6 54.5
STMN[12] 84.5 90.5 69.2 82.1 91.5 66.7 56.1
PiT[48] - 90.2 - - 92.1 76.3 67.7
SINet[2] 86.2 91.0 79.6 87.4 92.5 74.5 65.6
LSTRL[33] 86.8 91.6 82.4 89.8 92.2 - -
FT-CLIP‡ 88.00 91.62 84.07 90.77 94.00 78.11 69.32
VSLA-CLIP† 88.22 90.91 84.05 90.54 95.33 79.14 71.64
VSLA-CLIP‡ 88.60 91.82 85.20 91.66 95.33 79.70 72.55

5 Experiments

In this section, we first introduce the evaluation protocols and implementation details. Subsequently, we compare our proposed methods with state-of-the-art algorithms. Finally, ablation studies are conducted to investigate the contribution of each component.

5.1 Datasets and Evaluation Metrics

We conduct experiments on our G2A-VReID and three widely used video person ReID datasets, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., iLIDS[41], Mars[52], and LS-VID[28]. For G2A-VReID, we roughly divide 2788 identities into training and test sets at a ratio of 1:2:121:21 : 2, similar to that in LS-VID [28]. Therefore, there are 930 identities in training set and 1858 identities in the testing set. During the evaluation, we keep the cross-camera search paradigm in ReID task [28, 41, 52, 18]. Query and gallery are composed of video sequences from the ground and UAV cameras respectively.

Cumulative Matching Characteristic(CMC) at Rank-1 and mean average precision (mAP) are employed to evaluate the performance of our model.

5.2 Implementation Details

ViT-Base-16 [35] is selected as the Image Encoder. The initial weights are chosen as that of ViFi-CLIP [36], whose Image Encoder and Text Encoder have been fine-tuned on the extensive action recognition dataset Kinetics-400 [24]. Sparse temporal sampling strategy [40] is used to generate a clip containing 8 frames, with each frame resized to 256×\times×128. We randomly disrupt the order of the frames in each clip. Each batch has 32 clips corresponding to 8 identities. Adam [25] optimizer is used in both stages. In the first training stage, we optimize the ID-specific description tokens and shared text prompts with a learning rate of 3.5×1043.5superscript1043.5\times 10^{-4}3.5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, while freezing other parameters. In the second training stage, we adopt the initial learning rate 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT with decaying by 0.1 and 0.01 at the 60th and 90th epoch for FT-CLIP, and the initial learning rate 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT with decaying by 0.1 and 0.01 at the 60th and 90th epoch for VSLA-CLIP. The margin θ𝜃\thetaitalic_θ of triplet loss in Eq. (8) is set as 0.3, the β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, δ𝛿\deltaitalic_δ and ϵitalic-ϵ\epsilonitalic_ϵ in Eq. (10) are 1.0, 0.25, 1.0 and 1.0, respectively. Each image is padded with 10 pixels and augmented with random cropping, horizontal flipping, and erasing [53].

5.3 Comparison with State-of-the-Art Methods

On G2A-VReID Dataset. We comprehensively evaluate nine state-of-the-art methods [2, 48, 12, 42, 21, 34, 15, 44, 20] on G2A-VReID, and report the results in Tab. 2. As can be seen that, MGH[44] and PiT [48] showed superior performances on our G2A-VReID dataset, i.e. MGH achieves 76.7% on mAP and 69.9% on Rank-1. We attribute this to the careful visual alignment strategy adopted by MGH and PiT, which involves splitting the full image into vertical or horizontal stripes and aiming to align the stripes. This strategy mitigates the challenges of self-occlusion inherent in the UAV perspective. Our method, extracting description tokens for each person and aligning visual embeddings with semantic features, effectively solves the cross-platform visual misalignment problem. VSLA-CLIP‡  achieves 79.70% mAP and 72.55% Rank-1 on G2A-VReID.

On All Video ReID Dataset. As shown in Tab. 2, all the variants of our methods with aligning visual embeddings to semantic features, show consistent improvement on all datasets. Especially, our method achieves 85.20% mAP and 91.66% Rank-1 on the challenging LS-VID dataset, which greatly improves the mAP by 2.80% and the Rank-1 by 1.86% compared with the state-of-the-art LSTRL [33]. 2) Models initialized by weights of ViFi-CLIP (ViFi-weight) are marked as {\ddagger}, and it is effective compared with the original model weights released by Open AI (marked as {\dagger}). 3) It is worth noting that VSLA-CLIP shows better performance than fine-tuning the whole Image Encoder (FT-CLIP), with far fewer tunable parameters. Specifically, VSLA-CLIP{\ddagger} outperforms the FT-CLIP{\ddagger} by 1.59% mAP on G2A-VReID with tuning parameters (14.5M vs 88.0M).

Our experiments show that adapting pre-trained image-based models to video ReID tasks with the Video Set-Level-Adapter is both effective and efficient, setting a new baseline method for research endeavors in this field.

Table 3: Effectiveness of proposed components and comparison of the number of tunable parameters. baseline represents training FT-CLIP‡  without Lv2sce in Eq.(7), VSA is Visual-Semantic Alignment, IFA represents Intra-Frame Adapter, CFAA is Cross-Frame Attention Adapter and PBP is Platform Bridge Prompt.
Methods Overall Tunable LS-VID G2A-VReID
Param(M) Param (M) mAP rank-1 mAP rank-1
AP3D[15] 34.0 24.9 73.2 84.5 67.7 57.5
BiCnet-TKS[21] 33.7 29.3 75.1 84.6 63.4 51.7
STMN[12] 90.9 87.0 69.2 82.1 66.7 56.1
SINet[2] 33.7 27.3 79.6 87.1 74.5 65.6
baseline 86.1 86.1 76.10 84.26 72.80 63.62
baseline+VSA (FT-CLIP‡) 127.4 88.0 84.07 90.77 78.11 69.32
IFA 90.8 4.7 77.31 84.86 73.82 65.12
IFA+VSA 132.1 6.6 84.16 90.94 79.01 71.67
IFA+VSA+CFAA (VSLA-CLIP‡) 140.0 14.5 85.20 91.66 79.70 72.55
IFA+VSA+CFAA+PBP 140.0 14.5 - - 81.29 74.27

5.4 Ablation Study

To demonstrate the effectiveness of our proposed components in Sec.4, we conduct ablation studies and compare our method with four other methods.

Effectiveness of Visual-Semantic Alignment. To verify the effectiveness of Visual-Semantic Alignment, we first fine-tune the Image Encoder by directly using two common losses (trisubscript𝑡𝑟𝑖\mathcal{L}_{tri}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT and idsubscript𝑖𝑑\mathcal{L}_{id}caligraphic_L start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT in Eq.(10)), and set this model as our baseline. As shown in Tab. 3, Visual-Semantic Alignment is effective for both finetuning-based methods (FT-CLIP‡  vs. baseline) and adapter-based methods (IFA+VSA vs. IFA). In addition, we further analyze the loss functions for visual-semantic alignment. As shown in Tab. 5, when i2tsubscript𝑖2𝑡\mathcal{L}_{i2t}caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT, t2isubscript𝑡2𝑖\mathcal{L}_{t2i}caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT and v2scesubscript𝑣2𝑠𝑐𝑒\mathcal{L}_{v2sce}caligraphic_L start_POSTSUBSCRIPT italic_v 2 italic_s italic_c italic_e end_POSTSUBSCRIPT are used jointly, our model achieves the best results on LS-VID.

Table 4: Effect of α𝛼\alphaitalic_α of Intra-Frame Adapter and Cross-Frame Attention Adapter on LS-VID. TP represents the tunable parameter.
α𝛼\alphaitalic_α TP (M) LS-VID
IFA CFAA mAP rank-1
64 1.2 1.4 79.58 86.71
128 2.4 3.2 83.64 90.00
256 4.7 7.9 85.20 91.66
384 7.1 14.2 85.09 91.49
Table 5: Ablation experiments for the losses used for Visual-Semantic Alignment on LS-VID.
v2scesubscript𝑣2𝑠𝑐𝑒\mathcal{L}_{v2sce}caligraphic_L start_POSTSUBSCRIPT italic_v 2 italic_s italic_c italic_e end_POSTSUBSCRIPT i2tsubscript𝑖2𝑡\mathcal{L}_{i2t}caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT t2isubscript𝑡2𝑖\mathcal{L}_{t2i}caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT mAP
84.73
84.29
84.45
84.71
85.20

Effectiveness of Video Set-Level-Adapter. Our goal for proposing the Video Set-Level-Adapter is to efficiently adapt pre-trained image-based visual foundation mode to video-based ReID tasks. Considering that the Video Set-Level-Adapter (VSLA) contains two modules, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., an Intra-Frame Adapter (IFA) and a Cross-Frame Attention Adapter (CFAA), we perform ablation experiments separately to verify the effectiveness of each module. As shown in Tab. 3, IFA surpasses the full fine-tuned baseline method (77.31% vs. 76.10% mAP on LS-VID) with significantly less number of tunable parameters (4.7M vs. 86.1M). In addition, CFAA further improves model performance (85.20% vs. 84.16% mAP on LS-VID) while also using a small number of tunable parameters, which indicates that regarding video sequences as a set is effective.

Refer to caption
Figure 4: Analysis on the depth and length of PBP on our G2A-VReID.

We also analyze the hyper-parameter α𝛼\alphaitalic_α introduced in Sec. 4.3, which determines model’s complexity and the number of training parameters. We set α𝛼\alphaitalic_α to be 64, 128, 256, and 384 respectively. As presented in Tab. 5, the performances tend to improve with increasing α𝛼\alphaitalic_α, and achieves the best mAP at α=256𝛼256\alpha=256italic_α = 256. Therefore, we fix α𝛼\alphaitalic_α to be 256 for other datasets. At this setting, the VSLA module contains only approximately 12.6 million parameters parameters, and VSLA-CLIP achieves 85.20% mAP on LS-VID, surpassing FT-CLIP by 1.13%.

Effectiveness of PBP. The Platform Bridge Prompt (PBP) offers meticulous instructions to enable models to discern differences across platforms. It adeptly steers the model towards obtaining precise and targeted information, thereby bridging the semantic gap in visual features. The depth d𝑑ditalic_d and length l𝑙litalic_l are two hyper-parameters in PBP, which are introduced in Sec. 4.4. To analyze the impact of these two parameters on the model, we use grid-search to explore the impact of different value combinations on the model performance. The results for various parameter combinations of the model are presented in Fig. 4, and the optimal performance is achieved when d=3𝑑3d=3italic_d = 3 and l=16𝑙16l=16italic_l = 16.

6 Conclusion

In this paper, we construct a large-scale benchmark dataset for cross-platform video person ReID. Besides, we also propose a baseline method for solving cross-platform visual misalignment problems by transforming the visual alignment problem into visual-semantic alignment through the vision-language model (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., CLIP). To efficiently and effectively adapt the pre-trained image-based visual foundation model to Video ReID, We propose a Video Set-Level-Adapter module, which aggregates the inter-frame complementary information for more powerful video-level representations with only 12.6 million trainable parameters. Experimental results demonstrate that our proposed methods achieve state-of-the-art performance and will be a new trend for cross-platform video ReID tasks.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 62101453, 62176198 and 62201467, the Key Research and Development Program of Shaanxi Province under Grant 2024GX-YBXM-135, in part by China Postdoctoral Science Foundation under Grant 2022TQ0260, 2023M742842, in part by the Young Talent Fund of Xi’an Association for Science and Technology under Grant 959202313088, Innovation Capability Support Program of Shaanxi (No. 2024ZC-KJXX-043), the Fundamental Research Funds for the Central Universities No. HYGJZN202331 and the Natural Science Basic Research Program of Shaanxi Province (No. 2022JC-DW-08).

References

  • [1] Aich, A., Zheng, M., Karanam, S., Chen, T., Roy-Chowdhury, A.K., Wu, Z.: Spatio-temporal representation factorization for video-based person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 152–162 (2021)
  • [2] Bai, S., Ma, B., Chang, H., Huang, R., Chen, X.: Salient-to-broad transition for video person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7339–7348 (2022)
  • [3] Baltieri, D., Vezzani, R., Cucchiara, R.: 3dpes: 3d people dataset for surveillance and forensics. In: Joint Acm Workshop on Human Gesture & Behavior Understanding (2011)
  • [4] Chao, H., He, Y., Zhang, J., Feng, J.: Gaitset: Regarding gait as a set for cross-view gait recognition. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 8126–8133 (2019)
  • [5] Chen, D., Li, H., Xiao, T., Yi, S., Wang, X.: Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1169–1178 (2018)
  • [6] Chen, G., Rao, Y., Lu, J., Zhou, J.: Temporal coherence or temporal motion: Which is more critical for video-based person re-identification? In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16. pp. 660–676. Springer (2020)
  • [7] Cheng, D., He, L., Wang, N., Zhang, S., Wang, Z., Gao, X.: Efficient bilateral cross-modality cluster matching for unsupervised visible-infrared person reid. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 1325–1333 (2023)
  • [8] Cheng, D., Ji, Y., Gong, D., Li, Y., Wang, N., Han, J., Zhang, D.: Continual all-in-one adverse weather removal with knowledge replay on a unified network structure. IEEE Transactions on Multimedia (2024)
  • [9] Cheng, D., Zhou, J., Wang, N., Gao, X.: Hybrid dynamic contrast and probability distillation for unsupervised person re-id. IEEE Trans. Image Process. 31, 3334–3346 (2022). https://doi.org/10.1109/TIP.2022.3169693, https://doi.org/10.1109/TIP.2022.3169693
  • [10] Chung, D., Tahboub, K., Delp, E.J.: A two stream siamese convolutional neural network for person re-identification. In: Proceedings of the IEEE international conference on computer vision. pp. 1983–1991 (2017)
  • [11] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [12] Eom, C., Lee, G., Lee, J., Ham, B.: Video-based person re-identification with spatial and temporal memory networks. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12016–12025 (2021). https://doi.org/10.1109/ICCV48922.2021.01182
  • [13] Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1933–1941 (2016)
  • [14] Fu, Y., Wang, X., Wei, Y., Huang, T.: Sta: Spatial-temporal attention for large-scale video-based person re-identification. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 8287–8294 (2019)
  • [15] Gu, X., Chang, H., Ma, B., Zhang, H., Chen, X.: Appearance-preserving 3d convolution for video-based person re-identification. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. pp. 228–243. Springer (2020)
  • [16] He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: Transreid: Transformer-based object re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 15013–15022 (October 2021)
  • [17] He, T., Jin, X., Shen, X., Huang, J., Chen, Z., Hua, X.S.: Dense interaction learning for video-based person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1490–1501 (2021)
  • [18] Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person re-identification by descriptive and discriminative classification. In: Image Analysis: 17th Scandinavian Conference, SCIA 2011, Ystad, Sweden, May 2011. Proceedings 17. pp. 91–102. Springer (2011)
  • [19] Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Vrstc: Occlusion-free video person re-identification. In: CVPR. pp. 7183–7192 (2019)
  • [20] Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Temporal complementary learning for video person re-identification. In: ECCV. pp. 388–405 (2020)
  • [21] Hou, R., Chang, H., Ma, B., Huang, R., Shan, S.: Bicnet-tks: Learning efficient spatial-temporal representation for video person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2014–2023 (June 2021)
  • [22] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=nZeVKeeFYf9
  • [23] Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S.J., Hariharan, B., Lim, S.: Visual prompt tuning. In: ECCV (33). Lecture Notes in Computer Science, vol. 13693, pp. 709–727. Springer (2022)
  • [24] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  • [25] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [26] Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021)
  • [27] Li, H., Zhang, D., Liu, N., Cheng, L., Dai, Y., Zhang, C., Wang, X., Han, J.: Boosting low-data instance segmentation by unsupervised pre-training with saliency prompt. arXiv preprint arXiv:2302.01171 (2023)
  • [28] Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: ICCV. pp. 3958–3967 (2019)
  • [29] Li, J., Zhang, S., Huang, T.: Multiscale 3d convolution network for video based person reidentification. In: AAAI. pp. 8618–8625 (2019)
  • [30] Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3958–3967 (2019)
  • [31] Li, S., Sun, L., Li, Q.: Clip-reid: Exploiting vision-language model for image re-identification without concrete text labels. arXiv preprint arXiv:2211.13977 (2022)
  • [32] Liu, H., Jie, Z., Jayashree, K., Qi, M., Jiang, J., Yan, S., Feng, J.: Video-based person re-identification with accumulative motion context. IEEE transactions on circuits and systems for video technology 28(10), 2788–2802 (2017)
  • [33] Liu, X., Zhang, P., Lu, H.: Video-based person re-identification with long short-term representation learning. In: International Conference on Image and Graphics. pp. 55–67. Springer (2023)
  • [34] Liu, X., Zhang, P., Yu, C., Lu, H., Yang, X.: Watching you: Global-guided reciprocal learning for video-based person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13334–13343 (2021)
  • [35] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [36] Rasheed, H., khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Finetuned clip models are efficient video learners. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
  • [37] Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 815–823 (2015)
  • [38] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2818–2826 (2016)
  • [39] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • [40] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. pp. 20–36. Springer (2016)
  • [41] Wang, X., Zhao, R.: Person re-identification: System design and evaluation overview. In: Person Re-Identification, pp. 351–370. Springer (2014)
  • [42] Wang, Y., Zhang, P., Gao, S., Geng, X., Lu, H., Wang, D.: Pyramid spatial-temporal aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12026–12035 (2021)
  • [43] Xing, Y., Wu, Q., Cheng, D., Zhang, S., Liang, G., Wang, P., Zhang, Y.: Dual modality prompt tuning for vision-language pre-trained model. IEEE Transactions on Multimedia 26, 2056–2068 (2024). https://doi.org/10.1109/TMM.2023.3291588
  • [44] Yan, Y., Qin, J., Chen, J., Liu, L., Zhu, F., Tai, Y., Shao, L.: Learning multi-granular hypergraphs for video-based person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2899–2908 (2020)
  • [45] Yang, J., Zheng, W.S., Yang, Q., Chen, Y.C., Tian, Q.: Spatial-temporal graph convolutional network for video-based person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3289–3299 (2020)
  • [46] Yin, J., Wu, A., Zheng, W.S.: Fine-grained person re-identification. International Journal of Computer Vision 128(6), 1654–1672 (Jun 2020). https://doi.org/10.1007/s11263-019-01259-0, https://doi.org/10.1007/s11263-019-01259-0
  • [47] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. Advances in neural information processing systems 30 (2017)
  • [48] Zang, X., Li, G., Gao, W.: Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval. IEEE Transactions on Industrial Informatics 18(12), 8776–8785 (2022). https://doi.org/10.1109/TII.2022.3151766
  • [49] Zhang, S., Yang, Y., Wang, P., Liang, G., Zhang, X., Zhang, Y.: Attend to the difference: Cross-modality person re-identification via contrastive correlation. IEEE Transactions on Image Processing 30, 8861–8872 (2021). https://doi.org/10.1109/TIP.2021.3120881
  • [50] Zhang, S., Zhang, Q., Yang, Y., Wei, X., Wang, P., Jiao, B., Zhang, Y.: Person re-identification in aerial imagery. IEEE Transactions on Multimedia 23, 281–291 (2021). https://doi.org/10.1109/TMM.2020.2977528
  • [51] Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10407–10416 (2020)
  • [52] Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., Tian, Q.: Mars: A video benchmark for large-scale person re-identification. In: ECCV. pp. 868–884 (2016)
  • [53] Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 13001–13008 (2020)
  • [54] Zhou, Z., Huang, Y., Wang, W., Wang, L., Tan, T.: See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4747–4756 (2017)
  • [55] Zhu, K., Guo, H., Zhang, S., Wang, Y., Liu, J., Wang, J., Tang, M.: Aaformer: Auto-aligned transformer for person re-identification. IEEE Transactions on Neural Networks and Learning Systems (2023)