0% found this document useful (0 votes)

37 views11 pages

Jiang Cross-Modal Implicit Relation Reasoning and Aligning For Text-To-Image Person Retrieval CVPR 2023 Paper

Uploaded by

zh123439777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views11 pages

Jiang Cross-Modal Implicit Relation Reasoning and Aligning For Text-To-Image Person Retrieval CVPR 2023 Paper

Uploaded by

zh123439777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

Cross-Modal Implicit Relation Reasoning and Aligning for

Text-to-Image Person Retrieval

Ding Jiang1 , Mang Ye1,2 *

1
National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence,
Hubei Key Laboratory of Multimedia and Network Communication Engineering,
School of Computer Science, Wuhan University, Wuhan, China
2
Hubei Luojia Laboratory, Wuhan, China
https://github.com/anosorae/IRRA

Abstract A woman in a
Gray pair of
Image Text shorts, a pair of
Text-to-image person retrieval aims to identify the tar- Encoder Encoder Gray shoes and a
get person based on a given textual description query. The white purse
around her waist.
primary challenge is to learn the mapping of visual and tex-
tual modalities into a common latent space. Prior works (a) Early global matching paradigm
have attempted to address this challenge by leveraging sep-
arately pre-trained unimodal models to extract visual and A woman in a
Gray pair of
textual features. However, these approaches lack the nec- Image Text shorts, a pair of
Encoder Encoder Gray shoes and a
essary underlying alignment capabilities required to match white purse
multimodal data effectively. Besides, these works use prior around her waist.

information to explore explicit part alignments, which may

(b) Existing explicit local matching paradigm
lead to the distortion of intra-modality information. To alle-
viate these issues, we present IRRA: a cross-modal Implicit A woman in a
Gray pair of
Relation Reasoning and Aligning framework that learns re- Image Text shorts, a pair of
lations between local visual-textual tokens and enhances Encoder Encoder Gray shoes and a
[MASK] purse
global image-text matching without requiring additional around her waist.
white
prior supervision. Specifically, we first design an Implicit
Relation Reasoning module in a masked language model- (c) Our implicit relation reasoning aided matching paradigm
ing paradigm. This achieves cross-modal interaction by Alignment Global Image Feature Global Text Feature
integrating the visual cues into the textual tokens with a
Attention Local Image Feature Local Text Feature Masked Token
cross-modal multimodal interaction encoder. Secondly, to
globally align the visual and textual embeddings, Similar-
Figure 1. Evolution of text-to-image person retrieval paradigms.
ity Distribution Matching is proposed to minimize the KL (a) Early global-matching method directly align global image and
divergence between image-text similarity distributions and text embeddings. (b) Recent local-matching method, explicitly ex-
the normalized label matching distributions. The proposed tract and align local image and text embeddings. (c) Our implicit
method achieves new state-of-the-art results on all three relation reasoning method, implicitly reasoning the relation among
public datasets, with a notable margin of about 3%-9% for all local tokens to better align global image and text embeddings.
Rank-1 accuracy compared to prior methods.

text description query [30], which is a sub-task of both

1. Introduction image-text retrieval [26, 33, 42] and image-based person re-
Text-to-image person retrieval aims to retrieve a person- identification (Re-ID) [15,32,45]. Textual descriptions pro-
of-interest from a large image gallery that best matches the vide a natural and relatively comprehensive way to describe
a person’s attributes, and are more easily accessible than im-
* Corresponding Author: Mang Ye ([email protected]) ages. Text-to-image person retrieval thus received increas-

2787
ing attention in recent years, benefiting a variety of applica- encoder that can efficiently fuse visual and textual represen-
tions from personal photo album search to public security. tations, align cross-modal fine-grained features through the
However, text-to-image person retrieval remains a chal- MLM task. This design helps the backbone network to ex-
lenging task due to significant intra-identity variations and tract more discriminative global image-text representations
modality heterogeneity between vision and language. The without requiring additional supervision.
former challenge stems from the fact that visual appear- To guide the image-text matching, commonly used loss
ances of an identity differ based on pose, viewpoint, illu- functions include ranking loss and cross-modal projection
mination, and other factors, while textual description varies matching (CMPM) [53] loss. Compared to ranking loss,
by arbitrary descriptive order and textual ambiguity. The the CMPM loss does not require the selection of specific
latter challenge is the primary issue in cross-modal tasks triplets or margin parameter tuning. It exhibits great stabil-
and is caused by inherent representation discrepancies be- ity with varying batch sizes, making it widely used in text-
tween vision and language. To tackle above two challenges, to-image person retrieval [5, 39, 50]. However, we found
the core research problem in text-to-image person retrieval that the projection in CMPM can be regarded as a variable
is to explore better ways to extract discriminative feature weight that adjusts the distribution of softmax output log-
representations and to design better cross-modal matching its, similar to the temperature parameter [17] for knowledge
methods to align images and texts into a joint embedding distillation. Nevertheless, limited by the varying projection
space. Early global-matching methods [53, 54] aligned im- length, CMPM therefore cannot precisely control the pro-
ages and texts into a joint embedding space by designing jection probability distribution, making it difficult to focus
cross-modal matching loss functions (Fig. 1 (a)). Typically, on hard-negative samples during model updates. To ex-
these approaches learned cross-modal alignments by using plore more effective cross-modal matching objective, we
matching losses only at the end of the network, failing to further propose an image-text similarity distribution match-
achieve sufficient modality interaction in middle-level lay- ing (SDM) loss. The SDM loss minimizes the KL diver-
ers, which are crucial to bridge the feature-level modality gence between the normalized image-text similarity score
gap. Therefore, some later methods [5, 7, 21, 46] intro- distributions and the normalized ground truth label match-
duced the practice of local-matching by building the cor- ing distributions. Additionally, we introduce a temperature
respondence between the body parts and the textual entities hyperparameter to precisely control the similarity distribu-
(Fig. 1 (b)). Although this local matching strategy benefits tion compactness, which enables the model updates focus
retrieval performance, it introduces unavoidable noise and on hard-negative samples and effectively enlarges the vari-
uncertainty in the retrieval process. Besides, the strategy ance between non-matching pairs and the correlation be-
requires extracting and storing multiple local part represen- tween matching pairs.
tations of images and texts, computing pairwise similarity To address the limitations of separate pre-trained mod-
between all those representations during inference. These els on unimodal datasets, we leverage the Contrastive
resource-demanding properties limit their applicability for Language-Image Pre-training (CLIP) [35] as the initial-
practical large-scale scenarios. ization of our model. CLIP is pre-trained with abundant
image-text pairs and has powerful underlying cross-modal
In this paper, we present IRRA: a cross-modal Implicit
alignment capabilities. Some previous approaches [13, 50]
Relation Reasoning and Aligning framework, which per-
have either frozen some part of parameters or introduced
forms global alignment with the aid of cross-modal im-
only CLIP’s image encoder, which resulted in their inability
plicit local relation learning. Unlike previous methods that
to fully exploit CLIP’s powerful capabilities in image-text
heavily rely on explicit fine-grained local alignment, our
matching. With the proposed IRRA, we successfully trans-
approach implicitly utilizes fine-grained information to en-
fer the powerful knowledge directly from the pre-trained
hance global alignment without requiring any additional su-
full CLIP model and continue to learn fine-grained cross-
pervision and inference costs (Fig. 1 (c)). Specifically, we
modal implicit local relations on text-to-image person re-
design an Implicit Relation Reasoning module that effec-
trieval datasets. In addition, compared to many recent meth-
tively builds relations between visual and textual represen-
ods [5, 38, 50], IRRA is more efficient as it computes only
tations through self- and cross-attention mechanisms. This
one global image-text pair similarity score in the inference
fused representation is then utilized to perform masked lan-
stage. The main contributions can be summarized as fol-
guage modeling (MLM) task to achieve effective implicit
lows:
inter-modal and intra-modal fine-grained relation learning.
MLM is generally utilized during the pre-training stage of • We propose IRRA to implicitly utilize fine-grained in-
vision-language pre-training (VLP) [6, 9, 27, 31, 41]. In this teraction to enhance the global alignment without re-
work, we make the first attempt to demonstrate the effec- quiring any additional supervision and inference cost.
tiveness of MLM in downstream fine-tuning tasks. Our
main innovation is the design of a multimodal interaction • We introduce a new cross-modal matching loss named

2788
image-text similarity distribution matching (SDM) person retrieval and propose the IRRA to learn more dis-
loss. It directly minimizes the KL divergence between criminative image-text embeddings.
image-text similarity distributions and the normalized Vision-Language Pre-training aims to learn the seman-
label matching distributions. tic correspondence between vision and language modalities
by pre-training on large-scale image-text pairs. Inspired by
• We demonstrate that the full CLIP model can be ap-
the success of Transformer-based [44] language model pre-
plied to text-to-image person retrieval and can outper-
training (such as BERT) [22] and Vision Transformer (ViT)
form existing state-of-the-art methods with straightfor-
[8], Vision-Language Pre-training (VLP) has emerged as
ward fine-tuning. Moreover, our proposed IRR module
the prevailing paradigm in learning multimodal represen-
enables fine-grained image-text relation learning, al-
tations, demonstrating strong results on downstream tasks
lowing IRRA to learn more discriminative image-text
such as image captioning [3], image-text retrieval [25] and
representations.
visual question answering [1]. Existing work on VLP can
• Extensive experiments on three public benchmark be categorized into two types: single-stream and dual-
datasets, i.e., CUHK-PEDES [30], ICFG-PEDES [7] stream, depending on their model structure. In single-
and RSTPReid [55] show that IRRA consistently out- stream models [6, 23, 41], text and visual features are con-
performs the state-of-the-arts by a large margin. catenated and then fed into a single transformer encoder.
Although this architecture is more parameter-efficient as it
2. Related work uses the same set of parameters for both modalities, it has
a slow retrieval speed during the inference stage because it
Text-to-image Person Retrieval was first introduced by needs to predict the similarity score of all possible image-
Li et al. [30], who proposed the first benchmark dataset, text pairs. On the other hand, dual-stream models [9,20,35]
CUHK-PEDES [30]. The main challenge is how to effi- use two separate encoders to extract the text and visual fea-
ciently align image and text features into a joint embedding tures independently. These two transformer encoders do not
space for fast retrieval. Early works [2,29,30] utilized VGG share parameters. While achieving remarkable performance
[40] and LSTM [18] to learn representations for visual- on image-text retrieval tasks, dual-stream modals lack the
textual modalities and then aligned them using a match- ability to model complex interactions between vision and
ing loss. Later works [4, 36, 53] improved the feature ex- language for other vision-language understanding tasks.
traction backbone with ResNet50/101 [14] and BERT [22],
as well as designed novel cross-modal matching losses to 3. Method
align global image-text features in a joint embedding space.
More recent works [5, 46, 47, 49, 55] extensively employs In this section, we present our proposed IRRA frame-
additional local feature learning branches that explicitly ex- work. The overview of IRRA is illustrated in Fig. 2 and the
ploit human segmentation, body parts, color information, details are discussed in the following subsections.
and text phrases. There is also some works [7, 10, 38, 51]
3.1. Feature Extraction Dual-Encoder
that implicitly performs local feature learning through at-
tentional mechanisms. However, while these approaches Previous works in text-to-image person retrieval typi-
have been shown to provide better retrieval results than cally utilize image and text encoders that are pre-trained
using only global features, they also introduce additional separately on unimodal datasets. Inspired by the partial
computational complexity during inference when comput- success of transferring knowledge from CLIP to text-image
ing image-text similarity. The aforementioned works all person retrieval [13], we directly initialize our IRRA with
use backbones pre-trained separately with unimodal data the full CLIP image and text encoder to enhance its under-
to extract visual and textual features, and then perform lying cross-modal alignment capabilities.
cross-modal alignment without exploiting the great cross- Image Encoder. Given an input image I ∈ RH×W ×C ,
modal alignment capabilities of recently promising vision- a CLIP pre-trained ViT model is adopted to obtain the
language pre-training models. Han et al. [13] first intro- image embedding. We first split I into a sequence of
duced a CLIP model for text-to-image person retrieval us- N = H × W/P 2 fixed-sized non-overlapping patches,
ing a momentum contrastive learning framework to trans- where P denotes the patch size, and then map the patch
fer the knowledge learned from large-scale generic image- sequence to 1D tokens {fiv }|N i=1 by a trainable linear pro-
text pairs. Later, Yan et al. [50] proposed a CLIP-driven jection. With injection of positional embedding and extra
v
fine-grain information excavation framework to transfer the [CLS] token, the sequence of tokens {fcls , f1v , ..., fN
v
} are
knowledge of CLIP. However, they failed in directly trans- input into L-layer transformer blocks to model correlations
ferring the original aligned CLIP dual-encoder to text-to- of each patch. Finally, a linear projection is adopted to map
v
image person retrieval. In this work, we demonstrate that fcls to the joint image-text embedding space, which serves
the CLIP model can be easily transferred to text-to-image as global image representation.

2789
Image Encoder
Image Input 12× [CLS]
Embedding
Text-Image Similarities KL Cosine Similarity

Self Attention

Feed Forward
Image Embedding
Similarity Score

True Match Labels

ID Loss
Similarity Distribution Matching (SDM)

SDM

IRR
Multimodal Interaction Encoder
[MASK] Embedding
4×
V

Cross Attention
A woman in a Gray pair of
Self Attention

Feed Forward

Self Attention

Feed Forward
shorts, a pair of Gray shoes and K gray

MLM Head
a white purse around her waist. Text Embedding
Random
Mask shoes
A woman in a [MASK] pair of
shorts , a pair of Gray [MASK] and
a white purse [MASK] her waist. around
MLM Text Embedding
Q
Text Input 12× Masked Language Modeling
Text Encoder
Implicit Relation Reasoning (IRR)

Figure 2. Overview of the proposed IRRA framework. It consists of a dual-stream feature extraction backbone and three representation
learning branches, i.e. Implicit Relation Reasoning (IRR), Similarity Distribution Matching (SDM) and Identity Identification (ID loss).
IRR aims to implicitly utilize fine-grained information to learn a discriminative global representation. SDM minimizes the KL divergence
between image-text similarity score distributions and true label matching distributions, which can effectively enlarges the variance between
non-matching pairs and the correlation between matching pairs. Additionally, we adopt ID loss to aggregate the feature representations of
the same identity, further improving the retrieval performance. IRRA is trained end-to-end with these three tasks, and it computes only
one global image-text similarity score, making it computationally efficient. Modules connected by dashed lines will be removed during
inference stage.

Text Encoder. For an input text T , we directly use the it became widely known when the BERT model adapted it
CLIP text encoder to extract the text representation, which as a novel pre-training task. In this work, We utilize MLM
is a Transformer [44] modified by Radford et al. [35]. Fol- to predict masked textual tokens not only by the rest of un-
lowing CLIP, the lower-cased byte pair encoding (BPE) masked textual tokens but also by the visual tokens. Sim-
with a 49152 vocab size [37] is firstly employed to tokenize ilar to the analysis of Fu et al. [11] in pure language pre-
the input text description. The text description is brack- training, MLM optimizes two properties: (1) the alignment
eted with [SOS] and [EOS] tokens to indicate the start and of image and text contextualized representations with the
t
end of sequence. Then the tokenized text {fsos , f1t , ...feos
t
} static embeddings of masked textual tokens, and (2) the uni-
are fed into the transformer and exploit correlations of each formity of static embeddings in the joint embedding space.
patch by masked self-attention. Finally, the highest layer In the alignment property, sampled embeddings of masked
t
of the transformer at the [EOS] token feos is linearly pro- textual tokens serve as an anchor to align images and text
jected into the image-text joint embedding space to obtain contextualized representations, as illustrated in Fig. 3. We
the global text representation. find that such a local anchor is essential for modeling local
dependencies and can implicitly utilize fine-grained local
3.2. Implicit Relation Reasoning information for global feature alignment.
To fully exploit fine-grained information, it is crucial to Multimodal Interaction Encoder. To achieve full in-
bridge the significant modality gap between vision and lan- teraction between image and text modalities, We design an
guage. While most existing methods do so by explicitly efficient multimodal interaction encoder to fuse the image
aligning local features between images and text, this paper and text embeddings, compared to two other popular multi-
introduces a novel approach. Specifically, we use MLM to modal interaction modules [9, 16], our design is more com-
implicitly mine fine-grained relations and learn discrimina- putationally efficient, as illustrated in Fig. 4. The multi-
tive global features. modal interaction encoder consists of a multi-head cross at-
Masked Language Modeling. Masked language mod- tention (MCA) layer and 4-layer transformer blocks. Given
eling (MLM) was initially proposed by Taylor [43] in 1953, an input text description T , we randomly mask out the text

2790
white shoulder bag Feedforward Feedforward
Feedforward
Feedforward M×
M× Self-Attn
Cross-Attn Cross-Attn
M×
்ܳ ‫ܭ‬௏ ܸ௏ ܳ௏ ‫்ܸ ்ܭ‬ Self-Attn
Cross-Attn
Self-Attn Self-Attn ்ܳ ‫்ܭ‬/௏ ்ܸ/௏ ܳ௏ ‫்ܭ‬/௏ ்ܸ/௏
்ܳ ‫ܭ‬௏ ܸ௏
carrying a bag ்ܳ ‫்ܸ ்ܭ‬ ܳ௏ ‫ܭ‬௏ ܸ௏
Textual Visual Textual Visual
Textual Feature Visual Feature Feature Feature Feature Feature
bag
Static Embedding (a) Co-attention (b) Merged attention (c) Ours
Image Embedding
Figure 4. Illustration of our multimodal interaction encoder
holding a black bag Text Embedding
and two other popular interaction modules. (a) Co-attention,
Alignment
Boundary
textual and visual features are fed into separate transformer blocks
with self-attn and cross-attn independently to enable cross-modal
interaction. (b) Merged attention, textual and visual features are
Figure 3. Illustration of the MLM objective. MLM uses static
concatenated together and then fed into a single transformer block.
embedding of masked textual tokens as local fine-grained keys to
(c) Our multimodal interaction encoder, textual and visual features
align image and text contextualized representations in the same
are first fused by a cross-attn layer and then fed into a single trans-
context.
former block.

tokens with a probability of 15% and replace them with

the special token [MASK]. Following BERT, the replace- 3.3. Similarity Distribution Matching
ments are 10% random tokens, 10% unchanged, and 80% We introduce a novel cross modal matching loss termed
[MASK]. The masked text is defined as T̂ , and fed into the as Similarity Distribution Matching (SDM), which incorpo-
Text Transformer as described in Sec. 3.1. Then the last hid- rates the cosine similarity distributions of the N ×N image-
den states {ht̂i }L v N
i=1 and {hi }i=1 of the text transformer and text pairs embeddings into KL divergence to associate the
the vision transformer are fed into the multimodal interac- representations across different modalities.
tion encoder jointly. In order to fuse image and masked text Given a mini-batch of N image-text pairs, for each im-
representations more effectively, the masked text represen- age global representation fiv , we construct a set of image-
tation {ht̂i }L
i=1 served as query(Q), and the image represen- text representation pairs as {(fiv , fjt ), yi,j }N
j=1 , where yi,j
tation {hvi }N i=1 are served as key(K) and value(V). The full is a true matching label, yi,j = 1 means that (fiv , fjt ) is a
interaction between image and masked text representations matched pair from the same identity, while yi,j = 0 indi-
can be achieved by: cates the unmatched pair. Let sim(u, v) = u⊤ v/∥u∥∥v∥
denotes the dot product between L2 normalized u and v
\{h^m_i\}_{i=1}^{L} = Tansformer(MCA(LN \mathcal {(Q, K, V)})), (1) (i.e. cosine similarity). Then the probability of matching
pairs can be simply calculated with the following softmax
where {hm L
i }i=1 is the fused image and masked text con- function:
textualized representations, L is the length of input textual
tokens, LN (·) denotes Layer Normalization, the M CA(·) \label {eq4} p_{i,j} = \frac {\exp (sim(f_i^v, f_j^t)/\tau )}{\sum _{k=1}^{N} \exp (sim(f_i^v, f_k^t)/\tau )}, (4)
is the multi-head cross attention and can be realized by:
where τ is a temperature hyperparameter which controls
MCA(\mathcal {Q, K, V}) = softmax(\frac {\mathcal {QK}^\top }{\sqrt {d}})\mathcal {V}, (2) the probability distribution peaks. The matching probability
pi,j can be viewed as the proportion of the cosine similar-
ity score between fiv and fjt to the sum of cosine similarity
where d is the embedding dimension of masked tokens.
score between fiv and {fjt }N j=1 in a mini-batch. Then the
For each masked position {hm L
i : i ∈ M}i=1 , we use a
SDM loss from image to text in a mini-batch is computed
multi-layer perception (MLP) classifier to predict the prob-
|V| by :
ability of the corresponding original tokens {mij }j=1 =
m
M LP (hi ). The IRR objective can be formulated as:
\label {eq5} \mathcal {L}_{i2t} = KL(\mathbf {p_i}\| \mathbf {q_i}) = \frac {1}{N} \sum _{i=1}^{N}\sum _{j=1}^{N}p_{i,j}\log (\frac {p_{i,j}}{q_{i,j} + \epsilon }), (5)

\mathcal {L}_{irr} =-\frac {1}{|\mathcal {M}| |\mathcal {V}|} \sum _{i \in \mathcal {M}} \sum _{j \in |\mathcal {V}|} y_jî \log \frac {\exp (m_jî) }{\sum _{k=1}^{|\mathcal {V}|} \exp (m_kî)}, (3)
where ϵ is a small
PN number to avoid numerical problems, and
qi,j = yi,j / k=1 yi,k is the true matching probability.
Symmetrically, the SDM loss from text to image Lt2i
where M denotes the set of masked text tokens and |V| is
can be formulated by exchanging f v and f t in Eq.(4) (5),
the size of vocabulary V. mi is predicted token probabil-
and the bi-directional SDM loss is calculated by:
ity distribution and y i is a one-hot vocabulary distribution
where the ground-truth token has a probability of 1. \mathcal {L}_{sdm} = \mathcal {L}_{i2t} + \mathcal {L}_{t2i}. (6)

2791
Method Type Ref Image Enc. Text Enc. Rank-1 Rank-5 Rank-10 mAP mINP
CMPM/C [53] L ECCV18 RN50 LSTM 49.37 - 79.27 - -
TIMAM [36] G ICCV19 RN101 BERT 54.51 77.56 79.27 - -
ViTAA [46] L ECCV20 RN50 LSTM 54.92 75.18 82.90 51.60 -
NAFS [12] L arXiv21 RN50 BERT 59.36 79.13 86.00 54.07 -
DSSL [55] L MM21 RN50 BERT 59.98 80.41 87.56 - -
SSAN [7] L arXiv21 RN50 LSTM 61.37 80.15 86.73 - -
LapsCore [49] L ICCV21 RN50 BERT 63.40 - 87.80 - -
ISANet [51] L arXiv22 RN50 LSTM 63.92 82.15 87.69 - -
LBUL [48] L MM22 RN50 BERT 64.04 82.66 87.22 - -
Han et al. [13] G BMVC21 CLIP-RN101 CLIP-Xformer 64.08 81.73 88.19 60.08 -
SAF [28] L ICASSP22 ViT-Base BERT 64.13 82.62 88.40 - -
TIPCB [5] L Neuro22 RN50 BERT 64.26 83.19 89.10 -
CAIBC [47] L MM22 RN50 BERT 64.43 82.87 88.37 - -
AXM-Net [10] L MM22 RN50 BERT 64.44 80.52 86.77 58.73 -
LGUR [38] L MM22 DeiT-Small BERT 65.25 83.12 89.00 - -
IVT [39] G ECCVW22 ViT-Base BERT 65.59 83.11 89.21 - -
CFine [50] L arXiv22 CLIP-ViT BERT 69.57 85.93 91.15 - -
Baseline (CLIP-RN50) G - CLIP-RN50 CLIP-Xformer 57.26 78.57 85.58 50.88 34.44
Baseline (CLIP-RN101) G - CLIP-RN101 CLIP-Xformer 60.27 80.88 87.88 53.93 37.54
Baseline (CLIP-ViT-B/16) G - CLIP-ViT CLIP-Xformer 68.19 86.47 91.47 61.12 44.86
IRRA (Ours) G - CLIP-ViT CLIP-Xformer 73.38 89.93 93.71 66.13 50.24

Table 1. Performance comparisons with state-of-the-art methods on CUHK-PEDES dataset. Results are ordered based on the Rank-1
accuracy. “G” and “L” in “Type” column stand for global-matching/local-matching method.

Optimization. As mentioned previously, the main ob- RSTPReid [55] contains 20505 images of 4,101 iden-
jective of IRRA is to improve the learning of global image- tities from 15 cameras. Each identity has 5 corresponding
text representations in joint embedding space. To achieve images taken by different cameras and each image is anno-
this goal, the commonly utilized ID loss [54] is also adopted tated with 2 textual descriptions. Following the official data
along with SDM loss and IRR loss to optimize IRRA. The split, the training, validation and test set contain 3701, 200
ID loss is a softmax loss which classifies an image or text and 200 identities respectively.
into distinct groups based on their identities. It explicitly Evaluation Metrics. We adopt the popular Rank-k met-
considers the intra-modal distance and ensures that feature rics (k=1,5,10) as the primary evaluation metrics. Rank-k
representations of the same image/text group are closely reports the probability of finding at least one matching per-
clustered together in the joint embedding space. son image within the top-k candidate list when given a tex-
IRRA is trained in an end-to-end manner and the overall tual description as a query. In addition, for a comprehen-
optimization objective for training is defined as: sive evaluation, we also adopt the mean Average Precision
\mathcal {L} = \mathcal {L}_{irr} + \mathcal {L}_{sdm} + \mathcal {L}_{id}. (7) (mAP) and mean Inverse Negative Penalty(mINP) [52] as
another retrieval criterion. The higher Rank-k, mAP and
4. Experiments mINP indicates better performance.
We extensively evaluate our method on three challenging Implementation Details. IRRA consists of a pre-trained
text-to-image person retrieval datasets. image encoder, i.e., CLIP-ViT-B/16, a pre-trained text en-
CUHK-PEDES [30] is the first dataset dedicated to text- coder, i.e., CLIP text Transformer, and a random-initialized
to-image person retrieval, which contains 40,206 images multimodal interaction encoder. For each layer of the mul-
and 80,412 textual descriptions for 13,003 identities. Fol- timodal interaction encoder, the hidden size and number of
lowing the official data split, the training set consists of heads are set to 512 and 8. During training, random hor-
11,003 identities, 34,054 images and 68,108 textual de- izontally flipping, random crop with padding, and random
scriptions. The validation set and test set contain 3,078 and erasing are employed for image data augmentation. All in-
3,074 images, 6158 and 6156 textual descriptions, respec- put images are resized to 384 × 128. The maximum length
tively, and both of them have 1,000 identities. of the textual token sequence L is set to 77. Our model
ICFG-PEDES [7] contains a total of 54,522 images for is trained with Adam optimizer [24] for 60 epochs with a
4,102 identities. Each image has only one corresponding learning rate initialized to 1 × 10−5 and cosine learning rate
textual description. The dataset is divided into a training set decay. At the beginning, we spend 5 warm-up epochs lin-
and a test set, the former comprises 34,674 image-text pairs early increasing the learning rate from 1×10−6 to 1×10−5 .
of 3,102 identities, while the latter contains 19,848 image- For random-initialized modules, we set the initial learning
text pairs for the remaining 1,000 identities. rate to 5 × 10−5 . The temperature parameter τ in SDM

2792
loss is set to 0.02. This work is supported by Huawei Method Type Rank-1 Rank-5 Rank-10 mAP mINP
Dual Path [54] G 38.99 59.44 68.41 - -
MindSpore [19]. We perform our experiments on a single CMPM/C [53] L 43.51 65.44 74.26 - -
RTX3090 24GB GPU. ViTAA [46] L 50.98 68.79 75.78 - -
SSAN [7] L 54.23 72.63 79.53 - -
IVT [39] G 56.04 73.60 80.22 - -
4.1. Comparison with State-of-the-Art Methods ISANet [51] L 57.73 75.42 81.72 - -
CFine [50] L 60.83 76.55 82.42 - -
In this section, we present comparison results with state- Baseline (CLIP-RN50) G 41.46 63.68 73.04 21.00 2.46
Baseline (CLIP-RN101) G 44.09 66.27 74.75 22.59 2.84
of-the-art methods on three public benchmark datasets. Baseline (CLIP-ViT-B/16) G 56.74 75.72 82.26 31.84 5.03
Note that the Baseline models in Tab. 1 2 and 3 denotes dif- IRRA (Ours) G 63.46 80.25 85.82 38.06 7.93
ferent CLIP models fine-tuned with the original CLIP loss
(InfoNCE [34]). Table 2. Performance comparisons with state-of-the-art methods
Performance Comparisons on CUHK-PEDES We on ICFG-PEDES dataset.
first evaluate the proposed method on the most common
Method Type Rank-1 Rank-5 Rank-10 mAP mINP
benchmark, CUHK-PEDES. As shown in Tab. 1, IRRA DSSL [55] G 39.05 62.60 73.95 - -
outperforms all state-of-the-art methods, achieving 73.38% SSAN [7] L 43.50 67.80 77.15 - -
LBUL [48] L 45.55 68.20 77.85 - -
Rank-1 accuracy and 66.13% mAP respectively. It is worth IVT [39] G 46.70 70.00 78.80 - -
noting that our directly fine-tuned CLIP Baseline has al- CFine [50] L 50.55 72.50 81.60 - -
Baseline (CLIP-RN50) G 41.40 68.55 77.95 31.51 12.71
ready achieved the recent state-of-the-art method CFine Baseline (CLIP-RN101) G 43.45 67.75 78.40 29.91 11.18
[50], with Rank-1 accuracy and mAP reaching 68.19% and Baseline (CLIP-ViT-B/16) G 54.05 80.70 88.00 43.41 22.31
IRRA (Ours) G 60.20 81.30 88.20 47.17 25.28
86.47% respectively. In Tab. 1, we annotate the feature ex-
traction backbones (”Image Enc.” and ”Text Enc.” column)
Table 3. Performance comparisons with state-of-the-art methods
employed by each method, and it is evident that there is a on RSTPReid dataset.
growing demand of powerful feature extraction backbone
for text-to-image person retrieval, with transformer-based CLIP-ViT-B/16 model fine-tuned with InfoNCE loss as the
backbone becoming progressively dominant. Baseline to facilitate the ablation study.
Performance Comparisons on ICFG-PEDES The ex- Ablations on proposed components To fully demon-
perimental results on the ICFG-PEDES dataset are reported strate the impact of different components in IRRA, we con-
in Tab. 2. The Baseline can achieve comparable results duct a comprehensive empirical analysis on three public
to recent state-of-the-art methods, with 56.74%, 75.72% datasets (i.e., CUHK-PEDES [30], ICFG-PEDES [7] and
and 82.26% on Rank-1, Rank-5 and Rank-10, respectively. RSTPReid [55]). The Rank-1, Rank-5, Rank-10 accuracies
Moreover, our proposed IRRA achieves 63.46%, 80.24% (%) are reported in Tab. 4.
and 85.82% on these metrics, which exceed the recent state- IRR learns local relations through MLM task which can
of-the-art local-matching method Cfine [50] by a large mar- be easily integrated with other transformer-based methods
gin, i.e., +2.63%, +3.69% and +3.4%. It is worth noting that to facilitate fine-grained cross-modal alignment. The ef-
the mINP [52] metric on ICFG-PEDES is relatively low, ficacy of IRR is revealed via the experimental results of
which indicates the inferior capability of IRRA to find the No.0 vs. No.4, No.2 vs. No.6 and No.5 vs. No.7. Merely
hardest matching samples. adding the IRR to Baseline improves the Rank-1 accuracy
Performance Comparisons on RSTPReid We also re- by 3.04%, 4.22% and 3.85% on the three datasets, respec-
port our experimental results on the newly released RST- tively. The above results clearly show that IRR module are
PReid dataset in Tab. 3. Our proposed IRRA dramatically beneficial for cross-modal matching.
surpass the recent global-matching method IVT [39] by To demonstrate the effectiveness of our proposed sim-
+13.5%, +11.3% and +9.4% on Rank-1, Rank-5 and Rank- ilarity distribution matching (SDM) loss, we compare it
10, respectively. Compared with the recent local-matching with the commonly used cross-modal projection match-
method Cfine [50], IRRA also achieves considerable per- ing (CMPM) loss [53] (No.1 vs. No.2) on the three public
formance gains, with the rise of +9.65%, +8.8% and +6.6% datasets, the SDM loss promotes the Rank-1 accuracy of
on Rank-1, Rank-5 and Rank-10, respectively. the CMPM loss by 11.11%, 6.62%, and 2.2%, respectively.
In summary, our IRRA consistently achieves the best Besides, replace the original InfoNCE loss with the com-
performance for all metrics on all three benchmark datasets. monly used CMPM loss (No.0 vs. No.1) does not improve
This demonstrates the generalization and robustness of our the performance on text-to-image person retrieval, yet it
proposed method. leads to performance degradation. In contrast, the SDM loss
promotes the Rank-1 accuracy of the Baseline by 2.23%,
4.2. Ablation Study
3.71%, and 3.15% on three datasets, respectively. These re-
In this subsection, we analyze the effectiveness of each sults demonstrate that the proposed SDM loss well aligns
component in the IRRA framework. Here, we adopt the the features representations between the two modalities. In

2793
Components CUHK-PEDES ICFG-PEDES RSTPReid
No. Methods
SDM Lid IRR Rank-1 Rank-5 Rank-10 Rank-1 Rank-5 Rank-10 Rank-1 Rank-5 Rank-10
0 Baseline 68.19 86.47 91.47 56.74 75.72 82.26 54.05 80.70 88.00
1 +Lcmpm [53] 59.31 79.66 86.11 53.83 72.20 79.02 55.40 77.70 85.25
2 +SDM ✓ 70.42 86.73 92.04 60.45 77.88 83.86 57.20 79.90 88.10
3 +Lid ✓ 65.33 84.05 90.33 53.38 72.70 79.70 54.15 76.65 85.00
4 +IRR ✓ 71.23 88.89 93.24 60.96 79.02 84.90 57.90 80.85 88.50
5 +SDM+Lid ✓ ✓ 70.52 87.59 92.12 61.03 78.26 83.89 58.65 80.70 87.05
6 +SDM +IRR ✓ ✓ 72.81 89.31 93.39 63.27 80.10 85.77 59.25 79.70 88.00
7 IRRA ✓ ✓ ✓ 73.38 89.83 93.71 63.46 80.25 85.82 60.20 81.30 88.20

Table 4. Ablation study on each component of IRRA on CUHK-PEDES, ICFG-PEDES and RSTPReid.

Method Param(M) Time(ms) Rank-1 Rank-5 Rank-10 4.3. Qualitative Results

Co-attn 33.62 24.30 73.28 89.04 93.44
Merged attn 12.61 19.20 73.21 89.18 93.70 Fig. 5 compares the top-10 retrieval results from the
Ours 13.66 6.42 73.38 89.83 93.71
Baseline and our proposed IRRA. As the figure shows,
Table 5. Comparisons between different Multimodal Interaction IRRA achieves much more accurate retrieval results and
Module of IRRA on CUHK-PEDES. obtains accurate retrieval results when Baseline fails to re-
trieve them. This is mainly due to the Implicit Relation
addition, the experimental results of No.2 vs. No.5 and No.6 Reasoning (IRR) modules we designed, which fully ex-
vs. No.7 demonstrate the effectiveness of the ID loss. ploit fine-grained discriminative clues to distinguish differ-
Analysis of the Multimodal Interaction Encoder To ent pedestrians. This is illustrated in the orange highlighted
demonstrate the advantages of our proposed Multimodal In- text and image regions box in Fig. 5. Moreover, We found
teraction Module, we compare it with two other popular that our model only learns the semantic information of the
multimodal interaction modules in Tab. 5. The Multimodal word-level but unable to understand the semantics of the
Interaction Module in IRR is a computationally efficient op- phrase-level in the description text, which leads to the dis-
eration to fuse multimodal features, building the connection tortion of semantic information. This is because we only
between the two modalities. We extensively compare it with masked random single tokens in MLM, and did not perform
Co-attn and Merged attn under our proposed IRRA setting, phrase-level masking. We plan to address this issue in the
and observe slight but consistent performance gain on all future.
Rank-k metrics. Notably, our major advantage is the com-
putational efficiency. 5. Conclusion
In this paper, we introduce a cross modal implicit rela-
A man wearing a
tion reasoning and aligning framework(IRRA) to learn dis-
white and gray stripe
shirt, a pair of green
criminative global image-text representations. To achieve
pants and a pair of full cross-modal interaction, we propose an Implicit Rela-
shoes.
tion Reasoning module that exploits MLM to mine fine-
grained relations between visual and textual tokens. We fur-
ther propose a Similarity Distribution Matching loss to ef-
The man is crossing
the street wearing a fectively enlarge the variance between non-matching pairs
white shirt, black
pants and carrying a and the correlation between matching pairs. These mod-
black book bag.
ules collaborate to align images and text into a joint embed-
ding space. Significant performance gains on three popu-
lar benchmarks datasets prove the superiority and effective-
The woman wears a
teal t-shirt, black ness of our proposed IRRA framework. We believe that the
pants, and blue tennis
shoes. She wears
CLIP-based approach will be the future trend for text-to-
glasses and carries a image person retrieval.
red backpack.
Acknowledgement. This work is partially sup-
ported by the Key Research and Development Pro-
gram of Hubei Province (2021BAA187), National
Figure 5. Comparison of top-10 retrieved results on CUHK- Natural Science Foundation of China under Grant
PEDES between Baseline (the first row) and IRRA (the second (62176188), Zhejiang lab (NO.2022NF0AB01),
row) for each text query. The image corresponding to query text., the Special Fund of Hubei Luojia Laboratory
matched and mismatched images are marked with black, green and (220100015) and CAAI-Huawei MindSpore Open Fund.
red rectangles, respectively.

2794
References resentation for text-based person search. arXiv preprint
arXiv:2101.03036, 2021. 6
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret [13] Xiao Han, Sen He, Li Zhang, and Tao Xiang. Text-
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. based person search with limited data. arXiv preprint
Vqa: Visual question answering. In Proceedings of the IEEE arXiv:2110.10807, 2021. 2, 3, 6
international conference on computer vision, pages 2425–
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
2433, 2015. 3
Deep residual learning for image recognition. In Proceed-
[2] Tianlang Chen, Chenliang Xu, and Jiebo Luo. Improving ings of the IEEE conference on computer vision and pattern
text-based person search by spatial matching and adaptive recognition, pages 770–778, 2016. 3
threshold. In 2018 IEEE Winter Conference on Applications
[15] Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li,
of Computer Vision (WACV), pages 1879–1887. IEEE, 2018.
and Wei Jiang. Transreid: Transformer-based object re-
3
identification. In Proceedings of the IEEE/CVF international
[3] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- conference on computer vision, pages 15013–15022, 2021. 1
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. [16] Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-
Microsoft coco captions: Data collection and evaluation Baptiste Alayrac, and Aida Nematzadeh. Decoupling the
server. arXiv preprint arXiv:1504.00325, 2015. 3 role of data, attention, and losses in multimodal transformers.
[4] Yucheng Chen, Rui Huang, Hong Chang, Chuanqi Tan, Tao Transactions of the Association for Computational Linguis-
Xue, and Bingpeng Ma. Cross-modal knowledge adaptation tics, 9:570–585, 2021. 4
for language-based person search. IEEE Transactions on Im- [17] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling
age Processing, 30:4057–4069, 2021. 3 the knowledge in a neural network. 2
[5] Yuhao Chen, Guoqing Zhang, Yujiang Lu, Zhenxing Wang, [18] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
and Yuhui Zheng. Tipcb: A simple but effective part-based memory. Neural computation, 9(8):1735–1780, 1997. 3
convolutional baseline for text-based person search. Neuro- [19] Huawei. Mindspore, https://www.mindspore.cn/, 2020. 7
computing, 494:171–181, 2022. 2, 3, 6
[20] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
[6] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Duerig. Scaling up visual and vision-language representa-
Universal image-text representation learning. In European tion learning with noisy text supervision. In International
conference on computer vision, pages 104–120. Springer, Conference on Machine Learning, pages 4904–4916. PMLR,
2020. 2, 3 2021. 3
[7] Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng [21] Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang,
Tao. Semantically self-aligned network for text-to- and Tieniu Tan. Pose-guided multi-granularity attention net-
image part-aware person re-identification. arXiv preprint work for text-based person search. In Proceedings of the
arXiv:2107.12666, 2021. 2, 3, 6, 7 AAAI Conference on Artificial Intelligence, volume 34, pages
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, 11189–11196, 2020. 2
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [22] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Toutanova. Bert: Pre-training of deep bidirectional trans-
vain Gelly, et al. An image is worth 16x16 words: Trans- formers for language understanding. In Proceedings of
formers for image recognition at scale. In International Con- NAACL-HLT, pages 4171–4186, 2019. 3
ference on Learning Representations, 2020. 3 [23] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-
[9] Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang and-language transformer without convolution or region su-
Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu pervision. In International Conference on Machine Learn-
Yuan, Nanyun Peng, et al. An empirical study of training ing, pages 5583–5594. PMLR, 2021. 3
end-to-end vision-and-language transformers. In Proceed- [24] Diederik P Kingma and Jimmy Ba. Adam: A method for
ings of the IEEE/CVF Conference on Computer Vision and stochastic optimization. In ICLR (Poster), 2015. 6
Pattern Recognition, pages 18166–18176, 2022. 2, 3, 4 [25] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel.
[10] Ammarah Farooq, Muhammad Awais, Josef Kittler, and Unifying visual-semantic embeddings with multimodal neu-
Syed Safwan Khalid. Axm-net: Implicit cross-modal feature ral language models. arXiv preprint arXiv:1411.2539, 2014.
alignment for person re-identification. 36(4):4477–4485, 3
2022. 3, 6 [26] Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit
[11] Zhiyi Fu, Wangchunshu Zhou, Jingjing Xu, Hao Zhou, and Bansal, Tamara L Berg, and Licheng Yu. Loopitr: Com-
Lei Li. Contextual representation learning beyond masked bining dual and cross encoder architectures for image-text
language modeling. In Proceedings of the 60th Annual Meet- retrieval. arXiv preprint arXiv:2203.05465, 2022. 1
ing of the Association for Computational Linguistics (Volume [27] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare,
1: Long Papers), pages 2701–2714, 2022. 4 Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi.
[12] Chenyang Gao, Guanyu Cai, Xinyang Jiang, Feng Zheng, Align before fuse: Vision and language representation learn-
Jun Zhang, Yifei Gong, Pai Peng, Xiaowei Guo, and Xing ing with momentum distillation. Advances in neural infor-
Sun. Contextual non-local alignment over full-scale rep- mation processing systems, 34:9694–9705, 2021. 2

2795
[28] Shiping Li, Min Cao, and Min Zhang. Learning semantic- linguistic representations. arXiv preprint arXiv:1908.08530,
aligned feature representation for text-based person search. 2019. 2, 3
In ICASSP 2022-2022 IEEE International Conference on [42] Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei
Acoustics, Speech and Signal Processing (ICASSP), pages Fang, and Jingjing Liu. Lightningdot: Pre-training visual-
2724–2728. IEEE, 2022. 6 semantic embeddings for real-time image-text retrieval. In
[29] Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xi- Proceedings of the 2021 Conference of the North American
aogang Wang. Identity-aware textual-visual matching with Chapter of the Association for Computational Linguistics:
latent co-attention. In Proceedings of the IEEE International Human Language Technologies, pages 982–997, 2021. 1
Conference on Computer Vision, pages 1890–1899, 2017. 3 [43] Wilson L Taylor. “cloze procedure”: A new tool for measur-
[30] Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu ing readability. Journalism quarterly, 30(4):415–433, 1953.
Yue, and Xiaogang Wang. Person search with natural lan- 4
guage description. In Proceedings of the IEEE Conference [44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
on Computer Vision and Pattern Recognition, pages 1970– reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
1979, 2017. 1, 3, 6, 7 Polosukhin. Attention is all you need. Advances in neural
[31] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: information processing systems, 30, 2017. 3, 4
Pretraining task-agnostic visiolinguistic representations for [45] Haochen Wang, Jiayi Shen, Yongtuo Liu, Yan Gao, and Ef-
vision-and-language tasks. Advances in neural information stratios Gavves. Nformer: Robust person re-identification
processing systems, 32, 2019. 2 with neighbor transformer. In Proceedings of the IEEE/CVF
[32] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Conference on Computer Vision and Pattern Recognition,
Jiang. Bag of tricks and a strong baseline for deep person pages 7297–7307, 2022. 1
re-identification. In Proceedings of the IEEE/CVF confer- [46] Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. Vi-
ence on computer vision and pattern recognition workshops, taa: Visual-textual attributes alignment in person search by
pages 0–0, 2019. 1 natural language. In European Conference on Computer Vi-
[33] Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef sion, pages 402–420. Springer, 2020. 2, 3, 6, 7
Sivic, and Andrew Zisserman. Thinking fast and slow: Effi- [47] Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu,
cient text-to-visual retrieval with transformers. In Proceed- Tian Wang, and Yifeng Li. Caibc: Capturing all-round infor-
ings of the IEEE/CVF Conference on Computer Vision and mation beyond color for text-based person retrieval. arXiv
Pattern Recognition, pages 9826–9836, 2021. 1 preprint arXiv:2209.05773, 2022. 3, 6
[34] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- [48] Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu,
sentation learning with contrastive predictive coding. arXiv Tian Wang, and Yifeng Li. Look before you leap: Improv-
preprint arXiv:1807.03748, 2018. 7 ing text-based person retrieval by learning a consistent cross-
[35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya modal common manifold. In Proceedings of the 30th ACM
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, International Conference on Multimedia, pages 1984–1992,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- 2022. 6, 7
ing transferable visual models from natural language super- [49] Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li,
vision. In International Conference on Machine Learning, Changqing Zou, and Shuguang Cui. Lapscore: Language-
pages 8748–8763. PMLR, 2021. 2, 3, 4 guided person search via color reasoning. In Proceedings
[36] Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. of the IEEE/CVF International Conference on Computer Vi-
Adversarial representation learning for text-to-image match- sion, pages 1624–1633, 2021. 3, 6
ing. In Proceedings of the IEEE/CVF international confer- [50] Shuanglin Yan, Neng Dong, Liyan Zhang, and Jinhui Tang.
ence on computer vision, pages 5814–5824, 2019. 3, 6 Clip-driven fine-grained text-image person re-identification.
[37] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural arXiv preprint arXiv:2210.10276, 2022. 2, 3, 6, 7
machine translation of rare words with subword units. arXiv [51] Shuanglin Yan, Hao Tang, Liyan Zhang, and Jinhui Tang.
preprint arXiv:1508.07909, 2015. 4 Image-specific information suppression and implicit local
[38] Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian alignment for text-based person search. arXiv preprint
Wang, and Changxing Ding. Learning granularity-unified arXiv:2208.14365, 2022. 3, 6, 7
representations for text-to-image person re-identification. [52] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling
arXiv preprint arXiv:2207.07802, 2022. 2, 3, 6 Shao, and Steven CH Hoi. Deep learning for person re-
[39] Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, identification: A survey and outlook. IEEE transactions on
Ruizhi Qiao, Bo Ren, and Xiao Wang. See finer, see more: pattern analysis and machine intelligence, 44(6):2872–2893,
Implicit modality alignment for text-based person retrieval. 2021. 6, 7
arXiv preprint arXiv:2208.08608, 2022. 2, 6, 7 [53] Ying Zhang and Huchuan Lu. Deep cross-modal projection
[40] Karen Simonyan and Andrew Zisserman. Very deep convo- learning for image-text matching. In Proceedings of the Eu-
lutional networks for large-scale image recognition. arXiv ropean conference on computer vision (ECCV), pages 686–
preprint arXiv:1409.1556, 2014. 3 701, 2018. 2, 3, 6, 7, 8
[41] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu [54] Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang,
Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual- Mingliang Xu, and Yi-Dong Shen. Dual-path convolutional

2796
image-text embeddings with instance loss. ACM Transac-
tions on Multimedia Computing, Communications, and Ap-
plications (TOMM), 16(2):1–23, 2020. 2, 6, 7
[55] Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin,
Tian Wang, Fangqiang Hu, and Gang Hua. Dssl: Deep
surroundings-person separation learning for text-based per-
son retrieval. In Proceedings of the 29th ACM International
Conference on Multimedia, pages 209–217, 2021. 3, 6, 7

2797