0% found this document useful (0 votes)
15 views10 pages

Yao DetCLIPv2 Scalable Open-Vocabulary Object Detection Pre-Training Via Word-Region Alignment CVPR 2023 Paper

The paper introduces DetCLIPv2, a scalable framework for open-vocabulary object detection that utilizes a large dataset of image-text pairs to enhance detection performance without relying on pre-trained models. It achieves superior results, such as a 40.4% zero-shot average precision on the LVIS benchmark, outperforming previous methods significantly. The approach combines various data types in an end-to-end training process to improve localization and recognition of diverse concepts.

Uploaded by

Radha Gulhane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Yao DetCLIPv2 Scalable Open-Vocabulary Object Detection Pre-Training Via Word-Region Alignment CVPR 2023 Paper

The paper introduces DetCLIPv2, a scalable framework for open-vocabulary object detection that utilizes a large dataset of image-text pairs to enhance detection performance without relying on pre-trained models. It achieves superior results, such as a 40.4% zero-shot average precision on the LVIS benchmark, outperforming previous methods significantly. The approach combines various data types in an end-to-end training process to improve localization and recognition of diverse concepts.

Uploaded by

Radha Gulhane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via


Word-Region Alignment
Lewei Yao1,2 , Jianhua Han2 , Xiaodan Liang3†, Dan Xu1 ,
Wei Zhang2 , Zhenguo Li2 , Hang Xu2†
1
Hong Kong University of Science and Technology, 2 Huawei Noah’s Ark Lab
3
Shenzhen Campus of Sun Yat-Sen University

(a) Hand drawn black and white illus- (b) Fried egg on a frying pan with cherry (c) Curator looks on as we consider paintings (d) A SCUBA diver looks over his shoulder
tration, flying eagle with a snake in claws. tomatoes and parsley in a shared studio space towards a sea lion

(e) Little girl witch with black cat, owl, the (f) The Vibrant Tattoos (g) Playa Esmeralda in Holguin, Cuba. The view (h) Dark eyed juvenile perched in a conifer
witch's cauldron, ghost spirits and text on from the top of the beach. Beautiful Caribbean
violet background. The concept of Halloween. sea turquoise.
Figure 1. Visualizations of DetCLIPv2 for open-vocabulary word-region alignment. DetCLIPv2 is able to detect broad concepts.

Abstract performance. With 13M image-text pairs for pre-training,


DetCLIPv2 demonstrates superior open-vocabulary detec-
This paper presents DetCLIPv2, an efficient and scalable tion performance, e.g., DetCLIPv2 with Swin-T backbone
training framework that incorporates large-scale image- achieves 40.4% zero-shot AP on the LVIS benchmark,
text pairs to achieve open-vocabulary object detection which outperforms previous works GLIP/GLIPv2/DetCLIP
(OVD). Unlike previous OVD frameworks that typically rely by 14.4/11.4/4.5% AP, respectively, and even beats its fully-
on a pre-trained vision-language model (e.g., CLIP) or ex-
supervised counterpart by a large margin.
ploit image-text pairs via a pseudo labeling process, Det-
CLIPv2 directly learns the fine-grained word-region align-
ment from massive image-text pairs in an end-to-end man-
1. Introduction
ner. To accomplish this, we employ a maximum word-region Traditional object detection frameworks [6,35,36,57] are
similarity between region proposals and textual words to typically trained to predict a set of predefined categories,
guide the contrastive objective. To enable the model to gain which fails to meet the demand of many downstream appli-
localization capability while learning broad concepts, Det- cation scenarios that require to detect arbitrary categories
CLIPv2 is trained with a hybrid supervision from detection, (denoted as open-vocabulary detection, OVD). For exam-
grounding and image-text pair data under a unified data ple, a robust autonomous driving system requires accurate
formulation. By jointly training with an alternating scheme predictions for all classes of objects on the road [26]. Ex-
and adopting low-resolution input for image-text pairs, Det- tending traditional object detectors to adapt these scenar-
CLIPv2 exploits image-text pair data efficiently and ef- ios needs tremendous human effort for extra instance-level
fectively: DetCLIPv2 utilizes 13× more image-text pairs bounding-box annotations, especially for rare classes. To
than DetCLIP with a similar training time and improves obtain an open-vocabulary detector without the expensive
† Corresponding author: [email protected], annotation process, the central question we should ask is:
[email protected] where does knowledge about unseen categories come from?

23497
Cropped Proposals Image Embed.
Recent works [16,44,51] try to achieve open-vocabulary ... VL model ... VL model Pseudo
object detection by transferring knowledge from a pre- Distill Image &
Box
Caption
... Detector Annotations
trained vision-language (VL) model [20, 33, 49]. E.g., Image Detector
Region Embed.
ViLD [16] distills the CLIP’s [33] image embeddings of (a) Training with distillation (b) Pseudo label generation
cropped proposals into the proposal features of a detection Region-level Embed. Word-Region Similarity

... Fine-grained
model. However, these solutions suffer from the domain Image Detector
Contrastive
Word-level Embed.
gap problem: VL models are typically pre-trained with ...
Loss
Caption Text Encoder
an image-level supervision using a fixed resolution input,
(c) Training with fine-grained word-region alignment (ours)
which are not capable of recognizing objects with various
scales in the detection task, especially for small objects. Figure 2. Different OVD training paradigms. (a) Distilling
Another line of work resorts to exploiting massive knowledge from a pre-trained VL model [16]. (b) Exploiting
image-text pairs crawled from the Internet. To utilize the image-text pairs via pseudo labeling [27]. (c) Our end-to-end joint
training eliminates complex multi-stage training schemes, allow-
image-text pair data without instance-level annotation, ap-
ing for mutual benefits in learning from different types of data.
proaches [13, 14, 19, 27, 48, 54] generate pseudo-bounding-
box labels following a self-training paradigm [40] or based
mance and promising scaling behavior. E.g., compared
on a pre-trained VL model [33]. However, their final per-
to the prior work DetCLIP [48], DetCLIPv2 is able to
formance is restricted by the quality of pseudo-labels pro-
exploit 13× more image-text pairs while requiring only
vided by a detector trained with limited human-annotated
a similar training time. Using the vanilla ATSS [53] as
concepts or a VL model suffering from the aforementioned
the detector, DetCLIPv2 with Swin-T backbone achieves
domain gap problem. Besides, using high-resolution inputs
40.4% zero-shot AP on the LVIS [17] benchmark, sur-
similar to detection data for massive image-text pairs will
passing previous works GLIP [27]/GLIPv2 [52]/DetCLIP
impose a huge computational burden on training, prevent-
[48] by 14.4/11.4/4.5% AP, respectively. DetCLIPv2 also
ing us from further scaling up image-text pairs.
exhibits great generalization when transferring to down-
To address the above issues, we present DetCLIPv2, an stream tasks, e.g., it achieves SoTA fine-tuning performance
end-to-end open-vocabulary detection pre-training frame- on LVIS and ODinW13 [27]. We present a possibility
work that effectively incorporates large-scale image-text of achieving open-world detection by incorporating large-
pairs. DetCLILPv2 simultaneously learns localization ca- scale image-text pairs and hope it will enlighten the commu-
pability and knowledge of broad concepts without relying nity to explore a similar successful trajectory to CLIP [33].
on a teacher model to provide pseudo labels. Specifically,
we perform joint training with heterogeneous data from 2. Related Work
multiple sources, including detection [38], grounding [22]
and image-text pairs [7, 39], under a unified data formula- Vision-Language Pre-training (VLP). Conventional
tion. To enable image-text pairs without instance-level an- vision-language models are designed to serve a specific
notations to facilitate learning of detection, inspired by [49], task, e.g., VQA [2, 15, 24, 28] and image caption-
we employ an optimal matching-based set similarity be- ing [1, 30, 45, 50], etc. Recently, there has been a trend to
tween visual regions and textual concepts to guide the con- develop generic vision-language representation learning
trastive learning. By alternating different types of data for systems by exploiting large-scale low-cost image-text
training, we enable a “flywheel effect”: learning from de- pairs. For example, CLIP [33] and ALIGN [20] perform
tection data provides accurate localization, which helps ex- cross-modal contrastive learning on millions of image-text
tract representative regions for contrastive learning, while pairs and achieve impressive zero-shot image classification
contrastive learning from image-text pairs helps recognize performance. The most relevant work to our approach is
broader concepts, which further improves the localization FILIP [49], which proposes a cross-modal late interaction
of unseen categories. As the training goes on, the detector mechanism based on a word-patch similarity to better
learns to locate and recognize increasingly rich concepts. facilitate image-text alignment. However, it is non-trivial to
Furthermore, to relief the computation burden brought leverage the idea to construct an open-vocabulary detection
by large-scale image-text pairs, we adopt a low-resolution system, for which our approach provides a solution.
input for image-text pair data, which significantly improves Open-vocabulary object detection (OVD) emerges re-
the training efficiency. This is a reasonable design since the cently as a more general and practical paradigm to detect
caption of image-text pair data typically describes only the objects of unbounded concepts. Inspired by the success of
main objects appearing in the image, which alleviates the vision-language pre-training, recent works [16, 44, 51, 54]
necessity of high-resolution training. propose to transfer knowledge of a pre-trained VL model
Benefiting from the effective designs, DetCLIPv2 (e.g., CLIP [33]) into a detector. Another effective idea is
demonstrates superior open-vocabulary detection perfor- to use a wider source of training data. E.g, [13,14,19,27,48]

23498
incorporate low-cost image-text pairs to expand domain i.e., each tj is obtained by concatenating its category
coverage via a pseudo labeling process. XDETR [5] in- name with the corresponding definition.
tegrates a standard contrastive learning in VLP [20, 33] to • Grounding. We first extract noun phrases (provided in
learn image-to-text alignment. Detic [55] turns to solve annotations) from the original caption to form a posi-
a large-vocabulary detection problem by directly assigning |pos|
tive concept set Tpos = {tj }j=1 . To provide enough
classification labels to the max-size region proposals. Un- negative concepts for learning, we further randomly
like previous works, our approach targets on building an |neg|
sample a negative concept set Tneg = {tj }j=1 that
end-to-end framework that effectively learns word-region
does not contained in the caption (i.e., Tpos ∩Tneg = ∅)
alignment from massive image-text pairs without relying on
from a constructed concept dictionary [48]. The final
a teacher model.
category name set is formed by T = Tpos ∪ Tneg .
Semi-supervised Object Detection (SSOD) methods [40, • Image-text pairs. As instance-level annotation is not
41, 46, 58] aim to improve object detection systems by ex- available, we have {bi }N i=1 = ∅. T consists of the
ploiting unlabeled data on the basis of some available la- original caption and noun phrases extracted from it1 .
beled data. Although effective in improving performance,
these methods still assume a closed-domain setting where For detection and grounding data, each bi is labeled with
the categories in unlabeled data should be covered by la- a concept tj , which enables the learning of open-vocabulary
beled data. On the other hand, Weakly-supervised Ob- object detection. We describe it as follows.
ject Detection (WSOD) methods [3, 8, 32] aim to establish Open-vocabulary object detection. As illustrated in Fig-
localization-capable detectors by leveraging image-level la- ure 3, we use a dual-stream architecture which consists of an
bels, which also require a set of pre-defined categories. Dif- image encoder and a text encoder. The image encoder is an
fering from methods in these fields, our approach considers arbitrary-form object detector that takes an image xI as the
a more challenging open-domain setting and targeting on input and outputs a set of region proposals P = {pk }K k=1
establishing an open-world detector by learning unlimited (for one-stage detector, K equals to number of anchors), as
concepts from massive image-text pairs. well as their classification features fP ∈ RK×D , where D
3. The Proposed Approach is the feature dimension. For the text side, we treat each
concept name tj as a sentence and forward all tj ∈ T to the
An overview framework of the proposed approach is il- text encoder separately to obtain the sentence embeddings
lustrated in Figure 3. To construct a robust open-world ob- fT ∈ RM ×D . Following previous works [20, 33, 52], to in-
ject detection system, DetCLIPv2 incorporates data from crease the number of negative samples, we collect fT across
different sources, i.e., detection, grounding, and image-text a global batch and remove duplicate concepts contained in
pairs, for pre-training. We first introduce a unified paral- different samples in a batch, which gives a gathered text
leled data formulation enabling a training with heteroge- embedding fTbatch ∈ RMB ×D , where MB is the total num-
neous supervisions (Sec. 3.1). To utilize image-text pairs ber of concepts in a global batch after deduplication. Then
without instance-level annotations, we introduce a fine- we calculate the similarity matrix S ∈ RK×MB between fP
grained contrastive learning that automatically aligns tex- and fTbatch by
tual words and visual regions (Sec. 3.2). Finally, we in-
troduce the model architecture/training objective (Sec. 3.3) \label {eq:similarity} S = \textbf {f}^P (\textbf {f}^{T_{batch}})^\top \vspace {-.2em} (1)
and the joint-training details (Sec. 3.4). When instance-level annotations are available, e.g., for
3.1. A Unified Data Formulation detection and grounding data, we can construct a target
matrix G ∈ {0, 1}K×MB following a ground-truth assign-
Following DetCLIP [48], we use a paralleled formula-
ment process in conventional object detection frameworks
tion to unify the formats of data from different sources.
[36, 43, 53], then the alignment loss Lalign (S, G) (detailed
Specifically, we formulate each data sample as a triplet:
in Sec. 3.3) can be calculated; while for image-text pairs
(xI , {bi }N M I
i=1 , {tj }j=1 ), where x ∈ R
3×h×w
is the image,
4 N M
where the instance-level annotation is not available, we
{bi |bi ∈ R }i=1 and T = {tj }j=1 denote a set of bound- elaborate our approach in the Sec. 3.2.
ing box annotations and concept names, respectively. The
triplet is constructed for different types of data differently: 3.2. Learning from Image-text Pairs
Massive image-text pairs crawled from the Internet can
• Detection. T is constructed from a sampled category
provide rich knowledge for the learning of visual-language
names of the dataset, which consists of categories ap-
models. However, due to the lack of instance-level anno-
pearing in the image and additional randomly-sampled
tations, it is non-trivial to leverage image-text pairs to im-
negative categories. To explicitly provide the relation-
prove a dense prediction (e.g, object detection) learning sys-
ships between various concepts, We apply concept en-
richment [48] during both training and testing phases, 1 We use NLP parser provided by Spacy [18] repository.

23499
Detection/Grounding Data Image-text Pairs
Extract
Noun
Image- Detection/ Phrase ...
Text Pair Grounding Embedding
Phrase
...
Man man Kid
pours Text Encoder ...
wine Ball
wine
... ... ... ...
in the
glass Tree ... ... ... ...
glass.
Alignment
Loss Text-to-Image
Deformable Similarity
Reg
Module
2. Mean

Proposal

1. Max
Image Encoder
... Proposal
Image-text Pairs Cls Embedding
Det/Grounding
Selection
High-Resolution Low-Resolution
Small batch Large Batch Text-to-Region
Overall Architecture of DetCLIPv2 Similarity

Figure 3. Overall architecture of DetCLIPv2. DetCLIPv2 performs a joint training with detection, grounding and image-pair data in an
end-to-end manner. The architecture consists of an image encoder to extract region embeddings fP from an input image and a text encoder
to compute word embeddings fT for the input noun phrases. For detection and grounding data, the learning is performed by aligning
the word-region similarity matrix S to a target matrix constructed with instance-level annotations. For image-text pairs, we calculate an
optimal match-based set similarity between fT and fP to guide the contrastive learning, enabling the learning of word-region alignment.

tem. Inspired by [49], we introduce a contrastive learn- Another reasonable consideration is that each textual
ing method to learn a fine-grained word-region correspon- concept should correspond to multiple regions. This design
dences without relying on instance-level annotation, which can be modeled by using a softmax-weighted-sum similar-
is described as follows. ity between a textual concept and all visual regions, i.e.,
Word-region alignment similarity. Given an image-text
pair (xI , xT ), we extract a set of noun phrases T = {tj }M
j=1 \label {eq:weighted_sum similarity} s^T(x^I,x^T) = \frac {1}{M} \sum _{j=1}^{M} \sum _{k=1}^{P} \frac {\exp (s_{j,k}/\tau _t)}{\sum _{i=1}^{P}\exp (s_{j,i}/\tau _t)}s_{j,k} \vspace {-.4em} (4)
from xT and take (xI , {tj }M j=1 ) as the input of the model.
The image encoder generates a set of proposals P =
P where sj,k = [fT ]⊤ P
j [f ]k is the similarity between j-th tex-
{pk }K I
k=1 from x with their region features f ∈ RK×D
T tual concept and k-th visual region, and τt is a temperature
and the text encoder extracts text embeddings f ∈ RM ×D hyper-parameter to control sharpness of the softmax-based
of {tj }M
j=1 . Our word-region alignment contrastive learning weights (when τt → 0, Eq. 4 degrades to Eq. 3). We inves-
is constructed based on the set similarity between P and T . tigate this design in Sec. 4.2.1.
Specifically, for j-th concept tj ∈ T , we find its closest
match in P by calculating Image-text contrastive loss. Based on the introduced
word-region alignment similarity, a standard contrastive
\label {eq:token-wise similarity} m_j=\argmax _{0< k \le K} [\textbf {f}^T]_j^\top [\textbf {f}^P]_k, \vspace {-.2em} (2) learning between image-text pairs can be performed
[33]. Specifically, assume a batch of B image-text pairs
where [fP ]k is the k-th region feature in fP , and similar for {(xIi , xTi )}B
i=1 , the contrastive loss Lcts is formulated as
[fT ]j . This operation can be interpreted as, for each concept
we find a region that best fits its description. Then we cal- \label {eq:contrastive_loss} \mathcal {L}_{cts}=\mathcal {L}_{T\rightarrow I}=-\frac {1}{B} \log \frac {\exp (s^T(x_i^I,x_i^T)/\tau )}{\sum _{j=1}^{B} \exp (s^T(x_j^I,x_i^T)/\tau )} \vspace {-.2em} (5)
culate the text-to-image similarity sT between xI and xT
by aggregating all word-to-region similarities, i.e.,
where sT (xIi , xTj ) is text-to-image word-region alignment
\label {eq:token-wise similarity} s^T(x^I,x^T)= \frac {1}{M}\sum _{j=1}^{M} [\textbf {f}^T]_j^\top [\textbf {f}^P]_{m_j} \vspace {-.4em} (3) similarity between i-th image xIi and j-th text xTj , which is
given by Eq. 3, and τ is a temperature to scale the logits.
Note that the image-to-text similarity sI (xI , xT ) can be As discussed before, we only consider text-to-image con-
calculated in a similar way. However, we exclude this part trastive loss. By incorporating the word-region alignment
from our algorithm, since image-text pairs crawled from the similarity, the contrastive loss helps the model learn fine-
Internet suffer from a severe partial labeling problem – for grained word-region correspondences automatically.
the vast majority of data, the text describes only a small Proposal selection. Intuitively, we expect to select the most
fraction of the objects appearing in the image, i.e., most representative regions in an image to calculate similarities
of region proposals cannot find their corresponding match with textual concepts. There are several schemes to accom-
in the caption texts. Including image-to-text matching can plish this. For example, many detectors incorporate class-
result in a noticeable performance degradation, for which agnostic object scores in their designs, e.g., foreground clas-
we give an ablation in Sec. 4.2.1. sification score in RPN [36], centerness in FCOS [43], etc.,

23500
which can be utilized to generate high-quality region pro- Dataset Type Volume
posals with good generalization [16, 23]. However, these Objects365 [38] (O365) Detection 0.66M
approaches fail to take the textual information into consider- GoldG [22] Grounding 0.77M
ation. To select regions valuable for contrastive learning, for CC15M 13M
each candidate region, we calculate its similarities with all Image-text pairs
(CC3M [39]+CC12M [7]) (3M+10M)
textual concepts within a local batch, and use the maximum
similarity as its objectness score. The benefits of this design Table 1. A summary of training data. CC15M contains only 13M
image-text pairs since some urls are invalid.
are two-fold: (1) it selects the regions most relevant to the
text description; (2) it selects hard negative concepts that
one type of data for training. Different data types are trained
described in other texts which may benefit the contrastive
with different input resolutions and batch sizes. Specif-
learning. With the objectness score, we select top-k pro-
ically, we use a high-resolution input with a small batch
posals after a NMS operation. Different proposal selection
size for detection and grounding data; while for image-
strategies and the optimal k are studied in Sec. 4.2.1.
text pairs, a low-resolution input with a large batch size
3.3. Model Architecture and Training Objective is adopted, which helps increase the number of negative
samples in contrastive learning and considerably reduce the
Model architecture. Similar to DetCLIP [48], DetCLIPv2 training cost of massive image-text pairs.
is built using the vanilla ATSS [53] detector equipped with
a transformer-based [33, 34] text encoder. We do not intro- 4. Experimental Results
duce additional heavy modules such as DyHead [9] adopted 4.1. Implementation Details
in [27, 52] and cross-modal fusion adopted in [12, 27, 52].
A special design is that we insert a lightweight de- Training Dataset. We use multiple datasets from differ-
formable convolution [56] at the beginning of the classifi- ent sources for training (Table 1). Specifically, for detec-
cation head, which uses the features output by the regres- tion data, we use a sampled subset from Objects365v2 [38]
sion head to calculate the spatial offsets and the modulation dataset (denoted as O365) with 0.66M images; for ground-
scalar, and aggregates the features from the backbone out- ing data, we use GoldG [22] with COCO [29] images re-
put. The motivation is that when training with image-text moved, which results in a fairer zero-shot evaluation on
pairs, there is no supervision signal on the regression branch LVIS [17]. For image-text pairs, we use 2 versions of
and therefore no gradient is generated. This design helps the Conceptual Captions (CC) datasets, i.e., CC3M [39] and
gradient from the classification head to flow back to the re- CC12M [7] (together denoted as CC15M).
gression head, so that the regression head also benefits from Training details. We use Swin-transformer [31] backbones
training with massive image-text pairs. I.e., learning a bet- for image encoder. For text encoder, the maximum token
ter spatial aggregation for backbone features helps regres- length is set to 16 for efficient training and inference. We
sion head acquire better localization ability. We show this initialize the text-encoder with a pretrained FILIP model
neat design provides substantial performance improvement [49]. 32/64 V100 GPUs are used for training Swin-T/L-
when training with image-text pairs (see Sec. 4.2.2). based models, respectively. For detection and grounding
Training Objective. The overall objective of DetCLIPv2 data, we use input resolution 1333 × 800 with a batch size
can be formulated as of 128/256 for Swin-T/L models (4 per card), respectively;
and for image-text pairs, we use input resolution 320 × 320
with a batch size of 6144 (192/96 per card for Swin-T/L
\label {eq:overall_objectives} \mathcal {L}=\begin {cases} \mathcal {L}_{align}+\alpha \mathcal {L}_{reg}+\beta \mathcal {L}_{center}, & \text {for detection} \\ \mathcal {L}_{align}, & \text {for grounding} \\ \lambda \mathcal {L}_{cts}, & \text {for image-text pairs} \\ \end {cases} \vspace {-.2em} model). We set α = 2 and β = 1 and λ = 0.1 in Eq. 6.
Without otherwise specified, all models are trained with 12
(6) epochs. More training details can refer to Appendix.
where Lalign is the alignment loss described in Sec. 3.1; Evaluation benchmark. Following GLIP [27] and Det-
Lcts is the contrastive loss in Eq. 5; Lreg and Lcenter are CLIP [48], we evaluate our method’s zero-shot perfor-
regression and centerness losses, respectively; α, β and λ mances on LVIS [17] with 1203 categories. Fixed AP [10]
are loss weights. Following ATSS [53], we use focal loss on LVIS minival5k are reported for ablation and compar-
for Lalign , GIoU loss [37] for Lreg , and cross-entropy loss ison with other methods. To further study the generaliza-
for Lcenter . We remove the localization loss for grounding tion ability of our method, we also evaluate with ODinW13
data due to its inaccurate bounding box annotations. dataset [27, 52], which contains 13 downstream detection
3.4. Joint Training tasks with highly varied distributions. We focus on the
DetCLIPv2 performs a joint training with heterogeneous GLIP protocol [27] rather than the ViLD protocol [16] that
datasets. During training, we group data belonging to the splits LVIS into seen/unseen categories, since the former is
same type for a global batch. At each iteration, we sample a stronger and more practical open-world setting that does

23501
not make any prior assumptions on downstream tasks while the closest region reaches the best performance (31.3 AP)
the latter still requires partial LVIS data for training. and substantially saves the GPU memory compared to the
1-to-many strategy, which allows a larger batch size to boost
4.2. Ablation Studies
the contrastive learning.
4.2.1 Ablations for Image-text Contrastive Learning
Number of proposals k. Table 2d investigates the optimal
We investigate key factors for our image-text contrastive k when selecting proposals. We vary k from 25 to 200.
learning to work in Table 2. The experiments are conducted Using a large k = 200 results in too many low-quality can-
with Swin-T-based model on O365+CC3M datasets. didates that slightly decreases the performance. A too small
Proposal selection strategy. Selecting representative re- k = 25 leads to insufficient region extraction which causes
gions is critical for image-text contrastive learning. Ta- a noticeable performance drop. A modest design with 100
ble 2a studies multiple class-agnostic objectness scores for proposals achieves the best performance.
selecting proposals, which includes foreground classfica-
Contrastive loss design. Table 2c performs ablation exper-
tion score [47] (row1), IoU score [21] (row2) and center-
iments on different sides of the image-text contrastive loss
ness [43] (row3). Except for centerness which is originally
(Eq. 5). 3 designs are considered: (1). only image-to-text
designed in ATSS [53], other 2 scores are predicted by plug-
side loss; (2) only text-to-image side loss; and (3) bilateral
ging in an additional head after the regression branch. We
loss. As discussed in Sec. 3.2, using only image-to-text con-
consider 3 additional scores to utilize textual information:
trastive loss can lead to a significant performance degrada-
(1) sample-wised text similarity (row4), i.e., each region
tion (29.8 AP) due to the partial labeling problem of the
calculates the similarities with the textual concepts of the
image-text pair data. Excluding image-to-text contrastive
sample and the maximum similarity is used as the object-
loss can alleviate the problem and achieving a better perfor-
ness score; (2) batch-wised text similarity (row5), i.e., the
mance of 31.3 AP.
similarities are calculated between a region and textual con-
cepts within a local batch, as described in Sec. 3.2; and (3) Temperature and Loss weight. Table 2e and 2f study
multiplying the batch-wised text similarity with the center- the optimal values of temperature τ in Eq. 5 and loss
ness score (row6), which is commonly adopted by conven- weight λ in Eq. 6, respectively. The default values of
tional detectors [43, 53]. λ = 1, τ = 0.07 commonly adopted in standard constas-
Among 3 class-agnostic objectness scores, centerness tive learning methods [33, 49] perform poorly in our case.
and IoU scores are superior to classification score, indicat- We use τ = 0.5 and λ = 0.1 as our final setting.
ing that localization-based objectness scores provide better
class-agnostic proposals. The result is consistent with the 4.2.2 Effectiveness of Deformable Module
observations in [23]. Considering only the sample-wised Table 3 studies the effectiveness of the proposed deformable
text similarity performs worse than using class-agnostic module described in Sec. 3.3. The deformable module ef-
scores, since the regions selected in this way make it easier fectively promotes the weakly supervised learning. Specifi-
to distinguish between positive and negative samples in the cally, it presents negative effect when trained with strongly
contrastive learning, thus reducing the learning efficiency. supervised detection data (row1 and 2), while demonstrat-
Batch-wise similarity addresses the problem by consider- ing substantial performance improvement when incorpo-
ing text similarities with negative samples and achieves the rating grounding/image-text pair data without localization
best performance of 31.3 AP. Further integrating centerness supervisions (row3 and 4). Besides, the lightweight de-
score results in a performance drop to 30.9 AP. formable module introduces negligible computational cost
Word-region alignment strategy. Table 2b investigates in terms of training time.
word-region alignment strategies described in Sec. 3.2.
Specifically, for fine-grained word-region alignment, 2 4.2.3 Incorporating More Data Helps Learning
matching strategies are studied: (1) 1-to-1 match (row2),
Table 4 reports the performance gains when scaling up the
i.e., each textual concept is matched with its closest region
training data. With the proposed framework, incorporating
and (2) 1-to-many match (row3), i.e., each textual concept
more training data from different sources can consistently
calculates similarities with all regions, which is then aggre-
improve the performance. Compared to training with only
gated through a softmax-weighted-sum operation. Besides,
Objects365, including CC3M effectively improves the over-
we also study a coarse-grained image-text matching strat-
all AP from 28.6 to 31.3, especially for rare categories (from
egy proposed in [55] (row1). Specifically, it directly cal-
24.2 to 29.4, +5.2 AP). GoldG helps significantly improve
culates the similarity between the max-size proposal of im-
the overall AP to 38.4 thanks to its instance-level annota-
age and the entire caption of text. Both fine-grained word-
tions. Including CC12M pushes the envelop further, achiev-
region alignment strategies outperform the coarse-grained
ing a 40.4 overall AP which already surpasses the perfor-
image-text alignment. Assigning each textual concept with
mance of the fully-supervised method (see Table 6).

23502
# Strategy AP (r/c/f) # Strategy AP (r/c/f) Memory Design AP (r/c/f)
1 cls 28.4 (26.6/28.2/28.8) 1 max-bbox 29.8 (28.5/39.5/30.4) 19.8 GB text-to-image 31.3 (29.4/31.7/31.3)
2 IoU 30.1 (30.0/30.1/30.2) 2 1-to-1 31.3 (29.4/31.7/31.3) 20.6 GB image-to-text 29.8 (30.0/29.7/29.9)
3 centerness 30.2 (28.4/30.6/30.1) 3 1-to-many 30.9 (31.3/30.7/31.1) 26.0 GB bilateral 30.9 (30.0/31.5/30.5)
4 text sim (S) 29.6 (24.9/29.5/30.5)
5 text sim (B) 31.3 (29.4/31.7/31.3)
6 +centerness 30.9 (30.2/31.1/30.8)
(a) Proposal selection strategy. Batch-wised (b) Word-region matching strategy. Match- (c) Contrastive loss design. Excluding image-
text similarity generates better proposals for ing each textual concept to the closest region is to-text contrastive loss can boost the perfor-
contrastive learning. effective and memory-efficient. mance.
Top-k AP (r/c/f) τ AP (r/c/f) λ AP (r/c/f)
25 30.6 (29.9/30.5/30.8) 1 30.1 (28.4/30.3/30.3) 0.03 30.5 (29.6/30.0/31.0)
50 30.8 (30.2/30.6/31.0) 0.5 31.3 (29.4/31.7/31.3) 0.1 31.3 (29.4/31.7/31.3)
100 31.3 (29.4/31.7/31.3) 0.15 30.8 (29.6/31.0/30.9) 0.3 30.9 (29.4/31.5/30.7)
200 30.8 (29.4/30.6/31.1) 0.07 29.2 (27.5/29.0/29.6) 1 28.9 (27.7/28.8/29.3)
(d) Number of proposals. We use k = 100. (e) Temperature τ . τ = 0.5 works the best. (f) Contrastive loss weight. We use λ = 0.1.
Table 2. Ablation experiments for image-text contrastive learning. The models are based on the Swin-T backbone and trained with
O365+CC3M dataset. We report zero-shot fixed AP (%) [10] on LVIS minival5k [22]. r/c/f indicate AP of rare/common/frequent categories,
respectively. Designs with higher overall AP (marked in gray ) are selected as our final setting.

Pretrain-data deform AP (r/c/f) iter time (s) faster than DetCLIP: it exploits 13× more image-text pairs
than DetCLIP with a similar training time, achieving more
O365 ✗ 28.8 (26.0 / 28.0 / 30.0) 0.925
O365 ✓ 28.6 (24.2 / 27.1 / 30.6) 1.075
than 6× FPS speed up (25.7 FPS v.s. 4.1 FPS). This indi-
O365+GoldG+CC3M ✗ 37.3 (34.1 / 36.9 / 38.2) 0.980 cates the great scaling property of our method and allows
O365+GoldG+CC3M ✓ 38.4 (36.7 / 37.9 / 39.1) 1.092 a possibility of incorporating a larger-scale image-text pairs
to build a more powerful open-vocabulary detection system.
Table 3. The deformable module effectively improves the
weakly-supervised learning while introducing negligible compu- 4.3. Main Results
tational cost. ’iter time’ is the training time per iteration. 4.3.1 Zero-shot Performance on LVIS
Pretrain-data AP (r/c/f) To compare with the existing works, We train DetCLIPv2
with the best setting reported in 4.2.1. We vary models’
O365 28.6 (24.2 / 27.1 / 30.6) capacity by considering two backbones, i.e., swin-T and
O365+CC3M 31.3 (29.4 / 31.7 / 31.3) swin-L [31], denoted as DetCLIPv2-T/L, respectively. Ta-
O365+GoldG+CC3M 38.4 (36.7 / 37.9 / 39.1) ble 6 reports the comparison with MDETR [22], GLIP [27],
O365+GoldG+CC15M 40.4 (36.0 / 41.7 / 40.0) GLIPv2 [52], and DetCLIP [48] on zero-shot performance.
Table 4. Incorporating more data from different sources con- For better demonstration, we also report the performances
sistently improves the performance. of the fully-supervised method on LVIS.
Training time Training DetCLIPv2 outperforms the existing methods by a large
Model Pretrain-data
(GPU hours) FPS margin. Compared to GLIP/GLIPv2, DetCLIPv2 uses a
GLIP-T [27]† O365+GoldG 7.4k (3.0k)† 1.6 more lightweight backbone (without heavy DyHead [9]
DetCLIP-T [48] O365+GoldG+YFCC1M 2.0k 4.1 and cross-modal fusion) but still achieves better perfor-
DetCLIPv2-T O365+GoldG+CC15M 2.1k 25.7 mances, e.g., DetCLIPv2-T outperforms GLIP-T/GLIPv2-
Table 5. Training efficiency. For DetCLIP, we directly use the re- T by 14.4/11.4 AP, respectively. Compared to DetCLIP,
sult reported in the paper; while for GLIP, we calculate the training DetCLIPv2 achieves 4.5 (40.4 v.s. 35.9) and 6.1 (44.7
time based on the FPS provided in the paper. †: 7.4k is calculated v.s. 38.6) AP performance gains for Swin-T- and Swin-
based on the official implementation which trains 30 epochs, while L-based models, respectively. Despite using more training
3.0k is obtained by converting it to our setting of 12 epochs. data, our total training cost is on par with DetCLIP [48], as
4.2.4 Training Efficiency reported in Table 5. Notably, our models beat their fully-
supervised conterparts in a zero-shot manner, e.g, +6.8/0.8
We develop DetCLIPv2 with several designs that facilitate AP for Swin-T- and Swin-L-based models, respectively. Es-
training efficiency, including using low-resolution inputs pecially, due to the long-tailed property of LVIS, the im-
for image-text pairs, limiting the maximum token length of provements over rare categories are significant, i.e., more
the text encoder to 16, etc. Table 5 compares the training than 10 AP improvements can be observed on both models.
efficiency of DetCLIPv2 with that of GLIP [27] and Det-
CLIP [48]. First, both DetCLIP and DetCLIPv2 are more 4.3.2 Transfer Results with Fine-tuning
efficient than GLIP due to the lightweight architecture de- We study the transferability of DetCLIPv2 by fine-tuning
sign, as described in Sec. 3.3. Besides, DetCLIPv2 is much it on down-stream tasks. Specifically, we conduct full-

23503
LVIS
Method Detector (Backbone) Pre-Train Data
AP APr / APc / APf
MDETR [22] DETR [6] (RN101) GoldG+ 24.2 20.9 / 24.3 / 24.2
Supervised ATSS [53] (Swin-T) LVIS 33.6 19.7 / 32.4 / 37.2
GLIP-T [27] DyHead [9] (Swin-T) O365,GoldG,Cap4M 26.0 20.8 / 21.4 / 31.0
GLIPv2-T [52] DyHead [9] (Swin-T) O365,GoldG,Cap4M 29.0 - / - / -
DetCLIP-T [48] ATSS [53] (Swin-T) O365,GoldG,YFCC1M 35.9 33.2 / 35.7 / 36.4
DetCLIPv2-T (ours) ATSS [53] (Swin-T) O365,GoldG,CC15M 40.4 36.0 / 41.7 / 40.0
Supervised ATSS [53] (Swin-L) LVIS 43.9 30.6 / 43.6 / 46.6
GLIP-L [27] DyHead [9] (Swin-L) O365,GoldG,Cap24M 37.3 28.2 / 34.3 / 41.5
DetCLIP-L [48] ATSS [53] (Swin-L) O365,GoldG,YFCC1M 38.6 36.0 / 38.3 / 39.3
DetCLIPv2-L (ours) ATSS [53] (Swin-L) O365,GoldG,CC15M 44.7 43.1 / 46.3 / 43.7
Table 6. Zero-shot performance on LVIS minival5k [22]. Fixed AP [10] is reported. DetCLIPv2 achieves SoTA performance.
LVIS ODinW13 Pretrain-data AR (s/m/l)
Method
AP (APr /APc /APf ) average AP
O365 44.9 (35.2 / 52.9 / 62.0)
GLIP-T [27] - 64.9 O365+GoldG 57.2 (42.1 / 67.0 / 76.2)

GLIPv2-T [52] 50.6 ( - / - / - ) 66.5 O365+GoldG+CC15M 59.4 (44.5 / 69.4 / 76.8)
DetCLIPv2-T (ours) 50.7 (44.3/52.4/50.3) 68.0
Table 8. Average recall (AR) across 0.5-0.95 IoU on LVIS. s/m/l
GLIP-L [27] - 68.9 denote for small/medium/large objects, respectively.

GLIPv2-B [52] 57.3 ( - / - / - ) 69.4

GLIPv2-H [52] 59.8 ( - / - / - ) 70.4 ‘juvenile’ in case (h) refers to ‘young bird’ and ‘curator’ in
DetCLIPv2-L (ours) 60.1 (58.3/61.7/59.1) 70.4 case (c) refers to a person. These capabilities are critical
Table 7. Fine-tuning performance. Fixed AP [10] on LVIS mini- for open-world detectors but cannot be reflected well in the
val5k [22] and average AP on ODinW13 [27] are reported. Num- commonly adopted evaluation benchmarks like LVIS [17].
bers with † mean mask annotation are used for training. Learning from image-text pairs benefits localization. Ta-
ble 8 provides more evidences showing that learning from
shot fine-tuning on LVIS [17] with 1203 categories and image-text pairs also helps localization. Specifically, we
ODinW13 [27, 52] containing 13 detection tasks. The re- evaluate the average recall across 0.5-0.95 IoU on LVIS and
sults are shown in Table 7. Without using mask annota- compare models trained with different data. Incorporating
tion for training, DetCLIPv2 slightly outperforms GLIPv2 image-text pairs brings a significant and comprehensive re-
on LVIS, e.g., 50.7 AP of DetCLIPv2-T v.s. 50.6 call improvements for small, medium, and large objects.
AP of GLIPv2-T. On ODinW13, DetCLIPv2-T demon-
strates superior performance compared to GLIP-T/GLIPv2- 5. Conclusion
T, outperforming GLIP-T/GLIPv2-T by 3.1/1.5 average Learning from massive Internet-crawled data to achieve
AP, respectively; and DetCLIPv2-L with Swin-L backbone generic visual/language understanding systems has always
achieves the same performance (70.4 average AP) with been an important topic for both NLP [4, 11, 34] and CV
GLIPV2-H that uses a heavier Swin-H backbone. [20, 25, 33] fields. In this paper, we present DetCLIPv2,
a unified end-to-end pre-training framework towards open-
4.4. Visualizations and Analyses vocabulary object detection. By employing a best-matching
Visualization of word-region alignment. Figure 1 visual- set similarity between regions and words to guide the con-
izes the learning results of word-region alignment on image- trastive objective, we effectively leverage massive image-
text pairs in CC12M [7]. For each textual concepts, we find text pairs to serve the object detection task. Experiments
its best matching with the highest similarity to it, as de- demonstrate DetCLIPv2’s superior open-vocabulary perfor-
scribed in Sec. 3.2. Our approach achieves accurate word- mance and its broad domain coverage. Our method provides
region alignment (on instance-level) with great generaliza- a possible way to achieve open-world detection by further
tion, which is demonstrated by several aspects: (1) it suc- scaling up image-text pairs and we leave it to future work.
cesses to recognize concepts that do not covered by detec- Acknowledgements We acknowledge the support of Mind-
tion datasets, e.g., ‘parsley’ in case (b); (2) it works for im- Spore2 , CANN (Compute Architecture for Neural Net-
ages with natural distribution shifts [42], e.g., the sketch works) and Ascend AI Processor used for this research.
image in case (a) and the cartoon image in case (e); and (3)
it is capable of resolving co-reference expressions, e.g., the 2 https://www.mindspore.cn/

23504
References [15] Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu,
Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. Dy-
[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien namic fusion with intra-and inter-modality attention flow for
Teney, Mark Johnson, Stephen Gould, and Lei Zhang. visual question answering. In CVPR, pages 6639–6648,
Bottom-up and top-down attention for image captioning and 2019.
visual question answering. In CVPR, pages 6077–6086,
[16] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui.
2018.
Open-vocabulary object detection via vision and language
[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret knowledge distillation. arXiv preprint arXiv:2104.13921, 2,
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2021.
Vqa: Visual question answering. In Proceedings of the IEEE [17] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A
international conference on computer vision, pages 2425– dataset for large vocabulary instance segmentation. In CVPR,
2433, 2015. pages 5356–5364, 2019.
[3] Hakan Bilen and Andrea Vedaldi. Weakly supervised deep [18] Matthew Honnibal, Ines Montani, Sofie Van Landeghem,
detection networks. In CVPR, pages 2846–2854, 2016. and Adriane Boyd. spaCy: Industrial-strength Natural Lan-
[4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- guage Processing in Python. 2020.
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, [19] Matthew Inkawhich, Nathan Inkawhich, Hai Li, and Yiran
Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- Chen. Self-trained proposal networks for the open world.
guage models are few-shot learners. In NeurIPS, 2020. arXiv preprint arXiv:2208.11050, 2022.
[5] Zhaowei Cai, Gukyeong Kwon, Avinash Ravichandran, Er- [20] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
han Bas, Zhuowen Tu, Rahul Bhotika, and Stefano Soatto. Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom
X-detr: A versatile architecture for instance-wise vision- Duerig. Scaling up visual and vision-language representation
language tasks. arXiv preprint arXiv:2204.05626, 2022. learning with noisy text supervision. In ICML, 2021.
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [21] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yun-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- ing Jiang. Acquisition of localization confidence for accurate
to-end object detection with transformers. In ECCV, pages object detection. In ECCV, 2018.
213–229. Springer, 2020. [22] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
[7] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-
Soricut. Conceptual 12M: Pushing web-scale image-text modulated detection for end-to-end multi-modal understand-
pre-training to recognize long-tail visual concepts. In CVPR, ing. In ICCV, pages 1780–1790, 2021.
2021. [23] Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon,
[8] Ze Chen, Zhihang Fu, Rongxin Jiang, Yaowu Chen, and and Weicheng Kuo. Learning open-world object proposals
Xian-Sheng Hua. Slv: Spatial likelihood voting for weakly without learning to classify. IEEE Robotics and Automation
supervised object detection. In CVPR, pages 12995–13004, Letters, 7(2):5453–5460, 2022.
2020. [24] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-
[9] Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, and-language transformer without convolution or region su-
Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head: pervision. In ICML, 2021.
Unifying object detection heads with attentions. In CVPR, [25] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
pages 7373–7382, 2021. Blip: Bootstrapping language-image pre-training for uni-
[10] Achal Dave, Piotr Dollár, Deva Ramanan, Alexander Kir- fied vision-language understanding and generation. arXiv
illov, and Ross Girshick. Evaluating large-vocabulary ob- preprint arXiv:2201.12086, 2022.
ject detectors: The devil is in the details. arXiv preprint [26] Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chao-
arXiv:2102.01066, 2021. qiang Ye, Jianhua Han, Yukuai Chen, Wei Zhang, Chunjing
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Xu, Dit-Yan Yeung, et al. Coda: A real-world road cor-
Toutanova. Bert: Pre-training of deep bidirectional trans- ner case dataset for object detection in autonomous driving.
formers for language understanding. In ACL, 2019. arXiv e-prints, pages arXiv–2203, 2022.
[12] Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, [27] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang,
Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann Le- Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan
Cun, Nanyun Peng, et al. Coarse-to-fine vision-language Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al.
pre-training with fusion in the backbone. arXiv preprint Grounded language-image pre-training. arXiv preprint
arXiv:2206.07643, 2022. arXiv:2112.03857, 2021.
[13] Dario Fontanel, Matteo Tarantino, Fabio Cermelli, and Bar- [28] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei
bara Caputo. Detecting the unknown in object detection. Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu
arXiv preprint arXiv:2208.11641, 2022. Wei, et al. Oscar: Object-semantics aligned pre-training for
[14] Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, vision-language tasks. In ECCV, pages 121–137. Springer,
Ran Xu, Wenhao Liu, and Caiming Xiong. Towards open 2020.
vocabulary object detection without human-provided bound- [29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
ing boxes. arXiv preprint arXiv:2111.09452, 2021. Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence

23505
Zitnick. Microsoft coco: Common objects in context. In [45] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
ECCV, 2014. Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
[30] Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, and Jing Bengio. Show, attend and tell: Neural image caption gen-
Liu. Cptr: Full transformer network for image captioning. eration with visual attention. In ICML, pages 2048–2057.
arXiv preprint arXiv:2101.10804, 2021. PMLR, 2015.
[31] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng [46] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan
Zhang, Stephen Lin, and Baining Guo. Swin transformer: Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-
Hierarchical vision transformer using shifted windows. In end semi-supervised object detection with soft teacher. In
ICCV, pages 10012–10022, 2021. ICCV, pages 3060–3069, 2021.
[32] Jinjie Mai, Meng Yang, and Wenfeng Luo. Erasing inte- [47] Jianwei Yang, Jiasen Lu, Dhruv Batra, and Devi
grated learning: A simple yet effective approach for weakly Parikh. A faster pytorch implementation of faster r-cnn.
supervised object localization. In CVPR, pages 8766–8775, https://github.com/jwyang/faster-rcnn.pytorch, 2017.
2020. [48] Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang,
Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang
[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Xu. Detclip: Dictionary-enriched visual-concept paral-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
leled pre-training for open-world detection. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
arXiv:2209.09407, 2022.
ing transferable visual models from natural language super-
vision. In ICML, pages 8748–8763. PMLR, 2021. [49] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe
Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and
[34] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Chunjing Xu. Filip: Fine-grained interactive language-image
Amodei, Ilya Sutskever, et al. Language models are unsu-
pre-training. In ICLR, 2022.
pervised multitask learners. OpenAI blog, 1(8):9, 2019.
[50] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring
[35] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali visual relationship for image captioning. In ECCV, pages
Farhadi. You only look once: Unified, real-time object de- 684–699, 2018.
tection. In CVPR, 2016.
[51] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and
[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Chen Change Loy. Open-vocabulary detr with conditional
Faster r-cnn: Towards real-time object detection with region matching. arXiv preprint arXiv:2203.11876, 2022.
proposal networks. In NeurIPS, 2015. [52] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun
[37] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu
Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Uni-
tersection over union: A metric and a loss for bounding box fying localization and vision-language understanding. arXiv
regression. In CVPR, pages 658–666, 2019. preprint arXiv:2206.05836, 2022.
[38] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang [53] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: Stan Z Li. Bridging the gap between anchor-based and
A large-scale, high-quality dataset for object detection. In anchor-free detection via adaptive training sample selection.
ICCV, pages 8430–8439, 2019. In CVPR, pages 9759–9768, 2020.
[39] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu [54] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan
Soricut. Conceptual captions: A cleaned, hypernymed, im- Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang
age alt-text dataset for automatic image captioning. In ACL, Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip:
pages 2556–2565, 2018. Region-based language-image pretraining. In Proceedings
[40] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, of the IEEE/CVF Conference on Computer Vision and Pat-
Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised tern Recognition (CVPR), pages 16793–16803, June 2022.
learning framework for object detection. arXiv preprint [55] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Phillip
arXiv:2005.04757, 2020. Krähenbühl, and Ishan Misra. Detecting twenty-thousand
[41] Peng Tang, Chetan Ramaiah, Yan Wang, Ran Xu, and Caim- classes using image-level supervision. arXiv preprint
ing Xiong. Proposal learning for semi-supervised object de- arXiv:2201.02605, 2022.
tection. In WACV, pages 2291–2301, 2021. [56] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
[42] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Car- formable convnets v2: More deformable, better results. In
lini, Benjamin Recht, and Ludwig Schmidt. Measuring ro- CVPR, pages 9308–9316, 2019.
bustness to natural distribution shifts in image classification. [57] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
In NeurIPS, volume 33, pages 18583–18599, 2020. Wang, and Jifeng Dai. Deformable detr: Deformable trans-
[43] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: formers for end-to-end object detection. arXiv preprint
Fully convolutional one-stage object detection. In ICCV, arXiv:2010.04159, 2020.
2019. [58] Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanx-
iao Liu, Ekin Dogus Cubuk, and Quoc Le. Rethinking pre-
[44] Johnathan Xie and Shuai Zheng. Zsd-yolo: Zero-shot
training and self-training. In NeurIPS, volume 33, pages
yolo detection using vision-language knowledge distillation.
3833–3845, 2020.
arXiv preprint arXiv:2109.12066, 2021.

23506

You might also like