UN-DETR: Promoting Objectness Learning via Joint Supervision for Unknown Object Detection

Haomiao Liu\equalcontrib1, Hao Xu\equalcontrib1, Chuhuai Yue\equalcontrib1, Bo Ma1 Corresponding author.
Abstract

Unknown Object Detection (UOD) aims to identify objects of unseen categories, differing from the traditional detection paradigm limited by the closed-world assumption. A key component of UOD is learning a generalized representation, i.e. objectness for both known and unknown categories to distinguish and localize objects from the background in a class-agnostic manner. However, previous methods obtain supervision signals for learning objectness in isolation from either localization or classification information, leading to poor performance for UOD. To address this issue, we propose a transformer-based UOD framework, UN-DETR. Based on this, we craft Instance Presence Score (IPS) to represent the probability of an object’s presence. For the purpose of information complementarity, IPS employs a strategy of joint supervised learning, integrating attributes representing general objectness from the positional and the categorical latent space as supervision signals. To enhance IPS learning, we introduce a one-to-many assignment strategy to incorporate more supervision. Then, we propose Unbiased Query Selection to provide premium initial query vectors for the decoder. Additionally, we propose an IPS-guided post-process strategy to filter redundant boxes and correct classification predictions for known and unknown objects. Finally, we pretrain the entire UN-DETR in an unsupervised manner, in order to obtain objectness prior. Our UN-DETR is comprehensively evaluated on multiple UOD and known detection benchmarks, demonstrating its effectiveness and achieving state-of-the-art performance. Our method is available at https://github.com/ndwxhmzz/UN-DETR.

Introduction

Deep learning-based vision solutions have achieved remarkable success in the past (Krizhevsky, Sutskever, and Hinton 2012; He et al. 2016; Vaswani 2017), but their generalization performance in open scenarios still faces significant challenges. Within the closed-world assumption, conventional object detection frameworks (Ren et al. 2015; Redmon et al. 2016; Carion et al. 2020) are limited to detecting objects belonging to predefined categories present in the training set, thereby disregarding objects outside these predefined categories. This limitation leads to detrimental outcomes in real-world scenarios where high precision is essential. To advance from conventional detectors to open-world detectors, numerous outstanding works (Bendale and Boult 2016; Zheng et al. 2022) have emerged. Among them, Liang et al. (2023) clarified the definition of Unknown Object Detection (UOD).

Refer to caption
Figure 1: Joint supervision for objectness learning

UOD can be viewed as a two-step problem: 1) Locating all objects, both known and unknown, in a class-agnostic manner; and 2) Assigning specific categories to these objects. The first step hinges on how to learn the generalized features of any object, i.e., objectness, to distinguish them from the background. Difficulty arises because, under the definition of UOD, unknown objects lack supervision, and their objectness is mostly generalized from known categories. New methods (Wu et al. 2022a; Liang et al. 2023; Zohar, Wang, and Yeung 2023) have been developed to learn objectness, but they still struggle with low recall and precision, often misclassifying unknown objects for background or vice versa.

Our core insight is to explain the deficiencies of previous methods by their lack of considering both known category and location information in objectness learning. Classification information represents both class-specific categorization and the class-agnostic probability of being foreground. UnSniffer (Liang et al. 2023), only utilizes position prediction (IoU between predictions and ground truths)(Figure 1(a)), may misclassify high-IoU boxes without instances as foreground. Class-agnostic positional information is essential for accurate object localization. PROB (Zohar, Wang, and Yeung 2023) ignores positional information and merges the objectness and classification head (Figure 1(b)), reducing localization accuracy. We hypothesize that extracting general objectness from complementary positional and categorical latent spaces may address the aforementioned issue.

To validate our theory, we developed UN-DETR, the first Transformer-based UOD framework, and introduced the Instance Presence Score (IPS), which integrates elements representing general objectness from positional and categorical latent spaces. Concretely, we introduce an IPS Predictor (IPP), alongside the original classification head and regression head, to directly output IPS. The optimization process is jointly supervised by signals from both spaces, ensuring their mutual complementarity. To enhance IPS learning, we propose a one-to-many assignment strategy that introduces more positive samples. Subsequently, we propose Unbiased Query Selection, to optimize the initialization of queries by replacing the original classification head with the learned IPP. Moreover, we propose an IPS-guided post-process strategy, filtering redundant boxes and further separating known from unknown objects. Finally, we pretrain the entire UN-DETR in an unsupervised manner using the region prior and the self-supervised encoder, to obtain objectness priors. Essentially, our approach improves the robustness and generalization of the detector, elevating UOD performance to a new level (Figure 1(c)).

To summarize, the contributions of our work are as follows:

  • We reveal that a major flaw in previous UOD methods is the separate use of classification and localization information when learning objectness. To address this issue, we propose the very first Transformer-based UOD framework, UN-DETR.

  • Our core design involves using a dedicated IPP to learn IPS under the joint supervision singals from complementary positional and categorical latent space. Moreover, IPS also participates in multiple stages of the UN-DETR, including query selection and post-processing.

  • Extensive experiments on both UOD and known detection benchmarks clearly demonstrate that our approach surpasses previous methods, achieving state-of-the-art performance.

Related Work

Transformer-Based Detector

Since Detection Transformer (DETR) (Carion et al. 2020) pioneered the first fully end-to-end object detector, transformer-based detectors have gained significant attention for their outstanding performance and scalability. Deformable DETR (D-DETR) (Zhu et al. 2020) further improved this by introducing deformable attention, which efficiently samples key elements, reducing computational demands, speeding up convergence, and enhancing performance.

To enhance training efficiency, recent methods (Li et al. 2022a) have optimized the one-to-one assignment in DETR, which pairs each ground-truth object with a single prediction. Group-DETR (Chen et al. 2023) implemented a group-wise one-to-many assignment, performing decoder self-attention within each group. Similarly, Co-DETR (Zong, Song, and Liu 2023) introduced a collaborative training scheme with multiple auxiliary heads using one-to-many label assignments.

Building on these advancements, we developed UN-DETR, the first transformer-based UOD method, based on D-DETR (Zhu et al. 2020). We apply one-to-many assignment specifically for IPS learning to facilitate generalized feature extraction from more positive sample queries.

Unknown Object Detection and Related Tasks

Recent years have seen the emergence of tasks aimed at detecting unknown objects. Open Set Detection (OSD) (Bendale and Boult 2016) requires identifying and excluding unknown samples, but issues with overconfidence affect accuracy. Open World Object Detection (OWOD) (Joseph et al. 2021) aims to detect both known and unknown objects, yet the absence of labels for unknowns prevents precise evaluation. Recently, Liang et al. (2023) further clarify the UOD evaluation protocol with both the precision and recall of unknown objects as metrics.

Early OSD methods (Bendale and Boult 2016; Liang, Li, and Srikant 2018), focus on distinguishing known and unknown objects. Techniques like maximum softmax probability (Hendrycks and Gimpel 2017), minimum Mahalanobis distance (Denouden et al. 2018), energy scores (Liu et al. 2020), and virtual outliers (Du et al. 2022a, b) have been used. However, these methods primarily enhance known object detection, reducing unknown object recall. In contrast, Our approach seeks unbiased detection of unknowns.

Joseph et.al (2021) introduce OWOD with the ORE detector, featuring RPN-based unknown pseudo-labeling and contrast clustering. OW-DETR (Gupta et al. 2022) and others (Yang et al. 2022; Zhao et al. 2023; Wu et al. 2022b; Gupta et al. 2022; Ma et al. 2023) explored various pseudo-labeling methods. Yet, pseudo-labeling often misclassifies non-objects as unknowns, reducing precision.

Recent efforts (Liang et al. 2023; Zohar, Wang, and Yeung 2023; Wu et al. 2022a) focus on objectness scores without pseudo-labeling, reducing false positives. (Wu et al. 2022a) extended ORE with a localization-based objectness head, improving recall. Similarly, (Liang et al. 2023) introduced a localization-based GOC score using only known samples for supervision, and a graph-based boxes decision scheme. On the other hand, (Zohar, Wang, and Yeung 2023) introduced a framework for classification-based objectness estimation that alternate between probability distribution estimation and objectness likelihood maximization of known objects in the embedded feature space, and ultimately estimating the objectness for different proposals.

In this paper, we propose a learnable objectness score, IPS, that integrates positional and categorical signals for robust objectness representation, avoiding pseudo-labeling pitfalls.

Refer to caption
Figure 2: The overall architecture of UN-DETR

Preliminary

Two-Stage Deformable DETR Pipeline

In D-DETR Pipeline, an input image 𝑰𝑰\boldsymbol{I}bold_italic_I is processed by a backbone network to extract features, which are fed into an encoder using attention mechanisms to produce an enhanced feature sequence. In the decoder, N𝑞𝑢𝑒𝑟𝑦subscript𝑁𝑞𝑢𝑒𝑟𝑦N_{\mathit{query}}italic_N start_POSTSUBSCRIPT italic_query end_POSTSUBSCRIPT object queries are updated through self-attention and cross-attention with the encoder’s output, leading to refined queries qRD𝑞superscript𝑅𝐷q{\in}R^{D}italic_q ∈ italic_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. These are then processed by the bounding box regression head f𝑏𝑏𝑜𝑥subscript𝑓𝑏𝑏𝑜𝑥f_{\mathit{bbox}}italic_f start_POSTSUBSCRIPT italic_bbox end_POSTSUBSCRIPT and classification head f𝑐𝑙𝑠subscript𝑓𝑐𝑙𝑠f_{\mathit{cls}}italic_f start_POSTSUBSCRIPT italic_cls end_POSTSUBSCRIPT for final predictions. A one-to-one bipartite matching using Lmatchsubscript𝐿matchL_{\text{match}}italic_L start_POSTSUBSCRIPT match end_POSTSUBSCRIPT ensures alignment with ground-truth (GT) labels for supervision. And two-stage D-DETR leverages region proposals generated by the encoder as initial object queries for further refinement in the decoder. The top-scoring region proposals are determined by applying f𝑏𝑏𝑜𝑥subscript𝑓𝑏𝑏𝑜𝑥f_{\mathit{bbox}}italic_f start_POSTSUBSCRIPT italic_bbox end_POSTSUBSCRIPT and f𝑐𝑙𝑠subscript𝑓𝑐𝑙𝑠f_{\mathit{cls}}italic_f start_POSTSUBSCRIPT italic_cls end_POSTSUBSCRIPT to the encoder’s output feature maps.

Unknown Object Detection

The task of unknown object detection represents an extension of conventional object detection frameworks. Referring to (Du et al. 2022b; Joseph et al. 2021), the problem of unknown detection is formulated as follows. Given dataset D={𝑰,𝒀}𝐷𝑰𝒀D=\{\boldsymbol{I},\boldsymbol{Y}\}italic_D = { bold_italic_I , bold_italic_Y }, where the N𝑁Nitalic_N input images are denoted as 𝑰={𝐈1,,𝐈N}𝑰subscript𝐈1subscript𝐈𝑁\boldsymbol{I}=\{\mathbf{I}_{1},\ldots,\mathbf{I}_{N}\}bold_italic_I = { bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, with corresponding labels 𝒀={𝐘1,,𝐘N}𝒀subscript𝐘1subscript𝐘𝑁\boldsymbol{Y}=\{\mathbf{Y}_{1},\ldots,\mathbf{Y}_{N}\}bold_italic_Y = { bold_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. Each 𝐘i={y1,,yk}(i[1,2,,N])subscript𝐘𝑖subscript𝑦1subscript𝑦𝑘𝑖12𝑁\mathbf{Y}_{i}=\{y_{1},\ldots,y_{k}\}(i{\in}\left[1,2,\dots,N\right])bold_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ( italic_i ∈ [ 1 , 2 , … , italic_N ] ) contains a set of objects with yk=[lk,bk]subscript𝑦𝑘subscript𝑙𝑘subscript𝑏𝑘y_{k}=[l_{k},b_{k}]italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ], where lksubscript𝑙𝑘l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the class label for bounding box bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represented by xk,yk,wk,hksubscript𝑥𝑘subscript𝑦𝑘subscript𝑤𝑘subscript𝑘x_{k},y_{k},w_{k},h_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The known class set is denoted as 𝒦={1,2,,C}𝒦12𝐶\mathcal{K}=\{1,2,\ldots,C\}caligraphic_K = { 1 , 2 , … , italic_C }, and the unknown class is denoted as 𝒰={C+1}𝒰𝐶1\mathcal{U}=\{C+1\}caligraphic_U = { italic_C + 1 }.

The model is trained on data labeled only with known-classes objects {(𝐈n,𝐘n)\{(\mathbf{I}_{n},\mathbf{Y}_{n}){ ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) |lk𝒦}n=1Ntrain|l_{k}\in\mathcal{K}\}_{n=1}^{N_{\text{train}}}| italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_K } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, but tested on data including both known and unknown objects {(𝐈n,𝐘n)|lk\{(\mathbf{I}_{n},\mathbf{Y}_{n})|l_{k}\in{ ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ 𝒦𝒰}n=1Ntest\mathcal{K}\cup\mathcal{U}\}_{n=1}^{N_{\text{test}}}caligraphic_K ∪ caligraphic_U } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Ntrainsubscript𝑁trainN_{\text{train}}italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT is the image number in the training set, Ntestsubscript𝑁testN_{\text{test}}italic_N start_POSTSUBSCRIPT test end_POSTSUBSCRIPT for that in the test set, and N=Ntest+Ntrain𝑁subscript𝑁testsubscript𝑁trainN=N_{\text{test}}+N_{\text{train}}italic_N = italic_N start_POSTSUBSCRIPT test end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT.

UN-DETR

Overall Architecture

The architecture of UN-DETR is depicted in Figure 2. The processing pipeline is as follows: An image 𝑰𝑰\boldsymbol{I}bold_italic_I of dimensions H×W×C𝐻𝑊𝐶H\times W\times Citalic_H × italic_W × italic_C is fed into the backbone to extract features 𝑭𝑭\boldsymbol{F}bold_italic_F. These features are then processed by the encoder to produce feature memory 𝑴𝑴\boldsymbol{M}bold_italic_M. The top K initial object queries are refined and filtered using 𝑴𝑴\boldsymbol{M}bold_italic_M, fed into the regression head, and a linear layer to produce M𝑀Mitalic_M object queries 𝑸𝑸\boldsymbol{Q}bold_italic_Q. Subsequently, N𝑁Nitalic_N decoder layers transform 𝑸𝑸\boldsymbol{Q}bold_italic_Q into query embeddings 𝑬𝑬\boldsymbol{E}bold_italic_E, capturing the necessary spatial and semantic information for accurate Unknown Object Detection (UOD). These embeddings 𝑬𝑬\boldsymbol{E}bold_italic_E are processed through three branches—classification, regression, and IPP—to predict potential instances. The predicted bounding boxes 𝑩𝑩\boldsymbol{B}bold_italic_B undergo post-processing guided by IPS to yield the final prediction 𝑷𝑷\boldsymbol{P}bold_italic_P of object instances. Prior to end-to-end training, the UN-DETR model undergoes unsupervised pretraining to establish objectness priors.

Instance Presence Score Predictor

Instance Presence Score

As we discussed in the Introduction, extracting representation from complementary positional and categorical latent spaces favors objectness learning. Disregarding class-agnostic categorical latent space, UnSniffer (Liang et al. 2023) solely relies on position prediction (Intersection over Union, IoU, between the predictions and ground truth), may misclassify prediction boxes with high IoU scores but not containing any instances as foreground. Neglected by PROB (Zohar, Wang, and Yeung 2023), class-agnostic positional latent space directly impacts the detector’s ability to locate potential objects. PROB only integrates the objectness head within the classification head but does not adjust the regression head, impacting the localization accuracy of unknown objects, leading to partial or oversized object prediction boxes.

Furthermore, to validate the above conjecture, we extract representations from either positional or categorical latent spaces as objectness score and visualize the discriminability score of feature maps during inference, as shown in Figure 3. Solely considering categorical latent space, as shown in Figure 3(b), the model exhibited higher discriminability score between instances and the background but poor distinction among instances themselves. Conversely, when using only the positional latent space representations, as illustrated in Figure 3(c), the model demonstrated greater distinctiveness among different instances but lesser between instances and the background. These experiments suggest that the two latent spaces are complementary in learning objectness.

In object detection task, predictions require locating objects (regression) and identifying them (classification). This highlights the need to integrate both positional and categorical latent spaces, suggesting that their combined use in the UOD task is more effective than treating them separately. Therefore, we formulate IPS by leveraging attributes of objectness from them, enhancing the use of knowledge learned from known categories and improving generalization to unknown objects. This approach also enhances the distinction between foreground and background, increasing robustness in diverse real-world environments, such as those with varying appearance and scale, which aligns with the primary goal of the UOD task.

Similarly, to validate the discriminability of IPS for different instances, the distinction maps are visualized as shown in Figure 3(d). After fully utilizing the representations from both latent spaces, instances are clearly distinguished from one another as well as from the background. The comparison clearly demonstrates effectiveness and superiority of IPS.

Jointly Supervised IPP

To accurately estimate the IPS, we design a specialized IPP alongside the classification head and regression head. Specifically, the IPP is a simple single-layer feed-forward neural network and it inputs the query embedding ei𝑬subscript𝑒𝑖𝑬e_{i}\in\boldsymbol{E}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_italic_E and computes the corresponding IPS I(ei)𝐼subscript𝑒𝑖I(e_{i})italic_I ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). For IPP training, we propose a jointly supervised strategy. First, the query embedding eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is fed into two heads f𝑐𝑙𝑠(ei)subscript𝑓𝑐𝑙𝑠subscript𝑒𝑖f_{\mathit{cls}}(e_{i})italic_f start_POSTSUBSCRIPT italic_cls end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), f𝑏𝑏𝑜𝑥(ei)subscript𝑓𝑏𝑏𝑜𝑥subscript𝑒𝑖f_{\mathit{bbox}}(e_{i})italic_f start_POSTSUBSCRIPT italic_bbox end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to obtain the representations eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in lower dimensions ecls,ebboxsubscript𝑒𝑐𝑙𝑠subscript𝑒𝑏𝑏𝑜𝑥e_{cls},e_{bbox}italic_e start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT, which represent categorical and positional information obeying the two potential spaces Scls,Sbboxsubscript𝑆𝑐𝑙𝑠subscript𝑆𝑏𝑏𝑜𝑥S_{cls},S_{bbox}italic_S start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT, respectively. To remove the class-related components in ecls,ebboxsubscript𝑒𝑐𝑙𝑠subscript𝑒𝑏𝑏𝑜𝑥e_{cls},e_{bbox}italic_e start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT, we extract the embedding composed of components representing generic objectness eclso,ebboxosuperscriptsubscript𝑒𝑐𝑙𝑠𝑜superscriptsubscript𝑒𝑏𝑏𝑜𝑥𝑜e_{cls}^{o},e_{bbox}^{o}italic_e start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT from each of the two representations. For ebboxosuperscriptsubscript𝑒𝑏𝑏𝑜𝑥𝑜e_{bbox}^{o}italic_e start_POSTSUBSCRIPT italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, we compute generalized IoU (GIoU) after transforming it into a bounding box b^isubscript^𝑏𝑖\hat{b}_{i}over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the matched GT bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. GIoU is a metric based on spatial overlap that is independent of categories and thus performs better when confronted with unseen categories in the training set, and we formalize ebboxosuperscriptsubscript𝑒𝑏𝑏𝑜𝑥𝑜e_{bbox}^{o}italic_e start_POSTSUBSCRIPT italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT in terms of GIOU as follows:

e^bbox,σposo=GIoU(bi,b^σpos)superscriptsubscript^𝑒𝑏𝑏𝑜𝑥superscript𝜎𝑝𝑜𝑠𝑜GIoUsubscript𝑏𝑖subscript^𝑏superscript𝜎𝑝𝑜𝑠\hat{e}_{bbox,\sigma^{pos}}^{o}=\text{GIoU}(b_{i},\hat{b}_{\sigma^{pos}})over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_b italic_b italic_o italic_x , italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = GIoU ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (1)

where σpossuperscript𝜎𝑝𝑜𝑠\sigma^{pos}italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT is the index of the positive sample queries used to train IPP, as explained in the next subsection. For eclsosuperscriptsubscript𝑒𝑐𝑙𝑠𝑜e_{cls}^{o}italic_e start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , we represent the generalized objectness by the sum of 𝒦𝒦\mathcal{K}caligraphic_K (number of known categories) logits, since the 𝒦𝒦\mathcal{K}caligraphic_K logits represent the confidence of all the categories appearing in the training set, and the sum represents the overall probability of the foreground. This avoids the model’s tendency to favor a particular category or categories, and reflects the robustness of the model to different categories of objects in various environments. So eclsosuperscriptsubscript𝑒𝑐𝑙𝑠𝑜e_{cls}^{o}italic_e start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT is formalized as follows:

e^cls,σposo=P^fσpos\hat{e}_{cls,\sigma^{pos}}^{o}=\hat{P}_{f}{}_{\sigma^{pos}}over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_c italic_l italic_s , italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT (2)

where Pfsubscript𝑃𝑓P_{f}italic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the sum of 𝒦𝒦\mathcal{K}caligraphic_K logits. To capitalize on the complementarity of ebboxo,eclsosuperscriptsubscript𝑒𝑏𝑏𝑜𝑥𝑜superscriptsubscript𝑒𝑐𝑙𝑠𝑜e_{bbox}^{o},e_{cls}^{o}italic_e start_POSTSUBSCRIPT italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, we set the objective probability Po(ei)=αebboxo+βeclsosubscript𝑃𝑜subscript𝑒𝑖𝛼superscriptsubscript𝑒𝑏𝑏𝑜𝑥𝑜𝛽superscriptsubscript𝑒𝑐𝑙𝑠𝑜P_{o}(e_{i})=\alpha\cdot e_{bbox}^{o}+\beta\cdot e_{cls}^{o}italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_α ⋅ italic_e start_POSTSUBSCRIPT italic_b italic_b italic_o italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT + italic_β ⋅ italic_e start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, which serves as the supervised signal for IPP training. Therefore, the loss function of IPP is as follows:

LIPSH=1(Po,I(ei)σpos),if GIoU(bi,b^σpos)>τformulae-sequencesuperscriptsubscript𝐿IPSHsubscript1subscript𝑃𝑜𝐼subscriptsubscript𝑒𝑖superscript𝜎𝑝𝑜𝑠if GIoUsubscript𝑏𝑖subscript^𝑏superscript𝜎𝑝𝑜𝑠𝜏L_{\text{IPS}}^{\text{H}}=\ell_{1}(P_{o},I(e_{i})_{\sigma^{pos}}),\text{if }% \text{GIoU}(b_{i},\hat{b}_{\sigma^{pos}})>\tauitalic_L start_POSTSUBSCRIPT IPS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT H end_POSTSUPERSCRIPT = roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_I ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , if roman_GIoU ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) > italic_τ (3)
LIPSL=1(C,I(ei)σpos),if GIoU(bi,b^σpos)τformulae-sequencesuperscriptsubscript𝐿IPSLsubscript1𝐶𝐼subscriptsubscript𝑒𝑖superscript𝜎𝑝𝑜𝑠if GIoUsubscript𝑏𝑖subscript^𝑏superscript𝜎𝑝𝑜𝑠𝜏L_{\text{IPS}}^{\text{L}}=\ell_{1}(C,I(e_{i})_{\sigma^{pos}}),\text{if }\text{% GIoU}(b_{i},\hat{b}_{\sigma^{pos}})\leq\tauitalic_L start_POSTSUBSCRIPT IPS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT = roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_C , italic_I ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , if roman_GIoU ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≤ italic_τ (4)

where 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denotes the L1 loss, C𝐶Citalic_C is the objective constant and τ𝜏\tauitalic_τ is the GIOU threshold. Here we introduce C𝐶Citalic_C to increase the distinctiveness of IPS learning and maintain the stability of IPP training. The total IPS loss can be represented as:

LIPS=LIPSH+LIPSLsubscript𝐿IPSsuperscriptsubscript𝐿IPSHsuperscriptsubscript𝐿IPSLL_{\text{IPS}}=L_{\text{IPS}}^{\text{H}}+L_{\text{IPS}}^{\text{L}}italic_L start_POSTSUBSCRIPT IPS end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT IPS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT H end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT IPS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT (5)

Finally, the entire loss of UN-DETR can be represented as:

LUN-DETR=λ1LIPS+λ2Lcls+λ3Lbboxsubscript𝐿UN-DETRsubscript𝜆1subscript𝐿IPSsubscript𝜆2subscript𝐿clssubscript𝜆3subscript𝐿bboxL_{\text{UN-DETR}}=\lambda_{1}\cdot L_{\text{IPS}}+\lambda_{2}\cdot L_{\text{% cls}}+\lambda_{3}\cdot L_{\text{bbox}}italic_L start_POSTSUBSCRIPT UN-DETR end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT IPS end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT bbox end_POSTSUBSCRIPT (6)

where Lclssubscript𝐿clsL_{\text{cls}}italic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT and Lbboxsubscript𝐿bboxL_{\text{bbox}}italic_L start_POSTSUBSCRIPT bbox end_POSTSUBSCRIPT are consistent with the classification and regression loss of D-DETR, and λ1𝜆1\lambda 1italic_λ 1, λ2𝜆2\lambda 2italic_λ 2, and λ3𝜆3\lambda 3italic_λ 3 are the weights of the loss, set to 3, 2, and 5, respectively.

Refer to caption
Figure 3: Visualizations of discriminability scores

One-to-Many Assignment

In the Preliminary, we note that D-DETR employs one-to-one assignment to associate GT with potential objects. However, previous research has highlighted several issues with this approach, such as inefficient training. As a result, various one-to-many assignment methods have been proposed. These methods primarily aim to enhance convergence speed and training stability. Nonetheless, in UOD, beyond the aforementioned challenges, the most significant challenge is the invisibility of labels for unknown objects during training. Consequently, the joint supervision process of IPP must rely solely on features from the positional and categorical latent spaces. This reliance introduces potential uncertainty in the supervisory information, hindering the model parameters from iterating towards a more optimal solution, thereby compromising training stability and adversely affecting model performance.

One-to-many assignment allows for more flexible use of all supervision even when some of it is noisy or uncertain. By allowing multiple predictions to capture the same GT, the model aggregates information from these different matches, learning more stable and comprehensive object features.

Therefore, we propose a simple one-to-many assignment strategy to provide more positive samples for jointly supervised IPP training. Specifically, we introduce one set of sub-optimal queries besides one set of best-matching queries, since they also match known instances with high probability and are easily obtained by bilateral matching. The index for sub-optimal queries can be formalized as:

σ^=argminσ𝒮𝒩/σ^iLmatch(yi,y^σ(i))^𝜎subscript𝜎subscript𝒮𝒩^superscript𝜎subscript𝑖subscript𝐿matchsubscript𝑦𝑖subscript^𝑦𝜎𝑖\hat{\sigma}=\arg\min_{\sigma\in\mathcal{S_{N}}/\hat{\sigma^{*}}}\sum_{i}L_{% \text{match}}(y_{i},\hat{y}_{\sigma(i)})over^ start_ARG italic_σ end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_σ ∈ caligraphic_S start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT / over^ start_ARG italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT match end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_σ ( italic_i ) end_POSTSUBSCRIPT ) (7)

where σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG is the index of the sub-optimal matching queries, which has the same length as the index σ^^superscript𝜎\hat{\sigma^{*}}over^ start_ARG italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG of the best-matching queries. Then, the index of all positive sample queries for IPP training can be represented as:

σpos^=σ^σ^^superscript𝜎𝑝𝑜𝑠^𝜎^superscript𝜎\hat{\sigma^{pos}}=\hat{\sigma}\cup\hat{\sigma^{*}}over^ start_ARG italic_σ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_ARG = over^ start_ARG italic_σ end_ARG ∪ over^ start_ARG italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG (8)

Note that we only use the sub-optimal queries obtained from the one-to-many assignment for the jointly supervised IPP, and not for the classification and regression head. This is because the supervision for the regression head and classification head is derived directly from the labels and is inherently accurate. Introducing suboptimal supervision may therefore negatively impact their training.

Unbiased Query Selection

To effectively select object queries relevant to the current input, the two-stage D-DETR employs an additional regression head along with a classification head to refine and filter appropriate proposals, initializing them as object queries. Specifically, this newly introduced classification head is trained using category labels represented as a tensor of all zeros, allowing the first dimension to indicate the probability that the input is a likely object. Consequently, during prediction, the top K proposals are selected based on the first dimension of the category prediction. This straightforward approach, however, disregards information from other dimensions predicted by the classification head, introducing a bias that causes this head to favor larger outputs in the first dimension. This bias leads to sparse gradient updates, as other classes contribute minimally, ultimately affecting the convergence and accuracy of this head, which is crucial to the overall detection task.

The additional classification head introduced in two-stage D-DETR essentially treats query selection as an instance recognition task, similar to the target of IPP. To address the bias mentioned earlier, we follow this approach and introduce an additional IPP, naming our method Unbiased Query Selection (UQS). In UQS, we replace the original classification head with an additional IPP to filter proposals. The supervision information of IPP reflects its class-agnostic nature, as its predictions denote the probability if the query represents a foreground object, independent of any category-specific priors. This lack of reliance on class-specific information eliminates potential bias, allowing IPP to focus solely on the objectness of the query itself, rather than being influenced by class-based assumptions.

Refer to caption
Figure 4: Visualization of classification scores and IPS for encoder features

To analyze the effectiveness of UQS, we visualize the classification scores and IPS of the encoder features in Figure 4. In the scatterplot, blue and green dots represent encoder features from models trained with Unbiased Query Selection and vanilla query selection, respectively. Dots closer to the top right indicate higher-quality features, meaning a greater likelihood of representing foreground objects. Notably, the top right corner has more blue dots than green, indicating that queries filtered by UQS are of higher quality and more likely to contain instances. Introducing the extra IPP, which incorporates positional information, allows for verifying the spatial accuracy of queries, reducing false detections. This mitigates the class-specific bias, improving the quality of object queries forwarded to the decoder. Additionally, integrating bias elimination into the loss function further optimizes gradients, enhancing training stability.

IPS-Guided Post Process

In D-DETR, the top-K bounding boxes are directly output as detection results. However, in UOD, unknown objects may outnumber known ones. A fixed K constrains the recall rate and is unreliable to set manually. Additionally, the original D-DETR post-processing doesn’t work due to the one-to-many assignment in IPP training and the need to further differentiate unknown from known categories. To solve this, we propose the IPS-Guided Post Process, which includes IPS-Guided Non-Maximum Suppression (NMS) to remove redundant proposals and a dual-criteria unknown distinguish protocol.

IPS-Guided NMS

Traditional NMS methods(Neubeck and Van Gool 2006) rank proposals based on classification confidence. However, when labels for unknown objects are missing, these methods may fail and even discard well-predicted boxes. Furthermore, (Jiang et al. 2018) highlight the inconsistency between localization and classification information in traditional NMS, where boxes with accurate localization may have low scores, or highly scored boxes may have poor localization. To eliminate as much of the background as possible and address the inconsistency, we propose IPS-Guided NMS. Considering the distance between center points of bounding boxes, we rank all boxes with IPS and calculate Distance IoU (DIoU) (Zheng et al. 2020) to measure overlap.

Dual-Criteria Unknown Distinguish Protocol

Having identified all bounding boxes containing foreground objects, the next step is to classify the remaining proposals into known and unknown categories, after removing any redundant bounding boxes. Although there is no dedicated classification head for unknowns, utilizing IPS and classification confidence together still distinguishes between known and unknown. If both classification confidence and IPS are above set thresholds, the object is assigned to a known category. If classification confidence is low but IPS is high, the object is recognized but not confidently categorized, hence it’s classified as unknown.

Unsupervised Pretraining with Objectness Priors

Self-supervised representation learning can reduce the amount of labeled data required by the model and improve its representation capability. We hope to improve the performance of UN-DETR by utilizing the related technology. To this end, following DETRreg (Bar et al. 2022), we pretrain the entire UN-DETR in an unsupervised manner to obtain objectness priors with both localization and classification. Specifically, we utilize an unsupervised region proposal generator, Selective Search (Uijlings et al. 2013), to match object localization boxes. Moreover, we adopt a self-supervised image encoder, SwAV (Caron et al. 2020), to align the object embeddings used for classification. Note that to avoid introducing additional data, we only use the training set for unsupervised pretraining.

Experiment

Following the UOD Benchmark, we utilize COCO-OOD, COCO-Mixed (Liang et al. 2023), and VOC (Everingham et al. 2010) as test sets and employ mAP, U-AP, U-F1, U-PRE, and U-REC as evaluation metrics, as detailed in the Appendix.111https://github.com/ndwxhmzz/UN-DETR.

Implementation Details

In training, we use ResNet50 as the UN-DETR backbone. Moreover, the entire UN-DETR is pretrained on the VOC training set in an unsupervised manner (Bar et al. 2022). We introduce only one additional set of suboptimal queries for joint supervision of IPP training. The weight parameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β are empirically set to 0.6 and 0.4, respectively. In Eq. 4, C𝐶Citalic_C is set to 0.5 and τ𝜏\tauitalic_τ is set to 0.6.

Methods mAP U-AP U-F1 U-PRE U-REC
MSP 47.0 21.3 31.4 27.9 35.9
Mahalanobis 44.7 12.9 27.1 30.9 24.1
Energy score 47.4 21.3 30.8 26.0 37.7
VOS 46.9 20.5 31.7 29.1 34.8
ORE 24.3 21.4 25.5 16.3 78.2
OW-DETR 42.0 3.3 5.6 3.0 38.0
PROB 36.0 4.3 17.5 11.7 35.2
UnSniffer 46.4 45.4 47.9 43.3 53.5
Ours 47.2 47.0 54.9 54.5 55.3
Table 1: Comparisons with other methods in the VOC-test and COCO-OOD datasets. The mAP is based on VOC-test, while the other metrics are from COCO-ODD. The best results are in bold, second best are underlined.

Results

Quantitative Analysis. Tables 1 and 2 present the results of our method UN-DETR, alongside 8 classic or recent state-of-the-art methods, on the UOD Benchmark. Notably, on the COCO-OOD dataset, our UN-DETR outperforms others in metrics except for U-REC. Particularly for U-F1 and U-PRE, our method surpasses the second-best result by 7.0% and 11.2%, respectively. ORE’s U-REC outperforms our method but also recall many non-objects. This is evident as our method’s U-AP, U-F1, and U-PRE are all approximately twice as good as ORE’s. On the COCO-Mixed dataset, our method maintains a lead in U-AP and U-PRE, exceeding the second-best result by 4.1% and 0.5%, respectively. The aforementioned results demonstrate that our UN-DETR surpasses the previously leading method in unknown object detection, attributable to our proposed jointly supervised IPP training. In addition, experimental results on both the VOC-test and COCO-Mixed datasets show that UN-DETR performs comparably to existing methods on known detections, which demonstrates that it does not improve unknown detections by sacrificing known detections.

In general, our method outperforms other OSD methods as it is designed for detecting rather than excluding all unknown objects. Compared to pseudo-label-based OWOD detectors, our UN-DETR only introduces one additional set of query samples using one-to-many assignment strategy, which reduces the interference of negative samples and thus leads larger on U-PRE. Most importantly, benefiting from jointly supervised IPP training from both positional and categorical latent spaces, our approach exceeds other objectness-based approaches, as will be further demonstrated in the ablation study of Sec. 5.3.

Methods mAP U-AP U-F1 U-PRE U-REC
MSP 36.4 5.5 16.9 19.0 15.3
Mahalanobis 35.1 5.1 14.9 20.7 11.6
Energy score 36.4 4.9 16.9 16.7 17.1
VOS 36.4 5.1 17.2 18.4 16.3
ORE 21.3 14.0 17.5 10.3 59.2
OW-DETR 41.4 0.7 2.5 1.4 16.1
PROB 40.1 9.4 26.2 17.0 56.7
UnSniffer* 35.9 14.8 26.7 19.3 40.9
Ours 34.0 18.9 24.7 19.8 32.8
Table 2: Comparisons with other methods in the COCO-Mixed datasets. * means that our replication results.

Qualitative Analysis. Figure 8 visualizes the results of various methods. It is evident that our UN-DETR outperforms other methods both in localizing and identifying unknown objects. On the one hand, UnSniffer, which ignores category information, misidentifies some unknown objects, such as the elephant in the second row, and misses some obvious unknowns, such as the surfboard in the third row. On the other hand, PROB, which neglects positional information, has difficulty in accurately localizing the boundaries of objects, such as the water cup in the first row. On the contrary, our method accurately detects the most unknown objects, which benefits from the fact that we recouple generic objectness from both positional and categorical latent space. In addition, our method can accurately distinguish between known and unknown and exclude redundant boxes, which is attributed to our proposed IPS-guided post-processing. More visualization results are presented in the Appendix.

Refer to caption
Figure 5: Example results on COCO-OOD (first two rows) and COCO-Mixed (last two rows) datasets. Detections are overlaid on known (yellow) and unknown (blue) objects.

Ablation Study

To examine the contribution of each component in our method, we conduct adequate ablation experiments as presented in Table 3. For IPP training, individual supervision from either the latent space of category or position significantly degrades the performance of the model (rows 1 and 2). If only the classification information is considered, the detector may miss some unknown objects and dramatically reduce the recall, while if only the regression information is focused on, the detector may confuse unknowns with knowns and recall non-objects, thus impairing the precision. Our UN-DETR trade-offs both of the above to obtain excellent precision and recall simultaneously.

In the one-to-many assignment strategy, we introduce only one additional set of samples, which outperforms the original one-to-one matching (row 3) and prevents the performance decrease from introducing more sets due to the possible negative sample noise (row 4).

For UQS, we train an additional IPP to replace the original classification head for query filtering, and the experimental results show that IPP outperforms not only the original classification head (row 5), but also the IPP supervised only with regression information (row 6). This proves the superiority of our joint supervision and the effectiveness of UQS.

To demonstrate the effectiveness of our proposed post-processing, we replace our IPS-Guided NMS with the original NMS (row 7), and the experimental results show that all the metrics of UN-DETR decrease. When unsupervised pretraining is not used (row 8), the U-PRE of UN-DETR decreases significantly. It’s because pretraining provides a prior on objectness for the model, allowing it to initially acquire a certain class-agnostic perceptual ability after pre-training, which is crucial for UOD.

Row Component U-AP U-F1 U-PRE U-REC
1 IPP only cls 17.2 22.1 57.3 13.7
2 IPP only reg 40.9 23.9 14.7 63.4
3 one-to-one 46.0 51.3 51.4 51.1
4 one-to-three 46.0 53.7 53.5 53.9
5 UQS origin cls 39.7 51.1 56.7 46.4
6 UQS only reg 45.9 50.9 48.4 53.7
7 w/o IPS-NMS 46.7 52.3 50.1 56.8
8 w/o Unsupervised 45.8 44.5 35.0 60.1
9 All 47.0 54.9 54.5 55.3
Table 3: Ablation studies on COCO-OOD.

Conclusion

We propose a novel transformer-based UOD method UN-DETR that outperforms existing state-of-the-art methods. We investigate the deficiencies of current methods in exploiting complementary classification and regression predictions, leading to unstable objectness learning. Therefore, the core insight of our approach is jointly supervised objectness IPS learning from both positional and categorical latent spaces. Then, we propose a one-to-many assignment strategy to provide more positive samples for IPS learning. Furthermore, IPS is employed for query selection and post-processing in UN-DETR due to its encoding class-agnostic categorization and localization information. Finally, we pretrain the entire UN-DETR in an unsupervised manner to obtain the objectness prior. We hope that our work will inspire further research in UOD within the community.

Acknowledgments

This work was supported by the Joint Funds of the National Natural Science Foundation of China (No. U2441206) and the National Natural Science Foundation of China (No. 62072042).

References

  • Bar et al. (2022) Bar, A.; Wang, X.; Kantorov, V.; Reed, C. J.; Herzig, R.; Chechik, G.; Rohrbach, A.; Darrell, T.; and Globerson, A. 2022. Detreg: Unsupervised pretraining with region priors for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14605–14615.
  • Bendale and Boult (2016) Bendale, A.; and Boult, T. E. 2016. Towards Open Set Deep Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1563–1572.
  • Carion et al. (2020) Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
  • Caron et al. (2020) Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; and Joulin, A. 2020. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33: 9912–9924.
  • Chen et al. (2023) Chen, Q.; Chen, X.; Wang, J.; Zhang, S.; Yao, K.; Feng, H.; Han, J.; Ding, E.; Zeng, G.; and Wang, J. 2023. Group detr: Fast detr training with group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6633–6642.
  • Denouden et al. (2018) Denouden, T.; Salay, R.; Czarnecki, K.; Abdelzad, V.; Phan, B.; and Vernekar, S. 2018. Improving reconstruction autoencoder out-of-distribution detection with mahalanobis distance. arXiv preprint arXiv:1812.02765.
  • Du et al. (2022a) Du, X.; Wang, X.; Gozum, G.; and Li, Y. 2022a. Unknown-Aware Object Detection: Learning What You Don’t Know from Videos in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13678–13688.
  • Du et al. (2022b) Du, X.; Wang, Z.; Cai, M.; and Li, Y. 2022b. Vos: Learning what you don’t know by virtual outlier synthesis. arXiv preprint arXiv:2202.01197.
  • Everingham et al. (2010) Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision, 88: 303–338.
  • Gupta et al. (2022) Gupta, A.; Narayan, S.; Joseph, K.; Khan, S.; Khan, F. S.; and Shah, M. 2022. Ow-detr: Open-world detection transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9235–9244.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • Hendrycks and Gimpel (2017) Hendrycks, D.; and Gimpel, K. 2017. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In International Conference on Learning Representations.
  • Jiang et al. (2018) Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; and Jiang, Y. 2018. Acquisition of localization confidence for accurate object detection. In Proceedings of the European conference on computer vision (ECCV), 784–799.
  • Joseph et al. (2021) Joseph, K.; Khan, S.; Khan, F. S.; and Balasubramanian, V. N. 2021. Towards open world object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5830–5840.
  • Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
  • Li et al. (2022a) Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L. M.; and Zhang, L. 2022a. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13619–13627.
  • Li et al. (2022b) Li, L. H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.-N.; Chang, K.-W.; and Gao, J. 2022b. Grounded Language-Image Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10965–10975.
  • Liang, Li, and Srikant (2018) Liang, S.; Li, Y.; and Srikant, R. 2018. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks. In International Conference on Learning Representations.
  • Liang et al. (2023) Liang, W.; Xue, F.; Liu, Y.; Zhong, G.; and Ming, A. 2023. Unknown Sniffer for Object Detection: Don’t Turn a Blind Eye to Unknown Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3230–3239.
  • Liu et al. (2023) Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; Zhu, J.; and Zhang, L. 2023. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv:arXiv:2303.05499.
  • Liu et al. (2020) Liu, W.; Wang, X.; Owens, J.; and Li, Y. 2020. Energy-based out-of-distribution detection. Advances in neural information processing systems, 33: 21464–21475.
  • Ma et al. (2023) Ma, S.; Wang, Y.; Wei, Y.; Fan, J.; Li, T. H.; Liu, H.; and Lv, F. 2023. Cat: Localization and identification cascade detection transformer for open-world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19681–19690.
  • Neubeck and Van Gool (2006) Neubeck, A.; and Van Gool, L. 2006. Efficient non-maximum suppression. In 18th international conference on pattern recognition (ICPR’06), volume 3, 850–855. IEEE.
  • Redmon et al. (2016) Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788.
  • Ren et al. (2015) Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
  • Uijlings et al. (2013) Uijlings, J. R.; Van De Sande, K. E.; Gevers, T.; and Smeulders, A. W. 2013. Selective search for object recognition. International journal of computer vision, 104: 154–171.
  • Vaswani (2017) Vaswani, A. 2017. Attention is All You Need. arXiv preprint arXiv:1706.03762.
  • Wu et al. (2022a) Wu, Y.; Zhao, X.; Ma, Y.; Wang, D.; and Liu, X. 2022a. Two-branch objectness-centric open world detection. In Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis, 35–40.
  • Wu et al. (2022b) Wu, Z.; Lu, Y.; Chen, X.; Wu, Z.; Kang, L.; and Yu, J. 2022b. UC-OWOD: Unknown-classified open world object detection. In European Conference on Computer Vision, 193–210. Springer.
  • Yang et al. (2022) Yang, S.; Sun, P.; Jiang, Y.; Xia, X.; Zhang, R.; Yuan, Z.; Wang, C.; Luo, P.; and Xu, M. 2022. Objects in Semantic Topology. In International Conference on Learning Representations.
  • Zhao et al. (2023) Zhao, X.; Ma, Y.; Wang, D.; Shen, Y.; Qiao, Y.; and Liu, X. 2023. Revisiting open world object detection. IEEE Transactions on Circuits and Systems for Video Technology.
  • Zheng et al. (2022) Zheng, J.; Li, W.; Hong, J.; Petersson, L.; and Barnes, N. 2022. Towards open-set object detection and discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3961–3970.
  • Zheng et al. (2020) Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; and Ren, D. 2020. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 12993–13000.
  • Zhu et al. (2020) Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.
  • Zohar, Wang, and Yeung (2023) Zohar, O.; Wang, K.-C.; and Yeung, S. 2023. Prob: Probabilistic objectness for open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11444–11453.
  • Zong, Song, and Liu (2023) Zong, Z.; Song, G.; and Liu, Y. 2023. Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF international conference on computer vision, 6748–6758.

Appendix A Appendix

This appendix is organized as follows: 1) More details of the method; 2) Benchmarks for UOD; 3) Experimental results, especially more visualizations; 4) Comparison with promptable approaches; 5) Limitations and future work.

Appendix B UN-DETR Details

In this section, we show more details of the method. In UQS:

I=fipp(Q)𝐼subscript𝑓𝑖𝑝𝑝𝑄I=f_{ipp}(Q)italic_I = italic_f start_POSTSUBSCRIPT italic_i italic_p italic_p end_POSTSUBSCRIPT ( italic_Q ) (9)
QUQS=TopK(I,K)subscript𝑄UQSTopK𝐼𝐾Q_{\text{UQS}}=\text{TopK}(I,K)italic_Q start_POSTSUBSCRIPT UQS end_POSTSUBSCRIPT = TopK ( italic_I , italic_K ) (10)

where Q𝑄Qitalic_Q denotes the output form Encoder, K𝐾Kitalic_K denotes the number of queries.

In one-to-many assignment:

Cost=λ1Costcls+λ2Costbox𝐶𝑜𝑠𝑡subscript𝜆1𝐶𝑜𝑠subscript𝑡𝑐𝑙𝑠subscript𝜆2𝐶𝑜𝑠subscript𝑡𝑏𝑜𝑥Cost=\lambda_{1}\cdot Cost_{cls}+\lambda_{2}\cdot Cost_{box}italic_C italic_o italic_s italic_t = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT (11)
Costcls(i,j)=pclsi(cj)𝐶𝑜𝑠subscript𝑡𝑐𝑙𝑠𝑖𝑗superscriptsubscript𝑝𝑐𝑙𝑠𝑖subscript𝑐𝑗Cost_{cls}(i,j)=-p_{cls}^{i}(c_{j})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ( italic_i , italic_j ) = - italic_p start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (12)
Costbox(i,j)=1IoU(𝐛i,𝐛gtj)𝐶𝑜𝑠subscript𝑡𝑏𝑜𝑥𝑖𝑗1𝐼𝑜𝑈superscript𝐛𝑖superscriptsubscript𝐛𝑔𝑡𝑗Cost_{box}(i,j)=1-IoU(\mathbf{b}^{i},\mathbf{b}_{gt}^{j})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT ( italic_i , italic_j ) = 1 - italic_I italic_o italic_U ( bold_b start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (13)

where pclsi(cj)superscriptsubscript𝑝𝑐𝑙𝑠𝑖subscript𝑐𝑗p_{cls}^{i}(c_{j})italic_p start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the probability that i𝑖iitalic_i-th prediction belongs to GT class cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Appendix C Unknown Object Detection Benchmark

Datasets. Following (Joseph et al. 2021; Gupta et al. 2022), we use the training set and validation set of the Pascal VOC datasets (Everingham et al. 2010) as the training data that contains annotation of 20 known object categories. For testing, we use the three datasets as follows:

  • PASCAL VOC datasets (Everingham et al. 2010) (the test set of it) are used to evaluate the accuracy of known object detection.

  • COCO-OOD dataset (Liang et al. 2023) exclusively comprises unknown categories, comprising 504 images annotated with 1655 unknown objects. It serves for evaluating UOD.

  • COCO-Mixed dataset (Liang et al. 2023) comprises 897 images annotated with both known and unknown categories. It encompasses 2533 unknown objects and 2658 known objects. The inclusion of known objects in this dataset heightens the challenge of UOD. We evaluate the detection accuracy of known and unknown objects in this more challenging dataset.

Refer to caption
Figure 6: Convergence curves for different assignment strategies

Evaluation Metrics. To evaluate the performance of UOD, considering that TPu𝑇subscript𝑃𝑢TP_{u}italic_T italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT denotes the true positive proposals of unknown classes, FNu𝐹subscript𝑁𝑢FN_{u}italic_F italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for false negative proposals, and FPu𝐹subscript𝑃𝑢FP_{u}italic_F italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for false positive proposals, we employ four unknown metrics as follows:

  • The definition of the Unknown Average Precision (U-AP) is consistent with the known object AP commonly used in conventional object detection (Everingham et al. 2010).

  • The Recall Rate of Unknown (U-REC) is defined as:

    U-REC=TPuTPu+FNu𝑈-𝑅𝐸𝐶𝑇subscript𝑃𝑢𝑇subscript𝑃𝑢𝐹subscript𝑁𝑢\displaystyle U\text{-}REC=\frac{TP_{u}}{TP_{u}+FN_{u}}italic_U - italic_R italic_E italic_C = divide start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG (14)
  • The Precision Rate of Unknown (U-PRE) is defined as:

    U-PRE=TPuTPu+FPu𝑈-𝑃𝑅𝐸𝑇subscript𝑃𝑢𝑇subscript𝑃𝑢𝐹subscript𝑃𝑢\displaystyle U\text{-}PRE=\frac{TP_{u}}{TP_{u}+FP_{u}}italic_U - italic_P italic_R italic_E = divide start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG (15)
  • For a comprehensive comparison, we report the Unknown F1-Score defined as the harmonic mean of U-PRE and U-REC:

    U-F1=2×U-PRE×U-RECU-PRE+U-REC𝑈-𝐹12𝑈-𝑃𝑅𝐸𝑈-𝑅𝐸𝐶𝑈-𝑃𝑅𝐸𝑈-𝑅𝐸𝐶\displaystyle U\text{-}F1=\frac{2\times U\text{-}PRE\times U\text{-}REC}{U% \text{-}PRE+U\text{-}REC}italic_U - italic_F 1 = divide start_ARG 2 × italic_U - italic_P italic_R italic_E × italic_U - italic_R italic_E italic_C end_ARG start_ARG italic_U - italic_P italic_R italic_E + italic_U - italic_R italic_E italic_C end_ARG (16)

In addition, mAP is used to evaluate for known object detection. Note that we measure mAP over different IoU thresholds from 0.5 to 0.95. Other metrics are assessed at the IoU threshold of 0.5.

Appendix D Experiment Result

Ablation Study

One-to-many Assignment Strategy. To demonstrate the improvement of the one-to-many assignment strategy for IPP training, we show the convergence curves for introducing different sets of query samples in Figure 6. It is clearly observed that compared to the one-to-one assignment, the introduction of one additional set of samples (one-to-two) significantly improves the model performance with a small difference in the convergence speed (see the right of Figure 6). In addition, the introduction of more sets of samples (one-to-three) significantly increases the instability of UN-DETR because of the possible presence of negative sample noise.

Unsupervised Pre-training. As shown in Table 3 of the paper, the unsupervised pre-training enhances the U-AP and U-PRE of our method. It’s because it provides a prior on objectness for the model, allowing it to initially acquire a certain class-agnostic perceptual ability after pre-training, which is crucial for UOD. Notably, while others didn’t use our unsupervised pre-training, they used different pre-training methods. However, as indicated in the 8 th row of Table 3, our method still surpasses them in U-AP, even without pre-training. This demonstrates that our superior performance does not rely on pre-training. Futhermore, we retrained OW-DETR and PROB with the same pre-training method. As it is specifically suited for DETR-like architectures, we pre-trained Unsniffer using a closely similar approach. The U-AP of PROB has improved from 4.3 to 6.8, and the U-AP of OW-DETR has improved from 3.3 to 4.4. For Unsniffer, pre-trained weights achieved a U-AP of 26.1, showing strong unknown object perception, but final training yielded a U-AP of 43.9 due to method incompatibility.

Qualitative Analysis

In order to verify the effectiveness of our approach, in particular the jointly supervised IPP, we visualized the output of the UN-DETR decoder using t-SNE. As shown in Figure 7, the unknown objects are scattered among the known ones, while the background is far from them. Thus, the t-SNE visualization intuitively demonstrates that our approach distinguishes between objects and non-objects, thanks to our proposed IPS that jointly re-couples generalized objectness from both positional and categorical latent spaces.

Refer to caption
Figure 7: t-SNE visualization of various classes’ hidden vectors
Refer to caption
Figure 8: Example results on COCO-OOD datasets. Detections are overlaid on known (yellow) and unknown (blue) objects.
Refer to caption
Figure 9: Example results on COCO-Mixed datasets. Detections are overlaid on known (yellow) and unknown (blue) objects.

We show more visualization examples. Specifically, our UN-DETR is compared with 5 classic or recent state-of-the-art methods, including the OSD detector VOS (Du et al. 2022b), the pseudo-labeling-based OWOD detector ORE (Joseph et al. 2021) and OW-DETR (Gupta et al. 2022), objectness-based methods PROB (Zohar, Wang, and Yeung 2023) and UnSniffer (Liang et al. 2023). For a fair comparison, all results are either provided by the authors or reproduced by an open-source model trained on the same training set with the recommended setting.

Figure 8 and Figure 9 visualizes the results of above methods on example images from the COCO-OOD and COCO-Mixed datasets. Without learning reliable objectness, both OSD and OWOD methods misclassify or miss many apparently unknown objects. The objectness-based methods, on the other hand, despite detecting more unknown objects, are limited by individual supervision and thus still suffer from inaccurate object boundaries and low classification confidence. In contrast, our UN-DETR accurately detects the majority of unknown objects, thanks to jointly supervised IPP. For example, the donut in the second row and the orange in the fourth row of Figure 3, and the moon in the fifth row of Figure 4, these objects are only accurately predicted by our method. Moreover, attributed to our proposed IPS-guided post-processing, known and unknown objects can also be well distinguished, see the first and third rows of Figure 4.

Computational Cost and Reasoning Efficiency

Table 5 shows some results on computational cost and reasoning efficiency. Our method’s computational cost and inference time are comparable to the original D-DETR and Faster-RCNN models, demonstrating its scalability and practicality. Additionally, testing on the CrowdHuman dataset, which averages 23 instances per image, yielded an FPS of 21.3. This shows our method’s efficiency in large-scale, dense detection tasks.

Result on OWOD Benchmark

(Liang et al. 2023) highlight COCO’s incomplete labeling, limiting U-PRE assessment. PROB and OW-DETR, designed for OWOD (Joseph et al. 2021) focus on U-REC, leading to redundant boxes. Our method addresses both U-PRE and U-REC, achieving high U-AP and U-F1 on UOD benchmark(Liang, Li, and Srikant 2018). Additionally, it scored 21.2 U-REC (MOWOD task1) and 18.8 U-REC (SOWOD task1) on OWOD, showing competitive performance.

Appendix E Comparison with promptable open-set detectors

Recently, numerous promptable and grounded open-set detectors like RegionCLIP, GLIP and grounding DINO have been introduced, offering novel solutions for Object Detection in open scenarios. These are termed Visual-Language detectors because they utilize text prompts, unlike our Visual-Only detectors. While both can identify unknows, significant differences exist:

First, Learning Features. Prompt-based methods leverage multimodal data (image-text pairs) to align features, enabling generalization to unknowns, while our method leverages unimodal data to learn objectness features. Thus, direct comparison is not feasible.

Second, Data and Computational Requirements. Prompt-based methods (e.g., grounding-DINO) need large-scale, semantically-rich datasets (12M images and >1000 categories) and significant computational resources (64 Nvidia A100 GPUs), whereas our method works with much smaller datasets (16K images, 20 categories) and a single RTX 4090.

Third, Category Limitations. Prompt-based methods require predefined text prompts, which may not cover all unknowns. Setting appropriate prompts for various scenarios is also challenging.

We prove our point later by experimental comparison with Grounding-DINO.

Methods mAP U-AP U-F1 U-PRE U-REC
Grounding DINO 56.7 26.0 41.9 83.2 28.1
Ours 47.2 47.0 54.9 54.5 55.3
Table 4: Comparisons with Grounding DINO in the VOC-test and COCO-OOD datasets.
Methods Parameters(M) FLOPs(G) FPS
Faster-RCNN 41.72 180 21.72
D-DETR 42.02 137 24.33
Ours 42.14 137.3 23.61
Table 5: Efficiency comparisons.
Methods U-AP U-F1 U-PRE U-REC
Grounding DINO 26.0 41.9 83.2 28.1
Grounding DINO (3 classes) 32.7 45.2 75.9 32.1
Grounding DINO (5 classes) 41.7 52.7 75.8 40.4
Grounding DINO (7 classes) 41.8 53.9 75.2 41.8
Ours 47.0 54.9 54.5 55.3
Table 6: Comparison of Grounding DINO with different number of class prompts on COCO-OOD.

Experimental comparison with Grounding DINO

Recently, (Liu et al. 2023) designed a powerful “open set” object detector, Grounding DINO. To incorporate language information and enhance generalization to unseen objects, they introduced a tight modularity fusion technique based on DINO, enabling feature fusion at multiple levels of the pipeline. In addition, the author improved the training strategy of GLIP (Li et al. 2022b) by using a technique that utilizes sub level text features.

Liu et al. (2023) define the detection of arbitrary objects using manual inputs like category names or reference expressions as “open set object detection”. On the other hand, the UOD task requires the model to learn general objectness only from the known categories in the input image data and generalize to unknown objects, detecting instances that do not belong to any known category as “unknown”. Despite significant differences in task settings between our UN-DETR and Grounding DINO, the zero-shot inference ability of Grounding DINO is similar to UN-DETR’s capability to detect unknown objects. Therefore, we evaluated Grounding DINO on the UOD benchmark for comparison with our model.

In our experiments, we use GroundingDINO-T with BERT-BASE as the text encoder. Referring to the prompt example given by Liu et al. (2023), we set the prompt to “Category 1. Category 2. ……. Category n. ”, and in particular, for the experiments on COCO-OOD, we set the prompt to ‘unknown object. ’.

Grounding DINO’s zero-shot performance on the validation set of VOC achieves far superior results to other detectors, reflecting the splendid generalization performance resulting from the pre-training of large-scale datasets and the introduction of additional textual semantic information, as well as, the excellent performance of the DINO model structure itself.

Conversely, Grounding DINO’s performance on COCO-OOD is significantly lower than that of UN-DETR, as shown in Table 4, primarily because the task setting for UOD requires all objects to be uniformly labeled as ”unknown”. Grounding DINO’s text encoder struggles to accurately encode the prompt “unknown. ” due to the inherent ambiguity of the concept, which contrasts with the more concrete and descriptive language typically used during its training. In contrast, UN-DETR is designed to correctly categorize known categories while classifying anything beyond that as ”unknown”. Grounding DINO, however, detects objects based on specific phrases or words provided in text prompts that indicate category information. This fundamental difference in task setting hinders Grounding DINO’s performance in UOD tasks, leading to frequent missed detections. Nevertheless, it is worth noting that, owing to its powerful image feature extraction capabilities, Grounding DINO still outperforms some previous methods on COCO-OOD.

Figure 10 compares the visualization results of UN-DETR and Grounding-DINO on COCO-OOD. Due to limited category information from the text prompt, some obvious unknown objects are missed by Grounding-DINO, such as the lamp and table in the first column and the laptop in the fourth row, which directly reflects its limitations. On the other hand, our UN-DETR, which does not require any prompt input, detects all unknown objects.

To further assess Grounding DINO’s limited performance in the UOD setting, we attempted to label several types of frequently occurring unknown objects in COCO-OOD within Grounding DINO’s prompt, while fixing the model’s output label as ”unknown”. The experimental results are shown in Table 2, adding category information in the prompt can improve its performance. This approach essentially predefines certain categories for Grounding DINO to detect, aiming to enhance its performance in this context.

As shown in Figure 11, as the category information contained in the prompt provided to Grounding DINO increases, the number of objects it can detect increases accordingly. This additional experiment demonstrates that Grounding DINO’s performance is severely constrained by the information provided in the predefined prompt. If the prompt lacks an accurate and specific description, particularly in open-world scenarios where it is challenging to predefine all possible categories, Grounding DINO struggles to detect unknown targets accurately. This finding underscores both the research significance and the practical importance of the UOD task.

Refer to caption
Figure 10: Example results on COCO-OOD datasets. Detections are overlaid on known (yellow) and unknown (blue) objects.
Refer to caption
Figure 11: Example results with different number of class prompts on COCO-OOD datasets. Detections are overlaid on known (yellow) and unknown (blue) objects.

Appendix F Limitation and Future Work

Although our method is fully evaluated on existing UOD benchmarks, its data size limits the method training and inference. In the future work, we will extend the unknown detection dataset for more comprehensive comparison of unknown detection. And, for application in real-world scenarios, we consider optimizing the inference speed of UN-DETR in future work.