0% found this document useful (0 votes)
4 views13 pages

End-to-End Object Detection With Fully Convolutional Network

This paper presents an end-to-end object detection framework using a fully convolutional network that eliminates the need for non-maximum suppression (NMS) through a novel Prediction-aware One-To-One (POTO) label assignment method. The proposed framework, which also incorporates a 3D Max Filtering (3DMF) technique, demonstrates competitive performance on COCO and CrowdHuman datasets compared to state-of-the-art detectors that rely on NMS. The findings suggest that effective label assignment is crucial for improving detection accuracy and efficiency in crowded scenes.

Uploaded by

Duy Tin Truong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views13 pages

End-to-End Object Detection With Fully Convolutional Network

This paper presents an end-to-end object detection framework using a fully convolutional network that eliminates the need for non-maximum suppression (NMS) through a novel Prediction-aware One-To-One (POTO) label assignment method. The proposed framework, which also incorporates a 3D Max Filtering (3DMF) technique, demonstrates competitive performance on COCO and CrowdHuman datasets compared to state-of-the-art detectors that rely on NMS. The findings suggest that effective label assignment is crucial for improving detection accuracy and efficiency in crowded scenes.

Uploaded by

Duy Tin Truong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

End-to-End Object Detection with Fully Convolutional Network

Jianfeng Wang1 * Lin Song2 *† Zeming Li1 Hongbin Sun2 Jian Sun1 Nanning Zheng2
1
Megvii Technology 2 Xi’an Jiaotong University
[email protected] [email protected]
{lizeming,sunjian}@megvii.com {hsun,nnzheng}@mail.xjtu.edu.cn
arXiv:2012.03544v3 [cs.CV] 26 Mar 2021

Abstract

Mainstream object detectors based on the fully convolu-


tional network has achieved impressive performance. While
most of them still need a hand-designed non-maximum sup-
pression (NMS) post-processing, which impedes fully end-
to-end training. In this paper, we give the analysis of dis-
carding NMS, where the results reveal that a proper label
assignment plays a crucial role. To this end, for fully con-
volutional detectors, we introduce a Prediction-aware One-
To-One (POTO) label assignment for classification to en-
Figure 1. As shown in the dashed box, most detectors based on the
able end-to-end detection, which obtains comparable per- fully convolutional network adopt multiple predictions and NMS
formance with NMS. Besides, a simple 3D Max Filtering post-processing for each instance. With the proposed prediction-
(3DMF) is proposed to utilize the multi-scale features and aware one-to-one label assignment and 3D Max Filtering, our end-
improve the discriminability of convolutions in the local to-end detector can directly perform a single prediction for each
region. With these techniques, our end-to-end framework instance without post-processing.
achieves competitive performance against many state-of-
the-art detectors with NMS on COCO and CrowdHuman
datasets. The code is available at https://github. proposed to improve the duplicate removal, but they still do
com/Megvii-BaseDetection/DeFCN . not provide an effective end-to-end training strategy. Mean-
while, many approaches [43, 35, 26, 32, 36] based on re-
current neural networks have been introduced to predict the
1. Introduction bounding box for each instance by using an autoregressive
decoder. These approaches give naturally sequential model-
Object detection is a fundamental topic in computer vi- ing for the prediction of bounding boxes. But they are only
sion, which predicts a set of bounding boxes with pre- evaluated on some small datasets without modern detectors,
defined category labels for each image. Most of main- and the iterative manner makes the inference process ineffi-
stream detectors [9, 22, 30, 54] utilize some hand-crafted cient.
designs such as anchor-based label assignment and non-
Recently, DETR [3] introduces a bipartite matching
maximum suppression (NMS). Recently, a quite number of
based training strategy and transformers with the parallel
methods [46, 58, 5] have been proposed to eliminate the
decoder to enable end-to-end detection. It achieves compet-
pre-defined set of anchor boxes by using distance-aware
itive performance against many state-of-the-art detectors.
and distribution-based label assignments. Although they
However, DETR currently suffers from much longer train-
achieve remarkable progress and superior performance,
ing duration to coverage and relatively lower performance
there is still a challenge of discarding the NMS post-
on the small objects. To this end, this paper explores a new
processing, which hinders the fully end-to-end training.
perspective: could a fully convolutional network achieve
To tackle this issue, Learnable NMS [13], Soft NMS [1] competitive end-to-end object detection?
and other NMS variants [12, 24, 15], and CenterNet [5] are
In this paper, we attempt to answer this question in two
* Equal contribution. dimensions, i.e., label assignment and network architec-
† This work was done at Megvii Technology. ture. As shown in Fig. 1, most of fully convolutional detec-

1
tors [22, 46, 50, 21] adopt a one-to-many label assignment or background samples with bounding box offsets. Due
rule, i.e., assigning many predictions as foreground samples to the hand-designed and data-independent anchor boxes,
for one ground-truth instance. This rule provides adequate the training targets of anchor-based detectors are typically
foreground samples to obtain a strong and robust feature sub-optimal and require careful tuning of hyper-parameters.
representation. Nevertheless, the massive foreground sam- Recently, FCOS [46] and CornerNet [18] give a different
ples lead to duplicate predicted boxes for a single instance, perspective for fully convolutional detectors by introducing
which prevents end-to-end detection. To demonstrate it, an anchor-free framework. Nevertheless, these frameworks
we first give an empirical comparison of different existing still need a hand-designed post-processing step for dupli-
hand-designed label assignments. We find that the one-to- cate removal, i.e., non-maximum suppression (NMS). Since
one label assignment plays a crucial role in eliminating the NMS is a heuristic approach and adopts a constant thresh-
post-processing of duplicate removal. However, there is still old for all the instances, it needs carefully tuning and might
a drawback in the hand-designed one-to-one assignment. not be robust, especially in crowded scenes. In contrast,
The fixed assignment could cause ambiguity issues and re- based on the anchor-free framework, this paper proposes a
duce the discriminability of features, since the predefined prediction-aware one-to-one assignment rule for classifica-
regions of an instance may not be the best choice [17] for tion to discard the non-trainable NMS.
training. To solve this issue, we propose a prediction-aware
one-to-one (POTO) label assignment, which dynamically 2.2. End-to-End Object Detection
assigns the foreground samples according to the quality of To achieve end-to-end detection, many approaches are
classification and regression simultaneously. explored in the previous literature. Concretely, in the earlier
Furthermore, for the modern FPN based detector [46], researches, numerous detection frameworks based on recur-
the extensive experiment demonstrates that the duplicate rent neural networks [43, 35, 26, 32, 36] attempt to produce
bounding boxes majorly come from the nearby regions of a set of bounding boxes directly. Albeit they allow end-to-
the most confident prediction across adjacent scales. There- end learning in principle, they are only demonstrated effec-
fore, we design a 3D Max Filtering (3DMF), which can be tiveness on some small datasets and not against the modern
embedded into the FPN head as a differentiable module. baselines [46, 8]. Meanwhile, Learnable NMS [13] is pro-
This module could improve the discriminability of convo- posed to learn duplicate removal by using a very deep and
lution in the local regions by using a simple 3D max fil- complex network, which achieves comparable performance
tering operator across adjacent scales. Besides, to provide against NMS. But it is constructed by discrete components
adequate supervision for feature representation learning, we and does not give an effective solution to realize end-to-end
modify a one-to-many assignment as an auxiliary loss. training. Recently, the relation network [14] and DETR [3]
With the proposed techniques, our end-to-end detec- apply the attention mechanism to object detection, which
tion framework achieves competitive performance against models pairwise relations between different predictions. By
many state-of-the-art detectors. On COCO [23] dataset, our using one-to-one assignment rules and direct set losses, they
end-to-end detector based on FCOS framework [46] and do not need any additional post-processing steps. Neverthe-
ResNeXt-101 [49] backbone remarkably outperforms the less, when performing massive predictions, these methods
baseline with NMS by 1.1% mAP. Furthermore, our end-to- require highly expensive cost, making them not appropri-
end detector is more robust and flexible for crowded detec- ate for the dense prediction frameworks. Due to the lack of
tion. To demonstrate the superiority in the crowded scenes, image prior and multi-scale fusion mechanism, DETR also
we construct more experiments on CrowdHuman [37] suffers from much longer training duration than mainstream
dataset. Under the ResNet-50 backbone, our end-to-end de- detectors and lower performance on the small objects. Dif-
tector achieves 3.0% AP50 and 6.0% mMR absolute gains ferent from the approaches mentioned above, our method
over FCOS baseline with NMS. is the first to enable end-to-end object detection based on a
fully convolutional network.
2. Related Work
3. Methodology
2.1. Fully Convolutional Object Detector
3.1. Analysis on Label Assignment
Owing to the success of convolution networks [11,
40, 41, 39, 20, 51, 52], object detection has achieved To reveal the effect of label assignment on end-to-end
tremendous progress during the last decade. Modern one- object detection, we construct several ablation studies of
stage [22, 25, 31, 38, 29, 7] or two-stage detectors [33, 21, conventional label assignments on COCO [23] dataset. As
2] heavily rely on the anchors or anchor-based proposals. shown in Tab. 1, all the experiments are based on FCOS [46]
In these detectors, the anchor boxes are made up of pre- framework, whose centerness branch is removed to achieve
defined sliding windows, which are assigned as foreground a head-to-head comparison. The results demonstrate the su-

2
Table 1. The comparison of different label assignment rules for end-to-end object detection on COCO val set. ∆ indicates the gap between
with and without NMS. ‘Aux’ is the proposed auxiliary loss. All models are based on ResNet-50 backbone with 180k training iterations.
mAP mAR
Assignment Rule Method
w/ NMS w/o NMS ∆ w/ NMS w/o NMS ∆
One-to-many Hand-designed FCOS [46] baseline * 40.5 12.1 -28.4 58.3 52.8 -5.5
Anchor 37.2 35.8 -1.4 57.0 59.2 +2.2
Hand-designed
Center 37.2 33.6 -3.6 57.8 59.7 +1.9
One-to-one
Foreground loss 38.3 37.1 -1.2 58.6 61.4 +2.8
Prediction-aware POTO 38.6 38.0 -0.6 57.9 60.5 +2.6
POTO+3DMF 40.0 39.8 -0.2 58.8 60.9 +2.1
Mixture ** Prediction-aware POTO+3DMF+Aux 41.2 41.1 -0.1 58.9 61.2 +2.3
*
We remove its centerness branch to achieve a head-to-head comparison.
**
We adopt a one-to-one assignment in POTO and a one-to-many assignment in the auxiliary loss, respectively.

periority of one-to-many assignment on feature represen- ter of the instance in the pre-defined feature layer. Besides,
tation and the potential of one-to-one assignment on dis- other anchors or pixels are set as background samples.
carding the NMS. The detailed analysis is elaborated in the As shown in Tab. 1, compared with the one-to-many la-
following sections. bel assignment, the one-to-one label assignment allows the
fully convolutional detectors without NMS to greatly reduce
the gap between with and without NMS and achieve rea-
3.1.1 One-to-many Label Assignment
sonable performance. For instance, the detector based on
Since the NMS post-processing is widely adopted in dense Center rule achieves 21.5% mAP absolute gains over the
prediction frameworks [21, 22, 58, 53, 46, 50, 57, 29, 7], FCOS baseline. Besides, as it avoids the error suppression
one-to-many label assignment becomes a conventional way of the NMS in complex scenes, the recall rate is further in-
to assign training targets. The adequate foreground samples creased. Nevertheless, there still exist two unresolved is-
lead to a strong and robust feature representation. However, sues. First, when one-to-one label assignment is applied,
when discarding the NMS, due to the redundant foreground the performance gap between detectors with and without
samples of one-to-many label assignment, the duplicate NMS remains non-negligible. Second, due to the less super-
false-positive predictions could cause a dramatic drop in vision for each instance, the performance of the one-to-one
performance, e.g., 28.4% mAP absolute drop on FCOS [22] label assignment is still inferior to the FCOS baseline.
baseline. In addition, the reported mAR in Tab. 1 indicates
the recall rates for the predictions of the top 100 scores. 3.2. Our Methods
Without NMS, the one-to-many assignment rule leads to In this paper, to enable competitive end-to-end object de-
numerous duplicate predictions with high scores, thus re- tection, we propose a mixture label assignment and a new
ducing the recall rate. Therefore, the detector is hard to 3D Max Filtering (3DMF). The mixture label assignment
achieve competitive end-to-end detection by relying only on is made up of the proposed prediction-aware one-to-one
the one-to-many assignment. (POTO) label assignment and a modified one-to-many label
assignment (auxiliary loss). With these techniques, our end-
3.1.2 Hand-designed One-to-one Label Assignment to-end framework can discard the NMS post-processing and
keep the strong feature representation.
MultiBox [45] and YOLO [30] demonstrate the potential
in applying the one-to-one label assignment to a dense pre- 3.2.1 Prediction-aware One-to-one Label Assignment
diction framework. In this paper, we evaluate two one-to-
one label assignment rules to reveal the undergoing con- The hand-designed one-to-one label assignment follows a
nection with discarding NMS. These rules are modified by fixed rule. However, this rule may be sub-optimal for
two widely-used one-to-many label assignments: Anchor various instances in complex scenes, e.g., Center rule for
rule and Center rule. Concretely, Anchor rule is based on an eccentric object [17]. Thus if the assignment proce-
RetinaNet [22], each ground-truth instance is only assigned dure is forced to assign the sub-optimal prediction as the
to the anchor with the maximum Intersection-over-Union unique foreground sample, the difficulty for the network to
(IoU). Center rule is based on FCOS [46], each ground- converge could be dramatically increased, leading to more
truth instance is only assigned to the pixel closest to the cen- false-positive predictions. To this end, we propose a new

3
Figure 2. The diagram of the head with 3D Max Filtering (3DMF) in a FPN stage. ‘POTO’ indicates the proposed Prediction-aware One-
to-one Label Assignment rule to achieve end-to-end detection. ‘Conv + σ’ denotes a convolution layer followed by a sigmoid function [10],
which outputs coarsely classification scores. ‘Aux Loss’ is the proposed auxiliary loss to improve feature representation. The dotted lines
are used to highlight the additional components in the training phase, which are abandoned in the inference phase.

rule named Prediction-aware One-To-One (POTO) label as- shown in Tab. 1, this property makes the training loss not
signment by dynamically assigning samples according to the optimal choice for the matching cost. Therefore, as pre-
the quality of predictions. sented in Eq. 3 and Eq. 4, we propose a more clean and
Let Ψ denotes the index set of all the predictions. G and effective formulation (POTO) to find a better assignment.
N correspond to the number of ground-truth instances and G
predictions, respectively, where typically G  N in dense X
π̂ = arg max Qi,π(i) , (3)
prediction detectors. π̂ ∈ ΠN
G indicates a G-permutation of π∈ΠN
G i
N predictions. Our POTO aims to generate a suitable per-  1−α
mutation π̂ of predictions as the foreground samples. The where Qi,π(i) = 1 [π(i) ∈ Ωi ] · p̂π(i) (ci ) ·
training loss is formulated as Eq. 1, which consists of the
| {z } | {z }
spatial prior
classification
foreground loss Lfg and the background loss Lbg .   α (4)
IoU bi , b̂π(i) .
G
X   X  | {z }
L= Lfg p̂π̂(i) , b̂π̂(i) | ci , bi + Lbg p̂j , regression
i j∈Ψ\R(π̂)
(1) Here Qi,π(i) ∈ [0, 1] represents the proposed matching
quality of the i-th ground-truth with the π(i)-th prediction.
where R(π̂) denotes the corresponding index set of the as- It considers the spatial prior, the confidence of classifica-
signed foreground samples. For the i-th ground-truth, ci tion, and the quality of regression simultaneously. Ωi indi-
and bi are its category label and bounding box coordinates, cates the set of candidate predictions for i-th ground-truth,
respectively. While for the π̂(i)-th prediction, p̂π̂(i) and i.e., spatial prior. The spatial prior is widely used in the
b̂π̂(i) correspond to its predicted classification scores and training phase [21, 22, 58, 53, 46, 50]. For instance, the
predicted box coordinates, respectively. center sampling strategy is adopted in FCOS [46], which
To achieve competitive end-to-end detection, we need to only considers the predictions in the central portion of the
find a suitable label assignment π̂. As shown in Eq. 2, previ- ground-truth instance as foreground samples. We also apply
ous works [6, 3] treat it as a bipartite matching problem by it in POTO to achieve higher performance, but it is not nec-
using foreground loss [22, 34] as the matching cost, which essary for discarding NMS (more details refer to Sec. 4.2.2).
can be rapidly solved by the Hungarian algorithm [43]. To achieve balance, we define the quality by the weighted
geometric mean of classification score p̂π(i) (ci ) and regres-
G 
X   sion quality IoU bi , b̂π(i) in Eq. 4. The hyper-parameter
π̂ = arg min Lfg p̂π̂(i) , b̂π̂(i) | ci , bi . (2)
π∈ΠN
α ∈ [0, 1] adjusts the ratio between classification and re-
G i
gression, where α = 0.8 is adopted by default and more
However, foreground loss typically needs additional ablation studies are elaborated in Sec. 4.2.2. As shown in
weights to alleviate optimization issues, e.g., unbalanced Tab. 1, POTO not only narrows the gap with NMS but also
training samples and joint training of multiple tasks. As improves the performance.

4
Table 2. Comparison of different configurations for NMS post-
processing on COCO val set. ‘Across scales’ indicates applying
NMS to the multiple adjacent stages of the feature pyramid net-
work. ‘Spatial range’ denotes the spatial range for duplicate re-
moval in each scale.
Model Across scales Spatial range mAP
1×1 19.0
7 3×3 37.4
5×5 39.2
FCOS [46]
7 39.2
∞×∞
3 40.9

3.2.2 3D Max Filtering


In addition to the label assignment, we attempt to design
an effective architecture to realize more competitive end-to- Figure 3. The diagram of 3D Max Filtering. The detailed proce-
end detection. To this end, we first reveal the distribution dure of 3D max filtering is illustrated in the dashed box. ‘GN’ and
of duplicate predictions. As shown in Tab. 2, for a mod- ‘σ’ indicate the group normalization [47] and the sigmoid activa-
ern FPN based detector [46], the performance has a notice- tion function, respectively.
able degradation when applying the NMS to each scale sep-
arately. Moreover, we find that the duplicate predictions
size of input feature xs .
majorly come from the nearby spatial regions of the most
confident prediction. Therefore, we propose a new mod- yis = max max x̃s,k
j . (6)
ule called 3D Max Filtering (3DMF) to suppress duplicate k∈[s− τ2 ,s+ τ2 ] j∈Niφ×φ
predictions.
Convolution is a linear operation with translational As shown in Eq. 6, for a spatial location i in scale s, the
equivariance, which produces similar outputs for similar maximum value yis is then obtained in a pre-defined 3D
patterns at different positions. However, this property has neighbour tube with τ scales and φ × φ spatial distance.
a great obstacle to duplicate removal, since different pre- This operation can be easily implemented by a highly effi-
dictions of the same instance typically have similar fea- cient 3D max-pooling operator [27].
tures [22] for the dense prediction detectors. Max filter is Furthermore, to embed the 3D Max Filtering into the ex-
a rank-based non-linear filter [42], which could be used to isting frameworks and enable end-to-end training, we pro-
compensate for the discriminant ability of convolutions in pose a new module, as shown in Fig. 3. This module lever-
a local region. Besides, max filter has also been utilized ages the max filtering to select the predictions with the high-
in the key-point based detectors, e.g., CenterNet [56] and est activation value in a local region and could enhance the
CornerNet [18], as a new post-processing step to replace distinction with other predictions, which is further verified
the non-maximum suppression. It demonstrates some po- in Sec. 4.2.1. Owing to this property, as shown in Fig. 2, we
tentials to perform duplicate removal, but the non-trainable adopt the 3DMF to refine the coarsely dense predictions and
manner hinders the effectiveness and end-to-end training. suppress the duplicate predictions. Besides, all the modules
Meanwhile, the max filter only considers the single-scale are constructed by simple differentiable operators and only
feature, which is not appropriate for the widely-used FPN have slightly computational overhead.
based detectors [22, 46, 50].
Therefore, we extend the max filter to a multi-scale ver- 3.2.3 Auxiliary Loss
sion, called 3D Max Filtering, which transforms the features
in each scale of FPN. The 3D Max Filtering is respectively In addition, when using the NMS, as shown in Tab. 1, the
adopted in each channel of a feature map. performance of POTO and 3DMF is still inferior to the
FCOS baseline. This phenomenon may be attributed to the
n h τ τ io
x̃s = x̃s,k := Bilinear (xk ) | ∀k ∈ s − , s + . fact that one-to-one label assignment provides less supervi-
x s
2 2
(5) sion, making the network difficult to learn the strong and
robust feature representation [44]. It could further reduce
Specifically, as shown in Eq. 5, given an input feature xs in the discrimination of classification, thus causing a decrease
the scale s of FPN, we first adopt the bilinear operator [28] in performance. To this end, motivated by many previous
to interpolate the features from τ adjacent scales as the same works [44, 54, 55], we introduce an auxiliary loss based on

5
(a) FCOS baseline (b) POTO

Image

(c) POTO+3DMF (d) POTO+3DMF+Aux


Figure 4. Visualization of the predicted classification scores from different approaches. The input image has three instances of different
scales, i.e., person, tie and pot. The heatmaps from left to right of each approach correspond to the score map in the FPN stage ‘P5’, ‘P6’ and
‘P7’, respectively. ‘Aux’ indicates the proposed auxiliary loss. Our POTO based detector significantly suppresses the duplicate predictions
against the vanilla FCOS framework. The 3DMF enhances the distinctiveness of the local region across adjacent scales. Besides, the
auxiliary loss can further improve the feature representation.

one-to-many label assignment to provide adequate supervi- 4.2. Ablation Studies on COCO
sion, which is illustrated in Fig. 2.
4.2.1 Visualization
Similar to ATSS [50], our auxiliary loss adopts the fo-
cal loss [22] with a modified one-to-many label assignment. As shown in Fig. 4, we present the visualization of the clas-
Specifically, the one-to-many label assignment first takes sification scores from the FCOS baseline and our proposed
the top-9 predictions as candidates in each FPN stage, ac- framework. For a single instance, the FCOS baseline with
cording to the proposed matching quality in Eq. 4. It then one-to-many assignment rule outputs massive duplicate pre-
assigns the candidates as foreground samples whose match- dictions, which are highly activated and have comparable
ing qualities beyond a statistical threshold. The statistical activating scores with the most confident one. These dupli-
threshold is calculated by the summation of the mean and cate predictions are evaluated as false-positive samples and
the standard deviation of all the candidate matching quali- greatly affect performance. In contrast, by using the pro-
ties. In addition, different forms of one-to-many label as- posed POTO rule, the scores of duplicate samples are sig-
signment for the auxiliary loss are elaborately reported in nificantly suppressed. This property is crucial for the detec-
the supplementary material. tor to achieve direct bounding box prediction without NMS.
Moreover, with the proposed 3DMF module, this property
is further enhanced, especially in the nearby regions of the
4. Experiments most confident prediction. Besides, since the 3DMF mod-
ule introduces the multi-scale competitive mechanism, the
4.1. Implement Detail detector can well perform unique predictions across differ-
ent FPN stages, e.g., an instance in the Fig. 4 has single
As same as FCOS [46], our detector adopts a pair of 4- highly activated scores in various stages.
convolution heads for classification and regression, respec-
tively. The output channel numbers of the first convolu- 4.2.2 Prediction-Aware One-to-One Label Assignment
tion and the second convolution in 3DMF are 256 and 1,
respectively. All the backbones are pre-trained on the Ima- Spatial prior. As shown in Tab. 3, for the spatial range of
geNet dataset [4] with frozen batch normalizations [16]. In assignment, the center sampling strategy is relatively supe-
the training phase, input images are reshaped so that their rior to the inside box and global strategies on the COCO
shorter side is 800 pixels. All the training hyper-parameters dataset. It reflects that the prior knowledge of images is
are identical to the 2x schedule (180k iterations) in the De- essential in the real world scenario.
tectron2 [48] if not specifically mentioned. Classification vs. regression. The hyper-parameter α, as

6
Table 3. Results of POTO with different configurations of α and Table 5. The effect of sub-modules in the proposed 3DMF module
spatial prior on COCO val set. α = 0 is equivalent to considering on COCO val set. ‘3DMF’ and ‘Aux Loss’ indicate using the 3D
classification alone, α = 1 is equivalent to considering regression Max Filtering and the auxiliary loss, respectively. ‘/’ is used to
alone. ‘center sampling’ and ‘inside box’ both follow FCOS [46]. distinguish between results without and with NMS.
‘/’ is used to distinguish between results without and with NMS. Model 3DMF Aux Loss mAP
α center sampling inside box global
7 7 19.0 / 40.9
0.0 33.5 / 33.6 24.1 / 24.2 1.9 / 2.1 FCOS [46] 7 3 18.9 / 41.3
0.2 33.7 / 33.9 28.8 / 28.8 19.4 / 19.5 3* 7 38.7 / 40.0
0.4 35.0 / 35.2 32.7 / 32.8 28.3 / 28.4
7 7 38.0 / 38.6
0.6 36.6 / 36.9 35.3 / 35.5 34.7 / 34.9
Ours 3 7 39.8 / 40.0
0.8 38.0 / 38.6 37.4 / 37.9 37.3 / 37.9
3 3 41.1 / 41.2
1.0 11.8 / 29.7 4.5 / 13.0 non-convergence
*
We modify 3D Max Filtering as a post-processing.
Table 4. The effect of various quality functions on COCO val set.
‘/’ is used to distinguish between results without and with NMS. Table 6. The effect of hyper-parameters in the proposed 3DMF
‘Add’ and ‘Mul’ indicate two fusion functions. module on COCO val set. τ = 0 is equivalent to applying 2D
Method α mAP AP50 AP75 Max Filtering to transform features on a single scale. ‘/’ is used to
distinguish between results without and with NMS.
0.2 36.0 / 36.2 55.7 / 57.0 38.7 / 38.3
Add 0.5 37.3 / 37.8 54.9 / 57.4 40.5 / 40.4 φ=1 φ=3 φ=5
0.8 29.3 / 35.6 40.3 / 53.4 32.8 / 38.4 τ =0 39.2 / 39.5 39.1 / 39.5 39.0 / 39.4
Mul 0.8 38.0 / 38.6 55.2 / 57.6 41.4 / 41.3 τ =2 39.0 / 39.3 39.8 / 40.0 39.3 / 39.5
τ =4 39.1 / 39.3 39.3 / 39.4 39.4 / 39.6

shown in Eq. 4, controls the ratio of the importance between


classification and regression. As reported in Tab. 3, when end object detection. The proposed auxiliary loss gives ad-
α = 1, the gap with NMS is not narrowed. It could be at- equate supervision, making our detector obtain competitive
tributed to the misalignment between the best positions for performance against the FCOS with NMS.
classification and regression. When α = 0, the assignment End-to-end. To demonstrate the superiority of the end-
rule only relies on the predicted scores of classification. Un- to-end training manner, we replace the 2D Max Filtering
der this condition, the gap with NMS is considerably elim- of CenterNet [5] with the 3D Max Filtering as new post-
inated, but the absolute performance is still unsatisfactory, processing for duplicate removal. This post-processing is
which could be caused by overfitting the sub-optimal ini- further adopted to the FCOS detector. As shown in Tab. 5,
tialization. In contrast, with a proper fusion of classification the end-to-end manner achieves significant absolute gains
and regression quality, the absolute performance is remark- by 1.1% mAP.
ably improved.
Kernel size. As shown in Tab. 6, we evaluate different set-
Quality function. We further explore the effect of different
tings of spatial range φ and scale range τ in the 3DMF.
fusion methods on the quality function, i.e., Eq. 4. As pre-
When φ = 3 and τ = 2, our method obtains the highest
sented in Tab. 4, the method called ‘Add’ replaces the origi-  performance on the COCO dataset. This phenomenon re-
nal quality function by (1−α)· p̂π(i) (ci )+α·IoU bi , b̂π(i) ,
flects the duplicate predictions majorly come from a local
which has a similar form to [19]. However, we find that the
region across adjacent scales, which is similar to the obser-
multiplication fusion, i.e., ‘Mul’, is more suitable for the
vation in Sec. 3.2.2.
end-to-end detection, which achieves 0.7% mAP absolute
gains over the ‘Add’ fusion method. Performance w.r.t. training duration. As illustrated in
Fig. 5(a), at the very beginning, the performance on COCO
val set of our end-to-end detectors is inferior to the de-
4.2.3 3D Max Filtering
tectors with NMS. As the training progressed, the perfor-
Components. As shown in Tab. 5, without NMS post- mance gap becomes smaller and smaller. After 180k train-
processing, our end-to-end detector with POTO achieves ing iterations, our method finally outperforms other detec-
19.0% mAP absolute gains over the vanilla FCOS. By using tors with NMS. This phenomenon also occurs on Crowd-
the proposed 3DMF, the performance is further improved Human val set, which is shown in Fig. 5(c). Moreover, due
by 1.8% mAP, and the gap with NMS is narrowed to 0.2% to the removal of hand-designed post-processing, Fig. 5(b)
mAP. As shown in Fig. 4, the result shows the crucial role demonstrates the superiority of our method in the recall rate
of the multi-scale and local-range suppression for end-to- against the NMS based methods.

7
RetinaNet w/ NMS RetinaNet w/o NMS FCOS w/ NMS FCOS w/o NMS Ours w/ NMS Ours w/o NMS
63.0
42.0 63.0 95.0

35.0 55.0 55.0


75.0
28.0 47.0 47.0

21.0 39.0 55.0


39.0

14.0 31.0
31.0 35.0
7.0 23.0
23.0
0.0 15.0 15.0
5000 40000 75000 110000 145000 180000 5000 40000 75000 110000 145000 180000 1875 7500 13125 18750 24375
15.0
(a)
5000 mAP on COCO val 40000
set (b)
75000 mAR on COCO val110000
set (c) AP50 on CrowdHuman180000
145000 val set
Figure 5. The comparison graphs of performance w.r.t. training duration. The value of the horizontal axis corresponds to the training
iterations. All the models are based on the ResNet-50 backbone. The threshold of NMS is set to 0.6.

Table 7. The experiments of the proposed framework with larger Table 8. The comparison of fully convolutional detectors on
backbone on COCO2017 test-dev set. The hyper-parameters of all CrowdHuman val set. All models are based on the ResNet-50
the models follow the official settings. backbone. ‘Aux’ indicates the auxiliary loss.
Backbone Model Epochs mAP Method Epochs AP50 mMR Recall
RetinaNet [22] 36 41.0 RetinaNet [22] 32 81.7 57.6 88.6
FCOS [46] 36 43.1 FCOS [46] 32 86.1 54.9 94.2
ResNet-101 DETR [3] 500 43.5 ATSS [50] 32 87.2 49.7 94.0
DETR [3] 300 72.8 80.1 82.7
Ours (w/o NMS) 36 43.6 Ground-truth (w/ NMS) - - - 95.1
RetinaNet [22] 24 44.5 POTO 32 88.5 52.2 96.3
ResNeXt-101+DCN FCOS [46] 24 46.5 POTO+3DMF 32 88.8 51.0 96.6
Ours (w/o NMS) 24 47.6 POTO+3DMF+Aux 32 89.1 48.9 96.5

4.2.4 Larger Backbone icantly outperforms several state-of-the-art detectors with


NMS, e.g., 3.0% mAP and 6.0% mMR absolute gains over
To further demonstrate the robustness and effectiveness of
the FCOS. Moreover, the recall rate of our method is even
our method, we provide experiments with larger backbones.
superior to the ground-truth boxes with NMS.
The detailed results are reported in Tab. 7. Concretely,
when using the ResNet-101 as the backbone, our method
is slightly superior to FCOS by 0.5% mAP. But when in- 5. Conclusion
troducing more stronger backbone, i.e., ResNeXt-101 [49] This paper has presented a prediction-aware one-to-one
with deformable convolutions [59], our end-to-end detec- label assignment and a 3D Max Filtering to bridge the gap
tor achieves 1.1% mAP absolute gains over the FCOS with between fully convolutional network and end-to-end object
NMS. It might be attributed to the flexibly spatial modeling detection. With the auxiliary loss, our end-to-end frame-
of deformable convolutions. Moreover, the proposed 3DMF work achieves superior performance against many state-of-
is efficient and easy to implement. As shown in Tab. 7, our the-art detectors with NMS on COCO and CrowdHuman
3DMF module only has a slightly computational overhead datasets. Our method is also demonstrated great potential
against the baseline detector with NMS. in complex and crowded scenes, which may benefit many
4.3. Evaluation on CrowdHuman other instance-level tasks.

We evaluate our model on the CrowdHuman dataset [37], Acknowledgement


which is a large human detection dataset with various kinds
of occlusions. Compared with the COCO dataset, Crowd- This research was supported by National Key R&D Pro-
Human has more complex and crowded scenes, giving se- gram of China (No. 2017YFA0700800), National Nat-
vere challenges to conventional duplicate removal. Our ural Science Foundation of China (No. 61790563 and
end-to-end detector is more robust and flexible in crowded 61751401) and Beijing Academy of Artificial Intelligence
scenes. As shown in Tab. 8 and Fig. 5, our method signif- (BAAI).

8
References [16] Sergey Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network training by reducing internal co-
[1] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and variate shift. arXiv preprint arXiv:1502.03167, 2015. 6
Larry S Davis. Soft-nms–improving object detection with
[17] Kang Kim and Hee Seok Lee. Probabilistic anchor assign-
one line of code. In IEEE International Conference on Com-
ment with iou prediction for object detection. arXiv preprint
puter Vision, 2017. 1
arXiv:2007.08103, 2020. 2, 3
[2] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving
[18] Hei Law and Jia Deng. Cornernet: Detecting objects as
into high quality object detection. In IEEE Conference on
paired keypoints. In European Conference on Computer Vi-
Computer Vision and Pattern Recognition, 2018. 2
sion, 2018. 2, 5
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
[19] Hengduo Li, Zuxuan Wu, Chen Zhu, Caiming Xiong,
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
Richard Socher, and Larry S Davis. Learning from noisy
end object detection with transformers. European Confer-
anchors for one-stage object detection. In IEEE Conference
ence on Computer Vision, 2020. 1, 2, 4, 8, 13
on Computer Vision and Pattern Recognition, 2020. 7
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
[20] Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
Zhang, Xingang Wang, and Jian Sun. Learning dynamic
database. In IEEE Conference on Computer Vision and Pat-
routing for semantic segmentation. In IEEE Conference on
tern Recognition, 2009. 6
Computer Vision and Pattern Recognition, 2020. 2
[5] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing-
ming Huang, and Qi Tian. Centernet: Keypoint triplets for [21] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
object detection. In IEEE International Conference on Com- Bharath Hariharan, and Serge Belongie. Feature pyramid
puter Vision, 2019. 1, 7 networks for object detection. In IEEE Conference on Com-
puter Vision and Pattern Recognition, 2017. 2, 3, 4
[6] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and
Dragomir Anguelov. Scalable object detection using deep [22] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
neural networks. In IEEE Conference on Computer Vision Piotr Dollár. Focal loss for dense object detection. In IEEE
and Pattern Recognition, 2014. 4 International Conference on Computer Vision, 2017. 1, 2, 3,
[7] Zheng Ge, Jianfeng Wang, Xin Huang, Songtao Liu, and Os- 4, 5, 6, 8
amu Yoshie. Lla: Loss-aware label assignment for dense [23] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
pedestrian detection. arXiv preprint arXiv:2101.04307, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
2021. 2, 3 Zitnick. Microsoft coco: Common objects in context. In
[8] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: European Conference on Computer Vision, 2014. 2
Learning scalable feature pyramid architecture for object de- [24] Songtao Liu, Di Huang, and Yunhong Wang. Adaptive nms:
tection. In IEEE Conference on Computer Vision and Pattern Refining pedestrian detection in a crowd. In IEEE Confer-
Recognition, 2019. 2 ence on Computer Vision and Pattern Recognition, 2019. 1
[9] Ross Girshick. Fast r-cnn. In IEEE International Conference [25] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
on Computer Vision, 2015. 1 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
[10] Jun Han and Claudio Moraga. The influence of the sig- Berg. Ssd: Single shot multibox detector. In European Con-
moid function parameters on the speed of backpropagation ference on Computer Vision, 2016. 2
learning. In International Workshop on Artificial Neural Net- [26] Eunbyung Park and Alexander C Berg. Learning to decom-
works, 1995. 4 pose for object detection and instance segmentation. arXiv
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. preprint arXiv:1511.06449, 2015. 1, 2
Deep residual learning for image recognition. In IEEE Con- [27] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
ference on Computer Vision and Pattern Recognition, 2016. James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
2 Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An
[12] Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides, imperative style, high-performance deep learning library. In
and Xiangyu Zhang. Bounding box regression with uncer- Advances in Neural Information Processing Systems, 2019.
tainty for accurate object detection. In IEEE Conference on 5
Computer Vision and Pattern Recognition, 2019. 1 [28] William H Press, Saul A Teukolsky, William T Vetterling,
[13] Jan Hosang, Rodrigo Benenson, and Bernt Schiele. Learning and Brian P Flannery. Numerical recipes 3rd edition: The art
non-maximum suppression. In IEEE Conference on Com- of scientific computing. Cambridge university press, 2007. 5
puter Vision and Pattern Recognition, 2017. 1, 2 [29] Han Qiu, Yuchen Ma, Zeming Li, Songtao Liu, and Jian Sun.
[14] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Borderdet: Border feature for dense object detection. In Eu-
Wei. Relation networks for object detection. In IEEE Con- ropean Conference on Computer Vision, 2020. 2, 3
ference on Computer Vision and Pattern Recognition, 2018. [30] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
2 Farhadi. You only look once: Unified, real-time object de-
[15] Xin Huang, Zheng Ge, Zequn Jie, and Osamu Yoshie. Nms tection. In IEEE Conference on Computer Vision and Pattern
by representative region: Towards crowded pedestrian detec- Recognition, 2016. 1, 3
tion by proposal pairing. In IEEE Conference on Computer [31] Joseph Redmon and Ali Farhadi. Yolov3: An incremental
Vision and Pattern Recognition, 2020. 1 improvement. arXiv preprint arXiv:1804.02767, 2018. 2

9
[32] Mengye Ren and Richard S Zemel. End-to-end instance seg- ternational Conference on Computer Vision, 2019. 1, 2, 3, 4,
mentation with recurrent attention. In IEEE Conference on 5, 6, 7, 8, 13
Computer Vision and Pattern Recognition, 2017. 1, 2 [47] Yuxin Wu and Kaiming He. Group normalization. In Euro-
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. pean Conference on Computer Vision, 2018. 5
Faster r-cnn: Towards real-time object detection with region [48] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
proposal networks. In Advances in Neural Information Pro- Lo, and Ross Girshick. Detectron2. https://github.
cessing Systems, 2015. 2 com/facebookresearch/detectron2, 2019. 6
[34] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir [49] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- Kaiming He. Aggregated residual transformations for deep
tersection over union: A metric and a loss for bounding box neural networks. In IEEE Conference on Computer Vision
regression. In IEEE Conference on Computer Vision and and Pattern Recognition, 2017. 2, 8
Pattern Recognition, 2019. 4 [50] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
[35] Bernardino Romera-Paredes and Philip Hilaire Sean Torr. Stan Z Li. Bridging the gap between anchor-based and
Recurrent instance segmentation. In European Conference anchor-free detection via adaptive training sample selection.
on Computer Vision, 2016. 1, 2 In IEEE Conference on Computer Vision and Pattern Recog-
[36] Amaia Salvador, Miriam Bellver, Victor Campos, Manel nition, 2020. 2, 3, 4, 5, 6, 8, 13
Baradad, Ferran Marques, Jordi Torres, and Xavier Giro-i [51] Shiwei Zhang, Lin Song, Changxin Gao, and Nong Sang.
Nieto. Recurrent neural networks for semantic instance seg- Glnet: Global local network for weakly supervised action
mentation. arXiv preprint arXiv:1712.00617, 2017. 1, 2 localization. IEEE Transactions on Multimedia, 2019. 2
[37] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, [52] Songyang Zhang, Shipeng Yan, and Xuming He. Latentgnn:
Xiangyu Zhang, and Jian Sun. Crowdhuman: A bench- Learning efficient non-local relations for visual recognition.
mark for detecting human in a crowd. arXiv preprint In International Conference on Machine Learning, 2019. 2
arXiv:1805.00123, 2018. 2, 8 [53] Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and
[38] Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Hongbin Qixiang Ye. Freeanchor: Learning to match anchors for vi-
Sun, Jian Sun, and Nanning Zheng. Fine-grained dynamic sual object detection. In Advances in Neural Information
head for object detection. Advances in Neural Information Processing Systems, 2019. 3, 4
Processing Systems, 2020. 2 [54] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
[39] Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Xiangyu Wang, and Jiaya Jia. Pyramid scene parsing network. In
Zhang, Hongbin Sun, Jian Sun, and Nanning Zheng. Re- IEEE Conference on Computer Vision and Pattern Recogni-
thinking learnable tree filter for generic feature transform. tion, 2017. 1, 5
Advances in Neural Information Processing Systems, 2020. [55] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong
2 Wu. Object detection with deep learning: A review. IEEE
[40] Lin Song, Yanwei Li, Zeming Li, Gang Yu, Hongbin Sun, transactions on neural networks and learning systems, 2019.
Jian Sun, and Nanning Zheng. Learnable tree filter for 5
structure-preserving feature transform. In Advances in Neu- [56] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
ral Information Processing Systems, 2019. 2 jects as points. arXiv preprint arXiv:1904.07850, 2019. 5
[41] Lin Song, Shiwei Zhang, Gang Yu, and Hongbin Sun. Tac- [57] Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong,
net: Transition-aware context network for spatio-temporal Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differ-
action detection. In IEEE Conference on Computer Vision entiable label assignment for dense object detection. arXiv
and Pattern Recognition, 2019. 2 preprint arXiv:2007.03496, 2020. 3
[42] Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image pro- [58] Chenchen Zhu, Yihui He, and Marios Savvides. Feature se-
cessing, analysis, and machine vision. Cengage Learning, lective anchor-free module for single-shot object detection.
2014. 5 In IEEE Conference on Computer Vision and Pattern Recog-
[43] Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng. nition, 2019. 1, 3, 4
End-to-end people detection in crowded scenes. In IEEE [59] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
Conference on Computer Vision and Pattern Recognition, formable convnets v2: More deformable, better results. In
2016. 1, 2, 4 IEEE Conference on Computer Vision and Pattern Recogni-
[44] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, tion, 2019. 8, 13
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In IEEE Conference on Computer Vision and
Pattern Recognition, 2015. 5
[45] Christian Szegedy, Scott Reed, Dumitru Erhan, Dragomir
Anguelov, and Sergey Ioffe. Scalable, high-quality object
detection. arXiv preprint arXiv:1412.1441, 2014. 3
[46] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
Fully convolutional one-stage object detection. In IEEE In-

10
(a) Ground-truth (b) FCOS baseline (c) Ours
Figure 6. The prediction visualizations of different detectors on CrowdHuman val set. Our method demonstrates superiority in the crowded
scenes. All the models are based on the ResNet-50 backbone. The threshold of the classification score for visualization is set to 0.3.

11
(a) Ground-truth (b) FCOS baseline (c) Ours
Figure 7. The prediction visualizations of different detectors on COCO val set. Compared with the FCOS framework, our end-to-end
detector obtains much fewer duplicate predictions, which is crucial for downstream instance-aware tasks. All the models are based on the
ResNet-50 backbone. The threshold of the classification score for visualization is set to 0.3.

12
A. Auxiliary Loss Table 11. The comparison on CrowdHuman val set.
Method Queries Epochs AP50 mMR Recall
In this section, we evaluate different one-to-many label
assignment rules for the auxiliary loss. The detailed imple- DETR [3] 100 300 72.8 80.1 82.7
DETR 200 300 78.8 66.3 90.2
mentations are elaborated as follows:
DETR 300 300 70.6 79.1 89.7
FCOS. We adopt the assignment rule in FCOS [46].
Ours - 32 89.1 48.9 96.5
ATSS. We adopt the assignment rule in ATSS [50].
Quality-ATSS. The rule is elaborated in Sec. 3.2.3.
Quality-FCOS. Similar to FCOS, each ground-truth in-
variants for better performance than the DETR, e.g., de-
stance is assigned to the pixels in the pre-defined central
formable convolutions [59] in Tab. 10. Moreover, as shown
area of a specific FPN stage. But the specific FPN stage
in Tab. 11, our framework has great advantages over the
is selected according to the proposed quality instead of the
DETR [3] in convergence speed and crowded scenes.
size of instances.
Quality-Top-k. Each ground-truth instance is assigned to
pixels with top-k highest qualities over all the FPN stages.
We set k = 9 to align with other rules.
As shown in Tab. 9, the results demonstrate the superior-
ity of our proposed prediction-aware quality function over
the hand-designed matching metrics. Compared with the
standard ATSS framework, the quality based rule can ob-
tain 1.3% mAP absolute gains.

Table 9. The results of different one-to-many label assignment


rules for the auxiliary loss on COCO val set. All the models are
based on the ResNet-50 backbone. ‘/’ is used to distinguish be-
tween results without and with NMS.
Method mAP AP50 AP75
None 39.8 / 40.0 57.4 / 59.1 43.6 / 43.1
Hand-designed
FCOS [46] 39.4 / 39.8 57.0 / 59.1 43.4 / 43.0
ATSS [50] 39.8 / 40.1 57.5 / 59.5 44.1 / 43.4
Prediction-aware
Quality-FCOS 39.7 / 40.0 57.7 / 59.6 43.6 / 43.0
Quality-ATSS 41.1 / 41.2 59.0 / 60.7 45.4 / 44.8
Quality-Top-k 40.7 / 41.0 58.7 / 60.4 44.9 / 44.3

B. Comparison to DETR
As shown in Tab. 10 and Tab. 11, we give the comparison
of different methods based on ResNet-50 backbone, where
the NMS is not utilized except for FCOS.

Table 10. The comparison on COCO val set.


Method Epochs mAP APs APm APl #Param
DETR [3] 500 42.0 20.5 45.8 61.1 41.5 M
FCOS [46] 36 41.1 25.9 44.8 52.3 36.4 M
Ours 36 41.5 26.4 44.7 52.8 37.0 M
Ours* 36 43.5 26.3 46.6 55.4 40.3 M
*
adopts two extra deformable convolutions in the head.

Compared with transformers, convolutions have been


extensively tested in vision applications and have many

13

You might also like