End-to-End Object Detection With Fully Convolutional Network
End-to-End Object Detection With Fully Convolutional Network
Jianfeng Wang1 * Lin Song2 *† Zeming Li1 Hongbin Sun2 Jian Sun1 Nanning Zheng2
1
Megvii Technology 2 Xi’an Jiaotong University
[email protected] [email protected]
{lizeming,sunjian}@megvii.com {hsun,nnzheng}@mail.xjtu.edu.cn
arXiv:2012.03544v3 [cs.CV] 26 Mar 2021
Abstract
1
tors [22, 46, 50, 21] adopt a one-to-many label assignment or background samples with bounding box offsets. Due
rule, i.e., assigning many predictions as foreground samples to the hand-designed and data-independent anchor boxes,
for one ground-truth instance. This rule provides adequate the training targets of anchor-based detectors are typically
foreground samples to obtain a strong and robust feature sub-optimal and require careful tuning of hyper-parameters.
representation. Nevertheless, the massive foreground sam- Recently, FCOS [46] and CornerNet [18] give a different
ples lead to duplicate predicted boxes for a single instance, perspective for fully convolutional detectors by introducing
which prevents end-to-end detection. To demonstrate it, an anchor-free framework. Nevertheless, these frameworks
we first give an empirical comparison of different existing still need a hand-designed post-processing step for dupli-
hand-designed label assignments. We find that the one-to- cate removal, i.e., non-maximum suppression (NMS). Since
one label assignment plays a crucial role in eliminating the NMS is a heuristic approach and adopts a constant thresh-
post-processing of duplicate removal. However, there is still old for all the instances, it needs carefully tuning and might
a drawback in the hand-designed one-to-one assignment. not be robust, especially in crowded scenes. In contrast,
The fixed assignment could cause ambiguity issues and re- based on the anchor-free framework, this paper proposes a
duce the discriminability of features, since the predefined prediction-aware one-to-one assignment rule for classifica-
regions of an instance may not be the best choice [17] for tion to discard the non-trainable NMS.
training. To solve this issue, we propose a prediction-aware
one-to-one (POTO) label assignment, which dynamically 2.2. End-to-End Object Detection
assigns the foreground samples according to the quality of To achieve end-to-end detection, many approaches are
classification and regression simultaneously. explored in the previous literature. Concretely, in the earlier
Furthermore, for the modern FPN based detector [46], researches, numerous detection frameworks based on recur-
the extensive experiment demonstrates that the duplicate rent neural networks [43, 35, 26, 32, 36] attempt to produce
bounding boxes majorly come from the nearby regions of a set of bounding boxes directly. Albeit they allow end-to-
the most confident prediction across adjacent scales. There- end learning in principle, they are only demonstrated effec-
fore, we design a 3D Max Filtering (3DMF), which can be tiveness on some small datasets and not against the modern
embedded into the FPN head as a differentiable module. baselines [46, 8]. Meanwhile, Learnable NMS [13] is pro-
This module could improve the discriminability of convo- posed to learn duplicate removal by using a very deep and
lution in the local regions by using a simple 3D max fil- complex network, which achieves comparable performance
tering operator across adjacent scales. Besides, to provide against NMS. But it is constructed by discrete components
adequate supervision for feature representation learning, we and does not give an effective solution to realize end-to-end
modify a one-to-many assignment as an auxiliary loss. training. Recently, the relation network [14] and DETR [3]
With the proposed techniques, our end-to-end detec- apply the attention mechanism to object detection, which
tion framework achieves competitive performance against models pairwise relations between different predictions. By
many state-of-the-art detectors. On COCO [23] dataset, our using one-to-one assignment rules and direct set losses, they
end-to-end detector based on FCOS framework [46] and do not need any additional post-processing steps. Neverthe-
ResNeXt-101 [49] backbone remarkably outperforms the less, when performing massive predictions, these methods
baseline with NMS by 1.1% mAP. Furthermore, our end-to- require highly expensive cost, making them not appropri-
end detector is more robust and flexible for crowded detec- ate for the dense prediction frameworks. Due to the lack of
tion. To demonstrate the superiority in the crowded scenes, image prior and multi-scale fusion mechanism, DETR also
we construct more experiments on CrowdHuman [37] suffers from much longer training duration than mainstream
dataset. Under the ResNet-50 backbone, our end-to-end de- detectors and lower performance on the small objects. Dif-
tector achieves 3.0% AP50 and 6.0% mMR absolute gains ferent from the approaches mentioned above, our method
over FCOS baseline with NMS. is the first to enable end-to-end object detection based on a
fully convolutional network.
2. Related Work
3. Methodology
2.1. Fully Convolutional Object Detector
3.1. Analysis on Label Assignment
Owing to the success of convolution networks [11,
40, 41, 39, 20, 51, 52], object detection has achieved To reveal the effect of label assignment on end-to-end
tremendous progress during the last decade. Modern one- object detection, we construct several ablation studies of
stage [22, 25, 31, 38, 29, 7] or two-stage detectors [33, 21, conventional label assignments on COCO [23] dataset. As
2] heavily rely on the anchors or anchor-based proposals. shown in Tab. 1, all the experiments are based on FCOS [46]
In these detectors, the anchor boxes are made up of pre- framework, whose centerness branch is removed to achieve
defined sliding windows, which are assigned as foreground a head-to-head comparison. The results demonstrate the su-
2
Table 1. The comparison of different label assignment rules for end-to-end object detection on COCO val set. ∆ indicates the gap between
with and without NMS. ‘Aux’ is the proposed auxiliary loss. All models are based on ResNet-50 backbone with 180k training iterations.
mAP mAR
Assignment Rule Method
w/ NMS w/o NMS ∆ w/ NMS w/o NMS ∆
One-to-many Hand-designed FCOS [46] baseline * 40.5 12.1 -28.4 58.3 52.8 -5.5
Anchor 37.2 35.8 -1.4 57.0 59.2 +2.2
Hand-designed
Center 37.2 33.6 -3.6 57.8 59.7 +1.9
One-to-one
Foreground loss 38.3 37.1 -1.2 58.6 61.4 +2.8
Prediction-aware POTO 38.6 38.0 -0.6 57.9 60.5 +2.6
POTO+3DMF 40.0 39.8 -0.2 58.8 60.9 +2.1
Mixture ** Prediction-aware POTO+3DMF+Aux 41.2 41.1 -0.1 58.9 61.2 +2.3
*
We remove its centerness branch to achieve a head-to-head comparison.
**
We adopt a one-to-one assignment in POTO and a one-to-many assignment in the auxiliary loss, respectively.
periority of one-to-many assignment on feature represen- ter of the instance in the pre-defined feature layer. Besides,
tation and the potential of one-to-one assignment on dis- other anchors or pixels are set as background samples.
carding the NMS. The detailed analysis is elaborated in the As shown in Tab. 1, compared with the one-to-many la-
following sections. bel assignment, the one-to-one label assignment allows the
fully convolutional detectors without NMS to greatly reduce
the gap between with and without NMS and achieve rea-
3.1.1 One-to-many Label Assignment
sonable performance. For instance, the detector based on
Since the NMS post-processing is widely adopted in dense Center rule achieves 21.5% mAP absolute gains over the
prediction frameworks [21, 22, 58, 53, 46, 50, 57, 29, 7], FCOS baseline. Besides, as it avoids the error suppression
one-to-many label assignment becomes a conventional way of the NMS in complex scenes, the recall rate is further in-
to assign training targets. The adequate foreground samples creased. Nevertheless, there still exist two unresolved is-
lead to a strong and robust feature representation. However, sues. First, when one-to-one label assignment is applied,
when discarding the NMS, due to the redundant foreground the performance gap between detectors with and without
samples of one-to-many label assignment, the duplicate NMS remains non-negligible. Second, due to the less super-
false-positive predictions could cause a dramatic drop in vision for each instance, the performance of the one-to-one
performance, e.g., 28.4% mAP absolute drop on FCOS [22] label assignment is still inferior to the FCOS baseline.
baseline. In addition, the reported mAR in Tab. 1 indicates
the recall rates for the predictions of the top 100 scores. 3.2. Our Methods
Without NMS, the one-to-many assignment rule leads to In this paper, to enable competitive end-to-end object de-
numerous duplicate predictions with high scores, thus re- tection, we propose a mixture label assignment and a new
ducing the recall rate. Therefore, the detector is hard to 3D Max Filtering (3DMF). The mixture label assignment
achieve competitive end-to-end detection by relying only on is made up of the proposed prediction-aware one-to-one
the one-to-many assignment. (POTO) label assignment and a modified one-to-many label
assignment (auxiliary loss). With these techniques, our end-
3.1.2 Hand-designed One-to-one Label Assignment to-end framework can discard the NMS post-processing and
keep the strong feature representation.
MultiBox [45] and YOLO [30] demonstrate the potential
in applying the one-to-one label assignment to a dense pre- 3.2.1 Prediction-aware One-to-one Label Assignment
diction framework. In this paper, we evaluate two one-to-
one label assignment rules to reveal the undergoing con- The hand-designed one-to-one label assignment follows a
nection with discarding NMS. These rules are modified by fixed rule. However, this rule may be sub-optimal for
two widely-used one-to-many label assignments: Anchor various instances in complex scenes, e.g., Center rule for
rule and Center rule. Concretely, Anchor rule is based on an eccentric object [17]. Thus if the assignment proce-
RetinaNet [22], each ground-truth instance is only assigned dure is forced to assign the sub-optimal prediction as the
to the anchor with the maximum Intersection-over-Union unique foreground sample, the difficulty for the network to
(IoU). Center rule is based on FCOS [46], each ground- converge could be dramatically increased, leading to more
truth instance is only assigned to the pixel closest to the cen- false-positive predictions. To this end, we propose a new
3
Figure 2. The diagram of the head with 3D Max Filtering (3DMF) in a FPN stage. ‘POTO’ indicates the proposed Prediction-aware One-
to-one Label Assignment rule to achieve end-to-end detection. ‘Conv + σ’ denotes a convolution layer followed by a sigmoid function [10],
which outputs coarsely classification scores. ‘Aux Loss’ is the proposed auxiliary loss to improve feature representation. The dotted lines
are used to highlight the additional components in the training phase, which are abandoned in the inference phase.
rule named Prediction-aware One-To-One (POTO) label as- shown in Tab. 1, this property makes the training loss not
signment by dynamically assigning samples according to the optimal choice for the matching cost. Therefore, as pre-
the quality of predictions. sented in Eq. 3 and Eq. 4, we propose a more clean and
Let Ψ denotes the index set of all the predictions. G and effective formulation (POTO) to find a better assignment.
N correspond to the number of ground-truth instances and G
predictions, respectively, where typically G N in dense X
π̂ = arg max Qi,π(i) , (3)
prediction detectors. π̂ ∈ ΠN
G indicates a G-permutation of π∈ΠN
G i
N predictions. Our POTO aims to generate a suitable per- 1−α
mutation π̂ of predictions as the foreground samples. The where Qi,π(i) = 1 [π(i) ∈ Ωi ] · p̂π(i) (ci ) ·
training loss is formulated as Eq. 1, which consists of the
| {z } | {z }
spatial prior
classification
foreground loss Lfg and the background loss Lbg . α (4)
IoU bi , b̂π(i) .
G
X X | {z }
L= Lfg p̂π̂(i) , b̂π̂(i) | ci , bi + Lbg p̂j , regression
i j∈Ψ\R(π̂)
(1) Here Qi,π(i) ∈ [0, 1] represents the proposed matching
quality of the i-th ground-truth with the π(i)-th prediction.
where R(π̂) denotes the corresponding index set of the as- It considers the spatial prior, the confidence of classifica-
signed foreground samples. For the i-th ground-truth, ci tion, and the quality of regression simultaneously. Ωi indi-
and bi are its category label and bounding box coordinates, cates the set of candidate predictions for i-th ground-truth,
respectively. While for the π̂(i)-th prediction, p̂π̂(i) and i.e., spatial prior. The spatial prior is widely used in the
b̂π̂(i) correspond to its predicted classification scores and training phase [21, 22, 58, 53, 46, 50]. For instance, the
predicted box coordinates, respectively. center sampling strategy is adopted in FCOS [46], which
To achieve competitive end-to-end detection, we need to only considers the predictions in the central portion of the
find a suitable label assignment π̂. As shown in Eq. 2, previ- ground-truth instance as foreground samples. We also apply
ous works [6, 3] treat it as a bipartite matching problem by it in POTO to achieve higher performance, but it is not nec-
using foreground loss [22, 34] as the matching cost, which essary for discarding NMS (more details refer to Sec. 4.2.2).
can be rapidly solved by the Hungarian algorithm [43]. To achieve balance, we define the quality by the weighted
geometric mean of classification score p̂π(i) (ci ) and regres-
G
X sion quality IoU bi , b̂π(i) in Eq. 4. The hyper-parameter
π̂ = arg min Lfg p̂π̂(i) , b̂π̂(i) | ci , bi . (2)
π∈ΠN
α ∈ [0, 1] adjusts the ratio between classification and re-
G i
gression, where α = 0.8 is adopted by default and more
However, foreground loss typically needs additional ablation studies are elaborated in Sec. 4.2.2. As shown in
weights to alleviate optimization issues, e.g., unbalanced Tab. 1, POTO not only narrows the gap with NMS but also
training samples and joint training of multiple tasks. As improves the performance.
4
Table 2. Comparison of different configurations for NMS post-
processing on COCO val set. ‘Across scales’ indicates applying
NMS to the multiple adjacent stages of the feature pyramid net-
work. ‘Spatial range’ denotes the spatial range for duplicate re-
moval in each scale.
Model Across scales Spatial range mAP
1×1 19.0
7 3×3 37.4
5×5 39.2
FCOS [46]
7 39.2
∞×∞
3 40.9
5
(a) FCOS baseline (b) POTO
Image
one-to-many label assignment to provide adequate supervi- 4.2. Ablation Studies on COCO
sion, which is illustrated in Fig. 2.
4.2.1 Visualization
Similar to ATSS [50], our auxiliary loss adopts the fo-
cal loss [22] with a modified one-to-many label assignment. As shown in Fig. 4, we present the visualization of the clas-
Specifically, the one-to-many label assignment first takes sification scores from the FCOS baseline and our proposed
the top-9 predictions as candidates in each FPN stage, ac- framework. For a single instance, the FCOS baseline with
cording to the proposed matching quality in Eq. 4. It then one-to-many assignment rule outputs massive duplicate pre-
assigns the candidates as foreground samples whose match- dictions, which are highly activated and have comparable
ing qualities beyond a statistical threshold. The statistical activating scores with the most confident one. These dupli-
threshold is calculated by the summation of the mean and cate predictions are evaluated as false-positive samples and
the standard deviation of all the candidate matching quali- greatly affect performance. In contrast, by using the pro-
ties. In addition, different forms of one-to-many label as- posed POTO rule, the scores of duplicate samples are sig-
signment for the auxiliary loss are elaborately reported in nificantly suppressed. This property is crucial for the detec-
the supplementary material. tor to achieve direct bounding box prediction without NMS.
Moreover, with the proposed 3DMF module, this property
is further enhanced, especially in the nearby regions of the
4. Experiments most confident prediction. Besides, since the 3DMF mod-
ule introduces the multi-scale competitive mechanism, the
4.1. Implement Detail detector can well perform unique predictions across differ-
ent FPN stages, e.g., an instance in the Fig. 4 has single
As same as FCOS [46], our detector adopts a pair of 4- highly activated scores in various stages.
convolution heads for classification and regression, respec-
tively. The output channel numbers of the first convolu- 4.2.2 Prediction-Aware One-to-One Label Assignment
tion and the second convolution in 3DMF are 256 and 1,
respectively. All the backbones are pre-trained on the Ima- Spatial prior. As shown in Tab. 3, for the spatial range of
geNet dataset [4] with frozen batch normalizations [16]. In assignment, the center sampling strategy is relatively supe-
the training phase, input images are reshaped so that their rior to the inside box and global strategies on the COCO
shorter side is 800 pixels. All the training hyper-parameters dataset. It reflects that the prior knowledge of images is
are identical to the 2x schedule (180k iterations) in the De- essential in the real world scenario.
tectron2 [48] if not specifically mentioned. Classification vs. regression. The hyper-parameter α, as
6
Table 3. Results of POTO with different configurations of α and Table 5. The effect of sub-modules in the proposed 3DMF module
spatial prior on COCO val set. α = 0 is equivalent to considering on COCO val set. ‘3DMF’ and ‘Aux Loss’ indicate using the 3D
classification alone, α = 1 is equivalent to considering regression Max Filtering and the auxiliary loss, respectively. ‘/’ is used to
alone. ‘center sampling’ and ‘inside box’ both follow FCOS [46]. distinguish between results without and with NMS.
‘/’ is used to distinguish between results without and with NMS. Model 3DMF Aux Loss mAP
α center sampling inside box global
7 7 19.0 / 40.9
0.0 33.5 / 33.6 24.1 / 24.2 1.9 / 2.1 FCOS [46] 7 3 18.9 / 41.3
0.2 33.7 / 33.9 28.8 / 28.8 19.4 / 19.5 3* 7 38.7 / 40.0
0.4 35.0 / 35.2 32.7 / 32.8 28.3 / 28.4
7 7 38.0 / 38.6
0.6 36.6 / 36.9 35.3 / 35.5 34.7 / 34.9
Ours 3 7 39.8 / 40.0
0.8 38.0 / 38.6 37.4 / 37.9 37.3 / 37.9
3 3 41.1 / 41.2
1.0 11.8 / 29.7 4.5 / 13.0 non-convergence
*
We modify 3D Max Filtering as a post-processing.
Table 4. The effect of various quality functions on COCO val set.
‘/’ is used to distinguish between results without and with NMS. Table 6. The effect of hyper-parameters in the proposed 3DMF
‘Add’ and ‘Mul’ indicate two fusion functions. module on COCO val set. τ = 0 is equivalent to applying 2D
Method α mAP AP50 AP75 Max Filtering to transform features on a single scale. ‘/’ is used to
distinguish between results without and with NMS.
0.2 36.0 / 36.2 55.7 / 57.0 38.7 / 38.3
Add 0.5 37.3 / 37.8 54.9 / 57.4 40.5 / 40.4 φ=1 φ=3 φ=5
0.8 29.3 / 35.6 40.3 / 53.4 32.8 / 38.4 τ =0 39.2 / 39.5 39.1 / 39.5 39.0 / 39.4
Mul 0.8 38.0 / 38.6 55.2 / 57.6 41.4 / 41.3 τ =2 39.0 / 39.3 39.8 / 40.0 39.3 / 39.5
τ =4 39.1 / 39.3 39.3 / 39.4 39.4 / 39.6
7
RetinaNet w/ NMS RetinaNet w/o NMS FCOS w/ NMS FCOS w/o NMS Ours w/ NMS Ours w/o NMS
63.0
42.0 63.0 95.0
14.0 31.0
31.0 35.0
7.0 23.0
23.0
0.0 15.0 15.0
5000 40000 75000 110000 145000 180000 5000 40000 75000 110000 145000 180000 1875 7500 13125 18750 24375
15.0
(a)
5000 mAP on COCO val 40000
set (b)
75000 mAR on COCO val110000
set (c) AP50 on CrowdHuman180000
145000 val set
Figure 5. The comparison graphs of performance w.r.t. training duration. The value of the horizontal axis corresponds to the training
iterations. All the models are based on the ResNet-50 backbone. The threshold of NMS is set to 0.6.
Table 7. The experiments of the proposed framework with larger Table 8. The comparison of fully convolutional detectors on
backbone on COCO2017 test-dev set. The hyper-parameters of all CrowdHuman val set. All models are based on the ResNet-50
the models follow the official settings. backbone. ‘Aux’ indicates the auxiliary loss.
Backbone Model Epochs mAP Method Epochs AP50 mMR Recall
RetinaNet [22] 36 41.0 RetinaNet [22] 32 81.7 57.6 88.6
FCOS [46] 36 43.1 FCOS [46] 32 86.1 54.9 94.2
ResNet-101 DETR [3] 500 43.5 ATSS [50] 32 87.2 49.7 94.0
DETR [3] 300 72.8 80.1 82.7
Ours (w/o NMS) 36 43.6 Ground-truth (w/ NMS) - - - 95.1
RetinaNet [22] 24 44.5 POTO 32 88.5 52.2 96.3
ResNeXt-101+DCN FCOS [46] 24 46.5 POTO+3DMF 32 88.8 51.0 96.6
Ours (w/o NMS) 24 47.6 POTO+3DMF+Aux 32 89.1 48.9 96.5
8
References [16] Sergey Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network training by reducing internal co-
[1] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and variate shift. arXiv preprint arXiv:1502.03167, 2015. 6
Larry S Davis. Soft-nms–improving object detection with
[17] Kang Kim and Hee Seok Lee. Probabilistic anchor assign-
one line of code. In IEEE International Conference on Com-
ment with iou prediction for object detection. arXiv preprint
puter Vision, 2017. 1
arXiv:2007.08103, 2020. 2, 3
[2] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving
[18] Hei Law and Jia Deng. Cornernet: Detecting objects as
into high quality object detection. In IEEE Conference on
paired keypoints. In European Conference on Computer Vi-
Computer Vision and Pattern Recognition, 2018. 2
sion, 2018. 2, 5
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
[19] Hengduo Li, Zuxuan Wu, Chen Zhu, Caiming Xiong,
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
Richard Socher, and Larry S Davis. Learning from noisy
end object detection with transformers. European Confer-
anchors for one-stage object detection. In IEEE Conference
ence on Computer Vision, 2020. 1, 2, 4, 8, 13
on Computer Vision and Pattern Recognition, 2020. 7
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
[20] Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
Zhang, Xingang Wang, and Jian Sun. Learning dynamic
database. In IEEE Conference on Computer Vision and Pat-
routing for semantic segmentation. In IEEE Conference on
tern Recognition, 2009. 6
Computer Vision and Pattern Recognition, 2020. 2
[5] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing-
ming Huang, and Qi Tian. Centernet: Keypoint triplets for [21] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
object detection. In IEEE International Conference on Com- Bharath Hariharan, and Serge Belongie. Feature pyramid
puter Vision, 2019. 1, 7 networks for object detection. In IEEE Conference on Com-
puter Vision and Pattern Recognition, 2017. 2, 3, 4
[6] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and
Dragomir Anguelov. Scalable object detection using deep [22] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
neural networks. In IEEE Conference on Computer Vision Piotr Dollár. Focal loss for dense object detection. In IEEE
and Pattern Recognition, 2014. 4 International Conference on Computer Vision, 2017. 1, 2, 3,
[7] Zheng Ge, Jianfeng Wang, Xin Huang, Songtao Liu, and Os- 4, 5, 6, 8
amu Yoshie. Lla: Loss-aware label assignment for dense [23] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
pedestrian detection. arXiv preprint arXiv:2101.04307, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
2021. 2, 3 Zitnick. Microsoft coco: Common objects in context. In
[8] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: European Conference on Computer Vision, 2014. 2
Learning scalable feature pyramid architecture for object de- [24] Songtao Liu, Di Huang, and Yunhong Wang. Adaptive nms:
tection. In IEEE Conference on Computer Vision and Pattern Refining pedestrian detection in a crowd. In IEEE Confer-
Recognition, 2019. 2 ence on Computer Vision and Pattern Recognition, 2019. 1
[9] Ross Girshick. Fast r-cnn. In IEEE International Conference [25] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
on Computer Vision, 2015. 1 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
[10] Jun Han and Claudio Moraga. The influence of the sig- Berg. Ssd: Single shot multibox detector. In European Con-
moid function parameters on the speed of backpropagation ference on Computer Vision, 2016. 2
learning. In International Workshop on Artificial Neural Net- [26] Eunbyung Park and Alexander C Berg. Learning to decom-
works, 1995. 4 pose for object detection and instance segmentation. arXiv
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. preprint arXiv:1511.06449, 2015. 1, 2
Deep residual learning for image recognition. In IEEE Con- [27] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
ference on Computer Vision and Pattern Recognition, 2016. James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
2 Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An
[12] Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides, imperative style, high-performance deep learning library. In
and Xiangyu Zhang. Bounding box regression with uncer- Advances in Neural Information Processing Systems, 2019.
tainty for accurate object detection. In IEEE Conference on 5
Computer Vision and Pattern Recognition, 2019. 1 [28] William H Press, Saul A Teukolsky, William T Vetterling,
[13] Jan Hosang, Rodrigo Benenson, and Bernt Schiele. Learning and Brian P Flannery. Numerical recipes 3rd edition: The art
non-maximum suppression. In IEEE Conference on Com- of scientific computing. Cambridge university press, 2007. 5
puter Vision and Pattern Recognition, 2017. 1, 2 [29] Han Qiu, Yuchen Ma, Zeming Li, Songtao Liu, and Jian Sun.
[14] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Borderdet: Border feature for dense object detection. In Eu-
Wei. Relation networks for object detection. In IEEE Con- ropean Conference on Computer Vision, 2020. 2, 3
ference on Computer Vision and Pattern Recognition, 2018. [30] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
2 Farhadi. You only look once: Unified, real-time object de-
[15] Xin Huang, Zheng Ge, Zequn Jie, and Osamu Yoshie. Nms tection. In IEEE Conference on Computer Vision and Pattern
by representative region: Towards crowded pedestrian detec- Recognition, 2016. 1, 3
tion by proposal pairing. In IEEE Conference on Computer [31] Joseph Redmon and Ali Farhadi. Yolov3: An incremental
Vision and Pattern Recognition, 2020. 1 improvement. arXiv preprint arXiv:1804.02767, 2018. 2
9
[32] Mengye Ren and Richard S Zemel. End-to-end instance seg- ternational Conference on Computer Vision, 2019. 1, 2, 3, 4,
mentation with recurrent attention. In IEEE Conference on 5, 6, 7, 8, 13
Computer Vision and Pattern Recognition, 2017. 1, 2 [47] Yuxin Wu and Kaiming He. Group normalization. In Euro-
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. pean Conference on Computer Vision, 2018. 5
Faster r-cnn: Towards real-time object detection with region [48] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
proposal networks. In Advances in Neural Information Pro- Lo, and Ross Girshick. Detectron2. https://github.
cessing Systems, 2015. 2 com/facebookresearch/detectron2, 2019. 6
[34] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir [49] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- Kaiming He. Aggregated residual transformations for deep
tersection over union: A metric and a loss for bounding box neural networks. In IEEE Conference on Computer Vision
regression. In IEEE Conference on Computer Vision and and Pattern Recognition, 2017. 2, 8
Pattern Recognition, 2019. 4 [50] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
[35] Bernardino Romera-Paredes and Philip Hilaire Sean Torr. Stan Z Li. Bridging the gap between anchor-based and
Recurrent instance segmentation. In European Conference anchor-free detection via adaptive training sample selection.
on Computer Vision, 2016. 1, 2 In IEEE Conference on Computer Vision and Pattern Recog-
[36] Amaia Salvador, Miriam Bellver, Victor Campos, Manel nition, 2020. 2, 3, 4, 5, 6, 8, 13
Baradad, Ferran Marques, Jordi Torres, and Xavier Giro-i [51] Shiwei Zhang, Lin Song, Changxin Gao, and Nong Sang.
Nieto. Recurrent neural networks for semantic instance seg- Glnet: Global local network for weakly supervised action
mentation. arXiv preprint arXiv:1712.00617, 2017. 1, 2 localization. IEEE Transactions on Multimedia, 2019. 2
[37] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, [52] Songyang Zhang, Shipeng Yan, and Xuming He. Latentgnn:
Xiangyu Zhang, and Jian Sun. Crowdhuman: A bench- Learning efficient non-local relations for visual recognition.
mark for detecting human in a crowd. arXiv preprint In International Conference on Machine Learning, 2019. 2
arXiv:1805.00123, 2018. 2, 8 [53] Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and
[38] Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Hongbin Qixiang Ye. Freeanchor: Learning to match anchors for vi-
Sun, Jian Sun, and Nanning Zheng. Fine-grained dynamic sual object detection. In Advances in Neural Information
head for object detection. Advances in Neural Information Processing Systems, 2019. 3, 4
Processing Systems, 2020. 2 [54] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
[39] Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Xiangyu Wang, and Jiaya Jia. Pyramid scene parsing network. In
Zhang, Hongbin Sun, Jian Sun, and Nanning Zheng. Re- IEEE Conference on Computer Vision and Pattern Recogni-
thinking learnable tree filter for generic feature transform. tion, 2017. 1, 5
Advances in Neural Information Processing Systems, 2020. [55] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong
2 Wu. Object detection with deep learning: A review. IEEE
[40] Lin Song, Yanwei Li, Zeming Li, Gang Yu, Hongbin Sun, transactions on neural networks and learning systems, 2019.
Jian Sun, and Nanning Zheng. Learnable tree filter for 5
structure-preserving feature transform. In Advances in Neu- [56] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
ral Information Processing Systems, 2019. 2 jects as points. arXiv preprint arXiv:1904.07850, 2019. 5
[41] Lin Song, Shiwei Zhang, Gang Yu, and Hongbin Sun. Tac- [57] Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong,
net: Transition-aware context network for spatio-temporal Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differ-
action detection. In IEEE Conference on Computer Vision entiable label assignment for dense object detection. arXiv
and Pattern Recognition, 2019. 2 preprint arXiv:2007.03496, 2020. 3
[42] Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image pro- [58] Chenchen Zhu, Yihui He, and Marios Savvides. Feature se-
cessing, analysis, and machine vision. Cengage Learning, lective anchor-free module for single-shot object detection.
2014. 5 In IEEE Conference on Computer Vision and Pattern Recog-
[43] Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng. nition, 2019. 1, 3, 4
End-to-end people detection in crowded scenes. In IEEE [59] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
Conference on Computer Vision and Pattern Recognition, formable convnets v2: More deformable, better results. In
2016. 1, 2, 4 IEEE Conference on Computer Vision and Pattern Recogni-
[44] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, tion, 2019. 8, 13
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In IEEE Conference on Computer Vision and
Pattern Recognition, 2015. 5
[45] Christian Szegedy, Scott Reed, Dumitru Erhan, Dragomir
Anguelov, and Sergey Ioffe. Scalable, high-quality object
detection. arXiv preprint arXiv:1412.1441, 2014. 3
[46] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
Fully convolutional one-stage object detection. In IEEE In-
10
(a) Ground-truth (b) FCOS baseline (c) Ours
Figure 6. The prediction visualizations of different detectors on CrowdHuman val set. Our method demonstrates superiority in the crowded
scenes. All the models are based on the ResNet-50 backbone. The threshold of the classification score for visualization is set to 0.3.
11
(a) Ground-truth (b) FCOS baseline (c) Ours
Figure 7. The prediction visualizations of different detectors on COCO val set. Compared with the FCOS framework, our end-to-end
detector obtains much fewer duplicate predictions, which is crucial for downstream instance-aware tasks. All the models are based on the
ResNet-50 backbone. The threshold of the classification score for visualization is set to 0.3.
12
A. Auxiliary Loss Table 11. The comparison on CrowdHuman val set.
Method Queries Epochs AP50 mMR Recall
In this section, we evaluate different one-to-many label
assignment rules for the auxiliary loss. The detailed imple- DETR [3] 100 300 72.8 80.1 82.7
DETR 200 300 78.8 66.3 90.2
mentations are elaborated as follows:
DETR 300 300 70.6 79.1 89.7
FCOS. We adopt the assignment rule in FCOS [46].
Ours - 32 89.1 48.9 96.5
ATSS. We adopt the assignment rule in ATSS [50].
Quality-ATSS. The rule is elaborated in Sec. 3.2.3.
Quality-FCOS. Similar to FCOS, each ground-truth in-
variants for better performance than the DETR, e.g., de-
stance is assigned to the pixels in the pre-defined central
formable convolutions [59] in Tab. 10. Moreover, as shown
area of a specific FPN stage. But the specific FPN stage
in Tab. 11, our framework has great advantages over the
is selected according to the proposed quality instead of the
DETR [3] in convergence speed and crowded scenes.
size of instances.
Quality-Top-k. Each ground-truth instance is assigned to
pixels with top-k highest qualities over all the FPN stages.
We set k = 9 to align with other rules.
As shown in Tab. 9, the results demonstrate the superior-
ity of our proposed prediction-aware quality function over
the hand-designed matching metrics. Compared with the
standard ATSS framework, the quality based rule can ob-
tain 1.3% mAP absolute gains.
B. Comparison to DETR
As shown in Tab. 10 and Tab. 11, we give the comparison
of different methods based on ResNet-50 backbone, where
the NMS is not utilized except for FCOS.
13