MMDetection Open MMLab Detection Toolbox and Benchmark
MMDetection Open MMLab Detection Toolbox and Benchmark
Kai Chen1 Jiaqi Wang1∗ Jiangmiao Pang2∗ Yuhang Cao1 Yu Xiong1 Xiaoxiao Li1
Shuyang Sun3 Wansen Feng4 Ziwei Liu1 Jiarui Xu5 Zheng Zhang6 Dazhi Cheng7
Chenchen Zhu8 Tianheng Cheng9 Qijie Zhao10 Buyu Li1 Xin Lu4 Rui Zhu11 Yue Wu12
Jifeng Dai6 Jingdong Wang6 Jianping Shi4 Wanli Ouyang3 Chen Change Loy13 Dahua Lin1
arXiv:1906.07155v1 [cs.CV] 17 Jun 2019
1 2 3 4
The Chinese University of Hong Kong Zhejiang University The University of Sydney SenseTime Research
5 6 7
Hong Kong University of Science and Technology Microsoft Research Asia Beijing Institute of Technology
8 9 10
Nanjing University Huazhong University of Science and Technology Peking University
11 12 13
Sun Yat-sen University Northeastern University Nanyang Technological University
1
other codebases, especially for recent ones. A list is given 2.4. General Modules and Methods
as follows.
• Mixed Precision Training [22]: train deep neural net-
2.1. Single-stage Methods works using half precision floating point (FP16) num-
bers, proposed in 2018.
• SSD [19]: a classic and widely used single-stage de-
tector with simple model architecture, proposed in • Soft NMS [1]: an alternative to NMS, proposed in
2015. 2017.
• RetinaNet [18]: a high-performance single-stage de- • OHEM [29]: an online sampling method that mines
tector with Focal Loss, proposed in 2017. hard samples for training, proposed in 2016.
• GHM [16]: a gradient harmonizing mechanism to im- • DCN [8]: deformable convolution and deformable RoI
prove single-stage detectors, proposed in 2019. pooling, proposed in 2017.
• FCOS [32]: a fully convolutional anchor-free single- • DCNv2 [42]: modulated deformable operators, pro-
stage detector, proposed in 2019. posed in 2018.
• FSAF [39]: a feature selective anchor-free module for • Train from Scratch [12]: training from random initial-
single-stage detectors, proposed in 2019. ization instead of ImageNet pretraining, proposed in
2018.
2.2. Two-stage Methods
• ScratchDet [40]: another exploration on training from
• Fast R-CNN [9]: a classic object detector which re- scratch, proposed in 2018.
quires pre-computed proposals, proposed in 2015.
• M2Det [38]: a new feature pyramid network to con-
• Faster R-CNN [27]: a classic and widely used two- struct more effective feature pyramids, proposed in
stage object detector which can be trained end-to-end, 2018.
proposed in 2015.
• GCNet [3]: global context block that can efficiently
• R-FCN [7]: a fully convolutional object detector with model the global context, proposed in 2019.
faster speed than Faster R-CNN, proposed in 2016.
• Generalized Attention [41]: a generalized attention
• Mask R-CNN [13]: a classic and widely used object formulation, proposed in 2019.
detection and instance segmentation method, proposed
in 2017. • SyncBN [25]: synchronized batch normalization
across GPUs, we adopt the official implementation by
• Grid R-CNN [20]: a grid guided localization mecha- PyTorch.
nism as an alternative to bounding box regression, pro-
posed in 2018. • Group Normalization [36]: a simple alternative to BN,
proposed in 2018.
• Mask Scoring R-CNN [15]: an improvement over
Mask R-CNN by predicting the mask IoU, proposed • Weight Standardization [26]: standardizing the
in 2019. weights in the convolutional layers for micro-batch
training, proposed in 2019.
• Double-Head R-CNN [35]: different heads for classi-
fication and localization, proposed in 2019. • HRNet [30, 31]: a new backbone with a focus on learn-
ing reliable high-resolution representations, proposed
2.3. Multi-stage Methods in 2019.
• Cascade R-CNN [2]: a powerful multi-stage object de- • Guided Anchoring [34]: a new anchoring scheme that
tection method, proposed in 2017. predicts sparse and arbitrary-shaped anchors, proposed
in 2019.
• Hybrid Task Cascade [4]: a multi-stage multi-branch
object detection and instance segmentation method, • Libra R-CNN [23]: a new framework towards bal-
proposed in 2019. anced learning for object detection, proposed in 2019.
2
Table 1: Supported features of different codebases. “X” means officially supported, “*” means supported in a forked reposi-
tory and blank means not supported.
3. Architecture SingleRoIExtractor.
RoIHead (BBoxHead/MaskHead) RoIHead is the part
3.1. Model Representation that takes RoI features as input and make RoI-wise task-
Although the model architectures of different detectors specific predictions, such as bounding box classifica-
are different, they have common components, which can be tion/regression, mask prediction.
roughly summarized into the following classes. With the above abstractions, the framework of single-
Backbone Backbone is the part that transforms an image stage and two-stage detectors is illustrated in Figure 1. We
to feature maps, such as a ResNet-50 without the last fully can develop our own methods by simply creating some new
connected layer. components and assembling existing ones.
Neck Neck is the part that connects the backbone and heads.
3.2. Training Pipeline
It performs some refinements or reconfigurations on the raw
feature maps produced by the backbone. An example is We design a unified training pipeline with hooking
Feature Pyramid Network (FPN). mechanism. This training pipeline can not only be used for
DenseHead (AnchorHead/AnchorFreeHead) DenseHead object detection, but also other computer vision tasks such
is the part that operates on dense locations of feature maps, as image classification and semantic segmentation.
including AnchorHead and AnchorFreeHead, e.g., RPN- The training processes of many tasks share a similar
Head, RetinaHead, FCOSHead. workflow, where training epochs and validation epochs
RoIExtractor RoIExtractor is the part that extracts RoI- run iteratively and validation epochs are optional. In
wise features from a single or multiple feature maps with each epoch, we forward and backward the model by
RoIPooling-like operators. An example that extracts RoI many iterations. To make the pipeline more flexible
features from the corresponding level of feature pyramids is and easy to customize, we define a minimum pipeline
3
Single-stage detector 4. Benchmarks
Backbone DenseHead
4.1. Experimental Setting
{VGG, ResNet,
Neck {SSDHead, RetinaHead,
{FPN, BFP, …}
ResNeXt, …} FCOSHead, FSAFHead, …} Dataset. MMDetection supports both VOC-style and
COCO-style datasets. We adopt MS COCO 2017 as the pri-
Two-stage detector mary benchmark for all experiments since it is more chal-
DenseHead lenging and widely used. We use the train split for training
{RPNHead,
GARPNHead, …}
and report the performance on the val split.
Implementation details. If not otherwise specified, we
Backbone Neck RoIHead
{ResNet, ResNeXt, …} {FPN, BFP, …}
{BBoxHead, adopt the following settings. (1) Images are resized to a
MaskHead, …}
maximum scale of 1333 × 800,without changing the aspect
ratio. (2) We use 8 V100 GPUs for training with a total
Figure 1: Framework of single-stage and two-stage detec- batch size of 16 (2 images per GPU) and a single V100
tors, illustrated with abstractions in MMDetection. GPU for inference. (3) The training schedule is the same
as Detectron [10]. “1x” and “2x” means 12 epochs and 24
epochs respectively. “20e” is adopted in cascade models,
which denotes 20 epochs.
Evaluation metrics. We adopt standard evaluation metrics
start before_train_epoch for COCO dataset, where multiple IoU thresholds from 0.5
DistSamplerHook to 0.95 are applied. The results of region proposal network
(RPN) are measured with Average Recall (AR) and detec-
before_train_iter
training epoch tion results are evaluated with mAP.
LrUpdaterHook
IterTimerHook 4.2. Benchmarking Results
after_train_iter Main results. We benchmark different methods on COCO
model forward
OptimizerHook 2017 val, including SSD [19], RetinaNet [18], Faster
IterTimerHook RCNN [27], Mask RCNN [13] and Cascade R-CNN [18],
LoggerHook
Hybrid Task Cascade [4] and FCOS [32]. We evalute all re-
sults with four widely used backbones, i.e., ResNet-50 [14],
after_train_epoch ResNet-101 [14], ResNet-101-32x4d [37] and ResNeXt-
CheckpointHook 101-64x4d [37]. We report the inference speed of these
end EvalmAPHook methods and bbox/mask AP in Figure 3. The inference time
is tested on a single Tesla V100 GPU.
Figure 2: Training pipeline. Comparison with other codebases Besides MMDetection,
there are also other popular codebases like Detectron [10],
maskrcnn-benchmark [21] and SimpleDet [6]. They
are built on the deep learning frameworks of caffe21 ,
PyTorch [24] and MXNet [5], respectively. We compare
MMDetection with Detectron (@a6a835f), maskrcnn-
which just forwards the model repeatedly. Other be-
benchmark (@c8eff2c) and SimpleDet (@cf4fce4) from
haviors are defined by a hooking mechanism. In or-
three aspects: performance, speed and memory. Mask
der to run a custom training process, we may want
R-CNN and RetinaNet are taken for representatives of two-
to perform some self-defined operations before or after
stage and single-stage detectors. Since these codebases are
some specific steps. We define some timepoints where
also under development, the reported results in their model
users may register any executable methods (hooks), in-
zoo may be outdated, and those results are tested on differ-
cluding before run, before train epoch, after train epoch,
ent hardwares. For fair comparison, we pull the latest codes
before train iter, after train iter, before val epoch, af-
and test them in the same environment. Results are shown
ter val epoch, before val iter, after val iter, after run.
in Table 2. The memory reported by different frameworks
Registered hooks are triggered at specified timepoints fol-
are measured in different ways. MMDetection reports the
lowing the priority level. A typical training pipeline in
maximum memory of all GPUs, maskrcnn-benchmark
MMDetection is shown in Figure 2. The validation epoch
reports the memory of GPU 0, and these two adopt the
is not shown in the figure since we use evaluation hooks to
PyTorch API “torch.cuda.max memory allocated()”.
test the performance after each epoch. If specified, it has
the same pipeline as the training epoch. 1 https://github.com/facebookarchive/caffe2
4
48 42
Faster R-CNN Mask R-CNN
Mask R-CNN 41 Cascade Mask R-CNN
46 Fast R-CNN Hybrid Task Cascade
Fast Mask R-CNN 40 Mask Scoring RCNN
RetinaNet
44 Cascade R-CNN 39
Figure 3: Benchmarking results of different methods. Each method is tested with four different backbones.
8.2 8.1 8.2 the results and compare with the other two codebases in
8
6.6 Table 3. We test all codebases on the same V100 node.
6 Additionally, we investigate more models to figure out the
effectiveness of mixed precision training. As shown in Ta-
4
ble 4, we can learn that a larger batch size is more memory
2 saving. When the batch size is increased to 12, the memory
of FP16 training is reduced to nearly half of FP32 training.
0
TITAN X 1080 Ti TITAN Xp 2080 Ti TITAN V V100 Moreover, mixed precision training is more memory effi-
GPU types cient when applied to simpler frameworks like RetinaNet.
Multi-node scalability. Since MMDetection supports dis-
Figure 4: Inference speed benchmark of different GPUs. tributed training on multiple nodes, we test its scalabil-
ity on 8, 16, 32, 64 GPUs, respectively. We adopt Mask
R-CNN as the benchmarking method and conduct experi-
Detectron reports the GPU with the caffe2 API ments on another V100 cluster. Following [11], the base
“caffe2.python.utils.GetGPUMemoryUsageStats()”, learning rate is adjusted linearly when adopting different
and SimpleDet reports the memory shown by “nvidia- batch sizes. Experimental results in Figure 5 shows that
smi”, a command line utility provided by NVIDIA. MMDetection achieves nearly linear acceleration for multi-
Generally, the actual memory usage of MMDetection and ple nodes.
maskrcnn-benchmark are similar and lower than the others.
Inference speed on different GPUs. Different researchers 5. Extensive Studies
may use various GPUs, here we show the speed benchmark
on common GPUs, e.g., TITAN X, TITAN Xp, TITAN V, With MMDetection, we conducted extensive study on
GTX 1080 Ti, RTX 2080 Ti and V100. We evaluate three some important components and hyper-parameters. We
models on each type of GPU and report the inference speed wish that the study can shed lights to better practices in
in Figure 4. It is noted that other hardwares of these servers making fair comparisons across different methods and set-
are not exactly the same, such as CPUs and hard disks, tings.
but the results can provide a basic impression for the speed
benchmark. 2 https://github.com/NVIDIA/apex
5
Table 2: Comparison of different codebases in terms of speed, memory and performance.
Codebase model Train (iter/s) Inf (fps) Mem (GB) APbox APmask
MMDetection Mask RCNN 0.430 10.8 3.8 37.4 34.3
maskrcnn-benchmark Mask RCNN 0.436 12.1 3.3 37.8 34.2
Detectron Mask RCNN 0.744 8.1 8.8 37.8 34.1
SimpleDet Mask RCNN 0.646 8.8 6.7 37.1 33.7
MMDetection RetinaNet 0.285 13.1 3.4 35.8 -
maskrcnn-benchmark RetinaNet 0.275 11.1 2.7 36.0 -
Detectron RetinaNet 0.552 8.3 6.9 35.4 -
SimpleDet RetinaNet 0.565 11.6 5.1 35.6 -
Codebase Type Mem (GB) Train (iter/s) Inf (fps) APbox APmask
FP32 3.8 0.430 10.8 37.4 34.3
MMDetection
FP16 3.0 0.364 10.9 37.4 34.4
FP32 3.3 0.436 12.1 37.8 34.2
maskrcnn-benchmark
FP16 3.3 0.457 9.0 37.7 34.2
FP32 6.7 0.646 8.8 37.1 33.7
SimpleDet
FP16 5.5 0.635 9.0 37.3 33.9
6
Table 5: Comparison of various regression losses with dif- Table 6: Comparison of different BN settings and lr sched-
ferent loss weights (lw). Faster RCNN with ResNet-50- ules. Mask RCNN with ResNet-50-FPN is adopted.
FPN is adopted.
eval requires grad lr schedule APbox APmask
Regression Loss lw=1 lw=2 lw=5 lw=10 False True 1x 34.2 31.2
Smooth L1 Loss[27] 36.4 36.9 35.7 - True False 1x 37.4 34.3
L1 Loss 36.8 36.9 34.0 - True True 1x 37.3 34.2
Balanced L1 Loss[23] 37.2 36.7 33.0 - True False 2x 37.9 34.6
IoU Loss[32] 36.9 37.3 35.4 30.7 True True 2x 38.5 35.1
GIoU Loss[28] 37.1 37.4 35.4 30.0
Bounded IoU Loss[33] 34.0 35.7 36.8 36.8
AP. Under 1x learning rate (lr) schedule, fixing the affine
weights or not only makes slightly differences, i.e., 0.1%.
values than Smooth L1, especially for bounding boxes that
When a longer lr schedule is adopted, making affine weights
are relatively accurate. According to the analysis in [23],
trainable outperforms fixing these weights by about 0.5%.
boosting the gradients of better located bounding boxes will
In MMDetection, eval = T rue, requires grad = T rue is
benefit the localization. The loss values of L1 loss are al-
adopted as the default setting.
ready quite large, therefore, increasing loss weight does not
work better. Balanced L1 Loss achieves 0.3% higher mAP Different normalization layers. Batch Normalization
than L1 Loss for end-to-end Faster R-CNN, which is a little (BN) is widely adopted in modern CNNs. However, it heav-
different from experiments in [23] that adopts pre-computed ily depends on the large batch size to precisely estimate the
proposals. However, we find that Balanced L1 loss can statistics E(x) and Var(x). In object detection, the batch
lead to a higher gain on the baseline of the proposed IoU- size is usually much smaller than in classification, and the
balanced sampling or balanced FPN. IoU-based losses per- typical solution is to use the statistics of pretrained back-
form slightly better than L1-based losses with optimal loss bones and not to update them during training, denoted as
weights except for Bounded IoU Loss. GIoU Loss is 0.1% FrozenBN. More recently, SyncBN and GN are proposed
higher than IoU Loss, and Bounded IoU Loss has similar and have proved their effectiveness [36, 25]. SyncBN com-
performance to Smooth L1 Loss, but requires a larger loss putes mean and variance across multi-GPUs and GN divides
weight. channels of features into groups and computes mean and
variance within each group, which help to combat against
5.2. Normalization Layers the issue of small batch sizes. FrozenBN, SyncBN and GN
The batch size used when training detectors is usually can be specified in MMDetection with only simple modifi-
small (1 or 2) due to limited GPU memory, and thus BN cations in config files.
layers are usually frozen as a typical convention. There Here we study two questions. (1) How do different nor-
are two options for configuring BN layers. (1) whether malization layers compare with each other? (2) Where to
to update the statistics E(x) and Var(x), and (2) whether add normalization layers to detectors? To answer these two
to optimize affine weights γ and β. Following the argu- questions, we run three experiments of Mask R-CNN with
ment names of PyTorch, we denote (1) and (2) as eval and ResNet-50-FPN and replace the BN layers in backbones
requires grad. eval = T rue means statistics are not up- with FrozenBN, SyncBN and GN, respectively. Group
dated, and requires grad = T rue means γ and β are also number is set to 32 following [36]. Other settings and model
optimized during training. Apart from freezing BN lay- architectures are kept the same. In [36], the 2fc bbox head
ers, there are also other normalization layers which tackles is replaced with 4conv1fc and GN layers are also added to
the problem of small batch size, such as Synchronized BN FPN and bbox/mask heads. We perform another two sets
(SyncBN) [25] and Group Normalization (GN) [36]. We of experiments to study these two changes. Furthermore,
first evaluate different settings for BN layers in backbones, we explore different number of convolution layers for bbox
and then compare BN with SyncBN and GN. head.
BN settings. We evaluate different combinations of eval Results in Table 7 show that (1) FrozenBN, SyncBN
and requires grad on Mask R-CNN, under 1x and 2x train- and GN achieve similar performance if we just replace
ing schedules. Results in Table 6 show that updating statis- BN layers in backbones with corresponding ones. (2)
tics with a small batch size severely harms the performance, Adding SyncBN or GN to FPN and bbox/mask head will
when we recompute statistics (eval is false) and fix the not bring further gain. (3) Replacing the 2fc bbox head with
affine weights (requires grad is false), respectively. Com- 4conv1fc as well as adding normalization layers to FPN and
pared with eval = T rue, requires grad = T rue, it is 3.1% bbox/mask head improves the performance by around 1.5%.
lower in terms of bbox AP and 3.0% lower in terms of mask (4) More convolution layers in bbox head will lead to higher
7
Table 7: Comparison of adopting different normalization Table 8: Comparison of different training scales. Mask
layers and adding normalization layers on different compo- RCNN with ResNet-50-FPN and 2x lr schedule are adopted.
nents. (SBN is short for SyncBN.)
Training scale(s) APbox APmask
Backbone FPN Head APbox APmask 1333 × 800 38.5 35.1
FrozenBN - - (2fc) 37.3 34.2 1333 × [640 : 800 : 32] 39.3 35.8
FrozenBN - - (4conv1fc) 37.8 34.2 1333 × [640 : 960 : 32] 39.7 36.0
SBN - - (2fc) 37.4 34.1 2000 × [640 : 800 : 32] 39.3 35.9
SBN SBN SBN (2fc) 37.4 34.6 1333 × [640 : 800] 39.3 35.9
SBN SBN SBN (4conv1fc) 38.9 35.2 1333 × [640 : 960] 39.7 36.3
GN - - (2fc) 37.4 34.3 1333 × [480 : 960] 39.7 36.1
GN GN GN (2fc) 37.4 34.5
GN GN GN (2conv1fc) 38.2 35.1 Table 9: Study of hyper-parameters on RPN ResNet-50.
GN GN GN (4conv1fc) 38.8 35.2
GN GN GN (6conv1fc) 39.0 35.4 smoothl1 beta allowed border neg pos ub AR1000
1/5 0 ∞ 56.5
1/9 0 ∞ 57.1
performance. 1/15 0 ∞ 57.3
1/9 ∞ ∞ 57.7
5.3. Training Scales 1/9 ∞ 3 58.3
1/9 ∞ 5 58.1
As a typical convention, training images are resized to
a predefined scale without changing the aspect ratio. Pre-
vious studies typically prefer a scale of 1000 × 600, and scales. Specifically, [640 : 960] is 0.4% and 0.5% higher
now 1333 × 800 is typically adopted. In MMDetection, we than [640 : 800] in terms of bbox and mask AP. However,
adopt 1333 × 800 as the default training scale. As a simple a smaller minimum scale like 480 will not achieve better
data augmentation method, multi-scale training is also com- performance.
monly used. No systematic study exists to examine the way
to select an appropriate training scales. Knowing this is cru- 5.4. Other Hyper-parameters.
cial to facilitate more effective and efficient training. When MMDetection mainly follows the hyper-parameter set-
multi-scale training is adopted, a scale is randomly selected tings in Detectron and also explores our own implemen-
in each iteration, and the image will be resized to the se- tations. Empirically, we found that some of the hyper-
lected scale. There are mainly two random selection meth- parameters of Detectron are not optimal, especially for
ods, one is to predefine a set of scales and randomly pick RPN. In Table 9, we list those that can further improve the
a scale from them, the other is to define a scale range, and performance of RPN. Although the tuning may benefit the
randomly generate a scale between the minimum and max- performance, in MMDetection we adopt the same setting as
imum scale. We denote the first method as “value” mode Detectron by default and just leave this study for reference.
and the second one as “range” mode. Specifically, “range” smoothl1 beta Most detection methods adopt Smooth
mode can be seen as a special case of “value” mode where L1 Loss as the regression loss, implemented as
the interval of predefined scales is 1. torch.where(x < beta, 0.5 ∗ x2 /beta, x − 0.5 ∗ beta).
We train Mask R-CNN with different scales and ran- The parameter beta is the threshold for L1 term and
dom modes, and adopt the 2x lr schedule because more MSELoss term. It is set to 91 in RPN by default, according
training augmentation usually requires longer lr sched- to the standard deviation of regression errors empirically.
ules. The results are shown in Table 8, in which 1333 × Experimental results show that a smaller beta may improve
[640:800:32] indicates that the longer edge is fixed to average recall (AR) of RPN slightly. In the study of
1333 and the shorter edge is randomly selected from the Section 5.1, we found that L1 Loss performs better than
pool of {640, 672, 704, 736, 768, 800}, corresponding to Smooth L1 when the loss weight is 1. When we set beta to
the “value” mode. The setting 1333 × [640:800] indicates a smaller value, Smooth L1 Loss will get closer to L1 Loss
that the shorter edge is randomly selected between 640 and and the equivalent loss weight is larger, resulting in better
800, which corresponds to the “range” mode. From the re- performance.
sults we can learn that the “range” mode performs similar allowed border In RPN, pre-defined anchors are generated
to or slightly better than the “value” mode with the same on each location of a feature map. Anchors exceeding the
minimum and maximum scales. Usually a wider range boundaries of the image by more than allowed border will
brings more improvement, especially for larger maximum be ignored during training. It is set to 0 by default, which
8
means any anchors exceeding the image boundary will be
ignored. However, we find that relaxing this rule will be
beneficial. If we set it to infinity, which means none of the
anchors are ignored, AR will be improved from 57.1% to
57.7%. In this way, ground truth objects near boundaries
will have more matching positive samples during training.
neg pos ub We add this new hyper-parameter for sampling
positive and negative anchors. When training the RPN,
in the case when insufficient positive anchors are present,
one typically samples more negative samples to guaran-
tee a fixed number of training samples. Here we explore
neg pos ub to control the upper bound of the ratio of nega-
tive samples to positive samples. Setting neg pos ub to in-
finity leads to the aforementioned sampling behavior. This
default practice will sometimes cause imbalance distribu-
tion in negative and positive samples. By setting it to a rea-
sonable value, e.g., 3 or 5, which means we sample negative
samples at most 3 or 5 times of positive ones, a gain of 1.2%
or 1.1% is observed.
A. Detailed Results
We present detailed benchmarking results for some
methods in Table 10. R-50 and R-50 (c) denote
pytorch-style and caffe-style ResNet-50 backbone,
respectively. In the bottleneck residual block, pytorch-
style ResNet uses a 1x1 stride-1 convolutional layer
followed by a 3x3 stride-2 convolutional layer, while
caffe-style ResNet uses a 1x1 stride-2 convolutional
layer followed by a 3x3 stride-1 convolutional layer.
Refer to https://github.com/open-mmlab/
mmdetection/blob/master/MODEL_ZOO.md for
more settings and components.
9
Table 10: Results of different detection methods on COCO val2017. APb and APm denote box mAP and mask mAP
respectively.
Method Backbone Lr Schd APb APb50 APb75 APbS APbM APbL APm APm m
50 AP75 APS
m
APm M APm L
R-50 (c) 1x 36.6 58.5 39.2 20.7 40.5 47.9 - - - - - -
R-101 (c) 1x 38.8 60.5 42.3 23.3 43.1 50.3 - - - - - -
R-50 1x 36.4 58.4 39.1 21.5 40.0 46.6 - - - - - -
R-101 1x 38.5 60.3 41.6 22.3 43.0 49.8 - - - - - -
X-101-32x4d 1x 40.1 62.0 43.8 23.4 44.6 51.7 - - - - - -
Faster R-CNN
X-101-64x4d 1x 41.3 63.3 45.2 24.4 45.8 53.4 - - - - - -
R-50 2x 37.7 59.2 41.1 21.9 41.4 48.7 - - - - - -
R-101 2x 39.4 60.6 43.0 22.1 43.6 52.1 - - - - - -
X-101-32x4d 2x 40.4 61.9 44.1 23.3 44.6 52.9 - - - - - -
X-101-64x4d 2x 40.7 62.0 44.6 22.9 44.5 53.6 - - - - - -
R-50 1x 40.4 58.5 43.9 21.5 43.7 53.8 - - - - - -
R-101 1x 42.0 60.3 45.9 23.2 45.9 56.3 - - - - - -
X-101-32x4d 1x 43.6 62.2 47.4 25.0 47.7 57.4 - - - - - -
X-101-64x4d 1x 44.5 63.3 48.6 26.1 48.1 59.1 - - - - - -
Cascade R-CNN
R-50 20e 41.1 59.1 44.8 22.5 44.4 54.9 - - - - - -
R-101 20e 42.5 60.7 46.3 23.7 46.1 56.9 - - - - - -
X-101-32x4d 20e 44.0 62.5 48.0 25.3 47.8 58.1 - - - - - -
X-101-64x4d 20e 44.7 63.1 49.0 25.8 48.3 58.8 - - - - - -
SSD300 VGG16 120e 25.7 43.9 26.2 6.9 27.7 42.6 - - - - - -
SSD512 VGG16 120e 29.3 49.2 30.8 11.8 34.1 44.7 - - - - - -
R-50 (c) 1x 35.8 55.5 38.3 20.1 39.5 47.7 - - - - - -
R-101 (c) 1x 37.8 58.0 40.7 20.4 42.1 50.7 - - - - - -
R-50 1x 35.6 55.5 38.3 20.0 39.6 46.8 - - - - - -
R-101 1x 37.7 57.5 40.4 21.1 42.2 49.5 - - - - - -
X-101-32x4d 1x 39.0 59.4 41.7 22.6 43.4 50.9 - - - - - -
RetinaNet
X-101-64x4d 1x 40.0 60.9 43.0 23.5 44.4 52.6 - - - - - -
R-50 2x 36.4 56.3 38.7 19.3 39.9 48.9 - - - - - -
R-101 2x 38.1 58.1 40.6 20.2 41.8 50.8 - - - - - -
X-101-32x4d 2x 39.3 59.8 42.3 21.0 43.6 52.3 - - - - - -
X-101-64x4d 2x 39.6 60.3 42.3 21.6 43.5 53.5 - - - - - -
R-50 1x 36.9 55.5 39.1 20.4 40.3 48.7 - - - - - -
R-101 1x 39.0 57.7 41.3 21.8 43.2 51.8 - - - - - -
RetinaNet-GHM
X-101-32x4d 1x 40.5 59.7 43.1 22.8 44.8 53.5 - - - - - -
X-101-64x4d 1x 41.6 61.3 44.3 23.5 45.5 55.1 - - - - - -
R-50 (c) 1x 36.7 55.8 39.2 21.0 40.7 48.4 - - - - - -
R-101 (c) 1x 39.1 58.5 41.8 22.0 43.5 51.1 - - - - - -
FCOS
R-50 (c) 2x 36.9 55.8 39.1 20.4 40.1 49.2 - - - - - -
R-101 (c) 2x 39.1 58.6 41.7 22.1 42.4 52.5 - - - - - -
R-50 (c) 2x 38.7 58.0 41.4 23.4 42.8 49.0 - - - - - -
FCOS (mstrain) R-101 (c) 2x 40.8 60.1 43.8 24.5 44.5 52.8 - - - - - -
X-101-64x4d 2x 42.8 62.6 45.7 26.5 46.9 54.5 - - - - - -
R-50 1x 38.5 59.5 42.5 22.9 41.8 48.9 - - - - - -
R-101 1x 40.3 61.2 43.9 23.3 44.3 52.2 - - - - - -
Libra Faster R-CNN
X-101-32x4d 1x 41.6 62.7 45.6 24.8 45.8 53.6 - - - - - -
X-101-64x4d 1x 42.7 63.8 46.8 25.8 46.6 55.4 - - - - - -
R-50 (c) 1x 39.9 59.1 43.6 22.8 43.5 52.8 - - - - - -
R-101 (c) 1x 41.5 60.7 45.5 23.3 45.6 55.3 - - - - - -
GA-Faster R-CNN
X-101-32x4d 1x 42.9 62.1 46.8 24.8 46.9 56.1 - - - - - -
X-101-64x4d 1x 43.9 63.3 48.3 25.4 47.9 57.0 - - - - - -
R-50 (c) 1x 37.0 56.6 39.8 20.0 40.8 50.1 - - - - - -
R-101 (c) 1x 38.9 59.1 41.8 22.0 42.6 51.9 - - - - - -
GA-RetinaNet
X-101-32x4d 1x 40.3 60.9 43.5 23.5 44.9 53.5 - - - - - -
X-101-64x4d 1x 40.8 61.4 44.0 23.9 44.9 54.3 - - - - - -
Continued on next page
10
Table 10 – Continued from previous page
Method Backbone Lr Schd APb APb50 APb75 APbS APbM APbL APm APm m m
50 AP75 APS APm
M APm
L
R-50 (c) 1x 37.4 58.9 40.4 21.7 41.0 49.1 34.3 55.8 36.4 18.0 37.6 47.3
R-101 (c) 1x 39.9 61.5 43.6 23.9 44.0 51.8 36.1 57.9 38.7 19.8 39.8 49.5
R-50 1x 37.3 59.0 40.2 21.9 40.9 48.1 34.2 55.9 36.2 18.2 37.5 46.3
R-101 1x 39.4 60.9 43.3 23.0 43.7 51.4 35.9 57.7 38.4 19.2 39.7 49.7
X-101-32x4d 1x 41.1 62.8 45.0 24.0 45.4 52.6 37.1 59.4 39.8 19.7 41.1 50.1
Mask R-CNN
X-101-64x4d 1x 42.1 63.8 46.3 24.4 46.6 55.3 38.0 60.6 40.9 20.2 42.1 52.4
R-50 2x 38.5 59.9 41.8 22.6 42.0 50.5 35.1 56.8 37.0 18.9 38.0 48.3
R-101 2x 40.3 61.5 44.1 22.2 44.8 52.9 36.5 58.1 39.1 18.4 40.2 50.4
X-101-32x4d 2x 41.4 62.5 45.4 24.0 45.4 54.5 37.1 59.4 39.5 19.9 40.6 51.3
X-101-64x4d 2x 42.0 63.1 46.1 23.9 45.8 55.6 37.7 59.9 40.4 19.6 41.3 52.5
R-50 (c) 1x 37.5 59.2 40.5 21.4 41.3 48.9 35.6 55.6 38.5 18.2 39.1 49.2
R-101 (c) 1x 40.0 61.4 43.7 23.2 44.2 52.3 37.3 57.7 40.2 19.5 41.1 51.6
Mask Scoring R-CNN X-101-64x4d 1x 42.2 64.0 46.2 24.9 46.5 54.6 39.2 60.4 42.4 21.1 43.1 54.3
X-101-32x4d 2x 41.5 62.6 45.1 23.7 45.2 54.7 38.4 58.9 41.7 20.1 42.0 53.9
X-101-64x4d 2x 42.2 63.4 46.1 24.2 46.0 56.1 38.9 59.4 42.1 20.4 42.4 54.7
R-50 1x 41.2 59.1 45.1 23.3 44.5 54.5 35.7 56.3 38.6 18.5 38.6 49.2
R-101 1x 42.6 60.7 46.7 23.8 46.4 56.9 37.0 58.0 39.9 19.1 40.5 51.4
X-101-32x4d 1x 44.4 62.6 48.6 25.4 48.1 58.7 38.2 59.6 41.2 20.3 41.9 52.4
X-101-64x4d 1x 45.4 63.7 49.7 25.8 49.2 60.6 39.1 61.0 42.1 20.5 42.6 54.1
Cascade Mask R-CNN
R-50 20e 42.3 60.5 46.0 23.7 45.7 56.4 36.6 57.6 39.5 19.0 39.4 50.7
R-101 20e 43.3 61.3 47.0 24.4 46.9 58.0 37.6 58.5 40.6 19.7 40.8 52.4
X-101-32x4d 20e 44.7 63.0 48.9 25.9 48.7 58.9 38.6 60.2 41.7 20.9 42.1 52.7
X-101-64x4d 20e 45.7 64.1 50.0 26.2 49.6 60.0 39.4 61.3 42.9 20.8 42.7 54.1
R-50 1x 42.1 60.8 45.9 23.9 45.5 56.2 37.3 58.2 40.2 19.5 40.6 51.7
R-50 20e 43.2 62.1 46.8 24.9 46.4 57.8 38.1 59.4 41.0 20.3 41.1 52.8
Hyrbrid Task Cascade R-101 20e 44.9 63.8 48.7 26.4 48.3 59.9 39.4 60.9 42.4 21.4 42.4 54.4
X-101-32x4d 20e 46.1 65.1 50.2 27.5 49.8 61.2 40.3 62.2 43.5 22.3 43.7 55.5
X-101-64x4d 20e 46.9 66.0 51.2 28.0 50.7 62.1 40.8 63.3 44.1 22.7 44.2 56.3
11
References Conference on Computer Vision and Pattern Recognition,
2019. 2
[1] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and [16] Buyu Li, Yu Liu, and Xiaogang Wang. Gradient harmonized
Larry S Davis. Soft-nms–improving object detection with single-stage detector. In AAAI Conference on Artificial Intel-
one line of code. In IEEE International Conference on Com- ligence, 2019. 2
puter Vision, 2017. 2
[17] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang
[2] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving Zhang. Scale-aware trident networks for object detection.
into high quality object detection. In IEEE Conference on arXiv preprint arXiv:1901.01892, 2019. 3
Computer Vision and Pattern Recognition, 2018. 2 [18] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
[3] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Piotr Dollár. Focal loss for dense object detection. In IEEE
Hu. Gcnet: Non-local networks meet squeeze-excitation net- International Conference on Computer Vision, 2017. 2, 4
works and beyond. arXiv preprint arXiv:1904.11492, 2019. [19] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
2 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C.
[4] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Berg. SSD: Single shot multibox detector. In ECCV, 2016.
Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, 2, 4
Wanli Ouyang, Chen Change Loy, and Dahua Lin. Hybrid [20] Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan.
task cascade for instance segmentation. In IEEE Conference Grid r-cnn. In IEEE Conference on Computer Vision and
on Computer Vision and Pattern Recognition, 2019. 2, 4 Pattern Recognition, 2019. 2
[5] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, [21] Francisco Massa and Ross Girshick. maskrcnn-benchmark:
Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Fast, modular reference implementation of instance seg-
Zheng Zhang. Mxnet: A flexible and efficient machine mentation and object detection algorithms in pytorch.
learning library for heterogeneous distributed systems. arXiv https://github.com/facebookresearch/
preprint arXiv:1512.01274, 2015. 4 maskrcnn-benchmark, 2018. 1, 4
[6] Yuntao Chen, , Chenxia Han, Yanghao Li, Zehao Huang, [22] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory
Yi Jiang, Naiyan Wang, and Zhaoxiang Zhang. Sim- Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael
pledet: A simple and versatile distributed framework for Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu.
object detection and instance recognition. arXiv preprint Mixed precision training. In International Conference on
arXiv:1903.05831, 2019. 1, 4 Learning Representations, 2018. 2
[7] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object [23] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng,
detection via region-based fully convolutional networks. In Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards bal-
Advances in Neural Information Processing Systems, pages anced learning for object detection. In IEEE Conference on
379–387, 2016. 2 Computer Vision and Pattern Recognition, 2019. 2, 6, 7
[8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong [24] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
Zhang, Han Hu, and Yichen Wei. Deformable convolutional Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban
networks. In IEEE International Conference on Computer Desmaison, Luca Antiga, and Adam Lerer. Automatic dif-
Vision, 2017. 2 ferentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
[9] Ross Girshick. Fast r-cnn. In IEEE International Conference 1, 4
on Computer Vision, 2015. 2 [25] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu
[10] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Zhang, Kai Jia, Gang Yu, and Jian Sun. Megdet: A large
Dollár, and Kaiming He. Detectron. https://github. mini-batch object detector. In The IEEE Conference on Com-
com/facebookresearch/detectron, 2018. 1, 4 puter Vision and Pattern Recognition (CVPR), June 2018. 2,
[11] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- 7
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, [26] Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and
Yangqing Jia, and Kaiming He. Accurate, large mini- Alan Yuille. Weight standardization. arXiv preprint
batch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1903.10520, 2019. 2
arXiv:1706.02677, 2017. 5 [27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[12] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking im- Faster r-cnn: Towards real-time object detection with region
agenet pre-training. arXiv preprint arXiv:1811.08883, 2018. proposal networks. In Advances in Neural Information Pro-
2 cessing Systems, 2015. 2, 4, 7
[13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- [28] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir
shick. Mask r-cnn. In IEEE International Conference on Sadeghian, Ian Reid, and Silvio Savarese. Generalized in-
Computer Vision, 2017. 2, 4 tersection over union: A metric and a loss for bounding box
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. regression. In IEEE Conference on Computer Vision and
Deep residual learning for image recognition. In IEEE Con- Pattern Recognition, 2019. 6, 7
ference on Computer Vision and Pattern Recognition, 2016. [29] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick.
4 Training region-based object detectors with online hard ex-
[15] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang ample mining. In IEEE Conference on Computer Vision and
Huang, and Xinggang Wang. Mask scoring r-cnn. In IEEE Pattern Recognition, 2016. 2
12
[30] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
high-resolution representation learning for human pose esti-
mation. In IEEE Conference on Computer Vision and Pattern
Recognition, 2019. 2
[31] Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao,
Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and
Jingdong Wang. High-resolution representations for labeling
pixels and regions. arXiv preprint arXiv:1904.04514, 2019.
2
[32] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
Fully convolutional one-stage object detection. arXiv
preprint arXiv:1904.01355, 2019. 2, 4, 6, 7
[33] Lachlan Tychsen-Smith and Lars Petersson. Improving ob-
ject localization with fitness nms and bounded iou loss. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2018. 6, 7
[34] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and
Dahua Lin. Region proposal by guided anchoring. In IEEE
Conference on Computer Vision and Pattern Recognition,
2019. 2
[35] Yue Wu, Yinpeng Chen, Lu Yuan, Zicheng Liu, Lijuan
Wang, Hongzhi Li, and Yun Fu. Double-head rcnn: Rethink-
ing classification and localization for object detection. arXiv
preprint arXiv:1904.06493, 2019. 2
[36] Yuxin Wu and Kaiming He. Group normalization. In Pro-
ceedings of the European Conference on Computer Vision
(ECCV), pages 3–19, 2018. 2, 7
[37] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In IEEE Conference on Computer Vision
and Pattern Recognition, 2017. 4
[38] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen,
Ling Cai, and Haibin Ling. M2det: A single-shot object
detector based on multi-level feature pyramid network. In
AAAI Conference on Artificial Intelligence, 2018. 2
[39] Chenchen Zhu, Yihui He, and Marios Savvides. Feature se-
lective anchor-free module for single-shot object detection.
In IEEE Conference on Computer Vision and Pattern Recog-
nition, 2019. 2
[40] Rui Zhu, Shifeng Zhang, Xiaobo Wang, Longyin Wen,
Hailin Shi, Liefeng Bo, and Tao Mei. Scratchdet: Explor-
ing to train single-shot object detectors from scratch. In
IEEE Conference on Computer Vision and Pattern Recog-
nition, 2019. 2
[41] Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, and
Jifeng Dai. An empirical study of spatial attention mecha-
nisms in deep networks. arXiv preprint arXiv:1904.05873,
2019. 2
[42] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
formable convnets v2: More deformable, better results. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2019. 2
13