Lu Grid R-CNN CVPR 2019 Paper
Lu Grid R-CNN CVPR 2019 Paper
Lu Grid R-CNN CVPR 2019 Paper
Xin Lu1 Buyu Li2 Yuxin Yue3 Quanquan Li1 Junjie Yan1
1
SenseTime Research, 2 The Chinese University of Hong Kong, 3 Beihang University
{luxin,liquanquan,yanjunjie}@sensetime.com, libuyu [email protected], [email protected]
Abstract
7363
it may share very similar local features with nearby pixels. 2. Related Works
To overcome this problem, we design a multi-point super-
vision formulation. By defining target points in a gird, we Since our new approach is based on two stage object de-
have more clues to reduce the impact of inaccurate predic- tector, here we briefly review some related works. Two-
tion of some points. For instance, in a typical 3 × 3 grid stage object detector was developed from the R-CNN ar-
points supervision case, the probably inaccurate y-axis co- chitecture [9], a region-based deep learning framework that
ordinate of the top-right point can be calibrated by that of classify and locate every RoI (Region of Interest) gener-
top-middle point which just locates on the boundary of the ated by some low-level computer vision algorithms [30, 34].
object. The grid points are effective designs to decrease the Then SPP-Net [12] and Fast-RCNN [8] introduced a new
overall deviation. way to save redundant computation by extracting every re-
Furthermore, to take the full advantage of the correlation gion feature from the shared feature generated by entire im-
of points in a gird, we propose an information fusion ap- age. Although SPP-Net and Fast-RCNN significantly im-
proach. Specifically, we design individual group of feature prove the performance of object detection, the part of RoIs
maps for each grid point. For one grid point, the feature generating still cannot be trained end-to-end. Later, Faster-
maps of the neighbor grid points are collected and fused RCNN [25] was proposed to solve this problem by utilizing
into an integrated feature map. The integrated feature map a light region proposal network(RPN) to generate a sparse
is utilized for the location prediction of the corresponding set of RoIs. This makes the whole detection pipeline an end-
grid point. Thus complementary information from spatial to-end trainable network and further improve the accuracy
related grid points is incorporated to make the prediction and speed of the detector. In addition, some single-stage
more accurate. frameworks [20, 18, 24, 16] are also proposed to balance
We showcase the effectiveness of our Grid R-CNN the performance and efficiency of the model.
framework on the object detection track of the challenging Recently, many works extend Faster R-CNN architecture
COCO benchmark [19]. Our approach outperforms tradi- in many aspects to achieve better performance. R-FCN [3]
tional regression based state-of-the-art methods by a signif- proposed to use region-based fully convolution network to
icant margin. For example, we surpass Faster R-CNN [25] replace the original fully connected network. FPN [17] pro-
with a backbone of ResNet-50 [14] and FPN [17] archi- posed a top-down architecture with lateral connections for
tecture by 2.2% AP. Further comparison on different IoU building high-level semantic feature maps for variant scales.
threshold criteria shows that our approach has overwhelm- Mask R-CNN [11] extended Faster R-CNN by adding a
ing strength in high quality object localization, with a 4.1% branch for predicting an pixel-wise object mask. Differ-
AP gain at IoU=0.8 and 10.0% AP gain at IoU=0.9. ent from Mask R-CNN, our method replaces the regression
The main contributions of our work are listed as follows: branch with a new grid branch to locate objects more ac-
curately. Also, our method needs no extra annotation other
1. We propose a novel localization framework called Grid than bounding box.
R-CNN which substitute traditional regression net- LocNet [7] proposed a boundary-based method for ac-
work by fully convolutional network that preserves curate localization in object detection. It relies on condi-
spatial information efficiently. To our best knowledge, tional probabilities of region boundaries while our method
Grid R-CNN is the first proposed region based (two- is based on grid points prediction. In addition, LocNet is
stage) detection framework that locate object by pre- used for proposal generation(like RPN) and Grid R-CNN is
dicting grid points on pixel level. used for bounding box prediction.
CornerNet [15] is a single-stage object detector which
2. We design a multi-point supervision form that predicts
uses paired key-points to locate the objects. It’s a bottom-
points in grid to reduce the impact of some inaccurate
up detector that detects all the possible bounding box (cor-
points. We further propose a feature map level infor-
ner point) location through a hourglass [22] network. In the
mation fusion mechanism that enables the spatially re-
meanwhile, an embedding network was designed to map the
lated grid points to obtain incorporated features so that
paired keypoints as close as possible. With above embed-
their locations can be well calibrated.
ding mechanism, detected corners can be group as pairs and
3. We perform extensive experiments and prove that Grid locate the bounding boxes.
R-CNN framework is widely applicable across differ- It’s worth noting that our approach is quite different from
ent detection frameworks and network architectures CornerNet. CornerNet is a bottom-up method, which means
with consistent gains. The Grid R-CNN performs even it directly generate keypoints from the entire image without
better in more strict localization criterion (e.g. IoU defining instance. The key step of the CornerNet is to recog-
threshold = 0.75). Thus we are confident that our grid nize keypoints and grouping them correctly. In contrast to
guided localization mechanism is a better alternative that, our approach is a top-down two-stage detector which
for regression based localization methods. defines instance at first stage. What we focus on is how to
7364
Figure 2. Overview of the pipeline of Grid R-CNN. Region proposals are obtained from RPN and used for RoI feature extraction from the
output feature maps of a CNN backbone. The RoI features are then used to perform classification and localization. In contrast to previous
works with a box offset regression branch, we adopt a grid guided mechanism for high quality localization. The grid prediction branch
adopts a FCN to output a probability heatmap from which we can locate the grid points in the bounding box aligned with the object. With
the grid points, we finally determine the accurate object bounding box by a feature map level information fusion approach.
locate the grid points accurately. Furthermore, we designed The grid prediction branch outputs N ×N heatmaps with
feature fusion module to exploit the features of related grid 56 × 56 resolution, and a pixel-wise sigmoid function is
points and calibrate for more accurate grid points localiza- applied on each heatmap to obtain the probability map. And
tion than two corner points only. each heatmap has a corresponding supervision map, where
5 pixels in a cross shape are labeled as positive locations of
3. Grid R-CNN the target grid point. Binary cross-entropy loss is utilized
for optimization.
An overview of Grid R-CNN framework is shown in Fig-
ure 2. Based on region proposals, features for each RoI are During inference, on each heatmap we select the pixel
extracted individually from the feature maps obtained by a with highest confidence and calculate the corresponding lo-
CNN backbone. The RoI features are then used to perform cation on the original image as the grid point. Formally,
classification and localization for the corresponding propos- a point (Hx , Hy ) in heatmap will be mapped to the point
als. In contrast to previous works, e.g. Faster R-CNN, we (Ix , Iy ) in origin image by the following equation:
use a grid guided mechanism for localization instead of off- Hx
set regression. The grid prediction branch adopts a fully I x = Px + wp
wo
convolutional network [21]. It outputs a fine spatial layout (1)
Hy
(probability heatmap) from which we can locate the grid Iy = Py + hp
points of the bounding box aligned with the object. With the ho
grid points, we finally determine the accurate object bound- where (Px , Py ) is the position of upper left corner of the
ing box by a feature map level information fusion approach. proposal in input image, wp and hp are width and height of
3.1. Grid Guided Localization proposal, wo and ho are width and height of output heatmap.
Then we determine the four boundaries of the box of ob-
Most previous methods [9, 8, 25, 17, 11, 1] use several ject with the predicted grid points. Specifically, we denote
fully connected layers as a regressor to predict the box off- the four boundary coordinates as B = (xl , yu , xr , yb ) rep-
set for object localization. Whereas we adopt a fully convo- resenting the left, upper, right and bottom edge respectively.
lutional network to predict the locations of predefined grid Let gj represent the j-th grid point with coordinate (xj , yj )
points and then utilize them to determine the accurate object and predicted probability pj ,. Then we define Ei as the set
bounding box. of indices of grid points that are located on the i-th edge,
We design an N × N grid form of target points aligned i.e., j ∈ Ei if gj lies on the i-th edge of the bounding box.
in the bounding box of object. An example of 3 × 3 case is We have the following equation to calculate B with the set
shown in Figure 1.b, the gird points here are the four cor- of g:
ner points, midpoints of four edges and the center point
respectively. Features of each proposal are extracted by 1 X 1 X
RoIAlign [11] operation with a fixed spatial size of 14 × 14, xl = x j pj , yu = yj pj
N N
j∈E1 j∈E2
followed by eight 3×3 dilated(for large receptive field) con- (2)
volutional layers. After that, two 2× group deconvolution 1 X 1 X
xr = x j pj , yb = yj pj
layers are adopted to achieve a resolution of 56 × 56. N N
j∈E3 j∈E4
7365
(a) (b)
7366
box and 7 of the 9 grid points cannot be covered by output on ImageNet dataset [26], other new parameters are initial-
heatmap. ized by He (MSRA) initialization [13]. No data augmen-
A natural idea is to enlarge the proposal area. This ap- tations except standard horizontal flipping are used. Our
proach can make sure that most of the grid points will be in- model is trained on 32 Nvidia TITAN Xp GPUs with one
cluded in proposal area, but it will also introduce redundant image on each for 20 epochs with an initial learning rate
features of background or even other objects. Experiments of 0.02, which decreases by 10 in the 13 and 18 epochs.
show that simply enlarging the proposal area brings no gain We also use learning rate warming up and Synchronized
but harms the accuracy of small objects detection. BatchNorm machanism [10, 23](only used in Grid branch)
To address this problem, we modify the relationship of to make multi-GPU training more stable.
output heatmaps and regions in the original image by a ex- Inference: During the inference stage, the RPN gener-
tended region mapping approach. Specifically, when the ates 300/1000 (Faster R-CNN/FPN) RoIs per image. Then
proposals are obtained, the RoI features are still extracted the features of these RoIs will be processed by RoIAl-
from the same region on the feature map without enlarging gin [11] layer and the classification branch to generate cate-
proposal area. While we re-define the representation area gory score, followed by non-maximum suppression (NMS)
of the output heatmap as a twice larger corresponding re- with 0.5 IOU threshold. After that we select top 125 high-
gion in the image, so that all grid points are covered in most est scoring RoIs and put their RoIAlign features into grid
cases as shown in Figure 4 (the dashed box). branch for further location prediction. Finally, NMS with
The extended region mapping is formulated as a modifi- 0.5 IoU threshold will be applied to remove duplicate de-
cation of Equation 1: tection boxes.
′ 4Hx − wo 4. Experiments
I x = Px + wp
2wo
(4) We perform experiments on two object detection
′ 4Hy − ho
I y = Py + hp datasets, Pascal VOC [5] and COCO [19]. On Pascal VOC
2ho dataset, we train our model on VOC07+12 trainval set and
After the new mapping, all the target grid points of the pos- evaluate on VOC2007 test set. On COCO [19] dataset
itive proposals (which have an overlap larger than 0.5 with which contains 80 object categories, we train our model on
ground truth box) will be covered by the corresponding re- the union of 80k train images and 35k subset of val images
gion of the heatmap. and test on a 5k subset of val (minival) and 20k test-dev.
7367
method AP AP.5 AP.75 method backbone AP
w/o fusion 38.9 58.2 41.2 R-FCN ResNet-50 45.6
bi-directional fusion [2] 39.2 58.2 41.8 FPN ResNet-50 51.7
first order feature fusion 39.2 58.1 41.9 FPN based Grid R-CNN ResNet-50 55.3
second order feature fusion 39.6 58.3 42.4 Table 4. Comparison with R-FCN and FPN on Pascal VOC
Table 2. Comparison of different feature fusion methods. Bi- dataset. Note that we evaluate the results with a COCO-style cri-
directional feature fusion, first order feature fusion and second terion which is the average AP across IoU thresholds range from
order fusion all demonstrate improvements. Second order fusion 0.5 to [0.5:0.95].
achieves the best performance with an improvement of 0.7% on
AP.
two frameworks for fair comparison.
Experiments on Pascal VOC: We train Grid R-CNN
method AP APsmall APlarge on Pascal VOC dataset for 18 epochs with the learning rate
baseline 37.7 22.1 48.0 reduced by 10 at 15 and 17 epochs. The origianl evaluation
enlarge proposal area 37.7 20.8 50.9 criterion of PASCAL VOC is to calculate the mAP at 0.5
extended region mapping 38.9 22.1 51.4 IoU threshold. We extend that to the COCO-style criterion
which calculates the average AP across IoU thresholds from
Table 3. Comparison of enlarging the proposal directly and ex-
tended region mapping strategy.
0.5 to 0.95 with an interval of 0.05. We compare Grid R-
CNN with R-FCN [3] and FPN [17]. Results in Table 4
show that our Grid R-CNN significantly improve AP over
the effectiveness of feature fusion. We perform experiments FPN and R-FCN by 3.6% and 9.7% respectively.
on several typical feature fusion methods and achieve dif- Experiments on COCO: To further demonstrate the
ferent levels of improvement on AP performance. The bi- generalization capacity of our approach, we conduct experi-
directional fusion method, as mentioned in [2], models the ments on challenging COCO dataset. Table 5 shows that our
information flow as a bi-directional tree. For fair compar- approach brings consistently and substantially improvement
ison, we directly use the feature maps from the first order across multiple backbones and frameworks. Compared with
feature fusion stage for grid point location prediction, and Faster R-CNN framework, Grid R-CNN improves AP over
see a same gain of 0.3% AP as bi-directional fusion. And baseline by 2.1% with ResNet-50 backbone. The significant
we also perform experiment of the complete two stage fea- improvements are also shown on FPN framework based on
ture fusion. As can be seen in Table 2, the second order both ResNet-50 and ResNet-101 backbones. Experiments
fusion further improves the AP by 0.4%, with a 0.7% gain in Table 5 demonstrate that Grid R-CNN significantly im-
from the non-fusion baseline. Especially, the improvement prove the performance of middle and large objects by about
of AP0.75 is more significant than that of AP0.5 , which in- 3 points.
dicates that feature fusion mechanism helps to improve the Results on COCO test-dev Set: For complete compari-
localization accuracy of the bounding box. son, we also evaluate Grid R-CNN on the COCO test-dev
Extended Region Mapping: Table 3 shows the results set. We adopt ResNet-101 and ResNeXt-101 [31] with
of our extended region mapping strategy compared with the FPN [17] constructed on the top. Without bells and whis-
original region representation and the method of directly tles, Grid R-CNN based on ResNet-101-FPN and ResNeXt-
enlarging the proposal box. Directly enlarging the region 101-FPN could achieve 41.5 and 43.2 AP respectively. As
of proposal box for RoI feature extraction helps to cover shown in Table 6, Grid R-CNN achieves very competitive
more grid points of big objects but also brings in redundant performance comparing with other state-of-the-art detec-
information for small objects. Thus we can see that with tors. It outperforms Mask R-CNN by a large margin with-
this enlargement method there is a increase in APlarge but out using any extra annotations. Note that since the tech-
a decrease in APsmall , and finally a decline compared with niques such as scaling used in SNIP [28] and cascading in
the baseline. Whereas the extended region mapping strat- Cascade R-CNN [1] are not applied in current framework
egy improves APlarge performance as well as producing no of Grid R-CNN, there is still room for large improvement
negative influences on APsmall , which leads to 1.2% im- on performance (e.g. combined with scaling and cascading
provement on AP. methods).
7368
method backbone AP AP.5 AP.75 APS APM APL
Faster R-CNN ResNet-50 33.8 55.4 35.9 17.4 37.9 45.3
Grid R-CNN ResNet-50 35.9 54.0 38.0 18.6 40.2 47.8
Faster R-CNN w FPN ResNet-50 37.4 59.3 40.3 21.8 40.9 47.9
Grid R-CNN w FPN ResNet-50 39.6 58.3 42.4 22.6 43.8 51.5
Faster R-CNN w FPN ResNet-101 39.5 61.2 43.1 22.7 43.7 50.8
Grid R-CNN w FPN ResNet-101 41.3 60.3 44.4 23.4 45.8 54.1
Table 5. Bounding box detection AP on COCO minival. Grid R-CNN outperforms both Faster R-CNN and FPN on ResNet-50 and
ResNet-101 backbone.
the same ResNet-50 backbone across IoU thresholds from Varying Degrees of Improvement in Different Cate-
0.5 to 0.9. Grid R-CNN outperforms regression at higher gories: We have analyzed the specific improvement of Grid
IoU thresholds (greater than 0.7). The improvements over R-CNN on each category and discovered a meaningful and
baseline at AP0.8 and AP0.9 are 4.1% and 10% respectively, interesting phenomenon. As shown in Table 7, the cate-
which means that Grid R-CNN achieves better performance gories with the most gains usually have a rectangular or bar
mainly by improving the localization quality of the bound- like shape (e.g. keyboard, laptop, fork, train, and refrigera-
ing box. In addition, the results of AP0.5 indicates that grid tor), while the categories suffering declines or having least
branch may slightly affect the performance of the classifi- gains usually have a round shape without structural edges
cation branch. (e.g. sports ball, frisbee, bowl, clock and cup). This phe-
nomenon is reasonable since grid points are distributed in
Faster R-CNN with FPN a rectangular shape. Thus the rectangular objects tend to
70 Grid R-CNN with FPN have more grid points on the body but round objects can
59.3 58.3
60 54.7 53.9 never cover all the grid points (especially the corners) with
50 46.3 46.3 its body.
40 36.3
32.2
mAP
7369
Figure 6. Qualitative results comparison. The results of Grid R-CNN are listed in the first and third row, while those of Faster R-CNN are
in the second and fourth row.
category cat bear giraffe dog airplane horse zebra toilet keyboard fork teddy bear train laptop refrigerator hot dog
gain 6.0 5.6 5.4 5.3 5.3 5.0 4.8 4.8 4.7 4.6 4.4 4.2 4.0 3.6 3.6
category toaster hair drier sports ball frisbee traffic light backpack kite handbag microwave bowl clock cup carrot dining table boat
gain -1.9 -1.3 -1.0 -0.8 -0.5 -0.4 -0.3 -0.1 -0.1 -0.1 0.1 0.1 0.2 0.3 0.3
Table 7. The top 15 categories with most gains and most declines respectively, in the results of Grid R-CNN compared to Faster R-CNN.
7370
References [15] H. Law and J. Deng. Cornernet: Detecting objects as paired
keypoints. In Proceedings of the European Conference on
[1] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high Computer Vision (ECCV), pages 734–750, 2018.
quality object detection. In Proceedings of the IEEE Con- [16] B. Li, Y. Liu, and X. Wang. Gradient harmonized single-
ference on Computer Vision and Pattern Recognition, pages stage detector. In AAAI Conference on Artificial Intelligence,
6154–6162, 2018. 2019.
[2] X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature [17] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
learning for pose estimation. In Proceedings of the IEEE S. Belongie. Feature pyramid networks for object detection.
Conference on Computer Vision and Pattern Recognition, In Computer Vision and Pattern Recognition (CVPR), 2017
pages 4715–4723, 2016. IEEE Conference on, pages 936–944. IEEE, 2017.
[3] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection [18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
via region-based fully convolutional networks. In Advances loss for dense object detection. In Proceedings of the IEEE
in neural information processing systems, pages 379–387, international conference on computer vision, pages 2980–
2016. 2988, 2017.
[4] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. [19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
Deformable convolutional networks. In Proceedings of the manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
IEEE international conference on computer vision, pages mon objects in context. In European conference on computer
764–773, 2017. vision, pages 740–755. Springer, 2014.
[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
[5] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,
Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
J. Winn, and A. Zisserman. The pascal visual object classes
In European conference on computer vision, pages 21–37.
challenge: A retrospective. International journal of com-
Springer, 2016.
puter vision, 111(1):98–136, 2015.
[21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
[6] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. networks for semantic segmentation. In Proceedings of the
Dssd: Deconvolutional single shot detector. arXiv preprint IEEE conference on computer vision and pattern recogni-
arXiv:1701.06659, 2017. tion, pages 3431–3440, 2015.
[7] S. Gidaris and N. Komodakis. Locnet: Improving localiza- [22] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
tion accuracy for object detection. In Proceedings of the works for human pose estimation. In European Conference
IEEE conference on computer vision and pattern recogni- on Computer Vision, pages 483–499. Springer, 2016.
tion, pages 789–798, 2016. [23] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and
[8] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter- J. Sun. Megdet: A large mini-batch object detector. In Pro-
national conference on computer vision, pages 1440–1448, ceedings of the IEEE Conference on Computer Vision and
2015. Pattern Recognition, pages 6181–6189, 2018.
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- [24] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger.
ture hierarchies for accurate object detection and semantic In Proceedings of the IEEE conference on computer vision
segmentation. In Proceedings of the IEEE conference on and pattern recognition, pages 7263–7271, 2017.
computer vision and pattern recognition, pages 580–587, [25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
2014. real-time object detection with region proposal networks. In
[10] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, Advances in neural information processing systems, pages
L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. 91–99, 2015.
Accurate, large minibatch sgd: training imagenet in 1 hour. [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
arXiv preprint arXiv:1706.02677, 2017. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge.
[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
International Journal of Computer Vision, 115(3):211–252,
In Computer Vision (ICCV), 2017 IEEE International Con-
2015.
ference on, pages 2980–2988. IEEE, 2017.
[27] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be-
[12] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling yond skip connections: Top-down modulation for object de-
in deep convolutional networks for visual recognition. In tection. arXiv preprint arXiv:1612.06851, 2016.
European conference on computer vision, pages 346–361. [28] B. Singh and L. S. Davis. An analysis of scale invariance in
Springer, 2014. object detection snip. In Proceedings of the IEEE Conference
[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into on Computer Vision and Pattern Recognition, pages 3578–
rectifiers: Surpassing human-level performance on imagenet 3587, 2018.
classification. In Proceedings of the IEEE international con- [29] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
ference on computer vision, pages 1026–1034, 2015. Inception-v4, inception-resnet and the impact of residual
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- connections on learning. In AAAI, volume 4, page 12, 2017.
ing for image recognition. In Proceedings of the IEEE con- [30] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
ference on computer vision and pattern recognition, pages Smeulders. Selective search for object recognition. Interna-
770–778, 2016. tional journal of computer vision, 104(2):154–171, 2013.
7371
[31] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated
residual transformations for deep neural networks. In Com-
puter Vision and Pattern Recognition (CVPR), 2017 IEEE
Conference on, pages 5987–5995. IEEE, 2017.
[32] H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, and R. Chel-
lappa. Deep regionlets for object detection. In Proceedings
of the European Conference on Computer Vision (ECCV),
pages 798–814, 2018.
[33] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot
refinement neural network for object detection. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4203–4212, 2018.
[34] C. L. Zitnick and P. Dollár. Edge boxes: Locating object
proposals from edges. In European conference on computer
vision, pages 391–405. Springer, 2014.
7372