Feature Pyramid Networks For Object Detection
Feature Pyramid Networks For Object Detection
predict
deep learning object detectors have avoided pyramid rep- (a) Featurized image pyramid (b) Single feature map
resentations, in part because they are compute and memory
intensive. In this paper, we exploit the inherent multi-scale, predict
predict
pyramidal hierarchy of deep convolutional networks to con- predict
predict
struct feature pyramids with marginal extra cost. A top- predict
predict
down architecture with lateral connections is developed for
building high-level semantic feature maps at all scales. This
(c) Pyramidal feature hierarchy (d) Feature Pyramid Network
architecture, called a Feature Pyramid Network (FPN),
shows significant improvement as a generic feature extrac- Figure 1. (a) Using an image pyramid to build a feature pyramid.
Features are computed on each of the image scales independently,
tor in several applications. Using FPN in a basic Faster
which is slow. (b) Recent detection systems have opted to use
R-CNN system, our method achieves state-of-the-art single- only single scale features for faster detection. (c) An alternative is
model results on the COCO detection benchmark without to reuse the pyramidal feature hierarchy computed by a ConvNet
bells and whistles, surpassing all existing single-model en- as if it were a featurized image pyramid. (d) Our proposed Feature
tries including those from the COCO 2016 challenge win- Pyramid Network (FPN) is fast like (b) and (c), but more accurate.
ners. In addition, our method can run at 5 FPS on a GPU In this figure, feature maps are indicate by blue outlines and thicker
and thus is a practical and accurate solution to multi-scale outlines denote semantically stronger features.
object detection. Code will be made publicly available.
937
Deep ConvNet object detectors. With the development
predict
of modern deep ConvNets [19], object detectors like Over-
predict
Feat [34] and R-CNN [12] showed dramatic improvements
predict
in accuracy. OverFeat adopted a strategy similar to early
neural network face detectors by applying a ConvNet as
a sliding window detector on an image pyramid. R-CNN
adopted a region proposal-based strategy [37] in which each
2x up
proposal was scale-normalized before classifying with a
ConvNet. SPPnet [15] demonstrated that such region-based 1x1 conv +
938
pled map is then merged with the corresponding bottom-up anchors on a specific level. Instead, we assign anchors of
map (which undergoes a 1×1 convolutional layer to reduce a single scale to each level. Formally, we define the an-
channel dimensions) by element-wise addition. This pro- chors to have areas of {322 , 642 , 1282 , 2562 , 5122 } pixels
cess is iterated until the finest resolution map is generated. on {P2 , P3 , P4 , P5 , P6 } respectively.1 As in [29] we also
To start the iteration, we simply attach a 1×1 convolutional use anchors of multiple aspect ratios {1:2, 1:1, 2:1} at each
layer on C5 to produce the coarsest resolution map. Fi- level. So in total there are 15 anchors over the pyramid.
nally, we append a 3×3 convolution on each merged map to We assign training labels to the anchors based on
generate the final feature map, which is to reduce the alias- their Intersection-over-Union (IoU) ratios with ground-truth
ing effect of upsampling. This final set of feature maps is bounding boxes as in [29]. Formally, an anchor is assigned
called {P2 , P3 , P4 , P5 }, corresponding to {C2 , C3 , C4 , C5 } a positive label if it has the highest IoU for a given ground-
that are respectively of the same spatial sizes. truth box or an IoU over 0.7 with any ground-truth box,
Because all levels of the pyramid use shared classi- and a negative label if it has IoU lower than 0.3 for all
fiers/regressors as in a traditional featurized image pyramid, ground-truth boxes. Note that scales of ground-truth boxes
we fix the feature dimension (numbers of channels, denoted are not explicitly used to assign them to the levels of the
as d) in all the feature maps. We set d = 256 in this pa- pyramid; instead, ground-truth boxes are associated with
per and thus all extra convolutional layers have 256-channel anchors, which have been assigned to pyramid levels. As
outputs. There are no non-linearities in these extra layers, such, we introduce no extra rules in addition to those in [29].
which we have empirically found to have minor impacts. We note that the parameters of the heads are shared
Simplicity is central to our design and we have found that across all feature pyramid levels; we have also evaluated the
our model is robust to many design choices. We have exper- alternative without sharing parameters and observed similar
imented with more sophisticated blocks (e.g., using multi- accuracy. The good performance of sharing parameters in-
layer residual blocks [16] as the connections) and observed dicates that all levels of our pyramid share similar semantic
marginally better results. Designing better connection mod- levels. This advantage is analogous to that of using a fea-
ules is not the focus of this paper, so we opt for the simple turized image pyramid, where a common head classifier can
design described above. be applied to features computed at any image scale.
With the above adaptations, RPN can be naturally trained
4. Applications and tested with our FPN, in the same fashion as in [29]. We
elaborate on the implementation details in the experiments.
Our method is a generic solution for building feature
pyramids inside deep ConvNets. In the following we adopt 4.2. Feature Pyramid Networks for Fast R-CNN
our method in RPN [29] for bounding box proposal gen-
eration and in Fast R-CNN [11] for object detection. To Fast R-CNN [11] is a region-based object detector in
demonstrate the simplicity and effectiveness of our method, which Region-of-Interest (RoI) pooling is used to extract
we make minimal modifications to the original systems of features. Fast R-CNN is most commonly performed on a
[29, 11] when adapting them to our feature pyramid. single-scale feature map. To use it with our FPN, we need
to assign RoIs of different scales to the pyramid levels.
4.1. Feature Pyramid Networks for RPN We view our feature pyramid as if it were produced from
an image pyramid. Thus we can adapt the assignment strat-
RPN [29] is a sliding-window class-agnostic object de-
egy of region-based detectors [15, 11] in the case when they
tector. In the original RPN design, a small subnetwork is
are run on image pyramids. Formally, we assign an RoI of
evaluated on dense 3×3 sliding windows, on top of a single-
width w and height h (on the input image to the network) to
scale convolutional feature map, performing object/non-
the level Pk of our feature pyramid by:
object binary classification and bounding box regression.
√
This is realized by a 3×3 convolutional layer followed by k = k0 + log2 ( wh/224). (1)
two sibling 1×1 convolutions for classification and regres-
sion, which we refer to as a network head. The object/non- Here 224 is the canonical ImageNet pre-training size, and
object criterion and bounding box regression target are de- k0 is the target level on which an RoI with w × h = 2242
fined with respect to a set of reference boxes called anchors should be mapped into. Analogous to the ResNet-based
[29]. The anchors are of multiple pre-defined scales and Faster R-CNN system [16] that uses C4 as the single-scale
aspect ratios in order to cover objects of different shapes. feature map, we set k0 to 4. Intuitively, Eqn. (1) means
We adapt RPN by replacing the single-scale feature map that if the RoI’s scale becomes smaller (say, 1/2 of 224), it
with our FPN. We attach a head of the same design (3×3 should be mapped into a finer-resolution level (say, k = 3).
conv and two sibling 1×1 convs) to each level on our feature 1 Here we introduce P only for covering a larger anchor scale of 5122 .
6
pyramid. Because the head slides densely over all locations P6 is simply a stride two subsampling of P5 . P6 is not used by the Fast
in all pyramid levels, it is not necessary to have multi-scale R-CNN detector in the next section.
939
We attach predictor heads (in Fast R-CNN the heads are 5.1.1 Ablation Experiments
class-specific classifiers and bounding box regressors) to all
Comparisons with baselines. For fair comparisons with
RoIs of all levels. Again, the heads all share parameters,
original RPNs [29], we run two baselines (Table 1(a, b)) us-
regardless of their levels. In [16], a ResNet’s conv5 lay-
ing the single-scale map of C4 (the same as [16]) or C5 , both
ers (a 9-layer deep subnetwork) are adopted as the head on
using the same hyper-parameters as ours, including using 5
top of the conv4 features, but our method has already har-
scale anchors of {322 , 642 , 1282 , 2562 , 5122 }. Table 1 (b)
nessed conv5 to construct the feature pyramid. So unlike
shows no advantage over (a), indicating that a single higher-
[16], we simply adopt RoI pooling to extract 7×7 features,
level feature map is not enough because there is a trade-off
and attach two hidden 1,024-d fully-connected (fc) layers
between coarser resolutions and stronger semantics.
(each followed by ReLU) before the final classification and
Placing FPN in RPN improves AR1k to 56.3 (Table 1
bounding box regression layers. These layers are randomly
(c)), which is 8.0 points increase over the single-scale RPN
initialized, as there are no pre-trained fc layers available in
baseline (Table 1 (a)). In addition, the performance on small
ResNets. Note that compared to the standard conv5 head,
objects (AR1ks ) is boosted by a large margin of 12.9 points.
our 2-fc MLP head is lighter weight and faster.
Our pyramid representation greatly improves RPN’s robust-
Based on these adaptations, we can train and test Fast R-
ness to object scale variation.
CNN on top of the feature pyramid. Implementation details
are given in the experimental section. How important is top-down enrichment? Table 1(d)
shows the results of our feature pyramid without the top-
5. Experiments on Object Detection down pathway. With this modification, the 1×1 lateral con-
nections followed by 3×3 convolutions are attached to the
We perform experiments on the 80 category COCO de- bottom-up pyramid. This architecture simulates the effect
tection dataset [21]. We train using the union of 80k train of reusing the pyramidal feature hierarchy (Fig. 1(b)).
images and a 35k subset of val images (trainval35k The results in Table 1(d) are just on par with the RPN
[2]), and report ablations on a 5k subset of val images baseline and lag far behind ours. We conjecture that this
(minival). We also report final results on the standard is because there are large semantic gaps between different
test set (test-std) [21] which has no disclosed labels. levels on the bottom-up pyramid (Fig. 1(b)), especially for
As is common practice [12], all network backbones very deep ResNets. We have also evaluated a variant of Ta-
are pre-trained on the ImageNet1k classification set [33] ble 1(d) without sharing the parameters of the heads, but
and then fine-tuned on the detection dataset. We use the observed similarly degraded performance. This issue can-
pre-trained ResNet-50 and ResNet-101 models that are not be simply remedied by level-specific heads.
publicly available.2 Our code is a reimplementation of
How important are lateral connections? Table 1(e)
py-faster-rcnn3 using Caffe2.4
shows the ablation results of a top-down feature pyramid
5.1. Region Proposal with RPN without the 1×1 lateral connections. This top-down pyra-
mid has strong semantic features and fine resolutions. But
We evaluate the COCO-style Average Recall (AR) and we argue that the locations of these features are not precise,
AR on small, medium, and large objects (ARs , ARm , and because these maps have been downsampled and upsampled
ARl ) following the definitions in [21]. We report results for several times. More precise locations of features can be di-
100 and 1000 proposals per images (AR100 and AR1k ). rectly passed from the finer levels of the bottom-up maps via
Implementation details. All architectures in Table 1 are the lateral connections to the top-down maps. As a results,
trained end-to-end. The input image is resized such that its FPN has an AR1k score 10 points higher than Table 1(e).
shorter side has 800 pixels. We adopt synchronized SGD How important are pyramid representations? Instead
training on 8 GPUs. A mini-batch involves 2 images per of resorting to pyramid representations, one can attach the
GPU and 256 anchors per image. We use a weight decay of head to the highest-resolution, strongly semantic feature
0.0001 and a momentum of 0.9. The learning rate is 0.02 for maps of P2 (i.e., the finest level in our pyramids). Simi-
the first 30k mini-batches and 0.002 for the next 10k. For lar to the single-scale baselines, we assign all anchors to the
all RPN experiments (including baselines), we include the P2 feature map. This variant (Table 1(f)) is better than the
anchor boxes that are outside the image for training, which baseline but inferior to our approach. RPN is a sliding win-
is unlike [29] where these anchor boxes are ignored. Other dow detector with a fixed window size, so scanning over
implementation details are as in [29]. Training RPN with pyramid levels can increase its robustness to scale variance.
FPN on 8 GPUs takes about 8 hours on COCO. In addition, we note that using P2 alone leads to more
2 https://github.com/kaiminghe/deep-residual-networks anchors (750k, Table 1(f)) caused by its large spatial reso-
3 https://github.com/rbgirshick/py-faster-rcnn lution. This result suggests that a larger number of anchors
4 https://github.com/caffe2/caffe2 is not sufficient in itself to improve accuracy.
940
RPN feature # anchors lateral? top-down? AR100 AR1k AR1k
s AR1k
m AR1k
l
(a) baseline on conv4 C4 47k 36.1 48.3 32.0 58.7 62.2
(b) baseline on conv5 C5 12k 36.3 44.9 25.3 55.5 64.2
(c) FPN {Pk } 200k 44.0 56.3 44.9 63.4 66.2
Ablation experiments follow:
(d) bottom-up pyramid {Pk } 200k 37.4 49.5 30.5 59.9 68.0
(e) top-down pyramid, w/o lateral {Pk } 200k 34.5 46.1 26.5 57.4 64.7
(f) only finest level P2 750k 38.4 51.3 35.1 59.7 67.6
Table 1. Bounding box proposal results using RPN [29], evaluated on the COCO minival set. All models are trained on trainval35k.
The columns “lateral” and “top-down” denote the presence of lateral and top-down connections, respectively. The column “feature” denotes
the feature maps on which the heads are attached. All results are based on ResNet-50 and share the same hyper-parameters.
Fast R-CNN proposals feature head lateral? top-down? [email protected] AP APs APm APl
(a) baseline on conv4 RPN, {Pk } C4 conv5 54.7 31.9 15.7 36.5 45.5
(b) baseline on conv5 RPN, {Pk } C5 2fc 52.9 28.8 11.9 32.4 43.4
(c) FPN RPN, {Pk } {Pk } 2fc 56.9 33.9 17.8 37.7 45.8
Ablation experiments follow:
(d) bottom-up pyramid RPN, {Pk } {Pk } 2fc 44.9 24.9 10.9 24.4 38.5
(e) top-down pyramid, w/o lateral RPN, {Pk } {Pk } 2fc 54.0 31.3 13.3 35.2 45.3
(f) only finest level RPN, {Pk } P2 2fc 56.3 33.4 17.3 37.3 45.6
Table 2. Object detection results using Fast R-CNN [11] on a fixed set of proposals (RPN, {Pk }, Table 1(c)), evaluated on the COCO
minival set. Models are trained on the trainval35k set. All results are based on ResNet-50 and share the same hyper-parameters.
Faster R-CNN proposals feature head lateral? top-down? [email protected] AP APs APm APl
(*) baseline from He et al. [16]† RPN, C4 C4 conv5 47.3 26.3 - - -
(a) baseline on conv4 RPN, C4 C4 conv5 53.1 31.6 13.2 35.6 47.1
(b) baseline on conv5 RPN, C5 C5 2fc 51.7 28.0 9.6 31.9 43.1
(c) FPN RPN, {Pk } {Pk } 2fc 56.9 33.9 17.8 37.7 45.8
Table 3. Object detection results using Faster R-CNN [29] evaluated on the COCO minival set. The backbone network for RPN are
consistent with Fast R-CNN. Models are trained on the trainval35k set and use ResNet-50. † Provided by authors of [16].
5.2. Object Detection with Fast/Faster R-CNN puted by RPN on FPN (Table 1(c)), because it has good per-
formance on small objects that are to be recognized by the
Next we investigate FPN for region-based (non-sliding
detector. For simplicity we do not share features between
window) detectors. We evaluate object detection by the
Fast R-CNN and RPN, except when specified.
COCO-style Average Precision (AP) and PASCAL-style
As a ResNet-based Fast R-CNN baseline, following
AP (at a single IoU threshold of 0.5). We also report COCO
[16], we adopt RoI pooling with an output size of 14×14
AP on objects of small, medium, and large sizes (namely,
and attach all conv5 layers as the hidden layers of the head.
APs , APm , and APl ) following the definitions in [21].
This gives an AP of 31.9 in Table 2(a). Table 2(b) is a base-
Implementation details. The input image is resized such line exploiting an MLP head with 2 hidden fc layers, similar
that its shorter side has 800 pixels. Synchronized SGD is to the head in our architecture. It gets an AP of 28.8, indi-
used to train the model on 8 GPUs. Each mini-batch in- cating that the 2-fc head does not give us any orthogonal
volves 2 image per GPU and 512 RoIs per image. We use advantage over the baseline in Table 2(a).
a weight decay of 0.0001 and a momentum of 0.9. The Table 2(c) shows the results of our FPN in Fast R-CNN.
learning rate is 0.02 for the first 60k mini-batches and 0.002 Comparing with the baseline in Table 2(a), our method im-
for the next 20k. We use 2000 RoIs per image for training proves AP by 2.0 points and small object AP by 2.1 points.
and 1000 for testing. Training Fast R-CNN with FPN takes Comparing with the baseline that also adopts a 2fc head (Ta-
about 10 hours on the COCO dataset. ble 2(b)), our method improves AP by 5.1 points.5 These
comparisons indicate that our feature pyramid is superior to
5.2.1 Fast R-CNN (on fixed proposals) single-scale features for a region-based object detector.
Table 2(d) and (e) show that removing top-down con-
To better investigate FPN’s effects on the region-based de-
tector alone, we conduct ablations of Fast R-CNN on a fixed 5 We expect a stronger architecture of the head [30] will improve upon
set of proposals. We choose to freeze the proposals as com- our results, which is beyond the focus of this paper.
941
image test-dev test-std
method backbone competition pyramid [email protected] AP APs APm APl [email protected] AP APs APm APl
ours, Faster R-CNN on FPN ResNet-101 - 59.1 36.2 18.2 39.0 48.2 58.5 35.8 17.5 38.7 47.8
Competition-winning single-model results follow:
G-RMI† Inception-ResNet 2016 - 34.7 - - - - - - - -
AttractioNet‡ [10] VGG16 + Wide ResNet§ 2016 53.4 35.7 15.6 38.0 52.7 52.9 35.3 14.7 37.6 51.9
Faster R-CNN +++ [16] ResNet-101 2015 55.7 34.9 15.6 38.7 50.9 - - - - -
Multipath [40] (on minival) VGG-16 2015 49.6 31.5 - - - - - - - -
ION‡ [2] VGG-16 2015 53.4 31.2 12.8 32.9 45.2 52.9 30.7 11.8 32.8 44.8
Table 4. Comparisons of single-model results on the COCO detection benchmark. Some results were not available on the test-std
set, so we also include the test-dev results (and for Multipath [40] on minival). † : http://image-net.org/challenges/
talks/2016/GRMI-COCO-slidedeck.pdf. ‡ : http://mscoco.org/dataset/#detections-leaderboard. § : This
entry of AttractioNet [10] adopts VGG-16 for proposals and Wide ResNet [39] for object detection, so is not strictly a single-model result.
942
image pyramid AR ARs ARm ARl time (s)
5x5 14x14
320x320 [256x256]
DeepMask [27] 37.1 15.8 50.1 54.9 0.49
5x5 SharpMask [28] 39.8 17.4 53.1 59.1 0.77
14x14 InstanceFCN [4] 39.2 – – – 1.50†
160x160 [128x128]
FPN Mask Results:
5x5 14x14 80x80 [64x64] single MLP [5×5] 43.4 32.5 49.2 53.7 0.15
single MLP [7×7] 43.5 30.0 49.6 57.8 0.19
Figure 4. FPN for object segment proposals. The feature pyramid dual MLP [5×5, 7×7] 45.7 31.9 51.5 60.8 0.24
is constructed with identical structure as for object detection. We + 2x mask resolution 46.7 31.7 53.1 63.2 0.25
apply a small MLP on 5×5 windows to generate dense object seg- + 2x train schedule 48.1 32.6 54.2 65.6 0.25
ments with output dimension of 14×14. Shown in orange are the
Table 6. Instance segmentation proposals evaluated on the first 5k
size of the image regions the mask corresponds to for each pyra-
COCO val images. All models are trained on the train set.
mid level (levels P3−5 are shown here). Both the corresponding
DeepMask, SharpMask, and FPN use ResNet-50 while Instance-
image region size (light orange) and canonical object size (dark
FCN uses VGG-16. DeepMask and SharpMask performance
orange) are shown.
√ Half octaves are handled by an MLP on 7x7 is computed with models available from https://github.
windows (7 ≈ 5 2), not shown here. Details are in the appendix.
com/facebookresearch/deepmask (both are the ‘zoom’
variants). † Runtimes are measured on an NVIDIA M40 GPU, ex-
On the test-dev set, our method increases over the ex- cept the InstanceFCN timing which is based on the slower K40.
isting best results by 0.5 points of AP (36.2 vs. 35.7) and
3.4 points of [email protected] (59.1 vs. 55.7). It is worth noting that
our method does not rely on image pyramids and only uses 6.1. Segmentation Proposal Results
a single input image scale, but still has outstanding AP on
Results are shown in Table 6. We report segment AR and
small-scale objects. This could only be achieved by high-
segment AR on small, medium, and large objects, always
resolution image inputs with previous methods.
for 1000 proposals. Our baseline FPN model with a single
Moreover, our method does not exploit many popular
5×5 MLP achieves an AR of 43.4. Switching to a slightly
improvements, such as iterative regression [9], hard nega-
larger 7×7 MLP leaves accuracy largely unchanged. Using
tive mining [35], context modeling [16], stronger data aug-
both MLPs together increases accuracy to 45.7 AR. Increas-
mentation [22], etc. These improvements are complemen-
ing mask output size from 14×14 to 28×28 increases AR
tary to FPNs and should boost accuracy further.
another point (larger sizes begin to degrade accuracy). Fi-
Recently, FPN has enabled new top results in all tracks nally, doubling the training iterations increases AR to 48.1.
of the COCO competition, including detection, instance We also report comparisons to DeepMask [27], Sharp-
segmentation, and keypoint estimation. See [14] for details. Mask [28], and InstanceFCN [4], the previous state of the
art methods in mask proposal generation. We outperform
6. Extensions: Segmentation Proposals the accuracy of these approaches by over 8.3 points AR. In
particular, we nearly double the accuracy on small objects.
Our method is a generic pyramid representation and can
Existing mask proposal methods [27, 28, 4] are based on
be used in applications other than object detection. In this
densely sampled image pyramids (e.g., scaled by 2{−2:0.5:1}
section we use FPNs to generate segmentation proposals,
in [27, 28]), making them computationally expensive. Our
following the DeepMask/SharpMask framework [27, 28].
approach, based on FPNs, is substantially faster (our mod-
DeepMask/SharpMask were trained on image crops for
els run at 4 to 6 fps). These results demonstrate that our
predicting instance segments and object/non-object scores.
model is a generic feature extractor and can replace image
At inference time, these models are run convolutionally to
pyramids for other multi-scale detection problems.
generate dense proposals in an image. To generate segments
at multiple scales, image pyramids are necessary [27, 28].
7. Conclusion
It is easy to adapt FPN to generate mask proposals. We
use a fully convolutional setup for both training and infer- We have presented a clean and simple framework for
ence. We construct our feature pyramid as in Sec. 5.1 and building feature pyramids inside ConvNets. Our method
set d = 128. On top of each level of the feature pyramid, we shows significant improvements over several strong base-
apply a small 5×5 MLP to predict 14×14 masks and object lines and competition winners. Thus, it provides a practical
scores in a fully convolutional fashion, see Fig. 4. Addition- solution for research and applications of feature pyramids,
ally, motivated by the use of 2 scales per octave in the image without the need of computing image pyramids. Finally,
pyramid of [27, 28], we use a second MLP of input size 7×7 our study suggests that despite the strong representational
to handle half octaves. The two MLPs play a similar role as power of deep ConvNets and their implicit robustness to
anchors in RPN. The architecture is trained end-to-end; full scale variation, it is still critical to explicitly address multi-
implementation details are given in the appendix. scale problems using pyramid representations.
943
References [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
[1] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and mon objects in context. In ECCV, 2014.
J. M. Ogden. Pyramid methods in image processing. RCA [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed.
engineer, 1984. SSD: Single shot multibox detector. In ECCV, 2016.
[2] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside- [23] W. Liu, A. Rabinovich, and A. C. Berg. ParseNet: Looking
outside net: Detecting objects in context with skip pooling wider to see better. In ICLR workshop, 2016.
and recurrent neural networks. In CVPR, 2016. [24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
[3] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified networks for semantic segmentation. In CVPR, 2015.
multi-scale deep convolutional neural network for fast object [25] D. G. Lowe. Distinctive image features from scale-invariant
detection. In ECCV, 2016. keypoints. IJCV, 2004.
[4] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive [26] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
fully convolutional networks. In ECCV, 2016. works for human pose estimation. In ECCV, 2016.
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for [27] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to seg-
human detection. In CVPR, 2005. ment object candidates. In NIPS, 2015.
[6] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature [28] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learn-
pyramids for object detection. TPAMI, 2014. ing to refine object segments. In ECCV, 2016.
[7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- [29] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
manan. Object detection with discriminatively trained part- wards real-time object detection with region proposal net-
based models. TPAMI, 2010. works. In NIPS, 2015.
[8] G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruc- [30] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object
tion and refinement for semantic segmentation. In ECCV, detection networks on convolutional feature maps. PAMI,
2016. 2016.
[9] S. Gidaris and N. Komodakis. Object detection via a multi- [31] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolu-
region & semantic segmentation-aware CNN model. In tional networks for biomedical image segmentation. In MIC-
ICCV, 2015. CAI, 2015.
[10] S. Gidaris and N. Komodakis. Attend refine repeat: Active [32] H. Rowley, S. Baluja, and T. Kanade. Human face detec-
box proposal generation via in-out localization. In BMVC, tion in visual scenes. Technical Report CMU-CS-95-158R,
2016. Carnegie Mellon University, 1995.
[11] R. Girshick. Fast R-CNN. In ICCV, 2015. [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
ture hierarchies for accurate object detection and semantic A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
segmentation. In CVPR, 2014. Recognition Challenge. IJCV, 2015.
[34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
[13] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper-
and Y. LeCun. Overfeat: Integrated recognition, localization
columns for object segmentation and fine-grained localiza-
and detection using convolutional networks. In ICLR, 2014.
tion. In CVPR, 2015.
[35] A. Shrivastava, A. Gupta, and R. Girshick. Training region-
[14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
based object detectors with online hard example mining. In
arXiv:1703.06870, 2017.
CVPR, 2016.
[15] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
[36] K. Simonyan and A. Zisserman. Very deep convolutional
in deep convolutional networks for visual recognition. In
networks for large-scale image recognition. In ICLR, 2015.
ECCV. 2014.
[37] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W.
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
Smeulders. Selective search for object recognition. IJCV,
for image recognition. In CVPR, 2016.
2013.
[17] S. Honari, J. Yosinski, P. Vincent, and C. Pal. Recombinator [38] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach
networks: Learning coarse-to-fine feature aggregation. In for the localisation of objects in images. IEE Proc. on Vision,
CVPR, 2016. Image, and Signal Processing, 1994.
[18] T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards ac- [39] S. Zagoruyko and N. Komodakis. Wide residual networks.
curate region proposal generation and joint object detection. In BMVC, 2016.
In CVPR, 2016.
[40] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross,
[19] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet clas- S. Chintala, and P. Dollár. A multipath network for object
sification with deep convolutional neural networks. In NIPS, detection. In BMVC, 2016.
2012.
[20] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Backpropagation
applied to handwritten zip code recognition. Neural compu-
tation, 1989.
944