Wang NAS-FCOS Fast Neural Architecture Search For Object Detection CVPR 2020 Paper
Wang NAS-FCOS Fast Neural Architecture Search For Object Detection CVPR 2020 Paper
Ning Wang†‡ , Yang Gao†‡ , Hao Chen⋄ , Peng Wang†‡ , Zhi Tian⋄ , Chunhua Shen⋄ , Yanning Zhang†‡
†
School of Computer Science, Northwestern Polytechnical University, China
‡
National Engineering Lab for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, China
⋄
The University of Adelaide, Australia
11943
overall detection architecture is based on FCOS [24], a sim- bounding boxes at each location of feature maps generated
ple anchor-free one-stage object detection framework, in by a single CNN backbone.
which the feature pyramid network and prediction head are Note that most state-of-the-art object detectors (includ-
searched using our proposed NAS method. ing both one-stage detectors [12, 16, 19] and two-stage de-
Our main contributions are summarized as follows. tectors [20]) make predictions based on anchor boxes of
different scales and aspect ratios at each convolutional fea-
• In this work, we propose a fast and memory-efficient ture map location. However, the usage of anchor boxes
NAS method for searching both FPN and head archi- may lead to high imbalance between object and non-object
tectures, with carefully designed proxy tasks, search examples and introduce extra hyper-parameters. More re-
space and evaluation strategies, which is able to find cently, anchor-free one-stage detectors [9, 10, 24, 29, 30]
top-performing architectures over 3, 000 architectures have attracted increasing research interests, due to their sim-
using 28 GPU-days only. ple fully convolutional architectures and reduced consump-
Specifically, this high efficiency is enabled with the tion of computational resources.
following designs.
2.2. Neural Architecture Search
− Developing a fast proxy task training scheme by
skipping the backbone finetuning stage; NAS is usually time consuming. We have seen great
improvements from 24, 000 GPU-days [32] to 0.2 GPU-
− Adapting progressive search strategy to reduce time
day [28]. The trick is to first construct a supernet contain-
cost taken by the extended search space;
ing the complete search space and train the candidates all
− Using a more discriminative criterion for evaluation at once with bi-level optimization and efficient weight shar-
of searched architectures. ing [13, 15]. But the large memory allocation and diffi-
− Employing an efficient anchor-free one-stage detec- culties in approximated optimization prohibit the search for
tion framework with simple post processing; more complex structures.
Recently researchers [1, 5, 23] propose to apply single-
• Using NAS, we explore the workload relationship be- path training to reduce the bias introduced by approxima-
tween FPN and head, proving the importance of weight tion and model simplification of the supernet. DetNAS [2]
sharing in head. follows this idea to search for an efficient object detection
architecture. One limitation of the single-path approach is
• We show that the overall structure of NAS-FCOS is that the search space is restricted to a sequential structure.
general and flexible in that it can be equipped with var- Single-path sampling and straight through estimate of the
ious backbones including MobileNetV2, ResNet-50, weight gradients introduce large variance to the optimiza-
ResNet-101 and ResNeXt-101, and surpasses state- tion process and prohibit the search for more complex struc-
of-the-art object detection algorithms using compa- tures under this framework. Within this very simple search
rable computation complexity and memory footprint. space, NAS algorithms can only make trivial decisions like
More specifically, our model can improve the AP by kernel sizes for manually designed modules.
1.5 ∼ 3.5 points on all above models comparing to Object detection models are different from single-path
their FCOS counterparts. image classification networks in their way of merging multi-
level features and distributing the task to parallel prediction
2. Related Work heads. Feature pyramid networks (FPNs) [4, 8, 11, 14, 27],
designed to handle this job, plays an important role in
2.1. Object Detection
modern object detection models. NAS-FPN [4] targets on
The frameworks of deep neural networks for object de- searching for an FPN alternative based on one-stage frame-
tection can be roughly categorized into two types: one-stage work RetinaNet [12]. Feature pyramid architectures are
detectors [12] and two-stage detectors [6, 20]. sampled with a recurrent neural network (RNN) controller.
Two-stage detection frameworks first generate class- The RNN controller is trained with reinforcement learn-
independent region proposals using a region proposal net- ing (RL). However, the search is very time-consuming even
work (RPN), and then classify and refine them using ex- though a proxy task with ResNet-10 backbone is trained to
tra detection heads. In spite of achieving top perfor- evaluate each architecture.
mance, the two-stage methods have noticeable drawbacks: Since all these three kinds of research ( [2, 4] and ours)
they are computationally expensive and have many hyper- focus on object detection framework, we demonstrate the
parameters that need to be tuned to fit a specific dataset. differences among them that DetNAS [2] aims to search
In comparison, the structures of one-stage detectors are for the designs of better backbones, while NAS-FPN [4]
much simpler. They directly predict object categories and searches the FPN structure, and our search space contains
11944
both FPN and head structure. 20] are carried out on designing f and h while using widely-
To speed up reward evaluation of RL-based NAS, the adopted backbone structures such as ResNet [7]. Following
work of [17] proposes to use progressive tasks and other this principle, our search goal is to decide when to choose
training acceleration methods. By caching the encoder fea- which features from C and how to merge them.
tures, they are able to train semantic segmentation decoders To improve the efficiency, we reuse the parameters in b
with very large batch sizes very efficiently. In the sequel of pretrained on target dataset and search for the optimal struc-
this paper, we refer to this technique as fast decoder adap- tures after that. For the convenience of the following state-
tation. However, directly applying this technique to object ment, we call the network components to search for, namely
detection tasks does not enjoy similar speed boost, because f and h, together the decoder structure for the objection de-
they are either not in using a fully-convolutional model [11] tection network.
or require complicated post processing that are not scalable f and h take care of different parts of the detection job. f
with the batch size [12]. extracts features targeting different object scales in the pyra-
To reduce the post processing overhead, we resort to mid representations P , while h is a unified mapping applied
a recently introduced anchor-free one-stage framework, to each feature in P to avoid overfitting. In practice, people
namely, FCOS [24], which significantly improve the search seldom discuss the possibility of using a more diversified f
efficiency by cancelling the processing time of anchor-box to extract features at different levels or how many layers in
matching in RetinaNet. h need to be shared across the levels. In this work, we use
Compared to its anchor-based counterpart, FCOS signif- NAS as an automatic method to test these possibilities.
icantly reduces the training memory footprint while being
able to improve the performance. 3.2. Search Space
Considering the different functions of f and h, we ap-
3. Our Approach ply two search space respectively. Given the particularity
of FPN structure, a basic block with new overall connection
In our work, we search for anchor-free fully convolu- and f ’s output design is built for it. For simplicity, sequen-
tional detection models with fast decoder adaptation. Thus, tial space is applied for h part.
NAS methods can be easily applied.
We replace the cell structure with atomic operations to
3.1. Problem Formulation provide even more flexibility. To construct one basic block,
we first choose two layers x1 , x2 from the sampling pool
We base our search algorithm upon a one-stage frame- X at id1, id2, then two operations op1, op2 are applied
work FCOS due to its simplicity. Our training tuples to each of them and an aggregation operation agg merges
{(x, Y )} consist of input image tensors x of size (3 × H × the two output into one feature. To build a deep decoder
W ) and FCOS output targets Y in a pyramid representa- structure, we apply multiple basic blocks with their outputs
tion, which is a list of tensors yl each of size ((K + 4 + added to the sampling pool. Our basic block bbt : Xt−1 →
1) × Hl × Wl ) where Hl × Wl is feature map size on level Xt at time step t transforms the sampling pool Xt−1 to
p of the pyramid. (K + 4 + 1) is the output channels of Xt = Xt−1 ∪ {xt }, where xt is the output of bbt .
FCOS, the three terms are length-K one-hot classification The candidate operations are listed in Table 1. We in-
labels, 4 bounding box regression targets and 1 centerness clude only separable/depth-wise convolutions so that the
factor respectively. decoder can be efficient. In order to enable the decoder to
The network g : x → Ŷ in original FCOS consists of apply convolutional filters on irregular grids, here we have
three parts, a backbone b, FPN f and multi-level subnets also included deformable 3 × 3 convolutions [31]. For the
we call prediction heads h in this paper. First backbone aggregation operations, we include element-wise sum and
b : x → C maps the input tensor to a set of intermediate- concatenation followed by a 1 × 1 convolution.
leveled features C = {c3 , c4 , c5 }, with resolution (Hi × The decoder configuration can be represented by a se-
Wi ) = (H/2i × W/2i ). Then FPN f : C → P maps
the features to a feature pyramid P = {p3 , p4 , p5 , p6 , p7 }.
Then the prediction head h : p → y is applied to each level ID Description
of P and the result is collected to create the final prediction. 0 separable conv 3 × 3
To avoid overfitting, same h is often applied to all instances 1 separable conv 3 × 3 with dilation rate 3
in P . 2 separable conv 5 × 5 with dilation rate 6
Since objects of different scales require different effec- 3 skip-connection
tive receptive fields, the mechanism to select and merge 4 deformable 3 × 3 convolution
intermediate-leveled features C is particularly important in Table 1. Unary operations used in the search process.
object detection network design. Thus, most researches [16,
11945
quence with three components, FPN configuration, head Considering the independent part of the heads being
configuration and weight sharing stages. We provide de- extended FPN branch and the shared part as head with
tailed descriptions to each of them in the following sections. adaptive-length, we can further balance the workload for
The complete diagram of our decoder structure is shown in each individual FPN branch to extract level-specific features
Fig. 1. and the prediction head shared across all levels.
As mentioned above, the FPN f maps the convolutional RL based strategy is applied to the search process. We
features C to P . First, we initialize the sampling pool as rely on an LSTM-based controller to predict the full con-
X0 = C. Our FPN is defined by applying the basic block figuration. We consider using a progressive search strat-
7 times to the sampling pool, f := bbf1 ◦ bbf2 ◦ · · · ◦ bbf7 . egy rather than the joint search for both FPN structure and
To yield pyramid features P , we collect the last three basic prediction head part, since the former requires less com-
block outputs {x5 , x6 , x7 } as {p3 , p4 , p5 }. puting resources and time cost than the latter. The training
To allow shared information across all layers, we use dataset is randomly split into a meta-train Dt and meta-val
a simple rule to create global features. If there is some Dv subset. To speed up the training, we fix the backbone
dangling layer xt which is not sampled by later blocks network and cache the pre-computed backbone output C.
{bbfi |i > t} nor belongs to the last three layers t < 5, we This makes our single architecture training cost indepen-
use element-wise add to merge it to all output features dent from the depth of backbone network. Taking this ad-
vantage, we can apply much more complex backbone struc-
p∗i = pi + xt , i ∈ {3, 4, 5}. (1) tures and utilize high quality multilevel features as our de-
coder’s input. We find that the process of backbone fine-
Same as the aggregation operations, if the features have dif- tuning can be skipped if the cached features are powerful
ferent resolution, the smaller one is upsampled with bilinear enough. Speedup techniques such as Polyak weight averag-
interpolation. ing are also applied during the training.
To be consistent with FCOS, p6 and p7 are obtained via The most widely used detection metric is average preci-
a 3 × 3 stride-2 convolution on p5 and p6 respectively. sion (AP). However, due to the difficulty of object detection
task, at the early stages, AP is too low to tell the good archi-
3.2.2 Prediction Head Search Space tectures from the bad ones, which makes the controller take
much more time to converge. To make the architecture eval-
Prediction head h maps each feature in the pyramid P to the
uation process easier even at the early stages of the training,
output of corresponding y, which in FCOS and RetinaNet,
we therefore use negative loss sum as the reward instead of
consists of four 3 × 3 convolutions. To explore the potential
average precision:
of the head, we therefore extend a sequential search space
for its generation. Specifically, our head is defined as a se- X
quence of six basic operations. Compared with candidate R(a) = − (Lcls (x, Y |a)
operations in the FPN structures, the head search space has (x,Y )∈Dv (2)
two slight differences. First, we add standard convolution + Lreg (x, Y |a) + Lctr (x, Y |a))
modules (including conv1x1 and conv3x3) to the head sam-
pling pool for better comparison. Second, we follow the where Lcls , Lreg , Lctr are the three loss terms in FCOS.
design of FCOS by replacing all the Batch Normalization Gradient of the controller is estimated via proximal policy
(BN) layers to Group Normalization (GN) [25] in the oper- optimization (PPO) [22].
ations sampling pool of head, considering that head needs
to share weights between different levels, which causes BN 4. Experiments
invalid. The final output of head is the output of the last
(sixth) layer.
4.1. Implementation Details
4.1.1 Searching Phase
3.2.3 Searching for Head Weight Sharing
We design a fast proxy task for evaluating the decoder archi-
To add even more flexibility and understand the effect of tectures sampled in the searching phase. PASCAL VOC is
weight sharing in prediction heads, we further add an index selected as the proxy dataset, which contains 5715 training
i as the location where the prediction head starts to share images with object bounding box annotations of 20 classes.
weights. For every layer before stage i, the head h will Transfer capacity of the structures can be illustrated since
create independent set of weights for each FPN output level, the search and full training phase use different datasets.
otherwise, it will use the global weights for sharing purpose. The VOC training set is randomly split into a meta-train
11946
Head y5
P7
FPN y4
Head
P6
C5
Head y3
C4
P5
C3 Head y2 cls
independent part shared part
P4
ctr
C2
Head y1
P3 reg
Image Head
Figure 1. A conceptual example of our NAS-FCOS decoder. It consists of two sub networks, an FPN f and a set of prediction heads h
which have shared structures. One notable difference with other FPN-based one-stage detectors is that our heads have partially shared
weights. Only the last several layers of the predictions heads (marked as yellow) are tied by their weights. The number of layers to share
is decided automatically by the search algorithm. Note that both FPN and head are in our actual search space; and have more layers than
shown in this figure. Here the figure is for illustration only.
set with 4, 000 images and a meta-val set with 1715 im-
-1.05
ages. For each sampled architecture, we train it on meta-
-1.10
train and compute the reward (2) on meta-val. Input images
-1.15
are resized to short size 384 and then randomly cropped to
Reward
-1.20
384 × 384. Target object sizes of interest are scaled cor-
-1.25
respondingly. We use Adam optimizer with learning rate
-1.30
8e−4 and batch size 200. Polyak averaging is applied with
-1.35
the decay rates of 0.9. The decoder is evaluated after 300
iterations. As we use fast decoder adaptation, the backbone 200 400 600 800 1.0K 1.2K 1.4K 1.6K 1.8K 2.0K 2.2K 2.4K 2.6K 2.8K 3.0K
steps
features are fixed and cached during the search phase. To
enhance the cached backbone features, we first initialize Figure 2. Performance of reward during the proxy task, which has
them with pre-trained weights provided by open-source im- been growing throughout the process, indicating that the model of
reinforcement learning works.
plementation of FCOS and then finetune on VOC using the
training strategies of FCOS. Note that the above finetuning
process is only performed once at the begining of the search nearly converge, which is much faster than that for search-
phase. ing FPN architectures. After that, we select for full training
A progressive strategy is used for the search of f and h. the top-10 heads that achieve best performance on the proxy
We first search for the FPN part and retain the original head. task. In total, the whole search phase can be finished within
All operations in the FPN structure have 64 output channels. 4 days using 8 V100 GPUs.
The decoder inputs C are resized to fit output channel width
of FPN via 1 × 1 convolutions. After this step, a searched
4.1.2 Full Training Phase
FPN structure is fixed and the second stage searching for the
head will be started based on it. Most parameters for search- In this phase, we fully train the searched models on the MS
ing head are identical to those for searching FPN structure, COCO training dataset, and select the best one by eval-
with the exception that the output channel width is adjusted uating them on MS COCO validation images. Note that
from 64 to 128 to deliver more information. our training configurations are exactly the same as those
For the FPN search part, the controller model nearly con- in FCOS for fair comparison. Input images are resized to
verged after searching over 2.8K architectures on the proxy short size 800 and the maximum long side is set to be 1333.
task as shown in Fig. 2. Then, the top-20 best performing The models are trained using 4 V100 GPUs with batch size
architectures on the proxy task are selected for the next full 16 for 90K iterations. The initial learning rate is 0.01 and
training phase. For the head search part, we choose the best reduces to one tenth at the 60K-th and 80K-th iterations.
searched FPN among the top-20 architectures and pre-fetch The improving tricks are applied only on the final model
its features. It takes about 600 rounds for the controller to (w/improv).
11947
Decoder Backbone FLOPs (G) Params (M) AP
FPN-RetinaNet @256 MobileNetV2 133.4 11.3 30.8
FPN-FCOS @256 MobileNetV2 105.4 9.8 31.2
NAS-FCOS (ours) @128 MobileNetV2 39.3 5.9 32.0
NAS-FCOS (ours) @128-256 MobileNetV2 95.6 9.9 33.8
NAS-FCOS (ours) @256 MobileNetV2 121.8 16.1 34.7
FPN-RetinaNet @256 R-50 198.0 33.6 36.1
FPN-FCOS @256 R-50 169.9 32.0 37.4
NAS-FCOS (ours) @128 R-50 104.0 27.8 37.9
NAS-FCOS (ours) @128-256 R-50 160.4 31.8 39.1
NAS-FCOS (ours) @256 R-50 189.6 38.4 39.8
FPN-RetinaNet @256 R-101 262.4 52.5 37.8
FPN-FCOS @256 R-101 234.3 50.9 41.5
NAS-FCOS (ours) @256 R-101 254.0 57.3 43.0
FPN-FCOS @256 X-64x4d-101 371.2 89.6 43.2
NAS-FCOS (ours) @128-256 X-64x4d-101 361.6 89.4 44.5
FPN-FCOS @256 w/improvements X-64x4d-101 371.2 89.6 44.7
NAS-FCOS (ours) @128-256 w/improvements X-64x4d-101 361.6 89.4 46.1
Table 2. Results on test-dev set of MS COCO after full training. R-50 and R-101 represents ResNet backbones and X-64x4d-101 represents
ResNeXt-101 (64 × 4d). All networks share the same input image resolution. FLOPs and parameters are being measured on 1088 × 800,
which is the median of the input size on COCO. For RetinaNet and FCOS, we use official models provided by the authors. For our NAS-
FCOS, @128 and @256 means that the decoder channel width is 128 and 256 respectively. @128-256 is the decoder with 128 FPN width
and 256 head width. The same improving tricks used on the newest FCOS version are used in our model for fair comparison.
C4 skip skip
dconv3x3
x dconv3x3
connect
conv3x3
connect
dconv3x3 conv1x1 y
dconv3x3 concat
Figure 4. Our discovered Head structure.
C3 dconv3x3
concat dconv3x3
skip
connection
dconv3x3 concat fewer FLOPs/Params (FLOPs 79.24G vs. 89.16G, Params
dconv3x3 3.41M vs. 4.92M) and significantly better performance (AP
concat
P3 38.7 vs. 37.4).
C5 dconv3x3
We use the searched decoder together with either light-
dconv3x3 P4
skip weight backbones such as MobileNet-V2 [21] or more pow-
connection
concat P5
erful backbones such as ResNet-101 [7] and ResNeXt-
dconv3x3 101 [26]. To balance the performance and efficiency, we
concat dconv3x3 implement three decoders with different computation bud-
dconv3x3
gets: one with feature dimension of 128 (@128), one with
dconv3x3 concat
256 (@256) and another with FPN channel width 128 and
Figure 3. Our discovered FPN structure. C2 is omitted from this prediction head 256 (@128-256). The results on the COCO
figure since it is not chosen by this particular structure during the test-dev with short side being 800 is shown in Table 2. The
search process. searched decoder with feature dimension of 256 (@256)
surpasses its FCOS counterpart by 1.5 to 3.5 points in AP
under different backbones. The one with 128 channels
4.2. Search Results
(@128) has significantly reduced parameters and calcula-
The best FPN structure is illustrated in Fig. 3. The con- tion, making it more suitable for resource-constrained en-
troller identifies that deformable convolution and concate- vironments. In particular, our searched model with 128
nation are the best performing operations for unary and ag- channels and MobileNetV2 backbone suparsses the origi-
gregation respectively. From Fig. 4, we can see that the nal FCOS with the same backbone by 0.8 AP points with
controller chooses to use 4 operations (with two skip con- only 1/3 FLOPS. The third type of decoder (@128-256)
nections), rather than the maximum allowed 6 operations. achieves a good balance between accuracy and parameters.
Note that the discovered “dconv + 1x1 conv” structure Note that our searched model outperforms the strongest
achieves a good trade-off between accuracy and FLOPs. FCOS variant by 1.4 AP points (46.1 vs. 44.7) with slightly
Compared with the original head, our searched head has smaller FLOPs and Params. The comparison of FLOPs and
11948
44
R101-NAS-FCOS (ours)@256
Model Backbone
43
1 ResNet-50
0.95 42 ResNet-101
0.92
0.9 MobileNet-V2
41 R101-FPN-FCOS@256
0.86 R50-NAS-FCOS (ours)@256
Proportion of Fully Shared Structure
40
R50-NAS-FCOS (ours)@128-256
0.8 0.78
39
0.74 R50-NAS-FCOS (ours)@128 R101-FPN-RetinaNet
R50-FPN-FCOS @256
38
@256
37
AP
0.6 36 R50-FPN-RetinaNet
Mobile-NAS-FCOS
35 (ours)@256 @256
34 Mobile-NAS-FCOS (ours)@128-256
0.42
33
0.4 Mobile-NAS-FCOS (ours)@128
32
0.32 Mobile-FPN-FCOS@256
0.3
31
0.24 Mobile-FPN-Retinanet@256
30
0.2
0.2 0.18
29
28
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300
FLOPs (GMac)
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 Figure 7. Diagram of the relationship between FLOPs and AP with
Statistical Period
different backbones. Points of different shapes represent different
Figure 5. Trend graph of head weight sharing during search. The
backbones. NAS-FCOS@128 has a slight increase in precision
coordinates in the horizontal axis represent the number of the sta-
which also gains the advantage of computation quantity. One with
tistical period. A period consists of 50 head structures. The verti-
256 channels obtains the highest precision with more computation
cal axis represents the proportion of heads that fully share weights
complexity. Using FPN channel width 128 and prediction head
in 50 structures.
256 (@128-256) offers a trade-off.
0.35 45
Sample Arch Model Backbone
44
ResNet-50 R101-NAS-FCOS (ours)@256
0.33 Fit Curve 43 ResNet-101
MobileNet-V2
42
0.31 41 R101-FPN-FCOS@256
R50-NAS-FCOS (ours)@256
40
0.29 39
R50-NAS-FCOS (ours)@128-256
R101-FPN-RetinaNet
R50-NAS-FCOS (ours)@128 @256
38 R50-FPN-FCOS
0.27
AP
@256
37
AP
36 R50-FPN-RetinaNet
0.25 Mobile-NAS-FCOS @256
35 (ours)@256
34
0.23 Mobile-NAS-FCOS (ours)@128-256
33
Mobile-NAS-FCOS (ours)@128
32
0.21 Mobile-FPN-FCOS@256
31
Mobile-FPN-Retinanet@256
30
0.19
29
0 10 20 30 40 50 60 70
Params (M)
-1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1
Reward Figure 8. Diagram of the relationship between parameters and AP
Figure 6. Correlation between the search reward obtained on the with different backbones. Adjusting the number of channels in
VOC meta-val dataset and the AP evaluated on COCO-val. the FPN structure and head helps to achieve a balance between
accuracy and parameters.
11949
Arch FLOPs (G) Search Cost (GPU-day) Searched Archs AP
NAS-FPN @256 R-50 >325.0 333×#TPUs 17000 <38.0
NAS-FPN 7@256 R-50 1125.5 333×#TPUs 17000 44.8
DetNAS-FPN-Faster - 44 2200 40.2
DetNAS-RetinaNet - 44 2200 33.3
NAS-FCOS (ours) @256 R-50 189.6 28 3000 39.8
NAS-FCOS (ours) @128-256 X-64x4d-101 361.6 28 3000 46.1
Table 3. Comparison with other NAS methods. For NAS-FPN, the input size is 1280 × 1280 and the search cost should be timed by their
number of TPUs used to train each architecture. Note that the FLOPs and AP of NAS-FPN @256 here are from Figure 11 in NAS-FPN [4],
and NAS-FPN 7@256 stacks the searched FPN structure 7 times. The input images are resized such that their shorter size is 800 pixels in
DetNASNet [2] and our models.
15
11950
References SSD: Single shot multibox detector. In Proc. Eur. Conf.
Comp. Vis., pages 21–37, 2016.
[1] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct
[17] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid.
neural architecture search on target task and hardware. In
Fast neural architecture search of compact semantic segmen-
Proc. Int. Conf. Learn. Representations, 2019.
tation models via auxiliary cells. In Proc. IEEE Conf. Comp.
[2] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Vis. Patt. Recogn., 2019.
Chunhong Pan, and Jian Sun. DetNAS: Neural architecture [18] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and
search on object detection. In Proc. Advances in Neural Inf. Jeff Dean. Efficient neural architecture search via parameter
Process. Syst., 2019. sharing. In Proc. Int. Conf. Mach. Learn., 2018.
[3] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. [19] Joseph Redmon and Ali Farhadi. Yolov3: An incremental
Neural architecture search: A survey. In J. Mach. Learn. improvement. In arXiv preprint arXiv:1804.02767, 2018.
Res., 2019.
[20] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[4] Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, and Quoc V. Faster R-CNN: Towards real-time object detection with re-
Le. NAS-FPN: Learning scalable feature pyramid architec- gion proposal networks. In Proc. Advances in Neural Inf.
ture for object detection. In Proc. IEEE Conf. Comp. Vis. Process. Syst., pages 91–99, 2015.
Patt. Recogn., 2019.
[21] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
[5] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot residuals and linear bottlenecks. In Proc. IEEE Conf. Comp.
neural architecture search with uniform sampling. In arXiv Vis. Patt. Recogn., pages 4510–4520, 2018.
preprint arXiv:1904.00420, 2019.
[22] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad-
[6] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- ford, and Oleg Klimov. Proximal policy optimization algo-
shick. Mask r-cnn. In Proc. IEEE Conf. Comp. Vis. Patt. rithms. In arXiv preprint arXiv:1707.06347, 2017.
Recogn., pages 2961–2969, 2017. [23] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dim-
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. itrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Di-
Identity mappings in deep residual networks. In Proc. Eur. ana Marculescu. Single-path NAS: Designing hardware-
Conf. Comp. Vis., pages 630–645, 2016. efficient convnets in less than 4 hours. In arXiv preprint
[8] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr arXiv:1904.02877, 2019.
Dollr. Panoptic feature pyramid networks. In Proc. IEEE [24] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS:
Conf. Comp. Vis. Patt. Recogn., 2019. Fully convolutional one-stage object detection. In Proc.
[9] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, and IEEE Int. Conf. Comp. Vis., 2019.
Jianbo Shi. Foveabox: Beyond anchor-based object detec- [25] Yuxin Wu and Kaiming He. Group normalization. In Proc.
tor. In arXiv preprint arXiv:1904.03797, 2019. Eur. Conf. Comp. Vis., pages 3–19, 2018.
[10] Hei Law and Jia Deng. Cornernet: Detecting objects as [26] Saining Xie, Ross Girshick, Piotr Dollr, Zhuowen Tu, and
paired keypoints. In Proc. Eur. Conf. Comp. Vis., pages 734– Kaiming He. Aggregated residual transformations for deep
750, 2018. neural networks. In Proc. IEEE Conf. Comp. Vis. Patt.
[11] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Recogn., 2017.
Bharath Hariharan, and Serge Belongie. Feature pyramid [27] Ting Zhao and Xiangqian Wu. Pyramid feature attention net-
networks for object detection. In Proc. IEEE Conf. Comp. work for saliency detection. In Proc. IEEE Conf. Comp. Vis.
Vis. Patt. Recogn., pages 2117–2125, 2017. Patt. Recogn., 2019.
[12] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and [28] Hongpeng Zhou, Minghao Yang, Jun Wang, and Wei Pan.
Piotr Dollár. Focal loss for dense object detection. In Proc. BayesNAS: A bayesian approach for neural architecture
IEEE Conf. Comp. Vis. Patt. Recogn., pages 2980–2988, search. In Proc. Int. Conf. Mach. Learn., 2019.
2017. [29] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
[13] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig jects as points. In arXiv preprint arXiv:1904.07850, 2019.
Adam, Wei Hua, Alan Yuille, and Li Fei-Fei. Auto-deeplab: [30] Chenchen Zhu, Yihui He, and Marios Savvides. Feature se-
Hierarchical neural architecture search for semantic image lective anchor-free module for single-shot object detection.
segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.
2019. [31] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
[14] Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu formable convnets v2: More deformable, better results. In
Liu, Gang Yu, and Wei Jiang. An end-to-end network for Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.
panoptic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. [32] Barret Zoph and Quoc V. Le. Neural architecture search with
Recogn., 2019. reinforcement learning. In Proc. Int. Conf. Learn. Represen-
[15] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: tations, 2017.
Differentiable architecture search. In Proc. Int. Conf. Learn.
Representations, 2019.
[16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander Berg.
11951