0% found this document useful (0 votes)
16 views9 pages

Wang NAS-FCOS Fast Neural Architecture Search For Object Detection CVPR 2020 Paper

Uploaded by

charliee8001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

Wang NAS-FCOS Fast Neural Architecture Search For Object Detection CVPR 2020 Paper

Uploaded by

charliee8001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

NAS-FCOS: Fast Neural Architecture Search for Object Detection∗

Ning Wang†‡ , Yang Gao†‡ , Hao Chen⋄ , Peng Wang†‡ , Zhi Tian⋄ , Chunhua Shen⋄ , Yanning Zhang†‡

School of Computer Science, Northwestern Polytechnical University, China

National Engineering Lab for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, China

The University of Adelaide, Australia

Abstract only need to output image-level labels. Due to its complex


structure and numerous hyper-parameters, designing effec-
The success of deep neural networks relies on signifi- tive object detection networks is more challenging and usu-
cant architecture engineering. Recently neural architecture ally needs much manual effort.
search (NAS) has emerged as a promise to greatly reduce On the other hand, Neural Architecture Search (NAS)
manual effort in network design by automatically search- approaches [4, 17, 32] have been showing impressive results
ing for optimal architectures, although typically such al- on automatically discovering top-performing neural net-
gorithms need an excessive amount of computational re- work architectures in large-scale search spaces. Compared
sources, e.g., a few thousand GPU-days. To date, on chal- to manual designs, NAS methods are data-driven instead of
lenging vision tasks such as object detection, NAS, espe- experience-driven, and hence need much less human inter-
cially fast versions of NAS, is less studied. Here we propose vention. As defined in [3], the workflow of NAS can be
to search for the decoder structure of object detectors with divided into the following three processes: 1) sampling ar-
search efficiency being taken into consideration. To be more chitecture from a search space following some search strate-
specific, we aim to efficiently search for the feature pyra- gies; 2) evaluating the performance of the sampled archi-
mid network (FPN) as well as the prediction head of a sim- tecture; and 3) updating the parameters based on the perfor-
ple anchor-free object detector, namely FCOS [24], using mance.
a tailored reinforcement learning paradigm. With carefully One of the main problems prohibiting NAS from being
designed search space, search algorithms and strategies for used in more realistic applications is its search efficiency.
evaluating network quality, we are able to efficiently search The evaluation process is the most time consuming part be-
a top-performing detection architecture within 4 days us- cause it involves a full training procedure of a neural net-
ing 8 V100 GPUs. The discovered architecture surpasses work. To reduce the evaluation time, in practice a proxy
state-of-the-art object detection models (such as Faster R- task is often used as a lower cost substitution. In the proxy
CNN, RetinaNet and FCOS) by 1.5 to 3.5 points in AP on task, the input, network parameters and training iterations
the COCO dataset,with comparable computation complex- are often scaled down to speedup the evaluation. However,
ity and memory footprint, demonstrating the efficacy of the there is often a performance gap for samples between the
proposed NAS for object detection. proxy tasks and target tasks, which makes the evaluation
process biased. How to design proxy tasks that are both
accurate and efficient for specific problems is a challeng-
1. Introduction ing problem. Another solution to improve search efficiency
is constructing a supernet that covers the complete search
Object detection is one of the fundamental tasks in com-
space and training candidate architectures with shared pa-
puter vision, and has been researched extensively. In the
rameters [15, 18]. However, this solution leads to signifi-
past few years, state-of-the-art methods for this task are
cantly increased memory consumption and restricts itself to
based on deep convolutional neural networks (such as Faster
small-to-moderate sized search spaces.
R-CNN [20], RetinaNet [11]), due to their impressive per-
formance. Typically, the designs of object detection net- To our knowledge, studies on efficient and accurate NAS
works are much more complex than those for image clas- approaches to object detection networks are rarely touched,
sification, because the former need to localize and classify despite its significant importance. To this end, we present
multiple objects in an image simultaneously while the latter a fast and memory saving NAS method for object detection
networks, which is capable of discovering top-performing
∗ NW, YG, HC contributed to this work equally. architectures within significantly reduced search time. Our

11943
overall detection architecture is based on FCOS [24], a sim- bounding boxes at each location of feature maps generated
ple anchor-free one-stage object detection framework, in by a single CNN backbone.
which the feature pyramid network and prediction head are Note that most state-of-the-art object detectors (includ-
searched using our proposed NAS method. ing both one-stage detectors [12, 16, 19] and two-stage de-
Our main contributions are summarized as follows. tectors [20]) make predictions based on anchor boxes of
different scales and aspect ratios at each convolutional fea-
• In this work, we propose a fast and memory-efficient ture map location. However, the usage of anchor boxes
NAS method for searching both FPN and head archi- may lead to high imbalance between object and non-object
tectures, with carefully designed proxy tasks, search examples and introduce extra hyper-parameters. More re-
space and evaluation strategies, which is able to find cently, anchor-free one-stage detectors [9, 10, 24, 29, 30]
top-performing architectures over 3, 000 architectures have attracted increasing research interests, due to their sim-
using 28 GPU-days only. ple fully convolutional architectures and reduced consump-
Specifically, this high efficiency is enabled with the tion of computational resources.
following designs.
2.2. Neural Architecture Search
− Developing a fast proxy task training scheme by
skipping the backbone finetuning stage; NAS is usually time consuming. We have seen great
improvements from 24, 000 GPU-days [32] to 0.2 GPU-
− Adapting progressive search strategy to reduce time
day [28]. The trick is to first construct a supernet contain-
cost taken by the extended search space;
ing the complete search space and train the candidates all
− Using a more discriminative criterion for evaluation at once with bi-level optimization and efficient weight shar-
of searched architectures. ing [13, 15]. But the large memory allocation and diffi-
− Employing an efficient anchor-free one-stage detec- culties in approximated optimization prohibit the search for
tion framework with simple post processing; more complex structures.
Recently researchers [1, 5, 23] propose to apply single-
• Using NAS, we explore the workload relationship be- path training to reduce the bias introduced by approxima-
tween FPN and head, proving the importance of weight tion and model simplification of the supernet. DetNAS [2]
sharing in head. follows this idea to search for an efficient object detection
architecture. One limitation of the single-path approach is
• We show that the overall structure of NAS-FCOS is that the search space is restricted to a sequential structure.
general and flexible in that it can be equipped with var- Single-path sampling and straight through estimate of the
ious backbones including MobileNetV2, ResNet-50, weight gradients introduce large variance to the optimiza-
ResNet-101 and ResNeXt-101, and surpasses state- tion process and prohibit the search for more complex struc-
of-the-art object detection algorithms using compa- tures under this framework. Within this very simple search
rable computation complexity and memory footprint. space, NAS algorithms can only make trivial decisions like
More specifically, our model can improve the AP by kernel sizes for manually designed modules.
1.5 ∼ 3.5 points on all above models comparing to Object detection models are different from single-path
their FCOS counterparts. image classification networks in their way of merging multi-
level features and distributing the task to parallel prediction
2. Related Work heads. Feature pyramid networks (FPNs) [4, 8, 11, 14, 27],
designed to handle this job, plays an important role in
2.1. Object Detection
modern object detection models. NAS-FPN [4] targets on
The frameworks of deep neural networks for object de- searching for an FPN alternative based on one-stage frame-
tection can be roughly categorized into two types: one-stage work RetinaNet [12]. Feature pyramid architectures are
detectors [12] and two-stage detectors [6, 20]. sampled with a recurrent neural network (RNN) controller.
Two-stage detection frameworks first generate class- The RNN controller is trained with reinforcement learn-
independent region proposals using a region proposal net- ing (RL). However, the search is very time-consuming even
work (RPN), and then classify and refine them using ex- though a proxy task with ResNet-10 backbone is trained to
tra detection heads. In spite of achieving top perfor- evaluate each architecture.
mance, the two-stage methods have noticeable drawbacks: Since all these three kinds of research ( [2, 4] and ours)
they are computationally expensive and have many hyper- focus on object detection framework, we demonstrate the
parameters that need to be tuned to fit a specific dataset. differences among them that DetNAS [2] aims to search
In comparison, the structures of one-stage detectors are for the designs of better backbones, while NAS-FPN [4]
much simpler. They directly predict object categories and searches the FPN structure, and our search space contains

11944
both FPN and head structure. 20] are carried out on designing f and h while using widely-
To speed up reward evaluation of RL-based NAS, the adopted backbone structures such as ResNet [7]. Following
work of [17] proposes to use progressive tasks and other this principle, our search goal is to decide when to choose
training acceleration methods. By caching the encoder fea- which features from C and how to merge them.
tures, they are able to train semantic segmentation decoders To improve the efficiency, we reuse the parameters in b
with very large batch sizes very efficiently. In the sequel of pretrained on target dataset and search for the optimal struc-
this paper, we refer to this technique as fast decoder adap- tures after that. For the convenience of the following state-
tation. However, directly applying this technique to object ment, we call the network components to search for, namely
detection tasks does not enjoy similar speed boost, because f and h, together the decoder structure for the objection de-
they are either not in using a fully-convolutional model [11] tection network.
or require complicated post processing that are not scalable f and h take care of different parts of the detection job. f
with the batch size [12]. extracts features targeting different object scales in the pyra-
To reduce the post processing overhead, we resort to mid representations P , while h is a unified mapping applied
a recently introduced anchor-free one-stage framework, to each feature in P to avoid overfitting. In practice, people
namely, FCOS [24], which significantly improve the search seldom discuss the possibility of using a more diversified f
efficiency by cancelling the processing time of anchor-box to extract features at different levels or how many layers in
matching in RetinaNet. h need to be shared across the levels. In this work, we use
Compared to its anchor-based counterpart, FCOS signif- NAS as an automatic method to test these possibilities.
icantly reduces the training memory footprint while being
able to improve the performance. 3.2. Search Space
Considering the different functions of f and h, we ap-
3. Our Approach ply two search space respectively. Given the particularity
of FPN structure, a basic block with new overall connection
In our work, we search for anchor-free fully convolu- and f ’s output design is built for it. For simplicity, sequen-
tional detection models with fast decoder adaptation. Thus, tial space is applied for h part.
NAS methods can be easily applied.
We replace the cell structure with atomic operations to
3.1. Problem Formulation provide even more flexibility. To construct one basic block,
we first choose two layers x1 , x2 from the sampling pool
We base our search algorithm upon a one-stage frame- X at id1, id2, then two operations op1, op2 are applied
work FCOS due to its simplicity. Our training tuples to each of them and an aggregation operation agg merges
{(x, Y )} consist of input image tensors x of size (3 × H × the two output into one feature. To build a deep decoder
W ) and FCOS output targets Y in a pyramid representa- structure, we apply multiple basic blocks with their outputs
tion, which is a list of tensors yl each of size ((K + 4 + added to the sampling pool. Our basic block bbt : Xt−1 →
1) × Hl × Wl ) where Hl × Wl is feature map size on level Xt at time step t transforms the sampling pool Xt−1 to
p of the pyramid. (K + 4 + 1) is the output channels of Xt = Xt−1 ∪ {xt }, where xt is the output of bbt .
FCOS, the three terms are length-K one-hot classification The candidate operations are listed in Table 1. We in-
labels, 4 bounding box regression targets and 1 centerness clude only separable/depth-wise convolutions so that the
factor respectively. decoder can be efficient. In order to enable the decoder to
The network g : x → Ŷ in original FCOS consists of apply convolutional filters on irregular grids, here we have
three parts, a backbone b, FPN f and multi-level subnets also included deformable 3 × 3 convolutions [31]. For the
we call prediction heads h in this paper. First backbone aggregation operations, we include element-wise sum and
b : x → C maps the input tensor to a set of intermediate- concatenation followed by a 1 × 1 convolution.
leveled features C = {c3 , c4 , c5 }, with resolution (Hi × The decoder configuration can be represented by a se-
Wi ) = (H/2i × W/2i ). Then FPN f : C → P maps
the features to a feature pyramid P = {p3 , p4 , p5 , p6 , p7 }.
Then the prediction head h : p → y is applied to each level ID Description
of P and the result is collected to create the final prediction. 0 separable conv 3 × 3
To avoid overfitting, same h is often applied to all instances 1 separable conv 3 × 3 with dilation rate 3
in P . 2 separable conv 5 × 5 with dilation rate 6
Since objects of different scales require different effec- 3 skip-connection
tive receptive fields, the mechanism to select and merge 4 deformable 3 × 3 convolution
intermediate-leveled features C is particularly important in Table 1. Unary operations used in the search process.
object detection network design. Thus, most researches [16,

11945
quence with three components, FPN configuration, head Considering the independent part of the heads being
configuration and weight sharing stages. We provide de- extended FPN branch and the shared part as head with
tailed descriptions to each of them in the following sections. adaptive-length, we can further balance the workload for
The complete diagram of our decoder structure is shown in each individual FPN branch to extract level-specific features
Fig. 1. and the prediction head shared across all levels.

3.2.1 FPN Search Space 3.3. Search Strategy

As mentioned above, the FPN f maps the convolutional RL based strategy is applied to the search process. We
features C to P . First, we initialize the sampling pool as rely on an LSTM-based controller to predict the full con-
X0 = C. Our FPN is defined by applying the basic block figuration. We consider using a progressive search strat-
7 times to the sampling pool, f := bbf1 ◦ bbf2 ◦ · · · ◦ bbf7 . egy rather than the joint search for both FPN structure and
To yield pyramid features P , we collect the last three basic prediction head part, since the former requires less com-
block outputs {x5 , x6 , x7 } as {p3 , p4 , p5 }. puting resources and time cost than the latter. The training
To allow shared information across all layers, we use dataset is randomly split into a meta-train Dt and meta-val
a simple rule to create global features. If there is some Dv subset. To speed up the training, we fix the backbone
dangling layer xt which is not sampled by later blocks network and cache the pre-computed backbone output C.
{bbfi |i > t} nor belongs to the last three layers t < 5, we This makes our single architecture training cost indepen-
use element-wise add to merge it to all output features dent from the depth of backbone network. Taking this ad-
vantage, we can apply much more complex backbone struc-
p∗i = pi + xt , i ∈ {3, 4, 5}. (1) tures and utilize high quality multilevel features as our de-
coder’s input. We find that the process of backbone fine-
Same as the aggregation operations, if the features have dif- tuning can be skipped if the cached features are powerful
ferent resolution, the smaller one is upsampled with bilinear enough. Speedup techniques such as Polyak weight averag-
interpolation. ing are also applied during the training.
To be consistent with FCOS, p6 and p7 are obtained via The most widely used detection metric is average preci-
a 3 × 3 stride-2 convolution on p5 and p6 respectively. sion (AP). However, due to the difficulty of object detection
task, at the early stages, AP is too low to tell the good archi-
3.2.2 Prediction Head Search Space tectures from the bad ones, which makes the controller take
much more time to converge. To make the architecture eval-
Prediction head h maps each feature in the pyramid P to the
uation process easier even at the early stages of the training,
output of corresponding y, which in FCOS and RetinaNet,
we therefore use negative loss sum as the reward instead of
consists of four 3 × 3 convolutions. To explore the potential
average precision:
of the head, we therefore extend a sequential search space
for its generation. Specifically, our head is defined as a se- X
quence of six basic operations. Compared with candidate R(a) = − (Lcls (x, Y |a)
operations in the FPN structures, the head search space has (x,Y )∈Dv (2)
two slight differences. First, we add standard convolution + Lreg (x, Y |a) + Lctr (x, Y |a))
modules (including conv1x1 and conv3x3) to the head sam-
pling pool for better comparison. Second, we follow the where Lcls , Lreg , Lctr are the three loss terms in FCOS.
design of FCOS by replacing all the Batch Normalization Gradient of the controller is estimated via proximal policy
(BN) layers to Group Normalization (GN) [25] in the oper- optimization (PPO) [22].
ations sampling pool of head, considering that head needs
to share weights between different levels, which causes BN 4. Experiments
invalid. The final output of head is the output of the last
(sixth) layer.
4.1. Implementation Details
4.1.1 Searching Phase
3.2.3 Searching for Head Weight Sharing
We design a fast proxy task for evaluating the decoder archi-
To add even more flexibility and understand the effect of tectures sampled in the searching phase. PASCAL VOC is
weight sharing in prediction heads, we further add an index selected as the proxy dataset, which contains 5715 training
i as the location where the prediction head starts to share images with object bounding box annotations of 20 classes.
weights. For every layer before stage i, the head h will Transfer capacity of the structures can be illustrated since
create independent set of weights for each FPN output level, the search and full training phase use different datasets.
otherwise, it will use the global weights for sharing purpose. The VOC training set is randomly split into a meta-train

11946
Head y5
P7
FPN y4
Head
P6
C5
Head y3
C4
P5
C3 Head y2 cls
independent part shared part
P4
ctr
C2
Head y1
P3 reg

Image Head
Figure 1. A conceptual example of our NAS-FCOS decoder. It consists of two sub networks, an FPN f and a set of prediction heads h
which have shared structures. One notable difference with other FPN-based one-stage detectors is that our heads have partially shared
weights. Only the last several layers of the predictions heads (marked as yellow) are tied by their weights. The number of layers to share
is decided automatically by the search algorithm. Note that both FPN and head are in our actual search space; and have more layers than
shown in this figure. Here the figure is for illustration only.

set with 4, 000 images and a meta-val set with 1715 im-
-1.05
ages. For each sampled architecture, we train it on meta-
-1.10
train and compute the reward (2) on meta-val. Input images
-1.15
are resized to short size 384 and then randomly cropped to
Reward

-1.20
384 × 384. Target object sizes of interest are scaled cor-
-1.25
respondingly. We use Adam optimizer with learning rate
-1.30
8e−4 and batch size 200. Polyak averaging is applied with
-1.35
the decay rates of 0.9. The decoder is evaluated after 300
iterations. As we use fast decoder adaptation, the backbone 200 400 600 800 1.0K 1.2K 1.4K 1.6K 1.8K 2.0K 2.2K 2.4K 2.6K 2.8K 3.0K
steps
features are fixed and cached during the search phase. To
enhance the cached backbone features, we first initialize Figure 2. Performance of reward during the proxy task, which has
them with pre-trained weights provided by open-source im- been growing throughout the process, indicating that the model of
reinforcement learning works.
plementation of FCOS and then finetune on VOC using the
training strategies of FCOS. Note that the above finetuning
process is only performed once at the begining of the search nearly converge, which is much faster than that for search-
phase. ing FPN architectures. After that, we select for full training
A progressive strategy is used for the search of f and h. the top-10 heads that achieve best performance on the proxy
We first search for the FPN part and retain the original head. task. In total, the whole search phase can be finished within
All operations in the FPN structure have 64 output channels. 4 days using 8 V100 GPUs.
The decoder inputs C are resized to fit output channel width
of FPN via 1 × 1 convolutions. After this step, a searched
4.1.2 Full Training Phase
FPN structure is fixed and the second stage searching for the
head will be started based on it. Most parameters for search- In this phase, we fully train the searched models on the MS
ing head are identical to those for searching FPN structure, COCO training dataset, and select the best one by eval-
with the exception that the output channel width is adjusted uating them on MS COCO validation images. Note that
from 64 to 128 to deliver more information. our training configurations are exactly the same as those
For the FPN search part, the controller model nearly con- in FCOS for fair comparison. Input images are resized to
verged after searching over 2.8K architectures on the proxy short size 800 and the maximum long side is set to be 1333.
task as shown in Fig. 2. Then, the top-20 best performing The models are trained using 4 V100 GPUs with batch size
architectures on the proxy task are selected for the next full 16 for 90K iterations. The initial learning rate is 0.01 and
training phase. For the head search part, we choose the best reduces to one tenth at the 60K-th and 80K-th iterations.
searched FPN among the top-20 architectures and pre-fetch The improving tricks are applied only on the final model
its features. It takes about 600 rounds for the controller to (w/improv).

11947
Decoder Backbone FLOPs (G) Params (M) AP
FPN-RetinaNet @256 MobileNetV2 133.4 11.3 30.8
FPN-FCOS @256 MobileNetV2 105.4 9.8 31.2
NAS-FCOS (ours) @128 MobileNetV2 39.3 5.9 32.0
NAS-FCOS (ours) @128-256 MobileNetV2 95.6 9.9 33.8
NAS-FCOS (ours) @256 MobileNetV2 121.8 16.1 34.7
FPN-RetinaNet @256 R-50 198.0 33.6 36.1
FPN-FCOS @256 R-50 169.9 32.0 37.4
NAS-FCOS (ours) @128 R-50 104.0 27.8 37.9
NAS-FCOS (ours) @128-256 R-50 160.4 31.8 39.1
NAS-FCOS (ours) @256 R-50 189.6 38.4 39.8
FPN-RetinaNet @256 R-101 262.4 52.5 37.8
FPN-FCOS @256 R-101 234.3 50.9 41.5
NAS-FCOS (ours) @256 R-101 254.0 57.3 43.0
FPN-FCOS @256 X-64x4d-101 371.2 89.6 43.2
NAS-FCOS (ours) @128-256 X-64x4d-101 361.6 89.4 44.5
FPN-FCOS @256 w/improvements X-64x4d-101 371.2 89.6 44.7
NAS-FCOS (ours) @128-256 w/improvements X-64x4d-101 361.6 89.4 46.1
Table 2. Results on test-dev set of MS COCO after full training. R-50 and R-101 represents ResNet backbones and X-64x4d-101 represents
ResNeXt-101 (64 × 4d). All networks share the same input image resolution. FLOPs and parameters are being measured on 1088 × 800,
which is the median of the input size on COCO. For RetinaNet and FCOS, we use official models provided by the authors. For our NAS-
FCOS, @128 and @256 means that the decoder channel width is 128 and 256 respectively. @128-256 is the decoder with 128 FPN width
and 256 head width. The same improving tricks used on the newest FCOS version are used in our model for fair comparison.

C4 skip skip
dconv3x3
x dconv3x3
connect
conv3x3
connect
dconv3x3 conv1x1 y
dconv3x3 concat
Figure 4. Our discovered Head structure.
C3 dconv3x3

concat dconv3x3
skip
connection
dconv3x3 concat fewer FLOPs/Params (FLOPs 79.24G vs. 89.16G, Params
dconv3x3 3.41M vs. 4.92M) and significantly better performance (AP
concat
P3 38.7 vs. 37.4).
C5 dconv3x3
We use the searched decoder together with either light-
dconv3x3 P4
skip weight backbones such as MobileNet-V2 [21] or more pow-
connection

concat P5
erful backbones such as ResNet-101 [7] and ResNeXt-
dconv3x3 101 [26]. To balance the performance and efficiency, we
concat dconv3x3 implement three decoders with different computation bud-
dconv3x3
gets: one with feature dimension of 128 (@128), one with
dconv3x3 concat
256 (@256) and another with FPN channel width 128 and
Figure 3. Our discovered FPN structure. C2 is omitted from this prediction head 256 (@128-256). The results on the COCO
figure since it is not chosen by this particular structure during the test-dev with short side being 800 is shown in Table 2. The
search process. searched decoder with feature dimension of 256 (@256)
surpasses its FCOS counterpart by 1.5 to 3.5 points in AP
under different backbones. The one with 128 channels
4.2. Search Results
(@128) has significantly reduced parameters and calcula-
The best FPN structure is illustrated in Fig. 3. The con- tion, making it more suitable for resource-constrained en-
troller identifies that deformable convolution and concate- vironments. In particular, our searched model with 128
nation are the best performing operations for unary and ag- channels and MobileNetV2 backbone suparsses the origi-
gregation respectively. From Fig. 4, we can see that the nal FCOS with the same backbone by 0.8 AP points with
controller chooses to use 4 operations (with two skip con- only 1/3 FLOPS. The third type of decoder (@128-256)
nections), rather than the maximum allowed 6 operations. achieves a good balance between accuracy and parameters.
Note that the discovered “dconv + 1x1 conv” structure Note that our searched model outperforms the strongest
achieves a good trade-off between accuracy and FLOPs. FCOS variant by 1.4 AP points (46.1 vs. 44.7) with slightly
Compared with the original head, our searched head has smaller FLOPs and Params. The comparison of FLOPs and

11948
44
R101-NAS-FCOS (ours)@256
Model Backbone
43
1 ResNet-50
0.95 42 ResNet-101
0.92
0.9 MobileNet-V2
41 R101-FPN-FCOS@256
0.86 R50-NAS-FCOS (ours)@256
Proportion of Fully Shared Structure

40
R50-NAS-FCOS (ours)@128-256
0.8 0.78
39
0.74 R50-NAS-FCOS (ours)@128 R101-FPN-RetinaNet
R50-FPN-FCOS @256
38
@256
37

AP
0.6 36 R50-FPN-RetinaNet
Mobile-NAS-FCOS
35 (ours)@256 @256

34 Mobile-NAS-FCOS (ours)@128-256
0.42
33
0.4 Mobile-NAS-FCOS (ours)@128
32
0.32 Mobile-FPN-FCOS@256
0.3
31
0.24 Mobile-FPN-Retinanet@256
30
0.2
0.2 0.18
29
28
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300
FLOPs (GMac)
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 Figure 7. Diagram of the relationship between FLOPs and AP with
Statistical Period
different backbones. Points of different shapes represent different
Figure 5. Trend graph of head weight sharing during search. The
backbones. NAS-FCOS@128 has a slight increase in precision
coordinates in the horizontal axis represent the number of the sta-
which also gains the advantage of computation quantity. One with
tistical period. A period consists of 50 head structures. The verti-
256 channels obtains the highest precision with more computation
cal axis represents the proportion of heads that fully share weights
complexity. Using FPN channel width 128 and prediction head
in 50 structures.
256 (@128-256) offers a trade-off.
0.35 45
Sample Arch Model Backbone
44
ResNet-50 R101-NAS-FCOS (ours)@256
0.33 Fit Curve 43 ResNet-101
MobileNet-V2
42
0.31 41 R101-FPN-FCOS@256
R50-NAS-FCOS (ours)@256
40
0.29 39
R50-NAS-FCOS (ours)@128-256
R101-FPN-RetinaNet
R50-NAS-FCOS (ours)@128 @256
38 R50-FPN-FCOS
0.27
AP

@256
37
AP

36 R50-FPN-RetinaNet
0.25 Mobile-NAS-FCOS @256
35 (ours)@256

34
0.23 Mobile-NAS-FCOS (ours)@128-256
33
Mobile-NAS-FCOS (ours)@128
32
0.21 Mobile-FPN-FCOS@256
31
Mobile-FPN-Retinanet@256
30
0.19
29
0 10 20 30 40 50 60 70
Params (M)
-1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1
Reward Figure 8. Diagram of the relationship between parameters and AP
Figure 6. Correlation between the search reward obtained on the with different backbones. Adjusting the number of channels in
VOC meta-val dataset and the AP evaluated on COCO-val. the FPN structure and head helps to achieve a balance between
accuracy and parameters.

number of parameters with other models are illustrated in


We further measure the correlation between rewards ob-
Fig. 7 and Fig. 8 respectively.
tained during the search process with the proxy dataset
In order to understand the importance of weight sharing and APs attained by same architectures trained on COCO.
in head, we add the number of layers shared by weights Specifically, we randomly sample 15 architectures from all
as an object of the search. Fig. 5 shows a trend graph of the searched structures trained on COCO with batch size
head weight sharing during search. We set 50 structures as 16. Since full training on COCO is time-consuming, we
a statistical cycle. As the search deepens, the proportion reduce the iterations to 60K. The model is then evaluated
of fully shared structures increases, indicating that on the on the COCO 2017 validation set. As visible in Fig. 6,
multi-scale detection model, head weight sharing is a ne- there is a strong correlation between search rewards and
cessity. APs obtained from COCO. Poor- and well-performing ar-
We also demonstrate the comparison with other NAS chitectures can be distinguished by the rewards on the proxy
methods for object detection in Table 3. Our method is able task very well.
to search for twice more architectures than DetNAS [2] per
GPU-day. Note that the AP of NAS-FPN [4] is achieved by 4.3. Ablation Study
stacking the searched FPN 7 times, while we do not stack
4.3.1 Design of Reinforcement Learning Reward
our searched FPN. Our model with ResNeXt-101 (64x4d)
as backbone outperforms NAS-FPN by 1.3 AP points while As we discussed above, it is common to use widely ac-
using only 1/3 FLOPs and less calculation cost. cepted indicators as rewards for specific tasks in the search,

11949
Arch FLOPs (G) Search Cost (GPU-day) Searched Archs AP
NAS-FPN @256 R-50 >325.0 333×#TPUs 17000 <38.0
NAS-FPN 7@256 R-50 1125.5 333×#TPUs 17000 44.8
DetNAS-FPN-Faster - 44 2200 40.2
DetNAS-RetinaNet - 44 2200 33.3
NAS-FCOS (ours) @256 R-50 189.6 28 3000 39.8
NAS-FCOS (ours) @128-256 X-64x4d-101 361.6 28 3000 46.1
Table 3. Comparison with other NAS methods. For NAS-FPN, the input size is 1280 × 1280 and the search cost should be timed by their
number of TPUs used to train each architecture. Note that the FLOPs and AP of NAS-FPN @256 here are from Figure 11 in NAS-FPN [4],
and NAS-FPN 7@256 stacks the searched FPN structure 7 times. The input images are resized such that their shorter size is 800 pixels in
DetNASNet [2] and our models.

h with the original FPN being fixed and another is to search


30 the entire decoder (f +h). As shown in Table 4, it turns out
that searching f brings slightly more benefits than searching
25
h only. And our progressive search which combines both f
20 and h achieves a better result.
AP

15

10 4.3.3 Impact of Deformable Convolution


5 As aforementioned, deformable convolutions are included
in the set of candidate operations for both f and h, which
100 200 300 400 500 600 700 800 900 1000
are able to adapt to the geometric variations of objects. For
Steps
fair comparison, we also replace the whole standard 3 × 3
Figure 9. Comparison of two different RL reward designs. The convolutions with deformable 3 × 3 convolutions in FPN
vertical axis represents AP obtained from the proxy task on the
structure of the original FCOS and repeat them twice, mak-
validation dataset.
ing the FLOPs and parameters nearly equal to our searched
Decoder Search Space AP
model. The new model is therefore called DeformFPN-
FCOS. It turns out that our NAS-FCOS model still achieves
FPN-FCOS @256 - 37.4 better performance (AP = 38.9 with FPN search only, and
NAS-FCOS @256 h only 38.7 AP = 39.8 with both FPN and Head searched) than the
NAS-FCOS @256 f only 38.9 DeformFPN-FCOS model (AP = 38.4) under this circum-
NAS-FCOS @256 f +h 39.8 stance.
Table 4. Comparisons between APs obtained under different
search space with ResNet-50 backbone. 5. Conclusion
In this paper, we have proposed to use Neural Archi-
such as mIOU for segmentation and AP for object detection. tecture Search to further optimize the process of designing
However, we found that using AP as reward did not show a object detection networks. It is shown in this work that
clear upward trend in short-term search rounds (blue curve top-performing detectors can be efficiently searched using
in Fig. 9). We further analyze the possible reason to be that carefully designed proxy tasks, search strategies and model
the controller tries to learn a mapping from the decoder to evaluation metrics. The experiments on COCO demon-
the reward while the calculation of AP itself is complicated, strates the efficiency of our discovered model NAS-FCOS
which makes it difficult to learn this mapping within a lim- and its flexibility to be used with various backbone archi-
ited number of iterations. In comparison, we clearly see the tectures.
increase of AP with the validation loss as RL rewards (red
curve in Fig. 9). Acknowledgements
NW, YG, PW’s participation in this work were in part
4.3.2 Effectiveness of Search Space supported by the National Natural Science Foundation of
China (No.61876152, No.U19B2037). CS’ participation
To further discuss the impact of the search spaces f and
was in part supported by ARC DP project “Deep Learning
h, we design three experiments for verification. One is to
that Scales”.
search f with the original head being fixed, one is to search

11950
References SSD: Single shot multibox detector. In Proc. Eur. Conf.
Comp. Vis., pages 21–37, 2016.
[1] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct
[17] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid.
neural architecture search on target task and hardware. In
Fast neural architecture search of compact semantic segmen-
Proc. Int. Conf. Learn. Representations, 2019.
tation models via auxiliary cells. In Proc. IEEE Conf. Comp.
[2] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Vis. Patt. Recogn., 2019.
Chunhong Pan, and Jian Sun. DetNAS: Neural architecture [18] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and
search on object detection. In Proc. Advances in Neural Inf. Jeff Dean. Efficient neural architecture search via parameter
Process. Syst., 2019. sharing. In Proc. Int. Conf. Mach. Learn., 2018.
[3] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. [19] Joseph Redmon and Ali Farhadi. Yolov3: An incremental
Neural architecture search: A survey. In J. Mach. Learn. improvement. In arXiv preprint arXiv:1804.02767, 2018.
Res., 2019.
[20] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[4] Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, and Quoc V. Faster R-CNN: Towards real-time object detection with re-
Le. NAS-FPN: Learning scalable feature pyramid architec- gion proposal networks. In Proc. Advances in Neural Inf.
ture for object detection. In Proc. IEEE Conf. Comp. Vis. Process. Syst., pages 91–99, 2015.
Patt. Recogn., 2019.
[21] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
[5] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot residuals and linear bottlenecks. In Proc. IEEE Conf. Comp.
neural architecture search with uniform sampling. In arXiv Vis. Patt. Recogn., pages 4510–4520, 2018.
preprint arXiv:1904.00420, 2019.
[22] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad-
[6] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- ford, and Oleg Klimov. Proximal policy optimization algo-
shick. Mask r-cnn. In Proc. IEEE Conf. Comp. Vis. Patt. rithms. In arXiv preprint arXiv:1707.06347, 2017.
Recogn., pages 2961–2969, 2017. [23] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dim-
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. itrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Di-
Identity mappings in deep residual networks. In Proc. Eur. ana Marculescu. Single-path NAS: Designing hardware-
Conf. Comp. Vis., pages 630–645, 2016. efficient convnets in less than 4 hours. In arXiv preprint
[8] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr arXiv:1904.02877, 2019.
Dollr. Panoptic feature pyramid networks. In Proc. IEEE [24] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS:
Conf. Comp. Vis. Patt. Recogn., 2019. Fully convolutional one-stage object detection. In Proc.
[9] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, and IEEE Int. Conf. Comp. Vis., 2019.
Jianbo Shi. Foveabox: Beyond anchor-based object detec- [25] Yuxin Wu and Kaiming He. Group normalization. In Proc.
tor. In arXiv preprint arXiv:1904.03797, 2019. Eur. Conf. Comp. Vis., pages 3–19, 2018.
[10] Hei Law and Jia Deng. Cornernet: Detecting objects as [26] Saining Xie, Ross Girshick, Piotr Dollr, Zhuowen Tu, and
paired keypoints. In Proc. Eur. Conf. Comp. Vis., pages 734– Kaiming He. Aggregated residual transformations for deep
750, 2018. neural networks. In Proc. IEEE Conf. Comp. Vis. Patt.
[11] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Recogn., 2017.
Bharath Hariharan, and Serge Belongie. Feature pyramid [27] Ting Zhao and Xiangqian Wu. Pyramid feature attention net-
networks for object detection. In Proc. IEEE Conf. Comp. work for saliency detection. In Proc. IEEE Conf. Comp. Vis.
Vis. Patt. Recogn., pages 2117–2125, 2017. Patt. Recogn., 2019.
[12] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and [28] Hongpeng Zhou, Minghao Yang, Jun Wang, and Wei Pan.
Piotr Dollár. Focal loss for dense object detection. In Proc. BayesNAS: A bayesian approach for neural architecture
IEEE Conf. Comp. Vis. Patt. Recogn., pages 2980–2988, search. In Proc. Int. Conf. Mach. Learn., 2019.
2017. [29] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
[13] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig jects as points. In arXiv preprint arXiv:1904.07850, 2019.
Adam, Wei Hua, Alan Yuille, and Li Fei-Fei. Auto-deeplab: [30] Chenchen Zhu, Yihui He, and Marios Savvides. Feature se-
Hierarchical neural architecture search for semantic image lective anchor-free module for single-shot object detection.
segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.
2019. [31] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
[14] Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu formable convnets v2: More deformable, better results. In
Liu, Gang Yu, and Wei Jiang. An end-to-end network for Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.
panoptic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. [32] Barret Zoph and Quoc V. Le. Neural architecture search with
Recogn., 2019. reinforcement learning. In Proc. Int. Conf. Learn. Represen-
[15] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: tations, 2017.
Differentiable architecture search. In Proc. Int. Conf. Learn.
Representations, 2019.
[16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander Berg.

11951

You might also like