Auto-DeepLab Hierarchical Neural Architecture Search For Semantic Image Segmentation
Auto-DeepLab Hierarchical Neural Architecture Search For Semantic Image Segmentation
Chenxi Liu1∗, Liang-Chieh Chen2 , Florian Schroff2 , Hartwig Adam2 , Wei Hua2 ,
Alan Yuille1 , Li Fei-Fei3
1
Johns Hopkins University 2 Google 3 Stanford University
arXiv:1901.02985v1 [cs.CV] 10 Jan 2019
1
olution images to high resolution images [92], whereas op- ADE20K, our best model outperforms several state-of-the-
timal architectures for semantic segmentation must inher- art models [89, 44, 81, 87, 82] while using strictly less data
ently operate on high resolution imagery. This suggests the for pretraining.
need for: (1) a more relaxed and general search space to To summarize, the contribution of our paper is four-fold:
capture the architectural variations brought by the higher
• Ours is one of the first attempts to extend NAS beyond
resolution, and (2) a more efficient architecture search tech-
image classification to dense image prediction.
nique as higher resolution requires heavier computation.
We notice that modern CNN designs [25, 84, 31] usu- • We propose a network level architecture search space
ally follow a two-level hierarchy, where the outer network that augments and complements the much-studied cell
level controls the spatial resolution changes, and the inner level one, and consider the more challenging joint
cell level governs the specific layer-wise computations. The search of network level and cell level architectures.
vast majority of current works on NAS [92, 47, 61, 59, 49] • We develop a differentiable, continuous formulation
follow this two-level hierarchical design, but only automat- that conducts the two-level hierarchical architecture
ically search the inner cell level while hand-designing the search efficiently in 3 GPU days.
outer network level. This limited search space becomes
problematic for dense image prediction, which is sensitive • Without ImageNet pretraining, our model significantly
to the spatial resolution changes. Therefore in our work, outperforms FRRN-B and GridNet, and attains com-
we propose a trellis-like network level search space that parable performance with other ImageNet-pretrained
augments the commonly-used cell level search space first state-of-the-art models on Cityscapes. On PASCAL
proposed in [92] to form a hierarchical architecture search VOC 2012 and ADE20K, our best model also outper-
space. Our goal is to jointly learn a good combination of forms several state-of-the-art models.
repeatable cell structure and network structure specifically
for semantic image segmentation. 2. Related Work
In terms of the architecture search method, reinforce- Semantic Image Segmentation Convolutional neural
ment learning [91, 92] and evolutionary algorithms [62, 61] networks [42] deployed in a fully convolutional manner
tend to be computationally intensive even on the low resolu- (FCNs [67, 51]) have achieved remarkable performance
tion CIFAR-10 dataset, therefore probably not suitable for on several semantic segmentation benchmarks. Within the
semantic image segmentation. We draw inspiration from state-of-the-art systems, there are two essential compo-
the differentiable formulation of NAS [68, 49], and de- nents: multi-scale context module and neural network de-
velop a continuous relaxation of the discrete architectures sign. It has been known that context information is cru-
that exactly matches the hierarchical architecture search cial for pixel labeling tasks [26, 69, 37, 39, 16, 54, 14, 10].
space. The hierarchical architecture search is conducted via Therefore, PSPNet [87] performs spatial pyramid pooling
stochastic gradient descent. When the search terminates, [21, 41, 24] at several grid scales (including image-level
the best cell architecture is decoded greedily, and the best pooling [50]), while DeepLab [8, 9] applies several par-
network architecture is decoded efficiently using the Viterbi allel atrous convolution [28, 20, 67, 57, 7] with different
algorithm. We directly search architecture on 321×321 im- rates. On the other hand, the improvement of neural net-
age crops from Cityscapes [13]. The search is very efficient work design has significantly driven the performance from
and only takes about 3 days on one P100 GPU. AlexNet [38], VGG [71], Inception [32, 75, 73], ResNet
We report experimental results on multiple semantic seg- [25] to more recent architectures, such as Wide ResNet [85],
mentation benchmarks, including Cityscapes [13], PAS- ResNeXt [84], DenseNet [31] and Xception [12]. In addi-
CAL VOC 2012 [15], and ADE20K [89]. Without Ima- tion to adopting those networks as backbones for semantic
geNet [64] pretraining, our best model significantly out- segmentation, one could employ the encoder-decoder struc-
performs FRRN-B [60] by 8.6% and GridNet [17] by tures [63, 2, 55, 44, 60, 58, 33, 78, 18, 11, 86, 82] which ef-
10.9% on Cityscapes test set, and performs comparably ficiently captures the long-range context information while
with other ImageNet-pretrained state-of-the-art models [81, keeping the detailed object boundaries. Nevertheless, most
87, 4, 11, 6] when also exploiting the coarse annotations of the models require initialization from the ImageNet [64]
on Cityscapes. Notably, our best model (without pretrain- pretrained checkpoints except FRRN [60] and GridNet [17]
ing) attains the same performance as DeepLabv3+ [11] for the task of semantic segmentation. Specifically, FRRN
(with pretraining) while being 2.23 times faster in Multi- [60] employs a two-stream system, where full-resolution in-
Adds. Additionally, our light-weight model attains the per- formation is carried in one stream and context information
formance only 1.2% lower than DeepLabv3+ [11], while in the other pooling stream. GridNet, building on top of a
requiring 76.7% fewer parameters and being 4.65 times similar idea, contains multiple streams with different reso-
faster in Multi-Adds. Finally, on PASCAL VOC 2012 and lutions. In this work, we apply neural architecture search
for network backbones specific for semantic segmentation. mid Pooling (ASPP) module using random search, whereas
We further show state-of-the-art performance without Im- we focus on searching the much more fundamental network
ageNet pretraining, and significantly outperforms FRRN backbone architecture using more advanced and more effi-
[60] and GridNet [17] on Cityscapes [13]. cient search methods.
s
2 H2l
PP
AS
concat
4 s
Hl-2 ... s
Hl-1 s
H3l s
Hl
PP
AS
8
s
H4l
PP
AS
16
co
s
H5l
n
PP
AS
ca
32
t
...
Figure 1: Left: Our network level search space with L = 12. Gray nodes represent the fixed “stem” layers, and a path along
the blue nodes represents a candidate network level architecture. Right: During the search, each cell is a densely connected
structure as described in Sec. 4.1.1. Every yellow arrow is associated with the set of values αj→i . The three arrows after
concat are associated with β ls →s , βs→s
l l
, β2s→s respectively, as described in Sec. 4.1.2. Best viewed in color.
2
……
Downsample\Layer 1 2 3 4 5 L-1 L more network level variations [9, 56, 55].
1
Among the various network architectures for dense im-
2 age prediction, we notice two principles that are consistent:
PP
AS
4
• The spatial resolution of the next layer is either twice
PP
AS
8
as large, or twice as small, or remains the same.
PP
AS
16
PP
AS
32
• The smallest spatial resolution is downsampled by 32.
(a) Network level architecture used in DeepLabv3 [9].
Following these common practices, we propose the follow-
Downsample\Layer 1 2 3 4 5 …… L-1 L
ing network level search space. The beginning of the net-
1
work is a two-layer “stem” structure that each reduces the
2
spatial resolution by a factor of 2. After that, there are a
4 total of L layers with unknown spatial resolutions, with the
8 maximum being downsampled by 4 and the minimum being
16
downsampled by 32. Since each layer may differ in spatial
resolution by at most 2, the first layer after the stem could
32
only be either downsampled by 4 or 8. We illustrate our net-
(b) Network level architecture used in Conv-Deconv [56]. work level search space in Fig. 1. Our goal is then to find a
Downsample\Layer 1 2 3 4 5 …… L-1 L
good path in this L-layer trellis.
1 In Fig. 2 we show that our search space is general enough
2
to cover many popular designs. In the future, we have plans
to relax this search space even further to include U-net ar-
4
chitectures [63, 45, 70], where layer l may receive input
8
from one more layer preceding l in addition to l − 1.
16
We reiterate that our work searches the network level ar-
32 chitecture in addition to the cell level architecture. There-
fore our search space is strictly more challenging and
(c) Network level architecture used in Stacked Hourglass [55]. general-purpose than previous works.
Figure 2: Our network level search space is general and
includes various existing designs. 4. Methods
We begin by introducing a continuous relaxation of
the (exponentially many) discrete architectures that ex-
by 2 and multiply the number of filters by 2). This keep- actly matches the hierarchical architecture search described
downsampling strategy is reasonable in the image classifi- above. We then discuss how to perform architecture search
cation case, but in dense image prediction it is also impor- via optimization, and how to decode back a discrete archi-
tant to keep high spatial resolution, and as a result there are tecture after the search terminates.
4.1. Continuous Relaxation of Architectures also implemented as softmax.
Eq. (6) shows how the continuous relaxations of the two-
4.1.1 Cell Architecture
level hierarchy are weaved together. In particular, β con-
We reuse the continuous relaxation described in [49]. Every trols the outer network level, hence depends on the spatial
block’s output tensor Hil is connected to all hidden states in size and layer index. Each scalar in β governs an entire set
Iil : X of α, yet α specifies the same architecture that depends on
Hil = Oj→i (Hjl ) (1) neither spatial size nor layer index.
Hjl ∈Iil As illustrated in Fig. 1, Atrous Spatial Pyramid Pooling
(ASPP) modules are attached to each spatial resolution at
In addition, we approximate each Oj→i with its continuous the L-th layer (atrous rates are adjusted accordingly). Their
relaxation Ōj→i , defined as: outputs are bilinear upsampled to the original resolution be-
X fore summed to produce the prediction.
Ōj→i (Hjl ) = k
αj→i Ok (Hjl ) (2)
O k ∈O 4.2. Optimization
where The advantage of introducing this continuous relaxation
|O| is that the scalars controlling the connection strength be-
X
k
αj→i =1 ∀i, j (3) tween different hidden states are now part of the differen-
k=1
tiable computation graph. Therefore they can be optimized
k efficiently using gradient descent. We adopt the first-order
αj→i ≥0 ∀i, j, k (4)
approximation in [49], and partition the training data into
k two disjoint sets trainA and trainB. The optimization alter-
In other words, αj→i are normalized scalars associated with
k nates between:
each operator O ∈ O, easily implemented as softmax.
Recall from Sec. 3.1 that H l−1 and H l−2 are al- 1. Update network weights w by ∇w LtrainA (w, α, β)
ways included in Iil , and that H l is the concatenation of
{H1l , . . . , HB
l
}. Together with Eq. (1) and Eq. (2), the cell 2. Update architecture α, β by ∇α,β LtrainB (w, α, β)
level update may be summarized as:
where the loss function L is the cross entropy calculated on
H l = Cell(H l−1 , H l−2 ; α) (5) the semantic segmentation mini-batch.
1 +
sep
3x3 sep
5x5
2
+
sep
sep 5x5
PP
AS
4 3x3
concat
l-2 l-1
H ... H + Hl
sep
atr
PP
AS
8 3x3
5x5
+
sep
PP
AS
16 atr
5x5
3x3
+
PP
AS
32 sep
3x3
Figure 3: Best network and cell architecture found by our Hierarchical Neural Architecture Search. Gray dashed arrows show
the connection with maximum β at each node. atr: atrous convolution. sep: depthwise-separable convolution.
We follow the same training protocol in [9, 11]. In brief, 5.2.2 PASCAL VOC 2012
during training we adopt a polynomial learning rate sched-
ule [50] with initial learning rate 0.05, and large crop size PASCAL VOC 2012 [15] contains 20 foreground object
(e.g., 769 × 769 on Cityscapes, and 513 × 513 on PAS- classes and one background class. We augment the original
CAL VOC 2012 and resized ADE20K images). Batch nor- dataset with the extra annotations provided by [23], result-
malization parameters [32] are fine-tuned during training. ing in 10582 (train aug) training images.
The models are trained from scratch with 1.5M iterations In Tab. 5, we report our validation set results. Our best
on Cityscapes, 1.5M iterations on PASCAL VOC 2012, and model, Auto-DeepLab-L, with single scale inference sig-
4M iterations on ADE20K, respectively. nificantly outperforms [19] by 20.36%. Additionally, for
We adopt the simple encoder-decoder structure similar to all our model variants, adopting multi-scale inference im-
DeepLabv3+ [11]. Specifically, our encoder consists of our proves the performance by about 1%. Further pretraining
Method MS COCO mIOU (%) Method ImageNet COCO mIOU (%)
DropBlock [19] 53.4 Auto-DeepLab-S 3 82.5
Auto-DeepLab-M 3 84.1
Auto-DeepLab-S 71.68 Auto-DeepLab-L 3 85.6
Auto-DeepLab-S 3 72.54
RefineNet [44] 3 3 84.2
Auto-DeepLab-M 72.78 ResNet-38 [81] 3 3 84.9
Auto-DeepLab-M 3 73.69 PSPNet [87] 3 3 85.4
Auto-DeepLab-L 73.76 DeepLabv3+ [11] 3 3 87.8
Auto-DeepLab-L 3 75.26 MSCI [43] 3 3 88.0
Auto-DeepLab-S 3 78.31 Table 6: PASCAL VOC 2012 test set results. Our Auto-
Auto-DeepLab-S 3 3 80.27 DeepLab-L attains comparable performance with many
Auto-DeepLab-M 3 79.78 state-of-the-art models which are pretrained on both Ima-
Auto-DeepLab-M 3 3 80.73 geNet and COCO datasets. We refer readers to the official
leader-board for other state-of-the-art models.
Auto-DeepLab-L 3 80.75
Auto-DeepLab-L 3 3 82.04 Method ImageNet mIOU (%) Pixel-Acc (%) Avg (%)
Auto-DeepLab-S 40.69 80.60 60.65
Auto-DeepLab-M 42.19 81.09 61.64
Table 5: PASCAL VOC 2012 validation set results. We ex- Auto-DeepLab-L 43.98 81.72 62.85
periment with the effect of adopting multi-scale inference CascadeNet (VGG-16) [89] 3 34.90 74.52 54.71
(MS) and COCO-pretrained checkpoints (COCO). With- RefineNet (ResNet-152) [44] 3 40.70 - -
UPerNet (ResNet-101) [82] † 3 42.66 81.01 61.84
out any pretraining, our best model (Auto-DeepLab-L) out- PSPNet (ResNet-152) [87] 3 43.51 81.38 62.45
PSPNet (ResNet-269) [87] 3 44.94 81.69 63.32
performs DropBlock by 20.36%. All our models are not DeepLabv3+ (Xception-65) [11] † 3 45.65 82.52 64.09
pretrained with ImageNet images.
Table 7: ADE20K validation set results. We employ multi-
scale inputs during inference. †: Results are obtained from
our models on COCO [46] for 4M iterations improves the
their up-to-date model zoo websites respectively. Ima-
performance significantly.
geNet: Models pretrained on ImageNet. Avg: Average of
Finally, we report the PASCAL VOC 2012 test set re-
mIOU and Pixel-Accuracy.
sult with our COCO-pretrained model variants in Tab. 6.
As shown in the table, our best model attains the perfor-
mance of 85.6% on the test set, outperforming RefineNet tion to dense image prediction problems. Instead of fixating
[44] and PSPNet [87]. Our model is lagged behind the top- on the cell level, we acknowledge the importance of spa-
performing DeepLabv3+ [11] with Xception-65 as network tial resolution changes, and embrace the architectural vari-
backbone by 2.2%. We think that PASCAL VOC 2012 ations by incorporating the network level into the search
dataset is too small to train models from scratch and pre- space. We also develop a differentiable formulation that
training on ImageNet is still beneficial in this case. allows efficient (about 1000× faster than DPC [6]) archi-
tecture search over our two-level hierarchical search space.
5.2.3 ADE20K The result of the search, Auto-DeepLab, is evaluated by
training on benchmark semantic segmentation datasets from
ADE20K [89] has 150 semantic classes and high quality
scratch. On Cityscapes, Auto-DeepLab significantly out-
annotations of 20000 training images and 2000 validation
performs the previous state-of-the-art by 8.6%, and per-
images. In our experiments, the images are all resized so
forms comparably with ImageNet-pretrained top models
that the longer side is 513 during training.
when exploiting the coarse annotations. On PASCAL VOC
In Tab. 7, we report our validation set results. Our mod-
2012 and ADE20K, Auto-DeepLab also outperforms sev-
els outperform some state-of-the-art models, including Re-
eral ImageNet-pretrained state-of-the-art models.
fineNet [44], UPerNet [82], and PSPNet (ResNet-152) [87];
There are many possible directions for future work.
however, without any ImageNet [64] pretraining, our per-
Within the current framework, related applications such
formance is lagged behind the latest work of [11].
as object detection should be plausible; we could also try
6. Conclusion untying the cell architecture α across different layers (cf .
[76]) with little computation overhead. Beyond the current
In this paper, we present one of the first attempts to ex- framework, a more general and relaxed network level search
tend Neural Architecture Search beyond image classifica- space should be beneficial (cf . Sec. 3.2).
Figure 5: Visualization results on Cityscapes val set. Our failure mode is shown in the last row where our model confuses
with some difficult semantic classes such as person and rider.
Figure 6: Visualization results on ADE20K validation set. Our failure mode is shown in the last row where our model could
not segment very fine-grained objects (e.g., chair legs) and confuse with some difficult semantic classes (e.g., floor and rug).
Acknowledgments We thank Sergey Ioffe for the valu- [2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A
able feedback about the draft; Cloud AI and Mobile Vision deep convolutional encoder-decoder architecture for image
team for support. segmentation. arXiv:1511.00561, 2015. 2
[3] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu-
ral network architectures using reinforcement learning. In
References ICLR, 2017. 3
[4] S. R. Bulò, L. Porzi, and P. Kontschieder. In-place acti-
[1] K. Ahmed and L. Torresani. Maskconnect: Connectivity vated batchnorm for memory-optimized training of dnns. In
learning by gradient descent. In ECCV, 2018. 3 CVPR, 2018. 2, 7
[5] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Efficient ar- [23] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik.
chitecture search by network transformation. In AAAI, 2018. Semantic contours from inverse detectors. In ICCV, 2011. 7
3 [24] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
[6] L.-C. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, in deep convolutional networks for visual recognition. In
F. Schroff, H. Adam, and J. Shlens. Searching for effi- ECCV, 2014. 2
cient multi-scale architectures for dense image prediction. In [25] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
NIPS, 2018. 1, 2, 3, 7, 8 for image recognition. In CVPR, 2016. 1, 2
[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [26] X. He, R. S. Zemel, and M. Carreira-Perpindn. Multiscale
A. L. Yuille. Semantic image segmentation with deep con- conditional random fields for image labeling. In CVPR,
volutional nets and fully connected crfs. In ICLR, 2015. 1, 2004. 2
2 [27] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed,
[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath,
A. L. Yuille. Deeplab: Semantic image segmentation with et al. Deep neural networks for acoustic modeling in speech
deep convolutional nets, atrous convolution, and fully con- recognition: The shared views of four research groups. IEEE
nected crfs. TPAMI, 2017. 2, 7 Signal Processing Magazine, 29(6):82–97, 2012. 1
[9] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re- [28] M. Holschneider, R. Kronland-Martinet, J. Morlet, and
thinking atrous convolution for semantic image segmenta- P. Tchamitchian. A real-time algorithm for signal analy-
tion. arXiv:1706.05587, 2017. 2, 4, 6, 7 sis with the help of the wavelet transform. In Wavelets:
[10] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. At- Time-Frequency Methods and Phase Space, pages 289–297.
tention to scale: Scale-aware semantic image segmentation. Springer, 1989. 2
In CVPR, 2016. 2 [29] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
[11] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-
Encoder-decoder with atrous separable convolution for se- cient convolutional neural networks for mobile vision appli-
mantic image segmentation. In ECCV, 2018. 1, 2, 7, 8 cations. arXiv:1704.04861, 2017. 7
[30] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net-
[12] F. Chollet. Xception: Deep learning with depthwise separa-
works. In CVPR, 2018. 1
ble convolutions. In CVPR, 2017. 2
[31] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger.
[13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
Densely connected convolutional networks. In CVPR, 2017.
R. Benenson, U. Franke, S. Roth, and B. Schiele. The
1, 2
cityscapes dataset for semantic urban scene understanding.
[32] S. Ioffe and C. Szegedy. Batch normalization: accelerating
In CVPR, 2016. 2, 3, 6, 7
deep network training by reducing internal covariate shift. In
[14] J. Dai, K. He, and J. Sun. Convolutional feature masking for
ICML, 2015. 1, 2, 7
joint object and stuff segmentation. In CVPR, 2015. 2
[33] M. A. Islam, M. Rochan, N. D. Bruce, and Y. Wang. Gated
[15] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. feedback refinement network for dense image labeling. In
Williams, J. Winn, and A. Zisserman. The pascal visual ob- CVPR, 2017. 2
ject classes challenge a retrospective. IJCV, 2014. 2, 3, 6,
[34] R. Jozefowicz, W. Zaremba, and I. Sutskever. An empiri-
7
cal exploration of recurrent network architectures. In ICML,
[16] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning 2015. 3
hierarchical features for scene labeling. PAMI, 2013. 1, 2 [35] A. Kae, K. Sohn, H. Lee, and E. Learned-Miller. Augment-
[17] D. Fourure, R. Emonet, E. Fromont, D. Muselet, ing crfs with boltzmann machine shape priors for image la-
A. Tremeau, and C. Wolf. Residual conv-deconv grid net- beling. In CVPR, 2013. 3
work for semantic segmentation. In BMVC, 2017. 2, 3, 7 [36] D. P. Kingma and J. Ba. Adam: A method for stochastic
[18] J. Fu, J. Liu, Y. Wang, and H. Lu. Stacked deconvolutional optimization. In ICLR, 2015. 1, 6
network for semantic segmentation. arXiv:1708.04943, [37] P. Kohli, P. H. Torr, et al. Robust higher order potentials for
2017. 2 enforcing label consistency. IJCV, 82(3):302–324, 2009. 2
[19] G. Ghiasi, T.-Y. Lin, and Q. V. Le. Dropblock: A regulariza- [38] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
tion method for convolutional networks. In NIPS, 2018. 7, classification with deep convolutional neural networks. In
8 NIPS, 2012. 1, 2
[20] A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and [39] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associa-
J. Schmidhuber. Fast image scanning with deep max-pooling tive hierarchical crfs for object class image segmentation. In
convolutional neural networks. In ICIP, 2013. 2 ICCV, 2009. 2
[21] K. Grauman and T. Darrell. The pyramid match kernel: [40] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet:
Discriminative classification with sets of image features. In Ultra-deep neural networks without residuals. In ICLR,
ICCV, 2005. 2 2017. 7
[22] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink, [41] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
and J. Schmidhuber. Lstm: A search space odyssey. features: Spatial pyramid matching for recognizing natural
arXiv:1503.04069, 2015. 3 scene categories. In CVPR, 2006. 2
[42] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. [62] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu,
Howard, W. Hubbard, and L. D. Jackel. Backpropagation J. Tan, Q. Le, and A. Kurakin. Large-scale evolution of im-
applied to handwritten zip code recognition. Neural compu- age classifiers. In ICML, 2017. 2, 3
tation, 1(4):541–551, 1989. 2 [63] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
[43] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang. tional networks for biomedical image segmentation. In MIC-
Multi-scale context intertwining for semantic segmentation. CAI, 2015. 1, 2, 4
In ECCV, 2018. 8 [64] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[44] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi- S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
path refinement networks with identity mappings for high- A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
resolution semantic segmentation. In CVPR, 2017. 2, 8 Recognition Challenge. IJCV, 2015. 2, 7, 8
[45] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and [65] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
S. Belongie. Feature pyramid networks for object detection. Chen. Mobilenetv2: Inverted residuals and linear bottle-
In CVPR, 2017. 4 necks. In CVPR, 2018. 7
[46] T.-Y. Lin et al. Microsoft coco: Common objects in context. [66] S. Saxena and J. Verbeek. Convolutional neural fabrics. In
In ECCV, 2014. 8 NIPS, 2016. 3
[47] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, [67] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive and Y. LeCun. Overfeat: Integrated recognition, localization
neural architecture search. In ECCV, 2018. 1, 2, 3 and detection using convolutional networks. In ICLR, 2014.
[48] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and 2
K. Kavukcuoglu. Hierarchical representations for efficient [68] R. Shin, C. Packer, and D. Song. Differentiable neural net-
architecture search. In ICLR, 2018. 3 work architecture search. In ICLR Workshop, 2018. 2, 3
[49] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable [69] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost
architecture search. arXiv:1806.09055, 2018. 1, 2, 3, 5 for image understanding: Multi-class object recognition and
[50] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking segmentation by jointly modeling texture, layout, and con-
wider to see better. arXiv:1506.04579, 2015. 2, 7 text. IJCV, 2009. 2
[51] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional [70] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be-
networks for semantic segmentation. In CVPR, 2015. 1, 2 yond skip connections: Top-down modulation for object de-
[52] R. Luo, F. Tian, T. Qin, and T.-Y. Liu. Neural architecture tection. arXiv:1612.06851, 2016. 4
optimization. In NIPS, 2018. 3 [71] K. Simonyan and A. Zisserman. Very deep convolutional
[53] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, networks for large-scale image recognition. In ICLR, 2015.
D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, 1, 2
N. Duffy, and B. Hodjat. Evolving deep neural networks. [72] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence
arXiv:1703.00548, 2017. 3 learning with neural networks. In NIPS, 2014. 1
[54] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed- [73] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
forward semantic segmentation with zoom-out features. In Inception-v4, inception-resnet and the impact of residual
CVPR, 2015. 2 connections on learning. In AAAI, 2017. 1, 2
[55] A. Newell, K. Yang, and J. Deng. Stacked hourglass net- [74] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
works for human pose estimation. In ECCV, 2016. 1, 2, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
4 Going deeper with convolutions. In CVPR, 2015. 1
[56] H. Noh, S. Hong, and B. Han. Learning deconvolution net- [75] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
work for semantic segmentation. In ICCV, 2015. 1, 4 Rethinking the inception architecture for computer vision. In
[57] G. Papandreou, I. Kokkinos, and P.-A. Savalle. Modeling CVPR, 2016. 1, 2
local and global deformations in deep learning: Epitomic [76] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le.
convolution, multiple instance learning, and sliding window Mnasnet: Platform-aware neural architecture search for mo-
detection. In CVPR, 2015. 2 bile. arXiv:1807.11626, 2018. 3, 8
[58] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel [77] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and
matters–improve semantic segmentation by global convolu- G. Cottrell. Understanding convolution for semantic seg-
tional network. In CVPR, 2017. 2 mentation. In WACV, 2018. 7
[59] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Ef- [78] Z. Wojna, V. Ferrari, S. Guadarrama, N. Silberman, L.-C.
ficient neural architecture search via parameter sharing. In Chen, A. Fathi, and J. Uijlings. The devil is in the decoder.
ICML, 2018. 2, 3 In BMVC, 2017. 2
[60] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full- [79] Y. Wu and K. He. Group normalization. In ECCV, 2018. 1
resolution residual networks for semantic segmentation in [80] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
street scenes. In CVPR, 2017. 2, 3, 7 W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,
[61] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Reg- et al. Google’s neural machine translation system: Bridg-
ularized evolution for image classifier architecture search. ing the gap between human and machine translation.
arXiv:1802.01548, 2018. 1, 2, 3 arXiv:1609.08144, 2016. 1
[81] Z. Wu, C. Shen, and A. van den Hengel. Wider or
deeper: Revisiting the resnet model for visual recognition.
arXiv:1611.10080, 2016. 2, 7, 8
[82] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. Unified per-
ceptual parsing for scene understanding. In ECCV, 2018. 2,
8
[83] L. Xie and A. Yuille. Genetic cnn. In ICCV, 2017. 3
[84] S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He. Aggregated
residual transformations for deep neural networks. In CVPR,
2017. 1, 2
[85] S. Zagoruyko and N. Komodakis. Wide residual networks.
In BMVC, 2016. 2
[86] Z. Zhang, X. Zhang, C. Peng, D. Cheng, and J. Sun. Ex-
fuse: Enhancing feature fusion for semantic segmentation.
In ECCV, 2018. 2
[87] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
parsing network. In CVPR, 2017. 2, 7, 8
[88] Z. Zhong, J. Yan, W. Wu, J. Shao, and C.-L. Liu. Practical
block-wise neural network architecture generation. In CVPR,
2018. 3
[89] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor-
ralba. Scene parsing through ade20k dataset. In CVPR, 2017.
2, 3, 6, 8
[90] Y. Zhuang, F. Yang, L. Tao, C. Ma, Z. Zhang, Y. Li, H. Jia,
X. Xie, and W. Gao. Dense relation network: Learning con-
sistent and context-aware representation for semantic image
segmentation. In ICIP, 2018. 7
[91] B. Zoph and Q. V. Le. Neural architecture search with rein-
forcement learning. In ICLR, 2017. 2, 3
[92] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning
transferable architectures for scalable image recognition. In
CVPR, 2018. 1, 2, 3, 7