Practical Block-Wise Neural Network Architecture Generation
Practical Block-Wise Neural Network Architecture Generation
Zhao Zhong1,3∗, Junjie Yan2 ,Wei Wu2 ,Jing Shao2 ,Cheng-Lin Liu1,3,4
1
National Laboratory of Pattern Recognition,Institute of Automation, Chinese Academy of Sciences
2
SenseTime Research 3 University of Chinese Academy of Sciences
4
CAS Center for Excellence of Brain Science and Intelligence Technology
Email: {zhao.zhong, liucl}@nlpr.ia.ac.cn, {yanjunjie, wuwei, shaojing}@sensetime.com
arXiv:1708.05552v3 [cs.CV] 14 May 2018
Input Input
Input Input Input
Input
3x3 Conv
11x11 Conv 3x3 Conv 5x5 Conv
Conv 3x3 Conv
Pool 1/2 3x3 Conv
Conv,1 Conv,1 MaxP,3 BN 5x5 Conv
5x5 Conv Dropout 1/8 Conv,1 Conv,3 Conv,5 Conv,3
Conv,1 3x7 Conv
ReLU
Pool 1/2 7x7 Conv 2x2 MaxP
Conv,5 AvgP,3
Conv,3 Conv,5 Conv,1 Conv 7x7 Conv
3x3 Conv 1x1 Conv
7x3 Conv
BN Add Concat
3x3 Conv 7x1 Conv Dropout 1/4
Figure 1. The proposed BlockQNN (right in red box) compared with the hand-crafted networks marked in yellow and the existing auto-
generated networks in green. Automatically generating the plain networks [2, 37] marked in blue need large computational costs on
searching optimal layer types and hyperparameters for each single layer, while the block-wise network heavily reduces the cost to search
structures only for one block. The entire network is then constructed by stacking the generated blocks. Similar block concept has been
demonstrated its superiority in hand-crafted networks, such as inception-block and residue-block marked in red.
play [19] and epsilon-greedy strategy [21] to effectively and be transferred to ImageNet with little modification but
efficiently search the optimal block structure. The network still achieve outstanding performance.
block is constructed by the learning agent which is trained
sequentiality to choose component layers. Afterwards we 2. Related Work
stack the block to construct the whole auto-generated net-
Early works, from 1980s, have made efforts on automat-
work. Moreover, we propose an early stop strategy to en-
ing neural network design which often searched good archi-
able efficient search with fast convergence. A novel reward
tecture by the genetic algorithm or other evolutionary algo-
function is designed to ensure the accuracy of the early
rithms [24, 27, 26, 28, 23, 7, 34]. Nevertheless, these works,
stopped network having positive correlation with the con-
to our best knowledge, cannot perform competitively com-
verged network. We can pick up good blocks in reduced
pared with hand-crafted networks. Recent works, i.e. Neu-
training time using this property. With this acceleration
ral Architecture Search (NAS) [37] and MetaQNN [2],
strategy, we can construct a Q-learning agent to learn the
adopted reinforcement learning to automatically search a
optimal block-wise network structures for a given task with
good network architecture. Although they can yield good
limited resources (e.g. few GPUs or short time period). The
performance on small datasets such as CIFAR-10, CIFAR-
generated architectures are thus succinct and have powerful
100, the direct use of MetaQNN or NAS for architecture
generalization ability compared to the networks generated
design on big datasets like ImageNet [6] is computationally
by the other automatic network generation methods.
expensive via searching in a huge space. Besides, the net-
The proposed block-wise network generation brings a
work generated by this kind of methods is task-specific or
few advantages as follows:
dataset-specific, that is, it cannot been well transferred to
• Effective. The automatically generated networks other tasks nor datasets with different input data sizes. For
present comparable performances to those of hand- example, the network designed for CIFAR-10 cannot been
crafted networks with human expertise. The proposed generalized to ImageNet.
method is also superior to the existing works and Instead, our approach is aimed to design network block
achieves a state-of-the-art performance on CIFAR-10 architecture by an efficient search method with a dis-
with 3.54% error rate. tributed asynchronous Q-learning framework as well as an
• Efficient. We are the first to consider block-wise early-stop strategy. The block design conception follows
setup in automatic network generation. Companied the modern convolutional neural networks such as Incep-
with the proposed early stop strategy, the proposed tion [30, 14, 31] and Resnet [10, 11]. The inception-based
method results in a fast search process. The network networks construct the inception blocks via a hand-
generation for CIFAR task reaches convergence with crafted multi-level feature extractor strategy by computing
only 32 GPUs in 3 days, which is much more efficient 1 × 1, 3 × 3, and 5 × 5 convolutions, while the Resnet uses
than that by NAS [37] with 800 GPUs in 28 days. residue blocks with shortcut connection to make it
• Transferable. It offers surprisingly superior transfer- easier to represent the identity mapping which allows a very
able ability that the network generated for CIFAR can deep network. The blocks automatically generated by our
Name Index Type Kernel Size Pred1 Pred2
Input Input
Convolution T 1 1, 3, 5 K 0
Max Pooling T 2 1, 3 K 0 Conv Conv
Average Pooling T 3 1, 3 K 0 Pooling
Block ×N
Identity T 4 0 K 0
Conv
Elemental Add T 5 0 K K Pooling
Pooling
Concat T 6 0 K K Block ×N
Block ×N
Terminal T 7 0 0 0
Pooling Pooling
Table 1. Network Structure Code Space. The space contains seven
Block ×N Block ×N
types of commonly used layers. Layer index stands for the posi-
tion of the current layer in a block, the range of the parameters is Pooling
Linear
set to be T = {1, 2, 3, ...max layer index}. Three kinds of ker- Block ×N
nel sizes are considered for convolution layer and two sizes for
Pooling
pooling layer. Pred1 and Pred2 refer to the predecessor parame-
ters which is used to represent the index of layers predecessor, the Block ×N
allowed range is K = {1, 2, ..., current layer index − 1} Linear
Conv,3 Conv,5 Conv,1 Conv,3 similar structure but with different weights and filter num-
bers to construct the network. With the block-wise design,
Concat Add the network can not only achieves high performance but
also has powerful generalization ability to different datasets
Concat Output
and tasks. Unlike previous research on automating neural
Output network design which generate the entire network directly,
Codes = [(1,4,0,0,0), (2,1,1,1,0), (3,1,3,2,0), Codes = [(1,4,0,0,0), (2,1,3,1,0),
we aim at designing the block structure.
(4,1,1,1,0), (5,1,5,4,0), (6,6,0,3,5), (3,1,3,2,0), (4,5,0,1,3),
(7,2,3,1,0), (8,1,1,7,0), (9,6,0,6,8), (5,7,0,0,0)] As a CNN contains a feed-forward computation proce-
(10,7,0,0,0)] dure, we represent it by a directed acyclic graph (DAG),
Figure 2. Representative block exemplars with their Network where each node corresponds to a layer in the CNN while
structure codes (NSC) respectively: the block with multi-branch directed edges stand for data flow from one layer to another.
connections (left) and the block with shortcut connections (right). To turn such a graph into a uniform representation, we pro-
pose a novel layer representation called Network Structure
approach have similar structures such as some blocks con- Code (NSC), as shown in Table 2. Each block is then de-
tain short cut connections and inception-like multi-branch picted by a set of 5-D NSC vectors. In NSC, the first three
combination. We will discuss the details in Section 5.1. numbers stand for the layer index, operation type and kernel
Another bunch of related works include hyper-parameter size. The last two are predecessor parameters which refer
optimization [3], meta-learning [32] and learning to learn to the position of a layer’s predecessor layer in structure
methods [12, 1]. However, the goal of these works is to codes. The second predecessor (Pred2) is set for the layer
use meta-data to improve the performance of the existing owns two predecessors, and for the layer with only one pre-
algorithms, such as finding the optimal learning rate of op- decessor, Pred2 will be set to zero. This design is motivated
timization methods or the optimal number of hidden layers by the current powerful hand-crafted networks like Incep-
to construct the network. In this paper, we focus on learn- tion and Resnet which own their special block structures.
ing the entire topological architecture of network blocks to This kind of block structure shares similar properties such
improve the performance. as containing more complex connections, e.g. shortcut con-
nections or multi-branch connections, than the simple con-
3. Methodology nections of the plain network like AlexNet. Thus, the pro-
posed NSC can encode complexity architectures as shown
3.1. Convolutional Neural Network Blocks
in Fig. 2. In addition, all of the layer without successor in
The modern CNNs, e.g. Inception and Resnet, are de- the block are concatenated together to provide the final out-
signed by stacking several blocks each of which shares put. Note that each convolution operation, same as the dec-
(1,1,1,0,0) (2,1,3,1,0) (3,1,3,1,0) … (T,x,x,x,x) Input
Agent samples
structure codes
Stack blocks
…
Update
Input Conv,1x1 to generate a
Q-Value network
Figure 4. Q-learning process illustration. (a) The state transition process by different action choices. The block structure in (b) is generated
by the red solid line in (a). (c) The flow chart of the Q-learning procedure.
laration in Resnet [11], refers to a Pre-activation Convolu- to the defined NSC set with a limited number of choices,
tional Cell (PCC) with three components, i.e. ReLU, Con- both the state and action space are thus finite and discrete
volution and Batch Normalization. This results in a smaller to ensure a relatively small searching space. The state tran-
searching space than that with three components separate sition process (st , a(st )) → (st+1 ) is shown in Fig. 4(a),
search, and hence with the PCC, we can get better initial- where t refers to the current layer. The block example in
ization for searching and generating optimal block structure Fig. 4(b) is generated by the red solid lines in Fig. 4(a). The
with a quick training process. learning agent is given the task of sequentially picking NSC
Based on the above defined blocks, we construct of a block. The structure of block can be considered as an
the complete network by stacking these block structures action selection trajectory τa1:T , i.e. a sequence of NSCs.
sequentially which turn a common plain network into We model the layer selection process as a Markov Decision
its counterpart block version. Two representative auto- Process with the assumption that a well-performing layer in
generated networks on CIFAR and ImageNet tasks are one block should also perform well in another block [2]. To
shown in Fig. 3. There is no down-sampling operation find the optimal architecture, we ask our agent to maximize
within each block. We perform down-sampling directly its expected reward over all possible trajectories, denoted
by the pooling layer. If the size of feature map is halved by Rτ ,
by pooling operation, the block’s weights will be doubled.
The architecture for ImageNet contains more pooling layers Rτ = EP (τa1:T ) [R], (1)
than that for CIFAR because of their different input sizes,
where the R is the cumulative reward. For this maximiza-
i.e. 224 × 224 for ImageNet and 32 × 32 for CIFAR. More
tion problem, it is usually to use recursive Bellman Equation
importantly, the blocks can be repeated any N times to
to optimality. Given a state st ∈ S, and subsequent action
fulfill different demands, and even place the blocks in other
a ∈ A(st ), we define the maximum total expected reward
manner, such as inserting the block into the Network-in-
to be Q∗ (st , a) which is known as Q-value of state-action
Network [20] framework or setting short cut connection be-
pair. The recursive Bellman Equation then can be written as
tween different blocks.
Q∗ (st , a) = Est+1 |st ,a [Er|st ,a,st+1 [r|st , a, st+1 ]
3.2. Designing Network Blocks With Q-Learning
+γ max Q∗ (st+1 , a0 )]. (2)
Albeit we squeeze the search space of the entire network a0 ∈A(st+1 ))
design by focusing on constructing network blocks, there
An empirical assumption to solve the above quantity is
is still a large amount of possible structures to seek. There-
to formulate it as an iterative update:
fore, we employ reinforcement learning rather than random
sampling for automatic design. Our method is based on Q- Q(sT , a) =0 (3)
learning, a kind of reinforcement learning, which concerns
Q(sT −1 , aT ) =(1 − α)Q(sT −1 , aT ) + αrT (4)
how an agent ought to take actions so as to maximize the
cumulative reward. The Q-learning model consists of an Q(st , a) =(1 − α)Q(st , a)
agent, states and a set of actions. +α[rt + γ max
0
Q(st+1 , a0 )], t ∈ {1, 2, ...T − 2}, (5)
a
In this paper, the state s ∈ S represents the status of
the current layer which is defined as a Network Structure where α is the learning rate which determines how the
Code (NSC) claimed in Section 3.1, i.e. 5-D vector {layer newly acquired information overrides the old information,
index, layer type, kernel size, pred1, pred2}. The action γ is the discount factor which measures the importance of
a ∈ A is the decision for the next successive layer. Thanks future rewards. rt denotes the intermediate reward observed
Q-learning Performance with Different Data Analysis of Early Stop Accuracy
intermediate reward Early Stop ACC Final ACC Redefined Reward FLOPs Density
100 5
70
Ignore 𝑟#
Accuracy (%)
shaped reward 𝑟# 95 4
60
Accuracy (%)
90 3
55
1 6 11 16 21 26 85 2
Iteration (batch)
64
Accuracy (%)
62 Conv,5 Conv,1 Conv,3 Conv,5 Conv,3 Conv,1 Conv,3 Conv,5 Conv,3 Conv,3 Conv,3 Conv,1 Conv,1
Block-QNN-B
60
Add MaxP,1 MaxP,1 MaxP,3 Conv,5 AvgP,3 Conv,1 Conv,3 Add Conv,3
58
Conv,3 Conv,1 AvgP,3 Add Concat MaxP,3 Conv,5 MaxP,3 Conv,3
56
Add
54 Conv,3
1 21 41 61 81 101 121 141 161
Iteration (batch)
Random Exploration Start Exploitation Concat Concat Concat
Figure 8. (a) Q-learning performance on CIFAR-100. The accuracy goes up with the epsilon decrease and the top models are all found in
the final stage, show that our agent can learn to generate better block structures instead of random searching. (b-c) Topology of the Top-2
block structures generated by our approach. We call them Block-QNN-A and Block-QNN-B. (d) Topology of the best block structures
generated with limited parameters, named Block-QNN-S.
Q-learning Performance with Different Structure Codes Method Depth Para C-10 C-100
70 VGG [25] - 7.25 -
PCC{ReLU,Conv,BN}
separate ReLU,BN,Conv
65 ResNet [10] 110 1.7M 6.61 -
60 Wide ResNet [36] 28 36.5M 4.17 20.5
Accuracy (%)
Comparison with hand-crafted networks - It shows that our MetaQNN [2] 6.92 10 10
Block-QNN networks outperform most hand-crafted net- NAS [37] 3.65 800 28
works. The DenseNet-BC [13] uses additional 1 × 1 con- Our approach 3.54 32 3
volutions in each composite function and compressive tran-
sition layer to reduce parameters and improve performance, Table 4. The required computing resource and time of our ap-
proach compare with other automatic designing network methods.
which is not adopted in our design. Our performance can
be further improved by using this prior knowledge.
Method Input Size Depth Top-1 Top-5
Comparison with auto-generated networks - Our approach VGG [25] 224x224 16 28.5 9.90
achieves a significant improvement to the MetaQNN [2],
Inception V1 [30] 224x224 22 27.8 10.10
and even better than NAS’s best model (i.e. NASv3 more
Inception V2 [14] 224x224 22 25.2 7.80
filters) [37] proposed by Google brain which needs an ex-
ResNet-50 [11] 224x224 50 24.7 7.80
pensive costs on time and GPU resources. As shown in Ta-
ResNet-152 [11] 224x224 152 23.0 6.70
ble 4, NAS trains the whole system on 800 GPUs in 28
Xception(our test) [4] 224x224 50 23.6 7.10
days while we only need 32 GPUs in 3 days to get state-
ResNext-101(64x4d) [35] 224x224 101 20.4 5.30
of-the-art performance.
Block-QNN-B, N=3 224x224 38 24.3 7.40
Transfer block from CIFAR-100 to CIFAR-10 - We trans-
Block-QNN-S, N=3 224x224 38 22.6 6.46
fer the top blocks learned from CIFAR-100 to CIFAR-10
dataset, all experiment settings are the same. As shown Table 5. Block-QNN’s results (single-crop error rate) compare
in Table 3, the blocks can also achieve state-of-the-art re- with modern methods on ImageNet-1K Dataset.
sults on CIFAR-10 dataset with 3.60% error rate that proved
Block-QNN networks have powerful transferable ability.
Analysis on network parameters - The networks generated fairly, and we will consider this in our future work to further
by our method might be complex with a large amount of pa- improve the performance.
rameters since we do not add any constraints during train- As far as we known, most previous works of automatic
ing. We further conduct an experiment on searching net- network generation did not report competitive result on
works with limited parameters and adaptive block num- large scale image classification datasets. With the con-
bers. We set the maximal parameter number as 10M and ception of block learning, we can transfer our architecture
obtain an optimal block (i.e. Block-QNN-S) which outper- learned in small datasets to big dataset like ImageNet task
forms NASv3 with less parameters, as shown in Fig. 8(d). easily. In the future experiments, we will try to apply the
In addition, when involving more filters in each convolu- generated blocks in other tasks such as object detection and
tional layer (e.g. from [32,64,128] to [80,160,320]), we can semantic segmentation.
achieve even better result (3.54%).
6. Conclusion
5.3. Transfer to ImageNet
In this paper, we show how to efficiently design high per-
To demonstrate the generalizability of our approach, we formance network blocks with Q-learning. We use a dis-
transfer the block structure learned from CIFAR to Ima- tributed asynchronous Q-learning framework and an early
geNet dataset. stop strategy focusing on fast block structures searching.
For the ImageNet task, we set block repeat number We applied the framework to automatic block generation
N = 3 and add more down sampling operation before for constructing good convolutional network. Our Block-
blocks, the filters for convolution layers in different level QNN networks outperform modern hand-crafted networks
blocks are [64,128,256,512]. We use the best blocks struc- as well as other auto-generated networks in image classi-
ture learned from CIFAR-100 directly without any fine- fication tasks. The best block structure which achieves a
tuning, and the generated network initialized with MSRA state-of-the-art performance on CIFAR can be transfer to
initialization as same as above. The experimental results the large-scale dataset ImageNet easily, and also yield a
are shown in Table 5. The network generated by our frame- competitive performance compared with best hand-crafted
work can get competitive result compared with other human networks. We show that searching with the block design
designed models. The recently proposed methods such as strategy can get more elegant and model explicable network
Xception [4] and ResNext [35] use special depth-wise con- architectures. In the future, we will continue to improve the
volution operation to reduce their total number of parame- proposed framework from different aspects, such as using
ters and to improve performance. In our work, we do not more powerful convolution layers and making the searching
use this new convolution operation, so it can’t be compared process faster. We will also try to search blocks with lim-
ited FLOPs and conduct experiments on other tasks such as 68
detection or segmentation.
67
Acknowledgments
Accuracy (%)
66
This work has been supported by the National Natural
Science Foundation of China (NSFC) Grants 61721004 and 65
Start Exploitation RS Top1
61633021. RS Top5
64
BlockQNN Top1
Appendix BlockQNN Top5
63
1 21 41 61 81 101 121 141 161
A. Efficiency of BlockQNN Iteration (batch)
We demonstrate the effectiveness of our proposed Block- Figure 10. Measuring the efficiency of BlockQNN to random
QNN on network architecture generation on the CIFAR-100 search (RS) for learning neural architectures. The x-axis measures
dataset as compared to random search given an equivalent the training iterations (batch size is 64), i.e. total number of archi-
amount of training iterations, i.e. number of sampled net- tectures sampled, and the y-axis is the early stop performance after
works. We define the effectiveness of a network architec- 12 epochs on CIFAR-100 training. Each pair of curves measures
ture auto-generation algorithm as the increase in top auto- the mean accuracy across top ranking models generated by each
generated network performance from the initial random ex- algorithm. Best viewed in color.
ploration to exploitation, since we aim to getting optimal
auto-generated network instead of promoting the average
66
performance.
Figure 10 shows the performance of BlockQNN and ran- 65
dom search (RS) for a complete training process, i.e. sam-
pling 11, 392 blocks in total. We can find that the best model
Accuracy (%)
64
generated by BlockQNN is markedly better than the best
model found by RS by over 1% in the exploitation phase 63
RS-L Top1
on CIFAR-100 dataset. We observe this in the mean perfor- Start Exploitation
RS-L Top5
mance of the top-5 models generated by BlockQNN com- 62
BlockQNN-L Top1
pares to RS. Note that the compared random search method BlockQNN-L Top5
61
start from the same exploration phase as BlockQNN for 1 21 41 61 81 101 121 141 161
fairness. Iteration (batch)
Figure 11 shows the performance of BlockQNN
with limited parameters and adaptive block numbers Figure 11. Measuring the efficiency of BlockQNN with limited
(BlockQNN-L) and random search with limited parameters parameters and adaptive block numbers (BlockQNN-L) to ran-
dom search with limited parameters and adaptive block numbers
and adaptive block numbers (RS-L) for a complete training
(RS-L) for learning neural architectures. The x-axis measures the
process. We can see the same phenomenon, BlockQNN- training iterations (batch size is 64), i.e. total number of architec-
L outperform RS-L by over 1% in the exploitation phase. tures sampled, and the y-axis is the early stop performance after
These results prove that our BlockQNN can learn to gener- 12 epochs on CIFAR-100 training. Each pair of curves measures
ate better network architectures rather than random search. the mean accuracy across top ranking models generated by each
algorithm. Best viewed in color.
B. Evolutionary Process of Auto-Generated
Blocks
ally increase and the block tend choose ”Concat” as the last
We sample the block structures with median perfor-
layer. And we can find that the short-cut connections and
mance generated by our approach in different stage, i.e. at
elemental add layers are common in the exploitation stage.
iteration [1, 30, 60, 90, 110, 130, 150, 170], to show the evo-
Additionally, blocks generated by BlockQNN-L have less
lutionary process. As illustrated in Figure 12 and Fig-
”Conv,5” layers, i.e. convolution layer with kernel size of 5,
ure 13, i.e. BlockQNN and BlockQNN-L respectively, the
since the limitation of the parameters.
block structures generated in the random exploration stage
is much simpler than the structures generated in the ex- These prove that our approach can learn the universal de-
ploitation stage. sign concepts for good network blocks. Compare to other
In the exploitation stage, the multi-branch structures ap- automatic network architecture design methods, our gener-
pear frequently. Note that the connection numbers is gradu- ated networks are more elegant and model explicable.
Input
Input
Input
Conv,3 Input Conv,3 Conv,5 Conv,3 Conv,1
Input
Input
Input Conv,5 Conv,3
Conv,3 Conv,5 Conv,3 Conv,3 Conv,5
AvgP,3 AvgP,1
Conv,1 Conv,3 AvgP,3 AvgP,3 AvgP,3 Conv,3 Conv,5 Conv,3 AvgP,3 Conv,5 Conv,3 Add Conv,5
Input Conv,1 Conv,5
Conv,1 Concat Conv,1 AvgP,3
Conv,1 Concat Add Conv,5
Conv,3 Add
Figure 12. Evolutionary process of blocks generated by BlockQNN. We sample the block structures with median performance at iteration
[1, 30, 60, 90, 110, 130, 150, 170] to compare the difference between the blocks in the random exploration stage and the blocks in the
exploitation stage.
Input Input
Input
Input
Figure 13. Evolutionary process of blocks generated by BlockQNN with limited parameters and adaptive block numbers (BlockQNN-L).
We sample the block structures with median performance at iteration [1, 30, 60, 90, 110, 130, 150, 170] to compare the difference between
the blocks in the random exploration stage and the blocks in the exploitation stage.