Genetic CNN
Genetic CNN
A3
× × Code: 1-00-111 6× 6×
Stage 2 Encoding Area
B2
prev. stage POOL1 conv@64 pooling POOL2
B0 B1 B3 B5 B6
B4
6× 6× Code: 0-10-000-0011 8 × 8 × 64
Figure 1. A two-stage network (S = 2, (K1 , K2 ) = (4, 5)) and the encoded binary string (best viewed in color). The default input and
output nodes (see Section 3.1.1) and the connections related to these nodes are marked in red and green, respectively. We only encode the
connections between the ordinary codes (regions with light blue background). Within each stage, the number of convolutional filters is
a constant (32 in Stage 1, 64 in Stage 2), and the spatial resolution remains unchanged (32 × 32 in Stage 1, 16 × 16 in Stage 2). Each
pooling layer down-samples the data by a factor of 2. ReLU and batch normalization are added after each convolution.
and ReLU [19] are followed, which are verified efficient in and sends its output to every node without a predecessor,
training very deep neural networks [13]. We do not encode e.g., vs,1 . The default output node, denoted as vs,Ks +1 ,
the fully-connected layers of a network. receives data from all nodes without a successor, e.g., vs,Ks ,
In each stage, we use 1 + 2 + . . . + (Ks − 1) = sums up them, performs convolution, and sends its output
1
2 s (Ks − 1) bits to encode the inter-node connections.
K to the pooling layer. Note that the connections between the
The first bit represents the connection between (vs,1 , vs,2 ), ordinary nodes and the default nodes are not encoded.
then the following two bits represent the connection be- There are two special cases. First, if an ordinary node
tween (vs,1 , vs,3 ) and (vs,2 , vs,3 ), etc. This process con- vs,i is isolated (i.e., it is not connected to any other ordinary
tinues until the last Ks − 1 bits are used to represent the nodes vs,j , i 6= j), then it is simply ignored, i.e., it is not
connection between vs,1 , vs,2 , . . . , vs,Ks −1 and vs,Ks . For connected to the default input node nor the default output
1 6 i < j 6 Ks , if the bit corresponding to (vs,i , vs,j ) is 1, node (see the B2 node in Figure 1). This is to guarantee
there is an edge connecting vs,i and vs,j , i.e., vs,j takes the that a stage with more nodes can simulate all structures
output of vs,i as a part of the element-wise summation, and represented by a stage with fewer nodes. Second, if there
vice versa. In summary, an S-stage network with Ks nodes are no connections at a stage, i.e., all bits in the binary string
at the s-th
P stage is encoded into a binary string of length are 0, then the convolutional operation is performed only
L = 12 s Ks (Ks − 1). Figure 1 illustrates an example of once, not twice (one performed by the default input node
encoding a 2-stage network. and the other by the default output node).
We note that the number of possible network structures
(2L ) may be very large. In the CIFAR10 experiments (see 3.1.2 Examples and Limitations
Section 4.1), we have S = 3 and (K1 , K2 , K3 ) = (3, 4, 5),
therefore L = 19 and 2L = 524,288. It is computationally Many popular network structures can be represented us-
intractable to enumerate all these structures and find the ing the proposed encoding scheme. Examples include
optimal one(s). To this end, we use the genetic algorithm VGGNet [32], ResNet [13], and a modified variant of
to efficiently explore good candidates in this large space. DenseNet [15], which are illustrated in Figure 2.
Currently, the encoded structures only involve convolu-
3.1.1 Technical Details tional and pooling operations, which makes it impossible
to generate some tricky network modules such as Max-
To make every binary string valid, we define two default out [10]. Also, the convolutional kernel size and the number
nodes in each stage. The default input node, denoted as vs,0 , of channels are fixed within each stage, which limits the
receives data from the previous stage, performs convolution, network from incorporating multi-scale information as in
conv layer conv layer conv layer uals are all-zero strings), the genetic process can discover
quite competitive structures via crossover and mutation.
conv layer conv layer conv layer
3.2.2 Selection
conv layer conv layer conv layer The selection process is performed at the beginning of every
generation. Before the t-th generation, the n-th individual
conv layer conv layer conv layer Mt−1,n is assigned a fitness function, which is defined as
the recognition rate rt−1,n obtained in the previous genera-
conv layer conv layer conv layer tion or initialization. rt−1,n directly impacts the probability
that Mt−1,n survives the selection process.
conv layer conv layer conv layer We perform a Russian roulette process to determine
VGGNet [29] ResNet [11] DenseNet [14] which individuals survive. Each individual in the next gen-
�=4 �=4 �=4 eration Mt,n is determined independently by a non-uniform
Code: 1-01-001 Code: 1-01-101 Code: 1-11-111 N
sampling over the set {Mt−1,n }n=1 . The probability of
sampling Mt−1,n is proportional to rt−1,n − rt−1,0 , where
Figure 2. The basic building blocks of VGGNet [32], ResNet [13] rt−1,0 = minN n=1 {rt−1,n } is the minimal fitness function
and a variant of DenseNet [15] can be encoded as binary strings
value in the previous generation. This means that the best
defined in Section 3.1.
individual has the largest probability of being selected, and
the worst one is always eliminated. As the number of
the inception module [36]. We note that all automatically individuals N remains unchanged, each individual in the
learned network structures have such limitations [45]. Our previous generation may be selected multiple times.
approach can be easily modified to include more types of
layers and more flexible inter-layer connections. As shown 3.2.3 Mutation and Crossover
in experiments, we can achieve competitive recognition
performance using merely these basic building blocks. The mutation process of an individual Mt,n involves flip-
As shown in a recent published work using reinforce- ping each bit independently with a probability qM . In
ment learning to explore neural architecture [45], this type practice, qM is often small, e.g., 0.05, so that mutation is
of methods often require heavy computation to traverse the not likely to change one individual too much. This is to
huge solution space. We apply a strategy to learn network preserve the good properties of a survived individual while
architectures on a small dataset, and transfer the top-ranked providing an opportunity of trying out new possibilities.
structures to large-scale visual recognition tasks. The crossover process involves changing two individuals
simultaneously. Instead of considering each bit individual-
3.2. Genetic Operations ly, the basic unit in crossover is a stage, which is motivated
by the need to retain the local structures within each stage.
The flowchart of the genetic process is shown in Al- Similar to mutation, each pair of corresponding stages are
gorithm 1. It starts with an initialized population of N exchanged with a small probability qC .
randomized individuals. Then, we perform T rounds, or Both mutation and crossover are performed in an overall
T generations, each of which consists of three operations, flowchart (see Algorithm 1). The probabilities of mutation
i.e., selection, mutation and crossover. The fitness function and crossover for each individual (or pair) are pM and pC ,
of each individual is evaluated via training-from-scratch on respectively. Of course, there are many different ways of
the reference dataset. performing mutation and crossover. In experiments, our
simple choice leads to competitive performance.
3.2.1 Initialization
N
We initialize a set of randomized models {M0,n }n=1 . Each 3.2.4 Evaluation
model is a binary string with L bits, i.e., M0,n : b0,n ∈ After the above processes, each individual Mt,n is evaluated
L
{0, 1} . Each bit in each individual is independently sam- to obtain the fitness function value. A reference dataset D
pled from a Bernoulli distribution: bl0,n ∼ B(0.5), l = is pre-defined, and we individually train each model Mt,n
1, 2, . . . , L. After this, we evaluate each individual (see from scratch. If Mt,n is previously evaluated, we simply
Section 3.2.4) to obtain their fitness function values. evaluate it once again and compute the average accuracy
As we shall see in Section 4.1.2, different strategies of over all its occurrences. This strategy, at least to some
initialization do not impact the genetic performance too extent, alleviates the instability caused by the randomness
much. Even starting with a naive initialization (all individ- in the training process.
Algorithm 1 The Genetic Process for Network Design
1: Input: the reference dataset D, the number of generations T , the number of individuals in each generation N , the
mutation and crossover probabilities pM and pC , the mutation parameter qM , and the crossover parameter qC .
N
2: Initialization: generating a set of randomized individuals {M0,n }n=1 , and computing their recognition accuracies;
3: for t = 1, 2, . . . , T do
N N
Selection: producing a new generation M′t,n n=1 with a Russian roulette process on {Mt−1,n }n=1 ;
4:
⌊N/2⌋
5: Crossover: for each pair {(Mt,2n−1 , Mt,2n )}n=1 , performing crossover with probability pC and parameter qC ;
N
6: Mutation: for each non-crossover individual {Mt,n }n=1 , doing mutation with probability pM and parameter qM ;
N
7: Evaluation: computing the recognition accuracy for each new individual {Mt,n }n=1 ;
8: end for
N
9: Output: a set of individuals in the final generation {MT,n }n=1 with their recognition accuracies.
0.76
0.770
0.74
0.72
0.767
0.70
0.68
Random Initialization 0.764
All−zero Initialization
0.66
0 1 2 3 5 10 20 30 50
Generation Number 0.761
VGGNet
within the same stage are connected. This network produces 2 3 2 3
a 76.84% recognition rate, which is significantly lower than 4 4
the number (77.19%) reported in Table 1. Considering that Code: 1-01
the densely-connected network requires heavier computa-
tional overheads, we conclude that the structures learned by Code: 0-01-100
the genetic algorithm are more effective and efficient than 0 0
using dense connections.
1 Multiple-Path 1
Networks
GoogLeNet
2 3 2 3
4.1.5 Visualization
4 4
In Figure 5, we visualize the the network structures learned
5 5
from two individual genetic processes. The structures
learned by the genetic algorithm are somewhat different Code: 1-01-100
from the manually designed ones, although some manual-
Code: 0-11-
ly designed local structures are observed, like the chain-
0
101-0001 0
shaped networks, multi-path networks and highway net-
works. We emphasize that these two networks, though 1 1
obtained by independent genetic processes, are somewhat Highway
2 3 Networks 2 3
Residual Nets
similar, which demonstrates that the genetic process gener-
ally converges to similar network structures. 4 5 4 5