0% found this document useful (0 votes)
73 views10 pages

Genetic CNN

1. The document discusses using a genetic algorithm to automatically learn the structure of deep convolutional neural networks (CNNs). 2. It proposes encoding each CNN structure as a fixed-length binary string that can be evolved over generations using genetic operations like selection, mutation and crossover. 3. The fitness of each CNN structure is evaluated based on its recognition accuracy on a reference dataset, with the goal of finding high-quality structures through this automated search process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views10 pages

Genetic CNN

1. The document discusses using a genetic algorithm to automatically learn the structure of deep convolutional neural networks (CNNs). 2. It proposes encoding each CNN structure as a fixed-length binary string that can be evolved over generations using genetic operations like selection, mutation and crossover. 3. The fitness of each CNN structure is evaluated based on its recognition accuracy on a reference dataset, with the goal of finding high-quality structures through this automated search process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Genetic CNN

Lingxi Xie, Alan Yuille


Department of Computer Science, The Johns Hopkins University, Baltimore, MD, USA
[email protected] [email protected]

Abstract In this paper, we reveal the possibility of automatically


learning the structure of deep neural networks. We consider
The deep convolutional neural network (CNN) is the a constrained case, in which the network has a limited
state-of-the-art solution for large-scale visual recognition. number of stages, and each stage is defined as a set of pre-
Following some basic principles such as increasing network defined building blocks such as convolution and pooling
depth and constructing highway connections, researchers layers. Even under these limitations, the total number of
have manually designed a lot of fixed network architectures possible network structures grows exponentially with the
and verified their effectiveness. number of layers, making it impractical to enumerate all the
In this paper, we discuss the possibility of learning deep candidates and find the best one. Instead, we formulate this
network structures automatically. Note that the number problem as optimization in a large search space, and apply
of possible network structures increases exponentially with the genetic algorithm to exploring the space efficiently.
the number of layers in the network, which motivates us to The genetic algorithm involves constructing an initial
adopt the genetic algorithm to efficiently explore this large population of individuals, and performing genetic opera-
search space. The core idea is to propose an encoding tions to allow them to evolve in an iterative process. We
method to represent each network structure in a fixed-length propose a novel encoding scheme to represent each network
binary string. The genetic algorithm is initialized by gen- structure as a fixed-length binary string, and define several
erating a set of randomized individuals. In each genera- standard genetic operations, i.e., selection, mutation and
tion, we define standard genetic operations, e.g., selection, crossover, so that new competitive individuals are generated
mutation and crossover, to generate competitive individuals from the previous generation and weak ones are eliminated.
and eliminate weak ones. The competitiveness of each The quality (fitness function) of each individual is deter-
individual is defined as its recognition accuracy, which is mined by its recognition accuracy on a reference dataset.
obtained via a standalone training process on a reference To this end, we perform a complete training process for
dataset. We run the genetic process on CIFAR10, a small- each individual (i.e., network structure) which is inde-
scale dataset, demonstrating its ability to find high-quality pendent to the genetic algorithm. The genetic process
structures which are little studied before. The learned pow- comes to an end after a fixed number of generations.
erful structures are also transferrable to the ILSVRC2012
It is worth emphasizing that the genetic algorithm is
dataset for large-scale visual recognition.
computationally expensive, as we need to undergo a com-
plete network training process for each generated individ-
ual. We adopt the strategy to run the genetic process on a
1. Introduction small dataset (CIFAR10), in which we observe the ability
Visual recognition is a fundamental task in computer of the genetic algorithm to find effective network struc-
vision, implying a wide range of applications. Recently, the tures, and then transfer the learned top-ranked structures
state-of-the-art algorithms on visual recognition are mostly to perform large-scale visual recognition. The learned
based on the deep Convolutional Neural Network (CNN). structures, most of which have been less studied before,
Starting from the fundamental chain-styled network model- often perform better than the manually designed ones in
s [19], researchers have been increasing the depth of the either small-scale or large-scale experiments.
network [32], as well as designing novel network mod- The remainder of this paper is organized as follows.
ules [36][13] to improve recognition accuracy. Although Section 2 briefly introduces related work. Section 3 il-
these modern networks have been shown to be efficient, we lustrates the way of using the genetic algorithm to design
note that their structures are manually designed, not learned, network structures. Experiments are shown in Section 4,
which limits the flexibility of the approach. and conclusions are drawn in Section 5.
2. Related Work of the path. We will show in Section 3.1 that deep neural
networks can be encoded into a binary string.
2.1. Convolutional Neural Networks The core idea of the genetic algorithm is to allow in-
Recent years have witnessed a revolution in visual recog- dividuals to evolve via some genetic operations. Popular
nition. Conventional classification tasks [20][8] are extend- operations include selection, mutation, crossover, etc. The
ed into large-scale environments [5][44]. With the avail- selection process allows us to preserve strong individuals
ability of powerful computational resources (e.g., GPU), while eliminating weak ones. The ways of performing
the Convolutional Neural Networks (CNNs) [19][32] have mutation and crossover are often based on the properties of
shown superior performance over the conventional Bag-of- the specific problem. For example, in the TSP problem with
Visual-Words [3][38][29] and compositional models [9]. the permutation-based representation, a possible mutation
CNN is a hierarchical model for large-scale visual recog- operation is to change the order of two visited nodes.
nition. It is based on the observation that a network with Researches are conducted to improve the performance of
enough neurons is able to fit any complicated data distribu- genetic algorithms, including performing local search [37]
tion. In past years, neural networks were shown effective and generating random keys [33]. In our work, we show that
for simple recognition tasks [22]. More recently, the avail- the vanilla genetic algorithm works well enough without
ability of large-scale training data (e.g., ImageNet [5]) and these tricks. We also note that some previous work applied
powerful GPUs make it possible to train deep CNNs [19] the genetic algorithm to learning the structure [35][1] or
which significantly outperform BoVW models. A CNN weights [41][6] of artificial neural networks, but our work
is composed of several stacked layers. In each of them, aims at learning the architecture of modern CNNs, which is
responses from the previous layer are convoluted with a not studied in prior researches.
filter bank and activated by a differentiable non-linearity.
Hence, a CNN can be considered as a composite function, 3. Our Approach
which is trained by back-propagating error signals defined This section presents the genetic algorithm for learning
by the difference between the supervision and prediction competitive network structures. First, we propose a way
at the top layer. Recently, several efficient methods were of encoding a network structure into a fixed-length binary
proposed to help CNNs converge faster and prevent over- string. Next, genetic operations are defined, including se-
fitting, such as ReLU activation [19], batch normaliza- lection, mutation and crossover, so that we can explore the
tion [17], Dropout [34] and DisturbLabel [40]. search space efficiently and find high-quality solutions.
Designing powerful CNN structures is an intriguing Throughout this work, the genetic algorithm is only
problem. It is believed that deeper networks produce better used to propose new network structures, the param-
recognition results [32][36]. But also, adding highway eters and recognition accuracy of each individual are
information has been verified to be useful [13][42]. We also obtained via a standalone training-from-scratch.
find some work which uses stochastic [16] or dense [15]
structures. All these network structures are deterministic 3.1. Binary Network Representation
(although a stochastic strategy is used in [16] to accelerate We provide a binary string representation for a network
training and prevent over-fitting), which limits the flexibility structure in a constrained case. We consider those net-
of the models and thus inspires us to automatically learn work structures [32][13] which can be organized in several
network structures. stages. In each stage, the geometric dimensions (width,
height and depth) of the data cube remain unchanged.
2.2. Genetic Algorithm
Neighboring stages are connected via a spatial pooling op-
The genetic algorithm is a metaheuristic inspired by the eration, which may change the spatial resolution. All the
natural selection process. It is commonly used to generate convolutional operations within one stage have the same
high-quality solutions to optimization and search problem- number of filters, a.k.a. data channels.
s [14][30][2][4] by performing bio-inspired operators such We follow this idea to define a family of networks which
as mutation, crossover and selection. can be encoded into fixed-length binary strings. A network
A standard genetic algorithm requires two prerequisites, is composed of S stages, and the s-th stage, s = 1, 2, . . . , S,
i.e., a genetic representation of the solution domain, and a contains Ks nodes, denoted by vs,ks , ks = 1, 2, . . . , Ks .
fitness function to evaluate each individual. A typical exam- The nodes within each stage are ordered, and we only
ple is the travelling-salesman problem (TSP) [11], a famous allow connections from a lower-numbered node to a higher-
NP-complete problem which aims at finding the optimal numbered node. Each node corresponds to a convolutional
Hamiltonian path in a graph of N nodes. In this situation, operation, which takes place after element-wise summing
each feasible solution is represented as a permutation of up all its input nodes (lower-numbered nodes that are con-
{1, 2, . . . , N }, and the fitness function is the total distance nected to it). After convolution, batch normalization [17]
Stage 1 Encoding Area
A2
INPUT conv@ pooling POOL1 next stage
A0 A1 A4 A5

A3
× × Code: 1-00-111 6× 6×
Stage 2 Encoding Area
B2
prev. stage POOL1 conv@64 pooling POOL2
B0 B1 B3 B5 B6

B4
6× 6× Code: 0-10-000-0011 8 × 8 × 64

Figure 1. A two-stage network (S = 2, (K1 , K2 ) = (4, 5)) and the encoded binary string (best viewed in color). The default input and
output nodes (see Section 3.1.1) and the connections related to these nodes are marked in red and green, respectively. We only encode the
connections between the ordinary codes (regions with light blue background). Within each stage, the number of convolutional filters is
a constant (32 in Stage 1, 64 in Stage 2), and the spatial resolution remains unchanged (32 × 32 in Stage 1, 16 × 16 in Stage 2). Each
pooling layer down-samples the data by a factor of 2. ReLU and batch normalization are added after each convolution.

and ReLU [19] are followed, which are verified efficient in and sends its output to every node without a predecessor,
training very deep neural networks [13]. We do not encode e.g., vs,1 . The default output node, denoted as vs,Ks +1 ,
the fully-connected layers of a network. receives data from all nodes without a successor, e.g., vs,Ks ,
In each stage, we use 1 + 2 + . . . + (Ks − 1) = sums up them, performs convolution, and sends its output
1
2 s (Ks − 1) bits to encode the inter-node connections.
K to the pooling layer. Note that the connections between the
The first bit represents the connection between (vs,1 , vs,2 ), ordinary nodes and the default nodes are not encoded.
then the following two bits represent the connection be- There are two special cases. First, if an ordinary node
tween (vs,1 , vs,3 ) and (vs,2 , vs,3 ), etc. This process con- vs,i is isolated (i.e., it is not connected to any other ordinary
tinues until the last Ks − 1 bits are used to represent the nodes vs,j , i 6= j), then it is simply ignored, i.e., it is not
connection between vs,1 , vs,2 , . . . , vs,Ks −1 and vs,Ks . For connected to the default input node nor the default output
1 6 i < j 6 Ks , if the bit corresponding to (vs,i , vs,j ) is 1, node (see the B2 node in Figure 1). This is to guarantee
there is an edge connecting vs,i and vs,j , i.e., vs,j takes the that a stage with more nodes can simulate all structures
output of vs,i as a part of the element-wise summation, and represented by a stage with fewer nodes. Second, if there
vice versa. In summary, an S-stage network with Ks nodes are no connections at a stage, i.e., all bits in the binary string
at the s-th
P stage is encoded into a binary string of length are 0, then the convolutional operation is performed only
L = 12 s Ks (Ks − 1). Figure 1 illustrates an example of once, not twice (one performed by the default input node
encoding a 2-stage network. and the other by the default output node).
We note that the number of possible network structures
(2L ) may be very large. In the CIFAR10 experiments (see 3.1.2 Examples and Limitations
Section 4.1), we have S = 3 and (K1 , K2 , K3 ) = (3, 4, 5),
therefore L = 19 and 2L = 524,288. It is computationally Many popular network structures can be represented us-
intractable to enumerate all these structures and find the ing the proposed encoding scheme. Examples include
optimal one(s). To this end, we use the genetic algorithm VGGNet [32], ResNet [13], and a modified variant of
to efficiently explore good candidates in this large space. DenseNet [15], which are illustrated in Figure 2.
Currently, the encoded structures only involve convolu-
3.1.1 Technical Details tional and pooling operations, which makes it impossible
to generate some tricky network modules such as Max-
To make every binary string valid, we define two default out [10]. Also, the convolutional kernel size and the number
nodes in each stage. The default input node, denoted as vs,0 , of channels are fixed within each stage, which limits the
receives data from the previous stage, performs convolution, network from incorporating multi-scale information as in
conv layer conv layer conv layer uals are all-zero strings), the genetic process can discover
quite competitive structures via crossover and mutation.
conv layer conv layer conv layer
3.2.2 Selection
conv layer conv layer conv layer The selection process is performed at the beginning of every
generation. Before the t-th generation, the n-th individual
conv layer conv layer conv layer Mt−1,n is assigned a fitness function, which is defined as
the recognition rate rt−1,n obtained in the previous genera-
conv layer conv layer conv layer tion or initialization. rt−1,n directly impacts the probability
that Mt−1,n survives the selection process.
conv layer conv layer conv layer We perform a Russian roulette process to determine
VGGNet [29] ResNet [11] DenseNet [14] which individuals survive. Each individual in the next gen-
�=4 �=4 �=4 eration Mt,n is determined independently by a non-uniform
Code: 1-01-001 Code: 1-01-101 Code: 1-11-111 N
sampling over the set {Mt−1,n }n=1 . The probability of
sampling Mt−1,n is proportional to rt−1,n − rt−1,0 , where
Figure 2. The basic building blocks of VGGNet [32], ResNet [13] rt−1,0 = minN n=1 {rt−1,n } is the minimal fitness function
and a variant of DenseNet [15] can be encoded as binary strings
value in the previous generation. This means that the best
defined in Section 3.1.
individual has the largest probability of being selected, and
the worst one is always eliminated. As the number of
the inception module [36]. We note that all automatically individuals N remains unchanged, each individual in the
learned network structures have such limitations [45]. Our previous generation may be selected multiple times.
approach can be easily modified to include more types of
layers and more flexible inter-layer connections. As shown 3.2.3 Mutation and Crossover
in experiments, we can achieve competitive recognition
performance using merely these basic building blocks. The mutation process of an individual Mt,n involves flip-
As shown in a recent published work using reinforce- ping each bit independently with a probability qM . In
ment learning to explore neural architecture [45], this type practice, qM is often small, e.g., 0.05, so that mutation is
of methods often require heavy computation to traverse the not likely to change one individual too much. This is to
huge solution space. We apply a strategy to learn network preserve the good properties of a survived individual while
architectures on a small dataset, and transfer the top-ranked providing an opportunity of trying out new possibilities.
structures to large-scale visual recognition tasks. The crossover process involves changing two individuals
simultaneously. Instead of considering each bit individual-
3.2. Genetic Operations ly, the basic unit in crossover is a stage, which is motivated
by the need to retain the local structures within each stage.
The flowchart of the genetic process is shown in Al- Similar to mutation, each pair of corresponding stages are
gorithm 1. It starts with an initialized population of N exchanged with a small probability qC .
randomized individuals. Then, we perform T rounds, or Both mutation and crossover are performed in an overall
T generations, each of which consists of three operations, flowchart (see Algorithm 1). The probabilities of mutation
i.e., selection, mutation and crossover. The fitness function and crossover for each individual (or pair) are pM and pC ,
of each individual is evaluated via training-from-scratch on respectively. Of course, there are many different ways of
the reference dataset. performing mutation and crossover. In experiments, our
simple choice leads to competitive performance.
3.2.1 Initialization
N
We initialize a set of randomized models {M0,n }n=1 . Each 3.2.4 Evaluation
model is a binary string with L bits, i.e., M0,n : b0,n ∈ After the above processes, each individual Mt,n is evaluated
L
{0, 1} . Each bit in each individual is independently sam- to obtain the fitness function value. A reference dataset D
pled from a Bernoulli distribution: bl0,n ∼ B(0.5), l = is pre-defined, and we individually train each model Mt,n
1, 2, . . . , L. After this, we evaluate each individual (see from scratch. If Mt,n is previously evaluated, we simply
Section 3.2.4) to obtain their fitness function values. evaluate it once again and compute the average accuracy
As we shall see in Section 4.1.2, different strategies of over all its occurrences. This strategy, at least to some
initialization do not impact the genetic performance too extent, alleviates the instability caused by the randomness
much. Even starting with a naive initialization (all individ- in the training process.
Algorithm 1 The Genetic Process for Network Design
1: Input: the reference dataset D, the number of generations T , the number of individuals in each generation N , the
mutation and crossover probabilities pM and pC , the mutation parameter qM , and the crossover parameter qC .
N
2: Initialization: generating a set of randomized individuals {M0,n }n=1 , and computing their recognition accuracies;
3: for t = 1, 2, . . . , T do
N N
Selection: producing a new generation M′t,n n=1 with a Russian roulette process on {Mt−1,n }n=1 ;

4:
⌊N/2⌋
5: Crossover: for each pair {(Mt,2n−1 , Mt,2n )}n=1 , performing crossover with probability pC and parameter qC ;
N
6: Mutation: for each non-crossover individual {Mt,n }n=1 , doing mutation with probability pM and parameter qM ;
N
7: Evaluation: computing the recognition accuracy for each new individual {Mt,n }n=1 ;
8: end for
N
9: Output: a set of individuals in the final generation {MT,n }n=1 with their recognition accuracies.

4. Experiments number. The length L of each binary string is 19, which


means that there are 219 = 524,288 possible individuals.
Like other methods to learn network structures [45], our
We create an initial population with N = 20 individuals,
genetic algorithm requires a very large amount of compu-
and run the genetic process for T = 50 rounds. Other
tational resources, which makes it intractable to be directly
parameters are set to be pM = 0.8, qM = 0.05, pC = 0.2
evaluated a large-scale dataset such as ILSVRC2012 [31].
and qC = 0.2. The mutation and crossover parameters qM
Our strategy is to explore promising network structures on
and qC are set to be smaller because the strings become
a small dataset, namely CIFAR10 [18], then transfer these
longer. The maximal number of explored individuals is
structures to the large-scale environment.
20 × (50 + 1) = 1,020 ≪ 524,288. Training each indi-
4.1. CIFAR10 Experiments vidual takes an average of 0.4 hour, and the entire genetic
process takes about 17 GPU-days. 10 GPUs are used, and
The CIFAR10 dataset [18] contains 10 basic categories each of them trains 2 networks in each generation. As a
of 32 × 32 RGB images. There are 50,000 images for result, we can finish the entire genetic process in 2 days.
training, and 10,000 images for testing. To avoid seeing We note that [45] trained 10× more networks and each one
the testing data in the genetic process, we leave out 10,000 is much more complicated, resulting in at least 100× more
images from the training set for validation. computational overheads than our work.
We perform two individual genetic processes. The re-
4.1.1 Settings and Results sults of one of them are summarized in Table 1. With the ge-
netic operations, we can find competitive network structures
The basic configuration follows a revised version of with improved recognition performance. Although over a
LeNet [21], and the network structure abbreviated as: short period the best individual may not be updated, the
C3(P1)@8-MP3(S2)-C3(P1)@8-MP3(S2)- average and medium accuracies generally get higher from
C3(P1)@16-MP3(S2)-FC32-D0.5-FC10. generation to generation. This is very important, because
Here, C3(P1)@8 is a convolutional layer with a kernel size it guarantees the genetic algorithm improves the overall
of 3 × 3, a default spatial stride of 1, a padding width of 1 quality of the individuals. According to our diagnosis in
and the number of kernels of 8. MP3(S2) is a max-pooling Section 4.1.3, this facilitates strong individuals to be creat-
layer with a kernel size of 3 and a spatial stride of 2, FC32 ed, since the quality of a new individual is positively cor-
is a fully-connected layer with 32 outputs, and D0.5 is a related to the quality of its parent(s). After 50 generations,
Dropout layer with a drop ratio of 0.5. Please note that the recognition error rate of the best individual drops from
we significantly reduce the number of filters at each stage 24.04% to 22.81%. We also visualize the best structures
to accelerate the training process. We apply 120 training found by these two processes in Figure 5.
epochs with a learning rate of 10−2 , followed by 60 epochs
with a learning rate of 10−3 , 40 epochs with a learning rate 4.1.2 Initialization Issues
of 10−4 and another 20 epochs with a learning rate of 10−5 .
We keep the fully-connected part of the above network We observe the impact of different initializations. For this,
unchanged, and set S = 3 and (K1 , K2 , K3 ) = (3, 4, 5). we start a naive population with N = 20 all-zero individ-
Within each stage, the first convolutional layer remains the uals, and use the same parameters for a complete genetic
same as in the original LeNet, and other convolutional process. Results are shown in Figure 3. We find that,
layers take the kernel size 3 × 3 and the same channel although the all-zero string corresponds to a very simple and
Gen Max % Min % Avg % Med % Std-D Best Network Structure
00 75.96 71.81 74.39 74.53 0.91 0-01|0-01-111|0-11-010-0111
01 75.96 73.93 75.01 75.17 0.57 0-01|0-01-111|0-11-010-0111
02 75.96 73.95 75.32 75.48 0.57 0-01|0-01-111|0-11-010-0111
03 76.06 73.47 75.37 75.62 0.70 1-01|0-01-111|0-11-010-0111
05 76.24 72.60 75.32 75.65 0.89 1-01|0-01-111|0-11-010-0011
08 76.59 74.75 75.77 75.86 0.53 1-01|0-01-111|0-11-010-1011
10 76.72 73.92 75.68 75.80 0.88 1-01|0-01-110|0-11-111-0001
20 76.83 74.91 76.45 76.79 0.61 1-01|1-01-110|0-11-111-0001
30 76.95 74.38 76.42 76.53 0.46 1-01|0-01-100|0-11-111-0001
50 77.19 75.34 76.58 76.81 0.55 1-01|0-01-100|0-11-101-0001
Table 1. Recognition accuracy (%) on the CIFAR10 testing set. The zeroth generation is the initial population. We set S = 3 and
(K1 , K2 , K3 ) = (3, 4, 5). The best individual in each generation is also shown in binary codes.

GeNet on CIFAR10 0.773


0.78
Classification Accuracy (%)

0.76
0.770
0.74

0.72
0.767

0.70

0.68
Random Initialization 0.764
All−zero Initialization
0.66
0 1 2 3 5 10 20 30 50
Generation Number 0.761

Figure 3. The average recognition accuracy over all individuals


with respect to the generation number. The bars indicate the
0.758
highest and lowest accuracies in the corresponding generation. 0.758 0.761 0.764 0.767 0.770 0.773

Figure 4. The relationship in accuracy between the parent(s) and


less competitive network structure, the genetic algorithm the child(ren) (best viewed in color). A dot is bigger and close
is able to find strong individuals after several generations. to red if the recognition rate is higher, otherwise it is smaller and
This naive initialization achieves the initial performance of close to blue. The dots on the horizontal axis are from mutation
randomized individuals with about 5 generations. After operations, while others are from crossover operations.
about 30 generations, there is almost no difference, by
statistics, between these two populations.
best individual in these 1020 candidates reports 76.94%
accuracy, which is lower than the number (77.19%) ob-
4.1.3 Reasonability and Efficiency tained after the entire genetic process. From Table 1, we
We perform diagnostic experiments to verify the hypothe- find that after 30 rounds, the genetic process is able to find
sis, that a better individual is more likely to generate a good an individual generating 76.95% accuracy, which suggests
individual via mutation or crossover. For this, we randomly that the genetic process is much more efficient than random
select several occurrences of mutation and crossover in the search in the large solution space.
genetic process, and observe the relationship between an
individual and its parent(s). Figure 4 shows the results. 4.1.4 Parameters and Complexity
We argue that the genetic operations tend to preserve the
excellent “genes” from the parent(s), making it possible for We note that the number of learnable weights of a network
the population to evolve after some generations. is related to the number of non-isolated nodes, since each of
We also investigate the efficiency of the genetic algorith- them contributes the same number of weights regardless of
m. To this respect, we randomly generate 20 × (50 + 1) = the number of lower-numbered nodes that are connected to
1020 network architectures and evaluate each of them. The it. In experiments, isolation rarely happens, and thus all the
individuals have a very similar number of parameters.
The number of 1-bits in network encoding (inter-layer
GeNet #1 GeNet #2
connections) is the main factor of network complexity.
However, we point out that a network with more 1-bits does Code: 1-01
not mean to dominate another with fewer 1-bits. As a direct 0 0
evidence, we investigate the individual with all bits set to Chain-Shaped
Networks
 AlexNet
be 1. which leads to a network in which any two layers 1 1

 VGGNet
within the same stage are connected. This network produces 2 3 2 3
a 76.84% recognition rate, which is significantly lower than 4 4
the number (77.19%) reported in Table 1. Considering that Code: 1-01
the densely-connected network requires heavier computa-
tional overheads, we conclude that the structures learned by Code: 0-01-100
the genetic algorithm are more effective and efficient than 0 0
using dense connections.
1 Multiple-Path 1
Networks
 GoogLeNet
2 3 2 3
4.1.5 Visualization
4 4
In Figure 5, we visualize the the network structures learned
5 5
from two individual genetic processes. The structures
learned by the genetic algorithm are somewhat different Code: 1-01-100
from the manually designed ones, although some manual-
Code: 0-11-
ly designed local structures are observed, like the chain-
0
101-0001 0
shaped networks, multi-path networks and highway net-
works. We emphasize that these two networks, though 1 1
obtained by independent genetic processes, are somewhat Highway
2 3 Networks 2 3
 Residual Nets
similar, which demonstrates that the genetic process gener-
ally converges to similar network structures. 4 5 4 5

4.2. Small-Scale Transfer Experiments 6 6


Code: 0-01-
000-1011
We apply the networks learned from the CIFAR10 ex-
periments to more small-scale datasets. We test three
datasets, i.e., CIFAR10, CIFAR100 and SVHN. CI-
FAR100 is an extension to CIFAR10 which contains 100 Figure 5. Two network structures learned from the two inde-
pendent genetic processes on the CIFAR10 dataset (best viewed
categories at a finer level. It has the same numbers of
in color). These are three-stage networks (S = 3) with
training and testing images as CIFAR10, and these images (K1 , K2 , K3 ) = (3, 4, 5).
are also uniformly distributed over 100 categories.
SVHN (Street View House Numbers) [28] is a large
collection of 32 × 32 RGB images, i.e., 73,257 training ods in Table 2. First we note that the recognition accu-
samples, 26,032 testing samples, and 531,131 extra train- racy goes up through the genetic process, which verifies
ing samples. We preprocess the data as in the previous the transfer ability of the learned network structures. Al-
work [28], i.e., selecting 400 samples per category from the though these accuracies are lower than some state-of-the-
training set as well as 200 samples per category from the art candidates [42][16][15], we note that these networks are
extra set, using these 6,000 images for validation, and the much deeper (e.g., 40–100 layers, compared to the 17-layer
remaining 598,388 images as training samples. We also use GeNet #1 and #2). For fair comparison, we start from the
local contrast normalization (LCN) for preprocessing [10]. 40-layer wide residual network [42]. We create a population
We evaluate the best network structure in each genera- of 10 identical individuals, and perform genetic operation
tion of the genetic process. We resume using a large number for 5 rounds. Each of the initialized individuals is a variant
of filters at each stage, i.e., the three stages and the first of the 40-layer network with a few bits randomly reversed.
fully-connected layer are equipped with 64, 128, 256 and Using 10 GPUs to train these networks simultaneously, this
1024 filters, respectively. The training strategy, include the process takes around 10 days. As a result, we find a better
numbers of epochs and learning rates, remains the same as individual different from the original network structure, and
in the previous experiments. the error rates on CIFAR10, CIFAR100 and SVHN are
We compare our results with some state-of-the-art meth- 5.39%, 25.12% and 1.71%, respectively. This provides
SVHN CF10 CF100 Top-1 Top-5 # Paras
Zeiler et.al [43] 2.80 15.13 42.51 AlexNet [19] 42.6 19.6 62M
Goodfellow et.al [10] 2.47 9.38 38.57 GoogLeNet [36] 34.2 12.9 13M
Lin et.al [26] 2.35 8.81 35.68 VGGNet-16 [32] 28.5 9.9 138M
Lee et.al [24] 1.92 7.97 34.57 VGGNet-19 [32] 28.7 9.9 144M
Liang et.al [25] 1.77 7.09 31.75 GeNet #1 28.12 9.95 156M
Lee et.al [23] 1.69 6.05 32.37 GeNet #2 27.87 9.74 156M
Zagoruyko et.al [42] 1.77 5.54 25.52 Table 3. Top-1 and top-5 recognition error rates (%) on
Xie et.al [39] 1.67 5.31 25.01 the ILSVRC2012 dataset. For all competitors, we report the
Huang et.al [16] 1.75 5.25 24.98 single-model performance without using any complicated data
augmentation in testing. These numbers are copied from this page:
Huang et.al [15] 1.59 3.74 19.25
http://www.vlfeat.org/matconvnet/pretrained/.
GeNet after G-00 2.25 8.18 31.46 GeNet #1 and GeNet #2 are the structures shown in Figure 5.
GeNet after G-05 2.15 7.67 30.17
GeNet after G-20 2.05 7.36 29.63
GeNet #1 (G-50) 1.99 7.19 29.03 tivation [27] followed by square-root normalization and ℓ2
GeNet #2 (G-50) 1.97 7.10 29.05 normalization, and feed the feature vectors to a linear SVM
GeNet from WRN [42] 1.71 5.39 25.12 classifier [7]. With 60 training samples per category, the
Table 2. Comparison of the recognition error rate (%) with the classification accuracy with VGGNet-16 and VGGNet-19
state-of-the-arts. We apply data augmentation on all these datasets. features are 82.69% and 83.51%, respectively. The GeNets
GeNet #1 and GeNet #2 are the structures shown in Figure 5. #1 and #2 produce 83.59% and 83.78% accuracies, which
is slightly higher. This verifies that the benefits of GeNets
are generally transferrable to other visual recognition tasks.

an alternative strategy to generate better architectures from


existing manually designed ones. 5. Conclusions
4.3. Large-Scale Transfer Experiments This paper applies the genetic algorithm to automatically
learning the structure of deep convolutional neural network-
We evaluate the learned network structures on the s. Our main idea is to use an encoding scheme to represent
ILSVRC2012 classification task [31]. This is a subset of each network structure as a fixed-length binary string, and
the ImageNet database [5] which contains 1,000 object evaluate each generated individual via a standalone training
categories. The training set, validation set and testing set process on a reference dataset. Based on this framework,
contain 1.3M, 50K and 150K images, respectively. The we design some genetic operations, such as mutation and
input images are of 224 × 224 × 3 pixels. We first apply crossover, to explore the search space efficiently. We per-
the first two stages in the VGGNet (4 convolutional layers form the genetic algorithm on a small reference dataset
and two pooling layers) to change the data dimension to (CIFAR10), and find that the generated structures are able
56 × 56 × 128. Then, we apply the two networks shown in to transfer to the ILSVRC2012 dataset and extracting deep
Figure 5, and adjust the numbers of filters at three stages to features for other visual recognition tasks.
256, 512 and 512 (following VGGNet), respectively. After Despite the interesting results we have obtained, our
these stages, we obtain a 7×7×512 data cube. We preserve algorithm suffers from several drawbacks. First, a large
the fully-connected layers in VGGNet with the dropout rate fraction of network structures are still unexplored, including
0.5. We apply the training strategy as in VGGNet. Training some novel modules like Maxout [10], channel concatena-
each network takes around 20 GPU-days. tion [36][15], and introducing multi-scale into convolution-
Results are summarized in Table 3. We can see that, s [36]. In addition, the recurrent structure is also worth
in general, structures learned from a small dataset (CI- exploring [45]. Second, in the current work, the genetic
FAR10) can be transferred to large-scale visual recogni- algorithm is only used to explore the network structure,
tion (ILSVRC2012). Our model achieves better perfor- whereas the network training process is performed separate-
mance than VGGNet-16 and VGGNet-19, because the o- ly. It would be very interesting to incorporate the genetic
riginal chain-styled stages are replaced by the automatically algorithm to training the network structure and weights
learned structures which are verified more effective. simultaneously. These directions are left for future work.
Finally, we evaluate the transfer ability of the GeNets
on the Caltech256 dataset [12]. We use VGGNet-16, Acknowledgements. This work was supported by NSF
VGGNet-19 and GeNets to extract 4,096-dimensional fea- CCF-1231216 and ONR N00014-15-1-2356. We thank Dr.
tures on the first fully-connected layer, perform ReLU ac- Wei Shen, Yuyin Zhou and Siyuan Qiao for discussions.
References [17] S. Ioffe and C. Szegedy. Batch Normalization: Accelerat-
ing Deep Network Training by Reducing Internal Covariate
[1] J. Bayer, D. Wierstra, J. Togelius, and J. Schmidhuber. E- Shift. International Conference on Machine Learning, 2015.
volving Memory Cell Structures for Sequence Learning. In- [18] A. Krizhevsky and G. Hinton. Learning Multiple Layers of
ternational Conference on Artificial Neural Networks, 2009. Features from Tiny Images. Technical Report, University of
[2] J. Beasley and P. Chu. A Genetic Algorithm for the Set Cov- Toronto, 1(4):7, 2009.
ering Problem. European Journal of Operational Research, [19] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet
94(2):392–404, 1996. Classification with Deep Convolutional Neural Networks.
[3] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Advances in Neural Information Processing Systems, 2012.
Visual Categorization with Bags of Keypoints. Workshop on [20] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of
Statistical Learning in Computer Vision, European Confer- Features: Spatial Pyramid Matching for Recognizing Natural
ence on Computer Vision, 1(22):1–2, 2004. Scene Categories. Computer Vision and Pattern Recognition,
[4] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast 2006.
and elitist multiobjective genetic algorithm: Nsga-ii. IEEE [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
Transactions on Evolutionary Computation, 6(2):182–197, based Learning Applied to Document Recognition. Proceed-
2002. ings of the IEEE, 86(11):2278–2324, 1998.
[5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei- [22] Y. LeCun, J. Denker, D. Henderson, R. Howard, W. Hub-
Fei. ImageNet: A Large-Scale Hierarchical Image Database. bard, and L. Jackel. Handwritten Digit Recognition with a
Computer Vision and Pattern Recognition, 2009. Back-Propagation Network. Advances in Neural Information
[6] S. Ding, H. Li, C. Su, J. Yu, and F. Jin. Evolutionary Processing Systems, 1990.
Artificial Neural Networks: A Review. Artificial Intelligence [23] C. Lee, P. Gallagher, and Z. Tu. Generalizing Pooling
Review, 39(3):251–260, 2013. Functions in Convolutional Neural Networks: Mixed, Gated,
[7] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLIN- and Tree. International Conference on Artificial Intelligence
EAR: A Library for Large Linear Classification. Journal of and Statistics, 2016.
Machine Learning Research, 9:1871–1874, 2008. [24] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-
[8] L. Fei-Fei, R. Fergus, and P. Perona. Learning Generative Supervised Nets. International Conference on Artificial In-
Visual Models from Few Training Examples: An Incremen- telligence and Statistics, 2015.
tal Bayesian Approach Tested on 101 Object Categories. [25] M. Liang and X. Hu. Recurrent Convolutional Neural Net-
Computer Vision and Image Understanding, 106(1):59–70, work for Object Recognition. Computer Vision and Pattern
2007. Recognition, 2015.
[26] M. Lin, Q. Chen, and S. Yan. Network in Network. Interna-
[9] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-
tional Conference on Learning Representations, 2014.
manan. Object Detection with Discriminatively Trained Part-
Based Models. IEEE Transactions on Pattern Analysis and [27] V. Nair and G. Hinton. Rectified Linear Units Improve
Machine Intelligence, 32(9):1627–1645, 2010. Restricted Boltzmann Machines. International Conference
on Machine Learning, 2010.
[10] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and
[28] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and
Y. Bengio. Maxout networks. International Conference on
A. Ng. Reading Digits in Natural Images with Unsupervised
Machine Learning, 2013.
Feature Learning. NIPS Workshop on Deep Learning and
[11] J. Grefenstette, R. Gopal, B. Rosmaita, and D. Van Gucht. Unsupervised Feature Learning, 2011.
Genetic Algorithms for the Traveling Salesman Problem.
[29] F. Perronnin, J. Sanchez, and T. Mensink. Improving the
International Conference on Genetic Algorithms and their
Fisher Kernel for Large-scale Image Classification. Euro-
Applications, 1985.
pean Conference on Computer Vision, 2010.
[12] G. Griffin, A. Holub, and P. Perona. Caltech-256 Object [30] C. Reeves. A Genetic Algorithm for Flowshop Sequencing.
Category Dataset. Technical Report: CNS-TR-2007-001, Computers & Operations Research, 22(1):5–13, 1995.
2007.
[31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learn- S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
ing for Image Recognition. Computer Vision and Pattern et al. ImageNet Large Scale Visual Recognition Challenge.
Recognition, 2016. International Journal of Computer Vision, pages 1–42, 2015.
[14] C. Houck, J. Joines, and M. Kay. A Genetic Algorithm for [32] K. Simonyan and A. Zisserman. Very Deep Convolutional
Function Optimization: A Matlab Implementation. Techni- Networks for Large-Scale Image Recognition. International
cal Report, North Carolina State University, 2009. Conference on Learning Representations, 2014.
[15] G. Huang, Z. Liu, and K. Weinberger. Densely Connect- [33] L. Snyder and M. Daskin. A Random-Key Genetic Al-
ed Convolutional Networks. Computer Vision and Patter gorithm for the Generalized Traveling Salesman Problem.
Recognition, 2017. European Journal of Operational Research, 174(1):38–53,
[16] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep 2006.
Networks with Stochastic Depth. European Conference on [34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
Computer Vision, 2016. R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural
Networks from Overfitting. Journal of Machine Learning
Research, 15(1):1929–1958, 2014.
[35] K. Stanley and R. Miikkulainen. Evolving Neural Networks
through Augmenting Topologies. Evolutionary Computa-
tion, 10(2):99–127, 2002.
[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going Deeper with Convolutions. Computer Vision and
Pattern Recognition, 2015.
[37] N. Ulder, E. Aarts, H. Bandelt, P. van Laarhoven, and
E. Pesch. Genetic Local Search Algorithms for the Traveling
Salesman Problem. International Conference on Parallel
Problem Solving from Nature, 1990.
[38] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong.
Locality-Constrained Linear Coding for Image Classifica-
tion. Computer Vision and Pattern Recognition, 2010.
[39] L. Xie, Q. Tian, J. Flynn, J. Wang, and A. Yuille. Geometric
Neural Phrase Pooling: Modeling the Spatial Co-occurrence
of Neurons. European Conference on Computer Vision,
2016.
[40] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. DisturbLa-
bel: Regularizing CNN on the Loss Layer. Computer Vision
and Patter Recognition, 2016.
[41] X. Yao. Evolving Artificial Neural Networks. Proceedings
of the IEEE, 87(9):1423–1447, 1999.
[42] S. Zagoruyko and N. Komodakis. Wide Residual Networks.
arXiv preprint, arXiv: 1605.07146, 2016.
[43] M. Zeiler and R. Fergus. Stochastic Pooling for Regulariza-
tion of Deep Convolutional Neural Networks. International
Conference on Learning Representations, 2013.
[44] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
Learning Deep Features for Scene Recognition Using Places
Database. Advances in Neural Information Processing Sys-
tems, 2014.
[45] B. Zoph and Q. Le. Neural architecture search with rein-
forcement learning. International Conference on Learning
Representations, 2017.

You might also like