Lit Rev 1
Lit Rev 1
Abstract
NSGANet
Differentiable Architecture Search (DARTS) has achieved a 97.50
NSGANetV1
4 X Faster
rapid search for excellent architectures by optimizing archi- EPCNAS
97.45 EAEPSO
tecture parameters through gradient descent. However, this DARTS(1st)
11159
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
ent search directions to prevent the dilemma of falling into the EA-based method involves defining the search space,
a local optimum. However, in the field of Neural Architec- initializing a population of network architectures, evaluat-
ture Search (NAS), despite the proposal of numerous Evolu- ing their fitness based on performance, selecting superior ar-
tionary Algorithms (EA) methods to reduce search time and chitectures, applying crossover and mutation, updating the
achieve some results (Lu et al. 2018; Huang et al. 2022), population, and repeating the process for iterations. How-
they have not met the expectations in effectively balancing ever, the challenge of balancing performance and search cost
the trade-off between search cost and performance. There- has been a persistent issue for EA-based methods (Cui et al.
fore, striking a balance between premature convergence to 2018; Xue et al. 2021). For example, (Liu et al. 2018) and
local optima and search cost has become a major challenge (Real et al. 2018) proposed the methods, Hierarchical evo-
for researchers. lution and Amoeba-Net, which achieved a high accuracy of
In this paper, we propose a simple and efficient compound 96.25% and 97.45% on the CIFAR-10 dataset, respectively.
search algorithm, called EG-NAS, to address the above is- However, the computation costs for their methods were sig-
sues via the evolution strategy with gradient descent for neu- nificantly high, with 300 and 3150 GPU-Days, respectively,
ral architecture search. We adopt an improved evolution- far exceeding the capabilities of most researchers. To reduce
ary strategy to explore more diverse search directions and the excessive search cost, experts, (Lu et al. 2018; Huang
tune the architectural parameters, rather than relying exclu- et al. 2022; Yuan et al. 2023) in EA have incorporated ef-
sively on gradient descent (GD) to update the architectural ficient EA algorithms such as the efficient Non-dominated
and network parameters. This approach allows us to lever- Sorting Genetic Algorithm, Particle Swarm Optimization
age the strengths of both gradient descent and evolutionary (PSO), and so on, into NAS tasks, achieving some notable
strategies, namely efficiency and global searchability. Fur- results. Despite the application of various efficient evolu-
thermore, we redesign the fitness function to emphasize in- tionary algorithms to improve the efficiency of NAS, an ex-
dividual similarity rather than solely focusing on individual cessive focus on efficiency sometimes hinders these algo-
performance to enhance the diversity of search directions. rithms from fully leveraging their strengths, thereby dramat-
When the current individual outperforms the old one, reduc- ically impacting the obtained architecture performance.
ing the similarity between individuals fosters a more diverse The high search cost barrier in NAS was first broken
evolution in the search direction. Conversely, increasing the by Differentiable architecture search (DARTS) proposed by
similarity between individuals allows timely adjustments to (Liu, Simonyan, and Yang 2018). DARTS proposed ap-
the evolutionary direction toward the previously promising plying the continuous relaxation method to transform dis-
search direction. By adopting this evolutionary strategy, our crete operations into continuously differentiable weights,
approach ensures that the output represents the results ob- enabling efficient handling of the bi-level optimization ob-
tained from exploring various search directions, alleviat- jectives in architecture search through gradient descent. To
ing the risk of being trapped in local optima. Moreover, it further optimize the memory overhead and improve the
guarantees that the new search directions possess desirable search speed, (Xu et al. 2019) proposed PC-DARTS, which
performance. Finally, we iteratively update the architecture randomly selects only some of the channels to serve the
parameters with the new search direction obtained by the computation in the search phase. Although memory over-
evolutionary strategy while optimizing the network param- head is alleviated, some practitioners, (Xie et al. 2018; Chen
eters using gradient descent. This combination allows our and Hsieh 2020; Wang et al. 2021a), have raised new ques-
approach to efficiently explore diverse architectures with ex- tions about DARTS and provided the corresponding solu-
cellent performance. tions. For example, (Hu et al. 2020b) proposed an angle-
To demonstrate the effectiveness of our proposed EG- based metric that simplifies the original search space by
NAS, we conducted extensive experiments on different eliminating unpromising candidates, thereby reducing the
datasets and search spaces, showing significant compet- challenges faced by existing NAS methods in searching for
itive results compared to other methods. On CIFAR-10, high-quality architectures; (Wang et al. 2021a) proposed the
our method achieves remarkable results, requiring only 0.1 node normalization and decorrelation discretization strategy
GPU-Days for the search (see Fig. 1). The architectures to improve generality and stability; (Xiao et al. 2022b) intro-
discovered not only achieve 97.47% accuracy on CIFAR- duced the Shapley value to evaluate the importance of oper-
10, but also demonstrate 74.4% top-1 accuracy when trans- ations; (Chen et al. 2020) adopted an incremental learning
ferred to ImageNet. Moreover, we directly performed the scheme to bridge the gap between search and evaluation.
search and evaluation on ImageNet, achieving an outstand- However, the critical issue highlighted by the researchers
ing 75.1% top-1 accuracy with a search cost of just 1.2 GPU- (Chen et al. 2019; Zela et al. 2019) remains that gradient-
Days on 2 RTX 4090 GPUs, which is the optimal search based methods suffer from the issue of premature conver-
speed compared to state-of-the-art methods. gence to local optima, significantly compromising the per-
formance of architectures obtained during the search stage.
Related Work
The search strategy based on Evolutionary Algorithms (EA) Methodology
is one of the most common approaches, relying on EA’s
global search capability to prevent premature convergence Preliminaries
into local optima and effectively address large-scale NAS Differentiable Architecture Search (DARTS) DARTS
tasks with a discrete search space. The general process of (Liu, Simonyan, and Yang 2018) is a gradient-based method
11160
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
0
Input ES 𝛼
1
2
0 GD 𝜔 3
Candidate Operation
1
2 No
Yes
END EVA
3
Figure 2: Illustration of EG-NAS main framework. Improved Evolutionary Strategy outputs the best one from evolved architec-
tures as a direction to optimize α. After α is updated, the new architecture is composed and evaluated. The EVA is shorthand
for evaluation.
widely used in NAS that leverages continuous relaxation and Explore Search Directions with ES For gradient-based
weight-sharing techniques to achieve a differentiable archi- methods, the most common step in NAS is to calculate the
tecture search process, significantly reducing the computa- gradient information and use it as a search direction, iter-
tional cost. Following previous research (Pham et al. 2018; ating until convergence. Although the method is simple and
Bender et al. 2018), DARTS discovers the optimal cell ar- efficient, the search direction is limited by the gradient infor-
chitecture and constructs a supernet by repeatedly stacking mation, which makes the search direction single and easy to
normal and reduction cells. Note that, compared to normal fall into a local dilemma. Covariance Matrix Adaptive Evo-
cells, the reduction cells are located at 1/3 and 2/3 of the lutionary Strategy (CMA-ES) (Hansen 2016; Loshchilov
total depth of the network, where all operations adjacent to and Hutter 2016) is one of the most appreciated evolution-
the stride of the input node are set 2. During the search pro- ary algorithms in solving continuous black-box problems.
cess, each cell is regarded as a directed acyclic graph (DAG) Therefore, we apply the capability of CMA-ES, namely ef-
with N nodes and E edges, where each node x(i) is rep- ficient convergence and global search, to explore different
resented by the feature map and each edge (i, j) represents search directions, avoiding getting trapped in local optima.
an operation o(i,j) on the information flow transfer between We begin by sampling N architectures, denoted as xn for
different nodes. In DARTS, all the candidate operations are n = 1, 2, . . . , N , from the Gaussian distribution with α as
applied to the continuous relaxation approach to perform the the mean vector m0 and unit matrix I as the covariance ma-
gradient-based search. Specifically, the intermediate node is trix C0 . In other words, xn = α + σy with y ∼ N (0, I) for
computed using a softmax mixture of candidate operations: n = 1, 2, . . . , N . Subsequently, the sampled architectures
xn are used to initialize N CMA-ES based searches, where
(i,j)
X exp(αo ) each xn , for n = 1, 2, . . . , N , is considered as the initialized
o(i,j) (x(i) ) = P (i,j)
o(x(i) ), (1) mean vector m0 of the n-th ES for the search direction ex-
o∈O o′ ∈O exp(αo′ ) ploration. More specifically, the initial search population of
where i < j, all candidate operations are stored in O, and the n-th ES is sampled as follows:
(i,j)
αo represents the mixing weight of operation o(i,j) in the zti = mt + σi yi , yi ∼ N (0, Ct ), (3)
supernet construction. During the process of relaxation, ar-
chitecture search optimizes network weight ω and architec- 0
where t starting with 0 is the number of iterations, C = I,
ture parameters α in a differentiable manner, establishing a i = 1, . . . , λ and σ is the step size. After that, the mean
bi-level optimization model for NAS: vector m and the covariance matrix Ct+1 are optimized by
Eq. 4 and Eq. 5, which generate the new individual zti+1 .
min F (ω ∗ (α), α) = Lval (ω ∗ (α), α)
α
(2) P⌊λ/2⌋
s.t. ω ∗ (α) = argminω Ltrain (ω, α), mt+1 = i=1 βi zti , (4)
where the optimization variables α and ω are updated where βi represent the fitness weight assigned to each indi-
via the gradient descent. After ending the search, the fi- vidual xi .
nal architecture is composed of the operations with the
largest architectural parameter α on each edge, o(i,j) = Ct+1 = (1 − c1 − c⌊λ/2⌋ )C + c1 (ppT )
(i,j) (5)
argmaxo′ ∈O αo . + c⌊λ/2⌋ ⌊λ/2⌋ t t+1
)(zti − mt+1 )T ,
P
i=1 βi (zi − m
11161
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
where p is the evolutionary path. During the search, each in- relevant ablation studies, and the corresponding compara-
dividual is evaluated by the fitness function f (·) and sorted tive results are presented in Fig. 3. Detailed analysis of the
based on the fitness value of each individual. The best so- results can be found in part of the ablation studies.
lutions ⌊λ/2⌋ are then utilized to update the search parame-
ters, such as the covariance matrix Ct+1 and the two learn- Evolution Strategy with Gradient Descent-based
ing rates c1 , c⌊λ/2⌋ for the next iteration in CMA-ES. Con- Architecture Search
sidering that the optimal individual may not necessarily be To better guide the architecture search process, we pro-
present in the final population, we not only update the vari- pose exploring various search directions and selecting the
ables to optimize the search direction but also maintain a optimal ones based on task performance with the evolu-
record of the best individual z∗i from each population by tionary strategy. Compared to optimizing both the architec-
storing them in the set P based on their fitness values. Ul- ture and network parameters using gradient-based methods,
timately, during the final iteration phase, we select the indi- our proposed approach updates only the network parameters
vidual z∗best with the highest fitness value from the set P as through gradient descent, while the architecture parameters
the output, representing the optimal search direction for the are updated using an evolutionary strategy to sample excel-
n-th sampling. lent search directions for guidance. To more visually illus-
trate our proposed approach, we provide the general steps of
Compound Fitness Function EG-NAS as shown in Fig. 2. The update process of α can
The proposed fitness function, shown in Eq. 7 supplements be presented in Algorithm 1 and in Eq. 8.
the consideration of individual similarity or diversity. In αt = αt−1 + ξ∇α Lval (ω(α), α) → αt = αt−1 + ξst ,
other words, the fitness function f (·) is composed of the (8)
cross-entropy loss function L1 and the cosine similarity L2 where st represents the direction of α updates at the t-th step
as shown in Eq. 6. during the optimization process, ∇α Lval (ω(α), α) repre-
sents the search direction based on the gradient information,
L1 (yi , ŷi ) = − C
P
i=1 yi log(ŷi ), and ξ means the step size. To ensure that the search direc-
tion s used to update α is closely related to the task perfor-
1 αt zt+1 (6)
L2 (αt , zt+1i )= [ i
t+1 + 1], mance, we introduce task performance-based stabilization
2 ∥αt ∥ zi during the optimization:
where C is the number of classes and L1 represents the st = argmax Accval (ω t−1 , x∗n ), n = 1, 2, . . . , N, (9)
∗
cross-entropy loss function used to showcase the perfor- where Accval means the validation accuracy, xn is the
mance of architecture zt+1 i . In L1 , yi and ŷi refer to the search direction that yields the best fitness value by sam-
predicted labels by architecture zt+1
i and true labels, respec- pling the evolutionary strategy, and ω t−1 is the network pa-
tively. The essence of L2 is the cosine similarity function, rameters at (t − 1) step. After completing the search, we
which denotes the similarity between the current architec- adopt the same approach as DARTS to obtain the final ar-
ture zt+1
i and the original architecture αt . Note that for con- chitecture, where each edge is constructed with the opera-
venience in subsequent calculations, we have made a simple tion that has the maximum weight. Compared to DARTS,
adjustment, so that the function L2 ∈ [0, 1]. When the L2 which updates architecture parameters only once per round,
result tends to 0, it indicates that the two architectures are our method optimizes architecture parameters in every sam-
similar; otherwise, they are dissimilar. pling and evolutionary strategy step. As a result, in terms
of architecture parameter optimization, EG-NAS has a time
ζL1 − ηL2 if Acc(αt ) > Acc(zt+1
t+1 i ) complexity of N times λ compared to traditional gradient
f (αt , zi ) = , descent methods.
ζL1 + ηL2 else
(7)
where ζ and η are the weight coefficients of L1 and L2 , Experiments and Analysis
respectively. When the performance of the currently gener- In this part, we conduct extensive experiments to evaluate
ated individual is inferior to the original individual, we make our approach, EG-NAS, on the DARTS search space with
the L2 function tend towards 0, directing the new individual CIFAR-10, CIFAR-100, and ImageNet for image classifi-
towards a more promising performance. Conversely, when cation, as well as the NAS-Bench-201 search space with
the performance is better, we make the L2 function tend to- CIFAR-10, CIFAR-100, and ImageNet-16-120. All experi-
wards 1, encouraging the new individual to evolve towards ments were conducted on a single Nvidia RTX 3090, except
increased architectural diversity. Assisted by this composite for the ImageNet experiments, which were conducted on 2
fitness function f (·), the evolution strategy generates more RTX 4090. The ImageNet datasets mentioned in this paper
diverse individuals, evaluating a broader range of search di- refer to either ImageNet 1K or the ILSVRC2012 dataset, ex-
rections, leading to a more effective balance between perfor- cept for ImageNet-16-120 in NAS-Bench-201.
mance and diversity. Additionally, exploring and evaluating
more diverse search directions is beneficial for alleviating Datasets and Implementation Details
the issue of premature convergence to local optima and ob- CIFAR-10 dataset contains 60,000 color images from 10 dif-
taining superior architectures. To confirm the effectiveness ferent categories. CIFAR-100 dataset consists of 100 dif-
of the compound fitness function f (·), we have conducted ferent categories, including some finer-grained classes. The
11162
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
11163
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Table 1: Comparison of EG-NAS with state-of-the-art image classifiers on CIFAR-10 and CIFAR-100. The results of EG-NAS
were obtained by repeated experiments with 4 random seeds.
100, EG-NAS still surpasses most of the NAS methods. In Test Error
86.0 Search Cost 0.24
particular, it achieves exceptional performance with 46.13%
11164
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Table 3: Performance comparison of the NAS-Bench-201 benchmark. Note that EG-NAS only searched on the CIFAR-10
dataset, but all achieved competitive results on CIFAR-10, CIFAR-100, and ImageNet16-120. The average values are obtained
over four independent search runs.
Influence of coefficients ζ and η of L1 and L2 To explore NAS. Based on the observations in Fig. 4, when the popu-
the influence of different values of the coefficients ζ and η lation size λ is set too small, it significantly constrains the
in the fitness function f (·) of ES, we implement L1 and L2 diversity of the population, further limiting the discovery
assignment with different ζ and η. The validation accuracy of high-performance architectures. With the increase in the
and model parameter cost are demonstrated in Table 4. As population size λ, there is a greater number of individuals
the number of updates increases, the value of the L2 co- available for selection, leading to increased diversity within
efficient in the fitness function f (·), denoted as η, plays a the population, making it easier to discover high-quality net-
critical role in Evolution Strategies (ES). When η is set too work architectures. However, as λ continues to increase, the
large, ES tends to prioritize exploring diverse search direc- performance of discovered architectures sees only marginal
tions, leading to excessive dispersion and hindered conver- improvements, due to the constraints imposed by the cell
gence. On the contrary, if η is too small, ES focuses pre- search space. Concurrently, the increase in λ results in a sig-
dominantly on optimizing performance, resulting in overly nificant escalation of search costs. To effectively strike a bal-
monotonous search directions. As the L1 coefficient ζ in- ance between search costs and architecture performance, we
creases, the evolutionary strategy discovers more stable and ultimately opted for a population size of λ = 50 as our final
superior optimization directions, with ζ=1.0 achieving the choice.
best performance.
Effect of Population Size λ on EG-NAS In this part, a se- Conclusion and Future Work
ries of experiments are conducted on the CIFAR-10 dataset In this paper, we propose a simple yet efficient compound
to investigate the impact of the population size λ on EG- approach based on gradient descent and an improved evo-
lutionary strategy, termed EG-NAS, for neural architecture
search, which alleviates the dilemma of premature conver-
η=1.0 η=0.8
ζ
valid acc (%) params (M) valid acc (%) params (M)
gence to local optima by gradient descent. Meanwhile, by re-
1.0 84.36 3.23 85.42 2.96 ducing the similarity between individuals in the evolutionary
0.8 85.21 2.96 84.89 3.56 strategy, we can effectively explore various search directions
0.4 84.36 3.23 83.80 3.85 and avoid being trapped in local optima. Finally, we evaluate
0.2 83.61 3.10 83.13 3.57 the selected search directions with better fitness values using
η=0.4 η=0.2 validation accuracy to more accurately determine the rela-
ζ
valid acc (%) params (M) valid acc (%) params (M) tionship between search directions and task performance. In
1.0 85.53 3.71 85.40 2.85 the future, our aim is to further investigate the effective in-
0.8 84.82 3.97 83.63 3.41 tegration of different algorithms and enhance the stability of
0.4 82.81 3.0 81.25 2.85 hybrid algorithms to address a wide range of tasks.
0.2 81.54 2.96 80.68 3.66
11165
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
11166
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
11167