CPSO
CPSO
A R T I C L E I N F O A B S T R A C T
Keywords: Swarm intelligence algorithms have been widely adopted in solving many highly nonlinear, multimodal problems
Swarm intelligence and have achieved tremendous successes. However, their application on deep neural networks is largely unex-
PSO
plored. On the other hand, deep neural networks, especially convolutional neural network (CNN), have recently
Optimization
achieved breakthroughs in tackling many intractable problems; nevertheless their performance depends heavily
CNN
Hyper-parameter
on the chosen values of their hyper-parameters, whose fine-tuning is both labor-intensive and time-consuming.
In this paper, we propose a novel particle swarm optimization (PSO) variant cPSO-CNN for optimizing the hyper-
parameter configuration of architecture-determined CNNs. cPSO-CNN utilizes a confidence function defined by
a compound normal distribution to model experts’ knowledge on CNN hyper-parameter fine-tunings so as to
enhance the canonical PSO’s exploration capability. cPSO-CNN also redefines the scalar acceleration coefficients
of PSO as vectors to better adapt for the variant ranges of CNN hyper-parameters. Besides, a linear predic-
tion model is adopted for fast ranking the PSO particles to reduce the cost of fitness function evaluation. The
experimental results demonstrate that cPSO-CNN performs competitively when compared with several reported
algorithms in terms of both CNN hyper-parameter superiority and overall computation cost.
1. Introduction sive top-5 error rate of 15.3%, more than 10.8% lower than that of the
runner up, on image classification in the ImageNet Challenge 2012,
Artificial Swarm Intelligence (SI) algorithms, such as genetic algo- that CNN received renewed attention from researchers. Encouraged
rithm [1], particle swarm optimization [2] and fireworks algorithm [3], by its success, many CNNs such as ResNet [6], VGG [7], GoogleNet
have been showing their powerful capabilities for optimizing difficult [8] and DenseNet [9] were invented. The outstanding performances
problems in the real world ever since their advents. Relying on a pop- of these CNNs not only owe to the authors’ brilliant designs of the
ulation of simple agents which gather guiding information from neigh- CNNs’ architectures but also owe to the carefully chosen values of the
bors and environments then adjust their behaviors accordingly, a high hyper-parameters. A new CNN architecture is usually introduced by
intelligence emerges as a whole, which has the advantages of not requir- an insightful observation of the limitations of the existing CNN. For
ing the target problem to have any optimization-convenient mathemat- instance, ResNet [6] introduced shortcut hyper-connections to tackle
ical properties such as continuity, derivability or convexity. As such, the vanishing gradient problem when adding more layers to a CNN.
for many complex problems which are often highly nonlinear and mul- However, choosing proper values for hyper-parameters is very tricky
timodal, SI algorithms usually perform better than traditional math- since it not only depends on one’s level of experience but also her/his
ematical programming methods. And optimizing the hyper-parameter ability to learn from each round of value trial. Usually, fine-tuning
configuration of a Convolutional Neural Network (CNN) is just the kind hyper-parameters is conducted manually in a costly trial-and-error way.
of problem. The evaluations of different hyper-parameter configurations involve
Although LeNet-5, the pioneering CNN by LeCun et al. [4], solved many rounds of time-consuming CNN training. At the same time, new
the problem of character recognition decades ago, it was not until CNNs tend to have more and more layers, which leads to a surge in the
AlexNet [5], a GPU implementation of deep CNN, obtained an impres- number of hyper-parameters. For example, although AlexNet [5] has
∗ Corresponding author.
E-mail addresses: [email protected] (Y. Wang), [email protected] (H. Zhang), [email protected] (G. Zhang).
https://doi.org/10.1016/j.swevo.2019.06.002
Received 13 December 2018; Received in revised form 15 May 2019; Accepted 5 June 2019
Available online 10 June 2019
2210-6502/© 2019 Elsevier B.V. All rights reserved.
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123
only 27 hyper-parameters, the successors VGG-16 [7], GoogleNet [8], Hn (·) is a cumulative distribution function on Θ; 2) At the true parameter
ResNet-52 [6] and Densenet [9] have a total of 57, 78,150, 376 hyper- value θ = 𝜃 0 , Hn (𝜃 0 ) ≡ Hn (x, 𝜃 0 ), as a function of the sample x, follows
parameters respectively. Therefore, it is almost impossible to pinpoint the uniform distribution U[0, 1].
a close-to-optimal hyper-parameter configuration for a CNN manually
In the definition, Θ is the domain of the unknown parameter 𝜃 , 𝜒
under a reasonable cost, which hampers the adaption of CNNs for vari-
is the sample space of data x = {x1 , …, xn }. Although we can simply
ous real-world problems.
stack up the boundaries of a set of confidence intervals of all levels for
Since hyper-parameters of each layer of CNN are integers or can be
a parameter to obtain a confidence distribution, an analytic expression
encoded as integers, hyper-parameter fine-tuning for CNNs(or CNN tun-
is more favorable for the sake of infinite levels of confidence. When an
ing for short) is essentially an integer programming problem, which is
expert tunes the hyper-parameters of a CNN, she/he usually has a prior
NP-complete [10] and asks for an approximation algorithm running in
guessing of the values of the hyper-parameters based on her/his expe-
polynomial time in order to find a close-to-optimal solution. Recently,
rience or intuition, but with inaccurate confidence. After observing the
Particle Swarm Optimization (PSO), as a popular SI algorithm, obtains
performance of the hyper-parameters evaluated through a few rounds
much attention from researchers for optimizing CNN [11–16] and other
of trial-and-error, she/he gains a more accurate confidence on the distri-
neural networks [17–24].
bution of the hyper-parameters’ optima. In this paper, we model manual
Although these attempts had obtained some promising results, there
fine-tuning performed by experts with a confidence distribution so that
is still substantial room for further improvement. We argue to pro-
an automatic hyper-parameter tuning approach can be devised.
pose a new PSO variant to satisfy the requirements of CNN tuning
A compound probability distribution G is a probability distribution
better. The most time-consuming part of CNN tuning is CNN training
A of a random variable X, for which A has an unknown parameter 𝜃
since it is defined as the fitness function to be evaluated. Therefore, to
that is also a random variable and is distributed according to a second
improve the efficiency of CNN tuning, we need to reduce the fitness
distribution B. In this way, G is a distribution by compounding A with
evaluation’s frequency and execution time while maintaining accept-
B, which are called original distribution and latent distribution respec-
able accuracy. To this end, we first enhance PSO’s exploration capabil-
tively. G results from integrating out the unknown parameter(s) 𝜃 over
ity. This is accomplished by regenerating the worst particles’ positions
B. The probability density function [27,28] of G is given by Equation
with a compound normal confidence distribution. This distribution has
(1).
a larger variance that helps perform a free-falling style search. Such a
search would result in much fewer generations for a given quality solu-
p G (x ) = pA (x ∣ 𝜃)pB (𝜃)d𝜃 (1)
tion. Then, we take into consideration of the variant-lengths of ranges ∫
of CNN hyper-parameters when updating particles’ velocities to speed
up the search in large ranges while preventing values for small range And G’s mean and variance [27,28] are given by Equation (2).
hyper-parameters from flip-flopping between their boundaries. Finally, [ ]
EG [X ] = EB EA [X ∣ 𝜃] (2)
we utilize a linear model to predict the ranking of hyper-parameter con- [ ]
figurations and stop the CNN training prematurely for fitness evaluation VarG (X ) = EB VarA (X ∣ 𝜃) + VarB (EA [X ∣ 𝜃])
once the trend of ranking is stable. These revisions to the canonical PSO
enable our approach to find better hyper-parameter values with less A compound distribution G is similar to the original distribution A in
cost. The key contributions of this paper are: (1) It is the first PSO vari- many ways. For example, they have the same support and their shapes
ant that enhances the particles’ exploration capability with a confidence are largely similar too. However, a compound distribution typically has
distribution that compounds normal distributions; (2) CNN’s hyper- greater variance, and is usually heavy-tail as well. These are appealing
parameter characteristics are taken into consideration when revising properties that could be exploited to construct a distribution that is
PSO’s update equation; (3) A hyper-parameter quality prediction model able to enhance the exploration capability of the canonical PSO, which
is built for saving fitness evaluation time; (4) Extensive experiments are otherwise is prone to be stuck at a local optimum.
carried out to verify the effectiveness of the proposed approach. A compound confidence distribution is the combination of the above
The remainder of this paper is organized as follows: In Section 2, we two distributions and will be used as one of the significant components
describe briefly the concept of compound confidence distribution which of our approach elaborated in Section 3.
is an important underpinning of our work. We also summarize the
recent advances in the application of swarm intelligence algorithms in 2.2. PSO-based neural network hyper-parameter optimization
neural network hyper-parameter optimization. In Section 3, we describe
our approach in detail from its overall design to each part of it. Exper- Several works have been done to optimize hyper-parameters of
iments and result discussions are presented in Section 4. Finally, we neural networks [11–24,29,30]. However, their definitions for hyper-
conclude our work in Section 5. parameters are different thus leading to different optimization goals.
Some adopt hyper-parameters in a narrow sense which only comprise
of hyper-parameters within each neural network layer, and they aim
2. Related works
to fine-tune the existing neural networks without changing their over-
all architectures. Others treat hyper-parameters in a broad sense which
2.1. Compound confidence distribution
also includes the number/order of layers, learning rate and so on, and
their goals are to generate whole new neural networks from scratch. In
Confidence distribution [25] is a distribution estimator. Unlike point
this paper, we only focus on the former.
estimator or interval estimator, it is a sample-dependent distribution
Fine-tuning only occurs after a neural network’s architecture is
which can represent confidence intervals of all levels for the estimated
determined or when it is used for a different dataset. For CNN, these
parameter and thus contains much more information for inference. A
hyper-parameters usually include the convolutional layer parameters
confidence distribution is not a probability distribution function of the
(e.g., kernel size, kernel number, stride and padding), the pooling
parameter of interest, but may still be a function useful for making
layer parameters (e.g., pooling method, stride and padding) and the
inferences [26]. Inferences made from this distribution have a direct
fully-connection layer parameter (i.e., kernel number). For Radial Basis
frequency interpretation. A formal definition of confidence distribution
Function Neural Networks (RBFNNs), they may include hidden centers,
[25] is given as follows:
widths as well as controlling parameters of their kernel functions.
Definition 1. A function Hn (·) = Hn (x, ·) on χ × Θ → [0, 1] is called a Some researchers tried canonical meta-heuristic algorithms and
confidence distribution (CD) for a parameter θ, if: 1) For each given x ∈ X, obtained some promising results. For example, L. M. Rasdi Rere et
115
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123
al. [31] investigated the performance of three meta-heuristic algo- Taking from all these facts, we may safely infer that a PSO variant
rithms including simulated annealing, differential evolution and har- with a stronger exploration capability and less fitness evaluation cost
mony search in optimizing LeNet-5, and they achieved a reduction of would point to a hopeful direction for tackling the hyper-parameter
7.14% in CER.1 Toshihiko YAMASAKI et al. [13] applied PSO to the fine-tuning problem of CNN.
optimization of AlexNet and achieved a 0.7–5.7% reduction in CER for
five different datasets. Toshi Sinha et al. [11] applied PSO to optimize
the hyper-parameters of the first layer of a 13-layer CNN and obtained 3. Adapted PSO for CNN hyper-parameter fine-tuning
an 18.53% CER on the CIFAR-10 dataset [32], which is better than the
22.5% CER of the 8-layer AlexNet on the same dataset. As can be seen, 3.1. Motivation
direct applications of canonical meta-heuristic algorithms have substan-
tial effects on small CNNs such as LeNet-5, but the effects are weakened CNN hyper-parameter fine-tuning is a problem of multimodal func-
on medium sized CNNs such as AlexNet. Although one can fine-tune a tion optimization. Taking AlexNet as an example, the relationship
deeper CNN to obtain higher accuracy with these approaches, she/he between classification accuracy/error and CNN hyper-parameter con-
may have to afford more computation cost than bearable. Thus, it is not figuration is shown in Fig. 1. It can be seen that the accuracy (Fig. 1 (a)),
a preferable way to solve this problem. thus the CER(Fig. 1 (b)), is a multimodal function of hyper-parameters.
In contrast, some researchers proposed hybrid or adaptive solutions, For example, given kernel size 5 × 5, there are several optima of ker-
most of which are based on PSO. S.Y.S. Leung et al. [21] proposed a nel number (Fig. 1 (c)). Without a wise selection strategy, it is highly
hybrid PSO named ALPSO for optimizing RBFNN by introducing a lin- likely to end up with a hyper-parameter configuration leading to disap-
early decreasing inertia weight whose value is determined according to pointing CNN performance. Among the 700 hyper-parameter configura-
fitness evaluations to balance PSO’s exploration and exploitation capa- tions, only 1.14% of them lead to a satisfiable CNN performance (accu-
bilities better. Efe Camci et al. [17] optimized type-2 fuzzy neural net- racy ∈ [0.8, 0.9)). A large number of them (88.00%) lead to ordinary
works with a hybrid algorithm using PSO to tune the antecedent param- CNNs (accuracy ∈ [0.5, 0.8)), and 10.86% of them lead to bad perfor-
eters and using slide mode control to update the consequent parameters. mances (accuracy ∈ [0, 0.5)), as shown in Fig. 1 (d).
Jenni Raitoharju et al. [24] applied MD-PSO, the multi-dimensional Many heuristic algorithms have the potential for integer program-
extension of the canonical PSO, to the optimization of the class-specific ming of CNN hyper-parameters. Genetic Algorithm (GA), a typical SI
cluster’s centroids and locations of RBFNN. Honggui Han et al. [20] pro- algorithm, is an apparent choice [29,30] since GA inherently supports
posed adaptive particle swarm optimization (APSO), which developed a integer optimization. However, Particle Swarm Optimization (PSO) can
nonlinear regressive function to adjust the inertia weight to avoid being obtain the same level of optimization as GA but usually with less cost
trapped into local optimal values, and used APSO to improve the accu- in terms of generations [34].
racy and parsimony of RBFNN. Junfei Qiao et al. [23] applied APSO PSO and its variants have been tried for optimizing CNNs’ hyper-
for optimizing the design of a dynamic modular neural network. Yuhao parameters [11,13,31]. These results have demonstrated that PSO can
Chin et al. [18] adopted a fuzzy PSO to fine-tune the location param- reduce the CER of a CNN by finding a better hyper-parameter configura-
eter of a hyper-rectangular composite neural network. Although these tion for it. However, the high cost of CNN training and PSO’s tendency
works are not designed for fine-tuning hyper-parameters of CNN, their for producing a premature solution would limit the performance of PSO
achievements infer that a method complementing the premature short- on CNN hyper-parameter fine-tuning. Training of a deep CNN usually
age of PSO is a more promising way for solving the problem of CNN requires a tremendous number of iterations for adjusting weights and
tuning. biases, which makes it very costly to evaluate the quality of a hyper-
Apart from the aspect of exploration capability, some researchers parameter configuration. Sophisticated PSO variants with multiple lay-
investigated methods to minimize the cost of fitness evaluation for ers or distributed population structures may not help in this case since
CNN hyper-parameter fine-tuning. For example, Toshihiko YAMASAKI they usually require more particles than the canonical PSO, which leads
et al. [13] proposed two methods for reducing the time of the eval- to more instances of CNN training than affordable for users. We need a
uation of the fitness function defined as AlexNet training for ranking more powerful PSO variant that uses fewer particles to find the optima
PSO particles. The authors observed that the convergence speed of a within less PSO generations. Nevertheless, fewer particles usually lead
CNN training is dependent on the chosen dataset, so once the dataset to less diversity, thus resulting in weak exploration capability. To solve
is determined, it is possible to know how many epochs are needed for this dilemma, we suggest to adapt the canonical PSO to CNN hyper-
training the network well enough. By calculating Spearman’s ranking parameter fine-tuning to meet these requirements: 1. the process of
correlation when completing the training, the authors could determine CNN training should be stopped prematurely at an appropriate point
a proper epoch number for the ensuing fitness evaluations. To get rid to save time and computation resources but still maintain a trustworthy
of the costly complete training, the authors proposed a second method CER for ranking particles; 2. the count of particles should be as small
by defining a volatility based metric for measuring the stability of the as possible to reduce the number of instances of CNN training for each
CNN’s accuracy. Once it is stable, the fitness evaluation will be inter- PSO generation; 3. the generations of PSO should be as few as possible
rupted. Tobias Domhan et al. [33] aimed to reduce the fitness eval- to reduce the number of rounds for CNN training; 4. our approach can
uation time of CNN by prematurely stopping the solutions that have find a close-to-optimal solution in terms of CNN CER reduction. These
pessimistic expectations in future training. They combined eleven para- requirements motivate us to devise an approach that can first of all
metric functions linearly and used Markov Chain Monte Carlo (MCMC) fine-tune the hyper-parameters like a CNN expert exploring the hyper-
to predict the future performance of a solution. This method brings a parameter space with a more and more narrow confidence interval, sec-
reduction of 0.27% in CER on the CIFAR10 dataset. Although it helps ond of all consider the different ranges of hyper-parameters, and finally
to save computation resources in neural network training during the pinpoint the close-to-optimal hyper-parameter configuration with an
period of solution evaluation, its computation cost may neutralize the affordable cost.
gained error rate reduction since it involves a bunch of sophisticated
functions and a non-trivial MCMC computation.
3.2. Problem definition
116
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123
Fig. 1. Illustration of hyper-parameter space structure of CNN using AlexNet with different kernel numbers and kernel sizes for the first layer.
defined in the following equation: as an order set X⃖⃗ of which each element is denoted as xi , representing
a different configuration of hyper-parameters (each hyper-parameter
min C N N (H ⃖⃗ D) s.t . iD ≤ imax
⃖⃖⃗, 𝜃; (3) is denoted as xi,j ), are initialized randomly using a uniform distribu-
⃖⃗∈𝕀k
H
tion (lines 3–11), wherein xp and xg represent the personal and global
in which, CNN, as the objective function, denotes an architecture- best of hyper-parameter values. Secondly, the particles’ positions are
determined CNN that takes H, ⃖⃖⃗ 𝜃⃖⃗ and D as inputs and outputs CER. recalculated with both personal and environmental information (lines
⃖⃖⃗ ∈ 𝕀k is a vector of k dimensional hyper-parameters, 𝜃⃖⃗ is the vector
H 13–21). Then xi are rearranged in descendent order in X ⃖⃗ according to
of learned parameters (i.e., weights and biases of CNN), and D is the their fitness evaluation results, and the set Xp of personal bests as well
data set for training, testing and validating the CNN. We define the as the global best xg are updated accordingly. Finally, the best solu-
constraint of CNN as iterations of CNN training. iD ∈ 𝕀+ is the num- tion is returned when the termination condition is met (line 23). How-
ber of iterations consumed by training the target CNN on dataset D, ever, we made three refinements to the canonical PSO so that it can
and imax ∈ 𝕀+ is the upper bound of iterations specified by CNN users perform better in CNN tuning. The first one is Fast Fitness Evaluation
according to their budgets. One epoch may contain tens to hundreds of (FFE, as shown with lines 10 and 18), which aims to output a ranking
iterations, and each iteration takes a mini-batch of data as input and of particles and their associated personal bests using fewer iterations
performs an update on the CNN’s parameters. Compared to epoch, iter- of CNN training. The second one is the revision of the update equa-
ation is a more precise measurement for CNN’s training cost. Since CNN tion (line 21), which adapts particles’ accelerations to different sizes of
training involves a large amount of sample data as input for updating hyper-parameters’ domains (denoted as B ⃖⃗ =< b1 , b2 , … , bj , … , b|B| >, in
a tremendous number of parameters, it is often very time-consuming. which the lower and upper bound of bj are represented by bj and bj ) so
Thus the evaluation of CER is very costly. The constraint plays an
that an efficient search can be conducted. The third one is selecting
important role in avoiding impractical optimizing schemes that can
𝛽 percent of particles as the worst particles and regenerate their posi-
find high-quality hyper-parameter configurations but at an unafford-
tions with Compound Normal Confidence (CNC) Distribution (line 21),
able computation cost. Minimizing CER drives the search for optimal
which helps a lot in enhancing the exploration capability of PSO. Each
hyper-parameter values under the constraint.
of these refinements will be described in detail in the following sub-
sections. It is possible to further enhance the local searching capability
3.3. Overall design of our approach by switching to more efficient local search strategies
such as a quasi-Newton method in the last phase of searching, but we
The overall process of cPSO-CNN is consistent with the canonical would not discuss it in this paper since it is just a simple combination.
PSO as shown in Algorithm 1. Firstly, a swarm of particles, denoted
117
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123
⎧1 i ≥ ei > ei+1 − 1
⎪ max
P = ⎨i∑ (5)
⎪ 𝛿(i − ei ) otherwise
⎩ i=1
in which, i is the iteration number, CERi is the CER at the ith iteration, vij (t + 1) = wvij (t ) + c1 [j]r1 (pij − xij (t )) + c2 [j]r2 (pgj − xij (t )) (7)
values of a and b are chosen to make CER ̂ minimized. We rank the
particles according to their predicted CERs at imax , but the one-time Here, c1 [j] = 𝛼1 (bj − bj ) and c2 [j] = 𝛼2 (bj − bj ). 𝛼 1 , 𝛼 2 ∈ (0, 1) are ratio
ranking is not trustworthy since the trends may change with iterations coefficients.
118
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123
3.6. Compound normal confidence distribution The Normal Cloud Model also defines a membership function using
the simplified version of the probability density function of a normal
PSO is a popular metaheuristic searching algorithm, but is prone to distribution, as shown in Equation (9).
be premature and gets stuck at sub-optima. This shortcoming lies in its 2
− (x−Ex2 )
weak exploration capability. The worst particles are expected to explore fm (x) = e 2𝜎 (9)
and find a new global best position in the following generations. How-
ever, the position of the worst particles are constrained too strongly We use this membership function to encourage exploration. Given En
by the current best positions and thus limiting its searching capability and He, we can always generate several samples of 𝜎 (denoted as 𝜎 i (i ∈
according to Equation (7). On one hand, the current global best posi- 𝕀+ )) using distribution N(En, He2 ). Then we can generate samples of
tion drives the worst particles to rush to the position of the best particle X using a set of distributions N (Ex, 𝜎i2 )(i ∈ 𝕀+ ). According to the 68-
right from the beginning, thus discouraging them from exploring areas 95-99.7 rule [43], nearly 100% of the values in a normal distribution
far away from the current global best position. On the other hand, the lie within a band of [Ex − 3𝜎, Ex + 3𝜎]. However, the compound
current local best position only encourages the worst particles to search normal distribution does not have such a constraint since it has a larger
for the best position around itself, which is less likely to succeed since variance of En2 + He2 [44], thus can explore a larger hyper-parameter
an inferior position is more likely to appear in an inferior area. Both space with a higher possibility of finding the optima. We use a sigmoid-
global and local best positions attempt to turn the worst particles into like function to control the value of En and a linear function to model
the best, which is usually inefficient. Such a strong constraint has a the confidence changes of CNN experts, as shown in Equation (10) and
negative effect in maintaining population diversity as well as balancing Equation (11).
the exploration and exploitation of the population [38,39]. Inspired by v
En = (10)
the selection operation in GA, we argue that the worst particles should 1 + e−(s−f ×g )
be abandoned and new particles should be generated in such a way
He = 𝛾 En (11)
that not only utilizes information from superior solutions (mimicking
the selected genes in GA) with a gradually increasing strength but also Here, g is the number of generations of cPSO-CNN, v, s, f ∈ 𝕀+ are
explores the hyper-parameter space intensively (mimicking the oper- 1
parameters for controlling the shape of En, which are set to 2, 6 and 16
ations of cross over and mutation in GA, but in a more controllable in our experiments respectively. 𝛾 ∈ (0, 1) is a factor ratio determining
manner). To this end, we require a distribution that can express such a the confidence (a low figure corresponds to a high confidence). Accord-
semantic. Cloud Models [40–42], which have been proposed for Uncer- ing to Equation (10), En is large in initial generations, thereby tends
tainty Artificial Intelligence, can be used for this purpose. For simplicity to generate particles over a larger range, thus tends to explore. As the
of statement, we rewrite the definition of the Normal Cloud Model as number of generations increases, the value of En gradually decreases,
Equation (8). causing the range of generated particles to shrink to Ex, which tends to
exploit. Late in the generations, En gradually turns to zero, so that the
X ∼ N (Ex, 𝜎 2 ) s.t . 𝜎 ∼ N (En, He2 ) (8)
generated particles converge to Ex. We use the shape of the sigmoid-
It can be seen that X is distributed according to a compound normal like function as a proper imitation of a human expert’s behavior when
distribution. The Normal Cloud Model is originally used for transform- she/he is fine-tuning a CNN hyper-parameter: explore possible candi-
ing between qualitative knowledge and quantitative knowledge, but we dates firstly for finding the potential area then refine the result in each
interpret it in another way to better fit hyper-parameter optimization. trial during the later stage. Ex utilizes the elitists’ information [39]
We model the confidence of an expert’s estimation of the optimal val- instead of the global best position, which helps to maintain the diversity
ues of hyper-parameters with this compound normal distribution. The of the population. The procedure for determining the position of par-
expected mean Ex is the point estimate of the hyper-parameter. The ticles using a compound normal distribution is described in Algorithm
deviation 𝜎 is the interval estimate of the hyper-parameters, but the 3. Note that in line 10, we use the membership function to encour-
confidence of its accuracy is variant and is expressed by a third numeric age larger-range searching for initial generations. However, it does not
feature He. In this way, we obtain a confidence function that can sup- mean the exploration capability of our approach will downgrade to the
port explorations for hyper-parameters with different levels of confi- level of the canonical PSO during later generations since the compound
dence. Moreover, this property would allow particles to be placed ran- normal distribution can still generate positions far away from the center
domly in farther locations from the population center but still maintain of the population thanks to its heavy tail property.
the connection with the population center.
119
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123
Fig. 2. CER variation of the best particles of 5 PSOs over the generations. cPSO-
CNN brings more free-fallings of CER.
that it takes the critical responsibility for extracting features from raw
4. Experiments and discussion data.
Since the best particle represents the optimization result, we demon-
We carried out experiments for two purposes: ablation analysis and strate the performances of all PSOs, which are shown in Fig. 2. As can be
similar work comparison. The former is to verify the effectiveness of seen, all five curves contain some small downward steps corresponding
the enhancements on the searching capability of PSO. The latter is to local searches because all of the algorithms root in PSO. However,
to show the superiority of cPSO-CNN by comparing its performance in its first 58 generations, vPSO searches more aggressively than PSO,
and cost on well-known datasets and CNNs with other similar works. thus obtaining a 5.8% CER reduction (from 55.6% to 49.8%), which is
Note that the mini-batch size in our experiments for both our approach larger than PSO’s CER reduction of 3.1%. Nevertheless, in the remain-
and compared approaches is 128, and each epoch contains 195 itera- ing generations, vPSO gets stuck in local optima, just like the canon-
tions. ical PSO. The range-adapted acceleration mechanism has some effects
We revise the canonical PSO with three mechanisms: range-adaptive on CER reduction but is not enough to avoid the premature conver-
acceleration, exploration enhancement based on compound normal gence of PSO by itself. uPSO shows a steep dropping that reduces CER
confidence distribution, and fast fitness evaluation using linear esti- by 8.6% within only 26 generations, which suggests that exploration
mation. The first two mechanisms are introduced for improving the is critical during the early generations since it is more likely that the
efficiency of solution searching. In order to evaluate their effects, we unexplored space contains positions better than the randomly initial-
used the canonical PSO and four PSO variants for experiment, as shown ized ones. However, uPSO also loses its exploration strength afterwards.
in Table 1. All these PSOs use ten particles to fine-tune the same set The reason, we infer, is that uniform distribution does not take into
of hyper-parameters: kernel size (1–13), kernel number (1–128), stride consideration the structure of the hyper-parameter space thus search
(1–4) and padding (1–4). vPSO only uses the range-adaptive acceler- blindly in the vast space even though the particles’ positions provide
ation mechanism, which sets c1 and c2 in PSO according to Section useful information on the space structure. It is hard to pinpoint a better
3.5. uPSO additionally replaces the positions of the worst two (i.e., position in the vast hyper-parameter space once a relatively good posi-
𝛽 = 0.2) particles with new random values in the hyper-parameter tion is found. Neither vPSO nor uPSO utilizes knowledge on solution
space according to a uniform distribution in each dimension. For nPSO, distribution provided by the good particles. Alternatively, nPSO uses a
we construct a normal distribution for the positions of the worst two normal distribution to model this knowledge. It performs similarly as
particles. The average value in each dimension of the superior par- uPSO in early generations, reducing CER by 8.4% after 31 generations.
ticles is taken as Ex of the normal distribution for that dimension, While trapped in local optima in middle generations, nPSO strives to
and its variance 𝜎 is determined in the same way as En(see Equation get out of them from generation 104 on and surpasses uPSO by 0.8%
(10)). When deciding the value of He in our approach, we set 𝛾 to in the end. However, the 68-95-99.7 rule [43] of normal distribution
0.1 in Equation (11). Note that the values of the remaining parame-
ters of all these PSOs are the same. To avoid the side effect of hinder-
ing the fairness of comparison due to random initializations of parti-
cles, we have all these PSOs using the same particle initialization val-
ues.
We choose the hyper-parameters of AlexNet as the optimization tar-
get and CIFAR-10 [32] as the experiment dataset, since AlexNet is a
typical CNN with moderate complexity and CIFAR-10 is a widely used
CNN dataset. Firstly, we only optimize the first convolutional layer in
Table 1
PSOs under evaluation.
Name Description
PSO canonical PSO
vPSO PSO with range-adaptive acceleration
uPSO vPSO with uniform distribution regeneration
nPSO vPSO with normal distribution regeneration
cPSO-CNN vPSO with CNC regeneration Fig. 3. The Average Performances of PSOs on AlexNet. cPSO-CNN is Better Than
Others.
120
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123
Table 2 Table 3
CERs Before and After Optimization with cPSO-CNN on CIFAR-10. Comparison of CERs of different approaches on CIFAR-10.
Method Before (%) After (%) Reduction (%) Name Network CER(%) Approach
AlexNet [5] 23 10.01 12.99 GA-CNN [30] EDEN 25.41 GA
VGGNet-16 [7] 17.38 8.98 8.40 EPSO-CNN [13] AlexNet 19.85 PSO
VGGNet-19 [7] 13.65 6.62 7.03 PSO-b [11] 13-CNN 18.53 PSO
GoogleNet [8] 8.12 6.16 2.06 SMAC [33] Small CNN 17.47 DNN
ResNet-52 [6] 6.97 4.77 2.20 HORD [35] 19-CNN 20.54 RBF
RestNet-101 [6] 6.61 5.39 1.22 MCDNN [45] Single DNN 11.21 SA
DenseNet-121 [9] 4.96 3.82 1.14 NMM [36] 7-CNN 10.05 Nelder-Mead
cPSO-CNN(our approach) AlexNet 8.67 PSO
confines the searching space to an area near the found optimum. With finding a better solution. cPSO-CNN is designed with these integrated
an additional parameter of He that pushes the search not only towards mechanisms and produces the best results.
farther places but also exploits the knowledge acquired by better par- In order to compare cPSO-CNN with other works in hyper-parameter
ticles, some hard-to-reach yet better positions could be discovered. As optimization of neural networks, we use CIFAR-10 as the benchmark
shown by the curve of cPSO-CNN, it maintains a strong exploration dataset and CER as the performance metric. And we optimized all
capability down the generations with a series of free-falling reductions of the eight layers of AlexNet this time. The results are shown in
that result in an amazing 14.8% in the end, far lower than that of other Table 3. As can be seen, our approach is better than the state-of-the-art
PSOs. Note that the CERs in the figure are only used for ranking best NMM among all these compared approaches. Especially, our approach
particles of each compared algorithm instead of for training a CNN com- performs much better than EPSO-CNN, which is the state-of-the-art
pletely. Once the best particle (thus the best hyper-parameter configu- approach used for AlexNet hyper-parameter optimization.
ration) is determined in this way, it will be used to fully train a CNN, cPSO-CNN can optimize one CNN layer at a time, which is a useful
which would generate a much lower CER(8.65% as shown in Fig. 4 trait for obtaining better performance with less cost since CNN lay-
(a)). ers have different potentials in reducing CER depending on their types
Apart from the performances of the best particles, the average per- and positions, as shown in Table 4. C, P and F represents Convolu-
formances of all particles are also noteworthy since they reflect the tional, Pooling and Fully-Connected layers respectively and the sub-
overall capabilities of these algorithms. The average performances of script represents the position of the layer. For convolutional layers, the
all particles of each generation are shown in Fig. 3. It can be seen that
our approach generally finds better positions compared to others. Espe-
cially during the late generations, the curve of our approach converges Table 4
to the same value, meaning the particles as a whole is superior to that CER after single layer optimization on expert-tuned
of other PSOs. canonical AlexNet.
These experiments are performed on AlexNet, but we also opti- Layer(s) CER(%) #Gen. Cost-Effect. Ratio
mized other CNNs’ first convolutional layers, whose results are shown
C1 10.4 179 13.81
in Table 2. The experiments verify the fact that our approach is also P1 20.4 93 35.77
effective with other CNNs. C2 17.7 167 31.51
From all of these evidences, we can safely conclude that: (1) The P2 21.1 97 53.89
range-adapted acceleration mechanism has some effects on enhancing C3 19.3 189 51.08
C4 19.4 177 49.17
the searching capability of the canonical PSO; (2) Randomly regenerat- P3 20.3 91 33.70
ing positions of worst particles can help improve exploration capability; F 21.6 231 165.00
(3) A center-focused and heavy-tailed distribution has more potential in
121
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123
5. Conclusion
122
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123
[21] S. Leung, Y. Tang, W. Wong, A hybrid particle swarm optimization and its [34] H. Rania, C. Babak, de Weck Olivier, V. Gerhard, A Copmarison of Particle Swarm
application in neural networks, Expert Syst. Appl. (2012) 395–405. Optimization and the Genetic Algorithm, American Institute of Aeronautics and
[22] C. Mao, R. Lin, C. Xu, Q. He, Towards a trust prediction framework for cloud Astronautics, 2004, p. 1897.
services based on pso-driven neural network, IEEE Access (2017) 2187–2199. [35] I. Ilievski, T. Akhtar, J. Feng, C.A. Shoemaker, Efficient hyperparameter
[23] J. Qiao, C. Lu, W. Li, Design of dynamic modular neural network based on adaptive optimization for deep learning algorithms using deterministic rbf surrogates, in:
particle swarm optimization algorithm, IEEE Access (2018) 10850–10857. Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017, pp.
[24] J. Raitoharju, S. Kiranyaz, M. Gabbouj, Training radial basis function neural 822–829.
networks for classification via class-specific clustering, IEEE Trans. Neural Netw. [36] A. Saleh, M. Ausif, A framework for designing the architectures of deep
Learn. Syst. (2016) 2458–2471. convolutional neural networks, Entropy (2017) 242.
[25] M. Xie, K. Singh, Confidence distribution, the frequentist distribution estimator of [37] V. Gerhard, S. Sobieski, Jaroslaw, Particle swarm optimization, AIAA J. (2003)
a parameter: a review, Int. Stat. Rev. (2013) 3–39. 1583–1589.
[26] M. Xie, Rejoinder, Int. Stat. Rev. 81 (1) (2013) 68–77. [38] X. Zhao, W. Lin, J. Hao, XingquanZuo, J. Yuan, Clustering and pattern search for
[27] R. Cassady, J. Nachlas, Probability Models in Operations Research, CRC Press, enhancing particle swarm optimization with euclidean spatial neighborhood
2008. search, Neurocomputing 171 (2016) 966–981.
[28] G.A. Fox, S. Negrete-Yankelevich, V.J. Sosa, Ecological Statistics: Contemporary [39] G. Xu, X. Zhao, T. Wu, R. Li, X. Li, An elitist learning particle swarm optimization
Theory and Application, Oxford University Press, 2015. with scaling mutation and ring topology, Digit. Object Identifier 6 (2018)
[29] L. Xie, A. Yuille, Genetic cnn, in: IEEE International Conference on Computer 78453–78470.
Vision (ICCV), 2017, pp. 1379–1388. [40] D. Li, S. Wang, H. Yuan, D. Li, Software and applications of spatial data mining,
[30] S.R. Young, D.C. Rose, T.P. Karnowski, S.-H. Lim, R.M. Patton, Optimizing deep WIREs Data Min. Knowl. Discov. (2016) 84–114.
learning hyper-parameters through an evolutionary algorithm, in: The Workshop [41] S. Wang, H. Chi, H. Yuan, J. Geng, Extraction and representation of common
on Machine Learning in High-Performance Computing Environments, MLHPC ’15, feature from uncertain facial expressions with cloud model, Environ. Sci. Pollut.
ACM, New York, NY, USA, 2015, pp. 4:1–4:5. Control Ser. (2017) 27778–27787.
[31] L.M.R. Rere, M.I. Fanany, A.M. Arymurthy, Metaheuristic algorithms for [42] D. Li, Y. Du, Artificial Intelligence with Uncertainty, CRC Press, 2017.
convolution neural network, Comput. Intell. Neurosci. (2016) 1–13. [43] D.S. Moore, G.P. McCabe, Introduction to the Practice of Statistics, Freeman, 1993.
[32] K. Alex, N. Vinod, H. Geoffrey, The Cifar-10 Dataset, 2014, http://www.cs.toronto. [44] Z. Guangwei, Researches and Applications on Evolutionary Algorithm Based on
edu/kriz/cifar.html. Cloud Model, Ph.D. thesis, School of Computer Science Beihang University,
[33] D. Tobias, S.J. Tobias, H. Frank, Speeding up automatic hyperparameter Beijing, China, 2008.
optimization of deep neural networks by extrapolation of learning curves, in: [45] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing,
IJCAI, 2015, pp. 3460–3468. Science (1983) 671–680.
123