0% found this document useful (0 votes)
26 views10 pages

CPSO

The paper presents cPSO-CNN, a novel particle swarm optimization (PSO) variant designed to efficiently fine-tune hyper-parameters of convolutional neural networks (CNNs). This approach enhances the exploration capabilities of PSO using a compound normal distribution and incorporates a linear prediction model to reduce fitness evaluation costs. Experimental results indicate that cPSO-CNN outperforms several existing algorithms in terms of hyper-parameter optimization and overall computational efficiency.

Uploaded by

宗本 官
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

CPSO

The paper presents cPSO-CNN, a novel particle swarm optimization (PSO) variant designed to efficiently fine-tune hyper-parameters of convolutional neural networks (CNNs). This approach enhances the exploration capabilities of PSO using a compound normal distribution and incorporates a linear prediction model to reduce fitness evaluation costs. Experimental results indicate that cPSO-CNN outperforms several existing algorithms in terms of hyper-parameter optimization and overall computational efficiency.

Uploaded by

宗本 官
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Swarm and Evolutionary Computation 49 (2019) 114–123

Contents lists available at ScienceDirect

Swarm and Evolutionary Computation


journal homepage: www.elsevier.com/locate/swevo

cPSO-CNN: An efficient PSO-based algorithm for fine-tuning


hyper-parameters of convolutional neural networks
Yulong Wang, Haoxin Zhang, Guangwei Zhang ∗
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, 10 Xitucheng Road, Haidian District,
Beijing, 100086, China

A R T I C L E I N F O A B S T R A C T

Keywords: Swarm intelligence algorithms have been widely adopted in solving many highly nonlinear, multimodal problems
Swarm intelligence and have achieved tremendous successes. However, their application on deep neural networks is largely unex-
PSO
plored. On the other hand, deep neural networks, especially convolutional neural network (CNN), have recently
Optimization
achieved breakthroughs in tackling many intractable problems; nevertheless their performance depends heavily
CNN
Hyper-parameter
on the chosen values of their hyper-parameters, whose fine-tuning is both labor-intensive and time-consuming.
In this paper, we propose a novel particle swarm optimization (PSO) variant cPSO-CNN for optimizing the hyper-
parameter configuration of architecture-determined CNNs. cPSO-CNN utilizes a confidence function defined by
a compound normal distribution to model experts’ knowledge on CNN hyper-parameter fine-tunings so as to
enhance the canonical PSO’s exploration capability. cPSO-CNN also redefines the scalar acceleration coefficients
of PSO as vectors to better adapt for the variant ranges of CNN hyper-parameters. Besides, a linear predic-
tion model is adopted for fast ranking the PSO particles to reduce the cost of fitness function evaluation. The
experimental results demonstrate that cPSO-CNN performs competitively when compared with several reported
algorithms in terms of both CNN hyper-parameter superiority and overall computation cost.

1. Introduction sive top-5 error rate of 15.3%, more than 10.8% lower than that of the
runner up, on image classification in the ImageNet Challenge 2012,
Artificial Swarm Intelligence (SI) algorithms, such as genetic algo- that CNN received renewed attention from researchers. Encouraged
rithm [1], particle swarm optimization [2] and fireworks algorithm [3], by its success, many CNNs such as ResNet [6], VGG [7], GoogleNet
have been showing their powerful capabilities for optimizing difficult [8] and DenseNet [9] were invented. The outstanding performances
problems in the real world ever since their advents. Relying on a pop- of these CNNs not only owe to the authors’ brilliant designs of the
ulation of simple agents which gather guiding information from neigh- CNNs’ architectures but also owe to the carefully chosen values of the
bors and environments then adjust their behaviors accordingly, a high hyper-parameters. A new CNN architecture is usually introduced by
intelligence emerges as a whole, which has the advantages of not requir- an insightful observation of the limitations of the existing CNN. For
ing the target problem to have any optimization-convenient mathemat- instance, ResNet [6] introduced shortcut hyper-connections to tackle
ical properties such as continuity, derivability or convexity. As such, the vanishing gradient problem when adding more layers to a CNN.
for many complex problems which are often highly nonlinear and mul- However, choosing proper values for hyper-parameters is very tricky
timodal, SI algorithms usually perform better than traditional math- since it not only depends on one’s level of experience but also her/his
ematical programming methods. And optimizing the hyper-parameter ability to learn from each round of value trial. Usually, fine-tuning
configuration of a Convolutional Neural Network (CNN) is just the kind hyper-parameters is conducted manually in a costly trial-and-error way.
of problem. The evaluations of different hyper-parameter configurations involve
Although LeNet-5, the pioneering CNN by LeCun et al. [4], solved many rounds of time-consuming CNN training. At the same time, new
the problem of character recognition decades ago, it was not until CNNs tend to have more and more layers, which leads to a surge in the
AlexNet [5], a GPU implementation of deep CNN, obtained an impres- number of hyper-parameters. For example, although AlexNet [5] has

∗ Corresponding author.
E-mail addresses: [email protected] (Y. Wang), [email protected] (H. Zhang), [email protected] (G. Zhang).

https://doi.org/10.1016/j.swevo.2019.06.002
Received 13 December 2018; Received in revised form 15 May 2019; Accepted 5 June 2019
Available online 10 June 2019
2210-6502/© 2019 Elsevier B.V. All rights reserved.
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123

only 27 hyper-parameters, the successors VGG-16 [7], GoogleNet [8], Hn (·) is a cumulative distribution function on Θ; 2) At the true parameter
ResNet-52 [6] and Densenet [9] have a total of 57, 78,150, 376 hyper- value θ = 𝜃 0 , Hn (𝜃 0 ) ≡ Hn (x, 𝜃 0 ), as a function of the sample x, follows
parameters respectively. Therefore, it is almost impossible to pinpoint the uniform distribution U[0, 1].
a close-to-optimal hyper-parameter configuration for a CNN manually
In the definition, Θ is the domain of the unknown parameter 𝜃 , 𝜒
under a reasonable cost, which hampers the adaption of CNNs for vari-
is the sample space of data x = {x1 , …, xn }. Although we can simply
ous real-world problems.
stack up the boundaries of a set of confidence intervals of all levels for
Since hyper-parameters of each layer of CNN are integers or can be
a parameter to obtain a confidence distribution, an analytic expression
encoded as integers, hyper-parameter fine-tuning for CNNs(or CNN tun-
is more favorable for the sake of infinite levels of confidence. When an
ing for short) is essentially an integer programming problem, which is
expert tunes the hyper-parameters of a CNN, she/he usually has a prior
NP-complete [10] and asks for an approximation algorithm running in
guessing of the values of the hyper-parameters based on her/his expe-
polynomial time in order to find a close-to-optimal solution. Recently,
rience or intuition, but with inaccurate confidence. After observing the
Particle Swarm Optimization (PSO), as a popular SI algorithm, obtains
performance of the hyper-parameters evaluated through a few rounds
much attention from researchers for optimizing CNN [11–16] and other
of trial-and-error, she/he gains a more accurate confidence on the distri-
neural networks [17–24].
bution of the hyper-parameters’ optima. In this paper, we model manual
Although these attempts had obtained some promising results, there
fine-tuning performed by experts with a confidence distribution so that
is still substantial room for further improvement. We argue to pro-
an automatic hyper-parameter tuning approach can be devised.
pose a new PSO variant to satisfy the requirements of CNN tuning
A compound probability distribution G is a probability distribution
better. The most time-consuming part of CNN tuning is CNN training
A of a random variable X, for which A has an unknown parameter 𝜃
since it is defined as the fitness function to be evaluated. Therefore, to
that is also a random variable and is distributed according to a second
improve the efficiency of CNN tuning, we need to reduce the fitness
distribution B. In this way, G is a distribution by compounding A with
evaluation’s frequency and execution time while maintaining accept-
B, which are called original distribution and latent distribution respec-
able accuracy. To this end, we first enhance PSO’s exploration capabil-
tively. G results from integrating out the unknown parameter(s) 𝜃 over
ity. This is accomplished by regenerating the worst particles’ positions
B. The probability density function [27,28] of G is given by Equation
with a compound normal confidence distribution. This distribution has
(1).
a larger variance that helps perform a free-falling style search. Such a
search would result in much fewer generations for a given quality solu-
p G (x ) = pA (x ∣ 𝜃)pB (𝜃)d𝜃 (1)
tion. Then, we take into consideration of the variant-lengths of ranges ∫
of CNN hyper-parameters when updating particles’ velocities to speed
up the search in large ranges while preventing values for small range And G’s mean and variance [27,28] are given by Equation (2).
hyper-parameters from flip-flopping between their boundaries. Finally, [ ]
EG [X ] = EB EA [X ∣ 𝜃] (2)
we utilize a linear model to predict the ranking of hyper-parameter con- [ ]
figurations and stop the CNN training prematurely for fitness evaluation VarG (X ) = EB VarA (X ∣ 𝜃) + VarB (EA [X ∣ 𝜃])
once the trend of ranking is stable. These revisions to the canonical PSO
enable our approach to find better hyper-parameter values with less A compound distribution G is similar to the original distribution A in
cost. The key contributions of this paper are: (1) It is the first PSO vari- many ways. For example, they have the same support and their shapes
ant that enhances the particles’ exploration capability with a confidence are largely similar too. However, a compound distribution typically has
distribution that compounds normal distributions; (2) CNN’s hyper- greater variance, and is usually heavy-tail as well. These are appealing
parameter characteristics are taken into consideration when revising properties that could be exploited to construct a distribution that is
PSO’s update equation; (3) A hyper-parameter quality prediction model able to enhance the exploration capability of the canonical PSO, which
is built for saving fitness evaluation time; (4) Extensive experiments are otherwise is prone to be stuck at a local optimum.
carried out to verify the effectiveness of the proposed approach. A compound confidence distribution is the combination of the above
The remainder of this paper is organized as follows: In Section 2, we two distributions and will be used as one of the significant components
describe briefly the concept of compound confidence distribution which of our approach elaborated in Section 3.
is an important underpinning of our work. We also summarize the
recent advances in the application of swarm intelligence algorithms in 2.2. PSO-based neural network hyper-parameter optimization
neural network hyper-parameter optimization. In Section 3, we describe
our approach in detail from its overall design to each part of it. Exper- Several works have been done to optimize hyper-parameters of
iments and result discussions are presented in Section 4. Finally, we neural networks [11–24,29,30]. However, their definitions for hyper-
conclude our work in Section 5. parameters are different thus leading to different optimization goals.
Some adopt hyper-parameters in a narrow sense which only comprise
of hyper-parameters within each neural network layer, and they aim
2. Related works
to fine-tune the existing neural networks without changing their over-
all architectures. Others treat hyper-parameters in a broad sense which
2.1. Compound confidence distribution
also includes the number/order of layers, learning rate and so on, and
their goals are to generate whole new neural networks from scratch. In
Confidence distribution [25] is a distribution estimator. Unlike point
this paper, we only focus on the former.
estimator or interval estimator, it is a sample-dependent distribution
Fine-tuning only occurs after a neural network’s architecture is
which can represent confidence intervals of all levels for the estimated
determined or when it is used for a different dataset. For CNN, these
parameter and thus contains much more information for inference. A
hyper-parameters usually include the convolutional layer parameters
confidence distribution is not a probability distribution function of the
(e.g., kernel size, kernel number, stride and padding), the pooling
parameter of interest, but may still be a function useful for making
layer parameters (e.g., pooling method, stride and padding) and the
inferences [26]. Inferences made from this distribution have a direct
fully-connection layer parameter (i.e., kernel number). For Radial Basis
frequency interpretation. A formal definition of confidence distribution
Function Neural Networks (RBFNNs), they may include hidden centers,
[25] is given as follows:
widths as well as controlling parameters of their kernel functions.
Definition 1. A function Hn (·) = Hn (x, ·) on χ × Θ → [0, 1] is called a Some researchers tried canonical meta-heuristic algorithms and
confidence distribution (CD) for a parameter θ, if: 1) For each given x ∈ X, obtained some promising results. For example, L. M. Rasdi Rere et

115
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123

al. [31] investigated the performance of three meta-heuristic algo- Taking from all these facts, we may safely infer that a PSO variant
rithms including simulated annealing, differential evolution and har- with a stronger exploration capability and less fitness evaluation cost
mony search in optimizing LeNet-5, and they achieved a reduction of would point to a hopeful direction for tackling the hyper-parameter
7.14% in CER.1 Toshihiko YAMASAKI et al. [13] applied PSO to the fine-tuning problem of CNN.
optimization of AlexNet and achieved a 0.7–5.7% reduction in CER for
five different datasets. Toshi Sinha et al. [11] applied PSO to optimize
the hyper-parameters of the first layer of a 13-layer CNN and obtained 3. Adapted PSO for CNN hyper-parameter fine-tuning
an 18.53% CER on the CIFAR-10 dataset [32], which is better than the
22.5% CER of the 8-layer AlexNet on the same dataset. As can be seen, 3.1. Motivation
direct applications of canonical meta-heuristic algorithms have substan-
tial effects on small CNNs such as LeNet-5, but the effects are weakened CNN hyper-parameter fine-tuning is a problem of multimodal func-
on medium sized CNNs such as AlexNet. Although one can fine-tune a tion optimization. Taking AlexNet as an example, the relationship
deeper CNN to obtain higher accuracy with these approaches, she/he between classification accuracy/error and CNN hyper-parameter con-
may have to afford more computation cost than bearable. Thus, it is not figuration is shown in Fig. 1. It can be seen that the accuracy (Fig. 1 (a)),
a preferable way to solve this problem. thus the CER(Fig. 1 (b)), is a multimodal function of hyper-parameters.
In contrast, some researchers proposed hybrid or adaptive solutions, For example, given kernel size 5 × 5, there are several optima of ker-
most of which are based on PSO. S.Y.S. Leung et al. [21] proposed a nel number (Fig. 1 (c)). Without a wise selection strategy, it is highly
hybrid PSO named ALPSO for optimizing RBFNN by introducing a lin- likely to end up with a hyper-parameter configuration leading to disap-
early decreasing inertia weight whose value is determined according to pointing CNN performance. Among the 700 hyper-parameter configura-
fitness evaluations to balance PSO’s exploration and exploitation capa- tions, only 1.14% of them lead to a satisfiable CNN performance (accu-
bilities better. Efe Camci et al. [17] optimized type-2 fuzzy neural net- racy ∈ [0.8, 0.9)). A large number of them (88.00%) lead to ordinary
works with a hybrid algorithm using PSO to tune the antecedent param- CNNs (accuracy ∈ [0.5, 0.8)), and 10.86% of them lead to bad perfor-
eters and using slide mode control to update the consequent parameters. mances (accuracy ∈ [0, 0.5)), as shown in Fig. 1 (d).
Jenni Raitoharju et al. [24] applied MD-PSO, the multi-dimensional Many heuristic algorithms have the potential for integer program-
extension of the canonical PSO, to the optimization of the class-specific ming of CNN hyper-parameters. Genetic Algorithm (GA), a typical SI
cluster’s centroids and locations of RBFNN. Honggui Han et al. [20] pro- algorithm, is an apparent choice [29,30] since GA inherently supports
posed adaptive particle swarm optimization (APSO), which developed a integer optimization. However, Particle Swarm Optimization (PSO) can
nonlinear regressive function to adjust the inertia weight to avoid being obtain the same level of optimization as GA but usually with less cost
trapped into local optimal values, and used APSO to improve the accu- in terms of generations [34].
racy and parsimony of RBFNN. Junfei Qiao et al. [23] applied APSO PSO and its variants have been tried for optimizing CNNs’ hyper-
for optimizing the design of a dynamic modular neural network. Yuhao parameters [11,13,31]. These results have demonstrated that PSO can
Chin et al. [18] adopted a fuzzy PSO to fine-tune the location param- reduce the CER of a CNN by finding a better hyper-parameter configura-
eter of a hyper-rectangular composite neural network. Although these tion for it. However, the high cost of CNN training and PSO’s tendency
works are not designed for fine-tuning hyper-parameters of CNN, their for producing a premature solution would limit the performance of PSO
achievements infer that a method complementing the premature short- on CNN hyper-parameter fine-tuning. Training of a deep CNN usually
age of PSO is a more promising way for solving the problem of CNN requires a tremendous number of iterations for adjusting weights and
tuning. biases, which makes it very costly to evaluate the quality of a hyper-
Apart from the aspect of exploration capability, some researchers parameter configuration. Sophisticated PSO variants with multiple lay-
investigated methods to minimize the cost of fitness evaluation for ers or distributed population structures may not help in this case since
CNN hyper-parameter fine-tuning. For example, Toshihiko YAMASAKI they usually require more particles than the canonical PSO, which leads
et al. [13] proposed two methods for reducing the time of the eval- to more instances of CNN training than affordable for users. We need a
uation of the fitness function defined as AlexNet training for ranking more powerful PSO variant that uses fewer particles to find the optima
PSO particles. The authors observed that the convergence speed of a within less PSO generations. Nevertheless, fewer particles usually lead
CNN training is dependent on the chosen dataset, so once the dataset to less diversity, thus resulting in weak exploration capability. To solve
is determined, it is possible to know how many epochs are needed for this dilemma, we suggest to adapt the canonical PSO to CNN hyper-
training the network well enough. By calculating Spearman’s ranking parameter fine-tuning to meet these requirements: 1. the process of
correlation when completing the training, the authors could determine CNN training should be stopped prematurely at an appropriate point
a proper epoch number for the ensuing fitness evaluations. To get rid to save time and computation resources but still maintain a trustworthy
of the costly complete training, the authors proposed a second method CER for ranking particles; 2. the count of particles should be as small
by defining a volatility based metric for measuring the stability of the as possible to reduce the number of instances of CNN training for each
CNN’s accuracy. Once it is stable, the fitness evaluation will be inter- PSO generation; 3. the generations of PSO should be as few as possible
rupted. Tobias Domhan et al. [33] aimed to reduce the fitness eval- to reduce the number of rounds for CNN training; 4. our approach can
uation time of CNN by prematurely stopping the solutions that have find a close-to-optimal solution in terms of CNN CER reduction. These
pessimistic expectations in future training. They combined eleven para- requirements motivate us to devise an approach that can first of all
metric functions linearly and used Markov Chain Monte Carlo (MCMC) fine-tune the hyper-parameters like a CNN expert exploring the hyper-
to predict the future performance of a solution. This method brings a parameter space with a more and more narrow confidence interval, sec-
reduction of 0.27% in CER on the CIFAR10 dataset. Although it helps ond of all consider the different ranges of hyper-parameters, and finally
to save computation resources in neural network training during the pinpoint the close-to-optimal hyper-parameter configuration with an
period of solution evaluation, its computation cost may neutralize the affordable cost.
gained error rate reduction since it involves a bunch of sophisticated
functions and a non-trivial MCMC computation.
3.2. Problem definition

We define the problem of CNN hyper-parameter fine-tuning as an


1
For easy comparison, all accuracy figures are converted to CERs in this integer programming problem. The objective is to find the optimal
paper, which refers to top-1 classification error rates if not explicitly stated. hyper-parameter values that bring about the lowest CER, which is

116
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123

Fig. 1. Illustration of hyper-parameter space structure of CNN using AlexNet with different kernel numbers and kernel sizes for the first layer.

defined in the following equation: as an order set X⃖⃗ of which each element is denoted as xi , representing
a different configuration of hyper-parameters (each hyper-parameter
min C N N (H ⃖⃗ D) s.t . iD ≤ imax
⃖⃖⃗, 𝜃; (3) is denoted as xi,j ), are initialized randomly using a uniform distribu-
⃖⃗∈𝕀k
H
tion (lines 3–11), wherein xp and xg represent the personal and global
in which, CNN, as the objective function, denotes an architecture- best of hyper-parameter values. Secondly, the particles’ positions are
determined CNN that takes H, ⃖⃖⃗ 𝜃⃖⃗ and D as inputs and outputs CER. recalculated with both personal and environmental information (lines
⃖⃖⃗ ∈ 𝕀k is a vector of k dimensional hyper-parameters, 𝜃⃖⃗ is the vector
H 13–21). Then xi are rearranged in descendent order in X ⃖⃗ according to
of learned parameters (i.e., weights and biases of CNN), and D is the their fitness evaluation results, and the set Xp of personal bests as well
data set for training, testing and validating the CNN. We define the as the global best xg are updated accordingly. Finally, the best solu-
constraint of CNN as iterations of CNN training. iD ∈ 𝕀+ is the num- tion is returned when the termination condition is met (line 23). How-
ber of iterations consumed by training the target CNN on dataset D, ever, we made three refinements to the canonical PSO so that it can
and imax ∈ 𝕀+ is the upper bound of iterations specified by CNN users perform better in CNN tuning. The first one is Fast Fitness Evaluation
according to their budgets. One epoch may contain tens to hundreds of (FFE, as shown with lines 10 and 18), which aims to output a ranking
iterations, and each iteration takes a mini-batch of data as input and of particles and their associated personal bests using fewer iterations
performs an update on the CNN’s parameters. Compared to epoch, iter- of CNN training. The second one is the revision of the update equa-
ation is a more precise measurement for CNN’s training cost. Since CNN tion (line 21), which adapts particles’ accelerations to different sizes of
training involves a large amount of sample data as input for updating hyper-parameters’ domains (denoted as B ⃖⃗ =< b1 , b2 , … , bj , … , b|B| >, in
a tremendous number of parameters, it is often very time-consuming. which the lower and upper bound of bj are represented by bj and bj ) so
Thus the evaluation of CER is very costly. The constraint plays an
that an efficient search can be conducted. The third one is selecting
important role in avoiding impractical optimizing schemes that can
𝛽 percent of particles as the worst particles and regenerate their posi-
find high-quality hyper-parameter configurations but at an unafford-
tions with Compound Normal Confidence (CNC) Distribution (line 21),
able computation cost. Minimizing CER drives the search for optimal
which helps a lot in enhancing the exploration capability of PSO. Each
hyper-parameter values under the constraint.
of these refinements will be described in detail in the following sub-
sections. It is possible to further enhance the local searching capability
3.3. Overall design of our approach by switching to more efficient local search strategies
such as a quasi-Newton method in the last phase of searching, but we
The overall process of cPSO-CNN is consistent with the canonical would not discuss it in this paper since it is just a simple combination.
PSO as shown in Algorithm 1. Firstly, a swarm of particles, denoted

117
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123

of CNN training. Thus, we also need to observe the trends’ stability. If


Algorithm 1 Overall Design of cPSO-CNN. the rank of the concerned particles, which are the top particle and the
set of worst ⌊(1 − 𝛽)|X|⌋ particles, has not changed in 𝜆 consecutive
iterations, then the linear trends are considered to be stable and can be
used to determine the particles’ quality. Besides, since trend estimation
itself will consume some computation resources, we reduce the number
of evaluations of Equation (4) by using an indicator function, defined
in Equation (5), to control the frequency of its calculation.

⎧1 i ≥ ei > ei+1 − 1
⎪ max
P = ⎨i∑ (5)
⎪ 𝛿(i − ei ) otherwise
⎩ i=1

Here, 𝛿 is the unit impulse function, and r ∈ (0, 1) is a real num-


ber representing the increasing rate of the evaluation frequency. ei =
⌊ei−1 + r (ei−1 − ei−2 )⌋, ei ∈ 𝕀+ , i ∈ [3, imax ]. e1 , e2 ∈ 𝕀+ , e1 < e2 . The first
two occurrences e1 and e2 of predictions are set empirically with val-
ues ranging from hundreds to thousands of iterations since the rank-
ing of particles is expected to be dynamic during the initial stage of
CNN training. Note that although these two parameters need to be set
manually, determining their values are much easier than that of CNN
hyper-parameters. Even a conservative choice of small e1 and e2 would
not bring about much negative effect except for more occurrences of
predictions of the trend. In experiments conducted in this paper, we set
e1 and e2 as 975 and 2145 respectively. The following occurrences of
predictions are determined by Equation (5). The predictions occur more
3.4. Fast fitness evaluation and more frequently until reaching one occurrence for each iteration,
or the maximum allowed number of iterations have been reached. The
The most time-consuming part of hyper-parameter fine-tuning of complete procedure for fast fitness evaluation is shown in Algorithm 2.
deep neural networks such as CNN is the training of the network,
which is often selected as the fitness function of a meta-heuristic algo-
rithm. Some researchers just trained the neural networks until reach- 3.5. Range-adapted accelaration
ing convergence to evaluate the effectiveness of the candidate solutions
[11,16,35]. Saleh Albelwi et al. [36] and Lingxi Xie et al. [29] also The velocity updating equation of the canonical PSO is:
used complete epochs of the training but with a selected subset of the
training dataset. Pablo Ribalta Lorenzo et al. [12] and Bin Wang et al. vid (t + 1) = wvid (t ) + c1 r1 (pid − xid (t )) + c2 r2 (pgd − xid (t )) (6)
[15] used either a small or an arbitrary epoch of the training as an where vid is the velocity of the ith particle in its dth dimension, r1 and
early-stopping condition. Toshihiko YAMASAKI et al. proposed a more r2 are random real numbers in (0,1), c1 and c2 are acceleration coef-
precise method to avoid complete training by measuring the volatility ficients, w is an inertia weight, pid and pgd are the local best value
of CER [13]. Tobias Domhan et al. [33] combined eleven parametric and global best value respectively, xid denotes the position of the ith
learning curve models to predict the training performance in order to particle in its dth dimension. Since c1 and c2 are scalar values, par-
stop the training when the predictions indicated a bad hyper-parameter ticles in the canonical PSO will move at the same acceleration speed
configuration. in all dimensions. Scalar acceleration coefficient would not be a prob-
We argue that we could cut off unnecessary iterations of CNN train- lem in a parameter space where all dimensions are almost the same
ing by predicting the final ranking of the candidate solutions using a size. However, it is not the case with CNN hyper-parameter domains.
light-weighted model. If the trends of solution performances are rel- For example, the number of convolutional kernels may range from 1
atively stable and can at least filter out the best one and a few worst to 128, while the convolutional stride may range from 1 to 4. Sharing
candidates, then there is no need to continue the training process. Com- one acceleration coefficient among large-range and small-range hyper-
pared to the epoch that consumes the complete training dataset which parameters usually leads to a slow search for the large-range hyper-
is usually very large, an iteration of CNN training only uses a mini- parameter optimal value or results in out of range values for small-
batch of the same dataset, thus potentially consumes less computation range hyper-parameters or causes both to happen. Existing works set
resources. the position values to their upper or lower boundaries when the newly
We treat the evaluation sequence of a particle as a time series; then calculated values are out of range [34,37], but such strategy could cause
a trend estimation can be used to justify statements about the parti- a small-range value to bounce between the two ends of its range when
cle’s tendency by correlating its CER with its iterations. This model can an acceleration coefficient suitable for large-range hyper-parameters is
then be used to determine the quality of the particles under evaluation. used. Thus, we expect that a hyper-parameter with a large range of val-
For computing efficiency, we utilize linear trend estimation and adopt ues has a larger acceleration coefficient to speed up the exploration,
the commonly used least squares to fit it, as shown in the following while the small-range hyper-parameter has a smaller acceleration coef-
equation: ficient to prevent it from flip-flopping between its boundaries. To this
∑ end, we redefine c1 and c2 as d-dimensional vectors and modify Equa-
̂=
CER [CERi − (ai + b)]2 (4)
i
tion (6) as Equation 7

in which, i is the iteration number, CERi is the CER at the ith iteration, vij (t + 1) = wvij (t ) + c1 [j]r1 (pij − xij (t )) + c2 [j]r2 (pgj − xij (t )) (7)
values of a and b are chosen to make CER ̂ minimized. We rank the
particles according to their predicted CERs at imax , but the one-time Here, c1 [j] = 𝛼1 (bj − bj ) and c2 [j] = 𝛼2 (bj − bj ). 𝛼 1 , 𝛼 2 ∈ (0, 1) are ratio
ranking is not trustworthy since the trends may change with iterations coefficients.

118
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123

Algorithm 2 Fast Fitness Evaluation.

3.6. Compound normal confidence distribution The Normal Cloud Model also defines a membership function using
the simplified version of the probability density function of a normal
PSO is a popular metaheuristic searching algorithm, but is prone to distribution, as shown in Equation (9).
be premature and gets stuck at sub-optima. This shortcoming lies in its 2
− (x−Ex2 )
weak exploration capability. The worst particles are expected to explore fm (x) = e 2𝜎 (9)
and find a new global best position in the following generations. How-
ever, the position of the worst particles are constrained too strongly We use this membership function to encourage exploration. Given En
by the current best positions and thus limiting its searching capability and He, we can always generate several samples of 𝜎 (denoted as 𝜎 i (i ∈
according to Equation (7). On one hand, the current global best posi- 𝕀+ )) using distribution N(En, He2 ). Then we can generate samples of
tion drives the worst particles to rush to the position of the best particle X using a set of distributions N (Ex, 𝜎i2 )(i ∈ 𝕀+ ). According to the 68-
right from the beginning, thus discouraging them from exploring areas 95-99.7 rule [43], nearly 100% of the values in a normal distribution
far away from the current global best position. On the other hand, the lie within a band of [Ex − 3𝜎, Ex + 3𝜎]. However, the compound
current local best position only encourages the worst particles to search normal distribution does not have such a constraint since it has a larger
for the best position around itself, which is less likely to succeed since variance of En2 + He2 [44], thus can explore a larger hyper-parameter
an inferior position is more likely to appear in an inferior area. Both space with a higher possibility of finding the optima. We use a sigmoid-
global and local best positions attempt to turn the worst particles into like function to control the value of En and a linear function to model
the best, which is usually inefficient. Such a strong constraint has a the confidence changes of CNN experts, as shown in Equation (10) and
negative effect in maintaining population diversity as well as balancing Equation (11).
the exploration and exploitation of the population [38,39]. Inspired by v
En = (10)
the selection operation in GA, we argue that the worst particles should 1 + e−(s−f ×g )
be abandoned and new particles should be generated in such a way
He = 𝛾 En (11)
that not only utilizes information from superior solutions (mimicking
the selected genes in GA) with a gradually increasing strength but also Here, g is the number of generations of cPSO-CNN, v, s, f ∈ 𝕀+ are
explores the hyper-parameter space intensively (mimicking the oper- 1
parameters for controlling the shape of En, which are set to 2, 6 and 16
ations of cross over and mutation in GA, but in a more controllable in our experiments respectively. 𝛾 ∈ (0, 1) is a factor ratio determining
manner). To this end, we require a distribution that can express such a the confidence (a low figure corresponds to a high confidence). Accord-
semantic. Cloud Models [40–42], which have been proposed for Uncer- ing to Equation (10), En is large in initial generations, thereby tends
tainty Artificial Intelligence, can be used for this purpose. For simplicity to generate particles over a larger range, thus tends to explore. As the
of statement, we rewrite the definition of the Normal Cloud Model as number of generations increases, the value of En gradually decreases,
Equation (8). causing the range of generated particles to shrink to Ex, which tends to
exploit. Late in the generations, En gradually turns to zero, so that the
X ∼ N (Ex, 𝜎 2 ) s.t . 𝜎 ∼ N (En, He2 ) (8)
generated particles converge to Ex. We use the shape of the sigmoid-
It can be seen that X is distributed according to a compound normal like function as a proper imitation of a human expert’s behavior when
distribution. The Normal Cloud Model is originally used for transform- she/he is fine-tuning a CNN hyper-parameter: explore possible candi-
ing between qualitative knowledge and quantitative knowledge, but we dates firstly for finding the potential area then refine the result in each
interpret it in another way to better fit hyper-parameter optimization. trial during the later stage. Ex utilizes the elitists’ information [39]
We model the confidence of an expert’s estimation of the optimal val- instead of the global best position, which helps to maintain the diversity
ues of hyper-parameters with this compound normal distribution. The of the population. The procedure for determining the position of par-
expected mean Ex is the point estimate of the hyper-parameter. The ticles using a compound normal distribution is described in Algorithm
deviation 𝜎 is the interval estimate of the hyper-parameters, but the 3. Note that in line 10, we use the membership function to encour-
confidence of its accuracy is variant and is expressed by a third numeric age larger-range searching for initial generations. However, it does not
feature He. In this way, we obtain a confidence function that can sup- mean the exploration capability of our approach will downgrade to the
port explorations for hyper-parameters with different levels of confi- level of the canonical PSO during later generations since the compound
dence. Moreover, this property would allow particles to be placed ran- normal distribution can still generate positions far away from the center
domly in farther locations from the population center but still maintain of the population thanks to its heavy tail property.
the connection with the population center.

119
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123

Algorithm 3 Compound Normal Confidence Distribution.

Fig. 2. CER variation of the best particles of 5 PSOs over the generations. cPSO-
CNN brings more free-fallings of CER.

that it takes the critical responsibility for extracting features from raw
4. Experiments and discussion data.
Since the best particle represents the optimization result, we demon-
We carried out experiments for two purposes: ablation analysis and strate the performances of all PSOs, which are shown in Fig. 2. As can be
similar work comparison. The former is to verify the effectiveness of seen, all five curves contain some small downward steps corresponding
the enhancements on the searching capability of PSO. The latter is to local searches because all of the algorithms root in PSO. However,
to show the superiority of cPSO-CNN by comparing its performance in its first 58 generations, vPSO searches more aggressively than PSO,
and cost on well-known datasets and CNNs with other similar works. thus obtaining a 5.8% CER reduction (from 55.6% to 49.8%), which is
Note that the mini-batch size in our experiments for both our approach larger than PSO’s CER reduction of 3.1%. Nevertheless, in the remain-
and compared approaches is 128, and each epoch contains 195 itera- ing generations, vPSO gets stuck in local optima, just like the canon-
tions. ical PSO. The range-adapted acceleration mechanism has some effects
We revise the canonical PSO with three mechanisms: range-adaptive on CER reduction but is not enough to avoid the premature conver-
acceleration, exploration enhancement based on compound normal gence of PSO by itself. uPSO shows a steep dropping that reduces CER
confidence distribution, and fast fitness evaluation using linear esti- by 8.6% within only 26 generations, which suggests that exploration
mation. The first two mechanisms are introduced for improving the is critical during the early generations since it is more likely that the
efficiency of solution searching. In order to evaluate their effects, we unexplored space contains positions better than the randomly initial-
used the canonical PSO and four PSO variants for experiment, as shown ized ones. However, uPSO also loses its exploration strength afterwards.
in Table 1. All these PSOs use ten particles to fine-tune the same set The reason, we infer, is that uniform distribution does not take into
of hyper-parameters: kernel size (1–13), kernel number (1–128), stride consideration the structure of the hyper-parameter space thus search
(1–4) and padding (1–4). vPSO only uses the range-adaptive acceler- blindly in the vast space even though the particles’ positions provide
ation mechanism, which sets c1 and c2 in PSO according to Section useful information on the space structure. It is hard to pinpoint a better
3.5. uPSO additionally replaces the positions of the worst two (i.e., position in the vast hyper-parameter space once a relatively good posi-
𝛽 = 0.2) particles with new random values in the hyper-parameter tion is found. Neither vPSO nor uPSO utilizes knowledge on solution
space according to a uniform distribution in each dimension. For nPSO, distribution provided by the good particles. Alternatively, nPSO uses a
we construct a normal distribution for the positions of the worst two normal distribution to model this knowledge. It performs similarly as
particles. The average value in each dimension of the superior par- uPSO in early generations, reducing CER by 8.4% after 31 generations.
ticles is taken as Ex of the normal distribution for that dimension, While trapped in local optima in middle generations, nPSO strives to
and its variance 𝜎 is determined in the same way as En(see Equation get out of them from generation 104 on and surpasses uPSO by 0.8%
(10)). When deciding the value of He in our approach, we set 𝛾 to in the end. However, the 68-95-99.7 rule [43] of normal distribution
0.1 in Equation (11). Note that the values of the remaining parame-
ters of all these PSOs are the same. To avoid the side effect of hinder-
ing the fairness of comparison due to random initializations of parti-
cles, we have all these PSOs using the same particle initialization val-
ues.
We choose the hyper-parameters of AlexNet as the optimization tar-
get and CIFAR-10 [32] as the experiment dataset, since AlexNet is a
typical CNN with moderate complexity and CIFAR-10 is a widely used
CNN dataset. Firstly, we only optimize the first convolutional layer in

Table 1
PSOs under evaluation.
Name Description
PSO canonical PSO
vPSO PSO with range-adaptive acceleration
uPSO vPSO with uniform distribution regeneration
nPSO vPSO with normal distribution regeneration
cPSO-CNN vPSO with CNC regeneration Fig. 3. The Average Performances of PSOs on AlexNet. cPSO-CNN is Better Than
Others.

120
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123

Fig. 4. LAL Strategy vs. Overall Strategy.

Table 2 Table 3
CERs Before and After Optimization with cPSO-CNN on CIFAR-10. Comparison of CERs of different approaches on CIFAR-10.
Method Before (%) After (%) Reduction (%) Name Network CER(%) Approach
AlexNet [5] 23 10.01 12.99 GA-CNN [30] EDEN 25.41 GA
VGGNet-16 [7] 17.38 8.98 8.40 EPSO-CNN [13] AlexNet 19.85 PSO
VGGNet-19 [7] 13.65 6.62 7.03 PSO-b [11] 13-CNN 18.53 PSO
GoogleNet [8] 8.12 6.16 2.06 SMAC [33] Small CNN 17.47 DNN
ResNet-52 [6] 6.97 4.77 2.20 HORD [35] 19-CNN 20.54 RBF
RestNet-101 [6] 6.61 5.39 1.22 MCDNN [45] Single DNN 11.21 SA
DenseNet-121 [9] 4.96 3.82 1.14 NMM [36] 7-CNN 10.05 Nelder-Mead
cPSO-CNN(our approach) AlexNet 8.67 PSO

confines the searching space to an area near the found optimum. With finding a better solution. cPSO-CNN is designed with these integrated
an additional parameter of He that pushes the search not only towards mechanisms and produces the best results.
farther places but also exploits the knowledge acquired by better par- In order to compare cPSO-CNN with other works in hyper-parameter
ticles, some hard-to-reach yet better positions could be discovered. As optimization of neural networks, we use CIFAR-10 as the benchmark
shown by the curve of cPSO-CNN, it maintains a strong exploration dataset and CER as the performance metric. And we optimized all
capability down the generations with a series of free-falling reductions of the eight layers of AlexNet this time. The results are shown in
that result in an amazing 14.8% in the end, far lower than that of other Table 3. As can be seen, our approach is better than the state-of-the-art
PSOs. Note that the CERs in the figure are only used for ranking best NMM among all these compared approaches. Especially, our approach
particles of each compared algorithm instead of for training a CNN com- performs much better than EPSO-CNN, which is the state-of-the-art
pletely. Once the best particle (thus the best hyper-parameter configu- approach used for AlexNet hyper-parameter optimization.
ration) is determined in this way, it will be used to fully train a CNN, cPSO-CNN can optimize one CNN layer at a time, which is a useful
which would generate a much lower CER(8.65% as shown in Fig. 4 trait for obtaining better performance with less cost since CNN lay-
(a)). ers have different potentials in reducing CER depending on their types
Apart from the performances of the best particles, the average per- and positions, as shown in Table 4. C, P and F represents Convolu-
formances of all particles are also noteworthy since they reflect the tional, Pooling and Fully-Connected layers respectively and the sub-
overall capabilities of these algorithms. The average performances of script represents the position of the layer. For convolutional layers, the
all particles of each generation are shown in Fig. 3. It can be seen that
our approach generally finds better positions compared to others. Espe-
cially during the late generations, the curve of our approach converges Table 4
to the same value, meaning the particles as a whole is superior to that CER after single layer optimization on expert-tuned
of other PSOs. canonical AlexNet.
These experiments are performed on AlexNet, but we also opti- Layer(s) CER(%) #Gen. Cost-Effect. Ratio
mized other CNNs’ first convolutional layers, whose results are shown
C1 10.4 179 13.81
in Table 2. The experiments verify the fact that our approach is also P1 20.4 93 35.77
effective with other CNNs. C2 17.7 167 31.51
From all of these evidences, we can safely conclude that: (1) The P2 21.1 97 53.89
range-adapted acceleration mechanism has some effects on enhancing C3 19.3 189 51.08
C4 19.4 177 49.17
the searching capability of the canonical PSO; (2) Randomly regenerat- P3 20.3 91 33.70
ing positions of worst particles can help improve exploration capability; F 21.6 231 165.00
(3) A center-focused and heavy-tailed distribution has more potential in

121
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123

5. Conclusion

Hyper-parameter fine-tuning has been an obstacle for obtaining a


satisfying CNN due to the high cost of its trial-and-error process. To
tackle this problem, we should speed up the searching efficiency as
well as reduce the computation cost of fitness evaluation. In this paper,
we bring in three mechanisms: vectorizing the acceleration coefficients
to adapt for variant ranges of CNN hyper-parameters, enhancing explo-
ration capability with compound normal confidence distribution, and
linear-estimation based scheme for fast fitness evaluation. PSO with
these mechanisms substantially improves the quality of CNN’s hyper-
parameters with less computation cost. Our work suggests that finding
proper CNN oriented PSO variants is a potential direction for CNN opti-
mization. In our future work, we plan to extend cPSO-CNN to optimize
CNN architecture.
Fig. 5. CER variations determined by particles during fitness evaluation.
Acknowledgement

The work was supported by the National High-tech R&D Program of


hyper-parameters include kernel size (1–13), kernel number (1–128), China (863 Program) (2015AA017201).
stride (1–4) and padding (1–4). For all pooling layers, their hyper-
parameters include kernel size (1–5), stride (1–4) and padding (1–4). References
For the fully-connected layer, it has only one hyper-parameter kernel
[1] J. Zhang, H.S. HungChung, W. Lo, Clustering-based adaptive crossover and
number (1–4096). For example, if only one layer is to be optimized mutation probabilities for genetic algorithms, IEEE Trans. Evol. Comput. (2007)
due to limited budget, the results imply that it is reasonable to fine- 326–335.
tune the first layer of a CNN since the first layer C1 has a far lower [2] D. Tian, Z. Shi, Mpso: modified particle swarm optimization and its applications,
Swarm Evolut. Comput. (2018) 49–68.
cost-effectiveness ratio than other layers. [3] Y. Chen, L. Li, X. Zhao, J. Xiao, Q. Wu, Y. Tan, Simplified hybrid fireworks
We also compare two optimization strategies: layer-after-layer (LAL) algorithm, Knowl. Based Syst. (2019).
and overall. The current works adopt the overall strategy that optimizes [4] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to
document recognition, Proc. IEEE (1998) 2278–2324.
all layers at the same time. However, our experiments show that overall [5] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
optimization produces roughly the same level of CER as that of LAL convolutional neural networks, in: Advances in Neural Information Processing
optimization, but its cost is almost twice as that of LAL (measured in Systems, 2012, pp. 1097–1105.
[6] H. Kaiming, Z. Xiangyu, R. Shaoqing, S. Jian, Deep residual learning for image
the number of generations), as can be seen in Fig. 4 (a). Given a limited
recognition, in: IEEE Conference on Computer Vision and Pattern Recognition,
computation budget, which means a small number of generations (such 2016, pp. 770–778.
as 40 generations), overall optimization performs not as well as LAL [7] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
optimization, as shown in Fig. 4 (b). LAL is also beneficial for transfer image recognition, in: International Conference on Learning Representations,
2015.
learning in that a new CNN created by reusing fine-tuned preceding [8] S. Christian, L. Wei, J. Yangqing, S. Pierre, R. Scott, A. Dragomir, E. Dumitru, V.
layers of an existing CNN only needs to fine-tune its remaining layers, Vincent, R. Andrew, Going deeper with convolutions, in: IEEE Conference on
which would tremendously save computation resources. Therefore, it Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[9] G. Huang, Z. Liu, L. van der Maaten, K.Q. Weinberger, Densely connected
is preferable to adopt the LAL strategy to obtain better configurations convolutional networks, in: IEEE Conference on Computer Vision and Pattern
of hyper-parameters, which is one of our advantages over other similar Recognition (CVPR), 2017, pp. 2261–2269.
works. [10] C.H. Papadimitriou, On the complexity of integer programming, J. ACM 28 (4)
(1981) 765–768.
Apart from CER, we also compare the cost of our approach with [11] S. Toshi, H. Ali, V. Brijesh, Particle swarm optimization based approach for finding
others. For the reason explained in Section 3.2, we use the num- optimal values of convolutional neural network parameters, in: IEEE Congress on
ber of iterations of CNN training as the cost evaluation metric. The Evolutionary Computation (CEC), 2018, pp. 1–6.
[12] L.P. Ribalta, N. Jakub, K. Michal, R.L. Sanchez, P.J. Ranilla, Particle swarm
results are shown in Fig. 5. Ten particles are participating in fitness optimization for hyper-parameter selection in deep neural networks, in: The
evaluation, among which particle 1 is set according to the configu- Genetic and Evolutionary Computation Conference, 2017, pp. 481–488.
ration of the canonical AlexNet to represent an expert chosen con- [13] T. Yamasaki, T. Honma, K. Aizawa, Efficient optimization of convolutional neural
networks using particle swarm optimization, in: IEEE Third International
figuration and others are set randomly. For CIFAR-10 and AlexNet,
Conference on Multimedia Big Data, 2017, pp. 70–73.
the dataset and CNN used in this experiment, a complete training [14] F.C. Soon, H.Y. Khaw, J.H. Chuah, J. Kanesan, Hyper-parameters optimisation of
would need around 80,000 iterations, which is far more than the num- deep cnn architecture for vehicle logo recognition, IET Intell. Transp. Syst. (2018)
ber of iterations required by CNN tuning approaches with an early- 939–946.
[15] W. Bin, S. Yanan, X. Bing, Z. Mengjie, Evolving deep convolutional neural
stopping mechanism. Among these early-stopping approaches, Spear- networks by variable-length particle swarm optimization for image classification,
man’s correlation based method [13] performs the worst, consuming in: 2018 IEEE Congress on Evolutionary Computation (CEC), 2018.
5500 iterations for one fitness evaluation. EvoCNN [15] uses a fixed [16] S.S. Talathi, Hyper-parameter optimization of deep convolutional networks for
object recognition, in: IEEE International Conference on Image Processing (ICIP),
number of 10 epochs as the early-stopping condition, which corre- 2015, pp. 3982–3986.
sponds to 4000 iterations. However, EvoCNN’s arbitrary selection strat- [17] E. Camci, D.R. Kripalani, L. Ma, E. Kayacan, M.A. Khanesar, An aerial robot for
egy is not adaptable, thus may perform worse on other datasets or rice farm quality inspection with type-2 fuzzy neural networks tuned by particle
swarm optimization-sliding mode control hybrid algorithm, Swarm Evolut.
CNNs. The Volatility based method [13] measures the stability of CERs Comput. (2018) 1–8.
of particles and uses it to determine the proper epochs (3300 itera- [18] Y. HaoChin, Y. ZengHsieh, M. ChunSu, S. FangLee, M. WenChen, J. ChingWang,
tions) to stop AlexNet’s training for fitness evaluation. Our approach Music emotion recognition using pso-based fuzzy hyper-rectangular composite
neural networks, IET Signal Process. (2017) 884–891.
(with r = 0.5,imax = 10 in Equation (5)), which measures the sta- [19] P. Ghamisi, Y. Chen, X.X. Zhu, A self-improving convolution neural network for
bility of the CER’s trend for part of the particles, yields the best the classification of hyperspectral data, IEEE Geosci. Remote Sens. Lett. (2016)
result (needs only 2400 iterations for one fitness evaluation), surpass- 1537–1541.
[20] H. GuiHan, W. Lu, Y. Hou, J. FeiQiao, An adaptive-pso-based self-organizing rbf
ing the cost of the state-of-the-art method (Volatility based method) by
neural network, IEEE Trans. Neural Netw. Learn. Syst. (2018) 104–117.
27.3%.

122
Y. Wang et al. Swarm and Evolutionary Computation 49 (2019) 114–123

[21] S. Leung, Y. Tang, W. Wong, A hybrid particle swarm optimization and its [34] H. Rania, C. Babak, de Weck Olivier, V. Gerhard, A Copmarison of Particle Swarm
application in neural networks, Expert Syst. Appl. (2012) 395–405. Optimization and the Genetic Algorithm, American Institute of Aeronautics and
[22] C. Mao, R. Lin, C. Xu, Q. He, Towards a trust prediction framework for cloud Astronautics, 2004, p. 1897.
services based on pso-driven neural network, IEEE Access (2017) 2187–2199. [35] I. Ilievski, T. Akhtar, J. Feng, C.A. Shoemaker, Efficient hyperparameter
[23] J. Qiao, C. Lu, W. Li, Design of dynamic modular neural network based on adaptive optimization for deep learning algorithms using deterministic rbf surrogates, in:
particle swarm optimization algorithm, IEEE Access (2018) 10850–10857. Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017, pp.
[24] J. Raitoharju, S. Kiranyaz, M. Gabbouj, Training radial basis function neural 822–829.
networks for classification via class-specific clustering, IEEE Trans. Neural Netw. [36] A. Saleh, M. Ausif, A framework for designing the architectures of deep
Learn. Syst. (2016) 2458–2471. convolutional neural networks, Entropy (2017) 242.
[25] M. Xie, K. Singh, Confidence distribution, the frequentist distribution estimator of [37] V. Gerhard, S. Sobieski, Jaroslaw, Particle swarm optimization, AIAA J. (2003)
a parameter: a review, Int. Stat. Rev. (2013) 3–39. 1583–1589.
[26] M. Xie, Rejoinder, Int. Stat. Rev. 81 (1) (2013) 68–77. [38] X. Zhao, W. Lin, J. Hao, XingquanZuo, J. Yuan, Clustering and pattern search for
[27] R. Cassady, J. Nachlas, Probability Models in Operations Research, CRC Press, enhancing particle swarm optimization with euclidean spatial neighborhood
2008. search, Neurocomputing 171 (2016) 966–981.
[28] G.A. Fox, S. Negrete-Yankelevich, V.J. Sosa, Ecological Statistics: Contemporary [39] G. Xu, X. Zhao, T. Wu, R. Li, X. Li, An elitist learning particle swarm optimization
Theory and Application, Oxford University Press, 2015. with scaling mutation and ring topology, Digit. Object Identifier 6 (2018)
[29] L. Xie, A. Yuille, Genetic cnn, in: IEEE International Conference on Computer 78453–78470.
Vision (ICCV), 2017, pp. 1379–1388. [40] D. Li, S. Wang, H. Yuan, D. Li, Software and applications of spatial data mining,
[30] S.R. Young, D.C. Rose, T.P. Karnowski, S.-H. Lim, R.M. Patton, Optimizing deep WIREs Data Min. Knowl. Discov. (2016) 84–114.
learning hyper-parameters through an evolutionary algorithm, in: The Workshop [41] S. Wang, H. Chi, H. Yuan, J. Geng, Extraction and representation of common
on Machine Learning in High-Performance Computing Environments, MLHPC ’15, feature from uncertain facial expressions with cloud model, Environ. Sci. Pollut.
ACM, New York, NY, USA, 2015, pp. 4:1–4:5. Control Ser. (2017) 27778–27787.
[31] L.M.R. Rere, M.I. Fanany, A.M. Arymurthy, Metaheuristic algorithms for [42] D. Li, Y. Du, Artificial Intelligence with Uncertainty, CRC Press, 2017.
convolution neural network, Comput. Intell. Neurosci. (2016) 1–13. [43] D.S. Moore, G.P. McCabe, Introduction to the Practice of Statistics, Freeman, 1993.
[32] K. Alex, N. Vinod, H. Geoffrey, The Cifar-10 Dataset, 2014, http://www.cs.toronto. [44] Z. Guangwei, Researches and Applications on Evolutionary Algorithm Based on
edu/kriz/cifar.html. Cloud Model, Ph.D. thesis, School of Computer Science Beihang University,
[33] D. Tobias, S.J. Tobias, H. Frank, Speeding up automatic hyperparameter Beijing, China, 2008.
optimization of deep neural networks by extrapolation of learning curves, in: [45] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing,
IJCAI, 2015, pp. 3460–3468. Science (1983) 671–680.

123

You might also like