2015TrainingArtificialNeuralNetworkUsingModificationofDifferentialEvolutionAlgorithm
2015TrainingArtificialNeuralNetworkUsingModificationofDifferentialEvolutionAlgorithm
net/publication/336550208
CITATION READS
1 97
1 author:
Ngoc-Tam Bui
Shibaura Institute of Technology
73 PUBLICATIONS 350 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ngoc-Tam Bui on 15 October 2019.
DOI: 10.7763/IJMLC.2015.V5.473 1
International Journal of Machine Learning and Computing, Vol. 5, No. 1, February 2015
function, etc. In this paper, the logarithmic sigmoid (2) DE/rand to best/1:
( ) (2)
ViG, j X rG1, j F * X rG2 , j X rG3, j F * X rG2 , j X rG3, j (9)
where r1, r2, r3, r4 and r5 are distinct integers that randomly
selected from the range [1, NP] and are also different from i.
The parameter F is called the scaling factor that amplifier the
difference vectors. X best is the best individual in the current
population.
Fig. 1. Processing unit of an ANN (neuron). C. Crossover Operation
The optimization goal is to minimize the objective After mutation process, DE performs a binomial
function by optimizing the network weights. The mean crossover operator on X iG and ViG to generate a trial vector
square error (MSE), given by (3), is chosen as network error U iG uiG,1 ,, uiG, D for each particle i as shown in (10).
function.
VG
if rand j 0,1 CR or j jrand
( ⃗⃗ ( )) ∑ ∑ ( ) (3) U iG iG, j (10)
X i, j otherwise
where ( ⃗⃗ ( )) is the error at the tth iteration; ⃗⃗ ( ) is the
where i=1,…,NP, j=1,…,D, jrand is a randomly chosen integer
weight vector at the tth iteration; dk and ok represent
in [1, D], jrand is a randomly chosen integer in [1, D], randj(0,1)
respectively the desired and actual values of kth output node;
is a uniformly distributed random number between 0 and 1
L is the number of patterns.
generated for each j and CR[0,1] is called the crossover
control parameter. Due to the use of jrand, the trial vector U iG
differ from target vector X iG .
D. Selection Operation
The selection operator is performed to select the better
one between the target vector X iG and the trial vector U iG to
enter the next generation.
Fig. 2. Multilayer feed-forward neural network (MLP).
U
X iG 1 iG
G
if
f U iG f X iG
(11)
X i otherwise
III. TRAINING ARTIFICIAL NEURAL NETWORK where i=1, …, NP. is target vector in the next population.
Differential evolution (DE), proposed by Storn and Price E. Algorithm 1: DE Algorithm
[7], is a very popular EA. Like other EAs, DE is a
Requirements: Max_Cycles, number of particles NP,
population-based stochastic search technique. It uses
crossover constant CR and scaling factor F. The selection
mutation, crossover and selection operators at each
operator is performed to select the better one between the
generation to move its population toward the global
optimum minimum. target vector and the trial vector to enter the next
generation.
A. Initialization in DE
The initial population was generated uniformly at random Begin
in the range lower boundary (LB) and upper boundary (UB). Step 1: Initialize the population
Step 2: Evaluate the population
X iG, j0 lb j rand j 0,1ub j lb j (4) Step 3: Cycle = 1
Step 4: while for each individual X iG do
where randj(0,1) a random number in [0,1].
Step 5: Mutation:
B. Mutation Operation DE creates a mutation vector Vi G using equations (5) to (9),
In this process, DE creates a mutant vector depending on the mutation scheme
( ) for each individual at each generation Step 6: Crossover:
(called a target vector) in the current population. DE creates a trial vector U iG using equation (10)
There are several variants of DE, according to [7], [8] we Step 7: Greedy selection:
have some mutation schemes as follow: To decide whether it should become a member of generation
DE/rand/1: ViG, j X rG1, j F * X rG , j X rG3 , j
2
(5)
G 1 (next generation), the trial vector U iG is compared to
the target vector X iG (11)
DE/best/1: ViG, j X best
G G
, j F * X r , j X r2 , j
G
1
(6) Step 8: Memorize the best solution found thus far\\
Step 9: Cycle Cycle 1
DE/rand/2: ViG, j X rG1, j F * X rG2 , j X rG3, j (7) Step 10: end while
Step 11: return best solution
DE/best/2: ViG, j X rG1, j F * X rG2 , j X rG3, j (8) End
2
International Journal of Machine Learning and Computing, Vol. 5, No. 1, February 2015
F. Related Work of DE [0, 1]. τ1 and τ2 represent probabilities to adjust factors F and
This section reviews some papers that compared the CR, respectively, Author set τ1=τ1. Because Fl=0.1 and Fu=0.9,
different extension of DE with the original DE. After that, the new takes a value form [0.1, 0.9] in a random manner. The
we concentrate on papers that deal with parameter control in new CR takes a value from [0, 1]. and are obtained
DE. before the mutation process.
There have been many research works on controlling Through reviewing related work, we understood that it is
search parameters of DE. DE Control parameters include the difficult to select DE learning strategies in the mutation
NP, F and CR. operator and DE control parameters. To overcome this
R. Storn and K. Price [7] argued that these three control drawback we proposed the Improvement of Self-Adapting
parameters are not difficult to set for obtaining good control parameters in Differential Evolution (ISADE) - a
performance. They suggested that NP should be between 5D new version of DE in this research. The detail of ISADE is
and 10D, F should be 0.5 as a good initial choice and the presented in the next section.
value of F smaller than 0.4 or larger than 1.0 will lead to
performance degradation and CR can be set to 0.1 or 0.9.
Omar S. Soliman and Lam T. Bui at [9], the author IV. IMPROVEMENT OF SELF-ADAPTING CONTROL
introduced a self-adaptive approach to DE parameters using PARAMETERS IN DIFFERENTIAL EVOLUTION
a variable step length generated by a Gaussian distribution; To achieve good performance on a specific problem by
also, the mutation amplification and crossover parameter using the original DE algorithm, we need to try all available
were introduced. These parameters are evolved during the (usually 5 mentions above) learning strategies in the
optimization process. mutation operator and fine-tune the corresponding critical
A. K. Qin and P. N. Suganthan [10] proposed the new control parameters , and . From the experiment we
choice of learning strategy SaDE and the two control know that the performance of the original DE algorithm is
parameters F and CR do not require predetermining. During highly dependent on the strategies and parameter settings.
evolution, parameter is applied. The author considered Although we may find the most suitable strategy and the
allowing F to take different random values in the range (0, 2] corresponding control parameters for a specific problem, it
with normal distributions of mean 0.5 and standard may require a huge amount of computation time. Also,
deviation 0.3 for different individuals in the current during different evolution stages, different strategies and
population. For CR author assumed CR normally distributed corresponding parameter settings with different global and
in a range of normal distribution of CR, CRm and standard local search capability might be preferred. Therefore, to
deviation 0.1. The CR values associated with trial vectors overcome this drawback, we attempt to develop a new
successfully entering the next generation are recorded. After version of DE algorithm that can automatically adapt the
a specified number of generations CR has been changed for learning strategies and the parameters settings during
several times under the same normal distribution with center evolution. The main ideas of the ISADE algorithm are
CRm and standard deviation 0.1, and author recalculated the summarized below.
CRm according to all the recorded CR values corresponding
A. Adaptive Selection Learning Strategies in the Mutation
to successful trial vectors during this period.
Operator
J. Liu and J. Lampinen [11] present an algorithm based on
the Fuzzy Logic Control (FLC) in which the step-length was ISADE probabilistically selects one out of several
controlled using a single FLC. Its two inputs were: linearly available learning strategies in the mutation operator for
depressed parameter vector change and function value each individual in the current population. Hence, we should
change over the whole population members between the have several candidate learning strategies available to be
current generation and the last generation. chosen and also we need to develop a procedure to
J. Teo [12] proposed an attempt to dynamic self-adaptive determine the probability of applying each learning strategy.
populations in differential evolution, in addition to self- In this research, we select three learning strategies in the
adapting crossover and mutation rates, they showed that DE mutation operator as candidates: “DE/best/1/bin”,
with self-adaptive populations produced highly competitive “DE/best/2/bin” and “DE/rand to best/1/bin” that are
results compared to a conventional DE algorithm with static respectively expressed as:
populations. DE/best/1: ViG, j X best
G
, j F * X r1 , j X r2 , j
G G
(14)
J. Brest [13] presented another variant of DE algorithms
jDE, which uses different self-adaptive mechanisms applied DE/best/2: ViG, j X rG1, j F * X G
r2 , j X rG3, j (15)
on the control parameters: The step length F and crossover
rate CR are produce factors F and CR in a new parent vector. DE/rand to best/1:
G
G G
G
, j X r1, j F * X r2 , j X r 3, j F * X r2 , j X r 3, j
ViG G
(16)
F rand1 Fu if rand 2 1
Fi G 1 l G (12)
The reason for our choice is that these three strategies
Fi otherwise
have been commonly used in many DE literatures and
rand if rand 4 2 reported to perform well on problems with distinct
CRiG 1 G 3 (13) characteristics [7], [8]. Among them, “DE/rand to best/1/bin”
CRi otherwise strategy usually demonstrates good diversity while the
“DE/best/1/bin” and “DE/best/2/bin” strategy show good
where rand1, rand2, rand3, rand4 are uniform random values n
3
International Journal of Machine Learning and Computing, Vol. 5, No. 1, February 2015
convergence property, which we also observe in our trial solution and reduce the calculation cost.
experiments. For better performance of ISADE it is need that the scale
Since here we have three candidate strategies, the factor F should be high in the beginning to have much
probability of applying strategy to each particle in the exploration and after certain generation F is need to be small
current population is pi which are same value p1=p2=p3=1/3. for proper exploitation. To implement this, we have new
With this learning strategy in the mutation operator, the approach to calculate the scale factor F as follow:
procedure can gradually evolve the most suitable learning
strategy at different learning stages for the problem under ( )( ) (18)
consideration.
where Fmax, Fmin, iter, itermax and niter denote the lower
B. Adaptive Scaling Factor F boundary condition of the F, and the upper boundary
In the multi-point search of the DE, particles move from condition of the F, maximum generation, current generation
their current points to new search points in the design space and nonlinear modulation index, respectively. From our
of design variables. For example, as shown in Fig. 3, the experiment we assign Fmin=0.15, and Fmax=1.55.
particle A requires a slight change to the values of the To control the , we have varied the nonlinear
design variables to obtain the global optimum solution. On modulation index niter with generation as follows:
the other hand, particle B cannot reach a global optimum
( )( ) (19)
solution without a significant change, and in addition,
particle C has landed in a local optimum solution. Such a where nmax and nmin are typically chosen in the range (0,15].
situation, in which the good individual and the low After a number of experiments on the values of nmax and nmin,
individual are intermingled, can generally occur at any time we have found that the best choice for them is 0.2 and 6.0.
in this search process. Therefore, we have to recognize each The gait of , chart depends on the iteration number
individual's situation and propose a suitable design variables and the nonlinear modulation index niter is shown in Fig. 5.
generation process for each individual's situation in the
design space.
Global minimum
point
Fitness value
Individual C Individual A
(C gets local
minimum)
Individual B
883
932
981
1
344
393
442
491
540
589
638
687
736
785
834
1030
1079
1226
1275
1324
1373
1422
1471
4
International Journal of Machine Learning and Computing, Vol. 5, No. 1, February 2015
C. Adaptive Crossover Control Parameter CR Step 7: Mutation: Adaptive selection learning strategies the
Ref. [15] suggested to have a success if a child substitutes mutation operator.
its parent in the next generation. The minimum, maximum Step 8: Crossover: DE creates a trial vector using (10)
and medium value on such set of success is used for this Step 9: Selection: To decide whether it should become a
purpose. member of generation G + 1 (next generation), the trial
Be able to detect a separable problem, choosing a vector is compared to the target vector (11)
binomial crossover operator with low values for CR. Step 10: Memorize the best solution found thus far
Be able to detect non-separable problems, choosing a Step 11: Cycle = Cycle + 1
binomial crossover operator with high values for CR. Step 12: End while
In this way, the algorithm will be able to detect if high Step 13: Return best solution
values of CR are useful and furthermore, if a rotationally End
invariant crossover is required. A minimum base for CR
around its median value is incorporated to avoid stagnation
around a single value, Fig. 6 shows this principle, and so we V. EXPERIMENTS
propose the ideas behind this adaptive mechanism for the We apply our ISADE to training some neural network,
crossover: same in [4], which includes XOR, 3-Bit Parity and Decoder-
The control parameter CR is adapted as follows: Encoder problems. These experiments involved 30 trials for
each problem. The initial seed number was varied randomly
{ (21) during each trial.
The three layer feed-forward neural networks are used for
where: rand1 and rand2 are uniform random values in [0, 1], each problem, i.e. one hidden layer and input and output
τ represents probabilities to adjust CR, same as [5] we assign layers. In the network structures, bias nodes are also applied
τ=0.10. and sigmoid function is placed as the activating function of
After that we adjust $CR$ as follows: the hidden nodes.
A. The Exclusive-OR
{ (22)
The first test problem is the exclusive OR (XOR) Boolean
function which is a difficult classification problem mapping
where: CRmin, CRmed and CRmax denote the low value, two binary inputs to a single binary output shown in Table I.
median value and high value of crossover parameter In the simulations, we used a 2-2-1 feed-forward neural
respectively. From our experiment in many trials, we assign network with six connection weights, no biases (having six
CRmin=0.05, CRmed=0.50 and CRmax=0.95. parameters, XOR6) and a 2-2-1 feed-forward neural
network with six connection weights and three biases
CR=0 CR=1 (having 9 parameters, XOR9) and a 2-3-1 feed-forward
neural network having nine connection weights and four
CRmax
biases totally thirteen parameters (XOR13). For XOR6,
CRmin CRmedium
Highly suggested XOR9 and XOR13 problems, the parameter ranges [-100,
100], [-10, 10] and [-10, 10] are used, respectively. The
Independent Problems Not recommended maximum iteration was 200.
TABLE I: BINARY XOR PROBLEM
Highly suggested
Input 1 Input 2 Output
0 0 0
Not recommended Dependent Problems 0 1 1
1 0 1
Fig. 6. Suggested to calculate CR values. 1 1 0
The purpose of our approach is that user does not need to B. The 3-Bit Parity Problem
tune the good values for F and CR, which are problem The second test problem is the three bit parity problem.
dependent. The rules for improve self-adapting control The problem is taking the modulus 2 of summation of three
parameters are quite simple; therefore the new version of the inputs. In other words, if the number of binary inputs is odd,
DE algorithm does not increase the time complexity in the output is 1, otherwise it is 0 shown in Table II. We use a
comparison to the original DE algorithm. 3-3-1 feed-forward neural network structure for the 3-Bit
Parity problem. The parameter range was [-10, 10] for this
D. Algorithm 2: ISADE Algorithm problem. The maximum iteration was 400.
Requirements: Max Cycles, Number of particles NP. TABLE II: 3-BIT PARITY PROBLEM
Begin Input 1 Input 2 Input 3 Output
Step 1: Initialize the population 0 0 0 0
Step 2: Evaluate and rank population. 0 0 1 1
Step 3: Cycle = 1 0 1 0 1
0 1 1 0
Step 4: While for each individual do 1 0 0 1
Step 5: Adaptive scaling factor F by (17), (18) and (20) 1 0 1 0
Step 6: Adaptive crossover control parameter CR by (21) 1 1 0 0
and (22). 1 1 1 1
5
International Journal of Machine Learning and Computing, Vol. 5, No. 1, February 2015