A Modified Invasive Weed Optimization Algorithm For Training of Feed Forward Neural Networks
A Modified Invasive Weed Optimization Algorithm For Training of Feed Forward Neural Networks
Ahstract- Invasive Weed Optimization Algorithm IWO) is The computational drawbacks of existing derivative-based
an ecologically inspired metaheuristic that mimics the process numerical methods have forced the researchers all over the
of weeds colonization and distribution and is capable of solving world to rely on metaheuristic algorithms founded on
multi-dimensional, linear and nonlinear optimization problems
simulations to solve engineering optimization problems. A
with appreciable efficiency. In this article a modified version of
common factor shared by the metaheuristics is that they
IWO has been used for training the feed-forward Artificial
Neural Networks (ANNs) by adjusting the weights and biases of
combine rules and randomness to imitate some natural
the neural network. It has been found that modified IWO phenomena. Two closely related families of algorithms that
performs better than another very competitive real parameter primarily constitute this field today are the Evolutionary
optimizer called Differential Evolution (DE) and a few classical Algorithms (EAs) [9 - 11] and the Swarm Intelligence (SI)
gradient-based optimization algorithms in context to the weight algorithms [12 - 14]. While the EAs emulate the processes
training of feed-forward ANNs in terms of learning rate and
of Darwinian evolution and natural genetics, the SI
solution quality. Moreover, IWO can also be used in validation
algorithms draw inspiration from the collective intelligence
of reached optima and in the development of regularization
terms and non-conventional transfer functions that do not emerging from the behavior of a group of social insects (like
necessarily provide gradient information bees, termites and wasps) and also from the socio-cognition
theory of human beings.
Keywords: metaheuristics, invasive weed optimization, differential To overcome the shortcomings of BP in training the
evolution, feed-forward neural networks, classification, back ANN, metaheuristic ANN training models i.e. the
propagation. combination of stochastic optimization algorithms like
Genetic Algorithms (GAs) [9, 10], Particle Swarm
I. INTRODUCTION
Optimization (PSO) [11, 12], and Differential Evolution
(DE) [13] with the ANN learning process has been
Since the advent of the Artificial Neural Network (ANN) it
proposed. A survey and overview of the evolutionary
is severely used in the field of pattern recognition and
techniques in evolving ANN can be found in [10]. This kind
function approximation. Among various kinds of ANNs,
of evolutionary ANN models do not exhibit the
Feed-Forward Artificial Neural Networks (FFANN) are
inefficiencies of BP algorithms like need of the
considered to be powerful tools in the area of pattern
differentiability of the neuron transfer function, possibility
classification [1] where universal FFANN approximators,
of getting trapped in a local optima etc. Further the search
for arbitrary finite-input environment measures, can be
techniques of the evolutionary models are population driven
constituted by using only a single hidden layer [2]. The
instead of the trajectory driven techniques of the BP.
technique involves training of the FFANN with the dataset
The common evolutionary techniques are biologically
to be recognized. The process of training an ANN is
inspired stochastic global optimization methods. They have
concerned with adjusting the weights between each pair of
one common underlying idea behind them, which is based
the individual neurons and corresponding biases until a close
on a population of individuals [11]. Environmental pressure
approximation of the desired output is achieved. Usually
causes natural selection that in turn causes a rise in the
ANN, unless specified, uses Back Propagation (BP)
fitness of the population. An objective (fitness) function
algorithm for training purposes [3, 4]. The BP algorithm is a
represents a heuristic estimation of solution quality and the
trajectory driven technique, which are analogous to an error
variation and selection operators drive the search process.
minimizing process. BP learning requires the neuron transfer
Such process is iterated until convergence is reached. The
function to be differentiable and it also suffers from the
best population member is expected to be a near-optimum
possibility of falling into the local optima. BP is also known
solution [12].
to be sensitive to the initial weight settings and many weight
Using a suitable ANN representation, the process of
initialization techniques have been proposed to lessen such a
supervised ANN training using an evolutionary method
possibility [5, 6, and 7]. So, BP is considered to be
involves performing several iterations in order to minimize
inefficient in searching for global minimum of the search
or maximize a certain fitness function [8, 13, and 14]. Such
space [8].
optimization process would usually stochastically generate
vectors representing the network's weight values, including
3166
biases, calculate the fitness for the generated vectors and input layer serve only for transferring the input pattern to the
tries to keep those vectors that give better fitness values. It is rest of the network, without any processing. The information
possible also to include the ANN structure in such is processed by the units in the hidden and output layers.
representation where the structure can also evolve [15]. The Figure 2 depicts the architecture of a generic three-layered
cycle is repeated to generate new offspring and eventually FFANN model.
The neural network considered is fully connected in the
after several iterations the training process is halted based on
sense that every unit belonging to each layer is connected to
some criteria.
every unit belonging to the adjacent layer. In order to find
In recent past Mehrabian and Lucas proposed the the optimal network architecture, several combinations were
Invasive Weed Optimization (IWO) [16], a derivative-free, evaluated. These combinations included networks with
metaheuristic algorithm, mimicking the ecological behavior different number of hidden layers, different number of units
of colonizing weeds. Since its inception, IWO has found in each layer and different types of transfer functions. We
successful applications in many practical optimization converged to a configuration consisting of a one hidden
problems like optimization and tuning of a robust controller layer, one input layer and one output layer.
[16], optimal positioning of piezoelectric actuators [17],
developing a recommender system [18], design of E-shaped
MIMO Antenna [19], and design of encoding sequences for
DNA computing [20]. In this article IWO, with a
modification from it's original self has been used as an
evolutionary optimization technique to train artificial neural
network for the purpose of pattern recognition and function
approximation. A single case for function approximation and
three instances for pattern recognition have been used to
illustrate the application of the proposed algorithm.
Comparison with results obtained by another very common
and largely used evolutionary algorithm DE [21], and three
common back propagation algorithms namely gradient
descent SP, resilient SP, one step secant SP establishes the
superiority of the proposed method. Figure 1: Structure of a neuron
The rest of the paper is organized in the following
way. Section II outlines the method to construct the FFANN However the no. of neurons of each layer and the transfer
structure and it's details, Section III gives a short description function of each layer varies upon different problems of
of the IWO algorithm along with it's modification, Section function approximation and pattern recognition. The
IV describes the performance index, Section V represents structure of the neural network for each case has been
the results on various datasets by IWO and comparison with described later in different cases.
the competiting algorithms, Section V finally concludes the This configuration has been proven to be a universal
paper and unfold some future research works. mapper, provided that the hidden layer has enough units. On
the one hand, if there are too few units, the network will not
be flexible enough to model the data well and, on the other
II. PROPOSED METHODOLOGY
hand, if there are too many units, the network may overfit
the data. Typically, the number of units in the hidden layer is
A. FFANN Model chosen by trial and error, selecting a few alternatives and
then running simulations to find out the one with the best
Artificial Neural networks are highly interconnected simple results. Training of feed forward networks is normally
processing units designed in a way to model how the human performed in a supervised manner. One assumes that a
brain performs a particular task. Each of those units, also training set is available, given by the dataset, containing both
called neurons, forms a weighted sum of its inputs, to which inputs and the corresponding desired outputs, which is
a constant term called bias is added. This sum is then passed presented to the network. Evolutionary Algorithm has been
through a transfer function: linear, sigmoid or hyperbolic used in this training by choosing the appropriate values of
tangent. Figure 1 shows the internal structure of
a neuron. weights and biases of the ANN to minimize the training
Multi-Layer Perceptrons (MLP) are the best known and error of the corresponding problem. The error minimization
most widely used kind of neural network. Networks with process is repeated until an acceptable criterion for
interconnections that do not form any loops are called Feed convergence is reached. The knowledge acquired by the
Forward Artifical Neural Network (FFANN). The units are neural network through the learning process is tested by
organized in a way that defines the network architecture. In applying new data that it has never seen before, called the
feed forward networks, units are often arranged in layers: an testing set. The network should be able to generalize and
input layer, one or more hidden layers and an output layer. have an accurate output for this unseen data. It is undesirable
The units in each layer may share the same inputs, but are
to overtrain the neural network, meaning that the network
not connected to each other. Typically, the units in the
3167
would only work well on the training set, and would not
generalize well to new data outside the training set. In case
of ANN the most common learning algorithm is the back B. Reproduction
-
propagation algorithm. However, the standard back Each weed W;,G of the population G is allowed to produce
propagation learning algorithm is not efficient numerically
and tends to converge slowly. To improve the results in case seeds depending on its own, as well as the highest and
of training we have used an ecologically inspired algorithm lowest fitness of the colony, such that the number of seeds
IWO rather than the BP algorithms and has found that produced by a weed increases linearly from lowest possible
proposed algorithm can outperform the others i. e. DE, for a weed with worst fitness to the maximum number of
traingd, trainoss, trainrp etc. seeds for a weed with best fitness.
C. Spatial Dispersal
III. CLASSICAL IWO AND ITS MODIFICATION
The generated seeds are then randomly distributed in the
Invasive Weed Optimization (IWO) is a metaheuristic entire search space by normally distributed random numbers
algorithm that mimics the colonizing behavior of weeds. with zero mean but varying variance. This means that the
IWO can be summarized as follows. seeds will be randomly distributed at the neighborhood of
the parent weed. Here the standard deviation (u) of the
A. Initialization random function will be reduced from a previously defined
initial value ainilial to a final value afinal in every iteration
(itermax -iter
A finite number of weeds are initialized randomly in the
entire search space. For the training purpose of the FFANN, pow
of the algorithm following eq.l.
ermax J
each weed consists of a string of network weights followed
Wi ailer ( afinal - ainitial ) + ainilial (I)
itermax
=
f'
by network biases. So the i'th weed can be represented 1
ailer
as
is the maximum number of iteration, is the
standard deviation at the present iteration and pow is the
Wi,p = p'th weight term of the network non-linear modulation index.
This step ensures that the probability of dropping a seed in
bi,q = q'th bias term of the network and n & m being the the distant area decreases nonlinearly at each iteration which
total number of weights and biases respectively. results in grouping fitter plants and elimination of
inappropriate plants.
E. Modification of fWD
Here we aim at reducing the standard deviation u for a
weed when the objective function value of a particular weed
nears the minimum objective function value of the current
popUlation, so that the weed disperses it's seeds within a
small neighborhood of the suspected optima. Eqn (2)
describes the scheme by which the standard deviation Uj of
3168
network parameters may grow explosively. So to allow only
where, t!,.J; = /feW;) - feW best)/ (3) network parameters with lowest numerical value to be
selected, a penalty term consisting of the square of the
so when t!,.J; �0 then (Yi � (Yfinal As
weights and biases has been added with MSEREG to form
(Yfinal « (Yinilial , so when t!,.J; � 0 i.e. the i'th weed is the complete fitness function of the evolutionary
optimization algorithms like IWO and DE for training the
in close proximity of the optima ,then the standard deviation
neural network. Hence for well trained networks, MSEREG
of the weed becomes very small resulting in dispersal of the
should me as minimum as possible.
corresponding seeds within a small neighborhood around the
optima. Thus in this scheme, instead of using a fixed u for
B. Index for Pattern Recognition Problem Set
all weeds in a particular iteration we are varying the standard
deviation for each weed depending on it's objective function
In case of pattern recognition problem, Classification Error
value. So this scheme in one hand increases the explorative
Percentage (CEP) as defined in (4) is used as the fitness
power of the weeds and on the other creates some
function of the IWO and DE algorithms.
probability for the seeds dispersed by the undesirable weeds
(the weeds with higher objective function value) to be a
fitter plant. These features were absent in the classical IWO E
CEP =p *100 (4)
algorithm. Figure 3 shows the variation of u vs t!,.J; and P
Figure 4 represents the flowchart of the modified IWO Ep = Total Number of incorrectly recognized training
algorithm. or testing patterns.
p= Total number of training or testing patterns.
Hence for well trained networks CEP should be as
20 minimum as possible.
18
"0 algorithms.
Cii
"0
c:
'"
'"
1ii
• Differential Evolution (DE)-It is a novel
evolutionary algorithm first introduced by Storn
and Price [21], which is inspired from the theory of
evolution. It is successfully used in many artificial
and real optimization problems and applications
[22] including training a neural network [23]. In
2 ---3L-� 4 ---L---6L-� --�--��10
°0��---L this paper DE/rand!llbin variant is used for
1�lt h weed)·f(best weed)1 _>
..
comparison.
Figure 3: Variation of (Yi with t!,.J; for Ufinal=O.OOI and • Back propagation algorithm with an adaptive
learning rate (TRAINGDX).
• One step secant learning method (TRAINOSS).
IV. INDEX OF PERFORMANCE EVALUATION
• Resilient back propagation algorithm (TRAINRP)
In this paper, we have used two different classes of problem B. Experimental Results
set: Function Approximation & Pattern Recognition. The In this section performance of IWO is evaluated by
different Indices of Performance Evaluation for these two experiments. The experiments were conducted by various
classes have elaborated as follows: configurations of FFANN and two commonly used problem
domains: Function Approximation and Pattern Recognition.
A. Index for Function Approximation Problem Set
/) Function Approximation
In case of function Approximation Problem, Measurement of Here IWO trained FFANN has been used to approximate a
Mean Square Error (MSEREG) i.e. the square of the very simple and conventional function SIN(X). The network
difference between the actual and obtained outputs is used as structure of the selected FFANN consists of one input,one
the index of the performance evaluation. As the neural hidden and one output layer each containing 1,5,1 number of
network may have a large number of solutions of network neurons respectively. The transfer of the networks are
weights and biases having the the same MSEREG, the tansig-tansig-tansig ( tansig: Hyperbolic Tangent Sigmoid)
3169
respectively. Such kind of networks has been selected after
much experimentation. This same network architecture is
used for other competiting algorithms. IWO DE/rand/llbin
Paremeter Value Parameter Value
Search [-500,500] Search [-500,500]
range range
Maximum 200 Population 200
Population Size
Size
�lTAILIZATIOl'i
Initialize randomly generated weeds in the Initial 50 Scale Factor, 0.8
entire search space. Population F
Size
Max seed 8 Crossover 0.9
Probability,
Cr
Min seed 1
Reproduction and Spatial Dispn s al
10% of the
O'inilial
Create the s ee d population by producing entire search
normally distributed seeds with zero mean range
and standard deviation Vi for each weed 0.01% of the
O'jinal
following eqn(2) depending on the fitness entire search
of the weed.
range
3170
lesser performance index. Its limitation gets exposed when it IWO & DE/rand!llbin algorithms is same as shown in Table
is tested with some new data points as evident by the pattern I. 80% of the entire datasets has been used for training
recognition problems to be discussed next. Approximated purpose and rest 20% has been used for testing purpose.
Sine curve obtained by IWO along with the original one Both training and testing CEPs obtained by IWO and the
shown in Figure 5. other competiting algorithms are shown in Tables 4, 5 and 6
respectively.
Various classes are numbered in the Tables according to
Table 3. We have run 50 independent training session of
IWO,DE & BP algorithms for each of the selected datasets
and reporting the mean of these runs along with the standard
deviation, best and worst CEP. It is evident that at some
instances, the training CEP obtained by BP algorithms are
, , : , , -- Original CUive
better than that obtained by lWO. It is again occurring due to
0.6 ------1 ----- : --------: ------ - ---- ; ------ : -- __ Approxi m at ed -
,
,,
I
I
t
, ,
I I
,
obtained by IWO is much better than those obtained by DE
---:------�------- -
: ------�---- -:------�-----+-----�------�-----
, I , I I , , I
0.2
& BP algorithms in case of each dataset as can be verified
-
,
,
I
I
,
,
I
I
,
,
I
I
,, ,
I
•
o - - -- - -! - - - - - - � - -
, I
- :
-- - " " - - - - - -
"
� - - - - - - - - - - - � - - -- - �- - - - - - -� - - - - - - � -
-
"
-
I
- - -- from Table 4,5,6.This fact establishes the claim of
, " "overtraining" of the network only for the training dataset by
,
-0.2 - - - - - -! - - - - - - � - - - - - - - -- - - - - - :- -- - - - - ! - - - - � - - - - -- �- - - - - - - :- - - - - - - � - - - -
" "
, I
:
, I , I ,, , I
,
,
I
I
,
,
I
I
,
,
I
I ,
I
I
I
I or not can be understood only by comparing the testing CEP,
not the training one. Now, as IWO comfortably beats the
other competitors when testing CEP is concerned, these
experiments establish the superiority of IWO in training
,
-0.8 - - - - - - !- - - - - - � - - - - - - -: - - - - - - - :_ - - - - - -!- - - - - - � - - - \___.�
... _. FFANN to use it as a pattern classifier.
, I , I I
_ 1 L-�---L--�--L-�L-�--���--L-�
o 10 20 30 40 50 60 70 80 90 100
Table 3: Properties of the three datasets and the configuration of
the neural networks used
Figure 5: Approximated Sine Curve along with the original Properties CANCER DIABETES GLASS
one obtained by IWO FFANN 9-8-2 8-7-2 9-12-6
structures
2) Pattern Recognition Problem Weights 169 134 261
In this paper three datasets Diabetes, Cancer, Glass Dataset Biases 19 17 27
have been used available online available from [25]. Classes and I.Benign I. Building float
their (65.14%) I.Diabetes processed (40.19%)
The properties of the data sets are summarized in Table 3_
percentages II Malign (33.07%) II. Building non-float
The last row of Table 3 shows the percentage of various (34.86%) II.No processed (27.10%)
classes in each dataset. There are two classes in each of Diabetes III. Vehicle float
Cancer and Diabetes dataset and six classes in Glass dataset. (66.92%) processed (6.54%)
IV. Containers (8.41%)
A three layer FFANN was used for each problem to work as
V. Tableware (5.61%)
a pattern classifier. For all the three problems the transfer VI. Headlamps
function of layers has been chosen as purelin-tansig-tansig (12.15%)
respectively. Such a selection has been done after much
experimentation to obtain the best possible results. Network
configurations used for each dataset are summarized in
Table 3. Number of neurons in each class is equal to the
number of classes in each dataset. Parametric set-up for
3171
Table 4: Comparison of the CEP for the DIABETES dataset among IWO and other algorithms
Table 5: Comparison of the CEP for the DIABETES dataset among IWO and other algorithms
Table 6: Comparison of the CEP for the DIABETES dataset among IWO and other algorithms
3172
required for obtaining the convergence in case of this Networks-ISNN (2005)(Springer Berlin 1 Heidelberg) volume
algorithm sometimes become intolerable in case of larger 3496/2005,660-665
datasets. But as the performance is much better we have to [9] W. Gao, Evolutionary Neural Network based on New Ant
Colony Algorithm. international Symposium on Computational
go for trade off between time and performance, thus the
Intelligence and Design (ISCID)(2008) 318-321
future work should consider this trade off. In case of larger
[10] X.Yao, Evolving Artificial Neural Networks, Proceedings of the
FFANN with larger datasets the intrinsic parallel nature of IEEE, (1999) 87(9) 1423-1447
FFANN feed-forward calculations would invite the use of a [11] H. Pierreval , C. Caux, 1.L. Paris, F. Viguier, Evolutionary
parallel implementation to speedup the fitness function approaches to design and organization of the manufacturing
calculations resulting in a reduction in the overall training systems, Computer and Industrial Engineering, 44 (2003) 339-
time required 364
by our proposed algorithm. Future research may focus into [12] M.G.H. Omran, M. Mahadavi, Global Best Harmony Search,
recognition of more complex and useful applications like Applied Mathematics and Computation 198 (2008) 643-656
[13] E. Alba, J.F.Chicano, Training Artificial Nural Networks with
speech, character etc.
GA Hybrid Algorithm,Genetic and Evolutionary Computation
(GECCO) 2004
ACKNOWLEDGEMENT [14] R.E.Dorsey ,J.D.Johnson, W.J.Mayer, A Genetic Algorithm for
This work was supported by the Czech Science Foundation the Training of Feedforward Neural Networks, Advances in
under the grant no.1 02/0911494. Apngical Intelligence in Economics, Finance and Management
(1994) 93-111
[15] J. Yu, S. Wang, L. Xi, Evolving Artificial Neural Networks
using an Improved PSO and DPSO, Neurocomputing 71 (2008)
REFERENCES 1054-1060
[16] AR. Mehrabian, C. Lucas, A novel Numerical Optimization
Algorithm Inspired from Weed Colonization, Ecological
Informatics, 2006,vol 1.pp-355-366
[1] U. Seiffert, Training of Large-Scale Feed Forward Neural [17] AR. Mehrabian, A Yousefi-Koma, Optimal Positioning of
Networks. International Joint Conference on Neura Piezoelectric Actuators on a Smart Fin using Bio-inspired
Networks(2006),5324-5329 Algorithms,Aerospace Science and Technology,2007,vol 11, pp
174-182
[2] X.Jiang, AH.K.S. Wah, Constructing and Training feed [18] H. Sepehri Rad , C. Lucas, " A Recommender System based on
forward neural networks for Pattern classification, Pattern Inavasive Weed Optimization Algorithm", IEEE Congress on
recognition 36 (2003) 853-867 Evolutionary Computation, CEC 2007,pp 4297-4304
[3] AT. Chronopoulos, 1. Sarangapani, A distributed discrete [19] A R. Mallahzadeh ,S. Es'haghi, A Alipour, " Design of an E
time neural network architecture for pattern allocation and shaped MIMO Antenna using IWO Algorithm for Wireless
control. Proceedings of the International Parrelel and Application at 5.8 Ghz", Progress in Electromagnetic
Distributed Processing Symposium (IPDPS'02),(2002) 204- Research,PIER 90, 2009,187-203
211. [20] X. Zhang, Y. Wang, G. Cui, Y. Niu, 1. Xu, Applicationof a
[4] K. M. Lane, R.D. Neidinger, Neural networks from idea to novel IWO to the design of encoding sequence for DNA
implementation, ACM Sigapl APL Quote Quad 25(3) (1995) computing,Comput. Math. Appl. 57, pp. 2001-2008,Jun,2009
27-37 [21] R. Storn and K. V. Price, "Differential Evolution - a simple and
[5] L. Fausett , Fundamentals of Neural Networks Architecture, efficient adaptive scheme for global optimization over
Algorithms, and Applications, Prentice Hall, New Jersey, continuous spaces", Technical Report TR-95-012,ICSI,
1994. [22] Lampinen J, A Bibliography of Differential Evolution
[6] L.G.C. Hamey , XOR has No Local Minima: A case study in Algorithm.http://www.lut.fi
neural network error surface Analysis, Neural Networks [23] Masters T, Land .W , A New training algorithm for the general
11(1998) 669-681 regression neural network, IEEE International Conference on
[7] G.Wei, Study of Evolutinary Neural Network based on Ant System, Man and Cybernetics, Computational Cybernatics and
Colony Optimization, International Conference on Simulations, 3 (1997),1990-1994
Computational Intelligence and Security Workshops (2007) 3- [24] Y.Liu, 1. A Starzyk, Z. Zhu , Optimized Approximation
6 Algorithm in Neural Networks without Overfitting, IEEE
[8] D. Kim, H. Kim, D. Chung, A Modified Genetic Algorithm Transactions on Neural Networks,19(6) (2008) 983-995
for Fast Training Neural Networks, Advances in Neural [25] ftp://fip.ira.uka.de/pub/neuron/probenl.tar.gz.
3173