0% found this document useful (0 votes)
6 views9 pages

Download

Ant Colony Optimization

Uploaded by

WAQASJERAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

Download

Ant Colony Optimization

Uploaded by

WAQASJERAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4219435

Training feed-forward neural networks with ant colony optimization: An


application to pattern classification

Conference Paper · December 2005


DOI: 10.1109/ICHIS.2005.104 · Source: IEEE Xplore

CITATIONS READS

202 1,871

2 authors, including:

Christian Blum
Spanish National Research Council
294 PUBLICATIONS 17,488 CITATIONS

SEE PROFILE

All content following this page was uploaded by Christian Blum on 02 June 2014.

The user has requested enhancement of the downloaded file.


Université Libre de Bruxelles
Institut de Recherches Interdisciplinaires
et de Développements en Intelligence Artificielle

Training feed-forward neural networks


with ant colony optimization:
An application to pattern classification

Christian Blum and Krzysztof Socha

IRIDIA – Technical Report Series


Technical Report No.
TR/IRIDIA/2005-038
December 2005

Published in proceedings of Hybrid Intelligent Systems Conference, HIS-2005


IRIDIA – Technical Report Series
ISSN 1781-3794

Published by:
IRIDIA, Institut de Recherches Interdisciplinaires
et de Développements en Intelligence Artificielle
Université Libre de Bruxelles
Av F. D. Roosevelt 50, CP 194/6
1050 Bruxelles, Belgium

Technical report number TR/IRIDIA/2005-038

The information provided is the sole responsibility of the authors and


does not necessarily reflect the opinion of the members of IRIDIA. The
authors take full responsability for any copyright breaches that may
result from publication of this paper in the IRIDIA – Technical Report
Series. IRIDIA is not responsible for any use that might be made of
data appearing in this publication.
Training feed-forward neural networks with ant
colony optimization: An application to pattern
classification
Christian Blum Krzystof Socha
ALBCOM, LSI IRIDIA
Universitat Politècnica de Catalunya Université Libre de Bruxelles
Barcelona, Spain Brussels, Belgium
Email: [email protected] Email: [email protected]

Abstract— Ant colony optimization is an optimization tech- As soon as an ant finds a food source, it evaluates the quantity
nique that was inspired by the foraging behaviour of real ant and the quality of the food and carries some of it back to
colonies. Originally, the method was introduced for the applica- the nest. During the return trip, the quantity of pheromone
tion to discrete optimization problems. Recent research efforts
led to the development of algorithms that are also applicable that an ant leaves on the ground may depend on the quantity
to continuous optimization problems. In this work we present and quality of the food. The pheromone trails guide other
one of the most successful variants for continuous optimization ants to the food source. It has been shown in [8] that the
and apply it to the training of feed-forward neural networks for indirect communication between the ants via pheromone
pattern classification. For evaluating our algorithm we apply it to trails enables them to find shortest paths between their nest
classification problems from the medical field. The results show,
first, that our algorithm is comparable to specialized algorithms and food sources. The shortest path finding capabilities of
for neural network training, and second, that our algorithm has real ant colonies are exploited in artificial ant colonies for
advantages over other general purpose optimizers. solving optimization problems.
I. I NTRODUCTION While ACO algorithms were originally introduced to
Pattern classification is an important real-world problem. In solve discrete optimization (i.e., combinatorial) problems,
the medical field, for example, pattern classification problems their adaptation to solve continuous optimization problems
arise when physicians are interested in reliable classifiers enjoys an increasing attention. Early applications of the ants
for diseases based on a number of measurements. Feed- metaphor to continuous optimization include algorithms such
forward neural networks (NNs) are commonly used systems as Continuous ACO (CACO) [2], the API algorithm [17], and
for the task of pattern classification [5], but require prior Continuous Interacting Ant Colony (CIAC) [12]. However,
configuration. Generally the configuration problem consists all these approaches follow rather loosely the original ACO
hereby of two parts: First, the structure of the feed-forward framework. The latest approach, which is at the same time the
NN has to be determined. Second, the numerical weights of approach that is closest to the spirit of ACO for combinatorial
the neuron connections have to be determined such that the problems, was proposed in [20]. In this work we extend
resulting classifier is as correct as possible. In this work we this approach, and apply it to the problem of optimizing
focus only on the second part, namely the optimization of the the weights of feed-forward NNs for the task of pattern
connection weights. We adopt the NN structures from earlier classification.
works on the same subject.
The outline of our work is as follows. In Section 2 we
Ant colony optimization (ACO) is an optimization shortly present the structure of feed-forward NNs for the pur-
technique that was introduced for the application to discrete pose of pattern classification. Then, in Section 3 we present the
optimization problem in the early 90’s by M. Dorigo ACO algorithm, while in Section 4 we compare our algorithm
and colleagues [9], [10], [11]. The origins of ant colony to methods specialized for feed-forward NN training, as well
optimization are in a field called swarm intelligence (SI) [6], as to a genetic algorithm. Finally, in Section 5 we offer a
which studies the use of certain properties of social insects, conclusion and a glimpse of future work.
flocks of birds, or fish schools, for tasks such as optimization.
The inspiring source of ACO is the foraging behaviour of real II. F EED - FORWARD NEURAL NETWORKS FOR PATTERN
ant colonies. When searching for food, ants initially explore CLASSIFICATION
the area surrounding their nest in a random manner. While A dataset for pattern classification consists of a number of
moving, ants leave a chemical pheromone trail on the ground. patterns together with their correct classification. Each pattern
o
np n0
input hidden output i1 omax − omin X X 2
layer layer layer w1
SEP = 100 (tpi − opi ) , (1)
n0 np p=1 i=1
D i2 w2
i3 D where omax and omin are respectively the maximum and
o2 minimum values of the output signals of the output neurons, np
P
D D f( ) o
i2 D represents the number of patterns, n0 is the number of output
D D o1 wbias neurons, and tpi and opi represent respectively the expected and
i1 D wn actual values of output neuron i for pattern p.
D in ibias
o III. A NT COLONY OPTIMIZATION FOR CONTINUOUS
OPTIMIZATION
(a) (b)
ACO algorithms are iterative methods that try so solve
Fig. 1. (a) shows a feed-forward NN with one hidden layer of neurons. Note optimization problems as follows. At each iteration candidate
that each neuron of a certain layer is connected to each neuron of the next solutions are probabilistically constructed by sampling a prob-
layer. (b) shows one single neuron (from either the hidden layer, or the output ability distribution over the search space. Then, this probability
layer). The neuron receives inputs (i.e., signals il , weighted by weights wl )
from each neuron of the previous layer. Additionally, it receives distribution is modified using the better ones among the con-
P a so-called
bias input ibias with weight wbias . The transfer function f ( ) of a neuron structed solutions. The goal is to bias over time the sampling
transforms the sum of all the weighted inputs into an output signal, which of solutions to areas of the search space that contain high
servers as input for all the neurons of the following layer. Input signals, output
signals, biases and weights are real values. quality solutions.
In ACO algorithms for discrete optimization problems,
the probabilitiy distribution is discrete and is derived from
artificial pheromone information. In a way, the pheromone
consists of a number of measurements (i.e., numerical values). information represents the stored search experience of the
The goal consists in generating a classifier that takes the algorithm. In contrast, our ACO algorithm for continuous op-
measurements of a pattern as input, and provides its correct timization, henceforth denoted by ACO*, utilizes a continuous
classification as output. A popular type of classifier are feed- probability density function (PDF). This density function is –
forward neural networks (NNs). for each solution construction – produced from a population
A feed-forward NN consists of an input layer of neurons, P of solutions that the algorithm keeps at all times. The
an arbitrary number of hidden layers, and an output layer management of this population works as follows. Before the
(for an example, see Figure 1). Feed-forward NNs for pattern start of the algorithm, the population—whose size k is a
classification purposes consist of as many input neurons as parameter of the algorithm—is filled with random solutions.
the patterns of the data set have measurements, i.e., for each Even though the domains of the decision variables are not
measurement there exists exactly one input neuron. The output restricted, we used the initial interval [−1, 1] for the sake
layer consists of as many neurons as the data set has classes, of simplicity. Then, at each iteration a set of m solutions is
i.e., if the patterns of a medical data set belong to either the generated and added to P . The same number of the worst
class normal or to the class pathological, the output layer solutions are removed from P . This biases the search process
consists of two neurons. Given the weights of all the neuron towards the best solutions found during the search.
connections, in order to classify a pattern, one provides its For constructing a solution an ant acts as follows. First,
measurements as input to the input neurons, propagates the it transforms the original set of decision variables X =
output signals from layer to layer until the output signals {X1 , . . . , Xn } into a set of temporary variables Z =
of the output neurons are obtained. Each output neuron is {Z1 , . . . , Zn }. The purpose of introducing temporary variables
identified with one of the possible classes. The output neuron is to improve the algorithms performance by limiting the
that produces the highest output signal classifies the respective correlation between decision variables. Note that this trans-
pattern. formation also affects the population of solutions: All the
The process of generating a NN classifier consists of de- solutions are transformed to the new coordinate system as
termining the weights of the connections between the neurons well. The method of transforming the set of decision variables
such that the NN classifier shows a high performance. Since is presented towards the end of this section.
the weights are real-valued, this is a continuous optimization Then, at each construction step i = 1, . . . , n, the ant chooses
problem of the following form: Given are n decision variables a value for decision variable Zi . For performing this choice it
{X1 , . . . , Xn } with continous domains. These domains are uses a Gaussian kernel PDF, which is a weighted superposition
not restricted, i.e., each real number is feasible. Furthermore, of several Gaussian functions. For a decision variable Zi the
the problem is unconstrained, which means that the variable Gaussian kernel Gi is given as follows:
settings do not depend on each other. Sought is a solution k (z−µj )2

that minimizes the objective function called square error


X 1 − 2σj 2
Gi (z) = ωj √ e , ∀z∈R , (2)
percentage (SEP): j=1
σ 2π
generate random numbers according to a parameterized normal
Gaussian kernels distribution, or by using a uniform random generator in
individual Gausian functions
conjunction with (for instance) the Box-Muller method [7].
However, before doing that we have to specify the mean
and the standard deviation of the j ∗ -th Gaussian function. As
mean µj ∗ we choose the value of the i-th decision variable in
solution j ∗ . It remains to specify the standard deviation σj ∗ .
For doing that we calculate the average distance of the other
population members from the j ∗ -th solution and multiply it
−4 −2 0 2 4
by a parameter ρ, which regulates the speed of convergence:
z
k

∗ 2
X ´
σj ∗ = ρ zil − zij (5)
Fig. 2. An example of a Gaussian kernels PDF consisting of five separate l=1
Gaussian functions.
Parameter ρ has a role similar to the pheromone evaporation
rate ρ in the combinatorial ACO. The higher the value of
where the j-th Gaussian function is derived from the j-th ρ ∈ (0, 1), the lower the convergence speed of the algorithm,
member of the population P (remember that k is the size of and hence the lower the learning rate. Since this whole process
P ). Note that ω
~, µ
~ , and ~σ are vectors of size k. Hereby, ω
~ is the is done for each dimension (i.e., decision variable) in turn,
vector of weights, whereas µ ~ and ~σ are the vectors of means each time the distance is calculated only with the use of one
and standard deviations respectively. Figure 2 presents an single dimension (the rest of them are discarded). This ensures
example of a Gaussian kernel PDF consisting of five separate that the algorithm is able to adapt to convergence, but also
Gaussian functions. allows the handling of problems that are scaled differently in
Sampling directly the Gassian kernel PDF as defined in different directions.
Equation 2 is problematic. Therefore we accomplish it with Finally it remains to explain how the set of temporary
the following procedure. It may be proven that this procedure decision variables Z is created from the original set X.2
is exactly equivalent to sampling the PDF Gi directly. An obvious choice for adapting the coordinate system to the
Before starting a solution construction, we choose exactly distribution of population P would be the Principal Compo-
one of the Gaussian functions j, which is then used for all nent Analysis (PCA) [14]. Although PCA works very well
n construction steps.1 A Gaussian function j ∗ is chosen with for reasonably regular distributions, its performance is no
the following probability distribution: longer that interesting in case of more complex functions.
The mechanism that we designed instead, is relatively simple.
ωj Each ant at each step of the construction process chooses a
p j = Pk , ∀ j = 1, . . . , k , (3)
l=1 ωl direction. The direction is chosen by randomly selecting a
where ωj is the weight of Gaussian function j, which is solution u from the population that is reasonably far away
obtained as follows. All solutions in P are ranked according from the solution j ∗ chosen as mean of the PDF. Then, the
to their quality (i.e., their SEP value) with the best solution vector j~∗ u becomes the chosen direction. The probability of
having rank 1. Assuming the rank of the j-th solution in P to choosing solution u (having solution j ∗ chosen as the mean
be r, the weight ωj of the j-th Gaussian function is calculated of the PDF) is the following:
according to the following formula: d(u, j ∗ )4
p(u|j ∗ ) = Pk , (6)
(r−1)2 ∗ 4
1 − l=1 d(l, j )
ωj = √ ·e 2q 2 k2 , (4)
qk 2π where function d(·) is the function that returns the distance
which essentially defines the weight to be a value of the between two members of the population P . Once this vector
Gaussian function with the argument of rank r, with mean is chosen, the new orthogonal basis for the ant’s coordinate
in 1.0 and standard deviation of qk, where q is a parameter system is created using the Gram-Schmidt process [13]. It
of the algorithm. When parameter q is small, the best-ranked takes as input all the (already orthogonal) directions chosen in
solutions are strongly preferred, and when it is larger, the earlier ant’s steps and the newly chosen vector. The remaining
probability becomes more uniform. Thanks to using the ranks missing vectors (for the remaining dimensions) are chosen
instead of the actual fitness function values, the algorithm is randomly. Then, all the current coordinates of all the solutions
not sensitive to the scaling of the fitness function. in the population are rotated and recalculated according to this
The sampling of the chosen Gaussian function j ∗ may 2 Note that ACO algorithms in general do not exploit correlation information
be done using a random number generator that is able to between different decision variables (or components). In ACO*, due to the
specific way the search experience is stored (i.e., as a population of solutions),
1 Note that this has also the advantage that it allows to exploit the (possibly it is in fact possible to take into account the correlation between the decision
existing) correlation between the variables. variables.
TABLE I
the neuron transfer function f (·) to be differentiable. Conse-
S UMMARY OF THE NN STRUCTURES THAT WE USE FOR THE THREE DATA
quently, these algorithms may not — in contrast to ACO* —
SETS . I N THE LAST TABLE COLUMN IS GIVEN THE NUMBER OF WEIGHTS
be used in case the neuron transfer function is not differen-
TO BE OPTIMIZED FOR EACH TACKLED PROBLEM . N OTE THAT FOR THE
tiable or is unknown. In case of training NNs whose transfer
CALCULATION OF THIS NUMBER , THE BIAS INPUTS OF THE NEURONS
function is differentiable, the drawback of general optimization
HAVE ALSO TO BE TAKEN INTO ACCOUNT.
algorithms such as ACO* is however that they do not exploit
Data set Inp. layer Hid. layer Outp. layer # of weights available additional information e.g., the gradient. In order
Cancer1 9 6 2 74 to see how the additional gradient information influences the
Diabetes1 8 6 2 68 performance of ACO*, we have also implemented hybridized
Heart1 35 6 2 230 versions of ACO*, namely ACO*-BP and ACO*-LM. In these
hybrids, each solution generated by the ACO* algorithm is
improved by running a single improving iteration of either BP
or LM before being evaluated.
new orthogonal base resulting in the set of new temporary Finally, we wanted to see how all the algorithms tested
variables Z. Only then is the ant able to measure the average compare to a simple random search (RS) method. This is
distance, and subsequently to sample from the PDF (as it can an algorithm that randomly generates a set of values for the
now calculate the mean and standard deviation). At the end of weights and then evaluates these solutions. As we used a
the construction process, the chosen values of the temporary sigmoid function as neuron transfer function, it was sufficient
variables Z are converted back into the original coordinate to limit the range of weight values to values close to 0. Hence,
system X. we arbitrarily chose a range of [-5,5].
IV. E XPERIMENTAL EVALUATION We have performed a limited scope parameter tuning for
all algorithms using Birattari’s F-RACE method (see [4], [3]).
An important collection of medical data sets for The outcome is shown in Table II.
pattern classification is the well-known PROBEN1 data
repository [18], which has been used various times in the
B. Results
past to evaluate and compare the performance of different
methods for training NN classifiers. From the available data In order to compare the performance of the algorithms,
sets we chose Cancer1, Diabetes1, and Heart1 for our we applied each algorithm 50 times to each of the three
experimentation. test problems. As stopping condition we used the number of
fitness function evaluations. Following the work of Alba and
Cancer1 concerns the diagnosis of breast cancer (possible Chicano [1], we used 1000 function evaluations as the limit.
outcomes: yes or no). The data set contains altogether Figures 3, 4, and 5 present respectively the results obtained
699 patterns. Each pattern has 9 input parameters (i.e., for the cancer, diabetes, and heart test problems in the form
measurements). We used the first 525 of the patterns (i.e., of box-plots. Each figure presents the distributions of the
about 75%) as training set (i.e, for optimizing the NN actual classification error percentage (CEP) values obtained
weights), and the remaining 174 as test set. Diabetes1 by the algorithms (over 50 independent runs).
concerns the diagnosis of diabetes (possible outcomes: yes
or no). Each pattern has 8 input parameters. The data set Cancer1 (see Figure 3) appears to be the easiest data set
contains altogether 768 patterns. We used the first 576 of among the three that we tackled. All algorithms obtained
them as training set and the remaining 192 as test set. Finally, reasonably good results, including the RS method. However,
Heart1 concerns the diagnosis of a heart condition (possible the best performing algorithm is BP. From the fact that the
outcomes: yes or no). Each pattern has 35 input parameters. results obtained by RS do not differ significantly from the
The data set contains altogether 920 cases. We used the first results obtained by other–more complex algorithms, it may be
690 of them as training set and remaining 230 as test set. concluded that the problem is relatively easy, and that there
are a lot of reasonably good solutions scattered over the search
Concerning the structure of the feed-forward NNs that we space. None of the algorithms was able to classify all the test
used, we took inspiration from the literature. More specifically patterns correctly. This may be due to the limited size of the
we used the same network structures that were used in [1]. For training set, i.e. there might have been not enough information
an overview of these NN structures see Table I. in the training set to generalize perfectly.
Diabetes1 (see Figure 4) is a problem that is more difficult
A. Algorithmis for comparison than Cancer1. All our algorithms clearly outperform RS.
For comparison purposes we have re-implemented some However, the overall performance of the algorithms in terms of
algorithms traditionally used for training NNs, namely the the CEP value is not very good. The best performing is again
back-propagation (BP) algorithm [19], and the Levenberg- BP. The less good overall performance of the algorithms may
Marquardt (LM) algorithm [15], [16]. Both algorithms (i.e., again indicate that the training set does not represent fully all
BP and LM) require gradient information, hence they require the possible patterns.
TABLE II
S UMMARY OF THE FINAL PARAMETER VALUES THAT WE CHOSE FOR OUR ALGORITHMS . N OT INCLUDED IN THE TABLE ARE THE PARAMETERS COMMON
TO ALL ACO* VERSIONS , NAMELY q AND m. F OR THESE PARAMETERS WE USED THE SETTINGS q = 0.01, AND m = 2 ( THE NUMBER OF ANTS USED IN
EACH ITERATION ). N OTE THAT η IS THE STEP - SIZE PARAMETER OF BP, AND β IS THE ADAPTATION - STEP PARAMETER OF LM.

Cancer1 Diabetes1 Heart1


Algorithm k ρ η β k ρ η β k ρ η β
ACO* 148 0.95 - - 136 0.8 - - 230 0.6 - -
ACO*-BP 148 0.98 0.3 - 136 0.7 0.1 - 230 0.98 0.4 -
ACO*-LM 148 0.9 - 10 136 0.1 - 10 230 0.1 - 10
BP - - 0.002 - - - 0.01 - - - 0.001 -
LM - - - 50 - - - 5 - - - 1.5

Cancer (CEP) Diabetes (CEP)

35
30

30
20

25
10
0

aco acobp acolm bp lm rs aco acobp acolm bp lm rs

Fig. 3. Box-plots for Cancer1. The boxes are drawn between the first and Fig. 4. Box-plots for Diabetes1. The boxes are drawn between the first and
the third quartile of the distribution, while the indentations in the box-plots the third quartile of the distribution, while the indentations in the box-plots
(or notches) indicate the 95 % confidence interval. (or notches) indicate the 95 % confidence interval.

The Heart1 problem (see Figure 5) is–with 230 weights– and GA-LM. Table III summarizes the results obtained by
the largest problem that we tackled. It is also the one on the ACO* and GA based algorithms. Clearly the stand-alone
which the performance of the algorithms differed mostly. All ACO* performs better than the stand-alone GA for all the
tested algorithms clearly outperform RS, but there are also test problems. ACO*-BP and ACO*-LM perform respectively
significant differences among the more complex algorithms. better than GA-BP and GA-LM on both of the more difficult
BP, which was performing quite well on the other two test problems Diabetes1 and Heart1 and worse on Cancer1.
problems, did not do so well on Heart1. ACO* achieves For the Heart1 problem the mean performance of any ACO*
results similar to BP. In turn, LM which was not performing based algorithm is significantly better than the best GA based
so well on the first two problems, obtains quite good results. algorithm (which was reported as the state-of-the-art for this
Very interesting is the performance of the hybridized versions problem in 2004).
of ACO*, namely ACO*-BP and ACO*-LM. The ACO*-BP
hybrid clearly outperforms both ACO* and BP. ACO*-LM V. C ONCLUSION
outperforms respectively ACO* and LM. Additionally, We have presented an ant colony optimization algorithm
ACO*-LM performs best overall. (i.e., ACO*) for the training of feed-forward neural networks
for pattern classification. The performance of the algorithm
Finally, it is interesting to compare the performance of the was evaluated on real-world test problems and compared
ACO* based algorithms to some other general optimization to specialized algorithms for feed-forward neural network
algorithms. Alba and Chicano [1] have published the results training (back propagation and Levenberg-Marquardt), and
of a genetic algorithm (GA) used for tackling exactly the also to algorithms based on a genetic algorithm.
same three problems as we did. They have tested not only The performance of the stand-alone ACO* was comparable
a stand-alone GA, but also its hybridized versions: GA-BP (or at least not much worse) than the performance of spe-
TABLE III
PAIR - WISE COMPARISON OF THE RESULTS OF THE ACO* BASED ALGORITHMS WITH RECENT RESULTS OBTAINED BY A SET OF GA BASED ALGORITHMS
( SEE [1]). T HE RESULTS CAN BE COMPARED DUE TO THE FACT THAT 1000 EVALUATIONS AS STOPPING CRITERION WERE USED FOR ALL THE
ALGORITHMS . F OR EACH PROBLEM - ALGORITHM PAIR WE GIVE THE MEAN ( OVER 50 INDEPENDENT RUNS ), AND THE STANDARD DEVIATION ( IN
BRACKETS ). T HE BEST RESULT OF EACH COMPARISON IS INDICATED IN BOLD .

GA ACO* GA-BP ACO*-BP GA-LM ACO*-LM


Cancer1 16.76 (6.15) 2.39 (1.15) 1.43 (4.87) 2.14 (1.09) 0.02 (0.11) 2.08 (0.68)
Diabetes1 36.46 (0.00) 25.82 (2.59) 36.36 (0.00) 23.80 (1.73) 28.29 (1.15) 24.26 (1.40)
Heart1 41.50 (14.68) 21.59 (1.14) 54.30 (20.03) 18.29 (1.00) 22.66 (0.82) 16.53 (1.37)

Heart (CEP) [2] B. Bilchev and I. C. Parmee, “The ant colony metaphor for searching
continuous design spaces,” in Proceedings of the AISB Workshop on
Evolutionary Computation, ser. Lecture Notes in Computer Science, vol.
993, 1995, pp. 25–39.
35

[3] M. Birattari, “The problem of tuning metaheuristics as seen from a


machine learning perspective,” Ph.D. dissertation, Université Libre de
Bruxelles, Brussels, Belgium, 2004.
30

[4] M. Birattari, T. Stützle, L. Paquete, and K. Varrentrapp, “A racing


algorithm for configuring metaheuristics,” in Proceedings of the Genetic
and Evolutionary Computation Conference (GECCO-2002), W. L. et al.,
Ed. Morgan Kaufmann Publishers, San Mateo, CA, 2002, pp. 11–18.
25

[5] C. M. Bishop, Neural networks for pattern recognition. MIT Press,


2005.
[6] E. Bonabeau, M. Dorigo, and G. Theraulaz, Swarm Intelligence: From
Natural to Artificial Systems. Oxford University Press, New York, NY,
20

1999.
[7] G. E. P. Box and M. E. Muller, “A note on the generation of random
normal deviates,” Annals of Mathematical Statistics, vol. 29, no. 2, pp.
15

610–611, 1958.
[8] J.-L. Deneubourg, S. Aron, S. Goss, and J.-M. Pasteels, “The self-
aco acobp acolm bp lm rs organizing exploratory pattern of the argentine ant,” Journal of Insect
Behaviour, vol. 3, pp. 159–168, 1990.
Fig. 5. Box-plots for Heart1. The boxes are drawn between the first and [9] M. Dorigo, “Optimization, learning and natural algorithms (in italian),”
the third quartile of the distribution, while the indentations in the box-plots Ph.D. dissertation, Dipartimento di Elettronica, Politecnico di Milano,
(or notches) indicate the 95 % confidence interval. Italy, 1992.
[10] M. Dorigo, V. Maniezzo, and A. Colorni, “Ant System: Optimization by
a colony of cooperating agents,” IEEE Transactions on Systems, Man,
and Cybernetics – Part B, vol. 26, no. 1, pp. 29–41, 1996.
cialized algorithms for neural network training. This result is [11] M. Dorigo and T. Stützle, Ant Colony Optimization. MIT Press,
particularly interesting as ACO*—being a much more generic Cambridge, MA, 2004, to appear.
[12] J. Dréo and P. Siarry, “A new ant colony algorithm using the heterarchi-
approach—allows also the training of networks in which cal concept aimed at optimization of multiminima continuous functions,”
the neuron transfer function is either not differentiable or in Proceedings of ANTS 2002, ser. Lecture Notes in Computer Science,
unknown. The hybrid of ACO* and the Levenberg-Marquardt M. Dorigo, G. Di Caro, and M. Sampels, Eds., vol. 2463. Springer
Verlag, Berlin, Germany, 2002, pp. 216–221.
algorithm (i.e., ACO*-LM) was in some cases able to out- [13] G. H. Golub and C. F. Loan, Matrix Computations, 2nd ed. The John
perform the back propagation and the Levenberg-Marquardt Hopkins University Press, Baltimore, USA, 1989.
algorithms. Finally, the results indicate that ACO* outperforms [14] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning. Springer-Verlag, Berlin, Germany, 2001.
other general-purpose optimizers such as genetic algorithms. [15] K. Levenberg, “A method for the solution of certain problems in least
squares,” Quarterly Applied Mathematics, vol. 2, pp. 164–168, 1944.
ACKNOWLEDGMENT [16] D. Marquardt, “An algorithm for least-squares estimation of nonlinear
parameters,” SIAM Journal on Applied Mathematics, vol. 11, pp. 431–
This work was supported by the Spanish CICYT project TRACER (grant TIC-2002- 441, 1963.
04498-C05-03), and by the “Juan de la Cierva” program of the Spanish Ministry of [17] N. Monmarché, G. Venturini, and M. Slimane, “On how pachycondyla
Science and Technology of which Christian Blum is a post-doctoral research fellow.
apicalis ants suggest a new search algorithm,” Future Generation Com-
puter Systems, vol. 16, pp. 937–946, 2000.
This work was also partially supported by the ANTS project, an Action de Recherche [18] L. Prechelt, “Proben1—a set of neural network benchmark problems
Concertée funded by the Scientific Research Directorate of the French Community of and benchmarking rules,” Fakultät für Informatik, Universität Karlsruhe,
Belgium.
Karlsruhe, Germany, Tech. Rep. 21, 1994.
[19] D. Rummelhart, G. Hinton, and R. Williams, “Learning representations
by backpropagation errors,” Nature, vol. 323, pp. 533–536, 1986.
R EFERENCES [20] K. Socha, “Extended ACO for continuous and mixed-variable optimiza-
tion,” in Proceedings of ANTS 2004 – Fourth International Workshop
[1] E. Alba and J. F. Chicano, “Training neural networks with GA hybrid al-
on Ant Algorithms and Swarm Intelligence, ser. Lecture Notes in Com-
gorithms,” in Proceedings of the Genetic and Evolutionary Computation
puter Science, M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella,
Conference—GECCO 2004, ser. Lecture Notes in Computer Science,
F. Mondada, and T. Stützle, Eds. Springer Verlag, Berlin, Germany,
K. D. et al., Ed., vol. 3102. Springer Verlag, Berlin, Germany, 2004,
2004, to appear.
pp. 852–863.

View publication stats

You might also like