PARALLEL IMPLEMENTATION OF AN ANT COLONY OPTIMIZATION
METAHEURISTIC WITH OPENMP
Pierre Delisle(1), Michaël Krajecki(2, 3), Marc Gravel(1), Caroline Gagné (1)
(1) Département d’informatique et mathématique, Université du Québec à Chicoutimi,
Chicoutimi, Québec, Canada, G7H 2B1
[email protected],
[email protected],
[email protected] (2) Université de Reims Champagne-Ardenne, LERI
BP1039, F-51687 Reims Cedex2
(
[email protected])
(3) Collège Militaire Royal du Canada,
CP 17000, Succursale Forces, Kingston, Ontario, Canada, K7K 7B4
ABSTRACT
This paper presents a parallel implementation of an ant colony optimization metaheuristic for the solution of an
industrial scheduling problem in an aluminum casting center. The usefulness and efficiency of the algorithm, in
its sequential form, to solve that particular optimization problem has already been shown in previous work.
However, even if this method, as well as metaheuristics in general, offers good quality of solution, it still needs
considerable computational time and resources. Moreover, the structure of the algorithm makes it well suited
for parallelization, so a considerable improvement can be achieved that way. In this paper an efficient and
straightforward OpenMP implementation on a shared memory architecture is presented and the main
algorithmic and programming issues that had to be adressed are discussed. The code is written in C and the
application has been executed on a Silicon Graphics Origin2000 parallel machine.
1 INTRODUCTION We already have shown the interest of using OpenMP
for the parallelization of irregular Applications
In many industrial scheduling situations, exact (Habbas et al. [2000]). In this previous work we have
optimization algorithms require overlong solutions studied an exact algorithm for CSP resolution. We
times and cannot produce an acceptable or even hope to show in this paper that this approach can be
feasible solution in the time available. Metaheuristics applied to the present Ant Colony Optimization
have been shown to offer successful solution strategies implementation.
for that kind of problems. More specifically, Gravel et
al. [2001] have presented an efficient representation of In the first section of this paper the Ant Colony
a continuous horizontal casting operation, a real Optimization algorithm (ACO) is presented. Then, the
scheduling problem encountered in an aluminium choice of using a shared memory model and the
foundry, using an ant colony optimization OpenMP environment is explained, the parallel
metaheuristic. The algorithm (in its original implementation of the ACO is detailed and some
sequential form) has been implemented in an algorithmic and programming issues that had to be
application that is now used in the foundry. However, adressed during the parallelization process are
even if it has proven to be a viable solution, it is still discussed. Finally, some results are presented to show
very demanding in computational time and resources the performance of the resulting parallel application.
and this is particularly true as the problem size
increases. Moreover, the generation and evaluation of
ants (solutions), which are the main parts of the 2 ANT COLONY OPTIMIZATION
algorithm and its principal source of computational ALGORITHM (ACO)
cost, offers a low dependancy degree, so the structure
of the algorithm makes it well suited for concurrent The first ant colony optimization metaheuristic
execution. For those reasons, a parallel approach was (ACO), called ant system (Colorni et al. [1991],
justified and an implementation was made to study the Dorigo [1992]), was inspired by studies of the
performance in execution time that could be obtained behavior of ants (Deneubourg et al., [1983];
that way. Deneubourg & Goss, [1989]; Goss et al., [1990] ).
Ants communicate among themselves through
pheromone, a substance they deposit on the ground in the algorithm proposed in this paper carries out an
variable amounts as they move about. It has been updating of the trail intensity at the end of each cycle.
observed that the more ants use a particular path, the This allows us to update the trail according to the
more pheromone is deposited on that path and the evaluation of the solutions found in the cycle. Let the
more it becomes attractive to other ants seeking food. evaluation on the most important objective (h’) found
h'
If an obstacle is suddenly placed on an established by the kth ant be Lk . The contribution to the update
path leading to a food source, ants will initially go of the pheromone trail for ant k is the calculated as
right or left in a seemingly random manner, but those h'
follows: ∆τ ijk (t) = Q/ Lk (t), where Q is a system
choosing the side that is in fact shorter will reach the
food more quickly and will make the return journey parameter. The updating of the trail is also influenced
more often. The pheromone on the shorter path will by an evaporation factor (1-ρ) that diminishes the trail
therefore be more strongly reinforced and will present during the previous cycle.
eventually become the preferred route for the stream
of ants. The works of Colorni et al. [1991], Dorigo et The reader may consult Gravel et al. [2001] for the
al. [1991], Dorigo et al. [1996], Dorigo & details of these adaptations to the original ACO to
Gambardella, [1997], Dorigo & Di Caro, [1999] offer make it fit to the actual industrial problem. To treat
detailed information on the workings of the algorithm the multiple-objective optimization problem, all
and the choice of the values of the various parameters. nondominated solutions found by the metaheuristic
are stored in a quadtree (Finkel & Bentley, [1974]). If
In the multiple-objective scheduling problem treated a solution is dominated, it will be eliminated during
in this paper, we must determine the processing the quadtree insertion process, and if it dominates
sequence for a set of orders where setup times are other solutions already in the quadtree, then the
sequence-dependent. Our formulation is based on the insertion process will remove them before the solution
well-known traveling salesman problem (TSP). Each is inserted in its correct position.
order to be processed is represented as a “city” in the
TSP network. When an ant moves from city i to city Then, for clarity issues regarding the next section, the
j, it will leave a trail analogous to the pheromone on ACO algorithm can be formulated (see figure 1) in the
the edge (ij). The trail records information related to following way, which is less formal than the original
the previous use of edge (ij) and the higher this use specification, but simpler and closer to the way it has
has been, the greater is the probability of choosing it been implemented.
once again. For the scheduling problem, the
pheromone trail will contain information based on the Figure 1 Sequential implementation of the ACO
number of times ants chose to make jobs (ij) adjacent.
We will explain later how the trail is initialized and NC = 0;
modified. Initialize τij matrix;
Initialize quadtree;
At time t, from an existing partial job sequence, each While (NC < NCMax) and (Not Stagnation
ant k chooses the next job to append using a Behaviour)
probabilistic rule p ijk (t ) based on visibility (ηij) and Initialize ∆τij matrix;
For each ant k do
on the intensity of the pheromone trail (τij(t)). For the Build a job sequence;
scheduling problem, the visibility is defined by a Evaluate solution k on each objective;
matrix that aggegates information on each of the four Update ∆τij matrix according to solution k;
objectives to minimize. This matrix represent the Insert solution k into the quadtree;
visibility information analogous to the D matrix in the Update τij matrix according to ∆τij matrix;
TSP. At initialization of the algorithm, the trail NC = NC + 1;
intensity for all job pairs (ij) is initialized to a small
positive quantity τ0. Parameters α and β are used to
vary the relative importance of the trail intensity and
the visibility. To ensure that a job that already been 3 PARALLEL IMPLEMENTATION OF THE
placed in the sequence being constructed is not again ACO
selected, a tabu list is maintained. Each ant k will
have its own tabu list tabuk recording the ordered list As far as we know, few references can be found about
of jobs already selected. parallel implementations of the ACO at this time.
Moreover, work that has been done in this field is
At any given time, more than one ant constructs a job mostly related to message passing MIMD
sequence and a cycle is completed when each of the m architectures, which presents different issues
ants has completed their construction. The version of compared to shared memory architectures.
Bullnheimer et al. [1998] have proposed a (generation and evaluation of solutions) are done in
synchronous parallel implementation of the Ant parallel. Then, the solutions and their evaluations are
System for the message passing model. The authors sent back to the master, the τij matrix is updated and a
outline the considerable cost of communications new cycle begins by the broadcasting of the updated
encountered and the existence of a synchronization τij matrix.
procedure that cannot be neglected. Talbi et al. [1999]
have developed a similar parallel ACO algorithm that Figure 2 Parallel ACO in a message passing model
is combined with a local tabu search to solve the
quadratic assignment problem (QAP). Initialize τij matrix
NC = 0
In this paper we hope to show that a design for a
shared memory model is more accurate in order to Send D matrix
reduce the cost of the parallelization altough the Send τij matrix
synchronization procedure cannot be avoided.
Generate solution k
Ant 1 Ant k
There is also an issue about concurrent update of the Evaluate Solution k
information that emerges when using a shared Send k, lk
memory model but it can be easily resolved so it is
possible to achieve good performances in this NC = 1 Update τij matrix
environment. With the availiability of OpenMP
(OpenMP [1998]) it is possible to experiment this Send τij matrix
approach and the parallelization of the existing ...
sequential C code can be made in an easy,
straightforward way. Generate solution k
Evaluate Solution k
The reader may notice that our goal in this paper is to Send k, lk
improve the execution time of the algorithm without
NC = 2 Update τij matrix
altering its behavior. Improvement of the quality of
the solutions found by the ACO with parallel .
mechanisms is another part of this project and will not
.
.
be detailed in this paper.
As shown by Bullnheimer et al. [1998], if we put
3.1 The “natural” parallelism of the ACO aside the communication overhead, this parallelization
strategy implies an optimum speedup assuming that
As we can see in Figure 1 the “for loop” is the main there are enough processors available to assign one ant
part of the algorithm and the source of its complexity to each processing element. However, this factor
(the standard ACO algorithm is O(n3)). In fact, the cannot be neglected as the high number of
generation of one solution is of complexity O(n2) (n is communication operations needed (packing and
the number of jobs) and since we are in a real unpacking messages, sending and receiving packets,
scheduling environment, the evaluation has to idle times) are causing a considerable loss in
simulate the industrial process for each solution so it is efficiency.
considerably more time consuming than, for example,
the computing of a TSP tour. Besides, these two With the recent resurgence of powerful shared
operations are independant for each ant of a given memory parallel computers and the availability of
cycle so they can be easily parallelized. OpenMP, which permits the parallelization of the
existing sequential C program, it is interesting to
3.2 Message passing model vs. shared memory experiment a shared memory implementation of the
model ACO to solve our industrial scheduling problem. In a
model where explicit communications are no longer
Figure 2 shows the behavior of an implementation of needed it may be possible to obtain good
the ACO based on the parallel synchronous Ant performances with minimal changes to the algorithm.
System in a message passing model.
3.3 Synchronization issue with the updating of the
At the beginning of the algorithm, a master process τij matrix
initializes the information, spawns k processes (one
for each ant k), and broadcasts the information. At the At the end of each cycle, the τij matrix has to be
beginning of a cycle the τij matrix (the pheromone updated for the ants of the next cycle. Even if in our
trail) is sent to each process and the computations implementation there is no need of explicit
communications before the updating procedure, the variables and structures which are accessed in read
master thread still have to wait for all the processes mode during the execution of the algorithm.
(ants) of the current cycle to compute and evaluate
their job sequences, so this synchronization barrier is In the original sequential implementation of the ACO,
independent of the model used and cannot be avoided as well as in the first parallel implementation that was
without altering the original method. In future work made, the ∆τij matrix is in the shared memory and is
we will study the possibility of modifying the updated by one ant (i.e by an OpenMP thread) at a
algorithm and the frequency of these updates in order time (in a critical section of the parallel region) with
to achieve better efficiency without losing searching O(n) operations (where n is the number of jobs). With
performance. a memory cost, we can improve execution time by
creating one matrix for each thread, i.e. for each
3.4 Sharing the load between processors independant group of ants. The only use of this
matrix is to update the τij matrix at the end of the
A first step in the parallelization process could be to cycle, so we can merge all the ∆τij matrices (the
naively affect the generation and evaluation of each computationnal complexity of this operation is O(n2))
ant to a different processor (with a #pragma omp in parallel after the main parallel region and before the
parallel statement where Number of threads = updating of the τij matrix without altering the behavior
Number of ants) and by keeping the ∆τij matrix and of the algorithm.
the quadtree updatings outside the parallel region
since there would be a concurrent update conflict if 3.6 Updating the quadtree concurrently
they were inside. We could also include the updates
in the parallel region, in the shared memory, but in a A similar issue had to be adressed with the updating of
critical zone where only one update at a time could be the quadtree. Originally there is only one tree
done. However, in both cases synchronization causes structure that is updated sequentially or in a critical
a great loss in efficiency. construct, but with another memory cost it is possible
to create multiples trees (one quadtree for each
Besides, in this situation the chosen number of ants, processor) and merge them outside the parallel region.
which is a parameter of the ACO, is limited by the The merging procedure can be done after the
number of processors we have at our disposal, which execution of all the cycles and not at each cycle as
could be problematic since we could need more ants with the ∆τij matrix since the tree is a storing structure
for bigger problems and we want the application to and not an information structure that is needed by
run on smaller parallel machines. It is then better to other parts of the algorithm.
share the load between processors in a way to give
more than one ant to each processor (with a #pragma However, there may be a drawback in performance
omp parallel for statement where Number of threads < due to the fact that the management of the quadtree is
Number of ants). For this load balancing matter, we done by dynamic memory allocation, i.e. each
can use the static and dynamic solutions provided by insertion procedure implies dynamic creation and
OpenMP. This way we get a more efficient destruction of nodes. Even if pointers are private for
implementation and further improvement can be each quadtree, private dynamic memory allocation is
achieved by parallelizing the update of the ∆τij matrix not supported by OpenMP at this time and all the
and of the quadtree, which means having the whole quadtrees are part of the same shared memory heap.
for loop parallelized. This issue and its consequences on the performance of
the parallel implementation will be adressed in future
3.5 Updating the ∆τij matrix concurrently work.
The most important structures that are used by the With the modifications mentioned above being done,
ACO are the τij matrix, the D matrix, the ∆τij matrix we obtain a parallel implementation of the ACO as
and the quadtree. The τij matrix is updated once each shown in Figure 3.
cycle and cannot be parallized without changing the
behaviour of the algorithm so it stays in the shared
memory during the execution. It is accessed in read 4 RESULTS
mode by the generation function and its update is done
by the master thread at the end of each cycle, after the The implementation strategy mentioned in section 3
end of the parallel region. has been experimented by starting from an already
existing sequential implementation written in C,
The D matrix is constructed at the beginning of the adding the appropriate OpenMP directives in it and
execution and is never updated, so it stays in the making the necessary changes that have been
shared memory, as well as all the other parameter discussed.
Figure 3 Parallel implementation of the ACO Table 3 Results of parallel execution with 80 jobs,
100 cycles and 1000 ants
Number of Execution Speedup Efficiency
NC = 0;
processors time (sec)
NumThreads = p;
Initialize τij matrix; 1 1111 - -
Initialize the p quadtrees; 2 564 1.97 0.98
While (NC < NCMax) and (Not Stagnation Behaviour) 4 305 3.64 0.91
Initialize the p ∆τij matrices; 8 187 5.94 0.74
Parallel for with p threads 16 204 5.45 0.34
For each ant k do
Build a job sequence;
Evaluate solution k on each objective;
Update ∆τij[p] matrix according to the solution k; Table 4 Results of parallel execution with 80 jobs,
Insert solution k into the quadtree[p]; 200 cycles and 1000 ants
Merge the p ∆τij matrices in parallel;
Number of Execution Speedup Efficiency
Update τij matrix according to ∆τij matrix;
NC = NC + 1; processors time (sec)
Merge the p quadtrees; 1 2181 - -
2 1152 1.89 0.95
4 613 3.56 0.89
8 381 5.72 0.72
16 429 5.08 0.32
Tables 1, 2, 3, and 4 shows the performance of the
parallel implementation when 1, 2, 4, 8 and 16
processors are used to process order books of size 50 Figure 4 Execution time with 80 jobs, 100 cycles
and 80 with 2 sets of parameters. The number of ants and 1000 ants
(k) has been set to 1000. Figure 4 is a graphical
representation of execution times of Table 3.
1200
1000
Execution tim e
Table 1 Results of parallel execution with 50 jobs,
100 cycles and 1000 ants 800
600
Number of Execution Speedup Efficiency 400
processors time (sec)
200
1 572 - -
0
2 309 1.85 0.93
0 2 4 6 8 10 12 14 16 18
4 167 3.43 0.86
8 112 5.11 0.64 Nomber of processors
16 125 4.58 0.29
Table 2 Results of parallel execution with 50 jobs,
200 cycles and 1000 ants Regarding problem size, Tables 1 and 3 show that we
get better efficiency with 80 jobs than with 50, as we
Number of Execution Speedup Efficiency supposed, when the same parameters are applied
processors time (sec) (Figure 5 shows efficiency with 8 different sizes of
1 1136 - - order books). We believe that we would need bigger
2 634 1.79 0.90 order books to properly see the benefits that can be
4 349 3.25 0.81 gained from parallelism, but actual program
8 231 4.91 0.61 limitations (mostly the complex industrial evaluation
16 267 4.25 0.26 process) hinders the use of bigger order books. A first
step in this project was to study feasibility and
performance on existing and studied big problems, so
in future work we will expand the application to
process more than 80 jobs.
Figure 5 Efficiency with 4 processors, 100 cycles, • Better knowledge of the OpenMP environment
1000 ants and order books varying from and of the use of its directives
10 to 80 • The design of the actual application code
• More extensive experimentations with different
parameter settings
1
0.9
5 CONCLUSION
Efficiency
0.8
0.7
In this paper we have presented a shared memory
0.6 parallel implementation of an Ant Colony
0.5 Optimization metaheuristic that is applied to an
0.4 industrial scheduling problem and we have shown the
0 10 20 30 40 50 60 70 80 90
main issues that had to be adressed during the
parallelization process. The nature of the ACO and
Num ber of jobs the functionnality offered by OpenMP made the
transition from sequential to parallel easier and
straightforward while some changes had to be made to
the algorithm and to the program to obtain the level of
Another issue that we wanted to adress in this work is efficiency that we achieved.
the penalty in efficiency that is caused by the
synchronizations due to the updating of the τij matrix. Our aim was to increase the execution time of the
For this matter the application was executed with algorithm. The resulting implementation has shown
variations to the number of cycles while keeping the that is was possible to design an efficient parallel Ant
same other parameters. Results show that when NC is Colony Optimization metaheuristic in a shared
increased for the same problem, which implies more memory model with OpenMP. It also has shown
updatings of the τij matrix and more synchronizations, some limitations as we increased the number of
there is a loss in efficiency. processors used in the application and the causes of
those drawbacks should be analysed in near future.
Overall results shows that our parallel implementation
of the ACO for this problem leads to the obtention of In an another part of this project we plan to exploit
significant speedups. However, experiments are not parallelism potential of the ACO in a way that will
as convincing as we expected, especially with 8 and improve its solution searching capabilities without
more processors. The degration of efficiency obtained increasing its execution time. It is likely that this goal
when increasing the number of processors is faster will imply a model of co-evolution of many ant
than we expected and the loss of performance colonies, which means a higher level of
associated with the increase of the number of cycles is parallelization, the possible use of a message passing
smaller. That statement is also true as we increase the model and a resulting implementation that mixes MPI
number of ants. In fact, when we increase the number and OpenMP.
of ants to 10 000, which is a high number compared to
the usual chosen one, we get better efficiency, but not
as high as one could expect when 8 and 16 processors ACKNOWLEDGEMENT
are used (0.99 for 2 processors, 0.92 for 4, 0.76 for 8
and 0.36 for 16). This work was partly supported by the Centre Lorrain
de Calcul à Hautes Performances (Centre Charles
Further experiments and studies will lead us to Hermite : CCH) and by the Centre Informatique
understand better the mechanisms of the National de l'Enseignement Supérieur (CINES).
implementation, the effects of modifying the
algorithm parameters and the influence of software
and hardware elements that were used for the REFERENCES
development. The main issues that we may adress in
short term work are : Bullnheimer B., Kotsis G., Strauss C. [1998],
Parallelization Strategies for the Ant System. In R. De
• The effects of dynamic memory allocation in Leone, A. Murli, P. Pardalos, and G. Toraldo, editors,
parallel with OpenMP High Performance Algorithms and Software in
• The SGI Origin2000 computer, its CC-NUMA Nonlinear Optimization, volume 24 of Applied
architecture, the data structures used in the Optimization, pages 87-100. Kluwer : Dordrecht.
application and the way the memory is used
Colorni A., Dorigo M., Maniezzo V. [1991], OPENMP ARCHITECTURE REVIEW BOARD
Distributed optimization by ant-colonies, in [1998], OpenMP C and C++ Application Program
Proceedings of the European Conference on Artificial Interface Version 1.0, http://www.openmp.org.
Life (ECAL'91), edited by F. Varela and P. Bourgine,
134-142, Cambridge, Mass, USA, MIT Press. Talbi E-G., Roux O., Fonlupt C., Robillard D. [1999],
Parallel ant colonies for combinatorial optimization
Colorni A., Dorigo M., Maniezzo V., Trubian M. problems, BioSP3 Workshop on Biologically Inspired
[1994], Ant system for job-shop scheduling, Belgian Solutions to Parallel Processing Systems, in IEEE
Journal of Operations Research (JORBEL), Statistics IPPS/SPDP'99 (Int. Parallel Processing Symposium /
and Computer Science, 34, 1, 39-53. Symposium on Parallel and Distributed Processing,
edited by Jose Rolim, Lecture Notes in Computer
Deneubourg J.L., Pasteels J.M., Verhaeghe J.C. Science Vol., Springer-Verlag, San Juan, Puerto Rico,
[1983], Probabilistic behaviour in ants: A strategy of USA.
errors?, Journal of Therical Biology, 105, 259-271.
Deneubourg J.L., Goss S. [1989], Collective patterns
and decision-making, Ethology & Evolution, 1, 295-
311.
Dorigo M. [1992], Optimization, learning and natural
algorithms, Ph.D. Thesis, Politecnico di Milano, Italy.
Dorigo M., Di Caro G. [1999], The Ant Colony
Optimization Meta-Heuristic', In: D. Corne, M.
Dorigo and F. Glover Editors, New Ideas in
Optimization, McGraw-Hill.
Dorigo M., Gambardella L.M. [1997], Ant colonies for
the traveling salesman problem, BioSystems, 43, 73-
81.
Dorigo M., Maniezzo V., Colorni A. [1991], Positive
feedback as a search strategy, Technical Report No
91-016, Politecnico di Milano, Italy, 20 pages.
Finkel, R. A., Bentley, J. L. [1974], Quad trees, a data
structure for retrieval on composite keys, Acta
Informatica 4, 1-9.
Goss S., Beckers R., Deneubourg J.L., Aron S.,
Pasteels J.M. [1990], How trail laying and trail
following can solve foraging problems for ant
colonies, In: Behavioural Mechanisms of Food
Selection, R.N. Hughes ed., NATO-ASI Series, vol.
G20, Berlin: Springler-Verlag.
Gravel M., Price W., Gagné C. [2001], Scheduling
continuous casting of aluminum using a multiple-
objective ant colony optimization metaheuristic,
Document de Travail no. 2001-004, Faculté des
Sciences de l’Administration, Université Laval,
Québec, Canada. (submitted for publication)
Habbas Z., Krajecki M., Singer D. [2000], Domain
Decomposition for Parallel Resolution of Constraint
Satisfaction Problems with OpenMP, In: Proceedings
of The Second European Workshop on OpenMP,
Edinburgh, Scotland, 1-8.