PARALLELISATION OF GA
__________________________________________________________________________________
CHAPTER 3
Parallelisation of GA
The current serial implementation of GA takes about half a day if tasked to
perform 500 generations for a complementary peptide prediction (on the SGI Indy).
As prediction of 3D ligand structures will require more intensive computation, GA
will have to take an even longer time to reach a conclusion. As such, it is suggested
that GA can be parallelised to reduce the computation time required. The bulk
synchronous parallel (BSP) computation model is used and a parallel version of GA is
run successfully on a SGI PowerChallenge which has 4 processors.
3.1 BSP COMPUTATION MODEL
The BSP model is a generalisation of the parallel random access machine
(PRAM) model. (McColl 1994) PRAM involves a set of synchronous processors
which run in parallel and communicate via a common random access memory. BSP
allows PRAM to be efficiently modelled by controlling the routing networks via
barrier synchronisation. Other features of BSP includes the requirement for data
partitioning and a network that communicates point-to-point with uniform cost.
3.1.1 Barrier synchronisation of the processors
A BSP computation involves a sequence of supersteps. Each superstep is
itself a sequence of computing steps plus a barrier synchronisation. Local updating
and remote references are performed during the superstep. The updating of global
memory occurs only after the synchronisation. As such, values read have to belong to
the earlier superstep and values written will only be ready after the superstep. (section
3.1.3)
3.1.2 Partitioning of data
Data has to be divided and distributed onto a static set of processors. Each
processor runs the same program sequentially with its own set of program variables
(SPMD). This allows the data to be manipulated concurrently even though data is not
allowed to be shared globally.
24
PARALLELISATION OF GA
__________________________________________________________________________________
3.1.3 Implementation of a master-slave paradigm
This is one possible alternative to SPMD. One of the processes is identified as
the master process. The Oxford BSP library exchanges data between master and
slaves via remote assignment operations. The operations utilise a communication
network to transmit the data. This network is transparent to the user and this helps to
simplify the design and coding.
Each call of an operation identified both the destination and source of data. By
doing so, the possibility of deadlock is reduced and more efficient codes are
produced. (as compared with explicit message passing) To ensure that results obtain
are consistent, a data object which is being changed by a process is not to be fetched
by another process. Also, the same data cannot be both read and written in the same
superstep. (Miller 1994)
3.1.4 BSP parameters
The parameters are: (Bisseling & McColl 1994)
p = number of processors
s = processor speed (in time steps)
l = minimal size of a superstep. It is the interval between successive
synchronisations (in equivalent time steps).
g = cost of global communication.1
p s(megaflops/sec) l(flops for p) g N 12 (words/store
operation)
2 75 900 12 21
4 75 1600 12 25
Table 3-1: Table showing the parameteric values for the SGI PowerChallenge machine.(Miller 1994)
A superstep in which each process stores h words to one other process costs l
+ gh flops. For each process, g is approximately g ( + N 12 / k) where k is the number
1
It is the ratio of global computation to communication balance (ratio of total local operations
performed by all processors in 1 second to total data words delivered by communication network in 1
second)
25
PARALLELISATION OF GA
__________________________________________________________________________________
of words transferred during each data-storage or data-fetching call. l and g are
dependent on p. If p is increased, then g is likely to increase as well.
The cost C (in flops) of the superstep is C x + l + gh , (equation 3-1)
(x : number of local operations performed by a processor)
(gh : maximum communication cost)
C can be lowered by performing less supersteps, because each superstep
requires additional synchronisation time. As such, each superstep should be designed
to contain the maximum number of steps while remaining consistent at the same time.
From equation 3-1, it is possible to reexpress C as
C = E(D,p)xi + l + E(D,p)ghi (equation 3-2)
(hi : words sent by an individual)
(xi: operations performed by an individual)
(D:overall population size)
(E(D,p): a function of processors and population size)
The BSP Cost Model predicts the performance of algorithms to be
1 processor cost
implemented. The parallel efficiency is defined as E = . If n is
p( p processors cost )
the flops required for sequential computation, we get
n
E= .
n + p( l + gh)
(equation 3-2)
The von Neumann model has been described as an efficient interface between
serial software and hardware. The BSP model also acts as an efficient bridge, but
between parallel software and hardware. The Oxford BSP library includes a set of
functions allow the control of processes, bulk synchronisation and data
communication. (Miller & Reed 1993) These functions can be readily incorporated
into existing programs with minimum changes. Much hardware details have been
‘hidden’ away by the library and hence, allows more efforts to be spent on the
parallelising of GA. 23
2
The parallelising of GA can be expressed as follows:(Mühlenbein 1991)
x ti+1 = Gi ( x ti ,.. x tN , F ( x1t ),.. F ( x tN ) ) where i = 1, ..., N. N is the number of different
individuals/GA searches which are to be performed in parallel. xi refers to the position of individual i,
F(x) describes the fitness of an individual i, t refers to a particular generation and G is the selection
schedule. G = ( G1 ,... G N ) refers to the schema exchange among the searches. If the searches do
not communicate, we get x ti+1 = Gi ( x ti , F ( x ti ) ) . If we have two searches communicating with each
other, we get x ti+1 = Gi ( x ti−1 , x ti , F ( x ti−1), F ( x ti ) )
26
PARALLELISATION OF GA
__________________________________________________________________________________
3.2 PARALLELISING GA
A number of factors become important when GA is run in parallel - the
number of processors, the inter-processor communication costs and the
communication between two different GA runs. (Goldberg et. al. 1995) The way data
exchange occurs between different GA runs can affect the local performance of a GA.
Possible parallel strategies range from :
Ideal Mixing Model - running GAs concurrently before exchange of data
Isolated GA Model - running GAs concurrently in complete isolation.
SEGA has been chosen for parallelising. The computation requires by its
mechanisms are as follows:
Reorder - has O(n) time complexity. (section 1.11.1)
Genetic operators - HillCrossover and HillMutation have O(n2) time
complexity in the worst cases. (section 1.11.3, 1.11.4) They involves the
random choosing of a site along the chromosome for the purpose of
information exchanges and changes. If the new children produce fail to
improve on the fitness, the process is repeated and another site is picked to
generate another pair of children.
Since the latter pair of genetic operations required much more intensive
computation, (it will become worse when three dimensional evaluations is involved)
it is decided to focus the parallelising on just HillCrossover and HillMutation.
Interestingly, most of the other GA applications also spend most of the time
performing function evaluations. (Mitchell, Holland & Forrest 1995)
3
The C library is used because the GA codes have been written in C.
27
PARALLELISATION OF GA
__________________________________________________________________________________
3.3 IMPLEMENTATION
It is useful to eliminate synchronizations as the speedup achievable will be
constrained by the slowest individual. (Maruyama, Konagaya, Konishi 1992)
Besides, it is also important to achieve the same or better quality of solutions as
sequential genetic algorithms.
3.3.1 IntraPopulation Parallelisation
Master-slave paradigm has been tried by researchers in synchronous and
asynchronous manner. (Huntley & Brown 1991) Synchronous version requires the
master to wait for all the slaves to finish before it can commence on the next task. For
the asynchronous version, the master process does not wait for the slaves. Instead,
when a slave has completed a task, it messages the master and awaits the next
instruction from the master. A synchronous version is implemented with the Oxford
BSP library:
The master process is tasked to perform Reorder in a superstep. Upon
completion of Reorder, the population of individuals is segmented into four quarters
and distributed among the four processors. Each processor then performs a superstep
of HillCrossover and HillMutation on each quarter of population.
Once the operations are completed, the data are recombined and stored onto
master process. The master process completes the generation by computing relevant
statistics and then starts the next round of operations.
28
PARALLELISATION OF GA
__________________________________________________________________________________
Master
Reorder
Distribution of Data
Slave1 Slave 2 Slave n
Crossover & Crossover & Crossover &
Mutation Mutation Mutation
Recombination of Data
Master
PostProcess
Diagram 3-1: Intra-Population Parallelisation of GA
3.3.2 BSP Cost Computation
The BSP cost of the Distribution and Recombination of data is calculated as
follows. Assumes:
p = 4 processors
l = 1600 flops
hi = 25 words are communicated by an individual
g = 12 (1 + 25/ hi) = 24 4
D = 500 individuals (population size)
xi = 200 fitness evaluations by an individual via Crossover and Mutation,
assuming that an individual has to perform 10 Crossover and 10 Mutation for
hillclimbing, and that each operator perform 10 fitness evaluations for the
various window sizes.
n = Dxi (total flops for sequential computation)
Dxi Dghi
From equation 3-2, we get C +l+ which simplifies to
p p
C 25000 + 1600 + 75000 .(Table 3-1) This indicates that communication costs is
going to be relatively high and may affect the efficiency. When the parallel efficiency
is computed for a generation of population, we get
50020025
E= = 80% . (the calculation assumes that each
2500000 + 42 (1600 + 75000)
29
PARALLELISATION OF GA
__________________________________________________________________________________
generation need two supersteps to distribute and recombine the data) The figure
obtained is reasonable and it seems that the above parallelising approach is feasible.5
3.4 RESULTS
The InterPopulation version of GA is attempted. The number of processors
used range from 1 to 4. The time taken by serialGA and parallelGA are as follows:
Processors Time (seconds)
Serial 112
1 processor(with vm flag) 113
1 processor 111
2 processors 56
3 processors 38
4 processors 29
Table 3-2: Comparsion of time taken by different number of processors
120
100
80
Time(sec)
60
40
20
Serial 1 vmproc 1 proc 2 proc 3 proc 4 proc
NoOfProcessors
Graph 3-1: compares the time taken by GA for different number of processors.
It is mentioned that if the programs are complied with ‘-vm’ flag, the resultant
codes will be able to run faster. This is because ‘-vm’ allow processes direct access to
each other’s virtual memory. Without the flag, a shared-memory buffer has to be
4
Assuming each superstep contains a single data-fetch or data-store call, then k = hi.
5
The efficiency is later recomputed by changing l and g, as described in section 3.5.
30
PARALLELISATION OF GA
__________________________________________________________________________________
created among processes. However, no significant differences is observed from the
above runs.
3.5 DISCUSSION
The fitness of individuals obtained via the above parallelisation are identical to
those of serial version of GA - there is no improvement in accuracy. Improvement to
accuracy maybe achieved by adopting DGA. (section 3.6.3)
From the above BSP Cost computation, (section 3.3.2) C is dominated by
Dgh
communication cost, . It is possible that the distribution and recombination of
p
data may become a serious bottleneck because the supersteps perform sequential
processing on the master process. However, results obtained indicate that strategy to
divide the most intensive operators is feasible. A near linear speed up is achieved.
The practical efficiency obtained is (100 * 4 * 29)/119 = 97%. The efficiency is
higher than the predicted value possibly because l and g has been estimated too high.
The SGI machine takes about 112 seconds to perform 500 generations of 2
millions evaluations. This amounts to about 9 megaflops, which is only 0.12 (9/75) of
the best possible speed of the machine. When both l and g are reduced by a factor of
50020025
0.12, we get E = = 97% . The new estimate is
. * (1600 + 75000)
2500000 + 8 * 012
very close to the practical efficiency obtained.
The design of SEGA contributed to the good parallel performance:
- ReOrder - A fast process which does not need parallelising.
- HillCrossover - It has each slave performing many evaluations per
generation.6 This repeative performance of the same task helps to improve the
cache hits rate.
- Faster convergence - SEGA needs less generations to converge. This cuts
down the synchronisation frequencies and synchronisation time required.
This version of GA will be even more useful when future intensive three-
dimensional computation is performed. xi will increase and the communication cost
will correspondingly become less significant. (section 3.3.2)
6
The above approach is found to be similar to Micro-Grained Parallelism (MGP). (Punch, et al 1993)
MGP divides the bulk of fitness evaluation among processors by allocating sets of evaluations to each
processor so as to achieve almost linear speedup.
31
PARALLELISATION OF GA
__________________________________________________________________________________
3.6 IMPROVING THE PARALLEL GA
It is suggested that for serial GA, smaller population is to be used and
repeatedly run. This allows the maintaining of diversity via repeated injection of
fresh schemata. This also has the added advantage of allowing random search about
solutions by maintaining the processing rate of useful schemata. On the other hand,
the population is preferably to be maintained large when GA is run in parallel. This is
to obtain better schema averages as bigger population has better diversity. (Goldberg
1989b) As such, the population size will probably be maintained large for the
following strategies. These strategies attempt to improve both efficiency and
accuracy.
3.6.1 InterPopulation Parallelisation
GA can also be parallelised at a higher level, that is InterPopulation
parallelisation instead of IntraPopulation parallelisation: (Table 4-1)
Distribution of 'Enhanced' Individuals
Process1 Process2 Process n
Reorder & Reorder & Reorder &
Crossover & Crossover & Crossover &
Mutation Mutation Mutation
Combination of Fitter Individuals
Process0
Meta-GA
Diagram 3-2: InterPopulation Parallelism of GA
Each group of slave processor(s) are tasked with the running of a population
of individuals (either serially or in parallel). Upon completion of a generation, a
32
PARALLELISATION OF GA
__________________________________________________________________________________
number of fitter individuals is obtained per group. They are updated onto the master
processor(s) which then performs a meta-GA run. The resultant population is then
distributed to the rest of the populations.
This approach facilitates the easy parallelising of GA by simply running a
copy of GA (or a few copies) on each processor. Besides, the meta process can
perform different techniques to get very fit individuals. (example: Hillclimbing as in
section 2.6.2)
This is some sort of ‘coarse grain parallelisation’ which involves a set of GAs
being run in parallel and interacting via individuals exchange. This may require many
runs, especially if each run is performed on a smaller population. (Goldberg et. al.
1995) Note that fine-grained model (neighbourhood model) involves a single
population, each individual of which is placed in a cell of a planar grid and operators
are only applied between neighbouring individual on the gird7. (Stender J 1993) The
smaller the neighbourhood, the lesser overhead for synchronisation and inter-
processsor communications will be needed. However, this increases the chance of
local optimal solutions too.
The bottleneck problem (section 3.5) still exist for this strategy.
3.6.2 Non-Generational and SEGA Parallelisation
A good speedup maybe achieve by modifying the above master-slave set-up.
Each slave can be tasked to run a number of generations (Reorder, crossover and
mutation) until a plateau in fitness values is reached. The slaves then update the
master which performs a global selection before redistributing the individuals to the
slaves again. This is to reduce the communication between the slaves and master, and
minimise the bottleneck problem. This method also preserves diversity as each slave
achieves its own speciation differently. (section 3.6.3) Hence, it may help SEGA to
slow down the convergence rate and improve the fitness.
3.6.3 DGA
Distributed GA (DGA) is another Interpopulation method. It performs GA on
a number of smaller subpopulations and has a migration phase after each generation.
7
MIMD machines (transputer based) is more suited for DGA while SIMD (array processors) is more
suited for fine grained model. (Dorigo M and Maniezzo V 1993)
33
PARALLELISATION OF GA
__________________________________________________________________________________
(Tanese 1989) During this phase, a portion of each subpopulation is selected and
exchanged with another subpopulation. Migration, like mutation, introduce diversity.
But it is less destructive than mutation because migration introduces useful variation
and does not randomly change existing schemata. (Gorges-Schleuter 1989)
After migration, the system explores new areas via crossover. (Mühlenbein
1991) To prevent a subpopulation from being dominated by superior migrants, the
rate of migration has to be controlled. (Baluja 1993) It has been observed that
multiple running of DGA of small population sizes can produce results that are as
good as a single large population of TGA (and micro-grained GA). 8 Alternatively, it
is also possible to get a subpopulation which is stuck at a local optimum to broadcast
so that other subpopulations can send the best individuals. As such, exchange of
individuals happens at different intervals (Kröger, Schwenderling and Vornberger
1990)
At present, it is still debatable if DGA will work for most functions. DGA
may be less efficient because each processor wastes computing cycles for
subpopulations that are suboptimal. Hence, more generations are needed to converge.
(Bianchini & Brown 1993) Other works suggests that DGA may not work well for
functions in which recombination of lower-order building blocks are important.
(Forrest & Mitchell 1991) These lower-order schemata are necessary for deceptive
problems. (Mahfoud & Goldberg 1995)
DGA has much similarity with niche-formation methods. (Deb & Goldberg
1989) Niche refers to the surrounding inhabited by an individual. When several
niches arise via division of the natural environment, inter-species competition is
8
A number of suggestions are given to account for DGA performance:
a. Shifting Balance Theory - Populations in nature are able to avoid being stuck in local optima
because of the spreading of fit individuals from successful subpopulations (which are bigger in size
being more successful) to other populations. (Wright 1932)
b. Punctuated Equilibrium Theory - It seems that evolution progress in leaps and bounds, and is not
regularly ‘spaced’. It is suggested that this is due to isolation of subpopulations which results in
speciation upon convergence. These species only rapidly evolve when new individuals (with new
schemata) migrate from elsewhere and are added to the subpopulations. (Eldredge & Gould 1972)
c. Retaining of changes - It is proposed that subpopulations tend to maintain overall population
diversity because each subpopulation evolves separately and converge differently, hence preserving the
diversity. (Futuyma 1987)
d. Tendency for TGA to get stuck - A single copy of TGA tends to find a local optimum rapidly and
get stuck as the diversity of population drops rapidly. (Forrest & Mitchell 1991)
Suggestion (a) and (c) seem to contradict each other. However, both can be synergetic if a
proper balance of migration rate can be achieved.
34
PARALLELISATION OF GA
__________________________________________________________________________________
reduced. This encourages exploitation of the local search space and forms stable
subpopulation at each niche. 9 This technique may come in handy for problems
(example: drug design) that have multiple and equally ‘important’ peaks. (Keane
1995) Other possible methods include Messy GAs (Goldberg, Korb & Deb 1989) and
crowding.
3.6.4 Micro and Coarse-grained Parallelisation
Theoretically, it is better to have more threads running than the number of
physical processors. This optimises the usage of the processors and prevents them
from idling. It can be readily achieved by the integration of both Micro and Coarse-
grained GA. That is, multiple copies of GAs are be run in parallel and each copy of
GA has the genetic operations being processed in parallel as well. This approach is
probably suitable for massively parallel machines.
Alternatively, it is also possible to perform both InterPopulation and
IntraPopulation parallelism concurrently.
3.6.5 Fine-grained Parallelisation
This distribution involves spatiality(Manderick and Spiessens 1989). Each
individual has a specific place in a spatial environment (ex: grid implemented as a
toroidal array) and has genetic operators which act as local updating rules:
- selection is performed over nearby individuals for each cell
- crossover involve recombine a cell which a randomly chosen neighbour and
then replacing the cell by one of the offspring.
- mutation is applied for all cells.
As such, the model does not need a global control structure. This approach
involve more exploration of the search space because of the local selection. It is
found that while a model with large neighbourhood performs like the serial version, a
model with small neighbourhood (9 to 25 cells) has better performance. Fine-grained
model is suited for combinatorial optimization problems.
9
A sharing scheme has been implemented whereby a population is divided into subpopulations
according to the similarity of individuals and an individual’s fitness drops if it has to share more
resources with more neighbours. Mating restriction scheme has also been tried to improve the
performance.
35