Base Paper PDF
Base Paper PDF
Rashmi Sharan Sinha1, Satvir Singh2, Sarabjeet Singh3 and Vijay Kumar Banga4
1,2
Department of Electronics & Communication Engineering,
SBS State Technical Campus, Ferozepur–152004, (Punjab) India
3
Department of Computer Science & Engineering,
SBS State Technical Campus, Ferozepur–152004, (Punjab) India
4
Department of Electronics & Communication Engineering,
Amritsar College of Engg. & Tech., Amritsar–143001, (Punjab) India
1
[email protected], [email protected]
3
[email protected], [email protected]
Abstract
Genetic Algorithm (GA) is one of most popular swarm based evolutionary search
algorithm that simulates natural phenomenon of genetic evolution for searching solution
to arbitrary engineering problems. Although GAs are very effective in solving many
practical problems, their execution time can become a limiting factor for evolving
solution to most of real life problems as it involve large number of parameters that are to
be determined. Fortunately, the most time-consuming operators like fitness evaluations,
selection, crossover and mutation operations that involves multiple data independent
computations. Such computations can be made parallel on GPU cores using Compute
Unified Design Architecture (CUDA) platform. In this paper, various operations of GA
such as fitness evaluation, selection, crossover and mutation, etc. are implemented in
parallel on GPU cores and then performance is compared with its serial implementation.
The algorithm performances in serial and in parallel implementations are examined on a
testbed of well-known benchmark optimization functions. The performances are analyzed
with varying parameters viz. (i) population sizes, (ii) dimensional sizes, and (iii) problems
of differing complexities. Results shows that the overall computational time can
substantially be decreased by parallel implementation on GPU cores. The proposed
implementations resulted in 1.18 to 4.15 times faster than the corresponding serial
implementation on CPU.
1. Introduction
Genetic Algorithm (GA) is a swarm based global search algorithm inspired natural
mechanism of genetical improvements in biological species [1], described by Darwinian
Theory of Survival of Fittest. It was developed by John Holland in 1970 at University of
Michigan [2]. It simulates the process for evolving solutions to arbitrary problems [3].
GA is algorithm involves multiple solutions represented by a string of variables,
analogous to the chromosomes in Genetics. With a initially randomly generated
population, every swarm member is a required to be evolved. Evolution is based on the
fitness pairs of parent solutions that are selected randomly and reproduce next generation
of solutions, stochastically. Each child chromosome has features of both the parent as an
output of crossover. Another is limited alteration in feature values of the generation
represents effect of mutation.
GA essentially strives to attain the global maximum (or minimum) of cost depending
upon the nature of the problem. Over the period of advancements, GA is widely used and
extensively researched as optimization and search tools in several fields such as, medical,
engineering, and finance etc. The basic facts for their success are simple structure, broad
relevance with problem [4]. Goldberg and Harik brought the term compact Genetic
Algorithm (cGA) which represents the solution as a probability distribution over the wide
space set of solutions, Huanlai and Rong well utilized the concept in minimization
problem of resources of network codes [5]. Gas produces high-quality solutions through
its high adaptation property with the environment changes [6]. Prakash and Deshmukh
investigated the use of meta-heuristics for combinatorial decision-making problem in
flexible manufacturing system with GA [7]. Prominent GA applications include pattern
recognition [8], classification problems [9], protein folding [10] and neural network
design [11] etc. GAs are also suitable for multi-objective optimal design problems [12], in
solving multiple objectives.
Even though, GAs has powerful characteristic in determining many practical problems.
However their execution time act as bottleneck in few real life problems. GA involve
large number of trial vectors that needed to be evaluated. However, the major portion of
time consuming function of fitness evaluations can be made parallel to perform
independently due to data independency and, therefore, can be evaluated using parallel
computational mechanisms.
With the advent of General Purpose GPU (GPGPUs), researchers have been evolving
Evolutionary Computations [13-16] for parallel implementation. Similar advancements in
the field of genetic programming are quickly adopted by GA researchers.
After this brief background, the remaining paper is organized as follows: Section II,
describes GA Operators along with pseudo codes for implementation in parallel. Section
III, introduces architecture detail of GPGPU and C-CUDA followed by Section IV of its
implementation. Section V, summarizes performance evaluation of experimental result.
Conclusion of this paper with future aspect is presented in Section VI.
2. GA Operators
GA provides number of solutions however best solution among them is one with least
processing time [17]. The three primary operators involved in GA are: (1) Selection, (2)
Crossover, (3) Mutation and (4) Elite Solutions operators described as follows.
2.1. Selection
There are three most popular different types of Selection Strategies viz., Tournament
Selection, Ranked-Based Roulette Wheel, and Roulette Wheel Selection [18]. These
strategies are used to search potential parent chromosomes based on the fitness level of
individuals from the randomly generated population. The selection operator is expected to
produce solutions with higher fitness in succeeding generations. On contrarily, selection
operator anticipated to produce relative probability of being selected according to their
fitness in the swarm. This leads the algorithm to find global best solution rather than
converging to its nearest local best solution.
2.1.3. Roulette Wheel: In this method, the selection of parent solutions for the next
generation child solutions depend upon the probabilities of fitness values relative to
portion of spinning a roulette wheel. The chromosomes are chosen for next generation on
the basis of their values of fitness, i.e., a chromosome’s selection is directly proportional
to section of roulette wheel corresponding to the fitness level of the same. Let f1, f2,
f3,.....fn be fitness values of chromosomes 1, 2, 3, . . . n. Then the probability of selection
Pi for chromosomes i is defined as (1),
Advantage of proportional roulette wheel selection is that it selects all of the solutions
of swarm with the probability relational to fitness values. Hence it maintains diversity of
solution.
________________________________________________________________________
Algorithm 1: Pseudocode for Roulette Wheel Selection
2.2. Crossover
The process of producing child chromosomes from parent chromosomes is termed as
crossover. It is a significant operator which mimics biological crossover and reproduction
of the nature. This operator in GA is broadly classified into three different techniques viz.,
single point, double point, and uniform distribution crossover.
2.2.1. Single Point Crossover: In single point crossover, the selected parent solution
chromosome string get swapped from a randomly selected crossover point. The resulting
chromosome after swapping form children population for next generation.
2.2.2. Double Point Crossover: Double point crossover is similar to that of single point
crossover however; the crossover points are two rather than one.
________________________________________________________________________
Algorithm 1: Pseudocode for Uniform Crossover
2.3. Mutation
Mutation operator is applied to preserve genetic variance (diversity) in succeeding
generation of population in GA. Mutation operator creates a new solution for each
possible trial solution. To avoid optimal search stagnation, the difference between two
chromosome is increased by a factor termed as mutation factor. In this experiment, the
factor is kept relative to the number of iteration between 0.01 to 1.
the population get mutated by a single thread operations. Scheduling a block with a
sufficient number thread is needed to mutate the whole population.
________________________________________________________________________
Algorithm 1: Pseudocode for Mutation
The entire shared memory in CUDA architecture is divided into four types viz., texture
memory, constant memory, perthread private local memory and global memory for data
shared by all threads. Between these memories, texture memory and constant memory can
be accumulated as fast read only caches; while registers and shared memories are the
overall fastest memories.
For constant memory, the optimal access strategy adopts reading of same memory
location by all threads. The threads can read neighboring thread addresses using texture
catch with a high reading efficiency. CUDA functions for allocation and deallocation of
memory cudaMalloc and cudaFree, respectively are used. CUDA function cudaMemcpy
is used to copy data from host to device.
4.1. All GA solution is calculated using thread block at each generation. The maximum
size of GA population at each generation is limited to the total number of threads which is
currently (216-1)2.
4.2. Every trial solution uses threads to compute possible outcome. GPU’s computation
capability is 512 threads per block for 1 x 1024. Hence it is directly proportional to the
hardware capability.
4.3. GPU access all the trial solution in one step i.e., with each kernel call C-CUDA
launches number of threads per block equivalent to the population size of the generation.
These characteristics turn out to be best feature for massive implementation of such
algorithms. It easily provides speedup in overall computation time of GA. C-CUDA
kernels generates population in one step then computes their respective fitness values. The
genetic operator is applied to each solution where number of thread kept equal to its
population size. The new generation of succeeding population follow same strategy to
find solution. Hence due to these benefits of GPU takes less time as compared to its
sequential execution on CPU.
5. Performance Evaluation
5.1. Experimental Setup
The experiments were conducted on two different PCs (PC1 and PC2) and with same
nVidia cards (Refer Table I for System Specifications) for separate performance
evaluations. PC1 is tested with active background applications. PC2 is kept ideal until
complete performance evaluation carried out. The total number of streaming processors
and streaming multiprocessors are 16 hence 256 streaming processors in each PC. Entire
GA code of C-CUDA is written in Visual Studio C++ (2012 release mode) and complied
on nvcc compiler. The result of above experiment is evaluated using two different
iteration size of 10,000 and 100,000. The dimension size of experiment is kept fixed. The
acceleration of GA on GPU is seen efficient with large number dimension size and
maximum iteration. The result Table III and Table IV in next section shows a significant
speedup.
5.3.1. Case Study 1: In this experiment, the dimension size of the population generation
is kept first 32 and then 64. Each dimension size iterated for maximum number of
iteration, which was set to 10,000. The performance shown in result table is average value
of 20 trials. The speedup of GPU over CPU for all seven benchmark test functions are
shown in Table III and indicated in Figure 3 & Figure 4.
The best computational performance achieved among GPU1 and GPU2 for dimension
size of 32, is 2.17 times for f2(x) on GPU1, while on GPU2 with the dimension size of 64,
f2(x) shows a speedup of 4.15 times higher than its CPU execution time. The Table III
depicts the best result for 10,000 iterations along with quality.
5.3.2. Case Study 2: In this experiment the dimension size is kept same as Case Study
1, however the iterations size increased to 100,000. The respective speedup for all seven
benchmark test functions is shown in Table IV and indicated in Figure 5 and Figure 6.
The highest speedup achieved in this case for dimension size 32 among GPU1 and
GPU2 is 2.39 for test function f7(x). On the other hand, dimension size 64 has best
speedup of 2.78 times for f7(x). The GPU average execution time is 33.70 seconds while
127.52 seconds in CPU.
It is considered that the GPU implementation can conceal the latency of memory
access by executing many threads in parallel, while the CPU implementation executes the
GA calculation sequentially. In particular, it is notable that our implementation to
parallelize the process of both individuals and their data is more effective, because the
implementation enables the execution of more threads than others. In addition, most GA
processes are executed on GPU. This can suppress the frequency of data transfer between
the host and the device, which is probably the bottleneck to speed up by GPU.
6. Conclusion
In this paper, the implementation of GA on GPGPU using C-CUDA is carried out. The
massive parallel architecture of GPU exploited to attain required speedups in GA. It
shows acceleration of 1.18-4.15 times as compared to its sequential execution on CPU on
variety benchmark test functions. From this result it is concluded that, the algorithm can
be made more optimized for several search problems to enhance its wide variety of
features. In future work, the performance of GA model will be more improved by
modifying single objective GA to multi-objective GA. Further improvement of this model
will be done by implementing multi-objective GA model with Fuzzy logic system which
is expected to a fast parallel approach.
References
[1] L. De Giovanni and F. Pezzella, “An improved genetic algorithm for the distributed and flexible job-
shop scheduling problem”, European journal of operational research, vol. 200, no. 2, (2010), pp. 395-
408.
[2] J. H. Holland, “Genetic algorithms and the optimal allocation of trials”, SIAM Journal on Computing,
vol. 2, no. 2, (1973), pp. 88-105.
[3] K. Ghoseiri and S. F. Ghannadpour, “Multi-objective vehicle routing problem with time windows using
goal programming and genetic algorithm”, Applied Soft Computing, vol. 10, no. 4, (2010), pp. 1096-
1107.
[4] K. P. Murphy, “Machine learning: a probabilistic perspective”, MIT press, (2012).
[5] H. Xing and R. Qu, “A compact genetic algorithm for the network coding based resource minimization
problem”, Applied Intelligence, vol. 36, no. 4, (2012), pp. 809-823.
[6] S. Yang, H. Cheng and F. Wang, “Genetic algorithms with immigrants and memory schemes for
dynamic shortest path routing problems in mobile ad hoc networks”, Systems, Man, and Cybernetics,
Part C: Applications and Reviews, IEEE Transactions on, vol. 40, no. 1, (2010), pp. 52-63.
[7] A. Prakash, F. T. Chan and S. Deshmukh, “Fms scheduling with knowledge based genetic algorithm
approach”, Expert Systems with Applications, vol. 38, no. 4, (2011), pp. 3161-3171.
[8] J. Adams, D. L. Woodard, G. Dozier, P. Miller, K. Bryant and G. Glenn, “Genetic-based type ii feature
extraction for periocular biometric recognition: Less is more”, Pattern Recognition (ICPR), 2010 20th
International Conference on. IEEE, (2010), pp. 205-208.
[9] A. Quteishat, C. P. Lim and K. S. Tan, “A modified fuzzy min–max neural network with a genetic-
algorithm-based rule extractor for pattern classification”, Systems, Man and Cybernetics, Part A:
Systems and Humans, IEEE Transactions, vol. 40, no. 3, (2010), pp. 641-650.
[10] Y. Zhang and L. Wu, “Artificial bee colony for two dimensional protein folding,” Advances in
Electrical Engineering Systems, vol. 1, no. 1, (2012), pp. 19-23.
[11] L. Magnier and F. Haghighat, “Multiobjective optimization of building design using trnsys simulations,
genetic algorithm, and artificial neural network”, Building and Environment, vol. 45, no. 3, (2010), pp.
739-746.
[12] S. Omkar, J. Senthilnath, R. Khandelwal, G. Narayana Naik and S. Gopalakrishnan, “Artificial bee
colony (abc) for multi-objective design optimization of composite structures”, Applied Soft Computing,
vol. 11, no. 1, (2011), pp. 489-499.
[13] F. Fabris and R. A. Krohling, “A co-evolutionary differential evolution algorithm for solving min–max
optimization problems implemented on gpu using c-cuda”, Expert Systems with Applications, vol. 39,
no. 12, (2012), pp. 10 324-10 333.
[14] O. Maitre, F. Kr¨uger, S. Querry, N. Lachiche and P. Collet, “Easea: specification and execution of
evolutionary algorithms on gpgpu”, Soft Computing, vol. 16, no. 2, (2012), pp. 261-279.
[15] O. Maitre, N. Lachiche and P. Collet, “Fast evaluation of gp trees on gpgpu by optimizing hardware
scheduling”, Genetic Programming, Springer, 2010, (2011), pp. 301-312.
[16] M. A. Franco, N. Krasnogor and J. Bacardit, “Speeding up the evaluation of evolutionary learning
systems using gpgpus”, Proceedings of the 12th annual conference on Genetic and evolutionary
computation. ACM, (2010), pp. 1039-1046.
[17] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk and W.-M. W. Hwu, “Optimization
principles and application performance evaluation of a multithreaded gpu using cuda”, Proceedings of
the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM,
(2008), pp. 73-82.
[18] M. R. Noraini and J. Geraghty, “Genetic algorithm performance with different selection strategies in
solving tsp”, Proceedings of the World Congress on Engineering 2011, vol. II, (2011).
[19] M. Oiso, T. Yasuda, K. Ohkura and Y. Matumura, “Accelerating steadystate genetic algorithms based
on cuda architecture”, Evolutionary Computation (CEC), 2011 IEEE Congress on. IEEE, (2011), pp.
687-692.
[20] P. Posp´ıchal, J. Schwarz and J. Jaros, “Parallel genetic algorithm solving 0/1 knapsack problem running
on the gpu”, 16th International Conference on Soft Computing MENDEL, vol. 2, (2010), pp. 64-70.
[21] R. Arora, R. Tulshyan and K. Deb, “Parallelization of binary and real-coded genetic algorithms on gpu
using cuda”, Evolutionary Computation (CEC), 2010 IEEE Congress, IEEE, (2010), pp. 1-8.
[22] M. Oiso, Y. Matsumura, T. Yasuda and K. Ohkura, “Evaluation of generation alternation models in
evolutionary robotics”, Natural Computing. Springer, (2010), pp. 268-275.
[23] M. Jamil and X.-S. Yang, “A literature survey of benchmark functions for global optimisation
problems”, International Journal of Mathematical Modelling and Numerical Optimisation, vol. 4, no. 2,
(2013), pp. 150-194.
Authors
Rashmi Sharan Sinha was born on Apr 6, 1987. She received her
Bachelor’s degree (B.Tech.) from Lala Lajpat Rai Institute of
Engineering and Technology, Moga, in year 2011 and Master’s
degree (M.Tech.) from Shaheed Bhagat Singh State Technical
Campus (formerly, SBS College of Engineering & Technology),
Ferozepur, Punjab (India) with specialization in Electronics &
Communication Engineering in year 2015. Her research interests
include Evolutionary Algorithms and Fuzzy Logic System.