Thesis - CUDA NVIDIA
Thesis - CUDA NVIDIA
Directors:
Full Professor : Nouredine Melab, Lille 1
Full Professor : El-Ghazali Talbi, Lille 1
Introduction 1
i
2 Efficient CPU-GPU Cooperation 35
2.1 Task Repartition for Metaheuristics on GPU . . . . . . . . . . . . . . . . . 37
2.1.1 Model of Parallel Evaluation of Solutions . . . . . . . . . . . . . . . 37
2.1.2 Parallelization Scheme on GPU . . . . . . . . . . . . . . . . . . . . . 37
2.2 Data Transfer Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.1 Generation of the Neighborhood in S-metaheuristics . . . . . . . . . 39
2.2.2 The Proposed GPU-based Algorithm . . . . . . . . . . . . . . . . . . 40
2.2.3 Additional Data Transfer Optimization . . . . . . . . . . . . . . . . 42
2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.1 Analysis of the Data Transfers from CPU to GPU . . . . . . . . . . 43
2.3.2 Additional Data Transfer Optimization . . . . . . . . . . . . . . . . 49
2.4 Comparison with Other Parallel and Distributed Architectures . . . . . . . 50
2.4.1 Parallelization Scheme on Parallel and Distributed Architectures . . 52
2.4.2 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.3 Cluster of Workstations . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.4 Workstations in a Grid Organization . . . . . . . . . . . . . . . . . . 55
Appendix 137
.1 Mapping Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
.1.1 Two-to-one Index Transformation . . . . . . . . . . . . . . . . . . . 137
.1.2 One-to-two Index Transformation . . . . . . . . . . . . . . . . . . . . 138
.1.3 One-to-three Index Transformation . . . . . . . . . . . . . . . . . . . 139
.1.4 Three-to-one index transformation . . . . . . . . . . . . . . . . . . . 141
.2 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Bibliography 158
In the optimization field, both academic and industrial problems are often complex and
NP-hard. In practice, their modeling is continuously evolving in terms of constraints
and objectives. Thereby, a large number of real-life optimization problems in science,
engineering, economics, and business are complex and difficult to solve. Their resolution
cannot be performed in an exact manner within a reasonable amount of time, and their
resource requirements are ever increasing. To deal with such an issue, the design of resolu-
tion methods must be based on the joint use of advanced approaches from combinatorial
optimization, large-scale parallelism and engineering methods.
In the last decades, metaheuristics are approximate algorithms that have been successfully
applied to solve optimization problems. Indeed, this class of methods allows to produce
near-optimal solutions in a reasonable time. Metaheuristics may solve instances of prob-
lems that are believed to be hard in general, by exploring the usually large solution search
space of these instances. These algorithms achieve this by reducing the effective size of the
search space and by exploring that space efficiently. However, although metaheuristics al-
low to reduce the temporal complexity of problems resolution, they remain unsatisfactory
to tackle large problems. Experiments using large problems are often stopped without any
convergence being reached. Thereby, in designing metaheuristics, there is often a trade-off
to be found between the size of the problem instance and the computational complexity to
explore it. As a result, only the use of parallelism allows to design new methods to tackle
large problems.
Over the last decades, parallel computing has been revealed as an unavoidable way to deal
with large problem instances of difficult optimization problems. The design and imple-
mentation of parallel metaheuristics are strongly influenced by the computing platform.
Many contributions have been proposed for the design and implementation of parallel
metaheuristics using massively parallel processors [CSK93], networks or cluster of work-
stations [CTG95, BSB+ 01], and shared memory machines [JRG09, Bev02]. The proposed
approaches are based on three parallel models: parallel evaluation of a single solution,
parallel evaluation of solutions and parallel (cooperative or independent) execution of sev-
eral metaheuristics. These parallel approaches have been later revisited for large-scale
computational grids [TMT07]. Indeed, grid computing is an impressively powerful way to
solve challenging instances in combinatorial optimization. However, computational grids
providing a huge amount of resources are not easily available and accessible for any user.
1
Introduction
Recently, graphics processing units (GPU) have emerged as a new popular support for
massively parallel computing [RRS+ 08, OML+ 08]. Such resources supply a great com-
puting power, are energy-efficient, and unlike grids, they are highly available everywhere:
laptops, desktops, clusters, etc. During many years, the use of GPU computing was ded-
icated to graphics and video applications. Its utilization has recently been extended to
other application domains [CBM+ 08, GLGN+ 08] (e.g. scientific computing) thanks to
the publication of the CUDA (Compute Unified Device Architecture) development toolkit
that allows GPU programming in C-like language [NBGS08]. In some areas such as nu-
merical computing [TSP+ 08], we are now witnessing the proliferation of software libraries
such as CUBLAS for GPU. However, in other areas such as combinatorial optimization,
in particular metaheuristics, the utilization of GPU does not grow at the same pace. With
the arrival of open standard programming languages on GPU and the arrival of future
compilers for these languages, like other application areas, combinatorial optimization on
GPU will generate a growing interest.
Indeed, GPU computing has emerged in the recent years as an important challenge for
the parallel computing research area. This new emerging technology is believed to be
extremely useful to speed up many complex algorithms. One of the major issues for
metaheuristics is to rethink existing parallel models and programming paradigms to allow
their deployment on GPU accelerators. In other words, the challenge is to revisit the
parallel models and paradigms to efficiently take into account the characteristics of GPUs.
However, the exploitation of parallel models is not trivial, and many issues related to
the GPU memory hierarchical management of this architecture have to be considered.
Generally speaking, the major issues we have to deal with are: the distribution of data
processing between CPU and GPU, the thread synchronization, the optimization of data
transfer between the different memories, the memory capacity constraints, etc.
The contribution of this thesis is to deal with such issues for the redesign of parallel
models of metaheuristics to allow solving of large scale optimization problems on GPU
architectures. Our objective is to rethink the existing parallel models and to enable their
deployment on GPUs. In this purpose, we propose in this document a new generic guideline
for building efficient parallel metaheuristics on GPU. Our challenge is to come out with
the GPU-based design of the whole hierarchy of parallel models. Different contributions
and salient issues in this document are dealt with: (1) an effective cooperation between the
CPU and the GPU, which requires to optimize the data transfers between the CPU and
the GPU; (2) an efficient parallelism control to associate working units with threads, and
to meet the memory constraints; (3) efficient mappings of data structures of the different
models on the hierarchy of memories (with different access latencies) provided by the
GPU; (4) software framework-based implementations to make the GPU as transparent as
2
Introduction
Document Organization
Chapter 1
The first chapter describes general concepts for parallel metaheuristics and GPU comput-
ing. In this purpose, we will first introduce parallel models of metaheuristics. Thereafter,
we will present the general GPU architecture, and highlight the different challenges that
appear when dealing with metaheuristics. The rest of the chapter is dedicated to all the
definitions and protocols required to the general comprehension of the document.
Chapter 2
The next chapter contributes to the efficient cooperation between the CPU and the GPU.
Thereby, we will highlight how the optimization of data transfers between the two com-
ponents has a crucial impact on the performance of metaheuristics on GPU. Furthermore,
extensive experiments demonstrate the strong potential of GPU-based metaheuristics com-
pared to cluster or grid-based parallel architectures.
Chapter 3
In the third chapter, the focus is on the efficient control of parallelism when designing
metaheuristics on GPU. On the one hand, such a step allows to establish a clear association
of the elements to be processed according to the spatial thread organization of GPUs. On
the other hand, controlling the generation of threads will introduce some robustness in
GPU applications. Therefore, it may prevent applications from crashing, and it may lead
to some performance improvements.
Chapter 4
In the fourth chapter, we will describe different memory associations of optimization struc-
tures to deal with different GPU-based metaheuristics. As an illustration, the scope of
3
Introduction
this chapter is to redefine parallel and cooperative algorithms, in which this memory man-
agement is prominent. We will investigate on how the redesign of a same algorithm using
different memory associations has an impact on the global performance.
Chapter 5
The final chapter introduces an extension of the ParadisEO framework for the transparent
deployment of metaheuristics on GPU. In this purpose, conceptual aspects are exposed
to allow the user to program GPU-based metaheuristics, while minimizing his or her
involvement in its management.
4
Chapter 1
This first chapter presents all the background and prerequisites necessary to the general
comprehension of the global document.
First, we will describe the principles of metaheuristics within the optimization context.
An overview is made on the parallel models of metaheuristics to accelerate the search
process. Thereafter, GPU computing is introduced in the context of metaheuristics. In
this purpose, we will present the different advantages and the challenges associated with
this emergent technology. Furthermore, a review of different works of the literature for
metaheuristics on parallel and GPU architectures will be made. Finally, we will emphasize
on the experimental protocol used for the experiments performed in this manuscript.
Contents
1.1 Parallel Metaheuristics . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Optimization Context . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Principles of Metaheuristics . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Parallel Models of Metaheuristics . . . . . . . . . . . . . . . . . . 11
1.2 Metaheuristics and GPU Computing . . . . . . . . . . . . . . . 13
1.2.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 GPU Challenges for Metaheuristics . . . . . . . . . . . . . . . . . 15
1.2.3 General GPU Model: CPU-GPU Cooperation . . . . . . . . . . . 16
1.2.4 GPU Threads Model: Parallelism Control . . . . . . . . . . . . . 16
1.2.5 Kernel Management: Memory Management . . . . . . . . . . . . 17
1.3 Related Works on Parallel Metaheuristics . . . . . . . . . . . . 20
1.3.1 Metaheuristics on Parallel and Distributed Architectures . . . . . 20
1.3.2 Research Works on GPU-based Metaheuristics . . . . . . . . . . 21
1.4 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.1 Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.2 Machines Configuration . . . . . . . . . . . . . . . . . . . . . . . 30
1.4.3 Metric and Statistical Tests . . . . . . . . . . . . . . . . . . . . . 31
5
Chapter 1: GPU Computing for Parallel Metaheuristics
where n is the number of objectives, x = (x1 , . . . , xk ) is the vector representing the deci-
sion variables, and S represents the set of feasible solutions associated with equality and
inequality constraints and explicit bounds. F (x) = (f1 (x), f2 (x), . . . , fn (x)) is the vector
of objectives to be optimized. For n = 1 (respectively n ≥ 2), monoobjective (respectively
multiobjective) optimization is considered.
Thereby, the resolution of a monoobjective optimization problem consists in finding the
feasible solution that minimizes the objective function. In the multiobjective context, the
problem resolution aims at finding a set of Pareto optimal solutions, which is called the
Pareto front.
Following the complexity of the problem, two main families of resolution methods can
be used: exact methods and heuristics. Figure 1.1 illustrates the different methods of
resolution. Exact methods (e.g. branch-and-x, dynamic programming or constraints pro-
gramming) allow to find optimal solutions and guarantee their optimality. However, they
become impractical for large problems.
Conversely, heuristics produce high-quality solutions in a reasonable time practical use on
6
Chapter 1: GPU Computing for Parallel Metaheuristics
large-size problem instances. Heuristics can be specific i.e. designed to solve a particular
problem and/or instance. They can also be generic and applicable to different problem
types. In this case, they are called metaheuristics. These latter are based on the iterative
improvement of either a single solution (e.g. hill climbing, simulated annealing or tabu
search) or a population of solutions (e.g. evolutionary algorithms or ant colonies) of a given
optimization problem. In this document, the focus will be exclusively on metaheuristics.
Unlike exact methods, metaheuristics allow to tackle large-size problem instances by de-
livering satisfactory solutions in a reasonable time. There is no guarantee to find global
optimal solutions or even bounded solutions. Metaheuristics have received more and more
popularity in the past 20 years. Their use in many applications shows their efficiency and
effectiveness to solve large and complex problems. Metaheuristics fall into two categories:
single-solution based metaheuristics (S-metaheuristics) and population-based metaheuris-
tics (P-metaheuristics).
S-metaheuristics manipulate and transform a single solution during the search, while in
P-metaheuristics a whole population of solutions is evolved. These two families have
complementary characteristics: S-metaheuristics are exploitation oriented; they have the
ability to intensify the search in local regions. P-metaheuristics are exploration oriented;
they provide a better diversification in the entire search space. In the next chapters of
this document, we will mainly use this classification.
7
Chapter 1: GPU Computing for Parallel Metaheuristics
The objective function f formulates the goal to be achieved. It associates with each
solution of the search space a real value that gives the quality or the fitness of the solution,
f : S → IR. Then, it represents an absolute value and allows a complete ordering of all
solutions of the search space. The objective function is an essential element in designing
a metaheuristic. It guides the search toward “good” solutions of the search space. If the
objective function is improperly defined, it can lead to non-acceptable solutions whatever
is the used metaheuristic.
S-metaheuristics are iterative techniques that have been successfully applied for solving
many real and complex problems. These methods could be viewed as “walks through
neighborhoods” meaning search trajectories through the solutions domains of the problems
at hand. The walks are performed by iterative procedures that allow to move from a
solution to another one in the solution space (see Algorithm 1).
A S-metaheuristic usually starts with a randomly generated solution. At each iteration
of the algorithm, the current solution is replaced by another one selected from the set of
its neighboring candidates, and so on. An evaluation function associates a fitness value to
each solution indicating its suitability to the problem (selection criterion). Many strategies
related to the considered S-metaheuristic can be applied in the selection of a move: best
improvement, first improvement, random selection, etc.
The simplest S-metaheuristic is the hill climbing algorithm (see Algorithm 2). It starts
with at a given solution. At each iteration, the heuristic replaces the current solution
by a neighbor that improves the objective function. The search stops when all candidate
neighbors are worse than the current solution, meaning a local optimum is reached.
Another widespread method is the tabu search algorithm [Glo89, Glo90] (see Algorithm
8
Chapter 1: GPU Computing for Parallel Metaheuristics
3). In this local search, the best solution in the neighborhood is selected as the new current
solution even if it is not improving the current solution. This policy may generate cycles,
i.e. previous visited solutions could be selected again. To avoid these cycles, the algorithm
manages a memory of the moves recently applied, which is called the tabu list. This list is
a short-term memory which contains the solutions (moves) that have been visited in the
recent past.
Other popular examples of S-meheuristics are simulated annealing, iterative local search
and variable neighborhood search. A survey of the history and a state-of-the-art of S-
metaheuristics can be found in [DPST06, Tal09].
9
Chapter 1: GPU Computing for Parallel Metaheuristics
10
Chapter 1: GPU Computing for Parallel Metaheuristics
resentation of the individuals or how each step of the algorithm is designed. The main
subclasses of evolutionary algorithms are the genetic algorithms, genetic programming,
evolution strategies, etc. A large review of these evolutionary computation techniques is
done in [BFM97].
• Improve the quality of the obtained solutions: Some parallel models for metaheuris-
tics allow to improve the quality of the search. Indeed, exchanging information
between cooperative metaheuristics will alter their behavior in terms of searching in
the landscape associated with the problem. The main goal of a parallel cooperation
between metaheuristics is to improve the quality of solutions.
11
Chapter 1: GPU Computing for Parallel Metaheuristics
In this purpose, three major parallel models for metaheuristics can be distinguished:
solution-level, iteration-level and algorithmic-level (see Figure 1.3).
12
Chapter 1: GPU Computing for Parallel Metaheuristics
13
Chapter 1: GPU Computing for Parallel Metaheuristics
Conversely, a GPU has a large number of arithmetic units with a limited cache and few
control units. This allows the GPU to calculate in a massive and parallel way the rendering
of small and independent elements, while having a large flow of data processed. Since more
transistors are devoted to data processing rather than data caching and flow control, GPU
is specialized for compute-intensive and highly parallel computations.
Figure 1.5 depicts the general GPU architecture. It is composed of streaming multipro-
cessors (SMs), each containing a certain number of streaming processors (SPs), or pro-
cessor cores. Each core executes a single thread instruction in a SIMD (single-instruction
multiple-data) fashion, with the instruction unit distributing the current instruction to the
cores. Each core has one multiply-add arithmetic unit that can perform single-precision
floating-point operations or 32-bit integer arithmetic. In addition, each SM has special
functional units (SFUs), which execute more complex floating-point operations such as
reciprocal sine, cosine and square root with low cycle latency.
The SM contains other resources such as shared memory and the register file. Groups of
SMs belong to thread processing clusters (TPCs). The latter also contain resources (e.g.
caches and texture fetch units) that are shared among the SMs. The GPU architecture
comprises the collection of TPCs, the interconnection network, and the memory system
(DRAM memory controllers).
Figure 1.6 gives a comparison of the execution model for both CPU and GPU architectures.
Basically, a CPU thread proceeds one data element per operation. With the extension of
14
Chapter 1: GPU Computing for Parallel Metaheuristics
SSE (streaming SIMD execution) instructions, such a CPU thread can operate between
two and four data elements. Regarding a GPU multiprocessor, 32 threads proceed 32 data
elements. These groups of 32 threads are called warps. They are exposed as individual
threads but execute the same instruction. Therefore, a divergence in the threads execution
provokes a serialization of the different instructions.
A complete review of GPU architectures can be found in [RRS+ 08, ND10].
1. Cooperation between the CPU and the GPU. Such a step requires defining
the task repartition in metaheuristics. In this purpose, the optimization of data
transfer between the two components is necessary to achieve the best performance.
15
Chapter 1: GPU Computing for Parallel Metaheuristics
The contribution of this document is to deal with these challenges. Throughout this
manuscript, we will mainly use this classification. The next sections provide more details
about these different challenges.
The kernel handling is dependent of the general-purpose language. For instance, CUDA
[NVI11] or OpenCL [Gro10] are parallel computing environments which provide an appli-
cation programming interface for GPU architectures. Indeed, these toolkits introduce a
model of threads which provides an easy abstraction for SIMD architectures. The concept
of a GPU thread does not have exactly the same meaning as a CPU thread. A thread on
GPU can be seen as an element of data to be processed. Compared to CPU threads, GPU
threads are lightweight. It means that changing the context between two threads is not a
costly operation.
Regarding their spatial organization, threads are organized within so called thread blocks
(see Figure 1.8). A kernel is executed by multiple equally threaded blocks. Blocks can be
16
Chapter 1: GPU Computing for Parallel Metaheuristics
Figure 1.7: Illustration of the general GPU model. The GPU can be seen as a coprocessor
where data transfers must be performed between the CPU and the GPU.
From a hardware point of view, graphics cards consist of streaming multiprocessors, each
with processing units, registers and on-chip memory. Since multiprocessors are organized
according to the SPMD model, threads share the same code and have access to different
memory areas. Figure 1.9 illustrates these different available memories and connections
with thread blocks.
Communication between the CPU host and its device is done through the global memory.
Since in some GPU configurations, this memory is not cached and its access is slow, one
needs to minimize accesses to global memory (read/write operations) and reuse data within
the local multiprocessor memories. Graphics cards provide also read-only texture memory
to accelerate operations such as 2D or 3D mapping. Texture memory units are provided to
17
Chapter 1: GPU Computing for Parallel Metaheuristics
Figure 1.8: GPU threads model. GPU threads are organized into block structures.
Figure 1.9: GPU memory model. Different on-chip memories and connections with thread
blocks are available.
18
Chapter 1: GPU Computing for Parallel Metaheuristics
Figure 1.10: Illustration of accesses patterns that lead to coalesced and uncoalesced ac-
cesses to the global memory.
allow faster graphic operations. This way, binding texture on global memory can provide
an alternative optimization. Indeed, it improves random accesses or uncoalesced memory
access patterns that occur in common applications. Constant memory is read only from
kernels and is hardware optimized for the case where all threads read the same location.
Shared memory is a fast memory located on the multiprocessors and shared by threads of
each thread block. This memory area provides a way for threads to communicate within
the same block. Registers among streaming processors are exclusive to an individual
thread; they constitute a fast access memory. In the kernel code, each declared variable
is automatically put into registers. Local memory is a memory abstraction and is not an
actual hardware component. In fact, local memory resides in the global memory allocated
by the compiler. Complex structures such as declared arrays will reside in local memory.
Regarding the execution model, each block of threads is split into SIMD groups of threads
called warps. At any clock cycles, each processor of the multiprocessor selects a warp that is
ready to execute the same instruction on different data. For being efficient, global memory
accesses must be coalesced. This means that a memory read by consecutive threads in a
warp is combined by the hardware into several memory reads. The requirement is that
threads of the same warp must read global memory in an order manner (see Figure 1.10).
Global memory accesses patterns that are non-coalesced may significantly decrease the
performance of a program. As a result, an efficient management of the optimization
structures with the different available memories has to be considered.
19
Chapter 1: GPU Computing for Parallel Metaheuristics
20
Chapter 1: GPU Computing for Parallel Metaheuristics
and varies during the search. For instance, this happens when the evaluation cost of the
objective function depends on the solution. When dealing with such problems in which
the computations or the data transfers become irregular or asynchronous, parallel and
distributed architectures such as COWs or computational grids may be more appropriate.
With the emergence of standard programming languages on GPU and the arrival of com-
pilers for these languages, combinatorial optimization on GPU has generated a growing
interest. Historically, due to their embarrassingly parallel nature, P-metaheuristics such as
evolutionary algorithms have been the first subject of parallelization on GPU architectures.
One of the first pioneering works on genetic algorithms was proposed by Wong et al.
[WWF05, WW06, FWW07]. In their works, the population evaluation and a specific mu-
tation operator (Cauchy mutation) are performed on GPU. However, since replacement
and selection operators are implemented on CPU, massive data transfers are performed
between the CPU and the GPU. Such techniques limit the performance of the algorithm.
Concurrently, to deal with this drawback, Yu et al. [YCP05] were the first authors to
establish a full parallelization of the genetic process on GPU. To achieve this, the popula-
tion is organized into a 2D toroidal grid, which allows to apply selection operators similar
to those ones used for cellular genetic algorithms. However, the implementation is only
specific to a vector of real values. Later, Li et al. [LWHC07] extended this work for binary
representations, and implemented further specific genetic operators. In a similar manner,
Luo et al. [LL06] were among the first authors to implement a cellular genetic algorithm
on GPU for the 3-SAT problem. To perform this, the semantics of the original cellular
algorithm are completely modified to meet the GPU constraints.
The previous implementations quoted above are based on a transformation of evolution-
ary structures into a series of raster operations on GPU using shader libraries based on
Direct3D or OpenGL. In other words, to implement metaheuristics with such libraries,
one needs to solve the problem of texture storage of relevant information in arrays.
The following works on GPU are implemented with the CUDA development toolkit, which
allows programming on GPUs in a more accessible C-like language. In addition to this,
such a thread-based approach is easier in terms of reproducibility in comparison with
shader libraries.
Zhu suggested in [Zhu09] an evolution strategy algorithm to solve a bench of continuous
problems using the CUDA toolkit. In his implementation, multiple kernels are designed
for some evolutionary operators such as the selection, the crossover, the evaluation and
the mutation. The rest of the search process is handled by the CPU.
Later, Arora et al. presented a similar implementation in [ATD10] for genetic algorithms.
21
Chapter 1: GPU Computing for Parallel Metaheuristics
In addition, the contribution of their work is to investigate the effect of a bench of param-
eters (e.g. threads size, problem size or population size) on the acceleration of their GPU
implementation in regards with a sequential genetic algorithm.
Tsutsui et al. were among the first authors to establish memory management concepts of
combinatorial optimization problems [TF09]. In their implementation for the quadratic
assignment problem, accesses to the global memory (the population) are coalesced, the
shared memory is used to store as many individuals as possible, and matrices are associated
with the constant memory. Their approach is based on a full parallelization of the search
process to deal with data transfers. For doing that, the global genetic algorithm is di-
vided into multiple independent genetic algorithms, where each sub-population represents
a thread block. The obtained speed-ups are of course less convincing than for continuous
problems, due to the management of data structures in combinatorial problems.
In [MBL+ 09], Maitre et al. submitted a framework tentative for the automatic paralleliza-
tion of the evaluation function on GPU. In this purpose, the user does not need to know
about any CUDA keywords and only the evaluation function code must be specified. Such
a strategy allows to evaluate the population in a transparent way on GPU. However, this
automatic parallelization presents some drawbacks. Indeed, it lacks flexibility due to the
data transfers and non-optimized memory accesses. Moreover, the application is restricted
to problems which do not require any data structures (e.g. continuous problems).
Another framework tentative were introduced in [SBPE10] by Soca et al. for cellular
genetic algorithms. Multiple kernels are used for each evolutionary operator. The platform
offers a collection of predetermined operators (three crossovers and two mutations) that the
user must instantiate. In addition, management of problem structures in global memory
seems to be taken into account since an application to the quadratic assignment problem
is done. Unfortunately, unless the previous one, the framework seems to remain in the
design step since no link is available for downloading.
Another implementation of a cellular evolutionary algorithm is provided by Vidal et al.
in [VA10a]. It is based on a full parallelization of the search process on GPU. An applica-
tion of the approach is made for continuous and discrete problems without any problem
structures. Later, the authors submitted a multi-GPU implementation of their algorithm
in [VA10b]. Nevertheless, due to the challenging issue of the context management (e.g.
two separate GPU global memories), their multi-GPU implementation does not provide
any significant benefits in terms of performance.
Zhang et al. introduced a design of an island model on GPU in [ZH09]. The parallelization
of the entire algorithm is performed on GPU. Each sub-population is stored on the shared
memory. Regarding the topology of the different islands, they are similar to the exchange
topology present in cellular evolutionary algorithm (toroidal grid). Unfortunately, the
22
Chapter 1: GPU Computing for Parallel Metaheuristics
model only remains in the design step since the authors did not produce any experimental
results.
Pospichal et al. performed an implementation similar to the previous model for continuous
optimization problems in [PJS10]. Each sub-population is stored on the shared memory
and organized according to a ring topology. The obtained speed-ups are impressive in com-
parison with a sequential algorithm (thousands of times). However, the implementation
is only dedicated to few continuous optimization problems.
Since no general methods can be outlined from the two previous works, we investigated
the parallel island model on GPU in [6]. We addressed its redesign, implementation, and
associated issues related to the GPU execution context. In this purpose, we designed three
parallelization strategies involving different memory managements. We demonstrated the
effectiveness of the proposed approaches and their capabilities to fully exploit the GPU
architecture.
Regarding S-metaheuristics, Janiak et al. implemented a multi-start tabu search algorithm
applied to the traveling salesman problem and the flowshop scheduling problem [JJL08].
Using shader libraries, the parallelization is entirely performed on GPU, and each thread
process is associated with one tabu search. However, such a parallelization is not effective
since a large number of local search algorithms is required to cover the memory access
latency.
Concurrently, a similar approach based on the CUDA toolkit was proposed by Zhu et al.
[ZCM08]. The implementation has been applied to the quadratic assignment problem,
where the memory management of optimization structures is made on the global memory.
Nevertheless, the global performance is limited to the instance size, since each thread is
associated with one local search.
Although the multi-start model has already been applied in the context of the tabu search
on GPU, it has never been widely investigated in terms of reproducibility and memory
management. We provided in [5] a general methodology for the design of multi-start
algorithms applicable to any local search algorithms such as hill climbing, tabu search or
simulated annealing. Furthermore, we contributed with efficient associations between the
different available memories and the data commonly used for these algorithms.
However, as quote above, the application of the multi-start model on GPU is limited since
a large number of local search algorithms is required at launch time to be effective. As
a matter of fact, the parallelization of the evaluation of neighborhood on GPU might be
more valuable. In this purpose, we came up with the pioneering work in [9] on the redesign
of the parallel evaluation of the neighborhood on GPU. We introduced the generation of
neighbors on the GPU side to minimize the data transfers. Furthermore, we proposed to
manage the commonly used structures in combinatorial optimization with the different
23
Chapter 1: GPU Computing for Parallel Metaheuristics
available memories.
Munawar et al. introduced a hybrid genetic algorithm on GPU in [MWMA09a]. In
their implementation, an island model is implemented where each population represents
a cellular genetic algorithm. In addition to this, a hill climbing algorithm follows the
mutation step of the hybrid genetic algorithm. Each sub-population is associated with the
shared memory and traditional code optimization such as memory coalescing is performed.
The implementation is performed for the maximum satisfiability problem.
Since a full parallelization is investigated, the previous work requires that the hill climbing
must be combined with the island model. To perform a hybridization with a local search
in the general case, we contributed with the redesign of hybrid evolutionary algorithms on
GPU in [7]. In this purpose, the focus is on the generation of the different neighborhoods
on GPU, corresponding to each individual of the evolutionary process to be mutated. Such
a mechanism guarantees more flexibility since any local search algorithms can be broached
with any evolutionary algorithms.
Wong was the first author to introduce a multiobjective evolutionary algorithm on GPU
in [Won09]. In his implementation, most of the multiobjective algorithm (NSGA-II) is
implemented on GPU except the selection of non-dominated solutions.
In a similar manner, we contributed in [3] with the first multiobjective local search al-
gorithms on GPU. The parallelization strategy is based on the parallel evaluation of the
neighborhood on GPU using a set of non-dominated solutions to generate the neighbor-
hood.
Table 1.1 reports the major works on GPU according to the classification proposed in
Section 1.2.2. Basically, most approaches of the literature are based on either the parallel
evaluation of solutions on GPU (iteration-level) or the execution of simultaneous indepen-
dent/cooperative algorithms (algorithmic-level). Regarding the CPU-GPU cooperation,
for the first category, some implementations also consider the parallelization of other treat-
ments on GPU (e.g. selection or variation operators in evolutionary algorithms). One may
argue on the validity of these choices since an execution profiling may show that such treat-
ments are negligible in comparison with the evaluation of solutions. As quoted above, a
full parallelization of metaheuristics on GPU may be also performed to reduce the data
transfers between the CPU and the GPU. In this case, the original semantics of the meta-
heuristic are modified to fit the GPU execution model. This can explain the reason why an
important group of works only deals with concurrent independent/cooperative algorithms.
For the parallelism control, most implementations associate one thread with one solu-
tion. In addition, some cooperative algorithms may take advantage of the threads model
by associating one threads block with one sub-population. However, to the best of our
knowledge, no work has been investigated to efficiently manage the threads parallelism to
24
Table 1.1: Classification of the major works of the literature.
meet the memory constraints. When dealing with a large set of solutions or large problem
instances, the previous implementations might not be robust. We will show in Chapter 3
how an efficient thread control allows to introduce fault-tolerance mechanisms in GPU
applications.
Regarding the memory management, in some implementations, no explicit efforts have
been made for memory access optimizations. For instance, memory coalescing is used to
be one of the key element for speedups in CUDA, and local memories could be additionally
considered to reduce non-coalesced accesses. Some authors have just relied on the simple
way to use the shared memory to cache spatially local accesses to global memory, which
does not guarantee performance improvement. In some other implementations, explicit
efforts have been performed to handle optimization structures with the different available
memories. However, no general guideline can be outlined from the previous works. Indeed,
most of the time, these memory associations strictly depend on the target optimization
problem (e.g. small size of problems instances or no data inputs). We will try to examine
such issues for the general case in Chapter 4.
Many other works on P-metaheuristics on GPU have been proposed so far. The paral-
lelization strategies used for these implementations are similar to the prior techniques men-
tioned above. These works include particle swarm optimization [MCD09, ZT09, RK10], ant
colonies [BOL+ 09, SAGM10, TF11, CGU+ 11], genetic programming [HB07, Chi07, LB08,
Lan11] and other evolutionary computation techniques [MWMA09b, dPVK10, FKB10].
In comparison with previous works on P-metaheuristics, the spread of S-metaheuristics on
GPU does not occur at the same pace. Indeed, the parallelization on GPU architectures
is harder, due to the improvement of a single solution (and not a population of solutions).
In Chapter 2 and Chapter 3, we will fully contribute on the design and implementation of
S-metaheuristics on GPU.
To validate the approaches proposed in this manuscript, five optimization problems with
different encodings have been considered on GPU: the permuted perceptron problem, the
quadratic assignment problem, the continuous Weierstrass function, the traveling salesman
26
Chapter 1: GPU Computing for Parallel Metaheuristics
The quadratic assignment problem [BcRW98] arises in many applications such as facility
location or data analysis. Let A = (aij ) and B = (bij ) be n × n matrices of positive
integers. Finding a solution of the quadratic assignment problem is equivalent to finding
a permutation π = (1, 2, . . . , n) that minimizes the objective function:
n X
X n
z(π) = aij bπ(i)π(j)
i=1 j=1
The problem has been implemented using a permutation representation. The evaluation
function has a O(n2 ) time complexity where n is the instance size. When considering a
∆ evaluation for S-metaheuristics, some moves evaluations can be calculated in constant
time, and some other has a time complexity of O(n). The requirement is a structure which
stores previous ∆ evaluations in a quadratic space complexity.
The Weierstrass functions belong to the class of continuous optimization problems. These
functions have been widely used for the simulation of fractal surfaces. According to [LV98],
Weierstrass-Mandelbrot functions are defined as follows:
∞
b−ih sin(bi x) with b > 1 and 0 < h < 1
X
Wb,h (x) = (1.1)
i=1
27
Chapter 1: GPU Computing for Parallel Metaheuristics
The parameter h has an impact on the irregularity (“noisy” local perturbation of limited
amplitude) and these functions possess many local optima. The problem has been imple-
mented using a vector of real values. The domain definition has been set to −1 ≤ xk ≤ 1, h
has been fixed to 0.25, and the number of iterations to compute the approximation to 100
(instead of ∞). Such parameters are in accordance with the one dealt with in [LV98]. The
complexity of the evaluation function is quadratic. Regarding S-metaheuristics, since tra-
ditional neighborhoods for continuous optimization impact on all the elements composing
a solution, no easy technique can be applied for computing a ∆ evaluation.
Given n cities and a distance matrix dn,n , where each element dij represents the distance
between the cities i and j, the traveling salesman problem [Kru56] consists in finding a
tour which minimizes the total distance. A tour visits each city exactly once.
The chosen representation is a permutation structure. The evaluation function can be
performed in a linear time complexity. Regarding S-metaheuristics, a usual neighborhood
for the traveling salesman is a two-opt operator. In such a neighborhood, moves evaluations
can be performed in constant time.
Golomb rulers are numerical sequences or a class of graphs named for Solomon W. Golomb
[GB77] which have applications in a wide variety of fields including communications when
setting up an interferometer for radio astronomy.
A Golomb ruler is an ordered sequence of positive integer numbers. These numbers are
referred to as marks. The distance is the difference between two points, and all distances
must be distinct for that ruler. The last mark is referred to as the length of the Golomb
ruler. By convention, the first mark of the ruler must be placed at the position 0. Fig-
ure 1.11 shows an example for a solution with 4 marks.
More exactly, a Golomb ruler with n marks is a set a1 < a2 < ... < an of positive integer
28
Chapter 1: GPU Computing for Parallel Metaheuristics
positions such that the differences ai − aj (∀i 6= j) are distinct and where a1 = 0. By
n × (n − 1)
convention, an is the length of the solution and this ruler has distances. A
2
solution with n marks is optimal if it does not exist any shorter Golomb ruler with the
same number of marks.
The problem has been implemented using a discrete vector representation. A solution
evaluation has a O(n3 ) time complexity. When considering a S-metaheuristic, moves
evaluations can be performed in a quadratic time involving a structure (quadratic space
complexity).
The quadratic assignment problem and the traveling salesman problem are permutation
problems, the permuted perceptron problem is based on the binary encoding, the Weier-
strass function is represented by a vector of real values and the Golomb rulers requires a
discrete vector encoding.
The selected problems deal with the four principal encodings of the literature presented in
Section 1.1.2.1. They present different time complexities which worth being investigated.
Regarding the traveling salesman problem, it has been chosen since very large instances of
this problem are considered. Indeed, applications which require a huge number of threads
might fail at runtime, due to the hardware registers limitation.
To have an appreciation of the problems performance, one has to examine the differ-
29
Chapter 1: GPU Computing for Parallel Metaheuristics
ent operations composing an application to point out the limiting factors. A program
is considered compute bound if the number of calculating operations (e.g. additions or
multiplications) dominates the total number of operations. Conversely, an application is
said memory bound if memory accesses are the leading operations. Figure 1.12 illustrates
these concepts for the evaluation functions of the different problems. Such a classification is
achieved by investigating the number of operations in the evaluation function code. Indeed,
a code profiling highlights that the evaluation function represents the time-consuming in
a metaheuristic implementation. For instance, most operations in the Weierstrass eval-
uation function are power calculations, sine functions and multiplications; while in the
traveling salesman problem, the leading operations are memory accesses. Since a GPU
is composed of lot of arithmetic-logical units, it can clearly outperform the CPU when
considering pure compute bound algorithms. Thereby, when dealing with both compute
bound and memory bound algorithms, one of the issues is to take advantage of both the
GPU execution model and the GPU local memories to achieve the best performance.
30
Chapter 1: GPU Computing for Parallel Metaheuristics
the blocks assigned to a multiprocessor. The warp size is the number of threads running
concurrently on a multiprocessor. These threads are running both in parallel and pipelined.
When a multiprocessor is given thread blocks to execute, it partitions them into warps
that get processed by a warp scheduler at runtime.
The shared memory is a small memory within each multiprocessor that can be read/written
by any thread in a block assigned to that multiprocessor. The thread limit constrains the
amount of cooperation between threads. Indeed, only threads within the same block
can synchronize with each other, and exchange data through the fast shared memory
in a multiprocessor. The way the threads access global memory also affects the global
performance. The execution process goes much faster if the GPU can coalesce several
global addresses into a single contiguous access over the wide data bus that goes to the
external SDRAM. Such coalescing rules are relaxed when dealing with recent architectures.
In Fermi series such that the Tesla M2050, there is L1/L2 cache structure. When threads
require more registers than the hardware can support, they will spill into L1 cache, which
is very fast. If L1 cache is full, or there are other conflicts, these registers will spill into L2
cache, which is significantly larger. Still, L2 cache is much faster than accessing memory
off chip. Texture memory in prior architectures used to be an alternative to cache global
memory.
To assess the performance of the proposed GPU-based algorithms in this thesis, execution
times and acceleration factors are reported in comparison with a single-core CPU. The
31
Chapter 1: GPU Computing for Parallel Metaheuristics
GP U time
Acceleration =
CP U time
Statistical analysis must be performed to ensure that the conclusions deduced from the
experiments are meaningful. Furthermore, an objective is also to prove that a specific
algorithm outperforms another one. However, the comparison between two average values
might be not enough. Indeed, it may differ from the comparison between two distributions.
Therefore, a test has to be performed to ensure the statistical significance of the obtained
results. In other words, one has to determine whether an observation is likely to be due
to a sampling error or not.
The first test consists in checking if the data set is normally distributed from a number of
experiments above 30. This is done by applying a Kolmogorov-Smirnov’s test, which is a
powerful and accurate method.
To compare two different distributions (i.e. whether an algorithm is better than another
or not), the Student’s t-test is widely used to compare averages of normal data. The
prerequisites for such a test are to check the data normality (Kolmogorov-Smirnov) then
to examine the variances equality of the two samples. This latter can be done by the
Levene’s test, which is an inferential statistic used to assess the equality of variances in
different sample.
The statistical confidence level is fixed to 95%, and the p-values are represented for all the
statistical analysis tables.
Since lot of experiments are presented in this document, Section .2 in Appendix 5.3.1.2
only reports the results, in which statistical test cannot conclude if an algorithm is better
than another one.
32
Chapter 1: GPU Computing for Parallel Metaheuristics
Conclusion
In this chapter, we have described all the concepts necessary to the general understanding
of the document. In this purpose, we have introduced the optimization context, the prin-
ciples of metaheuristics, the different optimization problems at hand, and the individual
GPU configurations. More important, we have mainly focused on the principles of par-
allel metaheuristics for parallel and GPU architectures. Understanding the hierarchical
organization of the GPU architecture is useful to provide an efficient implementation of
parallel metaheuristics.
• GPU challenges. The goal of this thesis is to redesign the different parallel models
on GPU architectures. In this purpose, we have proposed a classification of the differ-
ent challenges involved in the design and implementation of metaheuristics on GPU
accelerators. These challenges concern the efficient cooperation between the CPU
and the GPU, the efficient parallelism control and the efficient management of the
hierarchical memory. These three challenges constitute the heart of this document.
In relation to this classification, we have shown how each work of the literature
can be classified into different categories according to these challenges. The next
three chapters will be dedicated to solving each of these challenges for the design of
GPU-based metaheuristics.
33
Chapter 1: GPU Computing for Parallel Metaheuristics
34
Chapter 2
The scope of this chapter is to establish an efficient cooperation between the CPU and
the GPU, which requires to share the work and to optimize the data transfer between the
two components. First, we will briefly describe a parallelization scheme on GPU common
to all metaheuristics. Then, we will focus on the optimization of data transfers between
the CPU and the GPU. Indeed, this represents one of the critical issues to achieve the
best performance in GPU applications. We will show how this optimization impacts in
particular on S-metaheuristics on GPU. Finally, we will investigate the powerful potential
of GPU-based S-metaheuristics compared to cluster or grid-based parallel architectures.
Contents
2.1 Task Repartition for Metaheuristics on GPU . . . . . . . . . . . 37
2.1.1 Model of Parallel Evaluation of Solutions . . . . . . . . . . . . . 37
2.1.2 Parallelization Scheme on GPU . . . . . . . . . . . . . . . . . . . 37
2.2 Data Transfer Optimization . . . . . . . . . . . . . . . . . . . . . 38
2.2.1 Generation of the Neighborhood in S-metaheuristics . . . . . . . 39
2.2.2 The Proposed GPU-based Algorithm . . . . . . . . . . . . . . . . 40
2.2.3 Additional Data Transfer Optimization . . . . . . . . . . . . . . 42
2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.1 Analysis of the Data Transfers from CPU to GPU . . . . . . . . 43
2.3.2 Additional Data Transfer Optimization . . . . . . . . . . . . . . 49
2.4 Comparison with Other Parallel and Distributed Architectures 50
2.4.1 Parallelization Scheme on Parallel and Distributed Architectures 52
2.4.2 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.3 Cluster of Workstations . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.4 Workstations in a Grid Organization . . . . . . . . . . . . . . . . 55
35
Chapter 2: Efficient CPU-GPU Cooperation
Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. GPU-based Approaches for
Multiobjective Local Search Algorithms. A Case Study: the Flowshop Scheduling Prob-
lem. 11th European Conference on Evolutionary Computation in Combinatorial Optimiza-
tion, EVOCOP 2011, pages 155–166, volume 6622 of Lecture Notes in Computer Science,
Springer, 2011.
Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. Local Search Algorithms on
Graphics Processing Units. A case study: the Permutation Perceptron Problem. Evolu-
tionary Computation in Combinatorial Optimization, 10th European Conference, EvoCOP
2010, pages 264–275, volume 6022 of Lecture Notes in Computer Science, Springer, 2010.
36
Chapter 2: Efficient CPU-GPU Cooperation
Figure 2.1: The parallel evaluation of solutions (iteration-level). The solutions are decom-
posed into different partitions which are evaluated in a parallel independent way.
In the iteration-level model, the focus is on the parallelization of each iteration of meta-
heuristics. The iteration-level parallel model is mainly based on the distribution of the
handled solutions. Indeed, the most time-consuming part in a metaheuristic is the evalu-
ation of the generated solutions. The parallelization concerns search mechanisms that are
problem-independent operations, such as the generation and evaluation of the neighbor-
hood in S-metaheuristics and the evaluation of successive populations in P-metaheuristics.
In other words, the iteration-level model is a low-level Master-Worker model that does not
alter the behavior of the heuristic. Figure 2.1 gives an illustration of this model.
At each iteration, the master generates the set of solutions to be evaluated. Each worker
receives from the master a partition of the solutions set. These solutions are evaluated and
returned back to the master. For S-metaheuristics, the neighbors can also be generated
by the workers. In this case, each worker receives a copy of the current solution, generates
one or several neighbor(s) to be evaluated and returned back to the master. A challenge
of this model is to determine the granularity of each partition of solutions to be allocated
to each worker according to the communication delays of the given architecture. In terms
of genericity, as the model is problem-independent, it is generic and reusable.
As quoted above, the evaluation of solution candidates is often the most time-consuming
part of metaheuristics. Thereby, it must be done in parallel in regards with the iteration-
level parallel model. Hence, according to the Master-Worker paradigm, the idea is to
37
Chapter 2: Efficient CPU-GPU Cooperation
Figure 2.2: The parallel evaluation of solutions on GPU (iteration-level). The evaluation
of solutions is performed on GPU and the CPU executes the sequential part of the search
process.
The GPU has its own memory and processing elements that are separate from the host
computer. Thereby, data transfer between CPU and GPU through the PCIe bus might be
a serious bottleneck in the performance of GPU applications. Table 2.1 gives an insight of
the different transfer rates for the 4 different configurations presented in Chapter 1. This
sample from the CUDA SDK [NVI11] delivers practical transfer rates by considering one
data transfer of 30 MB. By repeating the process thousands of times, a higher amount of
38
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.1: External bandwidth test. The program produces the elapsed time to copy 30
MB data and it delivers bi-directional transfer rates for 4 different configurations.
• Generation of the neighborhood on CPU and its evaluation on GPU. At each iter-
ation of the search process, the neighborhood is generated on the CPU side and
its associated structure storing the solutions is copied on GPU. This approach is
39
Chapter 2: Efficient CPU-GPU Cooperation
the most straightforward since a thread is automatically associated with its physi-
cal neighbor representation. This is what it is usually done for the parallelization
of P-metaheuristics on GPU. Thereby, the data transfers are essentially the set of
neighboring solutions copied from the CPU to the GPU and the fitnesses structure
which are copied from the GPU to the CPU.
• Generation of the neighborhood and its evaluation on GPU. In the second approach,
the neighborhood is generated on GPU. This generation is performed in a dynamic
manner which implies that no explicit structure needs to be allocated. This is
achieved by considering a neighbor as a slight variation of the candidate solution
which generates the neighborhood. Thereby, only the representation of this candi-
date solution must be copied from the CPU to the GPU. The advantage of such an
approach is to reduce drastically the data transfers since the whole neighborhood
does not have to be copied. The resulting fitnesses are the only structure which has
to be copied back from the GPU to the CPU. However, such an approach raises
another issue: a mapping between a thread and a neighbor must be determined. In
some cases, it might be challenging. Such an issue will be discussed in Chapter 3.
Even if the first approach is easier, applying it on S-metaheuristics on GPU will end in a
lot of data transfers for large neighborhoods. This is the case for P-metaheuristics on GPU
since the entire population is usually copied from CPU to GPU. Such an approach will
lead to a great loss of performance due to the limitation of the external bandwidth. That
is the reason why, in the rest of the chapter, we will consider the second approach: the
generation and the evaluation of the neighborhood on GPU. An experimental comparison
of the two approaches is broached in Section 2.3.1.
40
Chapter 2: Efficient CPU-GPU Cooperation
structure and never change during all the execution of the S-metaheuristic. Therefore,
their associated memory is copied only once during all the execution. Third, comes the
parallel iteration-level, in which each neighboring solution is generated, evaluated and
copied into the neighborhood fitnesses structure (from lines 12 to 15). Fourth, since the
order in which candidate neighbors are evaluated is undefined, the neighborhood fitnesses
structure has to be copied to the host CPU (line 16). Then, a specific solution selection
strategy is applied to this structure (line 17): the exploration of the neighborhood fitnesses
structure is carried out by the CPU in a sequential way. Finally, after a new candidate
has been selected, this latter and its additional structures are copied to the GPU (lines 19
and 20). The process is repeated until a stopping criterion is satisfied.
This parallelization can be seen as an acceleration model which does not change the se-
mantic of the S-metaheuristic. The iteration-level parallel model on GPU may be easily
extended to variable neighborhood search (VNS) metaheuristics, in which the same par-
allel exploration is applied to the various neighborhoods associated with the problem. Its
extension to iterative local search (ILS) metaheuristics is also straightforward as the par-
allelization on GPU could be used at each iteration of the ILS metaheuristic. The same
goes on when dealing with hybrid genetic algorithms.
41
Chapter 2: Efficient CPU-GPU Cooperation
Figure 2.3: Reduction operation to find the minimum of each block. Local synchronizations
are performed between threads of a same block via the shared memory.
In other S-metaheuristics such as hill climbing or variable neighborhood descent, the se-
lection operates on the minimal/maximal fitness for finding the best solution. Therefore,
only one value of the fitnesses structure may be merely copied from the GPU to the CPU.
However, since read/write operations on memory are performed in an asynchronous man-
ner, finding the appropriate minimal/maximal fitnesses is not straightforward. Indeed,
traditional parallel techniques such as semaphores which imply the global synchronization
(via atomic operations) of thousands of threads can drastically lead to diminished perfor-
mance. To deal with this issue, adaptation of parallel reduction techniques for each thread
block must be considered (see Fig. 2.3).
Algorithm 7 gives a template of the parallel reduction for a thread block (partition of the
neighborhood). Basically, each thread loads one element from global to shared memory
(lines 1 and 2). At each loop iteration, elements of the array are compared by pairs (lines
3 to 7). Then, by using local synchronizations between threads in a given block via the
shared memory, one can find the minimum/maximum of a given array since threads operate
at different memory addresses. For the sake of simplicity, the template is given for dealing
with a neighborhood size which is a power of two, but adaptation of the template for
the general case is straightforward. The complexity of such an algorithm is in O(log2 (n))
where n is the size of each thread block. If several iterations are performed on reduction
kernels, the minimum of all the neighbors can be found. Thereby, the GPU reduction
kernel makes it possible to get the minimum/maximum of each block of threads. More
details of the method are given in [Har08]. The benefits of such a technique will be pointed
out in Section 2.3.2.
42
Chapter 2: Efficient CPU-GPU Cooperation
As previously said, one of the major issues in obtaining the best performance resides in
optimizing the data transfer between the CPU and the GPU. To validate the performance
of our algorithms, we propose to make an analysis of the time consumed by each major
operation in two different approaches to assess the impact of such operations in terms of
efficiency: 1) the generation of the neighborhood on CPU and its evaluation on GPU; 2)
the generation of the neighborhood and its evaluation on GPU.
For the next experiments, a tabu search with 10000 iterations is considered on the GTX
280 configuration. The GPU adaptation of the tabu search is straightforward according to
the proposed GPU algorithm (see Algorithm 6 in Section 2.2.2). First, the metaheuristic
pre-treatment (line 3) is the tabu list initialization. Second, the replacement strategy (line
17) is performed by the best admissible neighbor according to its availability in the tabu
list. Finally, the post-treatment (line 18) represents the tabu list update.
A single CPU core implementation and a CPU-GPU one are considered. The number of
threads per block has been arbitrary chosen to 256 (multiple of 32), and the total number
of threads created at run time is equal to the neighborhood size.
As an application, the permuted perceptron problem has been considered. Table 2.2
reports the time spent by each operation in the two approaches by using a neighborhood
based on a Hamming distance of one (n neighbors). For the first approach, one can observe
that the time spent by the data transfer is significant. It represents almost 25% of the
total execution time for each instance. In the second approach, in comparison with the
previous one, the time spent on the data transfer is drastically reduced with the instance
43
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.2: Measures of the benefits of generating the neighborhood on GPU on the GTX
280. The permuted perceptron problem using a neighborhood based on a Hamming dis-
tance of one is considered.
Evaluation on GPU Gen. and eval. on GPU
Instance CPU
GPU process transfers kernel GPU process transfers kernel
73-73 1.1 3.4×0.3 4.0% 22.7% 73.3% 3.0×0.4 1.0% 23.2% 75.8%
81-81 1.3 3.8×0.3 5.3% 25.9% 68.8% 3.3×0.4 1.1% 22.9% 75.9%
101-117 2.2 5.1×0.4 4.8% 24.5% 70.7% 4.2×0.5 1.1% 19.1% 79.8%
201-217 8.1 11×0.7 6.8% 25.3% 67.9% 7.7×1.1 1.2% 12.0% 86.8%
401-417 31 27×1.2 7.1% 25.0% 67.9% 14×2.2 1.1% 6.2% 92.7%
601-617 105 68×1.5 7.1% 26.1% 66.8% 43×2.4 1.0% 3.9% 95.1%
801-817 200 98×2.0 7.1% 24.2% 68.7% 50×4.0 0.6% 1.4% 98.0%
1001-1017 336 106×3.2 5.5% 23.5% 71.0% 58×5.8 0.3% 0.6% 99.1%
1301-1317 687 146×4.7 5.2% 22.4% 72.4% 85×8.0 0.2% 0.4% 99.4%
size increase. Indeed, for the instance m = 73 and n = 73, this time corresponds to 19%
of the total running time and it reaches the value of 1% for the last instance (m = 1301
and n = 1317).
Another observation concerns the time taken by the generation and the evaluation of the
neighborhood on GPU. Generally speaking, the algorithm in the second approach takes
advantage of resource use since most of the total running time is dedicated to the GPU
kernel execution. For example, in the fourth instance m = 201 and n = 217, the time
associated with the evaluation of the neighborhood accounts for 86% of the total execution
time. This time grows along with the instance size (more than 90% for the other larger
instances).
As a result, the second approach outperforms the first one in terms of efficiency. Indeed,
regarding the related acceleration factors for the two approaches, the reported results are
in accordance with the previous observations. This difference of performance tends to
grow with the instance size. In a general manner, the speed-up grows with the problem
size augmentation (up to ×8 for m = 1301, n = 1317). The acceleration factor for
this implementation is significant but not spectacular. Indeed, since the neighborhood is
relatively small (n threads), the number of threads per block is not enough to fully cover
the memory access latency.
To validate this point, a neighborhood based on a Hamming distance of two has been
implemented. Table 2.3 details the time spent by each operation in the two approaches
by using a neighborhood based on a Hamming distance of two.
For the first approach, most of the time is devoted to data transfer. It accounts for nearly
75% of the execution time. As a consequence, such an approach is actually inefficient
since the time spent on the data transfers dominates the whole algorithm. The produced
measures of the speed-up confirm the previous observations. Indeed, since the amount of
44
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.3: Measures of the benefits of generating the neighborhood on GPU on the GTX
280. The permuted perceptron problem using a neighborhood based on a Hamming dis-
tance of two is considered.
Table 2.4: Amount of data transfers at each iteration from the CPU to the GPU for the
permuted perceptron problem.
data transferred tends to grow as the size increases, the acceleration factors diminish with
the instance size (from ×3.3 to ×0.6). Furthermore, the algorithm could not be executed
for larger instances since it exceeds the 1GB global memory of the GTX 280. Table 2.4
emphasizes these previous points when looking at the amount of data transfers.
Table 2.5 reports the obtained speed-ups for the 4 configurations. Considering the genera-
tion and evaluation on GPU for a neighborhood based on a Hamming distance of two, for
the first instance (m = 73, n = 73), acceleration factors are already significant (from ×3.6
to ×12.3). As long as the instance size increases, the acceleration factor grows accordingly
(from ×3.6 to ×8 for the first configuration). Since a large number of cores are available on
both 8800 and GTX 280, efficient speed-ups can be obtained (from ×10.1 to ×44.1). The
application also scales well when performing on the Tesla Fermi card (speed-ups varying
from ×11.7 to ×73.3). In comparison with the second approach, generating on CPU and
evaluating on GPU is clearly inadequate in terms of performance. A conclusion to this
45
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.5: Benefits of generating the neighborhood on GPU. The acceleration factors are
reported for a tabu search on GPU on the permuted perceptron problem.
46
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.6: Benefits of generating the neighborhood on GPU. The acceleration factors
are reported for a tabu search on GPU on the quadratic assignement problem and the
Weierstrass continuous function.
47
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.7: Benefits of generating the neighborhood on GPU. The acceleration factors are
reported for a tabu search on GPU on the traveling salesman problem and the Golomb
Rulers.
48
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.9: Measures of the benefits of using the reduction operation on the GTX 280.
The permuted perceptron problem is considered for two different neighborhoods using an
iterative local search composed of 100 hill climbing algorithms.
n × (n − 1)
n neighbors neighbors
Instance 2
CPU GPU GPUR CPU GPU GPUR
73-73 0.08 0.22×0.4 0.25×0.3 5.29 0.42×12.6 0.35×15.1
81-81 0.13 0.29×0.4 0.32×0.4 9.47 0.65×14.6 0.52×18.2
101-117 0.27 0.42×0.6 0.47×0.6 28.4 1.2×23.7 1.1×25.9
201-217 1.5 1.4×1.1 1.5×1.0 94.7 3.1×30.5 2.8×33.8
401-417 12.1 5.4×2.2 4.8×2.5 923 27.3×33.8 25×36.9
601-617 102 32.1×3.2 29.4×3.5 4754 110×43.2 103×46.1
801-817 199 49.3×4.0 45.7×4.4 13039 270×48.3 251×51.9
1001-1017 395 67.4×5.9 62.2×6.3 29041 593×48.9 551×52.7
1301-1317 1132 141×8.0 125×9.0 74902 1512×49.5 1395×53.7
Another point concerns the data transfer from the GPU to the CPU. Indeed, in some
S-metaheuristics such as hill climbing, the selection of the best neighbor is operated by
choosing the minimal/maximal fitness at each iteration. Hence, for these algorithms, there
is no need to transfer the entire fitnesses structure, and further optimizations are possible.
The following experiment consists in comparing two GPU-based approaches of the hill
climbing algorithm.
In the first approach, the standard GPU-based algorithm is considered i.e. the fitnesses
structure is copied back from the GPU to the CPU. In the second one, a reduction oper-
ation is iterated on GPU to find the minimum of all the fitnesses at each iteration.
Since the hill climbing heuristic rapidly converges, an iterated local search composed of
100 hill climbing algorithms has been considered. Such an important number of methods
is in accordance with the previous running time for the tabu search.
Results for the permuted perceptron problem by considering two different neighborhoods
are reported in Table 16. Regarding the version using a reduction operation (GPUR),
significant improvements in comparison with the standard version (GPU) can be observed.
n × (n − 1)
For example, for the instance m = 73 and n = 73, in the case of neighbors,
2
the speed-up is equal to ×15.1 for the version using reduction and ×12.6 for the other
one. Such improvement between 10% and 20% is maintained for most of the instances. A
peek performance is reached with the instance m = 1301 and n = 1317 (×53.7 for GPUR
against ×49.5 for GPU).
An analysis on the average percentage of the time consumed by each operation can clarify
this improvement. Table 2.10 highlights the analysis of the time dedicated to each major
operation for a neighborhood based on a Hamming distance of two. On the one hand,
49
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.10: Analysis of the time dedicated for each operation for an iterative local search
composed of 100 hill climbing algorithms. The permuted perceptron problem using a
Hamming distance of two and the reduction operation are considered.
GPU GPUR
Instance
process transfers kernel process transfers kernel
73-73 19.0% 11.2% 69.8% 1.43% 1.46% 97.11%
81-81 18.8% 10.7% 70.5% 0.91% 0.98% 98.01%
101-117 18.7% 10.1% 71.2% 0.46% 0.44% 99.10%
201-217 18.5% 7.3% 74.2% 0.36% 0.11% 99.53%
401-417 18.2% 6.3% 75.5% 0.08% 0.04% 99.88%
601-617 17.7% 4.5% 77.8% 0.04% 0.02% 99.94%
801-817 13.3% 2.5% 84.2% 0.03% 0.02% 99.96%
1001-1017 12.7% 1.5% 85.8% 0.02% 0.01% 99.97%
1301-1317 10.9% 1.5% 87.6% 0.01% 0.01% 99.98%
for the second approach, whatever the size of the neighborhood used or the instance
size, the data transfer is nearly constant (varying between 0.01% and 1.46%). It can be
explained by the fact that only one solution is transferred from the GPU to the CPU at
each iteration. On the other hand, one can also notice that the time spent on the search
process on CPU is also minimized for the second approach. Indeed, by definition, the
reduction operation consists in finding the minimum which is performed on the GPU-
side in a logarithmic time. While for the first approach, most of the CPU search process
time corresponds to the search of the minimum in the fitnesses structure (linear time).
Therefore, both minimization of the data transfers and complexity reduction can justify
such an improvement of performance.
The same observations can be stated for the other problems where the reduction operator
provides a 10% to 20% performance improvement (see Table 2.11).
During the last decade, cluster of workstations (COWs) and computational grids have
been largely deployed to provide standard high-performance computing platforms. Hence,
it will be interesting to compare the performance provided by GPU computing with such
multi-level architectures in regards with S-metaheuristics. For the next experiments, we
intend to compare each GPU configuration with COWs then with computational grids.
50
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.11: Benefits of the reduction operator on GPU. The acceleration factors are re-
ported for an iterative local search on different optimization problems.
51
Chapter 2: Efficient CPU-GPU Cooperation
Algorithm 8 provides the template parallelization on top of these parallel and distributed
architectures. Basically, the neighborhood is decomposed into separate partitions of equal
size, which are distributed among the cores of the different machines (line 4). The only
data which have to be copied concern the candidate solution and additional structures for
its evaluation (lines 5 and 6). Each working node (CPU core) is in charge of the evaluation
of its own neighborhood partition (lines 8 to 10). For deterministic S-metaheuristics, the
parallelization is synchronous, and one has to wait for the termination of the exploration
of all partitions (lines 11). Such a synchronous step ensures that the semantics of the
S-metaheuristic are preserved. Then, the master handles the sequential part of the S-
metaheuristic. The process is repeated until a stopping criterion is satisfied.
In comparison with the GPU parallelization, the iteration-level model on parallel and
distributed architectures is more flexible. Indeed, due to the asynchronous nature of these
architectures, the copy and the exploration of all partitions might also be done in an
asynchronous manner. Thereby, it ensures to deal with S-metaheuristics which explore a
partial neighborhood; or few irregular problems in which the execution time varies during
the search process.
2.4.2 Configurations
For doing a fair comparison with the previous results on GPU, the different parallel and
distributed architectures must have the same computational power. Table 2.12 presents
52
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.12: Parallel and distributed machines used for the experiments on COWs and
Grid’5000.
Configuration 1 Configuration 2
Architecture
Machines GFLOPS Machines GFLOPS
Core 2 Duo T5800 Core 2 Quad Q6600
GPU 76.8 384
GeForce 8600M GT GeForce 8800 GTX
Intel Xeon E5440 4 Intel Xeon E5440
COWs 90.656 362.624
8 CPU cores 32 CPU cores
Configuration 3 Configuration 4
Architecture
Machines GFLOPS Machines GFLOPS
Intel Xeon E5450 Intel Xeon E5620
GPU 981.12 1106.08
GeForce GTX 280 Tesla M2050
11 Intel Xeon E5440 13 Intel Xeon E5440
COWs 995.236 1176.188
88 CPU cores 104 CPU cores
2 Intel Xeon E5520 4 Intel Xeon E5520
2 AMD Opteron 2218 2 AMD Opteron 2218
2 Intel Xeon E5520 2 Intel Xeon E5520
Grid 4 Intel Xeon E5520 979.104 4 Intel Xeon E5520 1160.056
Intel Xeon X5570 Intel Xeon X5570
Intel Xeon E5520 Intel Xeon E5520
96 CPU cores 112 CPU cores
the different machines used for the experiments. The number of potential GFLOPS is
calculated from the theoretical ones provided by constructors.
The different machines used for the experiments for COWs and grid are described in Ta-
ble 2.12. Most of them are octo-core workstations. The different computers have been
chosen accordingly to the different GPU configurations i.e. in agreement with their com-
putational power. Such a metric has been deduced from the potential GFLOPS delivered
by the different machines.
From an implementation point of view, a hybrid OpenMP/MPI version has been pro-
duced to take advantage of both multi-core and distributed environments. Such a hybrid
implementation has widely proved in the past its efficiency for multi-level architectures
[JJMH03]. The tabu search previously seen has been implemented on top of these archi-
tectures.
The permuted perceptron problem using a neighborhood based on a Hamming distance
of two is considered on the two architectures. A Myri-10G gigabit ethernet connects the
different machines of the COWs. For the workstations distributed in a grid organiza-
53
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.13: Measures in terms of efficiency for a cluster of workstations. The permuted
perceptron problem using a neighborhood based on a Hamming distance of two is consid-
ered.
Intel Xeon E5440 4 Intel Xeon E5440
Instance 8 CPU cores 32 CPU cores
GPU COW GPU COW
73-73 0.8×3.6 0.9×3.5 0.2×10.1 1.4×1.6
81-81 1.1×3.8 1.2×3.6 0.3×10.4 1.6×1.8
101-117 2.5×4.4 2.9×3.8 0.6×12.4 1.9×3.8
201-217 15×4.7 18×3.9 3.3×15.4 6.2×8.2
401-417 103×5.4 139×4.1 24×18.3 39×11.3
601-617 512×6.3 966×3.3 89×28.3 258×9.8
801-817 1245×6.9 2828×3.0 212×32.8 737×9.4
1001-1017 2421×7.2 6307×2.8 409×35.2 1708×8.5
1301-1317 4903×8.0 15257×2.6 911×36.2 3968×8.4
11 Intel Xeon E5440 13 Intel Xeon E5440
Instance 88 CPU cores 104 CPU cores
GPU COW GPU COW
73-73 0.2×10.9 5.4×0.4 0.2×12.3 5.9×0.4
81-81 0.2×12.2 5.6×0.5 0.2×13.4 6.3×0.4
101-117 0.4×18.1 6.0×1.2 0.3×22.0 6.7×1.1
201-217 1.9×25.3 7.6×6.3 1.6×30.6 7.0×6.8
401-417 14×28.8 21×19.2 10×38.3 19×21.2
601-617 51×40.1 115×17.8 35×58.4 108×19.0
801-817 128×42.3 322×16.8 81×67.1 311×17.4
1001-1017 252×43.9 793×14.0 154×71.9 778×14.3
1301-1317 568×44.1 1807×13.8 342×73.3 1789×14.0
tion, experiments have been carried out on the high-performance computing Grid’50001
[BCC+ 06] involving two, five and seven French sites. The acceleration factors are estab-
lished from each single CPU core used for the previous experiments.
54
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.14: Analysis of the time dedicated to each operation for a cluster of workstations.
The permuted perceptron problem using a neighborhood based on a Hamming distance
of two is considered.
Intel Xeon E5440 4 Intel Xeon E5440
Instance 8 CPU cores 32 CPU cores
process transfers workers process transfers workers
73-73 4.2% 45.7% 50.1% 1.8% 93.0% 5.2%
81-81 3.3% 45.3% 51.4% 2.3% 91.9% 5.8%
101-117 2.6% 43.1% 54.3% 3.9% 83.8% 12.3%
201-217 1.9% 42.4% 55.7% 5.7% 67.8% 26.5%
401-417 1.8% 39.6% 58.6% 6.5% 57.1% 36.4%
601-617 0.9% 52.0% 47.1% 3.5% 64.9% 31.6%
801-817 0.6% 56.5% 42.9% 2.3% 67.4% 30.3%
1001-1017 0.5% 59.4% 40.1% 1.9% 70.7% 27.4%
1301-1317 0.4% 62.5% 37.1% 1.6% 71.3% 27.1%
11 Intel Xeon E5440 13 Intel Xeon E5440
Instance 88 CPU cores 104 CPU cores
process transfers workers process transfers workers
73-73 0.7% 98.8% 0.5% 0.7% 98.9% 0.4%
81-81 0.7% 98.7% 0.6% 0.7% 98.8% 0.5%
101-117 1.2% 97.4% 1.4% 1.1% 97.6% 1.3%
201-217 4.6% 88.2% 7.2% 4.5% 88.5% 7.0%
401-417 12.1% 63.8% 24.1% 11.9% 64.0% 24.1%
601-617 7.8% 71.7% 20.5% 7.9% 71.9% 20.6%
801-817 5.3% 75.4% 19.3% 5.4% 75.7% 19.5%
1001-1017 4.0% 79.9% 16.1% 3.9% 81.2% 15.9%
1301-1317 3.5% 80.6% 15.9% 3.4% 80.8% 15.8%
Furthermore, increasing the number of machines (i.e. the number of communications) has
a negative impact on the performance for small instances such as m = 73 and n = 73.
Indeed, the associated time dedicated to the transfers clearly dominates the algorithm
(93% and 98% for the second and the third configurations). Such a behaviour does not
appear as well in the first configuration since communication is based only on an inter-core
communication.
Regarding the overall performance, whatever the instance size, acceleration factors are less
salient than their GPU counterparts. For COWs, these acceleration factors diversify from
×0.4 to ×21.2 whereas for GPUs they alternate from ×3.6 to ×73.3.
All the previous observations made for COWs are valid when dealing with workstations
distributed in a grid organization. In general, the overall performance is less significant
than COWs for a comparable computational horsepower. Indeed, the acceleration factors
vary from ×0.3 to ×16.1 (see Table 2.15). This performance diminution is explained by
the growth of the communication time since clusters are distributed among different sites.
55
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.15: Measures in terms of efficiency for workstations distributed in a grid orga-
nization. The permuted perceptron problem using a neighborhood based on a Hamming
distance of two is considered.
2 machines 5 machines
Instance 8 CPU cores 40 CPU cores
GPU Grid GPU Grid
73-73 0.8×3.6 1.1×2.6 0.2×10.1 1.8×1.2
81-81 1.1×3.8 1.4×3.0 0.3×10.4 2.0×1.4
101-117 2.5×4.4 3.5×3.1 0.6×12.4 2.4×3.0
201-217 15×4.7 22×3.2 3.3×15.4 7.8×6.5
401-417 103×5.4 167×3.4 24×18.3 49×9.0
601-617 512×6.3 1159×2.8 89×28.3 323×7.8
801-817 1245×6.9 3394×2.5 212×32.8 922×7.5
1001-1017 2421×7.2 7568×2.3 409×35.2 2135×6.8
1301-1317 4903×8.0 18308×2.2 911×36.2 4960×6.7
12 machines 14 machines
Instance 96 CPU cores 112 CPU cores
GPU Grid GPU Grid
73-73 0.2×10.9 7.3×0.3 0.2×12.3 7.9×0.3
81-81 0.2×12.2 7.6×0.4 0.2×13.4 8.3×0.4
101-117 0.4×18.1 8.1×0.9 0.3×22.0 8.8×0.8
201-217 1.9×25.3 10×4.7 1.6×30.6 9.5×4.9
401-417 14×28.8 28×14.4 10×38.3 25×16.1
601-617 51×40.1 155×13.2 35×58.4 148×13.8
801-817 128×42.3 425×12.7 81×67.1 411×13.1
1001-1017 252×43.9 1071×10.3 154×71.9 1043×10.6
1301-1317 568×44.1 2439×10.2 342×73.3 2405×10.3
An analysis of the time dedicated to transfers in Table 2.16 confirms this observation.
In comparison with COWs, the transfer time corresponding to the partitions sending and
the synchronization is significantly more prominent whatever the instance size is. This
can be explained by the distribution of computers among the different sites (respectively
two, five and seven according to the configuration). Indeed, in COWs, such an extra inter-
sites communication does not occur since the computers are directly linked by a gigabit
ethernet.
.
56
Chapter 2: Efficient CPU-GPU Cooperation
Table 2.16: Analysis of the time dedicated to each operation for workstations distributed
in a grid organization. The permuted perceptron problem using a neighborhood based on
a Hamming distance of two is considered.
57
Chapter 2: Efficient CPU-GPU Cooperation
Conclusion
In this chapter, we have proposed an efficient cooperation between the CPU and the GPU.
This challenge represents one of the critical issues when dealing with the parallel evaluation
of solutions on GPU (iteration-level model). Indeed, since the evaluation of solutions is
often the time-consuming part of metaheuristics, it has to be done in parallel on GPU.
• Optimization of data transfers from CPU to GPU. One of the crucial issues is
to minimize the data transfer between the CPU and the GPU. For S-metaheuristics,
the generation of the neighborhood constitutes a must to achieve the best perfor-
mance. In this purpose, we have proposed an efficient algorithm which performs the
generation and evaluation of the neighborhood in parallel on GPU, while optimizing
the associated data transfers. In Chapter 3, we will show how to accomplish this
generation on the GPU side.
• Comparison with COWs and grids. For a same computational power, imple-
mentations on GPU architectures are much more efficient than COWs and grids
for dealing with data-parallel regular applications. Indeed, the main issue in such
distributed architectures concern the communication cost. This is also due to the
synchronous nature of the parallel iteration-level model (tabu search). However,
since GPUs execute threads in a SIMD fashion, they could not be adapted for few
irregular problems (e.g. [MCT06]), in which the computations become asynchronous.
58
Chapter 3
In this chapter, the focus is on the efficient control of parallelism for the iteration-level on
GPU. Indeed, GPU computing is based on hyper-threading, and the order in which the
threads are executed is unknown.
First, an efficient thread control must be applied to meet the memory constraints. It
allows to add some robustness in the developed metaheuritics on GPU, and to improve
the overall performance. Second, regarding S-metaheuristics on GPU, an efficient mapping
has to be defined between each neighboring candidate solution and a thread designated by
a unique identifier. Then, we will examine the design on GPU of S-metaheuristics which
explore a partial neighborhood. We will show why they are not well-adapted to GPU
architectures. Finally, having in hand new tools to design new S-metaheuristics, we will
assess the impact on how the increase of the neighborhood size can improve the quality of
the obtained solutions.
Contents
3.1 Thread Control for Metaheuristics on GPU . . . . . . . . . . . 61
3.1.1 Execution Parameters at Runtime . . . . . . . . . . . . . . . . . 61
3.1.2 Thread Control Heuristic . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Efficient Mapping of Neighborhood Structures on GPU . . . . 64
3.2.1 Binary Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.2 Discrete Vector Representation . . . . . . . . . . . . . . . . . . . 65
3.2.3 Vector of Real Values . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.4 Permutation Representation . . . . . . . . . . . . . . . . . . . . . 66
3.3 First Improvement S-metaheuristics on GPU . . . . . . . . . . 69
3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.1 Thread Control for Preventing Crashes . . . . . . . . . . . . . . 71
3.4.2 Thread Control for Further Optimization . . . . . . . . . . . . . 74
3.4.3 Performance of User-defined Mappings . . . . . . . . . . . . . . . 74
3.4.4 First Improvement S-metaheuristics on GPU . . . . . . . . . . . 77
3.5 Large Neighborhoods for Improving Solutions Quality . . . . . 80
3.5.1 Application to the Permuted Perceptron Problem . . . . . . . . . 81
59
Chapter 3: Efficient Parallelism Control
Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. Neighborhood Structures for
GPU-based Local Search Algorithms. Parallel Processing Letters, 20(4):307–324, 2010.
Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. Large Neighborhood Local
Search Optimization on Graphics Processing Units. 24th IEEE International Symposium
on Parallel and Distributed Processing, IPDPS 2010, pages 1–8, Workshop Proceedings,
IEEE, 2010.
60
Chapter 3: Efficient Parallelism Control
Figure 3.1: Illustration of the operation of the warp-based thread scheduling scheme.
61
Chapter 3: Efficient Parallelism Control
good performances for applications such as metaheuristics are usually reached for 64, 128,
256 and 512 threads per block.
However, for a very large solutions set, some experiments might not be conducted. The
major issue is then to control the number of threads to meet the memory constraints like
the limited size and number of registers to be allocated to each thread. Unlike the previous
approach, one thread might not be associated with one neighbor but several neighbors. As
a result, on the one hand, having an efficient thread control will prevent GPU programs for
crashing. On the other hand, it will allow to find an optimal number of threads required
at runtime to get the best multiprocessor occupancy, leading to a better performance.
Different works [CSV10, NM09] have been investigated for parameters auto-tuning. The
heuristics are a priori approaches which are based on enumerating all the different values
of the two parameters (threads per block and the total number of threads). However, such
approaches are too much time-consuming and may be not well-adapted for metaheuristics
due to their a priori natures. For dealing with this issue, we have proposed in [1] a
dynamic heuristic for parameters auto-tuning at runtime. To the best of our knowledge,
such approach has never been investigated regarding the different works on GPU-based
metaheuristics. Algorithm 9 gives the general template for this heuristic. Such a method
is common to all metaheuristics on GPU (i.e. P-metaheuristics and S-metaheuristics).
The main idea of this approach is to send threads by “waves” to the GPU kernel to
perform the parameters tuning during the first metaheuristic iterations. Thereby, the
time measurement for each selected configuration according to a certain number of trials
(lines 5 to 14) will yield the best configuration parameters. Regarding the number of
threads per block, as quoted above, it is set to a multiple of the warp size (see line 19).
The starting total number of threads is set as the nearest power of two of the solution
size to be in accordance with the previous point. For decreasing the total number of
configurations, the algorithm terminates when the logarithm of the neighborhood size is
reached. In some cases, when a thread block allocates more registers than are available
on a multiprocessor, the kernel execution will fail since too many threads are requested.
Therefore, a fault-tolerance mechanism is provided to detect such a situation (from lines 8
to 12). In this case, the heuristic terminates and returns the best configuration parameters
previously found.
The only parameter to determine is the number of trials per configuration. The more this
value is, the more will be accurate the final tuning at the expense of an extra computational
time. The benefits of the thread control will be presented in Section 3.4.1 and 3.4.2.
62
Chapter 3: Efficient Parallelism Control
63
Chapter 3: Efficient Parallelism Control
The main difficulty which remains, is to find an efficient mapping between a GPU thread
and neighbor candidate solutions. In other words, the issue is to say which solution must
be handled by which thread. The answer is dependent of the solution representation.
Indeed, the neighborhood structure strongly depends on the target optimization prob-
lem representation. In the following, we provide a methodology to deal with the main
structures of the literature.
64
Chapter 3: Efficient Parallelism Control
Figure 3.2: Binary representation. For a Hamming distance of one, the neighborhood of
a solution consists in flipping one bit of the solution.
has been investigated for the permuted perceptron problem [Poi95] with a neighborhood
based on Hamming distance of one.
For continuous optimization, a solution is coded as a vector of real values. A usual neigh-
borhood for such a representation consists in discretizing the solution space. The neigh-
borhood is defined in [CS00] by using the concept of “ball”. A ball B(s, r) is centered on
s with radius r; it contains all points s′ such that ||s′ − s|| ≤ r. To obtain a homogeneous
exploration of the space, a set of balls centered on the current solution s is considered with
radius h0 , h1 , . . . , hm .
65
Chapter 3: Efficient Parallelism Control
Thus, the space is partitioned into “crowns” Ci (s, hi−1 , hi ) such that Ci (s, hi−1 , hi ) =
s′ |hi−1 ≤ ||s′ − s|| ≤ hi . The m neighbors of s are chosen by random selection of one point
inside each crown Ci for i varying from 1 to m (see Fig. 3.4). This can be easily done by
geometrical construction. The mapping consists in associating one thread with at least one
neighbor corresponding to one point inside each crown. Thus, such a mapping is feasible.
An application of this mapping is done for the Weierstrass function [LV98].
66
Chapter 3: Efficient Parallelism Control
Figure 3.4: A neighborhood for a continuous problem with two dimensions. The neighbors
are taken by random selection of one point inside each crown.
Figure 3.5: Permutation representation. A usual neighborhood is based on the swap oper-
ator which consists in exchanging the location of two elements of the candidate solution.
67
Chapter 3: Efficient Parallelism Control
The proofs of two-to-one and one-to-two index transformations can be found in Appendix
.1.1 and Appendix .1.2. The complexity of such mappings is nearly in constant time i.e.
it depends on the calculation of the square root on GPU (solving quadratic equation).
Application of this mapping is done for the quadratic assignment problem. In a similar
way, a mapping for a neighborhood based on a 2-opt operator has been applied to the
traveling salesman problem. Moreover, a slight modification of the mapping has been
applied to the permuted perceptron problem for a neighborhood based on a Hamming
distance of two.
68
Chapter 3: Efficient Parallelism Control
For dealing with more complex neighborhoods, finding a mapping might be more difficult.
To release from such constraints, a common solution is to construct mapping tables on
CPU, and to copy them once to the GPU global memory. In this manner, each thread just
needs to retrieve its corresponding indexes in the mapping tables. Figure 3.6 illustrates
this idea with a neighborhood based on 3-exchange operator.
The construction of mapping tables allows to deal with any neighborhoods. The drawback
of this method is the extra cost due to additional global memory accesses. A study on
how it impacts the global performance of S-metaheuristics on GPU will be investigated in
Section 3.4.3.
69
Chapter 3: Efficient Parallelism Control
A way to deal with such asynchronous parallelization is to transform these algorithms into a
data-parallel regular application. Thereby, one has to consider the previous parallelization
scheme of the iteration-level on GPU, not applied to the whole neighborhood, but to a
partial set of solutions. In other words, another approach is to generate and evaluate a
partial set of solutions on GPU (see Figure 3.7). After this parallel evaluation, according
to the S-metaheuristic, a specific post-treatment is performed on CPU on this partial set
of solutions. Such a mechanism can be seen as a way to simulate a first improvement-based
S-metaheuristic in a parallel way. From an implementation point of view, it is similar to
the Algorithm 6 proposed in the previous chapter. The only difference concerns the partial
set that has to be handled in which the neighbors are randomly chosen from the entire
neighborhood. Once the number of neighbors has been set, the thread control heuristic
presented in Section 3.1 can automatically adjust the remaining parameters.
Even if this approach to deal with such an asynchronous algorithm may be normal, it may
be not efficient in comparison with a S-metaheuristic in which a full exploration of the
neighborhood is performed on GPU. Indeed, to get a better global memory performance,
memory accesses must constitute a contiguous range of addresses to be coalesced. This is
not achieved for the exploration of a partial neighborhood since neighbors are randomly
chosen.
Figure 3.8 illustrates a memory access pattern for the two different neighborhoods ex-
plorations. In the full exploration of the neighborhood (left case), all the neighbors are
generated. Hence, the elements of the structures are dispatched such that many thread
accesses will be coalesced into a single memory transaction. This does not happen in the
partial exploration of the neighborhood (right case) since there is no connection between
the elements to be accessed. Indeed, the different moves to be performed from the partial
70
Chapter 3: Efficient Parallelism Control
Figure 3.8: Illustration of a memory access pattern for two different neighborhoods explo-
rations.
set of solutions are randomly chosen from the whole neighborhood. In this case, the dif-
ferent memory accesses have to be serialized, increasing the total number of instructions
executed for this warp.
In Section 3.4.4, we will examine how such uncoalesced accesses to the global memory
have an impact on the overall performance of S-metaheuristics.
As previously said, when dealing with a large solutions set, some experiments might fail
at execution time. This is typically the case when too many threads are requested. This
is due to the fact that a thread block allocates more registers than are available on a
multiprocessor. Hence, the main issue is then to control the number of threads to meet
the memory constraints. In the next section, we will feature an application which provokes
errors at runtime. Thereafter, we will show how the thread control heuristic allows to
prevent such errors.
For the next experiments, a tabu search with 10000 iterations is considered on the GTX
280 configuration. The number of threads per block has been arbitrary chosen to 256, and
the total number of threads created at run time is equal to the neighborhood size. A 2-opt
71
Chapter 3: Efficient Parallelism Control
Table 3.1: Measures in terms of efficiency for the traveling salesman problem using a
pair-wise-exchange neighborhood (permutation representation).
operator for the TSP has been implemented on GPU. The considered instances have been
selected among the TSPLIB instances presented in [DG97].
Table 10 presents the results for the traveling salesman problem. On the one hand, even
n × (n − 1)
if a large number of threads are executed ( neighbors), the values for the first
2
configuration are not significant (acceleration factor from ×1.2 to ×1.5). Indeed, the
neighbor evaluation function consists of replacing two edges of a solution. As a result,
this computation can be given in constant time, which is not enough to hide the memory
latency. Regarding the other configurations, using more cores overcomes the issue and
results in a better global performance. Indeed, for the GeForce 8800, accelerations start
from ×1.5 with the eil101 instance and grows up to ×4.4 for pr2392. In a similar manner,
the GTX 280 starts from ×2.3 and goes up to an acceleration factor of ×11 for the
fnl4461 instance. Nevertheless, for the three first configurations, for larger instances such
as pr2392, fnl4461 or rl5915, the program has provoked an execution error because of the
hardware register limitation. Such a problem does not exist for the Tesla M2050 since
more registers are available on this card.
72
Chapter 3: Efficient Parallelism Control
Table 3.2: Measures of the benefits of applying thread control. The traveling salesman
problem using a neighborhood based on a 2-opt operator is considered.
Since the GPU may fail to execute large neighborhoods on large instances, the next exper-
iment consists in highlighting the benefits of thread control presented in Section 3.1. The
associated heuristic based on thread “waves” has been applied for the traveling salesman
problem previously seen. The only required parameter is the number of trials which defines
the accuracy of each parameter configuration (i.e. the number of threads per block and
the total number of threads). The value of this parameter has been fixed to 10. Table 13
presents the obtained results for the traveling salesman problem.
The first observation concerns the robustness provided by the thread control version for
large instances pr2392, fnl4461 and rl5915. Indeed, one can clearly see the great benefits
of such control since the execution of these instances on GPU has been successfully termi-
nated whatever the used card. Indeed, according to some execution logs, the heuristic is
fault-tolerant since it is able to detect kernel errors at run time (e.g. the number of reg-
isters, a bad configuration of the parameters, etc.), and to restore the previous functional
state of the algorithm. Regarding the acceleration factors using the thread control, they
73
Chapter 3: Efficient Parallelism Control
alternate between ×1.3 and ×19.9 according to the instance size (GPUTC ). Performance
improvement in comparison with the standard version varies between 1% and 5%, which
is not uncommonly significant. Furthermore, statistical analysis for some instances cannot
determine if the distribution of the averages between the two algorithms is different. This
can be explained by the fact that the instances are really large thus the neighborhood
size is also impacted. Indeed, since the number of iterations for tuning is directly linked
to the neighborhood size, the algorithm may take too much iterations to get a suitable
parameters tuning.
74
Chapter 3: Efficient Parallelism Control
Table 3.3: Measures of the benefits of applying thread control. The permuted perceptron
problem using a neighborhood based on a Hamming distance of two is considered.
75
Chapter 3: Efficient Parallelism Control
Table 3.4: Benefits of the thread control on GPU. The acceleration factors are reported
for a tabu search on GPU on different optimization problems.
76
Chapter 3: Efficient Parallelism Control
Table 3.5: Measures of the benefits of using user-defined mappings instead of constructed
mapping tables on the GTX 280. The permuted perceptron problem is considered for two
different neighborhoods using a tabu search.
n × (n − 1)
n neighbors neighbors
Instance 2
CPU GPU GPUMT CPU GPU GPUMT
73-73 1.1 3.0×0.4 3.2×0.4 2.1 0.2×10.9 0.2×9.5
81-81 1.3 3.3×0.4 3.5×0.4 2.7 0.2×12.2 0.3×10.6
101-117 2.2 4.2×0.5 4.5×0.5 7.0 0.4×18.1 0.4×15.6
201-217 8.1 7.7×1.1 8.3×1.0 48 1.9×25.3 2.2×21.5
401-417 31 14×2.2 16×2.0 403 14×28.8 16×24.8
601-617 105 43×2.4 48×2.2 2049 51×40.1 61×33.7
801-817 200 50×4.0 54×3.8 5410 128×42.3 149×36.0
1001-1017 336 58×5.8 66×5.1 11075 252×43.9 297×37.3
1301-1317 687 85×8.0 94×7.3 25016 568×44.1 661×37.9
and neighboring solutions, it implies additional accesses to the global memory. Table 3.5
reports the obtained results with user-defined mappings and constructed mapping tables
for the permuted perceptron problem.
Regarding the neighborhood based on a Hamming distance of one (n neighbors), the ob-
tained acceleration factors from the version using a mapping table (GPUMT) are quite
close to the original version. Indeed, they alternate from ×0.4 to ×7.3 (against from
×0.4 to ×8.0). This can be clarified by the fact the number of accesses is not majorly
important (i.e. mapping table of n elements). However, when considering a bigger neigh-
borhood based on a Hamming distance of two, the performance gap which occurs is quite
remarkable. The new obtained speed-ups vary from ×9.5 to ×37.9 for GPUMT (against
from ×10.9 to ×44.1 for the original version). Such performance difference alternates from
5% to 15% according to the instance. As a consequence, such a performance diminution
is fairly significant, and it justifies the use of user-defined mapping when increasing the
neighborhood size.
In a similar way, the same observations can be stated for the other problems where the
use of mapping tables generates a 5% to 20% performance degradation (see Table 3.6).
77
Chapter 3: Efficient Parallelism Control
Table 3.6: Benefits of the use of user-defined mappings in comparison with constructed
mapping tables. The acceleration factors are reported for a tabu search on GPU on
different optimization problems.
78
Chapter 3: Efficient Parallelism Control
Table 3.7: Measures in terms of efficiency of two different exploration strategies. The
quadratic assignment problem using a 3-exchange neighborhood is considered.
chosen, accesses to data structures via the global memory may be serialized, leading to a
performance decrease. A next experiment consists in evaluating this performance degra-
dation for a S-metaheuristic which explores a partial neighborhood. To measure this, this
latter algorithm is directly compared with another one which operates a full exploration
of the neighborhood.
For doing this, an iterated local search is considered for the quadratic assignment problem
is considered on the GTX 280 configuration. The embedded algorithm is a hill climbing
heuristic. The first version concerns a full exploration of the neighborhood whereas the
second one considers a partial exploration. The neighborhood is based on a 3-exchange
operator. Such a large neighborhood ensures that the number of threads is enough to
cover the access latency for not introducing any bias into the current experiment.
The stopping criterion is fixed to a certain number of evaluations which is equivalent to
10000 iterations for a full exploration. Regarding the strategy of a partial exploration,
experiments have been performed with a number of neighbors per iteration fixed to 1024,
2048 and 4096. The thread control heuristic is in charge of tuning the parameters. There-
after, only the average of the best of the three configurations is reported in the next
results. Furthermore, since there is no way to detect the convergence to a local optima
in the partial exploration strategy, the number of iterations is fixed to 50 before applying
the perturbation. All the obtained acceleration factors are made in comparison with an
iterated local search with a full exploration strategy made on a single CPU core. Table 3.7
reports the obtained results for the two algorithms.
In a general manner, for the iterative local search based on a full exploration (GPUILSHCFE ),
speed-ups grow with the instance size. They vary from ×9.5 to ×21.2. One can also observe
that these acceleration factors are more salient than those provided with a neighborhood
based on a 2-exchange. Regarding the iterative local search based on a partial exploration
(GPUILSHCPE ), accelerations also grow with the size increase (from ×3.8 to ×9.6). How-
79
Chapter 3: Efficient Parallelism Control
Table 3.8: Analysis of the execution path of two different exploration strategies provided
by the CUDA Profiler. The instance tai50a is considered.
ever, as expected, the obtained results for the partial exploration of the neighborhood
are clearly less effective than a full exploration strategy. This performance difference es-
sentially comes from random moves in the partial exploration, leading to non-coalesced
accesses to the global memory.
To confirm this point, an analysis of the execution path of the two different exploration
strategies is required. To achieve this, the CUDA Profiler [NVI11] provides a tool to
examine the number of branches taken. Table 3.8 reports the results for the instance
tai50a.
Regarding the presented results, in addition to the full exploration-based algorithm, partial
exploration-based algorithms are considered with three different neighborhood sizes. In
general, the number of branches taken by threads is similar for each algorithm. However,
one can clearly observe that the number of divergent branches taken by threads in each
first exploration-based algorithm is at least twice more important than the full exploration
strategy. This is typically the results from additional non-coalesced accesses to the global
memory. Hence, such a threads divergence leads to many memory accesses that have to
be serialized, increasing the total number of instructions executed. Indeed, the number of
warp serializations for a partial exploration is about five times more important than a full
exploration-based algorithm. Such an analysis confirms the performance difference which
occurs in the two exploration strategies.
80
Chapter 3: Efficient Parallelism Control
time-consuming, this mechanism is not often fully exploited in practice. So, in practice,
large neighborhoods algorithms are unusable because of their high computational cost.
Indeed, experiments using large neighborhoods are often stopped without convergence be-
ing reached. Thereby, in designing S-metaheuristics, there is often a trade-off between
the size of the neighborhood to use and the computational complexity to explore it. To
deal with such issues, only the use of parallelism allows to design methods based on large
neighborhood structures. We have shown in [2, 8] how the use of GPU computing allows
to exploit parallelism in such algorithms.
As an application, a tabu search has been implemented on GPU for the permuted per-
ceptron problem. Three different neighborhoods based on different Hamming distances
are considered. Indeed, usual neighborhoods for solving binary problems are in general a
linear (e.g. 1-Hamming distance) or quadratic (e.g. 2-Hamming distance) function of the
input instance size. Some large neighborhoods may be high-order polynomial of the size
of the input instance (e.g. 3-Hamming distance).
The used configuration is the third configuration with the NVIDIA GTX 280 card. The
following experiments intend to assess the quality of solutions for the four instances of the
literature addressed in [KM99]. A tabu search has been executed 50 times with a maximum
n × (n − 1) × (n − 2) m
number of iterations. The tabu list size has been arbitrary set to
6 6
where m is the number of neighbors. The average value of the evaluation function (fitness)
and its standard deviation (in sub index) have been measured. The number of successful
tries (fitness equal to zero) and the average number of iterations to converge to a solution
are also represented.
Table 3.9 reports the results for the tabu search based on the 1-Hamming distance neigh-
borhood. In a short execution time, the algorithm has been able to find few solutions
for the instances m = 73, n = 73 (11 successful tries on 50) and m = 81, n = 81 (5 suc-
cessful tries on 50). The two other instances are well-known for their difficulties and no
solutions were found. Regarding the execution time, the GPU version does not offer any-
thing in terms of efficiency. Indeed, since the neighborhood is relatively small (n threads),
the number of threads per block is not enough to fully cover the memory access latency.
To measure the efficiency of the GPU-based implementation of this neighborhood, bigger
instances of the permuted perceptron problem should be considered.
81
Chapter 3: Efficient Parallelism Control
A tabu search has been implemented on GPU using a neighborhood based on a Hamming
distance of two. Results of the experiment for the permuted perceptron problem are
reported in Table 3.10.
By using this other neighborhood, in comparison with Table 3.9, the quality of solutions
has been significantly improved: on the one side the number of successful tries for both
m = 73, n = 73 (22 solutions) and m = 81, n = 81 (17 solutions) is more prominent. On
the other side, 13 solutions were found for the instance m = 101, n = 101. Regarding the
execution time, the acceleration factor for the GPU version is remarkably efficient (from
×8.2 to ×18.5). Indeed, since a large number of threads are executed, the GPU can take
full advantage of the multiprocessors occupancy.
A tabu search using a neighborhood based on Hamming distance of three has been imple-
mented. The obtained results are collected in Table 3.11.
In comparison with Knudsen and Meier’s article [KM99], the results found by the generic
82
Chapter 3: Efficient Parallelism Control
Figure 3.9: Analysis of the time spent for each major operation. The 3 different neighbor-
hoods are compared and the instances are ordered according to their size.
tabu search are competitive without any use of cryptanalysis techniques. Indeed, the num-
ber of successful solutions has been drastically improved for every instance (respectively
39, 33 and 22 successful tries) and 3 solutions have been even found for the difficult in-
stance m = 101, n = 117. Regarding the execution time, acceleration factors using the
GPU are highly significant (from ×24.2 to ×26.2).
The conclusions from this experiment indicate that the use of the GPU provides an effi-
cient way to deal with large neighborhoods. Indeed, a neighborhood based on a Hamming
distance of three on the permuted perceptron problem was unpractical in terms of sin-
gle CPU computational resources. Therefore, implementing this algorithm on GPU has
allowed to exploit the parallelism in such neighborhood and to improve the quality of
solutions. Furthermore, we strongly believe that the quality of the solutions would be
drastically enhanced by (1) increasing the number of running iterations of the algorithm
and (2) introducing appropriate cryptanalysis heuristics.
To validate the performance of the algorithms, we propose to make an analysis of the time
spent by each major operation to assess its impact in terms of efficiency. The obtained
results are reported in Fig. 3.9.
For the first neighborhood (n neighbors), one can notice that the time spent by the data
transfer is significant. For example, it represents 28% of the total execution time for
the instance m = 73, n = 73. The same goes on for the other instances. Furthermore,
regarding the time spent on the search process on CPU, almost 15% is dedicated to this
task whatever the instance size. As a consequence, since only half of the total execution
83
Chapter 3: Efficient Parallelism Control
time is dedicated to the GPU kernel, the amount of computation may not be enough to
fully cover the memory access latency. That is the reason why, no acceleration is provided
for a tabu search based on a neighborhood with a Hamming distance of one (see Table 3.9).
To go on with the idea, if a neighborhood based on a Hamming distance of two is considered
n × (n − 1)
( neighbors), one can notice that the time spent of the GPU calculation is
2
greater than 70% for each instance. This time is almost equal to 90% for a neighborhood
based on Hamming distance of three. Thereby, regarding the time spent on both the
data transfer and the search process on CPU, it tends to decrease with 1) the number
of neighbors; 2) the instance size increase. Indeed, this can be explained by the fact
that in designing S-metaheuristics, these two latter parameters usually have a significant
influence on the total execution time and thus the amount of calculation. As a result, in
accordance with the previous results (see Table 3.10 and Table 3.11), algorithms based on
bigger neighborhoods clearly improve the GPU occupancy i.e. the amount of calculation
performed by the GPU, leading to a better global performance.
84
Chapter 3: Efficient Parallelism Control
Conclusion
In this chapter, we have dealt with the efficient control of parallelism for the iteration-level
on GPU. In this challenge, a clear understanding of the thread-based execution model
allows to design new algorithms, and to improve the performance of metaheuristics on
GPU.
• Thread control heuristic. When dealing with applications requiring a large num-
ber of threads, some experiments might not be conducted. The control of the gener-
ation of threads is a must to meet the memory constraints like the limited number
of registers allocated to each thread. We have proposed an efficient thread control
heuristic to automatically tune the different parameters involved during the ker-
nel execution. Such a control introduces some fault-tolerance mechanisms in GPU
applications. Further, it may provide an additional performance improvement.
85
Chapter 3: Efficient Parallelism Control
86
Chapter 4
Efficient cooperation and parallelism control have been proposed in the previous chapters.
In this chapter, the scope is on the memory management on GPU architectures. Under-
standing the hierarchical organization of the different memories is useful to provide an
efficient implementation of parallel metaheuristics.
First, we will introduce concepts of memory management common to all metaheuristics.
We will briefly explain how different memory techniques affect the performance of GPU
applications. Then, the focus of this chapter will be on the parallel and cooperative model
on GPU. Indeed, the memory management is more prominent when dealing with these
cooperative algorithms. Thereby, we will investigate on how interactions between the
threads in P-metaheuristics might be exploited on the hierarchical GPU. In this purpose,
traditional mechanisms of cooperative algorithms are revisited on GPU to achieve better
performance.
Contents
4.1 Common Concepts of Memory Management . . . . . . . . . . . 89
4.1.1 Memory Coalescing Issues . . . . . . . . . . . . . . . . . . . . . . 89
4.1.2 Coalescing Transformation . . . . . . . . . . . . . . . . . . . . . 90
4.1.3 Texture Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.4 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Memory Management in Cooperative Algorithms . . . . . . . . 94
4.2.1 Parallel and Cooperative Model . . . . . . . . . . . . . . . . . . . 94
4.2.2 Parallelization Strategies for Cooperative Algorithms . . . . . . . 96
4.2.3 Issues Related to the Fully Distributed Schemes . . . . . . . . . 101
4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 106
4.3.1 Coalescing accesses to global memory . . . . . . . . . . . . . . . 106
4.3.2 Memory Associations of Optimization Problems . . . . . . . . . 107
4.4 Performance of Cooperative Algorithms . . . . . . . . . . . . . . 108
4.4.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.4.2 Measures in Terms of Efficiency . . . . . . . . . . . . . . . . . . . 109
4.4.3 Measures in Terms of Effectiveness . . . . . . . . . . . . . . . . . 113
87
Chapter 4: Efficient Memory Management
Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. Parallel Hybrid Evolutionary
Algorithms on GPU. In IEEE Congress on Evolutionary Computation, CEC 2010, pages
1–8, Proceedings, IEEE, 2010.
Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. GPU-based Multi-start Local
Search Algorithms. 4th International Learning and Intelligent Optimization Conference,
LION 5, in press, Lecture Notes in Compute Science, Springer, 2011.
88
Chapter 4: Efficient Memory Management
In the GPU execution model, each block of threads is split into SIMD groups (warps). At
any clock cycle, each processor of the multiprocessor selects a half-warp (16 threads) that
is ready to execute the same instruction on different data. Global memory is conceptually
organized into a sequence of 128-byte segments. The number of memory transactions
performed for a half-warp will be the number of segments having the same addresses used
by that half-warp. Figure 4.1 illustrates an example of the memory management for a
basic vector addition.
For more efficiency, global memory accesses must be coalesced, which means that a mem-
ory request performed by consecutive threads in a half-warp is strictly associated with one
segment. The requirement is that threads of the same warp must read global memory in
an ordered pattern. If per-thread memory accesses for a single half-warp constitute a con-
tiguous range of addresses, accesses will be coalesced into a single memory transaction. In
the example of vector addition, memory accesses to the vectors a and b are fully coalesced,
since threads with consecutive thread indices access contiguous words.
Otherwise, accessing scattered locations results in memory divergence and requires the
processor to produce one memory transaction per thread. The performance penalty for
non-coalesced memory accesses varies according to the size of the data structure. In
Chapter 3, an insight of this issue has already been given for first improvement-based
S-metaheuristics (Section 3.3 and Section 3.4.4).
Indeed, regarding structures in optimization problems, coalescing is sometimes hardly
89
Chapter 4: Efficient Memory Management
feasible since global memory accesses have a data-dependent unstructured pattern (espe-
cially for permutation representation). Many research works on GPU such as [MBL+ 09,
SBPE10, VA10a, VA10b, ZCM08] ignore coalescing transformations of data structures.
Therefore, non-coalesced memory accesses imply many memory transactions that lead to
a significant performance decrease for these metaheuristics.
Nevertheless, for some optimization structures which are particular to a given thread,
memory coalescing on global memory can be performed. This is typically the case for the
data organization of a population in P-metaheuristics; or large local structures used for
the evaluation function. Figure 4.2 exhibits an example of a coalescing transformation
for local structures. As illustrated in the top of the figure, a natural wrong approach
to arrange the elements is to align the different structures one after the other. Thereby,
each thread can have access to the elements of its own structure with a logical pattern
baseAddress × id + of f set. For instance, in the figure, each thread has access to the
second element of its structure with baseAddress = 3 and of f set = 2.
Even if this way of organizing the elements on global memory is natural, it is clearly not
efficient. Indeed, to get a better global memory performance, memory accesses must consti-
tute a contiguous range of addresses to be coalesced. This is done in the bottom of the fig-
ure. In the second approach, the elements of the structures are dispatched such that thread
90
Chapter 4: Efficient Memory Management
accesses will be coalesced into a single memory transaction. In the figure, for instance,
accessing to the second element is done by using the pattern baseAddress2 × of f set + id.
An experimental comparison of the two approaches is conducted in Section 4.3.1. In this
case, each solution uses a large private structure which cannot be stored in a local private
memory. We will show how this transformation mechanism is well-adapted for large local
structures.
• Data accesses are frequent in the computation of evaluation methods. Then, us-
ing texture memory can provide a high performance improvement by reducing the
number of memory transactions.
• Cached texture data is laid out so as to give the best performance for 1D/2D access
patterns. The best performance will be achieved when the threads of a warp read
locations that are close together from a spatial locality perspective. Since optimiza-
tion problem inputs are generally 2D matrices or 1D solution vectors, optimization
structures can be bound to texture memory.
The use of textures in place of global memory accesses is a totally mechanical transforma-
tion. Details of texture coordinate clamping and filtering is given in [NBGS08, NVI11].
91
Chapter 4: Efficient Memory Management
Table 4.1: Kernel memory management. Summary of the different memories used in the
evaluation function.
Type of memory Optimization structure
Texture memory data inputs, solution representation
Global memory fitnesses structure, large structures
Registers additional variables
Local memory small structures
Shared memory partial fitnesses structure
Notice that, in the Fermi based GPUs, global memory is easier to access. This is due
to the relaxation of the coalescing rules and the presence of L1 cache memory. It means
that applications developed on GPU get a better global memory performance on this card.
Hence, the benefits of the use of texture memory as a data cache are less pronounced.
Table 4.1 summarizes the kernel memory management in accordance with the different
structures used in metaheuristics. The inputs of the problem (e.g. a matrix in the traveling
salesman problem) are associated with the texture memory. In the case of S-metaheuristics,
the solution which generates the neighborhood can also be placed in this memory. The
fitnesses structure, which stores the obtained results for each solution, is declared as global
memory. Indeed, since only one writing operation per thread is performed at each iteration,
this structure is not part of intensive calculations. Declared variables for the computation
of each solution evaluation are automatically associated with registers by the compiler.
Additional complex structures, which are private to a solution, will reside in local memory.
In the case where these structures are too large to fit into local memory, they should be
stored in global memory using the coalescing transformation mentioned above. Finally,
the shared memory may be used to perform additional reduction operations on the fitness
structure to collect the minimal/maximal fitness (see Section 2.2.3 in Chapter 2). All
the different memories quoted above have been widely used in the previous experiments
presented in Chapter 2 and Chapter 3.
Even if the shared memory has been widely investigated to reduce non-coalesced accesses
in regular applications (e.g. [RRS+ 08, OML+ 08]), its use may not be well-adapted for the
parallel iteration-level model. Due to the limited capacity of each multiprocessor (varying
from 16KB to 48KB), data inputs such as matrices cannot be completely stored on shared
memory. Thus, the use of shared memory must be considered as a user-managed cache.
This implies an explicit effort of code transformation: one has to identify common sub-
structures which are likely to be concurrently accessed by SIMD threads of a same block.
Unfortunately, such common accesses are not always predictable in evaluation functions
92
Chapter 4: Efficient Memory Management
since most of access patterns to data inputs differ from a solution to another (especially
for permutation-based problems).
The following transformation intends to show how to take advantage of the shared memory
in the case of the quadratic assignment. By considering a S-metaheuristic with a pair-
wise exchange operator, one part of the ∆ calculation (slight variation) of the evaluation
function for a neighbor (i,j) is given by:
A depth look at (4.2) indicates that a and b sub-matrices involving the variable k might be
concurrently accessed in parallel. Therefore, the idea is to associate such sub-structures
with the shared memory as a user-managed cache. The equation (4.2) can be transformed
into another one:
n
X
∆ = ∆1 + (rai − raj ) × (rbπ(j) − rbπ(i) ) (4.3)
k=0
k6=i
k6=j
ra←row(a,k)
rb←row(b,π(k))
ca←col(a,k)
cb←col(b,π(k))
+ (cai − caj ) × (caπ(j) − cbπ(i) )
Rows and columns are copied on shared memory at the beginning of each loop iteration via
a synchronization mechanism. In this manner, accesses to these structures are performed
through this memory in (4.3). Therefore, sub-matrices can benefit from the shared memory
since it is a low-latency memory in which non-coalescing accesses are reduced. However,
this transformation is problem-dependent, and it might not be applied to some evaluation
functions.
We will explore the performance obtained for the use of the different memories in Sec-
tion 4.3.2. Even if some research works on GPU has considered the association of data
structures with the shared memory [TF09, VA10a], we will experimentally show why this
93
Chapter 4: Efficient Memory Management
94
Chapter 4: Efficient Memory Management
objective is to allow the delay of the global convergence, especially when the metaheuristics
are heterogeneous. The migration of solutions follows a policy defined by few parameters:
• Exchange topology: The topology specifies for each P-metaheuristic its neighbors
with respect to the migration process. In other words, it indicates for each P-
metaheuristic the other metaheuristics to which it may send its emigrants, and the
ones from which it may receive immigrants). Different well-known topologies are
proposed such as ring, mesh, torus, hypercube, etc.
95
Chapter 4: Efficient Memory Management
Previous works on P-metaheuristics on GPU have tried hard to implement parallel cooper-
ative algorithms [ZH09, PJS10, MWMA09a, SBPE10]. However, GPU-based cooperative
algorithms have never been deeply investigated since no general methods can be outlined
from the previous works. In this section, the focus is on the redesign of the cooperative
algorithmic-level. To accomplish this, we have designed in [5, 6] three different paralleliza-
tion schemes allowing a clear separation of the GPU memory hierarchical management
concepts.
A first scheme for the design and the implementation of cooperative algorithms is based on
a combination with the parallel iteration-level model on GPU (parallel evaluation of the
solutions). Indeed, in general, evaluating a fitness function for each solution is frequently
the most costly operation of a P-metaheuristic. Therefore, in this scheme, task distri-
bution is clearly defined: the CPU manages the whole sequential search process for each
cooperative algorithm, and the GPU is dedicated to the parallel evaluation of populations.
Figure 4.4 shows the principle of the parallel evaluation of each P-metaheuristic on GPU.
The CPU sends a certain number of solutions to be evaluated to the GPU via the global
memory, and these solutions will be processed on GPU. Each thread associated with one
solution executes the same evaluation function kernel. Finally, results of the evaluation
96
Chapter 4: Efficient Memory Management
function are returned back to the host via the global memory.
Regarding the details of the corresponding algorithm (see Algorithm 11), first, memory
allocations on GPU are made: data inputs of the problem, populations and their corre-
sponding fitnesses structures must be allocated (lines 3 to 5). Additional solution struc-
tures, which are problem-dependent, can also be allocated to facilitate the computation of
solution evaluation (line 5). Second, problem data inputs have to be copied onto the GPU
(lines 7). These data are read-only structure, and their associated memory are copied
only once during all the execution. Third, regarding the main body of the algorithm, the
different populations and their associated structures have to be copied at each iteration
(lines 11 and 12). Thereafter, the evaluation of solutions is performed in parallel on GPU
(lines 13 to 16). Fourth, the fitnesses structures have to be copied to the host CPU (line
17). Then, a specific post-treatment strategy and the replacement of the population are
done (lines 18 and 19). Finally, at the end of each generation, a possible migration may be
performed on CPU to exchange information between the different P-metaheuristics (line
20). The process is repeated until a stopping criterion is satisfied.
97
Chapter 4: Efficient Memory Management
previously seen in Chapter 2, copying operations from CPU to GPU (i.e. population and
fitnesses structures) might be a bottleneck and thus can lead to a significant performance
decrease.
The main goal of this scheme is to accelerate the search process, and it does not alter the
semantics of the algorithm. As a result, the migration policy between the P-metaheuristics
remains unchanged in comparison with a classic design on CPU. Since the GPU is used as
a coprocessor to evaluate all individuals in parallel, this scheme is intrinsically dedicated
to synchronous cooperative algorithms.
98
Chapter 4: Efficient Memory Management
all the individuals of a same block (island) reach the same point. Unless S-metaheuristics
which iterate a single solution, the full parallelization of cooperative algorithms on GPU
can be naturally achieved since solutions (threads) of each local P-metaheuristic are in-
volved during the whole algorithm process.
Regarding the migration policy, communications are performed via the global memory
which stores the global population. This way, each local P-metaheuristic can communicate
with any other one according to the given topology.
Algorithm 11 gives more details of the proposed algorithm. In comparison with the pre-
vious scheme, all the different allocations and data transfers are performed only once at
the beginning of the algorithm (lines 3 to 9). Thereafter, all the standard operations of
P-metaheuristics are fully performed to GPU. In other words, they are hard-coded in the
GPU kernel and not on CPU. One of the key points to pay attention concerns the syn-
chronizations previously mentioned (lines 12 and lines 17). Such a procedure ensures that
operations involving the cooperation of various solutions are valid.
In comparison with the previous parallelization scheme, one of the limitations to move
the entire algorithm on GPU is the fact that heterogeneous strategies cannot be chosen.
Indeed, since threads work in a SIMD fashion, the same parameter configurations and
the same different search components (e.g. mutation or crossover in evolutionary algo-
99
Chapter 4: Efficient Memory Management
rithms) for each local P-metaheuristic must be applied. Another drawback of this scheme
concerns the maximal number of solutions per P-metaheuristic since this latter is limited
to the maximal number of threads per block (up to 512 or 1024 according to the GPU
architecture). A natural wrong idea to solve this restriction would be to associate one
P-metaheuristic with multiple threads blocks. However, it cannot be easily achieved in
practice since (1) threads work in an asynchronous manner; (2) threads synchronizations
are local to a same block. For instance, for evolutionary algorithms, one can imagine a
scenario in which a selection is made on two individuals of a different block, in which
one of the two threads has not yet updated its associated fitness value (i.e. one of the
two individuals has not yet been evaluated). Such a situation would provoke incoherent
results.
Regarding the kernel memory management, from a hardware point of view, graphics cards
consist of multiprocessors, each with processing units, registers and on-chip memory. Ac-
cessing global memory incurs an additional 400 to 600 clock cycles of memory latency.
As previously said, since this memory is not cached and its access is slow, one needs to
minimize accesses to global memory (read/write operations) and reuse data within the
local multiprocessor memories.
To accomplish this, the shared memory, presented in Section 4.1.4, is a fast on-chip memory
located on the multiprocessors and shared by threads of each thread block. This memory
can be considered as a user-managed cache which can deliver substantial speedups by
conserving bandwidth to main memory [NVI11]. Furthermore, since the shared memory
is accessible to each threads block, it provides an efficient way for threads to communicate
within the same block.
Therefore, a last parallelization strategy is to associate each local P-metaheuristic with
a threads block on GPU with the use of the shared memory. An illustration for each P-
metaheuristic of this scheme is shown in Figure 4.6. This strategy is similar to the previous
one except the fact that local populations and their associated fitnesses are stored in the
on-chip shared memory. In this purpose, each solution (thread) of each P-metaheuristic
(block) performs the algorithm process (initialization, evaluation, etc.) via the shared
memory.
A more refined view is given in Algorithm 11. Basically, the template is similar to the
previous one. The main difference which occurs, is all the algorithm operations are not
performed on the global memory but on the shared memory. In this purpose, at the
beginning of the algorithm, populations are copied to the shared memory (lines 10 to 12).
Regarding the migration between P-metaheuristics, since it requires an inter-metaheuristic
100
Chapter 4: Efficient Memory Management
Figure 4.6: Scheme of the full distribution of cooperative algorithms on GPU using shared
memory.
communication, copying operations from each local population (shared memory) to the
global population (global memory) have to be considered (line 23). Details of this point
will be given in the next section.
Even if this scheme can improve the efficiency of cooperative algorithms, it presents a major
limitation: since each multiprocessor has a limited capacity of shared memory (varying
from 16KB to 48KB), only small problem instances can be dealt with. Indeed, the amount
of allocated shared memory per block depends on both the size of the local population
and the size of the problem instance. Therefore, a trade-off must be considered between
the number of threads per block and the size of the handled problem.
101
Chapter 4: Efficient Memory Management
102
Chapter 4: Efficient Memory Management
• Number of emigrants: This parameter does not represent any pertinent issue
except the fact that if the number of emigrants is too high, accesses to the global
memory will be more frequent leading to a small performance decrease.
103
Chapter 4: Efficient Memory Management
Whether for a synchronous or an asynchronous model, regarding the scheme using shared
memory, emigrants must be copied into the global memory to ensure their visibility be-
tween the different metaheuristics. Figure 4.9 summarizes the memory management during
the migration process through an example of cooperative P-metaheuristics on GPU using
the shared memory.
In this example, solutions of each local population are associated with the shared mem-
ory, and a ring topology is considered. The two best individuals of each local population
(block) are first copied from the shared memory to the global memory. Such a procedure
ensures that the solutions to migrate are globally visible between the different local popu-
lations. Thereafter, the two worst solutions of each P-metaheuristic are replaced by their
corresponding emigrants.
104
Chapter 4: Efficient Memory Management
105
Chapter 4: Efficient Memory Management
Table 4.2: Measures of the benefits of using coalescing accesses to the global memory
on the GTX 280. The CUDA Profiler is used for analysing the execution path. The
permuted perceptron problem using a neighborhood based on a Hamming distance of two
is considered.
Non-coalesced accesses Coalesced accesses
Instance CPU
GPU Warp serializations GPU Warp serializations
401-417 403 64×6.3 3306512379 14×28.8 435067418
601-617 2049 249×8.2 6613185750 51×40.1 612332013
801-817 5410 665×8.1 11242415771 128×42.3 1003787122
1001-1017 11075 1361×8.1 17855601525 252×43.9 1594250136
1301-1317 25016 3180×7.9 29759335873 568×44.1 1859958499
As previously said in Section 4.1.2, non-coalesced accesses to the global memory have a
remarkable impact on the performance of GPU applications. To achieve the best perfor-
mance, such an issue has to be dealt with in the case of metaheuristics on GPU.
In Section 4.1.2, two different access patterns have been described to deal with large
structures involving the global memory. Even if the first one is natural, this associated
performance might be limited because of non-coalesced memory accesses. That is the
reason why, the second pattern has always been used for the experiments presented in
Chapter 2 and in Chapter 3.
To confirm this point, a next experiment consists in comparing the performance results
obtained by the two different access patterns. For doing this, the permuted perceptron
problem is considered. In this problem, the neighbor evaluation requires the calculation
of a structure called histogram. Since this structure is particular to a neighbor, the local
memory is managed to store the histogram. However, for large instances (from m = 401
and n = 417), the amount of local memory may be not enough to store this structure, and
the program will fail at compilation time. As a consequence, in that case, the histogram
must be stored on global memory. Table 4.2 presents the obtained results on the GTX
280 for a tabu search with a neighborhood based on a Hamming distance of two (10000
iterations).
In comparison with the second approach, the obtained performance results for the first
one are drastically reduced whatever the instance. For the version using non-coalesced ac-
cesses, the speed-up obtained from the first approach varies between ×4.6 and ×7.0 whilst
the speed-up for the second approach alternates from ×28.8 and ×44.1. An analysis of
the execution path of the two algorithms can confirm this observation. One can observe
that the number of warp serializations for the version using non-coalesced accesses is be-
106
Chapter 4: Efficient Memory Management
tween seven and sixteen times more important than for its counterpart. Indeed, a threads
divergence caused by non-coalesced accesses leads to many memory accesses that have
to be serialized, increasing the instructions to be executed. This explains the difference
performance between the two versions. A conclusion of this experiment indicates that
memory coalescing applied on local structures is a must to obtain the best performance.
107
Chapter 4: Efficient Memory Management
Table 4.3: Measures in terms of efficiency of three different versions (global, texture,
and shared memories). The quadratic assignment problem using a pair-wise-exchange
neighborhood is considered.
other hand, performance results for the quadratic assignment problem indicate that this
version is dominated by the texture one especially for low-graphic cards configurations.
Therefore, it seems that the stand-alone use of shared memory is not well-adapted for
the parallel iteration-level. That is the reason why, all the previous experiments on S-
metaheuristics have been performed by using the texture memory.
The exploitation of the threads spatial organization and the different available memories
allows the design of cooperative algorithms on GPU. In Section 4.2, we have introduced a
guideline for redesign of the algorithmic-level parallel model on GPU.
To validate the performance of the proposed approaches, a cooperative island model for
evolutionary algorithms has been implemented on GPU. In this purpose, the Weierstrass-
Mandelbrot functions have been considered.
108
Chapter 4: Efficient Memory Management
4.4.1 Configuration
The used configuration for the experiments is an Intel Xeon 8 cores 2.4 Ghz with a
NVIDIA GTX 280 card. For each experiment, different implementations of the island
model are given: a standalone single-core implementation of the island model on CPU
(CPU), the synchronous island model using the parallel evaluation of populations on
GPU (CPU+GPU), the asynchronous and fully distributed island model on GPU (GPU),
the synchronous version (SGPU) and their associated versions using the shared memory
(GPUSh and SGPUSh).
Regarding the full implementation for evolutionary algorithms on GPU, previous tech-
niques presented in Section 4.2.3 are applied to the selection and the replacement. Re-
garding the components which are dependent of the problem (i.e. initialization, evaluation
function and variation operators), they do not represent specific issues related to the GPU
kernel execution. From an implementation point of view, the focus is on the management
of random numbers. To achieve this, efficient techniques are provided in several books
such as [NVI10] to implement random generators on GPU (e.g. Gaussian or Mersenne
Twister generators).
The operators used in the evolutionary process for the implementations are the following:
the crossover is a standard two-point crossover (crossover rate fixed to 80%) which gener-
ates two offspring, the mutation consists in changing randomly one gene with a real value
taken in [−1; 1] (with a mutation rate of 30%), the selection is a deterministic tournament
(tournament size fixed to the block size divided by four), the replacement is a (µ + λ) re-
placement and the number of generations has been fixed to 100. Regarding the migration
policy, a ring topology has been chosen, a deterministic tournament has been performed
for both emigrants selection and migration replacement (tournament size fixed to the block
size divided by four), the migration rate is equal to the number of local individuals divided
by four and the migration frequency is set to 10 generations.
The objective of the following experiments is to assess the impact of the different GPU-
based implementations in terms of efficiency. Only execution times (in seconds) and accel-
eration factors (compared to a single CPU core) are reported. The average time has been
measured in seconds for 50 runs. The first experiment consists in varying the dimension
of the Weierstrass function. The obtained results are reported in Table 4.4.
As long as the size of the dimension increases, each GPU version gives some outstanding
accelerations compared to a CPU version (up to ×1757 for the GPUSh version). The use
of the shared memory provides a way to accelerate the search process even if the GPU
109
Chapter 4: Efficient Memory Management
Table 4.4: Measures in terms of efficiency of the island model for evolutionary algorithms
on GPU. The Weierstrass function is considered. The number of individuals per island is
fixed to 128 and the global population to 8192 (64 islands).
version is already impressive. However, due to its limited capacity, bigger instances such
as a dimension of 11 cannot be handled in any shared memory versions.
Regarding the fully distributed synchronous versions, since implicit synchronizations are
performed, a certain reduction of acceleration factors (from ×63 to ×293 for the SGPU
version) can be observed in comparison with their associated asynchronous versions. Nev-
ertheless, the acceleration factors are still outstanding. For the scheme of the parallel
evaluation populations (CPU+GPU version), the speed-ups are less remarkable even if
they remain significant (from ×25 to ×170). This is explained by the important number
of data transfers between the CPU and the GPU.
A conclusion of the first experiment indicates that the full distribution of the search process
on GPU and its own memory management deliver high performance results. However,
the importance of these results should be minimized due to the nature of the Weierstrass
function, which is a compute-bound application. When considering a standard application
which is both compute and memory bound, things are different. Table 4.5 reports the
obtained results for the quadratic assignment problem using the same parameters used
before.
In a general manner, one can see that the obtained acceleration factors are more similar
than those ones presented in the previous chapters. The acceleration factors grow with the
instance size. They alternate from ×1.3 to ×15.2 according to the different paralleliza-
tion schemes. Regarding the versions using the shared memory, unless the Weierstrass
function, the performance improvements are less pronounced since accesses to matrix
structures (global and texture memories) are prominent in the entire algorithm. Further-
110
Chapter 4: Efficient Memory Management
Table 4.5: Measures in terms of efficiency of the island model for evolutionary algorithms
on GPU. The quadratic function is considered. The number of individuals per island is
fixed to 128 and the global population to 8192 (64 islands)
.
CPU CPU+GPU GPU GPUSh SGPU SGPUSh
instance time time acc. time acc. time acc. time acc. time acc.
tai30a 8.4 6.5 ×1.3 1.3 ×6.2 1.2 ×7.1 2.3 ×3.6 2.0 ×4.2
tai35a 11.1 6.9 ×1.6 1.4 ×7.8 1.3 ×8.5 2.5 ×4.4 2.2 ×5.1
tai40a 15 7.1 ×2.1 1.6 ×9.5 1.5 ×10.2 2.8 ×5.4 2.4 ×6.3
tai50a 22 7.9 ×2.8 2.0 ×11.2 1.8 ×12.3 3.5 ×6.3 3.1 ×7.0
tai60a 31 8.9 ×3.5 2.4 ×12.8 2.3 ×13.6 4.2 ×7.4 3.6 ×8.5
tai80a 60 11.5 ×5.2 4.3 ×13.9 – – 7.1 ×8.5 – –
tai100a 101 13 ×7.7 6.6 ×15.2 – – 10.4 ×9.7 – –
more, the same limitation of the problem size occurs from the instance tai80a, in which
the application could not be executed.
After evaluating the performance of the island model for evolutionary algorithms on GPU,
another experiment consists in measuring the scalability of our approaches. For doing
this, varying the number of islands with extreme values is needed in order to determine
the application scalability. Results of this experiment are reported in Table 4.6 for the
Weierstrass function.
Regarding each fully distributed version, for a small number of islands (i.e. one or two
islands), the acceleration factor is significant but not spectacular (from ×7 to ×51). This
is explained by the fact that since the global population is relatively small (less than 1024
threads), the number of threads per block is not enough to fully cover the memory access
latency. This is not the same situation when considering more islands. Indeed, the speed-
up grows accordingly with the increase of the number of islands and remains impressive
(up to ×2074 for GPUSh).
Regarding the CPU+GPU version, speed-ups for one or two islands are more important
than for fully distributed versions. Indeed, since only the evaluation of the population is
distributed on GPU, fewer registers are allocated for each thread. As a result, this version
benefits from a better occupancy of the multiprocessors for a small number of islands. GPU
keeps accelerating the process with the islands increase until reaching a peak performance
of ×165 for 64 islands. Thereafter, the acceleration factor decreases with the augmentation
of the number of islands. Indeed, for each parallel evaluation of the population, the amount
of data transfers is proportional to the number of individuals (e.g. 524288 threads for 4096
islands). Thus, from a certain number of islands, the time dedicated to copy operations
becomes significant, leading clearly to a decrease of the performance.
111
Chapter 4: Efficient Memory Management
Table 4.6: Measures in terms of scalability of the island model for evolutionary algorithms
on GPU. The Weierstrass function is considered. The dimension of the problem is fixed
to 2 and the number of individuals per island to 128.
CPU CPU+GPU GPU GPUSh SGPU SGPUSh
islands time time acc. time acc. time acc. time acc. time acc.
1 3 0.10 ×33 0.20 ×17 0.12 ×27 0.45 ×7 0.27 ×12
2 7 0.12 ×55 0.20 ×33 0.13 ×51 0.45 ×15 0.29 ×23
4 13 0.15 ×89 0.20 ×65 0.13 ×104 0.45 ×29 0.29 ×46
8 26 0.19 ×139 0.20 ×132 0.13 ×207 0.45 ×59 0.29 ×92
16 53 0.34 ×154 0.21 ×256 0.13 ×403 0.46 ×114 0.29 ×179
32 106 0.66 ×160 0.26 ×406 0.13 ×828 0.59 ×180 0.29 ×368
64 211 1.28 ×165 0.33 ×644 0.14 ×1560 0.74 ×286 0.30 ×693
128 422 2.68 ×158 0.45 ×939 0.26 ×1596 1.01 ×417 0.60 ×709
256 845 5.61 ×151 0.69 ×1222 0.50 ×1677 1.56 ×543 1.13 ×746
512 1692 11.81 ×143 1.24 ×1365 1.00 ×1691 2.79 ×607 2.25 ×752
1024 3382 25.72 ×132 – – 1.70 ×1990 – – 3.82 ×885
2048 6781 53.23 ×127 – – 3.27 ×2074 – – 7.36 ×922
4096 13585 143.71 ×95 – – – – – – – –
Regarding the scalability of the fully distributed versions, from a certain number of islands,
the GPU failed to execute the program because of the hardware register limitation. For
instance, for a number of 1024 islands (131072 threads), the SGPU implementation could
not be executed. In the GPUSh and the SGPUSh versions, since the shared memory is
used conducting to fewer registers, this limit is reached for a larger number of 4096 islands.
The CPU+GPU version provides a higher scalability since fewer registers are allocated
(only the evaluation kernel is executed on GPU).
In the previous experiments, the number of individuals is fixed to 128. Another experiment
consists in varying the number of individuals per island i.e. the number of threads per
block. In this purpose, it will allow to determine the effect of this parameter on the
global performance. Table 4.7 reports the obtained results. Regarding the execution time
of each version, it varies accordingly to the number of threads per block. In general,
best performances are obtained for 16, 32 or 64 threads per block. Indeed, the measured
results depend on the multiprocessor occupancy of a GPU. This latter varies according
to the number of threads used in a kernel, the amount of registers and shared memory
used. For instance, the CUDA occupancy calculator [NVI11] is an efficient tool to adjust
the different configurations. Another observation that can be made concerns the threads
limitation for the shared memory versions. Indeed, for 768 or 1024 threads per block, the
amount of allocated shared memory exceeds the memory capacity of each multiprocessor
(16KB). Therefore, for these versions, a trade-off must be found between the dimension of
the problem and the number of individuals per island.
112
Chapter 4: Efficient Memory Management
Table 4.7: Measures in terms of efficiency by varying the number of individuals per island.
The dimension of the problem is fixed to 2 and the global population to 8192.
individuals 4 8 16 32 64 128 256 512 768 1024
CPU 41.8 42.0 42.1 42.2 42.4 42.9 43.4 44.8 49.8 55.2
CPU+GPU 0.27 0.22 0.24 0.34 0.54 0.96 1.72 2.90 3.23 4.12
GPU 0.18 0.17 0.16 0.16 0.15 0.16 0.19 0.25 0.28 0.33
GPUSh 0.05 0.03 0.02 0.02 0.02 0.04 0.14 0.17 – –
SGPU 0.41 0.38 0.36 0.36 0.34 0.36 0.42 0.55 0.63 0.74
SGPUSh 0.10 0.07 0.05 0.05 0.05 0.09 0.30 0.37 – –
Figure 4.10: Measures in terms of effectiveness of the different island models during the
first minute. The dimension of the problem is fixed to 10, the global population is set to
8192 and the number of individuals per island to 128.
113
Chapter 4: Efficient Memory Management
Conclusion
In this chapter, the focus has been made on the memory management in metaheuristics
on GPU. The comprehension of the hierarchical organization of the different memories
available on GPU architectures is the key issue to develop efficient parallel metaheuristics.
In this purpose, we have investigated the complete redesign of the algorithmic-level model
for parallel and cooperative algorithms.
114
Chapter 5
Previous chapters have shown that parallel combinatorial optimization on GPU is not
straightforward and requires a huge effort at design as well as at implementation level.
Indeed, the design of GPU-aware metaheuristics often involves the cost of a sometimes
painful apprenticeship of parallelization techniques and GPU computing technologies. In
order to free from such burden those who are unfamiliar with those advanced features,
ParadisEO integrates the up-to-date parallelization techniques and allow their transparent
exploitation and deployment on COWs and computational grids.
First, a brief overview of the ParadisEO framework will be done. Then, we will extend
ParadisEO to deal with GPU accelerators. The challenges and contributions consist in
making the GPU as transparent as possible for the user minimizing his or her involvement
in its management. In this purpose, we offer solutions to this challenge as an extension of
the ParadisEO framework.
Contents
5.1 The ParadisEO Framework . . . . . . . . . . . . . . . . . . . . . 117
5.1.1 Motivations and Goals . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.2 Presentation of the Framework . . . . . . . . . . . . . . . . . . . 117
5.2 GPU-enabled ParadisEO . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.1 Architecture of ParadisEO-GPU . . . . . . . . . . . . . . . . . . 119
5.2.2 ParadisEO-GPU Components . . . . . . . . . . . . . . . . . . . . 120
5.2.3 A Case Study: Parallel Evaluation of a Neighborhood . . . . . . 122
5.2.4 Automatic Construction of the Mapping Function . . . . . . . . 124
5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.1 Experimentation with ParadisEO-GPU . . . . . . . . . . . . . . 126
115
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. Neighborhood Structures for
GPU-based Local Search Algorithms. Parallel Processing Letters, 20(4):307–324, 2010.
Submitted article
Nouredine Melab, Thé Van Luong, K. Boufaras, and El-Ghazali Talbi. ParadisEO-MO-
GPU: A Framework for GPU-based Local Search Metaheuristics. Journal of Heuristics,
submitted.
116
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
• Maximum design and code reuse. The framework must provide a full architecture
design for using metaheuristics. In this purpose, the programmer may redo as little
code as possible. This objective requires a clear and maximal conceptual separation
between the different methods and problems to be solved. Therefore, the user might
develop only the minimal code specific to the problem at hand.
• Flexibility and adaptability. It must be possible for the user to easily add new fea-
tures or to modify existing ones without involving other components. Moreover, as
existing problems evolve and new ones appear, the framework components must be
specialized and adapted to the general demand.
• Utility. The framework must allow the user to cover a wide range of metaheuristics,
problems, parallel distributed models, hybridization mechanisms, etc.
• Portability. In order to satisfy a large number of users, the framework must support
different hardware architectures and operating systems.
117
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
The framework is composed of four modules that constitute a global framework. Fig-
ure 5.1 illustrates the different modules available in ParadisEO. Each module is based
on a clear conceptual separation of the metaheuristics from the problems they are in-
tended to solve. Such a separation provides a maximum code and design reuse to the user.
The first module ParadisEO-EO provides a broad range of classes for the development
of P-metaheuristics, including evolutionary algorithms and particle swarm optimization
techniques. Second, ParadisEO-MO contains a set of tools for S-metaheuristics such as
hill climbing, simulated annealing, or tabu search. Third, ParadisEO-MOEO is specifi-
cally dedicated to the reusable design of metaheuristics for multiobjective optimization.
Finally, ParadisEO-PEO provides a powerful set of classes for the design of parallel and
distributed metaheuristics: parallel evaluation of solutions, parallel evaluation of the ob-
jective function and parallel cooperative algorithms.
ParadisEO is one of the rare frameworks that provide the most common parallel and dis-
tributed models. These models are portable on distributed-memory machines and shared-
memory multiprocessors as they are implemented using standard libraries such as MPI
and PThreads. The models can be exploited in a transparent way. One has just to instan-
tiate their associated ParadisEO components. The ParadisEO-PEO module is an open
source framework originally intended to the design and deployment of parallel hybrid local
search metaheuristics on dedicated clusters and networks of workstations, shared-memory
machines and computational grids.
118
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
second one remains in a design step. To the best of our knowledge, there does not exist any
software framework for GPU-based metaheuristics applied to combinatorial optimization.
Regarding the different parallel implementations of ParadisEO, in [CMT04], the first ver-
sion of ParadisEO-PEO was dedicated to the reusable design of parallel and distributed
metaheuristics for only dedicated parallel hardware platforms. Later, the framework
was dedicated in [MCT06] to dynamic and heterogeneous large-scale environments us-
ing Condor-MW middleware and in [TMDT07] to computational grids using Globus. In
this section, we will present a step towards a ParadisEO framework for the reusable design
and implementation of the GPU-based parallel metaheuristics.
119
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
input data, the evaluation function and representations. The software layer supplies the
ParadisEO components including optimization solvers embedding metaheuristics. The
ParadisEO-GPU module provides a CUDA interface allowing the transparent interaction
with the hardware layer. The hardware layer supplies the different transparent tools
provided by ParadisEO-GPU such as the allocation and copy of data or the parallel gen-
eration and evaluation of the considered set of solutions. In addition, the platform offers
predefined structures (e.g. neighborhood structures) and mapping wrappers adapted to
hardware characteristics to deal with binary and permutation problems.
The layered architecture of ParadisEO-GPU has been designed in such a way that the user
does not need to build his or her own CUDA code for the specific problem to be solved.
Indeed, ParadisEO-GPU provides facilities for automatic execution of metaheuristics on
GPU. The only thing that must be user-managed is the different components described in
the user level quoted above.
Figure 5.3 illustrates the major components of the platform. The advantage of the decom-
position into components is to separate the components that must be defined by the user
and those which are generic and provided in ParadisEO-MO-GPU.
Initially, to implement a sequential metaheuristic, the user must overload required classes
of ParadisEO-EO and ParadisEO-MO. The classes coding the problem-specific part are
120
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
• Problem data inputs. The user must specify which inputs of the problem to be solved
will be allocated on GPU. This can be achieved by introducing additional keywords.
To summarize, regarding new user-defined classes, the user must specify the structures
which are likely to be accessed on the GPU device. This strict restriction enables more
efficiency and flexibility. Indeed, on the one hand, unnecessary structures for the parallel
evaluation of solutions should not be automatically allocated on GPU to reduce the com-
plexity memory space. On the other hand, some additional structures required for each
solution evaluation might be copied for each iteration of the metaheuristic, whilst some
others are transferred only once during the program. Such a restriction makes it possible
to avoid undesirable transfers, which would lead to a performance decrease.
Regarding the components supplied in the software framework, the associated classes
establish a hierarchy of classes implementing the invariant part of the code. The different
features of these generic components are the following:
• Memory allocation and deallocation. According to the specification made by the user,
this generic component enables the automatic allocation on GPU of the different
structures used for the problem. Type inference and size detection are managed by
this component. The same goes on for the deallocation.
121
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
• Parallel evaluation. This component manages the kernel of the solutions evalua-
tion. Concepts involving kernel such as thread blocks and block grids are completely
hidden to the user.
1. The neighborhood component moGPUNeighborhood prepares all the steps for the
parallel generation of the neighborhood on GPU. The initialization consists in setting
a mapping table between GPU threads and neighbors. Thereafter, the associated
data are sent only once to the GPU global memory since the mapping structure does
not change during the execution process of S-metaheuristics. The last step relies on
122
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
123
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
3. The component moGPUKernelEval modelizes the main body which will be executed
by m concurrent threads on different input data. A first step consists in getting
the thread identifier then the set of its associated data. This mechanism is done
through the mapping table previously mentioned. The second step calculates the
evaluation of the corresponding neighbor. Finally, the resulting fitness is stored in
the corresponding index of the fitnesses structure.
4. The worker component moGPUEvalFunc is the specific object with computes on the
GPU device the neighbor evaluation and returns back the produced result to the
CPU.
Once the entire neighborhood has been carried out in parallel on GPU, the precalculated
fitness structure is copied back to the CPU and given as input to the ParadisEO-MO mod-
ule. In this way, the S-metaheuristic continues the neighborhood exploration (iteration)
on the CPU side. Instead of reevaluating each neighbor, the corresponding fitness value
will be retrieved from the precomputed fitnesses structure. Hence, this mechanism has
the advantage of allowing both the deployment of any metaheuristic and the use of tool-
boxes provided in ParadisEO (e.g. statistical or fitnesses landscape analysis, checkpoint
monitors, etc.). This common technique is also currently available for P-metaheuristics.
124
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
of view, they are still user-managed. Indeed, the neighborhood structure strongly depends
on the target optimization problem representation.
To deal with these issues, mappings for different neighborhoods could be hard-coded in
the software framework. However, such a solution does not ensure any flexibility. Hence,
we propose to add a supplementary layer in terms of transparency for the deployment of
S-metaheuristics on GPU. The main idea is to find a generic mapping which is common
for a set of neighborhoods. To achieve this, we provide an automatic construction of the
mapping function for k-swaps and k-Hamming distance neighborhoods. Figure 5.5 depicts
such a construction of a mapping table. In this example, each neighbor associated with a
particular thread can retrieve its three corresponding indexes from the mapping table.
Considering a given vector of size n and a given neighborhood whose neighbors are com-
posed of k indexes with k in {1, 2, 3, ...}, the size of the associated neighborhood is exactly
n × (n − 1) × ... × (n − k + 1)
m= . The resulting mapping table associates each thread
k!
id with a set of k indexes. Each index can be respectively retrieved from the mapping
table with the access pattern:
{id, id + m, ..., id + (k − 2) × m, id + (k − 1) × m}
The corresponding mapping table will be used at each iteration of the local search. This ta-
ble is dynamically constructed on CPU according to the neighborhood size and transferred
only once to the GPU global memory during the program execution.
125
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
The objective is to assess the impact in terms of efficiency of an implementation done with
ParadisEO-GPU compared with an optimized version done outside the software frame-
work. In this purpose, a tabu search is considered for the permuted perceptron problem.
To measure the performance difference, three neighborhoods based on increasing Ham-
ming distances are considered for the experiments. In such neighborhoods, a neighbor is
produced by flipping respectively 1, 2 and 3 bits of the candidate solution.
The considered instances of the permuted perceptron problem are the difficult ones pre-
sented in [Poi95] for cryptanalysis applications. A tabu search has been implemented in
four different versions: 1) a ParadisEO-MO implementation on CPU and its counterpart
on GPU; 2) an optimized CPU implementation and its associated GPU version. Par-
adisEO versions are pure object-based implementations, whilst the optimized ones are
pointer-based made outside the software framework.
Experiments have been carried out on top of an Intel Core i7 970 3.2 Ghz with a GTX
480 card (15 multiprocessors with 32 cores). This is actually the machine dedicated to the
engineering production. To measure the acceleration factors, only a single CPU core has
been considered using the Intel i7 turbo mode (3.46 Ghz). For the different neighborhoods,
50 executions for each different version are considered. The stopping criterion of the S-
metaheuristic has been set to 10000 iterations.
Table 5.1 reports the results obtained for the tabu search based on a Hamming distance
of one. From the instance m = 171 and n = 187, both GPU versions start to yield
positive accelerations (from ×1.1 to ×1.2). As long as the instance size increases, the
acceleration factor grows accordingly (from ×1.3 to ×1.7). The acceleration factor for
this implementation is not really significant. This can be explained by the fact that since
the neighborhood is relatively small (n threads), the number of threads per block is not
enough to fully cover the memory access latency. Furthermore, since the execution time
for CPU versions is not meaningful, one can also argue on the use of GPU computing in
that case.
An experiment on a larger scale concerns a tabu search using a neighborhood based on
a Hamming distance of two. For this neighborhood, the evaluation kernel is executed by
n × (n − 1)
threads. The obtained results from experiments are reported in Table 5.2.
2
For the first instance (m = 73, n = 73), acceleration factors are already significant (from
×8.2 and ×13.1). As long as the instance size increases, the acceleration factor grows ac-
cordingly. A peak performance is obtained for the last instance (efficient speed-ups varying
126
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
127
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
from ×29.5 to ×41.8). A thorough examination of the acceleration factors points out that
the performance obtained with ParadisEO-GPU is not so far from an optimized imple-
mentation. The performance degradation that occurs is certainly due to the additional
cost provided by ParadisEO-MO.
Indeed, regarding the two CPU versions, initially, there is already a performance gap re-
garding the execution time (between 68% and 86%). This difference can be explained by
the overhead caused by the creation of generic objects in ParadisEO whereas the opti-
mized version on CPU is a pure pointer-based implementation. Indeed, the tabu search in
ParadisEO-MO is a specialized instantiation of a common template to any S-metaheuristic,
whilst the optimized version is a specific tabu search implementation. This may also clar-
ify the performance difference between the two different GPU counterparts in which the
same phenomenon occurs. However, for such a transparent exploitation and flexibility,
the obtained results are remarkably convincing. A conclusion of this experiment indicates
that the performance results of the GPU version provided by ParadisEO are not much
degraded compared to the GPU pointer-based one.
As previously said, the definition of the neighborhood is a major step for the performance
improvement of the algorithm. Indeed, the increase of the neighborhood size may improve
the quality of the obtained solutions. However, its exploitation for solving real-world
problems is possible only by using a great computing power. The following experiment
intend to perform a large neighborhood obtained with a Hamming distance of 3. For
n × (n − 1) × (n − 2)
such neighborhood, the evaluation kernel is executed by threads.
6
Table 5.3 presents the obtained results for such a large neighborhood.
In general, for the same problem instance, the obtained acceleration factors are much
more important than for the previous neighborhoods. For example, for the first instance,
the obtained speed-up already varies from ×17.7 to ×19.8. GPU keeps accelerating the
128
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
process as long as the size grows. A highly significant acceleration varies from ×34.1
to ×53 for the biggest instance (m = 201, n = 217). Regarding the difference between
ParadisEO-MO implementations and the optimized ones, the performance degradation is
more important between the CPU versions (from 53% to 70%). Indeed, the increase of
the neighborhood size may induce more creations of objects. This is also the same case for
the performance degradation regarding the GPU counterparts. Nevertheless, according
to the reported time measurements, the performance results of ParadisEO-GPU are still
satisfactory for such a transparency.
129
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
according to the instance. As previously said, this difference is explained by the overhead
caused by the creation of generic objects in ParadisEO. This clarifies the performance
difference between the two different GPU counterparts (between 73% and 83%). However,
for such a transparent exploitation, the obtained results are still satisfactory.
130
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
Conclusion
In this chapter, we have proposed a pioneering framework called ParadisEO-GPU for
the reusable design and implementation of parallel metaheuristics on GPU architectures.
We have revisited the ParadisEO software framework to allow its utilization on GPU
accelerators focusing on the parallel iteration-level model.
131
Chapter 5: Extension of ParadisEO for GPU-based Metaheuristics
132
Conclusion and Future Works
133
Conclusion and Future Works
on GPU. More exactly, in our contribution, we have proposed three different general
schemes for building efficient parallel and cooperative metaheuristics on GPU. In the first
scheme, cooperative algorithms are combined with the parallel evaluation of the population
on GPU (iteration-level). From an implementation point of view, this approach is the
most generic since only the evaluation kernel is considered. However, the performance
of this scheme is limited due to the data transfers between the CPU and the GPU. To
deal with this issue, the two other schemes operate on the full distribution of the search
process on GPU, involving the appropriate use of local memories. Applying such a strategy
allows to drastically improve the performance. However, these schemes could present
some restrictions due to the memory limitation with some problems that could be more
demanding in terms of resources.
In a general manner, we proved the effectiveness of our methods through extensive exper-
iments. In particular, we showed that it enables to gain up on GPU cards to a factor of
×80 in terms of acceleration (compared with a single-core architecture) when deploying
it for well known combinatorial instances and up to ×2000 for a continuous problem. In
addition to this, experiments indicate that the approaches performed on these problems
scale well with last GPU cards such as a Tesla Fermi card. Moreover, the experiments
highlight that GPU computing allows not only to speed up the search process, but also to
exploit parallelism to improve the quality of the obtained solutions.
In this document, we have also presented a step towards a ParadisEO framework [CMT04]
for the reusable design and implementation of the GPU-based parallel metaheuristics. In
this contribution, the focus has been set on the iteration-level parallel evaluation of the so-
lutions. We have revisited the design and implementation of this last model in ParadisEO
to allow its efficient execution and its transparent use on GPU. In order to reduce the cost
of the data transfer, mapping functions are defined and implemented into ParadisEO al-
lowing to assign a thread identifier to each solution. These mapping functions may be used
in a fully transparent way for dealing with many problem representations. An implementa-
tion made in ParadisEO using CUDA has been experimentally validated and compared to
the same implementation realized outside ParadisEO. The experimental results show that
the performance degradation that occurs between the two implementations is satisfactory.
Indeed, for such a flexibility and an easiness of reuse at implementation, the results ob-
tained with ParadisEO-GPU are really promising. Hence, the use of ParadisEO-GPU is
a viable solution. The first release of ParadisEO for GPU architectures is currently avail-
able on the ParadisEO website3 . Tutorials and documentation are provided to facilitate
its reuse. This release is dedicated to parallel metaheuristics based on the iteration-level
parallel model. In the future, the framework will be extended with further features to be
3
http://paradiseo.gforge.inria.fr
134
Conclusion and Future Works
135
Conclusion and Future Works
136
Appendix
.1 Mapping Proofs
Let us consider a 2D abstraction in which the elements of the neighborhood are disposed
in a zero-based indexing 2D representation. This repartition is performed in a similar
way as a lower triangular matrix. Let n be the size of the solution representation and
n × (n − 1)
let m = be the size of its neighborhood. Let i and j be the indexes of two
2
elements to be exchanged in a permutation. A candidate neighbor is then identified by
both i and j indexes in the 2D abstraction. Let f (i, j) be the corresponding index in the
1D neighborhood fitnesses structure. Fig. 6 is an example illustrating this abstraction.
In this example, n = 6, m = 15 and the neighbor identified by the coordinates (i = 2 ,
j = 3) is mapped to the corresponding 1D array element f (i, j) = 9.
The neighbor represented by the (i, j) coordinates is known and its corresponding index
f (i, j) on the 1D structure has to be calculated. If the 1D array size was n ∗ n, the 2D
abstraction would be similar to a matrix and thus the mapping would be:
f (i, j) = i × (n − 1) + (j − 1)
n × (n − 1)
Since the 1D array size is m = , in the 2D abstraction, elements above the
2
diagonal preceding the neighbor do not have to be considered (illustrated in Fig. 6 by a
137
Appendix
i × (i + 1)
f (i, j) = i × (n − 1) + (j − 1) − (1)
2
Let X be the number of elements following f (i, j) in the neighborhood index-based array
numbering:
X = m − f (i, j) − 1 (3)
Since this number can be also represented in the 2D abstraction, the main idea is to
maximize the distance k such as:
k × (k + 1)
≤X (4)
2
138
Appendix
A value of i can then be calculated according to (2). Finally, by using (1), j can be given
by:
i × (i + 1)
j = f (i, j) − i × (n − 1) + +1 (6)
2
f (x, y, z) is a given index of the 1D neighborhood fitnesses structure and the objective is to
find the three indexes x, y and z. Let n be the size of the solution representation and m =
n × (n − 1) × (n − 2)
be the size of the neighborhood. The main idea is to find in which
6
plan (coordinate z) corresponds the given element f (x, y, z) in the 3D abstraction. If this
corresponding plan is found, then the rest is similar to the one-to-two index transformation.
Figure 8 illustrates an example of the 3D abstraction.
In this representation, since each plan is a 2D abstraction, the number of elements in each
plan is the number of combinations Ck2 where k ∈ {2, 3, . . . , n − 1} according to each plan.
For a specific neighbor, if a value of k is found, then the value of the corresponding plan
z is:
z =n−k−1 (7)
For a given index f (x, y, z) belonging to the plan k in the 3D abstraction, the number of
139
Appendix
k × (k − 1) × (k − 2)
elements contained in the next plans is Ck2 (also equal to ).
6
Let Y be the number of elements following f (x, y, z) in both the 1D neighborhood fitnesses
structure and the 3D abstraction:
Y = m − f (x, y, z)
k × (k − 1) × (k − 2)
>= Y (8)
6
By reordering (8), in order to find a value of k, the next step is to solve the following
equation:
k13 − k1 − 6Y = 0 (9)
Cardano’s method in theory allows to solve cubic equation. Nevertheless, in the case
of finite discrete machine, this method can lose precision especially for big integers. As
a consequence, a simple Newton-Raphson method for finding an approximate value of
k1 is enough for our problem. Indeed, this iterative process follows a set guideline to
approximate one root, considering the function, its derivative, an initial arbitrary k1 -value
and a certain precision (see Algorithm 14).
k = ⌈k1 ⌉
Then a value of z can be deduced with (7). At this step, the plan corresponding to the
element f (x, y, z) is known. The next steps for finding x and y are identically the same as
the one-to-two index transformation with a change of variables.
First, the number of elements preceding f (x, y, z) in the neighborhood index-based array
numbering is exactly:
(k + 1) × k × (k − 1)
nbElementsBef ore = m −
6
140
Appendix
Second, the number of elements contained in the same plan z as f (x, y, z) is:
k × (k − 1)
nbElements =
2
n′ = n − (z + 1)
X = lastElement − f (x, y, z)
x = i + (z + 1)
y = j + (z + 1)
x, y and z are known and its corresponding index f (x, y, z) have to be found. According
to the 3D abstraction, since a value of z is known, k can be calculated:
k =n−1−z
Then the number of elements preceding f (x, y, z) in the neighborhood index-based array
numbering can be also deduced.
If each plan size was (n − 2) ∗ (n − 2), each 2D abstraction would be similar to a matrix
and the IN × IN → IN mapping would be:
f1 (x, y, z) = z × (n − 2) × (n − 2) + (x − 1) × (n − 2) + (y − 2) (10)
Since each 2D abstraction looks like a triangular matrix, some elements must not be
considered. The advantage of the 3D abstraction is that these elements can be found by
geometric construction (see Fig. 9).
141
Appendix
Figure 9: IN × IN × IN → IN mapping.
First, given a plan z, the number of elements in the previous plans to not consider is:
n1 = z × (n − 2) × (n − 2) − nbElementsBef ore
Second, the number of elements on the left side to not consider in the plan z is:
n2 = z × (n − 2)
Third, the number of elements on the upper side to not consider in the plan z is:
n3 = (y − z) × (n − k − 1)
Fourth, the number of elements on the upper triangle above f (x, y, z) to not consider is:
(y − z) × (y − z − 1)
n4 =
2
142
Bibliography
Table 5: Measures in terms of efficiency for the permuted perceptron problem using a
neighborhood based on a Hamming distance of two (binary representation). Test of the
null hypothesis of normality with the Kolmogorov-Smirnov’s test.
Core 2 Duo T5800 Core 2 Quad Q6600 Intel Xeon E5450 Xeon E5620
GeForce 8600M GT GeForce 8800 GTX GeForce GTX 280 Tesla M2050
Instance
32 GPU cores 128 GPU cores 240 GPU cores 448 GPU cores
CPU GPU CPU GPU CPU GPU CPU GPU
73-73 + + + + + + + +
81-81 + + + + + + + +
101-117 + + + + + + + +
201-217 + + + + + + + +
401-417 + + + + + + + +
601-617 + + + + + + + +
801-817 + + + + + + + +
1001-1017 + + + + + + + +
1301-1317 + + + + + + + +
Table 6: Measures in terms of efficiency for the permuted perceptron problem using a
neighborhood based on a Hamming distance of two (binary representation). Test of the
null hypothesis of variances equality with the Levene’s test.
Core 2 Duo T5800 Core 2 Quad Q6600 Intel Xeon E5450 Xeon E5620
GeForce 8600M GT GeForce 8800 GTX GeForce GTX 280 Tesla M2050
Instance
32 GPU cores 128 GPU cores 240 GPU cores 448 GPU cores
CPU - GPU CPU - GPU CPU - GPU CPU - GPU
73-73 + + + +
81-81 + + + +
101-117 + + + +
201-217 + + + +
401-417 + + + +
601-617 + + + +
801-817 + + + +
1001-1017 + + + +
1301-1317 + + + +
.2 Statistical Tests
143
Bibliography
Table 7: Measures in terms of efficiency for the permuted perceptron problem using a
neighborhood based on a Hamming distance of two (binary representation). Test of the
null hypothesis of the averages equality with the Student’s t-test.
Core 2 Duo T5800 Core 2 Quad Q6600 Intel Xeon E5450 Xeon E5620
GeForce 8600M GT GeForce 8800 GTX GeForce GTX 280 Tesla M2050
Instance
32 GPU cores 128 GPU cores 240 GPU cores 448 GPU cores
CPU - GPU CPU - GPU CPU - GPU CPU - GPU
73-73 - - - -
81-81 - - - -
101-117 - - - -
201-217 - - - -
401-417 0.083 - - -
601-617 - - - -
801-817 - - - -
1001-1017 - - - -
1301-1317 - - - -
Table 8: Measures in terms of efficiency for the traveling salesman problem using a 2-opt
neighborhood (permutation representation). Test of the null hypothesis of normality with
the Kolmogorov-Smirnov’s test.
Core 2 Duo T5800 Core 2 Quad Q6600 Intel Xeon E5450 Xeon E5620
GeForce 8600M GT GeForce 8800 GTX GeForce GTX 280 Tesla M2050
Instance
32 GPU cores 128 GPU cores 240 GPU cores 448 GPU cores
CPU GPU CPU GPU CPU GPU CPU GPU
eil101 + + + + + + + +
d198 + + + + + + + +
pcb442 + + + + + + + +
rat783 + + + + + + + +
d1291 + + + + + + + +
pr2392 + . + + + + + +
fnl4461 + . + . + + + +
rl5915 + . + . + . + +
Table 9: Measures in terms of efficiency for the traveling salesman problem using a 2-
opt neighborhood (permutation representation). Test of the null hypothesis of variances
equality with the Levene’s test.
Core 2 Duo T5800 Core 2 Quad Q6600 Intel Xeon E5450 Xeon E5620
GeForce 8600M GT GeForce 8800 GTX GeForce GTX 280 Tesla M2050
Instance
32 GPU cores 128 GPU cores 240 GPU cores 448 GPU cores
CPU - GPU CPU - GPU CPU - GPU CPU - GPU
eil101 + + + +
d198 + + + +
pcb442 + + + +
rat783 + + + +
d1291 + + + +
pr2392 . + + +
fnl4461 . . + +
rl5915 . . . +
144
Bibliography
Table 10: Measures in terms of efficiency for the traveling salesman problem using a 2-opt
neighborhood (permutation representation). Test of the null hypothesis of the averages
equality with the Student’s t-test.
Core 2 Duo T5800 Core 2 Quad Q6600 Intel Xeon E5450 Xeon E5620
GeForce 8600M GT GeForce 8800 GTX GeForce GTX 280 Tesla M2050
Instance
32 GPU cores 128 GPU cores 240 GPU cores 448 GPU cores
CPU - GPU CPU - GPU CPU - GPU CPU - GPU
eil101 - 0.064 - -
d198 - - - -
pcb442 - - - -
rat783 - - - -
d1291 - - - -
pr2392 . - - -
fnl4461 . . - -
rl5915 . . . -
Table 11: Measures of the benefits of applying thread control. The traveling salesman
problem using a 2-opt neighborhood is considered. Test of the null hypothesis of normality
with the Kolmogorov-Smirnov’s test.
Core 2 Duo T5800 Core 2 Quad Q6600 Intel Xeon E5450 Xeon E5620
GeForce 8600M GT GeForce 8800 GTX GeForce GTX 280 Tesla M2050
Instance
32 GPU cores 128 GPU cores 240 GPU cores 448 GPU cores
GPU GPUTC GPU GPUTC GPU GPUTC GPU GPUTC
eil101 + + + + + + + +
d198 + + + + + + + +
pcb442 + + + + + + + +
rat783 + + + + + + + +
d1291 + + + + + + + +
pr2392 . + + + + + + +
fnl4461 . + . + + + + +
rl5915 . + . + . + + +
Table 12: Measures of the benefits of applying thread control. The traveling salesman
problem using a 2-opt neighborhood is considered. Test of the null hypothesis of variances
equality with the Levene’s test.
Core 2 Duo T5800 Core 2 Quad Q6600 Intel Xeon E5450 Xeon E5620
GeForce 8600M GT GeForce 8800 GTX GeForce GTX 280 Tesla M2050
Instance
32 GPU cores 128 GPU cores 240 GPU cores 448 GPU cores
GPU - GPUTC GPU - GPUTC GPU - GPUTC GPU - GPUTC
eil101 + + + +
d198 + + + +
pcb442 + + + +
rat783 + + + +
d1291 + + + +
pr2392 . + + +
fnl4461 . . + +
rl5915 . . . +
145
Bibliography
Table 13: Measures of the benefits of applying thread control. The traveling salesman
problem using a 2-opt neighborhood is considered. Test of the null hypothesis of the
averages equality with the Student’s t-test.
Core 2 Duo T5800 Core 2 Quad Q6600 Intel Xeon E5450 Xeon E5620
GeForce 8600M GT GeForce 8800 GTX GeForce GTX 280 Tesla M2050
Instance
32 GPU cores 128 GPU cores 240 GPU cores 448 GPU cores
GPU - GPUTC GPU - GPUTC GPU - GPUTC GPU - GPUTC
eil101 0.58 - 0.65 -
d198 0.62 - 0.60 -
pcb442 - - - -
rat783 0.55 - - -
d1291 0.64 - - -
pr2392 . - - -
fnl4461 . . - -
rl5915 . . . -
Table 14: Measures of the benefits of using the reduction operation on the GTX 280.
The permuted perceptron problem is considered for two different neighborhoods using 100
hill climbing algorithms. Test of the null hypothesis of normality with the Kolmogorov-
Smirnov’s test.
n × (n − 1)
n neighbors neighbors
Instance 2
CPU GPU GP UR CPU GPU GP UT exR
73-73 + + + + + +
81-81 + + + + + +
101-117 + + + + + +
201-217 + + + + + +
401-417 + + + + + +
601-617 + + + + + +
801-817 + + + + + +
1001-1017 + + + + + +
1301-1317 + + + + + +
Table 15: Measures of the benefits of using the reduction operation on the GTX 280. The
permuted perceptron problem is considered for two different neighborhoods using 100 hill
climbing algorithms. Test of the null hypothesis of variances equality with the Levene’s
test.
n × (n − 1)
n neighbors neighbors
Instance 2
CPU - GP UR GPU - GP UR CPU - GP UR GPU - GP UT exR
73-73 + + + +
81-81 + + + +
101-117 + + + +
201-217 + + + +
401-417 + + + +
601-617 + + + +
801-817 + + + +
1001-1017 + + + +
1301-1317 + + + +
146
Bibliography
Table 16: Measures of the benefits of using the reduction operation on the GTX 280. The
permuted perceptron problem is considered for two different neighborhoods using 100 hill
climbing algorithms. Test of the null hypothesis of the averages equality with the Student’s
t-test.
n × (n − 1)
n neighbors neighbors
Instance 2
CPU - GP UR GPU - GP UR CPU - GP UR GPU - GP UT exR
73-73 0.064 0.059 - -
81-81 0.058 0.057 - -
101-117 0.071 0.060 - -
201-217 0.062 0.063 - -
401-417 - - - -
601-617 - - - -
801-817 - - - -
1001-1017 - - - -
1301-1317 - - - -
147
Bibliography
148
Bibliography
[AGM+ 07] Ravindra K. Ahuja, Jon Goodstein, Amit Mukherjee, James B. Orlin,
and Dushyant Sharma. A very large-scale neighborhood search algo-
rithm for the combined through-fleet-assignment model. INFORMS
Journal on Computing, 19(3):416–428, 2007.
[AK96] David Andre and John R. Koza. Parallel genetic programming: a scal-
able implementation using the transputer network architecture. Ad-
vances in genetic programming: volume 2, pages 317–337, 1996.
[ALNT04] Enrique Alba, Francisco Luna, Antonio J. Nebro, and José M. Troya.
Parallel heterogeneous genetic algorithms for continuous optimization.
Parallel Computing, 30(5-6):699–719, 2004.
[AT02] Enrique Alba and Marco Tomassini. Parallelism and evolutionary algo-
rithms. IEEE Trans. Evolutionary Computation, 6(5):443–462, 2002.
[BCC+ 06] Raphael Bolze, Franck Cappello, Eddy Caron, Michel J. Daydé,
Frédéric Desprez, Emmanuel Jeannot, Yvon Jégou, Stéphane Lanteri,
Julien Leduc, Nouredine Melab, Guillaume Mornet, Raymond Namyst,
Pascale Primet, Benjamin Quétier, Olivier Richard, El-Ghazali Talbi,
and Iréa Touche. Grid’5000: A large scale and highly reconfigurable
experimental grid testbed. IJHPCA, 20(4):481–494, 2006.
[BcRW98] Rainer E. Burkard, Eranda Çela, Günter Rote, and Gerhard J. Woeg-
inger. The quadratic assignment problem with a monotone anti-monge
and a symmetric toeplitz matrix: Easy and hard cases. Math. Pro-
gram., 82:125–158, 1998.
149
Bibliography
[BHX00] Maria J. Blesa, Lluis Hernandez, and Fatos Xhafa. Parallel skeletons
for tabu search method. In In Proceedings of International Conference
on Parallel and Distributed Systems, ICPADS ’01, IEEE, 2000.
[BOL+ 09] Hongtao Bai, Dantong OuYang, Ximing Li, Lili He, and Haihong Yu.
Max-min ant system on gpu with cuda. In Proceedings of the 2009
Fourth International Conference on Innovative Computing, Informa-
tion and Control, ICICIC ’09, pages 801–804, Washington, DC, USA,
2009. IEEE Computer Society.
[BSB+ 01] Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau Bölöni,
Muthucumaru Maheswaran, Albert I. Reuther, James P. Robertson,
Mitchell D. Theys, Bin Yao, Debra A. Hensgen, and Richard F. Fre-
und. A comparison of eleven static heuristics for mapping a class of
independent tasks onto heterogeneous distributed computing systems.
J. Parallel Distrib. Comput., 61(6):810–837, 2001.
[CBM+ 08] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W.
Sheaffer, and Kevin Skadron. A performance study of general-purpose
applications on graphics processors using cuda. J. Parallel Distributed
Computing, 68(10):1370–1380, 2008.
[CGHM04] Teodor Gabriel Crainic, Michel Gendreau, Pierre Hansen, and Nenad
Mladenovic. Cooperative parallel variable neighborhood search for the
-median. J. Heuristics, 10(3):293–314, 2004.
150
Bibliography
[CGU+ 11] José M. Cecilia, José M. Garcı́a, Manuel Ujaldon, Andy Nisbet, and
Martyn Amos. Parallelization strategies for ant colony optimisation on
gpus. In IPDPS Workshops, pages 339–346. IEEE, 2011.
[CS00] Rachid Chelouah and Patrick Siarry. Tabu search applied to global
optimization. European Journal of Operational Research, 123(2):256–
270, 2000.
[CSV10] Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-driven au-
totuning of sparse matrix-vector multiply on gpus. SIGPLAN Not.,
45:115–126, January 2010.
[DBL96] 9th International Conference on VLSI Design (VLSI Design 1996), 3-6
January 1996, Bangalore, India. IEEE Computer Society, 1996.
[DG97] Marco Dorigo and Luca Maria Gambardella. Ant colony system: a
cooperative learning approach to the traveling salesman problem. IEEE
Trans. on Evolutionary Computation, 1(1):53–66, 1997.
151
Bibliography
[GLGN+ 08] Michael Garland, Scott Le Grand, John Nickolls, Joshua Anderson,
Jim Hardwick, Scott Morton, Everett Phillips, Yao Zhang, and Vasily
Volkov. Parallel computing experiences with cuda. IEEE Micro,
28(4):13–27, 2008.
[Glo90] Fred Glover. Tabu search - part ii. INFORMS Journal on Computing,
2(1):4–32, 1990.
152
Bibliography
[GÉTA99] Luca Maria Gambardella, Éric Taillard, and Giovanni Agazzi. Macs-
vrptw: A multiple colony system for vehicle routing problems with
time windows. In New Ideas in Optimization, pages 63–76. McGraw-
Hill, 1999.
[JJMH03] Gabriele Jost, Haoqiang Jin, Dieter An Mey, and Ferhat F. Hatay.
Comparing the openmp, mpi, and hybrid programming paradigm on
an smp cluster. NASA Technical Report, 2003.
[KGRS01] Maarten Keijzer, Juan J. Merelo Guervós, Gustavo Romero, and Marc
Schoenauer. Evolving objects: A general purpose evolutionary compu-
tation library. In Pierre Collet, Cyril Fonlupt, Jin-Kao Hao, Evelyne
Lutton, and Marc Schoenauer, editors, Artificial Evolution, volume
2310 of Lecture Notes in Computer Science, pages 231–244. Springer,
2001.
153
Bibliography
[LL06] Zhongwen Luo and Hongzhi Liu. Cellular genetic algorithms and local
search for 3-sat problem on graphic hardware. In Evolutionary Compu-
tation, 2006. CEC 2006. IEEE Congress on, pages 2988 –2992, 2006.
[LV98] Evelyne Lutton and Jacques Lévy Véhel. Holder functions and decep-
tion of genetic algorithms. IEEE Trans. on Evolutionary Computation,
2(2):56–71, 1998.
[MBL+ 09] Ogier Maitre, Laurent A. Baumes, Nicolas Lachiche, Avelino Corma,
and Pierre Collet. Coarse grain parallelization of evolutionary algo-
rithms on gpgpu cards with easea. In Proceedings of the 11th An-
nual conference on Genetic and evolutionary computation, GECCO ’09,
pages 1403–1410, New York, NY, USA, 2009. ACM.
[MCD09] Luca Mussi, Stefano Cagnoni, and Fabio Daolio. Gpu-based road sign
detection using particle swarm optimization. In ISDA, pages 152–157.
IEEE Computer Society, 2009.
[MCT06] Nouredine Melab, Sébastien Cahon, and El-Ghazali Talbi. Grid com-
puting for parallel bioinspired algorithms. J. Parallel Distributed Com-
puting, 66(8):1052–1061, 2006.
154
Bibliography
[NBGS08] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable
parallel programming with cuda. ACM Queue, 6(2):40–53, 2008.
[ND10] John Nickolls and William J. Dally. The gpu computing era. IEEE
Micro, 30(2):56–69, 2010.
[NM09] Akira Nukada and Satoshi Matsuoka. Auto-tuning 3-d fft library for
cuda gpus. In Proceedings of the Conference on High Performance
Computing Networking, Storage and Analysis, SC ’09, pages 30:1–
30:10, New York, NY, USA, 2009. ACM.
[NVI10] NVIDIA. GPU Gems 3. Chapter 37: Efficient Random Number Gen-
eration and Application Using CUDA, 2010.
[PDB10] Frédéric Pinel, Bernabé Dorronsoro Dı́az, and Pascal Bouvry. A new
parallel asynchronous cellular genetic algorithm for scheduling in grids.
In IPDPS Workshops, pages 1–8. IEEE, 2010.
[PJS10] Petr Pospichal, Jirı́ Jaros, and Josef Schwarz. Parallel genetic algorithm
on the cuda architecture. In Cecilia Di Chio, Stefano Cagnoni, Carlos
Cotta, Marc Ebner, Anikó Ekárt, Anna Esparcia-Alcázar, Chi Keong
Goh, Juan J. Merelo Guervós, Ferrante Neri, Mike Preuss, Julian
Togelius, and Georgios N. Yannakakis, editors, EvoApplications (1),
volume 6024 of Lecture Notes in Computer Science, pages 442–451.
Springer, 2010.
155
Bibliography
[RRS+ 08] Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, John A. Stratton,
Sain-Zee Ueng, Sara S. Baghsorkhi, and Wen mei W. Hwu. Program
optimization carving for gpu computing. J. Parallel Distributed Com-
puting, 68(10):1389–1401, 2008.
[SBPE10] Nicolas Soca, Jose Luis Blengio, Martin Pedemonte, and Pablo Ezzatti.
Pugace, a cellular evolutionary algorithm framework on gpus. In IEEE
Congress on Evolutionary Computation [DBL10], pages 1–8.
156
Bibliography
[TF11] Shigeyoshi Tsutsui and Noriyuki Fujimoto. Fast qap solving by aco
with 2-opt local search on a gpu. In IEEE Congress on Evolutionary
Computation, pages 812–819. IEEE, 2011.
[TMDT07] A-A. Tantar, N. Melab, C. Demarey, and E-G. Talbi. Building a Vir-
tual Globus Grid in a Reconfigurable Environment - A case study:
Grid5000. In INRIA Research Report. HAL INRIA, 2007.
[TSP+ 08] Christian Tenllado, Javier Setoain, Manuel Prieto, Luis Piñuel, and
Francisco Tirado. Parallel implementation of the 2d discrete wavelet
transform on graphics processing units: Filter bank versus lifting. IEEE
Transactions on Parallel and Distributed Systems, 19(3):299–310, 2008.
[VA10a] Pablo Vidal and Enrique Alba. Cellular genetic algorithm on graphic
processing units. In Juan González, David Pelta, Carlos Cruz, Germán
Terrazas, and Natalio Krasnogor, editors, Nature Inspired Cooperative
Strategies for Optimization (NICSO 2010), volume 284 of Studies in
Computational Intelligence, pages 223–232. Springer Berlin / Heidel-
berg, 2010.
[WW06] Man-Leung Wong and Tien-Tsin Wong. Parallel hybrid genetic algo-
rithms on consumer-level graphics hardware. In Evolutionary Compu-
tation, 2006. CEC 2006. IEEE Congress on, pages 2973–2980, 2006.
[WWF05] Man Leung Wong, Tien-Tsin Wong, and Ka-Ling Fok. Parallel evolu-
tionary algorithms on graphics processing unit. In Congress on Evolu-
tionary Computation, pages 2286–2293. IEEE, 2005.
157
Bibliography
[YCP05] Qizhi Yu, Chongcheng Chen, and Zhigeng Pan. Parallel genetic al-
gorithms on programmable graphics hardware. In Lecture Notes in
Computer Science 3612, page 1051. Springer, 2005.
[ZCM08] W Zhu, J Curry, and A Marquez. Simd tabu search with graphics hard-
ware acceleration on the quadratic assignment problem. International
Journal of Production Research, 2008.
[ZH09] Sifa Zhang and Zhenming He. Implementation of parallel genetic algo-
rithm based on cuda. In Zhihua Cai, Zhenhua Li, Zhuo Kang, and Yong
Liu, editors, Advances in Computation and Intelligence, volume 5821
of Lecture Notes in Computer Science, pages 24–30. Springer Berlin /
Heidelberg, 2009.
[ZT09] You Zhou and Ying Tan. Gpu-based parallel particle swarm optimiza-
tion. In IEEE Congress on Evolutionary Computation, pages 1493–
1500. IEEE, 2009.
158
International Publications
[1] Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. GPU Computing for Local
Search Metaheuristic Algorithms. IEEE Transactions on Computers, in press, 2011.
[2] Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. Neighborhood Structures
for GPU-based Local Search Algorithms. Parallel Processing Letters, 20(4):307–324,
2010.
[3] Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. GPU-based Approaches
for Multiobjective Local Search Algorithms. A Case Study: the Flowshop Scheduling
Problem. 11th European Conference on Evolutionary Computation in Combinato-
rial Optimization, EVOCOP 2011, pages 155–166, volume 6622 of Lecture Notes in
Computer Science , Springer, 2011.
[4] Nouredine Melab, Thé Van Luong, K. Boufaras, and El-Ghazali Talbi. Towards
ParadisEO-MO-GPU: A Framework for GPU-based Local Search Metaheuristics.
11th International Work-Conference on Artificial Neural Networks, IWANN 2011,
pages 401–408, volume 6691 of Lecture Notes in Computer Science, Springer, 2011.
[5] Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. GPU-based Multi-start
Local Search Algorithms. 4th International Learning and Intelligent Optimization
Conference, LION 5, in press, Lecture Notes in Compute Science, Springer, 2011.
[6] Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. GPU-based Island Model
for Evolutionary Algorithms. Genetic and Evolutionary Computation Conference,
GECCO 2010 pages 1089–1096, Proceedings, ACM, 2010.
[7] Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. Parallel Hybrid Evolu-
tionary Algorithms on GPU. In IEEE Congress on Evolutionary Computation, CEC
2010, pages 1–8, Proceedings, IEEE, 2010.
[8] Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. Large Neighborhood Local
Search Optimization on Graphics Processing Units. 24th IEEE International Sym-
posium on Parallel and Distributed Processing, IPDPS 2010, pages 1–8, Workshop
Proceedings, IEEE, 2010.
159
[9] Thé Van Luong, Nouredine Melab, and El-Ghazali Talbi. Local Search Algorithms on
Graphics Processing Units. A case study: the Permutation Perceptron Problem. Evo-
lutionary Computation in Combinatorial Optimization, 10th European Conference,
EvoCOP 2010, pages 264–275, volume 6022 of Lecture Notes in Computer Science,
Springer, 2010.
[10] Thé Van Luong, Lakhdar Loukil, Nouredine Melab, and El-Ghazali Talbi. A GPU-
based Iterated Tabu Search for Solving the Quadratic 3-dimensional Assignment
Problem. The 8th ACS/IEEE International Conference on Computer Systems and
Applications, AICCSA 2010, pages 1–8, Proceedings, IEEE, 2010.
[11] Naouel Ayari, Thé Van Luong, and Abderrazak Jemai. A hybrid Genetic Algorithm
for Golomb Ruler Problem. The 8th ACS/IEEE International Conference on Com-
puter Systems and Applications, AICCSA 2010, pages 1–4, Proceedings, IEEE, 2010.
160
Résumé :
Les problèmes d’optimisation du monde réel sont souvent complexes et NP-difficiles. Bien
que des algorithmes approchés telles que les métaheuristiques permettent de réduire la
complexité de leur résolution, ces méthodes restent insuffisantes pour traiter des prob-
lèmes de grande taille. De nos jours, le calcul sur GPU s’est révélé efficace pour traiter des
problèmes coûteux en temps de calcul. Un des enjeux majeurs pour les métaheuristiques
est de repenser les modèles existants pour permettre leur déploiement sur les accélérateurs
GPU. La contribution de cette thèse porte sur la reconception de ces modèles parallèles
pour permettre la résolution des problèmes d’optimisation à large échelle sur les architec-
tures GPU. Pour cela, des approches efficaces ont été proposées pour l’optimisation des
transferts de données entre le CPU et le GPU, le contrôle de threads ou encore la ges-
tion de la mémoire. Les approches ont été expérimentées de façon exhaustive en utilisant
cinq problèmes d’optimisation et quatre configurations GPU. En comparaison avec une
exécution sur CPU, les accélérations obtenues vont jusqu’à 80 fois plus vite pour des prob-
lèmes d’optimisation combinatoire et jusqu’à 2000 fois pour un problème d’optimisation
continue.
Abstract:
Real-world optimization problems are often complex and NP-hard. Although near-optimal
algorithms such as metaheuristics make it possible to reduce the temporal complexity of
their resolution, they fail to tackle large problems satisfactorily. Nowadays, GPU com-
puting has recently been revealed effective to deal with time-intensive problems. One of
the major issues for metaheuristics is to rethink existing parallel models and programming
paradigms to allow their deployment on GPU accelerators. The contribution of this thesis
is to deal with such issues for the redesign of parallel models of metaheuristics to allow solv-
ing of large scale optimization problems on GPU architectures. Our objective is to rethink
the existing parallel models and to enable their deployment on GPUs. In this purpose,
very efficient approaches are proposed for CPU-GPU data transfer optimization, thread
control or memory management. These approaches have been exhaustively experimented
using five optimization problems and four GPU configurations. Compared to a CPU-based
execution, experiments report up to 80-fold acceleration for large combinatorial problems
and up to 2000-fold speed-up for a continuous problem.
161