AMA Asynchronous Management of Accelerators For Task - 2015 - Procedia Compute
AMA Asynchronous Management of Accelerators For Task - 2015 - Procedia Compute
Abstract
Computational science has benefited in the last years from emerging accelerators that increase
the performance of scientific simulations, but using these devices hinders the programming
task. This paper presents AMA: a set of optimization techniques to efficiently manage multi-
accelerator systems. AMA maximizes the overlap of computation and communication in a
blocking-free way. Then, we can use such spare time to do other work while waiting for device
operations. Implemented on top of a task-based framework, the experimental evaluation of
AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA
code, with the advantage of fully hiding the device management. In addition, we obtain up to
more than 2x performance speed-up with respect to the original framework implementation.
Keywords: accelerator management, asynchronous devices, programming models, multi-GPU systems
1 Introduction
Computational science is an interdisciplinary field where computational and numerical tech-
niques are applied to study systems of real-world scientific interest. Such studies are usually
done through computer simulation and modelling and allow us to simulate systems that were
previously too difficult to study due to its complexity. Moreover, scientists are now able to repro-
duce or simulate studies that may be too dangerous, take too much time or simply be impossible
to reproduce in a laboratory. In the past years, scientific computing has clearly benefited from
the advances in computer science, as computers have massively increased their performance,
popularity and usability. In this sense, heterogeneous high-performance computers have become
a key evolution of regular homogeneous CPU computers due to their computing power. The
TOP500 list (Nov 2014) [1] reflects this fact as half of the top 10 machines have either NVIDIA
K20x GPUs or Intel Xeon Phi processors, the top 2 being heterogeneous computers.
However, heterogeneous computers make the programming task more difficult, especially for
programmers or scientists that code their applications targeting heterogeneous systems (het-
erogeneous applications). Even in single-node systems, accelerators, offering massively parallel
130 Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2015
c The Authors. Published by Elsevier B.V.
doi:10.1016/j.procs.2015.05.212
AMA: Asynchronous Management of Accelerators Planas, Badia, Ayguade and Labarta
hardware, may have their own separated memory space with limited capacity; therefore, pro-
grammers and scientists (who may not be expert programmers) need to care about when and
which pieces of data are transferred between memory spaces. Data movements and synchro-
nizations become particularly complicated if we want to split the computation between multiple
devices and still want to get optimal performance. Several proposals have arisen in the last
years to program accelerators, the most important being CUDA [10], which targets NVIDIA
GPUs, and OpenCL [7], which works with Intel MIC devices and GPUs as well. However,
none of them addresses the aforementioned challenges, since they both expose the underlying
hardware to the programmer and only offer a resource management API. Ideally, programming
models should be able to hide heterogeneity and hierarchy from the programmer point of view,
so that they can focus on their application development and forget about the management of
available resources.
In this paper we present AMA (Asynchronous Management of Accelerators), the combina-
tion of several optimization ideas that help to efficiently manage and schedule computations
to accelerators. Since these devices are asynchronous, we can issue operations such as com-
putations from the host and then do other work on the CPU while waiting for such device
operations. Our contributions are a specialized accelerator-oriented work scheduler combined
with an asynchronous, non-blocking management design for external devices. Our main target
are task-based programming frameworks, where our techniques can improve the management of
multi-accelerator systems with a minimized overhead. The objectives of such work are, first, to
increase the performance of heterogeneous applications with no effort from the programmer side
and, second, to improve the framework by making the accelerator management more efficient.
The paper is organized as follows: Section 2 explains AMA. Section 3 discusses the chosen
framework to implement AMA. The implementation is described in Section 4 and its evaluation
in Section 5. Related work can be found in Section 6. Section 7 concludes the paper.
131
AMA: Asynchronous Management of Accelerators Planas, Badia, Ayguade and Labarta
the device, it will be run, independently of other data transfers or task executions related to
other tasks previously assigned. Consequently, the scheduling is now divided into two phases:
first, the framework decides which processing unit will run the task and, second, FTFR decides
when the task will be run. As a result of applying the AMA design, host-side threads are
never blocked, so they can do other useful work. This gives us an opportunity to make other
components smarter and more powerful, even if they increase framework’s overhead.
2.2 Additional Runtime Enhancements
The specific modifications will mostly depend on the features of the target framework, but
we will focus on those aspects that especially benefit the execution on accelerators. Generally,
accelerators consume tasks faster than the host, so it is important to give priority to those tasks
that open more parallelism (i.e., tasks with several successors) and also to those in the critical
path of the data dependency graph. Thus, we propose to use the CPU spare time to traverse the
graph and compute the appropriate priority for each task. In addition, we propose to propagate
the priority of a task to its predecessors if they have lower priority, and do so for several levels
of predecessors to ensure that the dependencies of higher priority tasks are satisfied as soon
as possible. Assuming the target framework offers a data affinity-aware scheduling policy, we
propose to compute the affinity information at the latest possible moment, when the task has its
dependencies satisfied and is going to be assigned to one of the devices. Making a good affinity
decision is very important to optimize the amount of data that is transferred. It may seem
obvious to compute task data affinity as late as possible but, since the affinity computation is
quite costly, some frameworks may compute this information much before to avoid additional
overhead while running tasks. With our approach, such overhead is completely hidden. All
these modifications increase the framework overhead, but AMA is able to hide it with the
asynchronous execution of tasks and data transfers.
132
AMA: Asynchronous Management of Accelerators Planas, Badia, Ayguade and Labarta
and transferred), run (the task is executed) and completion (the task has been run; if needed,
output data is transferred and task memory space is freed).
OmpSs offers several task scheduling policies, but we will focus on the data affinity-aware
policy. This policy calculates the affinity between tasks and processing units of the system
regarding data size and locality. Each processing unit is assigned an affinity score and the task
will be scheduled on the unit with the highest score. For performance reasons, this score is
computed at task instantiation stage. OmpSs also supports task priorities to establish a task
execution order while preserving data dependencies: tasks with higher priority will be executed
earlier. Task priority is only propagated to its direct predecessor task due to its overhead.
The flexible design and implementation of OmpSs runtime allows to easily extend any of its
features, like adding a scheduling policy or support for a new architecture. Run time configu-
ration arguments are used to set which plugins will be loaded in each application execution.
3.1 OmpSs CUDA Support
At run time, one CPU helper thread is created for each GPU device. Each helper manages
all the operations on its own GPU. The original OmpSs implementation for GPUs [4] already
supports asynchrony, but it is very restrictive and inflexible: each task execution cycle1 can
hold, at most, two data transfers (one in each direction) overlapped with one task execution.
This means that for each task execution there can be two more operations: one device-to-host
transfer that copies the output data of the previous executed task (if any) and one host-to-
device transfer that prefetches the data for the next task (if any). Three CUDA streams are
used to achieve the overlapping. This approach presents two main problems: first, the helper
thread that manages the GPU gets blocked at the end of each task execution cycle waiting for
all the operations of the cycle. This busy-waiting is done by calling cudaStreamSynchronize()
for each stream. Second, operations can be overlapped inside one cycle, but there is an explicit
synchronization between different cycles. Then, for each cycle, the thread has to wait for the
longest operation before it can start a new cycle. This means that several CPU-time cycles are
wasted just waiting for device operations. This gives us an opportunity to implement AMA on
top of OmpSs to evaluate our proposal.
4 AMA Implementation
This section explains the implementation of AMA on top of the OmpSs CUDA device sup-
port. As mentioned before, this implementation is extensible to other task-based frameworks
supporting asynchronous devices (OpenCL, FPGA, . . . ).
4.1 OmpSs CUDA Support with AMA
The host-device synchronization implementation of AMA is done by inserting a CUDA event
after each device asynchronous operation (data transfer or kernel launch). Then, the helper
thread links the event to the appropriate set of callbacks and holds a list of pending events
(registered events that have not been raised) on its device. This list is checked by the thread (it
queries CUDA for the state of each event), so when a raised event is detected, it will execute its
associated callback actions. We use a combination of an event-polling mechanism and CUDA
callbacks due to performance reasons. Several CUDA streams are used to overlap as many
operations as possible. We keep two CUDA streams for data transfers (one for each direction)
due to hardware limitations (although we could add more streams if the underlying hardware
1 A task execution cycle is the set of all the operations needed to run a task: transfer input data to device,
run the task on the device (usually, one kernel launch) and transfer output data back to host. Transfers can
happen in a different cycle, depending on the cache policy used.
133
AMA: Asynchronous Management of Accelerators Planas, Badia, Ayguade and Labarta
time
supported more simultaneous transfers). We also take advantage of CUDA’s concurrent kernel
execution feature by having multiple streams for kernel launches. The number of streams
devoted to kernels is set accordingly to OmpSs run time configuration options.
Figure 1 compares the execution of several tasks with the original OmpSs implementation
and with OmpSs+AMA. The top part shows a time line diagram of the execution of four tasks
t1, t2, t3 and t4 on the same device with the original OmpSs implementation. The orange
boxes labeled as HtD represent the amount of time spent in transferring data from host to
device. The blue boxes labeled as Run represent the amount of time that task execution takes
on the device. The green boxes labeled as DtH represent the amount of time spent transferring
data from device to host. The red vertical lines show the synchronization points where the
helper thread needs to block. In this example, we can see that waiting always for the longest
operation in each task cycle delays other operations that could be done earlier. For example,
the DtH t2 transfer could be issued right after t2 ’s execution, but since the HtD t3 transfer is
longer, the DtH t2 is delayed. Task t4 does not need input data transfers, but its execution
cannot be advanced due to the fixed in-order task execution of the original implementation.
Moreover, although not shown in Figure 1, the helper thread spends most of its time waiting
for the operations, as it calls cudaStreamSynchronize() to synchronize with the device.
The bottom part of Figure 1 shows the behavior of the OmpSs+AMA runtime proposal. We
can see that the global execution time is lower because the synchronization points have been
removed. We can still observe some gaps, but they are due to hardware limitations2 . That is
why tasks t1 and t2 can only partially overlap their execution. Since t4 task does not need
input data transfers, the FTFR scheduler can advance its execution right at the beginning:
there is no need to wait for t1, t2 and t3 data transfers and execution. Then, t4 ’s output data
transfers are overlapped with t1 ’s execution and t2 ’s input data transfers. The overall result is
that the asynchronous runtime has significantly reduced the total execution time of these tasks.
Moreover, t4 ’s dependences would be released at the end of its execution, so its dependent tasks
would be ready to run much earlier.
4.2 Priority Propagation
Since the original implementation of OmpSs already supports tasks with priorities, we have
modified this component in order to propagate task priority to several levels of predecessor tasks.
With this modification, we can enhance the scheduling decisions for ready tasks, as predecessors
of higher priority tasks will be scheduled earlier and thus dependences of higher priority tasks
will be satisfied earlier as well. We have enabled the double-linked data dependency graph for
2 A kernel can start only when all thread blocks of all prior kernels from any stream have started [10].
134
AMA: Asynchronous Management of Accelerators Planas, Badia, Ayguade and Labarta
our approach, as it is disabled by default to avoid additional overhead. Then, we can navigate
through task’s predecessors and update their priority. The optimum number of predecessor
levels to navigate and update priority is application-dependent. As we said before, the overhead
that we are adding by enabling the double-linked graph is hidden by the execution of other tasks
and their data transfers, so, effectively, it is cost-free in our implementation.
4.3 Affinity Scheduler
It will frequently happen in the original affinity scheduling policy that the computed affinity
is outdated by the time the task becomes ready. Thus, we have modified this scheduler to
better fit the execution of tasks on asynchronous devices. As we explained before, the original
affinity computation is done too early, at task’s instantiation stage. With AMA, the affinity
computation is delayed until task’s ready stage. We have also changed the affinity computation
criteria to refine processing unit’s scores. These changes give us an accurate and updated task
affinity score for each processing unit that allows the runtime make better task scheduling
decisions. In the same way as the priority propagation mechanism, the overhead of these
modifications is hidden by the execution of other tasks and their data transfers.
Table 1 summarizes the differences between the original implementation of OmpSs (Original
OmpSs) and our OmpSs+AMA proposal (OmpSs+AMA).
5 Evaluation
We present in this section the performance results of three applications in order to evaluate
our AMA proposal implemented on top of the OmpSs framework. We compare these results
with CUDA native versions and the original OmpSs framework on a multi-GPU Linux system
with two Intel Xeon E5-2650 at 2.00 GHz, 62.9 GB of main memory and four NVIDIA Tesla
K20c with 2496 CUDA cores and 4.7 GB of memory. The native CUDA codes were compiled
with CUDA 5.5 and the OmpSs versions were compiled with OmpSs compiler (using nvcc 5.5
and GCC 4.6.4). Optimization level −O3 was used in all codes. The same application source
code was used with both OmpSs and OmpSs+AMA runtimes. We run the applications with
different configurations of number of GPU devices and data set sizes and analyze its impact on
performance. Results are computed as the mean value of several executions.
N-body Simulation. In the context of computational science, the N-body simulation is
a molecular dynamics computation where a system of bodies (atoms, molecules) is allowed
to interact for a period of time. The result of the simulation gives a view of the motion
of the bodies whose trajectories are determined by forces between bodies and their potential
energy. The CUDA native implementation comes from CUDA 5.5 SDK examples [10, 8]. We
transformed this code into an OmpSs application by adding some task directives around GPU
kernel calls and removing all data transfers and GPU management. The performance of this
135
AMA: Asynchronous Management of Accelerators Planas, Badia, Ayguade and Labarta
2500
Configuration App version Runtime Data size
[Bodies]
Performance [GFlop/s]
2000
OmpSs 256Kbod OmpSs CUDA OmpSs 262144 OmpSs 256Kbod
OmpSs+AMA OmpSs CUDA OmpSs+ 262144 1500
OmpSs+AMA 256Kbod
256Kbod AMA OmpSs 512Kbod
1000
OmpSs 512Kbod OmpSs CUDA OmpSs 524288 OmpSs+AMA 512Kbod
OmpSs+AMA OmpSs CUDA OmpSs+ 524288 500 CUDA 256Kbod
512Kbod AMA CUDA 512Kbod
CUDA 256Kbod Native CUDA CUDA 262144 0
1 2 3 4
CUDA 512Kbod Native CUDA CUDA 524288 # GPUs
Performance [GFlop/s]
3500
Configuration App version Runtime Data set size
3000
[DP FP elmts] OmpSs 16K
2500
OmpSs 16K OmpSs CUDA OmpSs 16384 × 16384 OmpSs+AMA 16K
2000
OmpSs+ OmpSs CUDA OmpSs+ 16384 × 16384 OmpSs 32K
1500
AMA 16K AMA OmpSs+AMA 32K
1000
OmpSs 32K OmpSs CUDA OmpSs 32768 × 32768 CUDA 16K
500
OmpSs+ OmpSs CUDA OmpSs+ 32768 × 32768
AMA 32K AMA 0
1 2 3 4
CUDA 16K Native CUDA CUDA 16384 × 16384 # GPUs
136
AMA: Asynchronous Management of Accelerators Planas, Badia, Ayguade and Labarta
Performance [GFlop/s]
OmpSs 16K OmpSs CUDA OmpSs 16384 × 16384 3500
OmpSs+AMA 16K
OmpSs+ OmpSs CUDA OmpSs+ 16384 × 16384 3000
2500 OmpSs 32K
AMA 16K AMA
2000
OmpSs 32K OmpSs CUDA OmpSs 32768 × 32768 OmpSs+AMA 32K
1500
OmpSs+ OmpSs CUDA OmpSs+ 32768 × 32768 CUDA 16K
1000
AMA 32K AMA CUDA dgemm 16K
500
CUDA 16K Native CUDA CUDA 16384 × 16384 ceiling reference
0
CUDA dgemm Native CUDA CUDA 16384 × 16384 1 2 3 4
16K ceiling ref. # GPUs
6 Related Work
Computational science can benefit from the performance of new computer architecture designs
of heterogeneous multi-cores. However, their programmability complexity must be addressed.
137
AMA: Asynchronous Management of Accelerators Planas, Badia, Ayguade and Labarta
138
AMA: Asynchronous Management of Accelerators Planas, Badia, Ayguade and Labarta
data locality to decide which computing unit runs each task. On the other hand, we propose
a task priority mechanism that promotes the execution of critical tasks and their predecessors
(for example, high-priority tasks, or tasks in the critical path of the data dependency graph).
We have implemented our proposal on top of OmpSs, a task-based programming framework
with multi-GPU support. Our experiments show that our proposal outperforms in all cases the
performance of the original implementation, reaching more than 2x speed-up. Furthermore,
we can reach the same performance, or even get better results, than a native hand-tuned
CUDA application, with the advantages of task-based programming models that leverage the
programmer from many issues related to heterogeneous and asynchronous devices (like device
management, memory coherency or data transfers). As future work, we plan to extend our
implementation to support other architectures, like OpenCL or FPGA devices. Then, we will
also need to tune our scheduling parameters to fit all of them. In addition, we will consider
implementing AMA for a cluster architecture as well.
Acknowledgments
European Commission (HiPEAC-3 Network of Excellence, FP7-ICT 287759), Intel-BSC Exas-
cale Lab and IBM/BSC Exascale Initiative collaboration, Spanish Ministry of Education (FPU),
Computación de Altas Prestaciones VI (TIN2012-34557), Generalitat de Catalunya (2014-SGR-
1051). We thank KAUST IT Research Computing for granting access to their machines.
References
[1] TOP500 Supercomputing Site. June 2014. http://www.top500.org/lists/2014/06.
[2] C. Augonnet et al. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore
Architectures. Concurr. Comput. : Pract. Exper., 23(2):187–198, 2011.
[3] E. Ayguade et al. Extending OpenMP to Survive the Heterogeneous Multi-core Era. International
Journal of Parallel Programming, 38(5-6):440–459, June 2010.
[4] J. Bueno-Hedo et al. Productive Programming of GPU Clusters with OmpSs. In Proceedings of
the 26th IEEE Int. Parallel and Distributed Processing Symposium, IPDPS 2012, May 2012.
[5] A. Filgueras et al. OmpSs@Zynq All-programmable SoC Ecosystem. In Proc. of ACM/SIGDA
Int. Symp. on Field-programmable Gate Arrays, FPGA ’14, pages 137–146, NY, USA, 2014.
[6] Tianyi David Han and Tarek S. Abdelrahman. hiCUDA: High-Level GPGPU Programming. IEEE
Transactions on Parallel and Distributed Systems, 22:78–90, 2011.
[7] Khronos OpenCL Working Group. The OpenCL Specification, version 2.0, March 2014.
[8] Mark Harris Lars Nyland and Jan Prins. Chapter 31: Fast N-Body Simulation with CUDA. In
Hubert Nguyen, editor, GPU Gems 3. Addison-Wesley Professional, first edition, 2007.
[9] R. Nath, S. Tomov, and J. Dongarra. An Improved MAGMA GEMM for Fermi GPUs. Technical
Report UT-CS-10-655, University of Tennessee Computer Science, July 2010.
[10] NVIDIA. CUDA C Programming Guide Version 5.5. NVIDIA Corporation, July 2013.
[11] OpenMP ARB. OpenMP Application Program Interface, v. 4.0, July 2013.
[12] J.M. Perez, R.M. Badia, and J. Labarta. A Dependency-aware Task-based Programming Envi-
ronment for Multi-core Architectures. IEEE Int. Conf. on Cluster Comp., pages 142–151, 2008.
[13] J.A. Pienaar et al. MDR: performance model driven runtime for heterogeneous parallel platforms.
In Proc. of the Int. Conf. on Supercomputing, ICS ’11, pages 225–234, NY, USA, 2011. ACM.
[14] J. Planas et al. Self-Adaptive OmpSs Tasks in Heterogeneous Environments. In IEEE 27th Int.
Parallel & Distributed Processing Symp. (IPDPS), pages 138–149, 2013.
[15] S. Ueng, M. Lathara, S.S. Baghsorkhi, and W.W. Hwu. CUDA-lite: Reducing GPU Programming
Complexity. In 21st Languages and Compilers for Parallel Computing (LCPC), 2008.
139