0% found this document useful (0 votes)
267 views

GPUDirect Async

GPUDirect Async

Uploaded by

srirajpaul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
267 views

GPUDirect Async

GPUDirect Async

Uploaded by

srirajpaul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

J. Parallel Distrib. Comput.

114 (2018) 28–45

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput.


journal homepage: www.elsevier.com/locate/jpdc

GPUDirect Async: Exploring GPU synchronous communication


techniques for InfiniBand clusters
E. Agostini *, D. Rossetti, S. Potluri
NVIDIA, Santa Clara, CA, United States

highlights

• GPUDirect Async: new technology which enables the GPU to directly trigger and sync network transfers.
• LibMP: a simple message passing library to demonstrate the use of GPUDirect Async in applications.
• Two different Async communication models: Stream Asynchronous model and Kernel-Initiated model.
• Performance models introduced to help interpreting results on domain decomposed numerical applications.
• Representative benchmarks of different scientific domains to demonstrate the benefits of GPUDirect Async.

article info a b s t r a c t
Article history: NVIDIA GPUDirect is a family of technologies aimed at optimizing data movement among GPUs (P2P) or
Received 16 February 2017 among GPUs and third-party devices (RDMA). GPUDirect Async, introduced in CUDA 8.0, is a new addition
Received in revised form 24 October 2017 which allows direct synchronization between GPU and third party devices. For example, Async allows
Accepted 12 December 2017
an NVIDIA GPU to directly trigger and poll for completion of communication operations queued to an
Available online 20 December 2017
InfiniBand Connect-IB network adapter, with no involvement of CPU in the critical communication path
Keywords: of GPU applications. In this paper we describe the motivations and the building blocks of GPUDirect Async.
GPUDirect Async After an initial analysis with a micro-benchmark, by means of a performance model, we show the potential
CUDA 8.0 benefits of using two different asynchronous communication models supported by this new technology
InfiniBand in two MPI multi-GPU applications: HPGMG-FV, a proxy for real-world geometric multi-grid applications
Asynchronous communication models and CoMD-CUDA, a proxy for Classical Molecular Dynamics codes. We also report a test case in which the
use of GPUDirect Async does not provide any advantage, that is an implementation of the Breadth First
Search algorithm for large scale graphs.
© 2017 Elsevier Inc. All rights reserved.

1. Introduction Around that time, some CUDA-aware1 MPI middlewares added


support for GPUDirect P2P to accelerate intra-node GPU to GPU
NVIDIA GPUDirect technologies [16] allow peer GPUs, network communications.
adapters and other devices to directly read from and write to GPU Finally with CUDA 5.0, NVIDIA released the GPUDirect RDMA,
device memory. This eliminates additional copies to host memory, enabling the direct PCIe data path between GPUs and third-party
peripheral devices, like network interface controllers (NICs). Since
reducing latencies and lowering CPU overhead. This results in
MLNX OFED 2.1, Mellanox [25] have been supporting GPUDirect
significant improvements in data transfer times for applications
RDMA on ConnectX-3 and later Host Channel Adapters (HCAs).
running on NVIDIA Tesla, GeForce and Quadro GPUs [29]. The first
Similarly Chelsio added support for GPUDirect RDMA to their OFED
GPUDirect version was introduced in 2010 along with CUDA 3.1, to
software stack [6]. More recently Broadcom announced the same
accelerate the communication with third party PCIe network and too.
storage device drivers via shared pinned host memory. In 2011,
starting from CUDA 4.0, GPUDirect Peer-to-Peer (P2P) allowed di- 1 Refers to the ability for the user to pass GPU memory pointer to MPI com-
rect access and transfers between GPUs on the same PCIe root port. munication functions. By doing so, MPI middlewares have the opportunity to
improve performance, by deploying communication protocols optimized for GPUs,
e.g. pipelined data staging, as well as usability, i.e. explicit device-to-host or host-
* Corresponding author. to-device memory copies beforeafter communications are no more necessary. For
E-mail address: [email protected] (E. Agostini). a tutorial on CUDA-aware MPI, see [11].

https://doi.org/10.1016/j.jpdc.2017.12.007
0743-7315/© 2017 Elsevier Inc. All rights reserved.
E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45 29

Async [31,1] is a recent member of the GPUDirect family of in Section 2. GPUDirect Async and its implementation to InfiniBand
technologies which has been initially announced at the Super- Verbs is described in Section 3. There we depict the software
computing 2015 conference [30] and the required software APIs stack, including LibMP, a simple message passing library that we
introduced in CUDA 8.0 [32]. While GPUDirect Async is generic in developed to quickly enable GPUDirect Async in our benchmarks.
that it can be applied in principle to different realms – e.g. network In Section 4, for both Async communication models we introduce
communications, storage I/O, etc. – in this paper we focus on its use simplified performance models in order to help in interpreting the
in combination with Mellanox InfiniBand Host Channel Adapters experimental results. In Sections 5 and 6, we show performance
(HCAs). With MOFED 3.4, Mellanox has released Peer-direct Async, results respectively for micro-benchmarks and for a small suite of
a set of Verbs extensions complementary to GPUDirect Async. applications representative of a few scientific domains. Finally we
Traditionally in GPU-accelerated parallel applications, the CPU draw our conclusions 7.
works as the orchestrators between GPU compute and network
communication tasks, for example waiting for a (set of) GPU 2. Related works
compute task(s) – CUDA kernels in NVIDIA parlor – to complete
(cudaStreamSynchronize) before issuing a related communi- Since the introduction of GPUs as general purpose accelerators,
cation onto the NIC, or conversely waiting for communication many papers have studied ways to optimize the communication
completions (MPI_Wait) before issuing GPU compute tasks. Both data path between GPUs and NICs, see for example [2,4], custom
GPUDirect P2P and RDMA are all about optimizing data movement FPGA-based NICs natively implementing both the NVIDIA P2P and
respectively among GPUs and with third-party devices, in the the RDMA protocols in HW, and similarly [27], where the authors
previous example the NIC. GPUDirect Async instead helps offload- experimented with GPUDirect RDMA using the EXTOLL intercon-
ing the control path onto the GPU, by enabling the GPU to both nect.
trigger communication transfers and synchronize on notifications, So far instead, only a few papers have explored ways to offload
directly over the PCIe bus without the use of agents running on the communication control path onto the GPU. In Table 1, we
the host CPU. That helps getting rid of the CPU from the critical schematize those papers, based on where the different communi-
path in applications, potentially improving performance by al- cation phases (preparation of communication descriptors, trigger-
lowing for more computation–communication overlap, by freeing ing) are carried out on (CPU, GPU stream, GPU SMs), the location
CPU core cycles which could be spent elsewhere, or conversely of the NIC control structures (host memory, GPU memory), the
by potentially sustaining performance in conjunction with low- benchmark applications and the location of the related data buffers
performance CPUs. (GPUDirect RDMA is used to communicate data buffers located on
In the following we describe two separate ways of leverag- GPU memory).
ing GPUDirect Async, referred to as asynchronous communication S. Kim et al. [23] describe a native GPU networking layer that
models (Section 4), distinguished by the particular GPU engine provides a BSD-like socket abstraction and high-level networking
which interacts with the NIC HW. In the first model, Stream Asyn- APIs to GPU programs. While those APIs can be invoked directly
chronous or SA, we introduce on-a-stream point-to-point and one- by CUDA threads, they are actually performed by a proxy agent
sided communication primitives, which blend communications running on the CPU.
and computations in the same way GPU task synchronizations are Lena Oden et al. [28] explored different approaches to gen-
expressed in CUDA, that is through the concept of CUDA streams. erating and posting InfiniBand send and receive operations from
In other words, communications are naturally executed respecting within CUDA kernels. In one of the experiments they implemented
the order in which they are submitted on the CUDA stream, cor- a GPU-side subset of IB Verbs, modifying both the open source part
rectly mixing together with CUDA asynchronous memory copies of the NVIDIA kernel-mode driver and the Mellanox user-space
and CUDA kernels. We distinguish multiple phases: (a) the CPU driver to map some key InfiniBand HCA resources (e.g. memory
prepares the communication primitives – including buffer point- queues and HW doorbell registers) onto the GPU. Those GPU Verbs
ers and sizes, network addresses, etc. – and posts the associated APIs use a critical section to serialize access to the IB QP at the
descriptors onto the NIC command queues; (b) the meta-data single CUDA thread-block granularity. They showed unsatisfactory
required to activate those descriptors are collected and converted results leading the authors to conclude that the GPU-native design
into CUDA Memory Operations (MemOps); (c) the CPU submits is inferior to traditional CPU-controlled network transfers.
those MemOps on the user CUDA stream; (d) sometimes later the F. Daud et al. [12] implements a GPU-side subset of the Global
CUDA stream executes those MemOps with the effect of triggering address space Programming Interface (GPI) library, enabling high-
the communications prepared in phase (a). The second model, performance RDMA communications directly from GPU code,
Kernel-Initiated or KI, is a variation of the first, where phases (a) i.e. the CPU is completely bypassed and the CUDA kernel threads
and (b) remain the same as in the previous case, followed by both prepare and trigger the NIC commands relative to the commu-
phase (c) where the CPU passes the meta-data to a CUDA kernel; nications. Similarly to [28], they map InfiniBand resources onto the
(d) later, probably after some form of inter-thread synchronization GPU. They also experiment with having some of those resources
which is specific to the particular computation, the CUDA kernel (as (QPs, CQs) backed by GPU memory instead of host memory. In
opposed to the CUDA stream) uses the meta-data to either trigger the last two papers, they re-implement part of the communication
the communications or to wait on their completions. Note that in stack on the GPU side. Besides, they had to hack the GPU and/or
both cases, phase (a) is executed on the CPU by the same software HCA drivers to allow the GPU to access the NIC doorbell and to place
stack which handles regular communications, i.e. in our case by the control structure on GPU memory. This approach presents two
the Mellanox user-space libmlx5 driver. That code, full of bitwise drawbacks: the GPU-side stack uses more GPU resources (e.g. reg-
operations and branches, is NIC HW dependent (each NIC vendor isters), potentially reducing the occupancy and consequently the
has its own HW interface), hardly parallelizable so therefore it is performance of the computation kernels, where the communica-
convenient for it to be kept on the CPU which is optimized for low tion functions are used. Besides to the best of our knowledge they
latency. are affected by a GPU memory consistency issues [19] associated
The rest of the paper is organized in the following way: other to using receive data buffers, updated via GPUDirect RDMA, from
papers have already explored moving parts or most of the low- inside persistent CUDA kernels. GPUDirect Async officially intro-
level communication stack onto the GPU programmable cores duces a mechanism to fence incoming traffic directed towards
(SMs), mostly at the experimental level; those are briefly recalled GPU memory buffers. This mechanism is exposed as a new FLUSH
30 E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45

Table 1
Comparing the different approaches cited vs. those proposed in this paper (last two rows).
Paper comm descriptors comm trigger (control path) IB control structures Data buffers Benchmarking
creation (WQEs) location (QP, CQ) location
Oden et al. [28] GPU SMs CUDA kernels (SMs) GPU or host GPU Micro-benchmarks
Kim et al. [23] CPU proxy on CPU Host GPU Synth. workloads
Daud et al. [12] GPU SMs CUDA kernels (SMs) GPU or host GPU Micro-benchmarks
synth. workloads
Venkatesh et al. [34] CPU CPU assisted GPU stream Host GPU
Async SA here & [1] CPU GPU stream Host Host Micro-benchmarks
HPC mini-apps
Async KI here & [1] CPU CUDA kernels (SMs) Host Host Micro-benchmarks
HPC mini-apps

Fig. 1. GPUDirect RDMA compute-and-send workflow.


Fig. 2. GPUDirect Async compute-and-send workflow.

MemOP which can be queued after the wait on a communication


completion notification and before releasing a pre-launched GPU 1. CPU queues some computation tasks to the GPU (kernel
kernel launch) and synchronizes waiting for its completion.
In [34] the Ohio State University team, in collaboration with 2. CPU queues communication tasks to the InfiniBand HCA.
some of the authors, presented early results of using GPUDirect 3. HCA fetches data directly from GPU memory, thanks to
Async technology in MVAPICH2 (MPI-GDS). The scope of that paper GPUDirect RDMA.
was to explore protocol designs which take advantage of GPUDi- 4. HCA injects the associated messages through the network.
rect Async while at the same time respecting the demands of the 5. CPU synchronizes with the HCA by waiting for a completion
MPI specification. In that respect, this paper constitutes a premise (not shown in the figure).
of that other work. More specifically MPI-GDS offers MPI point-to-
point primitives synchronous to CUDA streams. MPI tag matching Note that when GPUDirect RDMA is not used, suitable GPU-to-
and the rendezvous protocol are supported and implemented us- host copies will be included in the communication tasks.
ing an hybrid approach, where the CPU actively progresses part of GPUDirect Async removes dependency on the CPU by enabling
the protocol at the cost of additional overhead. In this paper instead the GPU to trigger communication on the HCA and the HCA to
we measure the potential performance of GPUDirect Async alone, unblock CUDA tasks. The CPU needs only to prepare and enqueue
irrespective of the use of other GPUDirect technologies. Besides we both the compute and communication tasks. The compute-and-
also explore one-sided communication primitives and CUDA ker- send workflow in the presence of GPUDirect Async is shown in
nel initiated communications. Finally there they focus on micro- Fig. 2:
benchmarks while here we present benchmarks on applications.
Compared to our previous work [1], this paper introduces sev- 1. CPU prepares and queues both computation and communica-
eral improvements: tion tasks to the GPU.
2. GPU completes computation tasks and directly triggers the
• A more detailed description of the Async technology and its pending communications on the HCA.
software stack. 3. HCA fetches data directly from host memory, or from GPU
• A general performance model capturing the communication memory when GPUDirect RDMA is used.
pattern of domain decomposed multi-GPU applications. 4. HCA sends data.
• The performance model is applied to GPUDirect Async mod-
els in order to clarify the requirements needed by each In this latter case, the CPU workload changes. For example, after
asynchronous model to reach a gain in performance. having prepared and queued all the necessary tasks onto the GPU,
• New micro-benchmarks and a negative test case are dis- the CPU can go back and do other useful work.
cussed here, clarifying some GPUDirect Async limitations. We note here that GPUDirect Async is independent of GPUDi-
rect RDMA, so we can experiment with it in isolation, i.e. using
3. GPUDirect Async the first without enabling the second. In [3] it has been noted
that GPUDirect RDMA performance heavily depends on the PCIe
It is common for scientific applications to alternate between fabric, i.e. the type and number of PCIe bridges and switches which
compute and communication phases. Transition from compute to connects the NIC to the GPU, as well as on the particular GPU
communication in a multi-node GPU application involves: launch- architecture [32] used. So for example, for large message sizes in
ing a compute kernel on a GPU, waiting for it to complete and then the rendezvous protocol, pipelined staging through host memory
sending the data over the network. on the sender side is more efficient than using GPUDirect RDMA.
The workflow when using a GPUDirect RDMA-enabled Infini- Doing the same thing here, as in MPI-GDS [34], would require
Band HCA, is illustrated in Fig. 1: the CPU to progress the communication, defying our objective of
E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45 31

Fig. 3. Communication period MPI multi-GPU application timeline.

benchmarking GPUDirect Async in isolation. Considering that in onto either the send or the receive memory queue, and subse-
one of our benchmarking platforms GPUDirect RDMA is inefficient, quently updating some kind of doorbell register, both associated
we decided to not use it. to a specific Queue Pair (QP). The doorbell update is needed to
inform the HCA about the new requests ready to be processed. In
3.1. Motivations the particular case of recent Mellanox HCAs, two distinct doorbell
updates are required when triggering send operations: one to
Fig. 3 depicts the timeline of a typical multi-GPU application, a 32-bits word (DBREC) in host memory, while the other to a
comprising both computation and communication phases. The HW register located at a specific offset into one of the HCA PCIe
CPU iteratively schedules work onto the GPU, waits for GPU kernel resources (Base Address Register or BAR). When kernel by-pass is
completion, triggers communications on the HCA and finally polls used, the user-space process directly updates the DB by using an
the HCA for the completion of communications. The CPU must be uncached memory-mapped I/O (MMIO) mapping of the HCA BAR
constantly running at peak performance most of the times just page holding that registers. When a request is completed – i.e. data
to ensure responsiveness, that is that the different phases are have been sent or received – the HCA adds a new CQE (Completion
scheduled as quickly as possible. Queue Entry) respectively into the send or the receive Completion
Generally, when the applications are strong scaled, due to ge- Queue (CQ) associated to that QP at creation time. The application
ometrical and/or physical properties, the length of both the GPU needs to poll the corresponding CQ in order to detect whether a
computations and the network communication reduces on each request has been completed (Fig. 5).
compute node. Additionally, the rate of reduction is different from While the CPU is still in charge of preparing the commands,
computation and communications, e.g. the former scaling with
GPUDirect Async requires the GPU to directly access to the HCA
the volume while the latter with the surface as with the domain
doorbell registers and to the CQs (which reside on the host memory
decomposition approach. So therefore, even when the algorithm
in our case), using a combination of two CUDA driver functions:
allows for them to be overlapped, it is increasingly difficult to
cuMemHostRegister() to page-lock an existing host memory range
hide communication behind the computation. Beyond a certain
and maps it into the GPU’s address space; and cuMemHostGetDevi-
point, the application does not scale anymore. The onset of the
cePointer() to retrieve the corresponding device pointer.
non-scaling regime can be anticipated – i.e. the application stops
In particular, the CU_MEMHOSTREGISTER_IOMEMORY flag when
scaling at a smaller number of GPUs – if the overheads incurred by
registering an MMIO address range belonging to a third-party PCIe
the CPU when launching computation and communication tasks
device (the InfiniBand HCA in our case). The later corresponds to
become of the same order as the time necessary to execute them
respectively on the GPU and the NIC. In these cases, launching a the creation of a so called GPU peer mapping, that is a GPU mapping
GPU computation can take up to tens of microseconds, which can to peer PCIe device. Note that in the current implementation, the
be about the same time it takes to execute that very same task, or to whole MMIO range must be physically contiguous and marked
exchange a few kilobytes of data over the network. Similarly some cache inhibited for the CPU.
applications, as HPGMG which is introduced later, may go through Because of HW limitations in NVIDIA GPUs prior to the Pascal
phases – coarse grain levels – where it is more convenient to move architecture, a special Mellanox HCA firmware is required to let the
computations back to the CPU to avoid the overhead of launching HCA PCIe resource (BAR) to be placed in the appropriate address
work on the GPU. range.
By leveraging GPUDirect Async instead, a whole parallel com- Once the doorbell registers and the CQs are mapped on the GPU,
putation phase can be offloaded onto a CUDA stream. That in it is possible to access them on either (a) CUDA streams or (b)
turns allows to overlap – thereby paying the cost of – the work from CUDA kernel threads. We refer to the former as the Stream
submission at iteration i while at the same time iteration i − 1 Asynchronous (SA) communication model – see Section 4.2 – and
is being orchestrated by the GPU, effectively removing the CPU to the latter as the Kernel-Initiated (KI) communication model —
from the critical path. As shown qualitatively in Fig. 4, there are Section 4.3.
times when the CPU becomes idle for potentially extended period In the SA model, we make extensive use of CUDA Memory
of times, during which it can do useful work. When that is not Operations APIs, described below, to either wait (poll) the CQEs or
possible, the CPU can be allowed to go down into deeper sleep write (ring) the doorbell registers.
states, thereby lowering the application power profile.
1. cuStreamWaitValue32(stream, value, address, condition): en-
3.2. Implementation queues a synchronization of the CUDA stream on the given
memory location. Work ordered after the operation will
Currently, support for GPUDirect Async requires extended IB block until the given condition (EQual, Greater-or-EQual,
Verbs APIs contained in MLNX OFED 4.0 and is limited to the AND) is met. That for example allows to block the CUDA
latest generation Mellanox InfiniBand Host Channel Adapters stream to until a particular completion event (CQE) is sig-
(HCAs) [22], those supported by the libmlx5 user-space provider naled by the NIC.
library. 2. cuStreamWriteValue32(stream, value, address): writes the
Traditionally, the CPU issues communication operations onto passed value into the memory identified by the device ad-
the IB HCA by filling in data structures (Work Requests or WQEs) dress. This API is used to ring the QP doorbell register.
32 E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45

Fig. 4. Communication period multi-GPU application timeline with GPUDirect Async.

Fig. 5. InfiniBand HCA send/receive requests processing.

3. cuStreamBatchMemOp(stream, count, mem_ops[]): a batch 3.3.1. libibverbs


version of the previous functions, taking as input a vector libibverbs implements the OpenFabrics InfiniBand Verbs API
of memory operations (wait or write). specification. In version 4.0, Mellanox has introduced the new
Peer-Direct Async APIs (e.g. see the peer_ops.h header) targeting
When GPUDirect Async is used – see interaction diagram in the NVIDIA GPUDirect Async technology.
Fig. 6 – the CPU is still needed to:

• allocate and register communication buffers (device or host 3.3.2. libmlx5


pinned memory); It is the vendor-specific low-level provider library managing
• map the HCA specific data structures onto the GPU, as ex- recent Mellanox InfiniBand HCAs. It allows user-space processes to
plained above; access directly Mellanox HCA hardware with low latency and low
• prepare and post WQEs on the QPs overhead (kernel by-pass).
• prepare the send and receive requests descriptors and con-
vert them into a sequence of Memory Operations; 3.3.3. LibGDSync
• poll on the CQEs, once successfully read by the CUDA stream. Developed by the authors, it conceptually implements GPUDi-
rect Async support on InfiniBand Verbs, by bridging the gap be-
On the contrary, GPU has more tasks to do: tween the CUDA and the Verbs APIs. It consists of a set of low-level
APIs which are still very similar to IB Verbs though operating on
• triggering prepared WQEs by ringing the doorbells; CUDA streams.
• waiting for the CQE related to a send or receive WQE, polling LibGDSync is responsible for the creation of Verbs objects,
on the CQ.
i.e. queue pairs (QPs), completion queues (CQs), structures respect-
Given the small set of features offered by the CUDA Memory ing the constraints of GPUDirect Async, to register host memory
Operation APIs, i.e. the CUDA stream can only block on a mem- when needed, to post send instructions and completion waiting
ory location, we cannot implement a full blown CQE parser and directly on the GPU stream. Functions like gds_stream_queue_send
dispatcher there, as on the CPU. Hence, if we want to literally or gds_stream_wait_cq, internally use the CUDA Stream MemOp
keep the CPU off the critical path, the GPU stream needs the CQEs APIs as described in the previous Section 3.2.
and the send/receive operations to be strictly associated, i.e. the
completion of the i-th WQE will be placed in the i-th available CQE. 3.3.4. LibMP
This is for example the case if at QP creation time, distinct send and Implemented by the authors, it is a lightweight messaging
receive CQs are used, as well as giving up on Shared Receive Queues library built on top of LibGDSync APIs, developed as a technol-
(SRQs). ogy demonstrator to easily deploy the GPUDirect Async tech-
Note that error handling and recovery is still done by the CPU. nology in applications. Once the MPI environment is initialized
When polling the CQEs, the CPU may observe completions with (i.e. communicator, ranks, topology, etc.), it is possible to replace
errors. In that case, it is responsible to abort all outstanding work the standard MPI communication primitives with the respec-
for both the GPU and the HCA. tive LibMP ones, e.g. mp_isend_on_stream() instead of MPI_Isend(),
mp_wait_on_stream() instead of MPI_Wait(), etc. LibMP features
3.3. Software stack and design tradeoffs are:

In order to take advantage of the GPUDirect Async technology, • Point-to-Point communication primitives using the send
we implemented or modified libraries at different levels of the /receive semantic of IB verbs: receive buffers are consumed
software stack shown in Fig. 7. in the order they are posted on the particular QP.
E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45 33

Fig. 6. InfiniBand HCA send/receive requests processing with GPUDirect Async.

Fig. 7. GPUDirect Async software stack.

• One-Sided asynchronous communications, e.g. put and get • CUDA 8.0 for Stream Memory Operations APIs described in
on remote memory addresses. Section 3.2
• No support for MPI-style tag matching. • NVIDIA display driver version 384 or newer.
• No collective communication primitives. • LibGDSync library, available on [17].
• A special NVIDIA kernel driver registry key is required to
As previously stated, each QP has its own CQ. The depth of both enable GPU peer mappings.
the WQs and the CQs can be set at run-time; in our benchmarks we • The nvidia_peer_memory kernel module.
used a default depth of 512 entries. In our experiments the WQs, • The GDRcopy library [15].
CQs and DBREC were residing on host memory; in a future version
we plan to enable the use of GPU memory for CQs and DBREC. In algorithm 1 we present the typical structure of a GPUDirect
The parameters for the communication primitives (i.e. desti- Async application, using LibMP functions, where two processes
nation/source peer ranks, message size, buffer pointers) are used exchange some data using the Stream Asynchronous model, mixing
when the CPU post the WQEs, before collecting the descriptors and communication and computation tasks.
turning them into CUDA API calls. Hence they must be known at
the time of WQE posting and cannot be for example the result of a
GPU computation, which can add complexity in some applications
4. GPUDirect Async models
as shown below. While in principle it is possible to change some
of those parameters directly modifying the WQEs from within the
As described in previous sections, LibMP presents two different
GPU work, e.g. prior to triggering them, that would pose well-
execution models: the Stream Asynchronous model (SA), where
known challenges as discussed in [12,28].
communications are asynchronous with respect to the host and
3.3.5. System requirements synchronous with respect to the CUDA stream and the Kernel-
Async requirements are: Initiated model (KI), where communications are triggered by CUDA
threads within a kernel. In this section, with the help of abstract
• Mellanox Connect-IB or later HCA, possibly with a special performance models, we compare the behavior of our Async mod-
firmware version. els with respect to the standard MPI communication model. We
• MLNX OFED 4.0 for Peer-Direct Async Verbs APIs. consider an execution flow which is typical of GPU-accelerated MPI
34 E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45

Algorithm 1 LibMP example C-pseudocode (Bj time) working on inner data elements, i.e. not dependent
1: numRanks=2, Nreq=1; upon data coming from neighboring nodes.
2: ▷ Initialize MPI and CUDA environment 3. Receive and Compute: for Z times, wait to receive some-
3: initialize_MPI_environment(); thing from the other processes (Wk ), execute some opera-
4: cuda_init(); tion on the host (TH time) and launch (LCk time) CUDA tasks
5: myRank = get_MPI_rank(); (Ck time) working on received data.
6: ...
7: ▷ Initialize LibMP environment Considering R iterations of the above pattern, schematized in
8: mp_init(MPI_COMM_WORLD, !myRank, numRanks); Fig. 8, Eqs. (1) represents the time spent respectively on the CPU
9: ... (TCPUS ), the GPU (TGPUS ) and by the whole applications (TS ). Tidle
10: ▷ Create mp requests descriptors is the GPU idle time spent while waiting for CPU work.
11: mp_request_t * sreq, rreq;
12: host_memory_alloc_request(sreq, Nreq); X Y
∑ ∑
13: host_memory_alloc_request(rreq, Nreq); TCPUS = R × [ (LAi + THsync + Si ) + (LBj )+
14: ... i=1 j=1
15: ▷ Allocate send/receive buffers
Z
16: memory_alloc_buffer(sendBuffer, sizeS); ∑
17: memory_alloc_buffer(recvBuffer, sizeR); (Wk + THsync + LCk ) + THsync ]
18: ... k=1
(1)
19: ▷ Register related memory regions X Y Z
20: mp_reg_t sreg, rreg; ∑ ∑ ∑
21: mp_register(sendBuffer, sizeS, &sreg); TGPUS = R × [ Ai + Bj + Ck ] + Tidle
22: mp_register(recvBuffer, sizeR, &rreg); i=1 j=1 k=1
23: ...
TS = max(TCPUS , TGPUS ).
24: ▷ Post a Receive WQE
25: mp_irecv(recvBuffer, sizeR, !myRank, &rreg, &rreq));
The total time TS will be equal to the CPU time, because the CPU
26:
27: ▷ Start a CUDA kernel to prepare send buffers is always busy, i.e. at worst waiting for the completion of GPU tasks,
28: launch_cuda_kernel(sendBuffer, ...., stream); represented by the THsync parameter:
29:
TGPUS ≤ TCPUS → TS = TCPUS .
30: ▷ Trigger HCA for Send WQE
31: mp_isend_on_stream(sendBuffer, sizeS, !myRank, &sreg, &sreq, In the following sections we examine the case of the LibMP
stream); communication models, giving some examples of their application.
32:
33: ▷ Wait (poll) for Receive CQE 4.2. Stream synchronous, CPU asynchronous model (SA)
34: mp_wait_on_stream(&rreq, stream);
35:
As described previously, in this model communications are
36: ▷ Start a CUDA kernel to work on received data
enqueued into a CUDA stream along with other CUDA tasks, like
37: launch_cuda_kernel(recvBuffer, ...., stream);
38: ... kernels, memory transfers, etc. Usually this model is relatively
39: ▷ Cleanup CQEs easy to use because it requires very few changes to the MPI
40: mp_wait(&rreq); application (i.e. modifying MPI_Isend with mp_isend_on_stream,
41: mp_wait(&sreq); neglecting CUDA synchronization primitives). Computation and
42: ... communication tasks are executed asynchronously with respect to
43: ▷ Synchronize and cleanup the host code but synchronized to the CUDA streams.
44: cudaDeviceSynchronize(); The class of applications introduced in Section 4.1 can be lever-
45: mp_deregister(&rreg); aged with the SA model if it is possible to change the original
46: mp_deregister(&sreg); algorithm in order to be coherent with the following Formula (2)
47: cleanup_MPI_environment();
(represented by Fig. 9):
X Y Z
∑ ∑ ∑
TCPUSA = R × [ (LAi + LSi ) + LBj + (LWk + LCk )]
applications, where each MPI rank alternates between computa- i=1 j=1 k=1
tions and communication with other peers. Later, in Section 6, we
X Y Z (2)
will be using our performance models to explore the conditions ∑ ∑ ∑
when we expect GPUDirect Async to improve over the MPI model. TGPUSA = R × [ (Ai + Si ) + Bj + (Wk + Ck )]
i=1 j=1 k=1
4.1. CPU synchronous model
TSA = max(TCPUSA , TGPUSA )
As an example of regular multi-GPU MPI applications, we con- where LSi and LWk are respectively the time spent by CPU to
sider the kernel of a D-dimensional iterative stencil computations enqueue the send or to wait for receive operations on the CUDA
parallelized using the domain decomposition approach. Three in- stream. In this model, the Tidle time can be considered negligible
dependent phases can be identified: because, due to the asynchronous behavior, the CPU enqueues a lot
of sequential tasks on the CUDA stream without waiting for their
1. Compute and Send: for X times, launch (LAi time) some completion.
CUDA tasks (running for Ai time) like kernels or memory To ensure an asynchronous behavior, during communication
transfers, execute some operations on the host, like a syn- periods it is required that:
chronization with the CUDA stream (TH time), then send the
computed data (Si time). • all the CUDA synchronization primitives must be removed
2. Interior Compute: for Y times, execute some operations on • all the non-asynchronous CUDA primitives must be replaced
the host (TH time) and launch (LBj time) some CUDA tasks with the respective CUDA asynchronous primitives
E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45 35

Fig. 8. Typical timeline of an iterative multi-GPU domain-decomposed MPI application.

Fig. 9. General communication pattern of a multi-GPU with SA model application, Formula (2) representation.

• communication parameters must be known at the time of because TGPUS ≤ TCPUS . If GPU computation tasks require about
posting (for example send or receive buffers size, destina- the same time in both models:
tion ranks, pointers, etc.) ∑ ∑
[ (A + B + C )]SA ≤ [ (A + B + C )]S .
• all the MPI functions must be replaced by LibMP functions.
Then, we ∑obtain a∑simple condition:
An apparent side effect is that the CPU has less work to do (C 2) : [ (TS) + (TW )]SA ≤ [Tidle ]S
because host code does neither synchronizations nor communica- which means that the SA model is faster if the sum of the
tions, thus it is not relevant in an asynchronous context; for this communication times (send TS and wait TW ) is less than the GPU
reason we can consider negligible the TH parameter in Formula (2). Tidle time (the idle time spent by GPU waiting for CPU work) in
An algorithm that is coherent with Formula (2) represents an the synchronous model. Intuitively the effectiveness of SA model
improvement with respect to a synchronous version if the follow- relies on the relative magnitude of CUDA stream synchronous
ing 3 conditions are verified. communications in the SA model with respect to CPU initiated
communications plus GPU synchronizations in the S model.
4.2.1. C1 condition: asynchrony
In Formula (2) the total execution time is equal to the GPU time 4.2.3. C3 condition: fragmented computations
if: The larger the number of sub-tasks R, X , Y and Z , the more the
execution becomes asynchronous (due to C1).
TCPUSA < TGPUSA → TSA = TGPUSA . (C 3) : R > 0, Y ≥ 0, max(X , Z ) > 0
In Section 6 we apply those conditions to several MPI + CUDA
That is the time required by the CPU to enqueue tasks on the CUDA
applications.
stream (launch time) must be less than the time spent by GPU to
execute∑ those tasks (C1 condition):
4.3. Kernel-Initiated model (KI)
(LA + LS + LB + LW + LC ) <

(C 1) : (A + S + B + W + C )
Without this condition, the asynchrony cannot happen, because The Streaming Multiprocessors (SMs), which are in charge of
the CPU launch time is greater than the GPU execution time. executing the CUDA kernels, can directly issue communication
primitives to send messages or wait for receive completion. Having
4.2.2. C2 condition: time gain the HCA doorbell registers and CQs mapped in the GPU, a CUDA
The SA model (Formula (2)) is faster than the synchronous thread can use simple value assignments and comparisons inside
model (Formula (1)) if: the code to ring the doorbell or poll the CQs, respectively. In the
KI model we used the kernel fusion technique, where in a single
TSA < TS → TGPUSA < TCPUS . CUDA kernel both communication (send or wait) and computa-
This is always verified with the stronger condition: tion tasks (type A, B or C) are fused together. This approach can
lead to GPU memory consistency problems in case it is combined
TGPUSA < TGPUS with GPUDirect RDMA [19]. In order to avoid this issue, in our
36 E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45

Fig. 10. Kernel-Initiated communication pattern timeline.

Fig. 12. Inter-block barrier.


Fig. 11. KI kernel, CUDA blocks tasks.

• An inter-block barrier scheme [36] is used to synchronize


benchmarks (Section 6) we use host memory for communication the receiver and the taskC blocks, as explained in Fig. 12,
buffers. A typical timeline for the Kernel-Initiated model is shown where the thread 0 of each taskC block waits for the thread
in Fig. 10. 0 of the receiver block to set a global memory variable to 1,
As in the SA model, the CPU prepares the communication de- whereas the remaining threads move to the __syncthreads()
scriptors and later those communications are triggered directly by barrier. To prevent a waste of CUDA blocks waiting for the
threads in the CUDA kernels KI (using descriptors) instead of being receiver, TaskA, TaskB and send should always be executed
triggered by the CUDA streams as in the SA model. The complexity before TaskC.
is moved to the CUDA kernels KI, which requires at least N + M + 1
• When this happens (after the receive completion), all
blocks, where N is the number of blocks required to compute type A
threads 0 in the taskC blocks will reach the matching __sync-
tasks before the send operation, M is the number of blocks required
threads() barrier and then start to unpack the received data.
by type C tasks working on received data plus 1 block, used to poll
the CQs as shown in Fig. 11. As in SA model, tasks B represent For the same class of applications introduced in Sections 4.1 and
other (possible) CUDA tasks not related to communications, that 4.2, the KI model execution time can be estimated as:
can be performed by either kind of blocks (as shown in the Figure)
depending on the particular algorithm. TCPUKI = R × LK
In the KI model the kernel fusion technique is combined with X Z Y
a dynamic dispatcher which uses atomic operations to pick each
∑ ∑ ∑
TGPUKI = R × { (Ai + Si ) + (Wk + Ck ) + Bj − OKI }
thread block2 and to assign it to the right task according to the
i=1 k=1 j=1
following rules:
X Z Y (3)
∑ ∑ ∑
• In order to avoid dead-locks, the receive operation must not OKI = Ov erlap([ (Ai + Si )], [ (Wk + Ck )], [ Bj ],
prevent the send operation from starting or progressing: i=1 k=1 j=1
there must be always, at least, one block waiting for receiv-
ing and an other block executing the send. #SM , dispatcher , communications pattern, . . .)
• Receiving is time critical in our experience, so the first TKI = max(TCPUKI , TGPUKI )
receiver block is used to wait for incoming messages. In
particular each thread polls on the receive CQ associated to where Ai + Si is the time spent by sender blocks to execute tasks
each remote node. A plus send and Wk + Ck is the time spent to wait data and
• The blocks from the second up to the N + 1-th are assigned to execute tasks C; Bj represents other (possible) tasks not related to
the A group of operations, whereas the remaining M blocks communications (like working on internal structures), that can be
assigned to the C group of operations wait for the receiver performed by any type of blocks (A blocks, C blocks or other blocks).
block to signal that all incoming data have been received. Empirically, we measured that TCPUKI is always negligible,
therefore we can consider TKI = TGPUKI .
2 There is no guarantee about the order in which blocks are scheduled by the GPU When running on a GPU, multiple blocks of a CUDA kernel can
HW. be executed concurrently, therefore the execution of tasks A + S,
E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45 37

tasks W + C and tasks B can overlap. To represent this magnitude, Algorithm 2 Ping-pong latency with kernel compute
we defined Ov erlap that is a non-trivial function representing the 1: procedure PingPongKernel(myRank, msg_size, calc_size, use_kernel, commu-
overlap time among all the tasks considering several input param- nication_type)
2: for i=0 to NUM_ITERS do
eters like the number of GPU SMs, the task dispatching algorithm,
3: post_recv(msg_size);
the particular communication patterns, computation times of tasks 4: end for
A, B and C, etc. 5: for i=0 to NUM_ITERS do
The gain of the KI model with respect to the SA model can be 6: if myRank == rank0 then
7: wait_recv(rank1);
described as:
8: if use_kernel == 1 then
9: calc_kernel<<< ... >>>(calc_size);
TKI = TSA − OKI .
10: if communication_type != SA then
11: cudaDeviceSynchronize();
The best scenario is when all tasks fit in the logical number of
12: end if
CUDA blocks available on the GPU and tasks overlap for the most 13: end if
part of the execution: 14: send(rank1, buf, msg_size);
∑ ∑ ∑ 15: else
(A + S) + (W + C ) + (B) ≃ 2 × OKI . 16: send(rank0, buf, msg_size);
17: wait_recv(rank0);
On the contrary, the worst scenario is when each of the tasks 18: if use_kernel == 1 then
19: calc_kernel<<< ... >>>(calc_size);
require an high number of blocks, decreasing the importance of the
20: if communication_type != SA then
times overlap: 21: cudaDeviceSynchronize();
∑ ∑ ∑ 22: end if
(A + S) + (W + C ) + (B) ≫ 2 × OKI 23: end if
24: end if
which means that A, B and C tasks will be executed almost sequen- 25: end for
tially, like in case of the SA model, and the overlap in Formula (3) 26: end procedure
does not represent a real improvement in respect to Formula (2).
In Section 6, we explain when it is more convenient to use the
KI model or the SA model. which shows up as a constant per-operation overhead of roughly
2.5 µs. Latency grows linearly as the message size increases, but
5. Micro-benchmarks the latency of the SA model (red circles) is more irregular than
both the MPI (blue squares) and the KI model (green triangles),
In this section we discuss a few micro-benchmarks in the with piece-wise constant periods interleaved with sudden peaks;
three variations, i.e. standard MPI, SA and KI model. We start to clarify this behavior, we need to explain how communications
with the ping-pong latency benchmark, based on point-to-point are carried out by the CUDA Stream.
(send–receive) communications. Then we take a quick diversion
to analyze the performance of waiting in the SA model. Finally we 5.2. Notification waiting in the SA model
consider a variations of the first benchmark but using one-sided
primitives. These tests are designed to expose both the commu- As described in Section 3.2, in case of the SA model, multi-
nication latency and the CUDA kernel launch overhead. Here the ple cuStreamWriteValue32() MemOps are used to trigger the send
size of the messages and the duration of the kernels are treated operation while cuStreamWaitValue32() is used to stall the CUDA
as independent parameters, even though most of the times in real stream while waiting for both receive and send completions. Dur-
applications those are closely related, e.g. in 3D stencil applications ing the wait, the GPU front-end unit is responsible for polling
communications are O(L2 ) while computations are O(L3 ). the CQE associated to each communication operation. The polling
For all the micro-benchmarks we run 1000 warm-up iterations frequency depends on the GPU internal stream scheduling. We
and take 1000 samples. The test environment consists of two define Wait Reaction Time (WRT) as the time between the update of
Supermicro servers, each hosting a single Tesla K40m GPUs (using the CQE, by the HCA in our case, and the GPU observing the updated
boosted clock at 875 MHz) and a Mellanox Connect-IB card (56 value.
Gbps bandwidth). The MPI implementation is OpenMPI v1.10.7. To estimate the WRT on our K40m GPUs, we wrote a standalone
micro-benchmark, poll-latency, where a pinned host buffer is up-
dated by the CPU, simulating the CQE update by the HCA, and
5.1. Ping-pong latency benchmark
polled using CUDA MemOps APIs on a single stream, as shown in
the interaction diagram in Fig. 14. Specifically, in a loop, the CPU
Algorithm 2 depicts a simple ping-pong test between two MPI calls cuStreamWaitValue32(stream, i, ptr1, GEQ) and cuStreamWrite-
processes, rank0 and rank1, exchanging some data placed in host Value32(stream, i, ptr2), then sleeps for ∆t microseconds – to let
memory and optionally executing a constant-time CUDA kernel in the associated commands be fetched by the GPU front-end unit, –
order to simulate a GPU computation consuming the data received sets *ptr1=i and then measure the time spent before observing
or preparing the data to be sent. *ptr2==i.
In Fig. 15 we plot WRT = WRT(∆t) for the Tesla K40m, where
In Fig. 13, we considered the MPI, SA model and KI model cases it appears to be a sawtooth wave function of the sleep time with
without the CUDA kernel, to get the baseline ping-pong latency. a period of 12 µs. Repeating the test with multiple active CUDA
Because there is no computation involved in this case, i.e. no stream NAS , if actively polling, shows one additional microsecond
GPU kernels are launched, and the data buffers are in CPU memory, per stream:
the MPI results are in line with the expectations, i.e. in the order
of the microsecond for the half round-trip latency — note that full WRT(∆t , NAS ) = WRTmin + F (∆t) + NAS . (4)
round-trip latency is plotted in the figure. We note that traditional
CPU driven communications are faster for small message sizes. 3 The expensive GPU fence operations, required when injecting work on the HCA,
This is partly due to overheads in the GPU communications path,3 as explained in 3.2 as well as the use of non-inlined HCA commands.
38 E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45

Fig. 13. Ping-Pong microbenchmark with MPI, SA model and KI model, communications only. Full round-trip time is plotted. 1000 iterations, excluding warmup. K40m GPU,
Mellanox Connect-IB HCA, FDR IB switch. Open MPI default eager limit is 12 kB.

Fig. 14. The poll-latency micro-benchmark, used to measure the GPU Stream Wait Reaction Time (WRT).

Fig. 15. Wait Reaction Time curve , GPU Telsa K40m, single active CUDA stream.

This is compatible with the GPU front-end unit circulating among (send latency) line and is qualitatively similar to the SA model ping-
the active streams, each polling taking one microsecond, with a pong latency in Fig. 13. We conclude that the piece-wise constant
pause of ten microseconds at the end of every cycle. behavior of the SA model in Fig. 18 is related to the polling pattern
To qualitatively show how WRT influences the SA model latency of the GPU front-end unit.
in the ping-pong benchmark, we measured the IB Verbs send
latency at varying message size, see red line in Fig. 16, using the
standard ib_send_lat test (Mellanox Perftest 5.6 package [26]) on 5.2.1. Polling on newer GPU architectures
a couple of Supermicro servers. The blue line is our previous mea- Our experiments show that WRT varies across different GPU
surement of WRT plotted on a time scale rather than the message architectures. Running the poll-latency test on a GPU with Maxwell
size, using the equivalence 1 µs ≃ 6 kB, i.e. assuming 6 GB/s architecture shows to similar results as in Fig. 15 while on the
for the bandwidth of the Connect-IB used in this experiment. The newest Pascal architecture we obtained better results, as shown
yellow line in Fig. 16 is the sum of the blue (WRT) and the red in Fig. 17.
E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45 39

Fig. 16. Ping-pong qualitative dependence from WRT.

Fig. 17. Wait Reaction Time curve trend, GPU Tesla P100, Pascal architecture.

Fig. 18. Ping-Pong micro-benchmark with MPI, SA model and KI model, communications and computation, Round-Trip Time latency.

5.3. Ping-pong latency with GPU compute • MPI: as visible in the profiler timeline – Fig. 19 – rank0
waits for the receive completion, launches the CUDA kernel,
In Fig. 18 we plot the round-trip latency when enabling a ∼5 µs synchronizes with the CUDA stream and finally sends data.
CUDA kernel computation, in the three cases considered. • SA model: in Fig. 20, all communications and CUDA kernels
The overall performance depends on both the GPU computation are posted on the GPU Stream and executed about 3ms later.
and the communication: CPU and GPU work completely overlap.
40 E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45

• KI model: here we fuse together the polling on the receive Algorithm 3 HPGMG-FV, communication function, MPI version
completion, the constant time compute and the trigger for 1: procedure exchangeLevelBoundariesMPI(...)
the send into a new single kernel. That fused kernel is 2: cudaDeviceSynchronize();
launched multiple times in a loop. From the profiler point 3: for peer ∈ {0, . . . , numPeers − 1} do
of view, there is only a sequence of kernels on the CUDA 4: MPI_Irecv(recvBuf, ..., recvReqs[p]);
Stream. 5: end for
6: pack_kernel<<< ..., stream >>>(sendBuf, ...);
We note that the KI model is the best performing. Besides 7: cudaDeviceSynchronize();
the piece-wise constant behavior noticed for the SA model is not 8: ▷ Overlap between send and local kernel
present in this case. 9: for peer ∈ {0, . . . , numPeers − 1} do
10: MPI_Isend(sendBuf, ..., sendReqs[p]);
6. Applications benchmarks 11: end for
12: interior_kernel<<< ..., stream >>>(...);
GPUDirect Async is a new experimental technology, therefore 13: MPI_Waitall(recvReqs, ...);
we need to study its behavior into real MPI applications in order 14: unpack_kernel<<< ..., stream >>>(recvBuf, ...);
to verify if and when GPUDirect Async can improve performance. 15: MPI_Waitall(sendReqs, ...);
We choose several different multi-GPU MPI applications (with the 16: end procedure
constraint of a single GPU for a single process) having the commu-
nication periods similar to the one described in Section 4, replacing
the synchronous communication calls with the Async calls, testing
both SA and KI models. For our benchmarks we used the Wilkes 6.1.1. SA model
cluster [35] in Cambridge University (UK). The system consists The host code between CUDA kernels and communications
of 128 Dell T620 servers, 256 NVIDIA Tesla K20c GPUs intercon- consists of a simple cudaDeviceSynchronize() after the Pack kernel
nected by 256 Mellanox Connect IB cards, having the CUDA 8.0 that can be easily removed (negligible TH value); in addition, com-
Toolkit, and OpenMPI version 1.10.3. At the time of benchmarks, munication parameters are known at posting time. Therefore the
LibGDSync was on top of OFED 3.4 with an additional patch (re- SA model can be applied to HPGMG-FV and, considering Formula
cently included in OFED 4.0) to enable GPUDirect Async in case of IB (2), the constant values are (C3 condition):
Verbs.
R > 0, X = Z = Y = 1.
6.1. HPGMG-FV CUDA We modified the exchangeLevelBoundariesMPI function in Al-
gorithm 3 with the exchangeLevelBoundariesSA in Algorithm 4 .
HPGMG [20] is an HPC benchmarking effort developed by In case of InfiniBand communications, if a send is posted but the
Lawrence Berkeley National Laboratory. It is a geometric multi- corresponding receive is not ready yet, there are some delays
grid solver for elliptic problems using Finite Volume (FV) [13] and during communications. For this reason here we used the one-
Full Multigrid (FMG) [14] methods where the solution to a hard sided asynchronous call mp_iput_on_stream to ensure that each
problem (a continuous problem represented by a finer grid of ele- asynchronous send has the corresponding receive buffer posted by
ments by means of discretization) is expressed as a solution to an the opposite peer.
easier problem (coarser grid of element). In case of multi-process
execution, the workload is fairly distributed among processes: in
order to improve the parallelization, each problem level is divided Algorithm 4 HPGMG-FV, communication function, SA model ver-
into several same-size boxes. HPGMG-FV takes as input the num- sion
ber and the log2 (size) of the finest level boxes and calculates the 1: procedure exchangeLevelBoundariesSA(...)
size of all the other (smaller) levels. In HPGMG-FV CUDA version 2: for peer ∈ {0, . . . , numPeers − 1} do
(improved by NVIDIA [33]) finer levels are processed by GPU while 3: mp_irecv(recvBuf, peer, ..., recvReqs[p]);
4: mp_iput_on_stream(remoteAck[peer], ..., stream);
coarser levels by the CPU (according to a threshold on the level’s
5: end for
number of elements). Considering a multi-process execution of
6:
HPGMG-FV CUDA the main and most used communication func-
7: pack_kernel<<< ..., stream >>>(sendBuf, ...);
tion, in case of GPU finer levels, follows a 2D Stencil pattern that is 8:
similar to the one described in Section 4.1: 9: for peer ∈ {0, . . . , numPeers − 1} do
10: mp_wait_word_on_stream(localAck[peer], ..., stream);
1. Pack: a single CUDA kernel which stores its result data into
11: mp_isend_on_stream(sendBuf, ..., sendReqs[p], stream);
the send buffers (type A task)
12: end for
2. Send: Synchronize with the CUDA stream and send result
13:
data
14: interior_kernel<<< ..., stream >>>(...);
3. Interior computation: a single CUDA kernel working on 15: mp_wait_all_on_stream(recvReqs, ..., stream);
internal structures (type B task) 16: unpack_kernel<<< ..., stream >>>(recvBuf, ...);
4. Receive: Receive data from other processes 17: mp_wait_all_on_stream(sendReqs, ..., stream);
5. Unpack: a single CUDA kernel computation on received data 18: end procedure
(type C task)
The SA model communication pattern is represented in Fig. 23
In Algorithm 3 there is the MPI pseudo-code of this communi- and it appears on the CUDA Visual Profiler as in Fig. 24: the CPU
cation function. posts tasks but they are executed several microseconds later.
Considering a 4 process execution having 8 boxes with
Communication pattern is schematized in Fig. 21 and it appears log2 (size) = 4 only the finest level runs on the GPU; in that case
on the CUDA Visual Profiler as in Fig. 22: between CUDA kernels, using the NVIDIA Visual profiler we found that moving from MPI
the GPU is unloaded waiting for new tasks from the CPU (green to SA model led to a decrease of about 64% of the TCPUSA work
IDLE areas). time (due to TH → 0) with respect to TCPUS and that the TGPUSA
We applied both SA and KI models, modifying the HPGMG-FV computation was 49% longer than the TCPUSA to complete its tasks
CUDA algorithm according to the models described in Sections 4.2 (condition C1 is satisfied); furthermore we measured a decrease of
and 4.3. about 28% of the TGPUSA time (condition C2 is satisfied).
E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45 41

Fig. 19. Rank 0, Ping-Pong CUDA Visual Profiler with MPI plus CUDA kernel.

Fig. 20. Rank 0, Ping-Pong CUDA Visual Profiler with Async plus CUDA kernel.

Fig. 21. HPGMG-FV, communication function timeline, MPI version.

Fig. 22. HPGMG-FV, communication function on CUDA Visual Profiler, MPI version.

Fig. 23. HPGMG-FV, SA model, communication pattern timeline.

Fig. 24. HPGMG-FV, SA model, CUDA Visual Profiler.


42 E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45

Fig. 25. HPGMG-FV SA implementation time gain with respect to MPI, GPU levels
only, up to 16 processes, weak-scaling. Fig. 26. HPGMG-FV communication pattern, KI model version.

Table 2
HPGMG-FV KI model time reduction.
In Fig. 25 the Y axis shows the performance increase of the GPU
Compared to CPU time GPU time
levels in case of this SA implementation compared with the stan-
dard MPI version in the Wilkes cluster using up to 16 processes. The MPI 77% 32%
Async 37% 4%
maximum gain (about 24%) is reached in case of log2 (size) = 4 with
2 processes, while performance gain decreases when increasing
the size of the input level for two reasons:

• Message size grows with the box size, therefore commu-


nication overhead becomes less important. For large sizes,
all communication methods should converge to the same
performance level
• HPGMG-FV has weak-scaling, because the number of boxes
per rank is always the same for each input size
• Increasing the size and amount of levels, increases the CUDA
kernels workload (i.e. computation time) decreasing the
GPU idle time; thus replacing a small amount of GPU idle
time with communications on the CUDA Stream should
not result into a significant performance improvement (C2
condition). Furthermore, if communication time is greater
than idle time, there could be a negative gain.

6.1.2. KI model Fig. 27. HPGMG-FV KI implementation time gain with respect to MPI, GPU levels
According to previous observations, Pack kernel is a type A task, only, up to 16 processes.
Unpack kernel is a type C task and Interior kernel is a type B task.
Therefore we modified each communication period using a single
kernel organizing CUDA blocks as in Fig. 26.4
6.2. CoMD-CUDA
The type B task is computed by type A blocks after the send
operation in order to overlap with the Wait and Unpack operations.
Considering the previous 4 processes execution with 8 boxes CoMD is one of several proxy-apps developed as part of the
and log2 (size) = 4, in SA model the sum of CUDA blocks required ExMatEx project [9] and is a reference implementation of typical
by Pack, Interior and Unpack kernels is 193, each one having 64 classical molecular dynamics algorithms.
threads. Each thread needs almost 37 registers and no shared In particular, it considers materials where the inter-atomic po-
memory is required. The Wilkes cluster has Tesla K20 GPUs, with tentials are short range and the simulation requires the evaluation
13 SMs and 2048 maximum threads per SM, which means that all of all forces between atom pairs within the cutoff distance; that
the 193 logical CUDA blocks can be executed concurrently and all is, considering a multi-node distributed execution, the atoms on
tasks can overlap for the most part of the time, leading to the best each node are assigned to cubic link cells and each atom only
KI model scenario (Section 4.3). Observing with the NVIDIA Visual needs to test all the other atoms within its own cell and the 26
profiler, in case of KI model implementation we got a reduction of neighboring link cells in order to guarantee that it has found all
TCPUKI and TGPUKI time as summarized in Table 2. possible interacting atoms. Only two different potentials are imple-
We plot in Fig. 27 the KI model implementation performance mented: the Lennard-Jones (LJ) and the Embedded Atom Method
gain. The maximum gain is 26% in case of 2 processes with (EAM). NVIDIA enhanced the original CoMD implementation with
log2 (size) = 4 box size and, generally speaking, the performance CUDA (see [7] and [9]). Considering the EAM potential, there are
are better than the SA model performance. two different communication functions: force exchange and atoms
exchange, where CoMD-CUDA repeats for 3 times (x, y, and z
4 ‘‘Receive’’ is the combination of receive ack put before wait while ‘‘Send’’ is the dimensions) the following steps: load data in 2 different buffers
combination of the receive ack wait before the send, as in SA model. (one for atom’s moments and one for atom’s positions) by means
E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45 43

6.3. BFS

M. Bisson et al. in [5] implemented a parallel distributed


Breadth First Search algorithm on multi-GPU systems for large
graphs, represented by the adjacency matrix and partitioned by
means of a 2D decomposition. The graph is partitioned among
the computing nodes by assigning to each one a subset of the
original vertices and edges sets. The search is performed in parallel,
starting from the processor owning the root vertex. During each
step of the main loop, processors handling one or more frontier
vertices follow the edges connected to them to identify unvisited
neighbors. The reached vertices are then exchanged in order to
notify their owners and a new iteration begins. The search stops
when the connected component containing the root vertex has
been completely visited. They adopted a technique to reduce the
size of exchanged messages based on a fixed-size bitmap, when
Fig. 28. CoMD-CUDA time gain, SA model on MPI, communication periods only, messages size exceeds a pre-defined threshold.
Wilkes cluster, weak-scaling. Each rank during an iteration in the main loop executes the
following steps:

of 2 CUDA kernels (tasks A), synchronize with the GPU and send 1. exchange vertices in the frontier with MPI point-to-point
data, execute 2 CUDA kernels (tasks B), receive data and process primitives;
them using 2 CUDA kernels (tasks C). 2. by means of CUDA kernels, CUDA primitives and CUB func-
tions [10], get neighbor vertices that have not been visited
6.2.1. SA model yet and put them into the next send buffer;
Applying the SA model, the values of constants in Formula (2) 3. exchange vertices in the frontier with MPI point-to-point
are (C3 condition): primitives;

4. by means of CUDA kernels, CUDA primitives and CUB func-
⎨R = 3
⎪ → low number of sequential communication periods
X =2 → 2 CUDA kernels and 2 send tions, update the frontier;
(5)
⎪Y = 2 → 2 CUDA kernels between send an receive 5. evaluate the number of remaining vertices to be visited
Z =2 → 2 CUDA kernels and 2 receive.

by means of the collective MPI_Allreduce function. If the
We considered executions with 4, 8 and 16 processes having a number is 0, exit from the loop.
distributed set of 2.048.000 atoms. The performance gain of SA on
MPI in Fig. 28 refers to communication periods only. We faced several issues in porting the MPI version to the corre-
Removing all the synchronizations with the CUDA stream and sponding SA version.
transforming some mandatory host code into CUDA kernel code
(halving the computation time) we obtained a negligible TH in 6.3.1. Main loop iterations number
order to satisfy the C1 condition: considering the 4 processes run, The number of iterations in the main loop is unknown (R pa-
there is a difference of about 27 ms between the CPU launch of rameter in performance model formulas, Section 4), because it
the GPU tasks and their actual execution on the CUDA stream. This stops only when step 5 is satisfied. Thus, even substituting the
behavior led to a negligible Tidle time. TGPUSA (considering the case
MPI_Allreduce with several point-to-point LibMP functions, there is
of 4 processes) is about 40% lower than TGPUS (measured with the
the need of a synchronization at the end of each iteration in order
NVIDIA profiler), satisfying the condition C2.
to check the exit condition. This issue reduces a lot the asynchrony
The overall CoMD-CUDA performance is dominated by the
CUDA kernels computing the force between the atom pairs, (C1 condition), having R = 1, X = 1, Z = 1 (C3 condition).
whereas communications play a marginal role (about 7% of to-
tal time in case of 16 processes). For this reason, the great im- 6.3.2. Computation parameters
provement of the SA model (between 25% and 35% time gain) in Steps 2 and 4 use a number of CUDA tasks, like kernels and
communications does not lead to a remarkable improvement of synchronous primitives, to compute at run-time some values use-
CoMD-CUDA total time (just about 5%), according to C2 time gain ful to start next computation tasks. For the reasons explained in
condition. The aim of this experiment is to provide another exam- Section 3, this represents an issue for Async. In Algorithm 5 we
ple of the advantage in using GPU Direct Async communications. reproduce an example of parameters pA and pB computed at run-
time.
6.2.2. KI model
The communication pattern of CoMD-CUDA is similar to
Algorithm 5 BFS synchronous pseudo-code
the HPGMG-FV one, therefore is easy to move communications
into a single CUDA kernel as described before. Considering the 1: int pA, pB;
2: ....
benchmarks in the SA model, the problem resides on the number of
3: cudaKernel1<<<...>>>(deviceBuffer, ...);
blocks required by the CUDA kernel KI: N (task A) and M (task C) are
4: cudaMemcpy(& pA, deviceBuffer[bufferLenght], ...);
very high number (an average of about 850 blocks), leading to the
5: cudaKernel2<<<pA, ...>>>(pA, ...);
worst scenario for the KI model: the new KI kernel will require at 6: cub::DeviceScan::ExclusiveSum(deviceBuffer, pA, ...);
least 1700 CUDA blocks, each one having 128 threads. Considering 7: cudaMemcpy(& pB, deviceBuffer[pA], ...);
a Tesla K20 with a maximum of 2048 threads per SM, to have a 8: cudaKernel3<<<pB, ...>>>(pA, pB, ....);
whole parallel execution you should require more than 100 SM
(the value of the Overlap function is very small). For this reason, our To overcome this issue we needed to oversize the number of
KI model implementation of CoMD-CUDA gave us similar results as items in the CUB calls to the total number elements of the device-
the SA model. Buffer and to fix the CUDA kernels grid size (according to TeslaK20
44 E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45

Algorithm 6 BFS asynchronous pseudo-code • If the GPU is overloaded or if the idle time is smaller than
1: int * pA, * pB; communication time, performance may actually decrease.
2: allocateGPUMemory(pA, 1, sizeof(pA));
3: allocateGPUMemory(pB, 1, sizeof(pB)); We remark here that GPUDirect Async is still under develop-
4: .... ment. We are evaluating its efficacy in other domains, for example
5: cudaKernel1<<<...>>>(deviceBuffer, ...); in combination with CUDA Multi Process Services (MPS), when
6: cudaMemcpyAsync(pA, deviceBuffer[bufferLenght], ...); multiple MPI processes share a single GPU. New optimizations are
7: cudaKernel2<<<MAX_BLOCKS, MAX_THREADS>>>(pA, ...);
being explored, such as allocating InfiniBand elements (i.e. CQs)
8: cub::DeviceScan::ExclusiveSum(deviceBuffer, bufferLenght, ...);
on the GPU memory, offloaded collectives. We plan to analyze the
9: cudaKernel4<<< 1, 1 >>>(deviceBuffer, pA, pB);
10: cudaKernel3<<<MAX_BLOCKS, MAX_THREADS>>>(pA, pB, ....); interaction of GPUDirect Async with GPUDirect RDMA, explaining
11: .... how to overcome issues like memory consistency [19]. Up to now
12: function cudaKernel4(deviceBuffer, pA, pB) we used only Mellanox hardware, therefore we will explore new
13: pB[0] = deviceBuffer[pA[0]]; type of connections. Moreover we would like to test asynchronous
14: end function applications using a higher number of nodes to better evaluate
15: their scalability.
Given the design choices mentioned in Section 3.2, we expect to
eventually be faced with scalability problems, e.g. the GPU having
hardware), increasing the computation overhead as described in to poll O(N) CQs. In a sense we already faced the limitations of
Algorithm 6: polling in the SA model. We are currently working on improved
designs to relax the current constraints. We also plan to explore
The final cudaMemcpy() (row 7, Algorithm 5) has been trans- combining GPUDirect Async with advanced HW features of mod-
formed in cudaKernel4() (row 9 Algorithm 6). ern NICs, such as HW tag matching and MPI protocol offloading.

6.3.3. Communication parameters Acknowledgments


It is possible to use Async only when the fixed-length bitmap
is used during communications, otherwise the size of the com- The authors would like to thank Filippo Spiga, for the opportu-
munication buffers is evaluated at run-time and is unknown at nity to experiment on the Wilkes cluster, and Massimo Bernaschi
communication posting time (LibMP requirement, Section 3.3). of the National Research Council of Italy, for reviewing the paper.
We also thank Nikolay Sakharnykh for useful discussions and David
6.3.4. SA model benchmark Fontaine for major contributions to the CUDA MemOp APIs.
Considering all those issues, the final SA implementation
∑ had
no
∑ improvements, i.e. TGPUSA = TGPUS . That is, even if [ (TS) +
(TW )]SA ≤ [Tidle ]S , the C2 condition hypothesis about the times References
required by computation tasks is not verified, because we mea-
[1] E. Agostini, D. Rossetti, S. Potluri, Offloading communication control logic in
sured (by means of the NVIDIA Visual Profiler) an increase of about
GPU accelerated applications, in: 17th IEEE/ACM International Symposium on
14% of time for the most important CUDA kernel (due to a fixed and Cluster, Cloud and Grid Computing, Madrid, Spain, May 14-17, 2017.
non-optimized grid size) whereas CUB calls required about 3× the [2] R. Ammendola, et al., APEnet+: a 3D Torus network optimized for GPU-based
time required by the synchronous version (due to the oversizing of HPC Systems, J. Phys. Conf. Ser. 396 (2012) IOP Publishing.
items), thus: [3] R. Ammendola, et al., 2013. GPU peer-to-peer techniques applied to a cluster
∑ ∑ interconnect, in: CASS 2013 workshop at 27th IEEE International Parallel &
[ (TS) + (TW )]SA > [Tidle ]S . Distributed Processing Symposium, IPDPS.
[4] R. Ammendola, et al., NaNet: a flexible and configurable low-latency NIC for
real-time trigger systems based on GPUs, J. Instrum. 9 (2014).
7. Conclusion and future work [5] M. Bisson, M. Bernaschi, E. Mastrostefano, Parallel distributed breadth first
search on the kepler architecture, IEEE Trans. Parallel Distrib. Syst. 27 (7)
(2016).
GPUDirect Async is a technology enabling direct control paths
[6] Chelsio GDR, http://www.chelsio.com/gpudirect-rdma.
between GPUs and 3rd party devices. It has been initially released [7] CoMD https://github.com/ECP-copa/CoMD.
with CUDA 8.0. InfiniBand network support for Async comes in the [8] CoMD-CUDA Async on GitHub. https://github.com/e-ago/CoMD-CUDA-Async.
form of a new set of experimental OFED verbs (Mellanox OFED [9] CoMD-CUDA Code. https://github.com/NVIDIA/CoMD-CUDA.
4.0 in [18]), a mid-layer abstraction library LibGDSync [17], and [10] CUB CUDA. https://nvlabs.github.io/cub.
a sample message-passing library LibMP [24]. All the applications [11] CUDA-Aware MPI. https://devblogs.nvidia.com/parallelforall/introduction-cu
da-aware-mpi.
presented in this paper are available on GitHub: HPGMG-FV CUDA
[12] F. Daoud, A. Watad, M. Silberstein, GPUrdma: GPU-side library for high perfor-
Async in [21] and CoMD-CUDA Async in [8]. In summary we note mance networking from GPU kernels, in: Proceedings of the 6th International
that: Workshop on Runtime and Operating Systems for Supercomputers, Article No.
6.
• GPUDirect Async allows for a new communication model for [13] Finite Volume method. https://en.wikipedia.org/wiki/Finite_volume_method.
multi-GPU accelerated applications. [14] Full MultiGrid method. https://en.wikipedia.org/wiki/Multigrid_method.
• GPUDirect Async is not necessarily faster, for example when [15] GDRcopy. https://github.com/NVIDIA/gdrcopy.
[16] GPUDirect family, https://developer.nvidia.com/gpudirect.
the GPU idle time is greater than communication time.
[17] GPUDirect LibGDSync. http://github.com/gpudirect/libgdsync.
• The more the CPU can post several consecutive asyn- [18] GPUDirect libmlx5. https://github.com/gpudirect/libmlx5.
chronous communication periods, the greater the potential [19] GPUDirect RDMA. http://docs.nvidia.com/cuda/gpudirect-rdma/#design-cons
performance gain. iderations.
[20] HPGMG, https://hpgmg.org.
GPUDirect Async major drawbacks are: [21] HPGMG-FV CUDA Async on GitHub. https://github.com/e-ago/hpgmg-cuda-
async.
• Communication parameters must be known before posting [22] InfiniBand Standard. http://www.mellanox.com/pdf/whitepapers/IB_Intro_
on the GPU Stream. WP_190.pdf.
E. Agostini et al. / J. Parallel Distrib. Comput. 114 (2018) 28–45 45

[23] S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, GPUnet: Networking Abstractions for E. Agostini received her Ph.D. in Computer Science from
GPU Programs, in: Proceedings of the 11th USENIX Symposium on Operating the University of Rome ‘‘La Sapienza" in collaboration with
Systems Design and Implementation, October, 2014. the National Research Council of Italy and is currently
[24] LibMP on GitHub. https://github.com/gpudirect/libmp. a Software Engineer at NVIDIA Corp. Her main scientific
[25] Mellanox GDR, http://www.mellanox.com/page/products_dyn?product_fami interests are parallel computing, GPGPUs, HPC, network
ly=116. protocols and cryptanalysis.
[26] Mellanox Perftest. https://community.mellanox.com/docs/DOC-2802.
[27] L. Oden, H. Froning, GGAS: Global GPU address spaces for efficient communi-
cation in heterogeneous clusters, in: Cluster Computing, CLUSTER, 2013 IEEE
International Conference on.
[28] L. Oden, H. Fröning, F. Pfreundt, Infiniband-Verbs on GPU: A case study of con-
trolling an Infiniband network device from the GPU, in: Parallel & Distributed
Processing Symposium Workshops, IPDPSW, 2014 IEEE International. D. Rossetti has a degree in Theoretical Physics from
[29] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, D.K. Panda, Efficient Inter- Sapienza Rome University and is currently a senior engi-
node MPI communication using GPUDirect RDMA for infiniband clusters with neer at NVIDIA Corp. His main research activities are in
NVIDIA GPUs, in: Proceedings of the 42nd International Conference Parallel the fields of design and development of parallel computing
Processing, ICPP, 2013. and high-speed networking architectures optimized for
[30] D. Rossetti, Mellanox booth talk at Supercomputing 2016. https://www. numerical simulations, while his interests span different
youtube.com/watch?v=eO2tTVo8ALE. areas such as HPC, computer graphics, operating systems,
[31] D. Rossetti, E. Agostini, How to enable NVIDIA CUDA stream synchronous com- I/O technologies, GPGPUs, embedded systems, digital de-
munications using GPUDirect, http://on-demand.gputechconf.com/gtc/2017/ sign, real-time systems, etc.
presentation/s7128-davide-rossetti-how-to-enable.pdf.
[32] D. Rossetti, S. Potluri, D. Fontaine, State of GPUDirect technologies, http://
on-demand.gputechconf.com/gtc/2016/presentation/s6264-davide-rossetti-
GPUDirect.pdf.
[33] N. Sakharnykh, High-Performance Geometric Multi-Grid with GPU Acceler- S. Potluri received his Ph.D. in Computer Science and En-
ation. https://devblogs.nvidia.com/parallelforall/high-performance-geometri gineering from The Ohio State University and is currently
c-multi-grid-gpu-acceleration. a Senior Software Engineer at NVIDIA Corp. His research
[34] A. Venkatesh, K. Hamidouche, S. Potluri, D. Rossetti, C.H. Chu, D.K. Panda, interests include high-performance interconnects, hetero-
geneous architectures, parallel programming models and
MPI-GDS: High performance MPI designs with GPUDirect-aSync for CPU–GPU
high-end computing applications. His current focus is on
control flow decoupling, in: International Conference on Parallel Processing,
designing runtime and network solutions that enable high
August 2017.
performance and scalable communication on clusters with
[35] Wilkes cluster Cambridge, UK. www.hpc.cam.ac.uk.
NVIDIA GPUs
[36] S. Xiao, W. Feng, Inter-Block GPU Communication via Fast Barrier Synchroniza-
tion.

You might also like