3.3.1 Multi-GPU Programming with CUDA
3.3.1 Multi-GPU Programming with CUDA
Stefano Markidis
Three Key-Points
• CUDA provides a way to program multiple GPU on the same
computing node
• To program multiple on-node GPUs, we first select GPU with
cudaSetDevice()and associate a stream to it
• The CUDA peer-to-peer API to enable data copy from one GPU
memory to another GPU memory
Multi-GPU Systems
• There are two types of connectivity in multi-GPU systems:
• Multiple GPUs connected over the PCIe/NVlink bus in a single node
• Multiple GPUs connected over a network switch in a cluster
• GPU0 and GPU1 are connected via the PCIe bus on node0. GPU2 and GPU3 are
connected via the PCIe bus on node1.
• The two nodes (node0 and node1) are connected to each other through a network
Switch.
• In this lecture, we focus on programming on single node multi-GPU programming
Counting the number of GPUs on the Node
• Because the kernel launch and data transfer in the loop are asynchronous,
control will return to the host thread soon after each operation is invoked.
• We can switch devices even if kernels or transfers issued by the current
thread are still executing on the current device
Workflow for on-node Multi-GPU Programming
1. Select the set of GPUs this application will use
GPU 0
• The CUDA peer-to-peer (P2P) API
to enable direct inter-device
communication
GPU 1 GPU 2
• Peer-to-peer transfer allows us to
directly copy data between GPUs
GPU 3
Checking and Enabling Peer Access
• After enabling peer access between two devices, we can copy data
between those devices asynchronously with cudaMemcpyPeerAsync()
• This transfers data from device memory on the device srcDev to device
memory on the device dstDev. The function cudaMemcpyPeerAsync
is asynchronous with respect to the host and all other devices.
Code Example
• Measuring bandwidth between different devices
https://github.com/zchee/cuda-sample/blob/master/1_Utilities/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest.cu
To Summarize
• CUDA allows us to program multiple GPU on the same computing
node
• To program multiple on-node GPUs, we first select GPU with
cudaSetDevice()and then associate a stream to it
• The CUDA peer-to-peer API to enable data copy from one GPU
memory to another GPU memory