0% found this document useful (0 votes)
14 views

3.3.1 Multi-GPU Programming with CUDA

The document discusses multi-GPU programming using CUDA, highlighting the selection of GPUs and the use of streams for concurrent execution. It explains the peer-to-peer API for direct data transfer between GPUs and the necessary steps to enable peer access. Key functions such as cudaSetDevice(), cudaGetDeviceCount(), and cudaMemcpyPeerAsync() are emphasized for effective multi-GPU management.

Uploaded by

Daniel Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

3.3.1 Multi-GPU Programming with CUDA

The document discusses multi-GPU programming using CUDA, highlighting the selection of GPUs and the use of streams for concurrent execution. It explains the peer-to-peer API for direct data transfer between GPUs and the necessary steps to enable peer access. Key functions such as cudaSetDevice(), cudaGetDeviceCount(), and cudaMemcpyPeerAsync() are emphasized for effective multi-GPU management.

Uploaded by

Daniel Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Multi-GPU Programming with CUDA

Stefano Markidis
Three Key-Points
• CUDA provides a way to program multiple GPU on the same
computing node
• To program multiple on-node GPUs, we first select GPU with
cudaSetDevice()and associate a stream to it
• The CUDA peer-to-peer API to enable data copy from one GPU
memory to another GPU memory
Multi-GPU Systems
• There are two types of connectivity in multi-GPU systems:
• Multiple GPUs connected over the PCIe/NVlink bus in a single node
• Multiple GPUs connected over a network switch in a cluster

• GPU0 and GPU1 are connected via the PCIe bus on node0. GPU2 and GPU3 are
connected via the PCIe bus on node1.
• The two nodes (node0 and node1) are connected to each other through a network
Switch.
• In this lecture, we focus on programming on single node multi-GPU programming
Counting the number of GPUs on the Node

• A single host thread can manage


multiple devices
• In general, the first step is determining
the number of CUDA-enabled devices
available in a system with
cudaGetDeviceCount()
Selecting one one-node GPU

• We select which GPU is the current target


for all CUDA operations with
cudaSetDevice()
• This function sets the device with
identifier id as the current device.
• We cudaSetDevice()to select any
device with device identifiers from 0 and
span to ngpus-1.
• Current GPU can be changed while async
calls (kernels, memcpy) are running

The following code will have both GPUs executing concurrently


Using Streams for Kernel on different GPUs
• We execute different streams on different GPUs
Executing a Kernel on Different GPUs

• Because the kernel launch and data transfer in the loop are asynchronous,
control will return to the host thread soon after each operation is invoked.
• We can switch devices even if kernels or transfers issued by the current
thread are still executing on the current device
Workflow for on-node Multi-GPU Programming
1. Select the set of GPUs this application will use

2. Create streams for each device

3. Allocate device resources on each device


(for example, device memory)

4. Launch tasks on each GPU through the streams


(for example, data transfers or kernel executions)

5. Use the streams to wait for task completion


Peer-to-Peer Communication

GPU 0
• The CUDA peer-to-peer (P2P) API
to enable direct inter-device
communication
GPU 1 GPU 2
• Peer-to-peer transfer allows us to
directly copy data between GPUs
GPU 3
Checking and Enabling Peer Access

• Because not all GPUs support peer-to-peer access, we need to check if a


device supports P2P using cudaDeviceCanAccessPeer()
• Peer-to-peer memory access must be explicitly enabled between two devices
with cudaDeviceEnablePeerAccess()
• This function enables peer-to-peer access from the current device
to peerDevice.
• The flag argument is reserved for future use and currently must be set to 0.
• The access granted by this function is unidirectional
• this function enables access from the current device
to peerDevice but does not enable access from peerDevice.
Peer-to-Peer Memory Copy

• After enabling peer access between two devices, we can copy data
between those devices asynchronously with cudaMemcpyPeerAsync()
• This transfers data from device memory on the device srcDev to device
memory on the device dstDev. The function cudaMemcpyPeerAsync
is asynchronous with respect to the host and all other devices.
Code Example
• Measuring bandwidth between different devices

https://github.com/zchee/cuda-sample/blob/master/1_Utilities/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest.cu
To Summarize
• CUDA allows us to program multiple GPU on the same computing
node
• To program multiple on-node GPUs, we first select GPU with
cudaSetDevice()and then associate a stream to it
• The CUDA peer-to-peer API to enable data copy from one GPU
memory to another GPU memory

You might also like