3.3.1 Multi-GPU Programming with CUDA

The document discusses multi-GPU programming using CUDA, highlighting the selection of GPUs and the use of streams for concurrent execution. It explains the peer-to-peer API for direct data transfer between GPUs and the necessary steps to enable peer access. Key functions such as cudaSetDevice(), cudaGetDeviceCount(), and cudaMemcpyPeerAsync() are emphasized for effective multi-GPU management.

Uploaded by

Daniel Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

3.3.1 Multi-GPU Programming with CUDA

Uploaded by

Daniel Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Multi-GPU Programming with CUDA

Stefano Markidis
Three Key-Points
• CUDA provides a way to program multiple GPU on the same
computing node
• To program multiple on-node GPUs, we first select GPU with
cudaSetDevice()and associate a stream to it
• The CUDA peer-to-peer API to enable data copy from one GPU
memory to another GPU memory
Multi-GPU Systems
• There are two types of connectivity in multi-GPU systems:
• Multiple GPUs connected over the PCIe/NVlink bus in a single node
• Multiple GPUs connected over a network switch in a cluster

• GPU0 and GPU1 are connected via the PCIe bus on node0. GPU2 and GPU3 are
connected via the PCIe bus on node1.
• The two nodes (node0 and node1) are connected to each other through a network
Switch.
• In this lecture, we focus on programming on single node multi-GPU programming
Counting the number of GPUs on the Node

• A single host thread can manage

multiple devices
• In general, the first step is determining
the number of CUDA-enabled devices
available in a system with
cudaGetDeviceCount()
Selecting one one-node GPU

• We select which GPU is the current target

for all CUDA operations with
cudaSetDevice()
• This function sets the device with
identifier id as the current device.
• We cudaSetDevice()to select any
device with device identifiers from 0 and
span to ngpus-1.
• Current GPU can be changed while async
calls (kernels, memcpy) are running

The following code will have both GPUs executing concurrently

Using Streams for Kernel on different GPUs
• We execute different streams on different GPUs
Executing a Kernel on Different GPUs

• Because the kernel launch and data transfer in the loop are asynchronous,
control will return to the host thread soon after each operation is invoked.
• We can switch devices even if kernels or transfers issued by the current
thread are still executing on the current device
Workflow for on-node Multi-GPU Programming
1. Select the set of GPUs this application will use

2. Create streams for each device

3. Allocate device resources on each device

(for example, device memory)

4. Launch tasks on each GPU through the streams

(for example, data transfers or kernel executions)

5. Use the streams to wait for task completion

Peer-to-Peer Communication

GPU 0
• The CUDA peer-to-peer (P2P) API
to enable direct inter-device
communication
GPU 1 GPU 2
• Peer-to-peer transfer allows us to
directly copy data between GPUs
GPU 3
Checking and Enabling Peer Access

• Because not all GPUs support peer-to-peer access, we need to check if a

device supports P2P using cudaDeviceCanAccessPeer()
• Peer-to-peer memory access must be explicitly enabled between two devices
with cudaDeviceEnablePeerAccess()
• This function enables peer-to-peer access from the current device
to peerDevice.
• The flag argument is reserved for future use and currently must be set to 0.
• The access granted by this function is unidirectional
• this function enables access from the current device
to peerDevice but does not enable access from peerDevice.
Peer-to-Peer Memory Copy

• After enabling peer access between two devices, we can copy data
between those devices asynchronously with cudaMemcpyPeerAsync()
• This transfers data from device memory on the device srcDev to device
memory on the device dstDev. The function cudaMemcpyPeerAsync
is asynchronous with respect to the host and all other devices.
Code Example
• Measuring bandwidth between different devices

https://github.com/zchee/cuda-sample/blob/master/1_Utilities/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest.cu
To Summarize
• CUDA allows us to program multiple GPU on the same computing
node
• To program multiple on-node GPUs, we first select GPU with
cudaSetDevice()and then associate a stream to it
• The CUDA peer-to-peer API to enable data copy from one GPU
memory to another GPU memory

Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Apex Cheatsheet
No ratings yet
Apex Cheatsheet
4 pages
BX10/BX10 MB: Weighing Terminals Technical Manual
50% (2)
BX10/BX10 MB: Weighing Terminals Technical Manual
60 pages
Unit V Part B and C - 240514 - 220831
No ratings yet
Unit V Part B and C - 240514 - 220831
17 pages
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
No ratings yet
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
6 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
CUDA
No ratings yet
CUDA
33 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CS 179 Lecture 14: Pipeline Parallelism and Multi - Gpu Programming
No ratings yet
CS 179 Lecture 14: Pipeline Parallelism and Multi - Gpu Programming
23 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
3D Finite Difference Computation On Gpus Using Cuda: Paulius Micikevicius
No ratings yet
3D Finite Difference Computation On Gpus Using Cuda: Paulius Micikevicius
6 pages
Micikevicius, P. - 3D Finite DIfference Computation On GPUs Using CUDA
No ratings yet
Micikevicius, P. - 3D Finite DIfference Computation On GPUs Using CUDA
6 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
GPU_Architecture_and_Programming_Lecture
No ratings yet
GPU_Architecture_and_Programming_Lecture
9 pages
1 Cuda
100% (1)
1 Cuda
173 pages
w13s1_MultiprocessingGPU
No ratings yet
w13s1_MultiprocessingGPU
21 pages
rCUDA Guide
No ratings yet
rCUDA Guide
13 pages
Multi GPU System Design With Memory Netw
No ratings yet
Multi GPU System Design With Memory Netw
12 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
chapter-8
No ratings yet
chapter-8
58 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
A Programming Model For Massive Data Parallelism With Data Dependencies
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
8 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
GPUDirect Async
No ratings yet
GPUDirect Async
18 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
No ratings yet
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
77 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
Lecture3 Fundamentals of CUDA(Part1)_2025
No ratings yet
Lecture3 Fundamentals of CUDA(Part1)_2025
52 pages
Cs 8803 Ss Project
No ratings yet
Cs 8803 Ss Project
17 pages
cs179_2024_lec01
No ratings yet
cs179_2024_lec01
26 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
MSI Gaming Laptop
No ratings yet
MSI Gaming Laptop
2 pages
eks with terraform
No ratings yet
eks with terraform
34 pages
Integrated Circuit - Wik..., The Free Encyclopedia
No ratings yet
Integrated Circuit - Wik..., The Free Encyclopedia
15 pages
PC 817
No ratings yet
PC 817
4 pages
NEET Electronic Devices Important Questions
No ratings yet
NEET Electronic Devices Important Questions
24 pages
Learning Go Lang
100% (1)
Learning Go Lang
117 pages
EEEB273 1112S3 Q2 QnA
No ratings yet
EEEB273 1112S3 Q2 QnA
4 pages
ISSCC 2021 Tutorials: Designing Amplifiers For Stability
No ratings yet
ISSCC 2021 Tutorials: Designing Amplifiers For Stability
85 pages
00 Man Ug Ind900pro Series en
No ratings yet
00 Man Ug Ind900pro Series en
119 pages
AWS Cloud Computing Lab Manual 22BEY10046
No ratings yet
AWS Cloud Computing Lab Manual 22BEY10046
17 pages
Central Railway Headquarters Office S&T Branch, CSMT, Mumbai
100% (1)
Central Railway Headquarters Office S&T Branch, CSMT, Mumbai
8 pages
STARGATE SG 1 - OXG - Setup and Configuration v1.5 - en
No ratings yet
STARGATE SG 1 - OXG - Setup and Configuration v1.5 - en
41 pages
Pil Sim115
No ratings yet
Pil Sim115
1 page
750-PM100F-En-P Faults and Alarms FRN6
No ratings yet
750-PM100F-En-P Faults and Alarms FRN6
379 pages
Udm Pro
No ratings yet
Udm Pro
7 pages
Karpagam College of Engineering: Coimbatore-32
No ratings yet
Karpagam College of Engineering: Coimbatore-32
2 pages
Bodi Connect With Sap Ecc or R 3 New
No ratings yet
Bodi Connect With Sap Ecc or R 3 New
27 pages
Operational Amplifiers
No ratings yet
Operational Amplifiers
35 pages
Jmeter Interview Q&A
No ratings yet
Jmeter Interview Q&A
5 pages
CS205 Data Structures
No ratings yet
CS205 Data Structures
3 pages
ETS Specification
No ratings yet
ETS Specification
3 pages
UML Tutorial 1
No ratings yet
UML Tutorial 1
57 pages
Procom 60 Manual - 1.8.22 - en - 2011
100% (1)
Procom 60 Manual - 1.8.22 - en - 2011
185 pages
Semiconductor Thermal Design
No ratings yet
Semiconductor Thermal Design
247 pages
Unified Diagnostics Services Is The Diagnostics Communication Protocol in The ECU
No ratings yet
Unified Diagnostics Services Is The Diagnostics Communication Protocol in The ECU
6 pages
Linear Scale Interface With SDU (Serial Type) 0i-F
No ratings yet
Linear Scale Interface With SDU (Serial Type) 0i-F
4 pages
MISE Sample Paper
No ratings yet
MISE Sample Paper
3 pages
Enterprise Item Number Management - v3.5 - 20210122
No ratings yet
Enterprise Item Number Management - v3.5 - 20210122
42 pages

3.3.1 Multi-GPU Programming with CUDA

Uploaded by

3.3.1 Multi-GPU Programming with CUDA

Uploaded by

Multi-GPU Programming with CUDA

• A single host thread can manage

• We select which GPU is the current target

The following code will have both GPUs executing concurrently

2. Create streams for each device

3. Allocate device resources on each device

4. Launch tasks on each GPU through the streams

5. Use the streams to wait for task completion

• Because not all GPUs support peer-to-peer access, we need to check if a

You might also like