0% found this document useful (0 votes)

15 views40 pages

Module 4

Uploaded by

singhguma86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views40 pages

Module 4

Uploaded by

singhguma86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

MODULE FOUR:

GPU PROGRAMMING
Dr. Volker Weinberg | LRZ
MODULE OVERVIEW
OpenACC Directives

 Multicore CPU vs GPU

 Introduction to GPU Data Management
 CUDA Managed Memory
 GPU Profiling with Nsight Systems
CPU VS GPU
CPU VS GPU
Number of cores and parallelism

 Both are extremely popular parallel processors, but

with different degrees of parallelism
 CPUs generally have a small number of very fast
physical cores
 GPUs have thousands of simple cores able to
achieve high performance in aggregate
 Both require parallelism to be fully utilized, but GPUs
require much more
CPU + GPU WORKFLOW
Application Code

GPU CPU
Compute-Intensive Functions
Rest of Sequential
Small % of Code CPU Code
Large % of Runtime
GPU PROGRAMMING IN OPENACC

 Execution always begins and ends on the

host CPU
 Compute-intensive loops are offloaded to the
GPU using directives
int main(){
 Offloading may or may not require data <sequential code>
movement between the host and device.
Compiler
#pragma acc kernels
Hint
{
<parallel code>
}

}
CPU + GPU
Physical Diagram CPU GPU

 CPU memory is larger, GPU memory has $ $ $ $ $ $

more bandwidth $ $ $ $ $ $

Shared Cache
 CPU and GPU memory are usually separate,
connected by an I/O bus (traditionally PCI-e)
$ $ $ $ $ $ $ $

 Any data transferred between the CPU and

GPU will be handled by the I/O Bus Shared Cache
High
 The I/O Bus is relatively slow compared to Capacity
memory bandwidth Memory
IO Bus
High Bandwidth
 The GPU cannot perform computation until the Memory
data is within its memory
BASIC DATA MANAGEMENT
BASIC DATA MANAGEMENT
Between the host and device

 The host is traditionally a CPU Host

 The device is some parallel accelerator Device
 When our target hardware is multicore, the
host and device are the same, meaning that
their memory is also the same
 There is no need to explicitly manage data
when using a shared memory accelerator, Host
such as the multicore target Memory
Device
Memory
BASIC DATA MANAGEMENT
Between the host and device CPU GPU

 When the target hardware is a GPU data will $ $ $ $ $ $

usually need to migrate between CPU and $ $ $ $ $ $

GPU memory
Shared Cache

 The next lecture will discuss OpenACC data

management, for now we’ll assume a unified $ $ $ $ $ $ $ $

Host/Accelerator memory
Shared Cache
High
Capacity
Memory
IO Bus
High Bandwidth
Memory
CUDA MANAGED MEMORY
CUDA MANAGED MEMORY Commonly referred to as “unified
memory.”
Simplified Developer Effort

Without Managed Memory With Managed Memory

CPU and GPU memories are

combined into a single, shared pool

System GPU Memory Managed Memory

Memory
CUDA MANAGED MEMORY
Usefulness

 Handling explicit data transfers between the host and device (CPU and GPU) can be
difficult
 The PGI compiler can utilize CUDA Managed Memory to defer data management
 This allows the developer to concentrate on parallelism and think about data
movement as an optimization

$ pgcc –fast –acc –ta=tesla:managed –Minfo=accel main.c

$ pgfortran –fast –acc –ta=tesla:managed –Minfo=accel main.f90

MANAGED MEMORY
Limitations
With Managed Memory
 The programmer will almost always be able to
get better performance by manually handling
data transfers
 Memory allocation/deallocation takes longer
with managed memory
 Cannot transfer data asynchronously
 Currently only available from PGI on NVIDIA
GPUs. Managed Memory
OPENACC WITH MANAGED MEMORY
An Example from the Lab Code
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma acc kernels
Without Managed Memory the
{ compiler must determine the size of
for( int j = 1; j < n-1; j++)
{ A and Anew and copy their data to
for( int i = 1; i < m-1; i++ )
{
and from the GPU each iteration to
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] ensure correctness
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}

for( int j = 1; j < n-1; j++)

{
for( int i = 1; i < m-1; i++ )
{ With Managed Memory the
A[j][i] = Anew[j][i];
} underlying runtime will move the
}
}
data only when needed
}
INTRODUCTION TO DATA CLAUSES
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
 Data clauses allow the programmer to tell the compiler which data to move and
when
 Data clauses may be added to kernels or parallel regions, but also data, enter
data, and exit data, which will discussed shortly

C/C++
#pragma acc kernels
for(int i = 0; i < N; i++){
a[i] = 0;
}
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
 Data clauses allow the programmer to tell the compiler which data to move and
when
 Data clauses may be added to kernels or parallel regions, but also data, enter
data, and exit data, which will discussed shortly

C/C++
#pragma acc parallel loop copyout(a[0:n])
for(int i = 0; i < N; i++){
a[i] = 0; I don’t need the initial value
} of a, so I’ll only copy it out
of the region at the end.
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy

Copy ‘a’ Copy ‘a’ Deallocate

Allocate ‘a’ on Execute
from CPU from GPU ‘a’ from
GPU Kernels
to GPU to CPU GPU

#pragma acc parallel loop copy(a[0:N])

for(int i = 0; i < N; i++){
a[i] = 2 * a[i];
}
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy

Copy ‘a’ Copy ‘a’ Deallocate

Allocate ‘a’ on Execute
from CPU from GPU ‘a’ from
GPU Kernels
to GPU to CPU GPU

CPU MEMORY GPU MEMORY

A
A’ A’
DATA CLAUSES
copy( list ) Allocates memory on GPU and copies data from host to GPU when
entering region and copies data to the host when exiting region.

Principal use: For many important data structures in your code, this is a
logical default to input, modify and return the data.

copyin( list ) Allocates memory on GPU and copies data from host to GPU when
entering region.

Principal use: Think of this like an array that you would use as just an
input to a subroutine.

copyout( list ) Allocates memory on GPU and copies data to the host when exiting
region.

Principal use: A result that isn’t overwriting the input data structure.

create( list ) Allocates memory on GPU but does not copy.

Principal use: Temporary arrays.

ARRAY SHAPING

 Sometimes the compiler needs help understanding the shape of an array

 The first number is the start index of the array
 In C/C++, the second number is how much data is to be transferred
 In Fortran, the second number is the ending index

copy(array[starting_index:length]) C/C++

copy(array(starting_index:ending_index)) Fortran
BASIC DATA MANAGEMENT
Multi-dimensional Array shaping

copy(array[0:N][0:M]) C/C++

copy(array(1:N, 1:M)) Fortran

PROFILING GPU CODE
PROFILING GPU CODE (PGI)
Obtaining information about your GPU

 Using the pgaccelinfo command will

display information about available Terminal Window
accelerators
$ pgaccelinfo
Device Number: 0
Device Name: Tesla P100-PCIE-16GB
...
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60
PROFILING GPU CODE
Obtaining information about your GPU

 Using the pgaccelinfo command will

display information about available Terminal Window
accelerators
$ pgaccelinfo
 Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
 The Device Name identifies the type of PGI Compiler Option: -ta=tesla:cc60
accelerator
Without Manage Memory
 Can Managed Memory be used? $ pgcc –ta=tesla:cc60 main.c
 What compiler options should be used With Manage Memory
to target this device? $ pgcc –ta=tesla:cc60,managed main.c
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated We can see that our data
Generating Tesla code copies are being applied by
37, Generating reduction(max:error) the compiler
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated We also see that the
Generating Tesla code compiler is generating code
37, Generating reduction(max:error) for our GPU
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
This is the parallelization of
37, Generating reduction(max:error) the outer
inner loop
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
PROFILING GPU CODE (NSIGHT SYSTEMS)
Using nsys to profile GPU code
▪ Nsight Systems presents far more
information when running on a
GPU
▪ It is capable of capturing
information about CUDA
execution in the profiled process.
▪ In the Timeline view, you can see
all the information about kernels
and memory movements (expand
the CUDA row)
PROFILING GPU CODE (NSIGHT SYSTEMS)
Using nsys to profile GPU code

▪ Kernels: These are our

computational functions. We can
see our calcNext and swap function

▪ MemCpy(HtoD): This includes data

transfers from the Host to the Device
(CPU to GPU)
▪ MemCpy(DtoH): These are data
transfers from the Device to the Host
(GPU to CPU)
PROFILING GPU CODE
Receiving unexpected code results

 Here we can see the runtime of our Terminal Window

application: 151 seconds $ pgcc –ta=tesla:cc60 jacobi.c laplace2d.c
$ ./a.out
 The program is now performing over 3 0, 0.250000
times worse than the sequential 100, 0.002397
version 200, 0.001204
300, 0.000804
 A profiler can help us understand why 400, 0.000603
this performance is worse 500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 151.772627 s
PROFILING GPU CODE
Inspecting the Nsight Sytems timeline

▪ Let’s focus on the data movement

(Memory row)
▪ At a first glance, it looks like our
program is spending a significant
amount of time transferring data
between the host and device
▪ We also see that the compute
regions are very small and spread
out
▪ What if we try Managed Memory?
PROFILING GPU CODE
Using managed memory
Terminal Window
 Using managed memory $ pgcc –ta=tesla:cc60,managed jacobi.c
drastically improves performance laplace2d.c
$ ./a.out
 This managed memory version is 0, 0.250000
performing over 20x better than 100, 0.002397
the sequential code 200, 0.001204
300, 0.000804
 What does the profiler tell us about 400, 0.000603
500, 0.000483
this? 600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 1.474951 s
PROFILING GPU CODE
Using managed memory

▪ The data no longer needs to transfer

between each kernel
▪ The data is only moved when it’s first
accessed on the GPU or CPU
▪ During the timestepping data remains on
the device
▪ Now a higher percentage of time is spent
computing
KEY CONCEPTS
In this module we discussed…
 The fundamental differences between CPUs and GPUs
 Assisting the compiler by providing information about array sizes for data
management
 Managed memory
THANK YOU

OpenACC Advanced Fixed
No ratings yet
OpenACC Advanced Fixed
53 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
OpenACC 2017spring
No ratings yet
OpenACC 2017spring
46 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
CUDA
No ratings yet
CUDA
33 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
Openacc Online Course: Lecture 1: Introduction To Openacc
No ratings yet
Openacc Online Course: Lecture 1: Introduction To Openacc
47 pages
OpenACC Princeton Bootcamp PDF
No ratings yet
OpenACC Princeton Bootcamp PDF
51 pages
Introduction To CUDA Platform 1
No ratings yet
Introduction To CUDA Platform 1
18 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Module 5
No ratings yet
Module 5
30 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Note2 4
No ratings yet
Note2 4
11 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Owens
No ratings yet
Owens
67 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Memory Management: Main Memory
No ratings yet
Memory Management: Main Memory
8 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
02 Basicarch
No ratings yet
02 Basicarch
103 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Instruction Set Architecture and Principles
No ratings yet
Instruction Set Architecture and Principles
42 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
CUDA C Best Practices Guide
No ratings yet
CUDA C Best Practices Guide
73 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
From Everand
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
Maris Fenlor
No ratings yet
Practical GPU Programming
From Everand
Practical GPU Programming
Maris Fenlor
No ratings yet
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
From Everand
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
Rodrigo Copetti
No ratings yet
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet
FAQs For Management For Optimized Virtual Environments
No ratings yet
FAQs For Management For Optimized Virtual Environments
25 pages
Infotech4 Intermediate Unit2 Workbook PDF
No ratings yet
Infotech4 Intermediate Unit2 Workbook PDF
1 page
Answer:: 3.figure 5-3 (B) Shows One Way of Having Memory-Mapped I/O Even in The Presence of Separate
No ratings yet
Answer:: 3.figure 5-3 (B) Shows One Way of Having Memory-Mapped I/O Even in The Presence of Separate
2 pages
CMS16P55 Cmsemicon
No ratings yet
CMS16P55 Cmsemicon
32 pages
The DIY Smart Saw - Desktop Version - Troubleshooting and Appendices
No ratings yet
The DIY Smart Saw - Desktop Version - Troubleshooting and Appendices
21 pages
SSD硬件方案及TBW官标寿命汇总-消费级SSD主控对标概览 20231125
No ratings yet
SSD硬件方案及TBW官标寿命汇总-消费级SSD主控对标概览 20231125
3 pages
CS3351 Dpco Qbank
No ratings yet
CS3351 Dpco Qbank
43 pages
Oracle X86 Server Installation Specialist Online Assessment (2022)
No ratings yet
Oracle X86 Server Installation Specialist Online Assessment (2022)
16 pages
Mcs Magnum Lcd-1 Technical Manual
No ratings yet
Mcs Magnum Lcd-1 Technical Manual
9 pages
Ficha Tecnica Valvula Esferica Sun 316
No ratings yet
Ficha Tecnica Valvula Esferica Sun 316
1 page
Line Tracking
No ratings yet
Line Tracking
214 pages
WWF Together PolarBearOrigami
No ratings yet
WWF Together PolarBearOrigami
5 pages
SAMSUNG
No ratings yet
SAMSUNG
10 pages
Serial /PROFIBUS DP Adapter PM-125 User Manual: Technical Support: +86-21-5102 8348
No ratings yet
Serial /PROFIBUS DP Adapter PM-125 User Manual: Technical Support: +86-21-5102 8348
51 pages
Unit 2-Part2
No ratings yet
Unit 2-Part2
49 pages
Chapter-2 (Microprogrammed Control)
No ratings yet
Chapter-2 (Microprogrammed Control)
12 pages
Real Time Operating System (RTOS)
No ratings yet
Real Time Operating System (RTOS)
27 pages
Brkaci 2300
No ratings yet
Brkaci 2300
93 pages
Hotel Management System
No ratings yet
Hotel Management System
11 pages
Manual Técnico Tm-t20x
No ratings yet
Manual Técnico Tm-t20x
71 pages
Logitech G304 Mouse Invoice
No ratings yet
Logitech G304 Mouse Invoice
1 page
Index of /Files/No-Intro/: Home - Files - Upload - Faq - Discord - Telegram - Contact - Hshop
No ratings yet
Index of /Files/No-Intro/: Home - Files - Upload - Faq - Discord - Telegram - Contact - Hshop
11 pages
Ax
No ratings yet
Ax
28 pages
Course RH134
No ratings yet
Course RH134
4 pages
HP Desktop by Bizgram Whatsapp 87776955 PDF
No ratings yet
HP Desktop by Bizgram Whatsapp 87776955 PDF
1 page
Lecture-11 (80386 Microprocessor)
No ratings yet
Lecture-11 (80386 Microprocessor)
30 pages
FANUC PICTURE Specification (Edition 06.2 To Less Than 08.0)
No ratings yet
FANUC PICTURE Specification (Edition 06.2 To Less Than 08.0)
190 pages
DIY Human Following Robot (Final File)
No ratings yet
DIY Human Following Robot (Final File)
52 pages
Certificación Cableado AMP CAT6A FHSP Dic12 de 2022
No ratings yet
Certificación Cableado AMP CAT6A FHSP Dic12 de 2022
43 pages

Module 4

Uploaded by

Module 4

Uploaded by

MODULE FOUR:

 Multicore CPU vs GPU

 Both are extremely popular parallel processors, but

 Execution always begins and ends on the

 CPU memory is larger, GPU memory has $ $ $ $ $ $

 Any data transferred between the CPU and

 The host is traditionally a CPU Host

 When the target hardware is a GPU data will $ $ $ $ $ $

usually need to migrate between CPU and $ $ $ $ $ $

 The next lecture will discuss OpenACC data

Without Managed Memory With Managed Memory

CPU and GPU memories are

System GPU Memory Managed Memory

$ pgcc –fast –acc –ta=tesla:managed –Minfo=accel main.c

$ pgfortran –fast –acc –ta=tesla:managed –Minfo=accel main.f90

for( int j = 1; j < n-1; j++)

Copy ‘a’ Copy ‘a’ Deallocate

#pragma acc parallel loop copy(a[0:N])

Copy ‘a’ Copy ‘a’ Deallocate

CPU MEMORY GPU MEMORY

create( list ) Allocates memory on GPU but does not copy.

Principal use: Temporary arrays.

 Sometimes the compiler needs help understanding the shape of an array

copy(array(1:N, 1:M)) Fortran

 Using the pgaccelinfo command will

 Using the pgaccelinfo command will

 Using the pgaccelinfo command will

 Using the pgaccelinfo command will

 Using the pgaccelinfo command will

▪ Kernels: These are our

▪ MemCpy(HtoD): This includes data

 Here we can see the runtime of our Terminal Window

▪ Let’s focus on the data movement

▪ The data no longer needs to transfer

You might also like