0% found this document useful (0 votes)

30 views51 pages

Lecture GPU 17

Uploaded by

nur45lab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views51 pages

Lecture GPU 17

Uploaded by

nur45lab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

High-Performance Computing

(I2M-422/MAT-4202)

Dr. Nagaiah Chamakuri

School of Mathematics
IISER Thiruvananthapuram
[email protected]

Vasanth semester: 11th April, 2024

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 1 / 29

Recap

Organizing Threads
blockIdx (block index within a grid)
threadIdx (thread index within a block)
blockDim (block dimension, measured in threads)
gridDim (grid dimension, measured in blocks)
Launching a CUDA Kernel
Know Your Limitations: Managing Devices
Organizing Parallel Threads: The following layouts can be
possible for matrix addition
2D grid with 2D blocks
1D grid with 1D blocks
2D grid with 1D blocks

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 2 / 29

GPU Architecture Overview

CUDA Cores
Shared Memory /
L1-Cache
Register File
Load/Store Units
Special Function Units
Warp Scheduler

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 3 / 29

GPU Architecture Overview

CUDA Cores
Shared Memory /
L1-Cache
Register File
Load/Store Units
Special Function Units
Warp Scheduler

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 3 / 29

SM: The Heart Of The GPU Architecture

The Streaming Multiprocessor (SM) is the heart of the GPU

architecture. Registers and shared memory are scarce
resources in the SM.

CUDA partitions these resources among all threads resident

on an SM. Therefore, these limited resources impose a strict
restriction on the number of active warps in an SM, which
corresponds to the amount of parallelism possible in an SM.

Knowing some basic facts about the hardware components of an

SM will help you organize threads and configure kernel execution
to get the best performance.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 4 / 29

SM: The Heart Of The GPU Architecture

CUDA employs a Single Instruction Multiple Thread (SIMT)

architecture to manage and execute threads in groups of 32
called warps. All threads in a warp execute the same
instruction at the same time.
The SIMT model includes three key features that SIMD does not:

Each thread has its own instruction address counter.

Each thread has its own register state.
Each thread can have an independent execution path.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 5 / 29

SM: The Heart Of The GPU Architecture

CUDA employs a Single Instruction Multiple Thread (SIMT)

Each thread has its own instruction address counter.

Each thread has its own register state.
Each thread can have an independent execution path.
The number 32 is a magic number in CUDA programming. It
comes from hardware, and has a significant impact on the
performance of software.
Optimizing your workloads to fit within the boundaries of a warp
(group of 32 threads) will generally lead to more efficient
utilization of GPU compute resources.
Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 5 / 29
Knowing Hardware Resource Details

As a C programmer, when writing code just for correctness you

can safely ignore the cache line size; however, when tuning code
for peak performance, you must consider cache
characteristics in your code structure.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 6 / 29

Knowing Hardware Resource Details

As a C programmer, when writing code just for correctness you

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 6 / 29

Knowing Hardware Resource Details

As a C programmer, when writing code just for correctness you

can safely ignore the cache line size; however, when tuning code
for peak performance, you must consider cache
characteristics in your code structure.
This is true for CUDA C programming as well. As a CUDA C
programmer, you must have some understanding of hardware
resources if you are to improve kernel performance.
If you do not understand the hardware architecture, the CUDA
compiler will still do a good job of optimizing your kernel, but it
can only do so much. Even basic knowledge of the GPU
architecture will enable you to write much better code and fully
exploit the capability of your device.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 6 / 29

CUDA: Software Vs Hardware
A thread block is scheduled on only one SM. Once a thread
block is scheduled on an SM, it remains there until execution
completes. The corresponding components from the logical
view and hardware view of CUDA programming.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 7 / 29

Warps and Thread Blocks

The concept of grouping 32 threads into a single execution unit:

a warp

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 8 / 29

Warps and Thread Blocks

The concept of grouping 32 threads into a single execution unit:

a warp
From the logical perspective, a thread block is a collection of
threads organized in a 1D, 2D, or 3D layout.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 8 / 29

Warps and Thread Blocks

The concept of grouping 32 threads into a single execution unit:

a warp
From the logical perspective, a thread block is a collection of
threads organized in a 1D, 2D, or 3D layout.
From the hardware perspective, a thread block is a 1D
collection of warps. Threads in a thread block are organized in a
1D layout, and each set of 32 consecutive threads forms a warp.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 8 / 29

Warps and Thread Blocks

For a one-dimensional thread block, the unique thread ID is

stored in the CUDA built-in variable threadIdx.x, and threads
with consecutive values for threadIdx.x are grouped into warps.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 9 / 29

Warps and Thread Blocks

For a one-dimensional thread block, the unique thread ID is

stored in the CUDA built-in variable threadIdx.x, and threads
with consecutive values for threadIdx.x are grouped into warps.
For example, a one-dimensional thread block with 128 threads
will be organized into 4 warps as follows:
Warp 0: thread 0, thread 1, thread 2, ... thread 31
Warp 1: thread 32, thread 33, thread 34, ... thread 63
Warp 3: thread 64, thread 65, thread 66, ... thread 95
Warp 4: thread 96, thread 97, thread 98, ... thread 127

The logical layout of a two or three-dimensional thread block

can be converted into its one-dimensional physical layout by
using the x dimension as the innermost dimension, the y
dimension as the second dimension, and the z dimension as the
outermost.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 9 / 29

Throughput And Bandwidth

Bandwidth and throughput are often confused, but may be

used interchangeably depending on the situation. Both
throughput and bandwidth are rate metrics used to measure
performance.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 10 / 29

Throughput And Bandwidth

Bandwidth and throughput are often confused, but may be

used interchangeably depending on the situation. Both
throughput and bandwidth are rate metrics used to measure
performance.

Bandwidth is usually used to refer to a theoretical peak value,

while throughput is used to refer to an achieved value.

Bandwidth is usually used to describe the highest possible

amount of data transfer per time unit, while throughput can be
used to describe the rate of any kind of information or
operations carried out per time unit, such as, how many
instructions are completed per cycle.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 10 / 29

Synchronization

Barrier synchronization is a primitive that is common in many

parallel programming languages. In CUDA, synchronization can
be performed at two levels:
System-level: Wait for all work on both the host and the device
to complete.
Block-level: Wait for all threads in a thread block to reach the
same point in execution on the device.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 11 / 29

Synchronization

Barrier synchronization is a primitive that is common in many

Since many CUDA API calls and all kernel launches are
asynchronous with respect to the host,
cudaDeviceSynchronize can be used to block the host
application until all CUDA operations (copies, kernels, and so on)
have completed:
cudaError_t cudaDeviceSynchronize(void);

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 11 / 29

Scalability

Scalability is a desirable feature for any parallel application.

Scalability implies that providing additional hardware resources
to a parallel application yields speedup relative to the amount of
added resources.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 12 / 29

Scalability

Scalability is a desirable feature for any parallel application.

Scalability implies that providing additional hardware resources
to a parallel application yields speedup relative to the amount of
added resources.
When a CUDA kernel is launched, thread blocks are distributed
among multiple SMs. Thread blocks in a grid can be executed in
any order, in parallel or in series. This independence makes
CUDA programs scalable across an arbitrary number of compute
cores.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 12 / 29

CUDA Memory Model

To programmers, there are generally two classifications of

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 13 / 29

CUDA Memory Model

To programmers, there are generally two classifications of

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 13 / 29

CUDA Memory Model

To programmers, there are generally two classifications of

memory:
Programmable: You explicitly control what data is placed in
programmable memory.
Non-programmable: You have no control over data placement,
and rely on automatic tech- niques to achieve good performance.
In the CPU memory hierarchy, L1 cache and L2 cache are
examples of non-programmable memory.
Like CPU caches, GPU caches are non-programmable memory.
There are four types of cache in GPU devices:
L1, L2, Read-only constant and Read-only texture

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 13 / 29

CUDA Memory Model

To programmers, there are generally two classifications of

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 13 / 29

CUDA Memory Model

Registers
Shared
memory
Local
memory
Constant
memory
Texture
memory
Global
memory

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 14 / 29

Zero-Copy Memory

In general, the host cannot directly access device variables, and

the device cannot directly access host variables.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 15 / 29

Zero-Copy Memory

In general, the host cannot directly access device variables, and

the device cannot directly access host variables.
There is one exception to this rule: zero-copy memory. Both
the host and device can access zero-copy memory.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 15 / 29

Zero-Copy Memory

In general, the host cannot directly access device variables, and

the device cannot directly access host variables.
There is one exception to this rule: zero-copy memory. Both
the host and device can access zero-copy memory.
There are several advantages to using zero-copy memory in
CUDA kernels, such as:
Leveraging host memory when there is insufficient device
memory
Avoiding explicit data transfer between the host and device
Improving PCIe transfer rates
Modifying data in zero-copy memory from both the host and
device at the same time will result in undefined behavior.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 15 / 29

Zero-Copy Memory

You can create a mapped memory region with the following

function:
cudaError_t cudaHostAlloc(void **pHost, size_t count, unsigned int flags);

Memory allocated by this function must be freed with

cudaFreeHost.
The flags parameter enables further configuration of special
properties of the allocated memory:
cudaHostAllocDefault
cudaHostAllocPortable
cudaHostAllocWriteCombined
cudaHostAllocMapped
You can obtain the device pointer for mapped memory using
the following function:
cudaError_t cudaHostGetDevicePointer(void **pDevice, void *pHost,
unsigned int flags);

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 16 / 29

Sum arrays with zero-copy memory

For sharing a small amount of data between the host and device, zero-copy
memory may be a good choice because it simplifies programming and offers
reasonable performance.
For larger datasets with discrete GPUs connected via the PCIe bus, zero-copy
memory is a poor choice and causes significant performance degradation.
Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 17 / 29
Unified Memory

With CUDA 6.0, a new feature called Unified Memory was introduced to
simplify memory management in the CUDA programming model.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 18 / 29

Unified Memory

With CUDA 6.0, a new feature called Unified Memory was introduced to
simplify memory management in the CUDA programming model.
Unified Memory creates a pool of managed memory, where each allocation
from this memory pool is accessible on both the CPU and GPU with the same
memory address (that is, pointer).
Unified Memory offers a “single-pointer-to-data” model that is conceptually
similar to zero-copy memory.
However, zero-copy memory is allocated in host memory, and as a result
kernel performance generally suffers from high-latency accesses to zero-copy
memory over the PCIe bus.
Unified Memory, on the other hand, decouples memory and execution
spaces so that data can be transparently migrated on demand to the host or
device to improve locality and performance.
cudaError_t cudaMallocManaged(void **devPtr, size_t size,
unsigned int flags=0);

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 18 / 29

Unrolling Loops

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 19 / 29

Multi-GPU Programming

The most common reasons for adding multi-GPU support to an

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 20 / 29

Multi-GPU Programming

The most common reasons for adding multi-GPU support to an

application are:
Problem domain size: Existing data sets are too large to fit into
the memory of a single GPU.
Throughput and efficiency: If a single task fits within a single
GPU, you may be able to increase the throughput of an application
by processing multiple tasks concurrently using multiple GPUs.
A multi-GPU system allows you to amortize the power
consumption of a server node across GPUs by delivering more
performance for a given unit of power consumed, while boosting
throughput.
The efficiency of inter-GPU data transfers depends on how
GPUs are connected within a node, and across a cluster.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 20 / 29

Multi-GPU Programming

There are two types of connectivity in multi-GPU systems:

Multiple GPUs connected over the PCIe bus in a single node
Multiple GPUs connected over a network switch in a cluster

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 21 / 29

Multi-GPU Programming

To design a program to take advantage of multiple GPUs, you will

need to partition the workload across devices. Depending on
the application, this partitioning can result in two common
inter-GPU communication patterns:

No data exchange necessary between partitions of a problem,

and therefore no data shared across GPUs
Partial data exchange between problem partitions, requiring
redundant data storage across GPUs

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 22 / 29

Executing on Multiple GPUs

Features added in CUDA 4.0 made using multiple GPUs

straightforward for CUDA programmers.
A single host thread can manage multiple devices.
This function returns the number of devices with compute
capability 1.0 or higher.
i n t ngpus ;
cudaGetDeviceCount(&ngpus ) ;
f o r ( i n t i = 0; i < ngpus ; i++) {
cudaDeviceProp devProp ;
cudaGetDeviceProperties(&devProp , i ) ;
p r i n t f ( "Device %d has compute c a p a b i l i t y %d.%d . \ n" , i , devProp . major , devProp . minor ) ;
}

When implementing a CUDA application that works with multiple

GPUs, you must explicitly designate which GPU is the current
target for all CUDA operations.
cudaError_t cudaSetDevice ( i n t id ) ;

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 23 / 29

Executing on Multiple GPUs

Once a current device is selected, all CUDA operations will be

applied to that device:

Any device memory allocated from the host thread will be

physically resident on that device.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 24 / 29

Executing on Multiple GPUs

Once a current device is selected, all CUDA operations will be

applied to that device:

Any device memory allocated from the host thread will be

physically resident on that device.
Any host memory allocated with CUDA runtime functions will
have its lifetime associated with that device.
Any streams or events created from the host thread will be
associated with that device.
Any kernels launched from the host thread will be executed on
that device.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 24 / 29

Executing on Multiple GPUs

The following code snippet illustrates how to execute kernels

and memory copies from a single host thread, using a loop to
iterate over devices:
f o r ( i n t i = 0; i < ngpus ; i++) {
/ / set the current device
cudaSetDevice ( i ) ;
/ / execute kernel on current device
kernel<<<grid , block > > >(...);
/ / asynchronously transfer data between the host and current device
cudaMemcpyAsync ( . . . ) ;
}

Because the kernel launch and data transfer in the loop are
asynchronous, control will return to the host thread soon after
each operation is invoked.

Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 25 / 29

Synchronizing across Multi-GPUs

1 Select the set of GPUs this application will use.

2 Create streams and events for each device.
3 Allocate device resources on each device (for example, device
memory).
4 Launch tasks on each GPU through the streams (for example,
data transfers or kernel executions).
5 Use the streams and events to query and wait for task
completion.
6 Cleanup resources for all devices.