Lecture GPU 17
Lecture GPU 17
(I2M-422/MAT-4202)
School of Mathematics
IISER Thiruvananthapuram
[email protected]
Organizing Threads
blockIdx (block index within a grid)
threadIdx (thread index within a block)
blockDim (block dimension, measured in threads)
gridDim (grid dimension, measured in blocks)
Launching a CUDA Kernel
Know Your Limitations: Managing Devices
Organizing Parallel Threads: The following layouts can be
possible for matrix addition
2D grid with 2D blocks
1D grid with 1D blocks
2D grid with 1D blocks
CUDA Cores
Shared Memory /
L1-Cache
Register File
Load/Store Units
Special Function Units
Warp Scheduler
CUDA Cores
Shared Memory /
L1-Cache
Register File
Load/Store Units
Special Function Units
Warp Scheduler
Since many CUDA API calls and all kernel launches are
asynchronous with respect to the host,
cudaDeviceSynchronize can be used to block the host
application until all CUDA operations (copies, kernels, and so on)
have completed:
cudaError_t cudaDeviceSynchronize(void);
Registers
Shared
memory
Local
memory
Constant
memory
Texture
memory
Global
memory
For sharing a small amount of data between the host and device, zero-copy
memory may be a good choice because it simplifies programming and offers
reasonable performance.
For larger datasets with discrete GPUs connected via the PCIe bus, zero-copy
memory is a poor choice and causes significant performance degradation.
Dr. Nagaiah Chamakuri (IISER-TVM) HPC April, 2024 17 / 29
Unified Memory
With CUDA 6.0, a new feature called Unified Memory was introduced to
simplify memory management in the CUDA programming model.
With CUDA 6.0, a new feature called Unified Memory was introduced to
simplify memory management in the CUDA programming model.
Unified Memory creates a pool of managed memory, where each allocation
from this memory pool is accessible on both the CPU and GPU with the same
memory address (that is, pointer).
Unified Memory offers a “single-pointer-to-data” model that is conceptually
similar to zero-copy memory.
However, zero-copy memory is allocated in host memory, and as a result
kernel performance generally suffers from high-latency accesses to zero-copy
memory over the PCIe bus.
Unified Memory, on the other hand, decouples memory and execution
spaces so that data can be transparently migrated on demand to the host or
device to improve locality and performance.
cudaError_t cudaMallocManaged(void **devPtr, size_t size,
unsigned int flags=0);
Because the kernel launch and data transfer in the loop are
asynchronous, control will return to the host thread soon after
each operation is invoked.
Before distributing computation from the host to multiple devices, you first
need to determine how many GPUs are available in the current system:
i n t ngpus ;
cudaGetDeviceCount(&ngpus ) ;
p r i n t f ( " CUDA−capable devices : %i \ n" , ngpus ) ;
Before distributing computation from the host to multiple devices, you first
need to determine how many GPUs are available in the current system:
i n t ngpus ;
cudaGetDeviceCount(&ngpus ) ;
p r i n t f ( " CUDA−capable devices : %i \ n" , ngpus ) ;
Once the number of GPUs has been determined, you then declare host
memory, device memory, streams, and events for multiple devices.
f l o a t * d_A [NGPUS] , * d_B[NGPUS] , * d_C[NGPUS] ;
f l o a t * h_A[NGPUS] , * h_B [NGPUS] , * hostRef [NGPUS] , * gpuRef [NGPUS] ;
cudaStream_t stream [NGPUS] ;
Before distributing computation from the host to multiple devices, you first
need to determine how many GPUs are available in the current system:
i n t ngpus ;
cudaGetDeviceCount(&ngpus ) ;
p r i n t f ( " CUDA−capable devices : %i \ n" , ngpus ) ;
Once the number of GPUs has been determined, you then declare host
memory, device memory, streams, and events for multiple devices.
f l o a t * d_A [NGPUS] , * d_B[NGPUS] , * d_C[NGPUS] ;
f l o a t * h_A[NGPUS] , * h_B [NGPUS] , * hostRef [NGPUS] , * gpuRef [NGPUS] ;
cudaStream_t stream [NGPUS] ;
In our vector add example, a total input size of 16M elements is used and
evenly divided among all devices, giving each device iSize elements:
i n t size = 1 << 24;
i n t i S i z e = size / ngpus ;
Before distributing computation from the host to multiple devices, you first
need to determine how many GPUs are available in the current system:
i n t ngpus ;
cudaGetDeviceCount(&ngpus ) ;
p r i n t f ( " CUDA−capable devices : %i \ n" , ngpus ) ;
Once the number of GPUs has been determined, you then declare host
memory, device memory, streams, and events for multiple devices.
f l o a t * d_A [NGPUS] , * d_B[NGPUS] , * d_C[NGPUS] ;
f l o a t * h_A[NGPUS] , * h_B [NGPUS] , * hostRef [NGPUS] , * gpuRef [NGPUS] ;
cudaStream_t stream [NGPUS] ;
In our vector add example, a total input size of 16M elements is used and
evenly divided among all devices, giving each device iSize elements:
i n t size = 1 << 24;
i n t i S i z e = size / ngpus ;
The size in bytes for one float vector on a device is calculated as follows:
s i z e _ t iBytes = i S i z e * s iz eo f ( f l o a t ) ;