UNIT-5 Part 1
UNIT-5 Part 1
School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Unit-4::Syllabus
UNIT- V
Unit 5 HPC with CUD (HPC) 9Hours
CUDA programming model
Basic principles of CUDA programming
Concepts of threads and blocks,
GPU and CPU data exchange.
Unit-4:: CUDA programming model
HPC Architecture
CUDA (or Compute Unified Device Architecture)
Source: https://en.wikipedia.org/wiki/CUDA
Unit-4:: CUDA programming
model: HPC Architecture
host and device.
Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Unit-4:: CUDA programming model
HPC Architecture
To execute any CUDA program, there are three
main steps:
Copy the input data from host memory to
device memory, also known as host-to-device
transfer.
Load the GPU program and execute, caching
data on-chip for performance.
Copy the results from device memory to host
memory, also called device-to-host transfer.
Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Unit-4:: CUDA programming model
HPC Architecture
CUDA kernel and thread hierarchy
Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Unit-4:: CUDA programming model
HPC Architecture
Every CUDA kernel starts with a __global__ declaration specifier.
Programmers provide a unique global ID to each thread by using
built-in variables.
Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Unit-4:: CUDA programming model
HPC Architecture
Registers—These are private to each thread, which means that registers
assigned to a thread are not visible to other threads. The compiler makes
decisions about register utilization.
L1/Shared memory (SMEM)—Every SM has a fast, on-chip scratchpad
memory that can be used as L1 cache and shared memory. All threads in
a CUDA block can share shared memory, and all CUDA blocks running on
a given SM can share the physical memory resource provided by the SM.
Read-only memory—Each SM has an instruction cache, constant
memory, texture memory and RO cache, which is read-only to kernel
code.
L2 cache—The L2 cache is shared across all SMs, so every thread in every
CUDA block can access this memory. The NVIDIA A100 GPU has increased
the L2 cache size to 40 MB as compared to 6 MB in V100 GPUs.
Global memory—This is the framebuffer size of the GPU and DRAM
sitting in the GPU.
Department of
Computer Science and
Engineering
Thank You
School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology