0% found this document useful (0 votes)
6 views14 pages

UNIT-5 Part 1

The document outlines Unit 5 of a course on Modern Computer Architecture, focusing on High-Performance Computing (HPC) with CUDA. It covers the CUDA programming model, including data transfer between CPU and GPU, execution flow, kernel and thread hierarchy, and memory architecture in CUDA-capable GPUs. Key concepts include the roles of host and device memory, the execution of CUDA kernels, and the organization of threads and blocks for parallel processing.

Uploaded by

anilprincev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

UNIT-5 Part 1

The document outlines Unit 5 of a course on Modern Computer Architecture, focusing on High-Performance Computing (HPC) with CUDA. It covers the CUDA programming model, including data transfer between CPU and GPU, execution flow, kernel and thread hierarchy, and memory architecture in CUDA-capable GPUs. Key concepts include the roles of host and device memory, the execution of CUDA kernels, and the organization of threads and blocks for parallel processing.

Uploaded by

anilprincev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Department of

Computer Science and


Engineering

UNIT 5 – HPC with CUDA

Subject Name : MODERN COMPUTER ARCHITECTURE


Course Code : 10211CS129

School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Unit-4::Syllabus
UNIT- V
Unit 5 HPC with CUD (HPC) 9Hours
CUDA programming model
Basic principles of CUDA programming
Concepts of threads and blocks,
GPU and CPU data exchange.
Unit-4:: CUDA programming model
HPC Architecture
CUDA (or Compute Unified Device Architecture)

The CUDA programming model provides an abstraction of GPU


architecture that acts as a bridge between an application and its
possible implementation on GPU hardware.

CUDA® is a parallel computing platform and programming model


developed by NVIDIA for general computing on Graphical
Processing Units (GPUs). With CUDA, developers are able to
dramatically speed up computing applications by harnessing the
power of GPUs.
Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Unit-4:: CUDA programming model
HPC Architecture
CUDA processing flow

1. Copy data from main memory


to GPU memory
2. CPU initiates the GPU
compute kernel
3. GPU's CUDA cores execute the
kernel in parallel
4. Copy the resulting data from
GPU memory to main memory

Source: https://en.wikipedia.org/wiki/CUDA
Unit-4:: CUDA programming
model: HPC Architecture
host and device.

The host is the CPU available in the system.


The system memory associated with the
CPU is called host memory.
The GPU is called a device and GPU
memory likewise called device memory.

Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Unit-4:: CUDA programming model
HPC Architecture
To execute any CUDA program, there are three
main steps:
 Copy the input data from host memory to
device memory, also known as host-to-device
transfer.
 Load the GPU program and execute, caching
data on-chip for performance.
 Copy the results from device memory to host
memory, also called device-to-host transfer.
Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Unit-4:: CUDA programming model
HPC Architecture
CUDA kernel and thread hierarchy

The CUDA kernel is a function that


gets executed on GPU.

The parallel portion of your


applications is executed K times in
parallel by K different CUDA
threads, as opposed to only one Figure 1. The kernel is a function
time like regular C/C++ functions. executed on the GPU.

Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Unit-4:: CUDA programming model
HPC Architecture
Every CUDA kernel starts with a __global__ declaration specifier.
Programmers provide a unique global ID to each thread by using
built-in variables.

Figure 2. CUDA kernels are subdivided into blocks.


Unit-4:: CUDA programming model
HPC Architecture
Each CUDA block is executed by
one streaming multiprocessor
(SM) and cannot be migrated to
other SMs in GPU (except during
preemption, debugging, or CUDA
dynamic parallelism).
One SM can run several
concurrent CUDA blocks
depending on the resources
needed by CUDA blocks.
Each kernel is executed on one
device and CUDA supports
running multiple kernels on a
device at one time.
Figure 3 shows the kernel
execution and mapping on
hardware resources available in
Figure 3. Kernel execution on GPU. GPU.
Unit-4:: CUDA programming model
HPC Architecture
 The CUDA program for adding two matrices
below shows:
multidimensional blockIdx and threadIdx and
other variables like blockDim.
 In the example below, a 2D block is chosen for
ease of indexing and each block has 256 threads
with 16 each in x and y-direction.
 The total number of blocks are computed using
the data size divided by the size of each block.
Unit-4:: CUDA programming model
HPC Architecture

Example of CUDA Program for Matrix


Unit-4:: CUDA programming model
HPC Architecture
Memory hierarchy : CUDA-capable GPUs have a memory hierarchy as depicted in
Figure 4.

Figure 4. Memory hierarchy in GPUs.

Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Unit-4:: CUDA programming model
HPC Architecture
Registers—These are private to each thread, which means that registers
assigned to a thread are not visible to other threads. The compiler makes
decisions about register utilization.
L1/Shared memory (SMEM)—Every SM has a fast, on-chip scratchpad
memory that can be used as L1 cache and shared memory. All threads in
a CUDA block can share shared memory, and all CUDA blocks running on
a given SM can share the physical memory resource provided by the SM.
Read-only memory—Each SM has an instruction cache, constant
memory, texture memory and RO cache, which is read-only to kernel
code.
L2 cache—The L2 cache is shared across all SMs, so every thread in every
CUDA block can access this memory. The NVIDIA A100 GPU has increased
the L2 cache size to 40 MB as compared to 6 MB in V100 GPUs.
Global memory—This is the framebuffer size of the GPU and DRAM
sitting in the GPU.
Department of
Computer Science and
Engineering

Thank You

School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology

You might also like