0% found this document useful (0 votes)
32 views

Program Structure of CUDA

The structure of cuda programming

Uploaded by

cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Program Structure of CUDA

The structure of cuda programming

Uploaded by

cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

The CPU and GPUs are separate entities. Both have their own memory space.

CPU
cannot directly access GPU memory, and vice versa. In CUDA terminology, CPU
memory is called host memory and GPU memory is called device memory. Pointers to
CPU and GPU memory are called host pointer and device pointer, respectively.

For data to be accessible by GPU, it must be presented in the device memory. CUDA
provides APIs for allocating device memory and data transfer between host and device
memory. Following is the common workflow of CUDA programs.

1. Allocate host memory and initialized host data


2. Allocate device memory
3. Transfer input data from host to device memory
4. Execute kernels
5. Transfer output from device memory to host

So far, we have done step 1 and 4. We will add step 2, 3, and 5 to our vector addition
program and finish this exercise.

Program Structure of CUDA


A typical CUDA program has code intended both for the GPU and the CPU. By default,
a traditional C program is a CUDA program with only the host code. The CPU is
referred to as the host, and the GPU is referred to as the device. Whereas the host
code can be compiled by a traditional C compiler as the GCC, the device code needs a
special compiler to understand the api functions that are used. For Nvidia GPUs, the
compiler is called the NVCC (Nvidia C Compiler).
The device code runs on the GPU, and the host code runs on the CPU. The NVCC
processes a CUDA program, and separates the host code from the device code. To
accomplish this, special CUDA keywords are looked for. The code intended to run of
the GPU (device code) is marked with special CUDA keywords for labelling data-
parallel functions, called ‘Kernels’. The device code is further compiled by the NVCC
and executed on the GPU.
Execution of a CUDA C Program
How does a CUDA program work? While writing a CUDA program, the programmer
has explicit control on the number of threads that he wants to launch (this is a carefully
decided-upon number). These threads collectively form a three-dimensional grid
(threads are packed into blocks, and blocks are packed into grids). Each thread is
given a unique identifier, which can be used to identify what data it is to be acted upon.

Device Global Memory and Data Transfer


As has been explained in the previous chapter, a typical GPU comes with its own
global memory (DRAM- Dynamic Random Access Memory). For example, the Nvidia
GTX 480 has DRAM size equal to 4G. From now on, we will call this memory the
device memory.
To execute a kernel on the GPU, the programmer needs to allocate separate memory
on the GPU by writing code. The CUDA API provides specific functions for
accomplishing this. Here is the flow sequence −
 After allocating memory on the device, data has to be transferred from the host
memory to the device memory.
 After the kernel is executed on the device, the result has to be transferred back
from the device memory to the host memory.
 The allocated memory on the device has to be freed-up. The host can access
the device memory and transfer data to and from it, but not the other way round.
CUDA provides API functions to accomplish all these steps.

You might also like