An INTRODUCTION TO CUDA Programming
An INTRODUCTION TO CUDA Programming
net/publication/227487008
CITATION READS
1 3,244
1 author:
Irina Mocanu
University Polytechnica of Bucharest
90 PUBLICATIONS 359 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Irina Mocanu on 13 August 2014.
Introduction
The programmable graphics processor unit (GPU) has evolved into an absolute computing
workhorse. Today's GPUs offer a lot of resources for both graphics and non-graphics processing. Data-
parallel processing maps data elements to parallel processing threads. Many applications that process
large data sets such as arrays can use a data-parallel programming model to speed up the computations. In
3D rendering large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media
processing applications can map image blocks and pixels to parallel processing threads. A lot of any other
algorithms except the image rendering and processing algorithms are accelerated by data-parallel
processing.
In this scope, Nvidia developed CUDA (Compute Unified Device Architecture) [1], a new
hardware and software architecture for issuing and managing computations on the GPU as a data-parallel
computing device without the need of mapping them to a graphics API. It is available for the GeForce 8
Series, Quadro FX 5600/4600, and Tesla solutions.
The paper presents the principal features of CUDA programming. It is presented the CUDA
architecture and the application programming interface in CUDA, based on the documentation from [2]
and [4]. Also there is an simple CUDA program for adding two matrix which uses the parallel capabilities
of CUDA, which was presented in [3].
When programmed with CUDA, the GPU is viewed as a compute device capable of executing a
very high number of threads in parallel. It operates as a coprocessor to the main CPU (host). The host
and the device maintain their own DRAM, the host memory and device memory, respectively, as in [2].
The batch of threads that executes a kernel is organized as a grid of thread blocks like in Figure 2,
as in [2].
The goal of the CUDA programming is to provide a relatively simple path for users familiar with
the C. Based on [2], it consists of:
• A runtime library (presented in Table 1) split into:
• A host component, that runs on the host and provides functions to control and access one or
more compute devices from the host;
• A device component, that runs on the device and provides device-specific functions;
• A common component, that provides built-in vector types and a subset of the C standard
library that are supported in both host and device code.
• A minimal set of extensions to the C language, that allow the programmer to target portions of the
source code for execution on the device (composed of four parts presented in Table 2).
A new directive to specify how Any call to a __global__ function must specify the execution
a kernel is executed on the configuration for that call.
device from the host The execution configuration defines the dimension of the grid
and blocks that will be used to execute the function on the
device.
It is specified by the expression <<<Dg,Db,Ns>>> inserted
between the function name and the parenthesized argument list,
where: (i) Dg is of type dim3 and specifies the dimension and
size of the grid (Dg.x*Dg.y*Dg.y is the number of blocks being
launched), (ii) Db is of type dim3 and specifies the dimension
and size of each block, (Db.x*Db.y*Db.z is the number of
threads per block), (iii) Ns is of type size_t and specifies the
number of bytes in shared memory that is dynamically allocated
per block for this call in addition to the statically allocated
memory.
gridDim is of type dim3 and contains the dimensions of the grid.
Four built-in variables that blockIdx is of type uint3 and contains the block index within the grid.
specify the grid and block
dimensions and the block and blockDim is of type dim3 and contains the dimensions of the block.
thread indices threadIdx is of type uint3 and contains the thread index within the
block.
Table 2. The set of extensions to the C language
A CUDA example
In the following it is presented a CUDA program for adding two matrix in parallel comparing with the
same program write in C, which is described in [3]:
Bellow it is presented the whole source version of the above program, as pesented in [3]:
Program Observations
const int N = 1024;
Set grid size
const int blocksize = 16;
__global__
void add_matrix(float* a, float *b, float *c, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y; Compute kernel
int index = i + j*N;
if ( i < N && j < N )
c[index] = a[index] + b[index];
}
int main()
{
float *a = new float[N*N];
float *b = new float[N*N];
CPU memory allocation
float *c = new float[N*N];
for (int i = 0; i < N*N; ++i) a[i] = 1.0f; b[i] = 3.5f;
float *ad, *bd, *cd;
const int size = N*N*sizeof(float);
cudaMalloc((void**)&ad, size); GPU memory allocation
cudaMalloc((void**)&bd, size);
cudaMalloc((void**)&cd, size);
cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
Copy data to GPU
cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);
dim3 dimBlock(blocksize, blocksize);
dim3 dimGrid(N/dimBlock.x, N/dimBlock.y); Execute kernel
add_matrix<<<dimGrid, dimBlock>>>( d, bd, cd, N);
cudaMemcpy(c, cd, size, cudaMemcpyDeviceToHost); Copy result back to CPU
cudaFree(ad); cudaFree(bd); cudaFree(cd);
delete[] a; delete[] b; delete[] c; Clean up and return
return EXIT_SUCCESS;
}
Conclusions
CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU)
using graphics APIs:
It uses the standard C language, with some simple extensions; it is no need to learn graphics API;
Scattered writes – code can write to arbitrary addresses in memory;
Shared memory – CUDA exposes a fast shared memory region that can be shared amongst
threads;
But CUDA has some limitations:
Recursive functions are not supported and must be converted to loops.
Threads should be run in groups of at least 32 for best performance.
CUDA-enabled GPUs are only available from Nvidia (GeForce 8 series and above, Quadro
and Tesla)
Bibliography
1. http://www.nvidia.com/object/cuda_home.html
2. http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf
3. Johan Seland - Cuda Programming, Winter School in Parallel Computing,Geilo, January 20-25, 2008,
(http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf)
4. http://www.gpgpu.org/sc2007/