Cuda C

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

CUDA C

CUDA
A general purpose parallel computing platform and

programming model
Introduced in 2006 by NVIDIA
Enhances the compute engine in GPUs to solve complex

computation problems in an ecient way


Comes with a software environment that allows

developers to use
C as a high level programming language
other languages, application programming interfaces, or directives-

based approaches are supported, such as FORTRAN,


DirectCompute, OpenACC.
2

08/20/15

CUDA C

08/20/15

CUDA C - A scalable programming Model


Mainstream processors are parallel
Advent of many-core and multi-core chips

3D graphics applications transparently scale their

parallelism on GPUs with varying number of cores


Challenge: To develop application software that scales

transparently with the number of cores


Let programmers focus on parallel algorithms
Not on the mechanics of a parallel programming

language
4

08/20/15

CUDA C - A scalable programming Model


Facing the challenge
using minimal set of language extensions
Hierarchy of threads
Shared Memory
Barrier synchronization

Partition the problem into coarse sub problems


Solved independently in parallel by blocks of threads

Partition sub-problems into finer pieces


Solved cooperatively in parallel by all threads within

the block
5

08/20/15

CUDA C - A scalable programming Model


Each block of threads can be scheduled
on any of the available multiprocessors
Within a GPU, in any order, concurrently or

sequentially,
A compiled CUDA program can execute on any

number of multiprocessors
Only the runtime system needs to know the physical

multiprocessor count

08/20/15

Automatic Scalability

A multi threaded program is partitioned into blocks of threads that


execute independently from each other, so that a GPU with more cores
will automatically execute the program in less time than GPU with
fewer cores.
7

08/20/15

Heterogeneous Computing
Host : CPU and its memory (host memory)
Device : GPU and its memory (device memory)

08/20/15

Heterogeneous Computing

08/20/15

GPU programming model


Serial code executes in a

HOST(CPU) thread
Parallel code executes in

many concurrent
DEVICE(GPU) threads
across multiple parallel
processing elements

10

08/20/15

Compiling CUDA C Programs

Refer : http://docs.nvidia.com/cuda/cudacompiler-driver-nvcc/#axzz3Qz0M7rGW
11

08/20/15

Compiling CUDA C Programs


Source files for CUDA applications
Mixture of conventional C++ host code, plus GPU device (i.e., GPU-)

functions
CUDA compilation trajectory
Separates the device functions from the host code,
Compiles the device functions using proprietary NVIDIA compilers/

assemblers,
Compiles the host code using a general purpose C/C++ compiler that is

available on the host platform, and


Embeds the compiled GPU functions in the host object file.

In the linking stage, specific CUDA runtime libraries are added for

supporting remote SIMD procedure calling and for providing explicit


GPU manipulation such as allocation of GPU memory buers and hostGPU data transfer.

12

08/20/15

Purpose of NVCC
Compilation trajectory involves
Splitting, compilation, preprocessing, and merging steps for

each CUDA source file


CUDA compiler drivernvcchides the intricate details of

CUDA compilation from developers.


nvccmimics the behavior of the GNU compilergcc: it

accepts a range of conventional compiler options, such


as for defining macros and include/library paths, and
for steering the compilation process.
All non-CUDA compilation steps are forwarded to a

general purpose C compiler that is supported bynvcc


13

08/20/15

Integrated C programs with CUDA


extensions

NVCC Compiler

Host
Code
Host C preprocessor,
compiler/linker

Device Code (PTX)


Device just-in time
compiler

Heterogeneous Computing Platform with CPUs,


GPUs
Source : Kirk and Hwu
14

08/20/15

Nvccs basic workflow separates device code and host

code and then:


Compiles the device code into an assembly form (PTX code) or

binary form (cubin object)


Modifies the host code by replacing syntax in kernels by necessary

CUDA C runtime function calls to load and launch each compiled


kernel from the PTX code / cubin object

Modified host code is output as object code by letting

nvcc invoke the host compiler during last compilation


stage

15

08/20/15

Anatomy of a CUDA C Program


Step 1: Copy input data from host memory to device
memory

16

08/20/15

Anatomy of a CUDA C Program


Step 2: Launch a kernel on the device

17

08/20/15

Anatomy of a CUDA C Program


Step 3: Execute the kernel on the device, caching
data on chip for performance

18

08/20/15

Anatomy of a CUDA C Program


Step 4: Copy results from device memory to the host
memory

19

08/20/15

Thread

Simplified view of how a processor executes a program


Consists of
Code of the program
Particular point in the code that is being executed
Values of its variables and data structures

In CUDA, execution of each thread is sequential


But CUDA program initiates parallel execution by

launching kernel functions


Causes the underlying mechanisms to create many

threads that process dierent parts of the data in parallel

20

08/20/15

CUDA Kernels

Kernel is function that executes parallel portions

of an application on the device and can be called


from the host
One kernel is executed at a time by many threads

Execute as an array of threads in parallel


All threads run same code
Each thread has an ID that is used to compute

memory addresses and make control decisions


Can only access device memory

21

08/20/15

Thread Hierarchies
Grid
One or more thread blocks
Organized as a 3D array of blocks

Block
3D array of threads
Each block in a grid has the same number of threads

(dimension)
Each thread in a block can
Synchronize
Access shared memory
22

08/20/15

Thread Hierarchies
A kernel is executed as a grid of thread blocks
All threads share data memory space

A thread block is a batch of threads that can

cooperate with each other by:


Synchronizing their execution
For hazard-free shared memory accesses

Eciently sharing data through a low latency shared

memory
Two threads from two dierent blocks cannot

cooperate
23

08/20/15

Thread Hierarchies

24

08/20/15

25

08/20/15

08/20/15

Thread Hierarchies
Thread Block
Group of threads
G80 and GT200: Up to 512 threads
Fermi: Up to 1024 threads

Reside on same processor core


Share memory of that core

27

08/20/15

Initialize many
threads: Hundreds
or Thousands to
wake up a GPU
from its Bed!!!!

Threads: Representation
08/20/15
28

CUDA Thread Organization


Threads are grouped into blocks Blocks are grouped into Grids

Kernels are executed as a Grid of


Block of Threads

29

08/20/15

CUDA Thread Organization

Only one kernel can execute


on a device at one time

30

08/20/15

31

08/20/15

CUDA Thread Organization

All threads in a grid execute the same kernel function


Rely on unique coordinates to distinguish themselves

Two-level hierarchy using unique coordinates


blockIdx (for block index) shared by all threads in a block
threadIdx (for thread index) unique within a block
Used together to make a unique ID for each thread per kernel

Built-in pre-initialized variables accessed within kernels


References to them return coordinates of the thread when

executed
Kernel launch specifies the dimensions of the grid and each

block
gridDim and blockDim

32

08/20/15

Your First CUDA Program!

33

08/20/15

Program 1: Hello World!


int main(void) {
!
printf("Hello World!\n");
!
return 0;
}

Output:
$ nvcc
hello_world.cu
$ a.out
Hello World!
$

Standard C that runs on the host


NVIDIA compiler (nvcc) can be used to compile

programs with no device code

34

08/20/15

Program 1: Hello World! with Device


Code
__global__ void
mykernel(void) {
}
int main(void) {
!
mykernel<<<1,1>>>();
!
printf("Hello World!\n");
!
return 0;
}

Two new syntactic elements


35

Source: J sanders
et.al, Cuda by
example
08/20/15

Program 1: Hello World! with


Device Code
__global__ void mykernel(void) {
}

CUDA C/C++ keyword __global__ indicates a function

that:
Runs on the device
Is called from host code

nvcc separates source code into host and device components


Device functions (e.g. mykernel()) processed by NVIDIA

compiler
Host functions (e.g. main()) processed by standard host

compiler
gcc, cl.exe
36

08/20/15

Program 1: Hello World! with Device


Code
mykernel<<<1,1>>>();
Triple angle brackets mark a call from host code to

device code
Also called a kernel launch
Well return to the parameters (1,1) in a moment

Thats all that is required to execute a function on the

GPU!

37

08/20/15

Program 2: Summing two vectors

38

08/20/15

Source: J sanders
et.al, Cuda by
example
39

08/20/15

Can also be written


in this way

40

Suggest a potential way to


parallelize the code on a system with
multiple CPUs or CPU cores.
For example, with a dual-core
processor,
Can change the increment to 2
Have one core initialize the loop
with tid = 0 and another with tid
= 1.
The first core would add the
even-indexed elements, and the
second core would add the oddindexed elements
08/20/15

Multi-core CPU vector sum

41

08/20/15

GPU vector sum


#include<cuda.h>
.
void vecAdd(float* A, float* B, float* C, int n){
1. // Allocate device memory for A, B and C
// Copy A and B to device memory
2. //Kernel launch code to have the device perform the actual vector
addition
3. //Copy C from the device memory to the host memory
// Free memory allocated on the device
}
42

08/20/15

N is the number of parallel blocks specified while launching the


kernel
Below we can see four blocks, all running through the same copy of
the device code but having dierent values for the block id

43

08/20/15

CUDA Device Memory Model

44

08/20/15

CUDA Device Memory Model


Global and Constant Memory
Written and read by host by calling API functions
Constant Memory : short-latency, high-bandwidth,

read-only access by the device$


Register and shared memory (on-chip memories)
Accessed at high speed in a parallel manner
Registers
Allocated to individual threads
Each thread can only access its own registers

45

08/20/15

CUDA Device Memory Model


Kernel function uses registers to hold frequently

accessed variables private to each thread


Shared memory is allocated to thread blocks
All threads in a block can access the variables
Ecient means for threads to cooperate by sharing

input data and intermediate results

46

08/20/15

CUDA API function for managing device global memory

47

08/20/15

cudaMalloc()
First parameter : address of a pointer variable that

must point to the allocated object after allocation.


The address of the pointer variable should be cast to

(void **)
Function expects a generic pointer value;
Allows the cudaMalloc() function to write the address of the

allocated object into the pointer variable.

The second parameter : size of the object to be

allocated, in terms of bytes.


float *d_A;
int size = n * sizeof(float);
cudaMalloc((void**)&d_A, size);
.
cudaFree(d_A);

48

08/20/15

CUDA APIs

49

08/20/15

cudaMemcpy()

Cannot be used to copy between dierent GPUs


Can be used to transfer data in both directions by
Proper ordering of source and destination pointers and
Using the appropriate constant for transfer type

Four symbolic predefined constants


cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice

cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);


cudaMemcpy(d_C, C, size, cudaMemcpyDeviceToHost);
50

08/20/15

GPU vector sum

#include<cuda.h>
.
void vecAdd(float* A, float* B, float* C, int n){
!
int size = n * sizeof(float);
!
float *d_A, *d_B, *d_C;
$
cudaMalloc((void**) &d_A, size);
!
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
!
cudaMalloc((void**) &d_B, size);
!
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
!
cudaMalloc((void**) &d_C, size);
!
!
!
!
}

51

//Kernel invocation code

cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);


//Free device memory
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

08/20/15

CUDA Extensions to function


declaration
Executed on Only callable from
the :
the :
__device__ float DeviceFunc()

Device

Device

__global__ void KernelFunc()

Device

Host

Host

Host

__host__

52

float HostFunc()

08/20/15

CUDA Memory Hierarchy


Thread: Registers and Local Memory

53

Blocks: Shared Memory

08/20/15

CUDA Memory Hierarchy


Grid(All thread blocks) :
Global Memory

Registers
Per thread, Data lifetime = thread
lifetime

Local Memory
Per thread o-chip memory
Data lifetime = thread lifetime

Shared Memory
Per thread block on-chip
Data lifetime = block lifetime

Global Memory
Accessible by all threads as well as host
Data lifetime = from allocation to
deallocation

Host Memory
Not directly accessible by CUDA threads

54

08/20/15

55

08/20/15

Thread and Memory Organization


Blocks execute on SM

56

08/20/15

Thread and Memory Organization


Grid of Blocks across
GPU

57

08/20/15

Thread and Block Indexing


Each Thread has an ID
Predefined variables that allow a thread to access the

hardware registers at runtime that provide the identifying


coordinates to the thread

Threads : 3D IDs and unique within a Block


Built-in variables : threadIdx, blockIdx, blockDim,

gridDim

58

08/20/15

Thread Indexing

Grid consists of N thread blocks


Each with a blockIdx.x value that ranges from 0 to N 1
Each block consists of M threads
Each with a threadIdx.x value that ranges from 0 to M 1.
All blocks at the grid level are organized as a one-dimensional
(1D) array
All threads within each block are also organized as a 1D array.
Each grid has a total of N*M threads.
59

08/20/15

Thread Indexing
Thread id= blockIdx.x * blockDim.x +

threadIdx.x
Thread 3 of Block 0 has a threadID value of 0*M +3
Thread 3 of Block 5 has a threadID value of 5*M+3

Eg: A grid with 128(N) blocks and 32(M) threads in

each block
Total of 128*32=4096 threads in a grid

Each thread should have a unique ID


Thread 3 of Block 0 has a threadID value of 0*32 +3
Thread 3 of Block 5 has a threadID value of 5*32+3=163
Thread 15 of block 102, threadID = ?

60

08/20/15

Thread and Block Indexing


Grid Dimension: <3, 2>
gridDim.x=3
gridDim.y=2
blockIdx.x=1
blockIdx.y=1
Block Dimension : <5, 3>
blockDim.x=5
blockDim.y=3
threadIdx.x=4
threadIdx.y=2

61

08/20/15

Vector Addition Kernel Function


__global__ void vecAddKernel(float* A, float* B, float*
C, int n)
{
!
int i = threadIdx.x + blockDim.x * blockIdx.x;
!
if(i < n)
!
!
C[i] = A[i] + b[i];
}

62

The if(i<n) statement allows the kernel to process vectors


of arbitrary lengths.
Example : If vector length is 100 and thread block
dimension is 32, then four thread blocks need to be
launched (128 threads). Last 28 threads in block 3 need to
08/20/15
be disabled.

Thread Indexing and Organization

In general

Grid is organized as a 3D array of blocks


Block is organized as 3D array of threads

Exact organization of a grid is determined by the execution

configuration provided at kernel launch


When the host code invokes a kernel, it sets the grid and
thread block dimensions via execution configuration
parameters
2 parameters
One describes the configuration of the grid : Number of blocks
Second one describes the configuration of blocks : groups of threads

Each parameter is dim3 type C struct type with 3 unsigned

integer fields, x, y and z (three dimensions).


63

08/20/15

Thread Indexing and Organization


For 1D and 2D grids/blocks, the unused fields should be set

to 1
1D grid with 128 blocks, each of which consists of 32

threads
Total number of threads = 128 * 32 = 4096

$
dim3 dimGrid(128, 1, 1);
$
dim3 dimBlock(32, 1, 1);
$
vecAddKernel<<<dimGrid, dimBlock>>>();
dimBlock and dimGrid are programmer defined variables
These variables can have any names as long as they are of
type dim3 and kernel launch uses appropriate names
64

08/20/15

Thread Indexing and Organization


If a Grid/ Block has only 1 Dimension, instead of dim3 use

arithmetic expressions to specify the configuration.


Compiler takes the expression as x dimensions and assumes y and z

are 1.
$

vecAddKernel<<<ceil(n/256.0), 256>>>();

gridDim and blockDim are part of CUDA C specification and

cannot be changed
The x field of the predefined variables gridDim and blockDim get

preinitialized based on the execution configuration parameters


If n is 4000, then gridDim.x = 16 and blockDim.x = 256

65

08/20/15

Thread Indexing and Organization


Allowed values of gridDim.x, gridDim.y and gridDim.z range

from 1 to 65536.
All threads in a block share same blockIdx.x, blockIdx.y and

blockIdx.z
In a grid
The blockIdx.x ranges between 0 and gridDim.x 1
The blockIdx.y ranges between 0 and gridDim.y 1
The blockIdx.z ranges between 0 and gridDim.z 1

Total size of a block is limited to 1024 threads


Flexibility to divide into 3 dimensions
Example blockDim values
(512, 1, 1), (8, 16, 4) and (32, 16, 2)
(32, 32, 32)
66

08/20/15

2D grid (2, 2, 1) consisting of 3D

blocks (4, 2, 2)
Host code

dim3 dimGrid(2, 2, 1)
dim3 dimBlock(4, 2, 2)
Kernel<<<dimGrid,dimBlock>>>(.
.)

67

08/20/15

CUDA execution configuration


parameters
$ Eg: dim3 dimBlock(5,3);
dim3 dimGrid(3,2);
Kernel call :
Eg: gauss<<<dimGrid,dimBlock>>>( )

68

08/20/15

Kernel Launch Statement


vecAddKernel<<<ceil(n/256.0), 256>>>(d_A, d_B,
d_C, n);
Number of thread
blocks

69

Number of threads in each


block

If there are 1000 threads, we launch ceil(1000/256.0) = 4


thread blocks
It will launch 4 * 256 = 1024 threads
Number of thread blocks depends on the length of the
vectors (n)
If n = 750,
3 thread blocks;
If n = 4000,
08/20/15
16 thread blocks

Vector
Addition
Kernel
Launch
#include<cuda.h>
void vecAdd(float* A, float* B, float* C, int n){
!
int size = n * sizeof(float);
$
float *d_A, *d_B, *d_C;
$
cudaMalloc((void**) &d_A, size);
$
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
$
cudaMalloc((void**) &d_B, size);
$
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
$
cudaMalloc((void**) &d_C, size);

vecAddKernel<<<ceil(n/256.0), 256>>>(d_A, d_B,


d_C, n);
$

70

cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);

$
$
}

//Free device memory


cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
08/20/15

You might also like