Cuda C
Cuda C
Cuda C
CUDA
A general purpose parallel computing platform and
programming model
Introduced in 2006 by NVIDIA
Enhances the compute engine in GPUs to solve complex
developers to use
C as a high level programming language
other languages, application programming interfaces, or directives-
08/20/15
CUDA C
08/20/15
language
4
08/20/15
the block
5
08/20/15
sequentially,
A compiled CUDA program can execute on any
number of multiprocessors
Only the runtime system needs to know the physical
multiprocessor count
08/20/15
Automatic Scalability
08/20/15
Heterogeneous Computing
Host : CPU and its memory (host memory)
Device : GPU and its memory (device memory)
08/20/15
Heterogeneous Computing
08/20/15
HOST(CPU) thread
Parallel code executes in
many concurrent
DEVICE(GPU) threads
across multiple parallel
processing elements
10
08/20/15
Refer : http://docs.nvidia.com/cuda/cudacompiler-driver-nvcc/#axzz3Qz0M7rGW
11
08/20/15
functions
CUDA compilation trajectory
Separates the device functions from the host code,
Compiles the device functions using proprietary NVIDIA compilers/
assemblers,
Compiles the host code using a general purpose C/C++ compiler that is
In the linking stage, specific CUDA runtime libraries are added for
12
08/20/15
Purpose of NVCC
Compilation trajectory involves
Splitting, compilation, preprocessing, and merging steps for
08/20/15
NVCC Compiler
Host
Code
Host C preprocessor,
compiler/linker
08/20/15
15
08/20/15
16
08/20/15
17
08/20/15
18
08/20/15
19
08/20/15
Thread
20
08/20/15
CUDA Kernels
21
08/20/15
Thread Hierarchies
Grid
One or more thread blocks
Organized as a 3D array of blocks
Block
3D array of threads
Each block in a grid has the same number of threads
(dimension)
Each thread in a block can
Synchronize
Access shared memory
22
08/20/15
Thread Hierarchies
A kernel is executed as a grid of thread blocks
All threads share data memory space
memory
Two threads from two dierent blocks cannot
cooperate
23
08/20/15
Thread Hierarchies
24
08/20/15
25
08/20/15
08/20/15
Thread Hierarchies
Thread Block
Group of threads
G80 and GT200: Up to 512 threads
Fermi: Up to 1024 threads
27
08/20/15
Initialize many
threads: Hundreds
or Thousands to
wake up a GPU
from its Bed!!!!
Threads: Representation
08/20/15
28
29
08/20/15
30
08/20/15
31
08/20/15
executed
Kernel launch specifies the dimensions of the grid and each
block
gridDim and blockDim
32
08/20/15
33
08/20/15
Output:
$ nvcc
hello_world.cu
$ a.out
Hello World!
$
34
08/20/15
Source: J sanders
et.al, Cuda by
example
08/20/15
that:
Runs on the device
Is called from host code
compiler
Host functions (e.g. main()) processed by standard host
compiler
gcc, cl.exe
36
08/20/15
device code
Also called a kernel launch
Well return to the parameters (1,1) in a moment
GPU!
37
08/20/15
38
08/20/15
Source: J sanders
et.al, Cuda by
example
39
08/20/15
40
41
08/20/15
08/20/15
43
08/20/15
44
08/20/15
45
08/20/15
46
08/20/15
47
08/20/15
cudaMalloc()
First parameter : address of a pointer variable that
(void **)
Function expects a generic pointer value;
Allows the cudaMalloc() function to write the address of the
48
08/20/15
CUDA APIs
49
08/20/15
cudaMemcpy()
08/20/15
#include<cuda.h>
.
void vecAdd(float* A, float* B, float* C, int n){
!
int size = n * sizeof(float);
!
float *d_A, *d_B, *d_C;
$
cudaMalloc((void**) &d_A, size);
!
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
!
cudaMalloc((void**) &d_B, size);
!
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
!
cudaMalloc((void**) &d_C, size);
!
!
!
!
}
51
08/20/15
Device
Device
Device
Host
Host
Host
__host__
52
float HostFunc()
08/20/15
53
08/20/15
Registers
Per thread, Data lifetime = thread
lifetime
Local Memory
Per thread o-chip memory
Data lifetime = thread lifetime
Shared Memory
Per thread block on-chip
Data lifetime = block lifetime
Global Memory
Accessible by all threads as well as host
Data lifetime = from allocation to
deallocation
Host Memory
Not directly accessible by CUDA threads
54
08/20/15
55
08/20/15
56
08/20/15
57
08/20/15
gridDim
58
08/20/15
Thread Indexing
08/20/15
Thread Indexing
Thread id= blockIdx.x * blockDim.x +
threadIdx.x
Thread 3 of Block 0 has a threadID value of 0*M +3
Thread 3 of Block 5 has a threadID value of 5*M+3
each block
Total of 128*32=4096 threads in a grid
60
08/20/15
61
08/20/15
62
In general
08/20/15
to 1
1D grid with 128 blocks, each of which consists of 32
threads
Total number of threads = 128 * 32 = 4096
$
dim3 dimGrid(128, 1, 1);
$
dim3 dimBlock(32, 1, 1);
$
vecAddKernel<<<dimGrid, dimBlock>>>();
dimBlock and dimGrid are programmer defined variables
These variables can have any names as long as they are of
type dim3 and kernel launch uses appropriate names
64
08/20/15
are 1.
$
vecAddKernel<<<ceil(n/256.0), 256>>>();
cannot be changed
The x field of the predefined variables gridDim and blockDim get
65
08/20/15
from 1 to 65536.
All threads in a block share same blockIdx.x, blockIdx.y and
blockIdx.z
In a grid
The blockIdx.x ranges between 0 and gridDim.x 1
The blockIdx.y ranges between 0 and gridDim.y 1
The blockIdx.z ranges between 0 and gridDim.z 1
08/20/15
blocks (4, 2, 2)
Host code
dim3 dimGrid(2, 2, 1)
dim3 dimBlock(4, 2, 2)
Kernel<<<dimGrid,dimBlock>>>(.
.)
67
08/20/15
68
08/20/15
69
Vector
Addition
Kernel
Launch
#include<cuda.h>
void vecAdd(float* A, float* B, float* C, int n){
!
int size = n * sizeof(float);
$
float *d_A, *d_B, *d_C;
$
cudaMalloc((void**) &d_A, size);
$
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
$
cudaMalloc((void**) &d_B, size);
$
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
$
cudaMalloc((void**) &d_C, size);
70
$
$
}