Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
Accelerated Computing
2
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
3
Example: Vector Addition Kernel Launch (Host Code)
Host Code
4
More on Kernel Launch (Host Code)
Host Code
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
dim3 DimGrid((n-1)/256 + 1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
}
5
Kernel execution in a nutshell
__host__ __global__
void vecAdd(…) void vecAddKernel(float *A,
{ float *B, float *C, int n)
dim3 DimGrid(ceil(n/256.0),1,1); {
dim3 DimBlock(256,1,1); int i = blockIdx.x * blockDim.x
vecAddKernel<<<DimGrid,DimBlock>>>(d_A,d_B + threadIdx.x;
,d_C,n);
} if( i<n ) C[i] = A[i]+B[i];
}
M0 GPU Mk
•••
RAM
6
More on CUDA Function Declarations
7
How to manage memory (basics)
– Device memory allocation:
cudaError_t cudaMalloc(void ** devPtr, size_t size);
– Device memory deallocaton:
cudaError_t cudaFree(void * devPtr, size_t size);
– Copy data between host and device:
cudaError_t cudaMemcpy(void* dst, const void* src,
size_t size, cudaMemcpyKind kind)
– kind is one of the following:
– cudaMemcpyHostToDevice
– cudaMemcpyDeviceToHost
– cudaMemcpyDeviceToDevice
Most of the time, data are allocated twice : on host and on device.
8
Compiling A CUDA Program
NVCC Compiler
9
Exercise 1
10
GPU Teaching Kit
Accelerated Computing
12
A Multi-Dimensional Grid Example
host device
Block Block
Grid 1 (0, 1)
(0, 0)
Kernel 1
Block Block
(1, 0) (1, 1)
Block (1,0)
Grid 2 (1,0,0) (1,0,1) (1,0,2) (1,0,3)
13
13
Processing a Picture with a 2D Grid
16×16 blocks
62×76 picture
14
Row-Major Layout in C/C++
M
Row*Width+Col = 2*4+1 = 9
M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15
M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3
15
Source Code of a PictureKernel
16
Host Code for Launching PictureKernel
// assume that the picture is m × n,
// m pixels in y dimension and n pixels in x dimension
// input d_Pin has been allocated on and copied to device
// output d_Pout has been allocated on device
…
dim3 DimGrid((n-1)/16 + 1, (m-1)/16+1, 1);
dim3 DimBlock(16, 16, 1);
PictureKernel<<<DimGrid,DimBlock>>>(d_Pin, d_Pout, m, n);
…
17
Covering a 62×76 Picture with 16×16 Blocks
Not all threads in a Block will follow the same control flow path.
18
GPU Teaching Kit
Accelerated Computing
20
RGB Color Image Representation
21
RGB to Grayscale Conversion
22
Color Calculating Formula
0.71
0.21 0.07
23
RGB to Grayscale Conversion Code
// The input image is encoded as float [0, 1]
__global__ void colorConvert(float * grayImage,
float * rgbImage,
int width, int height, int channels) {
int x =… ; // horizontal pixel coordinate
int y =…..; // vertical pixel coordinate
24
RGB to Grayscale Conversion Code
// The input image is encoded as float [0, 1]
__global__ void colorConvert(float * grayImage,
float * rgbImage,
int width, int height, int channels) {
int x =… ; // horizontal pixel coordinate
int y =…..; // vertical pixel coordinate
25
RGB to Grayscale Conversion Code
// The input image is encoded as float [0, 1]
__global__ void colorConvert(float * grayImage,
float * rgbImage,
int width, int height, int channels) {
int x =… ; // horizontal pixel coordinate
int y =…..; // vertical pixel coordinate
26
Exercise 2
27
GPU Teaching Kit
Accelerated Computing
29
Image Blurring
30
Blurring Box
Pixels
processed
by a
thread
block
31
Image Blur as a 2D Kernel
__global__
void blurKernel(float * in, float * out, int w, int h, int c)
{
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;
32
__global__
void blurKernel(float * in, float * out, int w, int h, int c) {
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box around (Col;Row) pixel
for( … ) {
for( …. ) {
33
Exercise 3
34
GPU Teaching Kit
Accelerated Computing
36
Transparent Scalability
Device
Device
Thread grid
Block 0 Block 1
Block 2 Block 3
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 time
Block 2 Block 3 Block 4 Block 5 Block 6 Block 7
Block 6 Block 7
Block 4 Block 5
Block 6 Block 7
37
Example: Executing Thread Blocks
t0 t1 t2 … tm SM
– Threads are assigned to Streaming
Multiprocessors (SM) in block granularity SP
– GPU in the room can take up to 2048 threads
– Could be 256 (threads/block) * 8 blocks Blocks
– Or 512 (threads/block) * 4 blocks, etc.
– SM maintains thread/block idx #s Shared
Memory
38
The Von-Neumann Model
Memory
I/O
Processing Unit
Reg
ALU File
Control Unit
PC IR
39
The Von-Neumann Model with SIMD units
Memory
I/O
Processing Unit
Reg
ALU File
Control Unit
PC IR
40
Warps as Scheduling Units
• Each Block is executed as 32-thread Warps
– An implementation decision, not part of the
CUDA programming model
– Warps are scheduling units in SM
– Threads in a warp execute in SIMD
– Future GPUs may have different number of
threads in each warp
41
Warp Example
• If 3 blocks are assigned to an SM and each block has 256 threads,
how many Warps are there in an SM?
– Each Block is divided into 256/32 = 8 Warps
– There are 8 * 3 = 24 Warps
…
Block 0 Warps
t0 t1 t2 … t31
…Block 1 Warps
t0 t1 t2 … t31
…
Block 2 Warps
t0 t1 t2 … t31
… … …
Register File
L1 Shared Memory
42
Example: Thread Scheduling (Cont.)
– SM implements zero-overhead warp scheduling
– Warps whose next instruction has its operands ready for consumption are eligible
for execution
– Eligible Warps are selected for execution based on a prioritized scheduling policy
– All threads in a warp execute the same instruction when selected
43
Block Granularity Considerations
– Use the occupancy calculator for compute the
theoretical best block size configuration
– Example for imageBlur : nvcc --resource-usage
– 26 registers
– 16x16=256 blocs
– 0 shared memory (see further chapters)
– Occupancy calculator:
– 256/32 = 8 warps per block (max. is 64 on these GPU)
– 64/8=8 blocks per SM
– 256*8=2048 (max is 2048 on these GPU) : 100%occupancy
– Mind the register usage
– Mind the shared memory usage
44