0% found this document useful (0 votes)
83 views44 pages

Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit

This document discusses converting a color image to grayscale using CUDA. A color image is represented by RGB values for each pixel. To convert to grayscale, each pixel's red, green, and blue channel values are multiplied by constants (e.g. 0.21, 0.71, 0.07) and summed to produce a single intensity value for the grayscale pixel. The code launches a kernel that performs this calculation for each pixel in parallel using thread blocks arranged in a grid that maps to the image dimensions. Each thread calculates the grayscale value for one pixel.

Uploaded by

yassin mechbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views44 pages

Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit

This document discusses converting a color image to grayscale using CUDA. A color image is represented by RGB values for each pixel. To convert to grayscale, each pixel's red, green, and blue channel values are multiplied by constants (e.g. 0.21, 0.71, 0.07) and summed to produce a single intensity value for the grayscale pixel. The code launches a kernel that performs this calculation for each pixel in parallel using thread blocks arranged in a grid that maps to the image dimensions. Each thread calculates the grayscale value for one pixel.

Uploaded by

yassin mechbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

GPU Teaching Kit

Accelerated Computing

Module 3.1 - CUDA Parallelism Model


Kernel-Based SPMD Parallel Programming
Objective
– To learn the basic concepts involved in a simple CUDA kernel
function
– Declaration
– Built-in variables
– Thread index to data index mapping

2
Example: Vector Addition Kernel

Device Code
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition

__global__
void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}

3
Example: Vector Addition Kernel Launch (Host Code)

Host Code

void vecAdd(float* h_A, float* h_B, float* h_C, int n)


{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);
}
The ceiling function makes sure that there
are enough threads to cover all elements.

4
More on Kernel Launch (Host Code)

Host Code
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
dim3 DimGrid((n-1)/256 + 1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
}

This is an equivalent way to express the


ceiling function.

5
Kernel execution in a nutshell

__host__ __global__
void vecAdd(…) void vecAddKernel(float *A,
{ float *B, float *C, int n)
dim3 DimGrid(ceil(n/256.0),1,1); {
dim3 DimBlock(256,1,1); int i = blockIdx.x * blockDim.x
vecAddKernel<<<DimGrid,DimBlock>>>(d_A,d_B + threadIdx.x;
,d_C,n);
} if( i<n ) C[i] = A[i]+B[i];
}

Blk 0 Grid Blk N-1


•••

M0 GPU Mk
•••
RAM

6
More on CUDA Function Declarations

Executed on Only callable from


the: the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host

− __global__ defines a kernel function


− Each “__” consists of two underscore characters
− A kernel function must return void
− __device__ and __host__ can be used together
− __host__ is optional if used alone

7
How to manage memory (basics)
– Device memory allocation:
cudaError_t cudaMalloc(void ** devPtr, size_t size);
– Device memory deallocaton:
cudaError_t cudaFree(void * devPtr, size_t size);
– Copy data between host and device:
cudaError_t cudaMemcpy(void* dst, const void* src,
size_t size, cudaMemcpyKind kind)
– kind is one of the following:
– cudaMemcpyHostToDevice
– cudaMemcpyDeviceToHost
– cudaMemcpyDeviceToDevice

Most of the time, data are allocated twice : on host and on device.

Host allocations is standard calloc or malloc

8
Compiling A CUDA Program

Integrated C programs with CUDA extensions

NVCC Compiler

Device Code (PTX)


Host Code

Host C Compiler/ Linker Device Just-in-Time Compiler

Heterogeneous Computing Platform with


CPUs, GPUs, etc.

9
Exercise 1

10
GPU Teaching Kit
Accelerated Computing

Lecture 3.2 – CUDA Parallelism Model


Multidimensional Kernel Configuration
Objective
– To understand multidimensional Grids
– Multi-dimensional block and thread indices
– Mapping block/thread indices to data indices

12
A Multi-Dimensional Grid Example

host device
Block Block
Grid 1 (0, 1)
(0, 0)
Kernel 1
Block Block
(1, 0) (1, 1)

Block (1,0)
Grid 2 (1,0,0) (1,0,1) (1,0,2) (1,0,3)

Thread Thread Thread Thread


(0,0,0) (0,0,1) (0,0,2) (0,0,3)
Thread
Thread Thread Thread Thread
(0,0,0)
(0,1,0) (0,1,1) (0,1,2) (0,1,3)

13

13
Processing a Picture with a 2D Grid

16×16 blocks

62×76 picture

14
Row-Major Layout in C/C++
M
Row*Width+Col = 2*4+1 = 9
M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

M0,0 M0,1 M0,2 M0,3


M1,0 M1,1 M1,2 M1,3
M2,0 M2,1 M2,2 M2,3
M3,0 M3,1 M3,2 M3,3

15
Source Code of a PictureKernel

__global__ void PictureKernel(float* d_Pin, float* d_Pout,


int height, int width)
{

// Calculate the row # of the d_Pin and d_Pout element


int Row = blockIdx.y*blockDim.y + threadIdx.y;

// Calculate the column # of the d_Pin and d_Pout element


int Col = blockIdx.x*blockDim.x + threadIdx.x;

// each thread computes one element of d_Pout if in range


if ((Row < height) && (Col < width)) {
d_Pout[Row*width+Col] = 2.0*d_Pin[Row*width+Col];
}
}

Scale every pixel value by 2.0

16
Host Code for Launching PictureKernel
// assume that the picture is m × n,
// m pixels in y dimension and n pixels in x dimension
// input d_Pin has been allocated on and copied to device
// output d_Pout has been allocated on device

dim3 DimGrid((n-1)/16 + 1, (m-1)/16+1, 1);
dim3 DimBlock(16, 16, 1);
PictureKernel<<<DimGrid,DimBlock>>>(d_Pin, d_Pout, m, n);

17
Covering a 62×76 Picture with 16×16 Blocks

Not all threads in a Block will follow the same control flow path.

18
GPU Teaching Kit
Accelerated Computing

Lecture 3.3 – CUDA Parallelism Model


Color-to-Grayscale Image Processing Example
Objective
– To gain deeper understanding of multi-dimensional grid kernel
configurations through a real-world use case

20
RGB Color Image Representation

– Each pixel in an image is an RGB value


– The format of an image’s row is
(r g b) (r g b) … (r g b)
– RGB ranges are not distributed uniformly
– Many different color spaces, here we show the
constants to convert to AdbobeRGB color space
– The vertical axis (y value) and horizontal axis (x value) show
the fraction of the pixel intensity that should be allocated to G
and B. The remaining fraction (1-y–x) of the pixel intensity that
should be assigned to R
– The triangle contains all the representable colors in this color
space

21
RGB to Grayscale Conversion

A grayscale digital image is an image in which the value of


each pixel carries only intensity information.

22
Color Calculating Formula

– For each pixel (r g b) at (I, J) do:


grayPixel[I,J] = 0.21*r + 0.71*g + 0.07*b
– This is just a dot product <[r,g,b],[0.21,0.71,0.07]> with the
constants being specific to input RGB space

0.71
0.21 0.07

23
RGB to Grayscale Conversion Code
// The input image is encoded as float [0, 1]
__global__ void colorConvert(float * grayImage,
float * rgbImage,
int width, int height, int channels) {
int x =… ; // horizontal pixel coordinate
int y =…..; // vertical pixel coordinate

if (x < width && y < height) {


// get 1D coordinate for the grayscale image
int grayOffset = …… ;
// one can think of the RGB image having
// CHANNEL times columns than the gray scale image
int rgbOffset = …..;
unsigned char r = rgbImage[rgbOffset ]; // red value for pixel
unsigned char g = rgbImage[rgbOffset + 1]; // green value for pixel
unsigned char b = rgbImage[rgbOffset + 2]; // blue value for pixel
// perform the rescaling and store it
// We multiply by floating point constants
grayImage[grayOffset] = ……;
}
}

24
RGB to Grayscale Conversion Code
// The input image is encoded as float [0, 1]
__global__ void colorConvert(float * grayImage,
float * rgbImage,
int width, int height, int channels) {
int x =… ; // horizontal pixel coordinate
int y =…..; // vertical pixel coordinate

if (x < width && y < height) {


// get 1D coordinate for the grayscale image
int grayOffset = …… ;
// one can think of the RGB image having
// CHANNEL times columns than the gray scale image
int rgbOffset = …..;
Float unsigned char r = rgbImage[rgbOffset ]; // red value for pixel
Float unsigned char g = rgbImage[rgbOffset + 1]; // green value for pixel
Float unsigned char b = rgbImage[rgbOffset + 2]; // blue value for pixel
// perform the rescaling and store it
// We multiply by floating point constants
grayImage[grayOffset] = ……;
}
}

25
RGB to Grayscale Conversion Code
// The input image is encoded as float [0, 1]
__global__ void colorConvert(float * grayImage,
float * rgbImage,
int width, int height, int channels) {
int x =… ; // horizontal pixel coordinate
int y =…..; // vertical pixel coordinate

if (x < width && y < height) {


// get 1D coordinate for the grayscale image
int grayOffset = …… ;
// one can think of the RGB image having
// CHANNEL times columns than the gray scale image
int rgbOffset = …..;
unsigned char r = rgbImage[rgbOffset ]; // red value for pixel
unsigned char g = rgbImage[rgbOffset + 1]; // green value for pixel
unsigned char b = rgbImage[rgbOffset + 2]; // blue value for pixel
// perform the rescaling and store it
// We multiply by floating point constants
grayImage[grayOffset] = ……;
}
}

26
Exercise 2

27
GPU Teaching Kit
Accelerated Computing

Lecture 3.4 – CUDA Parallelism Model


Image Blur Example
Objective
– To learn a 2D kernel with more complex computation and memory
access patterns

29
Image Blurring

30
Blurring Box

Pixels
processed
by a
thread
block

31
Image Blur as a 2D Kernel

__global__
void blurKernel(float * in, float * out, int w, int h, int c)
{
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;

if (Col < w && Row < h) {


... // Rest of our kernel
}
}

32
__global__
void blurKernel(float * in, float * out, int w, int h, int c) {
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;

if (Col < w && Row < h) {


float pixVal_Red = 0;
int pixels = 0;

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box around (Col;Row) pixel
for( … ) {
for( …. ) {

int inRow = Row + blurRow;


int inCol = Col + blurCol;
// Verify we have a valid image pixel for reading at (inCol;inRow) pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
// Accumulate value for channel
// Keep track of number of pixels in the accumulated total
}
}
}

// Write our new pixel value out


out[Row * w + Col] = ….. ;
}
}

33
Exercise 3

34
GPU Teaching Kit
Accelerated Computing

Lecture 3.5 – CUDA Parallelism Model


Thread Scheduling
Objective
– To learn how a CUDA kernel utilizes hardware execution resources
– Assigning thread blocks to execution resources
– Capacity constrains of execution resources
– Zero-overhead thread scheduling

36
Transparent Scalability
Device
Device
Thread grid
Block 0 Block 1
Block 2 Block 3
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 time
Block 2 Block 3 Block 4 Block 5 Block 6 Block 7
Block 6 Block 7

Block 4 Block 5

Block 6 Block 7

– Each block can execute in any order relative to others.


– Hardware is free to assign blocks to any processor at any time
– A kernel scales to any number of parallel processors

37
Example: Executing Thread Blocks
t0 t1 t2 … tm SM
– Threads are assigned to Streaming
Multiprocessors (SM) in block granularity SP
– GPU in the room can take up to 2048 threads
– Could be 256 (threads/block) * 8 blocks Blocks
– Or 512 (threads/block) * 4 blocks, etc.
– SM maintains thread/block idx #s Shared
Memory

– SM manages/schedules thread execution

38
The Von-Neumann Model

Memory
I/O

Processing Unit
Reg
ALU File

Control Unit
PC IR

39
The Von-Neumann Model with SIMD units

Memory
I/O

Processing Unit
Reg
ALU File

Control Unit
PC IR

Single Instruction Multiple Data


(SIMD)

40
Warps as Scheduling Units
• Each Block is executed as 32-thread Warps
– An implementation decision, not part of the
CUDA programming model
– Warps are scheduling units in SM
– Threads in a warp execute in SIMD
– Future GPUs may have different number of
threads in each warp

41
Warp Example
• If 3 blocks are assigned to an SM and each block has 256 threads,
how many Warps are there in an SM?
– Each Block is divided into 256/32 = 8 Warps
– There are 8 * 3 = 24 Warps


Block 0 Warps
t0 t1 t2 … t31
…Block 1 Warps
t0 t1 t2 … t31

Block 2 Warps
t0 t1 t2 … t31
… … …

Register File
L1 Shared Memory

42
Example: Thread Scheduling (Cont.)
– SM implements zero-overhead warp scheduling
– Warps whose next instruction has its operands ready for consumption are eligible
for execution
– Eligible Warps are selected for execution based on a prioritized scheduling policy
– All threads in a warp execute the same instruction when selected

43
Block Granularity Considerations
– Use the occupancy calculator for compute the
theoretical best block size configuration
– Example for imageBlur : nvcc --resource-usage
– 26 registers
– 16x16=256 blocs
– 0 shared memory (see further chapters)
– Occupancy calculator:
– 256/32 = 8 warps per block (max. is 64 on these GPU)
– 64/8=8 blocks per SM
– 256*8=2048 (max is 2048 on these GPU) : 100%occupancy
– Mind the register usage
– Mind the shared memory usage

44

You might also like