0% found this document useful (0 votes)
3 views

GPU_Programming_slides_3

The document outlines a GPU Programming course (CSGG3018) led by instructor Amit Gurung, covering topics such as CUDA thread organization, kernel launch configurations, and synchronization. It includes quizzes and a class test focused on key concepts in CUDA programming, including thread organization and execution hierarchy. Additionally, it provides examples of grid and block structures in CUDA, emphasizing the maximum number of threads per block and the hierarchical execution model.

Uploaded by

pillai.siddhart
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

GPU_Programming_slides_3

The document outlines a GPU Programming course (CSGG3018) led by instructor Amit Gurung, covering topics such as CUDA thread organization, kernel launch configurations, and synchronization. It includes quizzes and a class test focused on key concepts in CUDA programming, including thread organization and execution hierarchy. Additionally, it provides examples of grid and block structures in CUDA, emphasizing the maximum number of threads per block and the hierarchical execution model.

Uploaded by

pillai.siddhart
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

GPU Programming

Course Code: CSGG3018


Instructor: AMIT GURUNG
Email: [email protected]

Welcomes

12 programmes accredited Ranked 501-600 in Ranked 801-850 in Ranked 46th in


A Grade Rankings 2025 World Rankings 2025 Rankings 2024*
* University Category

Jan – May, 2025


1
Overview
1.Quiz on the previous Lectures
2.CUDA Thread Organization
3.Mapping Threads to Multidimensional Data
4.Synchronization and Transparent Scalability
5.Querying Device Properties
6.Thread Assignment
7.Thread Scheduling and Latency Tolerance
8.Conclusions
Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x;

a[i] = i;

If kernel launch configuration is:

Kernel<<< 1, 4 >>> (a);

1) What is the output of a?


Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

a[i] = blockDim.x;

If launch kernel configuration is:

Kernel<<< 3, 4 >>> (a);

2) What is the output of a?


Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

a[i] = threadIdx.x;

If launch kernel configuration is:

Kernel<<< 3, 4 >>> (a);

3) What is the output of a?


Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

a[i] = blockIdx.x;

If launch kernel configuration is:

Kernel<<< 3, 4 >>> (a);

4) What is the output of a?


Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

a[i] = i;

If launch kernel configuration is:

Kernel<<< 3, 4 >>> (a);

5) What is the output of a?

https://www.youtube.com/watch?v=cRY5utouJzQ&t=60s
Class Test 1 Date: 4/02/2025

1) Differentiate between the following, providing


appropriate examples for each: [5+5+5]
a) Concurrent and parallel programming
b) A block diagram of a CPU and a GPU
c) Task parallelism and data parallelism
2) Why is the term GPGPUs used? [2]
3) Elaborate on the following terms: host, device, kernel
launching, and kernel launch configuration, providing
appropriate examples. [2+2+2+2]
OR
4) Describe the syntax of the following giving suitable
examples in each: cudaMalloc(), cudaMemcpy(). What is
the significance of the following: blockIdx, and
threadIdx.x [2+2+2+2]
CUDA Thread
Organization
CUDA Thread Organization
In CUDA programming, the maximum number of threads
per block is 1024. ​

This limit has been in place since Compute


Capability 2.0.
This means that when configuring your kernel launch
parameters, each block can contain up to 1024 threads.

This limit is defined by the CUDA architecture to ensure


efficient execution and resource management.

Refer:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-
and-technical-specifications
CUDA Thread Organization

CUDA allows for flexible configuration of


thread blocks in up to three dimensions.
The maximum x, y and z dimensions of a
block are 1024, 1024 and 64, and it should be
allocated such that x × y × z ≤ 1024, which is
the maximum number of threads per block.
Grid, Blocks, and
Threads in CUDA
CUDA Execution Hierarchy

CUDA follows a hierarchical execution


model:

Grid consists of multiple blocks.


Block consists of multiple threads.

Threads execute the actual computations.


CUDA Grid Structure

A grid is a collection of blocks.

It can be 1D, 2D, or 3D, defined as:


dim3 grid(x, y, z);.

The grid organizes the blocks in an


efficient layout for parallel execution.
1D Grid
The grid extends only along the x-dimension.
Example: dim3 grid(n); (1 × n blocks)

B0 B1 B2 B3
Grid (1D) with 4 blocks
2D Grid
The grid extends only along x and y dimensions.
Example: dim3 grid(n, m); (m × n blocks)

x
B(0,0) B(1,0) B(2,0)

B(0,1) B(1,1) B(2,1)


x-dim
y B(0,2) B(1,2) B(2,2)
Grid (2D) with 3x3 blocks y-dim
3D Grid
The grid extends along x, y, and z dimensions.

Example: dim3 grid(n, m, p); (p x m × n blocks)

Mapping of (x, y, z) in 3D Grid:


n → Number of blocks along the x-dimension.
m → Number of blocks along the y-dimension.
p → Number of blocks along the z-dimension.
Each block in the grid is uniquely identified by (blockIdx.x, blockIdx.y,
blockIdx.z), where:
blockIdx.x ranges from 0 to n-1
blockIdx.y ranges from 0 to m-1
blockIdx.z ranges from 0 to p-1
3D Grid
The grid extends only along x, y, and z dimensions.
Example: dim3 grid(n, m, p); (p x m × n blocks)

B(0,0,1) B(1,0,1) B(2,0,1)


Layer 1 (z=1)
B(0,1,1) B(1,1,1) B(2,1,1)

B(0,2,1) B(1,2,1) B(2,2,1)

x-dim
z
B(0,0,0) B(1,0,0) B(2,0,0) y-dim
B(0,1,0) B(1,1,0) B(2,1,0)
z-dim x
B(0,2,0) B(1,2,0) B(2,2,0)
Layer 0 (z=0)
Grid (3D) with 3x3x2 blocks y
CUDA Block Structure

Each block contains multiple threads.

Blocks can be 1D, 2D, or 3D, defined as:


dim3 block(x, y, z);.
1D Block
The block extends only along the x-dimension.
Example: dim3 block(n); (1 x n threads per block)

T0 T1 T2 T3
Block (1D) with 4 threads
2D Block
Threads are arranged in x and y dimensions.
Example: dim3 block(n, m); (m × n threads per
block)

x
T(0,0) T(1,0) T(2,0)

T(0,1) T(1,1) T(2,1)


x-dim
y T(0,2) T(1,2) T(2,2)
Block (2D) with 3x3 threads y-dim
3D Block
Threads are arranged in x, y, and z dimensions.
Example: dim3 block(n, m, p); (p x m × n threads
per block)
T(0,0,1) T(1,0,1) T(2,0,1) Layer 1 (z=1)
T(0,1,1) T(1,1,1) T(2,1,1)

T(0,2,1) T(1,2,1) T(2,2,1)

x-dim
z
T(0,0,0) T(1,0,0) T(2,0,0) y-dim
T(0,1,0) T(1,1,0) T(2,1,0)
z-dim x
T(0,2,0) T(1,2,0) T(2,2,0)
Layer 0 (z=0)
Block (3D) with 3x3x2 threads y
Example: kernel launch

// Define a 2D grid with 3x3 blocks, each block has


4x4 threads
dim3 grid(3, 3); // 3x3 blocks in the grid
dim3 block(4, 4); // 4x4 threads in each block

kernel<<<grid, block>>>();

This launches 3×3 = 9 blocks.

Each block contains 4×4 = 16 threads.


Example: kernel launch

// Define a 3D grid with 3x3x2 blocks, each block has


4x4x2 threads
dim3 grid(3, 3, 2); // 3x3x2 blocks in the grid
dim3 block(4, 4, 2); // 4x4x2 threads in each block

kernel<<<grid, block>>>();

This launches 3×3 x 2 = 18 blocks.

Each block contains 4×4 x 2 = 32 threads.


CUDA Thread Organization
The number of total threads created will be:

Total Threads = Number of Blocks in Grid × Number of Threads per Block

For a 3D grid and 3D blocks, the expanded formula is:


Total Threads = (Gx×Gy×Gz) × (Bx×By×Bz)
where:
Gx,Gy,Gz​are the number of blocks in the x, y, z dimensions of the
grid.
Bx,By,Bz are the number of threads in the x, y, z dimensions of each
block.
CUDA Thread Organization
In general use, grids tend to be two dimensional,
while blocks are three dimensional. However this
depends on the application you are writing.

CUDA provides a struct called dim3, which can be


used to specify the three dimensions of the
grids and blocks used to execute your kernel:

dim3 dimGrid(5, 2, 1);


dim3 dimBlock(4, 3, 6);
KernelFunction<<<dimGrid, dimBlock>>>(…);
Total threads = ??
CUDA Thread Organization
dim3 dimGrid(5, 2, 1);
dim3 dimBlock(4, 3, 6);
KernelFunction<<<dimGrid, dimBlock>>>(…);
For dimGrid, x = 5, y = 2, z = 1, and
for dimBlock, x = 4, y = 3, z = 6.
The threads created will have:
gridDim.x = 5, blockIdx.x = 0 … 4
gridDim.y = 2, blockIdx.y = 0 … 1
gridDim.z = 1, blockIdx.z = 0 … 0

blockDim.x = 4, threadIdx.x = 0 … 3
blockDim.y = 3, threadIdx.y = 0 … 2
blockDim.z = 6, threadIdx.z = 0 … 5
Therefore the total number of threads will be
5 * 2 * 1 * 4 * 3 * 6 = 720
CUDA Thread Organization
dim3 dimGrid(?, ?, ?);
dim3 dimBlock(?, ?, ?);

Block Block Block Block Block

blockIdx.x == 0 blockIdx.x == 1 blockIdx.x == 2 blockIdx.x == 3 blockIdx.x == 4


blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0
blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0

Block Block Block Block Block

blockIdx.x == 0 blockIdx.x == 1 blockIdx.x == 2 blockIdx.x == 3 blockIdx.x == 4


blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1
blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0

Thread (0, 0, 5) Thread (1, 0, 5) Thread (2, 0, 5) Thread (3, 0, 5)


threadIdx.x
(1, 0, 4) == Thread
1 threadIdx.x
(2, 0, 4) == Thread
2 threadIdx.x
(3, 0, 4) == 3
For Block
Thread (0, 0, 4) ==Thread
threadIdx.x
0 0, 3) == Thread
threadIdx.x threadIdx.y ==
0 threadIdx.x 1 0threadIdx.x
threadIdx.y == 2 0threadIdx.x
threadIdx.y == 0
Thread (0, (1, 0, 3) == Thread (2, 0, 3) == Thread (3, 0, 3) == 3
threadIdx.y
threadIdx.y == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0
== ==
0 0 threadIdx.y== ==
1 0threadIdx.x == ==
2 0threadIdx.x == ==
3 0
blockIdx.x == 2
threadIdx.x
Thread (0, 0, 2) ThreadthreadIdx.x
(1, 0, 2) Thread (2, 0, 2) Thread (3, 0, 2)
0
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0
threadIdx.y
threadIdx.x == ==
Thread 0 0threadIdx.x == ==
Thread 1 0threadIdx.x == ==
Thread 2 0threadIdx.x == ==
Thread3 0
Thread (0, 0, 1)threadIdx.z
Thread (1, 0, 1)
== Thread (2, 0, 1) Thread (3, 0, 1)
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0 == 3
0 0threadIdx.x
threadIdx.x == 0 == ==
1 0threadIdx.x
threadIdx.x == 1 == ==
2 0threadIdx.x
threadIdx.x == 2 == ==
3 0
threadIdx.x
blockIdx.y == 1
threadIdx.x
Thread (0, 0, 0) == 0
Thread Thread Thread
(1, 0, 0) Thread Thread
(2, 0, 0) Thread Thread
(3, 0, 0)
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.y threadIdx.z
== == 0threadIdx.y
1threadIdx.x
threadIdx.ythreadIdx.z
== == 0threadIdx.y
1threadIdx.x
threadIdx.ythreadIdx.z
== == 0 == ==
1threadIdx.x
threadIdx.y 3 1
threadIdx.x ==
Thread 0 0threadIdx.x
threadIdx.x == 0 ==
Thread ==
1 0threadIdx.x
== 1 ==
Thread ==
2 0threadIdx.x
== 2 ==
Thread ==
3 0
threadIdx.z == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0
threadIdx.y == 0 0threadIdx.y
threadIdx.y
threadIdx.x threadIdx.z
== == == ==0 0threadIdx.y
0 1threadIdx.x threadIdx.z
== == == ==0 0threadIdx.y
1 1threadIdx.x threadIdx.z
== == == ==0 0 == ==
2 1threadIdx.x 3 1
blockIdx.z == 0
Thread Thread Thread Thread
threadIdx.z ==threadIdx.z
0 Thread
threadIdx.y
threadIdx.x == == == 0threadIdx.y
threadIdx.z ==threadIdx.z
0 1threadIdx.x 0 Thread
== == == 0threadIdx.y
threadIdx.z ==threadIdx.z
1 1threadIdx.x 0 Thread
==
== 0threadIdx.y
threadIdx.z
== ==threadIdx.z
2 1threadIdx.x 0 Thread
== ==
3 1
== 0
Thread Thread Thread Thread
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0 == 3
threadIdx.x
Thread (0, 1, 0) ==
Thread 0 1threadIdx.x
threadIdx.x
Thread
== Thread
(1, 1, 0)
==
1 1threadIdx.x
0 ==threadIdx.x
Thread
== 1 ==
Thread
(2, 1, 0)
==
2 1threadIdx.x
threadIdx.x
Thread
== 2 ==
Thread
(3, 1, 0)
==
3 1
threadIdx.x
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.y threadIdx.z
== == 0threadIdx.y
2threadIdx.x
threadIdx.ythreadIdx.z
== == 0threadIdx.y
2threadIdx.x
threadIdx.ythreadIdx.z
== == 0 == ==
threadIdx.y
2threadIdx.x 3 2
threadIdx.x ==
Thread 0 1threadIdx.x
threadIdx.x == 0 ==
Thread==
1 1threadIdx.x
== 1 ==
Thread ==
2 1threadIdx.x
== 2 ==
Thread ==
3 1

blockDim.x == ?
threadIdx.z == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z ==
== 0threadIdx.y 0
threadIdx.y
Thread== 1 0threadIdx.y
threadIdx.y
threadIdx.x threadIdx.z
== ==
Thread == ==1 0threadIdx.y
0 2threadIdx.x threadIdx.z
== == == ==1 0threadIdx.y
1 2threadIdx.x
Thread
threadIdx.z
== == == ==1 0 == ==
2 2threadIdx.x
Thread 3 2
threadIdx.z ==threadIdx.z
0 ==
threadIdx.y
threadIdx.x == == 0threadIdx.y
threadIdx.z ==threadIdx.z
0 2threadIdx.x 0 == == 0threadIdx.y
threadIdx.z
== ==threadIdx.z
1 2threadIdx.x 0 == threadIdx.z
== ==threadIdx.z
== 0threadIdx.y
2 2threadIdx.x 0 == == 3 2
== 0
Thread Thread Thread Thread

blockDim.y == ?
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == ==
2 0
threadIdx.x
Thread 0 2threadIdx.x
(0, 2, 0) == Thread
==
1 2threadIdx.x
(1, 2, 0) == Thread
==
2 2threadIdx.x
(2, 2, 0) == Thread (3, 2, 0) == 3
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == 0
threadIdx.y
threadIdx.x == ==0 2threadIdx.x == ==1 2threadIdx.x == ==2 2threadIdx.x == ==3 2

blockDim.z == ?
threadIdx.z
threadIdx.y == ==2 0threadIdx.y
threadIdx.z
== ==2 0threadIdx.y
threadIdx.z
== == threadIdx.z
2 0threadIdx.y == ==2 0
threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0
CUDA Thread Organization
dim3 dimGrid(5, 2, 1);
dim3 dimBlock(4, 3, 6);

Block Block Block Block Block

blockIdx.x == 0 blockIdx.x == 1 blockIdx.x == 2 blockIdx.x == 3 blockIdx.x == 4


blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0
blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0

Block Block Block Block Block

blockIdx.x == 0 blockIdx.x == 1 blockIdx.x == 2 blockIdx.x == 3 blockIdx.x == 4


blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1
blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0

Thread (0, 0, 5) Thread (1, 0, 5) Thread (2, 0, 5) Thread (3, 0, 5)


threadIdx.x
(1, 0, 4) == Thread
1 threadIdx.x
(2, 0, 4) == Thread
2 threadIdx.x
(3, 0, 4) == 3
For Block
Thread (0, 0, 4) ==Thread
threadIdx.x
0 0, 3) == Thread
threadIdx.x threadIdx.y ==
0 threadIdx.x 1 0threadIdx.x
threadIdx.y == 2 0threadIdx.x
threadIdx.y == 0
Thread (0, (1, 0, 3) == Thread (2, 0, 3) == Thread (3, 0, 3) == 3
threadIdx.y
threadIdx.y == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0
== ==
0 0 threadIdx.y== ==
1 0threadIdx.x == ==
2 0threadIdx.x == ==
3 0
blockIdx.x == 2
threadIdx.x
Thread (0, 0, 2) ThreadthreadIdx.x
(1, 0, 2) Thread (2, 0, 2) Thread (3, 0, 2)
0
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0
threadIdx.y
threadIdx.x == ==
Thread 0 0threadIdx.x == ==
Thread 1 0threadIdx.x == ==
Thread 2 0threadIdx.x == ==
Thread3 0
Thread (0, 0, 1)threadIdx.z
Thread (1, 0, 1)
== Thread (2, 0, 1) Thread (3, 0, 1)
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0 == 3
0 0threadIdx.x
threadIdx.x == 0 == ==
1 0threadIdx.x
threadIdx.x == 1 == ==
2 0threadIdx.x
threadIdx.x == 2 == ==
3 0
threadIdx.x
blockIdx.y == 1
threadIdx.x
Thread (0, 0, 0) == 0
Thread Thread Thread
(1, 0, 0) Thread Thread
(2, 0, 0) Thread Thread
(3, 0, 0)
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.y threadIdx.z
== == 0threadIdx.y
1threadIdx.x
threadIdx.ythreadIdx.z
== == 0threadIdx.y
1threadIdx.x
threadIdx.ythreadIdx.z
== == 0 == ==
1threadIdx.x
threadIdx.y 3 1
threadIdx.x ==
Thread 0 0threadIdx.x
threadIdx.x == 0 ==
Thread ==
1 0threadIdx.x
== 1 ==
Thread ==
2 0threadIdx.x
== 2 ==
Thread ==
3 0
threadIdx.z == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0
threadIdx.y == 0 0threadIdx.y
threadIdx.y
threadIdx.x threadIdx.z
== == == ==0 0threadIdx.y
0 1threadIdx.x threadIdx.z
== == == ==0 0threadIdx.y
1 1threadIdx.x threadIdx.z
== == == ==0 0 == ==
2 1threadIdx.x 3 1
blockIdx.z == 0
Thread Thread Thread Thread
threadIdx.z ==threadIdx.z
0 Thread
threadIdx.y
threadIdx.x == == == 0threadIdx.y
threadIdx.z ==threadIdx.z
0 1threadIdx.x 0 Thread
== == == 0threadIdx.y
threadIdx.z ==threadIdx.z
1 1threadIdx.x 0 Thread
==
== 0threadIdx.y
threadIdx.z
== ==threadIdx.z
2 1threadIdx.x 0 Thread
== ==
3 1
== 0
Thread Thread Thread Thread
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0 == 3
threadIdx.x
Thread (0, 1, 0) ==
Thread 0 1threadIdx.x
threadIdx.x
Thread
== Thread
(1, 1, 0)
==
1 1threadIdx.x
0 ==threadIdx.x
Thread
== 1 ==
Thread
(2, 1, 0)
==
2 1threadIdx.x
threadIdx.x
Thread
== 2 ==
Thread
(3, 1, 0)
==
3 1
threadIdx.x
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.y threadIdx.z
== == 0threadIdx.y
2threadIdx.x
threadIdx.ythreadIdx.z
== == 0threadIdx.y
2threadIdx.x
threadIdx.ythreadIdx.z
== == 0 == ==
threadIdx.y
2threadIdx.x 3 2
threadIdx.x ==
Thread 0 1threadIdx.x
threadIdx.x == 0 ==
Thread==
1 1threadIdx.x
== 1 ==
Thread ==
2 1threadIdx.x
== 2 ==
Thread ==
3 1

blockDim.x == 4
threadIdx.z == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z ==
== 0threadIdx.y 0
threadIdx.y
Thread== 1 0threadIdx.y
threadIdx.y
threadIdx.x threadIdx.z
== ==
Thread == ==1 0threadIdx.y
0 2threadIdx.x threadIdx.z
== == == ==1 0threadIdx.y
1 2threadIdx.x
Thread
threadIdx.z
== == == ==1 0 == ==
2 2threadIdx.x
Thread 3 2
threadIdx.z ==threadIdx.z
0 ==
threadIdx.y
threadIdx.x == == 0threadIdx.y
threadIdx.z ==threadIdx.z
0 2threadIdx.x 0 == == 0threadIdx.y
threadIdx.z
== ==threadIdx.z
1 2threadIdx.x 0 == threadIdx.z
== ==threadIdx.z
== 0threadIdx.y
2 2threadIdx.x 0 == == 3 2
== 0
Thread Thread Thread Thread

blockDim.y == 3
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == ==
2 0
threadIdx.x
Thread 0 2threadIdx.x
(0, 2, 0) == Thread
==
1 2threadIdx.x
(1, 2, 0) == Thread
==
2 2threadIdx.x
(2, 2, 0) == Thread (3, 2, 0) == 3
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == 0
threadIdx.y
threadIdx.x == ==0 2threadIdx.x == ==1 2threadIdx.x == ==2 2threadIdx.x == ==3 2

blockDim.z == 6
threadIdx.z
threadIdx.y == ==2 0threadIdx.y
threadIdx.z
== ==2 0threadIdx.y
threadIdx.z
== == threadIdx.z
2 0threadIdx.y == ==2 0
threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0
Mapping Threads
to
Multidimensional
Data
Mapping Threads to Multidimensional Data

◆ The choice of 1D, 2D, or 3D thread/block


organization depends on the structure of the data
being processed on the GPU.
◆ Mapping threads efficiently to data dimensions
improves performance and simplifies indexing.
Mapping Threads to Multidimensional Data

Examples:
✅ 1D Data (e.g., Arrays, Audio Signals)
◆ A 1D grid and 1D blocks are often used for
linear data structures like arrays or audio
waveforms.
◆ Each thread processes one element of the
array.
Mapping Threads to Multidimensional Data

Examples:
✅ 2D Data (e.g., Grayscale Images, Matrices)
◆ A 2D grid and 2D blocks are useful for images and
matrices.
◆ Example: A grayscale image is a 2D array of pixels,
where each pixel holds an intensity value.
◆ Threads are mapped to (x, y) coordinates,
making access patterns more efficient.
Mapping Threads to Multidimensional Data

Examples:
✅ 3D Data (e.g., RGB Images, Volumetric Data, 3D
Grids)
◆ A 3D grid and 3D blocks are ideal forvolumetric
data or multi-channel images.
Example: An RGB image is a 3D structure, where
each pixel has three values (Red, Green, Blue)..
◆ Threads are mapped to (x, y, color channel) for
efficient parallel processing.
How Do We Blur a 2013 × 3971 Pixel
Image on a GPU?
Mapping Threads to Multidimensional Data

Q) How Do We Blur a 2013 × 3971 Pixel Image on


a GPU?
◆Problem: We have a black-and-white image
with 2013 x 3971 pixels, and we want to blur it
(each pixel should take the average of itself and its
neighbors).
◆How should we assign GPU threads?
Mapping Threads to Multidimensional Data

Q) How should we assign GPU threads?


◆ Since the image is 2D, we must preserve the row
and column information when mapping it to GPU
threads.
◆ So, we use 2D thread mapping, where:
– Each thread corresponds to one pixel (x, y).
– The grid represents the entire image, split
into blocks of pixels.
– Each block contains threads that process a
small region of the image.
Mapping Threads to Multidimensional Data
x=3971

2013 × 3971
y=2013 Pixel image

Q) How does CUDA handle this?


One possible configuration: we divide the image into small
blocks, each handled by a group of threads:

dim3 block(16, 16); // each block handles 16x16 pixel region


dim3 grid(ceil(3971/16), ceil(2013/16)); //grid covers entire image
blurKernel<<<grid, block>>>(d_image, width, height);

Each thread processes one pixel, and the entire image is covered by mapping pixels
to (threadIdx.x, threadIdx.y) within a block and (blockIdx.x, blockIdx.y) in the grid.
How do we process an RGB image
efficiently using CUDA?
Mapping Threads to Multidimensional Data

Q) How do we process an RGB image efficiently


using CUDA?
Hint: A black-and-white image needs 2D thread
mapping (x, y), but an RGB image has three color
channels (Red, Green, Blue).
◆How should we assign GPU threads?
Mapping Threads to Multidimensional Data

Q) How should we assign GPU threads for RGB image?


◆ Each pixel has three values (R, G, B).
◆ Instead of treating the image as 2D (x, y), we map
threads as (x, y, channel).
◆ CUDA mapping idea:
x → Pixel column (width)
y → Pixel row (height)
z → Color channel (0 = R, 1 = G, 2 = B)
Mapping Threads to Multidimensional Data

Example:
◆ If a pixel at (30, 45) has values (255, 120, 60),
three threads handle:
(30, 45, 0) → Red
(30, 45, 1) → Green
(30, 45, 2) → Blue
Mapping Threads to Multidimensional Data

Q) How does CUDA handle this RGB image?


Each thread processes one color channel per pixel:
dim3 grid(ceil(width/16), ceil(height/16)); // 2D grid
dim3 block(16, 16, 3); // 3D block (x, y, color channels)
processRGB<<<grid, block>>>(d_image, width, height);

☆ The grid (ceil(width/16), ceil(height/16)) represents the image's 2D spatial


layout (width × height).

☆ The block (16,16,3) assigns three threads per pixel (one for each R, G, and B
channel) to enable efficient parallel processing.
Key Guidelines for Choosing Threads per Block
in CUDA
Mapping Threads to Multidimensional Data

Guidelines for Choosing Threads per Block in CUDA


When designing CUDA kernel launch configurations,
balancing parallel efficiency, memory access patterns, and
hardware limitations is crucial.
Here are the main points to consider:
(1) Choose Threads per Block based on GPU
Architecture
(2) Keep Blocks a Multiple of Warp Size (32 Threads)
(3) Consider Shared Memory and Registers per Block
(4) Optimize for Memory Access and Coalescing
(5) Grid Size should Cover Entire Data
Mapping Threads to Multidimensional Data

Choose Threads per Block based on GPU Architecture


Rule: Modern NVIDIA GPUs support a maximum of 1024
threads per block, but the optimal choice depends on CUDA cores
per Streaming Multiprocessor (SM) and occupancy.

Too Few Threads (< 128 per block) Too Many Threads (Close to 1024 per block)

✔️ while not ideal for performance,


there are cases, it might be necessary ✔️ May maximize SM occupancy.
(small workloads or resource-limited
situations)

❌ Not enough parallel work to keep ❌ Can cause register/memory pressure,


GPU busy. reducing performance.

💡 Guideline: Aim for 256 to 512 threads per block for a good balance.
Mapping Threads to Multidimensional Data

Keep Blocks a Multiple of Warp Size (32 Threads)


Rule: Threads execute in warps of 32. Configurations that are
not multiples of 32 cause divergence and inefficient
execution.
Example
✔️16 x 16 = 256 threads (8 full warps)
✔️ 16 x 32 = 512 threads (16 full warps)
❌ 20 x 25 = 500 threads (incomplete warps: inefficient)
Mapping Threads to Multidimensional Data

Consider Shared Memory and Registers per Block


Rule: More threads per block means more shared
memory/register usage. If memory is exhausted, fewer
blocks run concurrently, reducing performance.

High Register Usage (Too Many Threads) Low Register Usage (Too Few Threads)

❌ May reduce occupancy, causing stalls. ❌ GPU cores remain underutilized.

💡 Guideline: Monitor memory use via CUDA occupancy calculator


(download and use it).
Mapping Threads to Multidimensional Data
Consider Shared Memory and Registers per Block

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
Mapping Threads to Multidimensional Data

Optimize for Memory Access and Coalescing


Rule: Memory access patterns should be aligned for
efficient coalescing (i.e., consecutive threads access
consecutive memory addresses)

Too Small Blocks Too Large Blocks


❌ Multiple global memory ✔️ May improve coalescing but could
transactions → Slower increase bank conflicts.
performance.

💡 Guideline: Blocks of 16 × 16 or 16 × 32 generally work well


for image processing.
Mapping Threads to Multidimensional Data
Grid Size Should Cover Entire Data
Rule: Ensure the grid size (blocks) is large enough to cover all
data points while minimizing idle threads.
Example: Image Processing (2013 × 3971 pixels)
16 × 16 blocks 16 × 32 blocks
✔️ Grid = ceil(2013/16) × ceil(3971/16) = ✔️ Grid = ceil(2013/16) × ceil(3971/32) = 126 × 125
126 × 249

❌ Some threads (?%) may be idle. ❌ Some threads (0.87%) may be idle.
Calculations: Calculations:
7993623 (=2013x3971) threads out of 8031744 7993623 (=2013x3971) threads out of 8064000
(249x126x16x16) will be in use. (125x126x16x32) will be in use.
Thus, 38121 threads are idle. Thus, 70377 threads are idle.

💡 Guideline: Minimize idle threads by choosing block sizes that evenly divide the
problem dimensions.
https://www.youtube.com/watch?v=b5lYGvcBjy4
Mapping Threads to Multidimensional Data

Memory Layouts Linearization in CUDA

● CUDA does not support direct multi-


dimensional array allocation with cudaMalloc.
● Multi-dimensional arrays must be flattened into
a one-dimensional representation.
● Proper indexing ensures correct memory access
and performance.
Mapping Threads to Multidimensional Data

Memory Layouts in Programming Languages

● Row-major layout: Used by C, C++ (rows stored


consecutively).
● Column-major layout: Used by FORTRAN (columns
stored consecutively).
● Understanding these layouts is crucial for efficient
memory access.
Mapping Threads to Multidimensional Data
Row-Major Layout (C/C++)
M(0,0) M(0,1) M(0,2) M(0,3)
Height x Width
(4 x 4)
Conceptual Representation: M(1,0) M(1,1) M(1,2) M(1,3)

M(2,0) M(2,1) M(2,2) M(2,3)


row
M(3,0) M(3,1) M(3,2) M(3,3)

column
C/C++
Representation M(0,0) M(0,1) M(0,2) M(0,3) M(1,0) M(1,1) M(1,2) M(1,3) M(2,0) M(2,1) M(2,2) M(2,3) M(3,0) M(3,1) M(3,2) M(3,3)

in Memory:

Linearized: M(0) M(1) M(2) M(3) M(4) M(5) M(6) M(7) M(8) M(9) M(10) M(11) M(12) M(13) M(14) M(15)

Formula: index = row x width + column

Formula: index = 2 x 4 + 1 = 9
Mapping Threads to Multidimensional Data
Column-Major Layout (FORTRAN)
M(0,0) M(0,1) M(0,2) M(0,3)
Height x Width
(4 x 4)
Conceptual Representation: M(1,0) M(1,1) M(1,2) M(1,3)

M(2,0) M(2,1) M(2,2) M(2,3)


row
M(3,0) M(3,1) M(3,2) M(3,3)

column
FORTRAN M(0,0) M(1,0) M(2,0) M(3,0) M(0,1) M(1,1) M(2,1) M(3,1) M(0,2) M(1,2) M(2,2) M(3,2) M(0,3) M(1,3) M(2,3) M(3,3)
Representation
in Memory:

Linearized: M(0) M(1) M(2) M(3) M(4) M(5) M(6) M(7) M(8) M(9) M(10) M(11) M(12) M(13) M(14) M(15)

Formula: index = column x height + row

Formula: index = 1 x 4 + 2 = 6
Mapping Threads to Multidimensional Data

Note that this can also be expanded to 3, 4 and more


dimensions. x = col

In 2D (with x as width, y as height):


index = (y * width) + x y = row

In 3D (with x as width, y as height, z as depth):


index = (z * width * height) + (y * width) + x

and so on… z
x

y
Mapping Threads to Multidimensional Data
Accessing Elements in CUDA

Since CUDA uses cudaMalloc, arrays must be manually


indexed.
x = col
Example in CUDA kernel:

__global__ void kernel(float *array, int width) {


int row = threadIdx.y; y = row
int col = threadIdx.x;
int index = row * width + col; // Row-major layout indexing
array[index] = array[index] + 10; // Example operation
}

Ensures correct memory mapping and avoids misaligned accesses.


Synchronization
and Transparent
Scalability
Synchronization and Transparent Scalability

CUDA essentially provides one function to


coordinate thread activities:

__syncthreads()

This function ensures that all threads in the


currently executing block have reached that
function call.
Two things are very important to note with syncthreads.
__syncthreads() First, that it only applies to threads within the same
block.
Second, as it requires all threads to reach the same point
before continuing, threads that complete faster will
be idle until other threads catch up.
time

thread 0

thread 1

thread 2

thread 3

… …

thread N - 3

thread N - 2

thread N - 1

__syncthreads()
Block Synchronization

__syncthreads() only synchronizes threads within a block:

●Scope of __syncthreads():

– The __syncthreads() function is a block-level synchronization


barrier. It ensures that all threads within the same block reach the
same point in the code before any thread can proceed further.

– It does not synchronize threads across different blocks


because blocks are designed to execute independently of one
another.

●Block independence:

– In CUDA, blocks are assigned to Streaming Multiprocessors


(SMs) for execution. Different blocks can run on different SMs, and
there is no guaranteed order of execution between blocks.

– Since blocks are independent, there is no mechanism for


synchronizing threads across blocks using __syncthreads().
Block Synchronization

CUDA devices can process multiple blocks simultaneously, but the


number of blocks that can be processed at once varies across different
GPU architectures.

CUDA-capable GPUs consist of multiple Streaming Multiprocessors


(SMs), each capable of processing multiple blocks concurrently,
depending on resource availability and hardware limits.

This design enables CUDA programs to scale efficiently, provided


there are enough blocks in the grid to fully utilize the available SMs.

Since each block operates independently, the GPU can execute as


many blocks as its hardware allows, dynamically distributing
workload across available resources.

As a result, the same CUDA program can run efficiently on


different GPUs, regardless of whether they have fewer or more SMs,
achieving transparent scalability without requiring code modifications.
Block Synchronization
4 (four) Blocks

Kernel Grid
Older Device Newer Device
Block Block
SM 1 SM 2 1 2 SM 1 SM 2 SM 3 SM 4
Block Block
Block Block 3 4 Block Block Block Block
1 2 1 2 3 4

Block Block time


3 4
time
● Blocks can execute in any order relative to one another, as they
are independent.

● Newer GPUs, with more Streaming Multiprocessors (SMs), can


execute more blocks in parallel, leading to higher
performance and improved efficiency.
Querying Device
Properties
Querying Device Properties
In CUDA C there are built in function calls for determining the
properties of the device(s) on the system:

#include <iostream>
#include <cuda_runtime.h>
using namespace std;
int main() {
int dev_count;
cudaGetDeviceCount( &dev_count );
for (int i = 0; i < dev_count; i++) {
cudaDeviceProp dev_prop;
cudaGetDeviceProperties(&dev_prop, i);
cout << “max threads per block:” << dev_prop.maxThreadsPerBlock << endl;
cout << “max block x dim:” << dev_prop.maxThreadsDim[0] << endl;
cout << “max block y dim:” << dev_prop.maxThreadsDim[1] << endl;
cout << “max block z dim:” << dev_prop.maxThreadsDim[2] << endl;
cout << “max grid x dim:” << dev_prop.maxGridSize[0] << endl;
cout << “max grid y dim:” << dev_prop.maxGridSize[1] << endl;
cout << “max grid z dim:” << dev_prop.maxGridSize[2] << endl;
}
return 0;
}

An extensive example:
https://github.com/NVIDIA/cuda-samples/blob/master/Samples/1_Utilities/deviceQuery/deviceQuery.cpp
Thread Scheduling
and
Latency Tolerance
Synchronization and Transparent Scalability

In most CUDA-capable GPUs, when a block is


assigned to a Streaming Multiprocessor (SM), it is
further divided into 32-thread units called warps.

Since execution occurs at the warp level, it is generally


recommended to have the number of threads per
block be a multiple of 32 (or the warp size specific to
the device) to maximize efficiency.

The warp size can be retrieved programmatically


using:

dev_prop.warpSize
Synchronization and Transparent Scalability

CUDA schedules threads in warps, executing them in a SIMD


(Single Instruction, Multiple Data) manner—similar to a
vector processor—where all threads in a warp execute the same
instruction simultaneously.

When threads in a warp encounter a long-latency operation


(such as a read from global memory), the Streaming
Multiprocessor (SM) switches execution to another ready
warp, keeping the GPU fully utilized while waiting for the data.

This strategy, known as latency hiding or latency tolerance, is


also used in CPUs when scheduling multiple threads to maximize
efficiency.
Synchronization and Transparent Scalability

Warp switching in CUDA incurs zero scheduling overhead,


ensuring that execution remains uninterrupted.

If an SM has enough active warps, long-latency operations


(such as memory accesses) do not slow down execution, as other
warps are scheduled to execute while waiting for data.

Unlike CPUs, which rely on large caches and branch


prediction, GPUs prioritize warp scheduling to allocate more
hardware for performing mathematical computations
(floating-point execution) rather than managing control flow.
This design helps GPUs execute many calculations in parallel,
making them highly efficient for tasks like graphics processing and
scientific computing.
Conclusions
Conclusions
In CUDA, a grid is divided into blocks, which are then assigned
to different Streaming Multiprocessors (SMs). Each SM
schedules and executes warps of threads using its streaming
processors.

Warps are executed block by block within an SM, which


ensures that __syncthreads() works correctly by synchronizing
threads within a block.

Although different GPUs may have different warp sizes,


varying numbers of SMs, and different limits on blocks per
SM, CUDA automatically manages thread scheduling. This allows
the same CUDA code to run efficiently across GPUs of different
architectures without modification.
References

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-
and-technical-specifications

You might also like