0% found this document useful (0 votes)

13 views73 pages

GPU Programming Slides 3

The document outlines a GPU Programming course (CSGG3018) led by instructor Amit Gurung, covering topics such as CUDA thread organization, kernel launch configurations, and synchronization. It includes quizzes and a class test focused on key concepts in CUDA programming, including thread organization and execution hierarchy. Additionally, it provides examples of grid and block structures in CUDA, emphasizing the maximum number of threads per block and the hierarchical execution model.

Uploaded by

pillai.siddhart

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views73 pages

GPU Programming Slides 3

Uploaded by

pillai.siddhart

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

GPU Programming

Course Code: CSGG3018

Instructor: AMIT GURUNG
Email: [email protected]

Welcomes

12 programmes accredited Ranked 501-600 in Ranked 801-850 in Ranked 46th in

A Grade Rankings 2025 World Rankings 2025 Rankings 2024*
* University Category

Jan – May, 2025

1
Overview
1.Quiz on the previous Lectures
2.CUDA Thread Organization
3.Mapping Threads to Multidimensional Data
4.Synchronization and Transparent Scalability
5.Querying Device Properties
6.Thread Assignment
7.Thread Scheduling and Latency Tolerance
8.Conclusions
Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x;

a[i] = i;

If kernel launch configuration is:

Kernel<<< 1, 4 >>> (a);

1) What is the output of a?

Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

a[i] = blockDim.x;

If launch kernel configuration is:

Kernel<<< 3, 4 >>> (a);

2) What is the output of a?

Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

a[i] = threadIdx.x;

If launch kernel configuration is:

Kernel<<< 3, 4 >>> (a);

3) What is the output of a?

Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

a[i] = blockIdx.x;

If launch kernel configuration is:

Kernel<<< 3, 4 >>> (a);

4) What is the output of a?

Quiz
__global__ void kernel (int *a) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

a[i] = i;

If launch kernel configuration is:

Kernel<<< 3, 4 >>> (a);

5) What is the output of a?

https://www.youtube.com/watch?v=cRY5utouJzQ&t=60s
Class Test 1 Date: 4/02/2025

1) Differentiate between the following, providing

appropriate examples for each: [5+5+5]
a) Concurrent and parallel programming
b) A block diagram of a CPU and a GPU
c) Task parallelism and data parallelism
2) Why is the term GPGPUs used? [2]
3) Elaborate on the following terms: host, device, kernel
launching, and kernel launch configuration, providing
appropriate examples. [2+2+2+2]
OR
4) Describe the syntax of the following giving suitable
examples in each: cudaMalloc(), cudaMemcpy(). What is
the significance of the following: blockIdx, and
threadIdx.x [2+2+2+2]
CUDA Thread
Organization
CUDA Thread Organization
In CUDA programming, the maximum number of threads
per block is 1024.

This limit has been in place since Compute

Capability 2.0.
This means that when configuring your kernel launch
parameters, each block can contain up to 1024 threads.

This limit is defined by the CUDA architecture to ensure

efficient execution and resource management.

Refer:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-
and-technical-specifications
CUDA Thread Organization

CUDA allows for flexible configuration of

thread blocks in up to three dimensions.
The maximum x, y and z dimensions of a
block are 1024, 1024 and 64, and it should be
allocated such that x × y × z ≤ 1024, which is
the maximum number of threads per block.
Grid, Blocks, and
Threads in CUDA
CUDA Execution Hierarchy

CUDA follows a hierarchical execution

model:

Grid consists of multiple blocks.

Block consists of multiple threads.

Threads execute the actual computations.

CUDA Grid Structure

A grid is a collection of blocks.

It can be 1D, 2D, or 3D, defined as:

dim3 grid(x, y, z);.

The grid organizes the blocks in an

efficient layout for parallel execution.
1D Grid
The grid extends only along the x-dimension.
Example: dim3 grid(n); (1 × n blocks)

B0 B1 B2 B3
Grid (1D) with 4 blocks
2D Grid
The grid extends only along x and y dimensions.
Example: dim3 grid(n, m); (m × n blocks)

x
B(0,0) B(1,0) B(2,0)

B(0,1) B(1,1) B(2,1)

x-dim
y B(0,2) B(1,2) B(2,2)
Grid (2D) with 3x3 blocks y-dim
3D Grid
The grid extends along x, y, and z dimensions.

Example: dim3 grid(n, m, p); (p x m × n blocks)

Mapping of (x, y, z) in 3D Grid:

n → Number of blocks along the x-dimension.
m → Number of blocks along the y-dimension.
p → Number of blocks along the z-dimension.
Each block in the grid is uniquely identified by (blockIdx.x, blockIdx.y,
blockIdx.z), where:
blockIdx.x ranges from 0 to n-1
blockIdx.y ranges from 0 to m-1
blockIdx.z ranges from 0 to p-1
3D Grid
The grid extends only along x, y, and z dimensions.
Example: dim3 grid(n, m, p); (p x m × n blocks)

B(0,0,1) B(1,0,1) B(2,0,1)

Layer 1 (z=1)
B(0,1,1) B(1,1,1) B(2,1,1)

B(0,2,1) B(1,2,1) B(2,2,1)

x-dim
z
B(0,0,0) B(1,0,0) B(2,0,0) y-dim
B(0,1,0) B(1,1,0) B(2,1,0)
z-dim x
B(0,2,0) B(1,2,0) B(2,2,0)
Layer 0 (z=0)
Grid (3D) with 3x3x2 blocks y
CUDA Block Structure

Each block contains multiple threads.

Blocks can be 1D, 2D, or 3D, defined as:

dim3 block(x, y, z);.
1D Block
The block extends only along the x-dimension.
Example: dim3 block(n); (1 x n threads per block)

T0 T1 T2 T3
Block (1D) with 4 threads
2D Block
Threads are arranged in x and y dimensions.
Example: dim3 block(n, m); (m × n threads per
block)

x
T(0,0) T(1,0) T(2,0)

T(0,1) T(1,1) T(2,1)

x-dim
y T(0,2) T(1,2) T(2,2)
Block (2D) with 3x3 threads y-dim
3D Block
Threads are arranged in x, y, and z dimensions.
Example: dim3 block(n, m, p); (p x m × n threads
per block)
T(0,0,1) T(1,0,1) T(2,0,1) Layer 1 (z=1)
T(0,1,1) T(1,1,1) T(2,1,1)

T(0,2,1) T(1,2,1) T(2,2,1)

x-dim
z
T(0,0,0) T(1,0,0) T(2,0,0) y-dim
T(0,1,0) T(1,1,0) T(2,1,0)
z-dim x
T(0,2,0) T(1,2,0) T(2,2,0)
Layer 0 (z=0)
Block (3D) with 3x3x2 threads y
Example: kernel launch

// Define a 2D grid with 3x3 blocks, each block has

4x4 threads
dim3 grid(3, 3); // 3x3 blocks in the grid
dim3 block(4, 4); // 4x4 threads in each block

kernel<<<grid, block>>>();

This launches 3×3 = 9 blocks.

Each block contains 4×4 = 16 threads.

Example: kernel launch

// Define a 3D grid with 3x3x2 blocks, each block has

4x4x2 threads
dim3 grid(3, 3, 2); // 3x3x2 blocks in the grid
dim3 block(4, 4, 2); // 4x4x2 threads in each block

kernel<<<grid, block>>>();

This launches 3×3 x 2 = 18 blocks.

Each block contains 4×4 x 2 = 32 threads.

CUDA Thread Organization
The number of total threads created will be:

Total Threads = Number of Blocks in Grid × Number of Threads per Block

For a 3D grid and 3D blocks, the expanded formula is:

Total Threads = (Gx×Gy×Gz) × (Bx×By×Bz)
where:
Gx,Gy,Gzare the number of blocks in the x, y, z dimensions of the
grid.
Bx,By,Bz are the number of threads in the x, y, z dimensions of each
block.
CUDA Thread Organization
In general use, grids tend to be two dimensional,
while blocks are three dimensional. However this
depends on the application you are writing.

CUDA provides a struct called dim3, which can be

used to specify the three dimensions of the
grids and blocks used to execute your kernel:

dim3 dimGrid(5, 2, 1);

dim3 dimBlock(4, 3, 6);
KernelFunction<<<dimGrid, dimBlock>>>(…);
Total threads = ??
CUDA Thread Organization
dim3 dimGrid(5, 2, 1);
dim3 dimBlock(4, 3, 6);
KernelFunction<<<dimGrid, dimBlock>>>(…);
For dimGrid, x = 5, y = 2, z = 1, and
for dimBlock, x = 4, y = 3, z = 6.
The threads created will have:
gridDim.x = 5, blockIdx.x = 0 … 4
gridDim.y = 2, blockIdx.y = 0 … 1
gridDim.z = 1, blockIdx.z = 0 … 0

blockDim.x = 4, threadIdx.x = 0 … 3
blockDim.y = 3, threadIdx.y = 0 … 2
blockDim.z = 6, threadIdx.z = 0 … 5
Therefore the total number of threads will be
5 * 2 * 1 * 4 * 3 * 6 = 720
CUDA Thread Organization
dim3 dimGrid(?, ?, ?);
dim3 dimBlock(?, ?, ?);

Block Block Block Block Block

blockIdx.x == 0 blockIdx.x == 1 blockIdx.x == 2 blockIdx.x == 3 blockIdx.x == 4

blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0
blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0

Block Block Block Block Block

blockIdx.x == 0 blockIdx.x == 1 blockIdx.x == 2 blockIdx.x == 3 blockIdx.x == 4

blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1
blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0

Thread (0, 0, 5) Thread (1, 0, 5) Thread (2, 0, 5) Thread (3, 0, 5)

threadIdx.x
(1, 0, 4) == Thread
1 threadIdx.x
(2, 0, 4) == Thread
2 threadIdx.x
(3, 0, 4) == 3
For Block
Thread (0, 0, 4) ==Thread
threadIdx.x
0 0, 3) == Thread
threadIdx.x threadIdx.y ==
0 threadIdx.x 1 0threadIdx.x
threadIdx.y == 2 0threadIdx.x
threadIdx.y == 0
Thread (0, (1, 0, 3) == Thread (2, 0, 3) == Thread (3, 0, 3) == 3
threadIdx.y
threadIdx.y == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0
== ==
0 0 threadIdx.y== ==
1 0threadIdx.x == ==
2 0threadIdx.x == ==
3 0
blockIdx.x == 2
threadIdx.x
Thread (0, 0, 2) ThreadthreadIdx.x
(1, 0, 2) Thread (2, 0, 2) Thread (3, 0, 2)
0
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0
threadIdx.y
threadIdx.x == ==
Thread 0 0threadIdx.x == ==
Thread 1 0threadIdx.x == ==
Thread 2 0threadIdx.x == ==
Thread3 0
Thread (0, 0, 1)threadIdx.z
Thread (1, 0, 1)
== Thread (2, 0, 1) Thread (3, 0, 1)
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0 == 3
0 0threadIdx.x
threadIdx.x == 0 == ==
1 0threadIdx.x
threadIdx.x == 1 == ==
2 0threadIdx.x
threadIdx.x == 2 == ==
3 0
threadIdx.x
blockIdx.y == 1
threadIdx.x
Thread (0, 0, 0) == 0
Thread Thread Thread
(1, 0, 0) Thread Thread
(2, 0, 0) Thread Thread
(3, 0, 0)
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.y threadIdx.z
== == 0threadIdx.y
1threadIdx.x
threadIdx.ythreadIdx.z
== == 0threadIdx.y
1threadIdx.x
threadIdx.ythreadIdx.z
== == 0 == ==
1threadIdx.x
threadIdx.y 3 1
threadIdx.x ==
Thread 0 0threadIdx.x
threadIdx.x == 0 ==
Thread ==
1 0threadIdx.x
== 1 ==
Thread ==
2 0threadIdx.x
== 2 ==
Thread ==
3 0
threadIdx.z == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0
threadIdx.y == 0 0threadIdx.y
threadIdx.y
threadIdx.x threadIdx.z
== == == ==0 0threadIdx.y
0 1threadIdx.x threadIdx.z
== == == ==0 0threadIdx.y
1 1threadIdx.x threadIdx.z
== == == ==0 0 == ==
2 1threadIdx.x 3 1
blockIdx.z == 0
Thread Thread Thread Thread
threadIdx.z ==threadIdx.z
0 Thread
threadIdx.y
threadIdx.x == == == 0threadIdx.y
threadIdx.z ==threadIdx.z
0 1threadIdx.x 0 Thread
== == == 0threadIdx.y
threadIdx.z ==threadIdx.z
1 1threadIdx.x 0 Thread
==
== 0threadIdx.y
threadIdx.z
== ==threadIdx.z
2 1threadIdx.x 0 Thread
== ==
3 1
== 0
Thread Thread Thread Thread
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z == 0 == 3
threadIdx.x
Thread (0, 1, 0) ==
Thread 0 1threadIdx.x
threadIdx.x
Thread
== Thread
(1, 1, 0)
==
1 1threadIdx.x
0 ==threadIdx.x
Thread
== 1 ==
Thread
(2, 1, 0)
==
2 1threadIdx.x
threadIdx.x
Thread
== 2 ==
Thread
(3, 1, 0)
==
3 1
threadIdx.x
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.y threadIdx.z
== == 0threadIdx.y
2threadIdx.x
threadIdx.ythreadIdx.z
== == 0threadIdx.y
2threadIdx.x
threadIdx.ythreadIdx.z
== == 0 == ==
threadIdx.y
2threadIdx.x 3 2
threadIdx.x ==
Thread 0 1threadIdx.x
threadIdx.x == 0 ==
Thread==
1 1threadIdx.x
== 1 ==
Thread ==
2 1threadIdx.x
== 2 ==
Thread ==
3 1

blockDim.x == ?
threadIdx.z == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z ==
== 0threadIdx.y 0
threadIdx.y
Thread== 1 0threadIdx.y
threadIdx.y
threadIdx.x threadIdx.z
== ==
Thread == ==1 0threadIdx.y
0 2threadIdx.x threadIdx.z
== == == ==1 0threadIdx.y
1 2threadIdx.x
Thread
threadIdx.z
== == == ==1 0 == ==
2 2threadIdx.x
Thread 3 2
threadIdx.z ==threadIdx.z
0 ==
threadIdx.y
threadIdx.x == == 0threadIdx.y
threadIdx.z ==threadIdx.z
0 2threadIdx.x 0 == == 0threadIdx.y
threadIdx.z
== ==threadIdx.z
1 2threadIdx.x 0 == threadIdx.z
== ==threadIdx.z
== 0threadIdx.y
2 2threadIdx.x 0 == == 3 2
== 0
Thread Thread Thread Thread

blockDim.y == ?
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == ==
2 0
threadIdx.x
Thread 0 2threadIdx.x
(0, 2, 0) == Thread
==
1 2threadIdx.x
(1, 2, 0) == Thread
==
2 2threadIdx.x
(2, 2, 0) == Thread (3, 2, 0) == 3
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == 0
threadIdx.y
threadIdx.x == ==0 2threadIdx.x == ==1 2threadIdx.x == ==2 2threadIdx.x == ==3 2

blockDim.z == ?
threadIdx.z
threadIdx.y == ==2 0threadIdx.y
threadIdx.z
== ==2 0threadIdx.y
threadIdx.z
== == threadIdx.z
2 0threadIdx.y == ==2 0
threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0
CUDA Thread Organization
dim3 dimGrid(5, 2, 1);
dim3 dimBlock(4, 3, 6);

Block Block Block Block Block

blockIdx.x == 0 blockIdx.x == 1 blockIdx.x == 2 blockIdx.x == 3 blockIdx.x == 4

blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0 blockIdx.y == 0
blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0

Block Block Block Block Block

blockIdx.x == 0 blockIdx.x == 1 blockIdx.x == 2 blockIdx.x == 3 blockIdx.x == 4

blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1 blockIdx.y == 1
blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0 blockIdx.z == 0

Thread (0, 0, 5) Thread (1, 0, 5) Thread (2, 0, 5) Thread (3, 0, 5)

blockDim.x == 4
threadIdx.z == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z ==
== 0threadIdx.y 0
threadIdx.y
Thread== 1 0threadIdx.y
threadIdx.y
threadIdx.x threadIdx.z
== ==
Thread == ==1 0threadIdx.y
0 2threadIdx.x threadIdx.z
== == == ==1 0threadIdx.y
1 2threadIdx.x
Thread
threadIdx.z
== == == ==1 0 == ==
2 2threadIdx.x
Thread 3 2
threadIdx.z ==threadIdx.z
0 ==
threadIdx.y
threadIdx.x == == 0threadIdx.y
threadIdx.z ==threadIdx.z
0 2threadIdx.x 0 == == 0threadIdx.y
threadIdx.z
== ==threadIdx.z
1 2threadIdx.x 0 == threadIdx.z
== ==threadIdx.z
== 0threadIdx.y
2 2threadIdx.x 0 == == 3 2
== 0
Thread Thread Thread Thread

blockDim.y == 3
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == ==
2 0
threadIdx.x
Thread 0 2threadIdx.x
(0, 2, 0) == Thread
==
1 2threadIdx.x
(1, 2, 0) == Thread
==
2 2threadIdx.x
(2, 2, 0) == Thread (3, 2, 0) == 3
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == 0
threadIdx.y
threadIdx.x == ==0 2threadIdx.x == ==1 2threadIdx.x == ==2 2threadIdx.x == ==3 2

blockDim.z == 6
threadIdx.z
threadIdx.y == ==2 0threadIdx.y
threadIdx.z
== ==2 0threadIdx.y
threadIdx.z
== == threadIdx.z
2 0threadIdx.y == ==2 0
threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0
Mapping Threads
to
Multidimensional
Data
Mapping Threads to Multidimensional Data

◆ The choice of 1D, 2D, or 3D thread/block

organization depends on the structure of the data
being processed on the GPU.
◆ Mapping threads efficiently to data dimensions
improves performance and simplifies indexing.
Mapping Threads to Multidimensional Data

Examples:
✅ 1D Data (e.g., Arrays, Audio Signals)
◆ A 1D grid and 1D blocks are often used for
linear data structures like arrays or audio
waveforms.
◆ Each thread processes one element of the
array.
Mapping Threads to Multidimensional Data

Examples:
✅ 2D Data (e.g., Grayscale Images, Matrices)
◆ A 2D grid and 2D blocks are useful for images and
matrices.
◆ Example: A grayscale image is a 2D array of pixels,
where each pixel holds an intensity value.
◆ Threads are mapped to (x, y) coordinates,
making access patterns more efficient.
Mapping Threads to Multidimensional Data

Examples:
✅ 3D Data (e.g., RGB Images, Volumetric Data, 3D
Grids)
◆ A 3D grid and 3D blocks are ideal forvolumetric
data or multi-channel images.
Example: An RGB image is a 3D structure, where
each pixel has three values (Red, Green, Blue)..
◆ Threads are mapped to (x, y, color channel) for
efficient parallel processing.
How Do We Blur a 2013 × 3971 Pixel
Image on a GPU?
Mapping Threads to Multidimensional Data

Q) How Do We Blur a 2013 × 3971 Pixel Image on

a GPU?
◆Problem: We have a black-and-white image
with 2013 x 3971 pixels, and we want to blur it
(each pixel should take the average of itself and its
neighbors).
◆How should we assign GPU threads?
Mapping Threads to Multidimensional Data

Q) How should we assign GPU threads?

◆ Since the image is 2D, we must preserve the row
and column information when mapping it to GPU
threads.
◆ So, we use 2D thread mapping, where:
– Each thread corresponds to one pixel (x, y).
– The grid represents the entire image, split
into blocks of pixels.
– Each block contains threads that process a
small region of the image.
Mapping Threads to Multidimensional Data
x=3971

2013 × 3971
y=2013 Pixel image

Q) How does CUDA handle this?

One possible configuration: we divide the image into small
blocks, each handled by a group of threads:

dim3 block(16, 16); // each block handles 16x16 pixel region

dim3 grid(ceil(3971/16), ceil(2013/16)); //grid covers entire image
blurKernel<<<grid, block>>>(d_image, width, height);

Each thread processes one pixel, and the entire image is covered by mapping pixels
to (threadIdx.x, threadIdx.y) within a block and (blockIdx.x, blockIdx.y) in the grid.
How do we process an RGB image
efficiently using CUDA?
Mapping Threads to Multidimensional Data

Q) How do we process an RGB image efficiently

using CUDA?
Hint: A black-and-white image needs 2D thread
mapping (x, y), but an RGB image has three color
channels (Red, Green, Blue).
◆How should we assign GPU threads?
Mapping Threads to Multidimensional Data

Q) How should we assign GPU threads for RGB image?

◆ Each pixel has three values (R, G, B).
◆ Instead of treating the image as 2D (x, y), we map
threads as (x, y, channel).
◆ CUDA mapping idea:
x → Pixel column (width)
y → Pixel row (height)
z → Color channel (0 = R, 1 = G, 2 = B)
Mapping Threads to Multidimensional Data

Example:
◆ If a pixel at (30, 45) has values (255, 120, 60),
three threads handle:
(30, 45, 0) → Red
(30, 45, 1) → Green
(30, 45, 2) → Blue
Mapping Threads to Multidimensional Data

Q) How does CUDA handle this RGB image?

Each thread processes one color channel per pixel:
dim3 grid(ceil(width/16), ceil(height/16)); // 2D grid
dim3 block(16, 16, 3); // 3D block (x, y, color channels)
processRGB<<<grid, block>>>(d_image, width, height);

☆ The grid (ceil(width/16), ceil(height/16)) represents the image's 2D spatial

layout (width × height).

☆ The block (16,16,3) assigns three threads per pixel (one for each R, G, and B
channel) to enable efficient parallel processing.
Key Guidelines for Choosing Threads per Block
in CUDA
Mapping Threads to Multidimensional Data

Guidelines for Choosing Threads per Block in CUDA

When designing CUDA kernel launch configurations,
balancing parallel efficiency, memory access patterns, and
hardware limitations is crucial.
Here are the main points to consider:
(1) Choose Threads per Block based on GPU
Architecture
(2) Keep Blocks a Multiple of Warp Size (32 Threads)
(3) Consider Shared Memory and Registers per Block
(4) Optimize for Memory Access and Coalescing
(5) Grid Size should Cover Entire Data
Mapping Threads to Multidimensional Data

Choose Threads per Block based on GPU Architecture

Rule: Modern NVIDIA GPUs support a maximum of 1024
threads per block, but the optimal choice depends on CUDA cores
per Streaming Multiprocessor (SM) and occupancy.

Too Few Threads (< 128 per block) Too Many Threads (Close to 1024 per block)

✔️ while not ideal for performance,

there are cases, it might be necessary ✔️ May maximize SM occupancy.
(small workloads or resource-limited
situations)

❌ Not enough parallel work to keep ❌ Can cause register/memory pressure,

GPU busy. reducing performance.

💡 Guideline: Aim for 256 to 512 threads per block for a good balance.
Mapping Threads to Multidimensional Data

Keep Blocks a Multiple of Warp Size (32 Threads)

Rule: Threads execute in warps of 32. Configurations that are
not multiples of 32 cause divergence and inefficient
execution.
Example
✔️16 x 16 = 256 threads (8 full warps)
✔️ 16 x 32 = 512 threads (16 full warps)
❌ 20 x 25 = 500 threads (incomplete warps: inefficient)
Mapping Threads to Multidimensional Data

Consider Shared Memory and Registers per Block

Rule: More threads per block means more shared
memory/register usage. If memory is exhausted, fewer
blocks run concurrently, reducing performance.

High Register Usage (Too Many Threads) Low Register Usage (Too Few Threads)

❌ May reduce occupancy, causing stalls. ❌ GPU cores remain underutilized.

💡 Guideline: Monitor memory use via CUDA occupancy calculator

(download and use it).
Mapping Threads to Multidimensional Data
Consider Shared Memory and Registers per Block

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
Mapping Threads to Multidimensional Data

Optimize for Memory Access and Coalescing

Rule: Memory access patterns should be aligned for
efficient coalescing (i.e., consecutive threads access
consecutive memory addresses)

Too Small Blocks Too Large Blocks

❌ Multiple global memory ✔️ May improve coalescing but could
transactions → Slower increase bank conflicts.
performance.

💡 Guideline: Blocks of 16 × 16 or 16 × 32 generally work well

for image processing.
Mapping Threads to Multidimensional Data
Grid Size Should Cover Entire Data
Rule: Ensure the grid size (blocks) is large enough to cover all
data points while minimizing idle threads.
Example: Image Processing (2013 × 3971 pixels)
16 × 16 blocks 16 × 32 blocks
✔️ Grid = ceil(2013/16) × ceil(3971/16) = ✔️ Grid = ceil(2013/16) × ceil(3971/32) = 126 × 125
126 × 249

❌ Some threads (?%) may be idle. ❌ Some threads (0.87%) may be idle.
Calculations: Calculations:
7993623 (=2013x3971) threads out of 8031744 7993623 (=2013x3971) threads out of 8064000
(249x126x16x16) will be in use. (125x126x16x32) will be in use.
Thus, 38121 threads are idle. Thus, 70377 threads are idle.

💡 Guideline: Minimize idle threads by choosing block sizes that evenly divide the
problem dimensions.
https://www.youtube.com/watch?v=b5lYGvcBjy4
Mapping Threads to Multidimensional Data

Memory Layouts Linearization in CUDA

● CUDA does not support direct multi-

dimensional array allocation with cudaMalloc.
● Multi-dimensional arrays must be flattened into
a one-dimensional representation.
● Proper indexing ensures correct memory access
and performance.
Mapping Threads to Multidimensional Data

Memory Layouts in Programming Languages

● Row-major layout: Used by C, C++ (rows stored

consecutively).
● Column-major layout: Used by FORTRAN (columns
stored consecutively).
● Understanding these layouts is crucial for efficient
memory access.
Mapping Threads to Multidimensional Data
Row-Major Layout (C/C++)
M(0,0) M(0,1) M(0,2) M(0,3)
Height x Width
(4 x 4)
Conceptual Representation: M(1,0) M(1,1) M(1,2) M(1,3)

M(2,0) M(2,1) M(2,2) M(2,3)

row
M(3,0) M(3,1) M(3,2) M(3,3)

column
C/C++
Representation M(0,0) M(0,1) M(0,2) M(0,3) M(1,0) M(1,1) M(1,2) M(1,3) M(2,0) M(2,1) M(2,2) M(2,3) M(3,0) M(3,1) M(3,2) M(3,3)

in Memory:

Linearized: M(0) M(1) M(2) M(3) M(4) M(5) M(6) M(7) M(8) M(9) M(10) M(11) M(12) M(13) M(14) M(15)

Formula: index = row x width + column

Formula: index = 2 x 4 + 1 = 9
Mapping Threads to Multidimensional Data
Column-Major Layout (FORTRAN)
M(0,0) M(0,1) M(0,2) M(0,3)
Height x Width
(4 x 4)
Conceptual Representation: M(1,0) M(1,1) M(1,2) M(1,3)

M(2,0) M(2,1) M(2,2) M(2,3)

row
M(3,0) M(3,1) M(3,2) M(3,3)

column
FORTRAN M(0,0) M(1,0) M(2,0) M(3,0) M(0,1) M(1,1) M(2,1) M(3,1) M(0,2) M(1,2) M(2,2) M(3,2) M(0,3) M(1,3) M(2,3) M(3,3)
Representation
in Memory:

Linearized: M(0) M(1) M(2) M(3) M(4) M(5) M(6) M(7) M(8) M(9) M(10) M(11) M(12) M(13) M(14) M(15)

Formula: index = column x height + row

Formula: index = 1 x 4 + 2 = 6
Mapping Threads to Multidimensional Data

Note that this can also be expanded to 3, 4 and more

dimensions. x = col

In 2D (with x as width, y as height):

index = (y * width) + x y = row

In 3D (with x as width, y as height, z as depth):

index = (z * width * height) + (y * width) + x

and so on… z
x

y
Mapping Threads to Multidimensional Data
Accessing Elements in CUDA

Since CUDA uses cudaMalloc, arrays must be manually

indexed.
x = col
Example in CUDA kernel:

global void kernel(float *array, int width) {

int row = threadIdx.y; y = row
int col = threadIdx.x;
int index = row * width + col; // Row-major layout indexing
array[index] = array[index] + 10; // Example operation
}

Ensures correct memory mapping and avoids misaligned accesses.

Synchronization
and Transparent
Scalability
Synchronization and Transparent Scalability

CUDA essentially provides one function to

coordinate thread activities:

__syncthreads()

This function ensures that all threads in the

currently executing block have reached that
function call.
Two things are very important to note with syncthreads.
__syncthreads() First, that it only applies to threads within the same
block.
Second, as it requires all threads to reach the same point
before continuing, threads that complete faster will
be idle until other threads catch up.
time

thread 0

thread 1

thread 2

thread 3

… …

thread N - 3

thread N - 2

thread N - 1

__syncthreads()
Block Synchronization

__syncthreads() only synchronizes threads within a block:

●Scope of __syncthreads():

– The __syncthreads() function is a block-level synchronization

barrier. It ensures that all threads within the same block reach the
same point in the code before any thread can proceed further.

– It does not synchronize threads across different blocks

because blocks are designed to execute independently of one
another.

●Block independence:

– In CUDA, blocks are assigned to Streaming Multiprocessors

(SMs) for execution. Different blocks can run on different SMs, and
there is no guaranteed order of execution between blocks.

– Since blocks are independent, there is no mechanism for

synchronizing threads across blocks using __syncthreads().
Block Synchronization

CUDA devices can process multiple blocks simultaneously, but the

number of blocks that can be processed at once varies across different
GPU architectures.

CUDA-capable GPUs consist of multiple Streaming Multiprocessors

(SMs), each capable of processing multiple blocks concurrently,
depending on resource availability and hardware limits.

This design enables CUDA programs to scale efficiently, provided

there are enough blocks in the grid to fully utilize the available SMs.

Since each block operates independently, the GPU can execute as

many blocks as its hardware allows, dynamically distributing
workload across available resources.

As a result, the same CUDA program can run efficiently on

different GPUs, regardless of whether they have fewer or more SMs,
achieving transparent scalability without requiring code modifications.
Block Synchronization
4 (four) Blocks

Kernel Grid
Older Device Newer Device
Block Block
SM 1 SM 2 1 2 SM 1 SM 2 SM 3 SM 4
Block Block
Block Block 3 4 Block Block Block Block
1 2 1 2 3 4

Block Block time

3 4
time
● Blocks can execute in any order relative to one another, as they
are independent.

● Newer GPUs, with more Streaming Multiprocessors (SMs), can

execute more blocks in parallel, leading to higher
performance and improved efficiency.
Querying Device
Properties
Querying Device Properties
In CUDA C there are built in function calls for determining the
properties of the device(s) on the system:

#include <iostream>
#include <cuda_runtime.h>
using namespace std;
int main() {
int dev_count;
cudaGetDeviceCount( &dev_count );
for (int i = 0; i < dev_count; i++) {
cudaDeviceProp dev_prop;
cudaGetDeviceProperties(&dev_prop, i);
cout << “max threads per block:” << dev_prop.maxThreadsPerBlock << endl;
cout << “max block x dim:” << dev_prop.maxThreadsDim[0] << endl;
cout << “max block y dim:” << dev_prop.maxThreadsDim[1] << endl;
cout << “max block z dim:” << dev_prop.maxThreadsDim[2] << endl;
cout << “max grid x dim:” << dev_prop.maxGridSize[0] << endl;
cout << “max grid y dim:” << dev_prop.maxGridSize[1] << endl;
cout << “max grid z dim:” << dev_prop.maxGridSize[2] << endl;
}
return 0;
}

An extensive example:
https://github.com/NVIDIA/cuda-samples/blob/master/Samples/1_Utilities/deviceQuery/deviceQuery.cpp
Thread Scheduling
and
Latency Tolerance
Synchronization and Transparent Scalability

In most CUDA-capable GPUs, when a block is

assigned to a Streaming Multiprocessor (SM), it is
further divided into 32-thread units called warps.

Since execution occurs at the warp level, it is generally

recommended to have the number of threads per
block be a multiple of 32 (or the warp size specific to
the device) to maximize efficiency.

The warp size can be retrieved programmatically

using:

dev_prop.warpSize
Synchronization and Transparent Scalability

CUDA schedules threads in warps, executing them in a SIMD

(Single Instruction, Multiple Data) manner—similar to a
vector processor—where all threads in a warp execute the same
instruction simultaneously.

When threads in a warp encounter a long-latency operation

(such as a read from global memory), the Streaming
Multiprocessor (SM) switches execution to another ready
warp, keeping the GPU fully utilized while waiting for the data.

This strategy, known as latency hiding or latency tolerance, is

also used in CPUs when scheduling multiple threads to maximize
efficiency.
Synchronization and Transparent Scalability

Warp switching in CUDA incurs zero scheduling overhead,

ensuring that execution remains uninterrupted.

If an SM has enough active warps, long-latency operations

(such as memory accesses) do not slow down execution, as other
warps are scheduled to execute while waiting for data.

Unlike CPUs, which rely on large caches and branch

prediction, GPUs prioritize warp scheduling to allocate more
hardware for performing mathematical computations
(floating-point execution) rather than managing control flow.
This design helps GPUs execute many calculations in parallel,
making them highly efficient for tasks like graphics processing and
scientific computing.
Conclusions
Conclusions
In CUDA, a grid is divided into blocks, which are then assigned
to different Streaming Multiprocessors (SMs). Each SM
schedules and executes warps of threads using its streaming
processors.

Warps are executed block by block within an SM, which

ensures that __syncthreads() works correctly by synchronizing
threads within a block.

Although different GPUs may have different warp sizes,

varying numbers of SMs, and different limits on blocks per
SM, CUDA automatically manages thread scheduling. This allows
the same CUDA code to run efficiently across GPUs of different
architectures without modification.
References

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-
and-technical-specifications

1st Quarter Exam PROG3112
100% (4)
1st Quarter Exam PROG3112
10 pages
Class 10
No ratings yet
Class 10
13 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
22 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
HPC
No ratings yet
HPC
90 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
CUDA
No ratings yet
CUDA
18 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
CUDA
No ratings yet
CUDA
33 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
3H
No ratings yet
3H
34 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Unit 5
No ratings yet
Unit 5
90 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
High Performance Computing WS2022 Slides 11 Cuda
No ratings yet
High Performance Computing WS2022 Slides 11 Cuda
18 pages
CS 179: GPU Computing: Lecture 2: More Basics
No ratings yet
CS 179: GPU Computing: Lecture 2: More Basics
23 pages
Lec 6
No ratings yet
Lec 6
16 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
Solving Math Problems
From Everand
Solving Math Problems
George N. Frempong
No ratings yet
Trupti Shinde CV - Updated
No ratings yet
Trupti Shinde CV - Updated
2 pages
Success Factors Project: Interface - Functional Specification
No ratings yet
Success Factors Project: Interface - Functional Specification
5 pages
OmniStudio Developer Dumps
No ratings yet
OmniStudio Developer Dumps
6 pages
PDF Basics of Python Programming 2nd Edition Dr. Pratiyush Guleria Download
No ratings yet
PDF Basics of Python Programming 2nd Edition Dr. Pratiyush Guleria Download
89 pages
Clothing Brand Project Complete 30 40 Pages
No ratings yet
Clothing Brand Project Complete 30 40 Pages
16 pages
Extron CNTRL Sys Prog Using Python C
No ratings yet
Extron CNTRL Sys Prog Using Python C
93 pages
Welcome To Apache™ Hadoop®!
No ratings yet
Welcome To Apache™ Hadoop®!
9 pages
Embedded Os
No ratings yet
Embedded Os
10 pages
07.user Defined Records-Functins and Procedures
No ratings yet
07.user Defined Records-Functins and Procedures
24 pages
Javascript by Example
No ratings yet
Javascript by Example
5 pages
Tuning - Spark 3.5.1 Documentation
No ratings yet
Tuning - Spark 3.5.1 Documentation
10 pages
Mayur Singh Taunk: Senior Test Engineer, 6.5 Years Work Experience
No ratings yet
Mayur Singh Taunk: Senior Test Engineer, 6.5 Years Work Experience
3 pages
Java Mcq1
No ratings yet
Java Mcq1
4 pages
Viva Questions Plant Disease Detection
No ratings yet
Viva Questions Plant Disease Detection
3 pages
BODS - Job Control Table
No ratings yet
BODS - Job Control Table
8 pages
Setup
No ratings yet
Setup
3 pages
Endevor User Training Guide
100% (1)
Endevor User Training Guide
43 pages
Usha Resume
No ratings yet
Usha Resume
6 pages
OOPC 1 Research Assignment
No ratings yet
OOPC 1 Research Assignment
4 pages
IBP Model Configuration Gude 2105
No ratings yet
IBP Model Configuration Gude 2105
448 pages
Notes For Excel
No ratings yet
Notes For Excel
9 pages
Entry Level Resume Template LaTeX
No ratings yet
Entry Level Resume Template LaTeX
1 page
2ND Periodical Exam Grade 11 Ict
No ratings yet
2ND Periodical Exam Grade 11 Ict
6 pages
Writing Software Requirements Specifications
No ratings yet
Writing Software Requirements Specifications
16 pages
Tivoli Netcool/Omnibus Web Gui Integration - Wireless Component
No ratings yet
Tivoli Netcool/Omnibus Web Gui Integration - Wireless Component
40 pages
Implementation of Pass One of A Two
No ratings yet
Implementation of Pass One of A Two
3 pages
Orf Field Task
No ratings yet
Orf Field Task
4 pages
Ranjani Resume
No ratings yet
Ranjani Resume
4 pages
Network Programming Notes
No ratings yet
Network Programming Notes
138 pages