GPU_Programming_slides_3
GPU_Programming_slides_3
Welcomes
int i = threadIdx.x;
a[i] = i;
a[i] = blockDim.x;
a[i] = threadIdx.x;
a[i] = blockIdx.x;
a[i] = i;
https://www.youtube.com/watch?v=cRY5utouJzQ&t=60s
Class Test 1 Date: 4/02/2025
Refer:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-
and-technical-specifications
CUDA Thread Organization
B0 B1 B2 B3
Grid (1D) with 4 blocks
2D Grid
The grid extends only along x and y dimensions.
Example: dim3 grid(n, m); (m × n blocks)
x
B(0,0) B(1,0) B(2,0)
x-dim
z
B(0,0,0) B(1,0,0) B(2,0,0) y-dim
B(0,1,0) B(1,1,0) B(2,1,0)
z-dim x
B(0,2,0) B(1,2,0) B(2,2,0)
Layer 0 (z=0)
Grid (3D) with 3x3x2 blocks y
CUDA Block Structure
T0 T1 T2 T3
Block (1D) with 4 threads
2D Block
Threads are arranged in x and y dimensions.
Example: dim3 block(n, m); (m × n threads per
block)
x
T(0,0) T(1,0) T(2,0)
x-dim
z
T(0,0,0) T(1,0,0) T(2,0,0) y-dim
T(0,1,0) T(1,1,0) T(2,1,0)
z-dim x
T(0,2,0) T(1,2,0) T(2,2,0)
Layer 0 (z=0)
Block (3D) with 3x3x2 threads y
Example: kernel launch
kernel<<<grid, block>>>();
kernel<<<grid, block>>>();
blockDim.x = 4, threadIdx.x = 0 … 3
blockDim.y = 3, threadIdx.y = 0 … 2
blockDim.z = 6, threadIdx.z = 0 … 5
Therefore the total number of threads will be
5 * 2 * 1 * 4 * 3 * 6 = 720
CUDA Thread Organization
dim3 dimGrid(?, ?, ?);
dim3 dimBlock(?, ?, ?);
blockDim.x == ?
threadIdx.z == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z ==
== 0threadIdx.y 0
threadIdx.y
Thread== 1 0threadIdx.y
threadIdx.y
threadIdx.x threadIdx.z
== ==
Thread == ==1 0threadIdx.y
0 2threadIdx.x threadIdx.z
== == == ==1 0threadIdx.y
1 2threadIdx.x
Thread
threadIdx.z
== == == ==1 0 == ==
2 2threadIdx.x
Thread 3 2
threadIdx.z ==threadIdx.z
0 ==
threadIdx.y
threadIdx.x == == 0threadIdx.y
threadIdx.z ==threadIdx.z
0 2threadIdx.x 0 == == 0threadIdx.y
threadIdx.z
== ==threadIdx.z
1 2threadIdx.x 0 == threadIdx.z
== ==threadIdx.z
== 0threadIdx.y
2 2threadIdx.x 0 == == 3 2
== 0
Thread Thread Thread Thread
blockDim.y == ?
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == ==
2 0
threadIdx.x
Thread 0 2threadIdx.x
(0, 2, 0) == Thread
==
1 2threadIdx.x
(1, 2, 0) == Thread
==
2 2threadIdx.x
(2, 2, 0) == Thread (3, 2, 0) == 3
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == 0
threadIdx.y
threadIdx.x == ==0 2threadIdx.x == ==1 2threadIdx.x == ==2 2threadIdx.x == ==3 2
blockDim.z == ?
threadIdx.z
threadIdx.y == ==2 0threadIdx.y
threadIdx.z
== ==2 0threadIdx.y
threadIdx.z
== == threadIdx.z
2 0threadIdx.y == ==2 0
threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0
CUDA Thread Organization
dim3 dimGrid(5, 2, 1);
dim3 dimBlock(4, 3, 6);
blockDim.x == 4
threadIdx.z == threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z ==
== 0threadIdx.y 0
threadIdx.y
Thread== 1 0threadIdx.y
threadIdx.y
threadIdx.x threadIdx.z
== ==
Thread == ==1 0threadIdx.y
0 2threadIdx.x threadIdx.z
== == == ==1 0threadIdx.y
1 2threadIdx.x
Thread
threadIdx.z
== == == ==1 0 == ==
2 2threadIdx.x
Thread 3 2
threadIdx.z ==threadIdx.z
0 ==
threadIdx.y
threadIdx.x == == 0threadIdx.y
threadIdx.z ==threadIdx.z
0 2threadIdx.x 0 == == 0threadIdx.y
threadIdx.z
== ==threadIdx.z
1 2threadIdx.x 0 == threadIdx.z
== ==threadIdx.z
== 0threadIdx.y
2 2threadIdx.x 0 == == 3 2
== 0
Thread Thread Thread Thread
blockDim.y == 3
threadIdx.z
threadIdx.y == == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == ==
2 0
threadIdx.x
Thread 0 2threadIdx.x
(0, 2, 0) == Thread
==
1 2threadIdx.x
(1, 2, 0) == Thread
==
2 2threadIdx.x
(2, 2, 0) == Thread (3, 2, 0) == 3
threadIdx.z == 0threadIdx.y
threadIdx.z == 0threadIdx.y
threadIdx.z threadIdx.z
== 0threadIdx.y == 0
threadIdx.y
threadIdx.x == ==0 2threadIdx.x == ==1 2threadIdx.x == ==2 2threadIdx.x == ==3 2
blockDim.z == 6
threadIdx.z
threadIdx.y == ==2 0threadIdx.y
threadIdx.z
== ==2 0threadIdx.y
threadIdx.z
== == threadIdx.z
2 0threadIdx.y == ==2 0
threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0 threadIdx.z == 0
Mapping Threads
to
Multidimensional
Data
Mapping Threads to Multidimensional Data
Examples:
✅ 1D Data (e.g., Arrays, Audio Signals)
◆ A 1D grid and 1D blocks are often used for
linear data structures like arrays or audio
waveforms.
◆ Each thread processes one element of the
array.
Mapping Threads to Multidimensional Data
Examples:
✅ 2D Data (e.g., Grayscale Images, Matrices)
◆ A 2D grid and 2D blocks are useful for images and
matrices.
◆ Example: A grayscale image is a 2D array of pixels,
where each pixel holds an intensity value.
◆ Threads are mapped to (x, y) coordinates,
making access patterns more efficient.
Mapping Threads to Multidimensional Data
Examples:
✅ 3D Data (e.g., RGB Images, Volumetric Data, 3D
Grids)
◆ A 3D grid and 3D blocks are ideal forvolumetric
data or multi-channel images.
Example: An RGB image is a 3D structure, where
each pixel has three values (Red, Green, Blue)..
◆ Threads are mapped to (x, y, color channel) for
efficient parallel processing.
How Do We Blur a 2013 × 3971 Pixel
Image on a GPU?
Mapping Threads to Multidimensional Data
2013 × 3971
y=2013 Pixel image
Each thread processes one pixel, and the entire image is covered by mapping pixels
to (threadIdx.x, threadIdx.y) within a block and (blockIdx.x, blockIdx.y) in the grid.
How do we process an RGB image
efficiently using CUDA?
Mapping Threads to Multidimensional Data
Example:
◆ If a pixel at (30, 45) has values (255, 120, 60),
three threads handle:
(30, 45, 0) → Red
(30, 45, 1) → Green
(30, 45, 2) → Blue
Mapping Threads to Multidimensional Data
☆ The block (16,16,3) assigns three threads per pixel (one for each R, G, and B
channel) to enable efficient parallel processing.
Key Guidelines for Choosing Threads per Block
in CUDA
Mapping Threads to Multidimensional Data
Too Few Threads (< 128 per block) Too Many Threads (Close to 1024 per block)
💡 Guideline: Aim for 256 to 512 threads per block for a good balance.
Mapping Threads to Multidimensional Data
High Register Usage (Too Many Threads) Low Register Usage (Too Few Threads)
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
Mapping Threads to Multidimensional Data
❌ Some threads (?%) may be idle. ❌ Some threads (0.87%) may be idle.
Calculations: Calculations:
7993623 (=2013x3971) threads out of 8031744 7993623 (=2013x3971) threads out of 8064000
(249x126x16x16) will be in use. (125x126x16x32) will be in use.
Thus, 38121 threads are idle. Thus, 70377 threads are idle.
💡 Guideline: Minimize idle threads by choosing block sizes that evenly divide the
problem dimensions.
https://www.youtube.com/watch?v=b5lYGvcBjy4
Mapping Threads to Multidimensional Data
column
C/C++
Representation M(0,0) M(0,1) M(0,2) M(0,3) M(1,0) M(1,1) M(1,2) M(1,3) M(2,0) M(2,1) M(2,2) M(2,3) M(3,0) M(3,1) M(3,2) M(3,3)
in Memory:
Linearized: M(0) M(1) M(2) M(3) M(4) M(5) M(6) M(7) M(8) M(9) M(10) M(11) M(12) M(13) M(14) M(15)
Formula: index = 2 x 4 + 1 = 9
Mapping Threads to Multidimensional Data
Column-Major Layout (FORTRAN)
M(0,0) M(0,1) M(0,2) M(0,3)
Height x Width
(4 x 4)
Conceptual Representation: M(1,0) M(1,1) M(1,2) M(1,3)
column
FORTRAN M(0,0) M(1,0) M(2,0) M(3,0) M(0,1) M(1,1) M(2,1) M(3,1) M(0,2) M(1,2) M(2,2) M(3,2) M(0,3) M(1,3) M(2,3) M(3,3)
Representation
in Memory:
Linearized: M(0) M(1) M(2) M(3) M(4) M(5) M(6) M(7) M(8) M(9) M(10) M(11) M(12) M(13) M(14) M(15)
Formula: index = 1 x 4 + 2 = 6
Mapping Threads to Multidimensional Data
and so on… z
x
y
Mapping Threads to Multidimensional Data
Accessing Elements in CUDA
__syncthreads()
thread 0
thread 1
thread 2
thread 3
… …
thread N - 3
thread N - 2
thread N - 1
__syncthreads()
Block Synchronization
●Scope of __syncthreads():
●Block independence:
Kernel Grid
Older Device Newer Device
Block Block
SM 1 SM 2 1 2 SM 1 SM 2 SM 3 SM 4
Block Block
Block Block 3 4 Block Block Block Block
1 2 1 2 3 4
#include <iostream>
#include <cuda_runtime.h>
using namespace std;
int main() {
int dev_count;
cudaGetDeviceCount( &dev_count );
for (int i = 0; i < dev_count; i++) {
cudaDeviceProp dev_prop;
cudaGetDeviceProperties(&dev_prop, i);
cout << “max threads per block:” << dev_prop.maxThreadsPerBlock << endl;
cout << “max block x dim:” << dev_prop.maxThreadsDim[0] << endl;
cout << “max block y dim:” << dev_prop.maxThreadsDim[1] << endl;
cout << “max block z dim:” << dev_prop.maxThreadsDim[2] << endl;
cout << “max grid x dim:” << dev_prop.maxGridSize[0] << endl;
cout << “max grid y dim:” << dev_prop.maxGridSize[1] << endl;
cout << “max grid z dim:” << dev_prop.maxGridSize[2] << endl;
}
return 0;
}
An extensive example:
https://github.com/NVIDIA/cuda-samples/blob/master/Samples/1_Utilities/deviceQuery/deviceQuery.cpp
Thread Scheduling
and
Latency Tolerance
Synchronization and Transparent Scalability
dev_prop.warpSize
Synchronization and Transparent Scalability
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-
and-technical-specifications