Multi - Dim
Multi - Dim
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Course Organization
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
OpenCL - heterogeneous computing 10 2
Efficient Neural Network Training/Inferencing 11-12 6
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional block
In general
I a grid is a 3-D array of blocks
I a block is a 3-D array of threads
I specified by C struct type dim3
I unused dimensions are set to 1
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block declaration
The memory layout thus created in device when the kernel is launched is shown next
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
blockIdx.y
blockIdx.z blockIdx.x
h0, 0, 0ih0, 0, 1ih0, 0, 2ih0, 0, 3ih0, 1, 0ih0, 1, 1ih0, 1, 2ih0, 1, 3ih1, 0, 0ih1, 0, 1ih1, 0, 2ih1, 0, 3ih1, 1, 0ih1, 1, 1ih1, 1, 2ih1, 1, 3i
BLOCK
threadIdx.z threadIdx.x
threadIdx.y
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
Figure: Grids and Blocks
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Col 0 Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7
Row 0
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
Figure: 2D Matrix
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
blockIdx.y
blockIdx.z blockIdx.x
h0, 0, 0ih0, 0, 1ih0, 0, 2ih0, 0, 3ih0, 1, 0ih0, 1, 1ih0, 1, 2ih0, 1, 3ih1, 0, 0ih1, 0, 1ih1, 0, 2ih1, 0, 3ih1, 1, 0ih1, 1, 1ih1, 1, 2ih1, 1, 3i
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15
BLOCK
threadIdx.z threadIdx.x
threadIdx.y
GY
ITU
IAN INST
KH
ARAGPUR
Figure: Global Thread IDs
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Relations among variables
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Col 0 Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7
Row 0 0 1 2 3 4 5 6 7
Row 1 8 9 10 11 12 13 14 15
Row 2 16 17 18 19 20 21 22 23
Row 3 24 25 26 27 28 29 30 31
Row 4 32 33 34 35 36 37 38 39
Row 5 40 41 42 43 44 45 46 47
Row 6 48 49 50 51 52 53 54 55
Row 7 56 57 58 59 60 61 62 63
GY
ITU
IAN INST
KH
ARAGPUR
Figure: Mapping Threads to Matrix
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Mapping between kernels and data
The CUDA programming interface provides support for mapping kernels of any
dimension (upto 3) to data of any dimension
I Mapping a 3D kernel to 2D kernel results in complex memory access expressions.
I Makes sense to map 2D kernel to 2D data and 3D kernel to 3D data
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
NumCols = blockDim.x * gridDim.x
NumRows = blockDim.y * gridDim.y
1 5 1 1 5 1
j=blockIdx.x*blockDim.x+threadIdx.x j=blockIdx.x*blockDim.x+threadIdx.x
i=blockIdx.y*blockDim.y+threadIdx.y i=blockIdx.y*blockDim.y+threadIdx.y
0 4 0 1 4 0
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
Figure: Two Dimensional Kernel
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
8 X 15 Matrix
1 5 1
j=blockIdx.x*blockDim.x+threadIdx.x
i=blockIdx.y*blockDim.y+threadIdx.y
1 4 0
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
Figure: Two Dimensional Kernel-Data Mapping
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
nX = blockDim.x * gridDim.x
nY = blockDim.y * gridDim.y
nZ = blockDim.z * gridDim.z
1 5 1 1 5 1
j=blockIdx.x*blockDim.x+threadIdx.x j=blockIdx.x*blockDim.x+threadIdx.x
i=blockIdx.y*blockDim.y+threadIdx.y i=blockIdx.y*blockDim.y+threadIdx.y
1 4 0 0 4 0
k=blockIdx.z*blockDim.z+threadIdx.z k=blockIdx.z*blockDim.z+threadIdx.z
0 3 0 1 3 0
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
Figure: Three Dimensional Kernel
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
1 5 1
j=blockIdx.x*blockDim.x+threadIdx.x
i=blockIdx.y*blockDim.y+threadIdx.y
0 4 0
k=blockIdx.z*blockDim.z+threadIdx.z
1 3 0
8 X 15 Matrix
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
Figure: Three Dimensional Kernel-Data Mapping
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization
Kernel grid
Block 2 Block 3
Block 4 Block 5
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 6 Block 7
time
Block 4 Block 5
Block 6 Block 7
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
0 1 2 3 4 5 6 7 8 9 10
M=
syncthreads()
V=
Figure: Input: A 11 X 11 matrix, Output: A vector of size 12 where each element represents
the column sums and the last element represents the sum of the column sums. TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Host Program
int main ()
{
int N =1024;
int size_M = N * N ;
int size_V = N +1;
} TE
OF LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Kernel
__global__
void sumTriangle ( float * M , float * V , int N ) {
int j = threadIdx . x ;
float sum =0.0;
for ( int i =0; i < j ; i ++)
sum += M [ i * N + j ];
V [ j ]= sum ;
__syncthreads () ;
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Kernel
if ( j == N -1)
{ sum = 0.0;
for ( i =0; i < N ; i ++)
sum = sum + V [ i ];
V [ N ] = sum ;
}
Once each thread finishes computing sum across columns, the total sum is computed
by the last thread.
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant I
int j = threadIdx . x ;
float sum =0.0;
for ( int i =0; i < j ; i ++)
if ( i %2) // Check for odd indices
sum += M [ i * N + j ];
V [ j ]= sum ;
__syncthreads () ;
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant I
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
tid = 0 1 2 3 4 5 6 7 8 9 10
M =
d = 0 1 2 3 4 5 6 7 8 9 10
syncthreads()
V =
Figure: A variant of SumTriangle where only the elements at odd indices of a column are added
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant II
Modification: Consider summing all indices again. But use all threads for final
reduction.
__global__
void sumTriangle ( float * M , float * V , int N ) {
int j = threadIdx . x ;
float sum =0.0;
for ( int i =0; i < j ; i ++)
sum += M [ i * N + j ];
V [ j ]= sum ;
__syncthreads () ;
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant II
Once each thread finishes computing sum across columns, the total sum is computed
by all the threads.
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction
tid 0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
1 5 9 13 17 21
6 22 38
iterations
28 38
66
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur