0% found this document useful (0 votes)
99 views29 pages

Multi - Dim

The document discusses multi-dimensional mapping of data space and synchronization in GPU programming. It covers that a grid is a 3D array of blocks, and a block is a 3D array of threads. The dimensions of the grid and blocks can be specified using the dim3 struct type. The document also discusses that all threads in a block see the same values for the block index variables blockIdx.x, blockIdx.y and blockIdx.z, which range from 0 to the respective grid dimension sizes.

Uploaded by

Vijay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views29 pages

Multi - Dim

The document discusses multi-dimensional mapping of data space and synchronization in GPU programming. It covers that a grid is a 3D array of blocks, and a block is a 3D array of threads. The dimensions of the grid and blocks can be specified using the dim3 struct type. The document also discusses that all threads in a block see the same values for the block index variables blockIdx.x, blockIdx.y and blockIdx.z, which range from 0 to the respective grid dimension sizes.

Uploaded by

Vijay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Multi-dimensional mapping of dataspace; Synchronization

Soumyajit Dey, Assistant Professor,


CSE, IIT Kharagpur

January 21, 2021

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Course Organization
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
OpenCL - heterogeneous computing 10 2
Efficient Neural Network Training/Inferencing 11-12 6
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional block

In general
I a grid is a 3-D array of blocks
I a block is a 3-D array of threads
I specified by C struct type dim3
I unused dimensions are set to 1

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block

dim3 X ( ceil ( n /256.0) , 1 , 1) ;


dim3 Y (256 , 1 , 1) ;
vecAddKernel < < <X , Y > > >(..) ;
vecAddKernel < < < ceil ( n /256) , 256 > > >(..) ;
// CUDA compiler is smart enough to realise both as equivalent

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block

I gridDim.x/y/z ∈ [1, 216 ]


I (blockIdx.x, blockIdx.y, blockIdx.z) is one block
I All threads in the block sees the same value of system vars blockIdx.x,
blockIdx.y, blockIdx.z
I blockIdx.x/y/z ∈ [0, gridDim.x/y/z -1]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block

block dimension is limited by total number of threads possible in a block – 1024.



I (512, 1, 1) -

I (8, 16, 4) -

I (32, 16, 2) -
I (32, 32, 32) - ×

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block declaration

Consider the following host side code


dim3 X (2 , 2 , 1) ;
dim3 Y (4 , 2 , 2) ;
vecAddKernel < < <X , Y > > >(..) ;

The memory layout thus created in device when the kernel is launched is shown next

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
blockIdx.y
blockIdx.z blockIdx.x

GRID h0, 0, 0i h0, 0, 1i h0, 1, 0i h0, 1, 1i

h0, 0, 0ih0, 0, 1ih0, 0, 2ih0, 0, 3ih0, 1, 0ih0, 1, 1ih0, 1, 2ih0, 1, 3ih1, 0, 0ih1, 0, 1ih1, 0, 2ih1, 0, 3ih1, 1, 0ih1, 1, 1ih1, 1, 2ih1, 1, 3i

BLOCK
threadIdx.z threadIdx.x

threadIdx.y
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Grids and Blocks

IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Col 0 Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7

Row 0

Row 1

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: 2D Matrix

IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
blockIdx.y
blockIdx.z blockIdx.x

GRID h0, 0, 0i h0, 0, 1i h0, 1, 0i h0, 1, 1i

Block 0 Block 1 Block 2 Block 3

h0, 0, 0ih0, 0, 1ih0, 0, 2ih0, 0, 3ih0, 1, 0ih0, 1, 1ih0, 1, 2ih0, 1, 3ih1, 0, 0ih1, 0, 1ih1, 0, 2ih1, 0, 3ih1, 1, 0ih1, 1, 1ih1, 1, 2ih1, 1, 3i

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15

BLOCK
threadIdx.z threadIdx.x

threadIdx.y

blockNum = blockIdx.z * (gridDim.x * gridDim.y) + blockIdx.y * gridDim.x + blockIdx.x


threadNum = threadIdx.z * (blockDim.x * blockDim.y) + threadIdx.y * (blockDIm.x) +
threadIdx.x
globalThreadId = blockNum * (blockDim.x * blockDim.y * blockDim.z) + threadNum
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Global Thread IDs

IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Relations among variables

blockNum = blockIdx . z * ( gridDim . x * gridDim . y ) + blockIdx . y * gridDim . x +


blockIdx . x ;
threadNum = threadIdx . z * ( blockDim . x * blockDim . y ) + threadIdx . y * ( blockDim .
x ) + threadIdx . x ;
glob alThre adId = blockNum * ( blockDim . x * blockDim . y * blockDim . z ) + threadNum
;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Col 0 Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7

Row 0 0 1 2 3 4 5 6 7

Row 1 8 9 10 11 12 13 14 15

Row 2 16 17 18 19 20 21 22 23

Row 3 24 25 26 27 28 29 30 31

Row 4 32 33 34 35 36 37 38 39

Row 5 40 41 42 43 44 45 46 47

Row 6 48 49 50 51 52 53 54 55

Row 7 56 57 58 59 60 61 62 63

i = globalThreadId / NumCols j = globalThreadId % NumCols


NumRows * NumCols = gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y *
blockDim.z
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Mapping Threads to Matrix

IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Mapping between kernels and data

The CUDA programming interface provides support for mapping kernels of any
dimension (upto 3) to data of any dimension
I Mapping a 3D kernel to 2D kernel results in complex memory access expressions.
I Makes sense to map 2D kernel to 2D data and 3D kernel to 3D data

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
NumCols = blockDim.x * gridDim.x
NumRows = blockDim.y * gridDim.y

gridDim = h3, 2i blockDim = h5, 4i

h0, 0i h0, 1i h0, 2i h1, 0i h1, 1i h1, 2i

Block 0 Block 1 Block 2 Block 3 Block 4 Block 5

h0, 0i h0, 1i h3, 3i h3, 4i h0, 0i h0, 1i h3, 3i h3, 4i

Thread 0 Thread 1 Thread 19 Thread 20 Thread 0 Thread 1 Thread 19 Thread 20

1 5 1 1 5 1

j=blockIdx.x*blockDim.x+threadIdx.x j=blockIdx.x*blockDim.x+threadIdx.x

i=blockIdx.y*blockDim.y+threadIdx.y i=blockIdx.y*blockDim.y+threadIdx.y
0 4 0 1 4 0

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Two Dimensional Kernel

IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
8 X 15 Matrix

1 5 1

j=blockIdx.x*blockDim.x+threadIdx.x

i=blockIdx.y*blockDim.y+threadIdx.y
1 4 0
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Two Dimensional Kernel-Data Mapping

IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
nX = blockDim.x * gridDim.x
nY = blockDim.y * gridDim.y
nZ = blockDim.z * gridDim.z

gridDim = h2, 2, 2i blockDim = h5, 4, 3i

h0, 0, 0i h0, 0, 1i h0, 1, 0i h0, 1, 1i h1, 0, 0i h1, 0, 1i h1, 1, 0i h1, 1, 1i

Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

h0, 0, 0ih0, 0, 1i h2, 3, 4i h0, 0, 0ih0, 0, 1i h2, 3, 4i

Thread 0 Thread 1 Thread 59 Thread 0 Thread 1 Thread 59

1 5 1 1 5 1

j=blockIdx.x*blockDim.x+threadIdx.x j=blockIdx.x*blockDim.x+threadIdx.x

i=blockIdx.y*blockDim.y+threadIdx.y i=blockIdx.y*blockDim.y+threadIdx.y
1 4 0 0 4 0
k=blockIdx.z*blockDim.z+threadIdx.z k=blockIdx.z*blockDim.z+threadIdx.z
0 3 0 1 3 0

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Three Dimensional Kernel

IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
1 5 1

j=blockIdx.x*blockDim.x+threadIdx.x

i=blockIdx.y*blockDim.y+threadIdx.y
0 4 0
k=blockIdx.z*blockDim.z+threadIdx.z
1 3 0

8 X 15 Matrix

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Three Dimensional Kernel-Data Mapping

IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization

Kernel grid

Device Block 0 Block 1 Device

Block 2 Block 3

Block 4 Block 5
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 6 Block 7
time

Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Block 4 Block 5

Block 6 Block 7

Figure: Mapping Blocks to Hardware

I Each block can execute in any order relative to other blocks.


I Lack of synchronization constraints between blocks enables scalability. TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization

I Synchronization constraints can be enforced to threads inside a thread block.


I Threads may co-operate with each other and share data with the help of local
memory (more on this later)
I CUDA construct __synchthreads() is used for enforcing synchronization.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
0 1 2 3 4 5 6 7 8 9 10

M=

syncthreads()

V=

Figure: Input: A 11 X 11 matrix, Output: A vector of size 12 where each element represents
the column sums and the last element represents the sum of the column sums. TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Host Program
int main ()
{
int N =1024;
int size_M = N * N ;
int size_V = N +1;

cudaMemcpy ( d_M ,M , size_M * sizeof ( float ) ,


cudaMemcpyHostToDevice );
cudaMemcpy ( d_V , V , size_V * sizeof ( float ) ,
cudaMemcpyHostToDevice );
dim3 grid (1 ,1 ,1) ;
dim3 block (N ,1 ,1) ;
sumTriangle < < < grid , block > > >( d_M , d_V , N ) ;
cudaMemcpy (V , d_V , size_V * sizeof ( float ) ,
cudaMemcpyDeviceToHost );
TECHNO

} TE
OF LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Kernel

__global__
void sumTriangle ( float * M , float * V , int N ) {

int j = threadIdx . x ;
float sum =0.0;
for ( int i =0; i < j ; i ++)
sum += M [ i * N + j ];

V [ j ]= sum ;
__syncthreads () ;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Kernel

if ( j == N -1)
{ sum = 0.0;
for ( i =0; i < N ; i ++)
sum = sum + V [ i ];
V [ N ] = sum ;
}

Once each thread finishes computing sum across columns, the total sum is computed
by the last thread.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant I

Modification: Only elements at odd indices are summed.


__global__
void sumTriangle ( float * M , float * V , int N ) {

int j = threadIdx . x ;
float sum =0.0;
for ( int i =0; i < j ; i ++)
if ( i %2) // Check for odd indices
sum += M [ i * N + j ];

V [ j ]= sum ;
__syncthreads () ;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant I

Addition still carried out by the last thread.


if ( j == N )
{
sum = 0.0;
for ( i =0; i < N ; i ++)
sum = sum + V [ i ];
V [ N +1] = sum ;
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
tid = 0 1 2 3 4 5 6 7 8 9 10

M =

d = 0 1 2 3 4 5 6 7 8 9 10
syncthreads()

V =

Figure: A variant of SumTriangle where only the elements at odd indices of a column are added
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant II

Modification: Consider summing all indices again. But use all threads for final
reduction.
__global__
void sumTriangle ( float * M , float * V , int N ) {

int j = threadIdx . x ;
float sum =0.0;
for ( int i =0; i < j ; i ++)
sum += M [ i * N + j ];

V [ j ]= sum ;
__syncthreads () ;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant II

Reduction possible since addition is an associative operation.

for ( unsigned int s =1; s < N ; s *= 2)


{
if ( j %(2* s ) ==0 && j + s < N )
V [ j ]+= V [ j + s ];
__syncthreads () ;
}
}

Once each thread finishes computing sum across columns, the total sum is computed
by all the threads.
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction

tid 0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11

1 5 9 13 17 21

6 22 38
iterations

28 38

66

Figure: Reducing an array of 12 elements


TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur

You might also like