0% found this document useful (0 votes)

99 views29 pages

Multi - Dim

The document discusses multi-dimensional mapping of data space and synchronization in GPU programming. It covers that a grid is a 3D array of blocks, and a block is a 3D array of threads. The dimensions of the grid and blocks can be specified using the dim3 struct type. The document also discusses that all threads in a block see the same values for the block index variables blockIdx.x, blockIdx.y and blockIdx.z, which range from 0 to the respective grid dimension sizes.

Uploaded by

Vijay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views29 pages

Multi - Dim

Uploaded by

Vijay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Multi-dimensional mapping of dataspace; Synchronization

Soumyajit Dey, Assistant Professor,

CSE, IIT Kharagpur

January 21, 2021

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Course Organization
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
OpenCL - heterogeneous computing 10 2
Efficient Neural Network Training/Inferencing 11-12 6
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional block

In general
I a grid is a 3-D array of blocks
I a block is a 3-D array of threads
I specified by C struct type dim3
I unused dimensions are set to 1

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block

dim3 X ( ceil ( n /256.0) , 1 , 1) ;

dim3 Y (256 , 1 , 1) ;
vecAddKernel < < <X , Y > > >(..) ;
vecAddKernel < < < ceil ( n /256) , 256 > > >(..) ;
// CUDA compiler is smart enough to realise both as equivalent

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block

I gridDim.x/y/z ∈ [1, 216 ]

I (blockIdx.x, blockIdx.y, blockIdx.z) is one block
I All threads in the block sees the same value of system vars blockIdx.x,
blockIdx.y, blockIdx.z
I blockIdx.x/y/z ∈ [0, gridDim.x/y/z -1]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block

block dimension is limited by total number of threads possible in a block – 1024.

√
I (512, 1, 1) -
√
I (8, 16, 4) -
√
I (32, 16, 2) -
I (32, 32, 32) - ×

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Multi dimensional grid, block declaration

Consider the following host side code

dim3 X (2 , 2 , 1) ;
dim3 Y (4 , 2 , 2) ;
vecAddKernel < < <X , Y > > >(..) ;

The memory layout thus created in device when the kernel is launched is shown next

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
blockIdx.y
blockIdx.z blockIdx.x

GRID h0, 0, 0i h0, 0, 1i h0, 1, 0i h0, 1, 1i

h0, 0, 0ih0, 0, 1ih0, 0, 2ih0, 0, 3ih0, 1, 0ih0, 1, 1ih0, 1, 2ih0, 1, 3ih1, 0, 0ih1, 0, 1ih1, 0, 2ih1, 0, 3ih1, 1, 0ih1, 1, 1ih1, 1, 2ih1, 1, 3i

BLOCK
threadIdx.z threadIdx.x

threadIdx.y
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Grids and Blocks

IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Col 0 Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7

Row 0

Row 1

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: 2D Matrix

IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
blockIdx.y
blockIdx.z blockIdx.x

GRID h0, 0, 0i h0, 0, 1i h0, 1, 0i h0, 1, 1i

Block 0 Block 1 Block 2 Block 3

h0, 0, 0ih0, 0, 1ih0, 0, 2ih0, 0, 3ih0, 1, 0ih0, 1, 1ih0, 1, 2ih0, 1, 3ih1, 0, 0ih1, 0, 1ih1, 0, 2ih1, 0, 3ih1, 1, 0ih1, 1, 1ih1, 1, 2ih1, 1, 3i

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15

BLOCK
threadIdx.z threadIdx.x

threadIdx.y

blockNum = blockIdx.z * (gridDim.x * gridDim.y) + blockIdx.y * gridDim.x + blockIdx.x

threadNum = threadIdx.z * (blockDim.x * blockDim.y) + threadIdx.y * (blockDIm.x) +
threadIdx.x
globalThreadId = blockNum * (blockDim.x * blockDim.y * blockDim.z) + threadNum
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Global Thread IDs

IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Relations among variables

blockNum = blockIdx . z * ( gridDim . x * gridDim . y ) + blockIdx . y * gridDim . x +

blockIdx . x ;
threadNum = threadIdx . z * ( blockDim . x * blockDim . y ) + threadIdx . y * ( blockDim .
x ) + threadIdx . x ;
glob alThre adId = blockNum * ( blockDim . x * blockDim . y * blockDim . z ) + threadNum
;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Col 0 Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7

Row 0 0 1 2 3 4 5 6 7

Row 1 8 9 10 11 12 13 14 15

Row 2 16 17 18 19 20 21 22 23

Row 3 24 25 26 27 28 29 30 31

Row 4 32 33 34 35 36 37 38 39

Row 5 40 41 42 43 44 45 46 47

Row 6 48 49 50 51 52 53 54 55

Row 7 56 57 58 59 60 61 62 63

i = globalThreadId / NumCols j = globalThreadId % NumCols

NumRows * NumCols = gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y *
blockDim.z
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Mapping Threads to Matrix

IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Mapping between kernels and data

The CUDA programming interface provides support for mapping kernels of any
dimension (upto 3) to data of any dimension
I Mapping a 3D kernel to 2D kernel results in complex memory access expressions.
I Makes sense to map 2D kernel to 2D data and 3D kernel to 3D data

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
NumCols = blockDim.x * gridDim.x
NumRows = blockDim.y * gridDim.y

gridDim = h3, 2i blockDim = h5, 4i

h0, 0i h0, 1i h0, 2i h1, 0i h1, 1i h1, 2i

Block 0 Block 1 Block 2 Block 3 Block 4 Block 5

h0, 0i h0, 1i h3, 3i h3, 4i h0, 0i h0, 1i h3, 3i h3, 4i

Thread 0 Thread 1 Thread 19 Thread 20 Thread 0 Thread 1 Thread 19 Thread 20

1 5 1 1 5 1

j=blockIdx.x*blockDim.x+threadIdx.x j=blockIdx.x*blockDim.x+threadIdx.x

i=blockIdx.y*blockDim.y+threadIdx.y i=blockIdx.y*blockDim.y+threadIdx.y
0 4 0 1 4 0

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Two Dimensional Kernel

IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
8 X 15 Matrix

1 5 1

j=blockIdx.x*blockDim.x+threadIdx.x

i=blockIdx.y*blockDim.y+threadIdx.y
1 4 0
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Two Dimensional Kernel-Data Mapping

IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
nX = blockDim.x * gridDim.x
nY = blockDim.y * gridDim.y
nZ = blockDim.z * gridDim.z

gridDim = h2, 2, 2i blockDim = h5, 4, 3i

h0, 0, 0i h0, 0, 1i h0, 1, 0i h0, 1, 1i h1, 0, 0i h1, 0, 1i h1, 1, 0i h1, 1, 1i

Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

h0, 0, 0ih0, 0, 1i h2, 3, 4i h0, 0, 0ih0, 0, 1i h2, 3, 4i

Thread 0 Thread 1 Thread 59 Thread 0 Thread 1 Thread 59

1 5 1 1 5 1

j=blockIdx.x*blockDim.x+threadIdx.x j=blockIdx.x*blockDim.x+threadIdx.x

i=blockIdx.y*blockDim.y+threadIdx.y i=blockIdx.y*blockDim.y+threadIdx.y
1 4 0 0 4 0
k=blockIdx.z*blockDim.z+threadIdx.z k=blockIdx.z*blockDim.z+threadIdx.z
0 3 0 1 3 0

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Three Dimensional Kernel

IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
1 5 1

j=blockIdx.x*blockDim.x+threadIdx.x

i=blockIdx.y*blockDim.y+threadIdx.y
0 4 0
k=blockIdx.z*blockDim.z+threadIdx.z
1 3 0

8 X 15 Matrix

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Three Dimensional Kernel-Data Mapping

IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization

Kernel grid

Device Block 0 Block 1 Device

Block 2 Block 3

Block 4 Block 5
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 6 Block 7
time

Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Block 4 Block 5

Block 6 Block 7

Figure: Mapping Blocks to Hardware

I Each block can execute in any order relative to other blocks.

I Lack of synchronization constraints between blocks enables scalability. TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization

I Synchronization constraints can be enforced to threads inside a thread block.

I Threads may co-operate with each other and share data with the help of local
memory (more on this later)
I CUDA construct __synchthreads() is used for enforcing synchronization.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
0 1 2 3 4 5 6 7 8 9 10

syncthreads()

Figure: Input: A 11 X 11 matrix, Output: A vector of size 12 where each element represents
the column sums and the last element represents the sum of the column sums. TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Host Program
int main ()
{
int N =1024;
int size_M = N * N ;
int size_V = N +1;

cudaMemcpy ( d_M ,M , size_M * sizeof ( float ) ,

cudaMemcpyHostToDevice );
cudaMemcpy ( d_V , V , size_V * sizeof ( float ) ,
cudaMemcpyHostToDevice );
dim3 grid (1 ,1 ,1) ;
dim3 block (N ,1 ,1) ;
sumTriangle < < < grid , block > > >( d_M , d_V , N ) ;
cudaMemcpy (V , d_V , size_V * sizeof ( float ) ,
cudaMemcpyDeviceToHost );
TECHNO

} TE
OF LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Kernel

__global__
void sumTriangle ( float * M , float * V , int N ) {

int j = threadIdx . x ;
float sum =0.0;
for ( int i =0; i < j ; i ++)
sum += M [ i * N + j ];

V [ j ]= sum ;
__syncthreads () ;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Kernel

if ( j == N -1)
{ sum = 0.0;
for ( i =0; i < N ; i ++)
sum = sum + V [ i ];
V [ N ] = sum ;
}

Once each thread finishes computing sum across columns, the total sum is computed
by the last thread.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant I

Modification: Only elements at odd indices are summed.

__global__
void sumTriangle ( float * M , float * V , int N ) {

int j = threadIdx . x ;
float sum =0.0;
for ( int i =0; i < j ; i ++)
if ( i %2) // Check for odd indices
sum += M [ i * N + j ];

V [ j ]= sum ;
__syncthreads () ;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant I

Addition still carried out by the last thread.

if ( j == N )
{
sum = 0.0;
for ( i =0; i < N ; i ++)
sum = sum + V [ i ];
V [ N +1] = sum ;
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
tid = 0 1 2 3 4 5 6 7 8 9 10

M =

d = 0 1 2 3 4 5 6 7 8 9 10
syncthreads()

V =

Figure: A variant of SumTriangle where only the elements at odd indices of a column are added
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant II

Modification: Consider summing all indices again. But use all threads for final
reduction.
__global__
void sumTriangle ( float * M , float * V , int N ) {

int j = threadIdx . x ;
float sum =0.0;
for ( int i =0; i < j ; i ++)
sum += M [ i * N + j ];

V [ j ]= sum ;
__syncthreads () ;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Synchronization Program Variant II

Reduction possible since addition is an associative operation.

for ( unsigned int s =1; s < N ; s *= 2)

{
if ( j %(2* s ) ==0 && j + s < N )
V [ j ]+= V [ j + s ];
__syncthreads () ;
}
}

Once each thread finishes computing sum across columns, the total sum is computed
by all the threads.
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction

tid 0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11

1 5 9 13 17 21

6 22 38
iterations

28 38

Figure: Reducing an array of 12 elements

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Multi-dimensional mapping of dataspace; Synchronization Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur

Amigu English Cocker Spaniel 1
100% (3)
Amigu English Cocker Spaniel 1
10 pages
Design For Performance
100% (1)
Design For Performance
34 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
150x Fresh Netflix Hits
No ratings yet
150x Fresh Netflix Hits
8 pages
Double Stub and LC Matching Circuit
No ratings yet
Double Stub and LC Matching Circuit
31 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
Columbus Fitness S900 Treadmill Manual Guide
No ratings yet
Columbus Fitness S900 Treadmill Manual Guide
14 pages
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
No ratings yet
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
32 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
OpenCL Tutorial - Basics
No ratings yet
OpenCL Tutorial - Basics
24 pages
BMW Video Interface v1.2
No ratings yet
BMW Video Interface v1.2
27 pages
Introduction To MIMD Architecture
No ratings yet
Introduction To MIMD Architecture
16 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
Wave Guides
No ratings yet
Wave Guides
29 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
CDT25 CacheMemory
No ratings yet
CDT25 CacheMemory
7 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
ACA UNIT-5 Notes
No ratings yet
ACA UNIT-5 Notes
15 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
MC 16-46 - Rev. 10-2008
No ratings yet
MC 16-46 - Rev. 10-2008
35 pages
Conspect of Lecture 7
No ratings yet
Conspect of Lecture 7
13 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Unit 5
No ratings yet
Unit 5
90 pages
Class 10
No ratings yet
Class 10
13 pages
Geographical Data in The Computer-1
No ratings yet
Geographical Data in The Computer-1
36 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
16-Cache Memory-13-03-2024
No ratings yet
16-Cache Memory-13-03-2024
50 pages
DSM
No ratings yet
DSM
36 pages
HPC
No ratings yet
HPC
90 pages
MOV S Manual
No ratings yet
MOV S Manual
84 pages
Questions On Chapter 1 and 2 Color New V2
No ratings yet
Questions On Chapter 1 and 2 Color New V2
8 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
ACA Assignment 4
No ratings yet
ACA Assignment 4
16 pages
09 Communication Models of Parallel Platforms
No ratings yet
09 Communication Models of Parallel Platforms
25 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
22 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Systolic Arrays & Their Applications
No ratings yet
Systolic Arrays & Their Applications
35 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
Tilining
No ratings yet
Tilining
23 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
3 Data Centric Directives
No ratings yet
3 Data Centric Directives
32 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
Week 5
No ratings yet
Week 5
35 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
CUDA Part-2
No ratings yet
CUDA Part-2
49 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
Module 4 - Architecture
No ratings yet
Module 4 - Architecture
22 pages
VSCSE Lecture3 Cuda Memory Model 2012
No ratings yet
VSCSE Lecture3 Cuda Memory Model 2012
31 pages
How To Make Playground Using Tyres
100% (5)
How To Make Playground Using Tyres
117 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
CH15
No ratings yet
CH15
16 pages
Gpu Test Answer Bank
No ratings yet
Gpu Test Answer Bank
22 pages
Navigator 3 PLUS Manual
No ratings yet
Navigator 3 PLUS Manual
30 pages
qt7r21z1s7 Nosplash
No ratings yet
qt7r21z1s7 Nosplash
107 pages
Module 1 and Module 2
No ratings yet
Module 1 and Module 2
56 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Blueprint Master Arm Manual V2.0
No ratings yet
Blueprint Master Arm Manual V2.0
6 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
53 pages
Liftmaster Garage Door Opener 8365W
No ratings yet
Liftmaster Garage Door Opener 8365W
40 pages
SEPT 12, 2024 (B2B) VIGI Update Price List
No ratings yet
SEPT 12, 2024 (B2B) VIGI Update Price List
3 pages
Blaupunkt Brochure 2025 Hyundai
No ratings yet
Blaupunkt Brochure 2025 Hyundai
8 pages
Parallel Prrocessor
No ratings yet
Parallel Prrocessor
12 pages
Module 3 Antenna Part
No ratings yet
Module 3 Antenna Part
35 pages
ASRock Complete Quad Core Beebox Mini PC With Windows 10 LN70053
No ratings yet
ASRock Complete Quad Core Beebox Mini PC With Windows 10 LN70053
3 pages
Manual Acer
No ratings yet
Manual Acer
5 pages
L05 RISCV Intro (1up)
No ratings yet
L05 RISCV Intro (1up)
45 pages
Zwischenpraesentation RPi Final
No ratings yet
Zwischenpraesentation RPi Final
10 pages
Mainboard - X8SIL-F
No ratings yet
Mainboard - X8SIL-F
102 pages
Reduction
No ratings yet
Reduction
91 pages
Mem-Coalesce
No ratings yet
Mem-Coalesce
69 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Atmel ICE UserGuide
No ratings yet
Atmel ICE UserGuide
61 pages
V316 11 (00017967) Decrypt
No ratings yet
V316 11 (00017967) Decrypt
35 pages
SKF Self-Aligning Bearings and The Pulp & Paper Industry
No ratings yet
SKF Self-Aligning Bearings and The Pulp & Paper Industry
10 pages
Gpu-Arc
No ratings yet
Gpu-Arc
37 pages
AMS-Neve 8804 User Guide PDF
No ratings yet
AMS-Neve 8804 User Guide PDF
17 pages
Komatsu 0000279c H0120-001001 Page
No ratings yet
Komatsu 0000279c H0120-001001 Page
2 pages
SF Dump
No ratings yet
SF Dump
19 pages
Lenovo Diagnostics - LOG: 9/3/2022 1:17:56 PM - Passed
No ratings yet
Lenovo Diagnostics - LOG: 9/3/2022 1:17:56 PM - Passed
12 pages
Eventlog
No ratings yet
Eventlog
8 pages
Despiece Mecanico Sierra Recipro Makita
No ratings yet
Despiece Mecanico Sierra Recipro Makita
3 pages
Kumm 2013
No ratings yet
Kumm 2013
8 pages
CL1016M LCD - KVM - Switch
No ratings yet
CL1016M LCD - KVM - Switch
1 page
Difference Between Microprocessor and Microcontroller
No ratings yet
Difference Between Microprocessor and Microcontroller
2 pages
Microprocessor Bus Organisation 8085
No ratings yet
Microprocessor Bus Organisation 8085
3 pages
Essentials of SMT: Practical Know-How
From Everand
Essentials of SMT: Practical Know-How
Young Bong Kang
4.5/5 (6)

Multi - Dim

Uploaded by

Multi - Dim

Uploaded by

Multi-dimensional mapping of dataspace; Synchronization

Soumyajit Dey, Assistant Professor,

January 21, 2021

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

dim3 X ( ceil ( n /256.0) , 1 , 1) ;

yog, kms kOflm^

I gridDim.x/y/z ∈ [1, 216 ]

yog, kms kOflm^

block dimension is limited by total number of threads possible in a block – 1024.

yog, kms kOflm^

Consider the following host side code

yog, kms kOflm^

GRID h0, 0, 0i h0, 0, 1i h0, 1, 0i h0, 1, 1i

yog, kms kOflm^

yog, kms kOflm^

GRID h0, 0, 0i h0, 0, 1i h0, 1, 0i h0, 1, 1i

Block 0 Block 1 Block 2 Block 3

blockNum = blockIdx.z * (gridDim.x * gridDim.y) + blockIdx.y * gridDim.x + blockIdx.x

yog, kms kOflm^

blockNum = blockIdx . z * ( gridDim . x * gridDim . y ) + blockIdx . y * gridDim . x +

yog, kms kOflm^

i = globalThreadId / NumCols j = globalThreadId % NumCols

yog, kms kOflm^

yog, kms kOflm^

gridDim = h3, 2i blockDim = h5, 4i

h0, 0i h0, 1i h0, 2i h1, 0i h1, 1i h1, 2i

Block 0 Block 1 Block 2 Block 3 Block 4 Block 5

h0, 0i h0, 1i h3, 3i h3, 4i h0, 0i h0, 1i h3, 3i h3, 4i

Thread 0 Thread 1 Thread 19 Thread 20 Thread 0 Thread 1 Thread 19 Thread 20

yog, kms kOflm^

yog, kms kOflm^

gridDim = h2, 2, 2i blockDim = h5, 4, 3i

h0, 0, 0i h0, 0, 1i h0, 1, 0i h0, 1, 1i h1, 0, 0i h1, 0, 1i h1, 1, 0i h1, 1, 1i

Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

h0, 0, 0ih0, 0, 1i h2, 3, 4i h0, 0, 0ih0, 0, 1i h2, 3, 4i

Thread 0 Thread 1 Thread 59 Thread 0 Thread 1 Thread 59

yog, kms kOflm^

yog, kms kOflm^

Device Block 0 Block 1 Device

Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Figure: Mapping Blocks to Hardware

I Each block can execute in any order relative to other blocks.

yog, kms kOflm^

I Synchronization constraints can be enforced to threads inside a thread block.

yog, kms kOflm^

yog, kms kOflm^

cudaMemcpy ( d_M ,M , size_M * sizeof ( float ) ,

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

Modification: Only elements at odd indices are summed.

yog, kms kOflm^

Addition still carried out by the last thread.

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

Reduction possible since addition is an associative operation.

for ( unsigned int s =1; s < N ; s *= 2)

yog, kms kOflm^

Figure: Reducing an array of 12 elements

yog, kms kOflm^

You might also like