0% found this document useful (0 votes)

19 views

Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro

Uploaded by

bigbigbarmaley

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro

Uploaded by

bigbigbarmaley

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

CHAPTER

Multidimensional grids and

data

Chapter Outline
3
3.1 Multidimensional grid organization .................................................................. 47
3.2 Mapping threads to multidimensional data ....................................................... 51
3.3 Image blur: a more complex kernel ................................................................. 58
3.4 Matrix multiplication ...................................................................................... 62
3.5 Summary ........................................................................................................ 66
Exercises .............................................................................................................. 67

In Chapter 2, Heterogeneous Data Parallel Computing, we learned to write a

simple CUDA C 1 1 program that launches a one-dimensional grid of threads by
calling a kernel function to operate on elements of one-dimensional arrays. A
kernel specifies the statements that are executed by each individual thread in the
grid. In this chapter, we will look more generally at how threads are organized
and learn how threads and blocks can be used to process multidimensional arrays.
Multiple examples will be used throughout the chapter, including converting a
colored image to a grayscale image, blurring an image, and matrix multiplication.
These examples also serve to familiarize the reader with reasoning about data par-
allelism before we proceed to discuss the GPU architecture, memory organization,
and performance optimizations in the upcoming chapters.

3.1 Multidimensional grid organization

In CUDA, all threads in a grid execute the same kernel function, and they rely on
coordinates, that is, thread indices, to distinguish themselves from each other and to
identify the appropriate portion of the data to process. As we saw in Chapter 2,
Heterogeneous Data Parallel Computing, these threads are organized into a two-level
hierarchy: A grid consists of one or more blocks, and each block consists of one or
more threads. All threads in a block share the same block index, which can be
accessed via the blockIdx (built-in) variable. Each thread also has a thread index,
which can be accessed via the threadIdx (built-in) variable. When a thread executes
a kernel function, references to the blockIdx and threadIdx variables return the

Programming Massively Parallel Processors. DOI: https://doi.org/10.1016/B978-0-323-91231-0.00004-5

coordinates of the thread. The execution configuration parameters in a kernel call

statement specify the dimensions of the grid and the dimensions of each block. These
dimensions are available via the gridDim and blockDim (built-in) variables.
In general, a grid is a three-dimensional (3D) array of blocks, and each block
is a 3D array of threads. When calling a kernel, the program needs to specify the
size of the grid and the blocks in each dimension. These are specified by using
the execution configuration parameters (within ,, , . . . .. . ) of the kernel call
statement. The first execution configuration parameter specifies the dimensions of
the grid in number of blocks. The second specifies the dimensions of each block
in number of threads. Each such parameter has the type dim3, which is an integer
vector type of three elements x, y, and z. These three elements specify the sizes
of the three dimensions. The programmer can use fewer than three dimensions by
setting the size of the unused dimensions to 1.
For example, the following host code can be used to call the vecAddkernel()
kernel function and generate a 1D grid that consists of 32 blocks, each of which
consists of 128 threads. The total number of threads in the grid is 128 32 5 4096:

Note that dimBlock and dimGrid are host code variables that are defined by
the programmer. These variables can have any legal C variable name as long as
they have the type dim3. For example, the following statements accomplish the
same result as the statements above:

dim3 dog(32, 1, 1);

dim3 cat(128, 1, 1);
vecAddKernel<<<dog, cat>>>(...);

The grid and block dimensions can also be calculated from other variables.
For example, the kernel call in Fig. 2.12 can be written as follows:

This allows the number of blocks to vary with the size of the vectors so that
the grid will have enough threads to cover all vector elements. In this example
the programmer chose to fix the block size at 256. The value of variable n at ker-
nel call time will determine dimension of the grid. If n is equal to 1000, the grid
will consist of four blocks. If n is equal to 4000, the grid will have 16 blocks. In
each case, there will be enough threads to cover all the vector elements. Once the
grid has been launched, the grid and block dimensions will remain the same until
the entire grid has finished execution.
3.1 Multidimensional grid organization 49

For convenience, CUDA provides a special shortcut for calling a kernel with
one-dimensional (1D) grids and blocks. Instead of using dim3 variables, one can
use arithmetic expressions to specify the configuration of 1D grids and blocks. In
this case, the CUDA compiler simply takes the arithmetic expression as the x
dimensions and assumes that the y and z dimensions are 1. This gives us the ker-
nel call statement shown in Fig. 2.12:

Readers who are familiar with C++ would realize that this “shorthand” conven-
tion for 1D configurations takes advantage of how C++ constructors and default
parameters work. The default values of the parameters to the dim3 constructor are
1. When a single value is passed where a dim3 is expected, that value will be passed
to the first parameter of the constructor, while the second and third parameters take
the default value of 1. The result is a 1D grid or block in which the size of the x
dimension is the value passed and the sizes of the y and z dimensions are 1.
Within the kernel function, the x field of variables gridDim and blockDim are
preinitialized according to the values of the execution configuration parameters.
For example, if n is equal to 4000, references to gridDim.x and blockDim.x in
the vectAddkernel kernel will result in 16 and 256, respectively. Note that unlike
the dim3 variables in the host code, the names of these variables within the kernel
functions are part of the CUDA C specification and cannot be changed. That is,
the gridDim and blockDim are built-in variables in a kernel and always reflect the
dimensions of the grid and the blocks, respectively.
In CUDA C the allowed values of gridDim.x range from 1 to 231 2 1,1 and
those of gridDim.y and gridDim.z range from 1 to 216 2 1 (65,535). All threads
in a block share the same blockIdx.x, blockIdx.y, and blockIdx.z values.
Among blocks, the blockIdx.x value ranges from 0 to gridDim.x-1, the
blockIdx.y value ranges from 0 to gridDim.y-1, and the blockIdx.z value
ranges from 0 to gridDim.z-1.
We now turn our attention to the configuration of blocks. Each block is orga-
nized into a 3D array of threads. Two-dimensional (2D) blocks can be created by
setting blockDim.z to 1. One-dimension blocks can be created by setting both
blockDim.y and blockDim.z to 1, as in the vectorAddkernel example. As we
mentioned before, all blocks in a grid have the same dimensions and sizes. The
number of threads in each dimension of a block is specified by the second execu-
tion configuration parameter at the kernel call. Within the kernel this configura-
tion parameter can be accessed as the x, y, and z fields of blockDim.
The total size of a block in current CUDA systems is limited to 1024 threads.
These threads can be distributed across the three dimensions in any way as long
as the total number of threads does not exceed 1024. For example, blockDim

Devices with a capability of less than 3.0 allow blockIdx.x to range from 1 to 216 2 1.
1
50 CHAPTER 3 Multidimensional grids and data

FIGURE 3.1
A multidimensional example of CUDA grid organization.

values of (512, 1, 1), (8, 16, 4), and (32, 16, 2) are all allowed, but (32, 32, 2) is
not allowed because the total number of threads would exceed 1024.
A grid and its blocks do not need to have the same dimensionality. A grid can
have higher dimensionality than its blocks and vice versa. For example, Fig. 3.1
shows a small toy grid example with a gridDim of (2, 2, 1) and a blockDim of (4,
2, 2). Such a grid can be created with the following host code:

The grid in Fig. 3.1 consists of four blocks organized into a 2 3 2 array. Each
block is labeled with (blockIdx.y, blockIdx.x). For example, block (1,0) has
blockIdx.y 5 1 and blockIdx.x 5 0. Note that the ordering of the block and
thread labels is such that highest dimension comes first. This notation uses an
ordering that is the reverse of that used in the C statements for setting configura-
tion parameters, in which the lowest dimension comes first. This reversed order-
ing for labeling blocks works better when we illustrate the mapping of thread
coordinates into data indexes in accessing multidimensional data.
Each threadIdx also consists of three fields: the x coordinate threadId.x, the
y coordinate threadIdx.y, and the z coordinate threadIdx.z. Fig. 3.1 illustrates
the organization of threads within a block. In this example, each block is orga-
nized into 4 3 2 3 2 arrays of threads. Since all blocks within a grid have the
3.2 Mapping threads to multidimensional data 51

FIGURE 3.2
Using a 2D thread grid to process a 62 3 76 picture P.

same dimensions, we show only one of them. Fig. 3.1 expands block (1,1) to
show its 16 threads. For example, thread (1,0,2) has threadIdx.z 5 1,
threadIdx.y 5 0, and threadIdx.x 5 2. Note that in this example we have 4
blocks of 16 threads each, with a grand total of 64 threads in the grid. We use
these small numbers to keep the illustration simple. Typical CUDA grids contain
thousands to millions of threads.

3.2 Mapping threads to multidimensional data

The choice of 1D, 2D, or 3D thread organizations is usually based on the nature
of the data. For example, pictures are a 2D array of pixels. Using a 2D grid that
consists of 2D blocks is often convenient for processing the pixels in a picture.
Fig. 3.2 shows such an arrangement for processing a 62 3 761F1F2 picture P
(62 pixels in the vertical or y direction and 76 pixels in the horizontal or x

2
We will refer to the dimensions of multidimensional data in descending order: the z dimension fol-
lowed by the y dimension, and so on. For example, for a picture of n pixels in the vertical or y
dimension and m pixels in the horizontal or x dimension, we will refer to it as a n 3 m picture.
This follows the C multidimensional array indexing convention. For example, we can refer to P[y]
[x] as Py,x in text and figures for conciseness. Unfortunately, this ordering is opposite to the order
in which data dimensions are ordered in the gridDim and blockDim dimensions. The discrepancy
can be especially confusing when we define the dimensions of a thread grid on the basis of a multi-
dimensional array that is to be processed by its threads.
52 CHAPTER 3 Multidimensional grids and data

direction). Assume that we decided to use a 16 3 16 block, with 16 threads in the

x direction and 16 threads in the y direction. We will need four blocks in the y
direction and five blocks in the x direction, which results in 4 3 5 5 20 blocks, as
shown in Fig. 3.2. The heavy lines mark the block boundaries. The shaded area
depicts the threads that cover pixels. Each thread is assigned to process a pixel
whose y and x coordinates are derived from its blockIdx, blockDim, and
threadIdx variable values:

Vertical ðrowÞ row coordinate 5 blockIdx:y blockDim:y 1 threadIdx:y

Horizontal ðColumnÞ coordinate 5 blockIdx:x blockDim:x 1 threadIdx:x
For example, the Pin element to be processed by thread (0,0) of block (1,0)
can be identified as follows:
PinblockIdx:y blockDim:y1threadIdx:y;blockIdx:x blockDim:x1threadIdx:x 5 Pin1 1610;0 1610 5 Pin16;0

Note that in Fig. 3.2 we have two extra threads in the y direction and four
extra threads in the x direction. That is, we will generate 64 3 80 threads to pro-
cess 62 3 76 pixels. This is similar to the situation in which a 1000-element vec-
tor is processed by the 1D kernel vecAddKernel in Fig. 2.9 using four 256-thread
blocks. Recall that an if-statement in Fig. 2.10 is needed to prevent the extra 24
threads from taking effect. Similarly, we should expect that the picture-processing
kernel function will have if-statements to test whether the thread’s vertical and
horizontal indices fall within the valid range of pixels.
We assume that the host code uses an integer variable n to track the number
of pixels in the y direction and another integer variable m to track the number of
pixels in the x direction. We further assume that the input picture data has been
copied to the device global memory and can be accessed through a pointer vari-
able Pin_d. The output picture has been allocated in the device memory and can
be accessed through a pointer variable Pout_d. The following host code can be
used to call a 2D kernel colorToGrayscaleConversion to process the picture, as
follows:

In this example we assume for simplicity that the dimensions of the blocks are
fixed at 16 3 16. The dimensions of the grid, on the other hand, depend on
the dimensions of the picture. To process a 1500 3 2000 (3-million-pixel) picture,
we would generate 11,750 blocks: 94 in the y direction and 125 in the x direction.
Within the kernel function, references to gridDim.x, gridDim.y, blockDim.x, and
blockDim.y will result in 125, 94, 16, and 16, respectively.
Before we show the kernel code, we first need to understand how C statements
access elements of dynamically allocated multidimensional arrays. Ideally, we
3.2 Mapping threads to multidimensional data 53

would like to access Pin_d as a 2D array in which an element at row j and col-
umn i can be accessed as Pin_d[j][i]. However, the ANSI C standard on the
basis of which CUDA C was developed requires the number of columns in Pin to
be known at compile time for Pin to be accessed as a 2D array. Unfortunately,
this information is not known at compile time for dynamically allocated arrays. In
fact, part of the reason why one uses dynamically allocated arrays is to allow the
sizes and dimensions of these arrays to vary according to the data size at runtime.
Thus the information on the number of columns in a dynamically allocated 2D
array is not known at compile time by design. As a result, programmers need to
explicitly linearize, or “flatten,” a dynamically allocated 2D array into an equiva-
lent 1D array in the current CUDA C.
In reality, all multidimensional arrays in C are linearized. This is due to the
use of a “flat” memory space in modern computers (see the “Memory Space”
sidebar). In the case of statically allocated arrays, the compilers allow the pro-
grammers to use higher-dimensional indexing syntax, such as Pin_d[j][i], to
access their elements. Under the hood, the compiler linearizes them into an equiv-
alent 1D array and translates the multidimensional indexing syntax into a 1D off-
set. In the case of dynamically allocated arrays, the current CUDA C compiler
leaves the work of such translation to the programmers, owing to lack of dimen-
sional information at compile time.

Memory Space
A memory space is a simplified view of how a processor accesses its
memory in modern computers. A memory space is usually associated with
each running application. The data to be processed by an application
and instructions executed for the application are stored in locations in its
memory space. Each location typically can accommodate a byte and has
an address. Variables that require multiple bytes—4 bytes for float and 8
bytes for double—are stored in consecutive byte locations. When acces-
sing a data value from the memory space, the processor gives the starting
address (address of the starting byte location) and the number of bytes
needed.
Most modern computers have at least 4G byte-sized locations,
where each G is 1,073,741,824 (230). All locations are labeled with an
address that ranges from 0 to the largest number used. Since there is only
one address for every location, we say that the memory space has a “flat”
organization. As a result, all multidimensional arrays are ultimately “flat-
tened” into equivalent one-dimensional arrays. While a C programmer can
use multidimensional array syntax to access an element of a multidimen-
sional array, the compiler translates these accesses into a base pointer that
points to the beginning element of the array, along with a one-dimensional
offset calculated from these multidimensional indices.
54 CHAPTER 3 Multidimensional grids and data

There are at least two ways in which a 2D array can be linearized. One is to
place all elements of the same row into consecutive locations. The rows are then
placed one after another into the memory space. This arrangement, called the
row-major layout, is illustrated in Fig. 3.3. To improve readability, we use Mj,i to
denote an element of M at the jth row and the ith column. Mj,i is equivalent to
the C expression M[j][i] but slightly more readable. Fig. 3.3 shows an example in
which a 4 3 4 matrix M is linearized into a 16-element 1D array, with all ele-
ments of row 0 first, followed by the four elements of row 1, and so on.
Therefore the 1D equivalent index for an element of M at row j and column i is
j 4 1 i. The j 4 term skips over all elements of the rows before row j. The i term
then selects the right element within the section for row j. For example, the 1D
index for M2,1 is 2 4 1 1 5 9. This is illustrated in Fig. 3.3, in which M9 is the
1D equivalent to M2,1. This is the way in which C compilers linearize 2D arrays.
Another way to linearize a 2D array is to place all elements of the same col-
umn in consecutive locations. The columns are then placed one after another into
the memory space. This arrangement, called the column-major layout, is used by
FORTRAN compilers. Note that the column-major layout of a 2D array is equiva-
lent to the row-major layout of its transposed form. We will not spend more time
on this except to mention that readers whose primary previous programming
experience was with FORTRAN should be aware that CUDA C uses the row-
major layout rather than the column-major layout. Also, many C libraries that are

FIGURE 3.3
Row-major layout for a 2D C array. The result is an equivalent 1D array accessed by an
index expression j Width+i for an element that is in the jth row and ith column of an array
of Width elements in each row.
3.2 Mapping threads to multidimensional data 55

FIGURE 3.4
Source code of colorToGrayscaleConversion with 2D thread mapping to data.

designed to be used by FORTRAN programs use the column-major layout to

match the FORTRAN compiler layout. As a result, the manual pages for these
libraries usually tell the users to transpose the input arrays if they call these librar-
ies from C programs.
We are now ready to study the source code of colorToGrayscaleConversion,
shown in Fig. 3.4. The kernel code uses the following equation to convert each
color pixel to its grayscale counterpart:
L 5 0:21 r 1 0:72 g 1 0:07 b
There are a total of blockDim.x gridDim.x threads in the horizontal direction.
Similar to the vecAddKernel example, the following expression generates every
integer value from 0 to blockDim.x gridDim.x1 (line 06):

col = blockIdx.x*blockDim.x + threadIdx.x

We know that gridDim.x blockDim.x is greater than or equal to width

(m value passed in from the host code). We have at least as many threads as the
number of pixels in the horizontal direction. We also know that there are at least as
many threads as the number of pixels in the vertical direction. Therefore as long as
we test and make sure that only the threads with both row and column values are
within range, that is, (col , width) && (row , height), we will be able to cover
every pixel in the picture (line 07).
Since there are width pixels in each row, we can generate the 1D index for the
pixel at row row and column col as row width+col (line 10). This 1D index
56 CHAPTER 3 Multidimensional grids and data

grayOffset is the pixel index for Pout since each pixel in the output grayscale image
is 1 byte (unsigned char). Using our 62 3 76 image example, the linearized 1D index
of the Pout pixel calculated by thread (0,0) of block (1,0) with the following formula:
PoutblockIdx:y blockDim:y1threadIdx:y;blockIdx:x blockDim:x1threadIdx:x
5 Pout1 1610;0 1610 5 Pout16;0 5 Pout½16 76 1 0 5 Pout½1216
As for Pin, we need to multiply the gray pixel index by 32F2F3 (line 13),
since each colored pixel is stored as three elements (r, g, b), each of which is 1
byte. The resulting rgbOffset gives the starting location of the color pixel in the
Pin array. We read the r, g, and b value from the three consecutive byte locations
of the Pin array (lines 1416), perform the calculation of the grayscale pixel
value, and write that value into the Pout array using grayOffset (line 19). In our
62 3 76 image example the linearized 1D index of the first component of the Pin
pixel that is processed by thread (0,0) of block (1,0) can be calculated with the
following formula:
PinblockIdx:y blockDim:y1threadIdx:y;blockIdx:x blockDim:x1threadIdx:x 5 Pin1 1610;0 1610
5 Pin16;0 5 Pin½16 76 3 1 0 5 Pin½3648
The data that is being accessed is the 3 bytes starting at byte offset 3648.
Fig. 3.5 illustrates the execution of colorToGrayscaleConversion in
processing our 62 3 76 example. Assuming 16 3 16 blocks, calling the
colorToGrayscaleConversion kernel generates 64 3 80 threads. The grid will
have 4 3 5 5 20 blocks: four in the vertical direction and five in the horizontal
direction. The execution behavior of blocks will fall into one of four different
cases, shown as four shaded areas in Fig. 3.5.
The first area, marked 1 in Fig. 3.5, consists of the threads that belong to the
12 blocks covering the majority of pixels in the picture. Both col and row values
of these threads are within range; all these threads pass the if-statement test and
process pixels in the dark-shaded area of the picture. That is all 16 3 16 5 256
threads in each block will process pixels.
The second area, marked 2 in Fig. 3.5, contains the threads that belong to the
three blocks in the medium-shaded area covering the upper-right pixels of the pic-
ture. Although the row values of these threads are always within range, the col
values of some of them exceed the m value of 76. This is because the number of
threads in the horizontal direction is always a multiple of the blockDim.x value
chosen by the programmer (16 in this case). The smallest multiple of 16 needed
to cover 76 pixels is 80. As a result, 12 threads in each row will find their col
values within range and will process pixels. The remaining four threads in each
row will find their col values out of range and thus will fail the if-statement con-
dition. These threads will not process any pixels. Overall, 12 3 16 5 192 of the
16 3 16 5 256 threads in each of these blocks will process pixels.

3
We assume that CHANNELS is a constant of value 3, and its definition is outside the kernel
function.
3.2 Mapping threads to multidimensional data 57

FIGURE 3.5
Covering a 76 3 62 picture with 16 3 16 blocks.

The third area, marked 3 in Fig. 3.5, accounts for the four lower-left blocks
covering the medium-shaded area of the picture. Although the col values of these
threads are always within range, the row values of some of them exceed the n
value of 62. This is because the number of threads in the vertical direction is
always a multiple of the blockDim.y value chosen by the programmer (16 in this
case). The smallest multiple of 16 to cover 62 is 64. As a result, 14 threads in
each column will find their row values within range and will process pixels. The
remaining two threads in each column will not pas the if-statement and will not
process any pixels. Overall, 16 3 14 5 224 of the 256 threads will process pixels.
The fourth area, marked 4 in Fig. 3.5, contains the threads that cover the lower
right, lightly shaded area of the picture. Like Area 2, 4 threads in each of the top
14 rows will find their col values out of range. Like Area 3, the entire bottom
two rows of this block will find their row values out of range. Overall, only
14 3 12 5 168 of the 16 3 16 5 256 threads will process pixels.
We can easily extend our discussion of 2D arrays to 3D arrays by including
another dimension when we linearize the array. This is done by placing each
“plane” of the array one after another into the address space. Assume that the pro-
grammer uses variables m and n to track the number of columns and rows, respec-
tively, in a 3D array. The programmer also needs to determine the values of
blockDim.z and gridDim.z when calling a kernel. In the kernel the array index
will involve another global index:
58 CHAPTER 3 Multidimensional grids and data

The linearized access to a 3D array P will be in the form of P[plane m n

+row m+col]. A kernel processing the 3D P array needs to check whether all the
three global indices, plane, row, and col, fall within the valid range of the array.
The use of 3D arrays in CUDA kernels will be further studies for the stencil pat-
tern in Chapter 8, Stencil.

3.3 Image blur: a more complex kernel

We have studied vecAddkernel and colorToGrayscaleConversion, in which each
thread performs only a small number of arithmetic operations on one array ele-
ment. These kernels serve their purposes well: to illustrate the basic CUDA C
program structure and data parallel execution concepts. At this point, the reader
should ask the obvious question: Do all threads in CUDA C programs perform
only such simple and trivial operations independently of each other? The answer
is no. In real CUDA C programs, threads often perform complex operations on
their data and need to cooperate with each other. For the next few chapters we
are going to work on increasingly complex examples that exhibit these character-
istics. We will start with an image-blurring function.
Image blurring smoothes out abrupt variation of pixel values while preserving
the edges that are essential for recognizing the key features of the image. Fig. 3.6
illustrates the effect of image blurring. Simply stated, we make the image blurry.
To human eyes, a blurred image tends to obscure the fine details and present the
“big picture” impression, or the major thematic objects in the picture. In computer
image-processing algorithms a common use case of image blurring is to reduce
the impact of noise and granular rendering effects in an image by correcting prob-
lematic pixel values with the clean surrounding pixel values. In computer vision,
image blurring can be used to allow edge detection and object recognition algo-
rithms to focus on thematic objects rather than being bogged down by a massive
quantity of fine-grained objects. In displays, image blurring is sometimes used to
highlight a particular part of the image by blurring the rest of the image.
Mathematically, an image-blurring function calculates the value of an output
image pixel as a weighted sum of a patch of pixels encompassing the pixel in the
input image. As we will learn in Chapter 7, Convolution, the computation of such

FIGURE 3.6
An original image (left) and a blurred version (right).
3.3 Image blur: a more complex kernel 59

weighted sums belongs to the convolution pattern. We will be using a simplified

approach in this chapter by taking a simple average value of the N 3 N patch of
pixels surrounding, and including, our target pixel. To keep the algorithm simple,
we will not place a weight on the value of any pixel based on its distance from
the target pixel. In practice, placing such weights is quite common in convolution
blurring approaches, such as Gaussian blur.
Fig. 3.7 shows an example of image blurring using a 3 3 3 patch. When calcu-
lating an output pixel value at (row, col) position, we see that the patch is cen-
tered at the input pixel located at the (row, col) position. The 3 3 3 patch spans
three rows (row-1, row, row+1) and three columns (col-1, col, col+1). For exam-
ple, the coordinates of the nine pixels for calculating the output pixel at (25, 50)
are (24, 49), (24, 50), (24, 51), (25, 49), (25, 50), (25, 51), (26, 49), (26, 50), and
(26, 51).
Fig. 3.8 shows an image blur kernel. Similar to the strategy that was used in
colorToGrayscaleConversion, we use each thread to calculate an output pixel.
That is, the thread-to-output-data mapping remains the same. Thus at the begin-
ning of the kernel we see the familiar calculation of the col and row indices (lines
0304). We also see the familiar if-statement that verifies that both col and row
are within the valid range according to the height and width of the image (line
05). Only the threads whose col and row indices are both within value ranges are
allowed to participate in the execution.

FIGURE 3.7
Each output pixel is the average of a patch of surrounding pixels and itself in the input
image.
60 CHAPTER 3 Multidimensional grids and data

FIGURE 3.8
An image blur kernel.

As shown in Fig. 3.7, the col and row values also give the central pixel loca-
tion of the patch of input pixels used for calculating the output pixel for the
thread. The nested for-loops in Fig. 3.8 (lines 1011) iterate through all the pix-
els in the patch. We assume that the program has a defined constant BLUR_SIZE.
The value of BLUR_SIZE is set such that BLUR_SIZE gives the number of pixels on
each side (radius) of the patch and 2 BLUR_SIZE+1 gives the total number of pixels
across one dimension of the patch. For example, for a 3 3 3 patch, BLUR_SIZE is
set to 1, whereas for a 7 3 7 patch, BLUR_SIZE is set to 3. The outer loop iterates
through the rows of the patch. For each row, the inner loop iterates through the
columns of the patch.
In our 3 3 3 patch example, the BLUR_SIZE is 1. For the thread that calculates
output pixel (25, 50), during the first iteration of the outer loop, the curRow vari-
able is row-BLUR_SIZE 5 (25 2 1) 5 24. Thus during the first iteration of the outer
loop, the inner loop iterates through the patch pixels in row 24. The inner loop
iterates from column col-BLUR_SIZE 5 501 5 49 to col+BLUR_SIZE 5 51 using
the curCol variable. Therefore the pixels that are processed in the first iteration of
the outer loop are (24, 49), (24, 50), and (24, 51). The reader should verify that in
the second iteration of the outer loop, the inner loop iterates through pixels (25,
49), (25, 50), and (25, 51). Finally, in the third iteration of the outer loop, the
inner loop iterates through pixels (26, 49), (26, 50), and (26, 51).
3.3 Image blur: a more complex kernel 61

FIGURE 3.9
Handling boundary conditions for pixels near the edges of the image.

Line 16 uses the linearized index of curRow and curCol to access the value
of the input pixel visited in the current iteration. It accumulates the pixel value
into a running sum variable pixVal. Line 17 records the fact that one more pixel
value has been added into the running sum by incrementing the pixels variable.
After all the pixels in the patch have been processed, line 22 calculates the aver-
age value of the pixels in the patch by dividing the pixVal value by the pixels
value. It uses the linearized index of row and col to write the result into its out-
put pixel.
Line 15 contains a conditional statement that guards the execution of lines 16
and 17. For example, in computing output pixels near the edge of the image, the
patch may extend beyond the valid range of the input image. This is illustrated in
Fig. 3.9 assuming 3 3 3 patches. In case 1, the pixel at the upper-left corner is
being blurred. Five of the nine pixels in the intended patch do not exist in the
input image. In this case, the row and col values of the output pixel are 0 and 0,
respectively. During the execution of the nested loop, the curRow and curCol
values for the nine iterations are (21, 2 1), (21,0), (21,1), (0, 2 1), (0,0), (0,1),
(1, 2 1), (1,0), and (1,1). Note that for the five pixels that are outside the image,
at least one of the values is less than 0. The curRow , 0 and curCol , 0 conditions
of the if-statement catch these values and skip the execution of lines 16 and 17.
As a result, only the values of the four valid pixels are accumulated into the run-
ning sum variable. The pixels value is also correctly incremented only four times
so that the average can be calculated properly at line 22.
The reader should work through the other cases in Fig. 3.9 and analyze the
execution behavior of the nested loop in blurKernel. Note that most of the
threads will find all the pixels in their assigned 3 3 3 patch within the input
image. They will accumulate all the nine pixels. However, for the pixels on the
four corners, the responsible threads will accumulate only four pixels. For other
pixels on the four edges, the responsible threads will accumulate six pixels. These
variations are what necessitates keeping track of the actual number of pixels that
are accumulated with the variable pixels.
62 CHAPTER 3 Multidimensional grids and data

3.4 Matrix multiplication

Matrix-matrix multiplication, or matrix multiplication in short, is an important
component of the Basic Linear Algebra Subprograms standard (see the “Linear
Algebra Functions” sidebar). It is the basis of many linear algebra solvers, such
as LU decomposition. It is also an important computation for deep learning using
convolutional neural networks, which will be discussed in detail in Chapter 16,
Deep Learning.

Linear Algebra Functions

Linear algebra operations are widely used in science and engineering
applications. In the Basic Linear Algebra Subprograms (BLAS), a de facto
standard for publishing libraries that perform basic algebra operations,
there are three levels of linear algebra functions. As the level increases,
the number of operations performed by the function increases. Level 1
functions perform vector operations of the form y 5 αx+y, where x and y
are vectors and α is a scalar. Our vector addition example is a special
case of a level 1 function with α 5 1. Level 2 functions perform matrix-
vector operations of the form y 5 αAx+βy, where A is a matrix, x and y
are vectors, and α and β are scalars. We will be studying a form of level 2
function in sparse linear algebra. Level 3 functions perform matrix-matrix
operations in the form of C 5 αAB 1 βC, where A, B, and C are matrices
and α and β are scalars. Our matrix-matrix multiplication example is a
special case of a level 3 function where α 5 1 and β 5 0. These BLAS
functions are important because they are used as basic building blocks of
higher-level algebraic functions, such as linear system solvers and eigen-
value analysis. As we will discuss later, the performance of different imple-
mentations of BLAS functions can vary by orders of magnitude in both
sequential and parallel computers.

Matrix multiplication between an I 3 j (i rows by j columns) matrix M and a

j 3 k matrix N produces an I 3 k matrix P. When a matrix multiplication is per-
formed, each element of the output matrix P is an inner product of a row of M and
a column of N. We will continue to use the convention where Prow, col is the ele-
ment at the rowth position in the vertical direction and the colth position in the
horizontal direction. As shown in Fig. 3.10, Prow,col (the small square in P) is the
inner product of the vector formed by the rowth row of M (shown as a horizontal
strip in M) and the vector formed by the colth column of N (shown as a vertical
strip in N). The inner product, sometimes called the dot product, of two vectors is
the sum of products of the individual vector elements. That is,
X
Prow;col 5 Mrow;k Nk;col for k 5 0; 1; . . .Width 2 1
3.4 Matrix multiplication 63

FIGURE 3.10
Matrix multiplication using multiple blocks by tiling P.

For example, in Fig. 3.10, assuming row 5 1 and col 5 5,

P1;5 5 M1;0 N0;5 þ M1;1 N1;5 þ M1;2 N2;5 þ . . .: þ M1;Width21 NWidth21;5
To implement matrix multiplication using CUDA, we can map the threads in
the grid to the elements of the output matrix P with the same approach that we
used for colorToGrayscaleConversion. That is, each thread is responsible for cal-
culating one P element. The row and column indices for the P element to be cal-
culated by each thread are the same as before:

and

With this one-to-one mapping, the row and col thread indices are also the row
and column indices for their output elements. Fig. 3.11 shows the source code of
64 CHAPTER 3 Multidimensional grids and data

FIGURE 3.11
A matrix multiplication kernel using one thread to compute one P element.

the kernel based on this thread-to-data mapping. The reader should immediately
see the familiar pattern of calculating row and col (lines 0304) and the if-
statement testing if row and col are both within range (line 05). These statements
are almost identical to their counterparts in colorToGrayscaleConversion. The
only significant difference is that we are making a simplifying assumption that
matrixMulKernel needs to handle only square matrices, so we replace both width
and height with Width. This thread-to-data mapping effectively divides P into
tiles, one of which is shown as a light-colored square in Fig. 3.10. Each block is
responsible for calculating one of these tiles.
We now turn our attention to the work done by each thread. Recall that
Prow,col is calculated as the inner product of the rowth row of M and the colth col-
umn of N. In Fig. 3.11 we use a for-loop to perform this inner product operation.
Before we enter the loop, we initialize a local variable Pvalue to 0 (line 06).
Each iteration of the loop accesses an element from the rowth row of M and an
element from the colth column of N, multiplies the two elements together, and
accumulates the product into Pvalue (line 08).
Let us first focus on accessing the M element within the for-loop. M is linear-
ized into an equivalent 1D array using row-major order. That is, the rows of M are
placed one after another in the memory space, starting with the 0th row.
Therefore the beginning element of row 1 is M[1 Width] because we need to
account for all elements of row 0. In general, the beginning element of the rowth
row is M[row Width]. Since all elements of a row are placed in consecutive loca-
tions, the kth element of the rowth row is at M[row Width+k]. This linearized
array offset is what we use in Fig. 3.11 (line 08).
We now turn our attention to accessing N. As is shown in Fig. 3.11, the begin-
ning element of the colth column is the colth element of row 0, which is N[col].
Accessing the next element in the colth column requires skipping over an entire row.
This is because the next element of the same column is the same element in the next
row. Therefore the kth element of the colth column is N[k Width+col] (line 08).
After the execution exits the for-loop, all threads have their P element values
in the Pvalue variables. Each thread then uses the 1D equivalent index expression
3.4 Matrix multiplication 65

row Width+col to write its P element (line 10). Again, this index pattern is like
that used in the colorToGrayscaleConversion kernel.
wWe now use a small example to illustrate the execution of the matrix multi-
plication kernel. Fig. 3.12 shows a 4 3 4 P with BLOCK_WIDTH 5 2. Although
such small matrix and block sizes are not realistic, they allow us to fit the entire
example into one picture. The P matrix is divided into four tiles, and each block
calculates one tile. We do so by creating blocks that are 2 3 2 arrays of threads,
with each thread calculating one P element. In the example, thread (0,0) of block
(0,0) calculates P0,0, whereas thread (0,0) of block (1,0) calculates P2,0.
The row and col indices in matrixMulKernel identify the P element to be calcu-
lated by a thread. The row index also identifies the row of M, and the col index iden-
tifies the column of N as input values for the thread. Fig. 3.13 illustrates the
multiplication actions in each thread block. For the small matrix multiplication exam-
ple, threads in block (0,0) produce four dot products. The row and col indices of
thread (1,0) in block (0,0) are 0 0 1 1 5 1 and 0 0 1 0 5 0, respectively. The thread
thus maps to P1,0 and calculates the dot product of row 1 of M and column 0 of N.
Let us walk through the execution of the for-loop of Fig. 3.11 for thread (0,0)
in block (0,0). During iteration 0 (k 5 0), row Width 1 k 5 0 4 1 0 5 0 and
k Width 1 col 5 0 4 1 0 5 0. Therefore the input elements accessed are M[0] and N
[0], which are the 1D equivalent of M0,0 and N0,0. Note that these are indeed the
0th elements of row 0 of M and column 0 of N. During iteration 1 (k 5 1),
row Width 1 k 5 0 4 1 1 5 1 and k Width 1 col 5 1 4 1 0 5 4. Therefore we are
accessing M[1] and N[4], which are the 1D equivalent of M0,1 and N1,0. These are the
first elements of row 0 of M and column 0 of N. During iteration 2 (k 5 2),
row Width 1 k 5 0 4 1 2 5 2 and k Width 1 col 5 2 4 1 0 5 8, which results in M[2]
and N[8]. Therefore the elements accessed are the 1D equivalent of M0,2 and N2,0.
Finally, during iteration 3 (k 5 3), row Width 1 k 5 0 4 1 3 5 3 and k Width 1 col 5

FIGURE 3.12
A small execution example of matrixMulKernel.
66 CHAPTER 3 Multidimensional grids and data

FIGURE 3.13
Matrix multiplication actions of one thread block.

3 4 1 0 5 12, which results in M[3] and N[12], the 1D equivalent of M0,3 and N3,0.
We have now verified that the for-loop performs the inner product between the 0th
row of M and the 0th column of N for thread (0,0) in block (0,0). After the loop, the
thread writes P[row Width+col], which is P[0]. This is the 1D equivalent of P0,0, so
thread (0,0) in block (0,0) successfully calculated the inner product between the 0th
row of M and the 0th column of N and deposited the result in P0,0.
We will leave it as an exercise for the reader to hand-execute and verify the
for-loop for other threads in block (0,0) or in other blocks.
Since the size of a grid is limited by the maximum number of blocks per grid
and threads per block, the size of the largest output matrix P that can be handled
by matrixMulKernel will also be limited by these constraints. In the situation in
which output matrices larger than this limit are to be computed, one can divide
the output matrix into submatrices whose sizes can be covered by a grid and use
the host code to launch a different grid for each submatrix. Alternatively, we can
change the kernel code so that each thread calculates more P elements. We will
explore both options later in this book.

3.5 Summary
CUDA grids and blocks are multidimensional with up to three dimensions. The
multidimensionality of grids and blocks is useful for organizing threads to be
mapped to multidimensional data. The kernel execution configuration parameters
define the dimensions of a grid and its blocks. Unique coordinates in blockIdx
and threadIdx allow threads of a grid to identify themselves and their domains of
Exercises 67

data. It is the programmer’s responsibility to use these variables in kernel func-

tions so that the threads can properly identify the portion of the data to process.
When accessing multidimensional data, programmers will often have to linearize
multidimensional indices into a 1D offset. The reason is that dynamically allo-
cated multidimensional arrays in C are typically stored as 1D arrays in row-major
order. We use examples of increasing complexity to familiarize the reader with
the mechanics of processing multidimensional arrays with multidimensional grids.
These skills will be foundational for understanding parallel patterns and their
associated optimization techniques.

Exercises
1. In this chapter we implemented a matrix multiplication kernel that has each
thread produce one output matrix element. In this question, you will
implement different matrix-matrix multiplication kernels and compare them.
a. Write a kernel that has each thread produce one output matrix row. Fill in
the execution configuration parameters for the design.
b. Write a kernel that has each thread produce one output matrix column. Fill
in the execution configuration parameters for the design.
c. Analyze the pros and cons of each of the two kernel designs.
2. A matrix-vector multiplication takes an input matrix B and a vector C and
produces one output vector A. Each element of the output vectorPA is the dot
product of one row of the input matrix B and C, that is, A[i] 5 j B[i][j] 1 C[j].
For simplicity we will handle only square matrices whose elements are single-
precision floating-point numbers. Write a matrix-vector multiplication kernel and
the host stub function that can be called with four parameters: pointer to the output
matrix, pointer to the input matrix, pointer to the input vector, and the number of
elements in each dimension. Use one thread to calculate an output vector element.
3. Consider the following CUDA kernel and the corresponding host function that
calls it:

a. What is the number of threads per block?

b. What is the number of threads in the grid?
68 CHAPTER 3 Multidimensional grids and data

c. What is the number of blocks in the grid?

d. What is the number of threads that execute the code on line 05?
4. Consider a 2D matrix with a width of 400 and a height of 500. The matrix is
stored as a one-dimensional array. Specify the array index of the matrix
element at row 20 and column 10:
a. If the matrix is stored in row-major order.
b. If the matrix is stored in column-major order.
5. Consider a 3D tensor with a width of 400, a height of 500, and a depth of
300. The tensor is stored as a one-dimensional array in row-major order.
Specify the array index of the tensor element at x 5 10, y 5 20, and z 5 5.

Object-Oriented Design Heuristics PDF
100% (2)
Object-Oriented Design Heuristics PDF
608 pages
Malvern Access Configurator (Mac) User Guide: MAN0602-01-EN-00 July 2017
No ratings yet
Malvern Access Configurator (Mac) User Guide: MAN0602-01-EN-00 July 2017
50 pages
COMPUTER STUDIES LESSON PLANS G8 by Phiri D PDF
No ratings yet
COMPUTER STUDIES LESSON PLANS G8 by Phiri D PDF
45 pages
Class 10
No ratings yet
Class 10
13 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
HPC
No ratings yet
HPC
90 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
UNIT-5
No ratings yet
UNIT-5
90 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
CUDA_part-2
No ratings yet
CUDA_part-2
49 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
Pgi Cuda Tutorial
No ratings yet
Pgi Cuda Tutorial
58 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Threads
No ratings yet
Threads
54 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
An INTRODUCTION TO CUDA Programming
No ratings yet
An INTRODUCTION TO CUDA Programming
9 pages
Hpc file
No ratings yet
Hpc file
22 pages
Processors
No ratings yet
Processors
25 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Cuda Firstprograms PDF
No ratings yet
Cuda Firstprograms PDF
6 pages
Basic Elements of A Program
No ratings yet
Basic Elements of A Program
12 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
chapter-8
No ratings yet
chapter-8
58 pages
1
No ratings yet
1
44 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
cuuda nvidai guide_Part3
No ratings yet
cuuda nvidai guide_Part3
15 pages
Lec 6
No ratings yet
Lec 6
16 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
CSS Grid Layout
From Everand
CSS Grid Layout
Abdelfattah Ragab
No ratings yet
Chapter 4 Compute Architecture and s 2023 Programming Massively Parallel P
No ratings yet
Chapter 4 Compute Architecture and s 2023 Programming Massively Parallel P
24 pages
Chapter-2---Heterogeneous-data-parallel_2023_Programming-Massively-Parallel-
No ratings yet
Chapter-2---Heterogeneous-data-parallel_2023_Programming-Massively-Parallel-
24 pages
Chapter-1---Introduction_2023_Programming-Massively-Parallel-Processors
No ratings yet
Chapter-1---Introduction_2023_Programming-Massively-Parallel-Processors
20 pages
Chapter 5 Memory Architecture and Dat 2023 Programming Massively Parallel
No ratings yet
Chapter 5 Memory Architecture and Dat 2023 Programming Massively Parallel
29 pages
Question Paper Code:: Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (University Question)
No ratings yet
Question Paper Code:: Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (University Question)
2 pages
Ieee Dissertation Template
100% (2)
Ieee Dissertation Template
8 pages
Mini PROJECT Report
No ratings yet
Mini PROJECT Report
31 pages
Motherboard
No ratings yet
Motherboard
63 pages
Downloading Firmware To D20 D25 and iBOX Devices PDF
No ratings yet
Downloading Firmware To D20 D25 and iBOX Devices PDF
4 pages
Workbook Practical Experimental Skills 20-21 V3
No ratings yet
Workbook Practical Experimental Skills 20-21 V3
225 pages
Ms Word Notes
50% (2)
Ms Word Notes
26 pages
Windows 10 Keyboard Shortcuts
No ratings yet
Windows 10 Keyboard Shortcuts
7 pages
Msi-Gtx 1650
No ratings yet
Msi-Gtx 1650
1 page
A Report of 08 Weeks Industrial Training At: ASPEXX Health Solution Pvt. LTD
No ratings yet
A Report of 08 Weeks Industrial Training At: ASPEXX Health Solution Pvt. LTD
74 pages
Chapter 5 - Authentication
No ratings yet
Chapter 5 - Authentication
30 pages
CAP797
No ratings yet
CAP797
35 pages
Đề thi thử số 2
No ratings yet
Đề thi thử số 2
3 pages
OnShape Keyboard Shprtcuts
No ratings yet
OnShape Keyboard Shprtcuts
1 page
Data Entry Software Manual DLHS-3
No ratings yet
Data Entry Software Manual DLHS-3
94 pages
Understanding Multimedia: CMPD273 Multimedia System Prepared by Nazrita Ibrahim © UNITEN2002
No ratings yet
Understanding Multimedia: CMPD273 Multimedia System Prepared by Nazrita Ibrahim © UNITEN2002
25 pages
J4L FO Designer 2.15
No ratings yet
J4L FO Designer 2.15
175 pages
CH03 Testbank COA9e With Answers
100% (1)
CH03 Testbank COA9e With Answers
6 pages
Gradient Mapping An Image: John Woods Adobe Certified Expert Photoshop 5.0
No ratings yet
Gradient Mapping An Image: John Woods Adobe Certified Expert Photoshop 5.0
9 pages
9th - Computer Science
No ratings yet
9th - Computer Science
102 pages
Keywords - Employee Management System, Web Based
No ratings yet
Keywords - Employee Management System, Web Based
8 pages
MO 01 Business Technology and Equpment TTLM
No ratings yet
MO 01 Business Technology and Equpment TTLM
69 pages
Learn CSS - Colors Cheatsheet - Codecademy
No ratings yet
Learn CSS - Colors Cheatsheet - Codecademy
2 pages
SSA-661247: Apache Log4j Vulnerabilities (Log4Shell, CVE-2021-44228, CVE-2021-45046) - Impact To Siemens Products
No ratings yet
SSA-661247: Apache Log4j Vulnerabilities (Log4Shell, CVE-2021-44228, CVE-2021-45046) - Impact To Siemens Products
15 pages
Quickspecs: HP Compaq Pro 6300 All-In-One PC
No ratings yet
Quickspecs: HP Compaq Pro 6300 All-In-One PC
54 pages
Deep Learning With PyTorch Guide For Beginners and Intermediate
100% (7)
Deep Learning With PyTorch Guide For Beginners and Intermediate
120 pages
Introduction To Database
No ratings yet
Introduction To Database
28 pages

Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro

Uploaded by

Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro

Uploaded by

CHAPTER

Multidimensional grids and

In Chapter 2, Heterogeneous Data Parallel Computing, we learned to write a

3.1 Multidimensional grid organization

Programming Massively Parallel Processors. DOI: https://doi.org/10.1016/B978-0-323-91231-0.00004-5

coordinates of the thread. The execution configuration parameters in a kernel call

dim3 dog(32, 1, 1);

3.2 Mapping threads to multidimensional data

direction). Assume that we decided to use a 16 3 16 block, with 16 threads in the

Vertical ðrowÞ row coordinate 5 blockIdx:y blockDim:y 1 threadIdx:y

designed to be used by FORTRAN programs use the column-major layout to

col = blockIdx.x*blockDim.x + threadIdx.x

We know that gridDim.x blockDim.x is greater than or equal to width

The linearized access to a 3D array P will be in the form of P[plane m n

3.3 Image blur: a more complex kernel

weighted sums belongs to the convolution pattern. We will be using a simplified

3.4 Matrix multiplication

Linear Algebra Functions

Matrix multiplication between an I 3 j (i rows by j columns) matrix M and a

For example, in Fig. 3.10, assuming row 5 1 and col 5 5,

data. It is the programmer’s responsibility to use these variables in kernel func-

a. What is the number of threads per block?

c. What is the number of blocks in the grid?

You might also like