0% found this document useful (0 votes)

18 views49 pages

CUDA Part-2

The document discusses the mapping of threads to multidimensional data in CUDA, focusing on flattening 2D arrays into 1D arrays using row-major and column-major layouts. It outlines three methods for matrix-matrix multiplication in CUDA, where each thread computes either a row, a column, or an element of the resultant matrix. Additionally, it highlights the importance of efficient data access patterns and the handling of sparse matrices in computational tasks.

Uploaded by

shreyasinha3002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views49 pages

CUDA Part-2

Uploaded by

shreyasinha3002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Mapping Threads to Multidimensional Data - Flattening a 2D-array

• Ideally, we would like to access d_Pin as a 2D array where an element at row j and column i can be accessed as d_Pin[j][i].

• However, the ANSI C standard based on which CUDA C was developed requires that the number of columns in d_Pin be
known at compile time.

• Unfortunately, this information is not known at compiler time for dynamically allocated arrays.

• As a result, programmers need to explicitly linearize, or “flatten,” a dynamically allocated 2D array into an equivalent 1D
array in the current CUDA C.

• In reality, all multidimensional arrays in C are linearized.

• There are at least two ways one can linearize a 2D array:

1) row-major layout
2) column-major layout

30-12-2023 1
Mapping Threads to Multidimensional Data - Flattening a 2D-array
1. row-major layout
• Here we place all elements of the same row into consecutive locations. The rows are then placed one after another
into the memory space.
• Mj,i denote an M element at the jth row and the ith column.

The 1D equivalent index for the M element in row j and column i is j x

4 + i. The j x 4 term skips over all elements of the rows before row j.
The i term then selects the right element within the section for row j.

30-12-2023 2
Mapping Threads to Multidimensional Data - Flattening a 2D-array
2. column-major layout

• Here we place all elements of the same column into consecutive locations. The columns are then placed one after
another into the memory space.

• This is used by FORTRAN compilers.

• The column-major layout of a 2D array is equivalent to the row-major layout of its transposed form.

30-12-2023 3
Matrix-Matrix Multiplication – three variants
We will write program in CUDA to multiply matrix-A (ha × wa) and matrix-B (hb × wb) in 3 variations as follows:

1. Each row of resultant matrix to be computed by one thread.

2. Each column of resultant matrix to be computed by one thread.
3. Each element of resultant matrix to be computed by one thread.

30-12-2023 4
Accessing Row & Col of a matrix
Accessing Col
Accessing Row
M0 M1 M2 M3 M4 M 5 M6 M7 M8 M9 M10 M11 M0 • The beginning element of the Col column
Col0 M4 is the Col element of the 0 row, which is
M[Col].
M8
Row0 Row1 Row2 • Accessing each additional element in the
M1
Col column requires skipping over entire
• Recall that a matrix M is linearized into an equivalent 1D array where Col1 M5 rows. This is because the next element of
the rows of M are placed one after another in the memory space,
the same column is actually the same
starting with the 0 row. M9
element in the next row.
M2
• For example, the beginning element of the 1 row is M[1 * Width]
Col2 • Therefore, the k element of the Col
because we need to account for all elements of the 0 row. M6
column is M[k * Width + Col].
M10
• In general, the beginning element of the Row row is M[Row * Width].
M3
• Since all elements of a row are placed in consecutive locations, the k Col3 Matrix M
element of the Row row is at M[Row * Width + k].
M7
M0 M1 M2 M3
M11
M4 M5 M6 M7

M8 M9 M10 M11
30-12-2023 5
Each row of resultant matrix to be computed by one thread

a(hawa ) b(hbwb) c (hawb)

wa should be equal to hb
How many threads=no. of rows of resultant matrix=ha
matmul_rowwise<<<1,ha>>>(a,b,c,wa,wb);
how many for loops in the kernel? 2
2X3 3*3 2X 3
1. Each row of resultant matrix to be computed by one thread

multiplyKernel_a<<<1,ha>>>(d_a, d_b, d_c, wa, wb);

global void multiplyKernel_rowwise(int * a, int * b, int * c, int wa, int wb)

{
int ridA = threadIdx.x;
int sum;
for(int cidB = 0; cidB < wb; cidB++)
{
sum= 0;
for(k = 0; k< wa; k++)
{
sum += (a[ridA * wa + k] * b[k * wb + cidB]);
}
c[ridA * wb+ cidB] = sum;
}
}

30-12-2023 7
Each col of a resultant matrix to be computed by one thread

a(hawa ) b(hbwb) c (hawb)

wa should be equal to hb
How many threads=no. of cols of resultant matrix=wb
matmul_colwise<<<1,wb>>>(a,b,c,ha,wa);
how many for loops in the kernel? 2
2X3 3*3 2X 3
2. Each column of resultant matrix to be computed by one thread

multiplyKernel_b<<<1, wb>>>(d_a, d_b, d_c, ha,wa);

global void multiplyKernel_colwise(int * a, int * b, int * c, int ha, int wa)

{
int cidB = threadIdx.x;
int wb = blockDim.x;
int sum, k;
for(ridA = 0; ridA < ha; ridA++)
{
sum = 0;
for( k=0; k< wa; k++)
{
sum += (a[ridA * wa + k] * b[k * wb + cidB]);
}
c[ridA * wb + cidB] =sum;
}
}
30-12-2023 9
Each element of resultant matrix to be computed by one thread

a(hawa ) b(hbwb) c (hawb)

wa should be equal to hb
How many threads=no. of elements of resultant matrix=ha*wb
Let’s use 1D grid, 2D block
matmul_rowwise<<<1,(,)>>>(a,b,c,wa);
how many for loops in the kernel? 1
2X3 3*3 2X 3
3. Each element of resultant matrix to be computed by one thread

multiplyKernel_b<<<(1, 1), (wb,ha)>>>(d_a, d_b, d_c, wa);

global void multiplyKernel_elementwise(int * a, int * b, int * c, int wa)

{
int ridA = threadIdx.y;
int cidB= threadIdx.x;
int wb = blockDim.x;
int sum=0, k;

for( k = 0; k < wa; k++)

{
sum += (a[ridA * wa + k] * b[k * wb + cidB]);
}
c[ridA * wb + cidB] =sum;

30-12-2023 11
Mapping Threads to Multidimensional Data
• The choice of 1D, 2D, or 3D thread organizations is usually based on the nature of the data.
• For example, pictures are a 2D array of pixels. It is often convenient to use a 2D grid that consists of 2D blocks to process the pixels
in a picture.

▪ It is a 76x62 picture.
▪ Assume that we decided to use a 16x16 block, with 16 threads in the x-
direction and 16 threads in the y-direction.
▪ We will need five blocks in the x-direction and four blocks in the y-
direction, which results in 5x4=20 block.

• In this picture example, we have four extra threads in the x-direction and two extra threads in the y-direction. That is, we will
generate 80x64 threads to process 76x62 pixels.
• The picture processing kernel function will have if statements to test whether the thread indices threadIdx.x and threadIdx.y fall
within the valid range of pixels.
30-12-2023 12
Mapping Threads to Multidimensional Data
→ n = 76
▪ Assume that the host code uses an integer variable n to track the
number of pixels in the x-direction, and another integer variable m to
track the number of pixels in the y-direction.

▪ Assume that the input picture data has been copied to the device

→ m = 62
memory and can be accessed through a pointer variable d_Pin. The
output picture has been allocated in the device memory and can be
accessed through a pointer variable d_Pout.

dim3 dimGrid (ceil(n/16.0), ceil(m/16.0), 1);

dim3 dimBlock(16, 16, 1);
vecAddKernel <<< dimGrid, dimBlock >>> (d_Pin, d_Pout, n, m);

• Within the kernel function, references to built-in variables gridDim.x, gridDim.y, blockDim.x, and blockDim.y will result in 5,
4, 16, and 16, respectively.
30-12-2023 13
Mapping Threads to Multidimensional Data – the pictureKernel()
__global__ void pictureKernell(float* d_Pin, float* d_Pout, int n, int m)
→ n = 76
{ // Calculate the row # of the d_Pin and d_Pout element to process
int Row = blockIdx.y*blockDim.y + threadIdx.y;
// Calculate the column # of the d_Pin and d_Pout element to process
int Col = blockIdx.x*blockDim.x + threadIdx.x;
// each thread computes one element of d_Pout if in range

→ m = 62
if ((Row < m) && (Col < n))
{
d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];
}
}
• This kernel will scale every pixel value in the picture by a factor of 2.0.

• There are a total of blockDim.x * gridDim.x threads in the horizontal direction

• and blockDim.y * gridDim.y threads in the vertical direction.

• The expression Col=blockIdx.xblockDim.x+threadIdx.x generates every integer value from 0 to blockDim.xgridDim.x-1.

• The condition (Col < n) && (Row < m) make sure that only the threads in proper range are executed.
dim3 dimGrid (ceil(n/16.0), ceil(m/16.0), 1); // 5,4,1
dim3 dimBlock(16, 16, 1); vecAddKernel <<< dimGrid, dimBlock >>> (d_Pin, d_Pout, n, m);
30-12-2023 14
Mapping Threads to Multidimensional Data – the pictureKernel()
→ n = 80

→ m = 64
• During the execution, the execution behaviour of blocks will fill into one of four different cases:
1. The first area, marked as 1 consists of the threads that belong to the 12 blocks covering the majority of pixels in the picture.
Both Col and Row values of these threads are within range (76).

2. The second area, marked as 2 contains the threads that belong to the 3 blocks covering the upper-right pixels of the picture.
Although the Row values of these threads are always within range, the Col values of some of them exceed the n value.

3. The third area, marked as 3 contains the threads that belong to the 4 blocks covering the lower-left pixels of the picture.
Although the Col values of these threads are always within range, the Row values of some of them exceed the m value (62).

4. The forth area, marked as 4 contains the threads that belong to 1 block covering the lower-right pixels of the picture. Both Col
and Row values exceed the n and m values.
30-12-2023 15
Matrix-Matrix Multiplication – A More Complex Kernel
• We can map every valid data element in a 2D array to a unique thread using threadIdx, blockIdx, blockDim, and
gridDim variables:
// Calculate the row #
int Row = blockIdx.y*blockDim.y + threadIdx.y;
// Calculate the column #
int Col = blockIdx.x*blockDim.x + threadIdx.x;

• Matrix-Matrix multiplication between an I x J matrix d_M and a J x K matrix d_N produces an I x K matrix d_P.

30-12-2023 16
Matrix-Matrix Multiplication – A More Complex Kernel
• When performing a matrix-matrix multiplication, each element of the product matrix d_P is an inner product of a row of
d_M and a column of d_N. The inner product between two vectors is the sum of products of corresponding elements. That
is, d_P Row,Col =Σd_MRow,k * d_Nk,Col, for k = 0, 1, ... Width-1.

• We design a kernel where each thread is responsible for calculating one d_P element.

• The d_P element calculated by a thread is in row blockIdx.y * blockDim.y + threadIdx.y and in column blockIdx.x *
blockDim.x + threadIdx.x.
30-12-2023 17
Matrix-Matrix Multiplication– Kernel for thread-to-data mapping
• In the following kernel, we assume square matrices having dimension Width*Width.

• Throughout the source code, instead of using a numerical value 16 for the block-width, the programmer can use the name
BLOCK_WIDTH by defining it. It helps in autotuning.

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
// Calculate the row index of the d_Pelement and d_M #define BLOCK_WIDTH 16
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of d_P and d_N // Setup the execution configuration
int Col = blockIdx.x*blockDim.x+threadIdx.x; int NumBlocks = Width/BLOCK_WIDTH;
if ((Row < Width) && (Col < Width)) if (Width % BLOCK_WIDTH)
{ NumBlocks++;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix dim3 dimGrid(NumBlocks, NumbBlocks);
for (int k = 0; k < Width; ++k) dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH);
Pvalue += d_M[Row*Width+k]*d_N[k*Width+Col];
// Launch the device computation threads
d_P[Row*Width+Col] = Pvalue; matrixMulKernel<<dimGrid, dimBlock>>(M, N, P, Width);
}
} Host code

30-12-2023
Kernel code 18
Sequential Sparse-Matrix Vector Multiplication (SpVM)
• In a sparse matrix, the vast majority of the elements are zeros. Storing and processing these zero elements are wasteful in terms of
memory, time, and energy.

• Due to the importance of sparse matrices, several sparse matrix storage formats and their corresponding processing methods have been
proposed and widely used in the field.

• Matrices are often used to represent the coefficients in a linear system of equations.

30-12-2023 Dr. Bhargav Bhatkalkar 19

Sequential Sparse-Matrix Vector Multiplication (SpVM)
• In many science and engineering problems, there are a large number of variables and the equations involved are loosely coupled. That is,
each equation only involves a small number of variables.

The variables x0 and x2 are involved in equation 0, none of the variables in equation 1, variables x1, x2, and
x3 in equation 2, and finally variables x0 and x3 in equation 3.

• Compressed Sparse Row (CSR) storage format is used to avoid storing zero elements of a sparse matrix.

• CSR stores only nonzero values of consecutive rows in a 1D data storage named data[ ].
This format compresses away all zero elements.
• The two sets of markers col_index[ ] and row_ptr[ ] preserve the structure of the
original sparse matrix.
• The marker col_index[ ] gives the column index of every nonzero value in the original
sparse matrix.
• The marker row_ptr[ ] gives the starting location of every row in the data[ ] array of the
compressed storage. Note that row_ptr[4] stores the starting location of a non-existing
row-4 as 7. This is for convenience, as some algorithms need to use the starting location
of the next row to delineate the end of the current row. This extra marker gives a
convenient way to locate the ending location of row 3.
30-12-2023 Dr. Bhargav Bhatkalkar 20
Sequential Sparse-Matrix Vector Multiplication (SpVM)
• A sequential implementation of SpMV based on CSR is quite straightforward. We assume that the code has access to the following:
1. The num_rows, a function argument that specifies the number of rows in the sparse matrix.
2. A floating-point data[ ] array and three integer row_ptr[ ], col_index[ ], and x[] arrays.

1. Line 1 is a loop that iterates through all rows of the matrix, with
each iteration calculating a dot product of the current row and the
vector X.

2. In each row, Line 2 first initializes the dot product to zero.

3. Line 3 and Line 4 sets up the range of data[ ] array elements that
belong to the current row.

4. Line 5 is a loop that fetches the elements from the sparse matrix A
and the vector X.

5. The loop body in Line 6 calculates the dot product for the current
row.

6. Line 7 adda the dot product with the corresponding element of the
vector Y.

30-12-2023 Dr. Bhargav Bhatkalkar 21

Parallel SpMV using CSR
• In a sequential implementation of SpMV the dot product calculation for each row of the sparse matrix is independent of those of other
rows.

• We can easily convert this sequential SpMV/CSR into a parallel CUDA kernel by assigning each iteration of the outer loop to a thread.

• The loop construct has been removed since it is

replaced by the thread grid.

• The row index is calculated as the familiar expression

blockIdx.x * blockDim.x + threadIdx.x

• Line 3 checks if the row index of a thread exceeds the

number of rows.

30-12-2023 Dr. Bhargav Bhatkalkar 22

Synchronization and Transparent Scalability
• CUDA allows threads in the same block to coordinate their activities using a barrier synchronization function
__syncthreads().

• When a kernel function calls __syncthreads(), all threads in a block will be held at the calling location until every thread in
the block reaches the location.

• This ensures that all threads in a block have completed a phase of their execution of the kernel before any of them can
move on to the next phase.

• In CUDA, a __syncthreads() statement, if present, must be executed by all threads in a block.

30-12-2023 23
Synchronization and Transparent Scalability

• When a __syncthread() statement is placed in an if-statement, either all threads in a block execute the path that includes
the __syncthreads() or none of them does.

• For an if-then-else statement, if each path has a __syncthreads() statement, either all threads in a block execute the
__syncthreads() on the then path or all of them execute the else path.

• The two __syncthreads() are different barrier synchronization points. If a thread in a block executes the then path and
another executes the else path, they would be waiting at different barrier synchronization points. They would end up
waiting for each other forever.

• It is the responsibility of the programmers to write their code so that these requirements are satisfied.

• The ability to synchronize also imposes execution constraints on threads within a block. These threads should execute in
close time proximity with each other to avoid excessively long waiting times.

30-12-2023 24
Synchronization and Transparent Scalability

• In fact, one needs to make sure that all threads involved in the barrier synchronization have access to the necessary
resources to eventually arrive at the barrier. Otherwise, a thread that never arrived at the barrier synchronization point can
cause everyone else to wait forever.

• CUDA runtime systems satisfy this constraint by assigning execution resources to all threads in a block as a unit. A block
can begin execution only when the runtime system has secured all the resources needed for all threads in the block to
complete execution.

• When a thread of a block is assigned to an execution resource, all other threads in the same block are also assigned to the
same resource. This ensures the time proximity of all threads in a block and prevents excessive or indefinite waiting time
during barrier synchronization

• By not allowing threads in different blocks to perform barrier synchronization with each other, the CUDA runtime system
can execute blocks in any order relative to each other since none of them need to wait for each other.

30-12-2023 25
Synchronization and Transparent Scalability
• In a low-cost system with only a few execution resources, one can execute a small number of blocks at the same time. In a
high-end implementation with more execution resources, one can execute a large number of blocks at the same time.

• The ability to execute the same application code on hardware with a different number of execution resources is referred to
as transparent scalability, which reduces the burden on application developers and improves the usability of applications.

30-12-2023 26
Assigning Resources to Blocks
• Once a kernel is launched, the CUDA runtime system generates the corresponding grid of threads. These threads are
assigned to execution resources on a block-by-block basis.

• In the current generation of hardware, the execution resources are organized into streaming multiprocessors (SMs).

• Multiple thread blocks can be assigned to each SM.

• Each device has a limit on the number of blocks that can be assigned to each SM. For example, a CUDA device may allow
up to eight blocks to be assigned to each SM.

30-12-2023 27
Assigning Resources to Blocks
• In situations where there is an insufficient amount of any one or more types of resources needed for the simultaneous
execution of eight blocks, the CUDA runtime automatically reduces the number of blocks assigned to each SM until their
combined resource usage falls under the limit.

• With a limited numbers of SMs and a limited number of blocks that can be assigned to each SM, there is a limit on the
number of blocks that can be actively executing in a CUDA device.

• Most grids contain many more blocks than this number. The runtime system maintains a list of blocks that need to execute
and assigns new blocks to SMs as they complete executing the blocks previously assigned to them.

• One of the SM resource limitations is the number of threads that can be simultaneously tracked and scheduled.

• It takes hardware resources for SMs to maintain the thread and block indices and track their execution status.

• In more recent CUDA device designs, up to 1,536 threads can be assigned to each SM. This could be in the form of 6 blocks
of 256 threads each, 3 blocks of 512 threads each, etc.

30-12-2023 28
Querying Device Properties
• When a CUDA application executes on a system, how can it find out the number of SMs in a device and the number of
threads that can be assigned to each SM?

• The CUDA runtime system has an API function cudaGetDeviceCount() that returns the number of available CUDA devices
in the system:
int dev_count;
cudaGetDeviceCount( &dev_count);

• The CUDA runtime system numbers all the available devices in the system from 0 to dev_count-1. It provides an API
function cudaGetDeviceProperties() that returns the properties of the device of which the number is given as an
argument:
cudaDeviceProp dev_prop;
for (i=0; i<dev_count; i++) {
cudaGetDeviceProperties( &dev_prop, i);
// decide if device has sufficient resources and capabilities
}

• The built-in type cudaDeviceProp is a C structure with fields that represent the properties of a CUDA device

30-12-2023 29
Querying Device Properties
The built-in type cudaDeviceProp is a C structure with fields that represent the properties of a CUDA device

• The maximal number of threads allowed in a block in the queried device is given by the field
dev_prop.maxThreadsPerBlock.

• The number of SMs in the device is given in dev_prop. multiProcessorCount.

• The clock frequency of the device is in dev_prop. clockRate.

• The host code can find the maximal number of threads allowed along each dimension of a block in
dev_prop.maxThreadsDim[0] (for the x dimension), dev_prop.maxThreadsDim[1] (for the y dimension), and
dev_prop.maxThreadsDim[2] (for the z dimension).

• The host code can find the maximal number of blocks allowed along each dimension of a grid in dev_prop.maxGridSize[0]
(for the x dimension), dev_prop.maxGridSize[1] (for the y dimension), and dev_prop.maxGridSize[2] (for the z dimension).

30-12-2023 30
Importance of Memory Access Efficiency
• In CUDA programming, the data to be processed by the threads is first transferred from the host memory to the device global memory.

• The threads then access their portion of the data from the global memory using their block IDs and thread IDs.

• The simple CUDA kernels will likely achieve only a small fraction of the potential speed of the underlying hardware. The poor performance
is due to the fact that global memory, which is typically implemented with dynamic random access memory (DRAM), tends to have long
access latencies (hundreds of clock cycles) and finite access bandwidth.

• In matrix multiplication kernel, the most important part of the kernel in terms of execution time is the for loop that performs inner
product calculation:
for (int k = 0; k < Width; ++k)
Pvalue += d_M[Row * Width + k] * d_N[k * Width + Col];

1. In every iteration of this loop, two global memory accesses are performed for one floating-point multiplication and one
floating-point addition.

2. One global memory access fetches a d_M[ ] element and the other fetches a d_N[ ] element.

3. One floating-point operation multiplies the d_M[] and d_N[] elements fetched and the other accumulates the product into
Pvalue.

4. Thus, the ratio of floating-point calculation to global memory access operation is 1:1, or 1.0.
30-12-2023 Dr. Bhargav Bhatkalkar 31
Importance of Memory Access Efficiency
• The compute to global memory access (CGMA) ratio is defined as the number of floating point calculations performed for each access to
the global memory within a region of a CUDA program.

• CGMA has major implications on the performance of a CUDA kernel.

➢ In a high-end device today, the global memory bandwidth is around 200 GB/s. With 4 bytes in each single-precision
floating-point value, one can expect to load no more than 50 (200/4) giga single-precision operands per second. With a
CGMA ration of 1.0, the matrix multiplication kernel will execute no more than 50 giga floating-point operations per second
(GFLOPS).

➢ For the matrix multiplication code to achieve the peak 1,500 GFLOPS rating of the processor, we need a CGMA value of 30.

30-12-2023 Dr. Bhargav Bhatkalkar 32

CUDA Device Memory Types
• CUDA supports several types of memory that can be used by programmers to achieve a high CGMA ratio and thus a high execution speed
in their kernels.

• At the bottom of the figure, we see global memory and constant memory. These types of memory can be written (W) and read (R) by the
host by calling API functions.

• The constant memory supports short-latency, high-bandwidth, read-only access by the device when all threads simultaneously access the
same location.

30-12-2023 Dr. Bhargav Bhatkalkar 33

CUDA Device Memory Types

• Registers and shared memory are on-chip memories. Variables that reside in these types of memory can be accessed at very high speed
in a highly parallel manner.

• Registers are allocated to individual threads; each thread can only access its own registers. A kernel function typically uses registers to hold
frequently accessed variables that are private to each thread.

• Shared memory is allocated to thread blocks; all threads in a block can access variables in the shared memory locations allocated to the
block. Shared memory is an efficient means for threads to cooperate by sharing their input data and the intermediate results of their work.

30-12-2023 Dr. Bhargav Bhatkalkar 34

CUDA Device Memory Types
• The global memory in the CUDA programming model maps to the memory of the von Neumann model.
• The processor box corresponds to the processor chip boundary.
• The global memory is off the processor chip and is implemented with DRAM technology,
which implies long access latencies and relatively low access bandwidth.
• The registers correspond to the “register file” of the von Neumann model. It is on the
processor chip, which implies very short access latency and drastically higher access
bandwidth.
• Whenever a variable is stored in a register, its accesses no longer consume off-chip
global memory bandwidth. This will be reflected as an increase in the CGMA ratio.
• The processor uses the PC value to fetch instructions from memory into the IR.

• Arithmetic instructions in most modern processors have “built-in” register operands:

fadd r1, r2, r3 → where r2 and r3 are the register numbers that specify the location in the register file where the input operand values can be
found. The location for storing the floating-point addition result value is specified by r1.
• When an operand of an arithmetic instruction is in a register, there is no additional instruction required to make the operand value available to the
arithmetic and logic unit (ALU) where the arithmetic calculation is done.
• On the other hand, if an operand value is in global memory, one needs to perform a memory load operation to make the operand value available
to the ALU. For example, if the first operand of a floating-point addition instruction is in global memory of a typical computer today, the instructions
involved will likely be:
load r2, r4, offset → where the load instruction adds an offset value to the contents of r4 to form an address for the
fadd r1, r2, r3 operand value. It then accesses the global memory and places the value into register r2.
30-12-2023 Dr. Bhargav Bhatkalkar 35
CUDA Device Memory Types
• The following figure shows shared memory and registers in a CUDA device:

• Although both are on-chip memories, they differ significantly in functionality and cost
of access.
• When the processor accesses data that resides in the shared memory, it needs to
perform a memory load operation, just like accessing data in the global memory.
• However, because shared memory resides on-chip, it can be accessed with much lower
latency and much higher bandwidth than the global memory.
• Because of the need to perform a load operation, share memory has longer latency
and lower bandwidth than registers.
• In computer architecture, shared memory is a form of scratchpad memory.

• One important difference between the share memory and registers in CUDA is that variables that reside in the shared memory are
accessible by all threads in a block. This is in contrast to register data, which is private to a thread.
• Scope identifies the range of threads that can access the variable: by a single
thread only, by all threads of a block, or by all threads of all grids.
• Lifetime tells the portion of the program’s execution duration when the variable is
available for use: either within a kernel’s execution or throughout the entire
application.
• Automatic array variables are not stored in registers. Instead, they are stored into
the global memory and may incur long access delays and potential access
congestions. The scope of these arrays is, like automatic scalar variables, limited to
individual threads.
30-12-2023 Dr. Bhargav Bhatkalkar 36
Constant Memory and Caching

• We can make three interesting observations about the way the mask array M is used in convolution:
1. First, the size of the M array is typically small.
2. Second, the contents of M are not changed throughout the execution of the kernel.
3. Third, all threads need to access the mask elements. Even better, all threads access the M elements in the same order, starting from
M[0] and move by one element a time through the iterations of the for loop in 1D parallel convolution.

• These properties make the mask array an excellent candidate for constant memory and caching.

30-12-2023 Dr. Bhargav Bhatkalkar 37

Constant Memory and Caching
• The CUDA programming model allows programmers to declare a variable in the constant memory.

• Like global memory variables, constant memory variables are also visible to all thread blocks.

• The main difference is that a constant memory variable cannot be changed by threads during kernel execution.

• Furthermore, the size of the constant memory can vary from device to device. The amount of constant memory available on a device can
be learned with a device property query. Assume that dev_prop is returned by cudaGetDeviceProperties(). The field
dev_prop.totalConstMem indicates the amount of constant memory available on a device is in the field.

• To declare an M array in constant memory, the host code declares it as follows:

#define MAX_MASK_WIDTH 10
__constant__ float M[MAX_MASK_WIDTH];

** This is a global variable declaration and should be outside any function in the source file. The keyword __constant__
(two underscores on each side) tells the compiler that array M should be placed into the device constant memory.

30-12-2023 Dr. Bhargav Bhatkalkar 38

Constant Memory and Caching
• Assume that the host code has already allocated and initialized the mask in a mask h_M array in the host memory with Mask_Width
elements. The contents of the h_M can be transferred to M in the device constant memory as follows: cudaMemcpyToSymbol(M,
h_M, Mask_Width * sizeof(float));

** This is a special memory copy function that informs the CUDA runtime that the data being copied into the constant memory
will not be changed during kernel execution.

• In general, the use of the cudaMemcpyToSymbol() function is as follows:

cudaMemcpyToSymbol(dest, src, size)

** where dest is a pointer to the destination location in the constant memory, src is a pointer to the source data in the host
memory, and size is the number of bytes to be copied.

• Kernel functions access constant memory variables as global variables. Thus, their pointers do not need to be passed to the kernel as
parameters.
• This is a revised kernel to use the constant memory for
the 1D parallel convolution.

• The only difference is that M is no longer accessed

through a pointer passed in as a parameter. It is now
accessed as a global variable declared by the host code.

30-12-2023 Dr. Bhargav Bhatkalkar 39

Constant Memory and Caching
• Like global memory variables, constant memory variables are also located in DRAM.

• However, because the CUDA runtime knows that constant memory variables are not modified during kernel execution, it directs the
hardware to aggressively cache the constant memory variables during kernel execution.

• To mitigate the effect of memory bottleneck, modern processors commonly employ on-chip cache memories, or caches, to reduce the
number of variables that need to be accessed from DRAM.

• A major design issue with using caches in a massively parallel processor is cache coherence, which arises when one or more processor
cores modify cached data.

• A cache coherence mechanism is needed to ensure that the contents of the caches of the other processor cores are updated.

30-12-2023 Dr. Bhargav Bhatkalkar 40

A Strategy for Reducing Global Memory Traffic
• Global memory is large but slow, whereas the shared memory is small but fast.

• A common strategy is partition the data into subsets called tiles so that each tile fits into the shared memory.

• An important criterion is that the kernel computation on these tiles can be done independently of each other.

• Note that not all data structures can be partitioned into tiles given an arbitrary kernel function.

• The concept of tiling can be illustrated with the matrix multiplication example.
• Assume that we use four 2 x 2 blocks to compute the P matrix.
• The figure highlights the computation done by the four threads of block(0,0) to compute P0,0, P0,1,
P1,0, and P1,1.

• The table shows the global memory accesses done by all threads in block 0,0. The threads are listed
in the vertical direction, with time of access increasing to the right in the horizontal direction.
• Among the four threads highlighted, there is a significant overlap in terms of the M and N elements
they access.
30-12-2023 Dr. Bhargav Bhatkalkar 41
A Strategy for Reducing Global Memory Traffic

• The kernel is written so that all the threads repeatedly access elements of
matrix M and N from the global memory.

• we can see that every M and N element is accessed exactly twice during the
execution of a block. Therefore, if we can have all four threads to collaborate
in their accesses to global memory, we can reduce the traffic to the global
memory by half.

30-12-2023 Dr. Bhargav Bhatkalkar 42

A Tiled Matrix-Matrix Multiplication Kernel
• In the design of a kernel for a tiled matrix-matrix multiplication, the basic idea is to have the threads to collaboratively load M and N matrix
elements into the shared memory before they individually use these elements in their dot product calculation.

• Keep in mind that the size of the shared memory is quite small and one must be careful not to exceed the capacity of the shared memory
when loading these M and N elements into the shared memory.

• This can be accomplished by dividing the M and N matrices into smaller tiles. The size of these tiles is chosen so that they can fit into the
shared memory. In the simplest form, the tile dimensions equal those of the block:

• We divide the M and N matrices into 2 x 2 tiles.

• The dot product calculations performed by each thread are now divided into phases.

• In each phase, all threads in a block collaborate to load a tile of M elements and a tile of
N elements into the shared memory. This is done by having every thread in a block to
load one M element and one N element into the shared memory.

30-12-2023 Dr. Bhargav Bhatkalkar 43

A Tiled Matrix-Matrix Multiplication Kernel

• The table only shows the activities of threads in block0,0;

• The shared memory array for the M and N elements is called Mds and Nds respectively.

• At the beginning of phase 1, the four threads of block0,0 collaboratively load a tile of M and a tile of N elements into shared memory.

• After the two tiles of M and N elements are loaded into the shared memory, these values are used in the calculation of the dot product.

30-12-2023 Dr. Bhargav Bhatkalkar 44

A Tiled Matrix-Matrix Multiplication Kernel

• Note that each value in the shared memory is used twice which reduces the global memory access.
• Note that the calculation of each dot product is now performed in two phases.
• Note that Pvalue is an automatic variable so a private version is generated for each thread.
• In general, if an input matrix is of dimension Width and the tile size is TILE_WIDTH, the dot product would be performed in
Width/TILE_WIDTH phases.
• Note also that Mds and Nds are reused to hold the input values. This allows a much smaller shared memory to serve most of the accesses to
global memory. This is due to the fact that each phase focuses on a small subset of the input matrix elements. Such focused access behaviour
is called locality.

30-12-2023 Dr. Bhargav Bhatkalkar 45

A Tiled Matrix-Matrix Multiplication Kernel

• The barrier __syncthreads() in line 11 ensures that all threads

have finished loading the tiles of d_M and d_N into Mds and Nds
before any of them can move forward.

• The barrier __syncthreads() in line 14 ensures that all threads

have finished using the d_M and d_N elements in the shared
memory before any of them move on to the next iteration (next
30-12-2023 Dr. Bhargav Bhatkalkar phase) and load the elements in the next tiles. 46
Tiled 1D Convolution with Halo Elements
• In a tiled algorithm, threads collaborate to load input elements into an on-chip memory and then access the on-chip memory for their
subsequent use of these elements.

• To understand tiled convolution, we will assume that each thread calculates one output P element. We will refer to the collection of
output elements processed by each block as an output tile.

• The following figure shows a small example of a 16-element, 1D convolution using 4 thread blocks of 4 threads each.

• The first output tile covers P[0] through P[3], the second tile P[4] through P[7],
the third tile P[8] through P[11], and the fourth tile P[12] through P[15].

• We will assume that the mask M elements are in the constant memory.

• We will assume that the mask size is an odd number equal to 2 × n + 1. The figure
shows an example where n = 2

• We will study an intuitive input data tiling strategy which involves loading all input data elements needed for calculating all output
elements of a thread block into the shared memory.

30-12-2023 Dr. Bhargav Bhatkalkar 47

Tiled 1D Convolution with Halo Elements

• Threads in block 0 calculate output elements P[0] through P[3]. This is the leftmost tile in the
output data and is often referred to as the left boundary tile.
• The threads collectively require input elements N[0] through N[5].
• Note that the calculation also requires two ghost elements to the left of N [0]. This is shown as
two dashed empty elements on the left end of tile 0. These ghost elements will be assumed have a
default value of 0.
• Tile 3 has a similar situation at the right end of input array N.
• We will refer to tiles like tile 0 and tile 3 as boundary tiles since they involve elements at or
outside the boundary of the input array N.

• Threads in block 1 calculate output elements P[4] through P[7]. They collectively require input elements N[2] through N[9].
• Calculations for tiles 1 and 2 do not involve ghost elements and are often referred to as internal tiles.

• The elements N[2] and N[3] belong to two tiles and are loaded into the shared memory twice, once to the shared memory of block 0 and
once to the shared memory of block 1.
• Since the contents of shared memory of a block are only visible to the threads of the block, these elements need to be loaded into the
respective shared memories for all involved threads to access them.
• The elements that are involved in multiple tiles and loaded by multiple blocks are commonly referred to as halo elements.
• We will refer to the center part of an input tile that is solely used by a single block the internal elements of that input tile.

30-12-2023 Dr. Bhargav Bhatkalkar 48

Tiled 1D Convolution with Halo Elements

30-12-2023 Dr. Bhargav Bhatkalkar 49

Ex7, 8 and 9 Dbms Lab
No ratings yet
Ex7, 8 and 9 Dbms Lab
7 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CCS334 BDA Lab Manual
No ratings yet
CCS334 BDA Lab Manual
35 pages
4 MM in CUDA
No ratings yet
4 MM in CUDA
38 pages
Lecture 11 - Introduction To Computational Thinking and Problem Solving
No ratings yet
Lecture 11 - Introduction To Computational Thinking and Problem Solving
12 pages
History of C Programming Language
No ratings yet
History of C Programming Language
23 pages
Web GPU
0% (1)
Web GPU
40 pages
Basic Filters & Pipes
No ratings yet
Basic Filters & Pipes
33 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
VB VC++ Oracle Practical Programs
50% (2)
VB VC++ Oracle Practical Programs
86 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
HPC
No ratings yet
HPC
90 pages
Unit 5
No ratings yet
Unit 5
90 pages
SS & MP Lab Viva Questiions
No ratings yet
SS & MP Lab Viva Questiions
33 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
6CS005 - Assessment 20-21
No ratings yet
6CS005 - Assessment 20-21
25 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
Techniques For User Code in SAS Data Integration Studio
No ratings yet
Techniques For User Code in SAS Data Integration Studio
18 pages
Lecture 5
No ratings yet
Lecture 5
15 pages
2D Array Lab Manual
No ratings yet
2D Array Lab Manual
6 pages
GPU - Mid - Gradescope
No ratings yet
GPU - Mid - Gradescope
11 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Chapter 8 - Arrays
No ratings yet
Chapter 8 - Arrays
18 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
Cache: Configuration
No ratings yet
Cache: Configuration
15 pages
Epaphras Simango Os Assignment
No ratings yet
Epaphras Simango Os Assignment
6 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
CO3 S2 Concurrency
No ratings yet
CO3 S2 Concurrency
29 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Tilining
No ratings yet
Tilining
23 pages
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
22 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Threads
No ratings yet
Threads
54 pages
278 hw5
No ratings yet
278 hw5
20 pages
Cuuda Nvidai Guide - Part3
No ratings yet
Cuuda Nvidai Guide - Part3
15 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
CUDA
No ratings yet
CUDA
3 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Module 3 Quiz
No ratings yet
Module 3 Quiz
2 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-8-Kernel-matrix-multiplication
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-8-Kernel-matrix-multiplication
12 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
Assignment 04
No ratings yet
Assignment 04
16 pages
Control Structures in Prolog
No ratings yet
Control Structures in Prolog
10 pages
HPC File
No ratings yet
HPC File
22 pages
2 - Indexing Structures - Ch14
No ratings yet
2 - Indexing Structures - Ch14
50 pages
HPC-Practical-4Addition of Two Large Vectors
No ratings yet
HPC-Practical-4Addition of Two Large Vectors
4 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
Lab7 GPU
No ratings yet
Lab7 GPU
10 pages
5 Computation
No ratings yet
5 Computation
13 pages
The Swift Developers Cookbook
No ratings yet
The Swift Developers Cookbook
50 pages
GPU Assignment-3 Solution
No ratings yet
GPU Assignment-3 Solution
4 pages
JAVA Project
No ratings yet
JAVA Project
14 pages
Class 10
No ratings yet
Class 10
13 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
VB Loops
No ratings yet
VB Loops
13 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
Processors
No ratings yet
Processors
25 pages
CUDA MatrixMultiplication
No ratings yet
CUDA MatrixMultiplication
2 pages
Advanced Data Structre - Docx 20
No ratings yet
Advanced Data Structre - Docx 20
2 pages
Oops Java
No ratings yet
Oops Java
115 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Data Structures
No ratings yet
Data Structures
2 pages
Spark Notes
0% (1)
Spark Notes
23 pages
#Include #Include #Define
No ratings yet
#Include #Include #Define
8 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
SQL - Data Types - Tutorialspoint
No ratings yet
SQL - Data Types - Tutorialspoint
4 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Cwsparam
No ratings yet
Cwsparam
2 pages
Coursera Quiz Week2 Fall 2012
No ratings yet
Coursera Quiz Week2 Fall 2012
3 pages
Object Oriented Analysis and Design
No ratings yet
Object Oriented Analysis and Design
120 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
No ratings yet
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
23 pages
Vim 7.6 SPS4 Vimi-32512 Ci-118
No ratings yet
Vim 7.6 SPS4 Vimi-32512 Ci-118
12 pages
Batch Testing in QTP
No ratings yet
Batch Testing in QTP
3 pages
Combined Cheatsheet
No ratings yet
Combined Cheatsheet
5 pages
KISA ICSE Computer Applications - AK
No ratings yet
KISA ICSE Computer Applications - AK
7 pages
Introduction to Vectors, Matrices and Tensors
From Everand
Introduction to Vectors, Matrices and Tensors
Simone Malacrida
No ratings yet

CUDA Part-2

Uploaded by

CUDA Part-2

Uploaded by

Mapping Threads to Multidimensional Data - Flattening a 2D-array

• In reality, all multidimensional arrays in C are linearized.

• There are at least two ways one can linearize a 2D array:

The 1D equivalent index for the M element in row j and column i is j x

• This is used by FORTRAN compilers.

1. Each row of resultant matrix to be computed by one thread.

a(ha*wa ) * b(hb*wb) c (ha*wb)

multiplyKernel_a<<<1,ha>>>(d_a, d_b, d_c, wa, wb);

__global__ void multiplyKernel_rowwise(int * a, int * b, int * c, int wa, int wb)

a(ha*wa ) * b(hb*wb) c (ha*wb)

multiplyKernel_b<<<1, wb>>>(d_a, d_b, d_c, ha,wa);

__global__ void multiplyKernel_colwise(int * a, int * b, int * c, int ha, int wa)

a(ha*wa ) * b(hb*wb) c (ha*wb)

multiplyKernel_b<<<(1, 1), (wb,ha)>>>(d_a, d_b, d_c, wa);

__global__ void multiplyKernel_elementwise(int * a, int * b, int * c, int wa)

for( k = 0; k < wa; k++)

dim3 dimGrid (ceil(n/16.0), ceil(m/16.0), 1);

• There are a total of blockDim.x * gridDim.x threads in the horizontal direction

• The expression Col=blockIdx.x*blockDim.x+threadIdx.x generates every integer value from 0 to blockDim.x*gridDim.x-1.

30-12-2023 Dr. Bhargav Bhatkalkar 19

2. In each row, Line 2 first initializes the dot product to zero.

30-12-2023 Dr. Bhargav Bhatkalkar 21

• The loop construct has been removed since it is

• The row index is calculated as the familiar expression

• Line 3 checks if the row index of a thread exceeds the

30-12-2023 Dr. Bhargav Bhatkalkar 22

• In CUDA, a __syncthreads() statement, if present, must be executed by all threads in a block.

• Multiple thread blocks can be assigned to each SM.

• The number of SMs in the device is given in dev_prop. multiProcessorCount.

• The clock frequency of the device is in dev_prop. clockRate.

• CGMA has major implications on the performance of a CUDA kernel.

30-12-2023 Dr. Bhargav Bhatkalkar 32

30-12-2023 Dr. Bhargav Bhatkalkar 33

30-12-2023 Dr. Bhargav Bhatkalkar 34

• Arithmetic instructions in most modern processors have “built-in” register operands:

30-12-2023 Dr. Bhargav Bhatkalkar 37

• To declare an M array in constant memory, the host code declares it as follows:

30-12-2023 Dr. Bhargav Bhatkalkar 38

• In general, the use of the cudaMemcpyToSymbol() function is as follows:

• The only difference is that M is no longer accessed

30-12-2023 Dr. Bhargav Bhatkalkar 39

30-12-2023 Dr. Bhargav Bhatkalkar 40

30-12-2023 Dr. Bhargav Bhatkalkar 42

• We divide the M and N matrices into 2 x 2 tiles.

30-12-2023 Dr. Bhargav Bhatkalkar 43

• The table only shows the activities of threads in block0,0;

30-12-2023 Dr. Bhargav Bhatkalkar 44

30-12-2023 Dr. Bhargav Bhatkalkar 45

• The barrier __syncthreads() in line 11 ensures that all threads

• The barrier __syncthreads() in line 14 ensures that all threads

30-12-2023 Dr. Bhargav Bhatkalkar 47

30-12-2023 Dr. Bhargav Bhatkalkar 48

30-12-2023 Dr. Bhargav Bhatkalkar 49

You might also like

a(hawa ) b(hbwb) c (hawb)

global void multiplyKernel_rowwise(int * a, int * b, int * c, int wa, int wb)

a(hawa ) b(hbwb) c (hawb)

global void multiplyKernel_colwise(int * a, int * b, int * c, int ha, int wa)

a(hawa ) b(hbwb) c (hawb)

global void multiplyKernel_elementwise(int * a, int * b, int * c, int wa)

• The expression Col=blockIdx.xblockDim.x+threadIdx.x generates every integer value from 0 to blockDim.xgridDim.x-1.