CUDA_part-2
CUDA_part-2
• Ideally, we would like to access d_Pin as a 2D array where an element at row j and column i can be accessed as d_Pin[j][i].
• However, the ANSI C standard based on which CUDA C was developed requires that the number of columns in d_Pin be
known at compile time.
• Unfortunately, this information is not known at compiler time for dynamically allocated arrays.
• As a result, programmers need to explicitly linearize, or “flatten,” a dynamically allocated 2D array into an equivalent 1D
array in the current CUDA C.
30-12-2023 1
Mapping Threads to Multidimensional Data - Flattening a 2D-array
1. row-major layout
• Here we place all elements of the same row into consecutive locations. The rows are then placed one after another
into the memory space.
• Mj,i denote an M element at the jth row and the ith column.
30-12-2023 2
Mapping Threads to Multidimensional Data - Flattening a 2D-array
2. column-major layout
• Here we place all elements of the same column into consecutive locations. The columns are then placed one after
another into the memory space.
• The column-major layout of a 2D array is equivalent to the row-major layout of its transposed form.
30-12-2023 3
Matrix-Matrix Multiplication – three variants
We will write program in CUDA to multiply matrix-A (ha × wa) and matrix-B (hb × wb) in 3 variations as follows:
30-12-2023 4
Accessing Row & Col of a matrix
Accessing Col
Accessing Row
M0 M1 M2 M3 M4 M 5 M6 M7 M8 M9 M10 M11 M0 • The beginning element of the Col column
Col0 M4 is the Col element of the 0 row, which is
M[Col].
M8
Row0 Row1 Row2 • Accessing each additional element in the
M1
Col column requires skipping over entire
• Recall that a matrix M is linearized into an equivalent 1D array where Col1 M5 rows. This is because the next element of
the rows of M are placed one after another in the memory space,
the same column is actually the same
starting with the 0 row. M9
element in the next row.
M2
• For example, the beginning element of the 1 row is M[1 * Width]
Col2 • Therefore, the k element of the Col
because we need to account for all elements of the 0 row. M6
column is M[k * Width + Col].
M10
• In general, the beginning element of the Row row is M[Row * Width].
M3
• Since all elements of a row are placed in consecutive locations, the k Col3 Matrix M
element of the Row row is at M[Row * Width + k].
M7
M0 M1 M2 M3
M11
M4 M5 M6 M7
M8 M9 M10 M11
30-12-2023 5
Each row of resultant matrix to be computed by one thread
30-12-2023 7
Each col of a resultant matrix to be computed by one thread
30-12-2023 11
Mapping Threads to Multidimensional Data
• The choice of 1D, 2D, or 3D thread organizations is usually based on the nature of the data.
• For example, pictures are a 2D array of pixels. It is often convenient to use a 2D grid that consists of 2D blocks to process the pixels
in a picture.
▪ It is a 76x62 picture.
▪ Assume that we decided to use a 16x16 block, with 16 threads in the x-
direction and 16 threads in the y-direction.
▪ We will need five blocks in the x-direction and four blocks in the y-
direction, which results in 5x4=20 block.
• In this picture example, we have four extra threads in the x-direction and two extra threads in the y-direction. That is, we will
generate 80x64 threads to process 76x62 pixels.
• The picture processing kernel function will have if statements to test whether the thread indices threadIdx.x and threadIdx.y fall
within the valid range of pixels.
30-12-2023 12
Mapping Threads to Multidimensional Data
→ n = 76
▪ Assume that the host code uses an integer variable n to track the
number of pixels in the x-direction, and another integer variable m to
track the number of pixels in the y-direction.
▪ Assume that the input picture data has been copied to the device
→ m = 62
memory and can be accessed through a pointer variable d_Pin. The
output picture has been allocated in the device memory and can be
accessed through a pointer variable d_Pout.
• Within the kernel function, references to built-in variables gridDim.x, gridDim.y, blockDim.x, and blockDim.y will result in 5,
4, 16, and 16, respectively.
30-12-2023 13
Mapping Threads to Multidimensional Data – the pictureKernel()
__global__ void pictureKernell(float* d_Pin, float* d_Pout, int n, int m)
→ n = 76
{ // Calculate the row # of the d_Pin and d_Pout element to process
int Row = blockIdx.y*blockDim.y + threadIdx.y;
// Calculate the column # of the d_Pin and d_Pout element to process
int Col = blockIdx.x*blockDim.x + threadIdx.x;
// each thread computes one element of d_Pout if in range
→ m = 62
if ((Row < m) && (Col < n))
{
d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];
}
}
• This kernel will scale every pixel value in the picture by a factor of 2.0.
• The condition (Col < n) && (Row < m) make sure that only the threads in proper range are executed.
dim3 dimGrid (ceil(n/16.0), ceil(m/16.0), 1); // 5,4,1
dim3 dimBlock(16, 16, 1); vecAddKernel <<< dimGrid, dimBlock >>> (d_Pin, d_Pout, n, m);
30-12-2023 14
Mapping Threads to Multidimensional Data – the pictureKernel()
→ n = 80
→ m = 64
• During the execution, the execution behaviour of blocks will fill into one of four different cases:
1. The first area, marked as 1 consists of the threads that belong to the 12 blocks covering the majority of pixels in the picture.
Both Col and Row values of these threads are within range (76).
2. The second area, marked as 2 contains the threads that belong to the 3 blocks covering the upper-right pixels of the picture.
Although the Row values of these threads are always within range, the Col values of some of them exceed the n value.
3. The third area, marked as 3 contains the threads that belong to the 4 blocks covering the lower-left pixels of the picture.
Although the Col values of these threads are always within range, the Row values of some of them exceed the m value (62).
4. The forth area, marked as 4 contains the threads that belong to 1 block covering the lower-right pixels of the picture. Both Col
and Row values exceed the n and m values.
30-12-2023 15
Matrix-Matrix Multiplication – A More Complex Kernel
• We can map every valid data element in a 2D array to a unique thread using threadIdx, blockIdx, blockDim, and
gridDim variables:
// Calculate the row #
int Row = blockIdx.y*blockDim.y + threadIdx.y;
// Calculate the column #
int Col = blockIdx.x*blockDim.x + threadIdx.x;
• Matrix-Matrix multiplication between an I x J matrix d_M and a J x K matrix d_N produces an I x K matrix d_P.
30-12-2023 16
Matrix-Matrix Multiplication – A More Complex Kernel
• When performing a matrix-matrix multiplication, each element of the product matrix d_P is an inner product of a row of
d_M and a column of d_N. The inner product between two vectors is the sum of products of corresponding elements. That
is, d_P Row,Col =Σd_MRow,k * d_Nk,Col, for k = 0, 1, ... Width-1.
• We design a kernel where each thread is responsible for calculating one d_P element.
• The d_P element calculated by a thread is in row blockIdx.y * blockDim.y + threadIdx.y and in column blockIdx.x *
blockDim.x + threadIdx.x.
30-12-2023 17
Matrix-Matrix Multiplication– Kernel for thread-to-data mapping
• In the following kernel, we assume square matrices having dimension Width*Width.
• Throughout the source code, instead of using a numerical value 16 for the block-width, the programmer can use the name
BLOCK_WIDTH by defining it. It helps in autotuning.
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
// Calculate the row index of the d_Pelement and d_M #define BLOCK_WIDTH 16
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of d_P and d_N // Setup the execution configuration
int Col = blockIdx.x*blockDim.x+threadIdx.x; int NumBlocks = Width/BLOCK_WIDTH;
if ((Row < Width) && (Col < Width)) if (Width % BLOCK_WIDTH)
{ NumBlocks++;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix dim3 dimGrid(NumBlocks, NumbBlocks);
for (int k = 0; k < Width; ++k) dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH);
Pvalue += d_M[Row*Width+k]*d_N[k*Width+Col];
// Launch the device computation threads
d_P[Row*Width+Col] = Pvalue; matrixMulKernel<<dimGrid, dimBlock>>(M, N, P, Width);
}
} Host code
30-12-2023
Kernel code 18
Sequential Sparse-Matrix Vector Multiplication (SpVM)
• In a sparse matrix, the vast majority of the elements are zeros. Storing and processing these zero elements are wasteful in terms of
memory, time, and energy.
• Due to the importance of sparse matrices, several sparse matrix storage formats and their corresponding processing methods have been
proposed and widely used in the field.
• Matrices are often used to represent the coefficients in a linear system of equations.
The variables x0 and x2 are involved in equation 0, none of the variables in equation 1, variables x1, x2, and
x3 in equation 2, and finally variables x0 and x3 in equation 3.
• Compressed Sparse Row (CSR) storage format is used to avoid storing zero elements of a sparse matrix.
• CSR stores only nonzero values of consecutive rows in a 1D data storage named data[ ].
This format compresses away all zero elements.
• The two sets of markers col_index[ ] and row_ptr[ ] preserve the structure of the
original sparse matrix.
• The marker col_index[ ] gives the column index of every nonzero value in the original
sparse matrix.
• The marker row_ptr[ ] gives the starting location of every row in the data[ ] array of the
compressed storage. Note that row_ptr[4] stores the starting location of a non-existing
row-4 as 7. This is for convenience, as some algorithms need to use the starting location
of the next row to delineate the end of the current row. This extra marker gives a
convenient way to locate the ending location of row 3.
30-12-2023 Dr. Bhargav Bhatkalkar 20
Sequential Sparse-Matrix Vector Multiplication (SpVM)
• A sequential implementation of SpMV based on CSR is quite straightforward. We assume that the code has access to the following:
1. The num_rows, a function argument that specifies the number of rows in the sparse matrix.
2. A floating-point data[ ] array and three integer row_ptr[ ], col_index[ ], and x[] arrays.
1. Line 1 is a loop that iterates through all rows of the matrix, with
each iteration calculating a dot product of the current row and the
vector X.
3. Line 3 and Line 4 sets up the range of data[ ] array elements that
belong to the current row.
4. Line 5 is a loop that fetches the elements from the sparse matrix A
and the vector X.
5. The loop body in Line 6 calculates the dot product for the current
row.
6. Line 7 adda the dot product with the corresponding element of the
vector Y.
• We can easily convert this sequential SpMV/CSR into a parallel CUDA kernel by assigning each iteration of the outer loop to a thread.
• When a kernel function calls __syncthreads(), all threads in a block will be held at the calling location until every thread in
the block reaches the location.
• This ensures that all threads in a block have completed a phase of their execution of the kernel before any of them can
move on to the next phase.
• When a __syncthread() statement is placed in an if-statement, either all threads in a block execute the path that includes
the __syncthreads() or none of them does.
• For an if-then-else statement, if each path has a __syncthreads() statement, either all threads in a block execute the
__syncthreads() on the then path or all of them execute the else path.
• The two __syncthreads() are different barrier synchronization points. If a thread in a block executes the then path and
another executes the else path, they would be waiting at different barrier synchronization points. They would end up
waiting for each other forever.
• It is the responsibility of the programmers to write their code so that these requirements are satisfied.
• The ability to synchronize also imposes execution constraints on threads within a block. These threads should execute in
close time proximity with each other to avoid excessively long waiting times.
30-12-2023 24
Synchronization and Transparent Scalability
• In fact, one needs to make sure that all threads involved in the barrier synchronization have access to the necessary
resources to eventually arrive at the barrier. Otherwise, a thread that never arrived at the barrier synchronization point can
cause everyone else to wait forever.
• CUDA runtime systems satisfy this constraint by assigning execution resources to all threads in a block as a unit. A block
can begin execution only when the runtime system has secured all the resources needed for all threads in the block to
complete execution.
• When a thread of a block is assigned to an execution resource, all other threads in the same block are also assigned to the
same resource. This ensures the time proximity of all threads in a block and prevents excessive or indefinite waiting time
during barrier synchronization
• By not allowing threads in different blocks to perform barrier synchronization with each other, the CUDA runtime system
can execute blocks in any order relative to each other since none of them need to wait for each other.
30-12-2023 25
Synchronization and Transparent Scalability
• In a low-cost system with only a few execution resources, one can execute a small number of blocks at the same time. In a
high-end implementation with more execution resources, one can execute a large number of blocks at the same time.
• The ability to execute the same application code on hardware with a different number of execution resources is referred to
as transparent scalability, which reduces the burden on application developers and improves the usability of applications.
30-12-2023 26
Assigning Resources to Blocks
• Once a kernel is launched, the CUDA runtime system generates the corresponding grid of threads. These threads are
assigned to execution resources on a block-by-block basis.
• In the current generation of hardware, the execution resources are organized into streaming multiprocessors (SMs).
• Each device has a limit on the number of blocks that can be assigned to each SM. For example, a CUDA device may allow
up to eight blocks to be assigned to each SM.
30-12-2023 27
Assigning Resources to Blocks
• In situations where there is an insufficient amount of any one or more types of resources needed for the simultaneous
execution of eight blocks, the CUDA runtime automatically reduces the number of blocks assigned to each SM until their
combined resource usage falls under the limit.
• With a limited numbers of SMs and a limited number of blocks that can be assigned to each SM, there is a limit on the
number of blocks that can be actively executing in a CUDA device.
• Most grids contain many more blocks than this number. The runtime system maintains a list of blocks that need to execute
and assigns new blocks to SMs as they complete executing the blocks previously assigned to them.
• One of the SM resource limitations is the number of threads that can be simultaneously tracked and scheduled.
• It takes hardware resources for SMs to maintain the thread and block indices and track their execution status.
• In more recent CUDA device designs, up to 1,536 threads can be assigned to each SM. This could be in the form of 6 blocks
of 256 threads each, 3 blocks of 512 threads each, etc.
30-12-2023 28
Querying Device Properties
• When a CUDA application executes on a system, how can it find out the number of SMs in a device and the number of
threads that can be assigned to each SM?
• The CUDA runtime system has an API function cudaGetDeviceCount() that returns the number of available CUDA devices
in the system:
int dev_count;
cudaGetDeviceCount( &dev_count);
• The CUDA runtime system numbers all the available devices in the system from 0 to dev_count-1. It provides an API
function cudaGetDeviceProperties() that returns the properties of the device of which the number is given as an
argument:
cudaDeviceProp dev_prop;
for (i=0; i<dev_count; i++) {
cudaGetDeviceProperties( &dev_prop, i);
// decide if device has sufficient resources and capabilities
}
• The built-in type cudaDeviceProp is a C structure with fields that represent the properties of a CUDA device
30-12-2023 29
Querying Device Properties
The built-in type cudaDeviceProp is a C structure with fields that represent the properties of a CUDA device
• The maximal number of threads allowed in a block in the queried device is given by the field
dev_prop.maxThreadsPerBlock.
• The host code can find the maximal number of threads allowed along each dimension of a block in
dev_prop.maxThreadsDim[0] (for the x dimension), dev_prop.maxThreadsDim[1] (for the y dimension), and
dev_prop.maxThreadsDim[2] (for the z dimension).
• The host code can find the maximal number of blocks allowed along each dimension of a grid in dev_prop.maxGridSize[0]
(for the x dimension), dev_prop.maxGridSize[1] (for the y dimension), and dev_prop.maxGridSize[2] (for the z dimension).
30-12-2023 30
Importance of Memory Access Efficiency
• In CUDA programming, the data to be processed by the threads is first transferred from the host memory to the device global memory.
• The threads then access their portion of the data from the global memory using their block IDs and thread IDs.
• The simple CUDA kernels will likely achieve only a small fraction of the potential speed of the underlying hardware. The poor performance
is due to the fact that global memory, which is typically implemented with dynamic random access memory (DRAM), tends to have long
access latencies (hundreds of clock cycles) and finite access bandwidth.
• In matrix multiplication kernel, the most important part of the kernel in terms of execution time is the for loop that performs inner
product calculation:
for (int k = 0; k < Width; ++k)
Pvalue += d_M[Row * Width + k] * d_N[k * Width + Col];
1. In every iteration of this loop, two global memory accesses are performed for one floating-point multiplication and one
floating-point addition.
2. One global memory access fetches a d_M[ ] element and the other fetches a d_N[ ] element.
3. One floating-point operation multiplies the d_M[] and d_N[] elements fetched and the other accumulates the product into
Pvalue.
4. Thus, the ratio of floating-point calculation to global memory access operation is 1:1, or 1.0.
30-12-2023 Dr. Bhargav Bhatkalkar 31
Importance of Memory Access Efficiency
• The compute to global memory access (CGMA) ratio is defined as the number of floating point calculations performed for each access to
the global memory within a region of a CUDA program.
➢ In a high-end device today, the global memory bandwidth is around 200 GB/s. With 4 bytes in each single-precision
floating-point value, one can expect to load no more than 50 (200/4) giga single-precision operands per second. With a
CGMA ration of 1.0, the matrix multiplication kernel will execute no more than 50 giga floating-point operations per second
(GFLOPS).
➢ For the matrix multiplication code to achieve the peak 1,500 GFLOPS rating of the processor, we need a CGMA value of 30.
• At the bottom of the figure, we see global memory and constant memory. These types of memory can be written (W) and read (R) by the
host by calling API functions.
• The constant memory supports short-latency, high-bandwidth, read-only access by the device when all threads simultaneously access the
same location.
• Registers and shared memory are on-chip memories. Variables that reside in these types of memory can be accessed at very high speed
in a highly parallel manner.
• Registers are allocated to individual threads; each thread can only access its own registers. A kernel function typically uses registers to hold
frequently accessed variables that are private to each thread.
• Shared memory is allocated to thread blocks; all threads in a block can access variables in the shared memory locations allocated to the
block. Shared memory is an efficient means for threads to cooperate by sharing their input data and the intermediate results of their work.
• Although both are on-chip memories, they differ significantly in functionality and cost
of access.
• When the processor accesses data that resides in the shared memory, it needs to
perform a memory load operation, just like accessing data in the global memory.
• However, because shared memory resides on-chip, it can be accessed with much lower
latency and much higher bandwidth than the global memory.
• Because of the need to perform a load operation, share memory has longer latency
and lower bandwidth than registers.
• In computer architecture, shared memory is a form of scratchpad memory.
• One important difference between the share memory and registers in CUDA is that variables that reside in the shared memory are
accessible by all threads in a block. This is in contrast to register data, which is private to a thread.
• Scope identifies the range of threads that can access the variable: by a single
thread only, by all threads of a block, or by all threads of all grids.
• Lifetime tells the portion of the program’s execution duration when the variable is
available for use: either within a kernel’s execution or throughout the entire
application.
• Automatic array variables are not stored in registers. Instead, they are stored into
the global memory and may incur long access delays and potential access
congestions. The scope of these arrays is, like automatic scalar variables, limited to
individual threads.
30-12-2023 Dr. Bhargav Bhatkalkar 36
Constant Memory and Caching
• We can make three interesting observations about the way the mask array M is used in convolution:
1. First, the size of the M array is typically small.
2. Second, the contents of M are not changed throughout the execution of the kernel.
3. Third, all threads need to access the mask elements. Even better, all threads access the M elements in the same order, starting from
M[0] and move by one element a time through the iterations of the for loop in 1D parallel convolution.
• These properties make the mask array an excellent candidate for constant memory and caching.
• Like global memory variables, constant memory variables are also visible to all thread blocks.
• The main difference is that a constant memory variable cannot be changed by threads during kernel execution.
• Furthermore, the size of the constant memory can vary from device to device. The amount of constant memory available on a device can
be learned with a device property query. Assume that dev_prop is returned by cudaGetDeviceProperties(). The field
dev_prop.totalConstMem indicates the amount of constant memory available on a device is in the field.
** This is a global variable declaration and should be outside any function in the source file. The keyword __constant__
(two underscores on each side) tells the compiler that array M should be placed into the device constant memory.
** This is a special memory copy function that informs the CUDA runtime that the data being copied into the constant memory
will not be changed during kernel execution.
** where dest is a pointer to the destination location in the constant memory, src is a pointer to the source data in the host
memory, and size is the number of bytes to be copied.
• Kernel functions access constant memory variables as global variables. Thus, their pointers do not need to be passed to the kernel as
parameters.
• This is a revised kernel to use the constant memory for
the 1D parallel convolution.
• However, because the CUDA runtime knows that constant memory variables are not modified during kernel execution, it directs the
hardware to aggressively cache the constant memory variables during kernel execution.
• To mitigate the effect of memory bottleneck, modern processors commonly employ on-chip cache memories, or caches, to reduce the
number of variables that need to be accessed from DRAM.
• A major design issue with using caches in a massively parallel processor is cache coherence, which arises when one or more processor
cores modify cached data.
• A cache coherence mechanism is needed to ensure that the contents of the caches of the other processor cores are updated.
• A common strategy is partition the data into subsets called tiles so that each tile fits into the shared memory.
• An important criterion is that the kernel computation on these tiles can be done independently of each other.
• Note that not all data structures can be partitioned into tiles given an arbitrary kernel function.
• The concept of tiling can be illustrated with the matrix multiplication example.
• Assume that we use four 2 x 2 blocks to compute the P matrix.
• The figure highlights the computation done by the four threads of block(0,0) to compute P0,0, P0,1,
P1,0, and P1,1.
• The table shows the global memory accesses done by all threads in block 0,0. The threads are listed
in the vertical direction, with time of access increasing to the right in the horizontal direction.
• Among the four threads highlighted, there is a significant overlap in terms of the M and N elements
they access.
30-12-2023 Dr. Bhargav Bhatkalkar 41
A Strategy for Reducing Global Memory Traffic
• The kernel is written so that all the threads repeatedly access elements of
matrix M and N from the global memory.
• we can see that every M and N element is accessed exactly twice during the
execution of a block. Therefore, if we can have all four threads to collaborate
in their accesses to global memory, we can reduce the traffic to the global
memory by half.
• Keep in mind that the size of the shared memory is quite small and one must be careful not to exceed the capacity of the shared memory
when loading these M and N elements into the shared memory.
• This can be accomplished by dividing the M and N matrices into smaller tiles. The size of these tiles is chosen so that they can fit into the
shared memory. In the simplest form, the tile dimensions equal those of the block:
• The dot product calculations performed by each thread are now divided into phases.
• In each phase, all threads in a block collaborate to load a tile of M elements and a tile of
N elements into the shared memory. This is done by having every thread in a block to
load one M element and one N element into the shared memory.
• The shared memory array for the M and N elements is called Mds and Nds respectively.
• At the beginning of phase 1, the four threads of block0,0 collaboratively load a tile of M and a tile of N elements into shared memory.
• After the two tiles of M and N elements are loaded into the shared memory, these values are used in the calculation of the dot product.
• Note that each value in the shared memory is used twice which reduces the global memory access.
• Note that the calculation of each dot product is now performed in two phases.
• Note that Pvalue is an automatic variable so a private version is generated for each thread.
• In general, if an input matrix is of dimension Width and the tile size is TILE_WIDTH, the dot product would be performed in
Width/TILE_WIDTH phases.
• Note also that Mds and Nds are reused to hold the input values. This allows a much smaller shared memory to serve most of the accesses to
global memory. This is due to the fact that each phase focuses on a small subset of the input matrix elements. Such focused access behaviour
is called locality.
• To understand tiled convolution, we will assume that each thread calculates one output P element. We will refer to the collection of
output elements processed by each block as an output tile.
• The following figure shows a small example of a 16-element, 1D convolution using 4 thread blocks of 4 threads each.
• The first output tile covers P[0] through P[3], the second tile P[4] through P[7],
the third tile P[8] through P[11], and the fourth tile P[12] through P[15].
• We will assume that the mask M elements are in the constant memory.
• We will assume that the mask size is an odd number equal to 2 × n + 1. The figure
shows an example where n = 2
• We will study an intuitive input data tiling strategy which involves loading all input data elements needed for calculating all output
elements of a thread block into the shared memory.
• Threads in block 0 calculate output elements P[0] through P[3]. This is the leftmost tile in the
output data and is often referred to as the left boundary tile.
• The threads collectively require input elements N[0] through N[5].
• Note that the calculation also requires two ghost elements to the left of N [0]. This is shown as
two dashed empty elements on the left end of tile 0. These ghost elements will be assumed have a
default value of 0.
• Tile 3 has a similar situation at the right end of input array N.
• We will refer to tiles like tile 0 and tile 3 as boundary tiles since they involve elements at or
outside the boundary of the input array N.
• Threads in block 1 calculate output elements P[4] through P[7]. They collectively require input elements N[2] through N[9].
• Calculations for tiles 1 and 2 do not involve ghost elements and are often referred to as internal tiles.
• The elements N[2] and N[3] belong to two tiles and are loaded into the shared memory twice, once to the shared memory of block 0 and
once to the shared memory of block 1.
• Since the contents of shared memory of a block are only visible to the threads of the block, these elements need to be loaded into the
respective shared memories for all involved threads to access them.
• The elements that are involved in multiple tiles and loaded by multiple blocks are commonly referred to as halo elements.
• We will refer to the center part of an input tile that is solely used by a single block the internal elements of that input tile.