Chapter 5 Memory Architecture and Dat 2023 Programming Massively Parallel
Chapter 5 Memory Architecture and Dat 2023 Programming Massively Parallel
Chapter Outline
5
5.1 Importance of memory access efficiency ......................................................... 94
5.2 CUDA memory types ........................................................................................ 96
5.3 Tiling for reduced memory traffic ..................................................................103
5.4 A tiled matrix multiplication kernel ................................................................107
5.5 Boundary checks ..........................................................................................112
5.6 Impact of memory usage on occupancy ..........................................................115
5.7 Summary ......................................................................................................118
Exercises ............................................................................................................119
So far, we have learned how to write a CUDA kernel function and how to
configure and coordinate its execution by a massive number of threads. We
have also looked at the compute architecture of current GPU hardware and how
threads are scheduled to execute on this hardware. In this chapter we will focus
on the on-chip memory architecture of the GPU and begin to study how one can
organize and position data for efficient access by a massive number of threads.
The CUDA kernels that we have studied so far will likely achieve only a tiny
fraction of the potential speed of the underlying hardware. This poor perfor-
mance is because global memory, which is typically implemented with off-chip
DRAM, tends to have long access latency (hundreds of clock cycles) and finite
access bandwidth. While having many threads available for execution can theo-
retically tolerate long memory access latencies, one can easily run into a situa-
tion in which traffic congestion in the global memory access paths prevents all
but a very few threads from making progress, thus rendering some of the cores
in the streaming multiprocessors (SMs) idle. To circumvent such congestion,
GPUs provide a number of additional on-chip memory resources for accessing
data that can remove the majority of traffic to and from the global memory. In
this chapter we will study the use of different memory types to boost the execu-
tion performance of CUDA kernels.
FIGURE 5.1
The most executed part of the matrix multiplication kernel in Fig. 3.11.
5.1 Importance of memory access efficiency 95
To achieve higher performance for this kernel, we need to increase the com-
pute to global memory access ratio of the kernel by reducing the number of
global memory accesses it performs. For example, to fully utilize the 19,500
GFLOPS that the A100 GPU provides, a ratio of at least (19,500 GOP/second)/
(1555 GB/second)=12.5 OP/B is needed. This ratio means that for every 4-byte
floating point value accessed, there must be about 50 floating-point operations
performed! The extent to which such a ratio can be achieved depends on the
intrinsic data reuse in the computation at hand. We refer the reader to the “The
Roofline Model” sidebar for a useful model for analyzing a program’s potential
performance with respect to its compute intensity.
As we will see, matrix multiplication presents opportunities for reduction of
global memory accesses that can be captured with relatively simple techniques.
The execution speed of matrix multiplication functions can vary by orders of
magnitude, depending on the level of reduction of global memory accesses.
Therefore matrix multiplication provides an excellent initial example for such
techniques. This chapter introduces a commonly used technique for reducing the
number of global memory accesses and demonstrates the technique on matrix
multiplication.
of global memory that it uses as its own private local memory where it places
data that is private to the thread but cannot be allocated in registers. This data
includes statically allocated arrays, spilled registers, and other elements of the
thread’s call stack.
Registers and shared memory in Fig. 5.2 are on-chip memories. Variables that
reside in these types of memory can be accessed at very high speed in a highly
parallel manner. Registers are allocated to individual threads; each thread can
access only its own registers (see the “CPU versus GPU Register Architecture”
sidebar). A kernel function typically uses registers to hold frequently accessed
variables that are private to each thread. Shared memory is allocated to thread
blocks; all threads in a block can access shared memory variables declared for the
block. Shared memory is an efficient means by which threads can cooperate by
sharing their input data and intermediate results. By declaring a CUDA variable
in one of the CUDA memory types, a CUDA programmer dictates the visibility
and access speed of the variable.
To fully appreciate the difference between registers, shared memory, and global
memory, we need to go into a little more detail about how these different memory
types are realized and used in modern processors. As we discussed in the “Warps
and SIMD Hardware” sidebar in Chapter 4, Compute Architecture and Scheduling,
virtually all modern processors find their root in the model proposed by John von
Neumann in 1945, which is shown in Fig. 5.3. CUDA devices are no exception.
The global memory in a CUDA device maps to the Memory box in Fig. 5.3. The
98 CHAPTER 5 Memory architecture and data locality
FIGURE 5.2
An (incomplete) overview of the CUDA device memory model. An important type of CUDA
memory that is not shown in this figure is the texture memory, since its use is not covered
in this textbook.
FIGURE 5.3
Memory versus registers in a modern computer based on the von Neumann model.
processor box corresponds to the processor chip boundary that we typically see
today. The global memory is off the processor chip and is implemented with
DRAM technology, which implies long access latencies and relatively low access
bandwidth. The registers correspond to the “Register File” of the von Neumann
model. The Register File is on the processor chip, which implies very short access
latency and drastically higher access bandwidth when compared to the global mem-
ory. In a typical device, the aggregated access bandwidth of all the register files
across all the SMs is at least two orders of magnitude higher than that of the global
memory. Furthermore, whenever a variable is stored in a register, its accesses no
5.2 CUDA memory types 99
where r2 and r3 are the register numbers that specify the location in the register
file where the input operand values can be found. The location for storing the
floating-point addition result value is specified by r1. Therefore when an operand
of an arithmetic instruction is in a register, no additional instruction is required to
make the operand value available to the arithmetic and logic unit (ALU), where
the arithmetic calculation is done.
Meanwhile, if an operand value is in the global memory, the processor needs to
perform a memory load operation to make the operand value available to the ALU.
For example, if the first operand of a floating-point addition instruction is in the global
memory, the instructions that are involved will likely look like the following example:
where the load instruction adds an offset value to the contents of r4 to form an
address for the operand value. It then accesses the global memory and places the
value into register r2. Once the operand value is in r2, the fadd instruction per-
forms the floating-point addition using the values in r2 and r3 and places the
result into r1. Since the processor can fetch and execute only a limited number of
instructions per clock cycle, the version with an additional load will likely take
more time to process than the one without. This is another reason why placing the
operands in registers can improve execution speed.
Finally, there is yet another subtle reason why placing an operand value in registers
is preferable. In modern computers the energy that is consumed for accessing a value
from the register file is at least an order of magnitude lower than for accessing a value
from the global memory. Accessing a value from registers has a tremendous advantage
in energy efficiency over accessing the value from the global memory. We will look at
more details of the speed and energy difference in accessing these two hardware struc-
tures in modern computers soon. On the other hand, as we will soon learn, the number
of registers that are available to each thread is quite limited in today’s GPUs. As we
saw in Chapter 4, Compute Architecture and Scheduling, the occupancy that is
achieved for an application can be reduced if the register usage in full-occupancy sce-
narios exceeds the limit. Therefore we also need to avoid oversubscribing to this lim-
ited resource whenever possible.
100 CHAPTER 5 Memory architecture and data locality
Fig. 5.4 shows the shared memory and registers in a CUDA device. Although
both are on-chip memories, they differ significantly in functionality and cost of
access. Shared memory is designed as part of the memory space that resides on
the processor chip. When the processor accesses data that resides in the shared
memory, it needs to perform a memory load operation, just as in accessing data
in the global memory. However, because shared memory resides on-chip, it can
be accessed with much lower latency and much higher throughput than the global
memory. Because of the need to perform a load operation, shared memory has
longer latency and lower bandwidth than registers. In computer architecture termi-
nology the shared memory is a form of scratchpad memory.
One important difference between the shared memory and registers in CUDA
is that variables that reside in the shared memory are accessible by all threads in
a block. This contrasts with register data, which is private to a thread. That is,
shared memory is designed to support efficient, high-bandwidth sharing of data
among threads in a block. As shown in Fig. 5.4, a CUDA device SM typically
employs multiple processing units to allow multiple threads to make simultaneous
progress (see the “Threads” sidebar) in Chapter 2, Heterogeneous Data Parallel
Computing. Threads in a block can be spread across these processing units.
Therefore the hardware implementations of the shared memory in these CUDA
devices are typically designed to allow multiple processing units to simulta-
neously access its contents to support efficient data sharing among threads in a
block. We will be learning several important types of parallel algorithms that can
greatly benefit from such efficient data sharing among threads.
It should be clear by now that registers, local memory, shared memory, and
global memory all have different functionalities, latencies, and bandwidth. It is
therefore important to understand how to declare a variable so that it will reside
in the intended type of memory. Table 5.1 presents the CUDA syntax for declar-
ing program variables into the various memory types. Each such declaration also
gives its declared CUDA variable a scope and lifetime. Scope identifies the set of
FIGURE 5.4
Shared memory versus registers in a CUDA device SM.
5.2 CUDA memory types 101
Table 5.1 CUDA variable declaration type qualifiers and the properties of
each type.
Variable declaration Memory Scope Lifetime
Automatic variables other than arrays Register Thread Grid
Automatic array variables Local Thread Grid
__device__ __shared__ int SharedVar; Shared Block Grid
__device__ int GlobalVar; Global Grid Application
__device__ __constant__ int ConstVar; Constant Grid Application
threads that can access the variable: a single thread only, all threads of a block, or
all threads of all grids. If a variable’s scope is a single thread, a private version of
the variable will be created for every thread; each thread can access only its pri-
vate version of the variable. For example, if a kernel declares a variable whose
scope is a thread and it is launched with one million threads, one million versions
of the variable will be created so that each thread initializes and uses its own ver-
sion of the variable.
Lifetime tells the portion of the program’s execution duration when the variable
is available for use: either within a grid’s execution or throughout the entire applica-
tion. If a variable’s lifetime is within a grid’s execution, it must be declared within
the kernel function body and will be available for use only by the kernel’s code. If
the kernel is invoked several times, the value of the variable is not maintained across
these invocations. Each invocation must initialize the variable in order to use it. On
the other hand, if a variable’s lifetime is throughout the entire application, it must be
declared outside of any function body. The contents of these variables are maintained
throughout the execution of the application and available to all kernels.
We refer to variables that are not arrays as scalar variables. As shown in
Table 5.1, all automatic scalar variables that are declared in kernel and device
functions are placed into registers. The scopes of these automatic variables are
within individual threads. When a kernel function declares an automatic variable,
a private copy of that variable is generated for every thread that executes the ker-
nel function. When a thread terminates, all its automatic variables cease to exist.
In Fig. 5.1, variables blurRow, blurCol, curRow, curCol, pixels, and pixVal are
all automatic variables and fall into this category. Note that accessing these vari-
ables is extremely fast and parallel, but one must be careful not to exceed the lim-
ited capacity of the register storage in the hardware implementations. Using a
large number of registers can negatively affect the occupancy of each SM, as we
saw in Chapter 4, Compute Architecture and Scheduling.
Automatic array variables are not stored in registers.1 Instead, they are stored
into the thread’s local memory and may incur long access delays and potential
1
There are some exceptions to this rule. The compiler may decide to store an automatic array into
registers if all accesses are done with constant index values.
102 CHAPTER 5 Memory architecture and data locality
access congestions. The scope of these arrays, like that of automatic scalar vari-
ables, is limited to individual threads. That is, a private version of each automatic
array is created for and used by every thread. Once a thread terminates its execu-
tion, the contents of its automatic array variables cease to exist. From our experi-
ence, one seldom needs to use automatic array variables in kernel functions and
device functions.
If a variable declaration is preceded by the __shared__ keyword (each “__’’
consists of two “_’’ characters), it declares a shared variable in CUDA. One can
also add an optional __device__ in front of __shared__ in the declaration to
achieve the same effect. Such a declaration is typically made within a kernel
function or a device function. Shared variables reside in the shared memory. The
scope of a shared variable is within a thread block; that is, all threads in a block
see the same version of a shared variable. A private version of the shared variable
is created for and used by each block during kernel execution. The lifetime of a
shared variable is within the duration of the kernel execution. When a kernel ter-
minates its grid’s execution, the contents of its shared variables cease to exist. As
we discussed earlier, shared variables are an efficient means for threads within a
block to collaborate with each other. Accessing shared variables from the shared
memory is extremely fast and highly parallel. CUDA programmers often use
shared variables to hold the portion of global memory data that is frequently used
and reused in an execution phase of the kernel. One may need to adjust the algo-
rithms that are used to create execution phases that heavily focus on small por-
tions of the global memory data, as we will demonstrate with matrix
multiplication in Section 5.4.
If a variable declaration is preceded by keyword __constant__’ (each “__’’
consists of two “_’’ characters), it declares a constant variable in CUDA. One
can also add an optional __device__ in front of __constant__ to achieve the
same effect. Declaration of constant variables must be outside any function
body. The scope of a constant variable is all grids, meaning that all threads in
all grids see the same version of a constant variable. The lifetime of a constant
variable is the entire application execution. Constant variables are often used for
variables that provide input values to kernel functions. The values of the con-
stant variables cannot be changed by the kernel function code. Constant vari-
ables are stored in the global memory but are cached for efficient access. With
appropriate access patterns, accessing constant memory is extremely fast and
parallel. Currently, the total size of constant variables in an application is lim-
ited to 65,536 bytes. One may need to break up the input data volume to fit
within this limitation. We will demonstrate the usage of constant memory in
Chapter 7, Convolution.
A variable whose declaration is preceded only by the keyword __device__
(each “__’’ consists of two “_’’ characters) is a global variable and will be placed
in the global memory. Accesses to a global variable are slow. Latency and
throughput of accessing global variables have been improved with caches in more
recent devices. One important advantage of global variables is that they are
5.3 Tiling for reduced memory traffic 103
visible to all threads of all kernels. Their contents also persist through the entire
execution. Thus global variables can be used as a means for threads to collaborate
across blocks. However, one must be aware that there is currently no easy way to
synchronize between threads from different thread blocks or to ensure data con-
sistency across threads in accessing global memory other than using atomic
operations or terminating the current kernel execution.2 Therefore global variables
are often used to pass information from one kernel invocation to another kernel
invocation.
In CUDA, pointers can be used to point to data objects in the global memory.
There are two typical ways in which pointer use arises in kernel and device func-
tions. First, if an object is allocated by a host function, the pointer to the object is
initialized by memory allocation API functions such as cudaMalloc and can be
passed to the kernel function as a parameter, as we saw in Chapter 2,
Heterogeneous Data Parallel Computing, and Chapter 3, Multidimensional Grids
and Data. The second type of use is to assign the address of a variable that is
declared in the global memory to a pointer variable. For example, the statement
{float ptr=&GlobalVar;} in a kernel function assigns the address of GlobalVar
into an automatic pointer variable ptr. The reader should refer to the CUDA
Programming Guide for using pointers in other memory types.
2
One can use CUDA memory fencing to ensure data coherence between thread blocks if the num-
ber of thread blocks is smaller than the number of SMs in the CUDA device. See the CUDA
Programming Guide for more details.
104 CHAPTER 5 Memory architecture and data locality
FIGURE 5.5
A small example of matrix multiplication. For brevity we show M[y Width+x], N[y Width
+x], P[y Width+x] as My,x, Ny,x Py,x, respectively.
FIGURE 5.6
Global memory accesses performed by threads in block0,0.
FIGURE 5.7
Tiling M and N to utilize shared memory.
FIGURE 5.8
Execution phases of a tiled matrix multiplication.
The creation of these phases is key to the reduction of accesses to the global
memory. With each phase focusing on a small subset of the input matrix values,
the threads can collaboratively load the subset into the shared memory and use
the values in the shared memory to satisfy their overlapping input needs in the
phase.
Note also that Mds and Nds are reused across phases. In each phase, the same
Mds and Nds are reused to hold the subset of M and N elements used in the
phase. This allows a much smaller shared memory to serve most of the accesses
to global memory. This is because each phase focuses on a small subset of the
input matrix elements. Such focused access behavior is called locality. When an
algorithm exhibits locality, there is an opportunity to use small, high-speed mem-
ories to serve most of the accesses and remove these accesses from the global
memory. Locality is as important for achieving high performance in multicore
CPUs as in many-thread GPUs We will return to the concept of locality in
Chapter 6, Performance Considerations.
FIGURE 5.9
A tiled matrix multiplication kernel using shared memory.
bx TILE_WIDTH elements of P. Another tx thread within the same block would
cover another tx elements. Thus the thread with bx and tx should be responsible
for calculating the P element whose x index is bx TILE_WIDTH+tx. For the exam-
ple in Fig. 5.7, the horizontal (x) index of the P element to be calculated by
thread0,1 of block1,0 is 0 2+1=1. This horizontal index is saved in the variable Col
for the thread and is also illustrated in Fig. 5.10.
Similarly, the vertical (y) position, or the row index, of the P element to be pro-
cessed by a thread is calculated as by TILE_WIDTH+ty. Going back to the example
in Fig. 5.7, the y index of the P element to be calculated by thread0,1 of block1,0 is
1 2+0=2. This vertical index is saved in the variable Row for the thread. As shown
in Fig. 5.10, each thread calculates the P element at the Colth column and the Rowth
row. Thus the P element to be calculated by thread0,1 of block1,0 is P2,1.
Line 16 of Fig. 5.9 marks the beginning of the loop that iterates through all
the phases of calculating the P element. Each iteration of the loop corresponds to
one phase of the calculation shown in Fig. 5.8. The ph variable indicates the num-
ber of phases that have already been done for the dot product. Recall that each
phase uses one tile of M and one tile of N elements. Therefore at the beginning
5.4 A tiled matrix multiplication kernel 109
FIGURE 5.10
Calculation of the matrix indices in tiled multiplication.
of each phase, ph TILE_WIDTH pairs of M and N elements have been processed by
previous phases.
In each phase, lines 19 and 20 in Fig. 5.9 load the appropriate M and N elements,
respectively, into the shared memory. Since we already know the row of M and col-
umn of N to be processed by the thread, we now turn our focus to the column index
of M and row index of N. As shown in Fig. 5.10, each block has TILE_WIDTH2
threads that will collaborate to load TILE_WIDTH2 M elements and TILE_WIDTH2 N ele-
ments into the shared memory. Thus all we need to do is to assign each thread to
load one M element and one N element. This is conveniently done by using the
blockIdx and threadIdx. Note that the beginning column index of the section of M
elements to be loaded is ph TILE_WIDTH. Therefore an easy approach is to have every
thread load an element that is tx (the threadIdx.x value) positions away from that
beginning point. Similarly, the beginning row index of the section of N elements to
be loaded is also ph TILE_WIDTH. Therefore every thread loads an element that is ty
(the threadIdx.y value) positions away from that beginning point.
110 CHAPTER 5 Memory architecture and data locality
This is precisely what we have in lines 19 and 20. In line 19, each thread
loads M[Row Width + ph TILE_WIDTH + tx], where the linearized index is formed
with the row index Row and column index ph TILE_WIDTH + tx. Since the value of
2
Row is a linear function of ty, each of the TILE_WIDTH threads will load a unique
M element into the shared memory because each thread has a unique combination
of tx and ty. Together, these threads will load a dark square subset of M in
Fig. 5.10. In a similar way, in line 20, each thread loads the appropriate N ele-
ment to shared memory using the linearized index (ph TILE_WIDTH + ty) Width +
Col. The reader should use the small example in Figs. 5.7 and 5.8 to verify that
the address calculation works correctly for individual threads.
The barrier __syncthreads() in line 21 ensures that all threads have finished
loading the tiles of M and N into Mds and Nds before any of them can move for-
ward. Recall from Chapter 4, Compute Architecture and Scheduling, that the call
to __syncthreads() can be used to make all threads in a block wait for each other
to reach the barrier before any of them can proceed. This is important because the
M and N elements to be used by a thread can be loaded by other threads. One
needs to ensure that all elements are properly loaded into the shared memory
before any of the threads start to use the elements. The loop in line 23 then per-
forms one phase of the dot product based on the tile elements. The progression of
the loop for threadty, tx is shown in Fig. 5.10, with the access direction of the M
and N elements along the arrow marked with k, the loop variable in line 23. Note
that these elements will be accessed from Mds and Nds, the shared memory arrays
holding these M and N elements. The barrier __syncthreads() in line 26 ensures
that all threads have finished using the M and N elements in the shared memory
before any of them move on to the next iteration and load the elements from the
next tiles. Thus none of the threads would load the elements too early and corrupt
the input values of other threads.
The two __syncthreads() calls in lines 21 and 26 demonstrate two different
types of data dependence that parallel programmers often have to reason about
when they are coordinating between threads. The first is called a read-after-write
dependence because threads must wait for data to be written to the proper place
by other threads before they try to read it. The second is called a write-after-read
dependence because a thread must wait for the data to be read by all threads that
need it before overwriting it. Other names for read-after-write and write-after-read
dependences are true and false dependences, respectively. A read-after-write depen-
dence is a true dependence because the reading thread truly needs the data supplied
by the writing thread, so it has no choice but to wait for it. A write-after-read
dependence is a false dependence because the writing thread does not need any
data from the reading thread. The dependence is caused by the fact that they are
reusing the same memory location and would not exist if they used different
locations.
The loop nest from line 16 to line 28 illustrates a technique called strip-
mining, which takes a long-running loop and break it into phases. Each phase
involves an inner loop that executes a few consecutive iterations of the original
5.4 A tiled matrix multiplication kernel 111
loop. The original loop becomes an outer loop whose role is to iteratively invoke
the inner loop so that all the iterations of the original loop are executed in their
original order. By adding barrier synchronizations before and after the inner loop,
we force all threads in the same block to focus their work on the same section of
input data during each phase. Strip-mining is an important means to creating the
phases that are needed by tiling in data parallel programs.3
After all phases of the dot product are complete, the execution exits the outer
loop. In Line 29, all threads write to their P element using the linearized index
calculated from Row and Col.
The benefit of the tiled algorithm is substantial. For matrix multiplication, the
global memory accesses are reduced by a factor of TILE_WIDTH. With 16 3 16
tiles, one can reduce the global memory accesses by a factor of 16. This increases
the compute to global memory access ratio from 0.25 OP/B to 4 OP/B. This
improvement allows the memory bandwidth of a CUDA device to support a high-
er computation rate. For example, in the A100 GPU which has a global memory
bandwidth of 1555 GB/second, this improvement allows the device to achieve
(1555 GB/second) (4 OP/B)=6220 GFLOPS, which is substantially higher than
the 389 GFLOPS achieved by the kernel that did not use tiling.
Although tiling improves throughput substantially, 6220 GFLOPS is still only
32% of the device’s peak throughput of 19,500 GFLOPS. One can further opti-
mize the code to reduce the number of global memory accesses and improve
throughput. We will see some of these optimizations later in the book, while other
advanced optimizations will not be covered. Because of the importance of matrix
multiplication in many domains, there are highly optimized libraries, such as
cuBLAS and CUTLASS, that already incorporate many of these advanced optimi-
zations. Programmers can use these libraries to immediately achieve close to peak
performance in their linear algebra applications.
The effectiveness of tiling at improving the throughput of matrix multiplication
in particular and applications in general is not unique to GPUs. There is a long his-
tory of applying tiling (or blocking) techniques to improve performance on CPUs
by ensuring that the data that is reused by a CPU thread within a particular time
window will be found in the cache. One key difference is that tiling techniques on
CPUs rely on the CPU cache to keep reused data on-chip implicitly, whereas tiling
techniques on GPUs use shared memory explicitly to keep the data on-chip. The
reason is that a CPU core typically runs one or two threads at a time, so a thread
can rely on the cache keeping recently used data around. In contrast, a GPU SM
runs many threads simultaneously to be able to hide latency. These threads may
compete for cache slots, which makes the GPU cache less reliable, necessitating
the use of shared memory for important data that is to be reused.
3
The reader should note that strip-mining has long been used in programming CPUs. Strip-mining
followed by loop interchange is often used to enable tiling for improved locality in sequential pro-
grams. Strip-mining is also the main vehicle for vectorizing compilers to generate vector or SIMD
instructions for CPU programs.
112 CHAPTER 5 Memory architecture and data locality
FIGURE 5.11
Loading input matrix elements that are close to the edge: phase 1 of block0,0.
5.5 Boundary checks 113
The element after M0,2 in the linearized layout is M1,0. Although thead0,1
is attempting to access M0,3, it will end up getting M1,0. The use of this value
in the subsequent inner product calculation will obviously corrupt the output
value.
A similar problem arises in accessing an element that is past the end of a col-
umn (N accesses by thread1,0 and thread1,1 in Fig. 5.11). These accesses are to
memory locations outside the allocated area for the array. In some systems they
will return random values from other data structures. In other systems these
accesses will be rejected, causing the program to abort. Either way, the outcome
of such accesses is undesirable.
From our discussion so far, it may seem that the problematic accesses arise
only in the last phase of execution of the threads. This would suggest that we can
deal with it by taking special actions during the last phase of the tiled kernel exe-
cution. Unfortunately, this is not true. Problematic accesses can arise in all
phases. Fig. 5.12 shows the memory access pattern of block1,1 during phase 0.
We see that thread1,0 and thread1,1 attempt to access nonexisting M elements M3,0
and M3,1, whereas thread0,1 and thread1,1 attempt to access N0,3 and N1,3, which
do not exist.
Note that these problematic accesses cannot be prevented by simply
excluding the threads that do not calculate valid P elements. For example,
thread1,0 in block1,1 does not calculate any valid P element. However, it needs to
load M2,1 during phase 0 for other threads in block1,1 to use. Furthermore, note
that some threads that calculate valid P elements will attempt to access M or N
elements that do not exist. For example, as we saw in Fig. 5.11, thread0,1 of block
0,0 calculates a valid P element P0,1. However, it attempts to access a nonexisting
M0,3 during phase 1. These two facts indicate that we will need to use different
boundary condition tests for loading M tiles, loading N tiles, and calculating/
FIGURE 5.12
Loading input elements during phase 0 of block1,1.
114 CHAPTER 5 Memory architecture and data locality
storing P elements. A rule of thumb to follow is that every memory access needs
to have a corresponding check that ensures that the indices used in the access are
within the bounds of the array being accessed.
Let’s start with the boundary test condition for loading input tiles. When
a thread is to load an input tile element, it should test whether the input ele-
ment it is attempting to load is a valid element. This is easily done by exam-
ining the y and x indices. For example, in line 19 in Fig. 5.9, the linearized
index is a derived from a y index of Row and an x index of ph TILE_WIDTH +
tx. The boundary condition test would be that both of indices are smaller
than Width: Row , Width && (ph TILE_WIDTH+tx) , Width. If the condition is
true, the thread should go ahead and load the M element. The reader should
verify that the condition test for loading the N element is (ph TILE_WIDTH
+ty) , Width && Col , Width .
If the condition is false, the thread should not load the element. The question
is what should be placed into the shared memory location. The answer is 0.0, a
value that will not cause any harm if it is used in the inner product calculation. If
any thread uses this 0.0 value in the calculation of its inner product, there will not
be any change in the inner product value.
Finally, a thread should store its final inner product value only if it is responsi-
ble for calculating a valid P element. The test for this condition is (Row , Width)
&& (Col , Width). The kernel code with the additional boundary condition checks
is shown in Fig. 5.13.
With the boundary condition checks, the tile matrix multiplication kernel is
just one step away from being a general matrix multiplication kernel. In general,
FIGURE 5.13
Tiled matrix multiplication kernel with boundary condition checks.
5.6 Impact of memory usage on occupancy 115
Note that the size of shared memory in each SM can also vary from device to
device. Each generation or model of devices can have a different amount of
shared memory in each SM. It is often desirable for a kernel to be able to use dif-
ferent amounts of shared memory according to the amount available in the hard-
ware. That is, we may want a host code to dynamically determine the size of the
shared memory and adjust the amount of shared memory that is used by a kernel.
This can be done by calling the cudaGetDeviceProperties function. Assume that
variable &devProp is passed to the function. In this case, the field devProp.
sharedMemPerBlock gives the amount of shared memory that is available in each
SM. The programmer can then determine the amount of shared memory that
should be used by each block.
Unfortunately, the kernels in Figs. 5.9 and 5.13 do not support any dynamic
adjustment of shared memory usage by the host code. The declarations that are
used in Fig. 5.9 hardwire the size of its shared memory usage to a compile-time
constant:
That is, the size of Mds and Nds is set to be TILE_WIDTH2 elements, whatever
the value of TILE_WIDTH is set to be at compile time. Since the code contains
both Mds and Nds will have 256 elements. If we want to change the size of Mds
and Nds, we need to change the value of TILE_WIDTH and recompile the code. The
kernel cannot easily adjust its shared memory usage at runtime without
recompilation.
We can enable such adjustment with a different style of declaration in
CUDA by adding a C extern keyword in front of the shared memory declara-
tion and omitting the size of the array in the declaration. Based on this style,
the declarations for Mds and Nds need to be merged into one dynamically
allocated array:
Since there is only one merged array, we will also need to manually define
where the Mds section of the array starts and where the Nds section starts. Note
that the merged array is one-dimensional. We will need to access it by using a lin-
earized index based on the vertical and horizontal indices.
5.6 Impact of memory usage on occupancy 117
where size_t is a built-in type for declaring a variable to hold the size infor-
mation for dynamically allocated data structures. The size is expressed in number
of bytes. In our matrix multiplication example, for a 16 3 16 tile, we have a size
of 2 3 16 3 16 3 4=2048 bytes to accommodate both Mds and Nds. We have omit-
ted the details of the calculation for setting the value of size at runtime and leave
it as an exercise for the reader.
In Fig. 5.14 we show how one can modify the kernel code in Figs. 5.9 and
5.11 to use dynamically sized shared memory for the Mds and Nds arrays. It may
also be useful to pass the sizes of each section of the array as arguments into
the kernel function. In this example we added two arguments: The first argu-
ment is the size of the Mds section, and the second argument is the size of
the Nds section, both in terms of bytes. Note that in the host code above, we
passed size/2 as the values of these arguments, which is 1024 bytes. With the
assignments in lines 06 and 07, the rest of the kernel code can use Mds and Nds
as the base of the array and use a linearized index to access the Mds and Nds ele-
ments. For example, instead of using Mds[ty][tx], one would use Mds
[ty TILE_WIDTH+tx].
FIGURE 5.14
Tiled matrix multiplication kernel with dynamically sized shared memory usage.
118 CHAPTER 5 Memory architecture and data locality
5.7 Summary
In summary, the execution speed of a program in modern processors can be
severely limited by the speed of the memory. To achieve good utilization of the
execution throughput of a CUDA devices, one needs to strive for a high compute
to global memory access ratio in the kernel code. If the ratio is low, the kernel is
memory-bound. That is, its execution speed is limited by the rate at which its
operands are accessed from memory.
CUDA provides access to registers, shared memory, and constant memory.
These memories are much smaller than the global memory but can be accessed at
much higher speed. Using these memories effectively requires redesign of the
algorithm. We use matrix multiplication as an example to illustrate tiling, a popu-
lar strategy to enhance locality of data access and enable effective use of shared
memory. In parallel programming, tiling uses barrier synchronization to force
multiple threads to jointly focus on a subset of the input data at each phase of the
execution so that the subset data can be placed into these special memory types to
enable much higher access speed.
However, it is important for CUDA programmers to be aware of the limited
sizes of these special types of memory. Their capacities are implementation
dependent. Once their capacities have been exceeded, they limit the number of
threads that can be executing simultaneously in each SM and can negatively
affect the GPU’s computation throughput as well as its ability to tolerate latency.
The ability to reason about hardware limitations when developing an application
is a key aspect of parallel programming.
Although we introduced tiled algorithms in the context of CUDA C program-
ming, it is an effective strategy for achieving high-performance in virtually all types
of parallel computing systems. The reason is that an application must exhibit locality
in data access to make effective use of high-speed memories in these systems. For
example, in a multicore CPU system, data locality allows an application to effectively
use on-chip data caches to reduce memory access latency and achieve high perfor-
mance. These on-chip data caches are also of limited size and require the computa-
tion to exhibit locality. Therefore the reader will also find the tiled algorithm useful
when developing a parallel application for other types of parallel computing systems
using other programming models.
Our goal for this chapter was to introduce the concept of locality, tiling, and differ-
ent CUDA memory types. We introduced a tiled matrix multiplication kernel using
shared memory. We further studied the need for boundary test conditions to allow for
arbitrary data dimensions in applying tiling techniques. We also briefly discussed the
use of dynamically sized shared memory allocation so that the kernel can adjust the
size of shared memory that is used by each block according to the hardware capability.
We did not discuss the use of registers in tiling. We will explain the use of registers in
tiled algorithms when we discuss parallel algorithm patterns in Part II of the book.
Exercises 119
Exercises
1. Consider matrix addition. Can one use shared memory to reduce the
global memory bandwidth consumption? Hint: Analyze the elements that
are accessed by each thread and see whether there is any commonality
between threads.
2. Draw the equivalent of Fig. 5.7 for a 8 3 8 matrix multiplication with 2 3 2
tiling and 4 3 4 tiling. Verify that the reduction in global memory bandwidth
is indeed proportional to the dimension size of the tiles.
3. What type of incorrect execution behavior can happen if one forgot to use
one or both __syncthreads() in the kernel of Fig. 5.9?
4. Assuming that capacity is not an issue for registers or shared memory, give
one important reason why it would be valuable to use shared memory
instead of registers to hold values fetched from global memory? Explain
your answer.
5. For our tiled matrix-matrix multiplication kernel, if we use a 32 3 32 tile,
what is the reduction of memory bandwidth usage for input matrices M
and N?
6. Assume that a CUDA kernel is launched with 1000 thread blocks, each of
which has 512 threads. If a variable is declared as a local variable in the
kernel, how many versions of the variable will be created through the
lifetime of the execution of the kernel?
7. In the previous question, if a variable is declared as a shared memory
variable, how many versions of the variable will be created through the
lifetime of the execution of the kernel?
8. Consider performing a matrix multiplication of two input matrices with
dimensions N 3 N. How many times is each element in the input matrices
requested from global memory when:
a. There is no tiling?
b. Tiles of size T 3 T are used?
9. A kernel performs 36 floating-point operations and seven 32-bit global
memory accesses per thread. For each of the following device
properties, indicate whether this kernel is compute-bound or memory-
bound.
a. Peak FLOPS=200 GFLOPS, peak memory bandwidth=100 GB/second
b. Peak FLOPS=300 GFLOPS, peak memory bandwidth=250 GB/second
10. To manipulate tiles, a new CUDA programmer has written a device kernel
that will transpose each tile in a matrix. The tiles are of size
BLOCK_WIDTH by BLOCK_WIDTH, and each of the dimensions of
matrix A is known to be a multiple of BLOCK_WIDTH. The kernel
invocation and code are shown below. BLOCK_WIDTH is known at
compile time and could be set anywhere from 1 to 20.
120 CHAPTER 5 Memory architecture and data locality
a. Out of the possible range of values for BLOCK_SIZE, for what values
of BLOCK_SIZE will this kernel function execute correctly on the
device?
b. If the code does not execute correctly for all BLOCK_SIZE values, what
is the root cause of this incorrect execution behavior? Suggest a fix to the
code to make it work for all BLOCK_SIZE values.
11. Consider the following CUDA kernel and the corresponding host function
that calls it:
e. What is the amount of shared memory used per block (in bytes)?
f. What is the floating-point to global memory access ratio of the kernel (in OP/B)?
12. Consider a GPU with the following hardware limits: 2048 threads/SM, 32
blocks/SM, 64K (65,536) registers/SM, and 96 KB of shared memory/SM.
For each of the following kernel characteristics, specify whether the kernel
can achieve full occupancy. If not, specify the limiting factor.
a. The kernel uses 64 threads/block, 27 registers/thread, and 4 KB of shared
memory/SM.
b. The kernel uses 256 threads/block, 31 registers/thread, and 8 KB of
shared memory/SM.