Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
Chapter Outline
4
4.1 Architecture of a modern GPU ......................................................................... 70
4.2 Block scheduling ............................................................................................ 70
4.3 Synchronization and transparent scalability ..................................................... 71
4.4 Warps and SIMD hardware .............................................................................. 74
4.5 Control divergence ......................................................................................... 79
4.6 Warp scheduling and latency tolerance ........................................................... 83
4.7 Resource partitioning and occupancy .............................................................. 85
4.8 Querying device properties ............................................................................. 87
4.9 Summary ........................................................................................................ 90
Exercises .............................................................................................................. 90
References ............................................................................................................ 92
In Chapter 1, Introduction, we saw that CPUs are designed to minimize the latency of
instruction execution and that GPUs are designed to maximize the throughput of exe-
cuting instructions. In Chapters 2, Heterogeneous Data Parallel Computing and 3,
Multidimensional Grids and Data, we learned the core features of the CUDA program-
ming interface for creating and calling kernels to launch and execute threads. In the
next three chapters we will discuss the architecture of modern GPUs, both the compute
architecture and the memory architecture, and the performance optimization techniques
stemming from the understanding of this architecture. This chapter presents several
aspects of the GPU compute architecture that are essential for CUDA C programmers
to understand and reason about the performance behavior of their kernel code. We will
start by showing a high-level, simplified view of the compute architecture and explore
the concepts of flexible resource assignment, scheduling of blocks, and occupancy. We
will then advance into thread scheduling, latency tolerance, control divergence, and
synchronization. We will finish the chapter with a description of the API functions that
can be used to query the resources that are available in the GPU and the tools to help
estimate the occupancy of the GPU when executing a kernel. In the following two
chapters, we will present the core concepts and programming considerations of the
GPU memory architecture. In particular, Chapter 5, Memory Architecture and Data
Locality, focuses on the on-chip memory architecture, and Chapter 6, Performance
Considerations, briefly covers the off-chip memory architecture then elaborates on vari-
ous performance considerations of the GPU architecture as a whole. A CUDA C
FIGURE 4.1
Architecture of a CUDA-capable GPU.
programmer who masters these concepts is well equipped to write and to understand
high-performance parallel kernels.
FIGURE 4.2
Thread block assignment to streaming multiprocessors (SMs).
Fig. 4.2 illustrates the assignment of blocks to SMs. Multiple blocks are likely
to be simultaneously assigned to the same SM. For example, in Fig. 4.2, three
blocks are assigned to each SM. However, blocks need to reserve hardware
resources to execute, so only a limited number of blocks can be simultaneously
assigned to a given SM. The limit on the number of blocks depends on a variety
of factors that are discussed in Section 4.6.
With a limited number of SMs and a limited number of blocks that can be simul-
taneously assigned to each SM, there is a limit on the total number of blocks that can
be simultaneously executing in a CUDA device. Most grids contain many more
blocks than this number. To ensure that all blocks in a grid get executed, the runtime
system maintains a list of blocks that need to execute and assigns new blocks to SMs
when previously assigned blocks complete execution.
The assignment of threads to SMs on a block-by-block basis guarantees that
threads in the same block are scheduled simultaneously on the same SM. This
guarantee makes it possible for threads in the same block to interact with each
other in ways that threads across different blocks cannot.1 This includes barrier
synchronization, which is discussed in Section 4.3. It also includes accessing a
low-latency shared memory that resides on the SM, which is discussed in
Chapter 5, Memory Architecture and Data Locality.
1
Threads in different blocks can perform barrier synchronization through the Cooperative Groups
API. However, there are several important restrictions that must be obeyed to ensure that all threads
involved are indeed simultaneously executing on the SMs. Interested readers are referred to the
CUDA C Programming Guide for proper use of the Cooperative Groups API.
72 CHAPTER 4 Compute architecture and scheduling
“_” characters. When a thread calls __syncthreads(), it will be held at the pro-
gram location of the call until every thread in the same block reaches that loca-
tion. This ensures that all threads in a block have completed a phase of their
execution before any of them can move on to the next phase.
Barrier synchronization is a simple and popular method for coordinating parallel
activities. In real life, we often use barrier synchronization to coordinate parallel activi-
ties of multiple people. For example, assume that four friends go to a shopping mall in
a car. They can all go to different stores to shop for their own clothes. This is a parallel
activity and is much more efficient than if they all remain as a group and sequentially
visit all the stores of interest. However, barrier synchronization is needed before they
leave the mall. They must wait until all four friends have returned to the car before
they can leave. The ones who finish earlier than the others must wait for those who fin-
ish later. Without the barrier synchronization, one or more individuals can be left in the
mall when the car leaves, which could seriously damage their friendship!
Fig. 4.3 illustrates the execution of barrier synchronization. There are N
threads in the block. Time goes from left to right. Some of the threads reach the
barrier synchronization statement early, and some reach it much later. The ones
that reach the barrier early will wait for those that arrive late. When the latest one
arrives at the barrier, all threads can continue their execution. With barrier syn-
chronization, “no one is left behind.”
FIGURE 4.3
An example execution of barrier synchronization. The arrows represent execution activities
over time. The vertical curve marks the time when each thread executes the__syncthreads
statement. The empty space to the right of the vertical curve depicts the time that each
thread waits for all threads to complete. The vertical line marks the time when the last thread
executes the __syncthreads statement, after which all threads are allowed to proceed to
execute the statements after the __syncthreads statement.
4.3 Synchronization and transparent scalability 73
FIGURE 4.4
An incorrect use of __syncthreads()
FIGURE 4.5
Lack of synchronization constraints between blocks enables transparent scalability for
CUDA programs.
with each other, the CUDA runtime system can execute blocks in any order relative
to each other, since none of them need to wait for each other. This flexibility enables
scalable implementations, as shown in Fig. 4.5. Time in the figure progresses from
top to bottom. In a low-cost system with only a few execution resources, one can
execute a small number of blocks at the same time, portrayed as executing two
blocks a time on the left-hand side of Fig. 4.5. In a higher-end implementation with
more execution resources, one can execute many blocks at the same time, portrayed
as executing four blocks at a time on the right-hand side of Fig. 4.5. A high-end
GPU today can execute hundreds of blocks simultaneously.
The ability to execute the same application code with a wide range of speeds allows
the production of a wide range of implementations according to the cost, power, and
performance requirements of different market segments. For example, a mobile proces-
sor may execute an application slowly but at extremely low power consumption, and a
desktop processor may execute the same application at a higher speed while consuming
more power. Both execute the same application program with no change to the code.
The ability to execute the same application code on different hardware with different
amounts of execution resources is referred to as transparent scalability, which reduces
the burden on application developers and improves the usability of applications.
executing a kernel should not depend on any assumption that certain threads will
execute in synchrony with each other without the use of barrier synchronizations.
Thread scheduling in CUDA GPUs is a hardware implementation concept and
therefore must be discussed in the context of specific hardware implementations.
In most implementations to date, once a block has been assigned to an SM, it is
further divided into 32-thread units called warps. The size of warps is implemen-
tation specific and can vary in future generations of GPUs. Knowledge of warps
can be helpful in understanding and optimizing the performance of CUDA appli-
cations on particular generations of CUDA devices.
A warp is the unit of thread scheduling in SMs. Fig. 4.6 shows the division of
blocks into warps in an implementation. In this example there are three blocks—
Block 1, Block 2, and Block 3—all assigned to an SM. Each of the three blocks
is further divided into warps for scheduling purposes. Each warp consists of 32
threads of consecutive threadIdx values: threads 0 through 31 form the first
warp, threads 32 through 63 form the second warp, and so on. We can calculate
the number of warps that reside in an SM for a given block size and a given num-
ber of blocks assigned to each SM. In this example, if each block has 256 threads,
we can determine that each block has 256/32 or 8 warps. With three blocks in the
SM, we have 8 3 3 = 24 warps in the SM.
FIGURE 4.6
Blocks are partitioned into warps for thread scheduling.
76 CHAPTER 4 Compute architecture and scheduling
Blocks are partitioned into warps on the basis of thread indices. If a block is
organized into a one-dimensional array, that is, only threadIdx.x is used, the par-
tition is straightforward. The threadIdx.x values within a warp are consecutive
and increasing. For a warp size of 32, warp 0 starts with thread 0 and ends with
thread 31, warp 1 starts with thread 32 and ends with thread 63, and so on. In
general, warp n starts with thread 32 3 n and ends with thread 32 3 (n+1) 2 1.
For a block whose size is not a multiple of 32, the last warp will be padded with
inactive threads to fill up the 32 thread positions. For example, if a block has 48
threads, it will be partitioned into two warps, and the second warp will be padded
with 16 inactive threads.
For blocks that consist of multiple dimensions of threads, the dimensions will be
projected into a linearized row-major layout before partitioning into warps. The linear
layout is determined by placing the rows with larger y and z coordinates after those
with lower ones. That is, if a block consists of two dimensions of threads, one will
form the linear layout by placing all threads whose threadIdx.y is 1 after those
whose threadIdx.y is 0. Threads whose threadIdx.y is 2 will be placed after those
whose threadIdx.y is 1, and so on. Threads with the same threadIdx.y value are
placed in consecutive positions in increasing threadIdx.x order.
Fig. 4.7 shows an example of placing threads of a two-dimensional block into
a linear layout. The upper part shows the two-dimensional view of the block. The
reader should recognize the similarity to the row-major layout of two-dimensional
arrays. Each thread is shown as Ty,x, x being threadIdx.x and y being
threadIdx.y. The lower part of Fig. 4.7 shows the linearized view of the block.
The first four threads are the threads whose threadIdx.y value is 0; they are
ordered with increasing threadIdx.x values. The next four threads are the threads
whose threadIdx.y value is 1. They are also placed with increasing threadIdx.x
values. In this example, all 16 threads form half a warp. The warp will be padded
with another 16 threads to complete a 32-thread warp. Imagine a two-dimensional
FIGURE 4.7
Placing 2D threads into a linear layout.
4.4 Warps and SIMD hardware 77
block with 8 3 8 threads. The 64 threads will form two warps. The first warp
starts from T0,0 and ends with T3,7. The second warp starts with T4,0 and ends
with T7,7. It would be useful for the reader to draw out the picture as an exercise.
For a three-dimensional block, we first place all threads whose threadIdx.z
value is 0 into the linear order. These threads are treated as a two-dimensional
block, as shown in Fig. 4.7. All threads whose threadIdx.z value is 1 will then
be placed into the linear order, and so on. For example, for a three-dimensional
2 3 8 3 4 block (four in the x dimension, eight in the y dimension, and two in the
z dimension), the 64 threads will be partitioned into two warps, with T0,0,0
through T0,7,3 in the first warp and T1,0,0 through T1,7,3 in the second warp.
An SM is designed to execute all threads in a warp following the single-instruction,
multiple-data (SIMD) model. That is, at any instant in time, one instruction is fetched
and executed for all threads in the warp (see the “Warps and SIMD Hardware” side-
bar). Fig. 4.8 shows how the cores in an SM are grouped into processing blocks in
which every 8 cores form a processing block and share an instruction fetch/dispatch
unit. As a real example, the Ampere A100 SM, which has 64 cores, is organized into
four processing blocks with 16 cores each. Threads in the same warp are assigned to
the same processing block, which fetches the instruction for the warp and executes it
for all threads in the warp at the same time. These threads apply the same instruction
to different portions of the data. Because the SIMD hardware effectively restricts all
threads in a warp to execute the same instruction at any point in time, the execution
behavior of a warp is often referred to as single instruction, multiple-thread.
The advantage of SIMD is that the cost of the control hardware, such as the
instruction fetch/dispatch unit, is shared across many execution units. This design
choice allows for a smaller percentage of the hardware to be dedicated to control
FIGURE 4.8
Streaming multiprocessors are organized into processing blocks for SIMD execution.
78 CHAPTER 4 Compute architecture and scheduling
Since all processing units are controlled by the same instruction in the
Instruction Register (IR) of the Control Unit, their execution differences
are due to the different data operand values in the register files. This is
called Single-Instruction-Multiple-Data (SIMD) in processor design. For
example, although all processing units (cores) are controlled by an
instruction, such as add r1, r2, r3, the contents of r2 and r3 are different
in different processing units.
Control units in modern processors are quite complex, including
sophisticated logic for fetching instructions and access ports to the instruc-
tion cache. Having multiple processing units to share a control unit can
result in significant reduction in hardware manufacturing cost and power
consumption.
example, for an if-else construct, if some threads in a warp follow the if-path
while others follow the else path, the hardware will take two passes. One pass
executes the threads that follow the if-path, and the other executes the threads
that follow the else-path. During each pass, the threads that follow the other path
are not allowed to take effect.
When threads in the same warp follow different execution paths, we say that
these threads exhibit control divergence, that is, they diverge in their execution.
The multipass approach to divergent warp execution extends the SIMD hard-
ware’s ability to implement the full semantics of CUDA threads. While the hard-
ware executes the same instruction for all threads in a warp, it selectively lets
these threads take effect in only the pass that corresponds to the path that they
took, allowing every thread to appear to take its own control flow path. This pre-
serves the independence of threads while taking advantage of the reduced cost of
SIMD hardware. The cost of divergence, however, is the extra passes the hard-
ware needs to take to allow different threads in a warp to make their own deci-
sions as well as the execution resources that are consumed by the inactive threads
in each pass.
Fig. 4.9 shows an example of how a warp would execute a divergent if-else
statement. In this example, when the warp consisting of threads 0 31 arrives
at the if-else statement, threads 0 23 take the then-path, while threads 24 31
take the else-path. In this case, the warp will do a pass through the code in which
threads 0 23 execute A while threads 24 31 are inactive. The warp will also do
another pass through the code in which threads 24 31 execute B while threads
0 23 are inactive. The threads in the warp then reconverge and execute C. In the
Pascal architecture and prior architectures, these passes are executed sequentially,
FIGURE 4.9
Example of a warp diverging at an if-else statement.
4.5 Control divergence 81
meaning that one pass is executed to completion followed by the other pass.
From the Volta architecture onwards, the passes may be executed concurrently,
meaning that the execution of one pass may be interleaved with the execution of
another pass. This feature is referred to as independent thread scheduling.
Interested readers are referred to the whitepaper on the Volta V100 architecture
(NVIDIA, 2017) for details.
Divergence also can arise in other control flow constructs. Fig. 4.10 shows an
example of how a warp would execute a divergent for-loop. In this example, each
thread executes a different number of loop iterations, which vary between four and
eight. For the first four iterations, all threads are active and execute A. For the
remaining iterations, some threads execute A, while others are inactive because they
have completed their iterations.
One can determine whether a control construct can result in thread divergence
by inspecting its decision condition. If the decision condition is based on
threadIdx values, the control statement can potentially cause thread divergence.
For example, the statement if(threadIdx.x . 2) {. . .} causes the threads in the
first warp of a block to follow two divergent control flow paths. Threads 0, 1, and
2 follow a different path than that of threads 3, 4, 5, and so on. Similarly, a loop
can cause thread divergence if its loop condition is based on thread index values.
A prevalent reason for using a control construct with thread control divergence is
handling boundary conditions when mapping threads to data. This is usually because
the total number of threads needs to be a multiple of the thread block size, whereas
the size of the data can be an arbitrary number. Starting with our vector addition ker-
nel in Chapter 2, Heterogeneous Data Parallel Computing, we had an if(i , n) state-
ment in addVecKernel. This is because not all vector lengths can be expressed as
multiples of the block size. For example, let’s assume that the vector length is 1003
FIGURE 4.10
Example of a warp diverging at a for-loop.
82 CHAPTER 4 Compute architecture and scheduling
and we picked 64 as the block size. One would need to launch 16 thread blocks to
process all the 1003 vector elements. However, the 16 thread blocks would have
1024 threads. We need to disable the last 21 threads in thread block 15 from doing
work that is not expected or not allowed by the original program. Keep in mind that
these 16 blocks are partitioned into 32 warps. Only the last warp (i.e., the second
warp in the last block) will have control divergence.
Note that the performance impact of control divergence decreases as the size
of the vectors being processed increases. For a vector length of 100, one of the
four warps will have control divergence, which can have significant impact on
performance. For a vector size of 1000, only one of the 32 warps will have con-
trol divergence. That is, control divergence will affect only about 3% of the exe-
cution time. Even if it doubles the execution time of the warp, the net impact on
the total execution time will be about 3%. Obviously, if the vector length is
10,000 or more, only one of the 313 warps will have control divergence. The
impact of control divergence will be much less than 1%!
For two-dimensional data, such as the color-to-grayscale conversion example
in Chapter 3, Multidimensional Grids and Data, if statements are also used to han-
dle the boundary conditions for threads that operate at the edge of the data. In
Fig. 3.2, to process the 62 3 76 image, we used 20 = 4 3 5 two-dimensional
blocks that consist of 16 3 16 threads each. Each block will be partitioned into 8
warps; each one consists of two rows of a block. A total 160 warps (8 warps per
block) are involved. To analyze the impact of control divergence, refer to
Fig. 3.5. None of the warps in the 12 blocks in region 1 will have control diver-
gence. There are 12 3 8 = 96 warps in region 1. For region 2, all the 24 warps
will have control divergence. For region 3, all the bottom warps are mapped to
data that are completely outside the image. As result, none of them will pass the
if condition. The reader should verify that these warps would have had control
divergence if the picture had an odd number of pixels in the vertical dimension.
In region 4, the first 7 warps will have control divergence, but the last warp will
not. All in all, 31 out of the 160 warps will have control divergence.
Once again, the performance impact of control divergence decreases as the
number of pixels in the horizontal dimension increases. For example, if we pro-
cess a 200 3 150 picture with 16 3 16 blocks, there will be a total of 130 =
13 3 10 thread blocks or 1040 warps. The number of warps in regions 1 through
4 will be 864 (12 3 9 3 8), 72 (9 3 8), 96 (12 3 8), and 8 (1 3 8). Only 80 of
these warps will have control divergence. Thus the performance impact of control
divergence will be less than 8%. Obviously, if we process a realistic picture with
more than 1000 pixels in the horizontal dimension, the performance impact of
control divergence will be less than 2%.
An important implication of control divergence is that one cannot assume that
all threads in a warp have the same execution timing. Therefore if all threads in a
warp must complete a phase of their execution before any of them can move on,
one must use a barrier synchronization mechanism such as __syncwarp() to
ensure correctness.
4.6 Warp scheduling and latency tolerance 83
Latency Tolerance
Latency tolerance is needed in many everyday situations. For example, in
post offices, each person who is trying to ship a package should ideally
have filled out all the forms and labels before going to the service counter.
However, as we all have experienced, some people wait for the service
desk clerk to tell them which form to fill out and how to fill out the form.
When there is a long line in front of the service desk, it is important to
maximize the productivity of the service clerks. Letting a person fill out the
form in front of the clerk while everyone waits is not a good approach.
The clerk should be helping the next customers who are waiting in line
while the person fills out the form. These other customers are “ready to
go” and should not be blocked by the customer who needs more time to
fill out a form.
This is why a good clerk would politely ask the first customer to step
aside to fill out the form while the clerk serves other customers. In most
cases, instead of going to the end of the line, the first customer will be
served as soon as he or she finishes the form and the clerk finishes serving
the current customer.
We can think of these post office customers as warps and the clerk as a
hardware execution unit. The customer who needs to fill out the form cor-
responds to a warp whose continued execution is dependent on a long-
latency operation.
84 CHAPTER 4 Compute architecture and scheduling
Note that warp scheduling is also used for tolerating other types of operation
latencies, such as pipelined floating-point arithmetic and branch instructions.
With enough warps around, the hardware will likely find a warp to execute at any
point in time, thus making full use of the execution hardware while the instruc-
tions of some warps wait for the results of these long-latency operations. The
selection of warps that are ready for execution does not introduce any idle or
wasted time into the execution timeline, which is referred to as zero-overhead
thread scheduling (see the “Threads, Context-switching, and Zero-overhead
Scheduling” sidebar). With warp scheduling, the long waiting time of warp
instructions is “hidden” by executing instructions from other warps. This ability
to tolerate long operation latencies is the main reason why GPUs do not dedicate
nearly as much chip area to cache memories and branch prediction mechanisms
as CPUs do. As a result, GPUs can dedicate more chip area to floating-point exe-
cution and memory access channel resources.
These hotel amenities are part of the properties, or resources and capa-
bilities, of the hotels. Veteran travelers check the properties at hotel web-
sites, choose the hotels that best match their needs, and pack more
efficiently and effectively.
int devCount;
cudaGetDeviceCount(&devCount);
While it may not be obvious, a modern PC system often has two or more CUDA
devices. This is because many PC systems come with one or more “integrated”
GPUs. These GPUs are the default graphics units and provide rudimentary capabilities
and hardware resources to perform minimal graphics functionalities for modern
window-based user interfaces. Most CUDA applications will not perform very well on
these integrated devices. This would be a reason for the host code to iterate through all
the available devices, query their resources and capabilities, and choose the ones that
have enough resources to execute the application with satisfactory performance.
The CUDA runtime numbers all the available devices in the system from 0 to
devCount-1. It provides an API function cudaGetDeviceProperties that returns
the properties of the device whose number is given as an argument. For example,
we can use the following statements in the host code to iterate through the avail-
able devices and query their properties:
cudaDeviceProp devProp;
for(unsigned int i = 0; i < devCount; i++) {
cudaGetDeviceProperties(&devProp, i);
// Decide if device has sufficient
resources/capabilities
}
4.8 Querying device properties 89
The built-in type cudaDeviceProp is a C struct type with fields that represent
the properties of a CUDA device. The reader is referred to the CUDA C
Programming Guide for all the fields of the type. We will discuss a few of these
fields that are particularly relevant to the assignment of execution resources to
threads. We assume that the properties are returned in the devProp variable whose
fields are set by the cudaGetDeviceProperties function. If the reader chooses to
name the variable differently, the appropriate variable name will obviously need
to be substituted in the following discussion.
As the name suggests, the field devProp.maxThreadsPerBlock gives the maximum
number of threads allowed in a block in the queried device. Some devices allow up to
1024 threads in each block, and other devices may allow fewer. It is possible that
future devices may even allow more than 1024 threads per block. Therefore it is a
good idea to query the available devices and determine which ones will allow a suffi-
cient number of threads in each block as far as the application is concerned.
The number of SMs in the device is given in devProp.multiProcessorCount. If the
application requires many SMs to achieve satisfactory performance, it should definitely
check this property of the prospective device. Furthermore, the clock frequency of the
device is in devProp.clockRate. The combination the clock rate and the number of SMs
gives a good indication of the maximum hardware execution throughput of the device.
The host code can find the maximum number of threads allowed along each
dimension of a block in fields devProp.maxThreadsDim[0] (for the x dimension),
devProp.maxThreadsDim[1] (for the y dimension), and devProp.maxThreadsDim[2]
(for the z dimension). An example of use of this information is for an automated
tuning system to set the range of block dimensions when evaluating the best per-
forming block dimensions for the underlying hardware. Similarly, it can find the
maximum number of blocks allowed along each dimension of a grid in devProp.
maxGridSize[0] (for the x dimension), devProp.maxGridSize[1] (for the y dimen-
sion), and devProp.maxGridSize[2] (for the z dimension). A typical use of this
information is to determine whether a grid can have enough threads to handle the
entire dataset or some kind of iterative approach is needed.
The field devProp.regsPerBlock gives the number of registers that are avail-
able in each SM. This field can be useful in determining whether the kernel can
achieve maximum occupancy on a particular device or will be limited by its reg-
ister usage. Note that the name of the field is a little misleading. For most com-
pute capability levels, the maximum number of registers that a block can use is
indeed the same as the total number of registers that are available in the SM.
However, for some compute capability levels, the maximum number of registers
that a block can use is less than the total that are available on the SM.
We have also discussed that the size of warps depends on the hardware. The
size of warps can be obtained from the devProp.warpSize field.
There are many more fields in the cudaDeviceProp type. We will discuss
them throughout the book as we introduce the concepts and features that they are
designed to reflect.
90 CHAPTER 4 Compute architecture and scheduling
4.9 Summary
A GPU is organized into SM, which consist of multiple processing blocks of
cores that share control logic and memory resources. When a grid is launched, its
blocks are assigned to SMs in an arbitrary order, resulting in transparent scalabil-
ity of CUDA applications. The transparent scalability comes with a limitation:
Threads in different blocks cannot synchronize with each other.
Threads are assigned to SMs for execution on a block-by-block basis. Once a block
has been assigned to an SM, it is further partitioned into warps. Threads in a warp are
executed following the SIMD model. If threads in the same warp diverge by taking dif-
ferent execution paths, the processing block executes these paths in passes in which
each thread is active only in the pass corresponding to the path that it takes.
An SM may have many more threads assigned to it than it can execute simulta-
neously. At any time, the SM executes instructions of only a small subset of its resident
warps. This allows the other warps to wait for long-latency operations without slowing
down the overall execution throughput of the massive number of processing units. The
ratio of the number of threads assigned to the SM to the maximum number of threads
it can support is referred to as occupancy. The higher the occupancy of an SM, the bet-
ter it can hide long-latency operations.
Each CUDA device imposes a potentially different limitation on the amount of
resources available in each SM. For example, each CUDA device has a limit on the
number of blocks, the number of threads, the number of registers, and the amount of
other resources that each of its SMs can accommodate. For each kernel, one or more of
these resource limitations can become the limiting factor for occupancy. CUDA C pro-
vides programmers with the ability to query the resources available in a GPU at runtime.
Exercises
1. Consider the following CUDA kernel and the corresponding host function that
calls it:
Exercises 91
8. Consider a GPU with the following hardware limits: 2048 threads per SM, 32
blocks per SM, and 64K (65,536) registers per SM. For each of the following
kernel characteristics, specify whether the kernel can achieve full occupancy.
If not, specify the limiting factor.
a. The kernel uses 128 threads per block and 30 registers per thread.
b. The kernel uses 32 threads per block and 29 registers per thread.
c. The kernel uses 256 threads per block and 34 registers per thread.
9. A student mentions that they were able to multiply two 1024 3 1024 matrices
using a matrix multiplication kernel with 32 3 32 thread blocks. The student is
using a CUDA device that allows up to 512 threads per block and up to 8 blocks
per SM. The student further mentions that each thread in a thread block calculates
one element of the result matrix. What would be your reaction and why?
References
CUDA Occupancy Calculator, 2021. https://docs.nvidia.com/cuda/cuda-occupancy-calculator/
index.html.
NVIDIA (2017). NVIDIA Tesla V100 GPU Architecture. Version WP-08608-001_v1.1.
Ryoo, S., Rodrigues, C., Stone, S., Baghsorkhi, S., Ueng, S., Stratton, J., et al., Program
optimization space pruning for a multithreaded GPU. In: Proceedings of the Sixth
ACM/IEEE International Symposium on Code Generation and Optimization, April
6 9, 2008.