0% found this document useful (0 votes)
16 views12 pages

Demystifying GPU Microarchitecture Through Microbenchmarking

This document presents a microbenchmark suite developed to analyze the architectural characteristics of the Nvidia GT200 GPU, revealing undocumented features that affect performance and correctness. It discusses the GPU's memory hierarchies and arithmetic processing cores, providing insights for performance optimization and modeling. The findings aim to enhance understanding of GPU architecture for developers and researchers in the field.

Uploaded by

bob song
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

Demystifying GPU Microarchitecture Through Microbenchmarking

This document presents a microbenchmark suite developed to analyze the architectural characteristics of the Nvidia GT200 GPU, revealing undocumented features that affect performance and correctness. It discusses the GPU's memory hierarchies and arithmetic processing cores, providing insights for performance optimization and modeling. The findings aim to enhance understanding of GPU architecture for developers and researchers in the field.

Uploaded by

bob song
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Demystifying GPU Microarchitecture through

Microbenchmarking
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos
Department of Electrical and Computer Engineering, University of Toronto
{henry, myrto, alvandim, moshovos}@eecg.utoronto.ca

Abstract—Graphics processors (GPU) offer the promise of


more than an order of magnitude speedup over conventional
processors for certain non-graphics computations. Because the
GPU is often presented as a C-like abstraction (e.g., Nvidia’s
CUDA), little is known about the characteristics of the GPU’s
architecture beyond what the manufacturer has documented.
This work develops a microbechmark suite and measures the
CUDA-visible architectural characteristics of the Nvidia GT200
(GTX280) GPU. Various undisclosed characteristics of the pro-
cessing elements and the memory hierarchies are measured. This
analysis exposes undocumented features that impact program
performance and correctness. These measurements can be useful
Fig. 1: Streaming Multiprocessor Fig. 2: Thread Processing Cluster
for improving performance optimization, analysis, and modeling with 8 Scalar Processors Each with 3 SMs Each
on this architecture and offer additional insight on the decisions
made in developing this GPU.

I. I NTRODUCTION

The graphics processor (GPU) as a non-graphics compute


processor has a different architecture from traditional sequential
processors. For developers and GPU architecture and compiler
researchers, it is essential to understand the architecture of a Fig. 3: GPU with TPCs and Memory Banks
modern GPU design in detail. • We measure the structure and performance of the memory
The Nvidia G80 and GT200 GPUs are capable of non- caching hierarchies, including the Translation Lookaside
graphics computation using the C-like CUDA programming Buffer (TLB) hierarchy, constant memory, texture mem-
interface. The CUDA Programming Guide provides hints of ory, and instruction memory caches.
the GPU performance characteristics in the form of rules [1]. • We discuss our measurement techniques, which we believe
However, these rules are sometimes vague and there is little will be useful in the analysis and modeling of other GPUs
information about the underlying hardware organization that and GPU-like systems, and for improving the fidelity of
motivates them. GPU performance modeling and simulation [2].
This work presents a suite of microbenchmarks targeting The remainder of this paper is organized as follows. Section
specific parts of the architecture. The presented measurements II reviews the CUDA computation model. Section III describes
focus on two major parts that impact GPU performance: the the measurement methodology, and Section IV presents the
arithmetic processing cores, and the memory hierarchies that measurements. Section V reviews related work and Section VI
feed instructions and data to these processing cores. A precise summarizes our findings.
understanding of the processing cores and of the caching
hierarchies is needed for avoiding deadlocks, for optimizing II. BACKGROUND : GPU A RCHITECTURE AND
application performance, and for cycle-accurate GPU perfor- P ROGRAMMING M ODEL
mance modeling. A. GPU Architecture
Specifically, in this work: CUDA models the GPU architecture as a multi-core system.
• We verify performance characteristics listed in the CUDA It abstracts the thread-level parallelism of the GPU into a hierar-
Programming Guide. chy of threads (grids of blocks of warps of threads) [1]. These
• We explore the detailed functionality of branch divergence threads are mapped onto a hierarchy of hardware resources.
and of the barrier synchronization. We find some non- Blocks of threads are executed within Streaming Multipro-
intuitive branching code sequences that lead to deadlock, cessors (SM, Figure 1). While the programming model uses
which an understanding of the internal architecture can collections of scalar threads, the SM more closely resembles
avoid. an eight-wide vector processor operating on 32-wide vectors.
SM Resources
SPs
and compiler optimization is performed on the PTX code,
8 per SM PTX code is not a good representation of the actual machine
(Scalar Processor)
SFUs (Special instructions executed. In most cases, we have found it most
2 per SM
Function Unit)
DPUs (Double productive to write in CUDA C, then verify the generated
1 per SM
Precision Unit) machine code sequences at the native code level using decuda
Registers 16,384 per SM [5]. The use of decuda was mainly for convenience, as the
Shared Memory 16 KB per SM
generated instruction sequences can be verified in the native
Caches cubin binary. Decuda is a disassembler for Nvidia’s machine-
Constant Cache 8 KB per SM
level instructions, derived from analysis of Nvidia’s compiler
Texture Cache 6-8 KB per SM
output, as the native instruction set is not publicly documented.
GPU Organization
TPCs (Thread III. M EASUREMENT M ETHODOLOGY
Processing Cluster) 10 total
SMs (Streaming
3 per TPC
A. Microbenchmark Methodology
Multiprocessor)
Shader Clock 1.35 GHz
To explore the GT200 architecture, we create microbench-
Memory 8 × 128MB, 64-bit marks to expose each characteristic we wish to measure. Our
Memory Latency 400-600 clocks conclusions were drawn from analyzing the execution times of
Programming Model the microbenchmarks. Decuda was used to report code size and
Warps 32 threads location when measuring instruction cache parameters, which
Blocks 512 threads max agreed with our analysis of the compiled code. We also used
Registers 128 per thread max decuda to inspect native instruction sequences generated by
Constant Memory 64 KB total
the CUDA compiler and to analyze code generated to handle
Kernel Size 2 M PTX insns max
branch divergence and reconvergence.
TABLE I: GT200 Parameters according to Nvidia [1], [3] The general structure of a microbenchmark consists of GPU
The basic unit of execution flow in the SM is the warp. kernel code containing timing code around a code section (typ-
In the GT200, a warp is a collection of 32 threads and ically an unrolled loop running multiple times) that exercises
is executed in groups of eight on eight Scalar Processors the hardware being measured. A benchmark kernel runs through
(SP). Nvidia refers to this arrangement as Single-Instruction the entire code twice, disregarding the first iteration to avoid the
Multiple-Thread (SIMT), where every thread of a warp executes effects of cold instruction cache misses. In all cases, the kernel
the same instruction in lockstep, but allows each thread to code size is small enough to fit into the L1 instruction cache
branch separately. The SM contains arithmetic units, and other (4 KB, see Section IV-K). Timing measurements are done by
resources that are private to blocks and threads, such as per- reading the clock register (using clock()). The clock values are
block shared memory and the register file. Groups of SMs first stored in registers, then written to global memory at the
belong to Thread Processing Clusters (TPC, Figure 2). TPCs end of the kernel to avoid slow global memory accesses from
also contain resources (e.g., caches, texture fetch units) that are interfering with the timing measurements.
shared among the SMs, most of which are not visible to the When investigating caching hierarchies, we observed that
programmer. From CUDA’s perspective, the GPU comprises memory requests which traverse the interconnect (e.g., access-
the collection of TPCs, the interconnection network, and the ing L3 caches and off-chip memory) had latencies that varied
memory system (DRAM memory controllers), as shown in depending on which TPC was executing the code. We average
Figure 3. Table I shows the parameters Nvidia discloses for our measurements across all 10 TPC placements and report the
the GT200 [1], [3]. variation where relevant.
B. CUDA Software Programming Interface B. Deducing Cache Characteristics from Latency Plots
CUDA presents the GPU architecture using a C-like pro- Most of our cache and TLB parameter measurements use
gramming language with extensions to abstract the threading stride accesses to arrays of varying size, with average access
model. In the CUDA model, host CPU code can launch GPU latency plotted. The basic techniques described in this section
kernels by calling device functions that execute on the GPU. are also used to measure CPU cache parameters. We develop
Since the GPU uses a different instruction set from the host variations for instruction caches and shared cache hierarchies.
CPU, the CUDA compilation flow compiles CPU and GPU Figure 4 shows an example of extracting cache size, way
code using different compilers targeting different instruction size, and line size from an average latency plot. This example
sets. The GPU code is first compiled into PTX “assembly”, assumes an LRU replacement policy, a set-associative cache,
then “assembled” into native code. The compiled CPU and and no prefetching. The cache parameters can be deduced from
GPU code is then merged into a single “fat” binary [4]. the example plot of Figure 4(a) as follows: As long as the array
Although PTX is described as being the assembly level fits in the cache, the latency remains constant (sizes 384 and
representation of GPU code, it is only an intermediate rep- below). Once the array size starts exceeding the cache size,
resentation, and was not useful for detailed analysis or mi- latency steps, equal in number to the number of cache sets
crobenchmarking. Since the native instruction set is different (four), occur as the sets overflow one by one (sizes 385-512,
(a) Latency Plot for 384-byte, 3-way, 4-set, 32-byte line cache

Fig. 5: Timing of two consecutive kernel launches of 10 and 30 blocks. Kernel


calls are serialized, showing TPCs have independent clock registers.
Latency Throughput Issue Rate
(clocks) (ops/clock) (clocks/warp)
SP 24 8 4
SFU 28 2 16
DPU 48 1 32

(b) Array 480 Bytes (15 Lines) in Size TABLE II: Arithmetic Pipeline Latency and Throughput

Fig. 4: Three-Way 12-Line Set-Associative Cache and its Latency Plot then investigate the SM’s various arithmetic pipelines, branch
cache way size). The increase in array size needed to trigger divergence and barrier synchronization. We also explore the
each step of average latency increase equals the line size (32 memory caching hierarchies both within and surrounding the
bytes). Latency plateaus when all cache sets overflow (size ≥16 SMs, as well as memory translation and TLBs.
cache lines). The cache associativity (three) can be found by A. Clock Overhead and Characteristics
dividing cache size (384 bytes) by the way size (128 bytes).
All timing measurements use the clock() function, which
This calculation does not need the line size nor the number of
returns the value of a counter that is incremented every clock
cache sets. There are other possible ways to compute the four
cycle [1]. The clock() function translates to a move from the
cache parameters, as knowing any three will give the fourth,
clock register followed by a dependent left-shift by one, sug-
using cache size = cache sets × line size × associativity.
gesting that the counter is incremented at half the shader clock
Listings 1 and 2 show the structure of our memory mi-
frequency. A clock() followed by a non-dependent operation
crobenchmarks. For each array size and stride, the microbench-
takes 28 cycles.
mark performs a sequence of dependent reads, with the pre-
The experiment in Figure 5 demonstrates that clock registers
computed stride access pattern stored in the array, eliminating
are per-TPC. Points in the figure show timestamp values
address computation overhead in the timed inner loop. The
returned by clock() when called at the beginning and end of
stride should be smaller than the cache line size so all steps
a block’s execution. We see that blocks running on the same
in the latency plot are observable, but large enough so that
TPC share timestamp values, and thus, share clock registers.
transitions between latency steps are not too small to be clearly
If clock registers were globally synchronized, the start times
distinguished.
of all blocks in a kernel would be approximately the same.
f o r ( i = 0 ; i < array_size ; i++) {
i n t t = i + stride ; Conversely, if the clock registers were per-SM, the start times
i f ( t >= array_size ) t %= stride ; of blocks within a TPC would not share the same timestamp.
host_array [ i ] = ( i n t ) device_array + 4 * t ;
}
cudaMemcpy ( device_array , host_array , . . . ) ; B. Arithmetic Pipelines
Listing 1: Array Initialization (CPU Code) Each SM contains three different types of execution units (as
shown in Figure 1 and Table I):
i n t * j = &device_array [ 0 ] ;
/ / s t a r t timing • Eight Scalar Processors (SP) that execute single precision
repeat256 ( j = * ( i n t * * ) j ; ) / / Macro copy 256 t i m e s floating point and integer arithmetic and logic instructions.
/ / end t i m i n g
• Two Special Function Units (SFU) that are responsible
Listing 2: Sequence of Dependent Reads (GPU Kernel Code)
for executing transcendental and mathematical functions
such as reverse square root, sine, cosine, as well as single-
IV. T ESTS AND R ESULTS precision floating-point multiplication.
This section presents our detailed tests and results. We • One Double Precision Unit (DPU) that handles computa-
begin by measuring the latency of the clock() function. We tions on 64-bit floating point operands.
Execution Latency Throughput
Table II shows the latency and throughput for these execution Operation Type
(clocks) (ops/clock)
Unit
units when all operands are in registers. add, sub, uint,
SP 24 7.9
To measure the pipeline latency and throughput, we use tests max, min int
consisting of a chain of dependent operations. For latency tests, uint,
mad SP 120 1.4
int
we run only one thread. For throughput tests, we run a block of uint,
512 threads (maximum number of threads per block) to ensure mul SP 96 1.7
int
full occupancy of the units. Tables III and IV show which div uint – 608 0.28
execution unit each operation uses, as well as the observed div int – 684 0.23
latency and throughput. rem uint – 728 0.24
rem int – 784 0.20
Table III shows that single- and double-precision floating- and, or,
point multiplication and multiply-and-add (mad) each map to xor, shl, uint SP 24 7.9
a single device instruction. However, 32-bit integer multiplica- shr
tion translates to four native instructions, requiring 96 cycles. Execution Latency Throughput
Operation Type
Unit (clocks) (ops/clock)
32-bit integer mad translates to five dependent instructions and add, sub,
takes 120 cycles. The hardware supports only 24-bit integer float SP 24 7.9
max, min
multiplication via the mul24() intrinsic. mad float SP 24 7.9
For 32-bit integer and double-precision operands, division mul float SP, SFU 24 11.2
div float – 137 1.5
translates to a subroutine call, resulting in high latency and low
Execution Latency Throughput
throughput. However, single-precision floating point division is Operation Type
(clocks) (ops/clock)
Unit
translated to a short inlined sequence of instructions with much add, sub,
double DPU 48 1.0
lower latency. max, min
The measured throughput for single-precision floating point mad double DPU 48 1.0
mul double DPU 48 1.0
multiplication is ∼11.2 ops/clock. This is greater than the SP
div double – 1366 0.063
throughput of eight, which suggests that multiplication is issued
to both the SP and SFU units. This suggests that each of the TABLE III: Latency and Throughput of Arithmetic and Logic Operations
two SFUs is capable of doing ∼2 multiplications per cycle (4 Execution Latency Throughput
Operation Type
in total for the 2 SFUs), twice the throughput of other (more Unit (clocks) (ops/clock)
complex) instructions that map to the SFU. The throughput for umul24() uint SP 24 7.9
single-precision floating point mad is 7.9 ops/clock, suggesting mul24() int SP 24 7.9
usad() uint SP 24 7.9
that mad operations cannot be executed by the SFUs.
sad() int SP 24 7.9
Decuda shows that sinf(), cosf(), and exp2f() intrin- umulhi() uint – 144 1.0
sics each translate to a sequence of two dependent instructions mulhi() int – 180 0.77
operating on a single operand. The Programming Guide states fadd rn(),
float SP 24 7.9
that SFUs execute transcendental operations, however, the la- fadd rz()
tency and throughput measurements for these transcendental fmul rn(),
float SP, SFU 26 10.4
fmul rz()
instructions do not match those for simpler instructions (e.g., fdividef() float – 52 1.9
log2f) executed by these units. sqrt() maps to two instructions: dadd rn() double DPU 48 1.0
a reciprocal-sqrt followed by a reciprocal. sinf(),
float SFU? 48 2.0
Figure 6 shows latency and throughput of dependent SP cosf()
instructions (integer additions), as the number of warps on the tanf() float – 98 0.67
exp2f() float SFU? 48 2.0
SM increases. Below six concurrent warps, the observed latency
expf(),
is 24 cycles. Since all warps observe the same latency, the float – 72 2.0
exp10f()
warp scheduler is fair. Throughput increases linearly while the log2f() float SFU 28 2.0
pipeline is not full, then saturates at eight (number of SP units) logf(),
float – 52 2.0
log10f()
operations per clock once the pipeline is full. The Programming
powf() float – 75 1.0
Guide states that six warps (192 threads) should be sufficient to rsqrt() float SFU 28 2.0
hide register read-after-write latencies. However, the scheduler sqrt() float SFU 56 2.0
does not manage to fill the pipeline when there are six or seven
TABLE IV: Latency and Throughput of Mathematical Intrinsics. A “–” in the
warps in the SM. Execution Unit column denotes an operation that maps to a multi-instruction
routine.
C. Control Flow
1) Branch Divergence: All threads of a warp execute a path [1]. Our observations are consistent with the expected
single common instruction at a time. The Programming Guide behavior. Figure 7 shows the measured execution timeline for
states that when threads of a warp diverge due to a data- two concurrent warps in a block whose threads diverge 32 ways.
dependent conditional branch, the warp serially executes each Each thread takes a different path based on its thread ID and
branch path taken, disabling threads that are not on that performs a sequence of arithmetic operations. The figure shows
a field in the instruction encoding. We observe that when
threads diverge, the execution of each path is serialized up
to the reconvergence point. Only when one path reaches the
reconvergence point does the other path begin executing.
According to Lindholm et al., a branch synchronization
stack is used to manage independent threads that diverge and
converge [6]. We use the kernel shown in Listing 3 to confirm
this statement. The array c contains a permutation of the set
of numbers between 0 and 31 specifying the thread execution
order. We observe that when a warp reaches a conditional
branch, the taken path is always executed first: for each if
statement the else path is the taken path and is executed first, so
that the last then-clause (else if (tid == c[31])) is always
executed first, and the first then-clause (if (tid == c[0])) is
Fig. 6: SP Throughput and Latency. Having six or seven warps does not fully executed last.
utilize the pipeline.
i f ( tid == c [ 0 ] ) { . . . }
else if ( tid == c [ 1 ] ) { . . . }
else if ( tid == c [ 2 ] ) { . . . }
...
else if ( tid == c [ 3 1 ] ) { . . . }

Listing 3: Reconvergence Stack Test

i n t __shared__ sharedvar = 0 ;
w h i l e ( sharedvar ! = tid ) ;
/ * ** r e c o n v e r g e n c e p o i n t ** * /
sharedvar++;

Listing 4: Example code that breaks due to SIMT behavior.

Figure 8 shows the execution timeline of this kernel when


the array c contains the increasing sequence {0, 1, ..., 31}. In
this case, thread 31 is the first thread to execute. When the array
c contains the decreasing sequence {31, 30, ..., 0}, thread 0 is
Fig. 7: Execution Timeline of Two 32-Way Divergent Warps. Top series shows the first to execute, showing that the thread ID does not affect
timing for Warp 0, bottom for Warp 1.
execution order. The observed execution ordering is consistent
that within a single warp, each path is executed serially, while with the taken path being executed first, and the fall-through
the execution of different warps may overlap. Within a warp, path being pushed on a stack. Other tests show that the number
threads that take the same path are executed concurrently. of active threads on a path also has no effect on which path is
2) Reconvergence: When the execution of diverged paths is executed first.
complete, the threads converge back to the same execution path. 3) Effects of Serialization due to SIMT: The Programming
Decuda shows that the compiler inserts one instruction before a Guide states that for correctness, the programmer can ignore the
potentially-diverging branch, which provides the hardware with SIMT behavior. In this section, we show an example of code
the location of the reconvergence point. Decuda also shows that would work if threads were independent, but deadlocks due
that the instruction at the reconvergence point is marked using to the SIMT behavior. In Listing 4, if threads were independent,
the first thread would break out of the while loop and increment
sharedvar. This would cause each consecutive thread to do
the same: fall out of the while loop and increment sharedvar,
permitting the next thread to execute. In the SIMT model,
branch divergence occurs when thread 0 fails the while-loop
condition. The compiler marks the reconvergence point just
before sharedvar++. When thread 0 reaches the reconvergence
point, the other (serialized) path is executed. Thread 0 cannot
continue and increment sharedvar until the rest of the threads
also reach the reconvergence point. This causes deadlock as
these threads can never reach the reconvergence point.

D. Barrier Synchronization
Synchronization between warps of a single block is done
using syncthreads(), which acts as a barrier. syncthreads()
Fig. 8: Execution Timeline of Kernel Shown in Listing 3. Array c contains the
increasing sequence {0, 1, ..., 31}
is implemented as a single instruction with a latency of
20 clock cycles for a single warp executing a sequence of
syncthreads().
The Programming Guide recommends that syncthreads()
be used in conditional code only if the condition evaluates
identically across the entire thread block. The rest of this sec-
tion investigates the behavior of syncthreads() when this rec-
ommendation is violated. We demonstrate that syncthreads()
operates as a barrier for warps, not threads. We show that when
threads of a warp are serialized due to branch divergence, any
syncthreads() on one path does not wait for threads from the
other path, but only waits for other warps running within the
same thread block.
Fig. 9: Total registers used by a block is limited to 16,384 (64 KB). Maximum
1) syncthreads() for Threads of a Single Warp: The Pro- number of threads in a block is quantized to 64 threads when limited by register
gramming Guide states that syncthreads() acts as a barrier for file capacity.
all threads in the same block. However, the test in Listing 5 i f ( warp0 ) {
shows that syncthreads() acts as a barrier for all warps in the / / Two−way d i v e r g e n c e
i f ( tid < 1 6 )
same block. This kernel is executed for a single warp, in which __syncthreads ( ) ; [ 1 ]
the first half of the warp produces values in shared memory for else
__syncthreads ( ) ; [ 2 ]
the second half to consume. }
i f ( warp1 ) {
If syncthreads() waited for all threads in a block, the __syncthreads ( ) ; [3]
two syncthreads() in this example would act as a common __syncthreads ( ) ; [4]
}
barrier, forcing the producer threads (first half of the warp)
to write values before the consumer threads (second half of Listing 7: Example code that produces unintended results due to
syncthreads()
the warp) read them. In addition, since branch divergence
serializes execution of divergent warps (See Section IV-C1), 2) syncthreads() Across Multiple Warps: syncthreads()
a kernel would deadlock whenever syncthreads() is used is a barrier that waits for all warps to either call syncthreads()
within a divergent warp (In this example, one set of 16 or terminate. If there is a warp that neither calls syncthreads()
threads would wait for the other serialized set of 16 threads nor terminates, syncthreads() will wait indefinitely, suggest-
to reach its syncthreads() call). We observe that there is no ing the lack of a time-out mechanism. Listing 6 shows one
deadlock, and that the second half of the warp does not read example of such a deadlock (with no branch divergence) where
the updated values in the array shared_array (The else clause the second warp spins waiting for data generated after the
executes first, see Section IV-C1), showing that syncthreads() syncthreads() by the first warp.
does not synchronize diverged threads within one warp as the Listing 7 illustrates the details of the interaction be-
Programming Guide’s description might suggest. tween syncthreads() and branch divergence. Given that
i f ( tid < 1 6 ) {
syncthreads() operates at a warp granularity, one would
shared_array [ tid ] = tid ; expect that either the hardware would ignore syncthreads()
__syncthreads ( ) ;
}
inside divergent warps, or that divergent warps participate in
else { barriers in the same way as warps without divergence. We show
__syncthreads ( ) ;
output [ tid ] =
that the latter is true.
shared_array [ tid%16]; In this example, the second syncthreads() synchronizes
}
with the third, and the first with the fourth (For warp 0, code
Listing 5: Example code that shows syncthreads() synchronizes at warp block 2 executes before code block 1 because block 2 is the
granularity branch’s taken path, see Section IV-C1). This confirms that
syncthreads() operates at the granularity of warps and that
/ / T e s t r u n w i t h two w a r p s diverged warps are no exception. Each serialized path executes
count = 0 ;
i f ( warp0 ) {
syncthreads() separately (Code block 2 does not wait for 1
__syncthreads ( ) ; at the barrier). It waits for all other warps in the block to also
count = 1 ;
}
execute syncthreads() or terminate.
else {
w h i l e ( count == 0 ) ; E. Register File
}
We confirm that the register file contains 16,384 32-bit
Listing 6: Example code that deadlocks due to syncthreads(). Test is
run with two warps.
registers (64 KB), as the Programming Guide states [1]. The
number of registers used by a thread is rounded up to a multiple
of four [4]. Attempting to launch kernels that use more than 128
registers per thread or a total of more than 64 KB of registers
in a block results in a failed launch. In Figure 9, below 32
registers per thread, the register file cannot be fully utilized
because the maximum number of threads allowed per block is
512. Above 32 registers per thread, the register file capacity
limits the number of threads that can run in a block. Figure 9
shows that when limited by register file capacity, the maximum
number of threads in a block is quantized to 64 threads. This
puts an additional limit on the number of registers that can be
used by one kernel, and is most visible when threads use 88
registers each: Only 128 threads can run in a block and only
11,264 (128 threads × 88) registers can be used, utilizing only
69% of the register file.
Fig. 10: Texture Memory. 5 KB L1 and 256 KB, 8-way L2 caches. Measured
The quantizing of threads per block to 64 threads suggests using 64-byte stride.
that each thread’s registers are distributed to one of 64 logical
“banks”. Each bank is the same size, so each bank can fit the G. Global Memory
same number of threads, limiting threads to multiples of 64 Global memory is accessible by all running threads, even
when limited by register file capacity. Note that this is different if they belong to different blocks. Global memory accesses are
from quantizing the total register use. uncached and have a documented latency of 400-600 cycles [1].
Because all eight SPs always execute the same instruction Our microbenchmark executes a sequence of pointer-chasing
at any given time, a physical implementation of 64 logical dependent reads to global memory, similar to Listings 1 and
banks can share address lines among the SPs and use wider 2. In the absence of a TLB miss, we measure a read latency
memory arrays instead of 64 real banks. Having the ability to in the range of 436-443 cycles. Section IV-I2 presents more
perform four register accesses per SP every clock cycle (four details on the effects of memory translation on global memory
logical banks) provides sufficient bandwidth to execute three- access latency. We also investigated the presence of caches. No
read, one-write operand instructions (e.g., multiply-add) every caching effects were observed.
clock cycle. A thread would access its registers over multiple H. Texture Memory
cycles, since they all reside in a single bank, with accesses for
multiple threads occurring simultaneously. Texture memory is a cached, read-only, globally-visible
Having eight logical banks per SP could provide extra memory space. In graphics rendering, textures are often two-
bandwidth for the “dual-issue” feature using the SFUs (see dimensional and exhibit two-dimensional locality. CUDA sup-
Section IV-B) and for performing memory operations in parallel ports one-, two-, and three-dimensional textures. We measure
with arithmetic. the cache hierarchy of the one-dimensional texture bound to a
region of linear memory. Our code performs dependent texture
The Programming Guide alludes to preferring multiples of
fetches from a texture, similar to Listings 1 and 2. Figure 10
64 threads by suggesting that to avoid bank conflicts, “best
shows the presence of two levels of texture caching using a
results” are achieved if the number of threads per block is a
stride of 64 bytes, showing 5 KB and 256 KB for L1 and L2
multiple of 64. We observe that when limited by register count,
cache sizes, respectively.
the number of threads per block is limited to a multiple of 64,
We expect the memory hierarchy for higher-dimension (2D
while no bank conflicts were observed.
and 3D) textures not to be significantly different. 2D spatial
locality is typically achieved by rearranging texture elements
F. Shared Memory
in “tiles” using an address computation, rather than requiring
Shared memory is a non-cached per-SM memory space. It specialized caches [8]–[10].
is used by threads of a block to cooperate by sharing data 1) Texture L1 Cache: The texture L1 cache is 5 KB 20-way
with other threads from the same block. The amount of shared set-associative with 32 byte cache lines. Figure 11 focuses
memory allowed per block is 16 KB. The kernel’s function on the first latency increase at 5 KB and shows results with
parameters also occupy shared memory, thus slightly reducing an eight byte stride. A 256-byte way size for a 5 KB cache
the usable memory size. implies 20-way set associativity. We see that the L1 hit latency
We measure the read latency to be 38 cycles using stride (261 clocks) is more than half that of main memory (499
accesses as in Listings 1 and 2. Volkov and Demmel reported clocks), consistent with the Programming Guide’s statement
a similar latency of 36 cycles on the 8800GTX, the predecessor that texture caches do not reduce fetch latency but do reduce
of the GT200 [7]. The Programming Guide states that shared DRAM bandwidth demand.
memory latency is comparable to register access latency. Vary- 2) Texture L2 Cache: The texture L2 cache is 256 KB 8-way
ing the memory footprint and stride of our microbenchmark set associative with 256-byte cache lines. Figure 10 shows a
verified the lack of caching for shared memory. way size of 32 KB for a 256 KB cache, implying 8-way set
Fig. 13: Global Memory. 8 MB fully-associative L1 and 32 MB 8-way L2
Fig. 11: Texture L1 Cache. 5 KB, 20-way, 32-byte lines. Measured using 8-byte
TLBs. Measured using 512 KB stride.
stride. Maximum and minimum average latency over all TPC placements are
also shown: L2 has TPC placement-dependent latency.

Fig. 14: Global L1 TLB. 16-way fully-associative with 512 KB line size.
Fig. 12: Texture L2 Cache. 256 KB, 8-way, 256-byte lines. Measured using
64-byte stride.
in addition, the TLB line size were 512 KB, the TLB would be
organized in 16 lines with 128 mappings of consecutive pages
associativity. Figure 12 zooms in on the previous graph near per line.
256 KB to reveal the presence of latency steps that indicate a In Figure 13, the first latency plateau at ∼440 cycles indicates
256-byte cache line size. We can also see in Figure 11 that the an L1 TLB hit (global memory read latency as measured in
L2 texture cache has TPC placement-dependent access times, Section IV-G). The second plateau at ∼487 cycles indicates an
suggesting the L2 texture cache does not reside within the TPC. L2 TLB hit, while an L2 TLB miss takes ∼698 cycles. We
measure the 16-way associativity of the L1 TLB by accessing
I. Memory Translation a fixed number of elements with varying strides. Figure 14
We investigate the presence of TLBs using stride-accessed depicts the results when 16 and 17 array elements are accessed.
dependent reads, similar to Listings 1 and 2. Measuring TLBs For large strides, where all elements map to same cache set
parameters is similar to measuring caches, but with increased (e.g., 8 MB), accessing 16 elements always experiences L1
array sizes and larger strides comparable to the page size. TLB hits, while accessing 17 elements will miss the L1 TLB
Detailed TLB results for both global and texture memory are at 512 KB stride and up. We can also see that the L1 TLB has
presented in Sections IV-I1 and IV-I2 respectively. only one cache set, implying it is fully-associative with 512 KB
1) Global Memory Translation: Figure 13 shows that there lines. If there were at least two sets, then when the stride is
are two TLB levels for global memory. The L1 TLB is fully- not a power of two and is greater than 512 KB (the size of a
associative, holding mappings for 8 MB of memory, containing cache way), some elements would map to different sets. When
16 lines with a 512 KB TLB line size. The 32 MB L2 TLB 17 elements are accessed, they would not be all mapped to
is 8-way set associative, with a 4 KB line size. We use the the same set, and there would be some stride for which no L1
term TLB size to refer to the total size of the pages that can misses occur. We never see L1 TLB hits for strides beyond
be mapped by the TLB, rather than the raw size of the entries 512 KB (e.g., 608, 724, and 821 KB). However, we can see (at
stored in the TLB. For example, an 8 MB TLB describes a TLB strides beyond 4 MB) that the L2 TLB is not fully-associative.
that can cache 2 K mappings when the page size is 4 KB. If, Figure 13 shows that the size of an L2 TLB way is 4 MB (32
Fig. 15: Global L2 TLB. 4 KB TLB line size. Fig. 17: Constant Memory. 2 KB L1, 8 KB 4-way L2, 32 KB 8-way L3 caches.
Measured using 256-byte stride. Maximum and minimum average latency over
all TPC placements are also shown: L3 has TPC placement-dependent latency.

Fig. 16: Texture Memory. 8 MB fully-associative L1 TLB and 16 MB 8-way


L2 TLB. 544 clocks L1 and 753 clocks L2 TLB miss. Measured using 256 KB
stride. Fig. 18: Constant L1 cache. 2 KB, 4-way, 64-byte lines. Measured using 16-byte
stride.
to 36 MB; see Section III-B). With an L2 TLB size of 32 MB,
2) Texture Memory Translation: We used the same method-
the associativity of the L2 TLB is eight. Extending the test did
ology as in Section IV-I1 to compute the configuration param-
not find evidence of multi-level paging.
eters of the texture memory TLBs. The methodology is not
Although the L1 TLB line size is 512 KB, the L2 TLB repeated here for the sake of brevity. Texture memory contains
line size is a smaller 4 KB. We used a microbenchmark that two levels of TLBs, with 8 MB and 16 MB of mappings, as
uses two sets of 10 elements (20 total) with each element seen in Figure 16 with 256 KB stride. The L1 TLB is 16-way
separated by a 2 MB stride. The two sets of elements are fully-associative with each line holding translations for 512 KB
separated by 2 MB+offset. The ith element has address (i < of memory. The L2 TLB is 8-way set associative with a line
10)?(i × 2 M B) : (i × 2 M B+offset). We need to access more size of 4 KB. At 512 KB stride, the virtually-indexed 20-way
than 16 elements to prevent the 16-way L1 TLB from hiding the L1 texture cache hides the features of the L1 texture TLB. The
accesses. Since the size of an L2 TLB way is 4 MB and we use access latencies as measured with 512 KB stride are 497 (TLB
2 MB strides, our 20 elements map to two L2 sets when offset is hit), 544 (L1 TLB miss), and 753 (L2 TLB miss) clocks.
zero. Figure 15 shows the 4 KB L2 TLB line size. When offset
is zero, our 20 elements occupy two sets, 10 elements per set, J. Constant Memory
causing conflict misses in the 8-way associative L2 TLB. As There are two segments of constant memory: one is user-
offset is increased beyond the 4 KB L2 TLB line size, having accessible, while the other is used by compiler-generated con-
5 elements per set no longer causes conflict misses in the L2 stants (e.g., comparisons for branch conditions) [4]. The user-
TLB. accessible segment is limited to 64 KB.
Although the page size can be less than the 4 KB L2 TLB The plot in Figure 17 shows three levels of caching of sizes
line size, we believe a 4 KB page size is a reasonable choice. 2 KB, 8 KB, and 32 KB. The measured latency includes the
We note that the Intel x86 architecture uses multi-level paging latency of two arithmetic instructions (one address computation
with mainly 4 KB pages, while Intel’s family of GPUs uses and one load), so the raw memory access time would be roughly
single-level 4 KB paging [9], [11]. 48 cycles lower (8, 81, 220, and 476 clocks for an L1 hit, L2
Fig. 19: Constant L3 cache bandwidth. 9.75 bytes per clock. Fig. 20: Constant Memory Sharing. Per-SM L1 cache, per-TPC L2, global L3.
Measured using 256-byte stride.
hit, L3 hit, and L3 miss, respectively). Our microbenchmarks
perform dependent constant memory reads, similar to Listings
1 and 2.
1) Constant L1 Cache: A 2 KB L1 constant cache is located
in each SM (See Section IV-J4). The L1 has a 64-byte cache
line size, and is 4-way set associative with eight sets. The way
size is 512 bytes, indicating 4-way set associativity in a 2 KB
cache. Figure 18 shows these parameters.
2) Constant L2 Cache: An 8 KB L2 constant cache is
located in each TPC and is shared with instruction memory
(See Sections IV-J4 and IV-J5). The L2 cache has a 256-byte
cache line size and is 4-way set associative with 8 sets. The
region near 8,192 bytes in Figure 17 shows these parameters.
A 2 KB way size in an 8 KB cache indicates an associativity
of four. Fig. 21: Constant Memory Instruction Cache Sharing. L2 and L3 caches are
shared with instructions. Measured using 256-byte stride.
3) Constant L3 Cache: We observe a 32 KB L3 constant
cache shared among all TPCs. The L3 cache has 256-byte cache eight-thread case, as there are not enough unique data sets and
lines and is 8-way set associative with 16 sets. We observe the per-TPC L2 cache hides some requests from the L3, causing
cache parameters in the region near 32 KB in Figure 17. The apparent aggregate bandwidth to increase. Above 30 blocks,
minimum and maximum access latencies for the L3 cache some SMs run more than one block causing load imbalance.
(Figure 17, 8-32 KB region) differ significantly depending on 4) Cache Sharing: The L1 constant cache is private to each
which TPC executes the test code. This suggests that the L3 SM, the L2 is shared among SMs on a TPC, and the L3
cache is located on a non-uniform interconnect that connects is global. This was tested by measuring latency using two
TPCs to L3 cache and memory. The latency variance does not concurrent blocks with varying placement (same SM, same
change with increasing array size, even when main memory is TPC, two different TPCs). The two blocks will compete for
accessed (array size > 32 KB), suggesting that L3 cache is shared caches, causing the observed cache size to be halved.
located near the main memory controllers. Figure 20 shows the results of this test. In all cases, the
We also measure the L3 cache bandwidth. Figure 19 shows observed cache size is halved to 16 KB (L3 is global). With
the aggregate L3 cache read bandwidth when a varying two blocks placed on the same TPC the observed L2 cache size
number of blocks make concurrent L3 cache read requests, is halved to 4 KB (L2 is per-TPC). Similarly, with two blocks
with the requests within each thread being independent. The on the same SM, the observed L1 cache size is halved to 2 KB
observed aggregate bandwidth of the L3 constant cache is (L1 is per-SM).
∼9.75 bytes/clock when running between 10 and 20 blocks. 5) Cache Sharing with Instruction Memory: It has been
We run two variants of the bandwidth tests: one variant using suggested that part of the constant cache and instruction cache
one thread per block and one using eight to increase constant hierarchies are unified [12], [13]. We find that the L2 and L3
cache fetch demand within a TPC. Both tests show similar caches are indeed instruction and constant caches, while the L1
behavior below 20 blocks. This suggests that when running caches are single-purpose. Similar to Section IV-J4, we measure
one block, an SM is only capable of fetching ∼1.2 bytes/clock the interference between instruction fetches and constant cache
even with increased demand within the block (from multiple fetches with varying placements. The result is plotted in Figure
threads). The measurements are invalid above 20 blocks in the 21. The L1 access times are not affected by instruction fetch
Fig. 22: Instruction Cache Latency. 8 KB 4-way L2, 32 KB 8-way L3. This Fig. 24: Instruction fetch size. The SM appears to fetch from the L1 cache in
test fails to detect the 4 KB L1 cache. blocks of 64 bytes. The code spans three 256-byte cache lines, with boundaries
at 160 and 416 bytes.

the instruction cache size of 4 KB for the SM being observed


does not decrease.
2) L2 Instruction Cache: The 8 KB L2 instruction cache
is located on the TPC, with 256-byte cache lines and 4-way
set associativity. We have shown in Section IV-J5 that the L2
instruction cache is also used for constant memory. We have
verified that the L2 instruction cache parameters match those
of the L2 constant cache, but omit the results due to space
constraints.
3) L3 Instruction Cache: The 32 KB L3 instruction cache is
global, with 256-byte cache lines and 8-way set associativity.
We have shown in Section IV-J5 that the L3 instruction cache is
also used for constant memory, and have verified that the cache
Fig. 23: Instruction L1 cache. 4 KB, 4-way, 256-byte lines. Contention for the parameters for instruction caches match those for constant
L2 cache is added to make L1 misses visible. Maximum and minimum average memory.
latency over all TPC placements are also shown: L3 has TPC placement-
dependent latency. 4) Instruction Fetch: The SM appears to fetch instructions
from the L1 instruction cache 64 bytes at a time (8-16 in-
demand even when the blocks run on the same SM, so the L1 structions). Figure 24 shows the execution timeline of our
caches are single-purpose. measurement code, consisting of 36 consecutive clock() reads
(72 instructions), averaged over 10,000 executions of the mi-
K. Instruction Supply
crobenchmark.
We detect three levels of instruction caching, of sizes 4 KB, While one warp runs the measurement code, seven “evicting
8 KB, and 32 KB, respectively (plotted in Figure 22). The warps” running on the same SM repeatedly evict the region
microbenchmark code consists of differently-sized blocks of indicated by the large points in the plot, by looping through
independent 8-byte arithmetic instructions (abs) to maximize 24 instructions (192 bytes) that cause conflict misses in the
fetch demand. The 8 KB L2 and 32 KB L3 caches are visible instruction cache. The evicting warps will repeatedly evict a
in the figure, but the 4 KB L1 is not, probably due to a cache line used by the measurement code with high probability,
small amount of instruction prefetching that hides the L2 access depending on warp scheduling. The cache miss latency caused
latency. by an eviction will be observed only at instruction fetch
1) L1 Instruction Cache: The 4 KB L1 instruction cache boundaries (160, 224, 288, 352, and 416 bytes in Figure 24).
resides in each SM, with 256-byte cache lines and 4-way set We see that a whole cache line (spanning the code region
associativity. 160-416 bytes) is evicted when a conflict occurs and that the
The L1 cache parameters were measured (Figure 23) by effect of a cache miss is only observed across, but not within,
running concurrent blocks of code on the other two SMs on blocks of 64 bytes.
the same TPC to introduce contention for the L2 cache, so L1
misses beginning at 4 KB do not stay hidden as in Figure 22. V. R ELATED W ORK
The 256-byte line size is visible, as well as the presence of Microbenchmarking has been used extensively in the past
4 cache sets. The L1 instruction cache is per-SM. When other to determine the hardware organization of various processor
SMs on the same TPC flood their instruction cache hierarchies, structures. We limit our attention to work targeting GPUs.
Arithmetic Pipeline
Volkov and Demmel benchmarked the 8800GTX GPU, the Latency (clocks) Throughput (ops/clock)
predecessor of the GT200 [7]. They measured characteristics of SP 24 8
the GPU relevant to accelerating dense linear algebra, revealing SFU 28 2 (4 for MUL)
the structure of the texture caches and one level of TLBs. DPU 48 1
Although they used the previous generation hardware, their Pipeline Control Flow
measurements generally agree with ours. We focus on the Diverged paths are serialized.
Branch Divergence
microarchitecture of the GPU, revealing an additional TLB Reconvergence is handled via a stack.
syncthreads() works at warp granularity.
level and caches, and the organization of the processing cores. Barrier
Warps wait at the barrier until all other warps
GPUs have also been benchmarked for performance analysis. Synchronization
execute syncthreads() or terminate.
An example is GPUBench [14], a set of microbenchmarks
Memories
written in the OpenGL ARB shading language that measures
Register File 16 K 32-bit registers, 64 logical banks per SM
some of the GPU instruction and memory performance char-
L1: 4 KB, 256-byte line, 4-way, per-SM
acteristics. The higher-level ARB shading language is further L2: 8 KB, 256-byte line, 4-way, per-TPC
abstracted from the hardware than CUDA, making it difficult Instruction L3: 32 KB, 256-byte line, 8-way, global
to infer detailed hardware structures from the results. However, L2 and L3 shared with constant memory
the ARB shading language offers vendor-independence, which L1: 2 KB, 64-byte line, 4-way, per-SM, 8 clk
L2: 8 KB, 256-byte line, 4-way, per-TPC, 81 clk
CUDA does not. Constant L3: 32 KB, 256-byte line, 8-way, global, 220 clk
Currently, specifications of Nvidia’s GPUs and CUDA op- L2 and L3 shared with instruction memory
timization techniques come from the manufacturer [1], [3]. ∼436-443 cycles read latency
Studies on optimization (e.g., [15]) as well as performance 4 KB translation page size
Global L1 TLB: 16 entries, 128 pages/entry, 16-way
simulators (e.g., [2]) rely on these published specifications. We L2 TLB: 8192 entries, 1 page/entry, 8-way
present more detailed parameters, which we hope will be useful L1: 5 KB, 32-byte line, 20-way, 261 clk
in improving the accuracy of these studies. L2: 256 KB, 256-byte line, 8-way, 371 clk
Texture 4 KB translation page size
L1 TLB: 16 entries, 128 pages/entry, 16-way
VI. S UMMARY AND C ONCLUSIONS L2 TLB: 4096 entries, 1 page/entry, 8-way
This paper presented our analysis of the Nvidia GT200 GPU Shared 16 KB, 38 cycles read latency
and our measurement techniques. Our suite of microbench- TABLE V: GT200 Architecture Summary
marks revealed architectural details of the processing cores and
the memory hierarchies. A GPU is a complex device, and it is [4] ——, “The CUDA Compiler Driver NVCC,” http://www.nvidia.com/
object/io 1213955090354.html.
impossible that we reverse-engineer every detail. We believe [5] W. J. van der Laan, “Decuda,” http://wiki.github.com/laanwj/decuda/.
we have investigated an interesting subset of features. Table V [6] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla:
summarizes our architectural findings. A Unified Graphics and Computing Architecture,” IEEE Micro, vol. 28,
no. 2, pp. 39–55, 2008.
Our results validated some of the hardware characteristics [7] V. Volkov and J. W. Demmel, “Benchmarking GPUs to Tune Dense Lin-
presented in the CUDA Programming Guide [1], but also re- ear Algebra,” in SC ’08: Proceedings of the 2008 ACM/IEEE Conference
vealed the presence of some undocumented hardware structures on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 1–11.
[8] Z. S. Hakura and A. Gupta, “The Design and Analysis of a Cache
such as mechanisms for control flow and caching and TLB Architecture for Texture Mapping,” SIGARCH Comput. Archit. News,
hierarchies. In addition, in some cases our findings deviated vol. 25, no. 2, pp. 108–120, 1997.
from the documented characteristics (e.g., texture and constant [9] Intel, G45: Volume 1a Graphics Core, Intel 965G Express Chipset
Family and Intel G35 Express Chipset Graphics Controller Programmer’s
caches). Reference Manual (PRM), January 2009.
We also presented our techniques for our architectural anal- [10] AMD, ATI CTM Guide, Technical Reference Manual.
ysis. We believe that these techniques will be useful for the [11] Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3A: System Programming Guide, Part 1, September 2009.
analysis of other GPU-like architectures and validation of GPU- [12] H. Goto, “Gt200 over view,” http://pc.watch.impress.co.jp/
like performance models. docs/2008/0617/kaigai 10.pdf, 2008.
[13] D. Kirk and W. W. Hwu, “ECE 489AL Lectures 8-9:
The ultimate goal is to know the hardware better, so that we The CUDA Hardware Model,” http://courses.ece.illinois.edu/
can harvest its full potential. ece498/al/Archive/Spring2007/lectures/lecture8-9-hardware.ppt, 2007.
[14] I. Buck, K. Fatahalian, and M. Houston, “GPUBench,”
http://graphics.stanford.edu/projects/gpubench/.
R EFERENCES [15] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk,
[1] Nvidia, “Compute Unified Device Architecture Programming and W. W. Hwu, “Optimization Principles and Application Performance
Guide Version 2.0,” http://developer.download.nvidia.com/compute/ Evaluation of a Multithreaded GPU using CUDA,” in PPoPP ’08:
cuda/2 0/docs/NVIDIA CUDA Programming Guide 2.0.pdf. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming. New York, NY, USA: ACM, 2008,
[2] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt,
pp. 73–82.
“Analyzing CUDA Workloads Using a Detailed GPU Simulator,” in
Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE
International Symposium on, April 2009, pp. 163–174.
[3] Nvidia, “NVIDIA GeForce GTX 200 GPU Architectural
Overview,” http://www.nvidia.com/docs/IO/55506/GeForce GTX 200
GPU Technical Brief.pdf, May 2008.

You might also like