Abstract
Abstract
Abstract
As CPU cores become both faster and more numerous, the limiting factor
for most programs is now, and will be for some time, memory access.
Hardware designers have come up with ever more sophisticated memory
handling and acceleration techniques–such as CPU caches–but these
cannot work optimally without some help from the programmer.
Unfortunately, neither the structure nor the cost of using the memory
subsystem of a computer or the caches on CPUs is well understood by
most programmers. This paper explains the structure of memory
subsystems in use on modern commodity hardware, illustrating why CPU
caches were developed, how they work, and what programs should do to
achieve optimal performance by utilizing them.
1. Types of RAM
As CPU cores become both faster and more numerous, the limiting factor
for most programs is now, and will be for some time, memory access.
Hardware designers have come up with ever more sophisticated memory
handling and acceleration techniques–such as CPU caches–but these
cannot work optimally without some help from the programmer.
Unfortunately, neither the structure nor the cost of using the memory
subsystem of a computer or the caches on CPUs is well understood by
most programmers. This paper explains the structure of memory
subsystems in use on modern commodity hardware, illustrating why CPU
caches were developed, how they work, and what programs should do to
achieve optimal performance by utilizing them.
shows the structure of a 6 transistor SRAM cell. The core of this cell is
formed by the four transistors M1 to M4 which form two cross-coupled
inverters. They have two stable states, representing 0 and 1 respectively.
The state is stable as long as power on Vdd is available. If access to the
state of the cell is needed the word access line WL is raised. This makes
the state of the cell immediately available for reading on BL and BL. If
the cell state must be overwritten the BL and BL lines are first set to the
desired values and then WL is raised. Since the outside drivers are
stronger than the four transistors (M1 through M4 ) this allows the old
state to be overwritten.
Conclusion:
• one cell requires six transistors. There are variants with four transistors
but they have disadvantages.
• maintaining the state of the cell requires constant power.
• the cell state is available for reading almost immediately once the word
access line WL is raised. The signal is as rectangular (changing quickly
between the two binary states) as other transistorcontrolled signals.
• the cell state is stable, no refresh cycles are needed.
1.2 Dynamic RAM
Dynamic RAM is, in its structure, much simpler than static RAM. Figure
2.5 shows the structure of a usual DRAM cell design. All it consists of is
one transistor and one capacitor. This huge difference in complexity of
course means that it functions very differently than static RAM.
2. CPU Caches
CPUs are today much more sophisticated than they were only 25 years
ago. In those days, the frequency of the CPU core was at a level
equivalent to that of the memory bus. Memory access was only a bit
slower than register access. But this changed dramatically in the early 90s,
when CPU designers increased the frequency of the CPU core but the
frequency of the memory bus and the performance of RAM chips did not
increase proportionally. This is not due to the fact that faster RAM could
not be built, as explained in the previous section. It is possible but it is not
economical. RAM as fast as current CPU cores is orders of magnitude
more expensive than any dynamic RAM.
shows the minimum cache configuration. It corresponds to the
architecture which could be found in early systems which deployed CPU
caches. The CPU core is no longer directly connected to the main
memory.16 All loads and stores have to go through the cache. The
connection between the CPU core and the cache is a special, fast
connection. In a simplified representation, the main memory and the
cache are connected to the system bus which can also be used for
communication with other components of the system. We introduced the
system bus as “FSB” which is the name in use today; see section 2.2. In
this section we ignore the Northbridge; it is assumed to be present to
facilitate the communication of the CPU(s) with the main memory.
shows three levels of cache and introduces the nomenclature we will use
in the remainder of the document. L1d is the level 1 data cache, L1i the
level 1 instruction cache, etc. Note that this is a schematic; the data flow
in reality need not pass through any of the higher-level caches on the way
from the core to the main memory. CPU designers have a lot of freedom
designing the interfaces of the caches. For programmers these design
choices are invisible.
When memory content is needed by the processor the entire cache line is
loaded into the L1d. The memory address for each cache line is computed
by masking the address value according to the cache line size. For a 64
byte cache line this means the low 6 bits are zeroed. The discarded bits
are used as the offset into the cache line. The remaining bits are in some
cases used to locate the line in the cache and as the tag. In practice an
address value is split into three parts. For a 32-bit address it might look as
follows:
With a cache line size of 2O the low O bits are used as the offset into the
cache line. The next S bits select the “cache set”. We will go into more
detail soon on why sets, and not single slots, are used for cache lines. For
now it is sufficient to understand there are 2 S sets of cache lines. This
leaves the top 32−S−O = T bits which form the tag. These T bits are the
value associated with each cache line to distinguish all the aliases18
which are cached in the same cache set. The S bits used to address the
cache set do not have to be stored since they are the same for all cache
lines in the same set.
To be able to load new data in a cache it is almost always first necessary
to make room in the cache. An eviction from L1d pushes the cache line
down into L2 (which uses the same cache line size). This of course means
room has to be made in L2. This in turn might push the content into L3
and ultimately into main memory. Each eviction is progressively more
expensive. What is described here is the model for an exclusive cache as
is preferred by modern AMD and VIA processors. Intel implements
inclusive caches19 where each cache line in L1d is also present in L2.
Therefore evicting from L1d is much faster. With enough L2 cache the
disadvantage of wasting memory for content held in two places is
minimal and it pays off when evicting. A possible advantage of an
exclusive cache is that loading a new cache line only has to touch the L1d
and not the L2, which could be faster. The CPUs are allowed to manage
the caches as they like as long as the memory model defined for the
processor architecture is not changed. It is, for instance, perfectly fine for
a processor to take advantage of little or no memory bus activity and
proactively write dirty cache lines back to main memory. The wide
variety of cache architectures among the processors for the x86 and x86-
64, between manufacturers and even within the models of the same
manufacturer, are testament to the power of the memory model
abstraction.
More sophisticated cache implementations allow another possibility to
happen. Assume a cache line is dirty in one processor’s cache and a
second processor wants to read or write that cache line. In this case the
main memory is out-of-date and the requesting processor must, instead,
get the cache line content from the first processor. Through snooping, the
first processor notices this situation and automatically sends the
requesting processor the data. This action bypasses main memory, though
in some implementations the memory controller is supposed to notice this
direct transfer and store the updated cache line content in main memory.
If the access is for writing the first processor then invalidates its copy of
the local cache line. Over time a number of cache coherency protocols
have been developed. The most important is MESI, which we will
introduce in section 3.3.4. The outcome of all this can be summarized in a
few simple rules:
These are the actual access times measured in CPU cycles. It is interesting
to note that for the on-die L2 cache a large part (probably even the
majority) of the access time is caused by wire delays. This is a physical
limitation which can only get worse with increasing cache sizes. Only
process shrinking (for instance, going from 60nm for Merom to 45nm for
Penryn in Intel’s lineup) can improve those numbers.
The numbers in the table look high but, fortunately, the entire cost does
not have to be paid for each occurrence of the cache load and miss. Some
parts of the cost can be hidden. Today’s processors all use internal
pipelines of different lengths where the instructions are decoded and
prepared for execution. Part of the preparation is loading values from
memory (or cache) if they are transferred to a register. If the memory load
operation can be started early enough in the pipeline, it may happen in
parallel with other operations and the entire cost of the load might be
hidden. This is often possible for L1d; for some processors with long
pipelines for L2 as well. There are many obstacles to starting the memory
read early. It might be as simple as not having sufficient resources for the
memory access or it might be that the final address of the load becomes
available late as the result of another instruction. In these cases the load
costs cannot be hidden (completely).
the relationship of all these values is that the cache size is:
cache line size × associativity × number of sets
The addresses are mapped into the cache by using:
O = log2 cache line size
S = log2 number of sets
Cache Size vs Associativity (CL=32)
2.3.2 Measurements of Cache Effects
All the figures are created by measuring a program which can simulate
working sets of arbitrary size, read and write access, and sequential or
random access.
All entries are chained in a circular list using the n element, either in
sequential or random order. Advancing from one entry to the next
always uses the pointer, even if the elements are laid out sequentially.
The pad element is the payload and it can grow arbitrarily large. In
some tests the data is modified, in others the program only performs
read operations.
A working set of 2 N bytes contains:
2 N /sizeof(struct l)
Obviously sizeof(struct l) depends on the value of NPAD. For 32-bit
systems, NPAD=7 means the size of each array element is 32 bytes,
for 64-bit systems the size is 64 bytes.
2.4 : Intruction Cache
Not just the data used by the processor is cached; the instructions
executed by the processor are also cached. However, this cache is
much less problematic than the data cache. There are several reasons:
• The quantity of code which is executed depends on the size of the
code that is needed. The size of the code in general depends on the
complexity of the problem. The complexity of the problem is fixed.
• While the program’s data handling is designed by the programmer
the program’s instructions are usually generated by a compiler. The
compiler writers know about the rules for good code generation.
• Program flow is much more predictable than data access patterns.
Today’s CPUs are very good at detecting patterns. This helps with
prefetching.
• Code always has quite good spatial and temporal locality.
On CISC processors the decoding stage can also take some time. The x86
and x86-64 processors are especially affected. In recent years these
processors therefore do not cache the raw byte sequence of the
instructions in L1i but instead they cache the decoded instructions. L1i in
this case is called the “trace cache”. Trace caching allows the processor to
skip over the first steps of the pipeline in case of a cache hit which is
especially good if the pipeline stalled.
To achieve the best performance there are only a few rules related to the
instruction cache:
1. Generate code which is as small as possible. There are
exceptions when software pipelining for the sake of using pipelines
requires creating more code or where the overhead of using small code is
too high.
2. Help the processor making good prefetching decisions. This can
be done through code layout or with explicit prefetching.
2.4.1 Self Modifying Code
SMC should in general be avoided. Though it is generally correctly executed
there are boundary cases which are not and it creates performance problems if
not done correctly. Obviously, code which is changed cannot be kept in the trace
cache which contains the decoded instructions. But even if the trace cache is not
used because the code has not been executed at all (or for some time) the
processor might have problems. If an upcoming instruction is changed while it
already entered the pipeline the processor has to throw away a lot of work and
start all over again. There are even situations where most of the state of the
processor has to be tossed away.
It is highly advised to avoid SMC whenever possible. Memory is not such a
scarce resource anymore. It is better to write separate functions instead of
modifying one function according to specific needs. Maybe one day SMC
support can be made optional and we can detect exploit code trying to modify
code this way. If SMC absolutely has to be used, the write operations should
bypass the cache as to not create problems with data in L1d needed in L1i.
3.5 Cache Miss Factors
We have already seen that when memory accesses miss the caches the costs
skyrocket. Sometimes this is not avoidable and it is important to understand the
actual costs and what can be done to mitigate the problem.
3.5.1 Cache and Memory Bandwidth
To get a better understanding of the capabilities of the processors we measure
the bandwidth available in optimal circumstances. This measurement is
especially interesting since different processor versions vary widely. This is why
this section is filled with the data of several different machines. The program to
measure performance uses the SSE instructions of the x86 and x86-64
processors to load or store 16 bytes at once. The working set is increased from
1kB to 512MB just as in our other tests and it is measured how many bytes per
cycle can be loaded or stored.
Pentium 4 Bandwidth
P4 Bandwidth with 2 Hyper-Threads
Core 2 Bandwidth
What is more astonishing than the read performance is the write and copy
performance. The write performance, even for small working set sizes, does not
ever rise above 4 bytes per cycle. This indicates that, in these Netburst
processors, Intel elected to use a Write-Through mode for L1d where the
performance is obviously limited by the L2 speed. This also means that the
performance of the copy test, which copies from one memory region into a
second, non-overlapping memory region, is not significantly worse. The
necessary read operations are so much faster and can partially overlap with the
write operations. The most noteworthy detail of the write and copy
measurements is the low performance once the L2 cache is not sufficient
anymore. The performance drops to 0.5 bytes per cycle! That means write
operations are by a factor of ten slower than the read operations. This means
optimizing those operations is even more important for the performance of the
program.
The interesting point is the write and copy performance for working set sizes
which would fit into L1d. As can be seen in the figure, the performance is the
same as if the data had to be read from the main memory. Both threads compete
for the same memory location and RFO messages for the cache lines have to be
sent. The problematic point is that these requests are not handled at the speed of
the L2 cache, even though both cores share the cache. Once the L1d cache is not
sufficient anymore modified entries are flushed from each core’s L1d into the
shared L2. At that point the performance increases significantly since now the
L1d misses are satisfied by the L2 cache and RFO messages are only needed
when the data has not yet been flushed. This is why we see a 50% reduction in
speed for these sizes of the working set. The asymptotic behavior is as expected:
since both cores share the same FSB each core gets half the FSB bandwidth
which means for large working sets each thread’s performance is about half that
of the single threaded case.
Because there are significant differences even between the processor versions of
one vendor it is certainly worthwhile looking at the performance of other
vendors’ processors, too. Figure 3.28 shows the performance of an AMD family
10h Opteron processor. This processor has 64kB L1d, 512kB L2, and 2MB of
L3. The L3 cache is shared between all cores of the processor.