Abstract

Abstract
As CPU cores become both faster and more numerous, the limiting factor
for most programs is now, and will be for some time, memory access.
Hardware designers have come up with ever more sophisticated memory
handling and acceleration techniques–such as CPU caches–but these
cannot work optimally without some help from the programmer.
Unfortunately, neither the structure nor the cost of using the memory
subsystem of a computer or the caches on CPUs is well understood by
most programmers. This paper explains the structure of memory
subsystems in use on modern commodity hardware, illustrating why CPU
caches were developed, how they work, and what programs should do to
achieve optimal performance by utilizing them.
1. Types of RAM
As CPU cores become both faster and more numerous, the limiting factor
for most programs is now, and will be for some time, memory access.
Hardware designers have come up with ever more sophisticated memory
handling and acceleration techniques–such as CPU caches–but these
cannot work optimally without some help from the programmer.
Unfortunately, neither the structure nor the cost of using the memory
subsystem of a computer or the caches on CPUs is well understood by
most programmers. This paper explains the structure of memory
subsystems in use on modern commodity hardware, illustrating why CPU
caches were developed, how they work, and what programs should do to
achieve optimal performance by utilizing them.
1.1 Static RAM
shows the structure of a 6 transistor SRAM cell. The core of this cell is
formed by the four transistors M1 to M4 which form two cross-coupled
inverters. They have two stable states, representing 0 and 1 respectively.
The state is stable as long as power on Vdd is available. If access to the
state of the cell is needed the word access line WL is raised. This makes
the state of the cell immediately available for reading on BL and BL. If
the cell state must be overwritten the BL and BL lines are first set to the
desired values and then WL is raised. Since the outside drivers are
stronger than the four transistors (M1 through M4 ) this allows the old
state to be overwritten.
Conclusion:
• one cell requires six transistors. There are variants with four transistors
but they have disadvantages.
• maintaining the state of the cell requires constant power.
• the cell state is available for reading almost immediately once the word
access line WL is raised. The signal is as rectangular (changing quickly
between the two binary states) as other transistorcontrolled signals.
• the cell state is stable, no refresh cycles are needed.
1.2 Dynamic RAM
Dynamic RAM is, in its structure, much simpler than static RAM. Figure
2.5 shows the structure of a usual DRAM cell design. All it consists of is
one transistor and one capacitor. This huge difference in complexity of
course means that it functions very differently than static RAM.
1-T Dynamic RAM

A dynamic RAM cell keeps its state in the capacitor C. The transistor M is
used to guard the access to the state. To read the state of the cell the
access line AL is raised; this either causes a current to flow on the data
line DL or not, depending on the charge in the capacitor. To write to the
cell the data line DL is appropriately set and then AL is raised for a time
long enough to charge or drain the capacitor
1.3 DRAM Access
Dynamic RAM Schematic
A secondary scalability problem is that having 30 address lines connected
to every RAM chip is not feasible either. Pins of a chip are precious
resources. It is “bad” enough that the data must be transferred as much as
possible in parallel (e.g., in 64 bit batches). The memory controller must
be able to address each RAM module (collection of RAM chips). If
parallel access to multiple RAM modules is required for performance
reasons and each RAM module requires its own set of 30 or more address
lines, then the memory controller needs to have, for 8 RAM modules, a
whopping 240+ pins only for the address handling.
2. CPU Caches
CPUs are today much more sophisticated than they were only 25 years
ago. In those days, the frequency of the CPU core was at a level
equivalent to that of the memory bus. Memory access was only a bit
slower than register access. But this changed dramatically in the early 90s,
when CPU designers increased the frequency of the CPU core but the
frequency of the memory bus and the performance of RAM chips did not
increase proportionally. This is not due to the fact that faster RAM could
not be built, as explained in the previous section. It is possible but it is not
economical. RAM as fast as current CPU cores is orders of magnitude
more expensive than any dynamic RAM.
shows the minimum cache configuration. It corresponds to the
architecture which could be found in early systems which deployed CPU
caches. The CPU core is no longer directly connected to the main
memory.16 All loads and stores have to go through the cache. The
connection between the CPU core and the cache is a special, fast
connection. In a simplified representation, the main memory and the
cache are connected to the system bus which can also be used for
communication with other components of the system. We introduced the
system bus as “FSB” which is the name in use today; see section 2.2. In
this section we ignore the Northbridge; it is assumed to be present to
facilitate the communication of the CPU(s) with the main memory.
2.1 CPU Caches in the Big Picture
Before diving into technical details of the implementation of CPU caches

some readers might find it useful to first see in some more details how
caches fit into the “big picture” of a modern computer system.
Even though most computers for the last several decades have used the
von Neumann architecture, experience has shown that it is of advantage to
separate the caches used for code and for data. Intel has used separate
code and data caches since 1993 and never looked back. The memory
regions needed for code and data are pretty much independent of each
other, which is why independent caches work better. In recent years
another advantage emerged: the instruction decoding step for the most
common processors is slow; caching decoded instructions can speed up
the execution, especially when the pipeline is empty due to incorrectly
predicted or impossible-topredict branches. Soon after the introduction of
the cache the system got more complicated. The speed difference between
the cache and the main memory increased again, to a point that another
level of cache was added, bigger and slower than the first-level cache.
Only increasing the size of the first-level cache was not an option for
economical reasons. Today, there are even machines with three levels of
cache in regular use. A system with such a processor looks like Figure
3.2. With the increase on the number of cores in a single CPU the number
of cache levels might increase in the future even more.
shows three levels of cache and introduces the nomenclature we will use
in the remainder of the document. L1d is the level 1 data cache, L1i the
level 1 instruction cache, etc. Note that this is a schematic; the data flow
in reality need not pass through any of the higher-level caches on the way
from the core to the main memory. CPU designers have a lot of freedom
designing the interfaces of the caches. For programmers these design
choices are invisible.
2.2 Cache Operation at High Level
When memory content is needed by the processor the entire cache line is
loaded into the L1d. The memory address for each cache line is computed
by masking the address value according to the cache line size. For a 64
byte cache line this means the low 6 bits are zeroed. The discarded bits
are used as the offset into the cache line. The remaining bits are in some
cases used to locate the line in the cache and as the tag. In practice an
address value is split into three parts. For a 32-bit address it might look as
follows:
With a cache line size of 2O the low O bits are used as the offset into the
cache line. The next S bits select the “cache set”. We will go into more
detail soon on why sets, and not single slots, are used for cache lines. For
now it is sufficient to understand there are 2 S sets of cache lines. This
leaves the top 32−S−O = T bits which form the tag. These T bits are the
value associated with each cache line to distinguish all the aliases18
which are cached in the same cache set. The S bits used to address the
cache set do not have to be stored since they are the same for all cache
lines in the same set.
To be able to load new data in a cache it is almost always first necessary
to make room in the cache. An eviction from L1d pushes the cache line
down into L2 (which uses the same cache line size). This of course means
room has to be made in L2. This in turn might push the content into L3
and ultimately into main memory. Each eviction is progressively more
expensive. What is described here is the model for an exclusive cache as
is preferred by modern AMD and VIA processors. Intel implements
inclusive caches19 where each cache line in L1d is also present in L2.
Therefore evicting from L1d is much faster. With enough L2 cache the
disadvantage of wasting memory for content held in two places is
minimal and it pays off when evicting. A possible advantage of an
exclusive cache is that loading a new cache line only has to touch the L1d
and not the L2, which could be faster. The CPUs are allowed to manage
the caches as they like as long as the memory model defined for the
processor architecture is not changed. It is, for instance, perfectly fine for
a processor to take advantage of little or no memory bus activity and
proactively write dirty cache lines back to main memory. The wide
variety of cache architectures among the processors for the x86 and x86-
64, between manufacturers and even within the models of the same
manufacturer, are testament to the power of the memory model
abstraction.
More sophisticated cache implementations allow another possibility to
happen. Assume a cache line is dirty in one processor’s cache and a
second processor wants to read or write that cache line. In this case the
main memory is out-of-date and the requesting processor must, instead,
get the cache line content from the first processor. Through snooping, the
first processor notices this situation and automatically sends the
requesting processor the data. This action bypasses main memory, though
in some implementations the memory controller is supposed to notice this
direct transfer and store the updated cache line content in main memory.
If the access is for writing the first processor then invalidates its copy of
the local cache line. Over time a number of cache coherency protocols
have been developed. The most important is MESI, which we will
introduce in section 3.3.4. The outcome of all this can be summarized in a
few simple rules:
• A dirty cache line is not present in any other processor’s cache.

• Clean copies of the same cache line can reside in arbitrarily many
caches.
Finally, we should at least give an impression of the costs associated with
cache hits and misses. These are the numbers Intel lists for a Pentium M:
These are the actual access times measured in CPU cycles. It is interesting
to note that for the on-die L2 cache a large part (probably even the
majority) of the access time is caused by wire delays. This is a physical
limitation which can only get worse with increasing cache sizes. Only
process shrinking (for instance, going from 60nm for Merom to 45nm for
Penryn in Intel’s lineup) can improve those numbers.
The numbers in the table look high but, fortunately, the entire cost does
not have to be paid for each occurrence of the cache load and miss. Some
parts of the cost can be hidden. Today’s processors all use internal
pipelines of different lengths where the instructions are decoded and
prepared for execution. Part of the preparation is loading values from
memory (or cache) if they are transferred to a register. If the memory load
operation can be started early enough in the pipeline, it may happen in
parallel with other operations and the entire cost of the load might be
hidden. This is often possible for L1d; for some processors with long
pipelines for L2 as well. There are many obstacles to starting the memory
read early. It might be as simple as not having sufficient resources for the
memory access or it might be that the final address of the load becomes
available late as the result of another instruction. In these cases the load
costs cannot be hidden (completely).
2.3 CPU Cache Implementation Details

Cache implementers have the problem that each cell in the huge main
memory potentially has to be cached. If the working set of a program is
large enough this means there are many main memory locations which
fight for each place in the cache. Previously it was noted that a ratio of 1-
to-1000 for cache versus main memory size is not uncommon.
2.3.1 Associativity
It would be possible to implement a cache where each cache line can
hold a copy of any memory location. This is called a fully associative
cache. To access a cache line the processor core would have to
compare the tags of each and every cache line with the tag for the
requested address. The tag would be comprised of the entire part of the
address which is not the offset into the cache line (that means, S in the
figure on page 15 is zero). There are caches which are implemented
like this but, by looking at the numbers for an L2 in use today, will
show that this is impractical. Given a 4MB cache with 64B cache lines
the cache would have 65,536 entries. To achieve adequate
performance the cache logic would have to be able to pick from all
these entries the one matching a given tag in just a few cycles. The
effort to implement this would be enormous.
Fully Associative Cache Schematics

For each cache line a comparator is needed to compare the large tag
(note, S is zero). The letter next to each connection indicates the width in bits. If
none is given it is a single bit line. Each comparator has to compare two T-bit-
wide values. Then, based on the result, the appropriate cache line content is
selected and made available. This requires merging as many sets of O data lines
as there are cache buckets. The number of transistors needed to implement a
single comparator is large especially since it must work very fast. No iterative
comparator is usable. The only way to save on the number of comparators is to
reduce the number of them by iteratively comparing the tags. This is not suitable
for the same reason that iterative comparators are not: it takes too long. Fully
associative caches are practical for small caches (for instance, the TLB caches
on some Intel processors are fully associative) but those caches are small, really
small. We are talking about a few dozen entries at most. For L1i, L1d, and
higher level caches a different approach is needed. What can be done is to
restrict the search. In the most extreme restriction each tag maps to exactly one
cache entry. The computation is simple: given the 4MB/64B cache with 65,536
entries we can directly address each entry by using bits 6 to 21 of the address
(16 bits). The low 6 bits are the index into the cache line.
Direct-Mapped Cache Schematics
Such a direct-mapped cache is fast and relatively easy

to implement as can be seen. It requires exactly one comparator, one multiplexer
(two in this diagram where tag and data are separated, but this is not a hard
requirement on the design), and some logic to select only valid cache line
content. The comparator is complex due to the speed requirements but there is
only one of them now; as a result more effort can be spent on making it fast. The
real complexity in this approach lies in the multiplexers. The number of
transistors in a simple multiplexer grows with O(log N), where N is the number
of cache lines.
Set-Associative Cache Schematics

This problem can be solved by making the cache set associative. A set-
associative cache combines the good features of the full associative and direct-
mapped caches to largely avoid the weaknesses of those designs. Which shows
the design of a set-associative cache. The tag and data storage are divided into
sets, one of which is selected by the address of a cache line. This is similar to the
direct-mapped cache. But instead of only having one element for each set value
in the cache a small number of values is cached for the same set value. The tags
for all the set members are compared in parallel, which is similar to the
functioning of the fully associative cache.
Effects of Cache Size, Associativity, and Line Size
the relationship of all these values is that the cache size is:
cache line size × associativity × number of sets
The addresses are mapped into the cache by using:
O = log2 cache line size
S = log2 number of sets
Cache Size vs Associativity (CL=32)
2.3.2 Measurements of Cache Effects
All the figures are created by measuring a program which can simulate
working sets of arbitrary size, read and write access, and sequential or
random access.
All entries are chained in a circular list using the n element, either in
sequential or random order. Advancing from one entry to the next
always uses the pointer, even if the elements are laid out sequentially.
The pad element is the payload and it can grow arbitrarily large. In
some tests the data is modified, in others the program only performs
read operations.
A working set of 2 N bytes contains:
2 N /sizeof(struct l)
Obviously sizeof(struct l) depends on the value of NPAD. For 32-bit
systems, NPAD=7 means the size of each array element is 32 bytes,
for 64-bit systems the size is 64 bytes.
2.4 : Intruction Cache
Not just the data used by the processor is cached; the instructions
executed by the processor are also cached. However, this cache is
much less problematic than the data cache. There are several reasons:
• The quantity of code which is executed depends on the size of the
code that is needed. The size of the code in general depends on the
complexity of the problem. The complexity of the problem is fixed.
• While the program’s data handling is designed by the programmer
the program’s instructions are usually generated by a compiler. The
compiler writers know about the rules for good code generation.
• Program flow is much more predictable than data access patterns.
Today’s CPUs are very good at detecting patterns. This helps with
prefetching.
• Code always has quite good spatial and temporal locality.
On CISC processors the decoding stage can also take some time. The x86
and x86-64 processors are especially affected. In recent years these
processors therefore do not cache the raw byte sequence of the
instructions in L1i but instead they cache the decoded instructions. L1i in
this case is called the “trace cache”. Trace caching allows the processor to
skip over the first steps of the pipeline in case of a cache hit which is
especially good if the pipeline stalled.
To achieve the best performance there are only a few rules related to the
instruction cache:
1. Generate code which is as small as possible. There are
exceptions when software pipelining for the sake of using pipelines
requires creating more code or where the overhead of using small code is
too high.
2. Help the processor making good prefetching decisions. This can
be done through code layout or with explicit prefetching.
2.4.1 Self Modifying Code
SMC should in general be avoided. Though it is generally correctly executed
there are boundary cases which are not and it creates performance problems if
not done correctly. Obviously, code which is changed cannot be kept in the trace
cache which contains the decoded instructions. But even if the trace cache is not
used because the code has not been executed at all (or for some time) the
processor might have problems. If an upcoming instruction is changed while it
already entered the pipeline the processor has to throw away a lot of work and
start all over again. There are even situations where most of the state of the
processor has to be tossed away.
It is highly advised to avoid SMC whenever possible. Memory is not such a
scarce resource anymore. It is better to write separate functions instead of
modifying one function according to specific needs. Maybe one day SMC
support can be made optional and we can detect exploit code trying to modify
code this way. If SMC absolutely has to be used, the write operations should
bypass the cache as to not create problems with data in L1d needed in L1i.
3.5 Cache Miss Factors
We have already seen that when memory accesses miss the caches the costs
skyrocket. Sometimes this is not avoidable and it is important to understand the
actual costs and what can be done to mitigate the problem.
3.5.1 Cache and Memory Bandwidth
To get a better understanding of the capabilities of the processors we measure
the bandwidth available in optimal circumstances. This measurement is
especially interesting since different processor versions vary widely. This is why
this section is filled with the data of several different machines. The program to
measure performance uses the SSE instructions of the x86 and x86-64
processors to load or store 16 bytes at once. The working set is increased from
1kB to 512MB just as in our other tests and it is measured how many bytes per
cycle can be loaded or stored.
Pentium 4 Bandwidth
P4 Bandwidth with 2 Hyper-Threads
Core 2 Bandwidth
What is more astonishing than the read performance is the write and copy
performance. The write performance, even for small working set sizes, does not
ever rise above 4 bytes per cycle. This indicates that, in these Netburst
processors, Intel elected to use a Write-Through mode for L1d where the
performance is obviously limited by the L2 speed. This also means that the
performance of the copy test, which copies from one memory region into a
second, non-overlapping memory region, is not significantly worse. The
necessary read operations are so much faster and can partially overlap with the
write operations. The most noteworthy detail of the write and copy
measurements is the low performance once the L2 cache is not sufficient
anymore. The performance drops to 0.5 bytes per cycle! That means write
operations are by a factor of ten slower than the read operations. This means
optimizing those operations is even more important for the performance of the
program.
The interesting point is the write and copy performance for working set sizes
which would fit into L1d. As can be seen in the figure, the performance is the
same as if the data had to be read from the main memory. Both threads compete
for the same memory location and RFO messages for the cache lines have to be
sent. The problematic point is that these requests are not handled at the speed of
the L2 cache, even though both cores share the cache. Once the L1d cache is not
sufficient anymore modified entries are flushed from each core’s L1d into the
shared L2. At that point the performance increases significantly since now the
L1d misses are satisfied by the L2 cache and RFO messages are only needed
when the data has not yet been flushed. This is why we see a 50% reduction in
speed for these sizes of the working set. The asymptotic behavior is as expected:
since both cores share the same FSB each core gets half the FSB bandwidth
which means for large working sets each thread’s performance is about half that
of the single threaded case.
Because there are significant differences even between the processor versions of
one vendor it is certainly worthwhile looking at the performance of other
vendors’ processors, too. Figure 3.28 shows the performance of an AMD family
10h Opteron processor. This processor has 64kB L1d, 512kB L2, and 2MB of
L3. The L3 cache is shared between all cores of the processor.
Core 2 Bandwidth with 2 Threads

AMD Family 10h Opteron Bandwidth
AMD Fam 10h Bandwidth with 2 Threads
The multi-thread performance of the Opteron processor is shown in Figure 3.29.

The read performance is largely unaffected. Each thread’s L1d and L2 works as
before and the L3 cache is in this case not prefetched very well either. The two
threads do not unduly stress the L3 for their purpose. The big problem in this
test is the write performance. All data the threads share has to go through the L3
cache. This sharing seems to be quite inefficient since even if the L3 cache size
is sufficient to hold the entire working set the cost is significantly higher than an
L3 access. Comparing this graph with Figure 3.27 we see that the two threads of
the Core 2 processor operate at the speed of the shared L2 cache for the
appropriate range of working set sizes. This level of performance is achieved for
the Opteron processor only for a very small range of the working set sizes and
even here it approaches only the speed of the L3 which is slower than the Core
2’s L2.
3.5.3 Cache Placement.
Where the caches are placed in relationship to the hyperthreads, cores, and
processors is not under control of the programmer. But programmers can
determine where the threads are executed and then it becomes important how the
caches relate to the used CPUs.
Each core has at least its own L1 caches. Aside from this there are today not
many details in common:
• Early multi-core processors had no shared caches at all.
• Later Intel models have shared L2 caches for dualcore processors. For
quad-core processors we have to deal with separate L2 caches for each pair of
two cores. There are no higher level caches.
• AMD’s family 10h processors have separate L2 caches and a unified L3
cache.
The test program has one process constantly reading or writing, using
SSE instructions, a 2MB block of memory. 2MB was chosen because this is half
the size of the L2 cache of this Core 2 processor. The process is pinned to one
core while a second process is pinned to the other core. This second process
reads and writes a memory region of variable size. The graph shows the number
of bytes per cycle which are read or written. Four different graphs are shown,
one for each combination of the processes reading and writing. The read/write
graph is for the background process, which always uses a 2MB working set to
write, and the measured process with variable working set to read.
Bandwidth with two Processes

The interesting part of the graph is the part between 2 20 and 2 23 bytes.
If the L2 cache of the two cores were completely separate we could expect that
the performance of all four tests would drop between 2 21 and 2 22 bytes, that
means, once the L2 cache is exhausted. As we can see in Figure 3.31 this is not
the case. For the cases where the background process is writing this is most
visible. The performance starts to deteriorate before the working set size reaches
1MB. The two processes do not share memory and therefore the processes do
not cause RFO messages to be generated. These are pure cache eviction
problems. The smart cache handling has its problems with the effect that the
experienced cache size per core is closer to 1MB than the 2MB per core which
are available. One can only hope that, if caches shared between cores remain a
feature of upcoming processors, the algorithm used for the smart cache handling
will be fixed.
3. Virtual Memory
The virtual memory (VM) subsystem of a processor implements the
virtual address spaces provided to each process. This makes each process
think it is alone in the system. The list of advantages of virtual memory
are described in detail elsewhere so they will not be repeated here. Instead
this section concentrates on the actual implementation details of the
virtual memory subsystem and the associated costs.
The input to the address translation performed by the MMU is a virtual
address. There are usually few–if any– restrictions on its value. Virtual
addresses are 32-bit values on 32-bit systems, and 64-bit values on 64-bit
systems. On some systems, for instance x86 and x86-64, the addresses
used actually involve another level of indirection: these architectures use
segments which simply cause an offset to be added to every logical
address. We can ignore this part of address generation, it is trivial and not
something that programmers have to care about with respect to
performance of memory handling.
3.1 Simplest Address Translation
1-Level Address Translation

shows how the different parts of the virtual address are used. A top part is used
to select an entry in a Page Directory; each entry in that directory can be
individually set by the OS. The page directory entry determines the address of a
physical memory page; more than one entry in the page directory can point to
the same physical address. The complete physical address of the memory cell is
determined by combining the page address from the page directory with the low
bits from the virtual address. The page directory entry also contains some
additional information about the page such as access permissions.
3.2 Multi-Level Page Tables

4MB pages are not the norm, they would waste a lot of memory since
many operations an OS has to perform require alignment to memory
pages. With 4kB pages (the norm on 32-bit machines and, still, often on
64-bit machines), the Offset part of the virtual address is only 12 bits in
size. This leaves 20 bits as the selector of the page directory. A table with
2 20 entries is not practical. Even if each entry would be only 4 bytes the
table would be 4MB in size. With each process potentially having its own
distinct page directory much of the physical memory of the system would
be tied up for these page directories.
4-Level Address Translation

Each process running on the system might need its own page table tree. It
is possible to partially share trees but this is rather the exception. It is
therefore good for performance and scalability if the memory needed by
the page table trees is as small as possible. The ideal case for this is to
place the used memory close together in the virtual address space; the
actual physical addresses used do not matter. A small program might get
by with using just one directory at each of levels 2, 3, and 4 and a few
level 1 directories. On x86-64 with 4kB pages and 512 entries per
directory this allows the addressing of 2MB with a total of 4 directories
(one for each level). 1GB of contiguous memory can be addressed with
one directory for levels 2 to 4 and 512 directories for level 1.
3.3 Optimizing Page Table Access
All the data structures for the page tables are kept in the main
memory; this is where the OS constructs and updates the tables. Upon
creation of a process or a change of a page table the CPU is notified.
The page tables are used to resolve every virtual address into a
physical address using the page table walk described above. More to
the point: at least one directory for each level is used in the process of
resolving a virtual address. This requires up to four memory accesses
(for a single access by the running process) which is slow. It is
possible to treat these directory table entries as normal data and cache
them in L1d, L2, etc., but this would still be far too slow.
Prefetching code or data through software or hardware could
implicitly prefetch entries for the TLB if the address is on another
page. This cannot be allowed for hardware prefetching because the
hardware could initiate page table walks that are invalid. Programmers
therefore cannot rely on hardware prefetching to prefetch TLB entries.
It has to be done explicitly using prefetch instructions. TLBs, just like
data and instruction caches, can appear in multiple levels. Just as for
the data cache, the TLB usually appears in two flavors: an instruction
TLB (ITLB) and a data TLB (DTLB). Higher-level TLBs such as the
L2TLB are usually unified, as is the case with the other caches.
7 Memory Performance Tools
A wide variety of tools is available to help programmers understand
performance characteristics of a program, the cache and memory use among
others. Modern processors have performance monitoring hardware that can be
used. Some events are hard to measure exactly, so there is also room for
simulation. When it comes to higherlevel functionality, there are special tools to
monitor the execution of a process. We will introduce a set of commonly used
tools available on most Linux systems.
7.1 Memory Operation Profiling
Cycles per Instruction (Follow Random)
shows the Cycles Per Instruction (CPI) for the simple random “Follow” test case
for the various working set sizes. The names of the events to collect this
information for most Intel processor are CPU_CLK_UNHALTED and
INST_RETIRED. As the names suggest, the former counts the clock cycles of
the CPU and the latter the number of instructions. We see a picture similar to the
cycles per list element measurements we used. For small working set sizes the
ratio is 1.0 or even lower. These measurements were made on a Intel Core 2
processor, which is multi-scalar and can work on several instructions at once.
For a program which is not limited by memory bandwidth, the ratio can be
significantly below 1.0 but, in this case, 1.0 is pretty good.
Measured Cache Misses (Follow Random)

Measured Cache Misses (Follow Sequential)
7.2 Simulating CPU Caches

While the technical description of how a cache works is relatively easy to
understand, it is not so easy to see how an actual program behaves with respect
to cache. Programmers are not directly concerned with the values of addresses,
be they absolute nor relative. Addresses are determined, in part, by the linker
and, in part, at runtime by the dynamic linker and the kernel. The generated
assembly code is expected to work with all possible addresses and, in the source
language, there is not even a hint of absolute address values left. So it can be
quite difficult to get a sense for how a program is making use of memory
In this simplest form the program command is executed with the parameter arg
while simulating the three caches using sizes and associativity corresponding to
that of the processor it is running on. One part of the output is printed to
standard error when the program is running; it consists of statistics of the total
cache use as can be seen in Figure 7.5. The total number of instructions and
memory references is given, along with the number of misses they produce for
the L1i/L1d and L2 cache, the miss rates, etc. The tool is even able to split the
L2 accesses into instruction and data accesses, and all data cache uses are split
in read and write accesses.
cg annotate Output

Abstract

Uploaded by

Copyright:

Available Formats

Abstract

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abstract

Uploaded by

Copyright:

Available Formats

Abstract

1.1 Static RAM

1-T Dynamic RAM

2.1 CPU Caches in the Big Picture

Before diving into technical details of the implementation of CPU caches

2.2 Cache Operation at High Level

• A dirty cache line is not present in any other processor’s cache.

2.3 CPU Cache Implementation Details

Fully Associative Cache Schematics

Direct-Mapped Cache Schematics

Such a direct-mapped cache is fast and relatively easy

Set-Associative Cache Schematics

Effects of Cache Size, Associativity, and Line Size

Core 2 Bandwidth with 2 Threads

AMD Fam 10h Bandwidth with 2 Threads

The multi-thread performance of the Opteron processor is shown in Figure 3.29.

Bandwidth with two Processes

1-Level Address Translation

3.2 Multi-Level Page Tables

4-Level Address Translation

Measured Cache Misses (Follow Random)

7.2 Simulating CPU Caches

You might also like