Memory Systems
Memory Systems
Memory Systems
Yoongu Kim
Carnegie Mellon University
Onur Mutlu
Carnegie Mellon University
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Basic Concepts and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Two Components of the Memory System . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Managing the Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Virtual vs. Physical Address Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Virtual Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Basic Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Logical Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.3 Management Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.4 Managing Multiple Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.5 Specialized Caches for Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.1 DRAM Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5.2 Bank Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.3 Memory Request Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.4 Refresh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6 Current and Future Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.1 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.2 Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.1 Introduction
As shown in Figure 1.1, a computing system consists of three fundamental
units: (i) units of computation to perform operations on data (e.g., processors,
as we have seen in a previous chapter), (ii) units of storage (or memory)
that store data to be operated on or archived, (iii) units of communication
that communicate data between computation units and storage units. The
storage/memory units are usually categorized into two: (i) memory system,
which acts as a working storage area, storing the data that is currently being
operated on by the running programs, and (ii) the backup storage system,
1
2 Memory Systems
e.g., the hard disk, which acts as a backing store, storing data for a longer
term in a persistent manner. This chapter will focus on the “working storage
area” of the processor, i.e., the memory system.
Computing System
The memory system is the repository of data from where data can be
retrieved and updated by the processor (or processors). Throughout the op-
eration of a computing system, the processor reads data from the memory
system, performs computation on the data, and writes the modified data back
into the memory system – continuously repeating this procedure until all the
necessary computation has been performed on all the necessary data.
Parallelism [unitless]
Bandwidth [accesses/time] =
Latency [time/access]
Parallelism [unitless]
Bandwidth [bytes/time] = × DataSize [bytes/access]
Latency [time/access]
An additional characteristic of a memory system is cost. The cost of a
memory system is the capital expenditure required to implement it. Cost
is closely related to the capacity and performance of the memory system:
increasing the capacity and performance of a memory system usually also
makes it more expensive.
Memory System
Note that the structure and operation of the hardware components that
make up the cache and main memory can be similar (in fact, they can be
exactly the same). However, the structure and operation of cache and memory
components are affected by (i) the function of the respective components
and (ii) the technology in which they are implemented. The main function
of caches is to store a small amount of data such that it can be accessed
quickly. Traditionally, caches have been designed using the SRAM technology
(so that they are fast), and main memory has been designed using the DRAM
technology (so that it has large capacity). As a result, caches and main memory
have evolved to be different in structure and operation, as we describe in later
sections (Section 1.4 and Section 1.5).
Memory Systems 5
higher performance
L1 Cache 10's of KB ≈ 1ns
larger capacity
L2 Cache 100's of KB < 5ns
hit hit
Latency effective = Pcache × Latency cache + (1 − Pcache ) × Latency main memory
hit
• 0 ≤ Pcache ≤1
• Latency cache Latency main memory
Similarly, for a three-level memory hierarchy with two caches and main
memory, the effective latency of the memory system can be expressed by the
following equation.
hit
Latency effective = Pcache1 × Latency cache1 +
hit hit hit
(1 − Pcache1 ) × {Pcache2 × Latency cache2 + (1 − Pcache2 ) × Latency main memory }
hit hit
• 0≤ Pcache1 , Pcache1 ≤1
• Latency cache1 < Latency cache2 Latency main memory
Memory Systems 7
As both equations show, a high hit-rate in the cache implies a lower effec-
hit
tive latency. In the best case, when all accesses hit in the cache (Pcache = 1),
then the memory hierarchy has the lowest effective latency, equal to that of
the cache. While having a high hit-rate is always desirable, the actual value
of the hit-rate is determined primarily by (i) the cache’s size and (ii) the
processor’s memory access behavior. First, compared to a small cache, a large
cache is able to store more data and has a better chance that a given access
will hit in the cache. Second, if the processor tends to access a small set of
data over and over again, the cache can store those data so that subsequent
accesses will hit in the cache. In this case, a small cache would be sufficient
to achieve a high hit-rate.
Fortunately, many computer programs – that the processor executes –
access the memory system in this manner. In other words, many computer
programs exhibit locality in their memory access behavior. Locality exists in
two forms: temporal locality and spatial locality. First, given a piece of data that
has been accessed, temporal locality refers to the phenomenon (or memory
access behavior) in which the same piece of data is likely to be accessed again
in the near future. Second, given a piece of data that has been accessed,
spatial locality refers to the phenomenon (or memory access behavior) in
which neighboring pieces of data (i.e., data at nearby addresses) are likely
to be accessed in the near future. Thanks to both temporal/spatial locality,
the cache – and, more generally, the memory hierarchy – is able to reduce the
effective latency of the memory system.
perfectly valid for the first memory system, the same address is invalid for
the second memory system, since it exceeds the maximum bound (1 MB) of
the address space. As a result, if the address space of the memory system is
directly exposed to the program, the software programmer can never be sure
which addresses are valid and can be used to store data for the program she
is composing.
Second, when a computer program is being composed, the software pro-
grammer has no way of knowing which other programs will run simultaneously
with the program. For example, when the user runs the program, it may run
on the same computer as many other different programs, all of which may hap-
pen to utilize the same address (e.g., address 0) to store a particular piece of
their data. In this case, when one program modifies the data at that address, it
overwrites another program’s data that was stored at the same address, even
though it should not be allowed to. As a result, if the address space of the
memory system is directly exposed to the program, then multiple programs
may overwrite and corrupt each other’s data, leading to incorrect execution
for all of the programs.
Intel uses 48-bit virtual addresses – i.e., a 256 TB virtual address space (248 = 256 T ) [13].
Memory Systems 9
of the program: the operating system maps the virtual address to a physical
address that is free – i.e., a physical address that is not yet mapped to another
virtual address. Once this mapping has been established, it is memorized by
the operating system and used later for “translating” any subsequent access
to that virtual addresses to its corresponding physical address (where the data
is stored).
··· ···
0 4KB 8KB 12KB 256TB 0 4KB 8KB 12KB 256TB
···
physical page
physical address space
First, most virtual memory systems map a virtual address when it is ac-
cessed for the very first time – i.e., on-demand. In other words, if a virtual
address is never accessed, it is never mapped to a physical address. Although
the virtual address space is extremely large (e.g., 256 TB), in practice, only
a very small fraction of it is actually utilized by most programs. Therefore,
mapping the entirety of virtual address space to the physical address space
is wasteful, because the overwhelming majority of the virtual addresses will
never be accessed. Not to mention the fact that the virtual address space is
much larger than the physical address space such that it is not possible to
map all virtual addresses to begin with.
Second, a virtual memory system must adopt a granularity at which it
maps addresses from the virtual address space to the physical address space.
For example, if the granularity is set equal to 1 byte, then the virtual memory
system evenly divides the virtual/physical address into 1-byte virtual/physical
10 Memory Systems
“chunks,” respectively. Then, the virtual memory system can arbitrarily map
a 1-byte virtual chunk to any 1-byte physical chunk, as long as the physical
chunk is free. However, such a fine division of the address spaces into large
numbers of small chunks has a major disadvantage: it increases the complex-
ity of the virtual memory system. As we recall, once a mapping between a
pair of virtual/physical chunks is established, it must be memorized by the
virtual memory system. Hence, large numbers of virtual/physical chunks im-
ply a large number of possible mappings between the two, which increases the
bookkeeping overhead of memorizing the mappings. To reduce such an over-
head, most virtual memory systems coarsely divide the address spaces into
smaller numbers of chunks, where a chunk is called a page and whose typical
size is 4 KB. As shown in Figure 1.4, a 4 KB chunk of the virtual address
space is referred to as a virtual page, whereas a 4 KB chunk of the physical
address is referred to as a physical page (alternatively, a frame). Every time a
virtual page is mapped to a physical page, the operating system keeps track
of the mapping by storing it in a data structure called the page table.
Third, as a program accesses a new virtual page for the very first time,
the virtual memory system maps the virtual page to a free physical page.
However, if this happens over and over, the physical address space may become
exhausted – i.e., none of the physical pages are free since all of them have
been mapped to virtual pages. At this point, the virtual memory system must
“create” a free physical page by reclaiming one of the mapped physical pages.
The virtual memory system does so by evicting a physical page’s data from
main memory and “un-mapping” the physical page from its virtual page.
Once a free physical page is created in such a manner, the virtual memory
system can map it to a new virtual page. More specifically, the virtual memory
system takes the following three steps in order to reclaim a physical page and
map it to a new virtual page. First, the virtual memory system selects the
physical page that will be reclaimed, i.e., the victim. The selection process of
determining the victim is referred to as the page replacement policy [5]. While
the simplest policy is randomly selecting any physical page, such a policy may
significantly degrade the performance of the computing system. For example,
if a very frequently accessed physical page is selected as the victim, then a
future access to that physical page would be served by the hard disk. However,
since a hard disk is extremely slow compared to main memory, the access
would incur a very large latency. Instead, virtual memory systems employ
more sophisticated page replacement policies that try to select a physical
page that is unlikely to be accessed in the near future, in order to minimize
the performance degradation. Second, after a physical page has been selected
as the victim, the virtual memory system decides whether the page’s data
should be migrated out of main memory and into the hard disk. If the page’s
data had been modified by the program while it was in main memory, then the
page must be written back into the hard disk – otherwise, the modifications
that were made to the page’s data would be lost. On the other hand, if the
page’s data had not been modified, then the page can simply be evicted from
Memory Systems 11
main memory (without being written into the hard disk) since the program
can always retrieve the page’s original data from the hard disk. Third, the
operating system updates the page table so that the virtual page (that had
previously mapped to the victim) is now mapped to the hard disk instead of
a physical page in main memory. Finally, now that a physical page had been
reclaimed, the virtual memory system maps the free physical page to a new
virtual page and updates the page table accordingly.
After the victim has been evicted from main memory, it would be best
if the program does not access the victim’s data ever again. This is because
accessing the victim’s data incurs the large latency of the hard disk where
it is stored. However, if the victim is eventually accessed, then the virtual
memory system brings the victim’s data back from the hard disk and places
it into a free physical page in main memory. Unfortunately, if main memory
has no free physical pages remaining at this point, then another physical page
must be chosen as a victim and be evicted from main memory. If this happens
repeatedly, different physical pages are forced to ping-pong back and forth
between main memory and hard disk. This phenomenon, referred to as swap-
ping or thrashing, typically occurs when the capacity of the main memory is
not large enough to accommodate all of the data that a program is actively
accessing (i.e., its working set). When a computing system experiences swap-
ping, its performance degrades detrimentally since it must constantly access
the extremely slow hard disk instead of the faster main memory.
1.4 Caches
Generally, a cache is any structure that stores data that is likely to be
accessed again (e.g., frequently accessed data or recently accessed data) in
order to avoid the long latency operation required to access the data from
a much slower structure. For example, web servers on the internet typically
employ caches that store the most popular photographs or news articles so
that they can be retrieved quickly and sent to the end user. In the context
of the memory system, a cache refers to a small but fast component of the
memory hierarchy that stores the most recently (or most frequently) accessed
data among all data in the memory system [44, 26]. Since a cache is designed
to be faster than main memory, data stored in the cache can be accessed
quickly by the processor. The effectiveness of a cache depends on whether a
large fraction of the memory accesses “hits” in the cache and, as a result,
are able to avoid being served by the much slower main memory. Despite its
small capacity, a cache can still achieve a high hit-rate thanks to the fact that
many computer programs exhibit locality (Section 1.2) in their memory access
behavior: data that have been accessed in the past are likely to be accessed
again in the future. That is why a small cache, whose capacity is much less
12 Memory Systems
than that of main memory, is able to serve most of the memory accesses as
long as the cache stores the most recently (or most frequently) accessed data.
every cache block has its own tag where the address of the data (not the data
itself) is stored. When the processor accesses the cache for a piece of data at a
particular address, it searches the cache for the cache block whose tag matches
the address. If such a cache block exists, then the processor accesses the data
contained in the cache block – as explained earlier, this is called a cache hit.
In addition to its address, a cache block’s tag may also store other types of
information about the cache block. For example, whether the cache block is
empty, whether the cache block has been written to, or how recently the cache
block has been accessed. These topics and more will soon be discussed in this
section.
Cache
information about
Cache Block the cache block,
Cache Block e.g., the address
···
fully-associative cache is the best at efficiently utilizing all the cache blocks
in the cache. However, the downside of a fully-associative cache is that the
processor must exhaustively search all cache blocks whenever it accesses the
cache, since any one of the cache blocks may contain the data that the pro-
cessor wants. Unfortunately, searching through all cache blocks not only takes
a long time (leading to high access latency), but also wastes energy.
8GB
··· cache
192B cache block
cache block
128B
cache block
64B cache block
“chunk”
0
On the other hand, a cache that provides the least freedom in mapping a
chunk to a cache block is said to have a direct-mapped organization (Figure 1.6,
lower-middle). When a new chunk is brought in from main memory, a direct-
mapped cache allows the chunk to be placed in only a specific cache block.
For example, let us assume a 64 KB cache consisting of 1024 cache blocks
(64 bytes). A simple implementation of a direct-mapped cache would map
every 1024th chunk (64 bytes) of the address space to the same cache block
– e.g., chunks at address 0, address 64K, address 128K, etc. would all map to
the 0th cache block in the cache. But if the cache block is already occupied
with a different chunk, then the old chunk must first be evicted before a new
chunk can be stored in the cache block. This is referred to as a conflict – i.e.,
when two different chunks (corresponding to two different addresses) contend
with each other for the same cache block. In a direct-mapped cache, conflicts
can occur at one cache block even when all other cache blocks are empty. In
this regard, a direct-mapped cache is the worst at efficiently utilizing all the
cache blocks in the cache. However, the upside of a direct-mapped cache is
that the processor can simply search only one cache block to quickly determine
Memory Systems 15
whether the cache contains the data it wants. Hence, the access latency of a
direct-mapped cache is low.
As a middle ground between the two organizations (fully-associative vs.
direct-mapped), there is a third alternative called the set-associative organi-
zation [12] (Figure 1.6, lower-right), which allows a chunk to map to one of
multiple (but not all) cache blocks within a cache. If a cache has a total of
N cache blocks, then a fully-associative organization would map a chunk to
any of the N cache blocks, while a direct-mapped organization would map a
chunk to only 1 specific cache block. A set-associative organization, in con-
trast, is based on the concept of sets, which are small non-overlapping groups
of cache blocks. A set-associative cache is similar to a direct-mapped cache
in that a chunk is mapped to only one specific set. However, a set-associative
cache is also similar to a fully-associative cache in that the chunk can map to
any cache block that belongs to the specific set. For example, let us assume
a set-associative cache in which each set consists of 2 cache blocks, which is
called a 2-way set-associative cache. Initially, such a cache maps a chunk to
one specific set out of all N2 sets. Then, within the set, the chunk can map to
either of the 2 cache blocks that belong to the set. More generally, a W -way
set-associative cache (1 < W < N ) directly maps a chunk to one specific set,
while fully-associatively mapping a chunk to any of the W cache block within
the set. For a set-associative cache, the value of W is fixed when the cache is
designed and cannot be changed afterwards. However, depending on the value
of W , a set-associative cache can behave similarly to a fully-associative cache
(for large values of W ) or a direct-mapped cache (for small values of W ). In
fact, an N -way set-associative cache degenerates into a fully-associative cache,
whereas a 1-way set-associative cache degenerates into a direct-mapped cache.
why always-allocate is one of the most popular allocation policies: for every
cache miss, the always-allocate policy populates an empty cache block with
the new chunk. (On the other hand, a different allocation policy may be more
discriminative and prevent certain chunks from being allocated in the cache.)
However, when the cache has no empty cache blocks left, a new chunk cannot
be stored in the cache unless the cache “creates” an empty cache block by
reclaiming one of the occupied cache blocks. The cache does so by evicting the
data stored in an occupied cache block and replacing it with the new chunk,
as described next.
Second, when the cache does not have an empty cache block where it can
store a new chunk, the cache’s replacement policy selects one of the occupied
cache blocks to evict, i.e., the victim cache block. The replacement policy
is invoked when a new chunk is brought into the cache. Depending on the
cache’s logical organization, the chunk may map to one or more cache blocks.
However, if all such cache blocks are already occupied, the replacement policy
must select one of the occupied cache blocks to evict from the cache. For a
direct-mapped cache, the replacement policy is trivial: since a chunk can be
mapped to only one specific cache block, then there is no choice but to evict
that specific cache block if it is occupied. Therefore, a replacement policy
applies only to set-associative or fully-associative caches, where a chunk can
potentially be mapped to one of multiple cache blocks – any one of which may
become the victim if all of those cache blocks are occupied. Ideally, the re-
placement policy should select the cache block that is expected to be accessed
the farthest away in the future, such that evicting the cache block has the
least impact on the cache’s hit-rate [3]. That is why one of the most common
replacement policy is the LRU (least-recently-used) policy, in which the cache
block that has been the least recently accessed is selected as the victim. Due
to the principle of locality, such a cache block is less likely to be accessed in the
future. Under the LRU policy, the victim is the least recently accessed cache
block among (i) all cache blocks within a set (for a set-associative cache) or
(ii) among all cache blocks within the cache (for a fully-associative cache).
To implement the LRU policy, the cache must keep track of each block in
terms of the last time when it was accessed. A similar replacement policy is
the LFU (least-frequently-used) policy, in which the cache block that has been
the least frequently accessed is selected as the victim. We refer the reader to
the following works for more details on cache replacement policies: Liptay [26]
and Qureshi et al. [36, 39].
Until now, we have discussed the design issues and management policies for
a single cache. However, as described previously, a memory hierarchy typically
consists of more than just one cache. In the following, we discuss the policies
that govern how multiple caches within the memory hierarchy interact with
Memory Systems 17
each other: (i) inclusion policy, (ii) write handling policy, and (ii) partitioning
policy.
First, the memory hierarchy may consist of multiple levels of caches in
addition to main memory. As the lowest level, main memory always contains
the superset of all data stored in any of the caches. In other words, main
memory is inclusive of the caches. However, the same relationship may not
hold between one cache and another cache depending on the inclusion pol-
icy employed by the memory system [2]. There are three different inclusion
policies: (i) inclusive, (ii) exclusive, and (iii) non-inclusive. First, in the in-
clusive policy, a piece of data in one cache is guaranteed to be also found in
all higher levels of caches. Second, in the exclusive policy, a piece of data in
one cache is guaranteed not to be found in all higher levels of caches. Third,
in the non-inclusive policy, a piece of data in one cache may or may not be
found in higher levels of caches. Among the three policies, the inclusive and
exclusive policies are opposites of each other, while all other policies between
the two are categorized as non-inclusive. On the one hand, the advantage of
the exclusive policy is that it does not waste cache capacity since it does not
store multiple copies of the same data in all of the caches. On the other hand,
the advantage of the inclusive policy is that it is simplifies searching for data
when there are multiple processors in the computing system. For example, if
one processor wants to know whether another processor has the data it needs,
it does not need to search all levels of caches of that other processor, but
instead search only the largest cache. (This is related to the concept of cache
coherence which is not covered in this chapter.) Lastly, the advantage of the
non-inclusive policy is that it does not require the effort to maintain a strict
inclusive/exclusive relationship between caches. For example, when a piece of
data is inserted into one cache, inclusive or exclusive policies require that the
same piece of data be inserted into or evicted from other levels of caches. In
contrast, the non-inclusive policy does not have this requirement.
Second, when the processor writes new data into a cache block, the data
stored in the cache block is modified and becomes different from the data
that was originally brought into the cache. While the cache contains the newest
copy of the data, all lower levels of the memory hierarchy (i.e., caches and main
memory) still contain an old copy of the data. In other words, when a write
access hits in a cache, a discrepancy arises between the modified cache block
and the lower levels of the memory hierarchy. The memory system resolves
this discrepancy by employing a write handling policy. There are two types
of write handling policies: (i) write-through and (ii) write-back. First, in a
write-through policy, every write access that hits in the cache is propagated
down to the lowest levels of the memory hierarchy. In other words, when a
cache at a particular level is modified, the same modification is made for
all lower levels of caches and for main memory. The advantage of the write-
through policy is that it prevents any data discrepancy from arising in the first
place. But, its disadvantage is that every write access is propagated through
the entire memory hierarchy (wasting energy and bandwidth), even when
18 Memory Systems
the write access hits in the cache. Second, in a write-back policy, a write
access that hits in the cache modifies the cache block at only that cache,
without being propagated down the rest of the memory hierarchy. In this
case, however, the cache block contains the only modified copy of the data,
which is different from the copies contained in lower levels of caches and main
memory. To signify that the cache block contains modified data, a write-back
cache must have a dirty flag in the tag of each cache block: when set to
‘1’, the dirty flag denotes that the cache block contains modified data. Later
on, when the cache block is evicted from the cache, it must be written into
the immediately lower level in the memory hierarchy, where the dirty flag is
again set to ‘1’. Eventually, through a cascading series of evictions at multiple
levels of caches, the modified data is propagated all the way down to main
memory. The advantage of the write-back policy is that it can prevent write
accesses from always being written into all levels of the memory hierarchy –
thereby conserving energy and bandwidth. Its disadvantage is that it slightly
complicates the cache design since it requires additional dirty flags and special
handling when modified cache blocks are evicted.
Third, a cache at one particular level may be partitioned into two smaller
caches, each of which is dedicated to two different types of data: (i) instruction
and (ii) data. Instructions are a special type of data that tells the computer
how to manipulate (e.g., add, subtract, move) other data. Having two separate
caches (i.e., an instruction cache and a data cache) has two advantages. First,
it prevents one type of data from monopolizing the cache. While the processor
needs both types of data to execute a program, if the cache is filled with only
one type of data, the processor may need to access the other type of data from
lower levels of the memory hierarchy, thereby incurring a large latency. Second,
it allows each of the caches to be placed closer to the processor – lowering
the latency to supply instructions and data to the processor. Typically, one
part of the processor (i.e., the instruction fetch engine) accesses instructions,
while another part of the processor (i.e., the data fetch engine) accesses non-
instruction data. In this case, the two caches can each be co-located with
the part of the processor that accesses its data – resulting in lower latencies
and potentially higher operating frequency for the processor. For this reason,
only the highest level of the cache in the memory hierarchy, which is directly
accessed by the processor, is partitioned into an instruction cache and a data
cache.
that hit in the TLB. Essentially, a TLB is a cache that caches the parts of the
page table that are recently used by the processor.
these three, only the data bus is bi-directional since the memory controller
can either send/receive data from the DRAM chips, whereas the address and
command buses are uni-directional since only the memory controller sends the
address and command to the DRAM chips.
row-decoder
wordline cell
capacitor
access
bitline
transistor
sense-
bitline
amplifier
cell
row-buffer
(a) Bank: high-level view (b) Bank: low-level view (c) Cell
connect a sense-amplifier to any of the cells in the same column. A wire called
the wordline (one for each row) determines whether or not the corresponding
row of cells is connected to the bitlines.
1. ACTIVATE (issued with a row address): Load the entire row into the
row-buffer.
1.5.4 Refresh
A DRAM cell stores data as charge on a capacitor. Over time, this charge
steadily leaks, causing the data to be lost. That is why DRAM is named
“dynamic” RAM, since its charge changes over time. In order to preserve
data integrity, the charge in each DRAM cell must be periodically restored or
refreshed. DRAM cells are refreshed at the granularity of a row by reading it
out and writing it back in – which is equivalent to issuing an ACTIVATE and
a PRECHARGE to the row in succession.
In modern DRAM chips, all DRAM rows must be refreshed once every
64ms [16], which is called the refresh interval. The memory controller inter-
nally keeps track of time to ensure that it refreshes all DRAM rows before their
refresh interval expires. When the memory controller decides to refresh the
DRAM chips, it issues a REFRESH command. Upon receiving a REFRESH
command, a DRAM chip internally refreshes a few of its rows by activating
and precharging them. A DRAM chip refreshes only a few rows at a time since
it has a very large number of rows and refreshing all of them would incur a very
large latency. Since a DRAM chip cannot serve any memory requests while it
is being refreshed, it is important that the refresh latency is kept short such
that no memory request is delayed for too long. So instead of refreshing all
rows at the end of each 64ms interval, throughout a given 64ms time interval,
the memory controller issues many REFRESH commands, each of which trig-
gers the DRAM chip to refresh only a subset of rows. However, the memory
controller ensures that REFRESH commands are issued frequently enough
such that all rows eventually do become refreshed before 64ms has passed.
For more information on DRAM refresh (and methods to reduce its effect on
performance and energy), we refer the reader to Liu et al. [27]
Memory Systems 23
1.6.1 Caches
Efficient Utilization. To better utilize the limited capacity of a cache, many
replacement policies have been proposed to improve upon the simple LRU
policy. The LRU policy is not always the most beneficial across all memory
access patterns that have different amounts of locality. For example, in the
LRU policy, the most-recently-accessed data is always allocated in the cache
even if the data has low-locality (i.e., unlikely to be ever accessed again). To
make matters even worse, the low-locality data is unnecessarily retained in
the cache for a long time. This is because the LRU policy always inserts data
into the cache as the most-recently-accessed and evicts the data only after
it becomes the least-recently-accessed. As a solution, researchers have been
working to develop sophisticated replacement policies using a combination of
three approaches. First, when a cache block is allocated in the cache, it should
not always be inserted as the most-recently-accessed. Instead, cache blocks
with low-locality should be inserted as the least-recently-accessed, so that
they are quickly evicted from the cache to make room for other cache blocks
that may have more locality (e.g., [15, 35, 41]). Second, when a cache block
24 Memory Systems
the system can exploit the advantages of each technology while hiding the
disadvantages of each [29, 45].
Quality-of-Service. Similar to the last-level cache, main memory is also
shared by all the cores in a processor. When the cores contend to access main
memory, their accesses may interfere with each other and cause significant de-
lays. In the worst case, a memory-intensive core may continuously access main
memory in such a way that all the other cores are denied service from main
memory [30]. This would detrimentally degrade the performance of not only
those particular cores, but also of the entire computing system. To address
this problem, researchers have proposed mechanisms that provide quality-of-
service to each core when accessing shared main memory. For example, mem-
ory request scheduling algorithms can ensure that memory requests from all
the cores are served in a fair manner [18, 19, 33, 34]. Another approach is for
the user to explicitly specify memory service requirements of a program to the
memory controller so that the memory scheduling algorithm can subsequently
guarantee that those requirements are met [42]. Other approaches to quality-
of-service include mechanisms proposed to map the data of those applications
that significantly interfere with each other to different memory channels [31]
and mechanisms proposed to throttle down the request rate of the processors
that cause significant interference to other processors [8]. Request scheduling
mechanisms that prioritize bottleneck threads in parallel applications have
also been proposed [10]. The QoS problem gets exacerbated when the pro-
cessors that share the main memory are different, e.g., when main memory
is shared by a CPU consisting of multiple processors and a GPU, and recent
research has started to examine solutions to this [1]. Finally, providing QoS
and high performance in the presence of different types of memory requests
from multiple processing cores, such as speculative prefetch requests that aim
to fetch the data from memory before it is needed, is a challenging problem
that recent research has started providing solutions for [22, 23, 11, 9].
1.7 Summary
The memory system is a critical component of a computing system. It
serves as the repository of data from where the processor (or processors) can
access data. An ideal memory system would have both high performance and
large capacity. However, there exists a fundamental trade-off relationship be-
tween the two: it is possible to achieve either high performance or large capac-
ity, but not both at the same time in a cost-effective manner. As a result of
the trade-off, a memory system typically consists of two components: caches
(which are small but relatively fast-to-access) and main memory (which is
large but relatively slow-to-access). Multiple caches and a single main mem-
ory, all of which strike a different balance between performance and capacity,
Memory Systems 27
are combined to form a memory hierarchy. The goal of the memory hierarchy
is to provide the high performance of a cache at the large capacity of main
memory. The memory system is co-operatively managed by both the operating
system and the hardware.
This chapter provided an introductory level description of memory sys-
tems employed in modern computing systems, focusing especially on how the
memory hierarchy, consisting of caches and main memory, is organized and
managed. The memory system continues to be an even more critical bottleneck
going into the future, as described in Section 1.6. Many problems abound, yet
the authors of this chapter remain confident that promising solutions, some of
which are also described in Section 1.6, will also abound and hopefully prevail.
28 Memory Systems
Bibliography
29
30 Memory Systems
[11] Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N Patt. Co-
ordinated Control of Multiple Prefetchers in Multi-Core Systems. In
International Symposium on Microarchitecture, 2009.
[12] M. D. Hill and A. J. Smith. Evaluating Associativity in CPU Caches.
IEEE Trans. Comput., 38(12), Dec 1989.
[13] Intel. Intel 64 and IA-32 Architectures Software Developers Manual, Aug
2012.
[14] Ravi Iyer. CQoS: A Framework for Enabling QoS in Shared Caches of
CMP Platforms. In International Conference on Supercomputing, 2004.
[15] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot,
Simon Steely Jr., and Joel Emer. Adaptive Insertion Policies for Manag-
ing Shared Caches. In International Conference on Parallel Architectures
and Compilation Techniques, 2008.
[16] Joint Electron Devices Engineering Council (JEDEC). DDR3 SDRAM
Standard (JESD79-3F), 2012.
[17] Seongbeom Kim, Dhruba Chandra, and Yan Solihin. Fair Cache Sharing
and Partitioning in a Chip Multiprocessor Architecture. In International
Conference on Parallel Architectures and Compilation Techniques, 2004.
[18] Yoongu Kim, Dongsu Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A
Scalable and High-Performance Scheduling Algorithm for Multiple Mem-
ory Controllers. In International Symposium on High Performance Com-
puter Architecture, 2010.
[19] Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-
Balter. Thread Cluster Memory Scheduling: Exploiting Differences in
Memory Access Behavior. In International Symposium on Microarchitec-
ture, 2010.
[20] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu.
A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM. In
International Symposium on Computer Architecture, 2012.
[21] Benjamin C Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Architecting
Phase Change Memory as a Scalable DRAM Alternative. In International
Symposium on Computer Architecture, 2009.
[22] Chang J Lee, Onur Mutlu, Veynu Narasiman, and Yale N Patt. Prefetch-
Aware DRAM Controllers. In International Symposium on Microarchi-
tecture, 2008.
[23] Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N Patt. Im-
proving Memory Bank-Level Parallelism in the Presence of Prefetching.
In International Symposium on Microarchitecture, 2009.
Memory Systems 31
[24] Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subra-
manian, and Onur Mutlu. Tiered-Latency DRAM: A Low Latency and
Low Cost DRAM Architecture. In International Symposium on High
Performance Computer Architecture, 2013.
[25] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang,
and P. Sadayappan. Gaining Insights into Multicore Cache Partitioning:
Bridging the Gap between Simulation and Real Systems. In International
Symposium on High Performance Computer Architecture, 2008.
[26] J. S. Liptay. Structural Aspects of the System/360 Model 85: II The
Cache. IBM Syst. J., 7(1), Mar 1968.
[27] Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu. RAIDR:
Retention-Aware Intelligent DRAM Refresh. In International Sympo-
sium on Computer Architecture, 2012.
[28] Gabriel H Loh. 3D-Stacked Memory Architectures for Multi-core Pro-
cessors. In International Symposium on Computer Architecture, 2008.
[29] Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and
Parthasarathy Ranganathan. Enabling Efficient and Scalable Hybrid
Memories Using Fine-Granularity DRAM Cache Management. IEEE
Computer Architecture Letters, 11(2), July 2012.
[30] Thomas Moscibroda and Onur Mutlu. Memory Performance Attacks:
Denial of Memory Service in Multi-Core Systems. In USENIX Security
Symposium, 2007.
[31] Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mah-
mut Kandemir, and Thomas Moscibroda. Reducing Memory Interference
in Multicore Systems via Application-Aware Memory Channel Partition-
ing. In International Symposium on Microarchitecture, 2011.
[32] Onur Mutlu. Memory Systems in the Many-Core Era: Challenges,
Opportunities, and Solution Directions. In International Symposium
on Memory Management, 2011. http://users.ece.cmu.edu/~omutlu/
pub/onur-ismm-mspc-keynote-june-5-2011-short.pptx.
[33] Onur Mutlu and Thomas Moscibroda. Stall-Time Fair Memory Access
Scheduling for Chip Multiprocessors. In International Symposium on
Microarchitecture, 2007.
[34] Onur Mutlu and Thomas Moscibroda. Parallelism-Aware Batch Schedul-
ing: Enhancing both Performance and Fairness of Shared DRAM Sys-
tems. In International Symposium on Computer Architecture, 2008.
[35] Moinuddin K Qureshi, Aamer Jaleel, Yale N Patt, Simon C Steely, and
Joel Emer. Adaptive Insertion Policies for High Performance Caching.
In International Symposium on Computer Architecture, 2007.
32 Memory Systems
[36] Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt.
A Case for MLP-Aware Cache Replacement. In International Symposium
on Computer Architecture, 2006.
[37] Moinuddin K Qureshi and Yale N Patt. Utility-Based Cache Partitioning:
A Low-Overhead, High-Performance, Runtime Mechanism to Partition
Shared Caches. In International Symposium on Microarchitecture, 2006.
[41] Vivek Seshadri, Onur Mutlu, Michael A Kozuch, and Todd C Mowry.
The Evicted-Address Filter: A Unified Mechanism to Address Both Cache
Pollution and Thrashing. In International Conference on Parallel Archi-
tectures and Compilation Techniques, 2012.
[42] Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and
Onur Mutlu. MISE: Providing Performance Predictability and Improving
Fairness in Shared Main Memory Systems. In International Symposium
on High Performance Computer Architecture, 2013.
[43] Chris Wilkerson, Hongliang Gao, Alaa R Alameldeen, Zeshan Chishti,
Muhammad Khellah, and Shih-Lien Lu. Trading off Cache Capacity for
Reliability to Enable Low Voltage Operation. In International Sympo-
sium on Computer Architecture, 2008.
[44] M. V. Wilkes. Slave Memories and Dynamic Storage Allocation. Elec-
tronic Computers, IEEE Transactions on, EC-14(2), 1965.
[45] HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael A. Hard-
ing, and Onur Mutlu. Row buffer locality aware caching policies for hybrid
memories. In International Conference on Computer Design, 2012.