Exploiting Memory Hierarchy
Locality
Locality is a principle that makes having a memory hierarchy a
good idea
If an item is referenced then because of
temporal locality: it will tend to be again referenced soon
spatial locality: nearby items will tend to be referenced soon
why does code have locality – consider instruction and data?
Hit and Miss
Focus on any two adjacent levels – called, upper (closer to CPU)
and lower (farther from CPU) – in the memory hierarchy,
because each block copy is always between two adjacent levels
Terminology:
block: minimum unit of data to move between levels
hit: data requested is in upper level
miss: data requested is not in upper level
hit rate: fraction of memory accesses that are hits (i.e.,
found at upper level)
miss rate: fraction of memory accesses that are not hits
miss rate = 1 – hit rate
hit time: time to determine if the access is indeed a hit +
time to access and deliver the data from the upper level to
the CPU
miss penalty: time to determine if the access is a miss + time
to replace block at upper level with corresponding block at
lower level + time to deliver the block to the CPU
Caches
By simple example
assume block size = one word of data
X4 X4
Reference to Xn
X1 X1
causes miss so
Xn – 2 Xn – 2
it is fetched from
memory
Xn – 1 Xn – 1
X2 X2
Xn
X3 X3
a. Before the reference to Xn b. After the reference to Xn
Issues:
how do we know if a data item is in the cache?
if it is, how do we find it?
if not, what do we do?
Solution depends on cache addressing scheme…
Direct Mapped Cache
MIPS style:
A d dre ss ( s h ow ing b it p os itio ns )
31 30 13 12 11 2 1 0
Byte
offse t
20 10
H it D a ta
Tag
Ind e x
In de x V a lid T ag D a ta
0
1
2
1021
1022
1023
20 32
Cache with 1024 1-word blocks: byte offset
(least 2 significant bits) is ignored and
next 10 bits used to index into cache
What kind of locality are we taking advantage of?
DECStation 3100 Cache(MIPS R2000 processor)
Address (showing bit positions)
3 1 30 17 1 6 15 54 32 10
16 14 Byte
offset
H it D ata
16 bits 32 bits
Valid Tag D ata
16K
entries
16 32
Cache with 16K 1-word blocks: byte offset
(least 2 significant bits) is ignored and
next 14 bits used to index into cache
Cache Read Hit/Miss
Cache read hit: no action needed
Instruction cache read miss:
1. Send original PC value (current PC – 4, as PC has already
been incremented in first step of instruction cycle) to
memory
2. Instruct main memory to perform read and wait for
memory to complete access – stall on read
3. After read completes write cache entry
4. Restart instruction execution at first step to refetch
instruction
Data cache read miss:
Similar to instruction cache miss
To reduce data miss penalty allow processor to execute
instructions while waiting for the read to complete until
the word is required – stall on use
Cache Write Hit/Miss
Write-through scheme
on write hit: replace data in cache and memory with every
write hit to avoid inconsistency
on write miss: write the word into cache and memory –
obviously no need to read missed word from memory!
Write-through is slow because of always required memory write
performance is improved with a write buffer where words
are stored while waiting to be written to memory –
processor can continue execution until write buffer is full
when a word in the write buffer completes writing into main
memory that buffer slot is freed and becomes available for
future writes
DEC 3100 write buffer has 4 words
Write-back scheme
write the data block only into the cache and write-back the
block to main only when it is replaced in cache
more efficient than write-through, more complex to implement
Direct Mapped Cache: Taking Advantage of Spatial Locality
Taking advantage of spatial locality with larger blocks:
Address (showing bit positions)
31 16 1 5 4 32 1 0
16 12 2 Byte
H it T ag D a ta
offset
Index Block offset
1 6 bits 12 8 bits
V T ag D ata
4K
entrie s
16 32 32 32 32
M ux
32
Cache with 4K 4-word blocks: byte offset (least 2 significant bits) is ignored,
next 2 bits are block offset, and the next 12 bits are used to index into cache
Direct Mapped Cache: Taking Advantage of
Spatial Locality
Cache replacement in large (multiword) blocks:
word read miss: read entire block from main memory
word write miss: cannot simply write word and tag! Why?!
writing in a write-through cache:
if write hit, i.e., tag of requested address and cache
entry are equal, continue as for 1-word blocks by
replacing word and writing block to both cache and
memory
if write miss, i.e., tags are unequal, fetch block from
memory, replace word that caused miss, and write block
to both cache and memory
therefore, unlike case of 1-word blocks, a write miss with
a multiword block causes a memory read
Direct Mapped Cache: Taking Advantage of
Spatial Locality
Miss rate falls at first with increasing block size as expected,
but, as block size becomes a large fraction of total cache size,
miss rate may go up because
there are few blocks
competition for blocks increases
blocks get ejected before most of their words are accessed
(thrashing in cache)
40%
35%
30%
Miss rate vs. block size for 25%
Miss rate
various cache sizes 20%
15%
10%
5%
0%
4 16 64 256
Block size (bytes) 1 KB
8 KB
16 KB
64 KB
256 KB
Example
How many total bits are required for a direct-mapped cache
with 128 KB of data and 1-word block size, assuming a 32-bit
address?
Example
How many total bits are required for a direct-mapped cache
with 128 KB of data and 1-word block size, assuming a 32-bit
address?
Cache data = 128 KB = 217 bytes = 215 words = 215 blocks
Cache entry size = block data bits + tag bits + valid bit
= 32 + (32 – 15 – 2) + 1 = 48 bits
Therefore, cache size = 215 48 bits =
215 (1.5 32) bits = 1.5 220 bits = 1.5 Mbits
data bits in cache = 128 KB 8 = 1 Mbits
total cache size/actual cache data = 1.5
Example Problem
How many total bits are required for a direct-mapped cache with
128 KB of data and 4-word block size, assuming a 32-bit
address?
Cache size = 128 KB = 217 bytes = 215 words = 213 blocks
Cache entry size = block data bits + tag bits + valid bit
= 128 + (32 – 13 – 2 – 2) + 1 = 144 bits
Therefore, cache size = 213 144 bits =
213 (1.25 128) bits = 1.25 220 bits = 1.25 Mbits
data bits in cache = 128 KB 8 = 1 Mbits
total cache size/actual cache data = 1.25
Example Problem
Consider a cache with 64 blocks and a block size of 16 bytes. What
block number does byte address 1200 map to?
As block size = 16 bytes:
byte address 1200 block address 1200/16 = 75
As cache size = 64 blocks:
block address 75 cache block (75 mod 64) = 11
Block Size Considerations
Larger blocks should reduce miss rate
Due to spatial locality
But in a fixed-sized cache
Larger blocks fewer of them
More competition increased miss rate
Larger miss penalty
Can override benefit of reduced miss rate
Early restart and critical-word-first can help
Performance
Simplified model assuming equal read and write miss penalties:
CPU time = (execution cycles + memory stall cycles) cycle
time
memory stall cycles = number of memory accesses miss
rate miss penalty
Therefore, two ways to improve performance in cache:
decrease miss rate
decrease miss penalty
what happens if we increase block size?
Example
Assume for a given machine and program:
instruction cache miss rate 2%
data cache miss rate 4%
miss penalty always 40 cycles
CPI of 2 without memory stalls
frequency of load/stores 36% of instructions
1. How much faster is a machine with a perfect cache that never
misses?
2. What happens if we speed up the machine by reducing its CPI to 1
without changing the clock rate?
3. What happens if we speed up the machine by doubling its clock rate,
but if the absolute time for a miss penalty remains same?
Solution
1.
Assume instruction count = I
Instruction miss cycles = I 2% 40 = 0.8 I
Data miss cycles = I 36% 4% 40 = 0.576 I
So, total memory-stall cycles = 0.8 I + 0.576 I = 1.376 I
in other words, 1.376 stall cycles per instruction
Therefore, CPI with memory stalls = 2 + 1.376 = 3.376
Assuming instruction count and clock rate remain same for a
perfect cache and a cache that misses:
CPU time with stalls / CPU time with perfect cache
= 3.376 / 2 = 1.688
Performance with a perfect cache is better by a factor of 1.688
Solution (cont.)
2. What happens if we speed up the machine by reducing its CPI to 1
without changing the clock rate?
CPI without stall = 1
CPI with stall = 1 + 1.376 = 2.376 (clock has not changed so
stall cycles per instruction
remains same)
CPU time with stalls / CPU time with perfect cache
= CPI with stall / CPI without stall
= 2.376
Performance with a perfect cache is better by a factor of 2.376
Conclusion: Lower the CPI more pronounced is the impact of
stall cycles
Solution (cont.)
3. What happens if we speed up the machine by doubling its clock rate, but if
the absolute time for a miss penalty remains same?
With doubled clock rate, miss penalty = 2 40 = 80 clock cycles
Stall cycles per instruction = (I 2% 80) + (I 36% 4% 80)
= 2.752 I
So, faster machine with cache miss has CPI = 2 + 2.752 = 4.752
CPU time with stalls / CPU time with perfect cache
= CPI with stall / CPI without stall
= 4.752 / 2 = 2.376
Performance with a perfect cache is better by a factor of 2.376
Conclusion: with higher clock rate cache misses “hurt more” than
with lower clock rate
Decreasing Miss Rates with Associative Block Placement
Direct mapped: one unique cache location for each memory block
cache block address = memory block address mod cache size
Fully associative: each memory block can locate anywhere in cache
all cache entries are searched (in parallel) to locate block
Set associative: each memory block can place in a unique set of
cache locations – if the set is of size n it is n-way set-associative
cache set address = memory block address mod number of sets in
cache
all cache entries in the corresponding set are searched (in parallel) to
locate block
Increasing degree of associativity
reduces miss rate
increases hit time because of the parallel search and then fetch
Decreasing Miss Rates with Associative Block Placement
Direct Mapped 2-way Set Associative Fully Associative
Direct mapped Set associative Fully associative
Block # 0 1 2 3 4 5 6 7 Set # 0 1 2 3
Data Data Data
12 mod 8 = 4 12 mod 4 = 0
1 1 1
Tag Tag Tag
2 2 2
Search Search Search
Location of a memory block with address 12 in a cache with 8
blocks with different degrees of associativity
Decreasing Miss Rates with Associative Block Placement
One-way set set
One-way associative
associative
(direct mapped)
Block Tag Data
0
Two-way set associative
1
Set Tag Data Tag Data
2
3 0
4 1
5 2
6 3
Four-way set associative
Set Tag Data Tag Data Tag Data Tag Data
0
1
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Configurations of an 8-block cache with different degrees of associativity
Example
Find the number of misses for a cache with four 1-word blocks given
the following sequence of memory block accesses:
0, 8, 0, 6, 8,
for each of the following cache configurations
1. direct mapped
2. 2-way set associative (use LRU replacement policy)
3. fully associative
Note about LRU replacement
in a 2-way set associative cache LRU replacement can be
implemented with one bit at each set whose value
indicates the most recently referenced block
Solution
1 (direct-mapped)
Block address Cache block
0 0 (= 0 mod 4)
6 2 (= 6 mod 4)
8 0 (= 8 mod 4)
Block address translation in direct-mapped cache
Address of memory Hit or Contents of cache blocks after reference
block accessed miss 0 1 2 3
0 miss Memory[0]
8 miss Memory[8]
0 miss Memory[0]
6 miss Memory[0] Memory[6]
8 miss Memory[8] Memory[6]
Cache contents after each reference – red indicates new entry added
5 misses
Solution (cont.)
2 (two-way set-associative)
Block address Cache set
0 0 (= 0 mod 2)
6 0 (= 6 mod 2)
8 0 (= 8 mod 2)
Block address translation in a two-way set-associative cache
Address of memory Hit or Contents of cache blocks after reference
block accessed miss Set 0 Set 0 Set 1 Set 1
0 miss Memory[0]
8 miss Memory[0] Memory[8]
0 hit Memory[0] Memory[8]
6 miss Memory[0] Memory[6]
8 miss Memory[8] Memory[6]
Cache contents after each reference – red indicates new entry added
Four misses
Solution (cont.)
3 (fully associative)
Address of memory Hit or Contents of cache blocks after reference
block accessed miss Block 0 Block 1 Block 2 Block 3
0 miss Memory[0]
8 miss Memory[0] Memory[8]
0 hit Memory[0] Memory[8]
6 miss Memory[0] Memory[8] Memory[6]
8 hit Memory[0] Memory[8] Memory[6]
Cache contents after each reference – red indicates new entry added
3 misses
Implementation of a Set-Associative Cache
A d dr es s
31 3 0 1 2 11 10 9 8 3 2 1 0
22 8
In d ex V Tag D a ta V Tag D a ta V T ag D ata V T ag D ata
0
1
2
Set
253
254
255
22 32
4 - to - 1 m ultip le xo r
H it D a ta
4-way set-associative cache with 4 comparators and one 4-to-1
multiplexor:size of cache is 1K blocks = 256 sets * 4-block set size
Performance with Set-Associative Caches
1 5%
1 2%
9%
Miss rate
6%
3%
0%
O n e -w a y T w o -w a y F ou r-w a y E ig h t-w a y
A sso c ia tivity 1 KB 16 KB
2 KB 32 KB
4 KB 64 KB
8 KB 1 28 KB
Miss rates for each of eight cache sizes
with increasing associativity:
data generated from SPEC92 benchmarks
with 32 byte block size for all caches
Replacement Policy
Direct mapped: no choice
Set associative
Prefer non-valid entry, if there is one
Otherwise, choose among entries in the set
Least-recently used (LRU)
Choose the one unused for the longest time
Simple for 2-way, manageable for 4-way, too hard
beyond that
Random
Gives approximately the same performance as
LRU for high associativity
Multilevel Caches
Primary cache attached to CPU
Small, but fast
Level-2 cache services misses from primary
cache
Larger, slower, but still faster than main memory
Main memory services L-2 cache misses
Some high-end systems include L-3 cache
Decreasing Miss Penalty with Multilevel Caches
Add a second-level cache
primary cache is on the same chip as the processor
use SRAMs to add a second-level cache, between main
memory and the first-level cache
if miss occurs in primary cache second-level cache is
accessed
if data is found in second-level cache miss penalty is access
time of second-level cache which is much less than main
memory access time
if miss occurs again at second-level then main memory
access is required and large miss penalty is incurred
Design considerations using two levels of caches:
try and optimize the hit time on the 1
st level cache to reduce
clock cycle
try and optimize the miss rate on the 2
nd level cache to
reduce memory access penalties
In other words, 2
nd level allows 1st level to go for speed
without “worrying” about failure…
Example Problem
Assume a 500 MHz machine with
base CPI 1.0
main memory access time 200 ns.
miss rate 5%
How much faster will the machine be if we add a second-level
cache with 20ns access time that decreases the miss rate to 2%?
Solution
Miss penalty to main = 200 ns / (2 ns / clock cycle) = 100 clock cycles
Effective CPI with one level of cache
= Base CPI + Memory-stall cycles per instruction
= 1.0 + 5% 100 = 6.0
With two levels of cache, miss penalty to second-level cache
= 20 ns / (2 ns / clock cycle) = 10 clock cycles
Effective CPI with two levels of cache
= Base CPI + Primary stalls per instruction
+ Secondary stall per instruction
= 1 + 5% 10 + 2% 100 = 3.5
= 1 + (5%-2%)x10 + 2%x(100+10)
Therefore, machine with secondary cache is faster by a factor of
6.0 / 3.5 = 1.71
Multilevel On-Chip Caches
Intel Nehalem 4-core processor
Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache
3-Level Cache Organization
Intel Nehalem AMD Opteron X4
L1 caches L1 I-cache: 32KB, 64-byte L1 I-cache: 32KB, 64-byte
(per core) blocks, 4-way, approx LRU blocks, 2-way, LRU
replacement, hit time n/a replacement, hit time 3 cycles
L1 D-cache: 32KB, 64-byte L1 D-cache: 32KB, 64-byte
blocks, 8-way, approx LRU blocks, 2-way, LRU
replacement, write- replacement, write-
back/allocate, hit time n/a back/allocate, hit time 9 cycles
L2 unified 512KB, 64-byte blocks, 8-way, 512KB, 64-byte blocks, 16-way,
cache approx LRU replacement, write- approx LRU replacement, write-
(per core) back/allocate, hit time n/a back/allocate, hit time n/a
L3 unified 8MB, 64-byte blocks, 16-way, 2MB, 64-byte blocks, 32-way,
cache replacement n/a, write- replace block shared by fewest
(shared) back/allocate, hit time n/a cores, write-back/allocate, hit
time 32 cycles
n/a: data not available
Sources of Misses (3C’s)
Compulsory misses (aka cold start misses)
First access to a block
Capacity misses
Due to finite cache size
A replaced block is later accessed again
Conflict misses (aka collision misses)
In a non-fully associative cache
Due to competition for entries in a set
Would not occur in a fully associative cache of the
same total size
Cache Design Trade-offs
Design change Effect on miss rate Negative performance
effect
Increase cache size Decrease capacity May increase access
misses time
Increase associativity Decrease conflict May increase access
misses time
Increase block size Decrease compulsory Increases miss
misses penalty.