Lecture 15 Set Associative Cache Cache Performance Cache Performance
Lecture 15 Set Associative Cache Cache Performance Cache Performance
Lecture 15
Set‐associative cache
Cache performance
Cache performance
Adapted from Computer Organization and Design, 4th edition, Patterson and Hennessy
The Memory Hierarchy: Terminology
• Block
Block (or line): the minimum unit of information that is
(or line): the minimum unit of information that is
present (or not) in a cache
• Hit Rate: the fraction of memory accesses found in a level
y
of the memory hierarchy
– Hit Time: Time to access that level which consists of
Time to access the block + Time to determine hit/miss
Time to access the block + Time to determine hit/miss
• Miss Rate: the fraction of memory accesses not found in
a level of the memory hierarchy 1 ‐ (Hit Rate)
– Miss Penalty: Time to replace a block in that level with the
corresponding block from a lower level which consists of
Time to access the block in the lower level + Time to transmit that
Time to access the block in the lower level + Time to transmit that
block to the level that experienced the miss + Time to insert the block
in that level + Time to pass the block to the requestor
Hit Time << Miss Penalty
Handling Cache Misses (Single Word
Blocks)
• Read misses (I$ and D$)
– stall the pipeline, fetch the block from the next level in the
memory hierarchy, install it in the cache and send the requested
hi h i t ll it i th h d d th t d
word to the processor, then let the pipeline resume
• Write misses (D$ only)
1 stallll the pipeline, fetch the block from next level in the memory
1. h i li f h h bl k f l li h
hierarchy, install it in the cache (which may involve having to
evict a dirty block if using a write‐back cache), write the word
from the processor to the cache, then let the pipeline resume
p , pp
or
2. Write allocate – just write the word into the cache updating both
the tag and data, no need to check for cache hit, no need to stall
g , ,
or
3. No‐write allocate – skip the cache write (but must invalidate that
cache block since it will now hold stale data) and just write the
j
word to the write buffer (and eventually to the next memory
level), no need to stall if the write buffer isn’t full
Multiword Block Considerations
• Read misses (I$ and D$)
– Processed the same as for single word blocks – a miss
returns the entire block from memory
returns the entire block from memory
– Miss penalty grows as block size grows
• Early restart – processor resumes execution as soon as the
requested word of the block is returned
requested word of the block is returned
• Requested word first – requested word is transferred from the
memory to the cache (and processor) first
– Nonblockingg cache – allows the processor to continue to
p
access the cache while the cache is handling an earlier
miss
• Write misses (D$)
( )
– If using write allocate must first fetch the block from
memory and then write the word to the block (or could
end up with a “garbled” block in the cache (e.g., for 4
word blocks, a new tag, one word of data from the new
block, and three words of data from the old block)
Handling Cache Hits
• Read hits (I$ and D$)
– this is what we want!
• Write hits (D$ only)
– require the cache and memory to be consistent
q y
• always write the data into both the cache block and the next level in
the memory hierarchy (write‐through)
• writes run at the speed of the next level in the memory hierarchy –
so slow! –
l ! or can use a write buffer and stall only if the write buffer
it b ff d t ll l if th it b ff
is full
– allow cache and memory to be inconsistent
• write
write the data only into the cache block (write‐back
the data only into the cache block (write back the cache block
the cache block
to the next level in the memory hierarchy when that cache block is
“evicted”)
• need a dirty bit for each data cache block to tell if it needs to be
written back to memory when it is evicted –
i b k h i i i d can use a write buffer
i b ff
to help “buffer” write‐backs of dirty blocks
Reducing Cache Miss Rates #1
1. Allow more flexible block placement
ll fl bl bl k l
• In
In a direct mapped
a direct mapped cache a memory block maps to
a memory block maps to
exactly one cache block
• At the other extreme, could allow a memory block to
be mapped to an cache block –
be mapped to any a he blo k fully associative
f ll asso iati e
cache
• A compromise is to divide the cache into sets each of
which consists of n “ways” (n‐way set associative). A
memory block maps to a unique set (specified by the
memory block maps to a unique set (specified by the
index field) and can be placed in any way of that set
(so there are n choices)
(block address) modulo (# sets in the cache)
(block address) modulo (# sets in the cache)
Another Reference String Mapping
• Consider the main memory word reference string
Consider the main memory word reference string
Start with an empty cache ‐ all blocks 0 4 0 4 0 4 0 4
initially marked as not valid
8 requests, 8 misses
Ping pong effect due to conflict misses ‐ two memory locations
that map into the same cache block
Set Associative Cache Example
Main Memory
0000xx
One word blocks
Cache 0001xx Two low order bits
0010
0010xx define the byte in the
Way Set V Tag Data
0011xx word (32b words)
0 0100xx
0
1 0101xx
0 0110xx
1
1 0111xx
1000xx
Q2: How do we find it?
1001xx
Q1: Is it there? 1010xx Use next 1 low order
1011xx memory address bit to
Compare all the cache 1100xx determine which cache
determine which cache
tags in the set to the high 1101xx set (i.e., modulo the
order 3 memory address 1110xx number of sets in the
1111
1111xx cache)
bits to tell if the memory
block is in the cache
Another Reference String Mapping
• Consider the main memory word reference string
Consider the main memory word reference string
Start with an empty cache ‐ all blocks 0 4 0 4 0 4 0 4
initially marked as not valid
0 miss
i 4 miss
i 0 hit 4 hit
000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0)
8 requests, 2 misses
Solves the ping pong effect in a direct mapped cache due to
conflict misses since now two memory locations that map into
the same cache set can co‐exist!
Four‐Way Set Associative Cache
• 28 = 256 sets each with four ways (each with one block)
256 sets each with four ways (each with one block)
31 30 . . . 13 12 11 . . . 2 1 0 Byte offset
Tag 22 8
Index
d
Index V Tag Data V Tag Data V Tag Data V Tag Data
0 0 0 0
1 1 1 1
2
Way 0
Way 0 2
Way 1
Way 1 2
Way 2
Way 2 2
Way 3
Way 3
. . . .
. . . .
. . . .
253 253 253 253
254 254 254 254
255 255 255 255
32
4x1 select
Hit Data
Range of Set Associative Caches
• For
For a fixed size cache, each increase by a factor of two
a fixed size cache each increase by a factor of two
in associativity doubles the number of blocks per set
(i.e., the number or ways) and halves the number of
sets decreases the size of the index by 1 bit and
sets – decreases the size of the index by 1 bit and
increases the size of the tag by 1 bit
Increasing associativity
Decreasing associativity
Fully associative
Fully associative
Direct mapped (only one set)
(only one way) Tag is all the bits except
Smaller tags, only a block and byte offset
single comparator
Costs of Set Associative Caches
• When
When a miss occurs, which way
a miss occurs which way’ss block do we pick
block do we pick
for replacement?
– Least Recently Used (LRU): the block replaced is the one
that has been unused for the longest time
h h b df h l i
• Must have hardware to keep track of when each way’s block
was used relative to the other blocks in the set
• For 2‐way set associative, takes one bit per set → set the bit
when a block is referenced (and reset the other way’s bit)
• N‐way set associative cache costs
– N comparators (delay and area)
– MUX delay (set selection) before data is available
– Data available after
Data available after set selection (and Hit/Miss
set selection (and Hit/Miss
decision). In a direct mapped cache, the cache block is
available before the Hit/Miss decision
• SSo its not possible to just assume a hit and continue and
it t ibl t j t hit d ti d
recover later if it was a miss
Benefits of Set Associative Caches
• The
The choice of direct mapped or set associative depends on
choice of direct mapped or set associative depends on
the cost of a miss versus the cost of implementation
12
4KB
10 8KB
16KB
8
Miss Rate
e
32KB
6 64KB
128KB
4 256KB
512KB
2
Data from Hennessy &
0 Patterson Computer
Patterson, Computer
1-way 2-way 4-way 8-way Architecture, 2003
Associativity
LLargest gains are in going from direct mapped to 2‐way
t i i i f di t dt 2
(20%+ reduction in miss rate)
Sources of Cache Misses
• C
Compulsory
l ( ld t t
(cold start or process migration, first
i ti fi t
reference):
– First access to a block, “cold” fact of life, not a whole lot
you can do about it. If you are going to run “millions” of
b f “ ll ” f
instruction, compulsory misses are insignificant
– Solution: increase block size (increases miss penalty; very
l
large blocks could increase miss rate)
bl k ld i i )
• Capacity:
– Cache cannot contain all blocks accessed by the program
y p g
– Solution: increase cache size (may increase access time)
• Conflict (collision):
– Multiple
Multiple memory locations mapped to the same cache
memory locations mapped to the same cache
location
– Solution 1: increase cache size
– Solution 2: increase associativity
S l i 2 i i i i (stay tuned) (may
( d) (
increase access time)
FIGURE 5.31 The miss rate can be broken into three sources of misses. This graph shows the total miss rate and its components for
a range of cache sizes. This data is for the SPEC2000 integer and floating‐point benchmarks and is from the same source as the data
in Figure 5.30. The compulsory miss component is 0.006% and cannot be seen in this graph. The next component is the capacity
miss rate, which depends on cache size. The conflict portion, which depends both on associativity and on cache size, is shown for a
range of associativities
f f
from one‐way to eight‐way. In each case, the labeled section corresponds to the increase in the miss rate
h h h l b l d d h h
that occurs when the associativity is changed from the next higher degree to the labeled degree of associativity. For example, the
section labeled two‐way indicates the additional misses arising when the cache has associativity of two rather than four. Thus, the
difference in the miss rate incurred by a direct‐mapped cache versus a fully associative cache of the same size is given by the sum of
the sections
marked eight‐way four‐way two‐way and one‐way The difference between eight‐way and four‐way is so small that it is diffi cult to
marked eight‐way, four‐way, two‐way, and one‐way. The difference between eight‐way and four‐way is so small that it is diffi cult to
see on this graph. Copyright © 2009 Elsevier, Inc. All rights reserved.
Chapter 5 — Large and Fast:
Exploiting Memory Hierarchy
— 15
Measuring Cache Performance
• Assuming
Assuming cache hit costs are included as part of the
cache hit costs are included as part of the
normal CPU execution cycle, then
CPU time = IC × CPI × CC
= IC ×
IC × (CPIideal + Memory‐stall cycles) ×
M t ll l ) × CC
CPIstall
Memory‐stall cycles come from cache misses (a sum of read‐
M t ll l f h i ( f d
stalls and write‐stalls)
Read‐stall cycles = reads/program ×
y /p g read miss rate ×
read miss penalty
Write‐stall cycles = (writes/program × write miss rate
p y)
× write miss penalty)
+ write buffer stalls
For write‐through caches, we can simplify this to
Memory‐stall cycles = accesses/program × miss rate × miss penalty
Impacts of Cache Performance
• Relative cache penalty increases as processor performance
improves (faster clock rate and/or lower CPI)
– The memory speed is unlikely to improve as fast as processor
y p y p p
cycle time. When calculating CPIstall, the cache miss penalty is
measured in processor clock cycles needed to handle a miss
– The lower the CPIideal, the more pronounced the impact of stalls
• A processor with a CPIideal of 2, a 100 cycle miss penalty,
36% load/store instr’s, and 2% I$ and 4% D$ miss rates
Memory‐stall cycles = 2% × 100 + 36% × 4% × 100 = 3.44
So CPIstalls = 2 + 3.44 = 5.44
more than twice the CPIideal !
• What if the CPI
What if the CPIideal
d l is reduced to 1? 0.5? 0.25?
is reduced to 1? 0 5? 0 25?
• What if the D$ miss rate went up 1%? 2%?
• What if the processor clock rate is doubled (doubling the
miss penalty)?
i lt )?
Average Memory Access Time (AMAT)
• A larger cache will have a longer access time. An
increase in hit time will likely add another stage to the
pipeline. At some point the increase in hit time for a
pipeline. At some point the increase in hit time for a
larger cache will overcome the improvement in hit rate
leading to a decrease in performance.
• Average Memory Access Time (AMAT) is the average
( ) h
to access memory considering both hits and misses
AMAT = Time for a hit + Miss rate x Miss penalty
• What is the AMAT for a processor with a 20 psec clock,
a miss penalty of 50 clock cycles, a miss rate of 0.02
misses per instruction and a cache access time of 1
misses per instruction and a cache access time of 1
clock cycle?
Reducing Cache Miss Rates #2
2 Use multiple levels of caches
2. Use multiple levels of caches
• With advancing technology have more than enough
g gy g
room on the die for bigger L1 caches or for a second
level of caches – normally a unified L2 cache (i.e., it
holds both instructions and data) and in some cases
)
even a unified L3 cache
• For our example, CPIideal of 2, 100 cycle miss penalty
(to main memory) and a 25 cycle miss penalty (to
(to main memory) and a 25 cycle miss penalty (to
UL2$), 36% load/stores, a 2% (4%) L1 I$ (D$) miss
rate, add a 0.5% UL2$ miss rate
CPIstalls = 2 + .02×25 + .36×.04×25 + .005×100 +
.36×.005×100 = 3.54
(as compared to 5.44 with no L2$)
Multilevel Cache Design Considerations
• Design
Design considerations for L1 and L2 caches are very
considerations for L1 and L2 caches are very
different
– Primary cache should focus on minimizing hit time in
support of a shorter clock cycle
support of a shorter clock cycle
• Smaller with smaller block sizes
– Secondary cache(s) should focus on reducing miss rate to
reduce the penalty of long main memory access times
reduce the penalty of long main memory access times
• Larger with larger block sizes
• Higher levels of associativity
• The
The miss penalty of the L1 cache is significantly
miss penalty of the L1 cache is significantly
reduced by the presence of an L2 cache – so it can be
smaller (i.e., faster) but have a higher miss rate
• For the L2 cache, hit time is less important than miss
F th L2 h hit ti i l i t t th i
rate
– The L2$ hit time determines L1$’s miss penalty
– L2$ local miss rate >> than the global miss rate