Cache Org
Cache Org
Cache Organization
Topics
Generic cache memory organization
Direct mapped caches
Set associative caches
Impact of caches on programming
Cache Vocabulary
Capacity
Cache block (aka cache line)
Associativity
Cache set
Index
Tag
Hit rate
Miss rate
Replacement policy
2
General Org of a Cache Memory
1 valid bit t tag bits B = 2b bytes
Cache is an array per line per line per cache block
of sets.
valid tag 0 1 ••• B–1
Each set contains E lines
set 0: •••
one or more lines. per set
valid tag 0 1 ••• B–1
Each line holds a
block of data. valid tag 0 1 ••• B–1
set 1: •••
S= 2s sets valid tag 0 1 ••• B–1
•••
m-1 0
v tag 0 1 • • • B–1
set 0: •••
v tag 0 1 • • • B–1 <tag> <set index> <block offset>
v tag 0 1 • • • B–1
set 1: •••
v tag 0 1 • • • B–1
The word at address A is in the cache if
••• the tag bits in one of the <valid> lines in
v tag 0 1 • • • B–1 set <set index> match <tag>.
set S-1: •••
v tag 0 1 • • • B–1 The word contents begin at offset
<block offset> bytes from the beginning
of the block.
4
Direct-Mapped Cache
Simplest kind of cache
Characterized by exactly one line per set.
•••
5
Accessing Direct-Mapped Caches
Set selection
Use the set index bits to determine the set of interest.
6
Accessing Direct-Mapped Caches
Line matching and word selection
Line matching: Find a valid line in the selected set with a
matching tag
Word selection: Then extract the word
0 1 2 3 4 5 6 7
(2) The tag bits in the cache (3) If (1) and (2), then
line must match the =?
cache hit,
tag bits in the address and block offset
selects
t bits s bits b bits
starting byte.
0110 i 100
m-1
tag set index block offset0
7
Direct-Mapped Cache Simulation
M=16 byte addresses, B=2 bytes/block,
S=4 sets, E=1 entry/set
t=1 s=2 b=1
x xx x Address trace (reads):
0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
(4) 1 1 M[12-13]
(5) 11 1 m[13] m[12]
M[12-13]
8
Why Use Middle Bits as Index?
4-line Cache High-Order Middle-Order
Bit Indexing Bit Indexing
00 0000 0000
01 0001 0001
10 0010 0010
11 0011 0011
0100 0100
High-Order Bit Indexing 0101 0101
Adjacent memory lines would map 0110 0110
to same cache entry 0111 0111
Poor use of spatial locality 1000 1000
Middle-Order Bit Indexing 1001 1001
Consecutive memory lines map to 1010 1010
different cache lines
1011 1011
Can hold C-byte region of address
space in cache at one time 1100 1100
1101 1101
1110 1110
1111 1111
9
Set Associative Caches
Characterized by more than one line per set
•••
valid tag cache block
set S-1:
valid tag cache block
10
Accessing Set Associative Caches
Set selection
identical to direct-mapped cache
valid tag cache block
set 0:
valid tag cache block
•••
valid tag cache block
t bits s bits b bits set S-1:
valid tag cache block
00 001
m-1
tag set index block offset0
11
Accessing Set Associative Caches
Line matching and word selection
must compare the tag in each valid line in the selected set.
0 1 2 3 4 5 6 7
1 1001
selected set (i):
1 0110 w0 w1 w2 w3
(2) The tag bits in one (3) If (1) and (2), then
of the cache lines must =? cache hit, and
block offset selects
match the tag bits in
starting byte.
the address
t bits s bits b bits
0110 i 100
m-1
tag set index block offset0
12
Cache Performance Metrics
Miss Rate
Fraction of memory references not found in cache
(misses/references)
Typical numbers:
3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
Time to deliver a line in the cache to the processor (includes
time to determine whether the line is in the cache)
Typical numbers:
1-3 clock cycle for L1
5-12 clock cycles for L2
Miss Penalty
Additional time required because of a miss
Typically 100-300 cycles for main memory
13
Memory System Performance
Average Memory Access Time (AMAT)
Assume
! 1-level cache, 90% hit rate, 1 cycle hit
time, 200 cycle miss penalty
!
AMAT = 21 cycles!!! - even though 90% only take
one cycle
14
Memory System Performance - II
How does AMAT affect overall performance?
Recall the CPI equation (pipeline efficiency)
CPI = 1.0 + lp + mp + rp
load/use penalty (lp) assumed memory access of 1 cycle
Further - we assumed that all load instructions were 1 cycle
More realistic AMAT (20+ cycles), really hurts CPI and overall
!
performance
Cause Name Instruction Condition Stalls Product
Frequency Frequency
Load lp 0.30 0.7 21 4.41
Load/Use lp 0.30 0.3 21+1 1.98
Mispredict mp 0.20 0.4 2 0.16
Return rp 0.02 1.0 3 0.06
Total penalty 6.61
15
Memory System Performance - III
How !
to reduce AMAT?
Reduce miss rate
!
Reduce miss penalty
Reduce hit time
16
Writing Cache Friendly Code
Can write code to improve miss rate
Repeated references to variables are good (temporal locality)
Stride-1 reference patterns are good (spatial locality)
Examples:
cold cache, 4-byte words, 4-word cache blocks
18
Concluding Observations
Programmer can optimize for cache performance
How data structures are organized
How data are accessed
Nested loop structure
Blocking is a general technique
19