CS 211: Computer Architecture Cache Memory Design
CS 211: Computer Architecture Cache Memory Design
CS 135
CS 135 CS 135
1
Memory Memory Technology
• Memory Comes in Many Flavors
• In our discussions (on MIPS pipeline, ¾ SRAM (Static Random Access Memory)
superscalar, EPIC) we’ve constantly
¾ Like a register file; once data written to SRAM its contents
stay valid – no need to refresh it
been assuming that we can access our ¾ DRAM (Dynamic Random Access Memory)
operand from memory in 1 clock cycle…
¾ Like leaky capacitors – data stored into DRAM chip charging
memory cells to max values; charge slowly leaks and will
eventually be too low to be valid – therefore refresh circuitry
¾ This is possible, but its complicated rewrites data and charges back to max
¾ We’ll now discuss how this happens ¾ Static RAM is faster but more expensive
¾ Cache uses static RAM
CS 135 CS 135
DRAM
DRAM ¾ OK in very regular applications
9%/yr. ¾ Can use SW pipelining, vectors
1 (2X/10yrs) ¾ Not OK in most other applications
1989
1992
1980
1981
1987
1988
1990
1991
1993
1994
1995
1996
1997
1998
1982
1983
1984
1985
1986
1999
2000
CS 135
Time CS 135
2
The principle of locality… Levels in a typical memory hierarchy
3
Propagation delay bounds where
Cache: Terminology
memory can be placed
• Cache is name given to the first level of
the memory hierarchy encountered once
an address leaves the CPU
Double Data Rate (DDR) SDRAM
¾ Takes advantage of the principle of locality
• The term cache is also now applied
whenever buffering is employed to reuse
XDR planned 3 to 6 GHz
items
• Cache controller
¾ The HW that controls access to cache or generates
request to memory
¾
CS 135 CS 135
What is a cache?
• Small, fast storage used to improve average access time to Caches: multilevel
slow memory.
• Exploits spatial and temporal locality
• In computer architecture, almost everything is a cache!
¾ Registers “a cache” on variables – software managed Main
¾ First-level cache a cache on second-level cache CPU cache
¾ Second-level cache a cache on memory Memory
¾ Memory a cache on disk (virtual memory)
¾ TLB a cache on page table
¾ Branch-prediction a cache on prediction information?
Proc/Regs
L1-Cache Main
CPU L2 L3
Bigger L2-Cache Faster
cache cache
L1 Memory
Memory
~256KB ~4MB
Disk, Tape, etc. 16~32KB ~10 pclk latency ~50 pclk latency
CS 135
1~2 CS
pclk
135
latency
4
A brief description of a cache Terminology Summary
¾
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Extra time to replace a block in the upper level +
¾ Block is minimum amount of information that ¾ Time to deliver the block the processor
can be in cache • Hit Time << Miss Penalty (500 instructions on Alpha 21264)
¾ fixed size collection of data, retrieved from memory Lower Level
and placed into the cache To Processor Upper Level Memory
Memory
• Processor generates request for From Processor
Blk X
CS 135 CS 135
5
Memory Hierarchy Unified or Separate I-Cache and D-Cache
• Placing the fastest memory near the CPU • Two types of accesses:
can result in increases in performance ¾ Instruction fetch
• Consider the number of cycles the CPU is ¾ Data fetch (load/store instructions)
a miss?
¾ Block can be placed anywhere in cache
6
Where can a block be placed in a cache? Associativity
Fully Associative Direct Mapped Set Associative • If you have associativity > 1 you have
12345678 12345678 12345678 to have a replacement policy
¾ FIFO
Cache: Set 0Set 1Set 2Set 3 ¾ LRU
¾ Random
CS 135 CS 135
CS 135 CS 135
7
Block Identification: How is a block found
Large Blocks and Subblocking
in the cache
• Large cache blocks can take a long time • Since we have many-to-one mappings,
to refill need tag
¾ refill cache line critical word first • Caches have an address tag on each
¾ restart cache access before complete refill
block that gives the block address.
• Large cache blocks can waste bus ¾ Eg: if slot zero in cache contains tag K, the value in
bandwidth if block size is larger than slot zero corresponds to block zero from area of
spatial locality memory that has tag K
¾ Address consists of <tag t,block b,offset o>
¾ divide a block into subblocks
¾ Examine tag in slot b of cache:
¾ associate separate valid bits for each subblock. ¾ if matches t then extract value from slot b in cache
¾ Else use memory address to fetch block from memory, place copy in
slot b of cache, replace tag with t, use o to select appropriate byte
v subblock v subblock v subblock tag
CS 135 CS 135
How is a block found in the cache? How is a block found in the cache?
¾
Used to select set to be checked
Ex.: Address stored in set 0 must have 0 in index field
¾ Entry divided between block address & block offset… ¾ Offset not necessary in comparison –entire block is present or not and
all block offsets must match
¾ …and further divided between tag field & index field
CS 135 CS 135
8
Cache Memory Structures Direct Mapped Caches
block index
index key idx key
decoder
decoder
decoder
=
Indexed/Direct mapped Associative Memory N-Way =
Tag Multiplexor Tag
Memory (CAM) Set-Associative Memory Match match
no index k-bit index
k-bit index unlimited blocks 2k • N blocks
2k blocks
CS 135 CS 135
CS 135 CS 135
B-bits b-bits
9
N-Way Set Associative Cache N-Way Set Associative Cache
decoder
=
decoder
decoder
=
Associative
search
Tag
= match Tag
=match
Multiplexor
Multiplexor
CS 135
Cache Size = N x 2B+b CS 135
Cache Size = N x 2B+b
Access 0xffff8004
• Direct mapped caches have 1 choice of what block 0
1
0x00004000
0x00003800
to replace 3
2
0xffff8000
0x00cd0800
10
Approximating LRU What happens on a write?
• FYI most accesses to a cache are
• LRU is too complicated reads:
¾ Access and possibly update all counters in a set
on every access (not just replacement)
¾ Used to fetch instructions (reads)
Most instructions don’t write to memory
• Need something simpler and faster
¾
¾ Replacement:
Randomly select a non-MRU line ¾ Actually pretty easy to do…
¾ Something like a FIFO will also work ¾ Can read block while comparing/reading tag
¾ Block read begins as soon as address available
¾ If a hit, address just passed right on to CPU
• Generically, there are 2 kinds of write • Write back versus write through:
policies: ¾ Write back advantageous because:
¾ Write through (or store through) ¾ Writes occur at the speed of cache and don’t incur delay
of lower-level memory
¾ With write through, information written to block in cache
and to block in lower-level memory ¾ Multiple writes to cache block result in only 1 lower-level
memory access
¾ Write back (or copy back)
¾ Write through advantageous because:
¾ With write back, information written only to cache. It will
be written back to lower-level memory when cache block ¾ Lower-levels of memory have most recent copy of data
is replaced
11
What happens on a write? Write Policies: Analysis
CS 135 CS 135
12
Cache Performance – Simplified Models Average Memory Access Time
CPU
¾ Access times for level 1, level 2, etc.
¾ r1 Ch1 + r2Ch2 + (1- r1 -r2)Cm
CS 135 CS 135
CS 135 CS 135
13
When do we get a miss ?
• Instruction
• Separating out Memory component entirely ¾ Fetch instruction – not found in cache
¾ AMAT = Average Memory Access Time ¾ How many instructions ?
¾ CPIALUOps does not include memory instructions
• Data access
¾ Load and Store instructions
¾ Data not found in cache
⎛ AluOps MemAccess ⎞ ¾ How many data accesses ?
CPUtime = IC × ⎜ × CPI + × AMAT ⎟ × CycleTime
⎝ Inst ⎠
AluOps
Inst
AMAT = HitTime + MissRate × MissPenalt y
= ( HitTime Inst + MissRate Inst × MissPenalty Inst ) +
( HitTime Data + MissRate Data × MissPenalty Data )
CS 135 CS 135
CS 135 CS 135
14
Cache Performance: System Performance
Memory access equations
• Using what we defined previously, we can say:
¾ Memory stall clock cycles = • CPU time = IC * CPI * clock
¾ Reads x Read miss rate x Read miss penalty + ¾ CPI depends on memory stall cycles
• CPU time = (CPU execution clock cycles +
Writes x Write miss rate x Write miss penalty
CS 135 CS 135
CS 135 CS 135
15
V D Tag 00 01 10 11
Physical Address (10 bits)
00
01
11
101010 3510 2410 1710 2510
V D Tag 00 01 10 11 V D Tag 00 01 10 11
Physical Address (10 bits) Physical Address (10 bits)
00 00
01 01
Tag Index Offset 10 101010 3510 2410 1710 2510 Tag Index Offset 10 101010 3510 2410 1710 2510
(6 bits) (2 bits) (2 bits) (6 bits) (2 bits) (2 bits)
11 11
A 4-entry direct mapped cache with 4 data words/block This cache can hold 16 data words…
16
77 What if we get the same pattern of accesses we had before? 77 But, we could also make our cache look like this…
V D Tag 0 1
V D Tag 000 001 010 011 100 101 110 111
Again, let’s assume we want to read the
000
0 following data words:
001
1 101010 3510 2410 1710 2510 A10 B10 C10 D10 Tag Index Offset Address Holds Data
010
1.) 101010 | 100 | 0 3510
Pattern of accesses: 011
Note that there is now more data
(note different # of bits for offset and 2.) 101010 | 100 | 1 2410
associated with a given cache block. 100 101010 3510 2410
index now)
101 101010 1710 2510 3.) 101010 | 101 | 0 1710
1.) 101010 1 000
2.) 101010 1 001 110 4.) 101010 | 101 | 1 2510
3.) 101010 1 010
111
4.) 101010 1 011
Assuming that all of these accesses were occurring
However, now we have only 1 bit of index. for the 1st time (and would occur sequentially),
Therefore, any address that comes along that has a tag that is There are now just 2 accesses (1) and (3) would result in compulsory
different than ‘101010’ and has 1 in the index position is going to result words associated misses, and accesses would result in hits because
in a conflict miss. with each cache of spatial locality. (The final state of the cache
block. is shown after all 4 memory accesses).
Note that by organizing a cache in this way, conflict misses will be reduced.
There are now more addresses in the cache that the 10-bit physical address can map too.
CS 135 CS 135
V D Tag 0 1
88 What’s a capacity miss?
V D Tag 00 01 10 11
000 • The cache is only so big. We won’t be able to store every block accessed in a program – must
00
them swap out!
001 01 • Can avoid capacity misses by making cache bigger
010 V D Tag 00 01 10 11
10 101010 3510 2410 1710 2510
011 00
11
100 101010 3510 2410 01 Thus, to avoid capacity
17
Next: Cache Optimization
CS 135
CS 135
18
• Miss-oriented Approach to Memory Access:
⎛ MemAccess ⎞
CPUtime = IC × ⎜ CPI + × MissRate × MissPenalty ⎟ × CycleTime
⎝ Execution Inst ⎠ • Separating out Memory component entirely
¾ AMAT = Average Memory Access Time
⎛ MemMisses ⎞
CPUtime = IC × ⎜ CPI + × MissPenalty ⎟ × CycleTime ¾ CPIALUOps does not include memory instructions
⎝ Execution Inst ⎠
⎛ AluOps MemAccess ⎞
CPUtime = IC × ⎜ × CPI + × AMAT ⎟ × CycleTime
⎝ Inst ⎠
AluOps
Inst
AMAT = HitTime + MissRate × MissPenalt y
= ( HitTime Inst + MissRate Inst × MissPenalty Inst ) +
( HitTime Data + MissRate Data × MissPenalty Data )
CS 135 CS 135
19
Cache Misses Reducing Miss Rate – 3C’s Model
CS 135 CS 135
16
32
64
128
¾ Merging arrays
20
(1) Larger cache block size
Larger cache block size
(graph comparison)
• Easiest way to reduce miss rate is to
increase cache block size Miss rate vs. block size Why this trend?
¾ This will help eliminate what kind of misses?
• Helps improve miss rate b/c of principle 25
of locality: 20 1K
Miss Rate
¾ Temporal locality says that if something is accessed 15 4K
once, it’ll probably be accessed again soon 16K
10 64K
¾ Spatial locality says that if something is accessed,
something nearby it will probably be accessed 5 256K
¾ Larger block sizes help with spatial locality
0
• Be careful though! 16 32 64 128 256
¾ Larger block sizes can increase miss penalty! Block Size
¾ Generally, larger blocks reduce # of total blocks in (Assuming total cache size stays constant for each curve)
cache
CS 135 CS 135
CS 135 CS 135
21
Larger cache block size
Larger cache block sizes (wrap-up)
(ex. continued)
Cache sizes • We want to minimize cache miss rate &
cache miss penalty at same time!
Block Miss 1K 4K 16K 64K 256K
Size Penalt
16 y
42 7.321 4.599 2.655 1.857 1.485
32 44 6.870 4.186 2.263 1.594 1.308
• Selection of block size depends on
64 48 7.605 4.360 2.267 1.509 1.245
latency and bandwidth of lower-level
128 56 10.318 5.357 2.551 1.571 1.274
memory:
¾ High latency, high bandwidth encourage large block
256 72 16.847 7.847 3.369 1.828 1.353
size
¾ Cache gets many more bytes per miss for a small
increase in miss penalty
Red entries are lowest average time for a particular configuration
¾ Low latency, low bandwidth encourage small block
Note: All of these block sizes are common in processor’s today size
Note: Data for cache sizes in units of “clock cycles”
¾ Twice the miss penalty of a small block may be close to
the penalty of a block twice the size
CS 135 CS 135 ¾ Larger # of small blocks may reduce conflict misses
Higher associativity
Associativity
• Higher associativity can improve cache 0.14
1-way
miss rates… 0.12
2-way
Conflict
16
32
64
CS 135 CS 135
22
Miss Rate Reduction Strategies Larger Block Size
(fixed size&assoc)
25%
• Increase block size – reduce compulsory misses
• Larger caches 20% 1K
¾ Larger size can reduce capacity, conflict misses 4K
¾ Larger block size for fixed total size can lead to more 15%
Miss
capacity misses 16K
Rate
¾ Can reduce conflict misses 10%
64K
• Higher associativity
5% 256K
¾ Can reduce conflict misses
¾ No effect on cold miss Reduced
compulsory 0%
• Compiler controlled pre-fetching (faulting/non- misses
16
32
64
128
256
faulting) Increased
Conflict
Block Size (bytes)
¾ Code reorganization Misses
¾ Merging arrays
16
32
64
128
23
Improving Performance: Reducing Cache
Early restart and critical word 1st
Miss Penalty
• Multilevel caches – • With this strategy we’re going to be
¾ second and subsequent level caches can be large enough to capture
many accesses that would have gone to main memory, and are faster impatient
(therefore less penalty)
¾ As soon as some of the block is loaded, see if the data
• Critical word first and early restart – is there and send it to the CPU
¾ don’t wait for full block of cache to be loaded, send the critical word
first, restart the CPU and continue the load ¾ (i.e. we don’t wait for the whole block to be loaded)
• Priority to read misses over write misses
• Merging write buffer –
¾ if the address of a new entry matches that of one in the write buffer,
combine it
• There are 2 general strategies:
• Victim Caches –
¾ cache discarded blocks elsewhere ¾ Early restart:
¾ Remember what was discarded in case it is needed again ¾ As soon as the word gets to the cache, send it to the CPU
¾ Insert small fully associative cache between cache and its refill path
¾ This “victim cache” contains only blocks that were discarded as a result of
¾ Critical word first:
a cache miss (replacement policy) ¾ Specifically ask for the needed word 1st, make sure it gets
¾ Check victim cache in case of miss before going to next lower level of to the CPU, then get the rest of the cache block data
memory
CS 135 CS 135
24
Multi-Level caches Second-level caches
• Local cache rate not good measure of • 2nd level caches are usually BIG!
secondary caches – its a function of L1 miss ¾ Usually L1 is a subset of L2
rate ¾ Should have few capacity misses in L2 cache
¾ Only worry about compulsory and conflict for optimizations…
¾ Which can vary by changing the L1 cache
CS 135 CS 135
¾ Use global cache miss rate to evaluating 2nd level caches!
25
Second-level caches (example) Second-level caches (example)
CS 135 CS 135
• We can reduce the miss penalty by reducing • This one should intuitively be pretty obvious:
the miss rate of the 2nd level caches using ¾ Try and fetch blocks before they’re even requested…
techniques previously discussed… ¾ This could work with both instructions and data
¾ I.e. Higher associativity or psuedo-associativity are worth ¾ Usually, prefetched blocks are placed either:
considering b/c they have a small impact on 2nd level hit ¾ Directly in the cache (what’s a down side to this?)
time ¾ Or in some external buffer that’s usually a small, fast cache
And much of the average access time is due to misses in the
• Let’s look at an example: (the Alpha AXP
¾
L2 cache
21064)
• Could also reduce misses by increasing L2
block size ¾ On a cache miss, it fetches 2 blocks:
¾ One is the new cache entry that’s needed
• Need to think about something called the ¾ The other is the next consecutive block – it goes in a buffer
“multilevel inclusion property”: ¾ How well does this buffer perform?
¾ In other words, all data in L1 cache is always in L2… ¾ Single entry buffer catches 15-25% of misses
¾ Gets complex for writes, and what not… ¾ With 4 entry buffer, the hit rate improves about 50%
CS 135 CS 135
26
Hardware prefetching example Hardware prefetching example
• What is the effective miss rate for the • We need a revised memory access time formula:
Alpha using instruction prefetching? ¾ Say: Average memory access timeprefetch =
¾ Hit time + miss rate * prefetch hit rate * 1 + miss rate * (1 – prefetch hit
• Assume:
¾ Average memory access timeno prefetching =
¾ Hit time + miss rate * miss penalty
¾ It takes 1 extra clock cycle if the instruction misses the ¾ Results in: (2.415 – 2) / 50 = 0.83%
cache but is found in the prefetch buffer • Calculation suggests effective miss rate of prefetching
¾ The prefetch hit rate is 25% with 8KB cache is 0.83%
¾ Miss rate for 8-KB instruction cache is 1.10% • Actual miss rates for 16KB = 0.64% and 8KB = 1.10%
¾ Hit time is 2 clock cycles
¾ Miss penalty is 50 clock cycles
CS 135 CS 135
27
Compiler optimizations – merging arrays Merging Arrays Example
28
Compiler optimizations – loop fusion Loop Fusion Example
performance
for (j = 0; j < N; j = j+1)
{r = 0;
• Tries to reduce misses by improving for (k = 0; k < N; k = k+1){
temporal locality r = r + y[i][k]*z[k][j];};
x[i][j] = r;
• To get a handle on this, you have to };
work through code on your own • Two Inner Loops:
¾ Homework! ¾ Read all NxN elements of z[]
• this is used mainly with arrays! ¾ Read N elements of 1 row of y[] repeatedly
¾ Row-major access
• Capacity Misses a function of N & Cache Size:
¾ 2N3 + N2 => (assuming no conflict; otherwise …)
CS 135 • Idea:
CS 135 compute on BxB submatrix that fits
29
Summary of Compiler Optimizations to Reduce Improving Cache Performance
Cache Misses (by hand)
vpenta (nasa7) 1. Reduce the miss rate,
2. Reduce the miss penalty, or
gmty (nasa7)
tomcatv
btrix (nasa7)
3. Reduce the time to hit in the cache.
mxm (nasa7)
spice
cholesky
(nasa7)
compress
1 1.5 2 2.5 3
Performance Improvement
this small is a good thing! ¾ Direct mapping also falls under the category of
“simple”
¾ Relates to point above as well – you can check tag and
read data at the same time!
CS 135 CS 135
30
Avoid address translation during cache
Separate Instruction and Data Cache
indexing
• This problem centers around virtual addresses. • Multilevel cache is one option for design
Should we send the virtual address to the
cache? • Another view:
¾ In other words we have Virtual caches vs. Physical caches ¾ Separate the instruction and data caches
¾ Why is this a problem anyhow? ¾ Instead of a Unified Cache, have separate I-cache and
¾ Well, recall from OS that a processor usually deals with D-cache
processes ¾ Problem: What size does each have ?
¾ What if process 1 uses a virtual address xyz and process 2
uses the same virtual address?
¾ The data in the cache would be totally different! – called
aliasing
• What’s avg. memory access time in each case? ¾ (75% x 1.995) + (25% x 2.995) = 2.24 cycles
CS 135 • Despite
CS 135 higher miss rate, access time faster for split cache!
31
Reducing Time to Hit in Cache:
The Trace Cache Proposal
Trace Cache
• Trace caches A A
¾ ILP technique A
¾ Trace cache finds dynamic sequence of instructions B B
B
including taken branches to load into cache block C C
¾ Branch prediction is folded into the cache C
D D
D E
static 90%
10% static
F
dynamic 10%
90% dynamic G
E F F
Trace-
G cache line
G boundaries
I-cache line
CS 135 CS 135
boundaries
Cache Summary
32