0% found this document useful (0 votes)
68 views

Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University

This document summarizes a lecture on memory hierarchies and caches. It discusses set-associative caches, block placement and identification policies, block replacement policies, and write policies like write-back and write-through. It also describes techniques for reducing cache miss penalties like victim caches, critical word first loading, and multilevel caches. The goal is to improve overall cache performance by reducing miss rates, miss penalties, and hit times.

Uploaded by

rafi sk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University

This document summarizes a lecture on memory hierarchies and caches. It discusses set-associative caches, block placement and identification policies, block replacement policies, and write policies like write-back and write-through. It also describes techniques for reducing cache miss penalties like victim caches, critical word first loading, and multilevel caches. The goal is to improve overall cache performance by reducing miss rates, miss penalties, and hit times.

Uploaded by

rafi sk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Computer Science 246

Computer Architecture
Spring
S i 2009
Harvard University

Instructor: Prof. David Brooks


[email protected]

Memory Hierarchy and Caches (Part 2)

Computer Science 246


David Brooks
Caches
• Monday lecture
– Review of cache basics, direct-mapped, set-associative
caches
• Today
– More on cache performance, write strategies

Computer Science 246


David Brooks
Summary of Set Associativity
• Direct Mapped
pp
– One place in cache, One Comparator, No Muxes
• Set Associative Caches
– Restricted set of places
– N-way set associativity
– Number of comparators = number of blocks per set
– N:1 mux
• Fully Associative
– Anywhere in cache
– Number of comparators = number of blocks in cache
– N:1 mux needed
Computer Science 246
David Brooks
More Detailed Questions
• Block placement policy?
– Where does a block go when it is fetched?
• Block identification policy?
– How do we find a block in the cache?
• Block replacement policy?
– When fetching a block into a full cache, how do we
decide what other block gets kicked out?
• Write strategy?
– Does any of this differ for reads vs. writes?

Computer Science 246


David Brooks
Block Placement + ID
• Placement
– Invariant: block always goes in exactly one set
– Fully
Fully-Associative:
Associative: Cache is one set, block goes
anywhere
– Direct-Mapped: Block goes in exactly one frame
– Set-Associative: Block goes in one of a few frames
• Identification
– Find Set
– Search ways in parallel (compare tags, check valid bits)

Computer Science 246


David Brooks
Block Replacement
• Cache miss requires a replacement
• No decision needed in direct mapped cache
• More than one place for memory blocks in set
set-
associative
• Replacement Strategies
– Optimal
• Replace Block used furthest ahead in time (oracle)
– Least Recently Used (LRU)
• Optimized for temporal locality
– (Pseudo) Random
• Nearly as good as LRU, simpler
Computer Science 246
David Brooks
Write Policies
• Writes are only about 21% of data cache traffic
• Optimize cache for reads, do writes “on the side”
– Reads can do tag check/data read in parallel
– Writes must be sure we are updating the correct data
and the correct amount of data (1-8 byte writes)
– Serial process => slow
• What to do on a write hit?
• What to do on a write miss?

Computer Science 246


David Brooks
Write Hit Policies
• Q1: When to propagate new values to memory?
• Write back – Information is only written to the cache.
– Next lower level only updated when it is evicted (dirty bits
say when
h data
d t has
h been
b modified)
difi d)
– Can write at speed of cache
– Caches become temporarily
p y inconsistent with lower-levels of
hierarchy.
– Uses less memory bandwidth/power (multiple consecutive
writes may require only 1 final write)
– Multiple writes within a block can be merged into one write
– Evictions are longer latency now (must write back)

Computer Science 246


David Brooks
Write Hit Policies
• Q1: When to propagate new values to memory?
• Write through – Information is written to cache
and to the lower-level memory
– Main memory is always “consistent/coherent”
– Easier to implement
p – no dirtyy bits
– Reads never result in writes to lower levels (cheaper)
– Higher bandwidth needed
– Write buffers used to avoid write stalls

Computer Science 246


David Brooks
Write buffers
CPU • Small chunks of memory to
buffer outgoing writes
• Processor can continue
Cache when data written to buffer
Write Buffer
• Allows overlap of
processor execution with
Lower Levels of Memory memory update

• Write buffers are essential for write-through caches


Computer Science 246
David Brooks
Write buffers
• Writes can now be pipelined (rather than serial)
• Check tag + Write store data into Write Buffer
• Write data from Write buffer to L2 cache (tags ok)
• Loads must check write buffer for Store Op
pending stores to same address Address| Data
• Loads Check: Write Buffer Address| Data
• Write Buffer Entry
• Cache
Tagg Data
• Subsequent Levels of Memory
Computer Science 246 Data Cache
David Brooks
Write buffer policies:
Performance/Complexity Tradeoffs
Stores L2 Cache

Loads

• Allow merging of multiple stores? (“coalescing”)


• “Flush
Flush Policy”
Policy – How to do flushing of entries?
• “Load Servicing Policy” – What happens when a
load occurs to data currently in write buffer?
Computer Science 246
David Brooks
Write misses?
• Write Allocate
– Block is allocated on a write miss
– Standard write hit actions follow the block allocation
– Write misses = Read Misses
– Goes well with write-back
• No-write Allocate
– Write misses do not allocate a block
– O l update
Only d t lower-level
l l l memory
– Blocks only allocate on Read misses!
– Goes well with write-through
Computer Science 246
David Brooks
Summary of Write Policies

Write Policy Hit/Miss Writes to


WriteBack/Allocate Both L1 Cache
WriteBack/NoAllocate Hit L1 Cache
WriteBack/NoAllocate Miss L2 Cache
WriteThrough/Allocate Both Both
WriteThrough/NoAllocate Hit Both
WriteThrough/NoAllocate Miss L2 Cache

Computer Science 246


David Brooks
Cache Performance
CPU time = (CPU execution cycles + Memory Stall
Cycles)*Clock Cycle Time

AMAT = Hit Time + Miss Rate * Miss Penalty

• Reducing these three parameters can have a big


impact on performance
• Out-of-order processors can hide some of the miss
penalty
Computer Science 246
David Brooks
Reducing Miss Penalty
• Have already seen two examples of techniques to
reduce miss penalty
– Write buffers give priority to read misses over writes
– Merging write buffers
• Multiword writes are faster than many single word writes
• Now we consider several more
– Victim Caches
– Critical Word First/Early Restart
– Multilevel caches

Computer Science 246


David Brooks
Reducing Miss Penalty:
Victim Caches
• Direct mapped caches => many conflict misses
• Solution 1: More associativity (expensive)
• Solution 2: Victim Cache
• Victim Cache
– Small (4 to 8-entry),
8 entry) fully
fully-associative
associative cache between L1
cache and refill path
– Holds blocks discarded from cache because of evictions
– Checked on a miss before going to L2 cache
– Hit in victim cache => swap victim block with cache block

Computer Science 246


David Brooks
Reducing Miss Penalty:
Victim Caches

• Even one entry helps


some benchmarks!
• Helps more for smaller
caches, larger block
sizes
Computer Science 246
David Brooks
Reducing Miss Penalty:
Critical Word First/Early Restart
• CPU normally just needs one word at a time
• Large cache blocks have long transfer times
• Don
Don’tt wait for the full block to be loaded before
sending requested data word to the CPU
• Critical Word First
– Request the missed word first from memory and send it
to the CPU and continue execution
• Early Restart
– Fetch in order, but as soon as the requested block
arrives send it to the CPU and continue execution

Computer Science 246


David Brooks
Review: Improving Cache
Performance
• How to improve cache performance?
– Reducing Cache Miss Penalty
– Reducing Miss Rate
– Reducing Miss Penalty/Rate via parallelism
– Reducing Hit Time

Computer Science 246


David Brooks
Non-blocking Caches to reduce stalls
on misses
• Non-blocking
Non blocking cache or lockup
lockup-free
free cache allow data cache to
continue to supply cache hits during a miss
– requires out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss penalty by
working during miss vs. ignoring CPU requests
• “hit
“h under
d multiple
l l miss”” or “miss
“ under
d miss”” may further
f h
lower the effective miss penalty by overlapping multiple
misses
– Significantly increases the complexity of the cache controller as there
can be multiple outstanding memory accesses
– Requires multiple memory banks (otherwise cannot support)
– Penium Pro allows 4 outstanding memory misses
Computer Science 246
David Brooks
Value of Hit Under Miss for SPEC
P
Percentage
t M
Memory St
Stall
ll Time
Ti off a Blocking
Bl ki Cache
C h

• FP pprograms
g on average:
g AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
Reducing Misses by Hardware
Prefetching of Instructions & Data
• Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer” not the cache
– On Access: check both cache and stream buffer
– On SB Hit: move line into cache
– On SB Miss: Clear and refill SB with successive lines
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4
streams got 43%
– Palacharla & Kessler [[1994]] for scientific pprograms
g for 8 streams ggot
50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory bandwidth that can
be used without penalty
Computer Science 246
David Brooks
Hardware Prefetching
• What to prefetch?
p
– One block ahead (spatially)
• What will this work well for?
– Address prediction for non-sequential data
• Correlated predictors (store miss, next_miss pairs in table)
• Jump-pointers
pp (augment
( g data structures with prefetch
p pointers)
p )
• When to prefetch?
– On everyy reference
– On a miss (basically doubles block size!)
– When resident data becomes “dead” -- how do we know?
• No one will use it anymore, so it can be kicked out
Computer Science 246
David Brooks
Reducing Misses by
S f
Software P f hi D
Prefetching Data
• Data Prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults; a form of speculative
execution
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
register
• Must be correct address and register!
– Non-Binding prefetch: Load into cache.
• Can be incorrect.
incorrect Faults?
• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher
Hi h superscalar l reduces
d difficulty
diffi lt off issue
i bandwidth
b d idth
Computer Science 246
David Brooks
Reducing Hit Times
• Some common techniques/trends
q
– Small and simple caches
• Pentium III – 16KB L1
• Pentium
i 4 – 8KB
8 L11
– Pipelined Caches (actually bandwidth increase)
• Pentium – 1 clock cycle
y I-Cache
• Pentium III – 2 clock cycle I-Cache
• Pentium 4 – 4 clock cycle I-Cache
– Trace
T Caches
C h
• Beyond spatial locality
• Dynamic sequences of instruction (including taken branches)

Computer Science 246


David Brooks
Cache Bandwidth
• Superscalars need multiple memory access per cycle
• Parallel cache access: more difficult than parallel ALUs
– Caches have state so multiple
p accesses will affect each other
• “True Multiporting”
– Multiple decoders, read/write wordlines per SRAM cell
– Pipeline a single port by “double pumping” Alpha 21264
– Multiple cache copies (like clustered register file) POWER4
• Interleaved Multiporting
– Cache divides into banks – two accesses to same bank =>
conflict

Computer Science 246


David Brooks

You might also like