Stanford Advanced Caches
Stanford Advanced Caches
http://eeclass.stanford.edu/ee282
2
Today’s lecture: More Caches
Advanced cache optimizations
H&P Chapter 5
Cache coherence
H&P Chapter 4
Software-managed memories
4
Non-blocking or Lockup Free
Caches
Basic idea
Allow for hits while serving a miss (hit-under-miss)
Allow for more than one outstanding miss (miss-under-miss)
When does it make sense (for L1, L2, …)
When the processor can handle >1 pending load/store
This is the case with superscalar processors
When the cache serves >1 processor or other cache
When the lower level allows for multiple pending accesses
More on this later
What is difficult about non-blocking caches
Handling multiple misses at the same time
Handling loads to pending misses
Handling stores to pending misses
5
Potential of Non-blocking
Caches
CPU Stall CPU on miss
Miss Penalty
Miss
Miss Hit
CPU CPU Hit under miss
Miss Penalty
6
Miss Status Handling Register
Keeps track of
Outstanding cache misses
Pending load & stores that refer to that cache block
Fields of an MSHR
Valid bit
Cache block address
Must support associative search
Issued bit (1 if already request issued to memory)
For each pending load or store
Valid bit
Type (load/store) and format (byte/halfword/…)
Block offset
Destination register for load OR store buffer entry for stores
7
MSHR
1 27 1 1 3 5 5
Valid Block Address Issued Valid Type Block Offset Destination Load/store 0
8
Non-block Caches: Operation
On a cache miss:
Search MSHRs for pending access to same cache block
If yes, just allocate new load/store entry
(if no) Allocate free MSHR
Update block address and first load/store entry
If no MSHR or load/store entry free, stall
When one word/sub-block for cache line become available
Check which load/stores are waiting for it
Forward data to LSU
Mark loads/store as invalid
Write word in the cache
When last word for cache line is available
Mark MSHR as invalid
9
Non-blocking Cache Efficacy
Non-blocking
cache
10
Prefetching
Idea: fetch data into the cache before processors
request them
Can address cold misses
Can be done by the programmer, compiler, or hardware
15
Stream Buffer Design
16
Strided Prefetching
PC Stride Last Addr Conf
Idea: detect and prefetch strided accesses
PC
for (i=0; i<N; i++) A[i*1024]++; 0x08ab0 8 0xff024 10
http://www.intel.com/Assets/PDF/manual/248966.pdf
19
Other Ideas in Prefetching
Prefetch for pointer-based data structures
Predict if fetched data contain a pointer & follow it
Works for linked-lists, graphs, etc
Must be very careful:
What is a pointer?
How far to prefetch?
20
Prefetching Efficacy
Prefetching
21
Multi-ported Caches
Idea: allow for multiple accesses in parallel
Processor with many LSUs, I+D access in L2, …
Request 1 Data 1
23
Multi-banked Caches
Cache
Request 1 Read Data 1
Bank 1
Cache
Request 2 Read Data 2
Bank 2
25
Multi-porting Efficacy
Multi-porting
26
Summary of
Advanced Cache Optimizations
Cache Miss rate Miss Hit Band-
optimization Cold Capacity Conflict penalty time width
Multi-level +
Victim cache ~ +
Pseudo-assoc. +/~
Skew-assoc. + ~
Non-blocking + ~
Critical-word-
first +
Prefetching + - +
Multi-porting ~ +
Also see Figure 5.11 in H&P 27
Today’s lecture: More Caches
Advanced cache optimizations
H&P Chapter 5
Cache coherence
H&P Chapter 4
Software-managed memories
u :5 u :5 u= 7
I/O devices
1
2
u:5
Memory
Coherence mechanisms
Metadata to track state for cached data
Controller that snoops bus (or interconnect)
activity and reacts if needed to adjust the state
of the cache data
30
MSI: Simple Coherence Protocol for Write
Back Caches
Each cache line has a tag M: Modified
S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other process reads, M or writes
P1 writes back
Other processor
intent to write
Read
miss
S I
Read by any Other processor
processor intent to write Cache state in
processor P1
31
Quick Questions
How many copies of a cache line can you have
in S state?
33
Today’s lecture: More Caches
Advanced cache optimizations
H&P Chapter 5
Cache coherence
H&P Chapter 4
Software-managed memories
Playstation 3!
36
Cache vis-à-vis Local Store
Cache Local Store
37
Local Stores: AMAT
AMAT = HitTime + MissRate * MissPenalty
MissRate = 0%!
Consequences?
Simpler performance analysis
Less motivation for out-of-order cores
Cell processor is in-order
High clock rate and low power
38
Local Stores: Operation
LD/ST instructions to LS proceed normally
No LD/ST to non-LS memory
39
Stream Programming
Time
get(a) get(b)
do_something(a) do_something(b)
Local Store
SW Complexity
42
Today’s lecture: More Caches
Advanced cache optimizations
H&P Chapter 5
Cache coherence
H&P Chapter 4
Software-managed memories
Disk
106-107 cycles ~1TB Software/OS
The
Tape Interwebs
44
Example: File cache
Do files exhibit locality?
Prefetching?
Write back or write Microsoft “SuperFetch”:
through? load common programs
at boot
When should we write
to disk?
Coherence?
Associativity? “Leases” in network
filesystems
Place arbitrarily and
keep an index
Most disks have caches
45
Example: Browser cache
Do web pages you visit
exhibit locality?
Coherence?
Write back or write Did the page change
through? since I last checked?
No writes! Relaxed coherence
“If-Modified-Since”
header
Replacement policy?
Probably LRU
AMAT?
46
Caching is a ubiquitous tool
Same design issues in system design as in
processor design
Placement, lookup, write policies, replacement
policies, coherence
47
Next Lecture
48