0% found this document useful (0 votes)
43 views46 pages

Stanford Advanced Caches

Uploaded by

Murali Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views46 pages

Stanford Advanced Caches

Uploaded by

Murali Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

EE282 Lecture 4

Advanced Caching (2)


Jacob Leverich

http://eeclass.stanford.edu/ee282

EE282 – Spring 2011 – Lecture 04


Announcements
 HW1 out
 Due Wed 4/20 @ 5pm, box outside Gates 305

2
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches


3
Advanced Cache Optimizations
 Multi-level caches and inclusion
 Victim caches
 Pseudo-associative caches
 Skew-associative caches
 Critical word first
 Non-blocking caches
 Prefetching
 Multi-ported caches

 Readings: H&P 5.1-2 and 4.2


 Read on your own about way prediction, pipelined caches, merging
write buffers, compiler optimizations

4
Non-blocking or Lockup Free
Caches
 Basic idea
 Allow for hits while serving a miss (hit-under-miss)
 Allow for more than one outstanding miss (miss-under-miss)
 When does it make sense (for L1, L2, …)
 When the processor can handle >1 pending load/store
 This is the case with superscalar processors
 When the cache serves >1 processor or other cache
 When the lower level allows for multiple pending accesses
 More on this later
 What is difficult about non-blocking caches
 Handling multiple misses at the same time
 Handling loads to pending misses
 Handling stores to pending misses
5
Potential of Non-blocking
Caches
CPU Stall CPU on miss
Miss Penalty

Miss

Miss Hit
CPU CPU Hit under miss
Miss Penalty

Stall only when


result needed
Miss Hit Miss
CPU
Miss Penalty Multiple out-standing misses
Miss Penalty
Miss Penalty

6
Miss Status Handling Register
 Keeps track of
 Outstanding cache misses
 Pending load & stores that refer to that cache block
 Fields of an MSHR
 Valid bit
 Cache block address
 Must support associative search
 Issued bit (1 if already request issued to memory)
 For each pending load or store
 Valid bit
 Type (load/store) and format (byte/halfword/…)
 Block offset
 Destination register for load OR store buffer entry for stores
7
MSHR

1 27 1 1 3 5 5
Valid Block Address Issued Valid Type Block Offset Destination Load/store 0

Valid Type Block Offset Destination Load/store 1

Valid Type Block Offset Destination Load/store 2

Valid Type Block Offset Destination Load/store 3

8
Non-block Caches: Operation
 On a cache miss:
 Search MSHRs for pending access to same cache block
 If yes, just allocate new load/store entry
 (if no) Allocate free MSHR
 Update block address and first load/store entry
 If no MSHR or load/store entry free, stall
 When one word/sub-block for cache line become available
 Check which load/stores are waiting for it
 Forward data to LSU
 Mark loads/store as invalid
 Write word in the cache
 When last word for cache line is available
 Mark MSHR as invalid
9
Non-blocking Cache Efficacy

Cache Miss rate Miss Hit Band-


optimization Cold Capacity Conflict penalty time width

Non-blocking
cache

10
Prefetching
 Idea: fetch data into the cache before processors
request them
 Can address cold misses
 Can be done by the programmer, compiler, or hardware

 Characteristics of ideal prefetching


 You only prefetch data that are truly needed
 Avoid bandwidth waste
 You issue prefetch requests early enough
 To hide the memory latency
 You don’t issue prefetch requests too early
 To avoid cache pollution
11
Software Prefetching
for (i=0; i<N; i++) {  Issues software prefetching
__prefetch(a[i+8]);  Takes up issue slots
__prefetch(b[i+8]);  Not big issue with superscalar
sum += a[i]*b[i];  Takes up system bandwidth
}
 Must have non-blocking caches
Doesn’t have to be correct!
 Prefetch distance depends on
__prefetch(-1);
specific system implementation
 Non-portable code
 Not easy to use for pointer based
structures
 Requires ninja
programmer/compiler!
12
Hardware Prefetching
 Same goal with software prefetching but initiated by hardware
 Can tune to specific system implementation
 Does not waste instruction issue bandwidth
 More portable code
 Major design questions
 Where to place a prefetch engine?
 L1, L2, …
 What to prefetch?
 Next sequential cache line(s), strided patterns, pointers, …
 When to prefetch?
 On a load, on a miss, when other prefetched data used, …
 Where to place prefetched data
 In the cache or in a special prefetch buffer
 How to handle VM exceptions?
 Don’t prefetch beyond a page?
13
Simple Sequential Prefetching
 On a cache miss, fetch two sequential memory
blocks
 Exploits spatial locality in both instructions & data
 Exploits high bandwidth for sequential accesses

 Called “Adjacent Cache Line Prefetch” or “Spatial


Prefetch” by Intel

 Extend to fetching N sequential memory blocks


 Pick N large enough to hide the memory latency
14
Stream Prefetching
 Sequential prefetching problem
 Performance slows down once every N cache lines
 Stream prefetching is a continuous version of prefetching
 Stream buffer can fit N cache lines
 On a miss, start fetching N sequential cache lines
 On a stream buffer hit:
 Move cache line to cache, start fetching line (N+1)
 In other words, stream buffer tries to stay N cache lines ahead
 Design issues
 When is a stream buffer allocated
 When is a stream buffer released
 Can use multiple stream buffers to capture multiple streams
 E.g. a program operating on 2 arrays

15
Stream Buffer Design

16
Strided Prefetching
PC Stride Last Addr Conf
 Idea: detect and prefetch strided accesses
PC
 for (i=0; i<N; i++) A[i*1024]++; 0x08ab0 8 0xff024 10

 Stride detected using a PC-based table 0x03fa8 1024 0xf0ab2 11

 For each PC, remember the stride


 Stride detection
 Remember the last address used for this PC
 Compare to currently used address for this PC
 Track confidence using a two bit saturating counter
 Increment when stride correct, decrement when incorrect

 How to use the PC-based table


 Similar to stream prefetching except using stride instead of +1
18
Sandybridge Prefetching
(Intel Core i7-2600K)
 “Intel 64 and IA-32 Architectures Optimization
Reference Manual, Jan 2011”, pg 2-24

http://www.intel.com/Assets/PDF/manual/248966.pdf
19
Other Ideas in Prefetching
 Prefetch for pointer-based data structures
 Predict if fetched data contain a pointer & follow it
 Works for linked-lists, graphs, etc
 Must be very careful:
 What is a pointer?
 How far to prefetch?

 Different correlation techniques


 Markov prefetchers
 Delta correlation prefetchers

20
Prefetching Efficacy

Cache Miss rate Miss Hit Band-


optimization Cold Capacity Conflict penalty time width

Prefetching

21
Multi-ported Caches
 Idea: allow for multiple accesses in parallel
 Processor with many LSUs, I+D access in L2, …

 Can be implemented in multiple ways


 True multi-porting
 Multiple banks

 What is difficult about multi-porting


 Interaction between parallel accesses (especially for
stores)
22
True Multi-porting
 True multiporting
 Use 2-ported tag/data storage
 Problem: large area increase
 Problem: hit time increase

Request 1 Data 1

Request 2 Cache Data 2

23
Multi-banked Caches
Cache
Request 1 Read Data 1
Bank 1

Cache
Request 2 Read Data 2
Bank 2

 Partition address space into multiple banks


 Bank0 caches addresses from partition 0, bank1 from partition 1…
 Can use least or most significant address bits for partitioning
 What are the advantages of each approach?

 Benefits: accesses can go in parallel if no conflicts


 Challenges: conflicts, distribution network, bank
utilization
24
Sun UltraSPARC T2
8-bank L2 cache

25
Multi-porting Efficacy

Cache Miss rate Miss Hit Band-


optimization Cold Capacity Conflict penalty time width

Multi-porting

26
Summary of
Advanced Cache Optimizations
Cache Miss rate Miss Hit Band-
optimization Cold Capacity Conflict penalty time width

Multi-level +
Victim cache ~ +
Pseudo-assoc. +/~
Skew-assoc. + ~
Non-blocking + ~
Critical-word-
first +
Prefetching + - +
Multi-porting ~ +
Also see Figure 5.11 in H&P 27
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches


28
Cache Coherence Problem
P1 P2 P3
u=? 3
u=?
4 5 $
$ $

u :5 u :5 u= 7

I/O devices
1
2
u:5
Memory

 Cores may see different values for u


 With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when
 Threads or processes accessing main memory may see very stale value
 Unacceptable for programming, and its frequent!
29
Hardware Cache Coherence
Using Snooping
 Hardware guarantees that loads from all
cores will return the value of the latest
write

 Coherence mechanisms
 Metadata to track state for cached data
 Controller that snoops bus (or interconnect)
activity and reacts if needed to adjust the state
of the cache data

 There needs to be a serialization point


 Shared L3, memory controller, or memory bus

30
MSI: Simple Coherence Protocol for Write
Back Caches
Each cache line has a tag M: Modified
S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other process reads, M or writes
P1 writes back
Other processor
intent to write

Read
miss
S I
Read by any Other processor
processor intent to write Cache state in
processor P1
31
Quick Questions
 How many copies of a cache line can you have
in S state?

 How many copies can you have in M state?

 How does L2 inclusion help?

33
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches


34
Software-managed Memory
 Caches are complex, hard to design, hard to
optimize, hard to analyze, hard to use well,
hard to keep coherent…

 Private on-chip memory with its own address


space
 Not implicitly backed by main memory
 Also called “Local Store”, “Local Memory”,
“Scratchpad”, “Stream Register File”
 Ubiquitous in embedded computing space
35
Local Stores in the wild
 IBM Cell Processor
 256KB LS per core
 Shared by inst. and data!

Playstation 3!
36
Cache vis-à-vis Local Store
 Cache  Local Store

37
Local Stores: AMAT
AMAT = HitTime + MissRate * MissPenalty

 MissRate = 0%!

Consequences?
 Simpler performance analysis
 Less motivation for out-of-order cores
 Cell processor is in-order
 High clock rate and low power
38
Local Stores: Operation
 LD/ST instructions to LS proceed normally
 No LD/ST to non-LS memory

 DMA transfers (Direct Memory Access) to


move data to/from main memory and LS
 Bulk, like memcpy()
 Asynchronous

dma(void *local_address, void *remote_address,


int size, int tag, boolean direction);

39
Stream Programming
Time

get(a) do_something(a) get(b) do_something(b)

get(a) get(b)
do_something(a) do_something(b)

 Overlap communication with computation


 Hide memory latency
 “Macroscopic” software prefetching
 No ugly prefetch instructions interlaced w/ your code
 Doesn’t waste instruction issue bandwidth
40
Local Stores: Pros and Cons
Pros Cons
 No coherence!  No coherence…
 Simple to implement  Complex to program
 Less overhead (no tags)  Can’t run existing SW
 Predictable performance,  Unpredictable access
great for in-order cores patterns perform poorly
 Can potentially hide all  Pointer chasing difficult
memory latency (linked lists, trees, etc.)

People resort to implementing


set-associative caches in
software…
41
Local Store Efficacy

Cache Miss rate Miss Hit Band-


optimization Cold Capacity Conflict penalty time width

Local Store

SW Complexity

42
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches


43
Everything is a Cache for
Something Else
Access Time Capacity Managed by
Registers 1 cycle ~500B Software/compiler

Level 1 Cache 1-3 cycles ~64KB Hardware

Level 2 Cache 5-10 cycles 1-10MB Hardware

DRAM ~100 cycles ~10GB Software/OS

Disk
106-107 cycles ~1TB Software/OS

The
Tape Interwebs
44
Example: File cache
 Do files exhibit locality?
 Prefetching?
 Write back or write  Microsoft “SuperFetch”:
through? load common programs
at boot
 When should we write
to disk?
 Coherence?
 Associativity?  “Leases” in network
filesystems
 Place arbitrarily and
keep an index
 Most disks have caches
45
Example: Browser cache
 Do web pages you visit
exhibit locality?
 Coherence?
 Write back or write  Did the page change
through? since I last checked?
 No writes!  Relaxed coherence
 “If-Modified-Since”
header
 Replacement policy?
 Probably LRU
 AMAT?
46
Caching is a ubiquitous tool
 Same design issues in system design as in
processor design
 Placement, lookup, write policies, replacement
policies, coherence

 Same optimization dimensions


 Size, associativity, granularity
 Hit time, miss rate, miss penalty, bandwidth,
complexity

47
Next Lecture

 DRAM (Main Memory)

48

You might also like