This Unit: Caches: - Basic Memory Hierarchy Concepts
This Unit: Caches: - Basic Memory Hierarchy Concepts
Application OS
Compiler CPU
Speed vs capacity
Readings
MA:FSPTCM
Section 2.2 Sections 6.1, 6.2, 6.3.1
Start-of-class Exercise
Youre a researcher
You frequently use books from the library Your productivity is reduced while waiting for books
Paper:
Jouppi, Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, ISCA 1990 ISCAs most influential paper award awarded 15 years later
How do you:
Coordinate/organize/manage the books? Fetch the books from the library when needed How do you reduce overall waiting? What techniques can you apply? Consider both simple & more clever approaches
Unobtainable goal:
Memory that operates at processor speeds Memory as large as needed for all running programs Memory that is cost effective
Types of Memory
Static RAM (SRAM)
6 or 8 transistors per bit Two inverters (4 transistors) + transistors for reading/writing Optimized for speed (first) and density (second) Fast (sub-nanosecond latencies for small SRAM) Speed roughly proportional to its area Mixes well with standard processor logic
1 transistor + 1 capacitor per bit Optimized for density (in terms of cost per bit) Slow (>40ns internal access, ~100ns pin-to-pin) Different fabrication steps (does not mix well with logic)
DRAM
DRAM: dynamic RAM
address Bits as capacitors Transistors as ports 1T cells: one access transistor per bit
Static
Cross-coupled inverters hold state
Dynamic means
Capacitors not connected to power/ground Stored charge decays over time Must be explicitly refreshed
To read
Equalize (pre-charge to 0.5), swing, amplify
To write
sense amp.
CIS 501 (Martin): Caches
Overwhelm
9
~data1
data1 ~data0
data0
DRAM process
Same basic materials/steps as CMOS But optimized for DRAM
Latency
Bandwidth
Processors are get faster more quickly than memory (note log scale)
Copyright Elsevier Scientific 2003
Access Time
13
Burks, Goldstine, VonNeumann Preliminary discussion of the logical design of an electronic computing instrument IAS memo 1946
15 CIS 501 (Martin): Caches 16
Library Analogy
Consider books in a library Library has lots of books, but it is slow to access
Far away (time to walk to the library) Big (time to walk within the library)
Temporal locality
Recently referenced data is likely to be referenced again soon Reactive: cache recently used data in small, fast memory
Spatial locality
More likely to reference data near recently referenced data Proactive: fetch data in large chunks to include nearby data
Caches bookshelves
Moderate capacity, pretty fast to access
Connected by buses
Which also have latency and bandwidth issues
I$
D$
Split instruction (I$) and data (D$) Typically 8KB to 64KB each
L2, L3
Main Memory
2 level typically ~256KB to 512KB Last level cache typically 4MB to 16MB Made of DRAM (Dynamic RAM) Typically 1GB to 4GB for desktops/laptops Servers can have 100s of GB Uses magnetic disks or flash drives
Disk
CIS 501 (Martin): Caches
Main memory
DRAM-based memory systems Virtual memory
Cache organization
ABC Miss classification
High-performance techniques
Main Memory Reducing misses Improving miss penalty Improving hit latency Main Memory
Disk
CIS 501 (Martin): Caches 24
Warmup
What is a hash table?
What is it used for? How does it work?
Short answer:
Maps a key to a value Constant time lookup/insert Have a table of some size, say N, of buckets Take a key value, apply a hash function to it Insert and lookup a key at hash(key) modulo N Need to store the key and value in each bucket Need to check to make sure the key matches Need to handle conflicts/overflows somehow (chaining, re-hashing)
25 CIS 501 (Martin): Caches 26
Cache Basics
10 bits
wordlines
Size of entries
Width of data accessed Data travels on bitlines 256 bits (32 bytes) in example
address
bitlines
data
27
28
wordlines
1021 1022 1023
bitlines
<< data
30
29
address
To each frame attach tag and valid bit Compare frame tag to address tag bits No need to match index bits (why?)
Lookup algorithm
Read frame indicated by index bits Hit if tag matches and valid bit is set Otherwise, a miss. Get data from next level
tag [31:15] index [14:5] [4:0]
32B frames 5-bit offset 1024 frames 10-bit index 32-bit address 5-bit offset 10-bit index = 17-bit tag (17-bit tag + 1-bit valid)* 1024 frames = 18Kb tags = 2.2KB tags ~6% overhead
address
%miss tmiss
35
Cache Examples
4-bit addresses 16B memory
Simpler cache diagrams than 32-bits
8B cache, 2B blocks
tag (1 bit)
index (2 bits)
1 bit
Figure out number of sets: 4 (capacity / block-size) Figure out how address splits into offset/index/tag bits Offset: least-significant log2(block-size) = log2(2) = 1 0000 Index: next log2(number-of-sets) = log2(4) = 2 0000 Tag: rest = 4 1 2 = 1 0000
37
38
Main memory
tag (1 bit)
index (2 bits)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
A B C D E F G H I J K L M N P Q
tag (1 bit)
index (2 bits)
1 bit
40
tag (1 bit)
index (2 bits)
+ Miss rate decreases monotonically Working set: insns/data program is actively using Diminishing returns However thit increases Latency proportional to sqrt(capacity) tavg ?
%miss working set size
Cache Capacity
Block Size
Given capacity, manipulate %miss by changing organization One option: increase block size 512*512bit
Exploit spatial locality Notice index/offset bits change Tag remain the same
SRAM
0 1 2
Ramifications
+ Reduce %miss (up to a point) + Reduce tag overhead (why?) Potentially useless data transfer Premature replacement of useful data Fragmentation
[31:15] [14:6]
510 511
64B frames 6-bit offset 512 frames 9-bit index 32-bit address 6-bit offset 9-bit index = 17-bit tag (17-bit tag + 1-bit valid) * 512 frames = 9Kb tags = 1.1KB tags + ~3% overhead
CIS 501 (Martin): Caches 44
address
tag (1 bit)
index (1 bits)
A B C D E F G H I J K L M N P Q
tag (1 bit)
index (1 bits)
2 bit
0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
46
Critical Word First / Early Restart (CRF/ER) Requested word fetched first, pipeline restarts immediately Remaining words in block transferred/filled in the background
Cache Conflicts
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 A B C D E F G H I J K L M N P Q
Set-Associativity
tag (1 bit) index (2 bits) 1 bit
Main memory
Set-associativity
Block can reside in one of few frames Frame groups called sets Each frame in set called a way This is 2-way set-associative (SA) 1-way direct-mapped (DM) 1-set fully-associative (FA)
ways
0 1 2 512 513 514
sets
510 511
1022 1023
+ Reduces conflicts Increases latencyhit: additional tag match & muxing Note: valid bit not shown
=
9-bit associativity
[31:14]
[13:5]
[4:0]
<< data
50
address
hit?
Set-Associativity
Lookup algorithm
Use index bits to find set Read data/tags in all frames in parallel Any (match and valid bit), Hit Notice tag/index/offset bits Only 9-bit index (versus 10-bit for direct mapped)
9-bit associativity
[31:14] [13:5] [4:0]
Replacement Policies
ways
0 1 2 512 513 514
sets
Some options
Random FIFO (first-in first-out) LRU (least recently used) Fits with temporal locality, LRU = least likely to be used in future NMRU (not most recently used) An easier to implement approximation of LRU Is LRU for 2-way set-associative caches Beladys: replace block that will be used furthest in future Unachievable optimum Which policy is simulated in previous example?
510 511
1022 1023
<< data
51
address
hit?
52
A B C D E F G H I J K L M N P Q
Main memory
tag (2 bit)
index (1 bits)
1 bit
LRU Tag
Way 1 Data 0 1
[31:15]
[14:5]
[4:0]
<< data
53
address
hit?
54
tag (2 bit)
index (1 bits)
1 bit
tag (2 bit)
index (1 bits)
1 bit
LRU Tag 0 1 01 01
Way 1 Data 0 E G 1 F H
0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
LRU Tag 0 0 01 11
Way 1 Data 0 E P 1 F Q
%miss
~5
Associativity
2-bit
Serial
= = = =
2-bit
= = = =
2-bit
<<
CIS 501 (Martin): Caches
Tags
59 CIS 501 (Martin): Caches
data
data
<<
60
tag
2-bit index
offset
Advantages
Fast Low-power
2-bit
== == ==
4-bit
Way Predictor
== ==
Disadvantage
More misses = = = =
CIS 501 (Martin): Caches
2-bit
addr
data
= hit data
<<
hit
61 CIS 501 (Martin): Caches 62
== == ==
== ==
addr
Capacity
66
VB
L2
67
Prefetching
Prefetching: put blocks in cache proactively/speculatively
Key: anticipate upcoming miss addresses accurately Can do in software or hardware Simple example: next block prefetching Miss on address X anticipate miss on X+block-size +Works for insns: sequential execution I$/D$ +Works for data: arrays Timeliness: initiate prefetches sufficiently in advance Coverage: prefetch for as many misses as possible prefetch logic Accuracy: dont pollute with unnecessary data L2 It evicts useful data
71 CIS 501 (Martin): Caches 72
Better locality
Better locality
Software Prefetching
Use a special prefetch instruction
Tells the hardware to bring in data, doesnt actually read it Just a hint
Hardware Prefetching
What to prefetch?
Use a hardware table to detect strides, common patterns
Address-prediction
Needed for non-sequential data: lists, trees, etc. Large table records (miss-addr next-miss-addr) pairs On miss, access table to find out what will miss next Its OK for this table to be large and slow
20% performance improvement for large trees (>1M nodes) But ~15% slowdown for small trees (<1K nodes)
74
Write Issues
So far we have looked at reading from cache
Instruction fetches, loads
76
Tag/Data Access
Reads: read tag and data in parallel
Tag mis-match data is wrong (OK, just stall until good data arrives)
Write Propagation
When to propagate new value to (lower level) memory? Option #1: Write-through: immediately
On hit, update cache Immediately send the write to the next level
Write-back
+ Key advantage: uses less bandwidth Reverse of other pros/cons above Used by Intel, AMD, and ARM Second-level and beyond are generally write-back caches
CIS 501 (Martin): Caches 79
80
Write miss?
Technically, no instruction is waiting for data, why stall?
Processor SB
Cache
Cache Hierarchies
Next-level cache
81 CIS 501 (Martin): Caches 82
Infrequent access thit less important tmiss is bad %miss important Higher capacity, associativity, and block size (to reduce %miss)
83
84
Bring block from memory into L1 but not L2 Move block to L2 on L1 eviction L2 becomes a large victim cache Block is either in L1 or L2 (never both) Good if L2 is small relative to L1 Example: AMDs Duron 64KB L1s, 64KB L2
Non-inclusion
CIS 501 (Martin): Caches
No guarantees
86
Hierarchy Performance
CPU tavg = tavg-M1 M1 M2 tmiss-M2 = tavg-M3 M3 tmiss-M3 = tavg-M4 M4
87 CIS 501 (Martin): Caches 88
tmiss-M1 = tavg-M2
%miss tmiss
Performance metric
tavg: average access time
CIS 501 (Martin): Caches
89
90
Calculate CPI
CPI = 1 + 30% * 5% * tmissD$ tmissD$ = tavgL2 = thitL2+(%missL2*thitMem )= 10 + (20%*50) = 20 cycles Thus, CPI = 1 + 30% * 5% * 20 = 1.3 CPI
Summary
Average access time of a memory component Memory hierarchy
latencyavg = latencyhit + %miss * latencymiss Hard to get low latencyhit and %miss in one structure hierarchy Cache (SRAM) memory (DRAM) virtual memory (Disk) Smaller, faster, more expensive bigger, slower, cheaper
93