L07 MemoryII
L07 MemoryII
edu/~cs152
CS 152/252A Computer
Architecture and Engineering Sophia Shao
Lecture 7 – Memory II
Intel Kills Optane Memory Business,
Pays $559 Million Inventory Write-Off
Intel used Optane memory to create
both storage and memory products,
and it has long been rumored to be on
the chopping block. At its debut in
2015, Intel and partner Micron touted
the underlying tech, 3D XPoint, as
delivering 1000x the performance and
1000x the endurance of NAND
storage, and 10x the density of DRAM.
https://www.tomshardware.com/news/intel-
kills-optane-memory-business-for-good https://www.intel.com/content/www/us/en/devel
oper/videos/disrupting-the-storage-memory-
hierarchy.html
Last time in Lecture 6
§ Dynamic RAM (DRAM) is main form of main memory
storage in use today
– Holds values on small capacitors, need refreshing (hence dynamic)
– Slow multi-step access: precharge, read row, read column
§ Static RAM (SRAM) is faster but more expensive
– Used to build on-chip memory for caches
§ Cache holds small set of values in fast memory (SRAM)
close to processor
– Need to develop search scheme to find values in cache, and replacement
policy to make space for newly accessed locations
§ Caches exploit two forms of predictability in memory
reference streams
– Temporal locality, same location likely to be accessed again soon
– Spatial locality, neighboring location likely to be accessed soon
2
Recap: Replacement Policy
In an associative cache, which line from a set should be
evicted when the set becomes full?
• Random
• Least-Recently Used (LRU)
• LRU cache state must be updated on every access
• True implementation only feasible for small sets (2-way)
• Pseudo-LRU binary tree often used for 4-8 way
• First-In, First-Out (FIFO) a.k.a. Round-Robin
• Used in highly associative caches
• Not-Most-Recently Used (NMRU)
• FIFO with exception for most-recently used line or lines
1 0
1 0 1 0
4
CPU-Cache Interaction
(5-stage pipeline)
0x4
Add E
M
A
we
Decode, ALU Y addr
bubble Primary
IR Register B
Data rdata
PC addr inst Fetch Cache R
D wdata hit?
hit? wdata
PCen Primary
Instruction MD1 MD2
Cache
Stall entire
CPU on data
cache miss
To Memory Control
5
Improving Cache Performance
To improve performance:
• reduce the hit time
• reduce the miss rate
• reduce the miss penalty
7
Effect of Cache Parameters on Performance
§ Larger cache size
+ reduces capacity and conflict misses
- hit time will increase
§ Higher associativity
+ reduces conflict misses
- may increase hit time
8
Figure B.9 Total miss rate (top) and
distribution of miss rate (bottom) for
each size cache according to the three
C's for the data in Figure B.8. The top
diagram shows the actual data cache
miss rates, while the bottom diagram
shows the percentage in each category.
(Space allows the graphs to show one
extra cache size than can fit in Figure
B.8.)
10
Figure B.10 Miss rate versus block size for five different-sized caches.
Note that miss rate actually goes up if the block size is too large relative to the
cache size. Each line represents a cache of different size. Figure B.11 shows
the data used to plot these lines. Unfortunately, SPEC2000 traces would take
too long if block size were included, so these data are based on SPEC92 on a
DECstation 5000 (Gee et al. 1993).
© 2019 Elsevier Inc. All rights reserved. 11
Write Policy Choices
§ Cache hit:
– write-through: write both cache & memory
• Generally higher traffic but simpler pipeline & cache design
– write-back: write cache only, memory is written only when the
entry is evicted
• A dirty bit per line further reduces write-back traffic
• Must handle 0, 1, or 2 accesses to memory for each load/store
§ Cache miss:
– no-write-allocate: only write to main memory
– write-allocate (aka fetch-on-write): fetch into cache
§ Common combinations:
– write-through and no-write-allocate
– write-back with write-allocate
12
Write Performance
Tag Index Offset
b
t k
V Tag Data
2k
lines
t
= WE
13
Reducing Write Hit Time
Problem: Writes take two cycles in memory stage, one
cycle for tag check plus one cycle for data write if hit
Solutions:
§ Design data RAM that can perform read and write in one
cycle, restore old value after tag miss
§ Pipelined writes: Hold write data for store in single buffer
ahead of cache, write cache data during next store’s tag check
§ Fully-associative (CAM Tag) caches: Word line only enabled if
hit
14
Pipelining Cache Writes
Address and Store Data From CPU
=? 1 0
§ Attend discussions/OHs.
16
CS252 Administrivia
§ Start thinking of class projects and forming teams
– 2-3
§ RISC vs CISC discussion this week.
CS252 17
Write Buffer to Reduce Read Miss Penalty
CPU Unified
Data Cache L2 Cache
Write
RF buffer
18
Reducing Tag Overhead with Sub-Blocks
§ Problem: Tags are too large, i.e., too much overhead
– Simple solution: Larger lines, but miss penalty could be large.
§ Solution: Sub-block placement (aka sector cache)
– A valid bit added to units smaller than full line, called sub-blocks
– Only read a sub-block on a miss
– If a tag matches, is the word in the cache?
100 1 1 1 1
300 1 1 0 0
204 0 1 0 1
19
Multilevel Caches
Problem: A memory cannot be large and fast
Solution: Increasing sizes of cache at each level
20
Figure B.14 Miss rates versus cache size for multilevel caches. Second-level caches
smaller than the sum of the two 64 KiB first-level caches make little sense, as reflected in
the high miss rates. After 256 KiB the single cache is within 10% of the global miss rates.
The miss rate of a single-level cache versus size is plotted against the local miss rate
and global miss rate of a second-level cache using a 32 KiB first-level cache. The L2
caches (unified) were two-way set associative with replacement. Each had split L1
instruction and data caches that were 64 KiB two-way set associative with LRU
replacement. The block size for both L1 and L2 caches was 64 bytes. Data were
collected as in Figure B.4. © 2019 Elsevier Inc. All rights reserved. 21
Presence of L2 influences L1 design
§ Use smaller L1 if there is also L2
– Trade increased L1 miss rate for reduced L1 hit time
– Backup L2 reduces L1 miss penalty
– Reduces average access energy
§ Use simpler write-through L1 with on-chip L2
– Write-back L2 cache absorbs write traffic, doesn’t go off-chip
– At most one L1 miss request per L1 access (no dirty victim write
back) simplifies pipeline control
– Simplifies coherence issues
– Simplifies error recovery in L1 (can use just parity bits in L1 and
reload from L2 when parity error detected on L1 read)
22
Inclusion Policy
§ Inclusive multilevel cache:
– Inner cache can only hold lines also present in outer
cache
– External coherence snoop access need only check
outer cache
§ Exclusive multilevel caches:
– Inner cache may hold lines not in outer cache
– Swap lines between inner/outer caches on miss
– Used in AMD Athlon with 64KB primary and 256KB
secondary cache
23
Itanium-2 On-Chip Caches
(Intel/HP, 2002)
24
Power 7 On-Chip Caches [IBM 2009]
32KB L1 I$/core
32KB L1 D$/core
3-cycle latency
25
IBM z196 Mainframe Caches 2010
26
Acknowledgements
§ This course is partly inspired by previous MIT 6.823 and
Berkeley CS252 computer architecture courses created by
my collaborators and colleagues:
– Arvind (MIT)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
27