CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
Hierarchy Design
&
APPENDIX B. Review of
Memory Heriarchy
by:
Fara Mutia, Hansen Hendra, Rachmadio Noval, & Julyando
2.2 Ten Advanced Optimization for Cache Performance
1. Small and Simple First-level Caches to Reduce Hit
Time and Power
Direct-mapped is faster than set associative for both reads and writes.
In particular, tag-checking can overlap data transmission, as there is
only one piece of data for each index.
Impact of cache size and associativity on hit Energy consumption per read vs cache
time size and associativity
2. Way Prediction to Reduce Hit Time
With way prediction, extra bits are kept in the cache to predict the way
or block within the set of next cache access.
3. Pipelined Cache Access to Increase Cache Bandwith
Rather than treat the cache as a single monolithic block, we can divide
it into independent banks that can support simultaneous accesses.
Critical Word First: Request the missed word first from memory and
send it to the preocessor as soon as it arrives; let the processor
continue execution while filling the rest of the words in the block.
Early Restart: Fetch the words in normal order, but as soon as the
requested word of the block arrives send it to the processor and let the
processor continue execution.
7. Merging Write Buffer to Reduce Miss Penalty
Access time:
▶ the time between when a read is requested and when the desired word
arrives
Cycle time :
▶ the minimum time between unrelated request to memory
SRAM and DRAM
DRAM SRAM
Slower faster
Cheap Expensive
Size up to 16 GB Size up to 12 MB
DRAM Growth
Nowadays there is
DDR4
16 G bit with 3200
MHz
2. Reducing power
Power consumption depends on RW
situation
and standby power
DDR2 : 1.5-1.8 V
DDR3 : 1.35-1.5 V
DDR4 : 1 – 1.2 V
Optimization of Memory(3)
Protection:
▶ Keeps processes in their own memory space
▶ Prevent program from corrupting each other
It’s program has its own page table to map to unique physical address
Program address in Linux
The frequency of the cache coherency problem is different for multiprocessors than I/O.
Multiple data copies are a rare event for I/O—one to be avoided whenever possible—but
a program running on multiple processors will want to have copies of the same data in
several caches. Performance of a multiprocessor program depends on the performance
of the system when sharing data.
The I/O cache coherency question is this: Where does the I/O occur in the computer—
between the I/O device and the cache or between the I/O device and main memory? If
input puts data into the cache and output reads data from the cache, both I/O and the
processor see the same data. The difficulty in this approach is that it interferes with the
processor and can cause the processor to stall for I/O. Input may also interfere with the
cache by displacing some information with new data that are unlikely to be accessed
soon.
2.6 Putting It All Together: Memory
Hierarchies in the ARM
ARM Cortex-A8
Intel Core i7
2.7 Fallacies and Pitfalls
Fallacies and Pitfalls
• Fallacy : predicting cache performance of one program from
another
• Pitfall :
– Simulating enough instructions to get accurate performance measures
of the memory hierarchy
• Trying to predict performance of a large cache using a small trace
• Program’s locality behavior is not constant over the run of the entire program
• Program’s locality behavior may vary depending on the input
▶
Memory hierrarchy basic question (1)
1. Where can a block be placed in the upper level? How is Block Found if it is
in the cache?
Procedure:
1. every access of cache block , we give an increment number
2. When we need to replace a cache block, look for the lowest number (still
depends on full or set associative)
Memory hierrarchy basic question(3)
Example : Cache
Cache : 16KB instruction + 16KB data I-Cache D-Cache
Size
Hit = 1 cycle, miss = 50 cycles 1 KB 3.0% 24.6%
75% of memory references are fetches 2 KB 2.3% 20.6%
Miss rate info (Shown in table) : 4 KB 1.8% 16.0%
• Find miss rates and AMAT 8 KB 1.1% 10.2%
16 KB 0.6% 6.5%
32 KB 0.4% 4.8%
64 KB 0.1% 3.8%
Cache miss rates:
• Miss rate instruction (16KB) = 0.6%
• Miss rate data (16KB) = 6.5%
• Overall miss rate = 0.75*0.6 + 0.25*6.5 = 2.07
Pitfall :
• Address is too small (ex: PDP-11 (by DEC and Carnegie Mellon
University) has address sizes (16 bits) as compared to the address
sizes of the IBM 360 (24 to 31 bits) and the VAX (32 bits)))
Problem because:
- Limiting the program length (must be less than 2^Address size)
- Determining the minimum width of anything that can contain an
address: PC, register, memory word, and effective address
arithmetic
Pitfall :
• Ignoring the impact of the operating system on the performance of the
memory hierarchy
Pitfall :
• Relying on the operating systems to change the page size over time