0% found this document useful (0 votes)
38 views17 pages

Memory Hierarchy Design: A Quantitative Approach, Fifth Edition

PPT

Uploaded by

Aniruddha Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views17 pages

Memory Hierarchy Design: A Quantitative Approach, Fifth Edition

PPT

Uploaded by

Aniruddha Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Computer Architecture

A Quantitative Approach, Fifth Edition

Chapter 2
Memory Hierarchy Design

Copyright © 2012, Elsevier Inc. All rights reserved. 1


Introduction
Memory Hierarchy

Copyright © 2012, Elsevier Inc. All rights reserved. 2


Introduction
Causes of misses
 Compulsory
 First reference to a block
 Capacity
 Blocks discarded and later retrieved
 Conflict
 Program makes repeated references to multiple
addresses from different blocks that map to the same
location in the cache

Copyright © 2012, Elsevier Inc. All rights reserved. 3


Introduction
Memory Hierarchy Basics

 Note that speculative and multithreaded


processors may execute other instructions
during a miss
 Reduces performance impact of misses

Copyright © 2012, Elsevier Inc. All rights reserved. 4


Introduction
Memory Hierarchy Basics
 Six basic cache optimizations:
 Larger block size
 Reduces compulsory misses
 Increases capacity and conflict misses, increases miss penalty
 Larger total cache capacity to reduce miss rate
 Increases hit time, increases power consumption
 Higher associativity
 Reduces conflict misses
 Increases hit time, increases power consumption
 Higher number of cache levels
 Reduces overall memory access time
 Giving priority to read misses over writes
 Use a write buffer!
 Reduces miss penalty
 Avoiding address translation in cache indexing
 Reduces hit time

Copyright © 2012, Elsevier Inc. All rights reserved. 5


ADVANCED CACHE
OPTIMIZATIONS

Copyright © 2012, Elsevier Inc. All rights reserved. 6


Advanced Optimizations
1. Small and simple first level caches

 Critical timing path:


 addressing tag memory, then
 comparing tags, then
 selecting correct set
 Direct-mapped caches can overlap tag compare
and transmission of data
 Lower associativity reduces power because
fewer cache lines are accessed

Copyright © 2012, Elsevier Inc. All rights reserved. 7


Advanced Optimizations
2. Way Prediction
 To improve hit time, predict the way to pre-set
mux
 Mis-prediction gives longer hit time
 Prediction accuracy
 > 90% for two-way
 > 80% for four-way
 I-cache has better accuracy than D-cache
 First used on MIPS R10000 in mid-90s
 Used on ARM Cortex-A8
 Extend to predict block as well
 “Way selection”
 Increases mis-prediction penalty

Copyright © 2012, Elsevier Inc. All rights reserved. 8


Advanced Optimizations
3. Pipelining Cache
 Pipeline cache access to improve bandwidth
 Examples:
 Pentium: 1 cycle
 Pentium Pro – Pentium III: 2 cycles
 Pentium 4 – Core i7: 4 cycles

 Increases branch mis-prediction penalty


 Makes it easier to increase associativity

Copyright © 2012, Elsevier Inc. All rights reserved. 9


Advanced Optimizations
4. Nonblocking Caches
 Allow hits before
previous misses
complete
 “Hit under miss”
 “Hit under multiple
miss”
 L2 must support this
 In general,
processors can hide
L1 miss penalty but
not L2 miss penalty

Copyright © 2012, Elsevier Inc. All rights reserved. 10


Advanced Optimizations
5. Multibanked Caches
 Organize cache as independent banks to
support simultaneous access
 ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 and 8 banks for L2

 Interleave banks according to block address

Copyright © 2012, Elsevier Inc. All rights reserved. 11


Advanced Optimizations
6. Critical Word First, Early Restart
 Critical word first
 Request missed word from memory first
 Send it to the processor as soon as it arrives
 Early restart
 Request words in normal order
 Send missed work to the processor as soon as it
arrives

 Effectiveness of these strategies depends on


block size and likelihood of another access to
the portion of the block that has not yet been
fetched

Copyright © 2012, Elsevier Inc. All rights reserved. 12


Advanced Optimizations
7. Merging Write Buffer
 When storing to a block that is already pending in the
write buffer, update write buffer
 Reduces stalls due to full write buffer
 Do not apply to I/O addresses

No write
buffering

Write buffering

Copyright © 2012, Elsevier Inc. All rights reserved. 13


Advanced Optimizations
8. Compiler Optimizations
 Loop Interchange
 Swap nested loops to access memory in
sequential order

 Blocking
 Instead of accessing entire rows or columns,
subdivide matrices into blocks
 Requires more memory accesses but improves
locality of accesses

Copyright © 2012, Elsevier Inc. All rights reserved. 14


Advanced Optimizations
9. Hardware Prefetching
 Fetch two blocks on miss (include next
sequential block)

Pentium 4 Pre-fetching

Copyright © 2012, Elsevier Inc. All rights reserved. 15


Advanced Optimizations
10. Compiler Prefetching
 Insert prefetch instructions before data is
needed
 Non-faulting: prefetch doesn’t cause
exceptions

 Register prefetch
 Loads data into register
 Cache prefetch
 Loads data into cache

 Combine with loop unrolling and software


pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 16


Advanced Optimizations
Summary

Copyright © 2012, Elsevier Inc. All rights reserved. 17

You might also like