Week 13 - Lecture 13 - Memory (cont)
Week 13 - Lecture 13 - Memory (cont)
Processor
4-8 bytes (word)
Inclusive–
Increasing L1$ what is in L1$
distance 8-32 bytes (block) is a subset of
from the L2$ what is in L2$
processor in is a subset of
1 to 4 blocks
access time what is in MM
Main Memory
that is a subset
1,024+ bytes (disk sector = page) of is in SM
Secondary Memory
❑ Solution:
➢ misses/instruction=1%+30%x5%=0.025;
➢ memory stall cycles/instruction=0.025x100=2.5 cycles
➢ total memory stall cycles=2.5x106=2,500,000 cycles
Impacts of Cache Performance
❑ Relative cache penalty increases as processor performance
improves (faster clock rate and/or lower CPI)
➢ Memory speed is unlikely to improve as fast as processor cycle time → when
calculating CPIstall, the cache miss penalty is measured in processor clock
cycles needed to handle a miss.
➢ The lower the CPIideal, the more pronounced the impact of stalls
❑ Example: Given
▪ I-cache miss rate = 2%, D-cache miss rate = 4%
▪ Miss penalty = 100 cycles
▪ Base CPI (ideal cache) = 2
▪ Load & stores are 36% of instructions
Questions:
➢ What is CPIstall? 2+(2%+36%x4%)x100 = 5.44, % time on memory stall = 63%
➢ What if the CPIideal is reduced to 1? % time on memory stall = 77%
➢ What if the processor clock rate is doubled? Miss penalty = 200, CPIstall = 8.88
Average Memory Access Time (AMAT)
❑ Hit time is also important for performance
➢ A larger cache will have a longer access time → an increase in hit time will
likely add another stage to the pipeline.
➢ At some point, the increase in hit time for a larger cache will overcome the
improvement in hit rate leading to a decrease in performance.
❑ Solution:
➢ AMAT = 1 + 0.05 × 20 = 2 cycles = 4 ns
➢ Without the cache, AMAT will be equal to miss penalty = 20 cycles = 40 ns
Reducing cache miss rates #1: cache
associativity
❑ Allow more flexible block placement
➢ In a direct mapped cache a memory block maps to exactly one cache block
➢ At the other extreme, could allow a memory block to be mapped to any cache
block → fully associative cache (no indexing)
❑ 8 requests, 2 misses
❑ Solves the ping pong effect in a direct mapped cache due to
conflict misses since now two memory locations that map into
the same cache set can co-exist!
Four-Way Set Associative Cache
Organization
28 = 256 sets
each with Way 0 Way 1 Way 2 Way 3
four ways
(each with
one block)
Increasing associativity
Decreasing associativity
Fully associative
Direct mapped (only one set)
(only one way) Tag is all the bits except
Smaller tags, only a block and byte offset
single comparator
❑ Least Recently Used (LRU): replace the one that has been
unused for the longest time
➢ Requires hardware to keep track of when each way’s block was used relative
to the other blocks in the set. For 2-way set associative, takes one bit per set
→ set the bit when a block is referenced (and reset the other way’s bit)
➢ Manageable for 4-way, too hard beyond that.
❑ Random
➢ Gives approximately the same performance as LRU for high associativity.
How Much Associativity?
❑ Increased associativity 12
4KB
decreases miss rate 10 8KB
➢ But with diminishing returns 16KB
8
Miss Rate
32KB
❑ The choice of direct 64KB
6
mapped or set associative 128KB
depends on the cost of a 4 256KB
512KB
miss versus the cost of 2
implementation. 0
1-way 2-way 4-way 8-way
❑ N-way set associative
Associativity
cache costs
➢ N comparators (delay and area)
➢ MUX delay (set selection) before data is available
➢ Data available after set selection and Hit/Miss decision (c.f. direct mapped
cache: the cache block is available before the Hit/Miss decision) → can be
an important consideration (why?).
Reducing Cache Miss Rates #2: multi-
level caches
❑ Use multiple levels of caches
➢ Primary (L1) cache attached to CPU
➢ Larger, slower, L2 cache services misses from primary cache. With
advancing technology → have more than enough room on the die for L2,
normally a unified cache (i.e., it holds both instructions and data) and in some
cases even a unified L3 cache.
❑ Example: Given
▪ CPU base CPI = 1, clock rate = 4GHz
▪ Miss rate/instruction = 2%
▪ Main memory access time = 100ns
Questions:
➢ Compute the actual CPI with just primary cache.
➢ Compute the performance gain if we add L2 cache with
▪ Access time = 5ns
▪ Global miss rate to main memory = 0.5%
Multi-level cache: example solution
❑ With just primary cache
➢ Miss penalty = 100ns/0.25ns = 400 cycles
➢ CPIstall = 1 + 0.02 × 400 = 9
Less More
Memory: the next hierarchy
Processor
4-8 bytes (word)
Inclusive–
Increasing L1$ what is in L1$
distance 8-32 bytes (block) is a subset of
from the L2$ what is in L2$
processor in 1 to 4 blocks
is a subset of
access time what is in MM
Main Memory
that is a subset
1,024+ bytes (disk sector = page) of is in SM
Secondary Memory
main memory
Program 2
virtual address space
Virtual memory: address translation
❑ Assuming fixed-size pages, each memory request first requires
an address translation from virtual space to physical space
➢ Done by a combination of hardware and software
➢ page fault: virtual memory miss (i.e., the page is not in physical memory).
Page fault penalty is very costly, often takes millions of clock cycles
Address Translation Mechanisms (1)
Virtual page # Offset
Mapping
Physical page
V base addr
1
1
1
index 1
into 1
page 1
table 0
1 Main memory
0
If valid bit is off,
then page is 1
not present in 0
memory.
Page Table
(in main memory)
32 bits wide = V + 18
bits PPN + extra bits
Disk storage
Replacement and Writes
❑ To reduce page fault rate, prefer least-recently used (LRU)
replacement
➢ Reference bit (aka use bit) in the page table entry set to 1 on access to
page
➢ Periodically cleared to 0 by OS
➢ A page with reference bit = 0 means it has not been used recently
❑ When OS starts new process, it creates space on disk for all the
pages of the process (all valid bits in page table = zero)
➢ called Demand Paging - pages of the process are loaded from disk only as
needed
Trans-
lation
data
❑ TLB misses are much more frequent than true page faults
Summary: steps in memory access