ch2 Appb
ch2 Appb
Typical memory
hierarchy layout for
(a) mobile computers
(b) laptop/desktops
(c) servers
Performance Formulae
• CPU execution time must factor in stalls from
memory access
– assume L1 cache responds within the amount of time
allotted for the load/store/instruction fetch stage
• e.g., 1 cycle
– CPU time = IC * CPI * Clock cycle time
• CPI = ideal CPI + stalls
– stalls from both hazards and memory hierarchy delays (e.g., cache
misses), or stalls = pipeline stalls + memory stalls
– compute memory stall cycles (msc) as follows
• msc = IC * (misses per instruction) * miss penalty
• msc = IC * (memory accesses per instruction) * miss rate *
miss penalty
Example
• CPI = 1.0 when cache has 100% hit rate
– data accesses are only performed in loads and stores
– load/stores make up 50% of all instructions
– miss penalty = 50 clock cycles
– miss rate = 1%
• How much faster is an ideal machine?
– CPU time ideal = IC * 1.0 * clock cycle time
– CPU time this machine = IC * (1.0 + 1.50 * .01 * 50)
* clock cycle time = IC * 1.75 * clock cycle time
– ideal machine is 1.75 times faster (75%)
Cache Questions
• Where can a block be placed?
– determined by type of cache (direct-mapped, set associative)
• How is a block found?
– requires tag checking
– tags get larger as we move from direct-mapped to some level
of associativity
• amount of hardware required for parallel tag checks increases
• causing the hardware to become more expensive and slower
• Which block should be replaced on a miss?
– direct-mapped: no choice
– associative: replacement strategy (random, LRU, FIFO)
• What happens on a write?
– read B-6..B-12 if you need a refresher from CSC 362
Write Strategy Example
• Computer has a fully associative write-back data cache
• Refill line size is 64 bytes
for(i=0;i<1024;i++){
x[i] = b[i] * y;// S1
c[i] = x[i] + z;// S2
}
• x, b and c are double arrays of 1024 elements
• assume b & c are already in cache and moving x into the
cache does not conflict with b or c
– x[0] is stored at the beginning of a block
• Questions:
– how many misses arise with respect to accessing x if the cache
uses a no-write allocate policy?
– how many misses arise with respect to accessing x if the cache
uses a write allocate policy?
– redo the question with S2 before statement S1
Solution
• Refill line size = 64 bytes
– each array element is 8 bytes, 8 elements per refill line
– misses arise only when accessing x
• No-write allocate policy means
– a miss on a write sends the write to memory without loading the
line into cache
• a miss in S1 does not load the line and so S2 will also have a miss
– a miss on a read causes line to be brought into cache
• a miss in S2 causes line to be brought into cache
– along with the next 7 array elements
• read misses occur for the first array element and every 8 th thereafter
– 1024 / 8 = 128 read misses
• read miss in S2 will have had a write miss in S1, 256 total misses
• Write allocate policy means write miss in S1 loads the line
into cache (so no miss in S2)
– therefore will have a total of 128 misses
• Reversing the order of S1 and S2 leads to the read miss
happening first, so 128 total misses no matter which policy
is used
Write Buffer
• To support write through, add a write buffer
– buffer caches updated lines so that, to the CPU, writing to
cache is the same as writing to memory
• but writes to memory get temporarily cached
– successive writes to the same block get stored in the buffer
• only if writes are to different blocks will the current write buffer
contents be sent to memory
• on cache miss, check buffer to see if it is storing what we need
– example: 4-word write buffer, and in the code below, 512
and 1024 map to the same cache line
• sw x3, 512(x0) – write to buffer
• lw x1, 1024(x0) – load causes block with 512 to be discarded
• lw x2, 512(x0) – datum found in write buffer, small penalty!
Performance: Unified Versus Split Cache
• Recall we want 2 caches to support 2 accesses
per cycle: 1 instruction, 1 data
– what if we had a single, unified cache?
• assume either two 16KB caches or one 32KB cache
• assume 36% loads/stores and a miss penalty of 100 cycles
• use the table below which shows misses per 1000
instructions (not miss rate)
Miss rates for DRAM need to be very small or else the miss penalty
has far to great an impact on performance
Fast Address Translation
• Recall we want to avoid address translation
– instruction and data caches usually store items by virtual addresses rather
than physical addresses which brings about two problems
• must ensure proper protection that the virtual address is legal for this
process
• on a cache miss, we still need to perform address translation
• Address translation uses the page table
– resides in memory
– very large page tables could, at least partially, be stored in swap space
– use a TLB (on-chip cache) to cache paging information to reduce impact
of page table access, but TLB is usually an associative cache
• virtual address is the tag
• frame number is the datum
• TLB entry also contains a protection bit (is this page owned by this
process?) and a dirty bit (is the page dirty, not is the TLB entry dirty?)
– using associative cache slows the translation process down even more
(beyond a second cache access) but there’s not much we can do about it!
Selecting a Page Size
• Larger page sizes
– smaller page table size (fewer overall pages for a process)
– transferring more data at a time is a more effective use of
accessing a hard disk
– fewer pages also means more entries in the TLB
(proportionally)
• Smaller page sizes
– less potential wasted space (smaller fragments)
– less time to load a page from disk (smaller miss penalty –
while this is smaller, it means a less effective disk access)
• Some CPUs today support multiple page sizes based
on the program size
– we prefer small pages if the program is smaller
Implementing a Memory Hierarchy
• Autonomous instruction fetch units
– with out-of-order execution, we remove the IF
stage from the pipeline and move it to a separate IF
unit
• IF unit accesses instruction cache to fetch an entire
block at a time
• move fetched instructions into a buffer (to be decoded
one at a time, and flushed if a branch had been taken)
– miss rates become harder to compute because
multiple instructions are being fetched at a time on
any hit, some of them may be misses
• similarly, prefetching data may result in partial misses
Continued
• Speculation
– causes a previously fetched instruction to be executed
before the hardware has computed whether a branch
should have been taken or not
• Need memory support for speculation
– proper protection must be implemented
• a memory violation arising when a fetched instruction should
not have been fetched due to miss-speculation should not
cause an exception
• to maintain a precise, memory violation must be postponed
– details on protection can be found in appendix B pages B-49 – B-54)
– cache miss after miss-speculation throws off the miss
rate
• because the instruction should never have been fetched
Cache Coherence
• With multi-core processors (or any parallel
processor), we now have a new concern
– core 1 has loaded datum x from memory into its L1
cache, which is a write-back cache
– core 1 increments x in its cache, but not yet to memory
– core 2 now loads datum x from memory into its L1 cache
– core 2 has a stale (dirty) version of x
• this cannot be allowed
– we resolve this by implementing cache coherence
• we look at this in chapter 5
ARM Cortex-A53 Mem Hierarchy
• Issue up to 2 instructions per cycle
– three TLBs: instruction TLB, data TLB, second level
TLB (L2 version of a TLB)
• instruction and data TLBs are 2-way set associative and can
store up to 10 entries each – 2 cycle penalty on a TLB miss
• second level TLB is 4-way set associative with 512 entries
and a 20 cycle miss penalty (to get to page table in
memory)
• TLBs use critical word first with early restart
– L1 instruction and data caches: 2-way set associative,
64-byte block, 13 cycle miss penalty
– L2 unified cache: 16-way set associative, LRU, 124
cycle miss penalty
• see figure 2.20 on page 131 if you want to be confused!
Memory Performance
• Tested on SPECInt2006 benchmarks
• 32KB L1 caches, 1MB L2
– miss rate for L1 was mostly under 5%
• reached 10% on two and 35%+ on mcf
– miss rate for L2 was under 1% for most
– miss penalty was under 1 cycle for most
benchmarks
• 2-5 for 3 others, 16 for mcf
Intel Core i7 6700 Mem Hierarchy
• Out of order execution processor, four cores
• We focus on the memory hierarchy of a single core
– each core can issue up to 4 instructions at a time
– 16-stage pipeline, dynamically scheduled
• up to 3 memory banks can be accessed simultaneously
– 48-bit virtual memory 36-bit physical memory
– 3-level cache hierarchy (but can support a fourth level
using high-bandwidth memory)
• L1 indexed by virtual address, L2 & L3 use phys addr
– this led to an oddity in that the L2 could cache an instruction both by
virtual address (when backing up L1) and physical address
• TLB access and L1 caches accessed simultaneously to reduce
impact of a cache miss requiring address translation
– two TLBs (instruction, data) are 8-way associative
More
• Here are a few more details
– L1 instruction cache is divided into 8 banks for simultaneous
accesses
– L2 cache uses write-back with a 10-entry merging write
buffer
– hardware prefetching performed on both L1 and l2
• usually just the next block after the one being requested
• Performance is hard to determine
– OOC nature of the processor
– separate instruction-fetch unit (which attempts to fetch 16
bytes at a time)
– L1 miss rate using prefetching ranged between 1% and 22%
for SPEC2006Int benchmarks, mostly 1-3%
– instruction fetch unit is stalled less than 2% of the time for
most programs, up to 12% for one
Fallacies and Pitfalls
• P: predicting cache performance from one program
to another
– cache performance can vary dramatically because of the
nature of a program’s branches, use of arrays, local
variables, pointers, etc
• P: not delivering high memory bandwidth to the
cache
– moving too little at a time causes cache performance to
degrade
• P: having too small an address space
– not providing enough bits for either a physical or virtual
address (this limitation can arise because of the size of
address registers and the address bus)