ch2 Appb

The document discusses the memory hierarchy and ways to improve cache performance. It covers: 1. The memory hierarchy uses cache memory and virtual memory to bridge the increasing gap between processor and memory speeds. Cache uses the principle of locality to improve average memory access times. 2. Techniques to improve cache performance include increasing cache size and associativity to reduce miss rates, using write buffers to hide write misses, and optimizing for out-of-order execution to hide some misses. 3. Cache performance is evaluated based on miss rate and penalty. Different cache organizations trade off miss rate, hit time, and hardware cost. Optimization strategies aim to reduce overall memory access time.

Uploaded by

Krupa Urankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views58 pages

ch2 Appb

Uploaded by

Krupa Urankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 58

Memory Hierarchy

• We combine appendix B and chapter 2 to study the

memory hierarchy
– how it works
– improvements to cache memory
– improvements to virtual memory
• we also briefly look at the ARM Cortex-A53 memory hierarchy
– NOTE: we are omitting chunks of chapter 2 and appendix
B
The “Memory” Problem
• Significant improvements in processor speed
– doubling every couple of years (until around 2005, but still
increasing from 2005 through 2018)
• DRAM access times increased mostly linearly since 1980
– every year processor speed increases >> DRAM access time
– since the processor accesses memory in every fetch-execute
cycle, there is no sense increasing clock speed unless we find a
faster form of memory

This is why we use

cache memory and
create the memory
hierarchy
Cache

Typical memory
hierarchy layout for
(a) mobile computers
(b) laptop/desktops
(c) servers
Performance Formulae
• CPU execution time must factor in stalls from
memory access
– assume L1 cache responds within the amount of time
allotted for the load/store/instruction fetch stage
• e.g., 1 cycle
– CPU time = IC * CPI * Clock cycle time
• CPI = ideal CPI + stalls
– stalls from both hazards and memory hierarchy delays (e.g., cache
misses), or stalls = pipeline stalls + memory stalls
– compute memory stall cycles (msc) as follows
• msc = IC * (misses per instruction) * miss penalty
• msc = IC * (memory accesses per instruction) * miss rate *
miss penalty
Example
• CPI = 1.0 when cache has 100% hit rate
– data accesses are only performed in loads and stores
– load/stores make up 50% of all instructions
– miss penalty = 50 clock cycles
– miss rate = 1%
• How much faster is an ideal machine?
– CPU time ideal = IC * 1.0 * clock cycle time
– CPU time this machine = IC * (1.0 + 1.50 * .01 * 50)
* clock cycle time = IC * 1.75 * clock cycle time
– ideal machine is 1.75 times faster (75%)
Cache Questions
• Where can a block be placed?
– determined by type of cache (direct-mapped, set associative)
• How is a block found?
– requires tag checking
– tags get larger as we move from direct-mapped to some level
of associativity
• amount of hardware required for parallel tag checks increases
• causing the hardware to become more expensive and slower
• Which block should be replaced on a miss?
– direct-mapped: no choice
– associative: replacement strategy (random, LRU, FIFO)
• What happens on a write?
– read B-6..B-12 if you need a refresher from CSC 362
Write Strategy Example
• Computer has a fully associative write-back data cache
• Refill line size is 64 bytes
for(i=0;i<1024;i++){
x[i] = b[i] * y;// S1
c[i] = x[i] + z;// S2
}
• x, b and c are double arrays of 1024 elements
• assume b & c are already in cache and moving x into the
cache does not conflict with b or c
– x[0] is stored at the beginning of a block
• Questions:
– how many misses arise with respect to accessing x if the cache
uses a no-write allocate policy?
– how many misses arise with respect to accessing x if the cache
uses a write allocate policy?
– redo the question with S2 before statement S1
Solution
• Refill line size = 64 bytes
– each array element is 8 bytes, 8 elements per refill line
– misses arise only when accessing x
• No-write allocate policy means
– a miss on a write sends the write to memory without loading the
line into cache
• a miss in S1 does not load the line and so S2 will also have a miss
– a miss on a read causes line to be brought into cache
• a miss in S2 causes line to be brought into cache
– along with the next 7 array elements
• read misses occur for the first array element and every 8 th thereafter
– 1024 / 8 = 128 read misses
• read miss in S2 will have had a write miss in S1, 256 total misses
• Write allocate policy means write miss in S1 loads the line
into cache (so no miss in S2)
– therefore will have a total of 128 misses
• Reversing the order of S1 and S2 leads to the read miss
happening first, so 128 total misses no matter which policy
is used
Write Buffer
• To support write through, add a write buffer
– buffer caches updated lines so that, to the CPU, writing to
cache is the same as writing to memory
• but writes to memory get temporarily cached
– successive writes to the same block get stored in the buffer
• only if writes are to different blocks will the current write buffer
contents be sent to memory
• on cache miss, check buffer to see if it is storing what we need
– example: 4-word write buffer, and in the code below, 512
and 1024 map to the same cache line
• sw x3, 512(x0) – write to buffer
• lw x1, 1024(x0) – load causes block with 512 to be discarded
• lw x2, 512(x0) – datum found in write buffer, small penalty!
Performance: Unified Versus Split Cache
• Recall we want 2 caches to support 2 accesses
per cycle: 1 instruction, 1 data
– what if we had a single, unified cache?
• assume either two 16KB caches or one 32KB cache
• assume 36% loads/stores and a miss penalty of 100 cycles
• use the table below which shows misses per 1000
instructions (not miss rate)

Size Instruction cache Data cache Unified cache

8 KB 8.16 44.0 63.0
16 KB 3.82 40.9 51.0
32 KB 1.36 38.4 43.3
64 KB 0.61 36.9 39.4
128 KB 0.30 35.3 36.2
256 KB 0.02 32.6 32.9
Solution
• Convert misses/instruction to miss rate
– unified cache: 43.3 misses per 1000 instructions
• miss rate 32KB unified = 43.3/1000/(1+.36) = .0318
– split caches
• miss rate 16KB instruction cache = 3.82/1000/1 = .004
• miss rate 16KB data cache = 40.9/1000/.36 = .114
• overall miss rate = (1/1.36) * .004 + (.36/1.36) * .114 = .0326
– split cache EAT = 1 + .0326 * 100 = 4.26
– unified cache EAT = 1 + .0318 * 100 + .0318 * .36 * 2 * 100
= 6.47
• unified cache can only respond to 1 access at a time, so if we have a
load/store, that load/store waits 1 cycle
– split caches have poorer miss rates but performance is far
better because unified cache requires stalling an instruction
fetch when there is a need to access a datum
Miss Rates: the 3 C’s
• Cache misses can be categorized as
– compulsory misses
• first access to a line – this results from a process just starting or
resuming so that the first access is a miss
– capacity misses
• cache cannot contain all lines needed for the process – results from not
having a large enough cache
– conflict misses
• mapping function causes new line to be inserted into a line that will
need to be accessed in the near future
• direct-mapped caches have no replacement strategy, so this argues for
some amount of associativity in our caches
• Some optimizations focus on one particular type of miss
– conflict misses are more common in direct-mapped caches while
fully associative caches never have conflict misses
– capacity misses are the most common cause of a cache miss
• see figure B.8 on page B-24
Impact of Associativity
• One way to improve (lower) miss rate is to use
associativity
– use a replacement strategy to better support locality of
reference and result in a lower conflict miss rates
– associativity requires more hardware and thus slows the
clock speed
– is it worthwhile?
• Base machine has CPI of 1.0
– clock cycle time of .35 ns
– 128 KB direct mapped cache, 64 words per line, miss
penalty of 65 ns
– assume 1.4 memory references per instruction
– should we replace the direct-mapped cache with a 2-way set
associative cache if it increases clock cycle time by 35%?
Solution
• Using figure B.8, we have the following miss rates
– direct mapped cache = 2.1%, 2-way assoc cache = 1.9%
• Miss penalty = 65 ns
• EAT = hit time + misses/instruction * miss penalty
– DM EAT: .35 ns + .021 * 65 ns = 1.72 ns
– 2-way EAT: (.35 * 1.35) ns + .019 * 65 ns = 1.71 ns
• But clock rate is lengthened for our associative cache
– CPU time = IC * (CPI + misses) * clock rate =
– IC * (CPI * clock rate + miss rate * accesses per cycle * miss
penalty)
• DM: IC * 1.0 * .35 ns + (.021 * 1.4 * 65 ns) = IC * 2.26 ns
• 2-way: IC * 1.0 * .35 ns * 1.35 + (.019 * 1.4 * 65 ns) = IC * 2.20 ns
– the 2-way offers a slight improvement
• we’ll see a fuller example shortly
Impact on OOC
• In our previous example, we saw that a cache miss
resulted in a set number of stalls
– this is the case in a pipeline
– with out-of-order completion, a cache miss can be
somewhat hidden
• Our machine is the same as the previous example but
with the 35% longer clock cycle (.35 ns * 1.35)
– we will use the same direct mapped cache with a 65 ns
miss penalty but assume 30% of misses is hidden because
of reservation stations giving us a reduction in miss
penalty from 65 ns to 45.5 ns
• CPU time = IC * (1.6 * .35 * 1.35 + .021 * 1.4 * 45.5) = IC *
2.09
• so, in spite of a slower clock, we see a significant improvement
Cache Optimization Strategies
• In section 2.3 and B.3, we focus on 17 different strategies (one
is not in the book)
– recall: EAT = hit time + miss rate * miss penalty
• Reduce EAT by reducing any of these three values
– reducing hit time
• hit time impacts every access so seems to be the most significant part of EAT
to improve
• but reducing hit time usually means a faster cache which technologically, is
going to be hard to improve on other than using D-M caches
– reducing miss rate – we may be reaching a limitation here
– reducing miss penalty – this is the largest of the values
• Two additional approaches improve cache performance
through parallelism
– reduce miss penalty and/or miss rate by parallelism
– increase cache bandwidth
Larger Block Sizes
• Larger block size brings more to cache at a time
– this will reduce compulsory misses
• Larger blocks mean fewer blocks
– this may negatively impact conflict misses
• Larger block sizes take more time to transfer
– this will negatively impact miss penalty
• Example: 4KB cache, 16 vs 32 byte blocks
– assume memory access = 80 cycles + 2 cycles/16 bytes
– EAT 16-byte blocks = 1 + .0857 * 82 = 8.027 cycles
– EAT 32-byte blocks = 1 + .0724 * 84 = 7.082 cycles
• see figure B.12, page B-28, for comparison of various block sizes on
different cache sizes
• 32 byte blocks are best for 4K caches, 64 byte blocks best for larger
caches
Higher Associativity
• Research points to the “2:1 cache rule of thumb”
– cache of size N has about the same miss rate as a cache of
size N / 2 of the next higher associativity
• e.g,. 32 KB D-M cache has similar miss rate (.042) as 16 KB 2-
way cache (.041) – see figure B.8 on page B-24
– there is no significant improvement over an 8-way set
• no need for 16-way, 32-way, etc
• Question: which type of cache should we use?
– choose between direct-mapped, 2-, 4- or 8-way cache?
– increased associativity requires slowing down clock speed
– let’s test this out assuming
• clock rates of 1.00 (DM), 1.36 (2W), 1.44 (4W), 1.52 (8W)
• miss penalty of 25 clock cycles
• caches of 4KB and 512 KB
Solution
• 4K cache (see figure B.8 for miss rates)
– DM: 1.00 + .098 * 25 = 3.44 ns
– 2W: 1.36 + .076 * 25 = 3.26 ns
– 4W: 1.44 + .071 * 25 = 3.22 ns
– 8W: 1.52 + .071 * 25 = 3.3 ns
• 512K cache
– DM: 1.00 + .008 * 25 = 1.2 ns
– 2W: 1.36 + .007 * 25 = 1.54 ns
– 4W: 1.44 + .006 * 25 = 1.59 ns
– 8W: 1.52 + .006 * 25 = 1.67 ns
• see figure B.13 on page B-30 for a summary of results, notice that DM is the
best for caches of 16K or larger, 4-way best for smaller caches
– increased associativity has a greater impact on smaller caches
– these results do not take into account how the increased clock speed
impacts other parts of the processor
– prefer associativity for L2 caches over L1 for this reason
Multi-Level Caches
• Competing ideas for our L1 cache
– larger caches improve miss rate but take up more space
– higher associativity improves miss rate but slows clock speed
– we want to maximize clock speed
– we want to keep costs down
• Solution: add a larger, higher associativity cache to the processor (or
off-chip) that backs up L1 so that miss penalty is reduced but hit time
is not impacted
– we can have L2 and L3 caches if it is economically feasible
• Today, the L2 is on the chip and shared between cores
– one problem is that L2 cannot respond to two or more cache requests at the
same time
– so hit time for L2 can vary based on the number of requests
– aside from cost, there may not be room for an L3 cache on some devices like
a smart phone
EAT for Multi-level Cache
• Our EAT formula goes from
– Hit time + miss rate * miss penalty to
– Hit time1 + miss rate 1 * (hit time 2 + miss rate 2 * miss penalty
2) to
– Hit time1 + miss rate 1 * (hit time 2 + miss rate 2 * (hit time 3 +
miss rate 3 * miss penalty))
• L2 duplicates everything in L1
– so even though L2 is much larger, its miss rate does not
necessarily reflect the difference in size completely
• We divide the L2 (and L3) miss rate into
– local miss rate – number of accesses at L2 that are misses /
number of L2 cache accesses
– global miss rate – number of misses at L2 / number of total
cache accesses = L1 miss rate * L2 local miss rate
Example
• D-M L1 cache
– hit time = 1 clock cycle, miss rate = 3.94%
• L2 is 128 KB 2-way set associative cache
– hit time of 5 clock cycles
– local miss rate of 40%
– global miss rate = 3.94% * 40% = 1.6%
• L3 is off-chip 1 MB 8-way set associative cache
– hit time = 10 clock cycles, local miss rate of 25%
– global miss rate = 1.6% * 25% = .004
• Memory access time = 80 clock cycles
• What is the performance with all three caches? without the 2nd on-
chip cache? without L3?
– EAT 3 caches = 1 + .0394 * (5 + .40 * (10 + .25 * 80)) = 1.67
– EAT without L2 = 1 + .0394 * (10 + .25 * 80) = 2.18
– EAT without L3 = 1 + .0394 * (5 + .40 * 80) = 2.46
Priority to Read over Write Misses
• Reads (loads) occur with a much greater frequency than
writes (stores) – make the common case fast
– writes are slower because the tag check must occur prior to
starting the write, write may also need to go memory
– tag check for reads can be done in parallel with the read
• the read is canceled if the tag is incorrect
• Use a write buffer for both types of write policy
– write-through cache writes to write buffer first
• read misses given priority over writing the write buffer to memory
– write-back cache writes to write buffer and the write buffer
is only written to memory when we are assured of no
conflict with a read miss
• in either case, a read miss will first examine the write buffer before
going on to memory in order to potentially save time
Avoiding Address Translation
• The CPU generates a logical address into virtual
memory
– this must be mapped to a physical address before
memory can be accessed
– even if the mapping information is found in the TLB, it
requires 2 cache accesses
• TLB miss requires page table access in DRAM
• Solution: cache instructions and data in the L1
cache using their virtual addresses instead of the
physical address
– this has its own problems and solutions
• read the text for a complete analysis (pages B-36 – B-40)
– we return to this later in these notes
Small/Simple L1 Caches
• Simple means direct-mapped
– reduces hit time so that our system clock can be faster
• direct-mapped caches also have shorter tags
• direct-mapped caches require lower power consumption
because only one line is examined (rather than 2 in a 2-way
or 4 in a 4-way, etc)
– with a direct-mapped cache, we can overlap tag check
and data access into two cycles
• this allows us to create a super-pipelined computer where the
clock cycle is halved (as we saw with the MIPS R4000)
• the deeper pipeline gives us greater speedup both because of
the clock cycle improvement and having a greater amount of
instructions in the pipe at a time
– the cost of the simple cache is a higher miss rate
Way Prediction
• Use a 2-way set-associative cache to act like a direct-mapped
cache as follows
– given address, predict which set to examine, 0 or 1
– perform tag comparison based on prediction
– on a hit, we’re done, 0 cycle stalls
– on a miss, check the other set, 1 cycle stall
• update prediction for next time within the same cycle
– on a miss of the second set, it’s a true cache miss, go to the next level
of the memory hierarchy
• The 2-way set associative cache’s miss rate doesn’t change
– hit time on first access = hit time of direct-mapped cache
• because we didn’t employ extra hardware for parallel tag checking
– on a miss to the predicted set but hit to other set, hit time is 2 * direct-
mapped cache hit time (2 cycles)
– prediction accuracy for a 2-way set associative cache can be as high as
90%
Variation: Pseudo-Associative Cache
• Use a direct-mapped cache but simulate a 2-way set associative
cache as follows
– given address, do tag comparison at given refill line
– on hit, no change
– on miss, invert last two (or first two) bits of the line number, check
that tag – on a hit, 1 cycle penalty (second access takes an extra cycle)
– on miss, go to the next level of the hierarchy
• Now we can employ a replacement strategy
– given a refill line, place it in its proper location or in the location with
the last two (or first two) bits of the line number reversed
• Example:
– address of line number: 1010111101
– search (or place new item) at 1010111101 or 1010111110
– DM cache has same miss rate as a 2-way set associative cache
without the expense or slowing the clock
Examples
• To employ a 2-way set associative cache we have to
increase clock speed by 25%
– hit time is 1 (DM), 1.25 (2-W)
– miss penalty is 10 clock cycles
• On-chip 8KB cache
– D-M miss rate = .068, 2-way miss rate = .049
– EAT D-M = 1 + .068 * 10 = 1.68
– EAT 2-way = 1.25 + .049 * 10 = 1.74
– EAT way prediction = .9 * 1 + .1 * 2 + .049 * 10 = 1.59
– EAT pseudo-assoc = 1 + .019 * 1 + .049 * 10 = 1.509
• pseudo-associative is best here although it is more expensive
than way-prediction
• with a 16KB cache, way prediction does worse than D-M!
Multi-Banked Caches
• A common memory optimization is to have multiple
memory banks
– each bank can be accessed independently for parallel accesses
• independent accesses to different banks (high order interleave)
• or access to several consecutive locations (low order interleave)
• Multiple cache “banks” can allow for interleaved cache
accesses
– multiple cache accesses during the same clock cycle to
accommodate superscalar processors with multiple load/store
units
• low-order interleave shown here
Variation: Pipelined Cache Access
• Recall MIPS R4000 was superpipelined
– divide cache accesses into multiple pipe stages (cache accesses
are now pipelined)
– decreases clock cycle time
– two stages: cache access, tag check
– increases number of data dependency stalls as instructions will
have to wait an additional cycle for datum to be forwarded
– many architectures have adopted this approach
• Sun Niagra, ARM Cortex A-8 and Intel Core i7: 4 L1 banks
• AMD Opteron: 2 L2 banks, i7: 8 L2 banks (pipelining L2 has no impact
on clock speed but is actually used for power saving
– in general, instruction cache access is pipelined more often than
data cache accesses
• in part because of branch prediction accuracies and in part because
instructions are never written too
• data write tag checking has to be performed prior to physically writing the
datum to the cache
Non-Blocking Caches
• Ordinary cache misses result in the cache blocking
– cache cannot respond to another request until the refill line
from the next level has been brought into the cache
• A non-blocking cache can continue to respond to CPU
requests while waiting for earlier miss to be resolved
– more expensive cache, lessens impact of stalls in out-of-order
completion architectures
• sometimes referred to as a “hit under miss” because part of the
miss penalty is hidden as the processor continues to work
– a second request must be to a different line than the one of the
miss
– often non-blocking caches start to block upon a second miss
– experimental research shows miss penalties can be reduced by
20-30% with a “one hit under miss” non-blocking cache
• A non-blocking cache does not help a normal pipeline
Implementing a Non-Blocking Cache
• Two problems need to be resolved
– later cache access might collide with (occur at the same
time as a) a refill line retrieved from a previous miss
– multiple misses can occur (“miss under miss”)
• if we allow multiple misses, then we have to figure out for
every returned refill line which miss it corresponds with
– resolved through having multiple miss status handling registers
(MSHRs), one per miss
• miss under miss could also result in multiple collisions
arising if multiple lines are returned at the same time
– general strategy is to allow current accesses to occur
before taking care of a retrieved refill line
• we might need a buffer to store retrieved lines temporarily
Critical Word First/Early Restart
• On cache miss, an entire line is moved from the next level
in the hierarchy to the cache
– miss penalty consists of next level access time and transfer time
• assume a 64 words per line, moved 4 words at a time
• retrieved line sent from next level back to cache takes 16 total
accesses/transfers
– if cache waits for full line to be retrieved, miss penalty is static
– if requested word is returned first, cache can send that word
back to CPU to let the pipeline resume flowing while remaining
words are delivered
• This optimization requires a non-blocking cache since the
cache needs to start responding to the request while still
storing other words as returned from the next level
– miss penalty = time for the next level access + 1 bus transfer
(rest of miss penalty is hidden)
– without this, longer block sizes create larger miss penalties!
Example
• Compare performance of a data access where the data cache
– responds normally
– with early restart only
– with critical word first and early restart
• Assume cache + memory only
– hit time = .5 ns, miss rate = 2.5%
– refill line is 64 words long
– memory access time = 20 ns, transferring 4 words at a time at a rate of 2 ns
per transfer
– look at average performance
• assume word requested is in the middle of the line
– normal: miss penalty = 20 ns + 2 ns * (64 / 4) = 52 ns, EAT = .5 ns + .025 *
52 ns = 1.8 ns
– early restart but no critical word first, miss penalty = 20 ns + 2 ns * (64 / 4) /
2 = 32 ns, EAT = .5 ns + .025 * 32 ns = 1.3 ns
– critical word first, miss penalty = 20 ns + 2 ns = 22 ns, EAT = .5 ns + .025 *
22 ns = 1.05 ns
Merging Write Buffer
• A write-through cache may use a write buffer to
reduce the number of writes being sent to memory
– write buffer contains multiple items waiting to be written to
memory
– writes to memory are postponed until either the buffer is full
or a modified refill line is being discarded
• in the latter case, the write occurs at the same time as a write-back
cache
– write buffer will be small
• perhaps storing a few items at a time
– we can make our write buffer more efficient as follows
• organize write buffer in rows, one row represents one refill line
• multiple writes to the same line are saved in the same buffer row
• write to memory moves entire block from buffer
Example
• Below we see 4 consecutive writes being placed in separate
rows of the buffer (top)
– instead, since all four writes are to the same write, we place them
in the same row
– 1 write instead of 4

Assume the 4 writes occur

inside a loop, each write being
4 instructions later than the
previous and 25 clock cycle
memory access time

Sending each item to memory

will cause delays in accessing
the write buffer of 21 cycles each
(84 total cycles) but if written to
a single line, only 21 cycles total
Compiler Optimizations
• Compiler scheduling can improve cache accesses by
grouping together accesses of items in the same cache
block
– loop interchange: exchange loops in a nested loop situation so
that array elements are accessed based on order that they will
appear in the cache and not programmer-prescribed order
• for(j=0;j<100;j++)
• for(i=0;i<5000;i++)
• x[i][j] = 2*x[i][j];
– currently, the for loops cause array elements to be accessed in
column-major order but they are stored in memory (and cache)
in row-major order: exchange the two loops
Continued
• Imagine a line stores 16 words
• An image processing routine processes a matrix
element (image[i][j]) and its neighbors
– image[i-1][j-1], image[i-1][j], image[i-1][j+1], image[i][j-
1], image[i][j+1], image[i+1][j-1], image[i+1][j],
image[i+1][j+1]
– knowing this, rather than operating on [i][j] in a row-by-
row manner, with blocking, the compiler rearranges the
loops so that once the neighbors are loaded into cache, all
of the neighbors are processed first
– in this way, items loaded to process [i][j] and then
discarded some time later (e.g., [i+1][j]) do not need to be
reloaded later
– see the example on pages 108-109
Hardware Prefetching
• Prefetching can be controlled by hardware or compiler
(see the next slide)
– prefetching can operate on either instructions or data or both
– one simple idea for instruction prefetching is that when there is
an instruction miss to block i, fetch it and then fetch block i + 1
• this might be controlled by hardware outside of the cache
• the second block is left in the instruction stream buffer and not
moved into the cache
• if an instruction is part of the instruction stream buffer, then the
cache access is cancelled (and thus a potential miss is cancelled)
– if prefetching is to place multiple blocks into the cache, then
the cache must be non-blocking
• see figure 2.15 on page 110 to see the speedup of many SPEC
2000 benchmarks (mostly FP) with hardware prefetching
(speedup ranges from 1.16 to 1.97)
Compiler Pre-fetching
• The compiler inserts prefetching instructions into the code
(for data prefetching only)
– the code below on the left causes 150 misses array from a and
101 from b
– assume an 8KB direct-mapped cache with 16 byte blocks, and a
and b are double arrays
– code with prefetch compiler instructions has only 19 misses
• see pages 112-113 for an analysis of the misses
for (j=0;j<100;j=j+1) {
prefetch(b[j+7][0]); /* prefetch 7 */
for (i=0;i<3;i=i+1) prefetch(a[0][j+7]); /* later */
for (j=0;j<100;j=j+1) a[0][j]=b[j][0]*b[j+1][0];};
a[i][j]=b[j][0]*b[j+1][0]; for (i=1;i<3;i=i+1)
for (j=0;j<100;j=j+1) {
prefetch(a[i][j+7]);
a[i][j]=b[j][0]*b[j+1][0]; }
Victim Cache
• In a direct-mapped cache, consider the following assuming a
cache of 512 entries where block x is not yet in the cache:
– access block x (cache miss)
– access block x + 512 (cache miss, block x discarded)
– access block x (cache miss)
– 3 misses in 3 accesses
• A victim cache is a small, fully associative cache, consulted
on an L1 cache miss
– might store 1-5 blocks
• Victim cache only stores blocks that have been discarded
from the cache on a cache miss
– on a cache miss, examine victim cache and request from next
level
– on victim cache hit, cancel request to next level, exchange item
in victim cache and item in same line in direct-mapped cache
Continued
• Hit time =1
• Miss rate decreases slightly because we have a little
associativity
• Miss penalty: here
– if item is in victim cache, miss
penalty is a little more than 2 (not
exactly 2 because the victim cache is
associative)
– if not in victim cache, miss penalty
remains the same because (we had
already started the request to the
next level)
– on a hit in victim cache, swap items
between L1 and victim cache so L1
cache either must be non-blocking or
Note: we are skipping HBM
causes a 1-2 cycle stall (depending
(high bandwidth memory)
on how many cycles the exchange
L4 caches, see p. 114-117)
takes)
Note: power Summary
consumption omitted

Technique Miss Band- Miss Hit Hardware Comments: (*

Penalty width Rate Time Complexity means widely
used)
Larger block sizes - + 0 Trivial
Larger caches + - 1 * especially L2
caches
Higher associativity + - 1 *
Multilevel caches + 2 *, costly,
difficult if L1
block size != L2
block size
Read miss over write + 1 *
priority
No address translation + 1 *
Small/simple caches - + 0 *, trivial
Way prediction + 1 Pentium IV
Pseudo associative + 2 Found in RISC
cache
Continued
Technique Miss Band Miss Hit Hardware Comments:
Penalty width Rate Time Complexity (* = widely used)
Trace cache + 3 Pentium IV
Pipelined accesses + - 1 *
Nonblocking caches + + 3 Used with all
OOC
Multibanked caches + 1 For L2 caches
Early restart + 2 *
Merging write buffer + 1 * for write-
through caches
Compiler techniques + 0 Software is
challenging
Hardware prefetching + + 2 (instr) or * for instr, P4 &
3 (data) Opteron for data

Compiler prefetching + + 3 *, needs non-

blocking cache
Victim caches + + 2 AMD Athlon
HBM as L4 cache - +/- + 3 varies
• 2 way set Opteron Data Cache
associative cache
• 512 lines per set
• 64 KBytes
• 64 bytes per line
• CPU address: 40
bits
– 25 bit tag
– 9 bit block Victim
number buffer
– 6 bit byte number discussed
later
• Write back cache,
valid bit (dirty
bit) needed
• LRU replacement Tag comparisons MUX to select one
strategy block on a tag match
VM: Review
• Backing up DRAM with swap space on hard disk
– requires page  frame number translation
– use TLB (L1 cache) and page table (DRAM)
• Concerns
– address translation doubles (at least) access time because of
two cache accesses
• store items in L1 (and maybe L2/L3) cache by virtual address
rather than physical address (complex, may or may not be
desirable)
– multitasking requires additional mechanisms to ensure no
memory violation
• especially if pages are shared between processes/threads
• if a single page table for all processes then page table must encode
who can access the given page
– textbook spends several subsections on VMs, we are skipping this
Comparing Cache and VM
Parameter First-level cache VM
Block/page size 16-128 bytes 4K-64KB
Hit time 1-3 clock cycles 100-200 clock cycles
Miss penalty 8-200 clock cycles 1,000,000+ clock cycles
access time 6-160 clock cycles 800,000+ clock cycles
transfer time 2-40 clock cycles 200,000+ clock cycles
Miss rate .1%-10% .00001%-.001%
Address mapping 25-45 bit physical address 32-64-bit virtual address to
to 14-20-bit cache address 25-45-bit physical address

Miss rates for DRAM need to be very small or else the miss penalty
has far to great an impact on performance
Fast Address Translation
• Recall we want to avoid address translation
– instruction and data caches usually store items by virtual addresses rather
than physical addresses which brings about two problems
• must ensure proper protection that the virtual address is legal for this
process
• on a cache miss, we still need to perform address translation
• Address translation uses the page table
– resides in memory
– very large page tables could, at least partially, be stored in swap space
– use a TLB (on-chip cache) to cache paging information to reduce impact
of page table access, but TLB is usually an associative cache
• virtual address is the tag
• frame number is the datum
• TLB entry also contains a protection bit (is this page owned by this
process?) and a dirty bit (is the page dirty, not is the TLB entry dirty?)
– using associative cache slows the translation process down even more
(beyond a second cache access) but there’s not much we can do about it!
Selecting a Page Size
• Larger page sizes
– smaller page table size (fewer overall pages for a process)
– transferring more data at a time is a more effective use of
accessing a hard disk
– fewer pages also means more entries in the TLB
(proportionally)
• Smaller page sizes
– less potential wasted space (smaller fragments)
– less time to load a page from disk (smaller miss penalty –
while this is smaller, it means a less effective disk access)
• Some CPUs today support multiple page sizes based
on the program size
– we prefer small pages if the program is smaller
Implementing a Memory Hierarchy
• Autonomous instruction fetch units
– with out-of-order execution, we remove the IF
stage from the pipeline and move it to a separate IF
unit
• IF unit accesses instruction cache to fetch an entire
block at a time
• move fetched instructions into a buffer (to be decoded
one at a time, and flushed if a branch had been taken)
– miss rates become harder to compute because
multiple instructions are being fetched at a time on
any hit, some of them may be misses
• similarly, prefetching data may result in partial misses
Continued
• Speculation
– causes a previously fetched instruction to be executed
before the hardware has computed whether a branch
should have been taken or not
• Need memory support for speculation
– proper protection must be implemented
• a memory violation arising when a fetched instruction should
not have been fetched due to miss-speculation should not
cause an exception
• to maintain a precise, memory violation must be postponed
– details on protection can be found in appendix B pages B-49 – B-54)
– cache miss after miss-speculation throws off the miss
rate
• because the instruction should never have been fetched
Cache Coherence
• With multi-core processors (or any parallel
processor), we now have a new concern
– core 1 has loaded datum x from memory into its L1
cache, which is a write-back cache
– core 1 increments x in its cache, but not yet to memory
– core 2 now loads datum x from memory into its L1 cache
– core 2 has a stale (dirty) version of x
• this cannot be allowed
– we resolve this by implementing cache coherence
• we look at this in chapter 5
ARM Cortex-A53 Mem Hierarchy
• Issue up to 2 instructions per cycle
– three TLBs: instruction TLB, data TLB, second level
TLB (L2 version of a TLB)
• instruction and data TLBs are 2-way set associative and can
store up to 10 entries each – 2 cycle penalty on a TLB miss
• second level TLB is 4-way set associative with 512 entries
and a 20 cycle miss penalty (to get to page table in
memory)
• TLBs use critical word first with early restart
– L1 instruction and data caches: 2-way set associative,
64-byte block, 13 cycle miss penalty
– L2 unified cache: 16-way set associative, LRU, 124
cycle miss penalty
• see figure 2.20 on page 131 if you want to be confused!
Memory Performance
• Tested on SPECInt2006 benchmarks
• 32KB L1 caches, 1MB L2
– miss rate for L1 was mostly under 5%
• reached 10% on two and 35%+ on mcf
– miss rate for L2 was under 1% for most
– miss penalty was under 1 cycle for most
benchmarks
• 2-5 for 3 others, 16 for mcf
Intel Core i7 6700 Mem Hierarchy
• Out of order execution processor, four cores
• We focus on the memory hierarchy of a single core
– each core can issue up to 4 instructions at a time
– 16-stage pipeline, dynamically scheduled
• up to 3 memory banks can be accessed simultaneously
– 48-bit virtual memory  36-bit physical memory
– 3-level cache hierarchy (but can support a fourth level
using high-bandwidth memory)
• L1 indexed by virtual address, L2 & L3 use phys addr
– this led to an oddity in that the L2 could cache an instruction both by
virtual address (when backing up L1) and physical address
• TLB access and L1 caches accessed simultaneously to reduce
impact of a cache miss requiring address translation
– two TLBs (instruction, data) are 8-way associative
More
• Here are a few more details
– L1 instruction cache is divided into 8 banks for simultaneous
accesses
– L2 cache uses write-back with a 10-entry merging write
buffer
– hardware prefetching performed on both L1 and l2
• usually just the next block after the one being requested
• Performance is hard to determine
– OOC nature of the processor
– separate instruction-fetch unit (which attempts to fetch 16
bytes at a time)
– L1 miss rate using prefetching ranged between 1% and 22%
for SPEC2006Int benchmarks, mostly 1-3%
– instruction fetch unit is stalled less than 2% of the time for
most programs, up to 12% for one
Fallacies and Pitfalls
• P: predicting cache performance from one program
to another
– cache performance can vary dramatically because of the
nature of a program’s branches, use of arrays, local
variables, pointers, etc
• P: not delivering high memory bandwidth to the
cache
– moving too little at a time causes cache performance to
degrade
• P: having too small an address space
– not providing enough bits for either a physical or virtual
address (this limitation can arise because of the size of
address registers and the address bus)

DigitalLogic ComputerOrganization L22 CachesP3 Handout
No ratings yet
DigitalLogic ComputerOrganization L22 CachesP3 Handout
52 pages
Memory Hierarchy 4.0
No ratings yet
Memory Hierarchy 4.0
50 pages
Cache Optimizations
No ratings yet
Cache Optimizations
29 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
l08 Caches 2
No ratings yet
l08 Caches 2
39 pages
CS252 Graduate Computer Architecture Caches and Memory Systems I
No ratings yet
CS252 Graduate Computer Architecture Caches and Memory Systems I
49 pages
10 Caches
No ratings yet
10 Caches
34 pages
Cache Org
No ratings yet
Cache Org
19 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
Lecture 12: Cache Innovations
No ratings yet
Lecture 12: Cache Innovations
17 pages
Cache Writing & Performance
No ratings yet
Cache Writing & Performance
23 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
L07 MemoryII
No ratings yet
L07 MemoryII
27 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
Week 13 - Lecture 13 - Memory (Cont)
No ratings yet
Week 13 - Lecture 13 - Memory (Cont)
31 pages
10 Cacheperf
No ratings yet
10 Cacheperf
24 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
Assignment (G)
No ratings yet
Assignment (G)
5 pages
Lect12 Cache
No ratings yet
Lect12 Cache
39 pages
Cache Performance
No ratings yet
Cache Performance
41 pages
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
No ratings yet
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
7 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Fundamentals of Computer Systems: Caches
No ratings yet
Fundamentals of Computer Systems: Caches
28 pages
05) Cache Memory Introduction
No ratings yet
05) Cache Memory Introduction
20 pages
Cache Performance Improving Cache Performance
No ratings yet
Cache Performance Improving Cache Performance
6 pages
2015Sp CS61C L16 Kavs Caches3
No ratings yet
2015Sp CS61C L16 Kavs Caches3
25 pages
AC14L08 Memory Hierarchy
No ratings yet
AC14L08 Memory Hierarchy
20 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
UNIT2 Cahe-Opt
No ratings yet
UNIT2 Cahe-Opt
134 pages
Chapter # 05
No ratings yet
Chapter # 05
42 pages
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
No ratings yet
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
23 pages
Cache Memory
No ratings yet
Cache Memory
28 pages
Cau 6 Cache
No ratings yet
Cau 6 Cache
25 pages
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
No ratings yet
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
18 pages
Solution of CSE 240A Assignemnt 3
No ratings yet
Solution of CSE 240A Assignemnt 3
5 pages
hw4 Sol
No ratings yet
hw4 Sol
4 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Computer Org and Arch: R.Magesh
No ratings yet
Computer Org and Arch: R.Magesh
48 pages
CA11 2023S1 New
No ratings yet
CA11 2023S1 New
26 pages
Lecture16 PDF
No ratings yet
Lecture16 PDF
4 pages
Computer Architecture
No ratings yet
Computer Architecture
5 pages
CMP3010L09 MemoryII
No ratings yet
CMP3010L09 MemoryII
39 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
115 pages
Advance Computer Architecture Homework 2 Solution
No ratings yet
Advance Computer Architecture Homework 2 Solution
8 pages
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
No ratings yet
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
13 pages
Windows 10 Key
0% (1)
Windows 10 Key
9 pages
Cache
No ratings yet
Cache
34 pages
Lec8 Memory
No ratings yet
Lec8 Memory
17 pages
ACA Unit-5
No ratings yet
ACA Unit-5
54 pages
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
No ratings yet
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
16 pages
Ca Sol PDF
No ratings yet
Ca Sol PDF
8 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
Unit II
No ratings yet
Unit II
9 pages
Memory Hierarchy Design-Aca
No ratings yet
Memory Hierarchy Design-Aca
15 pages
Cache
No ratings yet
Cache
10 pages
Cache Design
No ratings yet
Cache Design
59 pages
Project Proposal Presentation of Hotel R
No ratings yet
Project Proposal Presentation of Hotel R
5 pages
Text To Self Editedenglish 9 Quarter 2 Module 1
No ratings yet
Text To Self Editedenglish 9 Quarter 2 Module 1
8 pages
List of Deepfake Tools
No ratings yet
List of Deepfake Tools
5 pages
Alicia Cardigan Template: Print OUT & Keep
No ratings yet
Alicia Cardigan Template: Print OUT & Keep
22 pages
Frederick Jones - Juvenal and The Satiric Genre
No ratings yet
Frederick Jones - Juvenal and The Satiric Genre
225 pages
CHUYÊN ĐỀ 17 Adverbial Clauses
No ratings yet
CHUYÊN ĐỀ 17 Adverbial Clauses
13 pages
Select Statements in Sap Abap
No ratings yet
Select Statements in Sap Abap
7 pages
Present Simple Present Continuous Exercises
No ratings yet
Present Simple Present Continuous Exercises
3 pages
For and Since When ? and How Long ?: Weeks Etc.)
No ratings yet
For and Since When ? and How Long ?: Weeks Etc.)
2 pages
Humanistic and Political Literature in Florence and Venice at TH PDF
No ratings yet
Humanistic and Political Literature in Florence and Venice at TH PDF
233 pages
UNPLUGGED - UST IVCF Sem-Starter Devotional and Song Lyrics
No ratings yet
UNPLUGGED - UST IVCF Sem-Starter Devotional and Song Lyrics
2 pages
Unit - V: Principles of HDL
No ratings yet
Unit - V: Principles of HDL
56 pages
Elsc 109 Module 1 - Fbadua
No ratings yet
Elsc 109 Module 1 - Fbadua
27 pages
Dairy Management
No ratings yet
Dairy Management
15 pages
Concerning Divine Wisdom in The Creation of Man 1st Edition Abu Hamid Al-Ghazali PDF Download
No ratings yet
Concerning Divine Wisdom in The Creation of Man 1st Edition Abu Hamid Al-Ghazali PDF Download
42 pages
GEC108Module1secondSem2024 2025
No ratings yet
GEC108Module1secondSem2024 2025
21 pages
Jiva Profounded in Visistadvitha
No ratings yet
Jiva Profounded in Visistadvitha
185 pages
Unleashing The Power of ChatGPT For Translation
No ratings yet
Unleashing The Power of ChatGPT For Translation
10 pages
MT48 Release Notes V1
No ratings yet
MT48 Release Notes V1
5 pages
Tips For Freshers /: Some of The Personality Traits The GD Is Trying To Gauge May Include
No ratings yet
Tips For Freshers /: Some of The Personality Traits The GD Is Trying To Gauge May Include
5 pages
Poets and Pancakes
No ratings yet
Poets and Pancakes
2 pages
MATH 141, Tricks With Complex Numbers: Prof. Jonathan Rosenberg Fall Semester, 2012
No ratings yet
MATH 141, Tricks With Complex Numbers: Prof. Jonathan Rosenberg Fall Semester, 2012
2 pages
MA-2203: Introduction To Probability and Statistics: Lecture Slides
No ratings yet
MA-2203: Introduction To Probability and Statistics: Lecture Slides
64 pages
National & Kapodistrian University of Athens Lesson Planning and Materials Development
No ratings yet
National & Kapodistrian University of Athens Lesson Planning and Materials Development
17 pages
DC Endsem
No ratings yet
DC Endsem
2 pages
Loop IBM
No ratings yet
Loop IBM
3 pages
Tutorial 2 What Is The Output of The Below Program?
No ratings yet
Tutorial 2 What Is The Output of The Below Program?
2 pages
1st Year Scientific Stream
No ratings yet
1st Year Scientific Stream
2 pages
Passive Voice
No ratings yet
Passive Voice
10 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

ch2 Appb

Uploaded by

ch2 Appb

Uploaded by

Memory Hierarchy

• We combine appendix B and chapter 2 to study the

This is why we use

Size Instruction cache Data cache Unified cache

Assume the 4 writes occur

Sending each item to memory

Technique Miss Band- Miss Hit Hardware Comments: (*

Compiler prefetching + + 3 *, needs non-

You might also like