Memory HIerarchy
Memory HIerarchy
Module - 2 Part – 2
Memory Hierarchy
https://play.google.com/books/reader?id=grz-CwAAQBAJ&pg=GBS.PT66
Memory Hierarchy Technology
• Memory Hierarchy
– The goal of memory
hierarchy is to keep
the contents that are
needed now at or
near the top of the
hierarchy
– Parameters
• Access time
• Memory size
• Cost per byte
• Transfer bandwidth
• Unit of transfer
– Properties
• Inclusion
• Coherence
• Locality
Memory Hierarchy Technology
▪Spatial: tendency of a process to access items whose addresses are near one another
(elements of an array, subroutines or macros stored in near by locations)
Thus if location M is referenced at time t, then another location M±Dm will be referenced
again at some time t+Dt.
Memory Design Implications (Each type of locality affects design of memory hierarchy)
One of the implications of the locality is data and instructions should have separate data and
instruction caches. The main advantage of separate caches is that one can fetch instructions
and operands simultaneously. (Design basis of Harvard architecture)
Locality of Reference
Locality Example:
• Data sum = 0;
– Reference array elements in succession for (i = 0; i < n; i++)
(stride-1 reference pattern) sum += a[i];
– Reference sum each iteration return sum;
• Instructions
– Reference instructions in sequence
– Cycle through loop repeatedly
Outer Level
To Processor Inner Level Memory
Memory
Blk X
From Processor Blk Y
Memory Capacity Planning
• Miss: when the data being accessed is not found and the next level of the
hierarchy must be examined
eg: data needs to be retrieved from a block in the outer level (Block Y)
–Miss Rate or Miss ratio: how many misses out of all memory accesses
–Miss rate = 1 – hit rate, Hit rate = 1 – miss rate
–Miss Penalty: time to access the next level
– (Time to replace a block in the inner level + Time to deliver the block to the
processor)
• Hit Time << Miss Penalty (500 instructions on 21264!)
• Note that speculative and multithreaded processors may execute other
instructions during a miss -> reduces performance impact of misses
Memory Capacity Planning
• Average memory-access time = effective access time
= Hit time + Miss rate x Miss penalty (ns or clocks)
• that is, our memory access, on average, is the time it takes to access the cache,
plus for a miss, how much time it takes to access memory
• With a 2-level cache, effective access time :
• Average memory access time =
hit time0 + miss rate0 * (hit time1 + miss rate1 * miss penalty1 )
• including the impact of paging:
• effective access time = hit time0 + miss rate0 * (hit time1 + miss rate1 * (hit
time2 + miss rate2 * miss penalty2))
• Level 0 is on-chip cache
• Level 1 is off-chip cache
• Level 2 is main memory
• Level 3 is disk (miss penalty2 is disk access time, which is lengthy)
• Access time: time to access outer level = f(latency to outer level)
• Transfer time: time to transfer the block =f(BW between upper & lower levels)
Memory Capacity Planning
Memory Capacity Planning
Access frequency to Mi
Hierarchy Optimization
The total cost of a memory hierarchy is estimated as
Optimal design should have Teff ~ t1 of M1 and a total cost ~ cost of Mn. Optimization
process is a linear programming problem , with a ceiling C0 on total cost, that is to
minimize effective access time (T eff):
Memory Capacity Planning
CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time
• Access time
= hit time0 + miss rate0 * (hit time1 + miss rate1 * (hit time2 +
miss rate2 * miss penalty2))
= 5 ns + .10 * (10 ns + .04 * (60 ns + .002 * 10,000 ns))
= 6.32 ns
– So our memory hierarchy adds over 20% to our memory
access
Example
❖ CPI of 1.0 on a 5Ghz machine with a 2% miss rate and 100ns main memory
access. Adding a 2nd level cache with 5ns access time decreases miss rate
to 0.5%. How much faster is the new configuration?
❑ Ans : Clock cycle time = 1/frequency = 1/ (5 x (10^9) ) = 0.2ns
a. Without L2:No of memory access cycles = Memory access time x frequency
100ns
= = 500clockcycles
0.2ns / clockcyle
TotalCPI = BaseCPI +Memorystallcyclesper instructio n
TotalCPI = 1.0 +2% * 500 = 11.0
b. With L2 : No of L2 access cycles = L2 access time x frequency
5ns
= = 25clockcycles
0.2ns / clockcyle
TotalCPI = 1 +Pr imarystall sperinstru ction +SecondaryStallsPerIn struction
TotaCPI = 1 + 2% * 25 + 0.5% * 500 = 1 + 0.5 + 2.5 = 4.0
11
Speedup = = 2.8
4
Virtual Memory
• Just as DRAM(main memory) acts as a backup for cache, hard disk
(known as the swap space) acts as a backup for DRAM
• This is known as virtual memory
–Virtual memory is necessary because most programs are too large to
store entirely in memory
• Also, there are parts of a program that are not used very often, so why
waste the time loading those parts into memory if they won’t be used?
–Page – a fixed sized unit of memory – all programs and data are broken
into pages
–Paging – the process of bringing in a page when it is needed (this might
require throwing a page out of memory, moving it back to the swap disk)
• The operating system is in charge of Virtual Memory and it moves
needed pages into memory from disk and keeps track of where a specific
page is placed
The Paging Process
• When the CPU generates a memory address, it is a logical (or virtual)
address
– The first address of a program is 0, so the logical address is merely an
offset into the program or into the data segment
• For instance, address 25 is located 25 from the beginning of the program
• But 25 is not the physical address in memory, so the logical address must
be translated (or mapped) into a physical address
– Assume memory is broken into fixed size units known as frames (1 page
fits into 1 frame)
• We know the logical address as its page # and the offset into the page
– We have to translate the page # into the frame # (that is, where is that
particular page currently be stored in memory – or is it even in memory?)
• Thus, the mapping process for paging means finding the frame # and
replacing the page # with it
Example of Paging
Example address
Address 1010 is
page 101, item 0
Here, we have 13 bits for our addresses even though main memory is only 4K = 212
The Full Paging Process
if every memory access now requires first accessing the page table, which is in memory,
it slows down our computer
So we move the most used portion of the page table into a special cache known as
Translation Lookaside Buffer, TLB
A Variation: Segmentation
• One flaw of paging is that, because a page is fixed in size, a chunk of code might
be divided into two or more pages
– So page faults can occur any time
• Consider, as an example, a loop which crosses 2 pages
• If the OS must remove one of the two pages to load the other, then the OS
generates 2 page faults for each loop iteration!
• A variation of paging is segmentation
– instead of fixed size blocks, programs are divided into procedural units equal to
their size
• We subdivide programs into procedures
• We subdivide data into structures (e.g., arrays, structs)
– We still use the “on-demand” approach of virtual memory, but when a block of
code is loaded into memory, the entire needed block is loaded in
• Segmentation uses a segment table instead of a page table and works similarly
although addresses are put together differently
• But segmentation causes fragmentation – when a segment is discarded from
memory for a new segment, there may be a chunk of memory that goes unused
• One solution to fragmentation is to use paging with segmentation
Virtual Address Translation schemes
Address Translation using TLB and PTs
Inverted Address Mapping
Paging vs Segmentation
Page Replacement Algorithms
FIFO is not a stack algorithm. In certain cases, the number of page faults can
actually increase when more frames are allocated to the process. In the
example, there are 9 page faults for 3 frames and 10 page faults for 4 frames.
Examples - Page Replacement algorithms
• Given page reference string:
• 1,2,3,4,2,1,5,6,2,1,2,3,7,6,3,2,1,2,3,6
A. Frank - P. Weisberg
Asterisk indicates that the corresponding reference/use bit is set to 1.
The arrow indicates the current position of the pointer.
Assume 75% instruction, 25% data access
Cost of Misses, CPU time