0% found this document useful (0 votes)
21 views53 pages

Memory HIerarchy

The document discusses memory hierarchy technology, emphasizing the importance of access time, memory size, cost, and transfer bandwidth in designing efficient memory systems. It explains key properties such as inclusion, coherence, and locality of reference, which influence memory design and performance. Additionally, it covers virtual memory concepts, paging processes, and page replacement algorithms, highlighting their roles in optimizing memory access and management.

Uploaded by

Ashmy Shams
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views53 pages

Memory HIerarchy

The document discusses memory hierarchy technology, emphasizing the importance of access time, memory size, cost, and transfer bandwidth in designing efficient memory systems. It explains key properties such as inclusion, coherence, and locality of reference, which influence memory design and performance. Additionally, it covers virtual memory concepts, paging processes, and page replacement algorithms, highlighting their roles in optimizing memory access and management.

Uploaded by

Ashmy Shams
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

COMPUTER SYSTEM ARCHITECTURE - CS 405

Module - 2 Part – 2

Memory Hierarchy

https://play.google.com/books/reader?id=grz-CwAAQBAJ&pg=GBS.PT66
Memory Hierarchy Technology

• Memory Hierarchy
– The goal of memory
hierarchy is to keep
the contents that are
needed now at or
near the top of the
hierarchy
– Parameters
• Access time
• Memory size
• Cost per byte
• Transfer bandwidth
• Unit of transfer
– Properties
• Inclusion
• Coherence
• Locality
Memory Hierarchy Technology

❑Memory Hierarchy Parameters


• T(i): Access time (round-trip time from CPU to i-th level memory)
– T(i-1) < T(i) <T(i+1)
• S(i): Memory size (number of bytes or words in i-th level memory)
– S(i-1) < S(i) < S(i+1)
• C(i): Cost per byte ( per byte cost of i-th level memory;
total cost estimated by C(i)*S(i))
– C(i-1) > C(i) > C(i+1)
• B(i): Transfer bandwidth (rate at which information is transferred
between adjacent levels)
– B(i-1) > B(i) > B(i+1)
• X(i): Unit of transfer (grain size for data transfer between levels i and i+1)
– X(i-1) < X(i) < X(i+1)
Memory Hierarchy Technology
❑Memory Hierarchy Properties
❖Inclusion Property
o implies that all information items are originally stored in level Mn. During the processing,
subsets of Mn are copied into Mn-1 similarly, subsets of Mn-1 are copied into Mn-2, etc.
o If a value is found at one level, it should be present at all of the levels inner to it.
– M(i-2) is a subset of M(i-1) subset of M(i)
❖Coherence Property
– Copies of same information item at successive memory levels be consistent. If a word is
modified in the cache, copies of that word must be updated immediately or eventually at all
outer levels.
o Coherence strategies
▪ Write through
As soon as a data item in Mi is modified, immediate update of the corresponding data
item(s) in Mi+1, Mi+2, … Mn is required. This is the most aggressive (and expensive) strategy.
▪ Write-back
The update of the data item in Mi+1 corresponding to a modified item in Mi is not updated
until the block or page in Mi that contains it is replaced or removed. This is the most
efficient approach, but cannot be used (without modification) when multiple processors
share Mi+1, Mi+2, … Mn.
Memory Hierarchy Technology
❖ Locality of Reference
▪Temporal: recently referenced items are likely to be referenced again in near future (loop
iterations, process stacks, temporary variables or subroutines)
Thus if location M is referenced at time t, then (location M) will be referenced again at some
time t+Dt.

▪Spatial: tendency of a process to access items whose addresses are near one another
(elements of an array, subroutines or macros stored in near by locations)
Thus if location M is referenced at time t, then another location M±Dm will be referenced
again at some time t+Dt.

▪Sequential: execution of instructions follows a sequential program order unless branch


instructions create out-of-order execution.
▪Thus if location M is referenced at time t, then locations M+1,M+2,… will be referenced again
at some time t+Dt, t+Dt’, etc.

Memory Design Implications (Each type of locality affects design of memory hierarchy)
One of the implications of the locality is data and instructions should have separate data and
instruction caches. The main advantage of separate caches is that one can fetch instructions
and operands simultaneously. (Design basis of Harvard architecture)
Locality of Reference

Locality Example:
• Data sum = 0;
– Reference array elements in succession for (i = 0; i < n; i++)
(stride-1 reference pattern) sum += a[i];
– Reference sum each iteration return sum;

• Instructions
– Reference instructions in sequence
– Cycle through loop repeatedly

❖ Famous 40/10 rule that comes from empirical


observation: "A program spends 40% of its time in 10%
of its code"
Memory Hierarchy
Memory Capacity Planning
• Hit: when the data being accessed is found at the current level
eg: data appears in some block in the inner level (example: Block X)
–Hit Rate or Hit ratio (hi): how many hits out of all memory accesses
–ie the fraction of memory access where data found in the inner level
–Hit ratio (hi) at Mi is the probability that a data will be found in Mi.
–Hit Time: time to access that level of the hierarchy
–Time to access the inner level which consists of
(RAM access time + Time to determine if it is a hit/miss)

Outer Level
To Processor Inner Level Memory
Memory
Blk X
From Processor Blk Y
Memory Capacity Planning
• Miss: when the data being accessed is not found and the next level of the
hierarchy must be examined
eg: data needs to be retrieved from a block in the outer level (Block Y)
–Miss Rate or Miss ratio: how many misses out of all memory accesses
–Miss rate = 1 – hit rate, Hit rate = 1 – miss rate
–Miss Penalty: time to access the next level
– (Time to replace a block in the inner level + Time to deliver the block to the
processor)
• Hit Time << Miss Penalty (500 instructions on 21264!)
• Note that speculative and multithreaded processors may execute other
instructions during a miss -> reduces performance impact of misses
Memory Capacity Planning
• Average memory-access time = effective access time
= Hit time + Miss rate x Miss penalty (ns or clocks)
• that is, our memory access, on average, is the time it takes to access the cache,
plus for a miss, how much time it takes to access memory
• With a 2-level cache, effective access time :
• Average memory access time =
hit time0 + miss rate0 * (hit time1 + miss rate1 * miss penalty1 )
• including the impact of paging:
• effective access time = hit time0 + miss rate0 * (hit time1 + miss rate1 * (hit
time2 + miss rate2 * miss penalty2))
• Level 0 is on-chip cache
• Level 1 is off-chip cache
• Level 2 is main memory
• Level 3 is disk (miss penalty2 is disk access time, which is lengthy)
• Access time: time to access outer level = f(latency to outer level)
• Transfer time: time to transfer the block =f(BW between upper & lower levels)
Memory Capacity Planning
Memory Capacity Planning
Access frequency to Mi

Assume h0 = 0 and hn = 1, n is total no of memory levels

Effective Access Time (T eff):

Teff = h1.t1 + (1-h1).h2.t2 + (1-h1)(1-h2).h3.t3 +...+ (1-h1)(1-h2)(1-h3) ... (1- hn-1).hn.tn

Hierarchy Optimization
The total cost of a memory hierarchy is estimated as
Optimal design should have Teff ~ t1 of M1 and a total cost ~ cost of Mn. Optimization
process is a linear programming problem , with a ceiling C0 on total cost, that is to
minimize effective access time (T eff):
Memory Capacity Planning

CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time

Memory stall cycles


Memory accesses
=  Miss rate  Miss penalty
Program
Instructio ns Misses
=   Miss penalty
Program Instructio n
Cost of Misses, CPU time
❖ Example:
Hierarchy = primary cache, secondary cache, main memory
t1 = Access time to primary cache = 5 ns
t2 = Access time to secondary cache = 20 ns
t3 = Access time to main memory = 70 ns
h1 = hit ratio to primary cache = 0.9
h2 = hit ratio to secondary cache = 0.95
h3 = hit ratio to main memory = 1
What is the average access time for this memory hierarchy?
➢ Teff = h1.t1 + (1-h1)h2.t2 + (1-h1)(1-h2)h3.t3 + ...
+ (1-h1)(1-h2)(1-h3) ... (1-hn-1)hn.tn
❖Example:
On chip cache hit rate is 90%, hit time is 5 ns, off chip cache hit
rate is 96%, hit time is 10 ns, main memory hit rate is 99.8%, hit
time is 60 ns, memory miss penalty is 10 ms = 10,000 ns.
(memory miss penalty is the same as the disk hit time, or disk
access time)

• Access time
= hit time0 + miss rate0 * (hit time1 + miss rate1 * (hit time2 +
miss rate2 * miss penalty2))
= 5 ns + .10 * (10 ns + .04 * (60 ns + .002 * 10,000 ns))
= 6.32 ns
– So our memory hierarchy adds over 20% to our memory
access
Example
❖ CPI of 1.0 on a 5Ghz machine with a 2% miss rate and 100ns main memory
access. Adding a 2nd level cache with 5ns access time decreases miss rate
to 0.5%. How much faster is the new configuration?
❑ Ans : Clock cycle time = 1/frequency = 1/ (5 x (10^9) ) = 0.2ns
a. Without L2:No of memory access cycles = Memory access time x frequency
100ns
= = 500clockcycles
0.2ns / clockcyle
TotalCPI = BaseCPI +Memorystallcyclesper instructio n
TotalCPI = 1.0 +2% * 500 = 11.0
b. With L2 : No of L2 access cycles = L2 access time x frequency
5ns
= = 25clockcycles
0.2ns / clockcyle
TotalCPI = 1 +Pr imarystall sperinstru ction +SecondaryStallsPerIn struction
TotaCPI = 1 + 2% * 25 + 0.5% * 500 = 1 + 0.5 + 2.5 = 4.0
11
Speedup = = 2.8
4
Virtual Memory
• Just as DRAM(main memory) acts as a backup for cache, hard disk
(known as the swap space) acts as a backup for DRAM
• This is known as virtual memory
–Virtual memory is necessary because most programs are too large to
store entirely in memory
• Also, there are parts of a program that are not used very often, so why
waste the time loading those parts into memory if they won’t be used?
–Page – a fixed sized unit of memory – all programs and data are broken
into pages
–Paging – the process of bringing in a page when it is needed (this might
require throwing a page out of memory, moving it back to the swap disk)
• The operating system is in charge of Virtual Memory and it moves
needed pages into memory from disk and keeps track of where a specific
page is placed
The Paging Process
• When the CPU generates a memory address, it is a logical (or virtual)
address
– The first address of a program is 0, so the logical address is merely an
offset into the program or into the data segment
• For instance, address 25 is located 25 from the beginning of the program
• But 25 is not the physical address in memory, so the logical address must
be translated (or mapped) into a physical address
– Assume memory is broken into fixed size units known as frames (1 page
fits into 1 frame)
• We know the logical address as its page # and the offset into the page
– We have to translate the page # into the frame # (that is, where is that
particular page currently be stored in memory – or is it even in memory?)
• Thus, the mapping process for paging means finding the frame # and
replacing the page # with it
Example of Paging

Here, we have a process of 8 pages but only 4 physical frames in


memory – therefore we must place a page into one of the available
frames in memory whenever a page is needed
At this point in time, pages 0, 3, 4 and 7 have been moved into
memory at frames 2, 0, 1 and 3 respectively
This information (of which page is stored in which frame) is stored in
memory in a location known as the Page Table. The page table also
stores whether the given page has been modified (the valid bit –
much like our cache)
A More
Complete Virtual address
mapped to physical

Example address

the page table

Address 1010 is
page 101, item 0

Page 101 (5) is


located in frame 11
(3) so the item 1010
is found at 110

Logical and physical memory for our program


Page Faults
• Just as cache is limited in size, so is main memory – a process
is usually given a limited number of frames
• What if a referenced page is not currently in memory?
– The memory reference causes a page fault
• The page fault requires that the OS handle the problem
– The process’ status is saved and the CPU switches to the OS
– The OS determines if there is an empty frame for the referenced
page, if not, then the OS uses a replacement strategy to select a
page to discard
• if that page is dirty, then the page must be written to disk instead of
discarded
– The OS locates the requested page on disk and loads it into the
appropriate frame in memory
– The page table is modified to reflect the change
• Page faults are time consuming because of the disk access – this causes our
effective memory access time to deteriorate badly!
Another Paging Example

Here, we have 13 bits for our addresses even though main memory is only 4K = 212
The Full Paging Process

if every memory access now requires first accessing the page table, which is in memory,
it slows down our computer

So we move the most used portion of the page table into a special cache known as
Translation Lookaside Buffer, TLB
A Variation: Segmentation
• One flaw of paging is that, because a page is fixed in size, a chunk of code might
be divided into two or more pages
– So page faults can occur any time
• Consider, as an example, a loop which crosses 2 pages
• If the OS must remove one of the two pages to load the other, then the OS
generates 2 page faults for each loop iteration!
• A variation of paging is segmentation
– instead of fixed size blocks, programs are divided into procedural units equal to
their size
• We subdivide programs into procedures
• We subdivide data into structures (e.g., arrays, structs)
– We still use the “on-demand” approach of virtual memory, but when a block of
code is loaded into memory, the entire needed block is loaded in
• Segmentation uses a segment table instead of a page table and works similarly
although addresses are put together differently
• But segmentation causes fragmentation – when a segment is discarded from
memory for a new segment, there may be a chunk of memory that goes unused
• One solution to fragmentation is to use paging with segmentation
Virtual Address Translation schemes
Address Translation using TLB and PTs
Inverted Address Mapping
Paging vs Segmentation
Page Replacement Algorithms

▪ When there is a page fault, the referenced page must be loaded.


- If there is no available frame in memory, then one page (victim) is
selected for replacement.
- If the selected page has been modified, it must be copied back to
disk (swapped out)

Various algorithms are used:


FIFO
LRU
LFU
Circular FIFO
Optimal
Random
Belady’s Anomaly

FIFO is not a stack algorithm. In certain cases, the number of page faults can
actually increase when more frames are allocated to the process. In the
example, there are 9 page faults for 3 frames and 10 page faults for 4 frames.
Examples - Page Replacement algorithms
• Given page reference string:
• 1,2,3,4,2,1,5,6,2,1,2,3,7,6,3,2,1,2,3,6

• Compare the number of page faults and page


references for LRU, FIFO and Optimal page
replacement algorithms
FIFO (First In First Out)
FIFO- On a page fault, the frame that has been in memory the longest
(oldest) is replaced.
Optimal
• The Belady’s optimal algorithm is not a real replacement algorithm. It looks
forward in time to see which frame to replace on a page fault. It gives us a
frame of reference for a given static frame access sequence.
LRU (Least Recently Used )
• On a page fault, the frame that was least recently used page is replaced.
LRU

Get a counter, maybe a 64-bit counter


The counter is incremented after each instruction access and stored
into the page entry at each reference
Store the value of the counter in each entry of the page table (last
access time to the page)
When is time to remove a page, find the lowest counter value (this is
the LRU page)
Nice & good but expensive: it requires dedicated hardware
LFU (Least Frequently Used)

Number of page faults is 9


Circular FIFO(Clock)
• The set of frames candidate for replacement is considered as a
circular buffer.
• When a page is replaced, a pointer is set to point to the next frame
in buffer.
• A reference bit or use bit for each frame is set to 1 whenever:
– a page is first loaded into the frame.
– the corresponding page is referenced.
• When it is time to replace a page, the first frame encountered with
the reference bit set to 0 is replaced:
– During the search for replacement, each reference bit
set to 1 is changed to 0.
❖ Algorithm of Circular FIFO(Clock)
Like FIFO but…
Before throwing out a page checks the R bit / use bit:
✓If 0 remove it.
✓If 1 clear it.
✓If all pages have R=1, eventually degenerates to FIFO
Clock Page-Replacement Algorithm

A. Frank - P. Weisberg
Asterisk indicates that the corresponding reference/use bit is set to 1.
The arrow indicates the current position of the pointer.
Assume 75% instruction, 25% data access
Cost of Misses, CPU time

You might also like