TLB & Caches: N N N N
TLB & Caches: N N N N
n Direct mapped cache: use low address bits to index cache n Restart the faulting instruction
1
Why partial residency? Expanding physical memory
n Assumption we made to simplify things: n Virtual address translated to:
n All of process’s data is in memory n Physical memory ($1/meg). Very fast, but small
Load entire process into memory before it can run
Disk ($.01/meg). Very large, but very slow (millis vs nanos)
n
n
n OS software
choose an old page to replace
n
n continue thread
set mapping
n what to eject?
n physical memory always too small, which page to replace?
n Key constraint: don’t want user process to be aware that page
n may need to write the evicted page back to the disk fault happened (just like context switching)
n Can we skip the faulting instruction? No.
n how many pages for each process?
n Can we restart the instruction from the beginning?
n what to do when not enough memory?
n Not if it has partial-side effects.
n how to deal with thrashing?
n Can we inspect instruction to figure out what to do?
n May be ambiguous where it was.
2
Solution: hardware support Deciding what page(s) to fetch
n RISC machines are pretty simple: n Page selection: when to bring pages into memory
n instructions tend to have 1 memory ref & 1 side effect. n Like all caches: we need to know the future.
n Thus, only need faulting address and faulting PC. n Doesn’t the user know?
n Example: MIPS n Not reliably
n How to communicate that to the OS?
n Easy load-time hack: demand paging
Fault: epc = 0xffdd0, n Load initial page(s). Run. Load others on fault.
0xffdcc: add r1,r2,r3 bad va = 0x0ef80
0xffdd0: ld r1, 0(sp) fault handler
ld init pages ld page ld page ...
ld page
jump 0xffdd0 n When will startup be slower? Memory less utilized?
n Most systems do some sort of variant of this
n CISC harder:
n multiple memory references and side effects; interpret the instruction? n Tweak: pre-paging. Get page & its neighbors
n Algorithm n Cons
n Throw out the oldest page n No on-line implementation
n Pros
n Low-overhead implementation
n Cons
n May replace the heavily used pages
3
Least Recently Used (LRU) Implementing LRU
n Algorithm
n Replace page that hasn’t been used for the longest time
n Question
n What hardware mechanisms are required to implement LRU? Mostly Least
recently used 5 3 4 7 9 11 2 1 15 recently used
n Perfect
n Use a timestamp on each reference
n Keep a list of pages ordered by time of reference
n Is this practical?
n Consider the same reference string with 3 page frames n Main idea: add a “reference bit” (or use bit) per PTE
n FIFO replacement n Check the reference-bit of the oldest page
n 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 n If it is 0, then replace it
n 9 page faults! n If it is 1, clear the referent-bit, put it to the end of the list, and
continue searching
n This is called Belady’s anomaly
n Pros
n Fast and do not replace a heavily used page
n Cons
n The worst case may take a long time
4
Enhanced FIFO with 2nd-chance State per page table entry
n Same as the basic FIFO with 2nd chance, except that this Many machines maintain four bits per page table entry:
method considers both reference bit and modified bit
n (0,0): neither recently used nor modified n use (aka reference): set when page is referenced,
n (0,1): not recently used but modified cleared by “clock algorithm”
n (1,0): recently used but clean
n (1,1): recently used and modified n modified (aka dirty): set when page is modified, cleared
when page is written to disk
n Pros
n Avoid write back n valid (aka present): ok for program to reference this
n Cons page
More complicated
read-only: ok for program to read page, but not to modify
n
n
n Hardware sets use bit in TLB; when TLB entry is replaced, n Example: IBM 370 – 6 pages to handle SS MOVE
software copies use bit back to page table instruction:
n instruction is 6 bytes, might span 2 pages.
2 pages to handle from.
n Software manages TLB entries as FIFO list; everything not n
si = size of process pi n Global replacement – process selects a replacement frame from the set
S = ∑ si m = 64 of all frames; one process can take a frame from another.
m = total number of frames s1 = 10
si s2 = 127 n Local replacement – each process selects from only its own set of
ai = allocation for pi = × m 10 allocated frames.
S a1 = × 64 ≈ 5
137
127
a2 = × 64 ≈ 59
137