Lecture 7 Memory 2021
Lecture 7 Memory 2021
2 © AdamJune
Teman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance
Introduction to the
Memory Hierarchy
3
Processor-DRAM Gap (Latency)
• 1980 microprocessor executes ~one instruction in same time as DRAM access
• 2017 microprocessor executes ~1000 instructions in same time as DRAM
access
© AdamJune
Teman,
20, 2021
Memory Caching
• Mismatch between processor and memory speeds leads us to add a new level:
a memory cache
Cache Program
Datapath
Address
PC Bytes
Write
Registers Data
6 © AdamJune
Teman,
20, 2021
The Principle of Locality
• Programs access a small proportion of their address space at any time
• Temporal locality (locality in time)
• Items accessed recently are likely to be accessed again soon
• e.g., instructions in a loop, induction variables
• Spatial locality (locality in space)
• Items near those accessed recently are likely to be accessed soon
• E.g., sequential instruction access, array data
• Taking advantage of locality
• Store everything on disk
• Copy recently accessed (and nearby) items from disk to smaller DRAM memory
• Copy more recently accessed (and nearby) items from DRAM to smaller SRAM
memory
© AdamJune
Teman,
20, 2021
Characteristics of the Memory Hierarchy
Processor
4-8 bytes (word)
Inclusive –
$ what is in L1$
Increasing is a subset
8-32 bytes (block)
distance what is in Main
from the Memory that is
processor in Main Memory 1 to 4 blocks
a subset of is in
access time Secondary
1,024+ bytes
(disk sector = page)
Memory
Secondary Memory
Cache Cache
Secondary
Instr
• Dynamic RAM (DRAM) Processor
Datapath Main
Memory
(Disk
• 50ns – 70ns Memory Or Flash)
RegFile
Data
(DRAM)
• $20 – $75 per GB
• Magnetic disk
Speed (#cycles): ½’s 1’s 10’s 100’s 10,000’s
• 5ms – 20ms
• $0.20 – $2 per GB Size (bytes): 100’s 10K’s M’s G’s T’s
9 Cost: highest lowest
© AdamJune
Teman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance
Virtual Memory
10
“Bare” 5-Stage Pipeline
• In a bare machine, the only kind of address is a physical address
11 © AdamJune
Teman,
20, 2021
Multiple processes
• We want many things running concurrently.
• In fact, in today’s machines, hundreds of processes are running
• Each process has its own state and
needs to access and manage memory
12 © AdamJune
Teman,
20, 2021
Multi-process Addressing Challenges
• Size
• We cannot fit 232 (or 264) bytes on a chip (or in several chips…)
• We cannot fit a big program (e.g., Linux) on embedded memory
• Protection
• If we have several programs running,
how do we keep them from overwriting each other’s data?
• How do we keep private data of one process from being read by another?
• Location Independence
• We want to compile the addresses of a program,
assuming its address space starts from a pre-known base address.
• We want to give the process the feeling that it owns the entire memory space.
13 © AdamJune
Teman,
20, 2021
Option: Base and Bounds
• Each process will be given a base and bound address
• The process stores these inside base and bound registers.
• Compiled addresses are offsets that are added to the base address.
• Upon every memory access, check that the bound address is not surpassed.
Physical Address
Virtual Address
Datapath Physical Address Space
PC ? Bytes Set of addresses that
Registers map to actual physical
locations in memory
(ALU)
Hidden from user
applications
16
Many of these (software & hardware cores) One main memory
© AdamJune
Teman,
20, 2021
Library book analogy
• An author writes a book and sells it to a library.
• The library puts the book on a shelf.
• Does the author print the location of the book
on the book cover?
• No. Then we would need to set the
location in every library in the world.
• Instead, we provide the book with an
identifier (i.e., ISBN)
• The library has a catalog that says
where the specific book is placed.
• This indirection is the concept of
virtual addressing.
17 © AdamJune
Teman,
20, 2021
Paging
• Divide the memory into “pages”
page number offset
• Instead of referencing physical addresses,
the processor references a “page number” and an “offset”
• Typical page size is 4KB (12 bits)
• A “page table” contains the physical address
1
of the base of each page: 0
0 0
1 1 Physical
2 2 Memory
3 3 3
Address Space Page Table
of User-1 of User-1 2
• We cannot fit 232 bytes on a chip Allocate each process its own
• We cannot fit a program in embedded memory page table and pages
21
Paged Memory Address Translation
• Operating System keeps track of which process is active
• Stores address of Page Table in Page Table Base Register (PTBR).
• Memory management unit (MMU) extracts virtual page number (VPN)
from virtual address
• e.g., just top 20 bits
• Physical Page Number (PPN) is stored in Page Table Entry (PTE).
• Physical address is calculated as the sum of:
• Physical Page Number Virtual address (e.g., 32 Bits)
It requires
• On user switch two DRAM
• PTBR = System PT Base + new User ID accesses to
access one
data word or
instruction!
23 © AdamJune
Teman,
20, 2021
Demand Paging
• Sometimes, there is not enough DRAM to hold all allocated memory
• For example, entire virtual memory space of one process in 232=4GB !
• Therefore, we can use secondary storage (HDD) to increase our capacity
• Space allocated on secondary storage is called “swap”
• Add a “valid bit” to the PTE, which is set (valid=1) when the page is in memory.
• When a page is swapped out, the valid bit is cleared.
• The page table now stores a Disk Page Number (DPN) instead of a VPN.
• This process is known as “Demand Paging”
• Access to a swapped page (valid=0) results in a “page fault”.
• Need to allocate memory in DRAM and copy the page from storage.
• If out of memory, first need to evict a page
• Done in software by the OS (takes millions of cycles)
24 © AdamJune
Teman,
20, 2021
Page Fault Handling
• Upon addressing a swapped out page, we need to:
• Locate the page on the secondary storage
• Allocate space for the page in main memory
• If out of memory, choose a page to evict
• Swap evicted page with newly allocated page
• Update page table and re-run access instruction
• Done in software by OS
• To reduce cost of page faults:
• Use fully associative page placement (handled by OS) source: Silberschatz
• Add “access bit” (a.k.a. “use bit”) to PTE to enable pseudo-LRU
• Add “dirty bit” to PTE and only write-back swapped page when modified (=dirty)
• Never swap out page tables of the Operating System
25 © AdamJune
Teman,
20, 2021
Problem: Size of Linear Page Table
• With 32-bit addresses, 4-KB pages & 4-byte Page Table Entries (PTEs):
• 220 PTEs,
i.e, 4 MB page table per process
• 4 GB of swap needed to back up full virtual address space
• Larger pages?
• Internal fragmentation
(Not all memory in page is used)
• Larger page fault penalty
(more time to read from disk)
• What about 64-bit virtual address space???
• Even 1MB pages would require 244 8-byte PTEs (35 TB!)
26 © AdamJune
Teman,
20, 2021
Solution: Hierarchical Page Table
• Divide the virtual address into a hierarchy of page tables
31 22 21 12 11 0
Virtual
p1 p2 offset
Address
10-bit 10-bit
L1 index L2 index offset
Physical Memory
• Now every memory access p2
needs to access several p1
page tables. Root of the Current Level 1
Page Table Page Table
• Several DRAM Called Supervisor page
Page size 4MB
table base register
accesses (SPTBR) in RISC-V (1024 x 4096B) Level 2
• This is called a Page Tables
page in primary memory
12b → 4096B
“Page Table Walk” page in secondary memory
Data
27 PTE of a nonexistent page Pages
© Adam June
Teman,
20, 2021
RISC-V Sv32 Virtual Addressing Mode
VPN[1] VPN[0] page offset
• Virtual Addresses:
10 bits 10 bits 12 bits
• User PT is organized as a two-level tree
instead of a linear table with 220 entries 1st level PT Index 2st level PT Index
Index in Base Page Index in Page Table
• Physical Addresses:
PPN[1] PPN[0] page offset
• Both Page Tables and Base Pages
12 bits 10 bits 12 bits
are 4KB (1024 x 4B entries)
• Megapages (4MB) can be defined by using PPN[0]+offset to define the offset
• 32-bit Page Table Entry:
PPN[1] PPN[0] RSW D A G U X W R V
22-bits
Index within base page (1st level) Recently
Dirty User Read, Write, Valid Bit
or page table (2nd level) Accessed
Bit Page Execute
Bit 000→pointer to next level PTE
28 © AdamJuneTeman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance
30
Page-Based Virtual-Memory Machine
Page Fault? Page Fault?
Protection violation? Protection violation?
Virtual Virtual
Address Physical Address Physical
Address Address
Addr. Addr.
Inst. Decode Data
PC Transl D E + M Transl W
Cache Cache
ation ation
Page-Table Base
• On Instruction Memory access Register Hardware Page
Table Walker
• Translate VA to PA
• Exceptions? Physical Physical
Memory Controller
• Invoke Handler (OS) Address Address
• Request PA from cache Physical Address
• If cache hit, continue
Main Memory (DRAM)
• If cache miss, access DRAM
• Repeat for Data Memory Access
31 © AdamJune
Teman,
20, 2021
Translation Lookaside Buffers (TLB)
• Address translation is very expensive!
• In a single-level page table, each reference becomes two memory accesses
• In a two-level page table, each reference becomes three memory accesses
• Solution: Cache some translations in TLB
• TLB hit → Single-Cycle Translation
• TLB miss → Page-Table Walk to refill
virtual address VPN offset
32
hit? physical address PPN offset
© AdamJune
Teman,
20, 2021
TLB Access
• Look up VPN in TLB
• TLB Hit
• Take PA from TLB
• Turn on access (ref) bit
• Turn on dirty bit for write
• TLB Miss
• Check if page is in memory
• If yes, load to TLB and retry
• If no → Page Fault
source: Patterson, Hennesy
33 © AdamJune
Teman,
20, 2021
TLB Designs
• Specs:
• Typically 16-512 (32-128) entries
• 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate
• Usually fully associative
• Each entry maps a large page, hence less spatial locality across pages
→ more likely that two entries conflict
• Sometimes larger TLBs (256-512 entries) are 4-8 way set-associative
• Larger systems sometimes have multi-level (L1 and L2) TLBs
• Random or FIFO replacement policy
• LRU too costly
• “TLB Reach”
• Size of largest VA space that can be simultaneously mapped by TLB
34 © AdamJune
Teman,
20, 2021
VM-related events in pipeline
• Handling a TLB miss needs a hardware or software mechanism to refill TLB
• Usually done in hardware now
• Handling a page fault (e.g., page is on disk) needs a precise trap
so software handler can easily resume after retrieving page
• Handling protection violation may abort process
35 © AdamJune
Teman,
20, 2021
Address Translation: putting it all together
Virtual Address hardware
TLB hardware or software
Lookup software
miss hit
36 © AdamJune
Teman,
20, 2021
Virtual-Memory Machine with TLB
Page Fault? Page Fault?
Protection violation? Protection violation?
Virtual Virtual
Address Physical Address Physical
Address Address
Inst. Decode Data
PC TLB D E + M TLB W
Cache Cache
Page-Table Base
Register Hardware Page
• Access TLB Table Walker
Physical
Physical
• If TLB Hit Address Memory Controller
Address
• Check permissions and continue
Physical Address
• If TLB miss
Main Memory (DRAM)
• Page Table Walk – add page to TLB
• If page fault, invoke OS
37 © AdamJune
Teman,
20, 2021
Summary
• To enable a large memory space and multiple processes,
use Virtual Memory Addressing.
• However, accessing memory using VM is expensive
• First, need to access the page table find the physical address.
• Then need to access again to retrieve data from DRAM.
• If page not resident in memory, page faults are really bad.
• Make it faster by caching VA to PA translations – use a TLB.
• But we still need to access DRAM to retrieve the data…
38 © AdamJune
Teman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance
Cache Organization
39
Reminder: Adding Cache to Computer
• A cache exploits data locality to reduce the number of external memory
accesses. Processor Memory
Enable? Input
• The idea: Read/Write
Control
• Copy commonly
accessed data Cache Program
Datapath
to an on-chip Address
memory (SRAM) PC Bytes
Write
Registers Data
© AdamJune
Teman,
20, 2021
Anatomy of a simple cache Processor
• Say we have a 16B cache with 4B (1-word) blocks 32-bit 32-bit
Address Data
• Cache “Tags”
• Need way to tell if we have copy of location in
memory so that can decide on hit or miss 252 12
• On cache miss, put memory address of block in 1022 99
“tag address” of cache block 131 7
2041 20
Tag Data • First idea:
Cache
252 12 • Compare all four
tags to see if data 32-bit 32-bit
1022 99 Address Data
is in cache.
131 7 • Cache Replacement
2041 20 • Which block to evict
Memory
42 upon a cache miss? © AdamJune
Teman,
20, 2021
Bigger Block Size
Processor
• Let’s now put two words in a cache line
32-bit 32-bit
• E.g., 32B Cache, 8B blocks Address Data
• Alignment
• Blocks must be aligned in memory,
otherwise could get same word twice in cache. 252 12 -10
1022 99 1000
• Last 3 bits of address always 000two
130 42 7
• Therefore, tag can omit 3-bits! 2040 1947 20
• Can get hit for either word in block Cache
• Use additional 3-bits of address for
32-bit 32-bit
byte offset. Address Data
tttttttttttttttttttttttttttt ooo
Tag to check if have correct block Byte offset
Memory
43 within block © AdamJune
Teman,
20, 2021
Dividing the cache into “Sets”
• In the previous cache, we need to compare Processor
every tag to the processor address. 32-bit 32-bit
Address Data
• Comparators are expensive!
• New idea:
• Break cache into “sets” Set 0 Tag
Tag Data
Data
• Use index bit to state which set
the address is allowed to be in.
• Compare only the tags in the selected set.
Set 1 Tag
Tag Data
Data
49 © AdamJune
Teman,
20, 2021
Example: Alternatives in an 8 Block Cache
• Direct Mapped: 8 blocks, 1 way, 1 tag comparator, 8 sets
• Fully Associative: 8 blocks, 8 ways, 8 tag comparators, 1 set
• 2 Way Set Associative: 8 blocks, 2 ways, 2 tag comparators, 4 sets
• 4 Way Set Associative: 8 blocks, 4 ways, 4 tag comparators, 2 sets
DM: 0 FA: 0 2 Way SA: 0 4 Way SA: 0
8 sets 1 set 4 sets Set 0 2 sets
1 1 1 1
1 way 8 ways Set 0
2 2 2 2
Set 1
3 3 3 3
4 4 4 4
Set 2 5
5 5 5
Set 1
6 6 6 6
Set 3
7 7 7 7
50 © AdamJune
Teman,
20, 2021
Benefits of Set-Associative Caches
• Choice of Direct Mapped versus
Set Associative depends on the
cost of a miss versus the cost
of implementation
• Largest gains are in going from
Direct Mapped to 2-way
(20%+ reduction in miss rate)
51 © AdamJune
Teman,
20, 2021
One More Detail: Valid Bit
• When a new program starts,
cache does not have valid information for this program
• Need an indicator whether this tag entry is valid for this program
• Add a “valid bit” to the cache tag entry
• 0 => cache miss, even if by chance, address = tag
• 1 => cache hit, if processor address = tag
52 © AdamJune
Teman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance
53
Write Policy
• How do we make sure cache and memory have same values on writes?
• Write-Through Policy:
• Write cache and write through the cache to memory
• Too slow, so include Write Buffer to allow processor to continue
• Write buffer may have multiple entries to absorb bursts of writes
• Write-Back Policy:
• Write only to cache. Write block back to memory when evicted.
• Only single write to memory per block
• Need to specify if block was changed → include “Dirty Bit”
• What do you do on a write miss?
• Usually Write Allocate → First fetch the block, then write and set dirty bit.
54 © AdamJune
Teman,
20, 2021
Write-Through vs. Write-Back
• Write-Through: • Write-Back
• Simpler control logic • More complex control logic
• More predictable timing simplifies • More variable timing (0,1,2 memory
processor control logic accesses per cache access)
• Easier to make reliable, since • Usually reduces write traffic
memory always has copy of data • Harder to make reliable, sometimes
(big idea: Redundancy!) cache has only copy of data
55 © AdamJune
Teman,
20, 2021
Cache Performance
• Hit rate:
• Fraction of accesses that hit in the cache
• Miss rate:
AMAT = Time for a hit + Miss rate × Miss penalty
• 1 – Hit rate
• Miss penalty:
Reduce AMAT →
• Time to replace a block from lower • Reduce Hit Time
level in memory hierarchy to cache • Reduce Miss Rate
• Hit time: • Reduce Miss Penalty
• Time to access cache memory Balance cache parameters:
(including tag comparison) • Capacity
• Average Memory Access Time (AMAT) • Associativity
• Block size
• The average time to access memory
56 © AdamJune
Teman,
20, 2021
Cache Replacement Policy
• Random Replacement
• Hardware randomly selects a cache evict
• Least-Recently Used (LRU)
• Hardware keeps track of access history
• Replace the entry that has not been used for the longest time
• For 2-way set-associative cache, need one bit for LRU replacement
• Example of a Simple “Pseudo” LRU Implementation (“not-most-recently used” )
• Assume 64 Fully Associative entries
• Hardware replacement pointer points to one cache entry
• Whenever access is made to the entry that the pointer points to:
move the pointer to the next entry,
otherwise: do not move the pointer
58 © AdamJune
Teman,
20, 2021
The 3-C’s of Cache Misses How to simulate?
59 © AdamJune
Teman,
20, 2021
Cache Design Trade-Offs
Design change Effect on miss rate Negative performance effect
Increase cache size Decrease capacity misses May increase access time
Increase block size Decrease compulsory Increases miss penalty. For very
misses large block size, may increase miss
rate due to pollution.
60 © AdamJune
Teman,
20, 2021
Multilevel Caches
• To improve cache performance, use a hierarchy of caches
• Local Miss Rate
• Fraction of misses at a given level of a cache
• Local Miss rate L2$ = L2$ Misses / L1$ Misses
= L2$ Misses / total_L2_accesses Level
1
• Global Miss Rate
Level
• Fraction of misses that go all the way to memory 2
• Global Miss Rate = LN Misses / Total Accesses Level
• LN$ local miss rate >> than the global miss rate 3
...
Level
n
61 © AdamJune
Teman,
20, 2021
Multilevel Cache Considerations
• Different design considerations for L1$ and L2$
• L1$
• Focus on fast access, minimize hit time
• → Smaller Cache
• L2$, L3$
• Focus on low miss rate, reduce penalty
of main memory access times
• → Larger cache, larger block sizes,
higher levels of associativity
• Miss penalty of L1$ is significantly
reduced by presence of L2$, so can be
smaller/faster even with higher miss rate
62 © AdamJune
Teman,
20, 2021
Multilevel Cache Example
• Given • Primary miss with L2 hit
• CPU base CPI = 1, clock rate = 4GHz • Penalty = 5ns/0.25ns = 20 cycles
• Miss rate/instruction = 2% • Primary miss with L2 miss
• Main memory access time = 100ns • Extra penalty = 500 cycles
• With just primary cache • CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
• Miss penalty = 100ns/0.25ns • Performance ratio = 9/3.4 = 2.6
= 400 cycles
• Effective CPI = 1 + 0.02 × 400 = 9
• Now add L2 cache
• Access time = 5ns
• Global miss rate to main memory = 0.5%
© AdamJune
Teman,
20, 2021
Cache Blocks vs. VM Pages
• In caches, we dealt with individual blocks
• Usually ~64B on modern systems Cache Virtual Memory
• In VM, we deal with individual pages
• Usually ~4 KB on modern systems Unit Block or Line Page
64 © AdamJune
Teman,
20, 2021
Bytes, Words, Blocks, Pages
• Example: 1 of 4 Pages per Memory
Block 31
1 of 32 Blocks per Page
Word 31
• 16 KB DRAM 1 of 1 Memory
• 4 KB Pages
Page 3
(for VM) Can think of a
• 128 B blocks Can think of
page as:
• 32 Blocks, or
(for caches) Page 2
memory as: • 1024 Words
• 4 B words • 4 Pages, or
16 • 128 Blocks, or
(for lw/sw) • 4096 Words, or
KiB Page 1
• 16,384 Bytes
Page 0
Block 0 Word 0
65 © AdamJune
Teman,
20, 2021
References
• Patterson, Hennessy “Computer Organization and Design – The RISC-V
Edition”
• Berkeley 61C
• MIT 6.175
66 © AdamJune
Teman,
20, 2021