0% found this document useful (0 votes)
13 views

Lecture 7 Memory 2021

This document provides an overview of lecture 7 on the memory hierarchy. It discusses the growing gap between processor and memory speeds over time which led to the addition of cache memory. The principle of locality states that programs access a small proportion of their address space at a time, with temporal and spatial locality. The typical memory hierarchy includes CPU registers, cache memory on the CPU die, main memory (DRAM), and secondary storage (disks). Virtual memory addresses the challenges of running multiple processes concurrently using address translation.

Uploaded by

KANZA AKRAM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 7 Memory 2021

This document provides an overview of lecture 7 on the memory hierarchy. It discusses the growing gap between processor and memory speeds over time which led to the addition of cache memory. The principle of locality states that programs access a small proportion of their address space at a time, with temporal and spatial locality. The typical memory hierarchy includes CPU registers, cache memory on the CPU die, main memory (DRAM), and secondary storage (disks). Virtual memory addresses the challenges of running multiple processes concurrently using address translation.

Uploaded by

KANZA AKRAM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Lecture 7:

The Memory Hierarchy


Advanced Digital VLSI Design I
Bar-Ilan University, Course 83-614
Semester B, 2021
20 June 2021
Lecture Overview

2 © AdamJune
Teman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance

Introduction to the
Memory Hierarchy

3
Processor-DRAM Gap (Latency)
• 1980 microprocessor executes ~one instruction in same time as DRAM access
• 2017 microprocessor executes ~1000 instructions in same time as DRAM
access

• Slow DRAM access


has disastrous impact
on CPU performance!

© AdamJune
Teman,
20, 2021
Memory Caching
• Mismatch between processor and memory speeds leads us to add a new level:
a memory cache

• Implemented with same IC processing technology


as the CPU (usually integrated on same chip)
• Faster but more expensive than DRAM memory.
• Cache is a copy of a subset of main memory.
5 © AdamJune
Teman,
20, 2021
Adding Cache to Computer
Processor Memory
Enable? Input
Read/Write
Control

Cache Program
Datapath
Address
PC Bytes
Write
Registers Data

Arithmetic & Logic Unit Data


Read Output
(ALU)
Data

Processor-Memory Interface I/O-Memory Interfaces

6 © AdamJune
Teman,
20, 2021
The Principle of Locality
• Programs access a small proportion of their address space at any time
• Temporal locality (locality in time)
• Items accessed recently are likely to be accessed again soon
• e.g., instructions in a loop, induction variables
• Spatial locality (locality in space)
• Items near those accessed recently are likely to be accessed soon
• E.g., sequential instruction access, array data
• Taking advantage of locality
• Store everything on disk
• Copy recently accessed (and nearby) items from disk to smaller DRAM memory
• Copy more recently accessed (and nearby) items from DRAM to smaller SRAM
memory

© AdamJune
Teman,
20, 2021
Characteristics of the Memory Hierarchy
Processor
4-8 bytes (word)
Inclusive –
$ what is in L1$
Increasing is a subset
8-32 bytes (block)
distance what is in Main
from the Memory that is
processor in Main Memory 1 to 4 blocks
a subset of is in
access time Secondary
1,024+ bytes
(disk sector = page)
Memory
Secondary Memory

(Relative) size of the memory at each level


8 © AdamJune
Teman,
20, 2021
Typical Memory Hierarchy
• The Trick:
• Present processor with as much memory as is available
in the cheapest technology at the speed offered by the fastest technology
• Static RAM (SRAM) On-Chip Components
• 0.5ns – 2.5ns Control
$2000 – $5000 per GB

Cache Cache
Secondary

Instr
• Dynamic RAM (DRAM) Processor
Datapath Main
Memory
(Disk
• 50ns – 70ns Memory Or Flash)

RegFile

Data
(DRAM)
• $20 – $75 per GB
• Magnetic disk
Speed (#cycles): ½’s 1’s 10’s 100’s 10,000’s
• 5ms – 20ms
• $0.20 – $2 per GB Size (bytes): 100’s 10K’s M’s G’s T’s
9 Cost: highest lowest
© AdamJune
Teman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance

Virtual Memory

10
“Bare” 5-Stage Pipeline
• In a bare machine, the only kind of address is a physical address

11 © AdamJune
Teman,
20, 2021
Multiple processes
• We want many things running concurrently.
• In fact, in today’s machines, hundreds of processes are running
• Each process has its own state and
needs to access and manage memory

12 © AdamJune
Teman,
20, 2021
Multi-process Addressing Challenges
• Size
• We cannot fit 232 (or 264) bytes on a chip (or in several chips…)
• We cannot fit a big program (e.g., Linux) on embedded memory
• Protection
• If we have several programs running,
how do we keep them from overwriting each other’s data?
• How do we keep private data of one process from being read by another?
• Location Independence
• We want to compile the addresses of a program,
assuming its address space starts from a pre-known base address.
• We want to give the process the feeling that it owns the entire memory space.

13 © AdamJune
Teman,
20, 2021
Option: Base and Bounds
• Each process will be given a base and bound address
• The process stores these inside base and bound registers.
• Compiled addresses are offsets that are added to the base address.
• Upon every memory access, check that the bound address is not surpassed.

• This approach is often


known as segmentation
• Each program gets a
segment of memory.
• Still used partially in
modern systems
(e.g., x86)
14 © AdamJune
Teman,
20, 2021
Problem: Fragmentation
• As users come and go, the storage is “fragmented”.
• Therefore, at some stage, the OS has to
move around programs to compact the storage.
Users 4 & 5 Users 2 & 5
OS arrive OS leave OS
Space Space Space
user 1 16K user 1 16K 16K
user 1
user 2 24K user 2 24K 24K
user 4 16K
24K user 4 16K
8K 8K
user 3 32K user 3 32K user 3 32K
24K user 5 24K 24K
free
15 © AdamJune
Teman,
20, 2021
Solution: Virtual Addressing
• Processes use virtual addresses (VAs), e.g., 0 … 0xffff,ffff
• Many processes, all using same (conflicting) addresses
• Memory uses physical addresses (PAs), also, e.g., 0 ... 0xffff,ffff
• Memory manager maps virtual to physical addresses Virtual Address Space
Set of addresses that the
Processor (& Caches) Memory (DRAM) user program knows
Control
about

Physical Address
Virtual Address
Datapath Physical Address Space
PC ? Bytes Set of addresses that
Registers map to actual physical
locations in memory
(ALU)
Hidden from user
applications
16
Many of these (software & hardware cores) One main memory
© AdamJune
Teman,
20, 2021
Library book analogy
• An author writes a book and sells it to a library.
• The library puts the book on a shelf.
• Does the author print the location of the book
on the book cover?
• No. Then we would need to set the
location in every library in the world.
• Instead, we provide the book with an
identifier (i.e., ISBN)
• The library has a catalog that says
where the specific book is placed.
• This indirection is the concept of
virtual addressing.
17 © AdamJune
Teman,
20, 2021
Paging
• Divide the memory into “pages”
page number offset
• Instead of referencing physical addresses,
the processor references a “page number” and an “offset”
• Typical page size is 4KB (12 bits)
• A “page table” contains the physical address
1
of the base of each page: 0
0 0
1 1 Physical
2 2 Memory
3 3 3
Address Space Page Table
of User-1 of User-1 2

• Page tables make it possible to store the pages of a process non-contiguously.


• They also enable referencing one copy from many processes.
18 © AdamJune
Teman,
20, 2021
Protection
• Each process has its own page table and is allocated different pages in DRAM.
• This isolation keeps them from
accessing each other’s memory.
• Sharing is also possible
• OS may assign same physical
page to several processes
• For example, one copy of printf()
accessed from many VAs.
• Access Protection
• Protection bits (R,W,X) show the
access rights to a page.
• Only supervisor (OS) can modify
19 © AdamJune
Teman,
20, 2021
Did Virtual Addressing solve our challenges?
• Size Use external memory.

• We cannot fit 232 bytes on a chip Allocate each process its own
• We cannot fit a program in embedded memory page table and pages

• Protection Each process owns the entire


• How do we protect programs from each other? address space. Page tables
translate these addresses into
• Location Independence actual memory locations.
• How do we provide location independence?
Virtual addresses allow us to
• Fragmentation sparsely fill the address space.
• How do we dynamically allocate
Point to the same physical
enough memory for processes? address from several virtual
• Wasted resources addresses
• How do we not create several copies of each library in memory?
20 © AdamJune
Teman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance

Address Translation and


Demand Paging

21
Paged Memory Address Translation
• Operating System keeps track of which process is active
• Stores address of Page Table in Page Table Base Register (PTBR).
• Memory management unit (MMU) extracts virtual page number (VPN)
from virtual address
• e.g., just top 20 bits
• Physical Page Number (PPN) is stored in Page Table Entry (PTE).
• Physical address is calculated as the sum of:
• Physical Page Number Virtual address (e.g., 32 Bits)

• Offset (from virtual address) page table entry offset


• Number of bits in offset is page size Physical address
• Where do the page tables reside? page number offset
• Also in main memory…
22 © AdamJune
Teman,
20, 2021
Translation Using a Page Table Physical
Page
Number
+
PT User 1
VPN offset
PTBR
Virtual Page
• Translation: Number PT User 2

• PPN=Mem[PTBR + VPN] System PT Base PTB Useri


• PA=PPN + offset System PT

It requires
• On user switch two DRAM
• PTBR = System PT Base + new User ID accesses to
access one
data word or
instruction!

23 © AdamJune
Teman,
20, 2021
Demand Paging
• Sometimes, there is not enough DRAM to hold all allocated memory
• For example, entire virtual memory space of one process in 232=4GB !
• Therefore, we can use secondary storage (HDD) to increase our capacity
• Space allocated on secondary storage is called “swap”
• Add a “valid bit” to the PTE, which is set (valid=1) when the page is in memory.
• When a page is swapped out, the valid bit is cleared.
• The page table now stores a Disk Page Number (DPN) instead of a VPN.
• This process is known as “Demand Paging”
• Access to a swapped page (valid=0) results in a “page fault”.
• Need to allocate memory in DRAM and copy the page from storage.
• If out of memory, first need to evict a page
• Done in software by the OS (takes millions of cycles)
24 © AdamJune
Teman,
20, 2021
Page Fault Handling
• Upon addressing a swapped out page, we need to:
• Locate the page on the secondary storage
• Allocate space for the page in main memory
• If out of memory, choose a page to evict
• Swap evicted page with newly allocated page
• Update page table and re-run access instruction
• Done in software by OS
• To reduce cost of page faults:
• Use fully associative page placement (handled by OS) source: Silberschatz
• Add “access bit” (a.k.a. “use bit”) to PTE to enable pseudo-LRU
• Add “dirty bit” to PTE and only write-back swapped page when modified (=dirty)
• Never swap out page tables of the Operating System
25 © AdamJune
Teman,
20, 2021
Problem: Size of Linear Page Table
• With 32-bit addresses, 4-KB pages & 4-byte Page Table Entries (PTEs):
• 220 PTEs,
i.e, 4 MB page table per process
• 4 GB of swap needed to back up full virtual address space
• Larger pages?
• Internal fragmentation
(Not all memory in page is used)
• Larger page fault penalty
(more time to read from disk)
• What about 64-bit virtual address space???
• Even 1MB pages would require 244 8-byte PTEs (35 TB!)

26 © AdamJune
Teman,
20, 2021
Solution: Hierarchical Page Table
• Divide the virtual address into a hierarchy of page tables
31 22 21 12 11 0
Virtual
p1 p2 offset
Address
10-bit 10-bit
L1 index L2 index offset

Physical Memory
• Now every memory access p2
needs to access several p1
page tables. Root of the Current Level 1
Page Table Page Table
• Several DRAM Called Supervisor page
Page size 4MB
table base register
accesses (SPTBR) in RISC-V (1024 x 4096B) Level 2
• This is called a Page Tables
page in primary memory
12b → 4096B
“Page Table Walk” page in secondary memory
Data
27 PTE of a nonexistent page Pages
© Adam June
Teman,
20, 2021
RISC-V Sv32 Virtual Addressing Mode
VPN[1] VPN[0] page offset
• Virtual Addresses:
10 bits 10 bits 12 bits
• User PT is organized as a two-level tree
instead of a linear table with 220 entries 1st level PT Index 2st level PT Index
Index in Base Page Index in Page Table
• Physical Addresses:
PPN[1] PPN[0] page offset
• Both Page Tables and Base Pages
12 bits 10 bits 12 bits
are 4KB (1024 x 4B entries)
• Megapages (4MB) can be defined by using PPN[0]+offset to define the offset
• 32-bit Page Table Entry:
PPN[1] PPN[0] RSW D A G U X W R V

22-bits
Index within base page (1st level) Recently
Dirty User Read, Write, Valid Bit
or page table (2nd level) Accessed
Bit Page Execute
Bit 000→pointer to next level PTE
28 © AdamJuneTeman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance

The Translation Lookaside


Buffer (TLB)

30
Page-Based Virtual-Memory Machine
Page Fault? Page Fault?
Protection violation? Protection violation?
Virtual Virtual
Address Physical Address Physical
Address Address
Addr. Addr.
Inst. Decode Data
PC Transl D E + M Transl W
Cache Cache
ation ation

Page-Table Base
• On Instruction Memory access Register Hardware Page
Table Walker
• Translate VA to PA
• Exceptions? Physical Physical
Memory Controller
• Invoke Handler (OS) Address Address
• Request PA from cache Physical Address
• If cache hit, continue
Main Memory (DRAM)
• If cache miss, access DRAM
• Repeat for Data Memory Access
31 © AdamJune
Teman,
20, 2021
Translation Lookaside Buffers (TLB)
• Address translation is very expensive!
• In a single-level page table, each reference becomes two memory accesses
• In a two-level page table, each reference becomes three memory accesses
• Solution: Cache some translations in TLB
• TLB hit → Single-Cycle Translation
• TLB miss → Page-Table Walk to refill
virtual address VPN offset

(VPN = virtual page number)


VD tag PPN

(PPN = physical page number)

32
hit? physical address PPN offset
© AdamJune
Teman,
20, 2021
TLB Access
• Look up VPN in TLB

• TLB Hit
• Take PA from TLB
• Turn on access (ref) bit
• Turn on dirty bit for write

• TLB Miss
• Check if page is in memory
• If yes, load to TLB and retry
• If no → Page Fault
source: Patterson, Hennesy

33 © AdamJune
Teman,
20, 2021
TLB Designs
• Specs:
• Typically 16-512 (32-128) entries
• 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate
• Usually fully associative
• Each entry maps a large page, hence less spatial locality across pages
→ more likely that two entries conflict
• Sometimes larger TLBs (256-512 entries) are 4-8 way set-associative
• Larger systems sometimes have multi-level (L1 and L2) TLBs
• Random or FIFO replacement policy
• LRU too costly
• “TLB Reach”
• Size of largest VA space that can be simultaneously mapped by TLB
34 © AdamJune
Teman,
20, 2021
VM-related events in pipeline
• Handling a TLB miss needs a hardware or software mechanism to refill TLB
• Usually done in hardware now
• Handling a page fault (e.g., page is on disk) needs a precise trap
so software handler can easily resume after retrieving page
• Handling protection violation may abort process

Inst Inst. Data Data


PC D Decode E + M W
TLB Cache TLB Cache

TLB miss? Page Fault? TLB miss? Page Fault?


Protection violation? Protection violation?

35 © AdamJune
Teman,
20, 2021
Address Translation: putting it all together
Virtual Address hardware
TLB hardware or software
Lookup software

miss hit

Page Table Protection


Walk Check
the page is
 memory  memory denied permitted

Page Fault Update TLB Protection Physical


(OS loads page) Fault Address
Where? (to cache)

Resume the instruction Seg fault

36 © AdamJune
Teman,
20, 2021
Virtual-Memory Machine with TLB
Page Fault? Page Fault?
Protection violation? Protection violation?
Virtual Virtual
Address Physical Address Physical
Address Address
Inst. Decode Data
PC TLB D E + M TLB W
Cache Cache

Page-Table Base
Register Hardware Page
• Access TLB Table Walker
Physical
Physical
• If TLB Hit Address Memory Controller
Address
• Check permissions and continue
Physical Address
• If TLB miss
Main Memory (DRAM)
• Page Table Walk – add page to TLB
• If page fault, invoke OS
37 © AdamJune
Teman,
20, 2021
Summary
• To enable a large memory space and multiple processes,
use Virtual Memory Addressing.
• However, accessing memory using VM is expensive
• First, need to access the page table find the physical address.
• Then need to access again to retrieve data from DRAM.
• If page not resident in memory, page faults are really bad.
• Make it faster by caching VA to PA translations – use a TLB.
• But we still need to access DRAM to retrieve the data…

• Which brings us to the idea of on-chip caches.

38 © AdamJune
Teman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance

Cache Organization

39
Reminder: Adding Cache to Computer
• A cache exploits data locality to reduce the number of external memory
accesses. Processor Memory
Enable? Input
• The idea: Read/Write
Control
• Copy commonly
accessed data Cache Program
Datapath
to an on-chip Address
memory (SRAM) PC Bytes
Write
Registers Data

Arithmetic & Logic Unit Data


Read Output
(ALU)
Data

40 Processor-Memory Interface AdamJune


Teman,
©I/O-Memory 20, 2021
Interfaces
The Basics How do we
organize cache?
• Block (aka Cache Line):
• Unit of transfer between cache and memory Where does each
• May be multiple words memory address
• If accessed data is present in upper level map to?
• Hit: access satisfied by upper level
• Hit Ratio or Hit Rate: hits/accesses How do we know
• If accessed data is absent which elements
• Miss: block copied from lower level are in cache?
• Time taken: miss penalty
• Miss ratio or Miss Rate : 1 – hit rate How do we quickly
• Then accessed data supplied from upper level locate them?

© AdamJune
Teman,
20, 2021
Anatomy of a simple cache Processor
• Say we have a 16B cache with 4B (1-word) blocks 32-bit 32-bit
Address Data
• Cache “Tags”
• Need way to tell if we have copy of location in
memory so that can decide on hit or miss 252 12
• On cache miss, put memory address of block in 1022 99
“tag address” of cache block 131 7
2041 20
Tag Data • First idea:
Cache
252 12 • Compare all four
tags to see if data 32-bit 32-bit
1022 99 Address Data
is in cache.
131 7 • Cache Replacement
2041 20 • Which block to evict
Memory
42 upon a cache miss? © AdamJune
Teman,
20, 2021
Bigger Block Size
Processor
• Let’s now put two words in a cache line
32-bit 32-bit
• E.g., 32B Cache, 8B blocks Address Data
• Alignment
• Blocks must be aligned in memory,
otherwise could get same word twice in cache. 252 12 -10
1022 99 1000
• Last 3 bits of address always 000two
130 42 7
• Therefore, tag can omit 3-bits! 2040 1947 20
• Can get hit for either word in block Cache
• Use additional 3-bits of address for
32-bit 32-bit
byte offset. Address Data
tttttttttttttttttttttttttttt ooo
Tag to check if have correct block Byte offset
Memory
43 within block © AdamJune
Teman,
20, 2021
Dividing the cache into “Sets”
• In the previous cache, we need to compare Processor
every tag to the processor address. 32-bit 32-bit
Address Data
• Comparators are expensive!
• New idea:
• Break cache into “sets” Set 0 Tag
Tag Data
Data
• Use index bit to state which set
the address is allowed to be in.
• Compare only the tags in the selected set.
Set 1 Tag
Tag Data
Data

• We now need half the number of comparators! Cache


32-bit 32-bit
ttttttttttttttttttttttttttt i ooo Address Data

Tag to check if have Index to Byte offset


44 correct block select block within block Memory
© AdamJune
Teman,
20, 2021
Cache Addressing Terminology
ttttttttttttttttt iiiiiiiiii oooo
• All fields are read as unsigned integers. Tag to check if have Index to select Byte offset
correct block block within block
• Offset
• Specifies which byte within the block we want
• Set Index
• Select which set to search in.
• Size of Index = log2(number of sets)
• Tag
• The remaining bits.
• Used to distinguish between all the memory
addresses that map to the same location
• Size of Tag = Address size – Size of Index
– log2(number of bytes/block)
45 © AdamJune
Teman,
20, 2021
How many sets can we have?
• The more sets → the fewer comparators.
• So, what’s the limit?
• At least one block per set.
• So #sets <= #cache blocks
• This is called a Direct Mapped Cache
• Each address can only map to a single set.
• Only need one comparator.
• Simple Example:
• 32B memory, 8B cache, 1 byte per line
tt iii
2-bit tag 3-bit index No offset
(only 1-byte)
46 © AdamJune
Teman,
20, 2021
Direct-Mapped Cache Example
• Suppose we have 8B of data in a direct-mapped cache with 2-byte blocks
• Determine the size of the tag, index and offset fields
if we’re using a 32-bit architecture (RV32)
• Offset
need 1 bit to specify
• need to specify correct byte within a block
• block contains 2 bytes = 21 bytes correct byte
• Index:
• need to specify correct block in cache need 2 bits to specify
• cache contains 8 B = 23 bytes this many blocks
• block contains 2 B = 21 bytes
• # blocks/cache = bytes/cache = 23 bytes/cache = 22 blocks/cache
bytes/block 21 bytes/block
• Tag: use remaining bits as tag so, tag is 29 MSB bits of
• tag length = addr length – offset – index memory address
47 = 32 - 1 - 2 bits = 29 bits © AdamJune
Teman,
20, 2021
Example: Larger Block Size
• 32-bit address, 1 KB Cache with 64 Blocks
• How many bytes in a cache line?
• 210/26=24=16 B/block → 4 offset bits
• How many index bits? 31 10 9 4 3 0

• 64 blocks=26→ 6 index bits Tag Index Offset


22 bits 6 bits 4 bits
• How many tag bits?
• 32-bit address → 32-6-4 = 22 tag bits

• To what block number does address 1200ten map?


• Block address = 1200/16 = 75
• Block number = 75 modulo 64 = 11
© AdamJune
Teman,
20, 2021
Alternative Cache Organizations - Summary
• “Fully Associative”: Block can go anywhere Fully Associative
• First design in lecture • No conflicts
• Note: No Index field, but one comparator/block • Expensive hardware
• “Direct Mapped”: Block goes one place Direct Mapped
• Note: Only 1 comparator • Cheap hardware
• Number of sets = number blocks • Many conflicts
• “N-way Set Associative”: N places for a block
N-way Associative
• Number of sets = number of blocks / N
• Tradeoff hardware
• N comparators cost with conflicts
• Fully Associative: N = number of blocks
• Direct Mapped: N = 1

49 © AdamJune
Teman,
20, 2021
Example: Alternatives in an 8 Block Cache
• Direct Mapped: 8 blocks, 1 way, 1 tag comparator, 8 sets
• Fully Associative: 8 blocks, 8 ways, 8 tag comparators, 1 set
• 2 Way Set Associative: 8 blocks, 2 ways, 2 tag comparators, 4 sets
• 4 Way Set Associative: 8 blocks, 4 ways, 4 tag comparators, 2 sets
DM: 0 FA: 0 2 Way SA: 0 4 Way SA: 0
8 sets 1 set 4 sets Set 0 2 sets
1 1 1 1
1 way 8 ways Set 0
2 2 2 2
Set 1
3 3 3 3
4 4 4 4
Set 2 5
5 5 5
Set 1
6 6 6 6
Set 3
7 7 7 7
50 © AdamJune
Teman,
20, 2021
Benefits of Set-Associative Caches
• Choice of Direct Mapped versus
Set Associative depends on the
cost of a miss versus the cost
of implementation
• Largest gains are in going from
Direct Mapped to 2-way
(20%+ reduction in miss rate)

51 © AdamJune
Teman,
20, 2021
One More Detail: Valid Bit
• When a new program starts,
cache does not have valid information for this program
• Need an indicator whether this tag entry is valid for this program
• Add a “valid bit” to the cache tag entry
• 0 => cache miss, even if by chance, address = tag
• 1 => cache hit, if processor address = tag

52 © AdamJune
Teman,
20, 2021
Memory Virtual Demand Cache Cache
TLBs
Hierarchy Memory Paging Organization Performance

Caching Policies and


Performance

53
Write Policy
• How do we make sure cache and memory have same values on writes?
• Write-Through Policy:
• Write cache and write through the cache to memory
• Too slow, so include Write Buffer to allow processor to continue
• Write buffer may have multiple entries to absorb bursts of writes
• Write-Back Policy:
• Write only to cache. Write block back to memory when evicted.
• Only single write to memory per block
• Need to specify if block was changed → include “Dirty Bit”
• What do you do on a write miss?
• Usually Write Allocate → First fetch the block, then write and set dirty bit.

54 © AdamJune
Teman,
20, 2021
Write-Through vs. Write-Back
• Write-Through: • Write-Back
• Simpler control logic • More complex control logic
• More predictable timing simplifies • More variable timing (0,1,2 memory
processor control logic accesses per cache access)
• Easier to make reliable, since • Usually reduces write traffic
memory always has copy of data • Harder to make reliable, sometimes
(big idea: Redundancy!) cache has only copy of data

55 © AdamJune
Teman,
20, 2021
Cache Performance
• Hit rate:
• Fraction of accesses that hit in the cache
• Miss rate:
AMAT = Time for a hit + Miss rate × Miss penalty
• 1 – Hit rate
• Miss penalty:
Reduce AMAT →
• Time to replace a block from lower • Reduce Hit Time
level in memory hierarchy to cache • Reduce Miss Rate
• Hit time: • Reduce Miss Penalty
• Time to access cache memory Balance cache parameters:
(including tag comparison) • Capacity
• Average Memory Access Time (AMAT) • Associativity
• Block size
• The average time to access memory
56 © AdamJune
Teman,
20, 2021
Cache Replacement Policy
• Random Replacement
• Hardware randomly selects a cache evict
• Least-Recently Used (LRU)
• Hardware keeps track of access history
• Replace the entry that has not been used for the longest time
• For 2-way set-associative cache, need one bit for LRU replacement
• Example of a Simple “Pseudo” LRU Implementation (“not-most-recently used” )
• Assume 64 Fully Associative entries
• Hardware replacement pointer points to one cache entry
• Whenever access is made to the entry that the pointer points to:
move the pointer to the next entry,
otherwise: do not move the pointer
58 © AdamJune
Teman,
20, 2021
The 3-C’s of Cache Misses How to simulate?

• Compulsory (cold start, first reference): set cache size to infinity


• First access to a block, not a lot you can do about it and fully associative, and
count number of misses
• If running billions of instructions, compulsory misses
are insignificant
• Capacity:
Change cache size from
• Cache cannot contain all blocks accessed by the infinity and count misses
program for each reduction in size
• Misses that would not occur with infinite cache
• Conflict (collision):
• Multiple memory locations mapped to same cache set Change from FA to n-way
• Misses that would not occur with ideal fully associative set associative while
cache counting misses

59 © AdamJune
Teman,
20, 2021
Cache Design Trade-Offs
Design change Effect on miss rate Negative performance effect

Increase cache size Decrease capacity misses May increase access time

Increase associativity Decrease conflict misses May increase access time

Increase block size Decrease compulsory Increases miss penalty. For very
misses large block size, may increase miss
rate due to pollution.

60 © AdamJune
Teman,
20, 2021
Multilevel Caches
• To improve cache performance, use a hierarchy of caches
• Local Miss Rate
• Fraction of misses at a given level of a cache
• Local Miss rate L2$ = L2$ Misses / L1$ Misses
= L2$ Misses / total_L2_accesses Level
1
• Global Miss Rate
Level
• Fraction of misses that go all the way to memory 2
• Global Miss Rate = LN Misses / Total Accesses Level
• LN$ local miss rate >> than the global miss rate 3
...

Level
n
61 © AdamJune
Teman,
20, 2021
Multilevel Cache Considerations
• Different design considerations for L1$ and L2$
• L1$
• Focus on fast access, minimize hit time
• → Smaller Cache
• L2$, L3$
• Focus on low miss rate, reduce penalty
of main memory access times
• → Larger cache, larger block sizes,
higher levels of associativity
• Miss penalty of L1$ is significantly
reduced by presence of L2$, so can be
smaller/faster even with higher miss rate
62 © AdamJune
Teman,
20, 2021
Multilevel Cache Example
• Given • Primary miss with L2 hit
• CPU base CPI = 1, clock rate = 4GHz • Penalty = 5ns/0.25ns = 20 cycles
• Miss rate/instruction = 2% • Primary miss with L2 miss
• Main memory access time = 100ns • Extra penalty = 500 cycles
• With just primary cache • CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
• Miss penalty = 100ns/0.25ns • Performance ratio = 9/3.4 = 2.6
= 400 cycles
• Effective CPI = 1 + 0.02 × 400 = 9
• Now add L2 cache
• Access time = 5ns
• Global miss rate to main memory = 0.5%

© AdamJune
Teman,
20, 2021
Cache Blocks vs. VM Pages
• In caches, we dealt with individual blocks
• Usually ~64B on modern systems Cache Virtual Memory
• In VM, we deal with individual pages
• Usually ~4 KB on modern systems Unit Block or Line Page

• Common point of confusion: Hit/Miss Miss Page Fault


Unit Size 32-64B 4K-8KB
• Bytes
Placement Direct Mapped, Fully
• Words N-way Set Associative
• Blocks Associative
• Pages Replacement LRU or Random LRU
• Are all just different ways of
looking at memory! Write Policy Write Through or Write Back
Write Back

64 © AdamJune
Teman,
20, 2021
Bytes, Words, Blocks, Pages
• Example: 1 of 4 Pages per Memory
Block 31
1 of 32 Blocks per Page
Word 31

• 16 KB DRAM 1 of 1 Memory
• 4 KB Pages
Page 3
(for VM) Can think of a
• 128 B blocks Can think of
page as:
• 32 Blocks, or
(for caches) Page 2
memory as: • 1024 Words
• 4 B words • 4 Pages, or
16 • 128 Blocks, or
(for lw/sw) • 4096 Words, or
KiB Page 1
• 16,384 Bytes

Page 0

Block 0 Word 0

65 © AdamJune
Teman,
20, 2021
References
• Patterson, Hennessy “Computer Organization and Design – The RISC-V
Edition”
• Berkeley 61C
• MIT 6.175

66 © AdamJune
Teman,
20, 2021

You might also like