0% found this document useful (0 votes)
21 views136 pages

Chapter - 05 9wy

Chapter 5 discusses the principles of locality in memory access, emphasizing temporal and spatial locality to optimize memory hierarchy through various levels including SRAM, DRAM, and disk storage. It explains cache memory operations, including direct-mapped cache, hit/miss mechanisms, and the impact of block size on cache performance. The chapter also covers strategies for handling cache misses and write operations, such as write-through and write-back techniques.

Uploaded by

2105019
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views136 pages

Chapter - 05 9wy

Chapter 5 discusses the principles of locality in memory access, emphasizing temporal and spatial locality to optimize memory hierarchy through various levels including SRAM, DRAM, and disk storage. It explains cache memory operations, including direct-mapped cache, hit/miss mechanisms, and the impact of block size on cache performance. The chapter also covers strategies for handling cache misses and write operations, such as write-through and write-back techniques.

Uploaded by

2105019
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

COMPUTER ORGANIZATION AND 5th

Edition
The Hardware/Software Interface
DESIGN

Chapter 5
Large and Fast:
Exploiting Memory
Hierarchy
§5.1 Introduction
Principle of Locality
■ Programs access a small proportion of
their address space at any time
■ Temporal locality
■ Items accessed recently are likely to be
accessed again soon
■ e.g., instructions in a loop, induction variables
■ Spatial locality
■ Items near those accessed recently are likely
to be accessed soon
■ E.g., sequential instruction access, array data
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 2
Taking Advantage of Locality
■ Memory hierarchy
■ Store everything on disk
■ Copy recently accessed (and nearby)
items from disk to smaller DRAM memory
■ Main memory
■ Copy more recently accessed (and
nearby) items from DRAM to smaller
SRAM memory
■ Cache memory attached to CPU

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3


Memory Hierarchy Levels
■ Block (aka line): unit of copying
■ May be multiple words
■ If accessed data is present in
upper level
■ Hit: access satisfied by upper level
■ Hit ratio: hits/accesses
■ If accessed data is absent
■ Miss: block copied from lower level
■ Time taken: miss penalty
■ Miss ratio: misses/accesses
= 1 – hit ratio
■ Then accessed data supplied from
upper level

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4


§5.2 Memory Technologies
Memory Technology
■ Static RAM (SRAM)
■ 0.5ns – 2.5ns, $2000 – $5000 per GB
■ Dynamic RAM (DRAM)
■ 50ns – 70ns, $20 – $75 per GB
■ Magnetic disk
■ 5ms – 20ms, $0.20 – $2 per GB
■ Ideal memory
■ Access time of SRAM
■ Capacity and cost/GB of disk

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5


§6.3 Disk Storage
Disk Storage
■ Nonvolatile, rotating magnetic storage

Chapter 6 — Storage and Other I/O Topics — 6


Disk Sectors and Access
■ Each sector records
■ Sector ID
■ Data (512 bytes, 4096 bytes proposed)
■ Error correcting code (ECC)
■ Used to hide defects and recording errors
■ Synchronization fields and gaps
■ Access to a sector involves
■ Queuing delay if other accesses are pending
■ Seek: move the heads
■ Rotational latency
■ Data transfer
■ Controller overhead

Chapter 6 — Storage and Other I/O Topics — 7


Disk Access Example
■ Given
■ 512B sector, 15,000rpm, 4ms average seek
time, 100MB/s transfer rate, 0.2ms controller
overhead, idle disk
■ Average read time
■ 4ms seek time
+ ½ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms
■ If actual average seek time is 1ms
■ Average read time = 3.2ms

Chapter 6 — Storage and Other I/O Topics — 8


§5.3 The Basics of Caches
Cache Memory
■ Cache memory
■ The level of the memory hierarchy closest to
the CPU
■ Given accesses X1, …, Xn–1, Xn

■ How do we know if
the data is present?
■ Where do we look?

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9


Direct Mapped Cache
■ Location determined by address
■ Direct mapped: only one choice
■ (Block address) modulo (#Blocks in cache)

■ #Blocks is a
power of 2
■ Use low-order
address bits

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10


Tags and Valid Bits
■ How do we know which particular block is
stored in a cache location?
■ Store block address as well as the data
■ Actually, only need the high-order bits
■ Called the tag
■ What if there is no data in a location?
■ Valid bit: 1 = present, 0 = not present
■ Initially 0

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11


Cache Example
■ 8-blocks, 1 word/block, direct mapped
■ Initial state
22,
26, Index V Tag Data
22, 000 N
26, 001 N
16, 010 N
3, 011 N
100 N
16,
101 N
18,
110 N
16
111 N

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12


Cache Example
Word addr Binary addr Hit/miss Cache block
22 10 110 Miss 110

22,
26, Index V Tag Data
22, 000 N
26, 001 N
16, 010 N
3, 011 N
100 N
16,
101 N
18,
110 Y 10 Mem[10110]
16
111 N

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13


Cache Example
Word addr Binary addr Hit/miss Cache block
26 11 010 Miss 010

22,
26, Index V Tag Data
22, 000 N
26, 001 N
16, 010 Y 11 Mem[11010]
3, 011 N
100 N
16,
101 N
18,
110 Y 10 Mem[10110]
16
111 N

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14


Cache Example
Word addr Binary addr Hit/miss Cache block
22 10 110 Hit 110
26 11 010 Hit 010
22,
26, Index V Tag Data
22, 000 N
26, 001 N
16, 010 Y 11 Mem[11010]
3, 011 N
100 N
16,
101 N
18,
110 Y 10 Mem[10110]
16
111 N

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15


Cache Example
Word addr Binary addr Hit/miss Cache block
16 10 000 Miss 000
3 00 011 Miss 011
22, 16 10 000 Hit 000
26, Index V Tag Data
22, 000 Y 10 Mem[10000]
26, 001 N
16, 010 Y 11 Mem[11010]
3, 011 Y 00 Mem[00011]
100 N
16,
101 N
18,
110 Y 10 Mem[10110]
16
111 N

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16


Cache Example
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010

22,
26, Index V Tag Data
22, 000 Y 10 Mem[10000]
26, 001 N
16, 010 Y 11
10 Mem[11010]
Mem[10010]
3, 011 Y 00 Mem[00011]
100 N
16,
101 N
18,
110 Y 10 Mem[10110]
16
111 N

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17


Address Subdivision

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18


Address Subdivision

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18


v is the
Total bits in a cache Size of
tag field
valid field
size, i.e., 1

■ 32-bit addresses ■ Total number of bits


A direct-mapped cache

(C)
■ The cache size is 2n blocks Block
n
■ so n bits are used for the ■ C = 2 x (b + t + v) size
index ■ t = 32 – (n + m + 2)
■ The block size (b) is 2m words ■ b = 2m+5
(2m+2 bytes = 2m+2+3 bits)
■ m bits are used for the ■ C
word within = 2n x (2m+5 + 32 – (n
the block
■ two bits are used for the
+ m + 2) + 1)
byte part of the address = 2n x (2m x 32 + 32
–n - m -1)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19


Total bits in a cache n

■ How many total bits ■ 16 KiB = 4 Ki Words =


are required for a 1 Ki Blks = 210 Blks
direct-mapped cache ■ C = 1024 x (b + t + v)
with 16 KiB of = 1024x(4x32 + t + 1)
data and 4-word = 1024x(4x32 +18+1)
blocks (i.e., 16 bytes), = 147 Kilo bits
assuming a 32-bit
address?
31 14 13 43 0
Offs
Tag Index et
18 bits 10 bits 4 bits

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20


Example: Larger Block Size
Byte Address
■ 64 blocks, 16 bytes/block
■ To what block number does address 1200
map? ⎣byte address /
bytes per block⎦
■ Block address = ⎣1200/16⎦ = 75
■ Block number = 75 modulo 64 = 11
31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits

64 blocks 16 bytes

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21


Example: Larger Block Size
Byte Address
■ 64 blocks, 16 bytes/block
4 3Offs 0

To what block number does address 1200


bits


et
4

map? ⎣byte address /


bytes per block⎦
Block address = ⎣1200/16⎦ = 75
0 Index


bits
6

■ Block number = 75 modulo 64 = 11


1 9

■ In fact Block 11 maps all addresses


between 1200 and 1215.
Tag
bits
22
3
1

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22


Block Size Considerations
■ Larger blocks should reduce miss rate
■ Due to spatial locality
■ But in a fixed-sized cache
■ Larger blocks ⇒ fewer of them
■ More competition ⇒ increased miss rate
■ Larger miss penalty
■ Larger blocks ⇒ Larger transfer time
■ Can override benefit of reduced miss rate
■ Early restart critical-word-first can help

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23


Block Size Considerations
■ Larger blocks should reduce miss rate
■ Due to spatial locality
■ But in a fixed-sized cache
■ Larger blocks ⇒ fewer of them
■ More competition ⇒ increased miss rate
■ Larger miss penalty
■ Larger blocks ⇒ Larger transfer time
■ Can override benefit of reduced miss rate
■ Early restart critical-word-first can help

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23


Early restart
■ resume execution as soon as the requested word of the
block is returned; Does not wait for the entire block
■ For instruction access, it works best
■ Instruction accesses are largely sequential

■ so if the memory system can deliver a word every

clock cycle, the processor may be able to restart


operation when the requested word is returned, with
the memory system delivering new instruction words
just in time.
■ This technique is usually less effective for data caches

because it is likely that the words will be requested


from the block in a less predictable

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24


Critical Word First
■ Organizes the memory so that the
requested word is transferred from the
memory to the cache first.
■ The remainder of the block is then
transferred, starting with the address after
the requested word and wrapping
around to the beginning of the block.
■ Can be slightly faster than early restart
■ but it is limited by the same properties that
limit early restart.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25
Cache Misses
■ On cache hit, CPU proceeds normally
■ On cache miss
■ Stall the CPU pipeline
■ Fetch block from next level of hierarchy
■ Instruction cache miss
■ Restart instruction fetch
■ Data cache miss
■ Complete data access

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26


Write-Through
■ On data-write hit, could just update the block in
cache
■ But then cache and memory would be inconsistent
■ Write through: also update memory
■ But makes writes take longer
■ e.g., if base CPI = 1, 10% of instructions are stores,
write to memory takes 100 cycles
■ Effective CPI = 1 + 0.1×100 = 11
■ Solution: write buffer
■ Holds data waiting to be written to memory
■ CPU continues immediately
■ Only stalls on write if write buffer is already full

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27


Write-Back
■ Alternative: On data-write hit, just update
the block in cache
■ Keep track of whether each block is dirty
■ When a dirty block is replaced
■ Write it back to memory
■ Can use a write buffer to allow replacing block
to be read first

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28


Write Allocation
■ What should happen on a write miss?
■ Alternatives for write-through Write allocate
■ Allocate on miss: fetch the block
■ Write around: don’t fetch the block
No Write
■ Since programs often write a whole block before
allocate reading it (e.g., initialization)
■ For write-back
■ Usually fetch the block
■ (See next slides)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29


Advantage for write-through
■ we can write the data into the cache and
then read the tag;
■ if the tag mismatches, then a miss occurs.
■ Because the cache is write-through, the
overwriting of the block in the cache is not
catastrophic
■ memory has the correct value and we can do
it right.
■ THIS CANNOT BE DONE FOR WRITE
BACK
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30
Write Back
■ If we have a cache miss, we must first
write the block back to memory if the data
in the cache is modified.
■ stores require two cycles:
■ a cycle to check for a hit
■ followed by a cycle to actually perform the
write
■ Alternative: Write Buffer to hold that data

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31


Write Back- Write Buffer
■ write buffer holds the data to write in
■ effectively allowing the store to take only
one cycle by pipelining it:
■ When a store buffer is used, the processor
does the cache lookup and places the data in
the store buffer during the normal cache
access cycle.
■ Assuming a cache hit, the new data is written
from the store buffer into the cache on the
next unused cache access cycle.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32


Write Back- Write Buffer for miss
■ the modified block is moved to a
write-back buffer associated with the
cache in case of a miss
■ while the requested block is read
from memory.
■ The write-back buffer is later written back to
memory.
■ Assuming another miss does not occur
immediately, this technique reduces the
miss penalty
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33
Example: Intrinsity FastMATH
■ Embedded MIPS processor
■ 12-stage pipeline
■ Instruction and data access on each cycle
■ Split cache: separate I-cache and D-cache
■ Each 16KB: 256 blocks × 16 words/block
■ D-cache: write-through or write-back
■ SPEC2000 miss rates
■ I-cache: 0.4%
■ D-cache: 11.4%
■ Weighted average: 3.2%

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34


Example: Intrinsity FastMATH
256 blocks × 16 words/block

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35


§5.4 Measuring and Improving Cache Performance
Measuring Cache Performance

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36


Measuring Cache Performance

For Write-through Cache

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37


Notes: Write Buffer Stalls
■ Write buffer stalls depend
So not easy to
■ on the proximity of writes deduce an equation
■ and not just the frequency
■ Usually we can ignore write buffer stalls
■ in systems with 4 or more words depth
■ With a memory capable of accepting writes at a rate
that significantly exceeds the average write frequency
in programs (by a factor of 2)
■ If a system did not meet these criteria, it would
not be well designed;
■ instead, the designer should have used either a
deeper write buffer or a write-back organization
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38
§5.4 Measuring and Improving Cache Performance
Measuring Cache Performance
■ With simplifying assumptions:
■ read and write miss penalties are same
■ In most write-through schemes this is the case
■ Write buffer stalls are negligible

For
Data

For
Instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39


Cache Performance Example
■ Given
■ miss rate: I-cache = 2% ; D-cache = 4% How much
■ Miss penalty = 100 cycles faster is a
processor
■ Base CPI (ideal cache) = 2 with a perfect
■ Load & stores are 36% of instructions cache?

■ Say total instruction: I


■ I-cache miss cycles: I × 0.02 × 100 = 2.00 × I
■ D-cache: I × 0.36 × 0.04 × 100 = 1.44 × I

■ Actual CPI = 2 + 2 + 1.44 = 5.44


■ Ideal CPU is 5.44/2 =2.72 times faster

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40


Amdahl’s Law
■ A rule stating that the performance
enhancement possible with a given
improvement is limited by the amount that
the improved feature is used.
■ What happens if the processor is made
faster, but the memory system is not?
■ The amount of time spent on memory stalls
will take up an increasing fraction of the
execution time

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41


Cache Performance Example 2
■ Given
■ miss rate: I-cache = 2% ; D-cache = 4% How much
■ Miss penalty = 100 cycles faster is a
processor
■ Base CPI (ideal cache) = 1 with a perfect
■ Load & stores are 36% of instructions cache?

■ Say total instruction: I


■ I-cache miss cycles: I × 0.02 × 100 = 2.00 × I
■ D-cache: I × 0.36 × 0.04 × 100 = 1.44 × I

■ Actual CPI = 1 + 2 + 1.44 = 4.44


■ Ideal CPU is 4.44/1 =4.44 times faster

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42


Lesson (re)Learned

Previously
it was 5.44

Performance lost increases

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43


Average Access Time
■ Hit time is also important for performance
■ Average memory access time (AMAT)
■ AMAT = Hit time + Miss rate × Miss penalty
■ Example
■ CPU with 1ns clock, hit time = 1 cycle, miss
penalty = 20 cycles, I-cache miss rate = 5%
■ AMAT = 1 + 0.05 × 20 = 2ns
■ 2 cycles per instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 44


Performance Summary
■ When CPU performance increased
■ Miss penalty becomes more significant
■ Decreasing base CPI
■ Greater proportion of time spent on memory
stalls
■ Increasing clock rate
■ Memory stalls account for more CPU cycles
■ Can’t neglect cache behavior when
evaluating system performance

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45


Associative Caches
■ Fully associative
■ Allow a given block to go in any cache entry
■ Requires all entries to be searched at once
■ Comparator per entry (expensive)
■ n-way set associative
■ Each set contains n entries
■ Block number determines which set
■ (Block number) modulo (#Sets in cache)
■ Search all entries in a given set at once
■ n comparators (less expensive)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46
Associative Cache Example

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47


1-Way Set Associative

How many bits? 3 bits


■ A cache block can only go in one spot in the cache.
■ It makes a cache block very easy to find
■ but it‛s not very flexible about where to put the blocks.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48


2-Way Set Associative

How many bits? 2 bits


■ This cache is made up of sets that can fit two blocks
each.
■ The index is now used to find the set
■ The tag helps find the block within the set.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49
4-Way Set Associative
0 1

How many bits? 1 bit

■ Each set here fits four blocks,


■ So there are fewer sets.
■ As such, fewer index bits are needed.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50
Fully Associative

How many bits? 0 bit


■ No index is needed, since a cache block can go anywhere in
the cache.
■ Every tag must be compared when finding a block in the
cache
■ but block placement is very flexible!
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 51
Spectrum of Associativity
■ For a cache with 8 entries

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52


Associativity Example
■ Compare 4-block caches
■ Direct mapped, 2-way set associative,
fully associative
■ Block access sequence: 0, 8, 0, 6, 8
■ Direct mapped
Block Cache Hit/miss Cache content after access
address index 0 1 2 3
0 modulo 4 = 0 0 0 miss Mem[0]
8 modulo 4 = 0 8 0 miss Mem[8]
0 modulo 4 = 0 0 0 miss Mem[0]
6 modulo 4 = 2 6 2 miss Mem[0] Mem[6]
8 modulo 4 = 0
8 0 miss Mem[8] Mem[6]

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 53


Associativity Example
■ Compare 4-block caches
■ Direct mapped, 2-way set associative,
fully associative
■ Block access sequence: 0, 8, 0, 6, 8
■ 2-way set associative
Block Cache Hit/miss Cache content after access
address index Set 0 Set 1
0 modulo 2 = 0 0 0 miss Mem[0]
8 modulo 2 = 0 8 0 miss Mem[0] Mem[8]
0 modulo 2 = 0 0 0 hit Mem[0] Mem[8]
6 modulo 2 = 0 6 0 miss Mem[0] Mem[6]
8 modulo 2 = 0
8 0 miss Mem[8] Mem[6]

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 54


Associativity Example
■ Compare 4-block caches
■ Direct mapped, 2-way set associative,
fully associative
■ Block access sequence: 0, 8, 0, 6, 8
■ Fully associative

Block Hit/miss Cache content after access


address
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 55


Set Associative Cache Organization

4-Way

An Alternate
Implementation:
Remove
the multiplexor
Enable
Signals

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 56


Replacement Policy
■ Direct mapped: no choice
■ Set associative
■ Prefer non-valid entry, if there is one
■ Otherwise, choose among entries in the set
■ Least-recently used (LRU)
■ Choose the one unused for the longest time
■ Simple for 2-way, manageable for 4-way, too hard
beyond that
■ Random
■ Gives approximately the same performance as
LRU for high associativity

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 57


Tags versus Set Associativity
■ Cache of 4096 blocks
■ a 4-word block size
■ 32-bit address,
■ Find the total number of sets and the total
number of tag bits for caches that are
■ direct mapped
■ two-way set associative
■ four-way set associative
■ fully associative.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 58


Tags versus Set Associativity
■ 4096 blocks; 32 bit address
■ 4 words block = 16 (24) Bytes block
■ So for tag + index 32 – 4 = 28 bits
16 bit28
tagbits (tag + Index)
12 bit index 44 bits
bits

■ Direct Mapped:
■ 4096 (212) 1-way set => 12 bit index
■ 16 bit tag (28 -12)
■ 4096 entries × 16 bit tag = 64 K bits tag

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 59


Tags versus Set Associativity
■ 4096 blocks; 32 bit address
■ 4 words block = 16 (24) Bytes block
■ So for tag + index 32 – 4 = 28 bits
17 bit28
tagbits (tag + Index)
11 bit index 44 bits
bits

■ 2-way set associative :


■ 4096/2 = 2048 (211) sets => 11 bit index
■ 17 bit tag (28 -11)
■ 4096 entries × 17 bit tag = 68 K bits tag

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 60


Tags versus Set Associativity
■ 4096 blocks; 32 bit address
■ 4 words block = 16 (24) Bytes block
■ So for tag + index 32 – 4 = 28 bits
18 bit28
tagbits (tag + Index)
10 bit index 44 bits
bits

■ 4-way set associative :


■ 4096/4 = 1024 (210) sets => 10 bit index
■ 18 bit tag (28 -10)
■ 4096 entries × 18 bit tag = 72 K bits tag

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 61


Tags versus Set Associativity
■ 4096 blocks; 32 bit address
■ 4 words block = 16 (24) Bytes block
■ So for tag + index 32 – 4 = 28 bits
28 bits
28(tag
bit tag
+ Index) 44 bits
bits

■ Fully set associative :


■ No index
■ 28 bits tag (28 - 0)
■ 4096 entries × 28 bit tag = 112 K bits tag

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 62


Tags versus Set Associativity
■ 4096 blocks; 4-words block; 32-bit address
■ 4-words block = 16 (24) Bytes block
■ So, for tag + index has 32 – 4 = 28 bits.
28 bits (tag + Index) 4 bits

Associativity Tag bits (K)


Direct mapped (1 Way) 64
2 Way Set Associative 68
4 Way Set Associative 72
Fully Associative 112

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 63


Multilevel Caches
■ Primary cache attached to CPU
■ Small, but fast
■ Level-2 cache services misses from
primary cache
■ Larger, slower, but still faster than main
memory
■ Main memory services L-2 cache misses
■ Some high-end systems include L-3 cache

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 64


Multilevel Cache Example
■ Given
■ CPU base CPI = 1, clock rate = 4GHz
■ Miss rate/instruction @ primary cache = 2%
■ Main memory access time = 100ns
■ 4 GHz => 0.25ns cycle length
■ With just primary cache
■ Miss penalty = 100ns/0.25ns = 400 cycles
■ Effective CPI = 1 + 0.02 × 400 = 9

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 65


Example (cont.)
■ Now add L-2 cache CPU base CPI = 1,
clock rate = 4GHz
■ Access time = 5ns
■ Global miss rate to main memory = 0.5%
■ Primary miss with L-2 hit
■ Penalty = 5ns/0.25ns = 20 cycles
primary Miss

Primary miss with L-2 miss


rate = 2%

■ CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4


■ Performance ratio = 9/3.4 = 2.6

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 66


Multilevel Cache Considerations
■ Primary cache
■ Focus on minimal hit time
■ L-2 cache
■ Focus on low miss rate to avoid main memory
access
■ Hit time has less overall impact
■ Results
■ L-1 cache usually smaller than a single cache
■ L-1 block size smaller than L-2 block size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 67


Interactions with Advanced CPUs
■ Out-of-order CPUs can execute
instructions during cache miss
■ Pending store stays in load/store unit
■ Dependent instructions wait in reservation
stations
■ Independent instructions continue
■ Effect of miss depends on program data
flow
■ Much harder to analyse
■ Use system simulation

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 68


§5.5 Dependable Memory Hierarchy
Dependability
Service
accomplishment
Service delivered
as specified
■ Fault: failure of a
component
Restoration Failure ■ May or may not lead
to system failure

Service interruption
Deviation from
specified service

Chapter 6 — Storage and Other I/O Topics — 69


Dependability Measures
■ Reliability: mean time to failure (MTTF)
■ Service interruption: mean time to repair (MTTR)
■ Mean time between failures
■ MTBF = MTTF + MTTR
■ Availability = MTTF / (MTTF + MTTR)
■ Improving Availability
■ Increase MTTF: fault avoidance, fault tolerance, fault
forecasting
■ Reduce MTTR: improved tools and processes for
diagnosis and repair

Chapter 6 — Storage and Other I/O Topics — 70


Nines of Availability
■ We want availability to be very high.
■ One shorthand is to quote the number of
“nines of availability” per year.

# of Nines % of uptime
One 90
Two 99
Three 99.9
Four 99.99
Five 99.999

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 71


Nines of Availability
■ Given 365 days per year, which is 365 * 24
* 60 = 525,600 minutes.
■ Then the shorthand is decoded as follows:
■ 90% => 525,600 * 0.1 downtime = 52560
minutes = 52560/(60*24) = 36.5 days.
■ # of Nines % of uptime Downtime/Year
One 90 36.5 days
Two 99 3.65 days
Three 99.9 526 min
Four 99.99 52.6 min
Five 99.999 5.26 min

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 72


MTTF and AFR
■ MTTF is a reliability measure.
■ A related term is annual failure rate (AFR)
■ The percentage of devices that would be
expected to fail in a year for a given MTTF.
■ Hours in a year / MTTF
■ When MTTF gets large it can be misleading
■ while AFR leads to better intuition

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 73


High MTTF and AFR
■ Some disks today are quoted to have a
1,000,000-hour MTTF.
■ 1,000,000/(365 * 24) = 114 years
■ they practically never fail???
■ Warehouse scale computers that run
Internet services such as Search might
have 50,000 servers.
■ Assume each server has 2 disks.
■ How many disks we would expect to fail
per year.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 74
High MTTF and AFR
■ One year => 365 * 24 = 8760 hours.
■ A 1,000,000-hour MTTF means an AFR
of 8760/1,000,000 = 0.876%.
■ We have 50000 * 2 = 100,000 disks
■ we would expect 0.00876 * 100,000 = 876
disks to fail per year
■ On average more than (876/365) 2 disk
failures per day.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 75


§5.7 Virtual Memory
Virtual Memory
■ Use main memory as a “cache” for
secondary (disk) storage
■ Managed jointly by CPU hardware and the
operating system (OS)
■ Programs share main memory
■ Each gets a private virtual address space
holding its frequently used code and data
■ Protected from other programs
■ CPU and OS translate virtual addresses to
physical addresses
■ VM “block” is called a page
■ VM translation “miss” is called a page fault

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 76


Address Translation
■ Fixed-size pages (e.g., 4K)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 77


Page Fault Penalty
■ On page fault, the page must be fetched
from disk
■ Takes millions of clock cycles
■ Handled by OS code
■ Try to minimize page fault rate
■ Fully associative placement
■ But this needs costly search!!!
■ Smart replacement algorithms

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 78


Page Tables
■ Stores placement information
■ Array of page table entries, indexed by virtual
page number
■ Page table register in CPU points to page
table in physical memory
■ If page is present in memory
■ PTE stores the physical page number
■ Plus other status bits (referenced, dirty, …)
■ If page is not present
■ PTE can refer to location in swap space on
disk
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 79
Translation Using a Page Table

What is
the size
of the
220 (= 1 M) entries

page
table?

19 bits;
but 32 bits
wide usually

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 80


Page Table Size Issues
■ Size of page table => 4 MB (per process)
■ What about 100 processes
■ Each with its own page table?
■ What will happen if we have 64-bit
addresses (according to prev. calc.)?
■ 2(64 – 12) = 252 entries!!!
63………………………………………………………………………………………………….12 11………………….0

52 bits 12 bits

■ There are techniques to reduce the amount of


storage required for the page table.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 81
Techniques: Limit Registers
■ Keep a limit register that restricts the size
of the page table for a given process.
■ If the virtual page number becomes larger
than the contents of the limit register, entries
must be added to the page table.
■ This technique allows the page table to grow
as a process consumes more space.
■ Thus, the page table will only be large if the
process is using many pages of
virtual address space.
■ This technique requires that the address
space expand in only one direction.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 82
Techniques: Limit Registers
Limit Register

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 83


Techniques: Two Limits
■ Two separate page tables and two separate
Allowing growth in only one direction is not sufficient

limits.
Stack ■ One grows from highest address down
Heap ■ One grows from lowest address up
■ So, address space is divided into 2 segments.
■ High-order bit of an address usually determines which
segment
■ i.e., which page table to use for that address.
■ So, each segment can be as large as one-half of the
address space.
■ A limit register for each segment specifies the current
size of the segment, which grows in units of pages.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 84


Techniques: Inverted Page Table
■ Keep only one entry per physical block
(i.e., frame)
■ Such a structure is called an inverted page
table.
■ we can no longer just index the page table.
■ So, the lookup process is slightly more
complex
■ May apply a hashing function to the virtual
address
■ To make the lookups faster
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 85
Techniques: Multiple Levels
■ The first level maps large fixed-size blocks
of virtual address space Sometimes called segment
■ Each entry in the segment table:
■ indicates whether any pages in that segment
are allocated
■ if so, points to a page table for that segment.
■ Address translation happens:
■ by first looking in the segment table, using the
highest-order bits of the address.
■ If the segment address is valid, the next set of
high-order bits is used to index the page table
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 86
Techniques: Multiple Levels

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 87


Mapping Pages to Storage

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 88


Page Fault
■ If the valid bit for a virtual page is off, a
page fault occurs.
■ The operating system must be given
control.
■ This transfer is done with the exception
mechanism
■ OS must find the page in the next level of the
hierarchy
■ and decide where to place the requested page
in main memory.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 89


Swap Space
■ Virtual addr. alone does not immediately tell us
where the page is on disk.
■ OS usually creates the space on flash
Swap Space

memory/disk for all the pages of a process


■ OS creates a data structure to record where
each virtual page is stored on disk.
■ may be part of the page table
■ or an auxiliary data structure indexed in the same way
as the page table
■ OS also creates a data structure that tracks
■ which processes and which virtual addresses use
each physical page
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 90
Replacement and Writes
■ To reduce page fault rate, prefer
least-recently used (LRU) replacement
■ Reference bit (aka use bit) in PTE set to 1 on
access to page
■ Periodically cleared to 0 by OS
■ A page with reference bit = 0 has not been
used recently
■ Disk writes take millions of cycles
■ Block at once, not individual locations
■ Write through is impractical
■ Use write-back
■ Dirty bit in PTE set when page is written

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 91


Fast Translation Using a TLB
■ Address translation would appear to require
extra memory references Page Tables are
in main memory
■ One to access the PTE
■ Then the actual memory access
■ But access to page tables has good locality
■ So use a fast cache of PTEs within the CPU
■ Called a Translation Look-aside Buffer (TLB)
■ Typical: 16–512 PTEs
■ Misses could be handled by hardware or software

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 92


Fast Translation Using a TLB

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 93


TLB Misses
■ If page is in memory
■ Load the PTE from memory and retry
■ Could be handled in hardware
■ Can get complex for more complicated page table
structures
■ Or in software
■ Raise a special exception, with optimized handler
■ If page is not in memory (page fault)
■ OS handles fetching the page and updating
the page table
■ Then restart the faulting instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 94


TLB Miss Handler
■ TLB miss indicates
■ Page present, but PTE not in TLB
■ Page not present True Page Fault
■ Handler copies PTE from memory to TLB
■ Then restarts instruction
■ If page not present, page fault will occur
■ The reference and dirty bits may change in
TLB
■ So they must also be copied back to PTE for
the TLB entry that is replaced Write-Back Scheme
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 95
TLB: Associativity Lower Miss Rate

■ Some systems use small, fully associative


TLBs Not so high search cost

■ replacement choice becomes tricky


■ hardware LRU scheme is too expensive.
TLB misses ■ expensive software algorithm is also not feasible
are much
more
(unlike page faults)
frequent ■ Many systems provide some support for randomly
choosing an entry to replace.
■ Other systems use large TLBs, often with
small associativity.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 96


Page Fault Handler
■ Use faulting virtual address to find PTE
■ Locate page on disk
■ Choose page to replace
■ If dirty, write to disk first
■ Read page into memory and update page
table
■ Make process runnable again
■ Restart from faulting instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 97


Intrinsity FastMATH TLB
■ 4 KiB (212) pages; 32-bit address space
■ So, virtual page number is (32 – 12 =) 20 bits long
■ Physical address is same size as virtual address.
■ TLB: 16 entries; fully associative
■ shared between the instruction and data
■ Each entry is 64 bits wide
■ a 20-bit tag (virtual page number for that TLB entry)
■ the corresponding physical page number (also 20 bits),
■ a valid bit, a dirty bit, and other bookkeeping bits.
■ Like most MIPS systems, it uses software to handle
TLB misses.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 98


TLB and Cache
Interaction
Implementation
CAM

Cache
256 blocks × 16 words/block

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 99


Content Addressable Memory (CAM)
■ CAM is a circuit that combines comparison
and storage in a single device.
■ Does not supply an address and read a word
like a RAM
■ you supply the data and the CAM looks to see
if it has a copy and returns the index of the
matching row.
■ With CAMs higher set associativity in cache
can be implemented

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 100


Processing a read or a write-through in the
Intrinsity FastMATH TLB and cache

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 101


TLB, Cache and VM Events Combined

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 102


Physically Addressed Cache
■ Here, cache is
“physically addressed”
and “physically tagged”
■ Time to access
memory for a cache hit
include:
■ TLB access time
■ Cache access time
■ Of course, these
accesses can be
pipelined.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 103
Virtually Indexed Cache
■ Alternatively, the processor can index the
cache with a virtual address VIVT
■ Virtually indexed and Virtually tagged cache
■ Here, TLB is unused during the normal
cache access
■ Reduce cache latency
■ Cache miss=> the processor needs to
translate the address to a physical address
■ so that it can fetch the cache block from main
memory.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 104
Aliasing in VIVT Cache

■ Aliasing occurs when the same object has two names


■ Two virtual addresses for the same page.
■ May happen for shared pages between processes
■ This ambiguity creates a problem:
■ A word on such a page may be cached in two different locations,
each corresponding to different virtual addresses.
■ One program may write the data without the other program being
aware that the data had changed.
■ Solution to Aliasing Issues
■ either introduce design limitations on the cache and TLB to
reduce aliases
■ or require the operating system, and possibly the user, to take
steps to ensure that aliases do not occur

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 105


Virtually Indexed Physically Tagged
■ Physical Tag using just the page-offset
portion of the address, which is really
a physical address since it is not translated
■ These designs, which are virtually indexed
but physically tagged, attempt to achieve
the performance advantages of virtually
indexed caches with the architecturally
simpler advantages of a physically
addressed cache.
■ There is no alias problem in this case.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 106
Memory Protection
■ Different tasks can share parts of their
virtual address spaces
■ But need to protect against errant access
■ Requires OS assistance
■ Hardware support for OS protection
■ Privileged supervisor mode (aka kernel mode)
■ Privileged instructions
■ Page tables and other state information only
accessible in supervisor mode
■ System call exception (e.g., syscall in MIPS)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 107
Memory Protection: H/W Suport
■ Need to Support:
■ at least two modes:
■ Privileged supervisor mode (aka kernel mode)
■ User mode
■ Different Processor states:
■ Allow for a process to read only; No Write allowed
■ To write, privileged instructions are needed that are only
available in supervisor mode
■ Page tables and other state information only accessible in
supervisor mode
■ System call exception (e.g., syscall in MIPS):
■ This allows a process to change mode:
■ User mode => supervisor mode (and vice versa)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 108


Memory Protection
■ Each process has its own virtual address space.
■ OS can keep the page tables organized so that the
independent virtual pages map to disjoint physical
pages
■ one process will not be able to access another’s data.
■ But what if the Process changes the mapping?
■ Page tables are placed in the protected address
space of the OS
■ So, user process is refrained from changing the page table
mapping.
■ But OS is able to modify the page tables.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 109


Memory Sharing and Protection
■ OS assists processes for limited information sharing
■ OS can change the Page table as needed
■ The write access bit is used to restrict write
■ Can be changed only by OS

Example:
■ P2 wants P1 to access its page
■ P2 asks OS to create a page table entry for a virtual page in
P1’s address space that points to the same physical page that
P2 wants to share.
■ Any bits that determine the access rights for a page must be
included in both the page table and the TLB because the page
table is accessed only on a TLB miss.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 110


Memory Protection: Context Switch
■ Context Switch: A changing of the internal state of the
processor to allow a different process to use the
processor
■ Suppose a context switch has occurred:
■ P1 was running; now P2 will run
■ OS must ensure that P2 cannot get access to P1’s page tables
■ Page Table register is changed
■ What about the TLB?
■ OS must clear the TLB entries that belong to P1
■ to protect the data of P1 and
■ to force the TLB to load the entries for P2.
■ If the process switch rate were high, this could be quite inefficient
especially if context switch is frequent.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 111


Memory Protection: Context Switch
A common alternative:
■extend the virtual address space by adding a process

identifier or task identifier.


■The Intrinsity FastMATH has an 8-bit address space ID

(ASID) field for this purpose.


■ This small field identifies the currently running process
■ it is kept in a register loaded by OS when it switches processes.
■ The process identifier is concatenated to the tag portion of the TLB
■ TLB hit occurs only if both the page number and the process
identifier match.
■ This eliminates the need to clear the TLB in context switch
Similar problems can occur for a cache, since on a process

switch the cache will contain data from the running process.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 112


Exception Enable/Disable
■ Suppose we have a page fault exception and OS is handling
it
■ What will happen, if a second exception occurs?
■ The control unit would overwrite the exception program counter, making it
impossible to return to the instruction that caused the page fault!
■ We need the ability to disable and enable exceptions.
■ When an exception first occurs, the processor sets a bit that
disables all other exceptions;
■ this could happen at the same time the processor sets the
supervisor mode bit.
■ The OS will then save just enough state to allow it to recover if
another exception occurs
■ the exception program counter (EPC) and Cause registers
■ The operating system can then re-enable exceptions.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 113


Special control registers that help with
exceptions, TLB misses, and page faults

When a TLB miss


occurs, the MIPS
hardware saves the
page number of the
reference here.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 114


TLB Miss in MIPS
■ TLB Miss exception invokes OS, which handles the miss in
software.
■ Control is transferred to 8000 0000hex (TLB Miss Handler Address)
■ To find the physical address for the missing page, the TLB miss
routine indexes the page table using the page number of the virtual
address and the page table register
■ To make this indexing fast, MIPS hardware places the address of
the Page Table Entry in a special Context Register
■ Thus, the first two instructions copy the Context register into the
kernel temporary register $k1 and then load the page table entry
from that address into $k1.
■ Recall that $k0 and $k1 are reserved for the operating system

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 115


TLB Miss in MIPS

■ TLB miss handler does not check to see if


the page table entry is valid.
8000 0180h

If invalid, another and different exception


it transfers


control to

occurs, and OS recognizes the page fault.


■ (frequent) TLB miss becomes fast
■ at a slight performance penalty for the
(infrequent) page fault.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 116
§5.8 A Common Framework for Memory Hierarchies
The Memory Hierarchy
The BIG
Picture
■ Common principles apply at all levels of
the memory hierarchy
■ Based on notions of caching
■ At each level in the hierarchy
■ Block placement
■ Finding a block
■ Replacement on a miss
■ Write policy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 117


Block Placement
■ Determined by associativity
■ Direct mapped (1-way associative)
■ One choice for placement
■ n-way set associative
■ n choices within a set
■ Fully associative
■ Any location
■ Higher associativity reduces miss rate
■ Increases complexity, cost, and access time

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 118


Finding a Block
Associativity Location method Tag comparisons
Direct mapped Index 1
n-way set Set index, then search n
associative entries within the set
Fully associative Search all entries #entries
Full lookup table 0

■ Hardware caches
■ Reduce comparisons to reduce cost
■ Virtual memory
■ Full table lookup makes full associativity feasible
■ Benefit in reduced miss rate

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 119


Replacement
■ Choice of entry to replace on a miss
■ Least recently used (LRU)
■ Complex and costly hardware for high associativity
■ Random
■ Close to LRU, easier to implement
■ Virtual memory
■ LRU approximation with hardware support

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 120


Write Policy
■ Write-through
■ Update both upper and lower levels
■ Simplifies replacement, but may require write
buffer
■ Write-back
■ Update upper level only
■ Update lower level when block is replaced
■ Need to keep more state
■ Virtual memory
■ Only write-back is feasible, given disk write
latency

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 121


Sources of Misses
■ Compulsory misses (aka cold start misses)
■ First access to a block
■ Capacity misses
■ Due to finite cache size
■ A replaced block is later accessed again
■ Conflict misses (aka collision misses)
■ In a non-fully associative cache
■ Due to competition for entries in a set
■ Would not occur in a fully associative cache of
the same total size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 122


Cache Design Trade-offs
Design change Effect on miss rate Negative performance
effect
Increase cache size Decrease capacity May increase access
misses time
Increase associativity Decrease conflict May increase access
misses time
Increase block size Decrease compulsory Increases miss
misses penalty. Very large
block could increase
miss rate.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 123


§5.9 Using a Finite State Machine to Control A Simple Cache
Cache Control
■ Example cache characteristics
■ Direct-mapped, write-back, write allocate
■ Block size: 4 words (16 bytes)
■ Cache size: 16 KB (1024 blocks)
■ 32-bit byte addresses
■ Valid bit and dirty bit per block
■ Blocking cache
■ CPU waits until access is complete

31 10 9 4 3 0
Tag Index Offset
18 bits 10 bits 4 bits

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 124


Interface Signals

Read/Write Read/Write
Valid Valid
32 32
Address Address
32 Cache 128 Memory
CPU Write Data Write Data
32 128
Read Data Read Data
Ready Ready

Multiple cycles
per access

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 125


Finite State Machines
■ Use an FSM to
sequence control steps
■ Set of states, transition
on each clock edge
■ State values are binary
encoded
■ Current state stored in a
register
■ Next state
= fn (current state,
current inputs)
■ Control output signals
= fo (current state)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 126
Cache Controller FSM

Could partition
into separate
states to
reduce clock
cycle time

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 127

You might also like