Chapter 5.1-5.6 Memory
Chapter 5.1-5.6 Memory
1
Cache Memories in the Datapath
Imm16
Imm
Ext
ALU result 32
0
32
1
A
D-Cache
Register File
I-Cache Rs 5 BusA 2
Instruction RA 3
A 0
Data
L
R
0 Address
Instruction Rt 5 32
RB BusB 0 U Data_out
PC
1 Address 1 1 1
B
2 2 0
Data_in
D
RW BusW
3 32
32
0
Rd2
Rd3
Rd4
1
Rd
clk
Instruction Block
Block Address
Block Address
D-Cache miss
I-Cache miss
Data Block
causes pipeline to stall
2
Four Basic Questions on Caches
• Q1: Where can a block be placed in a cache?
– Block placement
– Direct Mapped, Set Associative, Fully Associative
• Q2: How is a block found in a cache?
– Block identification
– Block address, tag, index
• Q3: Which block should be replaced on a cache miss?
– Block replacement
– FIFO, Random, LRU
• Q4: What happens on a write?
– Write strategy
– Write Back or Write Through cache (with Write Buffer) 3
Inside a Cache Memory
Address Address
Cache Main
Processor Data Data
Memory Memory
N Cache Blocks
Tags Address Tag 0 Cache Block 0
4
Block Placement: Direct Mapped
• Block: unit of data transfer between cache and memory
• Direct Mapped Cache:
– A block can be placed in exactly one location in the cache
000
001
010
100
101
011
110
111
In this example:
Cach
Cache index =
least significant 3 bits of
e
Memory address
Memory
Main
00000
00001
00010
00100
00101
01000
01001
01010
10000
10001
10010
10100
10101
00011
00110
01011
01100
01101
10011
10110
11000
11001
11010
00111
01110
10111
11011
11100
11101
01111
11110
11111
5
Direct-Mapped Cache
• A memory address is divided into
Block Address
– Block address: identifies block in memory
Tag Index offset
– Block offset: to access bytes within a block
20 8 4
– 32-bit address is divided into: Tag Index offset
• 4-bit byte offset field, because block size = 24 = 16 bytes
• 8-bit cache index, because there are 28 = 256 blocks in cache
• 20-bit tag field
– Byte offset = 0xC = 12 (least significant 4 bits of address)
– Cache index = 0x8A = 138 (next lower 8 bits of address)
– Tag = 0x01FFF (upper 20 bits of address)
8
Example on Cache Placement & Misses
• Consider a small direct-mapped cache with 32 blocks
– Cache is initially empty, Block size = 16 bytes
– The following memory addresses (in decimal) are referenced:
1000, 1004, 1008, 2548, 2552, 2556.
– Map addresses to cache blocks and indicate whether hit or miss
23 5 4
• Solution: Tag Index offset
V Tag Block Data V Tag Block Data V Tag Block Data V Tag Block Data
= = = =
mux
m-way associative Data
Hit
10
Set-Associative Cache
• A set is a group of blocks that can be indexed
• A block is first mapped onto a set
– Set index = Block address mod Number of sets in cache
V Tag Block Data V Tag Block Data V Tag Block Data V Tag Block Data
= = = =
mux
m-way set-associative Hit
Data
12
Write Policy
• Write Through:
– Writes update cache and lower-level memory
– Cache control bit: only a Valid bit is needed
– Memory always has latest data, which simplifies data coherency
– Can always discard cached data when a block is replaced
• Write Back:
– Writes update cache only
– Cache control bits: Valid and Modified bits are required
– Modified cached data is written back to memory when replaced
– Multiple writes to a cache block require only one write to memory
– Uses less memory bandwidth than write-through and less power
– However, more complex to implement than write through
13
Write Miss Policy
• What happens on a write miss?
• Write Allocate:
– Allocate new block in cache
– Write miss acts like a read miss, block is fetched and updated
• No Write Allocate:
– Send data to lower-level memory
– Cache is not modified
• Typically, write back caches use write allocate
– Hoping subsequent writes will be captured in the cache
• Write-through caches often use no-write allocate
– Reasoning: writes must still go to lower level memory
14
Write Buffer
• Decouples the CPU write from the memory bus writing
– Permits writes to occur without stall cycles until buffer is full
17
Replacement Policy – cont’d
• Least Recently Used (LRU)
– Replace block that has been unused for the longest time
– Order blocks within a set from least to most recently used
– Update ordering of blocks on each cache hit
– With m blocks per set, there are m! possible permutations
23
Example on CPI with Memory Stalls
• A processor has CPI of 1.5 without any memory stalls
– Cache miss rate is 2% for instruction and 5% for data
– 20% of instructions are loads and stores
– Cache miss penalty is 100 clock cycles for I-cache and D-cache
• Solution:
AMAT = 1 + 0.05 × 20 = 2 cycles = 4 ns
Without the cache, AMAT will be equal to Miss penalty = 20 cycles
25
Improving Cache Performance
• Average Memory Access Time (AMAT)