Lectures wk11
Lectures wk11
139
Capacity
• In general, bigger is better
• The more data you can store in the cache, the less
often you have to go out to the main memory
140
Cache Line Length
• Cache groups contiguous addresses into lines
• Lines almost always aligned on their size
• Caches fetch or write back an entire line of data on a
miss
• Spatial Locality
• Reading/Writing a Line
• Typically, takes much longer to fetch the first word of a
line than subsequent words
• Page Mode memories
141
What causes a MISS?
• Three Major Categories of Cache Misses:
• Compulsory Misses: first access to a block
• Capacity Misses: cache cannot contain all blocks needed to
execute the program
• Conflict Misses: block replaced by another block and then later
retrieved - (affects set assoc. or direct mapped caches)
• Nightmare Scenario: ping pong effect!
142
3/13/2024 142
Block Size and Spatial Locality
Block is unit of transfer between the cache and memory
35%
30%
25%
Miss rate
20%
15%
10%
5%
0%
4 16 64 256
Block size (bytes) 1 KB
8 KB
Cache size 16 KB
64 KB
256 KB
144
Hit Rate isn’t Everything
• Average access time is better performance
indicator than hit rate
Tavg = Phit * Thit + Pmiss * Tmiss
Tmiss = Tfetch = Tfirst + (line length / fetch width) *
Tsubsequent
145
Block Size Tradeoff
Average
Miss Miss Access
Penalty Time
Rate Exploits Spatial Locality
Increased Miss Penalty
Fewer blocks: & Miss Rate
compromises
temporal locality
147
Contents of a direct mapped cache
148
ADDRESS
Direct cache Address (showing bit positions)
31 30 13 12 11 210
Byte
offset
20 10
Hit Data
Tag
Index
Separate address into fields:
Index Valid Tag Data
•Byte offset in word
0
•Index for row of cache 1
2
•Tag identifier of block
1021
Cache of 2^n words, a block being 1022
1023
a 4 byte word, has 2^n*(63-n) bits
20 32
for 32 bit address
#rows=2^n
#bits/row=32+32-2-n+1=63-n
149
Example: 1 KB Direct Mapped Cache with 32 Byte Blocks
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”
: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3
: : :
:
150
Extreme Example: single big line
151
A Two-way Set Associative Cache
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block
Hit 152
Another Extreme Example: Fully Associative
31 4 0
Cache Tag (27 bits long) Byte Select
Ex: 0x01
: :
X Byte 63 Byte 33 Byte 32
X
X
: : :
X
153
Which Block Should be Replaced on a Miss?
• Easy for Direct Mapped
• Set Associative or Fully Associative:
• Random - easier to implement
• Least Recently used - harder to implement - may approximate
• Miss rates for caches with different size, associativity and
replacemnt algorithm.
Associativity: 2-way 4-way 8-way
Size LRU Random LRU Random LRU Random
16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96%
64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
For caches with low miss rates, random is almost as good as LRU.
Q4: What Happens on a Write?
• Write through: The information is written to both the block in
the cache and to the block in the lower-level memory.
• Write back: The information is written only to the block in the
cache. The modified cache block is written to main memory only
when it is replaced.
• is block clean or dirty? (add a dirty bit to each block)
• Pros and Cons of each:
• Write through
• read misses cannot result in writes to memory,
• easier to implement
• Always combine with write buffers to avoid memory latency
• Write back
• Less memory traffic
• Perform writes at the speed of the cache
Q4: What Happens on a Write?
• Since data does not have to be brought into the cache on a write
miss, there are two options:
• Write allocate
• The block is brought into the cache on a write miss
• Used with write-back caches
• Hope subsequent writes to the block hit in cache
• No-write allocate
• The block is modified in memory, but not brought into the cache
• Used with write-through caches
• Writes have to go to memory anyway, so why bring the block into the cache
ARM9 – Split Cache
• ARM9TDMI
• ARM 32-bit and Thumb 16-bit instructions (v4T ISA).
• Code compatibility with ARM7TDMI:
• Portable to 0.25, 0.18 µm CMOS and below.
• Harvard 5-stage pipeline implementation:
• Higher performance (CPI = 1.5)
• Coprocessor interface for on-chip coprocessors:
• Allows floating point, DSP, graphics accelerators.
• EmbeddedICE debug capability
CPU Pipeline structure with Cache
Fetch Decode Read E1 E2 Write
Phase Phase Phase Phase Phase Phase
Mul
Register Register
I-buffer IU Rollback,
And File
Access Write back
128-bit Variable and
ICache word Size And
Bypassing
Decoding Immediate Load/Store
Generation
DCach
PC and e
Fetch Branch Unit
Address Exception
Generation Generation
13-Mar-24 158
ARM Cortex-A9 MPCore