CH05 COA11e
CH05 COA11e
Chapter 5
Cache Memory
Fastest Fast
Less Slow
fast
• Frame
– To distinguish between the data transferred and the chunk of physical
memory, the term frame, or block frame, is sometimes used with
reference to caches
• Line
– A portion of cache memory capable of holding one block, so-called
because it is usually drawn as a horizontal object
• Tag
– A portion of a cache line that is used for addressing purposes
• Line size
– The number of data bytes, or block size, contained in a line
Copyright © 2022 Pearson Education, Ltd. All Rights Reserved
Figure 5.2
Cache/Main Memory Structure
Line Memory
Number Tag Block address
0 0
1 1
2 2 Block 0
3 (K words)
C–1
Block Length
(K Words)
(a) Cache
Block M – 1
2n – 1
Word
Length
(b) Main memory
Copyright
Figure 5.2 © 2022 PearsonStructure
Cache/Main-Memory Education, Ltd. All Rights Reserved
Figure 5.3
Cache Read Operation
START
Receive address
RA from CPU
Load main
Deliver RA word
memory block
to CPU
into cache line
DONE
Address
buffer
System Bus
Control Control
Processor Cache
Data
buffer
Data
CopyrightCache
Figure 5.4 Typical © 2022Organization
Pearson Education, Ltd. All Rights Reserved
Table 5.1
Elements of Cache Design
Cache Addresses Write Policy
Logical Write through
Physical Write back
Cache Size Line Size
Mapping Function Number of Caches
Direct Single or two level
Associative Unified or split
Set associative
Replacement Algorithm
Least recently used (LRU)
First in first out (FIFO)
Least frequently used (LFU)
Random
Processor Main
Cache memory
Data
Processor Main
Cache memory
Data
Set Associative Sequence of m Each block of main memory Line portion of address used to
lines organized as v maps to one unique cache set. access cache set; Tag portion
sets of k lines each used to check every line in that
(m = v × k) set for hit on that line.
m lines
Bm–1 Lm–1
First m blocks of
cache memory
main memory
(equal to size of cache) b = length of block in bits
t = length of tag in bits
(a) Direct mapping
t b
L0
one block of
main memory
Lm–1
cache memory
(b) Associative mapping
00 000000001111111111111000
00 000000001111111111111100
Line
Tag Data Number
16 000101100000000000000000 77777777 00 13579246 0000
16 000101100000000000000100 11235813 16 11235813 0001
FF 11223344 3FFE
16 000101101111111111111100 12345678 16 12345678 3FFF
8 bits 32 bits
FF 111111110000000000000000 16-Kline cache
FF 111111110000000000000100
FF 111111111111111111111000 11223344
FF 111111111111111111111100 24682468
Note: Memory address values are
in binary representation;
32 bits other values are in hexadecimal
16-MByte main memory
Line
Tag Data Number
3FFFFE 11223344 0000
058CE7 FEDCBA98 0001
058CE6 000101100011001110011000
058CE7 000101100011001110011100 FEDCBA98 FEDCBA98
058CE8 000101100011001110100000
3FFFFD 33333333 3FFD
000000 13579246 3FFE
3FFFFF 24682468 3FFF
22 bits 32 bits
16 Kline Cache
Tag Word
Main Memory Address =
22 bits 2 bits
k lines
Lk–1
Cache memory - set 0
Bv–1
First v blocks of
main memory
(equal to number of sets)
B0 L0
one
set
v lines
Bv–1 Lv–1
First v blocks of Cache memory - way 1 Cache memory - way k
main memory
(equal to number of sets)
000 000000001111111111111000
000 000000001111111111111100
Set
Tag Data Number Tag Data
02C 000101100000000000000000 77777777 000 13579246 0000 02C 77777777
02C 000101100000000000000100 11235813 02C 11235813 0001
32 bits
Note: Memory address values are
16 MByte Main Memory in binary representation;
other values are in hexadecimal
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M
Cache size (bytes)
direct
2-way
4-way
8-way
16-way
• First-in-first-out (FIFO)
– Replace that block in the set that has been in the cache longest
– Easily implemented as a round-robin or circular buffer technique
If at least one write operation has been A more complex problem occurs when
performed on a word in that line of the multiple processors are attached to the
cache then main memory must be same bus and each processor has its own
updated by writing the line of cache out local cache - if a word is altered in one
to the block of memory before bringing cache it could conceivably invalidate a
in the new block word in other caches
• Write back
– Minimizes memory writes
– Updates are made only in the cache
– Portions of main memory are invalid and hence accesses by I/O modules
can be allowed only through the cache
– This makes for complex circuitry and a potential bottleneck
• Either of these policies can be used with either write through or write
back
• No write allocate is most commonly used with write through
• Write allocate is most commonly used with write back
• Two-level cache:
– Internal cache designated as level 1 (L1)
– External cache designated as level 2 (L2)
• Potential savings due to the use of an L2 cache depends on the hit rates in both the
L1 and L2 caches
• The use of multilevel caches complicates all of the design issues related to caches,
including size, replacement algorithm, and write policy
0.96
0.94
0.92
0.90
L1 = 16k
Hit ratio
0.88 L1 = 8k
0.86
0.84
0.82
0.80
0.78
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M 2M
• Trend is toward split caches at the L1 and unified caches for higher
levels
• Advantages of split cache:
– Eliminates cache contention between instruction fetch/decode unit and execution unit
▪ Important in pipelining
• Exclusive policy
– Dictates that a piece of data in one cache is guaranteed not to be found in all lower levels of caches
– The advantage is that it does not waste cache capacity since it does not store multiple copies of the
same data in all of the caches
– The disadvantage is the need to search multiple cache levels when invalidating or updating a block
– To minimize the search time, the highest-level tag sets are typically duplicated at the lowest cache
level to centralize searching
• Noninclusive policy
– With the noninclusive policy a piece of data in one cache may or may not be found in lower levels of
caches
– As with the exclusive policy, this policy will generally maintain all higher-level cache sets at the lowest
cache level
Internal cache is rather small, due to Add external L2 cache using faster 486
limited space on chip. technology than main memory.
Contention occurs when both the Create separate data and Pentium
Instruction Prefetcher and the Execution instruction caches.
Unit simultaneously require access to
the cache. In that case, the Prefetcher
is stalled while the Execution Unit’s data
access takes place.
Some applications deal with massive Add external L3 cache. Pentium III
databases and must have rapid access
Move L3 cache on-chip. Pentium 4
to large amounts of data. The on-Chip
caches are too small.
• Set associative
– It is not possible to transmit bytes and compare tags in parallel as can be done with
direct-mapped with speculative access
– However, the circuitry can be designed so that the data block from each line in a set
can be loaded and then transmitted once the tag check is made
Direct-Mapped with
Speculation
thit = trl + txb tmiss = trl + tct
Way Prediction ✓
Cache Capacity Small Large
Line Size Small Large
Degree of Associativity Decrease Increase
More Flexible
✓
Replacement Policies
Victim Cache ✓
Wider Busses ✓