0% found this document useful (0 votes)
80 views43 pages

CH05 COA11e

Uploaded by

鄧曉晴
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views43 pages

CH05 COA11e

Uploaded by

鄧曉晴
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Computer Organization and Architecture

Designing for Performance


11th Edition, Global Edition

Chapter 5
Cache Memory

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.1
Cache and Main Memory
Block Transfer
Word Transfer

CPU Cache Main Memory


Fast Slow

(a) Single cache

Level 1 Level 2 Level 3 Main


CPU
(L1) cache (L2) cache (L3) cache Memory

Fastest Fast
Less Slow
fast

(b) Three-level cache organization

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Cache Memory Principles
• Block
– The minimum unit of transfer between cache and main memory

• Frame
– To distinguish between the data transferred and the chunk of physical
memory, the term frame, or block frame, is sometimes used with
reference to caches

• Line
– A portion of cache memory capable of holding one block, so-called
because it is usually drawn as a horizontal object

• Tag
– A portion of a cache line that is used for addressing purposes

• Line size
– The number of data bytes, or block size, contained in a line
Copyright © 2022 Pearson Education, Ltd. All Rights Reserved
Figure 5.2
Cache/Main Memory Structure
Line Memory
Number Tag Block address
0 0
1 1
2 2 Block 0
3 (K words)

C–1
Block Length
(K Words)

(a) Cache

Block M – 1

2n – 1
Word
Length
(b) Main memory
Copyright
Figure 5.2 © 2022 PearsonStructure
Cache/Main-Memory Education, Ltd. All Rights Reserved
Figure 5.3
Cache Read Operation
START

Receive address
RA from CPU

Is block No Access main


containing RA memory for block
in cache? containing RA
Yes

Fetch RA word Allocate cache


and deliver line for main
to CPU memory block

Load main
Deliver RA word
memory block
to CPU
into cache line

DONE

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.4
Typical Cache Organization
Address

Address
buffer

System Bus
Control Control
Processor Cache

Data
buffer

Data

CopyrightCache
Figure 5.4 Typical © 2022Organization
Pearson Education, Ltd. All Rights Reserved
Table 5.1
Elements of Cache Design
Cache Addresses Write Policy
Logical Write through
Physical Write back
Cache Size Line Size
Mapping Function Number of Caches
Direct Single or two level
Associative Unified or split
Set associative
Replacement Algorithm
Least recently used (LRU)
First in first out (FIFO)
Least frequently used (LFU)
Random

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Cache Addresses
Virtual Memory
• Virtual memory
– Facility that allows programs to address memory from a
logical point of view, without regard to the amount of main
memory physically available
– When used, the address fields of machine instructions
contain virtual addresses
– For reads to and writes from main memory, a hardware
memory management unit (MMU) translates each virtual
address into a physical address in main memory

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.5
Logical and Physical Caches
Logical address Physical address
MMU

Processor Main
Cache memory

Data

(a) Logical Cache

Logical address Physical address


MMU

Processor Main
Cache memory

Data

(b) Physical Cache

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.5 Logical and Physical Caches
Cache Size
• Preferable for the size of the cache to be:
– Small enough so that the overall average cost per bit is close to that of
main memory alone
– Large enough so that the overall average access time is close to that of
the cache alone

▪ Motivations for minimizing cache size:


– The larger the cache, the larger the number of gates involved in
addressing the cache resulting in large caches being slightly slower than
small ones
– The available chip and board area also limits cache size

▪ Because the performance of the cache is very sensitive to the


nature of the workload, it is impossible to arrive at a single
“optimum” cache size
Copyright © 2022 Pearson Education, Ltd. All Rights Reserved
Table 5.2
Cache Sizes of Some Processors
Year of
Processor Type Introduction L1 Cachea L2 cache L3 Cache

IBM 360/85 Mainframe 1968 16 to 32 kB – –

PDP-11/70 Minicomputer 1968 1 kB – –

IBM 3033 Mainframe 1968 64 kB – –

IBM 3090 Mainframe 1968 128 to 256 kB – –

Intel 80486 PC 1968 8 kB – –

Pentium PC 1968 8 kB/8 kB 256 to 512 kB –

PowerPC 620 PC 1968 32 kB/32 kB – –

IBM S/390 G6 Mainframe 1968 256 kB 8 MB –

Pentium 4 PC/server 1968 8 kB/8 kB 256 kB –

Itanium PC/server 1968 16 kB/16 kB 96 kB 4 MB

Itanium 2 PC/server 1968 32 kB 256 kB 6 MB

IBM POWER5 High-end 1968 64 kB 1.9 MB 36 MB


server

CRAY XD-1 Supercomputer 1968 64 kB/64 kB 1 MB –

IBM POWER6 PC/server 1968 64 kB/64 kB 4 MB 32 MB

IBM z10 Mainframe 1968 64 kB/128 kB 3 MB 24-48 MB

Intel Core i7 Workstaton/ 1968 6 × 32 kB/32 kB 6 × 1.5 MB 12 MB a Two values separated by a


EE 990 Server
slash refer to instruction and
data caches.
IBM Mainframe/ 1968 24 × 64 kB/128 kB 24 × 1.5 MB 24 MB L3
zEnterprise Server 192 MB L4
196

IBM z13 Mainframe/ 1968 24 × 96 kB/128 kB 24 × 2 MB/2 MB 64 MB L3


server 480 MB L4 (Table can be found on page
Intel Core Workstation/ 1968 8 × 32 kB/32 kB 8 × 1 MB 14 MB 145 in the textbook.)
i0-7900X server

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Table 5.3
Cache Access Methods

Mapping of Main Memory Access using Main


Method Organization Blocks to Cache Memory Address
Direct Mapped Sequence of m Each block of main memory Line portion of address used
lines maps to one unique line of to access cache line; Tag
cache. portion used to check for hit
on that line.
Fully Associative Sequence of m Each block of main memory Tag portion of address used
lines can map to any line of cache. to check every line for hit on
that line.

Set Associative Sequence of m Each block of main memory Line portion of address used to
lines organized as v maps to one unique cache set. access cache set; Tag portion
sets of k lines each used to check every line in that
(m = v × k) set for hit on that line.

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.6
Mapping from Main Memory to Cache: Direct and
Associative b t b
B0 L0

m lines
Bm–1 Lm–1
First m blocks of
cache memory
main memory
(equal to size of cache) b = length of block in bits
t = length of tag in bits
(a) Direct mapping

t b
L0

one block of
main memory

Lm–1
cache memory
(b) Associative mapping

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.7
Direct-Mapping Cache Organization

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.8
Direct Mapping Example
Main memory address (binary)
Tag
(hex) Tag Line + Word Data
00 000000000000000000000000 13579246
00 000000000000000000000100

00 000000001111111111111000
00 000000001111111111111100
Line
Tag Data Number
16 000101100000000000000000 77777777 00 13579246 0000
16 000101100000000000000100 11235813 16 11235813 0001

16 000101100011001110011100 FEDCBA98 16 FEDCBA98 0CE7

FF 11223344 3FFE
16 000101101111111111111100 12345678 16 12345678 3FFF

8 bits 32 bits
FF 111111110000000000000000 16-Kline cache
FF 111111110000000000000100

FF 111111111111111111111000 11223344
FF 111111111111111111111100 24682468
Note: Memory address values are
in binary representation;
32 bits other values are in hexadecimal
16-MByte main memory

Tag Line Word


Main memory address =

8 bits 14 bits 2 bits

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.8 Direct Mapping Example
Content-Addressable Memory (CAM)
• Also known as associative storage
• Content-addressable memory is constructed of static RAM (SRAM)
cells but is considerably more expensive and holds much less data
than regular SRAM chips
• A CAM with the same data capacity as a regular SRAM is about
60% larger
• A CAM is designed such that when a bit string is supplied, the CAM
searches its entire memory in parallel for a match
– If the content is found, the CAM returns the address where the match
is found and, in some architectures, also returns the associated data
word
– This process takes only one clock cycle

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.9
Content-Addressable Memory

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.10
Fully Associative Cache Organization

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.11
Associative Mapping Example
Main memory address (binary)

Tag (hex) Tag Word Data


000000 000000000000000000000000 13579246
000001 000000000000000000000100

Line
Tag Data Number
3FFFFE 11223344 0000
058CE7 FEDCBA98 0001
058CE6 000101100011001110011000
058CE7 000101100011001110011100 FEDCBA98 FEDCBA98
058CE8 000101100011001110100000
3FFFFD 33333333 3FFD
000000 13579246 3FFE
3FFFFF 24682468 3FFF

22 bits 32 bits
16 Kline Cache

3FFFFD 111111111111111111110100 33333333


3FFFFE 111111111111111111111000 11223344
3FFFFF 111111111111111111111100 24682468 Note: Memory address values are
in binary representation;
32 bits other values are in hexadecimal

16 MByte Main Memory

Tag Word
Main Memory Address =

22 bits 2 bits

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.11 Associative Mapping Example
Set Associative Mapping
• Compromise that exhibits the strengths of both the direct
and associative approaches while reducing their
disadvantages
• Cache consists of a number of sets
• Each set contains a number of lines
• A given block maps to any line in a given set
• e.g. 2 lines per set
– 2 way associative mapping
– A given block can be in one of 2 lines in only one set

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.12
Mapping from Main Memory to Cache: k-Way Set
Associative B0 L0

k lines
Lk–1
Cache memory - set 0
Bv–1
First v blocks of
main memory
(equal to number of sets)

Cache memory - set v–1

(a) v associative-mapped caches

B0 L0
one
set

v lines
Bv–1 Lv–1
First v blocks of Cache memory - way 1 Cache memory - way k
main memory
(equal to number of sets)

(b) k direct-mapped caches


Copyright © 2022 Pearson Education, Ltd. All Rights Reserved
Figure 5.13 k-Way
Set Associative Cache Organization

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.14
Two-Way Set-Associative Mapping Example
Main memory address (binary)
Tag Main Memory Address =
(hex) Tag Set + Word Data
Tag Set Word
000 000000000000000000000000 13579246
000 000000000000000000000100

9 bits 13 bits 2 bits

000 000000001111111111111000
000 000000001111111111111100
Set
Tag Data Number Tag Data
02C 000101100000000000000000 77777777 000 13579246 0000 02C 77777777
02C 000101100000000000000100 11235813 02C 11235813 0001

02C 000101100011001110011100 FEDCBA98 02C FEDCBA98 0CE7

1FF 11223344 1FFE


02C 000101100111111111111100 12345678 02C 12345678 1FFF 1FF 24682468

9 bits 32 bits 9 bits 32 bits


1FF 111111111000000000000000 16 kline Cache
1FF 111111111000000000000100

1FF 111111111111111111111000 11223344


1FF 111111111111111111111100 24682468

32 bits
Note: Memory address values are
16 MByte Main Memory in binary representation;
other values are in hexadecimal

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.14 Two-Way Set Associative Mapping Example
Figure 5.15
Varying Associativity over Cache Size
1.0
0.9
0.8
0.7
Hit ratio

0.6
0.5
0.4
0.3
0.2
0.1
0.0
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M
Cache size (bytes)
direct
2-way
4-way
8-way
16-way

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Replacement Algorithms
• Once the cache has been filled, when a new block is
brought into the cache, one of the existing blocks must
be replaced
• For direct mapping there is only one possible line for any
particular block and no choice is possible
• For the associative and set-associative techniques a
replacement algorithm is needed
• To achieve high speed, an algorithm must be
implemented in hardware

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


The most common replacement algorithms
are:
• Least recently used (LRU)
– Most effective
– Replace that block in the set that has been in the cache longest with no
reference to it
– Because of its simplicity of implementation, LRU is the most popular
replacement algorithm

• First-in-first-out (FIFO)
– Replace that block in the set that has been in the cache longest
– Easily implemented as a round-robin or circular buffer technique

• Least frequently used (LFU)


– Replace that block in the set that has experienced the fewest references
– Could be implemented by associating a counter with each line

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Write Policy
When a block that is resident in
There are two problems to
the cache is to be replaced
contend with:
there are two cases to consider:

If the old block in the cache has not been


altered then it may be overwritten with a More than one device may have access to
new block without first writing out the old main memory
block

If at least one write operation has been A more complex problem occurs when
performed on a word in that line of the multiple processors are attached to the
cache then main memory must be same bus and each processor has its own
updated by writing the line of cache out local cache - if a word is altered in one
to the block of memory before bringing cache it could conceivably invalidate a
in the new block word in other caches

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Write Through
and Write Back
• Write through
– Simplest technique
– All write operations are made to main memory as well as to the cache
– The main disadvantage of this technique is that it generates substantial
memory traffic and may create a bottleneck

• Write back
– Minimizes memory writes
– Updates are made only in the cache
– Portions of main memory are invalid and hence accesses by I/O modules
can be allowed only through the cache
– This makes for complex circuitry and a potential bottleneck

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Write Miss Alternatives
• There are two alternatives in the event of a write miss at a cache
level:
– Write allocate
– The block containing the word to be written is fetched from main memory (or
next level cache) into the cache and the processor proceeds with the write
cycle
– No write allocate
– The block containing the word to be written is modified in the main memory
and not loaded into the cache

• Either of these policies can be used with either write through or write
back
• No write allocate is most commonly used with write through
• Write allocate is most commonly used with write back

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Cache Coherency
• A new problem is introduced in a bus organization in which more than one device has
a cache and main memory is shared
• If data in one cache are altered, this invalidates not only the corresponding word in
main memory, but also that same word in other caches
• Even if a write-through policy is used, the other caches may contain invalid data
• Possible approaches to cache coherency include:
– Bus watching with write through
• Each cache controller monitors the address lines to detect write operations to memory by other bus
masters
• If another master writes to a location in shared memory that also resides in the cache memory, the cache
controller invalidates that cache entry
• This strategy depends on the use of a write-through policy by all cache controllers
– Hardware transparency
• Additional hardware is used to ensure that all updates to main memory via cache are reflected in all
caches
• If one processor modifies a word in its cache, this update is written to main memory
– Noncacheable memory
• Only a portion of main memory is shared by more than one processor, and this is designated as noncacheable
• All accesses to shared memory are cache misses, because the shared memory is never copied into the cache
• The noncacheable memory can be identified using chip-select logic or high-address bits

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Line Size
When a block of Two specific effects
data is retrieved come into play:
and placed in the • Larger blocks reduce the
cache not only the As the block size number of blocks that fit
desired word but increases more into a cache
• As a block becomes larger
also some number useful data are each additional word is
of adjacent words brought into the farther from the requested
are retrieved cache word

As the block size The hit ratio will


increases the hit begin to decrease
ratio will at first as the block
increase because becomes bigger
of the principle of and the probability
locality of using the newly
fetched information
becomes less than
the probability of
reusing the
information that
has to be replaced

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Multilevel Caches
• As logic density has increased it has become possible to have a cache on the same
chip as the processor
• The on-chip cache reduces the processor’s external bus activity and speeds up
execution time and increases overall system performance
– When the requested instruction or data is found in the on-chip cache, the bus access is
eliminated
– On-chip cache accesses will complete appreciably faster than would even zero-wait
state bus cycles
– During this period the bus is free to support other transfers

• Two-level cache:
– Internal cache designated as level 1 (L1)
– External cache designated as level 2 (L2)

• Potential savings due to the use of an L2 cache depends on the hit rates in both the
L1 and L2 caches
• The use of multilevel caches complicates all of the design issues related to caches,
including size, replacement algorithm, and write policy

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.16
Total Hit Ratio (L1 and L2) for 8-kB and
16-kB L1
0.98

0.96

0.94

0.92

0.90
L1 = 16k
Hit ratio

0.88 L1 = 8k

0.86

0.84

0.82

0.80

0.78
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M 2M

L2 Cache size (bytes)


Copyright © 2022 Pearson Education, Ltd. All Rights Reserved
Unified Versus Split Caches
• Has become common to split cache:
– One dedicated to instructions
– One dedicated to data
– Both exist at the same level, typically as two L1 caches

• Advantages of unified cache:


– Higher hit rate
▪ Balances load of instruction and data fetches automatically
▪ Only one cache needs to be designed and implemented

• Trend is toward split caches at the L1 and unified caches for higher
levels
• Advantages of split cache:
– Eliminates cache contention between instruction fetch/decode unit and execution unit
▪ Important in pipelining

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Inclusion Policy
• Inclusive policy
– Dictates that a piece of data in one cache is guaranteed to be also found in all lower levels of caches
– Advantage is that it simplifies searching for data when there are multiple processors in the computing
system
– This property is useful in enforcing cache coherence

• Exclusive policy
– Dictates that a piece of data in one cache is guaranteed not to be found in all lower levels of caches
– The advantage is that it does not waste cache capacity since it does not store multiple copies of the
same data in all of the caches
– The disadvantage is the need to search multiple cache levels when invalidating or updating a block
– To minimize the search time, the highest-level tag sets are typically duplicated at the lowest cache
level to centralize searching

• Noninclusive policy
– With the noninclusive policy a piece of data in one cache may or may not be found in lower levels of
caches
– As with the exclusive policy, this policy will generally maintain all higher-level cache sets at the lowest
cache level

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Table 5.4
Intel Cache Evolution
Processor on Which
Problem Solution Feature First Appears
External memory slower than the Add external cache using faster 386
system bus. memory technology.

Increased processor speed results in Move external cache on-chip, 486


external bus becoming a bottleneck for operating at the same speed as the
cache access. processor.

Internal cache is rather small, due to Add external L2 cache using faster 486
limited space on chip. technology than main memory.

Contention occurs when both the Create separate data and Pentium
Instruction Prefetcher and the Execution instruction caches.
Unit simultaneously require access to
the cache. In that case, the Prefetcher
is stalled while the Execution Unit’s data
access takes place.

Create separate back-side bus that Pentium Pro


runs at higher speed than the main
(front-side) external bus. The BSB is
Increased processor speed results in dedicated to the L2 cache.
external bus becoming a bottleneck for
L2 cache access.
Move L2 cache on to the Pentium II
processor chip.

Some applications deal with massive Add external L3 cache. Pentium III
databases and must have rapid access
Move L3 cache on-chip. Pentium 4
to large amounts of data. The on-Chip
caches are too small.

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.17
Pentium 4 Block Diagram

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Table 5.5
Pentium 4 Cache Operating Modes

Control Bits Operating Mode

CD NW Cache Fills Write Throughs Invalidates

0 0 Enabled Enabled Enabled

1 0 Disabled Enabled Enabled

1 1 Disabled Disabled Disabled

Note: CD = 0; NW = 1 is an invalid combination.

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Figure 5.18
IBM z13 CPC Drawer Logical Structure

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Cache Timing Model
• Direct-mapped cache access
– The first operation is checking the Tag field of an address against the tag value in the
line designated by the Line field
– If there is not a match (miss), the operation is complete
– If there is a match (hit), the cache hardware reads the data block from the line in the
cache and then fetches the byte or word indicated by the Offset field of the address
– An advantage is that it allows simple and fast speculation

• Fully associative cache


– The line number is not known until the tag comparison is competed
– The hit time is the same as for direct-mapped
– Because this is a content-addressable memory, the miss time is simply the tag
comparison time

• Set associative
– It is not possible to transmit bytes and compare tags in parallel as can be done with
direct-mapped with speculative access
– However, the circuitry can be designed so that the data block from each line in a set
can be loaded and then transmitted once the tag check is made

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Table 5.6
Cache Timing Equations

Time for hit Time for miss

Direct-Mapped thit = trl + txb + tct tmiss = trl + tct

Direct-Mapped with
Speculation
thit = trl + txb tmiss = trl + tct

Fully Associative thit = trl + txb + tct tmiss = tct

Set-Associative thit = trl + txb + tct tmiss = trl + tct

Set-Associative with Way


Prediction
thit = trl + txb + (1 – Fp) tct T= = trl + tct

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Table 5.7
Cache Performance Improvement Techniques
Technique Reduce t1 Reduce (1 – h1) Reduce tpenalty

Way Prediction ✓
Cache Capacity Small Large
Line Size Small Large
Degree of Associativity Decrease Increase

More Flexible

Replacement Policies

Split I-cache and


Cache Unity Unified cache
D-cache
Prefetching ✓

Write Through Write allocate No write allocate

Critical Word First ✓

Victim Cache ✓

Wider Busses ✓

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved


Summary Cache
Memory
Chapter 5
• Elements of cache design
• Cache memory principles
– Cache addresses
• Intel x86 cache
– Cache size
organization
– Logical cache
• The IBM z13 cache
organization
organization
– Replacement algorithms
• Cache performance
modules – Write policy
– Cache timing model – Line size
– Design option for – Number of caches
improving – Inclusion policy
performance
Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

You might also like