Chapter 5
Chapter 5
Rung-Bin Lin
5-1
Rung-Bin Lin
5-2
Introduction
The necessity of memory-hierarchy in a computer system design is enabled by the following two factors:
Locality of reference: The nature of program behavior Large gap in speed between CPU and mass storage devices such a DRAM.
Rung-Bin Lin
5-3
Memory Hierarchy
Rung-Bin Lin
5-4
Rung-Bin Lin
5-5
Rung-Bin Lin
5-6
ABCs of Caches
Recalling some terms Cache: The name given to the first level of the memory hierarchy encountered once the address leaves the CPU. Miss rate: The fraction of accesses not in the cache. Miss penalty: The additional time to service the miss. Block: The minimum unit of information that can be present in the cache. Four questions about any level of the hierarchy: Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)
Rung-Bin Lin
5-7
Cache Performance
Formula for performance evaluation
CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time =IC *(CPIexecution + Memory stall clock cycles/IC)*Clock cycle
time
Memory stall cycles = IC * Memory reference per instruction *miss rate *miss penalty Measure of memory-hierarchy performance Average memory access time = Hit time + Miss rate * Miss penalty
Rung-Bin Lin
5-8
Rung-Bin Lin
5-9
Rung-Bin Lin
5-10
Rung-Bin Lin
5-11
Fully associative: A block can be placed anywhere in the cache. Set associative: A block can be placed in a restricted set of places in the cache. A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. The set is usually obtained by
(block address) MOD (Number of sets in a cache) If there are n blocks in a set, the cache is called n-way set associative.
Rung-Bin Lin
5-12
Rung-Bin Lin
5-13
Block Identification
Q2: How is a block found if it is in the cache
Each cache block consists of
Address tag: Give the block address Valid bit: Indicate whether or not the associated entry contains a valid address. Data
Rung-Bin Lin
5-14
Identification Steps
Index field of the CPU address is used to select a set. Tag field presented by the CPU is compared in parallel to all address tags of the blocks in the selected set. If any address tag matches the tag field of the CPU address and its valid bit is true, it is a cache hit. Offset field is used to select the desired data.
Rung-Bin Lin
5-15
Rung-Bin Lin
5-16
Block Replacement
Q3: Which block should be replaced on a cache miss?
For direct mapped cache, the answer is obvious. For set associative or fully associative cache, the following two strategies can be used:
Random Least-recently used (LRU) First in, first out (FIFO)
Rung-Bin Lin
5-17
Rung-Bin Lin
5-18
Write Strategy
Q4: What happens on a write?
Traffic patterns
Writes take about 7% of the overall memory traffic and take about 25% of the data cache traffic. Though read dominates processor cache traffic, write still can not be ignored in a high performance design.
Rung-Bin Lin
5-19
Either write miss option can be used with write through or write back, but write-back caches generally use write allocate and write-through cache often use no-write allocate.
Rung-Bin Lin
5-20
Rung-Bin Lin
5-21
Rung-Bin Lin
5-22
Rung-Bin Lin
5-23
Rung-Bin Lin
5-24
Rung-Bin Lin
5-25
Cache Performance
Average memory access time for processors with in-order execution
Average memory access time = Hit time + Miss rate * Miss penalty Examples on pages 408 and 409
Rung-Bin Lin
5-26
Rung-Bin Lin
5-27
Rung-Bin Lin
5-28
Multilevel Caches
Question:
Larger cache or faster cache? A contradictory scenario. Solution: Adding another level of cache. Second level cache complicates performance evaluation of cache memory.
Average memory access time = Hit timeL1 + Miss rateL1 *Miss penaltyL1
Where,
Miss penaltyL1 = Hit timeL2 + Miss rateL2 * Miss penaltyL2
Rung-Bin Lin
5-29
Rung-Bin Lin
5-30
Rung-Bin Lin
5-31
Rung-Bin Lin
5-32
Example (P417)
Rung-Bin Lin
5-33
Rung-Bin Lin
5-34
Rung-Bin Lin
5-35
Rung-Bin Lin
5-36
Rung-Bin Lin
5-37
Rung-Bin Lin
5-38
Rung-Bin Lin
5-39
Rung-Bin Lin
5-40
Miss Categories
Compulsory miss
The first access to a block is not in the cache.
Capacity miss
Occur because of blocks being discarded and later retrieved if the cache cannot contain all the blocks needed during execution of a program.
Conflict miss
Occur because a block can be discarded and later retrieved if two many blocks map to its set for direct mapped or set associative caches.
Rung-Bin Lin
5-41
Rung-Bin Lin
5-42
Rung-Bin Lin
5-43
Example (P426)
Rung-Bin Lin
5-44
Rung-Bin Lin
5-45
Rung-Bin Lin
5-46
Larger Caches
Drawbacks
Longer hit time Higher cost
Rung-Bin Lin
5-47
Higher Associativity
Two general rules of thumb
8-way set associative is for practical purposes as effective in reducing misses as fully associative. 2:1 cache rule of thumb A direct mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2.
The pressure of a fast processor clock cycle encourages simple cache, but the increasing miss penalty rewards associativity Example on page 429.
Rung-Bin Lin
5-48
Rung-Bin Lin
5-49
Rung-Bin Lin
5-50
Rung-Bin Lin
5-51
Way Prediction
Reduce conflict misses and yet maintain the hit speed of a direct-mapped cache. Way prediction
Extra bits are kept in the cache to predict the way, or block within the set of the next cache access. It means the MUX can be set early to select desired block. A miss results in checking the other blocks for matches. Alpha 21264 uses such technique.
Hits take 1 cycle Misses take 3 cycles
Rung-Bin Lin
5-52
Pseudoassociative Caches
Access proceed just as in the direct-mapped cache for a hit. On a miss, a second cache entry is checked to see if it matches there.
Rung-Bin Lin
5-53
Compiler Optimizations
Loop intercahnge
Reduce misses by improving spatial locality
Blocking
Reducing capacity miss
Rung-Bin Lin
5-54
Blocking
Rung-Bin Lin
5-55
Rung-Bin Lin
5-56
Rung-Bin Lin
5-57
Rung-Bin Lin
5-58
Performance of Hit-Under-Miss
Rung-Bin Lin
5-59
With four instruction (data) stream buffers, the hit rate improves to 50% (43%).
Rung-Bin Lin
5-60
Controller-Controlled Prefetching
Compiler inserts prefetch instructions to request the data before they are needed.
Register prefetch Cache prefetch
Rung-Bin Lin
5-61
Rung-Bin Lin
5-62
Rung-Bin Lin
5-63
Rung-Bin Lin
5-64
Rung-Bin Lin
5-65
Rung-Bin Lin
5-66
Rung-Bin Lin
5-67
Rung-Bin Lin
5-68
Rung-Bin Lin
5-69
Rung-Bin Lin
5-70
Rung-Bin Lin
5-71
Rung-Bin Lin
5-72
Memory Technology
Performance metrics
Latency: two measures
Access time: The time between when a read is requested and when the desired word arrives. Cycle time: The minimum time between requests to memory.
Rung-Bin Lin
5-73
DRAM
Refresh time < 5%; slow increase in speed.
Rung-Bin Lin
5-74
Rung-Bin Lin
5-75
Synchronous DRAM
Have a programmable register to hold the number of bytes requested and hence can send many bytes over several cycles per request with the overhead of synchronizing the controller.
Rung-Bin Lin
5-76
Rung-Bin Lin
5-77
Physical address
For having an access to main memory
Address translation
Convert a virtual address to a physical address. Can easily form the critical path that limits the clock cycle time.
Rung-Bin Lin
5-78
Types of VM
Paged Segmented Paged segment Fig. 5.34 on 463
Rung-Bin Lin
5-79
Rung-Bin Lin
5-80
Rung-Bin Lin
5-81
The size
Of VM is determined by the size of processor address. Of cache is independent of processor address size.
Second storage in VM occupied by file system is not normally in the address space.
Rung-Bin Lin
5-82
Rung-Bin Lin
5-83
Rung-Bin Lin
5-84
Rung-Bin Lin
5-85
Rung-Bin Lin
5-86
Rung-Bin Lin
5-87
Rung-Bin Lin
5-88
Computer designers can make protection easily implemented by the OS via VM design.
Rung-Bin Lin
5-89
Protecting Process
The simplest mechanism
Use base and bound registers
An access is valid if Base <= Address <= Bound
To enable this protection, computer designers have the following three responsibilities:
Provide at least two execution modes: user or kernel (OS, supervisor) modes Provide a portion of the CPU state that a user process can use but not write. Provide mechanisms whereby the CPU can go from user mode to kernel modes.
Rung-Bin Lin
5-90
Rung-Bin Lin
5-91
Rung-Bin Lin
5-92
The Alpha obeys only the protection requirements imposed by the bottom-level PTEs.
Rung-Bin Lin
5-93
Concluding Remarks
The primary challenge for the memory hierarchy designer is in choosing parameters that work well together, not in inventing new techniques (already enough).