Chapter 4 - Cache Memory: Luis Tarrataca
Chapter 4 - Cache Memory: Luis Tarrataca
Luis Tarrataca
[email protected]
CEFET-RJ
1 Introduction
Memory Hierarchy
Cache Addresses
Cache Size
Mapping Function
Direct Mapping
Associative Mapping
Set-associative mapping
Replacement Algorithms
Write Policy
Line Size
Number of caches
Luis Tarrataca Chapter 4 - Cache Memory 3 / 159
Table of Contents II
Multilevel caches
5 Intel Cache
Introduction
• type;
• technology;
• organization;
• performance;
• and cost.
Typically:
Typically:
• some external
• accessible via an I/O module;
• some external
• accessible via an I/O module;
Can you see any advantages / disadvantages with using each one?
Lets see if you can guess what each one of these signifies... Any ideas?
• Performance:
• Access time ( latency ):
• For RAM: time to perform a read or write operation;
• For Non-RAM: time to position the read-write head at desired location;
• Physical characteristics:
• Volatile: information decays naturally or is lost when powered off;
Memory Hierarchy
• How much?
• If memory exists, applications will likely be developed to use it.
• How fast?
• Best performance achieved when memory keeps up with the processor;
• How expensive?
• Cost of memory must be reasonable in relationship to other components;
• Greater capacity:
• Smaller cost per bit;
How can we solve this issue? Or at least mitigate the problem? Any
ideas?
• Space:
• if we access a memory location, close by addresses will very likely be accessed;
• Time:
• if we access a memory location, we will very likely access it again;
• Space:
• if we access a memory location, close by addresses will very likely be accessed;
• Time:
• if we access a memory location, we will very likely access it again;
• Space:
• if we access a memory location, close by addresses will very likely be accessed;
• Time:
• if we access a memory location, we will very likely access it again;
Example (1/5)
• Level 1 - L1 :
• contains 1000 words and has an access time of 0.01µs;
• Level 2 - L2 :
• contains 100,000 words and has an access time of 0.1µs.
• Assume that:
• if word ∈ L1 , then the processor accesses it directly;
Example (2/5)
For simplicity:
Also, let:
• H define the fraction of all memory accesses that are found L1;
Example (3/5)
General shape of the curve that covers this situation:
Example (4/5)
• For high percentages of L1 access, the average total access time is much
closer to that of L1 than that of L2 ;
Example (5/5)
• Eventually:
• Data ∈ L1 will be swapped to L2 to make room for new data;
This principle can be applied across more than two levels of memory:
• Processor registers:
• Fastest, smallest, and most expensive type of memory
• Followed immediately by the cache:
• Stages data movement between registers and main memory;
• Improves perfomance;
• Is not usually visible to the processor;
• Is not usually visible to the programmer.
• Followed by main memory:
• Principal internal memory system of the computer;
• Each location has a unique address.
This means that we should maybe have a closer look at the cache =)
Figure: Cache and main memory - single cache approach (Source: [Stallings, 2015])
Figure: Cache and main memory - single cache approach (Source: [Stallings, 2015])
• ...it is likely that there will be future references to that same memory location;
Can you see any way of improving the cache concept? Any ideas?
Can you see any way of improving the cache concept? Any ideas?
Figure: Cache and main memory - three-level cache organization (Source: [Stallings, 2015])
Main memory:
• m≪M
• Because m ≪ M, lines:
Read operation:
• Otherwise:
• Block containing that word is loaded into the cache;
• Data and address lines also attach to data and address buffers:
• Which attach to a system bus...
What do you think happens when a word is not in cache? Any ideas?
What do you think happens when a word is not in cache? Any ideas?
Cache Addresses
• Physical addresses:
• Actual memory addresses;
• Logical addresses:
• Virtual-memory addresses;
Cache Addresses
Cache Addresses
• ...secondary memory:
• Processor accesses the cache directly, without going through the MMU.
• Advantage:
• Faster access speed;
• Cache can respond without the need for an MMU address translation;
• Disadvantage:
• Same virtual address in two different applications refers to two different
physical addresses;
Cache Size
Cache Size
The larger the cache, the more complex the addressing logic:
Mapping Function
Recall that there are fewer cache lines than main memory blocks
How should one map main memory blocks into cache lines? Any ideas?
Three techniques can be used for mapping blocks into cache lines:
• Direct;
• Associative;
• Set associative
Direct Mapping
Maps each block of main memory into only one possible cache line as:
i=j mod m
where:
• First m main memory blocks map into each line of the cache;
• and so on...
Over time:
• Tag (s − r bits): to distinguish blocks that are mapped to the same line;
• Does the line contain the 2nd block that can be assigned?
• ...
• Does the line contain the 2s−r block that can be assigned?
1 Use the line field of the memory address to index the cache line;
2 Compare the tag from the memory address with the line tag;
1 If both match, then Cache Hit:
1 Use the line field of the memory address to index the cache line;
• Associative mapping;
Associative Mapping
• Flexibility as to which block to replace when a new block is read into the
cache;
Can you see any way of improving the associative scheme? Any ideas?
Can you see any way of improving the associative scheme? Any ideas?
Set-associative mapping
m=v ×k
i=j mod v
where:
• i = cache set number;
• v = number of sets;
Idea:
• Tag: used in conjunction with the set bits to identify a block (s − d bits);
Exercise (1/4)
Questions:
• How many bits are required for encoding words, sets and tag?
Exercise (2/4)
How many bits are required for the words? Any ideas?
Exercise (2/4)
How many bits are required for the words? Any ideas?
Exercise (3/4)
How many bits are required for the set? Any ideas?
Exercise (3/4)
How many bits are required for the sets? Any ideas?
Exercise (4/4)
How many bits are required for the tag? Any ideas?
Exercise (4/4)
How many bits are required for the tag? Any ideas?
Hint: The specific details about these models would make great exam
questions ;)
E.g.: what happens when we vary the number of lines k in each set?
Figure: Varying associativity degree k (lines per set) over cache size
Luis Tarrataca Chapter 4 - Cache Memory 116 / 159
Elements of Cache Design Mapping Function
Replacement Algorithms
• Direct Mapping;
• Associative Mapping;
• Set-Associative Mapping
Replacement Algorithms
• For direct mapping, there is only one possible line for any particular block:
• Thus no choice is possible;
• Replace block in the set that has been in the cache longest:
• With no references to it!
• First-in-first-out (FIFO):
• Replace the block in the set that has been in the cache longest:
• Regardless of whether or not there exist references to the block;
• studies have shown only slightly inferior performance to LRU, LFU and FIFO.
• =)
Write Policy
Can you see any implications that having a cache has on memory
management? Any ideas?
Write Policy
• more than one device may have access to main memory, e.g.:
• I/O module may be able to read-write directly to memory;
• Write through;
• Write back;
• All write operations are made to main memory as well as to the cache;
• Disadvantage:
• lots of memory accesses → worse performance;
• When a block is replaced, it is written to memory iff the use bit is on.
• Disadvantage:
• I/O module can access main memory (later chapter)...
Example (1/2)
What is the number of times that the line must be written before being
swapped out for a write-back cache to be more efficient that a write-
through cache?
Example (2/2)
What is the number of times that the line must be written before being
swapped out for a write-back cache to be more efficient that a write-
through cache?
• Write-back case:
• At swap-out time we need to transfer 32/4 = 8 words;
• Thus we need 8 × 30 = 240ns
• Write-through case:
• Each line update requires that one word be written to memory, taking 30ns
• Conclusion:
• If line gets written more than 8 times, the write-back method is more efficient;
Can you see the implications of having multiple caches for memory
management?
Can you see the implications of having multiple caches for memory
management?
What are the possible mechanisms for dealing with cache coherency?
Any ideas?
• Hardware transparency:
• Use additional hardware to ensure that all updates to main memory via
cache are reflected in all caches
• Noncacheable memory:
• Only a portion of main memory is shared by more than one processor, and
this is designated as noncacheable;
• All accesses to shared memory are cache misses, because the shared
memory is never copied into the cache.
• MESI Protocol:
• We will see this in better detail later on...
Line Size
• Larger blocks reduce the number of blocks that fit into a cache.
• Also, because each block fetch overwrites older cache contents...
• ...a small number of blocks results in data being overwritten shortly after they
are fetched.
Number of caches
Multilevel caches
Figure: Total hit ratio (L1 and L2) for 8-Kbyte and 16-Kbyte L1 (Source: [Stallings, 2015])
• L2 has little effect on performance until it is at least double the L1 cache size.
• Otherwise, L2 cache has little impact on total cache performance.
• If the L2 cache has the same line size and capacity as the L1 cache...
• Data cache;
• Chapter 14
• Pipelining:
• Multiples stages of the instruction cycle can be executed simultaneously
• Chapter 14;
• Performance bottleneck;
References I
Stallings, W. (2015).
Pearson Education.