Lecture 08 - CH No. 04 (Part 02)
Lecture 08 - CH No. 04 (Part 02)
04 –
Cache Memory (Part –
02)
Lecture – 09
17-04-2018
1
Topics to Cover
4.3 Elements of Cache Design
• Cache Addresses
• Cache Size
• Mapping Function
• Types of Mapping
• Replacement Algorithms
• Write Policy
• Line Size
• Number of Caches
2
4.3 Elements of Cache Design (Table
4.2)
• Although there are a large number of cache implementations, there
are a few basic design elements that serve to classify and differentiate
cache architectures. Table 4.2 below lists key elements.
3
Cache Size
• We would like the size of the cache to be small enough so that the
overall average cost per bit is close to that of main memory alone.
• We would like the cache to be large enough (up to a point) so that the
overall average access time is close to that of the cache alone.
• It is observed, that the larger the cache, the larger the number of
gates (bits) involved in addressing the cache.
• The result is that large caches tend to be slower than the smaller ones
• The available chip and board area also limits cache size.
• The performance of cache is very sensitive to the nature of workload,
so it is impossible to arrive at a single ‘optimum’ cache size.
4
Mapping Function
• Why do we need mapping?
• Because there are fewer cache lines than main memory blocks, an
algorithm is needed for mapping main memory blocks into cache
lines. HOW TO PLACE BLOCKS ONTO CACHE?
• Further, a means is needed for determining which main memory
block currently occupies a cache line. HOW TO IDENTIFY AND SEARCH
• Three mapping techniques can be used:
1. Direct 2. Associative 3. Set-Associative
• The choice of mapping function dictates how the cache is organized.
5
Example 4.2
• We will examine each of these three mapping techniques. In each
case, we look at the general structure and then a specific example.
(Capacity)
6
Example Explained (Cache)
• Cache Capacity = 64-K bytes. (64*1024 bytes)
• One cache line = one block = 4 bytes.
7
Example Explained (Memory)
• Main-memory capacity = 16-Mbytes (16*1024*1024 bytes)
• Main-memory addresses = 16 * 1024 * 1024 bytes
= 24 * 210 * 210
= 24+10+10
= 224 (=> 24-bit address, 224 = 16M)
• For mapping purpose, we have:
= 4 * 1024 *1024 = 4M
• Thus, we have in main-memory = 4M blocks of (4-bytes each)
8
1. Direct Mapping
• Direct mapping maps each block of main memory into only one
possible cache line. (Slicing)
• I.E. if a block is in cache, it must be in one specific place.
• The ‘direct mapping’ is expressed as:
9
Direct Mapping
• In ‘Direct Mapping’, the blocks of main-memory are assigned to lines
of the cache as follows:
(Tag field)
11
See ‘s-bits’
12
bits
This square
Box contains (Blocks)
(Block)
The answer to
Question 3. bits bits bits
Which Which Which
block line word
Line
(04 -
Words) (4-words
per block)
13
Summary of ‘Direct Mapping’
Where s is the number of blocks of main-memory and w is a word.
In main-memory
14
Advantages and Disadvantages
(Direct Mapping)
• Advantages
• The direct mapping technique is simple and inexpensive to implement
• Disadvantage
• Its main ‘disadvantage’ is that there is a fixed cache location for any
given block.
• Thus, if a program happens to reference words repeatedly from two
different blocks that map into the same line, ‘cache misses’ are HIGH.
• Then the blocks will be continuously swapped in the cache, and a hit
ratio will be low, this phenomenon is known as thrashing.
15
Victim Cache
• One approach to lower the miss penalty is to remember what was
discarded in case it is needed again.
• Since the discarded data has already been fetched, it can be used
again at a small cost.
• Such recycling is possible using a ‘Victim cache’.
• ‘Victim cache’ was originally proposed as an approach to reduce the
conflict misses of direct mapped caches without affecting its fast
access time.
• Victim cache is a fully associative cache, whose size is typically 4 to 16
cache lines, residing between a direct mapped L1 cache and memory.
16
8-bits r-bits specify m lines
Example 4.2(a)
2
Spacing = 4 bytes
Tag (8-bits)
Tag specifies which memory block is placed in the cache line
Each block has
4-byte word
17
For Example
Tag s-r Line or Slot r Word w
8 14 2
• 24 bit address
• 2 bit word identifier (4 byte block)
• 22 bit block identifier (s – r + r) See ‘Slide-6’
• 8 bit tag (=22-14)
• 14 bit slot or line
• No two blocks in the same line have the same Tag field
• Check contents of cache by finding line and checking Tag
18
Tag (s - r) Line or Slot r Word w
8 14 2
Cache Read Operation (Example)
1. The cache system is presented with a (s+w)-bits address.
2. The r-bits as line number is used as an index into the cache to
access a particular line. (containing 4 bytes/words)
3. If the tag-bits (is found), matches the tag number currently stored
in that line, then the w-bits as word number is used to select one of
the word in that line.
4. Otherwise (if not found), the s-bits is used to fetch a block from
main memory.
5. The actual address that is used for the fetch is the s-plus-line
concatenated with two 0-bits, so that 4 bytes (22 = 4) are fetched
starting on a block boundary.
(Break) 19
2. Associative Mapping
• Associative mapping permits each main-memory block to be loaded
into any line of the cache.
20
Fully Associative Cache Organization
(Fig. Next)
• The ‘cache control logic’ interprets a memory address simply as a Tag
and a Word field.
• The Tag field uniquely identifies a block of main-memory. M = 2n/K blk = 2 S
21
Cache Access Logic
(Block) (Blocks)
bits
Which Which
block bits word
22
Summary of ‘Associative Mapping’
Block of main memory = s, no. of words in each block = w
In main-memory
e.g. 22 = 4 bytes
23
Advantages and Disadvantages
(Associative Mapping)
• Advantages
• With ‘Associative mapping’, there is flexibility as to which block to
replace when a new block is read into the cache.
• Replacement algorithms, help to decide which line to replace and are
designed to maximize the HIT-ratio.
• Disadvantage
• The principal disadvantage of ‘associative mapping’ is the complex
circuitry required to examine the tags of all cache lines in parallel.
• This means that ‘cache searching gets expensive’.
24
Example 4.2 (b)
2 bits removed
(24 bits)
S-bits w
25
We have 4M blocks in Main-memory => 4*1024*1024 = 2 2+10+10
= 2 =>22-bits
22
For Example
Word
Tag 22 bit 2 bit
26
3. Set-Associative Mapping
• Set-Associative mapping, each block maps into all the cache lines in a
specific set, so that main-memory block B0 maps into set 0, and so on.
• The cache consists of a number of sets (v), each of which consists of a
number of lines (k). The relationships are
(out of V-sets)
27
(K-way ‘Set-Associative’)
K-way Set-Associative Mapping (Fig.
Next)
• With k-lines per set, this is referred to as k-way set-associative
mapping.
• With set-associative mapping, block Bj can be mapped into any of the
lines of set-j.
• Thus set-associative cache can be implemented as v-associative cache
• E.g. in a ‘2-way set-associative mapping’ (having 2 lines per set), a
given block can be in one of the two-lines in any one set.
• Figure (next slide) illustrates this mapping for the first v blocks of
main-memory.
28
‘V’ Associated–Mapped Caches
V-sets
K-lines per set
Each V-set is ASSOCIATIVE
Block B0 fits into any line of set-0
29
Set-Associative ‘Cache Control
Logic’
• For ‘Set-Associative mapping’, the cache control logic interprets a
memory address as three fields: Tag, Set, and Word.
• The d bits specify one of v = 2d sets.
• The s bits of the Tag and Set fields specify one of the 2s blocks of
main-memory.
• Advantage: With ‘k-way set-associative mapping’, the tag in a
memory address is much smaller and is only compared to the k tags
within a single set. (whereas associative-mapping tag is large, and compared with all lines).
• Figure (next slide) illustrates the ‘cache control logic’.
30
Cache Access Logic
This square
Box contains
The answer to bits bits bits
Question 5. Which Which Which
block Word k-lines
Set
Block Bj
Goes to set-j
k-lines
31
Summary of ‘Set-Associative
Mapping’
Block of main memory = s, no. of words in each block = w
In main-memory
e.g. 22 = 4 bytes
32
Example 4.2 (c)
34
Set-Associative Mapping (Extreme
Cases)
• In the extreme case of v = m, k =1, the set-associative technique
reduces to direct-mapping.
• For v =1, k = m, it reduces to associative-mapping.
• The use of two-lines per set (v = m/2 for k = 2) is the most common
‘set-associative’ organization.
• It significantly improves the HIT-ratio over ‘direct-mapping’.
• ‘Four-way set-associative’ (v = m/4, k = 4) makes a modest additional
improvement. Further increase in No. of lines per set has little effect.
35
See Last Questions
Problem 4-1
(Answer)
36
Problem 4-2
(Answer)
37
Summary of Three Mapping
Techniques
Mapping Address No. of Block No. of No. of Lines Size of Tag Size
Type Length Addressabl Size /Line Blocks in in Cache Cache (bits)
(bits) e Units Size Main- (m) (words)
(words (words memory
/bytes) /bytes)
Direct (s+w) 2 ^ ( s+ w) 2^w 2^s 2 ^ r (lines) 2 ^ ( r + w) (s–r)
Associative (s+w) 2 ^ ( s+ w) 2^w 2^s undefined unknown s-bits
Set- (s+w) 2 ^ ( s+ w) 2^w 2^s k-lines k*v*w (s–d)
Associative per set (v = sets)
(Break)
38
Cache ‘Replacement Algorithms’
(Associative)
• Once the cache has been filled, when a new block is brought into the
cache, one of the existing blocks must be replaced.
• For direct-mapping, there is only one possible line for a particular
block, and no choice is possible. We have to replace that line (thrash).
• For the associative and set-associative techniques, a replacement
algorithm is needed.
• To achieve high speed, such an algorithm must be implemented in
hardware.
• A number of algorithms have been tried. We mention only four.
39
Types of ‘Cache Replacement
Algorithms’
• Four of the most common cache replacement algorithms are:
1) Least Recently Used (LRU)
2) First-In-First-Out (FIFO)
3) Least Frequently Used (LFU)
4) Random Replacement (RR)
40
Types of ‘Cache Replacement
Algorithms’
1. Least Recently Used (LRU): Replace that block in the set that has
been in the cache longest with no reference to it. (relates to time)
• We are assuming that more recently used memory locations are likely
to be referenced again. This is called ‘Temporal locality’ and it should
give the best HIT-ratio.
• The cache mechanism maintains a separate list of indexes to all the
lines in the associative cache.
• When a line is referenced, it moves to the front of the list.
• For replacement, the line at the back of the list is used (discarded).
41
Types of ‘Cache Replacement
Algorithms’
2. First-In-First-Out (FIFO): Replace that block in the set that has been
in the cache longest. (first-come first-serve pattern)
• FIFO is easily implemented as a round-robin or circular buffer
technique.
3. Least Frequently Used (LFU): Replace that block in the set that has
experienced the fewest references. (relates to the use of data item)
• LFU could be implemented by associating a counter with each line.
4. Random Replacement (RR): Pick and replace a line at random from
among the candidate lines.
42
Write Policy
• Coherency property: States that if a Word or a cache line (block) is
modified in the lower level memory M1 (cache), then copies of that
word must be updated immediately at all higher memory levels M2,
M3 and so on, so as to maintain ‘data coherency’ among different
memory levels.
• When a block that is resident in the cache is to be replaced, if it has
not been altered (updated/overwritten), then it may directly be
replaced without first writing out this block to the main memory.
• If at least one write operation has been performed on a word in that
line of the cache, then main memory must be updated by writing that
block to the main memory before bringing in the new block in cache.
43
Types of ‘Write Policies’
• If a Word has been altered only in the cache, then the corresponding
memory Word is invalid.
• If the I/O device has altered the main-memory, then the cache Word
is invalid.
• To maintain ‘data coherency’, two write policies have been adopted:
1) Write through 2) Write back
44
Types of ‘Write Policies’
1. Write through: Using this technique, all write operations are made
to main memory as well as to the cache simultaneously, ensuring
that main-memory is always valid. (Advantage)
• ‘Write through’ does an ‘immediate update’ in RAM whenever a word
is modified in cache.
• Its main disadvantage, it generates ‘memory traffic’ and bottleneck.
2. Write back: Minimizes memory writes, and it ‘delays an update’ in
main-memory until that updated block is being replaced by a new
block in that same line in the cache. (set ‘updated’ bit for a line)
• Before ‘write back’, the main-memory block is invalid (Disadvantage).
45
Example 4.3
Answer:
46
Line Size
• When a block of data is retrieved and placed in the cache, not only
the desired word but also some number of adjacent words are
retrieved. (a block is a combination of words)
• As the block size increases, the hit ratio will at first increase, due to:
• Principal of locality, which states that data in the vicinity of a
referenced word are likely to be referenced in the near future.
• As the block size increases further, the hit ratio will begin to decrease.
• The relationship between block size and hit ratio is complex, and no
definitive optimum value has been found. 8-to-64 bytes is reasonable.
47
Number of Caches
• Multilevel Caches / Cache Hierarchy
1. On-chip cache (L1): is a cache on the same chip as the processor.
• The ‘on-chip cache (Level-1)’ reduces the processor’s external bus
activity and therefore speeds up execution times and increases overall
system performance. It is often accessed in one cycle by CPU.
2. An off-chip cache (L2) is external to the processor and is designated
as Level-2. It is made of static-RAM (SRAM).
• Due to the slow bus speed and slow memory access time, resulting in
poor performance, an L2 cache is desirable & gives ‘zero-wait state’.
48
• L2 cache uses a separate data path (Local bus), so as to reduce the
burden on the ‘system bus’.
• With the continued shrinking of the processor components (due to
Moore’s law), a number of processors now incorporate the L2 cache
on the processors chip, improving performance.
• A Hit is counted if the desired data appears in either the L1 or the L2
cache.
• L2 has little effect on the total number of cache hits until it is at least
double the L1 cache size.
• With the increasing availability of on-chip area available for cache,
most microprocessors have moved the L2 cache on the processor chip
and added an L3 cache (Off chip).
• The L3 cache is accessible over the external bus and gives a
performance advantage to adding the third level. Cache size should be double. 49
‘Unified’ Versus ‘Split Caches’
• The on-chip cache first consisted of a single cache used to store
references to both data and instructions.
• Now it has become common to split the cache into two: one
dedicated to instruction and one dedicated to data, called split cache.
• These two caches both exist at same level, typically as two L1 caches.
• When the processor attempts to fetch an ‘instruction/data’ from
main-memory, it first consults the ‘instruction/data’ L1 cache.
• Unified cache has only one cache that is used for holding both data
and instructions.
50
Advantages of a ‘Unified Cache’
• There are two potential advantages of a unified cache over split cache:
1. For a given cache size, a unified cache has a higher hit rate than split
caches because it balances the load between instruction and data
fetches automatically.
• That is if an execution pattern involves many more instruction fetches
than data fetches, then the cache will tend to fill up with instructions,
and if an execution pattern involves relatively more data fetches, the
opposite will occur.
2. Only one cache needs to be designed and implemented.
51
Advantages of a ‘Split Cache’
1. The key advantage of the split cache design is that it eliminates
contention for the cache between the instruction fetch/decode unit
and the execution unit. (useful for ‘pipelining’ of instructions)
2. This enables the processor to fetch instructions ahead of time and
fill a buffer, or pipeline, with instructions to be executed.
The trend is toward split caches at the L1 and unified caches for
higher levels.
52
‘Spatial’ & ‘Temporal’ LOCALITY (Q
4.8 = Q12)
• Principal of locality, which states that data in the vicinity of a
referenced word are likely to be referenced in the near future.
• The ‘principal of locality’ has two types:
1) Spatial locality 2) Temporal locality
1. Spatial locality: refers to the tendency of execution to involve a
number of memory locations that are clustered. This means to ‘access
neighbouring info items whose addresses are near one another’. Space locality.
2. Temporal locality: refers to the tendency for a processor to access
memory locations that have been used recently. So the item that has a
‘probability of being used in future is kept’. (so don’t discard it) locality in time.
53
Quiz # 2
Preparatory Questions
Chapter #4
Q1. Why do we need ‘cache mapping’ techniques? What are the types of
mapping? (Slide – 05)
Q2. What are the differences among ‘direct mapping, associative mapping and
set-associative mapping’? (Slide 9, 20 and 27)
Q3. For a ‘direct-mapped cache’, a main-memory address is viewed as consisting
of three fields. List and define the three fields. (Slide – 13.Box, Attached Q5.
Answer below)
Q4. For an ‘associative cache’, a main-memory address is viewed as consisting of
two fields. List and define the two fields. (Slide – 22.Box, Attached Q6. Answer
below)
Q5. For a ‘set-associative cache’, a main-memory address is viewed as consisting
of three fields. List and define the three fields. (Slide – 31.Box, Attached Q7.
Answer below)
54
Preparatory Questions
Q6. Write down the advantages and disadvantages of ‘direct mapping’. (Slide –
15)
Q7. Write down the advantages and disadvantages of ‘associative mapping’.
(Slide – 24)
Q8. Write down the two ‘Extreme cases’ of ‘set-associative mapping’. (Slide – 35)
Q9. List and explain the various ‘replacement algorithms’ used in cache memory.
(Slide – 39, 40, 41 and 42)
Q10. What is ‘data coherency’? To maintain data-coherency which two ‘write
policies’ are adopted in caches? Explain. (Slide 43, 44 and 45)
Q11. State the advantages of ‘unified cache’ over ‘split cache’, and vice versa.
(Slide – 51)
Q12. Why do we use ‘Victim cache’ in direct mapping?(Slide – 53)
55
Answers to Guess Questions: Q2. (4.4), Q3.
(4.5), Q4. (4.6), Q5. (4.7), Q12. (4.8) below:
56
Answers to Problems 4.1 and
4.2 below.
57
For Information
58
Cache Addresses
Out of Course
• Can be of two types
1) Logical addressing 2) Physical addressing
• Almost all processors support ‘Virtual Memory’.
• Virtual memory is a facility that allows programs to address memory
locations higher than the processor’s addressing space from a logical
point of view, without regard to the amount of main memory
physically available. (This concept will be discussed in Chapter 8)
• Memory Management Unit (MMU) is a hardware unit that translates
each virtual address into a physical address in main memory, to allow
reads and writes to main memory.
59
Logical and Physical Caches
1. Logical cache, also known as virtual cache, stores
data using virtual addresses.
• One obvious advantage of the logical cache is that ‘the processor
accesses the cache directly, without going through the MMU’. So the
cache access speed is faster than for a physical cache. Because the
cache can respond before the MMU performs an ‘address translation’.
• The disadvantage is that most virtual memory systems supply each
application with the same virtual memory address that starts at 0.
Thus cache memory must be completely flushed for two applications.
Thus. the same virtual address in two different applications refers to two different physical addresses.
2. Physical cache stores data using main memory physical addresses.
60