0% found this document useful (0 votes)
45 views39 pages

Lect12 Cache

The document discusses how memory hierarchies exploit locality by caching frequently used data closer to the CPU. It describes the basic concepts of hits, misses, hit rates and miss penalties when accessing different levels of memory. Direct-mapped caches index into the cache using some bits of the memory address to determine if a request is a hit or miss. Larger block sizes in caches can improve performance by taking advantage of spatial locality, but larger blocks also increase conflict misses if the block size is too large relative to the cache size.

Uploaded by

Vishal Mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views39 pages

Lect12 Cache

The document discusses how memory hierarchies exploit locality by caching frequently used data closer to the CPU. It describes the basic concepts of hits, misses, hit rates and miss penalties when accessing different levels of memory. Direct-mapped caches index into the cache using some bits of the memory address to determine if a request is a hit or miss. Larger block sizes in caches can improve performance by taking advantage of spatial locality, but larger blocks also increase conflict misses if the block size is too large relative to the cache size.

Uploaded by

Vishal Mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Exploiting Memory Hierarchy

Locality
 Locality is a principle that makes having a memory hierarchy a
good idea
 If an item is referenced then because of
 temporal locality: it will tend to be again referenced soon
 spatial locality: nearby items will tend to be referenced soon
 why does code have locality – consider instruction and data?
Hit and Miss
 Focus on any two adjacent levels – called, upper (closer to CPU)
and lower (farther from CPU) – in the memory hierarchy,
because each block copy is always between two adjacent levels
 Terminology:
 block: minimum unit of data to move between levels

 hit: data requested is in upper level

 miss: data requested is not in upper level

 hit rate: fraction of memory accesses that are hits (i.e.,


found at upper level)
 miss rate: fraction of memory accesses that are not hits

 miss rate = 1 – hit rate

 hit time: time to determine if the access is indeed a hit +


time to access and deliver the data from the upper level to
the CPU
 miss penalty: time to determine if the access is a miss + time
to replace block at upper level with corresponding block at
lower level + time to deliver the block to the CPU
Caches
 By simple example
 assume block size = one word of data

X4 X4
Reference to Xn
X1 X1
causes miss so
Xn – 2 Xn – 2
it is fetched from
memory
Xn – 1 Xn – 1

X2 X2
Xn
X3 X3

a. Before the reference to Xn b. After the reference to Xn

 Issues:
 how do we know if a data item is in the cache?

 if it is, how do we find it?

 if not, what do we do?

 Solution depends on cache addressing scheme…


Direct Mapped Cache
 MIPS style:
A d dre ss ( s h ow ing b it p os itio ns )
31 30 13 12 11 2 1 0
Byte
offse t
20 10
H it D a ta
Tag

Ind e x

In de x V a lid T ag D a ta
0
1
2

1021
1022
1023
20 32

Cache with 1024 1-word blocks: byte offset


(least 2 significant bits) is ignored and
next 10 bits used to index into cache

What kind of locality are we taking advantage of?


DECStation 3100 Cache(MIPS R2000 processor)
Address (showing bit positions)
3 1 30 17 1 6 15 54 32 10

16 14 Byte
offset
H it D ata

16 bits 32 bits
Valid Tag D ata

16K
entries

16 32

Cache with 16K 1-word blocks: byte offset


(least 2 significant bits) is ignored and
next 14 bits used to index into cache
Cache Read Hit/Miss
 Cache read hit: no action needed
 Instruction cache read miss:
1. Send original PC value (current PC – 4, as PC has already
been incremented in first step of instruction cycle) to
memory
2. Instruct main memory to perform read and wait for
memory to complete access – stall on read
3. After read completes write cache entry
4. Restart instruction execution at first step to refetch
instruction
 Data cache read miss:
 Similar to instruction cache miss
 To reduce data miss penalty allow processor to execute
instructions while waiting for the read to complete until
the word is required – stall on use
Cache Write Hit/Miss
Write-through scheme
 on write hit: replace data in cache and memory with every
write hit to avoid inconsistency
 on write miss: write the word into cache and memory –
obviously no need to read missed word from memory!
 Write-through is slow because of always required memory write

 performance is improved with a write buffer where words

are stored while waiting to be written to memory –


processor can continue execution until write buffer is full
 when a word in the write buffer completes writing into main

memory that buffer slot is freed and becomes available for


future writes
 DEC 3100 write buffer has 4 words

 Write-back scheme
 write the data block only into the cache and write-back the
block to main only when it is replaced in cache
 more efficient than write-through, more complex to implement
Direct Mapped Cache: Taking Advantage of Spatial Locality
 Taking advantage of spatial locality with larger blocks:

Address (showing bit positions)


31 16 1 5 4 32 1 0

16 12 2 Byte
H it T ag D a ta
offset
Index Block offset
1 6 bits 12 8 bits
V T ag D ata

4K
entrie s

16 32 32 32 32

M ux
32

Cache with 4K 4-word blocks: byte offset (least 2 significant bits) is ignored,
next 2 bits are block offset, and the next 12 bits are used to index into cache
Direct Mapped Cache: Taking Advantage of
Spatial Locality
 Cache replacement in large (multiword) blocks:
 word read miss: read entire block from main memory

 word write miss: cannot simply write word and tag! Why?!

 writing in a write-through cache:

 if write hit, i.e., tag of requested address and cache

entry are equal, continue as for 1-word blocks by


replacing word and writing block to both cache and
memory
 if write miss, i.e., tags are unequal, fetch block from
memory, replace word that caused miss, and write block
to both cache and memory
 therefore, unlike case of 1-word blocks, a write miss with
a multiword block causes a memory read
Direct Mapped Cache: Taking Advantage of
Spatial Locality
 Miss rate falls at first with increasing block size as expected,
but, as block size becomes a large fraction of total cache size,
miss rate may go up because
 there are few blocks
 competition for blocks increases
 blocks get ejected before most of their words are accessed
(thrashing in cache)
40%

35%

30%

Miss rate vs. block size for 25%


Miss rate

various cache sizes 20%

15%

10%

5%

0%
4 16 64 256
Block size (bytes) 1 KB
8 KB
16 KB
64 KB
256 KB
Example

 How many total bits are required for a direct-mapped cache


with 128 KB of data and 1-word block size, assuming a 32-bit
address?
Example
 How many total bits are required for a direct-mapped cache
with 128 KB of data and 1-word block size, assuming a 32-bit
address?

 Cache data = 128 KB = 217 bytes = 215 words = 215 blocks


 Cache entry size = block data bits + tag bits + valid bit
= 32 + (32 – 15 – 2) + 1 = 48 bits
 Therefore, cache size = 215  48 bits =
215  (1.5  32) bits = 1.5  220 bits = 1.5 Mbits
 data bits in cache = 128 KB  8 = 1 Mbits

 total cache size/actual cache data = 1.5


Example Problem
 How many total bits are required for a direct-mapped cache with
128 KB of data and 4-word block size, assuming a 32-bit
address?

 Cache size = 128 KB = 217 bytes = 215 words = 213 blocks


 Cache entry size = block data bits + tag bits + valid bit
= 128 + (32 – 13 – 2 – 2) + 1 = 144 bits
 Therefore, cache size = 213  144 bits =
213  (1.25  128) bits = 1.25  220 bits = 1.25 Mbits
 data bits in cache = 128 KB  8 = 1 Mbits

 total cache size/actual cache data = 1.25


Example Problem
 Consider a cache with 64 blocks and a block size of 16 bytes. What
block number does byte address 1200 map to?

 As block size = 16 bytes:


byte address 1200  block address 1200/16  = 75
 As cache size = 64 blocks:
block address 75  cache block (75 mod 64) = 11
Block Size Considerations

 Larger blocks should reduce miss rate


 Due to spatial locality
 But in a fixed-sized cache
 Larger blocks  fewer of them
 More competition  increased miss rate
 Larger miss penalty
 Can override benefit of reduced miss rate
 Early restart and critical-word-first can help
Performance
 Simplified model assuming equal read and write miss penalties:
 CPU time = (execution cycles + memory stall cycles)  cycle

time
 memory stall cycles = number of memory accesses  miss

rate  miss penalty

 Therefore, two ways to improve performance in cache:


 decrease miss rate

 decrease miss penalty

 what happens if we increase block size?


Example
 Assume for a given machine and program:
 instruction cache miss rate 2%
 data cache miss rate 4%
 miss penalty always 40 cycles
 CPI of 2 without memory stalls
 frequency of load/stores 36% of instructions

1. How much faster is a machine with a perfect cache that never


misses?
2. What happens if we speed up the machine by reducing its CPI to 1
without changing the clock rate?
3. What happens if we speed up the machine by doubling its clock rate,
but if the absolute time for a miss penalty remains same?
Solution
1.
 Assume instruction count = I
 Instruction miss cycles = I  2%  40 = 0.8  I
 Data miss cycles = I  36%  4%  40 = 0.576  I
 So, total memory-stall cycles = 0.8  I + 0.576  I = 1.376  I
 in other words, 1.376 stall cycles per instruction

 Therefore, CPI with memory stalls = 2 + 1.376 = 3.376


 Assuming instruction count and clock rate remain same for a
perfect cache and a cache that misses:
CPU time with stalls / CPU time with perfect cache
= 3.376 / 2 = 1.688
 Performance with a perfect cache is better by a factor of 1.688
Solution (cont.)
2. What happens if we speed up the machine by reducing its CPI to 1
without changing the clock rate?

 CPI without stall = 1


 CPI with stall = 1 + 1.376 = 2.376 (clock has not changed so
stall cycles per instruction
remains same)
 CPU time with stalls / CPU time with perfect cache
= CPI with stall / CPI without stall
= 2.376
 Performance with a perfect cache is better by a factor of 2.376
 Conclusion: Lower the CPI more pronounced is the impact of
stall cycles
Solution (cont.)
3. What happens if we speed up the machine by doubling its clock rate, but if
the absolute time for a miss penalty remains same?

 With doubled clock rate, miss penalty = 2  40 = 80 clock cycles


 Stall cycles per instruction = (I  2%  80) + (I  36%  4%  80)
= 2.752  I
 So, faster machine with cache miss has CPI = 2 + 2.752 = 4.752
 CPU time with stalls / CPU time with perfect cache
= CPI with stall / CPI without stall
= 4.752 / 2 = 2.376
 Performance with a perfect cache is better by a factor of 2.376
 Conclusion: with higher clock rate cache misses “hurt more” than
with lower clock rate
Decreasing Miss Rates with Associative Block Placement

 Direct mapped: one unique cache location for each memory block
 cache block address = memory block address mod cache size
 Fully associative: each memory block can locate anywhere in cache
 all cache entries are searched (in parallel) to locate block
 Set associative: each memory block can place in a unique set of
cache locations – if the set is of size n it is n-way set-associative
 cache set address = memory block address mod number of sets in
cache
 all cache entries in the corresponding set are searched (in parallel) to
locate block
 Increasing degree of associativity
 reduces miss rate
 increases hit time because of the parallel search and then fetch
Decreasing Miss Rates with Associative Block Placement

Direct Mapped 2-way Set Associative Fully Associative


Direct mapped Set associative Fully associative
Block # 0 1 2 3 4 5 6 7 Set # 0 1 2 3

Data Data Data

12 mod 8 = 4 12 mod 4 = 0

1 1 1
Tag Tag Tag
2 2 2

Search Search Search

Location of a memory block with address 12 in a cache with 8


blocks with different degrees of associativity
Decreasing Miss Rates with Associative Block Placement
One-way set set
One-way associative
associative
(direct mapped)
Block Tag Data
0
Two-way set associative
1
Set Tag Data Tag Data
2
3 0

4 1

5 2

6 3

Four-way set associative


Set Tag Data Tag Data Tag Data Tag Data
0
1

Eight-way set associative (fully associative)


Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Configurations of an 8-block cache with different degrees of associativity


Example
 Find the number of misses for a cache with four 1-word blocks given
the following sequence of memory block accesses:
0, 8, 0, 6, 8,
for each of the following cache configurations
1. direct mapped
2. 2-way set associative (use LRU replacement policy)
3. fully associative

 Note about LRU replacement


 in a 2-way set associative cache LRU replacement can be
implemented with one bit at each set whose value
indicates the most recently referenced block
Solution
 1 (direct-mapped)
Block address Cache block
0 0 (= 0 mod 4)
6 2 (= 6 mod 4)
8 0 (= 8 mod 4)
Block address translation in direct-mapped cache

Address of memory Hit or Contents of cache blocks after reference


block accessed miss 0 1 2 3
0 miss Memory[0]
8 miss Memory[8]
0 miss Memory[0]
6 miss Memory[0] Memory[6]
8 miss Memory[8] Memory[6]
Cache contents after each reference – red indicates new entry added
 5 misses
Solution (cont.)

 2 (two-way set-associative)
Block address Cache set
0 0 (= 0 mod 2)
6 0 (= 6 mod 2)
8 0 (= 8 mod 2)
Block address translation in a two-way set-associative cache

Address of memory Hit or Contents of cache blocks after reference


block accessed miss Set 0 Set 0 Set 1 Set 1
0 miss Memory[0]
8 miss Memory[0] Memory[8]
0 hit Memory[0] Memory[8]
6 miss Memory[0] Memory[6]
8 miss Memory[8] Memory[6]
Cache contents after each reference – red indicates new entry added
 Four misses
Solution (cont.)
 3 (fully associative)

Address of memory Hit or Contents of cache blocks after reference


block accessed miss Block 0 Block 1 Block 2 Block 3
0 miss Memory[0]
8 miss Memory[0] Memory[8]
0 hit Memory[0] Memory[8]
6 miss Memory[0] Memory[8] Memory[6]
8 hit Memory[0] Memory[8] Memory[6]
Cache contents after each reference – red indicates new entry added

 3 misses
Implementation of a Set-Associative Cache
A d dr es s
31 3 0 1 2 11 10 9 8 3 2 1 0

22 8

In d ex V Tag D a ta V Tag D a ta V T ag D ata V T ag D ata


0
1
2
Set
253
254
255
22 32

4 - to - 1 m ultip le xo r

H it D a ta

4-way set-associative cache with 4 comparators and one 4-to-1


multiplexor:size of cache is 1K blocks = 256 sets * 4-block set size
Performance with Set-Associative Caches
1 5%

1 2%

9%
Miss rate

6%

3%

0%
O n e -w a y T w o -w a y F ou r-w a y E ig h t-w a y
A sso c ia tivity 1 KB 16 KB
2 KB 32 KB
4 KB 64 KB
8 KB 1 28 KB

Miss rates for each of eight cache sizes


with increasing associativity:
data generated from SPEC92 benchmarks
with 32 byte block size for all caches
Replacement Policy
 Direct mapped: no choice
 Set associative
 Prefer non-valid entry, if there is one
 Otherwise, choose among entries in the set
 Least-recently used (LRU)
 Choose the one unused for the longest time
 Simple for 2-way, manageable for 4-way, too hard
beyond that
 Random
 Gives approximately the same performance as
LRU for high associativity
Multilevel Caches
 Primary cache attached to CPU
 Small, but fast
 Level-2 cache services misses from primary
cache
 Larger, slower, but still faster than main memory
 Main memory services L-2 cache misses
 Some high-end systems include L-3 cache
Decreasing Miss Penalty with Multilevel Caches
 Add a second-level cache
 primary cache is on the same chip as the processor

 use SRAMs to add a second-level cache, between main


memory and the first-level cache
 if miss occurs in primary cache second-level cache is
accessed
 if data is found in second-level cache miss penalty is access
time of second-level cache which is much less than main
memory access time
 if miss occurs again at second-level then main memory
access is required and large miss penalty is incurred
 Design considerations using two levels of caches:
 try and optimize the hit time on the 1
st level cache to reduce
clock cycle
 try and optimize the miss rate on the 2
nd level cache to
reduce memory access penalties
 In other words, 2
nd level allows 1st level to go for speed

without “worrying” about failure…


Example Problem

 Assume a 500 MHz machine with


 base CPI 1.0

 main memory access time 200 ns.

 miss rate 5%

 How much faster will the machine be if we add a second-level


cache with 20ns access time that decreases the miss rate to 2%?
Solution
 Miss penalty to main = 200 ns / (2 ns / clock cycle) = 100 clock cycles
 Effective CPI with one level of cache
= Base CPI + Memory-stall cycles per instruction
= 1.0 + 5%  100 = 6.0
 With two levels of cache, miss penalty to second-level cache
= 20 ns / (2 ns / clock cycle) = 10 clock cycles
 Effective CPI with two levels of cache
= Base CPI + Primary stalls per instruction
+ Secondary stall per instruction
= 1 + 5%  10 + 2%  100 = 3.5
= 1 + (5%-2%)x10 + 2%x(100+10)
 Therefore, machine with secondary cache is faster by a factor of
6.0 / 3.5 = 1.71
Multilevel On-Chip Caches
Intel Nehalem 4-core processor

Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache


3-Level Cache Organization
Intel Nehalem AMD Opteron X4
L1 caches L1 I-cache: 32KB, 64-byte L1 I-cache: 32KB, 64-byte
(per core) blocks, 4-way, approx LRU blocks, 2-way, LRU
replacement, hit time n/a replacement, hit time 3 cycles
L1 D-cache: 32KB, 64-byte L1 D-cache: 32KB, 64-byte
blocks, 8-way, approx LRU blocks, 2-way, LRU
replacement, write- replacement, write-
back/allocate, hit time n/a back/allocate, hit time 9 cycles
L2 unified 512KB, 64-byte blocks, 8-way, 512KB, 64-byte blocks, 16-way,
cache approx LRU replacement, write- approx LRU replacement, write-
(per core) back/allocate, hit time n/a back/allocate, hit time n/a
L3 unified 8MB, 64-byte blocks, 16-way, 2MB, 64-byte blocks, 32-way,
cache replacement n/a, write- replace block shared by fewest
(shared) back/allocate, hit time n/a cores, write-back/allocate, hit
time 32 cycles
n/a: data not available
Sources of Misses (3C’s)
 Compulsory misses (aka cold start misses)
 First access to a block
 Capacity misses
 Due to finite cache size
 A replaced block is later accessed again
 Conflict misses (aka collision misses)
 In a non-fully associative cache
 Due to competition for entries in a set
 Would not occur in a fully associative cache of the
same total size
Cache Design Trade-offs
Design change Effect on miss rate Negative performance
effect
Increase cache size Decrease capacity May increase access
misses time
Increase associativity Decrease conflict May increase access
misses time
Increase block size Decrease compulsory Increases miss
misses penalty.

You might also like