EE6304 Lecture9 Mem Caches
EE6304 Lecture9 Mem Caches
Lecture 9 - Caches
Cache Valid
Index Bit Tag Data
0 (00) 1
1 (01) 1
2 (10) 1
3 (11) 0 Invalid data
Retrieving Data from the Cache
• When the CPU requires instruction/data from memory,
the address will be sent to the cache controller
– Lowest k bits serve as cache index
– Upper m-k bits server as tag
• Data is sent to CPU if valid data is available
Loading Data into the Cache
• A copy of the data read from memory is stored into
the cache
• Lowest k bits of address specify a cache block
• Upper (m-k) specify the tag
• Data from memory is stored in the caches data field
• Valid bit is set to 1
Spatial Locality
Memory • How to make caches more efficient
address Memory exploiting locality?
0 (0000)
• Make cache block size larger than one
1 (0001)
byte
2 (0010)
3 (0011)
• E.g., two-byte blocks
4 (0100) – Last bit indicates which data entry
5 (0101)
6 (0110)
7 (0111)
8 (1000)
Cache
Index Tag Data
9 (1001)
A (1010) 0 (00)
B (1011) 1 (01)
C (1100) 2 (10)
D (1101)
3 (11)
E (1110)
F (1111)
Spatial Locality cont.
• When accessing main memory → Its entire
block (depending on the size) will be written
into the cache
• E.g., If address cache has a block size of 2 and
address 12h is ready from memory:
Locating Data in Multi Block Cache
• A block select (block offset) is required to
select which block to read
Example
Block
offset For the
following
addresses
below, what
byte is read?
• 1010
• 1110
Disadvantages of Direct Mapping Caches
• Advantages if Direct-mapped
Memory caches
address Memory
– Simple hardware to implement
0 (0000)
– Offset can be computed quickly and
1 (0001)
efficiently
2 (0010)
3 (0011)
• Disadvantages
4 (0100) – Cache can have low performance
5 (0101)
and be underutilized if program
address lead to same cache index
6 (0110)
E.g., 4,8,4,8,…
7 (0111)
8 (1000)
Cache
9 (1001)
Index Tag Data
A (1010)
0 (00)
B (1011)
C (1100) 1 (01)
D (1101) 2 (10)
E (1110) 3 (11)
F (1111)
Fully Associative Cache
• Allows data to be stored in any cache line
– When data is fetched from memory → It is placed
in any unused cache block
– No conflicts between multiple memory addresses
mapped onto a single cache block
Full Associative Cache
• Pros:
– Makes use of cache space more effectively
– No address conflicts
• Cons:
– It is expensive (area) to implement
• No index field → The entire addressed used as tag increasing cache size
• Data can be anywhere in cache →need to check every tag of every
cache block
Set Associativity
• Intermediate possibility
– Cache is divided into groups of blocks called sets.
– Each memory address maps to exactly one set in the cache,
but data may be placed in any block within that set
• If each set has 2x blocks, the cache is an 2x-way
associative cache
• 1-way associate cache = direct-mapped cache
Address Outcome
0 Miss
2 Hit
4 Miss
128 Miss
0 Hit
128 Hit
64 Miss
4 Hit
0 Miss
32 Miss
64 Hit
Example Solution – Block Size
Address Outcome
0 Miss
2 Hit Tag ? Index ?
4 Miss 2
128 Miss
0 Hit
128 Hit
64 Miss
4 Hit
0 Miss
32 Miss
addr0 addr1 addr2 addr3
64 Hit
14
15
…..
Example 1
• Given the example code below, and assuming a
virtually-addressed direct-mapped cache of capacity
8KBytes and 64-bit blocks (8 bytes), compute the overall
miss rate (number of misses divided by number of
references). Assume that all variables except array
locations reside in registers, and that arrays A, B, and C
are placed consecutively in memory.
All entries form the selected set are sent to the hit/miss logic
Tag Arrays
• A tag contains the information necessary to
record which line of data is stored in the line of
the data cache that is associated with the entry
• A tag entry consist of:
– a tag field that contains the portion of the address
– A valid bit that records whether or not the line
associated with this tag array contains valid data
– A dirty bit
– Depending on the replacement policy
• LRU: record how many of the other lines in the set have
been referenced since the last time the line it corresponds
was referenced (log2(cache associativity)) bits
Valid bit
Tag array entry
Dirty bit
Example
• How many bits of storage are required for the
tag array of a 320KB cache with 256-byte
cache lines and 4-way associativity if the cache
is write-back but does not require any
additional bits of data in the tag array to
implement the write-back policy? Assume that
the system containing the cache uses 32-bit
addresses and requires 1 dirty bit and 1 valid
bit
Solution
• A 32-KB cache with 256-byte lines contains:
– 32 KB/256= 128 lines
• Since the cache is 4-way set associative, it has:
– 128/4=32 sets → m=5 bits
• Lines that are 256 bytes long mean that log(256)=n=8 →
8+5=13 bit (m+n) of the address are used to select a set
and determine the byte within the line that an address
points to
• Tag field of each array entry = 32-13=19 bits long
• Adding 2 bits for the dirty and valid bit = 21 bits
• 21bits x 128 lines = 2,688 bits of storage in the tag array
Cache Misses and Performance
• How does this affect performance?
• Performance = Time / Program
Instructions Cycles Time
= X X
Program Instruction Cycle
(code size) (CPI) (cycle time)
• Cache organization affects cycle time
– Hit latency
• Cache misses affect CPI
Average Memory Access Time (AMAT)
𝐴𝑀𝐴𝑇 = 𝐻𝑖𝑡𝐿1 + 𝑀𝑖𝑠𝑠𝑟𝑎𝑡𝑒 𝐿1 𝑥𝐻𝑖𝑡𝐿2 + 𝑀𝑖𝑠𝑠𝑟𝑎𝑡𝑒 𝐿2 𝑥𝐻𝑖𝑡𝑚𝑒𝑚 + 𝑀𝑖𝑠𝑠𝑟𝑎𝑡𝑒 𝑚𝑒𝑚 𝑥𝑀𝑖𝑠𝑠𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑚𝑒𝑚
Example AMAT
• Calculate the AMAT for a system with the
following properties:
– L1$ hits in 1 cycle with local hit rate of 50%
– L2$ hits in 10 cycles with a local hit rate of 75%
– L3$ hits in 100 cycles with local hit rate of 90%
– Main memory always hits in 1000 cycles
Solution Example AMAT
L1$ hits in 1 cycle with local hit rate of 50%
L2$ hits in 10 cycles with a local hit rate of 75%
L3$ hits in 100 cycles with local hit rate of 90%
Main memory always hits in 1000 cycles
AMAT = 1+(1-0.5)(10+(1-0.75)(100+(1-0.9)(1000)))=31
Cache Misses and Performance
• CPI equation
– Only holds for misses that cannot be overlapped with other
activity
– Store misses often overlapped
• Place store in store queue → Wait for miss to complete →Perform
store
• Allow subsequent instructions to continue in parallel
– Modern out-of-order processors also do this for loads
• Cache performance modeling requires detailed modeling of entire
processor core
Sources of Cache Misses
• Compulsory (cold start or process migration, first reference): first access to
a block
– “Cold” fact of life: not a whole lot you can do about it
– Note: If you are going to run “billions” of instruction, Compulsory
Misses are insignificant
• Capacity:
– Cache cannot contain all blocks access by the program
– Solution: increase cache size
• Conflict (collision):
– Multiple memory locations mapped
to the same cache location
– Solution 1: increase cache size
– Solution 2: increase associativity
• Coherence (Invalidation): other process (e.g., I/O) updates memory
Cache Miss Rates
• Subtle tradeoffs between cache organization
parameters
– Large blocks reduce compulsory misses but increase miss
penalty
• #compulsory = (working set) / (block size)
• #transfers = (block size)/(bus width)
– Large blocks increase conflict misses
• #blocks = (cache size) / (block size)
– Associativity reduces conflict misses
– Associativity increases access time
Cache Miss Rages 3 C’s
• Vary size and associativity
– Compulsory misses are constant
– Capacity and conflict misses are reduced
9
8
Miss per Instruction (%)
7
6
5 Conflict
Capacity
4
Compulsory
3
2
1
0
8K1W 8K4W 16K1W 16K4W
© Hill, Lipasti
Cache Miss Rages 3 C’s
• Vary size and block size
– Compulsory misses drop with increased block size
– Capacity and conflict can increase with larger blocks
7
Miss per Instruction (%)
5
Conflict
4 Capacity
Compulsory
3
0
8K32B 8K64B 16K32B 16K64B
© Hill, Lipasti
Basic Cache Optimizations
• Metrics for cache optimizations : Hit latency, miss rate,
and miss penalty
• Reducing hit latency
– Block size
– Associativity
– Number of blocks
• Reducing Miss Rate
– Larger Block size (compulsory misses)
– Larger Cache size (capacity misses)
– Higher Associativity (conflict misses)
– Compiler optimizations
• Reducing Miss Penalty
– Multilevel Caches
– Hardware prefetching and compiler prefetching
Summary #1/2: The Cache Design Space
• Several interacting dimensions Cache Size
– cache size
– block size Associativity
– associativity
– replacement policy
– write-through vs write-back
– write allocation Block Size
• The optimal choice is a compromise
– depends on access characteristics
• workload Bad
• use (I-cache, D-cache, TLB)
– depends on technology / cost Good Factor A Factor B