Cache Memory
Cache Memory
and Architecture
Dr. Manish Kumar Bajpai
Acknowledgement
Note that cache design for High Performance Computing (HPC) is very
different from cache design for other computers
Some HPC applications perform poorly with typical cache designs
Cache Size does matter
• Cost
— More cache is expensive
— Would like cost/bit to approach cost of main
memory
• Speed
— But we want speed to approach cache speed for all
memory access
— More cache is faster (up to a point)
— Checking cache for data takes time
— Larger caches are slower to operate
Logical Cache
• A logical (virtual) cache stores virtual
addresses rather than physical addresses
• Processor addresses cache directly without
going through MMU
• Obvious advantage is that addresses do not
have to be translated by the MMU
• A not-so-obvious disadvantage is that all
processes have the same virtual address space –
a block of memory starting at 0
— The same virtual address in two processes usually
refers to different physical addresses
— So either flush cache with every context switch or
add extra bits
Logical and Physical Cache
Look-aside and Look-through
• Look-aside cache is parallel with main memory
• Cache and main memory both see the bus cycle
— Cache hit: processor loaded from cache, bus cycle
terminates
— Cache miss: processor AND cache loaded from
memory in parallel
• Pro: less expensive, better response to cache
miss
• Con: Processor cannot access cache while
another bus master accesses memory
Look-through cache
• Cache checked first when processor requests
data from memory
— Hit: data loaded from cache
— Miss: cache loaded from memory, then processor
loaded from cache
• Pro:
— Processor can run on cache while another bus
master uses the bus
• Con:
— More expensive than look-aside, cache misses slower
Mapping Function
• There are fewer cache lines than memory
blocks so we need
— An algorithm for mapping memory into cache lines
— A means to determine which memory block is in
which cache line
• Example elements:
— Cache of 64kByte
— Cache block of 4 bytes
– i.e. cache is 16k (214) lines of 4 bytes
— 16MBytes main memory
— 24 bit address (224=16M)
(note: Pentium cache line = 32 bytes until Pentium 4 (128 bytes))
Direct Mapping
• Each block of main memory maps to only one cache
line
— i.e. if a block is in cache, it must be in one specific place
• Mapping function is i = j modulo m
(i = j % m) where
i = cache line number
j = main memory block number
m = number of cache lines
Remove l.s. 2 bits = 0001 0100 0000 00 = 00 0101 0000 0000 = 0500
• 0 000000,010000,…,FF0000
• 1 000004,010004,…,FF0004
• …
• m-1 00FFFC,01FFFC,…,FFFFFC
Direct Mapping Pros & Cons
• Pro
— Simple
— Inexpensive
• Con
— Fixed location for given block
— If a program accesses 2 blocks that map to the same
line repeatedly, cache misses are very high (thrashing)
• Victim cache
— A solution to direct mapped cache thrashing
— Discarded lines are stored in a small ―victim‖ cache (4
to 16 lines)
— Victim cache is fully associative and resides
between L1 and next level of memory
Associative Mapping
• A main memory block can load into any line of
cache
• Memory address is interpreted as 2 fields: tag
and word
• Tag uniquely identifies block of memory
• Every line’s tag is examined simultaneously for a
match
— Cache searching gets expensive because a comparator
must be wired to each tag
— A comparator consists of XNOR gates (true when both
inputs are true)
— Complexity of comparator circuits makes fully
associative cache expensive
Associative Mapping
• Because no bit field in the address specifies a
line number the cache size is not determined
by the address size
• Associative-mapped memory is also called
―content-addressable memory.‖
• Items are found not by their address but by
their content
— Used extensively in routers and other network
devices
— Corresponds to associative arrays in Perl and other
languages
• Primary disadvantage is the cost of circuitry
Direct Mapping compared to Associative
Associative Mapping Address Structure
Word
Tag 22 bit
2 bit
• 22 bit tag stored with each 32 bit block of data
• Compare tag field with tag entry in cache to check for
hit
• Least significant 2 bits of address identify which 16 bit
word is required from 32 bit data block
• e.g.
— Address Tag Data Cache line
— FFFFFD FFFFFC 24682468 3FFF
Fully Associative Cache Organization
Associative Mapping
• Parking lot analogy: there are more permits than
spaces
• Any student can park in any space
• Makes full use of parking lot
— With direct mapping many spaces may be unfilled
Word
Tag 9 bit Set 13 bit
2 bit
— Address Tag Data Set number
— 1FF 7FFC 1FF 12345678 1FFF
— 001 7FFC 001 11223344 1FFF
• Tags are much smaller than fully associative memories
and comparators for simultaneous lookup are much less
expensive
Set Associative Mapping Summary
• For a k-way set associative cache with v sets (each set
contains k lines):
— Address length = (t+d+w) bits where w = log2(block size) and d
= log2(v)
— Number of addressable units = 2t+d+w words or bytes
— Size of tag = t bits
— Block size = line size = 2w words or bytes
— Number of blocks in main memory = 2t+d
— Number of lines in set = k
— Number of sets = v = 2d
— Number of lines in cache = kv = k * 2d
Word
Tag (t bits) Set (d bits) (w bits)
Additional Notes
• Where v (# sets) = m (# lines in cache) and k =
1 (one line/set) then set associative mapping
reduces to direct mapping
• For v=1 (one set) and k=m (# sets = # lines) it
reduces to pure associative mapping
• 2 lines/set; v=m/2, k=2 is quite common.
• Significant improvement in hit ratio over direct
mapping
• Four-way mapping v=m/4, k=4 provides further
modest improvement
Set Associative Mapping Implementation
• A set associative cache can be implemented as
k direct mapped caches OR as v associative
caches
• With k direct mapped caches each direct
mapped cache is referred to as a way
• The direct mapped implementation is used for
small degrees of associativity (small k) and the
associative mapped implementation for higher
degrees.
Direct mapped implementation
Associative Mapped Implementation
Varying associativity over cache size
Cache replacement algorithms
• When a line is read from memory it replaces
some other line already in cache
• Other than direct mapping, there are choices
for replacement algorithm