Memory Hierarchy: Haresh Dagale Dept of ESE
Memory Hierarchy: Haresh Dagale Dept of ESE
Haresh Dagale
Dept of ESE
Motivation
Memory Technologies
Access Time vs Cost
• SRAM : Levels closer to the CPU Technology Access Time Cost (ratio)
• DRAM : Main Memory SRAM 5 – 25 ns 100
• Magnetic Disk : Largest and
Slowest level DRAM 60 – 120 ns 5
Disk 10 – 20 ms 0.1
Memory
▪ Programmer’s dream:
• Unlimited amount of fast memory (..that works at the same speed as the processor..)
▪ Hardware designer’s response:
• Create the illusion of a vast memory that can be accessed without making the
processor wait (..on an average..)
▪ How could this illusion be created?
• Programs access a relatively small portion of the address space at any instant of time
- “principle of locality”
▪ Temporal locality
• If an item is referenced it will tend to be referenced again soon.
▪ Spatial locality
• If an item is referenced, items whose addresses are close by will tend to be
referenced soon.
Memory Hierarchy
• To take advantage of the
temporal locality, memory is
build as a hierarchy of levels
• Faster and smaller memory close
to the processor
• Slower and larger (less expensive)
memory below the first level
Memory Organization
• All data is stored at the lowest
level
• A level closer to the processor
is a subset of any level further
away
• Data is copied between two
adjacent levels at a time
• The minimum unit of
information that is
transferred from one level to
another is called a block
• The block size must be larger
than one word
• to take advantage of the spatial
locality
Cache
▪ Intermediary between processor and memory
▪ A standard feature in all modern processors
▪ Most CPU designs use two levels of cache:
▪ “Level 1” or “Primary” cache (also ▪ “Level 2” or “Secondary” cache - also
called internal cache when it is called external cache when it is
implemented off-chip
implemented on-chip)
▪ L2 cache is usually implemented
• Usually implemented on-chip and separately from the processor using fast
runs at the same clock rate as the static RAM (SRAM)
processor • Varies in size from 2Kb up to (?) Mb
• In some processors, L1 cache is • The communication between this
divided into separate I-cache and D- cache and the CPU is usually via a
dedicated bus to ease the traffic
cache congestion with other subsystems
• The L1 caches varies in size from ▪ Recent trend is to build L2 cache also on
2Kb up to 64Kb chip and yet another level (L3) off-chip.
Cache
Cache Organization
▪ The cache is divided into slots (or lines), each containing a block of
data and a Tag field.
Cache line bits:
▪ Data field: block of data (multiples of words)
▪ Tag field: the upper portion of the
address,
• bits that are not used as an index for the
cache
• required to identify whether a word in the
cache corresponds to the requested word
▪ Dirty bit: data written to cache but not
to external memory
• Instruction cache lines do not have this bit
because it is read-only
▪ Valid bit: cache line is not empty or has
not been deleted
▪ Lock bit: cache line can be accessed but
not replaced
Direct-mapped Cache
Cache Associativity
2-way Set Associative Cache
Cache Miss
▪ Type of misses
• Compulsory misses (or cold-start misses)
• Increase the block size?
• Capacity misses
• Increase the cache size. Additional hardware, address resolution
• Conflict misses
• Reduce swapping of blocks in and out
▪ Design considerations
• Block size
• Replacement policy
Updating Memory
▪ How to update main memory if cached data is modified?
▪ Write-through
• data is written immediately to the main memory
• causes more traffic on the bus
▪ Write-back (or copy-back)
• data is delayed until block replacement occurs
• complex to implement
Write Through
Cache miss
Old block is dirty
Memory
Write
Allocate
Back
Memory update Update
buffer
Cache Coherency Problem
▪ Main memory is shared among the processors and I/O subsystems
• Individual caches improve performance by storing frequently used data in faster
memory
▪ The view of memory through the cache could be different from the view of
memory through the I/O subsystem
▪ Since all processors share the same address space, it is possible for more
than one processor to cache an address (or data item) at a time
▪ If one processor updates the data item without informing the other
processors, inconsistencies may result and cause incorrect executions
Coherency /Consistency
▪ Coherence and Consistency are two complimentary issues though both
define the behaviour of reads and writes to memory locations
▪ The Coherence model defines what value can be returned by a read
▪ The Consistency model defines when a written value must be seen by a
read
▪ A simple definition of coherency:
M
LOAD FLUSH
STORE L_REQ
S_REQ
STORE
S
S_REQ
S_REQ
FLUSH
S_REQ
L_REQ
LOAD L_REQ
I
MESI Protocol
I Load E
Load S
L_REQ (S)
LOAD
E I Store M
S Store M
LOAD L_REQ (S`)
Load S
MESI
Transitions on requests from other
L_REQ
L_REQ FLUSH cache controller
FLUSH
M S
Current Request from other Next
State Cache Controller State
S_REQ
FLUSH E Store I
Load S
S_REQ S Load S
L_REQ FLUSH
Store I
FLUSH
M Store I
Load S
E I
S_REQ FLUSH
Cache Design
• Design a cache-memory system for a
processor with 8 bit data bus. It has 4
MBytes of RAM and 16 Kbytes of on-
chip cache. The cache is 4-way set Cache
associative. Assume that cache-line
(cache-blocks) is 128 bytes long.
• Minimum address bus width ?
Main
• The tag field ? Memory
set
• Index ?
• Offset ?
• Number of Sets
• Number of possible( competing)
memory-block per set ?
• Bits required to address 4Mb
Cache Design Solution
▪ RAM: Minimum 22 bits required to address 4 Mbytes of memory
▪ Number bits required to identify the byte cache-block = offset
▪ Offset = No. of bits required to address a byte in 128 byte = 7 bits
▪ Number of sets =
• Cache size / (# of cachelines per set * length of cache block)
• = 16k / ( 4 * 128) = 32
• Therefore, index field length = log232 = 5
▪ We have 32 sets with each set having 4 cache-block.
• Therefore, for a particular set possible cache blocks:
• Total memory-blocks = 4096 MB /128 bytes = 32000
• Number of competing memory-blocks for a particular set:
• 32000/32 (Total cache blocks / Total number of sets available) = 1000