13 Memory
13 Memory
Multilevel Caches
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 2
Memory Technology
Static RAM (SRAM)
Used typically to implement Cache memory
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 3
Static RAM Storage Cell
Word line
Static RAM (SRAM) Memory
Vcc
Cell Implementation:
Pass
1-Transistor cell (pass transistor) Transistor
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 5
Example of a Memory Chip
24-pin dual in-line package: 222 4-bit = 16Mibit memory
24 23 22 21 20 19 18 17 16 15 14 13
1 2 3 4 5 6 7 8 9 10 11 12
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 6
Typical Memory Structure
Row decoder
Select row to read/write
Row Decoder
Row address
Column decoder r 2r × 2c × m bits
...
Select column to read/write Cell Matrix
Cell Matrix
2D array of tiny memory cells Sense/write amplifiers
m
Data Row Latch 2c × m bits
Sense/Write amplifiers
...
Sense & amplify data on read
Column Decoder
Drive bit line with data in on write c
Column address
Same data lines are used for data in/out
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 7
DRAM Operation
Row Access (RAS)
Latch and decode row address to enable addressed row
Small change in voltage detected by sense amplifiers
Latch whole row of bits
Sense amplifiers drive bit lines to recharge storage cells
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 9
SDRAM and DDR SDRAM
SDRAM is Synchronous Dynamic RAM
Added clock to DRAM interface
Memory Bandwidth
Rate at which data is transferred between memory and CPU
Bandwidth is measured as millions of Bytes per second
Increased from 800 to 25600 MBytes/sec (between 1996 and 2016)
Improvement in memory bandwidth is 32X (1996 to 2016)
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 13
DRAM Refresh Cycles
Refresh cycle is about tens of milliseconds
Refreshing is done for the entire memory
Each row is read and written back to restore the charge
Some of the memory bandwidth is lost to refresh cycles
Threshold
voltage
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 14
Next . . .
Multilevel Caches
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 15
Processor-Memory Performance Gap
Performance Gap
DRAM: 7% per year
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 17
Typical Memory Hierarchy
Registers are at the top of the hierarchy
Typical size < 1 KB
Access time < 0.5 ns
Microprocessor
Level 1 Cache (8 – 64 KiB) Registers
Access time: 1 ns
L1 Cache
L2 Cache (1 MiB – 8 MiB) L2 Cache
Bigger
Faster
Access time: 3 – 10 ns
Memory Bus
Main Memory (8 – 32 GiB) Main Memory
Access time: 40 – 50 ns I/O Bus
Access time: 5 – 10 ms
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 18
Principle of Locality of Reference
Programs access small portion of their address space
At any time, only a small set of instructions & data is needed
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 19
What is a Cache Memory ?
Small and fast (SRAM) memory technology
Stores the subset of instructions & data currently being accessed
Goal is to achieve
Fast speed of cache memory access
Balance the cost of the memory system
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 20
Cache Memories in the Datapath
Imm16
Imm
Ext
ALU result 32
0
32
1
A
I-Cache D-Cache
Register File
Rs 5 BusA 2
A
Instruction
0
RA 3
L
Data
R
0 Address
Instruction Rt 5
RB BusB 0 U Data_out
32
PC
1 Address 1 1 1
B
2 2 0
Data_in
D
RW BusW
3 32
32
0
Rd4
Rd2
Rd3
1
Rd
clk
Instruction Block
Block Address
Block Address
D-Cache miss
I-Cache miss
Data Block
pipeline to stall
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 21
Almost Everything is a Cache !
In computer architecture, almost everything is a cache!
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 22
Next . . .
Multilevel Caches
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 23
Four Basic Questions on Caches
Q1: Where can a block be placed in a cache?
Block placement
Direct Mapped, Set Associative, Fully Associative
Q2: How is a block found in a cache?
Block identification
Block address, tag, index
Q3: Which block should be replaced on a cache miss?
Block replacement
FIFO, Random, LRU
Q4: What happens on a write?
Write strategy
Write Back or Write Through cache (with Write Buffer)
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 24
Inside a Cache Memory
Address Address
Cache Main
Processor Data Data
Memory Memory
N Cache Blocks
Tags Address Tag 0 Cache Block 0
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 25
Block Placement: Direct Mapped
Block: unit of data transfer between cache and memory
Direct Mapped Cache:
A block can be placed in exactly one location in the cache
000
001
010
100
101
011
110
111
In this example:
Cache
Cache index =
least significant 3 bits of
Block address
Memory
Main
00000
00001
00010
00100
00101
01000
01001
01010
10000
10001
10010
10100
10101
10110
11000
11001
11010
00011
00110
01011
01100
01101
10011
10111
11011
11100
11101
00111
01110
11110
01111
11111
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 26
Direct-Mapped Cache
A memory address is divided into
Block Address
Block address: identifies block in memory
Tag Index offset
Block offset: to access bytes within a block
Hit
Whether a cache block is valid or not
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 27
Direct Mapped Cache – cont’d
Cache hit: block is stored inside cache
Index is used to access cache block Block Address
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 28
Mapping an Address to a Cache Block
Example
Consider a direct-mapped cache with 256 blocks
Block size = 16 bytes
Compute tag, index, and byte offset of address: 0x01FFF8AC
20 8 4
32-bit address is divided into: Tag Index offset
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 29
Example on Cache Placement & Misses
Consider a small direct-mapped cache with 32 blocks
Cache is initially empty, Block size = 16 bytes
The following memory addresses (in decimal) are referenced:
1000, 1004, 1008, 2548, 2552, 2556.
Map addresses to cache blocks and indicate whether hit or miss
23 5 4
Solution: Tag Index offset
V Tag Block Data V Tag Block Data V Tag Block Data V Tag Block Data
= = = =
mux
m-way associative Data
Hit
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 31
Set-Associative Cache
A set is a group of blocks that can be indexed
A block is first mapped onto a set
Set index = Block address mod Number of sets in cache
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 32
Set-Associative Cache Diagram
Address Tag Index offset
V Tag Block Data V Tag Block Data V Tag Block Data V Tag Block Data
= = = =
mux
m-way set-associative Hit
Data
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 33
Write Policy
Write Through:
Writes update cache and lower-level memory
Cache control bit: only a Valid bit is needed
Memory always has latest data, which simplifies data coherency
Can always discard cached data when a block is replaced
Write Back:
Writes update cache only
Cache control bits: Valid and Modified bits are required
Modified cached data is written back to memory when replaced
Multiple writes to a cache block require only one write to memory
Uses less memory bandwidth than write-through and less power
However, more complex to implement than write through
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 34
What Happens on a Cache Miss?
Cache sends a miss signal to stall the processor
Decide which cache block to allocate/replace
One choice only when the cache is directly mapped
Multiple choices for set-associative or fully-associative cache
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 35
Replacement Policy
Which block to be replaced on a cache miss?
No selection alternatives for direct-mapped caches
m blocks per set to choose from for associative caches
Random replacement
Candidate blocks are randomly selected
One counter for all sets (0 to m – 1): incremented on every cycle
On a cache miss replace block specified by counter
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 36
Replacement Policy – cont’d
Least Recently Used (LRU)
Replace block that has been unused for the longest time
Order blocks within a set from least to most recently used
Update ordering of blocks on each cache hit
With m blocks per set, there are m! possible permutations
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 37
Next . . .
Multilevel Caches
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 38
Hit Rate and Miss Rate
Hit Rate = Hits / (Hits + Misses)
Miss Rate = Misses / (Hits + Misses)
I-Cache Miss Rate = Miss rate in the Instruction Cache
D-Cache Miss Rate = Miss rate in the Data Cache
Example:
Out of 1000 instructions fetched, 150 missed in the I-Cache
25% are load-store instructions, 50 missed in the D-Cache
What are the I-cache and D-cache miss rates?
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 40
Memory Stall Cycles Per Instruction
Memory Stall Cycles Per Instruction =
Combined Misses Per Instruction × Miss Penalty
Miss Penalty is assumed equal for I-cache & D-cache
Miss Penalty is assumed equal for Load and Store
Combined Misses Per Instruction =
I-Cache Miss Rate + LS Frequency × D-Cache Miss Rate
Therefore, Memory Stall Cycles Per Instruction =
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 41
Example on Memory Stall Cycles
Consider a program with the given characteristics
Instruction count (I-Count) = 106 instructions
30% of instructions are loads and stores
D-cache miss rate is 5% and I-cache miss rate is 1%
Miss penalty is 100 clock cycles for instruction and data caches
Compute combined misses per instruction and memory stall cycles
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 42
CPU Time with Memory Stall Cycles
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 43
Example on CPI with Memory Stalls
A processor has CPI of 1.5 without any memory stalls
Cache miss rate is 2% for instruction and 5% for data
20% of instructions are loads and stores
Cache miss penalty is 100 clock cycles for I-cache and D-cache
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 44
Average Memory Access Time
Average Memory Access Time (AMAT)
AMAT = Hit time + Miss rate × Miss penalty
Time to access a cache for both hits and misses
Example: Find the AMAT for a cache with
Cache access time (Hit time) of 1 cycle = 2 ns
Miss penalty of 20 clock cycles
Miss rate of 0.05 per access
Solution:
AMAT = 1 + 0.05 × 20 = 2 cycles = 4 ns
Without the cache, AMAT will be equal to Miss penalty = 20 cycles
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 45
Next . . .
Multilevel Caches
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 46
Improving Cache Performance
Average Memory Access Time (AMAT)
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 47
Small and Simple Caches
Hit time is critical: affects the processor clock cycle
Fast clock rate demands small and simple L1 cache designs
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 48
Classifying Misses – Three Cs
Conditions under which misses occur
Compulsory: program starts with no block in cache
Also called cold start misses
Misses that would occur even if a cache has infinite size
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 49
Classifying Misses – cont’d
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 50
Larger Size and Higher Associativity
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 51
Larger Block Size
Simplest way to reduce miss rate is to increase block size
However, it increases conflict misses if cache is small
64
16
256
128
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 52
Multilevel Caches
Top level cache should be kept small to
Keep pace with processor speed
Miss RateL1 for L1 cache, and Miss RateL1 Miss RateL2 for L2 cache
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 53
Power 7 On-Chip Caches [IBM 2010]
32KB I-Cache/core
32KB D-Cache/core
3-cycle latency
256KB Unified
L2 Cache/core
8-cycle latency
32MB Unified
Shared L3 Cache
Embedded DRAM
25-cycle latency
to local slice
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 54
Multilevel Cache Policies
Multilevel Inclusion
L1 cache data is always present in L2 cache
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 55
Multilevel Cache Policies – cont’d
Multilevel exclusion
L1 data is never found in L2 cache – Prevents wasting space
Memory Hierarchy & Caches ICS 233 / COE 301 – Computer Organization © Muhamed Mudawar – slide 56