06 - Memory System - I
06 - Memory System - I
Chapter 05-01
Memory System
Memory Hierarchy and Cache Memory
Hyoukjun kwon
[email protected]
EECS 112 (Spring 2024)
Organization of Digital Computers
2
Key Trade-off in the Memory Technology
On-chip Memory
Small and fast,
but costly memory
3
Principle of Locality
§ Key Observation
• Programs access a small proportion of their address space at any time
§ Temporal locality
• Items accessed recently are likely to be accessed again soon
• e.g., instructions in a loop, induction variables
§ Spatial locality
• Items near those accessed recently are likely to be accessed soon
• E.g., sequential instruction access, array data
4
Example C Code
int array_sum (int * ary, int len) {
int sum = 0;
Both spatial and temporal locality patterns are easily found in many programs
5
Strategy: Memory Hierarchy
Cache
On-chip Memory Strategy 1
Use small and fast memories for
Off-chip Memory frequently accessed data
DRAM
Flash
Memory Strategy 2
(SSD) Use large and slow memories as
backing storage (for infrequently
accessed data)
6
Utilizing Memory Hierarchy
§ If accessed data is present in upper level
• Hit: access satisfied by upper level
o Hit ratio: hits/accesses
DRAM
Upper Level
7
EECS 112 (Spring 2024)
Organization of Digital Computers
8
Cache Memory
§ Cache memory
• A small on-chip memory based on SRAM technology
• Closest memory element to the CPU (1 – a few cycles for access) other than the
register file
2’b01 Shared across all data whose address finishes with 2’b01
2’b10 Shared across all data whose address finishes with 2’b10
2’b11 Shared across all data whose address finishes with 2’b11
9
Terminologies
2’b10
2’b11
10
Terminologies in Addressing
Index Valid Tag Data
0 Block Address Block Offset
1
𝑡 = 32 − (𝑘 + 𝑏) 𝑘 = 𝑙𝑜𝑔! 𝐾 𝑏 = 𝑙𝑜𝑔! 𝐵
B-byte block
11
Example
Index Valid Tag Data
2’b00 Block Address Block Offset
2’b01
Tag Index Offset
2’b10 Address (t bits) (k-bits) (b-bits)
2’b11
𝑏 = 𝑙𝑜𝑔! 32=5
32-byte block (eight-words) 𝑘 = 𝑙𝑜𝑔! 4=2
𝑡 = 32 − (2 + 5)=25
Address: 32’b 0000 0111 1010 0010 1101 1000 1101 1100
Offset: 5’b11100
Index: 2’b10
Tag: 25’b0000011110100010110110001
12
Example
Index Valid Tag Data
2’b00 Block Address Block Offset
2’b01
Tag Index Offset
2’b10 Address (t bits) (k-bits) (b-bits)
2’b11
𝑏 = 𝑙𝑜𝑔! 32=5
32-byte block (eight-words) 𝑘 = 𝑙𝑜𝑔! 4=2
𝑡 = 32 − (2 + 5)=25
Address: 32’b 0000 0111 1010 0010 1101 1000 1101 1100
Offset: 5’b11100
Index: 2’b10 = 2nd row in the cache (note: the number starts from 0)
Tag: 25’b0000011110100010110110001
13
Example
Index Valid Tag Data
2’b00 Block Address Block Offset
2’b01
Tag Index Offset
2’b10 Address (t bits) (k-bits) (b-bits)
2’b11
𝑏 = 𝑙𝑜𝑔! 32=5
32-byte block (eight-words) 𝑘 = 𝑙𝑜𝑔! 4=2
𝑡 = 32 − (2 + 5)=25
Address: 32’b 0000 0111 1010 0010 1101 1000 1101 1100
Offset: 5’b11100 = 28-th byte within a block = 7-th word within a block (note: the number starts from 0)
Index: 2’b10
Tag: 25’b0000011110100010110110001
14
Direct Mapped Cache
§ Location determined by address
§ Direct mapped: only one choice
• Location = (Index) = (Block address) modulo (#Blocks in cache)
Block Address Block Offset
n #Blocks is a power of 2
n Block address: Upper-bits of
an address less block offset
Block addresses
15
Block Address Block Offset
16
Block Address Block Offset
17
Block Address Block Offset
18
Block Address Block Offset
Default Policy: Evict old value and keep the new value
19
Block Address Block Offset
20
Cache Data Path
§ Address decoding
• Extract tag and index
21
Block Address Block Offset
Total size = total valid bit size + total tag bit size + total data bit size <Note>
= 0.125KiB + 2.25 KiB + 16KiB = 18.375 KiB KiB = 1024 Bytes
KB = 1000 Bytes
22
Trade-off of Large Block Size
§ Pros: Large blocks can reduce miss rate
• Due to spatial locality
§ Cons 1: Cache Pollution
• The total size of cache is fixed => less number of rows in a cache with a large block
o More competition to each set Þ can increase the miss rate
• Accessing small data can evict large amount of useful data (cache pollution)
23
Operations on Cache Misses
§ On cache hit, CPU proceeds normally
• Memory stage can operate within one cycle (or a few)
§ On cache miss
• Step 1) Stall the CPU pipeline and wait for memory
• Step 2) Fetch block from next level of memory hierarchy
• Step 3 – Case 2) Data cache miss (Cache miss at the MEM stage)
o After fetching data into the data cache, complete data access in the MEM stage and resume the execution
24
EECS 112 (Spring 2024)
Organization of Digital Computers
25
Data Cache Read and Write Hits
§ Read Hit
• When the processor is executing a load instruction
• The data exist in the data cache
§ Write Hit
• When the processor is executing a store instruction
• The data exist in the data cache
• Update the existing data with a newer version
Potential Issue on Write Hit: Inconsistent values across cache and main memory
26
Write Policies on Cache Hit
§ Write-through
• Write the new value on both cache and main memory
• Pros: No value inconsistency across cache and main memory
• Cons: All writes involve costly main memory accesses
o Solution: Use “Write buffer.” The processor resume the execution while data
inside write buffer is being written onto main memory
§ Write-back
• Only update the data cache
• Update the main memory value when a “dirty” cache line is evicted
o “Dirty”: Indicates that a new data is written
• Pros: Fast; only access the cache
• Cons: Inconsistent values across cache and memory
27
Write Policies on Cache Miss
§ Write-allocate
• Fetch the data into the cache first, then perform write (follow the write-
hit policy afterwards)
o Well-suited with the write-back policy
§ Write-no-allocate
• When cache miss occurs on write, directly write to the main memory
o Write-back is not well-aligned with this approach
28
Cache Example: Intrinsity FastMATH
16KiB: 256 blocks × 16 words/block
29
Example: Intrinsity FastMATH
§ Embedded MIPS processor
• 12-stage pipeline
• Instruction and data access on each cycle
30
EECS 112 (Spring 2024)
Organization of Digital Computers
31
Problem of the Direct-Mapped Cache
§ Assumptions: 8-blocks (slots), 1 word/block, direct mapped, 7-bit address
32
Problem of the Direct-Mapped Cache
§ Assumptions: 8-blocks (slots), 1 word/block, direct mapped, 7-bit address
33
Problem of the Direct-Mapped Cache
§ Assumptions: 8-blocks (slots), 1 word/block, direct mapped, 7-bit address
§ Main Idea
• Employ multiple entries for each cache index (row)
• Reduce the number of sets (cache rows) to increase the number of ways with the same
data size in cache
§ N-way Associative Cache
• The number of ways = N
• The number of sets (rows) = 1/N of that of the direct-mapped cache with the same data size
35
2-Way Set Associative Cache
§ Assumptions: 4-blocks (slots), 1 word/block, 2-way associative, 7-bit address
01 N
Tag Index Offset
Address (t bits) (k-bits) (b-bits)
10 N
11 N 3 bits 2 bits 2 bit
36
2-Way Set Associative Cache
§ Assumptions: 4-blocks (slots), 1 word/block, 2-way associative, 7-bit address
01 N
Tag Index Offset
Address (t bits) (k-bits) (b-bits)
10 Y 011 Mem[0111000]-Mem[0111011]
11 N 3 bits 2 bits 2 bit
37
2-Way Set Associative Cache
§ Assumptions: 4-blocks (slots), 1 word/block, 2-way associative, 7-bit address
38
Set Associative Cache
§ Idea: Reduce cache misses by more flexible placement of blocks
39
Direct-Mapped vs Set-Associative Caches
40
Spectrum of Associativity
§ For a cache with 8 entries
41
Associativity Example
§ Compare 4-block caches
• Direct mapped, 2-way set associative,
fully associative
• Block access sequence: 0, 8, 0, 6, 8
§ Direct mapped
Block Cache Hit/miss Cache content after access
address index 0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
time 0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
42
Associativity Example
§ 2-way set associative
Block Cache Hit/miss Cache content after access
address index Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
§ Fully-associative
Block Hit/miss Cache content after access
address
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]
43
How Much Associativity Is Desired?
§ Increased associativity decreases miss rate
• But with diminishing returns
§ Simulation of a system with 64KB D-cache, 16-word blocks,
SPEC2000
• 1-way: 10.3%
• 2-way: 8.6%
• 4-way: 8.3%
• 8-way: 8.1%
44
Set Associative Cache Datapath
45
Replacement Policy
§ Direct mapped: no choice
§ Set associative
• Prefer non-valid entry, if there is one
• Otherwise, choose among entries in the set
§ Least-recently used (LRU)
• Choose the one unused for the longest time
o Simple for 2-way, manageable for 4-way, too hard beyond that
§ Random
• Gives approximately the same performance as LRU for high
associativity
46
EECS 112 (Spring 2024)
Organization of Digital Computers
47
Main Memory Supporting Caches
§ Use DRAMs for main memory
• Fixed width (e.g., 1 word)
• Connected by fixed-width clocked bus
o Bus clock is typically slower than CPU clock
§ Example cache block read
• 1 bus cycle for address transfer
• 15 bus cycles per DRAM access
• 1 bus cycle per data transfer
§ For 4-word block, 1-word-wide DRAM
• Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles
• Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle
48
Measuring Cache Performance
§ Components of CPU time
• Program execution cycles
o Includes cache hit time
• Memory stall cycles
o Mainly from cache misses
§ With simplifying assumptions:
50
Average Memory Access Time
§ Hit time is also important for performance
§ Average memory access time (AMAT)
• AMAT = Hit time + Miss rate × Miss penalty
§ Example
• CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache
miss rate = 5%
• AMAT = 1 + 0.05 × 20 = 2ns
o 2 cycles per instruction
51
AMAT Example
§ Clock rate is 1 GHz, has two levels of cache: L1 and L2.
• L1: miss rate 3%, 1-cycle access time
• L2: miss rate 1%, 18-cycle access, miss penalty = 200 cycles
52
Performance Summary
§ When CPU performance increased
• Miss penalty becomes more significant
§ Decreasing base CPI
• Greater proportion of time spent on memory stalls
§ Increasing clock rate
• Memory stalls account for more CPU cycles
§ Can’t neglect cache behavior when evaluating system
performance
53
EECS 112 (Spring 2024)
Organization of Digital Computers
54
DRAM Technology
address (column) Input-output data
§ Data stored as a charge in a capacitor …
address
…
(row)
Transistor
(as a switch)
55
DRAM Technology
§ Data stored as a charge in a capacitor
• Single transistor used to access the charge
• Must periodically be refreshed
o Read contents and write back
o Performed on a DRAM “row”
56
Advanced DRAM Organization
§ Bits in a DRAM are organized as a rectangular array
• DRAM accesses an entire row
• Burst mode: supply successive words from a row with reduced latency
§ Double data rate (DDR) DRAM
• Transfer on rising and falling clock edges
§ Quad data rate (QDR) DRAM
• Separate DDR inputs and outputs
57
DRAM Generations
Year Capacity $/GB
1980 64 Kibibit $6,480,000
1983 256 Kibibit $1,980,000
1985 1 Mebibit $720,000
1989 4 Mebibit $128,000
1992 16 Mebibit $30,000
1996 64 Mebibit $9,000
1998 128 Mebibit $900
2000 256 Mebibit $840
2004 512 Mebibit $150
2007 1 Gibibit $40
2010 2 Gibibit $13
2012 4 Gibibit $5
2015 8 Gibibit $7 𝑡RAC : Random access time, time required to read any random single memory cell
𝑡CAC : Column or page access time, time required to get data from existing row
2018 16 Gibibit $6
58
DRAM Performance Factors
§ Row Buffer
• A small buffer used for temporarily storing the lastly accessed data
• The size of the row buffer is for multiple words (size of a row in DRAM)
§ Burst Access
• Allows reading consecutive data without sending individual address
• Improves bandwidth
§ DRAM Banking
• Deploy multiple DRAM chips and read/write simultaneously
• Improves bandwidth
59
4 words x 1 bank vs 1 word x 4 banks
One- Four-
word word Assumtions:
wide wide • 1 cycle: send address to RAM
• 15 cycles: RAM access latency
§ 4-word wide memory • 1 cycle: return data from RAM
• Cache Miss Penalty = 1 + 15 + 1 = 17 bus cycles
• Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle
• Disadvantage: cost of wider buses
§ 4-bank interleaved memory
• Cache Miss penalty = 1 + 15 + 4×1 = 20 bus cycles
• Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle
• Benefit: overlapping the latencies of accessing each word
60
Flash Storage
§ Nonvolatile semiconductor storage
• 100× – 1000× faster than disk
• Smaller, lower power, more robust
• But more $/GB (between disk and DRAM)
66
Flash Types
§ NOR flash: bit cell like a NOR gate
• Random read/write access
• Used for instruction memory in embedded systems
§ NAND flash: bit cell like a NAND gate
• Denser (bits/area), but block-at-a-time access
• Cheaper per GB
• Used for USB keys, media storage, …
§ Flash bits wears out after 1000’s to 100k’s of accesses
• Not suitable for direct RAM or disk replacement
• Wear leveling: remap data to less used blocks
67
Disk Storage
68