L05 Memory
L05 Memory
Lecture 5 – Memory
Chris Fletcher
Electrical Engineering and Computer Sciences
University of California at Berkeley
https://cwfletcher.github.io/
http://inst.eecs.berkeley.edu/~cs152
Last time in Lecture 4
▪ Handling exceptions in pipelined machines by passing
exceptions down pipeline until instructions cross commit
point in order
▪ Can use values before commit through bypass network
▪ Four different pipeline categories: *-stage pipelined,
decoupled, out-of-order, super scalar
▪ Pipeline hazards can be avoided through software
techniques: scheduling, loop unrolling
▪ Decoupled architectures use queues between “access”
and “execute” pipelines to tolerate long memory latency
▪ Regularizing all functional units to have same latency
simplifies more complex pipeline design by avoiding
structural hazards, can be expanded to in-order
superscalar designs
2
Where are we
▪ ISA
▪ Microarchitecture
– Control & Datapath
• Fixed control & Pipelined
▪ *-Stage in-order pipelines
▪ Decoupled
▪ [Limited] Out-of-order
▪ [Limited] Out-of-order + super scalar
– Memory
• Today
3
Early Read-Only Memory Technologies
IBM Balanced
Capacitor ROS
IBM Card Capacitor ROS
4
Early Read/Write Main Memory Technologies
Babbage, 1800s: Digits
stored on mechanical wheels
Williams Tube,
Manchester Mark 1, 1947
6
Core Memory
▪ Core memory was first large scale reliable main memory
– invented by Forrester in late 40s/early 50s at MIT for Whirlwind project
▪ Bits stored as magnetization polarity on small ferrite cores
threaded onto two-dimensional grid of wires
▪ Coincident current pulses on X and Y wires would write
cell and also sense original state (destructive reads)
▪ Robust, non-volatile storage
▪ Used on space shuttle
computers
▪ Cores threaded onto wires by
hand (25 billion a year at peak
production)
▪ Core access time ~ 1µs
DEC PDP-8/E Board,
4K words x 12 bits, (1968)
7
Semiconductor Memory
▪ Semiconductor memory began to be
competitive in early 1970s
– Intel formed to exploit market for semiconductor
memory
– Early semiconductor memory was Static RAM (SRAM).
SRAM cell internals similar to a latch (cross-coupled
inverters).
8
▪ Good overview of the commercial memory technology
war
▪ And other things (litho basics, geopolitics, supply chain…)
9
Memory organization
▪ Y -> X: Y made up of X
▪ DRAM: DIMM -> Rank -> Chip -> Bank -> Array -> Cell
▪ SRAM: Similar story
10
One-Transistor Dynamic RAM [Dennard, IBM]
1-T DRAM Cell
word
access transistor
TiN top electrode (VREF)
VREF
Ta2O5 dielectric
bit
Storage
capacitor (FET gate,
trench, stack)
poly W bottom
word electrode
line access
transistor
11
Modern DRAM Structure
Row Address
N
Decoder
Row 2N
Memory cell
M Column Decoder & (one bit)
N+M
Sense Amplifiers
Data D
14
Double-Data Rate (DDR2) DRAM
200MHz
Clock
Data
400Mb/s
[ Micron, 256Mb DDR2 SDRAM datasheet ] Data Rate
15
DRAM vs. SRAM: Cell level
Word line
Word line
~Bit line Bit line
Bit line
16
DRAM vs. SRAM: Cell level
Word line
DRAM SRAM
▪ Optimized for density then speed • Optimized for speed then density
17
Relative Memory Cell Sizes
On-Chip DRAM on
SRAM in memory chip
logic chip
[ Foss, “Implementing
Application-Specific
Memory”, ISSCC 1996 ]
18
Administrivia
▪ PS 2 out today, due 2/20
▪ Lab 2 out Thursday
▪ Nafea Bshara guest lecture 2/25
▪ Midterm 3/4
19
CPU-Memory Bottleneck
CPU Memory
20
Factors influencing modern memory
system design
▪ Latency
– As a function of capacity/size
▪ Memory improvement vs. CPU improvement
▪ Bandwidth
– Modern packaging
▪ Memory access pattern characteristics
21
Processor-DRAM Gap (latency)
µProc 60%/year
1000 CPU
Performance
Processor-Memory
100 Performance Gap:
(growing 50%/yr)
10 DRAM
7%/year
DRAM
1
1988
1986
1987
1989
1990
1991
1992
1993
1994
1995
1996
1980
1981
1983
1984
1985
1997
1998
1999
2000
1982
Time
Four-issue 3GHz superscalar accessing 100ns DRAM could execute 1,200
instructions during time for one memory access!
22
Physical Size Affects Latency
CPU
CPU
Small
Memory
Big Memory
23
DRAM Packaging
(Laptops/Desktops/Servers)
~7
Clock and control signals
DRAM
Address lines multiplexed
row/column address ~12 chip
Data bus
(4b,8b,16b,32b)
24
DRAM Packaging, Apple M1
•128b databus,
running at 4.2Gb/s
•68GB/s bandwidth
25
High-Bandwidth Memory in SX-Aurora
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. Time
IBM Systems Journal 10(3): 168-192 (1971)
29
Typical Memory Reference Patterns
Instruction
fetches
subroutine subroutine
call return
Stack
accesses
argument access
Data
accesses scalar accesses
Time
30
Two predictable properties of memory
references:
▪ Temporal Locality: If a location is
referenced it is likely to be referenced again
in the near future.
31
Memory Reference Patterns
Memory Address (one dot per access)
Temporal
Locality
Spatial
Locality
Donald J. Hatfield, Jeanette Gerald: Program Time
Restructuring for Virtual Memory. IBM Systems Journal
10(3): 168-192 (1971) 32
Factors influencing modern memory
system design
▪ Latency
– As a function of capacity/size
▪ Memory improvement vs. CPU improvement
▪ Bandwidth
– Modern packaging
▪ Memory access pattern characteristics
Punchline:
Memory hierarchy,
Arranged in small/fast → big/slow pyramid
Policies to exploit data locality
33
Memory Hierarchy
A B
Small,
Fast Memory Big, Slow Memory
CPU
(RF, SRAM) (DRAM)
34
Management of Memory Hierarchy
▪ Small/fast storage, e.g., registers
– Address usually specified in instruction
– Generally implemented directly as a register file
• but hardware might do things behind software’s
back, e.g., stack management, register renaming
36
Inside a Cache
Address Address
Processor Main
CACHE Memory
Data Data
Address 6848
Tag 416
Data Block
37
Cache Algorithm (Read)
Look at Processor Address, search cache tags to
find match. Then either
Memory
Cache
39
Another view:
Cache = HW-Optimized Hash table
(Key3, Val3)
Bin Slots
t
k b
V Tag Data Block
2k
lines
t
=
41
Direct Map Address Selection
higher-order vs. lower-order address bits
2k
lines
t
=
42
2-Way Set-Associative Cache
Tag Index Block
Offset b
t
k
V Tag Data Block V Tag Data Block
= = Data
Word
or Byte
HIT
43
Fully Associative Cache
V Tag Data Block
t
=
Tag
t
=
HIT
Offset
Data
Block
= Word
b or Byte
44
Replacement Policy
In an associative cache, which block from a set should
be evicted when the set becomes full?
• Random
• Least-Recently Used (LRU)
• LRU cache state must be updated on every access
• true implementation only feasible for small sets (2-way)
• pseudo-LRU binary tree often used for 4-8 way
• First-In, First-Out (FIFO) a.k.a. Round-Robin
• used in highly associative caches
• Not-Most-Recently Used (NMRU)
• FIFO with exception for most-recently used block or blocks
45
Block Size and Spatial Locality
Block is unit of transfer between the cache and memory
46
Acknowledgements
▪ This course is partly inspired by previous MIT 6.823 and
Berkeley CS252 computer architecture courses created by
my collaborators and colleagues:
– Arvind (MIT)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Krste Asanovic (UCB)
– Sophia Shao (UCB)
47