0% found this document useful (0 votes)
41 views45 pages

L05 Memory

The document discusses memory technologies in computer architecture, covering early read-only and read/write memory systems, including core memory and semiconductor memory. It highlights the evolution of memory organization, modern DRAM structure, and the CPU-memory bottleneck, emphasizing factors like latency and bandwidth. Additionally, it addresses memory hierarchy and cache algorithms to optimize memory access patterns based on temporal and spatial locality.

Uploaded by

npcsignupaccnt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views45 pages

L05 Memory

The document discusses memory technologies in computer architecture, covering early read-only and read/write memory systems, including core memory and semiconductor memory. It highlights the evolution of memory organization, modern DRAM structure, and the CPU-memory bottleneck, emphasizing factors like latency and bandwidth. Additionally, it addresses memory hierarchy and cache algorithms to optimize memory access patterns based on temporal and spatial locality.

Uploaded by

npcsignupaccnt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CS 152 Computer Architecture and Engineering

CS252 Graduate Computer Architecture

Lecture 5 – Memory

Chris Fletcher
Electrical Engineering and Computer Sciences
University of California at Berkeley

https://cwfletcher.github.io/
http://inst.eecs.berkeley.edu/~cs152
Last time in Lecture 4
▪ Handling exceptions in pipelined machines by passing
exceptions down pipeline until instructions cross commit
point in order
▪ Can use values before commit through bypass network
▪ Four different pipeline categories: *-stage pipelined,
decoupled, out-of-order, super scalar
▪ Pipeline hazards can be avoided through software
techniques: scheduling, loop unrolling
▪ Decoupled architectures use queues between “access”
and “execute” pipelines to tolerate long memory latency
▪ Regularizing all functional units to have same latency
simplifies more complex pipeline design by avoiding
structural hazards, can be expanded to in-order
superscalar designs
2
Where are we
▪ ISA
▪ Microarchitecture
– Control & Datapath
• Fixed control & Pipelined
▪ *-Stage in-order pipelines
▪ Decoupled
▪ [Limited] Out-of-order
▪ [Limited] Out-of-order + super scalar
– Memory
• Today

3
Early Read-Only Memory Technologies

Punched cards, From early


1700s through Jaquard Loom,
Punched paper tape,
Babbage, and then IBM
instruction stream in
Harvard Mk 1
Diode Matrix, EDSAC-2
µcode store

IBM Balanced
Capacitor ROS
IBM Card Capacitor ROS
4
Early Read/Write Main Memory Technologies
Babbage, 1800s: Digits
stored on mechanical wheels

Williams Tube,
Manchester Mark 1, 1947

Mercury Delay Line, Univac 1, 1951

Also, regenerative capacitor memory on


Atanasoff-Berry computer, and rotating
magnetic drum memory on IBM 650
5
MIT Whirlwind Core Memory

6
Core Memory
▪ Core memory was first large scale reliable main memory
– invented by Forrester in late 40s/early 50s at MIT for Whirlwind project
▪ Bits stored as magnetization polarity on small ferrite cores
threaded onto two-dimensional grid of wires
▪ Coincident current pulses on X and Y wires would write
cell and also sense original state (destructive reads)
▪ Robust, non-volatile storage
▪ Used on space shuttle
computers
▪ Cores threaded onto wires by
hand (25 billion a year at peak
production)
▪ Core access time ~ 1µs
DEC PDP-8/E Board,
4K words x 12 bits, (1968)
7
Semiconductor Memory
▪ Semiconductor memory began to be
competitive in early 1970s
– Intel formed to exploit market for semiconductor
memory
– Early semiconductor memory was Static RAM (SRAM).
SRAM cell internals similar to a latch (cross-coupled
inverters).

▪ First commercial Dynamic RAM (DRAM)


was Intel 1103
– 1Kbit of storage on single chip
– charge on a capacitor used to hold value

Semiconductor memory quickly replaced


core in ‘70s [ Thomas Nguyen CC-BY-SA ]

8
▪ Good overview of the commercial memory technology
war
▪ And other things (litho basics, geopolitics, supply chain…)

9
Memory organization
▪ Y -> X: Y made up of X
▪ DRAM: DIMM -> Rank -> Chip -> Bank -> Array -> Cell
▪ SRAM: Similar story

▪ Bank: the “atomic unit” that gets activated / access

10
One-Transistor Dynamic RAM [Dennard, IBM]
1-T DRAM Cell

word
access transistor
TiN top electrode (VREF)
VREF
Ta2O5 dielectric

bit
Storage
capacitor (FET gate,
trench, stack)

poly W bottom
word electrode
line access
transistor

11
Modern DRAM Structure

[Samsung, sub-70nm DRAM, 2004]


12
DRAM Array Architecture
bit lines
Col. Col.
1 2M word lines
Row 1

Row Address
N

Decoder
Row 2N
Memory cell
M Column Decoder & (one bit)
N+M
Sense Amplifiers

Data D

▪ Bits stored in 2-dimensional arrays on chip


▪ Modern chips have around 4-8 logical banks on each chip
▪ each logical bank physically implemented as many smaller arrays
13
DRAM Operation
▪ Three steps in read/write access to a given bank
▪ Row access (RAS)
– decode row address, enable addressed row (often multiple Kb in row)
– bitlines share charge with storage cell
– small change in voltage detected by sense amplifiers which latch whole row of bits
– sense amplifiers drive bitlines full rail to recharge storage cells
▪ Column access (CAS)
– decode column address to select small number of sense amplifier latches (4, 8, 16,
or 32 bits depending on DRAM package)
– on read, send latched bits out to chip pins
– on write, change sense amplifier latches which then charge storage cells to
required value
– can perform multiple column accesses on same row without another row access
(burst mode)
▪ Precharge
– charges bit lines to known value, required before next row access
▪ Each step has a latency of around 15-20ns in modern DRAMs
▪ Various DRAM standards (DDR, RDRAM) have different ways of
encoding the signals for transmission to the DRAM, but all share
same core architecture

14
Double-Data Rate (DDR2) DRAM
200MHz
Clock

Row Column Precharge Row’

Data

400Mb/s
[ Micron, 256Mb DDR2 SDRAM datasheet ] Data Rate
15
DRAM vs. SRAM: Cell level

DRAM cell SRAM cell

Word line

Word line
~Bit line Bit line
Bit line

16
DRAM vs. SRAM: Cell level

Word line

Bit line Word line


~Bit line Bit line

DRAM SRAM
▪ Optimized for density then speed • Optimized for speed then density

▪ 1 transistor cells • 4 to 6 transistors per cell


▪ Time multiplexed address pins • Separate address pins
▪ Destructive reads • Reads not destructive
▪ Must refresh every few ms • Static → No Refresh

17
Relative Memory Cell Sizes
On-Chip DRAM on
SRAM in memory chip
logic chip

[ Foss, “Implementing
Application-Specific
Memory”, ISSCC 1996 ]

18
Administrivia
▪ PS 2 out today, due 2/20
▪ Lab 2 out Thursday
▪ Nafea Bshara guest lecture 2/25
▪ Midterm 3/4

19
CPU-Memory Bottleneck

CPU Memory

Performance of high-speed computers is usually limited by


memory bandwidth & latency
▪ Latency (time for a single access)
– Memory access time >> Processor cycle time
▪ Bandwidth (number of accesses per unit time)
if fraction m of instructions access memory
⇒ 1+m memory references / instruction
⇒ CPI = 1 requires 1+m memory refs / cycle (assuming RISC-V ISA)
▪ Also, Occupancy (time a memory bank is busy with one
request)

20
Factors influencing modern memory
system design
▪ Latency
– As a function of capacity/size
▪ Memory improvement vs. CPU improvement
▪ Bandwidth
– Modern packaging
▪ Memory access pattern characteristics

21
Processor-DRAM Gap (latency)

µProc 60%/year
1000 CPU
Performance

Processor-Memory
100 Performance Gap:
(growing 50%/yr)

10 DRAM
7%/year
DRAM

1
1988
1986
1987

1989
1990
1991
1992
1993
1994
1995
1996
1980
1981

1983
1984
1985

1997
1998
1999
2000
1982

Time
Four-issue 3GHz superscalar accessing 100ns DRAM could execute 1,200
instructions during time for one memory access!

22
Physical Size Affects Latency

CPU

CPU

Small
Memory
Big Memory

▪ Signals have further to travel


▪ Fan out to more locations

23
DRAM Packaging
(Laptops/Desktops/Servers)

~7
Clock and control signals
DRAM
Address lines multiplexed
row/column address ~12 chip
Data bus
(4b,8b,16b,32b)

▪ DIMM (Dual Inline Memory Module) contains


multiple chips with clock/control/address signals
connected in parallel (sometimes need buffers to
drive signals to all chips)
▪ Data pins work together to return wide word (e.g.,
64-bit data bus using 16x4-bit parts)

24
DRAM Packaging, Apple M1

Two DRAM chips


on same package
as system SoC

•128b databus,
running at 4.2Gb/s
•68GB/s bandwidth

25
High-Bandwidth Memory in SX-Aurora

1.2TB/s HBM Bandwidth, 6 channels * 1024 bits/channel * 1.6 Gb/s/pin


26
Real Memory Reference Patterns
Memory Address (one dot per access)

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. Time
IBM Systems Journal 10(3): 168-192 (1971)
29
Typical Memory Reference Patterns

Address n loop iterations

Instruction
fetches

subroutine subroutine
call return
Stack
accesses
argument access

Data
accesses scalar accesses
Time

30
Two predictable properties of memory
references:
▪ Temporal Locality: If a location is
referenced it is likely to be referenced again
in the near future.

▪ Spatial Locality: If a location is referenced it


is likely that locations near it will be
referenced in the near future.

31
Memory Reference Patterns
Memory Address (one dot per access)

Temporal
Locality

Spatial
Locality
Donald J. Hatfield, Jeanette Gerald: Program Time
Restructuring for Virtual Memory. IBM Systems Journal
10(3): 168-192 (1971) 32
Factors influencing modern memory
system design
▪ Latency
– As a function of capacity/size
▪ Memory improvement vs. CPU improvement
▪ Bandwidth
– Modern packaging
▪ Memory access pattern characteristics

Punchline:
Memory hierarchy,
Arranged in small/fast → big/slow pyramid
Policies to exploit data locality
33
Memory Hierarchy

A B
Small,
Fast Memory Big, Slow Memory
CPU
(RF, SRAM) (DRAM)

holds frequently used data

• capacity: Register << SRAM << DRAM


• latency: Register << SRAM << DRAM
• bandwidth: on-chip >> off-chip
On a data access:
if data  fast memory  low latency access (SRAM)
if data  fast memory  high latency access (DRAM)

34
Management of Memory Hierarchy
▪ Small/fast storage, e.g., registers
– Address usually specified in instruction
– Generally implemented directly as a register file
• but hardware might do things behind software’s
back, e.g., stack management, register renaming

▪ Larger/slower storage, e.g., main memory


– Address usually computed from values in register
– Generally implemented as a hardware-managed cache
hierarchy (hardware decides what is kept in fast
memory)
• but software may provide “hints”, e.g., don’t cache
or prefetch
35
Caches exploit both types of
predictability:
▪ Exploit temporal locality by remembering the
contents of recently accessed locations.

▪ Exploit spatial locality by fetching blocks of data


around recently accessed locations.

36
Inside a Cache

Address Address

Processor Main
CACHE Memory
Data Data

copy of main copy of main


memory memory
location 100 location 101
Data Data
100 Byte Byte Line
Data
304 Byte

Address 6848
Tag 416

Data Block

37
Cache Algorithm (Read)
Look at Processor Address, search cache tags to
find match. Then either

Found in cache Not in cache


a.k.a. HIT a.k.a. MISS

Return copy Read block of data from


of data from Main Memory
cache
Wait …

Return data to processor


and update cache
Q: Which line do we replace? 38
Placement Policy
1111111111 2222222222 33
Block Number 0123456789 0123456789 0123456789 01

Memory

Set Number 0 1 2 3 01234567

Cache

Fully (2-way) Set Direct


Associative Associative Mapped
anywhere anywhere in only into
block 12
can be placed set 0 block 4
(12 mod 4) (12 mod 8)

39
Another view:
Cache = HW-Optimized Hash table

(Key1, Val1) (Key2, Val2)


Hash
function
Key

(Key3, Val3)

Bin Slots

▪ Hash function: take bits of the address (“Index bits”)


▪ Bin == Set, Slots == Ways
▪ 1 Bin → Fully associative; 1 Slot / Bin → direct mapped
▪ M Slots / N Bins where M, N > 1 → set associative
▪ Fixed # Slots per Bin; Slots read out in parallel
▪ Key/Value pairs in Bins stored separately (tag + data array)
40
Direct-Mapped Cache

Tag Index Block


Offset

t
k b
V Tag Data Block

2k
lines

t
=

HIT Data Word or Byte

41
Direct Map Address Selection
higher-order vs. lower-order address bits

Index Tag Block


Offset
k t
b
V Tag Data Block

2k
lines

t
=

HIT Data Word or Byte

42
2-Way Set-Associative Cache
Tag Index Block
Offset b
t
k
V Tag Data Block V Tag Data Block

= = Data
Word
or Byte

HIT

43
Fully Associative Cache
V Tag Data Block

t
=
Tag

t
=
HIT
Offset

Data
Block

= Word
b or Byte

44
Replacement Policy
In an associative cache, which block from a set should
be evicted when the set becomes full?
• Random
• Least-Recently Used (LRU)
• LRU cache state must be updated on every access
• true implementation only feasible for small sets (2-way)
• pseudo-LRU binary tree often used for 4-8 way
• First-In, First-Out (FIFO) a.k.a. Round-Robin
• used in highly associative caches
• Not-Most-Recently Used (NMRU)
• FIFO with exception for most-recently used block or blocks

This is a second-order effect. Why?

Replacement only happens on misses

45
Block Size and Spatial Locality
Block is unit of transfer between the cache and memory

Tag Word0 Word1 Word2 Word3 4 word block, b=2

Split CPU block address offsetb


address

32-b bits b bits


2b = block size a.k.a line size (in bytes)

Larger block size has distinct hardware advantages


• less tag overhead
• exploit fast burst transfers from DRAM
• exploit fast burst transfers over wide busses
What are the disadvantages of increasing block size?
Fewer blocks => more conflicts. Can waste bandwidth.

46
Acknowledgements
▪ This course is partly inspired by previous MIT 6.823 and
Berkeley CS252 computer architecture courses created by
my collaborators and colleagues:
– Arvind (MIT)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Krste Asanovic (UCB)
– Sophia Shao (UCB)

47

You might also like