Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
1
Topics for Memory Systems
• Memory Technology and Metrics
– SRAM, DRAM, Flash/SSD, 3-D Stack Memory, Phase-change
memory
– Latency and Bandwidth, Error CorrecNon
– Memory wall
• Cache
– Cache basics
– Cache performance and op@miza@on
– Advanced op@miza@on
– Mul@ple-level cache, shared and private cache, prefetching
• Virtual Memory
– Protec@on, Virtualiza@on, and Reloca@on
– Page/segment, protec@on
– Address Transla@on and TLB
2
Topics for Memory Systems
• Parallelism (to be discussed in TLP)
– Memory Consistency model
• Instruc@ons for fence, etc
– Cache coherence
– NUMA and first touch
– Transac@onal memory (Not covered)
3
Acknowledgement
• Based on slides prepared by: Professor David A. PaWerson
Computer Science 252, Fall 1996, and edited and presented
by Prof. Kurt Keutzer for 2000 from UCB
• Some slides are adapted from the textbook slides for
Computer Organiza@on and Design, FiTh Edi@on: The
Hardware/SoTware Interface
4
The Big Picture: Where are We Now?
Processor
Input
Control
Memory
Datapath
Output
• Memory system
– Supplying data on @me for computa@on
– Term memory include circuits for storing data
• Cache (SRAM)
• Scratchpad (SRAM)
• RAM (DRAM)
• etc
5
Technology Trends
Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
DRAM
Year Size Cycle Time
1980 1000:1! 64 Kb 2:1! 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
6
Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency) à Memory Wall
1000! µProc
CPU!
60%/yr.
Moore s Law
Performance
(2X/1.5yr)
100! Processor-Memory
Performance Gap:
(grows 50% / year)
10!
DRAM
DRAM
9%/yr.
1! (2X/10 yrs)
1980!
1981!
1982!
1983!
1984!
1985!
1986!
1987!
1988!
1989!
1990!
1991!
1992!
1993!
1994!
1995!
1996!
1997!
1998!
1999!
2000!
Time 7
The SituaNon: Microprocessor
• Rely on caches to bridge gap
• Microprocessor-DRAM performance gap
– @me of a full cache miss in instruc@ons executed
1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136
instruc@ons
2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320
instruc@ons
3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648
instruc@ons
– 1/2X latency x 3X clock rate x 3X Instr/clock ⇒ -5X
8
The Goal: illusion of large, fast, cheap memory
9
An Expanded View of the Memory System
Processor
Control
Memory
Memory
Memory
Memory
Memory
Datapath
10
Why hierarchy works
• The Principle of Locality:
– Program access a rela@vely small por@on of the address space
at any instant of @me.
Probability
of reference
11
Typical Memory Reference PaWerns
Address n loop iteraNons
InstrucNon
fetches
subrouNne subrouNne
call return
Stack
accesses
argument access
Data
accesses scalar accesses
Time
12
Locality
• Principle of Locality:
– Programs tend to reuse data and instruc@ons near those
they have used recently, or that were recently referenced
themselves
– Spa@al locality: Items with nearby addresses tend to be
referenced close together in @me
– Temporal locality: Recently referenced items are likely to be
referenced in the near future
15
Locality Example
• Ques@on: Can you permute the loops so that the func@on
scans the 3-d array a[] with a stride-1 reference paWern
(and thus has good spa@al locality)?
16
Memory Hierarchy of a Computer System
• By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the
cheapest technology.
– Provide access at the speed offered by the fastest technology.
Processor
Control Tertiary
Secondary Storage
Storage (Tape)
2nd/3rd Main (Disk)
On-Chip
Registers
Level Memory
Cache
18
Intel i7 (Nahelem)
§ Private L1 and L2
– L2 is 256KB each. 10 cycle latency
§ 8MB shared L3. ~40 cycles latency
19
Area
20
How is the hierarchy managed?
• Registers <-> Memory
– by compiler (programmer?)
• cache <-> memory
– by the hardware
• memory <-> disks
– by the hardware and opera@ng system (virtual memory)
– by the programmer (files)
• Virtual memory
– Virtual layer between applica@on address space to physical
memory
– Not part of the physical memory hierarchy
21
Technology vs Architectures
• Technology determines the raw speed
– Latency
– Bandwidth
– It is material science
22
Memory Technology
• StaNc RAM (SRAM)
– 0.5ns – 2.5ns, $2000 – $5000 per GB
• Dynamic RAM (DRAM)
– 50ns – 70ns, $20 – $75 per GB
23
Memory Technology
• Random Access:
– Random is good: access @me is the same for all loca@ons
– DRAM: Dynamic Random Access Memory
• High density, low power, cheap, slow
• Dynamic: need to be refreshed regularly
– SRAM: StaNc Random Access Memory
• Low density, high power, expensive, fast
• Sta@c: content will last forever (un@l lose power)
• Non-so-random Access Technology:
– Access @me varies from loca@on to loca@on and from @me to @me
– Examples: Disk, CDROM
• Sequen@al Access Technology: access @me linear in loca@on
(e.g.,Tape)
24
Main Memory Background
• Performance of Main Memory:
– Latency: Cache Miss Penalty
• Access Time: @me between request and word arrives
• Cycle Time: @me between requests
– Bandwidth: I/O & Large Block Miss Penalty (L2)
• Main Memory is DRAM : Dynamic Random Access Memory
– Needs to be refreshed periodically (8 ms)
– Addresses divided into 2 halves (Memory as a 2D matrix):
• RAS or Row Access Strobe
• CAS or Column Access Strobe
• Cache uses SRAM : StaNc Random Access Memory
– No refresh (6 transistors/bit vs. 1 transistor)
Size: DRAM/SRAM - 4-8
Cost/Cycle =me: SRAM/DRAM - 8-16
25
Random Access Memory (RAM) Technology
26
StaNc RAM Cell
6-Transistor SRAM Cell word
word
0 1 (row select)
0 1
bit bit
bit bit
• Write:
1. Drive bit lines (bit=1, bit=0)
2.. Select row replaced with pullup
• Read: to save area
1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
27
Problems with SRAM
• Six transistors use up a lot of area
• Consider a Zero is stored in the cell:
– Transistor N1 will try to pull bit to 0
– Transistor P2 will try to pull bit bar to 1
• But bit lines are precharged to high: Are P1 and P2
necessary? Select = 1
P1 P2
Off On
On
On
On Off
N1 N2
bit = 1 bit = 0 28
1-Transistor Memory Cell (DRAM)
• Write: row select
– 1. Drive bit line
– 2.. Select row
• Read:
– 1. Precharge bit line to Vdd
– 2.. Select row
– 3. Cell and bit line share charges bit
• Very small voltage changes on the bit line
– 4. Sense (fancy sense amp)
• Can detect changes of ~1 million electrons
– 5. Write: restore the value
• Refresh
– 1. Just do a dummy read to every cell.
29
Classical DRAM OrganizaNon (square)
bit (data) lines
r
o Each intersection represents
w a 1-T DRAM Cell
d RAM Cell
e Array
c
o
d word (row) select
e
r
Column Decoder
…
11 Sense Amps & I/O D
Square root of bits per RAS/CAS
31
Main Memory Performance
Cycle Time
D1 available
Start Access for D1 Start Access for D2
Memory
Access Pattern with 4-way Interleaving: Bank 0
Memory
Bank 1
CPU
Memory
Bank 2
Memory
Access Bank 0
Bank 3
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
33
Main Memory Performance
• Timing model
– 1 to send address,
– 4 for access @me, 10 cycle @me, 1 to send data
– Cache Block is 4 words
• Simple M.P. = 4 x (1+10+1) = 48
• Wide M.P. = 1 + 10 + 1 = 12
• Interleaved M.P. = 1+10+1 + 3 =15
address address address address
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
34
Independent Memory Banks
• How many banks?
number banks number clocks to access word in bank
– For sequen@al accesses, otherwise will return to original bank
before it has next word ready
• Increasing DRAM => fewer chips => harder to have banks
– Growth bits/chip DRAM : 50%-60%/yr
– Nathan Myrvold M/S: mature soTware growth
(33%/yr for NT) - growth MB/$ of DRAM (25%-30%/yr)
35
DRAM History
• DRAMs: capacity +60%/yr, cost –30%/yr
– 2.5X cells/area, 1.5X die size in -3 years
• 97 DRAM fab line costs $1B to $2B
– DRAM only: density, leakage v. speed
• Rely on increasing no. of computers & memory per computer
(60% market)
– SIMM or DIMM is replaceable unit
=> computers use any genera@on DRAM
• Commodity, second source industry
=> high volume, low profit, conserva@ve
– LiWle organiza@on innova@on in 20 years
page mode, EDO, Synch DRAM
• Order of importance: 1) Cost/bit 1a) Capacity
– RAMBUS: 10X BW, +30% cost => liWle impact
36
Advanced DRAM OrganizaNon
• Bits in a DRAM are organized as a rectangular array
– DRAM accesses an en@re row
– Burst mode: supply successive words from a row with reduced
latency
• Double data rate (DDR)
DRAM
– Transfer on rising and
falling clock edges
• Quad data rate (QDR)
DRAM
– Separate DDR inputs
and outputs
37
3-D Stack Memory
• High Bandwidth Memory
• Hybrid Memory Cube
38
Flash Storage
• Nonvolatile semiconductor storage
– 100× – 1000× faster than disk
– Smaller, lower power, more robust
– But more $/GB (between disk and DRAM)
39
Flash Types
• NOR flash: bit cell like a NOR gate
– Random read/write access
– Used for instruction memory in embedded systems
• NAND flash: bit cell like a NAND gate
– Denser (bits/area), but block-at-a-time access
– Cheaper per GB
– Used for USB keys, media storage, …
• Flash bits wears out after 1000 s of accesses
– Not suitable for direct RAM or disk replacement
– Wear leveling: remap data to less used blocks
40
Disk Storage
• Nonvolatile, rotating magnetic storage
41
Disk Sectors and Access
• Each sector records
– Sector ID
– Data (512 bytes, 4096 bytes proposed)
– Error correcting code (ECC)
• Used to hide defects and recording errors
– Synchronization fields and gaps
• Access to a sector involves
– Queuing delay if other accesses are pending
– Seek: move the heads
– Rotational latency
– Data transfer
– Controller overhead
42
Disk Access Example
• Given
– 512B sector, 15,000rpm, 4ms average seek time, 100MB/
s transfer rate, 0.2ms controller overhead, idle disk
• Average read time
– 4ms seek time
+ ½ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms
• If actual average seek time is 1ms
– Average read time = 3.2ms
43
Summary:
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon.
– Spa@al Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon.
• By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the
cheapest technology.
– Provide access at the speed offered by the fastest technology.
• DRAM is slow but cheap and dense:
– Good choice for presen@ng the user with a BIG memory system
• SRAM is fast but expensive and not very dense:
– Good choice for providing the user FAST access @me.
44