Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
3
Goals
• Cost
• Latency
• Bandwidth
• Parallelism
• Power
• Energy
• Reliability
• …
4
5
Cell Array
Row Decoder
enable
Inverter
bottom
6
Sense Amplifier – Two Stable States
VDD 0
1 1
0 VDD
0
1 VT > VB
V0B
8
DRAM Cell – Capacitor
9
Capacitor to Sense Amplifier
0 VDD
1 1
VDD 0
10
DRAM Cell Operation
½VVDD
DD+δ
1
0
0 DD
½V
11
DRAM Subarray – Building Block for
DRAM Chip
Cell Array
Row Decoder
Cell Array
12
DRAM Bank
Cell Array
Row Decoder
Array of Sense Amplifiers (8Kb)
Cell Array
Address
Cell Array
Row Decoder
Array of Sense Amplifiers
Cell Array
Row Decoder
Row Decoder
Row Decoder
Row Decoder
Row Decoder
Row Decoder
Row Decoder
Row Decoder
Row Decoder
Amplifiers Amplifiers
Row Decoder
Row Decoder
Row Decoder
Amplifiers Amplifiers Amplifiers Amplifiers
Array of Sense Array of Sense Array of Sense Array of Sense
Shared internal bus
1 ACTIVATE Row
Row Decoder
Row Address
2 READ/WRITE Column
Cell Array
Row Decoder
Cell Array
3 PRECHARGE
Bank I/O
Data
Column Address
15
Memory Latency:
Fundamental Tradeoffs
Review: Memory Latency Lags Behind
Capacity Bandwidth Latency 128x
DRAM Improvement (log)
100
20x
10
1.3x
1
1999 2003 2006 2008 2011 2013 2014 2015 2016 2017
20
Retrospective: Conventional Latency Tolerance Techniques
21
Two Major Sources of Latency Inefficiency
22
What Causes
the Long Memory Latency?
Why the Long Memory Latency?
24
Tiered Latency DRAM
25
What Causes the Long Latency?
DRAM Chip
subarray
Subarray
cell array
I/O
I/O
channel
Dominant 26
Why is the Subarray So Slow?
Subarray Cell
cell
wordline
row decoder
row decoder
sense amplifier
capacitor
access
transistor
bitline
sense amplifier large sense amplifier
• Long bitline
– Amortizes sense amplifier cost Small area
– Large bitline capacitance High latency & power
27
Trade-Off: Area (Die Size) vs. Latency
Long Bitline Short Bitline
Faster
Smaller
28
Trade-Off: Area (Die Size) vs. Latency
4
Normalized DRAM Area 32
3
Fancy DRAM
Commodity
Cheaper
64 Short Bitline
DRAM
2
Long Bitline
128
1
256 512 cells/bitline
0
0 10 20 30 40 50 60 70
Latency (ns)
Faster
29
Approximating the Best of Both Worlds
Long Bitline Our Proposal Short Bitline
Small Area Large Area
High Latency Low Latency
Small area
using long
bitline
Low Latency
31
Latency, Power, and Area Evaluation
• Commodity DRAM: 512 cells/bitline
• TL-DRAM: 512 cells/bitline
– Near segment: 32 cells
– Far segment: 480 cells
• Latency Evaluation
– SPICE simulation using circuit-level DRAM model
• Power and Area Evaluation
– DRAM area/power simulator from Rambus
– DDR3 energy calculator from Micron
32
Commodity DRAM vs. TL-DRAM [HPCA 2013]
• DRAM Latency (tRC) • DRAM Power
150% 150%
+49%
+23%
(52.5ns)
Latency
100% 100%
Power
–56% –51%
50% 50%
0% 0%
64
2
128
256 512 cells/bitline
1
Near Segment Far Segment
0
0 10 20 30 40 50 60 70
Latency (ns)
Faster
34
Leveraging Tiered-Latency DRAM
• TL-DRAM is a substrate that can be leveraged by
the hardware and/or software
• Many potential uses
1. Use near segment as hardware-managed inclusive
cache to far segment
2. Use near segment as hardware-managed exclusive
cache to far segment
3. Profile-based page mapping by operating system
4. Simply replace DRAM with TL-DRAM
35
Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.
Near Segment as Hardware-Managed Cache
TL-DRAM
subarray main
far segment memory
near segment cache
sense amplifier
I/O
channel
Isolation Transistor
Destination
Near Segment
Sense Amplifier
37
Inter-Segment Migration
• Our way:
– Source and destination cells share bitlines
– Transfer data from source to destination across
shared bitlines concurrently
Source
Far Segment
Isolation Transistor
Destination
Near Segment
Sense Amplifier
38
Inter-Segment Migration
• Our way:
– Source and destination cells share bitlines
– Transfer data from source to destination across
Step 1: Activate source row
shared bitlines concurrently
Migration is overlapped with source row access
Additional ~4ns over row access latency
Far Segment
Step 2: Activate destination
row to connect cell and bitline
Isolation Transistor
Near Segment
Sense Amplifier
39
Near Segment as Hardware-Managed Cache
TL-DRAM
subarray main
far segment memory
near segment cache
sense amplifier
I/O
channel
Normalized Power
100% 100%
80% 80%
60% 60%
40% 40%
20% 20%
0% 0%
1 (1-ch) 2 (2-ch) 4 (4-ch) 1 (1-ch) 2 (2-ch) 4 (4-ch)
Core-Count (Channel) Core-Count (Channel)
Using near segment as a cache improves
performance and reduces power consumption
41
Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.
Single-Core: Varying Near Segment Length
Maximum IPC
14%
Performance Improvement
Improvement
12%
10%
8%
6%
Larger cache capacity
4%
2%
Higher cache access latency
0%
1 2 4 8 16 32 64 128 256
Near Segment Length (cells)
By adjusting the near segment length, we can
trade off cache capacity for cache latency
42
More on TL-DRAM
Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya
Subramanian, and Onur Mutlu,
"Tiered-Latency DRAM: A Low Latency and Low Cost
DRAM Architecture"
Proceedings of the 19th International Symposium on High-
Performance Computer Architecture (HPCA), Shenzhen, China,
February 2013. Slides (pptx)
43
LISA: Low-Cost Inter-Linked Subarrays
[HPCA 2016]
44
Problem: Inefficient Bulk Data Movement
Bulk data movement is a key operation in many applications
– memmove & memcpy: 5% cycles in Google’s datacenter [Kanev+ ISCA’15]
Core
Core
Controller
src
Memory
LLC
Channel
dst
Core
Core
64 bits
CPU Memory
…
Bank Subarray N
DRAM Internal
Data Bus (64b)
Low
Goal:
connectivity
Provide ainnew
DRAM
substrate
is the fundamental
to enable
wide
bottleneck
connectivity
for bulk
between
data movement
subarrays
46
Key Idea and Applications
• Low-cost Inter-linked subarrays (LISA)
– Fast bulk data movement between subarrays
– Wide datapath via isolation transistors: 0.8% DRAM chip area
Subarray 1
…
Subarray 2
• LISA is a versatile substrate → new applications
Fast bulk data copy: Copy latency 1.363ms→0.148ms (9.2x)
→ 66% speedup, -55% DRAM energy
In-DRAM caching: Hot data access latency 48.7ns→21.5ns (2.2x)
→ 5% speedup
Fast precharge: Precharge latency 13.1ns→5.0ns (2.6x)
→ 8% speedup 47
New DRAM Command to Use LISA
Row Buffer Movement (RBM): Move a row of data in
an activated row buffer to a precharged one
Subarray 1 Vdd-Δ
Activated S S S S
P P P P on
RBM: SA1→SA2 Charge
Sharing
Subarray 2 Vdd/2+Δ
/2
Activated an
Precharged
RBM transfers P Pentire row b/w subarrays
S S S S Amplify the charge
P P
…
48
RBM Analysis
• The range of RBM depends on the DRAM design
– Multiple RBMs to move data across > 3 subarrays
Subarray 1
Subarray 2
Subarray 3
• Validated with SPICE using worst-case cells
– NCSU FreePDK 45nm library
• 4KB data in 8ns (w/ 60% guardband)
→ 500 GB/s, 26x bandwidth of a DDR4-2400 channel
• 0.8% DRAM chip area overhead [O+ ISCA’14]
49
1. Rapid Inter-Subarray Copying (RISC)
• Goal: Efficiently copy a row across subarrays
• Key idea: Use RBM to form a new command sequence
Subarray 1
1 Activate src row src row
S S S S
P P P P
50
2.Variable Latency DRAM (VILLA)
• Goal: Reduce DRAM latency with low area overhead
• Motivation: Trade-off between area and latency
Long Bitline Short Bitline
(DDRx) (RLDRAM)
52
3. Linked Precharge (LIP)
• Problem: The precharge time is limited by the strength
of one precharge unit
• Linked Precharge (LIP): LISA precharges a subarray
using multiple precharge units
S S S S S S S S
P P P P P P P P on
Activated row
Precharging Linked
Reduces
S S S S precharge latency
S S S by
S 2.6x
Precharging
on on
(43% guardband)
P P P P P P P P
54
What Causes
the Long DRAM Latency?
Why the Long Memory Latency?
56
Tackling the Fixed Latency Mindset
Reliable operation latency is actually very heterogeneous
Across temperatures, chips, parts of a chip, voltage levels, …
Idea: Dynamically find out and use the lowest latency one
can reliably access a memory location with
Adaptive-Latency DRAM [HPCA 2015]
Flexible-Latency DRAM [SIGMETRICS 2016]
Design-Induced Variation-Aware DRAM [SIGMETRICS 2017]
Voltron [SIGMETRICS 2017]
DRAM Latency PUF [HPCA 2018]
...
Slow cells
Low High
DRAM Latency
58
Why is Latency High?
• DRAM latency: Delay as specified in DRAM standards
– Doesn’t reflect true DRAM device latency
• Imperfect manufacturing process → latency variation
• High standard latency chosen to increase yield
DRAM A DRAM B DRAM C Standard
Latency
Manufacturing
Variation
Low High
DRAM Latency
59
What Causes the Long Memory Latency?
Conservative timing margins!
Worst-case temperatures
85 degrees vs. common-case
to enable a wide range of operating conditions
Worst-case devices
DRAM cell with smallest charge across any acceptable device
to tolerate process variation at acceptable yield
Three steps of
charge movement
1. Sensing
2. Restore
3. Precharge Sense-Amplifier
62
DRAM Charge over Time
Cell
Cell
Data 1
charge
Sense-Amplifier
Sense-Amplifier
Data 0
Timing Parameters Sensing Restore time
In theory
margin
In practice
`
2. Temperature Dependence
– DRAM leaks more charge at higher temperature
– Leads to extra timing margin when operating at
low temperature
64
DRAM Cells are Not Equal
Ideal Real Smallest Cell
Largest Cell
Same Size
Large variation inDifferent
cell sizeSize
Same Charge Different Charge
Large
Same variation inDifferent
Latency
chargeLatency
Large variation in access latency
65
Process Variation
DRAM Cell
❶ Cell Capacitance
Capacitor
❷ Contact Resistance
❸ Transistor Performance
`
2. Temperature Dependence
– DRAM leaks more charge at higher temperature
– Leads to extra timing margin for cells that
operate at low temperature
the high temperature
67
Charge Leakage Temperature
Room Temp. Hot Temp. (85°C)
Cells store
Smallsmall charge atLarge
Leakage high Leakage
temperature
and large charge at low temperature
Large variation in access latency 68
DRAM Timing Parameters
• DRAM timing parameters are dictated by
the worst-case
– The smallest cell with the smallest charge in
all DRAM products
– Operating at the highest temperature
69
Adaptive-Latency DRAM [HPCA 2015]
Idea: Optimize DRAM timing for the common case
Current temperature
Current DRAM module
A DRAM cell can store much more charge in the common case
(low temperature, strong cell) than in the worst case
Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 2015.
Extra Charge Reduced Latency
1. Sensing
Sense cells with extra charge faster
Lower sensing latency
2. Restore
No need to fully restore cells with extra charge
Lower restoration latency
3. Precharge
No need to fully precharge bitlines for cells with
extra charge
Lower precharge latency
71
DRAM Characterization Infrastructure
Temperature
Controller FPGAs Heater FPGAs
PC
Flexible
Easy to Use (C++ API)
Open-source
github.com/CMU-SAFARI/SoftMC
73
SoftMC: Open Source DRAM Infrastructure
https://github.com/CMU-SAFARI/SoftMC
74
Observation 1. Faster Sensing
Typical DIMM at 115 DIMM
Low Temperature Characterization
More Charge
Timing
(tRCD)
Strong Charge
Flow 17% ↓
Faster Sensing No Errors
• Key idea
– Optimize DRAM timing parameters online
• Two components
– DRAM manufacturer provides multiple sets of
reliable
reliable DRAM
DRAM timing
timing parameters
parameters at different
temperatures for each DIMM
– System monitors DRAM temperature & uses
appropriate DRAM timing parameters
Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 2015. 77
DRAM Temperature
• DRAM temperature measurement
• Server cluster: Operates at under 34°C
• Desktop: Operates at under 50°C
• DRAM standard optimized for 85°C
• DRAM
Previousoperates
works – DRAM temperature is low
at low temperatures
• El-Sayed+ SIGMETRICS 2012
in2007
• Liu+ ISCA the common-case
• Previous works – Maintain low DRAM temperature
• David+ ICAC 2011
• Liu+ ISCA 2007
• Zhu+ ITHERM 2008
78
Latency Reduction Summary of 115 DIMMs
• Latency reduction for read & write (55°C)
– Read Latency: 32.7%
– Write Latency: 55.1%
• Workload
– 35 applications from SPEC, STREAM, Parsec,
Memcached, Apache, GUPS
80
AL-DRAM: Single-Core Evaluation
Average
Performance Improvement
25%
Single Core Multi Core Improvement
20%
15%
10% 6.7% 5.0%
5% 1.4%
0%
all-35-workload
gems
soplex
libq
s.cluster
gups
intensive
mcf
lbm
copy
milc
all-workloads
non-intensive
AL-DRAM improves performance on a real system
81
AL-DRAM: Multi-Core Evaluation
Average
Performance Improvement
25%
Single Core Multi Core Improvement
20%
15% 14.0%
10.4%
10%
5% 2.9%
0%
all-35-workload
gems
soplex
libq
s.cluster
gups
intensive
mcf
lbm
copy
milc
all-workloads
non-intensive
AL-DRAM provides higher performance for
multi-programmed & multi-threaded workloads
82
Reducing Latency Also Reduces Energy
AL-DRAM reduces DRAM power consumption by 5.8%
83
AL-DRAM: Advantages & Disadvantages
Advantages
+ Simple mechanism to reduce latency
+ Significant system performance and energy benefits
+ Benefits higher at low temperature
+ Low cost, low complexity
Disadvantages
- Need to determine reliable operating latencies for different
temperatures and different DIMMs higher testing cost
(might not be that difficult for low temperatures)
84
More on AL-DRAM
Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan,
Vivek Seshadri, Kevin Chang, and Onur Mutlu,
"Adaptive-Latency DRAM: Optimizing DRAM Timing for
the Common-Case"
Proceedings of the 21st International Symposium on High-
Performance Computer Architecture (HPCA), Bay Area, CA,
February 2015.
[Slides (pptx) (pdf)] [Full data sets]
85
Different Types of Latency Variation
AL-DRAM exploits latency variation
Across time (different temperatures)
Across chips
86
Variation in Activation Errors
Results from 7500 rounds over 240 chips
No ACT Errors Max
Many errors
Rife w/ errors
Quartiles
Very few errors
Min
13.1ns
standard
Modern
Different DRAM chips
characteristics acrossexhibit
DIMMs
significant variation in activation latency
87
Spatial Locality of Activation Errors
One DIMM @ tRCD=7.5ns
• Key idea:
1) Divide memory into regions of different latencies
2) Memory controller: Use lower latency for regions without
slow cells; higher latency for other regions
80% tRCD
60% 99% 13ns
40% 93% 10ns
20% 7.5ns
0% 12%
Baseline D1 D2 D3 Upper
(DDR3) Profiles of 3 real DIMMs Bound
100%
Fraction of Cells
80% tRP
60% 13ns
99%
40% 74% 10ns
20% 7.5ns
13%
0%
Baseline D1 D2 D3 Upper
(DDR3)
Chang+, “Understanding Bound
Latency Variation in Modern DRAM Chips: Experimental
Characterization, Analysis, and Optimization",” SIGMETRICS 2016.
Results
1.25
19.7%
1.2 19.5%
Normalized Performance
17.6%
1.15 13.3%
Baseline (DDR3)
1.1 FLY-DRAM (D1)
1.05 FLY-DRAM (D2)
FLY-DRAM (D3)
1
Upper Bound
0.95
FLY-DRAM
0.9
improves performance
by exploiting spatial latency variation in DRAM
40 Workloads
Chang+, “Understanding Latency Variation in Modern DRAM Chips: Experimental
Characterization, Analysis, and Optimization",” SIGMETRICS 2016.
FLY-DRAM: Advantages & Disadvantages
Advantages
+ Reduces latency significantly
+ Exploits significant within-chip latency variation
Disadvantages
- Need to determine reliable operating latencies for different
parts of a chip higher testing cost
- Slightly more complicated controller
92
Analysis of Latency Variation in DRAM Chips
Kevin Chang, Abhijith Kashyap, Hasan Hassan, Samira Khan, Kevin Hsieh,
Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Tianshi Li, and
Onur Mutlu,
"Understanding Latency Variation in Modern DRAM Chips:
Experimental Characterization, Analysis, and Optimization"
Proceedings of the ACM International Conference on Measurement and
Modeling of Computer Systems (SIGMETRICS), Antibes Juan-Les-Pins,
France, June 2016.
[Slides (pptx) (pdf)]
[Source Code]
93
Computer Architecture
Lecture 10b: Memory Latency
Subarray border
DRAM rows (1 cell)
Remapped row
0 0
512
512
DRAM columns (1 cell) 1024
1024
Fprob at time
Fail probability t2 (%)2 (%)
at time
t2 (%)2 (%)
at time
Fprob at time
ThisWeshows that on
can rely weacan relyprofile
static on a static profile
of weak of weak
bitlines
Fail probability
bitlines to determine
to determine whether
whether an access
an access will will
causecause failures
failures
…
Row Decoder
Cache line ✔ ✔
…
…
…
…
…
Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)
100
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step
Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)
101
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step
Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)
102
Solar-DRAM: VLC (I)
Weak bitline Strong bitline
…
Row Decoder
Cache line
…
…
…
Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)
104
Solar-DRAM: RSC (II)
Cache line 0 Cache line 1
…
Row Decoder
Cache line
…
…
…
Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)
106
Solar-DRAM: RLW (III)
All bitlines are strong when issuing writes
…
Row Decoder
Cache line
…
…
…
108
Why Is There
Spatial Latency Variation
Within a Chip?
109
What Is Design-Induced Variation?
fast slow
across column inherently slow
distance from
slow
wordline driver wordline drivers
across row
distance from
sense amplifier
fast
Inherently fast
sense amplifiers
Systematic variation in cell access times
caused by the physical organization of DRAM
110
DIVA Online Profiling
Design-Induced-Variation-Aware
inherently slow
wordline driver
sense amplifier
Profile only slow regions to determine min. latency
Dynamic & low cost latency optimization
111
DIVA Online Profiling
Design-Induced-Variation-Aware
slow cells inherently slow
wordline driver
process design-induced
variation variation
random error localized error
20% 20%
10% 10%
0% 0%
55°C 85°C 55°C 85°C 55°C 85°C 55°C 85°C 55°C 85°C 55°C 85°C
AL-DRAM DIVA
AVA Profiling DIVA
AVA Profiling AL-DRAM DIVA
AVA Profiling DIVA
AVA Profiling
+ Shuffling + Shuffling
Advantages
++ Automatically finds the lowest reliable operating latency
at system runtime (lower production-time testing cost)
+ Reduces latency more than prior methods (w/ ECC)
+ Reduces latency at high temperatures as well
Disadvantages
- Requires knowledge of inherently-slow regions
- Requires ECC (Error Correcting Codes)
- Imposes overhead during runtime profiling
114
Design-Induced Latency Variation in DRAM
Donghyuk Lee, Samira Khan, Lavanya Subramanian, Saugata Ghose,
Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and
Onur Mutlu,
"Design-Induced Latency Variation in Modern DRAM Chips:
Characterization, Analysis, and Latency Reduction Mechanisms"
Proceedings of the ACM International Conference on Measurement and
Modeling of Computer Systems (SIGMETRICS), Urbana-Champaign, IL,
USA, June 2017.
115