0% found this document useful (0 votes)

33 views115 pages

Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture

Uploaded by

Murali Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views115 pages

Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture

Uploaded by

Murali Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 115

Computer Architecture

Lecture 10b: Memory Latency

Prof. Onur Mutlu

ETH Zürich
Fall 2018
18 October 2018
DRAM Memory:
A Low-Level Perspective
DRAM Module and Chip

3
Goals
• Cost
• Latency
• Bandwidth
• Parallelism
• Power
• Energy
• Reliability
• …

4
5
Cell Array
Row Decoder

Array of Sense Amplifiers

Cell Array
Cell Array
Row Decoder

Array of Sense Amplifiers

Cell Array
Bank I/O
DRAM Chip
Sense Amplifier
top

enable

Inverter

bottom
6
Sense Amplifier – Two Stable States

VDD 0

1 1

0 VDD

Logical “1” Logical “0”

7
Sense Amplifier Operation
VTDD

0
1 VT > VB

V0B
8
DRAM Cell – Capacitor

Empty State Fully Charged State

Logical “0” Logical “1”

1 Small – Cannot drive circuits

2 Reading destroys the state

9
Capacitor to Sense Amplifier

0 VDD

1 1

VDD 0
10
DRAM Cell Operation

½VVDD
DD+δ

1
0

0 DD
½V

11
DRAM Subarray – Building Block for
DRAM Chip

Cell Array
Row Decoder

Array of Sense Amplifiers (Row Buffer) 8Kb

Cell Array

12
DRAM Bank

Cell Array

Row Decoder
Array of Sense Amplifiers (8Kb)

Cell Array
Address

Cell Array
Row Decoder
Array of Sense Amplifiers

Cell Array

Bank I/O (64b)

Data
Address
13
14
Cell Array Cell Array Cell Array Cell Array
Row Decoder

Row Decoder
Row Decoder

Row Decoder

Array of Sense Array of Sense Array of Sense Array of Sense

Amplifiers Amplifiers Amplifiers Amplifiers
Cell Array Cell Array Cell Array Cell Array
Cell Array Cell Array Cell Array Cell Array
Row Decoder

Row Decoder
Row Decoder

Row Decoder

Array of Sense Array of Sense Array of Sense Array of Sense

Amplifiers Amplifiers Amplifiers Amplifiers
Cell Array Cell Array Cell Array Cell Array
Bank I/O Bank I/O Bank I/O Bank I/O
Bank I/O Bank I/O Bank I/O Bank I/O
Cell Array Cell Array Cell Array Cell Array
Row Decoder

Row Decoder
Row Decoder

Row Decoder
Amplifiers Amplifiers

Memory channel - 8bits

Amplifiers Amplifiers
Array of Sense Array of Sense Array of Sense Array of Sense
Cell Array Cell Array Cell Array Cell Array
Cell Array Cell Array Cell Array Cell Array
Row Decoder

Row Decoder
Row Decoder

Row Decoder
Amplifiers Amplifiers Amplifiers Amplifiers
Array of Sense Array of Sense Array of Sense Array of Sense
Shared internal bus

Cell Array Cell Array Cell Array Cell Array

DRAM Chip
DRAM Operation

1 ACTIVATE Row
Row Decoder
Row Address

2 READ/WRITE Column
Cell Array
Row Decoder

Array of Sense Amplifiers

Cell Array
3 PRECHARGE
Bank I/O
Data
Column Address
15
Memory Latency:
Fundamental Tradeoffs
Review: Memory Latency Lags Behind
Capacity Bandwidth Latency 128x
DRAM Improvement (log)

100

20x
10

1.3x
1
1999 2003 2006 2008 2011 2013 2014 2015 2016 2017

Memory latency remains almost constant

DRAM Latency Is Critical for Performance

In-memory Databases Graph/Tree Processing

[Mao+, EuroSys’12; [Xu+, IISWC’12; Umuroglu+, FPL’15]
Clapp+ (Intel), IISWC’15]

In-Memory Data Analytics Datacenter Workloads

[Clapp+ (Intel), IISWC’15; [Kanev+ (Google), ISCA’15]
Awan+, BDCloud’15]
DRAM Latency Is Critical for Performance

In-memory Databases Graph/Tree Processing

[Mao+, EuroSys’12; [Xu+, IISWC’12; Umuroglu+, FPL’15]
Clapp+ (Intel), IISWC’15]
Long memory latency → performance bottleneck

In-Memory Data Analytics Datacenter Workloads

[Clapp+ (Intel), IISWC’15; [Kanev+ (Google), ISCA’15]
Awan+, BDCloud’15]
The Memory Latency Problem
 High memory latency is a significant limiter of system
performance and energy-efficiency

 It is becoming increasingly so with higher memory

contention in multi-core and heterogeneous architectures
 Exacerbating the bandwidth need
 Exacerbating the QoS problem

 It increases processor design complexity due to the

mechanisms incorporated to tolerate memory latency

20
Retrospective: Conventional Latency Tolerance Techniques

 Caching [initially by Wilkes, 1965]

 Widely used, simple, effective, but inefficient, passive
 Not all applications/phases exhibit temporal or spatial locality

 Prefetching [initially in IBM 360/91, 1967]

 Works well for regular memory access patterns
 Prefetching irregular access patterns is difficult, inaccurate, and hardware-
intensive

 Multithreading [initially in CDC 6600, 1964]

 Works well if there are multiple threads

 Improving single thread performance using multithreading hardware is an

ongoing research effort

 Out-of-order execution [initially by Tomasulo, 1967]

 Tolerates cache misses that cannot be prefetched
 Requires extensive hardware resources for tolerating long latencies

21
Two Major Sources of Latency Inefficiency

 Modern DRAM is not designed for low latency

 Main focus is cost-per-bit (capacity)

 Modern DRAM latency is determined by worst case

conditions and worst case devices
 Much of memory latency is unnecessary

22
What Causes
the Long Memory Latency?
Why the Long Memory Latency?

 Reason 1: Design of DRAM Micro-architecture

 Goal: Maximize capacity/area, not minimize latency

 Reason 2: “One size fits all” approach to latency specification

 Same latency parameters for all temperatures
 Same latency parameters for all DRAM chips
 Same latency parameters for all parts of a DRAM chip
 Same latency parameters for all supply voltage levels
 Same latency parameters for all application data
 …

24
Tiered Latency DRAM

25
What Causes the Long Latency?
DRAM Chip
subarray

Subarray
cell array

I/O

I/O
channel

DRAM Latency = Subarray

Subarray Latency
Latency ++ I/O
I/O Latency
Latency

Dominant 26
Why is the Subarray So Slow?
Subarray Cell
cell
wordline

bitline: 512 cells

row decoder
row decoder

sense amplifier
capacitor
access
transistor

bitline
sense amplifier large sense amplifier
• Long bitline
– Amortizes sense amplifier cost  Small area
– Large bitline capacitance  High latency & power
27
Trade-Off: Area (Die Size) vs. Latency
Long Bitline Short Bitline

Faster

Smaller

Trade-Off: Area vs. Latency

28
Trade-Off: Area (Die Size) vs. Latency
4
Normalized DRAM Area 32
3
Fancy DRAM
Commodity
Cheaper

64 Short Bitline
DRAM
2
Long Bitline
128
1
256 512 cells/bitline
0
0 10 20 30 40 50 60 70
Latency (ns)
Faster
29
Approximating the Best of Both Worlds
Long Bitline Our Proposal Short Bitline
Small Area Large Area
High Latency Low Latency

Need Add Isolation

Isolation Transistors
Short Bitline  Fast
30
Approximating the Best of Both Worlds
Long BitlineTiered-Latency
Long Bitline DRAMShort
Our Proposal Short Bitline
Bitline
Small Area Small Area Large Area
High Latency Low Latency Low Latency

Small area
using long
bitline
Low Latency

31
Latency, Power, and Area Evaluation
• Commodity DRAM: 512 cells/bitline
• TL-DRAM: 512 cells/bitline
– Near segment: 32 cells
– Far segment: 480 cells
• Latency Evaluation
– SPICE simulation using circuit-level DRAM model
• Power and Area Evaluation
– DRAM area/power simulator from Rambus
– DDR3 energy calculator from Micron

32
Commodity DRAM vs. TL-DRAM [HPCA 2013]
• DRAM Latency (tRC) • DRAM Power
150% 150%
+49%
+23%
(52.5ns)
Latency

100% 100%

Power
–56% –51%
50% 50%

0% 0%

Commodity Near Far Commodity Near Far

DRAM TL-DRAM DRAM TL-DRAM

• DRAM Area Overhead

~3%: mainly due to the isolation transistors
33
Trade-Off: Area (Die-Area) vs. Latency
4
Normalized DRAM Area 32
3
Cheaper

64
2
128
256 512 cells/bitline
1
Near Segment Far Segment
0
0 10 20 30 40 50 60 70
Latency (ns)
Faster
34
Leveraging Tiered-Latency DRAM
• TL-DRAM is a substrate that can be leveraged by
the hardware and/or software
• Many potential uses
1. Use near segment as hardware-managed inclusive
cache to far segment
2. Use near segment as hardware-managed exclusive
cache to far segment
3. Profile-based page mapping by operating system
4. Simply replace DRAM with TL-DRAM

35
Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.
Near Segment as Hardware-Managed Cache
TL-DRAM
subarray main
far segment memory
near segment cache
sense amplifier

I/O
channel

• Challenge 1: How to efficiently migrate a row between

segments?
• Challenge 2: How to efficiently manage the cache?
36
Inter-Segment Migration
• Goal: Migrate source row into destination row
• Naïve way: Memory controller reads the source row
byte by byte and writes to destination row byte by byte
→ High latency
Source
Far Segment

Isolation Transistor
Destination
Near Segment
Sense Amplifier
37
Inter-Segment Migration
• Our way:
– Source and destination cells share bitlines
– Transfer data from source to destination across
shared bitlines concurrently
Source
Far Segment

Isolation Transistor
Destination
Near Segment
Sense Amplifier
38
Inter-Segment Migration
• Our way:
– Source and destination cells share bitlines
– Transfer data from source to destination across
Step 1: Activate source row
shared bitlines concurrently
Migration is overlapped with source row access
Additional ~4ns over row access latency
Far Segment
Step 2: Activate destination
row to connect cell and bitline
Isolation Transistor
Near Segment
Sense Amplifier
39
Near Segment as Hardware-Managed Cache
TL-DRAM
subarray main
far segment memory
near segment cache
sense amplifier

I/O
channel

• Challenge 1: How to efficiently migrate a row between

segments?
• Challenge 2: How to efficiently manage the cache?
40
Performance & Power Consumption
120% 120%
12.4% 11.5% 10.7%
Normalized Performance

–23% –24% –26%

Normalized Power
100% 100%

80% 80%

60% 60%

40% 40%

20% 20%

0% 0%
1 (1-ch) 2 (2-ch) 4 (4-ch) 1 (1-ch) 2 (2-ch) 4 (4-ch)
Core-Count (Channel) Core-Count (Channel)
Using near segment as a cache improves
performance and reduces power consumption
41
Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.
Single-Core: Varying Near Segment Length
Maximum IPC
14%
Performance Improvement

Improvement
12%
10%
8%
6%
Larger cache capacity
4%
2%
Higher cache access latency
0%
1 2 4 8 16 32 64 128 256
Near Segment Length (cells)
By adjusting the near segment length, we can
trade off cache capacity for cache latency
42
More on TL-DRAM
 Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya
Subramanian, and Onur Mutlu,
"Tiered-Latency DRAM: A Low Latency and Low Cost
DRAM Architecture"
Proceedings of the 19th International Symposium on High-
Performance Computer Architecture (HPCA), Shenzhen, China,
February 2013. Slides (pptx)

43
LISA: Low-Cost Inter-Linked Subarrays
[HPCA 2016]

44
Problem: Inefficient Bulk Data Movement
Bulk data movement is a key operation in many applications
– memmove & memcpy: 5% cycles in Google’s datacenter [Kanev+ ISCA’15]
Core
Core

Controller
src
Memory
LLC

Channel
dst
Core
Core

64 bits

CPU Memory

Long latency and high energy

45
Moving Data Inside DRAM?
8Kb
Bank 512
rows
Subarray 1
Bank DRAM Subarray 2
cell Subarray 3
Bank

…
Bank Subarray N
DRAM Internal
Data Bus (64b)

Low
Goal:
connectivity
Provide ainnew
DRAM
substrate
is the fundamental
to enable
wide
bottleneck
connectivity
for bulk
between
data movement
subarrays
46
Key Idea and Applications
• Low-cost Inter-linked subarrays (LISA)
– Fast bulk data movement between subarrays
– Wide datapath via isolation transistors: 0.8% DRAM chip area
Subarray 1
…
Subarray 2
• LISA is a versatile substrate → new applications
Fast bulk data copy: Copy latency 1.363ms→0.148ms (9.2x)
→ 66% speedup, -55% DRAM energy
In-DRAM caching: Hot data access latency 48.7ns→21.5ns (2.2x)
→ 5% speedup
Fast precharge: Precharge latency 13.1ns→5.0ns (2.6x)
→ 8% speedup 47
New DRAM Command to Use LISA
Row Buffer Movement (RBM): Move a row of data in
an activated row buffer to a precharged one

Subarray 1 Vdd-Δ

Activated S S S S
P P P P on
RBM: SA1→SA2 Charge
Sharing
Subarray 2 Vdd/2+Δ
/2

Activated an
Precharged
RBM transfers P Pentire row b/w subarrays
S S S S Amplify the charge
P P
…

48
RBM Analysis
• The range of RBM depends on the DRAM design
– Multiple RBMs to move data across > 3 subarrays
Subarray 1

Subarray 2

Subarray 3
• Validated with SPICE using worst-case cells
– NCSU FreePDK 45nm library
• 4KB data in 8ns (w/ 60% guardband)
→ 500 GB/s, 26x bandwidth of a DDR4-2400 channel
• 0.8% DRAM chip area overhead [O+ ISCA’14]
49
1. Rapid Inter-Subarray Copying (RISC)
• Goal: Efficiently copy a row across subarrays
• Key idea: Use RBM to form a new command sequence
Subarray 1
1 Activate src row src row

S S S S
P P P P

2 RBM SA1→SA2 Subarray 2

Reduces row-copy latency bydst9.2x,
row

3 (write row buffer

Activate dst row
DRAM energy
into dst row) SbyS 48.1x
S S
P P P P

50
2.Variable Latency DRAM (VILLA)
• Goal: Reduce DRAM latency with low area overhead
• Motivation: Trade-off between area and latency
Long Bitline Short Bitline
(DDRx) (RLDRAM)

Shorter bitlines → faster

activate and precharge time
High area overhead: >40%
51
2.Variable Latency DRAM (VILLA)
• Key idea: Reduce access latency of hot data via a
heterogeneous DRAM design [Lee+ HPCA’13, Son+ ISCA’13]
• VILLA: Add fast subarrays as a cache in each bank

Slow Subarray 512

Challenge: VILLA cache requires
rows movement of data rows
frequent
Fast Subarray32
LISA: Cache rows rapidly from slow
rowsaccess latency by 2.2x
Reduces hot data
to fast subarrays
at only 1.6% area overhead
Slow Subarray

52
3. Linked Precharge (LIP)
• Problem: The precharge time is limited by the strength
of one precharge unit
• Linked Precharge (LIP): LISA precharges a subarray
using multiple precharge units

S S S S S S S S
P P P P P P P P on
Activated row
Precharging Linked
Reduces
S S S S precharge latency
S S S by
S 2.6x
Precharging
on on
(43% guardband)
P P P P P P P P

Conventional DRAM LISA DRAM

53
More on LISA
 Kevin K. Chang, Prashant J. Nair, Saugata Ghose, Donghyuk Lee,
Moinuddin K. Qureshi, and Onur Mutlu,
"Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast
Inter-Subarray Data Movement in DRAM"
Proceedings of the 22nd International Symposium on High-
Performance Computer Architecture (HPCA), Barcelona, Spain,
March 2016.
[Slides (pptx) (pdf)]
[Source Code]

54
What Causes
the Long DRAM Latency?
Why the Long Memory Latency?

 Reason 1: Design of DRAM Micro-architecture

 Goal: Maximize capacity/area, not minimize latency

 Reason 2: “One size fits all” approach to latency specification

56
Tackling the Fixed Latency Mindset
 Reliable operation latency is actually very heterogeneous
 Across temperatures, chips, parts of a chip, voltage levels, …

 Idea: Dynamically find out and use the lowest latency one
can reliably access a memory location with
 Adaptive-Latency DRAM [HPCA 2015]
 Flexible-Latency DRAM [SIGMETRICS 2016]
 Design-Induced Variation-Aware DRAM [SIGMETRICS 2017]
 Voltron [SIGMETRICS 2017]
 DRAM Latency PUF [HPCA 2018]
 ...

 We would like to find sources of latency heterogeneity and

exploit them to minimize latency
57
Latency Variation in Memory Chips
Heterogeneous manufacturing & operating conditions →
latency variation in timing parameters

DRAM A DRAM B DRAM C

Slow cells

Low High
DRAM Latency

58
Why is Latency High?
• DRAM latency: Delay as specified in DRAM standards
– Doesn’t reflect true DRAM device latency
• Imperfect manufacturing process → latency variation
• High standard latency chosen to increase yield
DRAM A DRAM B DRAM C Standard
Latency

Manufacturing
Variation
Low High
DRAM Latency
59
What Causes the Long Memory Latency?
 Conservative timing margins!

 DRAM timing parameters are set to cover the worst case

 Worst-case temperatures
 85 degrees vs. common-case
 to enable a wide range of operating conditions
 Worst-case devices
 DRAM cell with smallest charge across any acceptable device
 to tolerate process variation at acceptable yield

 This leads to large timing margins for the common case

60
Understanding and Exploiting
Variation in DRAM Latency
DRAM Stores Data as Charge
DRAM Cell

Three steps of
charge movement
1. Sensing
2. Restore
3. Precharge Sense-Amplifier

62
DRAM Charge over Time
Cell
Cell
Data 1
charge
Sense-Amplifier
Sense-Amplifier
Data 0
Timing Parameters Sensing Restore time
In theory
margin
In practice

Why does DRAM need the extra timing margin?

63
Two Reasons for Timing Margin
Variation
1. Process Variation
– DRAM cells are not equal
– Leads to extra timing margin for cell
a cellthat
thatcan
can
store asmall
largeamount
amountofofcharge
charge

`
2. Temperature Dependence
– DRAM leaks more charge at higher temperature
– Leads to extra timing margin when operating at
low temperature

64
DRAM Cells are Not Equal
Ideal Real Smallest Cell

Largest Cell

Same Size 
Large variation inDifferent 
cell sizeSize
Same Charge  Different Charge 
Large
Same variation inDifferent
Latency 
chargeLatency
Large variation in access latency
65
Process Variation
DRAM Cell
❶ Cell Capacitance
Capacitor
❷ Contact Resistance
❸ Transistor Performance

Bitline Small cell can store small

Contact charge
• Small cell capacitance
Access Transistor
• High contact resistance
ACCESS • Slow access transistor
 High access latency 66
Two Reasons for Timing Margin
1. Process Variation
– DRAM cells are not equal
– Leads to extra timing margin for a cell that can
store a large amount of charge

`
2. Temperature Dependence
– DRAM leaks more charge at higher temperature
– Leads to extra timing margin for cells that
operate at low temperature
the high temperature

67
Charge Leakage Temperature
Room Temp. Hot Temp. (85°C)

Cells store
Smallsmall charge atLarge
Leakage high Leakage
temperature
and large charge at low temperature
 Large variation in access latency 68
DRAM Timing Parameters
• DRAM timing parameters are dictated by
the worst-case
– The smallest cell with the smallest charge in
all DRAM products
– Operating at the highest temperature

• Large timing margin for the common-case

69
Adaptive-Latency DRAM [HPCA 2015]
 Idea: Optimize DRAM timing for the common case
 Current temperature
 Current DRAM module

 Why would this reduce latency?

 A DRAM cell can store much more charge in the common case
(low temperature, strong cell) than in the worst case

 More charge in a DRAM cell

 Faster sensing, charge restoration, precharging
 Faster access (read, write, refresh, …)

Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 2015.
Extra Charge  Reduced Latency
1. Sensing
Sense cells with extra charge faster
 Lower sensing latency
2. Restore
No need to fully restore cells with extra charge
 Lower restoration latency
3. Precharge
No need to fully precharge bitlines for cells with
extra charge
 Lower precharge latency
71
DRAM Characterization Infrastructure

Temperature
Controller FPGAs Heater FPGAs

Kim+, “Flipping Bits in Memory Without Accessing Them: An 72

Experimental Study of DRAM Disturbance Errors,” ISCA 2014.
DRAM Characterization Infrastructure

 Hasan Hassan et al., SoftMC: A

Flexible and Practical Open-
Source Infrastructure for
Enabling Experimental DRAM
Studies, HPCA 2017.

 Flexible
 Easy to Use (C++ API)
 Open-source
github.com/CMU-SAFARI/SoftMC

73
SoftMC: Open Source DRAM Infrastructure

 https://github.com/CMU-SAFARI/SoftMC

74
Observation 1. Faster Sensing
Typical DIMM at 115 DIMM
Low Temperature Characterization

More Charge
Timing
(tRCD)
Strong Charge
Flow 17% ↓
Faster Sensing No Errors

Typical DIMM at Low Temperature

 More charge  Faster sensing
75
Observation 2. Reducing Restore Time
Typical DIMM at 115 DIMM
Low Temperature Characterization
Less Leakage 
Extra Charge Read (tRAS)
37% ↓
No Need to Fully Write (tWR)
Restore Charge
54% ↓
No Errors
Typical DIMM at lower temperature
 More charge  Restore time reduction
76
AL-DRAM

• Key idea
– Optimize DRAM timing parameters online

• Two components
– DRAM manufacturer provides multiple sets of
reliable
reliable DRAM
DRAM timing
timing parameters
parameters at different
temperatures for each DIMM
– System monitors DRAM temperature & uses
appropriate DRAM timing parameters

Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 2015. 77
DRAM Temperature
• DRAM temperature measurement
• Server cluster: Operates at under 34°C
• Desktop: Operates at under 50°C
• DRAM standard optimized for 85°C
• DRAM
Previousoperates
works – DRAM temperature is low
at low temperatures
• El-Sayed+ SIGMETRICS 2012
in2007
• Liu+ ISCA the common-case
• Previous works – Maintain low DRAM temperature
• David+ ICAC 2011
• Liu+ ISCA 2007
• Zhu+ ITHERM 2008

78
Latency Reduction Summary of 115 DIMMs
• Latency reduction for read & write (55°C)
– Read Latency: 32.7%
– Write Latency: 55.1%

• Latency reduction for each timing

parameter (55°C)
– Sensing: 17.3%
– Restore: 37.3% (read), 54.8% (write)
– Precharge: 35.2%
Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 79
2015.
AL-DRAM: Real System Evaluation
• System
– CPU: AMD 4386 ( 8 Cores, 3.1GHz, 8MB LLC)
– DRAM: 4GByte DDR3-1600 (800Mhz Clock)
– OS: Linux
– Storage: 128GByte SSD

• Workload
– 35 applications from SPEC, STREAM, Parsec,
Memcached, Apache, GUPS

80
AL-DRAM: Single-Core Evaluation
Average
Performance Improvement

25%
Single Core Multi Core Improvement
20%
15%
10% 6.7% 5.0%
5% 1.4%
0%

all-35-workload
gems
soplex

libq

s.cluster
gups

intensive
mcf

lbm

copy
milc

all-workloads
non-intensive
AL-DRAM improves performance on a real system
81
AL-DRAM: Multi-Core Evaluation
Average
Performance Improvement

25%
Single Core Multi Core Improvement
20%
15% 14.0%
10.4%
10%
5% 2.9%
0%

all-35-workload
gems
soplex

libq

s.cluster
gups

intensive
mcf

lbm

copy
milc

all-workloads
non-intensive
AL-DRAM provides higher performance for
multi-programmed & multi-threaded workloads
82
Reducing Latency Also Reduces Energy
 AL-DRAM reduces DRAM power consumption by 5.8%

 Major reason: reduction in row activation time

83
AL-DRAM: Advantages & Disadvantages

 Advantages
+ Simple mechanism to reduce latency
+ Significant system performance and energy benefits
+ Benefits higher at low temperature
+ Low cost, low complexity

 Disadvantages
- Need to determine reliable operating latencies for different
temperatures and different DIMMs  higher testing cost
(might not be that difficult for low temperatures)

84
More on AL-DRAM
 Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan,
Vivek Seshadri, Kevin Chang, and Onur Mutlu,
"Adaptive-Latency DRAM: Optimizing DRAM Timing for
the Common-Case"
Proceedings of the 21st International Symposium on High-
Performance Computer Architecture (HPCA), Bay Area, CA,
February 2015.
[Slides (pptx) (pdf)] [Full data sets]

85
Different Types of Latency Variation
 AL-DRAM exploits latency variation
 Across time (different temperatures)
 Across chips

 Is there also latency variation within a chip?

 Across different parts of a chip

86
Variation in Activation Errors
Results from 7500 rounds over 240 chips
No ACT Errors Max
Many errors

Rife w/ errors
Quartiles
Very few errors
Min
13.1ns
standard

Modern
Different DRAM chips
characteristics acrossexhibit
DIMMs
significant variation in activation latency
87
Spatial Locality of Activation Errors
One DIMM @ tRCD=7.5ns

Activation errors are concentrated

at certain columns of cells
88
Mechanism to Reduce DRAM Latency
• Observation: DRAM timing errors (slow DRAM
cells) are concentrated on certain regions

• Flexible-LatencY (FLY) DRAM

– A software-transparent design that reduces latency

• Key idea:
1) Divide memory into regions of different latencies
2) Memory controller: Use lower latency for regions without
slow cells; higher latency for other regions

Chang+, “Understanding Latency Variation in Modern DRAM Chips: Experimental

Characterization, Analysis, and Optimization",” SIGMETRICS 2016.
FLY-DRAM Configurations
100%
Fraction of Cells

80% tRCD
60% 99% 13ns
40% 93% 10ns
20% 7.5ns
0% 12%
Baseline D1 D2 D3 Upper
(DDR3) Profiles of 3 real DIMMs Bound
100%
Fraction of Cells

80% tRP
60% 13ns
99%
40% 74% 10ns
20% 7.5ns
13%
0%
Baseline D1 D2 D3 Upper
(DDR3)
Chang+, “Understanding Bound
Latency Variation in Modern DRAM Chips: Experimental
Characterization, Analysis, and Optimization",” SIGMETRICS 2016.
Results
1.25
19.7%
1.2 19.5%
Normalized Performance

17.6%
1.15 13.3%
Baseline (DDR3)
1.1 FLY-DRAM (D1)
1.05 FLY-DRAM (D2)
FLY-DRAM (D3)
1
Upper Bound
0.95
FLY-DRAM
0.9
improves performance
by exploiting spatial latency variation in DRAM
40 Workloads
Chang+, “Understanding Latency Variation in Modern DRAM Chips: Experimental
Characterization, Analysis, and Optimization",” SIGMETRICS 2016.
FLY-DRAM: Advantages & Disadvantages

 Advantages
+ Reduces latency significantly
+ Exploits significant within-chip latency variation

 Disadvantages
- Need to determine reliable operating latencies for different
parts of a chip  higher testing cost
- Slightly more complicated controller

92
Analysis of Latency Variation in DRAM Chips
 Kevin Chang, Abhijith Kashyap, Hasan Hassan, Samira Khan, Kevin Hsieh,
Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Tianshi Li, and
Onur Mutlu,
"Understanding Latency Variation in Modern DRAM Chips:
Experimental Characterization, Analysis, and Optimization"
Proceedings of the ACM International Conference on Measurement and
Modeling of Computer Systems (SIGMETRICS), Antibes Juan-Les-Pins,
France, June 2016.
[Slides (pptx) (pdf)]
[Source Code]

93
Computer Architecture
Lecture 10b: Memory Latency

Prof. Onur Mutlu

ETH Zürich
Fall 2018
18 October 2018
We did not cover the following slides in lecture.
These are for your benefit.
Spatial Distribution of Failures
How are activation failures spatially distributed in DRAM?

DRAM Row (number) 1024 1024

Subarray border
DRAM rows (1 cell)

512 Subarray Edge

512

Remapped row
0 0
512
512
DRAM columns (1 cell) 1024
1024

DRAM Column (number)

Activation failures are highly constrained

to local bitlines 96
Short-term Variation
Does a bitline’s probability of failure change over time?

Fprob at time
Fail probability t2 (%)2 (%)
at time

Fail probability at time

Fprob at time 1 (%)
t1 (%)

A weak bitline is likely to remain weak and

a strong bitline is likely to remain strong over time 97
Short-term Variation
Does a bitline’s probability of failure change over time?

t2 (%)2 (%)
at time
Fprob at time

ThisWeshows that on
can rely weacan relyprofile
static on a static profile
of weak of weak
bitlines
Fail probability

bitlines to determine
to determine whether
whether an access
an access will will
causecause failures
failures

Fail probability at time

Fprob at time 1 (%)
t1 (%)

A weak bitline is likely to remain weak and

a strong bitline is likely to remain strong over time 98
Write Operations
How are write operations affected by reduced tRCD?
Weak bitline

…
Row Decoder

Cache line ✔ ✔
…

…
…
…
…

Local Row Buffer✔WRITE ✔

We can reliably issue write operations

with significantly reduced tRCD (e.g., by 77%) 99
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

100
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

101
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

102
Solar-DRAM: VLC (I)
Weak bitline Strong bitline

…
Row Decoder

Cache line

…
…
…

Local Row Buffer

Identify cache lines comprised of strong bitlines

Access such cache lines with a reduced tRCD
103
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

104
Solar-DRAM: RSC (II)
Cache line 0 Cache line 1

…
Row Decoder

Cache line

…
…
…

Local Row Buffer

Remap cache lines across DRAM at the memory

controller level so cache line 0 will likely map to
a strong cache line 105
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

106
Solar-DRAM: RLW (III)
All bitlines are strong when issuing writes

…
Row Decoder

Cache line

…
…
…

Local Row Buffer

Write to all locations in DRAM with a significantly

reduced tRCD (e.g., by 77%)
107
More on Solar-DRAM
 Jeremie S. Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu,
"Solar-DRAM: Reducing DRAM Access Latency by Exploiting
the Variation in Local Bitlines"
Proceedings of the 36th IEEE International Conference on Computer
Design (ICCD), Orlando, FL, USA, October 2018.

108
Why Is There
Spatial Latency Variation
Within a Chip?

109
What Is Design-Induced Variation?
fast slow
across column inherently slow
distance from

slow
wordline driver wordline drivers
across row
distance from
sense amplifier

fast
Inherently fast
sense amplifiers
Systematic variation in cell access times
caused by the physical organization of DRAM
110
DIVA Online Profiling
Design-Induced-Variation-Aware
inherently slow
wordline driver

sense amplifier
Profile only slow regions to determine min. latency
 Dynamic & low cost latency optimization
111
DIVA Online Profiling
Design-Induced-Variation-Aware
slow cells inherently slow
wordline driver
process design-induced
variation variation
random error localized error

error-correcting online profiling

code
sense amplifier
Combine error-correcting codes & online profiling
 Reliably reduce DRAM latency
112
DIVA-DRAM Reduces Latency
50%
Read 50%
Write
39.4%38.7%41.3%40.3%
Latency Reduction

40% 35.1%34.6%36.6%35.8% 40% 36.6%

31.2%
30% 25.5% 30% 27.5%

20% 20%

10% 10%

0% 0%
55°C 85°C 55°C 85°C 55°C 85°C 55°C 85°C 55°C 85°C 55°C 85°C
AL-DRAM DIVA
AVA Profiling DIVA
AVA Profiling AL-DRAM DIVA
AVA Profiling DIVA
AVA Profiling
+ Shuffling + Shuffling

DIVA-DRAM reduces latency more aggressively

and uses ECC to correct random slow cells
113
DIVA-DRAM: Advantages & Disadvantages

 Advantages
++ Automatically finds the lowest reliable operating latency
at system runtime (lower production-time testing cost)
+ Reduces latency more than prior methods (w/ ECC)
+ Reduces latency at high temperatures as well

 Disadvantages
- Requires knowledge of inherently-slow regions
- Requires ECC (Error Correcting Codes)
- Imposes overhead during runtime profiling

114
Design-Induced Latency Variation in DRAM
 Donghyuk Lee, Samira Khan, Lavanya Subramanian, Saugata Ghose,
Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and
Onur Mutlu,
"Design-Induced Latency Variation in Modern DRAM Chips:
Characterization, Analysis, and Latency Reduction Mechanisms"
Proceedings of the ACM International Conference on Measurement and
Modeling of Computer Systems (SIGMETRICS), Urbana-Champaign, IL,
USA, June 2017.

115

Memory Organization
No ratings yet
Memory Organization
117 pages
Q1 - LE - TLE 6 - Lesson 3 - Week 3
100% (1)
Q1 - LE - TLE 6 - Lesson 3 - Week 3
15 pages
Huawei SRv6
100% (1)
Huawei SRv6
97 pages
HCIE-Data Center Network V1.0 Training Material
No ratings yet
HCIE-Data Center Network V1.0 Training Material
1,520 pages
Computer For Upsssc Je
No ratings yet
Computer For Upsssc Je
70 pages
WHKETaz Wa FPFT XBX
No ratings yet
WHKETaz Wa FPFT XBX
281 pages
Simplified Evpn Vxlan
No ratings yet
Simplified Evpn Vxlan
75 pages
Maxtor: Drive Families: Diamond Max VL40 (PROXIMA), D531X (NIKE)
100% (1)
Maxtor: Drive Families: Diamond Max VL40 (PROXIMA), D531X (NIKE)
66 pages
Comparch Fall2020 Lecture10 Lowlatencymemory
No ratings yet
Comparch Fall2020 Lecture10 Lowlatencymemory
278 pages
BGP Overview FAL
100% (1)
BGP Overview FAL
24 pages
Hybrid Memory Cube
No ratings yet
Hybrid Memory Cube
22 pages
GOOD DRAM Interface Tutorial
No ratings yet
GOOD DRAM Interface Tutorial
91 pages
EE6304 Lecture8 Mem Hierarchy
No ratings yet
EE6304 Lecture8 Mem Hierarchy
54 pages
f37 Book Intarch Pres Pt5
No ratings yet
f37 Book Intarch Pres Pt5
75 pages
DRAM Basics by Prof. Matthew D. Sinclair
No ratings yet
DRAM Basics by Prof. Matthew D. Sinclair
103 pages
Unit 3
No ratings yet
Unit 3
24 pages
ALow CostReduced LatencyDRAMArchitectureWithDynamicReconfigurationofRow
No ratings yet
ALow CostReduced LatencyDRAMArchitectureWithDynamicReconfigurationofRow
15 pages
Lectures wk1
No ratings yet
Lectures wk1
18 pages
Class 12
No ratings yet
Class 12
40 pages
CS356Unit8 Memory Notes
No ratings yet
CS356Unit8 Memory Notes
25 pages
DRAM Terminology and Basics, Energy Innovations
No ratings yet
DRAM Terminology and Basics, Energy Innovations
14 pages
9 Memory Devices & Chip Area: Assignments
No ratings yet
9 Memory Devices & Chip Area: Assignments
16 pages
SoftMC Hpca17 Talk
No ratings yet
SoftMC Hpca17 Talk
35 pages
03 Memory
No ratings yet
03 Memory
48 pages
Ecs Usermanual
No ratings yet
Ecs Usermanual
441 pages
patterson1997 ٠٧٥٠٤٨
No ratings yet
patterson1997 ٠٧٥٠٤٨
11 pages
Comparch Fall2020 Lecture11a Memory Controllers
No ratings yet
Comparch Fall2020 Lecture11a Memory Controllers
71 pages
Performance Comparison of Contempora Ry DRAM Architectures
No ratings yet
Performance Comparison of Contempora Ry DRAM Architectures
36 pages
LogicDesign Mem
No ratings yet
LogicDesign Mem
8 pages
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality
No ratings yet
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality
13 pages
A Case For Intelligent RAM: IRAM: 1. Introduction and Why There Is A Problem
No ratings yet
A Case For Intelligent RAM: IRAM: 1. Introduction and Why There Is A Problem
23 pages
Lec13 Memory 1 Notes
No ratings yet
Lec13 Memory 1 Notes
27 pages
A Case For Intelligent RAM IRAM
No ratings yet
A Case For Intelligent RAM IRAM
23 pages
L06 Memory
No ratings yet
L06 Memory
37 pages
This Unit: Caches: - Basic Memory Hierarchy Concepts
No ratings yet
This Unit: Caches: - Basic Memory Hierarchy Concepts
24 pages
Unit 3 - Memory Organization
No ratings yet
Unit 3 - Memory Organization
98 pages
Seth 740 Fall13 Module3.5 Main Memory Part1
No ratings yet
Seth 740 Fall13 Module3.5 Main Memory Part1
69 pages
CS5204/EE5364 - Advanced Computer Architecture - Memory
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Memory
67 pages
01-01 Segment Routing
No ratings yet
01-01 Segment Routing
12 pages
Computer Hardware
No ratings yet
Computer Hardware
2 pages
Pipelining For Multi-Core Architectures
No ratings yet
Pipelining For Multi-Core Architectures
31 pages
CS 152 Computer Architecture and Engineering Lecture 6 - Memory
No ratings yet
CS 152 Computer Architecture and Engineering Lecture 6 - Memory
29 pages
Onur 447 Spring15 Lecture21 Main Memory Afterlecture
No ratings yet
Onur 447 Spring15 Lecture21 Main Memory Afterlecture
94 pages
Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture
No ratings yet
Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture
304 pages
Onur 740 Fall11 Lecture25 Mainmemory
No ratings yet
Onur 740 Fall11 Lecture25 Mainmemory
50 pages
Lecture 3 (Memory Hierarchy and Caches)
No ratings yet
Lecture 3 (Memory Hierarchy and Caches)
88 pages
IISc eDRAM Ravi 2014 PDF
No ratings yet
IISc eDRAM Ravi 2014 PDF
75 pages
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
No ratings yet
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
44 pages
The Memory Hierarchy
No ratings yet
The Memory Hierarchy
46 pages
L05 Memory
No ratings yet
L05 Memory
45 pages
02b Cache
No ratings yet
02b Cache
48 pages
Unit 5 Memory
No ratings yet
Unit 5 Memory
116 pages
Memory Design: SOC and Board-Based Systems
No ratings yet
Memory Design: SOC and Board-Based Systems
48 pages
Memory Design
No ratings yet
Memory Design
36 pages
7 Memory
No ratings yet
7 Memory
89 pages
DRAMs
No ratings yet
DRAMs
14 pages
Memory: Computer Architecture and Assembly Language
No ratings yet
Memory: Computer Architecture and Assembly Language
15 pages
11 Memory
No ratings yet
11 Memory
41 pages
Smart Memory
No ratings yet
Smart Memory
19 pages
Embedded Systems - Virtual Memory Notes
No ratings yet
Embedded Systems - Virtual Memory Notes
17 pages
Memory Subsystem: Dr. Gayathri Sivakumar Assistant Professor (SG-I) School of Electronics VIT, Chennai
No ratings yet
Memory Subsystem: Dr. Gayathri Sivakumar Assistant Professor (SG-I) School of Electronics VIT, Chennai
16 pages
1.2 - Hardware & Software
No ratings yet
1.2 - Hardware & Software
16 pages
Week 10
No ratings yet
Week 10
59 pages
Lecture 12 Dram
No ratings yet
Lecture 12 Dram
54 pages
Unit 3 OF ESD
No ratings yet
Unit 3 OF ESD
22 pages
CS010 506: Advanced Microprocessors & Peripherals: Teaching Scheme
No ratings yet
CS010 506: Advanced Microprocessors & Peripherals: Teaching Scheme
2 pages
BCASyllabus 2019
No ratings yet
BCASyllabus 2019
136 pages
CSE243: Introduction To Computer Architecture and Hardware/Software Interface
No ratings yet
CSE243: Introduction To Computer Architecture and Hardware/Software Interface
25 pages
Seminar ON Intelligent Ram
No ratings yet
Seminar ON Intelligent Ram
37 pages
508 - Test Report - The Sleuth Kit 3 2 2 - Autopsy 2 24 Test Report - November 2015 - Final PDF
No ratings yet
508 - Test Report - The Sleuth Kit 3 2 2 - Autopsy 2 24 Test Report - November 2015 - Final PDF
46 pages
Self-Encrypting Disks Pose Self-Decrypting Risks: How To Break Hardware-Based Full Disk Encryption
No ratings yet
Self-Encrypting Disks Pose Self-Decrypting Risks: How To Break Hardware-Based Full Disk Encryption
10 pages
Large and Fast: Exploiting Memory Hierarchy: Topics To Be Covered
No ratings yet
Large and Fast: Exploiting Memory Hierarchy: Topics To Be Covered
13 pages
GC in Databricks
No ratings yet
GC in Databricks
18 pages
PATRIOT User Manual URM03PH170-J
No ratings yet
PATRIOT User Manual URM03PH170-J
112 pages
E 07 INST3
No ratings yet
E 07 INST3
244 pages
Chapter 1 Study Guide
No ratings yet
Chapter 1 Study Guide
1 page
File System Structure1
No ratings yet
File System Structure1
21 pages
KRS-08 Om E1
No ratings yet
KRS-08 Om E1
7 pages
Unit-5 Notes BCS 302 NSA..
No ratings yet
Unit-5 Notes BCS 302 NSA..
15 pages
Slim SD HXC Floppy Emulator User Manual ENG
No ratings yet
Slim SD HXC Floppy Emulator User Manual ENG
17 pages
Running Open Ha Cluster With Virtualbox: Combining Technologies To Work
No ratings yet
Running Open Ha Cluster With Virtualbox: Combining Technologies To Work
11 pages
Computer Fundamentals Assignment
No ratings yet
Computer Fundamentals Assignment
7 pages
CEA201-Trương Vĩnh Thịnh-SE17D09
No ratings yet
CEA201-Trương Vĩnh Thịnh-SE17D09
6 pages
TR10 2301
No ratings yet
TR10 2301
6 pages
3 - Evolution of The Transport System
No ratings yet
3 - Evolution of The Transport System
4 pages
Data Sheet Ultrastar DC Ha210
No ratings yet
Data Sheet Ultrastar DC Ha210
2 pages
Ds Eternus dx100 s5 WW en
No ratings yet
Ds Eternus dx100 s5 WW en
6 pages
Superspeed Usb 3.0 Raid Hard Drive Array
No ratings yet
Superspeed Usb 3.0 Raid Hard Drive Array
2 pages
2 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 4th NOV 2024 - 9th NOV 2024
No ratings yet
2 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 4th NOV 2024 - 9th NOV 2024
4 pages
Week 14 Persistent Data Storage
No ratings yet
Week 14 Persistent Data Storage
7 pages
6 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 2nd DEC 2024 - 6th DEC 2024
No ratings yet
6 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 2nd DEC 2024 - 6th DEC 2024
3 pages
Asignación 6 y 7. Inglés 2. Resuelta
No ratings yet
Asignación 6 y 7. Inglés 2. Resuelta
7 pages
What's Inside a DVD Player?
From Everand
What's Inside a DVD Player?
Arnold Ringstad
No ratings yet
What's Inside a Radio?
From Everand
What's Inside a Radio?
Arnold Ringstad
No ratings yet
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
From Everand
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
U.Q. Magnusson
No ratings yet