0% found this document useful (0 votes)
33 views115 pages

Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture

Uploaded by

Murali Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views115 pages

Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture

Uploaded by

Murali Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

Computer Architecture

Lecture 10b: Memory Latency

Prof. Onur Mutlu


ETH Zürich
Fall 2018
18 October 2018
DRAM Memory:
A Low-Level Perspective
DRAM Module and Chip

3
Goals
• Cost
• Latency
• Bandwidth
• Parallelism
• Power
• Energy
• Reliability
• …

4
5
Cell Array
Row Decoder

Array of Sense Amplifiers


Cell Array
Cell Array
Row Decoder

Array of Sense Amplifiers


Cell Array
Bank I/O
DRAM Chip
Sense Amplifier
top

enable

Inverter

bottom
6
Sense Amplifier – Two Stable States

VDD 0

1 1

0 VDD

Logical “1” Logical “0”


7
Sense Amplifier Operation
VTDD

0
1 VT > VB

V0B
8
DRAM Cell – Capacitor

Empty State Fully Charged State


Logical “0” Logical “1”

1 Small – Cannot drive circuits


2 Reading destroys the state

9
Capacitor to Sense Amplifier

0 VDD

1 1

VDD 0
10
DRAM Cell Operation

½VVDD
DD+δ

1
0

0 DD
½V

11
DRAM Subarray – Building Block for
DRAM Chip

Cell Array
Row Decoder

Array of Sense Amplifiers (Row Buffer) 8Kb

Cell Array

12
DRAM Bank

Cell Array

Row Decoder
Array of Sense Amplifiers (8Kb)

Cell Array
Address

Cell Array
Row Decoder
Array of Sense Amplifiers

Cell Array

Bank I/O (64b)


Data
Address
13
14
Cell Array Cell Array Cell Array Cell Array
Row Decoder

Row Decoder
Row Decoder

Row Decoder

Array of Sense Array of Sense Array of Sense Array of Sense


Amplifiers Amplifiers Amplifiers Amplifiers
Cell Array Cell Array Cell Array Cell Array
Cell Array Cell Array Cell Array Cell Array
Row Decoder

Row Decoder
Row Decoder

Row Decoder

Array of Sense Array of Sense Array of Sense Array of Sense


Amplifiers Amplifiers Amplifiers Amplifiers
Cell Array Cell Array Cell Array Cell Array
Bank I/O Bank I/O Bank I/O Bank I/O
Bank I/O Bank I/O Bank I/O Bank I/O
Cell Array Cell Array Cell Array Cell Array
Row Decoder

Row Decoder
Row Decoder

Row Decoder
Amplifiers Amplifiers

Memory channel - 8bits


Amplifiers Amplifiers
Array of Sense Array of Sense Array of Sense Array of Sense
Cell Array Cell Array Cell Array Cell Array
Cell Array Cell Array Cell Array Cell Array
Row Decoder

Row Decoder
Row Decoder

Row Decoder
Amplifiers Amplifiers Amplifiers Amplifiers
Array of Sense Array of Sense Array of Sense Array of Sense
Shared internal bus

Cell Array Cell Array Cell Array Cell Array


DRAM Chip
DRAM Operation

1 ACTIVATE Row
Row Decoder
Row Address

2 READ/WRITE Column
Cell Array
Row Decoder

Array of Sense Amplifiers

Cell Array
3 PRECHARGE
Bank I/O
Data
Column Address
15
Memory Latency:
Fundamental Tradeoffs
Review: Memory Latency Lags Behind
Capacity Bandwidth Latency 128x
DRAM Improvement (log)

100

20x
10

1.3x
1
1999 2003 2006 2008 2011 2013 2014 2015 2016 2017

Memory latency remains almost constant


DRAM Latency Is Critical for Performance

In-memory Databases Graph/Tree Processing


[Mao+, EuroSys’12; [Xu+, IISWC’12; Umuroglu+, FPL’15]
Clapp+ (Intel), IISWC’15]

In-Memory Data Analytics Datacenter Workloads


[Clapp+ (Intel), IISWC’15; [Kanev+ (Google), ISCA’15]
Awan+, BDCloud’15]
DRAM Latency Is Critical for Performance

In-memory Databases Graph/Tree Processing


[Mao+, EuroSys’12; [Xu+, IISWC’12; Umuroglu+, FPL’15]
Clapp+ (Intel), IISWC’15]
Long memory latency → performance bottleneck

In-Memory Data Analytics Datacenter Workloads


[Clapp+ (Intel), IISWC’15; [Kanev+ (Google), ISCA’15]
Awan+, BDCloud’15]
The Memory Latency Problem
 High memory latency is a significant limiter of system
performance and energy-efficiency

 It is becoming increasingly so with higher memory


contention in multi-core and heterogeneous architectures
 Exacerbating the bandwidth need
 Exacerbating the QoS problem

 It increases processor design complexity due to the


mechanisms incorporated to tolerate memory latency

20
Retrospective: Conventional Latency Tolerance Techniques

 Caching [initially by Wilkes, 1965]


 Widely used, simple, effective, but inefficient, passive
 Not all applications/phases exhibit temporal or spatial locality

 Prefetching [initially in IBM 360/91, 1967]


 Works well for regular memory access patterns
 Prefetching irregular access patterns is difficult, inaccurate, and hardware-
intensive

 Multithreading [initially in CDC 6600, 1964]


 Works well if there are multiple threads

 Improving single thread performance using multithreading hardware is an


ongoing research effort

 Out-of-order execution [initially by Tomasulo, 1967]


 Tolerates cache misses that cannot be prefetched
 Requires extensive hardware resources for tolerating long latencies

21
Two Major Sources of Latency Inefficiency

 Modern DRAM is not designed for low latency


 Main focus is cost-per-bit (capacity)

 Modern DRAM latency is determined by worst case


conditions and worst case devices
 Much of memory latency is unnecessary

22
What Causes
the Long Memory Latency?
Why the Long Memory Latency?

 Reason 1: Design of DRAM Micro-architecture


 Goal: Maximize capacity/area, not minimize latency

 Reason 2: “One size fits all” approach to latency specification


 Same latency parameters for all temperatures
 Same latency parameters for all DRAM chips
 Same latency parameters for all parts of a DRAM chip
 Same latency parameters for all supply voltage levels
 Same latency parameters for all application data
 …

24
Tiered Latency DRAM

25
What Causes the Long Latency?
DRAM Chip
subarray

Subarray
cell array

I/O

I/O
channel

DRAM Latency = Subarray


Subarray Latency
Latency ++ I/O
I/O Latency
Latency

Dominant 26
Why is the Subarray So Slow?
Subarray Cell
cell
wordline

bitline: 512 cells

row decoder
row decoder

sense amplifier
capacitor
access
transistor

bitline
sense amplifier large sense amplifier
• Long bitline
– Amortizes sense amplifier cost  Small area
– Large bitline capacitance  High latency & power
27
Trade-Off: Area (Die Size) vs. Latency
Long Bitline Short Bitline

Faster

Smaller

Trade-Off: Area vs. Latency

28
Trade-Off: Area (Die Size) vs. Latency
4
Normalized DRAM Area 32
3
Fancy DRAM
Commodity
Cheaper

64 Short Bitline
DRAM
2
Long Bitline
128
1
256 512 cells/bitline
0
0 10 20 30 40 50 60 70
Latency (ns)
Faster
29
Approximating the Best of Both Worlds
Long Bitline Our Proposal Short Bitline
Small Area Large Area
High Latency Low Latency

Need Add Isolation


Isolation Transistors
Short Bitline  Fast
30
Approximating the Best of Both Worlds
Long BitlineTiered-Latency
Long Bitline DRAMShort
Our Proposal Short Bitline
Bitline
Small Area Small Area Large Area
High Latency Low Latency Low Latency

Small area
using long
bitline
Low Latency

31
Latency, Power, and Area Evaluation
• Commodity DRAM: 512 cells/bitline
• TL-DRAM: 512 cells/bitline
– Near segment: 32 cells
– Far segment: 480 cells
• Latency Evaluation
– SPICE simulation using circuit-level DRAM model
• Power and Area Evaluation
– DRAM area/power simulator from Rambus
– DDR3 energy calculator from Micron

32
Commodity DRAM vs. TL-DRAM [HPCA 2013]
• DRAM Latency (tRC) • DRAM Power
150% 150%
+49%
+23%
(52.5ns)
Latency

100% 100%

Power
–56% –51%
50% 50%

0% 0%

Commodity Near Far Commodity Near Far


DRAM TL-DRAM DRAM TL-DRAM

• DRAM Area Overhead


~3%: mainly due to the isolation transistors
33
Trade-Off: Area (Die-Area) vs. Latency
4
Normalized DRAM Area 32
3
Cheaper

64
2
128
256 512 cells/bitline
1
Near Segment Far Segment
0
0 10 20 30 40 50 60 70
Latency (ns)
Faster
34
Leveraging Tiered-Latency DRAM
• TL-DRAM is a substrate that can be leveraged by
the hardware and/or software
• Many potential uses
1. Use near segment as hardware-managed inclusive
cache to far segment
2. Use near segment as hardware-managed exclusive
cache to far segment
3. Profile-based page mapping by operating system
4. Simply replace DRAM with TL-DRAM

35
Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.
Near Segment as Hardware-Managed Cache
TL-DRAM
subarray main
far segment memory
near segment cache
sense amplifier

I/O
channel

• Challenge 1: How to efficiently migrate a row between


segments?
• Challenge 2: How to efficiently manage the cache?
36
Inter-Segment Migration
• Goal: Migrate source row into destination row
• Naïve way: Memory controller reads the source row
byte by byte and writes to destination row byte by byte
→ High latency
Source
Far Segment

Isolation Transistor
Destination
Near Segment
Sense Amplifier
37
Inter-Segment Migration
• Our way:
– Source and destination cells share bitlines
– Transfer data from source to destination across
shared bitlines concurrently
Source
Far Segment

Isolation Transistor
Destination
Near Segment
Sense Amplifier
38
Inter-Segment Migration
• Our way:
– Source and destination cells share bitlines
– Transfer data from source to destination across
Step 1: Activate source row
shared bitlines concurrently
Migration is overlapped with source row access
Additional ~4ns over row access latency
Far Segment
Step 2: Activate destination
row to connect cell and bitline
Isolation Transistor
Near Segment
Sense Amplifier
39
Near Segment as Hardware-Managed Cache
TL-DRAM
subarray main
far segment memory
near segment cache
sense amplifier

I/O
channel

• Challenge 1: How to efficiently migrate a row between


segments?
• Challenge 2: How to efficiently manage the cache?
40
Performance & Power Consumption
120% 120%
12.4% 11.5% 10.7%
Normalized Performance

–23% –24% –26%

Normalized Power
100% 100%

80% 80%

60% 60%

40% 40%

20% 20%

0% 0%
1 (1-ch) 2 (2-ch) 4 (4-ch) 1 (1-ch) 2 (2-ch) 4 (4-ch)
Core-Count (Channel) Core-Count (Channel)
Using near segment as a cache improves
performance and reduces power consumption
41
Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.
Single-Core: Varying Near Segment Length
Maximum IPC
14%
Performance Improvement

Improvement
12%
10%
8%
6%
Larger cache capacity
4%
2%
Higher cache access latency
0%
1 2 4 8 16 32 64 128 256
Near Segment Length (cells)
By adjusting the near segment length, we can
trade off cache capacity for cache latency
42
More on TL-DRAM
 Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya
Subramanian, and Onur Mutlu,
"Tiered-Latency DRAM: A Low Latency and Low Cost
DRAM Architecture"
Proceedings of the 19th International Symposium on High-
Performance Computer Architecture (HPCA), Shenzhen, China,
February 2013. Slides (pptx)

43
LISA: Low-Cost Inter-Linked Subarrays
[HPCA 2016]

44
Problem: Inefficient Bulk Data Movement
Bulk data movement is a key operation in many applications
– memmove & memcpy: 5% cycles in Google’s datacenter [Kanev+ ISCA’15]
Core
Core

Controller
src
Memory
LLC

Channel
dst
Core
Core

64 bits

CPU Memory

Long latency and high energy


45
Moving Data Inside DRAM?
8Kb
Bank 512
rows
Subarray 1
Bank DRAM Subarray 2
cell Subarray 3
Bank


Bank Subarray N
DRAM Internal
Data Bus (64b)

Low
Goal:
connectivity
Provide ainnew
DRAM
substrate
is the fundamental
to enable
wide
bottleneck
connectivity
for bulk
between
data movement
subarrays
46
Key Idea and Applications
• Low-cost Inter-linked subarrays (LISA)
– Fast bulk data movement between subarrays
– Wide datapath via isolation transistors: 0.8% DRAM chip area
Subarray 1

Subarray 2
• LISA is a versatile substrate → new applications
Fast bulk data copy: Copy latency 1.363ms→0.148ms (9.2x)
→ 66% speedup, -55% DRAM energy
In-DRAM caching: Hot data access latency 48.7ns→21.5ns (2.2x)
→ 5% speedup
Fast precharge: Precharge latency 13.1ns→5.0ns (2.6x)
→ 8% speedup 47
New DRAM Command to Use LISA
Row Buffer Movement (RBM): Move a row of data in
an activated row buffer to a precharged one

Subarray 1 Vdd-Δ

Activated S S S S
P P P P on
RBM: SA1→SA2 Charge
Sharing
Subarray 2 Vdd/2+Δ
/2

Activated an
Precharged
RBM transfers P Pentire row b/w subarrays
S S S S Amplify the charge
P P

48
RBM Analysis
• The range of RBM depends on the DRAM design
– Multiple RBMs to move data across > 3 subarrays
Subarray 1

Subarray 2

Subarray 3
• Validated with SPICE using worst-case cells
– NCSU FreePDK 45nm library
• 4KB data in 8ns (w/ 60% guardband)
→ 500 GB/s, 26x bandwidth of a DDR4-2400 channel
• 0.8% DRAM chip area overhead [O+ ISCA’14]
49
1. Rapid Inter-Subarray Copying (RISC)
• Goal: Efficiently copy a row across subarrays
• Key idea: Use RBM to form a new command sequence
Subarray 1
1 Activate src row src row

S S S S
P P P P

2 RBM SA1→SA2 Subarray 2


Reduces row-copy latency bydst9.2x,
row

3 (write row buffer


Activate dst row
DRAM energy
into dst row) SbyS 48.1x
S S
P P P P

50
2.Variable Latency DRAM (VILLA)
• Goal: Reduce DRAM latency with low area overhead
• Motivation: Trade-off between area and latency
Long Bitline Short Bitline
(DDRx) (RLDRAM)

Shorter bitlines → faster


activate and precharge time
High area overhead: >40%
51
2.Variable Latency DRAM (VILLA)
• Key idea: Reduce access latency of hot data via a
heterogeneous DRAM design [Lee+ HPCA’13, Son+ ISCA’13]
• VILLA: Add fast subarrays as a cache in each bank

Slow Subarray 512


Challenge: VILLA cache requires
rows movement of data rows
frequent
Fast Subarray32
LISA: Cache rows rapidly from slow
rowsaccess latency by 2.2x
Reduces hot data
to fast subarrays
at only 1.6% area overhead
Slow Subarray

52
3. Linked Precharge (LIP)
• Problem: The precharge time is limited by the strength
of one precharge unit
• Linked Precharge (LIP): LISA precharges a subarray
using multiple precharge units

S S S S S S S S
P P P P P P P P on
Activated row
Precharging Linked
Reduces
S S S S precharge latency
S S S by
S 2.6x
Precharging
on on
(43% guardband)
P P P P P P P P

Conventional DRAM LISA DRAM


53
More on LISA
 Kevin K. Chang, Prashant J. Nair, Saugata Ghose, Donghyuk Lee,
Moinuddin K. Qureshi, and Onur Mutlu,
"Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast
Inter-Subarray Data Movement in DRAM"
Proceedings of the 22nd International Symposium on High-
Performance Computer Architecture (HPCA), Barcelona, Spain,
March 2016.
[Slides (pptx) (pdf)]
[Source Code]

54
What Causes
the Long DRAM Latency?
Why the Long Memory Latency?

 Reason 1: Design of DRAM Micro-architecture


 Goal: Maximize capacity/area, not minimize latency

 Reason 2: “One size fits all” approach to latency specification


 Same latency parameters for all temperatures
 Same latency parameters for all DRAM chips
 Same latency parameters for all parts of a DRAM chip
 Same latency parameters for all supply voltage levels
 Same latency parameters for all application data
 …

56
Tackling the Fixed Latency Mindset
 Reliable operation latency is actually very heterogeneous
 Across temperatures, chips, parts of a chip, voltage levels, …

 Idea: Dynamically find out and use the lowest latency one
can reliably access a memory location with
 Adaptive-Latency DRAM [HPCA 2015]
 Flexible-Latency DRAM [SIGMETRICS 2016]
 Design-Induced Variation-Aware DRAM [SIGMETRICS 2017]
 Voltron [SIGMETRICS 2017]
 DRAM Latency PUF [HPCA 2018]
 ...

 We would like to find sources of latency heterogeneity and


exploit them to minimize latency
57
Latency Variation in Memory Chips
Heterogeneous manufacturing & operating conditions →
latency variation in timing parameters

DRAM A DRAM B DRAM C

Slow cells

Low High
DRAM Latency

58
Why is Latency High?
• DRAM latency: Delay as specified in DRAM standards
– Doesn’t reflect true DRAM device latency
• Imperfect manufacturing process → latency variation
• High standard latency chosen to increase yield
DRAM A DRAM B DRAM C Standard
Latency

Manufacturing
Variation
Low High
DRAM Latency
59
What Causes the Long Memory Latency?
 Conservative timing margins!

 DRAM timing parameters are set to cover the worst case

 Worst-case temperatures
 85 degrees vs. common-case
 to enable a wide range of operating conditions
 Worst-case devices
 DRAM cell with smallest charge across any acceptable device
 to tolerate process variation at acceptable yield

 This leads to large timing margins for the common case


60
Understanding and Exploiting
Variation in DRAM Latency
DRAM Stores Data as Charge
DRAM Cell

Three steps of
charge movement
1. Sensing
2. Restore
3. Precharge Sense-Amplifier

62
DRAM Charge over Time
Cell
Cell
Data 1
charge
Sense-Amplifier
Sense-Amplifier
Data 0
Timing Parameters Sensing Restore time
In theory
margin
In practice

Why does DRAM need the extra timing margin?


63
Two Reasons for Timing Margin
Variation
1. Process Variation
– DRAM cells are not equal
– Leads to extra timing margin for cell
a cellthat
thatcan
can
store asmall
largeamount
amountofofcharge
charge

`
2. Temperature Dependence
– DRAM leaks more charge at higher temperature
– Leads to extra timing margin when operating at
low temperature

64
DRAM Cells are Not Equal
Ideal Real Smallest Cell

Largest Cell

Same Size 
Large variation inDifferent 
cell sizeSize
Same Charge  Different Charge 
Large
Same variation inDifferent
Latency 
chargeLatency
Large variation in access latency
65
Process Variation
DRAM Cell
❶ Cell Capacitance
Capacitor
❷ Contact Resistance
❸ Transistor Performance

Bitline Small cell can store small


Contact charge
• Small cell capacitance
Access Transistor
• High contact resistance
ACCESS • Slow access transistor
 High access latency 66
Two Reasons for Timing Margin
1. Process Variation
– DRAM cells are not equal
– Leads to extra timing margin for a cell that can
store a large amount of charge

`
2. Temperature Dependence
– DRAM leaks more charge at higher temperature
– Leads to extra timing margin for cells that
operate at low temperature
the high temperature

67
Charge Leakage Temperature
Room Temp. Hot Temp. (85°C)

Cells store
Smallsmall charge atLarge
Leakage high Leakage
temperature
and large charge at low temperature
 Large variation in access latency 68
DRAM Timing Parameters
• DRAM timing parameters are dictated by
the worst-case
– The smallest cell with the smallest charge in
all DRAM products
– Operating at the highest temperature

• Large timing margin for the common-case

69
Adaptive-Latency DRAM [HPCA 2015]
 Idea: Optimize DRAM timing for the common case
 Current temperature
 Current DRAM module

 Why would this reduce latency?

 A DRAM cell can store much more charge in the common case
(low temperature, strong cell) than in the worst case

 More charge in a DRAM cell


 Faster sensing, charge restoration, precharging
 Faster access (read, write, refresh, …)

Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 2015.
Extra Charge  Reduced Latency
1. Sensing
Sense cells with extra charge faster
 Lower sensing latency
2. Restore
No need to fully restore cells with extra charge
 Lower restoration latency
3. Precharge
No need to fully precharge bitlines for cells with
extra charge
 Lower precharge latency
71
DRAM Characterization Infrastructure

Temperature
Controller FPGAs Heater FPGAs

PC

Kim+, “Flipping Bits in Memory Without Accessing Them: An 72


Experimental Study of DRAM Disturbance Errors,” ISCA 2014.
DRAM Characterization Infrastructure

 Hasan Hassan et al., SoftMC: A


Flexible and Practical Open-
Source Infrastructure for
Enabling Experimental DRAM
Studies, HPCA 2017.

 Flexible
 Easy to Use (C++ API)
 Open-source
github.com/CMU-SAFARI/SoftMC

73
SoftMC: Open Source DRAM Infrastructure

 https://github.com/CMU-SAFARI/SoftMC

74
Observation 1. Faster Sensing
Typical DIMM at 115 DIMM
Low Temperature Characterization

More Charge
Timing
(tRCD)
Strong Charge
Flow 17% ↓
Faster Sensing No Errors

Typical DIMM at Low Temperature


 More charge  Faster sensing
75
Observation 2. Reducing Restore Time
Typical DIMM at 115 DIMM
Low Temperature Characterization
Less Leakage 
Extra Charge Read (tRAS)
37% ↓
No Need to Fully Write (tWR)
Restore Charge
54% ↓
No Errors
Typical DIMM at lower temperature
 More charge  Restore time reduction
76
AL-DRAM

• Key idea
– Optimize DRAM timing parameters online

• Two components
– DRAM manufacturer provides multiple sets of
reliable
reliable DRAM
DRAM timing
timing parameters
parameters at different
temperatures for each DIMM
– System monitors DRAM temperature & uses
appropriate DRAM timing parameters

Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 2015. 77
DRAM Temperature
• DRAM temperature measurement
• Server cluster: Operates at under 34°C
• Desktop: Operates at under 50°C
• DRAM standard optimized for 85°C
• DRAM
Previousoperates
works – DRAM temperature is low
at low temperatures
• El-Sayed+ SIGMETRICS 2012
in2007
• Liu+ ISCA the common-case
• Previous works – Maintain low DRAM temperature
• David+ ICAC 2011
• Liu+ ISCA 2007
• Zhu+ ITHERM 2008

78
Latency Reduction Summary of 115 DIMMs
• Latency reduction for read & write (55°C)
– Read Latency: 32.7%
– Write Latency: 55.1%

• Latency reduction for each timing


parameter (55°C)
– Sensing: 17.3%
– Restore: 37.3% (read), 54.8% (write)
– Precharge: 35.2%
Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 79
2015.
AL-DRAM: Real System Evaluation
• System
– CPU: AMD 4386 ( 8 Cores, 3.1GHz, 8MB LLC)
– DRAM: 4GByte DDR3-1600 (800Mhz Clock)
– OS: Linux
– Storage: 128GByte SSD

• Workload
– 35 applications from SPEC, STREAM, Parsec,
Memcached, Apache, GUPS

80
AL-DRAM: Single-Core Evaluation
Average
Performance Improvement

25%
Single Core Multi Core Improvement
20%
15%
10% 6.7% 5.0%
5% 1.4%
0%

all-35-workload
gems
soplex

libq

s.cluster
gups

intensive
mcf

lbm

copy
milc

all-workloads
non-intensive
AL-DRAM improves performance on a real system
81
AL-DRAM: Multi-Core Evaluation
Average
Performance Improvement

25%
Single Core Multi Core Improvement
20%
15% 14.0%
10.4%
10%
5% 2.9%
0%

all-35-workload
gems
soplex

libq

s.cluster
gups

intensive
mcf

lbm

copy
milc

all-workloads
non-intensive
AL-DRAM provides higher performance for
multi-programmed & multi-threaded workloads
82
Reducing Latency Also Reduces Energy
 AL-DRAM reduces DRAM power consumption by 5.8%

 Major reason: reduction in row activation time

83
AL-DRAM: Advantages & Disadvantages

 Advantages
+ Simple mechanism to reduce latency
+ Significant system performance and energy benefits
+ Benefits higher at low temperature
+ Low cost, low complexity

 Disadvantages
- Need to determine reliable operating latencies for different
temperatures and different DIMMs  higher testing cost
(might not be that difficult for low temperatures)

84
More on AL-DRAM
 Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan,
Vivek Seshadri, Kevin Chang, and Onur Mutlu,
"Adaptive-Latency DRAM: Optimizing DRAM Timing for
the Common-Case"
Proceedings of the 21st International Symposium on High-
Performance Computer Architecture (HPCA), Bay Area, CA,
February 2015.
[Slides (pptx) (pdf)] [Full data sets]

85
Different Types of Latency Variation
 AL-DRAM exploits latency variation
 Across time (different temperatures)
 Across chips

 Is there also latency variation within a chip?


 Across different parts of a chip

86
Variation in Activation Errors
Results from 7500 rounds over 240 chips
No ACT Errors Max
Many errors

Rife w/ errors
Quartiles
Very few errors
Min
13.1ns
standard

Modern
Different DRAM chips
characteristics acrossexhibit
DIMMs
significant variation in activation latency
87
Spatial Locality of Activation Errors
One DIMM @ tRCD=7.5ns

Activation errors are concentrated


at certain columns of cells
88
Mechanism to Reduce DRAM Latency
• Observation: DRAM timing errors (slow DRAM
cells) are concentrated on certain regions

• Flexible-LatencY (FLY) DRAM


– A software-transparent design that reduces latency

• Key idea:
1) Divide memory into regions of different latencies
2) Memory controller: Use lower latency for regions without
slow cells; higher latency for other regions

Chang+, “Understanding Latency Variation in Modern DRAM Chips: Experimental


Characterization, Analysis, and Optimization",” SIGMETRICS 2016.
FLY-DRAM Configurations
100%
Fraction of Cells

80% tRCD
60% 99% 13ns
40% 93% 10ns
20% 7.5ns
0% 12%
Baseline D1 D2 D3 Upper
(DDR3) Profiles of 3 real DIMMs Bound
100%
Fraction of Cells

80% tRP
60% 13ns
99%
40% 74% 10ns
20% 7.5ns
13%
0%
Baseline D1 D2 D3 Upper
(DDR3)
Chang+, “Understanding Bound
Latency Variation in Modern DRAM Chips: Experimental
Characterization, Analysis, and Optimization",” SIGMETRICS 2016.
Results
1.25
19.7%
1.2 19.5%
Normalized Performance

17.6%
1.15 13.3%
Baseline (DDR3)
1.1 FLY-DRAM (D1)
1.05 FLY-DRAM (D2)
FLY-DRAM (D3)
1
Upper Bound
0.95
FLY-DRAM
0.9
improves performance
by exploiting spatial latency variation in DRAM
40 Workloads
Chang+, “Understanding Latency Variation in Modern DRAM Chips: Experimental
Characterization, Analysis, and Optimization",” SIGMETRICS 2016.
FLY-DRAM: Advantages & Disadvantages

 Advantages
+ Reduces latency significantly
+ Exploits significant within-chip latency variation

 Disadvantages
- Need to determine reliable operating latencies for different
parts of a chip  higher testing cost
- Slightly more complicated controller

92
Analysis of Latency Variation in DRAM Chips
 Kevin Chang, Abhijith Kashyap, Hasan Hassan, Samira Khan, Kevin Hsieh,
Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Tianshi Li, and
Onur Mutlu,
"Understanding Latency Variation in Modern DRAM Chips:
Experimental Characterization, Analysis, and Optimization"
Proceedings of the ACM International Conference on Measurement and
Modeling of Computer Systems (SIGMETRICS), Antibes Juan-Les-Pins,
France, June 2016.
[Slides (pptx) (pdf)]
[Source Code]

93
Computer Architecture
Lecture 10b: Memory Latency

Prof. Onur Mutlu


ETH Zürich
Fall 2018
18 October 2018
We did not cover the following slides in lecture.
These are for your benefit.
Spatial Distribution of Failures
How are activation failures spatially distributed in DRAM?

DRAM Row (number) 1024 1024

Subarray border
DRAM rows (1 cell)

512 Subarray Edge


512

Remapped row
0 0
512
512
DRAM columns (1 cell) 1024
1024

DRAM Column (number)

Activation failures are highly constrained


to local bitlines 96
Short-term Variation
Does a bitline’s probability of failure change over time?

Fprob at time
Fail probability t2 (%)2 (%)
at time

Fail probability at time


Fprob at time 1 (%)
t1 (%)

A weak bitline is likely to remain weak and


a strong bitline is likely to remain strong over time 97
Short-term Variation
Does a bitline’s probability of failure change over time?

t2 (%)2 (%)
at time
Fprob at time

ThisWeshows that on
can rely weacan relyprofile
static on a static profile
of weak of weak
bitlines
Fail probability

bitlines to determine
to determine whether
whether an access
an access will will
causecause failures
failures

Fail probability at time


Fprob at time 1 (%)
t1 (%)

A weak bitline is likely to remain weak and


a strong bitline is likely to remain strong over time 98
Write Operations
How are write operations affected by reduced tRCD?
Weak bitline


Row Decoder

Cache line ✔ ✔




Local Row Buffer✔WRITE ✔

We can reliably issue write operations


with significantly reduced tRCD (e.g., by 77%) 99
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

100
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

101
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

102
Solar-DRAM: VLC (I)
Weak bitline Strong bitline


Row Decoder

Cache line



Local Row Buffer

Identify cache lines comprised of strong bitlines


Access such cache lines with a reduced tRCD
103
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

104
Solar-DRAM: RSC (II)
Cache line 0 Cache line 1


Row Decoder

Cache line



Local Row Buffer

Remap cache lines across DRAM at the memory


controller level so cache line 0 will likely map to
a strong cache line 105
Solar-DRAM
Uses a static profile of weak subarray columns
• Identifies subarray columns as weak or strong
• Obtained in a one-time profiling step

Three Components
1. Variable-latency cache lines (VLC)
2. Reordered subarray columns (RSC)
3. Reduced latency for writes (RLW)

106
Solar-DRAM: RLW (III)
All bitlines are strong when issuing writes


Row Decoder

Cache line



Local Row Buffer

Write to all locations in DRAM with a significantly


reduced tRCD (e.g., by 77%)
107
More on Solar-DRAM
 Jeremie S. Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu,
"Solar-DRAM: Reducing DRAM Access Latency by Exploiting
the Variation in Local Bitlines"
Proceedings of the 36th IEEE International Conference on Computer
Design (ICCD), Orlando, FL, USA, October 2018.

108
Why Is There
Spatial Latency Variation
Within a Chip?

109
What Is Design-Induced Variation?
fast slow
across column inherently slow
distance from

slow
wordline driver wordline drivers
across row
distance from
sense amplifier

fast
Inherently fast
sense amplifiers
Systematic variation in cell access times
caused by the physical organization of DRAM
110
DIVA Online Profiling
Design-Induced-Variation-Aware
inherently slow
wordline driver

sense amplifier
Profile only slow regions to determine min. latency
 Dynamic & low cost latency optimization
111
DIVA Online Profiling
Design-Induced-Variation-Aware
slow cells inherently slow
wordline driver
process design-induced
variation variation
random error localized error

error-correcting online profiling


code
sense amplifier
Combine error-correcting codes & online profiling
 Reliably reduce DRAM latency
112
DIVA-DRAM Reduces Latency
50%
Read 50%
Write
39.4%38.7%41.3%40.3%
Latency Reduction

40% 35.1%34.6%36.6%35.8% 40% 36.6%


31.2%
30% 25.5% 30% 27.5%

20% 20%

10% 10%

0% 0%
55°C 85°C 55°C 85°C 55°C 85°C 55°C 85°C 55°C 85°C 55°C 85°C
AL-DRAM DIVA
AVA Profiling DIVA
AVA Profiling AL-DRAM DIVA
AVA Profiling DIVA
AVA Profiling
+ Shuffling + Shuffling

DIVA-DRAM reduces latency more aggressively


and uses ECC to correct random slow cells
113
DIVA-DRAM: Advantages & Disadvantages

 Advantages
++ Automatically finds the lowest reliable operating latency
at system runtime (lower production-time testing cost)
+ Reduces latency more than prior methods (w/ ECC)
+ Reduces latency at high temperatures as well

 Disadvantages
- Requires knowledge of inherently-slow regions
- Requires ECC (Error Correcting Codes)
- Imposes overhead during runtime profiling

114
Design-Induced Latency Variation in DRAM
 Donghyuk Lee, Samira Khan, Lavanya Subramanian, Saugata Ghose,
Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and
Onur Mutlu,
"Design-Induced Latency Variation in Modern DRAM Chips:
Characterization, Analysis, and Latency Reduction Mechanisms"
Proceedings of the ACM International Conference on Measurement and
Modeling of Computer Systems (SIGMETRICS), Urbana-Champaign, IL,
USA, June 2017.

115

You might also like