Comparch Fall2020 Lecture11b Memory Interference and Qos
Comparch Fall2020 Lecture11b Memory Interference and Qos
2
Memory System: A Shared
Resource View
Storage
L1 $ L1 $
L2 $
……
L1 $ L1 $
L2 $
……
L1 $ L1 $
L2 $
……
10
Memory System is the Major
Shared Resource
hreads’ requests
interfere
11
Much More of a Shared Resource
in Future
12
Inter-Thread/Application
Interference
Problem: Threads share the memory system, but
memory system does not distinguish between
threads’ requests
13
Unfair Slowdowns due to
Interference
matlab gcc
(Core
(Core 0)1) (Core 2)
(Core 1)
L2 L2
CACHE CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System
15
A Memory Performance Hog
// initialize large arrays A, B // initialize large arrays A, B
STREAM RANDOM
- Sequential memory access - Random memory access
- Very low row buffer locality (3% hit rate
- Very high row buffer locality (96% hit rate)
- Memory intensive - Similarly memory intensive
16
What Does the Memory Hog
Do?
Row decoder
T0: Row 0
T0:
T1: Row 05
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Memory Request Buffer Row
Row 00 Row Buffer
17
DRAM Controllers
A row-conflict memory access takes significantly
longer than a row-hit access
Slowdown 2
0.5
0
STREAM RANDOM
Virtual
gcc PC
19
Greater Problem with More
Cores
20
Greater Problem with More
Cores
21
Distributed DoS in Networked Multi-
Core Systems Attackers Stock option pricing
(Cores 1-8) application
(Cores 9-64)
~5000X latency
ncrease
23
http://www.youtube.com/watch?v=VJzZbwgBfy8
More on Interconnect Based
Starvation
Boris Grot, Stephen W. Keckler, and Onur Mutlu,
"Preemptive Virtual Clock: A Flexible, Efficient, and
Cost-effective QOS Scheme for Networks-on-Chip"
Proceedings of the
42nd International Symposium on Microarchitecture
(MICRO), pages 268-279, New York, NY, December 2009.
Slides (pdf)
24
How Do We Solve The Problem?
Inter-thread interference is uncontrolled in all memory
resources
Memory controller
Interconnect
Caches
We need to control it
i.e., design an interference-aware (QoS-aware) memory
system
25
QoS-Aware Memory Systems:
Challenges
How do we reduce inter-thread interference?
Improve system performance and core utilization
Reduce request serialization and core starvation
27
Fundamental Interference Control
Techniques
Goal: to reduce/control inter-thread memory
interference
3. Core/source throttling
4. Application/thread scheduling
28
QoS-Aware Memory Scheduling
Resolves memory
contention by scheduling
Core Core Memory requests
Memory
Controller
Core Core
29
QoS-Aware Memory
Scheduling:
Evolution
QoS-Aware Memory Scheduling:
Evolution
Stall-time fair memory scheduling [Mutlu+ MICRO’07]
Idea: Estimate and balance thread slowdowns
Takeaway: Proportional thread progress improves
performance, especially when threads are “heavy”
(memory intensive)
interference
Takeaway: Cache access rate of an application can be
34
QoS-Aware Memory Scheduling:
Evolution
BLISS: Blacklisting Memory Scheduler [Subramanian+
ICCD’14, TPDS’16]
Idea: Deprioritize (i.e., blacklist) a thread that has
consecutively serviced a large number of requests
Takeaway: Blacklisting greatly reduces interference enables
the scheduler to be simple without requiring full thread
ranking
39
How Do We Solve the Problem?
Stall-time fair memory scheduling [Mutlu+ MICRO’07]
40
Stall-Time Fairness in Shared
DRAM Systems
A DRAM system is fair if it equalizes the slowdown of equal-
priority threads relative to when each thread is run alone on the
same system
Tracks STshared
Estimates STalone
If unfairness <
Use DRAM throughput oriented scheduling policy
If unfairness ≥
Use fairness-oriented scheduling policy
Row
Row 16
Row 111
00 Row Buffer
T0 Slowdown 1.10
1.00
1.04
1.07
1.03
T1 Slowdown 1.14
1.03
1.06
1.08
1.11
1.00
Unfairness 1.06
1.04
1.03
1.00
Data
1.05
43
STFM Pros and Cons
Upsides:
First algorithm for fair multi-core memory scheduling
Provides a mechanism to estimate memory slowdown
of a thread
Good at providing fairness
Being fair can improve performance
Downsides:
Does not handle all types of interference
(Somewhat) complex to implement
Slowdown estimations can be incorrect
44
More on STFM
Onur Mutlu and Thomas Moscibroda,
"Stall-Time Fair Memory Access Scheduling for Chi
p Multiprocessors"
Proceedings of the
40th International Symposium on Microarchitecture
(MICRO), pages 146-158, Chicago, IL, December 2007. [
Summary] [Slides (ppt)]
45
Parallelism-Aware Batch
Scheduling
48
Bank Parallelism Interference in
DRAM
Baseline Scheduler:
2 DRAM Requests
Bank 0 Bank 1
49
Parallelism-Aware Scheduler
Baseline Scheduler: Bank 0 Bank 1
2 DRAM Requests
50
Parallelism-Aware Batch Scheduling
(PAR-BS)
Principle 1: Parallelism-awareness
Schedule requests from a thread (to
T1 T1
different banks) back to back T2 T0
Preserves each thread’s bank parallelism
T2 T2
But, this can cause starvation…
T3 T2 Batch
Principle 2: Request Batching T0 T3
Group a fixed number of oldest T2 T1
requests from each thread into a T1 T0
“batch”
Service the batch before all other
Bank 0 Bank 1
requests
Form a new batch when the current one is
done
Mutlu and Moscibroda,
Eliminates “Parallelism-Aware
starvation, providesBatch Scheduling,” ISCA 2008.
fairness
Allows parallelism-awareness within a batch 51
PAR-BS Components
Request batching
Within-batch scheduling
Parallelism aware
52
Request Batching
Each memory request has a bit (marked) associated
with it
Batch formation:
Mark up to Marking-Cap oldest requests per bank for each
thread
Marked requests constitute the batch
Form a new batch when no marked requests are left
rank
Key Idea:
thread B
thread B
T3
max-bank- total-load
T3
load
T3 T2 T3 T3
T0 1 3
T1 T0 T2 T0
T1 2 4
T2 T2 T1 T2
T3 T1 T0 T3 T2 2 6
T1 T3 T2 T3 T3 5 9
57
Example Within-Batch
Scheduling
Baseline Scheduling
Order (Arrival order)
Order
7 PAR-BS Scheduling
T3
Order
T3 7
T3 6 T3 6
T3 T2 T3 T3 5 T3 T3 T3 T3 5
Time
Time
T1 T0 T2 T0 4 T3 T2 T2 T3 4
T2 T2 T1 T2 3 T2 T2 T2 T3 3
T3 T1 T0 T3 2 T1 T1 T1 T2 2
T1 T3 T2 T3 1 T1 T0 T0 T0 1
T0 T1 T2 T3 T0 T1 T2 T3
Stall times 4 4 5 7 Stall times 1 2 4 7
AVG: 5 bank access latencies AVG: 3.5 bank access latencies
58
Putting It Together: PAR-BS
Scheduling Policy
PAR-BS Scheduling Policy
(1) Marked requests first Batching
(2) Row-hit requests first
Parallelism-aware
(3) Higher-rank thread first (shortest stall-time first)
within-batch
(4) Oldest first scheduling
Three properties:
Exploits row-buffer locality and intra-thread bank parallelism
Work-conserving
Services unmarked requests to banks without marked requests
Marking-Cap is important
Too small cap: destroys row-buffer locality
Too large cap: penalizes memory non-intensive threads
Many more trade-offs analyzed in the paper
59
Hardware Cost
<1.5KB storage cost for
8-core system with 128-entry memory request buffer
60
Unfairness on 4-, 8-, 16-core
Systems
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]
5
FR-FCFS
4.5 FCFS
Unfairness (lower is better)
NFQ
4
STFM
PAR-BS
3.5
2.5
1.11X
2
1.11X 1.08X
1.5
1
4-core 8-core 16-core
61
System Performance (Hmean-
speedup) 8.3% 6.1% 5.1%
1.4
1.3
1.2
1.1
Normalized Hmean Speedup
1
0.9
0.8
0.7 FR-FCFS
0.6 FCFS
NFQ
0.5
STFM
0.4
PAR-BS
0.3
0.2
0.1
0
4-core 8-core 16-core
62
PAR-BS Pros and Cons
Upsides:
First scheduler to address bank parallelism destruction
across multiple threads
Simple mechanism (vs. STFM)
Batching provides fairness
Ranking enables parallelism awareness
Downsides:
Does not always prioritize the latency-sensitive
applications
63
More on PAR-BS
Onur Mutlu and Thomas Moscibroda,
"Parallelism-Aware Batch Scheduling: Enhancing b
oth Performance and Fairness of Shared DRAM Sys
tems"
Proceedings of the
35th International Symposium on Computer Architecture
(ISCA), pages 63-74, Beijing, China, June 2008. [
Summary] [Slides (ppt)]
One of the 12 computer architecture papers of
2008 selected as Top Picks by IEEE Micro.
http://www.youtube.com/watch?v=UB1kgYR-4V0 64
More on PAR-BS
Onur Mutlu and Thomas Moscibroda,
"Parallelism-Aware Batch Scheduling: Enabling High-Performance and
Fair Memory Controllers"
IEEE Micro, Special Issue: Micro's Top Picks from 2008 Computer Architecture
Conferences (MICRO TOP PICKS), Vol. 29, No. 1, pages 22-32,
January/February 2009.
65
ATLAS Memory Scheduler
14 5.9%
throughput
System
8.4%
System throughput
12
9.8%
10
17.0%
8
4
1 2 4 8 16
#Memory controllers
of memory
controllers
ATLAS consistently provides higher system
throughput than all previous scheduling algorithms
68
System Throughput: 4-MC System
PAR-BS ATLAS
14
10.8%
12 8.4%
throughput
10
System
System throughput
4.0%
8
3.5%
6 1.1%
4
2
0
4 8 16 24 32
# ofCores
cores
69
ATLAS Pros and Cons
Upsides:
Good at improving overall throughput (compute-
intensive threads are prioritized)
Low complexity
Coordination among controllers happens infrequently
Downsides:
Lowest/medium ranked threads get delayed
significantly high unfairness
70
More on ATLAS Memory
Scheduler
Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-
Balter,
"ATLAS: A Scalable and High-Performance Scheduli
ng Algorithm for Multiple Memory Controllers"
Proceedings of the
16th International Symposium on High-Performance Com
puter Architecture
(HPCA), Bangalore, India, January 2010. Slides (pptx)
71
TCM:
Thread Cluster Memory
Scheduling
Id
ATLAS
ea
3
l
1
7 7.5 8 8.5 9 9.5 10
Weighted Speedup
Better system throughput
No previous memory scheduling algorithm provides
both the best fairness and system throughput
73
Throughput vs. Fairness
Throughput biased approach Fairness biased approach
Prioritize less memory-intensive threads Take turns accessing memory
thread
Unfairness caused by memory-intensive
being prioritized over each other
thread
• Shuffle thread ranking
thread
Memory-intensive threads have
thread
different vulnerability to interference
• Shuffle asymmetrically
75
Thread Cluster Memory Scheduling [Kim+ MICRO’10]
1. Group threads into two clusters
2. Prioritize non-intensive cluster
3. Different policies for each cluster higher
priority
Non-intensive
Memory-non-intensive cluster
thread
thread Throughput
thread
thread
Prioritized higher
thread priority
thread
thread
Memory-intensive
1. Clustering
77
Clustering Threads
Step1 Sort threads by MPKI (misses per kiloinstruction)
higher
MPKI
thread
thread
thread
thread
thread
thread
Non-intensive Intensive
cluster αT cluster
T
α < 10%
T = Total memory bandwidth usage ClusterThreshold
78
TCM Outline
1. Clustering
2. Between
Clusters
79
Prioritization Between Clusters
Prioritize non-intensive cluster
>
priority
• Increases system throughput
– Non-intensive threads have greater potential for
making progress
• Does not degrade fairness
– Non-intensive threads are “light”
– Rarely interfere with intensive threads
80
TCM Outline
3. Non-Intensive
Cluster
1. Clustering
Throughput
2. Between
Clusters
81
Non-Intensive Cluster
Prioritize threads according to MPKI
higher
priority thread
lowest MPKI
thread
thread
thread
highest MPKI
82
TCM Outline
3. Non-Intensive
Cluster
1. Clustering
Throughput
2. Between 4. Intensive
Clusters Cluster
Fairness
83
Intensive Cluster
Periodically shuffle the priority of threads
higher Most prioritized
priority
thread
84
Case Study: A Tale of Two Threads
Case Study: Two intensive threads contending
1. random-access
2. streaming Which is slowed down more easily?
Slowdown
8 7x 8
6 6
4
prioritized 4
prioritized
2 1x 2 1x
0 0
random-access streaming random-access streaming
85
Why are Threads Different?
random-access streaming
req
stuck req
activated row
rows
Bank 1 Bank 2 Bank 3 Bank 4 Memory
• All requests parallel • All requests Same row
• High bank-level parallelism • High row-buffer locality
Vulnerable to interference
86
TCM Outline
3. Non-Intensive
Cluster
1. Clustering
Throughput
2. Between 4. Intensive
Clusters Cluster
Fairness
87
Niceness
How to quantify difference between threads?
+ Niceness -
88
TCM: Quantum-Based Operation
Previous quantum Current quantum
(~1M cycles) (~1M cycles)
Time
Shuffle interval
During quantum: (~1K cycles)
• Monitor thread behavior
1. Memory intensity Beginning of quantum:
2. Bank-level parallelism • Perform clustering
3. Row-buffer locality • Compute niceness of
intensive threads
89
TCM: Scheduling Algorithm
1. Highest-rank: Requests from higher ranked threads prioritized
• Non-Intensive cluster > Intensive cluster
• Non-Intensive cluster: lower intensity higher rank
• Intensive cluster: rank shuffling
90
TCM: Implementation Cost
Required storage at memory controller (24 cores)
MPKI ~0.2kb
Bank-level parallelism ~0.6kb
Row-buffer locality ~2.9kb
Total < 4kbits
• No computation is on the critical path
91
Previous Work
FRFCFS [Rixner et al., ISCA00]: Prioritizes row-buffer hits
– Thread-oblivious Low throughput & Low fairness
ATLAS
12
10 STFM
8
PAR-BS
TCM
6
4
7.5 8 8.5 9 9.5 10
Weighted Speedup
Better system throughput
TCM, a heterogeneous scheduling policy,
provides best fairness and system throughput 93
TCM: Fairness-Throughput Tradeoff
When configuration parameter is varied…
12
FRFCFS
Better fairness
10
Maximum Slowdown
STFM ATLAS
8
PAR-BS
6 TCM
4
2 Adjusting
12 12.5 13 13.5 14 14.5 15 15.5 16
ClusterThreshold
Weighted Speedup
Better system throughput
95
Conclusion
• No previous memory scheduling algorithm provides
both high system throughput and fairness
– Problem: They use a single policy for all threads
96
TCM Pros and Cons
Upsides:
Provides both high fairness and high performance
Caters to the needs for different types of threads
(latency vs. bandwidth sensitive)
(Relatively) simple
Downsides:
Scalability to large buffer sizes?
Robustness of clustering and shuffling algorithms?
Ranking is still too complex?
97
More on TCM
Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor
Harchol-Balter,
"Thread Cluster Memory Scheduling: Exploiting Dif
ferences in Memory Access Behavior"
Proceedings of the
43rd International Symposium on Microarchitecture
(MICRO), pages 65-76, Atlanta, GA, December 2010.
Slides (pptx) (pdf)
98
The Blacklisting Memory
Scheduler
Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu,
e Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Co
Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD),
Seoul, South Korea, October 2014. [Slides (pptx) (pdf)]
Tackling Inter-Application Interference:
Application-aware Memory Scheduling
Monitor Rank Enforce
Ranks Highest
2 Request Buffer
1 App. ID Ranked AID
4 Request (AID)
2 =
Req 1 1
3
3 Req 2 4 =
1 Req 3 1 =
4
Req 4 1 =
Full ranking increases Req 5 3 =
critical path latency and area Req 5 2 =
significantly to improve Req 7 1 =
Req 8 3 =
performance and fairness
100
Performance vs. Fairness vs. Simplicity
Fairness App-unaware
FRFCFS
Ideal PARBS
App-aware
ATLAS
Low performance (Ranking)
and fairness TCM
Our Solution
Our Solution
Blacklisting
(No Ranking)
Performance
>
2 2 1
3
3 4
3
4 1
Benefit
Benefit
1: Low
2: Lower
complexity
slowdowns
compared
than ranking
to ranking
102
Key Observation 1: Group Rather Than Rank
Observation 1: Sufficient to separate applications
into two groups, rather than do full ranking
Monitor Rank Group
Interference
2 Vulnerable Causing
1
4
>
2 2 1
3
3 4
3
4 1
Basic Idea:
• Group applications with a large number of consecutive requests
as interference-causing Blacklisting
• Deprioritize blacklisted applications
• Clear blacklist periodically (1000s of cycles)
Benefits:
• Lower complexity
• Finer grained grouping decisions Lower unfairness
104
Performance vs. Fairness vs. Simplicity
Fairness Close to FRFCFS
fairest FRFCFS-Cap
Ideal
PARBS
ATLAS
Highest TCM
performance
Blacklisting
Performance
Close to
simplest
Simplicity 105
Performance and Fairness
FRFCFS FRFCFS-Cap PARBS ATLAS
15 TCM Blacklisting
13
11
(Lower is better)
Unfairness
9
5%
7
21%
5
3 Id
ea
l
1
1 2 3 4 5 6 7 8 9 10
Performance
1. Blacklisting achieves
(Higher the highest performance
is better)
2. Blacklisting balances performance and fairness
106
Complexity
FRFCFS FRFCFS-Cap PARBS ATLAS
TCM
120000 Blacklisting
Scheduler Area (sq. um)
100000
80000
60000
43%
40000
70%
20000
l
ea
Id
0
0 2 4 6 8 10 12
Critical Path Latency (ns)
Blacklisting reduces complexity significantly
107
More on BLISS (I)
Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri,
Harsha Rastogi, and Onur Mutlu,
"The Blacklisting Memory Scheduler: Achieving Hig
h Performance and Fairness at Low Cost"
Proceedings of the
32nd IEEE International Conference on Computer Design
(ICCD), Seoul, South Korea, October 2014. [Slides (pptx)
(pdf)]
108
More on BLISS: Longer Version
Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha
Rastogi, and Onur Mutlu,
"BLISS: Balancing Performance, Fairness and Complexit
y in Memory Access Scheduling"
109
Computer Architecture
Lecture 11b: Memory
Interference
and Quality of Service
Prof. Onur Mutlu
ETH Zürich
Fall 2020
29 October 2020
Staged Memory Scheduling
achata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel Loh, and Onur Mutlu
"Staged Memory Scheduling: Achieving High Performance
and Scalability in Heterogeneous Systems”
39th International Symposium on Computer Architecture (ISCA),
Portland, OR, June 2012.
Stage
3
Memory Scheduler
DRAM
Comman Bank 1 Bank 2 Bank 3 Bank 4
d
Schedule To DRAM
r
113
SMS: Staged Memory
Scheduling
Core Core Core Core GPU
1 2 3 4
Stage
1
Batch
Formatio
n
Stage BatchReq
Scheduler
Req
2
Stage
3
DRAM
Comman Bank 1 Bank 2 Bank 3 Bank 4
d
Schedule To DRAM
r
114
Putting Everything Together
Core Core Core Core GPU
1 2 3 4
Stage 1:
Batch
Formation
115
Complexity
Compared to a row hit first scheduler, SMS
consumes*
66% less area
46% less static power
0.8 Best
formance
Previous
PreviousBest
0.6 Scheduler
0.4
0.2
ATLAS TCM FR-FCFS
0
0.001 0.01 0.1 1 10 100 1000
GPUweight
117
Performance at Different GPU
Weights
1
System Per-
0.8 Best
Previous Best
formance
Previous
0.6 SMS
Scheduler
SMS
0.4
0.2
0
0.001 0.01 0.1 1 10 100 1000
GPUweight
At every GPU weight, SMS outperforms the best
previous scheduling algorithm for that weight
118
More on SMS
Rachata Ausavarungnirun, Kevin Chang, Lavanya
Subramanian, Gabriel Loh, and Onur Mutlu,
"Staged Memory Scheduling: Achieving High Perfo
rmance and Scalability in Heterogeneous Systems"
Proceedings of the
39th International Symposium on Computer Architecture
(ISCA), Portland, OR, June 2012. Slides (pptx)
119
DASH Memory Scheduler
[TACO 2016]
120
Current SoC Architectures
CPU CPU CPU CPU
DRAM Controller
DRAM
• Basic Idea:
– Different CPU applications and hardware
accelerators have different memory requirements
– Track progress of different agents and prioritize
accordingly
123
Key Observation:
Distribute Priority for Accelerators
• GPU/accelerators need priority to meet deadlines
• Worst case prioritization not always the best
• Prioritize when they are not on track to meet a
deadline
124
Key Observation:
Not All Accelerators are Equal
• Long-deadline accelerators are more likely to
meet their deadlines
• Short-deadline accelerators are more likely to
miss their deadlines
125
Key Observation:
Not All CPU cores are Equal
• Memory-intensive cores are much less
vulnerable to interference
• Memory non-intensive cores are much more
vulnerable to interference
126
DASH Summary:
Key Ideas and Results
• Distribute priority for HWAs
• Prioritize HWAs over memory-intensive CPU
cores even when not urgent
• Prioritize short-deadline-period HWAs based
on worst case estimates
127
DASH: Scheduling Policy
DASH scheduling policy
1. Short-deadline-period HWAs with high priority
2. Long-deadline-period HWAs with high priority
3. Memory non-intensive CPU applications
4. Long-deadline-period HWAs with low priority
Switch
5. Memory-intensive CPU applications probabilistically
6. Short-deadline-period HWAs with low priority
128
More on DASH
Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang,
and Onur Mutlu,
"DASH: Deadline-Aware High-Performance Memory S
cheduler for Heterogeneous Systems with Hardware
Accelerators"
129
Predictable Performance:
Strong Memory Service
Guarantees
130
Goal: Predictable Performance in
Complex Systems
CPU CPU CPU CPU
GPU
HWAs
How to allocate resources to heterogeneous agents
o mitigate interference and provide predictable performance?
131
Strong Memory Service
Guarantees
Goal: Satisfy performance/SLA requirements in the
presence of shared main memory, heterogeneous
agents, and hybrid memory/storage
Approach:
Develop techniques/models to accurately estimate the
performance loss of an application/agent in the
presence of resource sharing
Develop mechanisms (hardware and software) to
enable the resource partitioning/prioritization needed
to achieve the required performance levels for all
applications
All the while providing high system performance
Subramanian et al., “MISE: Providing Performance Predictability and Improving
Fairness in Shared Main Memory Systems,” HPCA 2013.
Subramanian et al., “The Application Slowdown Model,” MICRO 2015. 132
Predictable Performance
Readings
Eiman Ebrahimi,(I)
Chang Joo Lee, Onur Mutlu, and Yale N.
Patt,
"Fairness via Source Throttling: A Configurable an
d High-Performance Fairness Substrate for Multi-C
ore Memory Systems"
Proceedings of the
15th International Conference on Architectural Support fo
r Programming Languages and Operating Systems
(ASPLOS), pages 335-346, Pittsburgh, PA, March 2010.
Slides (pdf)
133
Predictable Performance
Readings
(II) Vivek Seshadri, Yoongu Kim, Ben
Lavanya Subramanian,
Jaiyen, and Onur Mutlu,
"MISE: Providing Performance Predictability and I
mproving Fairness in Shared Main Memory System
s"
Proceedings of the
19th International Symposium on High-Performance Com
puter Architecture
(HPCA), Shenzhen, China, February 2013. Slides (pptx)
134
Predictable Performance
Readings
(III)Vivek Seshadri, Arnab Ghosh, Samira
Lavanya Subramanian,
Khan, and Onur Mutlu,
"The Application Slowdown Model: Quantifying and Con
trolling the Impact of Inter-Application Interference at
Shared Caches and Main Memory"
Proceedings of the
48th International Symposium on Microarchitecture (MICRO),
Waikiki, Hawaii, USA, December 2015.
[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [
Poster (pptx) (pdf)]
[Source Code]
135
MISE:
Providing Performance
Predictability
in Shared Main Memory
Systems
Lavanya Subramanian, Vivek Seshadri,
Yoongu Kim, Ben Jaiyen, Onur Mutlu
136
Unpredictable Application
Slowdowns 6 6
5 5
Slowdown
Slowdown
4 4
3 3
2 2
1 1
0 0
leslie3d (core gcc (core 1) leslie3d (core mcf (core 1)
0) 0)
137
Need for Predictable
Performance
There is a need for predictable performance
When multiple applications share resources
Especially if some applications require performance
guarantees
2. Control Slowdown
139
Outline
1. Estimate Slowdown
Key Observations
Implementation
2. Control Slowdown
Providing Soft Slowdown
Guarantees
Minimizing Maximum Slowdown
140
Slowdown: Definition
Performanc e Alone
Slowdown
Performanc e Shared
141
Key Observation 1
For a memory bound application,
Performance Memory request service rate
Normalized Performance 1
0.9 omnetpp mcf
Harder
0.8 astar
Request Service
Performanc e AloneRate Alone
0.7
Slowdown 0.6 Intel Core i7, 4 cores
0.5
Request Service
Performance SharedRate
Mem. Bandwidth: 8.5
Shared
GB/s
0.4
0.3 Easy
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
142
Key Observation 2
Request Service Rate Alone (RSRAlone) of an application
can be estimated by giving the application highest
priority in accessing memory
143
Key Observation 2
1. Run alone
Time units Service
Request Buffer order
3 2 1
State Main Main
Memory Memory
144
Memory Interference-induced Slowdown
Estimation (MISE) model for memory bound
applications
Request Service Rate Alone (RSRAlone)
Slowdown
Request Service Rate Shared (RSRShared)
145
Key Observation 3
Memory-bound application
Compute Phase
Memory Phase
Memory Phase
Memory Interference-induced Slowdown
1
Estimation (MISE) model for non-memory
No
interference bound applications
RSR Alone time
Slowdown (1 - )
With RSR Shared
interference
time
1 RSR Alone
RSR Shared
Only memory fraction ( ) slows down with
interference
147
Outline
1. Estimate Slowdown
Key Observations
Implementation
2. Control Slowdown
Providing Soft Slowdown
Guarantees
Minimizing Maximum Slowdown
148
Interval Based Operation
Interval Interval
time
Estimate Estimate
slowdown slowdown
149
Measuring RSRShared and α
Request Service Rate Shared (RSRShared)
Per-core counter to track number of requests
serviced
At the end of each interval, measure
Number of Requests Serviced
RSR Shared
Interval Length
150
Estimating Request Service Rate Alone
(RSR
DivideAlone
each ) interval into shorter epochs
At the beginning of each epoch
MemoryGoal: Estimate
controller randomlyRSR
picksAlone
an
application as the highest priority application
How: Periodically give each
application
At the highest
end of an interval, for eachpriority in
application,
estimate accessing memory
Number of Requests During High Priority Epochs
RSR Alone
Number of Cycles Application Given High Priority
151
Inaccuracy in Estimating
RSR
When
Request an application
Buffer
Alone
has
Time highest
units priority
Service High Priority
order
Still experiences
State
Main
some
3 interference
2 1
Main
Memory Memory
Request Buffer Time units Service
State 3 order
2 1
Main Main
Memory Memory
Request Buffer Time units Service
State 3
order
2 1
Main Main
Memory Memory
Time units Service
3 order
2 1
Main
Memory
Interference Cycles
152
Accounting for Interference in RSRAlone
Estimation
Solution: Determine and remove interference
previously
153
Outline
1. Estimate Slowdown
Key Observations
Implementation
2. Control Slowdown
Providing Soft Slowdown
Guarantees
Minimizing Maximum Slowdown
154
MISE Model: Putting it All
Together
Interval Interval
time
Estimate Estimate
slowdown slowdown
155
Outline
1. Estimate Slowdown
Key Observations
Implementation
2. Control Slowdown
Providing Soft Slowdown
Guarantees
Minimizing Maximum Slowdown
156
Previous Work on Slowdown
Estimation
Previous work on slowdown estimation
STFM (Stall Time Fair Memory) Scheduling [Mutlu+,
MICRO ‘07]
FST (Fairness via Source Throttling) [Ebrahimi+, ASPLOS
‘10]
Per-thread Cycle Accounting [Du Bois+, HiPEAC ‘13]
Advantage 2:
STFM does not take into account compute
phase for non-memory-bound applications
MISE accounts for compute phase Better
accuracy
158
Methodology
Configuration of our simulated system
4 cores
1 channel, 8 banks/channel
DDR3 1066 DRAM
512 KB private cache/core
Workloads
SPEC CPU2006
300 multi programmed workloads
159
Quantitative Comparison
SPEC CPU 2006 application
leslie3d
4
3.5
Slowdown
3
2.5 Actual
2 STFM
1.5 MISE
1
0 10 20 30 40 50 60 70 80 90 100
Million Cycles
160
Comparison to STFM
4 4 4
3 3 3
Slowdown
Slowdown
Slowdown
2
2 2
1
1 1
0
20 40 60 80 100 error
Average 0 20 40 of
60 MISE: 8.2%
0 0
0 80 100
0 20 40 60 80 10
0
3
(across3 300 workloads) 3
Slowdown
Slowdown
Slowdown
2 2 2
1
1 1
0
0 0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 1 0 0
wrf calculix povray
161
Outline
1. Estimate Slowdown
Key Observations
Implementation
2. Control Slowdown
Providing Soft Slowdown
Guarantees
Minimizing Maximum Slowdown
162
Providing “Soft” Slowdown
Guarantees
Goal
1. Ensure QoS-critical applications meet a
prescribed slowdown bound
2. Maximize system performance for other
applications
Basic Idea
Allocate just enough bandwidth to QoS-critical
application
Assign remaining bandwidth to other
applications
163
MISE-QoS: Mechanism to Provide
Soft QoS
Assign an initial bandwidth allocation to QoS-critical
application
Estimate slowdown of QoS-critical application using the
MISE model
After every N intervals
If slowdown > bound B +/- ε, increase bandwidth
allocation
If slowdown < bound B +/- ε, decrease bandwidth
allocation
When slowdown bound not met for N intervals
Notify the OS so it can migrate/de-schedule jobs
164
Methodology
Each application (25 applications in total)
considered the QoS-critical application
Run with 12 sets of co-runners of different memory
intensities
Total of 300 multiprogrammed workloads
Each workload run with 10 slowdown bound values
Baseline memory scheduling mechanism
Always prioritize QoS-critical application
[Iyer+, SIGMETRICS 2007]
Other applications’ requests scheduled in FRFCFS
order
[Zuravleff +, US Patent 1997, Rixner+, ISCA 2000]
165
A Look at One Workload
Slowdown Bound =
Slowdown Bound =
103 Slowdown Bound = 2
3.33
2.5
2 AlwaysPriori-
Slowdown
tize
MISE1.5
is effective in MISE-QoS-
1. meeting the slowdown bound for the 10/1
MISE-QoS-
1
QoS-critical application 10/3
2. improving
0.5 performance of non-QoS-
critical
0
applications
leslie3d hmmer lbm omnetpp
QoS-critical non-QoS-critical
166
Effectiveness of MISE in
Enforcing Across
QoS3000 data points
Predicted Predicted
Met Not Met
QoS Bound
78.8% 2.1%
Met
QoS Bound
2.2% 16.9%
Not Met
1.2
1
AlwaysPrioritize
0.8
MISE-QoS-10/1
0.6 MISE-QoS-10/3
MISE-QoS-10/5
0.4
MISE-QoS-10/7
0.2 MISE-QoS-10/9
0
0 1 2 3 Avg
When
Higher slowdownwhen
performance bound is 10/3
bound is loose
Number of Memory Intensive Applications
MISE-QoS improves system performance by
10%
168
Outline
1. Estimate Slowdown
Key Observations
Implementation
2. Control Slowdown
Providing Soft Slowdown
Guarantees
Minimizing Maximum Slowdown
169
Other Results in the Paper
Sensitivity to model parameters
Robust across different values of model parameters
guarantees
configurations
170
Summary
Uncontrolled memory interference slows down
applications unpredictably
Goal: Estimate and control slowdowns
Key contribution
MISE: An accurate slowdown estimation model
Average error of MISE: 8.2%
Key Idea
Request Service Rate is a proxy for performance
Request Service Rate Alone estimated by giving an
application highest priority in accessing memory
Leverage slowdown estimates to control
slowdowns
Providing soft slowdown guarantees
Minimizing maximum slowdown
171
MISE: Pros and Cons
Upsides:
Simple new insight to estimate slowdown
Much more accurate slowdown estimations than prior
techniques (STFM, FST)
Enables a number of QoS mechanisms that can use
slowdown estimates to satisfy performance
requirements
Downsides:
Slowdown estimation is not perfect - there are still
errors
Does not take into account caches and other shared
resources in slowdown estimation
172
More on MISE
Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben
Jaiyen, and Onur Mutlu,
"MISE: Providing Performance Predictability and I
mproving Fairness in Shared Main Memory System
s"
Proceedings of the
19th International Symposium on High-Performance Com
puter Architecture
(HPCA), Shenzhen, China, February 2013. Slides (pptx)
173
Extending MISE to Shared
Caches:
ASMVivek Seshadri, Arnab Ghosh, Samira
Lavanya Subramanian,
Khan, and Onur Mutlu,
"The Application Slowdown Model: Quantifying and Con
trolling the Impact of Inter-Application Interference at
Shared Caches and Main Memory"
Proceedings of the
48th International Symposium on Microarchitecture (MICRO),
Waikiki, Hawaii, USA, December 2015.
[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [
Poster (pptx) (pdf)]
[Source Code]
174
Handling Memory Interference
In Multithreaded Applications
177
Barriers
Synchronization point
Threads have to wait until all threads reach the
barrier
Last thread arriving at the barrier is on the critical
path
Each thread:
loop1 {
Compute
}
barrier
loop2 {
Compute
}
178
Stages of Pipelined Programs
Loop iterations are statically divided into code segments called
stages
Threads execute stages on different cores
Thread executing the slowest stage is on the critical path
A B C
loop {
Compute1 A
Compute2 B
Compute3 C
}
179
Handling Interference in Parallel
Applications
Threads in a multithreaded application are inter-
dependent
Some threads can be on the critical path of
execution due to synchronization; some threads are
not
How do we schedule requests of inter-dependent
threads to maximize multithreaded application
performance?
181
Parallel App Mem Scheduling: Pros
and Cons
Upsides:
Downsides:
What if there are multiple multi-threaded applications
running together?
Limiter thread estimation can become complex
182
More on PAMS
Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin,
Chang Joo Lee, Onur Mutlu, and Yale N. Patt,
"Parallel Application Memory Scheduling"
Proceedings of the
44th International Symposium on Microarchitecture
(MICRO), Porto Alegre, Brazil, December 2011.
Slides (pptx)
183