0% found this document useful (0 votes)
17 views183 pages

Comparch Fall2020 Lecture11b Memory Interference and Qos

memory interface and quality-of-service introduction

Uploaded by

lijianing1024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views183 pages

Comparch Fall2020 Lecture11b Memory Interference and Qos

memory interface and quality-of-service introduction

Uploaded by

lijianing1024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 183

Computer Architecture

Lecture 11b: Memory


Interference
and Quality of Service
Prof. Onur Mutlu
ETH Zürich
Fall 2020
29 October 2020
Shared Resource Design
for Multi-Core Systems

2
Memory System: A Shared
Resource View

Storage

Most of the system is dedicated to storing and moving data


3
Resource Sharing Concept
 Idea: Instead of dedicating a hardware resource to a
hardware context, allow multiple contexts to use it
 Example resources: functional units, pipeline, caches,
buses, memory, interconnects, storage
 Why?

+ Resource sharing improves utilization/efficiency 


throughput
 When a resource is left idle by one thread, another
thread can use it; no need to replicate shared data
+ Reduces communication latency
 For example, shared data kept in the same cache in SMT
processors
+ Compatible with the shared memory model
4
Resource Sharing
Disadvantages
Resource sharing results in contention for resources
 When the resource is not idle, another thread cannot use
it
 If space is occupied by one thread, another thread needs
to re-occupy it

- Sometimes reduces each or some thread’s


performance
- Thread performance can be worse than when it is run
alone
- Eliminates performance isolation  inconsistent
performance across runs
- Thread performance depends on co-executing threads
- Uncontrolled (free-for-all) sharing degrades QoS
- Causes unfairness, starvation 5
Example: Problem with Shared
Caches
Processor Core 1 ←t1 Processor Core 2

L1 $ L1 $

L2 $

……

Kim et al., “Fair Cache Sharing and Partitioning in a Chip


Multiprocessor Architecture,” PACT 2004.
6
Example: Problem with Shared
Caches
Processor Core 1 t2→ Processor Core 2

L1 $ L1 $

L2 $

……

Kim et al., “Fair Cache Sharing and Partitioning in a Chip


Multiprocessor Architecture,” PACT 2004.
7
Example: Problem with Shared
Caches
Processor Core 1 ←t1 t2→ Processor Core 2

L1 $ L1 $

L2 $

……

t2’s throughput is significantly reduced due to unfair cache sharing.

Kim et al., “Fair Cache Sharing and Partitioning in a Chip


Multiprocessor Architecture,” PACT 2004.
8
Need for QoS and Shared
Resource
 Mgmt.
Why is unpredictable performance (or lack of QoS)
bad?

 Makes programmer’s life difficult


 An optimized program can get low performance (and
performance varies widely depending on co-runners)

 Causes discomfort to user


 An important program can starve
 Examples from shared software resources

 Makes system management difficult


 How do we enforce a Service Level Agreement when
hardware resources are sharing is uncontrollable?
9
Resource Sharing vs.
Partitioning
Sharing improves throughput
 Better utilization of space

 Partitioning provides performance isolation


(predictable performance)
 Dedicated space

 Can we get the benefits of both?

 Idea: Design shared resources such that they are


efficiently utilized, controllable and partitionable
 No wasted resource + QoS mechanisms for threads

10
Memory System is the Major
Shared Resource
hreads’ requests
interfere

11
Much More of a Shared Resource
in Future

12
Inter-Thread/Application
Interference
Problem: Threads share the memory system, but
memory system does not distinguish between
threads’ requests

 Existing memory systems


 Free-for-all, shared based on demand
 Control algorithms thread-unaware and thread-unfair
 Aggressive threads can deny service to others
 Do not try to reduce or control inter-thread
interference

13
Unfair Slowdowns due to
Interference

matlab gcc
(Core
(Core 0)1) (Core 2)
(Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service


14
in multi-core systems,” USENIX Security 2007.
Uncontrolled Interference: An
Example
CORE
stream1 random2
CORE Multi-Core
Chip

L2 L2
CACHE CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System

DRAM DRAM DRAM DRAM


Bank 0 Bank 1 Bank 2 Bank 3

15
A Memory Performance Hog
// initialize large arrays A, B // initialize large arrays A, B

for (j=0; j<N; j++) { for (j=0; j<N; j++) {


index = j*linesize;
streaming index = rand();random
A[index] = B[index]; A[index] = B[index];
… …
} }

STREAM RANDOM
- Sequential memory access - Random memory access
- Very low row buffer locality (3% hit rate
- Very high row buffer locality (96% hit rate)
- Memory intensive - Similarly memory intensive

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007

16
What Does the Memory Hog
Do?

Row decoder
T0: Row 0
T0:
T1: Row 05
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Memory Request Buffer Row
Row 00 Row Buffer

Row size: 8KB, cache blockColumn mux


size: 64B
T0: STREAM
128
T1: (8KB/64B)
RANDOM requests of T0 serviced
Data before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007

17
DRAM Controllers
 A row-conflict memory access takes significantly
longer than a row-hit access

 Current controllers take advantage of the row buffer

 Commonly used scheduling policy (FR-FCFS) [Rixner


2000]*
(1) Row-hit first: Service row-hit memory accesses first
(2) Oldest-first: Then service older accesses first

 This scheduling policy aims to maximize DRAM


throughput
But, it is unfair when multiple threads share the DRAM
 et al., “Memory Access Scheduling,” ISCA 2000.
*Rixner
*Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.
system
18
Effect of the Memory Performance
Hog 3
2.82X slowdown
2.5

Slowdown 2

1.5 1.18X slowdown

0.5

0
STREAM RANDOM
Virtual
gcc PC

Results on Intel Pentium D running Windows XP


(Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007

19
Greater Problem with More
Cores

 Vulnerable to denial of service (DoS)


 Unable to enforce priorities or SLAs
 Low system performance
Uncontrollable, unpredictable system

20
Greater Problem with More
Cores

 Vulnerable to denial of service (DoS)


 Unable to enforce priorities or SLAs
 Low system performance
Uncontrollable, unpredictable system

21
Distributed DoS in Networked Multi-
Core Systems Attackers Stock option pricing
(Cores 1-8) application
(Cores 9-64)

Cores connected via


packet-switched
routers on chip

~5000X latency
ncrease

Grot, Hestness, Keckler, Mutlu,


“Preemptive virtual clock: A
Flexible,
Efficient, and Cost-effective QOS
Scheme for Networks-on-Chip,“
MICRO 2009.
22
More on Memory Performance
Attacks
Thomas Moscibroda and Onur Mutlu,
"Memory Performance Attacks: Denial of Memory
Service in Multi-Core Systems"

Proceedings of the 16th USENIX Security Symposium


(USENIX SECURITY), pages 257-274, Boston, MA,
August 2007. Slides (ppt)

23
http://www.youtube.com/watch?v=VJzZbwgBfy8
More on Interconnect Based
Starvation
Boris Grot, Stephen W. Keckler, and Onur Mutlu,
"Preemptive Virtual Clock: A Flexible, Efficient, and
Cost-effective QOS Scheme for Networks-on-Chip"

Proceedings of the
42nd International Symposium on Microarchitecture
(MICRO), pages 268-279, New York, NY, December 2009.
Slides (pdf)

24
How Do We Solve The Problem?
 Inter-thread interference is uncontrolled in all memory
resources
 Memory controller
 Interconnect
 Caches

 We need to control it
 i.e., design an interference-aware (QoS-aware) memory
system

25
QoS-Aware Memory Systems:
Challenges
How do we reduce inter-thread interference?
 Improve system performance and core utilization
 Reduce request serialization and core starvation

 How do we control inter-thread interference?


 Provide mechanisms to enable system software to
enforce QoS policies
 While providing high system performance

 How do we make the memory system


configurable/flexible?
 Enable flexible mechanisms that can achieve many
goals
 Provide fairness or throughput when needed
26
 Satisfy performance guarantees when needed
Designing QoS-Aware Memory Systems:
Approaches
 Smart resources: Design each shared resource to have
a configurable interference control/reduction
mechanism
 QoS-aware memory controllers
 QoS-aware interconnects
 QoS-aware caches

 Dumb resources: Keep each resource free-for-all, but


reduce/control interference by injection control or data
mapping
 Source throttling to control access to memory system
 QoS-aware data mapping to memory controllers
 QoS-aware thread scheduling to cores

27
Fundamental Interference Control
Techniques
 Goal: to reduce/control inter-thread memory

interference

1. Prioritization or request scheduling

2. Data mapping to banks/channels/ranks

3. Core/source throttling

4. Application/thread scheduling

28
QoS-Aware Memory Scheduling
Resolves memory
contention by scheduling
Core Core Memory requests
Memory
Controller
Core Core

 How to schedule requests to provide


 High system performance
 High fairness to applications
 Configurability to system software

 Memory controller needs to be aware of threads

29
QoS-Aware Memory
Scheduling:
Evolution
QoS-Aware Memory Scheduling:
Evolution
Stall-time fair memory scheduling [Mutlu+ MICRO’07]
 Idea: Estimate and balance thread slowdowns
 Takeaway: Proportional thread progress improves
performance, especially when threads are “heavy”
(memory intensive)

 Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top


Picks’09]
 Idea: Rank threads and service in rank order (to preserve
bank parallelism); batch requests to prevent starvation
 Takeaway: Preserving within-thread bank-parallelism
improves performance; request batching improves fairness

 ATLAS memory scheduler [Kim+ HPCA’10]


 Idea: Prioritize threads that have attained the least service
from the memory scheduler
 Takeaway: Prioritizing “light” threads improves performance 31
QoS-Aware Memory Scheduling:
Evolution
Thread cluster memory scheduling [Kim+ MICRO’10, Top
Picks’11]
 Idea: Cluster threads into two groups (latency vs. bandwidth
sensitive); prioritize the latency-sensitive ones; employ a
fairness policy in the bandwidth sensitive group
 Takeaway: Heterogeneous scheduling policy that is different
based on thread behavior maximizes both performance and
fairness

 Integrated Memory Channel Partitioning and


Scheduling [Muralidhara+ MICRO’11]
 Idea: Only prioritize very latency-sensitive threads in the
scheduler; mitigate all other applications’ interference via
channel partitioning
 Takeaway: Intelligently combining application-aware
channel partitioning and memory scheduling provides
32
better performance than either
QoS-Aware Memory Scheduling:
Evolution
Parallel application memory scheduling [Ebrahimi+
MICRO’11]
 Idea: Identify and prioritize limiter threads of a
multithreaded application in the memory scheduler; provide
fast and fair progress to non-limiter threads
 Takeaway: Carefully prioritizing between limiter and non-
limiter threads of a parallel application improves
performance

 Staged memory scheduling [Ausavarungnirun+ ISCA’12]


 Idea: Divide the functional tasks of an application-aware
memory scheduler into multiple distinct stages, where each
stage is significantly simpler than a monolithic scheduler
 Takeaway: Staging enables the design of a scalable and
relatively simpler application-aware memory scheduler that
works on very large request buffers
33
QoS-Aware Memory Scheduling:
Evolution
MISE: Memory Slowdown Model [Subramanian+ HPCA’13]
 Idea: Estimate the performance of a thread by estimating
its change in memory request service rate when run alone
vs. shared  use this simple model to estimate slowdown to
design a scheduling policy that provides predictable
performance or fairness
 Takeaway: Request service rate of a thread is a good proxy
for its performance; alone request service rate can be
estimated by giving high priority to the thread in memory
scheduling for a while

 ASM: Application Slowdown Model [Subramanian+


MICRO’15]
 Idea: Extend MISE to take into account cache+memory

interference
 Takeaway: Cache access rate of an application can be
34
QoS-Aware Memory Scheduling:
Evolution
BLISS: Blacklisting Memory Scheduler [Subramanian+
ICCD’14, TPDS’16]
 Idea: Deprioritize (i.e., blacklist) a thread that has
consecutively serviced a large number of requests
 Takeaway: Blacklisting greatly reduces interference enables
the scheduler to be simple without requiring full thread
ranking

 DASH: Deadline-Aware Memory Scheduler [Usui+


TACO’16]
 Idea: Balance prioritization between CPUs, GPUs and Hardware
Accelerators (HWA) by keeping HWA progress in check vs.
deadlines such that HWAs do not hog performance and
appropriately distinguishing between latency-sensitive vs.
bandwidth-sensitive CPU workloads
 Takeaway: Proper control of HWA progress and application-
aware CPU prioritization leads to better system performance
35
while meeting HWA deadlines
QoS-Aware Memory Scheduling:
Evolution
Prefetch-aware shared resource management
[Ebrahimi+ ISCA’11] [Ebrahimi+ MICRO’09] [Ebrahimi+ HPCA’09]
[Lee+ MICRO’08’09]
 Idea: Prioritize prefetches depending on how they affect
system performance; even accurate prefetches can degrade
performance of the system
 Takeaway: Carefully controlling and prioritizing prefetch
requests improves performance and fairness

 DRAM-Aware last-level cache policies and write


scheduling [Lee+ HPS Tech Report’10] [Seshadri+ ISCA’14]
 Idea: Design cache eviction and replacement policies such
that they proactively exploit the state of the memory
controller and DRAM (e.g., proactively evict data from the
cache that hit in open rows)
 Takeaway: Coordination of last-level cache and DRAM
policies improves performance and fairness; writes should36
QoS-Aware Memory Scheduling:
Evolution
FIRM: Memory Scheduling for NVM [Zhao+ MICRO’14]
 Idea: Carefully handle write-read prioritization with coarse-
grained batching and application-aware scheduling
 Takeaway: Carefully controlling and prioritizing write
requests improves performance and fairness; write requests
are especially critical in NVMs

 Criticality-Aware Memory Scheduling for GPUs [Jog+


SIGMETRICS’16]
 Idea: Prioritize latency-critical cores’ requests in a GPU
system
 Takeaway: Need to carefully balance locality and criticality
to make sure performance improves by taking advantage of
both

 Worst-case Execution Time Based Memory


37
Scheduling for Real-Time Systems [Kim+ RTAS’14,
Stall-Time Fair Memory
Scheduling

Onur Mutlu and Thomas Moscibroda,


"Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors"
40th International Symposium on Microarchitecture (MICRO),
pages 146-158, Chicago, IL, December 2007. Slides (ppt)

STFM Micro 2007 Talk


The Problem: Unfairness

 Vulnerable to denial of service (DoS)


 Unable to enforce priorities or SLAs
 Low system performance
Uncontrollable, unpredictable system

39
How Do We Solve the Problem?
 Stall-time fair memory scheduling [Mutlu+ MICRO’07]

 Goal: Threads sharing main memory should


experience similar slowdowns compared to when
they are run alone  fair scheduling
 Also improves overall system performance by ensuring cores
make “proportional” progress

 Idea: Memory controller estimates each thread’s


slowdown due to interference and schedules
requests in a way to balance the slowdowns

 Mutlu and Moscibroda, “Stall-Time Fair Memory Access


Scheduling for Chip Multiprocessors,” MICRO 2007.

40
Stall-Time Fairness in Shared
DRAM Systems
 A DRAM system is fair if it equalizes the slowdown of equal-
priority threads relative to when each thread is run alone on the
same system

 DRAM-related stall-time: The time a thread spends waiting for DRAM


memory
 STshared: DRAM-related stall-time when the thread runs with other
threads
 STalone: DRAM-related stall-time when the thread runs alone
 Memory-slowdown = STshared/STalone
 Relative increase in stall-time

 Stall-Time Fair Memory scheduler (STFM) aims to equalize


Memory-slowdown for interfering threads, without sacrificing
performance
41
 Considers inherent DRAM performance of each thread
STFM Scheduling Algorithm
[MICRO’07]
 For each thread, the DRAM controller

 Tracks STshared
 Estimates STalone

 Each cycle, the DRAM controller


 Computes Slowdown = STshared/STalone for threads with legal
requests
 Computes unfairness = MAX Slowdown / MIN Slowdown

 If unfairness < 
 Use DRAM throughput oriented scheduling policy

 If unfairness ≥ 
 Use fairness-oriented scheduling policy

 (1) requests from thread with MAX Slowdown first


 (2) row-hit first , (3) oldest-first
42
How Does STFM Prevent
Unfairness?
T0: Row 0
T1: Row 5
T0: Row 0
T1: Row 111
T0: Row 0
T0:
T1: Row 0
16

Row
Row 16
Row 111
00 Row Buffer
T0 Slowdown 1.10
1.00
1.04
1.07
1.03
T1 Slowdown 1.14
1.03
1.06
1.08
1.11
1.00
Unfairness 1.06
1.04
1.03
1.00
Data
 1.05

43
STFM Pros and Cons
 Upsides:
 First algorithm for fair multi-core memory scheduling
 Provides a mechanism to estimate memory slowdown
of a thread
 Good at providing fairness
 Being fair can improve performance

 Downsides:
 Does not handle all types of interference
 (Somewhat) complex to implement
 Slowdown estimations can be incorrect

44
More on STFM
 Onur Mutlu and Thomas Moscibroda,
"Stall-Time Fair Memory Access Scheduling for Chi
p Multiprocessors"

Proceedings of the
40th International Symposium on Microarchitecture
(MICRO), pages 146-158, Chicago, IL, December 2007. [
Summary] [Slides (ppt)]

45
Parallelism-Aware Batch
Scheduling

Onur Mutlu and Thomas Moscibroda,


"Parallelism-Aware Batch Scheduling: Enhancing both
Performance and Fairness of Shared DRAM Systems”
35th International Symposium on Computer Architecture (ISCA),
pages 63-74, Beijing, China, June 2008. Slides (ppt)

PAR-BS ISCA 2008 Talk


Another Problem due to Memory
Interference
 Processors try to tolerate the latency of DRAM
requests by generating multiple outstanding requests
 Memory-Level Parallelism (MLP)
 Out-of-order execution, non-blocking caches, runahead
execution

 Effective only if the DRAM controller actually services


the multiple requests in parallel in DRAM banks

 Multiple threads share the DRAM controller


 DRAM controllers are not aware of a thread’s MLP
 Can service each thread’s outstanding requests serially, not
in parallel
47
Bank Parallelism of a Thread
2 DRAM Requests Bank 0 Bank 1
Single Thread:
Thread A : Compute Stall Compute
Bank 0
Bank 1
Thread A: Bank 0, Row 1
Thread A: Bank 1, Row 1

Bank access latencies of the two requests overlapped


Thread stalls for ~ONE bank access latency

48
Bank Parallelism Interference in
DRAM
Baseline Scheduler:
2 DRAM Requests
Bank 0 Bank 1

A : Compute Stall Stall Compute


Bank 0
Bank 1
2 DRAM Requests Thread A: Bank 0, Row 1

B: Compute Stall Stall Compute Thread B: Bank 1, Row 99


Bank 1 Thread B: Bank 0, Row 99
Bank 0
Thread A: Bank 1, Row 1

Bank access latencies of each thread serialized


Each thread stalls for ~TWO bank access latencies

49
Parallelism-Aware Scheduler
Baseline Scheduler: Bank 0 Bank 1
2 DRAM Requests

A : Compute Stall Stall Compute


Bank 0
Bank 1
2 DRAM Requests Thread A: Bank 0, Row 1
B: Compute Stall Stall Compute Thread B: Bank 1, Row 99
Bank 1
Bank 0 Thread B: Bank 0, Row 99
Thread A: Bank 1, Row 1
Parallelism-aware Scheduler:
2 DRAM Requests

A : Compute Stall Compute


Saved Cycles
Bank 0 Average stall-time:
Bank 1
2 DRAM Requests
~1.5 bank access
B: Compute Stall Stall Compute
latencies
Bank 0
Bank 1

50
Parallelism-Aware Batch Scheduling
(PAR-BS)
 Principle 1: Parallelism-awareness
 Schedule requests from a thread (to
T1 T1
different banks) back to back T2 T0
 Preserves each thread’s bank parallelism
T2 T2
 But, this can cause starvation…
T3 T2 Batch
 Principle 2: Request Batching T0 T3
 Group a fixed number of oldest T2 T1
requests from each thread into a T1 T0
“batch”
 Service the batch before all other
Bank 0 Bank 1
requests
 Form a new batch when the current one is
done
Mutlu and Moscibroda,
 Eliminates “Parallelism-Aware
starvation, providesBatch Scheduling,” ISCA 2008.
fairness
 Allows parallelism-awareness within a batch 51
PAR-BS Components

 Request batching

 Within-batch scheduling
 Parallelism aware

52
Request Batching
 Each memory request has a bit (marked) associated
with it

 Batch formation:
 Mark up to Marking-Cap oldest requests per bank for each
thread
 Marked requests constitute the batch
 Form a new batch when no marked requests are left

 Marked requests are prioritized over unmarked ones


 No reordering of requests across batches: no starvation, high
fairness

 How to prioritize requests within a batch? 53


Within-Batch Scheduling
 Can use any existing DRAM scheduling policy
 FR-FCFS (row-hit first, then oldest-first) exploits row-buffer
locality
 But, we also want to preserve intra-thread bank
parallelism
 Service each thread’s requests back to back
HOW?

 Scheduler computes a ranking of threads when the


batch is formed
 Higher-ranked threads are prioritized over lower-ranked
ones
 Improves the likelihood that requests from a thread are
serviced in parallel by different banks 54
 Different threads prioritized in the same order across ALL banks
Thread Ranking
thread A

rank
Key Idea:
thread B
thread B

req req Bank 1 req req Bank 1


thread A
req req Bank 0 req req Bank 0

memory service timeline memory service timeline


SAVED CYCLES
thread A WAIT thread A WAIT
thread B WAIT thread B WAIT
thread execution timeline thread execution timeline
55
How to Rank Threads within a
Batch
Ranking scheme affects system throughput and
fairness

 Maximize system throughput


 Minimize average stall-time of threads within the batch
 Minimize unfairness (Equalize the slowdown of
threads)
 Service threads with inherently low stall-time early in the
batch
 Insight: delaying memory non-intensive threads results in
high slowdown

 Shortest stall-time first (shortest job first) ranking


 Provides optimal system throughput [Smith, 1956]*
* W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.
 Controller estimates each thread’s stall-time within the
56
Shortest Stall-Time First
Ranking
Maximum number of marked requests to any bank (max-bank-
load)
 Rank thread with lower max-bank-load higher (~ low stall-time)
 Total number of marked requests (total-load)
 Breaks ties: rank thread with lower total-load higher

T3
max-bank- total-load
T3
load
T3 T2 T3 T3
T0 1 3
T1 T0 T2 T0
T1 2 4
T2 T2 T1 T2
T3 T1 T0 T3 T2 2 6
T1 T3 T2 T3 T3 5 9

Bank 0 Bank 1 Bank 2 Bank 3 Ranking:


T0 > T1 > T2 > T3

57
Example Within-Batch
Scheduling
Baseline Scheduling
Order (Arrival order)
Order
7 PAR-BS Scheduling
T3
Order
T3 7
T3 6 T3 6
T3 T2 T3 T3 5 T3 T3 T3 T3 5

Time

Time
T1 T0 T2 T0 4 T3 T2 T2 T3 4
T2 T2 T1 T2 3 T2 T2 T2 T3 3
T3 T1 T0 T3 2 T1 T1 T1 T2 2
T1 T3 T2 T3 1 T1 T0 T0 T0 1

Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3

Ranking: T0 > T1 > T2 > T3

T0 T1 T2 T3 T0 T1 T2 T3
Stall times 4 4 5 7 Stall times 1 2 4 7
AVG: 5 bank access latencies AVG: 3.5 bank access latencies

58
Putting It Together: PAR-BS
Scheduling Policy
 PAR-BS Scheduling Policy
(1) Marked requests first Batching
(2) Row-hit requests first
Parallelism-aware
(3) Higher-rank thread first (shortest stall-time first)
within-batch
(4) Oldest first scheduling

 Three properties:
 Exploits row-buffer locality and intra-thread bank parallelism

 Work-conserving
 Services unmarked requests to banks without marked requests
 Marking-Cap is important
 Too small cap: destroys row-buffer locality
 Too large cap: penalizes memory non-intensive threads
 Many more trade-offs analyzed in the paper
59
Hardware Cost
 <1.5KB storage cost for
 8-core system with 128-entry memory request buffer

 No complex operations (e.g., divisions)

 Not on the critical path


 Scheduler makes a decision only every DRAM cycle

60
Unfairness on 4-, 8-, 16-core
Systems
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]
5
FR-FCFS
4.5 FCFS
Unfairness (lower is better)

NFQ
4
STFM
PAR-BS
3.5

2.5

1.11X
2

1.11X 1.08X
1.5

1
4-core 8-core 16-core

61
System Performance (Hmean-
speedup) 8.3% 6.1% 5.1%
1.4
1.3
1.2
1.1
Normalized Hmean Speedup

1
0.9
0.8
0.7 FR-FCFS
0.6 FCFS
NFQ
0.5
STFM
0.4
PAR-BS
0.3
0.2
0.1
0
4-core 8-core 16-core

62
PAR-BS Pros and Cons
 Upsides:
 First scheduler to address bank parallelism destruction
across multiple threads
 Simple mechanism (vs. STFM)
 Batching provides fairness
 Ranking enables parallelism awareness

 Downsides:
 Does not always prioritize the latency-sensitive
applications

63
More on PAR-BS
 Onur Mutlu and Thomas Moscibroda,
"Parallelism-Aware Batch Scheduling: Enhancing b
oth Performance and Fairness of Shared DRAM Sys
tems"

Proceedings of the
35th International Symposium on Computer Architecture
(ISCA), pages 63-74, Beijing, China, June 2008. [
Summary] [Slides (ppt)]
One of the 12 computer architecture papers of
2008 selected as Top Picks by IEEE Micro.

http://www.youtube.com/watch?v=UB1kgYR-4V0 64
More on PAR-BS
 Onur Mutlu and Thomas Moscibroda,
"Parallelism-Aware Batch Scheduling: Enabling High-Performance and
Fair Memory Controllers"

IEEE Micro, Special Issue: Micro's Top Picks from 2008 Computer Architecture
Conferences (MICRO TOP PICKS), Vol. 29, No. 1, pages 22-32,
January/February 2009.

65
ATLAS Memory Scheduler

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter,


"ATLAS: A Scalable and High-Performance
Scheduling Algorithm for Multiple Memory Controllers"
th International Symposium on High-Performance Computer Architecture (HPCA
Bangalore, India, January 2010. Slides (pptx)

ATLAS HPCA 2010 Talk


ATLAS: Summary
 Goal: To maximize system performance

 Main idea: Prioritize the thread that has attained the


least service from the memory controllers (Adaptive
per-Thread Least Attained Service Scheduling)
 Rank threads based on attained service in the past
time interval(s)
 Enforce thread ranking in the memory scheduler
during the current interval

 Why it works: Prioritizes “light” (memory non-


intensive) threads that are more likely to keep their
cores busy
67
System Throughput: 24-Core System
System throughput = ∑ Speedup
FCFS FR_FCFS STFM PAR-BS ATLAS
16 3.5%

14 5.9%
throughput
System

8.4%
System throughput

12
9.8%
10
17.0%
8

4
1 2 4 8 16
#Memory controllers
of memory
controllers
ATLAS consistently provides higher system
throughput than all previous scheduling algorithms
68
System Throughput: 4-MC System
PAR-BS ATLAS
14
10.8%
12 8.4%
throughput

10
System
System throughput

4.0%
8
3.5%
6 1.1%
4
2
0
4 8 16 24 32
# ofCores
cores

# of cores increases  ATLAS performance benefit


increases

69
ATLAS Pros and Cons
 Upsides:
 Good at improving overall throughput (compute-
intensive threads are prioritized)
 Low complexity
 Coordination among controllers happens infrequently

 Downsides:
 Lowest/medium ranked threads get delayed
significantly  high unfairness

70
More on ATLAS Memory
Scheduler
Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-
Balter,
"ATLAS: A Scalable and High-Performance Scheduli
ng Algorithm for Multiple Memory Controllers"

Proceedings of the
16th International Symposium on High-Performance Com
puter Architecture
(HPCA), Bangalore, India, January 2010. Slides (pptx)

71
TCM:
Thread Cluster Memory
Scheduling

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter,


"Thread Cluster Memory Scheduling:
Exploiting Differences in Memory Access Behavior"
43rd International Symposium on Microarchitecture (MICRO),
pages 65-76, Atlanta, GA, December 2010. Slides (pptx) (pdf)

TCM Micro 2010 Talk


Previous Scheduling Algorithms are Biased
24 cores, 4 memory controllers, 96 workloads
17
Maximum Slowdown 15
Better fairness

13 System throughput bias


11 FCFS
9 FRFCFS
7 STFM
PAR-BS
5 Fairness bias

Id
ATLAS

ea
3

l
1
7 7.5 8 8.5 9 9.5 10
Weighted Speedup
Better system throughput
No previous memory scheduling algorithm provides
both the best fairness and system throughput
73
Throughput vs. Fairness
Throughput biased approach Fairness biased approach
Prioritize less memory-intensive threads Take turns accessing memory

Good for throughput Does not starve


thread A

less memory thread B higher thread C thread A thread B


intensive priority
thread C
not prioritized 
starvation  unfairness reduced throughput

Single policy for all threads is insufficient


74
Achieving the Best of Both Worlds
higher For Throughput
priority
Prioritize memory-non-intensive threads
thread
thread
thread
thread
For Fairness

thread
Unfairness caused by memory-intensive
being prioritized over each other
thread
• Shuffle thread ranking
thread
Memory-intensive threads have
thread
different vulnerability to interference
• Shuffle asymmetrically
75
Thread Cluster Memory Scheduling [Kim+ MICRO’10]
1. Group threads into two clusters
2. Prioritize non-intensive cluster
3. Different policies for each cluster higher
priority
Non-intensive
Memory-non-intensive cluster

thread
thread Throughput
thread
thread
Prioritized higher
thread priority
thread
thread

Threads in the system

Memory-intensive

Intensive cluster Fairness


76
TCM Outline

1. Clustering

77
Clustering Threads
Step1 Sort threads by MPKI (misses per kiloinstruction)
higher
MPKI

thread
thread
thread
thread

thread
thread
Non-intensive Intensive
cluster αT cluster

T
α < 10%
T = Total memory bandwidth usage ClusterThreshold

Step2 Memory bandwidth usage αT divides clusters

78
TCM Outline

1. Clustering

2. Between
Clusters

79
Prioritization Between Clusters
Prioritize non-intensive cluster

>
priority
• Increases system throughput
– Non-intensive threads have greater potential for
making progress
• Does not degrade fairness
– Non-intensive threads are “light”
– Rarely interfere with intensive threads

80
TCM Outline
3. Non-Intensive
Cluster

1. Clustering

Throughput
2. Between
Clusters

81
Non-Intensive Cluster
Prioritize threads according to MPKI
higher
priority thread
lowest MPKI
thread

thread

thread
highest MPKI

• Increases system throughput


– Least intensive thread has the greatest potential
for making progress in the processor

82
TCM Outline
3. Non-Intensive
Cluster

1. Clustering

Throughput
2. Between 4. Intensive
Clusters Cluster

Fairness
83
Intensive Cluster
Periodically shuffle the priority of threads
higher Most prioritized
priority
thread

thread Increases fairness


thread

• Is treating all threads equally good enough?


• BUT: Equal turns ≠ Same slowdown

84
Case Study: A Tale of Two Threads
Case Study: Two intensive threads contending
1. random-access
2. streaming Which is slowed down more easily?

Prioritize random-access Prioritize streaming


14 14
12
10
12
10
11x
Slowdown

Slowdown
8 7x 8
6 6
4
prioritized 4
prioritized
2 1x 2 1x
0 0
random-access streaming random-access streaming

random-access thread is more easily slowed down

85
Why are Threads Different?
random-access streaming
req
stuck req

activated row

rows
Bank 1 Bank 2 Bank 3 Bank 4 Memory
• All requests parallel • All requests  Same row
• High bank-level parallelism • High row-buffer locality

Vulnerable to interference
86
TCM Outline
3. Non-Intensive
Cluster

1. Clustering

Throughput
2. Between 4. Intensive
Clusters Cluster

Fairness
87
Niceness
How to quantify difference between threads?

High Niceness Low

Bank-level parallelism Row-buffer locality


Vulnerability to interference Causes interference

+ Niceness -
88
TCM: Quantum-Based Operation
Previous quantum Current quantum
(~1M cycles) (~1M cycles)

Time

Shuffle interval
During quantum: (~1K cycles)
• Monitor thread behavior
1. Memory intensity Beginning of quantum:
2. Bank-level parallelism • Perform clustering
3. Row-buffer locality • Compute niceness of
intensive threads

89
TCM: Scheduling Algorithm
1. Highest-rank: Requests from higher ranked threads prioritized
• Non-Intensive cluster > Intensive cluster
• Non-Intensive cluster: lower intensity  higher rank
• Intensive cluster: rank shuffling

2.Row-hit: Row-buffer hit requests are prioritized

3.Oldest: Older requests are prioritized

90
TCM: Implementation Cost
Required storage at memory controller (24 cores)

Thread memory behavior Storage

MPKI ~0.2kb
Bank-level parallelism ~0.6kb
Row-buffer locality ~2.9kb
Total < 4kbits
• No computation is on the critical path

91
Previous Work
FRFCFS [Rixner et al., ISCA00]: Prioritizes row-buffer hits
– Thread-oblivious  Low throughput & Low fairness

STFM [Mutlu et al., MICRO07]: Equalizes thread slowdowns


– Non-intensive threads not prioritized  Low throughput

PAR-BS [Mutlu et al., ISCA08]: Prioritizes oldest batch of requests


while preserving bank-level parallelism
– Non-intensive threads not always prioritized  Low
throughput
ATLAS [Kim et al., HPCA10]: Prioritizes threads with less memory
service
– Most intensive thread starves  Low fairness
92
TCM: Throughput and Fairness
24 cores, 4 memory controllers, 96 workloads
16
FRFCFS
14
Better fairness
Maximum Slowdown

ATLAS
12

10 STFM

8
PAR-BS
TCM
6

4
7.5 8 8.5 9 9.5 10
Weighted Speedup
Better system throughput
TCM, a heterogeneous scheduling policy,
provides best fairness and system throughput 93
TCM: Fairness-Throughput Tradeoff
When configuration parameter is varied…
12
FRFCFS
Better fairness

10
Maximum Slowdown

STFM ATLAS
8
PAR-BS
6 TCM
4

2 Adjusting
12 12.5 13 13.5 14 14.5 15 15.5 16
ClusterThreshold
Weighted Speedup
Better system throughput

TCM allows robust fairness-throughput tradeoff


94
Operating System Support
• ClusterThreshold is a tunable knob
– OS can trade off between fairness and throughput

• Enforcing thread weights


– OS assigns weights to threads
– TCM enforces thread weights within each cluster

95
Conclusion
• No previous memory scheduling algorithm provides
both high system throughput and fairness
– Problem: They use a single policy for all threads

• TCM groups threads into two clusters


1. Prioritize non-intensive cluster  throughput
2. Shuffle priorities in intensive cluster  fairness
3. Shuffling should favor nice threads  fairness

• TCM provides the best system throughput and fairness

96
TCM Pros and Cons
 Upsides:
 Provides both high fairness and high performance
 Caters to the needs for different types of threads
(latency vs. bandwidth sensitive)
 (Relatively) simple

 Downsides:
 Scalability to large buffer sizes?
 Robustness of clustering and shuffling algorithms?
 Ranking is still too complex?

97
More on TCM
 Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor
Harchol-Balter,
"Thread Cluster Memory Scheduling: Exploiting Dif
ferences in Memory Access Behavior"

Proceedings of the
43rd International Symposium on Microarchitecture
(MICRO), pages 65-76, Atlanta, GA, December 2010.
Slides (pptx) (pdf)

98
The Blacklisting Memory
Scheduler

Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu,
e Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Co
Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD),
Seoul, South Korea, October 2014. [Slides (pptx) (pdf)]
Tackling Inter-Application Interference:
Application-aware Memory Scheduling
Monitor Rank Enforce
Ranks Highest
2 Request Buffer
1 App. ID Ranked AID
4 Request (AID)
2 =
Req 1 1
3
3 Req 2 4 =
1 Req 3 1 =
4
Req 4 1 =
Full ranking increases Req 5 3 =
critical path latency and area Req 5 2 =
significantly to improve Req 7 1 =
Req 8 3 =
performance and fairness
100
Performance vs. Fairness vs. Simplicity
Fairness App-unaware
FRFCFS
Ideal PARBS
App-aware
ATLAS
Low performance (Ranking)
and fairness TCM
Our Solution
Our Solution
Blacklisting
(No Ranking)
Performance

Is it essential to give up simplicity to


optimize for performance and/or fairness?
Complex
Our solution achieves all three goals
Very Simple
Simplicity 101
Key Observation 1: Group Rather Than Rank
Observation 1: Sufficient to separate applications
into two groups, rather than do full ranking
Monitor Rank Group
Interference
2 Vulnerable Causing
1
4

>
2 2 1
3
3 4
3
4 1

Benefit
Benefit
1: Low
2: Lower
complexity
slowdowns
compared
than ranking
to ranking
102
Key Observation 1: Group Rather Than Rank
Observation 1: Sufficient to separate applications
into two groups, rather than do full ranking
Monitor Rank Group
Interference
2 Vulnerable Causing
1
4

>
2 2 1
3
3 4
3
4 1

How to classify applications into groups?


103
Key Observation 2
Observation 2: Serving a large number of consecutive
requests from an application causes interference

Basic Idea:
• Group applications with a large number of consecutive requests
as interference-causing  Blacklisting
• Deprioritize blacklisted applications
• Clear blacklist periodically (1000s of cycles)

Benefits:
• Lower complexity
• Finer grained grouping decisions  Lower unfairness
104
Performance vs. Fairness vs. Simplicity
Fairness Close to FRFCFS
fairest FRFCFS-Cap
Ideal
PARBS
ATLAS
Highest TCM
performance
Blacklisting
Performance

Blacklisting is the closest scheduler to ideal

Close to
simplest
Simplicity 105
Performance and Fairness
FRFCFS FRFCFS-Cap PARBS ATLAS
15 TCM Blacklisting
13
11
(Lower is better)
Unfairness

9
5%
7
21%
5
3 Id
ea
l
1
1 2 3 4 5 6 7 8 9 10
Performance
1. Blacklisting achieves
(Higher the highest performance
is better)
2. Blacklisting balances performance and fairness
106
Complexity
FRFCFS FRFCFS-Cap PARBS ATLAS
TCM
120000 Blacklisting
Scheduler Area (sq. um)

100000

80000

60000
43%
40000
70%
20000
l
ea
Id

0
0 2 4 6 8 10 12
Critical Path Latency (ns)
Blacklisting reduces complexity significantly
107
More on BLISS (I)
 Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri,
Harsha Rastogi, and Onur Mutlu,
"The Blacklisting Memory Scheduler: Achieving Hig
h Performance and Fairness at Low Cost"

Proceedings of the
32nd IEEE International Conference on Computer Design
(ICCD), Seoul, South Korea, October 2014. [Slides (pptx)
(pdf)]

108
More on BLISS: Longer Version
 Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha
Rastogi, and Onur Mutlu,
"BLISS: Balancing Performance, Fairness and Complexit
y in Memory Access Scheduling"

IEEE Transactions on Parallel and Distributed Systems (TPDS),


to appear in 2016. arXiv.org version, April 2015.
An earlier version as SAFARI Technical Report, TR-SAFARI-2015-
004, Carnegie Mellon University, March 2015.
[Source Code]

109
Computer Architecture
Lecture 11b: Memory
Interference
and Quality of Service
Prof. Onur Mutlu
ETH Zürich
Fall 2020
29 October 2020
Staged Memory Scheduling

achata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel Loh, and Onur Mutlu
"Staged Memory Scheduling: Achieving High Performance
and Scalability in Heterogeneous Systems”
39th International Symposium on Computer Architecture (ISCA),
Portland, OR, June 2012.

SMS ISCA 2012 Talk


SMS: Executive Summary
 Observation: Heterogeneous CPU-GPU systems
require memory schedulers with large request buffers
 Problem: Existing monolithic application-aware
memory scheduler designs are hard to scale to large
request buffer sizes
 Solution: Staged Memory Scheduling (SMS)
decomposes the memory controller into three simple
stages:
1) Batch formation: maintains row buffer locality
2) Batch scheduler: reduces interference between
applications
3) DRAM command scheduler: issues requests to DRAM
 Compared to state-of-the-art memory schedulers:
 SMS is significantly simpler and more scalable 112
SMS: Staged Memory
Scheduling
Core Core Core Core GPU
1 2 3 4
Stage
Monolithic Scheduler

1 Req Req Req Req Req Req Req Req


Batch Req Req Req Req Req Req Req Req
Formatio
Req Req Req Req Req Req Req Req
n
Req Req Req Req Req Req Req Req
Stage Req Req BatchReq
Req Scheduler
Req Req Req Req
2 Req

Stage
3
Memory Scheduler
DRAM
Comman Bank 1 Bank 2 Bank 3 Bank 4
d
Schedule To DRAM
r
113
SMS: Staged Memory
Scheduling
Core Core Core Core GPU
1 2 3 4
Stage
1
Batch
Formatio
n

Stage BatchReq
Scheduler
Req
2

Stage
3
DRAM
Comman Bank 1 Bank 2 Bank 3 Bank 4
d
Schedule To DRAM
r
114
Putting Everything Together
Core Core Core Core GPU
1 2 3 4
Stage 1:
Batch
Formation

Stage 2: Batch Scheduler


Current Batch
Stage 3: Scheduling
DRAM Policy
Command RR
SJF
Scheduler Bank Bank Bank Bank
1 2 3 4

115
Complexity
 Compared to a row hit first scheduler, SMS
consumes*
 66% less area
 46% less static power

 Reduction comes from:


 Monolithic scheduler  stages of simpler schedulers
 Each stage has a simpler scheduler (considers fewer
properties at a time to make the scheduling decision)
 Each stage has simpler buffers (FIFO instead of out-of-
order)
 Each stage has a portion of the total buffer size
(buffering is distributed across stages)
* Based on a Verilog model using 180nm 116
Performance at Different GPU
Weights
1
System Per-

0.8 Best
formance

Previous
PreviousBest
0.6 Scheduler
0.4
0.2
ATLAS TCM FR-FCFS
0
0.001 0.01 0.1 1 10 100 1000
GPUweight

117
Performance at Different GPU
Weights
1
System Per-

0.8 Best
Previous Best
formance

Previous
0.6 SMS
Scheduler
SMS
0.4
0.2
0
0.001 0.01 0.1 1 10 100 1000
GPUweight
 At every GPU weight, SMS outperforms the best
previous scheduling algorithm for that weight

118
More on SMS
 Rachata Ausavarungnirun, Kevin Chang, Lavanya
Subramanian, Gabriel Loh, and Onur Mutlu,
"Staged Memory Scheduling: Achieving High Perfo
rmance and Scalability in Heterogeneous Systems"

Proceedings of the
39th International Symposium on Computer Architecture
(ISCA), Portland, OR, June 2012. Slides (pptx)

119
DASH Memory Scheduler
[TACO 2016]

120
Current SoC Architectures
CPU CPU CPU CPU

Shared Cache HWA HWA HWA

DRAM Controller

DRAM

 Heterogeneous agents: CPUs and HWAs


 HWA : Hardware Accelerator
 Main memory is shared by CPUs and HWAs 
Interference
How to schedule memory requests from CPUs and HWAs
to mitigate interference?
121
DASH Scheduler: Executive
Summary
Problem: Hardware accelerators (HWAs) and CPUs share the
same memory subsystem and interfere with each other in main
memory
 Goal: Design a memory scheduler that improves CPU
performance while meeting HWAs’ deadlines
 Challenge: Different HWAs have different memory access
characteristics and different deadlines, which current schedulers
do not smoothly handle
 Memory-intensive and long-deadline HWAs significantly degrade CPU
performance when they become high priority (due to slow progress)
 Short-deadline HWAs sometimes miss their deadlines despite high
priority
 Solution: DASH Memory Scheduler
 Prioritize HWAs over CPU anytime when the HWA is not making good
progress
 Application-aware scheduling for CPUs and HWAs
 Key Results:
1) Improves CPU performance for a wide variety of workloads122
by
Goal of Our Scheduler (DASH)
• Goal: Design a memory scheduler that
– Meets GPU/accelerators’ frame rates/deadlines and
– Achieves high CPU performance

• Basic Idea:
– Different CPU applications and hardware
accelerators have different memory requirements
– Track progress of different agents and prioritize
accordingly

123
Key Observation:
Distribute Priority for Accelerators
• GPU/accelerators need priority to meet deadlines
• Worst case prioritization not always the best
• Prioritize when they are not on track to meet a
deadline

Distributing priority over time mitigates impact


of accelerators on CPU cores’ requests

124
Key Observation:
Not All Accelerators are Equal
• Long-deadline accelerators are more likely to
meet their deadlines
• Short-deadline accelerators are more likely to
miss their deadlines

Schedule short-deadline accelerators


based on worst-case memory access time

125
Key Observation:
Not All CPU cores are Equal
• Memory-intensive cores are much less
vulnerable to interference
• Memory non-intensive cores are much more
vulnerable to interference

Prioritize accelerators over memory-intensive cores


to ensure accelerators do not become urgent

126
DASH Summary:
Key Ideas and Results
• Distribute priority for HWAs
• Prioritize HWAs over memory-intensive CPU
cores even when not urgent
• Prioritize short-deadline-period HWAs based
on worst case estimates

Improves CPU performance by 7-21%


Meets (almost) 100% of deadlines for HWAs

127
DASH: Scheduling Policy
 DASH scheduling policy
1. Short-deadline-period HWAs with high priority
2. Long-deadline-period HWAs with high priority
3. Memory non-intensive CPU applications
4. Long-deadline-period HWAs with low priority
Switch
5. Memory-intensive CPU applications probabilistically
6. Short-deadline-period HWAs with low priority

128
More on DASH
 Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang,
and Onur Mutlu,
"DASH: Deadline-Aware High-Performance Memory S
cheduler for Heterogeneous Systems with Hardware
Accelerators"

ACM Transactions on Architecture and Code Optimization


(TACO), Vol. 12, January 2016.
Presented at the 11th HiPEAC Conference, Prague, Czech
Republic, January 2016.
[Slides (pptx) (pdf)]
[Source Code]

129
Predictable Performance:
Strong Memory Service
Guarantees

130
Goal: Predictable Performance in
Complex Systems
CPU CPU CPU CPU
GPU

Shared Cache HWA HWA

DRAM and Hybrid Memory Controllers

DRAM and Hybrid Memories

 Heterogeneous agents: CPUs, GPUs, and HWAs


 Main memory interference between CPUs, GPUs,

HWAs
How to allocate resources to heterogeneous agents
o mitigate interference and provide predictable performance?
131
Strong Memory Service
Guarantees
Goal: Satisfy performance/SLA requirements in the
presence of shared main memory, heterogeneous
agents, and hybrid memory/storage
 Approach:
 Develop techniques/models to accurately estimate the
performance loss of an application/agent in the
presence of resource sharing
 Develop mechanisms (hardware and software) to
enable the resource partitioning/prioritization needed
to achieve the required performance levels for all
applications
 All the while providing high system performance
 Subramanian et al., “MISE: Providing Performance Predictability and Improving
Fairness in Shared Main Memory Systems,” HPCA 2013.
 Subramanian et al., “The Application Slowdown Model,” MICRO 2015. 132
Predictable Performance
Readings
Eiman Ebrahimi,(I)
Chang Joo Lee, Onur Mutlu, and Yale N.
Patt,
"Fairness via Source Throttling: A Configurable an
d High-Performance Fairness Substrate for Multi-C
ore Memory Systems"

Proceedings of the
15th International Conference on Architectural Support fo
r Programming Languages and Operating Systems
(ASPLOS), pages 335-346, Pittsburgh, PA, March 2010.
Slides (pdf)

133
Predictable Performance
Readings
 (II) Vivek Seshadri, Yoongu Kim, Ben
Lavanya Subramanian,
Jaiyen, and Onur Mutlu,
"MISE: Providing Performance Predictability and I
mproving Fairness in Shared Main Memory System
s"

Proceedings of the
19th International Symposium on High-Performance Com
puter Architecture
(HPCA), Shenzhen, China, February 2013. Slides (pptx)

134
Predictable Performance
Readings
 (III)Vivek Seshadri, Arnab Ghosh, Samira
Lavanya Subramanian,
Khan, and Onur Mutlu,
"The Application Slowdown Model: Quantifying and Con
trolling the Impact of Inter-Application Interference at
Shared Caches and Main Memory"

Proceedings of the
48th International Symposium on Microarchitecture (MICRO),
Waikiki, Hawaii, USA, December 2015.
[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [
Poster (pptx) (pdf)]
[Source Code]

135
MISE:
Providing Performance
Predictability
in Shared Main Memory
Systems
Lavanya Subramanian, Vivek Seshadri,
Yoongu Kim, Ben Jaiyen, Onur Mutlu

136
Unpredictable Application
Slowdowns 6 6

5 5
Slowdown

Slowdown
4 4

3 3

2 2

1 1

0 0
leslie3d (core gcc (core 1) leslie3d (core mcf (core 1)
0) 0)

An application’s performance depends on


which application it is running with

137
Need for Predictable
Performance
There is a need for predictable performance
 When multiple applications share resources
 Especially if some applications require performance
guarantees

Our Goal: Predictable performance


 Example 1: In mobile systems
 in the
Interactive presence
applications ofnon-interactive
run with memory
applications interference
 Need to guarantee performance for interactive
applications

 Example 2: In server systems


 Different users’ jobs consolidated onto the same
server
 Need to provide bounded slowdowns to critical jobs 138
Outline
1. Estimate Slowdown
 Key Observations
 Implementation

 MISE Model: Putting it All Together

 Evaluating the Model

2. Control Slowdown

139
Outline
1. Estimate Slowdown
 Key Observations
 Implementation

 MISE Model: Putting it All Together

 Evaluating the Model

2. Control Slowdown
 Providing Soft Slowdown
Guarantees
 Minimizing Maximum Slowdown
140
Slowdown: Definition

Performanc e Alone
Slowdown 
Performanc e Shared

141
Key Observation 1
For a memory bound application,
Performance  Memory request service rate
Normalized Performance 1
0.9 omnetpp mcf
Harder
0.8 astar
Request Service
Performanc e AloneRate Alone
0.7

Slowdown 0.6 Intel Core i7, 4 cores

0.5
Request Service
Performance SharedRate
Mem. Bandwidth: 8.5
Shared
GB/s
0.4
0.3 Easy
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Normalized Request Service Rate

142
Key Observation 2
Request Service Rate Alone (RSRAlone) of an application
can be estimated by giving the application highest
priority in accessing memory

Highest priority  Little interference


(almost as if the application were run alone)

143
Key Observation 2
1. Run alone
Time units Service
Request Buffer order
3 2 1
State Main Main
Memory Memory

2. Run with another


application Time units Service
Request Buffer order
3 2 1
State Main Main
Memory Memory

3. Run with another application: highest


priority Time units Service
Request Buffer order
3 2 1
State Main Main
Memory Memory

144
Memory Interference-induced Slowdown
Estimation (MISE) model for memory bound
applications
Request Service Rate Alone (RSRAlone)
Slowdown 
Request Service Rate Shared (RSRShared)

145
Key Observation 3
 Memory-bound application
Compute Phase

Memory Phase

No Req Req Req


interference time

With Req Req Req


interference time

Memory phase slowdown dominates overall


slowdown
146
Key Observation 3
 Non-memory-bound application
Compute Phase

Memory Phase
Memory Interference-induced Slowdown
1  
Estimation (MISE) model for non-memory
No
interference bound applications
RSR Alone time
Slowdown (1 -  )  
With RSR Shared
interference
time
1  RSR Alone

RSR Shared

Only memory fraction ( ) slows down with
interference
147
Outline
1. Estimate Slowdown
 Key Observations
 Implementation

 MISE Model: Putting it All Together

 Evaluating the Model

2. Control Slowdown
 Providing Soft Slowdown
Guarantees
 Minimizing Maximum Slowdown
148
Interval Based Operation
Interval Interval

time

 Measure RSRShared,  Measure RSRShared,


 Estimate RSRAlone  Estimate RSRAlone

Estimate Estimate
slowdown slowdown
149
Measuring RSRShared and α
 Request Service Rate Shared (RSRShared)
 Per-core counter to track number of requests
serviced
 At the end of each interval, measure
Number of Requests Serviced
RSR Shared 
Interval Length

 Memory Phase Fraction ( )


 Count number of stall cycles at the core

 Compute fraction of cycles stalled for memory

150
Estimating Request Service Rate Alone
(RSR
 DivideAlone
each ) interval into shorter epochs
 At the beginning of each epoch
 MemoryGoal: Estimate
controller randomlyRSR
picksAlone
an
application as the highest priority application
How: Periodically give each
 application
At the highest
end of an interval, for eachpriority in
application,
estimate accessing memory
Number of Requests During High Priority Epochs
RSR Alone 
Number of Cycles Application Given High Priority

151
Inaccuracy in Estimating
RSR
 When
Request an application
Buffer
Alone
has
Time highest
units priority
Service High Priority
order
 Still experiences
State
Main
some
3 interference
2 1
Main
Memory Memory
Request Buffer Time units Service
State 3 order
2 1
Main Main
Memory Memory
Request Buffer Time units Service
State 3
order
2 1
Main Main
Memory Memory
Time units Service
3 order
2 1
Main
Memory

Interference Cycles

152
Accounting for Interference in RSRAlone
Estimation
 Solution: Determine and remove interference

cycles from RSRAlone calculation

Number of Requests During High Priority Epochs


RSR Alone 
Number of Cycles Application Given High Priority - Interference Cycles

 A cycle is an interference cycle if


 a request from the highest priority application

is waiting in the request buffer and


 another application’s request was issued

previously

153
Outline
1. Estimate Slowdown
 Key Observations
 Implementation

 MISE Model: Putting it All Together

 Evaluating the Model

2. Control Slowdown
 Providing Soft Slowdown
Guarantees
 Minimizing Maximum Slowdown
154
MISE Model: Putting it All
Together
Interval Interval

time

 Measure RSRShared,  Measure RSRShared,


 Estimate RSRAlone  Estimate RSRAlone

Estimate Estimate
slowdown slowdown
155
Outline
1. Estimate Slowdown
 Key Observations
 Implementation

 MISE Model: Putting it All Together

 Evaluating the Model

2. Control Slowdown
 Providing Soft Slowdown
Guarantees
 Minimizing Maximum Slowdown
156
Previous Work on Slowdown
Estimation
Previous work on slowdown estimation
 STFM (Stall Time Fair Memory) Scheduling [Mutlu+,
MICRO ‘07]
 FST (Fairness via Source Throttling) [Ebrahimi+, ASPLOS
‘10]
 Per-thread Cycle Accounting [Du Bois+, HiPEAC ‘13]

 Basic Idea: Hard


Stall Time Alone
Slowdown 
Stall Time Shared
Easy
Count number of cycles application receives
interference
157
Two Major Advantages of MISE
Over STFM
 Advantage 1:
 STFM estimates alone performance while an
application is receiving interference  Hard
 MISE estimates alone performance while
giving an application the highest priority 
Easier

 Advantage 2:
 STFM does not take into account compute
phase for non-memory-bound applications
 MISE accounts for compute phase  Better
accuracy

158
Methodology
 Configuration of our simulated system
 4 cores
 1 channel, 8 banks/channel
 DDR3 1066 DRAM
 512 KB private cache/core

 Workloads
 SPEC CPU2006
 300 multi programmed workloads

159
Quantitative Comparison
SPEC CPU 2006 application
leslie3d
4
3.5
Slowdown

3
2.5 Actual
2 STFM
1.5 MISE

1
0 10 20 30 40 50 60 70 80 90 100

Million Cycles

160
Comparison to STFM
4 4 4

3 3 3

Slowdown
Slowdown
Slowdown

2
2 2
1
1 1
0

20 40 60 80 100 error
Average 0 20 40 of
60 MISE: 8.2%
0 0
0 80 100
0 20 40 60 80 10
0

cactusADM GemsFDTD soplex


Average error of STFM: 29.4%
4 4 4

3
(across3 300 workloads) 3
Slowdown

Slowdown

Slowdown
2 2 2

1
1 1
0
0 0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 1 0 0
wrf calculix povray

161
Outline
1. Estimate Slowdown
 Key Observations
 Implementation

 MISE Model: Putting it All Together

 Evaluating the Model

2. Control Slowdown
 Providing Soft Slowdown
Guarantees
 Minimizing Maximum Slowdown
162
Providing “Soft” Slowdown
Guarantees
Goal
1. Ensure QoS-critical applications meet a
prescribed slowdown bound
2. Maximize system performance for other
applications

 Basic Idea
 Allocate just enough bandwidth to QoS-critical

application
 Assign remaining bandwidth to other

applications

163
MISE-QoS: Mechanism to Provide
Soft QoS
 Assign an initial bandwidth allocation to QoS-critical
application
 Estimate slowdown of QoS-critical application using the
MISE model
 After every N intervals
 If slowdown > bound B +/- ε, increase bandwidth
allocation
 If slowdown < bound B +/- ε, decrease bandwidth
allocation
 When slowdown bound not met for N intervals
 Notify the OS so it can migrate/de-schedule jobs

164
Methodology
 Each application (25 applications in total)
considered the QoS-critical application
 Run with 12 sets of co-runners of different memory
intensities
 Total of 300 multiprogrammed workloads
 Each workload run with 10 slowdown bound values
 Baseline memory scheduling mechanism
 Always prioritize QoS-critical application
[Iyer+, SIGMETRICS 2007]
 Other applications’ requests scheduled in FRFCFS
order
[Zuravleff +, US Patent 1997, Rixner+, ISCA 2000]

165
A Look at One Workload
Slowdown Bound =
Slowdown Bound =
103 Slowdown Bound = 2
3.33

2.5

2 AlwaysPriori-
Slowdown

tize
MISE1.5
is effective in MISE-QoS-
1. meeting the slowdown bound for the 10/1
MISE-QoS-
1
QoS-critical application 10/3
2. improving
0.5 performance of non-QoS-
critical
0
applications
leslie3d hmmer lbm omnetpp
QoS-critical non-QoS-critical

166
Effectiveness of MISE in
Enforcing Across
QoS3000 data points
Predicted Predicted
Met Not Met
QoS Bound
78.8% 2.1%
Met
QoS Bound
2.2% 16.9%
Not Met

MISE-QoS meets the bound for 80.9% of


MISE-QoS correctly predicts whether or not
workloads
the bound is met
AlwaysPrioritize for 95.7%
meets of workloads
the bound for 83% of
workloads
167
Performance of Non-QoS-Critical
Applications
1.4
Harmonic Speedup

1.2
1
AlwaysPrioritize
0.8
MISE-QoS-10/1
0.6 MISE-QoS-10/3
MISE-QoS-10/5
0.4
MISE-QoS-10/7
0.2 MISE-QoS-10/9
0
0 1 2 3 Avg
When
Higher slowdownwhen
performance bound is 10/3
bound is loose
Number of Memory Intensive Applications
MISE-QoS improves system performance by
10%
168
Outline
1. Estimate Slowdown
 Key Observations
 Implementation

 MISE Model: Putting it All Together

 Evaluating the Model

2. Control Slowdown
 Providing Soft Slowdown
Guarantees
 Minimizing Maximum Slowdown
169
Other Results in the Paper
 Sensitivity to model parameters
 Robust across different values of model parameters

 Comparison of STFM and MISE models in enforcing


soft slowdown guarantees
 MISE significantly more effective in enforcing

guarantees

 Minimizing maximum slowdown


 MISE improves fairness across several system

configurations

170
Summary
 Uncontrolled memory interference slows down
applications unpredictably
 Goal: Estimate and control slowdowns
 Key contribution
 MISE: An accurate slowdown estimation model
 Average error of MISE: 8.2%
 Key Idea
 Request Service Rate is a proxy for performance
 Request Service Rate Alone estimated by giving an
application highest priority in accessing memory
 Leverage slowdown estimates to control
slowdowns
 Providing soft slowdown guarantees
 Minimizing maximum slowdown
171
MISE: Pros and Cons
 Upsides:
 Simple new insight to estimate slowdown
 Much more accurate slowdown estimations than prior
techniques (STFM, FST)
 Enables a number of QoS mechanisms that can use
slowdown estimates to satisfy performance
requirements

 Downsides:
 Slowdown estimation is not perfect - there are still
errors
 Does not take into account caches and other shared
resources in slowdown estimation

172
More on MISE
 Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben
Jaiyen, and Onur Mutlu,
"MISE: Providing Performance Predictability and I
mproving Fairness in Shared Main Memory System
s"

Proceedings of the
19th International Symposium on High-Performance Com
puter Architecture
(HPCA), Shenzhen, China, February 2013. Slides (pptx)

173
Extending MISE to Shared
Caches:
 ASMVivek Seshadri, Arnab Ghosh, Samira
Lavanya Subramanian,
Khan, and Onur Mutlu,
"The Application Slowdown Model: Quantifying and Con
trolling the Impact of Inter-Application Interference at
Shared Caches and Main Memory"

Proceedings of the
48th International Symposium on Microarchitecture (MICRO),
Waikiki, Hawaii, USA, December 2015.
[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [
Poster (pptx) (pdf)]
[Source Code]

174
Handling Memory Interference
In Multithreaded Applications

Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin,


Chang Joo Lee, Onur Mutlu, and Yale N. Patt,
"Parallel Application Memory Scheduling"
Proceedings of the 44th International Symposium on Microarchitecture (MICRO),
Porto Alegre, Brazil, December 2011. Slides (pptx)
Multithreaded (Parallel)
Applications
Threads in a multi-threaded application can be inter-
dependent
 As opposed to threads from different applications

 Such threads can synchronize with each other


 Locks, barriers, pipeline stages, condition variables,
semaphores, …

 Some threads can be on the critical path of


execution due to synchronization; some threads are
not

 Even within a thread, some “code segments” may


be on the critical path of execution; some are not
176
Critical Sections
 Enforce mutually exclusive access to shared data
 Only one thread can be executing it at a time
 Contended critical sections make threads wait 
threads causing serialization can be on the critical
path
Each thread:
loop {
Compute N
lock(A)
Update shared data
unlock(A) C
}

177
Barriers
 Synchronization point
 Threads have to wait until all threads reach the
barrier
 Last thread arriving at the barrier is on the critical
path
Each thread:
loop1 {
Compute
}
barrier
loop2 {
Compute
}

178
Stages of Pipelined Programs
 Loop iterations are statically divided into code segments called
stages
 Threads execute stages on different cores
 Thread executing the slowest stage is on the critical path

A B C

loop {
Compute1 A

Compute2 B

Compute3 C
}

179
Handling Interference in Parallel
Applications
 Threads in a multithreaded application are inter-
dependent
 Some threads can be on the critical path of
execution due to synchronization; some threads are
not
 How do we schedule requests of inter-dependent
threads to maximize multithreaded application
performance?

 Idea: Estimate limiter threads likely to be on the critical


path and prioritize their requests; shuffle priorities of
non-limiter threads to reduce memory interference
among them [Ebrahimi+, MICRO’11]

 Hardware/software cooperative limiter thread estimation:


180
 Thread executing the most contended PAMS Micro
critical 2011 Talk
section
Prioritizing Requests from
Limiter Threads
Non-Critical Section Critical Section 1 Barrier
Waiting for Sync Critical Section 2 Critical Path
or Lock
Barrier
Thread A
Thread B
Thread C
Thread D
Time
Limiter Thread Identification Barrier
Thread A Most Contended
Thread B Critical Section: 1
Saved
Thread C Cycles Limiter Thread: C
A
B
D
Thread D
Time

181
Parallel App Mem Scheduling: Pros
and Cons
 Upsides:

 Improves the performance of multi-threaded


applications
 Provides a mechanism for estimating “limiter threads”
 Opens a path for slowdown estimation for multi-
threaded applications

 Downsides:
 What if there are multiple multi-threaded applications
running together?
 Limiter thread estimation can become complex

182
More on PAMS
 Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin,
Chang Joo Lee, Onur Mutlu, and Yale N. Patt,
"Parallel Application Memory Scheduling"
Proceedings of the
44th International Symposium on Microarchitecture
(MICRO), Porto Alegre, Brazil, December 2011.
Slides (pptx)

183

You might also like