0% found this document useful (0 votes)
63 views13 pages

Heracles: Improving Resource Efficiency at Scale

Uploaded by

Deep P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views13 pages

Heracles: Improving Resource Efficiency at Scale

Uploaded by

Deep P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Heracles: Improving Resource Efficiency at Scale

David Lo† , Liqun Cheng‡ , Rama Govindaraju‡ , Parthasarathy Ranganathan‡ and Christos Kozyrakis†
Stanford University† Google, Inc.‡

Abstract But, to amortize the much larger capital expenses, an increased


emphasis on the effective use of server resources is warranted.
User-facing, latency-sensitive services, such as websearch,
underutilize their computing resources during daily periods of Several studies have established that the average server uti-
low traffic. Reusing those resources for other tasks is rarely done lization in most datacenters is low, ranging between 10% and
in production services since the contention for shared resources 50% [14, 74, 66, 7, 19, 13]. A primary reason for the low uti-
can cause latency spikes that violate the service-level objectives lization is the popularity of latency-critical (LC) services such as
of latency-sensitive tasks. The resulting under-utilization hurts social media, search engines, software-as-a-service, online maps,
both the affordability and energy-efficiency of large-scale data- webmail, machine translation, online shopping and advertising.
centers. With technology scaling slowing down, it becomes im- These user-facing services are typically scaled across thousands
portant to address this opportunity. of servers and access distributed state stored in memory or Flash
across these servers. While their load varies significantly due to
We present Heracles, a feedback-based controller that en-
diurnal patterns and unpredictable spikes in user accesses, it is
ables the safe colocation of best-effort tasks alongside a latency-
difficult to consolidate load on a subset of highly utilized servers
critical service. Heracles dynamically manages multiple hard-
because the application state does not fit in a small number of
ware and software isolation mechanisms, such as CPU, memory,
servers and moving state is expensive. The cost of such under-
and network isolation, to ensure that the latency-sensitive job
utilization can be significant. For instance, Google websearch
meets latency targets while maximizing the resources given to
servers often have an average idleness of 30% over a 24 hour
best-effort tasks. We evaluate Heracles using production latency-
period [47]. For a hypothetical cluster of 10,000 servers, this
critical and batch workloads from Google and demonstrate aver-
idleness translates to a wasted capacity of 3,000 servers.
age server utilizations of 90% without latency violations across
all the load and colocation scenarios that we evaluated. A promising way to improve efficiency is to launch best-
effort batch (BE) tasks on the same servers and exploit any re-
sources underutilized by LC workloads [52, 51, 18]. Batch an-
1 Introduction alytics frameworks can generate numerous BE tasks and derive
Public and private cloud frameworks allow us to host an in- significant value even if these tasks are occasionally deferred or
creasing number of workloads in large-scale datacenters with restarted [19, 10, 13, 16]. The main challenge of this approach is
tens of thousands of servers. The business models for cloud interference between colocated workloads on shared resources
services emphasize reduced infrastructure costs. Of the total such as caches, memory, I/O channels, and network links. LC
cost of ownership (TCO) for modern energy-efficient datacen- tasks operate with strict service level objectives (SLOs) on tail
ters, servers are the largest fraction (50-70%) [7]. Maximizing latency, and even small amounts of interference can cause sig-
server utilization is therefore important for continued scaling. nificant SLO violations [51, 54, 39]. Hence, some of the past
Until recently, scaling from Moore’s law provided higher work on workload colocation focused only on throughput work-
compute per dollar with every server generation, allowing dat- loads [58, 15]. More recent systems predict or detect when a LC
acenters to scale without raising the cost. However, with sev- task suffers significant interference from the colocated tasks, and
eral imminent challenges in technology scaling [21, 25], alter- avoid or terminate the colocation [75, 60, 19, 50, 51, 81]. These
nate approaches are needed. Some efforts seek to reduce the systems protect LC workloads, but reduce the opportunities for
server cost through balanced designs or cost-effective compo- higher utilization through colocation.
nents [31, 48, 42]. An orthogonal approach is to improve the Recently introduced hardware features for cache isolation and
return on investment and utility of datacenters by raising server fine-grained power control allow us to improve colocation. This
utilization. Low utilization negatively impacts both operational work aims to enable aggressive colocation of LC workloads and
and capital components of cost efficiency. Energy proportion- BE jobs by automatically coordinating multiple hardware and
ality can reduce operational expenses at low utilization [6, 47]. software isolation mechanisms in modern servers. We focus on
two hardware mechanisms, shared cache partitioning and fine-
grained power/frequency settings, and two software mechanisms,
Permission to make digital or hard copies of all or part of this work for
core/thread scheduling and network traffic control. Our goal is
personal or classroom use is granted without fee provided that copies are not to eliminate SLO violations at all levels of load for the LC job
made or distributed for profit or commercial advantage and that copies bear while maximizing the throughput for BE tasks.
this notice and the full citation on the first page. Copyrights for components There are several challenges towards this goal. First, we must
of this work owned by others than ACM must be honored. Abstracting with carefully share each individual resource; conservative allocation
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
will minimize the throughput for BE tasks, while optimistic al-
permissions from [email protected]. location will lead to SLO violations for the LC tasks. Second,
ISCA ’15, June 13 - 17, 2015, Portland, OR, USA the performance of both types of tasks depends on multiple re-
c
2015 ACM. ISBN 978-1-4503-3402-0/15/06$15.00 sources, which leads to a large allocation space that must be
DOI: http://dx.doi.org/10.1145/2749469.2749475
explored in real-time as load changes. Finally, there are non- OS-level scheduling of cores between tasks. Common schedul-
obvious interactions between isolated and non-isolated resources ing algorithms such as Linux’s completely fair scheduler (CFS)
in modern servers. For instance, increasing the cache allocation have vulnerabilities that lead to frequent SLO violations when
for a LC task to avoid evictions of hot data may create memory LC tasks are colocated with BE tasks [39]. Real-time scheduling
bandwidth interference due to the increased misses for BE tasks. algorithms (e.g., SCHED_FIFO) are not work-preserving and
We present Heracles1 , a real-time, dynamic controller that lead to lower utilization. The availability of HyperThreads in
manages four hardware and software isolation mechanisms in a Intel cores leads to further complications, as a HyperThread exe-
coordinated fashion to maintain the SLO for a LC job. Compared cuting a BE task can interfere with a LC HyperThread on instruc-
to existing systems [80, 51, 19] that prevent colocation of inter- tion bandwidth, shared L1/L2 caches, and TLBs.
fering workloads, Heracles enables a LC task to be colocated Numerous studies have shown that uncontrolled interference
with any BE job. It guarantees that the LC workload receives on the shared last-level cache (LLC) can be detrimental for colo-
just enough of each shared resource to meet its SLO, thereby cated tasks [68, 50, 19, 22, 39]. To address this issue, Intel has
maximizing the utility from the BE task. Using online monitor- recently introduced LLC cache partitioning in server chips. This
ing and some offline profiling information for LC jobs, Heracles functionality is called Cache Allocation Technology (CAT), and
identifies when shared resources become saturated and are likely it enables way-partitioning of a highly-associative LLC into sev-
to cause SLO violations and configures the appropriate isolation eral subsets of smaller associativity [3]. Cores assigned to one
mechanism to proactively prevent that from happening. subset can only allocate cache lines in their subset on refills, but
The specific contributions of this work are the following. are allowed to hit in any part of the LLC. It is already well under-
First, we characterize the impact of interference on shared re- stood that, even when the colocation is between throughput tasks,
sources for a set of production, latency-critical workloads at it is best to dynamically manage cache partitioning using either
Google, including websearch, an online machine learning clus- hardware [30, 64, 15] or software [58, 43] techniques. In the pres-
tering algorithm, and an in-memory key-value store. We show ence of user-facing workloads, dynamic management is more
that the impact of interference is non-uniform and workload de- critical as interference translates to large latency spikes [39]. It is
pendent, thus precluding the possibility of static resource parti- also more challenging as the cache footprint of user-facing work-
tioning within a server. Next, we design Heracles and show that: loads changes with load [36].
a) coordinated management of multiple isolation mechanisms is Most important LC services operate on large datasets that do
key to achieving high utilization without SLO violations; b) care- not fit in on-chip caches. Hence, they put pressure on DRAM
fully separating interference into independent subproblems is ef- bandwidth at high loads and are sensitive to DRAM bandwidth
fective at reducing the complexity of the dynamic control prob- interference. Despite significant research on memory bandwidth
lem; and c) a local, real-time controller that monitors latency in isolation [30, 56, 32, 59], there are no hardware isolation mech-
each server is sufficient. We evaluate Heracles on production anisms in commercially available chips. In multi-socket servers,
Google servers by using it to colocate production LC and BE one can isolate workloads across NUMA channels [9, 73], but
tasks . We show that Heracles achieves an effective machine uti- this approach constrains DRAM capacity allocation and address
lization of 90% averaged across all colocation combinations and interleaving. The lack of hardware support for memory band-
loads for the LC tasks while meeting the latency SLOs. Heracles width isolation complicates and constrains the efficiency of any
also improves throughput/TCO by 15% to 300%, depending on system that dynamically manages workload colocation.
the initial average utilization of the datacenter. Finally, we es- Datacenter workloads are scale-out applications that generate
tablish the need for hardware mechanisms to monitor and isolate network traffic. Many datacenters use rich topologies with suf-
DRAM bandwidth, which can improve Heracles’ accuracy and ficient bisection bandwidth to avoid routing congestion in the
eliminate the need for offline information. fabric [28, 4]. There are also several networking protocols that
To the best of our knowledge, this is the first study to make prioritize short messages for LC tasks over large messages for
coordinated use of new and existing isolation mechanisms in a BE tasks [5, 76]. Within a server, interference can occur both
real-time controller to demonstrate significant improvements in in the incoming and outgoing direction of the network link. If a
efficiency for production systems running LC services. BE task causes incast interference, we can throttle its core alloca-
tion until networking flow-control mechanisms trigger [62]. In
2 Shared Resource Interference the outgoing direction, we can use traffic control mechanisms in
operating systems like Linux to provide bandwidth guarantees to
When two or more workloads execute concurrently on a
LC tasks and to prioritize their messages ahead of those from BE
server, they compete for shared resources. This section reviews
tasks [12]. Traffic control must be managed dynamically as band-
the major sources of interference, the available isolation mecha-
width requirements vary with load. Static priorities can cause un-
nisms, and the motivation for dynamic management.
derutilization and starvation [61]. Similar traffic control can be
The primary shared resource in the server are the cores in the applied to solid-state storage devices [69].
one or more CPU sockets. We cannot simply statically partition
Power is an additional source of interference between colo-
cores between the LC and BE tasks using mechanisms such as
cated tasks. All modern multi-core chips have some form of
cgroups cpuset [55]. When user-facing services such as
dynamic overclocking, such as Turbo Boost in Intel chips and
search face a load spike, they need all available cores to meet
Turbo Core in AMD chips. These techniques opportunistically
throughput demands without latency SLO violations. Similarly,
raise the operating frequency of the processor chip higher than
we cannot simply assign high priority to LC tasks and rely on
the nominal frequency in the presence of power headroom. Thus,
1 The mythical hero that killed the multi-headed monster, Lernaean Hydra. the clock frequency for the cores used by a LC task depends not

2
just on its own load, but also on the intensity of any BE task model is kept in main memory for performance reasons. The
running on the same socket. In other words, the performance of SLO for ml_cluster is a 95%-ile latency guarantee of tens of mil-
LC tasks can suffer from unexpected drops in frequency due to liseconds. ml_cluster is exercised using an anonymized trace of
colocated tasks. This interference can be mitigated with per-core requests captured from production services.
dynamic voltage frequency scaling, as cores running BE tasks Compared to websearch, ml_cluster is more memory band-
can have their frequency decreased to ensure that the LC jobs width intensive (with 60% DRAM bandwidth usage at peak) but
maintain a guaranteed frequency. A static policy would run all slightly less compute intensive (lower CPU power usage over-
BE jobs at minimum frequency, thus ensuring that the LC tasks all). It has low network bandwidth requirements. An interesting
are not power-limited. However, this approach severely penal- property of ml_cluster is that each request has a very small cache
izes the vast majority of BE tasks. Most BE jobs do not have the footprint, but, in the presence of many outstanding requests, this
profile of a power virus2 and LC tasks only need the additional translates into a large amount of cache pressure that spills over to
frequency boost during periods of high load. Thus, a dynamic DRAM. This is reflected in our analysis as a super-linear growth
solution that adjusts the allocation of power between cores is in DRAM bandwidth use for ml_cluster versus load.
needed to ensure that LC cores run at a guaranteed minimum fre- memkeyval is an in-memory key-value store, similar to mem-
quency while maximizing the frequency of cores for BE tasks. cached [2]. memkeyval is used as a caching service in the back-
A major challenge with colocation is cross-resource inter- ends of several Google web services. Other large-scale web
actions. A BE task can cause interference in all the shared re- services, such as Facebook and Twitter, use memcached exten-
sources discussed. Similarly, many LC tasks are sensitive to in- sively. memkeyval has significantly less processing per request
terference on multiple resources. Therefore, it is not sufficient to compared to websearch, leading to extremely high throughput
manage one source of interference: all potential sources need to in the order of hundreds of thousands of requests per second at
be monitored and carefully isolated if need be. In addition, inter- peak. Since each request is processed quickly, the SLO latency
ference sources interact with each other. For example, LLC con- is very low, in the few hundreds of microseconds for the 99%-
tention causes both types of tasks to require more DRAM band- ile latency. Load generation for memkeyval uses an anonymized
width, also creating a DRAM bandwidth bottleneck. Similarly, a trace of requests captured from production services.
task that notices network congestion may attempt to use compres- At peak load, memkeyval is network bandwidth limited. De-
sion, causing core and power contention. In theory, the number spite the small amount of network protocol processing done
of possible interactions scales with the square of the number of per request, the high request rate makes memkeyval compute-
interference sources, making this a very difficult problem. bound. In contrast, DRAM bandwidth requirements are low
(20% DRAM bandwidth utilization at max load), as requests sim-
3 Interference Characterization & Analysis ply retrieve values from DRAM and put the response on the wire.
This section characterizes the impact of interference on memkeyval has both a static working set in the LLC for instruc-
shared resources for latency-critical services. tions, as well as a per-request data working set.

3.1 Latency-critical Workloads 3.2 Characterization Methodology


We use three Google production latency-critical workloads. To understand their sensitivity to interference on shared re-
websearch is the query serving portion of a production web sources, we ran each of the three LC workloads with a synthetic
search service. It is a scale-out workload that provides high benchmark that stresses each resource in isolation. While these
throughput with a strict latency SLO by using a large fan-out are single node experiments, there can still be significant network
to thousands of leaf nodes that process each query on their shard traffic as the load is generated remotely. We repeated the char-
of the search index. The SLO for leaf nodes is in the tens of acterization at various load points for the LC jobs and recorded
milliseconds for the 99%-ile latency. Load for websearch is gen- the impact of the colocation on tail latency. We used produc-
erated using an anonymized trace of real user queries. tion Google servers with dual-socket Intel Xeons based on the
websearch has high memory footprint as it serves shards of Haswell architecture. Each CPU has a high core-count, with a
the search index stored in DRAM. It also has moderate DRAM nominal frequency of 2.3GHz and 2.5MB of LLC per core. The
bandwidth requirements (40% of available bandwidth at 100% chips have hardware support for way-partitioning of the LLC.
load), as most index accesses miss in the LLC. However, there We performed the following characterization experiments:
is a small but significant working set of instructions and data in Cores: As we discussed in §2, we cannot share a logical core (a
the hot path. Also, websearch is fairly compute intensive, as it single HyperThread) between a LC and a BE task because OS
needs to score and sort search hits. However, it does not consume scheduling can introduce latency spikes in the order of tens of
a significant amount of network bandwidth. For this study, we milliseconds [39]. Hence, we focus on the potential of using sep-
reserve a small fraction of DRAM on search servers to enable arate HyperThreads that run pinned on the same physical core.
colocation of BE workloads with websearch. We characterize the impact of a colocated HyperThread that im-
ml_cluster is a standalone service that performs real-time text plements a tight spinloop on the LC task. This experiment cap-
clustering using machine-learning techniques. Several Google tures a lower bound of HyperThread interference. A more com-
services use ml_cluster to assign a cluster to a snippet of text. pute or memory intensive microbenchmark would antagonize the
ml_cluster performs this task by locating the closest clusters for LC HyperThread for more core resources (e.g., execution units)
the text in a model that was previously learned offline. This and space in the private caches (L1 and L2). Hence, if this exper-
2A
iment shows high impact on tail latency, we can conclude that
computation that maximizes activity and power consumption of a core.

3
core sharing through HyperThreads is not a practical option. ent workloads on the same core or HyperThread, we also need
LLC: The interference impact of LLC antagonists is measured to use stronger isolation mechanisms.
by pinning the LC workload to enough cores to satisfy its SLO The sensitivity of LC tasks to interference on individual
at the specific load and pinning a cache antagonist that streams shared resources varies. For instance, memkeyval is quite sensi-
through a large data array on the remaining cores of the socket. tive to network interference, while websearch and ml_cluster are
We use several array sizes that take up a quarter, half, and almost not affected at all. websearch is uniformly insensitive to small
all of the LLC and denote these configurations as LLC small, and medium amounts of LLC interference, while the same can-
medium, and big respectively. not be said for memkeyval or ml_cluster. Furthermore, the im-
DRAM bandwidth: The impact of DRAM bandwidth interfer- pact of interference changes depending on the load: ml_cluster
ence is characterized in a similar fashion to LLC interference, can tolerate medium amounts of LLC interference at loads <50%
using a significantly larger array for streaming. We use numactl but is heavily impacted at higher loads. These observations moti-
to ensure that the DRAM antagonist and the LC task are placed vate the need for dynamic management of isolation mechanisms
on the same socket(s) and that all memory channels are stressed. in order to adapt to differences across varying loads and differ-
Network traffic: We use iperf, an open source TCP streaming ent workloads. Any static policy would be either too conserva-
benchmark [1], to saturate the network transmit (outgoing) band- tive (missing opportunities for colocation) or overly optimistic
width. All cores except for one are given to the LC workload. (leading to SLO violations).
Since the LC workloads we consider serve request from multiple We now discuss each LC workload separately, in order to un-
clients connecting to the service they provide, we generate inter- derstand their particular resource requirements.
ference in the form of many low-bandwidth “mice” flows. Net- websearch: This workload has a small footprint and LLC (small)
work interference can also be generated using a few “elephant” and LLC (med) interference do not impact its tail latency. Nev-
flows. However, such flows can be effectively throttled by TCP ertheless, the impact is significant with LLC (big) interference.
congestion control [11], while the many “mice” flows of the LC The degradation is caused by two factors. First, the inclusive
workload will not be impacted. nature of the LLC in this particular chip means that high LLC
Power: To characterize the latency impact of a power antagonist, interference leads to misses in the working set of instructions.
the same division of cores is used as in the cases of generating Second, contention for the LLC causes significant DRAM pres-
LLC and DRAM interference. Instead of running a memory ac- sure as well. websearch is particularly sensitive to interference
cess antagonist, a CPU power virus is used. The power virus caused by DRAM bandwidth saturation. As the load of web-
is designed such that it stresses all the components of the core, search increases, the impact of LLC and DRAM interference de-
leading to high power draw and lower CPU core frequencies. creases. At higher loads, websearch uses more cores while the
OS Isolation: For completeness, we evaluate the overall impact interference generator is given fewer cores. Thus, websearch can
of running a BE task along with a LC workload using only the defend its share of resources better.
isolation mechanisms available in the OS. Namely, we execute websearch is moderately impacted by HyperThread interfer-
the two workloads in separate Linux containers and set the BE ence until high loads. This indicates that the core has sufficient
workload to be low priority. The scheduling policy is enforced by instruction issue bandwidth for both the spinloop and the web-
CFS using the shares parameter, where the BE task receives search until around 80% load. Since the spinloop only accesses
very few shares compared to the LC workload. No other isola- registers, it doesn’t cause interference in the L1 or L2 caches.
tion mechanisms are used in this case. The BE task is the Google However, since the HyperThread antagonist has the smallest pos-
brain workload [38, 67], which we will describe further in §5.1. sible effect, more intensive antagonists will cause far larger per-
formance problems. Thus, HyperThread interference in practice
3.3 Interference Analysis should be avoided. Power interference has a significant impact
on websearch at lower utilization, as more cores are executing
Figure 1 presents the impact of the interference microbench-
the power virus. As expected, the network antagonist does not
marks on the tail latency of the three LC workloads. Each row in
impact websearch, due to websearch’s low bandwidth needs.
the table shows tail latency at a certain load for the LC workload
when colocated with the corresponding microbenchmark. The ml_cluster ml_cluster is sensitive to LLC interference of smaller
interference impact is acceptable if and only if the tail latency is size, due to the small but significant per-request working set.
less than 100% of the target SLO. We color-code red/yellow all This manifests itself as a large jump in latency at 75% load
cases where SLO latency is violated. for LLC (small) and 50% load for LLC (medium). With larger
LLC interference, ml_cluster experiences major latency degrada-
By observing the rows for brain, we immediately notice that
tion. ml_cluster is also sensitive to DRAM bandwidth interfer-
current OS isolation mechanisms are inadequate for colocating
ence, primarily at lower loads (see explanation for websearch).
LC tasks with BE tasks. Even at low loads, the BE task creates
ml_cluster is moderately resistant to HyperThread interference
sufficient pressure on shared resources to lead to SLO violations
until high loads, suggesting that it only reaches high instruction
for all three workloads. A large contributor to this is that the
issue rates at high loads. Power interference has a lesser impact
OS allows both workloads to run on the same core and even the
on ml_cluster since it is less compute intensive than websearch.
same HyperThread, further compounding the interference. Tail
Finally, ml_cluster is not impacted at all by network interference.
latency eventually goes above 300% of SLO latency. Proposed
interference-aware cluster managers, such as Paragon [18] and memkeyval: Due to its significantly stricter latency SLO,
Bubble-Up [51], would disallow these colations. To enable ag- memkeyval is sensitive to all types of interference. At high load,
gressive task colocation, not only do we need to disallow differ- memkeyval becomes sensitive even to small LLC interference as

4
websearch
5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
LLC (small) 134% 103% 96% 96% 109% 102% 100% 96% 96% 104% 99% 100% 101% 100% 104% 103% 104% 103% 99%
LLC (med) 152% 106% 99% 99% 116% 111% 109% 103% 105% 116% 109% 108% 107% 110% 123% 125% 114% 111% 101%
LLC (big) >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% 264% 222% 123% 102%
DRAM >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% 270% 228% 122% 103%
HyperThread 81% 109% 106% 106% 104% 113% 106% 114% 113% 105% 114% 117% 118% 119% 122% 136% >300% >300% >300%
CPU power 190% 124% 110% 107% 134% 115% 106% 108% 102% 114% 107% 105% 104% 101% 105% 100% 98% 99% 97%
Network 35% 35% 36% 36% 36% 36% 36% 37% 37% 38% 39% 41% 44% 48% 51% 55% 58% 64% 95%
brain 158% 165% 157% 173% 160% 168% 180% 230% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300%

ml_cluster
5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
LLC (small) 101% 88% 99% 84% 91% 110% 96% 93% 100% 216% 117% 106% 119% 105% 182% 206% 109% 202% 203%
LLC (med) 98% 88% 102% 91% 112% 115% 105% 104% 111% >300% 282% 212% 237% 220% 220% 212% 215% 205% 201%
LLC (big) >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% 276% 250% 223% 214% 206%
DRAM >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% 287% 230% 223% 211%
HyperThread 113% 109% 110% 111% 104% 100% 97% 107% 111% 112% 114% 114% 114% 119% 121% 130% 259% 262% 262%
CPU power 112% 101% 97% 89% 91% 86% 89% 90% 89% 92% 91% 90% 89% 89% 90% 92% 94% 97% 106%
Network 57% 56% 58% 60% 58% 58% 58% 58% 59% 59% 59% 59% 59% 63% 63% 67% 76% 89% 113%
brain 151% 149% 174% 189% 193% 202% 209% 217% 225% 239% >300% >300% 279% >300% >300% >300% >300% >300% >300%

memkeyval
5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
LLC (small) 115% 88% 88% 91% 99% 101% 79% 91% 97% 101% 135% 138% 148% 140% 134% 150% 114% 78% 70%
LLC (med) 209% 148% 159% 107% 207% 119% 96% 108% 117% 138% 170% 230% 182% 181% 167% 162% 144% 100% 104%
LLC (big) >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% 280% 225% 222% 170% 79% 85%
DRAM >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% 252% 234% 199% 103% 100%
HyperThread 26% 31% 32% 32% 32% 32% 33% 35% 39% 43% 48% 51% 56% 62% 81% 119% 116% 153% >300%
CPU power 192% 277% 237% 294% >300% >300% 219% >300% 292% 224% >300% 252% 227% 193% 163% 167% 122% 82% 123%
Network 27% 28% 28% 29% 29% 27% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300%
brain 197% 232% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300% >300%

Each entry is color-coded as follows: 140% is ≥120%, 110% is between 100% and 120%, and 65% is ≤100%.
Figure 1. Impact of interference on shared resources on websearch, ml_cluster, and memkeyval. Each row is an antagonist and
each column is a load point for the workload. The values are latencies, normalized to the SLO latency.

the small per-request working sets add up. When faced with one LC workload with many BE tasks. Since BE tasks are abun-
medium LLC interference, there are two latency peaks. The first dant, this is sufficient to raise utilization in many datacenters. We
peak at low load is caused by the antagonist removing instruc- leave colocation of multiple LC workloads to future work.
tions from the cache. When memkeyval obtains enough cores at
high load, it avoids these evictions. The second peak is at higher 4.1 Isolation Mechanisms
loads, when the antagonist interferes with the per-request work- Heracles manages 4 mechanisms to mitigate interference.
ing set. At high levels of LLC interference, memkeyval is unable
For core isolation, Heracles uses Linux’s cpuset
to meet its SLO. Even though memkeyval has low DRAM band-
cgroups to pin the LC workload to one set of cores and BE
width requirements, it is strongly affected by a DRAM streaming
tasks to another set (software mechanism) [55]. This mechanism
antagonist. Ironically, the few memory requests from memkeyval
is necessary, since in §3 we showed that core sharing is detrimen-
are overwhelmed by the DRAM antagonist.
tal to latency SLO. Moreover, the number of cores per server is
memkeyval is not sensitive to the HyperThread antagonist ex- increasing, making core segregation finer-grained. The alloca-
cept at high loads. In contrast, it is very sensitive to the power tion of cores to tasks is done dynamically. The speed of core
antagonist, as it is compute-bound. memkeyval does consume a (re)allocation is limited by how fast Linux can migrate tasks to
large amount of network bandwidth, and thus is highly suscepti- other cores, typically in the tens of milliseconds.
ble to competing network flows. Even at small loads, it is com-
For LLC isolation, Heracles uses the Cache Allocation Tech-
pletely overrun by the many small “mice” flows of the antagonist
nology (CAT) available in recent Intel chips (hardware mech-
and is unable to meet its SLO.
anism) [3]. CAT implements way-partitioning of the shared
LLC. In a highly-associative LLC, this allows us to define non-
4 Heracles Design overlapping partitions at the granularity of a few percent of the
We have established the need for isolation mechanisms be- total LLC capacity. We use one partition for the LC workload
yond OS-level scheduling and for a dynamic controller that man- and a second partition for all BE tasks. Partition sizes can be
ages resource sharing between LC and BE tasks. Heracles adjusted dynamically by programming model specific registers
is a dynamic, feedback-based controller that manages in real- (MSRs), with changes taking effect in a few milliseconds.
time four hardware and software mechanisms in order to isolate There are no commercially available DRAM bandwidth iso-
colocated workloads. Heracles implements an iso-latency pol- lation mechanisms. We enforce DRAM bandwidth limits in the
icy [47], namely that it can increase resource efficiency as long following manner: we implement a software monitor that peri-
as the SLO is being met. This policy allows for increasing server odically tracks the total bandwidth usage through performance
utilization through tolerating some interference caused by colo- counters and estimates the bandwidth used by the LC and BE
cation, as long as the the difference between the SLO latency jobs. If the LC workload does not receive sufficient bandwidth,
target for the LC workload and the actual latency observed (la- Heracles scales down the number of cores that BE jobs use. We
tency slack) is positive. In its current version, Heracles manages

5
discuss the limitations of this coarse-grained approach in §4.2.
LC workload
For power isolation, Heracles uses CPU frequency monitor-
ing, Running Average Power Limit (RAPL), and per-core DVFS Latency readings
(hardware features) [3, 37]. RAPL is used to monitor CPU power
Controller
at the per-socket level, while per-core DVFS is used to redis-
tribute power amongst cores. Per-core DVFS setting changes go Can BE grow?
into effect within a few milliseconds. The frequency steps are
Cores &
in 100MHz and span the entire operating frequency range of the Memory
CPU power Network
Internal
processor, including Turbo Boost frequencies. feedback
For network traffic isolation, Heracles uses Linux traf- loops
fic control (software mechanism). Specifically we use the DRAM LLC
CPU DVFS
CPU
HTB
Net
BW (CAT) Power BW
qdisc [12] scheduler with hierarchical token bucket queueing
discipline (HTB) to enforce bandwidth limits for outgoing traf-
fic from the BE tasks. The bandwidth limits are set by limiting
Figure 2. The system diagram of Heracles.
the maximum traffic burst rate for the BE jobs (ceil parameter
in HTB parlance). The LC job does not have any limits set on latency-sensitive and BE tasks and to manage bandwidth satura-
it. HTB can be updated very frequently, with the new bandwidth tion, we require some offline information. Specifically, Heracles
limits taking effect in less than hundreds of milliseconds. Manag- uses an offline model that describes the DRAM bandwidth used
ing ingress network interference has been examined in numerous by the latency-sensitive workloads at various loads, core, and
previous work and is outside the scope of this work [33]. LLC allocations. We verified that this model needs to be regen-
erated only when there are significant changes in the workload
4.2 Design Approach structure and that small deviations are fine. There is no need
Each hardware or software isolation mechanism allows rea- for any offline profiling of the BE tasks, which can vary widely
sonably precise control of an individual resource. Given that, the compared to the better managed and understood LC workloads.
controller must dynamically solve the high dimensional problem There is also no need for offline analysis of interactions between
of finding the right settings for all these mechanisms at any load latency-sensitive and best effort tasks. Once we have hardware
for the LC workload and any set of BE tasks. Heracles solves support for per-core DRAM bandwidth accounting [30], we can
this as an optimization problem, where the objective is to maxi- eliminate this offline model.
mize utilization with the constraint that the SLO must be met. 4.3 Heracles Controller
Heracles reduces the optimization complexity by decoupling
interference sources. The key insight that enables this reduction Heracles runs as a separate instance on each server, managing
is that interference is problematic only when a shared resource the local interactions between the LC and BE jobs. As shown in
becomes saturated, i.e. its utilization is so high that latency prob- Figure 2, it is organized as three subcontrollers (cores & mem-
lems occur. This insight is derived by the analysis in §3: the ory, power, network traffic) coordinated by a top-level controller.
antagonists do not cause significant SLO violations until an in- The subcontrollers operate fairly independently of each other and
flection point, at which point the tail latency degrades extremely ensure that their respective shared resources are not saturated.
rapidly. Hence, if Heracles can prevent any shared resource from Top-level controller: The pseudo-code for the controller is
saturating, then it can decompose the high-dimensional optimiza- shown in Algorithm 1. The controller polls the tail latency and
tion problem into many smaller and independent problems of one load of the LC workload every 15 seconds. This allows for suf-
or two dimensions each. Then each sub-problem can be solved ficient queries to calculate statistically meaningful tail latencies.
using sound optimization methods, such as gradient descent. If the load for the LC workload exceeds 85% of its peak on the
Since Heracles must ensure that the target SLO is met for the server, the controller disables the execution of BE workloads.
LC workload, it continuously monitors latency and latency slack This empirical safeguard avoids the difficulties of latency man-
and uses both as key inputs in its decisions. When the latency agement on highly utilized systems for minor gains in utilization.
slack is large, Heracles treats this as a signal that it is safe to be For hysteresis purposes, BE execution is enabled when the load
more aggressive with colocation; conversely, when the slack is drops below 80%. BE execution is also disabled when the la-
small, it should back off to avoid an SLO violation. Heracles tency slack, the difference between the SLO target and the cur-
also monitors the load (queries per second), and during periods rent measured tail latency, is negative. This typically happens
of high load, it disables colocation due to a high risk of SLO when there is a sharp spike in load for the latency-sensitive work-
violations. Previous work has shown that indirect performance load. We give all resources to the latency critical workload for a
metrics, such as CPU utilization, are insufficient to guarantee while (e.g., 5 minutes) before attempting colocation again. The
that the SLO is met [47]. constants used here were determined through empirical tuning.
Ideally, Heracles should require no offline information other When these two safeguards are not active, the controller uses
than SLO targets. Unfortunately, one shortcoming of current slack to guide the subcontrollers in providing resources to BE
hardware makes this difficult. The Intel chips we used do tasks. If slack is less than 10%, the subcontrollers are instructed
not provide accurate mechanisms for measuring (or limiting) to disallow growth for BE tasks in order to maintain a safety
DRAM bandwidth usage at a per-core granularity. To understand margin. If slack drops below 5%, the subcontroller for cores is
how Heracles’ decisions affect the DRAM bandwidth usage of instructed to switch cores from BE tasks to the LC workload.
This improves the latency of the LC workload and reduces the

6
1 while True: 1 def PredictedTotalBW():
2 latency=PollLCAppLatency() 2 return LcBwModel()+BeBw()+bw_derivative
3 load=PollLCAppLoad() 3 while True:
4 slack=(target-latency)/target 4 MeasureDRAMBw()
5 if slack<0: 5 if total_bw>DRAM_LIMIT:
6 DisableBE() 6 overage=total_bw-DRAM_LIMIT
7 EnterCooldown() 7 be_cores.Remove(overage/BeBwPerCore())
8 elif load>0.85: 8 continue
9 DisableBE() 9 if not CanGrowBE():
10 elif load<0.80: 10 continue
11 EnableBE() 11 if state==GROW_LLC:
12 elif slack<0.10: 12 if PredictedTotalBW()>DRAM_LIMIT:
13 DisallowBEGrowth() 13 state=GROW_CORES
14 if slack<0.05: 14 else:
15 be_cores.Remove(be_cores.Size()-2) 15 GrowCacheForBE()
16 sleep(15) 16 MeasureDRAMBw()
Algorithm 1: High-level controller. 17 if bw_derivative>=0:
18 Rollback()
19 state=GROW_CORES
ability of the BE job to cause interference on any resources. If 20 if not BeBenefit():
slack is above 10%, the subcontrollers are instructed to allow BE 21 state=GROW_CORES
tasks to acquire a larger share of system resources. Each sub- 22 elif state==GROW_CORES:
controller makes allocation decisions independently, provided of 23 needed=LcBwModel()+BeBw()+BeBwPerCore()
course that its resources are not saturated. 24 if needed>DRAM_LIMIT:
Core & memory subcontroller: Heracles uses a single subcon- 25 state=GROW_LLC
troller for core and cache allocation due to the strong coupling 26 elif slack>0.10:
between core count, LLC needs, and memory bandwidth needs. 27 be_cores.Add(1)
If there was a direct way to isolate memory bandwidth, we would 28 sleep(2)
use independent controllers. The pseudo-code for this subcon-
Algorithm 2: Core & memory sub-controller.
troller is shown in Algorithm 2. Its output is the allocation of
cores and LLC to the LC and BE jobs (2 dimensions).
The first constraint for the subcontroller is to avoid memory time, each time checking for DRAM bandwidth saturation and
bandwidth saturation. The DRAM controllers provide registers SLO violations for the LC workload. If bandwidth saturation oc-
that track bandwidth usage, making it easy to detect when they curs first, the subcontroller will return to the GROW_LLC phase.
reach 90% of peak streaming DRAM bandwidth. In this case, the The process repeats until an optimal configuration has been con-
subcontroller removes as many cores as needed from BE tasks verged upon. The search also terminates on a signal from the
to avoid saturation. Heracles estimates the bandwidth usage of top-level controller indicating the end to growth or the disabling
each BE task using a model of bandwidth needs for the LC work- of BE jobs. The typical convergence time is about 30 seconds.
load and a set of hardware counters that are proportional to the During gradient descent, the subcontroller must avoid trying
per-core memory traffic to the NUMA-local memory controllers. suboptimal allocations that will either trigger DRAM bandwidth
For the latter counters to be useful, we limit each BE task to a saturation or a signal from the top-level controller to disable BE
single socket for both cores and memory allocations using Linux tasks. To estimate the DRAM bandwidth usage of an alloca-
numactl. Different BE jobs can run on either socket and LC tion prior to trying it, the subcontroller uses the derivative of the
workloads can span across sockets for cores and memory. DRAM bandwidth from the last reallocation of cache or cores.
When the top-level controller signals BE growth and there Heracles estimates whether it is close to an SLO violation for
is no DRAM bandwidth saturation, the subcontroller uses gra- the LC task based on the amount of latency slack.
dient descent to find the maximum number of cores and cache Power subcontroller: The simple subcontroller described in
partitions that can be given to BE tasks. Offline analysis of Algorithm 3 ensures that there is sufficient power slack to run
LC applications (Figure 3) shows that their performance is a the LC workload at a minimum guaranteed frequency. This fre-
convex function of core and cache resources, thus guaranteeing quency is determined by measuring the frequency used when the
that gradient descent will find a global optimum. We perform LC workload runs alone at full load. Heracles uses RAPL to de-
the gradient descent in one dimension at a time, switching be- termine the operating power of the CPU and its maximum design
tween increasing the cores and increasing the cache given to BE power, or thermal dissipation power (TDP). It also uses CPU fre-
tasks. Initially, a BE job is given one core and 10% of the LLC quency monitoring facilities on each core. When the operating
and starts in the GROW_LLC phase. Its LLC allocation is in- power is close to the TDP and the frequency of the cores running
creased as long as the LC workload meets its SLO, bandwidth the LC workload is too low, it uses per-core DVFS to lower the
saturation is avoided, and the BE task benefits. The next phase frequency of cores running BE tasks in order to shift the power
(GROW_CORES) grows the number of cores for the BE job. Her- budget to cores running LC tasks. Both conditions must be met
acles will reassign cores from the LC to the BE job one at a in order to avoid confusion when the LC cores enter active-idle

7
Max load under SLO (websearch)
5 Heracles Evaluation
5.1 Methodology
100% We evaluated Heracles with the three production, latency-
100 critical workloads from Google analyzed in §3. We first per-
90

SLO
75% 80
formed experiments with Heracles on a single leaf server, intro-

Max load under


70 ducing BE tasks as we run the LC workload at different levels
50% 60 of load. Next, we used Heracles on a websearch cluster with
50 tens of servers, measuring end-to-end workload latency across
25% 40
30 the fan-out tree while BE tasks are also running. In the clus-
0% 20 ter experiments, we used a load trace that represents the traffic
100% 10 throughout a day, capturing diurnal load variation. In all cases,
0 we used production Google servers.
75%
0% 50% )
For the LC workloads we focus on SLO latency. Since the
25% s
(%

50% 25% o
re SLO is defined over 60-second windows, we report the worst-
75%
LL C
C s
ize
(%
)
100%0% case latency that was seen during experiments. For the produc-
tion batch workloads, we compute the throughput rate of the
batch workload with Heracles and normalize it to the throughput
Figure 3. Characterization of websearch showing that its per-
of the batch workload running alone on a single server. We then
formance is a convex function of cores and LLC. define the Effective Machine Utilization (EMU) = LC Through-
put + BE Throughput. Note that Effective Machine Utilization
can be above 100% due to better binpacking of shared resources.
1 while True: We also report the utilization of shared resources when necessary
2 power=PollRAPL() to highlight detailed aspects of the system.
3 ls_freq=PollFrequency(ls_cores) The BE workloads we use are chosen from a set containing
4 if power>0.90*TDP and ls_freq<guaranteed: both production batch workloads and the synthetic tasks that
5 LowerFrequency(be_cores) stress a single shared resource. The specific workloads are:
6 elif power<=0.90*TDP and ls_freq>=guaranteed: stream-LLC streams through data sized to fit in about half
7 IncreaseFrequency(be_cores) of the LLC and is the same as LLC (med) from §3.2. stream-
8 sleep(2) DRAM streams through an extremely large array that cannot fit
Algorithm 3: CPU power sub-controller. in the LLC (DRAM from the same section). We use these work-
loads to verify that Heracles is able to maximize the use of LLC
partitions and avoid DRAM bandwidth saturation.
cpu_pwr is the CPU power virus from §3.2. It is used to
1 while True: verify that Heracles will redistribute power to ensure that the LC
2 ls_bw=GetLCTxBandwidth() workload maintains its guaranteed frequency.
3 be_bw=LINK_RATE-ls_bw-max(0.05*LINK_RATE,
iperf is an open source network streaming benchmark used to
0.10*ls_bw)
verify that Heracles partitions network transmit bandwidth cor-
4 SetBETxBandwidth(be_bw)
rectly to protect the LC workload.
5 sleep(1)
brain is a Google production batch workload that performs
Algorithm 4: Network sub-controller. deep learning on images for automatic labelling [38, 67]. This
workload is very computationally intensive, is sensitive to LLC
size, and also has high DRAM bandwidth requirements.
modes, which also tends to lower frequency readings. If there is streetview is a production batch job that stitches together mul-
sufficient operating power headroom, Heracles will increase the tiple images to form the panoramas for Google Street View. This
frequency limit for the BE cores in order to maximize their per- workload is highly demanding on the DRAM subsystem.
formance. The control loop runs independently for each of the
two sockets and has a cycle time of two seconds. 5.2 Individual Server Results
Network subcontroller: This subcontroller prevents satura- Latency SLO: Figure 4 presents the impact of colocating
tion of network transmit bandwidth as shown in Algorithm 4. each of the three LC workloads with BE workloads across all
It monitors the total egress bandwidth of flows associated possible loads under the control of Heracles. Note that Her-
with the LC workload (LCBandwidth) and sets the total band- acles attempts to run as many copies of the BE task as possi-
width limit of all other flows as LinkRate − LCBandwidth − ble and maximize the resources they receive. At all loads and
max(0.05LinkRate, 0.10LCBandwidth). A small headroom of in all colocation cases, there are no SLO violations with Her-
10% of the current LCBandwidth or 5% of the LinkRate is added acles. This is true even for brain, a workload that even with
into the reservation for the LC workload in order to handle spikes. the state-of-the-art OS isolation mechanisms would render any
The bandwidth limit is enforced via HTB qdiscs in the Linux ker- LC workload unusable. This validates that the controller keeps
nel. This control loop is run once every second, which provides shared resources from saturating and allocates a sufficient frac-
sufficient time for the bandwidth enforcer to settle. tion to the LC workload at any load. Heracles maintains a small

8
baseline stream-LLC stream-DRAM cpu_pwr brain streetview iperf SLO latency
websearch latency with Heracles ml_cluster latency with Heracles memkeyval latency with Heracles
100 100 100
99% latency (% of SLO)

95% latency (% of SLO)

99% latency (% of SLO)


80 80 80

60 60 60

40 40 40

20 20 20

0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
% of peak load % of peak load % of peak load

Figure 4. Latency of LC applications co-located with BE jobs under Heracles. For clarity we omit websearch and ml_cluster with
iperf as those workloads are extremely resistant to network interference.

120
Effective machine utilization By dynamically managing multiple isolation mechanisms, Hera-
Effective mach. util. (%)

100 cles exposes opportunities to raise EMU that would otherwise be


missed with scheduling techniques that avoid interference.
80
Shared Resource Utilization: Figure 6 plots the utilization
60
of shared resources (cores, power, and DRAM bandwidth) under
40 Heracles control. For memkeyval, we include measurements of
20 network transmit bandwidth in Figure 7.
0 Across the board, Heracles is able to correctly size the
0 20 40 60 80 100
% of peak load BE workloads to avoid saturating DRAM bandwidth. For the
search+brain ml_cluster+brain memkeyval+brain stream-LLC BE task, Heracles finds the correct cache partitions
search+streetview ml_cluster+streetview memkeyval+streetview
Baseline to decrease total DRAM bandwidth requirements for all work-
loads. For ml_cluster, with its large cache footprint, Heracles
balances the needs of stream-LLC with ml_cluster effectively,
Figure 5. EMU achieved by Heracles.
with a total DRAM bandwidth slightly above the baseline. For
the BE tasks with high DRAM requirements (stream-DRAM,
latency slack as a guard band to avoid spikes and control insta- streetview), Heracles only allows them to execute on a few cores
bility. It also validates that local information on tail latency is to avoid saturating DRAM. This is reflected by the lower CPU
sufficient for stable control for applications with milliseconds utilization but high DRAM bandwidth. However, EMU is still
and microseconds range of SLOs. Interestingly, the websearch high, as the critical resource for those workloads is not compute,
binary and shard changed between generating the offline profil- but memory bandwidth.
ing model for DRAM bandwidth and performing this experiment.
Looking at the power utilization, Heracles allows significant
Nevertheless, Heracles is resilient to these changes and performs
improvements to energy efficiency. Consider the 20% load case:
well despite the somewhat outdated model.
EMU was raised by a significant amount, from 20% to 60%-90%.
Heracles reduces the latency slack during periods of low uti- However, the CPU power only increased from 60% to 80%.
lization for all workloads. For websearch and ml_cluster, the This translates to an energy efficiency gain of 2.3-3.4x. Overall,
slack is cut in half, from 40% to 20%. For memkeyval, the reduc- Heracles achieves significant gains in resource efficiency across
tion is much more dramatic, from a slack of 80% to 40% or less. all loads for the LC task without causing SLO violations.
This is because the unloaded latency of memkeyval is extremely
small compared to the SLO latency. The high variance of the tail 5.3 Websearch Cluster Results
latency for memkeyval is due to the fact that its SLO is in the hun-
dreds of microseconds, making it more sensitive to interference We also evaluate Heracles on a small minicluster for web-
than the other two workloads. search with tens of servers as a proxy for the full-scale cluster.
The cluster root fans out each user request to all leaf servers and
Server Utilization: Figure 5 shows the EMU achieved when
combines their replies. The SLO latency is defined as the aver-
colocating production LC and BE tasks with Heracles. In all
age latency at the root over 30 seconds, denoted as µ/30s. The
cases, we achieve significant EMU increases. When the two
target SLO latency is set as µ/30s when serving 90% load in the
most CPU-intensive and power-hungry workloads are combined,
cluster without colocated tasks. Heracles runs on every leaf node
websearch and brain, Heracles still achieves an EMU of at least
with a uniform 99%-ile latency target set such that the latency at
75%. When websearch is combined with the DRAM bandwidth
the root satisfies the SLO. We use Heracles to execute brain on
intensive streetview, Heracles can extract sufficient resources for
half of the leafs and streetview on the other half. Heracles shares
a total EMU above 100% at websearch loads between 25% and
the same offline model for the DRAM bandwidth needs of web-
70%. This is because websearch and streetivew have complemen-
search across all leaves, even though each leaf has a different
tary resource requirements, where websearch is more compute
shard. We generate load from an anonymized, 12-hour request
bound and streetview is more DRAM bandwidth bound. The
trace that captures the part of the daily diurnal pattern when web-
EMU results are similarly positive for ml_cluster and memkeyval.

9
baseline stream-LLC stream-DRAM cpu_pwr brain streetview iperf
100
websearch DRAM BW 100
ml_cluster DRAM BW 100
memkeyval DRAM BW
DRAM BW (% of available)

DRAM BW (% of available)

DRAM BW (% of available)
80 80 80

60 60 60

40 40 40

20 20 20

0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
% of peak load % of peak load % of peak load
100
websearch CPU utilization
100
ml_cluster CPU utilization 100
memkeyval CPU utilization

80 80 80
CPU utilization (%)

CPU utilization (%)

CPU utilization (%)


60 60 60

40 40 40

20 20 20

0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
% of peak load % of peak load % of peak load
100
websearch CPU power
100
ml_cluster CPU power 100
memkeyval CPU power
CPU power (% TDP)

80
CPU power (% TDP)

80

CPU power (% TDP)


80

60 60 60

40 40 40

20 20 20

0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
% of peak load % of peak load % of peak load

Figure 6. Various system utilization metrics of LC applications co-located with BE jobs under Heracles.

100
memkeyval network BW Server Utilization: Figure 8 also shows that Heracles suc-
Network BW (% of available)

cessfully converts the latency slack in the baseline case into sig-
80
nificantly increased EMU. Throughout the trace, Heracles colo-
60 cates sufficient BE tasks to maintain an average EMU of 90%
and a minimum of 80% without causing SLO violations. The
40
websearch load varies between 20% and 90% in this trace.
20 TCO: To estimate the impact on total cost of ownership, we
0 use the TCO calculator by Barroso et al. with the parameters
0 20 40 60 80 100 from the case-study of a datacenter with low per-server cost [7].
% of peak load
baseline iperf This model assumes $2000 servers with a PUE of 2.0 and a peak
power draw of 500W as well as electricity costs of $0.10/kW-hr.
For our calculations, we assume a cluster size of 10,000 servers.
Figure 7. Network bandwidth of memkeyval under Heracles. Assuming pessimistically that a websearch cluster is highly uti-
lized throughout the day, with an average load of 75%, Heracles’
ability to raise utilization to 90% translates to a 15% through-
search is not fully loaded and colocation has high potential.
put/TCO improvement over the baseline. This improvement in-
Latency SLO: Figure 8 shows the latency SLO with and cludes the cost of the additional power consumption at higher uti-
without Heracles for the 12-hour trace. Heracles produces no lization. Under the same assumptions, a controller that focuses
SLO violations while reducing slack by 20-30%. Meeting the only on improving energy-proportionality for websearch would
99%-ile tail latency at each leaf is sufficient to guarantee the achieve throughput/TCO gains of roughly 3% [47].
global SLO. We believe we can further reduce the slack in larger
If we assume a cluster for LC workloads utilized at an average
websearch clusters by introducing a centralized controller that
of 20%, as many industry studies suggest [44, 74], Heracles can
dynamically sets the per-leaf tail latency targets based on slack
achieve a 306% increase in throughput/TCO. A controller focus-
at the root [47]. This will allow a future version of Heracles to
ing on energy-proportionality would achieve improvements of
take advantage of slack in higher layers of the fan-out tree.

10
Baseline Heracles SLO latency Baseline Heracles
120 Latency comparison 120
Effective machine utilization
% of SLO latency ( /30s)

Effective mach. util. (%)


100 100

80 80

60 60

40 40

20 20

0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Time (hr) Time (hr)

Figure 8. Latency SLO and effective machine utilization for a websearch cluster managed by Heracles.

less than 7%. Heracles’ advantage is due to the fact that it can bandwidth limiters and priority mechanisms in hardware. Unfor-
raise utilization from 20% to 90% with a small increase to power tunately, these features are not exposed by device drivers. Hence,
consumption, which only represents 9% of the initial TCO. As Heracles and related projects in network performance isolation
long as there are useful BE tasks available, one should always currently use Linux qdisc [33]. Support for network isolation in
choose to improve throughput/TCO by colocating them with LC hardware should strengthen this work.
jobs instead of lowering the power consumption of servers in The LC workloads we evaluated do not use disks or SSDs in
modern datacenters. Also note that the improvements in through- order to meet their aggressive latency targets. Nevertheless, disk
put/TCO are large enough to offset the cost of reserving a small and SSD isolation is quite similar to network isolation. Thus, the
portion of each server’s memory or storage for BE tasks. same principles and controls used to mitigate network interfer-
ence still apply. For disks, we list several available isolation tech-
6 Related Work niques: 1) the cgroups blkio controller [55], 2) native command
Isolation mechanisms: There is significant work on shared queuing (NCQ) priorities [27], 3) prioritization in file-system
cache isolation, including soft partitioning based on replace- queues, 4) partitioning LC and BE to different disks, 5) repli-
ment policies [77, 78], way-partitioning [65, 64], and fine- cating LC data across multiple disks that allows selecting the
grained partitioning [68, 49, 71]. Tessellation exposes an inter- disk/reply that responds first or has lower load [17]. For SSDs:
face for throughput-based applications to request partitioned re- 1) many SSDs support channel partitions, separate queueing, and
sources [45]. Most cache partitioning schemes have been eval- prioritization at the queue level, 2) SSDs also support suspending
uated with a utility-based policy that optimizes for aggregate operations to allow LC requests to overtake BE requests.
throughput [64]. Heracles manages the coarse-grained, way- Interference-aware cluster management: Several cluster-
partitioning scheme recently added in Intel CPUs, using a search management systems detect interference between colocated
for a right-sized allocation to eliminate latency SLO violations. workloads and generate schedules that avoid problematic colo-
We expect Heracles will work even better with fine-grained par- cations. Nathuji et al. develop a feedback-based scheme that
titioning schemes when they are commercially available. tunes resource assignment to mitigate interference for colocated
Iyer et al. explores a wide range quality-of-service (QoS) poli- VMs [58]. Bubble-flux is an online scheme that detects memory
cies for shared cache and memory systems with simulated isola- pressure and finds colocations that avoid interference on latency-
tion features [30, 26, 24, 23, 29]. They focus on throughput met- sensitive workloads [79, 51]. Bubble-flux has a backup mech-
rics, such as IPC and MPI, and did not consider latency-critical anism to enable problematic co-locations via execution modu-
workloads or other resources such as network traffic. Cook et al. lation, but such a mechanism would have challenges with ap-
evaluate hardware cache partitioning for throughput based appli- plications such as memkeyval, as the modulation would need to
cations and did not consider latency-critical tasks [15]. Wu et be done in the granularity of microseconds. DeepDive detects
al. compare different capacity management schemes for shared and manages interference between co-scheduled applications in
caches [77]. The proposed Ubik controller for shared caches a VM system [60]. CPI2 throttles low-priority workloads that in-
with fine-grained partitioning support boosts the allocation for terfere with important services [80]. Finally, Paragon and Quasar
latency-critical workloads during load transition times and re- use online classification to estimate interference and to colocate
quires application level changes to inform the runtime of load workloads that are unlikely to cause interference [18, 19].
changes [36]. Heracles does not require any changes to the LC The primary difference of Heracles is the focus on latency-
task, instead relying on a steady-state approach for managing critical workloads and the use of multiple isolation schemes in
cache partitions that changes partition sizes slowly. order to allow aggressive colocation without SLO violations at
There are several proposals for isolation and QoS features scale. Many previous approaches use IPC instead of latency as
for memory controllers [30, 56, 32, 59, 57, 20, 40, 70]. While the performance metric [79, 51, 60, 80]. Nevertheless, one can
our work showcases the need for memory isolation for latency- couple Heracles with an interference-aware cluster manager in
critical workloads, such features are not commercially available order to optimize the placement of BE tasks.
at this point. Several network interface controllers implement Latency-critical workloads: There is also significant work in

11
optimizing various aspects of latency-critical workloads, includ- [13] Marcus Carvalho et al., “Long-term SLOs for Reclaimed Cloud Com-
ing energy proportionality [53, 54, 47, 46, 34], networking per- puting Resources,” in Proc. of SOCC, Seattle, WA, Dec. 2014.
[14] McKinsey & Company, “Revolutionizing data center efficiency,” Up-
formance [35, 8], and hardware-acceleration [41, 63, 72]. Hera- time Institute Symp., 2008.
cles is largely orthogonal to these projects. [15] Henry Cook et al., “A Hardware Evaluation of Cache Partitioning to
Improve Utilization and Energy-efficiency While Preserving Respon-
siveness,” in Proc. of the 40th Annual International Symposium on
7 Conclusions Computer Architecture, ser. ISCA ’13. New York, NY: ACM, 2013.
We present Heracles, a heuristic feedback-based system that [16] Carlo Curino et al., “Reservation-based Scheduling: If You’re Late
Don’t Blame Us!” in Proc. of the 5th annual Symposium on Cloud
manages four isolation mechanisms to enable a latency-critical Computing, 2014.
workload to be colocated with batch jobs without SLO viola- [17] Jeffrey Dean et al., “The tail at scale,” Commun. ACM, vol. 56, no. 2,
tions. We used an empirical characterization of several sources Feb. 2013.
[18] Christina Delimitrou et al., “Paragon: QoS-Aware Scheduling for
of interference to guide an important heuristic used in Heracles: Heterogeneous Datacenters,” in Proc. of the 18th Intl. Conf. on Archi-
interference effects are large only when a shared resource is sat- tectural Support for Programming Languages and Operating Systems
urated. We evaluated Heracles and several latency-critical and (ASPLOS), Houston, TX, 2013.
batch workloads used in production at Google on real hardware [19] Christina Delimitrou et al., “Quasar: Resource-Efficient and QoS-
Aware Cluster Management,” in Proc. of the Nineteenth International
and demonstrated an average utilization of 90% across all evalu- Conference on Architectural Support for Programming Languages and
ated scenarios without any SLO violations for the latency-critical Operating Systems (ASPLOS), Salt Lake City, UT, 2014.
job. Through coordinated management of several isolation mech- [20] Eiman Ebrahimi et al., “Fairness via Source Throttling: A Configurable
and High-performance Fairness Substrate for Multi-core Memory Sys-
anisms, Heracles enables colocation of tasks that previously tems,” in Proc. of the Fifteenth Edition of ASPLOS on Architectural
would cause SLO violations. Compared to power-saving mech- Support for Programming Languages and Operating Systems, ser. AS-
anisms alone, Heracles increases overall cost efficiency substan- PLOS XV. New York, NY: ACM, 2010.
[21] H. Esmaeilzadeh et al., “Dark silicon and the end of multicore scaling,”
tially through increased utilization. in Computer Architecture (ISCA), 2011 38th Annual International
Symposium on, June 2011.
8 Acknowledgements [22] Sriram Govindan et al., “Cuanta: quantifying effects of shared on-chip
resource interference for consolidated virtual machines,” in Proc. of
We sincerely thank Luiz Barroso and Chris Johnson for their the 2nd ACM Symposium on Cloud Computing, 2011.
[23] Fei Guo et al., “From Chaos to QoS: Case Studies in CMP Resource
help and insight in making our work possible at Google. We Management,” SIGARCH Comput. Archit. News, vol. 35, no. 1, Mar.
also thank Christina Delimitrou, Caroline Suen, and the anony- 2007.
mous reviewers for their feedback on earlier versions of this [24] Fei Guo et al., “A Framework for Providing Quality of Service in Chip
Multi-Processors,” in Proc. of the 40th Annual IEEE/ACM International
manuscript. This work was supported by a Google research Symposium on Microarchitecture, ser. MICRO 40. Washington, DC:
grant, the Stanford Experimental Datacenter Lab, and NSF grant IEEE Computer Society, 2007.
CNS-1422088. David Lo was supported by a Google PhD Fel- [25] Nikos Hardavellas et al., “Toward Dark Silicon in Servers,” IEEE
Micro, vol. 31, no. 4, 2011.
lowship.
[26] Lisa R. Hsu et al., “Communist, Utilitarian, and Capitalist Cache
Policies on CMPs: Caches As a Shared Resource,” in Proc. of the 15th
References International Conference on Parallel Architectures and Compilation
Techniques, ser. PACT ’06. New York, NY: ACM, 2006.
[1] “Iperf - The TCP/UDP Bandwidth Measurement Tool,” https://iperf.fr/.
[27] Intel, “Serial ATA II Native Command Queuing Overview,”
[2] “memcached,” http://memcached.org/. http://download.intel.com/support/chipsets/imsm/sb/sata2_ncq_
[3] “Intel 64
R and IA-32 Architectures Software Developer’s Manual,” overview.pdf, 2003.
vol. 3B: System Programming Guide, Part 2, Sep 2014. [28] Teerawat Issariyakul et al., Introduction to Network Simulator NS2,
[4] Mohammad Al-Fares et al., “A Scalable, Commodity Data Center Net- 1st ed. Springer Publishing Company, Incorporated, 2010.
work Architecture,” in Proc. of the ACM SIGCOMM 2008 Conference [29] Ravi Iyer, “CQoS: A Framework for Enabling QoS in Shared Caches of
on Data Communication, ser. SIGCOMM ’08. New York, NY: ACM, CMP Platforms,” in Proc. of the 18th Annual International Conference
2008. on Supercomputing, ser. ICS ’04. New York, NY: ACM, 2004.
[5] Mohammad Alizadeh et al., “Data Center TCP (DCTCP),” in Proc. of
the ACM SIGCOMM 2010 Conference, ser. SIGCOMM ’10. New [30] Ravi Iyer et al., “QoS Policies and Architecture for Cache/Memory in
York, NY: ACM, 2010. CMP Platforms,” in Proc. of the 2007 ACM SIGMETRICS International
Conference on Measurement and Modeling of Computer Systems, ser.
[6] Luiz Barroso et al., “The Case for Energy-Proportional Computing,” SIGMETRICS ’07. New York, NY: ACM, 2007.
Computer, vol. 40, no. 12, Dec. 2007.
[31] Vijay Janapa Reddi et al., “Web Search Using Mobile Cores: Quantify-
[7] Luiz André Barroso et al., The Datacenter as a Computer: An Intro- ing and Mitigating the Price of Efficiency,” SIGARCH Comput. Archit.
duction to the Design of Warehouse-Scale Machines, 2nd ed. Morgan News, vol. 38, no. 3, Jun. 2010.
& Claypool Publishers, 2013. [32] Min Kyu Jeong et al., “A QoS-aware Memory Controller for Dynami-
[8] Adam Belay et al., “IX: A Protected Dataplane Operating System for cally Balancing GPU and CPU Bandwidth Use in an MPSoC,” in Proc.
High Throughput and Low Latency,” in 11th USENIX Symposium on of the 49th Annual Design Automation Conference, ser. DAC ’12. New
Operating Systems Design and Implementation (OSDI 14). Broom- York, NY: ACM, 2012.
field, CO: USENIX Association, Oct. 2014. [33] Vimalkumar Jeyakumar et al., “EyeQ: Practical Network Performance
[9] Sergey Blagodurov et al., “A Case for NUMA-aware Contention Man- Isolation at the Edge,” in Proc. of the 10th USENIX Conference on
agement on Multicore Systems,” in Proc. of the 2011 USENIX Confer- Networked Systems Design and Implementation, ser. nsdi’13. Berkeley,
ence on USENIX Annual Technical Conference, ser. USENIXATC’11. CA: USENIX Association, 2013.
Berkeley, CA: USENIX Association, 2011. [34] Svilen Kanev et al., “Tradeoffs between Power Management and Tail
[10] Eric Boutin et al., “Apollo: Scalable and Coordinated Scheduling for Latency in Warehouse-Scale Applications,” in IISWC, 2014.
Cloud-Scale Computing,” in 11th USENIX Symposium on Operating [35] Rishi Kapoor et al., “Chronos: Predictable Low Latency for Data
Systems Design and Implementation (OSDI 14). Broomfield, CO: Center Applications,” in Proc. of the Third ACM Symposium on Cloud
USENIX Association, 2014. Computing, ser. SoCC ’12. New York, NY: ACM, 2012.
[11] Bob Briscoe, “Flow Rate Fairness: Dismantling a Religion,” SIG- [36] Harshad Kasture et al., “Ubik: Efficient Cache Sharing with Strict
COMM Comput. Commun. Rev., vol. 37, no. 2, Mar. 2007. QoS for Latency-Critical Workloads,” in Proc. of the 19th international
[12] Martin A. Brown, “Traffic Control HOWTO,” http://linux-ip.net/ conference on Architectural Support for Programming Languages and
articles/Traffic-Control-HOWTO/. Operating Systems (ASPLOS-XIX), March 2014.

12
[37] Wonyoung Kim et al., “System level analysis of fast, per-core DVFS [60] Dejan Novakovic et al., “DeepDive: Transparently Identifying and
using on-chip switching regulators,” in High Performance Computer Managing Performance Interference in Virtualized Environments,” in
Architecture, 2008. HPCA 2008. IEEE 14th International Symposium Proc. of the USENIX Annual Technical Conference (ATC’13), San Jose,
on, Feb 2008. CA, 2013.
[38] Quoc Le et al., “Building high-level features using large scale unsu- [61] W. Pattara-Aukom et al., “Starvation prevention and quality of service
pervised learning,” in International Conference in Machine Learning, in wireless LANs,” in Wireless Personal Multimedia Communications,
2012. 2002. The 5th International Symposium on, vol. 3, Oct 2002.
[39] Jacob Leverich et al., “Reconciling High Server Utilization and Sub- [62] M. Podlesny et al., “Solving the TCP-Incast Problem with Application-
millisecond Quality-of-Service,” in SIGOPS European Conf. on Com- Level Scheduling,” in Modeling, Analysis Simulation of Computer and
puter Systems (EuroSys), 2014. Telecommunication Systems (MASCOTS), 2012 IEEE 20th Interna-
[40] Bin Li et al., “CoQoS: Coordinating QoS-aware Shared Resources in tional Symposium on, Aug 2012.
NoC-based SoCs,” J. Parallel Distrib. Comput., vol. 71, no. 5, May [63] Andrew Putnam et al., “A Reconfigurable Fabric for Accelerating
2011. Large-scale Datacenter Services,” in Proceeding of the 41st Annual
[41] Kevin Lim et al., “Thin Servers with Smart Pipes: Designing SoC International Symposium on Computer Architecuture, ser. ISCA ’14.
Accelerators for Memcached,” in Proc. of the 40th Annual International Piscataway, NJ: IEEE Press, 2014.
Symposium on Computer Architecture, 2013. [64] M.K. Qureshi et al., “Utility-Based Cache Partitioning: A Low-
[42] Kevin Lim et al., “System-level Implications of Disaggregated Mem- Overhead, High-Performance, Runtime Mechanism to Partition
ory,” in Proc. of the 2012 IEEE 18th International Symposium on High- Shared Caches,” in Microarchitecture, 2006. MICRO-39. 39th Annual
IEEE/ACM International Symposium on, Dec 2006.
Performance Computer Architecture, ser. HPCA ’12. Washington,
DC: IEEE Computer Society, 2012. [65] Parthasarathy Ranganathan et al., “Reconfigurable Caches and Their
Application to Media Processing,” in Proc. of the 27th Annual Inter-
[43] Jiang Lin et al., “Gaining insights into multicore cache partitioning: national Symposium on Computer Architecture, ser. ISCA ’00. New
Bridging the gap between simulation and real systems,” in High Perfor- York, NY: ACM, 2000.
mance Computer Architecture, 2008. HPCA 2008. IEEE 14th Interna- [66] Charles Reiss et al., “Heterogeneity and Dynamicity of Clouds at Scale:
tional Symposium on, Feb 2008. Google Trace Analysis,” in ACM Symp. on Cloud Computing (SoCC),
[44] Huan Liu, “A Measurement Study of Server Utilization in Public Oct. 2012.
Clouds,” in Dependable, Autonomic and Secure Computing (DASC), [67] Chuck Rosenberg, “Improving Photo Search: A Step Across
2011 IEEE Ninth Intl. Conf. on, 2011. the Semantic Gap,” http://googleresearch.blogspot.com/2013/06/
[45] Rose Liu et al., “Tessellation: Space-time Partitioning in a Manycore improving-photo-search-step-across.html.
Client OS,” in Proc. of the First USENIX Conference on Hot Topics [68] Daniel Sanchez et al., “Vantage: Scalable and Efficient Fine-grain
in Parallelism, ser. HotPar’09. Berkeley, CA: USENIX Association, Cache Partitioning,” SIGARCH Comput. Archit. News, vol. 39, no. 3,
2009. Jun. 2011.
[46] Yanpei Liu et al., “SleepScale: Runtime Joint Speed Scaling and Sleep [69] Yoon Jae Seong et al., “Hydra: A Block-Mapped Parallel Flash Mem-
States Management for Power Efficient Data Centers,” in Proceeding of ory Solid-State Disk Architecture,” Computers, IEEE Transactions on,
the 41st Annual International Symposium on Computer Architecuture, vol. 59, no. 7, July 2010.
ser. ISCA ’14. Piscataway, NJ: IEEE Press, 2014. [70] Akbar Sharifi et al., “METE: Meeting End-to-end QoS in Multicores
[47] David Lo et al., “Towards Energy Proportionality for Large-scale Through System-wide Resource Management,” in Proc. of the ACM
Latency-critical Workloads,” in Proceeding of the 41st Annual In- SIGMETRICS Joint International Conference on Measurement and
ternational Symposium on Computer Architecuture, ser. ISCA ’14. Modeling of Computer Systems, ser. SIGMETRICS ’11. New York,
Piscataway, NJ: IEEE Press, 2014. NY: ACM, 2011.
[48] Krishna T. Malladi et al., “Towards Energy-proportional Datacen- [71] Shekhar Srikantaiah et al., “SHARP Control: Controlled Shared Cache
Management in Chip Multiprocessors,” in Proc. of the 42Nd Annual
ter Memory with Mobile DRAM,” SIGARCH Comput. Archit. News, IEEE/ACM International Symposium on Microarchitecture, ser. MI-
vol. 40, no. 3, Jun. 2012. CRO 42. New York, NY: ACM, 2009.
[49] R Manikantan et al., “Probabilistic Shared Cache Management [72] Shingo Tanaka et al., “High Performance Hardware-Accelerated Flash
(PriSM),” in Proc. of the 39th Annual International Symposium on Key-Value Store,” in The 2014 Non-volatile Memories Workshop
Computer Architecture, ser. ISCA ’12. Washington, DC: IEEE Com- (NVMW), 2014.
puter Society, 2012. [73] Lingjia Tang et al., “The impact of memory subsystem resource sharing
[50] J. Mars et al., “Increasing Utilization in Modern Warehouse-Scale on datacenter applications,” in Computer Architecture (ISCA), 2011
Computers Using Bubble-Up,” Micro, IEEE, vol. 32, no. 3, May 2012. 38th Annual International Symposium on, June 2011.
[51] Jason Mars et al., “Bubble-Up: Increasing Utilization in Modern Ware- [74] Arunchandar Vasan et al., “Worth their watts? - an empirical study
house Scale Computers via Sensible Co-locations,” in Proc. of the 44th of datacenter servers,” in Intl. Symp. on High-Performance Computer
Annual IEEE/ACM Intl. Symp. on Microarchitecture, ser. MICRO-44 Architecture, 2010.
’11, 2011. [75] Nedeljko Vasić et al., “DejaVu: accelerating resource allocation in
[52] Paul Marshall et al., “Improving Utilization of Infrastructure Clouds,” virtualized environments,” in Proc. of the seventeenth international
in Proc. of the 2011 11th IEEE/ACM International Symposium on conference on Architectural Support for Programming Languages and
Cluster, Cloud and Grid Computing, 2011. Operating Systems (ASPLOS), London, UK, 2012.
[53] David Meisner et al., “PowerNap: Eliminating Server Idle Power,” in [76] Christo Wilson et al., “Better Never Than Late: Meeting Deadlines in
Proc. of the 14th Intl. Conf. on Architectural Support for Programming Datacenter Networks,” in Proc. of the ACM SIGCOMM 2011 Confer-
Languages and Operating Systems, ser. ASPLOS XIV, 2009. ence, ser. SIGCOMM ’11. New York, NY: ACM, 2011.
[54] David Meisner et al., “Power Management of Online Data-Intensive [77] Carole-Jean Wu et al., “A Comparison of Capacity Management
Services,” in Proc. of the 38th ACM Intl. Symp. on Computer Architec- Schemes for Shared CMP Caches,” in Proc. of the 7th Workshop on
ture, 2011. Duplicating, Deconstructing, and Debunking, vol. 15. Citeseer, 2008.
[55] Paul Menage, “CGROUPS,” https://www.kernel.org/doc/ [78] Yuejian Xie et al., “PIPP: Promotion/Insertion Pseudo-partitioning of
Documentation/cgroups/cgroups.txt. Multi-core Shared Caches,” in Proc. of the 36th Annual International
Symposium on Computer Architecture, ser. ISCA ’09. New York, NY:
[56] Sai Prashanth Muralidhara et al., “Reducing Memory Interference in ACM, 2009.
Multicore Systems via Application-aware Memory Channel Partition- [79] Hailong Yang et al., “Bubble-flux: Precise Online QoS Management
ing,” in Proc. of the 44th Annual IEEE/ACM International Symposium for Increased Utilization in Warehouse Scale Computers,” in Proc. of
on Microarchitecture, ser. MICRO-44. New York, NY: ACM, 2011. the 40th Annual Intl. Symp. on Computer Architecture, ser. ISCA ’13,
[57] Vijay Nagarajan et al., “ECMon: Exposing Cache Events for Monitor- 2013.
ing,” in Proc. of the 36th Annual International Symposium on Computer [80] Xiao Zhang et al., “CPI2: CPU performance isolation for shared com-
Architecture, ser. ISCA ’09. New York, NY: ACM, 2009. pute clusters,” in Proc. of the 8th ACM European Conference on Com-
[58] R. Nathuji et al., “Q-Clouds: Managing Performance Interference puter Systems (EuroSys), Prague, Czech Republic, 2013.
Effects for QoS-Aware Clouds,” in Proc. of EuroSys, France, 2010. [81] Yunqi Zhang et al., “SMiTe: Precise QoS Prediction on Real-System
[59] K.J. Nesbit et al., “Fair Queuing Memory Systems,” in Microarchitec- SMT Processors to Improve Utilization in Warehouse Scale Computers,”
ture, 2006. MICRO-39. 39th Annual IEEE/ACM International Sympo- in International Symposium on Microarchitecture (MICRO), 2014.
sium on, Dec 2006.

13

You might also like