0% found this document useful (0 votes)
8 views14 pages

Compiler-Guided Throughput Scheduling For Many-Core Machines

The document presents a novel Beacons Framework for proactive scheduling in many-core machines, addressing the limitations of traditional reactive scheduling methods. By utilizing compiler analysis and machine learning, the framework predicts resource-heavy regions in applications, improving throughput by an average of 76.78% over existing schedulers. The proposed approach allows for efficient co-location of processes, maximizing system utilization while minimizing resource contention.

Uploaded by

fellipe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

Compiler-Guided Throughput Scheduling For Many-Core Machines

The document presents a novel Beacons Framework for proactive scheduling in many-core machines, addressing the limitations of traditional reactive scheduling methods. By utilizing compiler analysis and machine learning, the framework predicts resource-heavy regions in applications, improving throughput by an average of 76.78% over existing schedulers. The proposed approach allows for efficient co-location of processes, maximizing system utilization while minimizing resource contention.

Uploaded by

fellipe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Compiler-Guided Throughput Scheduling for Many-core Machines

Girish Mururu Sharjeel Khan Bodhisatwa Chatterjee


Google Georgia Institute of Technology Georgia Institute of Technology
Chao Chen Chris Porter Ada Gavrilovska
Amazon Georgia Institute of Technology Georgia Institute of Technology
Santosh Pande
arXiv:2103.06647v2 [cs.DC] 31 Oct 2021

Georgia Institute of Technology

Abstract Prior works on resource management & utilization [5, 6, 8,


Typical schedulers in multi-tenancy environments make 18, 29, 31] do not account for the dynamic phase behaviour
use of reactive, feedback-oriented mechanisms based on per- in modern workloads and use feedback-driven reactive ap-
formance counters to avoid resource contention but suffer proaches instead. These include approaches that rely on re-
from detection lag and loss of performance. In this paper, source usage history or current resource contention (moni-
we address these limitations by exploring the utility of pre- tored using hardware performance counters) and then reacting
dictive analysis through dynamic forecasting of applications’ to contention by invoking suitable scheduling mechanisms.
resource-heavy regions during its execution. Our compiler However, majority of modern workloads are input data de-
framework classifies loops in programs and leverages tra- pendent and do not exhibit highly regular or repetitive be-
ditional compiler analysis along with learning mechanisms havior across different input sets. Thus, neither history-based
to quantify their behaviour. Based on the predictability of methods nor feedback directed methods provide sufficiently
their execution time, it then inserts different types of beacons accurate predictions for such workloads. Furthermore, these
at their entry/exit points. The information produced by bea- reactive approaches detect a resource-intense execution phase
cons in multiple processes is aggregated and analyzed by the only after it already occurred, and this leads to two more re-
proactive scheduler to respond to the anticipated workload lated drawbacks. First, by the time the phase is detected, it
requirements. For throughput environments, we develop a may be too late to act, especially if the detection lag is signifi-
framework that demonstrates high-quality predictions and im- cant or phase duration is short (a phase might be end by the
provements in throughput over CFS by 76.78% on an average time it is detected). Second, as a result of this detection lag,
and up to 3.2x on Amazon Graviton2 Machine on consoli- the application state might have already bloated in terms of
dated workloads across 45 benchmarks. its resource consumption; the damage to other co-executing
processes’ cache state might be already done, and it might
be prohibitively expensive to mitigate the offending process
1 Introduction
because of the large state and its cache affinity can be lost by
Modern systems offer a tremendous amount of computing re- migrating or pausing. In addition, these approaches cannot
sources by assembling a very large number of cores. However, predict the duration of the phase, nor the sensitivity of its
when they are deployed in data-centers, it is very challeng- forthcoming resource usage, thus significantly limiting the
ing to fully exploit such computing capability due to their chances of further optimizations to co-location of processes.
shared memory resources. The challenge is that on one hand, To efficiently co-locate processes and maximize system
there is a need to co-locate a large number of workloads throughput, we propose Beacons Framework to predict the
to maximize the system utilization (& thus minimizing the resource requirement of applications based on a combina-
overall cost), and on the other hand, this workload co-location tion of compiler analysis and machine learning techniques
causes processes to interfere with each other by competing for that enables pro-active scheduling decisions. This frame-
shared cache resources. This problem is further exacerbated work consists of Beacons Compilation Component, that
by the fact that modern machine learning & graph analyt- leverages the compiler to analyze programs and predict the
ics workloads exhibit diverse resource demands throughout resource requirements, & Beacons Runtime Component,
their execution. In particular, the resource requirements of which includes the pro-active Beacons Scheduler, that uses
processes vary not only across applications, but also within these analysis to undertake superior scheduling decisions.
a single application across its various regions/phases as it The beacons scheduler relies on the compiler to insert “bea-
executes [11, 17]. cons”(specialized markers) in the application at strategic pro-
gram points to periodically produce and/or update details of 2 Motivation & Overall Framework
anticipated (forthcoming) resource-heavy program region(s).
The beacons framework classifies loops in programs based Maximizing system throughput by efficient process schedul-
on cache usage and predictability of their execution time and ing in a multi-tenant execution setting (data centers, cloud,
inserts different types of beacons at their entry/exit points. etc) for modern workloads is challenging. The reason for this
Moreover, the beacons scheduler augments imprecise infor- is that these execution scenarios consists of thousand of jobs,
mation through the use of performance counters to throttle typically in batches. In such cases, completion time of each
higher concurrency when needed. The novelty of this scheme individual job itself is of no value, but the completion time of
is that through the use of loop timing, memory footprint, and either whole batch of jobs, or the number of jobs completed
reuse information provided via compiler analysis combined per unit time (if the jobs are an incoming stream) is of utmost
with machine learning mechanisms, the scheduler is able to priority [2].
precisely yet aggressively maximize co-location or concur-
rency without incurring cache degradation. 2.1 Motivating Example: Reactive Vs Proac-
We evaulated Beacons Framework on a Amazon Gravi- tive
ton2 [1] machine on consolidated benchmark suites from
Machine Learning Workloads [7, 14–16], PolyBench [33], a Consider an execution scenario in a multi-tenant enviroment,
numerical computational benchmark, and Rodinia [4], a het- where we have the popular convolutional neural network
erogeneous compute-based kernel based on diverse domains. (CNN) Alexnet, being trained on different input sets with var-
Our experiments demonstrate that the end result of such a high ied parameters (dropout rate, hidden units, etc) concurrently,
throughput oriented framework is that it effectively schedules as a batch. During the training, several other jobs (matrix mul-
a large number of jobs and improves throughput over CFS tiplication computations on different matrices) co-execute,
(Completely Fair Scheduler, a production scheduler bundled having access to the same shared resources (LLC, etc), but
with Linux distributions) by 76.78% on average (geometric on different cores. Table 1 shows the comparison of execu-
mean) and up to 3.29x (§5). Overall, the main contributions tion time of this scenario with (1) widely used production
of our work can be summarized as follows: Linux scheduler CFS [19], (2) a performance-counter based
reactive scheduler Merlin [29] and (3) proposed Beacons
• We present a compiler-directed predictive analysis frame- Framework with Beacons Scheduler.
work that determines resource requirements of dynamic Constituent Processes
Total Processes
Execution Time (sec)
modern workloads in order to aid dynamic resource man- executing as a Batch CFS Merlin Beacons
Alexnet Training 20
agement. 249.43 358.23 100.58
Matrix Multiplication (2mm) 133768

• We show that an application’s resource requirement can Table 1: Alexnet Training and 2mm (Polybench) executing
be determined by its loop-timings (§3.1) which in turn as a batch. The 2mm process can be considered as a small
depend upon the prediction of expected loop iterations process compared to Alexnet, and hence can act as hogging
(§3.1.2), memory footprints (§3.2) and data-reuse be- system resources with sheer numbers
haviour (§3.2.2). We show how these attributes can be
estimated by combining compiler analysis with machine As we can see from Table 1, the schedulers like CFS that
learning techniques in Beacons Compilation Component are agnostic to the diverse requirements and overlap in re-
(§3). source requirements can suffer from slowdowns as high as
1.5x. On the other hand, scheduler that use feedback-driven
approach first uses performance counters to measure IPC/-
• To predict precise loop timing for proactive schedul-
Cache miss and then undertakes scheduling decision perform
ing decisions, we develop a detailed loop classification
even worse, having performance degradation of more than
scheme that covers all possible types of loops in modern
3.5x. The reason for this that Merlin [29] reacts to the re-
workloads. Based on this scheme, we develop a frame-
source requirements and suffers due to detection lag and other
work that analyzes such loops by capturing both the
reasons mentioned earlier. Therefore, in order to maximize
data-flow and control-flow aspects (§3.1.2).
throughput, we need to act pro-actively, i.e anticipate the re-
source demand before it happens and act on its onset, which
• We design and implement a lightweight Beacons Run- forms the primary objective of Beacons Framework.
time Component (§4), which includes the proactive
throughput-oriented Beacons Scheduler (§4.1), that ag-
2.2 Overall Framework
gregates the information generated by multiple processes
and allocate resources to the processes and enables com- Beacons framework consists of a Compilation Component
munication between the application and the scheduler. (Fig. 1), which in turn comprises of a sequence of compi-

2
lation, training, and re-compilation steps to instrument the then hoisted inter-procedurally. To rectify potential schedul-
application; all of these steps are applied statically. Several ing errors arising from timing and memory footprint inaccu-
challenges are overcome by to produce the above informa- racies, the beacon-based scheduler is augmented with perfor-
tion at loop entrances. Beacons consist of statically inserted mance counters which are used in cases of loop bounds which
code at the entrances of the loops that evaluate closed-form can not be estimated even by machine learning; in addition a
formulae & static learning models for predicting various ar- completion beacon is inserted at loop exit points to signal the
tifacts related to loop’s resource requirement. Other aspects, end of the loop region. We now describe each of these above
e.g. the amount of time a region executes, can be predicted & aspects in detail in the following subsections.
enhanced by employing machine learning mechanisms that
involves training. During runtime this information is evalu- 3.1 Loop Timing Analysis & Categorization
ated by beacons using actual dynamic values and is conveyed
to the beacons scheduler through library calls, which form In order to predict the timing of a loop (nest) present in the
the Beacons Runtime Component (Fig. 1). The scheduler application, we establish it as a function of its total iterations
aggregates and analyzes this information across all the pro- (trip-count). Our goal is to obtain a linear relationship of form
cesses, constructing a global picture about contentions, the y = ax1 + bx2 + .. + kxn , where x1 , x2 , ..., xk are trip-counts of
sensitivity to them, and moreover the duration of contentions. individual loops in a loop nest and y is the execution time of
For the purposes of this paper, the resource under contention the loop nest. The core-idea here is that the constants a, b, .., k
is the last level cache and memory bandwidth shared by all can then be learnt using linear-regression and this closed-form
the processes in the machine. Since the resource demand and equation can be directly utilized to generate the loop-timing
duration (loop timings) information is known to the scheduler dynamically during runtime.
before the loop executes, the scheduler is able to take pro-
active scheduling decisions overcoming the afore-mentioned 3.1.1 General Loop Modelling
limitations of reactive approaches.
For a given loop with various instructions, the loop time
depends on the number of loop iterations, i.e. loop time
3 Beacons Compilation Component (T ) is directly proportional to loop iterations (N). There-
fore, T ∝ N ⇒ T = α ∗ N, where proportionality constant
The Beacons Compilation Component (Fig. 1) is responsi- α which represents the average execution time of the loop
ble for instrumenting beacons in an application in order to body. In a nested loop, the execution duration depends on
guide a scheduler through upcoming regions. The regions in the number of iterations encountered at each nesting level.
code are analyzed at the granularity of loop nests. During Consider a loop nest consisting of n individual loops with
the compilation phase, the applications are first profiled to total iterations N1 , N2 , ..., Nn respectively. We observe that
generate sample loop timings through a compiler pass called the loop nest duration (T ) is a function of these loop iter-
the beacons loop analysis pass. This pass runs the program ations, i.e T = f (N1 , N2 , ..., Nn ). As far as the loop body is
on a subset of application inputs, which form the training concerned, each instruction in the loop nest contributes to
input set. These timings are then used for training a linear the loop time by the factor of the number of times the in-
regression equation which establishes the loop timing as a struction is executed. Thus, a loop can be analyzed by de-
function of the total iterations of the loop (trip-count). Such termining the nesting level of each of the basic blocks in its
an approach directly handles simple loops whose trip-counts body and then multiplying the timing of that basic block with
can be predicted statically from its loop bounds. However, to its execution frequency determined by the trip counts of sur-
enable analysis of loops whose trip-counts are unknown at rounding loops. Thus, timing equation can be represented as:
compile time, we develop a loop categorization framework T ∝ g1 (N1 ) + g2 (N1 ∗ N2 ) + ... + gn (N1 ∗ N2 ∗ ... ∗ Nn ), where
that unifies both data and control flow of the loop bounds and g1 (N1 ) is the function of time contributed by outermost loop
its predicates variables. We then predict the trip counts for or the basic block belonging to the outermost loop with N1
such loops by using decision tree classifier and rule-based iterations, g2 (N1 ∗ N2 ) is the function of time contributed by
mechanisms, which is futher integrated with the loop timings the second loop at the nesting level 2 and so on. Removing
to enhance its accuracy. The timing regression equation is the proportionality sign, this equation can be rewritten as:
then embedded before the loop, along with two other pieces:
the memory footprint equation and the classification of the
loop as either “reuse” or “streaming”, based on its memory T = c0 + c1 ∗ N1 + c2 ∗ (N1 ∗ N2 ) + ... + cn ∗ (N1 ∗ N2 ...Nn )
reference patterns. The memory footprints are estimated by (1)
polyhedral compiler analysis [13], while the reuse classifi- where c0 is the constant term. Eq. 1 is a linear equation in
cation is done with the help of static reuse distance (SRD) terms of each loop iteration Nk . Therefore, we use linear re-
analysis. gression to learn the coefficients for each of the loop iterations
The beacons are inserted before the outermost loops and in the loop nest to predict the loop timings.

3
Figure 1: Beacons Compilation Component Workflow. The source code is translated into IR with LLVM front-end and is subjected to various
‘beacon’ compiler passes to generate the loop and memory profile information. The IR is then instrumented with beacons information to
broadcast information at the runtime to the Beacons Scheduler with the help of Beacons library.

Figure 3: Examples of different instances of loop for predict-


ing trip counts. Each of these instances can be generalized to
nested loops as well.

Loop Classification Scheme In order to obtain trip counts


for various loop-nests, we first categorize the loops by devel-
oping a classification scheme that takes into account both the
Figure 2: Beacons Runtime Component data-flow & control-flow characteristics of a loop. The data-
flow characteristics of a loop can be determined by the loop
bounds, while the control flow aspect can be captured by the
3.1.2 Estimating Precise Loop Iterations nature of loop termination. Thus, each individual loops can
either be Normal-bounded/Irregularly-bounded loops (data
: The loop timing model described above depends on the flow aspect) or Normal-Exit/Multi-Exit (control flow aspect)
compile time knowledge of the exact loop iterations, i.e the loops. Unifying the data-flow & control-flow characteriza-
trip-count associated with the loop. For certain kinds of regu- tion results in a comprehensive framework that covers all
lar loops, it’s simple to statically determine the total number possible loops types in real workloads (Fig. 4) The resul-
of iterations for loops with regular bounds (Fig. 3a). Such tant four classes of loops are Normally Bounded Normal Exit
loops can be normalized using the loop-simplify pass in the (NBNE) loops, Normally Bounded Multi Exit (NBME) loops,
LLVM [21] compiler. Loop normalization converts a loop to Irregularly Bounded Normal Exit (IBNE) loops & Irregularly
the form with lower bound set to zero and step increment by Bounded Multi Exit (IBME) loops.
one to reach an upper bound. The upper bound of the loop is Based on this classification scheme, we develop a Loop
now equal to the number of loop iterations and can be easily Categorization Algorithm (Algo. 1) that detects loops for
extracted from the loop and plugged into Eq. 1 for timing each categories and invokes specialized trip count estimation
prediction. In practice, however, loops need not adhere to this mechanisms for each case. This algorithm first checks the
template. Real-world workloads have loops with irregular bounds of the given loop. If the bound is entirely composed
bounds (Fig. 3b) and non-uniform trip count that are often of numerical entities (integer, float, double, etc.), which al-
input-dependent, which makes it harder to analyze their trip lows a static estimation of symbolic upper bound the loop is
count statically. This also extends to loop nests (either with classified as Normal-Bounded. Alternatively, loops for which
affine or non-affine bounds), having control flow statements bounds contain non-numerical entities (pointer references,
that break out iterations transferring the control out of the loop function calls, etc.) are considered as Irregularly-Bounded
body (Fig. 3c). Therefore, to estimate precise loop iterations Loops. To handle such cases, we propose Upwards Exposed
we need to develop a framework that captures both input data Control Backslicing (UECB) Algorithm (Algo. 2), which is
flow characteristics and control flow behaviour of the entire a general framework that predicts trip counts based on the
loop nest. ‘upward-exposed’ values of variables that decide the control

4
Algorithm 1: Loop Categorization Algorithm
Input: Function F
Result: Determine the precise trip count (in terms of expression or
linear model) for all the loops
for each loop L ∈ Function F do
regular, normal ← true
exit_points, critical_bounds, critical_predicates ← φ
bounds ← L.getBounds()
for each argument arg ∈ bounds do
if type(arg) != num then
regular ← f alse
critical_bounds.push_back(arg)
end
end
Figure 4: Loop Classification Scheme with 4 classes of loop: for each basicblock bb ∈ L do
if (bb.successor ∈/ L) and (bb != L.header) then
NBNE, NBME, IBNE & IBME normal ← f alse
exit_points.push_back(bb)
for each point P ∈ exit_points do
for each variable v ∈ P do
flow in the loop, using learning mechanisms. It returns a learn- critical_predicates.push_back(v)
ing model that estimates the loop trip counts based on a loop’s end
critical variables. end
end
After the detection of expected trip counts based on loop end
bounds, the loop bodies are analyzed for control-flow state- if regular and normal then
ments. If there are no statements in the loop body that exit tripCount ← (bound.upper() − bound.lower())
out of the loop for Normal-Bounded case, the loop is catego- end
if regular and not normal then
rized as Normal-Exit and the expected trip count is equal to PredictionModel ← UECB(critical_predicates)
symbolic trip count. Next, if the loop is Irregularly-Bounded, end
its expected bound is learnt using UECB algorithm described if not regular and normal then
PredictionModel ← UECB(critical_bounds)
below. The cases of multiple exits in the body are classified end
as Multi-Exit and the critical variables present in the predi- if not regular and not normal then
cates are extracted and passed on to the UECB algorithm to PredictionModel ←
UECB(critical_bounds ∪ critical_predicates)
generate the respective model.
end
UECB Algorithm: The first step in UECB is to identify end
critical variables that dictate a loop’s trip count, i.e variable
that decides whether the control flow will go to the loop
body or loop exit. These variables can be either irregular loop
bounds, or part of predicates that break out of loop. The core set for the entire program is divided into two parts for training
idea here is to use a learning model that can be trained on & testing respectively. This model is then embedded in the
these variables to estimate the loop’s trip count. The trained beacon associated with the outer-loop and is used to predict
model can then be embedded in the beacon associated with the precise loop trip count when it’s invoked.
the loop nest. However, since the inserted beacon exists at Loops not suitable for machine learning: UECB Algo-
the entrance of loop nests, the trip count prediction must be rithm generates training data by logging the values of critical
done only with the variables & their definitions that are live variables upon various loop invocations. However, if the loop
at that point. Such variables are called Out-of-loop variables, is invoked only small number of times (< 10), then it’s im-
i.e these variables are live at outermost loop header and their practical to train decision trees for predicting trip-counts of
definitions that come from outside the loop body. Therefore, such loops. This is because machine learning models require
it is essential to back-propagate the critical variables in order the training and testing dataset to be sufficiently large in or-
to express them in terms of Out-of-loop variables. UECB der to provide meaningful predictions. Thus, for predicting
achieves this back-propagation by following a stack-based trip-counts of loops that are not invoked enough times to train
approach that involves analyzing the critical variables and a classifier model, we use Rule-Based Trip Count Predic-
the upwards-exposed definitions on which they depend in tion. The core idea here is that expected trip-count is within
terms of their backward slice. When the program is run on the one standard deviation of the mean trip-count of all the
representative inputs, these out-of-loop variables are logged rules. The rule-based mechanism is preferred over classifier
to generate the training data for the learning model. model, when the number of data-points are lesser than hyper-
The learning model used to predict loop trip counts based parameter threshold value (∼ 5).
on critical variables is the decision tree classifier. The input After the expected trip count for various classes of loop are

5
Algorithm 2: Upwards Exposed Control Backslicing is performed. Generally, parameters contain all loop-invariant
(UECB) Algorithm variables that are involved in either array index(es) or the
Input: Set of Critical Variables C Presburger formula, and the Presburger formula contains loop
Result: Classifier Model M that estimates precise trip count based conditions. We currently ignore if-conditions enclosing mem-
on unseen values of critical variables ory access statements; we thus get an upper bound in terms
model_parameters, worklist = {}
of estimation of the memory footprints. For illustration, list 1
for each variable vc ∈ C do
de f _set ← getallDe f initions(vc ) shows a loop with three memory accesses, with two of them
for each definition d ∈ de f _set do accessing the same array but different elements. A polyhedral
worklist.push_back(d)
access relation is built for each of them. The polyhedral access
end
end relation for d[2 ∗ i] is: [N] → {[i] → [2 ∗ i] : 0 <= i <= N},
while worklist is not empty do where [N] specifies the upper-bound of the normalized loop.
d ← remove a definition from the worklist It is a compile-time unknown loop invariant since its value
for each operand op ∈ d do
if op is out-of-loop variable then
is not updated across loop iterations. [i] → [2 ∗ i] is a map
model_parameters.push_back(op) from the loop iteration to the accessed data point (simply
else array indexes). 0 <= i <= N is the Presburger formula with
de f _set ← getallDe f initions(op)
for each definition d ∈ de f _set do
constraints about when the access relation is valid.
worklist.push_back(op)
end Listing 1: Memory Footprint Estimation Example
end
end
f o r ( i n t i = 0 ; i <= N ; ++ i ) {
end . . . = a [ i +3];
Generate Training & Testing data for model_parameters d [2* i ] = . . . ;
if Total Data Points > Threshold then d [3* i ] = . . . ;
Train Decision-Tree Classifier M(model_parameters)
else
}
Obtain Rule-based Prediction Model M(model_parameters)
end Based on the polyhedral access relations constructed for
every memory access in the loop, the whole memory foot-
print for the loop can be computed leveraging polyhedral
obtained, either by simple analysis (NBNE) or by classifier arithmetic. It simply counts the number of data elements in
models (NBME/IBNE/IBME) from the Loop Categorization each polyhedral access relation set, and then adds them to-
algorithm (Algo. 1), they are integrated with timing equation gether. Instead of a constant number, the result of polyhedral
(Eq. 1) to enhance the loop-timing predictions through normal arithmetic is an expression of parameters. For d[2 ∗ i], its
regression models. counting expression generated using polyhedral arithmetic is:
[N] → {(1 + N) : N >= 0}. Therefore, as long as the value
of N is available, the memory footprints of the loop can be
3.2 Memory Footprint Analysis estimated by evaluating the expressions. In our framework,
the value of N is given by the expected trip count (through one
The footprint indicates the amount of cache that will be oc-
of the five cases: NBNE or by classifier models (NBME/IB-
cupied by a loop. Memory footprint analysis consists of two
NE/IBME) or rule-based system as described in the last sec-
parts - (a) calculating the memory footprint of the loop, and
tion). For statements that access the same arrays, e.g. d[2 ∗ i]
(b) classifying a loop as a reuse-oriented loop or a streaming
and d[3 ∗ i], a union operation will first be performed to cal-
(which exhibits little or no reuse) loop.
culate the actual number of accessed elements as a function
of compile-time unknown loop iterations and instrumented
3.2.1 Calculating Memory Footprint in the program. It is evaluated at runtime to get the expected
For a given loop, its memory footprint is estimated based on memory footprint.
polyhedral analysis, which is a static program analysis per-
formed on LLVM intermediate representation (IR). For each 3.2.2 Classifying Reuse and Streaming Loops
memory access statement in the loop, a polyhedral access rela-
tion is constructed to describe the accessed data points of the A loop that reuses memory locations over a large number of
statement across loop iterations. An access relation describes iterations (large reuse distance) needs enough cache space
a map from the loop iteration to the data point accessed in to hold its working memory and might be sensitive to the
that iteration. It contains three pieces of information: 1) pa- misses caused, and a loop that streams data which is reused
rameters, which are compile-time unknown constants, 2) a in next few iterations require almost no cache space at all and
map from the iteration to array index(es); and 3) a Presburger might be insensitive due to a small fixed reuse distance. For
formula [30] describing the conditions when memory access efficient utilization of cache, the scheduler must know this

6
classification. We classify the loops using Static Reuse Dis- the inner loop bounds may not be available (or live) at the
tance (SRD), defined as the number of possible instructions outermost points inter-procedurally. Thus, use expected loop
between two accesses of a memory location. For example, bounds of the inner loops to calculate the beacon properties,
in Fig. 5 the SRD between statements (S1 , S2 ) is in the order timing information and memory footprint. Each interproce-
of m ∗ 3 because an access in instruction S1 has to wait for dural inner loop’s memory footprint is added to the outer-
m instructions within the inner loop to cover the distance of most inter-procedural loop’s memory footprint. On the other
three outer iterations between successive access of the same hand, the loop timing is based only on the outermost inter-
memory location. The SRD between statements (S5 , S6 ) is in procedural loop. The beacon is classified as reuse if there is a
the order of two, because the same memory is accessed after single interprocedural nested loop. Unfortunately, such a con-
two iterations. version transforms many known/inferred beacons to unknown
beacons. Also, hoisting is a repetitive process that stops once
no beacons are inside inter-procedural loops.
The decision trees are inserted as if-then-else blocks with
the trip count value being passed to both the loop timing
model and the memory footprint formula. The equations with
the coefficients and loop bounds are instrumented at the pre-
header of the corresponding loop nests, followed by the mem-
Figure 5: SRD Estimation for two loop nests. Loop Nest 1 is classi- ory footprint calculations. The variables that hold the timing
fied as reuse, while Loop Nest 2 is streaming. and memory footprint values along with loop type (reuse or
streaming) and beacon type are passed as arguments to the
Any loops with a constant SRD, that is the distance be- beacon library calls. Facilitated by the beacon library, the
tween the accesses is covered within a few iterations of the instrumented call fires a beacon with loop properties to the
same loop (e.g. the one between statements (S5 , S6 ), (S5 , S7 ) scheduler. We use shared memory for the beacon communica-
and (S6 , S7 ) in Fig. 5b), can be classified as streaming, be- tions between the library and the scheduler. For every beacon,
cause the memory locations must be in the cache for only a a loop completion beacon is inserted at either the exit points
few (constant) iterations of the loop during which it is highly of the loop nest or after the call site for beacons hoisted above
unlikely to be thrashed. More specifically, an SRD that in- call site. The completion beacon sends no extra information
volves an inner loop (e.g. the one between statements (S1 , other than signaling the completion of the loop phase and so
S2 ) in Fig. 5a) or outer loop (e.g. between statements (S3 , S4 ) that any sub-optimal scheduling decision can be rectified.
in Fig. 5a) where a cache entry must wait in the cache over
the duration of the entire loop that it is dependent on – such
4 Beacons Runtime Component
loops are classified as reuse. For example, array B must be
in cache for the duration of the entire outer loop. Thus, we After the beacons and its attributes (loop timing regression
classify such loops in which the SRD is dependent on either models, trip count classifiers, memory footprint calculations
an outer or inner loop as reuse loops (reuse distance here is a and reuse classification) are instrumented in the application,
function of normalized loop bound N, for example), and we it’s the job of the Runtime Component to evaluate & com-
classify the remaining loops (with small and constant reuse municate the attributes during the execution. These evaluated
distance) as streaming loops. Indirect and irregular references attributes are communicated to the scheduler through a li-
such as a[b[i]] do not typically have a large reuse distance as- brary interface that communicates with the scheduler. We
sociated with them (compared to the sizes of modern caches) refer to these function calls to the library as “beacon calls”.
and it is impossible to analyze them at compile time; in our A beacon call fired from an application writes the beacons
current approach, they are classified therefore as non-reuse attributes to a shared memory which is continuously polled
references. by the scheduler. Beacon Scheduler analyzes this beacon in-
formation and acts proactively to the resource requirements
3.3 Beacons Hoisting & Insertion among the co-executing processes, which sets it apart from
state-of-art schedulers like CFS. This establishes the com-
The beacon insertion compiler pass ensures that the bea- munication between applications and scheduler (no special
cons are hoisted at the entrances of outermost loops intra- privileges required), and processes that can write to this shared
procedurally. However, inter-procedural loops (function calls memory are agreed upon during initialization with a key. The
in loop body) can overload the scheduler with too many bea- scheduler arbitrates the co-executing processes to maximize
con calls. Hence, if the beacons are inside inter-procedural concurrency while simultaneously addressing the demand on
loops, then they are hoisted outside and also above the other the shared resources such as caches and memory bandwidth.
call sites that are not inside loops along all paths. For hoist- Then goal of the scheduler is to facilitate efficient cache and
ing the beacon call (& embedding the inner-loop attributes), memory bandwidth usage to improve system job throughput.

7
Types of Beacon Calls: Based on the system and appli- beacon and incoming beacon overlap for greater than 5-10%
cation’s execution state, there can be three distinct beacon (configurable) of the execution time (Fig 6 (middle)). Here, if
library calls: Beacon_Init (to initialize the shared memory the resources required by the incoming beacon is more than
for writing the attributes), Beacon (this writes the beacon at- the available resources, then the incoming beacon process is
tributes the shared memory) & Loop_Complete (this signals descheduled and replaced with another process. Finally, in the
the end of a loop for a process). third case where the overlap is less than 5-10%, if the incom-
Beacons Call Classification: Based on the precision of ing beacons’ resource requirement is satisfied on completion
the attribute information, the beacon calls can either be clas- of the earliest beacon process, then the process is allowed to
sified into Known/Inferred Beacon Calls, where the loop continue but with performance monitoring turned on. How-
trip-counts, timing and memory footprints are calculated via ever, if the IPC of the beacon processes degrade, then the
closed-form expressions with high accuracy, and Unknown incoming beacon process is descheduled. Also, if the informa-
Beacon Calls, where the attribute information is non-exact, tion is known to be imprecise (unknown beacons), then the
expected values mainly calculated by rule-based trip-count scheduler turns on performance counters to rectify its actions.
estimation. This distinction helps us to identify potential im- In each case the resource is either last level cache in reuse
preciseness in application’s resource requirement and allows mode or memory bandwidth in stream mode.
us rectify certain sub-optimal scheduling decisions.

4.1 Beacons Scheduler (BES)


The beacon information sent by the applications is collected
by the scheduler to devise a schedule with efficient resource Figure 6: Different timing Scenarios of Incoming Beacon
usage. The beacon scheduler dynamically operates in two
modes based on the loop data-reuse behaviour - reuse or Reuse Mode: The goal of the scheduler in reuse mode is
streaming mode. The two modes corresponding to two types to effectively utilize the cache by minimizing the execution
of reuse behaviour is to mitigate the adverse effects (resource overlap between the processes that are reuse bound. At any
demand overlaps) caused by multiple loops that co-execute given time, the cores in the machine may be executing a mix
together. Initially, the scheduler starts without a mode and of reuse loops (RJ) that fit the cache and non-cache-pressure
schedules as many processes as required to fill up all the (FJ) applications only as shown in the scheduler mealy ma-
processors (cores) in the machine. One primary objective of chine Fig 7. If any of these FJ processes fires a reuse beacon
the scheduler is to maximize system resource utilization and (RB), the scheduler first checks the memory information to
never keep any processors idle. ensure if the beacon fits in the available cache space or not.
The scheduler enters one of the two modes based on the The scheduler only allows the process to continue if it fits
first beacon it collects. Until the beacon call is fired, a process in the available cache space. Once the reuse loop completes
is treated as having no memory requirement and is referred to (loop_completion beacon call), the process is re-classified as
as a non-cache-pressure type process. During the non-cache- FJ again. If a FJ process fires a streaming beacon (SB), then
pressure phase, the processes typically have memory footprint the process is suspended and replaced by a suspended reuse
lower than the size of L1 private cache and do not affect process that fits in the cache. If no such suspended reuse pro-
the shared cache, unlike streaming or reuse type with cache cess exists, then a non-cache-pressure process is scheduled.
requirements exceeding the L1 cache. Based on the timing When all reuse loops are completed (RC) or the number of
information received from an incoming beacon, the scheduler suspended stream loops hits a threshold (ST) (typically 90%
estimates the time by which a certain loop should complete its of the number of cores in the machine), then the remaining
execution. An important point to note here is although loop
time values are obtained by compiling the process in isolation,
the loop timing will still be similar even with multi-tenancy
when scheduled by the beacon scheduler. This is because
the scheduler ensures the avoidance of contention among the
processes as detailed below.
When any process fires a beacon, three possible timing
scenarios (Fig 6) can occur. In the first case, the completing
(currently executing) beacon and the incoming beacon do
not overlap (Fig 6 (right)). The completing beacon will relin-
quish its resources, which will be accordingly updated by the Figure 7: A simplified Mealy state machine of the bea-
scheduler. The incoming beacon is then scheduled based on con scheduler. Key: Beacon, Reuse, Streaming, Filler, Job,
the resource availability.In the second case, the completing Complete, Threshold, dequeue, enqueue, $(cache)

8
RJ processes are suspended, if any, and all the SJ processes cal computational suite, consisting of linear algebra, matrix
are resumed and the scheduler switches to stream mode. multiplication kernels, Rodinia [4], which consists of graph
Stream Mode: A stream loop does not reuse its memory traversal, dynamic programming applications from diverse
in the cache and hence is not disturbed by other co-executing domains like medical imaging, etc and on various popular ma-
processes as long as the system memory bandwidth is suf- chine learning benchmarks like AlexNet [15], DenseNet201
ficient. The expected (mean) memory bandwidth (µbw ) of [16], Resnet101 [14], Resnet-18, Resnet-152, and VGG16 [28]
a stream loop can be calculated by using the memory foot- by training these networks on a subset of CIFAR-10 Dataset
print and the timing information as: µbw = MemoryFoot
LoopTime
print
. In [20]. In addition to that, we perform experiments on well-
stream mode the scheduler schedules the SJ processes by re- known pre-trained networks like TinyNet, Darknet and RNN
placing all other processes (RJ and FJ) as long as the Total (with 3 recurrent modules) for predicting data samples from
Mean Memory Bandwidth (Tµbw ) is less than the memory CIFAR-10, Imagenet [7] and NLTK [3] corpus respectively.
bandwidth of the machine. Methodology: We run experiments and report all the
Any remaining core can only be occupied by a FJ process benchmarks in Polybench. However, since the L1 data cache
because a RJ process will get thrashed by the streaming ap- size is 32KB, the beacon calls are fired only if the loop mem-
plications. If a streaming loop completes, then it is replaced ory footprint is above 32KB and if the loop predicted time
by a suspended streaming process when memory bandwidth is above 10ms. This is neccessary because the average pro-
is available. Otherwise, the process is allowed to continue as cessing time of loop complete, reuse, and stream beacons
long as it does not fire a reuse beacon (RB). In other words, are 116us, 427us, and 292us respectively, and thus the loop
any non-streaming, non-cache-pressure application firing a timings below 10ms would be unneccessary for scheduling.
reuse beacon is suspended and replaced by either a suspended Leukocyte in Rodinia is one such benchmark with all beacons
streaming process or a non-cache-pressure application. When statically removed because the expected memory footprint
the number of such suspended reuse processes hits a thresh- is lower than 32KB and hence we do not report the values
old (RT), which is typically 10% of the number of cores in here. Our ML workloads were created using Darknet [25] and
the machine and based on whether the reuse processes can they were divided into training benchmarks and prediction
fill the cache, the scheduler switches from stream mode to benchmarks. The training benchmarks are well-known convo-
reuse mode. An execution scenario is possible in which all lutional neural networks (CNNs) and we ran them by training
streaming processes get suspended, all reuse processes are on a subset of Cifar-10 images. The prediction benchmarks
run, then after suspending more streaming jobs, all streaming are pre-trained models that were used to classify images from
processes are scheduled again in a batch, and so on. Imagenet dataset. For all the benchmarks, input sets were
divided into training & test sets to obtain the loop-timing
(regression) & trip-count (classification) models. Table 2 sum-
5 Evaluation
marizes all the benchmarks used in our experiments. All the
Our goal is to evaluate Beacons Framework in environments, benchmarks were compiled with their defaults optimization
where throughput in terms of number of jobs completed per flags (-O2/O3, -pthread, etc). Designing Scheduling Jobs:
unit time or the time for completion of the batch is important Benchmark Suite Benchmarks Dataset
SMALL
and latency of each individual process itself is of not much 2mm, 3mm, atax, bicg, mvt, gemm,gesummv,
STANDARD
symm, syr2k, syrk, trmm, cholesky, lu,
value. A common example of such a scenario is a server con- Polybench ludcmp, trisolv, correlation,covariance,
EXTRALARGE
(Training)
floyd-warshall, nussinov, deriche, adi, fdtd-2d,
ducting biological anomaly detection on thousands of patient heat-3d, jacobi-1d, seidel-2d
LARGE (Testing)
data records (each patient as a new process) and with the backprop, bfs, cfd, heartwall, hotspot,
Custom Inputs (Training
Rodinia hotspot3D, kmeans-serial, lavaMD, nn,
goal to complete as many as possible to discover a particular particlefilter, srad_v2
& Testing)
Convolutional Neural Network Alexnet, Resnet-18, Resnet-101, Resnet-151, CIFAR-10 (Training
pattern [12]. (Training) VGG-16, Densenet-201 & Testing)
Convolutional Neural Network Imagenet (Training
Experimental Platform: The experiments were con- (Pre-Trained)
TinyNet, Darknet3
& Testing)
ducted on a Amazon Graviton2 (m6g.metal) machine, running
Linux Ubuntu 20.4. The system has one socket with 64 proces- Table 2: Summary of Benchmarks used in our experiments
sors and 32 MB L3 cache. We carried out our experiments on
60 processors, leaving the rest for other system applications Our mixes consists of large and small processes. We set a rea-
to run smoothly and not interfere with our results. Beacons sonable number of large processes (20-200) so the mix does
Compilation Component was implemented as unified set of not run for unusually large duration (hours) for experimen-
compiler and profilling passes in LLVM 10.0. Machine learn- tation although our scheduler is robust works long running
ing library scikit-learn was used to implement the classifier processes. Once the large processes are executed, we inject
models. 4-5 small processes per large process to simulate a real-life
Benchmarks: We evaluated our system on four sets of di- scenario of other processes getting added to mixes and trying
verse workload suites, consisting of 53 individual benchmarks. to hog the cache. Thus, the scheduler can end up executing
We perform our experiments on: PolyBench [33], a numeri- more than 10000 processes during the entire mix. In addition,

9
our mixes are homogeneous so all processes are the same,
but with different inputs only. We also created heterogeneous
mixes of different applications but a homogeneous mix tends
to be the worst case because all processes have the same
phases and cache requirements and execute in somewhat of a
phase synchrony causing each process to demand the same Figure 9: Three different cases of Loop Timing Accuracy: positive
resources as other processes at the same time during its exe- deviation (CFD), negative (2mm) and perfect fit (Backprop)
cution.
Baselines: To evaluate the effectiveness of Beacons
Scheduling mechanism, we compare it against - (a) Linux’s
CFS [19] baseline, which is the widely-used standard sched-
uler in most computing environments, and (b) Merlin [29],
which is a performance-counter based reactive scheduler Loop Timing Accuracy: The overall loop timing accu-
(RES). The purpose of this comparison to investigate the racy for both unknown and known beacons collectively was
necessity of compiler-guided scheduling vs traditional perf- 83% across all the workloads (Fig.10). This accuracy can
based scheduling. Merlin first uses the cache misses per thou- vary depending on different classes of loops and their re-
sand instructions (MPKI) in LLC to determine memory inten- spective training inputs for closed-form regression equation
sity; then it estimates cache reuse by calculating the memory (Eq.1). For known/inferred beacons (NBNE and other loops
factor (MF), which is the ratio LLC/(LLC-1) MPKI. We use covered by UECB algorithm), the accuracy of timing info
the same MF threshold as Merlin (0.6) to classify reuse and mainly depends on training. The regression fit in Eq. 1 works
stream phases. particularly well for continuous input sets, as the inputs can
be monotonously increased or decreased to obtain a proper
domain for the regression function. For example, in Fig.9,
backprop takes in continuous integer inputs and the predicted
curve matches closely with different testing inputs (mean-
squared error µ = 0.057). However, for applications with dis-
crete inputs, the regression training becomes dependent on
how representative the provided training inputs are of the ac-
tual application behaviour. Some benchmarks like hotspot in
Rodinia have five inputs (four used for training) that capture
Figure 8: Loop Distribution across Rodinia & ML Workloads
the behaviour of the loop. The precise loop curve overlaps
and their trip-count prediction accuracy.
almost exactly with the actual time curve, similar to Fig.9. In
contrast, a few cases had training inputs that are not sufficient
enough to learn precise regression coefficients, thus decreas-
5.1 Beacons Prediction Accuracies ing the accuracy. For example, in 2mm, shown in Figure 9,
the predicted curve deviates for the fourth input (which is the
Loop Trip Count Prediction Accuracy: Fig. 8 shows the
test input).
distribution of different loop-classes for Rodinia and Machine
Learning workloads. Polybench is not shown in the pie-chart
since it contains 100% NBNE-type loops. Rodinia’s loops are For some NBME, IBME and IBNE loops, the unknown
rule-based so there are no classifiers needed. The right part beacons predicts a loop time that is loop bound obtained
of Fig.8 shows the accuracy of classifier-based models in pre- during regression runs. However, the actual loop time can
dicting the trip counts. The average accuracy is 85.3%. The deviate from the predicted time, resulting in either positive or
reason UECB achieves consistently high accuracy is because negative error rate. These two cases are illustrated in Fig 9.
of the nature of these benchmarks. Machine Learning work- In CFD, the timing information is generated by hoisting the
loads can be broadly dissected into several constituent layers expected values of inner loop bounds to a point above their
- convolutional layer, fully-connected layer, softmax-layer, inter-procedural outer loop in the beacon. As a result, we
dropout-layer, etc. Each of these layers perform an unique end up with a mean-squared error µ = 11.023, which shows
set of operations, that can be captured very well by classifier that “unknown” nature of these loops can sometimes cause
models employed by UECB. These models find the correla- unreliable prediction. However, to ensure that such cases do
tion between specific loop properties such as the trip counts, not impact the scheduling decisions, an end-of-loop beacon
critical variable values and hence are able to perform well. typically fires at the loop exit, which helps the scheduler
Overall, trip counts for 60% of the loops in the workloads are correct its course. Ultimately, these few cases of expected
predicted by classifier models, while the rest are obtained by predictions (with low accuracy) are still manageable mainly
rule-based models. due to loop completion beacons.

10
Figure 10: Loop Timing Accuracy across all three workloads suites

5.2 Throughput Improvement & Analysis prevents the inter-loop data reuse. While CFS can save
on memory accesses across loops (because CFS does not
The throughput of the system is calculated as the total time
preempt the application), BES gains performance during
required by the scheduler to finish a fixed number of incoming
the reuse loops, leading to smaller overall improvement.
jobs which is same as the average number of jobs completed
in unit time when normalized with a baseline (CFS). The • Medium Improvement (50% - 2x): Applications within
throughput of both the beacon scheduler and Merlin-based re- this category spend a considerable amount of time within
active scheduler normalized against CFS is shown in Fig. 11. reuse loops. Bicg, gemm and atax have a majority of
Based on geometric mean average, we achieved speedup of streaming loops but the reuse loop executes for the
76.78% on Gravitron2 compared to the 33% slowdown by the longest duration. On the other hand, trisolv has more
Merlin-based reactive scheduler. Among 45 evaluated bench- reuse loops (but lesser duration) which allows beacons
marks from Polybench, Rodinia, and modern ML workloads, scheduler to be in reuse mode for a longer period of time
we had significant throughput improvement for 28 of them collectively, and benefits from the proactive scheduling.
with the modern ML workloads showing the most through-
put improvement (2.47x) on geometric mean. Polybench and • High Improvement (2x - 3x): Applications run for
Rodinia have a modest improvement of 69.01% and 51.46% longer durations and spend a majority of their time within
respectively. These improvements are for the worst-case mix reuse loops (although they have a mix of streaming and
of homogeneous processes that have the same cache phases. reuse loops in some cases). Evidently, it it’s no surprise
The performance benefit is mainly attributed to the smart that all the ML workloads fall in this category. These
processes scheduling leveraging the knowledge of process’ workloads are mostly training or predicting so they have
loop type at a given time. The throughput improvement can to continuously reuse their neuron weights during the
be divided into several categories based on its throughput execution. Aside from the neuron weights, they typi-
improvement and are analyzed below: cally use matrix computations, which are further reused
• Insignificant Improvement: In four workloads, bea- to do back-propagation. Both the matrices and weights
cons performed worse or got similar timings compared account for a considerable amount of data, as the large
against CFS baseline. These workloads were typically networks like Alexnet has eight layers (especially five
dominated by streaming loops (> 95%), and thus ma- convolutional layers). Thus, beacons schedules the pro-
jority of execution time was spend on streaming loops. cesses to enable fast matrix multiplications due to less
In this scenario, beacons’ scheduling is similar to CFS, trashing of the cache.
where the resource demand are low and uniform. Adding
We finally present three interesting job completion time-
the small scheduling and beacons call overheads, we
lines of Cholesky (which showed substantial benefits with
end up with timing similar to CFS’ or small slowdowns
beacons), juxtaposed with those of Correlation (which had
(≤ 9%).
no noticeable benefit)(Fig. 12) with three scheduler - CFS,
• Small Improvement (10% - 50%): The majority of ap- Beacons’ scheduler (BES) and Merlin-based reactive sched-
plications within this category tend to have small amount uler (RES). In Cholesky, BES starts with the same jobs as
of reuse-based loops ( 5% − 10%). Most of the reuse CFS but soon replaces some of the reuse jobs with other non-
loops typically have smaller footprints or short duration. cache-pressure types to avoid cache overflow, unlike CFS
Deriche is a special case where reuse loops occur alterna- which takes longer to retire its first jobs. BES later on intelli-
tively between streaming loops. Therefore, the minimal gently places reuse and non-cache-pressure jobs to maintain
performance improvement is obtained from the data- high throughput, whereas CFS keeps scheduling both the non-
reuse in alternate loops, but the frequent streaming loops cache-pressure types and cache-pressure types throughout

11
Figure 11: Throughput Improvement across benchmarks in Polybench, Rodinia & Machine Learning Workloads. Both scheduler’s throughputs
are normalized against the baseline Linux’s CFS Scheduler

approach to reduce power consumption by inserting state-


ments to shut down functional units through a profile-driven
approach in areas of a program where no access to the units
happens is proposed in [26]; it uses a profile-driven approach
that predicts the future execution paths and the latency.
Profile-based scheduling: Prediction of applications’ up-
coming phases by using a combination of offline and on-
Figure 12: Histograms for the job completion times of line profiling is proposed in [24, 27]. Similarly, another ap-
CFS and the beacon-enabled scheduler (BES) for cholesky proach [9] uses a reuse distance model for simple caches
(left) and correlation (right). The X axis represents discrete calculated by profiling on a simulator for predicting L2 cache
timesteps, and the Y axis is a count of the number of jobs that usage. A cache-aware scheduling algorithm is proposed in
completed within a given timestep [10]. It monitors cache usage of threads at runtime and at-
tempts to schedule groups of threads that stay within the
cache threshold. It is not compiler-driven, nor sensitive to
the execution. Compared to these two, RES keeps on mov- phase changes within the threads. Several efforts have fo-
ing both cache-pressure types and non-cache-pressure types cused on the development of scheduling infrastructure for the
rather than prioritizing a particular type of them at the be- shared server platforms [22, 23, 29, 31, 34]. A key feature of
ginning. In the case of correlation, the jobs are within the these efforts is their use of observation-based methods (i.e.
cache size limit and this can be seen by the two graphs (BES reactive approaches) to establish resource contention (e.g. for
and RES) being interleaved. Thus, BES does not do anything caches, memory bandwidth, or other platform resources) and
differently from RES, and both complete their workloads at to further determine the interference at runtime, and/or to as-
roughly the same time. BES and RES try to prioritize one sess the workloads’ sensitivity to the contended resource(s)
group and this leads to no completions then a spike of pro- by profiling.
cesses completing especially in the tail end of the mix that
saw 1200 processes completing every second. On the other
hand, just like in cholesky, CFS has no knowledge of the 7 Conclusion
types and does not do it in a batch so CFS keep on schedul-
ing and preempting processes to allow a constant amount In this work, we propose a compiler-directed approach for
of 80 processes to complete at each second. Thus, its line proactively carrying out scheduling. The key insight is that
is mostly horizontal. In summary, in Cholesky, by limiting the compiler produces predictions in terms of loop timings
cache pressure through careful scheduling, BES wins over and underlying memory footprints along with the type of loop:
others whereas in Correlation, it performs no worse. reuse oriented vs streaming which are used to make schedul-
ing decisions. A new framework based on the combination
of static analysis coupled with ML models was developed
6 Related Works to deal with irregular loops with statically non-determinable
trip counts and with multiple loop exits. It was shown that
Conservative scheduling [32] presents a learning-based this framework is able to successfully predict such loops with
technique for load prediction on a processing node. The load 85% accuracy. A prototype implementation of the framework
information at a processing node over time is extrapolated to demonstrates significant improvements in throughtput over
predict load at a future time. Task scheduling is done based on CFS by an average of 76.78% with up to 3.25x on Gravi-
predicted future load over a time window. A compiler-driven ton2 for consolidated workloads. The value of prediction was

12
also demonstrated over a reactive framework which under- [10] Alexandra Fedorova, Margo I. Seltzer, Christopher
performed CFS by 9%. Thus, to conclude, predictions help Small, and Daniel Nussbaum. Performance of mul-
pro-activeness in terms of scheduling decisions which lead tithreaded chip multiprocessors and implications for
to significant improvements in throughput for a variety of operating system design. In Proceedings of the 2005
workloads. USENIX Annual Technical Conference, April 10-15,
2005, Anaheim, CA, USA, pages 395–398. USENIX,
2005.
References
[11] Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and
[1] Amazon graviton2. https://aws.amazon.com/ec2/
Adam Belay. Caladan: Mitigating interference at
graviton/, 2019. Accessed: 2021 April 16.
microsecond timescales. In 14th {USENIX} Sympo-
[2] George Amvrosiadis, Jun Woo Park, Gregory R Ganger, sium on Operating Systems Design and Implementation
Garth A Gibson, Elisabeth Baseman, and Nathan De- ({OSDI} 20), pages 281–297, 2020.
Bardeleben. On the diversity of cluster workloads and
its impact on research results. In 2018 {USENIX} [12] Julie Greensmith, Uwe Aickelin, and Gianni Tedesco.
Annual Technical Conference ({USENIX}{ATC} 18), Information fusion for anomaly detection with the den-
pages 533–546, 2018. dritic cell algorithm. Information Fusion, 11(1):21–34,
2010.
[3] Steven Bird, Ewan Klein, and Edward Loper. Natural
language processing with Python: analyzing text with [13] Tobias Grosser, Hongbin Zheng, Raghesh Aloor, An-
the natural language toolkit. " O’Reilly Media, Inc.", dreas Simbürger, Armin Größlinger, and Louis-Noël
2009. Pouchet. Polly-polyhedral optimization in llvm. In Pro-
ceedings of the First International Workshop on Polyhe-
[4] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, dral Compilation Techniques (IMPACT), volume 2011,
Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. page 1, 2011.
Rodinia: A benchmark suite for heterogeneous comput-
ing. In 2009 IEEE international symposium on workload [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
characterization (IISWC), pages 44–54. Ieee, 2009. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision
[5] Shuang Chen, Christina Delimitrou, and José F Martínez. and pattern recognition, pages 770–778, 2016.
Parties: Qos-aware resource partitioning for multiple in-
teractive services. In Proceedings of the Twenty-Fourth [15] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky,
International Conference on Architectural Support for Ilya Sutskever, and Ruslan R Salakhutdinov. Improving
Programming Languages and Operating Systems, pages neural networks by preventing co-adaptation of feature
107–120, 2019. detectors. arXiv preprint arXiv:1207.0580, 2012.
[6] Christina Delimitrou, Nick Bambos, and Christos [16] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and
Kozyrakis. Qos-aware admission control in heteroge- Kilian Q Weinberger. Densely connected convolutional
neous datacenters. In Proceedings of the 10th Interna- networks. In Proceedings of the IEEE conference on
tional Conference on Autonomic Computing (ICAC 13), computer vision and pattern recognition, pages 4700–
pages 291–296, San Jose, CA, 2013. USENIX. 4708, 2017.
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
[17] Călin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh
and Li Fei-Fei. Imagenet: A large-scale hierarchical
Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos
image database. In 2009 IEEE conference on computer
Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, et al.
vision and pattern recognition, pages 248–255. Ieee,
Perfiso: Performance isolation for commercial latency-
2009.
sensitive services. In 2018 {USENIX} Annual Techni-
[8] Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and cal Conference ({USENIX}{ATC} 18), pages 519–532,
Yale N Patt. Fairness via source throttling: a config- 2018.
urable and high-performance fairness substrate for multi-
core memory systems. ACM Sigplan Notices, 45(3):335– [18] Seyyed Ahmad Javadi, Amoghavarsha Suresh, Muham-
346, 2010. mad Wajahat, and Anshul Gandhi. Scavenger: A black-
box batch workload resource manager for improving
[9] Alexandra Fedorova and Margo Seltzer. Throughput- utilization in cloud environments. In Proceedings of the
oriented scheduling on chip multithreading systems. 04 ACM Symposium on Cloud Computing, pages 272–285,
2019. 2019.

13
[19] Jacek Kobus and Rafal Szklarski. Completely fair sched- [30] Sven Verdoolaege. Presburger formulas and polyhedral
uler and its tuning. draft on Internet, 2009. compilation. 2016.

[20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning mul- [31] Hailong Yang, Alex Breslow, Jason Mars, and Lingjia
tiple layers of features from tiny images. 2009. Tang. Bubble-flux: Precise online qos management for
increased utilization in warehouse scale computers. In
[21] Chris Lattner and Vikram Adve. Llvm: A compilation Proceedings of the 40th Annual International Sympo-
framework for lifelong program analysis & transforma- sium on Computer Architecture (ISCA), ISCA ’13, pages
tion. In Proceedings of the International Symposium on 607–618, New York, NY, USA, 2013. ACM. Acceptance
Code Generation and Optimization: Feedback-directed Rate: 19%.
and Runtime Optimization, CGO ’04, pages 75–, Wash-
ington, DC, USA, 2004. IEEE Computer Society. [32] Lingyun Yang, Jennifer M Schopf, and Ian Foster. Con-
servative scheduling: Using predicted variance to im-
[22] Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, prove scheduling decisions in dynamic environments.
and Mary Lou Soffa. Bubble-up: Increasing utiliza- In Proceedings of the 2003 ACM/IEEE conference on
tion in modern warehouse scale computers via sensi- Supercomputing, page 31. ACM, 2003.
ble co-locations. In Proceedings of the 44th Annual
[33] Tomofumi Yuki and Louis-Noël Pouchet. Polybench
IEEE/ACM International Symposium on Microarchitec-
4.0, 2015.
ture (MICRO), MICRO-44, pages 248–259, New York,
NY, USA, 2011. ACM. Acceptance Rate: 21% - Se- [34] Sergey Zhuravlev, Juan Carlos Saez, Sergey Blago-
lected for IEEE MICRO TOP PICKS. durov, Alexandra Fedorova, and Manuel Prieto. Sur-
vey of Scheduling Techniques for Addressing Shared
[23] Dejan Novaković, Nedeljko Vasić, Stanko Novaković, Resources in Multicore Processors. ACM Computing
Dejan Kostić, and Ricardo Bianchini. Deepdive: Trans- Surveys, 45(1), 2012.
parently identifying and managing performance interfer-
ence in virtualized environments. In 2013 {USENIX}
Annual Technical Conference ({USENIX}{ATC} 13),
pages 219–230, 2013.

[24] Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das,


and Scott Mahlke. Trace based phase prediction for
tightly-coupled heterogeneous cores. In Proceedings of
the 46th Annual IEEE/ACM International Symposium
on Microarchitecture, pages 445–456. ACM, 2013.

[25] Joseph Redmon. Darknet: Open source neural networks


in c. http://pjreddie.com/darknet/, 2013–2016.

[26] Siddharth Rele, Santosh Pande, Soner Onder, and Rajiv


Gupta. Optimizing static power dissipation by func-
tional units in superscalar processors. In Compiler Con-
struction, pages 261–275. Springer, 2002.

[27] Xipeng Shen, Yutao Zhong, and Chen Ding. Locality


phase prediction. In ACM SIGOPS Operating Systems
Review, volume 38, pages 165–176. ACM, 2004.

[28] Karen Simonyan and Andrew Zisserman. Very deep con-


volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.

[29] Priyanka Tembey, Ada Gavrilovska, and Karsten


Schwan. Merlin: Application- and Platform-aware Re-
source Allocation in Consolidated Server Systems. In
ACM Symposium on Cloud Computing (SOCC), Seattle,
WA, November 2014.

14

You might also like