AnalysisOfFederatedyGlobalScheduling4ParallelRTTask Li Chen Agrawal Gill
AnalysisOfFederatedyGlobalScheduling4ParallelRTTask Li Chen Agrawal Gill
Jing Li† , Jian-Jia Chen§ , Kunal Agrawal† , Chenyang Lu† , Chris Gill† , Abusayeed Saifullah†
†
Washington University in St. Louis, U.S.A.
§
TU Dortmund University, Germany
[email protected], § [email protected], {kunal, lu, cdgill}@cse.wustl.edu, [email protected]
2
algorithm and proves the augmentation bound. Section IV Utilization-Based Schedulability Test. In this paper, we
proves a lower bound for any scheduler for parallel tasks. analyze schedulers in terms of their capacity augmentation
Section V presents a canonical form to give an upper bound bounds. The formal definition is presented here:
of the work of a DAG that should be done in a specified
Definition 1. Given a task set τ with total utilization of
interval length. Section VI proves that G-EDF provides a
UP , a scheduling algorithm S with capacity augmenta-
capacity augmentation bound of 2.618. Section VII shows
tion bound b can always schedule this task set on m cores
that G-RM provides a capacity augmentation bound of
of speed b as long as τ satisfies the following conditions
3.732. We discuss some practical considerations of the
on unit speed cores.
three schedulers in Section VIII. Section IX discusses X
related work and Section X concludes this paper. Utilization does not exceed total cores, ui ≤ m (1)
τi ∈τ
II. System Model For each task τi ∈ τ, the critical path Li ≤ Di (2)
We now present the details of the DAG task model for
Since no scheduler can schedule a task set τ on m unit
parallel tasks and some additional definitions.
speed cores unless Conditions (1) and (2) are met, a capac-
We consider a set τ of n independent sporadic real-
ity augmentation bound automatically leads to a resource
time tasks {τ1 , τ2 , . . . , τn }. A task τi represents an infinite
augmentation bound. This definition can be equivalently
sequence of arrivals and executions of task instances (also
stated (without reference to the speedup factor) as follows:
called jobs). We consider the sporadic task model [9,
Condition (1) says that the total utilization UP is at most
29] where, for a task τi , the minimum inter-arrival time
m/b and Condition (2) says that the critical-path length
(or period) Ti represents the time between consecutive
of each task is at most 1/b of its relative deadline, that
arrivals of task instances, and the relative deadline Di
is, Li ≤ Di /b. Therefore, in order to check if a task set
represents the temporal constraint for executing the job.
is schedulable we only need to know the total task set
If a task instance of τi arrives at time t, the execution of
utilization, and the maximum critical-path utilization. Note
this instance must be finished no later than the absolute
that a scheduler with a smaller b is better than another with
deadline t + Di and the release of the next instance of task
a larger b, since when b = 1 S is an optimal scheduler.
τi must be no earlier than t plus the minimum inter-arrival
time, i.e. t+Ti . In this paper, we consider implicit deadline III. Federated Scheduling
tasks where each task τi ’s relative deadline Di is equal to
its minimum inter-arrival time Ti ; that is, Ti = Di . We This section presents the federated scheduling strategy
consider the schedulability of this task set on a uniform for parallel tasks with implicit deadlines. We prove that it
multicore system consisting of m identical cores. provides a capacity augmentation bound of 2 on m-core
Each task τi ∈ τ is a parallel task and is characterized as machine parallel real-time tasks.
a directed acyclic graph (DAG). Each node (subtask) in the A. Federated Scheduling Algorithm
DAG represents a sequence of instructions (a thread) and
each edge represents a dependency between nodes. A node Given a task set τ , the federated scheduling algorithm
(subtask) is ready to be executed when all its predecessors works as follows: First, tasks are divided into two disjoint
have been executed. Throughout this paper, as it is not sets: τhigh contains all high-utilization tasks — tasks with
necessary to build the analysis based on specific structure worst-case utilization at least one (ui ≥ 1), and τlow
of the DAG, only two parameters related to the execution contains all the remaining low-utilization tasks. Consider a
pattern of task τi are defined: high-utilization task τi with worst-case execution time Ci ,
• total execution time (or work) Ci of task τi : This is worst-case critical-path length Li , and deadline Di (which
the summation of the worst-case execution times of all is equal to its period Ti ). We assign ni dedicated cores to
the subtasks of task τi . τi , where ni is
Ci − Li
• critical-path length Li of task τi : This is the length ni = (3)
of the critical-path in the given DAG, in which each Di − Li
P
node is characterized by the worst-case execution time We use nhigh = τi ∈τhigh ni to denote the total num-
of the corresponding subtask of task τi . Critical-path ber of cores assigned to high-utilization tasks τhigh . We
length is the worst-case execution time of the task on assign the remaining cores to all low-utilization tasks τlow ,
an infinite number of cores. denoted as nlow = m − nhigh . The federated scheduling
Given a DAG, obtaining work Ci and critical-path length P the task set τ , if nlow is non-negative
algorithm admits
Li [46, pages 661-666] can both be done in linear time. and nlow ≥ 2 τi ∈τlow ui .
For brevity, the utilization C Ci
Ti = Di of task τi is denoted
i
After a valid core allocation, runtime scheduling pro-
by ui for implicit
P deadlines. The total utilization of the task ceeds as follows: (1) Any greedy (work-conserving) paral-
set is UP = τi ∈τ ui . lel scheduler can be used to schedule a high-utilization task
3
τi on its assigned ni cores. Informally, a greedy scheduler Lemma 3. [Li13] If a job of task τi is executed by a
is one that never keeps a core idle if some node is ready to greedy scheduler, then every incomplete step reduces the
execute. (2) Low-utilization tasks are treated and executed remaining critical-path length of the job by 1.
as though they are sequential tasks and any multiprocessor
scheduling algorithm (such as partitioned EDF [37], or From Lemmas 2 and 3, we can establish Theorem 2.
various rate-monotonic schedulers [3]) with a utilization
bound of at most 1/2 can be used to schedule all the low-
Theorem 2. If an implicit-deadline m deterministic parallel
utilization tasks on the allocated nlow cores. The important l
Ci −Li
observation is that we can safely treat low-utilization tasks task τi is assigned ni = D i −Li
dedicated cores, then
as sequential tasks since Ci ≤ Di and parallel execution all its jobs can meet their deadlines, when using a greedy
is not required to meet their deadlines.4 scheduler.
B. Capacity Augmentation Bound of 2 for Proof: For contradiction, assume that some job of a high-
Federated Scheduling utilization task τi misses its deadline when scheduled on
ni cores by a greedy scheduler. Therefore, during the Di
Theorem 1. The federated scheduling algorithm has a time steps between the release of this job and its deadline,
capacity augmentation bound of 2. there are fewer than Li incomplete steps; otherwise, by
To prove Theorem 1, we consider a task set τ that Lemma 3, the job would have completed. Therefore, by
satisfies Conditions (1) and (2) from Definition 1 for b =2. Lemma 2, the scheduler must have finished at least ni Di −
Then, we (1) state the relatively obvious Lemma 1; (2) (ni − 1)Li work.
prove that a high utilization task τi meets its deadline ni Di − (ni − 1)Li = ni (Di − Li ) + Li
when assigned ni cores; and (3) Pshow that nlow is non-
Ci − Li
negative and satisfies nlow ≥ b τi ∈τlow ui and therefore = (Di − Li ) + Li
Di − Li
all low utilization tasks in τ will meet deadlines when
Ci − Li
scheduled using any multiprocessor scheduling strategy ≥ (Di − Li ) + Li = Ci
with utilization bound no less than b (i.e. can afford total Di − Li
task set utilization of m/b = 50%m). These three steps Since the job has work of at most Ci , it must have finished
complete the proof. in Di steps, leading to a contradiction.
Lemma 1. A task set τ is classified into disjoint subsets Low-Utilization Tasks are Schedulable. We first calculate
s1 , s2 , ..., sk , and each subset is assigned a dedicated a lower bound on nlow , the number of total cores assigned
P of cores with size n1 , n2 , ..., nk respectively, such
cluster to low-utilization tasks, when a task set τ that satisfies
that i ni ≤ m. If each subset sj is schedulable on its Conditions (1) and (2) of Definition 1 for b = 2 is
nj cores using some scheduling algorithm Sj (possibly scheduled using a federated scheduling strategy.
different for each subset), then the whole task set is
guaranteed to be schedulable on m cores. Lemma 4. The number ofP cores assigned to low-utilization
High-Utilization Tasks Are Schedulable. Assume that a tasks is at least nlow ≥ 2 low ui .
machine’s execution time is divided into discrete quanta
called steps. During each step a core can be either idle or Proof: Here, for brevity of the proof, we denote σi = DLi .
i
performing one unit of work. We say a step is complete It is obvious that Di = σi Li and hence Ci = Di ui =
if no core is idle during that step, and otherwise we say σi ui Li . Therefore,
it is incomplete. A greedy scheduler never keeps a cores
Ci − Li
σi ui Li − Li
σi ui − 1
idle if there is ready work available. Then, for a greedy ni = = =
Di − Li σi Li − Li σi − 1
scheduler on ni cores, we can state two straightforward
lemmas [35].
Since each task τi in task set τ satisfies Condition (2)
Lemma 2. [Li13] Consider a greedy scheduler running on of Definition 1 for b = 2; therefore, the critical-path length
ni cores for t time steps. If the total number of incomplete of each task is at most 1/b of its relative deadline, that is,
steps during this period is t∗ , the total work F t done Li ≤ Di /b =⇒ σi ≥ b = 2.
during these time steps is at least F t ≥ ni t − (ni − 1)t∗ .
By the definition of high-utilization task τi , we have
1 ≤ ui . Together with σi ≥ 2, we know that:
4 Even if these tasks are expressed as parallel programs, it is easy to
enforce correct sequential execution of parallel tasks — any topological (ui − 1)(σi − 2)
ordered execution of the nodes of the DAG is a valid sequential execution.
0≤
σi − 1
4
From the definition of ceiling, we can derive is schedulable and we can admit it without deadline misses.
σi ui − 1
This schedulability test admits a strict superset of tasks
ni = admitted by the capacity augmentation bound test, and in
σi − 1
practice, it may admits task sets with utilization greater
σi ui − 1 σi ui + σi − 2
< +1= than m/2.
σi − 1 σi − 1
σi ui + σi − 2 (ui − 1)(σi − 2) IV. Lower Bound on Capacity Augmentation
≤ +
σi − 1 σi − 1 of Any Scheduler for Parallel Tasks
σi ui + σi − 2 + σi ui − 2ui − σi + 2
=
σi − 1 On a system with m cores, consider a task set τ with a
2σi ui − 2ui 2ui (σi − 1) single task, τ1 , which starts with sequential execution for
= = = 2ui
σi − 1 σi − 1 1 − time and then forks m−1 + 1 subtasks with execution
= bui time . Here, we assume is an arbitrarily small positive
number. Therefore, the total work of task τ1 is C1 = m
In summary, for each high-utilization
P task, nP
i < bui . and its critical-path length Li = 1. The deadline (and also
So, their sum τhigh satisfies nhigh = high ni < b high ui . minimum inter-arrival time) of τ1 is 1.
Since the task set also satisfies Condition (1), we have
X X X Theorem 3. The capacity augmentation bound for any
nlow = m − nhigh > b ui − b ui = b ui 1
scheduler for parallel tasks on m cores is at least 2 − m ,
all high low +
when → 0.
Thus, the number of remaining cores P allocated to low- Proof: Consider the system defined above. The finishing
utilization tasks is at least nlow > 2 low ui .
time of τ1 by running at speed α is not earlier than 1−
α +
m−1
Corollary 1. For task sets satisfying Conditions (1) and
mα = α2 − mα 1
− α . If α > 2 − m 1
and →+ 0,
(2), a multiprocessor scheduler with utilization bound of 2 1
then − mα − α > 1, then task τ1 misses its deadline.
α
at least 50% can schedule all the low-utilization tasks Therefore, we reach the conclusion.
sequentially on the remaining nlow cores. Since Lemma 3 works for any scheduler for parallel
tasks, we can conclude that the lower bound on capacity
Proof: Low-utilization tasks are allocated nlow cores, and
augmentation of federated scheduling is at least 2 , when m
from Lemma 4 we know that the total utilization of the low
is sufficiently large. Since we have shown that the upper
utilization tasks is less than nlow /b = 50%nlow . Therefore,
bound on capacity augmentation of federated scheduling
any multiprocessor scheduling algorithm that provides a
is also 2, therefore, we have closed the gap between the
utilization bound of 2 (i.e., can schedule any task set with
lower and upper bound of federated scheduling for large
total worst-case utilization ratio no more than 50%) can
m. Moreover, for sufficiently large m, federated scheduling
schedule it.
has the best capacity augmentation bound, among all
Many multiprocessor scheduling algorithms (such as
schedulers for parallel tasks.
partitioned EDF or partitioned RM) provide a utilization
bound of 1/2 (i.e., 50%) to sequential tasks. That is, given V. Canonical Form of a DAG Task
nlow cores, they can schedule any task set with a total
worst-case utilization up to nlow /2. Using any of these In this section, we introduce the concept of a DAG’s
algorithms for low-utilization tasks will guarantee that canonical form. Note each task can have an arbitrarily
the federated algorithm meets all deadlines with capacity complex DAG structure which may be difficult to analyze
augmentation of 2. and may not even be known before runtime. However,
Therefore, since we can successfully schedule both high given the known task set parameters (work, critical path
and low-utilization tasks that satisfy Conditions (1) and (2), length, utilization, critical-path utilization, etc.) we repre-
we have proven Theorem 1 (using Lemma 1). sent each task using a canonical DAG that allows us to
As mentioned before, a capacity augmentation bound upper bound the demand of the task in any given interval
acts as a simple schedulability test. However, for federated length t. These results will play an important role when
scheduling, this test can be pessimistic, especially for tasks we analyze the capacity augmentation bounds for G-EDF
with high parallelism. Note, however, that the federated in Section VI and G-RM in Section VII. Recall that in this
scheduling algorithm described in Section III-A can also paper, we analyze tasks with implicit deadline, so period
be directly used as a (polynomial-time) schedulability equals to deadline (Ti = Di ).
test: given a task set, after assigning cores to each high- Recall that we classify each task τi as a low-utilization if
utilization task using this algorithm, if the remaining cores ui = Ci /Di < 1 (and hence Ci < Di ); or high-utilization
are sufficient for all low-utilization tasks, then the task set task, if τi ’s utilization ui ≥ 1.
5
For analytical purposes, instead of considering the com- task i as the maximum amount of work (computation) that
plex DAG structure of individual tasks τi , we consider a S∞,α must do on the sub-jobs of τi in any interval of length
canonical form τi∗ of task τi . The canonical form of a t. We can derive worki (t, α) as follows:
task is represented by a simpler DAG. In particular, each
worki (t, α) =
subtask (node) of task τi∗ has execution time , which is (
positive and arbitrarily small. Note that is a hypothetical j i −kqi (Di − t, α)
C j k t ≤ Di
(4)
unit-node execution time. Therefore, it is safe to assume t t
Di Ci + worki (t − Di Di , α) t > Di .
that Di and Ci are both integers. Low and high-utilization
tasks have different canonical forms described below. Clearly, both qi (t, α) and worki (t, α) for a task depend
• The canonical form τi∗ of a low-utilization task τi is on the structure of the DAG.
simply a chain of Ci / nodes, each with execution time We similarly define qi∗ (t, α) for the canonical form τi∗ .
. Note that task τi∗ is a sequential task. As the canonical form in task τi∗ is well-defined, we can
• The canonical form τi∗ of a high-utilization task τi derive qi∗ (t, α) directly. Note that can be arbitrarily small,
starts with a chain of Di / − 1 nodes each with execu- and, hence, its impact is ignored when calculating qi∗ (t, α).
tion time . The total work of this chain is Di − . The We can now define the canonical maximum load
last node of the chain forks all the remaining nodes. worki∗ (t, α) as the maximum workload of the canonical
Hence, all the remaining (Ci − Di + )/ nodes have task τi∗ in any interval t in schedule S∞,α . For a low-
an edge from the last node of this chain. Therefore, all utilization task τi , where Ci /Di < 1, and τi∗ is a chain, it
these nodes can execute entirely in parallel. is easy to see that the canonical workload is
Figure 1 provides an example for such a transformation
worki∗ (t, α) = (5)
for a high-utilization task. It is important to note that the
canonical form τi∗ does not depend on the DAG structure 0
t < Di − Ci
α
of τi at all. It depends only on the task parameters of τi . α · (tk − (Di − Cαi )) j Di − Cαi ≤ t ≤ Di
j k
t Ci + worki∗ (t − t
Di Di Di , α) t > Di .
..
.
nodes
2 1 ..
. worki∗ (t, α) = (6)
16 D
5 2 3 − 1 nodes
0
t < Di − α i
Di
(a) original DAG (b) canonical form: heavy task j i −kDi + α · (t − (Dij− αk ))
C Di − Dαi ≤ t ≤ Di
t Ci + worki∗ (t − t Di , α) t > Di .
Fig. 1: A high-utilization DAG task τi with Li = 12, Ci = Di Di
20, Ti = Di = 16, and ui = 1.25 and its canonical form,
where the number in each node is its execution time. Figure 2 shows the qi∗ (t, α), qi (t, α), worki∗ (t, α), and
As an additional analysis tool, we define a hypothetical worki (t, α) of the high-utilization task τi in Figure 1 when
scheduling strategy S∞ that must schedule a task set τ on Di = 16, α = 1, and α = 2. Note that worki∗ (t, α) ≥
an infinite number of cores, that is, m = ∞. With infinite worki (t, α). In fact, the following lemma proves that
number of cores, the prioritization of the sub-jobs becomes worki∗ (t, α) ≥ worki (t, α) for any t > 0 and α ≥ 1.
unnecessary and S can obtain an optimal schedule by Lemma 5. For any t > 0 and α ≥ 1, worki∗ (t, α) ≥
simply assigning a sub-job to a core as soon as that sub- worki (t, α).
job becomes ready for execution. Using this schedule, all
the tasks finish within their critical-path length; therefore, Proof: For low-utilization tasks, the entire work Ci is se-
if Li ≤ Di for all tasks τi in τ , the task set always meets quential. When t < Cαi , qi∗ (t, α) is αt, so qi (t, α) ≥ αt =
the deadlines. We denote this schedule as S∞ . Similarly, qi∗ (t, α). When Cαi ≤ t < Di , qi (t, α) = Ci = qi∗ (t, α).
S∞,α is the resulting schedule when A∞ schedules tasks Similarly, for high-utilization tasks, the first Di units
on cores of speed α ≥ 1. Note that S∞,α finishes a job of of work is sequential, so when t < Dαi , qi∗ (t, α) = αt. In
task τi exactly Li /α time units after it is released. addition, S∞,α finishes τi exactly Lαi time units after it is
We now define some notations based on S∞,α . Let released, while it finishes the τi∗ at Dαi . Since the critical-
qi (t, α) be the total work finished by S∞,α between the path length Li ≤ Di for all τi and τi∗ at unit-speed system,
arrival time ri of task τi and time ri + t. Therefore, in the when t < Lαi , qi (t, α) ≥ αt = qi∗ (t, α). When Lαi ≤ t <
Di ∗ Di ∗
interval from ri + t to ri + Di (interval of length Di − t) α , qi (t, α) = qi ( α , α) > qi (t, α), When Li < t ≤ Di ,
Di
the remaining Ci − qi (t, α) workload has to be finished. Lastly, when α ≤ t < Di , qi (t, α) = Ci = qi∗ (t, α)
We define maximum load, denoted by worki (t, α), for We can conclude that qi∗ (t) ≤ qi (t) for any 0 ≤
6
40 40
38 38
36 36
34 34
32 32
30 30
28 qi∗ (t, 2) 28
26 qi (t, 2) 26
24 24
22 22 worki (t, 1)
20 20
18 18
16 16
14 14 worki (t, 2)
12 12 worki∗ (t, 1)
10 qi∗ (t, 1) 10
8 8
6 6
4 qi (t, 1) 4 worki∗ (t, 2)
2 2
0 t 0 t
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
(a) qi∗ (t, α) and qi (t, α) (b) worki∗ (t, α) and worki (t, α)
Fig. 2: qi∗ (t, α), qi (t, α), worki∗ (t, α) and worki (t, α) for the high-utilization task τi with Di = 20 in Figure 1.
t < Di . Combining with the definition of work(t, α) If τi is a light task with∗ 1 ≤ ui < α and hence (b) is
worki (Di ,α)
(Equation (4)), we complete the proof. true, then we have Di = ui .
We classify task τi as a light or heavy task. A task is a If heavy task τi with α ≤ ui and hence (a) is true, then
light task if ui = Ci /Di < α. Otherwise, we say that τi is worki∗ (Di − Di
α , α) Ci − Di ui − 1
heavy (ui ≥ α) . The following lemmas provide an upper = =
Di − Di
Di − Dαi 1 − α1
bound on the density (the ratio of workload that has to be α
finished to the interval length) for heavy and light tasks. Therefore, Inequality (7) holds for 0 < t ≤ Di .
0
Lemma 6. For any task τi , t > 0 and 1 < α, we have
Case
j 2:k t > Di — Suppose that t is kDi + t , where k
t 0
( is Di and 0 < t ≤ Di . When ui < α, by Equation (5)
worki (t, α) worki∗ (t, α) ui (0 ≤ ui < α) and Equation (6), we have
≤ ≤ ui −1 (7)
t t 1− 1
(α ≤ ui ) worki∗ (t, α) kCi + worki∗ (t0 , α) kui Di + ui t0
α
= 0
≤ = ui
t kDi + t kDi + t0
Proof: The first inequality in Inequality (7) comes from
ui −1
Lemma 5. We now show that the second inequality also When α ≤ ui , we can derive that ui ≤ 1− α1 . By
holds for any task. Note that the right hand side is positive, Equation (6), we have
since 1 > ui > 0. There are two cases: i −1 0
worki∗ (t, α) kCi + worki∗ (t0 , α) kui Di + u1− 1 t
Case 1: 0 < t ≤ Di . = ≤ α
worki∗ (t, α) Ci + αt − αDi ui − α Proof: By Lemma 6, for any α > 1, it is clear that
= = α + Di ( ),
t t t W
P worki∗ (t,α) P P
τH (ui −1)
work∗ (t,α) t = τL +τH sup t>0 t ≤ τL ui + 1− 1
Therefore, i
t is maximized either (a) when t = P P P P α
7
where sup is the supremum of a set of numbers, τL and 2.6
8
Figure 3 illustrates the required speedup of G-EDF 2.6
1
it is safe to assume that m is a positive integer. Therefore, deadline. By Lemma 9, we reach the conclusion.
the total work of task τ1 is C1 = m−1 and its critical-path Figure 5 illustrates the upper bound of G-EDF provided
length is Li = 1. The minimum inter-arrival time of τ1 is in Theorem 4 and the lower bound in Theorem 5 with
1. respect to the capacity augmentation bound. It can be easily
Task τ2 is simply a sequential task with work (execution seen that the upper and lower bounds are getting closer
time) of 1 − α1 and minimum inter-arrival time also 1 − when m is larger. When m is 100, the gap between the
1 upper and the lower bounds is roughly about 0.00452.
α , where α > 1 will be defined later. Clearly, the total
utilization is m and the critical-path length of each task is It is important to note that the more precise speedup in
at most the relative deadline (minimum inter-arrival time). Corollary 2 is tight even for small m. This is because
2
q in the above example task set, UP = m, total high
3− m −δ+ 5− 12 4
m + m2
Lemma 9. When α < and δ = 2 + task utilization UH = m − 1 and number of heavy task
2 1
1− α kτH k = 1, then according to the Corollary, the capacity
g() and m ≥ 3, then 1−2
α +
m−2
mα >1− α holds.
augmentation bound for this task set under G-EDF is
1
1− α
Proof: By solving 1−2 m−2
α + mα = 1 − , we know that
r
UP −kτH k−1 (U P −kτ k−1)2
α 2+ +
4(UH −kτH k)
+
H
the equality holds when m m m2
2
q q q
2
−2+ 5− 12 4 4(m−1−1) (m−1−1)2 2
+ 5− 12 4
3− m m + m2 −g() 2+ m−1−1
m + m + m2
3− m m + m2
α< 2
= 2 = 2
1
where g() is a positive function of ,
which approaches which is exactly the lower bound in Theorem 5 for →+ 0.
to 0 when approaches 0. Now, by setting δ to 2 + g(),
we reach the conclusion. VII. G-RM Scheduling
Theorem 5. The q
capacity augmentation bound for G-EDF This section proves that√G-RM provides a capacity
2
3− m + 5− 12 4
m + m2 + augmentation bound of 2 + 3 for large m. The structure
is at least 2 , when → 0.
of the proof is very similar to the analysis in Section VI.
Proof: Consider the system with two tasks τ1 and τ2
Again, we use a lemma from [15], restated below.
defined in the beginning of Section VI-B. Suppose that
the arrival of task τ1 is at time 0, and the arrival of task Lemma 10. If ∀t > 0, 0.5(α · m − m + 1)t ≥
τ2 is at time α1 + α . By definition, the first jobs of τ1 and
P
i worki (t, α) the task set is schedulable by G-RM on
τ2 have absolute deadlines at 1 and 1 + α . Hence, G-EDF speed-α cores.
9
heavy task kτH k, then this task set will be schedulable
capacity augmentation bound
have α ≥ m
, and prove Theorem 6.
2 cores exclusively) and can use either fixed or dynamic
2
The result in Theorem 6 is the best known result for priority for low-utilization tasks. Thus, it is relatively easier
the capacity augmentation bound for global fixed-priority to implement G-RM and federated scheduling.
scheduling for general DAG tasks with arbitrary structures. For sequential tasks, in general, global scheduling may
√ incur more overhead due to thread migration and the
Interestingly, Kim et al. [31] get the same bound of 2+ 3
for global fixed-priority scheduling of parallel synchronous associated cache penalty, the extent of which depends on
tasks (a subset of DAG tasks). the cache architecture and the task sets. In particular, for
parallel tasks, the overheads for global scheduling could
The strategy used in [31] is quite different. In their
be worse. For sequential tasks, preemptions and migrations
algorithm, the tasks undergo a stretch transformation which
only occur when a new job with higher priority is released.
generates a set of sequential subtask (each with its release
In contrast, for parallel tasks, a preemption and possibly
time and deadline) for each parallel task in the original
migration could occur whenever a node in the DAG of
task set. These subtasks are then scheduled using a G-DM
a job with higher priority is enabled. Since nodes in a
scheduling algorithm [11]. Note that even though the paral-
DAG often represent a fine-grained units of computation,
lel tasks in the original task set have implicit deadlines, the
the number of nodes in the task set can be larger than the
transformed sequential tasks have only constrained dead-
number of tasks. Hence, we can expect a larger number of
lines — hence the need for deadline monotonic scheduling
such events. Since federated scheduling is a generalization
instead of rate monotonic scheduling.
of partitioned scheduling to parallel tasks, it has advantages
Corollary 3. If a task set has total utilization UP , similar to partitioning. In fact, if we use a partitioned
the total high task utilization UH and the number of RM or partitioned EDF strategy for low-utilization tasks,
10
there are only preemptions but no migration for low- in order to expose parallelism within the programs. Using
utilization tasks. Meanwhile, federated scheduling only these constructs generates tasks whose structure can be
allocates the minimum number of dedicated cores to ensure represented with different types of DAGs.
the schedulability of each high-utilization task, so there is Tasks with parallel synchronous tasks have been studied
no preemptions for high-utilization tasks and the number of more than others in the real-time community. These tasks
migrations is minimized. Hence, we expect that federated are generated if we use only parallel-for loops to generate
scheduling will have less overhead than global schedulers. parallelism. Lakshmanan et al. [32] proved a (capacity)
In addition, parallel runtime systems have additional augmentation bound of 3.42 for a restricted synchronous
parallel overheads, such as synchronization and scheduling task model which is generated when we restrict each
overheads. These overheads (per task) usually are approx- parallel-for loop in a task to have the same number of
imately linear in the number of cores allocated to each iterations. General synchronous tasks (with no restriction
task. Under federated scheduling, a minimum number of on the number of iterations in the parallel-for loops), have
cores is assigned. However, depending on the particular also been studied [4, 31, 40, 45]. (More details on these
implementation, global scheduling may execute a task on results were presented in Section I) Chwa et al. [20]
all the cores in the system and may have higher overheads. provide a response time analysis.
Finally, note that federated scheduling is not a greedy If we do not restrict the primitives used to parallel-for
(work conserving) strategy for the entire task set, although loops, we get a more general task model — most easily
it uses a greedy schedule for each individual task. In many represented by a general directed acyclic graph. A resource
1
real systems, the worst-case execution times are normally augmentation bound of 2 − m for G-EDF was proved for
over-estimated. Under federated scheduling cores allocated a single DAG with arbitrary deadlines [8] and for multiple
2
to tasks with overestimated execution times may idle due DAGs [15, 35]. A capacity augmentation bound of 4 − m
to resource over-provisioning. In contrast, work-conserving was proved in [35] for tasks with for implicit deadlines.
strategies (such as G-EDF and G-RM) and can utilize Liu et al. [36] provide a response time analysis for G-EDF.
available cores through thread migration dynamically. There has been significant work on scheduling non-real-
time parallel systems [5, 6, 24–26, 42]. In this context, the
IX. Related Work goal is generally to maximize throughput. Various provably
good scheduling strategies, such as list scheduling [18, 27]
In this section, we review closely related work on real- and work-stealing [12] have been designed. In addition,
time scheduling, concentrating primarily on parallel tasks. many parallel languages and runtime systems have been
Real-time multiprocessor scheduling considers schedu- built based on these results. While multiple tasks on a
ling sequential tasks on computers with multiple proces- single platform have been considered in the context of
sors or cores and has been studied extensively (see [10, 23] fairness in resource allocation [1], none of this work
for a survey). In addition, platforms such as LitmusRT [17, considers real-time constraints.
19] have been designed to support these task sets. Here, we
review a few relevant theoretical results. Researchers have X. Conclusions
proven resource augmentation bounds, utilization bounds
and capacity augmentation bounds. The best known re- In this paper, we consider parallel tasks in the DAG
source bound for G-EDF for sequential tasks on a mul- model and prove that for parallel tasks with implicit
tiprocessor is 2 [7]; a capacity augmentation bound of deadlines the capacity augmentation bounds of federated
1
2− m + for small [14]. Partitioned EDF and ver- scheduling, G-EDF and G-RM are 2, 2.618 and 3.732
sions partitioned static priority schedulers also provide a respectively. In addition, the bound 2 for federated sche-
utilization bound of 2 [3, 37]. G-RM provides a capacity duling and the bound of 2.618 for the G-EDF are both
augmentation bound of 3 [2] to implicit deadline tasks. tight for large m, since there exist matching lower bounds.
For parallel real-time tasks, most early work considered Moreover, the three bounds are the best known bounds for
intra-task parallelism of limited task models such as mal- these schedulers for DAG tasks.
leable tasks [22, 30, 33] and moldable tasks [39]. Kato There are several directions of future work. The G-RM
et al. [30] studied the Gang EDF scheduling of moldable capacity augmentation bound is not known to be tight. The
parallel task systems. current lower bound of G-RM is 2.668, inherited from the
Researchers have since considered more realistic task sequential sporadic real-time tasks without DAG structures
models that represent programs generated by commonly [38]. Therefore, it is worth investigating a matching lower
used parallel programming languages such as Cilk fam- bound or lowering the upper bound. In addition, since
1
ily [13, 21], OpenMP [41], and Intel’s Thread Building the lower bound of any scheduler is 2 − m , it would
Blocks [43]. These languages and libraries support primi- be interesting to investigate if it is possible to design
tives such as parallel for-loops and fork/join or spawn/sync schedulers that reach this bound. Finally, all the known
11
capacity augmentation bound results are restricted to im- [20] H. S. Chwa, J. Lee, K.-M. Phan, A. Easwaran, and I. Shin. “Global
plicit deadline tasks; we would like to generalize them to EDF Schedulability Analysis for Synchronous Parallel Tasks on
Multicore Platforms”. In: ECRTS. 2013.
constrained and arbitrary deadline tasks. [21] CilkPlus. http://software.intel.com/en-us/articles/intel-cilk-plus.
[22] S. Collette, L. Cucu, and J. Goossens. “Integrating job parallelism
Acknowledgment in real-time scheduling theory”. In: Information Processing Letters
106.5 (2008), pp. 180–187.
This research was supported in part by the priority [23] R. I. Davis and A. Burns. “A survey of hard real-time scheduling
for multiprocessor systems”. In: ACM Computing Surveys 43
program ”Dependable Embedded Systems” (SPP 1500 - (2011), 35:1–44.
spp1500.itec.kit.edu), by DFG, as part of the Collaborative [24] X. Deng, N. Gu, T. Brecht, and K. Lu. “Preemptive Scheduling
of Parallel Jobs on Multiprocessors”. In: SODA. 1996.
Research Center SFB876 (http://sfb876.tu-dortmund.de/), [25] M. Drozdowski. “Real-time scheduling of linear speedup parallel
by NSF grants CCF-1136073 (CPS) and CCF-1337218 tasks”. In: Inf. Process. Lett. 57 (1996), pp. 35–40.
(XPS). The authors thank anonymous reviewers for their [26] J. Edmonds, D. D. Chinn, T. Brecht, and X. Deng. “Non-
clairvoyant Multiprocessor Scheduling of Jobs with Changing
suggestions on improving this paper. Execution Characteristics”. In: Journal of Scheduling 6.3 (2003),
pp. 231–250.
References [27] R. L. Graham. “Bounds on Multiprocessing Anomalies”. In: SIAM
Journal on Applied Mathematics (1969), 17(2):416–429.
[1] K. Agrawal, C. E. Leiserson, Y. He, and W. J. Hsu. “Adaptive [28] H.-M. Huang, T. Tidwell, C. Gill, C. Lu, X. Gao, and S. Dyke.
work-stealing with parallelism feedback”. In: ACM Trans. Com- “Cyber-physical systems for real-time hybrid structural testing:
put. Syst. 26 (2008), pp. 112–120. a case study”. In: International Conference on Cyber Physical
Systems. 2010.
[2] B. Andersson, S. Baruah, and J. Jonsson. “Static-priority schedu-
ling on multiprocessors”. In: RTSS. 2001. [29] A. Ka and L. Mok. Fundamental design problems of distributed
systems for the hard-real-time environment. Tech. rep. 3. 1983.
[3] B. Andersson and J. Jonsson. “The utilization bounds of parti-
tioned and pfair static-priority scheduling on multiprocessors are [30] S. Kato and Y. Ishikawa. “Gang EDF Scheduling of Parallel Task
50%”. In: ECRTS. 2003. Systems”. In: RTSS. 2009.
[4] B. Andersson and D. de Niz. “Analyzing Global-EDF for Mul- [31] J. Kim, H. Kim, K. Lakshmanan, and R. Rajkumar. “Parallel
tiprocessor Scheduling of Parallel Tasks”. In: Principles of Dis- scheduling for cyber-physical systems: analysis and case study
tributed Systems. 2012, pp. 16–30. on a self-driving car”. In: ICCPS. 2013.
[5] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. “Thread Schedu- [32] K. Lakshmanan, S. Kato, and R. R. Rajkumar. “Scheduling Paral-
ling for Multiprogrammed Multiprocessors”. In: SPAA. 1998. lel Real-Time Tasks on Multi-core Processors”. In: Proceedings of
the 2010 31st IEEE Real-Time Systems Symposium. RTSS. 2010.
[6] N. Bansal, K. Dhamdhere, J. Konemann, and A. Sinha. “Non-
clairvoyant Scheduling for Minimizing Mean Slowdown”. In: [33] W. Y. Lee and H. Lee. “Optimal Scheduling for Real-Time Parallel
Algorithmica 40.4 (2004), pp. 305–318. Tasks”. In: IEICE Transactions on Information Systems E89-D.6
(2006), pp. 1962–1966.
[7] S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, and S. Stiller.
“Improved Multiprocessor Global Schedulability Analysis”. In: [34] J. Lelli, G. Lipari, D. Faggioli, and T. Cucinotta. “An efficient and
Real-Time Syst. 46.1 (2010), pp. 3–24. scalable implementation of global EDF in Linux”. In: OSPERT.
2011.
[8] S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, L. Stougie, and
A. Wiese. “A generalized parallel task model for recurrent real- [35] J. Li, K. Agrawal, C.Lu, and C. Gill. “Analysis of Global EDF
time processes”. In: RTSS. 2012. for Parallel Tasks”. In: ECRTS. 2013.
[9] S. K. Baruah, A. K. Mok, and L. E. Rosier. “Preemptively [36] C. Liu and J. Anderson. “Supporting Soft Real-Time Parallel
Scheduling Hard-Real-Time Sporadic Tasks on One Processor”. Applications on Multicore Processors”. In: RTCSA. 2012.
In: RTSS. 1990. [37] J. M. López, J. L. Dı́az, and D. F. Garcı́a. “Utilization Bounds
[10] M. Bertogna and S. Baruah. “Tests for global EDF schedulability for EDF Scheduling on Real-Time Multiprocessor Systems”. In:
analysis”. In: Journal of System Architecture 57.5 (2011). Real-Time Systems 28.1 (2004), pp. 39–68.
[11] M. Bertogna, M. Cirinei, and G. Lipari. “New Schedulability [38] L. Lundberg. “Analyzing Fixed-Priority Global Multiprocessor
Tests for Real-time Task Sets Scheduled by Deadline Monotonic Scheduling”. In: IEEE Real Time Technology and Applications
on Multiprocessors”. In: Proceedings of the 9th International Symposium. 2002, pp. 145–153.
Conference on Principles of Distributed Systems. 2006. [39] G. Manimaran, C. S. R. Murthy, and K. Ramamritham. “A
[12] R. D. Blumofe and C. E. Leiserson. “Scheduling multithreaded New Approach for Scheduling of Parallelizable Tasks inReal-
computations by work stealing”. In: Journal of the ACM 46.5 Time Multiprocessor Systems”. In: Real-Time Syst. 15 (1998),
(1999), pp. 720–748. pp. 39–60.
[13] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, [40] G. Nelissen, V. Berten, J. Goossens, and D. Milojevic. “Tech-
K. H. Randall, and Y. Zhou. “Cilk: An Efficient Multithreaded niques optimizing the number of processors to schedule multi-
Runtime System”. In: PPoPP. 1995, pp. 207–216. threaded tasks”. In: ECRTS. 2012.
[14] V. Bonifaci, A. Marchetti-Spaccamela, and S. Stiller. “A constant- [41] OpenMP Application Program Interface v3.1. http://www.openm
approximate feasibility test for multiprocessor real-time schedu- p.org/mp-documents/OpenMP3.1.pdf. 2011.
ling”. In: Algorithmica 62.3-4 (2012), pp. 1034–1049. [42] C. D. Polychronopoulos and D. J. Kuck. “Guided Self-Scheduling:
[15] V. Bonifaci, A. Marchetti-Spaccamela, S. Stiller, and A. Wiese. A Practical Scheduling Scheme for Parallel Supercomputers”. In:
“Feasibility Analysis in the Sporadic DAG Task Model”. In: Computers, IEEE Transactions on C-36.12 (1987).
ECRTS. 2013. [43] J. Reinders. Intel threading building blocks: outfitting C++ for
[16] B. B. Brandenburg and J. H. Anderson. “On the Implementation multi-core processor parallelism. O’Reilly Media, 2010.
of Global Real-Time Schedulers”. In: RTSS. 2009. [44] A. Saifullah, D. Ferry, J. Li, K. Agrawal, C. Lu, and C. Gill.
[17] B. B. Brandenburg, A. D. Block, J. M. Calandrino, U. Devi, H. “Parallel real-time scheduling of DAGs”. In: IEEE Transactions
Leontyev, and J. H. Anderson. LITMUS RT: A Status Report. 2007. on Parallel and Distributed Systems (2014).
[18] R. P. Brent. “The Parallel Evaluation of General Arithmetic [45] A. Saifullah, J. Li, K. Agrawal, C. Lu, and C. Gill. “Multi-core
Expressions”. In: Journal of the ACM (1974), pp. 201–206. real-time scheduling for generalized parallel task models”. In:
Real-Time Systems 49.4 (2013), pp. 404–435.
[19] J. M. Calandrino, H. Leontyev, A. Block, U. C. Devi, and J. H.
Anderson. “LITMUSRT : A Testbed for Empirically Comparing [46] R. Sedgewick and K. D. Wayne. Algorithms. 4th. 2011.
Real-Time Multiprocessor Schedulers”. In: RTSS. 2006.
12