tpds21 Taskflow
tpds21 Taskflow
Abstract—Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based
approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of parallel and
heterogeneous decomposition strategies on a heterogeneous computing platform. Our programming model distinguishes itself as a
very general class of task graph parallelism with in-graph control flow to enable end-to-end parallel optimization. To support our model
with high performance, we design an efficient system runtime that solves many of the new scheduling challenges arising out of our
models and optimizes the performance across latency, energy efficiency, and throughput. We have demonstrated the promising
performance of Taskflow in real-world applications. As an example, Taskflow solves a large-scale machine learning workload up to 29%
faster, 1.5 less memory, and 1.9 higher throughput than the industrial system, oneTBB, on a machine of 40 CPUs and 4 GPUs. We
have opened the source of Taskflow and deployed it to large numbers of users in the open-source community.
Index Terms—Parallel programming, task parallelism, high-performance computing, modern C++ programming
algorithm prevents the graph execution from compose thousands of dependent GPU operations to run on
underutilized threads that is harmful to perfor- the same task graph using iterative methods. By creating an
mance, while avoiding excessive waste of thread executable image for a GPU task graph, we can iteratively
resources when available tasks are scarce. The result launch it with extremely low kernel overheads. However,
largely improves the overall system performance, existing TGCSs are short of a generic model to express and
including latency, energy usage, and throughput. offload task graph parallelism directly on a GPU, as opposed
We have derived theory results to justify the effi- to a simple encapsulation of GPU operations into CPU tasks.
ciency of our work-stealing algorithm. Heterogeneous Runtimes. Many CAD algorithms compute
We have evaluated Taskflow on real-world applications extremely large circuit graphs. Different quantities are often
to demonstrate its promising performance. As an example, dependent on each other, via either logical relation or physi-
Taskflow solved a large-scale machine learning problem up cal net order, and are expensive to compute. The resulting
to 29% faster, 1.5 less memory, and 1.9 higher through- task graph in terms of encapsulated function calls and task
put than the industrial system, oneTBB [2], on a machine of dependencies is usually very large. For example, the task
40 CPUs and 4 GPUs. We believe Taskflow stands out as a graph representing a timing analysis on a million-gate
unique system given the ensemble of software tradeoffs and design can add up to billions of tasks that take several hours
architecture decisions we have made. Taskflow is open- to finish [32]. During the execution, tasks can run on CPUs
source at GitHub under MIT license and is being used by or GPUs, or more frequently a mix. Scheduling these hetero-
many academic and industrial projects [10]. geneously dependent tasks is a big challenge. Existing run-
times are good at either CPU- or GPU-focused work but
rarely both simultaneously.
2 MOTIVATIONS Therefore, we argue that there is a critical need for a new
Taskflow is motivated by our DARPA project to reduce the heterogeneous task graph programming environment that
long design times of modern circuits [1]. The main research supports in-graph control flow. The environment must han-
objective is to advance computer-aided design (CAD) tools dle new scheduling challenges, such as conditional depen-
with heterogeneous parallelism to achieve transformational dencies and cyclic executions. To this end, Taskflow aims to
performance and productivity milestones. Unlike tradi- (1) introduce a new programming model that enables end-
tional loop-parallel scientific computing problems, many to-end expressions of CPU-GPU dependent tasks along with
CAD algorithms exhibit irregular computational patterns and algorithmic control flow and (2) establish an efficient system
complex control flow that require strategic task graph decom- runtime to support our model with high performance across
positions to benefit from heterogeneous parallelism [28]. latency, energy efficiency, and throughput. Taskflow focuses
This type of complex parallel algorithm is difficult to imple- on a single heterogeneous node of CPUs and GPUs.
ment and execute efficiently using mainstream TGCS. We
highlight three reasons below, end-to-end tasking, GPU task
3 PRELIMINARY RESULTS
graph parallelism, and heterogeneous runtimes.
End-to-End Tasking. Optimization engines implement Taskflow is established atop our prior system, Cpp-Task-
various graph and combinatorial algorithms that frequently flow [32] which targets CPU-only parallelism using a DAG
call for iterations, conditionals, and dynamic control flow. model, and extends its capability to heterogeneous comput-
Existing TGCSs [2], [7], [12], [17], [18], [20], [24], [33], [39], ing using a new heterogeneous task dependency graph (HTDG)
closely rely on DAG models to define tasks and their depen- programming model beyond DAG. Since we opened the
dencies. Users implement control-flow decisions outside the source of Cpp-Taskflow/Taskflow, it has been successfully
graph description via either statically unrolling the graph adopted by much software, including important CAD proj-
across fixed-length iterations or dynamically executing an ects [14], [30], [43], [54] under the DARPA ERI IDEA/POSH
“if statement” on the fly to decide the next path and so forth. program [1]. Because of the success, we are recently invited
These solutions often incur rather complicated implementa- to publish a 5-page TCAD brief to overview how Taskflow
tions that lack end-to-end parallelism using just one task address the parallelization challenges of CAD work-
graph entity. For instance, when describing an iterative loads [31]. For the rest of the paper, we will provide com-
algorithm using a DAG model, we need to repetitively wait prehensive details of the Taskflow system from the top-
for the task graph to complete at the end of each iteration. level programming model to the system runtime, including
This wait operation is not cheap because it involves syn- several new technical materials for control-flow primitives,
chronization between the application code and the TGCS capturer-based GPU task graph parallelism, work-stealing
runtime, which could otherwise be totally avoided by sup- algorithms and theory results, and experiments.
porting in-graph control-flow tasks. More importantly,
developers can benefit by making in-graph control-flow 4 TASKFLOW PROGRAMMING MODEL
decisions to efficiently overlap tasks both inside and outside
control flow, completely decided by a dynamic scheduler. This section discusses five fundamental task types of Task-
GPU Task Graph Parallelism. Emerging GPU task graph flow, static task, dynamic task, module task, condition task, and
acceleration, such as CUDA Graph [4], can offer dramatic yet cudaFlow task.
largely untapped performance advantages by running a
GPU task graph directly on a GPU. This type of GPU task 4.1 Static Tasking
graph parallelism is particularly beneficial for many large- Static tasking is the most basic task type in Taskflow. A
scale analysis and machine learning algorithms that static task takes a callable of no arguments and runs it. The
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1305
Fig. 1. A task graph that spawns another task graph (B1, B2, and B3)
during the execution of task B. Fig. 2. An example of taskflow composition.
callable can be a generic C++ lambda function object, bind- Listing 2 shows the Taskflow code in Fig. 1. A dynamic
ing expression, or a functor. Listing 1 demonstrates a simple task accepts a reference of type tf::Subflow that is created
Taskflow program of four static tasks, where A runs before B by the executor during the execution of task B. A subflow
and C, and D runs after B and C. The graph is run by an exec- inherits all graph building blocks of static tasking. By
utor which schedules dependent tasks across worker default, a spawned subflow joins its parent task (B3 pre-
threads. Overall, the code explains itself. cedes its parent B implicitly), forcing a subflow to follow the
subsequent dependency constraints of its parent task.
Listing 1. A task graph of four static tasks. Depending on applications, users can detach a subflow
tf::Taskflow taskflow; from its parent task using the method detach, allowing its
tf::Executor executor; execution to flow independently. A detached subflow will
auto [A, B, C, D] = taskflow.emplace( eventually join its parent taskflow.
[] () { std::cout << “Task A”; },
[] () { std::cout << “Task B”; }, 4.3 Composable Tasking
[] () { std::cout << “Task C”; }, Composable tasking enables developers to define task hier-
[] () { std::cout << “Task D”; } archies and compose large task graphs from modular and
); reusable blocks that are easier to optimize. Fig. 2 gives an
A.precede(B, C); // A runs before B and C example of a Taskflow graph using composition. The top-
D.succeed(B, C); // D runs after B and C level taskflow defines one static task C that runs before a
executor.run(tf).wait(); dynamic task D that spawns two dependent tasks D1 and
D2. Task D precedes a module task E that composes a task-
flow of two dependent tasks A and B.
Fig. 7. A saxpy (“single-precision AX plus Y”) task graph using two CPU
tasks and one cudaFlow task.
Fig. 8. A cyclic task graph using three cudaFlow tasks and one condition
task to model an iterative k-means algorithm.
Algorithm 2. worker_loop(w)
Input: w: a worker
Per-worker global: t: a task (initialized to NIL)
1: while true do
2: exploit_task(w, t);
3: if wait_for_task(w, t) == false then
Fig. 12. Architecture of our work-stealing scheduler on two domains,
CPU and GPU.
4: break;
5: end
6: end
Keeping workers busy in awaiting tasks with a yielding
mechanism is a commonly used work-stealing frame-
work [16], [17], [25]. However, this approach is not cost-effi-
cient, because it can easily over-subscribe resources when Algorithm 3. exploit_task(w, t)
tasks become scarce, especially around the decision-making
Input: w: a worker (domain dw )
points of control flow. The sleep-based mechanism is
Per-worker global: t: a task
another way to suspend the workers frequently failing in 1: if t 6¼ NIL then
steal attempts. A worker is put into sleep by waiting for a 2: if AtomInc(actives[dw ]) == 1 and thieves[dw ] == 0 then
condition variable to become true. When the worker sleeps, 3: notifier½dw .notify_one();
OS can grant resources to other workers for running useful 4: end
jobs. Also, reducing wasteful steals can improve both the 5: do
inter-operability of a concurrent program and the overall 6: execute_task(w, t);
system performance, including latency, throughput, and 7: t w:task queue½dw .pop();
energy efficiency to a large extent [23]. Nevertheless, decid- 8: while t 6¼ NIL;
ing when and how to put workers to sleep, wake up workers to 9: AtomDec(actives½dw );
run, and balance the numbers of workers with dynamic task paral- 10: end
lelism is notoriously challenging to design correctly and
implement efficiently. We leverage two existing concurrent data structures,
Our previous work [42] has introduced an adaptive work-stealing queue and event notifier, to support our schedul-
work-stealing algorithm to address a similar line of the chal- ing architecture. We implemented the task queue based on
lenge yet in a CPU-only environment by maintaining a loop the lock-free algorithm proposed by [36]. Only the queue
invariant between active and idle workers. However, owner can pop/push a task from/into one end of the queue,
extending this algorithm to a heterogeneous target is not while multiple threads can steal a task from the other end at
easy, because we need to consider the adaptiveness in dif- the same time. Event notifier is a two-phase commit proto-
ferent heterogeneous domains and bound the total number col (2PC) that allows a worker to wait on a binary predicate
of wasteful steals across all domains at any time of the exe- in a non-blocking fashion [11]. The idea is similar to the 2PC
cution. To overcome this challenge, we introduce a new in distributed systems and computer networking. The wait-
scheduler architecture and an adaptive worker manage- ing worker first checks the predicate and calls prepare_-
ment algorithm that are both generalizable to arbitrary het- wait if it evaluates to false. The waiting worker then checks
erogeneous domains. We shall prove the proposed work- the predicate again and calls commit_wait to wait, if the
stealing algorithm can deliver a strong upper bound on the outcome remains false, or cancel_wait to cancel the
number of wasteful steals at any time during the execution. request. Reversely, the notifying worker changes the predi-
cate to true and call notify_one or notify_all to wake
5.2.1 Heterogeneous Work-Stealing Architecture up one or all waiting workers. Event notifier is particularly
At the architecture level, our scheduler maintains a set of useful for our scheduler architecture because we can keep
workers for each task domain (e.g., CPU, GPU). A worker notification between workers non-blocking. We develop
can only steal tasks of the same domain from others. Fig. 12 one event notifier for each domain, based on Dekker’s algo-
shows the architecture of our work-stealing scheduler on rithm by [11].
two domains, CPU and GPU. By default, the number of
domain workers equals the number of domain devices (e.g., 5.2.2 Heterogeneous Work-Stealing Algorithm
CPU cores, GPUs). We associate each worker with two sepa- Atop this architecture, we devise an efficient algorithm to
rate task queues, a CPU task queue (CTQ) and a GPU task adapt the number of active workers to dynamically gener-
queue (GTQ), and declare a pair of CTQ and GTQ shared ated tasks such that threads are not underutilized when
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1311
tasks are abundant nor overly subscribed when tasks are with other workers in the later algorithms. Lines 5:8 drain
scarce. Our adaptiveness is different from existing frame- out the local task queue and executes all the tasks using
works, such as constant wake-ups [2], [23], data locality [21], execute_task in Algorithm 4. Before leaving the func-
[49], and watchdogs [23]. Instead, we extend our previous tion, the worker decrements actives by one (line 9).
work [42] to keep a per-domain invariant to control the num-
bers of thieves and, consequently, wasteful steals based on Algorithm 5. submit_task(w, t)
the active worker count: When an active worker exists, we keep Input: w: a worker (domain dw )
at least one worker making steal attempts unless all workers are Per-worker global: t: a task (domain dt )
active. 1: w:task queue½dt .push(t);
Unlike the CPU-only scheduling environment in [42], the 2: if dw ! ¼ dt then
challenge to keep this invariant in a heterogeneous target 3: if actives[dt ] == 0 and thieves[dt ] == 0 then
comes from the heterogeneously dependent tasks and 4: notifier½dt .notify_one();
cross-domain worker notifications, as a CPU task can spawn 5: end
a GPU task and vice versa. Our scheduler architecture is 6: end
particularly designed to tackle this challenge by separating
decision controls to a per-domain basis. This design allows Algorithm 4 implements the function execute_task. We
us to realize the invariant via an adaptive strategy–the last invoke the callable of the task (line 1). If the task returns a
thief to become active will wake up a worker in the same domain value (i.e., a condition task), we directly submit the task of the
to take over its thief role, and so forth. External threads (non- indexed successor (lines 2:5). Otherwise, we remove the task
workers) submit tasks through the shared task queues and dependency from all immediate successors and submit new
wake up workers to run tasks. tasks of zero remaining strong dependencies (lines 6:10). The
detail of submitting a task is shown in Algorithm 5. The
Algorithm 4. execute_task(w, t) worker inserts the task into the queue of the corresponding
Input: w: a worker
domain (line 1). If the task does not belong to the worker’s
Per-worker global: t: a task domain (line 2), the worker wakes up one worker from that
1: r invoke_task_callable(t); domain if there are no active workers or thieves (lines 3:5).
2: if r.has_value() then The function submit_task is internal to the workers of a
3: submit_task(w, t:successors½r); scheduler. External threads never touch this call.
4: return; When a worker completes all tasks in its local queue, it
5: end proceeds to wait_for_task (line 3 in Algorithm 2), as
6: foreach s 2 t.successors do shown in Algorithm 6. At first, the worker enters explor-
7: if AtomDec(s.strong_dependents) == 0 then e_task to make steal attempts (line 2). When the worker
8: submit_task(w, s); steals a task and it is the last thief, it notifies a worker of the
9: end same domain to take over its thief role and returns to an
10: end active worker (lines 3:8). Otherwise, the worker becomes a
sleep candidate. However, we must avoid underutilized par-
Our scheduling algorithm is symmetric by domain. allelism, since new tasks may come at the time we put a
Upon spawned, each worker enters the loop in Algorithm 2. worker to sleep. We use 2PC to adapt the number of active
Each worker has a per-worker global pointer t to a task that workers to available task parallelism (lines 9:41). The predi-
is either stolen from others or popped out from the worker’s cate of our 2PC is at least one task queue, both local and shared,
local task queue after initialization; the notation will be in the worker’s domain is nonempty. At line 8, the worker has
used in the rest of algorithms. The loop iterates two func- drained out its local queue and devoted much effort to steal-
tions, exploit_task and wait_for_task. Algorithm 3 ing tasks. Other task queues in the same domain are most
implements the function exploit_task. We use two likely to be empty. We put this worker to a sleep candidate
scheduler-level arrays of atomic variables, actives and by submitting a wait request (line 9). From now on, all the
thieves, to record for each domain the number of workers notifications from other workers will be visible to at least one
that are actively running tasks and the number of workers worker, including this worker. That is, if another worker call
that are making steal attempts, respectively.1 Our algorithm notify at this moment, the 2PC guarantees one worker
relies on these atomic variables to decide when to put a within the scope of lines 9:41 will be notified (i.e., line 42).
worker to sleep for reducing resource waste and when to Then, we inspect our predicate by examining the shared
bring back a worker for running new tasks. Lines 2:4 imple- task queue again (lines 10:20), since external threads might
ment our adaptive strategy using two lightweight atomic have inserted tasks at the same time we call prepare_-
operations. In our pseudocodes, the two atomic operations, wait. If the shared queue is nonempty (line 10), the worker
AtomInc and AtomDec, return the results after incrementing cancels the wait request and makes an immediate steal
and decrementing the values by one, respectively. Notice attempt at the queue (lines 11:12); if the steal succeeds and it
that the order of these two comparisons matters (i.e., active is the last thief, the worker goes active and notifies a worker
workers and then thieves), as they are used to synchronize (lines 13:18), or otherwise enters the steal loop again (line
19). If the shared queue is empty (line 20), the worker checks
whether the scheduler received a stop signal from the exec-
1. While our pseudocodes use array notations of atomic variables for
the sake of brevity, the actual implementation considers padding to utor due to exception or task cancellation, and notifies all
avoid false-sharing effects. workers to leave (lines 21:28). Now, the worker is almost
1312 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022
ready to sleep except if it is the last thief and: (1) an active selected victim, including the shared task queue, in the
worker in its domain exists (lines 30:33) or (2) at least one same domain. We use a parameter MAX STEALS to con-
task queue of the same domain from other workers is non- trol the number of iterations. In our experiments, setting
empty (lines 34:39). The two conditions may happen MAX STEAL to ten times the number of all workers is suf-
because a task can spawn tasks of different domains and ficient enough for most applications. Up to this time, we
trigger the scheduler to notify the corresponding domain have discussed the core work-stealing algorithm. To submit
workers. Our 2PC guarantees the two conditions synchro- an HTDG for execution, we call submit_graph, shown in
nize with lines 2:4 in Algorithm 3 and lines 3:5 in Algorithm Algorithm 8. The caller thread inserts all tasks of zero
5, and vice versa, preventing the problem of undetected dependencies (both strong and weak dependencies) to the
task parallelism. Passing all the above conditions, the shared task queues and notifies a worker of the correspond-
worker commits to wait on our predicate (line 41). ing domain (lines 4:5). Shared task queues may be accessed
by multiple callers and are thus protected under a lock per-
Algorithm 6. wait_for_task(w, t) taining to the scheduler. Our 2PC guarantees lines 4:5 syn-
Input: w: a worker (domain dw )
chronizes with lines 10:20 of Algorithm 6 and vice versa,
Per-worker global: t: a task preventing undetected parallelism in which all workers are
Output: a boolean signal of stop sleeping.
1: AtomInc(thieves½dw );
2: explore_task(w; t); Algorithm 7. explore_task(w, t)
3: if t 6¼ NIL then Input: w: a worker (a thief in domain dw )
4: if AtomDec(thieves[dw ]) == 0 then Per-worker global:t: a task (initialized to NIL)
5: notifier½dw .notify_one(); 1: steals 0;
6: end 2: while t != NIL and++steals MAX STEAL do
7: return true; 3: yield();
8: end 4: t steal_task_from_random_victim(dw );
9: notifier½dw .prepare_wait(w); 5: end
10: if task_queue[dw ].empty() 6¼ true then
11: notifier½dw .cancel_wait(w);
12: t task queue½dw .steal();
13: if t 6¼ NIL then Algorithm 8. submit_graph(g)
14: if AtomDec(thieves½dw ) == 0 then Input: g: an HTDG to execute
15: notifier½dw .notify_one(); 1: foreach t 2 g.source_tasks do
16: end 2: scoped_lock lockðqueue mutex);
17: return true; 3: dt t:domain;
18: end 4: task queue½dt .push(t);
19: goto Line 2; 5: notifier½dt .notify_one();
20: end 6: end
21: if stop == true then
22: notifier½dw .cancel_wait(w);
23: foreach domain d 2 D do
24: notifier½d.notify_all(); 6 ANALYSIS
25: end To justify the efficiency of our scheduling algorithm, we
26: AtomDec(thieves½dw ); draw the following theorems and give their proof sketches.
27: return false;
28: end Lemma 1. For each domain, when an active worker (i.e., running
29: if AtomDec(thieves[dw ]) == 0 then a task) exists, at least one another worker is making steal
30: if actives½dw > 0 then attempts unless all workers are active.
31: notifier½dw .cancel_wait(w);
Proof. We prove Lemma 1 by contradiction. Assuming
32: goto Line 1;
there are no workers making steal attempts when an
33: end
34: foreach worker x 2 W do active worker exists, this means an active worker (line 2
35: if x.task_queue[dw ].empty() 6¼ true then in Algorithm 3) fails to notify one worker if no thieves
36: notifier½dw .cancel_wait(w); exist. There are only two scenarios for this to happen: (1)
37: goto Line 1; all workers are active; (2) a non-active worker misses the
38: end notification before entering the 2PC guard (line 9 in Algo-
39: end rithm 6). The first scenario is not possible as it has been
40: end excluded by the lemma. If the second scenario is true, the
41: notifier½dw .commit_wait(w); non-active worker must not be the last thief (contradic-
42: return true; tion) or it will notify another worker through line 3 in
Algorithm 6. The proof holds for other domains as our
scheduler design is symmetric. u
t
Algorithm 7 implements explore_task, which resem-
bles the normal work-stealing loop [16]. At each iteration, Theorem 1. Our work-stealing algorithm can correctly complete
the worker (thief) tries to steal a task from a randomly the execution of an HTDG.
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1313
Proof. There are two places where a new task is submitted, making unsuccessful steal attempts. Due to Lemma 1 and
line 4 in Algorithm 8 and line 1 in Algorithm 5. In the first lines 29:40 in Algorithm 6, only one thief w0d will eventu-
place, where a task is pushed to the shared task queue by ally remain in the loop, and the other jWd j 2 thieves
an external thread, the notification (line 5 in Algorithm 8) will go sleep after one round of unsuccessful steal
is visible to a worker in the same domain of the task for attempts (line 2 in Algorithm 6) which ends up with
two situations: (1) if a worker has prepared or committed MAX STEALS ðjWd j 2Þ wasteful steals. For the only
to wait (lines 9:41 in Algorithm 6), it will be notified; (2) one thief w0d , it keeps failing in steal attempts until the
otherwise, at least one worker will eventually go through task running by the only active worker wd finishes, and
lines 9:20 in Algorithm 6 to steal the task. In the second then both go sleep. This results in another MAX
place, where the task is pushed to the corresponding local STEALS ðed =es Þ þ MAX STEALS wasteful steals; the
task queue of that worker, at least one worker will execute second terms comes from the active worker because it
it in either situation: (1) if the task is in the same domain of needs another round of steal attempts (line 2 in Algo-
the worker, the work itself may execute the task in the sub- rithm 6) before going to sleep. Consequently, the number
sequent exploit_task, or a thief steals the task through of wasteful steals across all domains is bounded as fol-
explore_task; (2) if the worker has a different domain lows:
from the task (line 2 in Algorithm 5), the correctness can be
proved by contradiction. Assuming this task is unde- X
MAX STEALS ðjWd j 2 þ ðed =es Þ þ 1Þ
tected, which means either the worker did not notify a cor- d2D
responding domain worker to run the task (false at the X
condition of line 3 in Algorithm 5) or notified one worker MAX STEALS ðjWd j þ ed =es Þ
d2D
(line 4 in Algorithm 5) but none have come back. In the for- X
mer case, we know at least one worker is active or stealing, MAX STEALS ðjWd j þ E=es Þ
which will eventually go through line 29:40 of Algorithm 6 d2D
to steal this task. Similarly, the latter case is not possible ¼ OðMAX STEALS ðjW j þ jDj ðE=es ÞÞÞ: (1)
under our 2PC, as it contradicts the guarding scan in lines
9:41 of Algorithm 6. u
t
Theorem 2. Our work-stealing algorithm does not under-sub- We do not derive the bound over the execution of an
scribe thread resources during the execution of an HTDG. HTDG but the worst-case number of wasteful steals at
any time point, because the presence of control flow can
Proof. Theorem 2 is a byproduct of Lemma 1 and Theorem
lead to non-deterministic execution time that requires a
1. Theorem 1 proves that our scheduler never has task
further assumption of task distribution. u
t
leak (i.e., undetected task parallelism). During the execu-
tion of an HTDG, whenever the number of tasks is larger
than the present number of workers, Lemma 1 guarantees
7 EXPERIMENTAL RESULTS
one worker is making steal attempts, unless all workers We evaluate the performance of Taskflow on two fronts:
are active. The 2PC guard (lines 34:39 in Algorithm 6) micro-benchmarks and two realistic workloads, VLSI incre-
ensures that worker will successfully steal a task and mental timing analysis and machine learning. We use
become an active worker (unless no more tasks), which in micro-benchmarks to analyze the tasking performance of
turn wakes up another worker if that worker is the last Taskflow without much bias of application algorithms. We
thief. As a consequence, the number of workers will catch will show that the performance benefits of Taskflow
up on the number of tasks one after one to avoid under- observed in micro-benchmarks become significant in real
subscribed thread resources. u
t workloads. We will study the performance across runtime,
energy efficiency, and throughput. All experiments ran on a
Theorem 3. At any moment during the execution of an HTDG, Ubuntu Linux 5.0.0-21-generic x86 64-bit machine with 40
the number of wasteful steals is bounded by OðMAX Intel Xeon CPU cores at 2.00 GHz, 4 GeForce RTX 2080
STEALS ðjW j þ jDj ðE=es ÞÞÞ, where W is the worker GPUs, and 256 GB RAM. We compiled all programs using
set, D is the domain set, E is the maximum execution time of Nvidia CUDA v11 on a host compiler of clang++ v10 with C
any task, and es is the execution time of Algorithm 7. ++17 standard -std=c++17 and optimization flag -O2
Proof. We give a direct proof for Theorem 3 using the fol- enabled. We do not observe significant difference between
lowing notations: D denotes the domain set, d denotes a -O2 and -O3 in our experiments. Each run of N CPU cores
domain (e.g., CPU, GPU), W denotes the entire worker and M GPUs corresponds to N CPU and M GPU worker
set, Wd denotes the worker set in domain d, wd denotes a threads. All data is an average of 20 runs.
worker in domain d (i.e., wd 2 Wd ), es denotes the time to
complete one round of steal attempts (i.e., Algorithm 7), 7.1 Baseline
ed denotes the maximum execution time of any task in Give a large number of TGCSs, it is impossible to compare
domain d, and E denotes the maximum execution time of Taskflow with all of them. Each of the existing systems has
any task in the given HTDG. its pros and cons and dominates certain applications. We
At any time point, the worst case happens at the fol- consider oneTBB [2], StarPU [17], HPX [33], and OpenMP [7]
lowing scenario: for each domain d only one worker wd is each representing a particular paradigm that has gained
actively running one task while all the other workers are some successful user experiences in CAD due to
1314 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022
TABLE 1 TABLE 2
Programming Effort on Micro-benchmark Overhead of Task Graph Creation
Method LOC #Tokens CC WCC Dev Bug Method Stask Ttask Tedge r < 10 r<5 r<1
Taskflow 69 650 6 8 14 1% Taskflow 272 61 ns 14 ns 550 2550 35050
oneTBB 182 1854 8 15 25 6% oneTBB 136 99 ns 54 ns 1225 2750 40050
StarPU 253 2216 8 21 47 19% StarPU 1472 259 ns 384 ns 7550 - -
HPX 255 2264 10 24 41 33%
OpenMP 182 1896 13 19 57 49% Stask : static size per task in bytes.
Ttask =Tedge : amortized time to create a task/dependency.
CC: maximum cyclomatic complexity in a single function. rv : graph size where its creation overhead is below v%.
WCC: weighted cyclomatic complexity of the program.
Dev: minutes to complete the implementation. debugging task graph parallelism with StarPU, HPX, and
Bug: time spent on debugging as opposed to coding task graphs.
OpenMP.
Next, we study the overhead of task graph parallelism
performance [44]. oneTBB (2021.1 release) is an industrial-
among Taskflow, oneTBB, and StarPU. As shown in Table 2,
strength parallel programming system under Intel
the static size of a task, compiled on our platform, is 272,
oneAPI [2]. We consider its FlowGraph library and encap-
136, and 1472 bytes for Taskflow, oneTBB, and StarPU,
sulate each GPU task in a CPU function. At the time of this
respectively. We do not report the data of HPX and
writing, FlowGraph does not have dedicated work stealing
OpenMP because they do not support explicit task graph
for HTDGs. StarPU (version 1.3) is a CPU-GPU task pro-
construction at the functional level. The time it takes for
gramming system widely used in the scientific computing
Taskflow to create a task and add a dependency is also
community [17]. It provides a C-based syntax for writing
faster than oneTBB and StarPU. We amortize the time across
HTDGs on top of a work-stealing runtime highly optimized
1M operations because all systems support pooled memory
for CPUs and GPUs. HPX (version 1.4) is a C++ standard
to recycle tasks. We found StarPU has significant overhead
library for concurrency and parallelism [33]. It supports
in creating HTDGs. The overhead always occupies 5-10% of
implicit task graph programming through aggregating
the total execution time regardless of the HTDG size.
future objects in a dataflow API. OpenMP (version 4.5 in
Fig. 13 shows the overall performance comparison
clang toolchains) is a directive-based programming frame-
between Taskflow and the baseline at different HTDG sizes.
work for handling loop parallelism [7]. It supports static
In terms of runtime (top left of Fig. 13), Taskflow outper-
graph encoding using task dependency clauses.
forms others across most data points. We complete the larg-
To measure the expressiveness and programmability of
est HTDG by 1.37, 1.44, 1,53, and 1.40 faster than
Taskflow, we hire five PhD-level C++ programmers outside
oneTBB, StarPU, HPX, and OpenMP, respectively. The
our research group to implement our experiments. We edu-
memory footprint (top right of Fig. 13) of Taskflow is close
cate them the essential knowledge about Taskflow and base-
to oneTBB and OpenMP. HPX has higher memory because
line TGCSs and provide them all algorithm blocks such that
it relies on aggregated futures to describe task dependencies
they can focus on programming HTDGs. For each imple-
at the cost of shared states. Likewise, StarPU does not offer a
mentation, we record the lines of code (LOC), the number
closure-based interface and thus requires a flat layout (i.e.,
of tokens, cyclomatic complexity (measured by [8]), time to
codelet) to describe tasks. We use the Linux perf tool to
finish, and the percentage of time spent on debugging. We
average these quantities over five programmers until they
obtain the correct result. This measurement may be subjec-
tive but it highlights the programming productivity and
turnaround time of each TGCSs from a real user’s
perspective.
7.2 Micro-benchmarks
We randomly generate a set of DAGs (i.e., HTDGs) with
equal distribution of CPU and GPU tasks. Each task per-
forms a SAXPY operation over 1K elements. For fair pur-
pose, we implemented CUDA Graph [4] for all baselines;
each GPU task is a CUDA graph of three GPU operations,
H2D copy, kernel, and H2D copy, in this order of depen-
dencies. Table 1 summarizes the programming effort of
each method. Taskflow requires the least amount of lines of
code (LOC) and written tokens. The cyclomatic complexity
of Taskflow measured at a single function and across the
whole program is also the smallest. The development time
of Taskflow-based implementation is much more produc-
tive than the others. For this simple graph, Taskflow and
oneTBB are very easy for our programmers to implement, Fig. 13. Overall system performance at different problem sizes using 40
whereas we found they spent a large amount of time on CPUs and 4 GPUs.
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1315
TABLE 3
Programming Effort on VLSI Timing Closure
Fig. 19. Comparison of runtime and memory between cudaFlow (CUDA Fig. 20. A partial HTDG of 4 cudaFlows (purple boxes), 8 conditioned
Graph) and stream-based execution in the VLSI incremental timing anal- cycles (green diamonds), and 6 static tasks (other else) for the inference
ysis workload. workload.
Fig. 22. Throughput of corunning inference workloads on two 1920-lay- Fig. 24. Comparison of runtime and memory between cudaFlow (CUDA
ered neural networks, one with 4096 neurons per layer and another with Graph) and stream-based execution.
65536 neurons per layer.
one, two, and four streams. The advantage of CUDA Graph
is clearly demonstrated in this large machine learning work-
load of over 2K dependent GPU operations per cudaFlow.
Under four streams that deliver the best performance for the
baseline, cudaFlow is 1.5 (1451 versus 2172) faster at one
GPU and is 1.9 (750 versus 1423) faster at four GPUs. The
cost of this performance improvement is increased memory
usage because CUDA Graph needs to store all the operating
parameters in the graph. For instance, under four streams,
cudaFlow has 4% and 6% higher memory usage than
Fig. 23. Performance of our cudaFlow capturer using 1, 2, 4, and 8 stream-based execution at one and four GPUs, respectively.
streams to complete the inference of two neural networks.
8 RELATED WORK
inference program up to nine processes that compete for 40 8.1 Heterogeneous Programming Systems
CPUs and 4 GPUs. We use weighted speedup to measure
Heterogeneous programming systems are the main driving
the throughput. Fig. 22 plots the throughput of corunning
force to advance scientific computing. Directive-based pro-
inference programs on two different sparse neural net-
gramming models [5], [6], [7], [25], [38] allow users to aug-
works. Taskflow outperforms oneTBB and StarPU across all
ment program information of loop mapping onto CPUs/
coruns. oneTBB is slightly better than StarPU because
GPUs and data sharing rules to designated compilers for
StarPU tends to keep all workers busy all the time and
automatic parallel code generation. These models are good
results in large numbers of wasteful steals. The largest dif-
at loop-based parallelism but cannot handle irregular task
ference is observed at five coruns of inferencing the
graph patterns efficiently [37]. Functional approaches [2],
19204096 neural network, where our throughput is 1.9
[15], [17], [18], [20], [24], [32], [33], [34], [41] offer either
higher than oneTBB and 2.1 higher than StarPU. These
implicit or explicit task graph constructs that are more flexi-
CPU- and GPU-intensive workloads highlight the effective-
ble in runtime control and on-demand tasking. Each of these
ness of our heterogeneous work stealing. By keeping a per-
systems has its pros and cons. However, few of them enable
domain invariant, we can control cross-domain wasteful
end-to-end expressions of heterogeneously dependent tasks
steals to a bounded value at any time during the execution.
with general control flow.
We study the performance of our cudaFlow capturer
using different numbers of streams (i.e., max streams). For
complex GPU workloads like Fig. 20, stream concurrency is 8.2 Heterogeneous Scheduling Algorithms
crucial to GPU performance. As shown in Fig. 23, explicit Among various heterogeneous runtimes, work stealing is a
construction of a CUDA graph using cudaFlow achieves the popular strategy to reduce the complexity of load balanc-
best performance, because the CUDA runtime can dynami- ing [16], [41] and has inspired the designs of many parallel
cally decide the stream concurrency with internal optimiza- runtimes [2], [12], [39], [40], [52]. A key challenge in work-
tion. For applications that must use existing stream-based stealing designs is worker management. Instead of keeping
APIs, our cudaFlow capturer achieves comparable perfor- all workers busy most of the time [16], [17], [32], both
mance as cudaFlow by using two or four streams. Taking oneTBB [2] and BWS [23] have developed sleep-based strat-
the 192065536 neural network for example, the difference egies. oneTBB employs a mixed strategy of fixed-number
between our capturer of four streams and cudaFlow is only worker notification, exponential backoff, and noop assem-
10 ms. For this particular workload, we do not observe any bly. BWS modifies OS kernel to alter the yield behavior. [42]
performance benefit beyond four streams. Application takes inspiration from BWS and oneTBB to develop an
developers can fine-tune this number. adaptive work-stealing algorithm to minimize the number
We finally compare the performance of cudaFlow with of wasteful steals. Other approaches, such as [13] that tar-
stream-based execution. As shown in Fig. 24, the line cuda- gets a space-sharing environment, [47] that tunes hardware
Flow represents our default implementation using explicit frequency scaling, [22], [48] that balance load on distributed
CUDA graph construction, and the other lines represent memory, [21], [27], [51], [57] that deal with data locality,
stream-based implementations for the same task graph using and [49] that focuses on memory-bound applications have
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1319
[30] T.-W. Huang, G. Guo, C.-X. Lin, and M. Wong, “OpenTimer v2: A [53] D. F. Wong, H. W. Leong, and C. L. Liu., Simulated Annealing for
new parallel incremental timing analysis engine,” IEEE Trans. VLSI Design. Norwell, MA, USA: Kluwer Academic, 1988.
Comput.-Aided Des. Integr. Circuits Syst., vol. 40, no. 4, pp. 776–789, [54] B. Xu et al., “MAGICAL: Toward fully automated analog IC layout
Apr. 2021. leveraging human and machine intelligence: Invited paper,” in
[31] T.-W. Huang, D.-L. Lin, Y. Lin, and C.-X. Lin, “Taskflow: A gen- Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2019, pp. 1–8.
eral-purpose parallel and heterogeneous task programming sys- [55] C. Yu, S. Royuela, and E. Qui~ nones, “OpenMP to CUDA graphs:
tem,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., to be A compiler-based transformation to enhance the programmability
published, doi: 10.1109/TCAD.2021.3082507. of NVIDIA devices,” in Proc. Int. Workshop Softw. Compilers Embed-
[32] T.-W. Huang, Y. Lin, C.-X. Lin, G. Guo, and M. D. F. Wong, “Cpp- ded Syst., 2020, pp. 42–47.
taskflow: A general-purpose parallel task programming system at [56] Y. Yu, et al., “Dynamic control flow in large-scale machine
scale,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 40, learning,” in Proc. 13th EuroSys Conf., 2018, pp. 1–15.
no. 8, pp. 1687–1700, Aug. 2021. [57] H. Zhao, et al., “Bandwidth and locality aware task-stealing for
[33] H. Kaiser, T. Heller, B. Adelstein-Lelbach , A. Serio, and D. Fey, manycore architectures with bandwidth-asymmetric memory,”
“HPX: A Task Based Programming Model in a Global Address ACM Trans. Architect. Code Optim., vol. 15, no. 4, pp. 1–26, 2018.
Space,” in Proc. 8th Int. Conf. Partitioned Glob. Address Space Pro-
gram. Models, 2014, pp. 6:1–6:11.
[34] L. V. Kale and S. Krishnan, “Charm++: A portable concurrent
object oriented system based on C++,” in ACM SIGPLAN Notices, Tsung-Wei Huang received the BS and MS
1993, pp. 91–108. degrees from the Department of Computer Sci-
[35] J. Kepner, S. Alford, V. Gadepally, M. Jones, L. Milechin, R. Rob- ence, National Cheng Kung University, Tainan,
inett, and S. Samsi, “Sparse deep neural network graph challenge,” Taiwan, in 2010 and 2011, respectively, and the
in IEEE High Perform. Extreme Comput. Conf., 2019, pp. 1–7. PhD degree from the Department of Electrical
[36] N. M. L^e, A. Pop, A. Cohen, and F. Z. Nardelli, “Correct and effi- and Computer Engineering (ECE), University of
cient work-stealing for weak memory models,” in Proc. 18th ACM Illinois at Urbana-Champaign. He is currently an
SIGPLAN Symp. Princ. Pract. Parallel Program., 2013, pp. 69–80. assistant professor with the Department of ECE,
[37] S. Lee and J. S. Vetter, “Early evaluation of directive-based GPU University of Utah. His research interests include
programming models for productive exascale computing,” in building software systems for parallel computing
Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2012, and timing analysis. He was the recipient of
pp. 1–11. prestigious 2019 ACM SIGDA Outstanding PhD Dissertation Award for
[38] S. Lee and R. Eigenmann, “OpenMPC: Extended openMP pro- his contributions to distributed and parallel VLSI timing analysis in his
gramming and tuning for GPUs,” in Proc. ACM/IEEE Int. Conf. PhD thesis.
High Perform. Comput., Netw., Storage Anal., 2010, pp. 1–11.
[39] D. Leijen, W. Schulte, and S. Burckhardt, “The Design of a Task Par-
allel Library,” ACM SIGPLAN Notices, vol. 44, pp. 227–241, 2009. Dian-Lun Lin received the BS degree from the
[40] C. E. Leiserson, “The Cilk++ concurrency platform,” in Proc. 46th Department of Electrical Engineering, Taiwan’s
Annu. Des. Automat. Conf., 2009, pp. 522–527. Cheng Kung University, and the MS degree from
[41] J. V. F. Lima, T. Gautier, V. Danjean, B. Raffin, and N. Maillard, the Department of Computer Science, National
“Design and analysis of scheduling strategies for multi-CPU and Taiwan University. He is currently working toward
multi-GPU architectures,” Parallel Comput., vol. 44, pp. 37–52, 2015. the PhD degree with the Department of Electrical
[42] C.-X. Lin, T.-W. Huang, and M. D. F. Wong, “An efficient work- and Computer Engineering, University of Utah.
stealing scheduler for task dependency graph,” in Proc. IEEE 26th His research interests include parallel and hetero-
Int. Conf. Parallel Distrib. Syst., 2020, pp. 64–71. geneous computing with a specific focus on CAD
[43] Y. Lin, W. Li, J. Gu, H. Ren, B. Khailany, and D. Z. Pan, applications.
“ABCDPlace: Accelerated Batch-based Concurrent Detailed Place-
ment on Multi-threaded CPUs and GPUs,” IEEE Trans. Comput.-
Aided Des. Integr. Circuits Syst., vol. 39, no. 12, pp. 5083–5096, Dec. Chun-Xun Lin received the BS degree in electri-
2020. cal engineering from the National Cheng Kung
[44] Y.-S. Lu and K. Pingali, “Can parallel programming revolutionize University, Tainan, Taiwan, and the MS degree in
EDA tools?,” in Advanced Logic Synthesis. Berlin, Germany: electronics engineering from the Graduate Insti-
Springer, 2018, pp. 21–41. tute of Electronics Engineering, National Taiwan
[45] A. Mirhoseini, et al., “Device placement optimization with rein- University, Taipei, Taiwan, in 2009 and 2011,
forcement learning,” in Proc. 34th Int. Conf. Mach. Learn., 2007, respectively, and the PhD degree from the
pp. 2430–2439. Department of Electrical and Computer Engineer-
€
[46] B. Qiao, M. A. Ozkan, J. Teich, and F. Hannig, “The best of both ing, University of Illinois at Urbana-Champaign, in
worlds: Combining CUDA graph with an image processing DSL,” 2020. His research interest include parallel
in Proc. 57th ACM/IEEE Des. Automat. Conf., 2020, pp. 1–6. processing.
[47] H. Ribic and Y. D. Liu, “Energy-efficient work-stealing language
runtimes,” in Proc. 19th Int. Conf. Architectural Support Program.
Lang. Operat. Syst., 2014, pp. 513–528. Yibo Lin (Member, IEEE) received the BS degree
[48] V. A. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krishna- in microelectronics from Shanghai Jiaotong Univer-
moorthy, “Lifeline-based global load balancing,” in Proc. 16th sity in 2013, and the PhD degree from the Depart-
ACM Symp. Princ. Pract. Parallel Program., 2011, pp. 201–212.
ment of Electrical and Computer Engineering,
[49] S. Shiina and K. Taura, “Almost deterministic work stealing,” in Proc.
University of Texas at Austin, in 2018. He is cur-
Int. Conf. High Perform. Comput., Netw., Storage Anal., 2019, pp. 1–16. rently an assistant professor with the Department
[50] M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. Dokter, and of Computer Science associated with the Center
D. Schmalstieg, “Whippletree: Task-based scheduling of dynamic for Energy-Efficient Computing and Applications,
workloads on the GPU,” ACM Trans. Graph., vol. 33, no. 6, Peking University, China. His research interests
pp. 1–11, Nov. 2014.
include physical design, machine learning applica-
[51] W. Suksompong, C. E. Leiserson, and T. B. Schardl, “On the effi-
tions, GPU acceleration, and hardware security.
ciency of localized work stealing,” Inf. Process. Lett., vol. 116, no. 2,
pp. 100–106, Feb. 2016.
[52] O. Tardieu, H. Wang, and H. Lin, A work-stealing scheduler for " For more information on this or any other computing topic,
X10’s task parallelism with suspension,” ACM SIGPLAN Notices,
vol. 47, no. 8, pp. 267–276, 2012. please visit our Digital Library at www.computer.org/csdl.