0% found this document useful (0 votes)
33 views18 pages

tpds21 Taskflow

Taskflow is a lightweight task graph computing system that aims to streamline building parallel and heterogeneous applications. It introduces an expressive task graph programming model to assist developers in implementing parallel decomposition strategies on heterogeneous computing platforms. Taskflow distinguishes itself by supporting in-graph control flow beyond traditional DAG models. It also designs an efficient heterogeneous work stealing algorithm and runtime system to optimize performance across devices like CPUs and GPUs. Evaluation shows Taskflow can solve machine learning workloads up to 29% faster and with 1.5x less memory compared to an industrial system on a machine with 40 CPUs and 4 GPUs.

Uploaded by

f f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views18 pages

tpds21 Taskflow

Taskflow is a lightweight task graph computing system that aims to streamline building parallel and heterogeneous applications. It introduces an expressive task graph programming model to assist developers in implementing parallel decomposition strategies on heterogeneous computing platforms. Taskflow distinguishes itself by supporting in-graph control flow beyond traditional DAG models. It also designs an efficient heterogeneous work stealing algorithm and runtime system to optimize performance across devices like CPUs and GPUs. Evaluation shows Taskflow can solve machine learning workloads up to 29% faster and with 1.5x less memory compared to an industrial system on a machine with 40 CPUs and 4 GPUs.

Uploaded by

f f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO.

6, JUNE 2022 1303

Taskflow: A Lightweight Parallel and


Heterogeneous Task Graph Computing System
Tsung-Wei Huang , Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin , Member, IEEE

Abstract—Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based
approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of parallel and
heterogeneous decomposition strategies on a heterogeneous computing platform. Our programming model distinguishes itself as a
very general class of task graph parallelism with in-graph control flow to enable end-to-end parallel optimization. To support our model
with high performance, we design an efficient system runtime that solves many of the new scheduling challenges arising out of our
models and optimizes the performance across latency, energy efficiency, and throughput. We have demonstrated the promising
performance of Taskflow in real-world applications. As an example, Taskflow solves a large-scale machine learning workload up to 29%
faster, 1.5 less memory, and 1.9 higher throughput than the industrial system, oneTBB, on a machine of 40 CPUs and 4 GPUs. We
have opened the source of Taskflow and deployed it to large numbers of users in the open-source community.

Index Terms—Parallel programming, task parallelism, high-performance computing, modern C++ programming

1 INTRODUCTION be omitted if in-graph control-flow tasks are supported. Sec-


ond, existing TGCSs do not align well with modern hardware.
graph computing system (TGCS) plays an essential
T ASK
role in advanced scientific computing. Unlike loop-based
models, TGCSs encapsulate function calls and their depen-
In particular, new GPU task graph parallelism, such as CUDA
Graph, can bring significant yet largely untapped perfor-
mance benefits. Third, existing TGCSs are good at either
dencies in a top-down task graph to implement irregular par-
CPU- or GPU-focused workloads, but rarely both simulta-
allel decomposition strategies that scale to large numbers of
neously. Consequently, we introduce in this paper Taskflow, a
processors, including manycore central processing units
lightweight TGCS to overcome these limitations. We summa-
(CPUs) and graphics processing units (GPUs). As a result,
rize three main contributions of Taskflow as follows:
recent years have seen a great deal amount of TGCS research,
just name a few, oneTBB FlowGraph [2], StarPU [17],  Expressive programming model – We design an expres-
TPL [39], Legion [18], Kokkos-DAG [24], PaRSEC [20], sive task graph programming model by leveraging
HPX [33], and Fastflow [15]. These systems have enabled modern C++ closure. Our model enables efficient
vast success in a variety of scientific computing applications, implementations of parallel and heterogeneous
such as machine learning, data analytics, and simulation. decomposition strategies using the task graph
However, three key limitations prevent existing TGCSs model. The expressiveness of our model lets devel-
from exploring the full potential of task graph parallelism. opers perform rather a lot of work with relative ease
First, existing TGCSs closely rely on directed acyclic graph of programming. Our user experiences lead us to
(DAG) models to define tasks and dependencies. Users imple- believe that, although it requires some effort to learn,
ment control-flow decisions outside the graph description, a programmer can master our APIs needed for many
which typically results in rather complicated implementa- applications in just a few hours.
tions that lack end-to-end parallelism. For instance, when  In-graph control flow – We design a new conditional
encountering an if-else block, users need to synchronize the tasking model to support in-graph control flowbeyond
graph execution with a TGCS runtime, which could otherwise the capability of traditional DAG models that prevail
in existing TGCSs. Our condition tasks enable devel-
opers to integrate control-flow decisions, such as
 Tsung-Wei Huang and Dian-Lun Lin are with the Department of Electri-
cal and Computer Engineering, University of Utah, Salt Lake City, UT conditional dependencies, cyclic execution, and non-
84112 USA. E-mail: [email protected], [email protected]. deterministic flows into a task graph of end-to-end
 Chun-Xun Lin is with the MathWorks, Natick, MA 01760 USA. parallelism. In case applications have frequent
E-mail: [email protected].
dynamic behavior, such as optimization and branch
 Yibo Lin is with the Department of Computer Science, Peking University,
Beijing 100871, China. E-mail: [email protected]. and bound, programmers can efficiently overlap
Manuscript received 12 Apr. 2021; revised 4 Aug. 2021; accepted 6 Aug. 2021. tasks both inside and outside the control flow to hide
Date of publication 11 Aug. 2021; date of current version 25 Oct. 2021. expensive control-flow costs.
The work was supported in part by DARPA under Contract FA 8650-18-2-7843  Heterogeneous work stealing – We design an efficient
and in part by NSF under Grant CCF-2126672. work-stealing algorithm to adapt the number of
(Corresponding author: Tsung-Wei Huang.)
Recommended for acceptance by H. E. Bal. workers to dynamically generated task parallelism
Digital Object Identifier no. 10.1109/TPDS.2021.3104255 at any time during the graph execution. Our
1045-9219 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
1304 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022

algorithm prevents the graph execution from compose thousands of dependent GPU operations to run on
underutilized threads that is harmful to perfor- the same task graph using iterative methods. By creating an
mance, while avoiding excessive waste of thread executable image for a GPU task graph, we can iteratively
resources when available tasks are scarce. The result launch it with extremely low kernel overheads. However,
largely improves the overall system performance, existing TGCSs are short of a generic model to express and
including latency, energy usage, and throughput. offload task graph parallelism directly on a GPU, as opposed
We have derived theory results to justify the effi- to a simple encapsulation of GPU operations into CPU tasks.
ciency of our work-stealing algorithm. Heterogeneous Runtimes. Many CAD algorithms compute
We have evaluated Taskflow on real-world applications extremely large circuit graphs. Different quantities are often
to demonstrate its promising performance. As an example, dependent on each other, via either logical relation or physi-
Taskflow solved a large-scale machine learning problem up cal net order, and are expensive to compute. The resulting
to 29% faster, 1.5 less memory, and 1.9 higher through- task graph in terms of encapsulated function calls and task
put than the industrial system, oneTBB [2], on a machine of dependencies is usually very large. For example, the task
40 CPUs and 4 GPUs. We believe Taskflow stands out as a graph representing a timing analysis on a million-gate
unique system given the ensemble of software tradeoffs and design can add up to billions of tasks that take several hours
architecture decisions we have made. Taskflow is open- to finish [32]. During the execution, tasks can run on CPUs
source at GitHub under MIT license and is being used by or GPUs, or more frequently a mix. Scheduling these hetero-
many academic and industrial projects [10]. geneously dependent tasks is a big challenge. Existing run-
times are good at either CPU- or GPU-focused work but
rarely both simultaneously.
2 MOTIVATIONS Therefore, we argue that there is a critical need for a new
Taskflow is motivated by our DARPA project to reduce the heterogeneous task graph programming environment that
long design times of modern circuits [1]. The main research supports in-graph control flow. The environment must han-
objective is to advance computer-aided design (CAD) tools dle new scheduling challenges, such as conditional depen-
with heterogeneous parallelism to achieve transformational dencies and cyclic executions. To this end, Taskflow aims to
performance and productivity milestones. Unlike tradi- (1) introduce a new programming model that enables end-
tional loop-parallel scientific computing problems, many to-end expressions of CPU-GPU dependent tasks along with
CAD algorithms exhibit irregular computational patterns and algorithmic control flow and (2) establish an efficient system
complex control flow that require strategic task graph decom- runtime to support our model with high performance across
positions to benefit from heterogeneous parallelism [28]. latency, energy efficiency, and throughput. Taskflow focuses
This type of complex parallel algorithm is difficult to imple- on a single heterogeneous node of CPUs and GPUs.
ment and execute efficiently using mainstream TGCS. We
highlight three reasons below, end-to-end tasking, GPU task
3 PRELIMINARY RESULTS
graph parallelism, and heterogeneous runtimes.
End-to-End Tasking. Optimization engines implement Taskflow is established atop our prior system, Cpp-Task-
various graph and combinatorial algorithms that frequently flow [32] which targets CPU-only parallelism using a DAG
call for iterations, conditionals, and dynamic control flow. model, and extends its capability to heterogeneous comput-
Existing TGCSs [2], [7], [12], [17], [18], [20], [24], [33], [39], ing using a new heterogeneous task dependency graph (HTDG)
closely rely on DAG models to define tasks and their depen- programming model beyond DAG. Since we opened the
dencies. Users implement control-flow decisions outside the source of Cpp-Taskflow/Taskflow, it has been successfully
graph description via either statically unrolling the graph adopted by much software, including important CAD proj-
across fixed-length iterations or dynamically executing an ects [14], [30], [43], [54] under the DARPA ERI IDEA/POSH
“if statement” on the fly to decide the next path and so forth. program [1]. Because of the success, we are recently invited
These solutions often incur rather complicated implementa- to publish a 5-page TCAD brief to overview how Taskflow
tions that lack end-to-end parallelism using just one task address the parallelization challenges of CAD work-
graph entity. For instance, when describing an iterative loads [31]. For the rest of the paper, we will provide com-
algorithm using a DAG model, we need to repetitively wait prehensive details of the Taskflow system from the top-
for the task graph to complete at the end of each iteration. level programming model to the system runtime, including
This wait operation is not cheap because it involves syn- several new technical materials for control-flow primitives,
chronization between the application code and the TGCS capturer-based GPU task graph parallelism, work-stealing
runtime, which could otherwise be totally avoided by sup- algorithms and theory results, and experiments.
porting in-graph control-flow tasks. More importantly,
developers can benefit by making in-graph control-flow 4 TASKFLOW PROGRAMMING MODEL
decisions to efficiently overlap tasks both inside and outside
control flow, completely decided by a dynamic scheduler. This section discusses five fundamental task types of Task-
GPU Task Graph Parallelism. Emerging GPU task graph flow, static task, dynamic task, module task, condition task, and
acceleration, such as CUDA Graph [4], can offer dramatic yet cudaFlow task.
largely untapped performance advantages by running a
GPU task graph directly on a GPU. This type of GPU task 4.1 Static Tasking
graph parallelism is particularly beneficial for many large- Static tasking is the most basic task type in Taskflow. A
scale analysis and machine learning algorithms that static task takes a callable of no arguments and runs it. The
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1305

Fig. 1. A task graph that spawns another task graph (B1, B2, and B3)
during the execution of task B. Fig. 2. An example of taskflow composition.

callable can be a generic C++ lambda function object, bind- Listing 2 shows the Taskflow code in Fig. 1. A dynamic
ing expression, or a functor. Listing 1 demonstrates a simple task accepts a reference of type tf::Subflow that is created
Taskflow program of four static tasks, where A runs before B by the executor during the execution of task B. A subflow
and C, and D runs after B and C. The graph is run by an exec- inherits all graph building blocks of static tasking. By
utor which schedules dependent tasks across worker default, a spawned subflow joins its parent task (B3 pre-
threads. Overall, the code explains itself. cedes its parent B implicitly), forcing a subflow to follow the
subsequent dependency constraints of its parent task.
Listing 1. A task graph of four static tasks. Depending on applications, users can detach a subflow
tf::Taskflow taskflow; from its parent task using the method detach, allowing its
tf::Executor executor; execution to flow independently. A detached subflow will
auto [A, B, C, D] = taskflow.emplace( eventually join its parent taskflow.
[] () { std::cout << “Task A”; },
[] () { std::cout << “Task B”; }, 4.3 Composable Tasking
[] () { std::cout << “Task C”; }, Composable tasking enables developers to define task hier-
[] () { std::cout << “Task D”; } archies and compose large task graphs from modular and
); reusable blocks that are easier to optimize. Fig. 2 gives an
A.precede(B, C); // A runs before B and C example of a Taskflow graph using composition. The top-
D.succeed(B, C); // D runs after B and C level taskflow defines one static task C that runs before a
executor.run(tf).wait(); dynamic task D that spawns two dependent tasks D1 and
D2. Task D precedes a module task E that composes a task-
flow of two dependent tasks A and B.

4.2 Dynamic Tasking Listing 3. Taskflow code of Fig. 2.


Dynamic tasking refers to the creation of a task graph dur- // file 1 defines taskflow1
ing the execution of a task. Dynamic tasks are spawned tf::Taskflow taskflow1;
from a parent task and are grouped to form a hierarchy auto [A, B] = taskflow1.emplace(
called subflow. Fig. 1 shows an example of dynamic tasking. [] () { std::cout << “TaskA”; },
The graph has four static tasks, A, C, D, and B. The prece- [] () { std::cout << “TaskB”; }
dence constraints force A to run before B and C, and D to run );
after B and C. During the execution of task B, it spawns A.precede(B);
another graph of three tasks, B1, B2, and B3, where B1 and // file 2 defines taskflow2
B2 run before B3. In this example, B1, B2, and B3 are tf::Taskflow taskflow2;
grouped to a subflow parented at B. auto [C, D] = taskflow2.emplace(
[] () { std::cout << “TaskC”; },
Listing 2. Taskflow code of Fig. 1. [] (tf::Subflow& sf) {
std::cout << “TaskD”;
auto [A, C, D] = taskflow.emplace( auto [D1, D2] = sf.emplace(
[] () { std::cout << “A”; }, [] () { std::cout << “D1”; },
[] () { std::cout << “C”; }, [] () { std::cout << “D2”; }
[] () { std::cout << “D”; } );
); D1.precede(D2);
auto B = tf.emplace([] (tf::Subflow& subflow) { }
std::cout << “B\n“; );
auto [B1, B2, B3] = subflow.emplace( auto E = taskflow2.composed_of(taskflow1); //
[] () { std::cout << “B1”; }, module
[] () { std::cout << “B2”; }, D.precede(E);
[] () { std::cout << “B3”; } C.precede(D);
);
B3.succeed(B1, B2);
Listing 3 shows the Taskflow code of Fig. 2. It declares
});
two taskflows, taskflow1 and taskflow2. taskflow2 forms
A.precede(B, C);
a module task E by calling the method composed_of from
D.succeed(B, C);
taskflow1, which is then preceded by task D. Unlike a sub-
1306 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022

Fig. 5. A Taskflow graph of non-deterministic control flow using three


Fig. 3. A Taskflow graph of if-else control flow using one condition task condition tasks.
(in diamond).

Listing 5. Taskflow program of Fig. 4.


int i;
Fig. 4. A Taskflow graph of iterative control flow using one condition task. auto [init, body, cond, done] = taskflow.emplace(
[&](){ i=0; },
flow task, a module task does not own the taskflow but [&](){ i++; },
maintains a soft mapping to its composed taskflow. Users [&](){ return i<100 ? 0 : 1; },
[&](){ std::cout << “done”; }
can create multiple module tasks from the same taskflow
);
but they must not run concurrently; on the contrary, sub-
init.precede(body);
flows are created dynamically and can run concurrently. In
body.precede(cond);
practice, we use composable tasking to partition large paral-
cond.precede(body, done);
lel programs into smaller or reusable taskflows in separate
files (e.g., taskflow1 in file 1 and taskflow2 in file 2) to
Furthermore, our condition task can model non-deter-
improve program modularity and testability. Subflows are
ministic control flow where many existing models do not
instead used for enclosing a task graph that needs stateful
support. Fig. 5 shows an example of nested non-determin-
data referencing via lambda capture.
istic control flow frequently used in stochastic optimization
(e.g., VLSI floorplan annealing [53]). The graph consists of
4.4 Conditional Tasking two regular tasks, init and stop, and three condition
We introduce a new conditional tasking model to overcome tasks, F1, F2, and F3. Each condition task forms a dynamic
the limitation of existing frameworks in expressing general control flow to randomly go to either the next task or loop
control flow beyond DAG. A condition task is a callable that back to F1 with a probability of 1/2. Starting from init,
returns an integer index indicating the next successor task the expected number of condition tasks to execute before
to execute. The index is defined with respect to the order of reaching stop is eight. Listing 6 implements Fig. 5 in just 11
the successors preceded by the condition task. Fig. 3 shows lines of code.
an example of if-else control flow, and Listing 4 gives its
implementation. The code is self-explanatory. The condition
Listing 6. Taskflow program of Fig. 5.
task, cond, precedes two tasks, yes and no. With this
order, if cond returns 0, the execution moves on to yes, or auto [init, F1, F2, F3, stop] = taskflow.emplace(
no if cond returns 1. [] () { std::cout << “init”; },
[] () { return rand()%2 }
[] () { return rand()%2 }
Listing 4. Taskflow program of Fig. 3.
[] () { return rand()%2 }
auto [init, cond, yes, no] = taskflow.emplace( [] () { std::cout << “stop”; }
[] () { std::cout << “init”; }, );
[] () { std::cout << “cond”; return 0; }, init.precede(F1);
[] () { std::cout << “cond returns 0”; }, F1.precede(F2, F1);
[] () { std::cout << “cond returns 1”; } F2.precede(F3, F1);
); F3.precede(stop, F1);
cond.succeed(init)
.precede(yes, no); The advantage of our conditional tasking is threefold.
First, it is simple and expressive. Developers benefit from
Our condition task supports iterative control flow by the ability to make in-graph control-flow decisions that are
introducing a cycle in the graph. Fig. 4 shows a task graph of integrated within task dependencies. This type of decision
do-while iterative control flow, implemented in Listing 5. making is different from dataflow [33] as we do not abstract
The loop continuation condition is implemented by a single data but tasks, and is more general than the primitive-based
condition task, cond, that precedes two tasks, body and method [56] that is limited to domain applications. Second,
done. When cond returns 0, the execution loops back to condition tasks can be associated with other tasks to inte-
body. When cond returns 1, the execution moves onto grate control flow into a unified graph entity. Users ought
done and stops. In this example, we use only four tasks not to partition the control flow or unroll it to a flat DAG,
even though the control flow spans 100 iterations. Our but focus on expressing dependent tasks and control flow.
model is more efficient and expressive than existing frame- The later section will explain our scheduling algorithms for
works that count on dynamic tasking or recursive parallel- condition tasks. Third, our model enables developers to effi-
ism to execute condition on the fly [18], [20]. ciently overlap tasks both inside and outside control flow.
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1307

Fig. 7. A saxpy (“single-precision AX plus Y”) task graph using two CPU
tasks and one cudaFlow task.

Fig. 6. A Taskflow graph of parallel control-flow blocks using three condi-


tion tasks. (d2h_x, d2h_y), in this order of task dependencies. Task
dependencies are established through precede or suc-
For example, Fig. 6 implements a task graph of three con- ceed. Apparently, cudaFlow must run after allocate_x
trol-flow blocks, and cond_1 can run in parallel with and allocate_y. We emplace this cudaFlow on GPU 1
cond_2 and cond_3. This example requires only 30 lines (emplace_on). When defining cudaFlows on specific
of code. GPUs, users are responsible for ensuring all involved mem-
ory operations stay in valid GPU contexts.
Listing 7. Taskflow program of Fig. 7. Our cudaFlow has the three key motivations. First, users
focus on the graph-level expression of dependent GPU
__global__ void saxpy(int n,int a,float *x, operations without wrangling with low-level streams. They
float *y); can easily visualize the graph by Taskflow to reduce turn-
around time. Second, our closure forces users to express
const unsigned N = 1<<20;
their intention on what data storage mechanism should be
std::vector<float> hx(N, 1.0f), hy(N, 2.0f);
used for each captured variable. For example, Listing 7 cap-
float *dx{nullptr}, *dy{nullptr};
tures all data (e.g., hx, dx) in reference to form a stateful clo-
sure. When allocate_x and allocate_y finish, the
auto [allocate_x, allocate_y] = taskflow.emplace
( cudaFlow closure can access the correct state of dx and dy.
[&](){ cudaMallocManaged(&dx, N*sizeof This property is very important for heterogeneous graph
(float));} parallelism because CPU and GPU tasks need to share states
[&](){ cudaMallocManaged(&dy, N*sizeof of data to collaborate with each other. Our model makes it
(float));} easy and efficient to capture data regardless of its scope.
); Third, by abstracting GPU operations to a task graph clo-
auto cudaFlow = taskflow.emplace_on( sure, we judiciously hide implementation details for porta-
[&](tf::cudaFlow& cf) { ble optimization. By default, a cudaFlow maps to a CUDA
auto h2d_x = cf.copy(dx, hx.data(), N); graph that can be executed using a single CPU call. On a
auto h2d_y = cf.copy(dy, hy.data(), N); platform that does not support CUDA Graph, we fall back
auto d2h_x = cf.copy(hx.data(), dx, N); to a stream-based execution.
auto d2h_y = cf.copy(hy.data(), dy, N);
auto kernel = cf.kernel( Listing 8. Taskflow program of Fig. 7 using a capturer.
GRID, BLOCK, SHM, saxpy, N, 2.0f, dx, dy
); taskflow.emplace_on([&](tf::
kernel.succeed(h2d_x, h2d_y) cudaFlowCapturer& cfc) {
.precede(d2h_x, d2h_y); auto h2d_x = cfc.copy(dx, hx.data(), N);
}, 1 auto h2d_y = cfc.copy(dy, hy.data(), N);
); auto d2h_x = cfc.copy(hx.data(), dx, N);
cudaFlow.succeed(allocate_x, allocate_y); auto d2h_y = cfc.copy(hy.data(), dy, N);
auto kernel = cfc.on([&](cudaStream_t s){
invoke_3rdparty_saxpy_kernel(s);
});
4.5 Heterogeneous Tasking kernel.succeed(h2d_x, h2d_y)
We introduce a new heterogeneous task graph program- .precede(d2h_x, d2h_y);
ming model by leveraging C++ closure and emerging GPU }, 1);
task graph acceleration, CUDA Graph [4]. Fig. 7 and Listing
7 show the canonical CPU-GPU saxpy (AX plus Y) work- Taskflow does not dynamically choose whether to exe-
load and its implementation using our model. Our model cute tasks on CPU or GPU, and does not manage GPU data
lets users describe a GPU workload in a task graph called with another abstraction. This is a software decision we
cudaFlow rather than aggregated GPU operations using have made when designing cudaFlow based on our experi-
explicit CUDA streams and events. A cudaFlow lives inside ence in parallelizing CAD using existing TGCSs. While it is
a closure and defines methods for constructing a GPU task always interesting to see what abstraction is best suited for
graph. In this example, we define two parallel CPU tasks which application, in our field, developing high-perfor-
(allocate_x, allocate_y) to allocate unified shared mance CAD algorithms requires many custom efforts on
memory (cudaMallocManaged) and one cudaFlow task optimizing the memory and data layouts [26], [28]. Devel-
to spawn a GPU task graph consisting of two host-to-device opers tend to do this statically in their own hands, such as
(H2D) transfer tasks (h2d_x, h2d_y), one saxpy kernel task direct control over raw pointers and explicit memory place-
(kernel), and two device-to-host (D2H) transfer tasks ment on a GPU, while leaving tedious details of runtime
1308 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022

Fig. 8. A cyclic task graph using three cudaFlow tasks and one condition
task to model an iterative k-means algorithm.

load balancing to a dynamic scheduler. After years of


research, we have concluded to not abstract memory or
data because they are application-dependent. This decision
allows Taskflow to be framework-neutral while enabling
application code to take full advantage of native or low-
level GPU programming toolkits.
Constructing a GPU task graph using cudaFlow requires
all kernel parameters are known in advance. However, Fig. 9. Flowchart of our task scheduling.
third-party applications, such as cuDNN and cuBLAS, do
5.1 Task-level Scheduling Algorithm
not open these details but provide an API for users to
invoke hidden kernels through custom streams. The burden 5.1.1 Scheduling Condition Tasks
is on users to decide a stream layout and witness its concur- Conditional tasking is powerful but challenging to sched-
rency across dependent GPU tasks. To deal with this prob- ule. Specifically, we must deal with conditional dependency
lem, we design a cudaFlow capturer to capture GPU tasks and cyclic execution without encountering task race, i.e.,
from existing stream-based APIs. Listing 8 outlines an only one thread can touch a task at a time. More impor-
implementation of the same saxpy task graph in Fig. 7 using tantly, we need to let users easily understand our task
a cudaFlow capturer, assuming the saxpy kernel is only scheduling flow such that they can infer if a written task
invocable through a stream-based API. graph is properly conditioned and schedulable. To accom-
Both cudaFlow and cudaFlow capturer can work seam- modate these challenges, we separate the execution logic
lessly with condition tasks. Control-flow decisions fre- between condition tasks and other tasks using two depen-
quently happen at the boundary between CPU and GPU dency notations, weak dependency (out of condition tasks)
tasks. For example, a heterogeneous k-means algorithm iter- and strong dependency (other else). For example, the six
atively uses GPU to accelerate the finding of k centroids and dashed arrows in Fig. 5 are weak dependencies and the
then uses CPU to check if the newly found centroids con- solid arrow init!F1 is a strong dependency. Based on
verge to application rules. Taskflow enables an end-to-end these notations, we design a simple and efficient algorithm
expression of such a workload in a single graph entity, as for scheduling tasks, as depicted in Fig. 9. When the sched-
shown in Fig. 8 and Listing 9. This capability largely uler receives an HTDG, it (1) starts with tasks of zero depen-
improves the efficiency of modeling complex CPU-GPU dencies (both strong and weak) and continues executing
workloads, and our scheduler can dynamically overlap tasks whenever strong remaining dependencies are met, or
CPU and GPU tasks across different control-flow blocks. (2) skips this rule for weak dependency and directly jumps
to the task indexed by the return of that condition task.
Listing 9. Taskflow program of Fig. 8. Taking Fig. 5 for example, the scheduler starts with init
(zero weak and strong dependencies) and proceeds to F1.
auto [h2d, update, cond, d2h] = taskflow.emplace(
Assuming F1 returns 0, the scheduler proceeds to its first
[&](tf::cudaFlow& cf){ /* copy input to GPU */ },
successor, F2. Now, assuming F2 returns 1, the scheduler
[&](tf::cudaFlow& cf){ /* update kernel */ },
proceeds to its second successor, F1, which forms a cyclic
[&](){ return converged() ? 1 : 0; },
[&](tf::cudaFlow& cf){ /* copy result to CPU */ } execution and so forth. With this concept, the scheduler will
); cease at stop when F1, F2, and F3 all return 0. Based on
h2d.precede(update); this scheduling algorithm, users can quickly infer whether
update.precede(cond); their task graph defines correct control flow. For instance,
cond.precede(update, d2h); adding a strong dependency from init to F2 may cause
task race on F2, due to two execution paths, init!F2 and
init!F1!F2.
Fig. 10 shows two common pitfalls of conditional tasking,
5 TASKFLOW SYSTEM RUNTIME based on our task-level scheduling logic. The first example
has no source for the scheduler to start with. A simple fix is
Taskflow enables users to express CPU-GPU dependent to add a task S of zero dependencies. The second example
tasks that integrate control flow into an HTDG. To support may race on D, if C returns 0 at the same time E finishes. A
our model with high performance, we design the system fix is to partition the control flow at C and D with an auxil-
runtime at two scheduling levels, task level and worker level. iary node X such that D is strongly conditioned by E and X.
The goal of task-level scheduling is to (1) devise a feasible,
efficient execution for in-graph control flow and (2) trans-
form each GPU task into a runnable instance on a GPU. The 5.1.2 Scheduling GPU Tasks
goal of worker-level scheduling is to optimize the execution We leverage modern CUDA Graph [4] to schedule GPU
performance by dynamically balancing the worker count tasks. CUDA graph is a new asynchronous task graph pro-
with task parallelism. gramming model introduced in CUDA 10 to enable more
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1309

Fig. 11. Illustration of Algorithm 1 on transforming an application cuda-


Flow capturer graph into a native CUDA graph using two streams.
Fig. 10. Common pitfalls of conditional tasking.
stream does not produce decent performance, because GPU
has a limit on the maximum kernel concurrency (e.g., 32 for
efficient launch and execution of GPU work than streams.
RTX 2080). We give this constraint to users as a tunable
There are two types of GPU tasks, cudaFlow and cudaFlow
parameter, max streams. We assign each levelized task an
capturer. For each scheduled cudaFlow task, since we know
id equal to its index in the array at its level. Then, we can
all the operation parameters, we construct a CUDA graph
quickly assign each task a stream using the round-robin
that maps each task in the cudaFlow, such as copy and ker-
arithmetic (line 6). Since tasks at different levels have
nel, and each dependency to a node and an edge in the
dependencies, we need to record an event (lines 13:17) and
CUDA graph. Then, we submit it to the CUDA runtime for
wait on the event (lines 7:11) from both sides of a depen-
execution. This organization is simple and efficient, espe-
dency, saved for those issued in the same stream (line 8 and
cially under modern GPU architectures (e.g., Nvidia
line 14).
Ampere) that support hardware-level acceleration for graph
Fig. 11 gives an example of transforming a user-given
parallelism.
cudaFlow capturer graph into a native CUDA graph using
two streams (i.e., max stream ¼ 2) for execution. The algo-
Algorithm 1. make_graph(G) rithm first levelizes the graph by performing a topological
Input: a cudaFlow capturer C traversal and assign each node an id equal to its index at the
Output: a transformed CUDA graph G level. For example, A and B are assigned 0 and 1, C, D, and E
1: S get_capture_mode_streams(max streams); are assigned 0, 1, and 2, and so on. These ids are used to
2: L levelize(C); quickly determine the mapping between a stream and a
3: l L:min level; node in our round-robin loop, because CUDA stream only
4: while l < ¼ L:max level do allows inserting events from the latest node in the queue.
5: foreach t 2 L.get_tasks(l) do
For instance, when A and B are assigned to stream 0 (upper
6: s ðt:id mod max streamsÞ;
row) and stream 1 (lower row) during the level-by-level tra-
7: foreach p 2 t:predecessors do
versal (line 4 of Algorithm 1), we can determine ahead of
8: if s 6¼ ðp:id mod max streamsÞ then
the stream numbers of their successors and find out the two
9: stream_wait_event(S½s, p:event);
10: end cross-stream dependencies, A!D and B!E, that need
11: end recording events. Similarly, we can wait on recorded events
12: stream_capture(t, S½s); by scanning the predecessors of each node to find out cross-
13: foreach n 2 t:successors do stream event dependencies.
14: if s 6¼ ðn:id mod max streamsÞ then
15: stream_record_event(S½s, p:event);
16: end 5.2 Worker-level Scheduling Algorithm
17: end At the worker level, we leverage work stealing to execute
18: end submitted tasks with dynamic load balancing. Work steal-
19: end ing has been extensively studied in multicore program-
20: G end_capture_mode_streams(S); ming [2], [12], [13], [16], [23], [39], [40], [47], [52], but an
21: return G; efficient counterpart for hybrid CPU-GPU or more general
heterogeneous systems remains demanding. This is a chal-
On the other hand, for each scheduled cudaFlow cap- lenging research topic, especially under Taskflow’s HTDG
turer task, our runtime transforms the captured GPU tasks model. When executing an HTDG, a CPU task can submit
and dependencies into a CUDA graph using stream cap- both CPU and GPU tasks and vice versa whenever depen-
ture [4]. The objective is to decide a stream layout optimized dencies are met. The available task parallelism changes
for kernel concurrency without breaking task dependencies. dynamically, and there are no ways to predict the next com-
We design a greedy round-robin algorithm to transform a ing tasks under dynamic control flow. To achieve good sys-
cudaFlow capturer to a CUDA graph, as shown in Algo- tem performance, the scheduler must balance the number
rithm 1. Our algorithm starts by levelizing the capturer of worker threads with dynamically generated tasks to con-
graph into a two-level array of tasks in their topological trol the number of wasteful steals because the wasted resour-
orders. Tasks at the same level can run simultaneously. ces should have been used by useful workers or other
However, assigning each independent task here a unique concurrent programs [13], [23].
1310 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022

by all workers. The shared CTQ and GTQ pertain to the


scheduler and are primarily used for external threads to
submit HTDGs. A CPU worker can push and pop a new
task into and from its local CTQ, and can steal tasks from all
the other CTQs; the structure is symmetric to GPU workers.
This separation allows a worker to quickly insert dynami-
cally generated tasks to their corresponding queues without
contending with other workers.

Algorithm 2. worker_loop(w)
Input: w: a worker
Per-worker global: t: a task (initialized to NIL)
1: while true do
2: exploit_task(w, t);
3: if wait_for_task(w, t) == false then
Fig. 12. Architecture of our work-stealing scheduler on two domains,
CPU and GPU.
4: break;
5: end
6: end
Keeping workers busy in awaiting tasks with a yielding
mechanism is a commonly used work-stealing frame-
work [16], [17], [25]. However, this approach is not cost-effi-
cient, because it can easily over-subscribe resources when Algorithm 3. exploit_task(w, t)
tasks become scarce, especially around the decision-making
Input: w: a worker (domain dw )
points of control flow. The sleep-based mechanism is
Per-worker global: t: a task
another way to suspend the workers frequently failing in 1: if t 6¼ NIL then
steal attempts. A worker is put into sleep by waiting for a 2: if AtomInc(actives[dw ]) == 1 and thieves[dw ] == 0 then
condition variable to become true. When the worker sleeps, 3: notifier½dw .notify_one();
OS can grant resources to other workers for running useful 4: end
jobs. Also, reducing wasteful steals can improve both the 5: do
inter-operability of a concurrent program and the overall 6: execute_task(w, t);
system performance, including latency, throughput, and 7: t w:task queue½dw .pop();
energy efficiency to a large extent [23]. Nevertheless, decid- 8: while t 6¼ NIL;
ing when and how to put workers to sleep, wake up workers to 9: AtomDec(actives½dw );
run, and balance the numbers of workers with dynamic task paral- 10: end
lelism is notoriously challenging to design correctly and
implement efficiently. We leverage two existing concurrent data structures,
Our previous work [42] has introduced an adaptive work-stealing queue and event notifier, to support our schedul-
work-stealing algorithm to address a similar line of the chal- ing architecture. We implemented the task queue based on
lenge yet in a CPU-only environment by maintaining a loop the lock-free algorithm proposed by [36]. Only the queue
invariant between active and idle workers. However, owner can pop/push a task from/into one end of the queue,
extending this algorithm to a heterogeneous target is not while multiple threads can steal a task from the other end at
easy, because we need to consider the adaptiveness in dif- the same time. Event notifier is a two-phase commit proto-
ferent heterogeneous domains and bound the total number col (2PC) that allows a worker to wait on a binary predicate
of wasteful steals across all domains at any time of the exe- in a non-blocking fashion [11]. The idea is similar to the 2PC
cution. To overcome this challenge, we introduce a new in distributed systems and computer networking. The wait-
scheduler architecture and an adaptive worker manage- ing worker first checks the predicate and calls prepare_-
ment algorithm that are both generalizable to arbitrary het- wait if it evaluates to false. The waiting worker then checks
erogeneous domains. We shall prove the proposed work- the predicate again and calls commit_wait to wait, if the
stealing algorithm can deliver a strong upper bound on the outcome remains false, or cancel_wait to cancel the
number of wasteful steals at any time during the execution. request. Reversely, the notifying worker changes the predi-
cate to true and call notify_one or notify_all to wake
5.2.1 Heterogeneous Work-Stealing Architecture up one or all waiting workers. Event notifier is particularly
At the architecture level, our scheduler maintains a set of useful for our scheduler architecture because we can keep
workers for each task domain (e.g., CPU, GPU). A worker notification between workers non-blocking. We develop
can only steal tasks of the same domain from others. Fig. 12 one event notifier for each domain, based on Dekker’s algo-
shows the architecture of our work-stealing scheduler on rithm by [11].
two domains, CPU and GPU. By default, the number of
domain workers equals the number of domain devices (e.g., 5.2.2 Heterogeneous Work-Stealing Algorithm
CPU cores, GPUs). We associate each worker with two sepa- Atop this architecture, we devise an efficient algorithm to
rate task queues, a CPU task queue (CTQ) and a GPU task adapt the number of active workers to dynamically gener-
queue (GTQ), and declare a pair of CTQ and GTQ shared ated tasks such that threads are not underutilized when
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1311

tasks are abundant nor overly subscribed when tasks are with other workers in the later algorithms. Lines 5:8 drain
scarce. Our adaptiveness is different from existing frame- out the local task queue and executes all the tasks using
works, such as constant wake-ups [2], [23], data locality [21], execute_task in Algorithm 4. Before leaving the func-
[49], and watchdogs [23]. Instead, we extend our previous tion, the worker decrements actives by one (line 9).
work [42] to keep a per-domain invariant to control the num-
bers of thieves and, consequently, wasteful steals based on Algorithm 5. submit_task(w, t)
the active worker count: When an active worker exists, we keep Input: w: a worker (domain dw )
at least one worker making steal attempts unless all workers are Per-worker global: t: a task (domain dt )
active. 1: w:task queue½dt .push(t);
Unlike the CPU-only scheduling environment in [42], the 2: if dw ! ¼ dt then
challenge to keep this invariant in a heterogeneous target 3: if actives[dt ] == 0 and thieves[dt ] == 0 then
comes from the heterogeneously dependent tasks and 4: notifier½dt .notify_one();
cross-domain worker notifications, as a CPU task can spawn 5: end
a GPU task and vice versa. Our scheduler architecture is 6: end
particularly designed to tackle this challenge by separating
decision controls to a per-domain basis. This design allows Algorithm 4 implements the function execute_task. We
us to realize the invariant via an adaptive strategy–the last invoke the callable of the task (line 1). If the task returns a
thief to become active will wake up a worker in the same domain value (i.e., a condition task), we directly submit the task of the
to take over its thief role, and so forth. External threads (non- indexed successor (lines 2:5). Otherwise, we remove the task
workers) submit tasks through the shared task queues and dependency from all immediate successors and submit new
wake up workers to run tasks. tasks of zero remaining strong dependencies (lines 6:10). The
detail of submitting a task is shown in Algorithm 5. The
Algorithm 4. execute_task(w, t) worker inserts the task into the queue of the corresponding
Input: w: a worker
domain (line 1). If the task does not belong to the worker’s
Per-worker global: t: a task domain (line 2), the worker wakes up one worker from that
1: r invoke_task_callable(t); domain if there are no active workers or thieves (lines 3:5).
2: if r.has_value() then The function submit_task is internal to the workers of a
3: submit_task(w, t:successors½r); scheduler. External threads never touch this call.
4: return; When a worker completes all tasks in its local queue, it
5: end proceeds to wait_for_task (line 3 in Algorithm 2), as
6: foreach s 2 t.successors do shown in Algorithm 6. At first, the worker enters explor-
7: if AtomDec(s.strong_dependents) == 0 then e_task to make steal attempts (line 2). When the worker
8: submit_task(w, s); steals a task and it is the last thief, it notifies a worker of the
9: end same domain to take over its thief role and returns to an
10: end active worker (lines 3:8). Otherwise, the worker becomes a
sleep candidate. However, we must avoid underutilized par-
Our scheduling algorithm is symmetric by domain. allelism, since new tasks may come at the time we put a
Upon spawned, each worker enters the loop in Algorithm 2. worker to sleep. We use 2PC to adapt the number of active
Each worker has a per-worker global pointer t to a task that workers to available task parallelism (lines 9:41). The predi-
is either stolen from others or popped out from the worker’s cate of our 2PC is at least one task queue, both local and shared,
local task queue after initialization; the notation will be in the worker’s domain is nonempty. At line 8, the worker has
used in the rest of algorithms. The loop iterates two func- drained out its local queue and devoted much effort to steal-
tions, exploit_task and wait_for_task. Algorithm 3 ing tasks. Other task queues in the same domain are most
implements the function exploit_task. We use two likely to be empty. We put this worker to a sleep candidate
scheduler-level arrays of atomic variables, actives and by submitting a wait request (line 9). From now on, all the
thieves, to record for each domain the number of workers notifications from other workers will be visible to at least one
that are actively running tasks and the number of workers worker, including this worker. That is, if another worker call
that are making steal attempts, respectively.1 Our algorithm notify at this moment, the 2PC guarantees one worker
relies on these atomic variables to decide when to put a within the scope of lines 9:41 will be notified (i.e., line 42).
worker to sleep for reducing resource waste and when to Then, we inspect our predicate by examining the shared
bring back a worker for running new tasks. Lines 2:4 imple- task queue again (lines 10:20), since external threads might
ment our adaptive strategy using two lightweight atomic have inserted tasks at the same time we call prepare_-
operations. In our pseudocodes, the two atomic operations, wait. If the shared queue is nonempty (line 10), the worker
AtomInc and AtomDec, return the results after incrementing cancels the wait request and makes an immediate steal
and decrementing the values by one, respectively. Notice attempt at the queue (lines 11:12); if the steal succeeds and it
that the order of these two comparisons matters (i.e., active is the last thief, the worker goes active and notifies a worker
workers and then thieves), as they are used to synchronize (lines 13:18), or otherwise enters the steal loop again (line
19). If the shared queue is empty (line 20), the worker checks
whether the scheduler received a stop signal from the exec-
1. While our pseudocodes use array notations of atomic variables for
the sake of brevity, the actual implementation considers padding to utor due to exception or task cancellation, and notifies all
avoid false-sharing effects. workers to leave (lines 21:28). Now, the worker is almost
1312 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022

ready to sleep except if it is the last thief and: (1) an active selected victim, including the shared task queue, in the
worker in its domain exists (lines 30:33) or (2) at least one same domain. We use a parameter MAX STEALS to con-
task queue of the same domain from other workers is non- trol the number of iterations. In our experiments, setting
empty (lines 34:39). The two conditions may happen MAX STEAL to ten times the number of all workers is suf-
because a task can spawn tasks of different domains and ficient enough for most applications. Up to this time, we
trigger the scheduler to notify the corresponding domain have discussed the core work-stealing algorithm. To submit
workers. Our 2PC guarantees the two conditions synchro- an HTDG for execution, we call submit_graph, shown in
nize with lines 2:4 in Algorithm 3 and lines 3:5 in Algorithm Algorithm 8. The caller thread inserts all tasks of zero
5, and vice versa, preventing the problem of undetected dependencies (both strong and weak dependencies) to the
task parallelism. Passing all the above conditions, the shared task queues and notifies a worker of the correspond-
worker commits to wait on our predicate (line 41). ing domain (lines 4:5). Shared task queues may be accessed
by multiple callers and are thus protected under a lock per-
Algorithm 6. wait_for_task(w, t) taining to the scheduler. Our 2PC guarantees lines 4:5 syn-
Input: w: a worker (domain dw )
chronizes with lines 10:20 of Algorithm 6 and vice versa,
Per-worker global: t: a task preventing undetected parallelism in which all workers are
Output: a boolean signal of stop sleeping.
1: AtomInc(thieves½dw );
2: explore_task(w; t); Algorithm 7. explore_task(w, t)
3: if t 6¼ NIL then Input: w: a worker (a thief in domain dw )
4: if AtomDec(thieves[dw ]) == 0 then Per-worker global:t: a task (initialized to NIL)
5: notifier½dw .notify_one(); 1: steals 0;
6: end 2: while t != NIL and++steals  MAX STEAL do
7: return true; 3: yield();
8: end 4: t steal_task_from_random_victim(dw );
9: notifier½dw .prepare_wait(w); 5: end
10: if task_queue[dw ].empty() 6¼ true then
11: notifier½dw .cancel_wait(w);
12: t task queue½dw .steal();
13: if t 6¼ NIL then Algorithm 8. submit_graph(g)
14: if AtomDec(thieves½dw ) == 0 then Input: g: an HTDG to execute
15: notifier½dw .notify_one(); 1: foreach t 2 g.source_tasks do
16: end 2: scoped_lock lockðqueue mutex);
17: return true; 3: dt t:domain;
18: end 4: task queue½dt .push(t);
19: goto Line 2; 5: notifier½dt .notify_one();
20: end 6: end
21: if stop == true then
22: notifier½dw .cancel_wait(w);
23: foreach domain d 2 D do
24: notifier½d.notify_all(); 6 ANALYSIS
25: end To justify the efficiency of our scheduling algorithm, we
26: AtomDec(thieves½dw ); draw the following theorems and give their proof sketches.
27: return false;
28: end Lemma 1. For each domain, when an active worker (i.e., running
29: if AtomDec(thieves[dw ]) == 0 then a task) exists, at least one another worker is making steal
30: if actives½dw  > 0 then attempts unless all workers are active.
31: notifier½dw .cancel_wait(w);
Proof. We prove Lemma 1 by contradiction. Assuming
32: goto Line 1;
there are no workers making steal attempts when an
33: end
34: foreach worker x 2 W do active worker exists, this means an active worker (line 2
35: if x.task_queue[dw ].empty() 6¼ true then in Algorithm 3) fails to notify one worker if no thieves
36: notifier½dw .cancel_wait(w); exist. There are only two scenarios for this to happen: (1)
37: goto Line 1; all workers are active; (2) a non-active worker misses the
38: end notification before entering the 2PC guard (line 9 in Algo-
39: end rithm 6). The first scenario is not possible as it has been
40: end excluded by the lemma. If the second scenario is true, the
41: notifier½dw .commit_wait(w); non-active worker must not be the last thief (contradic-
42: return true; tion) or it will notify another worker through line 3 in
Algorithm 6. The proof holds for other domains as our
scheduler design is symmetric. u
t
Algorithm 7 implements explore_task, which resem-
bles the normal work-stealing loop [16]. At each iteration, Theorem 1. Our work-stealing algorithm can correctly complete
the worker (thief) tries to steal a task from a randomly the execution of an HTDG.
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1313

Proof. There are two places where a new task is submitted, making unsuccessful steal attempts. Due to Lemma 1 and
line 4 in Algorithm 8 and line 1 in Algorithm 5. In the first lines 29:40 in Algorithm 6, only one thief w0d will eventu-
place, where a task is pushed to the shared task queue by ally remain in the loop, and the other jWd j  2 thieves
an external thread, the notification (line 5 in Algorithm 8) will go sleep after one round of unsuccessful steal
is visible to a worker in the same domain of the task for attempts (line 2 in Algorithm 6) which ends up with
two situations: (1) if a worker has prepared or committed MAX STEALS  ðjWd j  2Þ wasteful steals. For the only
to wait (lines 9:41 in Algorithm 6), it will be notified; (2) one thief w0d , it keeps failing in steal attempts until the
otherwise, at least one worker will eventually go through task running by the only active worker wd finishes, and
lines 9:20 in Algorithm 6 to steal the task. In the second then both go sleep. This results in another MAX
place, where the task is pushed to the corresponding local STEALS  ðed =es Þ þ MAX STEALS wasteful steals; the
task queue of that worker, at least one worker will execute second terms comes from the active worker because it
it in either situation: (1) if the task is in the same domain of needs another round of steal attempts (line 2 in Algo-
the worker, the work itself may execute the task in the sub- rithm 6) before going to sleep. Consequently, the number
sequent exploit_task, or a thief steals the task through of wasteful steals across all domains is bounded as fol-
explore_task; (2) if the worker has a different domain lows:
from the task (line 2 in Algorithm 5), the correctness can be
proved by contradiction. Assuming this task is unde- X
MAX STEALS  ðjWd j  2 þ ðed =es Þ þ 1Þ
tected, which means either the worker did not notify a cor- d2D
responding domain worker to run the task (false at the X
condition of line 3 in Algorithm 5) or notified one worker  MAX STEALS  ðjWd j þ ed =es Þ
d2D
(line 4 in Algorithm 5) but none have come back. In the for- X
mer case, we know at least one worker is active or stealing,  MAX STEALS  ðjWd j þ E=es Þ
which will eventually go through line 29:40 of Algorithm 6 d2D
to steal this task. Similarly, the latter case is not possible ¼ OðMAX STEALS  ðjW j þ jDj  ðE=es ÞÞÞ: (1)
under our 2PC, as it contradicts the guarding scan in lines
9:41 of Algorithm 6. u
t
Theorem 2. Our work-stealing algorithm does not under-sub- We do not derive the bound over the execution of an
scribe thread resources during the execution of an HTDG. HTDG but the worst-case number of wasteful steals at
any time point, because the presence of control flow can
Proof. Theorem 2 is a byproduct of Lemma 1 and Theorem
lead to non-deterministic execution time that requires a
1. Theorem 1 proves that our scheduler never has task
further assumption of task distribution. u
t
leak (i.e., undetected task parallelism). During the execu-
tion of an HTDG, whenever the number of tasks is larger
than the present number of workers, Lemma 1 guarantees
7 EXPERIMENTAL RESULTS
one worker is making steal attempts, unless all workers We evaluate the performance of Taskflow on two fronts:
are active. The 2PC guard (lines 34:39 in Algorithm 6) micro-benchmarks and two realistic workloads, VLSI incre-
ensures that worker will successfully steal a task and mental timing analysis and machine learning. We use
become an active worker (unless no more tasks), which in micro-benchmarks to analyze the tasking performance of
turn wakes up another worker if that worker is the last Taskflow without much bias of application algorithms. We
thief. As a consequence, the number of workers will catch will show that the performance benefits of Taskflow
up on the number of tasks one after one to avoid under- observed in micro-benchmarks become significant in real
subscribed thread resources. u
t workloads. We will study the performance across runtime,
energy efficiency, and throughput. All experiments ran on a
Theorem 3. At any moment during the execution of an HTDG, Ubuntu Linux 5.0.0-21-generic x86 64-bit machine with 40
the number of wasteful steals is bounded by OðMAX Intel Xeon CPU cores at 2.00 GHz, 4 GeForce RTX 2080
STEALS  ðjW j þ jDj  ðE=es ÞÞÞ, where W is the worker GPUs, and 256 GB RAM. We compiled all programs using
set, D is the domain set, E is the maximum execution time of Nvidia CUDA v11 on a host compiler of clang++ v10 with C
any task, and es is the execution time of Algorithm 7. ++17 standard -std=c++17 and optimization flag -O2
Proof. We give a direct proof for Theorem 3 using the fol- enabled. We do not observe significant difference between
lowing notations: D denotes the domain set, d denotes a -O2 and -O3 in our experiments. Each run of N CPU cores
domain (e.g., CPU, GPU), W denotes the entire worker and M GPUs corresponds to N CPU and M GPU worker
set, Wd denotes the worker set in domain d, wd denotes a threads. All data is an average of 20 runs.
worker in domain d (i.e., wd 2 Wd ), es denotes the time to
complete one round of steal attempts (i.e., Algorithm 7), 7.1 Baseline
ed denotes the maximum execution time of any task in Give a large number of TGCSs, it is impossible to compare
domain d, and E denotes the maximum execution time of Taskflow with all of them. Each of the existing systems has
any task in the given HTDG. its pros and cons and dominates certain applications. We
At any time point, the worst case happens at the fol- consider oneTBB [2], StarPU [17], HPX [33], and OpenMP [7]
lowing scenario: for each domain d only one worker wd is each representing a particular paradigm that has gained
actively running one task while all the other workers are some successful user experiences in CAD due to
1314 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022

TABLE 1 TABLE 2
Programming Effort on Micro-benchmark Overhead of Task Graph Creation

Method LOC #Tokens CC WCC Dev Bug Method Stask Ttask Tedge r < 10 r<5 r<1
Taskflow 69 650 6 8 14 1% Taskflow 272 61 ns 14 ns 550 2550 35050
oneTBB 182 1854 8 15 25 6% oneTBB 136 99 ns 54 ns 1225 2750 40050
StarPU 253 2216 8 21 47 19% StarPU 1472 259 ns 384 ns 7550 - -
HPX 255 2264 10 24 41 33%
OpenMP 182 1896 13 19 57 49% Stask : static size per task in bytes.
Ttask =Tedge : amortized time to create a task/dependency.
CC: maximum cyclomatic complexity in a single function. rv : graph size where its creation overhead is below v%.
WCC: weighted cyclomatic complexity of the program.
Dev: minutes to complete the implementation. debugging task graph parallelism with StarPU, HPX, and
Bug: time spent on debugging as opposed to coding task graphs.
OpenMP.
Next, we study the overhead of task graph parallelism
performance [44]. oneTBB (2021.1 release) is an industrial-
among Taskflow, oneTBB, and StarPU. As shown in Table 2,
strength parallel programming system under Intel
the static size of a task, compiled on our platform, is 272,
oneAPI [2]. We consider its FlowGraph library and encap-
136, and 1472 bytes for Taskflow, oneTBB, and StarPU,
sulate each GPU task in a CPU function. At the time of this
respectively. We do not report the data of HPX and
writing, FlowGraph does not have dedicated work stealing
OpenMP because they do not support explicit task graph
for HTDGs. StarPU (version 1.3) is a CPU-GPU task pro-
construction at the functional level. The time it takes for
gramming system widely used in the scientific computing
Taskflow to create a task and add a dependency is also
community [17]. It provides a C-based syntax for writing
faster than oneTBB and StarPU. We amortize the time across
HTDGs on top of a work-stealing runtime highly optimized
1M operations because all systems support pooled memory
for CPUs and GPUs. HPX (version 1.4) is a C++ standard
to recycle tasks. We found StarPU has significant overhead
library for concurrency and parallelism [33]. It supports
in creating HTDGs. The overhead always occupies 5-10% of
implicit task graph programming through aggregating
the total execution time regardless of the HTDG size.
future objects in a dataflow API. OpenMP (version 4.5 in
Fig. 13 shows the overall performance comparison
clang toolchains) is a directive-based programming frame-
between Taskflow and the baseline at different HTDG sizes.
work for handling loop parallelism [7]. It supports static
In terms of runtime (top left of Fig. 13), Taskflow outper-
graph encoding using task dependency clauses.
forms others across most data points. We complete the larg-
To measure the expressiveness and programmability of
est HTDG by 1.37, 1.44, 1,53, and 1.40 faster than
Taskflow, we hire five PhD-level C++ programmers outside
oneTBB, StarPU, HPX, and OpenMP, respectively. The
our research group to implement our experiments. We edu-
memory footprint (top right of Fig. 13) of Taskflow is close
cate them the essential knowledge about Taskflow and base-
to oneTBB and OpenMP. HPX has higher memory because
line TGCSs and provide them all algorithm blocks such that
it relies on aggregated futures to describe task dependencies
they can focus on programming HTDGs. For each imple-
at the cost of shared states. Likewise, StarPU does not offer a
mentation, we record the lines of code (LOC), the number
closure-based interface and thus requires a flat layout (i.e.,
of tokens, cyclomatic complexity (measured by [8]), time to
codelet) to describe tasks. We use the Linux perf tool to
finish, and the percentage of time spent on debugging. We
average these quantities over five programmers until they
obtain the correct result. This measurement may be subjec-
tive but it highlights the programming productivity and
turnaround time of each TGCSs from a real user’s
perspective.

7.2 Micro-benchmarks
We randomly generate a set of DAGs (i.e., HTDGs) with
equal distribution of CPU and GPU tasks. Each task per-
forms a SAXPY operation over 1K elements. For fair pur-
pose, we implemented CUDA Graph [4] for all baselines;
each GPU task is a CUDA graph of three GPU operations,
H2D copy, kernel, and H2D copy, in this order of depen-
dencies. Table 1 summarizes the programming effort of
each method. Taskflow requires the least amount of lines of
code (LOC) and written tokens. The cyclomatic complexity
of Taskflow measured at a single function and across the
whole program is also the smallest. The development time
of Taskflow-based implementation is much more produc-
tive than the others. For this simple graph, Taskflow and
oneTBB are very easy for our programmers to implement, Fig. 13. Overall system performance at different problem sizes using 40
whereas we found they spent a large amount of time on CPUs and 4 GPUs.
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1315

Fig. 14. Runtime distribution of two task graphs.

measure the power consumption of all cores plus LLC [3].


The total joules (bottom left of Fig. 13) consumed by
Taskflow is consistently smaller than the others, due to our
adaptive worker management. In terms of power (bottom
right of Fig. 13), Taskflow, oneTBB, and OpenMP are more
power-efficient than HPX and StarPU. The difference
between Taskflow and StarPU continues to increase as we
enlarge the HTDG size.
Fig. 14 displays the runtime distribution of each method
over a hundred runs of two HTDGs, 5K and 20K tasks. The
boxplot shows that the runtime of Taskflow is more consis-
tent than others and has the smallest variation. We attribute Fig. 16. A partial HTDG of 1 cudaFlow task (purple box), 4 condition
this result to the design of our scheduler, which effectively tasks (green diamond), and 8 static tasks (other else) for one iteration of
timing-driven optimization.
separates task execution into CPU and GPU workers and
dynamically balances cross-domain wasteful steals with
task parallelism. Since both oneTBB and StarPU provides explicit task
Finally, we compare the throughput of each method on graph programming models and work-stealing for dynamic
corunning HTDGs. This experiment emulates a server- load balancing, we will focus on comparing Taskflow with
like environment where multiple programs run simulta- oneTBB and StarPU for the next two real workloads.
neously on the same machine to compete for the same
resources. The effect of worker management propagates 7.3 VLSI Incremental Timing Analysis
to all parallel processes. We consider up to nine corun
As part of our DARPA project, we applied Taskflow to solve
processes each executing the same HTDG of 20K tasks.
a VLSI incremental static timing analysis (STA) problem in
We use the weighted speedup [23] to measure the system
an optimization loop. The goal is to optimize the timing
throughput. Fig. 15 compares the throughput of each
landscape of a circuit design by iteratively applying design
method and relates the result to the CPU utilization. Both
transforms (e.g., gate sizing, buffer insertion) and evaluating
Taskflow and oneTBB produce significantly higher
the timing improvement until all data paths are passing,
throughput than others. Our throughput is slightly better
aka timing closure. Achieving timing closure is one of the
than oneTBB by 1–15% except for seven coruns. The result
most time-consuming steps in the VLSI design closure flow
can be interpreted by the CPU utilization plot, reported
process because optimization algorithms can call a timer
by perf stat. We can see both Taskflow and oneTBB
millions or even billions of times to incrementally analyze
make effective use of CPU resources to schedule tasks.
the timing improvement of a design transform. We consider
However, StarPU keeps workers busy most of the time
the GPU-accelerated critical path analysis algorithm [26]
and has no mechanism to dynamically control thread
and run it across one thousand incremental iterations based
resources with task parallelism.
on the design transforms given by TAU 2015 Contest [29].
The data is generated by an industrial tool to evaluate the
performance of an incremental timing algorithm. Each
incremental iteration corresponds to at least one design
modifier followed by a timing report operation to trigger
incremental timing update of the timer.
Fig. 16 shows a partial Taskflow graph of our implemen-
tation. One condition task forms a loop to implement itera-
tive timing updates and the other three condition tasks
branch the execution to either CPU-based timing update
(over 10K tasks) or GPU-based timing update (cudaFlow
tasks). The motivation here is to adapt the timing update to
Fig. 15. Throughput of corunning task graphs and CPU utilization at dif- different incrementalities. For example, if a design trans-
ferent problem sizes under 40 CPUs and 4 GPUs. form introduces only a few hundreds of nodes to update,
1316 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022

TABLE 3
Programming Effort on VLSI Timing Closure

Method LOC #Tokens CC WCC Dev Bug


Taskflow 3176 5989 30 67 3.9 13%
oneTBB 4671 8713 41 92 6.1 51%
StarPU 5643 13952 46 98 4.3 38%

CC: maximum cyclomatic complexity in a single function.


WCC: weighted cyclomatic complexity of the program.
Dev: hours to complete the implementation.
Bug: time spent on the debugging versus coding task graphs.

there is no need to offload the computation to GPUs due to


insufficient amount of data parallelism. The cudaFlow task
composes over 1K operations to compute large interconnect
delays, which often involves several gigabytes of parasitic
data. Since oneTBB FlowGraph and StarPU do not support
control flow, we unroll their task graphs across fixed-length
iterations found in hindsight to avoid expensive synchroni-
zation at each iteration; the number of concatenated graphs
is equal to the number of iterations.
Table 3 compares the programming effort between Task-
flow, oneTBB, and StarPU. In a rough view, the implemen-
tation complexity using Taskflow is much less than that of
oneTBB and StarPU. The amount of time spent on imple-
menting the algorithm is about 3.9 hours for Taskflow, 6.1
hours for oneTBB, and 4.3 hours for StarPU. It takes 3–4
more time to debug oneTBB and StarPU than Taskflow,
mostly on control flow. Interestingly, while StarPU involves
more LOC and higher cyclomatic complexity than oneTBB, Fig. 17. Runtime, memory, and power data of 1000 incremental timing
our programmers found StarPU easier to write due to its C- iterations (up to 11K tasks and 17K dependencies per iteration) on a
large design of 1.6M gates.
styled interface. Although there is no standard way to con-
clude the programmability of a library, we believe our mea-
surement highlights the expressiveness of Taskflow and its
ease of use from a real user’s perspective.
The overall performance is shown in Fig. 17. Using 40
CPUs and 1 GPU, Taskflow is consistently faster than
oneTBB and StarPU across all incremental timing iterations.
The gap continues to enlarge as increasing iteration num-
bers; at 100 and 1000 iterations, Taskflow reaches the goal in
3.45 and 39.11 minutes, whereas oneTBB requires 5.67 and
4.76 minutes and StarPU requires 48.51 and 55.43 minutes,
respectively. Note that the gain is significant because a typi- Fig. 18. Throughput of corunning timing analysis workloads on two itera-
cal timing closure algorithm can invoke millions to billions tion numbers using 40 CPUs and 1 GPU.
of iterations that take several hours to finish [30]. We
observed similar results at other CPU numbers; in terms of workload, regardless of iterations and CPU numbers.
the runtime speed-up over 1 CPU (all finish in 113 minutes), Beyond 16 CPUs where performance saturates, Taskflow
Taskflow is always faster than oneTBB and StarPU, regard- does not suffer from increasing power as oneTBB and
less of the CPU count. Speed-up of Taskflow saturates at StarPU, because our scheduler efficiently balances the num-
about 16 CPUs (3), primarily due to the inherent irregular- ber of workers with dynamic task parallelism.
ity of the algorithm (see Fig. 16). The memory footprint We next compare the throughput of each implementation
(middle of Fig. 17) shows the benefit of our conditional task- by corunning the same program. Corunning programs is a
ing. By reusing condition tasks in the incremental timing common strategy for optimization tools to search for the
loop, we do not suffer significant memory growth as best parameters. The effect of worker management propa-
oneTBB and StarPU. On a vertical scale, increasing the num- gates to all simultaneous processes. Thus, the throughput
ber of CPUs bumps up the memory usage of both methods, can be a good measurement for the inter-operability of a
but Taskflow consumes much less because we use only sim- scheduling algorithm. We corun the same timing analysis
ple atomic operations to control wasteful steals. In terms of program up to seven processes that compete for 40 CPUs
energy efficiency (bottom of Fig. 17, measured on all cores and 1 GPU. We use the weighted speedup to measure the sys-
plus LLC using power/energy-pkg [3]), our scheduler is tem throughput, which is the sum of the individual
very power-efficient in completing the timing analysis speedup of each process over a baseline execution time [23].
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1317

Fig. 19. Comparison of runtime and memory between cudaFlow (CUDA Fig. 20. A partial HTDG of 4 cudaFlows (purple boxes), 8 conditioned
Graph) and stream-based execution in the VLSI incremental timing anal- cycles (green diamonds), and 6 static tasks (other else) for the inference
ysis workload. workload.

A throughput of one implies that the corun’s throughput is


the same as if the processes were run consecutively. Fig. 18
plots the throughput across nine coruns at two iteration
numbers. Both Taskflow and oneTBB achieve decent
throughput greater than one and are significantly better
than StarPU. We found StarPU keep workers busy most of
the time and has no mechanism to balance the number of
workers with dynamically generated task parallelism. For
irregular HTDGs akin to Fig. 16, worker management is crit-
ical for corunning processes. When task parallelism
becomes sparse, especially around the decision-making
point of an iterative control flow, our scheduler can adap-
tively reduce the wasteful steals based on the active worker
count, and we offer a stronger bound than oneTBB (Theo-
rem 3). Saved wasteful resources can thus be used by other
concurrent programs to increase the throughput.
Fig. 19 shows the performance advantage of CUDA
Graph and its cost in handling this large GPU-accelerated
timing analysis workloads. The line cudaFlow represents our
default implementation using explicit CUDA graph con-
Fig. 21. Runtime and memory data of the LSDNN (1920 layers, 4096
struction. The other two lines represent the implementation neurons per layer) under different CPU and GPU numbers.
of the same GPU task graph but using stream and event
insertions (i.e., non-CUDA Graph). As partially shown in
Fig. 16, our cudaFlow composes over 1K dependent GPU Unlike VLSI incremental timing analysis, this workload is
operations to compute the interconnect delays. For large both CPU- and GPU-heavy. Fig. 20 illustrates a partial
GPU workloads like this, the benefit of CUDA Graph is HTDG. We create up to 4 cudaFlows on 4 GPUs. Each cuda-
clear; we observed 9–17% runtime speed-up over stream- Flow contains more than 2K GPU operations to run parti-
based implementations. The performance improvement tioned matrices in an iterative data dispatching loop formed
mostly comes from reduced kernel call overheads and by a condition task. Other CPU tasks evaluate the results
graph-level scheduling optimizations by CUDA runtime. with a golden reference. Since oneTBB FlowGraph and
Despite the improved performance, cudaFlow incurs higher StarPU do not support in-graph control flow, we unroll
memory costs because CUDA Graph stores all kernel their task graph across fixed-length iterations found offline.
parameters in advance for optimization. For instance, creat- Fig. 21 compares the performance of solving a 1920-lay-
ing a node in CUDA Graph can take over 300 bytes of opa- ered LSDNN each of 4096 neurons under different CPU and
que data structures. GPU numbers. Taskflow outperforms oneTBB and StarPU
in all aspects. Both our runtime and memory scale better
regardless of the CPU and GPU numbers. Using 4 GPUs,
7.4 Large Sparse Neural Network Inference when performance saturates at 4 CPUs, we do not suffer
We applied Taskflow to solve the MIT/Amazon Large from further runtime growth as oneTBB and StarPU. This is
Sparse Deep Neural Network (LSDNN) Inference Chal- because our work-stealing algorithm more efficiently con-
lenge, a recent effort aimed at new computing methods for trol wasteful steals upon available task parallelism. On the
sparse AI analytics [35]. Each dataset comprises a sparse other hand, our memory usage is 1.5-1.7 less than oneTBB
matrix of the input data for the network, 1920 layers of neu- and StarPU. This result highlights the benefit of our condi-
rons stored in sparse matrices, truth categories, and the bias tion task, which integrates iterative control flow into a cyclic
values used for the inference. Preloading the network to the HTDG, rather than unrolling it statically across iterations.
GPU is impossible. Thus, we implement a model decompo- We next compare the throughput of each implementation
sition-based kernel algorithm inspired by [19] and construct by corunning the same inference program to study the inter-
an end-to-end HTDG for the entire inference workload. operability of an implementation. We corun the same
1318 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022

Fig. 22. Throughput of corunning inference workloads on two 1920-lay- Fig. 24. Comparison of runtime and memory between cudaFlow (CUDA
ered neural networks, one with 4096 neurons per layer and another with Graph) and stream-based execution.
65536 neurons per layer.
one, two, and four streams. The advantage of CUDA Graph
is clearly demonstrated in this large machine learning work-
load of over 2K dependent GPU operations per cudaFlow.
Under four streams that deliver the best performance for the
baseline, cudaFlow is 1.5 (1451 versus 2172) faster at one
GPU and is 1.9 (750 versus 1423) faster at four GPUs. The
cost of this performance improvement is increased memory
usage because CUDA Graph needs to store all the operating
parameters in the graph. For instance, under four streams,
cudaFlow has 4% and 6% higher memory usage than
Fig. 23. Performance of our cudaFlow capturer using 1, 2, 4, and 8 stream-based execution at one and four GPUs, respectively.
streams to complete the inference of two neural networks.

8 RELATED WORK
inference program up to nine processes that compete for 40 8.1 Heterogeneous Programming Systems
CPUs and 4 GPUs. We use weighted speedup to measure
Heterogeneous programming systems are the main driving
the throughput. Fig. 22 plots the throughput of corunning
force to advance scientific computing. Directive-based pro-
inference programs on two different sparse neural net-
gramming models [5], [6], [7], [25], [38] allow users to aug-
works. Taskflow outperforms oneTBB and StarPU across all
ment program information of loop mapping onto CPUs/
coruns. oneTBB is slightly better than StarPU because
GPUs and data sharing rules to designated compilers for
StarPU tends to keep all workers busy all the time and
automatic parallel code generation. These models are good
results in large numbers of wasteful steals. The largest dif-
at loop-based parallelism but cannot handle irregular task
ference is observed at five coruns of inferencing the
graph patterns efficiently [37]. Functional approaches [2],
19204096 neural network, where our throughput is 1.9
[15], [17], [18], [20], [24], [32], [33], [34], [41] offer either
higher than oneTBB and 2.1 higher than StarPU. These
implicit or explicit task graph constructs that are more flexi-
CPU- and GPU-intensive workloads highlight the effective-
ble in runtime control and on-demand tasking. Each of these
ness of our heterogeneous work stealing. By keeping a per-
systems has its pros and cons. However, few of them enable
domain invariant, we can control cross-domain wasteful
end-to-end expressions of heterogeneously dependent tasks
steals to a bounded value at any time during the execution.
with general control flow.
We study the performance of our cudaFlow capturer
using different numbers of streams (i.e., max streams). For
complex GPU workloads like Fig. 20, stream concurrency is 8.2 Heterogeneous Scheduling Algorithms
crucial to GPU performance. As shown in Fig. 23, explicit Among various heterogeneous runtimes, work stealing is a
construction of a CUDA graph using cudaFlow achieves the popular strategy to reduce the complexity of load balanc-
best performance, because the CUDA runtime can dynami- ing [16], [41] and has inspired the designs of many parallel
cally decide the stream concurrency with internal optimiza- runtimes [2], [12], [39], [40], [52]. A key challenge in work-
tion. For applications that must use existing stream-based stealing designs is worker management. Instead of keeping
APIs, our cudaFlow capturer achieves comparable perfor- all workers busy most of the time [16], [17], [32], both
mance as cudaFlow by using two or four streams. Taking oneTBB [2] and BWS [23] have developed sleep-based strat-
the 192065536 neural network for example, the difference egies. oneTBB employs a mixed strategy of fixed-number
between our capturer of four streams and cudaFlow is only worker notification, exponential backoff, and noop assem-
10 ms. For this particular workload, we do not observe any bly. BWS modifies OS kernel to alter the yield behavior. [42]
performance benefit beyond four streams. Application takes inspiration from BWS and oneTBB to develop an
developers can fine-tune this number. adaptive work-stealing algorithm to minimize the number
We finally compare the performance of cudaFlow with of wasteful steals. Other approaches, such as [13] that tar-
stream-based execution. As shown in Fig. 24, the line cuda- gets a space-sharing environment, [47] that tunes hardware
Flow represents our default implementation using explicit frequency scaling, [22], [48] that balance load on distributed
CUDA graph construction, and the other lines represent memory, [21], [27], [51], [57] that deal with data locality,
stream-based implementations for the same task graph using and [49] that focuses on memory-bound applications have
HUANG ET AL.: TASKFLOW: A LIGHTWEIGHT PARALLEL AND HETEROGENEOUS TASK GRAPH COMPUTING SYSTEM 1319

improved work stealing in certain performance aspects, but REFERENCES


their results are limited to the CPU domain. How to migrate [1] DARPA. “Intelligent design of electronic assets (IDEA) program,”
the above approaches to a heterogeneous target remains an 2021. [Online]. Available: https://www.darpa.mil/program/
open question. intelligent-design-of-electronic-assets
In terms of GPU-based task schedulers, Whippletree [50] [2] Intel oneTBB, 2021. [Online]. Available: https://github.com/
oneapi-src/oneTBB
design a fine-grained resource scheduling algorithm for [3] Linux kernel profiler, 2021. [Online]. Available: https://man7.
sparse and scattered parallelism atop a custom program org/linux/man-pages/man1/perf-stat.1.html
model. [45] leverages reinforcement learning to place [4] Nvidia CUDA graph, 2021. [Online]. Available: https://devblogs.
nvidia.com/cuda-10-features-revealed/
machine learning workloads onto GPUs. Hipacc [46] intro- [5] OmpSs, 2021. [Online]. Available: https://pm.bsc.es/ompss
duces a pipeline-based optimization for CUDA graphs to [6] OpenACC, 2021. [Online]. Available: http://www.openacc-
speed up image processing workloads. [55] develops a com- standard.org
piler to transforms OpenMP directives to a CUDA graph. [7] OpenMP, 2021. [Online]. Available: https://www.openmp.org/
[8] SLOCCount, 2021. [Online]. Available: https://dwheeler.com/
These works have primarily focused on scheduling GPU sloccount/
tasks in various applications, which are orthogonal to our [9] SYCL, 2021. [Online]. Available: https://www.khronos.org/sycl/
generic heterogeneous scheduling approaches. [10] Taskflow gitHub, 2021. [Online]. Available: https://taskflow.
github.io/
[11] Two-phase commit protocol, 2021. [Online]. Available: http://
www.1024cores.net/home/lock-free-algorithms/eventcounts
9 CONCLUSION [12] K. Agrawal, C. E. Leiserson, and J. Sukha, “Nabbit: Executing task
graphs using work-stealing,” in Proc. IEEE Int. Symp. Parallel Dis-
In this paper, we have introduced Taskflow, a lightweight trib. Process., 2010, pp. 1–12.
task graph computing system to streamline the creation of [13] K. Agrawal, Y. He, and C. E. Leiserson, “Adaptive work stealing
heterogeneous programs with control flow. Taskflow has with parallelism feedback,” in Proc. 12th ACM SIGPLAN Symp.
Princ. Pract. Parallel Program., 2007, pp. 112–120.
introduced a new programming model that enables an end- [14] T. Ajayi, et al., “Toward an open-source digital flow: First learn-
to-end expression of heterogeneously dependent tasks with ings from the openROAD project,” in Proc. 56th Annu. Des. Auto-
general control flow. We have developed an efficient work- mat. Conf., 2019, pp. 1–4.
[15] M. Aldinucci, M. Danelutto, P. Kilpatrick, and M. Torquati, Fast-
stealing runtime optimized for latency, energy efficiency, flow: High-Level and Efficient Streaming on Multicore. Hoboken, NJ,
and throughput, and derived theory results to justify its effi- USA: Wiley, 2017, ch. 13, pp. 261–280.
ciency. We have evaluated the performance of Taskflow on [16] N. S. Arora, R. D. Blumofe, and C. G. Plaxton, “Thread scheduling
both micro-benchmarks and real applications. As an exam- for multiprogrammed multiprocessors,” in Proc. 10th Annu. ACM
Symp. Parallel Algorithms Architectures, 1998, pp. 119–129.
ple, Taskflow solved a large-scale machine learning prob- [17] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier,
lem up to 29% faster, 1.5 less memory, and 1.9 higher “StarPU: A unified platform for task scheduling on heterogeneous
throughput than the industrial system, oneTBB, on a multicore architectures,” Concurrency Comput. : Pract. Experience,
vol. 23, no. 2, pp. 187–198, 2011.
machine of 40 CPUs and 4 GPUs. [18] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, “Legion: Express-
Taskflow is an on-going project under active development. ing locality and independence with logical regions,” in Proc. Int. Conf.
We are currently exploring three directions: First, we are High Perform. Comput., Netw., Storage Anal., 2012, pp. 1–11.
designing a distributed tasking model based on partitioned [19] M. Bisson and M. Fatica, “A GPU implementation of the sparse
deep neural network graph challenge,” in Proc. IEEE High Perform.
taskflow containers with each container running on a remote Extreme Comput. Conf., 2019, pp. 1–8.
machine. Second, we are extending our model to incorporate [20] G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J.
SYCL [9] to provide a single-source heterogeneous task graph J. Dongarra, “PaRSEC: Exploiting heterogeneity to enhance
scalability,” Comput. Sci. Eng., vol. 15, no. 6, pp. 36–45, 2013.
programming environment. The author Dr. Huang is a mem-
[21] Q. Chen, M. Guo, and H. Guan, “LAWS: Locality-aware work-
ber of SYCL Advisory Panel and is collaborating with the stealing for multi-socket multi-core architectures,” ACM Trans.
working group to design a new SYCL Graph abstraction. Architecture Code Optim., 2014, pp. 1–24.
Third, we are researching automatic translation methods [22] J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J.
Nieplocha, “Scalable work stealing,” In Proc. Conf. High Perform.
between different task graph programming models using Comput. Netw., Storage Anal., 2009, pp. 1–11.
Taskflow as an intermediate representation. Programming- [23] X. Ding, K. Wang, P. B. Gibbons, and X. Zhang, “BWS: Balanced
model translation has emerged as an important research area work stealing for time-sharing multicores,” in Proc. 7th ACM Eur.
in today’s diverse computing environments because no one Conf. Comput. Syst., 2012, pp. 365–378.
[24] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling
programming model is optimal across all applications. The manycore performance portability through polymorphic memory
recent 2021 DOE X-Stack program directly calls for novel access patterns,” J. Parallel Distrib. Comput., vol. 74, no. 12,
translation methods to facilitate performance optimizations pp. 3202–3216, 2014.
[25] T. Gautier, J. V. F. Lima, N. Maillard, and B. Raffin, “XKaapi: A
on different computing environments. runtime system for data-flow task programming on heteroge-
One important future direction is to collaborate with neous architectures,” in Proc. IEEE 27th Int. Symp. Parallel Distrib.
Nvidia CUDA teams to design a conditional tasking inter- Process., 2013, pp. 1299–1308.
face within the CUDA Graph itself. This design will enable [26] G. Guo, T.-W. Huang, Y. Lin, and M. Wong, “GPU-accelerated
pash-based timing analysis,” in Proc. 39th Int. Conf. Comput.-Aided
efficient control-flow decisions to be made completely in Des., 2021, pp. 1–9.
CUDA runtime, thereby largely reducing the control-flow [27] Yi Guo, “A scalable locality-aware adaptive work-stealing sched-
cost between CPU and GPU. uler for multi-core task parallelism,” Ph.D. dissertation, Rice
Univ., Houston, TX, USA, 2010.
[28] Z. Guo, T.-W. Huang, and Y. Lin, “GPU-accelerated Static Timing
ACKNOWLEDGMENTS Analysis,” in Proc. 39th Int. Conf. Comput.-Aided Des., 2020, pp. 1–8.
[29] J. Hu, G. Schaeffer, and V. Garg, “TAU 2015 contest on incremen-
We appreciate all Taskflow contributors and reviewers’ tal timing analysis,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided
comments for improving this article. Des., 2015, pp. 895–902.
1320 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 6, JUNE 2022

[30] T.-W. Huang, G. Guo, C.-X. Lin, and M. Wong, “OpenTimer v2: A [53] D. F. Wong, H. W. Leong, and C. L. Liu., Simulated Annealing for
new parallel incremental timing analysis engine,” IEEE Trans. VLSI Design. Norwell, MA, USA: Kluwer Academic, 1988.
Comput.-Aided Des. Integr. Circuits Syst., vol. 40, no. 4, pp. 776–789, [54] B. Xu et al., “MAGICAL: Toward fully automated analog IC layout
Apr. 2021. leveraging human and machine intelligence: Invited paper,” in
[31] T.-W. Huang, D.-L. Lin, Y. Lin, and C.-X. Lin, “Taskflow: A gen- Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2019, pp. 1–8.
eral-purpose parallel and heterogeneous task programming sys- [55] C. Yu, S. Royuela, and E. Qui~ nones, “OpenMP to CUDA graphs:
tem,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., to be A compiler-based transformation to enhance the programmability
published, doi: 10.1109/TCAD.2021.3082507. of NVIDIA devices,” in Proc. Int. Workshop Softw. Compilers Embed-
[32] T.-W. Huang, Y. Lin, C.-X. Lin, G. Guo, and M. D. F. Wong, “Cpp- ded Syst., 2020, pp. 42–47.
taskflow: A general-purpose parallel task programming system at [56] Y. Yu, et al., “Dynamic control flow in large-scale machine
scale,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 40, learning,” in Proc. 13th EuroSys Conf., 2018, pp. 1–15.
no. 8, pp. 1687–1700, Aug. 2021. [57] H. Zhao, et al., “Bandwidth and locality aware task-stealing for
[33] H. Kaiser, T. Heller, B. Adelstein-Lelbach , A. Serio, and D. Fey, manycore architectures with bandwidth-asymmetric memory,”
“HPX: A Task Based Programming Model in a Global Address ACM Trans. Architect. Code Optim., vol. 15, no. 4, pp. 1–26, 2018.
Space,” in Proc. 8th Int. Conf. Partitioned Glob. Address Space Pro-
gram. Models, 2014, pp. 6:1–6:11.
[34] L. V. Kale and S. Krishnan, “Charm++: A portable concurrent
object oriented system based on C++,” in ACM SIGPLAN Notices, Tsung-Wei Huang received the BS and MS
1993, pp. 91–108. degrees from the Department of Computer Sci-
[35] J. Kepner, S. Alford, V. Gadepally, M. Jones, L. Milechin, R. Rob- ence, National Cheng Kung University, Tainan,
inett, and S. Samsi, “Sparse deep neural network graph challenge,” Taiwan, in 2010 and 2011, respectively, and the
in IEEE High Perform. Extreme Comput. Conf., 2019, pp. 1–7. PhD degree from the Department of Electrical
[36] N. M. L^e, A. Pop, A. Cohen, and F. Z. Nardelli, “Correct and effi- and Computer Engineering (ECE), University of
cient work-stealing for weak memory models,” in Proc. 18th ACM Illinois at Urbana-Champaign. He is currently an
SIGPLAN Symp. Princ. Pract. Parallel Program., 2013, pp. 69–80. assistant professor with the Department of ECE,
[37] S. Lee and J. S. Vetter, “Early evaluation of directive-based GPU University of Utah. His research interests include
programming models for productive exascale computing,” in building software systems for parallel computing
Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2012, and timing analysis. He was the recipient of
pp. 1–11. prestigious 2019 ACM SIGDA Outstanding PhD Dissertation Award for
[38] S. Lee and R. Eigenmann, “OpenMPC: Extended openMP pro- his contributions to distributed and parallel VLSI timing analysis in his
gramming and tuning for GPUs,” in Proc. ACM/IEEE Int. Conf. PhD thesis.
High Perform. Comput., Netw., Storage Anal., 2010, pp. 1–11.
[39] D. Leijen, W. Schulte, and S. Burckhardt, “The Design of a Task Par-
allel Library,” ACM SIGPLAN Notices, vol. 44, pp. 227–241, 2009. Dian-Lun Lin received the BS degree from the
[40] C. E. Leiserson, “The Cilk++ concurrency platform,” in Proc. 46th Department of Electrical Engineering, Taiwan’s
Annu. Des. Automat. Conf., 2009, pp. 522–527. Cheng Kung University, and the MS degree from
[41] J. V. F. Lima, T. Gautier, V. Danjean, B. Raffin, and N. Maillard, the Department of Computer Science, National
“Design and analysis of scheduling strategies for multi-CPU and Taiwan University. He is currently working toward
multi-GPU architectures,” Parallel Comput., vol. 44, pp. 37–52, 2015. the PhD degree with the Department of Electrical
[42] C.-X. Lin, T.-W. Huang, and M. D. F. Wong, “An efficient work- and Computer Engineering, University of Utah.
stealing scheduler for task dependency graph,” in Proc. IEEE 26th His research interests include parallel and hetero-
Int. Conf. Parallel Distrib. Syst., 2020, pp. 64–71. geneous computing with a specific focus on CAD
[43] Y. Lin, W. Li, J. Gu, H. Ren, B. Khailany, and D. Z. Pan, applications.
“ABCDPlace: Accelerated Batch-based Concurrent Detailed Place-
ment on Multi-threaded CPUs and GPUs,” IEEE Trans. Comput.-
Aided Des. Integr. Circuits Syst., vol. 39, no. 12, pp. 5083–5096, Dec. Chun-Xun Lin received the BS degree in electri-
2020. cal engineering from the National Cheng Kung
[44] Y.-S. Lu and K. Pingali, “Can parallel programming revolutionize University, Tainan, Taiwan, and the MS degree in
EDA tools?,” in Advanced Logic Synthesis. Berlin, Germany: electronics engineering from the Graduate Insti-
Springer, 2018, pp. 21–41. tute of Electronics Engineering, National Taiwan
[45] A. Mirhoseini, et al., “Device placement optimization with rein- University, Taipei, Taiwan, in 2009 and 2011,
forcement learning,” in Proc. 34th Int. Conf. Mach. Learn., 2007, respectively, and the PhD degree from the
pp. 2430–2439. Department of Electrical and Computer Engineer-

[46] B. Qiao, M. A. Ozkan, J. Teich, and F. Hannig, “The best of both ing, University of Illinois at Urbana-Champaign, in
worlds: Combining CUDA graph with an image processing DSL,” 2020. His research interest include parallel
in Proc. 57th ACM/IEEE Des. Automat. Conf., 2020, pp. 1–6. processing.
[47] H. Ribic and Y. D. Liu, “Energy-efficient work-stealing language
runtimes,” in Proc. 19th Int. Conf. Architectural Support Program.
Lang. Operat. Syst., 2014, pp. 513–528. Yibo Lin (Member, IEEE) received the BS degree
[48] V. A. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krishna- in microelectronics from Shanghai Jiaotong Univer-
moorthy, “Lifeline-based global load balancing,” in Proc. 16th sity in 2013, and the PhD degree from the Depart-
ACM Symp. Princ. Pract. Parallel Program., 2011, pp. 201–212.
ment of Electrical and Computer Engineering,
[49] S. Shiina and K. Taura, “Almost deterministic work stealing,” in Proc.
University of Texas at Austin, in 2018. He is cur-
Int. Conf. High Perform. Comput., Netw., Storage Anal., 2019, pp. 1–16. rently an assistant professor with the Department
[50] M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. Dokter, and of Computer Science associated with the Center
D. Schmalstieg, “Whippletree: Task-based scheduling of dynamic for Energy-Efficient Computing and Applications,
workloads on the GPU,” ACM Trans. Graph., vol. 33, no. 6, Peking University, China. His research interests
pp. 1–11, Nov. 2014.
include physical design, machine learning applica-
[51] W. Suksompong, C. E. Leiserson, and T. B. Schardl, “On the effi-
tions, GPU acceleration, and hardware security.
ciency of localized work stealing,” Inf. Process. Lett., vol. 116, no. 2,
pp. 100–106, Feb. 2016.
[52] O. Tardieu, H. Wang, and H. Lin, A work-stealing scheduler for " For more information on this or any other computing topic,
X10’s task parallelism with suspension,” ACM SIGPLAN Notices,
vol. 47, no. 8, pp. 267–276, 2012. please visit our Digital Library at www.computer.org/csdl.

You might also like