Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020
Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020
1
Why Parallel Computing?
q It’s critical to advance your application performance
1 CPU
10x faster
40 CPUs
100x faster
1 GPU
2
Parallel Programming is Not Easy, Yet
q You need to deal with many difficult technical details
q Standard concurrency control
q Task dependencies
q Scheduling
q Data race
Scheduling
Dependency
q … (more) constraints
efficiencies
Task and
p e rs data race
e v e lo
n y d m e in Debug
Ma hard t ght! i Concurrency
av e m r i control
h g the
e tt in
g
3
Taskflow offers a solution
4
“Hello World” in Taskflow
#include <taskflow/taskflow.hpp> // Taskflow is header-only
int main(){
tf::Taskflow taskflow;
tf::Executor executor;
auto [A, B, C, D] = taskflow.emplace(
Only 15 lines of code to get a
[] () { std::cout << "TaskA\n"; } parallel task execution!
[] () { std::cout << "TaskB\n"; },
[] () { std::cout << "TaskC\n"; },
[] () { std::cout << "TaskD\n"; }
);
A.precede(B, C); // A runs before B and C
D.succeed(B, C); // D runs after B and C
executor.run(taskflow).wait(); // submit the taskflow to the executor
return 0;
}
5
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism
6
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism
7
Motivation: Parallelizing VLSI CAD Tools
q Billions of tasks with diverse computational patterns
Machine learning in the loop Partial Tasks of Iteration 0
System Spec. Partition [0]collect_independent_sets_end
Partial Tasks of Iteration 1
Module(a, b)
Architecture Floorplan [0]construct_cost_matrices_kernel_S [1]maximum_independent_set_parallel_kernel1_S
Input a;
Output b; Graph
Function, logic loop[0][0] loop[0][1] loop[0][2] loop[0][3] loop[0][0] loop[0][1] loop[0][2] loop[0][3]
Optimization
Placement 1
2
1’
2’
[0]construct_cost_matrices_kernel_T [1]maximum_independent_set_parallel_kernel1_T
4
3’
4’ [0]solve_assignment_kernel_S [1]maximum_independent_set_parallel_kernel2_S
5 5’
0
Physical design
Tree NP-hard problems [0]solve_assignment_kernel_T [1]maximum_independent_set_parallel_kernel2_T
2 3
DRC, LVS Graph [0]apply_solution_kernel_S [1]maximum_independent_set_parallel_cond
4 5
Dynamic
1
[0]apply_solution_kernel_T
10B+ transistors inp1 u1:A
u1:Y
u4:B
u4:A
u4:Y
f1:D
inp2 u1:B
[0]apply_solution_end
How can we write efficient C++ parallel programs for this monster computational task
graph with millions of CPU-GPU dependent tasks along with algorithmic control flow”
8
We Invested a lot in Existing Tools …
9
Two Big Problems of Existing Tools
q Our problems define complex task dependencies
q Example: analysis algorithms compute the circuit
network of million of node and dependencies
q Problem: existing tools are often good at loop
parallelism but weak in expressing heterogeneous task
graphs at this large scale
q Our problems define complex control flow
q Example: optimization algorithms make essential use of
dynamic control flow to implement various patterns
• Combinatorial optimization, analytical methods
q Problem: existing tools are directed acyclic graph (DAG)-
based and do not anticipate cycles or conditional
dependencies, lacking end-to-end parallelism
10
Example: An Iterative Optimizer
q 4 computational tasks with dynamic control flow
#1: starts with init task
#2: enters the optimizer task (e.g., GPU math solver)
#3: checks if the optimization converged
• No: loops back to optimizer
• Yes: proceeds to stop How can we easily describe this
workload with dynamic control flow
#4: outputs the result using existing tools (e.g., OpenMP,
TBB, StarPU, SYCL, Kokkos) ?
N
Y
init optimizer converged? output
14
Many arguments are based on my personal opinions
– no offense, no criticism, just plain C++ from an
end user’s perspective
15
“Hello World” in Taskflow (Revisited)
#include <taskflow/taskflow.hpp> // Taskflow is header-only
int main(){
Taskflow defines five tasks:
tf::Taskflow taskflow;
1. static task
tf::Executor executor; 2. dynamic task
auto [A, B, C, D] = taskflow.emplace( 3. cudaFlow task
[] () { std::cout << "TaskA\n"; } 4. condition task
[] () { std::cout << "TaskB\n"; }, 5. module task
[] () { std::cout << "TaskC\n"; },
[] () { std::cout << "TaskD\n"; }
);
A.precede(B, C); // A runs before B and C
D.succeed(B, C); // D runs after B and C
executor.run(taskflow).wait(); // submit the taskflow to the executor
return 0;
}
16
“Hello World” in OpenMP
#include <omp.h> // OpenMP is a lang ext to describe parallelism using compiler directives
int main(){
#omp parallel num_threads(std::thread::hardware_concurrency())
{
int A_B, A_C, B_D, C_D;
#pragma omp task depend(out: A_B, A_C) Task dependency clauses
{
s t d : : c o u t << ”TaskA\n” ;
}
#pragma omp task depend(in: A_B; out: B_D) Task dependency clauses
{
s t d : : c o u t << ” TaskB\n” ;
}
#pragma omp task depend(in: A_C; out: C_D) Task dependency clauses
{
s t d : : c o u t << ” TaskC\n” ;
}
#pragma omp task depend(in: B_D, C_D) Task dependency clauses
{
s t d : : c o u t << ”TaskD\n” ;
} OpenMP task clauses are static and explicit;
} Programmers are responsible for a proper order of
return 0; writing tasks consistent with sequential execution 17
}
“Hello World” in TBB
#include <tbb.h> // Intel’s TBB is a general-purpose parallel programming library in C++
int main(){
using namespace tbb;
using namespace tbb:flow;
int n = task_scheduler init::default_num_threads () ;
task scheduler_init init(n);
graph g;
Use TBB’s FlowGraph
continue_node<continue_msg> A(g, [] (const continue msg &) { for task parallelism
s t d : : c o u t << “TaskA” ;
}) ;
continue_node<continue_msg> B(g, [] (const continue msg &) {
s t d : : c o u t << “TaskB” ;
}) ;
continue_node<continue_msg> C(g, [] (const continue msg &) { Declare a task as a
s t d : : c o u t << “TaskC” ; continue_node
}) ;
continue_node<continue_msg> C(g, [] (const continue msg &) {
s t d : : c o u t << “TaskD” ;
}) ;
make_edge(A, B); TBB has excellent performance in generic parallel
make_edge(A, C);
make_edge(B, D); computing. Its drawback is mostly in the ease-of-use
make_edge(C, D); standpoint (simplicity, expressivity, and programmability).
A.try_put(continue_msg());
g.wait_for_all();
}
TBB FlowGraph: https://software.intel.com/content/www/us/en/develop/home.html 18
“Hello World” in Kokkos
struct A {
template <class TeamMember> KOKKOS_INLINE_FUNCTION Fixed-layout task functor
void operator()(TeamMember& member) {std::cout << "TaskA\n"; }
}; (no lambda interface …?)
struct B {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
void operator()(TeamMember& member) {std::cout << "TaskB\n"; } Define team handle
};
struct C {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
void operator()(TeamMember& member) {std::cout << "TaskC\n"; }
}; Task dependency is
struct D {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
represented by instances of
void operator()(TeamMember& member) {std::cout << "TaskD\n"; } Kokkos::BasicFuture
};
auto scheduler = scheduler_type(/* ... */);
auto futA = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler), A() );
auto futB = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler, futA), B() ); Aggregated dependencies
auto futC = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler, futA), C() );
auto futD = Kokkos::host_spawn(
Kokkos::TaskSingle(scheduler, when_all(futB, futC)), D()
);
Kokkos is powerful in describing
asynchronous tasks but not efficient in large
More scheduling code to follow …
task graph parallelism
Kokkos task parallelism: https://github.com/kokkos/kokkos/wiki/Task-Parallelism 19
“Hello World” Summary (Less Biased)
Vote for Simplicity
(100 C++ programmers of 2-5 years of C++11
experience)
1
6 4
15
74
20
Dynamic Tasking (Subflow)
// create three regular tasks
tf::Task A = tf.emplace([](){}).name("A");
tf::Task C = tf.emplace([](){}).name("C");
tf::Task D = tf.emplace([](){}).name("D");
21
Subflow can be Nested
q Find the 7th Fibonacci number using subflow
q Fib(n) = Fib(n-1) + Fib(n-2)
22
Heterogeneous Tasking (cudaFlow)
const unsigned N = 1<<20;
std::vector<float> hx(N, 1.0f), hy(N, 2.0f);
float *dx{nullptr}, *dy{nullptr};
auto allocate_x = taskflow.emplace([&](){ cudaMalloc(&dx, 4*N);});
auto allocate_y = taskflow.emplace([&](){ cudaMalloc(&dy, 4*N);});
cudaflow.succeed(allocate_x, allocate_y);
executor.run(taskflow).wait();
Users define GPU work in a graph rather than aggregated
operations à single kernel launch to reduce overheads
23
Three Key Motivations
q Our closure enables stateful interface
q Users capture data in reference to marshal data
exchange between CPU and GPU tasks
q Our closure hides implementation details judiciously
q We use cudaGraph (since cuda 10) due to its excellent
performance, much faster than streams in large graphs
q Our closure extend to new accelerator types
q syclFlow, openclFlow, coralFlow, tpuFlow, fpgaFlow, etc.
We do not simplify kernel
programming but focus on
CPU-GPU tasking that
affects the performance to a
large extent! (same for data
abstraction)
24
Conditional Tasking
auto init = taskflow.emplace([&](){ initialize_data_structure(); } )
.name(”init");
auto optimizer = taskflow.emplace([&](){ matrix_solver(); } )
.name(”optimizer");
auto converged = taskflow.emplace([&](){ return converged() ? 1 : 0 } )
.name(”converged");
auto output = taskflow.emplace([&](){ std::cout << ”done!\n"; } );
.name(”output");
init.precede(optimizer);
optimizer.precede(converged);
converged.precede(optimizer, output); // return 0 to the optimizer again
0
1
init optimizer converged? output
Condition task integrates control flow into a task graph to form end-to-end
parallelism; in this example, there are ultimately four tasks ever created
25
Conditional Tasking (cont’d)
auto A = taskflow.emplace([&](){ } );
auto B = taskflow.emplace([&](){ return rand()%2; } );
auto C = taskflow.emplace([&](){ return rand()%2; } );
auto D = taskflow.emplace([&](){ return rand()%2; } );
auto E = taskflow.emplace([&](){ return rand()%2; } );
auto F = taskflow.emplace([&](){ return rand()%2; } );
auto G = taskflow.emplace([&](){});
A.precede(B).name("init");
B.precede(C, B).name("flip-coin-1");
Each task flips a
C.precede(D, B).name("flip-coin-2"); binary coin to decide
D.precede(E, B).name("flip-coin-3"); the next path
E.precede(F, B).name("flip-coin-4");
F.precede(G, B).name("flip-coin-5");
G.name(“end”);
27
Composable Tasking
tf::Taskflow f1, f2;
f1_module_task.succeed(f2A, f2B)
.precede(f2C);
28
Everything is Unified in Taskflow
q Use “emplace” to create a task
q Use “precede” to add a task dependency
q No need to learn different sets of API
q You can create a really complex graph Dynamic task
q Subflow(ConditionTask(cudaFlow))
cudaFlow
q ConditionTask(StaticTask(cudaFlow))
q Composition(Subflow(ConditionTask))
q Subflow(ConditionTask(cudaFlow))
Composition Control
q … flow
31
Submit Taskflow to Executor
q Executor manages a set of threads to run taskflows
q All execution methods are non-blocking
q All execution methods are thread-safe
{
tf::Taskflow taskflow1, taskflow2, taskflow3;
tf::Executor executor;
// create tasks and dependencies
// …
auto future1 = executor.run(taskflow1);
auto future2 = executor.run_n(taskflow2, 1000);
auto future3 = executor.run_until(taskflow3, [i=0](){ return i++>5 });
executor.async([](){ std::cout << “async task\n"; });
executor.wait_for_all(); // wait for all the above tasks to finish
}
32
Executor Scheduling Algorithm
q Task-level scheduling
q Decides how tasks are enqueued under control flow
• Goal #1: ensures a feasible path to carry out control flow
• Goal #2: avoids task race under cyclic and conditional execution
• Goal #3: maximizes the capability of conditional tasking
q Worker-level scheduling
q Decides how tasks are executed by which workers
• Goal #1: adopts work stealing to dynamically balance load
• Goal #2: adapts workers to available task parallelism
• Goal #3: maximizes performance, energy, and throughput
33
Task-level Scheduling
q “Strong dependency” versus “Weak dependency”
q Weak dependency: dependencies out of condition tasks
q Strong dependency: others else
init Y
Queue empty? Wait for tasks
N
invoke(t)
optimizer Dequeue a task t
0 Decrement strong
N
Condition task? dependency of t ‘s
successors by one
converged? Y
r = invoke(t)
1 Enqueue successors
of zero strong
enqueue rth successor dependencies
output
34
Task-level Scheduling (cont’d)
q Condition task is powerful but prone to mistakes …
37
Worker-level Scheduling (cont’d)
CPU worker threads GPU worker threads
steal steal CPU tasks CPU task
queue
39
Micro-benchmarks
q Randomly generate graphs with CPU-GPU tasks
q CPU task: aX + Y (saxpy) with 1K elements
q GPU task: aX + Y (saxpy) with 1M elements
q Comparison with TBB, StarPU, HPX, and OpenMP
q What is the turnaround time to program?
q What is the overhead of task graph parallelism?
Table I: Programming cost Table II: Task graph overhead (amortized)
SLOCCount: https://dwheeler.com/sloccount/
40
Micro-benchmarks (cont’d)
q Performance on 40 Intel CPUs and 4 Nvidia GPUs
41
Application 1: Machine Learning
q Compute a 1920-layer DNN each of 65536 neurons
q IEEE HPEC 2020 Neural Network Challenge Compute
Each cudaFlow
contains thousands
of GPU tasks
A partial taskflow graph of 4 cudaFlows, 6 static tasks, and 8 conditioned cycles for this workload
42
Application 1: Machine Learning (cont’d)
q Comparison with TBB and StarPU
q Unroll task graphs across iterations found in hindsight
q Implement cudaGraph for all
VLSI optimization
makes essential use of
dynamic control flow
45
Parallel programming
infrastructure matters
46
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism
47
Parallel Computing is Never Standalone
m
lis
lle
ra
Pa
48
No One Can Express All Parallelisms …
q Languages ∪ Compilers ∪ Libraries ∪ Programmers
49
IMHO, C++ Parallelism Needs Enhancement
52
Dr. Tsung-Wei Huang
[email protected]
Taskflow: https://taskflow.github.io
53