0% found this document useful (0 votes)

129 views53 pages

Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020

The document discusses Taskflow, a C++ library for parallel and heterogeneous programming that aims to simplify writing parallel programs through a task-based programming model. It describes how Taskflow allows defining tasks and dependencies between tasks with only a few lines of code. The rest of the document outlines an agenda for explaining how Taskflow can be used to express and boost parallelism in applications like VLSI CAD tools through its scheduling algorithms and support for heterogeneous architectures.

Uploaded by

Rithik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views53 pages

Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020

Uploaded by

Rithik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Taskflow: A General-purpose Parallel

and Heterogeneous Task Programming

System using Modern C++
Dr. Tsung-Wei (TW) Huang
Department of Electrical and Computer Engineering
University of Utah, Salt Lake City, UT
https://taskflow.github.io/

1
Why Parallel Computing?
q It’s critical to advance your application performance

Time to Solve a Machine Learning Workload

1 CPU
10x faster
40 CPUs
100x faster
1 GPU

0 100 200 300 400 500 600

Machine Learning Workload

2
Parallel Programming is Not Easy, Yet
q You need to deal with many difficult technical details
q Standard concurrency control
q Task dependencies
q Scheduling
q Data race
Scheduling
Dependency
q … (more) constraints
efficiencies

Task and
p e rs data race
e v e lo
n y d m e in Debug
Ma hard t ght! i Concurrency
av e m r i control
h g the
e tt in
g
3
Taskflow offers a solution

How can we make it easier for C++ developers to

quickly write parallel and heterogeneous programs
with high performance scalability and simultaneous
high productivity?

4
“Hello World” in Taskflow
#include <taskflow/taskflow.hpp> // Taskflow is header-only
int main(){
tf::Taskflow taskflow;
tf::Executor executor;
auto [A, B, C, D] = taskflow.emplace(
Only 15 lines of code to get a
[] () { std::cout << "TaskA\n"; } parallel task execution!
[] () { std::cout << "TaskB\n"; },
[] () { std::cout << "TaskC\n"; },
[] () { std::cout << "TaskD\n"; }
);
A.precede(B, C); // A runs before B and C
D.succeed(B, C); // D runs after B and C
executor.run(taskflow).wait(); // submit the taskflow to the executor
return 0;
}
5
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

6
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

7
Motivation: Parallelizing VLSI CAD Tools
q Billions of tasks with diverse computational patterns
Machine learning in the loop Partial Tasks of Iteration 0
System Spec. Partition [0]collect_independent_sets_end
Partial Tasks of Iteration 1

Graph [0]construct_cost_matrices_begin [1]random_shufﬂe_begin

Module(a, b)
Architecture Floorplan [0]construct_cost_matrices_kernel_S [1]maximum_independent_set_parallel_kernel1_S
Input a;
Output b; Graph
Function, logic loop[0][0] loop[0][1] loop[0][2] loop[0][3] loop[0][0] loop[0][1] loop[0][2] loop[0][3]

Optimization
Placement 1

2
1’

2’
[0]construct_cost_matrices_kernel_T [1]maximum_independent_set_parallel_kernel1_T

Circuit design Analytical

4
3’

4’ [0]solve_assignment_kernel_S [1]maximum_independent_set_parallel_kernel2_S

5 5’
0

CTS loop[0][0] loop[0][1] loop[0][2] loop[0][3] loop[0][0] loop[0][1] loop[0][2] loop[0][3]

Physical design
Tree NP-hard problems [0]solve_assignment_kernel_T [1]maximum_independent_set_parallel_kernel2_T

Signoff Routing [0]solve_assignment_end [1]maximum_independent_set_parallel_update

2 3
DRC, LVS Graph [0]apply_solution_kernel_S [1]maximum_independent_set_parallel_cond
4 5
Dynamic
1

Manufacturing Timing loop[0][0] loop[0][1] loop[0][2] loop[0][3] controls

Modeling and simulation [1]maximum_independent_set_parallel_end

Computational problems of f1:Q u2:A u2:Y u3:A u3:Y out

Testing Irregular graphs

clock f1:CLK

[0]apply_solution_kernel_T
10B+ transistors inp1 u1:A
u1:Y
u4:B

u4:A
u4:Y
f1:D

inp2 u1:B

[0]apply_solution_end

Data science &

Final chip regression
[0]compute_hpwl_kernel [1]collect_independent_sets_begin

How can we write efficient C++ parallel programs for this monster computational task
graph with millions of CPU-GPU dependent tasks along with algorithmic control flow”
8
We Invested a lot in Existing Tools …

9
Two Big Problems of Existing Tools
q Our problems define complex task dependencies
q Example: analysis algorithms compute the circuit
network of million of node and dependencies
q Problem: existing tools are often good at loop
parallelism but weak in expressing heterogeneous task
graphs at this large scale
q Our problems define complex control flow
q Example: optimization algorithms make essential use of
dynamic control flow to implement various patterns
• Combinatorial optimization, analytical methods
q Problem: existing tools are directed acyclic graph (DAG)-
based and do not anticipate cycles or conditional
dependencies, lacking end-to-end parallelism
10
Example: An Iterative Optimizer
q 4 computational tasks with dynamic control flow
#1: starts with init task
#2: enters the optimizer task (e.g., GPU math solver)
#3: checks if the optimization converged
• No: loops back to optimizer
• Yes: proceeds to stop How can we easily describe this
workload with dynamic control flow
#4: outputs the result using existing tools (e.g., OpenMP,
TBB, StarPU, SYCL, Kokkos) ?
N

Y
init optimizer converged? output

Millions of such tasks? End-to-end parallelism?

11
Need a New C++ Parallel Programming System

While designing parallel algorithms is non-trivial …

what makes parallel programming an enormous challenge is the

infrastructure work of “how to efficiently express dependent
tasks along with an algorithmic control flow and schedule them
across heterogeneous computing resources”
12
Taskflow Project Mantra
We maximize
productivity compared We maximize the performance
to handcrafted time Performance compared to handcrafted solution

We maximize the portability using

Productivity Portability
the power of modern C++

q We are not to replace existing tools but

1. Address their limitations on task graph parallelism
2. Develop compatible interface to reuse their facilities

Together, we can deliver complementary advantages to

advance C++ parallelism
13
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

14
Many arguments are based on my personal opinions
– no offense, no criticism, just plain C++ from an
end user’s perspective
15
“Hello World” in Taskflow (Revisited)
#include <taskflow/taskflow.hpp> // Taskflow is header-only
int main(){
Taskflow defines five tasks:
tf::Taskflow taskflow;
1. static task
tf::Executor executor; 2. dynamic task
auto [A, B, C, D] = taskflow.emplace( 3. cudaFlow task
[] () { std::cout << "TaskA\n"; } 4. condition task
[] () { std::cout << "TaskB\n"; }, 5. module task
[] () { std::cout << "TaskC\n"; },
[] () { std::cout << "TaskD\n"; }
);
A.precede(B, C); // A runs before B and C
D.succeed(B, C); // D runs after B and C
executor.run(taskflow).wait(); // submit the taskflow to the executor
return 0;
}
16
“Hello World” in OpenMP
#include <omp.h> // OpenMP is a lang ext to describe parallelism using compiler directives
int main(){
#omp parallel num_threads(std::thread::hardware_concurrency())
{
int A_B, A_C, B_D, C_D;
#pragma omp task depend(out: A_B, A_C) Task dependency clauses
{
s t d : : c o u t << ”TaskA\n” ;
}
#pragma omp task depend(in: A_B; out: B_D) Task dependency clauses
{
s t d : : c o u t << ” TaskB\n” ;
}
#pragma omp task depend(in: A_C; out: C_D) Task dependency clauses
{
s t d : : c o u t << ” TaskC\n” ;
}
#pragma omp task depend(in: B_D, C_D) Task dependency clauses
{
s t d : : c o u t << ”TaskD\n” ;
} OpenMP task clauses are static and explicit;
} Programmers are responsible for a proper order of
return 0; writing tasks consistent with sequential execution 17
}
“Hello World” in TBB
#include <tbb.h> // Intel’s TBB is a general-purpose parallel programming library in C++
int main(){
using namespace tbb;
using namespace tbb:flow;
int n = task_scheduler init::default_num_threads () ;
task scheduler_init init(n);
graph g;
Use TBB’s FlowGraph
continue_node<continue_msg> A(g, [] (const continue msg &) { for task parallelism
s t d : : c o u t << “TaskA” ;
}) ;
continue_node<continue_msg> B(g, [] (const continue msg &) {
s t d : : c o u t << “TaskB” ;
}) ;
continue_node<continue_msg> C(g, [] (const continue msg &) { Declare a task as a
s t d : : c o u t << “TaskC” ; continue_node
}) ;
continue_node<continue_msg> C(g, [] (const continue msg &) {
s t d : : c o u t << “TaskD” ;
}) ;
make_edge(A, B); TBB has excellent performance in generic parallel
make_edge(A, C);
make_edge(B, D); computing. Its drawback is mostly in the ease-of-use
make_edge(C, D); standpoint (simplicity, expressivity, and programmability).
A.try_put(continue_msg());
g.wait_for_all();
}
TBB FlowGraph: https://software.intel.com/content/www/us/en/develop/home.html 18
“Hello World” in Kokkos
struct A {
template <class TeamMember> KOKKOS_INLINE_FUNCTION Fixed-layout task functor
void operator()(TeamMember& member) {std::cout << "TaskA\n"; }
}; (no lambda interface …?)
struct B {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
void operator()(TeamMember& member) {std::cout << "TaskB\n"; } Define team handle
};
struct C {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
void operator()(TeamMember& member) {std::cout << "TaskC\n"; }
}; Task dependency is
struct D {
template <class TeamMember> KOKKOS_INLINE_FUNCTION
represented by instances of
void operator()(TeamMember& member) {std::cout << "TaskD\n"; } Kokkos::BasicFuture
};
auto scheduler = scheduler_type(/* ... */);
auto futA = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler), A() );
auto futB = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler, futA), B() ); Aggregated dependencies
auto futC = Kokkos::host_spawn( Kokkos::TaskSingle(scheduler, futA), C() );
auto futD = Kokkos::host_spawn(
Kokkos::TaskSingle(scheduler, when_all(futB, futC)), D()
);
Kokkos is powerful in describing
asynchronous tasks but not efficient in large
More scheduling code to follow …
task graph parallelism
Kokkos task parallelism: https://github.com/kokkos/kokkos/wiki/Task-Parallelism 19
“Hello World” Summary (Less Biased)
Vote for Simplicity
(100 C++ programmers of 2-5 years of C++11
experience)
1
6 4
15

Taskflow OpenMP TBB Kokkos std::thread

#1 concern: “My application is already very complex; it’s

important the parallel programming library doesn’t become
another burden.”

20
Dynamic Tasking (Subflow)
// create three regular tasks
tf::Task A = tf.emplace([](){}).name("A");
tf::Task C = tf.emplace([](){}).name("C");
tf::Task D = tf.emplace([](){}).name("D");

// create a subflow graph (dynamic tasking)

tf::Task B = tf.emplace([] (tf::Subflow& subflow) {
tf::Task B1 = subflow.emplace([](){}).name("B1");
tf::Task B2 = subflow.emplace([](){}).name("B2");
tf::Task B3 = subflow.emplace([](){}).name("B3");
B1.precede(B3);
B2.precede(B3);
}).name("B");

A.precede(B); // B runs after A

A.precede(C); // C runs after A
B.precede(D); // D runs after B
C.precede(D); // D runs after C

21
Subflow can be Nested
q Find the 7th Fibonacci number using subflow
q Fib(n) = Fib(n-1) + Fib(n-2)

22
Heterogeneous Tasking (cudaFlow)
const unsigned N = 1<<20;
std::vector<float> hx(N, 1.0f), hy(N, 2.0f);
float *dx{nullptr}, *dy{nullptr};
auto allocate_x = taskflow.emplace([&](){ cudaMalloc(&dx, 4*N);});
auto allocate_y = taskflow.emplace([&](){ cudaMalloc(&dy, 4*N);});

auto cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) {

To Nvidia
auto h2d_x = cf.copy(dx, hx.data(), N); // CPU-GPU data transfer
cudaGraph
auto h2d_y = cf.copy(dy, hy.data(), N);
auto d2h_x = cf.copy(hx.data(), dx, N); // GPU-CPU data transfer
auto d2h_y = cf.copy(hy.data(), dy, N);
auto kernel = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);
kernel.succeed(h2d_x, h2d_y).precede(d2h_x, d2h_y);
});

cudaflow.succeed(allocate_x, allocate_y);
executor.run(taskflow).wait();
Users define GPU work in a graph rather than aggregated
operations à single kernel launch to reduce overheads
23
Three Key Motivations
q Our closure enables stateful interface
q Users capture data in reference to marshal data
exchange between CPU and GPU tasks
q Our closure hides implementation details judiciously
q We use cudaGraph (since cuda 10) due to its excellent
performance, much faster than streams in large graphs
q Our closure extend to new accelerator types
q syclFlow, openclFlow, coralFlow, tpuFlow, fpgaFlow, etc.
We do not simplify kernel
programming but focus on
CPU-GPU tasking that
affects the performance to a
large extent! (same for data
abstraction)
24
Conditional Tasking
auto init = taskflow.emplace([&](){ initialize_data_structure(); } )
.name(”init");
auto optimizer = taskflow.emplace([&](){ matrix_solver(); } )
.name(”optimizer");
auto converged = taskflow.emplace([&](){ return converged() ? 1 : 0 } )
.name(”converged");
auto output = taskflow.emplace([&](){ std::cout << ”done!\n"; } );
.name(”output");
init.precede(optimizer);
optimizer.precede(converged);
converged.precede(optimizer, output); // return 0 to the optimizer again

0
1
init optimizer converged? output

Condition task integrates control flow into a task graph to form end-to-end
parallelism; in this example, there are ultimately four tasks ever created
25
Conditional Tasking (cont’d)
auto A = taskflow.emplace([&](){ } );
auto B = taskflow.emplace([&](){ return rand()%2; } );
auto C = taskflow.emplace([&](){ return rand()%2; } );
auto D = taskflow.emplace([&](){ return rand()%2; } );
auto E = taskflow.emplace([&](){ return rand()%2; } );
auto F = taskflow.emplace([&](){ return rand()%2; } );
auto G = taskflow.emplace([&](){});

A.precede(B).name("init");
B.precede(C, B).name("flip-coin-1");
Each task flips a
C.precede(D, B).name("flip-coin-2"); binary coin to decide
D.precede(E, B).name("flip-coin-3"); the next path
E.precede(F, B).name("flip-coin-4");
F.precede(G, B).name("flip-coin-5");
G.name(“end”);

You can describe non-deterministic, nested control flow!

26
Existing Frameworks on Control Flow?
q Expand a task graph across fixed-length iterations
q Graph size is linearly proportional to decision points
q Unknown iterations? Non-deterministic conditions?
q Complex dynamic tasks executing “if” on the fly
q Dynamic control flows and dynamic tasks?
q … (resort to client-side decision)

Existing frameworks on expressing conditional

tasking or dynamic control flow suffer from
exponential growth of code complexity

27
Composable Tasking
tf::Taskflow f1, f2;

auto [f1A, f1B] = f1.emplace(

[]() { std::cout << "Task f1A\n"; },
[]() { std::cout << "Task f1B\n"; }
);
auto [f2A, f2B, f2C] = f2.emplace(
[]() { std::cout << "Task f2A\n"; },
[]() { std::cout << "Task f2B\n"; },
[]() { std::cout << "Task f2C\n"; }
);

auto f1_module_task = f2.composed_of(f1);

f1_module_task.succeed(f2A, f2B)
.precede(f2C);

28
Everything is Unified in Taskflow
q Use “emplace” to create a task
q Use “precede” to add a task dependency
q No need to learn different sets of API
q You can create a really complex graph Dynamic task
q Subflow(ConditionTask(cudaFlow))
cudaFlow
q ConditionTask(StaticTask(cudaFlow))
q Composition(Subflow(ConditionTask))
q Subflow(ConditionTask(cudaFlow))
Composition Control
q … flow

q Scheduler performs end-to-end optimization

q Runtime, energy efficiency, and throughput
29
Example: k-means Clustering
One cudaFlow for host-to-
device data transfers

One cudaFlow for finding the

k centroids

One condition task to

model iterations

One cudaFlow for device-

to-host data transfers
30
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

31
Submit Taskflow to Executor
q Executor manages a set of threads to run taskflows
q All execution methods are non-blocking
q All execution methods are thread-safe
{
tf::Taskflow taskflow1, taskflow2, taskflow3;
tf::Executor executor;
// create tasks and dependencies
// …
auto future1 = executor.run(taskflow1);
auto future2 = executor.run_n(taskflow2, 1000);
auto future3 = executor.run_until(taskflow3, [i=0](){ return i++>5 });
executor.async([](){ std::cout << “async task\n"; });
executor.wait_for_all(); // wait for all the above tasks to finish
}
32
Executor Scheduling Algorithm
q Task-level scheduling
q Decides how tasks are enqueued under control flow
• Goal #1: ensures a feasible path to carry out control flow
• Goal #2: avoids task race under cyclic and conditional execution
• Goal #3: maximizes the capability of conditional tasking
q Worker-level scheduling
q Decides how tasks are executed by which workers
• Goal #1: adopts work stealing to dynamically balance load
• Goal #2: adapts workers to available task parallelism
• Goal #3: maximizes performance, energy, and throughput

33
Task-level Scheduling
q “Strong dependency” versus “Weak dependency”
q Weak dependency: dependencies out of condition tasks
q Strong dependency: others else

init Y
Queue empty? Wait for tasks
N
invoke(t)
optimizer Dequeue a task t

0 Decrement strong
N
Condition task? dependency of t ‘s
successors by one
converged? Y
r = invoke(t)
1 Enqueue successors
of zero strong
enqueue rth successor dependencies
output

34
Task-level Scheduling (cont’d)
q Condition task is powerful but prone to mistakes …

It is users’ responsibility to ensure a taskflow is properly

conditioned, i.e., no task race under our task-level scheduling policy
35
Worker-level Scheduling
q Taskflow adopts work stealing to run tasks
q What is work stealing?
q I finish my jobs first, and then steal jobs from you
q Improve performance through dynamic load balancing

Work stealing is commonly adopted

by parallel task programming
libraries (e.g., TBB, StarPU, TPL) CppCon 2015: Pablo Halpern “Work Stealing,”
https://www.youtube.com/watch?v=iLHNF7SgVN4
36
Worker-level Scheduling (cont’d)
q Challenge #1: distinct CPU-GPU performance traits
q Challenge #2: available task parallelism varies
q Challenge #3: wasteful steals eat out performance

q We solve the three challenges by the following:

1. Keep a different set of workers per heterogeneous
domain (e.g., CPU workers, GPU workers)
2. Keep an invariant that balances the active workers with
available task parallelism
3. Bring workers to sleep when tasks are scarce and wake
up workers to run tasks following the invariant

37
Worker-level Scheduling (cont’d)
CPU worker threads GPU worker threads
steal steal CPU tasks CPU task
queue

push pop push pop push push

H2D
core core core core GPU core GPU
push push D2H pop
push push
steal GPU GPU task
tasks queue

Shared CPU Shared GPU task

push graph push
task queue queue
(external threads) (external threads)

Generalizable to arbitrary heterogeneous domains 38

Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

39
Micro-benchmarks
q Randomly generate graphs with CPU-GPU tasks
q CPU task: aX + Y (saxpy) with 1K elements
q GPU task: aX + Y (saxpy) with 1M elements
q Comparison with TBB, StarPU, HPX, and OpenMP
q What is the turnaround time to program?
q What is the overhead of task graph parallelism?
Table I: Programming cost Table II: Task graph overhead (amortized)

SLOCCount: https://dwheeler.com/sloccount/
40
Micro-benchmarks (cont’d)
q Performance on 40 Intel CPUs and 4 Nvidia GPUs

41
Application 1: Machine Learning
q Compute a 1920-layer DNN each of 65536 neurons
q IEEE HPEC 2020 Neural Network Challenge Compute

Each cudaFlow
contains thousands
of GPU tasks

A partial taskflow graph of 4 cudaFlows, 6 static tasks, and 8 conditioned cycles for this workload
42
Application 1: Machine Learning (cont’d)
q Comparison with TBB and StarPU
q Unroll task graphs across iterations found in hindsight
q Implement cudaGraph for all

q Taskflow’s runtime is up to 2x faster Due to the

conditional tasking
q Taskflow’s memory is up to 1.6x less
Champions of HPEC 2020 Graph Challenge: https://graphchallenge.mit.edu/champions
43
Application 2: VLSI Placement
q Optimize cell locations on a chip

VLSI optimization
makes essential use of
dynamic control flow

A partial TDG of 4 cudaFlows, 1 conditioned cycle, and 12 static tasks

44
Application 2: VLSI Placement (cont’d)
q Runtime, memory, power, and throughput

Performance improvement comes

from end-to-end expression of CPU-
GPU dependent tasks using
condition tasks

45
Parallel programming
infrastructure matters

Different models give different implementations. The

parallel code/algorithm may run fast, yet the parallel
computing infrastructure to support that algorithm
may dominate the entire performance.

Taskflow enables end-to-end expression of

CPU-GPU dependent tasks along with
algorithmic control flow

46
Agenda
q Express your parallelism in the right way
q Parallelize your applications using Taskflow
q Understand our scheduling algorithm
q Boost performance in real applications
q Make C++ amenable to heterogeneous parallelism

47
Parallel Computing is Never Standalone

m
lis
lle
ra
Pa

48
No One Can Express All Parallelisms …
q Languages ∪ Compilers ∪ Libraries ∪ Programmers

49
IMHO, C++ Parallelism Needs Enhancement

q C++ parallelism is primitive (but in a good shape)

q std::thread is powerful but very low-level
q std::async leaves off handling task dependencies
q No easy ways to describe control flow in parallelism
• C++17 parallel STL count on bulk synchronous parallelism
q No standard ways to offload tasks to accelerators (GPU)
q Existing 3rd-party tools have enabled vast success but
q Lack an easy and expressive interface for parallelism
q Lack a mechanism for modeling control flow
q Lack an efficient executor for heterogeneous tasking
• Good at either CPU- or GPU-focused workload, but rarely both
simultaneously
50
Conclusion
q Taskflow is a general-purpose parallel tasking tool
q Simple, efficient, and transparent tasking models
q Efficient heterogeneous work-stealing executor
q Promising performance in large-scale ML and VLSI CAD
q Taskflow is not to replace anyone but to
q Complement the current state-of-the-art
q Leverage modern C++ to express task graph parallelism
q Taskflow is very open to collaboration
q We want to integrate OpenCL, SYCL, Intel DPC++, etc.
q We want to provide higher-level algorithms
q We want to broaden real use cases
51
Thank You All Using Taskflow!

52
Dr. Tsung-Wei Huang
[email protected]
Taskflow: https://taskflow.github.io

Ipdps 19
No ratings yet
Ipdps 19
10 pages
A Modern C++ Parallel Task Programming Library
No ratings yet
A Modern C++ Parallel Task Programming Library
4 pages
MM 19
No ratings yet
MM 19
4 pages
tpds21 Taskflow
No ratings yet
tpds21 Taskflow
18 pages
Taskflow A Lightweight Parallel and Heterogeneous Task Graph Computing System
No ratings yet
Taskflow A Lightweight Parallel and Heterogeneous Task Graph Computing System
18 pages
Lab1 PAR
No ratings yet
Lab1 PAR
40 pages
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
No ratings yet
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
37 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
Kokkos for C++ HPC Developers
No ratings yet
Kokkos for C++ HPC Developers
322 pages
OpenACC 1
No ratings yet
OpenACC 1
44 pages
OpenMP Tasking for Developers
No ratings yet
OpenMP Tasking for Developers
21 pages
hpcxx2024 d3
No ratings yet
hpcxx2024 d3
53 pages
C++ STL Parallelization Guide
No ratings yet
C++ STL Parallelization Guide
67 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Sample - Code - Parallel - Cse6230 Fa14 04 Omp
No ratings yet
Sample - Code - Parallel - Cse6230 Fa14 04 Omp
51 pages
Chapter Three Parallel Computing
No ratings yet
Chapter Three Parallel Computing
44 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
4 Performance.4x
No ratings yet
4 Performance.4x
14 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
Parallel Multiverse
No ratings yet
Parallel Multiverse
46 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
Introduction To Openmp: Openmp in Small Bites: Overview
No ratings yet
Introduction To Openmp: Openmp in Small Bites: Overview
123 pages
Intro To OpenMP Mattson Customized
No ratings yet
Intro To OpenMP Mattson Customized
94 pages
FPGA Programming with OmpSs
No ratings yet
FPGA Programming with OmpSs
14 pages
Enabling Openmp Task Parallelism On Multi-Fpgas
No ratings yet
Enabling Openmp Task Parallelism On Multi-Fpgas
1 page
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
No ratings yet
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
381 pages
21th 22th Lecture
No ratings yet
21th 22th Lecture
22 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
CSC-334 - P&DC - Lab Manual - V2.0
No ratings yet
CSC-334 - P&DC - Lab Manual - V2.0
102 pages
Embedded Systems Practices
No ratings yet
Embedded Systems Practices
16 pages
Parallel Algorithm Design Guide
No ratings yet
Parallel Algorithm Design Guide
107 pages
01 - Lecture Intro To HPC
No ratings yet
01 - Lecture Intro To HPC
62 pages
L01 Slides
No ratings yet
L01 Slides
24 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
32 pages
Task Manager
No ratings yet
Task Manager
23 pages
Embedded - PPT - 4-5 Unit - DR Monika-Edited
No ratings yet
Embedded - PPT - 4-5 Unit - DR Monika-Edited
87 pages
A Gênese (Allan Kardec)
No ratings yet
A Gênese (Allan Kardec)
80 pages
Parallel Answers
No ratings yet
Parallel Answers
6 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Intel OpenMP Webinar
No ratings yet
Intel OpenMP Webinar
48 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Document 15
No ratings yet
Document 15
5 pages
CS-3006 2 PDC Overview Compressed
No ratings yet
CS-3006 2 PDC Overview Compressed
107 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
Patterns For Parallel Programming
No ratings yet
Patterns For Parallel Programming
34 pages
Rtosconcepts 1232972644787004 2
No ratings yet
Rtosconcepts 1232972644787004 2
90 pages
PDC Lecture 05
No ratings yet
PDC Lecture 05
48 pages
Par Prog Course Many Core SW Pats Ocl
No ratings yet
Par Prog Course Many Core SW Pats Ocl
90 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
Fork Join Parallelism
No ratings yet
Fork Join Parallelism
30 pages
A Deep Dive Into The Latest HPC Software
No ratings yet
A Deep Dive Into The Latest HPC Software
38 pages
PDC Final Document
No ratings yet
PDC Final Document
21 pages
05 C++ Threads
No ratings yet
05 C++ Threads
28 pages
CCS Unit 1
No ratings yet
CCS Unit 1
56 pages
Looping in Visual Basic
No ratings yet
Looping in Visual Basic
5 pages
Cyber Sec
No ratings yet
Cyber Sec
376 pages
Sample Questions
No ratings yet
Sample Questions
2 pages
CS8602 Notes Compiler Design
No ratings yet
CS8602 Notes Compiler Design
92 pages
20210103143112CopyGame Log
100% (1)
20210103143112CopyGame Log
2 pages
Computer Fundamentals - SY2024
No ratings yet
Computer Fundamentals - SY2024
61 pages
Theory of Computation MCQs
No ratings yet
Theory of Computation MCQs
4 pages
Parallel Computers Architecture and Programming V. Rajaraman PDF Download
100% (2)
Parallel Computers Architecture and Programming V. Rajaraman PDF Download
59 pages
CORE JAVA SYLLABUS Tcs
No ratings yet
CORE JAVA SYLLABUS Tcs
5 pages
NWU CSE Spring 2024 Class Schedule
No ratings yet
NWU CSE Spring 2024 Class Schedule
3 pages
Marvellous Logic Building Asignment - 1
No ratings yet
Marvellous Logic Building Asignment - 1
3 pages
Hard Level MCQs Data Structures Algorithms
No ratings yet
Hard Level MCQs Data Structures Algorithms
7 pages
BSC It - Sem IV - Core Java
No ratings yet
BSC It - Sem IV - Core Java
141 pages
C Lab Program
No ratings yet
C Lab Program
60 pages
CTBD Sol02
No ratings yet
CTBD Sol02
2 pages
DS Question Papers (2009 To 2023)
No ratings yet
DS Question Papers (2009 To 2023)
63 pages
Xi CS Worksheet List Tuples
No ratings yet
Xi CS Worksheet List Tuples
4 pages
C# Programming Exercises
No ratings yet
C# Programming Exercises
11 pages
ADA - Question - Bank 2020
No ratings yet
ADA - Question - Bank 2020
15 pages
Flight School Guide To Swift Codable
No ratings yet
Flight School Guide To Swift Codable
140 pages
Software Engineering - An Introduction..
No ratings yet
Software Engineering - An Introduction..
34 pages
CHAPTER 28: Digital Signal Processor: C Versus Assembly
No ratings yet
CHAPTER 28: Digital Signal Processor: C Versus Assembly
21 pages
FLAT Syllabus
No ratings yet
FLAT Syllabus
2 pages
Operation Research Question Bank
No ratings yet
Operation Research Question Bank
20 pages
Design & Analysis of Algorithms - Topic 1 - Introduction To Course
No ratings yet
Design & Analysis of Algorithms - Topic 1 - Introduction To Course
29 pages
Answer Key - CSC
No ratings yet
Answer Key - CSC
7 pages
Music Genre Classification ML
No ratings yet
Music Genre Classification ML
40 pages
CJ20N Adicionar Alv
No ratings yet
CJ20N Adicionar Alv
3 pages
The Hexagon Task
No ratings yet
The Hexagon Task
1 page

Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020

Uploaded by

Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020

Uploaded by

Taskflow: A General-purpose Parallel

and Heterogeneous Task Programming

Time to Solve a Machine Learning Workload

0 100 200 300 400 500 600

How can we make it easier for C++ developers to

Graph [0]construct_cost_matrices_begin [1]random_shufﬂe_begin

Circuit design Analytical

CTS loop[0][0] loop[0][1] loop[0][2] loop[0][3] loop[0][0] loop[0][1] loop[0][2] loop[0][3]

Signoff Routing [0]solve_assignment_end [1]maximum_independent_set_parallel_update

Manufacturing Timing loop[0][0] loop[0][1] loop[0][2] loop[0][3] controls

Computational problems of f1:Q u2:A u2:Y u3:A u3:Y out

Testing Irregular graphs

Data science &

Millions of such tasks? End-to-end parallelism?

While designing parallel algorithms is non-trivial …

what makes parallel programming an enormous challenge is the

We maximize the portability using

q We are not to replace existing tools but

Together, we can deliver complementary advantages to

Taskflow OpenMP TBB Kokkos std::thread

#1 concern: “My application is already very complex; it’s

// create a subflow graph (dynamic tasking)

A.precede(B); // B runs after A

auto cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) {

You can describe non-deterministic, nested control flow!

Existing frameworks on expressing conditional

auto [f1A, f1B] = f1.emplace(

auto f1_module_task = f2.composed_of(f1);

q Scheduler performs end-to-end optimization

One cudaFlow for finding the

One condition task to

One cudaFlow for device-

It is users’ responsibility to ensure a taskflow is properly

Work stealing is commonly adopted

q We solve the three challenges by the following:

push pop push pop push push

Shared CPU Shared GPU task

Generalizable to arbitrary heterogeneous domains 38

q Taskflow’s runtime is up to 2x faster Due to the

A partial TDG of 4 cudaFlows, 1 conditioned cycle, and 12 static tasks

Performance improvement comes

Different models give different implementations. The

Taskflow enables end-to-end expression of

q C++ parallelism is primitive (but in a good shape)

You might also like