Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
Algorithms Library
Jared Hoberock
NVIDIA
Programming Systems and Applications Research Group
CppCon
September, 2014
Bringing Parallelism to C++
Technical Specification for Parallel Algorithms
Multi-vendor collaboration
Based on proven technologies
● Thrust (NVIDIA)
● PPL (Microsoft)
● TBB (Intel)
Multiple implementations in progress
Targeting C++17
Roadmap
Parallelism?
Motivating example
What’s included in the box
The details
Future work
What do I mean by parallelism?
That’s like threads, right?
When I say “parallel”, think “independent”
● Concurrency is an optimization
● Concurrency can be hard
● locking, exclusion, communication, shared
state, data races...
What do I mean by parallelism?
Parallel programming => identifying tasks which
may be performed independently image credit: npr.org
How to communicate
this information to the
system?
It’s easy!
Simple parallelism for everyone
Easy to access
Interoperability with existing codes
Supported as broadly as possible
Concurrency is an invisible optimization
Vendor extensible
Motivating Example
Motivating Example: Weld Vertices
Marching Cubes Algorithm
case 0 case 1
case 2 case 3
Motivating Example: Weld Vertices
Problem: Marching Cubes produces “triangle soup”
Solution: “Weld” redundant vertices together
2 5 4 3
4 8
0 1 3 6 7 0 1 2
Motivating Example: Weld Vertices
Easy with the right high-level algorithms
Procedure:
1. Sort triangle vertices
2. Collapse spans of like vertices
3. Search for each vertex’s unique index
Motivating Example: Weld Vertices
using namespace std;
...
Communicate dependencies
What’s included in the box
Execution Policies
Parallel Algorithms
Parallel Exceptions
Execution Policies
using namespace std::experimental::parallelism;
std::vector<int> data = ...
// vanilla sort
sort(data.begin(), data.end());
Implementation-Defined
(Non-Standard)
What is an execution policy?
Promise that a particular kind of reordering will
preserve the meaning of a program
calling thread
Parallelizable Execution
algo(par, begin, end, func);
calling thread
Parallelizable Execution
It is the caller’s responsibility to ensure
correctness, for example that the invocation
does not introduce data races or deadlocks.
Parallelizable Execution
// data race
int a[] = {0,1};
std::vector<int> v;
for_each(par, a, a + 2, [&](int& i)
{
v.push_back(i);
});
Parallelizable Execution
// OK (don't do this):
int a[] = {0,1};
std::vector<int> v;
std::mutex mut;
for_each(par, a, a + 2, [&](int& i)
{
mut.lock();
v.push_back(i);
mut.unlock();
});
Parallelizable Execution
// OK (do this):
int a[] = {0,1};
std::vector<int> v(2);
for_each(par, a, a + 2, [&](int& i)
{
v[i] = i;
});
Parallelizable Execution
// may deadlock (don't do this):
std::atomic<int> counter = 0;
int a[] = {0,1};
for_each(par, a, a + 2, [&](int& i)
{
counter++;
calling thread
Difference between par & par_vec
Function invocation is unsequenced when in
different threads
std::mutex m;
for_each(par_vec, a, a + 2, [&](int)
{
mut.lock();
++counter;
mut.unlock();
});
Parallelizable + Vectorizable Execution
// OK:
std::atomic<int> counter = 0;
int a[] = {0,1};
for_each(par_vec, a, a + 2, [&](int)
{
++counter;
});
Parallelizable + Vectorizable Execution
// Best:
int count = count_if(par_vec, ...);
try
{
for_each(data.begin(), data.end(), [](auto x)
{
if(x == 13)
{
throw superstition_error();
}
});
}
catch(superstition_error& error)
{
std::cerr << error.what() << std::endl;
}
Exceptions Example
struct superstition_error { const char* what() { return "eek"; } };
try
{
for_each(par, data.begin(), data.end(), [](auto x)
{
if(x == 13)
{
throw superstition_error();
}
});
}
catch(exception_list& error)
{
std::cerr << “Encountered “ << errors.size() << “ unlucky numbers” << std::endl;
try
{
for_each(seq, data.begin(), data.end(), [](auto x)
{
if(x == 13)
{
throw superstition_error();
}
});
}
catch(exception_list& error)
{
std::cerr << “Encountered “ << errors.size() << “ unlucky numbers” << std::endl;
auto num_within_quarter_circle =
count_if(par, iter, iter + n, test_quarter_circle);
1x 5.25x 19.6x
Future Work
Scheduling?
algo(exec, begin, end, func);
algo needs to compose with scheduling decisions in the
surrounding application
exec specifies how algo is allowed to execute
● specifies what work an implementation is allowed to
create
● does not specify where the work should be executed
Placement is orthogonal
Scheduling?
algo(exec(sched), begin, end,
func);
We anticipate extending our execution policies to accept
scheduling requirements as parameters
parallelstl.codeplex.com
● based on PPL?
github.com/t-lutz/ParallelSTL
● based on std::thread
Summary
High-level algorithms make parallelism easy
● Portable & Composable
● Concurrency is invisible
Standardization
● On track for C++17
● Experimental Tech Spec in the meantime
● github.com/cplusplus/parallelism-ts
Questions?
[email protected]
github.com/jaredhoberock