OpenMP offload implicitly using streams

Hi,
How to turn on streams when using OpenMP offload?
When different host threads individually start target regions (even not using nowait). The offloaded computation goes to different CUDA streams and may execute concurrently. This is currently available in XL.
With Clang, nvprof shows only the run only uses the default stream.
Is there a way to do that with Clang?
On the other hand,

nvcc has option –default-stream per-thread
I’m not familar with clang CUDA, is there a similar option?
Best,

Ye

Hi all,
After going through the source, I didn’t find CUDA stream support.
Luckily, I only need to add
#define CUDA_API_PER_THREAD_DEFAULT_STREAM
before
#include <cuda.h>
in libomptarget/plugins/cuda/src/rtl.cpp
Then the multiple target goes to different streams and may execute concurrently.

#pragma omp parallel
{
#pragma omp target
{
//offload computation

}

}
This is exactly I want.

I know the XL compiler uses streams in a different way but achieves similar effects.

Is there anyone working on using streams with openmp target in llvm?
Will clang-ykt get something similar to XL and upstream to the mainline?

If we just add #define CUDA_API_PER_THREAD_DEFAULT_STREAM in the cuda rtl, will it be a trouble?
As a compiler user, I’d like to have a better solution rather than having a patch just for myself.

Best,

Ye

Ye Luo <[email protected]> 于2019年3月17日周日 下午2:26写道:

Thanks, Ye. I suppose that I thought it always worked that way :slight_smile:

Alexey, Doru, do you know if there’s any semantic problem or other concerns with enabling this option and/or making it the default?

-Hal

I’m adding Alex to this thread. He should be able to shed some light on this issue.

Thanks,

–Doru

Hi Hal, it is hard to tell. I can try to add the option that will lead to definition of this macro to clang, if you want to try it.

That would be great, thanks!

-Hal

Hal,

Supporting async for targets is not trivial. Right now, since everything is “blocking” a target can be easily separated into 5 tasks:

  1. wait for dependences to resolve
  2. perform all the copy to the device
  3. execute the target
  4. perform all the copy from device to host
  5. resolve all dependences for other tasks
    Going async ala LOMP has many advantages: all operations are asynchronous, and dependences from target to target tasks are directly enforced on the device. To us, this was the only way that users could effectively hide the high overhead of generating targets, by enqueuing many dependent target tasks on the host, to prime the pipe of targets on the devices.

To do so, I believe you have to do the following.

  1. wait for all the host dependences; dependences from other device targets are tabulated but not waited for
  2. select a stream (of one stream from a dependent target, if any) and enqueue wait for event for all tabulated dependences
  3. enqueue all copy to device on stream (or enqueue sync event for data currently being copied over by other targets)
  4. enqueue computation on stream
  5. enqueue all copy from device on stream (this is speculative, as ref count may increase by another target executed before the data is actually copied back, but it’s legal)
  6. cleanup
    1. blocking: wait for stream to be finished
    2. non-blocking: have a callback from CUDA (which involve a separate thread) or have active polling by OpenMP threads when doing nothing and/or before doing a subsequent target task to determine when stream is finished
    3. when 1 or 2 above are finished, cleanup the map data structures, resolve dependences for dependent tasks.
      This is compounded by the fact that async data movements are only performed with pinned memory, and any CUDA memory cannot be allocated directly as it is a synchronizing event. So runtime must handle it’s own pool of device and pinned memory, which requires additional work in Steps 3, 5, and 6.3 above.

To perform the cleanup in Step 6, you also need to cache all info associated with a target in a dedicated data structure.

As you may have noticed, if you want some async to work, you have basically to treat all target as async; the synchronous ones differ only by having an explicit wait in Step 6.1. So all this handling is in the critical path.

You will also need to carefully managed CUDA events associated with any explicit data movements, as subsequent target operations may be dependent on an actual memory operation to complete (in either directions).

This has been done in LOMP, was it fun, maybe not, but it’s all feasible.

There is a possible saving grace, namely that you could implement async only under unified memory, which would simplify greatly the whole thing: eliminate Steps 3 & 5 above and associated bookkeeping.

However, most application writers that have optimized their code will tell you that unified-only program tend not to work too well, and that hybrid models (copy predictable data, use unified for unstructured data) is likely to deliver better performance. So you could simplify your implementation at the cost of precluding async for the most optimized programs.

Happy to discuss it further, and explore with you alternative implementations.

Alexandre

Hi, Alex,

Thanks; this is helpful in general.

In this case, we’re wondering about the following: If we have multiple threads on the host, and those threads all have target offload regions, can we “simply” allow all of these thread’s target regions to progress independently by using a different CUDA stream for each host thread. As Ye pointed out, just compiling the current LLVM implementation with CUDA_API_PER_THREAD_DEFAULT_STREAM seems to provide this effect. I’m just wondering if this is semantically correct or if there are some inter-thread dependencies that the current single-stream-for-all-threads model is implicitly enforcing. Please correct me if I’m wrong, but my impression is that this usage model where multiple host threads have (blocking) target regions is simpler to support than full async (by which you mean target tasks with nowait, right?).

-Hal

Thank you Alex for the explanation. It is a very nice read to understand how to fully async offload working. I believe this is the eventual goal. But it takes time to get in.

At the moment, I’m looking for a way which just needs only blocking offload but gets concurrent execution.

My code pattern is

#pragma omp parallel

{ // parallel is at the very high level of the code hierarchy and contains loops over target regions

#pragma omp target

{ // target is very local

//offload computation
}
}
Using 1 stream per host thread easily achieves concurrent execution.

I noticed some presentations showing a pipelining offload example using nowait

http://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf

slide 30.
If we enable one stream per thread. The pipelining can work as well even with blocking offload.

#pragma omp parallel for

for(int i=0; i<nblocks; i++)

{

#pragma omp target

{

//offload the computation of one block
}

}

So I’m thinking of turning on one stream per thread to gain something but I’m not aware of any negative side.

Best,

Ye

Alexandre Eichenberger <[email protected]> 于2019年3月20日周三 下午1:26写道:

Ye, Hal,

Just using a distinct stream per thread does not give you necessarily concurrency for transfers. E.g. is the runtime holding a single map lock during the entire mapping (including the data transfer)? I don’t recall precisely what LLVM does, but I think that was the case. If so, you don’t have any asynchrony during transfers, but you have a simple implementation. You could relax this by having one lock per mapped memory location, so that a second target mapping a same data would wait until the data is moved. Or you can use CUDA events per data transfers, and having a second thread’s target computation wait for that event.

Also, early reports from our users were that multiple threads doing CUDA calls was bad because the driver’s locking scheme was poor. Hopefully, this has been addressed since. I would conduct a simple experiment to test this, before you recommend your users to go that way.

Alexandre

Ye, Hal,

Just using a distinct stream per thread does not give you necessarily concurrency for transfers. E.g. is the runtime holding a single map lock during the entire mapping (including the data transfer)? I don't recall precisely what LLVM does, but I think that was the case. If so, you don't have any asynchrony during transfers, but you have a simple implementation. You could relax this by having one lock per mapped memory location, so that a second target mapping a same data would wait until the data is moved. Or you can use CUDA events per data transfers, and having a second thread's target computation wait for that event.

Ye, do you know if the benefit you're seeing is from concurrency in data transfers, or in compute, or both?

Also, early reports from our users were that multiple threads doing CUDA calls was bad because the driver's locking scheme was poor. Hopefully, this has been addressed since. I would conduct a simple experiment to test this, before you recommend your users to go that way.

My understanding from talking to folks at NVIDIA is that this has been an area they've actively worked on improving. It may be better, although testing is certainly required.

Thanks again,

Hal

Alexandre

Hal,

There is no concurrent data transfer because the hardware only has one pipe for each direction, D2H and H2D. It is anyway the bottleneck. I don’t think we can workaround that and this is expected.

I do want data transferring over lapping with computation from different host threads(streams) and also computation overlapping with each other.
Here is what I got from nvprof. Both overlapping is happening.
Ye

Finkel, Hal J. <[email protected]> 于2019年3月20日周三 下午4:37写道:

Hal,

I don’t have results from our initial experiments, and generally I implemented the scheme with the most asynchrony. You can probably make a cuda example of the scheme you are looking into, and compare with a similar scheme using LOMP or also cuda calls directly.

Ye, Hal,

It is true that you have one pipe in each direction, but it is one thing to serialize the data transfer vs another with respect to a lock. When mapping, the pattern is probably like this in the LLVM runtime (on entry to traget region)

  1. get map lock
  2. for each map
    1. test if already there, if not
    2. alloc
    3. cuda mem copy if needed by map operation
  3. release map lock
    And under the cover, synchronous memory operations in 3 will also allocate a pinned memory location, copy into the pinned memory, and then transfer it over to the device. So what you are talking about being serialized is one part of step 3… all the other ones are still serialized by the external lock.

And most importantly, you have to ensure the alloc does not use cudamalloc (and equivalent pinned memory operations) because it will prevent any asynchronous behavior)

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#implicit-synchronization

Along with using distinct streams for each thread, that is probably the most important change to do.

Alexandre

Hi Alex,
Could you explain why 2.3 is inside the lock region? I feel maintaining the memory consistency is the responsibility of code developers anyway.

Ye

Alexandre Eichenberger <[email protected]> 于2019年3月20日周三 下午6:37写道: