We never emitted those instructions in the compiler, and gfx10+ do not have them anymore.
The control flow lowering scheme is split across many passes, but StructurizeCFG is the primary pass to get the CFG into a form where we can use explicit exec masking instructions, which is handled by SIAnnotateControlFlow
Thank you very much for the reply. If there is no fork/join instruction, how can we force a control flow reconverge at some point with the new AMDGPU architecture? Take the diamond CFG as below for example, the execution sequence may be A(1111) B(1100) D(1100) C(0011) D(0011), so the D basic block may be executed twice for different threads. Should compiler sort the execute sequence A(1111) B(1100) C(0011) D(1111) in topology order, so that D can be executed after A and B are finished and thus we have D be executed only once with more threads?
Thank you, Matt! Is there any online compiler that I can use? I try to build OpenCL code with https://godbolt.org/, but it seems godbolt doesn’t support AMDGPU target.
You can also just pick upstream clang and use -x cl. If you don’t want to depend on any headers, you can use __builtin_amdgcn_workitem_id_x to get threadIdx.x
Thank you! The web compiler Compiler Explorer is very good, but I’d also like to build clang/llvm by myself to take a look at what each pass transform by compiler. I use cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_PROJECTS=clang -DLLVM_TARGETS_TO_BUILD="AMDGPU" ../llvm to build the code, but got below error when I use my local compiler to build opencl source code.
__global__ void kernel(int *array, int n) {
int tid = __builtin_amdgcn_workitem_id_x();
if (tid < n)
array[tid] = array[tid] * 3;
else {
array[tid] = array[tid] + 4;
}
array[tid + 1] = array[tid] + 2;
}
clang -x cl diverge.cl -emit-llvm -S
diverge.cl:3:1: error: unknown type name '__global__'
3 | __global__ void kernel(int *array, int n) {
| ^
diverge.cl:3:24: error: expected identifier or '('
3 | __global__ void kernel(int *array, int n) {
| ^
diverge.cl:3:24: error: expected ')'
diverge.cl:3:23: note: to match this '('
3 | __global__ void kernel(int *array, int n) {
| ^
3 errors generated.
Could you share the command line of building clang/llvm for AMDGPU?
Compiler explorer can do that, too. In the compiler menu bar select “Add New → Opt Pipeline” and it will show you how every pass in the compilation pipeline changes the IR.
Thank you! Now I can use clang diverge.hip --cuda-device-only -nogpulib -nogpuinc -emit-llvm -S -O2 to build my first hip source code with local clang compiler.
Add New → Opt Pipeline is great. I notice it can also dump the cfg. Take https://godbolt.org/z/o4hMTWr3Y as example, it seems a classic live lock issue for GPU. Some threads of a warp required the lock successfully, and the other threads in the same warp are waiting for them to release the lock. For the generated code, it seems it would spin on s_cbranch_execnz .LBB0_1 infinitely, because only all threads of the warp meet the condition (required lock), then they can leave the loop together.
The threads on AMD (and pre Volta) move in lock step. If they choose to go with the threads that did not get the lock, you are doomed.
You can make a critical section, but you cannot (easily) make the threads in the warp “spin-wait” for one of them. If you really need them all to do something sequentially, use a mask to keep track which thread is done and prevent them from participating in the future. Loop until all are done. In the loop you do an if (cas(...)) { count++; updateMask(); }.
You can’t implement generic mutex lock on the GPU, at least according to the OpenCL standard. Actual hardware implementations may vary, but I don’t think it’s very well published. Trying to do per-thread semantics on a SIMT machine with lock-step execution is pretty much guaranteed to deadlock as well.
You may be able to deal with it by launching only one thread per warp (in NVIDIA’s terms). This way you avoid running into the issue with deadlocks across divergent threads.
Of course, the downside is that by running only one out of 32 (or 64) threads you lose couple of orders of magnitude in performance…
ISO C++ allows you to implement a generic mutex lock on the GPU using the std::execution::par policy. All NVIDIA GPUs since Volta have Independent Thread Scheduling to support that. All threads in a warp can try to take the lock in any arbitrary order without issues. You can also do this with CUDA C++, OpenMP offloading with openmp atomics, etc.
Compiler explorer supports running code on NVIDIA GPUs online, so you can try this out. This example implements a ticket lock and takes it from all threads in a kernel (across all warps): Compiler Explorer .
You can write something that appears to be a ‘lock’ on the GPU using atomic builtins. However, a lock is only sound if threads cannot impede another thread’s progress, and the scheduler is fair. The first issue is somewhat solved by NVIDIA’s independent thread scheduling, but AMDGPU doesn’t have that so it’s not a general solution.
According to the OpenCL standards, threads cannot depend on the result from another thread at any time because it makes no assumption about the behavior of the scheduler. Actual hardware varies, and I don’t know of anything NVIDIA has published about how their scheduler works. AMDGPU’s ordering is based off of the HSA standard, which allows some kind of ordering, but not much.
So, basically you end up with this situation where a mutex lock “works” until it doesn’t. You could have a warp claim a lock, then some graphics job gets launched on the GPU and boots it out, the scheduler is then under no contract to ever schedule back in the thread that was evicted and owns the lock.
I think the original blog post on the Volta thread scheduling only stated that mutex locks can only work if they are guaranteed to be ‘starvation free’. I.a. each thread must succeed taking the lock eventually. This matches what we do in systems like my RPC interface where we guarantee that there’s enough locks that each hardware thread can have their own slot, otherwise you’d risk being blocked by another thread.