Divergent Control Flow

LuoYuanke · July 29, 2024, 7:22am

Hi,

I am investigating the divergent control flow of GPU. I notice in the AMDGPU document (https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf), there are S_CBRANCH_FORK and S_CBRANCH_JOIN instructions that can support arbitrary divergent control flow. I can find S_CBRANCH_JOIN and S_CBRANCH_G_FORK are defined in llvm/lib/Target/AMDGPU/SOPInstructions.td, but I can’t find any pass that would generate the FORK/JOIN instructions. Is there any document that introduce how to support divergent control flow for AMDGPU in llvm compiler?

Thanks
Yuanke

arsenm · July 29, 2024, 10:36am

We never emitted those instructions in the compiler, and gfx10+ do not have them anymore.

The control flow lowering scheme is split across many passes, but StructurizeCFG is the primary pass to get the CFG into a form where we can use explicit exec masking instructions, which is handled by SIAnnotateControlFlow

LuoYuanke · July 31, 2024, 1:05am

Hi @arsenm,

Thank you very much for the reply. If there is no fork/join instruction, how can we force a control flow reconverge at some point with the new AMDGPU architecture? Take the diamond CFG as below for example, the execution sequence may be A(1111) B(1100) D(1100) C(0011) D(0011), so the D basic block may be executed twice for different threads. Should compiler sort the execute sequence A(1111) B(1100) C(0011) D(1111) in topology order, so that D can be executed after A and B are finished and thus we have D be executed only once with more threads?

      A (0b'1111)
   /       \
  B       C
(1100) (0011)
   \     /
    D (1111)

Thanks
Yuanke

arsenm · July 31, 2024, 6:55am

All lanes execute all paths, but are masked out for the false block. This should turn into something like:

A: 
  exec = 0b1111
  init_exec = exec
  A();
B:
  exec &= 0b1100
  B();
C:
  exec = init_exec &= ~0b1100
  C();
D:
  exec = init_exec
  D();

LuoYuanke · August 1, 2024, 12:32am

Thank you, Matt! Is there any online compiler that I can use? I try to build OpenCL code with https://godbolt.org/, but it seems godbolt doesn’t support AMDGPU target.

jhuber6 · August 1, 2024, 1:22am

You can just target C++, you don’t need OpenCL for tests like this https://godbolt.org/z/TM3dv9nfx.

LuoYuanke · August 1, 2024, 5:59am

Thank you! Do you know how to create a divergent branch with C++ code? Is there any API to get thread id?

if (threadIdx.x > 16) {
} else {
}

arsenm · August 1, 2024, 8:43am

Shows HIP: Compiler Explorer

You can also just pick upstream clang and use -x cl. If you don’t want to depend on any headers, you can use __builtin_amdgcn_workitem_id_x to get threadIdx.x

jhuber6 · August 1, 2024, 12:08pm

As Matt said, you can just use -x cl or HIP to get access to these as well. Those will reduce down to the same compiler builtin you can use manually. llvm-project/libc/src/__support/GPU/amdgpu/utils.h at main · llvm/llvm-project · GitHub has a lot of the builtins for common things you’d do with OpenCL or HIP.

LuoYuanke · August 2, 2024, 6:05am

Thank you! The web compiler Compiler Explorer is very good, but I’d also like to build clang/llvm by myself to take a look at what each pass transform by compiler. I use cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_PROJECTS=clang -DLLVM_TARGETS_TO_BUILD="AMDGPU" ../llvm to build the code, but got below error when I use my local compiler to build opencl source code.

__global__ void kernel(int *array, int n) {
  int tid = __builtin_amdgcn_workitem_id_x();

  if (tid < n)
    array[tid] = array[tid] * 3;
  else {
    array[tid] = array[tid] + 4;
  }
  array[tid + 1] = array[tid] + 2;
}

clang -x cl diverge.cl -emit-llvm -S

diverge.cl:3:1: error: unknown type name '__global__'
    3 | __global__ void kernel(int *array, int n) {
      | ^
diverge.cl:3:24: error: expected identifier or '('
    3 | __global__ void kernel(int *array, int n) {
      |                        ^
diverge.cl:3:24: error: expected ')'
diverge.cl:3:23: note: to match this '('
    3 | __global__ void kernel(int *array, int n) {
      |                       ^
3 errors generated.

Could you share the command line of building clang/llvm for AMDGPU?

Thanks
Yuanke

arsenm · August 2, 2024, 6:30am

clang -x cl diverge.cl -emit-llvm -S

You are trying to compile HIP as OpenCL. You can use -x hip, or remove the -x and change the file extension to .hip

Artem-B · August 2, 2024, 5:35pm

Compiler explorer can do that, too. In the compiler menu bar select “Add New → Opt Pipeline” and it will show you how every pass in the compilation pipeline changes the IR.

LuoYuanke · August 5, 2024, 1:25am

Thank you! Now I can use clang diverge.hip --cuda-device-only -nogpulib -nogpuinc -emit-llvm -S -O2 to build my first hip source code with local clang compiler.

LuoYuanke · August 5, 2024, 1:46am

Add New → Opt Pipeline is great. I notice it can also dump the cfg. Take https://godbolt.org/z/o4hMTWr3Y as example, it seems a classic live lock issue for GPU. Some threads of a warp required the lock successfully, and the other threads in the same warp are waiting for them to release the lock. For the generated code, it seems it would spin on s_cbranch_execnz .LBB0_1 infinitely, because only all threads of the warp meet the condition (required lock), then they can leave the loop together.

__global__ void incrementCounter() {
  while (atomicCAS(&lock, 0, 1) != 0)
    ;
  __threadfence();
  count++;
  __threadfence();
  atomicExch(&lock, 0);
}

jdoerfert · August 5, 2024, 7:27am

The threads on AMD (and pre Volta) move in lock step. If they choose to go with the threads that did not get the lock, you are doomed.

You can make a critical section, but you cannot (easily) make the threads in the warp “spin-wait” for one of them. If you really need them all to do something sequentially, use a mask to keep track which thread is done and prevent them from participating in the future. Loop until all are done. In the loop you do an if (cas(...)) { count++; updateMask(); }.

jhuber6 · August 6, 2024, 1:30am

You can’t implement generic mutex lock on the GPU, at least according to the OpenCL standard. Actual hardware implementations may vary, but I don’t think it’s very well published. Trying to do per-thread semantics on a SIMT machine with lock-step execution is pretty much guaranteed to deadlock as well.

Artem-B · August 6, 2024, 5:45pm

You may be able to deal with it by launching only one thread per warp (in NVIDIA’s terms). This way you avoid running into the issue with deadlocks across divergent threads.

Of course, the downside is that by running only one out of 32 (or 64) threads you lose couple of orders of magnitude in performance…

LuoYuanke · August 7, 2024, 11:10am

Thank you all for the advice and discussion!

gonzalobg · September 13, 2024, 4:11pm

ISO C++ allows you to implement a generic mutex lock on the GPU using the std::execution::par policy. All NVIDIA GPUs since Volta have Independent Thread Scheduling to support that. All threads in a warp can try to take the lock in any arbitrary order without issues. You can also do this with CUDA C++, OpenMP offloading with openmp atomics, etc.

Compiler explorer supports running code on NVIDIA GPUs online, so you can try this out. This example implements a ticket lock and takes it from all threads in a kernel (across all warps): Compiler Explorer .

jhuber6 · September 13, 2024, 6:09pm

You can write something that appears to be a ‘lock’ on the GPU using atomic builtins. However, a lock is only sound if threads cannot impede another thread’s progress, and the scheduler is fair. The first issue is somewhat solved by NVIDIA’s independent thread scheduling, but AMDGPU doesn’t have that so it’s not a general solution.

According to the OpenCL standards, threads cannot depend on the result from another thread at any time because it makes no assumption about the behavior of the scheduler. Actual hardware varies, and I don’t know of anything NVIDIA has published about how their scheduler works. AMDGPU’s ordering is based off of the HSA standard, which allows some kind of ordering, but not much.

So, basically you end up with this situation where a mutex lock “works” until it doesn’t. You could have a warp claim a lock, then some graphics job gets launched on the GPU and boots it out, the scheduler is then under no contract to ever schedule back in the thread that was evicted and owns the lock.

I think the original blog post on the Volta thread scheduling only stated that mutex locks can only work if they are guaranteed to be ‘starvation free’. I.a. each thread must succeed taking the lock eventually. This matches what we do in systems like my RPC interface where we guarantee that there’s enough locks that each hardware thread can have their own slot, otherwise you’d risk being blocked by another thread.

Topic		Replies	Views
[RFC] Adding thread group semantics to LangRef (motivated by GPUs) LLVM Dev List Archives	27	144	August 19, 2019
GSoC proposal: SPIR-V to LLVM IR dialect conversion in MLIR MLIR	24	2356	May 22, 2020
[RFC] Introducing convergence control bundles and intrinsics LLVM Dev List Archives	12	222	October 30, 2020
[RFC] Addition of GPU conditional execution op MLIR gpu , mlir	8	288	January 16, 2024
Proposal: pragma for branch divergence LLVM Dev List Archives	6	122	January 26, 2015

Divergent Control Flow

Related topics