computeKnownBits
calls are used widely throughout optimization passes, as they enable bit aware optimizations / code canonicalization.
As an example, InstCombine will use computeKnownBits
to canonicalize xor A, B
to or disjoint A, B
if the requested bits of A and B are disjoint – that is, there is no position where A and B are both 1.
These computeKnownBits
calls recursively analyze the instruction’s operands to hopefully get increasing information about the KnownBits
of the instruction in question. Thus, for long instruction sequences, computing the KnownBits
could involve a potentially expensive recursive analysis, and these calls are a common source of increased compile time Make LLVM fast again
To control for this, we use a hardcoded depth limit of 6 to control recursion. However, in certain cases this can lead to significant degradations in the quality of the generated code.
A notable example is Triton’s Linear Layout: triton/include/triton/Tools/LinearLayout.h at 981e987eed9053b952f81153bc0779c99d8c642e · triton-lang/triton · GitHub . This feature provides a simplified way of mapping hardware data locations in GPUs to Tensor Indexes. As a result, less computation is needed to calculate memory addresses. However, by design of this feature, the memory address calculations now involve xor
s.
In a specific function using Linear Layout, we end up with the following IR:
%130 = or disjoint i32 %128, %129, !dbg !34
%131 = getelementptr inbounds nuw half, ptr addrspace(3) @global_smem, i32 %130, !dbg !34
store <4 x i32> %71, ptr addrspace(3) %131, align 16, !dbg !34
%132 = xor i32 %130, 4096, !dbg !34
%133 = getelementptr inbounds nuw half, ptr addrspace(3) @global_smem, i32 %132, !dbg !34
store <4 x i32> %74, ptr addrspace(3) %133, align 16, !dbg !34
%134 = xor i32 %130, 8192, !dbg !34
%135 = getelementptr inbounds nuw half, ptr addrspace(3) @global_smem, i32 %134, !dbg !34
store <4 x i32> %77, ptr addrspace(3) %135, align 16, !dbg !34
%136 = xor i32 %130, 12288, !dbg !34
%137 = getelementptr inbounds nuw half, ptr addrspace(3) @global_smem, i32 %136, !dbg !34
store <4 x i32> %80, ptr addrspace(3) %137, align 16, !dbg !34
We see the final instruction involved in base address calculation is an xor
. These xor
s have disjoint operands, but we do not canonicalize to or disjoint
due to the recursion depth limit.
For a specific xor
(%132
), the entirety of the address calculation is:
%28 = tail call i32 @llvm.amdgcn.workitem.id.x(), !dbg !26
%29 = and i32 %28, 8, !dbg !26
%.not = icmp eq i32 %29, 0, !dbg !26
%30 = and i32 %28, 16, !dbg !26
%31 = icmp eq i32 %30, 0, !dbg !26
%32 = and i32 %28, 32, !dbg !26
%33 = icmp eq i32 %32, 0, !dbg !26
%34 = and i32 %28, 256, !dbg !26
%53 = shl i32 %28, 3, !dbg !29
%54 = and i32 %53, 56, !dbg !29
%121 = select i1 %.not, i32 0, i32 72
%122 = select i1 %31, i32 0, i32 144
%123 = or disjoint i32 %121, %122
%124 = select i1 %33, i32 0, i32 288
%125 = or disjoint i32 %123, %124
%126 = xor i32 %125, %54
%127 = and i32 %53, 1536
%128 = or disjoint i32 %127, %126
%129 = shl nuw nsw i32 %34, 3
%130 = or disjoint i32 %128, %129, !dbg !34
%132 = xor i32 %130, 4096, !dbg !34
For the %132
Instruction, the longest recursion path before getting to a leaf node is:
%130
→ %128
→ %126
→ %125
→ %123
→ %121
→ %.not
→ %29
→ %28
Thus, the recursion tree exceeds the recursion depth and we end up with xor
s as the final instruction in the address calculation. SeparateConstOffsetFromGEP does not currently analyze xor
operands, so it does not attempt to fold the constants into the GEPs. For these particular GEPs, the users of the GEPs (stores) occur immediately after the GEPs, so the CodeGenPrepare address sinking does not apply. However, we have load users in different blocks that use GEPs which are not sunk since our target (AMDGPU) does not find a matching AddrMode. As a result, we end up with many base addresses for these instead of using a few bases with constant offsets.
For this particular case, if we use a new computeKnownBits
API which overrides the depth limit, the result is to use 18 less registers for the load addresses. Importantly, this has also resulted in eliminating spilling, and has doubled the FLOPS.
The conclusion is that the recursion depth limit is significantly hurting the runtime performance. We are exploring solutions which clean up the instruction sequences involved in calculating the addresses, but this type of approach is not very robust, and may result in missed optimizations. The more stable solution is to make the recursion depth limit more flexible. It seems like some computeKnownBits
based optimizations (e.g. xor
→ or disjoint
canonicalization for SeparateConstOffsetFromGEP) can provide a stronger performance uplift than others. Thus, it makes sense to me that some clients use a more relaxed depth limit than others. However, I’m curios if the community agrees / what the consensus is for issues of this type.