44242 – Folding casts into phi's in InstCombine sometimes creates redundant registers

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 44242 - Folding casts into phi's in InstCombine sometimes creates redundant registers

Summary: Folding casts into phi's in InstCombine sometimes creates redundant registers

Status:	RESOLVED FIXED

Alias:	None

Product:	new-bugs
Classification:	Unclassified
Component:	new bugs (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-12-06 09:03 PST by Connor Abbott
Modified:	2020-01-15 06:41 PST (History)
CC List:	11 users (show)

See Also:
Fixed By Commit(s):	fb114694e939c0204ac356fc0e830332175cd008

Attachments
simplified test case (2.24 KB, text/plain) 2019-12-09 05:19 PST, Connor Abbott	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Connor Abbott 2019-12-06 09:03:58 PST

InstCombine has a transform that does "cast(phi(a, b)) -> phi(cast(a), cast(b))", but it doesn't check that the phi has a single use, so if we do something like this where both the original phi and the cast have a use:

int a1 = ...;
while(true) {
    a2 = phi(a1, a4);
    a3 = bitcast_to_float(a2)
    ...
    a4 = bitcast_to_int(a3 * 0.5);
}

... = a2;
... = a3;

then, after instcombine runs, the original loop phi will stick around, and after folding casts some more, we'll end up with something like this:

int a1 = ...;
float a5 = bitcast_to_float(a1);
while(true) {
    a2 = phi(a1, a4);
    a6 = phi(a5, a7);
    ...
    a7 = a6 * 0.5;
    a4 = bitcast_to_int(a7);
}

... = a2;
... = a6;

and then codegen will insert some extra unnecessary copies and the register pressure is bloated because a2 and a6 will each have their own register. On AMDGPU, integers and floats use the same register file, so the first form is preferable.

For some context as to why we ran into this: In Mesa, our internal IR's don't distinguish between floating-point and integer types. Furthermore, even though we're consuming GLSL and SPIR-V, which do have a way to express this information, most games don't have it because they're transpiled from D3D bytecode which doesn't have it, and GPU instruction sets universally have one register file for integer and floating point operations, so it would be mostly pointless to use it internally. Thus, when translating to LLVM, we pick one type (either int or float) and insert bitcasts as necessary around operations so that the resulting value is always that type.

Recently, inside Mesa's driver for AMD, we've been trying to switch from a legacy IR called TGSI, which doesn't really understand smaller integer types and happens to pick float as its preferred type, to a newer IR which does understand small integer types, and hence using float would be uglier, so the LLVM emitter for the new IR picks ints as its default type and inserts bitcasts around every floating point operation. We thought that this choice wouldn't matter, because LLVM is usually pretty good about cleaning up these casts, except we ran into this case where it doesn't. In particular this happens whenever there's a loop-carried floating-point value in the original source, and we use it both with and without casting it after the loop. The cast after the loop gets CSE'ed with the cast inside the loop, and you get something like my example. In the case I'm looking at (a shader from Deus Ex: Mankind Divided) this roughly doubles register pressure compared to the old IR, which happens to avoid the problem because of how it uses float as the default type. On AMD GPU's, excess register pressure can be a problem regardless of whether it spills or not, due to the register-sharing scheme it uses.

I can see two solutions for this inside LLVM:

1. Disable the transform if the phi has other users.
2. Add a MachineGVN pass which happens inside AMDGPU which gets rid of the extra phi.

I guess the answer depends on which representation should be canonical, the first or the second. I think the first is better, because having two phi nodes for the same value might confuse different analysis passes. (Of course, (1) is also way less effort). But I'm no expert here. The theoretically best form for this code is probably something like:

float a1 = ...;
while(true) {
    a2 = phi(a1, a4);
    ...
    a4 = a2 * 0.5;
}

... = bitcast_to_int(a2);
... = a2;

but transforming the first snippet to this requires replacing the non-bitcasted uses of the original phi with a bitcast of the new phi, which would result in instcombine fighting with itself. In order to settle on this as the best form, you'd have to do some sort of global analysis on where to optimally insert the bitcasts, just local transforms aren't going to cut it. It's these sort of tricky situations that motivated us to not have separate integer and floating-point types in our IR in the first place.

Comment 1 Roman Lebedev 2019-12-06 09:28:36 PST

Would be good to have a standalone snippet showing the problem.

Comment 2 Connor Abbott 2019-12-09 05:19:25 PST

Created attachment 22921 [details]
simplified test case

Okay, so I just noticed that there's optimizeBitCastFromPhi which I believe is supposed to not create a redundant phi, and I guess that something is going wrong with it on my shader. Also, I found out that the bug only triggers when the initial value when entering the loop is a constant. I'm attaching a simplified test case which demonstrates the increase in register pressure and extra copies:

$ ~/build/llvm-debug/bin/llc < test.ll | grep NumVgprs
; NumVgprs: 4

$ ~/build/llvm-debug/bin/opt -S --instcombine test.ll | ~/build/llvm-debug/bin/llc | grep NumVgprs
; NumVgprs: 8

If I replace the 0, 1, 2, and 3 in the phi's with a function argument, it does nothing, and if I replace it with a bitcasted function argument, it replaces the phi wholesale with a floating-point equivalent, also resulting in no extra register pressure. So I think this isn't intended. In this example I made 4 phis to make what's going on a little more visible, but in the actual shader I'm looking at there are around 20 (!) which causes a drastic increase in register pressure.

Comment 3 Connor Abbott 2019-12-09 07:13:51 PST

https://reviews.llvm.org/D71209 should fix it.

Comment 4 Jay Foad 2019-12-10 02:14:24 PST

> Also, I found out that the bug only triggers when the initial value when entering the loop is a constant.

+Ryan Taylor, who has been looking at how we lower PHI nodes with constant operands during instruction selection. In MIR the PHI nodes always have register operands, so we have to insert move-immediate instructions, which might be related to the register pressure problems you're seeing.

Comment 5 Roman Lebedev 2020-01-15 06:31:19 PST

Wasn't this fixed?

Comment 6 Nikita Popov 2020-01-15 06:41:17 PST

Yes, the revision has landed in the meantime.