[LLVM] Addressing Rust optimization failures in LLVM

Description of the project: The Rust programming language uses LLVM for code generation, and heavily relies on LLVM’s optimization capabilities. However, there are many cases where LLVM fails to optimize typical code patterns that are emitted by rustc. Such issues are reported using the I-slow and/or A-LLVM labels.

The usual approach to fixing these issues is:

  1. Inspect the --emit=llvm-ir output on Godbolt.
  2. Create an LLVM IR test case that is not optimized when run through opt -O3.
  3. Identify a minimal missing transform and prove its correctness using alive2.
  4. Identify which LLVM pass or passes could perform the transform.
  5. Add necessary test coverage and implement the transform.
  6. (Much later: Check that the issue is really resolved after the next major LLVM version upgrade in Rust.)

The goal of this project is to address some of the less hard optimization failures. This means that in some cases, the process would stop after step 3 or 4 without proceeding to implementation, because it’s unclear how the issue could be addressed, or it would take a large amount of effort. Having an analysis of the problem is still valuable in that case.

Expected results: Fixes for a number of easy to medium Rust optimization failures. Preliminary analysis for some failures even if no fix was implemented.

Desirable skills: Intermediate C++ for implementation. Some familiarity with LLVM (at least ability to understand LLVM IR) for analysis. Basic Rust knowledge (enough to read, but not write Rust).

Project size: Either medium or large.

Difficulty: Medium.

Confirmed Mentor: @nikic

9 Likes

Hi!

I am interested in working on this project.

I am thinking of picking a few issues with the A-LLVM, and I-slow tag to create a proposal. How can I identify “less hard optimization failures”? And how many such issues would be reasonable to include in a proposal for a large-size project.

Utkarsh

How can I identify “less hard optimization failures”?

This is hard to answer generally, so I’ll try to go through a couple examples, based on recent issues.

Let’s pick the newest reported issue Suboptimal code generated for alignment checks and similar via `number.trailing_zeros() >= bit_count` · Issue #107554 · rust-lang/rust · GitHub as an example. (Edit: This has already been fixed by ⚙ D143368 [InstCombine] Look through truncate to fold icmp with intrinsics since I originally wrote this.)

  1. Inspect LLVM IR on Godbolt: Compiler Explorer
  2. Create an LLVM IR test case:Compiler Explorer Here we see that it fails to optimize if there is a truncation between the cttz and the comparison. The basic pattern itself is already supported.
  3. As such, we have two ways this might be fixed. First one is to just add support for the extra trunc, which would be this transform: Compiler Explorer Second one is to instead remove the trunc and then reuse the existing transform: Compiler Explorer
  4. In either case, this is a simple peephole transform that can be added to the InstCombine pass.

This is the hallmark of an “easy” optimization problem. It doesn’t get much easier than adding a peephole transform to InstCombine. An experienced LLVM developer would be able to implement and test this within an hour. For a new contributor this would take a few more hours, both in implementation time and review cycles. I have written a contribution guide that targets precisely this kind of InstCombine change as a general guidance.

Let’s look at the second issue: Missed optimization opportunity when trivially moving tuple of slices · Issue #107436 · rust-lang/rust · GitHub

This one already has an LLVM IR sample posted in the comment, but here’s the Godbolt link: Compiler Explorer. Here we perform a memcpy to an alloca and then only use it in a readonly noalias argument. This sounds kind of similar to a byval argument optimization LLVM already implements.

I would tentatively classify this as medium – this is not a simple peephole transform and would require some careful thinking about when exactly it is correct. But at least there is an existing transform to follow.

However, after implementing that transform one would find that it doesn’t actually fix the original issue (though it does fix others, so it’s valuable in any case). That’s because Rust’s LLVM IR is missing an align annotation on the %val argument for some reason. So a full fix would also need a change on the Rust side. That could be done as part of the project or not, depending on whether there is interest in working on Rust itself.

The third issue is missed(?) optimization with a const array of same item · Issue #107208 · rust-lang/rust · GitHub. This is about a load from a constant that contains a repeating pattern.

This one already has some analysis on the issue, and points out that we currently fail to handle even the basic case of a load from an all-zero global. Here is a reduced LLVM IR example of the missing transform: Compiler Explorer Fixing this would be another “easy” optimization.

Fixing the more general case of a load from a pattern-filled global I would classify as “medium”. It’s pretty clear what that transform has to do in abstract, but it would take a lot more implementation effort.

(I’m out of office for the next two weeks and have to catch a flight now, I’ll finish responding some other time.)

(Continuing from the previous post.)

All of the above ended up in the easy/medium category, and would be reasonable things to work on as part of this project (and it’s nice that they have varying levels of difficulty).

For a “hard” example, I’m not going to keep going down the list and just pick this one from the recent issues: `array::zip` in combination with `array::map` optimises very poorly · Issue #103555 · rust-lang/rust · GitHub

This is basically asking to fuse an array “zip” operation together with a following array “map” optimization. LLVM has a default-disabled loop fusion pass that might be able to handle this, but even if it can handle it, getting an experimental pass ready for default enablement would likely be a whole project in itself.

And how many such issues would be reasonable to include in a proposal for a large-size project.

I find that really hard to say. This can’t be reduced to a number of issues, because the time needed will vary significantly between them (the medium issues above will take a lot more time than the easy ones).

I’m not sure what requirements GSoC has in this area, but what I had in mind is to keep this relatively open ended, and just cover as much as time permits. It’s probably a good idea to have some starting points in mind (such as the examples above), but I’m not sure it makes sense to specify a fixed list of issues upfront. There is also some risk that some specific issues (especially if they’re very easy) already get resolved in the meantime.

Thank you so much for your helpful reply. I’ll look into the contribution guide and the issues you explained and reach out with further questions!

Hi! I’m a third year CS student, currently working on static analysis for Rust at DeepSource. This project looks really interesting, and I would love to work on it, especially with the missed optimisations for the more common Rust patterns. May I please know what mode of communication you would prefer to review the draft proposal?

Hello, I am second-year computer engineering student, and am interested in this project. Via what platform should I contact you? I am a beginner to gsoc, and it seems I should start preparing a proposal on my own and then review it with you, do I understand correctly? What would you suggest to work on right now as preparation for project or proposal?

Ilyas

Hi, Nikita. Thank you for all information above, I’m a CS master’s student and am also interested in this project.

I want to start from InstCombine patches around GEP and load (for third example you gave above). I tried to send a patch to add test.

I have the plan to add codes to find GEP to uniformly initialized arrays around InstCombinerImpl::visitLoadInst
but I’m slightly getting lost on where to implement new transformations. How do you find and decide where to add codes when you add new transformations? (this might be too abstract)

(I’m currently reading and finding where to implement the patch with LLVM_DEBUG macro, print methods with dyn_cast. But I’m not confident to catch an overview of the passes)

I think you’re already on the right track here – this transform is pretty much independent of other load transforms, so putting it anywhere inside visitLoadInst() would work. The only caveat is that the function has this early return: https://github.com/llvm/llvm-project/blob/35276f16e5a2cae0dfb49c0fbf874d4d2f177acc/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp#L1060-L1062 This excludes volatile loads and ordered atomic loads. The transform in question is indeed illegal for volatile loads, but it is legal for atomic loads, so one could add it in front of that check. This also suggests some additional tests that could be added (using volatile/atomic).

The second consideration here is that this transform does not create new instructions. The load will be converted into a constant value. Such transforms are (ideally) performed not in InstCombine, but in InstSimplify. The relevant code would be here: https://github.com/llvm/llvm-project/blob/8f7e7400b74b362251dcf155e45f4097fd574196/llvm/lib/Analysis/InstructionSimplify.cpp#L6572 You can see that this function already contains code to fold a load from a constant at a fixed offset. You would be extending it to also handle loads from non-constant offsets (in specific circumstances).

Architecturally, this is the right place to add the new code. However, it might turn out that this is too expensive (in terms of compile-time) to perform there, in which case it might have to go into InstCombine (or AggressiveInstCombine) after all. But simplifyLoadInst() would be the place to try first.

I would recommend to first only handle the all-zeroes case, because it is much simpler than the general one. Basically, you only need to call ConstantFoldLoadFromUniformValue() on getUnderlyingObject() of the pointer.

Regarding the abstract question – not sure there’s a good answer to that. If you don’t know at all where to place something, I’d try to find something similar to what you want to do that already works, and then find out which pass does the transform using -print-after-all (or on Godbolt, click “Add new” and then “LLVM Opt Pipeline” for a nice presentation) and then -debug to narrow down where in the pass implementation it happens.

1 Like

Feel free contact me at npopov@redhat.com. I’m also on the LLVM discord and the Rust zulip for online discussion. (Edit: I recently realized that Discourse has a private messaging feature, so that would work as well.)

Sounds reasonable to me, but I’m new to GSoC as well, so I’m not very familiar with how these proposals look like…

It probably wouldn’t hurt to take a look at some A-LLVM/I-slow issues and see whether anything catches your eye as a good starting point for the project.

The transform in question is indeed illegal for volatile loads, but it is legal for atomic loads, so one could add it in front of that check. This also suggests some additional tests that could be added (using volatile/atomic).

That seems right, thanks!

You can see that this function already contains code to fold a load from a constant at a fixed offset. You would be extending it to also handle loads from non-constant offsets (in specific circumstances)…

Ah, I noticed the constant-offset load for global constants is folded by try at hand, but couldn’t identify where. Yeah, I’ll try it for the all-zeroes cases first. I’ll try to confirm compile time on llvm-compile-time-tracker.com. (I may ask you about it later.)

I’d try to find something similar to what you want to do that already works, and then find out which pass does the transform using -print-after-all (or on Godbolt, click “Add new” and then “LLVM Opt Pipeline” for a nice presentation)

Oh, I didn’t notice those options, those help me a lot. I appreciate your help.

@nikic Hi Nikita
I’m currently working on folding uniformly initialized constant aggregate types. Comparing llvm:main...khei4:khei4/perf/constantFoldFromAllEqAggregate · llvm/llvm-project · GitHub
(I believed constant uniquely initialized global variables could be safely folded. Although I need to search how to handle empty struct and array.)
This implementation might be too naive and need to be considered architecturally but I want to measure compile-time on this change. Could you enable compile-time-tracker on my fork?

@khei4 Done. You’ll want to rebase over [InstCombine] Call simplifyLoadInst() · llvm/llvm-project@be88b58 · GitHub to get meaningful numbers.

@nikic Thank you! I could see the result!

Hi. @nikic I’m writing a proposal for this project and am curious about this call slot optimization opportunity after data mutation. If possible, I want to try to solve on the project. Can I tackle this issue by subsequently to your patch? Do you think it’s worth trying?

I believe there already is an open patch for handling this case: ⚙ D140089 [MemCpyOpt] Add a stack-move optimization to opportunistically merge allocas together.

1 Like

Ah. Thanks! I’ll find another one!

Hi, Nikita.

Thank you for detailed description given above, I am also interested in this project, if you have room for more.

I am a graduate student in applied-math/CS with a BSc in CS.
I have sent you a mail regarding writing up a proposal for this project.

Maybe starting by looking at the missing handling for tanh:

Hi there, are you still accepting applicants for proposals along these lines? As someone with an interest in Rust and compilers, this project really jumped off the page at me.
Having a quick look through the github issues page, there seems to be no shortage of issues that need investigation. So I can’t see there being an issue with my proposal co-existing with the other enthusiastic contributors here, as long as we communicate on which issues we’re working on!
I have the required C++ experience, and some familiarity with LLVM through my experience with Rust.

Would a proposal for this still be welcome?