I get "llvm::thinLTOInternalizeModule ... Assertion `GS != DefinedGlobals.end()' failed" consistently when compiling a particular Rust crate. This happens when handling the following symbol name: switch.table._ZN77_$LT$omaha_client..protocol..request..Event$u20$as$u20$core..clone..Clone$GT$5clone17h829b64c9ab982ff5E.llvm.10390335839252477638.llvm.11308644296266801080 Note that this has _two_ .llvm extensions appended. getOriginalNameBeforePromote chops off both extensions instead of only the last one. Changing getOriginalNameBeforePromote to only chop off the last one (with rsplit instead of split) fixes the issue. I do not yet have a simple reproducer. I'm hoping this is clearly a bug without one, but if not I can get one ready. See also https://github.com/rust-lang/rust/issues/67855
I'm not sure the sequence of events that would result in double promotion, and haven't seen this before. We run the code that performs the renaming/promotion once per module in the ThinLTO backend. Can you make a reproducer?
I'm having trouble making a minimal reproducer, but I've attached some bitcode files that demonstrate the problem. These files show the different stages of the ThinLTO pipeline as implemented in the rust compiler [1], [2]. They go in the following order: xxx.thin-lto-input.bc xxx.thin-lto-after-nounwind.bc xxx.thin-lto-after-rename.bc xxx.thin-lto-after-resolve.bc (after-internalize would come next, but the abort happens before it gets written.) The offending symbol in these files is different from the one in my original description, but it is also a switch table. In the input, it starts out as "switch.table._ZN61_$LT$omaha_client..AppEntry$u20$as$u20$core..clone..Clone$GT$5clone17h800435b9c4a89163E.llvm.1861638680630777414". Then, in renameModuleForThinLTO a second suffix gets added and it gets renamed to "switch.table._ZN61_$LT$omaha_client..AppEntry$u20$as$u20$core..clone..Clone$GT$5clone17h800435b9c4a89163E.llvm.1861638680630777414.llvm.3579491054261050351". There are other symbols that contain the same suffix in the input file, but only this symbol gets a second suffix added. Does this help? [1]: https://github.com/rust-lang/rust/blob/72b2bd55edbb1e63a930c5ddd08b25e4f9044786/src/librustc_codegen_llvm/back/lto.rs#L720-L861 [2]: https://github.com/rust-lang/rust/blob/72b2bd55edbb1e63a930c5ddd08b25e4f9044786/src/rustllvm/PassWrapper.cpp#L963
I should also mention that this was only something I hit when compiling without landing pads, which adds the "nounwind" stage directly before the rename pass. This is definitely a less exercised path in the Rust compiler. So it's possible that the pass is doing something to cause this, or it might be just a red herring.
Created attachment 23008 [details] bitcode files
What's unexpected is that the symbols have already gone through a round of renaming. I see lots of symbols with the ".llvm.xxxxxxx" extension in the input.bc. So it seems as though renameModuleForThinLTO is being called twice. Do you know why/how it is being called before the "input" stage?
Hm, so those symbols are from an external static library, whose bitcode appears to be going through the same LTO pipeline before being archived. I find it surprising that the library would be going through the same pipeline twice, so need to investigate.
So it turns out that - the Rust compiler uses the ThinLTO pipeline _by default_ to optimize individual Rust libraries, when a high enough optimization level is used and parallel codegen is enabled - when you plan to use ThinLTO on the final binary, it won't do this - but to get that behavior you have to pass the -Clto=thin flag when building the rust library, which actually *disables* the ThinLTO optimization pipeline when building that library (and enables it for the final executable) Since we weren't passing the flag to the library build, we got the ThinLTO pipeline running twice. I still think this is a bug, because I should be able to use ThinLTO to optimize a static library, and then use it again to optimize the final executable. There might be reasons why you don't want to do that, but it should still work. (As it turns out, doing so is actually faster in our build. Presumably because we're doing more of the work up front.)
Ah, I see. That's interesting. Essentially your library build is doing a complete ThinLTO link but capturing the precodegen bitcode modules and producing an archive of those. Then you are doing another ThinLTO whole program link consuming those library-ThinLTO'ed bitcode archives. Out of curiosity, at what point are the bitcode files from your library re-summarized for ThinLTO? I don't think there's a way to trigger this kind of behavior with C++/clang + ThinLTO without using a linker save-temps option and archiving the emitted precodegen temp bitcode files, after first feeding those through something like "opt -module-summary" to re-summarize the new modules, which is why we have never seen anything like it. As you point out, it seems like reasonable/desirable behavior for your pipeline, and it is a simple enough fix so we can support that. I'll fix it next week.
> Out of curiosity, at what point are the bitcode files from your library re-summarized for ThinLTO? They're summarized again in the second compile invocation (the executable depending on a library). The use of ThinLTO in the library build is actually more of an implementation detail. Because rustc splits up the work to generate and optimize LLVM IR across multiple codegen units, it can use ThinLTO to "stitch everything back together again" in a fairly optimized fashion. So in effect, we have a pre-pass for optimizing each individual library and a post-pass for optimizing the whole executable. What's actually going on is that rustc is doing some unnecessary work here. While building the library it actually runs the first ThinLTO pipeline to completion and generates an object file, which is never used. It also discards the summary at this stage. Then, while building the executable which depends on the library, it reruns the entire ThinLTO pipeline on just the library code again. Thus the bitcode gets run through some optimization passes that have already been run (which will be much faster this time), and the summary gets generated again. Of course, all this extra work is not part of the original design - just an artifact of a missing compiler flag in our build. If we had the missing flag, rustc would defer all optimization until the executable build. But in a build where the same libraries get used for many executables, this turns out to be slower! All the extra work I mentioned is less than the savings that we get from running the optimization pipeline on bitcode at the time of building the library. Hopefully at some point, we can get rustc to do this without all the silly duplicated work. But I think the effect of the duplicated work should be slower compile times (and perhaps a warning from the compiler frontend), not a crash.
Fix mailed for review: https://reviews.llvm.org/D72711 I'm not surprised that you are seeing faster build times with an earlier round of optimization. Have you measured the resulting performance with and without the first round of ThinLTO? Thinking through this scenario there are a couple things to be aware of. First, the initial round of function importing and inlining will have an impact on the importing and inlining in the full binary ThinLTO. It could be positive or negative. I.e. if the first round results in a lot of cleanup and more compact code, you may get more successful importing and inlining (generally positive). If the first round however results in larger functions, you may get less cross-module importing/inlining, which could be a negative. E.g. A calls B calls C, where B and C are in the same library but A is in a different one. If the first round imports and inlines C into B, it is possible that the combined BC may be above the threshold to import/inline in the full binary build later. Whereas if we hadn't done this, we might have been able to import B alone and inline it into A. If the A->B edge is hotter than the B->C edge this could be a lost opportunity. Another thing to think about for the future is whole program devirtualization. I know next to nothing about Rust, but a quick search suggests it has virtual functions. In that case it could benefit from the ThinLTO WPD. To take advantage of this you would presumably have to modify the Rust compiler to emit the type tests that clang emits under -fwhole-program-vtables. However, this optimization generally can only be done safely on the final binary, and currently the type tests are stripped early in the optimization pipeline. Which means with an initial round of ThinLTO on the libraries they would be removed. That could presumably be fixed, I don't see any fundamental reason the type tests can stick around until just before codegen, but that's the situation at the moment.
> I don't see any fundamental reason the type tests can stick around until just before codegen This should be "*can't* stick around".
Fix committed in 7dc4bbf8ab311606388faacca58b6c3e3e508b77. Unfortunately, I forgot to put the PR number in the description.
Thank you for the fix and for all the added insight! It's clear there are some drawbacks to (ab)using the ThinLTO pipeline this way. For production releases, we should probably go with the sequence that rustc was designed with in the first place. In the future I suppose we could tinker with running a different set of passes at each stage. For `-Oz` for example, maybe we can run all the passes we would anyway, but not do inlining at the first stage. But without inlining, I'm not sure we'd be saving all that much work. That said, this question is pretty academic at this point, since I doubt I'll have time to tinker with this anytime soon.