[RFC] Measuring GlobalISel compile-time performance

Hi,

At EuroLLVM, we had some discussions about GlobalISel/(g)MIR being a bit too slow overall, especially MachineInstr::addOperand being unnecessarily slow. I’m interested in improving GlobalISel performance overall, but for that I want to make sure I measure the improvements properly, and that I’m looking at the right things.

Micro-Benchmarks

I started adding some benchmarks for MachineInstr APIs in this branch: GitHub - Pierre-vh/llvm-project at mir-benchmarks

Right now they’re limited to MachineInstr constructor and MachineInstr::addOperand. New benchmark ideas are welcome.

I think adding some mini benchmarks to test individual MIR APIs is a good thing as it allows contributors and maintainers to quickly check the performance of common APIs, and track them over time.

Would this be a welcome addition upstream ? The llvm/benchmarks folder is under-used anyway (it only has one YAML benchmark), so even if this ends up not being of much use, it doesn’t add much bloat and is easy to remove.

If yes, should I wait until I have a solid number of benchmarks to upstream, or is it worth upstreaming ASAP to try and give this some momentum?

Real-world Benchmarks

This is likely the type of benchmark that matters the most for end users, but as we have no default-enabled GlobalISel backend upstream, I’m unsure how to test this accurately.

I think the AArch64 backend is the best benchmark candidate as it’s the most complete one.

However, I’m stuck on figuring out what inputs to use to test it.
I personally use a Linux x64 machine so I don’t have an AArch64 toolchain available. I’d like to find some code that:

  • Does not need a complete toolchain (minimal to no dependencies - only needs a LLVM build)
  • Does not fall back to DAGISel at all
  • Is large enough that it stress tests GlobalISel in a meaningful way
    • e.g. very large functions with many instructions and basic blocks

Does anyone have a suggestion?

2 Likes

Hi!

My approach to get a larger input would be the following. The AArch64 headers are available in most Linux distros. With the header files available, I would pick a larger application, either from the LLVM test suite or from SPEC. Compile the source files targeting AArch64 with opt level -O3 to bitcode files, and then use llvm-link to create one large bitcode file. Using -O3 should inline enough code that you get large functions, and having a whole application in one file should stress the various parts of GlobalISel.
However, it is difficult to control if there is a fall back to DAGISel. Since this should not happen that often, it might be an approach to identify the function requiring DAGISel, and removing them from the file.

Kai

You can take ComPile and retarget the IR to Aarch64.
You’ll lose 20%, and some modules might not contain the large functions you are looking for and there is no guarantee wrt. fall back.
That said, it’s super easy to iterate over millions of LLVM-IR modules with real code in 10 lines of python.

May be I am a bit too optimistic here. My steps on Ubuntu:

; Install the library and headers. Attn, this pulls in the whole toolchain.
sudo apt install libstdc++-13-dev-arm64-cross

; Cross-compile a benchmark
cd $LLVM/test-suite/MultiSource/Benchmarks/MiBench/consumer-jpeg
clang --sysroot /usr/aarch64-linux-gnu -target aarch64-linux -O3 -c -emit-llvm *.c
llvm-link -o jpeg.bc *.bc
llc -mtriple aarch64-linux -global-isel < jpeg.bc

This results in a nice input file (ca 290 functions with many basic blocks), but llc fails due to a legalization problem (G_PTR_ADD with vector typed operands). Looks like more options are needed.

Could you provide the error message for legalisation issue?

Sure.

LLVM ERROR: unable to legalize instruction: %275:_(<4 x p0>) = G_PTR_ADD %268:_, %274:_(<4 x s64>) (in function: prepare_for_pass)

clang and llc based on a855eea7fe86ef09a87f6251b3b711b821ae32bf.

Along those lines, I’m curious if anyone knows the answer to my previous question about GISel performance. Did it get slower vs FastISel, or are people finding performance is less than the early results when applying it to a wider set of inputs?

I used the scalarizer pass to get rid of that PTR_ADD error but then I hit

LLVM ERROR: unable to legalize instruction: %347:_(s16) = G_MERGE_VALUES %702:_(s8), %703:_(s8) (in function: process_data_context_main)

I just enabled DAG fallback for that single function.

Overall it’s 60k lines of IR, not too bad but still compiles within a second so I’d need something bigger to stress test ISel more. It’s difficult to find real bottlenecks like this.

I tried looking into the fuzzers but I’ve never used those and I’m not sure where to start to generate huge functions. I’d like to find a way to start with one or morer IR modules, then generate a huge LLVM function from that by repeatedly piecing together functions with small modifications in between.

HuggingFace seemed down today so I couldn’t try that.

In any case, I checked the jpeg file with perf record -F max and, unsurprisingly, MatchTable is sort of high in the list (about 1% of execution)
addOperand is also high in the list.

I think we could benefit from adding a batched mode to MachineInstrBuilder + a batched addOperand to avoid multiple calls whenever possible.

I also think we lack a bit of caching in MRI. getVRegDef is used extensively but it still does an iterator walk. Maybe we could have a map of Reg → Def and invalidate entries when the use list is updated for that Reg? Not sure if it’d be a big speedup/worth the effort.

I will also put MatchTable on my radar to see if it can be improved in some way.

I have an PR to address the ptr add issue. AArch64 is unhappy with s16 in merge values. Another issue, we have to address.

I got myself Gravitions from AWS and bootstrapped Clang. I was mainly interested in fallbacks. I never bootstrapped Clang for speed and profiling.

Along those lines, I’m curious if anyone knows the answer to my previous question about GISel performance . Did it get slower vs FastISel, or are people finding performance is less than the early results when applying it to a wider set of inputs?

I think it’s the former, we haven’t had much of a strong focus on compile time performance in the last few years. There’s likely room for improvements. That said, we’ll never be as fast as FastISel, just iterating over the MIR in multiple passes puts us behind in some cases (e.g. sqlite3).

2 Likes

I did a small round of optimization on the MatchTable as it’s the most time consuming operation of the pipeline so far: [TableGen][GlobalISel] Specialize more MatchTable Opcodes by Pierre-vh · Pull Request #89736 · llvm/llvm-project · GitHub

It’d be nice to convert all matchers to use MIR patterns to do at minimum the trivial matching (so the basic structure of the pattern). It actualls matters a lot because I checked which opcodes were the most time consuming, and it’s always the ones that fallback to C++.

I guess in many cases the MIR pattern is not precise enough, they jump into C++, then they are unhappy with MIR, and return false.

def combine_extracted_vector_load : GICombineRule<
  (defs root:$root, build_fn_matchinfo:$matchinfo),
  (match (wip_match_opcode G_EXTRACT_VECTOR_ELT):$root,
        [{ return Helper.matchCombineExtractedVectorLoad(*${root}, ${matchinfo}); }]),
  (apply [{ Helper.applyBuildFn(*${root}, ${matchinfo}); }])>;

It expects a G_EXTRACT_VECTOR_ELT with a G_LOAD on $src. Nobody has the bandwidth to rewrite these.

In many case the MIR pattern just checks whether the rule is enabled, checks the opcode, and C++ does everything.
If we moved those to basic MIR patterns (match all the instructions, maybe types as well), it’d help performance.

If we would rewrite the combine match to:

(match (G_LOAD $vector, $mmo),
       (G_EXTRACT_VECTOR_ELT $root, $vector, $idx)),

It would only hit, when there is a load on the vector. The probability of the C++ returning false would be lower. We cannot move everything into MIR.

Yes, that’s already enough. We simply need to add more checks, more context into the MatchTable so it can hoist common checks between rules into common blocks to avoid repetitive work.

Though, I definitely agree this is a huge bandwidth hog and even if I had the time for it, I’m not sure I’d commit to doing this repetitive work unless I’m 100% confident it helps (so I have perf data to prove how expensive C++ fallbacks are, other than just a hacked-in TimeTraceScope). However, I’d really like to get rid of wip_match_opcode at some point so this isn’t something I want to dismiss either.

Overall It’s probably better in the short term to do more generic optimizations that’d benefit both ISel and Combiners. I want to both look at making the MatchTables more compact (even more than with the specialized opcodes I added) and also do more micro-benchmarking to find optimizations that aren’t immediately obvious with a profiler.

Totally agreed! Removing wip_match_opcode or linting would be great.

It states that want to combine anyext(trunk), but registers on anyext !?!

I stated before. I would like to have a marker to state that $idx is const, e.g., ?const_idx. It would make the pattern even more precise.

We could add some gallery to the MIR guide of good and bad Mir patterns: wip_match_opcode.

@Pierre-vh I like the idea. It’s very useful in case someone decides to optimize the API. However, I have doubts regarding the applicability. Is the main idea to compare scores among revisions and/or simply keep track of slow things (old ones and todos)?

I gather this stats from time to time and it’s usually the same. This is the latest one (I measure overall compilation time using aarch64 toolchain on x86).

It’s a S-curve over SPEC2017INT. In general, GlobalISel is slightly slower than FastISel. There are two benchmarks when difference reaches 10%.

2 Likes

I’ve been profiling the combiners recently. I’m using perf record from Linux through a GUI tool called HotSpot, which allows me to zoom in on specific parts of the build process and filter events that only happened during a given function call. With that, I can filter on something like tryCombineAll for the PreLegalizerCombiner and see where the time is spent.

So far I did:

It’s not an exact science, I’ve found a few % of variance in the timing of some functions depending on runs, but it tells me where we waste the most time which is still useful.

For instance, if we zoom in on AArch64PreLegalizerCombiner, we see that:

  • executeMatchTable takes 75% of its execution (including all callees), but only 10% of the time is spent in that function (self time) ! So the MatchTable is very fast already.
  • getKnownBits is as expensive as the MatchTable with 10% use. 7.47% of that is in getKnownBitsImpl, the DenseMap construction takes 1.83%. I think this is due to KnownBits being an expensive object.
  • matchICmpToTrueFalseKnownBits is very expensive with 8% time spent too, due to calls to getKnownBits on all G_ICMP occurences.
  • The MatchInfoTy struct is very expensive to construct every iteration (every instruction visited). operator= seems to take about 7.47% and the constructor 4.37%)! I’m going to look at optimizing that very soon.
    • My current idea is to create a “lazy” allocator that only allocates a field when it’s requested, so we don’t pay for what we don’t use.
2 Likes

Regarding the lookthrough, do we actually have to look through MIs to find the constant?
Can we combine:
zext(iconstant) → iconstant
to reduce/eliminate the chains.

We combine them except for trunk: