[LLVM-DEV'24] LLVM ♥ ML Workshop

Hello all,

Continuing from Tanya’s announcement, the LLVM :hearts: ML workshop is a continuation (and rebranding) of last year’s ML-Guided Compiler Optimization in LLVM, with pretty much the same format…

…except this time there’ll be lunch!

As announced:

The workshop aims to bring together LLVM contributors and researchers that work on applying machine learning techniques to LLVM, including compiler optimizations, latency estimation, IR comprehension, code generation, or data set curation (to name a few) - to discuss their current, ongoing projects, and explore ways to better collaborate and plot pathways for integration into community LLVM.

Essentially, the workshop is an unconference. It has 2 parts. In the first part, folks that want to propose topics or update the community on their work give a short presentation followed by Q&A. If interested in presenting in the 1st half, please DM us (@mtrofin, @jdoerfert) by Oct. 1st so we can understand how to budget the time. We’ll communicate back on Oct 2nd to confirm timing for this first half. In addition, and optionally, if you’re interested in presenting and what to share more information with the participants before the workshop, share whatever materials you want to with us, a week before the workshop the latest, so we can group all of them in one post on Discord a few days before the workshop.

The second part will be round tables on the topics the workshop participants choose. We’ll ask each group to find a volunteer to drive & moderate, and another volunteer to take notes, which will then be shared back on Discord.

Looking forward to seeing you all in October!
Mircea & Johannes.

6 Likes

Tentative Agenda

Oct. 22, 2024

1-5PM, in the “Hall of Cities - Portland/Seattle” Room (Santa Clara Marriott)

1 PM - 3 PM

  • A Quick ComPile Tutorial: Using Big Data for ML/AI Tasks (Aiden Grossman, UC Davis)
  • LLM Compiler: Foundation Models of Compiler Optimization (Chris Cummins & Volker Seeker, Meta)
  • Imitation Based Optimization of Inlining Decisions Without Compiler in the Loop (Alekh Agarwal, Google)
  • Accurate Rewards for Production Workloads: Status and Results (Aiden Grossman, Google)
  • Inputgen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training (Johannes Doerfert, LLNL)
  • Fuzzlang: Generating Compilation Errors to Teach ML Code Fixes (Baodi Shan, Stony Brook University)

3 PM - 3:15 PM Break

3:15 PM - 3:35 PM

  • Connecting the pieces, ML in Rust (Manuel Drehwald)

3:35 PM - 3:40 PM

  • Ad-hoc identify 2 - 4 round table topics, note takers and moderators for each

3:40 PM - 5 PM : Round Tables

  • Round table minutes will be posted back here after the event
2 Likes

Post-workshop

Thank you! to the awesome speakers and to all attendees! I hope everybody enjoyed the talks and the discussion afterwards.

The workshop slides are available here.

If you are interested in keeping in touch:

  • monthly meeting details (how to join, etc).
  • please feel free to propose an agenda for that meeting! It can be anything you want to discuss with the LLVM community interested in applying ML - from presenting work, to asking questions. Just start a thread on Community and tag with mlgo.

Thanks!

Mircea & Johannes

Unconference discussion notes:

(These were written out as we were talking and the conversation jumped around a bit, so they are not necessarily ordered exactly by topic).

  • Problems other than inlining and regalloc
    • What else has been tried?
      • Phase ordering
      • Original paper out of Edinburgh was for performance, probably for loop unrolling
      • Regalloc at Google, improvements were noticeable.
        • Training methodology was not very good, needed to produce many models to find one that provided any effect at all.
        • Trying a bunch of things, trying to see what sticks. Cannot do microbenchmarks/benchmarking. Functions are enormous after ThinLTO/inlining. Bit of analytical (PGO-based) cost modeling based on reasonable assumptions. Very few models (all looked good from the ML side) from that cost model. In microbenchmarks, the poor reward worked okay. Matthias Braun’s presentation showing similar results.
        • Turn a cost modeling problem into a classification problem. Give a model A and B and ask which is better. Hypothesized by Chris to be easier from a ML perspective. Possibly more explainable too.
        • Part of the problem with cost modeling is that performance might change vastly with microarch.
        • Training multiple heuristic models against specific uarches would be perfectly reasonable in the datacenter case.
        • How do we measure the distance between two pieces of code? BLEU for natural language versus something else for code.
          • Could just completely ignore the text, see if it produces the same outputs for a set of inputs.
          • Just running the code is good enough.
        • Missing proper evaluations for any of this, everything is sort of ad hoc within the community.
          • Give people a way to evaluate a real problem. Even if the problem is there, can be hard to evaluate.
          • Having a task builds interest.
    • Instruction Scheduling - Always within reach, but lots of problems with regressions when making improvements in certain areas. Questioning whether or not there are any ML based approaches:
      • Concern about stuff within upstream LLVM/other research being specific to X86. Are we just overtraining for the weird special cases with X86?
      • Existing regalloc models: X86 regalloc on skylake for datacenter applications
      • Intuition about models probably not generalizing across architectures, but infrastructure can be common to generate new models to be specific to an application/architecture.
      • Some people do not care at all about CPU performance, infrastructure might still be useful. GPU code is much harder to run stuff on, lots of effort in the past couple years spent on building up tooling to generate training data.
    • A good solution for regalloc and instruction scheduling would solve both at the same time, they are tightly coupled problems.
    • Locality matters. Code locality matters, tools like BOLT are applicable here. Can we learn better code placement policies? Data locality is also important, but much harder. Improved locality could have a large amount of impact on performance.
      • Code locality can mostly be solved with PGO/profiling.
      • Prefetching for data locality does not have correctness problems. Some infra in LLVM, previous attempts have failed so far.
    • Memory optimizations like load elimination.
    • Inlining - adding additional features (like whether or not a callsite is in a loop). Can help model better downstream benefits.
    • One thing we do poorly across compilers - we fiddle with things and then hope that the performance is better than before. We do not know what performance is possible.
      • Even if we just get to 80%, it’s still way better than what we’ve seen in the past.
    • Do we look at individual inlining decisions currently when training the inlining models? Fix everything, just look at whether or not an individual decision is good.
      • The effects are cascading. As soon as we consider more than one, there are complicated interactions between each instance.
    • Rl4Real did cotraining. Questioning about coupled effects between multiple models.
    • Passes are all coupled together by the ordering. How do we solve this for a model
      • We only care about two orderings, O3, Oz
      • Taking an off the shelf model and applying it to an arbitrary problem is not something Google is trying to optimize for.
      • General answer is just to retrain.
      • Retraining due to code changes (anecdota from Fuchsia/Chrome/Cloud Infra) really needs to be done only when the feature distribution changes.
      • For the community upstream, train on a representative set of inputs from the community, and then ship it with the release. If you want to do better, train on your apps specifically.
  • Compiler correctness
    • Johannes: On his roadmap: Take two pieces of IR that are supposedly the same, know whether or not they are semantically equivalent.
      • Suggestion to look at AST subtrees: Doesn’t work over IR though.
      • Could do things like throw out things like similar def use chains.
      • Some bucketization strategy with divide and conquer was suggested. Replace a sequence of basic blocks that are identical and then delete them.
    • Alive2 - perfect for straight line code assuming you don’t need to scale too much. Breaks apart with loops, or if the CFG structure changes at all.
      • Cummins et al tried Alive2 for flag tuning, ran into scalability issues.
      • Edge cases are important to consider that will make certain optimizations illegal.
    • Data augmentation - On the ML side, doing semantics preserving optimizations and then adding many semantically equivalent inputs into the training set.
      • Doing semantically aware tokenization for LLMs (Doerfert and co).
      • Still doesn’t make much ground on the correctness issue even if we get significant tangible benefits.
      • Confidence scales might be helpful.
        • Shouldn’t be trusted. Lots of examples in the literature of them being overconfident.
    • LLMs typically depend on a human in the loop for compiler tasks for things like checking and verifying.
      • If we can reduce the number of things that need to be checked by approving automatically the trivial cases, that might still be important.
      • Analogy of Gran Turismo with the car off the road, hard for an agent to know that here.
    • Symbolic execution/input generation - can validate that things are semantics preserving (somewhat) by running autogenerated inputs and asserting outputs are the same.
    • Some domains are easier for correctness. Communication channels in a modem can easily be fuzzed, almost exhaustively, for correctness.
    • ML can easily be used when it does not make correctness impacting decisions.
    • One potential use given the above: code translation. Use ML to take code from one language to another while maintaining a bunch of properties. Correctness on interesting inputs.
    • A conversational compiler opens up a bunch of interesting ideas.
      • Could help with transpilation.
      • Autovectorization
    • How to debug a bug that pops up due to an optimization problem. How to fix it?
      • If we keep correctness in the compiler, the bugs we come across are still compiler problems.
    • Start with smaller things. As we gain confidence with those, we can move up from there.
      • The breakthrough in self driving and machine translation was that we should feed everything to the ML model rather than any hand crafted correctness rules.
      • Not really applicable to the compiler case.
    • For some tasks like filtering compiler output, using a LLM to just filter out the useful ones can be useful and save engineering time.
2 Likes