0% found this document useful (0 votes)
26 views12 pages

700 ML Opt CC

Machine learning for compiler optimization (name styled after Landin paper)

Uploaded by

sami.dena93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views12 pages

700 ML Opt CC

Machine learning for compiler optimization (name styled after Landin paper)

Uploaded by

sami.dena93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

The Next 700 ML-Enabled Compiler Optimizations

S. VenkataKeerthy Siddharth Jain Umesh Kalvakuntla


IIT Hyderabad, India IIT Hyderabad, India IIT Hyderabad, India

Pranav Sai Gorantla Rajiv Shailesh Chitale Eugene Brevdo


IIT Hyderabad, India IIT Hyderabad, India Google DeepMind, USA

Albert Cohen Mircea Trofin Ramakrishna Upadrasta


Google DeepMind, France Google, USA IIT Hyderabad, India

Abstract ML and Reinforcement Learning (RL) approaches have been


There is a growing interest in enhancing compiler optimiza- proposed to improve optimizations like vectorization [21, 36],
tions with ML models, yet interactions between compilers loop unrolling, distribution [25, 43], function inlining [27, 47],
and ML frameworks remain challenging. Some optimiza- register allocation [17, 26, 46, 50], prediction of phase se-
tions require tightly coupled models and compiler internals, quences [5, 23, 24], among many others [2, 53]. More specifi-
raising issues with modularity, performance and framework cally, the widely used LLVM compiler [29] has support for RL-
independence. Practical deployment and transparency for based inlining decisions from version 11, and RL-based evic-
the end-user are also important concerns. We propose ML- tion decisions in its register allocator from version 14 [46].
Compiler-Bridge to enable ML model development within The title of our paper acknowledges this growing trend and
a traditional Python framework while making end-to-end in- anticipates the needs of the ML-enabled optimizations that
tegration with an optimizing compiler possible and efficient. are yet to come, in the spirit of Landis’ seminal paper [28] on
We evaluate it on both research and production use cases, for the diversity of existing and future programming languages.
training and inference, over several optimization problems, Setting up an ML-based compiler optimization is a chal-
multiple compilers and its versions, and gym infrastructures. lenging task. In addition to model design, it involves special-
ized data collection, compiler engineering, packaging:
CCS Concepts: • Software and its engineering → Com-
pilers; Software infrastructure; • Computing method- 1. Preparing or generating the data sets for training the
ologies → Machine learning. model [14, 16].
Keywords: Machine Learning for Compiler Optimizations, 2. Engineering objective-specific features [19], or extract-
gRPC, Pipes, ONNX, TensorFlow AOT ing objective-independent program embeddings [3, 6, 13,
48], or a combination of both.
ACM Reference Format: 3. Setting up a training interface with the compiler, with
S. VenkataKeerthy, Siddharth Jain, Umesh Kalvakuntla, Pranav Sai examples ranging from communicating the output via
Gorantla, Rajiv Shailesh Chitale, Eugene Brevdo, Albert Cohen, compiler flags [24], offline file logs [21], to generic gym
Mircea Trofin, and Ramakrishna Upadrasta. 2024. The Next 700
APIs [10] and recent compiler-specific gym APIs like
ML-Enabled Compiler Optimizations. In Proceedings of the 33rd
ACM SIGPLAN International Conference on Compiler Construction
CompilerGym [15], PolyGym [9] and Supersonic [51].
(CC ’24), March 2–3, 2024, Edinburgh, United Kingdom. ACM, New 4. Finally, building and deploying the compiler with the
York, NY, USA, 12 pages. https://doi.org/10.1145/3640537.3641580 trained model for inference.

1 Introduction In most works, the process ends with step (3) and a simpli-
fied benchmark-oriented version of step (4) to evaluate the
With the success of Machine Learning (ML) models in vari- trained model. Indeed, while there exist a number of solu-
ous domains, there is a growing interest in applying ML to tions for steps (1 & 2), a proper methodology based solutions
improve optimization heuristics in compilers [4, 12]. Several for steps (3) & (4) that involve model-compiler interaction
have not yet been adequately addressed.
The diversity of compiler optimizations and ML models
is associated with an equally broad range of requirements
This work is licensed under a Creative Commons Attribution 4.0 Interna- for model-compiler interaction. In Tab. 1, we illustrate this
tional License.
on recent proposals. There exists multiple ML frameworks
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom
and even more types of ML models. A model’s input may
© 2024 Copyright held by the owner/author(s). be a plain floating point vector, or tensors of different ranks
ACM ISBN 979-8-4007-0507-6/24/03 and shapes. Outputs range from a unique Boolean decision
https://doi.org/10.1145/3640537.3641580 to complex data structures. These need to be communicated

238
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.

Table 1. Diverse ML and RL requirements in previous work; unknown or unclear ones are left blank.
Communication Model Input Model Output Commn Freq #Agents Model Type ML Framework
SLP Vectorization [36] LLVM IR Instructions to pack Single Agent GGNN
source code with pragma
NeuroVectorizer [21] Pragmas in source Code2Vec vectors Once, at the end Single Agent FCNN Keras, RLLib
added in Python
Register Allocation [17] No integration Interference graph Coloured IG None NA LSTM TensorFlow
Register Allocation [26] PBQP Graph Allocated PBQP graph Single Agent GCN, ResNet Pytorch
Multiple times
POSET-RL [24] Opt Flags IR2Vec vectors Pass sequence Single Agent FCNN PyTorch
per episode
Loop Distribution [25] Python Wrappers IR2Vec vectors Distribution sequence Once, at the end Two agents GNN, FCNN PyTorch
Inliner [47] Precompiled TF model Features Yes/No Once, at the end Single Agent FCNN TensorFlow
RegAlloc Eviction [46] Precompiled TF model Features Index of Live Range to Evict Once, at the end Single Agent FCNN TensorFlow
IG + node level Multiple times Four agents;
RL4ReAl [50] gRPC Color map GNN, FCNN PyTorch, RLLib
embeddings per episode hierarchical

with the compiler; it may be only once for simple scenar- the end user and the internal compiler algorithms; this lim-
ios, or many times and involving large amounts of data for its deployment opportunities among other downsides. We
more intricate ones. And this may involve extensive source discuss these issues in detail in Sec. 4.
code modifications for the sole purpose of implementing the To address these shortcomings, we propose ML-Compiler-
compiler-model interface. Bridge, a library that allows ML model development within
Some of these interactions have been explored in the liter- a traditional Python framework while providing tightly cou-
ature and even landed in production; however, there does not pled and efficient end-to-end integration with the compiler.
exist a single generic method to address the vast diversity of Our library bridges the compiler and ML model by providing
scenarios that are imaginable and the trade-offs therein. Such a suite of communication approaches (model runners) and
a situation limits the scope, applicability and effectiveness the related (de-)serialization mechanisms (SerDes) to cater
of ML for compiler optimizations in the following ways: to diverse scenarios. It also provides support for both inter-
and in-process communication by exposing different model
• Scalability: Integrating a Python model with C++ code runners: gRPC and named-pipes for the former, and the Ten-
using wrappers induces significant [25] compile time sorFlow and ONNX for the latter. Diverse SerDes options
overhead: e.g. 6×–100×. based on Protobuf, JSON, and native bitstreams improve ef-
• Integration: Not all optimizations are simple enough ficiency and versatility. The appropriate model runner and
that the outputs of the model can be communicated using SerDes can be chosen based on the usage scenario and re-
flags [25, 26, 47, 50]. As ML-based optimizations grow in quirements, and these may differ during training and infer-
popularity, flag-based approaches become unwieldy. ence. Our library provides C++ and Python APIs to expose
• Programmability: ML models are typically written in model runners and SerDes for integration with compilers
Python across different frameworks like TensorFlow, and ML frameworks respectively.
JAX, PyTorch, etc. Expecting the model to be written We show that the inter-process model runners effectively
in C++ within the compiler is not ML developer-friendly. supports training. Once the model is trained, the in-process
• Portability: Several proposals involve a tight coupling model runners provide interfacing of the model within the
between the compiler and a specific ML framework; we compiler in a transparent manner, with much lesser latency
however believe that a generic compiler infrastructure to aid in deployment. Besides, our both model runner and
like LLVM should remain ML-framework-independent. SerDes modules can be easily extended to support more
forms of communication and serialization. Our library also
provides C-APIs to aid in integration with C-based compiler
The existing gym libraries primarily aim at facilitating
infrastructures like Pluto, GCC, and SQLite.
model training for research and reproducibility by providing
We evaluate ML-Compiler-Bridge on four ML-enabled
a high-level integration. For example, the recent Compil-
optimizations in LLVM: RL-LoopDistribution, POSET-RL,
erGym [15] provides a high-level interface in the form of
RL4ReAl, and Inliner. We show that our library can be in-
C++ wrapper methods outside the compiler to invoke out-of-
tegrated with other compilers like Pluto [7] and MLIR [30]
tree compiler APIs to materialize the predicted actions. Such
with minimal effort. We study the impact of communication
integration caters well to training certain interactions like
and serialization options on compile time under different
Phase Ordering [24]. However, other optimizations like Re-
complex scenarios that the existing infrastructures could not
gAlloc [17, 26, 50], loop distribution [25] and inlining [47]
handle. We conduct extensive evaluations to measure the
necessitate a deeper interfacing of the model within the com-
overhead caused by each model runner and SerDes. We also
piler; with multiple rounds of interaction for both training
study the impact of integrating ML-Compiler-Bridge with
and inference scenarios. Further, in these gym libraries, the
LLVM in terms of additional dependencies, compile-time,
inference flow is driven by Python: the compilation starts by
and binary size overhead. Here are our contributions:
invoking a Python process, breaking the isolation between

239
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom

• We propose ML-Compiler-Bridge, a library to enable internals, infrastructural details, and integration points, fo-
the deeper integration of ML models and the compiler cusing on the optimization objectives and information flow.
in a framework-independent manner. For the end-user, however, the presence of ML-compiler op-
• We provide a suite of two inter- and two in-process model timization should be transparent, and indistinguishable from
runners, and three (de-)serialization mechanisms (SerDes) the conventional (non-ML based) compilation process. To
to support different interaction scenarios. achieve this scheme of abstraction/segregation among all
• We provide multi-language user APIs: C++ and C APIs to three actors, it is important to distinguish between the train-
interface model runners and serializers with compilers ing and inference flows.
and Python APIs to interface inter-process model runners
with ML frameworks. Training. Typically, training the ML model becomes part
• We show that our library is easy to integrate with three of compiler development and build-up, and inference be-
different compilers spanning different representations, comes part of the compiler deployment and execution. How-
and carry out extensive evaluations on four ML-enabled ever, occasionally this boundary may shift towards the user,
optimizations on two versions of LLVM (V10, V17). like domain-specific training or fine-tuning at deployment
• We characterize the impact of each communication and time. Since ML developers usually prefer developing models
serialization options on compilation and training times within a Python-based framework, the training process in-
and other overheads. volving a C++ compiler infrastructure like LLVM requires a
communication channel, typically inter-process, while cater-
2 Background ing to the needs of (de-)serializing data between the native
types of C++ and Python. The distributed nature of train-
ing processes may also require extending communication
beyond a single operating system node.

Inference. When focusing on inference/deployment, com-


pile time and ease-of-use become crucial factors. The commu-
nication and serialization methods involved should take this
into account, along with considering converting the Python
Figure 1. ML-enabled compiler optimizations: (1) Inputs and other
model to a streamlined C++ implementation. These factors
metadata required by the model are prepared in the appropriate are true even for the simplest forms of communication, like
format. (2) Serialized input is passed on to the model by a suitable one-time evaluations of the ML model and communicating
communication channel. (3) Input is deserialized to appropriate via flags. Making the flow transparent to the user also re-
format. (4) The model is queried to obtain optimization decisions quires a deeper, end-to-end integration with the compiler.
as output. (5) Output is serialized, and (6) Sent back to the compiler There is no tool providing the necessary layers of ab-
optimization as a response. (7) The received response is deserialized, straction between the three actors while supporting the
and optimization decisions are taken according to the output. required training and inference scenarios, not to mention
ML-framework independence. Designing such a library and
ML-enabled Compiler Optimizations. The process of
evaluating its suitability for diverse use cases is the challenge
supporting or fully implementing optimization decisions
we tackle in this paper.
with one or more ML models involves the steps shown in
Fig. 1. This process repeats until the end of the compilation
process for each ML-based optimization. The above scheme is 3 ML-Compiler-Bridge
generic enough to capture any optimization involving single We propose an abstraction mechanism made of two main
or multiple ML models with multiple two-way interactions. components: Serializer and Model Runner. The SerDes mod-
For the cases that would need multiple interactions, steps ule (de-)serializes the data to/from the requested format, and
(1)–(7) are repeated until the final outcome. the MLModelRunner module is responsible for communica-
More broadly, there are three actors involved in develop- tion with the model. The model runner obtains the serial-
ing and using such an ML-enabled compiler. (i) The Compiler ized data, writes it to a communication channel, queries
expert who develops the compiler optimization, (ii) The ML the model, and deserializes the output received from the
expert who designs the ML model for the optimization prob- model. ML-Compiler-Bridge exposes methods to be in-
lem, and (iii) The end-user who uses the compiler. Ideally, voked by the user to interact with the model decoupled
compiler experts should use the ML models with minimal from serialization and communication. We provide three
understanding of the internals/process specific to ML model- framework-independent model runners, gRPC, named-pipes,
ing and the framework on which the model is built to arrive and ONNX, and one framework-specific TensorFlow model
at the result. Similarly, ML experts should instead design runner. These can be combined with three different serial-
the models with minimal or no understanding of compiler izations: Protobuf, JSON, and bitstream. The modular design

240
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.

1 class MLModelRunner {
2 public :
3 // Populates inputs as key - value pairs
4 template < typename T , typename ... Types >
5 void populateFeatures ( std :: pair < string , T > &←↪
var 1 , std :: pair < string , Types > &... var 2) ;
6 // Exposed to the user ; returns model 's output
7 template < typename T > T evaluate () {
8 return * reinterpret_cast <T * >(←↪
Figure 2. The compiler instantiates a model runner and sets the evaluateUntyped () );
9 }
input features to be used by the model. MLModelRunner internally
10 protected :
invokes SerDes to serialize the data in one of the supported formats 11 // To be overridden by derived classes
and query the model. The returned decision is deserialized and 12 virtual void * evaluateUntyped () = 0 ;
provided to the optimization. 13 };
Listing 1. Skeleton of MLModelRunner class
enables new forms of communication and serialization to be
added by overriding a minimal set of methods. Fig. 2 shows
the components and interactions of ML-Compiler-Bridge. gRPC Model Runner. gRPC [52] provides RPC methods
specifying the type of input and output in Protobuf for-
3.1 ML Model Runners mat [41]. During the build process of the library, the proto
We provide two classes of model runners. The inter-process files are automatically translated to C++ and Python code
class provides the easiest mechanism to decouple Python by invoking the protoc compiler. The generated code de-
models from a compiler running as a separate process. The in- fines the Service class that exposes the RPC methods to be
process class assumes that the ML Model is readily available overridden by the user in the optimization that makes use
in a compiled form and can be accessed within the compiler of gRPCModelRunner.
through a specific API. Clearly, in-process communication gRPCModelRunner takes in the server address and the port
is designed with inference and deployment in mind, while number at which the connection is to be established. In train-
inter-process communication enjoys more diverse use cases. ing mode, gRPCModelRunner starts the server and starts lis-
Model runners may support simple ML queries and feed- tening for an RPC call invoked by the model. The overridden
forward networks as well as more involved Reinforcement RPC method is directly called by the Python model to gen-
Learning (RL) algorithms or Graph Neural Networks (GNNs). erate new observations by applying the action predicted by
Internally, MLModelRunner is the abstract base class from the model. In inference mode, gRPCModelRunner starts the
which the other model runners are derived (List. 1). It ex- gRPC connection at the given address and port.
poses two APIs: populateFeatures() populates the input
Pipe Model Runner. As the name suggests, the pipe model
features, and evaluate() queries the model.
runner relies on named pipes for inter-process communica-
3.1.1 Inter-process Model Runners. gRPCModelRunner tion (the mkfifo system call). Pipes provide a simple and
uses gRPC, and may run the model and compiler on different effective means of communication that is local to the ma-
machines. Whereas, the pipeModelRunner uses named pipes chine without any network or security constraints.
for single-machine scenarios. At training time, the compiler As pipes are unidirectional, the pipeModelRunner creates
acts as a server and the Python-based ML model acts as a the read and write pipes for communication. The read pipe in
client. The sequence of steps is described as follows: the compiler obtains the data written by the model in Python,
(1) Compilation starts and the compiler listens for queries and the write pipe provides the data into the pipe that is read
at the wait() call inserted at the point of interest. by the model on the other end. read() is a blocking call
(2) The Python model starts training; this can be started forcing the compiler to wait till data is written by the model.
concurrently with Step (1). Once the data is written, the model gets to a blocking state by
(3) When input from the compiler is required, the model invoking read() on the second pipe waiting for the response
sends requests to the compiler with appropriate queries from the compiler. The pipe model runner ensures proper
and waits for the response. opening, closing, and clean up. pipeModelRunner provides
(4) The compiler gets out of the blocked state and processes a simpler interface for establishing communication as the
the query to generate an appropriate response. user directly invokes evaluate() after setting the inputs.
(5) The response is sent back to the client, and the model 3.1.2 In-process Model Runners. In-process model run-
goes on to completing training on that input. ners are designed to provide an effective means of compiler
Inference follows the same steps, yet the compiler becomes deployment. It is important to optimize the inference time
the client and the model becomes the server so as to support as it adds up to the overall compile time. One may obtain
a regular compilation process. significantly lower compile times by removing inter-process

241
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom

communication overhead, and by turning the complications finetuning or when quickly evaluating candidate models and
of a compiled model into an advantage, by reducing the parameters.
query time compared to models running in Python. Serial- The TensorFlow model runner uses the AOT saved model
ization/deserialization overhead is also lowered. compiler which produces a header exposing the model as a
C++ class, and a native object file with its implementation.
ONNX Model Runner. The Open Neural Network Ex- The model runner reduces again to a simple adapter [20]
change [34] (ONNX) is an open format to represent machine around that class. The compiler binary does not expose new
learning models. Models built from various frameworks like runtime dependencies as it is statically linked, and this highly
TensorFlow, PyTorch, etc. can be represented in ONNX for- simplifies its deployment. Note that the model compiler can
mat in an interoperable manner. Additionally, it supports a be configured to generate code by loading the weights from
wide variety of hardware architectures ranging from edge a file passed via the command line to the compiler.
devices to general-purpose CPUs and GPUs. Once the model
is trained in Python, it is converted into a common ONNX
3.2 SerDes: Serializer and Deserializer Module
representation and is imported into the compiler via the
ONNX runtime. ONNXModelRunner exposes the necessary When data is transferred, specifically across two processes,
wrapper APIs to read the ONNX model, query it with inputs it is important to convert data that is present in the native
and obtain outputs. types (of C++ and Python) from one format to another. This
is the purpose of (de-)serialization as implemented by the
SerDes module.
Opt Pass ONNXModelRunner Agent ONNXModel
The MLModelRunner interacts with SerDes to (de-)serialize
dispatch C++ native data to model-specific types and back. The choice
evaluate()

reset() of (de-)serialization depends on the optimization and ML


observation computeAction() query() model. We currently provide three options: bitstream, JSON,
step() action modelOutput and Protobuf. They vary in terms of usage scenario, usage
effort, and (de)serialization time. SerDes effectively abstracts
away the underlying mechanism while providing the flexi-
Figure 3. Sequence diagram indicating different events and the bility of different serialization options.
interaction between various classes for RL based optimization by
Internally, each SerDes is derived from the BaseSerDes
ONNXModelRunner. Only the methods that highlighted are to be
class that exposes the APIs to (de-)serialize inputs/outputs.
overridden by the user. Other methods are internal to the library.
Our library supports (de)serializing in basic (int, float, double,
string, bool) and compound (vector, list) data types.
For RL, the agent is usually the learner trained to pre-
dict appropriate actions given the observations from the Protobuf SerDes. Protobuf SerDes needs the user to pro-
environment. Exporting a trained model to ONNX implies vide the input and output data specifications in a proto file.
exporting only the agent. To facilitate RL-based interaction These are compiled to generate the C++ and Python sources
for a generic multi-agent scenario between the environment (Sec. 3.1.1). ProtobufSerDes serializes the input key-value
and the agents, ONNXModelRunner provides Environment pair by overriding the setFeature methods to set the appro-
and Agent classes separately and accesses the APIs inter- priate fields of the message described in the proto file. De-
nally. The sequence of events describing this interaction is serializing protobuf data to the native format only involves
shown in Fig. 3. reading and returning the appropriate fields of the message.
ONNXModelRunner queries the model using the C++ APIs. Except for providing the proto file, ProtobufSerDes is trans-
A map containing the identifier of the agent (label) and the parent to the user.
corresponding model path is passed while instantiating the
ONNXModelRunner. In the case of multiple agents, the identi- JSON SerDes. JSONSerDes overrides the setFeature meth-
fier of the next one to use is set by the Environment while ods to populate the JSON buffer appropriately, given the
returning the observation. ONNXModelRunner queries the key-value pairs. Similarly, the received data is deserialized
corresponding agent with the observation to obtain the re- by first converting it to a JSON object, then the JSON fields
quested action. This process goes on until Environment in- are casted to native types and returned.
vokes setDone().
Bitstream SerDes. The bitstream starts with a JSON header
TensorFlow Model Runners. This is a framework-specific which specifies the key (identifier), type and shape of the
model runner built on the TensorFlow ahead-of-time (AOT) tensors, and the order in which they will be serialized. Tensor
saved model. There are two implementations: (i) “Release values themselves are dumped as raw bytes. The received bit-
Mode Model Runner” used in production environments, (ii) stream is interpreted based on the type and shape specified
“Model Under Training Model Runner” intended either for in the header and converted to native types. Processing the

242
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.

header induces negligible overhead if communicated data embeddings called IR2Vec [48]. The fourth optimization—
does not involve complex data types. inlining—uses TensorFlow [1], is built within LLVM V17,
and uses feature-based representations [47]. There are two
3.3 C-APIs ML based register allocators [46, 50] available for LLVM; we
We provide C wrappers around the C++ implementation to chose the former because it emphasizes finer-grained, high-
integrate with C-based compilers. These wrappers are C++ bandwidth interactions with an ML model. All the compo-
files written in C-style. Each method internally queries the nents are configured, compiled and linked during the regular
original C++ implementation and returns results in a way build process of LLVM. Integration challenges range from
compatible with C calling conventions. This code is built as a redesigning the entire framework of the original publication,
separate library that may be linked with a C-based compiler. to minor changes to the communication mechanisms.

3.4 Extensions 4.1 Phase Ordering of Optimization Passes


Both MLModelRunners and SerDes can be easily extended POSET-RL predicts the ordering sequence of passes to jointly
to support new model runners and serializers. New run- optimize code size along with execution time. An RL agent
ners may include TVM [11], ahead-of-time compiled Py- is trained with the DDQN algorithm [22] to predict a sub-
Torch models and FlatBuffers [18], and serialization also sequence as action, given program embeddings as input ob-
supports YAML formats. New model runners can be con- servation. There are about 15 predetermined subsequences
tributed by inheriting MLModelRunner and overriding the provided by the authors. The predicted optimization sub-
evaluateUntyped method according to the model runner. sequence is applied on the input program, and the embed-
Similarly, a new (de)serializer can be added by inheriting dings corresponding to the transformed program are used
BaseSerDes and overriding the setFeature and deserialize as the new observation. This process goes on until reaching
methods specific to the new serializer. a threshold on the number of subsequences.
In the published version, the above process was not inte-
3.5 Error Checking and Recovery grated within LLVM but driven from a Python model. An
LLVM-opt process was spawned, passing the optimization
The model runners and SerDes modules are designed to han-
sequence through a compiler flag for each prediction by the
dle compiler/model crashes, communication failures, and
agent. In addition, embeddings involve spawning yet another
infinite loops. The failures are handled appropriately by al-
process to invoke IR2Vec on the .ll IR file generated by the
lowing graceful termination of the processes. In the case
compiler. A similar strategy was in place for training.
of gRPC, we have implemented an exponential backoff al-
We revisited the above using ML-Compiler-Bridge to
gorithm to attempt retries to overcome the failures due to
operate directly within LLVM as a new transformation pass.
the delays in communication resulting from any network-
Our new PosetRL implements a pass manager that applies
related issues and packet losses. The communication fails
the predicted optimization sequence, and also generates the
gracefully upon exhausting the number of retries. In all other
next observation by invoking IR2Vec. The MLModelRunner
cases, we use a timeout based mechanism for handling the
communicates with the model and serializes the data to be
failure. These mechanisms proved invaluable in practical
transferred. The model communicates the predicted opti-
experiments due to compiler bugs and network errors.
mization subsequence as an integer ID (one among 15) to
3.6 Compiler/ML Experts View PosetRL, and the R300 module-level embedding vectors are
sent to the model for the next prediction.
To use ML-Compiler-Bridge, developers need to invoke a
minimal set of APIs by instantiating the necessary model 4.2 Loop Distribution for Vectorization and Locality
runner with appropriate options specifying the SerDes type. Jain et al. [25] improve loop distribution by modeling SIMD
List. 2 illustrates this on an example of invoking a user- parallelization and locality optimization opportunities. It
defined model runner with a user-defined SerDes from the uses two RL agents with fully-connected networks to iden-
compiler. A similar API abstracting the communication and tify the vertex processing order and when to distribute. Along
SerDes in Python is provided (List. 3) to query the ML model with these agents, a Gated Graph Neural Network [31] pro-
with inter-process model runners and respond back. cesses the connected components of the dependence graph,
where each node holds the embeddings for the correspond-
4 Use Cases: ML-LLVM Optimizations ing instructions.
We integrated ML-Compiler-Bridge with four ML-based During training, a Python driver spawns a process to in-
compiler optimizations in LLVM: phase ordering [24], loop voke the Loop Distribution pass. The RL model processes
distribution [25], register allocation [50] and method in- the input graph and predicts the sequence of instructions
liner [47]. The first three optimizations are built using RL- to be packed together as a loop. Upon applying the predic-
Lib [32] with PyTorch [39] and LLVM V10, using program tion, the rewards indicate the effectiveness of distribution.

243
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom

1 # include " MLCompilerBridge / MLModelRunner .h" 1 import CompilerInterface as CI


2 # include " MLCompilerBridge / yourMLModelRunner .h" 2
3 3 # Instantiate the required CompilerInterface ←↪
4 // Instantiate the required model runner with ←↪ with serdes type
SerDes type 4 interface = CI . YourCompilerInterface ( Arg , ←↪
5 MLModelRunner * MLRunner = std :: make_unique <←↪ yourSerdesType )
yourModelRunner >( Arg , SerDes :: Kind ::←↪ 5 while True :
yourSerDesType ); 6 # Send buffer data to compiler and wait ←↪
6 // Process Input Features for next request
7 std :: pair < std :: string , InType > p = ... // Input 7 request = interface . evaluate ()
8 MLRunner -> populateFeatures (p); 8 # Query model to get advice
9 // Get ML Advice / Output 9 # Populates buffer with advice ( serialized←↪
10 OutType advice = MLRunner -> evaluate < OutType >() ; and stored in serdes )
11 // Use the obtained advice 10 interface . populate_buffer ( advice )
12 ... 11 # Break on condition
Listing 2. C++ APIs of ML-Compiler-Bridge Listing 3. Python APIs of ML-Compiler-Bridge

All these steps involve model-compiler interaction via file 4.4 LLVM Inliner
I/O. Inference itself is integrated with LLVM using Python The inliner pass traverses call sites in a bottom-up fashion,
wrappers. one connected component of functions at a time. For a given
In this paper, we eliminate the need for Python wrappers, component a working queue is initialized with the set of all
file I/O and and spawning new processes. The model run- static call sites. As the algorithm marks some call sites for
ners internally (de-)serialize data depending on the chosen inlining, it appends the former callee’s call sites to the work
SerDes and the MLModelRunner. For the runners that use se- queue. The decision to inline or not is made in two steps. First,
rialization, the input graph is represented as key-value pairs, it determines legality and whether the user provided any
and a variable length matrix in R𝑛×300 encodes the sequence guidance (always/never inline). Only if the operation is legal
of 𝑛 300-D instruction embeddings. The output takes the and non-mandatory, a heuristic determines its profitability.
form a variable-length integer array with node identifiers The decision is driven by a simple RL based model. It takes
that are to be distributed. a number of scalar features characterizing both the caller/-
callee (instruction counts, basic block counts, maximum loop
depth), the call site itself (the number of compile-time con-
4.3 RL-Based Register Allocation stant parameters), as well as module-wide features (the cur-
We also evaluate RL4ReAl, an RL-based register allocator im- rent number of functions and statically known call edges).
plementing the splitting, coloring, and spilling sub-tasks as For the published version [47], the cost metric was size, with
separate RL agents on LLVM’s Machine IR. These RL agents no reliance on dynamic profile data. The implementation
pose a formidable engineering challenge in interfacing the uses AOT compiled TensorFlow model for inference with
model with the compiler during both training and inference. C++ APIs. We modularized it to use any model runner.
Unlike other optimizations that need one single commu-
nication at the end, RL4ReAl involves multiple interleaved
communications rounds to obtain a new observation and let 5 Evaluation
the relevant agent make the next prediction. Also them RL We measure compilation time on an Intel Xeon SkyLake
agents are arranged hierarchically: the outcome of one agent W2133 with 6 cores, 12 threads and 32GB RAM. Training
determines which agent would be invoked next. Unlike other time is measured on an Intel Xeon W1390P with 8 cores, 16
use cases, this optimization involves transferring an interfer- threads, 64GB RAM and an Nvidia 3060 GPU. We evaluate
ence graph where each variable is associated with a R𝑛×100 POSET-RL, RL-LoopDistribution and RL4ReAl with gRPC,
matrix, and where each one of the 𝑛 instructions in the live Pipe and ONNX model runners and different SerDes options,
range of the variable is represented in 100-D, a variable- and take the median of 3 runs. Most experiments use SPEC
length integer array to specify interferences and use points, CPU 2006 and SPEC CPU 2017 benchmarks.
and a variable-length floating point array of spill weights.
Other metadata like function name, file name, and status are 5.1 Impact on Deployment
also sent as string fields. The model returns key-value pairs Tab. 2 shows the POSET-RL compile time using different
mapping variables to split or color decisions. Both training model runners. Among the in-process runners, we use ONNX
and inference use gRPC and Protobuf serialization. for PyTorch models and RLLib. Overall, in-process runners
We will investigate different communication and serializa- achieve better compile times in all cases in comparison with
tion improvements in this paper, with specialized scenarios any of the inter-process ones. Among the latter, gRPC has
for distributed training and deployment-friendly inference. higher compile times (6.8–7.6%) compared to pipes, with

244
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.

Table 2. Compile time (in seconds) for POSET-RL. about 5.5Ks each. Throughout the iterations, we observe an
Original gRPC Pipe + JSON Pipe + Bitstream ONNX overhead of about 20s between JSON and bitstream serial-
SPEC06 5,829 1,318 1,236 1,227 1,140 ization options. This minimal overhead is associated with
SPEC17 10,342 1,221 1,141 1,132 1,093 the additional serialization effort involved while using JSON
Table 3. Multithreaded compile time with -O3 (in s) with in-process SerDes. However, using the inter-process model runners en-
model runners. Compile time with gRPC is shown for RL4ReAl for ables an end-to-end integration of model and the compiler
comparison. while training yields a significant improvement.
gRPC 1 Thread 2 Threads 4 Threads 8 Threads
5.2.2 Multi-Worker Support. ML-Compiler-Bridge sup-
LLVM Inliner
(TF Runner)
- 596 501 361 307 ports multi-worker training on both CPUs and GPUs. To
RL4ReAl
5,572 291 257 248 248
support multiple workers while using gRPC, we expose a
(ONNX Runner) method taking an array of ports to establish connections with
each worker. Similarly, multi-worker support with pipes is
JSON and bitstream SerDes. This is because of the over- enabled by instantiating one pair of pipe per worker. We ex-
heads associated with establishing connections and invok- tended RL4ReAl to handle multi-worker scenarios; training
ing RPC methods. Pipes with Bitstream SerDes yield slightly times are shown in Fig. 4(b) for CPU and GPU workers. Us-
higher performance than JSON SerDes due to the lower (de- ing 10 workers with a GPU trainer takes about 2 seconds per
)serialization overhead with bit streams. ONNXModelRunner episode, while a CPU trainer with <10, 5, 1> workers takes
yields a 7.2× speedup with POSET-RL compared to the origi- <4s, 8s, 15s> respectively. We obtained similar trends among
nal method in Sec. 4.1 that involved spawning new processes the workers even upon using pipes for communication.
to invoke the compiler and other dependencies.
In-process model runners natively support multithreaded 5.2.3 Using Different RL Policies. One may train and
compilation, while inter-process model runners necessitate deploy models with different RL policies without impacting
concurrently running multiple instances of the model result- the compiler. For this experiment, we evaluate RL4ReAl with
ing in a high memory and compute overhead. Tab. 3 shows the different RL policies provided by RLlib. We perform hy-
compile times with in-process model runners on LLVM In- perparameter tuning using Tune [33]. We trained the models
liner and RL4ReAl optimizations by varying the degree of with PPO [42], APPO [42], and A2C [37] policies untill con-
parallelism. As LLVM Inliner and RL4ReAl respectively rely vergence. On the SPEC CPU 2017 benchmarks, this resulted
on TensorFlow and PyTorch (and RLlib), we use TensorFlow in 2% improvement on average using the APPO policy. The
and ONNX model runners accordingly. In comparison to PPO and A2C perform similarly to original paper.
the original gRPC based inference flow of RL4ReAl, the
5.3 Round-Trip Time
ONNX runner reduces compile time by 22.4× and 19× using
8 threads and 1 thread respectively. Using RL4ReAl results Let us finally isolate the Round-Trip Time (RTT) of each
in a higher compile time, as it involves a larger number of model runner as a limit study of the achievable communica-
model-compiler interactions. This overhead is effectively re- tion throughput. We consider random floating point vectors
duced by using the model runners of ML-Compiler-Bridge. of increasing length ranging from 500 to 50K elements in
Similar trends are observed for RL-driven loop distribu- steps of 500. The model itself is a single fully-connected
tion [25] on TSVC [8] and the LLVM Test Suite [35]. The layer that consumes the vector and returns a scalar float.
ONNX model runner yields an improvement of 16× in com- Fig. 4(c) shows the RTT of the whole process. The TF and
parison to the original Python wrapper. ONNX runner achieves a very high throughput with a total
RTT of 21 and 68ms respectively; while Pipes+JSON and
5.2 Impact on Training Pipes+Bitstream yield 3154ms and 772ms respectively, and
In this section, we evaluate the effectiveness of ML-Compiler- gRPC yields a larger RTT of 5948ms. These differences can
Bridge during the training of POSET-RL and RL4ReAl. We be attributed to the serialization and communication over-
use inter-process model runners for training. head. The TF and ONNX runners benefit from in-process
communication, proving to be suitable candidates for de-
5.2.1 Training Time. Fig. 4(a) shows the cumulative train- ployment. The higher throughput of TF is due to the AOT
ing time and number of training iterations observed in POSET- precompiled model. The Pipe runner proves to be a good can-
RL. We obtain large improvements in the training time across didate for training on local machines. And the gRPC runner
all the model runners. We see similar trends with gRPC and provides support for training in a distributed environment.
Pipe, as explained in the previous experiment. This makes all the model runners important in their own way.
The original training process of POSET-RL involves spawn-
ing processes that takes ≈ 10Ks to complete 500 iterations. 5.4 Gym Integration
In comparison, the gRPC model takes about 5.7Ks, while We carried out additional experiments to evaluate the ben-
the pipes with JSON and bitstream serialization options take efits of our library in the context of a state of the art RL

245
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom

(a) Training times of POSET-RL (b) Training times of (c) Microbenchmarking of (d) MLIR performance (e) Pluto performance
with different Model Runners RL4ReAl with CPU/GPU individual Model Runners
multi-workers
Figure 4. Performance characterization of model runners on different compilers and optimizations

Table 4. LOC to integrate model runners. gRPC shows LOC for


API calls and RPC; Values in parenthesis indicate LOC in protobuf
specification. Other serdes do not need any additional code.
Optimizations Original gRPC Pipe ONNX
POSET-RL - 3+3 (4) 3 3
RL-LoopDistribution 65 3+3 (5) 4 3
RL4ReAl 75 10+3 (28) 4 3
Figure 5. Compile times on using phase ordering for code size
model with CompilerGym and ONNX model runner and SerDes. We measured round-trip time using different
SerDes and show the same in Fig. 4(e). This integration opens
new opportunities for ML-based polyhedral optimizations,
Gym. The two goals are to facilate deployment and to reduce
including autoscheduling and tile size selection.
compilation time by using in-process model runners. For
this purpose, we trained the pass ordering for code size of
CompilerGym [15] and exported the resulting model in the 6 Discussion
ONNX format. We then used our ONNX model runner within 6.1 Lines of Code
LLVM to materialize predictions and generate code. The in- In Tab. 4, we show the number of additional Lines of Code
ference times are shown in Fig. 5, with speedups ranging (LOC) to integrate ML-Compiler-Bridge with different com-
from 2× to 13×. These are primarily due to gRPC overheads piler optimizations. We observe a significant reduction in
in CompilerGym, as shown in Fig. 4(c). LOC compared to the original published works.
We do not compare with the size of the published ver-
5.5 Domain-Specific Compilers sion of POSET-RL, as its model was not integrated with the
Given LLVM’s dominance in the general-purpose and back- compiler. With Loop distribution and RL4ReAl, the effort of
end compiler landscape, it forms the natural basis for most writing Python wrappers and invoking protobuf and gRPC
ML/RL tools (Tab. 1). However, some ML-based domain- is completely removed. Among the available model runners
specific optimizations target higher-level frameworks like and SerDes, only gRPC, ONNX and Protobuf involve (small)
MLIR [30] and the polyhedral compilers Pluto [7] and PoCC [40]. additional codes to handle RPC, environment, and Protobuf
Let us illustrate the cases of MLIR and Pluto. messages. It is pertinent to note that ML-Compiler-Bridge
removes the tedious work of managing dependencies like
Integration with MLIR. Given that end-to-end ML com- gRPC and Python wrappers which was otherwise necessary.
pilers based on MLIR are still undergoing rapid changes [44],
we designed a simple experiment to demonstrate the inte- 6.2 Impact on Compile Time, Size and Memory
gration of ML-Compiler-Bridge with MLIR. We wrote a
In Tab. 5, we show the compile time, binary size and average
custom pass in MLIR to communicate data with a dummy
resident set size (RSS) used during compilation of Clang
ML model to mimic a typical ML-compiler interaction. We
10 with/without ML-Compiler-Bridge. The difference in
use the same experimental setup as discussed in Sec. 5.3
binary size is ≈ 80KB, while the average RSS value differs
and measure the round-trip time. The results are shown in
by 400KB with the release build time increasing only by a
Fig. 4(c). This opens up ML-based optimizations in MLIR-
few seconds. ML-Compiler-Bridge incurs only a negligible
native compilers such as IREE and OpenXLA [44], Triton
overhead in terms of binary size, compile time and memory
[45], Polygeist [38], and many other frameworks.
upon statically linking with the production version of Clang.
Integration with Pluto. We also experimented with the
polyhedral source-to-source compiler Pluto. As Pluto is writ- 6.3 Characterization
ten in C, we use the C-APIs of ML-Compiler-Bridge for As discussed earlier, different model runners exhibit dif-
interfacing the models, to illustrating the Pipe model runners ferent characteristics. During deployment, neither of the

246
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.

Table 5. Comparisons of time taken to build clang and final binary These primarily aim at facilitating research and reproducibil-
size with/without ML-Compiler-Bridge ity, which are only two of the broader ambitions of our re-
Characteristics Native Clang Clang with ML-Compiler-Bridge search (e.g., deployment, programmable compiler interface,
Compilation Time 5m 7s 5m 15s finer-grained interaction). CompilerGym internally calls the
Binary Size 102.79 MB 102.87 MB
Average RSS 1.5538 GB 1.5542 GB
compiler APIs from a C++ wrapper, and the communication
between the Python model and the wrapper is established
by predefined gRPC methods. This limits the functionality
inter-process model runners offer multi-threaded compila- to only the APIs supported by the library and a particular
tion upon running a single model instance. It could be done compiler version with which the library is compatible. Su-
by instantiating multiple model instances but this would personic [51] also uses the CompilerGym way of interfacing
consume unreasonable amounts of memory. The in-process via gRPC. And, to our understanding, PolyGym [9] does not
model runners however do not face this problem. Though provide a programmable compiler interface.
there is a separate serialization overhead involved with gRPC The gym libraries and ML-Compiler-Bridge solve differ-
and pipe model runners, they are handled automatically ent problems; the former facilitates research and training,
without the involvement of the developer. Due to the na- while our library aims to facilitate different interfaces for
ture of inter-process communication, there is a possibility of communication. We envision ML-Compiler-Bridge to sup-
encountering communication errors arising from network plement these gym environments by providing a variety
and compiler crashes. We handle such cases as explained in of options for more diverse, finer-grained, and framework-
Sec. 3.5. We summarize these characteristics in Tab. 6. independent interfacing of ML models with compilers facili-
tating the transition from research to production.
Table 6. Characteristics of different model runners
Characteristics gRPC Pipes ONNX TF
Multithreaded Compilation ✗ ✗ ✓ ✓
8 Conclusions
Distributed Training ✓ ✗ - - We present ML-Compiler-Bridge, a modular and extensible
Need for separate model process ✓ ✓ ✗ ✗ library to integrate ML models within compiler optimiza-
Autoserialization ✓ ✓ - -
tions. It provides inter/in-process model runners with dif-
Communication Fidelity ✗ ✗ ✓ ✓
ML Framework agnostic ✓ ✓ ✓ ✗ ferent serialization options to support both training/deploy-
Additional code by compiler writer Y N Y N ment scenarios. We show that a model and compiler pass can
Serialization Requirement Y Y - - be integrated with only 3 lines of code, while also enabling
Time overhead Y Y N N very deep interleaving of RL-based algorithms like RL4ReAl,
as well as leaner and production-friendly optimizations like
6.4 Limitations function inlining.
Our library exposes C++/C and Python APIs for inte-
As mentioned earlier, not all model runners are compati- gration with compilers and ML frameworks respectively.
ble with all ML models due to the nature of the underlying We considered multiple ML frameworks (TensorFlow, Py-
libraries. For instance, Tensorflow AOT compilation sup- Torch, RLlib), both feature-based and embedding-based rep-
ports any Tensorflow or JAX model, but not PyTorch. Also, resentations, multiple compilers (and versions) written in
upon exporting the inliner model from TensorFlow to ONNX, different languages to show versatility and suitability of ML-
we encountered an operator (TFL-Bucketize1 ) that is not Compiler-Bridge on research and production environments.
supported by ONNX. To handle such cases, the ONNX run- Source code along with documentation and the related arti-
time allows registering custom operators. Once exported, facts are available in https://compilers.cse.iith.ac.in/research/
the models can be used seamlessly without restriction. mlcompilerbridge [49].
Similarly, protobuf does not natively support C runtime.
Hence, our C APIs do not support using the gRPC model
runner with protobuf serialization. The current TF AOT com- Acknowledgments
pilation generates C++ code thereby making it not usable We are grateful to Govindarajan Ramaswamy and Dibyendu
directly with C. This issue can be mitigated by using TF Das for valuable discussions and feedback. We would like to
C-APIs instead of using AOT models. thank Nilesh Shah, Soumya Banerjee, and Vikas Patnala for
their help in various experiments. We also thank Raj Vilas
7 Related Work Ambekar and Neha Bhargava for helping in validating the
RL environments for compilers come closest to our work, artifacts.
such as CompilerGym [15], PolyGym [9], Supersonic [51]. This work is partially funded by PhD fellowships from
Google and PMRF, a research grant from Suzuki Motor Cor-
1 https://www.tensorflow.org/mlir/tfl_ops#tflbucketize_tflbucketizeop poration, and a faculty grant from AMD.

247
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom

References Optimizations. In Proceedings of the 38th International Conference on


[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceed-
Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu ings of Machine Learning Research, Vol. 139), Marina Meila and Tong
Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Zhang (Eds.). PMLR, 2244–2253. http://proceedings.mlr.press/v139/
Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, cummins21a.html
Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, [14] Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather.
Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon 2017. Synthesizing benchmarks for predictive modeling. In 2017
Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin- IEEE/ACM International Symposium on Code Generation and Optimiza-
cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, tion (CGO). 86–99. https://doi.org/10.1109/CGO.2017.7863731
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xi- [15] Chris Cummins, Bram Wasti, Jiadong Guo, Brandon Cui, Jason Ansel,
aoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning Sahir Gomez, Somya Jain, Jia Liu, Olivier Teytaud, Benoit Steiner,
on Heterogeneous Systems. https://www.tensorflow.org/ Software Yuandong Tian, and Hugh Leather. 2022. CompilerGym: Robust, Per-
available from tensorflow.org. formant Compiler Optimization Environments for AI Research. In 2022
[2] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles IEEE/ACM International Symposium on Code Generation and Optimiza-
Sutton. 2018. A survey of machine learning for big code and natural- tion (CGO). 92–105. https://doi.org/10.1109/CGO53902.2022.9741258
ness. ACM Computing Surveys (CSUR) 51, 4 (2018), 81. [16] Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza
[3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães,
code2vec: learning distributed representations of code. 3, POPL, Article and Fernando Magno Quintão Pereira. 2021. AnghaBench: A Suite
40 (jan 2019), 29 pages. https://doi.org/10.1145/3290353 with One Million Compilable C Benchmarks for Code-Size Reduction.
[4] Saman P. Amarasinghe. 2020. Compiler 2.0: Using Machine Learning In Proceedings of the 2021 IEEE/ACM International Symposium on Code
to Modernize Compiler Technology. In Proceedings of the 21st ACM Generation and Optimization (Virtual Event, Republic of Korea) (CGO
SIGPLAN/SIGBED International Conference on Languages, Compilers, ’21). IEEE Press, 378–390. https://doi.org/10.1109/CGO51591.2021.
and Tools for Embedded Systems, LCTES 2020, London, UK, June 16, 2020, 9370322
Jingling Xue and Changhee Jung (Eds.). ACM, 1–2. https://doi.org/10. [17] Dibyendu Das, Shahid Asghar Ahmad, and Venkataramanan Ku-
1145/3372799.3397167 mar. 2020. Deep Learning-based Approximate Graph-Coloring Al-
[5] Amir H. Ashouri, Andrea Bignoli, Gianluca Palermo, Cristina Silvano, gorithm for Register Allocation. In 2020 IEEE/ACM 6th Workshop on
Sameer Kulkarni, and John Cavazos. 2017. MiCOMP: Mitigating the the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop
Compiler Phase-Ordering Problem Using Optimization Sub-Sequences on Hierarchical Parallelism for Exascale Computing (HiPar). 23–32.
and Machine Learning. ACM Trans. Archit. Code Optim. 14, 3, Article https://doi.org/10.1109/LLVMHPCHiPar51896.2020.00008
29 (sep 2017), 28 pages. https://doi.org/10.1145/3124452 [18] Flatbuffers [n. d.]. FlatBuffers. https://flatbuffers.dev/index.html. [On-
[6] Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. line; accessed 29-Aug-2022].
Neural code comprehension: a learnable representation of code se- [19] Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew
mantics. In Proceedings of the 32nd International Conference on Neural Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha
Information Processing Systems (Montréal, Canada) (NIPS’18). Curran Mendelson, Ayal Zaks, Eric Courtois, François Bodin, Phil Barnard,
Associates Inc., Red Hook, NY, USA, 3589–3601. Elton Ashton, Edwin V. Bonilla, John Thomson, Christopher K. I.
[7] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayap- Williams, and Michael F. P. O’Boyle. 2011. Milepost GCC: Machine
pan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Learning Enabled Self-tuning Compiler. Int. J. Parallel Program. 39, 3
Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on (2011), 296–327. https://doi.org/10.1007/s10766-010-0161-2
Programming Language Design and Implementation (Tucson, AZ, USA) [20] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1995.
(PLDI ’08). Association for Computing Machinery, New York, NY, USA, Design patterns: elements of reusable object-oriented software. Pearson
101–113. https://doi.org/10.1145/1375581.1375595 Deutschland GmbH.
[8] Michael Boulton. [n. d.]. TSVC_2. https://github.com/UoB-HPC/ [21] Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Yakun Sophia Shao,
TSVC_2.git. Accessed 2015-09-16. Krste Asanovic, and Ion Stoica. 2020. NeuroVectorizer: end-to-end
[9] Alexander Brauckmann, Andrés Goens, and Jeronimo Castrillon. 2021. vectorization with deep reinforcement learning. In Proceedings of the
PolyGym: Polyhedral Optimizations as an Environment for Rein- 18th ACM/IEEE International Symposium on Code Generation and Opti-
forcement Learning. In 2021 30th International Conference on Paral- mization (San Diego, CA, USA) (CGO 2020). Association for Computing
lel Architectures and Compilation Techniques (PACT). 17–29. https: Machinery, New York, NY, USA, 242–255. https://doi.org/10.1145/
//doi.org/10.1109/PACT52795.2021.00009 3368826.3377928
[10] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, [22] Hado van Hasselt, Arthur Guez, and David Silver. 2016. Deep re-
John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. inforcement learning with double Q-Learning. In Proceedings of the
arXiv:arXiv:1606.01540 Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona)
[11] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie (AAAI’16). AAAI Press, 2094–2100.
Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis [23] Qijing Huang, Ameer Haj-Ali, William Moses, John Xiang, Ion Stoica,
Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Krste Asanovic, and John Wawrzynek. 2019. AutoPhase: Compiler
Automated End-to-End Optimizing Compiler for Deep Learning. In Phase-Ordering for HLS with Deep Reinforcement Learning. In 2019
Proceedings of the 13th USENIX Conference on Operating Systems Design IEEE 27th Annual International Symposium on Field-Programmable
and Implementation (Carlsbad, CA, USA) (OSDI’18). USENIX Associa- Custom Computing Machines (FCCM). IEEE Computer Society, Los
tion, USA, 579–594. Alamitos, CA, USA, 308–308. https://doi.org/10.1109/FCCM.2019.
[12] Keith D. Cooper, Devika Subramanian, and Linda Torczon. 2002. Adap- 00049
tive Optimizing Compilers for the 21st Century. J. Supercomput. 23, 1 [24] Shalini Jain, Yashas Andaluri, S. VenkataKeerthy, and Ramakrishna
(2002), 7–22. https://doi.org/10.1023/A:1015729001611 Upadrasta. 2022. POSET-RL: Phase ordering for Optimizing Size and
[13] Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Execution Time using Reinforcement Learning. In International IEEE
Michael F. P. O’Boyle, and Hugh Leather. 2021. ProGraML: A Graph- Symposium on Performance Analysis of Systems and Software, ISPASS
based Program Representation for Data Flow Analysis and Compiler 2022, Singapore, May 22-24, 2022. IEEE, 121–131. https://doi.org/10.

248
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.

1109/ISPASS55109.2022.00012 [38] William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zi-
[25] Shalini Jain, S. VenkataKeerthy, Rohit Aggarwal, Tharun Kumar Dan- nenko. 2021. Polygeist: Raising C to Polyhedral MLIR. In 2021 30th
geti, Dibyendu Das, and Ramakrishna Upadrasta. 2022. Reinforcement International Conference on Parallel Architectures and Compilation Tech-
Learning assisted Loop Distribution for Locality and Vectorization. niques (PACT). 45–59. https://doi.org/10.1109/PACT52795.2021.00011
In 2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infras- [39] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-
tructure in HPC (LLVM-HPC). 1–12. https://doi.org/10.1109/LLVM- bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
HPC56686.2022.00006 Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary
[26] Minsu Kim, Jeong-Keun Park, and Soo-Mook Moon. 2022. Solving DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
PBQP-Based Register Allocation Using Deep Reinforcement Learning. Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An
In Proceedings of the 20th IEEE/ACM International Symposium on Code Imperative Style, High-Performance Deep Learning Library. In Ad-
Generation and Optimization (Virtual Event, Republic of Korea) (CGO vances in Neural Information Processing Systems 32. Curran Associates,
’22). IEEE Press, 230–241. https://doi.org/10.1109/CGO53902.2022. Inc., 8024–8035. https://doi.org/10.5555/3454287.3455008
9741272 [40] Louis-Noel Pouchet, C´edric Bastoul, and Uday Bondhugula. 2010.
[27] Sameer Kulkarni, John Cavazos, Christian Wimmer, and Douglas Si- PoCC: the polyhedral compiler collection. https://web.cs.ucla.edu/
mon. 2013. Automatic construction of inlining heuristics using ma- ~pouchet/software/pocc. Accessed 2023-10-25.
chine learning. In Proceedings of the 2013 IEEE/ACM International Sym- [41] Protobuf [n. d.]. Protocol Buffers. https://developers.google.com/
posium on Code Generation and Optimization (CGO). 1–12. https: protocol-buffers. [Online; accessed 29-Aug-2022].
//doi.org/10.1109/CGO.2013.6495004 [42] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and
[28] P. J. Landin. 1966. The next 700 Programming Languages. Commun. Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.
ACM 9, 3 (mar 1966), 157–166. https://doi.org/10.1145/365230.365257 arXiv:1707.06347 [cs.LG]
[29] Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Frame- [43] M. Stephenson and S. Amarasinghe. 2005. Predicting unroll factors
work for Lifelong Program Analysis & Transformation. In Proceedings using supervised classification. In International Symposium on Code
of the International Symposium on Code Generation and Optimization: Generation and Optimization. 123–134. https://doi.org/10.1109/CGO.
Feedback-Directed and Runtime Optimization (Palo Alto, California) 2005.29
(CGO ’04). IEEE Computer Society, USA, 75. [44] The IREE Team. 2023. https://github.com/openxla/iree. Accessed
[30] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy 2023-11-13.
Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasi- [45] The Triton Team. 2023. https://github.com/openai/triton. Accessed
lache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infras- 2023-11-13.
tructure for Domain Specific Computation. In 2021 IEEE/ACM Interna- [46] Mircea Trofin, Yundi Qian, Eugene Brevdo, and David Li. 2021.
tional Symposium on Code Generation and Optimization (CGO). 2–14. RFC: MLGO Regalloc: learned eviction policy for regalloc. https://
https://doi.org/10.1109/CGO51591.2021.9370308 lists.llvm.org/pipermail/llvm-dev/2021-November/153639.html, https:
[31] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. //lists.llvm.org/pipermail/llvm-dev/2021-November/153639.html. [On-
2016. Gated Graph Sequence Neural Networks. In 4th International line; accessed 08-May-2022].
Conference on Learning Representations, ICLR 2016, San Juan, Puerto [47] Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choro-
Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and manski, and David Li. 2021. MLGO: a Machine Learning Guided
Yann LeCun (Eds.). http://arxiv.org/abs/1511.05493 Compiler Optimizations Framework. CoRR abs/2101.04808 (2021).
[32] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, arXiv:2101.04808 https://arxiv.org/abs/2101.04808
Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2018. [48] S. VenkataKeerthy, R Aggarwal, S Jain, M S Desarkar, R Upadrasta,
RLlib: Abstractions for Distributed Reinforcement Learning. In Proceed- and Y. N. Srikant. 2020. IR2Vec: LLVM IR Based Scalable Program
ings of the 35th International Conference on Machine Learning (Proceed- Embeddings. ACM Trans. Archit. Code Optim. 17, 4, Article 32 (Dec.
ings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas 2020), 27 pages. https://doi.org/10.1145/3418463
Krause (Eds.). PMLR, 3053–3062. https://proceedings.mlr.press/v80/ [49] S. VenkataKeerthy and Siddharth Jain. 2024. ML-Compiler-Bridge: The
liang18b.html Next 700 ML-Enabled Compiler Optimizations. https://doi.org/10.5281/
[33] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E. zenodo.10574579
Gonzalez, and Ion Stoica. 2018. Tune: A Research Platform for Dis- [50] S. VenkataKeerthy, Siddharth Jain, Anilava Kundu, Rohit Aggarwal,
tributed Model Selection and Training. arXiv:1807.05118 [cs.LG] Albert Cohen, and Ramakrishna Upadrasta. 2023. RL4ReAl: Reinforce-
[34] ONNX (Linux Foundation). 2017. ONNX: Open Neural Network Ex- ment Learning for Register Allocation. In CC 2023 (Montréal, QC,
change. https://github.com/onnx/onnx. [Online; accessed 11-Mar- Canada). Association for Computing Machinery, New York, NY, USA,
2023]. 133–144. https://doi.org/10.1145/3578360.3580273
[35] LLVM-Org. [n. d.]. LLVM Test Suite. https://github.com/llvm/llvm- [51] Huanting Wang, Zhanyong Tang, Cheng Zhang, Jiaqi Zhao, Chris
test-suite. Accessed 2021-08-25. Cummins, Hugh Leather, and Zheng Wang. 2022. Automating Re-
[36] Charith Mendis, Cambridge Yang, Yewen Pu, Saman Amarasinghe, inforcement Learning Architecture Design for Code Optimization.
and Michael Carbin. 2019. Compiler auto-vectorization with imitation In Proceedings of the 31st ACM SIGPLAN International Conference on
learning. Curran Associates Inc., Red Hook, NY, USA. https://doi.org/ Compiler Construction (Seoul, South Korea) (CC 2022). Association
10.5555/3454287.3455597 for Computing Machinery, New York, NY, USA, 129–143. https:
[37] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex //doi.org/10.1145/3497776.3517769
Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray [52] X Wang, H Zhao, and J Zhu. 1993. GRPC: A Communication Coopera-
Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement tion Mechanism in Distributed Systems. SIGOPS Oper. Syst. Rev. 27, 3
Learning. In Proceedings of The 33rd International Conference on Ma- (jul 1993), 75–86. https://doi.org/10.1145/155870.155881
chine Learning (Proceedings of Machine Learning Research, Vol. 48), [53] Zheng Wang and Michael O’Boyle. 2018. Machine Learning in Com-
Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New piler Optimization. Proc. IEEE 106, 11 (2018), 1879–1901. https:
York, New York, USA, 1928–1937. https://proceedings.mlr.press/v48/ //doi.org/10.1109/JPROC.2018.2817118
mniha16.html
Received 13-NOV-2023; accepted 2023-12-23

249

You might also like