700 ML Opt CC
700 ML Opt CC
1 Introduction In most works, the process ends with step (3) and a simpli-
fied benchmark-oriented version of step (4) to evaluate the
With the success of Machine Learning (ML) models in vari- trained model. Indeed, while there exist a number of solu-
ous domains, there is a growing interest in applying ML to tions for steps (1 & 2), a proper methodology based solutions
improve optimization heuristics in compilers [4, 12]. Several for steps (3) & (4) that involve model-compiler interaction
have not yet been adequately addressed.
The diversity of compiler optimizations and ML models
is associated with an equally broad range of requirements
This work is licensed under a Creative Commons Attribution 4.0 Interna- for model-compiler interaction. In Tab. 1, we illustrate this
tional License.
on recent proposals. There exists multiple ML frameworks
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom
and even more types of ML models. A model’s input may
© 2024 Copyright held by the owner/author(s). be a plain floating point vector, or tensors of different ranks
ACM ISBN 979-8-4007-0507-6/24/03 and shapes. Outputs range from a unique Boolean decision
https://doi.org/10.1145/3640537.3641580 to complex data structures. These need to be communicated
238
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.
Table 1. Diverse ML and RL requirements in previous work; unknown or unclear ones are left blank.
Communication Model Input Model Output Commn Freq #Agents Model Type ML Framework
SLP Vectorization [36] LLVM IR Instructions to pack Single Agent GGNN
source code with pragma
NeuroVectorizer [21] Pragmas in source Code2Vec vectors Once, at the end Single Agent FCNN Keras, RLLib
added in Python
Register Allocation [17] No integration Interference graph Coloured IG None NA LSTM TensorFlow
Register Allocation [26] PBQP Graph Allocated PBQP graph Single Agent GCN, ResNet Pytorch
Multiple times
POSET-RL [24] Opt Flags IR2Vec vectors Pass sequence Single Agent FCNN PyTorch
per episode
Loop Distribution [25] Python Wrappers IR2Vec vectors Distribution sequence Once, at the end Two agents GNN, FCNN PyTorch
Inliner [47] Precompiled TF model Features Yes/No Once, at the end Single Agent FCNN TensorFlow
RegAlloc Eviction [46] Precompiled TF model Features Index of Live Range to Evict Once, at the end Single Agent FCNN TensorFlow
IG + node level Multiple times Four agents;
RL4ReAl [50] gRPC Color map GNN, FCNN PyTorch, RLLib
embeddings per episode hierarchical
with the compiler; it may be only once for simple scenar- the end user and the internal compiler algorithms; this lim-
ios, or many times and involving large amounts of data for its deployment opportunities among other downsides. We
more intricate ones. And this may involve extensive source discuss these issues in detail in Sec. 4.
code modifications for the sole purpose of implementing the To address these shortcomings, we propose ML-Compiler-
compiler-model interface. Bridge, a library that allows ML model development within
Some of these interactions have been explored in the liter- a traditional Python framework while providing tightly cou-
ature and even landed in production; however, there does not pled and efficient end-to-end integration with the compiler.
exist a single generic method to address the vast diversity of Our library bridges the compiler and ML model by providing
scenarios that are imaginable and the trade-offs therein. Such a suite of communication approaches (model runners) and
a situation limits the scope, applicability and effectiveness the related (de-)serialization mechanisms (SerDes) to cater
of ML for compiler optimizations in the following ways: to diverse scenarios. It also provides support for both inter-
and in-process communication by exposing different model
• Scalability: Integrating a Python model with C++ code runners: gRPC and named-pipes for the former, and the Ten-
using wrappers induces significant [25] compile time sorFlow and ONNX for the latter. Diverse SerDes options
overhead: e.g. 6×–100×. based on Protobuf, JSON, and native bitstreams improve ef-
• Integration: Not all optimizations are simple enough ficiency and versatility. The appropriate model runner and
that the outputs of the model can be communicated using SerDes can be chosen based on the usage scenario and re-
flags [25, 26, 47, 50]. As ML-based optimizations grow in quirements, and these may differ during training and infer-
popularity, flag-based approaches become unwieldy. ence. Our library provides C++ and Python APIs to expose
• Programmability: ML models are typically written in model runners and SerDes for integration with compilers
Python across different frameworks like TensorFlow, and ML frameworks respectively.
JAX, PyTorch, etc. Expecting the model to be written We show that the inter-process model runners effectively
in C++ within the compiler is not ML developer-friendly. supports training. Once the model is trained, the in-process
• Portability: Several proposals involve a tight coupling model runners provide interfacing of the model within the
between the compiler and a specific ML framework; we compiler in a transparent manner, with much lesser latency
however believe that a generic compiler infrastructure to aid in deployment. Besides, our both model runner and
like LLVM should remain ML-framework-independent. SerDes modules can be easily extended to support more
forms of communication and serialization. Our library also
provides C-APIs to aid in integration with C-based compiler
The existing gym libraries primarily aim at facilitating
infrastructures like Pluto, GCC, and SQLite.
model training for research and reproducibility by providing
We evaluate ML-Compiler-Bridge on four ML-enabled
a high-level integration. For example, the recent Compil-
optimizations in LLVM: RL-LoopDistribution, POSET-RL,
erGym [15] provides a high-level interface in the form of
RL4ReAl, and Inliner. We show that our library can be in-
C++ wrapper methods outside the compiler to invoke out-of-
tegrated with other compilers like Pluto [7] and MLIR [30]
tree compiler APIs to materialize the predicted actions. Such
with minimal effort. We study the impact of communication
integration caters well to training certain interactions like
and serialization options on compile time under different
Phase Ordering [24]. However, other optimizations like Re-
complex scenarios that the existing infrastructures could not
gAlloc [17, 26, 50], loop distribution [25] and inlining [47]
handle. We conduct extensive evaluations to measure the
necessitate a deeper interfacing of the model within the com-
overhead caused by each model runner and SerDes. We also
piler; with multiple rounds of interaction for both training
study the impact of integrating ML-Compiler-Bridge with
and inference scenarios. Further, in these gym libraries, the
LLVM in terms of additional dependencies, compile-time,
inference flow is driven by Python: the compilation starts by
and binary size overhead. Here are our contributions:
invoking a Python process, breaking the isolation between
239
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom
• We propose ML-Compiler-Bridge, a library to enable internals, infrastructural details, and integration points, fo-
the deeper integration of ML models and the compiler cusing on the optimization objectives and information flow.
in a framework-independent manner. For the end-user, however, the presence of ML-compiler op-
• We provide a suite of two inter- and two in-process model timization should be transparent, and indistinguishable from
runners, and three (de-)serialization mechanisms (SerDes) the conventional (non-ML based) compilation process. To
to support different interaction scenarios. achieve this scheme of abstraction/segregation among all
• We provide multi-language user APIs: C++ and C APIs to three actors, it is important to distinguish between the train-
interface model runners and serializers with compilers ing and inference flows.
and Python APIs to interface inter-process model runners
with ML frameworks. Training. Typically, training the ML model becomes part
• We show that our library is easy to integrate with three of compiler development and build-up, and inference be-
different compilers spanning different representations, comes part of the compiler deployment and execution. How-
and carry out extensive evaluations on four ML-enabled ever, occasionally this boundary may shift towards the user,
optimizations on two versions of LLVM (V10, V17). like domain-specific training or fine-tuning at deployment
• We characterize the impact of each communication and time. Since ML developers usually prefer developing models
serialization options on compilation and training times within a Python-based framework, the training process in-
and other overheads. volving a C++ compiler infrastructure like LLVM requires a
communication channel, typically inter-process, while cater-
2 Background ing to the needs of (de-)serializing data between the native
types of C++ and Python. The distributed nature of train-
ing processes may also require extending communication
beyond a single operating system node.
240
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.
1 class MLModelRunner {
2 public :
3 // Populates inputs as key - value pairs
4 template < typename T , typename ... Types >
5 void populateFeatures ( std :: pair < string , T > &←↪
var 1 , std :: pair < string , Types > &... var 2) ;
6 // Exposed to the user ; returns model 's output
7 template < typename T > T evaluate () {
8 return * reinterpret_cast <T * >(←↪
Figure 2. The compiler instantiates a model runner and sets the evaluateUntyped () );
9 }
input features to be used by the model. MLModelRunner internally
10 protected :
invokes SerDes to serialize the data in one of the supported formats 11 // To be overridden by derived classes
and query the model. The returned decision is deserialized and 12 virtual void * evaluateUntyped () = 0 ;
provided to the optimization. 13 };
Listing 1. Skeleton of MLModelRunner class
enables new forms of communication and serialization to be
added by overriding a minimal set of methods. Fig. 2 shows
the components and interactions of ML-Compiler-Bridge. gRPC Model Runner. gRPC [52] provides RPC methods
specifying the type of input and output in Protobuf for-
3.1 ML Model Runners mat [41]. During the build process of the library, the proto
We provide two classes of model runners. The inter-process files are automatically translated to C++ and Python code
class provides the easiest mechanism to decouple Python by invoking the protoc compiler. The generated code de-
models from a compiler running as a separate process. The in- fines the Service class that exposes the RPC methods to be
process class assumes that the ML Model is readily available overridden by the user in the optimization that makes use
in a compiled form and can be accessed within the compiler of gRPCModelRunner.
through a specific API. Clearly, in-process communication gRPCModelRunner takes in the server address and the port
is designed with inference and deployment in mind, while number at which the connection is to be established. In train-
inter-process communication enjoys more diverse use cases. ing mode, gRPCModelRunner starts the server and starts lis-
Model runners may support simple ML queries and feed- tening for an RPC call invoked by the model. The overridden
forward networks as well as more involved Reinforcement RPC method is directly called by the Python model to gen-
Learning (RL) algorithms or Graph Neural Networks (GNNs). erate new observations by applying the action predicted by
Internally, MLModelRunner is the abstract base class from the model. In inference mode, gRPCModelRunner starts the
which the other model runners are derived (List. 1). It ex- gRPC connection at the given address and port.
poses two APIs: populateFeatures() populates the input
Pipe Model Runner. As the name suggests, the pipe model
features, and evaluate() queries the model.
runner relies on named pipes for inter-process communica-
3.1.1 Inter-process Model Runners. gRPCModelRunner tion (the mkfifo system call). Pipes provide a simple and
uses gRPC, and may run the model and compiler on different effective means of communication that is local to the ma-
machines. Whereas, the pipeModelRunner uses named pipes chine without any network or security constraints.
for single-machine scenarios. At training time, the compiler As pipes are unidirectional, the pipeModelRunner creates
acts as a server and the Python-based ML model acts as a the read and write pipes for communication. The read pipe in
client. The sequence of steps is described as follows: the compiler obtains the data written by the model in Python,
(1) Compilation starts and the compiler listens for queries and the write pipe provides the data into the pipe that is read
at the wait() call inserted at the point of interest. by the model on the other end. read() is a blocking call
(2) The Python model starts training; this can be started forcing the compiler to wait till data is written by the model.
concurrently with Step (1). Once the data is written, the model gets to a blocking state by
(3) When input from the compiler is required, the model invoking read() on the second pipe waiting for the response
sends requests to the compiler with appropriate queries from the compiler. The pipe model runner ensures proper
and waits for the response. opening, closing, and clean up. pipeModelRunner provides
(4) The compiler gets out of the blocked state and processes a simpler interface for establishing communication as the
the query to generate an appropriate response. user directly invokes evaluate() after setting the inputs.
(5) The response is sent back to the client, and the model 3.1.2 In-process Model Runners. In-process model run-
goes on to completing training on that input. ners are designed to provide an effective means of compiler
Inference follows the same steps, yet the compiler becomes deployment. It is important to optimize the inference time
the client and the model becomes the server so as to support as it adds up to the overall compile time. One may obtain
a regular compilation process. significantly lower compile times by removing inter-process
241
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom
communication overhead, and by turning the complications finetuning or when quickly evaluating candidate models and
of a compiled model into an advantage, by reducing the parameters.
query time compared to models running in Python. Serial- The TensorFlow model runner uses the AOT saved model
ization/deserialization overhead is also lowered. compiler which produces a header exposing the model as a
C++ class, and a native object file with its implementation.
ONNX Model Runner. The Open Neural Network Ex- The model runner reduces again to a simple adapter [20]
change [34] (ONNX) is an open format to represent machine around that class. The compiler binary does not expose new
learning models. Models built from various frameworks like runtime dependencies as it is statically linked, and this highly
TensorFlow, PyTorch, etc. can be represented in ONNX for- simplifies its deployment. Note that the model compiler can
mat in an interoperable manner. Additionally, it supports a be configured to generate code by loading the weights from
wide variety of hardware architectures ranging from edge a file passed via the command line to the compiler.
devices to general-purpose CPUs and GPUs. Once the model
is trained in Python, it is converted into a common ONNX
3.2 SerDes: Serializer and Deserializer Module
representation and is imported into the compiler via the
ONNX runtime. ONNXModelRunner exposes the necessary When data is transferred, specifically across two processes,
wrapper APIs to read the ONNX model, query it with inputs it is important to convert data that is present in the native
and obtain outputs. types (of C++ and Python) from one format to another. This
is the purpose of (de-)serialization as implemented by the
SerDes module.
Opt Pass ONNXModelRunner Agent ONNXModel
The MLModelRunner interacts with SerDes to (de-)serialize
dispatch C++ native data to model-specific types and back. The choice
evaluate()
242
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.
header induces negligible overhead if communicated data embeddings called IR2Vec [48]. The fourth optimization—
does not involve complex data types. inlining—uses TensorFlow [1], is built within LLVM V17,
and uses feature-based representations [47]. There are two
3.3 C-APIs ML based register allocators [46, 50] available for LLVM; we
We provide C wrappers around the C++ implementation to chose the former because it emphasizes finer-grained, high-
integrate with C-based compilers. These wrappers are C++ bandwidth interactions with an ML model. All the compo-
files written in C-style. Each method internally queries the nents are configured, compiled and linked during the regular
original C++ implementation and returns results in a way build process of LLVM. Integration challenges range from
compatible with C calling conventions. This code is built as a redesigning the entire framework of the original publication,
separate library that may be linked with a C-based compiler. to minor changes to the communication mechanisms.
243
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom
All these steps involve model-compiler interaction via file 4.4 LLVM Inliner
I/O. Inference itself is integrated with LLVM using Python The inliner pass traverses call sites in a bottom-up fashion,
wrappers. one connected component of functions at a time. For a given
In this paper, we eliminate the need for Python wrappers, component a working queue is initialized with the set of all
file I/O and and spawning new processes. The model run- static call sites. As the algorithm marks some call sites for
ners internally (de-)serialize data depending on the chosen inlining, it appends the former callee’s call sites to the work
SerDes and the MLModelRunner. For the runners that use se- queue. The decision to inline or not is made in two steps. First,
rialization, the input graph is represented as key-value pairs, it determines legality and whether the user provided any
and a variable length matrix in R𝑛×300 encodes the sequence guidance (always/never inline). Only if the operation is legal
of 𝑛 300-D instruction embeddings. The output takes the and non-mandatory, a heuristic determines its profitability.
form a variable-length integer array with node identifiers The decision is driven by a simple RL based model. It takes
that are to be distributed. a number of scalar features characterizing both the caller/-
callee (instruction counts, basic block counts, maximum loop
depth), the call site itself (the number of compile-time con-
4.3 RL-Based Register Allocation stant parameters), as well as module-wide features (the cur-
We also evaluate RL4ReAl, an RL-based register allocator im- rent number of functions and statically known call edges).
plementing the splitting, coloring, and spilling sub-tasks as For the published version [47], the cost metric was size, with
separate RL agents on LLVM’s Machine IR. These RL agents no reliance on dynamic profile data. The implementation
pose a formidable engineering challenge in interfacing the uses AOT compiled TensorFlow model for inference with
model with the compiler during both training and inference. C++ APIs. We modularized it to use any model runner.
Unlike other optimizations that need one single commu-
nication at the end, RL4ReAl involves multiple interleaved
communications rounds to obtain a new observation and let 5 Evaluation
the relevant agent make the next prediction. Also them RL We measure compilation time on an Intel Xeon SkyLake
agents are arranged hierarchically: the outcome of one agent W2133 with 6 cores, 12 threads and 32GB RAM. Training
determines which agent would be invoked next. Unlike other time is measured on an Intel Xeon W1390P with 8 cores, 16
use cases, this optimization involves transferring an interfer- threads, 64GB RAM and an Nvidia 3060 GPU. We evaluate
ence graph where each variable is associated with a R𝑛×100 POSET-RL, RL-LoopDistribution and RL4ReAl with gRPC,
matrix, and where each one of the 𝑛 instructions in the live Pipe and ONNX model runners and different SerDes options,
range of the variable is represented in 100-D, a variable- and take the median of 3 runs. Most experiments use SPEC
length integer array to specify interferences and use points, CPU 2006 and SPEC CPU 2017 benchmarks.
and a variable-length floating point array of spill weights.
Other metadata like function name, file name, and status are 5.1 Impact on Deployment
also sent as string fields. The model returns key-value pairs Tab. 2 shows the POSET-RL compile time using different
mapping variables to split or color decisions. Both training model runners. Among the in-process runners, we use ONNX
and inference use gRPC and Protobuf serialization. for PyTorch models and RLLib. Overall, in-process runners
We will investigate different communication and serializa- achieve better compile times in all cases in comparison with
tion improvements in this paper, with specialized scenarios any of the inter-process ones. Among the latter, gRPC has
for distributed training and deployment-friendly inference. higher compile times (6.8–7.6%) compared to pipes, with
244
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.
Table 2. Compile time (in seconds) for POSET-RL. about 5.5Ks each. Throughout the iterations, we observe an
Original gRPC Pipe + JSON Pipe + Bitstream ONNX overhead of about 20s between JSON and bitstream serial-
SPEC06 5,829 1,318 1,236 1,227 1,140 ization options. This minimal overhead is associated with
SPEC17 10,342 1,221 1,141 1,132 1,093 the additional serialization effort involved while using JSON
Table 3. Multithreaded compile time with -O3 (in s) with in-process SerDes. However, using the inter-process model runners en-
model runners. Compile time with gRPC is shown for RL4ReAl for ables an end-to-end integration of model and the compiler
comparison. while training yields a significant improvement.
gRPC 1 Thread 2 Threads 4 Threads 8 Threads
5.2.2 Multi-Worker Support. ML-Compiler-Bridge sup-
LLVM Inliner
(TF Runner)
- 596 501 361 307 ports multi-worker training on both CPUs and GPUs. To
RL4ReAl
5,572 291 257 248 248
support multiple workers while using gRPC, we expose a
(ONNX Runner) method taking an array of ports to establish connections with
each worker. Similarly, multi-worker support with pipes is
JSON and bitstream SerDes. This is because of the over- enabled by instantiating one pair of pipe per worker. We ex-
heads associated with establishing connections and invok- tended RL4ReAl to handle multi-worker scenarios; training
ing RPC methods. Pipes with Bitstream SerDes yield slightly times are shown in Fig. 4(b) for CPU and GPU workers. Us-
higher performance than JSON SerDes due to the lower (de- ing 10 workers with a GPU trainer takes about 2 seconds per
)serialization overhead with bit streams. ONNXModelRunner episode, while a CPU trainer with <10, 5, 1> workers takes
yields a 7.2× speedup with POSET-RL compared to the origi- <4s, 8s, 15s> respectively. We obtained similar trends among
nal method in Sec. 4.1 that involved spawning new processes the workers even upon using pipes for communication.
to invoke the compiler and other dependencies.
In-process model runners natively support multithreaded 5.2.3 Using Different RL Policies. One may train and
compilation, while inter-process model runners necessitate deploy models with different RL policies without impacting
concurrently running multiple instances of the model result- the compiler. For this experiment, we evaluate RL4ReAl with
ing in a high memory and compute overhead. Tab. 3 shows the different RL policies provided by RLlib. We perform hy-
compile times with in-process model runners on LLVM In- perparameter tuning using Tune [33]. We trained the models
liner and RL4ReAl optimizations by varying the degree of with PPO [42], APPO [42], and A2C [37] policies untill con-
parallelism. As LLVM Inliner and RL4ReAl respectively rely vergence. On the SPEC CPU 2017 benchmarks, this resulted
on TensorFlow and PyTorch (and RLlib), we use TensorFlow in 2% improvement on average using the APPO policy. The
and ONNX model runners accordingly. In comparison to PPO and A2C perform similarly to original paper.
the original gRPC based inference flow of RL4ReAl, the
5.3 Round-Trip Time
ONNX runner reduces compile time by 22.4× and 19× using
8 threads and 1 thread respectively. Using RL4ReAl results Let us finally isolate the Round-Trip Time (RTT) of each
in a higher compile time, as it involves a larger number of model runner as a limit study of the achievable communica-
model-compiler interactions. This overhead is effectively re- tion throughput. We consider random floating point vectors
duced by using the model runners of ML-Compiler-Bridge. of increasing length ranging from 500 to 50K elements in
Similar trends are observed for RL-driven loop distribu- steps of 500. The model itself is a single fully-connected
tion [25] on TSVC [8] and the LLVM Test Suite [35]. The layer that consumes the vector and returns a scalar float.
ONNX model runner yields an improvement of 16× in com- Fig. 4(c) shows the RTT of the whole process. The TF and
parison to the original Python wrapper. ONNX runner achieves a very high throughput with a total
RTT of 21 and 68ms respectively; while Pipes+JSON and
5.2 Impact on Training Pipes+Bitstream yield 3154ms and 772ms respectively, and
In this section, we evaluate the effectiveness of ML-Compiler- gRPC yields a larger RTT of 5948ms. These differences can
Bridge during the training of POSET-RL and RL4ReAl. We be attributed to the serialization and communication over-
use inter-process model runners for training. head. The TF and ONNX runners benefit from in-process
communication, proving to be suitable candidates for de-
5.2.1 Training Time. Fig. 4(a) shows the cumulative train- ployment. The higher throughput of TF is due to the AOT
ing time and number of training iterations observed in POSET- precompiled model. The Pipe runner proves to be a good can-
RL. We obtain large improvements in the training time across didate for training on local machines. And the gRPC runner
all the model runners. We see similar trends with gRPC and provides support for training in a distributed environment.
Pipe, as explained in the previous experiment. This makes all the model runners important in their own way.
The original training process of POSET-RL involves spawn-
ing processes that takes ≈ 10Ks to complete 500 iterations. 5.4 Gym Integration
In comparison, the gRPC model takes about 5.7Ks, while We carried out additional experiments to evaluate the ben-
the pipes with JSON and bitstream serialization options take efits of our library in the context of a state of the art RL
245
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom
(a) Training times of POSET-RL (b) Training times of (c) Microbenchmarking of (d) MLIR performance (e) Pluto performance
with different Model Runners RL4ReAl with CPU/GPU individual Model Runners
multi-workers
Figure 4. Performance characterization of model runners on different compilers and optimizations
246
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.
Table 5. Comparisons of time taken to build clang and final binary These primarily aim at facilitating research and reproducibil-
size with/without ML-Compiler-Bridge ity, which are only two of the broader ambitions of our re-
Characteristics Native Clang Clang with ML-Compiler-Bridge search (e.g., deployment, programmable compiler interface,
Compilation Time 5m 7s 5m 15s finer-grained interaction). CompilerGym internally calls the
Binary Size 102.79 MB 102.87 MB
Average RSS 1.5538 GB 1.5542 GB
compiler APIs from a C++ wrapper, and the communication
between the Python model and the wrapper is established
by predefined gRPC methods. This limits the functionality
inter-process model runners offer multi-threaded compila- to only the APIs supported by the library and a particular
tion upon running a single model instance. It could be done compiler version with which the library is compatible. Su-
by instantiating multiple model instances but this would personic [51] also uses the CompilerGym way of interfacing
consume unreasonable amounts of memory. The in-process via gRPC. And, to our understanding, PolyGym [9] does not
model runners however do not face this problem. Though provide a programmable compiler interface.
there is a separate serialization overhead involved with gRPC The gym libraries and ML-Compiler-Bridge solve differ-
and pipe model runners, they are handled automatically ent problems; the former facilitates research and training,
without the involvement of the developer. Due to the na- while our library aims to facilitate different interfaces for
ture of inter-process communication, there is a possibility of communication. We envision ML-Compiler-Bridge to sup-
encountering communication errors arising from network plement these gym environments by providing a variety
and compiler crashes. We handle such cases as explained in of options for more diverse, finer-grained, and framework-
Sec. 3.5. We summarize these characteristics in Tab. 6. independent interfacing of ML models with compilers facili-
tating the transition from research to production.
Table 6. Characteristics of different model runners
Characteristics gRPC Pipes ONNX TF
Multithreaded Compilation ✗ ✗ ✓ ✓
8 Conclusions
Distributed Training ✓ ✗ - - We present ML-Compiler-Bridge, a modular and extensible
Need for separate model process ✓ ✓ ✗ ✗ library to integrate ML models within compiler optimiza-
Autoserialization ✓ ✓ - -
tions. It provides inter/in-process model runners with dif-
Communication Fidelity ✗ ✗ ✓ ✓
ML Framework agnostic ✓ ✓ ✓ ✗ ferent serialization options to support both training/deploy-
Additional code by compiler writer Y N Y N ment scenarios. We show that a model and compiler pass can
Serialization Requirement Y Y - - be integrated with only 3 lines of code, while also enabling
Time overhead Y Y N N very deep interleaving of RL-based algorithms like RL4ReAl,
as well as leaner and production-friendly optimizations like
6.4 Limitations function inlining.
Our library exposes C++/C and Python APIs for inte-
As mentioned earlier, not all model runners are compati- gration with compilers and ML frameworks respectively.
ble with all ML models due to the nature of the underlying We considered multiple ML frameworks (TensorFlow, Py-
libraries. For instance, Tensorflow AOT compilation sup- Torch, RLlib), both feature-based and embedding-based rep-
ports any Tensorflow or JAX model, but not PyTorch. Also, resentations, multiple compilers (and versions) written in
upon exporting the inliner model from TensorFlow to ONNX, different languages to show versatility and suitability of ML-
we encountered an operator (TFL-Bucketize1 ) that is not Compiler-Bridge on research and production environments.
supported by ONNX. To handle such cases, the ONNX run- Source code along with documentation and the related arti-
time allows registering custom operators. Once exported, facts are available in https://compilers.cse.iith.ac.in/research/
the models can be used seamlessly without restriction. mlcompilerbridge [49].
Similarly, protobuf does not natively support C runtime.
Hence, our C APIs do not support using the gRPC model
runner with protobuf serialization. The current TF AOT com- Acknowledgments
pilation generates C++ code thereby making it not usable We are grateful to Govindarajan Ramaswamy and Dibyendu
directly with C. This issue can be mitigated by using TF Das for valuable discussions and feedback. We would like to
C-APIs instead of using AOT models. thank Nilesh Shah, Soumya Banerjee, and Vikas Patnala for
their help in various experiments. We also thank Raj Vilas
7 Related Work Ambekar and Neha Bhargava for helping in validating the
RL environments for compilers come closest to our work, artifacts.
such as CompilerGym [15], PolyGym [9], Supersonic [51]. This work is partially funded by PhD fellowships from
Google and PMRF, a research grant from Suzuki Motor Cor-
1 https://www.tensorflow.org/mlir/tfl_ops#tflbucketize_tflbucketizeop poration, and a faculty grant from AMD.
247
The Next 700 ML-Enabled Compiler Optimizations CC ’24, March 2–3, 2024, Edinburgh, United Kingdom
248
CC ’24, March 2–3, 2024, Edinburgh, United Kingdom S. VenkataKeerthy et al.
1109/ISPASS55109.2022.00012 [38] William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zi-
[25] Shalini Jain, S. VenkataKeerthy, Rohit Aggarwal, Tharun Kumar Dan- nenko. 2021. Polygeist: Raising C to Polyhedral MLIR. In 2021 30th
geti, Dibyendu Das, and Ramakrishna Upadrasta. 2022. Reinforcement International Conference on Parallel Architectures and Compilation Tech-
Learning assisted Loop Distribution for Locality and Vectorization. niques (PACT). 45–59. https://doi.org/10.1109/PACT52795.2021.00011
In 2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infras- [39] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-
tructure in HPC (LLVM-HPC). 1–12. https://doi.org/10.1109/LLVM- bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
HPC56686.2022.00006 Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary
[26] Minsu Kim, Jeong-Keun Park, and Soo-Mook Moon. 2022. Solving DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
PBQP-Based Register Allocation Using Deep Reinforcement Learning. Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An
In Proceedings of the 20th IEEE/ACM International Symposium on Code Imperative Style, High-Performance Deep Learning Library. In Ad-
Generation and Optimization (Virtual Event, Republic of Korea) (CGO vances in Neural Information Processing Systems 32. Curran Associates,
’22). IEEE Press, 230–241. https://doi.org/10.1109/CGO53902.2022. Inc., 8024–8035. https://doi.org/10.5555/3454287.3455008
9741272 [40] Louis-Noel Pouchet, C´edric Bastoul, and Uday Bondhugula. 2010.
[27] Sameer Kulkarni, John Cavazos, Christian Wimmer, and Douglas Si- PoCC: the polyhedral compiler collection. https://web.cs.ucla.edu/
mon. 2013. Automatic construction of inlining heuristics using ma- ~pouchet/software/pocc. Accessed 2023-10-25.
chine learning. In Proceedings of the 2013 IEEE/ACM International Sym- [41] Protobuf [n. d.]. Protocol Buffers. https://developers.google.com/
posium on Code Generation and Optimization (CGO). 1–12. https: protocol-buffers. [Online; accessed 29-Aug-2022].
//doi.org/10.1109/CGO.2013.6495004 [42] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and
[28] P. J. Landin. 1966. The next 700 Programming Languages. Commun. Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.
ACM 9, 3 (mar 1966), 157–166. https://doi.org/10.1145/365230.365257 arXiv:1707.06347 [cs.LG]
[29] Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Frame- [43] M. Stephenson and S. Amarasinghe. 2005. Predicting unroll factors
work for Lifelong Program Analysis & Transformation. In Proceedings using supervised classification. In International Symposium on Code
of the International Symposium on Code Generation and Optimization: Generation and Optimization. 123–134. https://doi.org/10.1109/CGO.
Feedback-Directed and Runtime Optimization (Palo Alto, California) 2005.29
(CGO ’04). IEEE Computer Society, USA, 75. [44] The IREE Team. 2023. https://github.com/openxla/iree. Accessed
[30] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy 2023-11-13.
Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasi- [45] The Triton Team. 2023. https://github.com/openai/triton. Accessed
lache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infras- 2023-11-13.
tructure for Domain Specific Computation. In 2021 IEEE/ACM Interna- [46] Mircea Trofin, Yundi Qian, Eugene Brevdo, and David Li. 2021.
tional Symposium on Code Generation and Optimization (CGO). 2–14. RFC: MLGO Regalloc: learned eviction policy for regalloc. https://
https://doi.org/10.1109/CGO51591.2021.9370308 lists.llvm.org/pipermail/llvm-dev/2021-November/153639.html, https:
[31] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. //lists.llvm.org/pipermail/llvm-dev/2021-November/153639.html. [On-
2016. Gated Graph Sequence Neural Networks. In 4th International line; accessed 08-May-2022].
Conference on Learning Representations, ICLR 2016, San Juan, Puerto [47] Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choro-
Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and manski, and David Li. 2021. MLGO: a Machine Learning Guided
Yann LeCun (Eds.). http://arxiv.org/abs/1511.05493 Compiler Optimizations Framework. CoRR abs/2101.04808 (2021).
[32] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, arXiv:2101.04808 https://arxiv.org/abs/2101.04808
Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2018. [48] S. VenkataKeerthy, R Aggarwal, S Jain, M S Desarkar, R Upadrasta,
RLlib: Abstractions for Distributed Reinforcement Learning. In Proceed- and Y. N. Srikant. 2020. IR2Vec: LLVM IR Based Scalable Program
ings of the 35th International Conference on Machine Learning (Proceed- Embeddings. ACM Trans. Archit. Code Optim. 17, 4, Article 32 (Dec.
ings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas 2020), 27 pages. https://doi.org/10.1145/3418463
Krause (Eds.). PMLR, 3053–3062. https://proceedings.mlr.press/v80/ [49] S. VenkataKeerthy and Siddharth Jain. 2024. ML-Compiler-Bridge: The
liang18b.html Next 700 ML-Enabled Compiler Optimizations. https://doi.org/10.5281/
[33] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E. zenodo.10574579
Gonzalez, and Ion Stoica. 2018. Tune: A Research Platform for Dis- [50] S. VenkataKeerthy, Siddharth Jain, Anilava Kundu, Rohit Aggarwal,
tributed Model Selection and Training. arXiv:1807.05118 [cs.LG] Albert Cohen, and Ramakrishna Upadrasta. 2023. RL4ReAl: Reinforce-
[34] ONNX (Linux Foundation). 2017. ONNX: Open Neural Network Ex- ment Learning for Register Allocation. In CC 2023 (Montréal, QC,
change. https://github.com/onnx/onnx. [Online; accessed 11-Mar- Canada). Association for Computing Machinery, New York, NY, USA,
2023]. 133–144. https://doi.org/10.1145/3578360.3580273
[35] LLVM-Org. [n. d.]. LLVM Test Suite. https://github.com/llvm/llvm- [51] Huanting Wang, Zhanyong Tang, Cheng Zhang, Jiaqi Zhao, Chris
test-suite. Accessed 2021-08-25. Cummins, Hugh Leather, and Zheng Wang. 2022. Automating Re-
[36] Charith Mendis, Cambridge Yang, Yewen Pu, Saman Amarasinghe, inforcement Learning Architecture Design for Code Optimization.
and Michael Carbin. 2019. Compiler auto-vectorization with imitation In Proceedings of the 31st ACM SIGPLAN International Conference on
learning. Curran Associates Inc., Red Hook, NY, USA. https://doi.org/ Compiler Construction (Seoul, South Korea) (CC 2022). Association
10.5555/3454287.3455597 for Computing Machinery, New York, NY, USA, 129–143. https:
[37] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex //doi.org/10.1145/3497776.3517769
Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray [52] X Wang, H Zhao, and J Zhu. 1993. GRPC: A Communication Coopera-
Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement tion Mechanism in Distributed Systems. SIGOPS Oper. Syst. Rev. 27, 3
Learning. In Proceedings of The 33rd International Conference on Ma- (jul 1993), 75–86. https://doi.org/10.1145/155870.155881
chine Learning (Proceedings of Machine Learning Research, Vol. 48), [53] Zheng Wang and Michael O’Boyle. 2018. Machine Learning in Com-
Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New piler Optimization. Proc. IEEE 106, 11 (2018), 1879–1901. https:
York, New York, USA, 1928–1937. https://proceedings.mlr.press/v48/ //doi.org/10.1109/JPROC.2018.2817118
mniha16.html
Received 13-NOV-2023; accepted 2023-12-23
249