0% found this document useful (0 votes)
314 views8 pages

Compiling ONNX Neural Network Models Using Mlir

The document discusses a compiler that takes in neural network models represented in the ONNX format and compiles them into standalone binaries that can run efficiently on different hardware targets like x86 machines, IBM Power Systems and IBM System Z. The compiler introduces two intermediate representations - ONNX IR to represent ONNX operators and Kernel IR to lower ONNX operators into LLVM bitcode. It also covers optimization passes like graph rewriting, constant propagation and memory management applied during the compilation.

Uploaded by

Guilherme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
314 views8 pages

Compiling ONNX Neural Network Models Using Mlir

The document discusses a compiler that takes in neural network models represented in the ONNX format and compiles them into standalone binaries that can run efficiently on different hardware targets like x86 machines, IBM Power Systems and IBM System Z. The compiler introduces two intermediate representations - ONNX IR to represent ONNX operators and Kernel IR to lower ONNX operators into LLVM bitcode. It also covers optimization passes like graph rewriting, constant propagation and memory management applied during the compilation.

Uploaded by

Guilherme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Compiling ONNX Neural Network Models Using

MLIR

Tung D. Le1,a) Gheorghe-Teodor Bercea2 Tong Chen2


Alexandre E. Eichenberger2 Haruki Imai1 Tian Jin2 Kiyokuni Kawachiya1
Yasushi Negishi1 Kevin O’Brien2

Abstract: Deep neural network models are becoming popular and have used in various tasks such as com-
puter vision, speech recognition, and natural language processing. It is often the case that the training phase
of a model is executed in one environment, while the inference phase is executed in another environment.
arXiv:2008.08272v1 [cs.PL] 19 Aug 2020

This is because the optimization characteristics for each phase significantly differ. Therefore, it is critical to
efficiently compile a trained model for inferencing on different environments. To represent neural network
models, users often use Open Neural Network Exchange (ONNX) which is an open standard format for
machine learning interoperability. We are developing a compiler for rewriting a model in ONNX into a stan-
dalone binary that is executable on different target hardwares such as x86 machines, IBM Power Systems, and
IBM System Z. The compiler was written using Multi-level Intermediate Representation (MLIR), a modern
compiler infrastructure. In particular, we introduce two internal representations: ONNX IR for representing
ONNX operators, and Kernel IR for efficiently lowering ONNX operators into LLVM bitcode. In this paper,
we will discuss the overall structure of our compiler and give some practical examples of converting ONNX
operators and models. We also cover several issues related to endianness. Our framework is publicly available
as an open source project under the ONNX project.

library written for a target accelerator. Rewriting a model


1. Introduction for inferencing consists of replacing the operations in the
Deep neural network models have been used widely for model with the function calls in the library. While such a
various tasks such as computer vision, speech recognition, library-call approach simplifies the rewritten procedure and
and natural language processing. The success of such mod- would lead to improved performance, it exposes the follow-
els was mainly originated from the development of acceler- ing drawbacks. Firstly, the number of models that can be
ators, especially GPU accelerators, back in 2012 [3]. Since rewritten is limited by the provided functions in the library.
then, many deep learning frameworks, such as Torch, Caffe, Secondly, it is often the case that users need to install ad-
Theano, and TensorFlow, have been developed to facilitate ditional packages to make the library work well. Thirdly, it
the training and inferencing of deep neural network models, lacks the ability to tailor code specific to different problems
which significantly speeds up the explosion of deep learning since the same function may be used for them.
in many areas. However, training and inferencing are often We tackle these drawbacks by developing a compiler that
done on different environments due to their different opti- rewrites a trained model to native code for a target hard-
mization characteristics. For example, a model is trained ware. It uses many mature optimization techniques devel-
using a large-scale distributed system since it might need oped during the long history of compiler, such as the ability
weeks or months to finish, and can then be used on light- to tailor code for a specific problem, memory optimizations,
weight devices such as Internet of Things or mobile phones and parallelization. Our compiler is completely based on
for inferencing. Hence, it is desirable to dynamically rewrite open-source softwares. In particular, we chose Open Neural
a trained model so that it runs efficiently on a target envi- Network Exchange (ONNX) [1] as a format to represent the
ronment. input model of our compiler. ONNX is an open machine-
Many deep learning frameworks utilize a highly-optimized independent format and widely used for exchanging neural
network models. It has been actively maintained by and
1
IBM Research - Tokyo, contributed from open source communities. Our compiler
19-21, Nihonbashi Hakozaki-cho, Chuo-ku, Tokyo 103-8510, was written using Multi-level Intermediate Representationi
Japan
2
IBM T.J. Watson Research Center,
(MLIR) [5], a modern open source compiler infrastructure
1101 Kitchawan Rd, Yorktown Heights, NY 10598, USA for multi-level intermediate representations, which is devel-
a)
[email protected] oped by Google and currently a subproject inside LLVM [4].
Authors are listed in alphabetical order of their first names,
except the first author. Our compiler is completely open-sourced and a subpro-

1
ject inside the ONNX project*1 . Although it is still under
development, it can already compile some popular models Listing 1: ONNX model for LeakyRelu operator (printed
such MNIST and ResNet50 to native code on x86 machines, using ‘protoc’ command).
IBM Power Systems*2 , and IBM System Z*3 . In this paper, 1 ir_version : 3
2 producer_name : " backend - test "
we will introduce our compiler by 3 graph {
4 node {
• presenting its overall design and architecture of the com- 5 input : " x "
piler, 6 output : " y "
7 op_type : " LeakyRelu "
8 attribute {
• introducing two new IRs: ONNX IR for representing
9 name : " alpha "
ONNX operators, and Kernel IR for efficiently lowering 10 f : 0.1
ONNX operators into LLVM bitcode, 11 type : FLOAT
12 }
13 }
• introducing optimization passes such as graph rewrit-
14 name : " test_leakyrelu "
ing, constant propagation, and memory management, 15 input {
and 16 name : " x "
17 type {
18 tensor_type {
• discussing some problems we encountered when emit- 19 elem_type : 1
ting native code for different architectures. 20 shape {
21 dim {
The remainder of the paper is organized as follows. In 22 dim_value : 3
23 }
Sec. 2, we briefly discuss ONNX and MLIR on which our 24 dim {
compiler is based. In Sec. 3, we introduce our compiler, its 25 dim_value : 4
design principle, and architecture. We also discuss in this 26 }
27 dim {
section two new IRs: ONNX IR and Kernel IR, and some 28 dim_value : 5
optimization passes. In Sec. 4, we present some preliminary 29 }
30 }
experiemental results for MNIST and ResNet50 models on
31 }
IBM Power Systems. Finally, we conclude our paper and 32 }
discuss future work in Sec. 5. 33 }
34 output {
2. Background 35
36
name : " y "
type {
2.1 ONNX 37 tensor_type {
38 elem_type : 1
Open Neural Network Exchange (ONNX) [1] is an open 39 shape {
source format for artificial intelligence models, including 40 dim {
41 dim_value : 3
both deep learning and traditional machine learning. It de- 42 }
fines an extensible computational graph model, operators, 43 dim {
and standard data types, which provides a common IR for 44 dim_value : 4
45 }
different frameworks. There are two ONNX variants: the 46 dim {
neural-network-only ONNX variant recognizes only tensors 47 dim_value : 5
48 }
as input and output types, while the classic machine learning
49 }
ONNX-ML also recognizes sequences and maps. ONNX-ML 50 }
extends the ONNX operator set with machine learning al- 51 }
52 }
gorithms that are not based on neural networks. In this 53 }
paper, we focus on the neural-network-only ONNX variant 54 opset_import {
and refer to it as just ONNX. Supporting ONNX-ML is un- 55 version : 9
56 }
der development in our compiler, thus, we do not discuss it
in this paper.
In ONNX, the top-level structture is a ‘Model’ to asso- topological sort of the list of nodes in the graph. Each node
ciate metadata with a graph. Operators in ONNX are di- in a graph contains the name of the operator it invokes, in-
vided into a set of primitive operators and functions, where puts, outputs, and attributes associated with the operator.
a function is an operator whose calculation can be expressed Inputs and outputs can be marked as variadic or optional.
via a subgraph of other operators. A graph is used to de- There are three data types used to define inputs and out-
scribe a function. There are lists of nodes, inputs, outputs, puts, i.e., ‘Tensor’, ‘Sequence’, and ‘Map’.
and initializers (constant values or default values for inputs) ONNX uses the Protocol Buffers*4 definition language for
in a graph. An acyclic dataflow graph is constructed as a its syntax. Listing 1 shows an example of an ONNX model
*1 https://github.com/onnx/onnx-mlir for the LeakyRelu operator. There is one node in the graph
*2 https://www.ibm.com/it-infrastructure/power/power9
*3 https://www.ibm.com/it-infrastructure/z/hardware *4 https://developers.google.com/protocol-buffers

2
have attributes that store static information. An operation
can hold a region which is a list of blocks. A block contains
a list of operations and ends with a terminator operation
that may have successor blocks to which the control flow
may be transferred. Be said that, nested regions becomes
a first-class concept in MLIR, which is efficient to represent
control flow graphs. A function is an operation with a sin-
gle region and attributes. A module is an operation with a
single region containing a single block and terminated by a
Fig. 1: Operations and Regions in MLIR.
dummy operation.
To develop a compiler using MLIR, users often need to
(Lines 4–13), which is associated with LeakyRelu, and has define dialects and optimization passes. A dialect serves as
one input, one output, and one attribute. The input and an abstraction level or intermediate representation, and an
output tensors have the shape of h3x4x5i and element type optimization pass is to enable optimization at an abstraction
of float32 (elem type: 1 at Lines 19 and 38). level or transformation among abstraction levels.
There are dialects in MLIR that are ready to use, e.g.,
2.2 MLIR ‘llvm’, ‘std’, ‘scf’, and ‘affine’. The ‘llvm’ dialect is a low-
Multi-level Intermediate Representation (MLIR) [5] is a level dialect. It wraps the LLVM IR types and instructions
modern compiler infrastructure, developed by Google, which into MLIR types and operations. The ‘std’ dialect includes
is reusable and extensible. It reduces the cost of building standard operations such as load, store, addi, addf, absf, and
domain-specfic compilers by facilitating the design and im- call. The ‘scf’ dialect defines control flow operations such as
plementation of code generators, translators, and optimizers for and if. The ‘affine’ dialect provides an abstraction for
at different abstraction levels. MLIR is a subproject of the affine operations and analyses.
LLVM project [6] and has many similarities to the LLVM Optimization passes can be roughly classified into three
compiler infrastructure [4]. In this section, we briefly review categories: general transformation, conversion, and dialect-
some of the features in MLIR that were used to build our specific. General transformation passes includes common
compiler. For more information about MLIR, one can re- passes such as ‘canonicalize’ pass for operation canonical-
fer to a previous study [5]. Readers who are familiar with ization, ‘cse’ pass to eliminate common sub-expressions,
MLIR can skip this section. and passes to print IR information such as ‘print-op-graph’,
Similar to LLVM, MLIR is a three-address static single ‘print-op-stats’, and ‘print-cfg-graph’. Conversion passes are
assignment (SSA)-based IR, where values are defined be- to convert operations in one dialect to operations in another
fore use and have a scope defined by their dominance rela- dialect, e.g., ‘convert-std-to-llvm’ pass to convert standard
tions. Operations may produce zero or more results, and operations into LLVM instructions. Finally, dialect-specific
each operation is a distinct SSA value with its own type passess are for transformation in a dialect, e.g., ‘affine-
defined by the type system. The type system in MLIR is loop-unroll-jam’ pass to unroll and jam affine loops in the
open, and one can define application-specific types. There ‘affine’ dialect. MLIR passes can be expressed via declar-
are a number of primitive types, e.g., integers, as well as ative rewriting rules (DRRs) using tablegen records or via
aggregate types for tensors and memory buffers, e.g., ‘Ten- writing code in C++.
sor’ and ‘MemRef’ types. A Tensor type is abstracted To denote an operation in a dialect, we explicitly use a
and does not have a pointer to the data while a Mem- form of dialect name.operation name. For example, std.load
Ref type is a lower representation, referring to a region of means the operation load of dialect ‘std’. Optimization
memory. In MLIR, Tensor and MemRef types are syn- passes are named with prefix ‘--’, for example, --canonicalize
tactically represented as tensorhD1 ×D2 × . . . ×DN ×dtypei is the canonlicalization pass.
and memrefhD1 ×D2 × . . . ×DN ×dtypei, respectively, where Listing 2 shows an example for calculating the exponential
D1 , D2 , . . . , DN are intergers representing the dimensions of of a given input tensor, element-wise, using ‘std’ and ‘affine’
a tensor or memref, and dtype is the type of the elements in dialects. The top level is a module containing a function
a tensor or memref, e.g., f32 for float32. hD1 ×D2 × . . . ×DN i ‘exp’. The function ‘exp’ accepts one input that is of memref
is called the shape of a tensor or memref. Tensor and Mem- type, and produces an output of the same type. The mem-
Ref types can be unranked when their shapes are unknown. ory for the output is allocated via std.alloc (Line 3). There is
In MLIR, unranked Tensor and MemRef types are syntacti- a nested loop (Lines 4–11), iterating over dimensions of the
cally represented as tensorh∗×dtypei and memrefh∗×dtypei, inputs using affine.for, loading each element from the input
respectively. using affine.load (Line 6), computing the exponential using
An operation is the unit of code in MLIR. To define an std.exp (Line 7), and storing the result in the output using
operation, a TableGen-based [7] specification for an opera- affine.store (Line 8). The output of the function is finally
tion descriptor is used. Figure 1 shows the structure of an returned using std.return.
operation. An operation has a list of SSA operands and may

3
Listing 2: Compute the exponential of a tensor in MLIR.
1 module {
2 func @exp ( arg0 : memref <3 x4xf32 >) -> memref <3 x4xf32 > {
3 %1 = std . alloc () : memref <3 x4xf32 >
4 affine . for %arg1 = 0 to 3 {
5 affine . for %arg2 = 0 to 4 {
6 %2 = affine . load %arg0 [ %arg1 , %arg2 ] : memref <3 x4xf32 >
7 %3 = std . exp %2 : f32
8 affine . store %3 , %1 [ %arg1 , %arg2 ] : memref <3 x4xf32 >
9 }
10 }
11 std . return %1 : memref <3 x4xf32 >
12 }
13 }

and outputs, respectively. To carry out inference with the


output library, users write their program to call the entry
function by passing inputs to the function and obtain re-
sults.
There are four IRs in onnx-mlir, i.e., ONNX IR, Kernel
IR, AffineStd IR and LLVM IR. ONNX and Kernel IR are
new IRs and are discussed in Sections 3.2 and 3.3, respec-
tively. ONNX IR is the first abstraction level in onnx-mlir
which is a high-level representation of ONNX operations.
It consists of operations in ONNX and Standard dialects,
where the ONNX dialect is automatically generated via an
importer that is a python script. Kernel IR provides a
representation that is suitable for polyhedral optimizations,
which helps carry out affine transformations such as tile,
skew, and permutation easily. It plays as an intermediate
abstraction for efficiently lowering ONNX IR into low-level
IRs (e.g., AffineStd IR and LLVM IR). AffineStd IR is not a
new IR, and we use it to refer to an abstraction level where
‘affine’ and ‘std’ dialects are used. LLVM IR is the lowest
abstraction level in onnx-mlir, and programs represented in
this level are quite similar to an LLVM program.
Fig. 2: Architecture of onnx-mlir. Names prefixed with ‘--’ are
passes. There are MLIR passes for converting one IR to another,
and for doing optimizations at a specific IR. ONNX IR is
converted to Kernel IR via pass --convert-onnx-to-kernel.
Then Kernel IR (except some of its operations) is converted
into AffineStd IR via pass --convert-kernel-to-affine. The re-
maining operations in Kernel IR and operations in AffineStd
Fig. 3: ONNX model for element-wise addition. IR are directly converted into instructions in LLVM IR via
pass --convert-kernel-to-llvm. The right side of Fig. 2 shows
optimization passes that can be carried out at each abstrac-
3. Compiling ONNX Models tion level.
We only enumerate the important optimizations here, and
This section introduces our compiler, onnx-mlir. We first
the list of optimization passes is not exhaustive.
discuss its overall architecture. We then introduce two new
Before discussing IRs and optimization passes in detail,
IRs, ONNX IR and Kernel IR. Finally, we present MLIR
we give a brief running example and go through the IRs in
passes for carrying out optimization.
onnx-mlir. This example is a testcase model in ONNX that
performs element-wise binary addition. Figure 3 shows this
3.1 Overview
ONNX model of the testcase. Operation add accepts two
Figure 2 shows the overall architecture of onnx-mlir. The
tensors of type h3x4x5xf32i (element type is float 32) and
input is an ONNX model, and the output is a library con-
returns a result tensor, i.e., sum, of the same type. List-
taining the compiled code. The output library contains an
ings 3, 4, and 5 show the ONNX IR, Kernel IR and AffineStd
entry function called ‘ dyn entry point main graph’ whose
IR of the operation, respectively. We omit the LLVM IR of
inputs and outputs are similar to the ONNX model’s inputs

4
Listing 3: ONNX IR for operation add, generated using importer.
1 module {
2 func @main_graph ( %arg0 : tensor <3 x4x5xf32 > , %arg1 : tensor <3 x4x5xf32 >) -> tensor <* xf32 > {
3 %0 = " onnx . add " ( %arg0 , %arg1 ) : ( tensor <3 x4x5xf32 > , tensor <3 x4x5xf32 >) -> tensor <* xf32 >
4 std . return %0 : tensor <* xf32 >
5 }
6 " onnx . EntryPoint " () { func = @main_graph , numInputs = 2 : i32 , numOutputs = 1 : i32 } : () -> ()
7 }

Listing 4: Kernel IR for operation add, generated by applying passes --shape-inference and --convert-onnx-to-kernel.
1 module {
2 func @main_graph ( %arg0 : memref <3 x4x5xf32 > , %arg1 : memref <3 x4x5xf32 >) -> memref <3 x4x5xf32 > {
3 %0 = alloc () : memref <3 x4x5xf32 >
4 %1 :3 = krnl . define_loops 3
5 krnl . iterate ( %1 #0 , %1 #1 , %1 #2) with ( %1 #0 -> %arg2 = 0 to 3 , %1 #1 -> %arg3 = 0 to 4 , %1 #2 -> ←-
%arg4 = 0 to 5) {
6 %2 = affine . load %arg0 [ %arg2 , %arg3 , %arg4 ] : memref <3 x4x5xf32 >
7 %3 = affine . load %arg1 [ %arg2 , %arg3 , %arg4 ] : memref <3 x4x5xf32 >
8 %4 = std . addf %2 , %3 : f32
9 affine . store %4 , %0 [ %arg2 , %arg3 , %arg4 ] : memref <3 x4x5xf32 >
10 }
11 std . return %0 : memref <3 x4x5xf32 >
12 }
13 " krnl . entry_point " () { func = @main_graph , numInputs = 2 : i32 , numOutputs = 1 : i32 } : () -> ()
14 }

Listing 5: AffineStd IR for operation add, generated by applying the pass --convert-kernel-to-affine.
1 module {
2 func @main_graph ( %arg0 : memref <3 x4x5xf32 > , %arg1 : memref <3 x4x5xf32 >) -> memref <3 x4x5xf32 > {
3 %0 = alloc () : memref <3 x4x5xf32 >
4 affine . for %arg2 = 0 to 3 {
5 affine . for %arg3 = 0 to 4 {
6 affine . for %arg4 = 0 to 5 {
7 %1 = affine . load %arg0 [ %arg2 , %arg3 , %arg4 ] : memref <3 x4x5xf32 >
8 %2 = affine . load %arg1 [ %arg2 , %arg3 , %arg4 ] : memref <3 x4x5xf32 >
9 %3 = std . addf %1 , %2 : f32
10 affine . store %3 , %0 [ %arg2 , %arg3 , %arg4 ] : memref <3 x4x5xf32 >
11 }
12 }
13 }
14 std . return %0 : memref <3 x4x5xf32 >
15 }
16 " krnl . entry_point " () { func = @main_graph , numInputs = 2 : i32 , numOutputs = 1 : i32 } : () -> ()
17 }

the operation due to space limitations. level, we still have a Kernel operation, i.e., krnl.entry point.
At ONNX IR, operations are represented similarly to their Such an operation is not related to the main computation
descriptions in ONNX. The ONNX model is converted into and will be directly converted to LLVM IR. Operations in
the function main graph. To generate an entry point func- the ‘affine’ dialect will be converted to operations in the ‘std’
tion into which users feed their inputs, we create a helper and ‘scf’ dialects before being lowered to instructions in the
operation in the ONNX dialect, i.e., onnx.EntryPoint, which ‘llvm’ dialect.
keeps meta-data in the operation’s attributes such as func-
tion name to call and the number of inputs and outputs. 3.2 ONNX IR
At Kernel IR, operation onnx.add is translated into a loop- ONNX IR is the first abstraction level in onnx-mlir and
based computation represented by operations in the ‘Kernel’ represents an ONNX model in MLIR language. We wrote
dialect, where scalar computation is represented by primi- a python script to automatically import ONNX opera-
tive operations in the ‘affine’ and ‘std’ dialects. We can tions into the tablegen-based operation definitions in MLIR.
apply polyhedral optimizations, such as tile, skew, or trans- These imported operations are organized into the ‘onnx’ di-
pose, to loop-based computation. At this level, we allocate alect. Thanks to tablegen, the operation definition in the
memory for output tensors, and memory management can ‘onnx’ dialect is quite similar to the operation description in
be performed. ONNX, where we are able to represent all necessary infor-
At AffineStd IR, optimized loop-based computation in the mation, such as inputs, outputs, attributes, and description,
‘Kernel’ dialect is translated into affine.for loops. At this into a single tablegen-based definition in human-readable

5
Listing 6: Tablegen-based definition for operation relu.
1 def O NN XL ea k yR el u Op : ONNX_Op < " LeakyRelu " ,
2 [ NoSideEffect , DeclareOpInterfaceMethods < ShapeInferenceOpInterface >] > {
3 let summary = " ONNX LeakyRelu operation " ;
4 let description = [{ " LeakyRelu takes ... " }] ;
5 let arguments = ( ins AnyTypeOf <[ TensorOf <[ F16 ] > , TensorOf <[ F32 ] > , TensorOf <[ F64 ] >] >: $ X , ←-
DefaultValuedAttr < F32Attr , " 0.01 " >: $ alpha ) ;
6 let results = ( outs AnyTypeOf <[ TensorOf <[ F16 ] > , TensorOf <[ F32 ] > , TenorOf <[ F64 ] >] >: $ Y ) ;
7 let e x t r a C l a s s D e c l a r a t i o n = [{ ... }] ;
8 }

textual form. semantics and schedules. Operation krnl.iterate semantically


We also created a new operation in the ‘onnx’ dialect, accepts two types of loop variables: variables for original
i.e., onnx.EntryPoint to keep information related to the loops and variables for scheduled loops. In syntactic sugar
dynamic list of inputs in an ONNX model. This op- form, we separate the two types of loops by the keyword
eration will be lowered to generate the entry function with, i.e. ‘(’scheduled loops‘)’ with ‘(’original loops‘)’. In-
‘ dyn entry point main graph’ of the generated library. duction variables, e.g., i and j in the above example, will
Listing 6 shows a tablegen-based definition for the relu be defined by using original loops. If there is no schedule
operation imported via the importer in onnx-mlir. The op- (e.g. block, skew, etc.), the scheduled loops are similar to
eration description is represented in the ‘description’ field the original loops.
(Line 4). Inputs and attributes are represented in the ‘argu- Now, we insert a schedule for blocking or tiling. Without
ments’ field, while outputs were represented in the ‘results’ loss of generality, we define just one loop instead of two.
field (Lines 5–6). All inputs and outputs will be imported 1 %ii = krnl . define_loops 1
as a tensor in MLIR. The importer automatically infers ele- 2 %ib , %il = krnl . block %ii 2 : ←-
(! krnl . loop ) - >(! krnl . loop , ! krnl . loop )
ment types for inputs, attributes, and outputs. However, the 3 krnl . iterate ( %ib , %il ) with ( %ii -> %i = 0 ←-
shape of a tensor will be inferred via the --shape-inference to 10) {
pass, which is a trait in the LeakyRelu operation (Line 2). 4 %foo = std . addi %i , %i : index
5 }
MLIR generates a C++ class definition for an operation
from its tablegen-based definition. If users want to define Operation krnl.block (Line 2) takes a loop and integer as
custom declaration in the class, it can be done via the ‘ex- inputs, where the integer is the tile size with which we want
traClassDeclaration’ field (Line 7). to carry out blocking. Results are two loop variables: one
for the outer loop and the other for the inner loop. The two
3.3 Kernel IR loops will be used as the result of scheduling and be passed
A computation kernel in a neural network workload has lo- to krnl.iterate (Line 3). It is worth noting that the original
cal structural simplicity in which loop nests are often simple, loops and computation in krnl.iterate remained unchanged
e.g., hyper-rectangle and statements carry quite straightfor- while inserting a schedule, which is exactly what we want for
ward arithmetic semantics. Such a characteristic is quite seperating program semantics and schedules in our Kernel
suitable to be represented in a polyhedral model for opti- IR.
mization [8]. Kernel IR aims to host both loop optimization The --convert-kernel-to-affine pass automatically gener-
and scalar semantic optimization in a single representation. ates optimized affine.for based loops as follows.
It is expected to provide interpretability where not only is 1 # map0 = affine_map <( d0 ) -> ( d0 ) >
2 # map1 = affine_map <( d0 ) -> ( d0 + 2) >
polyhedral representation readable but it also makes pro-
3 affine . for %arg0 = 0 to 10 step 2 {
gram semantics (or what to execute) and program schedules 4 affine . for %arg1 = # map0 ( %arg0 ) to ←-
(how and when to execute) independent. In other words, # map1 ( %arg0 ) {
5 %0 = addi %arg1 , %arg1 : index
our goal is to optimize not only programs but also the com- 6 }
position of individual schedules, which is a feature that is 7 }
often lacking in other existing systems.
The outer affine.for iterates with step 2 i.e., the tile size, and
Below is an example that defines a nested loop in Kernel
the inner affine.for iterates over the elements in a tile.
IR:
Other schedules, such as skew and permutation are used
1 %ii , %jj = krnl . define_loops 2 in a similar manner. All schedules are composable and can
2 krnl . iterate ( %ii , %jj ) with ( %ii -> %i = 0 ←-
to 10 , %jj -> %j = 0 to 10) { be nested.
3 %foo = std . addi %i , %j : index
4 }
3.4 Optimization Passes
where krnl.define loops defines two loops, called ii and jj. In this section, we discuss some of the optimization passes
These loop variables will be used to express both program in onnx-mlir. Thanks to the expressive power of MLIR,

6
many optimizations can be expressed easily via Declarative 3 ( GemmOp $ m1 , $ m2 , $ m3 ) ,
Rewriting Rules (DRRs) using tablegen records or writing 4 [( HasOneUse $ res ) ]
5 >;
code in C++.
3.4.1 Operation Decomposition Another example is to remove an IdentityOp operation by
In ONNX, many operations can be expressed using other passing its input directly to its consuming operations.
basic operations. For example, ReduceL1 over a vector x 1 def I d e n t i t y E l i m i n a t i o n P a t t e r n : Pat <
2 ( ONNXIdentityOp $ arg ) ,
is mathematically calculated by summing up the absolute
3 ( replaceWithValue $ arg )
values of the elements in x. In other words, we have 4 >;

ReduceL1 = ReduceSum (Abs x) Users can write as many rewriting rules as possible in the
same manner.
We only need to lower a subset of operations in the ‘onnx’ 3.4.4 Constant propagation
dialect to ‘kernel’ dialect, while the remaining operations in Constant propagation is a well-known optimization in
the ‘onnx’ dialect will be decomposed into operations in the compilers. In onnx-mlir, we created a pass to do this during
subset. compilation. There are two key ideas in constant propaga-
Using the DRRs in MLIR, operation decomposition is con- tion: ( 1 ) if all the inputs of an operation are constant, com-
cisely written as the following pattern: pute its outputs during compilcation and remove the opera-
1 def R ed uc eL 1 Pa tt e rn : Pat < tion, ( 2 ) if there is a mix of constant and non-constant in-
2 ( ReduceL1Op $ x , $ axes , $ keepdims ) , puts, normalize the operation. Normalization is to increase
3 ( ReduceSumOp ( AbsOp $ x ) , $ axes , $ keepdims )
4 >;
the possibility of constant propagation and strongly depends
on the mathematical properties of an operation. Below are
where ReduceL1Op, ReduceSumOp, and AbsOp are some normalization rules in onnx-mlir for the onnx.add op-
programmable forms of operations onnx.ReduceL1, eration whose properties are associative and communicative.
onnx.ReduceSum, and onnx.Abs respectively. Vari-
ables x, axes, and keepdims are for keeping input values (1) c + x ⇒ x + c
of operation ReduceL1Op. The pattern ‘ReduceL1Pattern’
( 2 ) (x + c1 ) + c2 ⇒ x + (c1 + c2 )
contains a source pattern to match a graph of one operation
ReduceL1Op (Line 2) and a destination pattern to generate ( 3 ) (x + c) + y ⇒ (x + y) + c
a graph of two operations ReduceSumOp and AbsOp
(Line 3). Whenever an operation ReduceL1Op appears in ( 4 ) x + (y + c) ⇒ (x + y) + c
an ONNX model, it will be replaced with a combination of ( 5 ) (x + c1 ) + (y + c2 ) ⇒ (x + y) + (c1 + c2 )
ReduceSumOp and AbsOp.
3.4.2 Shape Inference where x and y are non-constant values, and c, c1 , and c2 are
The --shape-inference pass attempts to infer shapes for all constant values. Normalization rules are expressed by using
tensors in a program at ONNX IR. The pass traverses all the DRRs in MLIR.
operations in a program, infers the shapes of tensors with 3.4.5 Memory management
unrank shapes (i.e. tensorh∗xf32i), propagates the ranked This pass is under development. The central idea is cre-
shapes to consuming operations, and terminates once all ating a memory pool to efficiently manage memory usage in
tensors have ranked shapes. For one operation, if its inputs a program by statically analyzing memory allocations and
have static shapes, it is likely that the --shape-inference pass deallocations. With the current version of onnx-mlir, mem-
will be able to infer static shapes for its outputs. If the ory pool is simply creating a single memory area for tensors
inputs have dynamic shapes (e.g. tensorh?x?x?xf32i), the in a model. The mechanism for memory reuse has not yet
outputs will also have dynamic shapes also, except for some been implemented.
operations whose output tensors’ shapes are specified in the
4. Preliminary Experiments
operation attributes.
3.4.3 Graph Rewriting 4.1 ONNX operation support and testcases
Graph rewriting is a powerful optimization tool. It is ONNX provides a set of test cases for each operation.
intensively applied to neural networks since calculation in When we support any operation in onnx-mlir, we enable its
a neural network is expressed via a dataflow graph. In ONNX test cases to check whether the operation behaves
MLIR, graph rewriting rules are conveniently represented correctly and produces correct result. At the time of writing
using DRRs. this paper, onnx-mlir supports 51 operations out of 139 op-
For example, the following rule is to fuse onnx.add and erations in ONNX, including important operations such as
onnx.MatMul into a single operation onnx.Gemm under the convolution, pooling, Gemm, and LSTM. These are enough
condition that the result of MatMulOp is only consumed by to compile and execute major networks such as MNIST and
AddOp: ResNet50. On the GitHub repository of onnx-mlir, we en-
1 def M u l A d d T o G e m m P a t t e r n : Pat < able continuous integration on different environments, i.e.,
2 ( AddOp ( MatMulOp : $ res $ m1 , $ m2 ) , $ m3 ) , Windows, Linux, and Docker environments, and different

7
systems, i.e., x86 machines, IBM Power Systems, and Sys- open source softwares such as ONNX and MLIR, we found a
tem Z. All supported operations have passed tests on the problem related to supporting different systems. In particu-
above environments. lar, we could not run ONNX models on Linux on IBM Sys-
tem Z (s390-linux) because the big-endian format was not
4.2 MNIST and ResNet50 well-supported in ONNX and MLIR. There are two reasons
In this section, we present some of our preliminary results for such a problem. First, a large amount of public input
for two neural network models in the ONNX Model Zoo: data and models in ONNX are stored in little-endian format.
MNIST and ResNet50 [2]. The MNIST*5 and ResNet50*6 Hence, they must be converted to big-endian format before
models have already been trained in the CNTK and Caffe2 they are used in a big-endian system. Second, we found that
frameworks, respectively. We ran inferences on the given constant values in ONNX models are not correctly loaded in
test data set in each model. The experiments were con- MLIR. LLVM was well-supported in big-endian, but MLIR
ducted on a machine with 2.3-GHz POWER9 processors. was not. We created two patches to solve this problem: one
For onnx-mlir, graph rewriting and canonicalization passes in ONNX*7 and one in MLIR*8 , and they are now available
were enabled. Polyheral optimizations were turned off since at the master branches of ONNX and MLIR. As a result,
they are under development and are not matured. Memory onnx-mlir now supports Linux on x86 (x86-Linux), Linux
pool was applied to create a single memory area for all nec- on Power Systems (ppc64le-Linux), Linux on IBM Z (s390-
cessary tensors in a model, but there was no mechanism for Linux), and Windows.
memory reuse. Under the above conditions, results shown
here are not suitable for using as reference for performance
5. Conclusion
comparision. We are developing an open source compiler called onnx-
mlirfor compiling ONNX models into native code. MLIR
Table 1: Run inferencing with MNIST and ResNet50 on a
POWER9 machine. Time in seconds. was used as an infrastructure to build the compiler, and
two novel IRs were introduced, i.e., ONNX IR and Ker-
Model Compilation time Inference time
nel IR. We also discussed some optimizations such as graph
MNIST 0.237 0.001
rewriting and constant propagation. It is worth noting that
ResNet50 7.661 7.540
new optimizations can be easily integrated into onnx-mlir
thanks to the MLIR infrastructure. In the future, we will
Table 1 shows the running times for the MNIST and
add more optimizations, e.g., polyhedral optimization, loop
ResNet50 models when doing inferencing. For each model,
fusion, SIMD optimization, and enable code generation for
we measured the compilation time for compiling the model
accelerators.
to native code and inference time for running the native
code with real inputs. MNIST is a small model with two
References
convolutional operations, one max pooling operation and a
[1] Bai, J., Lu, F., Zhang, K. et al.: ONNX: Open Neu-
matrix multiplication followed by an element-wise addition. ral Network Exchange, GitHub (online), available from
Compiling the MNIST model and carrying out inferencing hhttps://github.com/onnx/onnxi (accessed 2020-07-01).
[2] He, K., Zhang, X., Ren, S. and Sun, J.: Deep Residual
was rather fast, i.e., finished in less than one second. In the Learning for Image Recognition, CoRR, Vol. abs/1512.03385
MNIST model, the graph rewriting rule MulAddToGemm- (online), available from hhttp://arxiv.org/abs/1512.03385i
(2015).
Pattern mentioned in Sec. 3.4.3 was applied to fuse matrix [3] Krizhevsky, A., Sutskever, I. and Hinton, G. E.: ImageNet
multiplication and element-wise addition into a Gemm op- Classification with Deep Convolutional Neural Networks, In-
ternational Conference on Neural Information Processing
eration. ResNet50 is a complex deep model consisting of 50 Systems (NIPS), pp. 1097–1105 (2012).
layers of operations such as convolutions and poolings. The [4] Lattner, C. and Adve, V.: LLVM: A Compilation Framework
model is about 100 megabytes including learned weights. for Lifelong Program Analysis and Transformation, San Jose,
CA, USA, pp. 75–88 (2004).
For ResNet50, the current version of onnx-mlir does not [5] Lattner, C., Amini, M., Bondhugula, U., Cohen, A.,
have any optimization applied to the model during com- Davis, A., Pienaar, J., Riddle, R., Shpeisman, T., Vasi-
lache, N. and Zinenko, O.: MLIR: A Compiler Infrastruc-
pilation. However, we believe that the compilation time ture for the End of Moore’s Law, (online), available from
looks reasonable and the inference time is not so slow. We hhttp://arxiv.org/abs/2002.11054i (2020).
[6] LLVM: The LLVM Project, LLVM (online), available from
hope that once we integrate important optimizations, such hhttps://github.com/llvm/llvm-projecti (accessed 2020-07-
as polyhedral optimizations, SIMD optimization, and loop 01).
[7] LLVM: TableGen, LLVM (online), available from
fusion in near future, the inference time will be significantly hhttps://llvm.org/docs/TableGen/i (accessed 2020-07-01).
reduced. [8] Pouchet, L.-N., Bastoul, C., Cohen, A. and Cavazos, J.: Iter-
ative optimization in the polyhedral model: Part II, multidi-
mensional time, ACM SIGPLAN Notices, Vol. 43, No. 6, pp.
4.3 Supported Systems 90–100 (2008).
Although onnx-mlir is completely built upon widely-used
*5 https://github.com/onnx/models/tree/master/vision/
classification/mnist
*6 https://github.com/onnx/models/tree/master/vision/ *7 https://github.com/onnx/onnx/pull/2633
classification/resnet *8 https://reviews.llvm.org/D78076

You might also like