0% found this document useful (0 votes)
9 views

Machine_Learning_in_compiler_optimisation

The paper 'Machine Learning in Compiler Optimization' by Wang and O'Boyle discusses the integration of machine learning techniques into compiler optimization, highlighting their potential to enhance performance through automated decision-making. It provides an overview of key concepts such as feature engineering, model learning, and deployment, while also addressing challenges and future research directions in this evolving field. The authors emphasize the importance of evidence-based practices in compiler development and the role of machine learning in bridging the performance gap in software optimization.

Uploaded by

yusuff.0279
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Machine_Learning_in_compiler_optimisation

The paper 'Machine Learning in Compiler Optimization' by Wang and O'Boyle discusses the integration of machine learning techniques into compiler optimization, highlighting their potential to enhance performance through automated decision-making. It provides an overview of key concepts such as feature engineering, model learning, and deployment, while also addressing challenges and future research directions in this evolving field. The authors emphasize the importance of evidence-based practices in compiler development and the role of machine learning in bridging the performance gap in software optimization.

Uploaded by

yusuff.0279
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Edinburgh Research Explorer

Machine Learning in Compiler Optimization

Citation for published version:


Wang, Z & O'Boyle, M 2018, 'Machine Learning in Compiler Optimization', Proceedings of the IEEE, vol.
106, no. 11, pp. 1879 - 1901. https://doi.org/10.1109/JPROC.2018.2817118

Digital Object Identifier (DOI):


10.1109/JPROC.2018.2817118

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Peer reviewed version

Published In:
Proceedings of the IEEE

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.

Take down policy


The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact [email protected] providing details, and we will remove access to the work immediately and
investigate your claim.

Download date: 13. Jan. 2025


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Machine Learning in
Compiler Optimization
By Z h e ng Wa ng and M ic h a e l O’B oy l e

ABSTRACT | In the last decade, machine-learning-based A. It Is All About Optimization


compilation has moved from an obscure research niche to a
Compilers have two jobs—translation and optimi-
mainstream activity. In this paper, we describe the relationship
zation. First, they must translate programs into binary
between machine learning and compiler optimization and
correctly. Second, they have to find the most efficient
introduce the main concepts of features, models, training,
translation possible. There are many different correct
and deployment. We then provide a comprehensive survey
translations whose performance varies significantly. The
and provide a road map for the wide variety of different
vast majority of research and engineering practices is
research areas. We conclude with a discussion on open issues
focused on this second goal of performance, traditionally
in the area and potential research directions. This paper
misnamed optimization. The goal was misnamed because
provides both an accessible introduction to the fast moving
in most cases, until recently, finding an optimal transla-
area of machine-learning-based compilation and a detailed
tion was dismissed as being too hard to find and an unreal-
bibliography of its main achievements.
istic endeavor.1 Instead it focused on developing compiler
heuristics to transform the code in the hope of improving
KEYWORDS | Code optimization; compiler; machine learning;
performance but could in some instances damage it.
program tuning
Machine learning predicts an outcome for a new data
point based on prior data. In its simplest guise, it can be
I. I N T RODUC T ION considered a form of interpolation. This ability to pre-
dict based on prior information can be used to find the
“Why would anyone want to use machine learning to build
data point with the best outcome and is closely tied to
a compiler?” It is a view expressed by many colleagues over
the area of optimization. It is at this overlap of looking
the last decade. Compilers translate programming languages
at code improvement as an optimization problem and
written by humans into binary executable by computer hard-
machine learning as a predictor of the optima where we
ware. It is a serious subject studied since the 1950s [1]–[3]
find machine learning compilation.
where correctness is critical and caution is a by-word.
Optimization as an area, machine learning based or
Machine learning, on the other hand, is an area of artificial
otherwise, has been studied since the 1800s [8], [9]. An
intelligence (AI) aimed at detecting and predicting patterns.
interesting question is therefore why has the convergence
It is a dynamic field looking at subjects as diverse as galaxy
of these two areas taken so long? There are two funda-
classification [4] to predicting elections based on Tweeter
mental reasons. First, despite the year-on-year increasing
feeds [5]. When an open-source machine learning compiler
potential performance of hardware, software is increas-
was announced by IBM in 2009 [6], some wry slashdot com-
ingly unable to realize it leading to a software gap. This
mentators picked up on the AI aspect, predicting the start of
gap has yawned right open with the advent of multicores
sentient computers, global net, and the war with machines
(see also Section VI-B). Compiler writers are looking for
from the Terminator film series.
new ways to bridge this gap.
In fact, as we will see, in this paper, that compilers and
Second, computer architecture evolves so quickly
machine learning are a natural fit and have developed into
that it is difficult to keep up. Each generation has new
an established research domain.
quirks and compiler writers are always trying to play
Manuscript received October 30, 2017; accepted January 23, 2018.
catchup. Machine learning has the desirable property
(Corresponding author: Michael O'Boyle.) of being automatic. Rather than relying on expert com-
Z. Wang is with the MetaLab, School of Computing and Communications, Lancaster
University, Lancaster LA1 4WA, U.K. (e-mail: [email protected]).
piler writers to develop clever heuristics to optimize
M. O'Boyle is with the School of Informatics, University of Edinburgh, Edinburgh EH8 the code, we can let the machine learn how to optimize
9AB, U.K. (e-mail: [email protected]).
a compiler to make the machine run faster, an approach
1
In fact, the term superoptimizer [7] was coined to describe
Digital Object Identifier: 10.1109/JPROC.2018.2817118 systems that tried to find the optimum.

0018-9219 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Proceedings of the IEEE 1
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

sometimes referred to as autotuning [10]–[13]. Machine II. OV ERV I E W OF M AC H I N E L E A R N I NG


learning is, therefore, ideally suited to making any code I N COM PI L ER S
optimization decision where the performance impact Given a program, compiler writers would like to know
depends on the underlying platform. As described later in what compiler heuristic or optimization to apply in order
this paper, it can be used for topics ranging from selecting to make the code better. Better often means execute faster,
the best compiler flags to determining how to map paral- but can also mean smaller code footprint or reduced
lelism to processors. power. Machine learning can be used to build a model
Machine learning is part of a tradition in computer sci- used within the compiler that makes such decisions for
ence and compilation in increasing automation The 1950s any given program.
to 1970s were spent trying to automate compiler translation, There are two main stages involved: learning and
e.g., lex for lexical analysis [14] and yacc for parsing [15]; the deployment. The first stage learns the model based on train-
last decade by contrast has focused on trying to automate ing data, while the second uses the model on new unseen
compiler optimization. As we will see, it is not “magic” or a programs. Within the learning stage, we need a way of rep-
panacea for compiler writers, rather it is another tool allow- resenting programs in a systematic way. This representation
ing automation of tedious aspects of compilation providing is known as the program features [16].
new opportunities for innovation. It also brings compilation Fig. 1 gives an intuitive view on how machine learning
nearer to the standards of evidence-based science. It intro- can be applied to compilers. This process, which includes
duces an experimental methodology where we separate out feature engineering, learning a model, and deployment, is
evaluation from design and considers the robustness of solu- described in the following sections.
tions. Machine-learning-based schemes, in general, have
the problem of relying on black boxes whose working we do
not understand and hence trust. This problem is just as true A. Feature Engineering
for machine-learning-based compilers. In this paper, we Before we can learn anything useful about programs, we
aim to demystify machine-learning-based compilation and first need to be able to characterize them. Machine learn-
show it is a trustworthy and exciting direction for compiler ing relies on a set of quantifiable properties, or features, to
research. characterize the programs [Fig. 1(a)]. There are many dif-
The remainder of this paper is structured as follows. ferent features that can be used. These include the static
First, we give an intuitive overview for machine learning data structures extracted from the program source code
in compilers in Section II. Then, we describe how machine or the compiler intermediate representation (such as the
learning can be used to search for or to directly predict good number of instructions or branches), dynamic profiling
compiler optimizations in Section III. This is followed by a information (such as performance counter values) obtained
comprehensive discussion in Section IV for a wide range of through runtime profiling, or a combination of the both.
machine learning models that have been employed in prior Standard machine learning algorithms typically work on
work. Next, in Section V, we review how previous work fixed length inputs, so the selected properties will be sum-
chooses quantifiable properties, or features, to represent marized into a fixed length feature vector. Each element of
programs. We discuss the challenges and limitations for the vector can be an integer, real or Boolean value. The pro-
applying machine learning to compilation, as well as open cess of feature selection and tuning is referred to as feature
research directions in Section VII before we summarize and engineering. This process may need to iteratively perform
conclude in Section VIII. multiple times to find a set of high-quality features to build

Fig. 1. A generic view of supervised machine learning in compilers. (a) Feature engineering. (b) Leaning a model. (c) Deployment.

2 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

an accurate machine learning model. In Section V, we pro-


vide a comprehensive review of feature engineering for the
topic of program optimization.

B. Learning a Model
The second step is to use training data to derive a model using
a learning algorithm. This process is depicted in Fig. 1(b) . Unlike
other applications of machine learning, we typically generate
our own training data using existing applications or bench-
marks. The compiler developer will select training programs
which are typical of the application domain. For each training
Fig. 2. An OpenCL thread coarsening example reproduced from
program, we calculate the feature values, compiling the pro-
[17]. The original OpenCL code is shown in (a) where each thread
gram with different optimization options, and running and takes the square of one element of the input array. When coarsened
timing the compiled binaries to discover the best performing by a factor of two, as shown in (b), each thread now processes two
option. This process produces, for each training program, a elements of the input array.
training instance that consists of the feature values and the
optimal compiler option for the program.
The compiler developer then feeds these examples to a returned from the OpenCL get_global_id() API. Fig. 2(b)
machine learning algorithm to automatically build a model. shows the transformed code after applying a thread coarsen
The learning algorithm’s job is to find from the training factor of two, where each thread processes two elements of
examples a correlation between the feature values and the the input array.
optimal optimization decision. The learned model can then Thread coarsening can improve performance through
be used to predict, for a new set of features, what the opti- increasing instruction-level parallelism [19], reducing the
mal optimization option should be. number of memory-access operations [20] and eliminating
Because the performance of the learned model strongly redundant computation when the same value is computed
depends on how well the features and training programs are in every work item. However, it can also have several nega-
chosen, the processes of featuring engineering and training tive side effects, such as reducing the total amount of paral-
data generation often need to repeat multiple times. lelism and increasing the register pressure, which can lead
to slowdown performance. Determining when and how
C. Deployment to apply thread coarsening is nontrivial, because the best
coarsening factor depends on the target program and the
In the final step, the learned model is inserted into the
hardware architecture that the program runs on [17], [19].
compiler to predict the best optimization decisions for new
Magni et al. show that machine learning techniques
programs. This is demonstrated in Fig. 1(c) . To make a pre-
can be used to automatically construct effective thread-
diction, the compiler first extracts the features of the input
coarsening heuristics across GPU architectures [17]. Their
program, and then feeds the extracted feature values to the
approach considers six coarsening factors (​1, 2, 4, 8, 16, 32​).
learned model to make a prediction.
The goal is to develop a machine-learning-based model to
The advantage of the machine-learning-based approach
decide whether an OpenCL kernel should be coarsened on a
is that the entire process of building the model can be easily
specific GPU architecture and, if so, what is the best coars-
repeated whenever the compiler needs to target a new hard-
ening factor. Among many machine learning algorithms,
ware architecture, operating system, or application domain.
they chose to use an artificial neural network to model2
The model built is entirely derived from experimental
the problem. Construing such a model follows the classical
results and is hence evidence based.
three-step supervised learning process, which is depicted in
Fig. 1 and described in more details as follows.
D. Example
1) Feature Engineering: To describe the input OpenCL
As an example to illustrate these steps, consider thread kernel, Magni et al. use static code features extracted from
coarsening [18] for GPU programs. This code transforma- the compiler’s intermediate representation. Specifically, they
tion technique works by giving multiple work items (or developed a compiler-based tool to obtain the feature values
work elements) to one single thread. It is similar to loop from the program’s LLVM bitcode [21]. They started from 17
unrolling, but applied across parallel work items rather than candidate features. These include things like the number of
across serial loop iterations.
Fig. 2(a) shows a simple OpenCL kernel where a thread 2
In fact, Magni et al. employed a hierarchical approach consisting of
operates on a work item of the 1-D input array, in, at a time. multiple artificial neural networks [17]. However, these networks are
The work item to be operated on is specified by the value trained using the same process.

Proceedings of the IEEE 3


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

and types of instructions and memory level parallelism (MLP) III. M ET HOD OL O GY
within an OpenCL kernel. Table 1 gives the list of candidate fea- One of the key challenges for compilation is to select the
tures used in [17]. Typically, candidate features can be chosen right code transformation, or sequence of transformations
based on developers’ intuitions, suggestions from prior works, for a given program. This requires effectively evaluating the
or a combination of both. After choosing the candidate fea- quality of a possible compilation option, e.g., how a code
tures, a statistical method called principal component analysis transformation will affect eventual performance.
(PCA; see also Section IV-B) is applied to map the 17 candidate A naive approach is to exhaustively apply each legal
features into seven aggregated features, so that each aggregated transformation option and then profile the program to
feature is a linear combination of the original features. This collect the relevant performance metric. Given that many
technique is known as “feature dimension reduction,” which compiler problems have a massive number of options,
is discussed in Section V-D2. Dimension reduction helps exhaustive search and profiling is infeasible, prohibiting the
eliminating redundant information among candidate features, use of this approach at scale. This search-based approach
allowing the learning algorithm to perform more effectively. to compiler optimization is known as iterative compila-
2) Learning the Model: For the work presented in [17], tion [22], [23] or autotuning [10], [24]. Many techniques
16 OpenCL benchmarks were used to generate training have been proposed to reduce the cost of searching a large
data. To find out which of the six coarsening factors per- space [25], [26]. In certain cases, the overhead is justifiable
forms best for a given OpenCL kernel on a specific GPU if the program in question is to be used many times, e.g.,
architecture, we can apply each of the six factors to an in a deeply embedded device. However, its main limitation
OpenCL kernel and record its execution time. Since the remains: it only finds a good optimization for one program
optimal thread-coarsening factor varies across hardware and does not generalize into a compiler heuristic.
architectures, this process needs to repeat for each target There are two main approaches for solving the problem
architecture. In addition to finding the best performing of scalably selecting compiler options that work across pro-
coarsening factor, Magni et al. also extracted the aggregated grams. A high level comparison of both approaches is given
feature values for each kernel. Applying these two steps on in Fig. 3. The first strategy attempts to develop a cost (or pri-
the training benchmarks results in a training data set where ority) function to be used as a proxy to estimate the quality
each training example is composed of the optimal coars- of a potential compiler decision, without relying on exten-
ening factor and feature values for a training kernel. The sive profiling. The second strategy is to directly predict the
training examples are then fed into a learning algorithm best performing option.
which tries to find a set of model parameters (or weights) so
that overall prediction error on the training examples can A. Building a Cost Function
be minimized. The output of the learning algorithm is an
artificial neural network model where its weights are deter- Many compiler heuristics rely on a cost function to esti-
mined from the training data. mate the quality of a compiler option. Depending on the
optimization goal, the quality metric can be execution time,
3) Deployment: The learned model can then be used to the code size, or energy consumption, etc. Using a cost func-
predict the optimal coarsening factor for unseen OpenCL tion, a compiler can evaluate a range of possible options to
programs. To do so, static source code features are first choose the best one, without needing to compile and profile
extracted from the target OpenCL kernel; the extracted the program with each option.
feature values are then fed into the model which decides
whether to coarsen or not and which coarsening factor
should be used. The technique proposed in [17] achieves an
average speedup between 1.11x and 1.33x across four GPU
architectures and does not lead to degraded performance on
a single benchmark.

Table 1 Candidate Code Features Used in [17]

Fig. 3. There are, in general, two approaches to determine the


optimal compiler decision using machine learning. The first one
is to learn a cost or priority function to be used as a proxy to
select the best performing option (a). The second one is to learn a
predictive model to directly predict the best option (b).

4 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

1) The Problem of Handcrafted Heuristics: Trad­ applied to other optimization targets such as the code size
itionally, a compiler cost function is manually crafted. For [33] or a tradeoff between energy and runtime.
example, a heuristic of function inlining adds up a num-
2) Cost Functions for Performance: The Meta
ber of relevant metrics, such as the number of instruc-
Optimization framework [34] uses genetic programming
tions of the target function to be inlined, the callee and
(GP) to search for a cost function ​y ⃪ f (x)​, which takes in a
stack size after inlining, and compare the resulted value
feature vector x​ ​and produces a real-valued priority ​y​. Fig. 4
against a predefined threshold to determine if it is prof-
depicts the workflow of the framework. This approach is eval-
itable to inline a function [27]. Here, the importance or
uated on a number of compiler problems, including hyper-
weights for metrics and the threshold are determined block formation,3 register allocation, and data prefetching,
by compiler developers based on their experience or via showing that machine learned cost functions outperform
“trail-and-error.” Because the efforts involved in tuning human-crafted ones. A similar approach is employed by
the cost function are so expensive, many compilers simply Cavazos et al. who find cost functions for performance and
use “one-size-fits-all” cost function for inlining. However, compilation overhead for a Java just-in-time compiler [35].
such a strategy is ineffective. For examples, Cooper et al. The COLE compiler [36] uses a variance of the GP algorithm
show that a “one-size-fits-all” strategy for inlining often called strength Pareto evolutionary algorithm 2 (SPEA2)
delivers poor performance [28]; other studies also show [37] to learn cost functions to balance multiple objectives
that the optimal thresholds to use to determine when to (such as program runtime, compilation overhead, and code
inline change from one program to the other [29], [30]. size). In Section IV-C, we describe the working mechanism
Handcrafted cost functions are widely used in compil- of GP-like search algorithms.
ers. Other examples include the work conducted by Wagner Another approach to tune the cost functions is to pre-
et al. [31] and Tiwari et al. [32]. The former combines a dict the execution time or speedup of the target program.
Markov model and a human-derived heuristic to statically The Qilin compiler [38] follows such an approach. It uses
estimate the execution frequency of code regions (such curve fitting algorithms to estimate the runtime for execut-
as function innovation counts). The latter calculates the ing the target program of a given input size on the CPU
energy consumption of an application by assigning a weight and the GPU. The compiler then uses this information to
to each instruction type. The efficiency of these approaches determine the optimal loop iteration partition across the
highly depends on the accuracy of the estimations given by CPU and the GPU. The Qilin compiler relies on an applica-
the manually tuned heuristic. tion-specific function which is built on a per program base
The problem of relying on a hand-tuned heuristic is using reference inputs. The curve fitting (or regression; see,
that the cost and benefit of a compiler optimization often also, Section IV) model employed by the Qilin compiler
depend on the underlying hardware; while handcrafted can model with continuous values, making it suitable for
cost functions could be effective, manually developing estimating runtime and speedup. In [39], this approach is
one can take months or years on a single architecture. extended, which developed a relative predictor that predicts
This means that tuning the compiler for each newly whether an unseen predictor will improve significantly on a
released processor is hard and is often infeasible due to GPU relative to a CPU. This is used for runtime scheduling
the drastic efforts involved. Because cost functions are of OpenCL jobs.
important and manually tuning a good function is dif- The early work conduced by Brewer proposed a regres-
ficult for each individual architecture, researchers have sion-based model to predict the execution of a data layout
investigated ways to use machine learning to automate scheme for parallelization, by considering three parameters
this process. [40]. Using the model, his approach can select the optimal
In Section III-A2, we review a range of previous studies 3
Hyperblock formation combines basic blocks from multiple control
on using machine learning to tune cost functions for per- paths to form a predicated, larger code block to expose instruction level
formance and energy consumption—many of which can be parallelism.

Fig. 4. A simple view of the GP approach presented in [34] for tuning compiler cost functions. Each candidate cost function is represented
as an expression tree (a). The workflow of the GP algorithm is presented in (b).

Proceedings of the IEEE 5


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

layout for over 99% of the time for a partial differential equa- predicting the loop unroll factor [52] by considering eight
tion (PDE) solver across four evaluation platforms. Other unroll factors ​(1, 2, …,8 )​. They formulated the problem as a
previous works also use curve fitting algorithms to build a multiclass classification problem (i.e., each loop unroll factor
cost function to estimate the speedup or runtime of sequen- is a class). They used over 2500 loops from 72 benchmarks to
tial [41]–[43], OpenMP [44]–[46], and, more recently, deep train two machine learning models [a nearest neighbor and
learning applications [47]. a support vector machine (SVM) model] to predict the loop
unroll factor for unseen loops. Using a richer set of features
3) Cost Functions for Energy Consumption: In addi-
than [16], their techniques correctly predict the unroll fac-
tion to performance, there is an extensive body of work
tor for 65% of the testing loops, leading to, on average, a 5%
that investigates ways to build energy models for software
improvement for the SPEC 2000 benchmark suite.
optimization and hardware architecture design. As power or
For sequential programs, there is extensive work in pre-
energy readings are continuous real values, most of the prior
work on power modeling uses regression-based approaches. dicting the best compiler flags [53], [54], code transforma-
Linear regression is a widely used technique for energy tion options [55], or tile size for loops [56], [57]. This level
modeling. Benini et al. developed a linear-regression-based of interest is possibly due to the restricted nature of the
model to estimate power consumption at the instruction problem, allowing easy experimentation and comparison
level [48]. The framework presented by Rethinagiri et al. against prior work.
[49] uses parameterized formulas to estimate power con- Directly predicting the optimal option for parallel pro-
sumption of embedded systems. The parameters of the for- grams is harder than doing it for sequential programs, due to
mulas are determined by applying a regression-based algo- the complex interactions between the parallel programs and
rithm to reference data obtained with handcrafted assembly the underlying parallel architectures. Nonetheless, there
code and power measurements. In a more recent work, are works on predicting the optimal number of threads to
Schürmans et al. also adopt a regression-based method for be used to run an OpenMP program [46], [58], the best
power modeling [50], but the weights of the regression parameters to be used to compile a CUDA programs for a
model are determined using standard benchmarks instead given input [59], and the thread coarsening parameters for
of handwritten assembly programs. OpenCL programs for GPUs [17]. These papers show that
Other works employ the artificial neural network (ANN) supervised machine learning can be a powerful tool for
to automatically construct power models. Curtis-Maury et modeling problems with a relatively small number of opti-
al. develop an ANN-based model to predict the power con- mization options.
sumption of OpenMP programs on multicore systems [51].
The inputs to the model are hardware performance coun- I V. M AC H I N E L E A R N I NG MODEL S
ter values such as the cache miss rate, and the output is
In this section, we review the wide range of machine learning
the estimated power consumption. Su et al. adopt a similar
models used for compiler optimization. Table 2 summarizes
approach by developing an ANN predictor to estimate the
runtime and power consumption for mapping OpenMP pro- the set machine learning models discussed in this section.
grams on nonuniform memory access (NUMA) multicores. There are two major subdivisions of machine learn-
This approach is also based on runtime profiling of the target ing techniques that have previously been used in compiler
program, but it explicitly considers NUMA-specific infor- optimizations: supervised and unsupervised learning. Using
mation like local and remote memory accesses per cycle. supervised machine learning, a predictive model is trained
on empirical performance data (labeled outputs) and impor-
tant quantifiable properties (features) of representative
B. Directly Predicting the Best Option programs. The model learns the correlation between these
While a cost function is useful for evaluating the quality feature values and the optimization decision that delivers the
of compiler options, the overhead involved in searching for optimal (or near-optimal) performance. The learned correla-
the optimal option may still be prohibitive. For this reason, tions are used to predict the best optimization decisions for
researchers have investigated ways to directly predict the new programs. Depending on the nature of the outputs, the
best compiler decision using machine learning for relatively predictive model can be either a regression model for con-
small compilation problems. tinuous outputs or a classification model for discrete outputs.
Monsifrot et al. pioneered the use of machine learning to In the other subdivision of machine learning, termed
predict the optimal compiler decision [16]. This work devel- unsupervised learning, the input to the learning algorithm is
oped a decision-tree-based approach to determine whether it a set of input values merely—there is no labeled output. One
is beneficial to unroll a loop based on information such as the form of unsupervised learning is clustering which groups the
number of statements and arithmetic operations of the loop. input data items into several subsets. For example, SimPoint
Their approach makes a binary decision on whether to unroll [60], a simulation technique, uses clustering to pick repre-
a loop but not how many times the loop should be unrolled. sent program execution points for program simulation. It
Later, Stephenson and Amarasinghe advanced [16] by directly does so by first dividing a set of program runtime information

6 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

Table 2 Machine Learning Methods Discussed in Section IV

into groups (or clusters), such that points within each clus- feature vectors) and output (i.e., labels) have a strong linear
ter are similar to each other in terms of program structures relation. SVM and ANNs can model both linear and nonlin-
(loops, memory usages, etc.); it then chooses a few points ear relations, but typically require more training examples
of each cluster to represent all the simulation points within to learn an effective model when compared with simple lin-
that group without losing much information. ear regression models.
There are also techniques that sit at the boundary of super- Table 3 gives some examples of regression techniques
vised and unsupervised learning. These techniques refine the that have been used in prior work for code optimization and
knowledge gathered during offline learning or previous runs the problem to be modeled.
using empirical observations obtained during deployment.
2) Classification: Supervised classification is another
We review such techniques in Section IV-C. This sections
concludes with a discussion of the relative merits of different technique that has been widely used in prior work of machine-
modeling approaches for compiler optimization. learning-based code optimization. This technique takes in a
feature vector and predicts which of a set of classes the feature
vector is associated with. For example, classification can be
A. Supervised Learning used to predict which of a set of unroll factors should be used
1) Regression: A widely used supervised learning tech- for a given loop, by taking in a feature vector that describes the
nique is called regression. This technique has been used in characteristics of the target loop (see also Section II-D).
various tasks, such as predicting the program execution time The k-nearest neighbur (KNN) algorithm is a simple
input [38] or speedup [39] for a given input, or estimating yet effective classification technique. It finds the ​k​ closet
the tail latency for parallel workloads [61]. training examples to the input instance (or program) on the
Regression is essentially curve fitting. As an example, feature space. The closeness (or distance) is often evaluated
consider Fig. 5 where a regression model is learned from using the Euclidean distance, but other metrics can also be
five data points. The model takes in a program input size ​X​ used. This technique has been used to predict the optimal
and predicts the execution time of the program ​Y​. Adhering optimization parameters in prior works [52], [66], [67]. It
to supervised learning nomenclature, the set of five known
data points is the training data set and each of the five points
that comprise the training data is called a training example.
Each training example ​(​x​i​​, ​yi​​​)​ is defined by a feature vector
(i.e., the input size in our case) ​​x​i​​​and a desired output (i.e.,
the program execution time in our case) y​ ​​ i​​​. Learning in this
context is understood as discovering the relation between
the inputs (​​x​i​​​) and the outputs (​​y​i​​​) so that the predictive
model can be used to make predictions for any new, unseen
input features in the problem domain. Once the function ​f​
is in place, one can use it to make a prediction by taking in a
new input feature vector ​x​. The prediction ​y​is the value of
the curve that the new input feature vector ​x​corresponds to.
There are a range of machine learning techniques that
Fig. 5. A simple regression-based curve-fitting example. There are
can be used for regression. These include the simple linear five training examples in this case. A function ​f​is trained with the
regression model and more advanced models like SVMs and training data, which maps the input x​ ​to the output y​ .​ The trained
ANNs. Linear regression is effective when the input (i.e., function can predict the output of an unseen ​x​.

Proceedings of the IEEE 7


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

Table 3 Regression Techniques Used in Prior Works Decision trees make the assumption that the feature
space is convex, i.e., it can be divided up using hyperplanes
into different regions, each of which belongs to a different
category. This restriction is often appropriate in practice.
However, a significant drawback of using a single decision
tree is that the model can overfit due to outliers in the train-
works by first predicting which of the training programs ing data (see also Section IV-D). Random forests [73] have,
are closet (i.e., nearest neighbors) to the incoming program therefore, been proposed to alleviate the problem of over-
on the feature space; it then uses the optimal parameters fitting. Random forests are an ensemble learning method
(which are found during training time) of the nearest neigh- [74]. As illustrated in Fig. 7, it works by constructing mul-
bors as the prediction output. While it is effective on small tiple decision trees at training time. The prediction of each
problems, KNN also has two main drawbacks. First, it must tree depends on the values of a random vector sampled inde-
compute the distance between the input and all training data pendently on the feature value. In this way, each tree is ran-
at each prediction. This can be slow if there is a large num- domly forced to be insensitive to some feature dimensions.
ber of training programs to be considered. Second, the algo- To make a prediction, random forests then aggregate the out-
rithm itself does not learn from the training data; instead, comes of individual trees to form an overall prediction. It has
it simply selects the ​k​ nearest neighbors. This means that been employed to determine whether to inline a function or
the algorithm is not robust to noisy training data and could not [75], delivering better performance than a single-model-
choose an ill-suited training program as the prediction. based approach. We want to highlight that random forests
As an alternative, the decision tree has been used in can also be used for regression tasks. For instances, it has
prior works for a range of optimization problems. These been used to model energy consumption of OpenMP [76]
include choosing the parallel strategy for loop parallelization and CUDA [77] programs.
[69], determining the loop unroll factor [16], [70], decid- Logical regression is a variation of linear regression but
ing the profitability of using GPU acceleration [68], [71], is often used for classification. It takes in the feature vec-
and selecting the optimal algorithm implementation [72]. tor and calculates the probability of some outcome. For
The advantage of a decision tree is that the learned model is example, Cavazos and O’Boyle used logical regression to
interpretable and can be easily visualized. This enables users determine the optimization level of Jike RVM. Like decision
to understand why a particular decision is made by follow- trees, logical regression also assumes that the feature values
ing the path from the root node to a leaf decision node. For and the prediction has a linear relation.
example, Fig. 6 depicts the decision tree model developed in More advanced models, such as SVM classification,
[68] for selecting the best performing device (CPU or GPU) have been used for various compiler optimization tasks
to run an OpenCL program. To make a prediction, we start [46], [79]–[81]. SVMs use kernel functions to compute the
from the root of the tree; we compare a feature value (e.g., the similarity of feature vectors. The radial basis function (RBF)
communication–computation ratio) of the target program is commonly used in prior works [46], [82] because it can
against a threshold to determine which branch of the tree to model both linear and nonlinear problems. It works by map-
follow; and we repeat this process until we reach a leaf node ping the input feature vector to a higher dimensional space
where a decision will be made. It is to note that the structure where it may be easier to find a linear hyperplane to well
and thresholds of the tree are automatically determined by separate the labeled data (or classes).
the machine learning algorithm, which may change when we Other machine learning techniques, such as kernel
target a different architecture or application domain. canonical correlation analysis and naive Bayes, have also

Fig. 6. A decision tree for determining which device (CPU or GPU) to use to run an OpenCL program. This diagram is reproduced from [68].

8 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

part of the input. This capability allows DNNs to model the


complex relationship between the input and the output (i.e.,
the prediction). As an example, consider Fig. 8 that visual-
izes the internal state of DeepTune [78] when predicting
the optimal thread coarsening factor for an OpenCL kernel
(see Section II-D). Fig. 8(a) shows the first 80 elements of
the input source code tokens as a heatmap in which each
cell’s color reflects an integer value assigned to a specific
Fig. 7. Random forests are an ensemble learning algorithm. It token. Fig. 8(b) shows the neurons of the first DNN for
aggregates the outputs of multiple decision trees to form a final
prediction. The idea is to combine the predictions from multiple
each of the four GPU platforms, using a red–blue heatmap
individual models together to make a more robust, accurate to visualize the intensity of each activation. If we have a
prediction than any individual model. close look at the heatmap, we can find a number of neurons
in the layer with different responses across platforms. This
been used in prior works to predict stencil program configu- indicates that the DNN is partly specialized to the target
rations [83] or detect parallel patterns [84]. platform. As information flows through the network [layers
(c) and (d) in Fig. 8], the layers become progressively more
3) Deep Neural Networks: In recent years, deep neural specialized to the specific platform.
networks [85] have been shown to be a powerful tool for
tackling a range of machine learning tasks such as image rec-
ognition [86], [87] and audio processing [88]. Deep neural B. Unsupervised Learning
networks (DNNs) have recently been used to model program Unlike supervised learning models which learn a correla-
source code [89] for various software engineering tasks (see tion from the input feature values to the corresponding out-
also Section VI-C), but so far there is little work of apply- puts, unsupervised learning models only take it from the input
ing DNNs to compiler optimization. A recent attempt in data (e.g., the feature values). This technique is often used to
this direction is the DeepTune framework [78], which uses model the underlying structure of distribution of the data.
DNNs to extract source code features (see also Section V-C). Clustering is a classical unsupervised learning problem.
The advantage of DNNs is that they can compactly rep- The k-means clustering algorithm [90] groups the input data
resent a significantly larger set of functions than a shallow into ​k​clusters. For example, in Fig. 9, a k-means algorithm
network, where each function is specialized at processing is used to group data points into three clusters on a 2-D

Fig. 8. A simplified view of the internal state for the DeepTune DNN framework [78] when it predicts the optimal OpenCL thread coarsening
factor. Here, a DNN is learned for each of the four target GPU architectures. The activations in each layer of the four models increasingly
diverge (or specialize) toward the lower layers of the model. It is to note that some of the DeepTune layers are omitted to aid presentation.

Proceedings of the IEEE 9


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

to find a good optimization solution from a large search space.


An EA applies principles inspired by biological evolution to
find an optimal or near-optimal solution for the target prob-
lem. For instance, the SPIRAL autotuning framework uses a
stochastic evolutionary search algorithm to choose a fast for-
mula (or transformation) for signal processing applications
[101]. Li et al. use GAs to search for the optimal configura-
tion to determine which sorting algorithm to use based on
the unsorted data size [102]. The Petabricks compiler offers
a more general solution by using EAs to search for the best
performing configurations for a set of algorithms specified
by the programmer [103]. In addition to code optimization,
Fig. 9. Using k-means to group data points into three clusters. In EAs have also been used to create Pareto optimal program
this example, we group the data points into three clusters on a 2-D benchmarks under various criteria [104].
feature space.
As an example, consider how an EA can be employed in
the context of iterative compilation to find the best com-
feature space. The algorithm works by grouping data points piler flags for a program [25], [36], [105]. Fig. 10 depicts
that are close to each other on the feature space into a clus- how an EA can be used for this purpose. The algorithm
ter. K-means is used to characterize program behavior [60], starts from several populations of randomly chosen com-
[91]. It does so by clustering program execution into phase piler flag settings. It compiles the program using each indi-
groups, so that we can use a few samples of a group to rep- vidual compiler flag sequence, and uses a fitness function
resent the entire program phases within a group. K-means to evaluate how well a compiler flag sequence performs. In
is also used in the work presented in [92] to summarize the our case, a fitness function can simply return the recipro-
code structures of parallel programs that benefit from simi- cal of a program runtime measurement, so that compiler
lar optimization strategies. In addition to k-means, Martins settings that give faster execution time will have a higher
et al. employed the fast Newman clustering algorithm [93] fitness score. In the next epoch, the EA algorithm generates
which works on network structures to group functions that
may benefit from similar compiler optimizations [94].
PCA is a statistical method for unsupervised learning.
This method has been heavily used in prior work to reduce
the feature dimension [17], [25], [95]–[97]. Doing so allows
us to model a high-dimensional feature space with a smaller
number of representative variables which, in combination,
describe most of the variability found in the original feature
space. PCA is often used to discover the common pattern in
the data sets in order to help clustering exercises. It is used
to select representative programs from a benchmark suite
[95], [98]. In Section V-D, we discuss PCA in further details.
Autoencoders are a recently proposed artificial neural
network architecture for discovering the efficient codings of
input data in an unsupervised fashion [99]. This technique
can be used in combination of a natural language model to
first extract features from program source code and then find
a compact representation of the source code features [100].
We discuss autoencoders in Section V-D when reviewing
feature dimensionality reduction techniques.

C. Online Learning
1) Evolutionary Search: Evolutionary algorithms (EAs)
or evolutionary computation such as genetic algorithms Fig. 10. Using an EA to perform iterative compilation. The
algorithm starts from several initial populations of randomly
(GAs), GP,4 and stochastic-based search have been employed
chosen compiler flag sequences. It evaluates the performance of
4
A GA is represented as a list of actions and values, often a string, individual sequences to remove poorly performing sequences in
while a GP is represented as a tree structure of actions and values. For each population. It then applies crossover and mutation to create
example, GP is applied to the abstract syntax tree of a program to search a new generation of populations. The algorithm returns the best
for useful features in [70]. performing program binary when it terminates.

10 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

the next populations of compiler settings via mechanisms


such as reproduction (crossover) and mutation among
compiler flag settings. This results in a new generation of
compiler flag settings and the quality of each setting will
be evaluated again. In a mechanism analogous to natural
selection, a certain number of poorly performing compiler
flags within a population are chosen to die in each genera-
tion. This process terminates when no further improve-
ment is observed or the maximum number of generations is Fig. 11. The working mechanism of reinforcement learning.
reached, and the algorithm will return the best found pro-
gram binary as a result.
There are three key operations in an EA algorithm: selec- reduced if we can first remove phases whose application
tion, crossover, and mutation. The probability of an opti- order is irrelevant to the produced code [107]. Their tech-
mization option being selected for dying is often inversely niques are claimed to prune the exhaustive phase order
proportional to its fitness score. In other words, options that search space size by 89% on average.
are relatively fitter (e.g., give faster program runtime) are
2) Reinforcement Learning: Another class of online
more likely to survive and remain a part of the population
learning algorithms is reinforcement learning (RL) which
after selection. In crossover, a certain number of offsprings
is sometimes called “learning from interactions.” The algo-
are produced by mixing some existing optimization options
rithm tries to learn how to maximize the rewards (or perfor-
(e.g., compiler flags). The likelihood of an existing option
mance) itself. In other words, the algorithm needs to learn,
being chosen for crossover is again proportional to its fit-
for a given input, what the correct output or decision to take
ness. This strategy ensures that good optimizations will be
is. This is different from supervised learning where the cor-
preserved over generations, while poorly performing opti-
rect input/output pairs are presented in the training data.
mizations will gradually die out. Finally, mutation randomly
Fig. 11 illustrates the working mechanism of RL. Here
changes a preserved optimization, e.g., by turning on/off
the learning algorithm interacts with its environment over a
an option or replacing a threshold value in a compiler flag
discrete set of time steps. At each step, the algorithm evalu-
sequence. Mutation reduces the chance that the algorithm
ates the current state of its environment, and executes an
gets stuck with a locally optimal optimization.
action. The action leads to a change in the state of the envi-
EAs are useful for exploring a large optimization space
ronment (which the algorithm can evaluate in the next time
where it is infeasible to just enumerate all possible solu-
step), and produces an immediate reward. For example, in
tions. This is because an EA can often converge to the most
a multitasking environment, a state could be the CPU con-
promising area in the optimization space quicker than a
tention; when processor cores are idle, an action could be
general search heuristic. The EA is also shown to be faster
where to place a process, and a reward could be the overall
than a dynamic-programming-based search [24] in finding
system throughput. The goal of RL is to maximize the long-
the optimal transformation for the fast Fourier transforma-
term cumulative reward by learning an optimal strategy to
tion (FFT) [101]. When compared to supervised learning,
map states to actions.
EAs have the advantage of requiring little problem-specific
RL is particularly suitable for modeling problems that
knowledge, and hence they can be applied on a broad range
have an evolving natural, such as dynamic task scheduling,
of problems. However, because an EA typically relies on the
where the optimal outcome is achieved through a series of
empirical evidences (e.g., running time) for fitness evalu-
actions. RL has been used in prior research to schedule RAM
ation, the search time can still be prohibitively expensive.
memory traffics [108], select software component configu-
This overhead can be reduced by using a machine-learning-
rations at runtime [109], and configure virtual machines
based cost model [43] to estimate the potential gain (e.g.,
[110]. An early work of using RL for program optimization
speedup) of a configuration (see also Section III-A). Another
was conduced by Lagoudakis and Littman [111]. They use
approach is to combine supervised learning and EAs [25],
RL to find the cutoff point to switch between two sorting
[106] by first using an offline learned model to predict the
algorithms: quickSort and insertionSort. CALOREE
most promising areas of the design space (i.e., to narrow
combines machine learning and control theories to sched-
down the search areas), and then searching over the pre-
ule CPU resources on heterogeneous multicores [112]. For a
dicted areas to refine the solutions. Moreover, instead of
given application, CALOREE uses control-theoretic methods
predicting where in the search space to focus on, one can
to dynamically adjust the resource allocation, and machine
also first prune the search space to reduce the number of
learning to estimate the application’s latency and power for
options to search over. For example, Jantz and Kulkarni
a given resource allocation plan (to offer decision supports).
show that the search space of phase ordering5 can be greatly
An interesting RL-based approach for scheduling paral-
5
Compiler phase ordering determines at which order a set of lel OpenMP programs is presented in [113]. This approach
compiler optimization passes should be applied to a given program. predicts the best number of threads for a target OpenMP

Proceedings of the IEEE 11


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

program when it runs with other competing workloads, parameters to avoid overfitting while achieving a good pre-
aiming to make the target program run faster. This approach diction accuracy remains an outstanding challenge.
first learns a reward function offline based on static code Choosing which modeling technique to use is nontrivial.
features and runtime system information. The reward func- This is because the choice of model depends on a number
tion is used to estimate the reward of a runtime scheduling of factors: the prediction problem (e.g., regression or clas-
action, i.e., the expected speedup when assigning a certain sification), the set of features to use, the available train-
number of processor cores to an OpenMP program. In the ing examples, the training and prediction overhead, etc.
next scheduling epoch, this approach uses the empiri- In prior works, the choice of modeling technique largely
cal observation of the application speedup to check if the relied on developer experience and empirical results. Many
reward function was accurate and the decision was good, of the studies in the field of machine-learning-based code
and update the reward function if the model is found to be optimization do not fully justify the choice of the model,
inaccurate. although some do compare the performance of alternate
In general, RL is an intuitive and comprehensive solu- techniques. The OpenTuner framework addresses the prob-
tion for autonomous decision making. But its performance lem by employing multiple techniques for program tuning
depends on the effectiveness of the value function, which [115]. OpenTuner runs multiple search techniques at the
estimates the immediate reward. An optimal value function same time. Techniques which perform well will be given
should lead to the greatest cumulative reward in the longer more candidate tuning options to examine, while poorly
term. For many problems, it is difficult to design an effective performed algorithms will be given fewer choices or disa-
value function or policy, because the function needs to fore- bled entirely. In this way, OpenTuner can discover which
see the impact of an action in the future. The effectiveness of algorithm works best for a given problem during search.
RL also depends on the environment; if the number of pos- One technique that has seen little investigation is the use
sible actions is large, it can take RL a long time to converge of Gaussian processes [116]. Before the recent widespread
to a good solution. RL also requires the environment to be interest in DNNs, these were a highly popular method in
fully observed, i.e., all the possible states of the environment many areas of machine learning [117]. They are particularly
can be anticipated ahead of time. However, this assumption powerful when the amount of training data is sparse and
may not hold in a dynamic computing environment due expensive to collect. They also automatically give a confi-
to unpredictable disturbances, e.g., changes in application dence interval with any decision. This allows the compiler
inputs or application mixes. In recent years, deep learning writer to trade off risk versus reward depending on the
techniques have been used in conjunction with RL to learn application scenario.
a value function. The combined technique is able to solve Using a single model has a significant drawback in prac-
some problems that were deemed impossible in the past tice. This is because a one-size-fits-all model is unlikely to
[114]. However, how to combine deep learning with RL to precisely capture behaviors of diverse applications, and no
solve compilation and code optimization problems remains matter how parameterized the model is, it is highly unlikely
an open question. that a model developed today will always be suited for
tomorrow. To allow the model to adapt to the change of the
computing environment and workloads, ensemble learning
D. Discussion
was exploited in prior works [73], [118], [119]. The idea of
What model is best is the $64 000 question. The answer ensemble learning is to use multiple learning algorithms,
is: it depends. More sophisticated techniques may provide where each algorithm is effective for particular problems, to
greater accuracy but they require large amounts of labeled obtain better predictive performance than could be obtained
training data—a real problem in compiler optimization. from any of the constituent learning algorithm alone [120],
Techniques such as linear regression and decision trees [121]. Making a prediction using an ensemble typically
require less training data compared to more advanced mod- requires more computational time than doing that using a
els such as SVMs and ANNs. Simple models typically work single model, so ensembles can be seen as a way to com-
well when the prediction problem can be described using a pensate for poor learning algorithms by performing extra
feature vector that has a small number of dimensions, and computation. To reduce the overhead, fast algorithms such
when the feature vector and the prediction are linearly cor- as decision trees are commonly used in ensemble methods
related. More advanced techniques such as SVMs and ANNs (e.g., random forests), although slower algorithms can ben-
can model both linear and nonlinear problems on a higher efit from ensemble techniques as well.
dimensional feature space, but they often require more
training data to learn an effective model. Furthermore, the
performance of an SVM and an ANN also highly depends on V. F E AT U R E E NGI N EER I NG
the hyperparameters used to train the model. The optimal Machine-learning-based code optimization relies on hav-
hyperparameter values can be chosen by performing cross ing a set of high-quality features that capture the important
validation on the training data. However, how to select characteristics of the target program. Given that there is an

12 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

Table 4 Summary of Features Discussed in Section V Table 6 Example Code Features Used in Prior Works

cost function that predicts which of the two FFT formulas


runs faster. The cost function is used to search for the best
performing transformation.
Park et al. present a unique graph-based approach for
feature representations [125]. They use an SVM where the
kernel is based on a graph similarity metric. Their technique
unbounded number of potential features, finding the right requires hand-coded features at the basic block level, but
set is a nontrivial task. In this section, we review how pre- thereafter, graph similarity against each of the training pro-
vious work chooses features, a task known as feature engi- grams takes the place of global features. Mailike shows that
neering. Tables 4 and 5 summarize the range of program spatial-based information, i.e., how instructions are distrib-
features and feature engineering techniques discussed in uted within a program, extracted from the program’s data
this section, respectively. flow graph could be a useful feature for machine-learning-
based compiler optimization [126]. Nobre et al. also exploit
graph structures for code generation [26]. Their approach
A. Feature Representation
targets the phase ordering problem. The order of compiler
Various forms of program features have been used in optimization passes is represented as a graph. Each node of
compiler-based machine learning. These include static code the graph is an optimization pass and connections between
structures [122] and runtime information such as system nodes are weighted in a way that subsequences with higher
load [118], [123] and performance counters [53]. aggregated weights are more likely to lead to faster runtime.
1) Static Code Features: Static program features such The graph is automatically constructed and updated using
as the number and type of instructions are often used to iterative compilation (where the target program is complied
describe a program. These features are typically extracted using different compiler passes with different orders). A
from the compiler intermediate representations [29], [46], design space exploration algorithm is employed to drive the
[52], [80] in order to avoid using information extracted from iterative compilation process.
dead code. Table 6 gives some of the static code features that 3) Dynamic Features: While static code features are
were used in previous studies. Raw code features are often useful and can be extracted at static compile time (hence
used together to create a combined feature. For example, one feature extraction has no runtime overhead), they have
can divide the number of load instructions by the number of drawbacks. For examples, static code features may con-
total instructions to get the memory load ratio. An advantage tain information of code segments that rarely get executed,
of using static code features is that the features are readily and such information can confuse the machine learning
available from the compiler intermediate representation. model; some program information such as the loop bound
2) Tree- and Graph-Based Features: Singer and Veloso depends on the program input, which can only be obtained
represent the FFT in a split tree [124]. They extract from the during execution time; and static code features often may
tree a set of features, by counting the number of nodes of var- not precisely capture the application behavior in the runt-
ious types and quantifying the shape of the tree. These tree- ime environment [such as resource contention and input/
based features are then used to build a neural-network-based output (I/O) behavior] as such behavior highly depends on
the computing environment such as the number of available
processors and corunning workloads.
Table 5 Feature Engineering Techniques Discussed in Section V.
As illustrated in Fig. 12, dynamic features can be
extracted from multiple layers of the runtime environment.
At the application layer, we can obtain information such
as loop iteration counts that cannot be decided at compile
time, dynamic control flows, frequently executed code
regions, etc. At the operating system level, we can observe
the memory and I/O behavior of the application as well
as CPU load and thread contention, etc. At the hardware

Proceedings of the IEEE 13


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

Fig. 12. Dynamic features can be extracted from multiple layers of


the computing environment.

level, we can use performance counters to track information


such as how many instructions have been executed and of
what types, and the number of cache loads/stores as well as
branch misses, etc.
Hardware performance counter values, such as executed
instruction counts and cache miss rate, are therefore used
to understand the application’s dynamic behaviors [53],
[127], [128]. These counters can capture low-level pro-
gram information such as data access patterns, branches, Fig. 13. Standard feature-based modeling (a) versus reaction-based
and computational instructions. One of the advantages of modeling (b). Both models try to predict the speedup for a given
performance counters is that they capture how the target compiler transformation sequence. The program feature-based
program behaves on a specific hardware and avoid the irrel- predictor takes in static program features extracted from the
transformed program, while the reaction-based model takes in the
evant information that static code features may bring in. In target transformation sequence and the measured speedups of the
addition to hardware performance counters, operating sys- target program, obtained by applying a number of carefully selected
tem level metrics, such as system load and I/O contention, transformation sequences. Diagrams are reproduced from [131].
are also used to model an application’s behavior [39], [123].
Such information can be externally observed without instru-
menting the code, and can be obtain during offline profiling code features, developers must carefully select a few set-
or program execution time. tings from a large number of candidate options for profiling,
While effective, collecting dynamic information could because poorly chosen options can significantly affect the
incur prohibitively overhead and the collected information quality of the model.
can be noisy due to competing workloads and operating sys-
tem scheduling [129] or even subtle settings of the execu- C. Automatic Feature Generation
tion environment [130]. Another drawback of performance
As deriving good features is a time-consuming task, a
counters and dynamic features is that they can only capture
few methods have been proposed to automatically gener-
the application’s past behavior. Therefore, if the applica-
ate features from the compiler’s intermediate representa-
tion behaves significantly different in the future due to the
tion (IR) [70], [133]. The work of [70] uses GP to search
change of program phases or inputs, then the prediction will
for features, but required a huge grammar to be written,
be drawn on an unreliable observation. As such, dynamic
and static features are often used in combination in prior some 160 kB in length. Although much of this can be cre-
works in order to build a robust model. ated from templates, selecting the right range of capabili-
ties and search space bias is nontrivial and up to the expert.
The work of [133] expresses the space of features via logic
B. Reaction-Based Features programming over relations that represent information
Cavazos et al. present a reaction-based predictive model from the IRs. It greedily searches for expressions that rep-
for software–hardware codesign [131]. Their approach pro- resent good features. However, their approach relies on
files the target program using several carefully selected com- expert selected relations, combinators, and constraints
piler options to see how program runtime changes under to work. Both approaches closely tie the implementation
these options for a given microarchitecture setting. They of the predictive model to the compiler IR, which means
then use the program “reactions” to predict the best avail- changes to the IR will require modifications to the model.
able application speedup. Fig. 13 illustrates the difference Furthermore, the time spent in searching features could be
between a reaction-based model and a standard program significant for these approaches.
feature-based model. A similar reaction-based approach is The first work to employ neural network to extract fea-
used in [132] to predict speedup and energy efficiency for tures from program source code for compiler optimization
an application that is parallelized thread-level speculation was conducted by Cummins et al. [78]. Their system, namely
(TLS) under a given microarchitectural configuration. Note DeepTune, automatically abstracts and selects appropri-
that while a reaction-based approach does not use static ate features from the raw source code. Unlike prior work

14 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

where the predictive model takes in a set of human-crafted features should have the highest weights (or coefficients) in
features, program code is used directly in the training data. the model, while features uncorrelated with the output vari-
Programs are fed through a series of neural-network-based ables should have weights close to zero. For example, least
language models which learn how the code correlates with absolute shrinkage and selection operator (LASSO) regres-
the desired optimization options (see also Fig. 8). Their sion analysis is used in [137] to remove less useful features
work also shows that the properties of the raw code that to build a compiler-based model to predict performance.
are abstracted by the top layers of the neural networks are LASSO has also been used for feature selection to tune the
mostly independent of the optimization problem. While compiler heuristics for the TRIPS processor [138].
promising, it is worth mentioning that dynamic informa- In general, feature selection remains an open problem
tion such as the program input size and performance coun- for machine learning, and researchers often follow a “trail-
ter values are often essential for characterizing the behavior and-error” approach to test a range of methods and feature
of the target program. Therefore, DeepTune does not com- candidates. This makes automatic feature selection frame-
pletely remove human involvement for feature engineering work like FEAST [139] and HERCULES [140] attractive.
when static code features are insufficient for the optimiza- The former framework employs a range of existing feature
tion problem. selection methods to select useful candidate features, while
the latter searches for the most important static code fea-
D. Feature Selection and Dimension Reduction tures from a set of predefined patterns for loops.

Machine learning uses features to capture the essential 2) Feature Dimensionality Reduction: While feature
characteristics of a training example. Sometimes we have selection allows us to select the most important features,
too many features. As the number of features increases, so the resulted feature set can still be too large to train a good
does the number of training examples needed to build an model, especially when we only have a small number of
accurate model [134]. Hence, we need to limit the dimen- training examples. By reducing the number of dimensions,
sion of the feature space. In compiler research, commonly, the learning algorithm can often perform more efficiently
an initial large, high-dimensional candidate feature space is on a limited training data set. Dimension reduction is also
pruned via feature selection [52], or projected into a lower important for some machine learning algorithms such as
dimensional space [17]. In this section, we review a number KNN to avoid the effect of the curse of dimensionality [141].
of feature selection and dimension reduction methods. PCA is a well-established feature reduction technique
[142]. It uses orthogonal linear transformations to reduce the
1) Feature Selection: Feature selection requires under- dimensionality of a set of variables, i.e., features in our case.
standing how a particular feature affects the prediction Fig. 14 demonstrates the use of PCA to reduce the num-
accuracy. One of the simplest methods for doing this is ber of dimensions. The input in this example is a 3-D space
applying the Pearson correlation coefficient. This metric defined by ​​M​1​​​, ​​M​2​​​, and ​​M3​ ​​​, as shown in Fig. 14(a). Three
measures the linear correlation between two variables and is components, P​  ​C1​ ​​​, ​P ​C2​ ​​​, and ​P ​C3​ ​​​, which account for the vari-
used in numerous works [55], [92], [122], [135] to filter out ance of the data, are first calculated. Here, ​P ​C​1​​​ and P​  ​C​2​​​ con-
redundant features by removing features that have a strong tribute most to the variance of the data and P​  ​C3​ ​​​ accounts
correlation with an already selected feature. It has also been for the least variance. Using only ​P ​C1​ ​​​ and ​P ​C2​ ​​​, one can
used to quantify the relation of the select features in regres- transform the original, 3-D space into a new, 2-D coordinate
sion. One obvious drawback of using Pearson correlation as
a feature ranking mechanism is that it is only sensitive to a
linear relationship.
Another approach for correlation estimation is mutual
information [131], [136], which quantifies how much infor-
mation of one variable (or feature) can be obtained through
another variable (feature). Like correlation coefficient,
mutual information can be used to remove redundant fea-
tures. For example, if the information of feature x​ ​can be
largely obtained through another existing feature y​ ​, feature​
x​ can then be taken out from the feature set without losing
much information on the reduced feature set.
Both correlation coefficient and mutual information
evaluate each feature independently with respect to the pre-
diction. A different approach is to utilize regression analysis
Fig. 14. Using PCA to reduce dimensionality of a 3-D feature space.
for feature ranking. The underlying principal of regression The principal components are first computed (a). Then, the first two
analysis is that if the prediction is the outcome of regres- principal components (​P ​C​1​​​ and ​P ​C2
​ )​​​ are selected to represent the
sion model based on the features, then the most important original 3-D feature space on a new 2-D space (b).

Proceedings of the IEEE 15


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

system [as illustrated in Fig. 14(b)] while preserving much options, by representing the optimization problem as a mul-
of the variance of the original data. ticlass classification problem, where each compiler option
PCA has been used in many prior compiler research is a class. For example, Leather et al. [70] considered a
works for feature reduction [17], [25], [55], [92], [95]–[97], loop unroll factor between 0 and 15 (16 configurations in
[143]. It has also been used in prior works to visualize the total), treating each candidate unroll factor as a class; they
working mechanism of a machine learning model, e.g., to compiled and profiled each training program by trying all
show how benchmarks can be grouped in the feature space 16 configurations to find out the best loop unroll factor for
[123], by projecting features from a high-dimensional space each program, and then learned a decision tree model from
into a 2-D space. the training data.
We want to stress that PCA does not select some fea- There are other compiler problems where the number of
tures and discard the others. Instead, it linearly combines possible options is massive. For instance, the work presented
the original features to construct new features that can sum- in [55] considers 54 code transformations of GCC. While
marize the list of the original features. PCA is useful when these options are only a subset from the over hundreds of
there is some redundancy in the raw features, i.e., some of transformations provided by GCC, the resulted combinato-
the features are correlated with one another. Similar feature rial compiler configurations lead to a space of approximately​​
reduction methods include factor analysis and linear discri- 10​​  34​​. Although it is possible to build a classifier to directly
minant analysis (LDA), which all try to reduce the number predict the optimal setting from a large space, to learn an
of features by linearly combining multiple raw features. effective model would require a large volume of training
However, PCA seems to be the most popular feature reduc- programs in order to have an adequate sampling over the
tion method used in compiler research, probably due to its space. Doing so is difficult because 1) there are only a few
simplicity. dozen common benchmarks available; and 2) compiler
An alternative way of reducing the number of features developers need to generate the training data themselves.
used is via an autoencoder [144]. It is a neural network EAs such as generic search are often used to explore a
that finds a representation (encoding) for a set of data, by large design space (see also Section IV-C1). Prior works have
dimensionality reduction. Autoencoders works by learning used EAs to solve the phase ordering problem (i.e., at which
an encoder and a decoder from the input data. The encoder order a set of compiler transformations should be applied)
tries to compress the original input into a low-dimensional [150]–[152], determining the compiler flags during iterative
representation, while the decoder tries to reconstruct the compilation [153]–[156], selecting loop transformations
original input based on the low-dimension representations [157], tuning algorithmic choices [11], [103], etc.
generated by the encoder. As a result, the autoencoder has
been widely used to remove the data noise as well as to B. Optimizing Parallel Programs
reduce the data dimension [145].
Autoencoders have been applied to various natural lan- How to effectively optimize parallel programs has
guage processing tasks [99], often being used together with received significant attentions in the past decade, largely
DNNs. Recently, it has been employed to model program because the hardware industry has adopted multicore
source code to obtain a compact set of features that can design to avoid the power wall [158]. While multicore and
characterize the input program source [78], [146]–[149]. many-core architectures provide the potential for high-
performance and energy-efficient computing, the potential
performance can only be unlocked if the application pro-
V I. SCOPE grams are suitably parallel and can be made to match the
Machine learning has been used to solve a wide range of underlying heterogeneous platform. Without this, the myr-
problems, from the early successful work of selecting com- iad cores on multicore processors and their specialized pro-
piler flags for sequential programs, to recent works on cessing elements will sit idle or poorly utilized. To this end,
scheduling and optimizing parallel programs on heteroge- researchers have extended the reach of machine learning to
neous multicores. In this section, we review the types of optimize parallel programs.
problems that have been exploited in prior works. A line of research in parallel program optimization is
parallelism mapping. That is, given an already parallelized
program, how to map the application parallelism to match
A. Optimizing Sequential Programs the underlying hardware to make the program run as fast
Early works for machine learning in compilers look at as possible or be as energy efficient as possible. Zhang et
how, or if, a compiler optimization should be applied to a al. developed a decision-tree-based approach to predict the
sequential program. Some of the previous studies build scheduling policy to use for an OpenMP parallel region
supervised classifiers to predict the optimal loop unroll fac- [159]. The work presented in [46] employs two machine
tor [52], [70] or to determine whether a function should be learning techniques to predict the optimal number of
inlined [29], [35]. These works target a fixed set of compiler threads as well as the scheduling policy to use for OpenMP

16 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

parallel loop. Specifically, it uses a regression-based ANN classifiers to determine which processor to use [68] and at
model to predict the speedup of a parallel loop when it which clock frequency the processor should operate [80],
runs with a given number of threads (to search for the opti- [171]. Others used regression techniques to build curve fit-
mal number threads), and an SVM classifier to predict the ting models to search for the sweat spot for work partition-
scheduling policy. There are also works that use machine ing among processors [38] or a tradeoff of energy and per-
learning to determine the optimum degree of parallelism for formance [172].
transactional memory [160] and hardware source allocation Another line of research combines compiler-based
[161], or to select a code version from a pool of choices to analysis and machine learning to optimize programs in the
use [162]. Castro et al. developed a decision tree classifier presence of competing workloads. This research problem
to predict the thread mapping strategy in the context of soft- is important because programs rarely run in isolation and
ware transactional memory [163]. Jung et al. constructed an must share the computing resources with other corunning
ANN-based predictor to select an effective data structure on workloads. In [173] and [174], an ANN model based on
a specific microarchitecture [164]. static code features and runtime information was built to
The work presented in [92] and [165] is a unique predict the number of threads to use for a target program
approach for applying machine learning to map complex when it runs with external workloads. Later, in [118], an
parallel programs with unbounded parallel graph structures. ensemble-learning-based approach was used, which leads to
The work considers the question of finding the optimal graph significantly better performance over [173]. In [118], several
structure of a streaming program. The idea was that rather models are first trained offline; and then one of the model is
than trying to predict a sequence of transformations over an selected at runtime, taking into consideration the compet-
unbounded graph, where legality and consistency is a real ing workloads and available hardware resources. The central
problem, we should consider the problem from the dual idea is that instead of using a single monolithic model, we
feature space. The work showed that it is possible to pre- can use multiple models where each model is specialized for
dict the best target feature (i.e., the characteristics that an modeling a subset of applications or a particular runtime
ideal transformed program should have) which then can be scenario. Using this approach, a model is used when its pre-
used to evaluate the worth of candidate transformed graphs dictions are effective.
(without compiling and profiling the resulted graphs) in the Some recent works developed machine learning models
original feature space. based on static code features and dynamic runtime informa-
The Petabricks project [103], [166], [167] takes an evolu- tion to schedule OpenCL programs in the presence of GPU
tionary approach for program tuning. The Petabricks com- contention. The work presented in [175] uses SVM classifi-
piler employs genetic search algorithms to tune algorithmic cation to predict the work partition ratio between the CPU
choices. Due to the expensive overhead of the search, much and GPU when multiple programs are competing to run on a
of autotuning is done at static compile time. Their work single GPU. The work described in [39] aims to improve the
shows that one can utilize the idle processors on a multi- overall system throughput when there are multiple OpenCL
core systems to perform online tuning [168], where half of programs competing to run on the GPU. They developed an
the cores are devoted to a known safe program configura- ANN model to predict the potential speedup for running an
tion, while the other half are used for an experimental pro- OpenCL kernel on the GPU. The speedup prediction is then
gram configuration. In this way, when the results of the used as a proxy to determine which of the waiting OpenCL
faster configuration are returned, the slower version will be tasks get to run on the GPU and in what order.
terminated. The approaches presented in [176] and [177] target task
The idea of combining compile-time knowledge and colocation in a data center environment. They use com-
runtime information to achieve better optimizations has piler-based code transformations to reduce the contention
been exploited by the ADAPT compiler [169]. Using the for multiple corunning tasks. A linear regression model was
ADAPT compiler, users describe what optimizations are employed to calculate the contention score of code regions
available and provide heuristics for applying these optimi- based on performance counter values. Then, a set of com-
zations. The compiler then reads these descriptions and piler-based code transformations is applied to reduce the
generates application-specific runtime systems to apply the resource demands of highly contentious code.
heuristics. Runtime code tuning is also exploited by Active
Harmony [170], which utilizes the computing resources in
HPC systems to evaluate different code variants on different C. Other Research Problems
nodes to find the best performing version. Many works have demonstrated that machine learning
There is also an extensive body of work on how to opti- is a powerful technique in performance and cost modeling
mize programs on heterogeneous multicore systems. One of [47], [178]–[180], and in task and resource scheduling [161],
the problems for heterogeneous multicore optimization is [181]–[183]. We envision that many of these techniques can
to determine when and how to use the heterogeneous pro- be used to provide evidence to support runtime program
cessors. Researchers have used machine learning to build optimizations through, e.g., just-in-time compilation.

Proceedings of the IEEE 17


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

While not directly target code optimization, compiler- techniques such as active learning can be employed to
based code analysis and machine learning techniques have reduce overhead of training data generation [191]–[194].
been used in conjunction to solve various software engineer- Although its true to say that generating many differently
ing tasks. These include detecting code similarities [184], compiled programs and executing and timing them are
[185], automatic comment generation [186], mining API entirely automatic, finding the right data requires careful
usage patterns [187], [188], predicting program properties consideration. If the optimizations explored have little posi-
[189], code de-obfuscation for malware detection [190], tive performance on the programs, then there is nothing
etc. It is worth mentioning that many of these recent works worth learning.
show that the past development knowledge extracted from The most immediate problem continues to be gathering
large code bases such as GitHub are valuable for learning an enough sufficient high quality training data. Although there
effective model. There were two recent studies performed by are numerous benchmark sites publicly available, the num-
Cummins et al., which mine Github to synthesize OpenCL ber of programs available is relatively sparse compared to the
benchmarks [148] and code extract features from source number that a typical compiler will encounter in its lifetime.
code [78]. Both studies demonstrate the usefulness of large This is particularly true in specialist domains where there
code bases and deep learning techniques for learning pre- may not be any public benchmarks. Automatic benchmark
dictive models for compiler optimizations. We envision that generation work will help here, but existing approaches do
the rich information in large open source code bases could not guarantee that the generated benchmarks effectively
provide a powerful knowledge base for training machine represent the design space. Therefore, the larger issue of the
learning models to solve compiler optimization problems, structure of the program space remains.
and deep learning could be used as an effective tool to A really fundamental problem is that if we build our
extract such knowledge from massive program source code. optimization models based purely on empirical data, then
we must guarantee that these data are correct and represent-
ative; we must learn the signal, not the noise. Peer review of
V II. DISC US SION a machine learning approach is difficult. Black box mode-
One of the real benefits of machine-learning-based ling prevents the quality of the model from being questioned
approaches is that it forces an empirically driven approach unlike handcrafted heuristics. In a sense, reviewers now
to compiler construction. New models have to be based on have to scrutinize that the experiments were fairly done.
empirical data which can then be verified by independent This means all training and test data must be publicly availa-
experimentation. This experiment, hypothesis, test cycle is ble for scrutiny. This is common practice in other empirical
well known in the physical sciences but is a relatively new sciences. The artefact evaluation committee is an example
addition compiler construction. of this [195], [196].
As machine-learning-based techniques require a sam- Although the ability to automatically learn how to best
pling of the optimization space for training data, we typi- optimize an application and adapt to change is a big step
cally know the best optimization for any program in the forward, machine learning can only learn from what is pro-
training set. If we exclude this benchmark from training, vided by the compiler writer. Machine learning can neither
we therefore have access to an upper bound on performance invent new program transformations to apply nor derive
or oracle for this program. This immediately lets us know analysis that determines whether a transformation is legal;
how good existing techniques are. If they are 50% of this all of this is beyond its scope.
optimum or 95% of this optimum, this immediately tells us
whether the problem is worth exploring.
B. Will This Put Compiler Writers Out of a Job?
Furthermore, we can construct naive techniques, e.g.,
a random optimization, and see its performance. If it per- In fact, machine-learning-based compilation will para-
formed a number of times, it will have an expected value doxically lead to a renaissance in compiler optimization.
of the mean of the optimization speedups. We can then Compilers have become so complex that adding a new opti-
demand that any new heuristic should outperform this, mization or compiler phase can lead to performance regres-
though in our experience there have been cases where state- sions. This, in turn, has led to a conservative mind set where
of-the-art work was actually less than random. new transformations are not considered if they may rock the
boat. The core issue is that systems are so complex that it
is impossible to know for sure when to use such an opti-
A. Not a Panacea mization. Machine learning can remove this uncertainty by
This paper has, by and large, been very upbeat about the automatically determining when an optimization is prof-
use of machine learning. However, there are a number of itable. This now frees the compiler writer to develop ever
hurdles to overcome to make it a practical reality and this more sophisticated techniques. He/she does not need to
opens up new questions about optimization. worry about how they interfere with other optimizations—
Training cost is an issue that many find alarming. In machine learning looks after this. We can now develop opti-
practice, the cost is much less than a compiler writer, and mizations that will typically only work for specific domains,

18 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

and not worry about coordinating their integration into a Can machine learning also be applied to compiler analy-
general purpose system. It allows different communities to sis? For instance is it possible to learn dataflow or point-to
develop novel optimizations and naturally integrate them. analysis? As deep learning has the ability to automatically
So rather than closing down the opportunity for new ideas, construct features, can we find a set of features that are com-
it opens up new vistas. mon across all optimizations and analyses? Can we learn
the ideal compiler intermediate representation? There is
a wide range of interesting research questions that remain
C. Open Research Directions
unexplored.
Machine learning has demonstrated its utility as a means
of automating compiler profitability analysis. It will con-
tinue to be used for more complex optimization problems V III. CONCLUSION
and is likely to be the default approach to selecting compiler This paper has introduced machine-learning-based com-
optimizations in the coming decade. pilation and described its power in determining an evi-
The open research directions go beyond predicting the dence-based approach to compiler optimization. It is the
best optimizations to apply. One central issue is what the latest stage in 50 years of compiler automation. Machine-
program space looks like. We know that programs with lin- learning-based compilation is now a mainstream compiler
ear array accesses inside perfect loop nests need different research area and, over the last decade or so, has generated
treatment compared to, say, distributed graph processing a large amount of academic interest and papers. While it is
programs. If we could have a map that allows us to meas- impossible to provide a definitive cataloger of all research,
ure distances between programs, then we could see whether we have tried to provide a comprehensive and accessible
there are regions that are well served by compiler charac- survey of the main research areas and future directions.
terization and other regions that are sparse and currently Machine learning is not a panacea. It can only learn the data
ignored. If we could do the same for hardware, then we we provide. Rather than, as some fear, it dumbs down the
may be better able to design hardware likely to be of use for role of compiler writers, it opens up the possibility of much
emerging applications. greater creativity and new research areas. 

R EFER ENCES [11] J. Ansel, Y. L. Wong, C. Chan, M. Olszewski, [20] Y. Yang, P. Xiang, J. Kong, M. Mantor, and
A. Edelman, and S. Amarasinghe, “Language H. Zhou, “A unified optimizing compiler
[1] J. Chipps, M. Koschmann, S. Orgel, A. Perlis, and compiler support for auto-tuning framework for different gpgpu
and J. Smith, “A mathematical language variable-accuracy algorithms,” in Proc. Int. architectures,” ACM Trans. Archit. Code
compiler,” in Proc. 11th ACM Nat. Meeting, Symp. Code Generat. Optim. (CGO), Apr. 2011, Optim., vol. 9, no. 2, p. 9, 2012.
1956, pp. 114–117. pp. 85–96. [21] C. Lattner and V. Adve, “LLVM: A
[2] P. B. Sheridan, “The arithmetic translator- [12] J. Kurzak, H. Anzt, M. Gates, and compilation framework for lifelong program
compiler of the IBM FORTRAN automatic J. Dongarra, “Implementation and tuning of analysis & transformation,” in Proc. Int.
coding system,” Commun. ACM, vol. 2, no. 2, batched Cholesky factorization and solve for Symp. Code Generat. Optim. (CGO), 2004,
pp. 9–21, 1959. NVIDIA GPUs,” IEEE Trans. Parallel Distrib. pp. 75–86.
[3] M. D. McIlroy, “Macro instruction Syst., vol. 27, no. 7, pp. 2036–2048, Jul. 2016. [22] F. Bodin, T. Kisuki, P. Knijnenburg,
extensions of compiler languages,” Commun. [13] Y. M. Tsai, P. Luszczek, J. Kurzak, and M. O’Boyle, and E. Rohou, “Iterative
ACM, vol. 3, no. 4, pp. 214–220, 1960. J. Dongarra, “Performance-portable compilation in a non-linear optimization
[4] A. Gauci, K. Z. Adami, and J. Abela. (2010). autotuning of OpenCL kernels for space,” in Proc. Workshop Profile Feedback-
“Machine learning for galaxy morphology convolutional layers of deep neural Directed Compilation, 1998.
classification.” [Online]. Available: https:// networks,” in Proc. Workshop Mach. Learn. [23] P. M. Knijnenburg, T. Kisuki, and
arxiv.org/abs/1005.0390 HPC Environ. (MLHPC), 2016, pp. 9–18. M. F. O’Boyle, “Combined selection of tile
[5] H. Schoen, D. Gayo-Avello, P. T. Metaxas, [14] M. E. Lesk and E. Schmidt, “Lex—A lexical sizes and unroll factors using iterative
E. Mustafaraj, M. Strohmaier, and P. Gloor, analyzer generator,” Tech. Rep., 1975. compilation,” J. Supercomput., vol. 24, no. 1,
“The power of prediction with social media,” [15] S. C. Johnson, Yacc: Yet Another Compiler- pp. 43–67, 2003.
Internet Res., vol. 23, no. 5, pp. 528–543, Compiler, vol. 32. Murray Hill, NJ, USA: Bell [24] M. Frigo and S. G. Johnson, “The design and
2013. Laboratories, 1975. implementation of FFTW3,” Proc. IEEE,
[6] Slashdot. (2009). IBM Releases Open Source [16] A. Monsifrot, F. Bodin, and R. Quiniou, vol. 93, no. 2, pp. 216–231, Feb. 2005.
Machine Learning Compiler. [Online]. “A machine learning approach to automatic [25] F. Agakov et al., “Using machine learning to
Available: https://tech.slashdot.org/ production of compiler heuristics,” in Proc. focus iterative optimization,” in Proc. Int.
story/09/07/03/0143233/ibm-releases-open- Int. Conf. Artif. Intell. Methodol. Syst. Appl., Symp. Code Generat. Optim. (CGO), 2006,
source-machine-learning-compiler 2002, pp. 41–50. pp. 295–305.
[7] H. Massalin, “Superoptimizer: A look at the [17] A. Magni, C. Dubach, and M. O’Boyle, [26] R. Nobre, L. G. A. Martins, and
smallest program,” ACM SIGPLAN Notices, “Automatic optimization of thread- J. M. P. Cardoso, “A graph-based iterative
vol. 22, no. 10, pp. 122–126, 1987. coarsening for graphics processors,” in Proc. compiler pass selection and phase ordering
[8] J. Ivory, “I. On the method of the least squares,” 23rd Int. Conf. Parallel Archit. Compilation approach,” in Proc. 17th ACM SIGPLAN/
Philos. Mag. J., Comprehending Various Branches (PACT), 2014, pp. 455–466. SIGBED Conf. Lang. Compil. Tools Theory
Sci., Liberal Fine Arts, Agriculture, Manuf. [18] S. Unkule, C. Shaltz, and A. Qasem, Embedded Syst. (LCTES), 2016, pp. 21–30.
Commerce, vol. 65, no. 321, pp. 3–10, 1825. “Automatic restructuring of GPU kernels for [27] R. Leupers and P. Marwedel, “Function
[9] R. J. Adcock, “A problem in least squares,” exploiting inter-thread data locality,” in Proc. inlining under code size constraints for
Analyst, vol. 5, no. 2, pp. 53–54, 1878. 21st Int. Conf. Compil. Construction (CC), embedded processors,” in Dig. Tech. Papers
[10] K. Datta et al., “Stencil computation 2012, pp. 21–40. IEEE/ACM Int. Conf. Comput.-Aided Design,
optimization and auto-tuning on state-of- [19] V. Volkov and J. W. Demmel, “Benchmarking Nov. 1999, pp. 253–256.
the-art multicore architectures,” in Proc. GPUs to tune dense linear algebra,” in Proc. [28] K. D. Cooper, T. J. Harvey, and T. Waterman,
ACM/IEEE Conf. Supercomput., Nov. 2008, ACM/IEEE Conf. Supercomput. (SC), “An adaptive strategy for inline substitution,”
pp. 1–12. Nov. 2008, pp. 1–11. in Proc. Joint Eur. Conf. Theory Pract. Softw.

Proceedings of the IEEE 19


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

17th Int. Conf. Compil. Construction (CC/ M. Schulz, “Prediction models for multi- [60] E. Perelman, G. Hamerly, M. Van
ETAPS), 2008, pp. 69–84. dimensional power-performance Biesbrouck, T. Sherwood, and B. Calder,
[29] D. Simon, J. Cavazos, C. Wimmer, and optimization on many cores,” in Proc. 17th “Using simpoint for accurate and efficient
S. Kulkarni, “Automatic construction of Int. Conf. Parallel Archit. Compilation Techn. simulation,” in Proc. ACM SIGMETRICS Int.
inlining heuristics using machine learning,” (PACT), 2008, pp. 250–259. Conf. Meas. Modeling Comput. Syst., 2003,
in Proc. IEEE/ACM Int. Symp. Code Generat. [45] K. Singhet et al., “Comparing scalability pp. 318–319.
Optim. (CGO), 2013, pp. 1–12. prediction strategies on an SMP of CMPs,” [61] Y. Zhang, D. Meisner, J. Mars, and L. Tang,
[30] P. Zhao and J. N. Amaral, “To inline or not to in Proc. Eur. Conf. Paralell Process., 2010, “Treadmill: Attributing the source of tail
inline? Enhanced inlining decisions,” Lang. pp. 143–155. latency through precise load testing and
Compil. Parallel Comput., pp. 405–419, 2004. [46] Z. Wang and M. F. O’Boyle, “Mapping statistical inference,” in Proc. 43rd Int. Symp.
parallelism to multi-cores: A machine Comput. Archit. (ISCA), 2016, pp. 456–468.
[31] T. A. Wagner, V. Maverick, S. L. Graham,
and M. A. Harrison, “Accurate static learning based approach,” in Proc. 14th ACM [62] B. C. Lee, D. M. Brooks, B. R. de Supinski,
estimators for program optimization,” in SIGPLAN Symp. Principles Pract. Parallel M. Schulz, K. Singh, and S. A. McKee,
Proc. ACM SIGPLAN Conf. Program. Lang. Program. (PPoPP), 2009, pp. 75–84. “Methods of inference and learning for
Design Implement. (PLDI), 1994, pp. 85–96. [47] Y. Kang et al., “Neurosurgeon: Collaborative performance modeling of parallel
intelligence between the cloud and mobile applications,” in Proc. 12th ACM SIGPLAN
[32] V. Tiwari, S. Malik, and A. Wolfe, “Power Symp. Principles Pract. Parallel Program.
analysis of embedded software: A first step edge,” in Proc. 22nd Int. Conf. Archit. Support
Program. Lang. Oper. Syst. (ASPLOS), 2017, (PPoPP), 2007, pp. 249–258.
towards software power minimization,” in
pp. 615–629. [63] M. Curtis-Maury, J. Dzierwa,
Proc. IEEE/ACM Int. Conf. Comput.-Aided
Design, Nov. 1994, pp. 384–390. [48] L. Benini, A. Bogliolo, M. Favalli, and G. De C. D. Antonopoulos, and D. S. Nikolopoulos,
Micheli, “Regression models for behavioral “Online power-performance adaptation of
[33] K. D. Cooper, P. J. Schielke, and multithreaded programs using hardware
power estimation,” Integr. Comput.-Aided
D. Subramanian, “Optimizing for reduced event-based prediction,” in Proc. 20th
Eng., vol. 5, no. 2, pp. 95–106, 1998.
code space using genetic algorithms,” in Annu. Int. Conf. Supercomput. (ICS), 2006,
Proc. ACM SIGPLAN Workshop Lang. Compil. [49] S. K. Rethinagiri, R. B. Atitallah, and
pp. 157–166.
Tools Embedded Syst. (LCTES), 1999, pp. 1–9. J.-L. Dekeyser, “A system level power
consumption estimation for MPSoC,” in [64] P. E. Bailey, D. K. Lowenthal, V. Ravi,
[34] M. Stephenson, S. Amarasinghe, M. Martin, Proc. Int. Symp. Syst. Chip (SoC), 2011, B. Rountree, M. Schulz, and B. R. de
and U.-M. O‘Reilly, “Meta optimization: pp. 56–61. Supinski, “Adaptive configuration selection
Improving compiler heuristics with machine for power-constrained heterogeneous
learning,” in Proc. ACM SIGPLAN Conf. [50] S. Schürmans, G. Onnebrink, R. Leupers,
systems,” in Proc. 43rd Int. Conf. Parallel
Program. Lang. Design Implement. (PLDI), G. Leupers, and X. Chen, “Frequency-aware
Process., 2014, pp. 371–380.
2003, pp. 77–90. ESL power estimation for ARM cortex-A9
using a black box processor model,” ACM [65] J. L. Berral et al., “Towards energy-aware
[35] J. Cavazos and M. F. P. O’Boyle, “Automatic Trans. Embedded Comput. Syst., vol. 16, no. 1, scheduling in data centers using machine
tuning of inlining heuristics,” in Proc. ACM/ p. 26, 2016. learning,” in Proc. 1st Int. Conf. Energy-
IEEE Conf. Supercomput. (SC), Nov. 2005, p. 14. Efficient Comput. Netw. (e-Energy), 2010,
[51] M. Curtis-Maury et al., “Identifying energy-
[36] K. Hoste and L. Eeckhout, “Cole: Compiler pp. 215–224.
efficient concurrency levels using machine
optimization level exploration,” in Proc. 6th learning,” in Proc. IEEE Int. Conf. Cluster [66] D. D. Vento, “Performance optimization on a
Annu. IEEE/ACM Int. Symp. Code Generat. Comput., Sep. 2007, pp. 488–495. supercomputer with ctuning and the PGI
Optim. (CGO), 2008, pp. 165–174. compiler,” in Proc. 2nd Int. Workshop Adapt.
[52] M. Stephenson and S. Amarasinghe,
[37] M. Kim, T. Hiroyasu, M. Miki, and S. Watanabe, “Predicting unroll factors using supervised Self-Tuning Comput. Syst. Exaflop Era
“SPEA2+: Improving the performance of the classification,” in Proc. Int. Symp. Code (EXADAPT), 2012, pp. 12–20.
Strength Pareto Evolutionary Algorithm 2,” in Generat. Optim. (CGO), 2005, pp. 123–134. [67] P.-J. Micolet, A. Smith, and C. Dubach,
Proc. Int. Conf. Parallel Problem Solving from “A machine learning approach to mapping
Nature, 2004, pp. 742–751. [53] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla,
M. F. P. O’Boyle, and O. Temam, “Rapidly streaming workloads to dynamic multicore
[38] C.-K. Luk, S. Hong, and H. Kim, “Qilin: selecting good compiler optimizations using processors,” ACM SIGPLAN Notices, vol. 51,
Exploiting parallelism on heterogeneous performance counters,” in Proc. Int. Symp. no. 5, pp. 113–122, 2016.
multiprocessors with adaptive mapping,” in Code Generat. Optim. (CGO), 2007, [68] D. Grewe, Z. Wang, and M. F. P. O’Boyle,
Proc. 42nd Annu. IEEE/ACM Int. Symp. pp. 185–197. “Portable mapping of data parallel programs
Microarchit. (MICRO), Dec. 2009, pp. 45–55. to OpenCL for heterogeneous systems,” in
[54] J. Cavazos and M. F. P. O’Boyle, “Method-
[39] Y. Wen, Z. Wang, and M. F. P. O’Boyle, specific dynamic compilation using logistic Proc. Proc. IEEE/ACM Int. Symp. Code
“Smart multi-task scheduling for OpenCL regression,” in Proc. 21st Annu. ACM SIGPLAN Generation Optim. (CGO), Feb. 2013, pp. 1–10.
programs on CPU/GPU heterogeneous Conf. Object-Oriented Program. Syst. Lang. [69] H. Yu and L. Rauchwerger, “Adaptive
platforms,” in Proc. 21st Ann. IEEE Int. Conf. Appl. (OOPSLA), 2006, pp. 229–240. reduction parallelization techniques,” in
High Perform. Comput. (HiPC), Dec. 2014, Proc. 14th Int. Conf. Supercomput. (ICS), 2000,
[55] C. Dubach, J. Cavazos, B. Franke, G. Fursin,
pp. 1–10. pp. 66–77.
M. F. O’Boyle, and O. Temam, “Fast compiler
[40] E. A. Brewer, “High-level optimization via optimization evaluation using code-feature [70] H. Leather, E. Bonilla, and M. O’Boyle,
automated statistical modeling,” in Proc. 5th based performance prediction,” in Proc. 4th “Automatic feature generation for machine
ACM SIGPLAN Symp. Principles Pract. Parallel Int. Conf. Comput. Frontiers (CF), 2007, learning based optimizing compilation,” in
Program. (PPOPP), 1995, pp. 80–91. pp. 131–142. Proc. 7th Annu. IEEE/ACM Int. Symp. Code
[41] K. Vaswani, M. J. Thazhuthaveetil, [56] T. Yuki, L. Renganarayanan, S. Rajopadhye, Generat. Optim. (CGO), Mar. 2009, pp. 81–91.
Y. N. Srikant, and P. J. Joseph, C. Anderson, A. E. Eichenberger, and [71] Z. Wang, D. Grewe, and M. F. P. O’Boyle,
“Microarchitecture sensitive empirical K. O’Brien, “Automatic creation of tile size “Automatic and portable mapping of data
models for compiler optimizations,” in Proc. selection models,” in Proc. 8th Annu. IEEE/ parallel programs to opencl for GPU-based
Int. Symp. Code Generat. Optim. (CGO), ACM Int. Symp. Code Generat. Optim. (CGO), heterogeneous systems,” ACM Trans. Archit.
Mar. 2007, pp. 131–143. 2010, pp. 190–199. Code Optim., vol. 11, no. 4, p. 42, 2014.
[42] B. C. Lee and D. M. Brooks, “Accurate and [57] A. M. Malik, “Optimal tile size selection [72] Y. Ding, J. Ansel, K. Veeramachaneni,
efficient regression modeling for problem using machine learning,” in Proc. X. Shen, U.-M. O’Reilly, and S. Amarasinghe,
microarchitectural performance and power Optim. Tile Size Selection Problem Using Mach. “Autotuning algorithmic choice for input
prediction,” in Proc. 12th Int. Conf. Archit. Learn., vol. 2. Dec. 2012, pp. 275–280. sensitivity,” in Proc. 36th ACM SIGPLAN Conf.
Support Program. Lang. Oper. Syst. (ASPLOS), [58] R. W. Moore and B. R. Childers, “Building Program. Lang. Design Implement. (PLDI),
2006, pp. 185–194. and using application utility models 2015, pp. 379–390.
[43] E. Park, L.-N. Pouche, J. Cavazos, A. Cohen, to dynamically choose thread counts,” [73] T. K. Ho, “Random decision forests,” in
and P. Sadayappan, “Predictive modeling in a J. Supercomput., vol. 68, no. 3, Proc. 3rd Int. Conf. Document Anal. Recognit.
polyhedral optimization space,” in Proc. 9th pp. 1184–1213, 2014. (ICDAR), vol. 1. 1995, pp. 278–282.
Annu. IEEE/ACM Int. Symp. Code Generat. [59] Y. Liu, E. Z. Zhang, and X. Shen, “A cross- [74] T. G. Dietterich, “Ensemble methods
Optim. (CGO), Apr. 2011, pp. 119–129. input adaptive framework for GPU program in machine learning,” in Proc. 1st Int.
[44] M. Curtis-Maury, A. Shah, F. Blagojevic, optimizations,” in Proc. IEEE Int. Symp. Workshop Multiple Classifier Syst. (MCS),
D. S. Nikolopoulos, B. R. de Supinski, and Parallel Distrib. Process., May 2009, pp. 1–10. 2000, pp. 1–15.

20 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

[75] P. Lokuciejewski, F. Gedikli, P. Marwedel,   [91] T. Sherwood, E. Perelman, G. Hamerly, study in processor customization,” in Proc.
and K. Morik, “Automatic WCET reduction and B. Calder, “Automatically Design Autom. Test Eur. Conf. Exhibit.
by machine learning based heuristics for characterizing large scale program (DATE), 2012, pp. 1030–1035.
function inlining,” in Proc. 3rd Workshop behavior,” in Proc. 10th Int. Conf. Archit. [107] M. R. Jantz and P. A. Kulkarni, “Exploiting
Statistical Mach. Learn. Approaches Archit. Support Program. Lang. Operat. Syst. phase inter-dependencies for faster
Compilation (SMART), 2009, pp. 1–15. (ASPLOS), 2002, pp. 45–57. iterative compiler optimization phase order
[76] S. Benedict, R. S. Rejitha, P. Gschwandtner,   [92] Z. Wang and M. F. O’Boyle, “Partitioning searches,” in Proc. Int. Conf. Compil. Archit.
R. Prodan, and T. Fahringer, “Energy streaming parallelism for multi-cores: A Synthesis Embedded Syst. (CASES),
prediction of OpenMP applications using machine learning based approach,” in Proc. Oct. 2013, pp. 1–10.
random forest modeling approach,” in Proc. 19th Int. Conf. Parallel Archit. Compilation [108] E. Ipek, O. Mutlu, J. F. Martínez, and
IEEE Int. Parallel Distrib. Process. Symp. Techn. (PACT), 2010, pp. 307–318. R. Caruana, “Self-optimizing memory
Workshop, May 2015, pp. 1251–1260.   [93] M. Newman, Networks: An Introduction. controllers: A reinforcement learning
[77] R. S. Rejitha, S. Benedict, S. A. Alex, and New York, NY, USA: Oxford Univ. Press, approach,” in Proc. IEEE 35th Int. Symp.
S. Infanto, “Energy prediction of CUDA 2010. Comput. Archit. (ISCA), Jun. 2008, pp. 39–50.
application instances using dynamic   [94] L. G. Martins, R. Nobre, A. C. B. Delbem, [109] B. Porter, M. Grieves, R. R. Filho, and
regression models,” Computing, vol. 99, E. Marques, and J. A. M. Cardoso, D. Leslie, “Rex: A development platform
no. 8, pp. 765–790, 2017. “Exploration of compiler optimization and online learning approach for runtime
[78] C. Cummins, P. Petoumenos, Z. Wang, and sequences using clustering-based emergent software systems,” in Proc. Usenix
H. Leather, “End-to-end deep learning of selection,” in Proc. SIGPLAN/SIGBED Conf. Conf. Symp. Oper. Syst. Design Implement.,
optimization heuristics,” in Proc. 26th Int. Lang. Compil. Tools Embedded Syst. (LCTES), Nov. 2016, pp. 333–348.
Conf. Parallel Archit. Compilation Techn. 2014, pp. 63–72. [110] J. Rao, X. Bu, C.-Z. Xu, L. Wang, and G. Yin,
(PACT), 2017, pp. 219–232.   [95] L. Eeckhout, H. Vandierendonck, and “Vconf: A reinforcement learning approach
[79] Z. Wang, G. Tournavitis, B. Franke, and K. D. Bosschere, “Workload design: to virtual machines auto-configuration,” in
M. F. P. O’Boyle, “Integrating profile-driven Selecting representative program-input Proc. 6th Int. Conf. Autonom. Comput. (ICAC),
parallelism detection and machine-learning- pairs,” in Proc. Int. Conf. Parallel Archit. 2009, pp. 137–146.
based mapping,” ACM Trans. Archit. Code Compilation Techn., 2002, pp. 83–94. [111] M. G. Lagoudakis and M. L. Littman,
Optim., vol. 11, no. 1, p. 2, 2014.   [96] Y. Chen et al., “Evaluating iterative “Algorithm selection using reinforcement
[80] B. Taylor, V. S. Marco, and Z. Wang, optimization across 1000 datasets,” in Proc. learning,” in Proc. 7th Int. Conf. Mach.
“Adaptive optimization for OpenCL 31st ACM SIGPLAN Conf. Program. Lang. Learn. (ICML), 2000, pp. 511–518.
programs on embedded heterogeneous Design Implement. (PLDI), 2010, [112] N. Mishra, J. D. Lafferty, H. Hoffmann, and
systems,” in Proc. 18th Annu. ACM SIGPLAN/ pp. 448–459. C. Imes, “CALOREE: Learning control for
SIGBED Conf. Lang. Compil. Tools Embedded   [97] A. H. Ashouri, G. Mariani, G. Palermo, predictable latency and low energy,” in Proc.
Syst. (LCETS), 2017, pp. 11–20. and C. Silvano, “A Bayesian network 23rd Int. Conf. Archit. Support Program. Lang.
[81] P. Zhang, J. Fang, T. Tang, C. Yang, and approach for compiler auto-tuning for Oper. Syst. (ASPLOS), 2018, pp. 184–198.
Z. Wang, “Auto-tuning streamed applications embedded processors,” in Proc. IEEE 12th [113] M. K. Emani and M. O’Boyle, “Change
on Intel Xeon Phi,” in Proc. 32nd IEEE Int. Symp. Embedded Syst. Real-time Multimedia
detection based parallelism mapping:
Parallel Distrib. Process. Symp. (IPDPS), 2018. (ESTIMedia), Oct. 2014, pp. 90–97.
Exploiting offline models and Online
[82] P. J. Joseph, K. Vaswani, and   [98] A. Phansalkar, A. Joshi, and L. K. John, adaptation,” in Proc. 27th Int. Workshop
M. J. Thazhuthaveetil, “A predictive “Analysis of redundancy and application Lang. Compil. Parallel Comput. (LCPC),
performance model for superscalar balance in the spec cpu2006 benchmark 2014, pp. 208–223.
processors,” in Proc. 39th Annu. IEEE/ACM suite,” in Proc. 34th Annu. Int. Symp.
[114] Y. Li, “Deep reinforcement learning: An
Int. Symp. Microarchit. (MICRO), Comput. Archit. (ISCA), 2007, pp. 412–423.
overview,” CoRR, 2017.
Dec. 2006, pp. 161–170.   [99] P. Vincent, H. Larochelle, Y. Bengio, and
[115] J. Ansel et al., “Opentuner: An extensible
[83] A. Ganapathi, K. Datta, A. Fox, and D. P.-A. Manzagol, “Extracting and composing
framework for program autotuning,” in
Patterson, “A case for machine learning to robust features with denoising
Proc. PACT, 2014, pp. 303–316.
optimize multicore performance,” in Proc. autoencoders,” in Proc. 25th Int. Conf. Mach.
1st USENIX Conf. Hot Topics Parallelism Learn. (ICML), 2008, pp. 1096–1103. [116] C. K. Williams and C. E. Rasmussen,
(HotPar), 2009, p. 1. “Gaussian processes for regression,” in
[100] X. Gu, H. Zhang, D. Zhang, and S. Kim
Proc. Adv. Neural Inf. Process. Syst., 1996,
[84] E. Deniz and A. Sen, “Using machine (2016). Deep API learning. [Online].
pp. 514–520.
learning techniques to detect parallel Available: https://arxiv.org/abs/1605.08535
patterns of multi-threaded applications,” [101] B. Singer and M. Veloso, “Learning to [117] C. E. Rasmussen and C. K. Williams,
Int. J. Parallel Program., vol. 44, no. 4, construct fast signal processing Gaussian Processes for Machine Learning,
pp. 867–900, 2016. implementations,” J. Mach. Learn. Res., vol. 1. Cambridge, MA, USA: MIT Press,
[85] Y. LeCun, Y. Bengio, and G. Hinton, Deep vol. 3, pp. 887–919, Dec. 2002. 2006.
Learning, 2015. [102] X. Li, M. J. Garzaran, and D. Padua, [118] M. K. Emani and M. O’Boyle, “Celebrating
[86] A. Krizhevsky, I. Sutskever, and G. E. “Optimizing sorting with genetic diversity: A mixture of experts approach
Hinton, “Imagenet classification with deep algorithms,” in Proc. Int. Symp. Code for runtime mapping in dynamic
convolutional neural networks,” in Proc. Adv. Generat. Optim. (CGO), 2005, pp. 99–110. environments,” in Proc. 36th ACM SIGPLAN
Neural Inf. Process. Syst. (NIPS), 2012, Conf. Program. Lang. Design Implement.
[103] J. Ansel et al., “Petabricks: A language and (PLDI), 2015, pp. 499–508.
pp. 1097–1105. compiler for algorithmic choice,” in Proc.
[87] K. He, X. Zhang, S. Ren, and J. Sun, “Deep ACM SIGPLAN Conf. Program. Lang. Design [119] H. D. Nguyen and F. Chamroukhi (2017).
residual learning for image recognition,” in Implement. (PLDI), 2009, pp. 38–49. “An introduction to the practical and
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. theoretical aspects of mixture-of-experts
[104] M. Harman, W. B. Langdon, Y. Jia, D. R. modeling.” [Online]. Available: https://
(CVPR), Jun. 2016, pp. 770–778. White, A. Arcuri, and J. A. Clark, “The arxiv.org/abs/1707.03538
[88] H. Lee, Y. Largman, P. Pham, and A. Y. Ng, GISMOE challenge: Constructing the
“Unsupervised feature learning for audio Pareto program surface using genetic [120] R. Polikar, “Ensemble based systems in
classification using convolutional deep belief programming to find better programs decision making,” IEEE Circuits Syst. Mag.,
networks,” in Proc. 22nd Int. Conf. Neural Inf. (keynote paper),” in Proc. 27th IEEE/ACM vol. 6, no. 3, pp. 21–45, Sep. 2006.
Process. Syst. (NIPS), 2009, pp. 1096–1104. Int. Conf. Autom. Softw. Eng. (ASE), Sep. [121] L. Rokach, “Ensemble-based classifiers,”
[89] M. Allamanis, E. T. Barr, P. Devanbu, and 2012, pp. 1–14. Artif. Intell. Rev., vol. 33, nos. 1–2, pp. 1–39,
C. Sutton (2017). “A survey of machine [105] U. Garciarena and R. Santana, 2010.
learning for big code and naturalness.” “Evolutionary optimization of compiler [122] Y. Jiang et al., “Exploiting statistical
[Online]. Available: https://arxiv.org/ flag selection by learning and exploiting correlations for proactive prediction of
abs/1709.06182 flags interactions,” in Proc. Genetic Evol. program behaviors,” in Proc. 8th Annu.
[90] J. MacQueen, “Some methods for Comput. Conf. Companion (GECCO), 2016, IEEE/ACM Int. Symp. Code Generat. Optim.
classification and analysis of multivariate pp. 1159–1166. (CGO), Apr. 2010, pp. 248–256.
observations,” in Proc. 5th Berkeley Symp. [106] M. Zuluaga, E. Bonilla, and N. Topham, [123] V. S. Marco, B. Taylor, B. Porter, and
Math. Statist. Prob., 1967, pp. 281–297. “Predicting best design trade-offs: A case Z. Wang, “Improving spark application

Proceedings of the IEEE 21


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

throughput via memory aware task [138] M. E. Taylor, K. E. Coons, B. Robatmili, IEEE Trans. Evol. Comput., vol. 15, no. 4,
co-location: A mixture of experts B. A. Maher, D. Burger, and K. S. pp. 515–538, Aug. 2011.
approach,” in Proc. ACM/IFIP/USENIX McKinley, “Evolving compiler heuristics to [155] G. Fursin and O. Temam, “Collective
Middleware Conf., 2017, pp. 95–108. manage communication and contention,” optimization: A practical collaborative
[124] B. Singer and M. M. Veloso, “Learning to in Proc. 24th Conf. Artif. Intell. (AAAI), 2010, approach,” ACM Trans. Archit. Code Optim.,
predict performance from formula pp. 1690–1693. vol. 7, no. 4, p. 20, Dec. 2010.
modeling and training data,” in Proc. 7th [139] P.-S. Ting, C.-C. Tu, P.-Y. Chen, Y.-Y. Lo, [156] J. Kukunas, R. D. Cupper, and
Int. Conf. Mach. Learn. (ICML), 2000, and S.-M. Cheng (2016). “FEAST: An G. M. Kapfhammer, “A genetic algorithm
pp. 887–894. automated feature selection framework for to improve linux kernel performance on
[125] E. Park, J. Cavazos, and M. A. Alvarez, compilation tasks.” [Online]. Available: resource-constrained devices,” in Proc. 12th
“Using graph-based program https://arxiv.org/abs/1610.09543 Annu. Conf. Companion Genetic Evol.
characterization for predictive modeling,” [140] E. Park, C. Kartsaklis, and J. Cavazos, Comput. (GECCO), 2010, pp. 2095–2096.
in Proc. 10th Int. Symp. Code Generat. Optim. “HERCULES: Strong patterns towards [157] L.-N. Pouchet, C. Bastoul, A. Cohen, and
(CGO), 2012, pp. 196–206. more intelligent predictive modeling,” in J. Cavazos, “Iterative optimization in the
[126] A. M. Malik, “Spatial based feature Proc. 43rd Int. Conf. Parallel Process., 2014, polyhedral model: Part II,
generation for machine learning based pp. 172–181. multidimensional time,” in Proc. 29th ACM
optimization compilation,” in Proc. 9th Int. [141] K. Beyer, J. Goldstein, R. Ramakrishnan, SIGPLAN Conf. Program. Lang. Design
Conf. Mach. Learn. Appl., 2010, and U. Shaft, “When is ‘nearest neighbor’ Implement. (PLDI), 2008, pp. 90–100.
pp. 925–930. meaningful?” in Proc. Int. Conf. Database [158] K. Asanovic et al., “The landscape of parallel
[127] M. Burtscher, R. Nasre, and K. Pingali, Theory, 1999, pp. 217–235. computing research: A view from Berkeley,”
“A quantitative study of irregular programs [142] I. Fodor, “A survey of dimension reduction Univ. California, Berkeley, CA, USA, Tech.
on GPUs,” in Proc. IEEE Int. Symp. Workload techniques,” Lawrence Livermore Nat. Rep. UCB/EECS-2006-183, 2006.
Characterization (IISWC), Nov. 2012, Lab., Tech. Rep., 2002. [159] Y. Zhang, M. Voss, and E. S. Rogers,
pp. 141–151. [143] J. Thomson, M. F. O’Boyle, G. Fursin, and “Runtime empirical selection of loop
[128] Y. Luo, G. Tan, Z. Mo, and N. Sun, “Fast: A B. Franke, “Reducing training time in a schedulers on hyperthreaded SMPs,” in
fast stencil autotuning framework based on one-shot machine learning-based Proc. 19th IEEE Int. Parallel Distrib. Process.
an optimal-solution space model,” in Proc. compiler,” Lang. Compil. Parallel Comput., Symp. (IPDPS), Apr. 2005, p. 44b.
29th ACM Int. Conf. Supercomput., 2015, vol. 5898, pp. 399–407, 2009. [160] D. Rughetti, P. D. Sanzo, B. Ciciani, and
pp. 187–196. F. Quaglia, “Machine learning-based self-
[144] Y. Bengio, “Learning deep architectures for
[129] S. Browne, J. Dongarra, N. Garner, G. Ho, and AI,” Found. Trends Mach. Learn., vol. 2, no. adjusting concurrency in software
P. Mucci, “A portable programming interface 1, pp. 1–127, 2009. transactional memory systems,” in Proc.
for performance evaluation on modern IEEE 20th Int. Symp. Modeling Anal.
[145] L. Deng, M. L. Seltzer, D. Yu, A. Acero, Simulation Comput. Telecommun. Syst.,
processors,” Int. J. High Perform. Comput. Appl.,
A.-R. Mohamed, and G. Hinton, “Binary Aug. 2012, pp. 278–285.
vol. 14, no. 3, pp. 189–204, 2000.
coding of speech spectrograms using a
[130] T. Mytkowicz, A. Diwan, M. Hauswirth, deep auto-encoder,” in Proc. 11th Annu. [161] C. Delimitrou and C. Kozyrakis, “Quasar:
and P. F. Sweeney, “Producing wrong data Conf. Int. Speech Commun. Assoc., 2010. Resource-efficient and Qos-aware cluster
without doing anything obviously wrong!” management,” in Proc. 19th Int. Conf. Archit.
[146] L. Mou, G. Li, L. Zhang, T. Wang, and Support Program. Lang. Operat. Syst.
in Proc. 14th Int. Conf. Archit. Support
Z. Jin, “Convolutional neural networks (ASPLOS), 2014, pp. 127–144.
Program. Lang. Operat. Syst. (ASPLOS XIV),
over tree structures for programming
2009, pp. 265–276. [162] X. Chen and S. Long, “Adaptive multi-
language processing,” in Proc. AAAI, 2016,
[131] J. Cavazos et al., “Automatic performance pp. 1287–1293. versioning for openmp parallelization via
model construction for the fast software machine learning,” in Proc. 15th Int. Conf.
[147] M. White, M. Tufano, C. Vendome, and Parallel Distrib. Syst. (ICPADS), 2009,
exploration of new hardware designs,” in
Proc. Int. Conf. Compil. Archit. Synthesis D. Poshyvanyk, “Deep learning code pp. 907–912.
Embedded Syst. (CASES), 2006, pp. 24–34. fragments for code clone detection,” in
Proc. ASE 31st IEEE/ACM Int. Conf. Autom. [163] M. Castro, L. F. W. Góes, C. P. Ribeiro,
[132] S. Khan, P. Xekalakis, J. Cavazos, and Softw. Eng., 2016, pp. 87–98. M. Cole, M. Cintra, and J. F. Méhaut, “A
M. Cintra, “Using predictivemodeling for machine learning-based approach for
cross-program design space exploration in [148] C. Cummins, P. Petoumenos, Z. Wang, and thread mapping on transactional memory
multicore systems,” in Proc. IEEE 16th Int. H. Leather, “Synthesizing benchmarks for applications,” in Proc. 18th Int. Conf. High
Conf. Parallel Archit. Compilation Techn., predictive modeling,” in Proc. Int. Symp. Code Perform. Comput., 2011, pp. 1–10.
Sep. 2007, pp. 327–338. Generat. Optim. (CGO), 2017, pp. 86–99.
[164] C. Jung, S. Rus, B. P. Railing, N. Clark, and
[133] M. Namolaru, A. Cohen, G. Fursin, [149] M. White, M. Tufano, and M. Martinez, S. Pande, “Brainy: Effective selection of
A. Zaks, and A. Freund, “Practical M. Monperrus, and D. Poshyvanyk (2017). data structures,” in Proc. 32nd ACM
aggregation of semantical program “Sorting and transforming program repair SIGPLAN Conf. Program. Lang. Design
properties for machine learning based ingredients via deep learning code Implement. (PLDI), 2011, pp. 86–97.
optimization,” in Proc. Proc. Int. Conf. similarities.” [Online]. Available: https://
[165] Z. Wang and M. F. P. O’Boyle, “Using
Compil. Archit. Synth. Embedded Syst. arxiv.org/abs/1707.04742
machine learning to partition streaming
(CASES), 2010, pp. 197–206. [150] L. Almagor et al., “Finding effective programs,” ACM Trans. Archit. Code Optim.,
[134] C. M. Bishop, Pattern Recognition and compilation sequences,” in Proc. ACM vol. 10, no. 3, p. 20, 2013.
Machine Learning (Information Science and SIGPLAN/SIGBED Conf. Lang. Compil. Tools
[166] C. Chan, J. Ansel, Y. L. Wong,
Statistics). Secaucus, NJ, USA: Springer- Embedded Syst. (LCTES), 2004, pp. 231–239. S. Amarasinghe, and A. Edelman,
Verlag, 2006. [151] K. D. Cooper et al., “ACME: Adaptive “Autotuning multigrid with petabricks,” in
[135] K. Hoste, A. Phansalkar, L. Eeckhout, compilation made efficient,” in Proc. ACM Proc. ACM/IEEE Conf. Supercomput. (SC),
A. Georges, L. K. John, and K. de SIGPLAN/SIGBED Conf. Lang. Compil. 2009, Art. no. 5.
Bosschere, “Performance prediction based Embedded Syst. (LCTES), 2005, pp. 69–77. [167] M. Pacula, J. Ansel, S. Amarasinghe, and
on inherent program similarity,” in Proc. [152] A. H. Ashouri, A. Bignoli, G. Palermo, U.-M. O’Reilly, “Hyperparameter tuning in
IEEE Int. Conf. Parallel Archit. Compilation C. Silvano, S. Kulkarni, and J. Cavazos, bandit-based adaptive operator selection,”
Techn. (PACT), Swep. 2006, pp. 114–122. “MiCOMP: Mitigating the compiler phase- in Proc. Eur. Conf. Appl. Evol. Comput.
[136] N. E. Rosenblum, B. P. Miller, and X. Zhu, ordering problem using optimization sub- (EuroSys), 2012, pp. 73–82.
“Extracting compiler provenance from sequences and machine learning,” ACM [168] J. Ansel, “Siblingrivalry: Online autotuning
program binaries,” in Proc. 9th ACM Trans. Archit. Code Optim., vol. 14, no. 3, through local competitions,” in Proc. Int.
SIGPLAN-SIGSOFT Workshop Program Anal. p. 29, 2017. Conf. Compil., Archit. Synth. Embedded Syst.
Softw. Tools Eng. (PASTE), 2010, pp. 21–28. [153] K. D. Cooper, D. Subramanian, and (CASES), 2012, pp. 91–100.
[137] A. Bhattacharyya, G. Kwasniewski, and L. Torczon, “Adaptive optimizing compilers [169] M. J. Voss and R. Eigemann, “High-level
T. Hoefler, “Using compiler techniques to for the 21st century,” J. Supercomput., adaptive program optimization with
improve automatic performance vol. 23, no. 1, pp. 7–22, 2002. adapt,” in Proc. 8th ACM SIGPLAN Symp.
modeling,” in Proc. Int. Conf. Parallel Archit. [154] D. R. White, A. Arcuri, and J. A. Clark, Principles Pract. Parallel Program. (PPoPP),
Compilation (PACT), 2015, pp. 468–479. “Evolutionary improvement of programs,” 2001, pp. 93–102.

22 Proceedings of the IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Wang and O'Boyle: Machine Learning in Compiler Optimization

[170] A. Tiwari and J. K. Hollingsworth, “Online Cloud Grid Comput. (CCGRID), 2010, [187] J. Fowkes and C. Sutton, “Parameter-free
adaptive code generation and tuning,” in pp. 495–504. probabilistic api mining across GitHub,” in
Proc. IEEE Int. Parallel Distrib. Process. [179] S. Venkataraman, Z. Yang, M. J. Franklin, Proc. 24th ACM SIGSOFT Int. Symp. Found.
Symp. (IPDPS), May 2011, pp. 879–892. B. Recht, and I. Stoica, “Ernest: Efficient Softw. Eng. (FSE), 2016, pp. 254–265.
[171] J. Ren, L. Gao, H. Wang, and Z. Wang, performance prediction for large-scale [188] A. T. Nguyen et al., “API code
“Optimise Web browsing on heterogeneous advanced analytics,” in Proc. NSDI, 2016, recommendation using statistical learning
mobile platforms: A machine learning pp. 363–378. from fine-grained changes,” in Proc. 24th
based approach,” in Proc. IEEE Int. Conf. ACM SIGSOFT Int. Symp. Found. Softw. Eng.
[180] S. Sankaran, “Predictive modeling based
Comput. Commun. (INFOCOM), May 2017, (FSE), 2016, pp. 511–522.
power estimation for embedded multicore
pp. 1–9.
systems,” in Proc. ACM Int. Conf. Comput. [189] V. Raychev, P. Bielik, and M. Vechev,
[172] Y. Zhu and V. J. Reddi, “High-performance Frontiers (CF), 2016, pp. 370–375. “Probabilistic model for code with decision
and energy-efficient mobile Web browsing trees,” in Proc. ACM SIGPLAN Int. Conf.
on big/little systems,” in Proc. HPCA, [181] Y. Zhang, M. A. Laurenzano, J. Mars, and
L. Tang, “Smite: Precise QoS prediction on Object-Oriented Program. Syst. Lang. Appl.
Feb. 2013, pp. 13–24. (OOPSLA), 2016, pp. 731–747.
real-system smt processors to improve
[173] Z. Wang, M. F. P. O’Boyle, and utilization in warehouse scale computers,” [190] B. Bichsel, V. Raychev, P. Tsankov, and
M. K. Emani, “Smart, adaptive mapping of in Proc. 47th Annu. IEEE/ACM Int. Symp. M. Vechev, “Statistical deobfuscation of
parallelism in the presence of external Microarchit. (MICRO-47), Dec. 2014, pp. Android applications,” in Proc. ACM
workload,” in Proc. IEEE/ACM Int. Symp. 406–418. SIGSAC Conf. Comput. Commun. Secur.
Code Generat. Optim. (CGO), Feb. 2013, (CCS), 2016, pp. 343–355.
pp. 1–10. [182] V. Petrucci, “Octopus-man: QoS-driven
task management for heterogeneous [191] P. Balaprakash, R. B. Gramacy, and
[174] D. Grewe, Z. Wang, and M. F. P. O’Boyle, S. M. Wild, “Active-learning-based surrogate
multicores in warehouse-scale computers,”
“A workload-aware mapping approach for models for empirical performance tuning,”
in Proc. IEEE 21st Int. Symp. High Perform.
data-parallel programs,” in Proc. 6th Int. in Proc. IEEE Int. Conf. Cluster Comput.
Comput. Archit. (HPCA), Feb. 2015, pp.
Conf. High Perform. Embedded Archit. (CLUSTER), Sep. 2013, pp. 1–8.
Compil. (HiPEAC), 2011, pp. 117–126. 246–258.
[183] N. J. Yadwadkar, B. Hariharan, [192] W. F. Ogilvie, P. Petoumenos, Z. Wang, and
[175] D. Grewe, Z. Wang, and M. F. O’Boyle, H. Leather, “Fast automatic heuristic
“OpenCL task partitioning in the presence J. E. Gonzalez, and R. Katz, “Multi-task
learning for straggler avoiding predictive construction using active learning,” in
of GPU contention,” in Proc. Int. Workshop Proc. Int. Workshop Lang. Compil. Parallel
Lang. Compil. Parallel Comput., 2013, job scheduling,” J. Mach. Learn. Res., vol. 17,
no. 1, pp. 3692–3728, 2016. Comput., 2014, pp. 146–160.
pp. 87–101.
[184] Y. David and E. Yahav, “Tracelet-based [193] M. Zuluaga, G. Sergent, A. Krause, and
[176] L. Tang, J. Mars, and M. L. Soffa,
code search in executables,” in Proc. M. Püschel, “Active learning for multi-
“Compiling for niceness: Mitigating
35th ACM SIGPLAN Conf. Program. Lang. objective optimization,” in Proc. Int. Conf.
contention for Qos in warehouse scale
Design Implement. (PLDI), 2014, Mach. Learn., 2013, pp. 462–470.
computers,” in Proc. 10th Int. Symp. Code
Generat. Optim. (CGO), 2012, pp. 1–12. pp. 349–360. [194] W. F. Ogilvie, P. Petoumenos, Z. Wang, and
[177] L. Tang, J. Mars, W. Wang, T. Dey, and [185] Y. David, N. Partush, and E. Yahav, H. Leather, “Minimizing the cost of
M. L. Soffa, “Reqos: Reactive static/ “Statistical similarity of binaries,” in Proc. iterative compilation with active learning,”
dynamic compilation for Qos in warehouse 37th ACM SIGPLAN Conf. Program. Lang. in Proc. Int. Symp. Code Generat. Optim.
scale computers,” in Proc. 18th Int. Conf. Design Implement. (PLDI), 2016, (CGO), 2017, pp. 245–256.
Archit. Support Program. Lang. Oper. Syst. pp. 266–280. [195] A. Evaluation. About Artifact Evaluation.
(ASPLOS), 2013, pp. 89–100. [186] E. Wong, T. Liu, and L. Tan, “Clocom: [Online]. Available: http://www.artifact-
[178] A. Matsunaga and J. A. B. Fortes, “On the Mining existing source code for automatic eval.org/about.html
use of machine learning to predict the time comment generation,” in Proc. IEEE 22nd [196] cTuning Foundation. Artifact Evaluation for
and resources consumed by applications,” Int. Conf. Softw. Anal. Evol. Reeng. (SANER), Computer Systems Research. [Online].
in Proc. 10th IEEE/ACM Int. Conf. Cluster Mar. 2015, pp. 380–389. Available: http://ctuning.org/ae/

ABOUT THE AUTHORS


Zheng Wang received the Ph.D. degree in com- and parallelization, automating the design and construction of optimizing
puter science from The University of Edinburgh, technology. He has published over 100 papers and received three best
Edinburgh, U.K., in 2011. paper awards.
Currently, he is an Assistant Professor at Prof. O'Boyle is a Senior Research Fellow of EPSRC and a Fellow of the
Lancaster University, Lancaster, U.K., where he British Computer Society (BCS). He was presented with the ACM Interna-
leads the Distributed Systems research group. tional Symposium on Code Generation and Optimization (ACM CGO) Test
From 2005 to 2007, he worked as an R&D Engi- of Time award in 2017. v
neer at IBM China. His research focus is in the
areas of parallel compilers, runtime systems,
code security, and the application of machine learning to tackle the chal-
lenging optimization problems within these areas.
Prof. Wang received three best paper awards for his work on machine-
learning-based compiler optimization (PACT'10, CGO'17, and PACT'17).

Michael O'Boyle is a Professor of Computer Sci-


ence at the University of Edinburgh, Edinburgh,
U.K. He is a founding member of HiPEAC, the
Director of the ARM Research Centre of Excel-
lence at Edinburgh, and the Director of the Engi-
neering and Physical Sciences Research Council
(EPSRC) Centre for Doctoral Training in Perva-
sive Parallelism. He is best known for his work in
incorporating machine learning into compilation

Proceedings of the IEEE 23

You might also like