Machine_Learning_in_compiler_optimisation
Machine_Learning_in_compiler_optimisation
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Proceedings of the IEEE
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Machine Learning in
Compiler Optimization
By Z h e ng Wa ng and M ic h a e l O’B oy l e
0018-9219 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Proceedings of the IEEE 1
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. A generic view of supervised machine learning in compilers. (a) Feature engineering. (b) Leaning a model. (c) Deployment.
B. Learning a Model
The second step is to use training data to derive a model using
a learning algorithm. This process is depicted in Fig. 1(b) . Unlike
other applications of machine learning, we typically generate
our own training data using existing applications or bench-
marks. The compiler developer will select training programs
which are typical of the application domain. For each training
Fig. 2. An OpenCL thread coarsening example reproduced from
program, we calculate the feature values, compiling the pro-
[17]. The original OpenCL code is shown in (a) where each thread
gram with different optimization options, and running and takes the square of one element of the input array. When coarsened
timing the compiled binaries to discover the best performing by a factor of two, as shown in (b), each thread now processes two
option. This process produces, for each training program, a elements of the input array.
training instance that consists of the feature values and the
optimal compiler option for the program.
The compiler developer then feeds these examples to a returned from the OpenCL get_global_id() API. Fig. 2(b)
machine learning algorithm to automatically build a model. shows the transformed code after applying a thread coarsen
The learning algorithm’s job is to find from the training factor of two, where each thread processes two elements of
examples a correlation between the feature values and the the input array.
optimal optimization decision. The learned model can then Thread coarsening can improve performance through
be used to predict, for a new set of features, what the opti- increasing instruction-level parallelism [19], reducing the
mal optimization option should be. number of memory-access operations [20] and eliminating
Because the performance of the learned model strongly redundant computation when the same value is computed
depends on how well the features and training programs are in every work item. However, it can also have several nega-
chosen, the processes of featuring engineering and training tive side effects, such as reducing the total amount of paral-
data generation often need to repeat multiple times. lelism and increasing the register pressure, which can lead
to slowdown performance. Determining when and how
C. Deployment to apply thread coarsening is nontrivial, because the best
coarsening factor depends on the target program and the
In the final step, the learned model is inserted into the
hardware architecture that the program runs on [17], [19].
compiler to predict the best optimization decisions for new
Magni et al. show that machine learning techniques
programs. This is demonstrated in Fig. 1(c) . To make a pre-
can be used to automatically construct effective thread-
diction, the compiler first extracts the features of the input
coarsening heuristics across GPU architectures [17]. Their
program, and then feeds the extracted feature values to the
approach considers six coarsening factors (1, 2, 4, 8, 16, 32).
learned model to make a prediction.
The goal is to develop a machine-learning-based model to
The advantage of the machine-learning-based approach
decide whether an OpenCL kernel should be coarsened on a
is that the entire process of building the model can be easily
specific GPU architecture and, if so, what is the best coars-
repeated whenever the compiler needs to target a new hard-
ening factor. Among many machine learning algorithms,
ware architecture, operating system, or application domain.
they chose to use an artificial neural network to model2
The model built is entirely derived from experimental
the problem. Construing such a model follows the classical
results and is hence evidence based.
three-step supervised learning process, which is depicted in
Fig. 1 and described in more details as follows.
D. Example
1) Feature Engineering: To describe the input OpenCL
As an example to illustrate these steps, consider thread kernel, Magni et al. use static code features extracted from
coarsening [18] for GPU programs. This code transforma- the compiler’s intermediate representation. Specifically, they
tion technique works by giving multiple work items (or developed a compiler-based tool to obtain the feature values
work elements) to one single thread. It is similar to loop from the program’s LLVM bitcode [21]. They started from 17
unrolling, but applied across parallel work items rather than candidate features. These include things like the number of
across serial loop iterations.
Fig. 2(a) shows a simple OpenCL kernel where a thread 2
In fact, Magni et al. employed a hierarchical approach consisting of
operates on a work item of the 1-D input array, in, at a time. multiple artificial neural networks [17]. However, these networks are
The work item to be operated on is specified by the value trained using the same process.
and types of instructions and memory level parallelism (MLP) III. M ET HOD OL O GY
within an OpenCL kernel. Table 1 gives the list of candidate fea- One of the key challenges for compilation is to select the
tures used in [17]. Typically, candidate features can be chosen right code transformation, or sequence of transformations
based on developers’ intuitions, suggestions from prior works, for a given program. This requires effectively evaluating the
or a combination of both. After choosing the candidate fea- quality of a possible compilation option, e.g., how a code
tures, a statistical method called principal component analysis transformation will affect eventual performance.
(PCA; see also Section IV-B) is applied to map the 17 candidate A naive approach is to exhaustively apply each legal
features into seven aggregated features, so that each aggregated transformation option and then profile the program to
feature is a linear combination of the original features. This collect the relevant performance metric. Given that many
technique is known as “feature dimension reduction,” which compiler problems have a massive number of options,
is discussed in Section V-D2. Dimension reduction helps exhaustive search and profiling is infeasible, prohibiting the
eliminating redundant information among candidate features, use of this approach at scale. This search-based approach
allowing the learning algorithm to perform more effectively. to compiler optimization is known as iterative compila-
2) Learning the Model: For the work presented in [17], tion [22], [23] or autotuning [10], [24]. Many techniques
16 OpenCL benchmarks were used to generate training have been proposed to reduce the cost of searching a large
data. To find out which of the six coarsening factors per- space [25], [26]. In certain cases, the overhead is justifiable
forms best for a given OpenCL kernel on a specific GPU if the program in question is to be used many times, e.g.,
architecture, we can apply each of the six factors to an in a deeply embedded device. However, its main limitation
OpenCL kernel and record its execution time. Since the remains: it only finds a good optimization for one program
optimal thread-coarsening factor varies across hardware and does not generalize into a compiler heuristic.
architectures, this process needs to repeat for each target There are two main approaches for solving the problem
architecture. In addition to finding the best performing of scalably selecting compiler options that work across pro-
coarsening factor, Magni et al. also extracted the aggregated grams. A high level comparison of both approaches is given
feature values for each kernel. Applying these two steps on in Fig. 3. The first strategy attempts to develop a cost (or pri-
the training benchmarks results in a training data set where ority) function to be used as a proxy to estimate the quality
each training example is composed of the optimal coars- of a potential compiler decision, without relying on exten-
ening factor and feature values for a training kernel. The sive profiling. The second strategy is to directly predict the
training examples are then fed into a learning algorithm best performing option.
which tries to find a set of model parameters (or weights) so
that overall prediction error on the training examples can A. Building a Cost Function
be minimized. The output of the learning algorithm is an
artificial neural network model where its weights are deter- Many compiler heuristics rely on a cost function to esti-
mined from the training data. mate the quality of a compiler option. Depending on the
optimization goal, the quality metric can be execution time,
3) Deployment: The learned model can then be used to the code size, or energy consumption, etc. Using a cost func-
predict the optimal coarsening factor for unseen OpenCL tion, a compiler can evaluate a range of possible options to
programs. To do so, static source code features are first choose the best one, without needing to compile and profile
extracted from the target OpenCL kernel; the extracted the program with each option.
feature values are then fed into the model which decides
whether to coarsen or not and which coarsening factor
should be used. The technique proposed in [17] achieves an
average speedup between 1.11x and 1.33x across four GPU
architectures and does not lead to degraded performance on
a single benchmark.
1) The Problem of Handcrafted Heuristics: Trad applied to other optimization targets such as the code size
itionally, a compiler cost function is manually crafted. For [33] or a tradeoff between energy and runtime.
example, a heuristic of function inlining adds up a num-
2) Cost Functions for Performance: The Meta
ber of relevant metrics, such as the number of instruc-
Optimization framework [34] uses genetic programming
tions of the target function to be inlined, the callee and
(GP) to search for a cost function y ⃪ f (x), which takes in a
stack size after inlining, and compare the resulted value
feature vector x and produces a real-valued priority y. Fig. 4
against a predefined threshold to determine if it is prof-
depicts the workflow of the framework. This approach is eval-
itable to inline a function [27]. Here, the importance or
uated on a number of compiler problems, including hyper-
weights for metrics and the threshold are determined block formation,3 register allocation, and data prefetching,
by compiler developers based on their experience or via showing that machine learned cost functions outperform
“trail-and-error.” Because the efforts involved in tuning human-crafted ones. A similar approach is employed by
the cost function are so expensive, many compilers simply Cavazos et al. who find cost functions for performance and
use “one-size-fits-all” cost function for inlining. However, compilation overhead for a Java just-in-time compiler [35].
such a strategy is ineffective. For examples, Cooper et al. The COLE compiler [36] uses a variance of the GP algorithm
show that a “one-size-fits-all” strategy for inlining often called strength Pareto evolutionary algorithm 2 (SPEA2)
delivers poor performance [28]; other studies also show [37] to learn cost functions to balance multiple objectives
that the optimal thresholds to use to determine when to (such as program runtime, compilation overhead, and code
inline change from one program to the other [29], [30]. size). In Section IV-C, we describe the working mechanism
Handcrafted cost functions are widely used in compil- of GP-like search algorithms.
ers. Other examples include the work conducted by Wagner Another approach to tune the cost functions is to pre-
et al. [31] and Tiwari et al. [32]. The former combines a dict the execution time or speedup of the target program.
Markov model and a human-derived heuristic to statically The Qilin compiler [38] follows such an approach. It uses
estimate the execution frequency of code regions (such curve fitting algorithms to estimate the runtime for execut-
as function innovation counts). The latter calculates the ing the target program of a given input size on the CPU
energy consumption of an application by assigning a weight and the GPU. The compiler then uses this information to
to each instruction type. The efficiency of these approaches determine the optimal loop iteration partition across the
highly depends on the accuracy of the estimations given by CPU and the GPU. The Qilin compiler relies on an applica-
the manually tuned heuristic. tion-specific function which is built on a per program base
The problem of relying on a hand-tuned heuristic is using reference inputs. The curve fitting (or regression; see,
that the cost and benefit of a compiler optimization often also, Section IV) model employed by the Qilin compiler
depend on the underlying hardware; while handcrafted can model with continuous values, making it suitable for
cost functions could be effective, manually developing estimating runtime and speedup. In [39], this approach is
one can take months or years on a single architecture. extended, which developed a relative predictor that predicts
This means that tuning the compiler for each newly whether an unseen predictor will improve significantly on a
released processor is hard and is often infeasible due to GPU relative to a CPU. This is used for runtime scheduling
the drastic efforts involved. Because cost functions are of OpenCL jobs.
important and manually tuning a good function is dif- The early work conduced by Brewer proposed a regres-
ficult for each individual architecture, researchers have sion-based model to predict the execution of a data layout
investigated ways to use machine learning to automate scheme for parallelization, by considering three parameters
this process. [40]. Using the model, his approach can select the optimal
In Section III-A2, we review a range of previous studies 3
Hyperblock formation combines basic blocks from multiple control
on using machine learning to tune cost functions for per- paths to form a predicated, larger code block to expose instruction level
formance and energy consumption—many of which can be parallelism.
Fig. 4. A simple view of the GP approach presented in [34] for tuning compiler cost functions. Each candidate cost function is represented
as an expression tree (a). The workflow of the GP algorithm is presented in (b).
layout for over 99% of the time for a partial differential equa- predicting the loop unroll factor [52] by considering eight
tion (PDE) solver across four evaluation platforms. Other unroll factors (1, 2, …,8 ). They formulated the problem as a
previous works also use curve fitting algorithms to build a multiclass classification problem (i.e., each loop unroll factor
cost function to estimate the speedup or runtime of sequen- is a class). They used over 2500 loops from 72 benchmarks to
tial [41]–[43], OpenMP [44]–[46], and, more recently, deep train two machine learning models [a nearest neighbor and
learning applications [47]. a support vector machine (SVM) model] to predict the loop
unroll factor for unseen loops. Using a richer set of features
3) Cost Functions for Energy Consumption: In addi-
than [16], their techniques correctly predict the unroll fac-
tion to performance, there is an extensive body of work
tor for 65% of the testing loops, leading to, on average, a 5%
that investigates ways to build energy models for software
improvement for the SPEC 2000 benchmark suite.
optimization and hardware architecture design. As power or
For sequential programs, there is extensive work in pre-
energy readings are continuous real values, most of the prior
work on power modeling uses regression-based approaches. dicting the best compiler flags [53], [54], code transforma-
Linear regression is a widely used technique for energy tion options [55], or tile size for loops [56], [57]. This level
modeling. Benini et al. developed a linear-regression-based of interest is possibly due to the restricted nature of the
model to estimate power consumption at the instruction problem, allowing easy experimentation and comparison
level [48]. The framework presented by Rethinagiri et al. against prior work.
[49] uses parameterized formulas to estimate power con- Directly predicting the optimal option for parallel pro-
sumption of embedded systems. The parameters of the for- grams is harder than doing it for sequential programs, due to
mulas are determined by applying a regression-based algo- the complex interactions between the parallel programs and
rithm to reference data obtained with handcrafted assembly the underlying parallel architectures. Nonetheless, there
code and power measurements. In a more recent work, are works on predicting the optimal number of threads to
Schürmans et al. also adopt a regression-based method for be used to run an OpenMP program [46], [58], the best
power modeling [50], but the weights of the regression parameters to be used to compile a CUDA programs for a
model are determined using standard benchmarks instead given input [59], and the thread coarsening parameters for
of handwritten assembly programs. OpenCL programs for GPUs [17]. These papers show that
Other works employ the artificial neural network (ANN) supervised machine learning can be a powerful tool for
to automatically construct power models. Curtis-Maury et modeling problems with a relatively small number of opti-
al. develop an ANN-based model to predict the power con- mization options.
sumption of OpenMP programs on multicore systems [51].
The inputs to the model are hardware performance coun- I V. M AC H I N E L E A R N I NG MODEL S
ter values such as the cache miss rate, and the output is
In this section, we review the wide range of machine learning
the estimated power consumption. Su et al. adopt a similar
models used for compiler optimization. Table 2 summarizes
approach by developing an ANN predictor to estimate the
runtime and power consumption for mapping OpenMP pro- the set machine learning models discussed in this section.
grams on nonuniform memory access (NUMA) multicores. There are two major subdivisions of machine learn-
This approach is also based on runtime profiling of the target ing techniques that have previously been used in compiler
program, but it explicitly considers NUMA-specific infor- optimizations: supervised and unsupervised learning. Using
mation like local and remote memory accesses per cycle. supervised machine learning, a predictive model is trained
on empirical performance data (labeled outputs) and impor-
tant quantifiable properties (features) of representative
B. Directly Predicting the Best Option programs. The model learns the correlation between these
While a cost function is useful for evaluating the quality feature values and the optimization decision that delivers the
of compiler options, the overhead involved in searching for optimal (or near-optimal) performance. The learned correla-
the optimal option may still be prohibitive. For this reason, tions are used to predict the best optimization decisions for
researchers have investigated ways to directly predict the new programs. Depending on the nature of the outputs, the
best compiler decision using machine learning for relatively predictive model can be either a regression model for con-
small compilation problems. tinuous outputs or a classification model for discrete outputs.
Monsifrot et al. pioneered the use of machine learning to In the other subdivision of machine learning, termed
predict the optimal compiler decision [16]. This work devel- unsupervised learning, the input to the learning algorithm is
oped a decision-tree-based approach to determine whether it a set of input values merely—there is no labeled output. One
is beneficial to unroll a loop based on information such as the form of unsupervised learning is clustering which groups the
number of statements and arithmetic operations of the loop. input data items into several subsets. For example, SimPoint
Their approach makes a binary decision on whether to unroll [60], a simulation technique, uses clustering to pick repre-
a loop but not how many times the loop should be unrolled. sent program execution points for program simulation. It
Later, Stephenson and Amarasinghe advanced [16] by directly does so by first dividing a set of program runtime information
into groups (or clusters), such that points within each clus- feature vectors) and output (i.e., labels) have a strong linear
ter are similar to each other in terms of program structures relation. SVM and ANNs can model both linear and nonlin-
(loops, memory usages, etc.); it then chooses a few points ear relations, but typically require more training examples
of each cluster to represent all the simulation points within to learn an effective model when compared with simple lin-
that group without losing much information. ear regression models.
There are also techniques that sit at the boundary of super- Table 3 gives some examples of regression techniques
vised and unsupervised learning. These techniques refine the that have been used in prior work for code optimization and
knowledge gathered during offline learning or previous runs the problem to be modeled.
using empirical observations obtained during deployment.
2) Classification: Supervised classification is another
We review such techniques in Section IV-C. This sections
concludes with a discussion of the relative merits of different technique that has been widely used in prior work of machine-
modeling approaches for compiler optimization. learning-based code optimization. This technique takes in a
feature vector and predicts which of a set of classes the feature
vector is associated with. For example, classification can be
A. Supervised Learning used to predict which of a set of unroll factors should be used
1) Regression: A widely used supervised learning tech- for a given loop, by taking in a feature vector that describes the
nique is called regression. This technique has been used in characteristics of the target loop (see also Section II-D).
various tasks, such as predicting the program execution time The k-nearest neighbur (KNN) algorithm is a simple
input [38] or speedup [39] for a given input, or estimating yet effective classification technique. It finds the k closet
the tail latency for parallel workloads [61]. training examples to the input instance (or program) on the
Regression is essentially curve fitting. As an example, feature space. The closeness (or distance) is often evaluated
consider Fig. 5 where a regression model is learned from using the Euclidean distance, but other metrics can also be
five data points. The model takes in a program input size X used. This technique has been used to predict the optimal
and predicts the execution time of the program Y. Adhering optimization parameters in prior works [52], [66], [67]. It
to supervised learning nomenclature, the set of five known
data points is the training data set and each of the five points
that comprise the training data is called a training example.
Each training example (xi, yi) is defined by a feature vector
(i.e., the input size in our case) xiand a desired output (i.e.,
the program execution time in our case) y i. Learning in this
context is understood as discovering the relation between
the inputs (xi) and the outputs (yi) so that the predictive
model can be used to make predictions for any new, unseen
input features in the problem domain. Once the function f
is in place, one can use it to make a prediction by taking in a
new input feature vector x. The prediction yis the value of
the curve that the new input feature vector xcorresponds to.
There are a range of machine learning techniques that
Fig. 5. A simple regression-based curve-fitting example. There are
can be used for regression. These include the simple linear five training examples in this case. A function fis trained with the
regression model and more advanced models like SVMs and training data, which maps the input x to the output y . The trained
ANNs. Linear regression is effective when the input (i.e., function can predict the output of an unseen x.
Table 3 Regression Techniques Used in Prior Works Decision trees make the assumption that the feature
space is convex, i.e., it can be divided up using hyperplanes
into different regions, each of which belongs to a different
category. This restriction is often appropriate in practice.
However, a significant drawback of using a single decision
tree is that the model can overfit due to outliers in the train-
works by first predicting which of the training programs ing data (see also Section IV-D). Random forests [73] have,
are closet (i.e., nearest neighbors) to the incoming program therefore, been proposed to alleviate the problem of over-
on the feature space; it then uses the optimal parameters fitting. Random forests are an ensemble learning method
(which are found during training time) of the nearest neigh- [74]. As illustrated in Fig. 7, it works by constructing mul-
bors as the prediction output. While it is effective on small tiple decision trees at training time. The prediction of each
problems, KNN also has two main drawbacks. First, it must tree depends on the values of a random vector sampled inde-
compute the distance between the input and all training data pendently on the feature value. In this way, each tree is ran-
at each prediction. This can be slow if there is a large num- domly forced to be insensitive to some feature dimensions.
ber of training programs to be considered. Second, the algo- To make a prediction, random forests then aggregate the out-
rithm itself does not learn from the training data; instead, comes of individual trees to form an overall prediction. It has
it simply selects the k nearest neighbors. This means that been employed to determine whether to inline a function or
the algorithm is not robust to noisy training data and could not [75], delivering better performance than a single-model-
choose an ill-suited training program as the prediction. based approach. We want to highlight that random forests
As an alternative, the decision tree has been used in can also be used for regression tasks. For instances, it has
prior works for a range of optimization problems. These been used to model energy consumption of OpenMP [76]
include choosing the parallel strategy for loop parallelization and CUDA [77] programs.
[69], determining the loop unroll factor [16], [70], decid- Logical regression is a variation of linear regression but
ing the profitability of using GPU acceleration [68], [71], is often used for classification. It takes in the feature vec-
and selecting the optimal algorithm implementation [72]. tor and calculates the probability of some outcome. For
The advantage of a decision tree is that the learned model is example, Cavazos and O’Boyle used logical regression to
interpretable and can be easily visualized. This enables users determine the optimization level of Jike RVM. Like decision
to understand why a particular decision is made by follow- trees, logical regression also assumes that the feature values
ing the path from the root node to a leaf decision node. For and the prediction has a linear relation.
example, Fig. 6 depicts the decision tree model developed in More advanced models, such as SVM classification,
[68] for selecting the best performing device (CPU or GPU) have been used for various compiler optimization tasks
to run an OpenCL program. To make a prediction, we start [46], [79]–[81]. SVMs use kernel functions to compute the
from the root of the tree; we compare a feature value (e.g., the similarity of feature vectors. The radial basis function (RBF)
communication–computation ratio) of the target program is commonly used in prior works [46], [82] because it can
against a threshold to determine which branch of the tree to model both linear and nonlinear problems. It works by map-
follow; and we repeat this process until we reach a leaf node ping the input feature vector to a higher dimensional space
where a decision will be made. It is to note that the structure where it may be easier to find a linear hyperplane to well
and thresholds of the tree are automatically determined by separate the labeled data (or classes).
the machine learning algorithm, which may change when we Other machine learning techniques, such as kernel
target a different architecture or application domain. canonical correlation analysis and naive Bayes, have also
Fig. 6. A decision tree for determining which device (CPU or GPU) to use to run an OpenCL program. This diagram is reproduced from [68].
Fig. 8. A simplified view of the internal state for the DeepTune DNN framework [78] when it predicts the optimal OpenCL thread coarsening
factor. Here, a DNN is learned for each of the four target GPU architectures. The activations in each layer of the four models increasingly
diverge (or specialize) toward the lower layers of the model. It is to note that some of the DeepTune layers are omitted to aid presentation.
C. Online Learning
1) Evolutionary Search: Evolutionary algorithms (EAs)
or evolutionary computation such as genetic algorithms Fig. 10. Using an EA to perform iterative compilation. The
algorithm starts from several initial populations of randomly
(GAs), GP,4 and stochastic-based search have been employed
chosen compiler flag sequences. It evaluates the performance of
4
A GA is represented as a list of actions and values, often a string, individual sequences to remove poorly performing sequences in
while a GP is represented as a tree structure of actions and values. For each population. It then applies crossover and mutation to create
example, GP is applied to the abstract syntax tree of a program to search a new generation of populations. The algorithm returns the best
for useful features in [70]. performing program binary when it terminates.
program when it runs with other competing workloads, parameters to avoid overfitting while achieving a good pre-
aiming to make the target program run faster. This approach diction accuracy remains an outstanding challenge.
first learns a reward function offline based on static code Choosing which modeling technique to use is nontrivial.
features and runtime system information. The reward func- This is because the choice of model depends on a number
tion is used to estimate the reward of a runtime scheduling of factors: the prediction problem (e.g., regression or clas-
action, i.e., the expected speedup when assigning a certain sification), the set of features to use, the available train-
number of processor cores to an OpenMP program. In the ing examples, the training and prediction overhead, etc.
next scheduling epoch, this approach uses the empiri- In prior works, the choice of modeling technique largely
cal observation of the application speedup to check if the relied on developer experience and empirical results. Many
reward function was accurate and the decision was good, of the studies in the field of machine-learning-based code
and update the reward function if the model is found to be optimization do not fully justify the choice of the model,
inaccurate. although some do compare the performance of alternate
In general, RL is an intuitive and comprehensive solu- techniques. The OpenTuner framework addresses the prob-
tion for autonomous decision making. But its performance lem by employing multiple techniques for program tuning
depends on the effectiveness of the value function, which [115]. OpenTuner runs multiple search techniques at the
estimates the immediate reward. An optimal value function same time. Techniques which perform well will be given
should lead to the greatest cumulative reward in the longer more candidate tuning options to examine, while poorly
term. For many problems, it is difficult to design an effective performed algorithms will be given fewer choices or disa-
value function or policy, because the function needs to fore- bled entirely. In this way, OpenTuner can discover which
see the impact of an action in the future. The effectiveness of algorithm works best for a given problem during search.
RL also depends on the environment; if the number of pos- One technique that has seen little investigation is the use
sible actions is large, it can take RL a long time to converge of Gaussian processes [116]. Before the recent widespread
to a good solution. RL also requires the environment to be interest in DNNs, these were a highly popular method in
fully observed, i.e., all the possible states of the environment many areas of machine learning [117]. They are particularly
can be anticipated ahead of time. However, this assumption powerful when the amount of training data is sparse and
may not hold in a dynamic computing environment due expensive to collect. They also automatically give a confi-
to unpredictable disturbances, e.g., changes in application dence interval with any decision. This allows the compiler
inputs or application mixes. In recent years, deep learning writer to trade off risk versus reward depending on the
techniques have been used in conjunction with RL to learn application scenario.
a value function. The combined technique is able to solve Using a single model has a significant drawback in prac-
some problems that were deemed impossible in the past tice. This is because a one-size-fits-all model is unlikely to
[114]. However, how to combine deep learning with RL to precisely capture behaviors of diverse applications, and no
solve compilation and code optimization problems remains matter how parameterized the model is, it is highly unlikely
an open question. that a model developed today will always be suited for
tomorrow. To allow the model to adapt to the change of the
computing environment and workloads, ensemble learning
D. Discussion
was exploited in prior works [73], [118], [119]. The idea of
What model is best is the $64 000 question. The answer ensemble learning is to use multiple learning algorithms,
is: it depends. More sophisticated techniques may provide where each algorithm is effective for particular problems, to
greater accuracy but they require large amounts of labeled obtain better predictive performance than could be obtained
training data—a real problem in compiler optimization. from any of the constituent learning algorithm alone [120],
Techniques such as linear regression and decision trees [121]. Making a prediction using an ensemble typically
require less training data compared to more advanced mod- requires more computational time than doing that using a
els such as SVMs and ANNs. Simple models typically work single model, so ensembles can be seen as a way to com-
well when the prediction problem can be described using a pensate for poor learning algorithms by performing extra
feature vector that has a small number of dimensions, and computation. To reduce the overhead, fast algorithms such
when the feature vector and the prediction are linearly cor- as decision trees are commonly used in ensemble methods
related. More advanced techniques such as SVMs and ANNs (e.g., random forests), although slower algorithms can ben-
can model both linear and nonlinear problems on a higher efit from ensemble techniques as well.
dimensional feature space, but they often require more
training data to learn an effective model. Furthermore, the
performance of an SVM and an ANN also highly depends on V. F E AT U R E E NGI N EER I NG
the hyperparameters used to train the model. The optimal Machine-learning-based code optimization relies on hav-
hyperparameter values can be chosen by performing cross ing a set of high-quality features that capture the important
validation on the training data. However, how to select characteristics of the target program. Given that there is an
Table 4 Summary of Features Discussed in Section V Table 6 Example Code Features Used in Prior Works
where the predictive model takes in a set of human-crafted features should have the highest weights (or coefficients) in
features, program code is used directly in the training data. the model, while features uncorrelated with the output vari-
Programs are fed through a series of neural-network-based ables should have weights close to zero. For example, least
language models which learn how the code correlates with absolute shrinkage and selection operator (LASSO) regres-
the desired optimization options (see also Fig. 8). Their sion analysis is used in [137] to remove less useful features
work also shows that the properties of the raw code that to build a compiler-based model to predict performance.
are abstracted by the top layers of the neural networks are LASSO has also been used for feature selection to tune the
mostly independent of the optimization problem. While compiler heuristics for the TRIPS processor [138].
promising, it is worth mentioning that dynamic informa- In general, feature selection remains an open problem
tion such as the program input size and performance coun- for machine learning, and researchers often follow a “trail-
ter values are often essential for characterizing the behavior and-error” approach to test a range of methods and feature
of the target program. Therefore, DeepTune does not com- candidates. This makes automatic feature selection frame-
pletely remove human involvement for feature engineering work like FEAST [139] and HERCULES [140] attractive.
when static code features are insufficient for the optimiza- The former framework employs a range of existing feature
tion problem. selection methods to select useful candidate features, while
the latter searches for the most important static code fea-
D. Feature Selection and Dimension Reduction tures from a set of predefined patterns for loops.
Machine learning uses features to capture the essential 2) Feature Dimensionality Reduction: While feature
characteristics of a training example. Sometimes we have selection allows us to select the most important features,
too many features. As the number of features increases, so the resulted feature set can still be too large to train a good
does the number of training examples needed to build an model, especially when we only have a small number of
accurate model [134]. Hence, we need to limit the dimen- training examples. By reducing the number of dimensions,
sion of the feature space. In compiler research, commonly, the learning algorithm can often perform more efficiently
an initial large, high-dimensional candidate feature space is on a limited training data set. Dimension reduction is also
pruned via feature selection [52], or projected into a lower important for some machine learning algorithms such as
dimensional space [17]. In this section, we review a number KNN to avoid the effect of the curse of dimensionality [141].
of feature selection and dimension reduction methods. PCA is a well-established feature reduction technique
[142]. It uses orthogonal linear transformations to reduce the
1) Feature Selection: Feature selection requires under- dimensionality of a set of variables, i.e., features in our case.
standing how a particular feature affects the prediction Fig. 14 demonstrates the use of PCA to reduce the num-
accuracy. One of the simplest methods for doing this is ber of dimensions. The input in this example is a 3-D space
applying the Pearson correlation coefficient. This metric defined by M1, M2, and M3 , as shown in Fig. 14(a). Three
measures the linear correlation between two variables and is components, P C1 , P C2 , and P C3 , which account for the vari-
used in numerous works [55], [92], [122], [135] to filter out ance of the data, are first calculated. Here, P C1 and P C2 con-
redundant features by removing features that have a strong tribute most to the variance of the data and P C3 accounts
correlation with an already selected feature. It has also been for the least variance. Using only P C1 and P C2 , one can
used to quantify the relation of the select features in regres- transform the original, 3-D space into a new, 2-D coordinate
sion. One obvious drawback of using Pearson correlation as
a feature ranking mechanism is that it is only sensitive to a
linear relationship.
Another approach for correlation estimation is mutual
information [131], [136], which quantifies how much infor-
mation of one variable (or feature) can be obtained through
another variable (feature). Like correlation coefficient,
mutual information can be used to remove redundant fea-
tures. For example, if the information of feature x can be
largely obtained through another existing feature y , feature
x can then be taken out from the feature set without losing
much information on the reduced feature set.
Both correlation coefficient and mutual information
evaluate each feature independently with respect to the pre-
diction. A different approach is to utilize regression analysis
Fig. 14. Using PCA to reduce dimensionality of a 3-D feature space.
for feature ranking. The underlying principal of regression The principal components are first computed (a). Then, the first two
analysis is that if the prediction is the outcome of regres- principal components (P C1 and P C2
) are selected to represent the
sion model based on the features, then the most important original 3-D feature space on a new 2-D space (b).
system [as illustrated in Fig. 14(b)] while preserving much options, by representing the optimization problem as a mul-
of the variance of the original data. ticlass classification problem, where each compiler option
PCA has been used in many prior compiler research is a class. For example, Leather et al. [70] considered a
works for feature reduction [17], [25], [55], [92], [95]–[97], loop unroll factor between 0 and 15 (16 configurations in
[143]. It has also been used in prior works to visualize the total), treating each candidate unroll factor as a class; they
working mechanism of a machine learning model, e.g., to compiled and profiled each training program by trying all
show how benchmarks can be grouped in the feature space 16 configurations to find out the best loop unroll factor for
[123], by projecting features from a high-dimensional space each program, and then learned a decision tree model from
into a 2-D space. the training data.
We want to stress that PCA does not select some fea- There are other compiler problems where the number of
tures and discard the others. Instead, it linearly combines possible options is massive. For instance, the work presented
the original features to construct new features that can sum- in [55] considers 54 code transformations of GCC. While
marize the list of the original features. PCA is useful when these options are only a subset from the over hundreds of
there is some redundancy in the raw features, i.e., some of transformations provided by GCC, the resulted combinato-
the features are correlated with one another. Similar feature rial compiler configurations lead to a space of approximately
reduction methods include factor analysis and linear discri- 10 34. Although it is possible to build a classifier to directly
minant analysis (LDA), which all try to reduce the number predict the optimal setting from a large space, to learn an
of features by linearly combining multiple raw features. effective model would require a large volume of training
However, PCA seems to be the most popular feature reduc- programs in order to have an adequate sampling over the
tion method used in compiler research, probably due to its space. Doing so is difficult because 1) there are only a few
simplicity. dozen common benchmarks available; and 2) compiler
An alternative way of reducing the number of features developers need to generate the training data themselves.
used is via an autoencoder [144]. It is a neural network EAs such as generic search are often used to explore a
that finds a representation (encoding) for a set of data, by large design space (see also Section IV-C1). Prior works have
dimensionality reduction. Autoencoders works by learning used EAs to solve the phase ordering problem (i.e., at which
an encoder and a decoder from the input data. The encoder order a set of compiler transformations should be applied)
tries to compress the original input into a low-dimensional [150]–[152], determining the compiler flags during iterative
representation, while the decoder tries to reconstruct the compilation [153]–[156], selecting loop transformations
original input based on the low-dimension representations [157], tuning algorithmic choices [11], [103], etc.
generated by the encoder. As a result, the autoencoder has
been widely used to remove the data noise as well as to B. Optimizing Parallel Programs
reduce the data dimension [145].
Autoencoders have been applied to various natural lan- How to effectively optimize parallel programs has
guage processing tasks [99], often being used together with received significant attentions in the past decade, largely
DNNs. Recently, it has been employed to model program because the hardware industry has adopted multicore
source code to obtain a compact set of features that can design to avoid the power wall [158]. While multicore and
characterize the input program source [78], [146]–[149]. many-core architectures provide the potential for high-
performance and energy-efficient computing, the potential
performance can only be unlocked if the application pro-
V I. SCOPE grams are suitably parallel and can be made to match the
Machine learning has been used to solve a wide range of underlying heterogeneous platform. Without this, the myr-
problems, from the early successful work of selecting com- iad cores on multicore processors and their specialized pro-
piler flags for sequential programs, to recent works on cessing elements will sit idle or poorly utilized. To this end,
scheduling and optimizing parallel programs on heteroge- researchers have extended the reach of machine learning to
neous multicores. In this section, we review the types of optimize parallel programs.
problems that have been exploited in prior works. A line of research in parallel program optimization is
parallelism mapping. That is, given an already parallelized
program, how to map the application parallelism to match
A. Optimizing Sequential Programs the underlying hardware to make the program run as fast
Early works for machine learning in compilers look at as possible or be as energy efficient as possible. Zhang et
how, or if, a compiler optimization should be applied to a al. developed a decision-tree-based approach to predict the
sequential program. Some of the previous studies build scheduling policy to use for an OpenMP parallel region
supervised classifiers to predict the optimal loop unroll fac- [159]. The work presented in [46] employs two machine
tor [52], [70] or to determine whether a function should be learning techniques to predict the optimal number of
inlined [29], [35]. These works target a fixed set of compiler threads as well as the scheduling policy to use for OpenMP
parallel loop. Specifically, it uses a regression-based ANN classifiers to determine which processor to use [68] and at
model to predict the speedup of a parallel loop when it which clock frequency the processor should operate [80],
runs with a given number of threads (to search for the opti- [171]. Others used regression techniques to build curve fit-
mal number threads), and an SVM classifier to predict the ting models to search for the sweat spot for work partition-
scheduling policy. There are also works that use machine ing among processors [38] or a tradeoff of energy and per-
learning to determine the optimum degree of parallelism for formance [172].
transactional memory [160] and hardware source allocation Another line of research combines compiler-based
[161], or to select a code version from a pool of choices to analysis and machine learning to optimize programs in the
use [162]. Castro et al. developed a decision tree classifier presence of competing workloads. This research problem
to predict the thread mapping strategy in the context of soft- is important because programs rarely run in isolation and
ware transactional memory [163]. Jung et al. constructed an must share the computing resources with other corunning
ANN-based predictor to select an effective data structure on workloads. In [173] and [174], an ANN model based on
a specific microarchitecture [164]. static code features and runtime information was built to
The work presented in [92] and [165] is a unique predict the number of threads to use for a target program
approach for applying machine learning to map complex when it runs with external workloads. Later, in [118], an
parallel programs with unbounded parallel graph structures. ensemble-learning-based approach was used, which leads to
The work considers the question of finding the optimal graph significantly better performance over [173]. In [118], several
structure of a streaming program. The idea was that rather models are first trained offline; and then one of the model is
than trying to predict a sequence of transformations over an selected at runtime, taking into consideration the compet-
unbounded graph, where legality and consistency is a real ing workloads and available hardware resources. The central
problem, we should consider the problem from the dual idea is that instead of using a single monolithic model, we
feature space. The work showed that it is possible to pre- can use multiple models where each model is specialized for
dict the best target feature (i.e., the characteristics that an modeling a subset of applications or a particular runtime
ideal transformed program should have) which then can be scenario. Using this approach, a model is used when its pre-
used to evaluate the worth of candidate transformed graphs dictions are effective.
(without compiling and profiling the resulted graphs) in the Some recent works developed machine learning models
original feature space. based on static code features and dynamic runtime informa-
The Petabricks project [103], [166], [167] takes an evolu- tion to schedule OpenCL programs in the presence of GPU
tionary approach for program tuning. The Petabricks com- contention. The work presented in [175] uses SVM classifi-
piler employs genetic search algorithms to tune algorithmic cation to predict the work partition ratio between the CPU
choices. Due to the expensive overhead of the search, much and GPU when multiple programs are competing to run on a
of autotuning is done at static compile time. Their work single GPU. The work described in [39] aims to improve the
shows that one can utilize the idle processors on a multi- overall system throughput when there are multiple OpenCL
core systems to perform online tuning [168], where half of programs competing to run on the GPU. They developed an
the cores are devoted to a known safe program configura- ANN model to predict the potential speedup for running an
tion, while the other half are used for an experimental pro- OpenCL kernel on the GPU. The speedup prediction is then
gram configuration. In this way, when the results of the used as a proxy to determine which of the waiting OpenCL
faster configuration are returned, the slower version will be tasks get to run on the GPU and in what order.
terminated. The approaches presented in [176] and [177] target task
The idea of combining compile-time knowledge and colocation in a data center environment. They use com-
runtime information to achieve better optimizations has piler-based code transformations to reduce the contention
been exploited by the ADAPT compiler [169]. Using the for multiple corunning tasks. A linear regression model was
ADAPT compiler, users describe what optimizations are employed to calculate the contention score of code regions
available and provide heuristics for applying these optimi- based on performance counter values. Then, a set of com-
zations. The compiler then reads these descriptions and piler-based code transformations is applied to reduce the
generates application-specific runtime systems to apply the resource demands of highly contentious code.
heuristics. Runtime code tuning is also exploited by Active
Harmony [170], which utilizes the computing resources in
HPC systems to evaluate different code variants on different C. Other Research Problems
nodes to find the best performing version. Many works have demonstrated that machine learning
There is also an extensive body of work on how to opti- is a powerful technique in performance and cost modeling
mize programs on heterogeneous multicore systems. One of [47], [178]–[180], and in task and resource scheduling [161],
the problems for heterogeneous multicore optimization is [181]–[183]. We envision that many of these techniques can
to determine when and how to use the heterogeneous pro- be used to provide evidence to support runtime program
cessors. Researchers have used machine learning to build optimizations through, e.g., just-in-time compilation.
While not directly target code optimization, compiler- techniques such as active learning can be employed to
based code analysis and machine learning techniques have reduce overhead of training data generation [191]–[194].
been used in conjunction to solve various software engineer- Although its true to say that generating many differently
ing tasks. These include detecting code similarities [184], compiled programs and executing and timing them are
[185], automatic comment generation [186], mining API entirely automatic, finding the right data requires careful
usage patterns [187], [188], predicting program properties consideration. If the optimizations explored have little posi-
[189], code de-obfuscation for malware detection [190], tive performance on the programs, then there is nothing
etc. It is worth mentioning that many of these recent works worth learning.
show that the past development knowledge extracted from The most immediate problem continues to be gathering
large code bases such as GitHub are valuable for learning an enough sufficient high quality training data. Although there
effective model. There were two recent studies performed by are numerous benchmark sites publicly available, the num-
Cummins et al., which mine Github to synthesize OpenCL ber of programs available is relatively sparse compared to the
benchmarks [148] and code extract features from source number that a typical compiler will encounter in its lifetime.
code [78]. Both studies demonstrate the usefulness of large This is particularly true in specialist domains where there
code bases and deep learning techniques for learning pre- may not be any public benchmarks. Automatic benchmark
dictive models for compiler optimizations. We envision that generation work will help here, but existing approaches do
the rich information in large open source code bases could not guarantee that the generated benchmarks effectively
provide a powerful knowledge base for training machine represent the design space. Therefore, the larger issue of the
learning models to solve compiler optimization problems, structure of the program space remains.
and deep learning could be used as an effective tool to A really fundamental problem is that if we build our
extract such knowledge from massive program source code. optimization models based purely on empirical data, then
we must guarantee that these data are correct and represent-
ative; we must learn the signal, not the noise. Peer review of
V II. DISC US SION a machine learning approach is difficult. Black box mode-
One of the real benefits of machine-learning-based ling prevents the quality of the model from being questioned
approaches is that it forces an empirically driven approach unlike handcrafted heuristics. In a sense, reviewers now
to compiler construction. New models have to be based on have to scrutinize that the experiments were fairly done.
empirical data which can then be verified by independent This means all training and test data must be publicly availa-
experimentation. This experiment, hypothesis, test cycle is ble for scrutiny. This is common practice in other empirical
well known in the physical sciences but is a relatively new sciences. The artefact evaluation committee is an example
addition compiler construction. of this [195], [196].
As machine-learning-based techniques require a sam- Although the ability to automatically learn how to best
pling of the optimization space for training data, we typi- optimize an application and adapt to change is a big step
cally know the best optimization for any program in the forward, machine learning can only learn from what is pro-
training set. If we exclude this benchmark from training, vided by the compiler writer. Machine learning can neither
we therefore have access to an upper bound on performance invent new program transformations to apply nor derive
or oracle for this program. This immediately lets us know analysis that determines whether a transformation is legal;
how good existing techniques are. If they are 50% of this all of this is beyond its scope.
optimum or 95% of this optimum, this immediately tells us
whether the problem is worth exploring.
B. Will This Put Compiler Writers Out of a Job?
Furthermore, we can construct naive techniques, e.g.,
a random optimization, and see its performance. If it per- In fact, machine-learning-based compilation will para-
formed a number of times, it will have an expected value doxically lead to a renaissance in compiler optimization.
of the mean of the optimization speedups. We can then Compilers have become so complex that adding a new opti-
demand that any new heuristic should outperform this, mization or compiler phase can lead to performance regres-
though in our experience there have been cases where state- sions. This, in turn, has led to a conservative mind set where
of-the-art work was actually less than random. new transformations are not considered if they may rock the
boat. The core issue is that systems are so complex that it
is impossible to know for sure when to use such an opti-
A. Not a Panacea mization. Machine learning can remove this uncertainty by
This paper has, by and large, been very upbeat about the automatically determining when an optimization is prof-
use of machine learning. However, there are a number of itable. This now frees the compiler writer to develop ever
hurdles to overcome to make it a practical reality and this more sophisticated techniques. He/she does not need to
opens up new questions about optimization. worry about how they interfere with other optimizations—
Training cost is an issue that many find alarming. In machine learning looks after this. We can now develop opti-
practice, the cost is much less than a compiler writer, and mizations that will typically only work for specific domains,
and not worry about coordinating their integration into a Can machine learning also be applied to compiler analy-
general purpose system. It allows different communities to sis? For instance is it possible to learn dataflow or point-to
develop novel optimizations and naturally integrate them. analysis? As deep learning has the ability to automatically
So rather than closing down the opportunity for new ideas, construct features, can we find a set of features that are com-
it opens up new vistas. mon across all optimizations and analyses? Can we learn
the ideal compiler intermediate representation? There is
a wide range of interesting research questions that remain
C. Open Research Directions
unexplored.
Machine learning has demonstrated its utility as a means
of automating compiler profitability analysis. It will con-
tinue to be used for more complex optimization problems V III. CONCLUSION
and is likely to be the default approach to selecting compiler This paper has introduced machine-learning-based com-
optimizations in the coming decade. pilation and described its power in determining an evi-
The open research directions go beyond predicting the dence-based approach to compiler optimization. It is the
best optimizations to apply. One central issue is what the latest stage in 50 years of compiler automation. Machine-
program space looks like. We know that programs with lin- learning-based compilation is now a mainstream compiler
ear array accesses inside perfect loop nests need different research area and, over the last decade or so, has generated
treatment compared to, say, distributed graph processing a large amount of academic interest and papers. While it is
programs. If we could have a map that allows us to meas- impossible to provide a definitive cataloger of all research,
ure distances between programs, then we could see whether we have tried to provide a comprehensive and accessible
there are regions that are well served by compiler charac- survey of the main research areas and future directions.
terization and other regions that are sparse and currently Machine learning is not a panacea. It can only learn the data
ignored. If we could do the same for hardware, then we we provide. Rather than, as some fear, it dumbs down the
may be better able to design hardware likely to be of use for role of compiler writers, it opens up the possibility of much
emerging applications. greater creativity and new research areas.
R EFER ENCES [11] J. Ansel, Y. L. Wong, C. Chan, M. Olszewski, [20] Y. Yang, P. Xiang, J. Kong, M. Mantor, and
A. Edelman, and S. Amarasinghe, “Language H. Zhou, “A unified optimizing compiler
[1] J. Chipps, M. Koschmann, S. Orgel, A. Perlis, and compiler support for auto-tuning framework for different gpgpu
and J. Smith, “A mathematical language variable-accuracy algorithms,” in Proc. Int. architectures,” ACM Trans. Archit. Code
compiler,” in Proc. 11th ACM Nat. Meeting, Symp. Code Generat. Optim. (CGO), Apr. 2011, Optim., vol. 9, no. 2, p. 9, 2012.
1956, pp. 114–117. pp. 85–96. [21] C. Lattner and V. Adve, “LLVM: A
[2] P. B. Sheridan, “The arithmetic translator- [12] J. Kurzak, H. Anzt, M. Gates, and compilation framework for lifelong program
compiler of the IBM FORTRAN automatic J. Dongarra, “Implementation and tuning of analysis & transformation,” in Proc. Int.
coding system,” Commun. ACM, vol. 2, no. 2, batched Cholesky factorization and solve for Symp. Code Generat. Optim. (CGO), 2004,
pp. 9–21, 1959. NVIDIA GPUs,” IEEE Trans. Parallel Distrib. pp. 75–86.
[3] M. D. McIlroy, “Macro instruction Syst., vol. 27, no. 7, pp. 2036–2048, Jul. 2016. [22] F. Bodin, T. Kisuki, P. Knijnenburg,
extensions of compiler languages,” Commun. [13] Y. M. Tsai, P. Luszczek, J. Kurzak, and M. O’Boyle, and E. Rohou, “Iterative
ACM, vol. 3, no. 4, pp. 214–220, 1960. J. Dongarra, “Performance-portable compilation in a non-linear optimization
[4] A. Gauci, K. Z. Adami, and J. Abela. (2010). autotuning of OpenCL kernels for space,” in Proc. Workshop Profile Feedback-
“Machine learning for galaxy morphology convolutional layers of deep neural Directed Compilation, 1998.
classification.” [Online]. Available: https:// networks,” in Proc. Workshop Mach. Learn. [23] P. M. Knijnenburg, T. Kisuki, and
arxiv.org/abs/1005.0390 HPC Environ. (MLHPC), 2016, pp. 9–18. M. F. O’Boyle, “Combined selection of tile
[5] H. Schoen, D. Gayo-Avello, P. T. Metaxas, [14] M. E. Lesk and E. Schmidt, “Lex—A lexical sizes and unroll factors using iterative
E. Mustafaraj, M. Strohmaier, and P. Gloor, analyzer generator,” Tech. Rep., 1975. compilation,” J. Supercomput., vol. 24, no. 1,
“The power of prediction with social media,” [15] S. C. Johnson, Yacc: Yet Another Compiler- pp. 43–67, 2003.
Internet Res., vol. 23, no. 5, pp. 528–543, Compiler, vol. 32. Murray Hill, NJ, USA: Bell [24] M. Frigo and S. G. Johnson, “The design and
2013. Laboratories, 1975. implementation of FFTW3,” Proc. IEEE,
[6] Slashdot. (2009). IBM Releases Open Source [16] A. Monsifrot, F. Bodin, and R. Quiniou, vol. 93, no. 2, pp. 216–231, Feb. 2005.
Machine Learning Compiler. [Online]. “A machine learning approach to automatic [25] F. Agakov et al., “Using machine learning to
Available: https://tech.slashdot.org/ production of compiler heuristics,” in Proc. focus iterative optimization,” in Proc. Int.
story/09/07/03/0143233/ibm-releases-open- Int. Conf. Artif. Intell. Methodol. Syst. Appl., Symp. Code Generat. Optim. (CGO), 2006,
source-machine-learning-compiler 2002, pp. 41–50. pp. 295–305.
[7] H. Massalin, “Superoptimizer: A look at the [17] A. Magni, C. Dubach, and M. O’Boyle, [26] R. Nobre, L. G. A. Martins, and
smallest program,” ACM SIGPLAN Notices, “Automatic optimization of thread- J. M. P. Cardoso, “A graph-based iterative
vol. 22, no. 10, pp. 122–126, 1987. coarsening for graphics processors,” in Proc. compiler pass selection and phase ordering
[8] J. Ivory, “I. On the method of the least squares,” 23rd Int. Conf. Parallel Archit. Compilation approach,” in Proc. 17th ACM SIGPLAN/
Philos. Mag. J., Comprehending Various Branches (PACT), 2014, pp. 455–466. SIGBED Conf. Lang. Compil. Tools Theory
Sci., Liberal Fine Arts, Agriculture, Manuf. [18] S. Unkule, C. Shaltz, and A. Qasem, Embedded Syst. (LCTES), 2016, pp. 21–30.
Commerce, vol. 65, no. 321, pp. 3–10, 1825. “Automatic restructuring of GPU kernels for [27] R. Leupers and P. Marwedel, “Function
[9] R. J. Adcock, “A problem in least squares,” exploiting inter-thread data locality,” in Proc. inlining under code size constraints for
Analyst, vol. 5, no. 2, pp. 53–54, 1878. 21st Int. Conf. Compil. Construction (CC), embedded processors,” in Dig. Tech. Papers
[10] K. Datta et al., “Stencil computation 2012, pp. 21–40. IEEE/ACM Int. Conf. Comput.-Aided Design,
optimization and auto-tuning on state-of- [19] V. Volkov and J. W. Demmel, “Benchmarking Nov. 1999, pp. 253–256.
the-art multicore architectures,” in Proc. GPUs to tune dense linear algebra,” in Proc. [28] K. D. Cooper, T. J. Harvey, and T. Waterman,
ACM/IEEE Conf. Supercomput., Nov. 2008, ACM/IEEE Conf. Supercomput. (SC), “An adaptive strategy for inline substitution,”
pp. 1–12. Nov. 2008, pp. 1–11. in Proc. Joint Eur. Conf. Theory Pract. Softw.
17th Int. Conf. Compil. Construction (CC/ M. Schulz, “Prediction models for multi- [60] E. Perelman, G. Hamerly, M. Van
ETAPS), 2008, pp. 69–84. dimensional power-performance Biesbrouck, T. Sherwood, and B. Calder,
[29] D. Simon, J. Cavazos, C. Wimmer, and optimization on many cores,” in Proc. 17th “Using simpoint for accurate and efficient
S. Kulkarni, “Automatic construction of Int. Conf. Parallel Archit. Compilation Techn. simulation,” in Proc. ACM SIGMETRICS Int.
inlining heuristics using machine learning,” (PACT), 2008, pp. 250–259. Conf. Meas. Modeling Comput. Syst., 2003,
in Proc. IEEE/ACM Int. Symp. Code Generat. [45] K. Singhet et al., “Comparing scalability pp. 318–319.
Optim. (CGO), 2013, pp. 1–12. prediction strategies on an SMP of CMPs,” [61] Y. Zhang, D. Meisner, J. Mars, and L. Tang,
[30] P. Zhao and J. N. Amaral, “To inline or not to in Proc. Eur. Conf. Paralell Process., 2010, “Treadmill: Attributing the source of tail
inline? Enhanced inlining decisions,” Lang. pp. 143–155. latency through precise load testing and
Compil. Parallel Comput., pp. 405–419, 2004. [46] Z. Wang and M. F. O’Boyle, “Mapping statistical inference,” in Proc. 43rd Int. Symp.
parallelism to multi-cores: A machine Comput. Archit. (ISCA), 2016, pp. 456–468.
[31] T. A. Wagner, V. Maverick, S. L. Graham,
and M. A. Harrison, “Accurate static learning based approach,” in Proc. 14th ACM [62] B. C. Lee, D. M. Brooks, B. R. de Supinski,
estimators for program optimization,” in SIGPLAN Symp. Principles Pract. Parallel M. Schulz, K. Singh, and S. A. McKee,
Proc. ACM SIGPLAN Conf. Program. Lang. Program. (PPoPP), 2009, pp. 75–84. “Methods of inference and learning for
Design Implement. (PLDI), 1994, pp. 85–96. [47] Y. Kang et al., “Neurosurgeon: Collaborative performance modeling of parallel
intelligence between the cloud and mobile applications,” in Proc. 12th ACM SIGPLAN
[32] V. Tiwari, S. Malik, and A. Wolfe, “Power Symp. Principles Pract. Parallel Program.
analysis of embedded software: A first step edge,” in Proc. 22nd Int. Conf. Archit. Support
Program. Lang. Oper. Syst. (ASPLOS), 2017, (PPoPP), 2007, pp. 249–258.
towards software power minimization,” in
pp. 615–629. [63] M. Curtis-Maury, J. Dzierwa,
Proc. IEEE/ACM Int. Conf. Comput.-Aided
Design, Nov. 1994, pp. 384–390. [48] L. Benini, A. Bogliolo, M. Favalli, and G. De C. D. Antonopoulos, and D. S. Nikolopoulos,
Micheli, “Regression models for behavioral “Online power-performance adaptation of
[33] K. D. Cooper, P. J. Schielke, and multithreaded programs using hardware
power estimation,” Integr. Comput.-Aided
D. Subramanian, “Optimizing for reduced event-based prediction,” in Proc. 20th
Eng., vol. 5, no. 2, pp. 95–106, 1998.
code space using genetic algorithms,” in Annu. Int. Conf. Supercomput. (ICS), 2006,
Proc. ACM SIGPLAN Workshop Lang. Compil. [49] S. K. Rethinagiri, R. B. Atitallah, and
pp. 157–166.
Tools Embedded Syst. (LCTES), 1999, pp. 1–9. J.-L. Dekeyser, “A system level power
consumption estimation for MPSoC,” in [64] P. E. Bailey, D. K. Lowenthal, V. Ravi,
[34] M. Stephenson, S. Amarasinghe, M. Martin, Proc. Int. Symp. Syst. Chip (SoC), 2011, B. Rountree, M. Schulz, and B. R. de
and U.-M. O‘Reilly, “Meta optimization: pp. 56–61. Supinski, “Adaptive configuration selection
Improving compiler heuristics with machine for power-constrained heterogeneous
learning,” in Proc. ACM SIGPLAN Conf. [50] S. Schürmans, G. Onnebrink, R. Leupers,
systems,” in Proc. 43rd Int. Conf. Parallel
Program. Lang. Design Implement. (PLDI), G. Leupers, and X. Chen, “Frequency-aware
Process., 2014, pp. 371–380.
2003, pp. 77–90. ESL power estimation for ARM cortex-A9
using a black box processor model,” ACM [65] J. L. Berral et al., “Towards energy-aware
[35] J. Cavazos and M. F. P. O’Boyle, “Automatic Trans. Embedded Comput. Syst., vol. 16, no. 1, scheduling in data centers using machine
tuning of inlining heuristics,” in Proc. ACM/ p. 26, 2016. learning,” in Proc. 1st Int. Conf. Energy-
IEEE Conf. Supercomput. (SC), Nov. 2005, p. 14. Efficient Comput. Netw. (e-Energy), 2010,
[51] M. Curtis-Maury et al., “Identifying energy-
[36] K. Hoste and L. Eeckhout, “Cole: Compiler pp. 215–224.
efficient concurrency levels using machine
optimization level exploration,” in Proc. 6th learning,” in Proc. IEEE Int. Conf. Cluster [66] D. D. Vento, “Performance optimization on a
Annu. IEEE/ACM Int. Symp. Code Generat. Comput., Sep. 2007, pp. 488–495. supercomputer with ctuning and the PGI
Optim. (CGO), 2008, pp. 165–174. compiler,” in Proc. 2nd Int. Workshop Adapt.
[52] M. Stephenson and S. Amarasinghe,
[37] M. Kim, T. Hiroyasu, M. Miki, and S. Watanabe, “Predicting unroll factors using supervised Self-Tuning Comput. Syst. Exaflop Era
“SPEA2+: Improving the performance of the classification,” in Proc. Int. Symp. Code (EXADAPT), 2012, pp. 12–20.
Strength Pareto Evolutionary Algorithm 2,” in Generat. Optim. (CGO), 2005, pp. 123–134. [67] P.-J. Micolet, A. Smith, and C. Dubach,
Proc. Int. Conf. Parallel Problem Solving from “A machine learning approach to mapping
Nature, 2004, pp. 742–751. [53] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla,
M. F. P. O’Boyle, and O. Temam, “Rapidly streaming workloads to dynamic multicore
[38] C.-K. Luk, S. Hong, and H. Kim, “Qilin: selecting good compiler optimizations using processors,” ACM SIGPLAN Notices, vol. 51,
Exploiting parallelism on heterogeneous performance counters,” in Proc. Int. Symp. no. 5, pp. 113–122, 2016.
multiprocessors with adaptive mapping,” in Code Generat. Optim. (CGO), 2007, [68] D. Grewe, Z. Wang, and M. F. P. O’Boyle,
Proc. 42nd Annu. IEEE/ACM Int. Symp. pp. 185–197. “Portable mapping of data parallel programs
Microarchit. (MICRO), Dec. 2009, pp. 45–55. to OpenCL for heterogeneous systems,” in
[54] J. Cavazos and M. F. P. O’Boyle, “Method-
[39] Y. Wen, Z. Wang, and M. F. P. O’Boyle, specific dynamic compilation using logistic Proc. Proc. IEEE/ACM Int. Symp. Code
“Smart multi-task scheduling for OpenCL regression,” in Proc. 21st Annu. ACM SIGPLAN Generation Optim. (CGO), Feb. 2013, pp. 1–10.
programs on CPU/GPU heterogeneous Conf. Object-Oriented Program. Syst. Lang. [69] H. Yu and L. Rauchwerger, “Adaptive
platforms,” in Proc. 21st Ann. IEEE Int. Conf. Appl. (OOPSLA), 2006, pp. 229–240. reduction parallelization techniques,” in
High Perform. Comput. (HiPC), Dec. 2014, Proc. 14th Int. Conf. Supercomput. (ICS), 2000,
[55] C. Dubach, J. Cavazos, B. Franke, G. Fursin,
pp. 1–10. pp. 66–77.
M. F. O’Boyle, and O. Temam, “Fast compiler
[40] E. A. Brewer, “High-level optimization via optimization evaluation using code-feature [70] H. Leather, E. Bonilla, and M. O’Boyle,
automated statistical modeling,” in Proc. 5th based performance prediction,” in Proc. 4th “Automatic feature generation for machine
ACM SIGPLAN Symp. Principles Pract. Parallel Int. Conf. Comput. Frontiers (CF), 2007, learning based optimizing compilation,” in
Program. (PPOPP), 1995, pp. 80–91. pp. 131–142. Proc. 7th Annu. IEEE/ACM Int. Symp. Code
[41] K. Vaswani, M. J. Thazhuthaveetil, [56] T. Yuki, L. Renganarayanan, S. Rajopadhye, Generat. Optim. (CGO), Mar. 2009, pp. 81–91.
Y. N. Srikant, and P. J. Joseph, C. Anderson, A. E. Eichenberger, and [71] Z. Wang, D. Grewe, and M. F. P. O’Boyle,
“Microarchitecture sensitive empirical K. O’Brien, “Automatic creation of tile size “Automatic and portable mapping of data
models for compiler optimizations,” in Proc. selection models,” in Proc. 8th Annu. IEEE/ parallel programs to opencl for GPU-based
Int. Symp. Code Generat. Optim. (CGO), ACM Int. Symp. Code Generat. Optim. (CGO), heterogeneous systems,” ACM Trans. Archit.
Mar. 2007, pp. 131–143. 2010, pp. 190–199. Code Optim., vol. 11, no. 4, p. 42, 2014.
[42] B. C. Lee and D. M. Brooks, “Accurate and [57] A. M. Malik, “Optimal tile size selection [72] Y. Ding, J. Ansel, K. Veeramachaneni,
efficient regression modeling for problem using machine learning,” in Proc. X. Shen, U.-M. O’Reilly, and S. Amarasinghe,
microarchitectural performance and power Optim. Tile Size Selection Problem Using Mach. “Autotuning algorithmic choice for input
prediction,” in Proc. 12th Int. Conf. Archit. Learn., vol. 2. Dec. 2012, pp. 275–280. sensitivity,” in Proc. 36th ACM SIGPLAN Conf.
Support Program. Lang. Oper. Syst. (ASPLOS), [58] R. W. Moore and B. R. Childers, “Building Program. Lang. Design Implement. (PLDI),
2006, pp. 185–194. and using application utility models 2015, pp. 379–390.
[43] E. Park, L.-N. Pouche, J. Cavazos, A. Cohen, to dynamically choose thread counts,” [73] T. K. Ho, “Random decision forests,” in
and P. Sadayappan, “Predictive modeling in a J. Supercomput., vol. 68, no. 3, Proc. 3rd Int. Conf. Document Anal. Recognit.
polyhedral optimization space,” in Proc. 9th pp. 1184–1213, 2014. (ICDAR), vol. 1. 1995, pp. 278–282.
Annu. IEEE/ACM Int. Symp. Code Generat. [59] Y. Liu, E. Z. Zhang, and X. Shen, “A cross- [74] T. G. Dietterich, “Ensemble methods
Optim. (CGO), Apr. 2011, pp. 119–129. input adaptive framework for GPU program in machine learning,” in Proc. 1st Int.
[44] M. Curtis-Maury, A. Shah, F. Blagojevic, optimizations,” in Proc. IEEE Int. Symp. Workshop Multiple Classifier Syst. (MCS),
D. S. Nikolopoulos, B. R. de Supinski, and Parallel Distrib. Process., May 2009, pp. 1–10. 2000, pp. 1–15.
[75] P. Lokuciejewski, F. Gedikli, P. Marwedel, [91] T. Sherwood, E. Perelman, G. Hamerly, study in processor customization,” in Proc.
and K. Morik, “Automatic WCET reduction and B. Calder, “Automatically Design Autom. Test Eur. Conf. Exhibit.
by machine learning based heuristics for characterizing large scale program (DATE), 2012, pp. 1030–1035.
function inlining,” in Proc. 3rd Workshop behavior,” in Proc. 10th Int. Conf. Archit. [107] M. R. Jantz and P. A. Kulkarni, “Exploiting
Statistical Mach. Learn. Approaches Archit. Support Program. Lang. Operat. Syst. phase inter-dependencies for faster
Compilation (SMART), 2009, pp. 1–15. (ASPLOS), 2002, pp. 45–57. iterative compiler optimization phase order
[76] S. Benedict, R. S. Rejitha, P. Gschwandtner, [92] Z. Wang and M. F. O’Boyle, “Partitioning searches,” in Proc. Int. Conf. Compil. Archit.
R. Prodan, and T. Fahringer, “Energy streaming parallelism for multi-cores: A Synthesis Embedded Syst. (CASES),
prediction of OpenMP applications using machine learning based approach,” in Proc. Oct. 2013, pp. 1–10.
random forest modeling approach,” in Proc. 19th Int. Conf. Parallel Archit. Compilation [108] E. Ipek, O. Mutlu, J. F. Martínez, and
IEEE Int. Parallel Distrib. Process. Symp. Techn. (PACT), 2010, pp. 307–318. R. Caruana, “Self-optimizing memory
Workshop, May 2015, pp. 1251–1260. [93] M. Newman, Networks: An Introduction. controllers: A reinforcement learning
[77] R. S. Rejitha, S. Benedict, S. A. Alex, and New York, NY, USA: Oxford Univ. Press, approach,” in Proc. IEEE 35th Int. Symp.
S. Infanto, “Energy prediction of CUDA 2010. Comput. Archit. (ISCA), Jun. 2008, pp. 39–50.
application instances using dynamic [94] L. G. Martins, R. Nobre, A. C. B. Delbem, [109] B. Porter, M. Grieves, R. R. Filho, and
regression models,” Computing, vol. 99, E. Marques, and J. A. M. Cardoso, D. Leslie, “Rex: A development platform
no. 8, pp. 765–790, 2017. “Exploration of compiler optimization and online learning approach for runtime
[78] C. Cummins, P. Petoumenos, Z. Wang, and sequences using clustering-based emergent software systems,” in Proc. Usenix
H. Leather, “End-to-end deep learning of selection,” in Proc. SIGPLAN/SIGBED Conf. Conf. Symp. Oper. Syst. Design Implement.,
optimization heuristics,” in Proc. 26th Int. Lang. Compil. Tools Embedded Syst. (LCTES), Nov. 2016, pp. 333–348.
Conf. Parallel Archit. Compilation Techn. 2014, pp. 63–72. [110] J. Rao, X. Bu, C.-Z. Xu, L. Wang, and G. Yin,
(PACT), 2017, pp. 219–232. [95] L. Eeckhout, H. Vandierendonck, and “Vconf: A reinforcement learning approach
[79] Z. Wang, G. Tournavitis, B. Franke, and K. D. Bosschere, “Workload design: to virtual machines auto-configuration,” in
M. F. P. O’Boyle, “Integrating profile-driven Selecting representative program-input Proc. 6th Int. Conf. Autonom. Comput. (ICAC),
parallelism detection and machine-learning- pairs,” in Proc. Int. Conf. Parallel Archit. 2009, pp. 137–146.
based mapping,” ACM Trans. Archit. Code Compilation Techn., 2002, pp. 83–94. [111] M. G. Lagoudakis and M. L. Littman,
Optim., vol. 11, no. 1, p. 2, 2014. [96] Y. Chen et al., “Evaluating iterative “Algorithm selection using reinforcement
[80] B. Taylor, V. S. Marco, and Z. Wang, optimization across 1000 datasets,” in Proc. learning,” in Proc. 7th Int. Conf. Mach.
“Adaptive optimization for OpenCL 31st ACM SIGPLAN Conf. Program. Lang. Learn. (ICML), 2000, pp. 511–518.
programs on embedded heterogeneous Design Implement. (PLDI), 2010, [112] N. Mishra, J. D. Lafferty, H. Hoffmann, and
systems,” in Proc. 18th Annu. ACM SIGPLAN/ pp. 448–459. C. Imes, “CALOREE: Learning control for
SIGBED Conf. Lang. Compil. Tools Embedded [97] A. H. Ashouri, G. Mariani, G. Palermo, predictable latency and low energy,” in Proc.
Syst. (LCETS), 2017, pp. 11–20. and C. Silvano, “A Bayesian network 23rd Int. Conf. Archit. Support Program. Lang.
[81] P. Zhang, J. Fang, T. Tang, C. Yang, and approach for compiler auto-tuning for Oper. Syst. (ASPLOS), 2018, pp. 184–198.
Z. Wang, “Auto-tuning streamed applications embedded processors,” in Proc. IEEE 12th [113] M. K. Emani and M. O’Boyle, “Change
on Intel Xeon Phi,” in Proc. 32nd IEEE Int. Symp. Embedded Syst. Real-time Multimedia
detection based parallelism mapping:
Parallel Distrib. Process. Symp. (IPDPS), 2018. (ESTIMedia), Oct. 2014, pp. 90–97.
Exploiting offline models and Online
[82] P. J. Joseph, K. Vaswani, and [98] A. Phansalkar, A. Joshi, and L. K. John, adaptation,” in Proc. 27th Int. Workshop
M. J. Thazhuthaveetil, “A predictive “Analysis of redundancy and application Lang. Compil. Parallel Comput. (LCPC),
performance model for superscalar balance in the spec cpu2006 benchmark 2014, pp. 208–223.
processors,” in Proc. 39th Annu. IEEE/ACM suite,” in Proc. 34th Annu. Int. Symp.
[114] Y. Li, “Deep reinforcement learning: An
Int. Symp. Microarchit. (MICRO), Comput. Archit. (ISCA), 2007, pp. 412–423.
overview,” CoRR, 2017.
Dec. 2006, pp. 161–170. [99] P. Vincent, H. Larochelle, Y. Bengio, and
[115] J. Ansel et al., “Opentuner: An extensible
[83] A. Ganapathi, K. Datta, A. Fox, and D. P.-A. Manzagol, “Extracting and composing
framework for program autotuning,” in
Patterson, “A case for machine learning to robust features with denoising
Proc. PACT, 2014, pp. 303–316.
optimize multicore performance,” in Proc. autoencoders,” in Proc. 25th Int. Conf. Mach.
1st USENIX Conf. Hot Topics Parallelism Learn. (ICML), 2008, pp. 1096–1103. [116] C. K. Williams and C. E. Rasmussen,
(HotPar), 2009, p. 1. “Gaussian processes for regression,” in
[100] X. Gu, H. Zhang, D. Zhang, and S. Kim
Proc. Adv. Neural Inf. Process. Syst., 1996,
[84] E. Deniz and A. Sen, “Using machine (2016). Deep API learning. [Online].
pp. 514–520.
learning techniques to detect parallel Available: https://arxiv.org/abs/1605.08535
patterns of multi-threaded applications,” [101] B. Singer and M. Veloso, “Learning to [117] C. E. Rasmussen and C. K. Williams,
Int. J. Parallel Program., vol. 44, no. 4, construct fast signal processing Gaussian Processes for Machine Learning,
pp. 867–900, 2016. implementations,” J. Mach. Learn. Res., vol. 1. Cambridge, MA, USA: MIT Press,
[85] Y. LeCun, Y. Bengio, and G. Hinton, Deep vol. 3, pp. 887–919, Dec. 2002. 2006.
Learning, 2015. [102] X. Li, M. J. Garzaran, and D. Padua, [118] M. K. Emani and M. O’Boyle, “Celebrating
[86] A. Krizhevsky, I. Sutskever, and G. E. “Optimizing sorting with genetic diversity: A mixture of experts approach
Hinton, “Imagenet classification with deep algorithms,” in Proc. Int. Symp. Code for runtime mapping in dynamic
convolutional neural networks,” in Proc. Adv. Generat. Optim. (CGO), 2005, pp. 99–110. environments,” in Proc. 36th ACM SIGPLAN
Neural Inf. Process. Syst. (NIPS), 2012, Conf. Program. Lang. Design Implement.
[103] J. Ansel et al., “Petabricks: A language and (PLDI), 2015, pp. 499–508.
pp. 1097–1105. compiler for algorithmic choice,” in Proc.
[87] K. He, X. Zhang, S. Ren, and J. Sun, “Deep ACM SIGPLAN Conf. Program. Lang. Design [119] H. D. Nguyen and F. Chamroukhi (2017).
residual learning for image recognition,” in Implement. (PLDI), 2009, pp. 38–49. “An introduction to the practical and
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. theoretical aspects of mixture-of-experts
[104] M. Harman, W. B. Langdon, Y. Jia, D. R. modeling.” [Online]. Available: https://
(CVPR), Jun. 2016, pp. 770–778. White, A. Arcuri, and J. A. Clark, “The arxiv.org/abs/1707.03538
[88] H. Lee, Y. Largman, P. Pham, and A. Y. Ng, GISMOE challenge: Constructing the
“Unsupervised feature learning for audio Pareto program surface using genetic [120] R. Polikar, “Ensemble based systems in
classification using convolutional deep belief programming to find better programs decision making,” IEEE Circuits Syst. Mag.,
networks,” in Proc. 22nd Int. Conf. Neural Inf. (keynote paper),” in Proc. 27th IEEE/ACM vol. 6, no. 3, pp. 21–45, Sep. 2006.
Process. Syst. (NIPS), 2009, pp. 1096–1104. Int. Conf. Autom. Softw. Eng. (ASE), Sep. [121] L. Rokach, “Ensemble-based classifiers,”
[89] M. Allamanis, E. T. Barr, P. Devanbu, and 2012, pp. 1–14. Artif. Intell. Rev., vol. 33, nos. 1–2, pp. 1–39,
C. Sutton (2017). “A survey of machine [105] U. Garciarena and R. Santana, 2010.
learning for big code and naturalness.” “Evolutionary optimization of compiler [122] Y. Jiang et al., “Exploiting statistical
[Online]. Available: https://arxiv.org/ flag selection by learning and exploiting correlations for proactive prediction of
abs/1709.06182 flags interactions,” in Proc. Genetic Evol. program behaviors,” in Proc. 8th Annu.
[90] J. MacQueen, “Some methods for Comput. Conf. Companion (GECCO), 2016, IEEE/ACM Int. Symp. Code Generat. Optim.
classification and analysis of multivariate pp. 1159–1166. (CGO), Apr. 2010, pp. 248–256.
observations,” in Proc. 5th Berkeley Symp. [106] M. Zuluaga, E. Bonilla, and N. Topham, [123] V. S. Marco, B. Taylor, B. Porter, and
Math. Statist. Prob., 1967, pp. 281–297. “Predicting best design trade-offs: A case Z. Wang, “Improving spark application
throughput via memory aware task [138] M. E. Taylor, K. E. Coons, B. Robatmili, IEEE Trans. Evol. Comput., vol. 15, no. 4,
co-location: A mixture of experts B. A. Maher, D. Burger, and K. S. pp. 515–538, Aug. 2011.
approach,” in Proc. ACM/IFIP/USENIX McKinley, “Evolving compiler heuristics to [155] G. Fursin and O. Temam, “Collective
Middleware Conf., 2017, pp. 95–108. manage communication and contention,” optimization: A practical collaborative
[124] B. Singer and M. M. Veloso, “Learning to in Proc. 24th Conf. Artif. Intell. (AAAI), 2010, approach,” ACM Trans. Archit. Code Optim.,
predict performance from formula pp. 1690–1693. vol. 7, no. 4, p. 20, Dec. 2010.
modeling and training data,” in Proc. 7th [139] P.-S. Ting, C.-C. Tu, P.-Y. Chen, Y.-Y. Lo, [156] J. Kukunas, R. D. Cupper, and
Int. Conf. Mach. Learn. (ICML), 2000, and S.-M. Cheng (2016). “FEAST: An G. M. Kapfhammer, “A genetic algorithm
pp. 887–894. automated feature selection framework for to improve linux kernel performance on
[125] E. Park, J. Cavazos, and M. A. Alvarez, compilation tasks.” [Online]. Available: resource-constrained devices,” in Proc. 12th
“Using graph-based program https://arxiv.org/abs/1610.09543 Annu. Conf. Companion Genetic Evol.
characterization for predictive modeling,” [140] E. Park, C. Kartsaklis, and J. Cavazos, Comput. (GECCO), 2010, pp. 2095–2096.
in Proc. 10th Int. Symp. Code Generat. Optim. “HERCULES: Strong patterns towards [157] L.-N. Pouchet, C. Bastoul, A. Cohen, and
(CGO), 2012, pp. 196–206. more intelligent predictive modeling,” in J. Cavazos, “Iterative optimization in the
[126] A. M. Malik, “Spatial based feature Proc. 43rd Int. Conf. Parallel Process., 2014, polyhedral model: Part II,
generation for machine learning based pp. 172–181. multidimensional time,” in Proc. 29th ACM
optimization compilation,” in Proc. 9th Int. [141] K. Beyer, J. Goldstein, R. Ramakrishnan, SIGPLAN Conf. Program. Lang. Design
Conf. Mach. Learn. Appl., 2010, and U. Shaft, “When is ‘nearest neighbor’ Implement. (PLDI), 2008, pp. 90–100.
pp. 925–930. meaningful?” in Proc. Int. Conf. Database [158] K. Asanovic et al., “The landscape of parallel
[127] M. Burtscher, R. Nasre, and K. Pingali, Theory, 1999, pp. 217–235. computing research: A view from Berkeley,”
“A quantitative study of irregular programs [142] I. Fodor, “A survey of dimension reduction Univ. California, Berkeley, CA, USA, Tech.
on GPUs,” in Proc. IEEE Int. Symp. Workload techniques,” Lawrence Livermore Nat. Rep. UCB/EECS-2006-183, 2006.
Characterization (IISWC), Nov. 2012, Lab., Tech. Rep., 2002. [159] Y. Zhang, M. Voss, and E. S. Rogers,
pp. 141–151. [143] J. Thomson, M. F. O’Boyle, G. Fursin, and “Runtime empirical selection of loop
[128] Y. Luo, G. Tan, Z. Mo, and N. Sun, “Fast: A B. Franke, “Reducing training time in a schedulers on hyperthreaded SMPs,” in
fast stencil autotuning framework based on one-shot machine learning-based Proc. 19th IEEE Int. Parallel Distrib. Process.
an optimal-solution space model,” in Proc. compiler,” Lang. Compil. Parallel Comput., Symp. (IPDPS), Apr. 2005, p. 44b.
29th ACM Int. Conf. Supercomput., 2015, vol. 5898, pp. 399–407, 2009. [160] D. Rughetti, P. D. Sanzo, B. Ciciani, and
pp. 187–196. F. Quaglia, “Machine learning-based self-
[144] Y. Bengio, “Learning deep architectures for
[129] S. Browne, J. Dongarra, N. Garner, G. Ho, and AI,” Found. Trends Mach. Learn., vol. 2, no. adjusting concurrency in software
P. Mucci, “A portable programming interface 1, pp. 1–127, 2009. transactional memory systems,” in Proc.
for performance evaluation on modern IEEE 20th Int. Symp. Modeling Anal.
[145] L. Deng, M. L. Seltzer, D. Yu, A. Acero, Simulation Comput. Telecommun. Syst.,
processors,” Int. J. High Perform. Comput. Appl.,
A.-R. Mohamed, and G. Hinton, “Binary Aug. 2012, pp. 278–285.
vol. 14, no. 3, pp. 189–204, 2000.
coding of speech spectrograms using a
[130] T. Mytkowicz, A. Diwan, M. Hauswirth, deep auto-encoder,” in Proc. 11th Annu. [161] C. Delimitrou and C. Kozyrakis, “Quasar:
and P. F. Sweeney, “Producing wrong data Conf. Int. Speech Commun. Assoc., 2010. Resource-efficient and Qos-aware cluster
without doing anything obviously wrong!” management,” in Proc. 19th Int. Conf. Archit.
[146] L. Mou, G. Li, L. Zhang, T. Wang, and Support Program. Lang. Operat. Syst.
in Proc. 14th Int. Conf. Archit. Support
Z. Jin, “Convolutional neural networks (ASPLOS), 2014, pp. 127–144.
Program. Lang. Operat. Syst. (ASPLOS XIV),
over tree structures for programming
2009, pp. 265–276. [162] X. Chen and S. Long, “Adaptive multi-
language processing,” in Proc. AAAI, 2016,
[131] J. Cavazos et al., “Automatic performance pp. 1287–1293. versioning for openmp parallelization via
model construction for the fast software machine learning,” in Proc. 15th Int. Conf.
[147] M. White, M. Tufano, C. Vendome, and Parallel Distrib. Syst. (ICPADS), 2009,
exploration of new hardware designs,” in
Proc. Int. Conf. Compil. Archit. Synthesis D. Poshyvanyk, “Deep learning code pp. 907–912.
Embedded Syst. (CASES), 2006, pp. 24–34. fragments for code clone detection,” in
Proc. ASE 31st IEEE/ACM Int. Conf. Autom. [163] M. Castro, L. F. W. Góes, C. P. Ribeiro,
[132] S. Khan, P. Xekalakis, J. Cavazos, and Softw. Eng., 2016, pp. 87–98. M. Cole, M. Cintra, and J. F. Méhaut, “A
M. Cintra, “Using predictivemodeling for machine learning-based approach for
cross-program design space exploration in [148] C. Cummins, P. Petoumenos, Z. Wang, and thread mapping on transactional memory
multicore systems,” in Proc. IEEE 16th Int. H. Leather, “Synthesizing benchmarks for applications,” in Proc. 18th Int. Conf. High
Conf. Parallel Archit. Compilation Techn., predictive modeling,” in Proc. Int. Symp. Code Perform. Comput., 2011, pp. 1–10.
Sep. 2007, pp. 327–338. Generat. Optim. (CGO), 2017, pp. 86–99.
[164] C. Jung, S. Rus, B. P. Railing, N. Clark, and
[133] M. Namolaru, A. Cohen, G. Fursin, [149] M. White, M. Tufano, and M. Martinez, S. Pande, “Brainy: Effective selection of
A. Zaks, and A. Freund, “Practical M. Monperrus, and D. Poshyvanyk (2017). data structures,” in Proc. 32nd ACM
aggregation of semantical program “Sorting and transforming program repair SIGPLAN Conf. Program. Lang. Design
properties for machine learning based ingredients via deep learning code Implement. (PLDI), 2011, pp. 86–97.
optimization,” in Proc. Proc. Int. Conf. similarities.” [Online]. Available: https://
[165] Z. Wang and M. F. P. O’Boyle, “Using
Compil. Archit. Synth. Embedded Syst. arxiv.org/abs/1707.04742
machine learning to partition streaming
(CASES), 2010, pp. 197–206. [150] L. Almagor et al., “Finding effective programs,” ACM Trans. Archit. Code Optim.,
[134] C. M. Bishop, Pattern Recognition and compilation sequences,” in Proc. ACM vol. 10, no. 3, p. 20, 2013.
Machine Learning (Information Science and SIGPLAN/SIGBED Conf. Lang. Compil. Tools
[166] C. Chan, J. Ansel, Y. L. Wong,
Statistics). Secaucus, NJ, USA: Springer- Embedded Syst. (LCTES), 2004, pp. 231–239. S. Amarasinghe, and A. Edelman,
Verlag, 2006. [151] K. D. Cooper et al., “ACME: Adaptive “Autotuning multigrid with petabricks,” in
[135] K. Hoste, A. Phansalkar, L. Eeckhout, compilation made efficient,” in Proc. ACM Proc. ACM/IEEE Conf. Supercomput. (SC),
A. Georges, L. K. John, and K. de SIGPLAN/SIGBED Conf. Lang. Compil. 2009, Art. no. 5.
Bosschere, “Performance prediction based Embedded Syst. (LCTES), 2005, pp. 69–77. [167] M. Pacula, J. Ansel, S. Amarasinghe, and
on inherent program similarity,” in Proc. [152] A. H. Ashouri, A. Bignoli, G. Palermo, U.-M. O’Reilly, “Hyperparameter tuning in
IEEE Int. Conf. Parallel Archit. Compilation C. Silvano, S. Kulkarni, and J. Cavazos, bandit-based adaptive operator selection,”
Techn. (PACT), Swep. 2006, pp. 114–122. “MiCOMP: Mitigating the compiler phase- in Proc. Eur. Conf. Appl. Evol. Comput.
[136] N. E. Rosenblum, B. P. Miller, and X. Zhu, ordering problem using optimization sub- (EuroSys), 2012, pp. 73–82.
“Extracting compiler provenance from sequences and machine learning,” ACM [168] J. Ansel, “Siblingrivalry: Online autotuning
program binaries,” in Proc. 9th ACM Trans. Archit. Code Optim., vol. 14, no. 3, through local competitions,” in Proc. Int.
SIGPLAN-SIGSOFT Workshop Program Anal. p. 29, 2017. Conf. Compil., Archit. Synth. Embedded Syst.
Softw. Tools Eng. (PASTE), 2010, pp. 21–28. [153] K. D. Cooper, D. Subramanian, and (CASES), 2012, pp. 91–100.
[137] A. Bhattacharyya, G. Kwasniewski, and L. Torczon, “Adaptive optimizing compilers [169] M. J. Voss and R. Eigemann, “High-level
T. Hoefler, “Using compiler techniques to for the 21st century,” J. Supercomput., adaptive program optimization with
improve automatic performance vol. 23, no. 1, pp. 7–22, 2002. adapt,” in Proc. 8th ACM SIGPLAN Symp.
modeling,” in Proc. Int. Conf. Parallel Archit. [154] D. R. White, A. Arcuri, and J. A. Clark, Principles Pract. Parallel Program. (PPoPP),
Compilation (PACT), 2015, pp. 468–479. “Evolutionary improvement of programs,” 2001, pp. 93–102.
[170] A. Tiwari and J. K. Hollingsworth, “Online Cloud Grid Comput. (CCGRID), 2010, [187] J. Fowkes and C. Sutton, “Parameter-free
adaptive code generation and tuning,” in pp. 495–504. probabilistic api mining across GitHub,” in
Proc. IEEE Int. Parallel Distrib. Process. [179] S. Venkataraman, Z. Yang, M. J. Franklin, Proc. 24th ACM SIGSOFT Int. Symp. Found.
Symp. (IPDPS), May 2011, pp. 879–892. B. Recht, and I. Stoica, “Ernest: Efficient Softw. Eng. (FSE), 2016, pp. 254–265.
[171] J. Ren, L. Gao, H. Wang, and Z. Wang, performance prediction for large-scale [188] A. T. Nguyen et al., “API code
“Optimise Web browsing on heterogeneous advanced analytics,” in Proc. NSDI, 2016, recommendation using statistical learning
mobile platforms: A machine learning pp. 363–378. from fine-grained changes,” in Proc. 24th
based approach,” in Proc. IEEE Int. Conf. ACM SIGSOFT Int. Symp. Found. Softw. Eng.
[180] S. Sankaran, “Predictive modeling based
Comput. Commun. (INFOCOM), May 2017, (FSE), 2016, pp. 511–522.
power estimation for embedded multicore
pp. 1–9.
systems,” in Proc. ACM Int. Conf. Comput. [189] V. Raychev, P. Bielik, and M. Vechev,
[172] Y. Zhu and V. J. Reddi, “High-performance Frontiers (CF), 2016, pp. 370–375. “Probabilistic model for code with decision
and energy-efficient mobile Web browsing trees,” in Proc. ACM SIGPLAN Int. Conf.
on big/little systems,” in Proc. HPCA, [181] Y. Zhang, M. A. Laurenzano, J. Mars, and
L. Tang, “Smite: Precise QoS prediction on Object-Oriented Program. Syst. Lang. Appl.
Feb. 2013, pp. 13–24. (OOPSLA), 2016, pp. 731–747.
real-system smt processors to improve
[173] Z. Wang, M. F. P. O’Boyle, and utilization in warehouse scale computers,” [190] B. Bichsel, V. Raychev, P. Tsankov, and
M. K. Emani, “Smart, adaptive mapping of in Proc. 47th Annu. IEEE/ACM Int. Symp. M. Vechev, “Statistical deobfuscation of
parallelism in the presence of external Microarchit. (MICRO-47), Dec. 2014, pp. Android applications,” in Proc. ACM
workload,” in Proc. IEEE/ACM Int. Symp. 406–418. SIGSAC Conf. Comput. Commun. Secur.
Code Generat. Optim. (CGO), Feb. 2013, (CCS), 2016, pp. 343–355.
pp. 1–10. [182] V. Petrucci, “Octopus-man: QoS-driven
task management for heterogeneous [191] P. Balaprakash, R. B. Gramacy, and
[174] D. Grewe, Z. Wang, and M. F. P. O’Boyle, S. M. Wild, “Active-learning-based surrogate
multicores in warehouse-scale computers,”
“A workload-aware mapping approach for models for empirical performance tuning,”
in Proc. IEEE 21st Int. Symp. High Perform.
data-parallel programs,” in Proc. 6th Int. in Proc. IEEE Int. Conf. Cluster Comput.
Comput. Archit. (HPCA), Feb. 2015, pp.
Conf. High Perform. Embedded Archit. (CLUSTER), Sep. 2013, pp. 1–8.
Compil. (HiPEAC), 2011, pp. 117–126. 246–258.
[183] N. J. Yadwadkar, B. Hariharan, [192] W. F. Ogilvie, P. Petoumenos, Z. Wang, and
[175] D. Grewe, Z. Wang, and M. F. O’Boyle, H. Leather, “Fast automatic heuristic
“OpenCL task partitioning in the presence J. E. Gonzalez, and R. Katz, “Multi-task
learning for straggler avoiding predictive construction using active learning,” in
of GPU contention,” in Proc. Int. Workshop Proc. Int. Workshop Lang. Compil. Parallel
Lang. Compil. Parallel Comput., 2013, job scheduling,” J. Mach. Learn. Res., vol. 17,
no. 1, pp. 3692–3728, 2016. Comput., 2014, pp. 146–160.
pp. 87–101.
[184] Y. David and E. Yahav, “Tracelet-based [193] M. Zuluaga, G. Sergent, A. Krause, and
[176] L. Tang, J. Mars, and M. L. Soffa,
code search in executables,” in Proc. M. Püschel, “Active learning for multi-
“Compiling for niceness: Mitigating
35th ACM SIGPLAN Conf. Program. Lang. objective optimization,” in Proc. Int. Conf.
contention for Qos in warehouse scale
Design Implement. (PLDI), 2014, Mach. Learn., 2013, pp. 462–470.
computers,” in Proc. 10th Int. Symp. Code
Generat. Optim. (CGO), 2012, pp. 1–12. pp. 349–360. [194] W. F. Ogilvie, P. Petoumenos, Z. Wang, and
[177] L. Tang, J. Mars, W. Wang, T. Dey, and [185] Y. David, N. Partush, and E. Yahav, H. Leather, “Minimizing the cost of
M. L. Soffa, “Reqos: Reactive static/ “Statistical similarity of binaries,” in Proc. iterative compilation with active learning,”
dynamic compilation for Qos in warehouse 37th ACM SIGPLAN Conf. Program. Lang. in Proc. Int. Symp. Code Generat. Optim.
scale computers,” in Proc. 18th Int. Conf. Design Implement. (PLDI), 2016, (CGO), 2017, pp. 245–256.
Archit. Support Program. Lang. Oper. Syst. pp. 266–280. [195] A. Evaluation. About Artifact Evaluation.
(ASPLOS), 2013, pp. 89–100. [186] E. Wong, T. Liu, and L. Tan, “Clocom: [Online]. Available: http://www.artifact-
[178] A. Matsunaga and J. A. B. Fortes, “On the Mining existing source code for automatic eval.org/about.html
use of machine learning to predict the time comment generation,” in Proc. IEEE 22nd [196] cTuning Foundation. Artifact Evaluation for
and resources consumed by applications,” Int. Conf. Softw. Anal. Evol. Reeng. (SANER), Computer Systems Research. [Online].
in Proc. 10th IEEE/ACM Int. Conf. Cluster Mar. 2015, pp. 380–389. Available: http://ctuning.org/ae/