0% found this document useful (0 votes)
16 views15 pages

Automated Performance Modeling of HPC Applications Using Machine Learning

This paper presents a method for automated performance modeling and prediction of parallel programs, specifically MPI applications, using machine learning techniques. The approach involves collecting runtime features through instrumentation, employing a random forest model for execution time prediction, and utilizing transfer learning to adapt models to new platforms. Experimental results demonstrate an average prediction error of less than 20 percent across various applications and systems.

Uploaded by

zh18260238504zzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views15 pages

Automated Performance Modeling of HPC Applications Using Machine Learning

This paper presents a method for automated performance modeling and prediction of parallel programs, specifically MPI applications, using machine learning techniques. The approach involves collecting runtime features through instrumentation, employing a random forest model for execution time prediction, and utilizing transfer learning to adapt models to new platforms. Experimental results demonstrate an average prediction error of less than 20 percent across various applications and systems.

Uploaded by

zh18260238504zzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO.

5, MAY 2020 749

Automated Performance Modeling of HPC


Applications Using Machine Learning
Jingwei Sun , Guangzhong Sun , Shiyan Zhan, Jiepeng Zhang, and Yong Chen

Abstract—Automated performance modeling and performance prediction of parallel programs are highly valuable in many use cases,
such as in guiding task management and job scheduling, offering insights of application behaviors, and assisting resource requirement
estimation. The performance of parallel programs is affected by numerous factors, including but not limited to hardware, applications,
algorithms, and input parameters, thus an accurate performance prediction is often a challenging and daunting task. In this article, we
focus on automatically predicting the execution time of parallel programs (more specifically, MPI programs) with different inputs, at
different scales, and without domain knowledge. We model the correlation between the execution time and domain-independent
runtime features. These features include values of variables, counters of branches, loops, and MPI communications. Through
automatically instrumenting an MPI program, each execution of the program will output a feature vector and its corresponding
execution time. After collecting data from executions with different inputs, a random forest machine learning approach is used to build
an empirical performance model, which can predict the execution time of the program given a new input. A transfer learning method is
used to reuse an existing performance model and improve the prediction accuracy on a new platform that lacks historical execution
data. Our experiments and analyses of three parallel applications, Graph500, GalaxSee, and SMG2000, on three different systems
confirm that our method performs well, with less than 20 percent prediction error on average.

Index Terms—Parallel computing, performance modeling, machine learning, model transferring

1 INTRODUCTION
modeling is a widely concerned problem in factors can affect the performance, including but not limited
P ERFORMANCE
high performance computing (HPC) community. An
accurate model of parallel program performance, particularly
to hardware, applications, algorithms, and input parame-
ters. It is especially difficult to build a general-purpose
an accurate model for predicting execution time can yield model that synthesizes all factors. In this paper, we focus on
many benefits. First, a performance model can be used for designing and developing a model, particularly for predict-
task management and scheduling, assisting the scheduler to ing the execution time of parallel programs, on an HPC
decide how to map tasks to proper compute nodes [19], [37]. cluster with different inputs and at different scales. We also
Therefore, the utilization of the entire HPC system can be focus on MPI programs as MPI is the de facto standard par-
improved. Second, the model can offer insights about applica- allel programming model.
tion behaviors [22], [25], which helps developers understand Previous studies have mainly introduced three types of
the scaling potential and better tune applications. Third, the methods: analytical modeling, replay-based modeling, and statistical
model helps HPC users to estimate the number of CPU cores model. An analytical modeling method [8], [26], [36], [40] has
they need [43], [44]. According to the predicted performance, arithmetic formulas describing a parallel program performance
users can consider the predicted computation time and esti- and can offer a prediction of execution time quickly. However,
mated resources systematically, and then request a reasonable this method needs extensive efforts of human experts with in-
number of compute nodes and CPU cores from HPC systems. depth understanding of a particular HPC application (e.g., con-
Building an accurate performance model of parallel pro- sider the time complexity analysis process of a parallel pro-
grams, however, is a very challenging task. Due to the vari- gram). Since HPC applications have a wide range of domains,
ance and complexity of both system architectures and it is difficult to build an analytical model. Furthermore, it is
applications, the execution time of a parallel program is challenging to generalize a model for various domains.
often with significant uncertainty. For example, numerous A replay-based model [21], [38], [44], [45] is built from his-
torical execution traces, which contain detailed information
about computation and communication of an HPC pro-
 J. Sun, G. Sun, S. Zhan, and J. Zhang are with the School of Computer gram. Through analyzing traces, a synthetic program can be
Science and Technology, University of Science and Technology of China,
Hefei 230052, China.
automatically reconstructed for replaying behaviors of the
E-mail: {sunjw, zsynew, hitzjp}@mail.ustc.edu.cn, [email protected]. original program and predicting its performance. However,
 Y. Chen is with the Department of Computer Science, Texas Tech Univer- the replay-based modeling usually requires large storage
sity, Lubbock, TX 79409. E-mail: [email protected]. space to keep traces (ranging from hundreds of megabytes
Manuscript received 21 Feb. 2019; revised 6 Dec. 2019; accepted 26 Dec. 2019. to tens of gigabytes for each run [9], [44]). Besides, a syn-
Date of publication 10 Jan. 2020; date of current version 7 Apr. 2020. thetic program can only represent one specific execution
(Corresponding author: Guangzhong Sun.)
Recommended for acceptance by T. F. Pena. path of the original program, which also restricts the appli-
Digital Object Identifier no. 10.1109/TC.2020.2964767 cation of replay-based modeling.
0018-9340 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
750 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 5, MAY 2020

A statistical model [9], [10], [14] uses machine learning programs on three different HPC systems show that
techniques to fit the mapping function between perfor- the average prediction error is less than 20 percent.
mance metrics and certain features. With sufficiently much  We develop a tool to automatically analyze the syn-
training data, statistical models can make relatively accurate tax tree of an MPI program and instrument it. Thus
predictions of the performance, without requiring domain we can detect domain-independent runtime features
knowledge and human efforts. It is natural and convenient related to performance. These features are necessary
to use input parameters of an HPC program as features. to support the automated modeling and prediction
However, important performance factors may not be explic- with machine learning techniques.
itly covered by input parameters. For instance, it is difficult  We design a strategy to automatically analyze and
to automatically parse non-scalar inputs (e.g., files, matrices, reorganize the runtime feature data of an MPI pro-
and strings) for modeling without domain experts. Domain gram using machine learning, thereby we can extract
experts can determine a small number of scalar variables in the factors that significantly affect the performance
the source code of a program as model features since these and reduce the overhead and storage demand from
variables can directly expose the actual performance impact redundant instrumentations.
from inputs [9]. For applications that contain adaptive pre-  We introduce a transfer learning method that can
process or auto-tuning [30], [31], some key features are transfer an existing performance model to adapt to a
dynamically decided at runtime. In this case, input parame- new platform. It only uses 5 percent of the execution
ters also cannot cover all performance features. samples from the new platform as the training set to
As a superset of input parameters, runtime features gen- predict the execution time of the remaining samples,
erally contain more complete and relevant information that and the average prediction error is less than 20
affects performance. But effectively identifying an impor- percent.
tant subset of runtime features without domain knowledge The rest of this paper is organized as follows. Section 2
is still challenging, since the number of possible runtime describes our method of modeling and predicting the per-
features usually is much larger than the number of input formance in detail. Section 3 presents the experimental
parameters. A statistical performance model should prop- results and analyzes our method. Section 4 reviews a series
erly maintain its complexity, namely the number of fea- of existing studies relevant to this research. Section 5 sum-
tures. Redundant features result in high expense and high marizes this study and discusses our plan for future studies.
generalization error, while a lack of important performance
features makes the model useless. 2 METHODOLOGY
In this paper, we propose a performance modeling and pre-
diction method, which identifies runtime features that is 2.1 Overview
highly relevant to performance, without requiring domain Performance prediction is arguably a challenging problem.
knowledge and human efforts. It automatically inserts instru- Theoretically, it is impossible to find a perfect prediction for
mentation code, detects runtime features, and analyzes feature every program (e.g., the halting problem). In practice, it is
data. We adopt a lightweight instrumentation to reduce the also difficult to predict the performance of a program driven
overhead so that the instrumented program can be normally by dynamic factors in its entire execution period (e.g., many
used for production executions and meanwhile continuously randomized algorithms). In this work, we aim for modeling
output feature data for constructing a more accurate model. a program of which execution time mainly depends on
The random forest regression approach [20] is used to predict its early execution phase. This research focus is inspired by
execution time. With this approach, the performance model the fact that a typical HPC application consists of three
can fit the complex nonlinear mapping function between fea- phases: initialization, repetitive calculation, and termina-
tures and performance. It can also analyze the statistical tion. Among these three phases, the initialization phase is
importance of each feature and only reserve a subset of impor- usually used to define what will be calculated and how to
tant features to further reduce the overhead of instrumenta- be calculated [43].
tions. In addition to making a point prediction (e.g., scalar Our method of performance modeling and prediction
value of execution time), our model can also predict an interval includes two phases: a training phase and a predicting phase,
as shown in Fig. 1. The training phase is used to collect data
of performance to show the performance uncertainty.
and build a performance model. It mainly consists of four
Statistical modeling requires repetitive executions of an
stages: instrumentation, model learning, feature reduction,
application on a certain platform. For a new platform, it is
and model transfer. The predicting phase is used to handle a
difficult to build an accurate model due to the lack of histor-
new input data of the target program, calculate the value of
ical execution data. It is the so-called cold start problem. To
features, and output a predictive execution time with the
deal with this problem, we introduce a transfer learning
transferred model or the non-transferred model depending
method that can reuse an existing performance model to
on whether the transferred model is available. Next, we
assist in building an accurate model on a new platform with
describe the processes of our method in detail.
a small number of execution samples.
The main contributions of this study are summarized as
follows: 2.2 Instrumentation
To capture behavior patterns of parallel programs without
 We propose a method to model and predict the perfor- domain knowledge, we collect the runtime features through
mance of parallel programs (MPI programs) with dif- instrumentation. Instrumentation is a dynamic analysis for
ferent inputs. Our experiments with three different a program, which extracts program features from sample
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: AUTOMATED PERFORMANCE MODELING OF HPC APPLICATIONS USING MACHINE LEARNING 751

more, all values are recorded as different features. We do not


instrument the variables in a loop, because their values are
updated frequently. Recording all versions of these variables
is impractical. Code fragment 1 demonstrates an example of
instrumentation for assignments.

2.2.2 Branches
Branches can lead to different execution paths, which may
have significantly different execution times. Thus the results of
conditional statements in branches are important features for
predicting performance. This type of feature is difficult to fetch
via domain knowledge or static code analysis. For example,
code fragment 2 shows a common process that examines
whether a file is successfully opened. If successful, the pro-
gram will load data from the file and execute a heavy calcula-
tion, otherwise the program exits immediately. We cannot
predict the result of this branch according to the file path string
until the branch is actually executed. This example also indi-
cates that black-box performance modeling that only considers
program input parameters as features is insufficient, since
some important features can often only be fetched at runtime.
We insert instrumentation code after the conditional statement
Fig. 1. Overview of automated performance modeling of HPC applica- of branches and record their results as runtime features.
tions using machine learning.
1 //Code fragment 2
executions. We develop an instrumentor using clang [1]. 2 fp=fopen(filename,‘r’);
The instrumentor automatically analyzes the abstract syntax 3 if(fp)
tree (AST) of the source code of the target program, and 4 {
inserts detective code around assignments, branches, loops, 5 if_counter[1]++;//instrument
6 load_data(fp);
and communications to generate the instrumented pro-
7 calculate();
gram. Assignments reflect the data flow of a program.
8 }
Branches and loops reflect the control flow. MPI communi-
9 else
cations can be regarded as the skeleton of a program [45]. 10 {
To reduce the overhead of instrumentation, the inserted 11 if_counter[2]++;//instrument
code keeps lightweight, like an incrementing integer 12 return;
counter for each branch feature and loop feature, and an 13 }
assignment for each assignment feature [28]. We describe
the instrumentation of these different types of features,
respectively, below. 1 //Code fragment 3
2 while(i<n)
1 //Code fragment 1 3 {
2 n=parse(); 4 loop_counter[1]++;//instrument
3 variable[1]=n;//instrument 5 for(int k=0;k<n;k++)
4 n=preprocess(n); 6 {
5 variable[2]=n;//instrument 7 loop_counter[2]++;//instrument
6 result=init(); 8 calculation();
7 variable[3]=result;//instrument 9 }
8 while(i<n) 10 i=i+1;
9 { 11 }
10 result=calculate(i,result);
11 i=i+1;
12 }
2.2.3 Loops
Loops are usually the main calculation kernel of a program.
The amount of iterations of loops has a direct impact and
2.2.1 Assignments correlation with the performance. Code fragment 3 demon-
The size of a problem and the amount of calculation are strates an example of inserting instrumentations to count
decided by key variables like the problem size, iteration count, the number of iterations of loops, including nested loops.
convergence condition, and solution accuracy. To discover the This type of instrumentation can introduce huge overhead,
key variables from the source code, we insert the instrumenta- especially when the loop is deeply nested and the loop
tion code after assignments. If a variable is assigned twice or body is fine-grained. Section 2.4 will describe the reduce
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
752 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 5, MAY 2020

method, which can remove unnecessary features and corre-


sponding instrumentations.

2.2.4 MPI Communications


Existing approaches can measure the communication charac-
teristics of a program using benchmarks or communication
traces. A benchmark can measure performance features of a
network, like latency, bandwidth, or other parameters of mod-
els like LogP [17] and its variants [5], [27]. However, a bench-
mark mainly focuses on the general performance of a network,
and it does not capture the details like data movement size,
communication type, communication group of an MPI commu-
nication function call in an execution, etc. Using benchmarks to
predict the communication time cost of a certain execution still
needs an in-depth understanding of the application. Communi-
cation tracing can capture this information automatically, but
tracing MPI events of a parallel program completely generates
traces that often need very complex analysis, even if the pro-
gram is a fixed-behavior benchmark program [45].
In our method, we do not aim for modeling the commu-
nication behaviors of a parallel program precisely. It meets
our needs as long as we can characterize a statistical correla-
tion between performance and a small number of communi-
cation features. We take the data size and the number of Fig. 2. A data space (top) consisting of two features (x1 , x2 ) is divided
targets of MPI communication function calls to represent into five regions (bottom).
communication features. MPI communication functions,
such as MPI_Send, MPI_Bcast, MPI_Gather, MPI_Allgather,
MPI_Reduce, MPI_Allreduce, are instrumented before they neural network (ANN), and random forest [12]. We adopt
are invoked. Code fragment 4 shows an example of instru- a random forest approach with an optimization called
menting MPI communications. To minimize the overhead extremely randomized trees [20]. Random forest is widely
from synchronization, each MPI process maintains its local used in classification or regression tasks. Besides the capabil-
features during execution and synthesizes these features to ity of modeling complex nonlinear data, another advantage of
the root process at the end of the program. random forest is that it can process mixed different types of
features including float, integer, and enumeration [25]. Such a
1 //Code fragment 4 characteristic makes it suitable to model the runtime features
2 //instrument we trace from MPI program execution. Besides, random forest
3 data_size[1]=n*sizeof(MPI_INT); can analyze the importance of each feature. It enables reduc-
4 //instrument ing redundant features and corresponding instrumentations.
5 comm_size[1]=MPI_Comm_size(my_comm); Random forest is an ensemble learning method based on
6 CART regression trees [13]. Ensemble learning trains many
7 MPI_Bcast(data,n,MPI_INT,root,my_comm); weak models that may have low prediction accuracy and
8 calculate(data);
synthesizes them to a strong model. A random forest con-
sists of multiple regression trees. Fig. 2 shows an example
of a data space of two features, x1 and x2 , and a regression
2.3 Model Learning tree divides the data space into five regions.
After collecting runtime features via instrumentation, we A node in a regression tree selects a feature value ui to
then try to discover the correlation between features and divide the input data space into two disjoint regions. The
program execution time. It can be treated as a multivariate regression tree recursively divides the input data space into
nonlinear regression problem. Assume that there are n sam- multiple disjoint regions Ri . Each region is denoted by a
ples. Each sample is expressed as ðx; yÞ, where x is a vector leaf node and represents a response value, which means the
consisting of m features and y is the corresponding execu- program execution time in this study. When handling a
tion time. The goal of this regression problem is to find a new input, we can find a path from the tree root to a leaf
mapping relation, f : x ! y, that minimizes the mean square node according to the value of features of this input, decide
error (MSE) between the predictive value and the real exe- which region this input should belong to, and then take the
cution time in the n samples response value as a prediction. This inference process also
fits the nature of human decision. The model of a regression
1X n
min MSE ¼ ðyi  fðxi ÞÞ2 : (1) tree can be presented as follows:
n i¼1
X
k
There exist numerous approaches to solve this regression fðxÞ ¼ ðci Iðx 2 Ri ÞÞ; (2)
problem, like ridge regression [11], LASSO [42], artificial i¼1
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: AUTOMATED PERFORMANCE MODELING OF HPC APPLICATIONS USING MACHINE LEARNING 753

where k is the number of regions and ci is the response because a parallel program may contain many assignments,
value. In other words, ci is the predictive execution time of branches, loops, and communications. The polynomial expan-
samples in region Ri . IðÞ is an indicator, which outputs 1 if sion will further increase the number of features. Redundant
x belongs to region Ri and 0 otherwise. features take significant storage and introduce extra overhead.
To achieve the optimization goal in Formula (1), ci for Even if we could afford such heavy storage and calculation,
leaf nodes and ui for inner nodes should be carefully consid- according to the Occam’s razor, an unnecessarily complex
ered. The division should ensure that each region contains model tends to be over-fitting, which has a poor generaliza-
similar data, where the similarity is measured by MSE tion performance. Therefore, it is strongly desired to filtrate
between ui and all response values in this region. Generally, these features to generate a reduced model.
if a division feature is selected, it is easy to prove that the
average of the corresponding y of x in region Ri can mini- 2.4.1 Reduction by Time
mize the MSE [13]. The best division feature can be selected
Some features are too expensive to fetch their values when
by enumerating all features in the subset and selecting the
predicting the execution time of a new input. For example,
one with minimum MSE.
if taking a variable that is generated at the end of the pro-
It often leads to over-fitting with building only one
gram as a feature, we need to completely execute the pro-
regression tree to divide regions and to predict the response
gram to fetch the value of this feature. Such an expense
value. Thus, a random forest needs numerous regression
makes the prediction much less useful.
trees. Each regression tree fits a random subset in the origi-
This problem can be solved by setting a time threshold
nal features and learns from different parts of training data.
that only reserves the features below the threshold. In each
After building all regression trees, when handling a new
execution of the training phase, the time cost for obtaining a
input, the average output of all regression trees is taken as
feature can be recorded. Since the training dataset contains
the final prediction of a random forest. The ensemble of
many executions with different time costs, we then calculate
these trees constructs a robust and well-generalized model.
the ratio between the average time cost of a feature and the
Extremely randomized tree adopts a modification that, in
average time cost of program executions. Long-time execu-
each division, the division feature and its division value are
tions occupy a larger portion for calculating averages since
randomly selected. This extremely random division can fur-
they are more sensitive to the expense. This ratio generally
ther improve the accuracy of the ensemble model.
measures the expense of obtaining a feature. If we set the
A useful improvement is building a model from a series
time threshold to be 5 percent, then features of which the
of new features that are generated from a basis function,
time ratio is greater than 5 percent will be removed. Mean-
instead of the original features. The transformed model has
while, the corresponding instrumentation codes are also
almost the same form to the original model in Equation (2),
removed. After removing these instrumentations, the over-
but replaces x with FðxÞ, where FðÞ is a basis function.
head can be significantly reduced.
X
k When setting a high time threshold, the reduction is inef-
fðFðxÞÞ ¼ ðci IðFðxÞ 2 Ri ÞÞ: (3) fective. In contrast, if setting a low time threshold, some fea-
i¼1 tures that are closely relevant with performance may be
removed and the prediction accuracy would decrease. The
To predict the program execution time, the d-order poly- trade-off between reduction effectiveness and prediction
nomial expansion function is a reasonable and effective accuracy depends on the characteristics of the target appli-
basis function [24], [25], [28], which enumerates the combi- cation. As discussed in [43], in HPC applications, many
nation of production of original features less than or equal actions that appear in the early execution phase can be effec-
to d orders. It transforms the original m features to mdþdd tive predictors of performance, like loading and distributing
new features. For example, let FðÞ be a 2-order polynomial data, parsing key parameters (e.g., scale, initial value, step
expansion function. The transformation of a vector x with size, etc.), evaluating and selecting solvers, early part of iter-
two features ða1 ; a2 Þ is ative calculation, etc. In our experiments, we evaluated dif-
ferent time thresholds and verified that a low time
FðxÞ ¼ ð1; a1 ; a2 ; a1 a2 ; a21 ; a22 Þ: (4) threshold can effectively reduce redundant features and still
maintain desired prediction accuracy.
This transformation is inspired by the fact that the time
complexity of an algorithm is usually a polynomial function 2.4.2 Reduction by Importance
of its parameters. Even if it contains other forms of functions After reducing features by time threshold, we can further
like logarithmic function or trigonometric function, a poly- remove some features that have less impact on perfor-
nomial function can still provide an accurate approxima- mance. At first, we need to define importance, a quantitative
tion, according to the Taylor series expansion. Experimental metric for measuring the impact of every feature. When we
results also confirm that this transformation can improve adopt random forest to build a performance model, inter-
the accuracy of predictions. mediate results can be used to analyze the importance of
features. As we have described in Section 2.3, a node in a
2.4 Features Reduction regression tree has its corresponding division feature, divi-
A straightforward solution of using the collected features as sion value, and MSE. After dividing data by this node, data
discussed in Section 2.2 and the modeling technique dis- within a new region have higher similarity to each other,
cussed in Section 2.3 will generate a fairly complex model, therefore the sum of MSE (weighted by data size of
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
754 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 5, MAY 2020

corresponding regions) of children nodes is less than that of and select different parameters before performing an actual
this node. The extent of MSE reduction can represent the calculation. This selection is obviously related to the execution
importance of the division feature [12]. time, but often it is not a hot spot since it is a quick pre-process
for performance optimization.
Algorithm 1. Calculate Importance(forest)
2.5 Model Transferring
1: forest.importances=zeros(m)
The method we have discussed so far requires the training
2: for each tree in forest do
data and the testing data from the same platform since a
3: tree.importances=zeros(m)
4: for each node in tree do fundamental assumption of a typical machine learning
5: left = node.leftchild method is that training data and testing data follow the
6: right = node.rightchild same distribution. It largely limits the use scenarios of our
7: tmp = (node.data_size  node.MSE - performance model, since the platform settings may be
8: left.data_size  left.MSE - modified or the application may need to be run on a
9: right.data_size  right.MSE) completely new platform. It is feasible to repeat the model-
10: tree.importances[node.feature] += tmp ing process, like collecting data, building model, and reduc-
11: end for ing features, on the new platform, but the new model may
12: tree.importances /= tree.root.data_size not achieve the desired accuracy until a sufficient amount of
13: forest.importances += tree.importances execution samples are collected and analyzed. This chal-
14: end for lenge is commonly referred as the cold start problem.
15: forest.importances /= forest.n_trees A possible solution to dealing with this problem is based on
16: forest.importances /= sum(forest.importances) a critical observation that models on different platforms are
not absolutely irrelevant to each other. The inherent behaviors
Algorithm 1 presents the pseudocode of calculating fea- of an application determine that it often maintains similar per-
ture importance using random forest. Assume that a ran- formance patterns across different platforms. A simple sce-
dom forest model is trained and MSE of each node is nario is that two platforms have the same components except
calculated. The forest contains n trees trees and m features. their CPU frequency has a slight difference. In this case, the
Each tree node contains data size data samples in its corre- performance model of one platform can be easily converted to
sponding data region. The division feature of a tree node that of another platform via arithmetic calculations. However,
is denoted by node:feature. In line 1 and line 3, we initial- real-world scenarios are much more complicated, since mod-
ize the importance as zero. The for loop starting from line ern computing systems consist of many components like CPU,
4 calculates the weighted MSE decrease of each tree node. memory, network, etc. Each of them is also a fairly complicated
The decreases from the same feature are accumulated as subsystem and is developed and updated rapidly. Therefore
shown in line 10. The feature importance of the forest is the relevance of performance models on different platforms is
the average value of the feature importance of its trees. non-trivial. To leverage this relevance and to reuse an existing
Finally, the sum of the feature importance of the forest is model, we adopt a transfer learning method.
normalized to 100 percent. In transfer learning, a domain D ¼ fX; P ðxÞg consists of a
After calculating importance, we can sort features in the feature space X, where x ¼ ðx1 ; . . . ; xm Þ 2 X, and a mar-
descending order, set a threshold, and reserve top features ginal probability distribution P ðxÞ. A task T ¼ fY; fðÞg con-
with accumulative importance, not greater than the thresh- sists of a label space Y and an objective predictive function
old. This process reduces useless features. The effectiveness fðÞ. Note that the meanings of symbols remain the same as
of reduction by importance is verified in our evaluation. in Section 2.3. Generally, our model transferring problem
Traditional dimensionality reduction techniques, like can be regarded as an inductive transfer learning problem
principal component analysis (PCA) or singular value [34]. Given a source domain Ds and its task Ts , a target
decomposition (SVD), can effectively reduce the number of domain Dt and its task Tt , we aim to find the target predic-
features. However, these techniques are not suitable for our tive function ft ðÞ in Tt using the knowledge in Ds and Ts ,
purpose. New features generated from PCA or SVD are lin- where Ds ¼ Dt but P ðys jxs Þ 6¼ P ðyt jxt Þ.
ear combinations of original features, therefore we need to We introduce a transfer method to solve this problem.
reserve all the instrumentation of original features to fetch An instance refers to an execution sample. An execution
their values and calculate the value of new features. It does sample of an application on a certain platform can be
not help reduce the overhead and storage demand from regarded as a measure of this platform, and the measure-
instrumentation. ment result is represented by a ðm þ 1Þ-dimension point
Reduction by importance is also different from finding pro- ðy; x1 ; . . . ; xm Þ. Thus on two different platforms, multiple
gram hot spots via profiling, although they have some similar executions will generate two point sets with different mar-
processes like instrumentation, recording runtime informa- ginal probability distributions of y in the same
tion, and analyzing performance data. Hot spots are program ðm þ 1Þ-dimension space. Although we do not know the
snippets that occupy a large proportion of the whole execu- exact distributions, we can learn a transfer function hðÞ
tion time and might be performance bottlenecks. Important from the source platform to the target platform by
features in our work denote runtime information that is statis-
tically related to the execution time. For example, high perfor-
1X u
mance sparse matrix-vector production algorithms [16], [30], min MSE ¼ ðyti  hðysi ; xti1 ; . . . ; xtim ÞÞ2 : (5)
[31] analyze the data pattern of the input matrix and vector u i¼1
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: AUTOMATED PERFORMANCE MODELING OF HPC APPLICATIONS USING MACHINE LEARNING 755

To solve this regression problem, it still requires that the performance. It enables many useful functions, such as
target platform has executed a small number of execution quantifying performance variation and detecting abnormal
samples ðyti ; xti1 ; . . . ; xtim Þ, where i 2 ½1; u. Then we use performance. The same code run with different background
the existing source model to generate execution time pre- traffic can result in different performances. An interval pre-
dictions as ysi ¼ fðFðxti1 ; . . . ; xtim ÞÞ, where fðFðxÞÞ is diction can provide more comprehensive insight into the
described in Equation (3). Another random forest model is varying performance.
applied to these samples to learn the transfer function hðÞ. The prediction of performance interval is based on the
It converts a prediction from the source model to a predic- empirical variance of random forest. A leaf node in a regres-
tion from the target model. The final form of the target sion tree, which represents a part of the feature space, may
model is hðfðFðxÞÞ; xÞ. contain more than one training data. When making point
prediction, the regression tree returns the mean value of the
data in a leaf node as the prediction, while the variance can
2.6 Predicting Phase reflect the uncertainty of this prediction. Assume that each
The predicting phase of our method is simpler than the train- tree generates predictive mean mi and predictive variance
ing phase. There are two types of predictions: (1) using a non- s 2i . When a leaf only contains one data, we set a small num-
transferred model to predict the performance on the target ber (e.g., s 2min ¼ 0:01) as its variance. The random forest
platform with all trace data from this platform itself, and (2) with b trees can predict [25] joint mean m and variance s 2
using a transferred model to predict the performance on the across all trees by:
target platform with help from a source platform.
1X b
m¼ m (6)
Algorithm 2. Predictðinstru app; inst; h; fs ; ft ; FÞ b i¼1 i
1: x ¼ instru appðinstÞ
2: if h ¼ NULLjjfs ¼ NULL then !
return ft ðFðxÞÞ 2 1X b
3: s ¼ m2 þ s 2i  m: (7)
4: else b i¼1 i
5: return hðfs ðFðxÞÞ; xÞ
6: end if Then we can make interval predictions as ½m  s; m þ s,
which represents the uncertainty of performance predic-
Algorithm 2 shows the procedure of predicting the exe- tions learned from training data.
cution time of an application with an input instance on the
target platform. Its parameters include the instrumented 3 EVALUATION
application instru app, an input instance inst of the applica-
tion, the transfer function hðÞ, the performance model on In this section, we present the evaluation results and analy-
the source platform fs ðÞ (a random forest trained by all ses of our method.
trace data from the source platform), the performance
model on the target platform fs ðÞ (a random forest trained 3.1 Experimental Setup
by all trace data from the target platform), and the polyno- 3.1.1 Applications
mial expansion function FðÞ. First, it executes the instru- Three applications, Graph500, GalaxSee, and SMG2000,
mented application with inst. During this execution, were tested to predict their execution time under different
instrumentation codes export runtime features x. Then this input parameters.
execution will be terminated early according to the time Graph500 (version 2.1.4) [3] is a widely used benchmark
threshold set by the feature reduction step. focusing on data intensive computing. The main kernel of
If a source model or a transfer function does not exist, we Graph500 is a Breadth-First Search (BFS) of a graph which
can only make a prediction with the model ft ðÞ. As starts with a single source vertex.
described in Section 2.3, each regression tree in the forest GalaxSee [2] is a parallel N-body simulation program
calculates its response according to feature values with used for simulating the movements of multiple celestial
polynomial expansion. Finally the model outputs the aver- objects. It contains categorical parameters that determine
age response of all trees as the predicted execution time of a different implementations of algorithms to solve the
given input. If a source model fs ðÞ and a corresponding problem.
transfer function hðÞ exist, we first use fs ðÞ to produce a SMG2000 [4] is a parallel semicoarsening multigrid
prediction, and then transfer the prediction to the target solver for the linear systems arising from finite difference,
platform. In general, both the non-transfer model ft ðÞ and finite volume, or finite element discretizations of the diffu-
the transfer model hðfs ðFðxÞÞ; xÞ can predict the execution sion equation. This solver is a key component for achieving
time of inst on the target platform. The transfer model can scalability in radiation diffusion simulations.
achieve better prediction accuracy if the training data set is Among these three applications, Graph500 is a simple
small. When the amount of training data increases, the application since it contains few input parameters and each
transfer model can still achieve similar accuracy like a non- parameter has a straightforward impact on performance. In
transferred model. Thus, we always adopt the transferred contrast, GalaxSee contains different types of parameters,
model as long as it is available. and their impact on performance is not obvious. Tables 1, 2,
Besides making point prediction (e.g., scalar value of exe- and 3 show their parameters and value range in our experi-
cution time), our model can also predict an interval of ments, respectively.
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
756 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 5, MAY 2020

TABLE 1 TABLE 4
Input Parameters of Graph500 Configuration of Experimental Platforms

Parameter Type Range Platform A Platform B Platform C


SCALE integer [10, 18] CPU type E5-2680v4 E3-1240v5 E7-8860v4
EDGEFACTOR integer [10, 100] frequency 2.4 GHz 3.5 GHz 2.2 GHz
N_PROC integer [16, 1024] #cores/node 28 4 144
mem/node 128 GB 32 GB 1 TB
#nodes 40 40 1
network 100 Gbps OPA 100 Gbps InfiniBand N/A
TABLE 2
Input Parameters of GalaxSee
TABLE 5
Parameter Type Range The Impact of Different Time Thresholds on GalaxSee
N integer [5000, 10000]
ROTATION_FACTOR float [0.1, 1.0] Time Threshold Actual Time # Features Overhead Error
SCALE integer [10, 1000] 5% 1.5% 84 1.2% 22.1%
MASS float [100.0, 100000.0] 10% 5.8% 85 1.2% 21.3%
INT_METHOD enumeration [1, 6] 25% 23.7% 124 29.4% 19.7%
FORCE_METHOD enumeration [1, 3]
50% 33.8% 126 29.6% 14.6%
N_PROC integer [16, 1024] 75% 58.2% 132 32.8% 9.4%
100% 99.9% 357 1521.2% 7.5%

TABLE 3
Input Parameters of SMG2000 The effectiveness of reductions is controlled by the time
threshold and the importance threshold as defined in
Parameter Type Range Section 2.4. We first evaluated the impact of varying time
nx,ny,nz integer [50, 200] threshold settings. Table 5 shows the evaluation results, tak-
cx,cy,cz float [0.1, 10.0] ing GalaxSee as an example. The actual time means the real
SOLVER enumeration [0, 3] time proportion of fetching feature values under its corre-
N_PROC integer [16, 1024] sponding time threshold. For example, when setting the
time threshold to be 5 percent of the complete program exe-
cution time, the actual time proportion we measured is
3.1.2 Platforms and Environment
1.5 percent. The actual value is always smaller since the
The experiments were conducted on three different plat- threshold is an upper bound. We took 50 percent of data as
forms, denoted as A, B, and C. Table 4 lists the configuration training set and used random forest to predict the others.
of each platform. These three platforms have different char- The prediction error in our experiments is calculated by the
acteristics. Platform B has higher single-core performance, below formula
but the number of cores per node is only 4. Platform C is a
fat node with 8 low-frequency, 18-core CPUs. All communi- n 
 
1X ypredict ½i  yreal ½i
cations in Platform C are intra-node communications. A sin-  100%; (8)
n i¼1 yreal ½i
gle node of Platform A is in between Platform B and
Platform C, but the size of the entire Platform A is much
larger than that of B and C. Since the maximum number of where ypredict ½i is the predicted performance and yreal ½i is
CPU cores in platform B and C is 160 and 144, respectively, the actual performance of the ith testing sample. Table 5
the parameter N PROC, namely the number of processes, reports these results, which confirms our hypothesis dis-
of each application is set to ½16; 128 on these two platforms. cussed in Section 2.4 that lower time threshold can achieve
Applications were compiled using Intel C/C++ compiler better reduction, but it may reduce some useful features
15.0.0 and ran on CentOS 7.3 system. The regression model and decrease the prediction accuracy. When setting the time
and the model transferring programs were written in threshold to be 100 percent, it can achieve a very low error
Python 3.6.1 and scikit-learn library [35]. The Intel MPI ver- rate, but it is almost useless since the actual time and the
sion 5.0 was used as the MPI library. overhead are impractical. The 5 percent time threshold is
sufficient to achieve acceptable accuracy, and meanwhile to
keep low actual time and overhead. Note that in this series
3.2 Feature Reduction of evaluation tests, we only ran 100 samples since some of
We first evaluated the feature reduction methods intro- the full instrumented programs took too much time. These
duced in Section 2.4. In this series of experiments, each samples were used to analyze the trend of the impact under
application was tested 100 times on Platform A with differ- varying thresholds. The errors measured in these tests are
ent input parameters to trace runtime features. Since a higher than those presented in Section 3.3 with the same
dorder
mþdd polynomial expansion on m features will generate reduction setting, since the latter was evaluated with more
d new features, in this series of tests, we did not adopt data samples.
polynomial expansion. We aim for reducing the number of We also evaluated with varying the importance thresh-
features to under 20 so that a 3-order polynomial expansion old on GalaxSee with 5 percent time thresholds, and the
will not generate too many new features. results are shown in Table 6. With the decrease of the
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: AUTOMATED PERFORMANCE MODELING OF HPC APPLICATIONS USING MACHINE LEARNING 757

TABLE 6 TABLE 8
The Impact of Different Importance The Impact of Feature Reduction on GalaxSee
Thresholds on GalaxSee
Feature State # Features Overhead Storage (per run)
Importance # Features Error full 357 1521.2% 5,164KB
Threshold by time 84 1.2% 615KB
99% 29 22.7% by importance 18 0.5 % 51KB
95% 18 23.1%
90% 15 27.7%
80% 8 33.6% After fetching runtime feature values of these samples,
70% 3 45.3% each feature was normalized to zero mean and unit variance
independently. Part of data samples was randomly selected
as the training set while others were used as the testing set.
importance threshold, the number of reserved features also Fig. 6 presents the mean errors of the testing set under
decreases, but the error rate becomes higher. The results of different ratios of the training data. The regression methods
95 percent threshold in Table 6 indicate that, after reduction we tested include least absolute shrinkage and selection
by 5 percent time as shown in Table 5, there still exist many operator (LASSO), support vector machine regression (SVR)
redundant features among 84 features. Removing these with radial basis function (rbf) kernel, and random forest
redundant features only slightly increased the error rate (RF). Each method was applied to both the raw form of data
from 22.1 to 23.1 percent. Further reduction may remove and its 3-order polynomial expansion. We also tested two
some important features and result in a higher error rate. sophisticated modeling methods from related works. One is
We then evaluated all three applications with 5 percent a linear model based on performance model normal form
time threshold and 95 percent importance threshold. (PMNF) [9], [10], [14]. The other one is a deep learning
Tables 7, 8, and 9 report these experimental results. model called PerfNet [32]. PMNF contains a special form of
In general, feature reduction can effectively reduce nonlinear expansion. PerfNet does not have a process for
redundant features as well as instrumentation overhead feature reduction so that the polynomial expansion would
and storage. Taking GalasSee as an example, and as Table 8 make the model unnecessarily large and possibly over-
shows, the complete number of instrumentations of Galax- fitting. Therefore we did not use polynomial expansion on
See is 357. These instrumentations introduced a total of PMNF and PerfNet.
extra 1521.2 percent overhead, which means that running Since the mapping relation between the features and the
the instrumented GalaxSee costs more than ten times of exe- execution time is complicated, a linear regression method,
cution time than the original program without any instru- like LASSO, has higher prediction error than nonlinear
mentation. In each run, the trace data generated from methods. Using polynomial basis function, it is actually con-
instrumentation required 5,164 KB storage on average. After verted to nonlinear methods, and the corresponding errors
two processes of reduction, only 18 important features were are significantly reduced. It indicates that polynomial func-
reserved. The storage demand was reduced to 51 KB. Their tion is an acceptable approximation between features and
instrumentations only introduced 0.5 percent overhead, program execution time.
which means that running an instrumented program is As of nonlinear regression methods, SVR, PerfNet and
almost the same as the original one. We can provide the RF can achieve lower prediction error. Polynomial expan-
instrumented version of programs for HPC users to run sion provides little benefit to SVR and RF in most cases as
their jobs. It will continuously generate runtime feature shown in Fig. 6. Generally, RF has better prediction accu-
data and help the regression model to be more accurate.t racy. For a simple program like Graph500, PerfNet can
achieve comparable results. An important factor of the pre-
3.3 Performance Prediction diction accuracy is the impact of categorical features. A cate-
In this section, we discuss the prediction accuracy of differ- gorical feature is a variable of which value belongs to a finite
ent machine learning methods for our performance model. and discrete set. The value of a categorical feature is just a
We ran each application on each platform 1,000 times tag and does not have a numerical meaning. Thus, if a
with different input parameters. In other words, each regression model considers it as a numerical feature, the
modeling task had 1,000 data samples. The input parame- result can be worse. Although there are many transforma-
ters we used for experiments were generated from the tion measures to avoid this problem, they need a precondi-
parameter value range of each application uniformly and tion to verify which one is a categorical feature. However,
randomly. Fig. 3 shows the time distribution of three because we do not have domain knowledge, we cannot real-
applications. ize which feature is categorical and adopt transformation

TABLE 7 TABLE 9
The Impact of Feature Reduction on Graph500 The Impact of Feature Reduction on SMG2000

Feature state # Features Overhead Storage (per run) Feature State # Features Overhead Storage (per run)
full 87 71.6% 1,458KB full 648 29.4% 7,821KB
by time 15 1.4% 85KB by time 87 0.5% 697KB
by importance 5 1.1 % 28KB by importance 17 0.1 % 151KB
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
758 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 5, MAY 2020

Fig. 3. Execution time distribution of three applications.

measures for it. The advantage of RF is that categorical fea- there are kðk1Þ
2 pairs in total. Fig. 4 shows the experimental
tures naturally have less impact on it. A regression tree in results of three applications on three platforms. Considering
RF can generate nodes to divide the sample data by a cate- all experimental configurations, the accuracy of predicting the
gorical feature, then within each divided data region, the performance rank of each sample pair is around 90 percent. It
categorical feature is a constant and has no impact on fur- confirms that in most cases, our model can predict the correct
ther regression. In our experiments, GalaxSee contains cate- rank relationship when comparing the performance of two
gorical features, therefore the superiority of RF is more different input parameters.
apparent than that in experiments with other applications.
In addition to the execution time prediction, another 3.4 Performance Variation
objective that attracts much attention in many applications As we discussed in Section 2.6, besides predicting the scalar
is the relative rank of performance on different input value of execution time, our model can also predict an
parameters. Sometimes we just want to understand that
which input will be faster while the exact execution time is
not necessary [15], [23], [32]. To measure the ability of pre-
dicting the performance rank using our model, we took
50 percent data as the training set and tested the accuracy of
predicting the performance rank of all sample pairs in the
testing set. Assume that the testing set contains k samples,
the accuracy is calculated as
Pk Pk
2 i¼1 j¼iþ1 ðzij Þ
accuracy ¼  100%; (9)
kðk  1Þ

where

1 ðypredict ½i  ypredict ½jÞðyreal ½i  yreal ½jÞ > 0
zij ¼ :
0 otherwise
(10)

This objective measures the ratio of successfully predict-


ing the faster input parameter in each sample pair where Fig. 4. The prediction accuracy of performance rank.
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: AUTOMATED PERFORMANCE MODELING OF HPC APPLICATIONS USING MACHINE LEARNING 759

these predictions have relatively high uncertainty and more


training data around these points are required to improve
their predictions.

3.5 Impact of Runtime Features


In this section, we discuss the impact of runtime features.
Taking Graph500 on Platform A as an instance, we show
the importance of each feature in Table 10 with descending
order. The importance is defined and calculated by
Algorithm 1. Since Graph500 does not require complicated
domain knowledge, we can check the meaning of each fea-
ture from the source code. To clarify the meaning of fea-
tures, this measurement of importance is conducted before
the polynomial transformation. In Table 10, these features
represent the problem scale, the topology of MPI communi-
cation, the number of processes, the average node degree
Fig. 5. Predicting performance interval of SMG2000. and a for-loop counter, respectively. The value of the for-
loop counter is logarithm of number of processes if it is
required to be a power of two, otherwise zero. All these five
interval of performance, based on calculating empirical var- features are automatically detected by our model. Among
iance. Fig. 5 shows an example of predicting performance them, the first, the third and the forth features are directly
interval of SMG2000 on Platform A. We draw an interval covered by input parameters while the others are not. The
½m  s; m þ s for each prediction, as Equations (6) and (7) three input parameters contribute about 0.685 importance
describe. Some points diverge from the ideal predicting in sum, which means they are quite important but not domi-
line, thus they have large variance intervals. It indicates that nant. Further comparison is showed in Fig. 7. With the same

Fig. 6. The prediction error of different machine learning methods with various applications and platforms.
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
760 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 5, MAY 2020

TABLE 10
Feature Importance of Graph500

No. Importance Type Location


1 0.353 Variable main.c line 75
2 0.188 Variable main.c line 143
3 0.184 Variable utils.c line 25
4 0.148 Variable main.c line 76
5 0.125 Loop counter utils.c line 32

random forest model, if we only use input parameters as


features, the predictive error will increase.

3.6 Effectiveness of Model Transferring


In this section, we discuss the effectiveness of model transfer-
ring. Similar to these tests reported in Section 3.3, there are
1,000 random data samples of each application available on
each platform. The difference in the experiment setting is that
Fig. 7. Impact of runtime features on Graph500.
we only take a small number of samples (1-10 percent) as the
training set and others (99-90 percent) as the testing set.
4.1 Analytical Modeling
The transfer method used an existing well-trained random
forest model from another platform, discussed in Section 3.3, Analytical modeling uses an analytical formula to describe the
as a source model, and transferred it to a target model. We program performance. As we have introduced in Section 1, an
used Platform B, C as the source platform to generate the analytical model is tightly coupled with a particular algorithm
transferred model on Platform A. Fig. 8 reports mean errors and a particular application domain, thus it involves extensive
on the testing set under different ratios of the training data. As efforts from human experts. For example, Eller et al. [18] pro-
a comparison, a traditional machine learning method alone, posed an analytical model for Krylov Solver on structured
like random forest, has worse results on predicting perfor- grid problems. Hang et al. [36] proposed a detailed perfor-
mance, because the training data is too small to learn a reliable mance model for deep neural networks. Barker’s model [8]
model from it. With the increase of training percentage, the focused on the Krak Hydrodynamics Application. In general,
effectiveness of these two methods tends to be closer. an analytical model for a specific application is difficult to be
applied to other applications.

4 RELATED WORK 4.2 Replay-Based Modeling


In this section, we review and discuss existing studies along Replay-based modeling uses instrumentation or similar tech-
four categories, analytical modeling methods, replay-based niques to trace detailed information from program executions.
modeling methods, statistical modeling, and model trans- Through analyzing the trace, a synthetic program can be
ferring for the performance prediction of parallel programs. reconstructed for replaying behaviors of the original program

Fig. 8. The prediction error of transfer method and random forest with various applications and platforms. Transfer method uses an existing model
from a source platform to help build the model on the target platform, while random forest method can only exploit data from the target platform.
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: AUTOMATED PERFORMANCE MODELING OF HPC APPLICATIONS USING MACHINE LEARNING 761

and predicting its performance. Since tracing and analysis can troublesome and often unnecessary. A better solution would
be automated, this type of modeling and prediction method be transferring the existing model to reuse it or assist in build-
can eliminate the requirement of domain knowledge and can ing the new model. Numerous studies have been conducted,
be generalized for different applications. but these existing studies usually do not focus on the execution
Several studies exist in this area. For instance, Sodhi, Zhang time prediction but the prediction of the performance rank of
and Hao [21], [38], [45] constructed a skeleton of a parallel pro- an application under different conditions, since transferring
gram from traces. Skeleton preserves the flow and logic of the the rank correlation across different platforms is easier than
original program but reduces calculations and communica- transferring the execution time model. Hoste et al. [23] used
tions. Zhai et al. [44] analyzed traces and introduced a deter- benchmarks to measure different platforms, then according to
ministic replay to predict the performance. However, traces the similarity between benchmarks and the application of
require a large storage space. Even when tracing simple paral- interest, their work can predict the performance rank of an
lel programs like NPB benchmark [7], storage requirements application on different platforms. Chen et al. [15] used the
can range from hundreds of megabytes to tens of gigabytes Bayesian network to capture the parameter dependencies of
for each run [9], [44]. Besides, a trace-based modeling usually an application. Marathe et al. [32] built a performance model
consists of a program skeleton or other forms of a synthetic with deep neural networks. They transferred neural networks
program. A synthetic program shrinks computation and com- with a fine-tuning method, which took a trained network for a
munication of the original code. It loses the semantic of the platform to initialize the connection weights of a network for a
original code; therefore, it is not human-readable. The predic- new platform. Their work showed that deep neural networks
tion lacks interpretability, which does not locate the perfor- do not outperform traditional machine learning methods (e.g.,
mance factors of parallel programs. random forest) on execution time prediction in general, but
they can help users find better application configurations on
4.3 Statistical Modeling different platforms.
The development of machine learning techniques enables
the possibility of empirically analyzing the performance 4.5 Comparison of This Study and Existing Studies
patterns of a parallel program under different input param- As a comparison of this research and existing studies, our
eters. Song et al. [39] adopted a machine learning method method is a statistical modeling method and uses machine
called Delta Latent Dirichlet Allocation (DLDA) [6] to model learning techniques to analyze historical executions and to
application executions. It can locate possibly low-perfor- predict the execution time of a parallel program. There are
mance code blocks but not predict the exact execution time several critical differences between our work and existing
cost. Ogilvie and Thiagarajan [33], [41] focused on con- work though. First, existing studies using machine learning
structing surrogate models for auto-tuning. In these prob- techniques mainly consider input parameters as features to
lem settings, performance models are mainly required to model the performance of parallel programs, whereas our
achieve good accuracy on high-performance sub-space work considers runtime features about variable assign-
while low-performance part can be ignored. Lee et al. [29] ments, branches, loops, and communications, which are a
proposed methods that employ artificial neural networks to superset of input parameters. Runtime features contain
predict the performance of parallel programs. Their meth- more information that may not be covered by input parame-
ods can capture system complexity implicitly from various ters, thus our method can achieve equal or higher accuracy
input data, but their work only focuses on a fixed number for predicting performance, compared with models based
of cores. Additionally, their method cannot analyze the on input parameters. On the other hand, there is sufficient
impact of each feature since the artificial neural network is a difference between our method and existing works that also
black-box model. A series of studies [9], [10], [14] attempted take runtime features into account. To effectively utilize
to model the performance of kernels in a parallel program runtime features and eliminate the negative effect of redun-
with using linear regression methods like ridge regression, dant features, our method automatically identify the impor-
least absolute shrinkage and selection operator, or their var- tant subset of all possible runtime features using time and
iants to model the relationship between features and execu- importance reductions. Besides making scalar prediction of
tion time. Linear regression methods are easy to be performance, our method can also predict performance
implemented and their prediction results are concise and interval, which enables quantifying performance variation
interpretable. However, since parallel programs can have and checking the confidence of a prediction. Moreover, our
complex behavior patterns, a linear model may not be accu- method can transfer an existing random forest-based perfor-
rate to characterize the performance under different input mance model to a new platform with few historical program
parameters. EPMNF transformation [9], [10] can be used to executions. It effectively alleviates the cold start problem,
improve the nonlinear fitness of linear regressions, but which is a common problem for statistical modeling.
before the transformation, domain experts are required to
determine a small range of feature candidates. 5 CONCLUSION
In this paper, we introduce a novel method to model and pre-
4.4 Model Transferring dict the performance of parallel programs (MPI programs).
The traces of an application from a certain platform cannot be We develop a tool to automatically analyze the syntax tree of
used directly when modeling the performance of the applica- MPI programs and instrument them, so that we can detect its
tion on another platform. It is feasible to collect data and build runtime features related to computation and communication,
another model on the new platform; however, it is without requiring any domain knowledge. We design a
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
762 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 5, MAY 2020

strategy to automatically analyze and fit the runtime feature [16] J. W. Choi, A. Singh, and R. W. Vuduc, “Model-driven autotuning
of sparse matrix-vector multiply on GPUs,” ACM SIGPLAN
data of MPI programs using random forest technique, thereby Notices, vol. 45, no. 5, pp. 115–126, Jan. 2010.
we can predict the performance . Since we adopt a lightweight [17] D. Culler et al., “LogP: Towards a realistic model of parallel
instrumentation and further reduce features by two reduction computation,” ACM SIGPLAN Notices, vol. 28, pp. 1–12, 1993.
processes, the overhead of instrumentation is low, with much [18] P. R. Eller, T. Hoefler, and W. Gropp, “Using performance models
to understand scalable krylov solver performance at scale for
less storage demand compared to existing methods. Com- structured grid problems,” in Proc. ACM Int. Conf. Supercomputing,
bined with model transferring method, an existing perfor- 2019, pp. 138–149.
mance model can be reused to predict the performance on a [19] E. Gaussier, D. Glesser, V. Reis, and D. Trystram, “Improving
backfilling by using machine learning to predict running times,”
new platform with a small number of training samples.
in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2015,
Although we have achieved desired prediction on three Art. no. 64.
tested applications that well represent typical HPC applica- [20] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized
tions, the ability of our method can be further optimized. trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006.
[21] M. Hao, W. Zhang, Y. Zhang, M. Snir, and L. T. Yang, “Automatic
Since we only extract features from early execution phase of generation of benchmarks for I/O-intensive parallel applications,”
an application, if its behavior is not only decided by the J. Parallel Distrib. Comput., vol. 124, pp. 1–13, 2019.
early phase, our method may not have an accurate predic- [22] T. Hoefler, W. Gropp, W. Kramer, and M. Snir, “Performance
tion. In the future, we will further investigate how to extract modeling for systematic performance tuning,” in Proc. Int. Conf.
High Perform. Comput. Netw. Storage Anal., 2011, Art. no. 6.
behavior patterns throughout the entire execution period of [23] K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and
an application and further optimize our model. K. D. Bosschere, “Performance prediction based on inherent pro-
gram similarity,” in Proc. Int. Conf. Parallel Archit. Compilation
Techn., 2006, pp. 114–122.
ACKNOWLEDGMENTS [24] L. Huang, J. Jia, B. Yu, B.-G. Chun, P. Maniatis, and M. Naik,
This study was supported by NSF of China (Grant number: “Predicting execution time of computer programs using sparse
polynomial regression,” in Proc. 23rd Int. Conf. Neural Inf. Process.
61772485 and 61432016), and Open Research Fund Syst., 2010, pp. 883–891.
(CARCH201711) of State Key Laboratory of Computer Archi- [25] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown, “Algorithm
tecture, Institute of Computing Technology, CAS. This work runtime prediction: Methods & evaluation,” Artif. Intell., vol. 206,
was also supported by Youth Innovation Promotion Associa- pp. 79–111, 2014.
[26] D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and
tion of CAS. Experiments in this study were conducted on M. Gittings, “Predictive performance and scalability modeling of a
the supercomputer system in the Supercomputing Center of large-scale application,” in Proc. ACM/IEEE Conf. Supercomputing,
University of Science and Technology of China. 2001, pp. 37–37.
[27] T. Kielmann, H. E. Bal, and K. Verstoep, “Fast measurement of
logP parameters for message passing platforms,” in Proc. Int.
REFERENCES Parallel Distrib. Process. Symp., 2000, pp. 1176–1183.
[1] Clang: A C language family frontend for LLVM, 2016. [Online]. [28] Y. Kwon et al., “Mantis: Automatic performance prediction for
Available: http://clang._llvm.org/ smartphone applications,” in Proc. USENIX Conf. Annu. Techn.
[2] GalaxSee HPC module 1: The N-body problem, serial and parallel Conf., 2013, pp. 297–308.
simulation, 2010. [Online]. Available: http://shodor.org/ [29] B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz, K. Singh, and
petascale/_materials/UPModules/NBody/ S. A. McKee, “Methods of inference and learning for performance
[3] Graph 500 reference implementations, 2017. [Online]. Available: modeling of parallel applications,” in Proc. 12th ACM SIGPLAN
http://www.graph_500.org/referencecode Symp. Princ. Practice Parallel Program., 2007, pp. 249–258.
[4] The SMG2000 benchmark code, 2001. [Online]. Available: https://asc. [30] J. Li, G. Tan, M. Chen, and N. Sun, “SMAT: An input adaptive
llnl.gov/computing_resources/purple/archive/benchmarks/smg/ auto-tuner for sparse matrix-vector multiplication,” SIGPLAN
[5] A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman,“ Notices, vol. 48, no. 6, pp. 117–126, Jun. 2013.
LogGP: Incorporating long messages into the logP model for [31] K. Li, W. Yang, and K. Li, “Performance analysis and optimization
parallel computation,” J. Parallel Distrib. Comput., vol. 44, no. 1, for SpMV on GPU using probabilistic modeling,” IEEE Trans.
pp. 71–79, 1997. Parallel Distrib. Syst., vol. 26, no. 1, pp. 196–205, Jan. 2015.
[6] D. Andrzejewski, A. Mulhern, B. Liblit, and X. Zhu, “Statistical [32] A. Marathe et al., “Performance modeling under resource
debugging using latent topic models,” in Proc. Eur. Conf. Mach. constraints using deep transfer learning,” in Proc. Int. Conf. High
Learn., 2007, pp. 6–17. Perform. Comput. Netw. Storage Anal., 2017, pp. 31:1–31:12.
[7] D. H. Bailey et al., “The NAS parallel benchmarks,” The Int. J. [33] W. F. Ogilvie, P. Petoumenos, Z. Wang, and H. Leather,
Supercomputing Appl., vol. 5, no. 3, pp. 63–73, 1991. “Minimizing the cost of iterative compilation with active learning,”
[8] K. J. Barker, S. Pakin, and D. J. Kerbyson, “A performance model in Proc. Int. Symp. Code Gener. Optim., 2017, pp. 245–256.
of the krak hydrodynamics application,” in Proc. Int. Conf. Parallel [34] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
Process., 2006, pp. 245–254. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[9] A. Bhattacharyya and T. Hoefler, “PEMOGEN: Automatic adap- [35] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J.
tive performance modeling during program runtime,” in Proc. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.
23rd Int. Conf. Parallel Archit. Compilation, 2014, pp. 393–404. [36] H. Qi, E. R. Sparks, and A. Talwalkar, “Paleo: A performance
[10] A. Bhattacharyya, G. Kwasniewski, and T. Hoefler, “Using com- model for deep neural networks,” in Proc. 5th Int. Conf. Learn. Rep-
piler techniques to improve automatic performance modeling,” in resentations, 2017.
Proc. Int. Conf. Parallel Archit. Compilation, 2015, pp. 468–479. [37] H. Sanjay and S. Vadhiyar, “Performance modeling of parallel
[11] C. M. Bishop, “Pattern recognition,” Mach. Learn., vol. 128, pp. 1–58, applications for grid scheduling,” J. Parallel Distrib. Comput., vol. 68,
2006. no. 8, pp. 1135–1145, 2008.
[12] L. Breiman,“Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, [38] S. Sodhi, J. Subhlok, and Q. Xu, “Performance prediction with
2001. skeletons,” Cluster Comput., vol. 11, no. 2, pp. 151–165, 2008.
[13] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classifica- [39] L. Song and S. Lu, “Statistical debugging for real-world perfor-
tion and Regression Trees. Boca Raton, FL, USA: CRC Press, 1984. mance problems,” ACM SIGPLAN Notices, vol. 49, no. 10, pp. 561–
[14] A. Calotoiu et al., “Fast multi-parameter performance modeling,” 578, Oct. 2014.
in Proc. IEEE Int. Conf. Cluster Comput., 2016, pp. 172–181. [40] D. Sundaram-Stukel and M. K. Vernon, “Predictive analysis of a
[15] H. Chen, W. Zhang, and G. Jiang, “Experience transfer for the con- wavefront application using logGP,” in Proc. 7th ACM SIGPLAN
figuration tuning in large-scale computing systems,” IEEE Trans. Symp. Princ. Practice Parallel Program., 1999, pp. 141–150.
Knowl. Data Eng., vol. 23, no. 3, pp. 388–401, Mar. 2011.
Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: AUTOMATED PERFORMANCE MODELING OF HPC APPLICATIONS USING MACHINE LEARNING 763

[41] J. J. Thiagarajan et al., “Bootstrapping parameter space exploration Shiyan Zhan is currently working toward the
for fast tuning,” in Proc. Int. Conf. Supercomputing, 2018, pp. 385–395. graduate degree at the University of Science and
[42] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Technology of China, Hefei, China. His research
Roy. Statist. Soc. Series B (Methodological), vol. 58, pp. 267–288, 1996. interest includes high performance computing.
[43] L. T. Yang, X. Ma, and F. Mueller, “Cross-platform performance
prediction of parallel applications using partial execution,” in
Proc. ACM/IEEE Conf. Supercomputing, 2005, pp. 40–40.
[44] J. Zhai, W. Chen, W. Zheng, and K. Li, “Performance prediction
for large-scale parallel applications using representative replay,”
IEEE Trans. Comput., vol. 65, no. 7, pp. 2184–2198, Jul. 2016.
[45] W. Zhang, A. M. Cheng, and J. Subhlok, “DwarfCode: A perfor-
mance prediction tool for parallel applications,” IEEE Trans.
Jiepeng Zhang is currently working toward the
Comput., vol. 65, no. 2, pp. 495–507, Feb. 2016.
graduate degree at the University of Science and
Technology of China, Hefei, China. His research
Jingwei Sun is working toward the doctoral interest includes high performance computing.
degree at the University of Science and Technol-
ogy of China, Hefei, China. His research interests
include distributed deep learning, high perfor-
mance computing, and algorithm optimizations.

Yong Chen is currently an associate professor


and director of the Data-Intensive Scalable Com-
puting Laboratory in the Computer Science
Guangzhong Sun is currently an associate pro- Department of Texas Tech University, Lubbock,
fessor with the School of Computer Science and Texas. He is also a site director of the Cloud and
Technology, University of Science and Technol- Autonomic Computing center at Texas Tech Uni-
ogy of China, Hefei, China. He is also a member versity, Lubbock, Texas. His research interests
of National High Performance Computing Center include data-intensive computing, parallel and
(Hefei) and the head of Algorithm and Data Appli- distributed computing, high-performance comput-
cation (Ada) Research Group. His research inter- ing, and cloud computing.
ests include high performance computing,
algorithm optimizations, and data processing.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.

Authorized licensed use limited to: Xian Jiaotong University. Downloaded on April 05,2025 at 12:25:43 UTC from IEEE Xplore. Restrictions apply.

You might also like