0% found this document useful (0 votes)
26 views10 pages

Predicting Cloud Performance For HPC Applications A User-Oriented Approach

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

Predicting Cloud Performance For HPC Applications A User-Oriented Approach

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Predicting Cloud Performance for HPC Applications: a User-oriented Approach

Giovanni Mariani∗ , Andreea Anghel† , Rik Jongerius∗ , Gero Dittmann†


∗ IBM Research, the Netherlands
Email: {giovanni.mariani, r.jongerius}@nl.ibm.com
† IBM Research – Zurich, Switzerland
Email: {aan, ged}@zurich.ibm.com

Abstract—Cloud computing enables end users to execute


high-performance computing applications by renting the re-
quired computing power. This pay-for-use approach enables
small enterprises and startups to run HPC-related businesses
with a significant saving in capital investment and a short time
to market.
When deploying an application in the cloud, the users may
a) fail to understand the interactions of the application with the
software layers implementing the cloud system, b) be unaware
of some hardware details of the cloud system, and c) fail
to understand how sharing part of the cloud system with
other users might degrade application performance. These
misunderstandings may lead the users to select suboptimal Figure 1: Overview of the proposed methodology.
cloud configurations in terms of cost or performance.
To aid the users in selecting the optimal cloud configuration
for their applications, we suggest that the cloud provider
When a user decides to run an application in the cloud,
generate a prediction model for the provided system. We
propose applying machine-learning techniques to generate this he or she is asked to select a cloud configuration. Generally,
prediction model. First, the cloud provider profiles a set of the user can select the number of machines to use and their
training applications by means of a hardware-independent size in terms of cores, memory, disk space, and, depending
profiler and then executes these applications on a set of training upon the specific cloud provider, different machine types and
cloud configurations to collect actual performance values. The
network parameters. Users of HPC applications often have
prediction model is trained to learn the dependencies of
actual performance data on the application profile and cloud an in-depth understanding of their applications. Nonetheless,
configuration parameters. The advantage of using a hardware- when deploying an application in the cloud they may a) fail
independent profiler is that the cloud users and the cloud to understand the interactions between the application and
provider can analyze applications on different machines and the software layers implementing the cloud, b) be unaware
interface with the same prediction model.
of some hardware details of the cloud system, and c) fail
We validate the proposed methodology for a cloud system
implemented with OpenStack. We apply the prediction model to understand how sharing part of the cloud system with
to the NAS parallel benchmarks. The resulting relative error is other users might degrade application performance. These
below 15% and the Pareto optimal cloud configurations finally misunderstandings may lead the user to select suboptimal
found when maximizing application speed and minimizing cloud configurations in terms of cost or performance. Today,
execution cost on the prediction model are also at most 15%
the selection of the cloud configuration is based mainly
away from the actual optimal solutions.
on experience or on consultation between the user and
the provider. Acquiring experience and consulting are time
I. I NTRODUCTION consuming and overall the process is prone to human error
Cloud technologies enable researchers and businesses to and may lead to suboptimal results.
easily access a large amount of computing resources by rent- In this work we suggest a structured approach that does
ing the computing power, thus avoiding the capital cost of not rely on the user experience, does not require direct
acquiring the physical hardware. Recent work demonstrated consultation between the user and the cloud provider and
that, with acceptable performance degradation, significant can be automated (Figure 1). We suggest that the cloud
economical advantages can be achieved when running HPC provider generates a prediction model by applying machine
applications in the cloud [1]–[3]. Today there are businesses learning to find the dependencies between the application
that are taking advantage of this paradigm and several features (profile), cloud configuration parameters, and re-
cloud providers are offering specialized compute-optimized sulting performance. We make use of hardware-independent
machines. features to describe the application profile [4], [5]. In this

978-1-5090-6611-7/17 $31.00 © 2017 IEEE 524


DOI 10.1109/CCGRID.2017.11
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
way, the cloud provider and the user can generate the some applications cloud computing is actually an interesting
profiling data on different machines while interfacing with option because, with an acceptable performance degradation,
the same prediction model. The cloud provider trains the one can achieve significant economic advantages [6].
cloud prediction model by fitting the actual performance Starting from this preliminary result, in this work we
observed for a set of training applications executing on a set propose a methodology to aid cloud users to select the best
of training cloud configurations. The model is then released cloud configuration for a target application by predicting the
to the user who can query it for performance predictions performance and cost trade-off before the actual deployment.
for his target application. In this work the prediction model To the best of our knowledge, this is the first methodology
is trained to predict application speed, application execution enabling application-specific performance prediction for a
time, and the computing cost associated with running the cloud system before the target application has ever been
target application on a user-selected cloud configuration. deployed in that system.
The cost associated with transferring data to and from the
provider typically depends only on the dataset the user Several studies present performance prediction method-
wants to process and is independent of the selected cloud ologies for HPC applications [9]–[13]. Authors in [11]–[13]
configuration. This data-transfer cost cannot be optimized generate prediction models to extrapolate application perfor-
by selecting different cloud configurations, thus, it is not mance for workloads larger than those processed during a set
considered in this work. of training runs. Other authors propose performance models
The cloud environment is usually very noisy because of for supercomputer design optimization, analyzing different
the activity of several independent users. To minimize the workloads with varying architectural parameters [9], [14].
effect of noise in the training data on the prediction model Finally, Hutter et al. [10] present a survey of state of the art
we apply statistical techniques in the selection of training techniques aimed at predicting application performance for
configurations. These techniques are known as design of different input datasets. None of these studies is suitable to
experiments and are specifically intended for distributing the predict the performance achievable by a target application
training runs to minimize the perturbation related to noisy in a cloud environment before this application has ever been
data during the estimation of the parameters in the prediction deployed in the cloud. In fact, they assume that the target
model. This enables us to achieve a high prediction accuracy architecture is dedicated only to the execution of the target
with mean error rates below 15%. By carrying out a multi- application [9], [12]–[14] or they require several runs of the
objective optimization on the prediction model for maxi- target application on the target system in order to generate
mizing the application speed and minimizing the execution the application-specific prediction model [10]–[13].
cost, the optimal solutions found differ from the actual There is quite a lot of work on the optimization of
optima by at most 25% during the planning phase, i.e. when HPC applications executing on cloud systems [1], [15]–
considering the predicted metrics that the user has available [19]. Most of these studies do not aim at an application-
before deploying the application in the cloud infrastructure. specific optimization but at a domain-level optimization,
After deployment, i.e. when the user can observe the actual i.e. they target the optimization of a cloud system for the
performance metrics for the selected cloud configurations, execution of HPC applications in general [16]–[18]. Other
the solutions found with the proposed methodology differ authors suggest that HPC applications can be optimized
by at most 15% from the actual optimal ones. for cloud deployment [1], [19]. Gupta et al. suggest that
The remainder of this paper is organized as follows. network-related application parameters need re-tuning to
Section II analyzes the state of the art in performance pre- maximize performance in cloud systems [1]. Darosarighi et
diction for HPC applications. Section III describes in detail al. suggest that some HPC applications can exploit the run-
the proposed methodology. Section IV provides empirical time elasticity provided by some cloud systems [19].
evidence on the accuracy of the proposed prediction model In this paper we aim to identify the best cloud con-
and clarifies its practical use. figuration at the application-specific level and not at the
optimization of the cloud system for the HPC domain in
II. BACKGROUND general, nor at the optimization of the application code for
In recent years many researchers addressed the questions the execution in the cloud. We assume that a user wants to
of if and when is it more economically advantageous to deploy an existing application onto an existing cloud system.
use cloud computing resources compared to more traditional We assume that the user can select the cloud configuration
solutions when running an HPC application [1]–[3], [6]– and wants to find the best solution for the target application.
[8]. A general result is that application performance in the The cloud provider generates a prediction model and release
cloud is commonly subject to degradation whose extent it to the user. The user queries this prediction model to search
is application and platform dependent. Nonetheless, for for the best cloud configuration for the target application.

525

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
instruction set architecture. The profile p(a, d) is obtained by
executing the instrumented application a while processing a
target dataset d. Being hardware independent, a single profile
p(a, d) gathered on any machine is enough to describe
computation and communication requirements for any other
machine or any cloud configuration c. The profile p(a, d)
is a vector where each parameter is a statistic about an
application feature f . We collect a wide set of features:
• Operation mix: percentage of operations per type (in-
teger, floating point, memory read and write, etc.). For
Figure 2: Construction of the prediction model. each type we also collect information about how well
the operations can be vectorized. We do not directly
count machine instructions but LLVM IR operations.
III. P ROPOSED METHODOLOGY • ILP: instruction level parallelism for an ideal machine
A. Overview executing LLVM IR operations. This feature is com-
puted for the overall operations as well as for each
In the proposed approach, the cloud provider generates
operation type.
a performance prediction model for the provided infras-
• Reuse distance: for different distances δ we compute
tructure. Then, the users query this model to infer the
the probability ρ(δ) that a memory reference is reused
performance and costs of their applications for different
before accessing other δ unique memory references
cloud configurations. The prediction model takes as input
[20]. We store values of ρ(δ) for δ at the power of
a cloud configuration c and an application profile p, and
two levels between 27 and 218 . While being hardware
returns as output the expected speed, execution time, and
independent, this metric can be used for inferring the
cost.
miss rate for a cache of size δ [21].
The prediction model is generated by the cloud provider
• Memory traffic: This metric is a derivate of the reuse
by following the procedure illustrated in Figure 2. First
distance and operation mix. It represents the percentage
of all, a set of training cloud configurations are selected
of operations that need to access the next level of
together with a training set of applications and their datasets.
memory hierarchy when the current one is of length δ.
For each training application a and dataset d we collect a
It is computed by multiplying ρ(δ) times the percentage
hardware-independent profile p(a, d). Then, actual perfor-
of memory operations.
mance values are collected by running the application a with
• Register traffic: the average number of registers ac-
the dataset d using the cloud configuration c for each triple
cessed by an operation.
a, d, c in the training sets. Performance measurements are
• Library calls: the call count to each function in external
subject to noise related to the activity of other cloud users.
libraries.
For the selection of the training sets, when possible, we
• Communication requirements: for each function in the
apply statistical techniques named design of experiments to
communication library (in this work we consider MPI
minimize the effect of noise on the model learnt.
applications), we count the number of function calls and
B. Gathering performance and profiling data the overall number of bytes sent or received through it.
Hardware-independent profile. We characterize the ap- In this work, we aim at parallel applications. For each
plication execution by means of the platform-independent feature among the listed ones, the profile p(a, d) stores statis-
software analysis (PISA) tool proposed by Anghel et al. [4]. tics about the per-thread distribution in terms of minimum,
This hardware-independent profiling choice enables cloud maximum, mean, median, and variance. Additionally we
provider and cloud user to independently generate applica- also store the number of threads and the total count of LLVM
tion profiles on different machines while interfacing with the IR operations o(a, d).
same prediction model. Performance indices. Actual performance data is also
The profiler is based on LLVM. The application is first needed for training the prediction model (Figure 2). As
compiled to LLVM intermediate representation (IR) and the main performance measure we consider the wall-clock
then instrumented with the necessary profiling code. The execution time τ (a, d, c), i.e. the actual time elapsed between
instrumentation is automated by an LLVM plugin [4] and the start and the completion of the application a when
is carried out at the level of LLVM IR before generating processing the dataset d using the cloud configuration c.
the actual binary. This approach grants independence from To minimize the effect of noise, for each triple a, d, c
the actual machine including its microarchitecture and its we measure the execution time τ (a, d, c) ten times and

526

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
we use the median value as a robust estimate1 . Starting
from the time τ (a, d, c), we compute the execution speed
in terms of LLVM IR operations per second as: ς(a, d, c) =
o(a, d)/τ (a, d, c). Then, the execution cost ε(a, d, c) is the
time τ (a, d, c) multiplied by the cost associated with the
configuration c.
The model construction aims at generating an approxi- Figure 3: Central composite DOE for two parameters.
mation of the execution speed ςˆ(p, c) ∼ ς(a, d, c), where
p is the hardware-independent profile representation of a
when processing d. We will use the notations τ̂ (p, c), and Note that, for other applications where the format of
ε̂(p, c) to refer to the execution time and cost approximations the input dataset does not allow for a clear definition of
computed starting from ςˆ(p, c). a parameter space2 , the suggested DOE technique cannot be
applied. For those applications we recommend the use of
C. Selecting the training sets representative datasets as the ones commonly released with
Training applications. It is important to select a set the benchmark suites themselves.
of training applications general enough to represent all Training cloud configurations. Cloud configuration pa-
possible applications that the users may be interested in rameters in c are generally subject to constraints that the
when querying the model for predictions. There are several central composite DOE cannot handle. For example, the
studies on benchmarking [5], [22]–[25] that address the number of cores per node and the amount of memory per
problem of selecting a representative set of applications. node are two parameters that cannot be set independently but
Without loss of generality, in this work we focus on scientific only a set of their combined values (flavors) is exposed. It is
computing applications implemented in MPI, in particular generally impossible to implement a central composite DOE
we target the NAS parallel benchmarks (NPB) [26]. (Figure 3) on a space that includes only a set of predefined
Training dataset. Design of experiments (DOEs) are flavors.
statistical techniques to spread a set of points within a For placing experimental training configurations in a con-
parameter space to optimize some information criteria. There strained parameter space we apply a D-optimal DOE [27].
exist techniques specifically intended to minimize the un- This DOE is generated by solving a numerical optimization
certainty of an extrapolation model built upon the designed problem.
experiments [27]. Because there are many sources of noise in The optimization takes as input the desired number of
a cloud environment, we suggest applying DOE techniques experiments and a target extrapolation model. Discovering
whenever possible. the exact expression of the extrapolation model before
For the NPB applications, the input dataset is a set running the experiments is not possible, thus an expression
of a few numeric parameters. Thus, the input dataset d general enough to match any possible behavior is needed.
is a parameter vector and its configuration is a point in We consider a second order polynomial model to account
the parameter space. For this kind of applications, DOE for parameter interactions and nonlinearities:
techniques can be applied for the selection of the datasets ⎛ ⎞
  
d to be included in the training set. We apply the central ŷ(c) = k0 + ki γi + ⎝ ki,j γi γj ⎠ , (1)
composite DOE whose goal is minimizing the uncertainty of γi ∈c γi ∈c γj ∈c
a nonlinear polynomial model that accounts for parameter
interactions. In this DOE, each parameter in the vector d where ŷ is a generic approximation function, k are the
can assume one out of five levels: {minimum, low, central, coefficients to be learnt and γi are the cloud configuration
high, maximum}. All parameter combinations with the low parameters in c. While the final prediction model might
and high values are included in the DOE (the corners of the differ from the one in Equation 1, using this form enables
square in Figure 3). Additionally, two configurations of d us to spread experiments specifically to expose possible
are included in the DOE to experiment with the maximum parameter interactions and nonlinearities.
and the minimum values of each parameter while holding the Let us describe all experiments with a matrix C whose
central value for all the other parameters (the points lying on rows c are experimental cloud configurations and whose
the diameters in Figure 3). Finally, the central configuration columns γ are configuration parameters (Figure 4a). The
is also included. design matrix X is a matrix that describes both the exper-
iments to run and the polynomial model to learn [27]. The
1 In this work we focus on the prediction of the median performance

values. One could also be interested in a worst case analysis or, for example, 2 For example, with a small set of numeric parameters it is not easy
in the prediction of the 90th percentile to give the cloud user a feel about to describe the input of a 3D rendering application or the positioning of
worst case performance degradation. several chess pieces on a chessboard [28].

527

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
An RF is the collection of several regression trees. Each
tree is generated to fit the behavior of the target performance
metric by using a randomly selected subset of the training
data. This lets each tree focus on a slightly different problem,
a process that significantly improves the RF robustness to
noisy data. The data that a tree does not use during training
is referred as out-of-the-bag data. The number of trees and
(a) Experiment matrix. (b) Design matrix.
the amount of per-tree out-of-the-bag data are selected to let
Figure 4: Example of experiment matrix C and the resulting every observation in the training dataset be part of the out-
design matrix X considering six cloud configurations c1 —c6 of-the-bag set for many trees. An approximate estimation of
and two cloud configuration parameters γ1 , γ2 . the prediction accuracy of an RF model can be obtained on
the training dataset itself by considering only the out-of-the-
bag predictions.
matrix X includes a row for each experiment and a column A tree is constructed (grown) starting from a root node
for each term γi and γi γj in Equation 1 (Figure 4b). that represents the whole input space including both appli-
To generate a D-optimal DOE, the experiments in the ma- cation features in p and cloud configuration parameters in
trix C have to be tuned to maximize the determinant of the c. Then, iteratively, each node is grown by associating it
information matrix X  X. This is equivalent to minimizing with a splitting value for an input variable to generate two
the generalized variance (uncertainty) of the coefficients k subspaces, thus two child nodes. Each node is associated
to be learnt [27]. To solve this optimization problem we use with a prediction of the target metric equal to the mean
a genetic algorithm (GA). An individual of the GA is the observed value in the training dataset for the input subspace
experiment matrix C. Within the GA kernel, C is unrolled the node represents. The RF automatically selects the input
into a vector to apply the traditional genetic operators of variables to be included in the model. Each time a node
crossover and mutation. For any GA individual C, its fitness is grown, the input variables are screened by means of the
is computed as the information matrix determinant |X  X|, reduction in the mean relative error they would bring in the
where X can be derived from C (Figure 4). Since the fitness model if selected for the splitting. The split associated to the
function of a GA individual is computed analytically, the variable leading to the lowest mean squared error is assigned
GA runs very fast taking a few seconds in our Mathematica to the node. Thus, unimportant features do not perturb the
implementation. prediction.
In this work, we aim to predict the execution speed:
D. Training the prediction model ςˆ(p, c) ∼ ς(a, d, c). In order to maximize the prediction
The goal of the prediction model is to approximate the accuracy [31], we consider the Box-Cox preprocessing func-
execution speed ς(a, d, c) with an analytic function ςˆ(p, c) ∼ tions g(ς). We construct an RF model to fit g(ς), then we
ς(a, d, c), where p is the profile vector for the application predict the speed ςˆ by applying the inverse of g to the
a when processing d. For the applications investigated in value predicted by the RF. We consider the following set
this paper, the profile p counts a total of 795 features. of preprocessing functions: g(ς) ∈ {ς, log(ς), ς −1 }. For the
Before generating the prediction models, we clean the input final prediction model we select the one that minimizes the
dataset by removing the features having constant values, and, mean squared error for the out-of-the-bag predictions.
for each couple of features fi and fj with a correlation
IV. E XPERIMENTAL RESULTS
greater than 0.99, we drop one of them because including
it would not bring additional information. This cleaning In this Section we first introduce the experimental setup
process reduces the number of features in p to about 200. including a description of the cloud infrastructure, and the
This large number of features enables the identification of presentation of the benchmarks, datasets, and cloud config-
complex relationships between the analyzed software and urations in use. Then, empirical evidence on the accuracy
its performance. On the other hand, the accuracy of many of the prediction models is presented. Finally, the prediction
prediction models such as support vector machines and models are applied to the problem of identifying the best
neural networks is degraded by the introduction of too many cloud configuration to use for a target application, thus
features [29]. To avoid a complex feature-selection scheme, clarifying the way a cloud user can exploit them.
in this work we apply random forests (RFs) to implement A. Experimental setup
the prediction models. RFs embed automatic procedures to
The target cloud infrastructure. The evaluation of the
screen many input features and to select the predictors during
proposed methodology is carried out by using a cloud
model construction [30].
infrastructure reserved for research purposes in our labs.

528

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
Table I: List of the NAS parallel benchmarks and the values Table II: Virtual machine flavors for the target cloud system.
of parameter setting for the considered central composite Flavor id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
DOE. The column type indicates with K the benchmarks Cores m 1 2 1 2 1 2 4 8 1 2 4 8 24 8 12
that are kernels and with A the ones that are applications. RAM r [GB] 1 1 2 2 4 4 4 4 8 8 8 8 8 16 24

App. Type Param. name Min Low Central High Max


Table III: Experimental cloud configurations and the consid-
A Grid dimension 40 47 65 83 90
BT Iterations 30 40 65 90 100 ered cost in US dollar cents per hour [¢/h].
Processes 1 9 16 25 49
K No. rows 12000 14000 16000 18000 20000 Conf. id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
No. nonzeros 14 16 18 20 22 Cores m 1 8 8 8 24 12 24 4 8 8 1 8 8 1 2
CG
Iterations 21 24 27 30 33 RAM r 8 4 8 16 8 24 8 4 4 8 1 4 8 2 1
Processes 1 4 8 16 64 Nodes n 1 1 1 1 1 2 2 4 4 5 6 8 8 9 10
K M 26 27 28 29 30
EP Cost 12.5 9.1 14.8 26.1 20 77.3 40.1 31.2 36.5 73.8 15.5 72.3 118.1 35.9 29.1
Processes 1 4 8 16 32
K Iterations 3 4 6 8 9
FT
Processes 1 4 8 16 64
A Grid dimension 50 57 75 93 100
LU Iterations 42 50 70 90 98
the user. Then, the user can investigate different operating
Processes 1 9 16 25 49 systems by applying the different prediction models. This
K Grid dimension 64 64 128 256 256
MG Iterations 115 120 125 130 135
topic will be addressed in future work.
Processes 2 4 8 16 32 Cloud configurations. We distribute the workloads on a
A Grid dimension 50 57 75 93 100 cluster of homogeneous virtual machines (nodes) allocated
SP Iterations 42 50 70 90 98
Processes 1 9 16 25 49 in the cloud. As parameters in the cloud configuration c we
consider the number of nodes n, the number of cores per
node m, and the amount of RAM per node r. The under-
This infrastructure includes eight 24-core machines and eight lying cloud infrastructure does not allow tuning of network
16-core machines. All these machines are interconnected parameters. The physical machines are connected with 10
with 10 Gigabit Ethernet. On top of this hardware, the cloud Gigabit Ethernet. Additionally, the target benchmarks are
system is implemented with OpenStack. not sensitive to the storage system, thus we do not consider
storage-related parameters in the cloud configuration c.
Several researchers are concurrently using this system
We consider that a cloud user can allocate up to 64 cores,
for different purposes and are generating traffic on it. Col-
up to 64 GB of RAM, and can instantiate up to 10 nodes.
lecting performance data on this system for the considered
These kinds of limitations can be found either because
benchmarks and datasets took about eight days. To give
of specific policies of the cloud provider (e.g. limitations
an idea of how busy the system was, on the first day
to the type of user account target of the modeling) or
466 virtual machines were instantiated. During those eight
because ofphysical limitations
 in the cloud system. Thus:
days, 295 virtual machines were created while 198 machines
(n ≤ 10) (n × m ≤ 64) (n × r ≤ 64). Additionally,
were deleted. We are not aware of what applications were
only the set of flavors in Table II can be used, and that
executed by other users.
limits the combinations of m and r. A D-optimal DOE
Benchmarks and datasets. In this work we consider for this constrained space is organized using a genetic
the Fortran-MPI implementation of the NAS parallel bench- algorithm available in Mathematica [32]. To balance the
marks (NPB) as representative supercomputing workloads time required to gather performance data and the accuracy
[26]. These benchmarks are popular for the performance of the predictor model, we set the number of experiments in
analysis of HPC applications, also in the context of cloud the D-optimal DOE to 15 (this is the same number of runs
systems [2], [6]. MPI enables us to distribute the workload required for a central composite DOE for a space with three
on a computing cluster allocated in the cloud infrastructure. parameters as the considered cloud configuration space). The
The Fortran benchmarks are three applications (A) and resulting cloud configurations are listed in Table III. To each
four kernels (K). To select the input parameters defining the configuration we also associate an hourly cost that has been
datasets, we apply the central composite DOE (Figure 3). derived by extrapolating the costs of Amazon on-demand
The benchmarks, their parameters, and the value associated cloud [33] to the specific cloud configuration.
to the different parameter levels are listed in Table I. All the
benchmarks are installed on an Ubuntu 14.04 image. System B. Accuracy analysis
performance is influenced by the underlying operating sys- We assess model accuracy when predicting performance
tem and the software environment in general. The prediction and costs for different benchmarks and for different cloud
model is specific to the software environment in use during configurations by applying cross validation. We first cross-
the training phase. A provider can train different prediction validate benchmark wise, i.e. we iteratively train a prediction
models for each operating system considered interesting to model using the training data of all benchmarks but one and

529

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
Speed Execution time Execution cost Speed Execution time Execution cost

1
14
Relative error [%]

12 0.8

Correlation
10
0.6
8
6 0.4

4
0.2
2

0 0
BT

CG

EP

FT

LU

MG

SP

mean

BT

CG

EP

FT

LU

MG

SP

mean
(a) Mean relative error for the considered benchmarks (Table I). (a) Fidelity for the considered benchmarks (Table I)
Speed Execution time Execution cost Speed Execution time Execution cost

1
14
Relative error [%]

12 0.8

Correlation
10
0.6
8
6 0.4

4
0.2
2

0 0
1

10

11

12

13

14

15

mean

10

11

12

13

14

15

mean
(b) Mean relative error for the considered cloud configurations (Table III). (b) Fidelity for the considered cloud configurations (Table III).

Figure 5: Mean relative error for the prediction of execution Figure 6: Fidelity of the prediction models in terms of
speed, execution time, and execution cost. Paerson correlation for execution speed, time, and cost.

we measure prediction accuracy for that one benchmark. model is capable of discriminating between fast and slow,
Then, we cross-validate cloud-configuration wise. We assess and, cheap and expensive solutions. We consider model
both accuracy and fidelity [34]. fidelity to be particularly important when the model is
The accuracy is assessed in terms of relative error and applied for predicting performance of a previously unseen
indicates how close the predicted values are to the actual benchmark for different cloud configurations (Figure 6a).
ones. For a generic prediction ŷ(x) the error  is computed To clarify this point, Figure 7 shows the scatter plot of
relatively to the difference between the maximum and the actual execution speed (x axis) with respect to the predicted
minimum observed value of the actual metric: one (y axis). Each subplot is obtained by removing the
benchmark on the subplot title from the training dataset, thus
 = |ŷ(x) − y(x)|/(ymax − ymin ). (2) the predictions are for a previously unseen benchmark. For
each subplot, the predicted execution speed tends to increase
The fidelity is a measure of how much the trend in the with the increasing of the actual speed, which results in high
predicted metric ŷ follows the trend in the actual metric correlation (Figure 6a). We can thus state the following:
y. As suggested by Javaid et al. [34], we use Paerson if for a given benchmark we predict that a given cloud
correlation to measure fidelity. While accuracy tells us how configuration is a good one with respect to other cloud
close a prediction is to the actual metric, fidelity tells us how configurations, we can expect it to actually be the case.
good the prediction model is in discriminating between good
and bad configurations. Thus, both metrics are important C. Optimization results
when the prediction goal is to drive an optimization process. The main contribution of this work stands in the proposed
Figures 5 and 6 show prediction accuracy and fidelity prediction methodology. To demonstrate how the prediction
for the three considered metrics: execution speed, execution models can be used to tune a cloud configuration before
time, and execution cost. The mean relative error (Figure actually running the target application in the cloud, we setup
5) shows values below 15% for all benchmarks, cloud a multi-objective optimization problem to maximize the
configurations and target metrics. This demonstrates that the performance and minimize the cost related to the execution
predictions of the proposed model are reliable. Additionally, of NAS Parallel Benchmarks (NPBs) in the cloud. We target
the high correlation between the actual metric and the the optimization of cloud configurations for the deployment
predicted one (Figure 6) demonstrates that the predictor of the three NPB applications (BT, LU, and SP).

530

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
Predicted speed [Gops/s] BT CG EP FT LU MG SP

90

60

30

0
0 30 60 90 0 30 60 90 0 30 60 90 0 30 60 90 0 30 60 90 0 30 60 90 0 30 60 90
Actual speed [Gops/s]

Figure 7: Scatter plot of actual execution speed with respect to the predicted one for each of the considered benchmark.

The optimization is setup to consider the central dataset these results are available to the user before deploying the
configuration (Table I). We assume that the number of MPI application in the cloud (during a planning phase). The final
processes is not constrained to its central value but we trade-off are the actual objectives observable after deploying
can tune it according to the cloud configuration parame- the application on the configurations that were predicted to
ters. To avoid running an exhaustive analysis for over a belong to the best trade-off.
hundred cloud configurations, we consider only the cloud-
configurations listed in Table III. We consider the first cloud For all the considered applications, the prediction model
configuration (Table I, conf. id 1) as reference configuration always finds the two cheapest configurations. These are the
c1 to compute predicted and actual speedup τ (c1 )/τ (c) and crosses surrounded by circles that are maximizing the saving
saving [ε(c1 ) − ε(c)] /ε(c1 ) for a generic configuration c. in the three plots. The squares close to those configurations
Both speedup and savings are to be maximized. We consider demonstrate that the predictions of savings and speedup for
ε(c) the total execution cost computed as the hourly cost in those configurations are accurate. For the identification of
Table III times the execution time τ (c). To reduce the cost the configurations maximizing the speedup, the situation is
with respect to the reference cloud configuration, we may different. For the BT application (Figure 8a), the predicted
either use a smaller cloud configuration or a configuration and final results follow the actual optima pretty closely but
that completes the workload in a shorter time. they are no more overlapping it. For the LU application
(Figure 8b), there is an actual solution that provides more
For each application a, we first construct the proposed than 5× speedup and about 50% of saving. There is a
prediction model by using the data from all benchmarks prediction that overlaps this actual observation. Nonetheless,
except a. We then use this model to predict the performance this prediction does not correspond to the actual optimal
and cost associated with the considered cloud configurations. configuration but it demonstrates a suboptimal one (the circle
Figure 8 shows the optimization results for each of the at about 4.5× speedup and 50% saving, Figure 8b). Using
applications. There are three sets of points. The actual trade- the solution of the prediction model it is still possible to
off includes all the best options actually available. The achieve more than 5× speedup but at much higher costs
predicted trade-off shows the best predictions of the model; than the actual optimal one, by spending more than with

Actual Predicted Final Actual Predicted Final Actual Predicted Final

100 100 100

50 50 50
Saving [%]

Saving [%]

Saving [%]

0 0 0

-50 -50 -50

-100 -100 -100

-150 -150 -150


0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
Speedup Speedup Speedup
(a) BT. (b) LU. (c) SP.

Figure 8: Optimization results for the three NPB applications including the actual Pareto front (crosses), the predicted Pareto
front (squares), and the final Pareto front obtained when actually implementing the predicted optimal solutions in the cloud
system (circles).

531

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
the reference configuration (i.e. with negative savings). For the provider and the user can profile different applications
the SP application we correctly identify most of the actual independently on different machines.
optimal configurations (circles overlapping to crosses in The prediction model demonstrates high accuracy with
Figure 8c). Nonetheless we predict a maximum speedup of mean relative errors below 15% for the parallel MPI im-
about 4× while the actual one is about 5×. plementation of the NAS Parallel benchmark suite. We also
To quantify the difference between the actual optimal show how a cloud user can exploit the prediction model to
Pareto fronts and the predicted and final ones we compute search for the cloud configurations that are Pareto optimal
the average distance from reference set (ADRS). This metric for performance and cost. For the considered applications,
estimates on average how much a reference Pareto front the predicted Pareto solutions differ from the actual optimal
deviates from an alternative one [35], [36]. For each point ones by at most 25% when considering the predicted objec-
in the actual reference Pareto front we compute its relative tives (i.e. the objectives returned by the prediction model to
distance to the closest point in the alternative front. The dis- the user when planning which cloud configuration to use),
tances are computed relatively to the metric ranges observed and by at most 15% when considering the actual objectives
in the actual reference front and then averaged: (i.e. when the user finally decides to deploy the application
on the selected cloud configuration and actually observes the
1   objective values).
ADRS(Pref , P ) = min [d(y, y  )] , (3)
| Pref | y  ∈P
y∈Pref ACKNOWLEDGMENT
yi − yi
d(y, y  ) = max max . (4) This work is conducted in the context of the joint
i yi − yimin ASTRON and IBM DOME project and is funded by the
Dutch Ministry of Economische Zaken and the Province of
Where Pref is the actual reference optimal Pareto front, P
Drenthe.
is the alternative one (either the predicted or the final front),
y is the vector representation of the two objectives (speedup R EFERENCES
and saving), yi is the ith objective in y, and yimax , yimin are
respectively the maximum and minimum values observed for [1] A. Gupta, P. Faraboschi, F. Gioachin, L. V. Kale, R. Kauf-
mann, B. S. Lee, V. March, D. Milojicic, and C. H. Suen,
the ith objective in the actual reference Pareto front Pref . “Evaluating and improving the performance and scheduling
By computing the ADRS metric we estimate that the of HPC applications in cloud,” IEEE Transactions on Cloud
deviation of the predicted Pareto fronts from the actual Computing, vol. 4, pp. 307–321, July 2016.
optimal ones are of 20%, 14%, and 25% respectively for [2] E. Roloff, M. Diener, A. Carissimi, and P. Navaux, “High per-
BT, LU, and SP. The final Pareto fronts are closer to the formance computing in the cloud: Deployment, performance
actual optimal ones with ADRS of 10%, 15% and 4%. and cost efficiency,” in Cloud Computing Technology and Sci-
ence (CloudCom), 2012 IEEE 4th International Conference
V. C ONCLUSION on, pp. 371–378, Dec 2012.
[3] E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good,
In this work we presented a machine-learning based model “The cost of doing science on the cloud: The montage exam-
capable of predicting performance of HPC applications ple,” in High Performance Computing, Networking, Storage
running in the cloud. The main idea behind the proposed and Analysis, 2008. SC 2008. International Conference for,
methodology is that the cloud provider can collect training pp. 1–12, Nov 2008.
data to learn a model that predicts application performance. [4] A. Anghel, L. M. Vasilescu, G. Mariani, R. Jongerius, and
This model is then released to the cloud user who queries it G. Dittmann, “An instrumentation approach for hardware-
agnostic software characterization,” International Journal of
for searching the cloud configuration most suitable for the Parallel Programming, pp. 1–25, 2016.
application needs. The prediction model takes as input an
[5] K. Hoste and L. Eeckhout, “Microarchitecture-independent
application profile and a cloud configuration and returns as workload characterization,” Micro, IEEE, vol. 27, pp. 63 –
output the expected execution speed, time, and cost. The ap- 72, may-june 2007.
plication profile describes the behavior of the application in [6] A. Gupta, L. Kale, F. Gioachin, V. March, C. H. Suen, B.-
terms of its computation and communication requirements. S. Lee, P. Faraboschi, R. Kaufmann, and D. Milojicic, “The
The cloud provider should interface with the prediction who, what, why, and how of high performance computing
model during the training phase, and in a later phase the in the cloud,” in Cloud Computing Technology and Science
(CloudCom), 2013 IEEE 5th International Conference on,
cloud user interfaces with the same prediction model by
vol. 1, pp. 306–314, Dec 2013.
querying it for gathering prediction data. To enable the
[7] M. Cinque, D. Cotroneo, F. Frattini, and S. Russo, “To
provider and the user to easily interface with the same cloudify or not to cloudify: The question for a scientific
model, the application profile we use describes the appli- data center,” IEEE Transactions on Cloud Computing, vol. 4,
cation behavior in a hardware-independent manner. Thus, pp. 90–103, Jan 2016.

532

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
[8] C. Li, J. Xie, and X. Zhang, “Performance evaluation based [21] G. Marin and J. Mellor-Crummey, “Cross-architecture per-
on open source cloud platforms for high performance comput- formance predictions for scientific applications using param-
ing,” in Intelligent Networks and Intelligent Systems (ICINIS), eterized models,” in Proceedings of the Joint International
2013 6th International Conference on, pp. 90–94, Nov 2013. Conference on Measurement and Modeling of Computer
[9] G. Mariani, A. Anghel, R. Jongerius, and G. Dittmann, Systems, SIGMETRICS ’04/Performance ’04, (New York,
“Scaling properties of parallel applications to exascale,” In- NY, USA), pp. 2–13, ACM, 2004.
ternational Journal of Parallel Programming, pp. 1–28, 2016. [22] Z. Jia, J. Zhan, L. Wang, R. Han, S. Mckee, Q. Yang,
[10] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown, “Al- C. Luo, and J. Li, “Characterizing and subsetting big data
gorithm runtime prediction: Methods & evaluation,” Artif. workloads,” in Workload Characterization (IISWC), 2014
Intell., vol. 206, pp. 79–111, Jan. 2014. IEEE International Symposium on, pp. 191–201, Oct 2014.
[11] A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf, “Using [23] A. Phansalkar, A. Joshi, and L. K. John, “Analysis of re-
automated performance modeling to find scalability bugs in dundancy and application balance in the SPEC CPU2006
complex codes,” in Proceedings of the International Confer- benchmark suite,” SIGARCH Comput. Archit. News, vol. 35,
ence on High Performance Computing, Networking, Storage pp. 412–423, June 2007.
and Analysis, SC ’13, (New York, NY, USA), pp. 45:1–45:12,
[24] Z. Jin and A. Cheng, “Improve simulation efficiency us-
ACM, 2013.
ing statistical benchmark subsetting - an implantbench case
[12] X. Wu and F. Mueller, “ScalaExtrap: Trace-based communi- study,” in Design Automation Conference, 2008. DAC 2008.
cation extrapolation for spmd programs,” ACM Trans. Pro- 45th ACM/IEEE, pp. 970–973, June 2008.
gram. Lang. Syst., vol. 34, pp. 5:1–5:29, May 2012.
[25] Q. Guo, T. Chen, Y. Chen, and F. Franchetti, “Accelerating
[13] A. Wong, D. Rexachs, and E. Luque, “Parallel application architectural simulation via statistical techniques: A survey,”
signature for performance analysis and prediction,” Parallel IEEE Transactions on Computer-Aided Design of Integrated
and Distributed Systems, IEEE Transactions on, vol. 26, Circuits and Systems, vol. 35, pp. 433–446, March 2016.
pp. 2009–2019, July 2015.
[14] R. Jongerius, G. Mariani, A. Anghel, G. Dittmann, E. Vermij, [26] “NAS parallel benchmarks,” 2016.
and H. Corporaal, “Analytic processor model for fast design- http://www.nas.nasa.gov/publications/npb.html.
space exploration,” in Computer Design (ICCD), 2015 33nd [27] D. Montgomery, Design and Analysis of Experiments, 8th
IEEE International Conference on, pp. 440–443, Oct 2015. Edition. John Wiley & Sons, Incorporated, 2012.
[15] A. Gupta, O. Sarood, L. V. Kale, and D. Milojicic, “Improving [28] “SPEC CPU benchmarks..”
HPC application performance in cloud through dynamic load http://www.spec.org/benchmarks.html.
balancing,” in Cluster, Cloud and Grid Computing (CCGrid),
2013 13th IEEE/ACM International Symposium on, pp. 402– [29] M. H. Nguyen and F. de la Torre, “Optimal feature selec-
409, May 2013. tion for support vector machines,” Pattern Recogn., vol. 43,
pp. 584–591, Mar. 2010.
[16] I. Sadooghi, S. Palur, A. Anthony, I. Kapur, K. Belagodu,
P. Purandare, K. Ramamurty, K. Wang, and I. Raicu, “Achiev- [30] L. Breiman, “Random forests,” Machine Learning, vol. 45,
ing efficient distributed scheduling with message queues in no. 1, pp. 5–32, 2001.
the cloud for many-task computing and high-performance [31] G. Palermo, C. Silvano, and V. Zaccaria, “Respir: A response
computing,” in Cluster, Cloud and Grid Computing (CCGrid), surface-based pareto iterative refinement for application-
2014 14th IEEE/ACM International Symposium on, pp. 404– specific design space exploration,” IEEE Transactions on
413, May 2014. Computer-Aided Design of Integrated Circuits and Systems,
[17] K. Le, R. Bianchini, J. Zhang, Y. Jaluria, J. Meng, and vol. 28, pp. 1816–1829, Dec 2009.
T. D. Nguyen, “Reducing electricity cost through virtual
machine placement in high performance computing clouds,” [32] “Mathematica 10,” 2014. http://www.wolfram.com/mathematica/.
in Proceedings of 2011 International Conference for High [33] https://aws.amazon.com/ec2/pricing/on-demand/.
Performance Computing, Networking, Storage and Analysis,
[34] H. Javaid, A. Ignjatovic, and S. Parameswaran, “Fidelity met-
SC ’11, (New York, NY, USA), pp. 22:1–22:12, ACM, 2011.
rics for estimation models,” in 2010 IEEE/ACM International
[18] M. Musleh, V. Pai, J. P. Walters, A. Younge, and S. Crago, Conference on Computer-Aided Design (ICCAD), pp. 1–8,
“Bridging the virtualization performance gap for HPC using Nov 2010.
sr-iov for infiniband,” in Proceedings of the 2014 IEEE
International Conference on Cloud Computing, CLOUD ’14, [35] G. Mariani, G. Palermo, V. Zaccaria, and C. Silvano, “OS-
(Washington, DC, USA), pp. 627–635, IEEE Computer So- CAR: An optimization methodology exploiting spatial cor-
ciety, 2014. relation in multicore design spaces,” IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems,
[19] R. da Rosa Righi, C. A. da Costa, V. F. Rodrigues, and G. Ro- vol. 31, pp. 740–753, May 2012.
stirolla, “Joint-analysis of performance and energy consump-
tion when enabling cloud elasticity for synchronous HPC [36] E. Zitzler, M. Laumanns, L. Thiele, C. M. Fonseca, and
applications,” Concurrency and Computation: Practice and V. Grunert da Fonseca, “Why Quality Assessment Of Mul-
Experience, vol. 28, no. 5, pp. 1548–1571, 2016. cpe.3710. tiobjective Optimizers Is Difficult,” in Genetic and Evolu-
[20] Q. Guo, T. Chen, Y. Chen, L. Li, and W. Hu, “Microarchi- tionary Computation Conference (GECCO 2002), (New York,
tectural design space exploration made fast,” Microprocessors NY, USA), pp. 666–674, Morgan Kaufmann Publishers, July
and Microsystems, vol. 37, no. 1, pp. 41–51, 2013. 2002.

533

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.

You might also like