Predicting Cloud Performance For HPC Applications A User-Oriented Approach
Predicting Cloud Performance For HPC Applications A User-Oriented Approach
525
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
instruction set architecture. The profile p(a, d) is obtained by
executing the instrumented application a while processing a
target dataset d. Being hardware independent, a single profile
p(a, d) gathered on any machine is enough to describe
computation and communication requirements for any other
machine or any cloud configuration c. The profile p(a, d)
is a vector where each parameter is a statistic about an
application feature f . We collect a wide set of features:
• Operation mix: percentage of operations per type (in-
teger, floating point, memory read and write, etc.). For
Figure 2: Construction of the prediction model. each type we also collect information about how well
the operations can be vectorized. We do not directly
count machine instructions but LLVM IR operations.
III. P ROPOSED METHODOLOGY • ILP: instruction level parallelism for an ideal machine
A. Overview executing LLVM IR operations. This feature is com-
puted for the overall operations as well as for each
In the proposed approach, the cloud provider generates
operation type.
a performance prediction model for the provided infras-
• Reuse distance: for different distances δ we compute
tructure. Then, the users query this model to infer the
the probability ρ(δ) that a memory reference is reused
performance and costs of their applications for different
before accessing other δ unique memory references
cloud configurations. The prediction model takes as input
[20]. We store values of ρ(δ) for δ at the power of
a cloud configuration c and an application profile p, and
two levels between 27 and 218 . While being hardware
returns as output the expected speed, execution time, and
independent, this metric can be used for inferring the
cost.
miss rate for a cache of size δ [21].
The prediction model is generated by the cloud provider
• Memory traffic: This metric is a derivate of the reuse
by following the procedure illustrated in Figure 2. First
distance and operation mix. It represents the percentage
of all, a set of training cloud configurations are selected
of operations that need to access the next level of
together with a training set of applications and their datasets.
memory hierarchy when the current one is of length δ.
For each training application a and dataset d we collect a
It is computed by multiplying ρ(δ) times the percentage
hardware-independent profile p(a, d). Then, actual perfor-
of memory operations.
mance values are collected by running the application a with
• Register traffic: the average number of registers ac-
the dataset d using the cloud configuration c for each triple
cessed by an operation.
a, d, c in the training sets. Performance measurements are
• Library calls: the call count to each function in external
subject to noise related to the activity of other cloud users.
libraries.
For the selection of the training sets, when possible, we
• Communication requirements: for each function in the
apply statistical techniques named design of experiments to
communication library (in this work we consider MPI
minimize the effect of noise on the model learnt.
applications), we count the number of function calls and
B. Gathering performance and profiling data the overall number of bytes sent or received through it.
Hardware-independent profile. We characterize the ap- In this work, we aim at parallel applications. For each
plication execution by means of the platform-independent feature among the listed ones, the profile p(a, d) stores statis-
software analysis (PISA) tool proposed by Anghel et al. [4]. tics about the per-thread distribution in terms of minimum,
This hardware-independent profiling choice enables cloud maximum, mean, median, and variance. Additionally we
provider and cloud user to independently generate applica- also store the number of threads and the total count of LLVM
tion profiles on different machines while interfacing with the IR operations o(a, d).
same prediction model. Performance indices. Actual performance data is also
The profiler is based on LLVM. The application is first needed for training the prediction model (Figure 2). As
compiled to LLVM intermediate representation (IR) and the main performance measure we consider the wall-clock
then instrumented with the necessary profiling code. The execution time τ (a, d, c), i.e. the actual time elapsed between
instrumentation is automated by an LLVM plugin [4] and the start and the completion of the application a when
is carried out at the level of LLVM IR before generating processing the dataset d using the cloud configuration c.
the actual binary. This approach grants independence from To minimize the effect of noise, for each triple a, d, c
the actual machine including its microarchitecture and its we measure the execution time τ (a, d, c) ten times and
526
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
we use the median value as a robust estimate1 . Starting
from the time τ (a, d, c), we compute the execution speed
in terms of LLVM IR operations per second as: ς(a, d, c) =
o(a, d)/τ (a, d, c). Then, the execution cost ε(a, d, c) is the
time τ (a, d, c) multiplied by the cost associated with the
configuration c.
The model construction aims at generating an approxi- Figure 3: Central composite DOE for two parameters.
mation of the execution speed ςˆ(p, c) ∼ ς(a, d, c), where
p is the hardware-independent profile representation of a
when processing d. We will use the notations τ̂ (p, c), and Note that, for other applications where the format of
ε̂(p, c) to refer to the execution time and cost approximations the input dataset does not allow for a clear definition of
computed starting from ςˆ(p, c). a parameter space2 , the suggested DOE technique cannot be
applied. For those applications we recommend the use of
C. Selecting the training sets representative datasets as the ones commonly released with
Training applications. It is important to select a set the benchmark suites themselves.
of training applications general enough to represent all Training cloud configurations. Cloud configuration pa-
possible applications that the users may be interested in rameters in c are generally subject to constraints that the
when querying the model for predictions. There are several central composite DOE cannot handle. For example, the
studies on benchmarking [5], [22]–[25] that address the number of cores per node and the amount of memory per
problem of selecting a representative set of applications. node are two parameters that cannot be set independently but
Without loss of generality, in this work we focus on scientific only a set of their combined values (flavors) is exposed. It is
computing applications implemented in MPI, in particular generally impossible to implement a central composite DOE
we target the NAS parallel benchmarks (NPB) [26]. (Figure 3) on a space that includes only a set of predefined
Training dataset. Design of experiments (DOEs) are flavors.
statistical techniques to spread a set of points within a For placing experimental training configurations in a con-
parameter space to optimize some information criteria. There strained parameter space we apply a D-optimal DOE [27].
exist techniques specifically intended to minimize the un- This DOE is generated by solving a numerical optimization
certainty of an extrapolation model built upon the designed problem.
experiments [27]. Because there are many sources of noise in The optimization takes as input the desired number of
a cloud environment, we suggest applying DOE techniques experiments and a target extrapolation model. Discovering
whenever possible. the exact expression of the extrapolation model before
For the NPB applications, the input dataset is a set running the experiments is not possible, thus an expression
of a few numeric parameters. Thus, the input dataset d general enough to match any possible behavior is needed.
is a parameter vector and its configuration is a point in We consider a second order polynomial model to account
the parameter space. For this kind of applications, DOE for parameter interactions and nonlinearities:
techniques can be applied for the selection of the datasets ⎛ ⎞
d to be included in the training set. We apply the central ŷ(c) = k0 + ki γi + ⎝ ki,j γi γj ⎠ , (1)
composite DOE whose goal is minimizing the uncertainty of γi ∈c γi ∈c γj ∈c
a nonlinear polynomial model that accounts for parameter
interactions. In this DOE, each parameter in the vector d where ŷ is a generic approximation function, k are the
can assume one out of five levels: {minimum, low, central, coefficients to be learnt and γi are the cloud configuration
high, maximum}. All parameter combinations with the low parameters in c. While the final prediction model might
and high values are included in the DOE (the corners of the differ from the one in Equation 1, using this form enables
square in Figure 3). Additionally, two configurations of d us to spread experiments specifically to expose possible
are included in the DOE to experiment with the maximum parameter interactions and nonlinearities.
and the minimum values of each parameter while holding the Let us describe all experiments with a matrix C whose
central value for all the other parameters (the points lying on rows c are experimental cloud configurations and whose
the diameters in Figure 3). Finally, the central configuration columns γ are configuration parameters (Figure 4a). The
is also included. design matrix X is a matrix that describes both the exper-
iments to run and the polynomial model to learn [27]. The
1 In this work we focus on the prediction of the median performance
values. One could also be interested in a worst case analysis or, for example, 2 For example, with a small set of numeric parameters it is not easy
in the prediction of the 90th percentile to give the cloud user a feel about to describe the input of a 3D rendering application or the positioning of
worst case performance degradation. several chess pieces on a chessboard [28].
527
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
An RF is the collection of several regression trees. Each
tree is generated to fit the behavior of the target performance
metric by using a randomly selected subset of the training
data. This lets each tree focus on a slightly different problem,
a process that significantly improves the RF robustness to
noisy data. The data that a tree does not use during training
is referred as out-of-the-bag data. The number of trees and
(a) Experiment matrix. (b) Design matrix.
the amount of per-tree out-of-the-bag data are selected to let
Figure 4: Example of experiment matrix C and the resulting every observation in the training dataset be part of the out-
design matrix X considering six cloud configurations c1 —c6 of-the-bag set for many trees. An approximate estimation of
and two cloud configuration parameters γ1 , γ2 . the prediction accuracy of an RF model can be obtained on
the training dataset itself by considering only the out-of-the-
bag predictions.
matrix X includes a row for each experiment and a column A tree is constructed (grown) starting from a root node
for each term γi and γi γj in Equation 1 (Figure 4b). that represents the whole input space including both appli-
To generate a D-optimal DOE, the experiments in the ma- cation features in p and cloud configuration parameters in
trix C have to be tuned to maximize the determinant of the c. Then, iteratively, each node is grown by associating it
information matrix X X. This is equivalent to minimizing with a splitting value for an input variable to generate two
the generalized variance (uncertainty) of the coefficients k subspaces, thus two child nodes. Each node is associated
to be learnt [27]. To solve this optimization problem we use with a prediction of the target metric equal to the mean
a genetic algorithm (GA). An individual of the GA is the observed value in the training dataset for the input subspace
experiment matrix C. Within the GA kernel, C is unrolled the node represents. The RF automatically selects the input
into a vector to apply the traditional genetic operators of variables to be included in the model. Each time a node
crossover and mutation. For any GA individual C, its fitness is grown, the input variables are screened by means of the
is computed as the information matrix determinant |X X|, reduction in the mean relative error they would bring in the
where X can be derived from C (Figure 4). Since the fitness model if selected for the splitting. The split associated to the
function of a GA individual is computed analytically, the variable leading to the lowest mean squared error is assigned
GA runs very fast taking a few seconds in our Mathematica to the node. Thus, unimportant features do not perturb the
implementation. prediction.
In this work, we aim to predict the execution speed:
D. Training the prediction model ςˆ(p, c) ∼ ς(a, d, c). In order to maximize the prediction
The goal of the prediction model is to approximate the accuracy [31], we consider the Box-Cox preprocessing func-
execution speed ς(a, d, c) with an analytic function ςˆ(p, c) ∼ tions g(ς). We construct an RF model to fit g(ς), then we
ς(a, d, c), where p is the profile vector for the application predict the speed ςˆ by applying the inverse of g to the
a when processing d. For the applications investigated in value predicted by the RF. We consider the following set
this paper, the profile p counts a total of 795 features. of preprocessing functions: g(ς) ∈ {ς, log(ς), ς −1 }. For the
Before generating the prediction models, we clean the input final prediction model we select the one that minimizes the
dataset by removing the features having constant values, and, mean squared error for the out-of-the-bag predictions.
for each couple of features fi and fj with a correlation
IV. E XPERIMENTAL RESULTS
greater than 0.99, we drop one of them because including
it would not bring additional information. This cleaning In this Section we first introduce the experimental setup
process reduces the number of features in p to about 200. including a description of the cloud infrastructure, and the
This large number of features enables the identification of presentation of the benchmarks, datasets, and cloud config-
complex relationships between the analyzed software and urations in use. Then, empirical evidence on the accuracy
its performance. On the other hand, the accuracy of many of the prediction models is presented. Finally, the prediction
prediction models such as support vector machines and models are applied to the problem of identifying the best
neural networks is degraded by the introduction of too many cloud configuration to use for a target application, thus
features [29]. To avoid a complex feature-selection scheme, clarifying the way a cloud user can exploit them.
in this work we apply random forests (RFs) to implement A. Experimental setup
the prediction models. RFs embed automatic procedures to
The target cloud infrastructure. The evaluation of the
screen many input features and to select the predictors during
proposed methodology is carried out by using a cloud
model construction [30].
infrastructure reserved for research purposes in our labs.
528
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
Table I: List of the NAS parallel benchmarks and the values Table II: Virtual machine flavors for the target cloud system.
of parameter setting for the considered central composite Flavor id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
DOE. The column type indicates with K the benchmarks Cores m 1 2 1 2 1 2 4 8 1 2 4 8 24 8 12
that are kernels and with A the ones that are applications. RAM r [GB] 1 1 2 2 4 4 4 4 8 8 8 8 8 16 24
529
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
Speed Execution time Execution cost Speed Execution time Execution cost
1
14
Relative error [%]
12 0.8
Correlation
10
0.6
8
6 0.4
4
0.2
2
0 0
BT
CG
EP
FT
LU
MG
SP
mean
BT
CG
EP
FT
LU
MG
SP
mean
(a) Mean relative error for the considered benchmarks (Table I). (a) Fidelity for the considered benchmarks (Table I)
Speed Execution time Execution cost Speed Execution time Execution cost
1
14
Relative error [%]
12 0.8
Correlation
10
0.6
8
6 0.4
4
0.2
2
0 0
1
10
11
12
13
14
15
mean
10
11
12
13
14
15
mean
(b) Mean relative error for the considered cloud configurations (Table III). (b) Fidelity for the considered cloud configurations (Table III).
Figure 5: Mean relative error for the prediction of execution Figure 6: Fidelity of the prediction models in terms of
speed, execution time, and execution cost. Paerson correlation for execution speed, time, and cost.
we measure prediction accuracy for that one benchmark. model is capable of discriminating between fast and slow,
Then, we cross-validate cloud-configuration wise. We assess and, cheap and expensive solutions. We consider model
both accuracy and fidelity [34]. fidelity to be particularly important when the model is
The accuracy is assessed in terms of relative error and applied for predicting performance of a previously unseen
indicates how close the predicted values are to the actual benchmark for different cloud configurations (Figure 6a).
ones. For a generic prediction ŷ(x) the error is computed To clarify this point, Figure 7 shows the scatter plot of
relatively to the difference between the maximum and the actual execution speed (x axis) with respect to the predicted
minimum observed value of the actual metric: one (y axis). Each subplot is obtained by removing the
benchmark on the subplot title from the training dataset, thus
= |ŷ(x) − y(x)|/(ymax − ymin ). (2) the predictions are for a previously unseen benchmark. For
each subplot, the predicted execution speed tends to increase
The fidelity is a measure of how much the trend in the with the increasing of the actual speed, which results in high
predicted metric ŷ follows the trend in the actual metric correlation (Figure 6a). We can thus state the following:
y. As suggested by Javaid et al. [34], we use Paerson if for a given benchmark we predict that a given cloud
correlation to measure fidelity. While accuracy tells us how configuration is a good one with respect to other cloud
close a prediction is to the actual metric, fidelity tells us how configurations, we can expect it to actually be the case.
good the prediction model is in discriminating between good
and bad configurations. Thus, both metrics are important C. Optimization results
when the prediction goal is to drive an optimization process. The main contribution of this work stands in the proposed
Figures 5 and 6 show prediction accuracy and fidelity prediction methodology. To demonstrate how the prediction
for the three considered metrics: execution speed, execution models can be used to tune a cloud configuration before
time, and execution cost. The mean relative error (Figure actually running the target application in the cloud, we setup
5) shows values below 15% for all benchmarks, cloud a multi-objective optimization problem to maximize the
configurations and target metrics. This demonstrates that the performance and minimize the cost related to the execution
predictions of the proposed model are reliable. Additionally, of NAS Parallel Benchmarks (NPBs) in the cloud. We target
the high correlation between the actual metric and the the optimization of cloud configurations for the deployment
predicted one (Figure 6) demonstrates that the predictor of the three NPB applications (BT, LU, and SP).
530
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
Predicted speed [Gops/s] BT CG EP FT LU MG SP
90
60
30
0
0 30 60 90 0 30 60 90 0 30 60 90 0 30 60 90 0 30 60 90 0 30 60 90 0 30 60 90
Actual speed [Gops/s]
Figure 7: Scatter plot of actual execution speed with respect to the predicted one for each of the considered benchmark.
The optimization is setup to consider the central dataset these results are available to the user before deploying the
configuration (Table I). We assume that the number of MPI application in the cloud (during a planning phase). The final
processes is not constrained to its central value but we trade-off are the actual objectives observable after deploying
can tune it according to the cloud configuration parame- the application on the configurations that were predicted to
ters. To avoid running an exhaustive analysis for over a belong to the best trade-off.
hundred cloud configurations, we consider only the cloud-
configurations listed in Table III. We consider the first cloud For all the considered applications, the prediction model
configuration (Table I, conf. id 1) as reference configuration always finds the two cheapest configurations. These are the
c1 to compute predicted and actual speedup τ (c1 )/τ (c) and crosses surrounded by circles that are maximizing the saving
saving [ε(c1 ) − ε(c)] /ε(c1 ) for a generic configuration c. in the three plots. The squares close to those configurations
Both speedup and savings are to be maximized. We consider demonstrate that the predictions of savings and speedup for
ε(c) the total execution cost computed as the hourly cost in those configurations are accurate. For the identification of
Table III times the execution time τ (c). To reduce the cost the configurations maximizing the speedup, the situation is
with respect to the reference cloud configuration, we may different. For the BT application (Figure 8a), the predicted
either use a smaller cloud configuration or a configuration and final results follow the actual optima pretty closely but
that completes the workload in a shorter time. they are no more overlapping it. For the LU application
(Figure 8b), there is an actual solution that provides more
For each application a, we first construct the proposed than 5× speedup and about 50% of saving. There is a
prediction model by using the data from all benchmarks prediction that overlaps this actual observation. Nonetheless,
except a. We then use this model to predict the performance this prediction does not correspond to the actual optimal
and cost associated with the considered cloud configurations. configuration but it demonstrates a suboptimal one (the circle
Figure 8 shows the optimization results for each of the at about 4.5× speedup and 50% saving, Figure 8b). Using
applications. There are three sets of points. The actual trade- the solution of the prediction model it is still possible to
off includes all the best options actually available. The achieve more than 5× speedup but at much higher costs
predicted trade-off shows the best predictions of the model; than the actual optimal one, by spending more than with
50 50 50
Saving [%]
Saving [%]
Saving [%]
0 0 0
Figure 8: Optimization results for the three NPB applications including the actual Pareto front (crosses), the predicted Pareto
front (squares), and the final Pareto front obtained when actually implementing the predicted optimal solutions in the cloud
system (circles).
531
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
the reference configuration (i.e. with negative savings). For the provider and the user can profile different applications
the SP application we correctly identify most of the actual independently on different machines.
optimal configurations (circles overlapping to crosses in The prediction model demonstrates high accuracy with
Figure 8c). Nonetheless we predict a maximum speedup of mean relative errors below 15% for the parallel MPI im-
about 4× while the actual one is about 5×. plementation of the NAS Parallel benchmark suite. We also
To quantify the difference between the actual optimal show how a cloud user can exploit the prediction model to
Pareto fronts and the predicted and final ones we compute search for the cloud configurations that are Pareto optimal
the average distance from reference set (ADRS). This metric for performance and cost. For the considered applications,
estimates on average how much a reference Pareto front the predicted Pareto solutions differ from the actual optimal
deviates from an alternative one [35], [36]. For each point ones by at most 25% when considering the predicted objec-
in the actual reference Pareto front we compute its relative tives (i.e. the objectives returned by the prediction model to
distance to the closest point in the alternative front. The dis- the user when planning which cloud configuration to use),
tances are computed relatively to the metric ranges observed and by at most 15% when considering the actual objectives
in the actual reference front and then averaged: (i.e. when the user finally decides to deploy the application
on the selected cloud configuration and actually observes the
1 objective values).
ADRS(Pref , P ) = min [d(y, y )] , (3)
| Pref | y ∈P
y∈Pref ACKNOWLEDGMENT
yi − yi
d(y, y ) = max max . (4) This work is conducted in the context of the joint
i yi − yimin ASTRON and IBM DOME project and is funded by the
Dutch Ministry of Economische Zaken and the Province of
Where Pref is the actual reference optimal Pareto front, P
Drenthe.
is the alternative one (either the predicted or the final front),
y is the vector representation of the two objectives (speedup R EFERENCES
and saving), yi is the ith objective in y, and yimax , yimin are
respectively the maximum and minimum values observed for [1] A. Gupta, P. Faraboschi, F. Gioachin, L. V. Kale, R. Kauf-
mann, B. S. Lee, V. March, D. Milojicic, and C. H. Suen,
the ith objective in the actual reference Pareto front Pref . “Evaluating and improving the performance and scheduling
By computing the ADRS metric we estimate that the of HPC applications in cloud,” IEEE Transactions on Cloud
deviation of the predicted Pareto fronts from the actual Computing, vol. 4, pp. 307–321, July 2016.
optimal ones are of 20%, 14%, and 25% respectively for [2] E. Roloff, M. Diener, A. Carissimi, and P. Navaux, “High per-
BT, LU, and SP. The final Pareto fronts are closer to the formance computing in the cloud: Deployment, performance
actual optimal ones with ADRS of 10%, 15% and 4%. and cost efficiency,” in Cloud Computing Technology and Sci-
ence (CloudCom), 2012 IEEE 4th International Conference
V. C ONCLUSION on, pp. 371–378, Dec 2012.
[3] E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good,
In this work we presented a machine-learning based model “The cost of doing science on the cloud: The montage exam-
capable of predicting performance of HPC applications ple,” in High Performance Computing, Networking, Storage
running in the cloud. The main idea behind the proposed and Analysis, 2008. SC 2008. International Conference for,
methodology is that the cloud provider can collect training pp. 1–12, Nov 2008.
data to learn a model that predicts application performance. [4] A. Anghel, L. M. Vasilescu, G. Mariani, R. Jongerius, and
This model is then released to the cloud user who queries it G. Dittmann, “An instrumentation approach for hardware-
agnostic software characterization,” International Journal of
for searching the cloud configuration most suitable for the Parallel Programming, pp. 1–25, 2016.
application needs. The prediction model takes as input an
[5] K. Hoste and L. Eeckhout, “Microarchitecture-independent
application profile and a cloud configuration and returns as workload characterization,” Micro, IEEE, vol. 27, pp. 63 –
output the expected execution speed, time, and cost. The ap- 72, may-june 2007.
plication profile describes the behavior of the application in [6] A. Gupta, L. Kale, F. Gioachin, V. March, C. H. Suen, B.-
terms of its computation and communication requirements. S. Lee, P. Faraboschi, R. Kaufmann, and D. Milojicic, “The
The cloud provider should interface with the prediction who, what, why, and how of high performance computing
model during the training phase, and in a later phase the in the cloud,” in Cloud Computing Technology and Science
(CloudCom), 2013 IEEE 5th International Conference on,
cloud user interfaces with the same prediction model by
vol. 1, pp. 306–314, Dec 2013.
querying it for gathering prediction data. To enable the
[7] M. Cinque, D. Cotroneo, F. Frattini, and S. Russo, “To
provider and the user to easily interface with the same cloudify or not to cloudify: The question for a scientific
model, the application profile we use describes the appli- data center,” IEEE Transactions on Cloud Computing, vol. 4,
cation behavior in a hardware-independent manner. Thus, pp. 90–103, Jan 2016.
532
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.
[8] C. Li, J. Xie, and X. Zhang, “Performance evaluation based [21] G. Marin and J. Mellor-Crummey, “Cross-architecture per-
on open source cloud platforms for high performance comput- formance predictions for scientific applications using param-
ing,” in Intelligent Networks and Intelligent Systems (ICINIS), eterized models,” in Proceedings of the Joint International
2013 6th International Conference on, pp. 90–94, Nov 2013. Conference on Measurement and Modeling of Computer
[9] G. Mariani, A. Anghel, R. Jongerius, and G. Dittmann, Systems, SIGMETRICS ’04/Performance ’04, (New York,
“Scaling properties of parallel applications to exascale,” In- NY, USA), pp. 2–13, ACM, 2004.
ternational Journal of Parallel Programming, pp. 1–28, 2016. [22] Z. Jia, J. Zhan, L. Wang, R. Han, S. Mckee, Q. Yang,
[10] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown, “Al- C. Luo, and J. Li, “Characterizing and subsetting big data
gorithm runtime prediction: Methods & evaluation,” Artif. workloads,” in Workload Characterization (IISWC), 2014
Intell., vol. 206, pp. 79–111, Jan. 2014. IEEE International Symposium on, pp. 191–201, Oct 2014.
[11] A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf, “Using [23] A. Phansalkar, A. Joshi, and L. K. John, “Analysis of re-
automated performance modeling to find scalability bugs in dundancy and application balance in the SPEC CPU2006
complex codes,” in Proceedings of the International Confer- benchmark suite,” SIGARCH Comput. Archit. News, vol. 35,
ence on High Performance Computing, Networking, Storage pp. 412–423, June 2007.
and Analysis, SC ’13, (New York, NY, USA), pp. 45:1–45:12,
[24] Z. Jin and A. Cheng, “Improve simulation efficiency us-
ACM, 2013.
ing statistical benchmark subsetting - an implantbench case
[12] X. Wu and F. Mueller, “ScalaExtrap: Trace-based communi- study,” in Design Automation Conference, 2008. DAC 2008.
cation extrapolation for spmd programs,” ACM Trans. Pro- 45th ACM/IEEE, pp. 970–973, June 2008.
gram. Lang. Syst., vol. 34, pp. 5:1–5:29, May 2012.
[25] Q. Guo, T. Chen, Y. Chen, and F. Franchetti, “Accelerating
[13] A. Wong, D. Rexachs, and E. Luque, “Parallel application architectural simulation via statistical techniques: A survey,”
signature for performance analysis and prediction,” Parallel IEEE Transactions on Computer-Aided Design of Integrated
and Distributed Systems, IEEE Transactions on, vol. 26, Circuits and Systems, vol. 35, pp. 433–446, March 2016.
pp. 2009–2019, July 2015.
[14] R. Jongerius, G. Mariani, A. Anghel, G. Dittmann, E. Vermij, [26] “NAS parallel benchmarks,” 2016.
and H. Corporaal, “Analytic processor model for fast design- http://www.nas.nasa.gov/publications/npb.html.
space exploration,” in Computer Design (ICCD), 2015 33nd [27] D. Montgomery, Design and Analysis of Experiments, 8th
IEEE International Conference on, pp. 440–443, Oct 2015. Edition. John Wiley & Sons, Incorporated, 2012.
[15] A. Gupta, O. Sarood, L. V. Kale, and D. Milojicic, “Improving [28] “SPEC CPU benchmarks..”
HPC application performance in cloud through dynamic load http://www.spec.org/benchmarks.html.
balancing,” in Cluster, Cloud and Grid Computing (CCGrid),
2013 13th IEEE/ACM International Symposium on, pp. 402– [29] M. H. Nguyen and F. de la Torre, “Optimal feature selec-
409, May 2013. tion for support vector machines,” Pattern Recogn., vol. 43,
pp. 584–591, Mar. 2010.
[16] I. Sadooghi, S. Palur, A. Anthony, I. Kapur, K. Belagodu,
P. Purandare, K. Ramamurty, K. Wang, and I. Raicu, “Achiev- [30] L. Breiman, “Random forests,” Machine Learning, vol. 45,
ing efficient distributed scheduling with message queues in no. 1, pp. 5–32, 2001.
the cloud for many-task computing and high-performance [31] G. Palermo, C. Silvano, and V. Zaccaria, “Respir: A response
computing,” in Cluster, Cloud and Grid Computing (CCGrid), surface-based pareto iterative refinement for application-
2014 14th IEEE/ACM International Symposium on, pp. 404– specific design space exploration,” IEEE Transactions on
413, May 2014. Computer-Aided Design of Integrated Circuits and Systems,
[17] K. Le, R. Bianchini, J. Zhang, Y. Jaluria, J. Meng, and vol. 28, pp. 1816–1829, Dec 2009.
T. D. Nguyen, “Reducing electricity cost through virtual
machine placement in high performance computing clouds,” [32] “Mathematica 10,” 2014. http://www.wolfram.com/mathematica/.
in Proceedings of 2011 International Conference for High [33] https://aws.amazon.com/ec2/pricing/on-demand/.
Performance Computing, Networking, Storage and Analysis,
[34] H. Javaid, A. Ignjatovic, and S. Parameswaran, “Fidelity met-
SC ’11, (New York, NY, USA), pp. 22:1–22:12, ACM, 2011.
rics for estimation models,” in 2010 IEEE/ACM International
[18] M. Musleh, V. Pai, J. P. Walters, A. Younge, and S. Crago, Conference on Computer-Aided Design (ICCAD), pp. 1–8,
“Bridging the virtualization performance gap for HPC using Nov 2010.
sr-iov for infiniband,” in Proceedings of the 2014 IEEE
International Conference on Cloud Computing, CLOUD ’14, [35] G. Mariani, G. Palermo, V. Zaccaria, and C. Silvano, “OS-
(Washington, DC, USA), pp. 627–635, IEEE Computer So- CAR: An optimization methodology exploiting spatial cor-
ciety, 2014. relation in multicore design spaces,” IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems,
[19] R. da Rosa Righi, C. A. da Costa, V. F. Rodrigues, and G. Ro- vol. 31, pp. 740–753, May 2012.
stirolla, “Joint-analysis of performance and energy consump-
tion when enabling cloud elasticity for synchronous HPC [36] E. Zitzler, M. Laumanns, L. Thiele, C. M. Fonseca, and
applications,” Concurrency and Computation: Practice and V. Grunert da Fonseca, “Why Quality Assessment Of Mul-
Experience, vol. 28, no. 5, pp. 1548–1571, 2016. cpe.3710. tiobjective Optimizers Is Difficult,” in Genetic and Evolu-
[20] Q. Guo, T. Chen, Y. Chen, L. Li, and W. Hu, “Microarchi- tionary Computation Conference (GECCO 2002), (New York,
tectural design space exploration made fast,” Microprocessors NY, USA), pp. 666–674, Morgan Kaufmann Publishers, July
and Microsystems, vol. 37, no. 1, pp. 41–51, 2013. 2002.
533
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:19:30 UTC from IEEE Xplore. Restrictions apply.