Sampling-Based Binary-Level Cross-Platform Performance Estimation
Sampling-Based Binary-Level Cross-Platform Performance Estimation
Sampling-Based Binary-Level Cross-Platform Performance Estimation
Performance Estimation
Xinnian Zheng, Haris Vikalo, Shuang Song, Lizy K. John, Andreas Gerstlauer
The University of Texas at Austin, TX, USA
{xzheng1, hvikalo, songshuang1990, ljohn, gerstl}@utexas.edu
Abstract—Fast and accurate performance estimation is a require intrusive instrumentation of a compiler-generated inter-
key challenge in modern system design. Recently, machine mediate representation (IR) with profiling calls at basic block
learning-based approaches have emerged that allow predicting boundaries to capture phase-level performance features of a
the performance of an application on a target platform from
executions on a different host. However, existing approaches program on the host as well as, during training, reference
rely on expensive instrumentation that requires source code performance on the target. Instrumentation at the IR basic
to be available. We propose a novel sampling-based, binary- block level leads to three major problems, however: (1)
level cross-platform prediction method that accurately predicts it introduces performance overhead, (2) application source
performance of a workload on a target by relying on various code is required, and (3) prediction is only possible for a
performance statistics sampled on a host using built-in hardware
counters. In our proposed framework, samples acquired from single standalone application at a time. These inherently limit
the host and target do not satisfy straightforward one-to-one practical applicability, e.g. to predict performance of an actual
correspondence that characterizes prior instrumentation-based application and system code mix running on a platform.
approaches. The resulting alignment problem is NP-hard; to solve In this paper, we instead propose a sampling-based, binary-
it efficiently, we develop a stochastic dynamic coupling (SDC) level technique that addresses the drawbacks of existing
algorithm which, under mild assumptions, with high probability
closely approximates optimal alignment. The prediction model instrumentation-based cross-platform prediction methods. Our
constructed using SDC-aligned samples achieves on average approach is source-oblivious and can perform background
96.5% accuracy for 45 benchmarks at speeds of over 3 GIPS. At prediction for arbitrary binary code while maintaining similar
similar accuracies, this is up to 6× faster than instrumentation- prediction accuracy with a significant increase in speed. A
based prediction, and approximately twice the speed of executing key challenge in sampling-based learning methods is proper
the same applications natively on our ARM target.
alignment of samples during training. In instrumentation-based
I. I NTRODUCTION approaches, since the IR is architecture-independent, profiling
Estimating performance of complex software on hardware every fixed number of dynamic basic blocks guarantees an
that is not readily available is among the key challenges in exact correspondence between the performance features on the
modern system-level design. Due to increasing complexity host and reference performance on the target for each captured
of software and hardware, performance modeling and pre- phase. By contrast, when host and target measurements are
diction are difficult tasks. Widely adopted simulation-based sampled with respect to time rather than to fixed block or
approaches, such as cycle-accurate instruction set simulators phase boundaries, the one-to-one correspondence between host
(ISSs), excel in accuracy yet lack significantly in speed. and target measurements is lost. Due to a different pace of
By contrast, traditional analytical models are computationally application progress, sampling an application at the same rate
efficient but their accuracy is limited. To this end, novel on different hardware platforms almost always results in a
machine learning-based cross-platform prediction techniques vastly different number of samples. Hence, the main challenge
have recently been proposed [1], [2]. Compared to traditional is to align and synchronize the host and target samples during
cycle-accurate ISSs, cross-platform methods offer several or- training such that they approximately correspond to the same
ders of magnitude speedup while maintaining similar accuracy. section of executed code. To solve this alignment problem,
The key idea behind such approaches is the simple intuition we propose a novel, efficient and effective stochastic dynamic
that performance of the same application running on two coupling (SDC) heuristic. Crucially, numerical measurements
different hardware platforms is correlated, and that this latent from the host and target can be viewed as two stochastic
correlation can be extracted by supervised learning methods signals, and our aim is to align them such that their cross-
to ultimately predict performance of applications running on a covariance is maximized. Using the aligned samples from
target while executing them natively on a host. Resulting pre- the host and target, we then train prediction models, which
diction models trained by executing small micro-benchmarks correlate performance from the host to the target.
on an early reference implementation or model of the target The rest of the paper is organized as follows: after a
can be used by software and hardware developers to evaluate discussion of related work and an overview of our approach,
large, real-world application behavior that would otherwise Section II presents the formulation of the alignment problem
require target access or be too slow if not infeasible to obtain. and the SDC algorithm as well as theoretical performance
Existing cross-platform prediction approaches achieve high guarantees. Section III discusses the experimental setup and
accuracy using a setup where training and prediction are Section IV presents results. Finally, Section V concludes the
both performed at program phase granularity. Such approaches paper with a summary of key contributions and results.
978-3-9815370-8-6/17/$31.00 2017
c IEEE 1709
Training
Sample Alignment
App. N
App. 1
Host Sampled
.
.
.
App. 1 Machine Performance Feature Learning
. Aligned
g Reference Algorithm
.
.
.
.
. Performance
Performance
App. N Reference Sampled Reference
Performance
.
.
.
Target Model
(a) 19 SPEC CPU C/C++ programs. (b) 13 DaCapo Java and 13 Pybench Python programs.
Fig. 4: Runtime of 45 benchmarks.
oap, xalan) from the DaCapo Java benchmark v9.12 [13] and case prediction error is around 6%, with average errors of less
13 Python programs (2to3, bzr startup, django v3, fastpickle, than 3.5%. Despite using fewer counters and coarser sampling
hg startup, html5lib, json load, nbody, nqueen, regex v8, but with a larger training set, for the C++ benchmarks for
spectral norm, telco, tornado http) from the commercial Uni- which complete source code is available (Fig. 3a), this is
fied Python benchmark suite (PyBench) [14]. The Java and comparable to results reported by existing instrumentation-
Python benchmarks include significant binary-only library, or based, source-level cross-platform prediction techniques [2],
virtual machine performance components. In our current setup, which achieve on average over 97% accuracy.
we are interested in predicting performance for single-core The total runtime of each test program is shown in Fig. 4. It
workloads excluding disturbances due to host/target operating consists of profiling and prediction times. The profiling time is
system variations. Thus, all programs are restricted to run on the time it takes to collect various counters on the host. Our In-
one core till completion, which minimizes measurement noise tel host supports simultaneous reading of 3 counters at a time.
due to core migration. On both the host and target, the Java Since the Linux Perf tool allows for over-sampled counter
programs are executed with the OpenJDK Runtime v2.6.7, and multiplexing, only one run of each program is necessary to
the Python test programs with Python v2.7.3. collect all 6 counters with little overhead. Compared to prior
To demonstrate the effectiveness of our approach on state- work [2], which incurs instrumentation overhead at every basic
of-the-art, real-world mobile target platforms, we employ a block, samples at faster rates, and requires 5 separate runs of
physical hardware reference as target for training and predic- each program to collect 14 counters, profiling in our approach
tion. We specifically use the ODROID-U3 development board is significantly faster while achieving similar accuracies. The
with a quad-core ARM Cortex-A9 based Samsung Exynos prediction time measures the total duration of solving the
4412 SoC as our target platform for experiments. optimization problem (II.4) for all the samples obtained from
the program. Solving time is governed by the dimension of
IV. E XPERIMENTAL R ESULTS the data matrix X and the neighborhood defined by distance
To demonstrate the accuracy of the proposed approach, we threshold μ in II.4. Prior work [2] uses a convex but non-
apply our sampling-based prediction framework to the 45 test smooth objective function. By contrast, our objective function
programs on the Intel host in order to predict performance of is both convex and smooth, which results in faster solving
each benchmark on the ARM target. We compare predictions speed overall. Combined, shorter profiling and prediction times
against actual measurements obtained from the U3 hardware. significantly increase the speed of our approach as compared to
Fig. 3 shows the accuracy of predicting whole program perfor- prior work [1], [2]. Overall simulation speed of our approach
mance with a host sampling period of T = 500ms. Predicted is on the order of 3 giga target instructions per second (GIPS).
cycles are very close to actual hardware measurements. Worst- Related work runs at 500-800 MIPS counting instrumented